Using Ruby as a cli tool for text processing
Sundeep
Posted on September 30, 2020
Why use Ruby for one-liners?
I assume you are already familiar with use cases where command line is more productive compared to GUI. See also this series of articles titled Unix as IDE.
A shell utility like bash
provides built-in commands and scripting features to make it easier to solve and automate various tasks. External *nix commands like grep
, sed
, awk
, sort
, find
, parallel
etc can be combined to work with each other. Depending upon your familiarity with those tools, you can either use ruby
as a single replacement or complement them for specific use cases.
Here's some one-liners (options will be explained later):
-
ruby -e 'puts readlines.uniq' *.txt
— retain only one copy if lines are duplicated from the given list of input file(s) -
ruby -e 'puts readlines.uniq {|s| s.split[1]}' *.txt
— retain only first copy of duplicate lines using second field as duplicate criteria -
ruby -rcommonregex -ne 'puts CommonRegex.get_links($_)' *.md
— extract only the URLs, using a third-party CommonRegexRuby library -
stackoverflow: merge duplicate key values while preserving order — a recent Q&A that I answered with a simpler
ruby
solution compared toawk
The main advantage of ruby
over tools like grep
, sed
and awk
includes feature rich regular expression engine, standard library and third-party libraries. If you don't already know the syntax and idioms for sed
and awk
, learning command line options for ruby
would be the easier option. The main disadvantage is that ruby
is likely to be slower compared to those tools.
Command line options
Option | Description |
---|---|
-0[octal] |
specify record separator (\0 , if no argument) |
-a |
autosplit mode with -n or -p (splits $_ into $F ) |
-c |
check syntax only |
-Cdirectory |
cd to directory before executing your script |
-d |
set debugging flags (set $DEBUG to true) |
-e 'command' |
one line of script. Several -e 's allowed. Omit [programfile] |
-Eex[:in] |
specify the default external and internal character encodings |
-Fpattern |
split() pattern for autosplit (-a ) |
-i[extension] |
edit ARGV files in place (make backup if extension supplied) |
-Idirectory |
specify $LOAD_PATH directory (may be used more than once) |
-l |
enable line ending processing |
-n |
assume 'while gets(); ... end' loop around your script |
-p |
assume loop like -n but print line also like sed
|
-rlibrary |
require the library before executing your script |
-s |
enable some switch parsing for switches after script name |
-S |
look for the script using PATH environment variable |
-v |
print the version number, then turn on verbose mode |
-w |
turn warnings on for your script |
-W[level=2|:category] |
set warning level; 0=silence, 1=medium, 2=verbose |
-x[directory] |
strip off text before #!ruby line and perhaps cd to directory |
--jit |
enable JIT with default options (experimental) |
--jit-[option] |
enable JIT with an option (experimental) |
-h |
show this message, --help for more info |
Executing Ruby code
If you want to execute a ruby
program file, one way is to pass the filename as argument to the ruby
command.
$ echo 'puts "Hello Ruby"' > hello.rb
$ ruby hello.rb
Hello Ruby
For short programs, you can also directly pass the code as an argument to the -e
option.
$ ruby -e 'puts "Hello Ruby"'
Hello Ruby
$ # multiple statements can be issued separated by ;
$ ruby -e 'x=25; y=12; puts x**y'
59604644775390625
$ # or use -e option multiple times
$ ruby -e 'x=25' -e 'y=12' -e 'puts x**y'
59604644775390625
Filtering
ruby
one-liners can be used for filtering lines matched by a regexp, similar to grep
, sed
and awk
. And similar to many command line utilities, ruby
can accept input from both stdin
and file arguments.
$ # sample stdin data
$ printf 'gate\napple\nwhat\nkite\n'
gate
apple
what
kite
$ # print all lines containing 'at'
$ # same as: grep 'at' and sed -n '/at/p' and awk '/at/'
$ printf 'gate\napple\nwhat\nkite\n' | ruby -ne 'print if /at/'
gate
what
$ # print all lines NOT containing 'e'
$ # same as: grep -v 'e' and sed -n '/e/!p' and awk '!/e/'
$ printf 'gate\napple\nwhat\nkite\n' | ruby -ne 'print if !/e/'
what
By default, grep
, sed
and awk
will automatically loop over input content line by line (with \n
as the line distinguishing character). The -n
or -p
option will enable this feature for ruby
. As seen before, the -e
option accepts code as command line argument. Many shortcuts are available to reduce the amount of typing needed.
In the above examples, a regular expression (defined by the pattern between a pair of forward slashes) has been used to filter the input. When the input string isn't specified in a conditional context (for example: if
), the test is performed against global variable $_
, which has the contents of the input line (the correct term would be input record). To summarize, in a conditional context:
-
/regexp/
is a shortcut for$_ =~ /regexp/
-
!/regexp/
is a shortcut for$_ !~ /regexp/
$_
is also the default argument for print
method, which is why it is generally preferred in one-liners over puts
method. More such defaults that apply to the print
method will be discussed later.
See ruby-doc: Pre-defined global variables for documentation on
$_
,$&
, etc.
Here's an example with file input instead of stdin
.
$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
$ # same as: grep -oE '[0-9]+$' table.txt
$ ruby -ne 'puts $& if /\d+$/' table.txt
42
7
14
Substitution
Use sub
and gsub
methods for search and replace requirements. By default, these methods operate on $_
when the input string isn't provided. For these examples, -p
option is used instead of -n
option, so that the value of $_
is automatically printed after processing each input line.
$ # for each input line, change only first ':' to '-'
$ # same as: sed 's/:/-/' and awk '{sub(/:/, "-")} 1'
$ printf '1:2:3:4\na:b:c:d\n' | ruby -pe 'sub(/:/, "-")'
1-2:3:4
a-b:c:d
$ # for each input line, change all ':' to '-'
$ # same as: sed 's/:/-/g' and awk '{gsub(/:/, "-")} 1'
$ printf '1:2:3:4\na:b:c:d\n' | ruby -pe 'gsub(/:/, "-")'
1-2-3-4
a-b-c-d
You might wonder how $_
is modified without the use of !
methods. The reason is that these methods are part of Kernel (see ruby-doc: Kernel for details) and are available only when -n
or -p
options are used.
-
sub(/regexp/, repl)
is a shortcut for$_.sub(/regexp/, repl)
and$_
will be updated if substitution succeeds -
gsub(/regexp/, repl)
is a shortcut for$_.gsub(/regexp/, repl)
and$_
gets updated if substitution succeeds
Field processing
Consider the sample input file shown below with fields separated by a single space character.
$ cat table.txt
brown bread mat hair 42
blue cake mug shirt -7
yellow banana window shoes 3.14
Here's some examples that is based on specific field rather than the entire line. The -a
option will cause the input line to be split based on whitespaces and the field contents can be accessed using $F
global variable. Leading and trailing whitespaces will be suppressed and won't result in empty fields.
$ # print the second field of each input line
$ # same as: awk '{print $2}' table.txt
$ ruby -ane 'puts $F[1]' table.txt
bread
cake
banana
$ # print lines only if last field is a negative number
$ # same as: awk '$NF<0' table.txt
$ ruby -ane 'print if $F[-1].to_f < 0' table.txt
blue cake mug shirt -7
$ # change 'b' to 'B' only for the first field
$ # same as: awk '{gsub(/b/, "B", $1)} 1' table.txt
$ ruby -ane '$F[0].gsub!(/b/, "B"); puts $F * " "' table.txt
Brown bread mat hair 42
Blue cake mug shirt -7
yellow banana window shoes 3.14
BEGIN and END
You can use a BEGIN{}
block when you need to execute something before input is read and a END{}
block to execute something after all of the input has been processed.
$ # same as: awk 'BEGIN{print "---"} 1; END{print "%%%"}'
$ # note the use of ; after BEGIN block
$ seq 4 | ruby -pe 'BEGIN{puts "---"}; END{puts "%%%"}'
---
1
2
3
4
%%%
ENV hash
When it comes to automation and scripting, you'd often need to construct commands that can accept input from user, file, output of a shell command, etc. As mentioned before, this book assumes bash
as the shell being used. To access environment variables of the shell, you can call the special hash variable ENV
with the name of the environment variable as a string key.
$ # existing environment variable
$ # output shown here is for my machine, would differ for you
$ ruby -e 'puts ENV["HOME"]'
/home/learnbyexample
$ ruby -e 'puts ENV["SHELL"]'
/bin/bash
$ # defined along with ruby command
$ # note that the variable is placed before the shell command
$ word='hello' ruby -e 'puts ENV["word"]'
hello
$ # the input characters are preserved as is
$ ip='hi\nbye' ruby -e 'puts ENV["ip"]'
hi\nbye
Here's another example when a regexp is passed as an environment variable content.
$ cat word_anchors.txt
sub par
spar
apparent effort
two spare computers
cart part tart mart
$ # assume 'r' is a shell variable that has to be passed to the ruby command
$ r='\Bpar\B'
$ rgx="$r" ruby -ne 'print if /#{ENV["rgx"]}/' word_anchors.txt
apparent effort
two spare computers
As an example, see my repo ch: command help for a practical shell script, where commands are constructed dynamically.
Executing external commands
You can call external commands using the system
Kernel method. See ruby-doc: system for documentation.
$ ruby -e 'system("echo Hello World")'
Hello World
$ ruby -e 'system("wc -w <word_anchors.txt")'
12
$ ruby -e 'system("seq -s, 10 > out.txt")'
$ cat out.txt
1,2,3,4,5,6,7,8,9,10
Return value of system
or global variable $?
can be used to act upon exit status of command issued.
$ ruby -e 'es=system("ls word_anchors.txt"); puts es'
word_anchors.txt
true
$ ruby -e 'system("ls word_anchors.txt"); puts $?'
word_anchors.txt
pid 6087 exit 0
$ ruby -e 'system("ls xyz.txt"); puts $?'
ls: cannot access 'xyz.txt': No such file or directory
pid 6164 exit 2
To save the result of an external command, use backticks or %x
.
$ ruby -e 'words = `wc -w <word_anchors.txt`; puts words'
12
$ ruby -e 'nums = %x/seq 3/; print nums'
1
2
3
See also stackoverflow: difference between exec, system and %x() or backticks
Summary
This post introduced some of the common options for ruby
cli usage, along with typical cli text processing examples. While specific purpose cli tools like grep
, sed
and awk
are usually faster, ruby
has a much more extensive standard library and ecosystem. And you do not have to learn a lot if you are comfortable with ruby
but not familiar with those cli tools.
Ruby one-liners cookbook
If you liked this post and would like to learn more, check out my ebook using the links below. These are free to download until this Sunday (4-Oct-2020).
You can also get the ebooks as part of Ruby text processing bundle using these links:
Posted on September 30, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.