Sundeep
Posted on March 26, 2019
Now that you're familiar with regexp syntax and some of the methods, the next step is to know about the special features of regular expressions. In this chapter, you'll be learning about qualifying a pattern. Instead of matching anywhere in the given input string, restrictions can be specified. For now, you'll see the ones that are already part of regular expression features. In later chapters, you'll learn how to define your own rules for restriction.
These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as metacharacters in regexp parlance. In case you need to match those characters literally, you need to escape them with a \
character (discussed in Escaping metacharacters section).
String anchors
This restriction is about qualifying a regexp to match only at start or end of an input string. These provide functionality similar to the string methods start_with?
and end_with?
. There are three different escape sequences related to string level regexp anchors. First up is \A
which restricts the matching to the start of string.
# \A is placed as a prefix to the search term
>> 'cater'.match?(/\Acat/)
=> true
>> 'concatenation'.match?(/\Acat/)
=> false
>> "hi hello\ntop spot".match?(/\Ahi/)
=> true
>> "hi hello\ntop spot".match?(/\Atop/)
=> false
To restrict the match to the end of string, \z
is used.
# \z is placed as a suffix to the search term
>> 'spare'.match?(/are\z/)
=> true
>> 'nearest'.match?(/are\z/)
=> false
>> words = %w[surrender unicorn newer door empty eel pest]
>> words.grep(/er\z/)
=> ["surrender", "newer"]
>> words.grep(/t\z/)
=> ["pest"]
There is another end of string anchor \Z
. It is similar to \z
but if newline is the last character, then \Z
allows matching just before the newline character.
# same result for both \z and \Z
# as there is no newline character at the end of string
>> 'dare'.sub(/are\z/, 'X')
=> "dX"
>> 'dare'.sub(/are\Z/, 'X')
=> "dX"
# different results as there is a newline character at the end of string
>> "dare\n".sub(/are\z/, 'X')
=> "dare\n"
>> "dare\n".sub(/are\Z/, 'X')
=> "dX\n"
Combining both the start and end string anchors, you can restrict the matching to the whole string. Similar to comparing strings using the ==
operator.
>> 'cat'.match?(/\Acat\z/)
=> true
>> 'cater'.match?(/\Acat\z/)
=> false
>> 'concatenation'.match?(/\Acat\z/)
=> false
The anchors can be used by themselves as a pattern. Helps to insert text at the start or end of string, emulating string concatenation operations. These might not feel like useful capability, but combined with other regexp features they become quite a handy tool.
>> 'live'.sub(/\A/, 're')
=> "relive"
>> 'send'.sub(/\A/, 're')
=> "resend"
>> 'cat'.sub(/\z/, 'er')
=> "cater"
>> 'hack'.sub(/\z/, 'er')
=> "hacker"
Line anchors
A string input may contain single or multiple lines. The newline character \n
is used as the line separator. There are two line anchors, ^
metacharacter for matching the start of line and $
for matching the end of line. If there are no newline characters in the input string, these will behave same as the \A
and \z
anchors respectively.
>> pets = 'cat and dog'
>> pets.match?(/^cat/)
=> true
>> pets.match?(/^dog/)
=> false
>> pets.match?(/dog$/)
=> true
>> pets.match?(/^dog$/)
=> false
Here's some multiline examples to distinguish line anchors from string anchors.
# check if any line in the string starts with 'top'
>> "hi hello\ntop spot".match?(/^top/)
=> true
# check if any line in the string ends with 'er'
>> "spare\npar\ndare".match?(/er$/)
=> false
# filter all lines ending with 'are'
>> "spare\npar\ndare".each_line.grep(/are$/)
=> ["spare\n", "dare"]
# check if any complete line in the string is 'par'
>> "spare\npar\ndare".match?(/^par$/)
=> true
Just like string anchors, you can use the line anchors by themselves as a pattern. gsub
and puts
will be used here to better illustrate the transformation. The gsub
method returns an Enumerator if you don't specify a replacement string nor pass a block. That paves way to use all those wonderful Enumerator and Enumerable methods.
>> str = "catapults\nconcatenate\ncat"
>> puts str.gsub(/^/, '1: ')
1: catapults
1: concatenate
1: cat
>> puts str.gsub(/^/).with_index(1) { |m, i| "#{i}: " }
1: catapults
2: concatenate
3: cat
>> puts str.gsub(/$/, '.')
catapults.
concatenate.
cat.
If there is a newline character at the end of string, there is an additional end of line match but no additional start of line match.
>> puts "1\n2\n".gsub(/^/, 'foo ')
foo 1
foo 2
>> puts "1\n\n".gsub(/^/, 'foo ')
foo 1
foo
# note the number of lines in output
>> puts "1\n2\n".gsub(/$/, ' baz')
1 baz
2 baz
baz
>> puts "1\n\n".gsub(/$/, ' baz')
1 baz
baz
baz
If you are dealing with Windows OS based text files, you'll have to convert
\r\n
line endings to\n
first. Which is easily handled by many of the Ruby methods. For example, you can specify which line ending to use forFile.open
method, thesplit
string method handles all whitespaces by default and so on. Or, you can handle\r
as optional character with quantifiers (see Greedy quantifiers section).
Word anchors
The third type of restriction is word anchors. Alphabets (irrespective of case), digits and the underscore character qualify as word characters. You might wonder why there are digits and underscores as well, why not only alphabets? This comes from variable and function naming conventions — typically alphabets, digits and underscores are allowed. So, the definition is more oriented to programming languages than natural ones.
The escape sequence \b
denotes a word boundary. This works for both the start of word and end of word anchoring. Start of word means either the character prior to the word is a non-word character or there is no character (start of string). Similarly, end of word means the character after the word is a non-word character or no character (end of string). This implies that you cannot have word boundary \b
without a word character.
>> words = 'par spar apparent spare part'
# replace 'par' irrespective of where it occurs
>> words.gsub(/par/, 'X')
=> "X sX apXent sXe Xt"
# replace 'par' only at the start of word
>> words.gsub(/\bpar/, 'X')
=> "X spar apparent spare Xt"
# replace 'par' only at the end of word
>> words.gsub(/par\b/, 'X')
=> "X sX apparent spare part"
# replace 'par' only if it is not part of another word
>> words.gsub(/\bpar\b/, 'X')
=> "X spar apparent spare part"
You can get lot more creative with using word boundary as a pattern by itself:
# space separated words to double quoted csv
# note the use of 'tr' string method
>> puts words.gsub(/\b/, '"').tr(' ', ',')
"par","spar","apparent","spare","part"
>> '-----hello-----'.gsub(/\b/, ' ')
=> "----- hello -----"
# make a programming statement more readable
# shown for illustration purpose only, won't work for all cases
>> 'foo_baz=num1+35*42/num2'.gsub(/\b/, ' ')
=> " foo_baz = num1 + 35 * 42 / num2 "
# excess space at start/end of string can be stripped off
# later you'll learn how to add a qualifier so that strip is not needed
>> 'foo_baz=num1+35*42/num2'.gsub(/\b/, ' ').strip
=> "foo_baz = num1 + 35 * 42 / num2"
The word boundary has an opposite anchor too. \B
matches wherever \b
doesn't match. This duality will be seen with some other escape sequences too. Negative logic is handy in many text processing situations. But use it with care, you might end up matching things you didn't intend!
>> words = 'par spar apparent spare part'
# replace 'par' if it is not start of word
>> words.gsub(/\Bpar/, 'X')
=> "par sX apXent sXe part"
# replace 'par' at the end of word but not whole word 'par'
>> words.gsub(/\Bpar\b/, 'X')
=> "par sX apparent spare part"
# replace 'par' if it is not end of word
>> words.gsub(/par\B/, 'X')
=> "par spar apXent sXe Xt"
# replace 'par' if it is surrounded by word characters
>> words.gsub(/\Bpar\B/, 'X')
=> "par spar apXent sXe part"
Here's some standalone pattern usage to compare and contrast the two word anchors.
>> 'copper'.gsub(/\b/, ':')
=> ":copper:"
>> 'copper'.gsub(/\B/, ':')
=> "c:o:p:p:e:r"
>> '-----hello-----'.gsub(/\b/, ' ')
=> "----- hello -----"
>> '-----hello-----'.gsub(/\B/, ' ')
=> " - - - - -h e l l o- - - - - "
Exercises
For practice problems, visit Exercises.md file from this book's repository on GitHub.
Posted on March 26, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.