Ruby Regexp Part 4 - Quantifiers

learnbyexample

Sundeep

Posted on March 28, 2019

Ruby Regexp Part 4 - Quantifiers

Dot metacharacter and Quantifiers

This chapter introduces dot metacharacter and quantifiers. As the name implies, quantifiers allows you to specify how many times a character or grouping should be matched. With * string operator, you can do something like 'no' * 5 to get "nonononono". This saves you manual repetition as well as gives the ability to programmatically repeat a string object as many times as you need. Quantifiers support this simple repetition as well as ways to specify a range of repetition. This range has the flexibility of being bounded or unbounded with respect to start and end values. Combined with dot metacharacter (and alternation if needed), quantifiers allow you to construct conditional AND logic between patterns.

Dot metacharacter

The dot metacharacter matches any character except the newline character.

# matches character 'c', any character and then character 't'
>> 'tac tin c.t abc;tuv acute'.gsub(/c.t/, 'X')
=> "taXin X abXuv aXe"

# matches character 'r', any two characters and then character 'd'
>> 'breadth markedly reported overrides'.gsub(/r..d/) { |s| s.upcase }
=> "bREADth maRKEDly repoRTED oveRRIDes"

# matches character '2', any character and then character '3'
>> "42\t33".sub(/2.3/, '8')
=> "483"
Enter fullscreen mode Exit fullscreen mode

See m modifier section to know how . can match newline as well. Chapter Character class will discuss how to define your own custom placeholder for limited set of characters.

split method

This chapter will additionally use split method to illustrate examples. The split method separates the string based on given regexp (or string) and returns an array of strings.

# same as: 'apple-85-mango-70'.split('-')
>> 'apple-85-mango-70'.split(/-/)
=> ["apple", "85", "mango", "70"]

>> 'bus:3:car:5:van'.split(/:.:/)
=> ["bus", "car", "van"]

# optional limit can be specified as second argument
# when limit is positive, you get maximum of limit-1 splits
>> 'apple-85-mango-70'.split(/-/, 2)
=> ["apple", "85-mango-70"]
Enter fullscreen mode Exit fullscreen mode

See split with capture groups section for details of how capture groups affect the output of split method.

Greedy quantifiers

Quantifiers have functionality like the string repetition operator and range method. They can be applied to both characters and groupings (and more, as you'll see in later chapters). Apart from ability to specify exact quantity and bounded range, these can also match unbounded varying quantities. If the input string can satisfy a pattern with varying quantities in multiple ways, you can choose among three types of quantifiers to narrow down possibilities. In this section, greedy type of quantifiers is covered.

First up, the ? metacharacter which quantifies a character or group to match 0 or 1 times. In other words, you make that character or group as something to be optionally matched. This leads to a terser regexp compared to alternation and grouping.

# same as: /ear|ar/
>> 'far feat flare fear'.gsub(/e?ar/, 'X')
=> "fX feat flXe fX"

# same as: /\bpar(t|)\b/
>> 'par spare part party'.gsub(/\bpart?\b/, 'X')
=> "X spare X party"

# same as: /\b(re.d|red)\b/
>> words = %w[red read ready re;d road redo reed rod]
>> words.grep(/\bre.?d\b/)
=> ["red", "read", "re;d", "reed"]

# same as: /part|parrot/
>> 'par part parrot parent'.gsub(/par(ro)?t/, 'X')
=> "par X X parent"
# same as: /part|parrot|parent/
>> 'par part parrot parent'.gsub(/par(en|ro)?t/, 'X')
=> "par X X X"
Enter fullscreen mode Exit fullscreen mode

The * metacharacter quantifies a character or group to match 0 or more times. There is no upper bound, more details will be discussed later in this chapter.

# match 't' followed by zero or more of 'a' followed by 'r'
>> 'tr tear tare steer sitaara'.gsub(/ta*r/, 'X')
=> "X tear Xe steer siXa"

# match 't' followed by zero or more of 'e' or 'a' followed by 'r'
>> 'tr tear tare steer sitaara'.gsub(/t(e|a)*r/, 'X')
=> "X X Xe sX siXa"

# match zero or more of '1' followed by '2'
>> '3111111111125111142'.gsub(/1*2/, 'X')
=> "3X511114X"
Enter fullscreen mode Exit fullscreen mode

Here's some examples with split and related methods. partition splits the input string on the first match and the text matched by the regexp is also present in the output. rpartition is like partition but splits on the last match.

# note how '25' and '42' gets split, there is '1' zero times in between them
>> '3111111111125111142'.split(/1*/)
=> ["3", "2", "5", "4", "2"]
# there is '1' zero times at end of string as well, note the use of -1 for limit
>> '3111111111125111142'.split(/1*/, -1)
=> ["3", "2", "5", "4", "2", ""]

>> '3111111111125111142'.partition(/1*2/)
=> ["3", "11111111112", "5111142"]

# last element is empty because there is nothing after 2 at the end of string
>> '3111111111125111142'.rpartition(/1*2/)
=> ["311111111112511114", "2", ""]
Enter fullscreen mode Exit fullscreen mode

The + metacharacter quantifies a character or group to match 1 or more times. Similar to * quantifier, there is no upper bound. More importantly, this doesn't have surprises like matching empty string in between patterns or at the end of string.

>> 'tr tear tare steer sitaara'.gsub(/ta+r/, 'X')
=> "tr tear Xe steer siXa"
>> 'tr tear tare steer sitaara'.gsub(/t(e|a)+r/, 'X')
=> "tr X Xe sX siXa"

>> '3111111111125111142'.gsub(/1+2/, 'X')
=> "3X5111142"
>> '3111111111125111142'.split(/1+/)
=> ["3", "25", "42"]
Enter fullscreen mode Exit fullscreen mode

You can specify a range of integer numbers, both bounded and unbounded, using {} metacharacters. There are four ways to use this quantifier as listed below:

Pattern Description
{m,n} match m to n times
{m,} match at least m times
{,n} match up to n times (including 0 times)
{n} match exactly n times
>> demo = %w[abc ac adc abbc xabbbcz bbb bc abbbbbc]

>> demo.grep(/ab{1,4}c/)
=> ["abc", "abbc", "xabbbcz"]
>> demo.grep(/ab{3,}c/)
=> ["xabbbcz", "abbbbbc"]
>> demo.grep(/ab{,2}c/)
=> ["abc", "ac", "abbc"]
>> demo.grep(/ab{3}c/)
=> ["xabbbcz"]
Enter fullscreen mode Exit fullscreen mode

The {} metacharacters have to be escaped to match them literally. However, unlike () metacharacters, these have lot more leeway. For example, escaping { alone is enough, or if it doesn't conform strictly to any of the four forms listed above, escaping is not needed at all. Also, if you are applying {} quantifier to # character, you need to escape the # to override interpolation.

AND conditional

Next up, how to construct AND conditional using dot metacharacter and quantifiers.

# match 'Error' followed by zero or more characters followed by 'valid'
>> 'Error: not a valid input'.match?(/Error.*valid/)
=> true

>> 'Error: key not found'.match?(/Error.*valid/)
=> false
Enter fullscreen mode Exit fullscreen mode

To allow matching in any order, you'll have to bring in alternation as well. That is somewhat manageable for 2 or 3 patterns. See AND conditional with lookarounds section for an easier approach.

>> seq1, seq2 = ['cat and dog', 'dog and cat']
>> seq1.match?(/cat.*dog|dog.*cat/)
=> true
>> seq2.match?(/cat.*dog|dog.*cat/)
=> true

# if you just need true/false result, this would be a scalable approach
>> patterns = [/cat/, /dog/]
>> patterns.all? { |re| seq1.match?(re) }
=> true
>> patterns.all? { |re| seq2.match?(re) }
=> true
Enter fullscreen mode Exit fullscreen mode

What does greedy mean?

When you are using the ? quantifier, how does Ruby decide to match 0 or 1 times, if both quantities can satisfy the regexp? For example, consider this substitution expression 'foot'.sub(/f.?o/, 'X') — should foo be replaced or fo? It will always replace foo because these are greedy quantifiers, meaning they try to match as much as possible.

>> 'foot'.sub(/f.?o/, 'X')
=> "Xt"

# a more practical example
# prefix '<' with '\' if it is not already prefixed
# both '<' and '\<' will get replaced with '\<'
>> puts 'blah \< foo < bar \< blah < baz'.gsub(/\\?</, '\<')
blah \< foo \< bar \< blah \< baz

# say goodbye to /handful|handy|hand/ shenanigans
>> 'hand handy handful'.gsub(/hand(y|ful)?/, 'X')
=> "X X X"
Enter fullscreen mode Exit fullscreen mode

But wait, then how did /Error.*valid/ example work? Shouldn't .* consume all the characters after Error? Good question. The regular expression engine actually does consume all the characters. Then realizing that the regexp fails, it gives back one character from end of string and checks again if the overall regexp is satisfied. This process is repeated until a match is found or failure is confirmed. In regular expression parlance, this is called backtracking.

>> sentence = 'that is quite a fabricated tale'

# /t.*a/ will always match from first 't' to last 'a'
# also, note that 'sub' is being used here, not 'gsub'
>> sentence.sub(/t.*a/, 'X')
=> "Xle"
>> 'star'.sub(/t.*a/, 'X')
=> "sXr"

# matching first 't' to last 'a' for t.*a won't work for these cases
# the regexp engine backtracks until .*q matches and so on
>> sentence.sub(/t.*a.*q.*f/, 'X')
=> "Xabricated tale"
>> sentence.sub(/t.*a.*u/, 'X')
=> "Xite a fabricated tale"
Enter fullscreen mode Exit fullscreen mode

Backtracking can be quite time consuming for certain corner cases (see ruby-doc: Regexp Performance). Or even catastrophic (see cloudflare: Details of the Cloudflare outage on July 2, 2019).

Non-greedy quantifiers

As the name implies, these quantifiers will try to match as minimally as possible. Also known as lazy or reluctant quantifiers. Appending a ? to greedy quantifiers makes them non-greedy.

>> 'foot'.sub(/f.??o/, 'X')
=> "Xot"
>> 'frost'.sub(/f.??o/, 'X')
=> "Xst"

>> '123456789'.sub(/.{2,5}?/, 'X')
=> "X3456789"

>> 'green:3.14:teal::brown:oh!:blue'.split(/:.*?:/)
=> ["green", "teal", "brown", "blue"]
Enter fullscreen mode Exit fullscreen mode

Like greedy quantifiers, lazy quantifiers will try to satisfy the overall regexp.

>> sentence = 'that is quite a fabricated tale'

# /t.*?a/ will always match from first 't' to first 'a'
>> sentence.sub(/t.*?a/, 'X')
=> "Xt is quite a fabricated tale"

# matching first 't' to first 'a' for t.*?a won't work for this case
# so, regexp engine will move forward until .*?f matches and so on
>> sentence.sub(/t.*?a.*?f/, 'X')
=> "Xabricated tale"
# this matches last 'e' after 'q' to satisfy the anchor requirement
>> sentence.sub(/q.*?e$/, 'X')
=> "that is X"
Enter fullscreen mode Exit fullscreen mode

Possessive quantifiers

Appending a + to greedy quantifiers makes them possessive quantifiers. These are like greedy quantifiers, but without the backtracking. So, something like /Error.*+valid/ will never match because .*+ will consume all the remaining characters. If both the greedy and possessive quantifier versions are functionally equivalent, then possessive is preferred because it will fail faster for non-matching cases.

# functionally equivalent greedy and possessive versions
>> %w[abc ac adc abbc xabbbcz bbb bc abbbbbc].grep(/ab*c/)
=> ["abc", "ac", "abbc", "xabbbcz", "abbbbbc"]
>> %w[abc ac adc abbc xabbbcz bbb bc abbbbbc].grep(/ab*+c/)
=> ["abc", "ac", "abbc", "xabbbcz", "abbbbbc"]

# different results
# numbers >= 100 if there are leading zeros
# \d will be discussed in a later chapter, it matches all digit characters
>> '0501 035 154 12 26 98234'.gsub(/\b0*\d{3,}\b/, 'X')
=> "X X X 12 26 X"
>> '0501 035 154 12 26 98234'.gsub(/\b0*+\d{3,}\b/, 'X')
=> "X 035 X 12 26 X"
Enter fullscreen mode Exit fullscreen mode

The effect of possessive quantifier can also be expressed using atomic grouping. The syntax is (?>pat), where pat is an abbreviation for a portion of regular expression pattern. In later chapters you'll see more such special groupings.

# same as: /(b|o)++/
>> 'abbbc foooooot'.gsub(/(?>(b|o)+)/, 'X')
=> "aXc fXt"

# same as: /\b0*+\d{3,}\b/
>> '0501 035 154 12 26 98234'.gsub(/\b(?>0*)\d{3,}\b/, 'X')
=> "X 035 X 12 26 X"
Enter fullscreen mode Exit fullscreen mode

Exercises

For practice problems, visit Exercises.md file from this book's repository on GitHub.

💖 💪 🙅 🚩
learnbyexample
Sundeep

Posted on March 28, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related