Just Enough Regex

poulamic

Poulami Chakraborty

Posted on September 4, 2020

Just Enough Regex

This is a basic primer on a powerful programming tool - regular expressions.

Regular expressions (regex) are a powerful way to describe patterns in string data. In JavaScript, regular expressions are objects to find patterns of char combinations in strings. Some typical use cases of regular expressions are validating a string with the pattern, searching within a string, replace substrings in a string, extract some meta information from a string.

However, at first (and sometimes even after the hundredth) glance, regex looks complex and daunting. Until now, I had tried to get away with just understanding the concept and use of regex - after all, once I knew that I wanted to use regex, I could look up the syntax to hack together whatever I wanted. That works most of the time.

There are two problems with that process though - i) it is time-consuming, ii) it isn't a practical way when deconstructing regex (in others' code).

So, I finally decided to dive into regex with the express purpose of demystifying regex, and this is my documentation.

Some Regex and String Methods

Regex methods are out of scope of this article. However, as I would be using some methods to demonstrate concepts, I am starting with the format and use of the JavaScript functions.

test()

The test() method executes a search for a match between a regular expression and a specified string. Returns true or false.

var str = "This is an example"
var regex = /exam/;
console.log(regex.test(str)) // true
Enter fullscreen mode Exit fullscreen mode

match()

There is a method of String. It finds matches for regexp in a string and returns an array of the matches.

var str = "This is a test to test match method";
var regex =/ test/
console.log(str.match(regex));  // [ " test" ]
Enter fullscreen mode Exit fullscreen mode

To find all matches, we use the g (global) flag

var str = "This is a test to test match method";
var regex =/test/g
console.log(str.match(regex)); // [ "test", "test" ]
Enter fullscreen mode Exit fullscreen mode

In case of no matches, null is returned (and NOT an empty array. Important to remember while applying array methods).

var str = "This is a test" ;
console.log(str.match(/hello/)); // null
Enter fullscreen mode Exit fullscreen mode

(There is more to these functions - but again, out of scope of this article)

Regular Expressions

Constructor

There are two ways to construct a regular expression

  1. Using the RegExp constructor

    var re = new RegExp("pattern", "flags");
    
  2. Using a regular expression literal, which consists of a pattern enclosed between slashes (slashes are like quotes for strings -they tell javscript that this is a regular expression)

    var re = /pattern/flags;
    

'flags' are optional, and I will discuss them shortly.

Difference between the two methods

Both of them create an RegExp object, and will return same results. There is one difference:

Regex literal are compiled when the script is loaded while the constructor function provides runtime compilation of the regular expression.

What that ensues is that regex literals can only be static, i.e. we know the exact patterns while writing the code.They cannot be created from a dynamically generated string. Ex: wanting to use an user input as the regular expression.

For dynamic regex expressions we use the RegExp constructor method

var filter= "star";
var re = new RegExp(`${filter}`);
var str = "Twinkle twinkle little star"
console.log(str.match(re)); 

// [ star]
Enter fullscreen mode Exit fullscreen mode

Flags

Flags are optional parameters that can be added to a regular expression to affect its matching behavior. There are six flags which modify the search in different ways

  • i: Ignores casing (/e/i will match both 'e' and 'E')
  • g: Global search returning all matches for a given expression inside a string - without it, only the first match is returned

The other flags (m s, u, y are used rarely and some require understanding of some advanced concepts. So, it is left out of this article. This codegauge lesson dives deep into the flags.

These flags can be used separately or together in any order.

var str = "Hello, hello";
console.log(str.match(/he/gi)
//Array(2)[ "He","he"]
]
Enter fullscreen mode Exit fullscreen mode

Regular Expression Pattern

Literal Characters

The simplest regular expression is a series of letters and numbers that have no special meaning. There is a match only if there is exactly that sequence of characters in the string it is searching, i.e., it is a 'literal' match.

Simple patterns are constructed of characters for which you want to find a direct match. For example, the pattern /abc/ matches character combinations in strings only when the exact sequence "abc" occurs (all characters together and in that order).

console.log(/abc/.test("**abc**")); //true
console.log(/abc/.test("I am learning my **abc**s'")); //true
console.log(/abc/.test("The c**ab c**ollided")); //false
Enter fullscreen mode Exit fullscreen mode

But this could also be done with indexOf. Why do we need regex?

Well, regex is mostly used when we want to for complicated or 'less literal' matches (ex: a date pattern - we don't want to match a particular date, just check the format). To do that, we use metacharacters.

Special (meta) characters and Escaping

A metacharacter is a character that has a special meaning (instead of a literal meaning) during pattern processing. We use these special characters in regular expressions to transform literal characters into powerful expressions.

In JavaScript, the special characters are - backslash \, caret ^, dollar sign $, period or dot ., vertical bar |, question mark ?, asterisk *, plus sign +, opening parenthesis (, closing parenthesis ), opening square brackets [. Some like curly brackets { have special characters when used with closing curly bracket } also have special meaning.

We will go over each of these in time. Before that - escaping.

What if we want to find a 'literal match' for one of these special characters? (Example: find a literal match for "^"?

To do that, we use another metacharacter - backslash . Prepending \ to a special character causes it to be treated as a literal character.

console.log/b^2/.test('a^2 + b^2 - C*3')); //false
console.log(/b\^2/.test('a^2 + b^2 - C*3')); //true
Enter fullscreen mode Exit fullscreen mode
console.log/b^2/.test('a^2 + b^2 - C*3')); //false
console.log(/b\^2/.test('a^2 + b^2 - C*3')); //true
Enter fullscreen mode Exit fullscreen mode

Groups and ranges

Several characters or character classes inside square brackets […] means we want to “search for any of these characters"

For example [ae] will match for 'a' or 'e'

console.log(/[ae]/.test("par")); //true
console.log(/[ae]/.test("per")); //true
console.log(/[ae]/.test("por")); //false
Enter fullscreen mode Exit fullscreen mode

We can have square bracket within a bigger regex expression

console.log(/b[ae]r/.test("bard")); //true
console.log(/b[ae]r/.test("bread")); //false
Enter fullscreen mode Exit fullscreen mode

Within square brackets, a hyphen (-) between two characters can be used to indicate a range of characters (where the ordering is determined by the character’s Unicode number).

Ex: [0-9] will match any character between 0 and 9, '[a-z]' is a character in range from a to z

console.log(/[0-9]/.test("for 4 years")); //true
Enter fullscreen mode Exit fullscreen mode

A number of common character groups have their own built-in shortcuts in the form of character classes.

Character classes

Character classes are shorthands for certain character sets.

Character Class Respresents
\d Any digit character (from 0 to 9)
\D Non-digit: any character except \d
\w Any alphanumeric character from the basic Latin alphabet (including digit), including the underscore
\W Non-wordly character: anything but \w. Ex: a non-latin letter (%, etc.) or a space
\s a single white space character, including space, tab, form feed, line feed, and other Unicode spaces
\S Non-space: any character except \s, for instance a letter

As we can note: For every character class there exists an “inverse class”, denoted with the same letter, but uppercased.

Apart from these, there are character class to supports certain non-printable characters

Character Class Represents
\t Matches a horizontal tab
\r Matches a carriage return
\n Matches a linefeed
\v Matches a vertical tab
\f Matches a form-feed
\0 Matches a NUL character (Do not follow this with another digit)
[\b] Matches a backspace

Character classes can be written in series to create complex patterns. Example, to check for time format in hh:mm, the regular expression is '\d\d:\d\d' (For now, we are not checking validity of the input, i.e. 72:80 is also a valid time for our purposes)

console.log(/\d\d:\d\d/.test("2:25")); //false
console.log(/\d\d:\d\d/.test("02:25")); //true
Enter fullscreen mode Exit fullscreen mode

Anchors

Anchors in regular expressions do not match any character. Instead, they match a position before or after characters. They can be used to “anchor” the regex match at a certain position.

  • Caret (^) matches the position before the first character in the string -i.e. the regular expression that follows it should be at the start of the test string.
  • Dollar ($) matches the position right after the last character in the string -i.e. the regular expression that precedes it should be at the end of the test string
console.log(/^Jack/.test("Jack and Jill went up the hill")); //true
console.log(/^hill/.test("Jack and Jill went up the hill")); //false
console.log(/hill$/.test("Jack and Jill went up the hill")); //true
console.log(/Jack$/.test("Jack and Jill went up the hill")); //false
Enter fullscreen mode Exit fullscreen mode

Both anchors together ^...$ is often used to test whether or not a string fully matches the pattern.

Going back to our time example:

console.log(/\d\d:\d\d/.test("02:25")); //true
console.log(/\d\d:\d\d/.test("02:225")); //true
console.log(/^\d\d:\d\d/.test("02:225")); //true
console.log(/\d\d:\d\d$/.test("102:225")); //true
console.log(/^\d\d:\d\d$/.test("102:25")); //false
console.log(/^\d\d:\d\d$/.test("02:225")); //false
Enter fullscreen mode Exit fullscreen mode

In multiline mode (with flag 'm'), ^ and $ match not only at the beginning and the end of the string, but also at start/end of line.

Apart from line boundaries, we can also check for the position word boundary in a string. There are three different positions that qualify as word boundaries:

  • At string start, if the first string character is a word character \w
  • Between two characters in the string, where one is a word character \w and the other is not
  • At string end, if the last string character is a word character \w Alt Text
console.log(/hell/i.test(str)) //true
console.log(/hell\b/i.test(str)) //false
console.log(/hello\b/i.test(str)) //true
Enter fullscreen mode Exit fullscreen mode

Quantifiers

Quantifiers are used to handle repeated patterns in regular expressions. For example, if we are to check for a 10-digit number, having to write /\d\d\d\d\d\d\d\d\d\d/ seems awful - how about a 100 digit number?

With quantifiers,we can specify how many instances of a character, group, or character class is required. The quantifier is appended just after the character which needs to be repeated, and applies only to that character. For example: in /a+/ the quantifier '+' applies to the character 'a', in /cat+/, the '+' applies to 't' and not the word 'cat'

  • {n} - matches exactly "n" occurrences
  • {n,m} - matches at least N occurrences and at most M occurrences (n<m)
  • {n,} - matches at least "n" occurrences
  • + - matches 1 or more times
  • * - matches 0 or more times
  • ? - matches 0 or 1 times. In other words, it makes the preceding token optional

Let's go back to the time example and simplify it using quantifiers. We want to have time in the format hh:mm or h:mm (Note ^ and $ are not affected by quantifiers)

var re = /^\d+:\d{2}$/
console.log(re.test("02:25")); //true
console.log(re.test("2:25")); //true
console.log(re.test("102:25")); //false
console.log(re.test("02:225")); //false
Enter fullscreen mode Exit fullscreen mode

Let's try something a little more complex - let's see if a string is a html element - we will check for opening and closing tag (not considering attributes for now). The pattern will be a one or more letter tag in between '<' and '>' followed by optional text and then closing tags

var re = /<[a-z][a-z0-6]*>[\w\W]+<\/[a-z][a-z0-6]*>/i;
console.log(re.test('<h1>Hello World!</h1>')); //true
console.log(re.test('<h1>Hello World!')); //false
console.log(re.test('Hello World!</h1>')); //false
console.log(re.test('</h1>Hello World!</h1>')); //false
Enter fullscreen mode Exit fullscreen mode

Groups

A part of a pattern can be enclosed in parentheses (). This is called a “capturing group”. It counts as a single element as far as the operators following it are concerned.

console.log(/(java)/.test('javascript')) //true
console.log(/(java)/.test('javscript')) //false
Enter fullscreen mode Exit fullscreen mode

If we put a quantifier after the parentheses, it applies to the parentheses as a whole.

console.log(/(la)+/.test('lalalala')); //true
Enter fullscreen mode Exit fullscreen mode

Negation

For cases where we don't want to match a character, we create negated or complemented character set. For negation also, we use the combination of [] and ^ special characters.
[^xyz] means that it matches anything that is not enclosed in the brackets. (Note: in anchors ^ is outside the brackets).

console.log(/ello/.test('hello')); //true
console.log(/[^h]ello/.test('hello')); //false
console.log(/[^h]ello/.test('cello')); //true
Enter fullscreen mode Exit fullscreen mode

We can also do it for words:

console.log(/[^(password)\w+]/.test('password1234')); //false
Enter fullscreen mode Exit fullscreen mode

Conditionals (Lookahead and lookbehind)

Sometimes we need to find only those matches for a pattern that are (or not) followed or (or not) preceded by another pattern.

Pattern Meaning
x(?=y) Matches "x" only if "x" is followed by "y"
x(?!y) Matches "x" only if "x" is not followed by "y"
(?<=y)x Matches "x" only if "x" is preceded by "y"
(?<!y)x Matches "x" only if "x" is not preceded by "y"
var str = "apple mango pineApples grape Grapefruit";
console.log(str.match(/grape(?=(fruit))/gi)); // [ "Grape"]
console.log(str.match(/grape(?!(fruit))/gi)); // [ "grape"]
console.log(str.match(/(?<=(pine))apple/gi)); // [ "apple"]
console.log(str.match(/(?<!(pine))apple/gi)); // [ "Apple"]
Enter fullscreen mode Exit fullscreen mode

Alternation

Alternation is just another word for logical OR - i.e. match this OR that. Previously discussed [] was for single character (out of several possible characters). Alternation is to match a single regular expression out of several possible regular expressions. It is denoted by the pipe character (|).

Ex: with /(abc\def)/, we are looking for matches for either 'abc' or 'def'

console.log(/\b(apple|mango)\b/.test('I like mango')) //true
console.log(/\b(apple|mango)\b/.test('I like apple')) //true
Enter fullscreen mode Exit fullscreen mode

We can combine/nest with other things we have learnt to create more complex patterns

console.log(/\b((pine)?apple|mango)\b/.test('I like pineapple')) //true
Enter fullscreen mode Exit fullscreen mode

That's it for this article. This is just an introduction; there are some more concepts to understand which can help become more proficient in regex - like greedy and lazy quantifiers, backreferences, more advanced conditionals, etc. Javascript.info and eloquentjavascript are two good places to start from.

💖 💪 🙅 🚩
poulamic
Poulami Chakraborty

Posted on September 4, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related