Regex 101 - Kill the Monster

More frequently than not I see people recommending RegEx (or RegExp) to other people to help solve a problem and the reaction being the same: they don't want to use RegEx because they don't understand or find it very confusing. I myself find that weird, because I never thought of RegEx to be that awful, at least not on reasonable boundaries. So today I'll try to explain the basis of RegEx to try to make it less of a monster to you guys.

Disclaimer: depending on the programming language you are using the syntax may vary slightly. For reference, I will be using the JS/C# syntax. You can try it out on Regex101

The basics

I really would like to split this into more topics but they wouldn't make very much sense on their own, so I'll call them as "the basics". This will include:

Letters and Numbers
Basic Symbols
Groups and Ranges
Counters
Tokens

Letters and Numbers

First of all, let's talk about letters and numbers. They work very much as you would expect: if you write a, the regex will expect the letter a, lower case, and so forth. There's really not much to explain here.

Basic Symbols

Symbols on the other hand can be a little bit confusing. Some symbols are reserved by RegEx to do some special stuff. They all have the possibility of escaping by using a backslash \, which brings us to the first special symbol:

\: escapes any character that would be a special one to mean literally that character
(): group delimiter, we'll dive in deeper on that later
[]: range delimiter, we'll dive in deeper on that later
{}: counter delimiter, we'll dive in deeper on that later
^: when outside delimiters, it means the start of a string. When within delimiters, it means not (the same as the good old ! on programming)
$: end of a string
.: anything. The . means that the character there can be absolutely any single character. Also known as wildcard
|: our good old boolean operator or

Groups and Ranges

Groups, delimited by (), have more or less the same idea as the symbols in maths or any programming: they group operations together to make something valid for the entire group (e.g. a counter)

Ranges, delimited by [], are a little bit more complex, but not so much. They mean that any character within its range is valid. Note that it can be mixed and matched, and even improved:

[abc] means any character from a, b or c
[^abc] means any character except a, b or c
[a-z] means any character from a to z, in the alphabetical order (so [a-c] would be the same as [abc])
[a-zA-Z] means the same as the above, but case insensitive
[0-9] means any digit

Counters

Counters make it easier to delimit how many from a given character (or rule) you expect.

* means any number, or from 0 to ∞, also known as zero or more
+ means from 1 to ∞, aka one or more
? means from 0 to 1, aka zero or one
{3} means exactly 3
{3,} means 3 or more
{3,6} means from 3 to 6

Tokens

Just as we have \n on programming as a token for new line, RegEx has its own tokens as well.

\s means any whitespace character (space, tab, new line)
\S means any non-whitespace character
\d means any digit, the same as [0-9]
\D means any non-digit, the same as [^0-9]
\w means any word, or any letter, digit or underscore
\W means any non-word, or anything besides letters, digits or underscores
\b means word boundary, or the character immediately matched by \w and a character not matched by \w, in either order

Join all that together and...

By joining all those definitions, we can start writing RegExes. Let's see some samples

Match a 🇧🇷BR ZIP Code: Brazilian ZIP Codes are 5 digits, followed by a dash, followed by 3 more digits. Or, in RegEx:
- [0-9]{5}-[0-9]{3}
- \d{5}-\d{3}
- Some people might not type in the -: \d{8}
- Furthermore:
Match a DD/MM/YYYY or DD/MM/YY date:
- \d{2}/\d{2}/(\d{4}|\d{2})
- Note that | evaluation is lazy \d{2}/\d{2}/(\d{4}|\d{2})

Naming groups

Naming groups should be available on most programming languages, but how it works may vary. It is very useful for readability purposes and should always be used in production environments or serious work should RegEx make it that far. The symbol for grouping is ?<> (or ?P<> for Python in the example).

((?P<ZIPCode>\d{5}-?\d{3})|(?P<Date>\d{2}\/\d{2}\/(\d{4}|\d{2})))
- Yes, in Python it is very ugly, but it is language dependent. In C# it is much better

Wrap up

This article was aimed to just "kill the monster" that people consider Regular Expressions to be, and show that it is not that scary for simple work. Of course it gets harder and harder the more complex your matching needs are (e.g. find an email), but there usually are better ways of doing complex tasks.

If you want or need to dive deeper into Regular Expressions, consider studying the theory behind it (from Formal Languages and read/play around Regex101, but beware: it gets really deep, but it's very interesting!

Blog