Regex 101 - Kill the Monster
Lucas Fonseca Mundim
Posted on September 2, 2021
More frequently than not I see people recommending RegEx
(or RegExp
) to other people to help solve a problem and the reaction being the same: they don't want to use RegEx
because they don't understand or find it very confusing. I myself find that weird, because I never thought of RegEx
to be that awful, at least not on reasonable boundaries. So today I'll try to explain the basis of RegEx
to try to make it less of a monster to you guys.
Disclaimer: depending on the programming language you are using the syntax may vary slightly. For reference, I will be using the JS
/C#
syntax. You can try it out on Regex101
The basics
I really would like to split this into more topics but they wouldn't make very much sense on their own, so I'll call them as "the basics". This will include:
- Letters and Numbers
- Basic Symbols
- Groups and Ranges
- Counters
- Tokens
Letters and Numbers
First of all, let's talk about letters and numbers. They work very much as you would expect: if you write a
, the regex will expect the letter a
, lower case, and so forth. There's really not much to explain here.
Basic Symbols
Symbols on the other hand can be a little bit confusing. Some symbols are reserved by RegEx to do some special stuff. They all have the possibility of escaping by using a backslash \
, which brings us to the first special symbol:
-
\
: escapes any character that would be a special one to mean literally that character -
()
: group delimiter, we'll dive in deeper on that later -
[]
: range delimiter, we'll dive in deeper on that later -
{}
: counter delimiter, we'll dive in deeper on that later -
^
: when outside delimiters, it means the start of a string. When within delimiters, it meansnot
(the same as the good old!
on programming) -
$
: end of a string -
.
: anything. The.
means that the character there can be absolutely any single character. Also known as wildcard -
|
: our good old boolean operatoror
Groups and Ranges
Groups, delimited by ()
, have more or less the same idea as the symbols in maths or any programming: they group operations together to make something valid for the entire group (e.g. a counter)
Ranges, delimited by []
, are a little bit more complex, but not so much. They mean that any character within its range is valid. Note that it can be mixed and matched, and even improved:
-
[abc]
means any character froma
,b
orc
-
[^abc]
means any character excepta
,b
orc
-
[a-z]
means any character froma
toz
, in the alphabetical order (so[a-c]
would be the same as[abc]
) -
[a-zA-Z]
means the same as the above, but case insensitive -
[0-9]
means any digit
Counters
Counters make it easier to delimit how many from a given character (or rule) you expect.
-
*
means any number, or from0
to ∞, also known as zero or more -
+
means from1
to ∞, aka one or more -
?
means from0
to1
, aka zero or one -
{3}
means exactly 3 -
{3,}
means 3 or more -
{3,6}
means from 3 to 6
Tokens
Just as we have \n
on programming as a token for new line
, RegEx
has its own tokens as well.
-
\s
means any whitespace character (space, tab, new line) -
\S
means any non-whitespace character -
\d
means any digit, the same as[0-9]
-
\D
means any non-digit, the same as[^0-9]
-
\w
means any word, or any letter, digit or underscore -
\W
means any non-word, or anything besides letters, digits or underscores -
\b
means word boundary, or the character immediately matched by\w
and a character not matched by\w
, in either order
Join all that together and...
By joining all those definitions, we can start writing RegEx
es. Let's see some samples
-
Match a 🇧🇷BR ZIP Code: Brazilian ZIP Codes are 5 digits, followed by a dash, followed by 3 more digits. Or, in
RegEx
:-
[0-9]{5}-[0-9]{3}
-
\d{5}-\d{3}
- Some people might not type in the
-
:\d{8}
- Furthermore:
-
-
Match a
DD/MM/YYYY
orDD/MM/YY
date:-
\d{2}/\d{2}/(\d{4}|\d{2})
- Note that
|
evaluation is lazy\d{2}/\d{2}/(\d{4}|\d{2})
-
Naming groups
Naming groups should be available on most programming languages, but how it works may vary. It is very useful for readability purposes and should always be used in production environments or serious work should RegEx
make it that far. The symbol for grouping is ?<>
(or ?P<>
for Python
in the example).
-
((?P<ZIPCode>\d{5}-?\d{3})|(?P<Date>\d{2}\/\d{2}\/(\d{4}|\d{2})))
- Yes, in
Python
it is very ugly, but it is language dependent. InC#
it is much better
- Yes, in
Wrap up
This article was aimed to just "kill the monster" that people consider Regular Expressions to be, and show that it is not that scary for simple work. Of course it gets harder and harder the more complex your matching needs are (e.g. find an email), but there usually are better ways of doing complex tasks.
If you want or need to dive deeper into Regular Expressions, consider studying the theory behind it (from Formal Languages and read/play around Regex101, but beware: it gets really deep, but it's very interesting!
Posted on September 2, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.