Finally Understand Regular Expressions: Regex isn't as hard as it looks
Alex Hyett
Posted on January 10, 2023
There's nothing like a regular expression to strike fear in the heart of a developer.
Regular expressions (regex) are used for a lot of things, such as validating that a string is in the right format, as well as grabbing certain parts of a string as well.
You can do simple string searches with regex, but obviously that's not what makes it powerful.
If you want to follow along with these examples, there is a great website called Regexr that I always use when testing regular expressions.
Special characters
There are a few different special characters you can use to help you with your searches.
-
\w
- will match every alphanumeric character as well as underscores. -
\d
- will match all number characters. -
\s
- matches spaces, tabs and new lines.
We can turn these into the negative versions by using a capital letter:
-
\W
- will match everything that is not alphanumeric or an underscore. -
\D
- will match everything that is not a number. -
\S
- will match everything that is not a space, tab or new line.
There is also another special character projects that is used to match any character in your string, and that is a .
.
If you were to search for .at
in the following sentence:
The cat sat on the mat at home.
It will match on cat
, sat
, mat
, and at
.
Quantifiers
There are a few quantifiers you can use in regex to match for multiple occurrences of a letter.
-
*
- match 0 or more of the preceding pattern. -
+
- match at least 1 of the preceding pattern. -
?
- match 0 or 1 of the preceding pattern (it is basically optional). -
{3}
- matches exactly 3 occurrences of the preceding pattern. -
{3,5}
- match between 3 and 5 occurrences (3,4,5) of the preceding pattern.
Let's say we have the following text:
a aa aaa aaaa aaaaa aaaaaa
This is what we get with the following patterns:
-
a*
matchesa
,aa
,aaa
,aaaa
,aaaaa
,aaaaaa
-
a+
matches all of them as well as we have at least 1. -
a?
matchesa
21 times for each individuala
in the text. -
a{3}
matches justaaa
5 times. Once inaaa
,aaaa
andaaaaa
and twice inaaaaaa
. -
a{4,5}
matchesaaaa
andaaaaa
3 times.
Character Sets
In some cases, we want to match a range of different characters. For this, we have character sets. To use a character set, we can put a range of characters in square brackets []
.
Let's take our simple sentence again and look at an example:
The cat sat on the mat at home.
If we search for the pattern [cs]at
we are going to match on cat
and sat
but not mat
.
You can also do ranges of characters too. If we search for the pattern [a-p]at
then we are going to match on cat
and mat
but not sat
.
As with the special characters, it is also possible to look at the negative version of this by putting a ^
symbol at the start inside the brackets.
So doing [^a-p]at
will match on sat
but also at
as spaces are included as characters as well.
Capture Groups
One of the main reasons for using regular expressions is because you want to extract a string from a bit of text.
For example, if you wanted to extract the domain from the following email address:
cat@alexhyett.com
We can use the following regular expression to match on this email address:
[\w-\.]+@([\w-]+\.+[\w]{2,63})
Let's break this down, so we can see what it is doing:
-
[\w-\.]+
- The first part is matching on any alphanumeric character and underscore (as denoted by the\w
) as well as a hyphen-
and a dot.
. The dot here has been escaped with a backslash so that it doesn’t get confused with the.
special character. These characters are matched one or more times, denoted by the+
. -
@
- is just matching the@
character. -
[\w-]+
- is matching any alphanumeric character and underscore (as denoted by the\w
) as well a hyphen-
. These characters are matched one or more times, denoted by the+
. -
\.
- is just matching the.
character. -
[\w]{2,4}
matches any alphanumeric character and underscore. I don’t think you can have underscores in top level domains so this should probably just be[a-zA-Z]
but\w
is simpler. This is then matched 2 to 63 times to allow for extensions such asuk
,com
,technology
.
We have then added brackets after the @
until the end of the string to create a capture group.
When you use this regex in code you will be able to look at the groups and extract the domain e.g. alexhyett.com
.
Lookahead and Lookbehind
This is where people to start to switch off when it comes to regular expressions.
Positive and Negative Lookahead and Lookbehinds sound complicated, but they are not actually that hard.
A lookahead, or lookbehind, just looks for a particular pattern ahead or behind what you are looking for, without including it as part of the match.
Positive Lookahead
Let’s go back to our simple string and see how we can use a positive lookahead.
The cat sat on the mat at home.
Say we want to match on the letter o
but only if it has the letter m
after it.
To do this, we use the pattern o(?=m)
. Which will match the o
in home
but not the o
in on
.
Negative Lookahead
We can also do the negative of this. If we wanted to find all occurrences of the letter o
that does not have the letter m
after it we would use o(?!m)
basically replacing the =
with an !
.
This would then match the o
in on
but not the o
in home
.
Positive Lookbehind
You can probably see where this is going now. A lookbehind, as the name suggests, looks backwards instead of forwards.
If we want to find all occurrences of the word at
that are preceded by the letter c
we can use the following pattern (?<=c)at
. This will only match the at
in cat
but not any of the other occurrences.
Negative Lookbehind
Similarly, we can find the negative version of the positive lookbehind by changing the =
to a !
.
If we now search for (?<!c)at
it will match on the at
in sat
, mat
and at
.
Extra Tip
It is also possible to combine multiple patterns in one regular expression.
Let's say we want to find all the a
characters that aren't preceded by an s
as well as all the t
characters.
We can do an OR symbol |
and have a pattern that looks like this:
(?<!s)a|t
You can do this, but it can get quite complicated if you're going to be chaining on lots of different expressions.
If you're doing these regular expressions in code, then I recommend that you split these out, just to make the regular expressions that much clearer.
I hope that demystifies regular expressions for you. If you like this post, you can also follow me on Twitter and Medium.
Posted on January 10, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.