Primer on Regexes
Cihat Gündüz
Posted on May 6, 2021
In this post, I will try to give you a practical overview of Regular Expressions to teach you what they are, what they can be used for and a quick intro to how you can use them.
What are Regular Expressions even?
Regular Expressions (short Regexes) are Strings that work as a DSL (domain-specific language) to do some common tasks within other Strings. A DSL can also be subscribed as "a programming language within a programming language".
In the case of Regexes, the outer programming language can be any programming language that supports the String
type, it just has to support Regexes. Nearly all popular programming languages support Regexes, which makes Regexes so useful to know. The inner language of Regexes consists of only String
with some characters having a special meaning.
For example in the String ".*@.*\.com"
the .
means "any character", the *
means "any amount of <whatever precedes>", together .*
means "any amount of any character". Then we have a non-special character @
, then again .*
followed by \
which means "escape the next character and treat like a non-special character" so \.
together reads like a normal .
character without the special meaning "any character". Lastly, there's com
which is just a set of characters without any special meaning. Overall this Regex is a simple matcher for any email address ending with .com
and containing a @
somewhere.
What can I do with the "Regex DSL"?
NOTE: With Ruby installed (on Macs it's preinstalled), you can type
irb
to start an interactive Ruby shell to play around with the samples below.
There are three main functions that any Regex string can be used with:
1. matches
Given a Regex and another String, this function checks if the given String "matches" the Regex. This means if there's "any" part within the given String that matches the specified Regex, it returns true
, otherwise, it's false
.
For example in Ruby (where matches
is called match?
– ?
is part of the function name):
/.*@.*\.com/.match?('harry.potter@hogwarts.co.uk')
# => false
/.*@.*\.com/.match?('queenie.goldstein@ilvermorny.com')
# => true
2. captures
Given a Regex and another String, this function can read substrings out of the given text which matches marked portions of the given Regex. The portions in the Regex can be marked via (
and )
. They are called "capture groups".
For example in Ruby (where captures
can be accessed on the Match
object):
/(.*)@(.*)\.com/.match('queenie.goldstein@ilvermorny.com').captures
# => ["queenie.goldstein", "ilvermorny"]
3. replace
Given a Regex and a Template String, this function can automatically replace matches with a given String where even the capture groups can be referenced via $1
, $2
or in some languages also \1
, \2
, etc.
For example in Ruby (where replace
is called gsub
):
'queenie.goldstein@ilvermorny.com'.gsub(/(.*)@(.*)\.com/, '\1@\2.org')
# => "queenie.goldstein@ilvermorny.org"
What does the "Regex DSL" look like?
There's plenty of useful "cheat sheets" for this with great examples:
https://www.rexegg.com/regex-quickstart.html#chars
Generally, there's 5 different kinds of DSL components to understand:
1. Character/Group Modifiers (e.g. *
, +
, {,}
, ?
)
The default "building" block of Regexes are characters. After each character, you can write a modifier that tells how many times the preceding character is matched. The following modifiers are available:
0
or 1
times:
?
(example: a?b?c?
matches all of a
, ab
, abc
, bc
, c
)
1
time exactly:
No modifier (default)
0
to ♾️ times:
*
(example: a*bc
matches bc
, abc
, aaabc
)
1
to ♾️ times:
+
(example: a+bc
matches abc
, aaabc
but not bc
)
X
times exactly:
{X}
(example: a{3}bc
matches aaabc
but not aabc
, aaaabc
)
X
to Y
times:
{X,Y}
(example: a{2,5}bc
matches aaaaabc
, but not abc
)
X
to ♾️ times:
{X,}
(example: a{2,}bc
matches aaaaaaaabc
but not abc
)
The same modifiers also work on Groups (e.g. (abc)+
) (see below for groups).
2. Custom Sets (created with [
and ]
)
You can define custom sets of characters by listing them without any separator within brackets, e.g. for a set of the characters a, b, c and numbers 1, 2, 3 we would write [abc123]
. This is then considered as "one character of this set", thus matching multiple of them need character modifiers as in [abc123]*
or [abc123]{2,5}
.
You can also use ^
at the beginning of a custom set to specify that you accept any character except the set you specified in the brackets, e.g. [^\n]
to accept any character except a newline.
Characters of which you know are ordered right after each other like numbers or the Alphabet you can also use ranges by putting a -
in between, e.g. [a-zA-Z0-9]
.
[abc123]{3,}
would not match a
, b
, c
, ab
, but would match 111
, abc
3. Predefined Sets (\s
, \S
, \d
, \D
, \w
, \W
)
The following sets (simplified) are already pre-defined and can be used directly:
-
\s
effectively same as[ \t\n]
, reads "any whitespace character" -
\S
effectively same as[^ \t\n]
, reads "any non-whitespace character" -
\d
effectively same as[0-9]
, reads "any digit" -
\D
effectively same as[^0-9]
, reads "any non-digit" -
\w
similar to[a-zA-Z_0-9]
(includes Umlauts etc.), reads "any word character" -
\W
similar to[^a-zA-Z_0-9]
reads "any non-word character"
4. Groups (e.g. (
and )
, (?<name>
and )
)
Groups could be thought of like "words" or "sentences", they change the default building block, which is "character" for any modifier to a set of characters, or a "group". For example, writing abc*
reads "one time a, one times b and any number of times c". If you want to write "any number of times abc" you do this: (abc)*
. The abc
is then considered one group and the regex would match the whole string abcabcabc
.
Groups also allow for specifying different options to choose from. For this, you write a group and separate the different words via a |
like so: (abc|def)
– this reads "either abc or def" and would match both 123abc123
and 456def456
but not adbecf
.
These capture some sub-portions of a Regex and assign them a number or name which then can be used to reference them in code or in replacement template Strings. Typically capture groups like in (.*)@(.*).com
are used then referenced back via \1@\2.com
or $1@$2.com
(depending on the language).
It's also possible to give the groups names, e.g. (?<user>.*)@(?<domain>.*).com
to reference back like in ${user}@${domain}.com
, but these are advanced features which are implemented differently in different languages (and are missing in some).
5. Match Modifiers (\A
, \z
, ^
, $
, Lookarounds)
By default, a match for a Regex, like abc
is done like a contains
method. But you can also specify that the abc
string needs to be at the beginning or end of a given string or of a line. For example, the ^
in ^abc
makes sure only strings with abc
at the beginning of a new line match. This will match def\nabc
but not defabc
. The $
in abc$
makes sure there's a line-end after abc
. Use \A
and \z
to match among the entire String (matching multiple lines).
Lookaheads & Lookbehinds are more of an advanced topic and useful mostly when you want to match that the beginning or end of your Regex does NOT match a given Regex. In most cases, Regexes with Lookaheads & Lookbehinds can be rewritten with Capture Groups, so you should try to write them as Capture Groups instead and only read about these if the other options don't work as Lookaround are CPU-intensive operations and also kind of restricted (e.g. they don't support most modifiers).
Here's a good place to learn about them:
https://www.regular-expressions.info/lookaround.html
Common Quirks and validating new Regexes
One common thing to consider is that .
in most languages does not match the newline character by default. But it can typically be turned on with an option, in Ruby by specifying the /m
at the end which stands for "make dot match newlines".
Also note that in every language there are different characters that are reserved due to how Strings work in them, for example in Ruby /
needs to be escaped with \/
, in Swift this escape is not needed but there you need to escape {
and }
with \{
and \}
. These quirks are important to remember when copy & pasting Regexes written for other languages.
Generally, when writing a new Regex, I recommend using a website or tool with 3 features:
- An option to add a sample String to match against.
- A Regex cheat sheet visible right on the screen to look up things.
- A live matcher for the regex you write among the given sample String.
The site I've come to use here is this (runs Ruby in the background) :
How can I use Regexes in my projects today?
There's no need to wait until there's a good opportunity to use Regexes, you can simply lint your projects using Regular expressions (including Auto-Correction support) via AnyLint:
_This article was written by Cihat Gündüz
Photo by Agence Olloweb
Posted on May 6, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.