A beginner's guide to Regex
kimcc
Posted on January 1, 2022
Have you ever seen something like this /^[a-z0-9_-]{1,10}$/
and thought, "What the heck am I looking at? Clearly this is some kind of keyboard smash and not an actual readable line of code!" Or perhaps you've heard of the term "regex" and wanted to nope right out of there and hit the Back button?
Well, this tutorial aims to provide a basic overview of the components you'll see in regular expressions, so that the next time you see regex out in the wild, or find yourself needing to use regex, you can have a better understanding of regex and approach it with more confidence.
Summary
We'll start out by giving a general overview of what regex is, outlining some of the components you'll see in regex, and then using what we've learned to interpret a sample regular expression for finding a hex code: /^#?([a-f0-9]{6}|[a-f0-9]{3})$/
. At the end of the tutorial, several links will also be provided for resources to learn more about regex.
Table of Contents
- Overview
- Anchors
- Quantifiers
- OR Operator
- Character Classes
- Flags
- Grouping and Capturing
- Bracket Expressions
- Greedy and Lazy Match
- Boundaries
- Example Breakdown
- Additional Resources
Overview
First off, what is regex? Regex stands for "regular expression". It's basically a way of defining a pattern of some text you want to search for. Think about when you use Ctrl+F or Cmd+F to search for something in a document. For example, let's say you wanted to look for the word color
. You could type Ctrl+F or Cmd+F and input color
, but what if you also wanted to match colors
, colour
and colours
?
This is where regular expressions can come in handy. In regex, this search would be defined as:
/colou?rs?/g
Other uses might be for input fields where you want to check if someone has typed in an email address where the expected format is:
<1 or more characters including a-z, 0-9, underscores, hyphens, or dots>@<1 or more characters including 0-9, a-z, hyphens or dots>.<between 2 and 6 characters including a-z, or dots>
You could use the following regex expression to check for a formatted email:
/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/g
From these examples, you can see how regex can be very powerful and save you a lot of time!
A few things to note before we dive into the details:
You might have noticed that we use the forward slashes (/
) to wrap our regex expressions. The slashes indicate a regex literal. This is a way of indicating that we want to use a regex expression. The g
at the end of the expression is also a special character that we'll go over more in depth in the Flags section. For now, just know that it is a way to indicate that we want to keep searching even after we've found the first match and to continue finding matches.
Second, while the concept of regex is language agnostic, this particular tutorial will focus on JavaScript. Different languages will have variations in syntax, so keep this in mind if you want to use this tutorial for languages other than JavaScript!
Finally, this tutorial is not meant to exhaustive, so the additional resources at the end are meant to provide further material for deepening your regex knowledge.
Regex Components
Anchors
Anchors are used to test for characters at the beginning and at the end of a piece of text.
A caret (^
) will look for a match at the beginning of a line.
A dollar sign ($
) will look for a match at the end of a line.
Let's look at an example line of text:
there are many instances of there in this text aren't there
If we use the expression /there/g
, we will get back all instances of there
. However, if we only wanted the first there
we could type /^there/g
. By the same token, if we only wanted the last there
, we could type /there$/g
. Now what you think would be matched if we typed /^there$/g
. If you guessed "nothing" you'd be right! None of the there
's are at the start AND the end, so none of the text would match. We would need to change our example to be:
there
Then there
would be matched with this expression: /^there$/g
.
We can also use the combo of ^
and $
to test for a full match of a pattern. For example:
const string = "Hello there";
const regex = /^Hello there$/;
regex.test(string); // Returns true
Quantifiers
Quantifiers are used to indicate how many of a type of character we want to match.
An asterisk (*
) will look for 0 or more characters.
To illustrate an example, we will write out a short JavaScript snippet:
const string = "a ab abb";
const regex = /ab*/g;
string.match(regex);
We are defining a string, a regular expression, and then matching our regex against the string. What do you think will be returned? Since in our regex, we have b
followed by *
, we are looking for a
followed by 0 or more b
s. So a
, ab
, and abb
will all be matched.
Now let's look at another quantifier.
A plus sign (+
) will look for 1 or more characters.
const string = "a ab abb";
const regex = /ab+/g;
string.match(regex); // Matches ab, abb
Here, we've changed the *
to a +
. Now we're looking for a
followed by 1 or more b
s. So now only ab
, and abb
will be matched.
A question mark (?
) can be used to indicate 0 or 1 of a character.
In our earlier example, we had
/colou?rs?/g
Now, we can break down what this means in more detail. We're looking for the letters colo
followed by 0 or 1 u
followed by r
and then 0 or 1 s
.
Curly braces ({}
) can be used to indicate a specific number of characters or a range of characters.
If we put one number inside the braces, like {5}
, we will match that exact number of characters. If we use a range, like 1,5
, then we'll look for a number of characters that are between that range.
For example:
const string = "My address is 1234 Main Street, Apartment 12";
const regex1 = /\d{4}/g;
const regex2 = /\d{1,4}/g;
string.match(regex1); // Matches 1234 since we're looking for exactly 4 characters
string.match(regex2); // Matches 1234, 12 since we're looking for between 1 and 4 characters
OR Operator
A pipe character |
in between two expressions indicates an OR.
If you are used to JavaScript, the |
symbol should look familiar since we use two of them (||
) as the OR logical operator. However, keep in mind that for regex, spaces will be used in the match, so don't put spaces unless you want those included!
const string = "Find me at example.com, example.org, or example.net";
const regex = /com|org|net/g;
string.match(regex); // Matches com, org, net
Characters within brackets can also indicate OR.
const string = "Is it spelled gray or grey?";
const regex = /gr[ae]y/g;
string.match(regex); // Matches gray, grey
Brackets will let you match anything inside them, which could be single instances of characters or a range. The Bracket Expressions section goes over this in more detail.
Character Classes
These are special expressions that match specific types of characters. You can use these if you want to match characters such as digits or words. There are also "negation" expressions, for example if you want to find everything EXCEPT word characters.
-
\d
—> Any digit from 0-9 -
.
—> Any character (i.e.: everything!) -
\w
—> Any word character including A-z, a-z, 0-9 -
\W
—> This is the anti\w
. It will match anything that’s not a word character -
\s
—> Any whitespace (e.g.: space, tab) -
\S
—> Similar to\W
, this is the anti\s
. It will match anything that’s not whitespace -
\.
—> An actual dot (this is different from the "any character" indicator!) -
\t
—> Tab -
\n
—> New line -
\r
—> Carriage return
For example, let's say I wanted to match any digits. I could type:
const string = "My phone number is 123-456-7890";
const regex = /\d/g;
string.match(regex); // Matches 123, 456, 7890
Flags
Flags are like modifiers that you put after the closing /
of a regular expression. We've already used g
for several examples, but let's recap that and a few other flags.
g
—> This stands for "global". With this flag, the search looks for all matches for your expression. Without theg
flag, the search will stop after the first match is returned.i
—> This stands for case-insensitive, meaning the search will not care if something is capitalized or lowercase.m
—> This indicates multiline mode. For^
and$
anchors, them
flag will tell the search to keep going even if the original text to be searched spans multiple lines.
Example:
const string = "My phone number is 123-456-7890";
const regex1 = /\d/;
const regex2 = /\d/g;
string.match(regex1); // Matches 1 because there is no g flag and it will stop after getting the first number
string.match(regex2); // Matches 123, 456, 7890
Grouping and Capturing
Any expression will get saved as a group. You can create subgroups with parentheses. The advantage of groups is that you can refer back to a group with $<num>
or \<num>
where <num>
is the group's number.
Let's look at an example with groups:
/\d{3}-(/d{3})-(\d{4})/
The first group is the entire expression and is Group 0. The next group would be (\d{3})
and can be referred to as $1
or \1
. The group after that would be Group 2 and can be referred to as $2
or \2
.
Bracket Expressions
Use brackets to match single characters, or use them to specify a range to match.
-
[abc]
—> Match an a OR b OR c -
[0-5]
—> Match anything in between 0 to 5 (including 0 and 5) -
[^0-5]
—> Match anything NOT from 0 to 5 (this only applies if the caret is the first character). If the caret is in the middle then it looks for the literal caret character -
[a^bc]
—> a OR ^ OR b OR c
Greedy and Lazy Match
When we use the term "greedy" within the context of regex, we're referring to when regex tries to match as much as possible. Think about Kirby, the pink Nintendo character, who wants to keep eating and eating! If your expression is greedy, it will keep trying to match as much text as it can!
Here is a greedy expression in usage:
const string = "[something] and [something else]";
const regex = /\[.*\]/; // Even though there is no g flag, this expression is greedy because it will try to match anything following the .
string.match(regex); // Matches [something] and [something else] instead of just [something]
Note that we use the backlash (\
) to "escape" the bracket characters since we are looking for literal brackets and not trying to write a bracket expression.
For the above example, if we use a question mark with a quantifier, we can make the expression not greedy, otherwise known as "lazy". You've probably guessed by now, but lazy is opposite of greedy, meaning that the expressions will try to match as little as it can.
Now we can match just the first set of brackets:
const string = "[something] and [something else]";
const regex = /\[.*?\]/;
string.match(regex); // Only matches [something]
Boundaries
A \b
can be used to indicate a non-word character, known as a "boundary". The non-word character could be something like a space, a period, or a comma. Keep in mind that regex's definition of words includes things like _
and numbers.
Let's say you wanted to search for the word bye
that has a boundary on both sides:
const string1 = "Hello, bye!";
const string2 = "Hello, goodbye!";
const regex = /\bbye\b/;
string1.match(regex); // Matches bye
string2.match(regex); // Will not match
Example Breakdown
Now that we have a better understanding of regex, let's go back to our example expression:
/^#?([a-f0-9]{6}|[a-f0-9]{3})$/
We know that this is a regex literal because it's wrapped with forward slashes (/
).
Next, inside the forward slashes we see a ^
at the start and a $
at the end. This means we're looking for a full expression that has the indicated start and end.
#
followed by ?
means that we are looking for 0 or 1 #
sign.
Next we see the start of a group with the (
sign. This group includes [a-f0-9]{6}|[a-f0-9]{3}
.
The first part of the group is [a-f0-9]
which is a bracket expression that's looking for matches from a to f OR 0-9. Right after that, we see {6}
, which means that the match must be exactly 6 characters.
Next we see a |
which means that we're looking for something that could match the expression right before the |
OR the expression right after it.
The expression after the |
is [a-f0-9]
, a bracket expression that's looking for matches from a to f OR 0-9, just like the previous expression. However, this time we have {3}
instead of {6}
, meaning that we are looking for exactly 3 characters.
Let's summarize the entire expression in plain English.
Look for a match that starts with 0 or 1 # sign.
Next, check if this match ends with the following: either 6 characters that could be from a to f or 0 to 9, or 3 characters that could be from a to f or 0 to 9.
And there you have it! We've learned a little bit about regex and learned how to analyze a regex expression for finding a hex code!
Additional Resources
Posted on January 1, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024