Regular Expressions—A Rite of Passage for Web Developers
Robert Hieger
Posted on January 12, 2023
This article was originally published on JavaScript in Plain English.
Part 1: Regular Expressions Crash Course—Explaining the Theory Behind Regular Expressions
Photo by ThisIsEngineering (pexels.com)
Intended Audience: Intermediate Developers—a strong grasp of HTML, some understanding of CSS, and a strong intermediate grasp of the JavaScript language are required.
Table of Contents
Introduction
What Are We Going to Do in This Tutorial?
Grappling with Theory
Just Enough Regex
What Are Regular Expressions?
A Thumbnail Historical Sketch of Regex
Table 1. Some Well-Known Flavors of Regex
Nuts & Bolts of Regular Expressions in JavaScript
The Anatomy of a Regular Expression
Delimiters
Anchors
Character Sets
Ranges
Quantifiers
Capturing Groups
Flags
The Regex Tear-Down
Taking it Apart
The First Half of the Pattern
The Second Half of the Pattern
Sometimes a Hyphen is Just a Hyphen
Putting It Back Together
The Test String
What Next?
References
Introduction
The entire software edngineering profession, regardless of discipline, eventually faces a challenge along the learning pathway. When a
budding or even experienced developer encounters this challenge, it presents as somewhat of a monolithic barrier. It is almost as if that
moment is the discovery of a new and threatening four-letter word
(though it’s really five letters)—regex.
When I first encountered regex (or regular expressions), my experience was a disheartening struggle to learn what seemed a mysterious language, which despite its obvious power, did not make obvious its utility.
Yet its utility is quite obvious any time we use a web browser to search out an item, whether we search on Google or within the content of a website. Its utility is also quite evident when we fill in online forms and click the Submit button.
Before the data in your form is submitted, it is validated to make certain that it is correctly formed and/or sanitized (stripped of erroneous content).
What Are We Going to Do in This Tutorial?
We will be building a simple demo Single Page Application (SPA), but with fairly complex underpinnings.
This tutorial is the first of a three-part series in which we will proceed from theory to practice to the realization of a complete demo SPA.
Grappling with Theory
If you are completely new to regular expressions, then this first part of the tutorial will be for you. If you are fairly experienced with them, then you can skip ahead to the second article in this series—Regular Expressions—a Rite of Passage: From Theory to Practice.
However, because this part of the tutorial uses the data in our web application to give a quick crash course in regular expressions, you might still find it useful.
By the end of the third tutorial in this series, we will build a small web application the purpose of which will be to extract from a string of text valid zip codes and display them in a result box that populates upon clicking the Validate Zip Codes button.
Figure 1 below shows the opening screen of our finished application:
Fig. 1 Application Landing Page
In short, the test data frame scrolls to reveal a multiline string of characters within which are found valid zip codes. When the user clicks the Validate Zip Codes button, using the regular expression shown just below the test data, the application will replace the NO MATCH message displayed in the frame to its right with a scrolling list of validated zip code matches.
The Reset button restores the screen to its original state.
To realize this project, we will need to take a dive into the theory of regular expressions.
But this app seems simple enough. Why devote a whole article to theoretical underpinnings? Shouldn’t we learn by doing?
Well, yes…but not really. We will learn by doing, but doing without some minimal basis of understanding would be a painful pursuit. In writing this application, we will slam head-on into the barrier of Regex, which at least initially, might feel like bumping into a Tyrannosaurus Regex.
All kidding aside, without this rite of passage, any attempts to come up with a similar application to the one we are going to create would require writing code that likely would not work as well as that made possible by the inclusion of regular expressions.
A thorough exploration of Regex and its implementation in JavaScript is beyond the scope of this article. Such exploration could easily require a course of its own.
However, to facilitate some understanding of how it will come into play for our demo web application, I will give a quick crash course in just the concepts of regular expressions necessary for the realization of this project.
Just Enough Regex
To begin, we need to define our terms. Thus far, we have only named our concept—regular expressions. But what are they?
What Are Regular Expressions?
As defined by the Mozilla Developer Network—
“Regular expressions are patterns used to match character combinations in strings” (“Regular expressions—JavaScript”).
What does this mean? This is a very abstract statement. Let’s take yet a further step back to set the theoretical context for regular expressions.
A Thumbnail Historical Sketch of Regular Expressions
Mathematician Stephen Cole Kleene
The roots of regular expressions are in a 1951 paper entitled Representation of Events in Nerve Nets and Finite Automata, by mathematician Stephen Cole Kleene. The theories defined in this paper are far beyond the scope of this tutorial, but should you be interested in exploring what he has to say, you can download the paper here (Kleene, “Representation of Events”).
Practical application of Kleene’s theories came into their own around 1968 when they were used to facilitate pattern matching in text editors, and by compilers in their lexical analysis of source code. For more on this, you might want to consult this article (Wikipedia, “Regular Expressions”).
Suffice it to say that in the years to come, many different implementations of regular expressions were developed, all of them owing their heritage to the advent of these concepts in the UNIX operating system. Table 1 below is a sampling of some of the different flavors of regular expressions found on Wikipedia (“Comparison of Regular Expression Engines”):
Regex Engine (Library) | Where It is Used |
---|---|
PCRE (Perl Compatible Regular Expressions) | This implementation of regular expressions derives from the Perl language often used for server-side scripts in web applications. Perl is compatible to the POSIX (Portable Operating System Interface) standard established by the IEEE. UNIX was used as the model for this standard. The PCRE implementation found its way into PHP, the Apache HTTP Server, and C and C++ languages, to name but a few. |
FREJ (Fuzzy Regular Expressions for Java) | This implementation is a library specific to the Java programming language. |
EXMAScript (JavaScript Regular Expressions) | This is the reference library standard used by most JavaScript engines across many web browsers. |
XRegExp (Extended JavaScript Regular Expressions) | This library, which may be used with JavaScript, is a superset of the standard implementation of regular expressions in the JavaScript engine. Coded and maintained by Steven Levithan, the releases of this library may be found on this repository: https://github.com/slevithan/xregexp/releases |
Table 1: Some Well Known Flavors of Regex
Nuts and Bolts of Regular Expressions in JavaScript
In JavaScript, regular expressions are represented in one of two ways:
An object literal.
An object declared with a constructor.
The object literal notation is what we see above in Figure 1 minus its declaration. The code that you will be writing soon uses the notation shown here:
const regex = /[a-z]/g;
This expression will search for and match any instances of lowercase letters a through z in a specified test string. More on this later.
This same regular expression, declared using its object constructor, has the following syntax:
const regex = new RegExp(
‘[a-z]’, ‘g’;
);
Both of these syntaxes work identically. Is there a time one or the other should be used? Yes, there is. According to the Mozilla Developer Network:
“The literal notation results in compilation of the regular expression when the expression is evaluated. Use literal notation when the regular expression will remain constant. For example, if you use literal notation to construct a regular expression used in a loop, the regular expression won’t be recompiled on each iteration” (“RegExp –JavaScript | MDN”).
You will find more complete information on this in the full Mozilla Developer Network article.
The Anatomy of a Regular Expression
/^[0-9]{5}(-[0-9]{4})?$/gm
Looking at this cryptic line of code, you might be asking yourself, “What is this Gobbledygook?” I assure you that in a few minutes, this will look less intimidating than
it does now.
If we think of Regex as a symbolic representation of character patterns we wish to match, we can begin to classify different kinds of symbolic characters.
We will take a quick look at the regular expression above, pick it apart piece by piece and then put it back together to understand what is being asked of the regex engine.
The component parts of our regular expression can be broken down as follows:
1. Delimiters
2. Anchors
3. Character Sets and Ranges
4. Quantifiers
5. Capturing Groups
6. Flags
All six of these components come into play for our regular expression. Let’s take them one at a time.
Delimiters
In JavaScript object literal notation, regular expressions are delimited on either side by forward slashes. Everything that appears between these slashes is a representation of the pattern for which we wish to find matches in our test string.
Anchors
An anchor does not match any specific character in a test string. Rather it defines the beginning or endpoint of the match we wish to see returned by the regex engine.
The caret symbol (^) represents the beginning of the match for which we are searching. For example, if we want to match for any instance of the letter J at the beginning of our pattern, the syntax would be ^J
.
On the other hand, the dollar sign ($) represents the end of the match for which we are searching. This symbol is used at the end of the string pattern and affects the character or character set immediately preceding it.
Character Sets
A character set is a specified collection of characters, which may be character literals or a specified progression of characters against which
we want to test a specific test string. Character Sets are enclosed by
square brackets.
Ranges
When a series of characters defines a range of characters, we have a range. Ranges are comprised of a set of sequential digits or alphabetical characters. For example, 0 through 9 is what we would consider (in the decimal system, anyway) the complete range of digits possible in a number. This would also be notated in square brackets, like so:
[0-9]
For alphabetical characters we also have the ranges [a-z] and [A-Z]. But we could just as easily specify a range of digits such as [1-5] or alphabetical ranges such as [a-g] or [A-G].
Quantifiers
A quantifier specifies the minimum instances of a given character (or character within a character set) desired in a match. It may also specify
the exact number of characters desired in a pattern match.
Table 2 below shows some of the uses for these quantifiers:
Quantifier | Description |
---|---|
* | Represents 0 or more instances of the character or character set that immediately precedes it. |
+ | Represents 1 or more instances of the character or character set that immediately precedes it. |
? | Represents 0 or 1 instance of the character or character set that immediately precedes it. |
{ } | This quantifier represents exactly the number of character matches as the number within it. For example, {5} means exactly 5 instances of the character that immediately precedes it. {2,} means 2 or more instances. {1, 5} means between 1 and 5 instances. |
Table 2: Regex Quantifiers
Capturing Groups
Notated by the use of parentheses, capturing groups set a pattern sequence to be taken as a whole, rather than its constituent parts. In
other words, if we have a capturing group such as this…
([A-G]-0[0-9][3})
…what we are saying is to match any complete pattern that starts with any capital letter between A and G, followed by a hyphen and 0, then finally exactly 3 digits between 0 and 9.
How does this differ from a plain character set? In a character set, unless a quantifier follows it immediately such as in the sequence [0-9]{3} above, the regex engine will match only 1 character that falls within the range specified.
With a capturing group, on the other hand, a sequence is analyzed as a whole and must be matched as a whole. This proves very useful in our mini-application, as you will see.
Flags
With flags, we reach perhaps the easiest to understand of all the syntax thus far described. Flags have an impact on the way the regex engine parses the test string passed to it.
The JavaScript flavor of regular expressions has six flags, but we will confine our examination to the three you will probably most often encounter. These flags occur at the end of a regular expression after the closing forward slash. They are:
1. i—the case-insensitive flag specifies that alphabetical characters will
be matched whether they are upper or lower case.
2. g—the global flag specifies that all matches from beginning to end of
the test string will be returned. By default, the regex engine reads the
test string from left to right, and once it returns the first match, exits,
ignoring any matches beyond the first.
3. m—Stefan Judis clarifies that “the multiline flag changes the
meaning…” of the ^ and $ anchors mentioned above (Judis, Multiline
mode in JavaScript regular expressions). By default, the regex engine
treats a test string as one long uninterrupted string of characters to
be searched.
The boundaries of this string are defined by the ^ anchor, which denotes
the start of a string, and the $ anchor, which denotes the end of a string.
With the m (multiline) flag appended to the end of an expression, these
two anchors define the beginning and end of a line rather than the string
as a whole.
OKAY. I know that was an awful lot of stuff and some of it might have even been a bit confusing. That’s about to change.
The Regex Tear-Down
We are now ready to analyze the regular expression used in the web application we are going to build. Let’s look at it again in all its glory:
/^[0-9]{5}(-[0-9]{4})?$/gm
Taking it Apart
Let’s start by examining the outermost parts of this expression. The shell of our regular expression looks like this:
//gm
What does this little fragment say on its own? Not much, as no pattern has been specified as yet. Nonetheless, this fragment sets the stage for how any pattern appearing between the forward slash delimiters will be treated.
Using the definitions of regex flags above, this fragment says to the regex engine “Match all instances of the pattern specified between the delimiters (global matching) and consider each line of the test string separately (multiline matching).” Not much yet, but the engine now knows how to treat whatever pattern is defined for it between the delimiters.
The First Half of the Pattern
The first piece of this pattern makes use of the ^ anchor about which we learned earlier, differently. Let’s take a look at the first half of the pattern with this in mind:
^[0-9]{5}
First we have the ^ beginning of string anchor, which declares the start of our pattern. Next we have a character set enclosed in square brackets. This character set is comprised entirely of a range of digits 0 through 9.
This range is followed immediately by a quantifier of {5}. Taken together, the range and the quantifier specify that any set of exactly 5 digits in the range of 0 through 9 will be returned as a match.
Now we have the complete first half of our regular expression:
/^[0-9]{5}/gm
What this says all together to the regex engine is
“Search for and return all matches across all lines of the test string, that start with any combination of exactly 5 digits.”
Taken on its own, the first half of our regular expression is complete in and of itself and will return any valid 5-digit zip code that it finds in our test string.
But there’s still a problem. What we have so far will bypass any valid +4 zip codes embedded in the text. The second half of our regular expression will address this problem.
The Second Half of the Pattern
Now let’s approach the second half of the regular expression:
(-[0-9]{4})?$
Here we encounter a capturing group, which was explained earlier. The presence of the parentheses around the hyphen, range and quantifier indicates that the group is to be analyzed only as a whole. No constituent part of the capturing group will be considered on its own.
First let’s take on the contents of the capturing group.
Sometimes a Hyphen is Just a Hyphen
We start with the hyphen. When a hyphen occurs within square brackets that delimit a character set, it denotes a range, e.g. [0-9]
. When it occurs as the first character in a capturing group and outside of square brackets, it is a character literal. Thus the hyphen is a required character at the beginning of the pattern to be matched in the capturing group.
Next is the range [0-9]
. So far our group will only return a match if it starts with a hyphen and is followed by 4 digits between 0 and 9.
Outside the parentheses of the capturing group is the ? quantifier, which specifies that there can be 0 or 1 instance of what immediately preceded it. It’s important to note that this ? quantifier applies to the entire capturing group, not just the single character to its left.
This is the power of capturing groups. They make it possible to require of a match a specified group of characters or character sets. Because the ? quantifier accepts either 0 or 1 match of what precedes it, it is sometimes called an optional. In other words, a match of 0 or 1 instance of the character or pattern specified will be returned.
This behavior is what makes it possible for the regex engine to return matches of either a 5-digit zip code or a +4 zip code.
Finally, we have the $ anchor, which defines the end of the string to be returned as a match. This very unambiguously specifies that a valid zip code match will begin with a sequence of 5 digits and have an optional hyphen followed by exactly 4 digits, but absolutely nothing following this sequence.
Putting it Back Together
Let’s put everything back together now. Figure 2 below shows the entire sequence of our regular expression.
/^[0-9]{5}(-[0-9]{4})?$/gm
Fig. 2 The Reassembled Regular Expression
This sequence should now be clearer to us. Read from left to right, here is the meaning of the complete regular expression:
“Search for and return all matches across all lines of the test string that start with any combination of exactly 5 digits, and conclude optionally with a hyphen and exactly 4 digits in the range of 0 through 9.”
There we have it.
How about a Real-World Example?
Here is a very stripped down example of what we will implement for our application:
The Test String
10003
asdf10003
10003-8924
How many matches do you think will be returned from this search string? 1, 2, 3 or 4?
Using the very useful online tool called Regex Pal, we obtain the results found in Figure 3 below:
Fig. 3 Test Results from RegexPal.com
On the first line, 10003 is highlighted as a match. This is because it meets the criteria for a 5 digit zip code.
On the second line, asdf10003 has not been highlighted. This is because even though there are 5 digits on this line, they are preceded by asdf. It therefore does not meet the criteria, as the beginning of the matched string is not a sequence of 5 digits as required.
On the third line, 10003asdf has also not been highlighted. This is because, even though there are 5 digits at the beginning of the line, asdf follows this sequence. It therefore does not meet the criteria, as the end of the matched string must either be the last digit of the 5-digit sequence at the beginning, or the optional hyphen followed by exactly 4 digits. Nothing else may follow.
Finally, on the fourth line, 10003-8924 is highlighted. This is because it meets the criteria of a string that begins with exactly 5 digits, and ends with the optional sequence of a hyphen followed by exactly 4 digits.
There are therefore 2 matches.
Though this example is admittedly a bit simpler than the one we will be using in our finished application, the principles remain the same.
What Next?
Once you have come up for air, please continue to the second part of this tutorial series in which we will begin building our application—Regular Expressions—a Rite of Passage: From Theory to Practice.
References
“Regular Expressions - JavaScript: MDN.” JavaScript | MDN,
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions.
Kleene, Stephen Cole. Representation of Events in Nerve Nets and Finite…
- Rand Corporation. https://www.rand.org/content/dam/rand/pubs/research_memoranda/2008/RM704.pdf.
“Regular Expression.” Wikipedia, Wikimedia Foundation, 30 May 2022,
https://en.wikipedia.org/wiki/Regular_expression.
“Comparison of Regular Expression Engines.” Wikipedia, Wikimedia
Foundation, 29 Apr. 2022,
https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines.
“RegExp - JavaScript: MDN.” JavaScript | MDN,
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp#literal_notation_and_constructor.
Judis, Stefan. “Multiline Mode in JavaScript Regular Expressions.” Stefan
Judis Web Development, 23 Jan. 2022,
https://www.stefanjudis.com/today-i-learned/multiline-mode-in-javascript-regular-expressions/.
Posted on January 12, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.