Beginner's Guide to Regex
Rahul Yadav
Posted on May 18, 2022
Regex or Regexp ?
It stands for Regular Expression. It is a character matching mechanism. In other words, regular expressions are the pattern to match standalone character or a combination of characters in a string. It was coined in the beginning of 1950s, when the American mathematician Stephen Cole Kleene formalized the concept of a regular language. Most general-purpose programming languages support regex capabilities either natively or via libraries, including example Python, C, C++, Java, and JavaScript.
How does it work ?
There are the following rules a regex engine follows to find the match:
- It reads the whole input string and attempts to match the characters against the instructions one at a time.
- It reads the string from left to right
- If a character matches according to the instruction that we have imposed, it returns true and goes for the next character to match. if a character doesn't match it returns false and goes for matching the next character from start. 4.If all the instructions matched it returns success indicating regex matched successfully.
Basic matching
A regular expression is just a pattern of characters that we use to perform a search in a text.
For example, the regular expression the
means letter t
, followed by the letter h
, followed by the letter e
.
"the" => The fat cat sat on the mat.
Meta-characters ?
Meta-characters are reserved characters to set a specific rule for matching.
The meta characters are as follows:
-
^
: It specifies the starting of the string. This metacharacter helps us to match a string starting with a specific substring.
^The => The car is parked in the garage.
-
$
: Matches the end of the string. If a string ends with a specific character then we can use the$
to find the match.
Wihtout $
:
at\. => The fat cat. sat. on the mat.
With $
:
at\.$ => The fat cat. sat. on the mat.
-
.
: It matches a single character without a line break. This is the simplest meta character. It helps us to match the next single character, if we put.
after the matching pattern or the previous single character, if we put before the matching pattern.
.at => The fat cat sat on the mat.
ca. => The fat cat sat on the mat.
-
[]
: Character Class. It matches any character written inside a square bracket. If we want to match a specific character then we can use character class and write a matching pattern between the square bracket.
[0-9] => 3 fat cats sat on the mat.
-
[^]
: Neglected Character Class. It does not match any character written inside in a square bracket. it is similar to character class but in this case, we don't want to match specific characters we can use neglected character class. If caret (^
) is available before a matching pattern wrapped inside the square bracket then expect the matching pattern it matches all other characters available in the string.
[^a-b] => 3 fat cats sat on the mat.
-
*
: It is used with another rule like[a-z]*
. It matches zero or more than zero the occurrence of preceding characters. If*
is written after the matching pattern then it matches zero or more than zero repetition of the matching pattern.
[0-9] => 525 fat cats sat on the mat.
[0-9]* => 525 fat cats sat on the mat.
It is generally used with whitespace character (\s
) when we want to match a substring followed by zero or more than zero whitespaces available.
The\s*fat => The fat cat sat on the mat.
-
+
: This works the same as*
works but instead of matching zero or more than zero occurrences, it matches one or more than one occurrence of preceding characters. If plus is written after the matching pattern then it matches one or more than one repetition of the matching pattern.
[0-9]+ => 525 fat cats sat on the mat.
The difference ?
hel* => heo.
hel+ => heo. //no match
-
?
: Makes the preceding symbol optional. This symbol matches zero or one instance of the preceding character.
[t]?he=> the fat cat sat on the mat he.
-
{n, m}
: It specifies the range of the preceding string. In this case, the string should not be less than n characters and more than m characters.
[0-9]{2, 3}=> the value of pi is 3.1415.
In this example, if we omit the second parameter (3) then it finds a match with 2 or more digits.
[0-9]{2,}=> the value of pi is 3.14159.
If we omit the comma too then it finds a match with exactly 2 digits.
[0-9]{2}=> the value of pi is 3.141.
(xyz)
: Character Group. Matches substring written inside parentheses.|
: Alternation. It is similar to the OR operator.
(T|t)he|car => The car is parked in the garage.
-
\
: Escape. This allows you to match reserved characters { } . * + ? ^ $ \ |. For example, if we want to find.
character then we need to write\.
as for the rule.
Assertions
Assertions include boundaries, which indicate the beginnings and endings of lines and words, and other patterns.
- Lookahead Assertion:
x(?=y)
matches x
if it is followed by y
.
monica(?=\sgeller) => my name is monica geller.
- Negative Lookahead Assertion
x(?!y)
matches x
if it is not followed by y
.
geller(?!\sfamily) => monica geller belongs to the geller family.
- Look Behind Assertion
(?<=x)y
matches y
if it is preceded by x
(?<=monica\s)geller=> monica geller belongs to the geller family.
- Negative Look Behind Assertion
(?<!x)y
matches y
if it is not preceded by x
(? monica geller belongs to the geller family.
Lazy vs Greedy matching
By default, a regex will perform a greedy match, which means the match will be as long as possible. We can use ? to match in a lazy way, which means the match should be as short as possible.
Lazy Matching
he.*l=> say hello to hell.
Greedy matching
he.*?l=> say hello to hell.
Flags
Flags are also called modifiers because they modify the output of a regular expression. These flags can be used in any order or combination, and are an integral part of the RegExp.
i
: Case insensitive: Match will be case-insensitive.g
: Global Search: Match all instances, not just the first.m
: Multiline: Anchor meta characters work on each line.
Regex in JavaScript
There are two ways to create a RegExp object: a literal notation and a constructor.
- The literal notation's parameters are enclosed between slashes and do not use quotation marks.
\[0-9]\
- The constructor function's parameters are not enclosed between slashes but do use quotation marks.
new RegExp('[0-9]')
These patterns are used with the exec()
and test()
methods of RegExp, and with the match()
, matchAll()
, replace()
, replaceAll()
, search()
, and split()
methods of String.
I hope this beginner's guide helps you to understand the basics of regex.
Practice regex with Regex 101
Posted on May 18, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.