Irregular Expressions: Matching Strings in Python
Marc Katz
Posted on July 3, 2023
What is Regex?
Regex, short for Regular Expressions, is a way to determine whether a string or a part of a string fits a certain pattern. It is very powerful and short, but the syntax is not very intuitive. In this article, I will give a brief run-down of how to use regex in python, and a basic dictionary for the symbols used.
Startup
In order to use regex in Python, you first need to import the module with import re
. Once you've done this, you're good to go!
Search
The first function that we'll go over is search
. This function takes in two parameters: the pattern that the function will look for, and the string it'll look for it in. The full syntax is <match> = re.search(<pattern>, <string>)
. If this function finds a match, it returns a Match
object representing the first substring that fits the pattern, otherwise it return None
. You can get the matching substring with <match>.group(0)
, and the start and end positions of the substring as a tuple with <match>.span()
. For example, re.search("ra", "abracadabra").span()
will return (2,4)
. If you use groupings in your pattern (see below), you can use <match>.group(<n>)
, where n is the group number you want to access.
Find All
Findall will give you all the substrings that match the pattern. re.findall(<pattern>, <string>)
will return a list of all the non-overlapping substrings in the string that match the pattern. For example, re.findall("a.{1,2}a", "abracadabra")
will return ["abra", "ada"]
. Note that it doesn't find "aca"
nor the second "abra"
, since they overlap with the substrings already found.
Split
Split is used when you want to break up your original string. re.split(<pattern>, <string>)
will return an list of substrings, separated where the pattern matches in the string. For example, re.split("ra", "abracadabra")
will return ["ab", "cadab", ""]
.
Substitute
Sub is used when you want to replace parts of your original string with something else. re.sub(<pattern>, <replacement>, <string>)
will replace all the substrings in string
that match the pattern with replacement
. For example, re.sub("ra", "lo", "abracadabra")
will return "ablocadablo"
.
Special Characters
There are many special characters that can be used in the pattern in order to make your searches more powerful. Here is a list of some of the more widely used ones:
-
^
: Start of the string -
$
: End of the string -
[]
: Will match any character inside the square braces. Ranges can be given with-
, and^
will negate it, matching anything except what's inside.Examples:
-
[abc]
will match"a"
,"b"
,"c"
-
[^abc]
will match anything except"a"
,"b"
,"c"
-
[4-7f-e]
will match any digit between 4 and 7, and any letter between f and e (inclusive)
-
.
: Wildcard. This will match anything except a newline\d
: Any digit. The same as[0-9]
\D
: Any non-digit. The same as[^0-9]
\s
: Any whitespace character, such as spaces, tabs, and newlines\S
: Any non whitespace character\w
: Any "word" character: numbers, letters, and _ (underscore)\W
: Any non-"word" character*
: Any number of repetitions of the expression before. For example,a*b
will match"b"
,"ab"
,"aab"
,"aaab"
, etc+
: One or more repetitions of the expression before. For example,a+b
will match"ab"
,"aab"
,"aaab"
, etc?
: One or no matches o the expression before. For example,a?b
will match"b"
and"ab"
-
{}
: Used to match a specific number of repetitions:-
{n}
: will match exactlyn
repetitions:-
a{3}b
will match"aaab"
-
-
{n,}
: will matchn
or more repetitions:-
a{3,}b
will match"aaab"
,"aaaab"
, etc
-
-
{n,m}
: will match betweenn
andm
repetitions, inclusive:-
a{1,3}b
will match"ab"
,"aab"
, and"aaab"
-
-
</code>: Will escape the next character, allowing you to search for special characters. For example
*\?
will search for"*?"
|
: "Or" function: will match either expression on each side. For example,a|b
will match"a"
and"b"
.()
: Will "group" the expression inside the parentheses, either for capturing with the functions above, or to use in relation with the repetition or | symbols. If you don't want to capture, use(?:
.
Sources:
https://docs.python.org/3/library/re.html
https://www.w3schools.com/python/python_regex.asp
Useful Links:
https://regex101.com/
https://www.rexegg.com/regex-quickstart.html
https://xkcd.com/1313/
https://alf.nu/RegexGolf
Posted on July 3, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.