Irregular Expressions: Matching Strings in Python

marckatz

Marc Katz

Posted on July 3, 2023

Irregular Expressions: Matching Strings in Python

What is Regex?

Regex, short for Regular Expressions, is a way to determine whether a string or a part of a string fits a certain pattern. It is very powerful and short, but the syntax is not very intuitive. In this article, I will give a brief run-down of how to use regex in python, and a basic dictionary for the symbols used.

Startup

In order to use regex in Python, you first need to import the module with import re. Once you've done this, you're good to go!

Search

The first function that we'll go over is search. This function takes in two parameters: the pattern that the function will look for, and the string it'll look for it in. The full syntax is <match> = re.search(<pattern>, <string>). If this function finds a match, it returns a Match object representing the first substring that fits the pattern, otherwise it return None. You can get the matching substring with <match>.group(0), and the start and end positions of the substring as a tuple with <match>.span(). For example, re.search("ra", "abracadabra").span() will return (2,4). If you use groupings in your pattern (see below), you can use <match>.group(<n>), where n is the group number you want to access.

Find All

Findall will give you all the substrings that match the pattern. re.findall(<pattern>, <string>) will return a list of all the non-overlapping substrings in the string that match the pattern. For example, re.findall("a.{1,2}a", "abracadabra") will return ["abra", "ada"]. Note that it doesn't find "aca" nor the second "abra", since they overlap with the substrings already found.

Split

Split is used when you want to break up your original string. re.split(<pattern>, <string>) will return an list of substrings, separated where the pattern matches in the string. For example, re.split("ra", "abracadabra") will return ["ab", "cadab", ""].

Substitute

Sub is used when you want to replace parts of your original string with something else. re.sub(<pattern>, <replacement>, <string>) will replace all the substrings in string that match the pattern with replacement. For example, re.sub("ra", "lo", "abracadabra") will return "ablocadablo".

Special Characters

There are many special characters that can be used in the pattern in order to make your searches more powerful. Here is a list of some of the more widely used ones:

  • ^: Start of the string
  • $: End of the string
  • []: Will match any character inside the square braces. Ranges can be given with -, and ^ will negate it, matching anything except what's inside.

    Examples:

    • [abc] will match "a", "b", "c"
    • [^abc] will match anything except "a", "b", "c"
    • [4-7f-e] will match any digit between 4 and 7, and any letter between f and e (inclusive)
  • .: Wildcard. This will match anything except a newline

  • \d: Any digit. The same as [0-9]

  • \D: Any non-digit. The same as [^0-9]

  • \s: Any whitespace character, such as spaces, tabs, and newlines

  • \S: Any non whitespace character

  • \w: Any "word" character: numbers, letters, and _ (underscore)

  • \W: Any non-"word" character

  • *: Any number of repetitions of the expression before. For example, a*b will match "b", "ab", "aab", "aaab", etc

  • +: One or more repetitions of the expression before. For example, a+b will match "ab", "aab", "aaab", etc

  • ?: One or no matches o the expression before. For example, a?b will match "b" and "ab"

  • {}: Used to match a specific number of repetitions:

    • {n}: will match exactly n repetitions:
      • a{3}b will match "aaab"
    • {n,}: will match n or more repetitions:
      • a{3,}b will match "aaab", "aaaab", etc
    • {n,m}: will match between n and m repetitions, inclusive:
      • a{1,3}b will match "ab", "aab", and "aaab"
  • </code>: Will escape the next character, allowing you to search for special characters. For example *\? will search for "*?"

  • |: "Or" function: will match either expression on each side. For example, a|b will match "a" and "b".

  • (): Will "group" the expression inside the parentheses, either for capturing with the functions above, or to use in relation with the repetition or | symbols. If you don't want to capture, use (?:.

Sources:

https://docs.python.org/3/library/re.html
https://www.w3schools.com/python/python_regex.asp

Useful Links:

https://regex101.com/
https://www.rexegg.com/regex-quickstart.html
https://xkcd.com/1313/
https://alf.nu/RegexGolf

💖 💪 🙅 🚩
marckatz
Marc Katz

Posted on July 3, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

What was your win this week?
weeklyretro What was your win this week?

November 29, 2024

Where GitOps Meets ClickOps
devops Where GitOps Meets ClickOps

November 29, 2024

How to Use KitOps with MLflow
beginners How to Use KitOps with MLflow

November 29, 2024

Modern C++ for LeetCode 🧑‍💻🚀
leetcode Modern C++ for LeetCode 🧑‍💻🚀

November 29, 2024