Kaustubh Joshi
Posted on November 8, 2023
๐ Introduction
World famous Oxford English Dictionary defines the word Regular
as,
Characterized by evenness, order, or harmony in physical form, structure, or organization; arranged in or constituting a constant or definite pattern.
But when it comes to 'Regular' Expressions, they are the most uneven, disordered, and dissonance structures in coding I have ever seen.
But again, When it comes to data cleaning, extracting data from a huge string or working on simple text parsing, nothing else can compete with the power of Regular Expressions.
๐ Relevant(?) XKCD
PostgreSQL regular expressions are a combination of LIKE notations and POSIX regular expression notations.
This post will provide you all the information you need to understand these notations a better and a simpler way.
๐ปMetacharacters and Operators
Match Operators:
- ~ - String matches regular expression ```python
Example:
select 'thomas' ~ 't.*ma';
Output:
-- true
- **~*** - String matches regular expression, case-insensitively
```python
#Example:
select 'thomas' ~* 'T.*ma';
#Output:
#-- true
- !~ - String does not match regular expression ```python
Example:
select 'thomas' !~ 't.*max';
Output:
-- true
- **!~*** - String does not match regular expression, case-insensitively
```python
#Example:
select 'thomas' !~* 'T.*ma';
#Output:
#-- false
POSIX Regular Expressions:
- | - Represents alternation, either of two alternatives provided in regex. ```python
Example:
select city from taxdata where state ~ 'CA|KS';
Output:
-- List of cities which has either CA or KS as state
- **^** - Matches begining of the line.
```python
#Example:
select city from taxdata where city ~ '^B';
#Output:
#-- List of cities which name starts with 'B'
- $ - Matches end of the line. ```python
Example:
select city from taxdata where city ~ 'S$';
Output:
-- List of cities which name ends with S
- **. (dot)** - Matches any charcter.
```python
#Example:
select city from taxdata where city ~ '.NN';
#Output:
#-- List of cities which has character 'NN' anywhere in the name
- * - Denotes repetition of the character zero or more times. ```python
Example:
select city from taxdata where city ~ '.*N';
Output:
-- List of cities which has zero or more repetition of character 'N'
> **\*?** - Denotes repetition of the character **zero or more times** in **non-greedy way.**
- **\+** - Denotes repetition of the character **one or more times**.
```python
#Example:
select city from taxdata where city ~ '.+N';
#Output:
#-- List of cities which has one or more repetition of character 'N'
+? - Denotes repetition of the character one or more times in non-greedy way.
- ? - Denotes repetition of the character zero or one time. ```python
Example:
select city from taxdata where city ~ '.?N';
Output:
-- List of cities which has zero or one repetition of character 'N'
---
**_<u>Working with Item Set:</u>_**
- **{m}** - Denotes repetition of the previous item exactly m times.
```python
#Example:
select city from taxdata where city ~ 'N{2}';
#Output:
#-- List of cities which has 2 repetition of character 'N'
- {m,} - Denotes repetition of the previous item m or more times. ```python
Example:
select city from taxdata where city ~ 'N{2,}';
Output:
-- List of cities which has 2 or more repetition of character 'N'
- **{m,n}** - Denotes repetition of the previous item at least m and not more than n times.
```python
#Example:
select city from taxdata where city ~ 'N{2,5}';
#Output:
#-- List of cities which has 2 to 5 repetition of character 'N'
- () - Parentheses () can be used to group items into a single logical item. ```python
Example:
select distinct city from taxdata where city similar to '%(Long|New)%';
Output:
-- List of cities which has 'Long' or 'New' in there name
- **(** - Indicates string extraction is to start
- **)** - Indicates string extraction is to end
```python
#Example:
select substring(email from '.+@(.*)$') from taxdata;
#Output:
#-- Returns domain name from email
- [...] - Can be used to match characters in listed set ```python
Example:
select substring(email from '[0-9]') from taxdata;
Output:
-- Returns emails which has any one of 0-9 digit
- **regexp_matches(input, regex)** - Returns a text array of matching substring(s) within the first match of a POSIX regular expression pattern to a string
```python
#Example:
SELECT regexp_match('foobarbequebaz', '(bar)(beque)');
#Output:
#-- Returns output {bar,beque}
๐ค Greedy and Lazy
greedy - Greedy quantifiers first tries to repeat the token as many times as possible, and gradually gives up matches as the engine backtracks to find an overall match.
tl;dr; Keep searching until condition is not satisfied.lazy - Lazy quantifier first repeats the token as few times as required, and gradually expands the match as the engine backtracks through the regex to find an overall match.
tl;dr; Stop searching once condition is satisfied.
๐โ๐จ Conclusion
This article should provide you with a birds-eye-view of different regular expressions and notations you can use in your Postgres queries.
I hope you've found this post helpful and do not hesitate to revisit whenever you are stuck finding the correct way to write your regex.๐
NOTE: I would really encourage you to practice these operators on your own dataset or any real life examples. Remember practice is the key to become the hero who saves the day with REGEX.
HAPPY CODING!!!โค๏ธ
๐ References
Posted on November 8, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.