🐍 regex: a Python example using lookbehind and lookahead

Introduction

Still in regex vibes, I'm going to share a solution for a simple HackerRank code challenge I took last week. I've used regular expressions to solve it and I'll describe the solution by explaining the two group constructs I've used as a part of the solution: lookbehind and lookahead.

The challenge

The challenge gives us a string S which consists of alphanumeric characters, spaces and symbols(+,-). Our task is to find all the substrings of S containing 2 or more vowels. Also, these substrings must lie in between 2 consonants and should contain vowels only.
So the vowels are defined as: AEIOU and aeiou.
The consonants are defined as: QWRTYPSDFGHJKLZXCVBNM and qwrtypsdfghjklzxcvbnm.
If no substrings are found, you just print -1.

Example:
So you get a string S = 'rabcdeefgyYhFjkIoomnpOeorteeeeet'.
You should return all the substrings that meet the requirements described above. The substrings should be:
'ee', Ioo, Oeo, eeeee, once they're all two or more vowels located between consonants.

Discussing a solution

At first the purpose of this challenge is to learn when to use the findall() and finditer() from the regex module re.
But you still need to write the regex pattern matching the challenge requirements in order to find the substrings.

So let's start from the basic idea: string screening from the first character to the last one. In this case the pattern shouldn't only match two vowels or more, but also look at what comes before and after each of those substrings.
And that's when we use the lookbehind and lookahead.

Lookbehind

The lookbehind can be positive (?<=) or negative (?<!). The first one is used to "ensure that the given pattern will match, ending at the current position in the expression. The pattern must have a fixed width. Does not consume any characters." (reference from regex101)

So basically what the positive lookbehind does is finding an expression X preceded by another given expression Y:
(?<=Y)X

While the negative lookbehind would deny the given expression Y, that is finding X when it's not preceded by Y:
(?<!Y)X

Lookahead

Just like lookbehind, the lookahead can also be positive (?=) or negative (?!).
The positive one "asserts that the given subpattern can be matched here, without consuming characters" while the negative one "starts at the current position in the expression, ensuring that the given pattern will not match. Does not consume any characters." (reference from regex101)

The behaviour is very similar to what we've seen already for lookbehind but now looking at what comes after the given expression.

So the positive lookahead finds X when it's followed by Y: X(?=Y)

While the negative one finds X when not followed by Y: X(?!Y)

The solution

Now we know what those groups do, here is a possible regex pattern for the solution using both of them:

So first we use a positive lookbehind group (?<=[QWRTYPSDFGHJKLZXCVBNMqwrtypsdfghjklzxcvbnm]) looking for any consonants contained in the group and followed by two or more vowels [aeiouAEIOU]{2,}. Finally a positive lookahead group to look for any following consonants (?=[QWRTYPSDFGHJKLZXCVBNMqwrtypsdfghjklzxcvbnm]).

Note in the first string example it matches the expected substrings attending the challenge requirements. And in the second string example it doesn't match the vowels at the beginning and at the end of the string (once they're not between consonants), which is the expected behaviour.

The fact that those groups don't consume any characters is very relevant for our solution. This means our consonants won't be consumed when finding two or more vowels between them and we can still find other vowels attending the same pattern instead of stopping the search when a group is found following the pattern conditions.
For example, if our consonants would be consumed, when having a string like this baabaaab, it would match only the first substring of vowels aa stopping there, without finding the second substring of vowels aaa that are also between consonants.

And here is the complete solution using the re.findall() which "returns all the non-overlapping matches of patterns in a string as a list of strings" (HackerRank reference):



import re

s = input()
lst = re.findall(r'(?<=[QWRTYPSDFGHJKLZXCVBNMqwrtypsdfghjklzxcvbnm])[aeiouAEIOU]{2,}(?=[QWRTYPSDFGHJKLZXCVBNMqwrtypsdfghjklzxcvbnm])', s)

if len(lst) > 0:
    for element in lst:
        print(element)
else:
    print(-1)

Blog