Deep Dive into Preprocessing Techniques in NLP using Python - Part 1

blessing988

Blessing Agyei Kyem

Posted on January 13, 2023

Deep Dive into Preprocessing Techniques in NLP using Python - Part 1

Photo by Patrick Tomasso on Unsplash



Language and Speech data we encounter in the real-world are normally messy and disorganized; this makes it hard for machines to understand and therefore it necessitates we preprocess them so we can make informed decisions during analysis and modelling.

Without a systematic way to start and keep data clean, bad data will happen - Donato Diorio


Consider the sentence below :

Lifeeee is such a painnn :(
Enter fullscreen mode Exit fullscreen mode

And another sentence :

Life is such a pain :(
Enter fullscreen mode Exit fullscreen mode

The two sentences carry the same semantic meaning but the former requires a bit of preprocessing to remove the extra characters at the end of some of the words.

Preprocessing in NLP tasks is very essential and an important toolkit for Machine learning engineers and Data Scientists as they transition to build ML models.

For one to become good at data preprocessing especially in NLP, it is necessary you are able to detect and extract patterns in data.

As a result, knowing how to manipulate strings with regex should be a priority. In this tutorial we will dive deep into how to use regular expressions in Python.


Regular Expressions

Regular Expressions are some strings of characters and symbols(literals and metacharacters) that are used to detect patterns in text.
Suppose we have the following text :

The numbers are : 022-236-1823, 0554-236-172, 055 345 17584, and 0234456812
Enter fullscreen mode Exit fullscreen mode

We might want to extract only the numbers from the text.

The following regex can help us achieve this :

/\d+[\s-]?\d+[\s-]?\d+/
Enter fullscreen mode Exit fullscreen mode

Another example :

Hello everyone, Helium, hectic, help me!
Enter fullscreen mode Exit fullscreen mode

Considering the text above we might be interested in words that start with He or he.
We can type the following expression :

/[Hh]e[a-z]+/
Enter fullscreen mode Exit fullscreen mode

We can also extract a specific word pattern by typing some literals. Example :

The regex below :

/happy/
Enter fullscreen mode Exit fullscreen mode

will extract or match happy from the sentence I am very happy


To understand regex, we have to know the difference between metacharacters and literals.

Let's consider the pattern we used earlier:

/\d+[\s-]?\d+[\s-]?\d+/
Enter fullscreen mode Exit fullscreen mode

The metacharacters are :

\, +, \d, [ ], \s, ?

The only literal we have is -.

In our second example :

/[Hh]e[a-z]+/
Enter fullscreen mode Exit fullscreen mode

Our literals are H, h, e, a , z.


There are a lot of metacharacters in regex and each of them has its specific use case. Let's explore them and know when to use them :

Metacharacter
Description
\ It is used before some characters to illustrate that the character is a special character or a literal.
^ Matches the start of an input
$ Matches the end of an input
. Detects any single character except a newline
| Match either characters given. E.g. x | y will match either x or y
? Matches the character before it zero or more times. E.g. s?it will match sit or it
+ Matches the character before it one or more times. E.g. a+ will match bag and baaaag
[ ] Matches everything inside it. E.g [A-Z] will match any uppercase from A to Z
\w Matches any word character including underscore. It is equivalent to [A-Za-z0-9_]
\W Matches any non-word character. It is equivalent to [^A-Za-z0-9_]
\d Matches any digit. i.e. 0-9
\D Matches a non-digit number
\s Matches any whitespace
\S Matches any non-whitespace

For information on Metacharacters, check this resource.

We will be using the popular python module re for regex matching operations.

Let's import regex :

import re
Enter fullscreen mode Exit fullscreen mode

To create a pattern in regex you can use the compile function which strictly takes in the pattern you want to extract.

re.compile(pattern, flags=0)
Enter fullscreen mode Exit fullscreen mode

Let's say we want to create a pattern to extract some number from the text : I am 25 years old. We can type :

re.compile(r'\d+')
Enter fullscreen mode Exit fullscreen mode

r is just used to indicate the pattern as a raw string. This is because there are some characters like \ which performs a specific function in python so we have to make them raw strings to be used for regex-specific tasks.

We will be sing the following functions to match a pattern:

  • re.match() -> checks for a match only at the beginning of the string
  • re.search() -> checks for a match anywhere in the string
  • re.findall() -> checks for all occurrences of the match

Suppose we want to check whether Coming is at the beginning of the text below :

Coming is a verb
Enter fullscreen mode Exit fullscreen mode

We will first create our pattern :

# Create our pattern
pattern = re.compile(r'Coming')
Enter fullscreen mode Exit fullscreen mode

Let's use match() to match our pattern :

text = 'Coming is a verb'

# Creating our Match Object
match = pattern.match(text)
print(match)

## Output:
<re.Match object; span=(0, 6), match='Coming'>
Enter fullscreen mode Exit fullscreen mode

Alternatively, we can use re.match() directly :

match = re.match(pattern, text)
print(match)

## Output:
<re.Match object; span=(0, 6), match='Coming'>
Enter fullscreen mode Exit fullscreen mode

NOTE: When using match(), if the pattern isn't found at the beginning of the text, there will be no match.

Let's verify that with an example below :

pattern = re.compile(r'Coming')
text = 'Is Coming a verb?'

# Creating our Match Object
match = pattern.match(text)
print(match)

## Output:
None
Enter fullscreen mode Exit fullscreen mode

As illustrated above, because the text begins with Is, there will be no match.

We can rectify this by using search() function below :

pattern = re.compile(r'Coming')
text = 'Is coming a verb?'

# Creating our Match Object
match = pattern.search(text)
print(match)

##Output:
<re.Match object; span=(3, 9), match='Coming'>
Enter fullscreen mode Exit fullscreen mode

Yes! We have been able to match Coming. This is because the search() function matches anywhere within the text.


Now, what if a particular pattern exists multiple times within a text and we would like to detect all the instances of that pattern?

Example: Say, we want to detect all occurrences of a number within the string below :

These are four-digit numbers : 1245, 1220, 9028. 
Enter fullscreen mode Exit fullscreen mode

Using the search() function will only match the first occurrence of the number :

text = 'These are four-digit numbers : 1245, 1220, 9028.'
pattern = re.compile(r'\d+') 

match = pattern.search(text)
print(match)

## Output
<re.Match object; span=(31, 35), match='1245'>
Enter fullscreen mode Exit fullscreen mode

Intuition behind the above code :

  • our pattern \d+ has two components : \d and +.
  • \d will match any single digit like 1, 2, ...
  • + is a quantifier which when added to \d will match 1 or more additional digit till it reaches a non-digit character like whitespace or an alphabet. Eg: 1245
  • search() then goes through our text and once it sees a single pattern as described above, it immediately matches and returns that pattern. In this case it will match only 1245.

NOTE: search() only returns a single occurrence of the match.

We can use findall() to match all occurrences of the pattern in our text:

text = 'These are four-digit numbers : 1245, 1220, 9028.'
pattern = re.compile(r'\d+') 

match = pattern.findall(text) # -> Returns a list
print(match)

## Output:
['1245', '1220', '9028']  
Enter fullscreen mode Exit fullscreen mode

Suppose you have a large chunk of data and you aren't interested in getting all the matches in the text at once, we can retrieve the matches in a sequence.

finditer() can help us achieve that.

Let's get the four-digit numbers in sequences :

text = 'These are four-digit numbers : 1245, 1220, 9028.'
pattern = re.compile(r'\d+') 

match = pattern.finditer(text) # -> Returns an callable iterator

# Let's check the type of the match 
print(match)

## Output 
<class 'callable_iterator'>
Enter fullscreen mode Exit fullscreen mode

To get the next item in the iterator object, we can use the next() function in python.

Let's get the matches in sequences :

print(next(match))  # -> Outputs the first match 

print(next(match))  # -> Outputs the second match 

print(next(match))  # -> Outputs the last match 

## Output
<re.Match object; span=(31, 35), match='1245'>
<re.Match object; span=(37, 41), match='1220'>
<re.Match object; span=(43, 47), match='9028'>
Enter fullscreen mode Exit fullscreen mode

Using the ^ and & metacharacter

^ is used before characters to match a pattern only at the beginning of a text. E.g. We can check whether say the word The is at the beginning of a line by typing ^The.

Let's illustrate that with an example:
We can detect whether The is at the beginning of the text below:

The work is super easy. 
Enter fullscreen mode Exit fullscreen mode

We can achieve that as illustrated:

text = 'The work is super easy.'
pattern = re.compile(r'^The')

match = pattern.search(text)

print(match)

## Output
<re.Match object; span=(0, 3), match='The'>
Enter fullscreen mode Exit fullscreen mode

In the same way, $ is used to match whether a character or some set of characters is at the end of a line.
Let's check if cool is at the end of the sentence in the text below :

Regex is super cool
Enter fullscreen mode Exit fullscreen mode

Code :

text = 'Regex is super cool'
pattern = re.compile(r'cool$')
match = pattern.search(text)
print(match)

## Output:
<re.Match object; span=(15, 19), match='cool'>
Enter fullscreen mode Exit fullscreen mode

NOTE: There is a limitation to ^ and $ metacharacter as it only matches a pattern within the first line. In NLP and other applications, you might be working with multiple documents which you would have to preprocess to extract patterns.

Let's consider an example.

Suppose we want to extract the first user-id(24ga-d34) in the string:

'User-ids\n24ga-d34\n87bx-f60\n47nd-q21'
Enter fullscreen mode Exit fullscreen mode

which contains user ids each at the beginning of a new line,

using search() function alone wouldn't work :

pattern = re.compile(r'^\d{2}[a-z]{2}-[a-z]\d{2}')
text = 'User ids\n24ga-d34\n87bx-f60\n47nd-q21'

match = pattern.search(text)
print(match)

## Output:
None
Enter fullscreen mode Exit fullscreen mode

We can fix this by adding a re.MULTILINE or re.M flag to our compile() function.

You can check all the available flags in re module.

re.MULTILINE flag prevents ^ or $ from considering just the first line. It allows it to check the beginning of all the lines in the text.

Code :

import re
pattern = re.compile(r'^\d{2}[a-z]{2}-[a-z]\d{2}', re.MULTILINE)
text = 'User ids\n24ga-d34\n87bx-f60\n47nd-q21'

match = pattern.search(text)
print(match)

## Output:
<re.Match object; span=(9, 17), match='24ga-d34'> 
Enter fullscreen mode Exit fullscreen mode

Intuition behind the above code :

  • ^ -> matches the pattern at the beginning of a line
  • \d{2} -> matches any two-digit number
  • [a-z]{2} -> matches any two lowercase alphabet
  • - -> matches a hyphen
  • [a-z] -> matches any single alphabet
  • \d{2} -> matches any two-digit number
  • re.MULTILINE -> overrides the default behavior of ^ in matching only at the beginning of a single line.

{n} is a metacharacter which will match anything before it n number of times, where n is a non-negative integer.


To be continued later...

Conclusion

In this tutorial, you learnt about the difference between literal and metacharacters in regex. You also learnt about how to use these metacharacters to match patterns in texts using the re module in python. In the next part of the tutorial, we will delve more into other preprocessing techniques in NLP.


Follow me for more of this content. Let's connect on LinkedIn!


References

💖 💪 🙅 🚩
blessing988
Blessing Agyei Kyem

Posted on January 13, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related