Python Text Processing

This chapter will discuss str methods and introduce a few examples with the string and re modules.

join

The join() method is similar to what the print() function does with the sep option, except that you get a str object as the result. The iterable you pass to join() can only have string elements. On the other hand, print() uses an object's __str__() method to get its string representation (__repr__() method is used as a fallback).

>>> print(1, 2)
1 2
>>> ' '.join((1, 2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: sequence item 0: expected str instance, int found
>>> ' '.join(('1', '2'))
'1 2'

>>> c = ' :: '
>>> c.join(['This', 'is', 'a', 'sample', 'string'])
'This :: is :: a :: sample :: string'

As an exercise, check what happens if you pass multiple string values separated by comma to join() instead of an iterable.

Transliteration

The translate() method accepts a table of codepoints (numerical value of a character) mapped to another character/codepoint or None (if the character has to be deleted). You can use the ord() built-in function to get the codepoint of characters. Or, you can use the str.maketrans() method to generate the mapping for you.

>>> ord('a')
97
>>> ord('A')
65

>>> str.maketrans('aeiou', 'AEIOU')
{97: 65, 101: 69, 105: 73, 111: 79, 117: 85}

>>> greeting = 'have a nice day'
>>> greeting.translate(str.maketrans('aeiou', 'AEIOU'))
'hAvE A nIcE dAy'

The string module has a collection of constants that are often useful in text processing. Here's an example of deleting punctuation characters.

>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

>>> para = '"Hi", there! How *are* you? All fine here.'
>>> para.translate(str.maketrans('', '', string.punctuation))
'Hi there How are you All fine here'

>>> chars_to_delete = ''.join(set(string.punctuation) - set('.!?'))
>>> para.translate(str.maketrans('', '', chars_to_delete))
'Hi there! How are you? All fine here.'

As an exercise, read the documentation for features covered in this section. See also stackoverflow: character translation examples.

Removing leading/trailing characters

The strip() method removes consecutive characters from the start/end of the given string. By default it removes whitespace characters, which you can change by passing a str argument. You can use lstrip() and rstrip() methods to work only on the leading and trailing characters respectively.

>>> greeting = '  \t\r\n have    a  nice \t day  \f\v\r\t\n '
>>> greeting.strip()
'have    a  nice \t day'
>>> greeting.lstrip()
'have    a  nice \t day  \x0c\x0b\r\t\n '
>>> greeting.rstrip()
'  \t\r\n have    a  nice \t day'

>>> '"Hi",'.strip(string.punctuation)
'Hi'

The removeprefix() and removesuffix() methods will delete a substring from the start/end of the input string.

>>> 'spare'.removeprefix('sp')
'are'
>>> 'free'.removesuffix('e')
'fre'

# difference between remove and strip
>>> 'cared'.removesuffix('de')
'cared'
>>> 'cared'.rstrip('de')
'car'

Dealing with case

Here's five different methods for changing the case of characters. Word level transformation is determined by consecutive occurrences of alphabets, not limited to separation by whitespace characters.

>>> sentence = 'thIs iS a saMple StrIng'

>>> sentence.capitalize()
'This is a sample string'

>>> sentence.title()
'This Is A Sample String'

>>> sentence.lower()
'this is a sample string'

>>> sentence.upper()
'THIS IS A SAMPLE STRING'

>>> sentence.swapcase()
'THiS Is A SAmPLE sTRiNG'

The string.capwords() method is similar to title() but also allows a specific word separator (whose default is whitespace).

>>> phrase = 'this-IS-a:colon:separated,PHRASE'

>>> phrase.title()
'This-Is-A:Colon:Separated,Phrase'
>>> string.capwords(phrase, ':')
'This-is-a:Colon:Separated,phrase'

is methods

The islower(), isupper() and istitle() methods check if the given string conforms to the specific case pattern. Characters other than alphabets do not influence the result, but at least one alphabet needs to be present for a True output.

>>> 'αλεπού'.islower()
True

>>> '123'.isupper()
False
>>> 'ABC123'.isupper()
True

>>> 'Today is Sunny'.istitle()
False

Here's some examples with isnumeric() and isascii() methods. As an exercise, read the documentation for the rest of the is methods.

# checks if string has numeric characters only, at least one
>>> '153'.isnumeric()
True
>>> ''.isnumeric()
False
>>> '1.2'.isnumeric()
False
>>> '-1'.isnumeric()
False

# False if any character codepoint is outside the range 0x00 to 0x7F
>>> '123—456'.isascii()
False
>>> 'happy learning!'.isascii()
True

Substring and count

The in operator checks if the LHS string is a substring of the RHS string.

>>> sentence = 'This is a sample string'

>>> 'is a' in sentence
True
>>> 'this' in sentence
False
>>> 'this' in sentence.lower()
True
>>> 'test' not in sentence
True

The count() method gives the number of times the given substring is present (non-overlapping).

>>> sentence = 'This is a sample string'
>>> sentence.count('is')
2
>>> sentence.count('w')
0

>>> word = 'phototonic'
>>> word.count('oto')
1

Match start/end

The startswith() and endswith() methods check for the presence of substrings only at the start/end of the input string.

>>> sentence = 'This is a sample string'

>>> sentence.startswith('This')
True
>>> sentence.startswith('is')
False

>>> sentence.endswith('ing')
True
>>> sentence.endswith('ly')
False

If you need to check for multiple substrings, pass a tuple argument.

>>> words = ['refuse', 'impossible', 'present', 'read']
>>> prefix = ('im', 're')
>>> for w in words:
...     if w.startswith(prefix):
...         print(w)
... 
refuse
impossible
read

split

The split() method splits a string based on the given substring and returns a list. By default, whitespace is used for splitting. You can also control the number of splits.

>>> greeting = '  \t\r\n have    a  nice \t day  \f\v\r\t\n '
# note that leading/trailing whitespaces do not create empty elements
>>> greeting.split()
['have', 'a', 'nice', 'day']

# note that the empty elements are preserved here
>>> ':car::jeep::'.split(':')
['', 'car', '', 'jeep', '', '']

>>> 'apple<=>grape<=>mango<=>fig'.split('<=>', maxsplit=1)
['apple', 'grape<=>mango<=>fig']

replace

Use replace() method for substitution operation. Optional third argument allows you to specify number of replacements to be made.

>>> phrase = '2 be or not 2 be'

>>> phrase.replace('2', 'to')
'to be or not to be'

>>> phrase.replace('2', 'to', 1)
'to be or not 2 be'

# recall that string is immutable, you'll need to re-assign if needed
>>> phrase = phrase.replace('2', 'to')
>>> phrase
'to be or not to be'

re module

Regular Expressions is a versatile tool for text processing. You'll find them included as part of standard library of most programming languages that are used for scripting purposes. If not, you can usually find a third-party library. Syntax and features of regular expressions vary from language to language though. re module is the built-in library for Python.

What's so special about regular expressions and why would you need it? It is a mini programming language in itself, specialized for text processing. Parts of a regular expression can be saved for future use, analogous to variables and functions. There are ways to perform AND, OR, NOT conditionals. Operations similar to range() function, string repetition operator and so on. Here's some common use cases:

Sanitizing a string to ensure that it satisfies a known set of rules. For example, to check if a given string matches password rules.
Filtering or extracting portions on an abstract level like alphabets, numbers, punctuation and so on.
Qualified string replacement. For example, at the start or the end of a string, only whole words, based on surrounding text, etc.

>>> import re

# extract non-colon character sequences
>>> ip = ':car::jeep::'
>>> ip.split(':')
['', 'car', '', 'jeep', '', '']
>>> re.findall(r'[^:]+', ip)
['car', 'jeep']

# replace only whole words 'par' and 'hand' with 'X'
# \b is an anchor to restrict the matching to the start/end of words
>>> ip = 'par spare part hand handy unhanded'
>>> re.sub(r'\b(par|hand)\b', 'X', ip)
'X spare part X handy unhanded'

See my book Python re(gex)? for a detailed guide on regular expressions (it is longer than this book!). The book covers the third-party regex module as well.

Exercises

Write a function that checks if two strings are anagrams irrespective of case (assume input is made up of alphabets only).
```
>>> anagram('god', 'Dog')
True
>>> anagram('beat', 'table')
False
>>> anagram('Beat', 'abet')
True
```

Read the documentation and implement these formatting examples with equivalent str methods.

>>> fruit = 'apple'

>>> f'{fruit:=>10}'
'=====apple'
>>> f'{fruit:=<10}'
'apple====='
>>> f'{fruit:=^10}'
'==apple==='

>>> f'{fruit:^10}'
'  apple   '

Write a function that returns a list of words present in the input string.

>>> words('"Hi", there! How *are* you? All fine here.')
['Hi', 'there', 'How', 'are', 'you', 'All', 'fine', 'here']
>>> words('This-Is-A:Colon:Separated,Phrase')
['This', 'Is', 'A', 'Colon', 'Separated', 'Phrase']

Blog