Applying regular expressions – matching single-letter words in text

aloisseckar

Alois Sečkár

Posted on January 14, 2024

Applying regular expressions – matching single-letter words in text

In my previous article on this topic, I said regex is not your enemy. I tried to explain how regular expressions work on a practical problem I was solving at the time. As a follow up, I have another application today.

The task

I live in Czechia. In our language it is a typography error to leave a single letter word at the end of a line. We have plenty of such words (prepositions and conjunctions). You need to add “hard” (non-braking) space to make them wrap together with the following word.

This was always a pain for me to deal with on websites. I produced many texts, where I literally went through the whole content and manually inserted all the required   (or  ) HTML Entities.

But apart from being tedious and error prone, what if the displayed text is out of your direct control?

The “dumb” solution

Not so long ago I discovered that you can put Unicode character \u00A0 (or even better shorthand \xa0) into JavaScript’s string and it works as non-breaking space even in a plain text when displayed in the browser.

Therefore, we can replaceAll all occurrences of given standalone letter: text.replaceAll(' a ', ' a\xa0')

Expanding this idea to cover all variants (we have 8 such “words” in our language) would be a bit verbose. But it will cover most of the cases. However, there are couple of extra rules to make it harder:

  • They can be at the beginning of a sentence, so it could be either lower or upper letter. This means the number of string-based replaces doubles up.
  • All others can appear in pairs after an “a” or “i” (like “a s” means “and with” or “i v” means “also in”. This would mean another multiplication of replaceAll commands.
  • They can also appear directly after starting brackets, so the ’ a ‘ string won’t even match these and besides you also have to modify the replacement. Ready for another round of new clauses?

Stop right there! If you didn’t already, this is the last moment when you can salve yourself from bloating the function with ever-growing list of rules which will make your code smell like those infamous Swedish fish cans.

As you may have guessed, there is a better way, and the name is “regular expressions”.

The regex solution

First of all, we would have to switch from replaceAll to replace method, as it allows an instance of a RegExp object to be its first argument instead of a plain string.

To define a regex in JavaScript code, just wrap its definition into slashes / … / Similar to ' or " enclosing strings, content between / will be interpreted as a regular expression. If you have an IDE with syntax highlighting for JavaScript, you will see it better.

JS regex defined in VSCode

Before delving into matching texts, we must prevent one undesirable regression. Unlike replaceAll, replace method only runs its comparation once. So, we would match and fix the first occurrence of a single letter word, but the rest will remain intact! This can be easily fixed with adding a “global” flag. Flags are simple modifiers for JS regex behaviour. To use them, simply add the respective letter(s) after the closing slash.

In our case it will be: / … /g. Making regex “global” causes it to continue looking for matches until there is no other left. And they will all be replaced, even when using the replace method.

Reducing the amount of separate commands

First thing to optimize is to get rid for 8 separate invocations. In regular expressions we can join individual options into one by putting them into square brackets to define a list of options. In our case it will be:



[aikosuvz]


Enter fullscreen mode Exit fullscreen mode

It raises one concern though. When we were looking for an “a”, it was obvious we can replace the match using the “a” as well. But what about now? It can be either of the 8 options. You may remember from the previous article we can capture the matching group by using regular brackets.



([aikosuvz])


Enter fullscreen mode Exit fullscreen mode

Then we have a reference to what was actually found inside. How to ask for it in JavaScript replace? Because not everything in JavaScript is bad, the n-th captured group is simply available under $n placeholder and you can just use it in the replacement string and everything will be handled for you under the hood.

Armed with this knowledge we can rewrite:



text.replaceAll(' a ', ' a\xa0')

text.replaceAll(' z ', ' z\xa0')


Enter fullscreen mode Exit fullscreen mode

Into a one liner:



text.replace(/ ([aikosuvz]) /g, ' $1\xa0 ')


Enter fullscreen mode Exit fullscreen mode

Improving matching

This is a nice refactor, but it does not address any of the three problems mentioned in the previous chapter yet. It does not match capital letters, words in brackets and in the case of two single letters in a row it will only fix the first one.

I am pretty sure it is possible to keep expanding one regular expression until it covers all the edge cases. However, I also advocate against trying to do so. Because the more complex regex becomes, the less obvious it is, making it harder to understand or maintain. While regular expressions may be invaluable helpers, it is wise to keep them at bay.

So instead of assembling some god-like regex, I rather decided to split possible cases into separate branches. After a few iterations I ended up with having two of them:



// dual occurences
input = input.replace(/(\s\(?)([AIai])\s([ikosuvz])\s/g, '$1$2\xa0$3\xa0')
// single occurences
input = input.replace(/(\s\(?)([AIKOSUVZaikosuvz])\s/g, '$1$2\xa0')


Enter fullscreen mode Exit fullscreen mode

Do not worry, it may look complicated. We will debunk the meaning in no time. Whenever in doubt, just call our good friend https://regex101.com/ to help you describe what’s going on.

The first regex basically consists of 3 parts – the capturing groups – enclosed in brackets (highlighted green) plus some separators:

Regex analysis from regex101.com

The first group is focusing on what comes BEFORE the single word sequence. There must be a space, otherwise it will match letters at the end of other words. I used \s metacharacter for ”any white space” to make it more robust. The \(? covers the edge case when the single letter word appears inside brackets. The bracket has to be escaped with backslash and the ? quantifier means ”zero or once” to mark it optional.

The second group solves the dual single letter words. When appearing in pairs, only “a” or “i” can be the first. We only say “and sth” or “also sth”, not the other way around. Therefore, we can define a list of those two letters inside square brackets and capture it. Capital “A” and “I” are also present as this dual occurrence may start a sentence. Either way, it must then be followed by an empty space.

The third group contains a list of possible single letter words. The “a” can be omitted, because “a a” makes no sense and we always concatenate “a i”, but never use “i a” in our language. Technically, there is a little flaw, because like this it will also capture “i i” / ”I i”, but those never appears in Czech texts, so from the practical point of view it is needless to mitigate it. Finally, after the capturing group there must be another space to make it stand-alone word and not the first letter of something longer.

The replacement string '$1$2\xa0$3\xa0' looks ugly, but it really just glues the captured stuff together using non-breaking spaces. There is no space between the first and second group, because we do want it can wrap before.

The second variant is much easier, as we only need to capture one occurrence of a single letter word. Just don’t forget to include “a” back in the character list. Also, we are targeting the full spectrum of upper/lower characters:

Regex analysis from regex101.com

Replacing all the matches with '$1$2\xa0' will get the job done. Note that this variant is listed second, because first we’ll deal with all the dual occurrences and then this will treat the rest without having to worry about any clashes.¨

Is this good enough?

I have to confess I was struggling when putting this article together. I was repeatedly coming up with “solutions” and scrapping them again for various flaws. For a long period of time, I was convinced 4 separate commands (dual/single and lower/upper) would be better considering trade-offs between compactness and simplicity. But as I was constantly trying to challenge my own judgement, I shifted towards the solution I have presented here.

Can you do better? I am eager to see your solutions, feel free to post them into comments!

I want to point out one more feature that may be handy trying to evolve this further. The second argument of the replace function doesn’t have to be just a plain string (enhanced with $n placeholders). You can also pass in a function that takes the match followed by the 0-n captured groups.

We may rewrite our code like this:



// dual occurences
input = input.replace(/(\s\(?)([AIai])\s([ikosuvz])\s/g, function (match, g1, g2, g3) {
  return g1 + g2 + '\xa0' + g3 + '\xa0'
})
// single occurences
input = input.replace(/(\s\(?)([AIKOSUVZaikosuvz])\s/g, function (match, g1, g2) {
  return g1 + g2 + '\xa0'
})


Enter fullscreen mode Exit fullscreen mode

However, I think this is too verbose and unclear. It may be useful for more complex use cases like conditionally captured groups, so it is good to know it exists, but I would avoid using it here.

Conclusion

The article is coming to an end, but we have only just started. The function for sanitizing single letter words with non-breaking spaces is still a work-in-progress. I keep coming across new rules that need to be followed, as I am dealing with real texts on my website. Also, single letter words are not the only thing to watch for. There are other special symbols and abbreviations that shall wrap together. Or we may start tackling more languages with even more rules to incorporate together?

You are welcome to review and suggest improvements on my GitHub: https://github.com/AloisSeckar/ELRHUtils/blob/main/src/text/textUtils.ts

Nevertheless, you have seen another application of regular expressions in a scenario, that would be hard to impossible to deal with using something else. Although the quest for the ultimate solution is still underway, we are armed well to solve it eventually.

If you have questions or remarks, go ahead and share them with us in the comments. Until next time!

💖 💪 🙅 🚩
aloisseckar
Alois Sečkár

Posted on January 14, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related