UTF-8 regular expressions

bbkr

Paweł bbkr Pabian

Posted on September 7, 2023

UTF-8 regular expressions

For many, many years Perl language has been top choice for text processing tasks. As a result it established informal standard of regular expressions. Today almost every big language uses either PCRE (Perl Compatible Regular Expressions) library directly or implements own regular expression engine heavily inspired and mostly compatible with Perl one.

Raku language was meant to be direct continuation of Perl (former name was Perl 6). Its regular expression engine was redesigned from scratch. However with modernized syntax and new features came lack of backward compatibility.

Let's compare them side by side to have general understanding of what is currently available in most languages (I will call those regular expression "Perl" ones) and what may be adopted to languages if Raku manages to establish new standard. There is a lot to cover here, so comments will be divided into Unicode specific aspects and separate section that clarifies technical differences.

Literal text

$ perl -E 'use utf8; say "Żółw 🐢" =~ /Ż..w 🐢/'
1

$ raku -e 'say "Żółw 🐢" ~~ /Ż..w \s "🐢"/'
「Żółw 🐢」
Enter fullscreen mode Exit fullscreen mode

Unicode:

  • Perl and Raku matches text in Unicode aware manner, respecting multi-byte code points. Non-ASCII characters are allowed in regular expression body. Yay, good start!
  • Raku treats non-alphabetic symbols (like Emojis) as meta characters and requires them to be quoted.
  • Perl needs use utf8 pragma to indicate source code is in UTF-8, similar declaration is a common requirement in a lot of other languages. Raku source code is UTF-8 by default.

Technical:

  • White space handling is flipped. Perl treats white spaces in regular expression literally and can ignore them with //x modifier. Raku ignores white spaces by default and can treat them literally with m:s// or m:sigspace// modifier. So you can write /Ż..w \s 🐢/x in Perl to get Raku behavior or m:s/Ż..w "🐢"/ in Raku to get Perl behavior.
  • In Raku modifiers were moved to the beginning of regular expression for better readability.
  • Perl returns boolean as match result and matched text is available in $& variable while Raku returns Match object.

Predefined character classes

$ perl -E 'use utf8; "1꧕ żółtych róż" =~ /\d{2} \w+ [[:alpha:]]+/; say $&'
1꧕ żółtych róż

$ raku -e 'say "1꧕ żółtych róż" ~~ /\d**2 \s \w+ \s <.alpha>+/'
「1꧕ żółtych róż」
Enter fullscreen mode Exit fullscreen mode

Unicode:

  • Both Perl and Raku support similar set of long and short classes (Perl long/short, Raku long/short) that includes non-ASCII characters.
  • Raku also supports small set of predefined tokens.
  • Careful what you wish for!

Big bird

Very common mistake is to write regular expression in Unicode aware language without realizing what given character classes matches. Or blindly copy-pasting old regular expressions into Unicode aware code. For example \d matches digit. Javanese digit five is a digit and will be matched in ^\d{5}\z American short zip code regular expression, probably causing weird side effects and errors. If you need only ASCII digits you must be explicit about it - [0-9] in Perl or <[0..9]> in Raku.

Technical:

  • Character classes are handled very differently. In Perl predefined POSIX [:classes:] are only usable within class group []. While in Raku they are written as <tokens>, which is super consistent with built-in Grammars. More on that later.

Code point properties

I recommend reading this post in series before continuing...

$ perl -E 'use utf8; "Cool😎" =~ /\p{Lu}\P{Uppercase_Letter}+\p{Block=Emoticons}/; say $&'
Cool😎

$ raku -e 'say "Cool😎" ~~ /<:Lu><:!Uppercase_Letter>+ <:Block("Emoticons")>/'
「Cool😎」
Enter fullscreen mode Exit fullscreen mode

Unicode:

  • Both Perl and Raku support code point properties.
  • Binary properties can be tested without value using both long or short names (for example Uppercase_Letter or Lu).
  • Specific value of property can be checked by providing parameter (for example value of property named Block should be equal to Emoticons).
  • Perl mixes Unicode properties, POSIX properties and internal properties under common \p{} test. They also have variants, \p{PosixDigit} matches 0-9 while \p{XPosixDigit} matches all Unicode digits. One way to look at it is that property is a property, no matter who defined it. But I personally dislike it because it provides duplicated, overlapping functionality and makes regular expressions less portable. I really wish there was separate test dedicated for Unicode properties only.

Technical:

  • Perl uses \p{Foo} for property and \P{Foo} for negated property while Raku uses token-ish form <:Foo> for property and <:!Foo> for negated property.
  • Property value parameter is different. Perl uses Foo=Bar syntax, which is compact but kind of weird due to unquoted value - even Perl itself does not compare strings like that. While Raku decided on Foo('Bar') method call style, aligned with the rest of the Raku and commonly used in other languages.
  • Perl treats string properties called without value as matching if they return any value indicating that the property applies. While Raku only matches if value matches:
$ perl -E 'use utf8; say "4" =~ /\p{Digit}/;'
1

$ raku -e 'say "4" ~~ /<:Digit>/'
Nil # oops, not explicit enough

$ raku -e 'say "4" ~~ /<:Digit("Decimal")>/'
「4」 # because property "Digit" of "4" is "Decimal"
Enter fullscreen mode Exit fullscreen mode
  • Raku has nasty trap here. One may think that "if I need Digit property of any kind I can just request any defined value":
$ raku -e 'say "4" ~~ /<:Digit(Any:D)>/'
「4」 # success?
Enter fullscreen mode Exit fullscreen mode

This is very far from being correct, because some properties returns defined strings indicating that they do not apply:

$ raku -e 'say "A" ~~ /<:Digit(Any:D)>/'
「A」 # wrong

$ raku -e 'say "A".uniprop("Digit")'
None # literal string 'None' matching Any:D value
Enter fullscreen mode Exit fullscreen mode

Hint:

  • If you are mixing tests for General_Category, Script and Block properties in a single regular expression I strongly recommend using full property names. For example can you tell what 'A' ~~ /<:Latin>/ test means? Yes, it tests Script, not the Block, because A is in Block named Basic Latin. Being explicit greatly improves regular expression understanding, for example in Perl:
$ perl -E '
    use utf8;
    "A" =~ /\p{General_Category=Uppercase_Letter}/;
    "A" =~ /\p{Block=Basic Latin}/;
    "A" =~ /\p{Script=Latin}/;
'
Enter fullscreen mode Exit fullscreen mode

Warning, in Raku explicit General_Category test currently only accepts short forms.

Property arithmetic

One of the features that looks useless but really shines when combined with Unicode properties. Let's assume you got text about animal life expectancy stats: แฮมสเตอร์ ๔, แมว ๑๖ (stats: hamster 4, cat 16) and must extract Thai words from it, skipping numbers.

One way to solve it is to manually enumerate all Thai letters:

$ perl -E '
    use utf8;
    my $text = "stats: แฮมสเตอร์ ๔, แมว ๑๖";
    say for $text =~ /[กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลฦวศษสหฬอฮฯะาำเแโใไๅๆ]+/g;
'

แฮมสเตอร  # hamster
แมว       # cat
Enter fullscreen mode Exit fullscreen mode

That works but will cause a lot of head scratching if someone unfamiliar with Thai alphabet encounters this regular expression. You can try to be more explicit and provide range:

$ perl -E '
    use utf8;
    my $text = "stats: แฮมสเตอร์ ๔, แมว ๑๖";
    say for $text =~ /[\N{THAI CHARACTER KO KAI}-\N{THAI CHARACTER MAIYAMOK}]+/g;
'

แฮมสเตอร
แมว
Enter fullscreen mode Exit fullscreen mode

Which also works, but still requires knowledge about Thai alphabet and introduces new risk that provided range may not be continuous series of code points exclusively from this alphabet. For example Polish alphabet starts with a, ends with ź, but there are actually 280 code points between them containing a lot of other stuff.

That is the perfect application for extended character class:

$ perl -E '
    use utf8;
    my $text = "stats: แฮมสเตอร์ ๔, แมว ๑๖";
    say for $text =~ /(?[ \p{Thai} & \p{Letter} ])+/g;
'

แฮมสเตอร
แมว
Enter fullscreen mode Exit fullscreen mode

Extended class is wrapped in (?[ ]) and allows to perform classes arithmetic, in this case & indicates intersection between Thai script and Letter general category. You can make intersections &, unions +, subtraction - and XOR ^ logic. No Thai alphabet knowledge is needed to extract Thai words!

Well, kind of... Full Thai word for hamster is
หนูแฮมสเตอร์ (thehamster). You may already noticed that none of previous solution extracted last character ร์ properly. And our code actually splits this word:

$ perl -E '
    use utf8;
    my $text = "stats: หนูแฮมสเตอร์ ๔, แมว ๑๖";
    say for $text =~ /(?[ \p{Thai} & \p{Letter} ])+/g;
'

หน        # the
แฮมสเตอร  # hamster
แมว
Enter fullscreen mode Exit fullscreen mode

This is because นู and ร์ are actually two characters written one above other forming grapheme cluster, let's analyze them:

$ raku -e '.say for "นู".uninames;'
THAI CHARACTER NO NU
THAI CHARACTER SARA UU

$ raku -e '.say for "นู".uniprops;'
Lo # Letter_Other
Mn # Nonspacing_Mark
Enter fullscreen mode Exit fullscreen mode

That solves our mystery. Those missing Thai characters are not letters but non spacing marks. But hey, we have property arithmetic. Let's fix that quickly:

$ perl -E '
    use utf8;
    my $text = "stats: หนูแฮมสเตอร์ ๔, แมว ๑๖";
    say for $text =~ /(?[ \p{Thai} & ( \p{Letter}  + \p{Nonspacing_Mark} ) ])+/g;
'

หนูแฮมสเตอร์
แมว
Enter fullscreen mode Exit fullscreen mode

So now we have intersection of Thai script with union of Letter and Nonspacing_Mark general category. Everything encapsulated in neat, self-documenting, extended character class. Lovely!

In Raku word things are not that mature yet. Character class arithmetic only supports union and subtraction. For example let's find stuff that looks like model numbers (at least 2 characters long):

$ raku -e '
    say "Production of AR-15 riffle..."~~ /
        <:Uppercase_Letter + :Digit("Decimal") + :Dash_Punctuation> ** 2..*
    /
'

「AR-15」
Enter fullscreen mode Exit fullscreen mode

Syntax for extended class is <:A + :B>, no grouping inside.

Grapheme clusters

$ perl -E 'use utf8; "หนูแฮมสเตอร์" =~ /\p{Letter}+/; say $&;'
หน # the

$ raku -e 'say "หนูแฮมสเตอร์" ~~ /<:Letter>+/'
「หนูแฮมสเตอร์」 # thehamster, unharmed :)
Enter fullscreen mode Exit fullscreen mode

This time point goes to Raku, which handles grapheme clusters properly.

Perl has predefined \X class, which represents "what appears to be a single character, but may be represented internally by more than one", so pretty much everything. Because it cannot be intersected in extended class to get cluster of specific property it is next to useless.

Diacritics

Matching with ignoring combining code points is Raku-only feature.

$ raku -e 'say "👋🏾Cześć" ~~ m:ignoremark/ "👋" Czesc /'
「👋🏾Cześć」
Enter fullscreen mode Exit fullscreen mode

In Perl it is possible through decomposing using Unicode::Normalize module, filtering out combining code points and matching preprocessed text. But Perl regular expression engine does not support that out of the box.

Variable case length

There is perfect example in German language - sharp s, also named Eszett.

It looks like this ß and basically is equal to ss. So weiße and weisse both mean white. It had no uppercase form, SS was always used. I wrote "was", because in 2017 uppercase form of ß was officially added to German alphabet as , causing some backward-compatibility havoc:

$ raku -e 'say "ß".uc'
SS # still translates to SS, backward compatibility

$ raku -e 'say "ẞ".lc'
ß # this does not translate to ss, because it never did
Enter fullscreen mode Exit fullscreen mode

So we have intransitive case change, that also changes length - lower case is ß which is synonym for lower case ss. Both Perl and Raku handles this correctly:

$ raku -e 'say "WEIẞE" ~~ m:ignorecase/ weisse /'
「WEIẞE」

$ perl -E 'use utf8; say "WEIẞE" =~ /weisse/i;'
1
Enter fullscreen mode Exit fullscreen mode

Pick your poison

We had two regular expression engines flexing muscles to prove being Unicode handling champion. Perl dominates with Unicode properties and property arithmetic. Raku fights back with grapheme clusters and diacritic insensitive matching.

Coming up next: Optional fun with homoglyphs. And Byte Order Mark. I promise next posts will be shorter and easier.

💖 💪 🙅 🚩
bbkr
Paweł bbkr Pabian

Posted on September 7, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

UTF-8 regular expressions
unicode UTF-8 regular expressions

September 7, 2023