Balancing Clarity, Elegance, and Risk in Regular Expressions

arclight

arclight

Posted on August 16, 2021

Balancing Clarity, Elegance, and Risk in Regular Expressions

Recently a colleague posted a security incident report about a single-character error in a regular expression. It got me thinking about the techniques I have used to reduce errors in my regular expressions and how I could use this incident as a guide to applying these techniques. In the process, I found this is a case where elegance works against clarity; sometimes clarity needs to win especially in security or safety applications.

Disclaimer

Before I start, I want to make one thing clear: regular expressions are complex and brittle. Everyone makes this sort of mistake so do not interpret this as a personal criticism of any individual, any group, any language, or any project. I am not second-guessing design decisions. Nobody cares about how I would have written the original code. Even I don't.

Very simply, the code is what it is; my goal is to see what we can learn from it and what a team can do to reduce the risk posed by regular expressions. This is not an exhaustive tutorial. The examples below are meant to illustrate techniques for writing safer, more reviewable code. That's all.

The Incident

Daniel Cuthbert (@dcuthbert) writes:

Nice little bug this showing how not properly escaping a full stop
(.) can lead to code execution The bug writeup is brilliant too
https://github.com/pi-hole/AdminLTE/security/advisories/GHSA-5cm9-6p3m-v259

tl;dr

The one line technical summary from @SchneiderSec gets right to the point:

The issue lies in the fact that one of the periods is not escaped
allowing any character to be used in it's place.

A Few Words About Regular Expressions

  • Regular expressions are extremely powerful tools
  • Regular expression notation is notoriously difficult to read
  • Regular expressions are remarkably brittle and can fail in unexpected and catastrophic ways

When we use them (or any construct, really) we are balancing clarity, elegance, and risk.

Regular expressions are not an always/never tool. There are as many good reasons to use them as to avoid them. Here we are analyzing short strings which is the dominant use case for regular expressions.

The Code

Here is code we are worried about:

function validDomainWildcard($domain_name)
{
    // There has to be either no or at most one "*" at the beginning of a line
    $validChars = preg_match("/^((\*.)?[_a-z\d](-*[_a-z\d])*)(\.([_a-z\d](-*[a-z\d])*))*(\.([_a-z\d])*)*$/i", $domain_name);
    $lengthCheck = preg_match("/^.{1,253}$/", $domain_name);
    $labelLengthCheck = preg_match("/^[^\.]{1,63}(\.[^\.]{1,63})*$/", $domain_name);
    return ( $validChars && $lengthCheck && $labelLengthCheck ); //length of each label
}
Enter fullscreen mode Exit fullscreen mode

This function checks if a string matches three search patterns. The function returns True if all three patterns match; if any pattern doesn't match, it returns False.

Quick Impressions and Setting Focus

The problem is that there is a simple typo which drastically changes the function's behavior. It's not easy to see either as an author or reviewer. Otherwise there's nothing fundamentally wrong with this code.

Parts of this routine are extremely simple and parts are very complex. The simple parts are really simple - three assignments and a logical expression. The complex parts are the regexes so that is where we will focus our analytical efforts.

So what can we do to limit the risks that regexes pose?

Spacing and Documentation

Let's start by addressing clarity - how can we make regexes more readable?

We can:

  • add whitespace
  • decompose complex expressions into meaningful subexpressions
  • add comments to explain the intent of subexpressions

In some regex languages (for example PCRE and those embedded in Perl, Python, and PHP) you can disable the significance of carriage returns and other whitespace and allow embedded comments. Perl, Python, and PHP all support the 'x' option. In PHP, the 'x' option can also be enabled using the PCRE_EXTENDED modifier. Check your languagefor details.

Note that comments embedded in a regex are subject to additional escaping rules; we'll keep our documentation simple. There is already enough punctuation; let's try not to add any more.

But say our regex language doesn't support embedded comments. Can we still use this method?

Sure! Even without strange regex option flags, we can

  • decompose the pattern string into smaller substrings,
  • document them, and
  • compose the full pattern string from parts.

This is straightforward and should work in any language so you may want to try this tactic first.

When Should We Apply This?

There is no hard and fast rule but if a regex has more than 2-3 paren'd groups or is more than half a line, consider breaking it up and commenting it.

PHP Example

This is an example of adding whitespace and comments to the $validChars regex. The result is still rather long and confusing, but the structure and intent are much more apparent. Repetitive groups and character classes stand out and we can see inconsistencies in patterns. As a reviewer, I should at least mark these down and ask the author for clarification of any suspicious constructs.

Side note: My job as a reviewer is to ask for clarification, not to rewrite the code or redesign the application. These examples are more intended for code authors than reviewers.

We might want to break out groups 1, 4, and 6 and explain what we are looking for but this is a start.

$validChars = preg_match(
    "/^                         # Start of string,
      (                         # Open group 1
        (\*.)?           # ERROR: Optional '*.' literal (domain wildcard) as group 2
        [_a-z\d]                # One letter, digit, or underscore
        (                       # Start group 3
          -*                    # Any number of hyphens
          [_a-z\d]              # One letter, digit, or underscore
        )*                      # Any number of group 3 matches
      )                         # Close group 1; group 1 should match once
      (                         # Open group 4
        \.                      # Literal '.'
        (                       # Open group 5
          [_a-z\d]              # One letter, digit, or underscore
          (                     # Open group 6
            -*                  # Any number of hyphens
            [a-z\d]             # One letter or digit -- Q: Should this have an underscore?
          )*                    # Any number of group 6 matches
        )                       # Close group 5
      )*                        # Close group 4; any number of group 4 matches
      (                         # Open group 6
        \.                      # Literal '.'
        (                       # Open group 7
          [_a-z\d]              # One letter, digit, or underscore
        )*                      # Close group 7; any number of group 7 matches
                                # -- Q: Group 7 can be empty; is that intended?
      )*                        # Close group 6; any number of group 6
      $/ix", $domain_name);     # End of string; case-insensitive,
Enter fullscreen mode Exit fullscreen mode

If you have ever spent time programming LISP or Scheme, you might think this looks very similar to those languages. That is no accident. Regular expression notation is a low-level functional language. It just isn't as obvious when it's all crammed onto a single line.

Backslashitis, Delimiters, and Character Classes

Another clarity-killer is excessive escaping or as it is more commonly known, Backslashitis. Some common entities and patterns require backslashes such as tab (\t), digit (\d), and word boundary (\b). Some characters require escaping because they match regex operators.

Basic examples:

  • . matches any character,
  • \. matches a literal .,
  • * is the Kleene closure which modifies the pattern before it (pattern matches zero or more times),
  • \* matches a literal *.

Parens are often used for embedding regex options or for capturing (grouping) parts of matches. This isn't consistent among regex languages so sometimes \( and \) are used for grouping, other times () defines a group.

Still other characters need to be escaped because they match pattern delimiters, historically /. For this reason Backslashitis is especially painful in web and filesystem code. The use of backslashes as path separators on Windows adds another layer of badness but it is only slightly less bad everywhere else.

Alternate Delimiters

Perl worked around delimiter collision by allowing letting users set alternate pattern delimiters; for example

/my\/path/
Enter fullscreen mode Exit fullscreen mode

can be replaced with

m#my/path#
Enter fullscreen mode Exit fullscreen mode

The leading m stands for match.

Python and PHP use normal strings to hold patterns so there is some overlap between escaped characters in both strings and regexes, e.g. \t for horizontal tab. You may not remove all escaping backslashes by changing delimiters but it is an easy way to clarify slash-heavy regexes.

Character Classes

If you can’t change delimiters, you can use character classes to reduce escaping. [.] can often be used for as \. and [*] for \* so the original example pattern

(\*\.)?
Enter fullscreen mode Exit fullscreen mode

could be rephrased as

([*][.])?
Enter fullscreen mode Exit fullscreen mode

That’s still not great; it looks like ASCII art of a confused robot. It is definitely clearer and it makes some problems more obvious.

Note that typos in character class notation have a very different failure mode than typos in backslash-escaped characters.

Accidentally omitting [ or ] will probably cause the regex to fail to compile. It won’t silently change the meaning as drastically as dropping \ from \. or \* will.

Named Entities

But what about entities like \d or \b which don't correspond to a single character? Often regex languages give these patterns explicit names.

\d is shorthand for [0-9]; in POSIX bracket expression notation it can be written as [:digit:]. [A-Za-z] can be written as [:alpha:]. [:alnum:] is equivalent to [A-Za-z0-9], and [A-Za-z0-9_] can be written as \w or [:word:].

In some regex dialects, \b matches a word boundary, a zero-length position before or after a pattern if it is outside a character class, but a backspace if inside a character class.

The point is that sometimes you are stuck using an escaped entity for which there is no good backslash-free substitute. That's okay; just rephrasing common patterns can improve clarity.

When Should We Apply This?

If there are many escaped slashes in your pattern, see if you can change the pattern delimiters to something other than /.

If you use the same character class more than 2-3 times in a pattern, try to substitute a POSIX bracketed expression or define it as a string, document it, and compose the main regex from these subpatterns.

Keep a cheat sheet on your language's regex syntax handy; the meaning of entities like \s (whitespace) varies among languages. You don't have to memorize all this weirdness; be safe and look it up. Life's too short to spend it debugging regexes.

PHP Example

Let's revisit our $validChars regex. PHP supports POSIX expressions so we can clarify our pattern by using named entities. Recall the named patterns we found earlier:

  • single digits can be matched with [0-9], \d, or [:digit:],
  • single letters can be matched with [A-Za-z] or [:alpha:],
  • a single digit or letter is called an alphanumeric character and can be matched with [A-Za-z0-9], and
  • 'word' characters are defined as [A-Za-z0-9_] and can be matched using \w or [:word:]

Note that [[:alnum:]_] also matches word characters; please do not do this.

The $validChars regex made heavy use of [_a-z\d] and [a-z\d]. We can replace these with [:word:] and [:alnum:] respectively. This is possible because the /i flag at the end of the pattern makes it case-insensitive so [A-Z] is treated just like [a-z].

Also, let's replace the problematic (\*.)? construct with our confused robot regex. Our / delimiter is fine as-is; we would probably want to change it if we were matching paths or URLs.

We could also replace \. with [.]. It is not a big improvement in readability but it is cheap risk reduction so let's swap that in too. We can always change it back if it looks stupid.

Substitution gives us:

$validChars = preg_match(
    "/^                 # Start of string
    (                   # Open group 1
      ([*][.])?         # FIXED: Optional '*.' literal (domain wildcard) as group 2
      [:word:]          # One letter, digit, or underscore
      (                 # Start group 3
        -*              # Any number of hyphens
        [:word:]        # One letter, digit, or underscore
      )*                # Any number of group 3 matches
    )                   # Close group 1; group 1 should match once
    (                   # Open group 4
      [.]               # Literal '.'
      (                 # Open group 5
        [:word:]        # One letter, digit, or underscore
        (               # Open group 6
          -*            # Any number of hyphens
          [:alnum:]     # One letter or digit -- Q: Should this be [:word:]?
        )*              # Any number of group 6 matches
      )                 # Close group 5
    )*                  # Close group 4; any number of group 4 matches
    (                   # Open group 6
      [.]               # Literal '.'
      (                 # Open group 7
        [:word:]        # One letter, digit, or underscore
      )*                # Close group 7; any number of group 7 matches
                        # -- Q: Group 7 can be empty; is that intended?
    )*                  # Close group 6; any number of group 6
    $/ix",              $domain_name);     # End of string; case-insensitive,
Enter fullscreen mode Exit fullscreen mode

We completely eliminated backslashes and made the distinction between [_a-z\d] and [a-z\d] much more obvious. Consistent use of [.] makes domain name separators look a little strange if we are used to \. but the domain wildcard pattern is safer. It is much easier to see the difference between

([*][.])? and ([*].)?

versus

(\*\.)? and (\*.)?

You have to drop both '[' and ']' from around '.' to cause the bug in the original code. Dropping just one would cause the regex to fail to compile and give us a chance to find the original error.

At this point we are mostly left with grouping characters (parens) and have possibly identified a larger subpattern we may want to extract, document, and test separately, i.e. [:word:](-*[:word:])*

We also see some breaks in symmetry; domain parts are variously matched by

  • [:word:](-*[:word:])*,
  • [:word:](-*[:alnum:])*, and
  • ([:word:])*

The reviewer should ask if these differences are intended; it isn't clear from the context or documentation.

Testing

This is a deep and complex topic so I will keep it brief.

Build test cases that are expected to fail. This may be difficult to remember because unit testing is often a constructive affirmation process used to demonstrate expected behavior with expected input. Compare this against adversarial security testing which actively seeks to exploit a function by throwing mountains of bad input at it.

If you can test subpatterns, do so. It is easier to compose test cases for simple patterns than complex ones and you will need fewer tests overall.

Remember that unit tests are a developer convenience, not a substitute for verification and validation (V & V), acceptance tests, integral tests, regression tests, etc. In legacy code, the cost of backfitting unit tests is often prohibitive but you can add unit tests to new code and remediate the rest as time permits.

Is There a Simpler or Clearer Way to Do This?

I programmed Perl for over a decade; I think regexes are awesome. Also, I avoid them if possible because as we see here, regex errors are subtle, easy to make, and difficult to find.

What are some simpler replacements?

Direct Checks

One common regex is /^X/ (that is, does this string start with the specific character X?). It is simple, clear, and probably optimized in your language’s regex engine.

But why not directly compare the first character of the string to X?

Which is easier to explain, this?

$target = 'Xena';
if (preg_match('/^X/', $target)) {
    # Do some warrior princess things
}
Enter fullscreen mode Exit fullscreen mode

Or this?

$target = 'Xena';
if ($target[0] === 'X') {
    # Do some warrior princess things
}
Enter fullscreen mode Exit fullscreen mode

The direct check has several advantages:

  • it does not rely on cryptic regular expression notation,
  • direct checks often use fewer resources (memory, time) than regexes, and
  • direct checks fail less spectacularly than regexes

The regex has two advantages:

  • it is more flexible, and
  • it does not rely on the correct number of '=' signs to work correctly

Question: is this limited to PHP or are there other languages where this is an issue?

If we want to make this a case-insensitive test or search for multiple characters, the direct comparison method gets complex and ugly fast. It is easy enough to convert the direct check to a regex later if we need to.

Note that in PHP there is a risk of the conditional changing meaning if we leave out one or more = signs in the direct check. The consequences of using == instead of === may be small but the consequences of using = instead of == can be massive.

A simple way to guard against this is to swap the positions of the constant and the variable terms in the test, such as

$target = 'Xena';
if ('X' === $target[0]) {
    # Do some warrior princess things
}
Enter fullscreen mode Exit fullscreen mode

It looks strange but it has the same meaning. The only difference is that if there is a typo in the conditional operator, the code will crash rather than silently making an assignment and continuing execution.

Simple Length Checks

The original example code uses two regexes to check the length of the domain name and its parts:

$lengthCheck = preg_match("/^.{1,253}$/", $domain_name);
$labelLengthCheck = preg_match("/^[^\.]{1,63}(\.[^\.]{1,63})*$/", $domain_name);
Enter fullscreen mode Exit fullscreen mode

Let's focus on the first one. Most languages have a string length function. For example, string length is found in PHP using strlen(), Python uses len(), etc.

The regex here seems like seems like overkill. As a reviewer I have to ask if there is a reason string length checks are done with regexes instead of strlen()? There might be; I don't know. That's why I'm asking.

We can easily replace

$lengthCheck = preg_match("/^.{1,253}$/", $domain_name);
Enter fullscreen mode Exit fullscreen mode

with

$domainLength = strlen($domain_name);
$lengthCheck = (($domainLength >= 1) && ($domainLength <= 253));
Enter fullscreen mode Exit fullscreen mode

This can be clarified further if we define a utility function such as

**
 * Determines if $number is between $min and $max
 *
 * @param  integer  $number     The number to test
 * @param  integer  $min        The minimum value in the range
 * @param  integer  $max        The maximum value in the range
 * @param  boolean  $inclusive  Whether the range should be inclusive or not
 * @return boolean              Whether the number was in the range
 */
function in_range($number, $min, $max, $inclusive = FALSE)
{
    if (is_int($number) && is_int($min) && is_int($max))
    {
        return $inclusive
            ? ($number >= $min && $number <= $max)
            : ($number > $min && $number < $max) ;
    }

    return FALSE;
}
Enter fullscreen mode Exit fullscreen mode

(credit: Luke at Stack Overflow https://stackoverflow.com/a/35272658 )

We can reduce the check down to a single line with:

$lengthCheck = in_range(strlen($domain_name), 1, 253, TRUE);
Enter fullscreen mode Exit fullscreen mode

This change is not critical but it is an option.

Complex Length Checks

The second length check is more complex, checking if all the domain name components have a length between 1 and 63 inclusive.

$labelLengthCheck = preg_match("/^[^\.]{1,63}(\.[^\.]{1,63})*$/", $domain_name);
Enter fullscreen mode Exit fullscreen mode

It would take some effort to replace this regex with elemental string functions and probably will not make it any clearer. What we can do is break up the regex and document it.

Maybe do this:

$labelLengthCheck = preg_match(
    "/^              # Starts with
    [^\.]{1,63}      # 1-63 non-dot chars
    (                # any number of groups
        \.           # containing a dot
        [^\.]{1,63}  # followed by 1-63 non-dot chars
    )*               # until the end of the string
    $/x", $domain_name);
Enter fullscreen mode Exit fullscreen mode

Or this

$domain_part = '[^\.]{1,63}';
$labelLengthPattern = "/^${domain_part}(\.${domain_part})*$/";

$domain_name = 'reasonable.example.com';
$labelLengthCheck = preg_match($labelLengthPattern, $domain_name);
Enter fullscreen mode Exit fullscreen mode

Or alternately, since we are already detecting and grouping domain parts in the $validChars regex, we could apply our length checks to selected groups. We might want to use non-capturing subpatterns to restrict the captured matches to just the full domain parts. Non-capturing subpatterns are inside parens just like groups but the opening paren is followed by '?:'. If we wanted our confused robot regex to be non-capturing, we
could change it from:

([*][.])?
Enter fullscreen mode Exit fullscreen mode

to

(?:[*][.])?
Enter fullscreen mode Exit fullscreen mode

Our Confused Robot Regex looks even more confused. The Very Confused Robot Regex.

The optional third argument to preg_match is a list of captured groups. We would really like to capture everything that matches groups 1, 4, and 6 and check if all those matches have 1-63 characters. Note that groups 4 and 6 may match multiple times and we need to check each one. This also leads to weird numbering in the capture array; here it doesn't matter since we have to check all the matched domain parts.

$validChars = preg_match(
    "/^                 # Start of string
    (                   # Open group 1
                        # Next line is non-capturing pattern AAA
      (?:[*][.])?       # FIXED: Optional '*.' literal (domain wildcard)
      [:word:]          # One letter, digit, or underscore
      (?:               # Open non-capturing pattern BBB
        -*              # Any number of hyphens
        [:word:]        # One letter, digit, or underscore
      )*                # Any number of group BBB matches
    )                   # Close group 1; group 1 should match once
    (?:                 # Open non-capturing group CCC
      [.]               # Literal '.'
      (                 # Open capturing group 2
        (?:             # Open non-capturing pattern DDD
          [:word:]      # One letter, digit, or underscore
          (?:           # Open non-capturing pattern EEE
            -*          # Any number of hyphens
            [:alnum:]   # One letter or digit -- Q: Should this be [:word:]?
          )*            # Close group EEE. Any number of group EEE matches
        )               # Close group DDD
      )                 # Close group 2; any number of group 2 matches captured
    )*                  # Close group CCC
    (?:                 # Open non-capturing pattern FFF
      [.]               # Literal '.'
      (                 # Open capturing group 3
        (?:             # Non-capturing group GGG
          [:word:]      # One letter, digit, or underscore
        )*              # Close non-capturing group GGG;
      )                 # Close group 3; There may be any number of group 3 matches
                        # -- Q: Group GGG can be empty; is that intended?
                        # -- Q: Can domain part end in a dot
    )*                  # Close non-capturing group FFF
    $/ix",              $domain_name, $domain_parts); # End of string; case-insensitive
Enter fullscreen mode Exit fullscreen mode

Let's edit out unnecessary non-capturing groups and compress some of the vertical space:

$validChars = preg_match(
    "/^                   # Start of string
    (                     # Open group 1
      (?:[*][.])?         # FIXED: Optional '*.' literal (domain wildcard)
      [:word:]            # One letter, digit, or underscore
      (?:-*[:word:])*     # Optional hyphen then a word character (0 or more)
    )                     # Close group 1; group 1 should match once. Captured.
    (?:                   # Open non-capturing group AAA
      [.]                 # Literal '.'
      (                   # Open capturing group 2
        [:word:]          # One letter, digit, or underscore
        (?:-*[:alnum:])*  # Optional hyphen then an alphanumeric character (0 or more)
      )                   # Close group 2; any number of group 2 matches captured
    )*                    # Close group AAA
    (?:                   # Open non-capturing pattern BBB
      [.]                 # Literal '.'
      ([:word:]*)         # Group 3 captures any number of word characters
                          # -- Q: Group 3 can be empty; is that intended?
                          # -- Q: Can domain part end in a dot
    )*                    # Close non-capturing group BBB
    $/ix",                $domain_name, $domain_parts); # End of string; case-insensitive

// Initially set $labelLengthCheck = 1 (TRUE)
// Iterate over domain parts; if one is too short
// or too long, set $labelLengthCheck = 0 (FALSE) and exit loop
$labelLengthCheck = 1
foreach ($domain_parts as $part) {
    $partLength = strlen($part)
    if (! in_range($partLength, 1, 63, TRUE)) {
        $labelLengthCheck = 0
        break;
    }
}
Enter fullscreen mode Exit fullscreen mode

I'm not sure this is any clearer than the original $labelLengthCheck regex; someone with more recent PHP experience could clean this further. Again, as a reviewer my job is to point out that the $ValidChars regex already returns enough information for free that we can set $labelLengthCheck without using an additional regex. It is not my goal to rewrite the code, just to ask if errors are more obvious or more catastrophic in the regex version or the strlen() version.

The $validChars regex has been simplified in some ways, made more complex in others. There is a legibility cost incurred by using non-capturing patterns.

When Do We Apply This?

When you see a short regex, ask if a regex is really necessary.

Look for clearer alternatives with safer and more obvious failure modes.

Elegance Does Not Guarantee Clarity

Let’s look at the original function again (complete with error), but rearrange it so regexes are stored in strings

function validDomainWildcard($domain_name)
{
    // There has to be either no or at most one "*" at the beginning of a line
    $validCharsPattern       = "/^((\*.)?[_a-z\d](-*[_a-z\d])*)"
                             . "(\.([_a-z\d](-*[a-z\d])*))*"
                             . "(\.([_a-z\d])*)*$/i";
    $lengthCheckPattern      = "/^.{1,253}$/";
    $labelLengthCheckPattern = "/^[^\.]{1,63}(\.[^\.]{1,63})*$/";

    $validChars       = preg_match($validCharsPattern,       $domain_name);
    $lengthCheck      = preg_match($lengthCheckPattern,      $domain_name);
    $labelLengthCheck = preg_match($labelLengthCheckPattern, $domain_name);

    return ( $validChars && $lengthCheck && $labelLengthCheck ); //length of each label
}
Enter fullscreen mode Exit fullscreen mode

Using three regex checks in a row like this has a nice symmetry; a column each for results, regex patterns and targets. At a high level, it is very clear what is intended. At a low level it’s difficult to see if it is implemented correctly.

We can rephrase it as

function in_range($number, $min, $max, $inclusive = FALSE)
{
    if (is_int($number) && is_int($min) && is_int($max))
    {
        return $inclusive
            ? ($number >= $min && $number <= $max)
            : ($number > $min && $number < $max) ;
    }

    return FALSE;
}

function validDomainWildcard($domain_name)
{
    // There has to be either no or at most one "*" at the beginning of a line
    $validChars = preg_match(
        "/^                   # Start of string
        (                     # Open group 1
          (?:[*][.])?         # FIXED: Optional '*.' literal (domain wildcard)
          [:word:]            # One letter, digit, or underscore
          (?:-*[:word:])*     # Optional hyphen then a word character (0 or more)
        )                     # Close group 1; group 1 should match once. Captured.
        (?:                   # Open non-capturing group AAA
          [.]                 # Literal '.'
          (                   # Open capturing group 2
            [:word:]          # One letter, digit, or underscore
            (?:-*[:alnum:])*  # Optional hyphen then an alphanumeric character (0 or more)
          )                   # Close group 2; any number of group 2 matches captured
        )*                    # Close group AAA
        (?:                   # Open non-capturing pattern BBB
          [.]                 # Literal '.'
          ([:word:]*)         # Group 3 captures any number of word characters
                              # -- Q: Group 3 can be empty; is that intended?
                              # -- Q: Can domain part end in a dot
        )*                    # Close non-capturing group BBB
        $/ix",                $domain_name, $domain_parts); # End of string; case-insensitive

    // Check overall domain length is from 1 to 253 characters
    $lengthCheck = in_range(strlen($domain_name), 1, 253, TRUE)

    // Iterate over domain parts; if one is too short
    // or too long, set $labelLengthCheck = 0 (FALSE) and exit loop
    foreach ($domain_parts as $part) {
        $labelLengthCheck = in_range(strlen($part), 1, 63, TRUE)
        if (! $labelLengthCheck)
            break;
        }
    }

    // If any check fails, return false (0)
    return ( $validChars && $lengthCheck && $labelLengthCheck );
}
Enter fullscreen mode Exit fullscreen mode

It is longer and uglier; we lost the clean symmetry and brevity of the original. We also added another function which needs testing, documentation, etc. It's less elegant but the complex regex is documented so overall it's easier to review.

Elegance isn’t bad, it just may not always serve us.

Important code like a security check requires extra clarity; failure modes are serious. Regexes are difficult to read and they are key to maintaining security at this point in the code. And here, clarity at low level is more important than at high level because errors are more likely to occur at the low level of regexes.

You Are Probably Misusing Use Peer Review

Find a colleague not on the project who can behave like a professional, have them read over your code and write down any technical questions they have. Read through their questions and resolve each one by changing the code or adding documentation to make it clearer. That's it.

Any mentoring that happens in this process is incidental. A technical review is for finding errors or weaknesses and making sure the code is clear enough that any member of the team can read and debug it.

Whatever you do, DO NOT sit around a table with a half-dozen team members going line-by-line through someone's code. This wastes a lot of time and can easily turn into a developer-beating party. This is neither productive nor effective and it can turn ugly and unprofessional before anyone realizes it's happening.

Aside: I worked in the nuclear industry under a formal independent technical review process. Every engineer was assigned as a technical reviewer on their peers' work; it was a normal, integral part of our work process whether we were documenting engineering calculations or writing code. I found errors in senior engineers' work. People found errors in mine. Errors happen; it is a team effort to sniff them out before a report is issued or code is released. Peer technical review is one aspect of nuclear work I miss when in other environments. If I could, I wouldn't release important code without it.

You don't have to be working on safety-critical systems to get the benefit of peer technical review. Peer review does not have to be a nitpicky ritual beating; it shouldn't be. If it is, something is very, very wrong with your process or work culture.

The three rules for technical review: be professional, be thorough, and be fast -- in that order.

Conclusion

Regular expressions are very powerful tools but they can be difficult to read. They fail in subtle, unexpected, and spectacular ways.

It is impossible to keep errors out of your regular expressions but there are techniques to make errors more obvious in review and less catastrophic in production. Some are easier to apply than others but none of them require you to become some regex wizard overnight. You may not need every one on every project but it helps to know what techniques are available to give you and your team the best chance of success.

💖 💪 🙅 🚩
arclight
arclight

Posted on August 16, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related