Regex for lazy developers

Regular expressions are a text processing system based on a special pattern notation system. Simply put, it provides programmers with the ability to easily process and validate strings. It represents the implementation of the DRY (Don't Repeat Yourself) principle, in almost all supported languages, the regular expression pattern will not change form at all.

The code written on the backend and frontend applications will be identical, thereby saving time for teams to implement the same features. It is also worth emphasizing that this module is ideal for working with large or complex strings, therefore makes it possible to solve problems related to them simply and quickly.

It happens over a cup of tea in the kitchen or at a team zoom-call, you can hear that regular expressions are quite difficult to learn, write and read, and in general they were invented by terrible people 😈. But is it? Let's figure it out.

Note:
This article is relevant for those who consider regular expressions complex, incomprehensible and for those who think that basic knowledge is completely enough for work.

What does it look like
The following are examples in 6 programming languages for determining a Russian phone number.

In this example, you can immediately notice the first feature of the Regex module: the condition pattern will be completely identical and you can easily share your code with a team that writes in another programming language. The ability to quickly "fumble" the code base between different teams saves time on the development and implementation of features.

History of appearance
Regular expressions first appeared in scientific papers on automata theory and the theory of formal languages in the mid-1950s. Stefan Cole Kleen is credited as the person who first introduced the concept of Regular Expressions.

The principles and ideas laid down in his work were practically implemented by Ken Thompson, and with his light hand were integrated into the Perl language.

By definition, Regular Expressions are a module of your programming language that is used to search and manipulate text.

The Regular Expression Language is not a full-fledged programming language, although, like other languages, it has its own syntax and commands.

What programming languages support them?
The list is quite large, here are just a few of them:

C
C#
C++
Cobol
Delphi
F#
Go
Groovy
Haskell
Java
JavaScript
Julia
Kotlin
MATLAB
Objective-C
PHP
Perl
Python
R
Ruby
Rust
Scala
Swift
Visual Basic
Visual Basic .NET
...

Capabilities

Pattern matching of input data.
Search and change input data by template.
Return the first or all results from the input string.
Return along with the result of the general search, named and not substrings when searching.
Replacing characters, words, phrases in the input string after the pass.
And most importantly, write once and use everywhere.

Where will it be useful?

Search and replace code by pattern in IDE (VS Code, Rider, CLion, VS)
Validation of strings for pattern matching (file extension).
Validation of fields on the front (e-mail, phone number and other).
Validation of request and response data.
Validating huge strings and then getting the necessary pieces of text without spending a lot of time.

Basic Syntax
^ - start of string (means that the input string must start with the next character after that. Not suitable if you don't know the first character of the input string).

$ - end of string (means that all conditions before this character will be the final result of the input string and after them there is nothing further. Not suitable if you want to return several results from the input string).

* - means that the previous condition before the given symbol may occur one or more times or not at all (respectively, it may be repeated).

+ - means that the previous condition before this symbol must occur one or more times (respectively, it can be repeated).

[a-z] - enumeration of a valid character in the input string, that is, it can be any lowercase Latin letter (a or b or c ... or x or y or z).

[0-9] - enumeration of a valid character in the input string, that is, it can be any lowercase Latin letter (1 or 2 or 3 ... or 7 or 8 or 9).

. - any single character.

\ - selection of any special character.

| – OR logical operation (condition to the left or the condition to the right of this operand must be fulfilled)

Syntax Simplification

\d ≡ [0-9] - any character from 0 to 9
\D ≡ [^0-9] - any character except numbers
\w ≡ [a-zA-Z0-9_] - any Latin character, all numbers and “_”
\W ≡ [^a-zA-Z0-9_] – any character except Latin characters, numbers and “_”
\s ≡ [ ] - space only
\S ≡ [^ ] - any character except space

Basic Syntax Explanation

Condition Length
In addition to validating values in a string, we can also specify how many characters should pass the same condition. There are only three possibilities to work with length conditions:
{3} – required number of characters for the condition
{3.5} - min. and max. number of characters for the condition
{3,} – mandatory min. number and unlimited max. quantity

Note: The condition [0-9] can be replaced with the abbreviation \d

Working with groups (Advanced)
It's going to be a little more tricky, so get ready.

() - creating an anonymous group (creating a substring and allocating memory for it)
(?‘nameGroup’) - (?<nameGroup>) – create named string
(\k<nameGroup>) - serves to get rid of the pattern from duplicate code, so, if you have a named group “nameGroup” with some condition, you can not write the second group in the pattern, but simply use this directive with a regular expression indicating only the name of the group that has been described before. Thus, the condition will be repeated and you do not need to describe it again.
(?:) - selection in logical brackets of the condition, without naming and creating a substring
(<=) - Excludes the conditions inside the brackets and does not include it in the selection.
(?!) - Checks the conditions inside the brackets and does not include it in the selection.

Real life example
Once, at work, I had to parse data from a QR code that was printed on checks when buying / returning various goods, services, etc. The first version of the parser was written at the C# backend. The code base of the parser was ~150 lines of code, it did not take into account some features of various fiscal registrars (devices that print checks and send data to the Federal Tax Service). To change this function, it was necessary to carefully look, check every line of code. Later, there were so many options and there was a need to use it at the frontend for validation. Accordingly, it was decided to rewrite it using regular expressions to simplify the parser and make it easy and quick to port it to another programming language.

Goals:

Parse input values for pattern validation
Take the necessary fields for the date and amount of the purchase for further use in the system.
Check that the field “n” is always equal to 1 (0 - return, 1 - purchase)

Here is an example for Input data:
t=20181125T142800&s=850.12&fn=8715000100011785&i=86841&fp=1440325305&n=1
Regular expression for such data parsing:
^t=(?<Date>[0-9-:T]+)&s=(?<Sum>[0-9]+(?:\.[0-9]{2})?)&fn=[0-9]+&i=[0-9]+&fp=[0-9]+&n=1$
Code example (C#):

private static (string date, string sum) parseQRCode(string data)
{
   var pattern = new Regex(@"^t=(?<Date>[0-9-:T]+)&s=(?<Sum>[0-9]+(?:\.[0-9]{2})?)&fn=[0-9]+&i=[0-9]+&fp=[0-9]+&n=1$", RegexOptions.ECMAScript);
   var matchResult = pattern.Match(data);
   if (!matchResult.Success)
       throw new ArgumentException("Invalid qrCode");
   var dateGroup = matchResult.Groups["Date"];
   if(!dateGroup.Success)
       throw new ArgumentException("Invalid qrCode, Date group not found");
   var sumGroup = matchResult.Groups["Sum"];
   if(!sumGroup.Success)
       throw new ArgumentException("Invalid qrCode, Sum group not found");

   return (dateGroup.Value, sumGroup.Value);
}

Code example (Typescript):
This option is made through Exceptions, but can be done through return false or return null.

const parseQRCode = (data:string) : {date: string, sum: string} => {
  const pattern = new RegExp("^t=(?<Date>[0-9-:T]+)&s=(?<Sum>[0-9]+(?:\.[0-9]{2})?)&fn=[0-9]+&i=[0-9]+&fp=[0-9]+&n=1$");
  const matchResult = pattern.exec(data);
  if (!matchResult)
      throw "Invalid qrCode";
  const dateGroup = matchResult[1];
  if(!dateGroup)
      throw "Invalid qrCode, Date group not found";
  const sumGroup = matchResult[2];
  if(!sumGroup)
      throw "Invalid qrCode, Sum group not found";
  return {date: dateGroup, sum: sumGroup};
};

At the output, we get two values:

Date - a field indicating the date and time of purchase (it remains only to parse it and turn it into a date object)
Sum - purchase amount

Now let's analyze the pattern in more detail:

^ - denoting the beginning of a line
t=(?<Date>[0-9-:T]+) – required characters t=(hereinafter any characters (from 0 to 9 or - or : or T) in one or more instances)
&s=(?<Sum>[0-9]+(?:\.[0-9]{2})?) – required characters
1. &s= – required sequence of characters & and s and =
2. [0-9]+ (characters 0 to 9 in one or more instances)
3. (?:\.[0-9]{2})? - non required group start at . symbol with 2 numbers
$ - denoting the end of the line
&fn=[0-9]+ – required characters &fn= followed by [0-9]+ -> (any number from 0 to 9 in one or more instances)
&i=[0-9]+ – required characters &i= followed by [0-9]+ -> (any number from 0 to 9 in one or more instances)
&fp=[0-9]+ – required characters &fp= followed by [0-9]+ -> (any number from 0 to 9 in one or more instances)
&n=1 – required characters &n=1

The problem of working with non-Latin
When you need to work with the entire Latin alphabet, just write [a-zA-Z]. Many people think that when working with Cyrillic it is enough to write [а-яА-Я]. It seems that everything is logical and everything is fine, but at some point you will realize that sometimes it does not work correctly for you. The problem is that the range [а-я] does not include the letter “ё”, therefore, you need to change your pattern from [а-яА-Я] to [а-яёА-ЯЁ] so that the code takes into account a specific letter in the alphabet. This problem exists not only in Cyrillic, this problem is also relevant for Greek, Turkish, China and a number of other languages. Be careful when writing a pattern that should use these languages.

JS regex flags

global (g) - does not stop searching after finding the first match.
multi line (m) - searches the line including line break (^ start of line, $ end of line).
insensitive (i) - search insensitively (a ≡ A)
sticky (y) - search returns, in addition to the match, the index from the beginning of the subselect match (not supported in IE)
unicode (u) - search includes unicode characters (not supported in IE)
single line (s) - in this mode, the symbol . includes also newline (supported by Chrome, Opera, Safari)

Additional regex settings in C#
RegexOptions is exposed as an additional parameter in the constructor of the Regex class. It can also be specified in the Match, Matches methods.

None - set by default.
IgnoreCase (\i) - checks case insensitively.
Multiline (\m) - work with a line where there are hyphens \n.
ExplicitCapture (\n) - adds only named groups to the result.
Compiled (will be useful only in static version, speeds up regular expression, slows down compilation).
Singleline (the . sign will match any character except \n and ignore it when searching)
IgnorePatternWhitespace (\x) . (cuts out all spaces, exceptions in constructions[],{})
RightToLeft - search from right to left.
ECMAScript (JS like version but stylegroupings same as in .NET).
CultureInvariant (compares ignoring the keyboard layout).

Good Practices and Optimization Tips

The fewer groupings, the faster the execution speed. Try to avoid them if you don't need them.
When using abbreviations (\d, \w and others), be sure that they fully match your search terms. Better check twice.
If you often use regular expressions, create it once globally, thereby reducing the amount of duplicate code.
Almost everywhere there is a possibility of compiling regular expressions, which often optimizes your expressions and speeds up their execution. BUT use them after validation, it will speed up your code.
Try to reduce the amount of special symbol selection (\), this functionality slows down the execution speed in many programming languages.
Regular expressions have support for UTF character codes. At some points, this will improve performance, but reduce readability. If you decide to use them, be sure that the team will approve your decision and it's worth it.

Conclusion
Regular expressions just want to seem complicated, but in fact, the features that they provide give a lot of opportunities and allow you to simplify and speed up the work of everyone from Junior to Senior / Lead.
Please, if you have any questions, please feel free to comment, there we can discuss with you.

Links

P.S. Don't forget one important rule: "Programming is still cool." and have a nice working day

Blog

Regex for lazy developers

Ilya Ermoshin

Join Our Newsletter. No Spam, Only the good stuff.

Related