Rust: Working with Regular Expressions

ssivakumar

Sivakumar

Posted on May 2, 2023

Rust: Working with Regular Expressions

Regular Expressions is a way to describe sets of characters using syntactic rules. Like many programming languages, Rust also has a great support of regular expressions.

A regular expression processor is used for processing a regular expression statement in terms of a grammar in a given formal language, and with that examines a text string.

Crate regex

This crate provides a library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a few features like look around and backreferences. In exchange, all searches execute in linear time with respect to the size of the regular expression and search text.

Let us see some of the examples

Parsing Integers

With regex, it is pretty easy to check if given input contains valid integers

    let re_int = Regex::new(r"\d").unwrap();
    println!("Is 100 is valid integer? {}", re_int.is_match("100"));
    println!("Is HelloWorld is valid integer? {}", re_int.is_match("HelloWorld"));
Enter fullscreen mode Exit fullscreen mode

This code generates below output

Is 100 is valid integer? true
Is HelloWorld is valid integer? false
Enter fullscreen mode Exit fullscreen mode
Parsing String

If we need to check if given input contains only alphabets, it is very easy to check using [[:alpha]] expression

    let re_str = Regex::new(r"[[:alpha]]").unwrap();
    println!("Is 100 is valid alphabet? {}", re_str.is_match("100"));
    println!("Is HelloWorld is valid alphabet? {}", re_str.is_match("HelloWorld"));
Enter fullscreen mode Exit fullscreen mode

This code generates below output

Is 100 is valid alphabet? false
Is HelloWorld is valid alphabet? true
Enter fullscreen mode Exit fullscreen mode
Parsing Date

This is one of a quite common use case where we need to verify user inputted date is a valid or not

    let re_ymd = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap();
    println!("Is 2023-01-01 matching against YYYY-MM-DD? {}", re_ymd.is_match("2023-01-01"));
    println!("Is 01-01-2023 matching against YYYY-MM-DD? {}", re_ymd.is_match("01-01-2023"));
Enter fullscreen mode Exit fullscreen mode

This code generates below output

Is 2023-01-01 matching against YYYY-MM-DD? true
Is 01-01-2023 matching against YYYY-MM-DD? false
Enter fullscreen mode Exit fullscreen mode

Supported Expressions

regex crate supports wide variety of expression patterns. Let us explore them here

Matching one character
.             any character except new line (includes new line with s flag)
\d            digit (\p{Nd})
\D            not digit
\pX           Unicode character class identified by a one-letter name
\p{Greek}     Unicode character class (general category or script)
\PX           Negated Unicode character class identified by a one-letter name
\P{Greek}     negated Unicode character class (general category or script)
Enter fullscreen mode Exit fullscreen mode
Character classes
[xyz]         A character class matching either x, y or z (union).
[^xyz]        A character class matching any character except x, y and z.
[a-z]         A character class matching any character in range a-z.
[[:alpha:]]   ASCII character class ([A-Za-z])
[[:^alpha:]]  Negated ASCII character class ([^A-Za-z])
[x[^xyz]]     Nested/grouping character class (matching any character except y and z)
[a-y&&xyz]    Intersection (matching x or y)
[0-9&&[^4]]   Subtraction using intersection and negation (matching 0-9 except 4)
[0-9--4]      Direct subtraction (matching 0-9 except 4)
[a-g~~b-h]    Symmetric difference (matching `a` and `h` only)
[\[\]]        Escaping in character classes (matching [ or ])
Enter fullscreen mode Exit fullscreen mode
Composites
xy    concatenation (x followed by y)
x|y   alternation (x or y, prefer x)
Enter fullscreen mode Exit fullscreen mode
Repetitions
x*        zero or more of x (greedy)
x+        one or more of x (greedy)
x?        zero or one of x (greedy)
x*?       zero or more of x (ungreedy/lazy)
x+?       one or more of x (ungreedy/lazy)
x??       zero or one of x (ungreedy/lazy)
x{n,m}    at least n x and at most m x (greedy)
x{n,}     at least n x (greedy)
x{n}      exactly n x
x{n,m}?   at least n x and at most m x (ungreedy/lazy)
x{n,}?    at least n x (ungreedy/lazy)
x{n}?     exactly n x
Enter fullscreen mode Exit fullscreen mode
Empty matches
^     the beginning of text (or start-of-line with multi-line mode)
$     the end of text (or end-of-line with multi-line mode)
\A    only the beginning of text (even with multi-line mode enabled)
\z    only the end of text (even with multi-line mode enabled)
\b    a Unicode word boundary (\w on one side and \W, \A, or \z on other)
\B    not a Unicode word boundary
Enter fullscreen mode Exit fullscreen mode
Grouping and flags
(exp)          numbered capture group (indexed by opening parenthesis)
(?P<name>exp)  named (also numbered) capture group (names must be alpha-numeric)
(?<name>exp)   named (also numbered) capture group (names must be alpha-numeric)
(?:exp)        non-capturing group
(?flags)       set flags within current group
(?flags:exp)   set flags for exp (non-capturing)
Enter fullscreen mode Exit fullscreen mode
Escape sequences
\*          literal *, works for any punctuation character: \.+*?()|[]{}^$
\a          bell (\x07)
\f          form feed (\x0C)
\t          horizontal tab
\n          new line
\r          carriage return
\v          vertical tab (\x0B)
\123        octal character code (up to three digits) (when enabled)
\x7F        hex character code (exactly two digits)
\x{10FFFF}  any hex character code corresponding to a Unicode code point
\u007F      hex character code (exactly four digits)
\u{7F}      any hex character code corresponding to a Unicode code point
\U0000007F  hex character code (exactly eight digits)
\U{7F}      any hex character code corresponding to a Unicode code point
Enter fullscreen mode Exit fullscreen mode
ASCII character classes
[[:alnum:]]    alphanumeric ([0-9A-Za-z])
[[:alpha:]]    alphabetic ([A-Za-z])
[[:ascii:]]    ASCII ([\x00-\x7F])
[[:blank:]]    blank ([\t ])
[[:cntrl:]]    control ([\x00-\x1F\x7F])
[[:digit:]]    digits ([0-9])
[[:graph:]]    graphical ([!-~])
[[:lower:]]    lower case ([a-z])
[[:print:]]    printable ([ -~])
[[:punct:]]    punctuation ([!-/:-@\[-`{-~])
[[:space:]]    whitespace ([\t\n\v\f\r ])
[[:upper:]]    upper case ([A-Z])
[[:word:]]     word characters ([0-9A-Za-z_])
[[:xdigit:]]   hex digit ([0-9A-Fa-f])
Enter fullscreen mode Exit fullscreen mode

Hope this blog post gives you deep dive view of regular expressions and how to use them in rust. You can get the code samples from this link

Please feel free to share your comments if any

Happy reading!!!

💖 💪 🙅 🚩
ssivakumar
Sivakumar

Posted on May 2, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related