Rust: Working with Regular Expressions
Sivakumar
Posted on May 2, 2023
Regular Expressions is a way to describe sets of characters using syntactic rules. Like many programming languages, Rust also has a great support of regular expressions.
A regular expression processor is used for processing a regular expression statement in terms of a grammar in a given formal language, and with that examines a text string.
Crate regex
This crate provides a library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a few features like look around and backreferences. In exchange, all searches execute in linear time with respect to the size of the regular expression and search text.
Let us see some of the examples
Parsing Integers
With regex, it is pretty easy to check if given input contains valid integers
let re_int = Regex::new(r"\d").unwrap();
println!("Is 100 is valid integer? {}", re_int.is_match("100"));
println!("Is HelloWorld is valid integer? {}", re_int.is_match("HelloWorld"));
This code generates below output
Is 100 is valid integer? true
Is HelloWorld is valid integer? false
Parsing String
If we need to check if given input contains only alphabets, it is very easy to check using [[:alpha]]
expression
let re_str = Regex::new(r"[[:alpha]]").unwrap();
println!("Is 100 is valid alphabet? {}", re_str.is_match("100"));
println!("Is HelloWorld is valid alphabet? {}", re_str.is_match("HelloWorld"));
This code generates below output
Is 100 is valid alphabet? false
Is HelloWorld is valid alphabet? true
Parsing Date
This is one of a quite common use case where we need to verify user inputted date is a valid or not
let re_ymd = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap();
println!("Is 2023-01-01 matching against YYYY-MM-DD? {}", re_ymd.is_match("2023-01-01"));
println!("Is 01-01-2023 matching against YYYY-MM-DD? {}", re_ymd.is_match("01-01-2023"));
This code generates below output
Is 2023-01-01 matching against YYYY-MM-DD? true
Is 01-01-2023 matching against YYYY-MM-DD? false
Supported Expressions
regex
crate supports wide variety of expression patterns. Let us explore them here
Matching one character
. any character except new line (includes new line with s flag)
\d digit (\p{Nd})
\D not digit
\pX Unicode character class identified by a one-letter name
\p{Greek} Unicode character class (general category or script)
\PX Negated Unicode character class identified by a one-letter name
\P{Greek} negated Unicode character class (general category or script)
Character classes
[xyz] A character class matching either x, y or z (union).
[^xyz] A character class matching any character except x, y and z.
[a-z] A character class matching any character in range a-z.
[[:alpha:]] ASCII character class ([A-Za-z])
[[:^alpha:]] Negated ASCII character class ([^A-Za-z])
[x[^xyz]] Nested/grouping character class (matching any character except y and z)
[a-y&&xyz] Intersection (matching x or y)
[0-9&&[^4]] Subtraction using intersection and negation (matching 0-9 except 4)
[0-9--4] Direct subtraction (matching 0-9 except 4)
[a-g~~b-h] Symmetric difference (matching `a` and `h` only)
[\[\]] Escaping in character classes (matching [ or ])
Composites
xy concatenation (x followed by y)
x|y alternation (x or y, prefer x)
Repetitions
x* zero or more of x (greedy)
x+ one or more of x (greedy)
x? zero or one of x (greedy)
x*? zero or more of x (ungreedy/lazy)
x+? one or more of x (ungreedy/lazy)
x?? zero or one of x (ungreedy/lazy)
x{n,m} at least n x and at most m x (greedy)
x{n,} at least n x (greedy)
x{n} exactly n x
x{n,m}? at least n x and at most m x (ungreedy/lazy)
x{n,}? at least n x (ungreedy/lazy)
x{n}? exactly n x
Empty matches
^ the beginning of text (or start-of-line with multi-line mode)
$ the end of text (or end-of-line with multi-line mode)
\A only the beginning of text (even with multi-line mode enabled)
\z only the end of text (even with multi-line mode enabled)
\b a Unicode word boundary (\w on one side and \W, \A, or \z on other)
\B not a Unicode word boundary
Grouping and flags
(exp) numbered capture group (indexed by opening parenthesis)
(?P<name>exp) named (also numbered) capture group (names must be alpha-numeric)
(?<name>exp) named (also numbered) capture group (names must be alpha-numeric)
(?:exp) non-capturing group
(?flags) set flags within current group
(?flags:exp) set flags for exp (non-capturing)
Escape sequences
\* literal *, works for any punctuation character: \.+*?()|[]{}^$
\a bell (\x07)
\f form feed (\x0C)
\t horizontal tab
\n new line
\r carriage return
\v vertical tab (\x0B)
\123 octal character code (up to three digits) (when enabled)
\x7F hex character code (exactly two digits)
\x{10FFFF} any hex character code corresponding to a Unicode code point
\u007F hex character code (exactly four digits)
\u{7F} any hex character code corresponding to a Unicode code point
\U0000007F hex character code (exactly eight digits)
\U{7F} any hex character code corresponding to a Unicode code point
ASCII character classes
[[:alnum:]] alphanumeric ([0-9A-Za-z])
[[:alpha:]] alphabetic ([A-Za-z])
[[:ascii:]] ASCII ([\x00-\x7F])
[[:blank:]] blank ([\t ])
[[:cntrl:]] control ([\x00-\x1F\x7F])
[[:digit:]] digits ([0-9])
[[:graph:]] graphical ([!-~])
[[:lower:]] lower case ([a-z])
[[:print:]] printable ([ -~])
[[:punct:]] punctuation ([!-/:-@\[-`{-~])
[[:space:]] whitespace ([\t\n\v\f\r ])
[[:upper:]] upper case ([A-Z])
[[:word:]] word characters ([0-9A-Za-z_])
[[:xdigit:]] hex digit ([0-9A-Fa-f])
Hope this blog post gives you deep dive view of regular expressions and how to use them in rust. You can get the code samples from this link
Please feel free to share your comments if any
Happy reading!!!
Posted on May 2, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.