Regular expressions in Rust

carlyraejepsenstan

CarlyRaeJepsenStan

Posted on June 18, 2021

Regular expressions in Rust

In many popular languages such as Javascript or Python, using regular expressions is simple. Well.. mostly simple.

For example, let's say I have this inscrutable JS string:
let a = "123bbasfsdf23asd2021-06-17";

For reasons known only to me, I desire to find numbers that are at most 3 digits long - specifically, I want to find the number "23."

The solution? A regular expression. In JavaScript syntax, \d is a number, and putting {1,3} after the digit marker means we search for any group of 1 to 3 adjacent digits

let regex = /\d{1,3}/gi
let a = "123bbasfsdf23asd2021-06-17"
a.match(regex)

     ==>["123", "23", "202", "1", "06", "17"]
Enter fullscreen mode Exit fullscreen mode

That was easy! Now let's try to reproduce this in Rust. As Rust is a lower-level language focused on much different things than Javascript, the process is much different.

First, we have to use the regex crate, which thankfully uses similar expression syntax to JavaScript. So to recreate our previous example, we first have to import the crate by both putting it into Cargo.toml under [dependencies] and by beginning our code with a use statement. So our beginning code is fairly simple:

use regex::Regex;

fn main() {
    let regex = Regex::new(r"\d{3}").unwrap();
    let a: &str = "123bbasfsdf23asd2021-06-17";
}
Enter fullscreen mode Exit fullscreen mode

Some of this might not make sense:

The r in front of our regular expression makes it so we don't have to spam backslashes to make our strings valid. Especially useful for backslash-heavy regexes!

The .unwrap() comes from Option<T> which basically replaces null or undefined with None. I imagine these regexes are Options because there are cases (such as in user input) where the regex can indeed be empty. But, because we are hardcoding the value of the regular expression, we can safely call unwrap, which exposes the string value of the regular expression and allows us to do stuff with it without having to do pattern matching or whatever.

Having written these simple lines, our problems begin. In the documentation for Regex, there's no match function. There are find and capture, but each of these only return the first leftmost match - insufficient for my aim of finding the second match.

After lots of thinking and help from the Rust community, I finally decided on find_iter, which returns an Iterator composed of all the matches. Does it work? Note that I used the regex directly, skipping assigning it to a variable.

use regex::Regex;

fn main() {
    let a: &str = "123bbasfsdf23asd2021-06-17";
    for cap in Regex::new(r"\d{1,3}").unwrap().find_iter(a) {
        println!("{:#?}", cap);
    }
}


==> Match {
    text: "123bbasfsdf23asd2021-06-17",
    start: 0,
    end: 3,
}
Match {
    text: "123bbasfsdf23asd2021-06-17",
    start: 11,
    end: 13,
}
Match {
    text: "123bbasfsdf23asd2021-06-17",
    start: 16,
    end: 19,
}
Match {
    text: "123bbasfsdf23asd2021-06-17",
    start: 19,
    end: 20,
// --snip 
Enter fullscreen mode Exit fullscreen mode

So yes, this works. But it's also practically useless - trying to log cap.start or cap.end throws an error about "private fields." After even more thinking, scrolling through documentation and conversing with friendly Discord members, I finally found the as_str method.

Side note, putting :? in the curly braces of println lets you log stuff that empty braces can not format, while :#? prettyprints it. Read more about it - its cool!

use regex::Regex;

fn main() {
    let a: &str = "123bbasfsdf23asd2021-06-17";
    for cap in Regex::new(r"\d{1,3}").unwrap().find_iter(a) {
        println!("{:#?}", cap.as_str());
    }
}

=> "123"
"23"
"202"
"1"
"06"
"17"
Enter fullscreen mode Exit fullscreen mode

Yay!! Now I have my clean, useable outputs. But how do I turn this into an array?

In Rust, an Iterator is essentially a fancy array that makes looping and recursive functions easier. As vectors (as arrays are named in Rust) can be easily turned into iterators through into_iter, iterators can be turned back into vectors through collect.

However, just running Regex::new(r"\d{1,3}").unwrap().find_iter(a).collect() doesn't work - not only do we have to write some type annotations, we get an error that we can't collapse a match iterator into a clean string vector.

The solution? Use the incredible map function (which every developer should know about!) and apply as_str to every item of the iterator. Slap on a type annotation (as Rust requires) and some random borrowing, and voila:

let my_captures: Vec<&str> = (Regex::new(r"\d{1,3}").unwrap().find_iter(a).map(|x| x.as_str()).collect());
println!("{:?}", my_captures);
     => ["123", "23", "202", "1", "06", "17"]
Enter fullscreen mode Exit fullscreen mode

Great! Now use bracket notation: my_captures[1], and you're done! Try it yourself here.

Hopefully this article was helpful to you - I spent over an hour banging my head against documentation and Discord to solve this. Thanks for reading, and good luck!

💖 💪 🙅 🚩
carlyraejepsenstan
CarlyRaeJepsenStan

Posted on June 18, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Regular expressions in Rust
rust Regular expressions in Rust

June 18, 2021