CarlyRaeJepsenStan
Posted on June 18, 2021
In many popular languages such as Javascript or Python, using regular expressions is simple. Well.. mostly simple.
For example, let's say I have this inscrutable JS string:
let a = "123bbasfsdf23asd2021-06-17";
For reasons known only to me, I desire to find numbers that are at most 3 digits long - specifically, I want to find the number "23."
The solution? A regular expression. In JavaScript syntax, \d
is a number, and putting {1,3}
after the digit marker means we search for any group of 1 to 3 adjacent digits
let regex = /\d{1,3}/gi
let a = "123bbasfsdf23asd2021-06-17"
a.match(regex)
==>["123", "23", "202", "1", "06", "17"]
That was easy! Now let's try to reproduce this in Rust. As Rust is a lower-level language focused on much different things than Javascript, the process is much different.
First, we have to use the regex
crate, which thankfully uses similar expression syntax to JavaScript. So to recreate our previous example, we first have to import the crate by both putting it into Cargo.toml
under [dependencies]
and by beginning our code with a use
statement. So our beginning code is fairly simple:
use regex::Regex;
fn main() {
let regex = Regex::new(r"\d{3}").unwrap();
let a: &str = "123bbasfsdf23asd2021-06-17";
}
Some of this might not make sense:
The r
in front of our regular expression makes it so we don't have to spam backslashes to make our strings valid. Especially useful for backslash-heavy regexes!
The .unwrap()
comes from Option<T>
which basically replaces null
or undefined
with None
. I imagine these regexes are Options because there are cases (such as in user input) where the regex can indeed be empty. But, because we are hardcoding the value of the regular expression, we can safely call unwrap
, which exposes the string value of the regular expression and allows us to do stuff with it without having to do pattern matching or whatever.
Having written these simple lines, our problems begin. In the documentation for Regex, there's no match
function. There are find
and capture
, but each of these only return the first leftmost match - insufficient for my aim of finding the second match.
After lots of thinking and help from the Rust community, I finally decided on find_iter
, which returns an Iterator
composed of all the matches. Does it work? Note that I used the regex directly, skipping assigning it to a variable.
use regex::Regex;
fn main() {
let a: &str = "123bbasfsdf23asd2021-06-17";
for cap in Regex::new(r"\d{1,3}").unwrap().find_iter(a) {
println!("{:#?}", cap);
}
}
==> Match {
text: "123bbasfsdf23asd2021-06-17",
start: 0,
end: 3,
}
Match {
text: "123bbasfsdf23asd2021-06-17",
start: 11,
end: 13,
}
Match {
text: "123bbasfsdf23asd2021-06-17",
start: 16,
end: 19,
}
Match {
text: "123bbasfsdf23asd2021-06-17",
start: 19,
end: 20,
// --snip
So yes, this works. But it's also practically useless - trying to log cap.start
or cap.end
throws an error about "private fields." After even more thinking, scrolling through documentation and conversing with friendly Discord members, I finally found the as_str
method.
Side note, putting :?
in the curly braces of println
lets you log stuff that empty braces can not format, while :#?
prettyprints it. Read more about it - its cool!
use regex::Regex;
fn main() {
let a: &str = "123bbasfsdf23asd2021-06-17";
for cap in Regex::new(r"\d{1,3}").unwrap().find_iter(a) {
println!("{:#?}", cap.as_str());
}
}
=> "123"
"23"
"202"
"1"
"06"
"17"
Yay!! Now I have my clean, useable outputs. But how do I turn this into an array?
In Rust, an Iterator
is essentially a fancy array that makes looping and recursive functions easier. As vectors (as arrays are named in Rust) can be easily turned into iterators through into_iter
, iterators can be turned back into vectors through collect
.
However, just running Regex::new(r"\d{1,3}").unwrap().find_iter(a).collect()
doesn't work - not only do we have to write some type annotations, we get an error that we can't collapse a match iterator into a clean string vector.
The solution? Use the incredible map
function (which every developer should know about!) and apply as_str
to every item of the iterator. Slap on a type annotation (as Rust requires) and some random borrowing, and voila:
let my_captures: Vec<&str> = (Regex::new(r"\d{1,3}").unwrap().find_iter(a).map(|x| x.as_str()).collect());
println!("{:?}", my_captures);
=> ["123", "23", "202", "1", "06", "17"]
Great! Now use bracket notation: my_captures[1]
, and you're done! Try it yourself here.
Hopefully this article was helpful to you - I spent over an hour banging my head against documentation and Discord to solve this. Thanks for reading, and good luck!
Posted on June 18, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.