RegEx Basics in Ruby
Joe Christensen
Posted on August 20, 2021
What is Regex?
Regular expression, or RegEx, is a a sequence of characters that specify some form of search or filtration pattern. Created in the 1950s, RegEx is a way for programmers to quickly and easily search, validate, and filter strings of characters. Using RegEx, a multiple line, complex, validation function can be condensed into a smaller pattern of characters. Due to its shown usefulness, multiple programming languages have some form of RegEx implementation.
As an example, let's look at emails. When someone signs up for a new account on a website, many websites ask for an email. If the email is invalid, say its missing an @ or .com, the website will give the user an error, stating such. But how does the website actually check that the email is invalid? Well, a poor programmer would have to create an incredibly complex validation function that splits the string that the user typed in up, checks every part, makes sure the @ and .com are present and in the right positions, and so on. Now imagine having to write a function like this every time you needed user input to be validated. That's where RegEx really shines. These crazy, multi-line functions can be condensed into a small regular expression.
One of the major downsides of RegEx, however, is also due to its ability to condense lots of code into a small expression. RegEx is confusing! A regular expression looks like a random string of characters all mushed together, making it really daunting for new programmers to try and tackle learning it. In this article, we'll go over some of the more useful parts about Ruby's RegEx.
Syntax
There is a lot going on in any given RegEx expression, but there is a method to the madness! We'll be going through some of the basics of Ruby RegEx expressions to hopefully see some of that method. As you're following along Rubular is a great tool to use. It allows you to write your own RegEx expressions to manipulate a given string. It also has a handy list of useful RegEx expressions.
Here is a basic RegEx expression that finds every letter ranging a-f
in a given string:
/[a-f]/
Go ahead and try it out in Rubular!
Delimiters
For most programming languages, Ruby included, a RegEx expression normally starts and ends with the delimiters /
. These backslashes generally help identify the beginning and end of a regular expression, although there may be some cases where there is more information immediately following the second backslash.
In our example, you can see the delimiters at the start and end of the expression. This lets the compiler know that everything between these two characters is a RegEx expression.
Atoms and Quantifiers
Starting from the smallest part, an atom is a single point within a RegEx expression that is used to try and manipulate a given string. These atoms have something called quantifiers that are used to show how many times, where, or what, exactly, the atoms are supposed to match on the given string.
Metacharacters
A step up from atoms are metacharacters. These are expressions built into RegEx that help group expressions, while also making use of atoms and quantifiers.
In our example expressions, the []
is actually a metacharacter. If you refer back to your Rubular cheat sheet, you can see that []
with characters inside tries to find matches of that single character in the string.
As an example, the RegEx /[a]/
will try to find every instance of "a" in the string.
Our expression is a little more special. We're using something called a range, which is a way to tell RegEx that we want any single character or digit that matches between two characters. Our range [a-f]
is asking RegEx to find a, b, c, d, e, or f
anywhere in our test string.
This also works for digits. Doing [0-9]
is the same as looking for every instance of 0,1,2,3,4,5,6,7,8, or 9
.
One final handy thing about this metacharacter is the ability to have multiple ranges.
Looking back to our example, /[a-f]/
only highlights the lowercase characters a-f. What if we wanted every character, both lowercase and uppercase, in that range to be selected? Well, then we would combine two ranges like so: /[a-fA-F]/
. As you can see, we've added a new range [A-F]
to our original one, using the same set of brackets. This will say to look for any characters from the range a-f or from the range A-F.
The []
is only one metacharater, and there are many, many more that are commonly used by RegEx. As this was only a light overview, I strongly suggest experimenting with them on your own using the Rubular cheat sheet.
Regex With Ruby Methods
So, we now know a little about writing RegEx patterns. Now what? How can this actually be used in code? Well, Ruby has a few string methods where the use of RegEx fits perfectly to filter, find, and validate strings.
Scan
The first method is the scan method. Calling .scan()
on a string returns an array of all items that match the given input. This is an incredibly useful way to filter through a string, only retaining the data that you want.
As an example, say you have the string "bat cat dot hat mat eat pat sat bit hit split "
and you want to filter out every word that isn't 3 letters long and ends in "at"
. The easiest way to do this would be using the .scan method like so:
"bat cat dot hat mat eat pat sat bit batter hit split ".scan(/\w+at/)
# Returns ["bat", "cat", "hat", "mat", "eat", "pat", "sat", "bat"]
# If you want it as a string again, just do .join
The .scan method is filtering through the string, word for word, using the given RegEx pattern. Our RegEx pattern starts with the metacharacter \w
, which looks for any word character (letter, number, or underscore) at the start of each word. The last part of our expression, the at
, just looks for any instance of "at" in each word in our string. Combining them together with the +
, we get an expression that looks through each word, makes sure the first letter is a word character, and then makes sure that the following two letters are "at".
One important thing to notice is the last character in the returned array. It's "bat"
, which the .scan method grabbed from the "batter"
word. Since our ruby expression only stated that the first letter be a valid word character and the following two letters be "at"
, nothing stops the scan from just grabbing that valid section from the word.
Match
The second method is the match method. Match returns the first item in a string (in the form of a MatchData object) that matches a given RegEx pattern. More often than not, the Match method is used for input validation. You check to see if a sting contains a given RegEx pattern. If the return value is nil, then the given string doesn't pass the given RegEx pattern.
As an example, say you have a phone number input on a website and you want to make sure that people are entering 10 numbers. You would do something like this:
"1234567891".match(/^\d{10}$/)
# Returns a valid MatchData object containing the string
"123".match(/^\d{10}$/)
#Returns nil
As you can see, the first string passed has valid numbers, so an object is returned. The second string doesn't, so nil is returned. Using this information, you can easily set up a validation check and response system.
What exactly is the RegEx pattern that we're using? Well, to start with, the ^
symbol means "start of line" and the $
symbol means "end of line". By having them at the start and the end of the expression, we're saying that the given input must match our expression EXACTLY, with no extra characters at the start or end of it. Moving on, the \d
means "any digit", or any number 0-9. It's roughly the same as doing [0-9]
. Finally, the {10}
is saying "exactly 10 of whatever is to the left of me". Since the \d
is to the left, it's saying "exactly 10 \d
", or "exactly 10 digits". All in all, our pattern is asking for exactly 10 digits to be input with nothing before or after it.
Grep
Finally, another useful method is Grep. Grep, however, is an Array enumerable method, not a string method. What .grep()
does is take in an array of strings. It will then return a filtered array of strings, comparing each value in the original array to a given RegEx pattern.
Carrying on from our previous example, say you have a large array of phone numbers, and you only want the valid ones. Using .grep()
, you can filter through the array and get back only the valid numbers like so:
["1234567891", "3216549871", "3456215435", "12", "65435", "9328456214"].grep(/^\d{10}$/)
#Returns ["1234567891", "3216549871", "3456215435", "9328456214"] Only the valid phone numbers
As you can see, .grep()
is an incredibly useful filtering method for arrays of data.
Closing Statements
In the end, RegEx is an incredibly powerful, but daunting, tool. As a programmer, you should try to at least familiarize yourself with the basics of RegEx, but don't worry too much about mastering it. If there is a common validation, or filtering expression that you need, like email validation, then someone has probably already created it and shared it online.
As you try to get more familiar with RegEx, be sure to use Rubular, Regexr, or some other online source. Experimenting and messing around with RegEx is by far the best way to learn it. Good luck!
Posted on August 20, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024