Golang: Regex

mr_destructive

Meet Rajesh Gor

Posted on March 24, 2023

Golang: Regex

Introduction

In this 26th part of the series, we will be covering the basics of using regular expressions in golang. This article will cover the basic operations like matching, finding, replacing, and sub-matches in a regular expression pattern from string source or file content. This will have examples for each of the concepts and similar variants will follow the same ideology in self-exploring the syntax.

Regex in golang

So, let's start with what are regular expressions.

Regular expressions are basic building blocks for searching, pattern matching, and manipulation from a source of text using computational logic.

This is not the formal definition, but have written it in words for my understanding of regular expressions till now. Might not be accurate, but makes sense to me after I had played and explored it (not fully).

So regular expressions use some pattern-matching techniques using basic logic operators like concatenation, quantifiers, etc. These relate to the study of the theory of computation quite closely but you don't need to get into too much theory in order to understand the working of regular expressions. However, it won't harm you if you are curious about it and want to explore it further.

Some resources to learn the fundamentals of regular expressions:

Regexp package

We will be using the regexp package in the golang standard library to get some quick and important methods for quick and neat pattern matching and searching. It provides a Regexp type and a lot of methods on top of it to perform matching, finding, replacing, and sub-matches in the source text.

This package also supports two types of methods serving different purposes and use cases for string and slice of bytes, this can be useful for reading from a buffer, file, etc., and also flexible enough to search for a simple string.

Matching Patterns

One of the fundamental aspects of the regular expression is to check if a particular pattern is present or not in a source string. The regexp package provides some methods like Match, MatchString methods on a slice of bytes and string respectively from the pattern string.

Matching Strings

The basic operations with regex or regular expression can be performed to compare and match if the pattern matches a given string.

In golang, the regexp package provides a few functions to simply match expressions with strings or text. One of the easy-to-understand ones include the MtachString, and Match methods.

package main

import (
    "log"
    "regexp"
)

func log_error(err error) {
    if err != nil {
        log.Fatal(err)
    }
}

func main() {

    str := "gophers of the goland"
    is_matching, err := regexp.MatchString("go", str)
    log_error(err)
    log.Println(is_matching)
    is_matching, err := regexp.MatchString("no", str)
    log_error(err)
    log.Println(is_matching)

}
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

true
false
Enter fullscreen mode Exit fullscreen mode

In the above code, we have used the MatchString method that takes in two parameters string/pattern to find, and the source string. The function returns a boolean as true or false if the pattern is present in the source string, also it might return an error if the pattern(first parameter) is parsed as an incorrect regular expression.

So, we can clearly see, the string go is present in the string gophers of the goland and the string no is not a substring.

We also have Match method which is a more general version of MatchString it excepts a slice of byte rather than a string as the source string. The first parameter is still a string, but the second parameter is a slice of bytes.

is_matching, err = regexp.Match(`.*land`, []byte("goland is a land of gophers"))
log_error(err)
log.Println(is_matching)
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

true
Enter fullscreen mode Exit fullscreen mode

We can use the Match method to simply parse a slice of bytes to use as the source text to check the pattern.

package main

import (
    "log"
    "os"
    "regexp"
)

func log_error(err error) {
    if err != nil {
        log.Fatal(err)
    }
}

func main() {
    file_content, err := os.ReadFile("temp.txt")
    log_error(err)
    is_matching, err = regexp.Match(`memory`, file_content)
    log_error(err)
    log.Println(is_matching)
    is_matching, err = regexp.Match(`text `, file_content)
    log_error(err)
    log.Println(is_matching)
}
Enter fullscreen mode Exit fullscreen mode
# temp.txt

One of the gophers used a slice,
the other one used a arrays.
Some gophers were idle in the memory.
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

true
false
Enter fullscreen mode Exit fullscreen mode

We can now even parse the contents of a file as a slice of bytes. So, it would be really nice to compare and check for a string pattern in a file quickly. Here in the above example, we have checked if memory is present in the text which it is, and in the second call, we check if text string is present anywhere in the file content which it is not.

Find Patterns

We can even use regular expressions for searching text or string with the struct/type Regexp provided by golang's regexp. We can create a regular expression and use other functions like MatchString, Match, and others that we will explore to match or find a pattern in the text.

Find String from RegEx

We can get a slice of byte from the FindAll method which will take in a slice of byte, the second parameter as -1 for all matches. The function returns a slice of slice of byte with the byte representation of the matched string in the source text.

exp, err := regexp.Compile(`\b\d{5}(?:[-\s]\d{4})?\b`)
log_error(err)
pincode_file, err := os.ReadFile("pincode.txt")
log_error(err)
match := exp.FindAll(pincode_file, -1)
log.Println(match)
Enter fullscreen mode Exit fullscreen mode
# pincode.txt

Pincode: 12345-1234
City, 40084
State 123
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

[[49 50 51 52 53 45 49 50 51 52] [52 48 48 56 52]]
Enter fullscreen mode Exit fullscreen mode

In the above example, we have used the Compile method to create a regular expression and FindAll to get all the occurrences of the matching patterns in the text. We have again read the contents from the file. In this example, the exp is a regular expression for a postal code, which can either have 5 digit or 5digits-4digit combination. We read the file pincode.txt as a slice of bytes and use the FindAll method. The FindAll method takes in the parameter as a slice of byte and the integer as the number of occurrences to search. If we use a negative number it will include all the matches.

We search for the pin code in the file and the funciton returns a list of bytes that match the regular expression in the provided object exp. Finally, we get the result as 12345-1234 and 40084 which are present in the file. It doesn't match the number 123 which is not a valid match for the given regular expression.

There is also a version of FindAll as FindAllString which will take in a string as the text source and return a slice of strings.

matches := exp.FindAllString(string(pincode_file), -1)
log.Println(matches)
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

["12345-1234" "40084"]
Enter fullscreen mode Exit fullscreen mode

So, the FindAllString would return a slice of strings of the matches in the text.

Find the Index of String from RegEx

We can even get the start and end index of the matched string in the text using the FindIndex and FindAllIndex methods to get the indexes of the match, or all the matches of the file content.

exp, err := regexp.Compile(`\b\d{5}(?:[-\s]\d{4})?\b`)
log_error(err)
pincode_file, err := os.ReadFile("pincode.txt")
log_error(err)

match_index := exp.FindIndex(pincode_file)
log.Printf("%T\n", match_index)
log.Println(match_index)
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

[]int
[9 19]
Enter fullscreen mode Exit fullscreen mode

The above code is using the FindIndex method to get the indexes of the first match of the regular expression. The output of the code gives a single slice of an integer with length two as the start and last index of the matched string in the text file. So, here, the 9 represents the position(index) of the first character match in the string, and 19 is the last index of the matched string.

Pincode: 12345-1234
01234567890123456789

Assume 0 as 10 after 9, 11, and so on
It starts from the 9th character as `1` and ends at the `4` character
at position 18 but it returns the end position + 1 for the ease of slicing

City, 40084
012345678901

State 123
234567890
Enter fullscreen mode Exit fullscreen mode

The character at 9 and 18 are the first and the last character position/index of the source string, so it returns the end position + 1 as the index. This makes retrieval of slicing source string much easier, as we won't be offset by one.

If we want to get the text in the source string, we can use the slicing as:

exp, err := regexp.Compile(`\b\d{5}(?:[-\s]\d{4})?\b`)
log_error(err)
pincode_file, err := os.ReadFile("pincode.txt")
log_error(err)

match_index := exp.FindIndex(pincode_file)
if len(match_index) > 0 {

    // Get the slice of the original string from start to end index
    sliced_string := pincode_file[match_index[0]:match_index[1]]
    log.Printf("%q\n", sliced_string)
}
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

"12345-1234"
Enter fullscreen mode Exit fullscreen mode

So, we can access the string from the source text without calling other functions, simply slicing the original string will retrieve the expected results. This is the convention of the golang standard library to make the end index exclusive.

Similarly, the FindAllIndex method can be used to get a list of list of such indexes of matched strings.

exp, err := regexp.Compile(`\b\d{5}(?:[-\s]\d{4})?\b`)
log_error(err)
pincode_file, err := os.ReadFile("pincode.txt")
log_error(err)

match_indexes := exp.FindAllIndex(pincode_file)
log.Printf("%T\n", match_indexes)
log.Println(match_indexes)
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

[][]int
[[9 19] [26 31]]
Enter fullscreen mode Exit fullscreen mode

The above example gets the list of slices as for all the search pattern indexes in the source string/text. We can iterate over the list and get each matched string index.

Find Submatch

the regexp package also has a utility function for finding the sub-match of a given regular expression. The method returns a list of strings (or slices of bytes) containing the leftmost match and the sub-matches in that match. This also has the All version which instead of returning a single match i.e. the leftmost match it returns all the matches and the corresponding sub-matches.

package main

import (
    "log"
    "regexp"
)

func main() {

    str := "abc@def.com is the mail address of 8th user with id 12"
    exp := regexp.MustCompile(
        `([a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z]{2,})` + 
            `|(email|mail)|` + 
            `(\d{1,3})`)`,
    )
    match := exp.FindStringSubmatch(str)
    log.Println(match)
    matches := exp.FindAllStringSubmatch(str, -1)
    log.Println(matches)
}
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

[abc@def.com abc@def.com  ]
[[abc@def.com abc@def.com  ] [mail  mail ] [8   8] [12   12]]
Enter fullscreen mode Exit fullscreen mode

The above example uses a regex to compute a few things like email address, words like mail or email, also a number of up to 3 digits. The | between these expressions indicates any combination of these three things would be matched for the regular expression. The FindStringSubmatch method, takes in a string as the source and returns a slice of the matching pattern. The first element is the leftmost match in the string source for the given regular expression, and the subsequent elements are the sub-matches in the matched string.

We can now move a little step ahead for actually understanding the sub-match in a regular expression.

str := "abe21@def.com is the mail address of 8th user with id 124"
exp := regexp.MustCompile(
    `([a-zA-Z]+(\d*)[a-zA-Z]*@[a-zA-Z]*(\d*)[a-zA-Z]+\.[a-z|A-Z]{2,})` +
        `|(mail|email)` +
        `|(\d{1,3})`,
)

match := exp.FindStringSubmatch(str)
log.Println(match)
matches := exp.FindAllStringSubmatch(str, -1)
log.Println(matches)
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

[abe21@def.com abe21@def.com 21   ]
[[abe21@def.com abe21@def.com 21   ] [mail    mail ] [8     8] [124     124]]
Enter fullscreen mode Exit fullscreen mode

In the above example, there are a few things to take away, let us break it down into small pieces. We have a regex for matching either a mail address, the word mail or email, or a number up to 3 digits. The regex is a bit different from the previous example for understanding the sub-matches within an expression. We find the sub-matches in the string which will be handled by the FindStringSubmatch. This function takes in a string and returns a list of strings that are the leftmost matches in the source string.

First, we need to understand the regex to get a clear idea of the code snippet. The first sub-match is for the email address. However, we use [a-zA-Z] in the picking of the username and the domain name, we don't want to directly match numbers yet. The goal of this regex is to pick up numbers inside an email address. So, we can have 1 or more characters [a-zA-z]+, followed by 0 or more digits (\d*), and again we can have 0 or more characters [a-zA-Z]*. The + is for 1 or more, * is for 0 or more, \d is for digits. After this, we have the @ as a compulsory character in the mail, followed by the same sequence in the username i.e. 1 or more characters, 0 or more digits, 0 or more characters. Finally, we have the .com and the domain name extensions, as a group of 2 or more characters [a-z|A-Z]{2,}.

So, the regex accepts an email, with a sub-match of the number anywhere in the username or the domain name.

The FindStringSubmatch function lists out the sub-matches for the leftmost(first) match of the regex in the source string. So it finds the string abc21@def.com which is the email id. This regex ([a-zA-Z]+(\d*)[a-zA-Z]*@[a-zA-Z]+(\d*)[a-zA-Z]*\.[a-z|A-Z]{2,}) has two sub-matches \d* in the username part and also in the domain part. So, the list returns the email address as the match and the match as itself, as well as the number in the sub-match inside the mail address. So the result, [abc21@def.com abc21@def.com 21 ], there are a few empty string sub-matches because the second sub-match for the domain name number returns an empty string.

Similarly, the FindAllStringSubmatch will return the list of all the matches in the source string. The other matches don't have any sub-matches in the regular expression so it just gets the match and the sub-match as itself in the case of string mail, digit 8, and digit 124.

We can also use this example from a file as a slice of bytes. This will return a list of slice of slice of bytes instead of slice of slice of strings.

package main

import (
    "log"
    "regexp"
)

func main() {

    exp := regexp.MustCompile(
        `([a-zA-Z]+(\d*)[a-zA-Z]*@[a-zA-Z]*(\d*)[a-zA-Z]+\.[a-z|A-Z]{2,})` +
            `|(mail|email)` +
            `|(\d{1,3})`,
    )
    email_file, err := os.ReadFile("subtext.txt")
    log_error(err)
    mail_match := exp.FindSubmatch(email_file)
    log.Printf("%s\n", mail_match)
    mail_matches := exp.FindAllSubmatch(email_file, -1)
    //log.Println(mail_matches)
    log.Printf("%s\n", mail_matches)

}
Enter fullscreen mode Exit fullscreen mode
# submatch.txt

abc21@def.com is the mail address of user id 1234
The email address abe2def.com is of user name abc
a2be.@def.com
Email address: abe@de2f.com, User id: 45
johndoe@example.com
jane.doe123@example.com
janedoe@example.co.uk
john123@example.org
janedoe456@example.net
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

[][]unit8
[abc21@def.com abc21@def.com 21   ]

[][][]unit8
[
    [abc21@def.com abc21@def.com 21   ] [mail    mail ] [123 123] [4     4] 
    [email    email ] [2     2] [2     2] [mail    mail ] 
    [abe@de2f.com abe@de2f.com  2  ] [45     45] 
    [johndoe@example.com johndoe@example.com    ] 
    [doe123@example.com doe123@example.com 123   ] 
    [janedoe@example.co janedoe@example.co    ] 
    [john123@example.org john123@example.org 123   ] 
    [janedoe456@example.net janedoe456@example.net 456   ]
]
Enter fullscreen mode Exit fullscreen mode

As we can see from the dummy email ids and some random text, we are able to match the email ids and the numbers in them as the sub-match of the string. This is a [][]unit8 in case of the FindSubmatch and [][][]unit8 in case of FindAllSubmatch. The working remains the same for the bytes as it was for the strings.

Find Submatch Index

We also have the FindSubmatchIndex and FindAllSubmatchIndex, and the string variants to get the index(es) of the sub-matches in the pattern picked from the regular expression.

str := "abe21@def.com is the mail address of 8th user with id 124"
exp := regexp.MustCompile(
    `([a-zA-Z]+(\d*)[a-zA-Z]*@[a-zA-Z]*(\d*)[a-zA-Z]+\.[a-z|A-Z]{2,})` +
        `|(mail|email)` +
        `|(\d{1,3})`,
)

match := exp.FindStringSubmatch(str)
match_index := exp.FindStringSubmatchIndex(str)
log.Println(match)
log.Println(match_index)
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

[abe21@def.com abe21@def.com 21   ]
[0 13 0 13 3 5 9 9 -1 -1 -1 -1]
Enter fullscreen mode Exit fullscreen mode

So this returns a list of pairs, here consider the list as [(0, 13) (0, 13), (3, 5), (9, 9), (-1, -1), (-1, -1)]. Because the list represents the start and end indexes of the sub-matches in the source string. The match is the first pair, i.e. at the first character 0 -> a, and ends at space i.e. 13 character off by one. Then we have the sub-match itself with the same indexes. Then the 3 and 5 indicating the number sub-match 21 in the abc21@def.com it starts at 3 and ends at 5, so it occupies 3 and 4 characters in the source string. Similarly, the domain level number doesn't have any number in the source string so it has returned the domain name end index as an empty string.

We had used (\d*) which can return 0 or more occurrences of the digit, so it returned no match in the case of the domain level name, hence we get the 9 as an empty string at the end of the sub-match for it. The rest are for the email or mail, and the lone digit sub-matches in the regular expression which we don't have in the first match in the source string.

str := "abe21@def.com is the mail address of 8th user with id 124"
exp := regexp.MustCompile(
    `([a-zA-Z]+(\d*)[a-zA-Z]*@[a-zA-Z]*(\d*)[a-zA-Z]+\.[a-z|A-Z]{2,})` +
        `|(mail|email)` +
        `|(\d{1,3})`,
)

match := exp.FindAllStringSubmatch(str, -1)
match_indexes := exp.FindAllStringSubmatchIndex(str, -1)
log.Println(match_indexes)
Enter fullscreen mode Exit fullscreen mode
$ go run main.go

[[abe21@def.com abe21@def.com 21   ] [mail    mail ] [8     8] [124     124]]
[
    [0 13 0 13 3 5 9 9 -1 -1 -1 -1]
    [21 25 -1 -1 -1 -1 -1 -1 21 25 -1 -1]
    [37 38 -1 -1 -1 -1 -1 -1 -1 -1 37 38]
    [54 56 -1 -1 -1 -1 -1 -1 -1 -1 54 57]
]
Enter fullscreen mode Exit fullscreen mode

Similarly, we have used the FindAllStringSubmatchIndex for getting the list of a slice of indexes(int) for the sub-match of the expression in the source string.

The first element is the same as the previous example, the next match in the source string is mail which comes at index 21 and ends at the 24th index which golang does 24+1 as a convention. Similarly, the number 8 is matched at index 37 and the number 124 at index 54, rest of the sub-matches for these matches are not present so it turns up to be -1 for the rest of the sub-matches.

So this can also be used for the byte/uint8 type of variants with FindSubmatchIndex and FindAllSubmatchIndex.

Replace Patterns

The replace method is used for replacing the matched patterns.

The ReplaceAll and ReplaceAllLiteral with some string and byte slice variations can help us in replacing the source text with a replacement string against a regular expression.

Let's start with a simple example of strings.

package main

import (
    "log"
    "regexp"
)

func main() {

    str := "Gophy gophers go in the golang grounds"
    exp := regexp.MustCompile(`(Go|go)`)
    replaced_str := exp.ReplaceAllString(str, "to")
    log.Println(replaced_str)

}
Enter fullscreen mode Exit fullscreen mode

The above code can replace the string with regex with the replacement string. The regex provided in the exp variable has a pattern matching for Go or go. So, the ReplaceAllString method takes in two arguments as strings and returns the string after replacing the character.

$ go run replace.go

Original:  Gophy gophers go in the golang grounds
Replaced:  tophy tophers to in the tolang grounds
Enter fullscreen mode Exit fullscreen mode

So, this replaces all the characters with Go or go in the source string with to.

There is a special value available in the replacement string which can expand the regular expression literals. Since the regular expression consists of multiple literals/quantifiers, we can use those to replace or keep the string in a particular expression's evaluation.

str = "Gophy gophers go in the golang grounds"
exp2 := regexp.MustCompile(`(Go|go)|(phers)|(rounds)`)
log.Println(exp2.ReplaceAllString(str, "hop"))
log.Println(exp2.ReplaceAllString(str, "$1"))
log.Println(exp2.ReplaceAllString(str, "$2"))
log.Println(exp2.ReplaceAllString(str, "$3"))
Enter fullscreen mode Exit fullscreen mode
$ go run replace.go

hopphy hophop hop in the hoplang ghop
Gophy go go in the golang g
phy phers  in the lang g
phy   in the lang grounds
Enter fullscreen mode Exit fullscreen mode

The above code uses the regular expression match string have been replaced with the literals. So, the regex (Go|go)|(phers)|(rounds) has three parts Go|go, phers, and rounds. Each of the literals is separated by an operator, indicating either the match or all matches should be considered.

In the first statement, we replace the regex with hop as you can see all the matches are replaced with the replacement string. For instance, the word gophers is replaced by hophop, because go and phers is matched separately and replaced.

In the second statement, we replace the source string with $1 indicating the first literal in the regex i.e. Go|go. These statements will expand the $1 and keep the match only where the literal matches and rest are removed. So, Gophy is matched with Go|go so it is kept as is in the replacement. However, for grounds, the literal match for $1 i.e. Go|go does not match and hence is not kept and removed, so the resulting string becomes g rest is substituted with an empty string.

In the third print statement, the source string is replaced with the second literal $2 i.e. phers. So if any string matches the literal, only those are kept and the rest are substituted by an empty string. So, Gophy or Go doesn't match phers and hence is replaced, however, gophers match the phers part and is kept as it is, but the go part is substituted.

Similarly, for the fourth print statement, the source string is replaced with the third literal i.e. rounds. So if only the third literal is kept as is, the rest matching strings from the regex are substituted with an empty string. So, grounds remain as it is because it matches the rounds in the replacement literal.

In short, we replace the literal back after replacing the regex patterns in the source string. This can be used to fine-tune or access specific literals or expressions in the regular expressions.

str = "Gophy gophers go in the golang grounds"
exp2 := regexp.MustCompile(`(Go|go)|(phers)|(rounds)`)
log.Println(exp2.ReplaceAllString(str, "$1$2"))

str = "Gophy gophers go in the golang cophers grounds"
log.Println(exp2.ReplaceAllString(str, "$1$3"))
Enter fullscreen mode Exit fullscreen mode
$ go run replace.go

Gophy gophers go in the golang g
Gophy go go in the golang co grounds
Enter fullscreen mode Exit fullscreen mode

We can even concatenate and make some minor string literal adjustments to the replacement string. As we have done in the example, where both the $1$2 are used as the replacement string. The two literals combine to make a string with Go|go and phers. So, we can see the result, in the first statement, Gophy gophers go in the golang g, all the character that have Go|go or phers are kept as it is(substituted the same string), the grounds however does not match and hence are replaced with the empty string(because the capture group $1$2 does not match rounds).

Similarly, for the third example, the $1$3 i.e. Go|go or rounds are matched with the source string. So, we the phers in gophers and cophers does not match the capture group $1$3 and hence is replaced by an empty group(string). However, the Gophy, golang, and grounds match the capture group and are replaced by that match string (which is the same string).

If we want to avoid expansion of the strings as we did in the previous example with $1 and other parameters, we can use the ReplaceAllLiteral or ReplaceAllLiteralString to parse the string as it is.

str := "Gophy gophers go in the golang cophers grounds"
exp2 := regexp.MustCompile(`(Go|go)|(phers)|(rounds)`)
log.Println(exp2.ReplaceAllLiteralString(str, "$1"))
Enter fullscreen mode Exit fullscreen mode
$ go run replace.go

$1phy $1$1 $1 in the $1lang co$1 g$1
Enter fullscreen mode Exit fullscreen mode

As we can see the $1 is not expanded and parsed as it is for replacing the pattern in the regular expression. The Go is replaced with $1 to look like $1phy, and similarly for the rest of the patterns.

That's it from the 26th part of the series, all the source code for the examples are linked in the GitHub on the 100 days of Golang repository.

Conclusion

This article covered the fundamentals of using the regexp package for working with regular expressions in golang. We explored the methods and Regexp type in the package with various methods available through the type interface. By exploring the examples and simple snippets, various ways for pattern matching, finding, and replacing were walked over and found.

So, hopefully, the article might have found useful to you, if you have any queries, questions, feedback, or mistakes in the article, you can let me know in the discussion or on my social handles. Happy Coding :)series: "['100-days-of-golang']"

💖 💪 🙅 🚩
mr_destructive
Meet Rajesh Gor

Posted on March 24, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Golang: Regex
go Golang: Regex

March 24, 2023