Specifying a pattern

lizmat

Elizabeth Mattijsen

Posted on November 2, 2022

Specifying a pattern

This blog post will discuss the types of patterns you can specify with rak as part 3 of the It's time to rak! blog series.

The --type argument

To start right off the bat: the --type argument indicates how a given pattern (specified as a string on the command line) should be interpreted. Having to specify --type=foo all of the time, can be bothersome. So there are a number of shortcuts that you can use to make your life easier. These will be discussed at the end.

The results in all these examples are based on the existence of a file called "twenty" in the current directory, which contains the words "one", "two", "three" ... "twenty", each on a separate line. Because the contents of the line matches the line number, it hopefully makes understanding the example output in this post easier.

Note that lines will be matched without the end-of-line by default.

--type=contains

The "contains" value on --type indicates that a line will match if they contain the given literal pattern anywhere on the line.

# Look for "ve" anywhere on any line in file "twenty"
$ rak --type=contains ve twenty
twenty
5:fi๐ฏ๐ž
7:se๐ฏ๐žn
11:ele๐ฏ๐žn
12:twel๐ฏ๐ž
17:se๐ฏ๐žnteen
Enter fullscreen mode Exit fullscreen mode

Note that first the filename is shown, and then the actual matches with the pattern (๐ฏ๐ž) highlighted in the result. This is on by default, if rak is being run by a human (aka, there's actually someone at the keyboard). Or more technically, if STDIN is connected to a TTY.

Also note that the matched lines are prefixed with their line number (which happens to be the same as the textual version of the number).

Unfortunately, markdown (in which this blog post is written) does not allow bolding of characters inside code blocks. So I've had to resort to using the equivalent "๐›๐จ๐ฅ๐" codepoints (versus "normal" and "bold"), which may render not quite as I intended. Hopefully they will get the message across anyway.

--type=words

The "words" value on --type indicates that a line will match if it contains the literal pattern as a word. A word here is defined as a string consisting of alphanumeric characters, with non-alphanumeric characters (or the absence of a character) on both ends.

# Look for "six" as a word on any line in file "twenty"
$ rak --type=words six twenty
twenty
6:๐ฌ๐ข๐ฑ
Enter fullscreen mode Exit fullscreen mode

In this case "six" was acceptable, because there are no characters before or after on the line. So in that sense, it is a bad example of using "words" as a type of search. Note that "sixteen" was not matched, because there was no word-boundary at "x".

--type=starts-with

The "starts-with" value on --type indicates that a line will match if it starts with the literal pattern.

# Look for "seven" at the start of all lines in file "twenty"
$ rak --type=starts-with seven twenty
twenty
7:๐ฌ๐ž๐ฏ๐ž๐ง
17:๐ฌ๐ž๐ฏ๐ž๐งteen
Enter fullscreen mode Exit fullscreen mode

Note that in this case the line with "seventeen" was matched, because it starts with "seven".

--type=ends-with

The "ends-with" value on --type indicates that a line will match if it ends with the literal pattern.

# Look for "ve" at the end of all lines in file "twenty"
$ rak --type=ends-with ve twenty
twenty
5:fi๐ฏ๐ž
12:twel๐ฏ๐ž
Enter fullscreen mode Exit fullscreen mode

Note that in this case the lines with "seven", "eleven" and "seventeen" were not matched, because they didn't end with "ve", even though they had the literal string in them.

--type=equal

The "equal" value on --type indicates that a line will be accepted if it matches in its entirety with the literal pattern.

# Look for "eight" as the whole line in file "twenty"
$ rak --type=equal eight twenty
twenty
8:๐ž๐ข๐ ๐ก๐ญ
Enter fullscreen mode Exit fullscreen mode

Note that in this case the line with "eighteen" was not matched, because it had additional characters after "eight".

--type=regex

The "regex" value on --type indicates that a line will be accepted if it matches the literal pattern as a regex.

# Look for strings starting with "e" and ending with "t" and any
# number of characters between them (.*), in file twenty
$ rak --type=regex 'e.*t' twenty
twenty
8:๐ž๐ข๐ ๐ก๐ญ
17:s๐ž๐ฏ๐ž๐ง๐ญeen
18:๐ž๐ข๐ ๐ก๐ญeen
19:nin๐ž๐ญeen
20:tw๐ž๐ง๐ญy
Enter fullscreen mode Exit fullscreen mode

Note that because of shell issues, you will most likely need to quote this string, because . and * have special meanings in most shells.

We will get back to what you can do with Raku regexes at the end of this blog post.

Ignoring case and accents

All of the above values of the --type arguments adhere to the --ignorecase, --ignoremark and --smartcase flags.

--ignorecase

If this flag is specified, then any matching will be done without regards to case. In other words "E" will match "e", and vice-versa.

# Look for strings containing y or Y
$ rak --type=contains --ignorecase Y twenty
twenty
20:twent๐ฒ
Enter fullscreen mode Exit fullscreen mode

--ignoremark

If this flag is specified, then any matching will be done by comparing base characters, and ignore additional marks such as combining accents. In other words "รฉ" will match "e", and vice-versa.

# Look for strings matching "u" having any additional marks
$ rak --type=contains --ignoremark รบ twenty
twenty
4:fo๐ฎr
Enter fullscreen mode Exit fullscreen mode

--smartcase

If this flag is specified, then the pattern is checked for uppercase characters. If none are found, then it acts as if --ignorecase has been specified. Otherwise it does not perform any action.

--type=code

The "code" value on --type instructs rak to convert the pattern into Raku code and run the code for each line. If it returns True, it will consider the line matched. If it returns False, or Empty, or Nil it will consider the line did not match.

If it returns a Slip, it will produce the elements of the slip as separate items, and consider the line matched.

If it returns anything else, it will produce that value (and consider the line matched).

# Look for lines containing "ve" and show them in uppercase
$ rak '.uc if .contains("ve")' twenty --type=code
twenty
5:FIVE
7:SEVEN
11:ELEVEN
12:TWELVE
17:SEVENTEEN
Enter fullscreen mode Exit fullscreen mode

Note that no highlighting can occur when the pattern is code, as it cannot indicate where there was a match (if any).

# Look for lines that contain "y" and produce all of its letters separately
$ rak '.comb.Slip if .contains("y")' twenty --type=code
twenty
20:t
20:w
20:e
20:n
20:t
20:y
Enter fullscreen mode Exit fullscreen mode

Note that returning a Slip, will produce multiple items for the same line.

# Convert all occurrences of "teen" into "10"
$ rak '.subst("teen",10)' twenty --type=code
twenty
1:one
2:two
3:three
4:four
5:five
6:six
7:seven
8:eight
9:nine
10:ten
11:eleven
12:twelve
13:thir10
14:four10
15:fif10
16:six10
17:seven10
18:eigh10
19:nine10
20:twenty
Enter fullscreen mode Exit fullscreen mode

The .subst for "substitute" method is a very handy tool to convert a target into something else, because it will return the original string if no match was found. The target in .subst can also be a regex.

Code patterns are typically used with options such as --unique and --modify-files and similar.

# Produce the frequencies of the letters in file "twenty"
$ rak 'slip .comb' twenty --type=code --frequencies
33:e
17:n
15:t
9:i
5:f
5:v
4:h
4:o
4:r
4:s
3:w
2:g
2:l
2:u
2:x
1:y
Enter fullscreen mode Exit fullscreen mode

--type=auto

This is the default type if there is no --type specified. It indicates that the given pattern(s) should be inspected for shortcuts, and act accordingly.

Shortcuts

These shortcuts are only available if --type=auto has been (implicitely) specified.

string

If there are no special characters in the pattern, then a line will match if the pattern occurs anywhere on a line (as if --type=contains was specified).

# Show the lines containing "ne"
$ rak ne twenty
twenty
1:o๐ง๐ž
9:ni๐ง๐ž
19:ni๐ง๐žteen
Enter fullscreen mode Exit fullscreen mode

ยงstring

If the pattern starts with "ยง" (A7 SECTION SIGN), then the rest of the pattern must match as a word in a line. So rak ยงstring is the same as rak string --type=words.

Depending on your keyboard and OS, the ยง character may be hard to type: if you're going to use this feature often, it probably makes sense to create a keyboard shortcut for it.

# Show the lines that contain "six" as a word
$ rak ยงsix twenty
twenty
6:๐ฌ๐ข๐ฑ
Enter fullscreen mode Exit fullscreen mode

^string

If the pattern starts with "^" (5E CIRCUMFLEX ACCENT), then the rest of the pattern must match at the start of a line. So rak ^string is the same as rak string --type=starts-with.

# Show the lines starting with "e"
$ rak ^e twenty
twenty
8:๐žight
11:๐žleven
18:๐žighteen
Enter fullscreen mode Exit fullscreen mode

string$

If the pattern ends with "$" (24 DOLLAR SIGN), then the rest of the pattern must match at the end of a line. So rak string$ is the same as rak string --type=ends-with.

# Show the lines ending with "o"
$ rak o$ twenty
twenty
2:tw๐จ
Enter fullscreen mode Exit fullscreen mode

^string$

If the pattern starts with "^" (5E CIRCUMFLEX ACCENT) and ends with "$" (24 DOLLAR SIGN), then the rest of the pattern must match the line in its entirety. So rak ^string$ is the same as rak string --type=equal.

# Show all the lines that consist of "seven"
$ rak ^seven$ twenty
twenty
7:๐ฌ๐ž๐ฏ๐ž๐ง
Enter fullscreen mode Exit fullscreen mode

/ regex /

If the pattern starts and ends with "/" (2F SOLIDUS), then the rest of the pattern will be interpreted as a regex to match lines with. So rak /string/ is the same as rak string --type=regex. The pattern generally will need to be quoted as the shell may interpret the pattern in unwanted ways otherwise.

# Show all the lines with exactly on character between "e" and "t"
$ rak /e.t/ twenty
twenty
17:sev๐ž๐ง๐ญeen
20:tw๐ž๐ง๐ญy
Enter fullscreen mode Exit fullscreen mode

{ code }

If the pattern starts with "{" (7B LEFT CURLY BRACKET) and ends with "}" (7D RIGHT CURLY BRACKET), then the rest of the pattern will be interpreted as Raku code. So rak '{string}' is the same as rak string --type=code. The pattern generally will need to be quoted as the shell may interpret the pattern in unwanted ways otherwise.

# Show all lines starting with "t" titlecased
$ rak '{.tc if /^ t /}' twenty
twenty
2:Two
3:Three
10:Ten
12:Twelve
13:Thirteen
20:Twenty
Enter fullscreen mode Exit fullscreen mode

*code

If the pattern starts with "*" (2A ASTERISK), then the rest of the pattern will be interpreted as Raku code. So rak *string is the same as rak string --type=code. The pattern generally will need to be quoted as the shell may interpret the pattern in unwanted ways otherwise.

This is typically used for short pieces of code that use Whatever-currying

# Reverse the order of the characters of each line
$ rak '*.flip' twenty
twenty
1:eno
2:owt
3:eerht
4:ruof
5:evif
6:xis
7:neves
8:thgie
9:enin
10:net
11:nevele
12:evlewt
13:neetriht
14:neetruof
15:neetfif
16:neetxis
17:neetneves
18:neethgie
19:neetenin
20:ytnewt
Enter fullscreen mode Exit fullscreen mode

On Raku regexes

Everybody knows regular expressions, right? Well, I guess that's true to some extent. But regular expressions haven't been regular for a long time already. That's why it was decided to just call them "regexes" in the Raku Programming Language.

Regexes in Raku look like your average regular expression: some recipe to tell a state machine to look for matches. The biggest difference with other regular expression implementations, is that Raku regexes are code under the hood that you specify with Raku's regex syntax. You can think of a regex as a piece of code that you can call inside another regex, like a subroutine or a method. This has all sorts of implications, but that will be for another series of blog posts.

Within the context of rak, the most important thing to know, is that you have to quote any non-alphanumeric character in Raku regexes, unless it has a special meaning in the context of regexes, such as ., *, ?, to name but a few. The reason for this strictness, is to ensure future compatibility of regexes, should another non-alphanumeric character get a special meaning in Raku regexes at some time in the future.

# Show all lines that have an "h" in them, optionally followed by a "t"
$ rak '/ h t? /' twenty
twenty
3:t๐กree
8:eig๐ก๐ญ
13:t๐กirteen
18:eig๐ก๐ญeen
Enter fullscreen mode Exit fullscreen mode

Note that in the above example we used ? to indicate that the "t" may or may not occur.

You should also be aware that whitespace in Raku regexes has no meaning. So whitespace will need to be explicitely specified if it is intended to be part of the regex. However, if there is whitespace between alphanumeric characters, it will warn.

# Look for one, with unneeded whitespace
% rak '/ o ne /' twenty
Potential difficulties:
    Space is not significant here; please use quotes or :s (:sigspace) modifier (or, to suppress this warning, omit the space, or otherwise change the spacing)
    ------> / oโ ne /
twenty
1:๐จ๐ง๐ž
Enter fullscreen mode Exit fullscreen mode

The pipe character | can be use to indicate alternatives:

# Look for lines with either "foo" or "bar"
$ rak '/ one | four /' twenty
twenty
1:๐จ๐ง๐ž
4:๐Ÿ๐จ๐ฎ๐ซ
14:๐Ÿ๐จ๐ฎ๐ซteen
Enter fullscreen mode Exit fullscreen mode

The << marker indicates the start of a word, and \w* indicates zero or more alphanumeric characters.

# Look for lines in which there are words that start with "fi"
$ rak '/ << fi \w+ /' twenty
twenty
5:๐Ÿ๐ข๐ฏ๐ž
15:๐Ÿ๐ข๐Ÿ๐ญ๐ž๐ž๐ง
Enter fullscreen mode Exit fullscreen mode

There's a lot more about regexes that can be told. There is even a dedicated book about it by Moritz Lenz: Parsing with Regexes and Grammars: A Recursive Descent into Parsing.

Conclusion

This concludes part 3 of a series of blog posts about rak.

If you have any comments, find bugs, have recommendations / ideas, please submit them as issues at the App::Rak repository. If you would like to have a more direct interaction, you can visit the #raku-rak channel on Libera.chat.

Thank you (again) for reading all the way to the end!

๐Ÿ’– ๐Ÿ’ช ๐Ÿ™… ๐Ÿšฉ
lizmat
Elizabeth Mattijsen

Posted on November 2, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Setting up your haystack
raku Setting up your haystack

November 18, 2022

Specifying a pattern
raku Specifying a pattern

November 2, 2022

ยฉ TheLazy.dev

About