Juan Julián Merelo Guervós
Posted on December 25, 2017
We already know that Perl6 does roles. But this series is about grammars, so sooner or later we had to find them in an article, right?
Using roles in grammars
We already know that grammars are actually classes, a particular kind of classes that returns Match
es when parsing a text. But the marked texts we are dealing with are actually combinations of many different elements. Paragraphs are made of (maybe enhanced) words for instance. We do not need to create them hierarchically: we can mix and match the word role in different markdown parsers, from the simplest to the most complicated. And we can create a parser for a semi-decent markdown-like little language like this, starting with the word.
All scripts for this series of articles is in GitHub
role like-a-word {
regex like-a-word { «\H+» }
}
role
declares, you guessed it, a role. But instead of populating it with methods, we use grammar stuff like regex
. Grammar roles are just roles.
And regex
en are just regular expressions, in the same way we saw them in the Match article. They match things. But tokens do that too, that is what they have done so far. But they do not backtrack. Once they have started to match things and find something that does not correspond to the rule, they fail and don't go back and say, wait, maybe it matches this little other thing. In a word, they behave just like regular regular expressions
I always wanted to say that.
In this case we use them because we do not know where this rule will end up. Allowing it to use backtrack will prove useful later on. And the regex
itself might seem weird, with the «»
and all. In Perl6, they are just word boundaries. This regex will match anything that is not horizontal whitespace up to a word boundary; this excludes vertical whitespace because it will effectively bound a word. It is a pretty general way of describing words.
We want some other structure to do words. Like this:
role span does like-a-word {
regex span { <like-a-word>(\s+ <like-a-word>)* }
}
Declaring this, which is also a role, does like-a-word
allows it to use the declared regex with the same name inside it. A span is just a group of things that look like a word. But we can build on that:
role pair-quoted does span {
proto regex quoted {*}
regex quoted:sym<em> { '*' ~ '*' <span> }
regex quoted:sym<alsoem> { '~' ~ '~' <span> }
regex quoted:sym<code> { '`' ~ '`' <span> }
regex quoted:sym<strong> { '**' ~ '**' <span> }
regex quoted:sym<strike> { '~~' ~ '~~' <span> }
}
We want to surround these spans with quote-like things that express emphasis or other kind of things. We use proto
which makes all functions use the same signature but work with different code, depending on what they have to deal with. Syntax again might get in the way, but we'll get to that later on. Suffice it to say that we are declaring here different kind of spans.
Theoretically, we could already use this to match things; however, since they do not declare TOP
, they have to be used in conjunction with a real grammar. Just like this one:
grammar better-paragraph does pair-quoted {
token TOP { <chunk>[ (\s+) <chunk>]* }
regex chunk { <quoted> | <span> }
}
This grammar only needs to do
the most complicated of the roles we have declared, the one which includes all of them. It includes either a quoted
(taken from the pair-quoted
role) or a span
(taken from the span
role). By using roles we have simplified the construction of this grammar, and created something that can be easily understood for someone reading it. A better-paragraph
is a sequence of chunks, which can be either quoted spans, or simple spans.
Let's put it to use.
Let's do the parsing:
my $simple-thing = better-paragraph.parse("Simple **thing**");
$simple-thing<chunk>.map: { .put };
First line does parsing as usual. And we know this returns a Match
object. This object can be used like a hash, which has as keys the tokens that can be parsed from the top. This is what we use in the next line: $simple-thing<chunk>.map: { .put };
has to be read from left to right. $simple-thing<chunk>
is a list of the different chunks that have been extracted from the simple text. We will map
them to a function, in this case simply put
that prints them; that is, .put
actually does (implicit loop variable).put
; we could use our beloved thorn to write it this way:
$simple-thing<chunk>.map: { $^þ.put };
which would do exactly the same, that is printing:
Simple
**thing**
We might want to do actually get rid of the markers, and just make some note that there was something marking that span. We can do it so:
$simple-thing<chunk>.map: { so $^þ<quoted> ??
say "["~$^þ<quoted><span> ~ "]"!!
$^þ.put};
Is it a quoted thing? so $^þ<quoted> ??
so
turns into a boolean whatever is to its right. If it exists, it would be true. And then the next would kick in:
say "[ "~$^þ<quoted><span> ~ "]"
Instead of printing directly the <quoted>
part, we'll dive more deeply into the Match
object and go to the next level, where there should be a . We'll get almost the same as above:
Simple
[thing]
But this is kind of disappointing, right? Go to all that trouble to not be able to use the actual quotes.
Working with unnamed captures
Anything we put inside parentheses in a rule, token or regex will be captured. Let's slightly change the pair-quoted
role this way:
regex quoted:sym<em> { ('*') ~ '*' <span> }
(and do the same to the rest). We'll have two captures in the Match object; the first will contain the quoting operator used and the second will be the same as before. We can change also the printing map
:
$simple-thing<chunk>.map: { so $^þ<quoted> ??
say $^þ<quoted>[0] ~ " → " ~ $^þ<quoted><span> !!
$^þ.put};
Now $^þ<quoted>[0]
contains the captured operator, and the rest is like before. This would print:
Simple
** → thing
Nifty, right? We can put that to good use in our eventual markdown grammar. But this will have to wait until the next installment.
Posted on December 25, 2017
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.