Writing a BNF Grammar (featuring Prof Jeff Heinz)
Marianne
Posted on December 2, 2020
Summary
Writing a programming language starts with its grammar. But what is a grammar in the context of computation? And how does one go about writing one? This week Marianne talks with computational linguist Jeff Heinz about Alan Turing, Noam Chomsky and what context-free grammar actually means.
Links
- Prof Jeff Heinz's Research
- The death match poster pitting big data against structural natural language processing
- The Definitive Antlr 4 Reference
Bonus Context
Become a patron to support this project.
Transcript
MB: This is going to be a great week. Because this is the week I actually start implementing things. Are you ready? Okay, let’s get started!
MB: What you're doing when you write a programming language is actually writing a series of applications that take string input and translate it into something the machine can execute.
MB: The building blocks of programming language implementation are: parsers, lexers, and evaluators. Lexers take your code and break it up into parts. Parsers assemble those parts into trees that suggest semantic meaning. Evaluators walk the tree, interpret the meaning of each leaf and execute the appropriate behavior on the machine.
MB: I wouldn't necessarily say that this stuff is easy to understand, but there are enough explanations out there (some very academic, some more plain language, some with examples, some heavy on history) that after I read two or three of them I felt pretty confident about it.
MB: But for me at least, the hardest part of learning something is figuring out how to learn it in the first place.
MB: The thing is... just about every resource I studied on parsers said the same thing: this isn't how you would do things if you wanted to do this for real. If you were going to do it for real you would use a generator.
MB: When I raised my hand and told the internet "hello I'm going to do it for real, what do I need to know to use these generators?" dead silence came back.
MB: There are only two types of people who write the code for parsers from scratch: absolutely beginners and absolute experts. The beginners write parsers "by hand" so that they can learn HOW they do what they do. The experts write them so that they can optimize them within an inch of their lives.
MB: Everyone else, as it turns out, writes a grammar that a generator uses to write a full lexer-parser for them.
MB: There are huge advantages to generators. You get to leverage the experience of the whole language design community and control for edge cases and errors that would probably never even occur to you. You also make it easier for other people to implement your language, as most generators accept the same formats for grammar. And lastly, you have less code you need to write when you use a generator.
MB: Since I wanted to implement my language in Go, selecting a generator was really just about which ones generate parsers in Go or not. Skimming the Wikipedia chart, I picked one out and I started reading the documentation. Accepted grammars in EBNF, cool, well what does that look like?
MB: ...you know... Confronting what feels like a tidal wave of information is becoming an all too familiar feeling on this project. All the tutorials I could find on writing BNF grammars seemed to assume you understood the mechanics of grammars in the first place. I'm basically looking for Strunk and White's Elements of Style in the Microsoft Word Help Center ... that's what it feels like.
MB: Something that kept getting referenced over and over and over again without much of any context to hint at why it was relevant was Chomsky's Hierarchy, Chomsky "context-free grammar" also "Chomsky normal form" ... but coming from a position of complete ignorance, it is hard to understand how these concepts map back to the work of parsers.
MB: So I reached out to a friend of mine who is a linguist and happened to study under Chomsky. Did she know anyone who understood both the world of the programming and the world of linguistics?
MB: Oh... she said, Yes! You should talk to Jeff Heinz out at Stony Brook.
MB: .... I have to confess that I'm vulnerable to a certain degree of magical thinking. I did my bachelor's degree at Stony Brook, so this felt right somehow. Then when I looked Professor Heinz up I saw he had given a joint talk on grammar -vs- statistical approachs to natural language processing which they had promoted it by producing a parodying a heavy weight boxing match.
MB: Yeees... I think this is the right person to talk to.
JH: So right now, we're in the age of big data and a lot of the hype around artificial intelligence and machine learning is that you can provide these sort of general learning machines with lots of data and it's going to learn what it is that it's supposed to do. And I don't think that's how children learn language at all. I think children learn language from actually small amounts of data. And I think that the big data, I mean, even in our own world of big data, is good for languages like English, French, Chinese. But, you know, there's thousands of languages and there isn't a lot of data for most of those languages.
JH: But children will learn those languages of the raised in those communities, again effortlessly. So my own feeling is that I want the computing, the learning algorithms, the machine learning algorithms and the AI algorithms to learn more like people do. And what that means is that they're going to learn from small data. And the way that they're going to do that, I believe, and what I'm working on now is currently the these machine learning algorithms, they need big data because they're searching for the needle in a huge haystack. And that's actually... so they frame the problem wrong. In the case of natural language, we're not searching a huge haystack. In fact, it's a very small little pile of of straw and the needles right there.
JH: And so the thing is, what's that small pile of straw? What's that? How do we characterize natural language in the right kind of way? So the search problem is actually very easy.
MB: Something I had been unaware of until I started digging into generators for parsers is how useful both parsers and grammars are for other challenges outside of implementing programming languages. I feel like I've been struggling to hang pictures around my home and one day someone knocks on my door and introduces me to the hammer. My career is filled with all these challenges that it turns out I just should have written a parser for.
*MB: Like data transformation, should have wrote a parser for that. Log alerting? Parser. Allowing rich search querying? Parser again.
MB: But when you tell me to write a grammar for my programming language ... I don't even know where to start. Obviously a grammar is a list of rules. But what kind of rules even go in a BNF grammar and what sort of things are handled by error handling on the evaluator?
MB: How would you actually define what a grammar is, because most people associate the word grammar with what they learn in school, which is a variety, which includes grammar, but also things like spelling and syntax and all sorts of different concepts. And most software engineers, when they think about software, what they're thinking about is not grammar. They're thinking about syntax. Right. So, like, how would you explain what a grammar what a formal grammar actually is?
JH: Yeah, so...
JH: So a formal grammar, I would explain, is just any kind of device that's able to take a sequence and tell you whether it's valid or not. And that's that's it. And so in some sense, it's going to identify some subset of all logically possible sequences, the ones that are valid. Now, you can have you know, that could be a very large set. It could be there could be infinitely many strings in that set. Sort of like you can think of the set of all even numbers. You can think of all the strings that begin with an A or the word the or whatever you want. Those are... And you can write formal grammars that would recognize those strings. And so that is in one sense what a formal grammar is. And it can be... it can take many forms. I mean, so people have studied formal grammars from logical perspectives. They've studied them from sort of machine type perspectives, from algebraic perspective. So you can study these formal grammars from many perspectives. Having said that, I would say that a grammar in the sense that people will typically use it like both in software engineering and I would define a little bit differently than a formal grammar. So the formal grammar is just focused on, say, the set of strings or sort of structures more generally. Doesn’t have to be strings. You could be talking about tree structures or graph structures. So any kind of object you're interested in, the formal grammar is essentially a classifier.
JH: It just say these are these are going to be OK and they're not going to be OK. Now, why would you care about that?
JH: Well, as you said, in software engineering and natural language ideas, you're going to try to interpret... you want to interpret those those structures that are admissible as valid structures. And so the question is... So that's the other part of grammar that I would say is important, is that not only is it just a set of structures or something like that, it's actually... More and more abstract. It’s a way of trying to go from the structure that you see to something else, to some kind of meaning, some kind of semantics.
JH: And so this is also why we study, because we often think of the syntax, in a sense is part of the grammar that gets interpreted in two ways. So one way is sort of the linear string that, you know, that you would write out, which would get passed into that syntactic structure. And the other one is going to be the meaning of that. So that's going to be the the instructions to the computer, how to fire up the circuits in certain ways or the meaning that we're expressing when we're talking to each other. So the grammar is is is sort of fundamentally this connection between. The linear string that's being expressed and the meaning that's being expressed by that string, it's that connection.
MB: So we need to think about it in terms of the structure of our strings, in this case code-- As it happens, the best resources I found for how to think through this were the reference manuals for the generators themselves. This caught me a little by surprise because ... you know, I don't expect to read about the vanishing horizon in the manual for Adobe InDesign, right? Reference manuals assume you know what the tool is for and just need to learn how to user the tool. But I guess I'm not the only one who has no idea where to start with grammars.
MB: My favorite resource so far has been The Definitive Antlr 4 Reference by Antlr creator Terence Parr. The book includes common patterns in grammar and step-by-step example grammars for real world languages.
MB: So... Here's how Parr recommends starting: go from the largest complete unit of the string down. So in other words, the first rule should define what a file is ... what is a file made up of? And the answer to that could be lines if you're parsing something like CSV, or it could be multi-line objects for JSON. Let's say it's an object... what is the object made up of? Key-Value pairs, right? What are the keys made up of? What makes a key valid versus an invalid key? Strings with double quotation marks if we’re talking about JSON. What about the values? What makes a valid value versus an invalid value?
MB: ...And so on. Each description of a new level is a new set of rules to implement.
MB: All this is well and good, but when does the context-free part come in?
MB: So in the world of language design and language development, we hear about Chomsky all the time, particularly around Chomsky's hierarchy, like what is Chomsky's hierarchy in linguistics? And like, does it connect to an underlying philosophy in any way?
JH: Yeah, yeah, it does so so Chomsky's hierarchy is... so Noam Chomsky is a famous linguist and intellectual, arguably one of the most important figures of the 20th century, at one point, one of the most cited people in the world of all time. And he's someone can be somewhat controversial, I guess, in both in linguistics and in his political writings. But there's no question his impact has been kind of enormous. Having said that, I also will say that... you know I saw him give a talk several years ago at the University of Maryland, actually, and he just began by pointing out how, you know, the way that most people view language in the broader context of cognition. Nothing's changed in 50 or 60 years. They all view it as just something that is just like any other aspect of cognition. And there's there's nothing... You know, there's nothing about it in particular, and you see that in a lot of work outside of linguistics proper... in cognitive science and so on, people are trying to tackle the language learning problem just like they would tackle any other problem. So they're not necessarily trying to focus on features of language per se. They're just focus on general features of cognition. So I was quite taken aback, but I'm thinking here's a guy who's done more than anybody else to sort of challenge that view. And he's saying, actually, there's really no difference between now and 60 years ago when you look at the broader scientific field. I thought that was quite interesting.
JH: Well anyway, so what is the Chomsky hierarchy? So the Chomsky hierarchy is actually a sort of a refined view of the notion of computability more generally. So, you know, one of the big advances in the 20th century was this notion of what is computation. And there are several figures who contribute to that. The most famous one now is Alan Turing, but he's not the only one. There was Alonzo Church.
MB: Right.
JH: There was Kurt Gödel. There was all these guys in the 20s and 30s really made these. You know, incredible... I don't know how to describe it, but they pointed out in a way where they showed you what is computable, and so they show that there are certain problems that, you know, no computer can compute. And by computer, they don't just mean the hunk of metal and plastic on your desk or your phone. They mean anything that computes, which is sort of, you know. Including people, people are in the information processing systems, they are, in a sense, computers just they're made of a different type of material than than the phone in your pocket. So. So so there's sort of an outer limit of what can be computed.
JH: And so, for example, what Alan Turing showed was that most real numbers ... actually real numbers— real number is just like pi, 3.41 goes on and on and on. Most of these numbers cannot be computed. And by that, what he means is that like for pi, you can give me a number like five and I can tell you the fifth digit of pi. And if you do the number like 10 million. I can follow some kind of computation and eventually I can give you the 10 million digit of pi and they're programs that are running this stuff all the time. But most real numbers aren't like that. Most real numbers you can give me— There's no algorithm for which you can give me a number N and I can give you the nth digit of that real number.
MB: Yeah.
JH: OK, and so that's quite interesting. So there's a question of what— So that's the limit of computability. So what's the Chomsky hierarchy if I said it. So now, you know, I’m— I don't know. I think probably belaboring this point too much.
JH: You're I'm thinking too long to get to the point. But the when you refine the hierarchy, you find that you're asking, OK, suppose I have... So that's if I have all the resources and time in the world. Suppose I limit my resources in some way. Now what what am I able to compute? And so that's how you get these different. And if we do this in string space instead of in number space, what can we do? OK, so I mentioned before how formal grammar can distinguish between this set of strings as valid or not, these string are not valid. So that's sort of a computational problem, which is sort of the same. If you want to recognize a number or identify a number by knowing what the amplitude is, you want to be able to identify a set of strings by saying, is that string in the set or not? Okay?
MB: So the hierarchy isn't so much about the complexity of the grammar, so much it is about the computational abilities and the level of specification of the grammar.
MB: Is that a good way of summing it up?
JH: Well, it is about complexity of the grammar, but it's also it’s... What the Chomsky hierarchy does is it identifies different types of computational resources that you need to solve different types of... I would use the word membership problem to describe this, this problem of is this string in the set or not on the set?
MB: Got it. Got it.
JH: So if I if we have the term membership problem, then I would say to the Chomsky hierarchy classifies membership problems. Over strings
MB: OK.
JH: Over sets of strings. The reason why I hearken back to the dawn of computer science and this whole thing in this long story that I'm telling is because it wasn't just about linguistics. I mean, the context free grammars, which is sort of the part of the context of the Chomsky hierarchy that Chomsky introduced that... that became the basis for programming languages and compilers and all that kind of stuff. So when you when you're going to write your own programming language, you're going to have to learn something about context free grammars.
MB: Right. Right.
JH: And there's no way around that because that's what's used. I mean, all of these programming languages are essentially content free. Yeah.
JH: And that's not the only type of I mean... that's a very important region in the Chomsky hierarchy. And we can ask the question, are all natural language patterns, you know, in a sense context free? And the answer to that question appears to be no. When you look at certain kinds of syntactic patterns, you have to go outside of that. You don't have to go all the way to the Turing complete computer.
JH: There's an area called context sensitive.
MB: Right.
JH: And that seems to be where we find natural language patterns
JH: In my own work in morphology, phonology, context-free actually seems to be more than sufficient. And you can go much smaller than that. The last thing I would say about the Chomsky hierarchy is that many people, when they when they study it, you know, it's in all the computer science undergrad courses. You have to take a course in automata theory and you study the Chomsky hierarchy and complexity and so on. And people and they all know that this was work that was done 50, 60 years ago. So many people think that was a chapter that was opened and closed and that's it. But that's not the case. The fact is that in the past 60 years, there are new chapters constantly being written with those same methods. And we have much better knowledge now about, I mean, the Chomsky hierarchy is just the most famous of many, many types of hierarchies, many, many types of ways we can study sequences, structures, objects, transformations from sequences to other sequences. So machine translation, you know, you want to you want to translate the sentence in French to a sentence in Swahili. You know, those those are all things that formal language theory studies. And there's many more hierarchies beyond just the Chomsky hierarchy that are relevant to these problems. And so there's... it's a very... it's an alive, living area.
MB: Uh, this is so interesting.
MB: So, like, what's particularly compelling to me is when I was first exposed to the Chomsky hierarchy in the context of writing, programming languages, my first immediate assumption is this is a metaphor, right? That this is something that we have misappropriated from another field of study to provide a little bit of visualization behind what these like really kind of abstract grammars mean. And it doesn't actually connect to anything specific to computation. And what I'm hearing is, no, no, no, that's completely wrong, that it actually is all about computation and the level of computation needed at various parts of the hierarchy, which is not what I was expecting to hear, and that is super, super interesting.
MB: After my conversation with Professor Heinz I went back to the things I had read about Chomsky's Hierarchy. But this time I reread them with computation on my mind.
MB: And just like that, it clicked. The lowest level of grammar in Chomsky's Hierarchy is what's called a "regular grammar" .... as in a regular expression. When a regular expression matches a pattern in a string, all it knows, all it needs to know to determine that match is what is directly in front of it at the time. And from this type of parsing we can deduce a certain level of meaning. But you couldn't use regular expressions to parse a programming language, because the same characters can have different meanings in programming languages depending on how they are used. Think about the equal sign: by itself it could be assignment, with another equal sign it means something different, with an exclamation mark... still another meaning. So in order to parse a programming language, you need a little more computation... you need to factor in whatever came immediately before. Those are context-free grammars.
MB: Above context-free are context sensitive grammars. Most spoken languages are context-sensitive. In English it's not enough to have the word directly before the word your trying to figure out the meaning of, we often need the whole sentence up to that point ... still more computation. And so on and so forth.
MB: OK ... I feel ready to start writing this grammar!
MB: I really like the look of specs written in Alloy, so I'm gravitating towards using that as my inspiration. An Alloy file is made up of hash map looking structures called signatures, blocks of logic code, and short one line statements that define what the model should check and at what level of complexity. You can even import other models, although this doesn't actually work they way I want it to in Alloy.
MB: Our spec files will start with a declaration statement giving the model a name. That's one line.... let's use the keyword 'spec' followed by a valid identifier. Identifiers are alphanumeric but the first character must be a letter (in order to keep them from being confused with actual numbers by the parser)
MB: After the declaration statement we'll have an import block. Let's do that the same way Go does it: keyword "import" followed by a left parentheses, followed by a set of strings with import paths, followed by a right parentheses. .... Yeah, I like the way that looks.
MB: At this point I compiled the grammar and started experimenting with how that was looking in token form. Antlr 4 let's you determine what things the parser handles and what things the lexer handles by modifying the case of the nonterminal symbols .... basically what that means is that each line of the grammar looks like a variable assignment, the left hand side (the nonterminal symbols) are like variable names and the right hand side are the rule definitions. And yes, there can be multiple rules joined by a '|' as an OR, and they can also be recursive. But for a simple example the rule to parse 2+2 might read something like INT "+" INT.
MB: In Antlr 4 if the nonterminal symbol begins with a lowercase letter then the rule belongs to the parser, if it begins with a capital letter it belongs to the lexer. The distinction is what gets treated as a token (rules belonging to the lexer) and what gets treated as a string to potentially be broken up into more branches and nodes on the tree (rules belonging to the parser).
MB: So far this grammar is just for a tiny header that names the spec and imports other specs. After that are our blocks of code and that's more complicated. If we're using Alloy specs as an inspiration than we will have hashmap like structures that defines the elements of our system. Therefore the next rule is that there will be one or more entities. Entities are defined by the keyword "def", followed by an identifier, followed by an equal sign, followed by a left bracket, followed by a properties list, followed by a right bracket.
MB: What do the properties look like? Well... They start with an identifier obviously (the key), followed by a ':', followed by a value. For the time being the values will be either integers or strings. We don't want to get too complicated before recompiling and looking for errors.
MB: The final grammar will need to be more robust than this, but this is the way I worked through it: starting simple, compiling the parser and lexer, inputting a mock spec and fixing bugs before adding more complexity.
MB: It is incredibly satisfying when it gives you the output you were expecting.
Posted on December 2, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.