Understanding Grammars — Jack Vanlightly

SERIES LINKS

How to Create a Query Language DSL in C#
Creating a Simple Tokenizer (Lexer) in C#
--> Understanding Grammars
Implementing a DSL Parser in C#
Generating SQL from a Data Structure

Example code on GitHub

When you use a parser generator like ANTLR, the grammar is fed in and the code of a parser is the result. However we are hand crafting a parser so for us the grammar will act as a reference and will ultimately map neatly onto our code.

There are different styles of grammars and we'll look at two different ways of expressing a grammar.

Microsoft SQL Grammar Style

If you have ever been on MSDN to look up SQL reference materials then you may have seen their SQL grammar before. Check out MSDN SQL MERGE for an example. We can express LQL in the same way.

  
    MATCH <match-condition>
[AND|OR <match-condition>]
BETWEEN 'yyyy-MM-dd HH:mm:ss' AND 'yyyy-MM-dd HH:mm:ss'
[LIMIT integer]

<match-condition>
<object> {{=|!=|[NOT] LIKE} 'string_value'}|{[NOT]  IN('string_value'[,'string_value']*)}

<object>
{app|application}
|{ex|exception}
|{msg|message}
|{sf|stackframe}
|{fp|fingerprint}
  

The text in <> act as placeholders.
[] denote optional groups
{} denote mandatory groups
| denotes OR

So we see that the text MATCH followed by one <match-condition> is mandatory. A second <match-condition> with an associated logical operator is optional. After match conditions BETWEEN is mandatory followed by 'date' AND 'date' with a specific date format. Finally we see that LIMIT is optional, but when included it must be followed by an integer.

Regarding <match-condition>, we see that it must start with an <object> but then we have two options. Either it is followed by an equality operator and string value, or by an IN('value'...). For an IN clause we see that NOT is optional. For the list of string values I have deviated from the MSDN SQL grammar method of denoting a list and used a more easily understood optional [,'value'] group followed by a * which in regex signifies zero or more of the previous character or group.

Finally <object> can be either an application, exception type, message, stack frame or fingerprint. Each of these objects has a shorter alias.

I find this style of grammar very readable and is well suited to a declarative language such as SQL and LQL. But it doesn't translate well into a parser and so we need to look at a grammar better suited to feeding parsers.

Production Rule Notation Grammars

This can be a very abstract method of expressing the rules of a language but it ultimately translates to a parser very well so it is worth some time and effort into understanding.

First of all let's cover some terminology.

Production Rules

Productions define a set of rules for recursively substituting symbols (or placeholders) with another set of symbols. You will see examples like:

S -> aSb
S -> ba

Lowercase are Terminal symbols, basically, the letters a and b. They are terminal because they cannot be substituted. Then we have Non Terminal symbols which are placeholders for another set of symbols.

The following string is valid for the above productions: aababb

Let's iterate over the character sequence and match the rules against it. I will refer to the rules as S1 (rule S -> aSb) and S2 (rule S -> ba)

aababb - S1 is the only rule that can start with the letter a so we follow rule S1. We match 'a'

aababb - Next position S -> aSb - The next symbol in S1 is S. Again, only one S rule starts with 'a', S1. So now we are nested recursively S -> a(S -> aSb)b. We match 'a'

aababb - Next position S -> a(S -> aSb)b - The next symbol in S1 is S. Again, only one S rule starts with 'b', S2. So now we are nested recursively S -> a(S -> a(S -> ba)b)b. We match 'b'.

aababb - Next position S -> a(S -> a(S -> ba)b)b - The next symbol in our nested S2 is 'a' which matches the 'a; in our sequence. We match 'a'.

aababb - Next position S -> a(S -> a(S -> ba)b)b - According to the grammar then next letter should be 'b' and this matches the sequence. We match 'b'.

aababb - Next position S -> a(S -> a(S -> ba)b)b - According to the grammar then next letter should be 'b' again and this matches the sequence. We match 'b'.

So our two production rules match aababb nicely. The text abba will not fit nicely, let's run through it.

abba - This matches S1. We match 'a'.

abba - Next position S -> aSb - The next symbol is S. The only S rule to begin with 'b' is S2. We are now nested recursively S -> a(S -> ba)b. We match 'b'.

baba - Next position S -> a(S -> ba)b - The only next valid letter is 'a' according to the grammar but we get 'b' in the sequence. This text does not match this grammar.

Context Free Grammar

If you can, you want your grammar to be context free, it makes your parser much simpler. This means that we can parse the text based on syntax alone. The moment we have to start tracking variables then we need to start storing information about what tokens are and then make different decisions based on what that token is. Syntax is not enough. Luckily in our case LQL has no variables or other features that make it context based and can be parsed purely based on its syntax.

LL(1) and LL(k) Grammars

1 and k refer to a fixed amount of lookahead that the grammar needs to be interpreted. So for example, let's take the production rules:

S -> abS
S -> acS
S -> z

This is an LL(2) grammar. It requires a lookahead of two. If we have the text 'abz' and 'adz', when the parser reaches the 'a' in both texts it doesn't know if it matches abS or adS, it has to look two symbols ahead to know.

Let's create a production rule grammar for LQL and see if it is LL(1) or LL(k)

LQL Production Rules

  
    S -> MATCH
MATCH -> match MATCH_CONDITION

MATCH_CONDITION -> object operator string_literal MATCH_CONDITION_NEXT
MATCH_CONDITION -> object in open_bracket STRING_LITERAL_LIST close_bracket  MATCH_CONDITION_NEXT
MATCH_CONDITION -> object not_in open_bracket STRING_LITERAL_LIST close_bracket MATCH_CONDITION_NEXT
MATCH_CONDITION_NEXT -> and MATCH_CONDITION
MATCH_CONDITION_NEXT -> or MATCH_CONDITION
MATCH_CONDITION_NEXT -> DATE_CONDITION

STRING_LITERAL_LIST -> string_literal STRING_LITERAL_LIST_NEXT
STRING_LITERAL_LIST_NEXT -> comma string_literal STRING_LITERAL_LIST_NEXT
STRING_LITERAL_LIST_NEXT -> empty_string

DATE_CONDITION -> between datetime_literal and datetime_literal  DATE_CONDITION_NEXT
DATE_CONDITION_NEXT -> LIMIT
DATE_CONDITION_NEXT -> eos

LIMIT -> limit number eos
  

The first rule points to the MATCH rule. MATCH rule says the first token is match followed by the MATCH_CONDITION rule. There are three MATCH_CONDITION rules depending on whether it is an equality predicate, IN or NOT IN.

Each MATCH_CONDITION rule ends with MATCH_CONDITION_NEXT which has three rules which declare that next up is either another match condition preceded by an AND or OR, or the next thing is DATE_CONDITION which expects a BETWEEN clause.

If we look at STRING_LITERAL_LIST we see that it expects a string literal '' followed by STRING_LITERAL_LIST_NEXT. STRING_LITERAL_LIST_NEXT declares that next up is either a comma then another string literal or else we match an empty string.

We can see that this grammar is LL(2). It requires two lookaheads. All the rules are LL(1) except MATCH_CONDITION which has three rules all starting with object. The second symbol is different for each of the three rules. It is LL(2) so our parser will need two lookaheads in order to parse LQL.

If we go through the same exercise as before we'll see how it works. We'll denote the first lookahead as L1 and the second as L2. Let's use a simple LQL query:

  
    MATCH APP = 'My App'
BETWEEN 2016-01-01 00:00:00 AND 2016-01-02 00:00:00

L1 MATCH, L2 'APP'
Current Rule: S -> MATCH
MATCH says (MATCH -> match MATCH_CONDITION) that first we expect the match token.
Match MATCH

L1 'APP', L2 =
Current Rule: MATCH -> match MATCH_CONDITION
Next possible rules:
MATCH_CONDITION -> object operator string_literal MATCH_CONDITION_NEXT
MATCH_CONDITION -> object in open_bracket STRING_LITERAL_LIST close_bracket MATCH_CONDITION_NEXT
MATCH_CONDITION -> object not_in open_bracket STRING_LITERAL_LIST close_bracket MATCH_CONDITION_NEXT
With two lookaheads we can decide which rule to follow. The second lookahead has identified an equality operator. So we follow the first MATCH_CONDITION rule.
Match APP

L1 =, L2 'My App'
Current rule: MATCH_CONDITION -> object operator string_literal MATCH_CONDITION_NEXT
We expect an equality operator and get one.
Match =

L1 'My App', L2 BETWEEN
Current rule: MATCH_CONDITION -> object operator string_literal MATCH_CONDITION_NEXT
We expect a string literal and get one.
Match 'My App'

L1 BETWEEN, L2 2016-01-01 00:00:00
Current rule: MATCH_CONDITION -> object operator string_literal MATCH_CONDITION_NEXT
Next possible rules:
MATCH_CONDITION_NEXT -> and MATCH_CONDITION
MATCH_CONDITION_NEXT -> or MATCH_CONDITION
MATCH_CONDITION_NEXT -> DATE_CONDITION
Our first lookahead has BETWEEN and this matches the DATE_CONDITION rule.
Match BETWEEN.

L1 2016-01-01 00:00:00, L2 AND
Current rule: DATE_CONDITION -> between datetime_literal and datetime_literal DATE_CONDITION_NEXT
DATE_CONDITION says to expect a date in yyyy-MM-dd HH:mm:ss.
Match 2016-01-01 00:00:00

L1 AND, L2 2016-01-02 00:00:00
Current rule: DATE_CONDITION -> between datetime_literal and datetime_literal DATE_CONDITION_NEXT
We expect the and symbol.
Match AND

L1 2016-01-02 00:00:00, L2 LIMIT
Current rule: DATE_CONDITION -> between datetime_literal and datetime_literal DATE_CONDITION_NEXT
We expect a date.
Match 2016-01-02 00:00:00

L1 LIMIT, L2 100
Current rule: DATE_CONDITION -> between datetime_literal and datetime_literal DATE_CONDITION_NEXT
We expect DATE_CONDITION_NEXT will match LIMIT or the end of the query. LIMIT states LIMIT -> limit number. We follow LIMIT.
Match LIMIT.

L1 100, L2 EOS
Current rule: LIMIT -> limit number
We expect a number.
Match 100.

L1 EOS, L2 EOS
We expect nothing further and reach the End Of Sequence symbol.

So we have defined an LL(2) grammar that can parse any LQL query. Next we'll build a parser that implements this grammar. You may be thinking that the first grammar style was easier to read, and you'd be right, but you'll see that the production rule notation will map to your code so cleanly that you'll see why we use production rules.

SERIES LINKS

How to Create a Query Language DSL in C#
Creating a Simple Tokenizer (Lexer) in C#
--> Understanding Grammars
Implementing a DSL Parser in C#
Generating SQL from a Data Structure