Implementing a DSL Parser

SERIES LINKS

How to Create a Query Language DSL in C#
Creating a Simple Tokenizer (Lexer) in C#
Understanding Grammars
--> Implementing a DSL Parser in C#
Generating SQL from a Data Structure

Example code on GitHub

We previously created a tokenizer that breaks up a sequence of characters into a sequence of tokens (enum TokenType) and a LL2 production notation grammar that acts as a template for the code of the parser.

The input of this parser will be the sequence of tokens and the output will be an Intermediate Representation (IR) which is a data structure that represents the DSL text in a structured manner. The next step, after parsing, will be translating this IR into SQL.

Let's start with the IR.

Intermediate Representation (IR) - DslQueryModel Class


public class DslQueryModel
{
    public DslQueryModel()
    {
        MatchConditions = new List<MatchCondition>();
    }

    public DateRange DateRange { get; set; }
    public int? Limit { get; set; }
    public IList<MatchCondition> MatchConditions { get; set; }
}

public class DateRange
{
    public DateTime From { get; set; }
    public DateTime To { get; set; }
}

public class MatchCondition
{
    public DslObject Object { get; set; }
    public DslOperator Operator { get; set; }
    public string Value { get; set; }
    public List<string> Values { get; set; }
    
    public DslLogicalOperator LogOpToNextCondition { get; set; }
}

public enum DslObject
{
    Application,
    ExceptionType,
    Message,
    StackFrame,
    Fingerprint
}

public enum DslOperator
{
    NotDefined,
    Equals,
    NotEquals,
    Like,
    NotLike,
    In,
    NotIn
}

public enum DslLogicalOperator
{
    NotDefined,
    Or,
    And
}

These classes contain all the concepts required to build an SQL query. Limit will translate to TOP. DateRange will translate to a BETWEEN predicate. The collection of MatchCondition class will translate to predicates and joins regarding those logging objects. The next post will go through generating SQL from this object graph.

The Parser Algorithm

The algorithm needs to consume tokens, walking the grammar, and creating the DslQueryModel class as it goes. Let's see how the grammar maps onto the algorithm. Each grammar rule has an equivalent method in the parser

Rule: MATCH -> match MATCH_CONDITION
Method: Match()

Rule: MATCH_CONDITION -> object operator string_literal MATCH_CONDITION_NEXT
Method: MatchCondition() followed by EqualityMatchCondition()

Rule: MATCH_CONDITION -> object in open_bracket STRING_LITERAL_LIST close_bracket MATCH_CONDITION_NEXT
Method: MatchCondition() followed by InMatchCondition()

Rule: MATCH_CONDITION -> object not_in open_bracket STRING_LITERAL_LIST close_bracket MATCH_CONDITION_NEXT
Method: MatchCondition() followed by NotInMatchCondition()

Rule: MATCH_CONDITION_NEXT -> and MATCH_CONDITION
Method: MatchConditionNext() followed by AndMatchCondition()

Rule: MATCH_CONDITION_NEXT -> or MATCH_CONDITION
Method: MatchConditionNext() followed by OrMatchCondition()

Rule: MATCH_CONDITION_NEXT -> DATE_CONDITION
Method: MatchConditionNext() followed by DateCondition()

Rule: STRING_LITERAL_LIST -> string_literal STRING_LITERAL_LIST_NEXT
Method: StringLiteralList() followed by StringLiteralListNext()

Rule: STRING_LITERAL_LIST_NEXT -> comma string_literal STRING_LITERAL_LIST_NEXT
Method: StringLiteralListNext() followed by StringLiteralListNext()

Rule: DATE_CONDITION -> between datetime_literal and datetime_literal DATE_CONDITION_NEXT
Method: DateCondition() followed by DateConditionNext()

Rule: DATE_CONDITION_NEXT -> LIMIT
Method: DateCondition() followed by Limit()

Rule: LIMIT -> limit number eos
Method: Limit()

Let's see the code

First we store the token sequence in Stack<DslToken> and store the first and second lookahead tokens in variables. Also we create a DslQueryModel class instance that we will build as we go.


// Stack and Lookaheads
private Stack<DslToken> _tokenSequence;
private DslToken _lookaheadFirst;
private DslToken _lookaheadSecond;

// IR class that we will build as we go
private DslQueryModel _queryModel;
private MatchCondition _currentMatchCondition;

We fill the Stack and load the lookaheads


private void LoadSequenceStack(List<DslToken> tokens)
{
    _tokenSequence = new Stack<DslToken>();
    int count = tokens.Count;
    for (int i = count - 1; i >= 0; i--)
    {
        _tokenSequence.Push(tokens[i]);
    }
}

private void PrepareLookaheads()
{
    _lookaheadFirst = _tokenSequence.Pop();
    _lookaheadSecond = _tokenSequence.Pop();
}

We need some methods for reading and discarding tokens as we go consuming the token sequence. When we read a token we pass the type of token we expect and a DslParserException is thrown when we encounter a token we don't expect. When we discard a token we can do a check on the type and it also keeps the lookaheads in sync.


private DslToken ReadToken(TokenType tokenType)
{
    if (_lookaheadFirst.TokenType != tokenType)
        throw new DslParserException(string.Format("Expected {0} but found: {1}", tokenType.ToString().ToUpper(), _lookaheadFirst.Value));

    return _lookaheadFirst;
}

private void DiscardToken()
{
    _lookaheadFirst = _lookaheadSecond.Clone();

    if (_tokenSequence.Any())
        _lookaheadSecond = _tokenSequence.Pop();
    else
        _lookaheadSecond = new DslToken(TokenType.SequenceTerminator, string.Empty);
}

private void DiscardToken(TokenType tokenType)
{
    if (_lookaheadFirst.TokenType != tokenType)
        throw new DslParserException(string.Format("Expected {0} but found: {1}", tokenType.ToString().ToUpper(), _lookaheadFirst.Value));

    DiscardToken();
}

The top method is Parse(). Let's walk through the code and see how the methods are called according to the grammar.


public DslQueryModel Parse(List<DslToken> tokens)
{
	// create the stack and load the lookaheads
    LoadSequenceStack(tokens);
    PrepareLookaheads();
    _queryModel = new DslQueryModel();

	// S -> MATCH
    Match();

	DiscardToken(TokenType.SequenceTerminator);

    return _queryModel;
}

private void Match()
{
	// MATCH -> match MATCH_CONDITION
    DiscardToken(TokenType.Match);
    MatchCondition();
}

In the MatchCondition() method we need to identify if it is an equality, in or not in predicate and follow the corresponding grammar rule.


private void MatchCondition()
{
    CreateNewMatchCondition();

    if (IsObject(_lookaheadFirst))
    {
        if (IsEqualityOperator(_lookaheadSecond))
        {
            EqualityMatchCondition();
        }
        else if (_lookaheadSecond.TokenType == TokenType.In)
        {
            InCondition();
        }
        else if (_lookaheadSecond.TokenType == TokenType.NotIn)
        {
            NotInCondition();
        }
        else
        {
            throw new DslParserException(ExpectedObjectErrorText + " " + _lookaheadSecond.Value);
        }

        MatchConditionNext();
    }
    else
    {
        throw new DslParserException(ExpectedObjectErrorText + _lookaheadFirst.Value);
    }
}

private bool IsObject(DslToken token)
{
    return token.TokenType == TokenType.Application
           || token.TokenType == TokenType.ExceptionType
           || token.TokenType == TokenType.Fingerprint
           || token.TokenType == TokenType.Message
           || token.TokenType == TokenType.StackFrame;
}

private bool IsEqualityOperator(DslToken token)
{
    return token.TokenType == TokenType.Equals
           || token.TokenType == TokenType.NotEquals
           || token.TokenType == TokenType.Like
           || token.TokenType == TokenType.NotLike;
}

private void CreateNewMatchCondition()
{
	// create a new empty MatchCondition instance ready to fill
    _currentMatchCondition = new MatchCondition();
    _queryModel.MatchConditions.Add(_currentMatchCondition);
}

Let's look at the EqualityMatchCondition() route.


private void EqualityMatchCondition()
{
    _currentMatchCondition.Object = GetObject(_lookaheadFirst);
    DiscardToken();
    _currentMatchCondition.Operator = GetOperator(_lookaheadFirst);
    DiscardToken();
    _currentMatchCondition.Value = _lookaheadFirst.Value;
    DiscardToken();
}

private DslObject GetObject(DslToken token)
{
    switch (token.TokenType)
    {
        case TokenType.Application:
            return DslObject.Application;
        case TokenType.ExceptionType:
            return DslObject.ExceptionType;
        case TokenType.Fingerprint:
            return DslObject.Fingerprint;
        case TokenType.Message:
            return DslObject.Message;
        case TokenType.StackFrame:
            return DslObject.StackFrame;
        default:
            throw new DslParserException(ExpectedObjectErrorText + token.Value);
    }
}

private DslOperator GetOperator(DslToken token)
{
    switch (token.TokenType)
    {
        case TokenType.Equals:
            return DslOperator.Equals;
        case TokenType.NotEquals:
            return DslOperator.NotEquals;
        case TokenType.Like:
            return DslOperator.Like;
        case TokenType.NotLike:
            return DslOperator.NotLike;
        case TokenType.In:
            return DslOperator.In;
        case TokenType.NotIn:
            return DslOperator.NotIn;
        default:
            throw new DslParserException("Expected =, !=, LIKE, NOT LIKE, IN, NOT IN but found: " + token.Value);
    }
}

As we can see, the code is following the grammar rules and building a MatchCondition class.

Let's look at how it parses an IN and NOT IN predicate.


private void NotInCondition()
{
    ParseInCondition(DslOperator.NotIn);
}

private void InCondition()
{
    ParseInCondition(DslOperator.In);
}

private void ParseInCondition(DslOperator inOperator)
{
    _currentMatchCondition.Operator = inOperator;
    _currentMatchCondition.Values = new List<string>();
    _currentMatchCondition.Object = GetObject(_lookaheadFirst);
    DiscardToken();

    if (inOperator == DslOperator.In)
        DiscardToken(TokenType.In);
    else if (inOperator == DslOperator.NotIn)
        DiscardToken(TokenType.NotIn);

    DiscardToken(TokenType.OpenParenthesis);
    StringLiteralList();
    DiscardToken(TokenType.CloseParenthesis);
}

private void StringLiteralList()
{
    _currentMatchCondition.Values.Add(ReadToken(TokenType.StringValue).Value);
    DiscardToken(TokenType.StringValue);
    StringLiteralListNext();
}

private void StringLiteralListNext()
{
    if (_lookaheadFirst.TokenType == TokenType.Comma)
    {
        DiscardToken(TokenType.Comma);
        _currentMatchCondition.Values.Add(ReadToken(TokenType.StringValue).Value);
        DiscardToken(TokenType.StringValue);
        StringLiteralListNext();
    }
    else
    {
        // nothing
    }
}

After consuming a match condition we call MatchConditionNext(). From here it could be an AND or OR and another match condition or a date condition. After the date condition we may or may not have a limit.


private void MatchConditionNext()
{
    if (_lookaheadFirst.TokenType == TokenType.And)
    {
        AndMatchCondition();
    }
    else if (_lookaheadFirst.TokenType == TokenType.Or)
    {
        OrMatchCondition();
    }
    else if (_lookaheadFirst.TokenType == TokenType.Between)
    {
        DateCondition();
    }
    else
    {
        throw new DslParserException("Expected AND, OR or BETWEEN but found: " + _lookaheadFirst.Value);
    }
}

private void AndMatchCondition()
{
    _currentMatchCondition.LogOpToNextCondition = DslLogicalOperator.And;
    DiscardToken(TokenType.And);
    MatchCondition();
}

private void OrMatchCondition()
{
    _currentMatchCondition.LogOpToNextCondition = DslLogicalOperator.Or;
    DiscardToken(TokenType.Or);
    MatchCondition();
}

private void DateCondition()
{
    DiscardToken(TokenType.Between);

    _queryModel.DateRange = new DateRange();
    _queryModel.DateRange.From = DateTime.ParseExact(ReadToken(TokenType.DateTimeValue).Value, "yyyy-MM-dd HH:mm:ss", CultureInfo.InvariantCulture);
    DiscardToken(TokenType.DateTimeValue);
    DiscardToken(TokenType.And);
    _queryModel.DateRange.To = DateTime.ParseExact(ReadToken(TokenType.DateTimeValue).Value, "yyyy-MM-dd HH:mm:ss", CultureInfo.InvariantCulture);
    DiscardToken(TokenType.DateTimeValue);
    DateConditionNext();
}

private void DateConditionNext()
{
    if (_lookaheadFirst.TokenType == TokenType.Limit)
    {
        Limit();
    }
    else if (_lookaheadFirst.TokenType == TokenType.SequenceTerminator)
    {
        // nothing
    }
    else
    {
        throw new DslParserException("Expected LIMIT or the end of the query but found: " + _lookaheadFirst.Value);
    }

}

private void Limit()
{
    DiscardToken(TokenType.Limit);
    int limit = 0;
    bool success = int.TryParse(ReadToken(TokenType.Number).Value, out limit);
    if (success)
        _queryModel.Limit = limit;
    else
        throw new DslParserException("Expected an integer number but found " + ReadToken(TokenType.Number).Value);

    DiscardToken(TokenType.Number);
}

Once we reach the end of all these recursive calls we exit below the Match() method in the Parse() method, discard the expected SequenceTerminator token and return the DslQueryModel class with all the information we nned to generate the SQL.


public DslQueryModel Parse(List<DslToken> tokens)
{
	// create the stack and load the lookaheads
    LoadSequenceStack(tokens);
    PrepareLookaheads();
    _queryModel = new DslQueryModel();

	// S -> MATCH
    Match();

	DiscardToken(TokenType.SequenceTerminator);

    return _queryModel;
}

Running the Code and See the Results

Let's run the code of the tokenizer and parser and output the DslQueryModel as JSON for readability.


static void Main(string[] args)
{
    new Program().ParseQuery();
    Console.ReadLine();
}

public void ParseQuery()
{
    string query = @"MATCH app = 'MyTestApp'
AND ex IN ('System.NullReferenceException', 'System.FormatException')
BETWEEN 2016-01-01 00:00:00 AND 2016-02-01 00:00:00
LIMIT 100";

    var tokenizer = new Tokenizer();
    var tokenSequence = tokenizer.Tokenize(query);
    
    var parser = new Parser();
    var dataRepresentation = parser.Parse(tokenSequence);

    var json = JsonConvert.SerializeObject(dataRepresentation, Formatting.Indented);
    File.WriteAllText("jsonData.txt", json);
}

Produces the following JSON


{
  "DateRange": {
    "From": "2016-01-01T00:00:00",
    "To": "2016-02-01T00:00:00"
  },
  "Limit": 100,
  "MatchConditions": [
    {
      "Object": 0, // APP
      "Operator": 1, // =
      "Value": "'MyTestApp'",
      "Values": null,
      "LogOpToNextCondition": 2 // AND
    },
    {
      "Object": 1, // ExceptionType
      "Operator": 5, // IN
      "Value": null,
      "Values": [
        "'System.NullReferenceException'",
        "'System.FormatException'"
      ],
      "LogOpToNextCondition": 0 // None
    }
  ]
}