Creating a Simple Tokenizer (Lexer) in C#

SERIES LINKS

How to Create a Query Language DSL in C#
--> Creating a Simple Tokenizer (Lexer) in C#
Understanding Grammars
Implementing a DSL Parser in C#
Generating SQL from a Data Structure

Example code on GitHub

What is a Tokenizer?

Before you read on, there is a more efficient version of this tokenizer here: http://jack-vanlightly.com/blog/2016/2/24/a-more-efficient-regex-tokenizer that has significantly better performance. If you parse any length of text do not use the version documented in this post.

So a tokenizer or lexer takes a sequence of characters and output a sequence of tokens. Let's dive straight into an example to illustrate this.

Meet a simplified version of Logging Query Language (LQL)

MATCH APP = 'My App'
AND EX IN ('System.NullReferenceException','System.FormatException')
BETWEEN 2016-01-01 10:00:00 AND 2016-01-01 11:00:00
LIMIT 100

Let's create an enum that represents the keywords, symbols and values of LQL.

public enum TokenType
{
    NotDefined,
    And,
    Application,
    Between,
    CloseParenthesis,
    Comma,
    DateTimeValue,
    Equals,
    ExceptionType,
    Fingerprint,
    In,
    Invalid,
    Like,
    Limit,
    Match,
    Message,
    NotEquals,
    NotIn,
    NotLike,
    Number,
    Or,
    OpenParenthesis,
    StackFrame,
    StringValue,
    SequenceTerminator
}

Given the LQL query at the top of the page, the output of our tokenizer should be the following sequence:

TokenType.Match
TokenType.Application
TokenType.Equals
TokenType.StringValue, 'My App'
TokenType.And
TokenType.ExceptionType
TokenType.In
TokenType.OpenBracket
TokenType.StringValue, 'System.NullReferenceException'
TokenType.Comma
TokenType.StringValue, 'System.FormatException'
TokenType.CloseBracket
TokenType.Between
TokenType.DateTimeValue, 2016-01-01 10:00:00
TokenType.And
TokenType.DateTimeValue, 2016-01-01 11:00:00
TokenType.Limit
TokenType.Number, 100
TokenType.SequenceTerminator

There are multiple ways of tokenizing a stream of characters. Just like there are LL Recursive-Descent Parsers there are LL Recursive-Descent Lexers. These tokenizers analyze the character stream one by one, using multiple levels of lookahead in order to identity what token is currently being examined. However as this is our first foray into tokenizing I want to put off LL Recursive-Descent algorithms for now and use a very simple yet inefficient method.

Meet the Regex Tokenizer

The great thing about regex is that it is trivial to match complex patterns and against specific tokens. This algorithm is very memory inefficient but considering that LQL queries are pretty small it doesn't really impact us. What is great is that it is super simple.

How the algorithm works

We create a list of regex patterns and associate each one to a token. Then we iterate through the query text character by character checking if the text at the current location matches a pattern. If it does then we create the token, create a new string cutting out the match text and continue checking. If it doesn't match then we create a new string cutting out the current character and do the check again and so on.

It is inefficient as we generate multiple strings, each one slightly smaller than the one before. This method would not be viable on very large texts (if you care about performance and load).

TokenDefinition, TokenMatch and DslToken classes

public class TokenDefinition
{
    private Regex _regex;
    private readonly TokenType _returnsToken;

    public TokenDefinition(TokenType returnsToken, string regexPattern)
    {
        _regex = new Regex(regexPattern, RegexOptions.IgnoreCase);
        _returnsToken = returnsToken;
    }

    public TokenMatch Match(string inputString)
    {
        var match = _regex.Match(inputString);
        if (match.Success)
        {
            string remainingText = string.Empty;
            if (match.Length != inputString.Length)
                remainingText = inputString.Substring(match.Length);

            return new TokenMatch()
            {
                IsMatch = true,
                RemainingText = remainingText,
                TokenType = _returnsToken,
                Value = match.Value
            };
        }
        else
        {
            return new TokenMatch() { IsMatch = false};
        }

    }
}

public class TokenMatch
{
    public bool IsMatch { get; set; }
    public TokenType TokenType { get; set; }
    public string Value { get; set; }
    public string RemainingText { get; set; }
}

public class DslToken
{
    public DslToken(TokenType tokenType)
    {
        TokenType = tokenType;
        Value = string.Empty;
    }

    public DslToken(TokenType tokenType, string value)
    {
        TokenType = tokenType;
        Value = value;
    }

    public TokenType TokenType { get; set; }
    public string Value { get; set; }

    public DslToken Clone()
    {
        return new DslToken(TokenType, Value);
    }
}

In our tokenizer we create a list of definitions. Let's do that in the constructor.

public Tokenizer()
{
	_tokenDefinitions = new List();

	_tokenDefinitions.Add(new TokenDefinition(TokenType.And, "^and"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.Application, "^app|^application"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.Between, "^between"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.CloseParenthesis, "^\\)"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.Comma, "^,"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.Equals, "^="));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.ExceptionType, "^ex|^exception"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.Fingerprint, "^fingerprint"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.NotIn, "^not in"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.In, "^in"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.Like, "^like"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.Limit, "^limit"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.Match, "^match"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.Message, "^msg|^message"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.NotEquals, "^!="));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.NotLike, "^not like"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.OpenParenthesis, "^\\("));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.Or, "^or"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.StackFrame, "^sf|^stackframe"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.DateTimeValue, "^\\d\\d\\d\\d-\\d\\d-\\d\\d \\d\\d:\\d\\d:\\d\\d"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.StringValue, "^'[^']*'"));
	_tokenDefinitions.Add(new TokenDefinition(TokenType.Number, "^\\d+"));
}

The algorithm

public List<DslToken> Tokenize(string lqlText)
{
    var tokens = new List<DslToken>();
    string remainingText = lqlText;

    while (!string.IsNullOrWhiteSpace(remainingText))
    {
        var match = FindMatch(remainingText);
        if (match.IsMatch)
        {
            tokens.Add(new DslToken(match.TokenType, match.Value));
            remainingText = match.RemainingText;
        }
        else
        {
            remainingText = remainingText.Substring(1);
        }
    }

    tokens.Add(new DslToken(TokenType.SequenceTerminator, string.Empty));

    return tokens;
}

private TokenMatch FindMatch(string lqlText)
{
    foreach (var tokenDefinition in _tokenDefinitions)
    {
        var match = tokenDefinition.Match(lqlText);
        if (match.IsMatch)
            return match;
    }

    return new TokenMatch() {  IsMatch = false };
}

That's it! But wait, there is one change that would make this more robust.

Detecting Mistakes

One attribute of a good DSL parsing system is informative errors when a user enters invalid syntax. Let's add detection of invalid syntax in the form of the TokenType.Invalid token.

Let's modify the algorithm to detect what the current text doesn't match any pattern and is not just white space.

public List<DslToken> Tokenize(string lqlText)
{
    var tokens = new List<DslToken>();

    string remainingText = lqlText;

    while (!string.IsNullOrWhiteSpace(remainingText))
    {
        var match = FindMatch(remainingText);
        if (match.IsMatch)
        {
            tokens.Add(new DslToken(match.TokenType, match.Value));
            remainingText = match.RemainingText;
        }
        else
        {
            if (IsWhitespace(remainingText))
            {
                remainingText = remainingText.Substring(1);
            }
            else
            {
                var invalidTokenMatch = CreateInvalidTokenMatch(remainingText);
                tokens.Add(new DslToken(invalidTokenMatch.TokenType, invalidTokenMatch.Value));
                remainingText = invalidTokenMatch.RemainingText;
            }
        }
    }

    tokens.Add(new DslToken(TokenType.SequenceTerminator, string.Empty));

    return tokens;
}

private TokenMatch FindMatch(string lqlText)
{
    foreach (var tokenDefinition in _tokenDefinitions)
    {
        var match = tokenDefinition.Match(lqlText);
        if (match.IsMatch)
            return match;
    }

    return new TokenMatch() {  IsMatch = false };
}

private bool IsWhitespace(string lqlText)
{
    return Regex.IsMatch(lqlText, "^\\s+");
}

private TokenMatch CreateInvalidTokenMatch(string lqlText)
{
    var match = Regex.Match(lqlText, "(^\\S+\\s)|^\\S+");
    if (match.Success)
    {
        return new TokenMatch()
        {
            IsMatch = true,
            RemainingText = lqlText.Substring(match.Length),
            TokenType = TokenType.Invalid,
            Value = match.Value.Trim()
        };
    }

    throw new DslParserException("Failed to generate invalid token");
}

Next we'll look at grammars and how they will help us to craft a parser to interpret this list of tokens.

SERIES LINKS

How to Create a Query Language DSL in C#
--> Creating a Simple Tokenizer (Lexer) in C#
Understanding Grammars
Implementing a DSL Parser in C#
Generating SQL from a Data Structure

I have since written about a similar tokenizer with much better performance.