A More Efficient Regex Tokenizer

Example code on GitHub

As part of a DSL parsing series, I wrote a post about a super simple yet memory inefficient way of tokenizing some input text. The benefit was that the tokenizer was extremely simple but the downside was that it wouldn't be suitable for large texts or if the tokenizer was called excessively.

In this post we'll look at a similar Regex based tokenizer and trade-off a little simplicity for lot of performance gain.

The main problem with the Super Simple Regex Tokenizer is that it is making lots of substrings. Every time it finds a match it takes the match value and returns the original text with this captured value cut out. So we end up creating loads of ever smaller pieces of the original text. The performance and memory allocation grows exponentially with the size of the text.

The Precedence Based Regex Tokenizer works differently. Instead of moving along the text sequentially checking which TokenDefinitions match, we just gather all the matches that exist into a list. Each TokenDefinition has a precedence assigned. For each match we note the Start and End index of the match and assign the precedence. Then we group the matches by Start index and for each group we take the match with highest precedence (lowest value) at each Start index. As we loop through the groups we ignore matches that fall within the string index range of the previous match so that we don't have overlap.

Precedence in Action

As an example of how precedence takes part in the process we'll look at dates and numbers. Imagine we have the following text:

BETWEEN 2016-01-01 10:00:00 AND 2016-01-02 10:00:00
LIMIT 100

We have a DateTimeValue TokenDefinition with the regex pattern "\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d" and the NumberValue TokenDefinition with regex pattern "\d+".

The DateTimeValue pattern will match the whole date and the Number pattern will match every number part of the date and the limit number. The Start index of the date match and the 2016 number match will be the same, but if we assign the DateTimeValue a precedence of 1 and the NumberValue a precedence of 2 then we resolve the conflict. The NumberValue pattern will continue to match the month, day, hour, minute and second components but these will be ignored as they overlap with the DateTimeValue match.

The Code

We define the TokenType as before

public enum TokenType
{
    NotDefined,
    And,
    Application,
    Between,
    CloseParenthesis,
    Comma,
    DateTimeValue,
    Equals,
    ExceptionType,
    Fingerprint,
    In,
    Invalid,
    Like,
    Limit,
    Match,
    Message,
    NotEquals,
    NotIn,
    NotLike,
    NumberValue,
    Or,
    OpenParenthesis,
    StackFrame,
    StringValue,
    SequenceTerminator
}

The TokenDefinition has a Precedence property and a FindMatches() method. It returns a list of TokenMatch objects which define the Start and End index and the precedence.

public class TokenDefinition
{
    private Regex _regex;
    private readonly TokenType _returnsToken;
    private readonly int _precedence;

    public TokenDefinition(TokenType returnsToken, string regexPattern, int precedence)
    {
        _regex = new Regex(regexPattern, RegexOptions.IgnoreCase|RegexOptions.Compiled);
        _returnsToken = returnsToken;
        _precedence = precedence;
    }

    public IEnumerable<TokenMatch> FindMatches(string inputString)
    {
        var matches = _regex.Matches(inputString);
        for(int i=0; i<matches.Count; i++)
        {
            yield return new TokenMatch()
            {
                StartIndex = matches[i].Index,
                EndIndex = matches[i].Index + matches[i].Length,
                TokenType = _returnsToken,
                Value = matches[i].Value,
                Precedence = _precedence
            };
        }
    }
}

public class TokenMatch
{
    public TokenType TokenType { get; set; }
    public string Value { get; set; }
    public int StartIndex { get; set; }
    public int EndIndex { get; set; }
    public int Precedence { get; set; }
}

The tokenizer class defines the TokenDefinitions in its constructor. When two TokenDefinitions could identify a match with the same Start index then the precedence becomes important. For our DSL there are only two potential conflicts: DateTimeValue and NumberValue. We assign the DateTimeValue TokenDefinition a precedence of 1 and the NumberValue TokenDefinition a precedence of 2. This means that when both have a match for a date, the DateTimeValue token is selected as the best match.

public PrecedenceBasedRegexTokenizer()
{
    _tokenDefinitions = new List<TokenDefinition>();

    _tokenDefinitions.Add(new TokenDefinition(TokenType.And, "and", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.Application, "app|application", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.Between, "between", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.CloseParenthesis, "\\)", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.Comma, ",", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.Equals, "=", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.ExceptionType, "ex|exception", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.Fingerprint, "fingerprint", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.NotIn, "not in", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.In, "in", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.Like, "like", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.Limit, "limit", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.Match, "match", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.Message, "msg|message", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.NotEquals, "!=", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.NotLike, "not like", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.OpenParenthesis, "\\(", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.Or, "or", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.StackFrame, "sf|stackframe", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.DateTimeValue, "\\d\\d\\d\\d-\\d\\d-\\d\\d \\d\\d:\\d\\d:\\d\\d", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.StringValue, "'([^']*)'", 1));
    _tokenDefinitions.Add(new TokenDefinition(TokenType.NumberValue, "\\d+", 2));
}

First we loop through all the definitions and gather all the matches. Next we group by Start index. Then looping through the groups we select the highest precedence match. We ignore any matches that fall within the last good match.

public IEnumerable<DslToken> Tokenize(string errorMessage)
{
    var tokenMatches = FindTokenMatches(errorMessage);

    var groupedByIndex = tokenMatches.GroupBy(x => x.StartIndex)
        .OrderBy(x => x.Key)
        .ToList();

    TokenMatch lastMatch = null;
    for (int i = 0; i < groupedByIndex.Count; i++)
    {
        var bestMatch = groupedByIndex[i].OrderBy(x => x.Precedence).First();
        if (lastMatch != null && bestMatch.StartIndex < lastMatch.EndIndex)
            continue;

        yield return new DslToken(bestMatch.TokenType, bestMatch.Value);

        lastMatch = bestMatch;
    }

    yield return new DslToken(TokenType.SequenceTerminator);
}

private List<TokenMatch> FindTokenMatches(string errorMessage)
{
    var tokenMatches = new List<TokenMatch>();

    foreach (var tokenDefinition in _tokenDefinitions)
        tokenMatches.AddRange(tokenDefinition.FindMatches(errorMessage).ToList());

    return tokenMatches;
}

Performance

Let's look at the performance profiles of these two approaches. The chart below shows the total memory allocated in MB when executing the tokenizers 1000 times. I used four different texts of 150, 1000, 1800 and 2700 characters.

The Predence Based tokenizer's memory allocation grew almost linearly with the size of the text being tokenized, whereas the Super Simple tokenizer grew with an exponential curve, allocating almost 1GB for 1000 executions over a text of 2700 characters.

The Precedence Based Regex Tokenizer may be a bit more complex in certain scenarios, such as for large volumes of conflicting matches, or sheer volume of total matches. But in the scenario of a simple DSL I would say that both approaches are equally simple and easy to modify and tweak.

I will look at more complex and more performant tokenizers in the future for when performance is critical and worth the simplicity trade-off.