Unknowingly, this decision was the making of the platform and has made it a key component of the architecture, used by support teams, development teams, the architecture team and the operations team. The key has been that it is very simple, anyone can write queries and yet it is very powerful.
Querying the Error Log
LQL has four types of object:
- Exception Types
- Exception Messages
- Stack Frames
Some example queries would be:
Get me all System.OutOfMemoryException errors in ApplicationXYZ during the first two days of December 2015
MATCH (APP = ‘ApplicationXYZ’ AND EX = ‘System.OutOfMemoryException’) BETWEEN 2015-12-01 AND 2015-12-02
Get me all the errors that occurred in method PublishEvents in ApplicationXYZ during the last week
MATCH (APP = ‘ApplicationXYZ’ AND SF LIKE ‘PublishService.PublishEvents(MyEvent event)’) BETWEEN 2015-12-31 AND 2016-01-06
Get me all the timeout errors that occurred in the last week, where the number in one hour exceeded 1000
MATCH (MSG LIKE ‘timeout exceeded’) BETWEEN 2015-12-31 AND 2016-01-06 FREQUENCY > 1000 IN 1h
Before I created LQL I envisaged having to create bespoke reports and alerts but upon the creation of LQL I could use it for alerts in addition to ad hoc queries.
This again has been successful because anyone can create an alert using the simple DSL and subscribe to it for as long as they want, from one day to forever. They get notified when errors occur that match their alerts and they can view the alerts in a timeline view.
Example uses cases:
A team deploys a new API function, and they want to be alerted of any type of error that occurs within that API call. So they create an alert with one or more stack frames that match the API call and give it a two week Time To Live (TTL).
The Operations team create alerts related to the server platform, such as web service activation failures, messaging platform errors and insufficient memory errors.
The support team has many more alerts and share many of the ops team alerts.
In addition to ad hoc queries and alerts, LQL is also used for trend analysis. Users can create a query and then the number of errors that match that query over a time period are visualised on a timeline chart.
MATCH (MSG LIKE ‘timeout exceeded’) BETWEEN 2015-12-01 AND 2015-12-31 COUNT BY 1d
One example of this was when we changed the server architecture in order to reduce lack of memory errors. We were able to track the number of memory related errors with the following query:
MATCH (MSG LIKE ‘memory gates checking failed’ OR EX IN (‘System.OutOfMemoryException’, ‘System.InsufficientMemoryException’)) BETWEEN 2015-12-01 AND 2015-12-31 COUNT BY 1h
We monitored the trend analysis over a period after the change to ensure there was real improvement.
We are considering migrating to Elasticsearch as we currently use SQL Server using its full text search and we would like to leverage the performance of ES and its perculator for real-time notifications. We currently we have about 150 million errors in our database and we still get sub-second to 5 second performance but we have made sacrifices to achieve this. We use a dimensional model coupled with full text search and while we store all data about the stack trace and exception type we sample the exception text to reduce the size of the data. With Elasticsearch we would not need any such comprises.
Elasticsearch comes with Kibana which is more powerful than what we have right now in terms of an interface to the full text search and visualizations. But I don't see that it has the same ease of use as our current platform. The current success of our logging platform is that anyone can write ad hoc queries, create alerts and create trend analyses and all they have to learn is the very simple query syntax. The Elasticsearch API is a joy to use as a developer who understands ES and the mappings but it is not accessible to non ES experts.
LQL was my first domain specific language and now seeing the success of it I really see why DSLs are so useful. They democratise access to data and computation that would not otherwise be accessible Subject Matter Experts (SME) who are those that can most powerfully take advantage of the data. I am definitely looking at future use cases for other DSLs within the organisation.