With Great Observation Comes Great Insight

In this post I’m going to talk about how I leverage machine automation and human observation as a powerful combination for both testing and helping the community by the creation of better guidance. I tentatively call the approach automated exploratory testing.

I joined the RabbitMQ core team back in September 2019. I believe I was hired not because of my awesome Erlang skills (which are laughable) but because of my work in the community as both a tester and a writer. The team is small but diverse and each person brings different strengths to the team. Each has years of experience and a formidable knowledge in a wide variety of areas. My testing, formal verification and writing skills complement the team nicely and I have concentrated on those strengths since joining the team.

In this post I'm going to talk a bit about how I approach testing and my use of the combination of automation and experimentation in order to make RabbitMQ better and help craft guidance for the RabbitMQ community.

First of all, the various projects that make up RabbitMQ have a huge number of automated tests: unit tests, integration tests, property tests, Jepsen tests, long running environments etc. It didn't take me long to see that the team had automated testing covered. In fact, they had to help me learn about Erlang test frameworks and the Erlang way of testing distributed Erlang components. We discussed writing a blog post about our test infrastructure and the lengths we go to to make sure RabbitMQ ships as a dependable data service.

What your CI tests don't tell you

CI tests don't give you a feel for the system they are testing. They are great for checking that things work but they don't help you get to know it - they don't tell you a huge number of things about the behaviour under different conditions. You can only get to know a system by tinkering with it and observing it. The more tinkering and observation you do, the more you get to know it.

Question Driven Testing or Curiosity Driven Testing

As you tinker, questions are arise, questions for which you don't have an answer. So you run it again and again under different conditions until you satisfy your question. The answers can simply be knowledge of how it behaves, or the discovery of a problem or more questions which require more testing. Below are some of the questions I had when I started looking at quorum queues:

  • What is the memory profile when using quorum queues under different loads?

  • What is the memory profile of a large quorum queue with no activity?

  • How does throughput and latency change as a quorum queue grows larger?

  • What is the recovery time of a queue on broker start-up, as a function of queue size?

  • How does queue count affect throughput and latency?

  • What is best: 3 large VMs or 7 small ones?

  • What is quorum queue performance like on HDDs compared to SSDs?

  • How long is availability lost, if at all, for a quorum queue when a leader goes down? How does that change as a function of total system load?

  • Can a quorum queue recover when it has lost (n/2)-1 brokers permanently? Is there any way out?

  • What tends to be the resource bottleneck, if any, of various quorum queue workloads?

  • What affect does publisher confirms and consumer acks have on quorum queues?

  • What happens to throughput and latency if I change quorum queue defaults?

  • When do mirrored queues do better?

One question then leads to further questions.

For example, the question: "What is quorum queue performance like on HDDs compared to SSDs?" led to more questions:

  • what is single queue performance?

  • what is the performance of hundreds of queue?

  • how do IO characteristics of quorum queues differ to classic or mirrored queues?

  • what is the effect of mixed workload, i.e mixed IO patterns (classic and quorum)?

  • how does isolating different disk accesses into different disks affect it? (WAL, segment, message store, logs)

  • what happens when the disk throughput limit is reached?

  • do high IOPS disks do better than lower IOPS, or do QQs simply adjust to the lower IOPs limit and perform fewer, larger operations?

  • what is the most cost effective disk configuration?

Or the question: "What effect do publisher confirms and consumer acks have on quorum queues" led to:

  • Is TCP back-pressure enough to protect a broker when under high load?

  • Does using publisher confirms as a flow control mechanism help avoid broker overload?

  • How much does network latency effect the choice of in-flight limit?

  • Is using the multiple flag with acks beneficial, if so, how much? When is it counterproductive?

  • Are there any rules of thumb regarding confirms and performance?

  • Can messages be lost if publisher confirms are not used but there are no connection failures or downed brokers? RabbitMQ provides no guarantees without confirms by the way. If a stress test can cause message loss, how much stress is required and for how long?

Or the question "What happens to throughput and latency if I change internal quorum queue defaults?" led to me running hundreds of tests with different default values. For example, I ran it with quorum_commands_soft_limit set to 1, then 2, then 4, then 8, then 16... 1 million. The default was 512, but the optimum setting ended up being 32 when taking into account a variety of workloads. In fact, reducing this value played a huge role in making quorum queues act better under a stress test.

It's amazing how many optimisations and bugs can be discovered by being curious and getting answers to your curiosity. For example, we discovered that mirrored queues were better under heavy load than quorum queues. "Why?" and "How?" questions led to discovery of bottlenecks which led to optimisations and now, as of 3.8.4, mirrored queues rarely have an edge but often are out classed.

The final benefit is that once you understand behaviour you can craft guidance.

It does all the things

RabbitMQ does a lot of things. This makes it both interesting and daunting from a tester's perspective. In fact, the hardest part about RabbitMQ doing so many things is that when crafting guidance on expert usage, it becomes difficult to simplify. For example, we have three queue types and each has different performance characteristics. After having run thousands of tests now, I have a pretty good idea of its behaviour under a variety of conditions, but communicating all the nuance is tough. Formulating rules of thumb is indispensable and if that can't be done then educating the community on how to find that out themselves in the next best thing.

Human does test design, automation does execution, human does analysis

With all these questions, I need to be able to create the different conditions, place workloads into those conditions, measure performance, record the environment (metrics), monitor invariants etc. For that I need a powerful automation framework that does the execution part. It will deploy the servers, install RabbitMQ, run the specific workload, gather the metrics, gather the logs and detect invariant violations such as message loss. It is a partly automated approach that I call it automated exploratory testing.

I start with a question, then I design a large number of tests that probe that question in different ways, different conditions etc. The test automation runs it all but in the end I have to interpret the results which takes a lot of analysis (checking performance metrics against system metrics, reading logs, running comparison calculations compared to the same test under different conditions), which all takes time.

Observation takes time.

Kyle Kingsbury, of the Jepsen tests noted that his testing is not fully automated, but requires experimentation and analysis too.

Each report is typically the product of months of experimental work; it's not like Jepsen is a pass-fail test suite that gives immediately accurate results. There is, unfortunately, a lot of subtle interpretive work that goes into figuring out if a test is doing something meaningful, and a lot of that work needs to be repeated on each test run. Think, like... staring at the logs and noticing that a certain class of exception is being caught more often than you might have expected, and realizing that a certain type of transaction now triggers a new conflict detection mechanism which causes higher probabilities of aborts; those aborts reduce the frequency with which you can observe database state, allowing a race condition to go un-noticed. That kinda thing.

https://news.ycombinator.com/item?id=23291847

You can't automate that part because you don’t always know what your are looking for. The important thing is to have a powerful automation and observability framework that is capable of realising any test that you need and able to run many experiments concurrently. Typically when I first start out trying to answer a question, I don't know what the best configuration is in order to test it. So I'll run multiple configurations both in parallel and one after another, with very short tests. It's like a scatter gather phase where I try a lot of things and identify the best ones. That is phase 1.

Once I've identified the configurations that are best I start narrowing things down and running less configurations and for longer.

Comparison

One key aspect of my testing revolves around comparisons. A performance or resource utilisation result by itself is not too informative, but when comparing it to another test, identical except for one difference (SSD vs HDD for example) then we start to tease out how environment, workload, hardware affect behaviour.

I tend to run different configurations simultaneously and observe their behaviours together in real-time. At the end of a run, I’ll run a simple statistical analysis comparing the runs which gives me another indicator about which configuration was superior. Each run will often be comprised of tens of tests and each configuration will be run many times in parallel.

For example, if I run a HDD vs SDD test, I’ll run each configuration 3 times in parallel, so 6 clusters. That also tells me how much variance there is between runs and find outliers to be investigated or potentially discarded. Running the tests in the cloud also introduces a fair amount of variability so multiple runs are required.

How Experimentation and Observation Improved Quorum Queues

All told I've probably dedicated about 50% of my time to automated exploratory testing of various RabbitMQ features, building the automation and sharing some of what I have learned in blog posts. The most productive times has been working with Karl Nilsson, the main developer behind quorum queues. We form a tight feedback cycle where we work together to diagnose, fix and verify the fix in a tight feedback cycle. We've been able to mature quorum queues pretty fast using this collaboration.

The most recent example was the improvements to Ra, our Raft library that quorum queues are built on. Under heavy load, memory usage would climb. We checked out what was going on under the hood and discovered that every time it happened, the segment writer had fallen behind the write-ahead log (WAL). Message bodies couldn't be safely removed from memory and so memory usage grew until memory alarms activated causing large swings in throughput. We experimented with different solutions, to make the IO faster, including fio benchmarks.

The conclusion was that parallelising the segment file writing would be the most effective. So Karl implemented a change to Ra to go from a single segment writer per node, to a writer per scheduler (per core). I created a RabbitMQ build with the change, tested again and verified that the bottleneck was removed. No stress test could make the segment writer fall behind anymore, now the WAL is the bottleneck, but it is less often a problem and an area of research for future improvements.

This change, along with better defaults, dropped in 3.8.4 and makes a big difference to quorum queue stability under heavy load. Since 3.8.0 was released we've continued to invest in making quorum queues better: they use 30% less memory while under use, 90% less memory when at rest, 30% higher throughput and the ability to gracefully handle overload. Additionally, with all the tests that have been run, bugs have been identified and fixed.

Final Thoughts

Automated testing in the CI system is the bedrock of any testing strategy. But that last 20% only comes through experimentation and observation.

Observation leads to insights which lead to both improvements in the product and better community guidance. This all requires tooling to create and observe the scenarios but also a curious mind.

Thanks for reading :)

Banner image credit: Luxy Images/ESO