How to Lose Messages on a Kafka Cluster - Part 1

Chaos testing Apache Kafka with Blockade, Docker, Python and Bash

Chaos testing Apache Kafka with Blockade, Docker, Python and Bash

In my previous post I used Blockade, Python and some Bash scripts to test a RabbitMQ cluster under various failure conditions such as failed nodes, network partitions, packet loss and a slow network. The aim was to find out how and when a RabbitMQ cluster loses messages. In this post we’ll do exactly the same but with a Kafka cluster. We’ll use our knowledge of the inside workings of Kafka and Zookeeper to produce various failure modes that produce message loss. Please read my post on Kafka fault tolerance as this post assumes you understand the basics of the acknowledgements and replication protocol.

In each scenario we’ll publish messages to a topic called “test1” while introducing failures. At the end we see how many messages are lost. Because Kafka has more components and more configurations than RabbitMQ, we’ll do this in two parts. In Part 1 we’ll concentrate on node failures and network partitions. In Part 2 we’ll take some of the scenarios from Part 1 and add a flaky/slow network.

I have created a few helper bash scripts to perform tasks such as recreating the cluster between scenarios, creating topics etc. You’ll find the scripts in the Github repo.

When there is no Blockade cluster up and running I run the following commands:

$ blockade up
$ bash update-hosts.sh
$ bash create-topic.sh kafka1 test1

After that, to recreate the cluster between each scenario I run:

$ bash reset-cluster.sh
$ bash create-topic.sh kafka1 test1

Scenario 1 - Fire-and-forget with a failed node and partition leader fail-over

In this scenario we’ll send 100000 messages with acks=0, meaning that the client does not require any acknowledgements back. At about the 30000 mark, I kill the node that hosts the leader partition.

First we’ll create the topic:

$ bash create-topic.sh kafka1 test1
Created topic "test1".
Topic:test1    PartitionCount:1    ReplicationFactor:3    Configs:
Topic: test1    Partition: 0    Leader: 3    Replicas: 3,1,2    Isr: 3,1,2

Next we kick off the publishing of the messages. It sends 100000 messages to the topic “test1” at a rate of 10000 messages a second. It uses acks mode 0 (fire-and-forget).

$ python producer.py 100000 0.0001 0 test1
Success: 10000 Failed: 0
Success: 20000 Failed: 0
Success: 30000 Failed: 0
%3|1537022288.472|FAIL|rdkafka#producer-1| [thrd:172.17.0.5:9094/bootstrap]: 172.17.0.5:9094/3: Receive failed: Connection reset by peer
%3|1537022288.472|ERROR|rdkafka#producer-1| [thrd:172.17.0.5:9094/bootstrap]: 172.17.0.5:9094/3: Receive failed: Connection reset by peer
%3|1537022288.572|FAIL|rdkafka#producer-1| [thrd:172.17.0.5:9094/bootstrap]: 172.17.0.5:9094/3: Connect to ipv4#172.17.0.5:9094 failed: Connection refused
%3|1537022288.572|ERROR|rdkafka#producer-1| [thrd:172.17.0.5:9094/bootstrap]: 172.17.0.5:9094/3: Connect to ipv4#172.17.0.5:9094 failed: Connection refused
Success: 40000 Failed: 0
Success: 50000 Failed: 0
Success: 60000 Failed: 0
Success: 70000 Failed: 0
Success: 80000 Failed: 0
Success: 90000 Failed: 0
Success: 100000 Failed: 0
Sent: 100000
Delivered: 100000
Failed: 0

At about the 30000 message mark we tell Blockade to kill kafka3 which hosts the partition leader.

blockade kill kafka3

Once publishing is complete we run a script to get the current high watermark of the topic. This will tell us how many messages have been persisted to the topic. The print-hw.sh script takes three arguments: the broker to run the command on, the internal port and the topic name.

bash print-hw.sh kafka2 19093 test1
test1:0:93236

We see that we lost 6764 messages that the producer had sent. This is due to a combination of a connection failure and a leader fail-over.

Scenario 2 - Acks=1 with a failed node and partition leader fail-over

The producer config request.required.acks=1 or its alias acks=1 tells the broker that the producers wants an acknowledgement once the leader partition has written the message to its local log. This scenario should still lose messages, but less messages than acks=0 as we should lose zero messages due to the connection failure, only due to the leader fail-over.

We recreate the cluster and topic for each scenario.

$ bash reset-cluster.sh 
NODE            CONTAINER ID    STATUS  IP              NETWORK    PARTITION
kafka1          e180fe0cc4fa    UP      172.17.0.3      NORMAL
kafka2          637e7ef260ba    UP      172.17.0.4      NORMAL
kafka3          ccffaab3d0c0    UP      172.17.0.5      NORMAL
zk1             d751f9f7aaac    UP      172.17.0.2      NORMAL 
Acquiring container ids and ip addresses
Updating hosts of kafka1
Updating hosts of kafka2
Updating hosts of kafka3

$ bash create-topic.sh kafka1 test1
Created topic "test1".
Topic:test1    PartitionCount:1    ReplicationFactor:3    Configs:
    Topic: test1    Partition: 0    Leader: 1    Replicas: 1,3,2    Isr: 1,3,2

This time the leader is on broker 1.

Now we starting sending the 100000 messages, with acks=1.

$ python producer.py 100000 0.0001 1 test1
Success: 10000 Failed: 0
Success: 20000 Failed: 0
Success: 30000 Failed: 0
%3|1537091610.422|FAIL|rdkafka#producer-1| [thrd:172.17.0.3:9092/bootstrap]: 172.17.0.3:9092/1: Receive failed: Connection reset by peer
%3|1537091610.422|ERROR|rdkafka#producer-1| [thrd:172.17.0.3:9092/bootstrap]: 172.17.0.3:9092/1: Receive failed: Connection reset by peer
Success: 35187 Failed: 4813
%3|1537091610.527|FAIL|rdkafka#producer-1| [thrd:172.17.0.3:9092/bootstrap]: 172.17.0.3:9092/1: Connect to ipv4#172.17.0.3:9092 failed: Connection refused
%3|1537091610.527|ERROR|rdkafka#producer-1| [thrd:172.17.0.3:9092/bootstrap]: 172.17.0.3:9092/1: Connect to ipv4#172.17.0.3:9092 failed: Connection refused
Success: 39356 Failed: 10644
Success: 49356 Failed: 10644
Success: 59356 Failed: 10644
Success: 69356 Failed: 10644
Success: 79356 Failed: 10644
Success: 89356 Failed: 10644
Sent: 100000
Delivered: 89356
Failed: 10644

At the 30000 message mark more or less I killed broker 1 with the following command:

$ blockade kill kafka1

With acks=1 only 89356 messages are acknowledged, but hopefully all or nearly all will have survived the fail-over.

$ bash print-hw.sh kafka2 19093 test1
test1:0:89356

We see that all survived the fail-over.

Let’s rerun the scenario with ten producers, each sending 100000 messages at the same time. For this I run the python script from a bash script, running concurrently as background jobs.

$ bash concurrent-producer.sh 100000 0.0001 1 test1
Run ID: 2 Success: 10000 Failed: 0
Run ID: 4 Success: 10000 Failed: 0
Run ID: 6 Success: 10000 Failed: 0
Run ID: 10 Success: 10000 Failed: 0
Run ID: 10 Success: 20000 Failed: 0
Run ID: 3 Success: 10000 Failed: 0
Run ID: 3 Success: 20000 Failed: 0
Run ID: 3 Success: 30000 Failed: 0
Run ID: 3 Success: 40000 Failed: 0
Run ID: 3 Success: 50000 Failed: 0
Run ID: 1 Success: 10000 Failed: 0
Run ID: 1 Success: 20000 Failed: 0
Run ID: 1 Success: 30000 Failed: 0
Run ID: 9 Success: 10000 Failed: 0
Run ID: 9 Success: 20000 Failed: 0
Run ID: 9 Success: 30000 Failed: 0
Run ID: 9 Success: 40000 Failed: 0
Run ID: 9 Success: 50000 Failed: 0
Run ID: 9 Success: 60000 Failed: 0
%3|1537094097.198|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.198|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.198|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.198|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.198|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.198|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.198|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.198|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.199|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.201|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.201|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.198|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.199|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.201|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.201|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.202|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.206|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.206|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.209|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
%3|1537094097.209|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Receive failed: Connection reset by peer
Run ID: 2 Success: 19930 Failed: 70
%3|1537094097.337|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.337|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.338|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.338|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.344|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.344|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
Run ID: 8 Success: 8495 Failed: 1505
%3|1537094097.354|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.355|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.358|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.358|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
Run ID: 7 Success: 7435 Failed: 2565
%3|1537094097.362|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.364|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
Run ID: 5 Success: 6966 Failed: 3034
%3|1537094097.386|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.394|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.387|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.394|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.395|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.395|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.404|FAIL|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
%3|1537094097.404|ERROR|rdkafka#producer-1| [thrd:172.17.0.4:9093/bootstrap]: 172.17.0.4:9093/2: Connect to ipv4#172.17.0.4:9093 failed: Connection refused
Run ID: 1 Success: 37279 Failed: 2721
Run ID: 6 Success: 14414 Failed: 5586
Run ID: 10 Success: 23463 Failed: 6537
Run ID: 4 Success: 12325 Failed: 7675
Run ID: 9 Success: 62589 Failed: 7411
Run ID: 2 Success: 19930 Failed: 10070
Run ID: 3 Success: 51539 Failed: 8461
Run ID: 9 Success: 62589 Failed: 17411
Run ID: 8 Success: 8495 Failed: 11505
Run ID: 7 Success: 7435 Failed: 12565
Run ID: 3 Success: 51539 Failed: 18461
Run ID: 5 Success: 6966 Failed: 13034
Run ID: 1 Success: 37279 Failed: 12721
Run ID: 10 Success: 23463 Failed: 16537
Run ID: 6 Success: 14414 Failed: 15586
Run ID: 7 Success: 7435 Failed: 22565
Run ID: 4 Success: 12325 Failed: 17675
Run ID: 2 Success: 19930 Failed: 20070
Run ID: 1 Success: 37279 Failed: 22721
Run ID: 8 Success: 8495 Failed: 21505
Run ID: 10 Success: 23463 Failed: 26537
Run ID: 5 Success: 6966 Failed: 23034
Run ID: 6 Success: 14414 Failed: 25586
Run ID: 7 Success: 7435 Failed: 32565
Run ID: 1 Success: 37279 Failed: 32721
Run ID: 10 Success: 23463 Failed: 36537
Run ID: 2 Success: 19930 Failed: 30070
Run ID: 4 Success: 12325 Failed: 27675
Run ID: 7 Success: 7435 Failed: 42565
Run ID: 8 Success: 8495 Failed: 31505
Run ID: 5 Success: 6966 Failed: 33034
Run ID: 1 Success: 37279 Failed: 42721
Run ID: 6 Success: 14414 Failed: 35586
Run ID: 7 Success: 7435 Failed: 52565
Run ID: 2 Success: 19930 Failed: 40070
Run ID: 4 Success: 12325 Failed: 37675
Run ID: 8 Success: 8495 Failed: 41505
Run ID: 5 Success: 6966 Failed: 43034
Run ID: 6 Success: 14414 Failed: 45586
Run ID: 4 Success: 12325 Failed: 47675
Run ID: 8 Success: 8495 Failed: 51505
Run ID: 5 Success: 6966 Failed: 53034
Run ID: 5 Success: 6966 Failed: 63034
Run ID: 2 Success: 29608 Failed: 40392
Run ID: 3 Success: 52695 Failed: 27305
Run ID: 3 Success: 62695 Failed: 27305
Run ID: 4 Success: 16069 Failed: 53931
Run ID: 6 Success: 17998 Failed: 52002
Run ID: 7 Success: 10298 Failed: 59702
Run ID: 9 Success: 63620 Failed: 26380
Run ID: 1 Success: 46565 Failed: 43435
Run ID: 10 Success: 27688 Failed: 42312
Run ID: 2 Success: 39608 Failed: 40392
Run ID: 2 Success: 49608 Failed: 40392
Run ID: 2 Success: 59608 Failed: 40392
Run ID: 2 Sent: 100000 Delivered: 59608 Failed: 40392
Run ID: 4 Success: 26069 Failed: 53931
Run ID: 3 Success: 72695 Failed: 27305
Run ID: 3 Sent: 100000 Delivered: 72695 Failed: 27305
Run ID: 5 Success: 16393 Failed: 63607
Run ID: 4 Success: 36069 Failed: 53931
Run ID: 5 Success: 26393 Failed: 63607
Run ID: 4 Success: 46069 Failed: 53931
Run ID: 4 Sent: 100000 Delivered: 46069 Failed: 53931
Run ID: 5 Success: 36393 Failed: 63607
Run ID: 5 Sent: 100000 Delivered: 36393 Failed: 63607
Run ID: 6 Success: 27998 Failed: 52002
Run ID: 6 Success: 37998 Failed: 52002
Run ID: 6 Success: 47998 Failed: 52002
Run ID: 6 Sent: 100000 Delivered: 47998 Failed: 52002
Run ID: 7 Success: 20298 Failed: 59702
Run ID: 7 Success: 30298 Failed: 59702
Run ID: 8 Success: 16788 Failed: 53212
Run ID: 7 Success: 40298 Failed: 59702
Run ID: 7 Sent: 100000 Delivered: 40298 Failed: 59702
Run ID: 8 Success: 26788 Failed: 53212
Run ID: 8 Success: 36788 Failed: 53212
Run ID: 8 Success: 46788 Failed: 53212
Run ID: 8 Sent: 100000 Delivered: 46788 Failed: 53212
Run ID: 9 Success: 73620 Failed: 26380
Run ID: 9 Sent: 100000 Delivered: 73620 Failed: 26380
Run ID: 1 Success: 56565 Failed: 43435
Run ID: 1 Sent: 100000 Delivered: 56565 Failed: 43435
Run ID: 10 Success: 37688 Failed: 42312
Run ID: 10 Success: 47688 Failed: 42312
Run ID: 10 Success: 57688 Failed: 42312
Run ID: 10 Sent: 100000 Delivered: 57688 Failed: 42312
Runs complete

I killed the leader midway and summed the success numbers to 537722.

bash print-hw.sh kafka1 19092 test1
test1:0:537717

So we see that this time we lost 5 acknowledged messages out of 1 million sent concurrently by 10 producers.

I reran the scenario again, with a variant of the script that only prints the final acknowledged total.

$ bash concurrent-producer-silent.sh 100000 0.0001 1 test1
Runs complete
Acknowledged total: 766804

$ bash print-hw.sh kafka1 19092 test1
test1:0:766793

This time it lost 11 acknowledged writes. Not too bad for a leader fail-over with acks=1.

Scenario 3 - Acks=all with a failed node and partition leader fail-over (No message loss)

We’ll repeat the same concurrent ten producers sending 1 million messages over a period of a few seconds, killing the leader midway.

$ bash concurrent-producer-silent.sh 100000 0.0001 all test1
Runs complete
Acknowledged total: 889751

After about 10 seconds I executed the command to kill the leader like in previous scenarios. Then finally I retrieved the high watermark for the topic.

$ bash print-hw.sh kafka2 19093 test1
test1:0:889758

With acks=all we did not lose any messages, in fact, 7 messages for which we never got an ack also got persisted. So we see that acks=all helped guarantee no message loss during a leader fail-over.

Scenario 4 - Completely Isolate Leader from other Kafka nodes and Zookeeper with acks=1

Isolating a Kafka leader node should lead to greater message loss than a downed node as the leader does not realize it cannot talk to Zookeeper until after it has already acknowledged messages during a short period, a few seconds.

In this scenario I send 100000 messages at roughly a rate of 10000 second and at about the 30000 message mark I isolate the leader using Blockade.

$ python producer.py 100000 0.0001 1 test1
Success: 10000 Failed: 0
Success: 20000 Failed: 0
Success: 30000 Failed: 0
Success: 40000 Failed: 0
Success: 50000 Failed: 0
Success: 60000 Failed: 0
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
... (over 33000 error lines here)
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
Success: 66453 Failed: 33547
Sent: 100000
Delivered: 66453
Failed: 33547

Midway I isolated kafka2.

$ blockade partition kafka2

You can see that the time out messages appear after 60000 messages have been acknowledged. In the end it gets 66453 acknowledgements.

Below we see that the leader failed over to kafka3 and the high watermark is 34669 which means we lost 31784 acknowledged messages.

$ bash print-topic-details.sh kafka3 test1
Topic:test1    PartitionCount:1    ReplicationFactor:3    Configs:
    Topic: test1    Partition: 0    Leader: 3    Replicas: 2,3,1    Isr: 3,1

$ bash print-hw.sh kafka1 19092 test1
test1:0:34669

Let’s dig a bit deeper and see what is going on. I have a slow-producer.py script that publishes one message per second and prints out more information. We’ll reset the cluster, run this new script and isolate the leader again.

The script does a few things:

  • Prints out the broker id of the leader at the start and when it changes

  • On receipt of an acknowledgement, prints out the value of the message and that message offset.

  • Detects when the message offset drops to a lower value and warns of message loss

$ python slow-producer.py 100 1 1 test1
Current leader: 1
Value: 1 Offset: none
Value: 2 Offset: 1
Value: 3 Offset: 2
Value: 4 Offset: 3
Value: 5 Offset: 4
Value: 6 Offset: 5
Value: 7 Offset: 6
Value: 8 Offset: 7
Value: 9 Offset: 8
Value: 10 Offset: 9
Value: 11 Offset: 10
Value: 12 Offset: 11
Value: 13 Offset: 12
Value: 14 Offset: 13
Value: 15 Offset: 14
Value: 16 Offset: 15
Value: 17 Offset: 16
Value: 18 Offset: 17
Value: 19 Offset: 18
Value: 20 Offset: 19
Value: 21 Offset: 20
Value: 22 Offset: 21
Value: 23 Offset: 22
Value: 24 Offset: 23
Value: 25 Offset: 24
Value: 26 Offset: 25
Value: 27 Offset: 26
Value: 28 Offset: 27
%3|1537197027.978|FAIL|rdkafka#producer-1| [thrd:172.17.0.3:9092/bootstrap]: 172.17.0.3:9092/1: 1 request(s) timed out: disconnect
%3|1537197027.978|ERROR|rdkafka#producer-1| [thrd:172.17.0.3:9092/bootstrap]: 172.17.0.3:9092/1: 1 request(s) timed out: disconnect
Current leader: 2
ERROR! Value: 29 Offset: none Error: Local: Message timed out
ERROR! Value: 30 Offset: none Error: Local: Message timed out
ERROR! Value: 31 Offset: none Error: Local: Message timed out
ERROR! Value: 32 Offset: none Error: Local: Message timed out
ERROR! Value: 33 Offset: none Error: Local: Message timed out
ERROR! Value: 34 Offset: none Error: Local: Message timed out
ERROR! Value: 35 Offset: none Error: Local: Message timed out
ERROR! Value: 36 Offset: none Error: Local: Message timed out
ERROR! Value: 37 Offset: none Error: Local: Message timed out
ERROR! Value: 38 Offset: none Error: Local: Message timed out
ERROR! Value: 39 Offset: none Error: Local: Message timed out
ERROR! Value: 40 Offset: none Error: Local: Message timed out
ERROR! Value: 41 Offset: none Error: Local: Message timed out
ERROR! Value: 42 Offset: none Error: Local: Message timed out
ERROR! Value: 43 Offset: none Error: Local: Message timed out
ERROR! Value: 44 Offset: none Error: Local: Message timed out
ERROR! Value: 45 Offset: none Error: Local: Message timed out
ERROR! Value: 46 Offset: none Error: Local: Message timed out
ERROR! Value: 47 Offset: none Error: Local: Message timed out
ERROR! Value: 48 Offset: none Error: Local: Message timed out
ERROR! Value: 49 Offset: none Error: Local: Message timed out
ERROR! Value: 50 Offset: none Error: Local: Message timed out
ERROR! Value: 51 Offset: none Error: Local: Message timed out
ERROR! Value: 52 Offset: none Error: Local: Message timed out
ERROR! Value: 53 Offset: none Error: Local: Message timed out
ERROR! Value: 54 Offset: none Error: Local: Message timed out
ERROR! Value: 55 Offset: none Error: Local: Message timed out
ERROR! Value: 56 Offset: none Error: Local: Message timed out
ERROR! Value: 57 Offset: none Error: Local: Message timed out
ERROR! Value: 58 Offset: none Error: Local: Message timed out
ERROR! Value: 59 Offset: none Error: Local: Message timed out
ERROR! Value: 60 Offset: none Error: Local: Message timed out
ERROR! Value: 61 Offset: none Error: Local: Message timed out
ERROR! Value: 62 Offset: none Error: Local: Message timed out
ERROR! Value: 63 Offset: none Error: Local: Message timed out
ERROR! Value: 64 Offset: none Error: Local: Message timed out
ERROR! Value: 65 Offset: none Error: Local: Message timed out
ERROR! Value: 66 Offset: none Error: Local: Message timed out
ERROR! Value: 67 Offset: none Error: Local: Message timed out
ERROR! Value: 68 Offset: none Error: Local: Message timed out
ERROR! Value: 69 Offset: none Error: Local: Message timed out
ERROR! Value: 70 Offset: none Error: Local: Message timed out
ERROR! Value: 71 Offset: none Error: Local: Message timed out
ERROR! Value: 72 Offset: none Error: Local: Message timed out
ERROR! Value: 73 Offset: none Error: Local: Message timed out
ERROR! Value: 74 Offset: none Error: Local: Message timed out
ERROR! Value: 75 Offset: none Error: Local: Message timed out
ERROR! Value: 76 Offset: none Error: Local: Message timed out
ERROR! Value: 77 Offset: none Error: Local: Message timed out
ERROR! Value: 78 Offset: none Error: Local: Message timed out
ERROR! Value: 79 Offset: none Error: Local: Message timed out
ERROR! Value: 80 Offset: none Error: Local: Message timed out
ERROR! Value: 81 Offset: none Error: Local: Message timed out
ERROR! Value: 82 Offset: none Error: Local: Message timed out
ERROR! Value: 83 Offset: none Error: Local: Message timed out
ERROR! Value: 84 Offset: none Error: Local: Message timed out
ERROR! Value: 85 Offset: none Error: Local: Message timed out
ERROR! Value: 86 Offset: none Error: Local: Message timed out
ERROR! Value: 87 Offset: none Error: Local: Message timed out
ERROR! Value: 88 Offset: none Error: Local: Message timed out
Partition fail-over, messages lost: 12
Value: 89 Offset: 15
Value: 90 Offset: 16
Value: 91 Offset: 17
Value: 92 Offset: 18
Value: 93 Offset: 19
Value: 94 Offset: 20
Value: 95 Offset: 21
Value: 96 Offset: 22
Value: 97 Offset: 23
Value: 98 Offset: 24
Value: 99 Offset: 25
Value: 100 Offset: 26
Sent: 100
Delivered: 40
Failed: 60

I isolated the leader at about the 13-14 message mark. Acknowledgements came in right up to message 28, at which point is seemingly hung for 60 seconds. After 60 seconds the client detected the new leader was broker 2 and we got 60 seconds worth of message timeouts all at once. After the timeouts we started getting acknowledgements from offset 15.

If we look in the logs of the three brokers we see some interesting snippets.

kafka1 logs extract

[2018-09-15 19:20:31,244] INFO Opening socket connection to server zk1/172.17.0.2:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)

[2018-09-15 19:20:37,884] WARN Client session timed out, have not heard from server in 7025ms for sessionid 0x165e83d8e870005 (org.apache.zookeeper.ClientCnxn)

[2018-09-15 19:20:37,884] INFO Client session timed out, have not heard from server in 7025ms for sessionid 0x165e83d8e870005, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)

[2018-09-15 19:20:39,034] INFO [Partition test1-0 broker=3] Shrinking ISR from 1,2,3 to 1 (kafka.cluster.Partition)

[2018-09-15 19:20:43,991] INFO Opening socket connection to server zk1/172.17.0.2:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)

[2018-09-15 19:20:43,991] WARN Client session timed out, have not heard from server in 6004ms for sessionid 0x165e83d8e870005 (org.apache.zookeeper.ClientCnxn)

[2018-09-15 19:20:44,095] INFO [ZooKeeperClient] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)

Shows the broker 1 cannot connect to Zookeeper (for 7 seconds), then shortly afterwards it tries shrinks the ISR to itself, though it cannot update zookeeper with that information. So it declares that it will wait until connected to Zookeeper again.

kafka2 logs extract

[2018-09-15 19:20:26,181] INFO [Partition test1-0 broker=2] test1-0 starts at Leader Epoch 1 from offset 15. Previous Leader Epoch was: 0 (kafka.cluster.Partition)

Shows it taking on leadership of partition 0, with offset 15. Note that this happens 15 seconds before kafka1 tries to shrink the ISR, thinking that it is still leader.

kafka3 logs extract

[2018-09-15 19:20:26,245] INFO [Log partition=test1-0, dir=/var/lib/kafka/data] Truncating to 15 has no effect as the largest offset in the log is 14 (kafka.log.Log)

Shows kafka3 truncating its log to the new leader’s high watermark (though no data is truncated as it is not ahead of the new leader).

So we see that a completely isolated node is worse than a node failure with acks=1, as it takes a while for the broker to detect that it has lost its Zookeeper connection. While sending 100000 messages we lost 31784 acknowledged messages! Those were the messages sent during this brief 15 second window where kafka1 was still accepting writes even though a new leader had been elected.

Scenario 5 - Completely Isolate Leader from other Kafka nodes and Zookeeper with acks=all (no message loss)

I will repeat the exact same steps as scenario 4 except we’ll set acks=all.

In theory, once the leader is isolated we should not get any acknowledgements because no followers in the ISR are reachable. Once the 10 second limit is reached the leader will shrink the ISR to itself but it can’t update Zookeeper with the value and will stop accepting more writes. Meanwhile the leader should fail-over to a follower and the client should detect that change within 60 seconds and start sending messages to the new leader instead.

Again at about the 30000 message mark I isolate the leader.

$ python producer.py 100000 0.0001 all test1
Success: 10000 Failed: 0
Success: 20000 Failed: 0
Success: 30000 Failed: 0
KafkaError{code=REQUEST_TIMED_OUT,val=7,str="Broker: Request timed out"}
KafkaError{code=REQUEST_TIMED_OUT,val=7,str="Broker: Request timed out"}
KafkaError{code=REQUEST_TIMED_OUT,val=7,str="Broker: Request timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
... (many many timeout errors here)
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
Success: 37640 Failed: 62360
Sent: 100000
Delivered: 37640
Failed: 62360

This time soon after isolating the leader we get three “Broker: request timed out” errors due to the fact that the followers are out of contact. Then we get the long period with nothing, followed by thousands of “Local: Message timed out” errors.

In the end we get 37640 acknowledged messages.

bash print-topic-details.sh kafka1 test1
Topic:test1    PartitionCount:1    ReplicationFactor:3    Configs:
    Topic: test1    Partition: 0    Leader: 1    Replicas: 2,1,3    Isr: 1,3
$ bash print-hw.sh kafka1 19092 test1
test1:0:37641

We see that the new leader is kafka1 and that we did not lose a message. In fact, there is a message in the topic for which we never received an ack. Acks=all is required to avoid data loss in leader fail-overs whether they are due to a network partition or failed node.

Let’s run it again with the ten concurrent producers that will attempt to send 1 million messages in total. I ran the script and after 10 seconds I started the network partition. In the end you see that still no acknowledged messages were lost, though the number of acknowledged messages was very low.

$ bash concurrent-producer-silent.sh 100000 0.0001 all test1
Runs complete
Acknowledged total: 8347

$ bash print-hw.sh kafka1 19092 test1
test1:0:8347

Remember that we are not retrying failed messages, which is what you would do in production. In our case we don’t want duplicates interfering with the numbers.

Scenario 6 - Leader Isolated from Zookeeper only with Acks=1

In this scenario we will isolate the leader from Zookeeper but not from the other Kafka nodes. Once the broker that hosts the leader cannot talk to Zookeeper, it will be marked as dead and the leadership will fail-over to a follower. For a while the original leader will accept writes even though a follower has already been elected as the new leader. Any messages acknowledged during this short window will be lost.

This should be similar to scenario 4 with full isolation of the leader. They have many similarities. At some point the followers will stop sending fetch requests to the leader and the leader will try to shrink the ISR to itself. The difference is that the reason they stop sending fetch requests is that leadership failed-over to another node.

The leader is kafka2 and the controller node is kafka1.

$ bash create-topic.sh kafka3 test1
Created topic "test1".
Topic:test1    PartitionCount:1    ReplicationFactor:3    Configs:
    Topic: test1    Partition: 0    Leader: 2    Replicas: 2,3,1    Isr: 2,3,1
$ bash print-controller.sh kafka2
1

Messages start getting sent.

$ python producer.py 100000 0.0001 all test1
Success: 10000 Failed: 0
Success: 20000 Failed: 0
Success: 30000 Failed: 0
Success: 40000 Failed: 0
Success: 50000 Failed: 0
Success: 60000 Failed: 0
Success: 70000 Failed: 0
Success: 80000 Failed: 0
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
... (many many time out errors)
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
Success: 89027 Failed: 10973
Sent: 100000
Delivered: 89027
Failed: 10973

Midway I isolated kafka2 from Zookeeper only.

$ blockade partition kafka1,kafka2,kafka3 kafka1,kafka3,zk1

The final count showed of the 89027 acknowledged messages, with only 45518 making it to the topic meaning 43509 we lost.

$ bash print-hw.sh kafka1 19092 test1
test1:0:45518

Again we can combine the slow-producer.py and the docker logs to get confirmation of what is happening.

$ python slow-producer.py 100 1 1 test1
Current leader: 3
Value: 1 Offset: none
Value: 2 Offset: 1
Value: 3 Offset: 2
Value: 4 Offset: 3
Value: 5 Offset: 4
Value: 6 Offset: 5
Value: 7 Offset: 6
Value: 8 Offset: 7
Value: 9 Offset: 8
Value: 10 Offset: 9
Value: 11 Offset: 10
Value: 12 Offset: 11
Value: 13 Offset: 12
Value: 14 Offset: 13
Value: 15 Offset: 14
Value: 16 Offset: 15
Value: 17 Offset: 16
Value: 18 Offset: 17
Value: 19 Offset: 18
Value: 20 Offset: 19
Value: 21 Offset: 20
Value: 22 Offset: 21
Value: 23 Offset: 22
Value: 24 Offset: 23
Value: 25 Offset: 24
Value: 26 Offset: 25
Value: 27 Offset: 26
Value: 28 Offset: 27
Value: 29 Offset: 28
Value: 30 Offset: 29
Value: 31 Offset: 30
Value: 32 Offset: 31
Value: 33 Offset: 32
%3|1537212099.509|FAIL|rdkafka#producer-1| [thrd:172.17.0.5:9094/bootstrap]: 172.17.0.5:9094/3: 1 request(s) timed out: disconnect
%3|1537212099.509|ERROR|rdkafka#producer-1| [thrd:172.17.0.5:9094/bootstrap]: 172.17.0.5:9094/3: 1 request(s) timed out: disconnect
Current leader: 1
ERROR! Value: 34 Offset: none Error: Local: Message timed out
ERROR! Value: 35 Offset: none Error: Local: Message timed out
ERROR! Value: 36 Offset: none Error: Local: Message timed out
ERROR! Value: 37 Offset: none Error: Local: Message timed out
ERROR! Value: 38 Offset: none Error: Local: Message timed out
ERROR! Value: 39 Offset: none Error: Local: Message timed out
...
ERROR! Value: 90 Offset: none Error: Local: Message timed out
ERROR! Value: 91 Offset: none Error: Local: Message timed out
ERROR! Value: 92 Offset: none Error: Local: Message timed out
ERROR! Value: 93 Offset: none Error: Local: Message timed out
Partition fail-over, messages lost: 12
Value: 94 Offset: 20
Value: 95 Offset: 21
Value: 96 Offset: 22
Value: 97 Offset: 23
Value: 98 Offset: 24
Value: 99 Offset: 25
Value: 100 Offset: 26
Sent: 100
Delivered: 40
Failed: 60

Kafka3 Logs Extract

[2018-09-16 19:20:31,877] INFO Opening socket connection to server zk1/172.17.0.2:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)

[2018-09-16 19:20:37,884] WARN Client session timed out, have not heard from server in 7327ms for sessionid 0x165e8aa663e0005 (org.apache.zookeeper.ClientCnxn)

[2018-09-16 19:20:37,884] INFO Client session timed out, have not heard from server in 7327ms for sessionid 0x165e8aa663e0005, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)

[2018-09-16 19:20:39,034] INFO [Partition test1-0 broker=3] Shrinking ISR from 3,1,2 to 3 (kafka.cluster.Partition)

[2018-09-16 19:20:39,609] INFO Opening socket connection to server zk1/172.17.0.2:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)

[2018-09-16 19:20:43,991] WARN Client session timed out, have not heard from server in 6005ms for sessionid 0x165e8aa663e0005 (org.apache.zookeeper.ClientCnxn)

[2018-09-16 19:20:43,991] INFO Client session timed out, have not heard from server in 6005ms for sessionid 0x165e8aa663e0005, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)

[2018-09-16 19:20:44,095] INFO [ZooKeeperClient] Waiting until connected. (kafka.zookeeper.ZooKeeperClient)

We see that kafka3, the original leader loses it connection to Zookeeper. Soon after it tries to shrink the ISR to itself but cannot and declares it will wait to be connected.

kafka2 log extract

[2018-09-16 19:20:26,154] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions test1-0 (kafka.server.ReplicaFetcherManager)

[2018-09-16 19:20:26,185] INFO [ReplicaFetcherManager on broker 2] Added fetcher for partitions List([test1-0, initOffset 20 to broker BrokerEndPoint(1,kafka1,19092)] ) (kafka.server.ReplicaFetcherManager)

[2018-09-16 19:20:26,186] INFO [ReplicaAlterLogDirsManager on broker 2] Added fetcher for partitions List() (kafka.server.ReplicaAlterLogDirsManager)

[2018-09-16 19:20:26,192] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Starting (kafka.server.ReplicaFetcherThread)

[2018-09-16 19:20:26,199] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Shutting down (kafka.server.ReplicaFetcherThread)

[2018-09-16 19:20:26,209] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Error sending fetch request (sessionId=1307994263, epoch=10152) to node 3: java.nio.channels.ClosedSelectorException. (org.apache.kafka.clients.FetchSessionHandler)

[2018-09-16 19:20:26,210] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)

[2018-09-16 19:20:26,211] INFO [ReplicaFetcher replicaId=2, leaderId=3, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread)

[2018-09-16 19:20:26,245] INFO [Log partition=test1-0, dir=/var/lib/kafka/data] Truncating to 20 has no effect as the largest offset in the log is 19 (kafka.log.Log)

We see that kafka2 removes the existing fetcher and creates a new one, for the new leader, starting at offset 20.

kafka1 Log Extract

[2018-09-16 19:20:26,024] INFO [Controller id=1] Newly added brokers: , deleted brokers: 3, all live brokers: 1,2 (kafka.controller.KafkaController)

[2018-09-16 19:20:26,025] INFO [RequestSendThread controllerId=1] Shutting down (kafka.controller.RequestSendThread)

[2018-09-16 19:20:26,029] INFO [RequestSendThread controllerId=1] Stopped (kafka.controller.RequestSendThread)

[2018-09-16 19:20:26,031] INFO [RequestSendThread controllerId=1] Shutdown completed (kafka.controller.RequestSendThread)

[2018-09-16 19:20:26,069] INFO [Controller id=1] Broker failure callback for 3 (kafka.controller.KafkaController)

[2018-09-16 19:20:26,074] INFO [Controller id=1] Removed ArrayBuffer() from list of shutting down brokers. (kafka.controller.KafkaController)

[2018-09-16 19:20:26,148] INFO [RequestSendThread controllerId=1] Controller 1 connected to kafka2:19093 (id: 2 rack: null) for sending state change requests (kafka.controller.RequestSendThread)

[2018-09-16 19:20:26,171] INFO [RequestSendThread controllerId=1] Controller 1 connected to kafka1:19092 (id: 1 rack: null) for sending state change requests (kafka.controller.RequestSendThread)

[2018-09-16 19:20:26,180] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions test1-0 (kafka.server.ReplicaFetcherManager)

[2018-09-16 19:20:26,181] INFO [Partition test1-0 broker=1] test1-0 starts at Leader Epoch 1 from offset 20. Previous Leader Epoch was: 0 (kafka.cluster.Partition)

We see that kafka1 sees that kafka3 is gone and that it stops its fetcher. The partition on kafka1 gets elected leader.

Note that the fail-over occurred at 19:20:26 and that kafka3 was still thinking it was the leader at 19:20:39, stopping to receive messages at 19:20:40 - a whole 14 seconds after the fail-over.

One last interesting thing is that the controller node could have stopped kafka3 from accepting writes if it had sent a request to it to stop being leader. In the logs we see the following line straight after the fail-over:

[2018-09-17 19:20:26,323] WARN [Channel manager on controller 1]: Not sending request (type=StopReplicaRequest, controllerId=1, controllerEpoch=1, deletePartitions=false, partitions=test1-0) to broker 3, since it is offline. (kafka.controller.ControllerChannelManager)

It didn’t send the request as it believed the node was offline because it had expired in Zookeeper.

The conclusion is that acks=1 loses loads of messages even when a kafka node is only isolated from Zookeeper.

Scenario 7 - Leader Isolated from Zookeeper only with Acks=all (no message loss)

We’ll repeat the same actions as in scenario 6 except we’ll set acks to all.

Just like in scenario 5 I don’t expect to see message loss. The reason is that the leader will be isolated from Zookeeper and so when the fail-over occurs, the controller won’t bother to tell the original leader to stop being leader. Thinking it’s still leader, it will see the other followers not sending fetch requests and try to remove them from the ISR, but it won’t be able to. So when the followers are still in the ISR, we’ll not get an ack because they’ll never confirm receipt and once the attempt to shrink the ISR has occurred no more writes will be accepted.

The fail-over will occur and the client will detect the changes within a minute and start publishing to the new leader.

$ python producer.py 100000 0.0001 all test1
Success: 10000 Failed: 0
Success: 20000 Failed: 0
Success: 30000 Failed: 0
Success: 40000 Failed: 0
KafkaError{code=REQUEST_TIMED_OUT,val=7,str="Broker: Request timed out"}
KafkaError{code=REQUEST_TIMED_OUT,val=7,str="Broker: Request timed out"}
KafkaError{code=REQUEST_TIMED_OUT,val=7,str="Broker: Request timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
... (many many time out errors)
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
KafkaError{code=_MSG_TIMED_OUT,val=-192,str="Local: Message timed out"}
Success: 46419 Failed: 53581
Sent: 100000
Delivered: 46419
Failed: 53581

We got 46419 acknowledgements.

$ bash print-hw.sh kafka1 19092 test1
test1:0:46419

And a matching number in the topic meaning we lost 0 messages again.


Conclusions

We still have more failures to introduce to the cluster that we’ll cover in part 2. But the conclusions from this part are clear - use acks=all if message loss is unacceptable to you.

Message loss due to a fail-over as a result of a node failure with acks=1 was surprisingly low and we’ll see what affect slowing down the network has on that number. The biggest danger are network partitions with acks=1 due to the short window of split brain where a partition can have two leaders.

Acks=all is the equivalent of RabbitMQ’s publisher confirms and we have seen with both messaging systems that confirming replication to secondaries is critical for lossless fail-overs.

In the next part we’ll start messing with the network speed and also contrast RabbitMQ’s queue synchronization to Kafka’s replication protocol.