In part 1 we built a logical model for how copy-on-write tables work in Apache Hudi, and posed a number of questions regarding consistency with regard to types of concurrency control, timestamp monotonicity and more. In part 2 we studied timestamp collisions, their probabilities and how to avoid them (and be conformant to the Hudi spec). In part 3, we’ll be focusing on the results of model checking the TLA+ specification, and answering those questions.
This TLA+ specification only models the logic I have explained so far:
Commit action types only.
COW tables.
Uses a fixed-size pool of file groups, with lazy assignment.
No table services (cleaning, compaction, clustering etc).
Only single primary key operations. Instead of rows, the data model is based on simple KV pairs, where the key presents the primary key and the value represents the non-PK column values.
The specification has the following parameters:
Writers. E.g. {w1, w2}.
Keys. E.g. {k1, k2}
Values. E.g. {A, B}
MonotonicTs
True = Monotonic ts provider: such as ZK.
False = Non-deterministic local clock: issues values between 1 and a monotonic value.
ConcurrencyControl
0 = None
1 = Optimistic
2 = Pessimistic
PutIfAbsentSupport.
True = Instant and file slice files cannot be overwritten (and will cause the writer to abort on trying).
False = Instant and file slice files can be silently overwritten.
Primary key conflict check.
True = Primary key conflicts that would result in duplicates are detected at index update time (before commit).
False = No primary key conflict detection is performed.
Salt
Yes. All instant and file slice identifiers include a unique salt to avoid file overwriting.
No.
The specification has one important invariant, ConsistentRead, that checks each committed KV pair op (insert/update/delete) is forever readable with the value associated to that commit (at a compatible timestamp).
I have summarized the results into two categories:
V5 spec conformant configurations
V5 spec non-conformant configurations
Hudi v5 spec conformant configurations
The Hudi spec makes it clear that timestamps must be monotonic, so all the configurations below use monotonic timestamps. Locking is recommended for multi-writer scenarios, so the configurations include optimistic and pessimistic concurrency control. Finally, the primary key conflict detection to avoid duplicates is optional, so I have configurations with and without it.
Id Locking Timestamps PK conflict check Consistent ---- ------------- ------------ ------------------- ---------------- 1 Optimistic Monotonic YES OK 2 Optimistic Monotonic NO Duplicate rows 3 Pessimistic Monotonic YES OK 4 Pessimistic Monotonic NO Duplicate rows
These results correlate to the guarantees listed in the Hudi docs:
There were no surprises, which was good to see.
ACID guarantees in compliant implementations
As we covered in part 1, atomicity and durability are trivially true. Model checking now gives us the results to determine whether Hudi supports consistency and isolation too.
When optional primary key conflict detection is implemented and enabled, then full ACID guarantees are delivered. However, without primary key conflict detection, we get a failure of isolation resulting in primary key duplication across file groups. It only occurs when two or more concurrent operations insert the same primary key in different file groups. The last write to the the primary-key-to-file-group mapping index wins. In an OLTP system this kind of isolation issue might only result in a lost write/update, but in Hudi it results in a consistency problem as the orphaned row can still be read in the wrong file group. Using the primary key conflict check in multi-writer scenarios solves the problem.
Hudi v5 spec non-conformant configurations
The following configurations do not conform to the Hudi v5 spec (all bets are off). However, there are countermeasures that exist that can make non-conformant configurations safe, namely PutIfAbsent storage support and the use of salts in instant and file slice file names. For completeness, we’ll look at safe and unsafe non-conformant configurations.
Id PutIfAbsent Salt CC Timestamps PK check Consistent ---- ------------- ------ ------------- --------------- ---------- ---------------------------------- 5 Any Any No locking Any Any Fail (lost write) 6 NO NO Optimistic Non-monotonic YES Fail (lost write - ts collision) 7 NO NO Pessimistic Non-monotonic YES Fail (lost write - ts collision) 8 YES NO Optimistic Non-monotonic YES OK 9 YES NO Pessimistic Non-monotonic YES OK 10 NO YES Optimistic Non-monotonic YES OK 11 NO YES Pessimistic Non-monotonic YES OK
The only cases of data loss are related to non-conformant configurations. We also see that we can get away with non-monotonic timestamps if we use storage that supports PutIfAbsent or use salts. However, not doing concurrency control with multiple writers is never safe.
If you want to look at some traces of safe and unsafe configurations, then read on, else feel free to jump to the end.
Let’s dive into some of these scenarios to understand why each is safe or unsafe.
Some pass/fail scenarios
Case 1 - No concurrency control (Non-conformant and unsafe)
Parameters:
Writers={w1, w2}
Keys={k1, k2}
Values={A, B}
FileGroupCount=1
MonotonicTs=TRUE
ConcurrencyControl=0 (none)
KeyConflictCheck=TRUE
PutIfAbsentSupported = any (set to TRUE or FALSE)
This configuration is guaranteed to avoid timestamp collisions but experiences a lost write.
Case 2 - OCC, two keys, PK conflict detection enabled (Conformant & safe)
Parameters:
Writers={w1, w2}
Keys={k1}
Values={A, B}
FileGroupCount=1
MonotonicTs=TRUE
ConcurrencyControl=1 (optimistic)
KeyConflictCheck=TRUE
PutIfAbsentSupported = any
This is the same as case 1, except we use optimistic concurrency control. This time, key actions are wrapped in a lock, leading to the second operation failing its OCC check.
Case 3 - OCC, one key, PK conflict detection disabled (Conformant - duplicate keys)
Parameters:
Writers={w1, w2}
Keys={k1}
Values={A}
FileGroupCount=2
MonotonicTs=TRUE
ConcurrencyControl=1 (optimistic)
KeyConflictCheck=FALSE
PutIfAbsentSupported = any
Without the PK conflict detection, two concurrent inserts of the same key by different writers can cause the same key to be written to two separate file groups, despite OCC.
When a duplicate of a primary key exists in a file group that doesn’t correspond to the index, it is still readable, as long as its file slice remains referenced from the timeline. The interesting question is how does such an orphaned row, which is still readable, eventually get filtered out? Presumably merges of the file slice into new file slices would preserve the row.
Case 4 - OCC, non-monotonic timestamps, PutIfAbsentSupport=false (Non-conformant, unsafe)
Parameters:
Writers={w1, w2}
Keys={k1}
Values={A, B}
FileGroupCount=1
MonotonicTs=FALSE
ConcurrencyControl=1 (optimistic)
KeyConflictCheck=TRUE
PutIfAbsentSupported = FALSE
In the TLA+ specification, a non-monotonic timestamp is issued non-deterministically with any value between 1 and a monotonic value (which includes duplicate timestamps that would collide). When doing brute force checking, the model checker actually explores all timestamp values between 1 and the lowest monotonic value for each operation.
Case 5 - OCC, non-monotonic timestamps, PutIfAbsentSupport=true (Non-conformant, safe)
Parameters:
Writers={w1, w2}
Keys={k1}
Values={A, B}
FileGroupCount=1
MonotonicTs=FALSE
ConcurrencyControl=1 (optimistic)
KeyConflictCheck=TRUE
PutIfAbsentSupported = TRUE
The second write to 1.commit.requested failed as it already existed, and w2 aborted early.
On timeline order not matching insertion order
Going back to the beginning of this analysis in part 1, I said I wasn’t sure if the v5 Hudi spec talking about monotonic timestamps meant insertion time or issuance time. Having gone through the process of modeling Hudi in TLA+, from a correct perspective the most important thing is that timestamps shouldn’t collide, at least when using a storage service that does not support PutIfAbsent.
However, what happens if two writers obtain timestamps that are monotonic at the time of issuance but the operations are executed out of order?
The answer is that everything is ok, as long as you have chosen one of the compliant, safe configurations.
Example: Out of order, same primary key
The following operations are carried out, note the insertion order does not match the ts order:
Op 1, k1=A, ts=1, insertion order=1
Op 2, k1=B, ts=2, insertion order=3
Op 3, k1=C, ts=3, insertion order=2
Behavior:
Op 1 is carried out to completion, writing file slice [file_id=1, ts=1] with k1=A.
Op 2 obtains ts=2.
Op 3 obtains ts=3.
Op 2 pauses for a while.
Op 3 goes ahead and the operation succeeds, writing file slice [file_id=1, ts=3] containing k1=C.
Op 2 resumes. It scans the timeline and determines that the merge commit timestamp is 3 which is higher than its own timestamp, so it aborts early. If the operations had interleaved in a different way, with Op 3 still completing first, then the OCC check of Op 2 would have detected the conflict, and aborted.
Where two overlapping operations are executed out of timestamp order, only one succeeds. File slices can only be committed in timestamp order when OCC is used. From a performance perspective, this means that operations that are executed in a monotonic timestamp order will have better performance due to less conflicts.
Example: Out of order, different primary keys mapping to different file groups
Op 1, k1=A, ts=1, fg=1, insertion order=1
Op 2, k2=X, ts=2, fg=2, insertion order=2
Op 3, k1=B, ts=3, fg=1, insertion order=4
Op 4, k2=Y, ts=4, fg=2, insertion order=3
First, op 1 and 2 perform upserts:
k1=A at ts=1 to file slice [file_id=1, ts=1]
k2=X at ts=2 to file slice [file_id=2, ts=2]
Then op3 and op 4 execute.
Op 3 obtains ts=3.
Op 4 obtains ts=4.
Op 3 pauses for a while.
Op 4 goes ahead and succeeds, writing file slice [file_id=2,ts=4] containing k2=Y.
Op 1 resumes. It scans the timeline and determines that the merge commit timestamp is 2. It looks for a file slice merge target of file_id=1 with a timestamp <= 2. It finds [file_id=1,ts=1] with k1=A. It writes a new file slice, [file_id=1, ts=3] with k1=B. Its concurrency control check passes as there is no completed instant in the timeline that touches the same file group with a ts > 1. Op 1 succeeds.
Where two disjoint operations are executed out of order, both succeed.
But what about consistency across keys? If a client had been repeatedly retrieving all keys the whole time at ts=3 or at ts=4 would the results have been consistent?
At ts=3, a reader would have seen the following results as it repeated its query over and over:
k1=null, k2=null
k1=A, k2=null
k1=A, k2=X
k1=B, k2=X
At ts=4, a reader would have seen the following results as it repeated its query over and over:
k1=null, k2=null
k1=A, k2=null
k1=A, k2=X
k1=A, k2=Y <- sees k2=Y of ts=4 before k1=B of ts=3.
k1=B, k2=Y
In the ts=4 case, a reader saw k2=Y before k1=B. This is ok because the two operations overlapped and therefore any chosen total order of materializing these operations is valid (which is what we are seeing here). Multiple clients reading at the same timestamp would have seen the same total order.
Conclusions
This analysis has been of limited scope, but so far the results of model checking the TLA+ specification correspond to the guarantees discussed in the multi-writer section of the Apache Hudi documentation concurrency control. ACID is supported with Hudi spec compliant configurations, and when primary key conflict detection is used in multi-writer scenarios.
I’ve focused a lot on multi-writer scenarios in this analysis because I’m a distributed systems engineer and I wanted to understand how Hudi handles this. Yet the single writer setup is the more common case.
Regarding multi-writer scenarios, the Apache Hudi v5 spec clearly states that timestamps should be monotonic. From my analysis, the most important thing is that timestamps should not collide and there are multiple options for doing this. If you use a storage service that supports PutIfAbsent, then it's a solved problem. Otherwise, if you’re on S3, then you’ll need a source of monotonic timestamps. Given that distributed locking is definitely needed for correctness in a multi-writer set up anyway, something like DynamoDB or ZooKeeper can do locks and monotonic counters. The performance impact of using such a system for timestamps and locking should be minimal, given that the number of ops per second should be much lower than say a Kafka topic or an OLTP database table. The cost of loading the timeline, reading and writing Parquet files should massively outweigh the costs of obtaining timestamps and acquiring/releasing locks.
Delta Lake and Apache Hudi are very similar on this point, with both taking the write-ahead-log (WAL) approach, and both requiring monotonic identifiers for WAL entries. Databricks uses a lightweight coordination (presumably locking) service when using S3 to ensure that id conflicts do not occur (due to its lack of PutIfAbsent support). Databricks notes that load on this coordination service is low due to the relatively low write rate on lakehouse tables.
The next steps, if I dedicate more time to this, would be to model Merge-On-Read (MOR) tables and also table services (compaction, clustering, cleaning etc). There are definitely more data consistency questions I have about how these additional features work safely. So far, my results correlate with the guarantees in the Hudi documentation, so I have no reason to expect I would find issues. Even so, it would be a useful exercise to understand Hudi internals further.
I’ll finish by just saying that I find TLA+ an excellent tool for learning existing systems. Going through the modeling process really makes things concrete for me, and I end up with a far deeper understanding of the target systems afterwards. TLA+ isn’t just for design time.
Series links: