Dmitry Ivanov requested to merge funbringer/2.11.2-quorum-promote-v2 into 2.11.2-picodata Jun 10, 2024

This patchset fundamentally changes the behavior of the protocol message IPROTO_RAFT_PROMOTE.

The problem

Previously, we'd first win raft elections becoming a raft leader, then write the message unconditionally w/o any quorum. This works fine under many circumstances, but things become more complicated once we introduce network partitions & lags and cluster node hiccups (remember that the node might slow down significantly or become unresponsive due to heavy load or insufficient fiber context switches, etc). Once we start writing IPROTO_RAFT_PROMOTE, there is no going back -- if we limp and fail to deliver it in time, any other node that succeeds in becoming a new leader will cause a split brain check to go off. This has been described previously in https://github.com/tarantool/tarantool/issues/9376.

The proposed solution

To mitigate this problem, the algorithm becomes a little more complicated:

Start the elections using election_mode = "candidate" or box.ctl.promote().
Win the elections, becoming a raft leader.
Claim the limbo using IPROTO_RAFT_PROMOTE:
1. Write an IPROTO_RAFT_PROMOTE (promote request) in the current term...
2. and link it to the preceding unconfirmed promote requests (or limbo->promote_greatest_term if none) using PREV_TERM (more on that later).
3. Add it to the queue limbo->pending_promotes.
4. Gather quorum on the promote request by counting ACKs from fellow nodes.
5. Once the quorum is gathered, write an IPROTO_RAFT_CONFIRM (confirm request) pointing to the promote request (aka confirm-for-promote).
6. Now everyone reads the confirm request and applies the promote request and its dependencies (PREV_TERM) in order.
Success, now the leader owns the limbo and the cluster is R/W.

It's impossible for the raft leader to add new transactions to the limbo until the ownership transition has completed.

Every promote request is uniquely identified by its term (per raft), so we can store all of them in a list sorted by term. Until the leader gathers quorum on its own promote request written in the current term (the active request, see txn_limbo_get_active_promote), it may not assume limbo ownership or commit/rollback regular transactions. Once the quorum is gathered, all promote requests reachable from the active one via PREV_TERM are committed as well, thus becoming parts of history retroactively.

Whenever a node receives a foreign promote request, it may not know whether it has gathered quorum or not -- only its author is responsible for counting ACKs, but the author might die suddenly and never write a confirm-for-promote again. The promote request in question is now known as the pending promote request. If a new leader suddenly comes to power, it has to deal with the consequences. It may not drop the pending promote request, because it might get confirmed eventually (now or later). This is exactly why we add it as a dependency (a prerequisite in our path to owning the limbo) via PREV_TERM. This situation might repeat several times in a row, so we have to introduce a queue for all pending promote requests.

The algorithm above makes sure that failures at steps 3.4 and 3.5 -- including the inability to complete them in time -- do not cause a split brain or a permanent cluster desynchronization. A newly elected leader should always link all promote requests is has witnessed, so any other node should eventually come to the same conclusions regarding the limbo ownership.

If the leader has reached step 3.1 but failed to complete the write in its term, the promote request will not gather quorum and will be nopified upon arrival. Same goes for the corresponding confirm-for-promote at step 3.5 -- either the promote request reaches the quorum and will eventually become a part of history with the help of a more recent promote request (PREV_TERM), or it never gathered quorum and won't get a corresponding confirm-for-promote. Also keep in mind that if the message gathered a quorum, any future leader must have it.

Regarding the upgrades

The patch adds two new iproto keys: PREV_TERM (0x71) and WAIT_ACK (0x72). We have observed an upgrade problem in clusters with tarantool prior to https://github.com/tarantool/tarantool/commit/ee0660b8316966e4f9b4c79b18b6a9837e83ac89. This is something to keep in mind. Luckily, the problem does not affect 2.11.2.

Speaking of the upgrade routine itself: there are no guard rails per se, but there is a limited compatibility mode for legacy promote requests (the ones that don't have WAIT_ACK set to true). We apply those immediately w/o quorum, which means that we can easily read old snapshots and even respect the newly-elected leaders running the pre-patch tarantool version. However, once the upgraded instance emits a new-style promote request, it will eventually cause a transient "split brain" error in all pre-patch nodes that happen to be running. The error will go away once the node is restarted with a post-patch tarantool binary.

Thus, the effective way to upgrade is the following:

Upgrade all replicas, but not the current leader L.
Promote any upgraded replica to leader, causing a transient "split brain" in L.
Upgrade L's binary and restart it, completing the upgrade.

The patchset structure

Currently, the commits may lack elaborate commit messages, yet each has its own meaning nonetheless. Certain undecided features (e.g. everything box.ctl.wait-related) have been extracted to their own commits. We really shouldn't squash them prematurely so as not to inhibit quick removal of poor ideas.

Questions

Should `box.ctl.promote()` wait for the promote request to take effect?

Currently, box.ctl.promote() does not wait for the promote to take effect. However, it does block until a new promote is written and queued. In order to wait for completion, one is expected to use box.ctl.wait_rw(). Sadly, we can't say the same about election_mode = "candidate", because it is very vague and implicit.

In other words, the recipe is:

box.ctl.promote() => box.ctl.wait_rw().
election_mode = "candidate" => while box.info.synchro.queue.term < box.info.election.term do fiber.sleep(0) end (used in a few tests).

For now, this seems to be the perfect spot: not too blocking (no need to spawn a fiber most of the time) and not too irritating (can be dealt with). However, the author has no strong opinions on this matter and is willing to discuss the alternatives.

What about `IPROTO_RAFT_DEMOTE`?

Oh, this is a tricky one. Given that IPROTO_RAFT_DEMOTE only becomes relevant when election_mode = 'off' and serves as a "synchro escape hatch", we pretty much don't care about this mode in Picodata. Thus, I left a non-quorum implementation in place just to make the tests pass. However, if you really feel like we should implement quorum demote for the sake of the upstream, I won't argue.

What about on-disk data format (snapshots)?

We deliberately decided not to change the snapshot format and IPROTO_JOIN_META. Instead, we disallow both snapshotting and adding new nodes while the limbo ownership transition is taking place.

What about tests?

Considering that box.ctl.promote() does not wait for the promote to take effect, we had to add a dozen box.ctl.wait_rw() calls here and there. There are a few tests that didn't make it -- we had to drop them out of inability to properly fix them. Hopefully, the maintainers will give a better clue.

Furthermore, we have added a new test suite (qpromote) for this feature specifically -- among others, it contains a few test cases which routinely cause "split brain" in pre-patch versions.

Previous attempt: !160 (closed)

Tracking issue: https://git.picodata.io/picodata/tarantool/-/issues/48

Upstream: https://github.com/tarantool/tarantool/pull/10334

Edited Aug 02, 2024 by Dmitry Ivanov

feat: PROMOTE now requires quorum & CONFIRM (v2)