Skip to content
Snippets Groups Projects
Commit 54495510 authored by Vladislav Shpilevoy's avatar Vladislav Shpilevoy
Browse files

replication: send raft terms in applier heartbeats

There was a bug that an instance could ack a transaction from an
old Raft term thus allowing the old leader to CONFIRM it, even if
that first instance knew there is a newer Raft term going on.

As a result, the old leader could write CONFIRM even if there is
already a new leader elected and the synchro quorum was > half.
That led to split-brain, when bad txn reached the new leader, and
PROMOTE reached the old leader.

Split-brain here is totally unnecessary. If the quorum is correct,
synchro timeout is infinite, and there is no async transactions,
then split-brain shouldn't ever happen.

The fix is as simple as attach the current Raft term number to
applier heartbeats.

In the testcase above if terms are attached, the old leader gets
ACK + new term. That causes the old leader freeze even if the
pending txn got quorum. The old leader can't CONFIRM nor ROLLBACK
its pending txns until a new leader is elected.

Freeze is guaranteed, because if a new leader was elected, then it
had got votes from > half cluster. It means > half nodes have the
new term. That in turn means the old leader during collecting ACKs
for its "new" txn will get the new term number from at least one
replica.

When the new leader finished writing PROMOTE, it either confirms
or rolls back the txn of the old leader (depending on whether it
has reached the new leader before promotion). Neither result
causes split brain. The rollback only causes a non-critical error
on the old leader raised by the bad txn's commit attempt.

There were some alternatives considered. One of the most promising
ones was to make instances reject txns if they see these txns
coming from an instance having an old Raft term. It would help in
the test provided above. But wouldn't do in a more complicated
test, when there is a third node which gets the bad transaction,
then gets local term bumped, and then replicates to any other
instance. Others would accept that bad txn, because the sender has
a newer Raft term, even though the txn author is still in the old
term. Tracking terms of txn author is not possible in too many
cases so as to rely on that.

Closes #7253

@TarantoolBot document
Title: New iproto field in applier -> relay ACKs
The applier->relay channel (from replica back to master) is used
only for sending ACKs. Replication data goes the other way
(relay->applier).

These ACKs had 2 fields: `IPROTO_VCLOCK (0x26)` and
`IPROTO_VCLOCK_SYNC (0x5a)`.

Now they have a new field: `IPROTO_TERM (0x53)`. It is a unsigned
number containing `box.info.election.term` of the sender node
(applier, replica).
parent 135fd0ff
No related branches found
No related tags found
Loading
Showing
with 386 additions and 72 deletions
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment