raft: persist new term and vote separately
If a node persisted a foreign term + vote request at the same time, it increased split-brain probability. A node could vote for a candidate having smaller vclock than the local one. For example, via the following scenario: - Node1, node2, node3 are started; - Node1 becomes a leader; - The topology becomes node1 <-> node2 <-> node3 due to network issues; - Node1 sends a synchro txn to node2. The txn starts a WAL write; - Node3 bumps term and votes for self. Sends it all to node2; - Node2 votes for node3, because their vclocks are equal; - Node2 finishes all pending WAL writes, including the txn from node1. Now its vclock is > node3's one and the vote was wrong. - Node3 wins, writes PROMOTE, and it conflicts with node1 writing CONFIRM. This patch makes so a node can't persist a vote in a new term in the same WAL write as the term bump. Term bump is written first and alone. It serves as a WAL sync after which the node's vclock is not supposed to change except for the 0 (local) component. The vote requests are re-checked after term bump is persisted to see if they still can be applied. Part of #7253 NO_DOC=bugfix (cherry picked from commit c9155ac8)
Showing
- changelogs/unreleased/gh-7253-split-brain-early-vote.md 5 additions, 0 deletionschangelogs/unreleased/gh-7253-split-brain-early-vote.md
- src/lib/raft/raft.c 99 additions, 19 deletionssrc/lib/raft/raft.c
- src/lib/raft/raft.h 6 additions, 0 deletionssrc/lib/raft/raft.h
- test/replication-luatest/gh_7253_election_long_wal_write_test.lua 106 additions, 0 deletions...lication-luatest/gh_7253_election_long_wal_write_test.lua
- test/replication/election_basic.result 35 additions, 13 deletionstest/replication/election_basic.result
- test/replication/election_basic.test.lua 20 additions, 13 deletionstest/replication/election_basic.test.lua
- test/unit/raft.c 182 additions, 59 deletionstest/unit/raft.c
- test/unit/raft.result 49 additions, 32 deletionstest/unit/raft.result
Loading
Please register or sign in to comment