raft: prevent disruptive elections
Raft cluster members only treat the leader as alive as long as they preserve a direct connection to it. This means that as soon as one of the followers loses leader connection, it will start new elections bumping the term and forcing the existing leader to step off. Moreover, when a server is partitioned from the rest of the cluster, it will constantly rearrange elections by bumping the term and voting for itself. It will never get any votes, since it is partitioned, but it will increase its term to big numbers. Once such a server reconnects to the rest of the cluster, it will make everyone read-only, because the working majority will probably have a smaller term than this server. Such elections, performed while there is a working leader, like in the first example, or performed when there's no chance of winning, like in the second example, are considered disruptive. Diego Ongaro's thesis (https://github.com/ongardie/dissertation) proposes the following solution to the problem of disruptive servers. It's an additional election phase, called Pre-Vote. The basic idea is as follows: only bump the term and start elections once the quorum of peers reply OK to a Pre-Vote message. The peers would reply OK when: - the candidate's log is sufficiently up to date - the peer doesn't see the leader The downside of such approach is an additional request travelling to and from the followers. Tarantool replication architecture allows us to achieve the same goals without issuing Pre-Vote requests. Here's what's done in this patch to do so: 1) Track number of live peer connections, and only start elections when there is a quorum of connected peers. 2) Make every node broadcast whether it sees the leader of the current term or not. Every candidate collects info about who's seen as a leader in the current term. When at least one node sees a leader, the candidate doesn't start new elections. Closes #6654 @TarantoolBot document Title: elections: new `box.info` field and binary protocol changes * `box.info.election` got a new field: `leader_idle`. When elections are enabled, it shows time in seconds since the last interaction with the known leader. * `IPROTO_RAFT` request got 2 new fields: `IPROTO_RAFT_LEADER_ID = 0x04` - uint - the id of the current leader (as seen by the node which issued the request). `IPROTO_RAFT_IS_LEADER_SEEN = 0x05` - bool - whether the node has a direct connection to the leader.
Showing
- changelogs/unreleased/elections-pre-vote.md 10 additions, 0 deletionschangelogs/unreleased/elections-pre-vote.md
- src/box/box.cc 1 addition, 3 deletionssrc/box/box.cc
- src/box/iproto_constants.h 2 additions, 0 deletionssrc/box/iproto_constants.h
- src/box/lua/info.c 4 additions, 0 deletionssrc/box/lua/info.c
- src/box/raft.c 136 additions, 0 deletionssrc/box/raft.c
- src/box/raft.h 4 additions, 0 deletionssrc/box/raft.h
- src/box/replication.cc 9 additions, 2 deletionssrc/box/replication.cc
- src/box/xrow.c 28 additions, 0 deletionssrc/box/xrow.c
- src/box/xrow.h 2 additions, 0 deletionssrc/box/xrow.h
- test/replication-luatest/election_pre_vote_test.lua 106 additions, 0 deletionstest/replication-luatest/election_pre_vote_test.lua
- test/replication-luatest/election_split_vote_test.lua 4 additions, 0 deletionstest/replication-luatest/election_split_vote_test.lua
- test/replication/gh-3055-election-promote.result 9 additions, 3 deletionstest/replication/gh-3055-election-promote.result
- test/replication/gh-3055-election-promote.test.lua 5 additions, 3 deletionstest/replication/gh-3055-election-promote.test.lua
- test/replication/gh-6127-election-join-new.result 5 additions, 1 deletiontest/replication/gh-6127-election-join-new.result
- test/replication/gh-6127-election-join-new.test.lua 3 additions, 1 deletiontest/replication/gh-6127-election-join-new.test.lua
- test/replication/gh-6127-master1.lua 2 additions, 1 deletiontest/replication/gh-6127-master1.lua
- test/replication/gh-6127-master2.lua 2 additions, 1 deletiontest/replication/gh-6127-master2.lua
- test/replication/gh-6127-replica.lua 1 addition, 0 deletionstest/replication/gh-6127-replica.lua
Loading
Please register or sign in to comment