promote: fix infinite elections with multi-promote
If box.ctl.promote() was called on more than one instance, then it could lead to infinite or extremely long elections bumping thousands of terms in just a few seconds. This was because box.ctl.promote() used to be a loop. The loop retried term bump + voted for self until the node won. Retry happened immediately as the node saw the term was bumped again and there was no leader elected or the connection quorum was lost. If 2 nodes would start box.ctl.promote() almost at the same time, they could bump each other's terms, not see any winner, bump them again, and so on. For example: - Node1 term=1, node2 term=2; - Promote is called on both; - Node1 term=2, node2 term=3. They receive the messages. Node2 ignores node1's old term. Node1 term is bumped and it votes for node2, but it didn't win, so box.ctl.promote() bumps its term to 4. - Node2 receives term 4 from node1. Its own box.ctl.promote() sees the term was bumped and no winner, so it bumps it to 5 and the process continues for a long time. It worked good enough in tests - the problem happened sometimes, terms could roll like 80k times in a few seconds, but the tests ended fine anyway. One of the next commits will make term bump + vote written in separate WAL records. That aggravates the problem drastically. Basically, this mutual term bump loop could end only if one node would receive vote for self from another node and send back the message 'I am a leader' before the other node's box.ctl.promote() notices the term was bumped externally. This will get much harder to achieve. The patch simply drops the loop. Let box.ctl.promote() fail if the term was bumped outside. There was an alternative to keep running it in a loop with a randomized election timeout like it works inside of raft. But the current solution is just simpler. NO_DOC=bugfix NO_TEST=election_split_vote_test.lua catches it already (cherry picked from commit dd89c57e)
Showing
- changelogs/unreleased/promote-multiple-infinite.md 4 additions, 0 deletionschangelogs/unreleased/promote-multiple-infinite.md
- src/box/box.cc 5 additions, 46 deletionssrc/box/box.cc
- src/box/errcode.h 1 addition, 0 deletionssrc/box/errcode.h
- src/box/raft.c 27 additions, 9 deletionssrc/box/raft.c
- src/box/raft.h 2 additions, 2 deletionssrc/box/raft.h
- test/box/error.result 1 addition, 0 deletionstest/box/error.result
Loading
Please register or sign in to comment