Skip to content
Snippets Groups Projects
Commit b200d298 authored by Vladislav Shpilevoy's avatar Vladislav Shpilevoy
Browse files

promote: fix infinite elections with multi-promote

If box.ctl.promote() was called on more than one instance, then it
could lead to infinite or extremely long elections bumping
thousands of terms in just a few seconds.

This was because box.ctl.promote() used to be a loop. The loop
retried term bump + voted for self until the node won. Retry
happened immediately as the node saw the term was bumped again
and there was no leader elected or the connection quorum was lost.

If 2 nodes would start box.ctl.promote() almost at the same time,
they could bump each other's terms, not see any winner, bump them
again, and so on. For example:

- Node1 term=1, node2 term=2;
- Promote is called on both;
- Node1 term=2, node2 term=3. They receive the messages. Node2
    ignores node1's old term. Node1 term is bumped and it votes
    for node2, but it didn't win, so box.ctl.promote() bumps its
    term to 4.
- Node2 receives term 4 from node1. Its own box.ctl.promote() sees
    the term was bumped and no winner, so it bumps it to 5 and the
    process continues for a long time.

It worked good enough in tests - the problem happened sometimes,
terms could roll like 80k times in a few seconds, but the tests
ended fine anyway.

One of the next commits will make term bump + vote written in
separate WAL records. That aggravates the problem drastically.

Basically, this mutual term bump loop could end only if one node
would receive vote for self from another node and send back the
message 'I am a leader' before the other node's box.ctl.promote()
notices the term was bumped externally. This will get much harder
to achieve.

The patch simply drops the loop. Let box.ctl.promote() fail if the
term was bumped outside.

There was an alternative to keep running it in a loop with a
randomized election timeout like it works inside of raft. But the
current solution is just simpler.

NO_DOC=bugfix
NO_TEST=election_split_vote_test.lua catches it already

(cherry picked from commit dd89c57e)
parent 705f0e51
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment