Skip to content
Snippets Groups Projects
user avatar
Serge Petrenko authored
The commit c1c77782 ("replication: fix bootstrap failing with
ER_READONLY") made applier retry connection infinitely upon receiving a
ER_READONLY error on join. At the time of writing that commit, this was
the only way to make join retriable. Because there were no retries in
scope of bootstrap_from_master. The join either succeeded or failed.

Later on, bootstrap_from_master was made retriable in commit
f2ad1dee ("replication: retry join automatically"). Now when
bootstrap_from_master fails, replica reconnects to all the remote nodes,
thus updating their ballots, chooses a new (probably different from the
previous approach) bootstrap leader, and retries booting from it.

The second approach is more preferable, and here's why. Imagine
bootstrapping a cluster of 3 nodes, A, B and C in a full-mesh topology.
B and C connect to all the remote peers almost instantly, and both
independently decide that B will be the bootstrap leader (it means it
has the smallest uuid among A, B, C).

At the same time, A can't connect to C. B bootstraps the cluster, and
joins C. After C is joined, A finally connects to C. Now A can choose a
bootstrap leader. It has an old B's ballot (smallest uuid, but not yet
booted) and C's ballot (already booted). This is because C's ballot is
received after cluster bootstrap, and B's ballot was received earlier
than that. So A believes C is a better bootstrap leader, and tries to
boot from it.

A will fail joining to C, because at the same time C tries to sync with
everyone, including A, and thus stays read-only. Since A retries joining
to the same instance over and over again, this situation makes the A and
C stuck forever.

Let's retry ER_READONLY on another level: instead of trying to join to
the same bootstrap leader over and over, try to choose a new bootstrap
leader and boot from it.

In the situation described above, this means that A would try to join to
C once, fail due to ER_READONLY, re-fetch new ballots from everyone and
choose B as a join master (now it has smallest uuid and is booted).

The issue was discovered due to linearizable_test.lua hanging
occasionally with the following output:
NO_WRAP
 No output during 40 seconds. Will abort after 320 seconds without output. List of workers not reporting the status:
- 059_replication-luatest [replication-luatest/linearizable_test.lua, None] at /tmp/t/059_replication-luatest/linearizable.result:0
[059] replication-luatest/linearizable_test.lua                       [ fail ]
[059] Test failed! Output from reject file /tmp/t/rejects/replication-luatest/linearizable.reject:
[059] TAP version 13
[059] 1..6
[059] # Started on Thu Sep 29 10:30:45 2022
[059] # Starting group: linearizable-read
[059] not ok 1	linearizable-read.test_wait_others
[059] #   ....11.0~entrypoint.531.dev/test/luatest_helpers/server.lua:104: Waiting for "readiness" on server server_1-q7berSRY4Q_E (PID 53608) timed out
[059] #   stack traceback:
[059] #   	....11.0~entrypoint.531.dev/test/luatest_helpers/server.lua:104: in function 'wait_for_readiness'
[059] #   	...11.0~entrypoint.531.dev/test/luatest_helpers/cluster.lua:92: in function 'start'
[059] #   	...t.531.dev/test/replication-luatest/linearizable_test.lua:50: in function <...t.531.dev/test/replication-luatest/linearizable_test.lua:20>
[059] #   	...
[059] #   	[C]: in function 'xpcall'
NO_WRAP

Part-of #7737

NO_DOC=bugfix
09c18907
History