Serge Petrenko
authored
The commit c1c77782 ("replication: fix bootstrap failing with ER_READONLY") made applier retry connection infinitely upon receiving a ER_READONLY error on join. At the time of writing that commit, this was the only way to make join retriable. Because there were no retries in scope of bootstrap_from_master. The join either succeeded or failed. Later on, bootstrap_from_master was made retriable in commit f2ad1dee ("replication: retry join automatically"). Now when bootstrap_from_master fails, replica reconnects to all the remote nodes, thus updating their ballots, chooses a new (probably different from the previous approach) bootstrap leader, and retries booting from it. The second approach is more preferable, and here's why. Imagine bootstrapping a cluster of 3 nodes, A, B and C in a full-mesh topology. B and C connect to all the remote peers almost instantly, and both independently decide that B will be the bootstrap leader (it means it has the smallest uuid among A, B, C). At the same time, A can't connect to C. B bootstraps the cluster, and joins C. After C is joined, A finally connects to C. Now A can choose a bootstrap leader. It has an old B's ballot (smallest uuid, but not yet booted) and C's ballot (already booted). This is because C's ballot is received after cluster bootstrap, and B's ballot was received earlier than that. So A believes C is a better bootstrap leader, and tries to boot from it. A will fail joining to C, because at the same time C tries to sync with everyone, including A, and thus stays read-only. Since A retries joining to the same instance over and over again, this situation makes the A and C stuck forever. Let's retry ER_READONLY on another level: instead of trying to join to the same bootstrap leader over and over, try to choose a new bootstrap leader and boot from it. In the situation described above, this means that A would try to join to C once, fail due to ER_READONLY, re-fetch new ballots from everyone and choose B as a join master (now it has smallest uuid and is booted). The issue was discovered due to linearizable_test.lua hanging occasionally with the following output: NO_WRAP No output during 40 seconds. Will abort after 320 seconds without output. List of workers not reporting the status: - 059_replication-luatest [replication-luatest/linearizable_test.lua, None] at /tmp/t/059_replication-luatest/linearizable.result:0 [059] replication-luatest/linearizable_test.lua [ fail ] [059] Test failed! Output from reject file /tmp/t/rejects/replication-luatest/linearizable.reject: [059] TAP version 13 [059] 1..6 [059] # Started on Thu Sep 29 10:30:45 2022 [059] # Starting group: linearizable-read [059] not ok 1 linearizable-read.test_wait_others [059] # ....11.0~entrypoint.531.dev/test/luatest_helpers/server.lua:104: Waiting for "readiness" on server server_1-q7berSRY4Q_E (PID 53608) timed out [059] # stack traceback: [059] # ....11.0~entrypoint.531.dev/test/luatest_helpers/server.lua:104: in function 'wait_for_readiness' [059] # ...11.0~entrypoint.531.dev/test/luatest_helpers/cluster.lua:92: in function 'start' [059] # ...t.531.dev/test/replication-luatest/linearizable_test.lua:50: in function <...t.531.dev/test/replication-luatest/linearizable_test.lua:20> [059] # ... [059] # [C]: in function 'xpcall' NO_WRAP Part-of #7737 NO_DOC=bugfix