replication: retry in case of XlogGapError
Previously XlogGapError was considered a critical error stopping the replication. That may be not so good as it looks. XlogGapError is a perfectly fine error, which should not kill the replication connection. It should be retried instead. Because here is an example, when the gap can be recovered on its own. Consider the case: node1 is a leader, it is booted with vclock {1: 3}. Node2 connects and fetches snapshot of node1, it also gets vclock {1: 3}. Then node1 writes something and its vclock becomes {1: 4}. Now node3 boots from node1, and gets the same vclock. Vclocks now look like this: - node1: {1: 4}, leader, has {1: 3} snap. - node2: {1: 3}, booted from node1, has only snap. - node3: {1: 4}, booted from node1, has only snap. If the cluster is a fullmesh, node2 will send subscribe requests with vclock {1: 3}. If node3 receives it, it will respond with xlog gap error, because it only has a snap with {1: 4}, nothing else. In that case node2 should retry connecting to node3, and in the meantime try to get newer changes from node1. The example is totally valid. However it is unreachable now because master registers all replicas in _cluster before allowing them to make a join. So they all bootstrap from a snapshot containing all their IDs. This is a bug, because such auto-registration leads to registration of anonymous replicas, if they are present during bootstrap. Also it blocks Raft, which can't work if there are registered, but not yet joined nodes. Once the registration problem will be solved in a next commit, the XlogGapError will strike quite often during bootstrap. This patch won't allow that happen. Needed for #5287
Showing
- src/box/applier.cc 27 additions, 0 deletionssrc/box/applier.cc
- test/replication/force_recovery.result 5 additions, 5 deletionstest/replication/force_recovery.result
- test/replication/force_recovery.test.lua 2 additions, 2 deletionstest/replication/force_recovery.test.lua
- test/replication/replica.lua 4 additions, 2 deletionstest/replication/replica.lua
- test/replication/replica_rejoin.result 11 additions, 3 deletionstest/replication/replica_rejoin.result
- test/replication/replica_rejoin.test.lua 7 additions, 2 deletionstest/replication/replica_rejoin.test.lua
- test/replication/show_error_on_disconnect.result 1 addition, 1 deletiontest/replication/show_error_on_disconnect.result
- test/replication/show_error_on_disconnect.test.lua 1 addition, 1 deletiontest/replication/show_error_on_disconnect.test.lua
- test/xlog/panic_on_wal_error.result 9 additions, 11 deletionstest/xlog/panic_on_wal_error.result
- test/xlog/panic_on_wal_error.test.lua 7 additions, 5 deletionstest/xlog/panic_on_wal_error.test.lua
- test/xlog/replica.lua 4 additions, 2 deletionstest/xlog/replica.lua
Loading
Please register or sign in to comment