replication: fix rebootstrap crash in case master has replica's rows
During SUBSCRIBE the master sends only those rows originating from the subscribed replica that aren't present on the replica. Such rows may appear after a sudden power loss in case the replica doesn't issue fdatasync() after each WAL write, which is the default behavior. This means that a replica can write some rows to WAL, relay them to another replica, then stop without syncing WAL file. If this happens we expect the replica to read its own rows from other members of the cluster upon restart. For more details see commit eae84efb ("replication: recover missing local data from replica"). Obviously, this feature only makes sense for SUBSCRIBE. During JOIN we must relay all rows. This is how it initially worked, but commit adc28591 ("replication: do not delete relay on applier disconnect"), witlessly removed the corresponding check from relay_send_row() so that now we don't send any rows originating from the joined replica: @@ -595,8 +630,7 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet) * it). In the latter case packet's LSN is less than or equal to * local master's LSN at the moment it received 'SUBSCRIBE' request. */ - if (relay->replica == NULL || - packet->replica_id != relay->replica->id || + if (packet->replica_id != relay->replica->id || packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe, packet->replica_id)) { relay_send(relay, packet); (relay->local_vclock_at_subscribe is initialized to 0 on JOIN) This only affects the case of rebootstrap, automatic or manual, because when a new replica joins a cluster there can't be any rows on the master originating from it. On manual rebootstrap, i.e. when the replica files are deleted by the user and the replica is restarted from an empty directory with the same UUID (set via box.cfg.instance_uuid), this isn't critical - the replica will still receive those rows it should have received during JOIN once it subscribes. However, in case of automatic rebootstrap this can result in broken order of xlog/snap files, because the replica directory still contains old xlog/snap files created before rebootstrap. The rebootstrap logic expects them to have strictly less vclocks than new files, but if JOIN stops prematurely, this condition may not hold, leading to a crash when the vclock of a new xlog/snap is inserted into the corresponding xdir. This patch fixes this issue by restoring pre eae84efb behavior: now we create a new relay for FINAL JOIN instead of reusing the one attached to the joined replica so that relay_send_row() can detect JOIN phase and relay all rows in this case. It also adds a comment so that we don't make such a mistake in future. Apart from fixing the issue, this patch also fixes a relay leak in relay_initial_join() in case engine_join_xc() fails, which was also introduced by the above mentioned commit. A note about xlog/panic_on_broken_lsn test. Now the relay status isn't reported by box.info.replication if FINAL JOIN failed and the replica never subscribed (this is how it worked before commit eae84efb) so we need to tweak the test a bit to handle this. Closes #3740
Showing
- src/box/box.cc 1 addition, 2 deletionssrc/box/box.cc
- src/box/relay.cc 25 additions, 11 deletionssrc/box/relay.cc
- src/box/relay.h 2 additions, 2 deletionssrc/box/relay.h
- test/replication/replica_rejoin.result 137 additions, 0 deletionstest/replication/replica_rejoin.result
- test/replication/replica_rejoin.test.lua 51 additions, 0 deletionstest/replication/replica_rejoin.test.lua
- test/xlog/panic_on_broken_lsn.result 1 addition, 4 deletionstest/xlog/panic_on_broken_lsn.result
- test/xlog/panic_on_broken_lsn.test.lua 1 addition, 2 deletionstest/xlog/panic_on_broken_lsn.test.lua
Loading
Please register or sign in to comment