Fix replication freeze if slave bumps lsn while master is down
To avoid rescanning the last recovered xlog in case it has been properly finalized, recover_remaining_wals() skips xlogs whose signature is less than the signature of the current recovery position. This assumption is incorrect if this function is used for replication. For example consider the following scenario in case of master -> slave replication: 1. Master temporarily shuts down. 2. Slave bumps its LSN while master is down. 3. Master is brought back online. 4. Slave reconnects to master. In such a case the recovery vclock signature sent by slave on reconnect will be greater than the signature of the xlog file created after master restart, causing replication to silently freeze. Instead of comparing xlog signature to recovery position, we should compare it to the signature of the last scanned xlog. To do that, we need to remove TRASH() from xlog_cursor_close() so that xlog cursor meta isn't overwritten on close. To make sure nobody attempts to use a closed cursor, let's add corresponding assertions to each public xlog cursor function. Fixes b25c60f0 ("recovery: do not rescan last xlog") Closes #3038
Showing
- src/box/recovery.cc 2 additions, 3 deletionssrc/box/recovery.cc
- src/box/xlog.c 12 additions, 4 deletionssrc/box/xlog.c
- test/replication/errinj.result 63 additions, 0 deletionstest/replication/errinj.result
- test/replication/errinj.test.lua 23 additions, 0 deletionstest/replication/errinj.test.lua
Loading
Please register or sign in to comment