test: fix replication/gc flaky failures
Two problems are fixed here. The first one is about correctness of the test case. The second is about flaky failures. About correctness. The test case contains the following lines: | test_run:cmd("switch replica") | -- Unblock the replica and break replication. | box.error.injection.set("ERRINJ_WAL_DELAY", false) | box.cfg{replication = {}} Usually rows are applied and the new vclock is sent to the master before replication will be disabled. So the master removes old xlog before the replica restart and the next case tests nothing. This commit uses the new test-run's ability to stop a tarantool instance with a custom signal and stops the replica with SIGKILL w/o dropping ERRINJ_WAL_DELAY. This change fixes the race between applying rows and disabling replication and so makes the test case correct. About flaky failures. They were look like so: | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019 | [029] @@ -290,7 +290,12 @@ | [029] ... | [029] wait_xlog(1) or fio.listdir('./master') | [029] --- | [029] -- true | [029] +- - 00000000000000000305.vylog | [029] + - 00000000000000000305.xlog | [029] + - '512' | [029] + - 00000000000000000310.xlog | [029] + - 00000000000000000310.vylog | [029] + - 00000000000000000310.snap | [029] ... | [029] -- Stop the replica. | [029] test_run:cmd("stop server replica") | <...next cases could have induced mismathes too...> The reason of the fail is that a replica applied all rows from the old xlog, but didn't sent an ACK with a new vclock to a master, because the replication was disabled before that. The master stops relay and keeps the old xlog. When the replica starts again it subscribes with the vclock value that instructs a relay to open the new xlog. Tarantool can remove an old xlog just after a replica's ACK when observes that the xlog was fully read by all replicas. But tarantool does not remove xlogs when a replica is subscribed. This is not a big problem, because such 'stuck' xlog file will be removed with a next xlog removal. There was the attempt to fix this behaviour and remove old xlogs at subscribe, see the following commits: * b5b4809c ('replication: update replica gc state on subscribe'); * 766cd3e1 ('Revert "replication: update replica gc state on subscribe"'). Anyway, this commit fixes this flaky failures, because stops the replica before applying rows from the old xlog. So when the replica starts it continues reading from the old xlog and the xlog file will be removed when will be fully read. Closes #4162
Please register or sign in to comment