replication: fix flaky election_qsync.test

Fix the test failing occasionally with the following result mismatch: [001] replication/election_qsync.test.lua memtx [ fail ] [001] [001] Test failed! Result content mismatch: [001] --- replication/election_qsync.result Thu Jul 15 17:15:48 2021 [001] +++ var/rejects/replication/election_qsync.reject Thu Jul 15 20:46:51 2021 [001] @@ -145,8 +145,7 @@ [001] | ... [001] box.space.test:select{} [001] | --- [001] - | - - [1] [001] - | - [2] [001] + | - - [2] [001] | ... [001] box.space.test:drop() [001] | --- [001] The issue happened because row [1] wasn't delivered to the 'default' instance from the 'replica' at all. The test does try to wait for [1] to be written to WAL and replicated, but sometimes it fails to wait until this event happens: box.ctl.promote() is issued asynchronously once the instance becomes the Raft leader. So issuing `box.ctl.wait_rw()` doesn't guarantee that the replica has already written the PROMOTE (the limbo is initially unclaimed so replica becomes writeable as soon as it becomes the Raft leader). Right after `wait_rw()` we wait for lsn propagation and for 'default' instance to reach replica's lsn. It may happen that lsn propagation happens due to PROMOTE being written to WAL, and not row [1]. When this is the case, the 'default' instance doesn't receive row [1] at all, resulting in the test error shown above. Fix the issue by waiting for the promotion to happen explicitly. Part of #5430

replication: fix flaky election_qsync.test
096a0a7d · Serge Petrenko · Kirill Yukhin · cdb234e1 · 096a0a7d · 096a0a7d
Commit 096a0a7d authored 3 years ago by Serge Petrenko Committed by Kirill Yukhin 3 years ago
--- a/test/replication/election_qsync.result
+++ b/test/replication/election_qsync.result
@@ -75,13 +75,19 @@ box.cfg{
 | ---
 | ...

-box.ctl.wait_rw()
+-- Promote is written asynchronously to the instance becoming the leader, so
+-- wait for it. As soon as it's written, the instance's definitely a leader.
+test_run:wait_cond(function()                                                   \
+    return box.info.synchro.queue.owner == box.info.id                          \
+end)
 | ---
+ | - true
 | ...
 assert(box.info.election.state == 'leader')
 | ---
 | - true
 | ...
+
 lsn = box.info.lsn
 | ---
 | ...

--- a/test/replication/election_qsync.test.lua
+++ b/test/replication/election_qsync.test.lua
@@ -39,8 +39,13 @@ box.cfg{
    replication_timeout = 0.1,                                                  \
 }

-box.ctl.wait_rw()
+-- Promote is written asynchronously to the instance becoming the leader, so
+-- wait for it. As soon as it's written, the instance's definitely a leader.
+test_run:wait_cond(function()                                                   \
+    return box.info.synchro.queue.owner == box.info.id                          \
+end)
 assert(box.info.election.state == 'leader')
+
 lsn = box.info.lsn
 _ = fiber.create(function()                                                     \
    ok, err = pcall(box.space.test.replace, box.space.test, {1})                \