Skip to content
Snippets Groups Projects
Commit 096a0a7d authored by Serge Petrenko's avatar Serge Petrenko Committed by Kirill Yukhin
Browse files

replication: fix flaky election_qsync.test

Fix the test failing occasionally with the following result mismatch:

[001] replication/election_qsync.test.lua             memtx           [ fail ]
[001]
[001] Test failed! Result content mismatch:
[001] --- replication/election_qsync.result	Thu Jul 15 17:15:48 2021
[001] +++ var/rejects/replication/election_qsync.reject	Thu Jul 15 20:46:51 2021
[001] @@ -145,8 +145,7 @@
[001]   | ...
[001]  box.space.test:select{}
[001]   | ---
[001] - | - - [1]
[001] - |   - [2]
[001] + | - - [2]
[001]   | ...
[001]  box.space.test:drop()
[001]   | ---
[001]

The issue happened because row [1] wasn't delivered to the 'default'
instance from the 'replica' at all. The test does try to wait for [1] to
be written to WAL and replicated, but sometimes it fails to wait until
this event happens:

box.ctl.promote() is issued asynchronously once the instance becomes the
Raft leader. So issuing `box.ctl.wait_rw()` doesn't guarantee that the
replica has already written the PROMOTE (the limbo is initially
unclaimed so replica becomes writeable as soon as it becomes the Raft
leader).

Right after `wait_rw()` we wait for lsn propagation and for 'default'
instance to reach replica's lsn. It may happen that lsn propagation
happens due to PROMOTE being written to WAL, and not row [1].

When this is the case, the 'default' instance doesn't receive row [1] at
all, resulting in the test error shown above.

Fix the issue by waiting for the promotion to happen explicitly.

Part of #5430
parent cdb234e1
No related branches found
No related tags found
No related merge requests found
......@@ -75,13 +75,19 @@ box.cfg{
| ---
| ...
box.ctl.wait_rw()
-- Promote is written asynchronously to the instance becoming the leader, so
-- wait for it. As soon as it's written, the instance's definitely a leader.
test_run:wait_cond(function() \
return box.info.synchro.queue.owner == box.info.id \
end)
| ---
| - true
| ...
assert(box.info.election.state == 'leader')
| ---
| - true
| ...
lsn = box.info.lsn
| ---
| ...
......
......@@ -39,8 +39,13 @@ box.cfg{
replication_timeout = 0.1, \
}
box.ctl.wait_rw()
-- Promote is written asynchronously to the instance becoming the leader, so
-- wait for it. As soon as it's written, the instance's definitely a leader.
test_run:wait_cond(function() \
return box.info.synchro.queue.owner == box.info.id \
end)
assert(box.info.election.state == 'leader')
lsn = box.info.lsn
_ = fiber.create(function() \
ok, err = pcall(box.space.test.replace, box.space.test, {1}) \
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment