test: fix flaky read_only_reason test
This is a second attempt to stabilize #5568 test since 5105c2d7 ("test: fix flaky read_only_reason test"). It had several failures with various frequency. * test_read_only_reason_synchro() could see ro_reason as nil even after box.error.READONLY was raised (and some yields passed); * test_read_only_reason_orphan() could do the same; * `box.ctl.demote()` could raise error "box.ctl.demote does not support simultaneous invocations". The orphan failure couldn't be reproduced. It was caught only locally, so maybe it was just some unnoticed diff breaking the test. Failure of test_read_only_reason_synchro() could happen when demote() was called in a previous test case, then the current test case called promote(), then got box.error.READONLY on the replica, then that old demote() was delivered to replica, and the attempt to get ro_reason returned nil. It is attempted to be fixed with replication_synchro_quorum = 2, so master promote()/demote() will implicitly push the previous operation to the replica. Via term bump and quorum wait. Additionally, huge replication_synchro_timeout is added for manual promotions. Automatic promotion is retried so here the timeout is not so important. `box.ctl.demote` failure due to `simultaneous invocations` seems to be happening because the original auto-election win didn't finish limbo transition yet. Hence the instance calling demote() now would think it is called 'simultaneously' with another promote()/demote(). It is attempted to be fixed from 2 sides: - Add waiting for `not box.info.ro` on the leader node after auto-promotion. To ensure the limbo is taken by the leader; - The first option didn't help much, so `box.ctl.demote()` is simply called in a loop until it succeeds. Closes #6670
Please register or sign in to comment