Skip to content
Snippets Groups Projects
Commit e23b14dc authored by Vladislav Shpilevoy's avatar Vladislav Shpilevoy
Browse files

test: fix flaky unit/swim test

swim_test_indirect_ping() failed with random seed 1605651752.

The test created a cluster with 3 swim nodes, and broke network
connection between node-1 and node-2. Then it run the cluster for
10 seconds, and ensured, that both node-1 and node-2 are
eventually alive despite they are suspected sometimes.

    node1 <-> node3 <-> node2

'Alive' means that a node is considered alive on all the other
nodes.

The test spun for 10 seconds giving the nodes a chance to become
suspected. Then it checked that node-1 is either still alive, or
it is suspected, but will be restored in at most 3 seconds. The
same was checked for node-2. They were supposed to interact via
node-3.

3 seconds was used assuming that the worst what could happen is
that it is suspected from the beginning of this three-second
interval on node-3, because it was suspected by node-2 and
disseminated to node-3.

Then node-3 might need 1 second to finish its current
dissemination round by sending a ping to node-2, 1 second to start
new round randomly again from node-2, and only then send a ping to
node-1. So 3 seconds total.

But it could also happen, that in the beginning of the
three-second interval node-1 is already suspected on node-2. On
the next step node-2 shares the suspicion with node-3. And then
the scenario above happens. So the test case needed at least 4
seconds.

And actually it could happen infinitely, because while the test
waits for 3 seconds of gossip refutation about node-1 on node-3,
node-2 can suspect it again. And so on.

Also the test would pass even without indirect pings. Because
node-3 has access to node-1 and node-2. So even if, say, node-1
suspects node-2, then it will tell node-3 about it. Node-3 will
ping node-2, get ack, and will refute the gossip. The refutation
will be then sent to node-1 back. It means indirect pings don't
matter here.

The patch makes a new test, which won't pass without indirect
pings. It uses the existing error injection ERRINJ_SWIM_FD_ONLY,
which allows to turn off all the SWIM components except failure
detection. So only pings and acks are being sent.

Then without proper indirect pings node-1 and node-2 would suspect
each other and declare dead eventually. The new test checks it
does not happen.

Closes #5399
parent 546499c9
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment