test: fix flaky unit/swim test
swim_test_indirect_ping() failed with random seed 1605651752. The test created a cluster with 3 swim nodes, and broke network connection between node-1 and node-2. Then it run the cluster for 10 seconds, and ensured, that both node-1 and node-2 are eventually alive despite they are suspected sometimes. node1 <-> node3 <-> node2 'Alive' means that a node is considered alive on all the other nodes. The test spun for 10 seconds giving the nodes a chance to become suspected. Then it checked that node-1 is either still alive, or it is suspected, but will be restored in at most 3 seconds. The same was checked for node-2. They were supposed to interact via node-3. 3 seconds was used assuming that the worst what could happen is that it is suspected from the beginning of this three-second interval on node-3, because it was suspected by node-2 and disseminated to node-3. Then node-3 might need 1 second to finish its current dissemination round by sending a ping to node-2, 1 second to start new round randomly again from node-2, and only then send a ping to node-1. So 3 seconds total. But it could also happen, that in the beginning of the three-second interval node-1 is already suspected on node-2. On the next step node-2 shares the suspicion with node-3. And then the scenario above happens. So the test case needed at least 4 seconds. And actually it could happen infinitely, because while the test waits for 3 seconds of gossip refutation about node-1 on node-3, node-2 can suspect it again. And so on. Also the test would pass even without indirect pings. Because node-3 has access to node-1 and node-2. So even if, say, node-1 suspects node-2, then it will tell node-3 about it. Node-3 will ping node-2, get ack, and will refute the gossip. The refutation will be then sent to node-1 back. It means indirect pings don't matter here. The patch makes a new test, which won't pass without indirect pings. It uses the existing error injection ERRINJ_SWIM_FD_ONLY, which allows to turn off all the SWIM components except failure detection. So only pings and acks are being sent. Then without proper indirect pings node-1 and node-2 would suspect each other and declare dead eventually. The new test checks it does not happen. Closes #5399
Showing
- test/unit/suite.ini 0 additions, 8 deletionstest/unit/suite.ini
- test/unit/swim.c 1 addition, 25 deletionstest/unit/swim.c
- test/unit/swim.result 7 additions, 13 deletionstest/unit/swim.result
- test/unit/swim_errinj.c 40 additions, 1 deletiontest/unit/swim_errinj.c
- test/unit/swim_errinj.result 7 additions, 1 deletiontest/unit/swim_errinj.result
Loading
Please register or sign in to comment