Skip to content
Snippets Groups Projects
Commit ad562340 authored by Konstantin Belyavskiy's avatar Konstantin Belyavskiy Committed by Vladimir Davydov
Browse files

replication: fix relay disconnect due to race condition

Incomming ACK lead to race condition and prevent heartbeat
messages. It ends up with disconnect on timeout.
This fix is based on @locker proposal to send vclock only
to reply master (since it itself sends heartbeat messages).

Closes #3160
parent b9b7eb74
No related branches found
No related tags found
No related merge requests found
......@@ -98,6 +98,11 @@ applier_log_error(struct applier *applier, struct error *e)
/*
* Fiber function to write vclock to replication master.
* To track connection status, replica answers master
* with encoded vclock. In addition to DML requests,
* master also sends heartbeat messages every
* replication_timeout seconds (introduced in 1.7.7).
* On such requests replica also responds with vclock.
*/
static int
applier_writer_f(va_list ap)
......@@ -106,10 +111,18 @@ applier_writer_f(va_list ap)
struct ev_io io;
coio_create(&io, applier->io.fd);
/* Re-connect loop */
while (!fiber_is_cancelled()) {
fiber_cond_wait_timeout(&applier->writer_cond,
replication_timeout);
/*
* Tarantool >= 1.7.7 sends periodic heartbeat
* messages so we don't need to send ACKs every
* replication_timeout seconds any more.
*/
if (applier->version_id >= version_id(1, 7, 7))
fiber_cond_wait_timeout(&applier->writer_cond,
TIMEOUT_INFINITY);
else
fiber_cond_wait_timeout(&applier->writer_cond,
replication_timeout);
/* Send ACKs only when in FOLLOW mode ,*/
if (applier->state != APPLIER_SYNC &&
applier->state != APPLIER_FOLLOW)
......
......@@ -472,6 +472,45 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
---
- ok
...
-- Check replica's ACKs don't prevent the master from sending
-- heartbeat messages (gh-3160).
test_run:cmd("start server replica_timeout with args='0.009'")
---
- true
...
test_run:cmd("switch replica_timeout")
---
- true
...
fiber = require('fiber')
---
...
while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end
---
...
box.info.replication[1].upstream.status -- follow
---
- follow
...
for i = 0, 15 do fiber.sleep(0.01) if box.info.replication[1].upstream.status ~= 'follow' then break end end
---
...
box.info.replication[1].upstream.status -- follow
---
- follow
...
test_run:cmd("switch default")
---
- true
...
test_run:cmd("stop server replica_timeout")
---
- true
...
test_run:cmd("cleanup server replica_timeout")
---
- true
...
box.snapshot()
---
- ok
......
......@@ -196,6 +196,22 @@ test_run:cmd("stop server replica_timeout")
test_run:cmd("cleanup server replica_timeout")
errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
-- Check replica's ACKs don't prevent the master from sending
-- heartbeat messages (gh-3160).
test_run:cmd("start server replica_timeout with args='0.009'")
test_run:cmd("switch replica_timeout")
fiber = require('fiber')
while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end
box.info.replication[1].upstream.status -- follow
for i = 0, 15 do fiber.sleep(0.01) if box.info.replication[1].upstream.status ~= 'follow' then break end end
box.info.replication[1].upstream.status -- follow
test_run:cmd("switch default")
test_run:cmd("stop server replica_timeout")
test_run:cmd("cleanup server replica_timeout")
box.snapshot()
for i = 0, 9999 do box.space.test:replace({i, 4, 5, 'test'}) end
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment