fix: don't cpipe_push into a closed pipe in tx_status_update
Summary
- fix: don't cpipe_push into a closed pipe in tx_status_update
The bug was discovered during the development of the quorum promote feature -- basically, it causes qpromote_several_outstanding_promotes_test.lua to fail spontaneously. Furthermore, we suspect that this is the underlying cause of lost heartbeats and subsequent severe replication lags.
Long story short, whenever we want to send a message from the TX thread back to a relay thread, we should first check if they are still connected. Otherwise, we'll see
- An assertion failure in debug, or
- (Presumably) a relay hangup in release due to
status_msg->msg.route != NULLin relay_status_update() -> relay_check_status_needs_update().
The upstream is aware of this issue: https://github.com/tarantool/tarantool/issues/9920
Backtrace:
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
0x00007f891f6a5463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
0x00007f891f64c120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
0x00007f891f6334c3 in __GI_abort () at abort.c:79
0x00007f891f6333df in __assert_fail_base (fmt=0x7f891f7c3c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x58560077b94b "loop() == pipe->producer", file=file@entry=0x58560077b935 "./src/lib/core/cbus.h", line=line@entry=224,
function=function@entry=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:94
0x00007f891f644177 in __assert_fail (assertion=0x58560077b94b "loop() == pipe->producer", file=0x58560077b935 "./src/lib/core/cbus.h", line=224,
function=0x58560077b910 "void cpipe_push_input(cpipe*, cmsg*)") at assert.c:103
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
0x00005856001be16a in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x585600469b82 <fiber_pool_f>, ap=0x7f891e8129a8)
at tarantool/src/lib/core/fiber.h:1283
0x000058560045f495 in fiber_loop (data=0x0) at tarantool/src/lib/core/fiber.c:1085
0x0000585600745b8f in coro_init () at tarantool/third_party/coro/coro.c:108
Relevant frames:
0x0000585600328782 in cpipe_push_input (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:224
224 assert(loop() == pipe->producer);
(gdb)
0x0000585600328802 in cpipe_push (pipe=0x58563115dfc8, msg=0x58563115e028) at tarantool/src/lib/core/cbus.h:241
241 cpipe_push_input(pipe, msg);
(gdb)
0x000058560032a157 in tx_status_update (msg=0x58563115e028) at tarantool/src/box/relay.cc:629
629 cpipe_push(&status->relay->relay_pipe, msg);
(gdb)
0x0000585600468d4b in cmsg_deliver (msg=0x58563115e028) at tarantool/src/lib/core/cbus.c:553
553 msg->hop->f(msg);
(gdb)
0x0000585600469e50 in fiber_pool_f (ap=0x7f891e8129a8) at tarantool/src/lib/core/fiber_pool.c:64
64 cmsg_deliver(msg);
(gdb)
Close #... Docs follow-up: not necessary / new issue
Edited by Dmitry Ivanov