Fix race in garbage collection
Engine callbacks that perform garbage collection may sleep, because they use coio for removing files to avoid blocking the TX thread. If garbage collection is called concurrently from different fibers (e.g. from relay fibers), we may attempt to delete the same file multiple times. What is worse xdir_collect_garbage(), used by engine callbacks to remove files, isn't safe against concurrent execution - it first unlinks a file via coio, which involves a yield, and only then removes the corresponding vclock from the directory index. This opens a race window for another fiber to read the same clock and yield, in the interim the vclock can be freed by the first fiber: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007f105ceda3fa in __GI_abort () at abort.c:89 #2 0x000055e4c03f4a3d in sig_fatal_cb (signo=11) at main.cc:184 #3 <signal handler called> #4 0x000055e4c066907a in vclockset_remove (rbtree=0x55e4c1010e58, node=0x55e4c1023d20) at box/vclock.c:215 #5 0x000055e4c06256af in xdir_collect_garbage (dir=0x55e4c1010e28, signature=342, use_coio=true) at box/xlog.c:620 #6 0x000055e4c0417dcc in memtx_engine_collect_garbage (engine=0x55e4c1010df0, lsn=342) at box/memtx_engine.c:784 #7 0x000055e4c0414dbf in engine_collect_garbage (lsn=342) at box/engine.c:155 #8 0x000055e4c04a36c7 in gc_run () at box/gc.c:192 #9 0x000055e4c04a38f2 in gc_consumer_advance (consumer=0x55e4c1021360, signature=342) at box/gc.c:262 #10 0x000055e4c04b4da8 in tx_gc_advance (msg=0x7f1028000aa0) at box/relay.cc:250 #11 0x000055e4c04eb854 in cmsg_deliver (msg=0x7f1028000aa0) at cbus.c:353 #12 0x000055e4c04ec871 in fiber_pool_f (ap=0x7f1056800ec0) at fiber_pool.c:64 #13 0x000055e4c03f4784 in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x55e4c04ec6d4 <fiber_pool_f>, ap=0x7f1056800ec0) at fiber.h:665 #14 0x000055e4c04e6816 in fiber_loop (data=0x0) at fiber.c:631 #15 0x000055e4c0687dab in coro_init () at /home/vlad/src/tarantool/third_party/coro/coro.c:110 Fix this by serializing concurrent execution of garbage collection callbacks with a latch.
Showing
- src/box/gc.c 19 additions, 3 deletionssrc/box/gc.c
- test/replication/gc.result 31 additions, 0 deletionstest/replication/gc.result
- test/replication/gc.test.lua 16 additions, 0 deletionstest/replication/gc.test.lua
- test/replication/lua/fast_replica.lua 27 additions, 0 deletionstest/replication/lua/fast_replica.lua
Please register or sign in to comment