Skip to content
Snippets Groups Projects
  • Vladimir Davydov's avatar
    b8717738
    Fix race in garbage collection · b8717738
    Vladimir Davydov authored
    Engine callbacks that perform garbage collection may sleep, because they
    use coio for removing files to avoid blocking the TX thread. If garbage
    collection is called concurrently from different fibers (e.g. from relay
    fibers), we may attempt to delete the same file multiple times. What is
    worse xdir_collect_garbage(), used by engine callbacks to remove files,
    isn't safe against concurrent execution - it first unlinks a file via
    coio, which involves a yield, and only then removes the corresponding
    vclock from the directory index. This opens a race window for another
    fiber to read the same clock and yield, in the interim the vclock can be
    freed by the first fiber:
    
      #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
      #1  0x00007f105ceda3fa in __GI_abort () at abort.c:89
      #2  0x000055e4c03f4a3d in sig_fatal_cb (signo=11) at main.cc:184
      #3  <signal handler called>
      #4  0x000055e4c066907a in vclockset_remove (rbtree=0x55e4c1010e58, node=0x55e4c1023d20) at box/vclock.c:215
      #5  0x000055e4c06256af in xdir_collect_garbage (dir=0x55e4c1010e28, signature=342, use_coio=true) at box/xlog.c:620
      #6  0x000055e4c0417dcc in memtx_engine_collect_garbage (engine=0x55e4c1010df0, lsn=342) at box/memtx_engine.c:784
      #7  0x000055e4c0414dbf in engine_collect_garbage (lsn=342) at box/engine.c:155
      #8  0x000055e4c04a36c7 in gc_run () at box/gc.c:192
      #9  0x000055e4c04a38f2 in gc_consumer_advance (consumer=0x55e4c1021360, signature=342) at box/gc.c:262
      #10 0x000055e4c04b4da8 in tx_gc_advance (msg=0x7f1028000aa0) at box/relay.cc:250
      #11 0x000055e4c04eb854 in cmsg_deliver (msg=0x7f1028000aa0) at cbus.c:353
      #12 0x000055e4c04ec871 in fiber_pool_f (ap=0x7f1056800ec0) at fiber_pool.c:64
      #13 0x000055e4c03f4784 in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x55e4c04ec6d4 <fiber_pool_f>, ap=0x7f1056800ec0) at fiber.h:665
      #14 0x000055e4c04e6816 in fiber_loop (data=0x0) at fiber.c:631
      #15 0x000055e4c0687dab in coro_init () at /home/vlad/src/tarantool/third_party/coro/coro.c:110
    
    Fix this by serializing concurrent execution of garbage collection
    callbacks with a latch.
    b8717738
    History
    Fix race in garbage collection
    Vladimir Davydov authored
    Engine callbacks that perform garbage collection may sleep, because they
    use coio for removing files to avoid blocking the TX thread. If garbage
    collection is called concurrently from different fibers (e.g. from relay
    fibers), we may attempt to delete the same file multiple times. What is
    worse xdir_collect_garbage(), used by engine callbacks to remove files,
    isn't safe against concurrent execution - it first unlinks a file via
    coio, which involves a yield, and only then removes the corresponding
    vclock from the directory index. This opens a race window for another
    fiber to read the same clock and yield, in the interim the vclock can be
    freed by the first fiber:
    
      #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
      #1  0x00007f105ceda3fa in __GI_abort () at abort.c:89
      #2  0x000055e4c03f4a3d in sig_fatal_cb (signo=11) at main.cc:184
      #3  <signal handler called>
      #4  0x000055e4c066907a in vclockset_remove (rbtree=0x55e4c1010e58, node=0x55e4c1023d20) at box/vclock.c:215
      #5  0x000055e4c06256af in xdir_collect_garbage (dir=0x55e4c1010e28, signature=342, use_coio=true) at box/xlog.c:620
      #6  0x000055e4c0417dcc in memtx_engine_collect_garbage (engine=0x55e4c1010df0, lsn=342) at box/memtx_engine.c:784
      #7  0x000055e4c0414dbf in engine_collect_garbage (lsn=342) at box/engine.c:155
      #8  0x000055e4c04a36c7 in gc_run () at box/gc.c:192
      #9  0x000055e4c04a38f2 in gc_consumer_advance (consumer=0x55e4c1021360, signature=342) at box/gc.c:262
      #10 0x000055e4c04b4da8 in tx_gc_advance (msg=0x7f1028000aa0) at box/relay.cc:250
      #11 0x000055e4c04eb854 in cmsg_deliver (msg=0x7f1028000aa0) at cbus.c:353
      #12 0x000055e4c04ec871 in fiber_pool_f (ap=0x7f1056800ec0) at fiber_pool.c:64
      #13 0x000055e4c03f4784 in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x55e4c04ec6d4 <fiber_pool_f>, ap=0x7f1056800ec0) at fiber.h:665
      #14 0x000055e4c04e6816 in fiber_loop (data=0x0) at fiber.c:631
      #15 0x000055e4c0687dab in coro_init () at /home/vlad/src/tarantool/third_party/coro/coro.c:110
    
    Fix this by serializing concurrent execution of garbage collection
    callbacks with a latch.