netbox: close transport after stopping worker loop and wait for the stop
Currently, we close the transport from transport from
`luaT_netbox_transport_stop`, and we do not wait for the worker fiber to
stop. This causes several problems.
Firstly, the worker can switch context by yielding (`coio_wait`) or
entering the Lua VM (`netbox_on_state_change`). During a context switch,
the connection can get closed. When the connection is closed, its receive
buffer is reset. If there was some pending response that was partially
retrieved (e.g., a large select), then after resetting the buffer we will
read some inconsistent data. We must not allow this to happen, so let's
check for this case after returning from places where the worker can switch
context. In between closing the connection and cancelling the connection's
worker, an `on_disconnect` trigger can be called, which, in turn, can
also yield, returning control to the worker before it gets cancelled.
Secondly, when the worker enters the Lua VM, garbage collection can be
triggered and the connection owning the worker could get closed
unexpectedly to the worker.
The fundamental source of these problems is that we close the transport
before the worker's loop stops. Instead, we should close it after the
worker's loop stops. In `luaT_netbox_transport_stop`, we should only cancel
the worker, and either wait for the worker to stop, if we are not executing
on it, or otherwise throw an exception (`luaL_testcancel`) to stop the
worker's loop. The user will still have the opportunity to catch this
exception and prevent stoppage of the worker at his own risk. To safeguard
from this scenario, we will now keep the `is_closing` flag enabled once
`luaT_netbox_transport_stop` is called and never disable it.
There also still remains a special case of the connection getting garbage
collected, when it is impossible to stop the worker's loop, since we cannot
join the worker (yielding is forbidden from finalizers), and an exception
will not go past the finalizer. However, this case is safe, since the
connection is not going to be used by this point, so the worker can simply
stop on its own at some point. The only thing we need to account for is
that we cannot wait for the worker to stop: we can reuse the `wait` option
of `luaT_netbox_transport_stop` for this.
Closes #9621
Closes #9826
NO_DOC=<bugfix>
Co-authored-by:
Vladimir Davydov <vdavydov@tarantool.org>
Showing
- changelogs/unreleased/gh-9621-netbox-worker-crash.md 4 additions, 0 deletionschangelogs/unreleased/gh-9621-netbox-worker-crash.md
- changelogs/unreleased/gh-9826-netbox-on_schema_reload-close.md 4 additions, 0 deletions...elogs/unreleased/gh-9826-netbox-on_schema_reload-close.md
- src/box/lua/net_box.c 21 additions, 19 deletionssrc/box/lua/net_box.c
- test/app-luatest/gh_9621_netbox_worker_crash_test.lua 79 additions, 0 deletionstest/app-luatest/gh_9621_netbox_worker_crash_test.lua
- test/box-luatest/gh_9826_netbox_on_schema_reload_close_test.lua 25 additions, 0 deletions...ox-luatest/gh_9826_netbox_on_schema_reload_close_test.lua
- test/box-luatest/net_box_test.lua 17 additions, 0 deletionstest/box-luatest/net_box_test.lua
Loading
Please register or sign in to comment