netbox: close transport after stopping worker loop and wait for the stop
Currently, we close the transport from transport from `luaT_netbox_transport_stop`, and we do not wait for the worker fiber to stop. This causes several problems. Firstly, the worker can switch context by yielding (`coio_wait`) or entering the Lua VM (`netbox_on_state_change`). During a context switch, the connection can get closed. When the connection is closed, its receive buffer is reset. If there was some pending response that was partially retrieved (e.g., a large select), then after resetting the buffer we will read some inconsistent data. We must not allow this to happen, so let's check for this case after returning from places where the worker can switch context. In between closing the connection and cancelling the connection's worker, an `on_disconnect` trigger can be called, which, in turn, can also yield, returning control to the worker before it gets cancelled. Secondly, when the worker enters the Lua VM, garbage collection can be triggered and the connection owning the worker could get closed unexpectedly to the worker. The fundamental source of these problems is that we close the transport before the worker's loop stops. Instead, we should close it after the worker's loop stops. In `luaT_netbox_transport_stop`, we should only cancel the worker, and either wait for the worker to stop, if we are not executing on it, or otherwise throw an exception (`luaL_testcancel`) to stop the worker's loop. The user will still have the opportunity to catch this exception and prevent stoppage of the worker at his own risk. To safeguard from this scenario, we will now keep the `is_closing` flag enabled once `luaT_netbox_transport_stop` is called and never disable it. There also still remains a special case of the connection getting garbage collected, when it is impossible to stop the worker's loop, since we cannot join the worker (yielding is forbidden from finalizers), and an exception will not go past the finalizer. However, this case is safe, since the connection is not going to be used by this point, so the worker can simply stop on its own at some point. The only thing we need to account for is that we cannot wait for the worker to stop: we can reuse the `wait` option of `luaT_netbox_transport_stop` for this. Closes #9621 Closes #9826 NO_DOC=<bugfix> Co-authored-by:Vladimir Davydov <vdavydov@tarantool.org> (cherry picked from commit fcf7f5c4) Cherry pick note: Dropped gh_9621_netbox_worker_crash_test because box.iproto.encode helpers aren't available on 2.11.
Showing
- changelogs/unreleased/gh-9621-netbox-worker-crash.md 4 additions, 0 deletionschangelogs/unreleased/gh-9621-netbox-worker-crash.md
- changelogs/unreleased/gh-9826-netbox-on_schema_reload-close.md 4 additions, 0 deletions...elogs/unreleased/gh-9826-netbox-on_schema_reload-close.md
- src/box/lua/net_box.c 21 additions, 19 deletionssrc/box/lua/net_box.c
- test/box-luatest/gh_9826_netbox_on_schema_reload_close_test.lua 25 additions, 0 deletions...ox-luatest/gh_9826_netbox_on_schema_reload_close_test.lua
- test/box-luatest/net_box_test.lua 17 additions, 0 deletionstest/box-luatest/net_box_test.lua
Loading
Please register or sign in to comment