Skip to content
Snippets Groups Projects
Commit c4c022e2 authored by Georgiy Lebedev's avatar Georgiy Lebedev Committed by Vladimir Davydov
Browse files

netbox: close transport after stopping worker loop and wait for the stop


Currently, we close the transport from transport from
`luaT_netbox_transport_stop`, and we do not wait for the worker fiber to
stop. This causes several problems.

Firstly, the worker can switch context by yielding (`coio_wait`) or
entering the Lua VM (`netbox_on_state_change`). During a context switch,
the connection can get closed. When the connection is closed, its receive
buffer is reset. If there was some pending response that was partially
retrieved (e.g., a large select), then after resetting the buffer we will
read some inconsistent data. We must not allow this to happen, so let's
check for this case after returning from places where the worker can switch
context. In between closing the connection and cancelling the connection's
worker, an `on_disconnect` trigger can be called, which, in turn, can
also yield, returning control to the worker before it gets cancelled.

Secondly, when the worker enters the Lua VM, garbage collection can be
triggered and the connection owning the worker could get closed
unexpectedly to the worker.

The fundamental source of these problems is that we close the transport
before the worker's loop stops. Instead, we should close it after the
worker's loop stops. In `luaT_netbox_transport_stop`, we should only cancel
the worker, and either wait for the worker to stop, if we are not executing
on it, or otherwise throw an exception (`luaL_testcancel`) to stop the
worker's loop. The user will still have the opportunity to catch this
exception and prevent stoppage of the worker at his own risk. To safeguard
from this scenario, we will now keep the `is_closing` flag enabled once
`luaT_netbox_transport_stop` is called and never disable it.

There also still remains a special case of the connection getting garbage
collected, when it is impossible to stop the worker's loop, since we cannot
join the worker (yielding is forbidden from finalizers), and an exception
will not go past the finalizer. However, this case is safe, since the
connection is not going to be used by this point, so the worker can simply
stop on its own at some point. The only thing we need to account for is
that we cannot wait for the worker to stop: we can reuse the `wait` option
of `luaT_netbox_transport_stop` for this.

Closes #9621
Closes #9826

NO_DOC=<bugfix>

Co-authored-by: default avatarVladimir Davydov <vdavydov@tarantool.org>
(cherry picked from commit fcf7f5c4)

Cherry pick note: Dropped gh_9621_netbox_worker_crash_test because
box.iproto.encode helpers aren't available on 2.11.
parent 7b49ff36
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment