replication: stop pushing TimedOut error to the replica
Every error that happens during master processes a join or subscribe request is sent to the replica for better diagnostics. This could lead to the following situation with the TimedOut error: it could be written on top of a half-written row and make the replica stop replication with ER_INVALID_MSGPACK error. The error is unrecoverable and the only way to resume replication after it happens is to reset box.cfg.replication. Here's what happened: 1) Replica is under heavy load, meaning it's event loop is occupied by some fiber not yielding control to others. 2) applier and other fibers aren't scheduled while the event loop is blocked. This means applier doesn't send heartbeat messages to the master and doesn't read any data coming from the master. 3) The unread master's data piles up. First in replica's receive buffer, then in master's send buffer. 4) Once master's send buffer is full, the corresponding socket stops being writeable and the relay yields waiting for the socket to become writeable again. The send buffer might contain a partially written row by now. 5) Replication timeout happens on master, because it hasn't heard from replica for a while. An exception is raised, and the exception is pushed to the replica's socket. Now two situations are possible: a) the socket becomes writeable by the time exception is raised. In this case the exception is logged to the buffer right after a partially written row. Once replica receives the half-written row with an exception logged on top, it errors with ER_INVALID_MSGPACK. Replication is broken. b) the socket isn't writeable still (the most probable scenario) The exception isn't logged to the socket and the connection is closed. Replica eventually receives a partially-written row and retries connection to the master normally. In order to prevent case a) from happening, let's not push TimedOut errors to the socket at all. They're the only errors that could be raised while a row is being written, i.e. the only errors that could lead to the situation described in 5a. Closes #4040
Showing
- changelogs/unreleased/gh-4040-invalid-msgpack.md 4 additions, 0 deletionschangelogs/unreleased/gh-4040-invalid-msgpack.md
- src/box/applier.cc 2 additions, 0 deletionssrc/box/applier.cc
- src/box/iproto.cc 6 additions, 0 deletionssrc/box/iproto.cc
- src/box/xrow.c 3 additions, 0 deletionssrc/box/xrow.c
- src/lib/core/errinj.h 2 additions, 0 deletionssrc/lib/core/errinj.h
- test/box/errinj.result 3 additions, 1 deletiontest/box/errinj.result
- test/replication/errinj.result 8 additions, 4 deletionstest/replication/errinj.result
- test/replication/errinj.test.lua 4 additions, 4 deletionstest/replication/errinj.test.lua
- test/replication/gh-4040-invalid-msgpack.result 169 additions, 0 deletionstest/replication/gh-4040-invalid-msgpack.result
- test/replication/gh-4040-invalid-msgpack.test.lua 71 additions, 0 deletionstest/replication/gh-4040-invalid-msgpack.test.lua
- test/replication/suite.cfg 1 addition, 0 deletionstest/replication/suite.cfg
- test/replication/suite.ini 1 addition, 1 deletiontest/replication/suite.ini
Loading
Please register or sign in to comment