Commit 61b95d16 authored 3 years ago by Serge Petrenko Committed by Kirill Yukhin 3 years ago

replication: stop pushing TimedOut error to the replica

Every error that happens during master processes a join or subscribe
request is sent to the replica for better diagnostics.

This could lead to the following situation with the TimedOut error:
it could be written on top of a half-written row and make the replica stop
replication with ER_INVALID_MSGPACK error. The error is unrecoverable
and the only way to resume replication after it happens is to reset
box.cfg.replication.

Here's what happened:

1) Replica is under heavy load, meaning it's event loop is occupied by
   some fiber not yielding control to others.

2) applier and other fibers aren't scheduled while the event loop is
   blocked. This means applier doesn't send heartbeat messages to the
   master and doesn't read any data coming from the master.

3) The unread master's data piles up. First in replica's receive buffer, then
   in master's send buffer.

4) Once master's send buffer is full, the corresponding socket stops
   being writeable and the relay yields waiting for the socket to become
   writeable again. The send buffer might contain a partially written
   row by now.

5) Replication timeout happens on master, because it hasn't heard from
   replica for a while. An exception is raised, and the exception is
   pushed to the replica's socket. Now two situations are possible:

  a) the socket becomes writeable by the time exception is raised.
     In this case the exception is logged to the buffer right after
     a partially written row. Once replica receives the half-written
     row with an exception logged on top, it errors with
     ER_INVALID_MSGPACK. Replication is broken.

  b) the socket isn't writeable still (the most probable scenario)
     The exception isn't logged to the socket and the connection is
     closed. Replica eventually receives a partially-written row and
     retries connection to the master normally.

In order to prevent case a) from happening, let's not push TimedOut
errors to the socket at all. They're the only errors that could be
raised while a row is being written, i.e. the only errors that could
lead to the situation described in 5a.

Closes #4040

parent d4b625ef

No related branches found

No related tags found

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 274 additions and 10 deletions

Please register or to comment