Skip to content
Snippets Groups Projects
Commit a708a94a authored by Serge Petrenko's avatar Serge Petrenko Committed by Serge Petrenko
Browse files

replication: fix ER_PROTOCOL in relay

We've had numerous problems with transaction boundaries in replication.
They were mostly caused by various cases when either the beginning or
end of the transaction happened to be a local row. Local rows are not
replicated, so the peer saw "corrupted" transactions with either no
beginning or no end flag, even though the transaction contents were
fine.

The problem with starting a transaction with a local row was solved in
commit f41d1ddd ("wal: fix tx boundaries"), and that fix seems to
continue working fine to this day.

The problem with ending transactions with a local row was first fixed
in commit 25382617 ("replication: append NOP as the last tx row"),
however there were problems with this approach: when a user tried to
write to local spaces on a replica from a replication trigger, it made
it impossible to ever start replicating from replica back to master.

Another fix was proposed: in commit f96782b5 ("relay: send rows
transactionally") we made relay read a full transaction into memory and
then send it all at once mangling with transanction start and end flags
when necessary.

After that the NOPs were removed in commit f5e52b2c ("box: get rid
of dummy NOPs after transactions ending with local rows"), since relay
became capable of fixing transaction boundaries itself.

Turns out the assumption that relay always sees a full transaction and
may correctly set transaction boundaries is wrong: when a replica
reconnects to master we set its starting vclock[0] to the one master has
at the moment of reconnect, so when recovery reads local rows with lsns
less than vclock[0] it silently skips them without showing them to
relay. When such skipped rows contain the is_commit flag for a currently
sent transaction we get the same problem as described before.

Let's make recovery track whether it has pushed any transaction rows to
relay or not, and if yes, recover rows with is_commit flag regardless of
whether the rows were already applied. To prevent recovering the same
data twice, recovery replaces such row contents with NOPs. Basically the
row is "recovered" only for the sake of showing its is_commit flag to
relay. Relay will skip the row anyway, since it remains local.

Follow-up #8958
Closes #9491

NO_DOC=bugfix

(cherry picked from commit 60d45765)
parent 1c51dd5e
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment