replication: fix ER_PROTOCOL in relay
We've had numerous problems with transaction boundaries in replication. They were mostly caused by various cases when either the beginning or end of the transaction happened to be a local row. Local rows are not replicated, so the peer saw "corrupted" transactions with either no beginning or no end flag, even though the transaction contents were fine. The problem with starting a transaction with a local row was solved in commit f41d1ddd ("wal: fix tx boundaries"), and that fix seems to continue working fine to this day. The problem with ending transactions with a local row was first fixed in commit 25382617 ("replication: append NOP as the last tx row"), however there were problems with this approach: when a user tried to write to local spaces on a replica from a replication trigger, it made it impossible to ever start replicating from replica back to master. Another fix was proposed: in commit f96782b5 ("relay: send rows transactionally") we made relay read a full transaction into memory and then send it all at once mangling with transanction start and end flags when necessary. After that the NOPs were removed in commit f5e52b2c ("box: get rid of dummy NOPs after transactions ending with local rows"), since relay became capable of fixing transaction boundaries itself. Turns out the assumption that relay always sees a full transaction and may correctly set transaction boundaries is wrong: when a replica reconnects to master we set its starting vclock[0] to the one master has at the moment of reconnect, so when recovery reads local rows with lsns less than vclock[0] it silently skips them without showing them to relay. When such skipped rows contain the is_commit flag for a currently sent transaction we get the same problem as described before. Let's make recovery track whether it has pushed any transaction rows to relay or not, and if yes, recover rows with is_commit flag regardless of whether the rows were already applied. To prevent recovering the same data twice, recovery replaces such row contents with NOPs. Basically the row is "recovered" only for the sake of showing its is_commit flag to relay. Relay will skip the row anyway, since it remains local. Follow-up #8958 Closes #9491 NO_DOC=bugfix (cherry picked from commit 60d45765)
Showing
- changelogs/unreleased/gh-9491-last-local-row-tx-boundary.md 4 additions, 0 deletionschangelogs/unreleased/gh-9491-last-local-row-tx-boundary.md
- src/box/recovery.cc 26 additions, 9 deletionssrc/box/recovery.cc
- test/replication-luatest/gh_9491_local_space_tx_boundary_test.lua 68 additions, 0 deletions...lication-luatest/gh_9491_local_space_tx_boundary_test.lua
Loading
Please register or sign in to comment