vinyl: rework replication
Currently, on initial join we send the current vinyl state. To do that, we open a read iterator over a space's primary index and send statements returned by it. Such an approach has a number of inherent problems: - An open read iterator blocks compaction, which is unacceptable for such a long operation as join. To avoid blocking compaction, we open the iterator in the dirty mode, i.e. it skims over the tops. This, however, introduces a different kind of problem: this makes the threshold between initial and final join phases hazy - statements sent on final join may or may not be among those sent during the initial join, and there's no efficient way to differentiate between them w/o sending extra information. - The replica expects LSNs to be growing monotonically. This constraint is imposed by the lsregion allocator used for storing statements in memory, but read iterator returns statements ordered by key, not by LSN. Currently, replica simply crashes if statements happen to be sent in an order different from chronological, which renders vinyl replication unusable. In the scope of the current model, we can't fix this by assigning fake LSNs to statements received on initial join, because there's no strict LSN threshold between initial and final join phases (see the previous paragraph). - In the initial join phase, replica is only aware of spaces that were created before the last snapshot, while vinyl sends statements from spaces that exist now. As a result, if a space was created after the most recent snapshot, the replica won't be able to receive its tuples and fail. To address the above-mentioned problems, we make vinyl initial join send the latest snapshot, just like in case of memtx. We implement this by loading the vinyl state from the last snapshot of the metadata log and sending statements of all runs from the snapshot as is (including deletes and updates), to be applied by the replica. To make lsregion at the receiving end happy, we assign fake monotonically growing LSNs to statements received on initial join. This is OK, because any LSN from final join > max real LSN from initial join max real LSN from initial join >= max fake LSN hence any LSN from final join > any fake LSN from initial join Besides fixing vinyl replication, this patch also enables the replication test suite for the vinyl engine (except for hot_standby) and makes engine/replica_join cover the following test cases: - secondary indexes - delete and update statements - keys added in an order different from LSN - recreate space after checkpoint Closes #1911 Closes #2001
Showing
- src/box/box.cc 0 additions, 5 deletionssrc/box/box.cc
- src/box/memtx_space.cc 4 additions, 1 deletionsrc/box/memtx_space.cc
- src/box/vinyl.c 121 additions, 26 deletionssrc/box/vinyl.c
- src/box/vinyl.h 4 additions, 2 deletionssrc/box/vinyl.h
- src/box/vinyl_engine.cc 5 additions, 56 deletionssrc/box/vinyl_engine.cc
- src/box/vinyl_space.cc 24 additions, 6 deletionssrc/box/vinyl_space.cc
- src/box/xctl.c 64 additions, 6 deletionssrc/box/xctl.c
- src/box/xctl.h 10 additions, 0 deletionssrc/box/xctl.h
- test/engine/replica_join.result 154 additions, 129 deletionstest/engine/replica_join.result
- test/engine/replica_join.test.lua 27 additions, 3 deletionstest/engine/replica_join.test.lua
- test/replication/suite.cfg 3 additions, 1 deletiontest/replication/suite.cfg
Loading
Please register or sign in to comment