vinyl: eliminate disk read on REPLACE/DELETE
When executing a REPLACE or DELETE request for a vinyl space, we need to delete the old tuple from secondary indexes if any, e.g. if there's a space with the primary index over field 1 and a secondary index over field 2 and there's REPLACE{1, 10} in the space, then REPLACE{1, 20} has to generate DELETE{10, 1} in order to overwrite REPLACE{10, 1} before inserting REPLACE{20, 1} into the secondary index. Currently, we generate DELETEs for secondary indexes immediately on request execution, which makes REPLACE/DELETE operations disk-bound in case the space has secondary indexes, because in order to delete the old tuple we have to look it up in the primary index first. Actually, we can postpone DELETE generation and still yield correct results. All we have to do is compare each tuple read from a secondary index with the full tuple corresponding to it in the primary index: if they match, then the tuple is OK to return to the user; if the don't, then the tuple was overwritten in the primary index and we have to skip it. This doesn't introduce any overhead, because we have to look up full tuples in the primary index while reading a secondary index anyways. For instance, consider the example given in the previous paragraph: if we don't insert DELETE{10, 1} into the secondary index, then we will encounter REPLACE{10, 1} when reading it, but the tuple corresponding to it in the primary index is REPLACE{1, 20} != REPLACE{10, 1} so we skip it. This is the first thing that this patch does. However, skipping garbage tuples isn't enough. We have to purge them sooner or later, otherwise we risk iterating over thousands of stale tuples before encountering a fresh one, which would adversely affect latency of SELECT requests over a secondary index. So we mark each and every REPLACE and DELETE statement that was inserted into the primary index without generating DELETEs for secondary index with a special per statement flag VY_STMT_DEFERRED_DELETE and generate DELETEs for these statements when the time comes. The time comes when the primary index finally gets compacted. When writing a compacted run, we iterate over all tuples in the order set by the primary key from newer to older tuples, so each statement marked with VY_STMT_DEFERRED_DELETE will be followed by the tuple it overwrote, provided there's enough runs compacted. We take these tuples and send them to the tx thread over cbus (compaction is done in a worker thread, remember), where deferred DELETEs are generated and inserted into secondary indexes. Well, it isn't that simple actually, but you should have got the basic idea by now. The first problem here is by the time we generate a deferred DELETE, newer statements for the same key could have been inserted into the index and dumped to disk, while the read iterator assumes that the newer the source the newer statements it stores for the same key. In order not to break the read iterator assumptions by inserting deferred DELETEs, we mark them with another special per-statement flag, VY_STMT_SKIP_READ, which renders them invisible to the read iterator. The flag doesn't affect the write iterator though so deferred DELETEs will purge garbage statements when the secondary index eventually gets compacted. The second problem concerns the recovery procedure. Since we write deferred DELETEs to the in-memory level, we need to recover them after restart somehow in case they didn't get dumped. To do that, we write them to WAL (along with LSN and space id) with the aid of a special system blackhole space, _vinyl_deferred_delete. The insertion of deferred DELETEs into in-memory trees is actually done by on_replace trigger installed on the space so deferred DELETEs are generated and recovered by the same code. In order not to recover statements that have been dumped, we account LSNs of WAL rows that generates deferred DELETEs to vy_lsm::dump_lsn and filter dumped statements with vy_is_committed(), just like normal statements. Finally, we may run out of memory while generating deferred DELETEs. This is manageable if happens during compaction - we simply throttle the compaction task until the memory level is dumped. However, we can't do that while generating deferred DELETEs during index dump. Solution: don't generate deferred DELETEs during dump. The thing is we can generate a deferred DELETE during dump only if the overwritten tuple is stored in memory, but if it is, the lookup is nearly free and so we can generate a DELETE when the transaction gets committed. So we introduce a special version of point lookup, vy_point_lookup_mem(), which look ups a tuple by the full key in cache and in memory. When a transaction is committed, we use this function to generate DELETEs. This should outline the pivotal points of the algorithm. More details, as usual, in the code. Closes #2129
Showing
- src/box/vinyl.c 226 additions, 28 deletionssrc/box/vinyl.c
- src/box/vy_lsm.h 5 additions, 0 deletionssrc/box/vy_lsm.h
- src/box/vy_mem.h 6 additions, 0 deletionssrc/box/vy_mem.h
- src/box/vy_point_lookup.c 32 additions, 0 deletionssrc/box/vy_point_lookup.c
- src/box/vy_point_lookup.h 18 additions, 0 deletionssrc/box/vy_point_lookup.h
- src/box/vy_scheduler.c 304 additions, 2 deletionssrc/box/vy_scheduler.c
- src/box/vy_tx.c 139 additions, 0 deletionssrc/box/vy_tx.c
- test/unit/vy_point_lookup.c 2 additions, 0 deletionstest/unit/vy_point_lookup.c
- test/vinyl/deferred_delete.result 677 additions, 0 deletionstest/vinyl/deferred_delete.result
- test/vinyl/deferred_delete.test.lua 261 additions, 0 deletionstest/vinyl/deferred_delete.test.lua
- test/vinyl/errinj.result 78 additions, 0 deletionstest/vinyl/errinj.result
- test/vinyl/errinj.test.lua 29 additions, 0 deletionstest/vinyl/errinj.test.lua
- test/vinyl/info.result 15 additions, 3 deletionstest/vinyl/info.result
- test/vinyl/info.test.lua 7 additions, 2 deletionstest/vinyl/info.test.lua
- test/vinyl/layout.result 22 additions, 24 deletionstest/vinyl/layout.result
- test/vinyl/quota.result 1 addition, 1 deletiontest/vinyl/quota.result
- test/vinyl/tx_gap_lock.result 8 additions, 8 deletionstest/vinyl/tx_gap_lock.result
- test/vinyl/tx_gap_lock.test.lua 5 additions, 5 deletionstest/vinyl/tx_gap_lock.test.lua
- test/vinyl/write_iterator.result 5 additions, 0 deletionstest/vinyl/write_iterator.result
- test/vinyl/write_iterator.test.lua 3 additions, 0 deletionstest/vinyl/write_iterator.test.lua
Loading
Please register or sign in to comment