Commit a6edd455 authored 6 years ago by Vladimir Davydov

vinyl: eliminate disk read on REPLACE/DELETE

When executing a REPLACE or DELETE request for a vinyl space, we need to
delete the old tuple from secondary indexes if any, e.g. if there's a
space with the primary index over field 1 and a secondary index over
field 2 and there's REPLACE{1, 10} in the space, then REPLACE{1, 20} has
to generate DELETE{10, 1} in order to overwrite REPLACE{10, 1} before
inserting REPLACE{20, 1} into the secondary index. Currently, we
generate DELETEs for secondary indexes immediately on request execution,
which makes REPLACE/DELETE operations disk-bound in case the space has
secondary indexes, because in order to delete the old tuple we have to
look it up in the primary index first.

Actually, we can postpone DELETE generation and still yield correct
results. All we have to do is compare each tuple read from a secondary
index with the full tuple corresponding to it in the primary index: if
they match, then the tuple is OK to return to the user; if the don't,
then the tuple was overwritten in the primary index and we have to skip
it. This doesn't introduce any overhead, because we have to look up full
tuples in the primary index while reading a secondary index anyways.
For instance, consider the example given in the previous paragraph: if
we don't insert DELETE{10, 1} into the secondary index, then we will
encounter REPLACE{10, 1} when reading it, but the tuple corresponding to
it in the primary index is REPLACE{1, 20} != REPLACE{10, 1} so we skip
it. This is the first thing that this patch does.

However, skipping garbage tuples isn't enough. We have to purge them
sooner or later, otherwise we risk iterating over thousands of stale
tuples before encountering a fresh one, which would adversely affect
latency of SELECT requests over a secondary index. So we mark each and
every REPLACE and DELETE statement that was inserted into the primary
index without generating DELETEs for secondary index with a special per
statement flag VY_STMT_DEFERRED_DELETE and generate DELETEs for these
statements when the time comes.

The time comes when the primary index finally gets compacted. When
writing a compacted run, we iterate over all tuples in the order set by
the primary key from newer to older tuples, so each statement marked
with VY_STMT_DEFERRED_DELETE will be followed by the tuple it overwrote,
provided there's enough runs compacted. We take these tuples and send
them to the tx thread over cbus (compaction is done in a worker thread,
remember), where deferred DELETEs are generated and inserted into
secondary indexes. Well, it isn't that simple actually, but you should
have got the basic idea by now.

The first problem here is by the time we generate a deferred DELETE,
newer statements for the same key could have been inserted into the
index and dumped to disk, while the read iterator assumes that the newer
the source the newer statements it stores for the same key. In order not
to break the read iterator assumptions by inserting deferred DELETEs, we
mark them with another special per-statement flag, VY_STMT_SKIP_READ,
which renders them invisible to the read iterator. The flag doesn't
affect the write iterator though so deferred DELETEs will purge garbage
statements when the secondary index eventually gets compacted.

The second problem concerns the recovery procedure. Since we write
deferred DELETEs to the in-memory level, we need to recover them after
restart somehow in case they didn't get dumped. To do that, we write
them to WAL (along with LSN and space id) with the aid of a special
system blackhole space, _vinyl_deferred_delete. The insertion of
deferred DELETEs into in-memory trees is actually done by on_replace
trigger installed on the space so deferred DELETEs are generated and
recovered by the same code. In order not to recover statements that have
been dumped, we account LSNs of WAL rows that generates deferred DELETEs
to vy_lsm::dump_lsn and filter dumped statements with vy_is_committed(),
just like normal statements.

Finally, we may run out of memory while generating deferred DELETEs.
This is manageable if happens during compaction - we simply throttle the
compaction task until the memory level is dumped. However, we can't do
that while generating deferred DELETEs during index dump. Solution:
don't generate deferred DELETEs during dump. The thing is we can
generate a deferred DELETE during dump only if the overwritten tuple is
stored in memory, but if it is, the lookup is nearly free and so we can
generate a DELETE when the transaction gets committed. So we introduce a
special version of point lookup, vy_point_lookup_mem(), which look ups a
tuple by the full key in cache and in memory. When a transaction is
committed, we use this function to generate DELETEs.

This should outline the pivotal points of the algorithm. More details,
as usual, in the code.

Closes #2129

parent d9d14334

No related branches found

No related tags found

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 1843 additions and 73 deletions

Please register or to comment