Skip to content
Snippets Groups Projects
Commit 9e5f3757 authored by Vladimir Davydov's avatar Vladimir Davydov Committed by Konstantin Osipov
Browse files

vinyl: decouple dump scheduling logic from statement LSNs

Historically, we use LSNs as identifiers for lsregion allocations and
therefore prioritize dumps of indexes by the minimal statement LSN
stored in the index. Although this design decision seemed to be clear
and straightforward at the very beginning, over time it has developed
pretty ugly hacks:

 - To make snapshot consistent we invented a new Engine method,
   prepareWaitCheckpoint(), because in beginCheckpoint() we don't know
   the LSN of the checkpoint yet, while it is (or rather was - see
   below) required to schedule dump of in-memory trees created before
   checkpoint started - see commit db180cde ("vinyl: make sure all
   statements with LSN <= snapshot LSN are dumped").

 - Commit a0d27684 ("vinyl: apply TX in vy_prepare and make vinyl
   similar to memtx"), which separated statement allocation from LSN
   assignment, had to introduce a notion of lsregion_id for statement
   allocations, because at the time a statement is inserted into a tree
   we don't know its final LSN yet, we only know its fake LSN, so we
   started to use fake LSNs as lsregion_id instead of normal LSNs.
   However, we still prioritize index dump by real LSN for the sake of
   snapshot consistency (see the previous point). This means in order to
   update min_lsregion_id, which is needed for lsregion_gc(), and
   min_checkpoint_lsn, which is required by checkpoint, we have to
   maintain two lists of in-memory trees, dump_fifo and checkpoint_fifo,
   sorted by fake and real LSN, respectively. The latter is especially
   ugly, because the order of the two lists is usually the same - it can
   only differ in case of rollback.

The only reason we stick to LSNs is checkpoint: we need to guarantee
that only statements with LSNs <= checkpoint_lsn will make it to the
disk. But since commit a0d27684 we don't need to know the checkpoint
LSN to schedule a consistent dump: since statements are now inserted
into an in-memory tree before WAL write (in Engine::prepare()) we can
schedule dump of *all* in-memory trees from Engine::beginCheckpoint() -
all statements inserted before WAL rotation are guaranteed to be there;
we only need to seal the trees to make sure that no statements inserted
after WAL checkpoint are included into dump.

That said, we don't need to know LSNs to schedule dumps in vinyl,
instead we can use an arbitrary monotonically growing counter, the only
question is when we should increment it. Apart from being convoluted,
the current design based on LSNs has a serious flaw: if new statements
are inserted into index A while dump of index B triggered by exceeding
memory quota is in progress, and we have more than one worker thread, we
will start dumping index A immediately, no matter how many statements
are in it, even though index B may free enough memory on dump
completion. This will result in creation of infinitesimal run files.
Obviously, a sane scheduler should wait until all in-memory trees that
existed at the time when the quota was exceeded are dumped before
scheduling new dumps, because only dumping all those trees can guarantee
that the lsregion allocator releases memory.

Keeping that in mind, this patch implements a new scheduling algorithm
that can be delineated by the following pivotal points:

 - The scheduler maintains a monotonically growing counter called
   'generation'.

 - Upon creation, an in-memory tree is assigned the current generation.
   The generation is then used as an identifier for all lsregion
   allocations for this tree.

 - Right after creation, an in-memory tree is added to the scheduler's
   list of all in-memory trees. The list is sorted by generation. It
   allows to ascertain the minimal generation currently in use. A tree
   is removed from the list only on dump.

 - While the minimal generation is less than the current generation, the
   scheduler schedules dumps of in-memory trees whose generation is
   strictly less than the current generation. As before, dumps are
   scheduled per index and a heap of indexes is used for scheduling. The
   heap is prioritized by the generation of the oldest in-memory tree of
   the index.

 - Once all in-memory trees older than the current generation have been
   dumped, the scheduler may increment the current generation in case
   memory quota is exceeded.

 - The current generation is also incremented by checkpoint to force
   dumping of all in-memory data.

 - Statements are always inserted to a tree of the current generation.
   If the active in-memory tree is older, it is rotated.

As a bonus, this makes every complete dump consistent, not only dumps
triggered by checkpoint - a property that is exploited by the next patch
to make sure that a run of the primary index cannot contain statements
newer than a run of a secondary index of the same space.
parent 57bc94e2
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment