vinyl: decouple dump scheduling logic from statement LSNs
Historically, we use LSNs as identifiers for lsregion allocations and therefore prioritize dumps of indexes by the minimal statement LSN stored in the index. Although this design decision seemed to be clear and straightforward at the very beginning, over time it has developed pretty ugly hacks: - To make snapshot consistent we invented a new Engine method, prepareWaitCheckpoint(), because in beginCheckpoint() we don't know the LSN of the checkpoint yet, while it is (or rather was - see below) required to schedule dump of in-memory trees created before checkpoint started - see commit db180cde ("vinyl: make sure all statements with LSN <= snapshot LSN are dumped"). - Commit a0d27684 ("vinyl: apply TX in vy_prepare and make vinyl similar to memtx"), which separated statement allocation from LSN assignment, had to introduce a notion of lsregion_id for statement allocations, because at the time a statement is inserted into a tree we don't know its final LSN yet, we only know its fake LSN, so we started to use fake LSNs as lsregion_id instead of normal LSNs. However, we still prioritize index dump by real LSN for the sake of snapshot consistency (see the previous point). This means in order to update min_lsregion_id, which is needed for lsregion_gc(), and min_checkpoint_lsn, which is required by checkpoint, we have to maintain two lists of in-memory trees, dump_fifo and checkpoint_fifo, sorted by fake and real LSN, respectively. The latter is especially ugly, because the order of the two lists is usually the same - it can only differ in case of rollback. The only reason we stick to LSNs is checkpoint: we need to guarantee that only statements with LSNs <= checkpoint_lsn will make it to the disk. But since commit a0d27684 we don't need to know the checkpoint LSN to schedule a consistent dump: since statements are now inserted into an in-memory tree before WAL write (in Engine::prepare()) we can schedule dump of *all* in-memory trees from Engine::beginCheckpoint() - all statements inserted before WAL rotation are guaranteed to be there; we only need to seal the trees to make sure that no statements inserted after WAL checkpoint are included into dump. That said, we don't need to know LSNs to schedule dumps in vinyl, instead we can use an arbitrary monotonically growing counter, the only question is when we should increment it. Apart from being convoluted, the current design based on LSNs has a serious flaw: if new statements are inserted into index A while dump of index B triggered by exceeding memory quota is in progress, and we have more than one worker thread, we will start dumping index A immediately, no matter how many statements are in it, even though index B may free enough memory on dump completion. This will result in creation of infinitesimal run files. Obviously, a sane scheduler should wait until all in-memory trees that existed at the time when the quota was exceeded are dumped before scheduling new dumps, because only dumping all those trees can guarantee that the lsregion allocator releases memory. Keeping that in mind, this patch implements a new scheduling algorithm that can be delineated by the following pivotal points: - The scheduler maintains a monotonically growing counter called 'generation'. - Upon creation, an in-memory tree is assigned the current generation. The generation is then used as an identifier for all lsregion allocations for this tree. - Right after creation, an in-memory tree is added to the scheduler's list of all in-memory trees. The list is sorted by generation. It allows to ascertain the minimal generation currently in use. A tree is removed from the list only on dump. - While the minimal generation is less than the current generation, the scheduler schedules dumps of in-memory trees whose generation is strictly less than the current generation. As before, dumps are scheduled per index and a heap of indexes is used for scheduling. The heap is prioritized by the generation of the oldest in-memory tree of the index. - Once all in-memory trees older than the current generation have been dumped, the scheduler may increment the current generation in case memory quota is exceeded. - The current generation is also incremented by checkpoint to force dumping of all in-memory data. - Statements are always inserted to a tree of the current generation. If the active in-memory tree is older, it is rotated. As a bonus, this makes every complete dump consistent, not only dumps triggered by checkpoint - a property that is exploited by the next patch to make sure that a run of the primary index cannot contain statements newer than a run of a secondary index of the same space.
Showing
- src/box/engine.cc 0 additions, 12 deletionssrc/box/engine.cc
- src/box/engine.h 0 additions, 6 deletionssrc/box/engine.h
- src/box/vinyl.c 197 additions, 231 deletionssrc/box/vinyl.c
- src/box/vinyl.h 1 addition, 1 deletionsrc/box/vinyl.h
- src/box/vinyl_engine.cc 2 additions, 2 deletionssrc/box/vinyl_engine.cc
- src/box/vinyl_engine.h 1 addition, 1 deletionsrc/box/vinyl_engine.h
- src/box/vy_mem.c 3 additions, 14 deletionssrc/box/vy_mem.c
- src/box/vy_mem.h 8 additions, 18 deletionssrc/box/vy_mem.h
- src/box/vy_stmt.c 2 additions, 2 deletionssrc/box/vy_stmt.c
- src/box/vy_stmt.h 2 additions, 2 deletionssrc/box/vy_stmt.h
- test/vinyl/info.result 0 additions, 22 deletionstest/vinyl/info.result
- test/vinyl/info.test.lua 0 additions, 7 deletionstest/vinyl/info.test.lua
Loading
Please register or sign in to comment