- Jul 23, 2024
-
-
Vladimir Davydov authored
An index can be dropped while a memory dump is in progress. If the vinyl garbage collector happens to delete the index from the vylog by the time the memory dump completes, the dump will log an entry for a deleted index, resulting in an error next time we try to recover the vylog, like: ``` ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run 2 committed after deletion ``` or ``` ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Deleted range 9 has run slices ``` We already fixed a similar issue with compaction in commit 29e2931c ("vinyl: fix race between compaction and gc of dropped LSM"). Let's fix this one in exactly the same way: discard the new run without logging it to the vylog on a memory dump completion if the index was dropped while the dump was in progress. Closes #10277 NO_DOC=bug fix
-
- Jul 18, 2024
-
-
Vladimir Davydov authored
The function `vy_space_build_index`, which builds a new index on DDL, calls `vy_scheduler_dump` on completion. If there's a checkpoint in progress, the latter will wait on `vy_scheduler::dump_cond` until `vy_scheduler::checkpoint_in_progress` is cleared. The problem is `vy_scheduler_end_checkpoint` doesn't broadcast `dump_cond` when it clears the flag. Usually, everything works fine because the condition variable is broadcast on any dump completion, and vinyl checkpoint implies a dump, but under certain conditions this may lead to a fiber hang. Let's broadcast `dump_cond` in `vy_scheduler_end_checkpoint` to be on the safe side. While we are at it, let's also inject a dump delay to the original test to make it more robust. Closes #10267 Follow-up #10234 NO_DOC=bug fix
-
- Jul 15, 2024
-
-
Vladimir Davydov authored
There may be more than one fiber waiting on `vy_scheduler::dump_cond`: ``` box.snapshot vinyl_engine_wait_checkpoint vy_scheduler_wait_checkpoint space.create_index vinyl_space_build_index vy_scheduler_dump ``` To avoid hang, we should use `fiber_cond_broadcast`. Closes #10233 NO_DOC=bug fix
-
- May 16, 2024
-
-
Vladimir Davydov authored
Between picking an LSM tree from a heap and taking a reference to it in vy_task_new() there are a few places where the scheduler may yield: - in vy_worker_pool_get() to start a worker pool; - in vy_task_dump_new() to wait for a memory tree to be unpinned; - in vy_task_compaction_new() to commit an entry to the metadata log after splitting or coalescing a range. If a concurrent fiber drops and deletes the LSM tree in the meanwhile, the scheduler will crash. To avoid that, let's take a reference to the LSM tree. It's quite difficult to write a functional test for it without a bunch of ugly error injections so we rely on fuzzing tests. Closes #9995 NO_DOC=bug fix NO_TEST=fuzzing
-
- Feb 16, 2024
-
-
Nikolay Shirokovskiy authored
`sql-tap/intpkey.test` start to flak under load after the commit fe769b0a ("vinyl: add graceful shutdown"). The issue is vy_scheduler_complete_tasks() may yield. So on shutdown `vinyl.scheduler` fiber may be cancelled during this yield and then we go to sleep waiting for new tasks forever. Now vinyl shutdown hangs. Part of #8423 NO_TEST=fix flaky test NO_CHANGELOG=fix flaky test NO_DOC=fix flaky test
-
Nikolay Shirokovskiy authored
Let's stop all vinyl internal fibers and threads. In case of scheduler it looks like we revert the commit e463128e ("vinyl: cancel reader and writer threads on shutdown") so we can again have delay on shutdown in 'vinyl/replica_quota.test'. I guess we should not. At the time of the commit deferring deletes was the default behavior and there is a secondary index in the test space. The deferred deletes involve TX thread communication and at moment of stopping scheduler worker threads the TX event loop was not running. This could result in worker threads hanging on stop. In this patch we stop worker threads in shutdown phase while TX event loop is active. We delete part of the test for #3412 as now we finish fibers that may use the latch. Also we restore destroying the latch. Part of #8423 NO_CHANGELOG=internal NO_DOC=internal
-
- Jan 29, 2024
-
-
Nikolay Shirokovskiy authored
In the process of graceful shutdown it is convenient to first finish all client (non system) fibers. Otherwise we should be ready for any subsystem to handle request from client fiber during or after subsystem shutdown. This would make code more complex. We first cancel client fibers and then wait for their finishing. The fiber may not respond to cancel and hang which cause shutdown hang but this is the approach we choose for iproto shutdown already. Note that as a result of this approach application will panic if it is shutdown during execution of initialization script (in particular if this script is doing box.cfg). There are changes in application/test to adopt to client fibers shutdown: - make code cancellable (only to pass existing tests, we did not investigate all the possible places that should be made such). - make console stop sending echo to client before client fibers shutdown. Otherwise as console server fiber is client one we will send message that fiber is cancelled on shutdown which breaks a lot of existing tests. This approach is on par with iproto shutdown. - some tests (7743, replication-luatest/shutdown, replication/anon, replication/force_recovery etc etc) test shutdown during execution of init script. Now panic is expected so change them accordingly. - some tests (8530, errinj_vylog) use injection that block client fiber finishing. In that tests we don't need graceful shutdown so let's just kill tarantool instead. - we change test in vinyl/errinj for gh-3225. We don't really need to check when vinyl reader is blocked as it executes small tasks (we assume reading syscall will not hang). Also change test for vinyl dump shutdown by slowing dump down instead of blocking it entirely. This is required to finish in time client fibers in the test. - other similar changes Also we can drop code from replication shutdown which is required to handle client requests during/after shutdown. Part of #8423 NO_CHANGELOG=internal NO_DOC=internal
-
- Feb 13, 2023
-
-
Mergen Imeev authored
This patch replaces malloc() with xmalloc() in key_def_dup() to avoid the possibility of skipping the malloc() return value check. Closes tarantool/security#81 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring
-
- Dec 12, 2022
-
-
Vladislav Shpilevoy authored
It is a wrapper around pthread cancel and join. It was repeated many times and was dangerous, because left cord.id set. An accidental attempt to cord_join/cojoin() such cord would lead to UB then. The patch introduces a function which encapsulates the blocking cancellation. It is going to be used in a next patch to count the number of cords in the process. Which in turn is needed for a new test. The counter is atomic in case some cords would be created not by the main cord. There are now also more sanity checks against accidental attempts to join the same cord twice. Needed for #7743 NO_DOC=internal NO_CHANGELOG=internal
-
- Nov 23, 2022
-
-
Nikolay Shirokovskiy authored
As it breaks sane usage of region as a data stack: size_t region_svp = region_used(&fiber()->gc); /* some allocation on fiber gc and usage of allocated memory. */ region_truncate(&fiber()->gc, region_svp); If in the above snippet one calls a function that in turn calls `fiber_gc` then the snippet code may have use-after-free and later UB on truncation. For this reason let's get read of fiber_gc. However we need to make sure we won't introduce leaks this way. So before actually removing fiber_gc we make it perform leak check instead and only after fixing all the leaks the fiber_gc was removed. In order to find the leak easily the backtrace of the first fiber gc allocation that is not truncated is saved and then reported. In order to catch leaks that are not triggered by the current test suit and to prevent introducing leaks in future patches the leak check is added on fiber exit/recycle and for long living system fibers on every loop iteration. Leak check in release build is on but without leak backtrace info by default for performance reasons. Backtrace can be provided by using `fiber.leak_backtrace_enable()` knob before starting leaking fiber. Regularly leaks are only reported in log but it will not help to catch errors when running test suits so build option ABORT_ON_LEAK is added. When it is on we abort on leak. This option is turned off for all builds that used in CI. Closes #5665 NO_CHANGELOG=internal NO_DOC=internal
-
- Aug 30, 2022
-
-
Nikita Zheleztsov authored
Currently internal tarantool fibers can be cancelled from the user's app, which can lead to critical errors. Let's mark these fibers as a system ones in order to be sure that they won't be cancelled from the Lua world. Closes #7448 Closes #7473 NO_DOC=minor change
-
- Aug 04, 2022
-
-
Vladimir Davydov authored
We have a method for getting the number of elements stored in a BPS tree. Let's use it instead of accessing BPS tree internals directly so that we can freely refactor BPS tree internals. NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring
-
- Jun 24, 2022
-
-
Vladimir Davydov authored
We often call vy_lsm_rotate_mem_if_required if its generation or schema version is older than the current one. Let's add a helper function for that. Needed for #5080 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring
-
- Dec 06, 2021
-
-
Georgiy Lebedev authored
On completion of compaction tasks we remove compacted run files created after the last checkpoint immediately to save disk space. In order to perform this optimization we compare the unused runs' dump LSN with the last checkpoint's one. But during replica's initial JOIN stage we set the LSN of all rows received from remote master to 0 (see box/box.cc/boostrap_journal_write). Considering that the LSN of an initial checkpoint is also 0, our optimization stops working, and we receive a huge disk space usage spike (as the unused run files will only get removed when garbage collection occurs). We should check the vinyl space engine's status and perform our optimization unconditionally if we are in replica's initial JOIN stage. Closes #6568
-
- Sep 22, 2021
-
-
Vladimir Davydov authored
For deferred DELETE statements to be recovered after restart, we write them to a special 'blackhole' system space, _vinyl_deferred_delete, which doesn't store any data, but is logged in the WAL, as a normal space. In the on_replace trigger installed for this space, we insert deferred DELETE statements into the memory (L0) level of the LSM tree corresponding to the space for which the statement was generated. We also wait for L0 quota in the trigger. The problem is a space can be dropped while we are waiting for quota, in which case the trigger function will crash once it resumes execution. To fix this, let's wait for quota before we write the information about the deferred DELETE statement to the _vinyl_deferred_delete space and check if the LSM tree was dropped after yield. This way, everything will work as expected even if a new space is created with the same id, because we don't yield after checking quota. Closes #6448
-
- Jul 01, 2021
-
-
Vladimir Davydov authored
An LSM tree (space index, that is) can be dropped while compaction is in progress for it. In this case compaction will still commit the new run to vylog upon completion. This usually works fine, but not if gc has already purged all the information about the dropped LSM tree from vylog by that time, in which case an attempt to commit the new run will result in permanently broken vylog (because compaction will write vylog records for a non-existing object): ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 13 deleted but not registered To prevent this from happening, let's make compaction silently drop the new run without committing it to vylog if the LSM tree has been dropped. This should work just fine - since the LSM tee isn't used anymore we don't need to have it compacted, neither do we need to delete the run, since gc will eventually clean up all artefacts left from the dropped LSM tree. One thing to be noted is that we also must exclude dropped LSM trees from further compaction - if we don't do that, we might end up picking the dropped LSM tree for compaction over and over again (because it isn't actually compacted). This patch also drops the gh-5141-invalid-vylog-file test, because the latter just ensured that the issue fixed by this patch is there. Closes #5436
-
- Jun 16, 2021
-
-
Vladislav Shpilevoy authored
Sometimes a transaction can fail before it goes to WAL. Then the signature didn't have any sign of it, as well as the journal_entry result (which might be not even created yet). Still if txn_commit/try_async() are called, they invoke on_rollback triggers. The triggers only can see TXN_SIGNATURE_ROLLBACK and can't distinguish it from a real rollback like box.rollback(). Due to that some important errors like a transaction manager conflict or OOM are lost. The patch introduces a new error signature TXN_SIGNATURE_ABORT which says the transaction didn't manage to try going to WAL and for an error need to look at the global diag. The next patch is going to stop overriding it with TXN_SIGNATURE_ROLLBACK. Part of #6027
-
- May 14, 2021
-
-
Cyrill Gorcunov authored
Drop redundant "%s" from format. Follow-up #5846 Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
- Apr 14, 2021
-
-
Aleksandr Lyapunov authored
Part of #5958
-
- Jun 10, 2020
-
-
Nikita Pettik authored
It may turn out that dump_generation does not catch up with current generation and no other dump tasks are in progress. This may happen dump process is throttled due to errors. In this case generation is bumped but dump_generation is not (since dump is not completed). In turn, throttling opens a window for DDL operations. For instance, index dropping and creation of new one results in mentioned situation: box.snapshot() -- fails for some reason; next attempt at dumping will be -- taken in one second. s:drop() -- drop index to be dumped s = box.schema.space.create('test', {engine = 'vinyl'}) -- create new one (its mem generation is greater than scheduler's one) i = s:create_index('pk') Closes #4821
-
- Jun 09, 2020
-
-
Vladislav Shpilevoy authored
Some error injections use usleep() to put the current thread in sleep. For example, to simulate, that WAL thread is slow. A few of them allowed to pass custom number of microseconds to usleep, in form: usleep(injection->dvalue * 1000000); Assuming, that dvalue is a number of seconds. But usleep argument is uint32_t, at least on Mac (it is useconds_t, whose size is 4). It means, that too big dvalue easily overflows it. The patch makes it use nanosleep(), in a new wrapper: thread_sleep(). It takes a double value, and does not truncate it to 4 bytes. The overflow was the case for ERRINJ_VY_READ_PAGE_TIMEOUT = 9000 in test/vinyl/errinj_vylog.test.lua. And ERRINJ_VY_RUN_WRITE_STMT_TIMEOUT = 9000 in test/vinyl/errinj.test.lua. Part of #4609
-
- May 27, 2020
-
-
Nikita Pettik authored
Before this patch box.snapshot() bails out immediately if it sees that the scheduler is throttled due to errors. For instance: box.error.injection.set('ERRINJ_VY_RUN_WRITE', true) snapshot() -- fails due to ERRINJ_VY_RUN_WRITE box.error.injection.set('ERRINJ_VY_RUN_WRITE', false) snapshot() -- still fails despite the fact that injected error is unset As a result, one has to wait up to a minute to make a snapshot. The reason why throttling was introduced was to avoid flooding the log in case of repeating disk errors. What is more, to deal with schedule throttling in tests, we had to introduce a new error injection (ERRINJ_VY_SCHED_TIMEOUT). It reduces time duration during which the scheduler remains throttled, which is ugly and race prone. So, let's unthrottle scheduler when checkpoint process is launched via manual box.snapshot() invocation. Closes #3519
-
- Mar 26, 2020
-
-
Kirill Shcherbatov authored
Let's rename diag_add_error() to diag_set_error() because it actually replaces an error object in diagnostic area with a new one and this name is not representative. Moreover, we are going to introduce a new diag_add_error() which will place error at the top of stack diagnostic area. Needed for #1148
-
- Feb 20, 2020
-
-
Cyrill Gorcunov authored
Using void explicitly in functions which take no arguments allows to optimize code a bit and don't assume if there might be variable args. Moreover in commit e070cc4d we dropped arguments from txn_begin but didn't update vy_scheduler.c. The compiler didn't complain because it assumed there are vargs. Acked-by:
Konstantin Osipov <kostja.osipov@gmail.com> Acked-by:
Nikita Pettik <korablev@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
- Oct 28, 2019
-
-
Ilya Kosarev authored
Trigger function returning type is changed from void to int and any non-zero value means the trigger was processed with an error. A trigger can still raise an error - there is no more refactoring except obvious `diag_raise();' --> `return -1;' replacement. Prerequisites: #4247
-
- Aug 20, 2019
-
-
Vladimir Davydov authored
We remove an LSM tree from the scheduler queues as soon as it is dropped, even though the tree may hang around for a while after that, e.g. because it is pinned by an iterator. As a result, once an index is dropped, it won't be dumped anymore - its memory level will simply disappear without a trace. This is okay for now, but to implement snapshot iterators we must make sure that an index will stay valid as long as there's an iterator that references it. That said, let's delay removal of an index from the scheduler queues until it is about to be destroyed.
-
- Jul 04, 2019
-
-
Vladimir Davydov authored
ERROR_INJECT_YIELD yields the current fiber execution by calling fiber_sleep(0.001) while the given error injection is set. ERROR_INJECT_SLEEP suspends the current thread execution by calling usleep(1000) while the given error injection is set.
-
- Jun 25, 2019
-
-
Georgy Kirichenko authored
Refactoring: don't touch a fiber gc storage on a transaction rollback explicitly. This relaxes dependencies between fiber and transaction life cycles. Prerequisites: #1254
-
Georgy Kirichenko authored
Move transaction auto start and auto commit behavior to the box level. From now a transaction won't start and commit automatically without txn_begin/txn_commit invocations. This is a part of a bigger transaction refactoring in order to implement detachable transactions and a parallel applier. Prerequisites: #1254
-
- Jun 20, 2019
-
-
Vladimir Davydov authored
To apply replicated rows in parallel, we need to be able to complete transactions asynchronously, from the tx_prio callback. We can't yield there so we must ensure that on_commit/on_rollback triggers don't yield. The only place where we may yield in a completion trigger is vinyl DDL, which submits vylog records and waits for them to complete. Actually, there's no reason to wait for vylog write to complete, as we can handle missing records on recovery. So this patch reworks vylog to make vy_log_tx_try_commit() and hence on_commit/on_rollback triggers using it non-yielding. To achieve that, we need to: - Use vy_log.latch only to sync log rotation vs writes. Don't protect vylog buffer with it. This makes vy_log_tx_begin() non-yielding. - Use a separate list and buffer for storing vylog records of each transaction. We used to share them among transactions, but without vy_log.latch we can't sync access to them anymore. Since vylog transactions are rare events, this should be fine. - Make vy_log_tx_try_commit() append the transaction to the list of pending transactions and wake up a background fiber to flush all pending transactions. This way it doesn't need to yield. Closes #4218
-
- Jun 06, 2019
-
-
Vladimir Davydov authored
After compacting runs, we first mark them as dropped (VY_LOG_DROP_RUN), then try to delete their files unless they are needed for recovery from the checkpoint, and finally mark them as not needed in the vylog (VY_LOG_FORGET_RUN). There's a potential race sitting here: the problem is the garbage collector might kick in after files are dropped, but before they are marked as not needed. If this happens, there will be runs that have two VY_LOG_FORGET_RUN records, which will break recovery: Run XX is forgotten, but not registered The following patches make the race more likely to happen so let's eliminate it by making the garbage collector the only one who can mark runs as not needed (i.e. write VY_LOG_FORGET_RUN record). There will be no warnings, because the garbage collector silently ignores ENOENT errors, see vy_gc(). Another good thing about this patch is that now we never yield inside a vylog transaction, which makes it easier to remove the vylog latch blocking implementation of transactional DDL.
-
- Apr 16, 2019
-
-
Vladimir Davydov authored
To propagate changes applied to a space while a new index is being built, we install an on_replace trigger. In case the on_replace trigger callback fails, we abort the DDL operation. The problem is the trigger may yield, e.g. to check the unique constraint of the new index. This opens a time window for the DDL operation to complete and clear the trigger. If this happens, the trigger will try to access the outdated build context and crash: | #0 0x558f29cdfbc7 in print_backtrace+9 | #1 0x558f29bd37db in _ZL12sig_fatal_cbiP9siginfo_tPv+1e7 | #2 0x7fe24e4ab0e0 in __restore_rt+0 | #3 0x558f29bfe036 in error_unref+1a | #4 0x558f29bfe0d1 in diag_clear+27 | #5 0x558f29bfe133 in diag_move+1c | #6 0x558f29c0a4e2 in vy_build_on_replace+236 | #7 0x558f29cf3554 in trigger_run+7a | #8 0x558f29c7b494 in txn_commit_stmt+125 | #9 0x558f29c7e22c in box_process_rw+ec | #10 0x558f29c81743 in box_process1+8b | #11 0x558f29c81d5c in box_upsert+c4 | #12 0x558f29caf110 in lbox_upsert+131 | #13 0x558f29cfed97 in lj_BC_FUNCC+34 | #14 0x558f29d104a4 in lua_pcall+34 | #15 0x558f29cc7b09 in luaT_call+29 | #16 0x558f29cc1de5 in lua_fiber_run_f+74 | #17 0x558f29bd30d8 in _ZL16fiber_cxx_invokePFiP13__va_list_tagES0_+1e | #18 0x558f29cdca33 in fiber_loop+41 | #19 0x558f29e4e8cd in coro_init+4c To fix this issue, let's recall that when a DDL operation completes, all pending transactions that affect the altered space are aborted by the space_invalidate callback. So to avoid the crash, we just need to bail out early from the on_replace trigger callback if we detect that the current transaction has been aborted. Closes #4152
-
- Apr 11, 2019
-
-
Vladimir Davydov authored
L1 runs are usually the most frequently read and smallest runs at the same time so we gain nothing by compressing them. Closes #2389
-
- Apr 07, 2019
-
-
Vladimir Davydov authored
Apart from speeding up statement comparisons and hence index lookups, this is also a prerequisite for multikey indexes, which will reuse tuple comparison hints as offsets in indexed arrays. Albeit huge, this patch is pretty straightforward - all it does is replace struct tuple with struct vy_entry (which is tuple + hint pair) practically everywhere in the code. Now statements are stored and compared without hints only in a few places, primarily at the very top level. Hints are also computed at the top level so it should be pretty easy to replace them with multikey offsets when the time comes.
-
- Apr 01, 2019
-
-
Vladimir Davydov authored
It's an independent piece of code that is definitely worth moving from vy_scheduler to vy_lsm internals anyway. Besides, having it wrapped up in a separate function will make it easier to patch.
-
- Mar 13, 2019
-
-
Vladimir Davydov authored
It's actually only needed to initialize disk streams so let's pass it to vy_write_iterator_new_slice() instead.
-
- Feb 25, 2019
-
-
Vladislav Shpilevoy authored
The only goal of reading and writing heap_node.pos was checking if a node is now in a heap, or not. This commit encapsulates this logic into a couple of functions.
-
- Feb 22, 2019
-
-
Vladislav Shpilevoy authored
Now heap API works with struct heap_node only, which forces a user to constantly call container_of. Such a code looks really awful. This commit makes heap taking and returning user defined structures, and removes container_of clue. It is worth noting, that the similar API rb-tree and b-tree have. Even rlist has its rlist_*_entry() wrappers, and mhash provides macroses to define your own value type.
-
- Feb 14, 2019
-
-
Vladimir Davydov authored
We used to swap it between vy_lsm objects, but we don't do that anymore so we can embed it.
-
- Feb 12, 2019
-
-
Vladimir Davydov authored
This patch adds dumps_per_compaction metric to per index statistics. It shows the number of dumps it takes to trigger a major compaction of a range in a given LSM tree. We need it to automatically choose the optimal number of ranges that would smooth out the load generated by range compaction. To calculate this metric, we assign dump_count to each run. It shows how many dumps it took to create the run. If a run was created by a memory dump, it is set to 1. If a run was created by a minor compaction, it is set to the sum of dump counts of compacted ranges. If a run was created by a major compaction, it is set to the sum of dump counts of compacted ranges minus dump count of the last level run. The dump_count is stored in vylog. This allows us to estimate the number of dumps that triggers compaction in a range as dump_count of the last level run stored in the range. Finally, we report dumps_per_compaction of an LSM tree as the average dumps_per_compaction among all ranges constituting the tree. Needed for #3944
-