Commits · ae6a02ebab0b69e5421ae96c027219d4dc267f93 · core / tarantool

Jul 23, 2024

vinyl: do not log dump if index was dropped · ae6a02eb

Vladimir Davydov authored 8 months ago

An index can be dropped while a memory dump is in progress. If the vinyl
garbage collector happens to delete the index from the vylog by the time
the memory dump completes, the dump will log an entry for a deleted
index, resulting in an error next time we try to recover the vylog,
like:

```
ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run 2 committed after deletion
```

or

```
ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Deleted range 9 has run slices
```

We already fixed a similar issue with compaction in commit 29e2931c
("vinyl: fix race between compaction and gc of dropped LSM"). Let's fix
this one in exactly the same way: discard the new run without logging it
to the vylog on a memory dump completion if the index was dropped while
the dump was in progress.

Closes #10277

NO_DOC=bug fix

ae6a02eb

Jul 18, 2024

vinyl: wake up waiters after clearing checkpoint_in_progress flag · fc3196dc

Vladimir Davydov authored 8 months ago

The function `vy_space_build_index`, which builds a new index on DDL,
calls `vy_scheduler_dump` on completion. If there's a checkpoint in
progress, the latter will wait on `vy_scheduler::dump_cond` until
`vy_scheduler::checkpoint_in_progress` is cleared. The problem is
`vy_scheduler_end_checkpoint` doesn't broadcast `dump_cond` when it
clears the flag. Usually, everything works fine because the condition
variable is broadcast on any dump completion, and vinyl checkpoint
implies a dump, but under certain conditions this may lead to a fiber
hang. Let's broadcast `dump_cond` in `vy_scheduler_end_checkpoint`
to be on the safe side.

While we are at it, let's also inject a dump delay to the original
test to make it more robust.

Closes #10267
Follow-up #10234

NO_DOC=bug fix

fc3196dc

Jul 15, 2024

vinyl: use broadcast instead of signal to notify about dump completion · 30547157

Vladimir Davydov authored 8 months ago

There may be more than one fiber waiting on `vy_scheduler::dump_cond`:

```
box.snapshot
  vinyl_engine_wait_checkpoint
    vy_scheduler_wait_checkpoint

space.create_index
  vinyl_space_build_index
    vy_scheduler_dump
```

To avoid hang, we should use `fiber_cond_broadcast`.

Closes #10233

NO_DOC=bug fix

30547157

May 16, 2024

vinyl: fix use-after-free of LSM tree in scheduler · 1c4605bb

Vladimir Davydov authored 10 months ago

Between picking an LSM tree from a heap and taking a reference to it in
vy_task_new() there are a few places where the scheduler may yield:
 - in vy_worker_pool_get() to start a worker pool;
 - in vy_task_dump_new() to wait for a memory tree to be unpinned;
 - in vy_task_compaction_new() to commit an entry to the metadata log
   after splitting or coalescing a range.

If a concurrent fiber drops and deletes the LSM tree in the meanwhile,
the scheduler will crash. To avoid that, let's take a reference to
the LSM tree.

It's quite difficult to write a functional test for it without a bunch
of ugly error injections so we rely on fuzzing tests.

Closes #9995

NO_DOC=bug fix
NO_TEST=fuzzing

1c4605bb

Feb 16, 2024

vinyl: check if fiber is cancelled after tasks completion · 24bc65d7

Nikolay Shirokovskiy authored 1 year ago

`sql-tap/intpkey.test` start to flak under load after the commit
fe769b0a ("vinyl: add graceful shutdown"). The issue is
vy_scheduler_complete_tasks() may yield. So on shutdown
`vinyl.scheduler` fiber may be cancelled during this yield and then we
go to sleep waiting for new tasks forever. Now vinyl shutdown hangs.

Part of #8423

NO_TEST=fix flaky test
NO_CHANGELOG=fix flaky test
NO_DOC=fix flaky test

24bc65d7

vinyl: add graceful shutdown · fe769b0a

Nikolay Shirokovskiy authored 1 year ago

Let's stop all vinyl internal fibers and threads.

In case of scheduler it looks like we revert the commit e463128e
("vinyl: cancel reader and writer threads on shutdown") so we can again
have delay on shutdown in 'vinyl/replica_quota.test'. I guess we should
not.

At the time of the commit deferring deletes was the default behavior and
there is a secondary index in the test space. The deferred deletes
involve TX thread communication and at moment of stopping scheduler
worker threads the TX event loop was not running. This could result in
worker threads hanging on stop. In this patch we stop worker threads in
shutdown phase while TX event loop is active.

We delete part of the test for #3412 as now we finish fibers that may
use the latch. Also we restore destroying the latch.

Part of #8423

NO_CHANGELOG=internal
NO_DOC=internal

fe769b0a

Jan 29, 2024

box: finish client fibers on shutdown · bf620650

Nikolay Shirokovskiy authored 1 year ago

In the process of graceful shutdown it is convenient to first finish
all client (non system) fibers. Otherwise we should be ready for any
subsystem to handle request from client fiber during or after subsystem
shutdown. This would make code more complex.

We first cancel client fibers and then wait for their finishing. The
fiber may not respond to cancel and hang which cause shutdown hang
but this is the approach we choose for iproto shutdown already.

Note that as a result of this approach application will panic if
it is shutdown during execution of initialization script (in
particular if this script is doing box.cfg).

There are changes in application/test to adopt to client fibers
shutdown:

- make code cancellable (only to pass existing tests, we did not
  investigate all the possible places that should be made such).

- make console stop sending echo to client before client fibers
  shutdown. Otherwise as console server fiber is client one we will send
  message that fiber is cancelled on shutdown which breaks a lot of
  existing tests. This approach is on par with iproto shutdown.

- some tests (7743, replication-luatest/shutdown, replication/anon,
  replication/force_recovery etc etc) test shutdown during execution of
  init script. Now panic is expected so change them accordingly.

- some tests (8530, errinj_vylog) use injection that block client
  fiber finishing. In that tests we don't need graceful shutdown so
  let's just kill tarantool instead.

- we change test in vinyl/errinj for gh-3225. We don't really need
  to check when vinyl reader is blocked as it executes small tasks
  (we assume reading syscall will not hang). Also change test for
  vinyl dump shutdown by slowing dump down instead of blocking it
  entirely. This is required to finish in time client fibers in
  the test.

- other similar changes

Also we can drop code from replication shutdown which is required to
handle client requests during/after shutdown.

Part of #8423

NO_CHANGELOG=internal
NO_DOC=internal

bf620650

Feb 13, 2023

box: replace malloc with xmalloc in key_def_dup · 8ca94313

Mergen Imeev authored 2 years ago

This patch replaces malloc() with xmalloc() in key_def_dup() to avoid
the possibility of skipping the malloc() return value check.

Closes tarantool/security#81

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

8ca94313

Dec 12, 2022

fiber: introduce cord_cancel_and_join() · 9a71e8ee

Vladislav Shpilevoy authored 2 years ago

It is a wrapper around pthread cancel and join. It was repeated
many times and was dangerous, because left cord.id set. An
accidental attempt to cord_join/cojoin() such cord would lead to
UB then.

The patch introduces a function which encapsulates the blocking
cancellation. It is going to be used in a next patch to count the
number of cords in the process. Which in turn is needed for a new
test.

The counter is atomic in case some cords would be created not by
the main cord.

There are now also more sanity checks against accidental attempts
to join the same cord twice.

Needed for #7743

NO_DOC=internal
NO_CHANGELOG=internal

9a71e8ee

Nov 23, 2022

misc: get rid of fiber_gc · 19abfd2a

Nikolay Shirokovskiy authored 2 years ago

As it breaks sane usage of region as a data stack:

	size_t region_svp = region_used(&fiber()->gc);
	/* some allocation on fiber gc and usage of allocated memory. */
	region_truncate(&fiber()->gc, region_svp);

If in the above snippet one calls a function that in turn calls
`fiber_gc` then the snippet code may have use-after-free and later UB
on truncation.

For this reason let's get read of fiber_gc. However we need to make sure
we won't introduce leaks this way. So before actually removing
fiber_gc we make it perform leak check instead and only after fixing
all the leaks the fiber_gc was removed.

In order to find the leak easily the backtrace of the first fiber gc
allocation that is not truncated is saved and then reported.

In order to catch leaks that are not triggered by the current test suit
and to prevent introducing leaks in future patches the leak check is
added on fiber exit/recycle and for long living system fibers on every loop
iteration.

Leak check in release build is on but without leak backtrace info by
default for performance reasons. Backtrace can be provided by using
`fiber.leak_backtrace_enable()` knob before starting leaking fiber.

Regularly leaks are only reported in log but it will not help to
catch errors when running test suits so build option ABORT_ON_LEAK
is added. When it is on we abort on leak. This option is turned off
for all builds that used in CI.

Closes #5665

NO_CHANGELOG=internal
NO_DOC=internal

19abfd2a

Aug 30, 2022

core: mark some internal fibers as system ones · 3733ff25

Nikita Zheleztsov authored 2 years ago

Currently internal tarantool fibers can be cancelled from the user's app,
which can lead to critical errors.

Let's mark these fibers as a system ones in order to be sure that they
won't be cancelled from the Lua world.

Closes #7448
Closes #7473

NO_DOC=minor change

3733ff25

Aug 04, 2022

Use bps_tree_size instead of accessing size directly · 5855fd30

Vladimir Davydov authored 2 years ago

We have a method for getting the number of elements stored in a BPS
tree. Let's use it instead of accessing BPS tree internals directly
so that we can freely refactor BPS tree internals.

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

5855fd30

Jun 24, 2022

vinyl: add vy_lsm_rotate_mem_if_required helper · 415d8b25

Vladimir Davydov authored 2 years ago

We often call vy_lsm_rotate_mem_if_required if its generation or schema
version is older than the current one. Let's add a helper function for
that.

Needed for #5080

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

415d8b25

Dec 06, 2021

vinyl: optimize deletion of compacted run files · 599f0c13

Georgiy Lebedev authored 3 years ago

On completion of compaction tasks we remove compacted run files created
after the last checkpoint immediately to save disk space. In order to
perform this optimization we compare the unused runs' dump LSN with the
last checkpoint's one.

But during replica's initial JOIN stage we set the LSN of all rows
received from remote master to 0 (see
box/box.cc/boostrap_journal_write). Considering that the LSN of an
initial checkpoint is also 0, our optimization stops working, and we
receive a huge disk space usage spike (as the unused run files will
only get removed when garbage collection occurs).

We should check the vinyl space engine's status and perform
our optimization unconditionally if we are in replica's initial JOIN
stage.

Closes #6568

599f0c13

Sep 22, 2021

vinyl: fix use of dropped space in deferred DELETE handler · 0428bbce

Vladimir Davydov authored 3 years ago

For deferred DELETE statements to be recovered after restart, we write
them to a special 'blackhole' system space, _vinyl_deferred_delete,
which doesn't store any data, but is logged in the WAL, as a normal
space. In the on_replace trigger installed for this space, we insert
deferred DELETE statements into the memory (L0) level of the LSM tree
corresponding to the space for which the statement was generated. We
also wait for L0 quota in the trigger. The problem is a space can be
dropped while we are waiting for quota, in which case the trigger
function will crash once it resumes execution.

To fix this, let's wait for quota before we write the information about
the deferred DELETE statement to the _vinyl_deferred_delete space and
check if the LSM tree was dropped after yield. This way, everything will
work as expected even if a new space is created with the same id,
because we don't yield after checking quota.

Closes #6448

0428bbce

Jul 01, 2021

vinyl: fix race between compaction and gc of dropped LSM · 29e2931c

Vladimir Davydov authored 3 years ago

An LSM tree (space index, that is) can be dropped while compaction is in
progress for it. In this case compaction will still commit the new run
to vylog upon completion. This usually works fine, but not if gc has
already purged all the information about the dropped LSM tree from vylog
by that time, in which case an attempt to commit the new run will result
in permanently broken vylog (because compaction will write vylog records
for a non-existing object):

ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 13 deleted but not registered

To prevent this from happening, let's make compaction silently drop the
new run without committing it to vylog if the LSM tree has been dropped.
This should work just fine - since the LSM tee isn't used anymore we
don't need to have it compacted, neither do we need to delete the run,
since gc will eventually clean up all artefacts left from the dropped
LSM tree.

One thing to be noted is that we also must exclude dropped LSM trees
from further compaction - if we don't do that, we might end up picking
the dropped LSM tree for compaction over and over again (because it
isn't actually compacted).

This patch also drops the gh-5141-invalid-vylog-file test, because the
latter just ensured that the issue fixed by this patch is there.

Closes #5436

29e2931c

Jun 16, 2021

txn: introduce TXN_SIGNATURE_ABORT · 26a8317f

Vladislav Shpilevoy authored 3 years ago

Sometimes a transaction can fail before it goes to WAL. Then the
signature didn't have any sign of it, as well as the journal_entry
result (which might be not even created yet).

Still if txn_commit/try_async() are called, they invoke
on_rollback triggers. The triggers only can see
TXN_SIGNATURE_ROLLBACK and can't distinguish it from a real
rollback like box.rollback().

Due to that some important errors like a transaction manager
conflict or OOM are lost.

The patch introduces a new error signature TXN_SIGNATURE_ABORT
which says the transaction didn't manage to try going to WAL and
for an error need to look at the global diag.

The next patch is going to stop overriding it with
TXN_SIGNATURE_ROLLBACK.

Part of #6027

26a8317f

May 14, 2021

box/vinyl: fix say_x format · d8e9a9b7

Cyrill Gorcunov authored 4 years ago


Drop redundant "%s" from format.

Follow-up #5846

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

d8e9a9b7

Apr 14, 2021
- box: store request type implicitly instead of faking xrow · 453b14d5
  Aleksandr Lyapunov authored 3 years ago
  
  Part of #5958
  453b14d5
Jun 10, 2020

vinyl: bump dump_generation in case scheduler doesn't catch up with DDL · 38ac17f3

Nikita Pettik authored 4 years ago

It may turn out that dump_generation does not catch up with current
generation and no other dump tasks are in progress. This may happen
dump process is throttled due to errors. In this case generation is
bumped but dump_generation is not (since dump is not completed). In
turn, throttling opens a window for DDL operations. For instance, index
dropping and creation of new one results in mentioned situation:

box.snapshot() -- fails for some reason; next attempt at dumping will be
               -- taken in one second.
s:drop() -- drop index to be dumped
s = box.schema.space.create('test', {engine = 'vinyl'})
-- create new one (its mem generation is greater than scheduler's one)
i = s:create_index('pk')

Closes #4821

38ac17f3

Jun 09, 2020

test: avoid usleep() usage for error injections · 0f953ab3

Vladislav Shpilevoy authored 4 years ago

Some error injections use usleep() to put the current thread in
sleep. For example, to simulate, that WAL thread is slow.

A few of them allowed to pass custom number of microseconds to
usleep, in form:

    usleep(injection->dvalue * 1000000);

Assuming, that dvalue is a number of seconds. But usleep argument
is uint32_t, at least on Mac (it is useconds_t, whose size is 4).
It means, that too big dvalue easily overflows it.

The patch makes it use nanosleep(), in a new wrapper:
thread_sleep(). It takes a double value, and does not truncate it
to 4 bytes.

The overflow was the case for ERRINJ_VY_READ_PAGE_TIMEOUT = 9000
in test/vinyl/errinj_vylog.test.lua. And
ERRINJ_VY_RUN_WRITE_STMT_TIMEOUT = 9000 in
test/vinyl/errinj.test.lua.

Part of #4609

0f953ab3

May 27, 2020

vinyl: unthrottle scheduler on manual checkpoint · 7305fe5a

Nikita Pettik authored 4 years ago

Before this patch box.snapshot() bails out immediately if it sees
that the scheduler is throttled due to errors. For instance:

box.error.injection.set('ERRINJ_VY_RUN_WRITE', true)
snapshot() -- fails due to ERRINJ_VY_RUN_WRITE
box.error.injection.set('ERRINJ_VY_RUN_WRITE', false)
snapshot() -- still fails despite the fact that injected error is unset

As a result, one has to wait up to a minute to make a snapshot. The
reason why throttling was introduced was to avoid flooding the log
in case of repeating disk errors.
What is more, to deal with schedule throttling in tests, we had to
introduce a new error injection (ERRINJ_VY_SCHED_TIMEOUT). It reduces
time duration during which the scheduler remains throttled, which is
ugly and race prone.  So, let's unthrottle scheduler when checkpoint
process is launched via manual box.snapshot() invocation.

Closes #3519

7305fe5a

Mar 26, 2020

box: rename diag_add_error to diag_set_error · 55a39946

Kirill Shcherbatov authored 5 years ago

Let's rename diag_add_error() to diag_set_error() because it actually
replaces an error object in diagnostic area with a new one and this name
is not representative. Moreover, we are going to introduce a new
diag_add_error() which will place error at the top of stack diagnostic
area.

Needed for #1148

55a39946

Feb 20, 2020

box/txn: fix void args mess · 11dec7dd

Cyrill Gorcunov authored 5 years ago


Using void explicitly in functions which take
no arguments allows to optimize code a bit and
don't assume if there might be variable args.

Moreover in commit e070cc4d we dropped arguments
from txn_begin but didn't update vy_scheduler.c.
The compiler didn't complain because it assumed
there are vargs.

Acked-by: Konstantin Osipov <kostja.osipov@gmail.com>
Acked-by: Nikita Pettik <korablev@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

11dec7dd

Oct 28, 2019

refactoring: change trigger function signature to return an int · b8419953

Ilya Kosarev authored 5 years ago

Trigger function returning type is changed from void to int and
any non-zero value means the trigger was processed with an error.
A trigger can still raise an error - there is no more refactoring
except obvious `diag_raise();' --> `return -1;' replacement.

Prerequisites: #4247

b8419953

Aug 20, 2019

vinyl: don't exempt dropped indexes from dump and compaction · d7387ec9

Vladimir Davydov authored 5 years ago

We remove an LSM tree from the scheduler queues as soon as it is
dropped, even though the tree may hang around for a while after
that, e.g. because it is pinned by an iterator. As a result, once
an index is dropped, it won't be dumped anymore - its memory level
will simply disappear without a trace. This is okay for now, but
to implement snapshot iterators we must make sure that an index
will stay valid as long as there's an iterator that references it.

That said, let's delay removal of an index from the scheduler queues
until it is about to be destroyed.

d7387ec9

Jul 04, 2019

Add ERROR_INJECT_YIELD and ERROR_INJECT_SLEEP helpers · ef25d66e

Vladimir Davydov authored 5 years ago

ERROR_INJECT_YIELD yields the current fiber execution by calling
fiber_sleep(0.001) while the given error injection is set.

ERROR_INJECT_SLEEP suspends the current thread execution by calling
usleep(1000) while the given error injection is set.

ef25d66e

Jun 25, 2019

txn: get rid of fiber_gc from txn_rollback · 68404cfd

Georgy Kirichenko authored 5 years ago

Refactoring: don't touch a fiber gc storage on a transaction rollback
explicitly. This relaxes dependencies between fiber and transaction
life cycles.

Prerequisites: #1254

68404cfd

txn: get rid of autocommit from a txn structure · e070cc4d

Georgy Kirichenko authored 5 years ago

Move transaction auto start and auto commit behavior to the box level.
From now a transaction won't start and commit automatically without
txn_begin/txn_commit invocations. This is a part of a bigger transaction
refactoring in order to implement detachable transactions and a parallel
applier.

Prerequisites: #1254

e070cc4d

Jun 20, 2019

vinyl: don't yield in on_commit and on_rollback triggers · b818a5de

Vladimir Davydov authored 5 years ago

To apply replicated rows in parallel, we need to be able to complete
transactions asynchronously, from the tx_prio callback. We can't yield
there so we must ensure that on_commit/on_rollback triggers don't yield.
The only place where we may yield in a completion trigger is vinyl DDL,
which submits vylog records and waits for them to complete.

Actually, there's no reason to wait for vylog write to complete, as we
can handle missing records on recovery. So this patch reworks vylog to
make vy_log_tx_try_commit() and hence on_commit/on_rollback triggers
using it non-yielding.

To achieve that, we need to:

 - Use vy_log.latch only to sync log rotation vs writes. Don't protect
   vylog buffer with it. This makes vy_log_tx_begin() non-yielding.

 - Use a separate list and buffer for storing vylog records of each
   transaction. We used to share them among transactions, but without
   vy_log.latch we can't sync access to them anymore. Since vylog
   transactions are rare events, this should be fine.

 - Make vy_log_tx_try_commit() append the transaction to the list of
   pending transactions and wake up a background fiber to flush all
   pending transactions. This way it doesn't need to yield.

Closes #4218

b818a5de

Jun 06, 2019

vinyl: don't purge deleted runs from vylog on compaction · 30c9c7c0

Vladimir Davydov authored 5 years ago

After compacting runs, we first mark them as dropped (VY_LOG_DROP_RUN),
then try to delete their files unless they are needed for recovery from
the checkpoint, and finally mark them as not needed in the vylog
(VY_LOG_FORGET_RUN). There's a potential race sitting here: the problem
is the garbage collector might kick in after files are dropped, but
before they are marked as not needed. If this happens, there will be
runs that have two VY_LOG_FORGET_RUN records, which will break recovery:

  Run XX is forgotten, but not registered

The following patches make the race more likely to happen so let's
eliminate it by making the garbage collector the only one who can mark
runs as not needed (i.e. write VY_LOG_FORGET_RUN record). There will
be no warnings, because the garbage collector silently ignores ENOENT
errors, see vy_gc().

Another good thing about this patch is that now we never yield inside
a vylog transaction, which makes it easier to remove the vylog latch
blocking implementation of transactional DDL.

30c9c7c0

Apr 16, 2019

vinyl: fix crash during index build · ccd46a27

Vladimir Davydov authored 5 years ago

To propagate changes applied to a space while a new index is being
built, we install an on_replace trigger. In case the on_replace
trigger callback fails, we abort the DDL operation.

The problem is the trigger may yield, e.g. to check the unique
constraint of the new index. This opens a time window for the DDL
operation to complete and clear the trigger. If this happens, the
trigger will try to access the outdated build context and crash:

 | #0  0x558f29cdfbc7 in print_backtrace+9
 | #1  0x558f29bd37db in _ZL12sig_fatal_cbiP9siginfo_tPv+1e7
 | #2  0x7fe24e4ab0e0 in __restore_rt+0
 | #3  0x558f29bfe036 in error_unref+1a
 | #4  0x558f29bfe0d1 in diag_clear+27
 | #5  0x558f29bfe133 in diag_move+1c
 | #6  0x558f29c0a4e2 in vy_build_on_replace+236
 | #7  0x558f29cf3554 in trigger_run+7a
 | #8  0x558f29c7b494 in txn_commit_stmt+125
 | #9  0x558f29c7e22c in box_process_rw+ec
 | #10 0x558f29c81743 in box_process1+8b
 | #11 0x558f29c81d5c in box_upsert+c4
 | #12 0x558f29caf110 in lbox_upsert+131
 | #13 0x558f29cfed97 in lj_BC_FUNCC+34
 | #14 0x558f29d104a4 in lua_pcall+34
 | #15 0x558f29cc7b09 in luaT_call+29
 | #16 0x558f29cc1de5 in lua_fiber_run_f+74
 | #17 0x558f29bd30d8 in _ZL16fiber_cxx_invokePFiP13__va_list_tagES0_+1e
 | #18 0x558f29cdca33 in fiber_loop+41
 | #19 0x558f29e4e8cd in coro_init+4c

To fix this issue, let's recall that when a DDL operation completes,
all pending transactions that affect the altered space are aborted by
the space_invalidate callback. So to avoid the crash, we just need to
bail out early from the on_replace trigger callback if we detect that
the current transaction has been aborted.

Closes #4152

ccd46a27

Apr 11, 2019

vinyl: don't compress L1 runs · 00b6ea52

Vladimir Davydov authored 5 years ago

L1 runs are usually the most frequently read and smallest runs at the
same time so we gain nothing by compressing them.

Closes #2389

00b6ea52

Apr 07, 2019

vinyl: incorporate tuple comparison hints into vinyl data structures · dafd3926

Vladimir Davydov authored 5 years ago

Apart from speeding up statement comparisons and hence index lookups,
this is also a prerequisite for multikey indexes, which will reuse tuple
comparison hints as offsets in indexed arrays.

Albeit huge, this patch is pretty straightforward - all it does is
replace struct tuple with struct vy_entry (which is tuple + hint pair)
practically everywhere in the code. Now statements are stored and
compared without hints only in a few places, primarily at the very top
level. Hints are also computed at the top level so it should be pretty
easy to replace them with multikey offsets when the time comes.

dafd3926

Apr 01, 2019

vinyl: factor out procedure looking up LSM tree range intersection · f04e567c

Vladimir Davydov authored 5 years ago

It's an independent piece of code that is definitely worth moving
from vy_scheduler to vy_lsm internals anyway. Besides, having it
wrapped up in a separate function will make it easier to patch.

f04e567c

Mar 13, 2019

vinyl: zap vy_write_iterator->format · 902d212b

Vladimir Davydov authored 6 years ago

It's actually only needed to initialize disk streams so let's pass it
to vy_write_iterator_new_slice() instead.

902d212b

Feb 25, 2019

salad: do not touch struct heap_node.pos in user's code · 2d1f841e

Vladislav Shpilevoy authored 6 years ago

The only goal of reading and writing heap_node.pos was checking
if a node is now in a heap, or not. This commit encapsulates this
logic into a couple of functions.

2d1f841e

Feb 22, 2019

salad: make heap struct more friendly to use · e3d156cd

Vladislav Shpilevoy authored 6 years ago

Now heap API works with struct heap_node only, which forces a
user to constantly call container_of. Such a code looks really
awful. This commit makes heap taking and returning user defined
structures, and removes container_of clue.

It is worth noting, that the similar API rb-tree and b-tree
have. Even rlist has its rlist_*_entry() wrappers, and mhash
provides macroses to define your own value type.

e3d156cd

Feb 14, 2019
- vinyl: embed vy_lsm->tree and rename it to range_tree · 36e636dd
  Vladimir Davydov authored 6 years ago
  
  We used to swap it between vy_lsm objects, but we don't do that anymore so we can embed it.
  36e636dd
Feb 12, 2019

vinyl: keep track of dumps per compaction for each LSM tree · e4f5476c

Vladimir Davydov authored 6 years ago

This patch adds dumps_per_compaction metric to per index statistics. It
shows the number of dumps it takes to trigger a major compaction of a
range in a given LSM tree. We need it to automatically choose the
optimal number of ranges that would smooth out the load generated by
range compaction.

To calculate this metric, we assign dump_count to each run. It shows how
many dumps it took to create the run. If a run was created by a memory
dump, it is set to 1. If a run was created by a minor compaction, it is
set to the sum of dump counts of compacted ranges. If a run was created
by a major compaction, it is set to the sum of dump counts of compacted
ranges minus dump count of the last level run. The dump_count is stored
in vylog.

This allows us to estimate the number of dumps that triggers compaction
in a range as dump_count of the last level run stored in the range.
Finally, we report dumps_per_compaction of an LSM tree as the average
dumps_per_compaction among all ranges constituting the tree.

Needed for #3944

e4f5476c