Commits · ae6a02ebab0b69e5421ae96c027219d4dc267f93 · core / tarantool

Jul 23, 2024

vinyl: do not log dump if index was dropped · ae6a02eb

Vladimir Davydov authored 8 months ago

An index can be dropped while a memory dump is in progress. If the vinyl
garbage collector happens to delete the index from the vylog by the time
the memory dump completes, the dump will log an entry for a deleted
index, resulting in an error next time we try to recover the vylog,
like:

```
ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run 2 committed after deletion
```

or

```
ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Deleted range 9 has run slices
```

We already fixed a similar issue with compaction in commit 29e2931c
("vinyl: fix race between compaction and gc of dropped LSM"). Let's fix
this one in exactly the same way: discard the new run without logging it
to the vylog on a memory dump completion if the index was dropped while
the dump was in progress.

Closes #10277

NO_DOC=bug fix

ae6a02eb

errinj: log error injection value · 019bacbe

Vladimir Davydov authored 8 months ago

Let's log the new value when an error injection is set in orer to ease
debugging in tests.

NO_DOC=logging
NO_TEST=logging
NO_CHANGELOG=logging

019bacbe

Jul 22, 2024

config/schema: support field deletion using :set() · 8a0013ea

Alexander Turenko authored 8 months ago

This commit implements the `<schema object>:set()` algorithm in a more
accurate way and it solves several drawbacks of the previous
implementation.

* It was impossible to set a field that is nested to a record or a map
  that has the box.NULL value (#10190).
* It was impossible to set a field to the box.NULL value (#10193).
* It was impossible to delete a field, now `nil` RHS value means the
  deletion (#10194).

Fixes #10190
Fixes #10193
Fixes #10194

NO_DOC=Included into https://github.com/tarantool/doc/issues/4279

8a0013ea

lua: bump metrics module · e79982e7

Georgy Moiseev authored 8 months ago

Bump metrics package submodule. Commits from PRs [1-4] affect
Tarantool, the other ones are related to module infrastructure.

1. https://github.com/tarantool/metrics/pull/482
2. https://github.com/tarantool/metrics/pull/483
3. https://github.com/tarantool/metrics/pull/484
4. https://github.com/tarantool/metrics/pull/491

NO_DOC=doc is a part of submodule

e79982e7

tuple: allocate formats table statically · a2da1de7

Vladimir Davydov authored 8 months ago

The tuple formats table may be accessed with `tuple_format_by_id()` from
any thread, not just tx. For example, it's accessed by a vinyl writer
thread when it deletes a tuple. If a thread happens to access the table
while it's being reallocated by tx, see `tuple_format_register()`,
the accessing thread may crash with a use-after-free or NULL pointer
dereference bug, like the one below:

```
 # 1  0x64bd45c09e22 in crash_signal_cb+162
 # 2  0x76ce74e45320 in __sigaction+80
 # 3  0x64bd45ab070c in vy_run_writer_append_stmt+700
 # 4  0x64bd45ada32a in vy_task_write_run+234
 # 5  0x64bd45ad84fe in vy_task_f+46
 # 6  0x64bd45a4aba0 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*)+16
 # 7  0x64bd45c13e66 in fiber_loop+70
 # 8  0x64bd45e83b9c in coro_init+76
```

To avoid that, let's make the tuple formats table statically allocated.
This shouldn't increase actual memory usage because system memory is
allocated lazily, on page fault. The max number of tuple formats isn't
that big (64K) to care about the increase in virtual memory usage.

Closes #10278

NO_DOC=bug fix
NO_TEST=mt race

a2da1de7

config/schema: allow to index 'any' type in :get() · 0f8c715d

Alexander Turenko authored 9 months ago

`<schema object>:get()` now can access a field inside the `any` type if
it is a `table` or `nil`/`box.NULL`.

`config:get()` now can access fields inside `app.cfg.<key>` and
`roles_cfg.<key>`.

Fixes #10205

NO_DOC=The `<schema object>:get()` update is included into
       https://github.com/tarantool/doc/issues/4279.
       The `config:get()` reference on the website doesn't mention the
       constraint, so it doesn't need an update.

0f8c715d

Jul 18, 2024

vinyl: wake up waiters after clearing checkpoint_in_progress flag · fc3196dc

Vladimir Davydov authored 8 months ago

The function `vy_space_build_index`, which builds a new index on DDL,
calls `vy_scheduler_dump` on completion. If there's a checkpoint in
progress, the latter will wait on `vy_scheduler::dump_cond` until
`vy_scheduler::checkpoint_in_progress` is cleared. The problem is
`vy_scheduler_end_checkpoint` doesn't broadcast `dump_cond` when it
clears the flag. Usually, everything works fine because the condition
variable is broadcast on any dump completion, and vinyl checkpoint
implies a dump, but under certain conditions this may lead to a fiber
hang. Let's broadcast `dump_cond` in `vy_scheduler_end_checkpoint`
to be on the safe side.

While we are at it, let's also inject a dump delay to the original
test to make it more robust.

Closes #10267
Follow-up #10234

NO_DOC=bug fix

fc3196dc

Jul 17, 2024

iproto: introduce FETCH_SNAPSHOT_CURSOR feature · 62c49367

Nikita Zheleztsov authored 8 months ago

This commit introduces FETCH_SNAPSHOT_CURSOR feature, which is available
only in EE. The feature is not returned in response to IPROTO_ID and is
not shown in box.iproto.protocol_features in Community Edition. Its id
is shown only in box.iproto.feature, which is a list of all available
features in the current version.

Needed for tarantool/tarantool-ee#741

NO_CHANGELOG=minor

@TarantoolBot document
Title: Document iproto feature FETCH_SNAPSHOT_CURSOR

Root document: https://www.tarantool.io/en/doc/latest/reference/reference_lua/net_box/#net-box-connect

FETCH_SNAPSHOT_CURSOR feature requires cursor FETCH_SNAPSHOT on the
server. Its ID is IPROTO_FEATURE_FETCH_SNAPSHOT_CURSOR. IPROTO version
is 8 or more, Enterprise Edition is also required.

62c49367

engine: introduce stubs for checkpoint FETCH_SNAPSHOT · 2fca5c13

Nikita Zheleztsov authored 8 months ago

This commit introduces engine stubs that enable a new method
of fetching snapshots for anonymous replicas. Instead of using
the traditional read-view join approach, this update allows
file snapshot fetching. Note that file snapshot fetching
is only available in Tarantool EE.

Checkpoint fetching is done via IPROTO_IS_CHECKPOINT_JOIN,
IPROTO_CHECKPOINT_VCLOCK and IPROTO_CHECKPOINT_LSN fields.

If IPROTO_CHECKPOINT_JOIN is set to true, join will be done from
files: .snap for memtx, .run for vinyl, if false - from read view.

Checkpoint join allows to continue from the place, where client
stopped in case of snapshot fetching error. This allows to avoid
rebootstrap of an anonymous client. This can be done by specifying
CHECKPOINT_VCLOCK, which says from which file server should continue
join, client gets vclock at the beginning of the join. Specifying
CHECKPOINT_LSN allows to continue from some position in checkpoint.
Server sends all data >= CHECKPOINT_LSN.

If CHECKPOINT_VCLOCK is not specified, fetching is done from the latest
available checkpoint. If CHECKPOINT_LSN is not specified - start from
the beginning of the snap. So, specifying only IS_CHECKPOINT_JOIN
triggers fetching the latest checkpoint from files.

Needed for tarantool/tarantool-ee#741

NO_DOC=ee
NO_TEST=ee
NO_CHANGELOG=ee

2fca5c13

engine: send vclock with 0th component during join · 56058393

Nikita Zheleztsov authored 8 months ago

This commit makes engine to send vclock without ignoring 0th component
during join, which is needed for checkpoint FETCH SNAPSHOT.

Currently engine join functions are invoked only from
relay_initial_join, which is done during JOIN or FETCH SNAPSHOT.
They respond with vclock of the read view we're going to send.

In the following commit checkpoint FETCH SNAPSHOT will be introduced,
which responds with vclock of the checkpoint, we're going to send.
Such vclock may include 0th component and it's crucial to send it to
a client, as in case of connection failure, client will send us the
same vclock and we'll have to use its signature to figure out, which
checkpoint client wants.

So, we have to send and receive 0th component of the vclock during
FETCH_SNAPSHOT. This commit also introduces decoding vclocks without
ignoring 0th component, as they'll be used in the following commit too.

Needed for tarantool/tarantool-ee#741

NO_DOC=internal
NO_TEST=ee
NO_CHANGELOG=internal

56058393

xrow: rename xrow_encode_vclock · 313bd730

Nikita Zheleztsov authored 8 months ago

This commit renames xrow_encode_vlock to xrow_encode_vclock_ignore0
since the next commit will introduce encoding vclock without ignoring
0th component, which is needed during sending the response to fetch
snapshot request.

This commit also removes internal field inside the replication_request
structure, as the following commit will use 'vclock' for
encoding/decoding vclock without ignoring component.

Needed for tarantool/tarantool-ee#741

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

313bd730

relay: refactor relay_initial_join · 72cc2b3e

Nikita Zheleztsov authored 8 months ago

From now on during initial join memtx engine prepares vclock, raft and
limbo states, it also sends them during memtx_engine_join.

It's done in order to simplify the code of initial join, as in the
consequent commit checkpoint initial join will be introduced and we want
relay code to handle it the same as read-view join without confusing
conditions.

Needed for tarantool/tarantool-ee#741

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

72cc2b3e

engine: move raft and limbo states after system data in checkpoint · 3da31b83

Nikita Zheleztsov authored 8 months ago

Before this commit raft and limbo states were written at the end of the
checkpoint, which makes it very costly to access them.

Checkpoint join needs to access limbo and raft state in order to send
them during JOIN_META stage. We cannot use the latest states, like it's
done for read-view snapshot fetching: states may be far ahead of the
data, written to the checkpoint, which we're going to send.

This commit moves raft and limbo states after data from the system
spaces but before user data. We cannot put them right at the beginning
of the snapshot, because then we'll have to patch recovery process,
which currently strongly relies on the fact, that system spaces are
at the beginning of the snapshot (this was done in order to apply force
recovery only for user data). If we patch recovery process, then old
versions, where it's unpatched, won't be able to recover from the
snapshots done by the newer version, compatibility of snapshots will be
broken.

The current change is not breaking, old Tarantool versions can restore
from the snapshot made by the newer one.

Needed for tarantool/tarantool-ee#741

NO_DOC=internal
NO_CHANGELOG=internal

3da31b83

Jul 16, 2024

applier: fix assertion failure after split brain · 5ce010c5

Nikita Zheleztsov authored 8 months ago

After receiving async transaction from an old term applier_apply_tx
exits without unlocking the latch. If the same applier tries to
subscribe for replication, it fails with assertion, as the latch is
already locked.

Let's fix the function, which raises error so that it just sets
diag and returns -1.

Closes #10073

NO_DOC=bugfix
NO_CHANGELOG=no crash on release version

5ce010c5

error: introduce ERRINJ_TUPLE_ALLOC_COUNTDOWN · 52926402
Ilya Verbin authored 9 months ago
```
Needed for tarantool/tarantool-ee#712

NO_DOC=internal
NO_TEST=internal
NO_CHANGELOG=internal
```
52926402

Jul 15, 2024

applier: drop apply_final_join_tx · da158b9b

Vladislav Shpilevoy authored 9 months ago

Can use the regular applier_apply_tx(), they do the same. The
latter is just more protective, but doesn't matter much in this
case if the code does a few latch locks.

The patch also drops an old test about double-received row panic
during final join. The logic is that absolutely the same situation
could happen during subscribe, but it was always filtered out by
checking replicaset.applier.vclock and skipping duplicate rows.

There doesn't seem to be a reason why final join must be any
different. It is, after all, same subscribe logic but the received
rows go into replica's initial snapshot instead of xlogs. Now it
even uses the same txn processing function applier_apply_tx().

The patch also moves `replication_skip_conflict` option setting
after bootstrap is finished. In theory, final join could deliver
a conflicting row and it must not be ignored. The problem is that
it can't be reproduced anyhow without illegal error injection
(which would corrupt something in an unrealistic way). But lets
anyway move it below bootstrap for clarity.

Follow-up #10113

NO_DOC=refactoring
NO_CHANGELOG=refactoring

da158b9b

box: make instance_vclock const · 19b2cc20

Vladislav Shpilevoy authored 8 months ago

No code besides box.cc can now update instance's vclock
explicitly. That is a protection against hacks like #9916.

Closes #10113

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

19b2cc20

box: make final join vclock update only in box.cc · fe338ed4

Vladislav Shpilevoy authored 9 months ago

The goal is to make sure that no files except box.cc can change
instance_vclock_storage directly. That leads to all sorts of hacks
which in turn lead to bugs - #9916 is a good example.

Now applier on final join only sends rows into the journal. The
journal then is handled by box.cc where vclock is properly
updated.

Part of #10113

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

fe338ed4

journal: extract journal_write_row from limbo · 7d10096c

Vladislav Shpilevoy authored 9 months ago

The function writes a single xrow into the journal in a blocking
way. It isn't so simple, so makes sense to keep as a function,
especially given that it will be used more in the next commit.

Part of #10113

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

7d10096c

box: move recovery_journal creation · 2620eb9e

Vladislav Shpilevoy authored 9 months ago

Recovery journal uses word "recovery" to say that it works with
xlogs. For snapshot recovery there is bootstrap_journal. Lets use
it during local snapshot recovery.

The reasoning is that while right now there is no difference, in
next commits the recovery_journal will do more.

Part of #10113

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

2620eb9e

box: move replicaset.vclock into instance_vclock · f1e8e4e1

Vladislav Shpilevoy authored 9 months ago

Storing vclock of the instance in replicaset.vclock wasn't right.
It wasn't vclock of the whole replicaset. It was local to this
instance. There is no such thing as "replicaset vclock".

The patch moves it to box.h/cc.

Part of #10113

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

f1e8e4e1

applier: treat register txns like regular ones · 51751f87

Vladislav Shpilevoy authored 9 months ago

Applier during the registration waiting (for registering a new ID
or a name) could keep doing the master txns received before the
registration was started. They could still be inside WAL doing a
disk write, when the replica sends a register request.

Before this commit, it could cause an assertion failure in debug
and a double LSN error in release.

The reason was that during the registration waiting the applier
treated all incoming txns as "final join" txns. I.e. it wasn't
checking if those txns were already received, but not committed
yet.

During normal subscribe process the appliers (potentially
multiple) protect themselves from that by keeping track of the
vclocks which are already applied and also being applied right now
(replicaset.applier.vclock).

Such protection ensures that receiving same row from 2 appliers
wouldn't result into its double write. It also protects from the
case when a txn was received, goes to WAL, but then the applier
reconnects, resubscribes, and gets the same txn again - it
shouldn't be applied.

The patch makes so that the registration waiting after recovery
works like subscribe. Registration during recovery would mean
bootstrap via join. And outside of recovery it means the instance
is already running.

Closes #9916

NO_DOC=bugfix

51751f87

raft: finish worker fiber on Tarantool shutdown · ab386e65

Nikolay Shirokovskiy authored 9 months ago

Let's make sure box raft worker fiber is finished on Tarantool shutdown as
we are going to free fibers stacks. If fiber is not finished it's stack
may have references to objects on heap. Thus as fiber stack will be freed
we will have FP memory leaks under ASAN.

Part of #9722

NO_TEST=rely on existing tests
NO_CHANGELOG=internal
NO_DOC=internal

ab386e65

net.box: finish worker fiber on Tarantool shutdown · 50ce6ebe

Nikolay Shirokovskiy authored 8 months ago

Let's make sure net.box system fiber is finished on Tarantool shutdown as
we are going to free fibers stacks. If fiber is not finished it's stack
may have references to objects on heap. Thus as fiber stack will be freed
we will have FP memory leaks under ASAN.

Part of #9722

NO_CHANGELOG=internal
NO_DOC=internal

50ce6ebe

core: fiber pool shutdown · e513d4ce

Nikolay Shirokovskiy authored 8 months ago

Fiber pool shutdown is finishing all idle fibers. Any message processing
is finished earlier on client fiber shutdown.

We need some changes in shutdown order to make fiber pool shutdown.
First we need to move stopping of iproto threads from free to shutdown
step. The issue is we want to destroy "tx" endpoint which requires all
pipes connected to it to be destroyed first. There are pipes in iproto
threads that connected to "tx". Currently we destroy pipes in free step
and at this point as there is no event loop in tx thread
`cbus_endpoint_destroy` can't receive notifications that pipes are
destroyed.

Originally we put stopping of iproto threads to the free step because we
don't have client fibers shutdown. So it was convenient to have working
`net_pipe` so that client fibers can use iproto API without adding extra
logic to them. Now I guess it make sense to stop client fibers before
iproto shutdown. This is the second change in shutdown order.

There is another reason why we have iproto shutdown before client fiber
shutdown. In the process of iproto shutdown we close connections first
and then cancel all requests in progress. This way client do not receive
unexpected `FiberIsCancelled` errors in the process of server shutdown.
After the patch it not so. Well we may close connections as an extra
step before client fibers shutdown. But let's leave it this way.  Good
clients can subscribe to servere shutdown and prepare for it.  Otherwise
they may receive `FiberIsCancelled` for theier request which looks
sensible.

It is also makes sense now to move watcher and client fiber shutdown
to `box_shutdown` as we can both use watcher and create client fibers
without starting a box.

While at it also drop a note in code why we shutdown watcher before even
fiber clients shutdown.

Part of #9722

NO_CHANGELOG=internal
NO_DOC=internal

e513d4ce

memtx: shutdown memtx gc fiber · f944953e

Nikolay Shirokovskiy authored 10 months ago

As we are going to check for memory leaks in ASAN we need to wait while
memtx gc finishes freeing tuples of dropped primary index or keys of
dropped functional indexes.

Part of #9722

NO_CHANGELOG=internal
NO_DOC=internal

f944953e

vinyl: use broadcast instead of signal to notify about dump completion · 30547157

Vladimir Davydov authored 8 months ago

There may be more than one fiber waiting on `vy_scheduler::dump_cond`:

```
box.snapshot
  vinyl_engine_wait_checkpoint
    vy_scheduler_wait_checkpoint

space.create_index
  vinyl_space_build_index
    vy_scheduler_dump
```

To avoid hang, we should use `fiber_cond_broadcast`.

Closes #10233

NO_DOC=bug fix

30547157

trivia: use __builtin* for offsetof macro · 27e94824

Lev Kats authored 8 months ago

Changed default tarantool `offsetof` macro implementation so it don't
access members of null pointer in typeof that triggers UBsan.

Needed for #10143

NO_DOC=bugfix
NO_CHANGELOG=minor
NO_TEST=tested manually with fuzzer

27e94824

Jul 09, 2024

uuid: relax UUID value validation · b0b32bff

Igor Munkin authored 1 year ago

This patch completely relaxes UUID checks and accepts an arbitrary
128-bit sequence as an UUID for binary data. String representations
still should match the grammars in RFC 4122, Section 3 [1] and RFC 9562,
Section 4 [2].

[1]: https://datatracker.ietf.org/doc/html/rfc4122#section-3
[2]: https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-format

Closes #5444

@TarantoolBot document
Title: uuid: relaxed UUID validation

[The UUID module documentation][1] mentions that Tarantool generates
UUIDs following the rules for RFC 4122,[version 4, variant 1][2]. It is
worth mentioning that the user can store an arbitrary 128-bit sequence
as an UUID for binary data. String representations still should match
the grammars in RFC 4122, [Section 3][3], and RFC 9562, [Section 4][4].

[1]: https://www.tarantool.io/en/doc/latest/reference/reference_lua/uuid/
[2]: https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_(random)
[3]: https://datatracker.ietf.org/doc/html/rfc4122#section-3
[4]: https://datatracker.ietf.org/doc/html/rfc9562#name-uuid-format

b0b32bff

Jul 03, 2024

config: expose experimental.config.utils.schema · ef716a3e

Alexander Turenko authored 9 months ago

The module is renamed from `internal.config.utils.schema` to
`experimental.config.utils.schema` without changes.

It is useful for validation of configuration data in roles and
applications.

Also, it provides a couple of methods that aim to simplify usual tasks
around processing of hierarchical configuration data. For example,

* get/set a nested value
* apply defaults from the schema
* filter data based on annotations from the schema
* transform a hierarchical data using a function
* merge two hierarchical values
* parse environment variable according to its type in the schema

See https://github.com/tarantool/doc/issues/4279 for an in-depth
description.

Fixes #10117

NO_DOC=https://github.com/tarantool/doc/issues/4279

ef716a3e

Jun 26, 2024

box: fix memleak on functional index drop · 319357d5

Nikolay Shirokovskiy authored 9 months ago

We just don't free functional index keys on functional index drop now.
Let's approach keys deletion as in the case of primary index drop ie
let's drop these keys in background.

We should set `use_hint` to `true` in case of MEMTX_TREE_VTAB_DISABLED
tree index methods because `memtx_tree_disabled_index_vtab` uses
`memtx_tree_index_destroy<true>`. Otherwise we get read outside of index
structure for stub functional index on destroy for introduced `is_func`
field (which is reported by ASAN).

Closes #10163

NO_DOC=bugfix

319357d5

Jun 24, 2024

config: add missing ssl.ssl_cert for etcd · 13249eb3

Georgy Moiseev authored 9 months ago

etcd configuration section allows to connect to TLS-encrypted etcd
cluster, providing a way to pass `ssl.ssl_key`. But it is not enough
when etcd server have client cert auth enabled and has a CA file, since
it requires a ssl_cert as well. Actually, propagating ssl_cert is
already a part of the EE connect code [1], we just missing the top-level
config option.

Fixes https://github.com/tarantool/tarantool-ee/issues/827

1. https://github.com/tarantool/tarantool-ee/blame/1138443c46e7a6e1bb855277bc6cb3333240131c/src/box/lua/config/source/etcd.lua#L103

@TarantoolBot document
Title: config: add missing ssl.ssl_cert for etcd

etcd configuration section already allows to set `ssl.ssl_key`. Now it
also allows to pass `ssl.ssl_cert`.

13249eb3

Jun 20, 2024

ci: add workflow to check downgrade versions · 6d856347

Nikolay Shirokovskiy authored 9 months ago

Tarantool has hardcoded list of versions it can downgrade to. This list
should consist of all the released versions less than Tarantool version.
This workflow helps to make sure we update the list before release.

It is run on pushing release tag to the repo, checks the list and fails
if it misses some released version less than current. In this case we
are supposed to update downgrade list (with required downgrade code) and
update the release tag.

Closes #8319

NO_TEST=ci
NO_CHANGELOG=ci
NO_DOC=ci

6d856347

Jun 18, 2024

config: use vshard-ee if available · 29519c71

Alexander Turenko authored 9 months ago

The new closed-source `vshard-ee` module was recently introduced. It is
based on the open-source `vshard` module and, as far as I know, provides
the same API.

Let's configure `vshard-ee` in the same way as `vshard` if sharding is
enabled in tarantool's configuration.

Prefer `vshard-ee` if both are available.

Fixes https://github.com/tarantool/tarantool-ee/issues/815

@TarantoolBot document
Title: config: vshard-ee is now supported

The declarative configuration supports `vshard-ee` in addition to
`vshard` since Tarantool 3.1.1 and 3.2+.

`vshard` is mentioned in the documentation at least [here][1]. All such
places should be updated to mention both `vshard-ee` and `vshard`.

[1]: https://www.tarantool.io/en/doc/latest/reference/configuration/configuration_reference/#sharding

29519c71

Jun 17, 2024

box: feature `tuple:format` to get a format of a tuple · 6d5f1db5

DerekBum authored 9 months ago

This patch adds `tuple:format()` method to get a format
of a tuple.

Closes #10005

@TarantoolBot document
Title: New `format` method for `box.tuple`
Product: Tarantool
Since: 3.2

The `tuple:format` method returns a format of a tuple.

6d5f1db5

Jun 13, 2024

vinyl: fix gc vs vylog race leading to duplicate record · 9d3859b2

Vladimir Davydov authored 9 months ago

Vinyl run files aren't always deleted immediately after compaction,
because we need to keep run files corresponding to checkpoints for
backups. Such run files are deleted by the garbage collection procedure,
which performs the following steps:

 1. Loads information about all run files from the last vylog file.
 2. For each loaded run record that is marked as dropped:
    a. Tries to remove the run files.
    b. On success, writes a "forget" record for the dropped run,
       which will make vylog purge the run record on the next
       vylog rotation (checkpoint).

(see `vinyl_engine_collect_garbage()`)

The garbage collection procedure writes the "forget" records
asynchronously using `vy_log_tx_try_commit()`, see `vy_gc_run()`.
This procedure can be successfully executed during vylog rotation,
because it doesn't take the vylog latch. It simply appends records
to a memory buffer which is flushed either on the next synchronous
vylog write or vylog recovery.

The problem is that the garbage collection isn't necessarily loads
the latest vylog file because the vylog file may be rotated between
it calls `vy_log_signature()` and `vy_recovery_new()`. This may
result in a "forget" record written twice to the same vylog file
for the same run file, as follows:

  1. GC loads last vylog N
  2. GC starts removing dropped run files.
  3. CHECKPOINT starts vylog rotation.
  4. CHECKPOINT loads vylog N.
  5. GC writes a "forget" record for run A to the buffer.
  6. GC is completed.
  7. GC is restarted.
  8. GC finds that the last vylog is N and blocks on the vylog latch
     trying to load it.
  9. CHECKPOINT saves vylog M (M > N).
 10. GC loads vylog N. This triggers flushing the forget record for
     run A to vylog M (not to vylog N), because vylog M is the last
     vylog at this point of time.
 11. GC starts removing dropped run files.
 12. GC writes a "forget" record for run A to the buffer again,
     because in vylog N it's still marked as dropped and not forgotten.
     (The previous "forget" record was written to vylog M).
 13. Now we have two "forget" records for run A in vylog M.

Such duplicate run records aren't tolerated by the vylog recovery
procedure, resulting in a permanent error on the next checkpoint:

```
ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run XXXX forgotten but not registered
```

To fix this issue, we move `vy_log_signature()` under the vylog latch
to `vy_recovery_new()`. This makes sure that GC will see vylog records
that it's written during the previous execution.

Catching this race in a function test would require a bunch of ugly
error injections so let's assume that it'll be tested by fuzzing.

Closes #10128

NO_DOC=bug fix
NO_TEST=tested manually with fuzzer

9d3859b2

tuple: don't use offset_slot_cache in vinyl threads · 19d1f1cc

Vladimir Davydov authored 9 months ago

`key_part::offset_slot_cache` and `key_part::format_epoch` are used for
speeding up tuple field lookup in `tuple_field_raw_by_part()`. These
structure members are accessed and updated without any locks, assuming
this code is executed exclusively in the tx thread. However, this isn't
necessarily true because we also perform tuple field lookups in vinyl
read threads. Apparently, this can result in unexpected races and bugs,
for example:

```
  #1  0x590be9f7eb6d in crash_collect+256
  #2  0x590be9f7f5a9 in crash_signal_cb+100
  #3  0x72b111642520 in __sigaction+80
  #4  0x590bea385e3c in load_u32+35
  #5  0x590bea231eba in field_map_get_offset+46
  #6  0x590bea23242a in tuple_field_raw_by_path+417
  #7  0x590bea23282b in tuple_field_raw_by_part+203
  #8  0x590bea23288c in tuple_field_by_part+91
  #9  0x590bea24cd2d in unsigned long tuple_hint<(field_type)5, false, false>(tuple*, key_def*)+103
  #10 0x590be9d4fba3 in tuple_hint+40
  #11 0x590be9d50acf in vy_stmt_hint+178
  #12 0x590be9d53531 in vy_page_stmt+168
  #13 0x590be9d535ea in vy_page_find_key+142
  #14 0x590be9d545e6 in vy_page_read_cb+210
  #15 0x590be9f94ef0 in cbus_call_perform+44
  #16 0x590be9f94eae in cmsg_deliver+52
  #17 0x590be9f9583e in cbus_process+100
  #18 0x590be9f958a5 in cbus_loop+28
  #19 0x590be9d512da in vy_run_reader_f+381
  #20 0x590be9cb4147 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*)+34
  #21 0x590be9f8b697 in fiber_loop+219
  #22 0x590bea374bb6 in coro_init+120
```

Fix this by skipping this optimization for threads other than tx.

No test is added because reproducing this race is tricky. Ideally, bugs
like this one should be caught by fuzzing tests or thread sanitizers.

Closes #10123

NO_DOC=bug fix
NO_TEST=tested manually with fuzzer

19d1f1cc

vinyl: fix cache iterator skipping tuples in read view · 7b72080d

Vladimir Davydov authored 9 months ago

The tuple cache doesn't store older tuple versions so if a reader is
in a read view, it must skip tuples that are newer than the read view,
see `vy_cache_iterator_stmt_is_visible()`. A reader must also ignore
cached intervals if any of the tuples used as a boundary is invisible
from the read view, see `vy_cache_iterator_skip_to_read_view()`.
There's a bug in `vy_cache_iterator_restore()` because of which such
an interval may be returned to the reader: when we step backwards
from the last returned tuple we consider only one of the boundaries.
As a result, if the other boundary is invisible from the read view,
the reader will assume there's nothing in the index between the
boundaries and skip reading older sources (memory, disk). Fix this by
always checking if the other boundary is visible.

Closes #10109

NO_DOC=bug fix

7b72080d

vinyl: fix run iterator skipping tuples following non-terminal statement · 72763f94

Vladimir Davydov authored 9 months ago

If a run iterator is positioned at a non-terminal statement (UPSERT or
UPDATE), `vy_run_iterator_next()` will iterate over older statements
with the same key using `vy_run_iterator_next_lsn()` to build the key
history. While doing so, it may reach the end of the run file (if the
current key is the last in the run). This would stop iteration
permanently, which is apparently wrong for reverse iterators (LE or LT):
if this happens the run iterator won't return any keys preceding the
last one in the run file. Fix this by removing `vy_run_iterator_stop()`
from `vy_run_iterator_next_lsn()`.

Part of #10109

NO_DOC=bug fix
NO_CHANGELOG=next commit

72763f94

Jun 10, 2024

box: refactor synchro quorum update on deletion from `_cluster` space · 9b63ced3

Georgiy Lebedev authored 9 months ago

For symmetry with the update of the synchronous replication quorum on
insertion into the `_cluster` space, let's reuse the
`on_replace_cluster_update_quorum` on_commit trigger.

Follows-up #10087

NO_CHANGELOG=<refactoring>
NO_DOC=<refactoring>
NO_TEST=<refactoring>

9b63ced3