Commits · 00e153408bb2db44763d52bd48a82e52e2a2b7b2 · core / tarantool

Jul 24, 2024

test: do not test errinj.info() output · 00e15340

Ilya Verbin authored 8 months ago

There is no much sense in testing it, but it is sensitive to source code
changes, especially `ERRINJ_*_COUNTDOWN` injections, e.g. see commit
697123d0 ("box: use maximal space id instead of _schema.max_id").

Needed for tarantool/tarantool-ee#712

NO_DOC=test
NO_CHANGELOG=test

(cherry picked from commit dc0fd81c)

00e15340

Jul 23, 2024

vinyl: do not log dump if index was dropped · 37eea2b9

Vladimir Davydov authored 8 months ago

An index can be dropped while a memory dump is in progress. If the vinyl
garbage collector happens to delete the index from the vylog by the time
the memory dump completes, the dump will log an entry for a deleted
index, resulting in an error next time we try to recover the vylog,
like:

```
ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run 2 committed after deletion
```

or

```
ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Deleted range 9 has run slices
```

We already fixed a similar issue with compaction in commit 29e2931c
("vinyl: fix race between compaction and gc of dropped LSM"). Let's fix
this one in exactly the same way: discard the new run without logging it
to the vylog on a memory dump completion if the index was dropped while
the dump was in progress.

Closes #10277

NO_DOC=bug fix

(cherry picked from commit ae6a02eb)

37eea2b9

Jul 22, 2024

tuple: allocate formats table statically · bfbf5a10

Vladimir Davydov authored 8 months ago

The tuple formats table may be accessed with `tuple_format_by_id()` from
any thread, not just tx. For example, it's accessed by a vinyl writer
thread when it deletes a tuple. If a thread happens to access the table
while it's being reallocated by tx, see `tuple_format_register()`,
the accessing thread may crash with a use-after-free or NULL pointer
dereference bug, like the one below:

```
 # 1  0x64bd45c09e22 in crash_signal_cb+162
 # 2  0x76ce74e45320 in __sigaction+80
 # 3  0x64bd45ab070c in vy_run_writer_append_stmt+700
 # 4  0x64bd45ada32a in vy_task_write_run+234
 # 5  0x64bd45ad84fe in vy_task_f+46
 # 6  0x64bd45a4aba0 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*)+16
 # 7  0x64bd45c13e66 in fiber_loop+70
 # 8  0x64bd45e83b9c in coro_init+76
```

To avoid that, let's make the tuple formats table statically allocated.
This shouldn't increase actual memory usage because system memory is
allocated lazily, on page fault. The max number of tuple formats isn't
that big (64K) to care about the increase in virtual memory usage.

Closes #10278

NO_DOC=bug fix
NO_TEST=mt race

(cherry picked from commit a2da1de7)

bfbf5a10

applier: drop apply_final_join_tx · 596d56f7

Vladislav Shpilevoy authored 9 months ago

Can use the regular applier_apply_tx(), they do the same. The
latter is just more protective, but doesn't matter much in this
case if the code does a few latch locks.

The patch also drops an old test about double-received row panic
during final join. The logic is that absolutely the same situation
could happen during subscribe, but it was always filtered out by
checking replicaset.applier.vclock and skipping duplicate rows.

There doesn't seem to be a reason why final join must be any
different. It is, after all, same subscribe logic but the received
rows go into replica's initial snapshot instead of xlogs. Now it
even uses the same txn processing function applier_apply_tx().

The patch also moves `replication_skip_conflict` option setting
after bootstrap is finished. In theory, final join could deliver
a conflicting row and it must not be ignored. The problem is that
it can't be reproduced anyhow without illegal error injection
(which would corrupt something in an unrealistic way). But lets
anyway move it below bootstrap for clarity.

Follow-up #10113

NO_DOC=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit da158b9b)

596d56f7

box: make instance_vclock const · a62da4ee

Vladislav Shpilevoy authored 8 months ago

No code besides box.cc can now update instance's vclock
explicitly. That is a protection against hacks like #9916.

Closes #10113

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit 19b2cc20)

a62da4ee

box: make final join vclock update only in box.cc · 15f4482c

Vladislav Shpilevoy authored 9 months ago

The goal is to make sure that no files except box.cc can change
instance_vclock_storage directly. That leads to all sorts of hacks
which in turn lead to bugs - #9916 is a good example.

Now applier on final join only sends rows into the journal. The
journal then is handled by box.cc where vclock is properly
updated.

Part of #10113

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit fe338ed4)

15f4482c

journal: extract journal_write_row from limbo · 972d909b

Vladislav Shpilevoy authored 9 months ago

The function writes a single xrow into the journal in a blocking
way. It isn't so simple, so makes sense to keep as a function,
especially given that it will be used more in the next commit.

Part of #10113

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit 7d10096c)

972d909b

box: move recovery_journal creation · f4438449

Vladislav Shpilevoy authored 9 months ago

Recovery journal uses word "recovery" to say that it works with
xlogs. For snapshot recovery there is bootstrap_journal. Lets use
it during local snapshot recovery.

The reasoning is that while right now there is no difference, in
next commits the recovery_journal will do more.

Part of #10113

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit 2620eb9e)

f4438449

box: move replicaset.vclock into instance_vclock · 15cf2419

Vladislav Shpilevoy authored 9 months ago

Storing vclock of the instance in replicaset.vclock wasn't right.
It wasn't vclock of the whole replicaset. It was local to this
instance. There is no such thing as "replicaset vclock".

The patch moves it to box.h/cc.

Part of #10113

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit f1e8e4e1)

15cf2419

applier: treat register txns like regular ones · d7d31846

Vladislav Shpilevoy authored 9 months ago

Applier during the registration waiting (for registering a new ID
or a name) could keep doing the master txns received before the
registration was started. They could still be inside WAL doing a
disk write, when the replica sends a register request.

Before this commit, it could cause an assertion failure in debug
and a double LSN error in release.

The reason was that during the registration waiting the applier
treated all incoming txns as "final join" txns. I.e. it wasn't
checking if those txns were already received, but not committed
yet.

During normal subscribe process the appliers (potentially
multiple) protect themselves from that by keeping track of the
vclocks which are already applied and also being applied right now
(replicaset.applier.vclock).

Such protection ensures that receiving same row from 2 appliers
wouldn't result into its double write. It also protects from the
case when a txn was received, goes to WAL, but then the applier
reconnects, resubscribes, and gets the same txn again - it
shouldn't be applied.

The patch makes so that the registration waiting after recovery
works like subscribe. Registration during recovery would mean
bootstrap via join. And outside of recovery it means the instance
is already running.

Closes #9916

NO_DOC=bugfix

(cherry picked from commit 51751f87)

d7d31846

Jul 18, 2024

vinyl: wake up waiters after clearing checkpoint_in_progress flag · 84b2d213

Vladimir Davydov authored 8 months ago

The function `vy_space_build_index`, which builds a new index on DDL,
calls `vy_scheduler_dump` on completion. If there's a checkpoint in
progress, the latter will wait on `vy_scheduler::dump_cond` until
`vy_scheduler::checkpoint_in_progress` is cleared. The problem is
`vy_scheduler_end_checkpoint` doesn't broadcast `dump_cond` when it
clears the flag. Usually, everything works fine because the condition
variable is broadcast on any dump completion, and vinyl checkpoint
implies a dump, but under certain conditions this may lead to a fiber
hang. Let's broadcast `dump_cond` in `vy_scheduler_end_checkpoint`
to be on the safe side.

While we are at it, let's also inject a dump delay to the original
test to make it more robust.

Closes #10267
Follow-up #10234

NO_DOC=bug fix

(cherry picked from commit fc3196dc)

84b2d213

applier: fix assertion failure after split brain · 73e6d02f

Nikita Zheleztsov authored 8 months ago

After receiving async transaction from an old term applier_apply_tx
exits without unlocking the latch. If the same applier tries to
subscribe for replication, it fails with assertion, as the latch is
already locked.

Let's fix the function, which raises error so that it just sets
diag and returns -1.

Closes #10073

NO_DOC=bugfix
NO_CHANGELOG=no crash on release version

(cherry picked from commit 5ce010c5)

73e6d02f

Jul 16, 2024

sio: fix error message displaying bind address · b9a2a87d

Lev Kats authored 8 months ago

Now `sio_bind` function prints address into error message directly
instead of relying on `fd` used in `bind` that failed to execute.

`sio_bind` used `sio_socketname_to_buffer` for error message
effectively attempting printing address bound to `fd` while there
actually was an error in binding that address to that socket in the
first place.

Fixes #5925

NO_DOC=bugfix
NO_CHANGELOG=minor

(cherry picked from commit a5214bfc)

b9a2a87d

test: cover split-brain during promote · 252cad12

Nikita Zheleztsov authored 8 months ago

This test checks, that when PROMOTE from the previous term is
encountered we immediately notice split-brain situation and break
replication without corrupting data.

Closes #9943

NO_DOC=test
NO_CHANGELOG=test

(cherry picked from commit 06b87e27)

252cad12

box: refactor synchro quorum update on deletion from `_cluster` space · 096e453e

Georgiy Lebedev authored 9 months ago

For symmetry with the update of the synchronous replication quorum on
insertion into the `_cluster` space, let's reuse the
`on_replace_cluster_update_quorum` on_commit trigger.

Follows-up #10087

NO_CHANGELOG=<refactoring>
NO_DOC=<refactoring>
NO_TEST=<refactoring>

(cherry picked from commit 9b63ced3)

096e453e

box: update synchro quorum in on_commit trigger instead of on_replace · 737dc12a

Georgiy Lebedev authored 9 months ago

Currently, we update the synchronous replication quorum from the
`on_replace` trigger of the `_cluster` space when registering a new
replica. However, during the join process, the replica cannot ack its own
insertion into the `_cluster` space. In the scope of #9723, we are going to
enable synchronous replication for most of the system spaces, including the
`_cluster` space. There are several problems with this:

1. Joining a replica to a 1-member cluster without manual changing of
quorum won't work: it is impossible to commit the insertion into the
`_cluster` space with only 1 node, since the quorum will equal to 2 right
after the insertion.

2. Joining a replica to a 3-member cluster may fail: the quorum will become
equal to 3 right after the insertion, the newly joined replica cannot ACK
its own insertion into the `_cluster` space — if one out of original 3
nodes fails, then reconfiguration will fail.

Generally speaking, it will be impossible to join a new replica to the
cluster, if a quorum, which includes the newly added replica (which cannot
ACK), cannot be gathered.

To solve these problems, let's update the quorum in the `on_commit`
trigger. This way we’ll be able to insert a node regardless of the current
configuration. This somewhat contradicts with the Raft specification, which
requires application of all configuration changes in the `on_replace`
trigger (i.e., as soon as they are persisted in the WAL, without quorum
confirmation), but still forbids several reconfigurations at the same time.

Closes #10087

NO_DOC=<no special documentation page devoted to cluster reconfiguration>

(cherry picked from commit 29d1c0fa)

737dc12a

Jul 15, 2024

vinyl: use broadcast instead of signal to notify about dump completion · 04347ee7

Vladimir Davydov authored 8 months ago

There may be more than one fiber waiting on `vy_scheduler::dump_cond`:

```
box.snapshot
  vinyl_engine_wait_checkpoint
    vy_scheduler_wait_checkpoint

space.create_index
  vinyl_space_build_index
    vy_scheduler_dump
```

To avoid hang, we should use `fiber_cond_broadcast`.

Closes #10233

NO_DOC=bug fix

(cherry picked from commit 30547157)

04347ee7

small: bump new version with UBSan fixes · cf278f56

Lev Kats authored 8 months ago

This patch bumped small to the new version that does not trigger
UBSan with *_entry* macros and should support new oss-fuzz builder.

New commits:

* rlist: make its methods accept const arguments
* lsregion: introduce lsregion_to_iovec method
* rlist: make foreach_enrty_* macros not to use UB

Fixes: #10143

NO_DOC=small submodule bump
NO_TEST=small submodule bump
NO_CHANGELOG=small submodule bump

(cherry picked from commit 3e183044)

cf278f56

trivia: use __builtin* for offsetof macro · 25146985

Lev Kats authored 8 months ago

Changed default tarantool `offsetof` macro implementation so it don't
access members of null pointer in typeof that triggers UBsan.

Needed for #10143

NO_DOC=bugfix
NO_CHANGELOG=minor
NO_TEST=tested manually with fuzzer

(cherry picked from commit 27e94824)

25146985

Jul 08, 2024

fiber: phohibit fiber self join · 2131743e

Nikolay Shirokovskiy authored 8 months ago

In this case join will just hang. Instead let's raise an error in case
of Lua API and panic in case of C API.

Closes #10196

NO_DOC=minor

(cherry picked from commit 1e1bf36d)

2131743e

fiber: make the concurrent fiber_join safer · cf3def52

Magomed Kostoev authored 1 year ago

Prior to this patch a bunch of illegal conditions was possible:
1. The joinability of a fiber could be changed while the fiber is
   being joined by someone. This could lead to double recycling:
   the first one happened on the fiber finish, and the second one
   in the fiber join.
2. The joinability of a dead joinable fiber could be altered, this
   led to inability jo join the dead fiber and free its resources.
3. A running fiber could be joined concurrently by two or more
   fibers, so the fiber could be recycled more than once (once
   per each concurrent join).
4. A dead recycled fiber could be made joinable and joined leading
   to the double recycle.

Fixed these issues by adding a new FIBER_JOIN_BEEN_INVOKED flag: now
the `fiber_set_joinable` and `fiber_join_timeout` functions detect
the double join. Because of the API limitations both of them panic
when an invalid condition is met:
- The `fiber_set_joinable` was not designed to report errors.
- The `fiber_join_timeout` can't raise any error unless a timeout
  is met, because the `fiber_join` users don't expect to receive
  any error from this function at all (except the one generated
  by the joined fiber).

It's still possible that a fiber join is performed on a struct which
has been recycled and, if the new fiber is joinable too, this can't
be detected. The current fiber API does not allow to fix this, so
this is to be the user's responsibility, they should be warned about
the fact the double join to the same fiber is illegal.

Closes #7562

@TarantoolBot document
Title: `fiber_join`, `fiber_join_timeout` and `fiber_set_joinable`
behave differently now.

`fiber_join` and `fiber_join_timeout` now panic in case if double
join of the given fiber is detected.

`fiber_set_joinable` now panics if the given fiber is dead or is
joined already. This prevents some amount of error conditions that
could happen when using the API in an unexpected way, including:
- Making a dead joinable fiber non-joinable could lead to a memory
  leak: one can't join the fiber anymore.
- Making a dead joinable fiber joinable again is a sign of attempt
  to join the fiber later. That means the fiber struct may be joined
  later, when it's been recycled and reused. This could lead to a
  very hard to debug double join.
- Making an alive joined fiber non-joinable would lead to the double
  free: once on the fiber function finish, and secondly in the active
  fiber join finish. Risks of making it joinable are described above.
- Making a dead and recycled fiber joinable allowed to join the fiber
  once again leading to a double free.

Any given by the API `struct fiber` should only be joined once. If a
fiber is joined after the first join on it has finished the behavior
is undefined: it can either be a panic or an incidental join to a
totally foreign fiber.

(cherry picked from commit 44401529)

cf3def52

luajit: bump new version · 03d9038c

Sergey Kaplun authored 8 months ago

* Correct fix for stack check when recording BC_VARG.
* test: remove inline suppressions of _TARANTOOL
* FFI: Fix ffi.alignof() for reference types.
* FFI: Fix sizeof expression in C parser for reference types.
* FFI: Allow ffi.metatype() for typedefs with attributes.
* FFI: Fix ffi.metatype() for non-raw types.
* Maintain chain invariant in DCE.
* build: introduce option LUAJIT_ENABLE_TABLE_BUMP
* ci: add tablebump flavor for exotic builds
* test: allow `jit.parse` to return aborted traces
* Handle all types of errors during trace stitching.
* Use generic trace error for OOM during trace stitching.
* Check for IR_HREF vs. IR_HREFK aliasing in non-nil store check.
* cmake: set cmake_minimum_required only once
* cmake: fix warning about minimum required version
* ci: add a workflow for testing with AVX512 enabled
* test: introduce a helper read_file
* OSX/iOS/ARM64: Fix generation of Mach-O object files.
* OSX/iOS/ARM64: Fix bytecode embedding in Mach-O object file.
* build: introduce LUAJIT_USE_UBSAN option
* ci: enable UBSan for sanitizers testing workflow
* cmake: add the build directory to the .gitignore
* Prevent sanitizer warning in snap_restoredata().
* Avoid negation of signed integers in C that may hold INT*_MIN.
* Show name of NYI bytecode in -jv and -jdump.

Closes #9924
Closes #8473

NO_DOC=LuaJIT submodule bump
NO_TEST=LuaJIT submodule bump

03d9038c

Jul 04, 2024

fiber: fix leak on dead joinable fiber search · e97b01f6

Nikolay Shirokovskiy authored 8 months ago

When fiber is accessed from Lua we create a userdata object and keep the
reference for future accesses. The reference is cleared when fiber is
stopped. But if fiber is joinable is still can be found with
`fiber.find`. In this case we create userdata object again.
Unfortunately as fiber is already stopped we fail to clear the
reference. The trigger memory that clear the reference is also leaked.
As well as fiber storage if it is accessed after fiber is stopped.

Let's add `on_destroy` trigger to fiber and clear the references there.

Note that with current set of LSAN suppressions the trigger memory leak
of the issue is not reported.

Closes #10187

NO_DOC=bugfix

(cherry picked from commit 7db4de75)

e97b01f6

Jun 26, 2024

box: fix memleak on functional index drop · 432789dc

Nikolay Shirokovskiy authored 8 months ago

We just don't free functional index keys on functional index drop now.
Let's approach keys deletion as in the case of primary index drop ie
let's drop these keys in background.

We should set `use_hint` to `true` in case of MEMTX_TREE_VTAB_DISABLED
tree index methods because `memtx_tree_disabled_index_vtab` uses
`memtx_tree_index_destroy<true>`. Otherwise we get read outside of index
structure for stub functional index on destroy for introduced `is_func`
field (which is reported by ASAN).

Closes #10163

NO_DOC=bugfix

(cherry picked from commit 319357d5)

432789dc

Jun 25, 2024

third_party: update libcurl from 8.7.0 to 8.8.0+patches · d4c07f32

Sergey Bronnikov authored 9 months ago

The patch updates curl module to the version 8.8.0 [1] plus
a number of commits in a range curl-8_8_0..30de937bda0f because
it includes a fix for a regression [2] caught on the previous bump.
The new version brings a number of functional fixes.

Previous changelog entry has been removed because duplicate
entries about bumps in release changelog confuses end users.

Closes #9612

1. https://curl.se/changes.html#8_8_0
2. https://github.com/curl/curl/issues/13740

NO_DOC=libcurl submodule bump
NO_TEST=libcurl submodule bump

(cherry picked from commit 7192bf66)

d4c07f32

third_party: update libcurl from 8.6.0 to 8.7.1 · b49b4b8c

Sergey Bronnikov authored 11 months ago

The patch updates curl module to the version 8.7.1 [1][2] that
brings a number of functional and security fixes, and updates
CMake module for building curl library.

Security fixes:

- CVE-2024-2004: Usage of disabled protocol. (low)
- CVE-2024-2398: HTTP/2 push headers memory-leak. (medium)
- CVE-2024-2379: QUIC certificate check bypass with wolfSSL. (low)
- CVE-2024-2466: TLS certificate check bypass with mbedTLS. (medium)

Changes in CMake module:

- Option `USE_OPENSSL_QUIC` was added and disabled by default [3]

Previous changelog entry has been removed because duplicate
entries about bumps in release changelog confuses end users.

The bump was blocked by a regression in libcurl [4][5].

1. https://curl.se/changes.html#8_7_1
2. https://github.com/curl/curl/compare/curl-8_6_0...curl-8_7_1
3. https://github.com/curl/curl/commit/8e741644a229c3791963b4f5cae1dcfccba842dd
4. https://curl.se/mail/lib-2024-03/0059.html
5. https://github.com/curl/curl/issues/13260

NO_DOC=libcurl submodule bump
NO_TEST=libcurl submodule bump

(cherry picked from commit 63cb2bf6)

b49b4b8c

third_party: update libcurl from 8.5.0 to 8.6.0 · 587abe70

Sergey Bronnikov authored 1 year ago

The patch updates curl module to the version 8.6.0 [1][2] that
brings a number of functional fixes, and updates CMake module for
building curl library.

Changes in CMake module:

- Option `ENABLE_CURL_MANUAL` was added and disabled by default [3]
- Option `BUILD_LIBCURL_DOCS` was added and disabled by default [3]

Previous changelog entry has been removed because duplicate
entries about bumps in release changelog confuses end users.

This bump was blocked by a regression in libcurl [4].

1. https://curl.se/changes.html#8_6_0
2. https://github.com/curl/curl/compare/curl-8_5_0...curl-8_6_0
3. https://github.com/curl/curl/commit/a808aab06851d4364ab1773c664df3d906a497a9
4. https://github.com/curl/curl/commit/b8c003832d730bb2f4b9de4204675ca5d9f7a903

NO_DOC=libcurl submodule bump
NO_TEST=libcurl submodule bump

(cherry picked from commit 00cfc959)

587abe70

Jun 22, 2024

sio: use kern.ipc.somaxconn for listen() on Mac · 23e58efb

Vladislav Shpilevoy authored 9 months ago

listen() on Mac used to take SOMAXCONN as the backlog size. It is
just 128, which is too small when connections are incoming too
fast. They get rejected.

Increase of the queue size wasn't possible, because the limit was
hardcoded. But now sio takes the runtime limit from
kern.ipc.somaxconn sysctl setting.

One weird thing is that when set too high, it seems to have no
effect, like if nothing was changed. Specifically, values above
32767 are not doing anything, even though stay visible in
kern.ipc.somaxconn.

It seems listen() on Mac internally might be using 'short' or
int16_t to store the queue size and it gets broken when anything
above INT16_MAX is used. The code truncates the queue size to this
value if the given one is too high.

Closes #8130

NO_DOC=bugfix
NO_TEST=requires root privileges for testing

(cherry picked from commit 7e9a872f)

23e58efb

Jun 20, 2024

ci: add workflow to check downgrade versions · 4ab1dcfd

Nikolay Shirokovskiy authored 9 months ago

Tarantool has hardcoded list of versions it can downgrade to. This list
should consist of all the released versions less than Tarantool version.
This workflow helps to make sure we update the list before release.

It is run on pushing release tag to the repo, checks the list and fails
if it misses some released version less than current. In this case we
are supposed to update downgrade list (with required downgrade code) and
update the release tag.

Closes #8319

NO_TEST=ci
NO_CHANGELOG=ci
NO_DOC=ci

(cherry picked from commit 6d856347)

4ab1dcfd

Jun 14, 2024

lsan: add another FP leak suppression · 4e8478dd

Nikolay Shirokovskiy authored 9 months ago

See #8890

NO_TEST=internal
NO_CHANGELOG=internal
NO_DOC=internal

(cherry picked from commit c5b3e594)

4e8478dd

Jun 13, 2024

ci: followup fix RPM package builds on aarch64 runners · 74223a2d

Serge Petrenko authored 9 months ago

Commit 715abaaf ("ci: fix RPM package builds on aarch64 runners")
has limited number of parallel jobs to 6 on these runners to fix the
OOM, but it turns out this isn't enough: almalinux_9_aarch64 workflow
fails constantly even with this setting. Let's try to reduce the amount
of jobs to 4.

NO_CHANGELOG=ci
NO_TEST=ci
NO_DOC=ci

74223a2d

relay: do not report vclock[0] anywhere · 4f2e67f5

Vladislav Shpilevoy authored 9 months ago

Remote replica's vclock is given to master to send data starting
from that position. The master does that, but, in order to find
the relevant position in local WAL to start from, the master must
ignore the local rows. Consider them all already "sent". For that
the master replaces the remote vclock[0] with the local vclock[0].
That makes xlog cursor skip all the local rows.

The problem is that this vclock was taken by relay as is, like if
it was truly reported by the replica. It was even saved as the
"last received ACK". Which clearly isn't the case.

When a real ACK was received, it didn't contain anything in
vclock[0], and yet relay "saw" that the previous ACK has
vclock[0] > 0. That looked like the replica went backwards without
even closing connection, which isn't possible. That made the relay
crash from cringe (on assert).

The fix is not to save the local vclock[0] in the last received
ACK.

For GC and xlog cursor the hack is still needed. An option how to
make it easier was to set vclock[0] to INT64_MAX to just never
even bother with any local rows, but that didn't work. Some
assumptions in other places seem to depend on having a proper
local LSN in these places.

Closes #10047

NO_CHANGELOG=the bug wasn't released
NO_DOC=bugfix

(cherry picked from commit 1f75231a)

4f2e67f5

relay: rename vclock args and make const · 49b374f9

Vladislav Shpilevoy authored 9 months ago

It wasn't clear which of them are inputs and which are outputs.
The patch explicitly marks the input vclocks as const. It makes
the code a bit easier to read inside of relay.cc knowing that
these vclocks shouldn't change.

Alongside "replica_clock" in subscribe is renamed to
"start_vclock". To make it consistent with relay_final_join(), and
to signify that technically it doesn't have to be a replica
vclock. It isn't really. Box.cc alters the replica's vclock before
giving it to relay, which means it is no longer "replica clock".

In scope of #10047

NO_TEST=refactoring
NO_CHANGELOG=refactoring
NO_DOC=refactoring

(cherry picked from commit 5ebbed77)

49b374f9

relay: move gc subscriber creation out of it · 605752e5

Vladislav Shpilevoy authored 9 months ago

GC consumer creation and destroy seemed to only happen in box.cc
with one exception in relay_subscribe(). Lets move it out for
consistency. Now relay can only notify GC consumers, but can't
manage them.

That also makes it harder to misuse the GC by passing some wrong
vclock to it, similar to what was happening in #10047.

In scope of #10047

NO_TEST=refactoring
NO_CHANGELOG=refactoring
NO_DOC=refactoring

(cherry picked from commit 4dc0c1ea)

605752e5

box: introduce box_localize_vclock · 149fc1f7

Vladislav Shpilevoy authored 9 months ago

The function takes the burden of explaining why this hack about
setting local component in a remote vclock is needed. It also
creates a new vclock, not alters an existing one. This is to
signify that the vclock is no longer what was received from a
remote host.

Otherwise it is too easy to actually mistreat this mutant vlock as
a remote vclock. That btw did happen and is fixed in following
commits.

In scope of #10047

NO_TEST=refactoring
NO_CHANGELOG=refactoring
NO_DOC=refactoring

(cherry picked from commit b8463960)

149fc1f7

ci: add a workflow to check for entrypoint tags · 426bff55

Nikolay Shirokovskiy authored 1 year ago

Check check-entrypoint.sh comment for explanation of what entrypoint tag
is. The workflow fails if current branch does not have a most recent
entrypoint tag that it should have.

Part of #8319

NO_TEST=ci
NO_CHANGELOG=ci
NO_DOC=ci

(cherry picked from commit c06d0d14)

426bff55

vinyl: fix gc vs vylog race leading to duplicate record · 085279aa

Vladimir Davydov authored 9 months ago

Vinyl run files aren't always deleted immediately after compaction,
because we need to keep run files corresponding to checkpoints for
backups. Such run files are deleted by the garbage collection procedure,
which performs the following steps:

 1. Loads information about all run files from the last vylog file.
 2. For each loaded run record that is marked as dropped:
    a. Tries to remove the run files.
    b. On success, writes a "forget" record for the dropped run,
       which will make vylog purge the run record on the next
       vylog rotation (checkpoint).

(see `vinyl_engine_collect_garbage()`)

The garbage collection procedure writes the "forget" records
asynchronously using `vy_log_tx_try_commit()`, see `vy_gc_run()`.
This procedure can be successfully executed during vylog rotation,
because it doesn't take the vylog latch. It simply appends records
to a memory buffer which is flushed either on the next synchronous
vylog write or vylog recovery.

The problem is that the garbage collection isn't necessarily loads
the latest vylog file because the vylog file may be rotated between
it calls `vy_log_signature()` and `vy_recovery_new()`. This may
result in a "forget" record written twice to the same vylog file
for the same run file, as follows:

  1. GC loads last vylog N
  2. GC starts removing dropped run files.
  3. CHECKPOINT starts vylog rotation.
  4. CHECKPOINT loads vylog N.
  5. GC writes a "forget" record for run A to the buffer.
  6. GC is completed.
  7. GC is restarted.
  8. GC finds that the last vylog is N and blocks on the vylog latch
     trying to load it.
  9. CHECKPOINT saves vylog M (M > N).
 10. GC loads vylog N. This triggers flushing the forget record for
     run A to vylog M (not to vylog N), because vylog M is the last
     vylog at this point of time.
 11. GC starts removing dropped run files.
 12. GC writes a "forget" record for run A to the buffer again,
     because in vylog N it's still marked as dropped and not forgotten.
     (The previous "forget" record was written to vylog M).
 13. Now we have two "forget" records for run A in vylog M.

Such duplicate run records aren't tolerated by the vylog recovery
procedure, resulting in a permanent error on the next checkpoint:

```
ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run XXXX forgotten but not registered
```

To fix this issue, we move `vy_log_signature()` under the vylog latch
to `vy_recovery_new()`. This makes sure that GC will see vylog records
that it's written during the previous execution.

Catching this race in a function test would require a bunch of ugly
error injections so let's assume that it'll be tested by fuzzing.

Closes #10128

NO_DOC=bug fix
NO_TEST=tested manually with fuzzer

(cherry picked from commit 9d3859b2)

085279aa

box: prevent demoted leader from being a candidate in the next elections · 22a9cfd8

Georgiy Lebedev authored 9 months ago


Currently, the demoted leader sees that nobody has requested a vote in the
newly persisted term (because it has just written it without voting, and
nobody had time to see the new term yet), and hence votes for itself,
becoming the most probable winner of the next elections.

To prevent this from happening, let's forbid the demoted leader to be a
candidate in the next elections using `box_raft_leader_step_off`.

Closes #9855

NO_DOC=<bugfix>

Co-authored-by: Serge Petrenko <sergepetrenko@tarantool.org>
(cherry picked from commit 05d03a1c)

22a9cfd8

box: refactor `box_demote` to make it more comprehensible · 49747a4b

Georgiy Lebedev authored 9 months ago


Suggested by Nikita Zheleztsov in the scope of #9855.

Needed for #9855

NO_CHANGELOG=<refactoring>
NO_DOC=<refactoring>
NO_TEST=<refactoring>

Co-authored-by: Nikita Zheleztsov <n.zheleztsov@proton.me>
(cherry picked from commit ff010fe9)

49747a4b

election: fix box.ctl.demote() nop in off-mode · 42631d5b

Vladislav Shpilevoy authored 1 year ago

box.ctl.demote() used not to do anything with election_mode='off'
if the synchro queue didn't belong to the caller in the same term
as the election state.

The reason could be that if the synchro queue term is "outdated",
there is no guarantee that some other instance doesn't own it in
the latest term right now.

The "problem" is that this could be workarounded easily by just
calling promote + demote together.

There isn't much sense in fixing it for the off-mode because the
only reasons off-mode exists are 1) for people who don't use
synchro at all, 2) who did use it and want to stop. Hence they
need demote just to disown the queue.

The patch "legalizes" the mentioned workaround by allowing to
perform demote in off-mode even if the synchro queue term is old.

Closes #6860

NO_DOC=bugfix

(cherry picked from commit 1afe2274)

42631d5b