Commits · 8cbe42eb332aa9bf50d80174dc6bd85242d662ff · core / tarantool

Sep 17, 2024

memtx: fix use-after-free in mvcc on ddl · 8cbe42eb

Andrey Saranchin authored 6 months ago

When space is being altered, `memtx_tx_space_on_delete` is called - it
deletes all the stories associated with the old schema. However, before
deleting a story, its `reader_list` member is not unlinked from the list
so other nodes can still access this memory. The commit fixes this
problem and adds an assertion that checks if story is always unlinked
from reader list when is being deleted.

Part of #10146

NO_CHANGELOG=later
NO_DOC=bugfix

(cherry picked from commit a32f56dfbb4b56b410ac376fce079613cac0ccb6)

8cbe42eb

memtx: do not use memtx_build_on_replace trigger with mvcc enabled · 74584869

Andrey Saranchin authored 8 months ago

Now background build of index uses index iterator that collects
conflicts during iteration if MVCC is enabled. Thus, trigger
`memtx_build_on_replace` is not needed - if someone writes to
prefix we already scanned, it will lead to transaction conflict.
Moreover, `memtx_ddl_state` that is needed for rollback is allocated
on stack of function called from DDL transaction, so if conflicted
transaction rolls back later that DDL is over (and it's possible only
with MVCC enabled), segmentation fault will happen. So let's simply
don't set the trigger is MVCC is enabled.

Closes #10147

NO_CHANGELOG=later
NO_DOC=bugfix

(cherry picked from commit 9fe60c5754cf77686404fc7ee3d24af32b6c486c)

74584869

memtx: unlink all delete statements of mvcc stories on space delete · 1d72b80f

Andrey Saranchin authored 9 months ago

Since one tuple can be deleted by many concurrent transactions, member
`del_stmt` of `struct memtx_story` is actually a list. It seems we
forgot about it when implementing `memtx_tx_on_space_delete` so the
function unlink only one of delete statements. The commit fixes this
mistake.

Part of #10146

NO_CHANGELOG=later
NO_DOC=bugfix

(cherry picked from commit 5a31551467308f26b8471a9de233b94e380f23cf)

1d72b80f

Sep 16, 2024

box: fix crash on rollback on memtx memory OOM and massive index change · e9fc51d0

Nikolay Shirokovskiy authored 6 months ago

We cannot tolerate index extent memory allocation failure on rollback.
At the same time it is not practical to reserve memory because a whole
index can easily be changed on rollback if read view is created before
rollback.

So in case of rollback and memtx memory OOM let's allocate outside the
memtx arena limited by quota.

Now part of the index can reside outside memtx arena. But regularly the
index changes will move this part back to the memtx arena. Until next
such situation of course.

Closes #10551

NO_DOC=bugfix

(cherry picked from commit 32ea713af0a4f27f9ae37bb767c21722ee8c6742)

e9fc51d0

memtx: free extents on exit · 1fb5a7cc

Nikolay Shirokovskiy authored 6 months ago

Part-of #10211

NO_TEST=internal
NO_CHANGELOG=internal
NO_DOC=internal

(cherry picked from commit 134a2a4f7f0a3bad15bc42e2dc051708c3583fed)

1fb5a7cc

core: add (void *) set definition · e320972a

Nikolay Shirokovskiy authored 6 months ago

Part-of #10551

NO_TEST=declarative code
NO_CHANGELOG=internal
NO_DOC=internal

(cherry picked from commit 398c7031c915380bd6e93b7aeab9145cf0ebe511)

e320972a

Sep 13, 2024

small: bump version · e60f5fbd

Nikolay Shirokovskiy authored 6 months ago

New commits:
* slab cache: fix slab alignment to 16 bytes

NO_TEST=submodule bump
NO_CHANGELOG=submodule bump
NO_DOC=submodule bump

(cherry picked from commit 2300704e8317f2d8a545cde1394f8cbbb7e95741)

e60f5fbd

sptree: don't use variable length arrays · 977ef353

Vladimir Davydov authored 6 months ago

This causes warnings if compiled with clang-18. Let's define a sane
upper limit for the max tree depth and use it for allocating arrays
on stack. Note that we don't really care about performance because
sptree is used only in unit tests.

Closes #10354

NO_DOC=internal
NO_TEST=internal
NO_CHANGELOG=internal

(cherry picked from commit 187d288f0c3b008ed2d281e8bb43159e44c4106e)

977ef353

test: disable flaky testcases in http_client_test · 7d120035

Sergey Bronnikov authored 6 months ago

The testcase "http_client.sock_family:\"AF_UNIX\".test_follow_location"
is flaky in each run of `release_clang_asan` and
`debug_asan_clang` workflows. Disabling a single testcase does not
help. The patch disables a group of testcases executed with Unix
domain socket.

Needed for #9854

NO_CHANGELOG=testing
NO_DOC=testing

(cherry picked from commit 8fae8004f79ecd555537960c60c6e646b037c4cc)

7d120035

test: fix luacheck warnings · e393ee29

Sergey Bronnikov authored 6 months ago

The patch fixes a warning produced by luacheck:

NO_WRAP
test/app-luatest/http_client_test.lua:27:8: Error prone negation: negation is executed before relational operator.
test/app-luatest/http_client_test.lua:28:8: Error prone negation: negation is executed before relational operator.
NO_WRAP

Found by Luacheck 1.2.0.

Closes #10037

NO_CHANGELOG=codehealth
NO_DOC=codehealth
NO_TEST=codehealth

(cherry picked from commit 8fd37731b68e1e1d8e258ab919d65907d52ec764)

e393ee29

Sep 09, 2024

test: fix flaky #10148 test · adbb726a

Vladimir Davydov authored 6 months ago

The test may exceed the default fiber slice (1 second):

```
[060] server | 2024-09-09 09:16:16.329 [33093] main/111/main fiber.h:1132 W> fiber has not yielded for more than 0.500 seconds
[060] server | 2024-09-09 09:16:16.825 [33093] main/111/main/test-run.lib.luatest.luatest.log I> Assert "FiberSliceIsExceeded" equals to "OutOfMemory"
[060] not ok 1	box-luatest.gh_10148_fix_crash_low_slab_alloc_factor.test_low_slab_alloc_factor
[060] #   ...uatest/gh_10148_fix_crash_low_slab_alloc_factor_test.lua:36: expected: "OutOfMemory"
[060] #   actual: "FiberSliceIsExceeded"
[060] #   stack traceback:
[060] #   	...uatest/gh_10148_fix_crash_low_slab_alloc_factor_test.lua:30: in function 'box-luatest.gh_10148_fix_crash_low_slab_alloc_factor.test_low_slab_alloc_factor'
[060] #   	...
[060] #   	[C]: in function 'xpcall'
[060] #   artifacts:
[060] #   	server -> /tmp/t/060_box-luatest/artifacts/server-RulP4Fj6qEoI
[060] luatest | 2024-09-09 09:16:16.839 [32904] main/104/luatest/test-run.lib.luatest.luatest.log I> End test "box-luatest.gh_10148_fix_crash_low_slab_alloc_factor.test_low_slab_alloc_factor"
[060] server | 2024-09-09 09:16:16.849 [33093] main/116/iproto.shutdown I> tx_binary: stopped
[060] # Ran 1 tests in 2.388 seconds, 0 succeeded, 1 failed
```

Let's set the fiber slice to a sufficiently big value.

Fixes commit e4ce9e111483 ("test: add test for #10148").

NO_DOC=test fix
NO_CHANGELOG=test fix

(cherry picked from commit 565cda7f2f0d74b2b726b475d2b7ed0c3344920e)

adbb726a

vinyl: fix ERRINJ_VY_DELAY_PK_LOOKUP · 7c1d6841

Vladimir Davydov authored 6 months ago

Enabling `ERRINJ_VY_DELAY_PK_LOOKUP` makes Vinyl yield in a place where
it wouldn't normally do. If the transaction is aborted in the meantime,
we'll get the assertion failure:

```
./src/box/vy_point_lookup.c:219: vy_point_lookup: Assertion 'tx == NULL || tx->state == VINYL_TX_READY' failed.
```

To prevent this from happening, let's replace this invalid error
injection with the new one `ERRINJ_VY_POINT_LOOKUP_DELAY` that injects
a delay to `vy_point_lookup()` before reading disk. This doesn't have
exactly the same effect as the old error injection because it also
delays direct lookups in the primary index. Fortunately, the old error
injection is used in the only test, where the new one works as expected
if we make the secondary index created in the test non-unique and enable
deferred writes (this makes the `s:replace{2, 2}` statement bypass
a lookup in the primary index).

Also, let's replace `VY_POINT_ITER_WAIT` with the new error injection
because they have very a similar meaning and `VY_POINT_LOOKUP_DELAY`
works in the test using it with a very small adjustment (we need to
clear it explicitly after `box.snapshot()`).

Closes #10517

NO_DOC=errinj fix
NO_CHANGELOG=errinj fix

(cherry picked from commit 926196359eaa46bbc670d196103730e196c31437)

7c1d6841

vinyl: use VERBOSE level for logging ranges · a9765933

Vladimir Davydov authored 6 months ago

Whenever a range is compacted, split, or coalesced, we log the range
boundaries. This gets really annoying if there's an index that has
a lot of key parts or contains binary strings. Let's lower the level
used for logging these events down to VERBOSE so that they are not
shown by default but can be enabled if needed.

Closes #10524

NO_DOC=bug fix

(cherry picked from commit 06fa83947b0b63c39732efba4c9d67578f113612)

a9765933

test: add test for #10148 · 6bebc1b5

Nikolay Shirokovskiy authored 6 months ago

The fix itself is in the small submodule which is bumped in the previous
commit.

Closes #10148

NO_DOC=bugfix

(cherry picked from commit e4ce9e111483a24d66e078f4f05679d309fcb94d)

6bebc1b5

small: bump version · e1bb094f

Nikolay Shirokovskiy authored 6 months ago

New commits:

* small: small: fix crash with low alloc_factor and high memory pressure
* test: get rid of debug message
* test: assign label to tests
* test: introduce a CMake function create_test

Part of #10148

NO_TEST=submodule bump
NO_CHANGELOG=submodule bump
NO_DOC=submodule bump

(cherry picked from commit f3dd6960852f1885ca14587a9c72769fad6b9f55)

e1bb094f

small: bump version · 387dcbaa

Serge Petrenko authored 8 months ago

New commits:
* test: fix memory leaks reported by LSAN
* region: fix memleak in ASAN version
* matras: introduce `matras_needs_touch` and `matras_touch_no_check`
* lsregion: implement lsregion_reserve for asan build

Prerequisite #10161

NO_CHANGELOG=submodule bump
NO_TEST=submodule bump
NO_DOC=submodule bump

(cherry picked from commit c191a1bbe96a67405cbdbb3e421dbf7ea543bf47)

387dcbaa

Sep 06, 2024

vinyl: handle error loading statement from disk during key lookup · 69450ca7

Vladimir Davydov authored 6 months ago

`vy_page_stmt()` may fail (return NULL) if:
 - the statement is corrupted;
 - memory allocation for the statement fails;
 - the statement size exceeds `box.cfg.vinyl_max_tuple_size`.

If this happens `vy_page_find_key()` won't return an error. Instead,
it'll either point the caller to a wrong statement or claim that there's
no statement matching the key in this page. This may result in invalid
index selection results and, later on, a crash caused by inconsistencies
in the tuple cache. The issue was introduced by commit ac8ce023
("vinyl: factor out function to lookup key in page").

All of the three cases are actually very unlikely to happen in
production:
 - If a statement stored in a run file is corrupted, we'll probably fail
   to load the whole page due to failed checksums and never even get to
   `vy_page_stmt()`.
 - Statements are allocated with `malloc()`, which doesn't normally
   fail (instead the whole process would be terminated by OOM) .
 - Users don't tend to lower the tuple size limit after restart.

Still, let's fix the issue by implementing proper error handling for
`vy_page_find_key()`.

Closes #10512

NO_DOC=bug fix

(cherry picked from commit 9dbaa6a9bc0d65984b417f8a76aa8373b6125d16)

69450ca7

Aug 30, 2024

lua: fix iconv memory leak · 08c80081

Nikolay Shirokovskiy authored 8 months ago

`ffi.C.tnt_iconv_open` returns pointer to `struct iconv`. In this case
`__gc` in metatable is not bound to the object.

Closes #10487
Part-of #10211

NO_TEST=covered by existing tests
NO_DOC=bugfix

(cherry picked from commit 105e6188ee6cc8de71ca2ab077f78f51be07559d)

08c80081

vinyl: fix memory leak on dump/compaction failure · b6cd6bbe

Nikolay Shirokovskiy authored 8 months ago

The issue is we increment `page_count` only on page write. If we fail
for some reason before then page info `min_key` in leaked.

LSAN report for 'vinyl/recovery_quota.test.lua':

```
2024-07-05 13:30:34.605 [478603] main/103/on_shutdown vy_scheduler.c:1668 E> 512/0: failed to compact range (-inf..inf)

=================================================================
==478603==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 4 byte(s) in 1 object(s) allocated from:
    #0 0x5e4ebafcae09 in malloc (/home/shiny/dev/tarantool/build-asan-debug/src/tarantool+0x1244e09) (BuildId: 20c5933d67a3831c4f43f6860379d58d35b81974)
    #1 0x5e4ebb3f9b69 in vy_key_dup /home/shiny/dev/tarantool/src/box/vy_stmt.c:308:14
    #2 0x5e4ebb49b615 in vy_page_info_create /home/shiny/dev/tarantool/src/box/vy_run.c:257:23
    #3 0x5e4ebb48f59f in vy_run_writer_start_page /home/shiny/dev/tarantool/src/box/vy_run.c:2196:6
    #4 0x5e4ebb48c6b6 in vy_run_writer_append_stmt /home/shiny/dev/tarantool/src/box/vy_run.c:2287:6
    #5 0x5e4ebb72877f in vy_task_write_run /home/shiny/dev/tarantool/src/box/vy_scheduler.c:1132:8
    #6 0x5e4ebb73305e in vy_task_compaction_execute /home/shiny/dev/tarantool/src/box/vy_scheduler.c:1485:9
    #7 0x5e4ebb73e152 in vy_task_f /home/shiny/dev/tarantool/src/box/vy_scheduler.c:1795:6
    #8 0x5e4ebb01e0b1 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*) /home/shiny/dev/tarantool/src/lib/core/fiber.h:1331:10
    #9 0x5e4ebc389ee0 in fiber_loop /home/shiny/dev/tarantool/src/lib/core/fiber.c:1182:18
    #10 0x5e4ebd3e9595 in coro_init /home/shiny/dev/tarantool/third_party/coro/coro.c:108:3

SUMMARY: AddressSanitizer: 4 byte(s) leaked in 1 allocation(s).
```

Closes #10489
Part-of #10211

NO_TEST=covered by existing tests
NO_DOC=bugfix

(cherry picked from commit 84101f60947dc9322b6bb31d2b3c536101c723c7)

b6cd6bbe

box: fix memory leak on user DDL when access is denied · 97902542

Nikolay Shirokovskiy authored 6 months ago

Besides mentioned #10485 we also fix a similar memleak (updating user)
that introduced by the same commit 5b32bb7f ("alter: Refactor
access_check outside constructors").

Closes #10485
Part-of #10211

NO_TEST=covered by existing tests
NO_DOC=bugfix

(cherry picked from commit 84f10be00824348844c9e1997bd813b881836928)

97902542

test: add test_ prefix to a function name · 12246244

Maksim Tiushev authored 7 months ago

The test function `g.jit_off_on_macOS_by_default` in `gh_8252` was
silently ignored by the luatest due to its lack of the required
`test_` prefix. This commit renames the function to
`test_jit_off_on_macOS_by_default`, ensuring that it is recognized
and executed by the luatest.

Closes #10210

NO_DOC=codehealth
NO_CHANGELOG=codehealth

(cherry picked from commit eca4f17b3588d38a4d61a71af8371f5ed15de248)

12246244

Aug 28, 2024

box: fix memory leak on foreign key constraint failure · 925a9b89

Nikolay Shirokovskiy authored 6 months ago

Closes #10476
Part-of #10211

NO_TEST=covered by existing tests
NO_DOC=bugfix

(cherry picked from commit a4f4569286c2bc6c237656f63a93b5ff5913ad95)

925a9b89

box: fix memory leak on xlog open failure · bebff767

Nikolay Shirokovskiy authored 6 months ago

Closes #10479
Part-of #10211

NO_TEST=covered by existing tests
NO_DOC=bugfix

(cherry picked from commit 3333040462069f064ab0eb01e0ae245e034950a6)

bebff767

coio: fix memleak in coio_connect_timeout · 12f38792

Nikolay Shirokovskiy authored 6 months ago

coio_connect_timeout() fallbacks to next address returned by
coio_getaddrinfo() if it cannot connect to the first one. In this case
it fails to free resources related to the first address in function
cleanup.

Closes #10482
Part-of #10211

NO_TEST=covered by existing tests
NO_DOC=bugfix

(cherry picked from commit 00124cea3df72b51b7100d337368db018f542779)

12f38792

box: fix memory leak on disconnection from replica · c8eecc05

Nikolay Shirokovskiy authored 6 months ago

Closes #10480
Part-of #10211

NO_TEST=covered by existing tests
NO_DOC=bugfix

(cherry picked from commit b8d1ae4e8f833d255814f83af1352b698834d610)

c8eecc05

Aug 26, 2024

vinyl: do not discard run on dump/compaction abort if index was dropped · 5b5a0568

Vladimir Davydov authored 7 months ago

If an index is dropped while a dump or compaction task is in progress
we must not write any information about it to the vylog when the task
completes otherwise there's a risk of getting a vylog recovery failure
in case the garbage collector manages to purge the index from the vylog.

We disabled logging on successful completion of a dump task quite a
while ago, in commit 29e2931c ("vinyl: fix race between compaction
and gc of dropped LSM"), and for compaction only recently, in commit
ae6a02eb ("vinyl: do not log dump if index was dropped"), but the
issue remains for a dump/compaction failure, when we log a discard
record for a run file we failed to write. These results in errors like:

```
ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run 6 deleted twice
```

or

```
ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run 5768 deleted but not registered
```

Let's fix these issues in exactly the same way as we fixed them for
successful dump/compaction completion - by skipping writing to vylog
in case the index is marked as dropped.

Closes #10452

NO_DOC=bug fix

(cherry picked from commit de59504c2bdb0369cdd27af892301f8515293fe1)

5b5a0568

memtx: skip excluded tuples in index count with MVCC enabled · a9b5ae1d

Andrey Saranchin authored 7 months ago

Excluded tuples actually have their own history chains in MVCC - such
chains consist of only one `memtx_story` containing excluded tuple
itself. Such chains should be skipped when counting invisible tuples
because they are not inserted to the index - that's what the commit
does.

Closes #10396

NO_DOC=bugfix

(cherry picked from commit 8947cb04f59423e2944d48b8a1effec2fb11b1db)

a9b5ae1d

Aug 23, 2024

memtx: do not pass pagination key to MVCC · 76bd0d99

Andrey Saranchin authored 7 months ago

Currently, when starting an iterator in memtx tree on a range request,
we pass key from `start_data` to memtx MVCC. The problem is `start_data`
can contain pagination key that is extracted with `cmp_def`, but MVCC
performs all comparisons with `key_def`. Fortunately, first parts of
`cmp_def` is actually `key_def` of the index, so let's crop `start_data`
by passing `part_count` not greater than `key_def->part_count` to MVCC.

Closes #10448

NO_DOC=bugfix

(cherry picked from commit 0dca0076c0fdaee142020cdeddb031bc0e2238cb)

76bd0d99

vinyl: enable exact match optimization for unique secondary indexes · 93a8edbc

Vladimir Davydov authored 7 months ago

If the iterator type is EQ/REQ/LE/GE and the search key is exact (that
is, there may be at most one tuple matching the key in the index),
there's no need to scan disk levels if we found a statement for this
key in the memory level. We've had this optimization for ages but it
worked only for full keys in terms `cmp_def` (key definition extended
with primary key parts). Apparently, a lookup in a secondary index
performed by the user wouldn't match these criteria unless the secondary
index explicitly included all primary key parts.

This commit improves on that. Now, we enable the optimization if the
search key is **exact**. We consider a key **exact** if either of the
following conditions is true:

 - The key statement is a tuple (tuple has all key parts).
 - The key statement is a full key in terms of `cmp_def`.
 - The key statement is a full key in terms of `key_def`, it doesn't
   contain nulls, and the index is unique. The check for nulls is
   necessary because even a unique nullable index may contain more than
   one equal key with nulls.

Note, this patch slightly refactors the optimization, adding a few
comments and hopefully making it more understandable. In particular,
we remove the one-result-tuple optimization for exact EQ/REQ from
`vy_read_iterator_advance` and put it in `vy_read_iterator_evaluate_src`
instead. This way the whole optimization resides in one place.

Closes #10442

NO_DOC=bug fix

(cherry picked from commit 850673db5a69df2c7250d174ab15305624b2634a)

93a8edbc

Aug 22, 2024

test: fix flaky gh-5998-one-tx-for-ddl.test.lua · 3067139b

Vladimir Davydov authored 7 months ago

The test expects that any DDL operation aborts **all** concurrent
transactions, but since commit f5f061d051dc ("vinyl: do not abort
unrelated transactions on DDL") this isn't exactly true: transactions
that haven't read/written anything aren't aborted. In the test we expect
a transaction that haven't done anything to be aborted by DDL and it
**is** aborted most of them time but for a different reason: it reads
data that are later modified for `box.schema.user.create()` reads
`box.space._user:max()` to generate an id for the new user first. Since
it reads before writing anything, it has the "read-confirmed" isolation
level hence it's aborted by the transaction creating another user
because the latter updates `box.space._user:max()`. However, sometimes
both users are created and the test fails. This happens if the first
transaction manages to commit before the second one reads the `_user`
system space.

To fix the test and make the transaction creating the second user fail
due to DDL, let's add a read of the `_user` system space before putting
it to sleep. Actually, this even makes the test closer to the "original
test from #5998".

Closes #10444

NO_DOC=test fix
NO_CHANGELOG=test fix

(cherry picked from commit 62c051e22109369f9079b5adf4de30e0c53f6ca7)

3067139b

Aug 21, 2024

test/fuzz: update compiler and library flags · b37b41f3

Sergey Bronnikov authored 7 months ago

Clang compiler is linking by default against libc++ rather than
libstdc++, which the code seems to be built against.

The patch adds "-stdlib=libstdc++" that is required by build on
OSS Fuzz and also removes dirty hacks in a build script [2][3]
and allows to switch to a static build in OSS Fuzz.

1. https://libcxx.llvm.org/docs/UsingLibcxx.html
2. https://github.com/google/oss-fuzz/commit/92d4e951c3a10b8b3d67ccd061ddaba71f01a328
3. https://github.com/google/oss-fuzz/commit/7a6315db98f574cc8beb4de7f9d2e71c6c589a34
4. https://github.com/google/oss-fuzz/pull/12322

NO_CHANGELOG=testing
NO_DOC=testing

(cherry picked from commit 79737d9d13d9e473687192be76fe055658b6e0b8)

b37b41f3

Aug 20, 2024

sql: fix use-after-poison in variable binding · 6d775540

Maksim Tiushev authored 7 months ago

This patch fixes a bug found by the ASAN instrumentation of LuaJIT
allocator [1]. The problem is using `memcpy` beyond the size of the
buffer being copied.

Failing tests:
  - ./test/sql-luatest/gh_10243_varbinary_bound_variable_test.lua

[1]: Issue #10231

Closes #10398

NO_DOC=bugfix
NO_CHANGELOG=bugfix
NO_TEST=rely on existing test (run failing tests with tarantool
build described in [1]).

(cherry picked from commit 1923078b4f65649f03fdd6f789f29ba635b836b7)

6d775540

Dummy commit · 2a756941
Serge Petrenko authored 7 months ago

2a756941
schema: add missing downgrade versions · 8cebbf2c
Serge Petrenko authored 7 months ago
```
NO_DOC=tools
NO_TEST=tools
NO_CHANGELOG=tools
```
View commits for tag 2.11.4 2.11.4

8cebbf2c

Generate changelog for 2.11.4 · 89fc437a

Serge Petrenko authored 7 months ago

Also, remove unreleased/ entries.

NO_DOC=changelog
NO_TEST=changelog
NO_CHANGELOG=changelog

89fc437a

Aug 16, 2024

engine: introduce stubs for checkpoint FETCH_SNAPSHOT · 23c7899e

Nikita Zheleztsov authored 8 months ago

This commit introduces engine stubs that enable a new method
of fetching snapshots for anonymous replicas. Instead of using
the traditional read-view join approach, this update allows
file snapshot fetching. Note that file snapshot fetching
is only available in Tarantool EE.

Checkpoint fetching is done via IPROTO_IS_CHECKPOINT_JOIN,
IPROTO_CHECKPOINT_VCLOCK and IPROTO_CHECKPOINT_LSN fields.

If IPROTO_CHECKPOINT_JOIN is set to true, join will be done from
files: .snap for memtx, .run for vinyl, if false - from read view.

Checkpoint join allows to continue from the place, where client
stopped in case of snapshot fetching error. This allows to avoid
rebootstrap of an anonymous client. This can be done by specifying
CHECKPOINT_VCLOCK, which says from which file server should continue
join, client gets vclock at the beginning of the join. Specifying
CHECKPOINT_LSN allows to continue from some position in checkpoint.
Server sends all data >= CHECKPOINT_LSN.

If CHECKPOINT_VCLOCK is not specified, fetching is done from the latest
available checkpoint. If CHECKPOINT_LSN is not specified - start from
the beginning of the snap. So, specifying only IS_CHECKPOINT_JOIN
triggers fetching the latest checkpoint from files.

Needed for tarantool/tarantool-ee#741

NO_DOC=ee
NO_TEST=ee
NO_CHANGELOG=ee

(cherry picked from commit 2fca5c13)

23c7899e

engine: send vclock with 0th component during join · 9434531b

Nikita Zheleztsov authored 8 months ago

This commit makes engine to send vclock without ignoring 0th component
during join, which is needed for checkpoint FETCH SNAPSHOT.

Currently engine join functions are invoked only from
relay_initial_join, which is done during JOIN or FETCH SNAPSHOT.
They respond with vclock of the read view we're going to send.

In the following commit checkpoint FETCH SNAPSHOT will be introduced,
which responds with vclock of the checkpoint, we're going to send.
Such vclock may include 0th component and it's crucial to send it to
a client, as in case of connection failure, client will send us the
same vclock and we'll have to use its signature to figure out, which
checkpoint client wants.

So, we have to send and receive 0th component of the vclock during
FETCH_SNAPSHOT. This commit also introduces decoding vclocks without
ignoring 0th component, as they'll be used in the following commit too.

Needed for tarantool/tarantool-ee#741

NO_DOC=internal
NO_TEST=ee
NO_CHANGELOG=internal

(cherry picked from commit 56058393)

9434531b

xrow: rename xrow_encode_vclock · 4de3d0d6

Nikita Zheleztsov authored 8 months ago

This commit renames xrow_encode_vlock to xrow_encode_vclock_ignore0
since the next commit will introduce encoding vclock without ignoring
0th component, which is needed during sending the response to fetch
snapshot request.

This commit also removes internal field inside the replication_request
structure, as the following commit will use 'vclock' for
encoding/decoding vclock without ignoring component.

Needed for tarantool/tarantool-ee#741

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit 313bd730)

4de3d0d6

relay: refactor relay_initial_join · 854d09ff

Nikita Zheleztsov authored 8 months ago

From now on during initial join memtx engine prepares vclock, raft and
limbo states, it also sends them during memtx_engine_join.

It's done in order to simplify the code of initial join, as in the
consequent commit checkpoint initial join will be introduced and we want
relay code to handle it the same as read-view join without confusing
conditions.

Needed for tarantool/tarantool-ee#741

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit 72cc2b3e)

854d09ff

engine: move raft and limbo states after system data in checkpoint · c107ba11

Nikita Zheleztsov authored 8 months ago

Before this commit raft and limbo states were written at the end of the
checkpoint, which makes it very costly to access them.

Checkpoint join needs to access limbo and raft state in order to send
them during JOIN_META stage. We cannot use the latest states, like it's
done for read-view snapshot fetching: states may be far ahead of the
data, written to the checkpoint, which we're going to send.

This commit moves raft and limbo states after data from the system
spaces but before user data. We cannot put them right at the beginning
of the snapshot, because then we'll have to patch recovery process,
which currently strongly relies on the fact, that system spaces are
at the beginning of the snapshot (this was done in order to apply force
recovery only for user data). If we patch recovery process, then old
versions, where it's unpatched, won't be able to recover from the
snapshots done by the newer version, compatibility of snapshots will be
broken.

The current change is not breaking, old Tarantool versions can restore
from the snapshot made by the newer one.

Needed for tarantool/tarantool-ee#741

NO_DOC=internal
NO_CHANGELOG=internal

(cherry picked from commit 3da31b83)

c107ba11