Commits · 65f126ffa54d96637beedb28f71d256a8312800d · core / tarantool

Oct 06, 2022

ci: rename `pack_and_deploy` to `pack-and-deploy` · 65f126ff

Usually, GitHub actions are named like `foo-bar` rather than `foo_bar`.
A few widely known examples: upload-artifact [1], download-artifact [2],
setup-python [3], setup-node [4]. So let's stick to this approach also.

[1] https://github.com/actions/upload-artifact
[2] https://github.com/actions/download-artifact
[3] https://github.com/actions/setup-python
[4] https://github.com/actions/setup-node

NO_DOC=ci
NO_TEST=ci
NO_CHANGELOG=ci

(cherry picked from commit 3c4a5056)

Unverified

65f126ff

sql: fix assertion during INDEXED BY · 481cfb75

Mergen Imeev authored 2 years ago

This patch fixed the assertion when using INDEXED BY with an index that
is at least the third in space.

Closes #5976

NO_DOC=bugfix

(cherry picked from commit 22c65f96)

481cfb75

luajit: bump new version · d8a166c3

Igor Munkin authored 2 years ago

* FFI: Always fall back to metamethods for cdata length/concat.
* FFI: Add tonumber() specialization for failed conversions.
* build: introduce LUAJIT_ENABLE_CHECKHOOK option
* Fix overflow check in unpack().
* gdb: refactor iteration over frames while dumping stack
* gdb: adjust to support Python 2 (CentOS 7)

Closes #7458
Closes #7655
Needed for #7762
Part of #7230

NO_DOC=LuaJIT submodule bump
NO_TEST=LuaJIT submodule bump

d8a166c3

Oct 05, 2022

lua/datetime: clearer error msg for new() · 36571fd8

Gleb Kashkin authored 2 years ago

dt.new() will raise a clear error on wrong timestamp type.

Closes #7273

NO_DOC=bugfix
NO_CHANGELOG=bugfix

(cherry picked from commit 8d4cbf44)

Unverified

36571fd8

Sep 30, 2022

test: don't call box.cfg() from test instance in gh-7288 · 13a839e8

Nikolay Shirokovskiy authored 2 years ago

Such style is not recommented and going to be banned.

Closes #7715

NO_TEST=test refactoring
NO_DOC=test refactoring
NO_CHANGELOG= test refactoring

(cherry picked from commit ce784430)

13a839e8

test: replace Lua `assert`s with luatest `assert`s in luatest tests · c7a65c94

Georgiy Lebedev authored 2 years ago

Some luatest framework tests use Lua `assert`s, which are incomprehensible
when failed (the only information provided is 'assertion failed!'),
making debugging difficult: replace them with luatest `assert`s and their
context-specific varieties.

NO_CHANGELOG=<code health>
NO_DOC=<code health>

(cherry picked from commit 1d9645f7)

c7a65c94

box: fix usage error messages if FFI is disabled for index read ops · 5ebfc581

Vladimir Davydov authored 2 years ago

Commit 7f2c7609 ("box: add option to disable Lua FFI for read
operations") added an internal configuration option that switches all
index read operations from FFI to Lua C, which was necessary to
implement space upgrade in the EE repository. Turns out there was a
minor bug in that commit - when FFI is disabled, method usage errors
(raised when object.method is called instead of object:method) stop
working. Fix this bug and add an extensive test that checks that Lua C
and FFI implementations of index read operations are equivalent.

Fixes https://github.com/tarantool/tarantool-ee/issues/254

NO_DOC=bug fix
NO_CHANGELOG=bug manifests itself in only in EE

(cherry picked from commit e8d9a7cb)

5ebfc581

memtx: fix loss of committed tuple in secondary index · 2ec3dfbb

Georgiy Lebedev authored 2 years ago

Concurrent transactions can try to insert tuples that intersect only by
parts of secondary index: in this case when one of them gets prepared, the
others get conflicted, but the committed story does not get retained
(because the conflicting statements are not added to the committed story's
delete statement list as opposed to primary index) and is lost after
garbage collection: retain stories if there is a newer uncommitted story
in the secondary indexes' history chain.

Closes #7712

NO_DOC=bugfix

(cherry picked from commit 7b0baa57)

2ec3dfbb

Generate changelog for 2.10.3 · 8aca8a19

Kirill Yukhin authored 2 years ago

Generate changelog for 2.10.3 release.
Also, clean changelogs/unreleased folder.

NO_DOC=no code changes
NO_TEST=no code changes
NO_CHANGELOG=no code changes

8aca8a19

Fix wording, punctuation, and formatting. · 15d8c463
Pavel Semyonov authored 2 years ago
```
NO_CHANGELOG=changelog
NO_DOC=changelog
NO_TEST=changelog
```
15d8c463
Fix wording, punctuation, and formatting. · 74a8debd
Pavel Semyonov authored 2 years ago
```
NO_CHANGELOG=changelog
NO_DOC=changelog
NO_TEST=changelog
```
74a8debd

doc: proofread 2.10.3 changelogs · 6cfd02fe

Pavel Semyonov authored 2 years ago

Fix wording, punctuation, and formatting.

NO_CHANGELOG=changelog
NO_DOC=changelog
NO_TEST=changelog

6cfd02fe

Sep 29, 2022

gc: replace vclockset_psearch with _match in wal_collect_garbage_f · d6fc95f6

Serge Petrenko authored 2 years ago

When using vclockset_psearch, the resulting vclock may be incomparable
to the search key. For example, with a vclock set { } (empty vclock),
{0: 1, 1: 10}, {0: 2, 1:11} vclockset_psearch(set, {0:2, 1: 9}) might
return {0: 1, 1: 10}, and not { }.
This is known and avoided in other places, for example
recover_remaining_wals(), where vclockset_match() is used instead.
vclockset_match() starts with the same result as vclockset_psearch() and
then unwinds the result until the first vclock which is less or equal to
the search key is found.

Having vclockset_psearch in wal_collect_garbage_f could lead to issues
even before local space changes became written to 0-th vclock component.
Once replica subscribes, its' gc consumer is set to the vclock, which
the replica sent in subscribe request. This vclock might be incomparable
with xlog vclocks of the master, leading to the same issue of
potentially deleting a needed xlog during gc.

Closes #7584

NO_DOC=bugfix

(cherry picked from commit c63bfb9a)

d6fc95f6

Sep 28, 2022

memtx: fix transaction manager MVCC invariant violation · 1fac9eef

Georgiy Lebedev authored 2 years ago

We hold the following invariant in MVCC: the story at the top of the
history chain is present in index.

If a story is subject to be deleted from index and there is an older story
in the history chain, the older story starts to be at the top of the
history chain and is not present in index, which violates our invariant:
explicitly check for this case when evaluating whether a story can be
garbage collected and add an assertion to check the invariant above is not
violated.

Rollbacked stories need to be handled in a special way: they are
present at the end of some history chains and completely unlinked from
others (which also implies they are not present in the corresponding
indexes).

`memtx_tx_story_full_unlink` is called in two contexts: space deletion, in
which we delete all stories, and garbage collection step — the former case
can break the invariant described above, while the latter must preserve it,
hence add two different functions for the corresponding contexts.

Closes #7490

NO_CHANGELOG=<internal bugfix not user observable>
NO_DOC=<bugfix>

(cherry picked from commit c8eccfbb)

1fac9eef

memtx: rework transaction rollback · 61be2c8f

Georgiy Lebedev authored 2 years ago

When we rollback a transaction statement, we relink its read trackers
to a newer story in the history chain, if present (6c990a7b), but we do not
handle the case when there is no newer story.

If there is an older story in the history chain, we can relink the
rollbacked story's reader to it, but if the rollbacked story is the
only one left, we need to retain it, because it stores the reader list
needed for conflict resolution — such stories are distinguished by the
rollbacked flag, and there can be no more than one such story located
strictly at the end of a given history chain (which means a story can be
fully unlinked from some indexes and present at the end of others).

There are several nuances we need to account for:

Firstly, such rollbacked stories must be impossible to read from an index:
this is ensured by `memtx_tx_story_is_visible`.

Secondly, rollbacked transactions need to be treated as prepared with
stories that have `add_psn == del_psn`, so that they are correctly deleted
during garbage collection.

After this logical change we have the following partially ordered set over
tuple stories:
———————————————————————————————————————————————————————> serialization time
|- - - - - - - -|— — — — — -|— — — — — |— — — — — — -|— — — — — — — -
| No more than  | Committed | Prepared | In-progress | One dirty
| one rollbacked|           |          |             | story in index
| story         |           |          |             |
|- - - - - - - -|— — — — — -| — — — — —|— — — — — — -|— — — — — — — —

Closes #7343

NO_DOC=bugfix

(cherry picked from commit 56cf737c)

61be2c8f

memtx: remove redundant `space` field from `struct memtx_story` · 8ee6a2f2

Georgiy Lebedev authored 2 years ago

`struct memtx_story` has a `space` field, which is basically used
to identify that a tuple is unlinked from the history chain in
`memtx_tx_index_invisible_count_slow` (though this can be determined by its
presence in the index) and is used to get the space's index in
`memtx_tx_story_link_top` (though it can be  retrieved from the older
story's link field): remove this redundant field.

Needed for #7343

NO_CHANGELOG=<refactoring>
NO_DOC=<refactoring>
NO_TEST=<refactoring>

(cherry picked from commit 55e64a8d)

8ee6a2f2

memtx: refactor story cleanup on space delete · 0533d097

Georgiy Lebedev authored 2 years ago

When a space is deleted, all transactions need to be aborted and all their
stories need to be removed immediately out of order: currently we
artificially rollback statements — instead call this statement
removal to logically distinguish it from rollback. It differs in the sense
that the whole space's tuple history is teared down instead — no more
transaction managing is going to be done as opposed to rollback of an
individual transaction.

Needed for #7343

NO_CHANGELOG=refactoring
NO_DOC=refactoring
NO_TEST=refactoring

(cherry picked from commit 88203d4f)

0533d097

memtx: refactor `memtx_tx_history_rollback_stmt` · 81fede2f

Georgiy Lebedev authored 2 years ago

Follow `memtx_tx_history_{add, prepare}_{insert, delete}` pattern: split
code responsible for rollbacking addition and deletion of a story into
separate functions.

Needed for #7343

NO_CHANGELOG=refactoring
NO_DOC=refactoring
NO_TEST=refactorin

(cherry picked from commit 9dd27681)

81fede2f

memtx: refactor removing of story's delete statements · a3f136cf

Georgiy Lebedev authored 2 years ago

When a statement gets rollbacked, we need to remove delete statements
attached to the story it adds by relinking them and making them delete an
older story in the history chain: refactor this loop out into a separate
function.

Needed for #7343

NO_CHANGELOG=refactoring
NO_DOC=refactoring
NO_TEST=refactoring

(cherry picked from commit 1da727f6)

a3f136cf

memtx: refactor sinking of story added by prepared statement · f0c3ccb8

Georgiy Lebedev authored 2 years ago

If a statement becomes prepared, the story it adds must be 'sunk' to
the level of prepared stories: refactor this loop into a
separate function.

Needed for #7343

NO_CHANGELOG=refactoring
NO_DOC=refactoring
NO_TEST=refactoring

(cherry picked from commit b25d3729)

f0c3ccb8

Sep 26, 2022

xrow: fix crash on nested map/array update ops · 3def2916

Vladislav Shpilevoy authored 2 years ago

If an update operation tried to insert a new key into a map or an
array which was created by a previous update operation, then the
process would fail an assertion.

That was because the first operation was stored as a bar update.
The second operation tried to branch it assuming that the entire
bar update's JSON path must exist, but it wasn't so for the newly
created part of the path.

The solution is to fallback to branching earlier than the entire
bar path ends, if can see that the next part of the path can't be
found.

Closes #7705

NO_DOC=bugfix

(cherry picked from commit 8425ebfc)

3def2916

Sep 23, 2022

memtx: track `index:random` reads and clarify result · 2af84a85

Georgiy Lebedev authored 2 years ago

TREE (HASH) index implements `random` method: if the space is empty from
the transaction's perspective, which means we have to return nothing, add
gap tracking of whole range (full scan
tracking), since this result is equivalent to `index:select{}`, otherwise
repeatedly call `random` and clarify result, until we get a non-empty one.
We do not care about performance here, since all operations in context of
transaction management currently have O(number of dirty tuples)
complexity.

Closes #7670

NO_DOC=bugfix

(cherry picked from commit 1b82beb2)

2af84a85

salad: add LIGHT(random) method · a647b1d8

Vladimir Davydov authored 2 years ago

This commit moves the code that gets the index of a random light
record from the memtx hash index implementation to a new light method.
This gives us more freedom of refactoring the light internals without
modifying the code using it.

After this change, LIGHT(pos_valid) isn't needed anymore so it's
inlined in LIGHT(random).

Needed for #7192

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit 76add786)

a647b1d8

memtx: refactor `index_def_new` · 07c0d3a6

Georgiy Lebedev authored 2 years ago

Since `key_def_merge` sets the merged key definition's unique part count
equal to the new part count, the extra assignment in case the index is not
unique is redundant: remove it.

NO_CHANGELOG=<refactoring>
NO_DOC=<refactoring>
NO_TEST=<refactoring>

(cherry picked from commit 1d6c92e5)

07c0d3a6

memtx: fix TREE index `get` check for part count · b9d62fca

Georgiy Lebedev authored 2 years ago

If TREE index `get` result is empty, the key part count is incorrectly
compared to the tree's `cmp_def->part_count`, though it should be compared
with `cmp_def->unique_part_count`. But we can actually assume that by the
time we get to the index's `get` method the part count is equal to the
unique part count (partial keys are rejected and `get` is not
supported for non-unique indexes): change check to correct assertion.

Closes #7685

NO_DOC=<bugfix>

(cherry picked from commit bfcd8ca7)

b9d62fca

Sep 21, 2022

limbo: fix assertions in box_issue_de/promote · 65b3bad6

Boris Stepanenko authored 2 years ago

Replaced assertions, that no one started new elections/promoted while
acquiring limbo, with checks that raft term and limbo term didn't
change. In case they did - don't write DEMOTE/PROMOTE and just release
limbo, because it's already owned/will soon be by someone else.

Closes #7086

NO_DOC=Bugfix

(cherry picked from commit 8ee0e434)

65b3bad6

Sep 16, 2022

box: check constraint name against identifier rules · 2c912de8

Ilya Verbin authored 2 years ago

Currently, it is possible to create a constraint with a name that does
not match the rules for identifiers. Fix this by validating them by
identifier_check.

Closes #7201

NO_DOC=bugfix
NO_CHANGELOG=minor bug

(cherry picked from commit 1d00b544)

2c912de8

Sep 15, 2022

test: bump test-run to new version · 18b8a80d

Yaroslav Lobankov authored 2 years ago

Bump test-run to new version with the following improvements:

- Improve getting iproto port for tarantool < 2.4.1 [1]

[1] https://github.com/tarantool/test-run/pull/349

NO_DOC=testing stuff
NO_TEST=testing stuff
NO_CHANGELOG=testing stuff

(cherry picked from commit 4668db62)

18b8a80d

cmake: add extra security compiler options · ce4b08eb

Ilya Verbin authored 2 years ago

Introduce cmake option ENABLE_HARDENING, which is TRUE by default for
non-debug regular and static builds, excluding AArch64 and FreeBSD.
It passess compiler flags that harden Tarantool (including the bundled
libraries) against memory corruption attacks. The following flags are
passed:

* -Wformat - Check calls to printf and scanf, etc., to make sure that
  the arguments supplied have types appropriate to the format string
  specified.

* -Wformat-security -Werror=format-security - Warn about uses of format
  functions that represent possible security problems. And make the
  warning into an error.

* -fstack-protector-strong - Emit extra code to check for buffer
  overflows, such as stack smashing attacks.

* -fPIC -pie - Generate position-independent code (PIC). It allows to
  take advantage of the Address Space Layout Randomization (ASLR).

* -z relro -z now - Resolve all dynamically linked functions at the
  beginning of the execution, and then make the GOT read-only.

Also do not disable hardening for Debian and RPM-based Linux distros.

Closes #5372
Closes #7536

NO_DOC=build
NO_TEST=build

(cherry picked from commit e6abe1c9)

ce4b08eb

memtx: fix 'use after free' of garbage collected MVCC stories · 0daf8382

Georgiy Lebedev authored 2 years ago

`directly_replaced` stories can potentially get garbage collected in
`memtx_tx_handle_gap_write`, which is unexpected and leads to 'use after
free': in order to fix this, limit garbage collection points only to
external API calls.

Wrap all possible garbage collection points with explicit warnings (see
c9981a56).

Closes #7449

NO_DOC=bugfix

(cherry picked from commit 18e042f5)

0daf8382

Sep 14, 2022

lua/merger: fix use-after-free during iteration · f9aecfb8

Alexander Turenko authored 2 years ago

All merge sources (including the merger itself) share the same
`<merge source>:pairs()` implementation, which returns `gen, param,
state` triplet. `gen` is `lbox_merge_source_gen()`, `param` is `nil`,
`state` in the merge source.

The `lbox_merge_source_gen()` returns `source, tuple`. The returned
source is supposed to be the same object as a one passed to the function
(`gen(param, state)`), so the function assumes the object as alive and
don't increment source's refcounter at entering, don't decrease it at
exitting.

This logic is perfect, but there was a mistake in the implementation:
the function returns a new cdata object (which holds the same pointer to
the merge source structure) instead of the same cdata object.

The new cdata object neither increases the source's refcounter at
pushing to Lua, nor decreases it at collecting. At result, if we'll loss
the original merge source object (and the first `state` that is returned
from `:pairs()`), the source structure may be freed. The pointer in the
new cdata object will be invalid so.

A sketchy code that illustrates the problem:

```lua
gen, param, state0 = source:pairs()
assert(state0 == source)
source = nil
state1, tuple = gen(param, state0)
state0 = nil
-- assert(state1 == source) -- would fails
collectgarbage()
-- The cdata object that is referenced as `source` and as `state`
-- is collected. The GC handler is called and dropped the merge
-- source structure refcounter to zero. The structure is freed.
-- The call below will crash.
gen(param, state1)
```

In the fixed code `state1 == source`, so the GC handler is not called
prematurely: we have the merge source object alive till the end of the
iterator or till the stop of the traversal.

Fixes #7657

NO_DOC=a crash is definitely not what we want to document

(cherry picked from commit 3bc64229)

Unverified

f9aecfb8

Sep 13, 2022

test: slight refactoring of replication-py tests · 852770c2

Yaroslav Lobankov authored 2 years ago

- Remove unused imports
- Remove unnecessary creation of 'replica' instance objects
- Use `<instance>.iproto.uri` object attribute instead of calling
  `box.cfg.listen` via admin connection

NO_DOC=testing stuff
NO_TEST=testing stuff
NO_CHANGELOG=testing stuff

(cherry picked from commit d13b06bd)

852770c2

test: bump test-run to new version · 96dfd98b

Yaroslav Lobankov authored 2 years ago

Bump test-run to new version with the following improvements:

- Report job summary on GitHub Actions [1]
- Free port auto resolving for TarantoolServer and AppServer [2]

Also, this patch includes the following changes:

- removing `use_unix_sockets` option from all suite.ini config files
  due to permanent using Unix sockets for admin connection recently
  introduced in test-run
- switching replication-py tests to Unix sockets for iproto connection
- fixing replication-py/swap.test.py and swim/swim.test.lua tests

[1] tarantool/test-run#341
[2] tarantool/test-run#348

NO_DOC=testing stuff
NO_TEST=testing stuff
NO_CHANGELOG=testing stuff

(cherry picked from commit 4335b442)

96dfd98b

memtx: track read story when conflicting full scans due to gap write · 23b7d3cb

Georgiy Lebedev authored 2 years ago

When conflicting transactions that made full scans in
`memtx_tx_handle_gap_write`, we need to also track that the conflicted
transaction has read the inserted tuple, just like we do in gap tracking
for ordered indexes — otherwise another transaction can overwrite the
inserted tuple in which case no gap tracking will be handled.

Closes #7493

NO_DOC=bugfix

(cherry picked from commit 7f52f445)

23b7d3cb

Sep 12, 2022

Use MT-Safe strerror_r instead of strerror · 03ceaafc

Vladimir Davydov authored 2 years ago

strerror() is MT-Unsafe, because it uses a static buffer under the hood.
We should use strerror_r() instead, which takes a user-provided buffer.
The problem is there are two implementations of strerror_r(): XSI and
GNU. The first one returns an error code and always writes the message
to the beginning of the buffer while the second one returns a pointer to
a location within the buffer where the message starts. Let's introduce a
macro HAVE_STRERROR_R_GNU set if the GNU version is available and define
tt_strerror() which writes the message to the static buffer, like
tt_cstr() or tt_sprintf().

Note, we have to export tt_strerror(), because it is used by Lua via
FFI. We also need to make it available in the module API header, because
the say_syserror() macro uses strerror() directly. In order to avoid
adding tt_strerror() to the module API, we introduce an internal helper
function _say_strerror(), which calls tt_strerror().

NO_DOC=bug fix
NO_TEST=code is covered by existing tests

(cherry picked from commit 44f46dc8)

03ceaafc

Sep 09, 2022

popen: fix a race between setpgrp() and killpg() · 99040255

Alexander Turenko authored 2 years ago

In brief: `vfork()` on Mac OS 12 and newer doesn't suspend the parent
process, so we should wait for `setpgrp()` to use `killpg()`. See more
detailed description of the problem in a comment of the
`popen_wait_group_leadership()` function.

The solution is to spin in a loop and check child's process group. It
looks as the most simple and direct solution. Other possible solutions
requires to estimate cons and pros of using extra file descriptor or
assigning a signal number for the child -> parent communication.

There are the following alternatives and variations:

* Create a pipe and notify the parent from the child about the
  `setpgrp()` call.

  It costs extra file descriptor, so I decided to don't do that.
  However if we'll need some channel to deliver information from the
  child to the parent for another task, it'll worth to reimplement this
  function too.

  One possible place, where we may need such channel is delivery of
  child's errors to the parent. Now the child writes them directly to
  logger's fd and it requires some tricky code to keep and close the
  descriptor at right points. Also it doesn't allow to catch those
  errors in the parent, but we may need it for #4925.
* Notify the parent about `setpgrp()` using a signal.

  It seems too greedly to assign a specific signal for such local
  problem. It is also unclear how to guarantee that it'll not break any
  user's code: a user can load a dynamic library, which uses some
  signals on its own.

  However we can consider using this approach here if we'll design some
  common interprocess notification system.
* We can use the fiber cond or the `popen_wait_timeout()` function from
  PR #7648 to react to the child termination instantly.

  It would complicate the code and anyway wouldn't allow to react
  instantly on `setpgrp()` in the child.

  Also it assumes yielding during the wait (see below).
* Wait until `setpgrp()` in `popen_send_signal()` instead of
  `popen_new()`.

  It would add yielding/waiting inside `popen_send_signal()` and likely
  will extend a set of its possible exit situations. It is undesirable:
  this function should have simple and predictable behavior.
* Finally, we considered yielding in `popen_wait_group_leadership()`
  instead of sleeping the whole tx thread.

  `<popen handle>:new()` doesn't yield at the moment and a user's code
  may lean on this fact.

  Yielding would allow to achieve better throughtput (amount of parallel
  requests per second), but we don't take much care to performance on
  Mac OS. The primary goal for this platform is to offer the same
  behavior as on Linux to allow development of applications.

I didn't replace `vfork()` with `fork()` on Mac OS, because `vfork()`
works and I don't know consequences of calling `pthread_atfork()`
handlers in a child created by popen. See the comment in `popen_new()`
near to `vfork()` call: it warns about possible mutex double locks. This
topic will be investigated further in #6674.

Fixes #7658

NO_DOC=fixes incorrect behavior, no need to document the bug
NO_TEST=already tested by app-tap/popen.test.lua

(cherry picked from commit e2207fdc)

99040255

Sep 07, 2022

raft: persist new term and vote separately · 61a07baf

Vladislav Shpilevoy authored 2 years ago

If a node persisted a foreign term + vote request at the same
time, it increased split-brain probability. A node could vote for
a candidate having smaller vclock than the local one. For example,
via the following scenario:

- Node1, node2, node3 are started;
- Node1 becomes a leader;
- The topology becomes node1 <-> node2 <-> node3 due to network
    issues;
- Node1 sends a synchro txn to node2. The txn starts a WAL write;
- Node3 bumps term and votes for self. Sends it all to node2;
- Node2 votes for node3, because their vclocks are equal;
- Node2 finishes all pending WAL writes, including the txn from
    node1. Now its vclock is > node3's one and the vote was wrong.
- Node3 wins, writes PROMOTE, and it conflicts with node1 writing
    CONFIRM.

This patch makes so a node can't persist a vote in a new term in
the same WAL write as the term bump. Term bump is written first
and alone. It serves as a WAL sync after which the node's vclock
is not supposed to change except for the 0 (local) component.

The vote requests are re-checked after term bump is persisted to
see if they still can be applied.

Part of #7253

NO_DOC=bugfix

(cherry picked from commit c9155ac8)

61a07baf

qsync: fix txn fiber hang on fencing at CONFIRM · 618bafe6

Vladislav Shpilevoy authored 2 years ago

If the limbo was fenced during CONFIRM WAL write, then the
confirmed txn was committed just fine, but its author-fiber kept
hanging. This is because when it was woken up, it checked if the
limbo is frozen and went to infinite waiting before actually
checking if the txn is completed.

The fiber would unfreeze if would be woken up explicitly as a
workaround.

The fix is simple - change the checks order.

Part of #7253

NO_DOC=bugfix

(cherry picked from commit ec628100)

618bafe6

promote: abort it when become non-candidate · cbebd024

Vladislav Shpilevoy authored 2 years ago

box.ctl.promote() bumps the term, makes the node a candidate, and
waits for the term outcome. The waiting used to be until there is
a leader elected or the node lost connection quorum or the term
was bumped again.

There was a bug that a node could hang in box.ctl.promote() even
when became a voter. It could happen if the quorum was still there
and a leader couldn't be elected in the current term at all. For
instance, others could have `election_mode='off'`.

The fix is to stop waiting for the term outcome if the node can't
win anyway.

NO_DOC=bugfix

(cherry picked from commit ab08dad9)

cbebd024

promote: fix infinite elections with multi-promote · b200d298

Vladislav Shpilevoy authored 2 years ago

If box.ctl.promote() was called on more than one instance, then it
could lead to infinite or extremely long elections bumping
thousands of terms in just a few seconds.

This was because box.ctl.promote() used to be a loop. The loop
retried term bump + voted for self until the node won. Retry
happened immediately as the node saw the term was bumped again
and there was no leader elected or the connection quorum was lost.

If 2 nodes would start box.ctl.promote() almost at the same time,
they could bump each other's terms, not see any winner, bump them
again, and so on. For example:

- Node1 term=1, node2 term=2;
- Promote is called on both;
- Node1 term=2, node2 term=3. They receive the messages. Node2
    ignores node1's old term. Node1 term is bumped and it votes
    for node2, but it didn't win, so box.ctl.promote() bumps its
    term to 4.
- Node2 receives term 4 from node1. Its own box.ctl.promote() sees
    the term was bumped and no winner, so it bumps it to 5 and the
    process continues for a long time.

It worked good enough in tests - the problem happened sometimes,
terms could roll like 80k times in a few seconds, but the tests
ended fine anyway.

One of the next commits will make term bump + vote written in
separate WAL records. That aggravates the problem drastically.

Basically, this mutual term bump loop could end only if one node
would receive vote for self from another node and send back the
message 'I am a leader' before the other node's box.ctl.promote()
notices the term was bumped externally. This will get much harder
to achieve.

The patch simply drops the loop. Let box.ctl.promote() fail if the
term was bumped outside.

There was an alternative to keep running it in a loop with a
randomized election timeout like it works inside of raft. But the
current solution is just simpler.

NO_DOC=bugfix
NO_TEST=election_split_vote_test.lua catches it already

(cherry picked from commit dd89c57e)

b200d298