Commits · f9aecfb81b345c49535716b87f85e95e71487298 · core / tarantool

Sep 14, 2022

lua/merger: fix use-after-free during iteration · f9aecfb8

All merge sources (including the merger itself) share the same
`<merge source>:pairs()` implementation, which returns `gen, param,
state` triplet. `gen` is `lbox_merge_source_gen()`, `param` is `nil`,
`state` in the merge source.

The `lbox_merge_source_gen()` returns `source, tuple`. The returned
source is supposed to be the same object as a one passed to the function
(`gen(param, state)`), so the function assumes the object as alive and
don't increment source's refcounter at entering, don't decrease it at
exitting.

This logic is perfect, but there was a mistake in the implementation:
the function returns a new cdata object (which holds the same pointer to
the merge source structure) instead of the same cdata object.

The new cdata object neither increases the source's refcounter at
pushing to Lua, nor decreases it at collecting. At result, if we'll loss
the original merge source object (and the first `state` that is returned
from `:pairs()`), the source structure may be freed. The pointer in the
new cdata object will be invalid so.

A sketchy code that illustrates the problem:

```lua
gen, param, state0 = source:pairs()
assert(state0 == source)
source = nil
state1, tuple = gen(param, state0)
state0 = nil
-- assert(state1 == source) -- would fails
collectgarbage()
-- The cdata object that is referenced as `source` and as `state`
-- is collected. The GC handler is called and dropped the merge
-- source structure refcounter to zero. The structure is freed.
-- The call below will crash.
gen(param, state1)
```

In the fixed code `state1 == source`, so the GC handler is not called
prematurely: we have the merge source object alive till the end of the
iterator or till the stop of the traversal.

Fixes #7657

NO_DOC=a crash is definitely not what we want to document

(cherry picked from commit 3bc64229)

f9aecfb8

Sep 13, 2022

test: slight refactoring of replication-py tests · 852770c2

Yaroslav Lobankov authored 2 years ago

- Remove unused imports
- Remove unnecessary creation of 'replica' instance objects
- Use `<instance>.iproto.uri` object attribute instead of calling
  `box.cfg.listen` via admin connection

NO_DOC=testing stuff
NO_TEST=testing stuff
NO_CHANGELOG=testing stuff

(cherry picked from commit d13b06bd)

852770c2

test: bump test-run to new version · 96dfd98b

Yaroslav Lobankov authored 2 years ago

Bump test-run to new version with the following improvements:

- Report job summary on GitHub Actions [1]
- Free port auto resolving for TarantoolServer and AppServer [2]

Also, this patch includes the following changes:

- removing `use_unix_sockets` option from all suite.ini config files
  due to permanent using Unix sockets for admin connection recently
  introduced in test-run
- switching replication-py tests to Unix sockets for iproto connection
- fixing replication-py/swap.test.py and swim/swim.test.lua tests

[1] tarantool/test-run#341
[2] tarantool/test-run#348

NO_DOC=testing stuff
NO_TEST=testing stuff
NO_CHANGELOG=testing stuff

(cherry picked from commit 4335b442)

96dfd98b

memtx: track read story when conflicting full scans due to gap write · 23b7d3cb

Georgiy Lebedev authored 2 years ago

When conflicting transactions that made full scans in
`memtx_tx_handle_gap_write`, we need to also track that the conflicted
transaction has read the inserted tuple, just like we do in gap tracking
for ordered indexes — otherwise another transaction can overwrite the
inserted tuple in which case no gap tracking will be handled.

Closes #7493

NO_DOC=bugfix

(cherry picked from commit 7f52f445)

23b7d3cb

Sep 12, 2022

Use MT-Safe strerror_r instead of strerror · 03ceaafc

Vladimir Davydov authored 2 years ago

strerror() is MT-Unsafe, because it uses a static buffer under the hood.
We should use strerror_r() instead, which takes a user-provided buffer.
The problem is there are two implementations of strerror_r(): XSI and
GNU. The first one returns an error code and always writes the message
to the beginning of the buffer while the second one returns a pointer to
a location within the buffer where the message starts. Let's introduce a
macro HAVE_STRERROR_R_GNU set if the GNU version is available and define
tt_strerror() which writes the message to the static buffer, like
tt_cstr() or tt_sprintf().

Note, we have to export tt_strerror(), because it is used by Lua via
FFI. We also need to make it available in the module API header, because
the say_syserror() macro uses strerror() directly. In order to avoid
adding tt_strerror() to the module API, we introduce an internal helper
function _say_strerror(), which calls tt_strerror().

NO_DOC=bug fix
NO_TEST=code is covered by existing tests

(cherry picked from commit 44f46dc8)

03ceaafc

Sep 09, 2022

popen: fix a race between setpgrp() and killpg() · 99040255

Alexander Turenko authored 2 years ago

In brief: `vfork()` on Mac OS 12 and newer doesn't suspend the parent
process, so we should wait for `setpgrp()` to use `killpg()`. See more
detailed description of the problem in a comment of the
`popen_wait_group_leadership()` function.

The solution is to spin in a loop and check child's process group. It
looks as the most simple and direct solution. Other possible solutions
requires to estimate cons and pros of using extra file descriptor or
assigning a signal number for the child -> parent communication.

There are the following alternatives and variations:

* Create a pipe and notify the parent from the child about the
  `setpgrp()` call.

  It costs extra file descriptor, so I decided to don't do that.
  However if we'll need some channel to deliver information from the
  child to the parent for another task, it'll worth to reimplement this
  function too.

  One possible place, where we may need such channel is delivery of
  child's errors to the parent. Now the child writes them directly to
  logger's fd and it requires some tricky code to keep and close the
  descriptor at right points. Also it doesn't allow to catch those
  errors in the parent, but we may need it for #4925.
* Notify the parent about `setpgrp()` using a signal.

  It seems too greedly to assign a specific signal for such local
  problem. It is also unclear how to guarantee that it'll not break any
  user's code: a user can load a dynamic library, which uses some
  signals on its own.

  However we can consider using this approach here if we'll design some
  common interprocess notification system.
* We can use the fiber cond or the `popen_wait_timeout()` function from
  PR #7648 to react to the child termination instantly.

  It would complicate the code and anyway wouldn't allow to react
  instantly on `setpgrp()` in the child.

  Also it assumes yielding during the wait (see below).
* Wait until `setpgrp()` in `popen_send_signal()` instead of
  `popen_new()`.

  It would add yielding/waiting inside `popen_send_signal()` and likely
  will extend a set of its possible exit situations. It is undesirable:
  this function should have simple and predictable behavior.
* Finally, we considered yielding in `popen_wait_group_leadership()`
  instead of sleeping the whole tx thread.

  `<popen handle>:new()` doesn't yield at the moment and a user's code
  may lean on this fact.

  Yielding would allow to achieve better throughtput (amount of parallel
  requests per second), but we don't take much care to performance on
  Mac OS. The primary goal for this platform is to offer the same
  behavior as on Linux to allow development of applications.

I didn't replace `vfork()` with `fork()` on Mac OS, because `vfork()`
works and I don't know consequences of calling `pthread_atfork()`
handlers in a child created by popen. See the comment in `popen_new()`
near to `vfork()` call: it warns about possible mutex double locks. This
topic will be investigated further in #6674.

Fixes #7658

NO_DOC=fixes incorrect behavior, no need to document the bug
NO_TEST=already tested by app-tap/popen.test.lua

(cherry picked from commit e2207fdc)

99040255

Sep 07, 2022

raft: persist new term and vote separately · 61a07baf

Vladislav Shpilevoy authored 2 years ago

If a node persisted a foreign term + vote request at the same
time, it increased split-brain probability. A node could vote for
a candidate having smaller vclock than the local one. For example,
via the following scenario:

- Node1, node2, node3 are started;
- Node1 becomes a leader;
- The topology becomes node1 <-> node2 <-> node3 due to network
    issues;
- Node1 sends a synchro txn to node2. The txn starts a WAL write;
- Node3 bumps term and votes for self. Sends it all to node2;
- Node2 votes for node3, because their vclocks are equal;
- Node2 finishes all pending WAL writes, including the txn from
    node1. Now its vclock is > node3's one and the vote was wrong.
- Node3 wins, writes PROMOTE, and it conflicts with node1 writing
    CONFIRM.

This patch makes so a node can't persist a vote in a new term in
the same WAL write as the term bump. Term bump is written first
and alone. It serves as a WAL sync after which the node's vclock
is not supposed to change except for the 0 (local) component.

The vote requests are re-checked after term bump is persisted to
see if they still can be applied.

Part of #7253

NO_DOC=bugfix

(cherry picked from commit c9155ac8)

61a07baf

qsync: fix txn fiber hang on fencing at CONFIRM · 618bafe6

Vladislav Shpilevoy authored 2 years ago

If the limbo was fenced during CONFIRM WAL write, then the
confirmed txn was committed just fine, but its author-fiber kept
hanging. This is because when it was woken up, it checked if the
limbo is frozen and went to infinite waiting before actually
checking if the txn is completed.

The fiber would unfreeze if would be woken up explicitly as a
workaround.

The fix is simple - change the checks order.

Part of #7253

NO_DOC=bugfix

(cherry picked from commit ec628100)

618bafe6

promote: abort it when become non-candidate · cbebd024

Vladislav Shpilevoy authored 2 years ago

box.ctl.promote() bumps the term, makes the node a candidate, and
waits for the term outcome. The waiting used to be until there is
a leader elected or the node lost connection quorum or the term
was bumped again.

There was a bug that a node could hang in box.ctl.promote() even
when became a voter. It could happen if the quorum was still there
and a leader couldn't be elected in the current term at all. For
instance, others could have `election_mode='off'`.

The fix is to stop waiting for the term outcome if the node can't
win anyway.

NO_DOC=bugfix

(cherry picked from commit ab08dad9)

cbebd024

promote: fix infinite elections with multi-promote · b200d298

Vladislav Shpilevoy authored 2 years ago

If box.ctl.promote() was called on more than one instance, then it
could lead to infinite or extremely long elections bumping
thousands of terms in just a few seconds.

This was because box.ctl.promote() used to be a loop. The loop
retried term bump + voted for self until the node won. Retry
happened immediately as the node saw the term was bumped again
and there was no leader elected or the connection quorum was lost.

If 2 nodes would start box.ctl.promote() almost at the same time,
they could bump each other's terms, not see any winner, bump them
again, and so on. For example:

- Node1 term=1, node2 term=2;
- Promote is called on both;
- Node1 term=2, node2 term=3. They receive the messages. Node2
    ignores node1's old term. Node1 term is bumped and it votes
    for node2, but it didn't win, so box.ctl.promote() bumps its
    term to 4.
- Node2 receives term 4 from node1. Its own box.ctl.promote() sees
    the term was bumped and no winner, so it bumps it to 5 and the
    process continues for a long time.

It worked good enough in tests - the problem happened sometimes,
terms could roll like 80k times in a few seconds, but the tests
ended fine anyway.

One of the next commits will make term bump + vote written in
separate WAL records. That aggravates the problem drastically.

Basically, this mutual term bump loop could end only if one node
would receive vote for self from another node and send back the
message 'I am a leader' before the other node's box.ctl.promote()
notices the term was bumped externally. This will get much harder
to achieve.

The patch simply drops the loop. Let box.ctl.promote() fail if the
term was bumped outside.

There was an alternative to keep running it in a loop with a
randomized election timeout like it works inside of raft. But the
current solution is just simpler.

NO_DOC=bugfix
NO_TEST=election_split_vote_test.lua catches it already

(cherry picked from commit dd89c57e)

b200d298

Sep 06, 2022

ci: add RedOS 7.3 rpm package build (x86_64) · 705f0e51

Sergey Vorontsov authored 2 years ago

Add the redos_7.3.yml workflow to build Tarantool packages (x86_64) for
the RedOS 7.3 system.

Packages are created by https://github.com/packpack/packpack.

NO_DOC=ci
NO_TEST=ci

(cherry picked from commit a6b48f14)

705f0e51

Sep 05, 2022

uri: fix resolve with only port specification · 399bea26

Ilya Grishnov authored 2 years ago

Supplemented the implementation of the `src/lib/uri` parser.
Before this fix a call `uri.parse(uri.format(uri.parse(3301)))`
returned an error of 'Incorrect URI'.
Now this call return correct `service: '3301'`.
As a result, the possibility of using host=localhost by default
for `tarantoolctl connect` has been restored now.
As well as for `console.connect`.

Fixes #7479

NO_DOC=bugfix

(cherry picked from commit 96d8dcec)

399bea26

test: always perform assertions in module API test · 3247d3d1

Alexander Turenko authored 2 years ago

This commit pursues several goals:

* Eliminate unused parameter/variable warnings at building module_api.c
  in non-debug configuration. The problem was introduced in commit
  5c1bc3da ("decimal: add the library into the module API").
* Eliminate a need to check newly added tests in two build
  configurations (Debug and RelWithDebInfo) and don't forget to add
  `(void)x;` statements in addition to a test condition check.
* Fail the testing if conditions required by the
  app-tap/module_api.test.lua test are not met -- not only in the Debug
  build, but also in RelWithDebInfo.

Fixes #7625

NO_DOC=a change in a test, purely development matter
NO_CHANGELOG=see NO_DOC

(cherry picked from commit aaf3bf91)

3247d3d1

box: fix high CPU usage while on_shutdown triggers are running · 69a8a649

Ilya Verbin authored 2 years ago


Currently this script causes 100% CPU usage for 10 sec, because
os.exit() infinitely yields to the scheduler until on_shutdown
fiber completes and breaks the event loop. Fix this by a sleep.

```
box.ctl.set_on_shutdown_timeout(100)
box.ctl.on_shutdown(function() require('fiber').sleep(10) end)
os.exit()
```

Closes #6801

NO_DOC=bugfix
NO_TEST=don't know how to catch this by a test

Co-authored-by: Georgy Moshkin <louielouie314@gmail.com>
(cherry picked from commit 6d91e44b)

69a8a649

main: run an event loop for on_shutdown triggers · 53347bcf

Ilya Verbin authored 2 years ago

When Tarantool is stopped by Ctrl+D or by reaching the end of the
script, run_script_f() breaks the event loop, then tarantool_exit()
is called from main(), however the fibers that execute on_shutdown
triggers can not be longer scheduled, because the event loop is
already stopped. Fix this by starting an auxiliary event loop for
such cases.

Closes #7434

NO_DOC=bugfix

(cherry picked from commit cdd5674c)

53347bcf

Sep 02, 2022

Revert "log: free resources while event loop is running" · 8a97dccd

Vladimir Davydov authored 2 years ago

This reverts commit 0c3f9b37.

If log_destroy and log_boot use the same fd (STDERR_FILENO), say()
called after say_logger_free() will write to a closed fd. What's worse,
the fd may be reused, in which case say() will write to a completely
unrelated file or socket (maybe a data file!). This is what happened
with flightrec - flightrec finalization info message was written to
an xlog file. Let's move say_logger_free() back to where it belongs -
after other subsystem has been finalized.

Since 2.10.2 was released, this commit also adds a changelog.

Reopens #4450
Needed for https://github.com/tarantool/tarantool-ee/issues/223

NO_DOC=bug fix
NO_TEST=revert

(cherry picked from commit 5cb688ed)

8a97dccd

Sep 01, 2022

Generate changelog for 2.10.2 · b924f0b4

Kirill Yukhin authored 2 years ago

Generate changelog for 2.10.2 release.
Also, clean changelogs/unreleased folder.

NO_DOC=no code changes
NO_TEST=no code changes
NO_CHANGELOG=no code changes

b924f0b4

doc: proofread 2.10.2 changelogs · 116a9fe4

Pavel Semyonov authored 2 years ago

Fix wording, punctuation, and formatting.

NO_CHANGELOG=changelog
NO_DOC=changelog
NO_TEST=changelog

116a9fe4

Aug 31, 2022

box: fix unauthorized inserts into _truncate table · 01c9ea9e

Nikolay Shirokovskiy authored 2 years ago

Non privileged user (thru public role) has write access to _truncate
table in order to be able to perform truncates on it's tables. Normally
it should be able to modify records only for the tables he has write
access. Yet now due to bootstrap check it is not so.

Closes tarantool/security#5

NO_DOC=bugfix

(cherry picked from commit 941318e7)

01c9ea9e

box: make part simplicity check easier · f2b8d63b

Nikolay Shirokovskiy authored 2 years ago

Simple part is a part without any extra key besides 'field' and 'type'.
Let's make a check in try_simplify_index_parts itself.

NO_TEST=refactoring
NO_DOC=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit bc0872fd)

f2b8d63b

box: fix inheriting format options for old-style parts · 4deb7663

Nikolay Shirokovskiy authored 2 years ago

If index parts are specified using old syntax like:

	parts = {1, 'number', 2, 'string'},

then (except if parts count is 1) index options set in space format
are not taken into account. Solution is to continue after parsing 1.6.0
style parts so to use code that check format options.

Closes #7614

NO_DOC=bugfix

(cherry picked from commit 91ba0a59)

4deb7663

Aug 30, 2022

core: mark some internal fibers as system ones · 00fab37c

Nikita Zheleztsov authored 2 years ago

Currently internal tarantool fibers can be cancelled from the user's app,
which can lead to critical errors.

Let's mark these fibers as a system ones in order to be sure that they
won't be cancelled from the Lua world.

Closes #7448
Closes #7473

NO_DOC=minor change

(cherry picked from commit 3733ff25)

00fab37c

core: introduce system fiber · cbe833ac

Nikita Zheleztsov authored 2 years ago

There are a number of internal system fibers which are not supposed to
be cancelled.

Let's introduce `FIBER_IS_SYSTEM` flag that will indicate, if the fiber
can be explicitly killed. If this flag is set, killing functions will
just ignore cancellation request.

This commit introduce blocking system fiber cancelling only from the Lua
public API, as it is more important to have it right. The prohibition to
cancel fibers from C API will be introduced later.

Related to #7448
Part of #7473

NO_DOC=internal
NO_TEST=will be added in subsequent commit
NO_CHANGELOG=internal

(cherry picked from commit 3a18a9bf)

cbe833ac

Aug 26, 2022

ci: use `ubuntu-latest` instead of `ubuntu-18.04` · 48a3ecda

Yaroslav Lobankov authored 2 years ago

The `ubuntu-18.04` environment is deprecated, so let's switch to
`ubuntu-latest` where it is safe. For more details see [1].

[1] https://github.com/actions/virtual-environments/issues/6002

NO_DOC=ci
NO_TEST=ci
NO_CHANGELOG=ci

(cherry picked from commit 4572a584)

48a3ecda

Aug 25, 2022

Fix a bug in qsort · 1fbd446e

Aleksandr Lyapunov authored 2 years ago

In commit (35334ca1) qsort was fixed but unfortunately a small
typo was introduced. Due to that typo the qsort made its job wrong.

Fix the problem and add unit test for qsort.

Unfortunately the test right from the issue runs extremely long,
so it should go to long-tests.

Closes #7605

NO_DOC=bugfix

(cherry picked from commit e1d96170)

1fbd446e

say: introduce on_log_level routines · 0cb1d0b8

Nikita Pettik authored 2 years ago

Let's introduce on_log_level static variable which is assumed to be
configured in `say_set_log_callback()`. on_log_level is assumed to be
log level of `log->on_log` callback (i.e. if entry to be logger features
higher log level - it is simply skipped). Note that now casual log_level
is calculated as MAX(level, on_log_level) since log_level is the single
guard for passing execution flow to `log_vsay()` where both things (to
be precise on_log callback invocation and ordinary logging) happens.

This change is required since if log_level has lower magnitude than
on_log_level - on_log callback will be skipped.

NO_DOC=<Internal change>
NO_TEST=<Internal change>
NO_CHANGELOG=<Internal change>

(cherry picked from commit 3d39d23a)

0cb1d0b8

say: get rid of say_log_level() macro family · 2d0fdb6a

Nikita Pettik authored 2 years ago

It is unused and misleading. Let's remove them so that now we have
single entry point for log subsystem - `say()`.

NO_DOC=<Refactoring>
NO_CHANGELOG=<Refactoring>
NO_TEST=<Refactoring>

(cherry picked from commit fbfa5aaf)

2d0fdb6a

core: fix crashes after altering trigger list while it is run · 702d2c39

Serge Petrenko authored 2 years ago

This patch fixes a number of issues with trigger_clear() while the
trigger list is being run:
1) clearing the next-to-be-run trigger doesn't prevent it from being run
2) clearing the next-to-be-run trigger causes an infinite loop or a
   crash
3) swapping trigger list head before the last trigger is run causes an
   infinite loop or a crash (see space_swap_triggers() in alter.cc, which
   had worked all this time by miracle: space _space on_replace trigger
   swaps its own head during local recovery, and that had only worked
   because the trigger by luck was the last to run)

This is fixed by adding triggers in a separate run list on trigger_run.
This list may be iterated by `rlist_shift_entry`, which doesn't suffer
from any of the problems mentioned above.

While being bad in a number of ways, old approach supported practically
unlimited number of concurrent trigger_runs for the same trigger list.
The new approach requires the trigger to be in as many run lists as
there are concurrent trigger_runs, which results in quite a big
refactoring.

Add a luatest-based test and a unit test.

Closes #4264

NO_DOC=bugfix

(cherry picked from commit 607cb553)

702d2c39

core: add a trigger initializer macro · d9debdd6

Serge Petrenko authored 2 years ago

struct trigger is about to get a new field, and it's mandatory that this
field is specified in all initializers. Let's introduce a macro to avoid
adding every new field to all the initializers and at the same time keep
the benefits of static initialization.

Also while we're at it fix `lbox_trigger_reset` setting all trigger
fileds manually.

Part-of #4264

NO_DOC=refactoring
NO_CHANGELOG=refactoring
NO_TEST=refactoring

(cherry picked from commit 2040d1f9)

d9debdd6

core: refactor trigger_fiber_run · b6d2494a

Serge Petrenko authored 2 years ago

Make trigger_fiber_run return an error, when it occurs, so that the
calling code decides how to log it.
Also, while I'm at it, simplify trigger_fiber_run's code a bit.

In-scope-of #4264

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit ca59d305)

b6d2494a

core: introduce cord_exit() function · e64a01b3

Serge Petrenko authored 2 years ago

cord_exit should be always called in the exiting thread. It's a single
place to call all the thread-specific module deinitalization routines.

In-scope-of #4264

NO_DOC=refactoring
NO_TEST=refactoring
NO_CHANGELOG=refactoring

(cherry picked from commit 35b724c0)

e64a01b3

test: make unit.h self sufficient · f9bfabfe

Serge Petrenko authored 2 years ago

Unit test compilation with `#define UNIT_TAP_COMPATIBLE 1` might fail
with an error complaining that <stdarg.h> is not included. Fix this.

In-scope-of #4264

NO_CHANGELOG=testing stuff
NO_DOC=testing stuff

(cherry picked from commit b9fd4557)

f9bfabfe

core: add a test for recursive trigger invocation · e543004b

Serge Petrenko authored 2 years ago

Our triggers support recursive invocation: for example, an on_replace
trigger on a space may do a replace in the same space.

However, this is not tested and might get broken easily. Let's add a
corresponding test.

In-scope-of #4264

NO_DOC=testing
NO_CHANGELOG=testing

(cherry picked from commit bf852b41)

e543004b

Aug 24, 2022

ci: report PRs from TarantoolBot to main chat · 6e33cb0a

Nick Volynkin authored 2 years ago

Report workflow failures in PRs made by TarantoolBot to the same chat
as with stable branches. Such PRs are used for automated integration
testing, so it's important for the team to notice failures in them.
There is no personal chat for TarantoolBot and no need to make one.

NO_DOC=CI reporting
NO_TEST=CI reporting
NO_CHANGELOG=CI reporting

(cherry picked from commit 8ebfa611)

6e33cb0a

Aug 23, 2022

ci: improve `report-job-status` action · 0374211b

Anna Balaeva authored 2 years ago

This patch allows to call `report-job-status` action with only one
input: `bot-token`. VK Teams chat ID has the default value in current
action, API URL has the default value in [1].

[1] `tarantool/actions/report-job-status`

NO_DOC=ci
NO_TEST=ci
NO_CHANGELOG=ci

(cherry picked from commit b07d00a3)

0374211b

console: fix multiline commands saved as oneline · ed14f68d

Gleb Kashkin authored 2 years ago

When multiline commands were loaded from .tarantool_history, they were
treated as a bunch of oneline commands. Now readline is configured to
write timestamps in .tarantool_history as delimiters and multiline
commands are handled correctly.

If there is already a .tarantool_history file, readline will set
timestamps automatically, nothing will be lost.

Closes #7320
NO_DOC=bugfix
NO_TEST=impossible to check readline history from lua

(cherry picked from commit d2271ec0)

ed14f68d

Aug 22, 2022

ci: improve `report-job-status` README · f38a36ce

Anna Balaeva authored 2 years ago

This patch adds a warning about using the action in pull requests
created from forks. GitHub secrets are not passed to workflows in
this case [1] and due to this limitation the action will not work
correctly for such pull requests.

[1] https://docs.github.com/en/actions/security-guides/encrypted-secrets#using-encrypted-secrets-in-a-workflow

NO_DOC=ci
NO_TEST=ci
NO_CHANGELOG=ci

(cherry picked from commit b7cb1421)

f38a36ce

Aug 19, 2022

test: fix flakiness in app-luatest/http_client_test.lua · 6b0587fd

Sergey Bronnikov authored 2 years ago

The problem could be easily reproduced with following command line:
./test/test-run.py
	$(yes app-luatest/http_client_test.lua | head -n 1000).

Before this commit we did a socket binding to know a free network port,
then close a socket and started httpd.py on that network port. However
it was not reliable and even with socket options SO_REUSE_PORT start of
httpd.py has failed. With proposed patch schema is changed: we start
httpd.py and pass only a socket family (AF_INET for TCP connection and
AF_UNIX for connection via Unix socket) and then reading output from a
process. Successfully started httpd.py prints a path to a Unix socket or
a pair of IP address and network port split with ":".

With proposed patch test has passed 1000 times without any problems.
Tests previously marked as "fragile" are passed too:

./test/test-run.py --builddir=$(pwd)/build box-tap/net.box.test.lua \
	box-tap/cfg.test.lua box-tap/session.storage.test.lua \
	box-tap/session.test.lua app-tap/tarantoolctl.test.lua \
	app-tap/debug.test.lua app-tap/inspector.test.lua \
	app-tap/logger.test.lua app-tap/transitive1.test.lua \
	app-tap/csv.test.lua app-luatest/http_client_test.lua

P.S. The problem with "fragile" tests is that rerunning hides other
problems. [1] is about "Address already in use" and [2] is about hangs
in test. I made a pull request with changes in http client module and
triggered CI run. Job has been passed, but in log [3] I see three test
restarts due to fails in http_client test related to my changes.

1. https://github.com/tarantool/tarantool-qa/issues/186
2. https://github.com/tarantool/tarantool-qa/issues/31
3. https://github.com/tarantool/tarantool/runs/7726358823?check_suite_focus=true

Closes https://github.com/tarantool/tarantool-qa/issues/186
Closes https://github.com/tarantool/tarantool-qa/issues/31

NO_CHANGELOG=testing
NO_DOC=testing
NO_TEST=testing

(cherry picked from commit 02fae15a)

6b0587fd

Aug 17, 2022

replication: fix downstream lag growing when there's no new transactions · 2cf7350e

Serge Petrenko authored 2 years ago

downstream lag is the difference in time between the moment a
transaction was written to master's WAL and the moment an ack for it
arrived.

Its calculation is supported by replicas sending the last applied row
timestamp. When there is no replication, the last applied row timestamp
stays the same, so in this case downstream lag grows as time passes.

Once an old master is replaced by a new one, it notices changes in peer
vclocks and tries to update downstream lag unconditionally. This makes
the lag appear to be growing indefinitely, showing the time since the
last transaction on the old master:

```
 downstream:
   status: follow
   idle: 0.018218606001028
   vclock: {1: 3, 2: 2}
   lag: 34.623061401367
```

The commit 56571d83 ("raft: make followers notice leader hang")
made relay exchange information with tx even when there are no new
transactions, so the issue became even easier to reproduce.

The issue itself was present since downstream lag introduction in commit
29025bce ("relay: provide information about downstream lag").

Closes #7581

NO_DOC=bugfix

(cherry picked from commit a167a070)

2cf7350e

log: free resources while event loop is running · 7afd7efd

Cyrill Gorcunov authored 2 years ago


The 'log' module uses fibers internally for logs rotation sake and
before we can free log's resources (on program exit) we need to wait
until rotation is complete, which implies that events loop is still
running. But we break the event loop in `on_shutdown_f` trigger and
calling any events based functionality later cause unexpected results
because fibers are no loner valid to use. Thus move `say_logger_free`
call into `on_shutdown_f` body where fibers are still alive.

N.B. Testing the issue is sensitive to timings, during local tests
found that minimal delay 1ms is enough to trigger, thus ERRINJ_LOG_ROTATE
get increased.

Fixes #4450

NO_DOC=bugfix

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
(cherry picked from commit 0c3f9b37)

7afd7efd