Commits · e0417b4dc17343cf9e9abd111c7a7261a8cbe631 · core / tarantool

Dec 25, 2020

luacheck: remove unneeded comment · e0417b4d

serpent module has been dropped in commit
b53cb2ae
"console: drop unused serpent module", but comment that belong to module
was left in luacheck config.

e0417b4d

luacheck: fix warnings in test/app · 3f031411

Sergey Bronnikov authored 4 years ago


Closes #5454

Reviewed-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Reviewed-by: Igor Munkin <imun@tarantool.org>

Co-authored-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Co-authored-by: Igor Munkin <imun@tarantool.org>

3f031411

luacheck: fix warnings in test/app-tap · 09cf8f33

Sergey Bronnikov authored 4 years ago


Closes #5453

Reviewed-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Reviewed-by: Igor Munkin <imun@tarantool.org>

Co-authored-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Co-authored-by: Igor Munkin <imun@tarantool.org>

09cf8f33

test: fix box/error · 038c8abe
Serge Petrenko authored 4 years ago
```
Follow-up #5435
```
038c8abe

txn_limbo: ignore CONFIRM/ROLLBACK for a foreign master · cab99888

Serge Petrenko authored 4 years ago

We designed limbo so that it errors on receiving a CONFIRM or ROLLBACK
for other instance's data. Actually, this error is pointless, and even
harmful. Here's why:

Imagine you have 3 instances, 1, 2 and 3.
First 1 writes some synchronous transactions, but dies before writing CONFIRM.

Now 2 has to write CONFIRM instead of 1 to take limbo ownership.
From now on 2 is the limbo owner and in case of high enough load it constantly
has some data in the limbo.

Once 1 restarts, it first recovers its xlogs, and fills its limbo with
its own unconfirmed transactions from the previous run. Now replication
between 1, 2 and 3 is started and the first thing 1 sees is that 2 and 3
ack its old transactions. So 1 writes CONFIRM for its own transactions
even before the same CONFIRM written by 2 reaches it.
Once the CONFIRM written by 1 is replicated to 2 and 3 they error and
stop replication, since their limbo contains entries from 2, not from 1.
Actually, there's no need to error, since it's just a really old CONFIRM
which's already processed by both 2 and 3.

So, ignore CONFIRM/ROLLBACK when it references a wrong limbo owner.

The issue was discovered with test replication/election_qsync_stress.

Follow-up #5435

cab99888

test: fix replication/election_qsync_stress test · bf0fbf3a

Serge Petrenko authored 4 years ago

The test involves writing synchronous transactions on one node and
making other nodes confirm these transactions after its death.
In order for the test to work properly we need to make sure the old
node replicates all its transactions to peers before killing it.
Otherwise once the node is resurrected it'll have newer data, not
present on other nodes, which leads to their vclocks being incompatible
and noone becoming the new leader and hanging the test.

Follow-up #5435

bf0fbf3a

box: rework clear_synchro_queue to commit everything · 5c7dae44

Serge Petrenko authored 4 years ago

It is possible that a new leader (elected either via raft or manually or
via some user-written election algorithm) loses the data that the old
leader has successfully committed and confirmed.

Imagine such a situation: there are N nodes in a replicaset, the old
leader, denoted A, tries to apply some synchronous transaction. It is
written on the leader itself and N/2 other nodes, one of which is B.
The transaction has thus gathered quorum, N/2 + 1 acks.

Now A writes CONFIRM and commits the transaction, but dies before the
confirmation reaches any of its followers. B is elected the new leader and it
sees that the last A's transaction is present on N/2 nodes, so it doesn't have a
quorum (A was one of the N/2 + 1).

Current `clear_synchro_queue()` implementation makes B roll the transaction
back, leading to rollback after commit, which is unacceptable.

To fix the problem, make `clear_synchro_queue()` wait until all the rows from
the previous leader gather `replication_synchro_quorum` acks.

In case the quorum wasn't achieved during replication_synchro_timeout, rollback
nothing and wait for user's intervention.

Closes #5435

Co-developed-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>

5c7dae44

txn_limbo: introduce txn_limbo_last_synchro_entry method · 618e8269
Serge Petrenko authored 4 years ago
```
It'll be useful for box_clear_synchro_queue rework.

Prerequisite #5435
```
618e8269

replication: introduce on_ack trigger · 0941aaa1

Vladislav Shpilevoy authored 4 years ago

The trigger is fired every time any of the relays notifies tx of replica's
known vclock change.

The trigger will be used to collect synchronous transactions quorum for
old leader's transactions.

Part of #5435

0941aaa1

box: add a single execution guard to clear_synchro_queue · 05f7ff7c

Serge Petrenko authored 4 years ago

Clear_synchro_queue isn't meant to be called multiple times on a single
instance.

Multiple simultaneous invocations of clear_synhcro_queue() shouldn't
hurt now, since clear_synchro_queue simply exits on an empty limbo, but
may be harmful in future, when clear_synchro_queue is reworked.

Prohibit such misuse by introducing an execution guard and raising an
error once duplicate invocation is detected.

Prerequisite #5435

05f7ff7c

test: remove dead code in Python tests end extra newlines · cdb5f603
Sergey Bronnikov authored 4 years ago
```
Closes #5538
```
cdb5f603

test: get rid of iteritems() · 6394a997

Sergey Bronnikov authored 4 years ago

For Python 3, PEP 3106 changed the design of the dict builtin and the
mapping API in general to replace the separate list based and iterator
based APIs in Python 2 with a merged, memory efficient set and multiset
view based API. This new style of dict iteration was also added to the
Python 2.7 dict type as a new set of iteration methods. PEP-0469 [1]
recommends to replace d.iteritems() to iter(d.items()) to make code
compatible with Python 3.

1. https://www.python.org/dev/peps/pep-0469/

Part of #5538

6394a997

test: make strings compatible with Python 3 · 5c24c5ae

Sergey Bronnikov authored 4 years ago

The largest change in Python 3 is the handling of strings.
In Python 2, the str type was used for two different
kinds of values - text and bytes, whereas in Python 3,
these are separate and incompatible types.
Patch converts strings to byte strings where it is required
to make tests compatible with Python 3.

Part of #5538

5c24c5ae

test: make dict.items() compatible with Python 3.x · e97dc044

Sergey Bronnikov authored 4 years ago

In Python 2.x calling items() makes a copy of the keys that you can
iterate over while modifying the dict. This doesn't work in Python 3.x
because items() returns an iterator instead of a list and Python 3 raise
an exception "dictionary changed size during iteration". To workaround
it one can use list to force a copy of the keys to be made.

Part of #5538

e97dc044

test: convert print to function and make quotes use consistent · a113e43c

Sergey Bronnikov authored 4 years ago

- convert print statement to function. In a Python 3 'print' becomes a
function, see [1]. Patch makes 'print' in a regression tests compatible with
Python 3.
- according to PEP8, mixing using double quotes and quotes in a project looks
inconsistent. Patch makes using quotes with strings consistent.
- use "format()" instead of "%" everywhere

1. https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function

Part of #5538

a113e43c

feedback_daemon: add operation statistics reporting · 781ae4f4

Serge Petrenko authored 4 years ago

Report box.stat().*.total, box.stat.net().*.total and
box.stat.net().*.current via feedback daemon report.
Accompany this data with the time when report was generated so that it's
possible to calculate RPS from this data on the feedback server.

`box.stat().OP_NAME.total` reside in `feedback.stats.box.OP_NAME.total`, while
`box.stat.net().OP_NAME.total` reside in `feedback.stats.net.OP_NAME.total`
The time of report generation is located at `feedback.stats.time`

Closes #5589

781ae4f4

Dec 24, 2020

crash: report crash data to the feedback server · f132aa9b

Cyrill Gorcunov authored 4 years ago


We have a feedback server which gathers information about a running instance.
While general info is enough for now we may loose a precious information about
crashes (such as call backtrace which caused the issue, type of build and etc).

In the commit we add support of sending this kind of information to the feedback
server. Internally we gather the reason of failure, pack it into base64 form
and then run another Tarantool instance which sends it out.

A typical report might look like

 | {
 |   "crashdump": {
 |     "version": "1",
 |     "data": {
 |       "uname": {
 |         "sysname": "Linux",
 |         "release": "5.9.14-100.fc32.x86_64",
 |         "version": "#1 SMP Fri Dec 11 14:30:38 UTC 2020",
 |         "machine": "x86_64"
 |       },
 |       "build": {
 |         "version": "2.7.0-115-g360565efb",
 |         "cmake_type": "Linux-x86_64-Debug"
 |       },
 |       "signal": {
 |         "signo": 11,
 |         "si_code": 0,
 |         "si_addr": "0x3e800004838",
 |         "backtrace": "#0  0x630724 in crash_collect+bf\n...",
 |         "timestamp": "2020-12-23 14:42:10 MSK"
 |       }
 |     }
 |   }
 | }

There is no simple way to test this so I did it manually:
1) Run instance with

	box.cfg{log_level = 8, feedback_host="127.0.0.1:1500"}

2) Run listener shell as

	while true ; do nc -l -p 1500 -c 'echo -e "HTTP/1.1 200 OK\n\n $(date)"'; done

3) Send SIGSEGV

	kill -11 `pidof tarantool`

Once SIGSEGV is delivered the crashinfo data is generated and sent out. For
debug purpose this data is also printed to the terminal on debug log level.

Closes #5261

Co-developed-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

@TarantoolBot document
Title: Configuration update, allow to disable sending crash information

For better analysis of program crashes the information associated with
the crash such as

 - utsname (similar to `uname -a` output except the network name)
 - build information
 - reason for a crash
 - call backtrace

is sent to the feedback server. To disable it set `feedback_crashinfo`
to `false`.

f132aa9b

crash: move fatal signal handling in · a0a443bd

Cyrill Gorcunov authored 4 years ago


When SIGSEGV or SIGFPE reaches the tarantool we try to gather
all information related to the crash and print it out to the
console (well, stderr actually). Still there is a request
to not just show this info locally but send it out to the
feedback server.

Thus to keep gathering crash related information in one module,
we move fatal signal handling into the separate crash.c file.
This allows us to collect the data we need in one place and
reuse it when we need to send reports to stderr (and to the
feedback server, which will be implemented in next patch).

Part-of #5261

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

a0a443bd

backtrace: allow to specify destination buffer · e3503bc2

Cyrill Gorcunov authored 4 years ago


This will allow to reuse this routine in crash reports.

Part-of #5261

Acked-by: Serge Petrenko <sergepetrenko@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

e3503bc2

util: introduce strlcpy helper · 125d444f

Cyrill Gorcunov authored 4 years ago


Very convenient to have this string extension.
We will use it in crash handling.

Acked-by: Serge Petrenko <sergepetrenko@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

125d444f

lua/key_def: fix compare_with_key() part count check · 37b15af9
Sergey Nikiforov authored 4 years ago
```
Added corresponding test

Fixes: #5307
```
37b15af9

update_repo: add Fedora 32 · 78c1de32

Alexander V. Tikhonov authored 4 years ago

It was added Fedora 32 gitlab-ci packaging job in commit:
  507c47f7a829581cc53ba3c4bd6a5191d088cdf ("gitlab-ci: add packaging for Fedora 32")

but also it had to be enabled in update_repo tool to make able to save
packages in S3 buckets.

Follows up #4966

78c1de32

test: add replication/gh-5446-qsync-eval-quorum.test.lua · d5a5754a

Cyrill Gorcunov authored 4 years ago


Part-of #5446

Co-developed-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

d5a5754a

cfg: more precise check for replication_synchro_quorum value · 74439bcf

Cyrill Gorcunov authored 4 years ago


When we fetch replication_synchro_quorum value (either as
a plain integer or via formula evaluation) we trim the
number down to integer, which silently hides potential
overflow errors.

For example

 | box.cfg{replication_synchro_quorum='4294967297'}

which is 1 in terms of machine words. Lets use 8 bytes
values and trigger an error instead.

Part-of #5446

Reported-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

74439bcf

cfg: support symbolic evaluation of replication_synchro_quorum · 14fa5fd8

Cyrill Gorcunov authored 4 years ago


When synchronous replication is used we prefer a user to specify
a quorum number, ie the number of replicas where data must be
replicated before the master node continue accepting new
transactions.

This is not very convenient since a user may not know initially
how many replicas will be used. Moreover the number of replicas
may vary dynamically. For this sake we allow to specify the
number of quorum in a symbolic way.

For example

box.cfg {
	replication_synchro_quorum = "N/2+1",
}

where `N` is a number of registered replicas in a cluster.
Once new replica attached or old one detached the number
is renewed and propagated.

Internally on each replica_set_id() and replica_clear_id(),
ie at moment when replica get registered or unregistered,
we call box_update_replication_synchro_quorum() helper which
finds out if evaluation of replication_synchro_quorum is
needed and if so we calculate new replication_synchro_quorum
value based on number of currently registered replicas. Then
we notify dependent systems such as qsync and raft to update
their guts.

Note: we do *not* change the default settings for this option,
it remains 1 by default for now. Change the default option should
be done as a separate commit once we make sure that everything is
fine.

Closes #5446

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

@TarantoolBot document
Title: Support dynamic evaluation of synchronous replication quorum

Setting `replication_synchro_quorum` option to an explicit integer
value was introduced rather for simplicity sake mostly. For example
if the cluster's size is not a constant value and new replicas are
connected in dynamically then an administrator might need to increase
the option by hands or by some other external tool.

Instead one can use a dynamic evaluation of a quorum value via formal
representation using symbol `N` as a current number of registered replicas
in a cluster.

For example the canonical definition for a quorum (ie majority
of members in a set) of `N` replicas is `N/2+1`. For such configuration
define

```
box.cfg {replication_synchro_quorum = "N/2+1"}
```

The formal statement allows to provide a flexible configuration but keep
in mind that only canonical quorum (and bigger values, say `N` for all
replicas) guarantees data reliability and various weird forms such as
`N/3+1` while allowed may lead to unexpected results.

14fa5fd8

cfg: rework box_check_replication_synchro_quorum · e140e7cd

Cyrill Gorcunov authored 4 years ago


Currently the box_check_replication_synchro_quorum helper
test for "replication_synchro_quorum" value being valid
and returns the value itself to use later in code.

This is fine for regular numbers but since we're gonna
support formula evaluation the real value to use will
be dynamic and returning a number "to use" won't be
convenient.

Thus lets change the context: make
box_check_replication_synchro_quorum() to return 0|-1
for success|failure and when the real value is needed
we will fetch it explicitly via cfg_geti call.

To make this more explicit the real update of the
appropriate variable is done via
box_update_replication_synchro_quorum() helper.

Part-of #5446

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

e140e7cd

cfg: add cfg_isnumber helper · 8d090514

Cyrill Gorcunov authored 4 years ago


We will need it to figure out if parameter
is a numeric value when doing configuration
check.

Part-of #5446

Acked-by: Serge Petrenko <sergepetrenko@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

8d090514

sql: do not reset region on select · d0d668fa

Mergen Imeev authored 4 years ago

Prior to this patch, region on fiber was reset during select(), get(),
count(), max(), or min(). This would result in an error if one of these
operations was used in a user-defined function in SQL. After this patch,
these functions truncate region instead of resetting it.

Closes #5427

d0d668fa

Dec 23, 2020

sql: fix return value type of ifnull built-in · 2a3a0d1a

Nikita Pettik authored 4 years ago

Accidentally, in built-in declaration list it was specified that
ifnull() can return only integer values, meanwhile it should return
SCALAR: ifnull() returns first non-null argument so type of return value
depends on type of arguments. Let's fix this and set return type of
ifnull() SCALAR.

2a3a0d1a

box: remove unnecessary rights from peristent functions · 4a50e1c4

Mergen Imeev authored 4 years ago

After this patch, the persistent functions "box.schema.user.info" and
"LUA" will have the same rights as the user who executed them.

The problem was that setuid was unnecessarily set. Because of this,
these functions had the same rights as the user who created them.
However, they must have the same rights as the user who used them.

Fixes tarantool/security#1

4a50e1c4

lua: avoid panic if HOOK_GC is not an active hook · 95aa7d20

Sergey Kaplun authored 4 years ago


Platform panic occurs when fiber.yield() is used within any active
(i.e. being executed now) hook.

It is a regression caused by 96dbc49d
('lua: prohibit fiber yield when GC hook is active').

This patch fixes false positive panic in cases when VM is not running
a GC hook.

Relates to #4518
Closes #5649

Reported-by: Michael Filonenko <filonenko.mikhail@gmail.com>

95aa7d20

gitlab-ci: add packaging for Fedora 32 · 1507c47f
Alexander V. Tikhonov authored 4 years ago
```
Added packaging jobs for Fedora 32.

Closes #4966
```
1507c47f

test: filter replication/skip_conflict_row output · 2828f912

Alexander V. Tikhonov authored 4 years ago

Found that test replication/skip_conflict_row.test.lua fails with
output message in results file:

  [035] @@ -139,7 +139,19 @@
  [035]  -- applier is not in follow state
  [035]  test_run:wait_upstream(1, {status = 'stopped', message_re = "Duplicate key exists in unique index 'primary' in space 'test'"})
  [035]  ---
  [035] -- true
  [035] +- false
  [035] +- id: 1
  [035] +  uuid: f2084d3c-93f2-4267-925f-015df034d0a5
  [035] +  lsn: 553
  [035] +  upstream:
  [035] +    status: follow
  [035] +    idle: 0.0024020448327065
  [035] +    peer: unix/:/builds/4BUsapPU/0/tarantool/tarantool/test/var/035_replication/master.socket-iproto
  [035] +    lag: 0.0046234130859375
  [035] +  downstream:
  [035] +    status: follow
  [035] +    idle: 0.086121961474419
  [035] +    vclock: {2: 3, 1: 553}
  [035]  ...
  [035]  --
  [035]  -- gh-3977: check that NOP is written instead of conflicting row.

Test could not be restarted with checksum because of changing values
like UUID on each fail. It happend because test-run uses internal
chain of functions wait_upstream() -> gen_box_info_replication_cond()
which returns instance information on its fails. To avoid of it this
output was redirected to log file instead of results file.

2828f912

github-ci: set same workflow name as job · fa85c848

Alexander V. Tikhonov authored 4 years ago

Due to current testing schema uses separate pipelines per each testing
job then workflow names should be the same as jobs to make it more
visible on github actions results page [1].

[1] - https://github.com/tarantool/tarantool/actions

fa85c848

Dec 22, 2020

Change the behavior of the option 'force_recovery' · 0be1243e

mechanik20051988 authored 4 years ago

There was an option 'force_recovery' that makes tarantool
to ignore some problems during xlog recovery. This patch
change  this option behavior and makes tarantool to ignore
some errors during snapshot recovery just like during
xlog recovery.
Error types which can be ignored:
 - snapshot is someway truncated,
   but after necessary system spaces
 - snapshot has some garbage after it declared length
 - single tuple within snapshot has broken checksum
   and may be skipped without consequences (in this case
   we ignore all row with this tuple)

@TarantoolBot document
Title: Change 'force_recovery' option behavior
Change 'force_recovery' option behavior to allow
tarantool loading from broken snapshot

Closes #5422

0be1243e

github-ci: add skip-duplicate-actions module · 141802b6

Alexander V. Tikhonov authored 4 years ago

Found that jobs on push and pull_request filters run duplicating each
other [1][2]. To avoid of it found additional module [3]. Used entire
jobs skip on duplicated jobs either previously run jobs in queue that
were already updated [4].

[1] - https://github.community/t/duplicate-checks-on-push-and-pull-request-simultaneous-event/18012
[2] - https://github.community/t/how-to-trigger-an-action-on-push-or-pull-request-but-not-both/16662
[3] - https://github.com/fkirc/skip-duplicate-actions#concurrent_skipping
[4] - https://github.com/fkirc/skip-duplicate-actions#option-1-skip-entire-jobs

141802b6

github-ci: implement coverity check · 9209e8b2

Alexander V. Tikhonov authored 4 years ago

Added standalone job with coverity check as described at [1]. This
job uploads results to coverity.com host to 'tarantool' project when
COVERITY_TOKEN environment is enabled. Main coverity functionality
added at .travis.mk make file as standalone targets:

  'test_coverity_debian_no_deps' - used in github-ci actions
  'coverity_debian' - additional target with needed tools check

This job configured by cron scheduler on each Saturday 04:00 am.

Closes #5600

[1] - https://scan.coverity.com/download?tab=cxx

9209e8b2

github-ci: switch coverage saving from travis-ci · 7aa0b018

Alexander V. Tikhonov authored 4 years ago

Moved coverage saving to coveralls.io repository from travis-ci to
github-ci. Completely removed travis-ci from commit criteria.

Part of #5294

7aa0b018

github-ci: implement OSX commit testing · 70f2bd5f

Alexander V. Tikhonov authored 4 years ago

Implemented github-ci action workflow OSX jobs on commits:
 - OSX 10.15
 - OSX 11.0

Part of #5294

70f2bd5f

github-ci: initiate commit testing on github-ci · b5e545d0

Alexander V. Tikhonov authored 4 years ago

Implemented github-ci action workflow on commits.
Added group of CI jobs:

  1) on Debian 9 ("Stretch"):
    - luacheck
    - release
    - debug_coverage
    - release_clang
    - release_lto

  2) on Debian 10 ("Buster")
    - release_lto_clang11
    - release_asan_clang11

Part of #5294

b5e545d0