- Dec 25, 2020
-
-
Serge Petrenko authored
It is possible that a new leader (elected either via raft or manually or via some user-written election algorithm) loses the data that the old leader has successfully committed and confirmed. Imagine such a situation: there are N nodes in a replicaset, the old leader, denoted A, tries to apply some synchronous transaction. It is written on the leader itself and N/2 other nodes, one of which is B. The transaction has thus gathered quorum, N/2 + 1 acks. Now A writes CONFIRM and commits the transaction, but dies before the confirmation reaches any of its followers. B is elected the new leader and it sees that the last A's transaction is present on N/2 nodes, so it doesn't have a quorum (A was one of the N/2 + 1). Current `clear_synchro_queue()` implementation makes B roll the transaction back, leading to rollback after commit, which is unacceptable. To fix the problem, make `clear_synchro_queue()` wait until all the rows from the previous leader gather `replication_synchro_quorum` acks. In case the quorum wasn't achieved during replication_synchro_timeout, rollback nothing and wait for user's intervention. Closes #5435 Co-developed-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
-
Serge Petrenko authored
It'll be useful for box_clear_synchro_queue rework. Prerequisite #5435
-
Vladislav Shpilevoy authored
The trigger is fired every time any of the relays notifies tx of replica's known vclock change. The trigger will be used to collect synchronous transactions quorum for old leader's transactions. Part of #5435
-
Serge Petrenko authored
Clear_synchro_queue isn't meant to be called multiple times on a single instance. Multiple simultaneous invocations of clear_synhcro_queue() shouldn't hurt now, since clear_synchro_queue simply exits on an empty limbo, but may be harmful in future, when clear_synchro_queue is reworked. Prohibit such misuse by introducing an execution guard and raising an error once duplicate invocation is detected. Prerequisite #5435
-
Sergey Bronnikov authored
Closes #5538
-
Sergey Bronnikov authored
For Python 3, PEP 3106 changed the design of the dict builtin and the mapping API in general to replace the separate list based and iterator based APIs in Python 2 with a merged, memory efficient set and multiset view based API. This new style of dict iteration was also added to the Python 2.7 dict type as a new set of iteration methods. PEP-0469 [1] recommends to replace d.iteritems() to iter(d.items()) to make code compatible with Python 3. 1. https://www.python.org/dev/peps/pep-0469/ Part of #5538
-
Sergey Bronnikov authored
The largest change in Python 3 is the handling of strings. In Python 2, the str type was used for two different kinds of values - text and bytes, whereas in Python 3, these are separate and incompatible types. Patch converts strings to byte strings where it is required to make tests compatible with Python 3. Part of #5538
-
Sergey Bronnikov authored
In Python 2.x calling items() makes a copy of the keys that you can iterate over while modifying the dict. This doesn't work in Python 3.x because items() returns an iterator instead of a list and Python 3 raise an exception "dictionary changed size during iteration". To workaround it one can use list to force a copy of the keys to be made. Part of #5538
-
Sergey Bronnikov authored
- convert print statement to function. In a Python 3 'print' becomes a function, see [1]. Patch makes 'print' in a regression tests compatible with Python 3. - according to PEP8, mixing using double quotes and quotes in a project looks inconsistent. Patch makes using quotes with strings consistent. - use "format()" instead of "%" everywhere 1. https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function Part of #5538
-
Serge Petrenko authored
Report box.stat().*.total, box.stat.net().*.total and box.stat.net().*.current via feedback daemon report. Accompany this data with the time when report was generated so that it's possible to calculate RPS from this data on the feedback server. `box.stat().OP_NAME.total` reside in `feedback.stats.box.OP_NAME.total`, while `box.stat.net().OP_NAME.total` reside in `feedback.stats.net.OP_NAME.total` The time of report generation is located at `feedback.stats.time` Closes #5589
-
- Dec 24, 2020
-
-
Cyrill Gorcunov authored
We have a feedback server which gathers information about a running instance. While general info is enough for now we may loose a precious information about crashes (such as call backtrace which caused the issue, type of build and etc). In the commit we add support of sending this kind of information to the feedback server. Internally we gather the reason of failure, pack it into base64 form and then run another Tarantool instance which sends it out. A typical report might look like | { | "crashdump": { | "version": "1", | "data": { | "uname": { | "sysname": "Linux", | "release": "5.9.14-100.fc32.x86_64", | "version": "#1 SMP Fri Dec 11 14:30:38 UTC 2020", | "machine": "x86_64" | }, | "build": { | "version": "2.7.0-115-g360565efb", | "cmake_type": "Linux-x86_64-Debug" | }, | "signal": { | "signo": 11, | "si_code": 0, | "si_addr": "0x3e800004838", | "backtrace": "#0 0x630724 in crash_collect+bf\n...", | "timestamp": "2020-12-23 14:42:10 MSK" | } | } | } | } There is no simple way to test this so I did it manually: 1) Run instance with box.cfg{log_level = 8, feedback_host="127.0.0.1:1500"} 2) Run listener shell as while true ; do nc -l -p 1500 -c 'echo -e "HTTP/1.1 200 OK\n\n $(date)"'; done 3) Send SIGSEGV kill -11 `pidof tarantool` Once SIGSEGV is delivered the crashinfo data is generated and sent out. For debug purpose this data is also printed to the terminal on debug log level. Closes #5261 Co-developed-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com> @TarantoolBot document Title: Configuration update, allow to disable sending crash information For better analysis of program crashes the information associated with the crash such as - utsname (similar to `uname -a` output except the network name) - build information - reason for a crash - call backtrace is sent to the feedback server. To disable it set `feedback_crashinfo` to `false`.
-
Cyrill Gorcunov authored
When SIGSEGV or SIGFPE reaches the tarantool we try to gather all information related to the crash and print it out to the console (well, stderr actually). Still there is a request to not just show this info locally but send it out to the feedback server. Thus to keep gathering crash related information in one module, we move fatal signal handling into the separate crash.c file. This allows us to collect the data we need in one place and reuse it when we need to send reports to stderr (and to the feedback server, which will be implemented in next patch). Part-of #5261 Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
This will allow to reuse this routine in crash reports. Part-of #5261 Acked-by:
Serge Petrenko <sergepetrenko@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
Very convenient to have this string extension. We will use it in crash handling. Acked-by:
Serge Petrenko <sergepetrenko@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Sergey Nikiforov authored
Added corresponding test Fixes: #5307
-
Alexander V. Tikhonov authored
It was added Fedora 32 gitlab-ci packaging job in commit: 507c47f7a829581cc53ba3c4bd6a5191d088cdf ("gitlab-ci: add packaging for Fedora 32") but also it had to be enabled in update_repo tool to make able to save packages in S3 buckets. Follows up #4966
-
Cyrill Gorcunov authored
Part-of #5446 Co-developed-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
When we fetch replication_synchro_quorum value (either as a plain integer or via formula evaluation) we trim the number down to integer, which silently hides potential overflow errors. For example | box.cfg{replication_synchro_quorum='4294967297'} which is 1 in terms of machine words. Lets use 8 bytes values and trigger an error instead. Part-of #5446 Reported-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
When synchronous replication is used we prefer a user to specify a quorum number, ie the number of replicas where data must be replicated before the master node continue accepting new transactions. This is not very convenient since a user may not know initially how many replicas will be used. Moreover the number of replicas may vary dynamically. For this sake we allow to specify the number of quorum in a symbolic way. For example box.cfg { replication_synchro_quorum = "N/2+1", } where `N` is a number of registered replicas in a cluster. Once new replica attached or old one detached the number is renewed and propagated. Internally on each replica_set_id() and replica_clear_id(), ie at moment when replica get registered or unregistered, we call box_update_replication_synchro_quorum() helper which finds out if evaluation of replication_synchro_quorum is needed and if so we calculate new replication_synchro_quorum value based on number of currently registered replicas. Then we notify dependent systems such as qsync and raft to update their guts. Note: we do *not* change the default settings for this option, it remains 1 by default for now. Change the default option should be done as a separate commit once we make sure that everything is fine. Closes #5446 Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com> @TarantoolBot document Title: Support dynamic evaluation of synchronous replication quorum Setting `replication_synchro_quorum` option to an explicit integer value was introduced rather for simplicity sake mostly. For example if the cluster's size is not a constant value and new replicas are connected in dynamically then an administrator might need to increase the option by hands or by some other external tool. Instead one can use a dynamic evaluation of a quorum value via formal representation using symbol `N` as a current number of registered replicas in a cluster. For example the canonical definition for a quorum (ie majority of members in a set) of `N` replicas is `N/2+1`. For such configuration define ``` box.cfg {replication_synchro_quorum = "N/2+1"} ``` The formal statement allows to provide a flexible configuration but keep in mind that only canonical quorum (and bigger values, say `N` for all replicas) guarantees data reliability and various weird forms such as `N/3+1` while allowed may lead to unexpected results.
-
Cyrill Gorcunov authored
Currently the box_check_replication_synchro_quorum helper test for "replication_synchro_quorum" value being valid and returns the value itself to use later in code. This is fine for regular numbers but since we're gonna support formula evaluation the real value to use will be dynamic and returning a number "to use" won't be convenient. Thus lets change the context: make box_check_replication_synchro_quorum() to return 0|-1 for success|failure and when the real value is needed we will fetch it explicitly via cfg_geti call. To make this more explicit the real update of the appropriate variable is done via box_update_replication_synchro_quorum() helper. Part-of #5446 Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
We will need it to figure out if parameter is a numeric value when doing configuration check. Part-of #5446 Acked-by:
Serge Petrenko <sergepetrenko@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Mergen Imeev authored
Prior to this patch, region on fiber was reset during select(), get(), count(), max(), or min(). This would result in an error if one of these operations was used in a user-defined function in SQL. After this patch, these functions truncate region instead of resetting it. Closes #5427
-
- Dec 23, 2020
-
-
Nikita Pettik authored
Accidentally, in built-in declaration list it was specified that ifnull() can return only integer values, meanwhile it should return SCALAR: ifnull() returns first non-null argument so type of return value depends on type of arguments. Let's fix this and set return type of ifnull() SCALAR.
-
Mergen Imeev authored
After this patch, the persistent functions "box.schema.user.info" and "LUA" will have the same rights as the user who executed them. The problem was that setuid was unnecessarily set. Because of this, these functions had the same rights as the user who created them. However, they must have the same rights as the user who used them. Fixes tarantool/security#1
-
Sergey Kaplun authored
Platform panic occurs when fiber.yield() is used within any active (i.e. being executed now) hook. It is a regression caused by 96dbc49d ('lua: prohibit fiber yield when GC hook is active'). This patch fixes false positive panic in cases when VM is not running a GC hook. Relates to #4518 Closes #5649 Reported-by:
Michael Filonenko <filonenko.mikhail@gmail.com>
-
Alexander V. Tikhonov authored
Added packaging jobs for Fedora 32. Closes #4966
-
Alexander V. Tikhonov authored
Found that test replication/skip_conflict_row.test.lua fails with output message in results file: [035] @@ -139,7 +139,19 @@ [035] -- applier is not in follow state [035] test_run:wait_upstream(1, {status = 'stopped', message_re = "Duplicate key exists in unique index 'primary' in space 'test'"}) [035] --- [035] -- true [035] +- false [035] +- id: 1 [035] + uuid: f2084d3c-93f2-4267-925f-015df034d0a5 [035] + lsn: 553 [035] + upstream: [035] + status: follow [035] + idle: 0.0024020448327065 [035] + peer: unix/:/builds/4BUsapPU/0/tarantool/tarantool/test/var/035_replication/master.socket-iproto [035] + lag: 0.0046234130859375 [035] + downstream: [035] + status: follow [035] + idle: 0.086121961474419 [035] + vclock: {2: 3, 1: 553} [035] ... [035] -- [035] -- gh-3977: check that NOP is written instead of conflicting row. Test could not be restarted with checksum because of changing values like UUID on each fail. It happend because test-run uses internal chain of functions wait_upstream() -> gen_box_info_replication_cond() which returns instance information on its fails. To avoid of it this output was redirected to log file instead of results file.
-
Alexander V. Tikhonov authored
Due to current testing schema uses separate pipelines per each testing job then workflow names should be the same as jobs to make it more visible on github actions results page [1]. [1] - https://github.com/tarantool/tarantool/actions
-
- Dec 22, 2020
-
-
mechanik20051988 authored
There was an option 'force_recovery' that makes tarantool to ignore some problems during xlog recovery. This patch change this option behavior and makes tarantool to ignore some errors during snapshot recovery just like during xlog recovery. Error types which can be ignored: - snapshot is someway truncated, but after necessary system spaces - snapshot has some garbage after it declared length - single tuple within snapshot has broken checksum and may be skipped without consequences (in this case we ignore all row with this tuple) @TarantoolBot document Title: Change 'force_recovery' option behavior Change 'force_recovery' option behavior to allow tarantool loading from broken snapshot Closes #5422
-
Alexander V. Tikhonov authored
Found that jobs on push and pull_request filters run duplicating each other [1][2]. To avoid of it found additional module [3]. Used entire jobs skip on duplicated jobs either previously run jobs in queue that were already updated [4]. [1] - https://github.community/t/duplicate-checks-on-push-and-pull-request-simultaneous-event/18012 [2] - https://github.community/t/how-to-trigger-an-action-on-push-or-pull-request-but-not-both/16662 [3] - https://github.com/fkirc/skip-duplicate-actions#concurrent_skipping [4] - https://github.com/fkirc/skip-duplicate-actions#option-1-skip-entire-jobs
-
Alexander V. Tikhonov authored
Added standalone job with coverity check as described at [1]. This job uploads results to coverity.com host to 'tarantool' project when COVERITY_TOKEN environment is enabled. Main coverity functionality added at .travis.mk make file as standalone targets: 'test_coverity_debian_no_deps' - used in github-ci actions 'coverity_debian' - additional target with needed tools check This job configured by cron scheduler on each Saturday 04:00 am. Closes #5600 [1] - https://scan.coverity.com/download?tab=cxx
-
Alexander V. Tikhonov authored
Moved coverage saving to coveralls.io repository from travis-ci to github-ci. Completely removed travis-ci from commit criteria. Part of #5294
-
Alexander V. Tikhonov authored
Implemented github-ci action workflow OSX jobs on commits: - OSX 10.15 - OSX 11.0 Part of #5294
-
Alexander V. Tikhonov authored
Implemented github-ci action workflow on commits. Added group of CI jobs: 1) on Debian 9 ("Stretch"): - luacheck - release - debug_coverage - release_clang - release_lto 2) on Debian 10 ("Buster") - release_lto_clang11 - release_asan_clang11 Part of #5294
-
Alexander V. Tikhonov authored
Due to all the activities moving from Gitlab-CI to Github-CI Actions, then docker images creation routine updated with the new images naming and containers registry: GITLAB_REGISTRY?=registry.gitlab.com changed to DOCKER_REGISTRY?=docker.io Part of #5294
-
Alexander V. Tikhonov authored
Added test-run filter on box.snapshot error message: 'Invalid VYLOG file: Slice [0-9]+ deleted but not registered' to avoid of printing changing data in results file to be able to use its checksums in fragile list of test-run to rerun it as flaky issue. Found issues: 1) vinyl/deferred_delete.test.lua https://gitlab.com/tarantool/tarantool/-/jobs/913623306#L4552 [036] 2020-12-15 19:10:01.996 [16602] coio vy_log.c:2202 E> failed to process vylog record: delete_slice{slice_id=744, } [036] 2020-12-15 19:10:01.996 [16602] main/103/vinyl vy_log.c:2068 E> ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 744 deleted but not registered 2) vinyl/gh-4864-stmt-alloc-fail-compact.test.lua https://gitlab.com/tarantool/tarantool/-/jobs/913810422#L4835 [052] @@ -56,9 +56,11 @@ [052] -- [052] dump(true) [052] | --- [052] - | ... [052] -dump() [052] - | --- [052] + | - error: 'Invalid VYLOG file: Slice 253 deleted but not registered' [052] + | ... 3) vinyl/misc.test.lua https://gitlab.com/tarantool/tarantool/-/jobs/913727925#L5284 [014] @@ -62,14 +62,14 @@ [014] ... [014] box.snapshot() [014] --- [014] -- ok [014] +- error: 'Invalid VYLOG file: Slice 1141 deleted but not registered' [014] ... 4) vinyl/quota.test.lua https://gitlab.com/tarantool/tarantool/-/jobs/914016074#L4595 [025] 2020-12-15 22:56:50.192 [25576] coio vy_log.c:2202 E> failed to process vylog record: delete_slice{slice_id=522, } [025] 2020-12-15 22:56:50.193 [25576] main/103/vinyl vy_log.c:2068 E> ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 522 deleted but not registered 5) vinyl/update_optimize.test.lua https://gitlab.com/tarantool/tarantool/-/jobs/913728098#L2512 [051] 2020-12-15 20:18:43.365 [17147] coio vy_log.c:2202 E> failed to process vylog record: delete_slice{slice_id=350, } [051] 2020-12-15 20:18:43.365 [17147] main/103/vinyl vy_log.c:2068 E> ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 350 deleted but not registered 6) vinyl/upsert.test.lua https://gitlab.com/tarantool/tarantool/-/jobs/913623510#L6132 [008] @@ -441,7 +441,7 @@ [008] -- Mem has DELETE [008] box.snapshot() [008] --- [008] -- ok [008] +- error: 'Invalid VYLOG file: Slice 1411 deleted but not registered' [008] ... 7) vinyl/replica_quota.test.lua https://gitlab.com/tarantool/tarantool/-/jobs/914272656#L5739 [023] @@ -41,7 +41,7 @@ [023] ... [023] box.snapshot() [023] --- [023] -- ok [023] +- error: 'Invalid VYLOG file: Slice 232 deleted but not registered' [023] ... 8) vinyl/ddl.test.lua https://gitlab.com/tarantool/tarantool/-/jobs/914309343#L4538 [039] @@ -81,7 +81,7 @@ [039] ... [039] box.snapshot() [039] --- [039] -- ok [039] +- error: 'Invalid VYLOG file: Slice 206 deleted but not registered' [039] ... 9) vinyl/write_iterator.test.lua https://gitlab.com/tarantool/tarantool/-/jobs/920646297#L4694 [059] @@ -80,7 +80,7 @@ [059] ... [059] box.snapshot() [059] --- [059] -- ok [059] +- error: 'Invalid VYLOG file: Slice 351 deleted but not registered' [059] ... [059] -- [059] -- Create a couple of tiny runs on disk, to increate the "number of runs" 10) vinyl/gc.test.lua https://gitlab.com/tarantool/tarantool/-/jobs/920441445#L4691 [050] @@ -59,6 +59,7 @@ [050] ... [050] gc() [050] --- [050] +- error: 'Invalid VYLOG file: Run 1176 deleted but not registered' [050] ... [050] files = ls_data() [050] --- 11) vinyl/gh-3395-read-prepared-uncommitted.test.lua https://gitlab.com/tarantool/tarantool/-/jobs/921944705#L4258 [019] @@ -38,7 +38,7 @@ [019] | ... [019] box.snapshot() [019] | --- [019] - | - ok [019] + | - error: 'Invalid VYLOG file: Slice 634 deleted but not registered' [019] | ... [019] [019] c = fiber.channel(1)
-
Alexander V. Tikhonov authored
Found that running vinyl test suite in parallel using test-run vardir on real hard drive may cause a lot of tests to fail. It happens because of bottleneck with hard drive usage up to 100% which can be seen by any of the tools like atop during vinyl tests run in parallel. To avoid of it all heavy loaded testing processes should use tmpfs for vardir path. Found that out-of-source build had to be updated to use tmpfs for it. This patch mounts additional tmpfs mount point in OOS build docker run process for test-run vardir. This mount point set using '--tmpfs' flag because '--mount' does not support 'exec' option which is needed to be able to execute commands in it [2][3]. Issues met on OOS before the patch, like described in #5504 and [1]: Test hung! Result content mismatch: --- vinyl/write_iterator.result Fri Nov 20 14:48:24 2020 +++ /rw_bins/test/var/081_vinyl/write_iterator.result Fri Nov 20 15:01:54 2020 @@ -200,831 +200,3 @@ --- ... for i = 1, 100 do space:insert{i, ''..i} if i % 2 == 0 then box.snapshot() end end ---- -... -space:delete{1} ---- -... Closes #5622 Part of #5504 [1] - https://gitlab.com/tarantool/tarantool/-/jobs/863266476#L5009 [2] - https://stackoverflow.com/questions/54729130/how-to-mount-docker-tmpfs-with-exec-rw-flags [3] - https://github.com/moby/moby/issues/35890
-
Sergey Kaplun authored
Part of #5187
-
- Dec 21, 2020
-
-
Vladislav Shpilevoy authored
If death timeout was decreased during waiting for leader death or discovery to a new value making the current death waiting end immediately, it could crash in libev. Because it would mean the remaining time until leader death became negative. The negative timeout was passed to libev without any checks, and there is an assertion, that a timeout should always be >= 0. This commit makes raft code covered almost on 100%, not counting one 'unreachable()' place. Closes #5303
-
Vladislav Shpilevoy authored
If election timeout was decreased during election to a new value making the current election expired immediately, it could crash in libev. Because it would mean the remaining time until election end became negative. The negative timeout was passed to libev without any checks, and there is an assertion, that a timeout should always be >= 0. Part of #5303
-