- Dec 27, 2020
-
-
Alexander Turenko authored
We want to increase testing timeouts for GitLab CI, where we use our own runners and observe stalls and high disk pressure when several vinyl tests are run in parallel. The idea is to set variables in GitLab CI web interface and read them from test-run (see [1]). First, we need to pass the variables into inner environments. GitLab CI jobs run the testing using packpack, Docker or VirtualBox. Packpack already preserves environment variables that are listed in the PRESERVE_ENVVARS variable (see [2]). This commit passes the variables that are listed in the PRESERVE_ENVVARS variable into Docker and VirtualBox environment. So, all jobs will have given variables in the enviroment. (Also dropped unused EXTRA_ENV variable.) The next commit will update the test-run submodule with support of setting timeouts using environment variables. [1]: https://github.com/tarantool/test-run/issues/258 [2]: https://github.com/packpack/packpack/pull/135
-
- Dec 26, 2020
-
-
Alexander V. Tikhonov authored
Removed obvious part in rpm spec for Travis-CI, due to it is no longer in use. ---- Comments from @Totktonada ---- This change is a kind of revertion of the commit d48406d5 ('test: add more tests to packaging testing'), which did close #4599. Here I described the story, why the change was made and why it is reverted now. We run testing during an RPM package build: it may catch some distribution specific problem. We had reduced quantity of tests and single thread tests execution to keep the testing stable and don't break packages build and deployment due to known fragile tests. Our CI had to use Travis CI, but we were in transition to GitLab CI to use our own machines and don't reach Travis CI limit with five jobs running in parallel. We moved package builds to GitLab CI, but kept build+deploy jobs on Travis CI for a while: GitLab CI was the new for us and we wanted to do this transition smoothly for users of our APT / YUM repositories. After enabling packages building on GitLab CI, we wanted to enable more tests (to catch more problems) and wanted to enable parallel execution of tests to speed up testing (and reduce amount of time a developer wait for results). We observed that if we'll enable more tests and parallel execution on Travis CI, the testing results will become much less stable and so we'll often have holes in deployed packages and red CI. So, we decided to keep the old way testing on Travis CI and perform all changes (more tests, more parallelism) only for GitLab CI. We had a guess that we have enough machine resources and will able to do some load balancing to overcome flaky fails on our own machines, but in fact we picked up another approach later (see below). That's all story behind #4599. What changes from those days? We moved deployment jobs to GitLab CI[^1] and now we completely disabled Travis CI (see #4410 and #4894). All jobs were moved either to GitLab CI or right to GitHub Actions[^2]. We revisited our approach to improve stability of testing. Attemps to do some load balancing together with attempts to keep not-so-large execution time were failed. We should increase parallelism for speed, but decrease it for stability at the same time. There is no optimal balance. So we decided to track flaky fails in the issue tracker and restart a test after a known fail (see details in [1]). This way we don't need to exclude tests and disable parallelism in order to get the stable and fast testing[^3]. At least in theory. We're on the way to verify this guess, but hopefully we'll stick with some adequate defaults that will work everywhere[^4]. To sum up, there are several reasons to remove the old workaround, which was implemented in the scope of #4599: no Travis CI, no foreseeable reasons to exclude tests and reduce parallelism depending on a CI provider. Footnotes: [^1]: This is simplification. Travis CI deployment jobs were not moved as is. GitLab CI jobs push packages to the new repositories backend (#3380). Travis CI jobs were disabled later (as part of #4947), after proofs that the new infrastructure works fine. However this is the another story. [^2]: Now we're going to use GitHub Actions for all jobs, mainly because GitLab CI is poorly integrated with GitHub pull requests (when source branch is in a forked repository). [^3]: Some work toward this direction still to be done: First, 'replication' test suite still excluded from the testing under RPM package build. It seems, we should just enable it back, it is tracked by #4798. Second, there is the issue [2] to get rid of ancient traces of the old attempts to keep the testing stable (from test-run side). It'll give us more parallelism in testing. [^4]: Of course, we perform investigations of flaky fails and fix code and testing problems it feeds to us. However it appears to be the long activity. References: [1]: https://github.com/tarantool/test-run/pull/217 [2]: https://github.com/tarantool/test-run/issues/251
-
- Dec 25, 2020
-
-
Sergey Bronnikov authored
To run Tarantool fuzzers on OSS Fuzz infrastructure it is needed to pass library $LIB_FUZZING_ENGINE to linker and use external CFLAGS and CXXFLAGS. Full description how to integrate with OSS Fuzz is in [1] and [2]. Patch to OSS Fuzz repository [2] is ready to merge. We need to pass options with "-fsanitize=fuzzer" two times (in cmake/profile.cmake and test/fuzz/CMakeLists.txt) because: - cmake/profile.cmake is for project source files, -fsanitize=fuzzer-no-link option allows to instrument project source files for fuzzing, but LibFuzzer will not replace main() in these files. - test/fuzz/CMakeLists.txt uses -fsanitize=fuzzer and not -fsanitize=fuzzer-no-link because we want to add automatically generated main() for each fuzzer. 1. https://google.github.io/oss-fuzz/getting-started/new-project-guide/ 2. https://google.github.io/oss-fuzz/advanced-topics/ideal-integration/ 3. https://github.com/google/oss-fuzz/pull/4723 Closes #1809
-
Sergey Bronnikov authored
OSS Fuzz has a limited number of runs per day and now it is a 4 runs. Option ENABLE_FUZZERS is enabled to make sure that building of fuzzers is not broken. Part of #1809
-
Sergey Bronnikov authored
Fuzzing tools uses evolutionary algorithms. Supplying seed corpus consisting of good sample inputs is one of the best ways to improve fuzz target’s coverage. Patch adds a corpuses that can be used with existed fuzzers. The name of each file in the corpus is the sha1 checksum of its contents. Corpus with http headers was added from [1] and [2]. 1. https://google.github.io/oss-fuzz/getting-started/new-project-guide/ 2. https://en.wikipedia.org/wiki/List_of_HTTP_header_fields 3. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers The libFuzzer allow to minimize corpus with help of `-merge` flag: when 1 is passed, any corpus inputs from the 2nd, 3rd etc. corpus directories that trigger new code coverage will be merged into the first corpus directory, when 0 is passed an existed corpus will be minimized. All provided corpuses in a patch were minimized. Part of #1809
-
Sergey Bronnikov authored
There is a number of bugs related to parsing and encoding/decoding data. Examples: - csv: #2692, #4497, #2692 - uri: #585 One of the effective method to find such issues is a fuzzing testing. Patch introduces a CMake flag to enable building fuzzers (ENABLE_FUZZER) and add fuzzers based on LibFuzzer [1] to csv, http_parser and uri modules. Note that fuzzers must return 0 exit code only, other exit codes are not supported [2]. NOTE: LibFuzzer requires Clang compiler. 1. https://llvm.org/docs/LibFuzzer.html 2. http://llvm.org/docs/LibFuzzer.html#id22 How-To Use: $ mkdir build && cd build $ cmake -DENABLE_FUZZER=ON \ -DENABLE_ASAN=ON \ -DCMAKE_BUILD_TYPE=Debug \ -DCMAKE_C_COMPILER="/usr/bin/clang" \ -DCMAKE_CXX_COMPILER="/usr/bin/clang++" .. $ make -j $ ./test/fuzz/csv_fuzzer -workers=4 ../test/static/corpus/csv Part of #1809
-
Sergey Bronnikov authored
serpent module has been dropped in commit b53cb2ae "console: drop unused serpent module", but comment that belong to module was left in luacheck config.
-
Sergey Bronnikov authored
Closes #5454 Reviewed-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Reviewed-by:
Igor Munkin <imun@tarantool.org> Co-authored-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Co-authored-by:
Igor Munkin <imun@tarantool.org>
-
Sergey Bronnikov authored
Closes #5453 Reviewed-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Reviewed-by:
Igor Munkin <imun@tarantool.org> Co-authored-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Co-authored-by:
Igor Munkin <imun@tarantool.org>
-
Serge Petrenko authored
Follow-up #5435
-
Serge Petrenko authored
We designed limbo so that it errors on receiving a CONFIRM or ROLLBACK for other instance's data. Actually, this error is pointless, and even harmful. Here's why: Imagine you have 3 instances, 1, 2 and 3. First 1 writes some synchronous transactions, but dies before writing CONFIRM. Now 2 has to write CONFIRM instead of 1 to take limbo ownership. From now on 2 is the limbo owner and in case of high enough load it constantly has some data in the limbo. Once 1 restarts, it first recovers its xlogs, and fills its limbo with its own unconfirmed transactions from the previous run. Now replication between 1, 2 and 3 is started and the first thing 1 sees is that 2 and 3 ack its old transactions. So 1 writes CONFIRM for its own transactions even before the same CONFIRM written by 2 reaches it. Once the CONFIRM written by 1 is replicated to 2 and 3 they error and stop replication, since their limbo contains entries from 2, not from 1. Actually, there's no need to error, since it's just a really old CONFIRM which's already processed by both 2 and 3. So, ignore CONFIRM/ROLLBACK when it references a wrong limbo owner. The issue was discovered with test replication/election_qsync_stress. Follow-up #5435
-
Serge Petrenko authored
The test involves writing synchronous transactions on one node and making other nodes confirm these transactions after its death. In order for the test to work properly we need to make sure the old node replicates all its transactions to peers before killing it. Otherwise once the node is resurrected it'll have newer data, not present on other nodes, which leads to their vclocks being incompatible and noone becoming the new leader and hanging the test. Follow-up #5435
-
Serge Petrenko authored
It is possible that a new leader (elected either via raft or manually or via some user-written election algorithm) loses the data that the old leader has successfully committed and confirmed. Imagine such a situation: there are N nodes in a replicaset, the old leader, denoted A, tries to apply some synchronous transaction. It is written on the leader itself and N/2 other nodes, one of which is B. The transaction has thus gathered quorum, N/2 + 1 acks. Now A writes CONFIRM and commits the transaction, but dies before the confirmation reaches any of its followers. B is elected the new leader and it sees that the last A's transaction is present on N/2 nodes, so it doesn't have a quorum (A was one of the N/2 + 1). Current `clear_synchro_queue()` implementation makes B roll the transaction back, leading to rollback after commit, which is unacceptable. To fix the problem, make `clear_synchro_queue()` wait until all the rows from the previous leader gather `replication_synchro_quorum` acks. In case the quorum wasn't achieved during replication_synchro_timeout, rollback nothing and wait for user's intervention. Closes #5435 Co-developed-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
-
Serge Petrenko authored
It'll be useful for box_clear_synchro_queue rework. Prerequisite #5435
-
Vladislav Shpilevoy authored
The trigger is fired every time any of the relays notifies tx of replica's known vclock change. The trigger will be used to collect synchronous transactions quorum for old leader's transactions. Part of #5435
-
Serge Petrenko authored
Clear_synchro_queue isn't meant to be called multiple times on a single instance. Multiple simultaneous invocations of clear_synhcro_queue() shouldn't hurt now, since clear_synchro_queue simply exits on an empty limbo, but may be harmful in future, when clear_synchro_queue is reworked. Prohibit such misuse by introducing an execution guard and raising an error once duplicate invocation is detected. Prerequisite #5435
-
Sergey Bronnikov authored
Closes #5538
-
Sergey Bronnikov authored
For Python 3, PEP 3106 changed the design of the dict builtin and the mapping API in general to replace the separate list based and iterator based APIs in Python 2 with a merged, memory efficient set and multiset view based API. This new style of dict iteration was also added to the Python 2.7 dict type as a new set of iteration methods. PEP-0469 [1] recommends to replace d.iteritems() to iter(d.items()) to make code compatible with Python 3. 1. https://www.python.org/dev/peps/pep-0469/ Part of #5538
-
Sergey Bronnikov authored
The largest change in Python 3 is the handling of strings. In Python 2, the str type was used for two different kinds of values - text and bytes, whereas in Python 3, these are separate and incompatible types. Patch converts strings to byte strings where it is required to make tests compatible with Python 3. Part of #5538
-
Sergey Bronnikov authored
In Python 2.x calling items() makes a copy of the keys that you can iterate over while modifying the dict. This doesn't work in Python 3.x because items() returns an iterator instead of a list and Python 3 raise an exception "dictionary changed size during iteration". To workaround it one can use list to force a copy of the keys to be made. Part of #5538
-
Sergey Bronnikov authored
- convert print statement to function. In a Python 3 'print' becomes a function, see [1]. Patch makes 'print' in a regression tests compatible with Python 3. - according to PEP8, mixing using double quotes and quotes in a project looks inconsistent. Patch makes using quotes with strings consistent. - use "format()" instead of "%" everywhere 1. https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function Part of #5538
-
Serge Petrenko authored
Report box.stat().*.total, box.stat.net().*.total and box.stat.net().*.current via feedback daemon report. Accompany this data with the time when report was generated so that it's possible to calculate RPS from this data on the feedback server. `box.stat().OP_NAME.total` reside in `feedback.stats.box.OP_NAME.total`, while `box.stat.net().OP_NAME.total` reside in `feedback.stats.net.OP_NAME.total` The time of report generation is located at `feedback.stats.time` Closes #5589
-
- Dec 24, 2020
-
-
Cyrill Gorcunov authored
We have a feedback server which gathers information about a running instance. While general info is enough for now we may loose a precious information about crashes (such as call backtrace which caused the issue, type of build and etc). In the commit we add support of sending this kind of information to the feedback server. Internally we gather the reason of failure, pack it into base64 form and then run another Tarantool instance which sends it out. A typical report might look like | { | "crashdump": { | "version": "1", | "data": { | "uname": { | "sysname": "Linux", | "release": "5.9.14-100.fc32.x86_64", | "version": "#1 SMP Fri Dec 11 14:30:38 UTC 2020", | "machine": "x86_64" | }, | "build": { | "version": "2.7.0-115-g360565efb", | "cmake_type": "Linux-x86_64-Debug" | }, | "signal": { | "signo": 11, | "si_code": 0, | "si_addr": "0x3e800004838", | "backtrace": "#0 0x630724 in crash_collect+bf\n...", | "timestamp": "2020-12-23 14:42:10 MSK" | } | } | } | } There is no simple way to test this so I did it manually: 1) Run instance with box.cfg{log_level = 8, feedback_host="127.0.0.1:1500"} 2) Run listener shell as while true ; do nc -l -p 1500 -c 'echo -e "HTTP/1.1 200 OK\n\n $(date)"'; done 3) Send SIGSEGV kill -11 `pidof tarantool` Once SIGSEGV is delivered the crashinfo data is generated and sent out. For debug purpose this data is also printed to the terminal on debug log level. Closes #5261 Co-developed-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com> @TarantoolBot document Title: Configuration update, allow to disable sending crash information For better analysis of program crashes the information associated with the crash such as - utsname (similar to `uname -a` output except the network name) - build information - reason for a crash - call backtrace is sent to the feedback server. To disable it set `feedback_crashinfo` to `false`.
-
Cyrill Gorcunov authored
When SIGSEGV or SIGFPE reaches the tarantool we try to gather all information related to the crash and print it out to the console (well, stderr actually). Still there is a request to not just show this info locally but send it out to the feedback server. Thus to keep gathering crash related information in one module, we move fatal signal handling into the separate crash.c file. This allows us to collect the data we need in one place and reuse it when we need to send reports to stderr (and to the feedback server, which will be implemented in next patch). Part-of #5261 Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
This will allow to reuse this routine in crash reports. Part-of #5261 Acked-by:
Serge Petrenko <sergepetrenko@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
Very convenient to have this string extension. We will use it in crash handling. Acked-by:
Serge Petrenko <sergepetrenko@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Sergey Nikiforov authored
Added corresponding test Fixes: #5307
-
Alexander V. Tikhonov authored
It was added Fedora 32 gitlab-ci packaging job in commit: 507c47f7a829581cc53ba3c4bd6a5191d088cdf ("gitlab-ci: add packaging for Fedora 32") but also it had to be enabled in update_repo tool to make able to save packages in S3 buckets. Follows up #4966
-
Cyrill Gorcunov authored
Part-of #5446 Co-developed-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
When we fetch replication_synchro_quorum value (either as a plain integer or via formula evaluation) we trim the number down to integer, which silently hides potential overflow errors. For example | box.cfg{replication_synchro_quorum='4294967297'} which is 1 in terms of machine words. Lets use 8 bytes values and trigger an error instead. Part-of #5446 Reported-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
When synchronous replication is used we prefer a user to specify a quorum number, ie the number of replicas where data must be replicated before the master node continue accepting new transactions. This is not very convenient since a user may not know initially how many replicas will be used. Moreover the number of replicas may vary dynamically. For this sake we allow to specify the number of quorum in a symbolic way. For example box.cfg { replication_synchro_quorum = "N/2+1", } where `N` is a number of registered replicas in a cluster. Once new replica attached or old one detached the number is renewed and propagated. Internally on each replica_set_id() and replica_clear_id(), ie at moment when replica get registered or unregistered, we call box_update_replication_synchro_quorum() helper which finds out if evaluation of replication_synchro_quorum is needed and if so we calculate new replication_synchro_quorum value based on number of currently registered replicas. Then we notify dependent systems such as qsync and raft to update their guts. Note: we do *not* change the default settings for this option, it remains 1 by default for now. Change the default option should be done as a separate commit once we make sure that everything is fine. Closes #5446 Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com> @TarantoolBot document Title: Support dynamic evaluation of synchronous replication quorum Setting `replication_synchro_quorum` option to an explicit integer value was introduced rather for simplicity sake mostly. For example if the cluster's size is not a constant value and new replicas are connected in dynamically then an administrator might need to increase the option by hands or by some other external tool. Instead one can use a dynamic evaluation of a quorum value via formal representation using symbol `N` as a current number of registered replicas in a cluster. For example the canonical definition for a quorum (ie majority of members in a set) of `N` replicas is `N/2+1`. For such configuration define ``` box.cfg {replication_synchro_quorum = "N/2+1"} ``` The formal statement allows to provide a flexible configuration but keep in mind that only canonical quorum (and bigger values, say `N` for all replicas) guarantees data reliability and various weird forms such as `N/3+1` while allowed may lead to unexpected results.
-
Cyrill Gorcunov authored
Currently the box_check_replication_synchro_quorum helper test for "replication_synchro_quorum" value being valid and returns the value itself to use later in code. This is fine for regular numbers but since we're gonna support formula evaluation the real value to use will be dynamic and returning a number "to use" won't be convenient. Thus lets change the context: make box_check_replication_synchro_quorum() to return 0|-1 for success|failure and when the real value is needed we will fetch it explicitly via cfg_geti call. To make this more explicit the real update of the appropriate variable is done via box_update_replication_synchro_quorum() helper. Part-of #5446 Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
We will need it to figure out if parameter is a numeric value when doing configuration check. Part-of #5446 Acked-by:
Serge Petrenko <sergepetrenko@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Mergen Imeev authored
Prior to this patch, region on fiber was reset during select(), get(), count(), max(), or min(). This would result in an error if one of these operations was used in a user-defined function in SQL. After this patch, these functions truncate region instead of resetting it. Closes #5427
-
- Dec 23, 2020
-
-
Nikita Pettik authored
Accidentally, in built-in declaration list it was specified that ifnull() can return only integer values, meanwhile it should return SCALAR: ifnull() returns first non-null argument so type of return value depends on type of arguments. Let's fix this and set return type of ifnull() SCALAR.
-
Mergen Imeev authored
After this patch, the persistent functions "box.schema.user.info" and "LUA" will have the same rights as the user who executed them. The problem was that setuid was unnecessarily set. Because of this, these functions had the same rights as the user who created them. However, they must have the same rights as the user who used them. Fixes tarantool/security#1
-
Sergey Kaplun authored
Platform panic occurs when fiber.yield() is used within any active (i.e. being executed now) hook. It is a regression caused by 96dbc49d ('lua: prohibit fiber yield when GC hook is active'). This patch fixes false positive panic in cases when VM is not running a GC hook. Relates to #4518 Closes #5649 Reported-by:
Michael Filonenko <filonenko.mikhail@gmail.com>
-
Alexander V. Tikhonov authored
Added packaging jobs for Fedora 32. Closes #4966
-
Alexander V. Tikhonov authored
Found that test replication/skip_conflict_row.test.lua fails with output message in results file: [035] @@ -139,7 +139,19 @@ [035] -- applier is not in follow state [035] test_run:wait_upstream(1, {status = 'stopped', message_re = "Duplicate key exists in unique index 'primary' in space 'test'"}) [035] --- [035] -- true [035] +- false [035] +- id: 1 [035] + uuid: f2084d3c-93f2-4267-925f-015df034d0a5 [035] + lsn: 553 [035] + upstream: [035] + status: follow [035] + idle: 0.0024020448327065 [035] + peer: unix/:/builds/4BUsapPU/0/tarantool/tarantool/test/var/035_replication/master.socket-iproto [035] + lag: 0.0046234130859375 [035] + downstream: [035] + status: follow [035] + idle: 0.086121961474419 [035] + vclock: {2: 3, 1: 553} [035] ... [035] -- [035] -- gh-3977: check that NOP is written instead of conflicting row. Test could not be restarted with checksum because of changing values like UUID on each fail. It happend because test-run uses internal chain of functions wait_upstream() -> gen_box_info_replication_cond() which returns instance information on its fails. To avoid of it this output was redirected to log file instead of results file.
-
Alexander V. Tikhonov authored
Due to current testing schema uses separate pipelines per each testing job then workflow names should be the same as jobs to make it more visible on github actions results page [1]. [1] - https://github.com/tarantool/tarantool/actions
-