- Jul 01, 2021
-
-
Vladimir Davydov authored
An LSM tree (space index, that is) can be dropped while compaction is in progress for it. In this case compaction will still commit the new run to vylog upon completion. This usually works fine, but not if gc has already purged all the information about the dropped LSM tree from vylog by that time, in which case an attempt to commit the new run will result in permanently broken vylog (because compaction will write vylog records for a non-existing object): ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 13 deleted but not registered To prevent this from happening, let's make compaction silently drop the new run without committing it to vylog if the LSM tree has been dropped. This should work just fine - since the LSM tee isn't used anymore we don't need to have it compacted, neither do we need to delete the run, since gc will eventually clean up all artefacts left from the dropped LSM tree. One thing to be noted is that we also must exclude dropped LSM trees from further compaction - if we don't do that, we might end up picking the dropped LSM tree for compaction over and over again (because it isn't actually compacted). This patch also drops the gh-5141-invalid-vylog-file test, because the latter just ensured that the issue fixed by this patch is there. Closes #5436
-
Egor Elchinov authored
Now idle fibers are present in fiber.info() but without their stacks. Added test ensuring that fiber.info doesn't get cluttered by idle fibers stacks after dispatching multiple requests in short time. Closes #4235
-
Egor Elchinov authored
In some cases it's good to have an opportunity to detect if fiber is idle in a fiber_pool. Now this can be done as fiber->flags & FIBER_IS_IDLE. Needed for: #4235
-
- Jun 24, 2021
-
-
VitaliyaIoffe authored
Make able to save packages in S3 buckets. Closes: #5825
-
VitaliyaIoffe authored
Add ubuntu-hirsute workflow, which runs on push and pull-requests. Fix lintian globbing-patterns-out-of-order warnings. Part of: #5825
-
VitaliyaIoffe authored
Due to a build is going as out-of-source after the patch 781fd38, where was deleted the path of a source dir, macro __FILE__ leads to the compilation fail on ubuntu_21_04. Change __FILE__ to the file path. Needed for: #5825
-
- Jun 23, 2021
-
-
Cyrill Gorcunov authored
We already have `box.replication.upstream.lag` entry for monitoring sake. Same time in synchronous replication timeouts are key properties for quorum gathering procedure. Thus we would like to know how long it took of a transaction to traverse `initiator WAL -> network -> remote applier -> initiator ACK reception` path. Typical output is | tarantool> box.info.replication[2].downstream | --- | - status: follow | idle: 0.61753897101153 | vclock: {1: 147} | lag: 0 | ... | tarantool> box.space.sync:insert{69} | --- | - [69] | ... | | tarantool> box.info.replication[2].downstream | --- | - status: follow | idle: 0.75324084801832 | vclock: {1: 151} | lag: 0.0011014938354492 | ... Closes #5447 Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com> @TarantoolBot document Title: Add `box.info.replication[n].downstream.lag` entry `replication[n].downstream.lag` represents a lag between the main node writes a certain transaction to it's own WAL and a moment it receives an ack for this transaction from a replica.
-
Cyrill Gorcunov authored
Applier fiber sends current vclock of the node to remote relay reader, pointing current state of fetched WAL data so the relay will know which new data should be sent. The packet applier sends carries xrow_header::tm field as a zero but we can reuse it to provide information about first timestamp in a transaction we wrote to our WAL. Since old instances of Tarantool simply ignore this field such extension won't cause any problems. The timestamp will be needed to account lag of downstream replicas suitable for information purpose and cluster health monitoring. We update applier statistics in WAL callbacks but since both apply_synchro_row and apply_plain_tx are used not only in real data application but in final join stage as well (in this stage we're not writing the data yet) the apply_synchro_row is extended with replica_id argument which is non zero when applier is subscribed. The calculation of the downstream lag itself lag will be addressed in next patch because sending the timestamp and its observation are independent actions. Part-of #5447 Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Alexander V. Tikhonov authored
Checked and found that: #4353 -> tarantool/tarantool-qa#13: engine/ddl.test.lua fixed in #6102. #4926, tarantool/tarantool#115: box/alter_limits.test.lua fixed in tarantool/tarantool-qa#126. #5547 -> tarantool/tarantool-qa#50: box/net.box_schema_change_gh-2666.test.lua fixed in tarantool/tarantool-qa#126. #5583 -> tarantool/tarantool-qa#22: box/net.box_methods_gh-3107.test.lua fixed in tarantool/tarantool-qa#126. Closes tarantool/tarantool-qa#13 Closes tarantool/tarantool-qa#115 Closes #4926 Closes tarantool/tarantool-qa#50 Closes tarantool/tarantool-qa#22
-
Alexander V. Tikhonov authored
Found that the root cause of the issues happened with vinyl tests were backside effects of the not correct test 'vinyl/gh.test.lua' which leaved Tarantool worker process in inconsistent state. After it any other next test on the same Tarantool worker process could fail on running testings with snapshots calls, like tarantool/tarantool-qa#126: error: Snapshot is already in progress Either restarting Tarantool worker process could fail on stopping it, like tarantool/test-run#261 and #5141: E> failed to process vylog record: delete_slice{slice_id=115, } E> ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Slice 115 deleted but not registered Decided to remove all vinyl tests from 'fragile' list except test 'gh.test.lua', which should be improved before, to be able to run it with the other tests. And 'gh-5141-invalid-vylog-file.test.lua' test which checks this issue and can be removed after the fix will be done. The following issues were moved to tarantool/tarantool-qa repository: #4346 -> tarantool/tarantool-qa#11 #5408 -> tarantool/tarantool-qa#73 #5584 -> tarantool/tarantool-qa#21 #5586 -> tarantool/tarantool-qa#19 Part of tarantool/tarantool-qa#97 Closes tarantool/tarantool-qa#11 Closes #4572 Closes #4979 Closes #4984 Closes #5336 Closes #5356 Closes #5377 Closes #5378 Closes #5383 Closes tarantool/tarantool-qa#73 Closes tarantool/tarantool-qa#21 Closes tarantool/tarantool-qa#19
-
Kirill Yukhin authored
* Retry on a Lua error in test_run:wait_cond() * Re-raise an error from a remote server in :eval()
-
- Jun 22, 2021
-
-
Alexander V. Tikhonov authored
Found that after change: 474eda49 ('github-ci: use vardir option in tests runs') where 'vardir' was changed. It was forgot to update the Github Actions workflows with the same change to be able to collect artifacts. This patch fixes it. Closes tarantool/tarantool-qa#125
-
Oleg Babin authored
Seems CMakeLists.txt contains the same line as inside BuildZSTD.cmake. Seems we don't need to duplicate this logic twice let's remove it.
-
Alexander V. Tikhonov authored
Found that in one previous commit 42c64d06 ('test: fix hanging of vinyl/gh.test.lua') Was mistakenly made such change: -while finished ~= 2 do fiber.sleep(0.01) end +test_run:wait_cond(function() return finished ~= 2 end) And the logic of the check was broken. This patch fixes it. Part of #5141 Fixes tarantool/test-run#261 Part of tarantool/tarantool-qa#106
-
- Jun 21, 2021
-
-
Cyrill Gorcunov authored
Currently we use synchro packets filtration based on their contents, in particular by their xrow->replica_id value. Still there was a question if we can optimize this moment and rather filter out all packets coming from non-leader replica. Raft specification requires that only data from a current leader should be applied to local WAL but doesn't put a concrete claim on the data transport, ie how exactly rows are reaching replicas. This implies that data propagation may reach replicas indirectly via transit hops. Thus we drop applier->instance_id filtering and rely on xrow->replica_id matching instead. In the test (inspired by Serge Petrenko's test) we recreate the situation where replica3 obtains master's node data (which is a raft leader) indirectly via replica2 node. Closes #6035 Co-developed-by:
Serge Petrenko <sergepetrenko@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Alexander Turenko authored
This update offers two changes: * More intelligent error reporting for non UTF-8 TAP13 output ([1], [2]). * Restart tarantool server before each test (to avoid unexpected and non-obvious dependencies between tests, see [3]). Removed the pretest_clean suite.ini option: test-run does not read it anymore. [1]: https://github.com/tarantool/test-run/issues/293 [2]: https://github.com/tarantool/test-run/pull/297 [3]: https://github.com/tarantool/test-run/pull/309
-
- Jun 18, 2021
-
-
Oleg Babin authored
After this patch digest module will be use bundled xxhash. It fixes build problem if user doesn't use bundled zstd. Closes #6135 Follow-up #2003
-
Oleg Babin authored
This patch is the first step for fixing regression introduced in f998ea39 (digest: introduce FFI bindings for xxHash32/64). We used xxhash library that is shipped with zstd. However it's possible that user doesn't use bundled zstd. In such cases we couldn't export xxhash symbols and build failed with following error: ``` [ 59%] Linking CXX executable tarantool /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: CMakeFiles/tarantool.dir/exports.c.o:(.data.rel+0xd80): undefined reference to `XXH32' /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: CMakeFiles/tarantool.dir/exports.c.o:(.data.rel+0xd88): undefined reference to `XXH32_copyState' /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: CMakeFiles/tarantool.dir/exports.c.o:(.data.rel+0xd90): undefined reference to `XXH32_digest' /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: CMakeFiles/tarantool.dir/exports.c.o:(.data.rel+0xd98): undefined reference to `XXH32_reset' /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: CMakeFiles/tarantool.dir/exports.c.o:(.data.rel+0xda0): undefined reference to `XXH32_update' /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: CMakeFiles/tarantool.dir/exports.c.o:(.data.rel+0xda8): undefined reference to `XXH64' /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: CMakeFiles/tarantool.dir/exports.c.o:(.data.rel+0xdb0): undefined reference to `XXH64_copyState' /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: CMakeFiles/tarantool.dir/exports.c.o:(.data.rel+0xdb8): undefined reference to `XXH64_digest' /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: CMakeFiles/tarantool.dir/exports.c.o:(.data.rel+0xdc0): undefined reference to `XXH64_reset' /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: CMakeFiles/tarantool.dir/exports.c.o:(.data.rel+0xdc8): undefined reference to `XXH64_update' collect2: error: ld returned 1 exit status ``` To avoid a problem this patch introduces standalone xxhash library that will be bundled anyway. It's worth to mention that our approach is still related to zstd. We use Cyan4973/xxHash that is used in zstd and passes the same compile flags to it. Single difference is usage of XXH_NAMESPACE to avoid symbols clashing with zstd. Need for #6135
-
VitaliyaIoffe authored
Make able to save packages in S3 buckets. Closes: #5824
-
VitaliyaIoffe authored
Add ubuntu-groovy workflow, which runs on push and pull-requests. Fix lintian globbing-patterns-out-of-order warnings. Part of: #5824
-
Alxander V. Tikhonov authored
Found that on slow test hosts like FreeBSD VMware test flaky failed, like: [001] @@ -153,7 +153,7 @@ [001] | ... [001] assert(leader_count == 1) [001] | --- [001] - | - true [001] + | - error: assertion failed! [001] | ... [001] -- All nodes have the same leader. [001] r1_leader = test_run:eval('election_replica1', leader_id_cmd)[1] It happened because there was not enough time right after wait_fullmesh() call and before check of the current leader to make any of the replicas leader. Later in the code was not correct logic when checked who is the leader and choosed 'election_replica3' if not the others, it caused the later fails in the test. Decided to wait after wait_fullmesh() when any of the replicas became leader using test_cond() routine. Closes #5368
-
- Jun 17, 2021
-
-
Alexander V. Tikhonov authored
On heavy loaded hosts found the following issue: [046] --- replication/transaction.result Wed Sep 30 17:19:32 2020 [046] +++ /tmp/tnt/rejects/replication/transaction.reject Tue Nov 24 04:39:29 2020 [046] @@ -234,7 +234,7 @@ [046] ... [046] box.info.replication[1].upstream.status [046] --- [046] -- follow [046] +- disconnected [046] ... [046] test_run:cmd("switch default") [046] --- [046] It happened because box.cfg was not ready to provide information. In real there is no need to use local check for replication information parts availablity, due to wait_upstream() function used below, do it itself. Closes #5563 Closes tarantool/tarantool-qa#35
-
- Jun 16, 2021
-
-
Vladislav Shpilevoy authored
When txn_commit/try_async() failed before going to WAL thread, they installed TXN_SIGNATURE_ABORT signature meaning that the caller and the rollback triggers must look at the global diag. But they called txn_rollback() before doing return and calling the triggers, which overrode the signature with TXN_SIGNATURE_ROLLBACK leading to the original error loss. The patch makes TXN_SIGNATURE_ROLLBACK installed only when a real rollback happens (via box_txn_rollback()). This makes the original commit errors like a conflict in the transaction manager and OOM not lost. Besides, ERRINJ_TXN_COMMIT_ASYNC does not need its own diag_log() anymore. Because since this commit the applier logs the correct error instead of ER_WAL_IO/ER_TXN_ROLLBACK. Closes #6027
-
Vladislav Shpilevoy authored
Sometimes a transaction can fail before it goes to WAL. Then the signature didn't have any sign of it, as well as the journal_entry result (which might be not even created yet). Still if txn_commit/try_async() are called, they invoke on_rollback triggers. The triggers only can see TXN_SIGNATURE_ROLLBACK and can't distinguish it from a real rollback like box.rollback(). Due to that some important errors like a transaction manager conflict or OOM are lost. The patch introduces a new error signature TXN_SIGNATURE_ABORT which says the transaction didn't manage to try going to WAL and for an error need to look at the global diag. The next patch is going to stop overriding it with TXN_SIGNATURE_ROLLBACK. Part of #6027
-
Vladislav Shpilevoy authored
A transaction in WAL thread could be rolled back not only due to an IO error. But also if there was a cascading rollback in progress. The patch makes such case use a special error code turned into its own diag when it reaches the TX thread. Usage of ER_WAL_IO wasn't correct here. Part of #6027
-
Vladislav Shpilevoy authored
Previously all journal and txn errors were turned into ER_WAL_IO error code. It led to loss of the real error, which sometimes was absolutely not related to IO. For example, a timeout in the limbo for a synchronous transaction. The patch makes journal/txn errors turn into proper diags. Part of #6027
-
Vladislav Shpilevoy authored
In the journal write trigger the transaction assumed it might be already rolled back and completed, hence does not need to do anything except free itself. But it can't happen. The only imaginable reason why a transaction might be rolled back before it completed its WAL write is a ROLLBACK entry issued after the transaction. But ROLLBACK applies its effects only after it is written. Hence only after all the other pending txns are written too. Therefore it is not possible for a transaction to get ROLLBACK before it finishes its own WAL write. Probably it was possible in the time when applier used to execute ROLLBACK before writing it to WAL. But that was fixed in b259e930 ("applier: process synchro rows after WAL write"). Can't happen now. This became easier to realize when not finished transaction signature got its own value TXN_SIGNATURE_UNKNOWN.
-
Vladislav Shpilevoy authored
Journal used to have only one error code in journal_entry.res: -1. It had at least 2 problems: - There was an assumption that TXN_SIGNATURE_ROLLBACK is the same as journal_entry error = -1; - It wasn't possible to tell if the entry tried to be written and failed, or it didn't try yet. Both looked as -1. The patch introduces a new error code JOURNAL_ENTRY_ERR_UNKNOWN. The IO error now has its own value: JOURNAL_ENTRY_ERR_IO. This helps to ensure that a not finished journal entry or a transaction won't try to obtain a diag error for their result. Part of #6027
-
Vladislav Shpilevoy authored
A transaction on rollback used to check if it was already rolled back inside of the limbo by looking at its signature as signature != TXN_SIGNATURE_ROLLBACK It meant the transaction is already completed. TXN_SIGNATURE_ROLLBACK was used as a default value of the signature. Therefore if it is not default, it is completed. This is going to break if normal (not synchronous) transactions would have more rollback codes except just TXN_SIGNATURE_ROLLBACK. Also treatment of TXN_SIGNATURE_ROLLBACK as a default value looks confusing. Next patches are going to rework the codes and render the assumptions above incorrect. This patch makes the transaction use a correct way to check whether it is in the limbo still - look at TXN_WAIT_SYNC flag. It is set for all txns in the limbo and is not set for all the others. Part of #6027
-
Vladislav Shpilevoy authored
ER_WAL_IO is set on any WAL error if it was after journal_write() success. It is not correct, because there can be plenty of reasons. In WAL it could be an actual IO error or a cascading rollback in progress. When used for transactions, it could be an error related to synchronous transactions like a timeout, or a persistent ROLLBACK. These errors are overridden by ER_WAL_IO. The patch encapsulates the diag installation for bad journal write and for transaction rollback. The next patches are going to introduce more error codes and use proper ones to install a diag. Part of #6027
-
Vladislav Shpilevoy authored
diag_set() uses the current file and line as built-in macros. In the future patches there are going to appear a couple of new diag_set-like helpers which also would want to preserve the original file and line. For that they must be macros at least partially, like diag_set(), and pass their own file and line. Because they are going not to be very trivial and won't be implemented in the header. The patch introduces diag_set_detailed() which allows to pass custom file and line. Needed for #6027
-
Vladislav Shpilevoy authored
It didn't have a single fail path. That led to some amount of code duplication, and it complicated future patches where the journal entries are going to get a proper error reason instead of default -1 without any details. The patch is a preparation for #6027 where it is wanted to have more detailed errors on journal entry/transaction fail instead of ER_WAL_IO for everything. Sometimes it can override a real error like a cascade rollback, or a transaction conflict. Part of #6027
-
Vladislav Shpilevoy authored
It used to simply return -1 and set a diag only when OOM happened inside. The caller was forced either to ignore the result or set its own diag regardless of what really happened. The patch makes journal_write() set a correct diag error when it returns -1. The only implementation to change was wal_write_async(). The other implementations always return 0. Part of #6027
-
Vladislav Shpilevoy authored
The script name was too long. It was also used as a name for the unix socket file on which the replica listens. As a result, the test couldn't start, at least on my machine. Besides, the script was not any different from the existing replica.lua, except a couple of not important settings. The patch drops it and makes gh-4730-applier-rollback.test.lua use replica.lua. Now it can run on my machine. Done as a preparation for #6027, which is slightly related to the test - it is also about errors in applier and their display.
-
Vladislav Shpilevoy authored
It was called ER_CHECKPOINT_ROLLBACK but was set only when there is a cascade rollback in WAL. The new error name is going to be used in the next patches, where not only checkpoint can fail due to a cascade rollback. Part of #6027
-
mechanik20051988 authored
`VERSION` files in small subproject and in tarantool are treated as C++ standard library on a filesystem with case-insensitive names. So we have to delete the root of tarantool project from `include_directories` in tarantool CMake. Also we have to change `include_directories` in tarantool CMake from the root of `small` project to `include` subfolder in `small` project. Closes #6076
-
Kirill Yukhin authored
* build: fix tarantool build failure on xcode 12.5
-
- Jun 15, 2021
-
-
Alexander V. Tikhonov authored
In scope of this commit new GitHub Actions workflows for testing Tarantool on M1 hosts are added: Release: .github/workflows/osx_arm64_11_2.yml Debug: .github/workflows/osx_debug_arm64_11_2.yml Since GitHub Actions uses x86_64 environment by default on M1 targets, 'arch -arm64' prefix is specified in GitHub Actions workflow to make all commands in .travis.mk run in ARM64 environment. Introduced a new temporary target in .travis.mk Makefile to run only specific LuaJIT test suites on M1. Now it runs only the following LuaJIT test targets: * PUC-Rio-Lua-5.1-tests * lua-Harness-tests * tarantool-tests Python 3.9 is installed by default on M1 hosts, but gevent is required for Tarantool tests, which installation fails with the following error: Using cached gevent-21.1.2.tar.gz (5.9 MB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing wheel metadata ... error ERROR: Command errored out with exit status 1: command: /opt/homebrew/opt/python@3.9/bin/python3.9 /opt/homebrew/lib/python3.9/site-packages/pip/_vendor/pep517/in_process/_in_process.py prepare_metadata_for_build_wheel /var/folders/b0/1vlv5rvn77x2rn6zbl2p4tqr0000gp/T/tmpyy59ae2p cwd: /private/var/folders/b0/1vlv5rvn77x2rn6zbl2p4tqr0000gp/T/pip-install-msbf7_vz/gevent_c2956687bb0d4de9bfb5f0660da759ee Complete output (42 lines): ... File "/private/var/folders/b0/1vlv5rvn77x2rn6zbl2p4tqr0000gp/T/pip-build-env-1lesbbxi/overlay/lib/python3.9/site-packages/cffi/api.py", line 48, in __init__ import _cffi_backend as backend ImportError: dlopen(/private/var/folders/b0/1vlv5rvn77x2rn6zbl2p4tqr0000gp/T/pip-build-env-1lesbbxi/overlay/lib/python3.9/site-packages/_cffi_backend.cpython-39-darwin.so, 2): no suitable image found. Did find: /private/var/folders/b0/1vlv5rvn77x2rn6zbl2p4tqr0000gp/T/pip-build-env-1lesbbxi/overlay/lib/python3.9/site-packages/_cffi_backend.cpython-39-darwin.so: mach-o, but wrong architecture /private/var/folders/b0/1vlv5rvn77x2rn6zbl2p4tqr0000gp/T/pip-build-env-1lesbbxi/overlay/lib/python3.9/site-packages/_cffi_backend.cpython-39-darwin.so: mach-o, but wrong architecture This issue is described in gevent/gevent#1721. Fortunately, gevent can be successfully installed via Python 3.8, hence to avoid this failure, python3 is pinned to the specific version (i.e. python@3.8) until the mentioned issue is resolved. Closes tarantool/tarantool-qa#120 Relates to #6068
-
Alexander V. Tikhonov authored
To avoid targets duplication in the later use of .travis.mk file, it was decided to parameterize OSX jobs and move all fine tuning manipulations related to these pipelines from .travis.mk to the corresponding GitHub Actions workflows.
-
- Jun 12, 2021
-
-
Igor Munkin authored
* ARM64: Fix xpcall() error case (really). * ARM64: Fix xpcall() error case. * test: add arch-specific skipcond for memprof * ARM, ARM64, PPC: Fix TSETR fallback. Closes #6084 Closes #6093 Part of #5629
-