- Jun 20, 2022
-
-
Sergey Bronnikov authored
By default CMake generates Makefiles for building a project. However, it allows to generate Ninja files. Ninja [1] may build project a bit faster than Make, see [2]. Patch adds fixes for CMake files allowing to use Ninja for building Tarantool: 1. Fixed dependencies in ExternalProject_Add(), see explanation in [3] 2. Fixed ninja error due to presence of symbol '$' in cmake/rpm.cmake 3. Added propagation of CMAKE_GENERATOR in dependencies that uses CMake for building, see [4] How-to build wit Ninja: $ cmake -G Ninja -B build -S . $ ninja -C build/ 1. https://ninja-build.org/ 2. https://mesonbuild.com/Simple-comparison.html 3. https://stackoverflow.com/a/65803911/3665613 4. https://cmake.org/cmake/help/latest/module/ExternalProject.html NO_DOC=internal NO_CHANGELOG=internal NO_TEST=internal
-
- Jun 18, 2022
-
-
Igor Munkin authored
* ci: add job for build using Ninja on Linux/x86_64 * build: create file lists outside of CMake commands * build: use unique names for CMake targets * Revert "test: disable PUC-Rio tests for several -l options" * ci: make GitHub workflows more CMake-ish * test: adapt PUC-Rio tests for debug line hook * test: adapt PUC-Rio test for tail calls debug info * test: adapt PUC-Rio test with reversed function Closes #5693 Closes #5702 Closes #5782 Follows up #5747 NO_DOC=LuaJIT submodule bump NO_TEST=LuaJIT submodule bump NO_CHANGELOG=LuaJIT submodule bump
-
- Jun 17, 2022
-
-
Cyrill Gorcunov authored
When fiber has finished its work it ended up in two cases: 1) If no "joinable" attribute set then the fiber is simply recycled 2) Otherwise it continue hanging around waiting to be joined. Our API allows to call fiber_wakeup() for dead but joinable fibers (2) in release builds without any side effects, such fibers are simply ignored, in turn for debug builds this causes assertion to trigger. We can't change our API for backward compatibility sake but same time we must not preserve different behaviour between release and debug builds since this brings inconsistency. Thus lets get rid of assertion call and allow to call fiber_wakeup in debug build as well. Fixes #5843 NO_DOC=bug fix Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Serge Petrenko authored
Once the split-brain detection is in place, it's fine to nopify obsolete data even on a node with elections disabled. Let's not keep a bug around anymore. This behaviour change leads to changing "gh_6842_qsync_applier_order_test.lua" a bit. It actually relied on old and buggy behaviour: it assumed old transactions would not be nopified and would trigger replication error. This doesn't happen anymore, because nopify works correctly, and the transactions are not followed by a conflicting CONFIRM. The test for this commit is simply altering the gh_5295_split_brain_detection_test.lua to work with elections disabled. Closes #6133 Follow-up #5295 NO_DOC=internal change NO_CHANGELOG=internal change
-
Cyrill Gorcunov authored
When we receive synchro requests we can't just apply them blindly because in worst case they may come from split-brain configuration (where a cluster split into several clusters and each one has own leader elected, then clusters are trying to merge back into the original one). We need to do our best to detect such disunity and force these nodes to rejoin from the scratch for data consistency sake. Thus when we're processing requests we pass them to the packet filter first which validates their contents and refuse to apply if they violate consistency. Depending on request type each packet traverses an appropriate chain. filter_generic(): a common chain for any synchro packet. 1) request:replica_id = 0 allowed for PROMOTE request only. 2) request:replica_id should match limbo:owner_id, IOW the limbo migration should be noticed by all instances in the cluster. filter_confirm_rollback(): a chain for CONFIRM | ROLLBACK packets. 1) Zero lsn is disallowed for such requests. filter_promote_demote(): a chain for PROMOTE | DEMOTE packets. 1) The requests should come in with nonzero term, otherwise the packet is corrupted. 2) The request's term should not be less than maximal known one, iow it should not come in from nodes which didn't notice raft epoch changes and living in the past. filter_queue_boundaries(): a common finalization chain. 1) If LSN of the request matches current confirmed LSN the packet is obviously correct to process. 2) If LSN is less than confirmed LSN then the request is wrong, we have processed the requested LSN already. 3) If LSN is greater than confirmed LSN then a) If limbo is empty we can't do anything, since data is already processed and should issue an error; b) If there is some data in the limbo then requested LSN should be in range of limbo's [first; last] LSNs, thus the request will be able to commit and rollback limbo queue. Note the filtration is disabled during initial configuration where we apply requests from the only source of truth (either the remote master, or our own journal), so no split brain is possible. In order to make split-brain checks work, the applier nopify filter now passes synchro requests from obsolete term without nopifying them. Also, now ANY asynchronous request coming from an instance with obsolete term is treated as a split-brain. Think of it as of a syncrhonous request committed with a malformed quorum. Closes #5295 NO_DOC=it's literally below Co-authored-by:
Serge Petrenko <sergepetrenko@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com> @TarantoolBot document Title: new error type: ER_SPLIT_BRAIN If for some reason the cluster had 2 leaders working independently (for example, user has mistakenly lovered the quorum below N / 2 + 1), then once such leaders and their followers try connecting to each other, they will receive the ER_SPLIT_BRAIN error, and the connection will be aborted. This is done to preserve data integrity. Once the user notices such an error he or she has to manually inspect the data on both the split halves, choose a way to restore the data, and rebootstrap one of the halves from the other.
-
Serge Petrenko authored
Change return types of txn_limbo_req_prepare, txn_limbo_process, txn_limbo_write_promote, txn_limbo_write_demote from void to int. This is a preparation for when these functions start returning errors. Part-of #5295 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring
-
Serge Petrenko authored
Make box_issue_promote and box_issue_demote return a return code. For now it's always 0, but soon they will return errors. Part-of #5295 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring
-
Serge Petrenko authored
limbo->confirmed_lsn was only filled on limbo owner in txn_limbo_write_confirm. Replicas and recovering limbo owner need to track it as well to correctly detect split-brains based on confirmed_lsn. So update confirmed_lsn in txn_limbo_read_confirm. Part-of #5295 NO_DOC=internal change NO_TEST=tested in future commits NO_CHANGELOG=internal change
-
Serge Petrenko authored
It's important for the synchro queue owner to not finalize any of the pending synchronous transactions after restart. Since the node was down for some time the chances are pretty high it was deposed by some new leader during its downtime. It means that the node might not know yet that it's transactions were already finalized by someone else. So, any arbitrary finalization might lead to a future split-brain, once the remote PROMOTE finally reaches the local node. Let's fix this by adding a new reason for the limbo to be frozen - a queue owner has recovered but has not issued a new PROMOTE locally and hasn't received any PROMOTE requests from the remote nodes. Once the first PROMOTE is issued or received, it's safe to return to the old mode of operation. So, now the synchro queue owner starts in "frozen" state and can't CONFIRM, ROLLBACK or issue new transactions until either issuing a PROMOTE or receiving a PROMOTE from some remote node. This also required modifying box.ctl.promote() behaviour: it's no longer a no-op on a synchro queue owner, when elections are disabled and the queue is frozen due to restart. Also fix the tests, which assumed the queue owner is writeable after a restart. gh-5298 test was partially deleted, because it became pointless. And while we are at it, remove the double run of gh-5288 test. It is storage engine agnostic, so there's no point in running it for both memtx and vinyl. Part-of #5295 NO_CHANGELOG=covered by previous commit @TarantoolBot document Title: ER_READONLY error receives new reasons When box.info.ro_reason is "synchro" and some operation throws an ER_READONLY error, this error now might include the following reason: ``` Can't modify data on a read-only instance - synchro queue with term 2 belongs to 1 (06c05d18-456e-4db3-ac4c-b8d0f291fd92) and is frozen due to fencing ``` This means that the current instance is indeed the synchro queue owner, but it has noticed, that someone else in the cluster might start new elections or might overtake the synchro queue soon. This may be also detected by `box.info.election.term` becoming greater than `box.info.synchro.queue.term` (this is the case for the second error message). There is also a slightly different error message: ``` Can't modify data on a read-only instance - synchro queue with term 2 belongs to 1 (06c05d18-456e-4db3-ac4c-b8d0f291fd92) and is frozen until promotion ``` This means that the node simply cannot guarantee that it is still the synchro queue owner (for example, after a restart, when a node still thinks it is the queue owner, but someone else in the cluster has already overtaken the queue).
-
Serge Petrenko authored
Receiving a raft term greater than the current queue term means that someone has either already written PROMOTE (in case elections are disabled), or is going to write PROMOTE once he wins the elections (in case they are enabled). In both cases the queue owner in an old term should freeze the limbo until queue term catches up with raft term. Unfreezing happens automatically once synchro queue term catches up. Part-of #5295 NO_DOC=covered by next commit
-
Serge Petrenko authored
Soon there will be more reasons for a transaction limbo to be frozen. Let's make the limbo->frozen flag a bitmap and rename it to limno->frozen_reasons. The first bit, named frozen_due_to_fencing, represents the only current reason for the limbo to be frozen. While we are at it, rename txn_limbo_(un)freeze to txn_limbo_(un)fence to better reflect the situation. Part-of #5295 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring
-
Serge Petrenko authored
Previously we assumed that every PROMOTE request changes limbo owner, and thus limbo should have confirmed_lsn = 0 after the request is processed, because new confirmed lsn is yet unknown. This is not true for PROMOTE requests coming in JOIN or saved in snapshot: such requests don't change limbo owner: they are like savepoints, they notify the instance of the current limbo state. Such promotions may be detected by the rule replica_id (old limbo owner) == origin_id (new limbo owner) So, for the sake of correct split-brain detection, confirmed_lsn should be nonzero after such promotions. Part-of #5295 NO_DOC=internal change NO_TEST=tested in future commits NO_CHANGELOG=internal change
-
Serge Petrenko authored
The test involves creating a manual split-brain between nodes r1 and r2. After the split-brain detection introduction it's impossible to reuse the nodes in the next test without recreating them. Let's fix that by switching nodes r1 and r3. Now there's a split-brain between (r1, r2) and r3, and r3 isn't used in the following tests and may be safely deleted. Follow-up #5295 NO_DOC=refactoring NO_CHANGELOG=refactoring Signed-off-by:
Serge Petrenko <sergepetrenko@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Serge Petrenko authored
Fix two issues with sent_raft_term calculations: * first of all, it doesn't matter during initial and final join, so set it to UINT64_MAX. * secondly, it's nullified after a successful dispatch from the tx thread. This might make the relay stall forever. For example, when elections are disabled. NO_DOC=bugfix NO_TEST=tested in next commit
-
- Jun 16, 2022
-
-
Ilya Verbin authored
Currently it's possible to wakeup a fiber, which is waiting for a child process termination, using Tarantool C API. This will leave a zombie process behind. This patch reworks `coio_waitpid` in such a way that it yields until `cw.data` is set to NULL in the process status change callback. Part of #7166 NO_DOC=refactoring NO_CHANGELOG=refactoring
-
Ilya Verbin authored
Currently it's possible to wakeup a fiber, which is waiting for task completion, using Tarantool C API. This will cause a "wrong fiber woken" panic. This patch reworks `cord_cojoin` in such a way that it yields until a completion flag is set. Part of #7166 NO_DOC=refactoring NO_CHANGELOG=refactoring
-
Ilya Verbin authored
Currently it's possible to wakeup a `hot_standby_f` fiber from Lua, this does not lead to any error, but it results in redundant `recover_remaining_wals` calls. This patch handles such spurious wakeups in `hot_standby_f`. Part of #7166 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring
-
Ilya Verbin authored
Spurious wakeups are already handled correctly. Part of #7166 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring
-
Georgiy Lebedev authored
Currently, there is a fundamental logical inconsistency with read-view and conflicted states of transactions. Conflicted transactions see all prepared changes (e.g., #7238), because they are handled differently than read-view ones. At the same time, one does not know the state of the transaction until `box.commit` is called. A similar problem arises with read-view transactions: if such transactions do any DML statements, they are de-facto conflicted, but this will only be determined at preparation stage: https://github.com/tarantool/tarantool/blob/79245573dabf3c1eb4eb904fd80ee84270360476/src/box/txn.c#L1006-L1013 Fix this inconsistency by the following changes: 1. Conflict "read-view" transactions on attempt to perform DML statements immediately — guarantee this with an assertion at preparation stage. 2. Make conflicted transactions unconditionally throw "Transaction has been aborted by conflict" error on any CRUD operations (including read-only ones) until they are either rolled back (which will return no error) or committed (which will return the same error). Closes #7238 Closes #7239 Closes #7240 @TarantoolBot document Title: new behaviour of "conflicted" transactions "Conflicted" transactions now return "Transaction aborted by conflicted" error on any CRUD operations (including read-only ones), until they are either rolled back (which will return no error) or committed (which will return the same error).
-
Sergey Bronnikov authored
NO_CHANGELOG=internal NO_DOC=internal NO_TEST=internal
-
Pavel Balaev authored
Regular expression now works on versions: alpha, beta, rc and so on. NO_DOC=bugfix NO_TEST=bugfix NO_CHANGELOG=bugfix
-
Pavel Balaev authored
Tabs were replaced with spaces to bypass checkpatch. NO_DOC=bugfix NO_TEST=bugfix NO_CHANGELOG=bugfix
-
- Jun 14, 2022
-
-
Vladimir Davydov authored
The `cmp_tuple` helper function is broken - it assumes that all tuple fields, including the payload, are numeric. It isn't true - the payload field is either nil or string. This results in a false-positive test failure: ``` error: '[string "function cmp_tuple(t1, t2) for i = 1, PAY..."]:1: attempt to compare nil with string' ``` Closes #6336 NO_DOC=test NO_CHANGELOG=test
-
Yaroslav Lobankov authored
Bump test-run to new version with the following improvements: - Fix issue with not detecting successful server start [1] [1] tarantool/test-run#343 NO_DOC=testing stuff NO_TEST=testing stuff NO_CHANGELOG=testing stuff
-
- Jun 09, 2022
-
-
Vladimir Davydov authored
With the default shutdown timeout of 3 seconds, a test that leaves behind asynchronous requests would still pass, but it would take longer to finish, because the server instance started by Tarantool would have to wait for the dangling requests to complete. Setting the timeout to infinity will result in a hang, making us fix the test. Infinite timeout is also good for catching bugs like #7225 and #7256. We don't set the timeout for diff and TAP tests because those are deprecated and shouldn't be used for writing new tests. Nevertheless, I manually checked that none of them hangs if the timeout is set to infinity. Closes #6820 NO_DOC=test NO_CHANGELOG=test
-
Vladimir Davydov authored
If a socket fd is shared by a child process, closing it in the parent will not shut down the underlying connection. As a result, the server may hang executing the graceful shutdown protocol. Fix this problem by explicitly shutting down the connection socket fd before closing it. This is a recommended way to terminate a Unix socket connection, see http://www.faqs.org/faqs/unix-faq/socket/#:~:text=2.6.%20%20When%20should%20I%20use%20shutdown()%3F Closes #7256 NO_DOC=bug fix
-
Ilya Verbin authored
It's possible to wakeup a fiber, which is waiting for WAL write completion, using Tarantool C API. This results in an error like: ``` main/118/lua F> Journal result code -1 can't be converted to an error ``` This patch introduces a flag, which is set when WAL write is finished, that allows fibers to yield until the flag is set. Closes #6506 NO_DOC=bugfix
-
Yaroslav Lobankov authored
Bump test-run to new version with the following improvements: - Fail *.test.py tests in case of server start errors [1] [1] tarantool/test-run#333 NO_DOC=testing stuff NO_TEST=testing stuff NO_CHANGELOG=testing stuff
-
- Jun 08, 2022
-
-
Mergen Imeev authored
This patch fixes format building when an ephemeral space was used in ORDER BY and ORDER BY uses at least two variables from the list of selected columns. Closes #7042 NO_DOC=Bugfix
-
Serge Petrenko authored
There was an assertion failure when inserting a decimal into an index which contained double Inf or NaN. The reason for that was never checking decimal_from_*() return values, and decimal_from_double() not being able to handle NaN or Inf, because these values are not representable in decimal numbers. Start handling decimal_from_<type> return values and fix decimal comparison with Inf, NaN. Closes #6377 NO_DOC=bugfix
-
- Jun 07, 2022
-
-
Yaroslav Lobankov authored
To reduce the chance to encounter the tarantool/test-run#141 issue in replication-py/swim tests, let's switch to using unix sockets instead of TCP ports for tarantool console. NO_DOC=testing stuff NO_TEST=testing stuff NO_CHANGELOG=testing stuff
-
- Jun 06, 2022
-
-
Ilya Verbin authored
Currently it's possible to wakeup a fiber, which is waiting for `cbus_call` completion, using Tarantool C API. This will cause a misleading `TimedOut` error. This patch reworks `cbus_call` in such a way that it yields until a completion flag is set. Part of #7166 NO_DOC=refactoring NO_CHANGELOG=refactoring
-
Ilya Verbin authored
Part of #7166 NO_DOC=refactoring NO_TEST=refactoring NO_CHANGELOG=refactoring
-
Timur Safin authored
Simplify/shorten `interval_to_string()` implementation. Part of #7045 NO_CHANGELOG=refactoring NO_DOC=refactoring NO_TEST=refactoring
-
Timur Safin authored
Do not even try to make more readable output of secs/nsec, but rather report them as is, without any [de]normalization. Not the prior way: ``` tarantool> dt.interval.new{min=1, sec=59, nsec=2e9+1} -- - +1 minutes, 61.000000001 seconds ... ``` But instead as: ``` tarantool> dt.interval.new{min=1, sec=59, nsec=2e9+1} -- - +1 minutes, 59 seconds, 2000000001 nanoseconds ... ``` Closes #7045 NO_DOC=internal
-
Vladimir Davydov authored
The graceful shutdown protocol works as follows: 1. The server sends a shutdown request (the box.shutdown event) to all its clients that subscribed to it. 2. Upon receiving a shutdown request, a client is supposed to close its connection. 3. The server waits for all clients subscribed to box.shutdown event to exit. 4. The server exits. In net.box, the box.shutdown event is processed by `remote._callback`. The problem is it may occur that `remote._callback` is garbage collected while the `remote` object isn't. If this happens, the shutdown request will never get processed, and the server won't exit until the `remote` object is garbage collected, which may take forever. Let's fix this issue by breaking the worker loop if we see that the callback was garbage collected. Closes #7225 NO_DOC=bug fix
-
- Jun 02, 2022
-
-
Boris Stepanenko authored
nixos (and probably some other distributives) place zoneinfo directory not in /usr/share (in /etc for example). TZDIR is set accordingly. Currently zoneinfo is looked for in /usr/share, disregarding TZDIR env variable. This commit adds compile definition for TZDIR if such env variable is defined. This fixes zoneinfo lookup for nixos. NO_CHANGELOG=build NO_DOC=build NO_TEST=build
-
Vladimir Davydov authored
This reverts commit 33830978. Follow-up #6477 NO_DOC=ci NO_TEST=ci NO_CHANGELOG=ci
-
Vladimir Davydov authored
This reverts commit 9d1f9f0e. Follow-up #6477 NO_DOC=ci NO_TEST=ci NO_CHANGELOG=ci
-
Vladimir Davydov authored
Two things we need to do to fix build with OpenSSL 3.0: 1. Use EVP_MAC_* functions instead of HMAC_* https://www.openssl.org/docs/man3.0/man3/HMAC_CTX_new.html 2. Load the Legacy provider to enable legacy algorithms, such as MD4 https://wiki.openssl.org/index.php/OpenSSL_3.0#Programming_in_OpenSSL_3.0 Closes #6477 NO_DOC=build fix NO_TEST=build fix NO_CHANGELOG=build fix
-