- Nov 23, 2020
-
-
Vladislav Shpilevoy authored
Worker fiber is used by raft library to perform yielding tasks like WAL write, and simply long tasks like network broadcast. That allows not to block the raft state machine, and to collect multiple updates during an event loop iteration to flush them all at once. While the worker fiber was inside raft library, it wasn't possible to use it for anything else. And that is exactly what is going to be needed. The reason chain is quite long. It all starts from that the elimination of all box appearances from raft library also includes relocation of box_update_ro_summary(). The only place it can be moved to is box_raft_on_update trigger. The trigger is currently called from the raft worker fiber. It means, that between raft state update and trigger invocation there is a yield. If box_update_ro_summary() would be blindly moved to the trigger, users sometimes could observe miracles like instance role being 'follower', but the node is still writable if it was a leader before, because box_raft_on_update wasn't invoked yet, and it didn't update RO summary. Assume, the on_update triggers are invoked by raft not in the worker fiber, but right from the state machine. Then box_update_ro_summary() would always follow a state change without a yield. However that creates another problem - the trigger also calls box_clear_synchro_queue(), which yields. But on_update triggers must not yield so as not to block the state machine. This can be easily solved if it would be possible to schedule box_clear_synchro_queue() from on_update trigger to be executed later. And after this patch it becomes possible, because the worker fiber now can be used not only to handle raft library async work, but also for box-raft async work, like the synchro queue clearance. Part of #5303
-
Vladislav Shpilevoy authored
This is a general practice throughout the code. If a fiber is not cancellable, it always means a system fiber which can't be woken up or canceled until its work done. It is used in gc (about xlogs), recovery, vinyl, WAL, at least. Before raft used flag raft.is_write_in_progress. But it won't work soon, because the worker fiber will move to box/raft.c, where it would be incorrect to rely on deeply internal parts of struct raft, such as is_write_in_progress. Hence, this patch makes raft use a more traditional way of spurious wakeup avoidance. Part of #5303
-
Vladislav Shpilevoy authored
Raft used to depend on xrow, because it used raft_request as a communication and persistence unit. Xrow is a part of src/box library set, so it blocked raft extraction into src/lib/raft. This patch makes raft not depend on xrow. For that raft introduces a new communication and persistence unit - struct raft_msg. Interestingly, throughout its source code raft already uses term 'message' to describe requests, so this patch also restores the consistency. This is because raft_request name was used to be consistent with other *_request structs in xrow.h. Now raft does not depend on this, and can use its own name. Struct raft_msg repeats raft_request literally, but it actually makes sense. Because when raft is extracted to a new library, it may start evolving independently. Its raft_msg may be populated with new members, or their behaviour may change depending on how the algorithm will evolve. But inside box it will be possible to tweak and extend raft_msg whenever it is necessary, via struct raft_request, and without changing the basic library. For instance, in future we may want to make nodes forward the messages to each other during voting to speed the process up, and for that we may want to add an explicit 'source' field to raft_request, while it won't be necessary on the level of raft_msg. There is a new compatibility layer in src/box/raft.h which hides raft_msg details from other box code, and does the msg <-> request conversions. Part of #5303
-
Vladislav Shpilevoy authored
Raft is being moved to a separate library in src/lib. It means, it can't depend on anything from box/. The patch makes raft stop using replicaset and journal objects. They were used to broadcast messages to all the other nodes, and to persist updates. Now raft does the same through vtab, which is configured by box. Broadcast still sends messages via relays, and disk write still uses the journal. But raft does not depend on any specific journal or network API. Part of #5303
-
Vladislav Shpilevoy authored
Raft is being moved to a separate library in src/lib. It means, it can't depend on anything from box/. The patch makes raft stop using replicaset.vclock. Instead, it has a new option 'vclock'. It is stored inside struct raft by pointer and should be configured using raft_cfg_vclock(). Box configures it to point at replicaset.vclock like before. But now raftlib code does not depend on it explicitly. Vclock is stored in Raft by pointer instead of by value so as not to update it for each transaction. It would be too high price to pay for raft independence from box. Part of #5303
-
Vladislav Shpilevoy authored
Raft is never supposed to change vclock. Not the stored one, nor the received ones. The patch makes it checked during compilation. The patch is mostly motivated by a next patch making Raft use an externally configured vclock which can't be changed. Since Raft uses raft_request to carry the vclock in a few places, the request's vclock also must become const. Part of #5303
-
Vladislav Shpilevoy authored
Raft is being moved to a separate library in src/lib. It means, it can't depend on anything from box/. The patch makes raft stop using instance_id. Instead, it has a new option 'instance_id'. It is stored inside struct raft as 'self', and should be configured using raft_cfg_instance_id(). The configuration is done when bootstrap ends and the instance_id is either recovered successfully, or the instance is anonymous. While working on this, I also considered introducing a new function raft_boot() instead of raft_cfg_instance_id(). Which I would also use to configure vclock later. Raft_boot() would be meant to be called only one time with non-dynamic parameters instance_id and vclock. But then I decided to keep adding new raft_cfg_*() functions. Because: - It is more consistent with the existing options; - Does not require to think about too many different functions like raft_create(), raft_boot(), raft_cfg_*() and in which order to call them; Also I was thinking to introduce a single raft_cfg() like I did in swim with swim_cfg(), to reduce number of raft_cfg_*() functions, but decided it would be even worse with so many options. Part of #5303
-
Vladislav Shpilevoy authored
Raft is being moved to a separate library in src/lib. It means, it can't depend on anything from box/, including global replication parameters such as replication_synchro_quorum. The patch makes raft stop using replication_synchro_quorum. Instead, it has a new option 'election_quorum'. Note, that this is just raft API. Box API still uses replication_synchro_quorum. But it is used to calculate the final quorum in src/box/raft, not in src/box/raftlib. And to pass it to the base raft implementation. Part of #5303
-
Vladislav Shpilevoy authored
Raft is being moved to a separate library in src/lib. It means, it can't depend on anything from box/, including global replication parameters such as replication_timeout, and functions like replication_disconnect_timeout(). The patch makes raft stop using replication_disconnect_timeout(). Instead, it stores death timeout in struct raft. It is configured by box simultaneously with replication_timeout. Part of #5303
-
Vladislav Shpilevoy authored
The commit moves raft functions and objects specific for box to src/box/raft from src/box/box and src/box/raftlib. The goal is to gradually eliminate all box dependencies from src/box/raftlib and move it to src/lib/raft. It makes the compilation work again after the previous commit broke it. Part of #5303
-
Vladislav Shpilevoy authored
The commit renames raft.h and raft.c to raftlib.h and raftlib.c. This is done to prepare to raft split into src/box/ and src/lib/raft. The commit is not atomic, the build won't work here. Because if raft is renamed to raftlib, and in the same commit new raft.c and raft.h are added, git thinks the original file was changed, and ruins all the git history. By splitting move of raft to raftlib and introduction of box/raft into 2 commits the git history is saved. Part of #5303
-
- Nov 22, 2020
-
-
Sergey Ostanevich authored
A number of places in sql.c uses direct access to box_process_rw() that does not check read-only setting. Fixed by use of an intended interface of box_process1(). Closes #5231
-
- Nov 20, 2020
-
-
Vladislav Shpilevoy authored
swim_test_indirect_ping() failed with random seed 1605651752. The test created a cluster with 3 swim nodes, and broke network connection between node-1 and node-2. Then it run the cluster for 10 seconds, and ensured, that both node-1 and node-2 are eventually alive despite they are suspected sometimes. node1 <-> node3 <-> node2 'Alive' means that a node is considered alive on all the other nodes. The test spun for 10 seconds giving the nodes a chance to become suspected. Then it checked that node-1 is either still alive, or it is suspected, but will be restored in at most 3 seconds. The same was checked for node-2. They were supposed to interact via node-3. 3 seconds was used assuming that the worst what could happen is that it is suspected from the beginning of this three-second interval on node-3, because it was suspected by node-2 and disseminated to node-3. Then node-3 might need 1 second to finish its current dissemination round by sending a ping to node-2, 1 second to start new round randomly again from node-2, and only then send a ping to node-1. So 3 seconds total. But it could also happen, that in the beginning of the three-second interval node-1 is already suspected on node-2. On the next step node-2 shares the suspicion with node-3. And then the scenario above happens. So the test case needed at least 4 seconds. And actually it could happen infinitely, because while the test waits for 3 seconds of gossip refutation about node-1 on node-3, node-2 can suspect it again. And so on. Also the test would pass even without indirect pings. Because node-3 has access to node-1 and node-2. So even if, say, node-1 suspects node-2, then it will tell node-3 about it. Node-3 will ping node-2, get ack, and will refute the gossip. The refutation will be then sent to node-1 back. It means indirect pings don't matter here. The patch makes a new test, which won't pass without indirect pings. It uses the existing error injection ERRINJ_SWIM_FD_ONLY, which allows to turn off all the SWIM components except failure detection. So only pings and acks are being sent. Then without proper indirect pings node-1 and node-2 would suspect each other and declare dead eventually. The new test checks it does not happen. Closes #5399
-
- Nov 19, 2020
-
-
Alexander Turenko authored
The following commits are land into the submodule with this update: - 96dea99 Enable luacheck on Travis CI - daff045 Add luacheck config and fix warnings found by luacheck Most of the changes are about test-run's Makefile and CI, but there is one change in the testing framework itself: | diff --git a/test_run.lua b/test_run.lua | index 2a8d713..06fbf6c 100644 | --- a/test_run.lua | +++ b/test_run.lua | @@ -163,7 +163,6 @@ local function wait_fullmesh(self, servers) | log.info("%s: bootstrapped", server) | break | end | - local info = self:eval(server, "box.info") | fiber.sleep(0.01) | end | -- wait all for full mesh This change removes the call, whose result is not used. On the first glance, it should not change any behaviour. However `test_run:eval()` performs a network request and it is good to get rid of it. The change is tiny and should not affect anything in tarantool testing, but I decided to commit it separately from others to don't hide anything and ease bisect if we'll observe any positive or negative impact. IOW, just in case.
-
Alexander Turenko authored
There are two situations that may lead to a NULL dereference. First, when an attempt to create a merge source is failed. Now only OOM may trigger this code path. Reproducing it in a test is tricky, so I left it without a test case. Second, when an attempt to create a merger is failed. The reason of the fail may be a 'wrong' key_def: when we unable to create a tuple format from it. Say, when the key_def defines several key parts using the same field, but marks them with different field types. Fixes #5450
-
- Nov 18, 2020
-
-
Roman Khabibov authored
Enable to add column to existing space with <ALTER TABLE ADD [COLUMN]> statement. Column definition can be supplemented with the four types of constraints, <DEFAULT>, <COLLATE> clauses and <[NOT] NULL>, AUTOINCREMENT. Closes #2349, #3075 @TarantoolBot document Title: Add columns to existing tables in SQL Now, it is possible to add columns to existing empty spaces using <ALTER TABLE table_name ADD [COLUMN] column_name column_type ...> statement. The column definition is the same as in <CREATE TABLE> statement. * Space emptiness is Tarantool's restriction. Possibilty to add column to non empty space will be implemented later. For example: ``` tarantool> box.execute("CREATE TABLE test (a INTEGER PRIMARY KEY)") --- - row_count: 1 ... tarantool> box.execute([[ALTER TABLE test ADD COLUMN b TEXT > CHECK (LENGTH(b) > 1) > NOT NULL > DEFAULT ('aa') > COLLATE "unicode_ci" > ]]) --- - row_count: 1 ... ```
-
Roman Khabibov authored
Allocate memory for the "index" array of ephemeral space on the parser's region instead of a heap as it was before. Fixed a memory leak that realloc() generated for the index array of ephemeral space. The memory was never released. Needed for #2349
-
Roman Khabibov authored
Just add box_space_field_MAX to the _space fields enum. Needed for #3075
-
Roman Khabibov authored
Move ck, fk constraint lists from struct create_table_def into new defs and autoincrement into struct Parse to make the code more reusable when implementing <ALTER TABLE ADD COLUMN>. Needed for #3075
-
Roman Khabibov authored
Rename TK_COLUMN used for tokens treated as a column name to TK_COLUMN_REF. It is needed to allow the typing of COLUMN keyword in <ALTER TABLE ADD COLUMN> statement. Needed for #3075
-
- Nov 17, 2020
-
-
Nikita Pettik authored
After upsert rework in 5a61c471 (#5107 issue) update operations are applied consistently corresponding to upserts they belong to: if update operation from single upsert fails - all update operations from that upsert are skipped. But the rest of update operations related to other upserts (after squashing two upserts) are applied. So let's update #4957 test case: now upsert operation can't be processed only if it contains more than BOX_UPDATE_OP_CNT_MAX (4000) operations before (before squashing with any other upsert). Follow-up #4957 Follow-up #5107
-
- Nov 13, 2020
-
-
Cyrill Gorcunov authored
We've VCLOCK_MASK constant which limits the number of regular nodes in a replication cluster. This limit is bound to the vclock_map_t bitset type. Thus when we're tacking voting process in Raft node we better use this type for vote_mask member (otherwise it is a room for error if we ever change VCLOCK_MASK and extend the width of a bitset). Suggested-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
- Nov 12, 2020
-
-
Kirill Yukhin authored
Variable `old` which contains tuple reference wasn't unrefed at all. Fix this.
-
- Nov 11, 2020
-
-
Mary Feofanova authored
Degradation caused by #2866 forbade specifying index optios in key parts, which is actually right. Tests were fixed and the behaviour was extended to the other way of specifying index parts. Closes #5473
-
- Nov 10, 2020
-
-
Vladislav Shpilevoy authored
Vclock is used in raft, which is going to be moved to src/lib. That means vclock also should be moved there. It is easy, because vclock does not depend on anything in box/. Needed for #5303
-
Vladislav Shpilevoy authored
Since box_raft is now initialized at runtime and is used from several subsystems (memtx for snapshots; applier for accepting rows; box.info for monitoring), it may be easy to screw the intialization order and accidentally use the not initialized global raft object. This patch adds a sanity check ensuring it does not happen. The raft state is set to 0 at program start. Then any access to the global raft object firstly checks the state not being 0. The initialization order will get trickier when raft will stop using globals from replication and from box, and will be used from them more extensively. Part of #5303
-
Vladislav Shpilevoy authored
All raft functions worked with a global raft object. That would make impossible to move raft to a separate module, where it could be properly unit-tested with multiple raft nodes in each test. The patch adds an explicit raft pointer argument to each raft function as a first part of moving raft to a separate library. The global object is renamed to box_raft_global so as to emphasize this is a global box object, not from the future raft library. Its access now should go through box_raft() function, which will get some sanity checks in the next commit. Part of #5303
-
Vladislav Shpilevoy authored
Struct fiber has a member va_list f_data. It is used to forward arguments to the fiber function when fiber_start() is called, right from the caller's stack. But it is useless when fiber is started asynchronously, with fiber_new + fiber_wakeup. And there is no way to pass anything into such a fiber. This patch adds a new member 'void *f_arg', which shares memory with va_list f_data, and can be used to pass something into the fiber. The feature is going to be used by raft. Currently raft worker fiber works only with global variables, but soon it will need to have its own pointer at struct raft object. And it can't be started with fiber_start(), because raft code does not yield anywhere in its state machine. Needed for #5303
-
Vladislav Shpilevoy authored
Raft state machine crashed if it was configured to be a candidate during a WAL write with a known leader. It tried to start waiting for the leader death, but should have waited for the WAL write end first. The code tried to handle it, but the order of 'if' conditions was wrong. WAL write being in progress was checked last, but should have been checked first. Closes #5506
-
Vladislav Shpilevoy authored
Raft state machine crashed if was restarted during a WAL write being in progress. When the machine was started, it didn't assume there still can be a not finished WAL write from the time it was enabled earlier. The patch makes it continue waiting for the write end. Part of #5506
-
Vladislav Shpilevoy authored
Raft didn't broadcast its state when the state machine was started. It could lead to the state being never sent until some other node would generate a term number bigger that the local one. That happened when a node participated in some elections, accumulated a big term number, then the election was turned off, and a new replica was connected in a 'candidate' state. Then the first node was configured to be a 'voter'. The first node didn't send anything to the replica, because at the moment of its connection the election was off. So the replica started from term 1, tried to start elections in this term, but was ignored by the first node. It waited for election timeout, bumped the term to 2, and the process was repeated until the replica reached the first node's term + 1. It could take very long time. The patch fixes it so now Raft broadcasts its state when it is enabled. To cover the replicas connected while it was disabled. Closes #5499
-
Vladislav Shpilevoy authored
The typo led to not resetting the election timeout to the default value. It was left 1000, and as a result the next election tests could work extremely long. Part of #5499
-
Vladislav Shpilevoy authored
Raft worker fiber does all the heavy and yielding jobs. These are 2 - disk write, and network broadcast. Disk write yields. Network broadcast is slow, so it happens at most once per event loop iteration. The worker on each iteration checks if any of these 2 jobs is active, and if not, it goes to sleep until an explicit wakeup. But there was a bug. Before going to sleep it did a yield + a check that there is nothing to do. However during the yield new tasks could appear, and the check failed, leading to a crash. The patch reorganizes this part of the code so now the worker does not yield between checking new tasks and going to sleep. No test, because extremely hard to reproduce, and don't want to clog this part of the code with error injections.
-
- Nov 03, 2020
-
-
Sergey Ostanevich authored
Static buffer overflow in thread local pool causes random fails on OSX platform. This was caused by an incorrect use of the allocator result. Fixes #5312 Co-authored-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
-
Vladislav Shpilevoy authored
"Too long WAL write" is supposed to warn a user that either the disk write was too long, or the event loop is too slow, maybe due to certain fibers not doing yields often enough. It was printed by the code doing the transaction commit. As a result, for synchronous transactions the check also included the replication time, often overflowing a threshold and printing "too long WAL write" even when it had nothing to do with a WAL write or the event loop being too slow. The patch makes so the warning is checked and printed after WAL write right away, not after commit. Closes #5139
-
Vladislav Shpilevoy authored
txn_complete used to handle all the transaction outcomes: - manual rollback; - error at WAL write; - successful WAL write and commit; - successful WAL write and wait for synchronization with replicas. The code became a mess after synchronous replication was introduced. This patch splits txn_complete's code into multiple pieces. Now WAL write success and fail are handled by txn_on_journal_write() exclusively. It also runs the WAL write triggers. It was very strange to call them from txn_complete(). txn_on_journal_write() also checks if the transaction is synchronous, and if it is not, it completes it with txn_complete_success() whose code is simple now, and only works on committing the changes. In case of fail the transaction always ends up in txn_complete_fail(). These success and fail functions are now used by the limbo as well. It appeared all the places finishing a transaction always know if they want to fail it or complete successfully. This should also remove a few ifs from the hot code of transaction commit. The patch simplifies the code in order to fix the false warning about too long WAL write for synchronous transactions, which is printed not at WAL write now, but at commit. These two events are far from each other for synchro requests. Part of #5139
-
Vladislav Shpilevoy authored
The function is called only by the journal when write is finished. Besides, it may not complete the transaction. In case of synchronous replication it is not enough for completion. It means, it can't have 'complete' in its name. Also the function is never used out of txn.c, so it is removed from txn.h and is now static. The patch is a preparation for not spaming "too long WAL write" on synchronous transactions, because it is simply misleading. Part of #5139
-
- Nov 02, 2020
-
-
Cyrill Gorcunov authored
We should never trust the request data it can carry anything, thus lets make sure we're not decoding some trash into a string. Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
The size should be matched to the real size of a buffer, otherwise it is a room for mistake. Same time make sure we're not overriding the buffer. Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
- Nov 01, 2020
-
-
Alexander V. Tikhonov authored
Found that the previously fixed vinyl/gh.test.lua test in commit: 94dc5bdd ('test: gh test hangs after gh-4957-too-many-upserts') with adding fiber.sleep(1) workaround to avoid of raise from the previously run vinyl/gh-4957-too-many-upserts.test.lua test can be changed in the other way. The new change from one side will leave the found issue untouched to be able to resolve it within opened issue in github. And from the other side it will let the test-run tool to be able to avoid of this issue using fragile list feature to save the stability of testing due to found issue is flaky and can be passed on reruns. The current fix changes the forever waiting loop to especially created for such situations test_run:wait_cond() routine which has timeout in it to avoid of hanging the test till global timeout will occure. It will let the testing to be continued even after the fail. Needed for #5141
-