- Oct 15, 2020
-
-
Alexander V. Tikhonov authored
Found on GCC 4.8.5 on CentOS 7 issue: build/usr/src/debug/tarantool-2.6.0.144/src/box/txn.c:944:30: error: ‘limbo_entry’ may be used uninitialized in this function [-Werror=maybe-uninitialized] if (txn_limbo_wait_complete(&txn_limbo, limbo_entry) < 0) Set limbo_entry variable to NULL on initialization. Needed for #4941
-
- Oct 14, 2020
-
-
Nikita Pettik authored
After previous commit, there's no need in these functions. However, someday we may want to reconsider it squash optimization and make it work without breaking upsert associativity rule. So to not lost initial squash implementation, let's put its removal in a separate patch. Follow-up #5107
-
Nikita Pettik authored
Previous upsert implementation had a few drawbacks which led to number of various bugs and issues. Issue #5092 (redundant update operations execution) In a nutshell, application of upsert(s) (on top of another upsert) consists of two actions (see vy_apply_upsert()): execute and squash. Consider example: insert({1, 1}) -- terminal statement, stored on disk upsert({1}, {{'-', 2, 20}}) -- old ups1 upsert({1}, {{'+', 2, 10}}) -- new ups2 'Execute' step takes update operations from the new upsert and combines them with key of the old upsert. {1} + {'+', 2, 10} can't be evaluated since key consists of only one field. Note that in case upsert doesn't fold into insert the upsert's tuple and the tuple stored in index can be different. In our particular case, tuple stored on disk has two fields ({1, 1}), so first upsert's update operation can be applied to it: {1, 1} + {'+', 2, 10} --> {1, 11}. If upsert's operation can't be executed using key of old upsert, we simply continue processing squash step. In turn 'squash' is a combination of update operations: arithmetic operations are combined so we don't have to store actions over the same field; the rest operations - are merged into single array. As a result, we get one upsert with squashed operations: upsert({1}, {{'+', 2, -10}}). Then vy_apply_upsert() is called again to apply new upsert on the top of terminal statement - insert{1, 1}. Since now tuple has second field, update operations can be executed and corresponding result is {1, -9}. It is the final result of upsert application procedure. Now imagine that we have following upserts: upsert({1, 1}, {{'-', 2, 20}}) -- old ups1 upsert({1}, {{'+', 2, 10}}) -- new ups2 In this case execution successfully finishes and modifies old upsert's tuple: {1, 1} + {'+', 2, 10} --> {1, 11} However, we still have to squash/accumulate update operations since they may be applied on tuple stored on disk later. After all, we have following upsert: upsert({2, 11}, {{'+', 2, -10}}). Then it is applied on the top of insert({1, 1}) and we get the same result as in the first case - {1, -9}. The only difference is that upsert's tuple was modified. As one can see, execution of update operations applied to upsert's tuple is redundant in the case index already contains tuple with the same key (i.e. when upserts turns into update). Instead, we are able to accumulate/squash update operations only. When the last upsert is being applied, we can either execute all update operation on tuple fetched from index (i.e. upsert is update) OR on tuple specified in the first upsert (i.e. first upsert is insert). Issue #5105 (upsert doesn't follow associative property) Secondly, current approach breaks associative property: after upsert's update operations are merged into one array, part of them (related to one upsert) can be skipped, meanwhile the rest - is applied. For instance: -- Index is over second field. i = s:create_index('pk', {parts={2, 'uint'}}) s:replace{1, 2, 3, 'default'} s:upsert({2, 2, 2}, {{'=', 4, 'upserted'}}) -- First update operation modifies primary key, so upsert must be ignored. s:upsert({2, 2, 2}, {{'#', 1, 1}, {'!', 3, 1}}) After merging two upserts we get the next one: upsert({2, 2, 2}, {{'=', 4, 'upserted'}, {'#', 1, 1}, {'!', 3, 1}} While we executing update operations, we don't distinguish operations from different upserts. Thus, if one operation fails, the rest are ignored as well. As a result, first (in general case - all preceding squashed upserts) upsert won't be applied, even despite the fact it is absolutely correct. What is more, user gets no error/warning concerning this fact. Issue #1622 (no upsert result validation) After upsert application, there's no check verifying that result satisfies space's format: number of fields, their types, overflows etc. Due to this tuples violating format may appear in the space, which in turn may lead to unpredictable consequences. To resolve these issues, let's group update operations of each upsert into separate array. So that operations related to particular upsert are stored in single array. In terms of previous example we will get: upsert({2, 2, 2}, {{{'=', 4, 'upserted'}}, {{'#', 1, 1}, {'!', 3, 1}}} Also note that we don't longer have to apply update operations on tuple in vy_apply_upsert() when it comes for two upserts: it can be done once we face terminal statement; or if there's no underlying statement (i.e. it is delete statement or no statement at all) we apply all update arrays except the first one on upsert's tuple. In case one of operations from array fail, we skip the rest operations from this array and process to the next array. After successful application of update operations of each array, we check that the resulting tuple fits into space format. If they aren't, we rollback applied operations, log error and moving to the next group of operations. Finally, arithmetic operations are not longer able to be combined: it is requirement which is arises from #5105 issue. Otherwise, result of upserts combination might turn out to be inapplicable to the tuple stored on disk (e.g. result applied on tuple leads to integer overflow - in this case only last upsert leading to overflow must be ignored). Closes #1622 Closes #5105 Closes #5092 Part of #5107
-
Vladislav Shpilevoy authored
Order of LSNs from different instances can be any. It means, that if the synchronous transaction limbo changed its owner, the old value of maximal confirmed LSN does not make sense. The new owner means new LSNs, even less than the previously confirmed LSN of an old owner. The patch resets the confirmed LSN when the owner is changed. No specific test for that, as this is a hotfix - tests start fail on every run after a new election test was added in the previous commit. A test for this bug will be added later. Part of #5395
-
Vladislav Shpilevoy authored
According to Raft, when a new leader is elected, it should finish transactions of the old leader. In Raft this is done via adding a new transaction originated from the new leader. In case of Tarantool this can be done without a new transaction due to WAL format specifics, and the function doing it is called box_clear_synchro_queue(). Before the patch, when a node was elected as a leader, it didn't finish the pending transactions. The queue clearance was expected to be done by a user. There was no any issue with that, just technical debt. The patch fixes it. Now when a node becomes a leader, it finishes synchronous transactions of the old leader. This is done a bit differently than in the public box.ctl.clear_synchro_queue(). The public box.ctl.clear_synchro_queue() tries to wait for CONFIRM messages, which may be late. For replication_synchro_timeout * 2 time. But when a new leader is elected, the leader will ignore all rows from all the other nodes, as it thinks it is the only source of truth. Therefore it makes no sense to wait for CONFIRMs here, and the waiting is omitted. Closes #5339
-
Vladislav Shpilevoy authored
Raft state machine now has a trigger invoked each time when any of the visible Raft attributes is changed: state, term, vote. The trigger is needed to commit synchronous transactions of an old leader, when a new leader is elected. This is done via a trigger so as not to depend on box in raft code too much. That would make it harder to extract it into a new module later. The trigger is executed in the Raft worker fiber, so as not to stop the state machine transitions anywhere, which currently don't contain a single yield. And the synchronous transaction queue clearance requires a yield, to write CONFIRM and ROLLBACK records to WAL. Part of #5339
-
Vladislav Shpilevoy authored
When a raft node was configured to be a candidate via election_mode, it didn't do anything if there was an active leader. But it should have started monitoring its health in order to initiate a new election round when it dies. The patch fixes this bug. It does not contain a test, because will be covered by a test for #5339. Needed for #5339
-
Vladislav Shpilevoy authored
Raft has a worker fiber to perform async tasks such as WAL write, state broadcast. The worker was created and woken up from 2 places, leading at least to code duplication. The patch wraps it into a new function raft_worker_wakeup(), and uses it. The patch is not need for anything functional, but was created while working on #5339 and trying ideas. The patch seems to be good refactoring making the code simpler, and therefore it is submitted.
-
- Oct 13, 2020
-
-
Ilya Kosarev authored
key_def didn't support key definitions with array, map, varbinary & any fields. Thus they couldn't be extracted with key_def_object:extract_key(). Since the restriction existed due to impossibility of such types comparison, this patch removes the restriction for the fields extraction and only leaves it for comparison. Closes #4538
-
- Oct 12, 2020
-
-
Vladislav Shpilevoy authored
The new option can be one of 3 values: 'off', 'candidate', 'voter'. It replaces 2 old options: election_is_enabled and election_is_candidate. These flags looked strange, that it was possible to set candidate true, but disable election at the same time. Also it would not look good if we would ever decide to introduce another mode like a data-less sentinel node, for example. Just for voting. Anyway, the single option approach looks easier to configure and to extend. - 'off' means the election is disabled on the node. It is the same as election_is_enabled = false in the old config; - 'voter' means the node can vote and is never writable. The same as election_is_enabled = true + election_is_candidate = false in the old config; - 'candidate' means the node is a full-featured cluster member, which eventually may become a leader. The same as election_is_enabled = true + election_is_candidate = true in the old config. Part of #1146
-
- Oct 07, 2020
-
-
Aleksandr Lyapunov authored
space:fselect and index:fselect fetch data like ordinal select, but formats the result like mysql does - with columns, column names etc. fselect converts tuple to strings using json, extending with spaces and cutting tail if necessary. It is designed for visual analysis of select result and shouldn't be used stored procedures. index:fselect(<key>, <opts>, <fselect_opts>) space:fselect(<key>, <opts>, <fselect_opts>) There are some options that can be specified in different ways: - among other common options (<opts>) with 'fselect_' prefix. (e.g. 'fselect_type=..') - in special <fselect_opts> map (with or without prefix). - in global variables with 'fselect_' prefix. The possible options are: - type: - 'sql' - like mysql result (default). - 'gh' (or 'github' or 'markdown') - markdown syntax, for copy-pasting to github. - 'jira' - jira table syntax (for copy-pasting to jira). - widths: array with desired widths of columns. - max_width: limit entire length of a row string, longest fields will be cut if necessary. Set to 0 (default) to detect and use screen width. Set to -1 for no limit. - print: (default - false) - print each line instead of adding to result. - use_nbsp: (default - true) - add invisible spaces to improve readability in YAML output. Not applicabble when print=true. There is also a pair of shortcuts: index/space:gselect - same as fselect, but with type='gh'. index/space:jselect - same as fselect, but with type='jira'. See test/engine/select.test.lua for examples. Closes #5161
-
Sergey Kaplun authored
In case when we build without `ENABLE_FIBER_TOP` neither `struct fiber` contains `clock_stat` field nor `FIBER_TIME_RES` constant is defined. This patch adds corresponding ifdef directive to avoid compilation errors.
-
- Oct 06, 2020
-
-
Kirill Yukhin authored
File mp_error.cc was using a C99 feature called "designated initializers" which was catched by Clang v12. Remove this usage.
-
Kirill Yukhin authored
Variable `delete` wasn't free()-ed in case of error. Free it properly. Found by static analyzer.
-
Igor Munkin authored
While running GC hook (i.e. __gc metamethod) garbage collector engine is "stopped": the memory penalty threshold is set to LJ_MAX_MEM and incremental GC step is not triggered as a result. Ergo, yielding the execution at the finalizer body leads to further running platform with disabled LuaJIT GC. It is not re-enabled until the yielded fiber doesn't get the execution back. This changeset extends <cord_on_yield> routine with the check whether GC hook is active. If the switch-over occurs in scope of __gc metamethod the platform is forced to stop its execution with EXIT_FAILURE and calls panic routine before the exit. Relates to #4518 Follows up #4727 Reviewed-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Reviewed-by:
Sergey Ostanevich <sergos@tarantool.org> Signed-off-by:
Igor Munkin <imun@tarantool.org>
-
- Oct 02, 2020
-
-
Igor Munkin authored
Since Tarantool fibers don't respect Lua coroutine switch mechanism, JIT machinery stays unnotified when one lua_State substitutes another one. As a result if trace recording hasn't been aborted prior to fiber switch, the recording proceeds using the new lua_State and leads to a failure either on any further compiler phase or while the compiled trace is executed. This changeset extends <cord_on_yield> routine aborting trace recording when the fiber switches to another one. If the switch-over occurs while mcode is being run the platform finishes its execution with EXIT_FAILURE code and calls panic routine prior to the exit. Closes #1700 Fixes #4491 Reviewed-by:
Sergey Ostanevich <sergos@tarantool.org> Reviewed-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Signed-off-by:
Igor Munkin <imun@tarantool.org>
-
Igor Munkin authored
Tarantool integrates several complex environments together and there are issues occurring at their junction leading to the platform failures. E.g. fiber switch-over is implemented outside the Lua world, so when one lua_State substitutes another one, main LuaJIT engines, such as JIT and GC, are left unnotified leading to the further platform misbehaviour. To solve this severe integration drawback <cord_on_yield> function is introduced. This routine encloses the checks and actions to be done when the running fiber yields the execution. Unfortunately the way callback is implemented introduces a circular dependency. Considering linker symbol resolving methods for static build an auxiliary translation unit is added to the particular tests mocking (i.e. exporting) <cord_on_yield> undefined symbol. Part of #1700 Relates to #4491 Reviewed-by:
Sergey Ostanevich <sergos@tarantool.org> Reviewed-by:
Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Signed-off-by:
Igor Munkin <imun@tarantool.org>
-
- Oct 01, 2020
-
-
Cyrill Gorcunov authored
There is a mixture of types and clang prefer explicit conversion (since @value is a double). Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
d is "double" thus placate clang. Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
pIn3->u.r is a "double", thus placate clang. Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
The @r is "double" value thus use explicit conversion to placate clang compiler. Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
- Sep 30, 2020
-
-
Kirill Yukhin authored
Without explicit cast we're getting warnings. Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
- Sep 29, 2020
-
-
Vladislav Shpilevoy authored
Box.info.election returns a table of form: { state: <string>, term: <number>, vote: <instance ID>, leader: <instance ID> } The fields correspond to the same named Raft concepts one to one. This info dump is supposed to help with the tests, first of all. And with investigation of problems in a real cluster. The API doesn't mention 'Raft' on purpose, to keep it not depending specifically on Raft, and not to confuse users who don't know anything about Raft (even that it is about leader election and synchronous replication). Part of #1146
-
Vladislav Shpilevoy authored
The commit is a core part of Raft implementation. It introduces the Raft state machine implementation and its integration into the instance's life cycle. The implementation follows the protocol to the letter except a few important details. Firstly, the original Raft assumes, that all nodes share the same log record numbers. In Tarantool they are called LSNs. But in case of Tarantool each node has its own LSN in its own component of vclock. That makes the election messages a bit heavier, because the nodes need to send and compare complete vclocks of each other instead of a single number like in the original Raft. But logic becomes simpler. Because in the original Raft there is a problem of uncertainty about what to do with records of an old leader right after a new leader is elected. They could be rolled back or confirmed depending on circumstances. The issue disappears when vclock is used. Secondly, leader election works differently during cluster bootstrap, until number of bootstrapped replicas becomes >= election quorum. That arises from specifics of replicas bootstrap and order of systems initialization. In short: during bootstrap a leader election may use a smaller election quorum than the configured one. See more details in the code. Part of #1146
-
sergepetrenko authored
The patch introduces a new type of system message used to notify the followers of the instance's raft status updates. It's relay's responsibility to deliver the new system rows to its peers. The notification system reuses and extends the same row type used to persist raft state in WAL and snapshot. Part of #1146 Part of #5204
-
Vladislav Shpilevoy authored
The new options are: - election_is_enabled - enable/disable leader election (via Raft). When disabled, the node is supposed to work like if Raft does not exist. Like earlier; - election_is_candidate - a flag whether the instance can try to become a leader. Note, it can vote for other nodes regardless of value of this option; - election_timeout - how long need to wait until election end, in seconds. The options don't do anything now. They are added separately in order to keep such mundane changes from the main Raft commit, to simplify its review. Option names don't mention 'Raft' on purpose, because - Not all users know what is Raft, so they may not even know it is related to leader election; - In future the algorithm may change from Raft to something else, so better not to depend on it too much in the public API. Part of #1146
-
Vladislav Shpilevoy authored
The patch introduces a sceleton of Raft module and a method to persist a Raft state in snapshot, not bound to any space. Part of #1146
-
Vladislav Shpilevoy authored
Struct replicaset didn't store a number of registered replicas. Only an array, which was necessary to fullscan each time when want to find the count. That is going to be needed in Raft to calculate election quorum. The patch makes the count tracked so as it could be found for constant time by simply reading an integer. Needed for #1146
-
Vladislav Shpilevoy authored
Relay.cc and box.cc obtained box.cfg.wal_dir value using cfg_gets() call. To initialize WAL and create struct recovery objects. That is not only a bit dangerous (cfg_gets() uses Lua API and can throw a Lua error) and slow, but also not necessary - wal_dir parameter is constant, it can't be changed after instance start. It means, the value can be stored somewhere one time and then used without Lua. Main motivation is that the WAL directory path will be needed inside relay threads to restart their recovery iterators in the Raft patch. They can't use cfg_gets(), because Lua lives in TX thread. But can access a constant global variable, introduced in this patch (it existed before, but now has a method to get it). Needed for #1146
-
Vladislav Shpilevoy authored
An instance is writable if box.cfg.read_only is false, and it is not orphan. Update of the final read-only state of the instance needs to fire read-only update triggers, and notify the engines. These 2 flags were easy and cheap to check on each operation, and the triggers were easy to use since both flags are stored and updated inside box.cc. That is going to change when Raft is introduced. Raft will add 2 more checks: - A flag if Raft is enabled on the node. If it is not, then Raft state won't affect whether the instance is writable; - When Raft is enabled, it will allow writes on a leader only. It means a check for being read-only would look like this: is_ro || is_orphan || (raft_is_enabled() && !raft_is_leader()) This is significantly slower. Besides, Raft somehow needs to access the read-only triggers and engine API - this looks wrong. The patch introduces a new flag is_ro_summary. The flag incorporates all the read-only conditions into one flag. When some subsystem may change read-only state of the instance, it needs to call box_update_ro_summary(), and the function takes care of updating the summary flag, running the triggers, and notifying the engines. Raft will use this function when its state or config will change. Needed for #1146
-
Vladislav Shpilevoy authored
Applier is going to need its numeric ID in order to tell the future Raft module who is a sender of a Raft message. An alternative would be to add sender ID to each Raft message, but this looks like a crutch. Moreover, applier still needs to know its numeric ID in order to notify Raft about heartbeats from the peer node. Needed for #1146
-
Sergey Kaplun authored
Found and fixed not closed va_list 'ap' with cppcheck: [src/httpc.c:190]: (error) va_list 'ap' was opened but not closed by va_end().
-
- Sep 28, 2020
-
-
Roman Khabibov authored
Ban ability to modify view on box level. Since a view is a named select, and not a table, in fact, altering view is not a valid operation.
-
Sergey Kaplun authored
Found and fixed Null pointer dereference with cppcheck: [src/box/alter.cc:395]: (error) Null pointer dereference
-
Sergey Kaplun authored
[src/lua/fiber.c:245] -> [src/lua/fiber.c:217]: (warning) Either the condition 'if(func)' is redundant or there is possible null pointer dereference: func.
-
- Sep 23, 2020
-
-
Aleksandr Lyapunov authored
Use mvcc transaction engine in memtx if the engine is enabled. Closes #4897
-
Aleksandr Lyapunov authored
If a tuple fetched from an index is dirty - it must be clarified. Let's fix all fetched from indexeds in that way. Also fix a snapshot iterator - it must save a part of history along with creating a read view in order to clean tuple during iteration from another thread. Part of #4897
-
Aleksandr Lyapunov authored
When memtx snapshot iterator is created it could contain some amount of dirty tuples that should be clarified before writing to WAL file. Implement special snapshot cleaner for this purpose. Part of #4897
-
Aleksandr Lyapunov authored
Memtx story is a part of a history of a value in space. It's a story about a tuple, from the point it was added to space to the point when it was deleted from the space. All stories are linked into a list of stories of the same key of each index. Part of #4897
-
Aleksandr Lyapunov authored
There are situations when we have to track that if some TX is committed then some others must be aborted due to conflict. The common case is that one r/w TX have read some value while the second is about to overwrite the value; if the second is committed, the first must be aborted. Thus we have to store many-to-many TX relations between breaker TX and victim TX. The patch implements that. Part of #4897
-