- Sep 29, 2020
-
-
Vladislav Shpilevoy authored
The new options are: - election_is_enabled - enable/disable leader election (via Raft). When disabled, the node is supposed to work like if Raft does not exist. Like earlier; - election_is_candidate - a flag whether the instance can try to become a leader. Note, it can vote for other nodes regardless of value of this option; - election_timeout - how long need to wait until election end, in seconds. The options don't do anything now. They are added separately in order to keep such mundane changes from the main Raft commit, to simplify its review. Option names don't mention 'Raft' on purpose, because - Not all users know what is Raft, so they may not even know it is related to leader election; - In future the algorithm may change from Raft to something else, so better not to depend on it too much in the public API. Part of #1146
-
Vladislav Shpilevoy authored
The patch introduces a sceleton of Raft module and a method to persist a Raft state in snapshot, not bound to any space. Part of #1146
-
Vladislav Shpilevoy authored
Struct replicaset didn't store a number of registered replicas. Only an array, which was necessary to fullscan each time when want to find the count. That is going to be needed in Raft to calculate election quorum. The patch makes the count tracked so as it could be found for constant time by simply reading an integer. Needed for #1146
-
Vladislav Shpilevoy authored
Relay.cc and box.cc obtained box.cfg.wal_dir value using cfg_gets() call. To initialize WAL and create struct recovery objects. That is not only a bit dangerous (cfg_gets() uses Lua API and can throw a Lua error) and slow, but also not necessary - wal_dir parameter is constant, it can't be changed after instance start. It means, the value can be stored somewhere one time and then used without Lua. Main motivation is that the WAL directory path will be needed inside relay threads to restart their recovery iterators in the Raft patch. They can't use cfg_gets(), because Lua lives in TX thread. But can access a constant global variable, introduced in this patch (it existed before, but now has a method to get it). Needed for #1146
-
Vladislav Shpilevoy authored
An instance is writable if box.cfg.read_only is false, and it is not orphan. Update of the final read-only state of the instance needs to fire read-only update triggers, and notify the engines. These 2 flags were easy and cheap to check on each operation, and the triggers were easy to use since both flags are stored and updated inside box.cc. That is going to change when Raft is introduced. Raft will add 2 more checks: - A flag if Raft is enabled on the node. If it is not, then Raft state won't affect whether the instance is writable; - When Raft is enabled, it will allow writes on a leader only. It means a check for being read-only would look like this: is_ro || is_orphan || (raft_is_enabled() && !raft_is_leader()) This is significantly slower. Besides, Raft somehow needs to access the read-only triggers and engine API - this looks wrong. The patch introduces a new flag is_ro_summary. The flag incorporates all the read-only conditions into one flag. When some subsystem may change read-only state of the instance, it needs to call box_update_ro_summary(), and the function takes care of updating the summary flag, running the triggers, and notifying the engines. Raft will use this function when its state or config will change. Needed for #1146
-
Vladislav Shpilevoy authored
Applier is going to need its numeric ID in order to tell the future Raft module who is a sender of a Raft message. An alternative would be to add sender ID to each Raft message, but this looks like a crutch. Moreover, applier still needs to know its numeric ID in order to notify Raft about heartbeats from the peer node. Needed for #1146
-
Sergey Kaplun authored
Found and fixed not closed va_list 'ap' with cppcheck: [src/httpc.c:190]: (error) va_list 'ap' was opened but not closed by va_end().
-
- Sep 28, 2020
-
-
Roman Khabibov authored
Ban ability to modify view on box level. Since a view is a named select, and not a table, in fact, altering view is not a valid operation.
-
Alexander V. Tikhonov authored
Added for tests with issues: app/fiber.test.lua gh-5341 app-tap/debug.test.lua gh-5346 app-tap/http_client.test.lua gh-5346 app-tap/inspector.test.lua gh-5346 box/gh-2763-session-credentials-update.test.lua gh-5363 box/hash_collation.test.lua gh-5247 box/lua.test.lua gh-5351 box/net.box_connect_triggers_gh-2858.test.lua gh-5247 box/net.box_incompatible_index-gh-1729.test.lua gh-5360 box/net.box_on_schema_reload-gh-1904.test.lua gh-5354 box/protocol.test.lua gh-5247 box/update.test.lua gh-5247 box-tap/net.box.test.lua gh-5346 replication/autobootstrap.test.lua gh-4533 replication/autobootstrap_guest.test.lua gh-4533 replication/ddl.test.lua gh-5337 replication/gh-3160-misc-heartbeats-on-master-changes.test.lua gh-4940 replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua.test.lua gh-5357 replication/gh-3637-misc-error-on-replica-auth-fail.test.lua gh-5343 replication/long_row_timeout.test.lua gh-4351 replication/on_replace.test.lua gh-5344, gh-5349 replication/prune.test.lua gh-5361 replication/qsync_advanced.test.lua gh-5340 replication/qsync_basic.test.lua gh-5355 replication/replicaset_ro_mostly.test.lua gh-5342 replication/wal_rw_stress.test.lua gh-5347 replication-py/multi.test.py gh-5362 sql/prepared.test.lua test gh-5359 sql-tap/selectG.test.lua gh-5350 vinyl/ddl.test.lua gh-5338 vinyl/gh-3395-read-prepared-uncommitted.test.lua gh-5197 vinyl/iterator.test.lua gh-5336 vinyl/write_iterator_rand.test.lua gh-5356 xlog/panic_on_wal_error.test.lua gh-5348
-
Sergey Kaplun authored
Found and fixed Null pointer dereference with cppcheck: [src/box/alter.cc:395]: (error) Null pointer dereference
-
Sergey Kaplun authored
[src/lua/fiber.c:245] -> [src/lua/fiber.c:217]: (warning) Either the condition 'if(func)' is redundant or there is possible null pointer dereference: func.
-
- Sep 26, 2020
-
-
Alexander Turenko authored
Updated test_run:wait_upstream() and test_run:wait_downstream() to wait until box will be configured and an instance with given ID will appear in box.info.replication. See https://github.com/tarantool/test-run/issues/221 Fixes #5317 Fixes #5329
-
- Sep 25, 2020
-
-
Alexander Turenko authored
Justify columns in the output. https://github.com/tarantool/test-run/pull/222
-
Alexander V. Tikhonov authored
Removed dust line from merge.
-
Alexander V. Tikhonov authored
In test-run implemented the new format of the fragile lists based on JSON format set as fragile option in 'suite.ini' files per each suite: fragile = { "retries": 10, "tests": { "bitset.test.lua": { "issues": [ "gh-4095" ], "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ] } }} Added ability to check results file checksum on tests fail and compare with the checksums of the known issues mentioned in the fragile list. Also added ability to set 'retries' option, which sets the number of accepted reruns of the tests failed from 'fragile' list that have checksums on its fails. Closes #5050
-
Alexander V. Tikhonov authored
Found flaky issues multi running replication/anon.test.lua test on the single worker: [007] --- replication/anon.result Fri Jun 5 09:02:25 2020 [007] +++ replication/anon.reject Mon Jun 8 01:19:37 2020 [007] @@ -55,7 +55,7 @@ [007] [007] box.info.status [007] | --- [007] - | - running [007] + | - orphan [007] | ... [007] box.info.id [007] | --- [094] --- replication/anon.result Sat Jun 20 06:02:43 2020 [094] +++ replication/anon.reject Tue Jun 23 19:35:28 2020 [094] @@ -154,7 +154,7 @@ [094] -- Test box.info.replication_anon. [094] box.info.replication_anon [094] | --- [094] - | - count: 1 [094] + | - count: 2 [094] | ... [094] #box.info.replication_anon() [094] | --- [094] It happend because replications may stay active from the previous runs on the common tarantool instance at the test-run worker. To avoid of it added restarting of the tarantool instance at the very start of the test. Closes #5058
-
Alexander V. Tikhonov authored
Set opensuse jobs to test group to be sure that it will be run with artifacts collecting and without gitlab-ci jobs extra parallization.
-
Alexander V. Tikhonov authored
Added artifacts saver to all gitlab-ci jobs with testing. Gitlab-ci jobs saves its results files in the following paths: 1. base jobs for testing different features: - test/var/artifacts 2. OSX jobs: - ${OSX_VARDIR}/artifacts 3. pack/deploy jobs: - build/usr/src/*/tarantool-*/test/var/artifacts 4. VBOX jobs (freebsd_12) on virtual host: - ~/tarantool/test/var/artifacts In gitlab-ci configuration added 'after_script' section with script which collects from different test places 'artifacts' directories created by test-run tool. It saves 'artifacts' directories as root path in artifacts packages. User will be able to download these packages using gitlab-ci GUI either API. Additionally added OSX_VARDIR environment variable to be able to setup common path for artifacts and OSX shell scripts options. OSX_VARDIR: /tmp/tnt Part of #5050
-
Sergey Bronnikov authored
On running Jepsen tests created directory with Terraform state and directory with Jepsen tests source code in a build directory. Everything is ok on using out of source build in a separate directory, but with building in a project root directory these directories appears in `git status` output. This patch add ignores for these directories.
-
Sergey Bronnikov authored
For running Jepsen tests we need to checkout external repository with tests source code on a build stage. This behaviour brokes a Tarantool build under Gentoo. Option WITH_JEPSEN enables targets only when they needed. Closes #5325
-
- Sep 24, 2020
-
-
Alexander Turenko authored
Retry a failed test when it is marked as fragile (and several other conditions are met, see below). The test-run already allows to set a list of fragile tests. They are run one-by-one after all parallel ones in order to eliminate possible resource starvation and fit timings to ones when the tests pass. See [1]. In practice this approach does not help much against our problem with flaky tests. We decided to retry failed tests, when they are known as flagile. See [2]. The core idea is to split responsibility: known flaky fails will not deflect attention of a developer, but each fragile test will be marked explicitly, trackerized and will be analyzed by the quality assurance team. The default behaviour is not changed: each test from the fragile list will be run once after all parallel ones. But now it is possible to set retries amount. Beware: the implementation does not allow to just set retries count, it also requires to provide an md5sum of a failed test output (so called reject file). The idea here is to ensure that we retry the test only in case of a known fail: not some other fail within the test. This approach has the limitation: in case of fail a test may output an information that varies from run to run or depend of a base directory. We should always verify the output before put its checksum into the configuration file. Despite doubts regarding this approach, it looks simple and we decided to try and revisit it if there will be a need. See configuration example in [3]. [1]: https://github.com/tarantool/test-run/issues/187 [2]: https://github.com/tarantool/test-run/issues/189 [3]: https://github.com/tarantool/test-run/pull/217 Part of #5050
-
- Sep 23, 2020
-
-
Aleksandr Lyapunov authored
Closes #4897
-
Aleksandr Lyapunov authored
txn_proxy is a special utility for transaction tests. Formerly it was used only for vinyl tests and thus was placed in vinyl folder. Now the time has come to test memtx transactions and the utility must be placed amongst other utils - in box/lua. Needed for #4897
-
Aleksandr Lyapunov authored
Use mvcc transaction engine in memtx if the engine is enabled. Closes #4897
-
Aleksandr Lyapunov authored
If a tuple fetched from an index is dirty - it must be clarified. Let's fix all fetched from indexeds in that way. Also fix a snapshot iterator - it must save a part of history along with creating a read view in order to clean tuple during iteration from another thread. Part of #4897
-
Aleksandr Lyapunov authored
When memtx snapshot iterator is created it could contain some amount of dirty tuples that should be clarified before writing to WAL file. Implement special snapshot cleaner for this purpose. Part of #4897
-
Aleksandr Lyapunov authored
Memtx story is a part of a history of a value in space. It's a story about a tuple, from the point it was added to space to the point when it was deleted from the space. All stories are linked into a list of stories of the same key of each index. Part of #4897
-
Aleksandr Lyapunov authored
There are situations when we have to track that if some TX is committed then some others must be aborted due to conflict. The common case is that one r/w TX have read some value while the second is about to overwrite the value; if the second is committed, the first must be aborted. Thus we have to store many-to-many TX relations between breaker TX and victim TX. The patch implements that. Part of #4897
-
Aleksandr Lyapunov authored
Define memtx TX manager. It will store data for MVCC and conflict manager. Define also 'memtx_use_mvcc_engine' in config that enables that MVCC engine. Part of #4897
-
Aleksandr Lyapunov authored
Prepare sequence number is a monotonically increasing ID that is assigned to any prepared transaction. This ID is suitable for serialization order resolution: the bigger is ID - the later the transaction exists in the serialization order of transactions. Note that id of transactions has quite different order in case when transaction could yield - an younger (bigger id) transaction can prepare/commit first (lower psn) while older tx sleeps in vain. Also it should be mentioned that LSN has the same order as PSN, but it has two general differences: 1. The LSN sequence has no holes, i.e. it is a natural number sequence. This property is useless for transaction engine. 2. The LSN sequence is provided by WAL writer and thus LSN is not available for TX thas was prepared and haven't been committed yet. That feature makes psn more suitable sequence for transactions as it allows to order prepared but not committed transaction and allows, for example, to create a read view between prepared transactions. Part of #4897
-
Aleksandr Lyapunov authored
That flag is needed for transactional conflict manager - if any other transaction commits a replacement of old_tuple before current one and the flag is set - the current transaction will be aborted. For example REPLACE just replaces a key, no matter what tuple lays in the index and thus does_require_old_tuple = false. In contrast, UPDATE makes new tuple using old_tuple and thus the statement will require old_tuple (does_require_old_tuple = true). INSERT also does_require_old_tuple = true because it requires old_tuple to be NULL. Part of #4897
-
Aleksandr Lyapunov authored
Transaction engine (see further commits) needs to distinguish and maniputate transactions by their status. The status describe the lifetime point of a transaction (inprogress, prepared, committed) and its abilities (conflicted, read view). Part of #4897 Part of #5108
-
Aleksandr Lyapunov authored
Apart from other vinyl objects that are named with "vy_" prefix, its transaction manager (tx_manager) have no such prefix. It should have in order to avoid conflicts with global tx manager. Needed for #4897
-
Kirill Yukhin authored
cord_ptr variable is calloc()-ated in coio_on_start() and is not free()-ed, which triggers ASAN. free() it in coio_on_stop(). Closes #5308
-
- Sep 18, 2020
-
-
Vladislav Shpilevoy authored
The test tried to start a replica whose box.cfg would hang, with replication_connect_quorum = 0 to make it return immediately. But the quorum parameter was added and removed during work on 44421317 ("replication: do not register outgoing connections"). Instead, to start the replica without blocking on box.cfg it is necessary to pass 'wait=False' with the test_run:cmd('start server') command. Closes #5311
-
Sergey Bronnikov authored
added a new stage with a single job to run Jepsen tests. Job is not started automatically by default, one need to trigger it manually. Directory with test results (logs, graphs, operations history) published to artifacts. Closes #5277
-
Sergey Bronnikov authored
Main script that handle creation of set of virtual machines using Terraform, setup for remote connection, running Jepsen tests and teardown test environment. Part of #5277
-
Sergey Bronnikov authored
Added targets 'make jepsen-single' and 'make jepsen-cluster' to run Jepsen tests on a single Tarantool instance and cluster of Tarantool instances. Part of #5277
-
Sergey Bronnikov authored
For testing Tarantool with Jepsen we use virtual machines as they provides better resource isolation in comparison to containers. Jepsen tests may need a single instance or a set of instances for testing cluster. To setup virtual machines we use Terraform [1]. Patch adds a set of configuration files for Terraform that can create required number of virtual machines in MCS and output IP addresses to stdout. Terraform needs some parameters before run. They are: - id, identificator of a test stand that should be specific for this run, id also is a part of virtual machine name - keypair_name, name of keypair used in a cloud, public SSH key of that key pair will be placed to virtual machine - instance_count, number of virtual machines in a test stand - ssh_key, SSH private key, used to access to a virtual machine - user_name - password - tenant_id - user_domain_id These parameters can be passed via enviroment variables with TF_VAR_ prefix (like TF_VAR_id) or via command-line parameters. To demonstrate full lifecycle of a test stand with Terraform one needs to perform these commands: terraform init extra/tf terraform apply extra/tf terraform output instance_names terraform output instance_ips terraform destroy extra/tf 1. https://www.terraform.io/ Part of #5277
-
Cyrill Gorcunov authored
There is a bug in systemd-209 source code: it returns ENOENT when no more entries in a password database left. Later the issue been fixed but we still meet the systems where it hits. The problem affects getpwent/getgrent calls only thus we can expect them to return the buggy error code to skip. Notes: 1) See systemd's commit where issue been fixed | commit 06202b9e659e5cc72aeecc5200155b7c012fccbc | Author: Yu Watanabe <watanabe.yu+github@gmail.com> | Date: Sun Jul 15 23:00:00 2018 +0900 | | nss: do not modify errno when NSS_STATUS_NOTFOUND or NSS_STATUS_SUCCESS 2) Another option is to call getpwall on Tarantool startup unconditionally where we could simply ignore any errors. This is a very bad choise since traversig a password database might introduce significant lags if backend does some network activiy or have expired caches. Thus drop getpwall() unconditional call run it iif a user does an explicit request. Fixes #5034 Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-