Commits · 43482eedc3e139e24e9f12d55761f88ea7f9926c · core / tarantool

Sep 24, 2020

Retry a failed test when it is marked as fragile (and several other
conditions are met, see below).

The test-run already allows to set a list of fragile tests. They are run
one-by-one after all parallel ones in order to eliminate possible
resource starvation and fit timings to ones when the tests pass. See
[1].

In practice this approach does not help much against our problem with
flaky tests. We decided to retry failed tests, when they are known as
flagile. See [2].

The core idea is to split responsibility: known flaky fails will not
deflect attention of a developer, but each fragile test will be marked
explicitly, trackerized and will be analyzed by the quality assurance
team.

The default behaviour is not changed: each test from the fragile list
will be run once after all parallel ones. But now it is possible to set
retries amount.

Beware: the implementation does not allow to just set retries count, it
also requires to provide an md5sum of a failed test output (so called
reject file). The idea here is to ensure that we retry the test only in
case of a known fail: not some other fail within the test.

This approach has the limitation: in case of fail a test may output an
information that varies from run to run or depend of a base directory.
We should always verify the output before put its checksum into the
configuration file.

Despite doubts regarding this approach, it looks simple and we decided
to try and revisit it if there will be a need.

See configuration example in [3].

[1]: https://github.com/tarantool/test-run/issues/187
[2]: https://github.com/tarantool/test-run/issues/189
[3]: https://github.com/tarantool/test-run/pull/217

Part of #5050

Unverified

43482eed

Sep 23, 2020

txm: add a test · 0018398d
Aleksandr Lyapunov authored 4 years ago
```
Closes #4897
```
0018398d

test: move txn_proxy.lua to box/lua · 6f9f57fa

Aleksandr Lyapunov authored 4 years ago

txn_proxy is a special utility for transaction tests.
Formerly it was used only for vinyl tests and thus was placed in
vinyl folder.
Now the time has come to test memtx transactions and the utility
must be placed amongst other utils - in box/lua.

Needed for #4897

6f9f57fa

txm: use new tx manager in memtx · 7c2a0c18
Aleksandr Lyapunov authored 4 years ago
```
Use mvcc transaction engine in memtx if the engine is enabled.

Closes #4897
```
7c2a0c18

txm: clarify all fetched tuples · ee8ed065

Aleksandr Lyapunov authored 4 years ago

If a tuple fetched from an index is dirty - it must be clarified.
Let's fix all fetched from indexeds in that way.
Also fix a snapshot iterator - it must save a part of history
along with creating a read view in order to clean tuple during
iteration from another thread.

Part of #4897

ee8ed065

txm: introduce snapshot cleaner · ef47de0f

Aleksandr Lyapunov authored 4 years ago

When memtx snapshot iterator is created it could contain some
amount of dirty tuples that should be clarified before writing
to WAL file.
Implement special snapshot cleaner for this purpose.

Part of #4897

ef47de0f

txm: introduce memtx_story · c4205758

Aleksandr Lyapunov authored 4 years ago

Memtx story is a part of a history of a value in space.
It's a story about a tuple, from the point it was added to space
to the point when it was deleted from the space.
All stories are linked into a list of stories of the same key of
each index.

Part of #4897

c4205758

txm: introduce conflict tracker · 518fb9d8

Aleksandr Lyapunov authored 4 years ago

There are situations when we have to track that if some TX is
committed then some others must be aborted due to conflict.
The common case is that one r/w TX have read some value while the
second is about to overwrite the value; if the second is committed,
the first must be aborted.
Thus we have to store many-to-many TX relations between breaker
TX and victim TX.
The patch implements that.

Part of #4897

518fb9d8

txm: introduce memtx tx manager · bd1ed6dd

Aleksandr Lyapunov authored 4 years ago

Define memtx TX manager. It will store data for MVCC and conflict
manager. Define also 'memtx_use_mvcc_engine' in config that
enables that MVCC engine.

Part of #4897

bd1ed6dd

txm: introduce prepare sequence number · ef5c293c

Aleksandr Lyapunov authored 4 years ago

Prepare sequence number is a monotonically increasing ID that is
assigned to any prepared transaction. This ID is suitable for
serialization order resolution: the bigger is ID - the later the
transaction exists in the serialization order of transactions.

Note that id of transactions has quite different order in case
when transaction could yield - an younger (bigger id) transaction
can prepare/commit first (lower psn) while older tx sleeps in vain.

Also it should be mentioned that LSN has the same order as PSN,
but it has two general differences:
1. The LSN sequence has no holes, i.e. it is a natural number
sequence. This property is useless for transaction engine.
2. The LSN sequence is provided by WAL writer and thus LSN is not
available for TX thas was prepared and haven't been committed yet.
That feature makes psn more suitable sequence for transactions as
it allows to order prepared but not committed transaction and
allows, for example, to create a read view between prepared
transactions.

Part of #4897

ef5c293c

txm: save does_require_old_tuple flag in txn_stmt · 61bce613

Aleksandr Lyapunov authored 4 years ago

That flag is needed for transactional conflict manager - if any
other transaction commits a replacement of old_tuple before
current one and the flag is set - the current transaction will
be aborted.
For example REPLACE just replaces a key, no matter what tuple
lays in the index and thus does_require_old_tuple = false.
In contrast, UPDATE makes new tuple using old_tuple and thus
the statement will require old_tuple (does_require_old_tuple = true).
INSERT also does_require_old_tuple = true because it requires
old_tuple to be NULL.

Part of #4897

61bce613

txm: add TX status · 070a0cd4

Aleksandr Lyapunov authored 4 years ago

Transaction engine (see further commits) needs to distinguish and
maniputate transactions by their status. The status describe the
lifetime point of a transaction (inprogress, prepared, committed)
and its abilities (conflicted, read view).

Part of #4897
Part of #5108

070a0cd4

vinyl: rename tx_manager -> vy_tx_manager · 363169a2

Aleksandr Lyapunov authored 4 years ago

Apart from other vinyl objects that are named with "vy_" prefix,
its transaction manager (tx_manager) have no such prefix.
It should have in order to avoid conflicts with global tx manager.

Needed for #4897

363169a2

coio: fix cord leak on stop · 8477b6c0

Kirill Yukhin authored 4 years ago

cord_ptr variable is calloc()-ated in coio_on_start()
and is not free()-ed, which triggers ASAN. free() it
in coio_on_stop().

Closes #5308

8477b6c0

Sep 18, 2020

tests: fix replication/prune.test.lua hang · f7bcdf4c

Vladislav Shpilevoy authored 4 years ago

The test tried to start a replica whose box.cfg would hang, with
replication_connect_quorum = 0 to make it return immediately.

But the quorum parameter was added and removed during work on
44421317 ("replication: do not
register outgoing connections"). Instead, to start the replica
without blocking on box.cfg it is necessary to pass 'wait=False'
with the test_run:cmd('start server') command.

Closes #5311

f7bcdf4c

ci: integrate Jepsen tests to GitLab CI · a8e89b77

Sergey Bronnikov authored 4 years ago

added a new stage with a single job to run Jepsen tests.
Job is not started automatically by default, one need to
trigger it manually. Directory with test results
(logs, graphs, operations history) published to artifacts.

Closes #5277

a8e89b77

tools: add script to run Jepsen tests · 49bca315

Sergey Bronnikov authored 4 years ago

Main script that handle creation of set of virtual machines
using Terraform, setup for remote connection, running
Jepsen tests and teardown test environment.

Part of #5277

49bca315

cmake: add targets to run Jepsen tests · a42f8993

Sergey Bronnikov authored 4 years ago

Added targets 'make jepsen-single' and 'make jepsen-cluster'
to run Jepsen tests on a single Tarantool instance and
cluster of Tarantool instances.

Part of #5277

a42f8993

extra: add Terraform config files · 0b59bc93

Sergey Bronnikov authored 4 years ago

For testing Tarantool with Jepsen we use virtual machines as they provides
better resource isolation in comparison to containers. Jepsen tests may need a
single instance or a set of instances for testing cluster.  To setup virtual
machines we use Terraform [1]. Patch adds a set of configuration files for
Terraform that can create required number of virtual machines in MCS and output
IP addresses to stdout.

Terraform needs some parameters before run. They are:

- id, identificator of a test stand that should be specific for this run, id
also is a part of virtual machine name
- keypair_name, name of keypair used in a cloud, public SSH key of that key pair
will be placed to virtual machine
- instance_count, number of virtual machines in a test stand
- ssh_key, SSH private key, used to access to a virtual machine
- user_name
- password
- tenant_id
- user_domain_id

These parameters can be passed via enviroment variables with TF_VAR_ prefix
(like TF_VAR_id) or via command-line parameters.

To demonstrate full lifecycle of a test stand with Terraform one needs to
perform these commands:

terraform init extra/tf
terraform apply extra/tf
terraform output instance_names
terraform output instance_ips
terraform destroy extra/tf

1. https://www.terraform.io/

Part of #5277

0b59bc93

lua/pwd: workaround the systemd bug · ab3ff23f

Cyrill Gorcunov authored 4 years ago


There is a bug in systemd-209 source code: it returns
ENOENT when no more entries in a password database left.

Later the issue been fixed but we still meet the systems
where it hits. The problem affects getpwent/getgrent calls
only thus we can expect them to return the buggy error code
to skip.

Notes:

1) See systemd's commit where issue been fixed

   | commit 06202b9e659e5cc72aeecc5200155b7c012fccbc
   | Author: Yu Watanabe <watanabe.yu+github@gmail.com>
   | Date:   Sun Jul 15 23:00:00 2018 +0900
   |
   |     nss: do not modify errno when NSS_STATUS_NOTFOUND or NSS_STATUS_SUCCESS

2) Another option is to call getpwall on Tarantool startup
   unconditionally where we could simply ignore any errors. This
   is a very bad choise since traversig a password database might
   introduce significant lags if backend does some network activiy
   or have expired caches. Thus drop getpwall() unconditional call
   run it iif a user does an explicit request.

Fixes #5034

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

ab3ff23f

lua/errno: shrink memory usage on error declaration · 8603da36

Cyrill Gorcunov authored 4 years ago


There is no need to allocate 32 bytes per each string,
the backend lua does copy the string internally thus
plain pointer is enough here no need to allocate redundant
memory.

Part-of #5034

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

8603da36

lua/errno: use lengthof helper · 0f062df1

Cyrill Gorcunov authored 4 years ago


No need for ending empty entry.

Part-of #5034

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

0f062df1

Sep 17, 2020

replication: do not register outgoing connections · 44421317

Vladislav Shpilevoy authored 4 years ago

Replication protocol's first stage for non-anonymous replicas is
that the replica should be registered in _cluster to get a unique
ID number.

That happens, when replica connects to a writable node, which
performs the registration. So it means, registration always
happens on the master node when appears an *incoming* request for
it, explicitly asking for a registration. Only relay can do that.

That wasn't the case for bootstrap. If box.cfg.replication wasn't
empty on the master node doing the cluster bootstrap, it
registered all the outgoing connections in _cluster. Note, the
target node could be even anonymous, but still was registered.

That breaks the protocol, and leads to registration of anon
replicas sometimes. The patch drops it.

Another motivation here is Raft cluster bootstrap specifics.
During Raft bootstrap it is going to be very important that
non-joined replicas should not be registered in _cluster. A
replica can only register after its JOIN request was accepted, and
its snapshot download has started.

Closes #5287
Needed for #1146

44421317

replication: add is_anon flag to ballot · 0fd72560

Vladislav Shpilevoy authored 4 years ago

Ballot is a message sent in response on vote request, which is
sent by applier first thing after connection establishment.

It contains basic info about the remote instance such as whether
it is read only, if it is still loading, and more.

The ballot didn't contain a flag whether the instance is
anonymous. That led to a problem, when applier was connected to a
remote instance, was added to struct replicaset inside a struct
replica object, but it was unknown whether it is anonymous. It was
added as not anonymous by default.

If the remote instance was in fact anonymous and sent a subscribe
response back to the first instance with the anon flag = true,
then it looked like the remote instance was not anonymous, and
suddenly became such, without even a reconnect. It could lead to
an assertion.

The bug is hidden behind another bug, because of which the leader
instance on boostrap registers all replicas listed in its
box.cfg.replication, even anonymous ones.

The patch makes the ballot contain the anon flag. Now both relay
and applier send whether their host is anonymous. Relay does it by
sending the ballot, applier sends it in scope of subscribe
request. By the time a replica gets UUID and is added into struct
replicaset, its anon flag is determined.

Also the patch makes anon_count updated on each replica hash table
change. Previously it was only updated when something related to
relay was done. Now anon is updated by applier actions too, and
it is not ok to update the counter on relay-specific actions.

The early registration bug is a subject for a next patch.

Part of #5287

@TarantoolBot document
Title: IPROTO_BALLOT_IS_ANON flag

There is a request type IPROTO_BALLOT, with code 0x29. It has
fields IPROTO_BALLOT_IS_RO (0x01), IPROTO_BALLOT_VCLOCK (0x02),
IPROTO_BALLOT_GC_VCLOCK (0x03), IPROTO_BALLOT_IS_LOADING (0x04).

Now it gets a new field IPROTO_BALLOT_IS_ANON (0x05). The field
is a boolean, and equals to box.cfg.replication_anon of the
sender.

0fd72560

replication: retry in case of XlogGapError · f1a507b0

Vladislav Shpilevoy authored 4 years ago

Previously XlogGapError was considered a critical error stopping
the replication. That may be not so good as it looks.

XlogGapError is a perfectly fine error, which should not kill the
replication connection. It should be retried instead.

Because here is an example, when the gap can be recovered on its
own. Consider the case: node1 is a leader, it is booted with
vclock {1: 3}. Node2 connects and fetches snapshot of node1, it
also gets vclock {1: 3}. Then node1 writes something and its
vclock becomes {1: 4}. Now node3 boots from node1, and gets the
same vclock. Vclocks now look like this:

  - node1: {1: 4}, leader, has {1: 3} snap.
  - node2: {1: 3}, booted from node1, has only snap.
  - node3: {1: 4}, booted from node1, has only snap.

If the cluster is a fullmesh, node2 will send subscribe requests
with vclock {1: 3}. If node3 receives it, it will respond with
xlog gap error, because it only has a snap with {1: 4}, nothing
else. In that case node2 should retry connecting to node3, and in
the meantime try to get newer changes from node1.

The example is totally valid. However it is unreachable now
because master registers all replicas in _cluster before allowing
them to make a join. So they all bootstrap from a snapshot
containing all their IDs. This is a bug, because such
auto-registration leads to registration of anonymous replicas, if
they are present during bootstrap. Also it blocks Raft, which
can't work if there are registered, but not yet joined nodes.

Once the registration problem will be solved in a next commit, the
XlogGapError will strike quite often during bootstrap. This patch
won't allow that happen.

Needed for #5287

f1a507b0

xlog: introduce an error code for XlogGapError · fc8e2297

Vladislav Shpilevoy authored 4 years ago

XlogGapError object didn't have a code in ClientError code space.
Because of that it was not possible to handle the gap error
together with client errors in some switch-case statement.

Now the gap error has a code.

This is going to be used in applier code to handle XlogGapError
among other errors using its code instead of RTTI.

Needed for #5287

fc8e2297

Sep 15, 2020

gitlab-ci: save sources to new S3 location · c1f72aeb

Alexander V. Tikhonov authored 4 years ago

Changed S3 location for sources tarballs. Also added ability to
create S3 directory for the tarballs if it was not existed.

c1f72aeb

gitlab-ci: fix deployment of tagged commits · 5aa1a1df

Alexander V. Tikhonov authored 4 years ago

Found that tagged commits were not run the deployment gitlab-ci jobs.
To fix it added 'tags' label for deployment and perfomance jobs. Also
found that after the commit tagged it has tag label in format 'x^0'
and all previous commits till the previous tag became to have tags in
format 'x~<commits before>' like 'x~1' or 'x~2' and etc. So the check

if git name-rev --name-only --tags --no-undefined HEAD ; then

became always pass and previous commits on rerun could began to deploy.
To fix it was used gitlab-ci environment variable 'CI_COMMIT_TAG', it
shows in real if the current commit has tag and has to be deployed.

Part of #3745

5aa1a1df

test: flaky replication/gh-3704-misc-* · db3dd8dd

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [037] --- replication/gh-3704-misc-replica-checks-cluster-id.result	Thu Sep 10 18:05:22 2020
  [037] +++ replication/gh-3704-misc-replica-checks-cluster-id.reject	Fri Sep 11 11:09:38 2020
  [037] @@ -25,7 +25,7 @@
  [037]  ...
  [037]  box.info.replication[2].downstream.status
  [037]  ---
  [037] -- follow
  [037] +- stopped
  [037]  ...
  [037]  -- change master's cluster uuid and check that replica doesn't connect.
  [037]  test_run:cmd("stop server replica")

It happened because replication downstream status check occurred too
early, when it was only in 'stopped' state. To give the replication
status check routine ability to reach the needed 'follow' state, it
need to wait for it using test_run:wait_downstream() routine.

Closes #5293

db3dd8dd

build: refactor static build process · 800e5ed6

HustonMmmavr authored 4 years ago


Refactored static build process to use static-build/CMakeLists.txt
instead of Dockerfile.staticbuild (this allows to support static
build on macOS). Following third-party dependencies for static build
are installed via cmake `ExternalProject_Add`:
  - OpenSSL
  - Zlib
  - Ncurses
  - Readline
  - Unwind
  - ICU

* Added support static build for macOS
* Fixed `CONFIGURE_COMMAND` while building bundled libcurl for static
  build at file cmake/BuildLibCURL.cmake:
    - disable building shared libcurl libraries (by setting
      `--disable-shared` option)
    - disable hiding libcurl symbols (by setting
      `--disable-symbol-hiding` option)
    - prevent linking libcurl with system libz (by setting
      `--with-zlib=${FOUND_ZLIB_ROOT_DIR}` option)
* Removed Dockerfile.staticbuild
* Added new gitlab.ci jobs to test new style static build:
  - static_build_cmake_linux
  - static_build_cmake_osx_15
* Removed static_docker_build gitlab.ci job

Closes #5095

Co-authored-by: Yaroslav Dynnikov <yaroslav.dynnikov@gmail.com>

800e5ed6

Sep 14, 2020

memtx: force async snapshot transactions · c620735c

Vladislav Shpilevoy authored 4 years ago

Snapshot rows contain not real LSNs. Instead their LSNs are
signatures, ordinal numbers. Rows in the snap have LSNs from 1 to
the number of rows. This is because LSNs are not stored with every
tuple in the storages, and there is no way to store real LSNs in
the snapshot.

These artificial LSNs broke the synchronous replication limbo.
After snap recovery is done, limbo vclock was broken - it
contained numbers not related to reality, and affected by rows
from local spaces.

Also the recovery could stuck because ACKs in the limbo stopped
working after a first row - the vclock was set to the final
signature right away.

This patch makes all snapshot recovered rows async. Because they
are confirmed by definition. So now the limbo is not involved into
the snapshot recovery.

Closes #5298

c620735c

test: update test-run · a33cd1bd

Alexander Turenko authored 4 years ago

Fixed formatting of reproduce files with recent pyyaml versions.

Background: test-run generates so called reproduce files in the
test/var/reproduce/ directory and accepts them as the argument of the
--reproduce option. It is convenient to share a reproducer for a problem
that appears when specific tests are run in a specific order.

https://github.com/tarantool/test-run/pull/220

Unverified

a33cd1bd

Sep 13, 2020

test: update test-run · 49ac0eb5

Alexander Turenko authored 4 years ago

Copy working directories, logs and reproduce files of workers with
failed tests to test/var/artifacts directory. It is prerequisite to
expose the artifacts via CI.

https://github.com/tarantool/test-run/issues/90

Unverified

49ac0eb5

Sep 12, 2020

limbo: don't wake self fiber on CONFIRM write · a0477827

Vladislav Shpilevoy authored 4 years ago

During recovery WAL writes end immediately, without yields.
Therefore WAL write completion callback is executed in the
currently active fiber.

Txn limbo on CONFIRM WAL write wakes up the waiting fiber, which
appears to be the same as the active fiber during recovery.

That breaks the fiber scheduler, because apparently it is not safe
to wake the currently active fiber unless it is going to call
fiber_yield() immediately after. See a comment in fiber_wakeup()
implementation about that way of usage.

The patch simply stops waking the waiting fiber, if it is the
currently active one.

Closes #5288
Closes #5232

a0477827

Sep 11, 2020

test: replication/status.test.lua fails on Debug · 008e732c

Alexander V. Tikhonov authored 4 years ago


Found 2 issues on Debug build:

  [009] --- replication/status.result	Fri Sep 11 10:04:53 2020
  [009] +++ replication/status.reject	Fri Sep 11 13:16:21 2020
  [009] @@ -174,7 +174,8 @@
  [009]  ...
  [009]  test_run:wait_downstream(replica_id, {status == 'follow'})
  [009]  ---
  [009] -- true
  [009] +- error: '[string "return test_run:wait_downstream(replica_id, {..."]:1: variable
  [009] +    ''status'' is not declared'
  [009]  ...
  [009]  -- wait for the replication vclock
  [009]  test_run:wait_cond(function()                    \
  [009] @@ -226,7 +227,8 @@
  [009]  ...
  [009]  test_run:wait_upstream(master_id, {status == 'follow'})
  [009]  ---
  [009] -- true
  [009] +- error: '[string "return test_run:wait_upstream(master_id, {sta..."]:1: variable
  [009] +    ''status'' is not declared'
  [009]  ...
  [009]  master.upstream.lag < 1
  [009]  ---

It happened because of the change introduced in commit [1]. Where
mistakenly were used wait_upstream()/wait_downstream() with:

  test_run:wait_*stream(*_id, {status == 'follow'})

with status set using '==' instead of '='. We unable to read status
variable when the strict mode is enabled. It is enabled by default on
Debug builds.

Follows up #5110
Closes #5297

Reviewed-by: Alexander Turenko <alexander.turenko@tarantool.org>
Co-authored-by: Alexander Turenko <alexander.turenko@tarantool.org>

[1] - a08b4f3a ("test: flaky replication/status.test.lua status")

Unverified

008e732c

lua: fix panic in case when log.cfg.log incorrecly specified · 85f19a87

Oleg Babin authored 4 years ago

This patch makes log.cfg{log = ...} behaviour the same as in
box.cfg{log = ...} and fixes panic if "log" is incorrectly
specified. For such purpose we export "say_parse_logger_type"
function and use for logger type validation and logger type
parsing.

Closes #5130

85f19a87

asan: leak unit/swim.test:swim_test_encryption · ee9e3aed

Alexander V. Tikhonov authored 4 years ago


Found leak issue:

  [001] +==41031==ERROR: LeakSanitizer: detected memory leaks
  [001] +
  [001] +Direct leak of 96 byte(s) in 2 object(s) allocated from:
  [001] +    #0 0x4d8e53 in __interceptor_malloc (/tnt/test/unit/swim.test+0x4d8e53)
  [001] +    #1 0x53560f in crypto_codec_new /source/src/lib/crypto/crypto.c:239:51
  [001] +    #2 0x5299c4 in swim_scheduler_set_codec /source/src/lib/swim/swim_io.c:700:30
  [001] +    #3 0x511fe6 in swim_cluster_set_codec /source/test/unit/swim_test_utils.c:251:2
  [001] +    #4 0x50b3ae in swim_test_encryption /source/test/unit/swim.c:767:2
  [001] +    #5 0x50b3ae in main_f /source/test/unit/swim.c:1123
  [001] +    #6 0x544a3b in fiber_loop /source/src/lib/core/fiber.c:869:18
  [001] +    #7 0x5a13d0 in coro_init /source/third_party/coro/coro.c:110:3
  [001] +
  [001] +SUMMARY: AddressSanitizer: 96 byte(s) leaked in 2 allocation(s).

Prepared minimal issue reproducer:

  static void
  swim_test_encryption(void)
  {
          swim_start_test(3);
          struct swim_cluster *cluster = swim_cluster_new(2);
          swim_cluster_set_codec(cluster, CRYPTO_ALGO_AES128, CRYPTO_MODE_CBC,
                                 "1234567812345678", CRYPTO_AES128_KEY_SIZE);
          swim_cluster_delete(cluster);
          swim_finish_test();
  }

Found that memory allocation for codec creation at crypto_codec_new()
using swim_cluster_set_codec() was not any freed at the test. Added
crypto_codec_delete() in swim_scheduler_destroy() function for it.

After this fix removed susspencion on memory leak for unit/swim.test.

Closes #5283

Reviewed-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>

Co-authored-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>

ee9e3aed

test: flaky replication/gh-5195-qsync-* · a43414a5

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

   box.cfg{replication_synchro_quorum = 2}
    | ---
  + | - error: '[string "test_run:wait_cond(function()                ..."]:1: attempt to
  + |     index field ''vclock'' (a nil value)'
    | ...

The issue output was not correct due to wrong output list. Real command
that caused the initial issue was the previous command:

  test_run:wait_cond(function()                                                   \
          local info = box.info.replication[replica_id]                           \
          local lsn = info.downstream.vclock[replica_id]                          \
          return lsn and lsn >= replica_lsn                                       \
  end)

It happened because replication vclock field was not exist at the moment
of its check. To fix the issue, vclock field had to be waited to be
available using test_run:wait_cond() routine.

Closes #5230

a43414a5

test: flaky replication/wal_off.test.lua test · ad4d0564

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [035] --- replication/wal_off.result	Fri Jul  3 04:29:56 2020
  [035] +++ replication/wal_off.reject	Mon Sep  7 15:32:46 2020
  [035] @@ -47,6 +47,8 @@
  [035]  ...
  [035]  while box.info.replication[wal_off_id].upstream.message ~= check do fiber.sleep(0) end
  [035]  ---
  [035] +- error: '[string "while box.info.replication[wal_off_id].upstre..."]:1: attempt to
  [035] +    index field ''upstream'' (a nil value)'
  [035]  ...
  [035]  box.info.replication[wal_off_id].upstream ~= nil
  [035]  ---

It happened because replication upstream status check occurred too
early, when its state was not set. To give the replication status
check routine ability to reach the needed 'stopped' state, it need
to wait for it using test_run:wait_upstream() routine.

Closes #5278

ad4d0564

test: flaky replication/status.test.lua status · a08b4f3a

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following 3 issues:

line 174:

 [026] --- replication/status.result	Thu Jun 11 12:07:39 2020
 [026] +++ replication/status.reject	Sun Jun 14 03:20:21 2020
 [026] @@ -174,15 +174,17 @@
 [026]  ...
 [026]  replica.downstream.status == 'follow'
 [026]  ---
 [026] -- true
 [026] +- false
 [026]  ...

It happened because replication downstream status check occurred too
early. To give the replication status check routine ability to reach
the needed 'follow' state, it need to wait for it using
test_run:wait_downstream() routine.

line 178:

[024] --- replication/status.result	Mon Sep  7 00:22:52 2020
[024] +++ replication/status.reject	Mon Sep  7 00:36:01 2020
[024] @@ -178,11 +178,13 @@
[024]  ...
[024]  replica.downstream.vclock[master_id] == box.info.vclock[master_id]
[024]  ---
[024] -- true
[024] +- error: '[string "return replica.downstream.vclock[master_id] =..."]:1: attempt to
[024] +    index field ''vclock'' (a nil value)'
[024]  ...
[024]  replica.downstream.vclock[replica_id] == box.info.vclock[replica_id]
[024]  ---
[024] -- true
[024] +- error: '[string "return replica.downstream.vclock[replica_id] ..."]:1: attempt to
[024] +    index field ''vclock'' (a nil value)'
[024]  ...
[024]  --
[024]  -- Replica

It happened because replication vclock field was not exist at the moment
of its check. To fix the issue, vclock field had to be waited to be
available using test_run:wait_cond() routine. Also the replication data
downstream had to be read at the same moment.

line 224:

[014] --- replication/status.result	Fri Jul  3 04:29:56 2020
[014] +++ replication/status.reject	Mon Sep  7 00:17:30 2020
[014] @@ -224,7 +224,7 @@
[014]  ...
[014]  master.upstream.status == "follow"
[014]  ---
[014] -- true
[014] +- false
[014]  ...
[014]  master.upstream.lag < 1
[014]  ---

It happened because replication upstream status check occurred too
early. To give the replication status check routine ability to reach
the needed 'follow' state, it need to wait for it using
test_run:wait_upstream() routine.

Removed test from 'fragile' test_run tool list to run it in parallel.

Closes #5110

a08b4f3a