Commits · e5039742eba89936bf2f8c5aebc8f21c810710d5 · core / tarantool

Oct 13, 2020

Add flaky tests checksums to fragile 2nd part · 3bc455f7

Alexander V. Tikhonov authored 4 years ago

Added for tests with issues:

  app/socket.test.lua				gh-4978
  box/access.test.lua				gh-5411
  box/access_misc.test.lua			gh-5401
  box/gh-5135-invalid-upsert.test.lua		gh-5376
  box/hash_64bit_replace.test.lua test		gh-5410
  box/hash_replace.test.lua			gh-5400
  box/huge_field_map_long.test.lua		gh-5375
  box/net.box_huge_data_gh-983.test.lua		gh-5402
  replication/anon.test.lua			gh-5381
  replication/autoboostrap.test.lua		gh-4933
  replication/box_set_replication_stress.test.lua gh-4992
  replication/election_basic.test.lua		gh-5368
  replication/election_qsync.test.lua test	gh-5395
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua gh-5380
  replication/gh-3711-misc-no-restart-on-same-configuration.test.lua gh-5407
  replication/gh-5287-boot-anon.test.lua	gh-5412
  replication/gh-5298-qsync-recovery-snap.test.lua.test.lua gh-5379
  replication/show_error_on_disconnect.test.lua	gh-5371
  replication/status.test.lua			gh-5409
  swim/swim.test.lua				gh-5403
  unit/swim.test				gh-5399
  vinyl/gc.test.lua				gh-5383
  vinyl/gh-4864-stmt-alloc-fail-compact.test.lua test gh-5408
  vinyl/gh-4957-too-many-upserts.test.lua	gh-5378
  vinyl/gh.test.lua				gh-5141
  vinyl/quota.test.lua				gh-5377
  vinyl/snapshot.test.lua			gh-4984
  vinyl/stat.test.lua				gh-4951
  vinyl/upsert.test.lua				gh-5398

3bc455f7

test: enable flaky tests on FreeBSD 12 · 8bcb6409

Alexander V. Tikhonov authored 4 years ago

Testing on FreeBSD 12 had some tests previously blocked to avoid of
flaky fails. For now we have the ability to avoid of it in test-run
using checksums for fails with opened issues. So adding back 7 tests
to testing on FreeBSD 12.

Closes #4271

8bcb6409

test: move error messages into logs gh-5383 · fa66c295
Alexander V. Tikhonov authored 4 years ago
```
Set error message to log output in test:

  vinyl/gc.test.lua
```
fa66c295
test: move error messages into logs gh-4984 · e95aec95
Alexander V. Tikhonov authored 4 years ago
```
Set error message to log output in test:

  vinyl/snapshot.test.lua
```
e95aec95
test: move error messages into logs gh-5366 · c34d1d67
Alexander V. Tikhonov authored 4 years ago
```
Set error message to log output in test:

  replication/gh-4402-info-errno.test.lua
```
c34d1d67
test: move error messages into logs gh-4985 · ca0c2799
Alexander V. Tikhonov authored 4 years ago
```
Set error message to log output in test:

  replication/replica_rejoin.test.lua
```
ca0c2799

test: move error messages into logs gh-4940 · 1433ed8e

Alexander V. Tikhonov authored 4 years ago

Set error message to log output in test:

  replication/gh-3160-misc-heartbeats-on-master-changes.test.lua

1433ed8e

Oct 12, 2020

raft: introduce election_mode configuration option · 24974f36

Vladislav Shpilevoy authored 4 years ago

The new option can be one of 3 values: 'off', 'candidate',
'voter'. It replaces 2 old options: election_is_enabled and
election_is_candidate. These flags looked strange, that it was
possible to set candidate true, but disable election at the same
time. Also it would not look good if we would ever decide to
introduce another mode like a data-less sentinel node, for
example. Just for voting.

Anyway, the single option approach looks easier to configure and
to extend.

- 'off' means the election is disabled on the node. It is the same
  as election_is_enabled = false in the old config;

- 'voter' means the node can vote and is never writable. The same
  as election_is_enabled = true + election_is_candidate = false in
  the old config;

- 'candidate' means the node is a full-featured cluster member,
  which eventually may become a leader. The same as
  election_is_enabled = true + election_is_candidate = true in the
  old config.

Part of #1146

24974f36

Oct 07, 2020

Introduce fselect - formatted select · 0dc72812

Aleksandr Lyapunov authored 4 years ago

space:fselect and index:fselect fetch data like ordinal select,
but formats the result like mysql does - with columns, column
names etc. fselect converts tuple to strings using json,
extending with spaces and cutting tail if necessary. It is
designed for visual analysis of select result and shouldn't
be used stored procedures.

index:fselect(<key>, <opts>, <fselect_opts>)
space:fselect(<key>, <opts>, <fselect_opts>)

There are some options that can be specified in different ways:
 - among other common options (<opts>) with 'fselect_' prefix.
   (e.g. 'fselect_type=..')
 - in special <fselect_opts> map (with or without prefix).
 - in global variables with 'fselect_' prefix.

The possible options are:
 - type:
    - 'sql' - like mysql result (default).
    - 'gh' (or 'github' or 'markdown') - markdown syntax, for
      copy-pasting to github.
    - 'jira' - jira table syntax (for copy-pasting to jira).
 - widths: array with desired widths of columns.
 - max_width: limit entire length of a row string, longest fields
   will be cut if necessary. Set to 0 (default) to detect and use
   screen width. Set to -1 for no limit.
 - print: (default - false) - print each line instead of adding
   to result.
 - use_nbsp: (default - true) - add invisible spaces to improve
   readability in YAML output. Not applicabble when print=true.

There is also a pair of shortcuts:
index/space:gselect - same as fselect, but with type='gh'.
index/space:jselect - same as fselect, but with type='jira'.

See test/engine/select.test.lua for examples.

Closes #5161

0dc72812

Oct 06, 2020

lua: prohibit fiber yield when GC hook is active · d3f1dd72

Igor Munkin authored 4 years ago


While running GC hook (i.e. __gc  metamethod) garbage collector engine
is "stopped": the memory penalty threshold is set to LJ_MAX_MEM and
incremental GC step is not triggered as a result. Ergo, yielding the
execution at the finalizer body leads to further running platform with
disabled LuaJIT GC. It is not re-enabled until the yielded fiber doesn't
get the execution back.

This changeset extends <cord_on_yield> routine with the check whether GC
hook is active. If the switch-over occurs in scope of __gc metamethod
the platform is forced to stop its execution with EXIT_FAILURE and calls
panic routine before the exit.

Relates to #4518
Follows up #4727

Reviewed-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Reviewed-by: Sergey Ostanevich <sergos@tarantool.org>
Signed-off-by: Igor Munkin <imun@tarantool.org>

d3f1dd72

raft: add a test with synchronous replication · 7e91e272
Serge Petrenko authored 4 years ago

7e91e272

Oct 02, 2020

lua: abort trace recording on fiber yield · 2711797b

Igor Munkin authored 4 years ago


Since Tarantool fibers don't respect Lua coroutine switch mechanism, JIT
machinery stays unnotified when one lua_State substitutes another one.
As a result if trace recording hasn't been aborted prior to fiber
switch, the recording proceeds using the new lua_State and leads to a
failure either on any further compiler phase or while the compiled trace
is executed.

This changeset extends <cord_on_yield> routine aborting trace recording
when the fiber switches to another one. If the switch-over occurs while
mcode is being run the platform finishes its execution with EXIT_FAILURE
code and calls panic routine prior to the exit.

Closes #1700
Fixes #4491

Reviewed-by: Sergey Ostanevich <sergos@tarantool.org>
Reviewed-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Signed-off-by: Igor Munkin <imun@tarantool.org>

2711797b

fiber: introduce a callback for fibers switch-over · a390ec55

Igor Munkin authored 4 years ago


Tarantool integrates several complex environments together and there are
issues occurring at their junction leading to the platform failures.
E.g. fiber switch-over is implemented outside the Lua world, so when one
lua_State substitutes another one, main LuaJIT engines, such as JIT and
GC, are left unnotified leading to the further platform misbehaviour.

To solve this severe integration drawback <cord_on_yield> function is
introduced. This routine encloses the checks and actions to be done when
the running fiber yields the execution.

Unfortunately the way callback is implemented introduces a circular
dependency. Considering linker symbol resolving methods for static build
an auxiliary translation unit is added to the particular tests mocking
(i.e. exporting) <cord_on_yield> undefined symbol.

Part of #1700
Relates to #4491

Reviewed-by: Sergey Ostanevich <sergos@tarantool.org>
Reviewed-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Signed-off-by: Igor Munkin <imun@tarantool.org>

a390ec55

Oct 01, 2020
- test: unit/guava -- fix compilation warning · 0ffba197
  Cyrill Gorcunov authored 4 years ago
  
  Convert to uint64_t explicitly. Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
  0ffba197
Sep 29, 2020

raft: add tests · cf799645
Vladislav Shpilevoy authored 4 years ago
```
Part of #1146
```
cf799645

raft: introduce box.info.election · 15fc8449

Vladislav Shpilevoy authored 4 years ago

Box.info.election returns a table of form:

    {
        state: <string>,
        term: <number>,
        vote: <instance ID>,
        leader: <instance ID>
    }

The fields correspond to the same named Raft concepts one to one.
This info dump is supposed to help with the tests, first of all.
And with investigation of problems in a real cluster.

The API doesn't mention 'Raft' on purpose, to keep it not
depending specifically on Raft, and not to confuse users who
don't know anything about Raft (even that it is about leader
election and synchronous replication).

Part of #1146

15fc8449

raft: introduce box.cfg.election_* options · 1d329f0b

Vladislav Shpilevoy authored 4 years ago

The new options are:

- election_is_enabled - enable/disable leader election (via
  Raft). When disabled, the node is supposed to work like if Raft
  does not exist. Like earlier;

- election_is_candidate - a flag whether the instance can try to
  become a leader. Note, it can vote for other nodes regardless
  of value of this option;

- election_timeout - how long need to wait until election end, in
  seconds.

The options don't do anything now. They are added separately in
order to keep such mundane changes from the main Raft commit, to
simplify its review.

Option names don't mention 'Raft' on purpose, because
- Not all users know what is Raft, so they may not even know it
  is related to leader election;
- In future the algorithm may change from Raft to something else,
  so better not to depend on it too much in the public API.

Part of #1146

1d329f0b

Sep 28, 2020

box: disallow to alter SQL view · c5cb8d31

Roman Khabibov authored 4 years ago

Ban ability to modify view on box level. Since a view is a named
select, and not a table, in fact, altering view is not a valid
operation.

c5cb8d31

Add flaky tests checksums to fragile · 75ba744b

Alexander V. Tikhonov authored 4 years ago

Added for tests with issues:
  app/fiber.test.lua				gh-5341
  app-tap/debug.test.lua			gh-5346
  app-tap/http_client.test.lua			gh-5346
  app-tap/inspector.test.lua			gh-5346
  box/gh-2763-session-credentials-update.test.lua gh-5363
  box/hash_collation.test.lua			gh-5247
  box/lua.test.lua				gh-5351
  box/net.box_connect_triggers_gh-2858.test.lua	gh-5247
  box/net.box_incompatible_index-gh-1729.test.lua gh-5360
  box/net.box_on_schema_reload-gh-1904.test.lua gh-5354
  box/protocol.test.lua				gh-5247
  box/update.test.lua				gh-5247
  box-tap/net.box.test.lua			gh-5346
  replication/autobootstrap.test.lua		gh-4533
  replication/autobootstrap_guest.test.lua	gh-4533
  replication/ddl.test.lua			gh-5337
  replication/gh-3160-misc-heartbeats-on-master-changes.test.lua gh-4940
  replication/gh-3247-misc-iproto-sequence-value-not-replicated.test.lua.test.lua gh-5357
  replication/gh-3637-misc-error-on-replica-auth-fail.test.lua gh-5343
  replication/long_row_timeout.test.lua		gh-4351
  replication/on_replace.test.lua		gh-5344, gh-5349
  replication/prune.test.lua			gh-5361
  replication/qsync_advanced.test.lua		gh-5340
  replication/qsync_basic.test.lua		gh-5355
  replication/replicaset_ro_mostly.test.lua	gh-5342
  replication/wal_rw_stress.test.lua		gh-5347
  replication-py/multi.test.py			gh-5362
  sql/prepared.test.lua test			gh-5359
  sql-tap/selectG.test.lua			gh-5350
  vinyl/ddl.test.lua				gh-5338
  vinyl/gh-3395-read-prepared-uncommitted.test.lua gh-5197
  vinyl/iterator.test.lua			gh-5336
  vinyl/write_iterator_rand.test.lua	gh-5356
  xlog/panic_on_wal_error.test.lua		gh-5348

75ba744b

Sep 25, 2020

test: fix mistake in replication/suite.ini · 6b98017e
Alexander V. Tikhonov authored 4 years ago
```
Removed dust line from merge.
```
6b98017e

Enable test reruns on failed fragiled tests · 74328386

Alexander V. Tikhonov authored 4 years ago

In test-run implemented the new format of the fragile lists based on
JSON format set as fragile option in 'suite.ini' files per each suite:

   fragile = {
        "retries": 10,
        "tests": {
            "bitset.test.lua": {
                "issues": [ "gh-4095" ],
                "checksums": [ "050af3a99561a724013995668a4bc71c", "f34be60193cfe9221d3fe50df657e9d3" ]
            }
        }}

Added ability to check results file checksum on tests fail and
compare with the checksums of the known issues mentioned in the
fragile list.

Also added ability to set 'retries' option, which sets the number
of accepted reruns of the tests failed from 'fragile' list that
have checksums on its fails.

Closes #5050

74328386

test: flaky replication/anon.test.lua test · bb856247

Alexander V. Tikhonov authored 4 years ago

Found flaky issues multi running replication/anon.test.lua test
on the single worker:

 [007] --- replication/anon.result	Fri Jun  5 09:02:25 2020
 [007] +++ replication/anon.reject	Mon Jun  8 01:19:37 2020
 [007] @@ -55,7 +55,7 @@
 [007]
 [007]  box.info.status
 [007]   | ---
 [007] - | - running
 [007] + | - orphan
 [007]   | ...
 [007]  box.info.id
 [007]   | ---

 [094] --- replication/anon.result       Sat Jun 20 06:02:43 2020
 [094] +++ replication/anon.reject       Tue Jun 23 19:35:28 2020
 [094] @@ -154,7 +154,7 @@
 [094]  -- Test box.info.replication_anon.
 [094]  box.info.replication_anon
 [094]   | ---
 [094] - | - count: 1
 [094] + | - count: 2
 [094]   | ...
 [094]  #box.info.replication_anon()
 [094]   | ---
 [094]

It happend because replications may stay active from the previous
runs on the common tarantool instance at the test-run worker. To
avoid of it added restarting of the tarantool instance at the very
start of the test.

Closes #5058

bb856247

Sep 23, 2020

txm: add a test · 0018398d
Aleksandr Lyapunov authored 4 years ago
```
Closes #4897
```
0018398d

test: move txn_proxy.lua to box/lua · 6f9f57fa

Aleksandr Lyapunov authored 4 years ago

txn_proxy is a special utility for transaction tests.
Formerly it was used only for vinyl tests and thus was placed in
vinyl folder.
Now the time has come to test memtx transactions and the utility
must be placed amongst other utils - in box/lua.

Needed for #4897

6f9f57fa

txm: introduce memtx tx manager · bd1ed6dd

Aleksandr Lyapunov authored 4 years ago

Define memtx TX manager. It will store data for MVCC and conflict
manager. Define also 'memtx_use_mvcc_engine' in config that
enables that MVCC engine.

Part of #4897

bd1ed6dd

Sep 18, 2020

tests: fix replication/prune.test.lua hang · f7bcdf4c

Vladislav Shpilevoy authored 4 years ago

The test tried to start a replica whose box.cfg would hang, with
replication_connect_quorum = 0 to make it return immediately.

But the quorum parameter was added and removed during work on
44421317 ("replication: do not
register outgoing connections"). Instead, to start the replica
without blocking on box.cfg it is necessary to pass 'wait=False'
with the test_run:cmd('start server') command.

Closes #5311

f7bcdf4c

Sep 17, 2020

replication: do not register outgoing connections · 44421317

Vladislav Shpilevoy authored 4 years ago

Replication protocol's first stage for non-anonymous replicas is
that the replica should be registered in _cluster to get a unique
ID number.

That happens, when replica connects to a writable node, which
performs the registration. So it means, registration always
happens on the master node when appears an *incoming* request for
it, explicitly asking for a registration. Only relay can do that.

That wasn't the case for bootstrap. If box.cfg.replication wasn't
empty on the master node doing the cluster bootstrap, it
registered all the outgoing connections in _cluster. Note, the
target node could be even anonymous, but still was registered.

That breaks the protocol, and leads to registration of anon
replicas sometimes. The patch drops it.

Another motivation here is Raft cluster bootstrap specifics.
During Raft bootstrap it is going to be very important that
non-joined replicas should not be registered in _cluster. A
replica can only register after its JOIN request was accepted, and
its snapshot download has started.

Closes #5287
Needed for #1146

44421317

replication: retry in case of XlogGapError · f1a507b0

Vladislav Shpilevoy authored 4 years ago

Previously XlogGapError was considered a critical error stopping
the replication. That may be not so good as it looks.

XlogGapError is a perfectly fine error, which should not kill the
replication connection. It should be retried instead.

Because here is an example, when the gap can be recovered on its
own. Consider the case: node1 is a leader, it is booted with
vclock {1: 3}. Node2 connects and fetches snapshot of node1, it
also gets vclock {1: 3}. Then node1 writes something and its
vclock becomes {1: 4}. Now node3 boots from node1, and gets the
same vclock. Vclocks now look like this:

  - node1: {1: 4}, leader, has {1: 3} snap.
  - node2: {1: 3}, booted from node1, has only snap.
  - node3: {1: 4}, booted from node1, has only snap.

If the cluster is a fullmesh, node2 will send subscribe requests
with vclock {1: 3}. If node3 receives it, it will respond with
xlog gap error, because it only has a snap with {1: 4}, nothing
else. In that case node2 should retry connecting to node3, and in
the meantime try to get newer changes from node1.

The example is totally valid. However it is unreachable now
because master registers all replicas in _cluster before allowing
them to make a join. So they all bootstrap from a snapshot
containing all their IDs. This is a bug, because such
auto-registration leads to registration of anonymous replicas, if
they are present during bootstrap. Also it blocks Raft, which
can't work if there are registered, but not yet joined nodes.

Once the registration problem will be solved in a next commit, the
XlogGapError will strike quite often during bootstrap. This patch
won't allow that happen.

Needed for #5287

f1a507b0

xlog: introduce an error code for XlogGapError · fc8e2297

Vladislav Shpilevoy authored 4 years ago

XlogGapError object didn't have a code in ClientError code space.
Because of that it was not possible to handle the gap error
together with client errors in some switch-case statement.

Now the gap error has a code.

This is going to be used in applier code to handle XlogGapError
among other errors using its code instead of RTTI.

Needed for #5287

fc8e2297

Sep 15, 2020

test: flaky replication/gh-3704-misc-* · db3dd8dd

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [037] --- replication/gh-3704-misc-replica-checks-cluster-id.result	Thu Sep 10 18:05:22 2020
  [037] +++ replication/gh-3704-misc-replica-checks-cluster-id.reject	Fri Sep 11 11:09:38 2020
  [037] @@ -25,7 +25,7 @@
  [037]  ...
  [037]  box.info.replication[2].downstream.status
  [037]  ---
  [037] -- follow
  [037] +- stopped
  [037]  ...
  [037]  -- change master's cluster uuid and check that replica doesn't connect.
  [037]  test_run:cmd("stop server replica")

It happened because replication downstream status check occurred too
early, when it was only in 'stopped' state. To give the replication
status check routine ability to reach the needed 'follow' state, it
need to wait for it using test_run:wait_downstream() routine.

Closes #5293

db3dd8dd

Sep 14, 2020

memtx: force async snapshot transactions · c620735c

Vladislav Shpilevoy authored 4 years ago

Snapshot rows contain not real LSNs. Instead their LSNs are
signatures, ordinal numbers. Rows in the snap have LSNs from 1 to
the number of rows. This is because LSNs are not stored with every
tuple in the storages, and there is no way to store real LSNs in
the snapshot.

These artificial LSNs broke the synchronous replication limbo.
After snap recovery is done, limbo vclock was broken - it
contained numbers not related to reality, and affected by rows
from local spaces.

Also the recovery could stuck because ACKs in the limbo stopped
working after a first row - the vclock was set to the final
signature right away.

This patch makes all snapshot recovered rows async. Because they
are confirmed by definition. So now the limbo is not involved into
the snapshot recovery.

Closes #5298

c620735c

Sep 12, 2020

limbo: don't wake self fiber on CONFIRM write · a0477827

Vladislav Shpilevoy authored 4 years ago

During recovery WAL writes end immediately, without yields.
Therefore WAL write completion callback is executed in the
currently active fiber.

Txn limbo on CONFIRM WAL write wakes up the waiting fiber, which
appears to be the same as the active fiber during recovery.

That breaks the fiber scheduler, because apparently it is not safe
to wake the currently active fiber unless it is going to call
fiber_yield() immediately after. See a comment in fiber_wakeup()
implementation about that way of usage.

The patch simply stops waking the waiting fiber, if it is the
currently active one.

Closes #5288
Closes #5232

a0477827

Sep 11, 2020

test: replication/status.test.lua fails on Debug · 008e732c

Alexander V. Tikhonov authored 4 years ago


Found 2 issues on Debug build:

  [009] --- replication/status.result	Fri Sep 11 10:04:53 2020
  [009] +++ replication/status.reject	Fri Sep 11 13:16:21 2020
  [009] @@ -174,7 +174,8 @@
  [009]  ...
  [009]  test_run:wait_downstream(replica_id, {status == 'follow'})
  [009]  ---
  [009] -- true
  [009] +- error: '[string "return test_run:wait_downstream(replica_id, {..."]:1: variable
  [009] +    ''status'' is not declared'
  [009]  ...
  [009]  -- wait for the replication vclock
  [009]  test_run:wait_cond(function()                    \
  [009] @@ -226,7 +227,8 @@
  [009]  ...
  [009]  test_run:wait_upstream(master_id, {status == 'follow'})
  [009]  ---
  [009] -- true
  [009] +- error: '[string "return test_run:wait_upstream(master_id, {sta..."]:1: variable
  [009] +    ''status'' is not declared'
  [009]  ...
  [009]  master.upstream.lag < 1
  [009]  ---

It happened because of the change introduced in commit [1]. Where
mistakenly were used wait_upstream()/wait_downstream() with:

  test_run:wait_*stream(*_id, {status == 'follow'})

with status set using '==' instead of '='. We unable to read status
variable when the strict mode is enabled. It is enabled by default on
Debug builds.

Follows up #5110
Closes #5297

Reviewed-by: Alexander Turenko <alexander.turenko@tarantool.org>
Co-authored-by: Alexander Turenko <alexander.turenko@tarantool.org>

[1] - a08b4f3a ("test: flaky replication/status.test.lua status")

Unverified

008e732c

lua: fix panic in case when log.cfg.log incorrecly specified · 85f19a87

Oleg Babin authored 4 years ago

This patch makes log.cfg{log = ...} behaviour the same as in
box.cfg{log = ...} and fixes panic if "log" is incorrectly
specified. For such purpose we export "say_parse_logger_type"
function and use for logger type validation and logger type
parsing.

Closes #5130

85f19a87

test: flaky replication/gh-5195-qsync-* · a43414a5

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

   box.cfg{replication_synchro_quorum = 2}
    | ---
  + | - error: '[string "test_run:wait_cond(function()                ..."]:1: attempt to
  + |     index field ''vclock'' (a nil value)'
    | ...

The issue output was not correct due to wrong output list. Real command
that caused the initial issue was the previous command:

  test_run:wait_cond(function()                                                   \
          local info = box.info.replication[replica_id]                           \
          local lsn = info.downstream.vclock[replica_id]                          \
          return lsn and lsn >= replica_lsn                                       \
  end)

It happened because replication vclock field was not exist at the moment
of its check. To fix the issue, vclock field had to be waited to be
available using test_run:wait_cond() routine.

Closes #5230

a43414a5

test: flaky replication/wal_off.test.lua test · ad4d0564

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [035] --- replication/wal_off.result	Fri Jul  3 04:29:56 2020
  [035] +++ replication/wal_off.reject	Mon Sep  7 15:32:46 2020
  [035] @@ -47,6 +47,8 @@
  [035]  ...
  [035]  while box.info.replication[wal_off_id].upstream.message ~= check do fiber.sleep(0) end
  [035]  ---
  [035] +- error: '[string "while box.info.replication[wal_off_id].upstre..."]:1: attempt to
  [035] +    index field ''upstream'' (a nil value)'
  [035]  ...
  [035]  box.info.replication[wal_off_id].upstream ~= nil
  [035]  ---

It happened because replication upstream status check occurred too
early, when its state was not set. To give the replication status
check routine ability to reach the needed 'stopped' state, it need
to wait for it using test_run:wait_upstream() routine.

Closes #5278

ad4d0564

test: flaky replication/status.test.lua status · a08b4f3a

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following 3 issues:

line 174:

 [026] --- replication/status.result	Thu Jun 11 12:07:39 2020
 [026] +++ replication/status.reject	Sun Jun 14 03:20:21 2020
 [026] @@ -174,15 +174,17 @@
 [026]  ...
 [026]  replica.downstream.status == 'follow'
 [026]  ---
 [026] -- true
 [026] +- false
 [026]  ...

It happened because replication downstream status check occurred too
early. To give the replication status check routine ability to reach
the needed 'follow' state, it need to wait for it using
test_run:wait_downstream() routine.

line 178:

[024] --- replication/status.result	Mon Sep  7 00:22:52 2020
[024] +++ replication/status.reject	Mon Sep  7 00:36:01 2020
[024] @@ -178,11 +178,13 @@
[024]  ...
[024]  replica.downstream.vclock[master_id] == box.info.vclock[master_id]
[024]  ---
[024] -- true
[024] +- error: '[string "return replica.downstream.vclock[master_id] =..."]:1: attempt to
[024] +    index field ''vclock'' (a nil value)'
[024]  ...
[024]  replica.downstream.vclock[replica_id] == box.info.vclock[replica_id]
[024]  ---
[024] -- true
[024] +- error: '[string "return replica.downstream.vclock[replica_id] ..."]:1: attempt to
[024] +    index field ''vclock'' (a nil value)'
[024]  ...
[024]  --
[024]  -- Replica

It happened because replication vclock field was not exist at the moment
of its check. To fix the issue, vclock field had to be waited to be
available using test_run:wait_cond() routine. Also the replication data
downstream had to be read at the same moment.

line 224:

[014] --- replication/status.result	Fri Jul  3 04:29:56 2020
[014] +++ replication/status.reject	Mon Sep  7 00:17:30 2020
[014] @@ -224,7 +224,7 @@
[014]  ...
[014]  master.upstream.status == "follow"
[014]  ---
[014] -- true
[014] +- false
[014]  ...
[014]  master.upstream.lag < 1
[014]  ---

It happened because replication upstream status check occurred too
early. To give the replication status check routine ability to reach
the needed 'follow' state, it need to wait for it using
test_run:wait_upstream() routine.

Removed test from 'fragile' test_run tool list to run it in parallel.

Closes #5110

a08b4f3a

test: flaky replication/gh-4606-admin-creds test · 11ba3322

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [021] --- replication/gh-4606-admin-creds.result	Wed Apr 15 15:47:41 2020
  [021] +++ replication/gh-4606-admin-creds.reject	Sun Sep  6 20:23:09 2020
  [021] @@ -36,7 +36,42 @@
  [021]   | ...
  [021]  i.replication[i.id % 2 + 1].upstream.status == 'follow' or i
  [021]   | ---
  [021] - | - true
  [021] + | - version: 2.6.0-52-g71a24b9f2
  [021] + |   id: 2
  [021] + |   ro: false
  [021] + |   uuid: 3921679b-d994-4cf0-a6ef-1f6a0d96fc79
  [021] + |   package: Tarantool
  [021] + |   cluster:
  [021] + |     uuid: f27dfdfe-2802-486a-bc47-abc83b9097cf
  [021] + |   listen: unix/:/Users/tntmac02.tarantool.i/tnt/test/var/014_replication/replica_auth.socket-iproto
  [021] + |   replication_anon:
  [021] + |     count: 0
  [021] + |   replication:
  [021] + |     1:
  [021] + |       id: 1
  [021] + |       uuid: a07cad18-d27f-48c4-8d56-96b17026702e
  [021] + |       lsn: 3
  [021] + |       upstream:
  [021] + |         peer: admin@unix/:/Users/tntmac02.tarantool.i/tnt/test/var/014_replication/master.socket-iproto
  [021] + |         lag: 0.0030207633972168
  [021] + |         status: disconnected
  [021] + |         idle: 0.44824500009418
  [021] + |         message: timed out
  [021] + |         system_message: Operation timed out
  [021] + |     2:
  [021] + |       id: 2
  [021] + |       uuid: 3921679b-d994-4cf0-a6ef-1f6a0d96fc79
  [021] + |       lsn: 0
  [021] + |   signature: 3
  [021] + |   status: running
  [021] + |   vclock: {1: 3}
  [021] + |   uptime: 1
  [021] + |   lsn: 0
  [021] + |   sql: []
  [021] + |   gc: []
  [021] + |   vinyl: []
  [021] + |   memory: []
  [021] + |   pid: 40326
  [021]   | ...
  [021]  test_run:switch('default')
  [021]   | ---

It happened because replication upstream status check occurred too
early, when it was only in 'disconnected' state. To give the
replication status check routine ability to reach the needed 'follow'
state, it need to wait for it using test_run:wait_upstream() routine.

Closes #5233

11ba3322

test: flaky replication/gh-4402-info-errno.test.lua · 2b1f8f9b

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [004] --- replication/gh-4402-info-errno.result	Wed Jul 22 06:13:34 2020
  [004] +++ replication/gh-4402-info-errno.reject	Wed Jul 22 06:41:14 2020
  [004] @@ -32,7 +32,39 @@
  [004]   | ...
  [004]  d ~= nil and d.status == 'follow' or i
  [004]   | ---
  [004] - | - true
  [004] + | - version: 2.6.0-10-g8df49e4
  [004] + |   id: 1
  [004] + |   ro: false
  [004] + |   uuid: 41c4e3bf-cc3b-443d-88c9-39a9a8fe2df9
  [004] + |   package: Tarantool
  [004] + |   cluster:
  [004] + |     uuid: 6ec7bcce-68e7-41a4-b84b-dc9236621579
  [004] + |   listen: unix/:(socket)
  [004] + |   replication_anon:
  [004] + |     count: 0
  [004] + |   replication:
  [004] + |     1:
  [004] + |       id: 1
  [004] + |       uuid: 41c4e3bf-cc3b-443d-88c9-39a9a8fe2df9
  [004] + |       lsn: 52
  [004] + |     2:
  [004] + |       id: 2
  [004] + |       uuid: 8a989231-177a-4eb8-8030-c148bc752b0e
  [004] + |       lsn: 0
  [004] + |       downstream:
  [004] + |         status: stopped
  [004] + |         message: timed out
  [004] + |         system_message: Connection timed out
  [004] + |   signature: 52
  [004] + |   status: running
  [004] + |   vclock: {1: 52}
  [004] + |   uptime: 27
  [004] + |   lsn: 52
  [004] + |   sql: []
  [004] + |   gc: []
  [004] + |   vinyl: []
  [004] + |   memory: []
  [004] + |   pid: 99
  [004]   | ...
  [004]
  [004]  test_run:cmd('stop server replica')

It happened because replication downstream status check occurred too
early, when it was only in 'stopped' state. To give the replication
status check routine ability to reach the needed 'follow' state, it
need to wait for it using test_run:wait_downstream() routine.

Closes #5235

2b1f8f9b

test: flaky replication/gh-4928-tx-boundaries test · 5410e592

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [089] --- replication/gh-4928-tx-boundaries.result	Wed Jul 29 04:08:29 2020
  [089] +++ replication/gh-4928-tx-boundaries.reject	Wed Jul 29 04:24:02 2020
  [089] @@ -94,7 +94,7 @@
  [089]   | ...
  [089]  box.info.replication[1].upstream.status
  [089]   | ---
  [089] - | - follow
  [089] + | - disconnected
  [089]   | ...
  [089]
  [089]  box.space.glob:select{}

It happened because replication upstream status check occurred too
early, when it was only in 'disconnected' state. To give the
replication status check routine ability to reach the needed 'follow'
state, it need to wait for it using test_run:wait_upstream() routine.

Closes #5234

5410e592