Commits · 5aa1a1dfe58f39d9024b6e66000a71458a26f3ff · core / tarantool

Sep 15, 2020

gitlab-ci: fix deployment of tagged commits · 5aa1a1df

Alexander V. Tikhonov authored 4 years ago

Found that tagged commits were not run the deployment gitlab-ci jobs.
To fix it added 'tags' label for deployment and perfomance jobs. Also
found that after the commit tagged it has tag label in format 'x^0'
and all previous commits till the previous tag became to have tags in
format 'x~<commits before>' like 'x~1' or 'x~2' and etc. So the check

if git name-rev --name-only --tags --no-undefined HEAD ; then

became always pass and previous commits on rerun could began to deploy.
To fix it was used gitlab-ci environment variable 'CI_COMMIT_TAG', it
shows in real if the current commit has tag and has to be deployed.

Part of #3745

5aa1a1df

test: flaky replication/gh-3704-misc-* · db3dd8dd

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [037] --- replication/gh-3704-misc-replica-checks-cluster-id.result	Thu Sep 10 18:05:22 2020
  [037] +++ replication/gh-3704-misc-replica-checks-cluster-id.reject	Fri Sep 11 11:09:38 2020
  [037] @@ -25,7 +25,7 @@
  [037]  ...
  [037]  box.info.replication[2].downstream.status
  [037]  ---
  [037] -- follow
  [037] +- stopped
  [037]  ...
  [037]  -- change master's cluster uuid and check that replica doesn't connect.
  [037]  test_run:cmd("stop server replica")

It happened because replication downstream status check occurred too
early, when it was only in 'stopped' state. To give the replication
status check routine ability to reach the needed 'follow' state, it
need to wait for it using test_run:wait_downstream() routine.

Closes #5293

db3dd8dd

build: refactor static build process · 800e5ed6

HustonMmmavr authored 4 years ago


Refactored static build process to use static-build/CMakeLists.txt
instead of Dockerfile.staticbuild (this allows to support static
build on macOS). Following third-party dependencies for static build
are installed via cmake `ExternalProject_Add`:
  - OpenSSL
  - Zlib
  - Ncurses
  - Readline
  - Unwind
  - ICU

* Added support static build for macOS
* Fixed `CONFIGURE_COMMAND` while building bundled libcurl for static
  build at file cmake/BuildLibCURL.cmake:
    - disable building shared libcurl libraries (by setting
      `--disable-shared` option)
    - disable hiding libcurl symbols (by setting
      `--disable-symbol-hiding` option)
    - prevent linking libcurl with system libz (by setting
      `--with-zlib=${FOUND_ZLIB_ROOT_DIR}` option)
* Removed Dockerfile.staticbuild
* Added new gitlab.ci jobs to test new style static build:
  - static_build_cmake_linux
  - static_build_cmake_osx_15
* Removed static_docker_build gitlab.ci job

Closes #5095

Co-authored-by: Yaroslav Dynnikov <yaroslav.dynnikov@gmail.com>

800e5ed6

Sep 14, 2020

memtx: force async snapshot transactions · c620735c

Vladislav Shpilevoy authored 4 years ago

Snapshot rows contain not real LSNs. Instead their LSNs are
signatures, ordinal numbers. Rows in the snap have LSNs from 1 to
the number of rows. This is because LSNs are not stored with every
tuple in the storages, and there is no way to store real LSNs in
the snapshot.

These artificial LSNs broke the synchronous replication limbo.
After snap recovery is done, limbo vclock was broken - it
contained numbers not related to reality, and affected by rows
from local spaces.

Also the recovery could stuck because ACKs in the limbo stopped
working after a first row - the vclock was set to the final
signature right away.

This patch makes all snapshot recovered rows async. Because they
are confirmed by definition. So now the limbo is not involved into
the snapshot recovery.

Closes #5298

c620735c

test: update test-run · a33cd1bd

Alexander Turenko authored 4 years ago

Fixed formatting of reproduce files with recent pyyaml versions.

Background: test-run generates so called reproduce files in the
test/var/reproduce/ directory and accepts them as the argument of the
--reproduce option. It is convenient to share a reproducer for a problem
that appears when specific tests are run in a specific order.

https://github.com/tarantool/test-run/pull/220

Unverified

a33cd1bd

Sep 13, 2020

test: update test-run · 49ac0eb5

Alexander Turenko authored 4 years ago

Copy working directories, logs and reproduce files of workers with
failed tests to test/var/artifacts directory. It is prerequisite to
expose the artifacts via CI.

https://github.com/tarantool/test-run/issues/90

Unverified

49ac0eb5

Sep 12, 2020

limbo: don't wake self fiber on CONFIRM write · a0477827

Vladislav Shpilevoy authored 4 years ago

During recovery WAL writes end immediately, without yields.
Therefore WAL write completion callback is executed in the
currently active fiber.

Txn limbo on CONFIRM WAL write wakes up the waiting fiber, which
appears to be the same as the active fiber during recovery.

That breaks the fiber scheduler, because apparently it is not safe
to wake the currently active fiber unless it is going to call
fiber_yield() immediately after. See a comment in fiber_wakeup()
implementation about that way of usage.

The patch simply stops waking the waiting fiber, if it is the
currently active one.

Closes #5288
Closes #5232

a0477827

Sep 11, 2020

test: replication/status.test.lua fails on Debug · 008e732c

Alexander V. Tikhonov authored 4 years ago


Found 2 issues on Debug build:

  [009] --- replication/status.result	Fri Sep 11 10:04:53 2020
  [009] +++ replication/status.reject	Fri Sep 11 13:16:21 2020
  [009] @@ -174,7 +174,8 @@
  [009]  ...
  [009]  test_run:wait_downstream(replica_id, {status == 'follow'})
  [009]  ---
  [009] -- true
  [009] +- error: '[string "return test_run:wait_downstream(replica_id, {..."]:1: variable
  [009] +    ''status'' is not declared'
  [009]  ...
  [009]  -- wait for the replication vclock
  [009]  test_run:wait_cond(function()                    \
  [009] @@ -226,7 +227,8 @@
  [009]  ...
  [009]  test_run:wait_upstream(master_id, {status == 'follow'})
  [009]  ---
  [009] -- true
  [009] +- error: '[string "return test_run:wait_upstream(master_id, {sta..."]:1: variable
  [009] +    ''status'' is not declared'
  [009]  ...
  [009]  master.upstream.lag < 1
  [009]  ---

It happened because of the change introduced in commit [1]. Where
mistakenly were used wait_upstream()/wait_downstream() with:

  test_run:wait_*stream(*_id, {status == 'follow'})

with status set using '==' instead of '='. We unable to read status
variable when the strict mode is enabled. It is enabled by default on
Debug builds.

Follows up #5110
Closes #5297

Reviewed-by: Alexander Turenko <alexander.turenko@tarantool.org>
Co-authored-by: Alexander Turenko <alexander.turenko@tarantool.org>

[1] - a08b4f3a ("test: flaky replication/status.test.lua status")

Unverified

008e732c

lua: fix panic in case when log.cfg.log incorrecly specified · 85f19a87

Oleg Babin authored 4 years ago

This patch makes log.cfg{log = ...} behaviour the same as in
box.cfg{log = ...} and fixes panic if "log" is incorrectly
specified. For such purpose we export "say_parse_logger_type"
function and use for logger type validation and logger type
parsing.

Closes #5130

85f19a87

asan: leak unit/swim.test:swim_test_encryption · ee9e3aed

Alexander V. Tikhonov authored 4 years ago


Found leak issue:

  [001] +==41031==ERROR: LeakSanitizer: detected memory leaks
  [001] +
  [001] +Direct leak of 96 byte(s) in 2 object(s) allocated from:
  [001] +    #0 0x4d8e53 in __interceptor_malloc (/tnt/test/unit/swim.test+0x4d8e53)
  [001] +    #1 0x53560f in crypto_codec_new /source/src/lib/crypto/crypto.c:239:51
  [001] +    #2 0x5299c4 in swim_scheduler_set_codec /source/src/lib/swim/swim_io.c:700:30
  [001] +    #3 0x511fe6 in swim_cluster_set_codec /source/test/unit/swim_test_utils.c:251:2
  [001] +    #4 0x50b3ae in swim_test_encryption /source/test/unit/swim.c:767:2
  [001] +    #5 0x50b3ae in main_f /source/test/unit/swim.c:1123
  [001] +    #6 0x544a3b in fiber_loop /source/src/lib/core/fiber.c:869:18
  [001] +    #7 0x5a13d0 in coro_init /source/third_party/coro/coro.c:110:3
  [001] +
  [001] +SUMMARY: AddressSanitizer: 96 byte(s) leaked in 2 allocation(s).

Prepared minimal issue reproducer:

  static void
  swim_test_encryption(void)
  {
          swim_start_test(3);
          struct swim_cluster *cluster = swim_cluster_new(2);
          swim_cluster_set_codec(cluster, CRYPTO_ALGO_AES128, CRYPTO_MODE_CBC,
                                 "1234567812345678", CRYPTO_AES128_KEY_SIZE);
          swim_cluster_delete(cluster);
          swim_finish_test();
  }

Found that memory allocation for codec creation at crypto_codec_new()
using swim_cluster_set_codec() was not any freed at the test. Added
crypto_codec_delete() in swim_scheduler_destroy() function for it.

After this fix removed susspencion on memory leak for unit/swim.test.

Closes #5283

Reviewed-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>

Co-authored-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>

ee9e3aed

test: flaky replication/gh-5195-qsync-* · a43414a5

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

   box.cfg{replication_synchro_quorum = 2}
    | ---
  + | - error: '[string "test_run:wait_cond(function()                ..."]:1: attempt to
  + |     index field ''vclock'' (a nil value)'
    | ...

The issue output was not correct due to wrong output list. Real command
that caused the initial issue was the previous command:

  test_run:wait_cond(function()                                                   \
          local info = box.info.replication[replica_id]                           \
          local lsn = info.downstream.vclock[replica_id]                          \
          return lsn and lsn >= replica_lsn                                       \
  end)

It happened because replication vclock field was not exist at the moment
of its check. To fix the issue, vclock field had to be waited to be
available using test_run:wait_cond() routine.

Closes #5230

a43414a5

test: flaky replication/wal_off.test.lua test · ad4d0564

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [035] --- replication/wal_off.result	Fri Jul  3 04:29:56 2020
  [035] +++ replication/wal_off.reject	Mon Sep  7 15:32:46 2020
  [035] @@ -47,6 +47,8 @@
  [035]  ...
  [035]  while box.info.replication[wal_off_id].upstream.message ~= check do fiber.sleep(0) end
  [035]  ---
  [035] +- error: '[string "while box.info.replication[wal_off_id].upstre..."]:1: attempt to
  [035] +    index field ''upstream'' (a nil value)'
  [035]  ...
  [035]  box.info.replication[wal_off_id].upstream ~= nil
  [035]  ---

It happened because replication upstream status check occurred too
early, when its state was not set. To give the replication status
check routine ability to reach the needed 'stopped' state, it need
to wait for it using test_run:wait_upstream() routine.

Closes #5278

ad4d0564

test: flaky replication/status.test.lua status · a08b4f3a

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following 3 issues:

line 174:

 [026] --- replication/status.result	Thu Jun 11 12:07:39 2020
 [026] +++ replication/status.reject	Sun Jun 14 03:20:21 2020
 [026] @@ -174,15 +174,17 @@
 [026]  ...
 [026]  replica.downstream.status == 'follow'
 [026]  ---
 [026] -- true
 [026] +- false
 [026]  ...

It happened because replication downstream status check occurred too
early. To give the replication status check routine ability to reach
the needed 'follow' state, it need to wait for it using
test_run:wait_downstream() routine.

line 178:

[024] --- replication/status.result	Mon Sep  7 00:22:52 2020
[024] +++ replication/status.reject	Mon Sep  7 00:36:01 2020
[024] @@ -178,11 +178,13 @@
[024]  ...
[024]  replica.downstream.vclock[master_id] == box.info.vclock[master_id]
[024]  ---
[024] -- true
[024] +- error: '[string "return replica.downstream.vclock[master_id] =..."]:1: attempt to
[024] +    index field ''vclock'' (a nil value)'
[024]  ...
[024]  replica.downstream.vclock[replica_id] == box.info.vclock[replica_id]
[024]  ---
[024] -- true
[024] +- error: '[string "return replica.downstream.vclock[replica_id] ..."]:1: attempt to
[024] +    index field ''vclock'' (a nil value)'
[024]  ...
[024]  --
[024]  -- Replica

It happened because replication vclock field was not exist at the moment
of its check. To fix the issue, vclock field had to be waited to be
available using test_run:wait_cond() routine. Also the replication data
downstream had to be read at the same moment.

line 224:

[014] --- replication/status.result	Fri Jul  3 04:29:56 2020
[014] +++ replication/status.reject	Mon Sep  7 00:17:30 2020
[014] @@ -224,7 +224,7 @@
[014]  ...
[014]  master.upstream.status == "follow"
[014]  ---
[014] -- true
[014] +- false
[014]  ...
[014]  master.upstream.lag < 1
[014]  ---

It happened because replication upstream status check occurred too
early. To give the replication status check routine ability to reach
the needed 'follow' state, it need to wait for it using
test_run:wait_upstream() routine.

Removed test from 'fragile' test_run tool list to run it in parallel.

Closes #5110

a08b4f3a

test: flaky replication/gh-4606-admin-creds test · 11ba3322

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [021] --- replication/gh-4606-admin-creds.result	Wed Apr 15 15:47:41 2020
  [021] +++ replication/gh-4606-admin-creds.reject	Sun Sep  6 20:23:09 2020
  [021] @@ -36,7 +36,42 @@
  [021]   | ...
  [021]  i.replication[i.id % 2 + 1].upstream.status == 'follow' or i
  [021]   | ---
  [021] - | - true
  [021] + | - version: 2.6.0-52-g71a24b9f2
  [021] + |   id: 2
  [021] + |   ro: false
  [021] + |   uuid: 3921679b-d994-4cf0-a6ef-1f6a0d96fc79
  [021] + |   package: Tarantool
  [021] + |   cluster:
  [021] + |     uuid: f27dfdfe-2802-486a-bc47-abc83b9097cf
  [021] + |   listen: unix/:/Users/tntmac02.tarantool.i/tnt/test/var/014_replication/replica_auth.socket-iproto
  [021] + |   replication_anon:
  [021] + |     count: 0
  [021] + |   replication:
  [021] + |     1:
  [021] + |       id: 1
  [021] + |       uuid: a07cad18-d27f-48c4-8d56-96b17026702e
  [021] + |       lsn: 3
  [021] + |       upstream:
  [021] + |         peer: admin@unix/:/Users/tntmac02.tarantool.i/tnt/test/var/014_replication/master.socket-iproto
  [021] + |         lag: 0.0030207633972168
  [021] + |         status: disconnected
  [021] + |         idle: 0.44824500009418
  [021] + |         message: timed out
  [021] + |         system_message: Operation timed out
  [021] + |     2:
  [021] + |       id: 2
  [021] + |       uuid: 3921679b-d994-4cf0-a6ef-1f6a0d96fc79
  [021] + |       lsn: 0
  [021] + |   signature: 3
  [021] + |   status: running
  [021] + |   vclock: {1: 3}
  [021] + |   uptime: 1
  [021] + |   lsn: 0
  [021] + |   sql: []
  [021] + |   gc: []
  [021] + |   vinyl: []
  [021] + |   memory: []
  [021] + |   pid: 40326
  [021]   | ...
  [021]  test_run:switch('default')
  [021]   | ---

It happened because replication upstream status check occurred too
early, when it was only in 'disconnected' state. To give the
replication status check routine ability to reach the needed 'follow'
state, it need to wait for it using test_run:wait_upstream() routine.

Closes #5233

11ba3322

test: flaky replication/gh-4402-info-errno.test.lua · 2b1f8f9b

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [004] --- replication/gh-4402-info-errno.result	Wed Jul 22 06:13:34 2020
  [004] +++ replication/gh-4402-info-errno.reject	Wed Jul 22 06:41:14 2020
  [004] @@ -32,7 +32,39 @@
  [004]   | ...
  [004]  d ~= nil and d.status == 'follow' or i
  [004]   | ---
  [004] - | - true
  [004] + | - version: 2.6.0-10-g8df49e4
  [004] + |   id: 1
  [004] + |   ro: false
  [004] + |   uuid: 41c4e3bf-cc3b-443d-88c9-39a9a8fe2df9
  [004] + |   package: Tarantool
  [004] + |   cluster:
  [004] + |     uuid: 6ec7bcce-68e7-41a4-b84b-dc9236621579
  [004] + |   listen: unix/:(socket)
  [004] + |   replication_anon:
  [004] + |     count: 0
  [004] + |   replication:
  [004] + |     1:
  [004] + |       id: 1
  [004] + |       uuid: 41c4e3bf-cc3b-443d-88c9-39a9a8fe2df9
  [004] + |       lsn: 52
  [004] + |     2:
  [004] + |       id: 2
  [004] + |       uuid: 8a989231-177a-4eb8-8030-c148bc752b0e
  [004] + |       lsn: 0
  [004] + |       downstream:
  [004] + |         status: stopped
  [004] + |         message: timed out
  [004] + |         system_message: Connection timed out
  [004] + |   signature: 52
  [004] + |   status: running
  [004] + |   vclock: {1: 52}
  [004] + |   uptime: 27
  [004] + |   lsn: 52
  [004] + |   sql: []
  [004] + |   gc: []
  [004] + |   vinyl: []
  [004] + |   memory: []
  [004] + |   pid: 99
  [004]   | ...
  [004]
  [004]  test_run:cmd('stop server replica')

It happened because replication downstream status check occurred too
early, when it was only in 'stopped' state. To give the replication
status check routine ability to reach the needed 'follow' state, it
need to wait for it using test_run:wait_downstream() routine.

Closes #5235

2b1f8f9b

test: flaky replication/gh-4928-tx-boundaries test · 5410e592

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [089] --- replication/gh-4928-tx-boundaries.result	Wed Jul 29 04:08:29 2020
  [089] +++ replication/gh-4928-tx-boundaries.reject	Wed Jul 29 04:24:02 2020
  [089] @@ -94,7 +94,7 @@
  [089]   | ...
  [089]  box.info.replication[1].upstream.status
  [089]   | ---
  [089] - | - follow
  [089] + | - disconnected
  [089]   | ...
  [089]
  [089]  box.space.glob:select{}

It happened because replication upstream status check occurred too
early, when it was only in 'disconnected' state. To give the
replication status check routine ability to reach the needed 'follow'
state, it need to wait for it using test_run:wait_upstream() routine.

Closes #5234

5410e592

Sep 09, 2020

test: fix status at replication/gh-4424-misc* test · 5a9b79fa

Alexander V. Tikhonov authored 4 years ago

Fixed flaky status check:

  [016] @@ -73,11 +73,11 @@
  [016]  ...
  [016]  box.info.status
  [016]  ---
  [016] -- running
  [016] +- orphan
  [016]  ...
  [016]  box.info.ro
  [016]  ---
  [016] -- false
  [016] +- true
  [016]  ...
  [016]  box.cfg{                                                        \
  [016]      replication = {},                                           \
  [016]

Test changed to use wait condition for the status check, which should
be changed from 'orphan' to 'running'. On heavy loaded hosts it may
spend some additional time, wait condition routine helped to fix it.

Closes #5271

5a9b79fa

test: flaky replication/gh-3642-misc-* test · 2569ba54

Alexander V. Tikhonov authored 4 years ago

On heavy loaded hosts found the following issue:

  [036] --- replication/gh-3642-misc-no-socket-leak-on-replica-disconnect.result	Sun Sep  6 23:49:57 2020
  [036] +++ replication/gh-3642-misc-no-socket-leak-on-replica-disconnect.reject	Mon Sep  7 04:07:06 2020
  [036] @@ -63,7 +63,7 @@
  [036]  ...
  [036]  box.info.replication[1].upstream.status
  [036]  ---
  [036] -- follow
  [036] +- disconnected
  [036]  ...
  [036]  test_run:cmd('switch default')
  [036]  ---

It happened because replication upstream status check occurred too
early, when it was only in 'disconnected' state. To give the
replication status check routine ability to reach the needed 'follow'
state, it need to wait for it using test_run:wait_upstream() routine.

Closes #5276

2569ba54

test: remove asan suppression for unit/msgpack · 35f99e66

Alexander V. Tikhonov authored 4 years ago

ASAN should the issue in msgpuck repository in file test/msgpuck.c
which was the cause of the fail in unit/msgpack test. The issue
was fixed in msgpuck repository and ASAN suppression was removed
for it. Also removed skip condition file, which blocked the test
when it failed.

Part of #4360

35f99e66

lsan: app-tap/http_client.test.lua suppresions · 8d616ade
Alexander V. Tikhonov authored 4 years ago
```
Removed lsan suppresions that were not reproduced.

Part of #4360
```
8d616ade

Sep 08, 2020

msgpack: print mp_exp type as signed integer · 2a01ce91

Ilya Kosarev authored 4 years ago

MsgPack extension types allow applications to define
application-specific types. They consist of an 8-bit signed integer and
a byte array where the integer represents a kind of types and the byte
array represents data. Types from 0 to 127 are application-specific
types and types from -128 to -1 are reserved for predefined types.
However, extension types were printed as unsigned integers. Now it is
fixed and extension types are being printed in a correct way as signed
integers. Also the typo in word "Unsupported" was fixed. According test
case is introduced.

Closes #5016

2a01ce91

rtree: add comments on ignored rtree_search() return value · 4883f19b

Ilya Kosarev authored 4 years ago

rtree_search() has return value and it is ignored in some cases.
Although it is totally fine it seems to be reasonable to comment those
cases as far as such usage might be questionable.

Closes #2052

4883f19b

Divide replication/misc.test.lua · 867e6b3d

Alexander V. Tikhonov authored 4 years ago

To fix flaky issues of replication/misc.test.lua the test had to be
divided into smaller tests to be able to localize the flaky results:

  gh-2991-misc-asserts-on-update.test.lua
  gh-3111-misc-rebootstrap-from-ro-master.test.lua
  gh-3160-misc-heartbeats-on-master-changes.test.lua
  gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
  gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
  gh-3606-misc-crash-on-box-concurrent-update.test.lua
  gh-3610-misc-assert-connecting-master-twice.test.lua
  gh-3637-misc-error-on-replica-auth-fail.test.lua
  gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
  gh-3704-misc-replica-checks-cluster-id.test.lua
  gh-3711-misc-no-restart-on-same-configuration.test.lua
  gh-3760-misc-return-on-quorum-0.test.lua
  gh-4399-misc-no-failure-on-error-reading-wal.test.lua
  gh-4424-misc-orphan-on-reconfiguration-error.test.lua

Needed for #4940

867e6b3d

msgpuck: bump a new version · 77e03451
Kirill Yukhin authored 4 years ago
```
- test: correct buffer size to fix ASAN error
```
77e03451

lua: return back import of table.clear() method · 09aa8135

Sergey Bronnikov authored 4 years ago

Import of `table.clear` module has been removed to fix luacheck warning about
unused variable in commit 3af79e70
('Fix luacheck warnings in src/lua/') and method `table.clear()` became unavailable
in Tarantool. This commit returns that import back as some applications depends
on it (bug has been found with Cartridge application) and adds regression test
for table.clear(). Note: `table.clear` is not available until an explicit
`require('table.clear')` call.

Closes #5210

09aa8135

Aug 31, 2020

update_repo: correct fix for missing metadata RPMs · 71a24b9f

Alexander V. Tikhonov authored 4 years ago

On running update_repo tool with the given option to delete some RPMs
need to remove all files found by this given pattern. The loop checking
metadata deletes files, but only which were presented in it. However
it is possible that some broken update left orphan files: they are
present in the storage, but does not mentioned in the metadata.

71a24b9f

test: concurrent tuple update segfault on bitset index iteration · c5d7e139

Ilya Kosarev authored 4 years ago

Concurrent tuple update could segfault on BITSET_ALL_NOT_SET iterator
usage. Fixed in 850054b2. This patch
introduces corresponding test.

Closes #1088

c5d7e139

gitlab-ci: add openSUSE packages build jobs · d07e5f96

Alexander V. Tikhonov authored 5 years ago

Implemented openSUSE packages build with testing for images:
opensuse-leap:15.[0-2]

Added %{sle_version} checks in Tarantool spec file according to
https://en.opensuse.org/openSUSE:Packaging_for_Leap#RPM_Distro_Version_Macros

Added opensuse-leap of 15.1 and 15.2 versions to Gitlab-CI packages
building/deploing jobs.

Closes #4562

d07e5f96

vinyl: fix check vinyl_dir existence at bootstrap · 9600b895

Alexander V. Tikhonov authored 4 years ago


During implementation of openSUSE build with testing got failed test
box-tap/cfg.test.lua. Found that when memtx_dir didn't exist and
vinyl_dir existed and also errno was set to ENOENT, box configuration
succeeded, but it shouldn't. Reason of this wrong behavior was that
not all of the failure paths in xdir_scan() set errno, but the caller
assumed it.

Debugging the issue found that after xdir_scan() there was incorrect
check for errno when it returned negative values. xdir_scan() is not
system call and negative return value from it doesn't mean that errno
would be set too. Found that in situations when errno was left from
previous commands before xdir_scan() and xdir_scan() returned negative
value by itself it produced the wrong check.

The previous failed logic of the check was to catch the error ENOENT
which set in the xdir_scan() function to handle the situation when
vinyl_dir was not exist. It failed, because checking ENOENT outside
the xdir_scan() function, we had to be sure that ENOENT had come from
xdir_scan() function call indeed and not from any other functions
before. To be sure in it possible fix could be reset errno before
xdir_scan() call, because errno could be passed from any other function
before call to xdir_scan().

As mentioned above xdir_scan() function is not system call and can be
changed in any possible way and it can return any result value without
need to setup errno. So check outside of this function on errno could
be broken.

To avoid that we must not check errno after call of the function.
Better solution is to use the flag in xdir_scan(), to check if the
directory should exist. So errno check was removed and instead of it
the check for vinyl_dir existence using flag added.

Closes #4594
Needed for #4562

Co-authored-by: Alexander Turenko <alexander.turenko@tarantool.org>

9600b895

Aug 25, 2020

tuple: drop extra restrictions for multikey index · bfeb61b3

Ilya Kosarev authored 4 years ago

Multikey index did not work properly with nullable root field in
tuple_raw_multikey_count(). Now it is fixed and corresponding
restrictions are dropped. This also means that we can drop implicit
nullability update for array/map fields and make all fields nullable
by default, as it was until e1d3fe8a
(tuple format: don't allow null where array/map is expected), as far as
default non-nullability itself doesn't solve any real problems while
providing confusing behavior (gh-5027).

Follow-up #5027
Closes #5192

bfeb61b3

Aug 24, 2020

box: introduce space:alter() · 8c965989

Vladislav Shpilevoy authored 4 years ago

There was no way to change certain space parameters without its
recreation or manual update of internal system space _space. Even
if some of them were legal to update: field_count, owner, flag of
being temporary, is_sync flag.

The patch introduces function space:alter(), which accepts a
subset of parameters from box.schema.space.create which are
mutable, and 'name' parameter. There is a method space:rename(),
but still the parameter is added to space:alter() too, to be
consistent with index:alter(), which also accepts a new name.

Closes #5155

@TarantoolBot document
Title: New function space:alter(options)

Space objects in Lua (stored in `box.space` table) now have a new
method: `space:alter(options)`.

The method accepts a table with parameters `field_count`, `user`,
`format`, `temporary`, `is_sync`, and `name`. All parameters have
the same meaning as in `box.schema.space.create(name, options)`.

Note, `name` parameter in `box.schema.space.create` is separated
from `options` table. It is not so in `space:alter(options)` -
here all parameters are specified in the `options` table.

The function does not return anything in case of success, and
throws an error when fails.

From 'Synchronous replication' page, from 'Limitations and known
problems' it is necessary to delete the note about "no way to
enable synchronous replication for existing spaces". Instead it
is necessary to say, that it can be enabled using
`space:alter({is_sync = true})`. And can be disabled by setting
`is_sync = false`.
https://www.tarantool.io/en/doc/2.5/book/replication/repl_sync/#limitations-and-known-problems

The function will appear in >= 2.5.2.

8c965989

xrow: drop xrow_header_dup_body · 9dd2e2e4

Cyrill Gorcunov authored 4 years ago


We no longer use it.

Closes #5129

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

9dd2e2e4

txn: txn_add_redo -- drop synchro processing · 1d7e256b

Cyrill Gorcunov authored 4 years ago


Since we no longer use txn engine for synchro
packets processing this code is never executed.

Part-of #5129

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

1d7e256b

applier: process synchro requests without txn engine · cfccfd44

Cyrill Gorcunov authored 4 years ago


Transaction processing code is very heavy simply because
transactions are carrying various data and involves a number
of other mechanisms to proceed.

In turn, when we receive confirm or rollback packed from
another node in a cluster we just need to inspect limbo
queue and write this packed into a WAL journal. So calling
a bunch of txn engine helpers is simply waste of cycles.

Thus lets rather handle them in a special light way:

 - allocate synchro_entry structure which would carry
   the journal entry itself and encoded message
 - process limbo queue to mark confirmed/rollback'ed
   messages
 - finally write this synchro_entry into a journal

Which is a way simplier.

Part-of #5129

Suggedsted-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Co-developed-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

cfccfd44

qsync: direct write of CONFIRM/ROLLBACK into a journal · 41b31ff0

Cyrill Gorcunov authored 4 years ago


When we need to write CONFIRM or ROLLBACK message (which is
a binary record in msgpack format) into a journal we use txn code
to allocate a new transaction, encode there a message and pass it
to walk the long txn path before it hit the journal. This is not
only resource wasting but also somehow strange from architectural
point of view.

Instead lets encode a record on the stack and write it to the journal
directly.

Part-of #5129

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

41b31ff0

qsync: provide a binary form of syncro entries · 7e1ce153

Cyrill Gorcunov authored 4 years ago


These msgpack entries will be needed to write them
down to a journal without involving txn engine. Same
time we would like to be able to allocate them on stack,
for this sake the binary form is predefined.

Part-of #5129

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

7e1ce153

journal: add journal_entry_create helper · 580abaee

Cyrill Gorcunov authored 4 years ago


To create raw journal entries. We will use it
to write confirm/rollback entries.

Part-of #5129

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

580abaee

journal: bind asynchronous write completion to an entry · fd145ed5

Cyrill Gorcunov authored 4 years ago


In commit 77ba0e35 we've redesigned
wal journal operations such that asynchronous write completion
is a single instance per journal.

It turned out that such simplification is too tight and doesn't
allow us to pass entries into the journal with custom completions.

Thus lets allow back such ability. We will need it to be able
to write "confirm" records into wal directly without touching
transactions code at all.

Part-of #5129

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

fd145ed5

Aug 20, 2020

asan/lsan: cleanup suppression lists · a1021237

Alexander V. Tikhonov authored 4 years ago

Removed asan/lsan suppresions for issues that were not reproduced.
Removed skip condition files for tests that passed testing.

Part of #4360

a1021237

Aug 17, 2020

xrow: introduce struct synchro_request · ee07eab4

Vladislav Shpilevoy authored 4 years ago

All requests saved to WAL and transmitted through network have
their own request structure with parameters:
- struct request for DML;
- struct call_request for CALL/EVAL;
- struct auth_request for AUTH;
- struct ballot for VOTE;
- struct sql_request for SQL;
- struct greeting for greeting.

It is done for a reason - not to pass all the request parameters
into each function one by one, and manage them all at once
instead.

For synchronous requests IPROTO_CONFIRM and IPROTO_ROLLBACK it was
not done. Because so far it was not too hard to carry just 2
parameters: lsn and replica_id, from their body.

But it will be changed in #5129. Because in fact these requests
have more parameters, but they were filled by txn module, since
synchro requests were saved to WAL via transactions (due to lack
of alternative API to access WAL).

After #5129 it will be necessary to save LSN and replica_id of the
request author. This patch introduces struct synchro_request to
simplify extension of the synchro parameters.

Closes #5151
Needed for #5129

ee07eab4