Skip to content
Snippets Groups Projects
  1. Sep 11, 2020
    • Alexander V. Tikhonov's avatar
      test: flaky replication/wal_off.test.lua test · ad4d0564
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following issue:
      
        [035] --- replication/wal_off.result	Fri Jul  3 04:29:56 2020
        [035] +++ replication/wal_off.reject	Mon Sep  7 15:32:46 2020
        [035] @@ -47,6 +47,8 @@
        [035]  ...
        [035]  while box.info.replication[wal_off_id].upstream.message ~= check do fiber.sleep(0) end
        [035]  ---
        [035] +- error: '[string "while box.info.replication[wal_off_id].upstre..."]:1: attempt to
        [035] +    index field ''upstream'' (a nil value)'
        [035]  ...
        [035]  box.info.replication[wal_off_id].upstream ~= nil
        [035]  ---
      
      It happened because replication upstream status check occurred too
      early, when its state was not set. To give the replication status
      check routine ability to reach the needed 'stopped' state, it need
      to wait for it using test_run:wait_upstream() routine.
      
      Closes #5278
      ad4d0564
    • Alexander V. Tikhonov's avatar
      test: flaky replication/status.test.lua status · a08b4f3a
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following 3 issues:
      
      line 174:
      
       [026] --- replication/status.result	Thu Jun 11 12:07:39 2020
       [026] +++ replication/status.reject	Sun Jun 14 03:20:21 2020
       [026] @@ -174,15 +174,17 @@
       [026]  ...
       [026]  replica.downstream.status == 'follow'
       [026]  ---
       [026] -- true
       [026] +- false
       [026]  ...
      
      It happened because replication downstream status check occurred too
      early. To give the replication status check routine ability to reach
      the needed 'follow' state, it need to wait for it using
      test_run:wait_downstream() routine.
      
      line 178:
      
      [024] --- replication/status.result	Mon Sep  7 00:22:52 2020
      [024] +++ replication/status.reject	Mon Sep  7 00:36:01 2020
      [024] @@ -178,11 +178,13 @@
      [024]  ...
      [024]  replica.downstream.vclock[master_id] == box.info.vclock[master_id]
      [024]  ---
      [024] -- true
      [024] +- error: '[string "return replica.downstream.vclock[master_id] =..."]:1: attempt to
      [024] +    index field ''vclock'' (a nil value)'
      [024]  ...
      [024]  replica.downstream.vclock[replica_id] == box.info.vclock[replica_id]
      [024]  ---
      [024] -- true
      [024] +- error: '[string "return replica.downstream.vclock[replica_id] ..."]:1: attempt to
      [024] +    index field ''vclock'' (a nil value)'
      [024]  ...
      [024]  --
      [024]  -- Replica
      
      It happened because replication vclock field was not exist at the moment
      of its check. To fix the issue, vclock field had to be waited to be
      available using test_run:wait_cond() routine. Also the replication data
      downstream had to be read at the same moment.
      
      line 224:
      
      [014] --- replication/status.result	Fri Jul  3 04:29:56 2020
      [014] +++ replication/status.reject	Mon Sep  7 00:17:30 2020
      [014] @@ -224,7 +224,7 @@
      [014]  ...
      [014]  master.upstream.status == "follow"
      [014]  ---
      [014] -- true
      [014] +- false
      [014]  ...
      [014]  master.upstream.lag < 1
      [014]  ---
      
      It happened because replication upstream status check occurred too
      early. To give the replication status check routine ability to reach
      the needed 'follow' state, it need to wait for it using
      test_run:wait_upstream() routine.
      
      Removed test from 'fragile' test_run tool list to run it in parallel.
      
      Closes #5110
      a08b4f3a
    • Alexander V. Tikhonov's avatar
      test: flaky replication/gh-4606-admin-creds test · 11ba3322
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following issue:
      
        [021] --- replication/gh-4606-admin-creds.result	Wed Apr 15 15:47:41 2020
        [021] +++ replication/gh-4606-admin-creds.reject	Sun Sep  6 20:23:09 2020
        [021] @@ -36,7 +36,42 @@
        [021]   | ...
        [021]  i.replication[i.id % 2 + 1].upstream.status == 'follow' or i
        [021]   | ---
        [021] - | - true
        [021] + | - version: 2.6.0-52-g71a24b9f2
        [021] + |   id: 2
        [021] + |   ro: false
        [021] + |   uuid: 3921679b-d994-4cf0-a6ef-1f6a0d96fc79
        [021] + |   package: Tarantool
        [021] + |   cluster:
        [021] + |     uuid: f27dfdfe-2802-486a-bc47-abc83b9097cf
        [021] + |   listen: unix/:/Users/tntmac02.tarantool.i/tnt/test/var/014_replication/replica_auth.socket-iproto
        [021] + |   replication_anon:
        [021] + |     count: 0
        [021] + |   replication:
        [021] + |     1:
        [021] + |       id: 1
        [021] + |       uuid: a07cad18-d27f-48c4-8d56-96b17026702e
        [021] + |       lsn: 3
        [021] + |       upstream:
        [021] + |         peer: admin@unix/:/Users/tntmac02.tarantool.i/tnt/test/var/014_replication/master.socket-iproto
        [021] + |         lag: 0.0030207633972168
        [021] + |         status: disconnected
        [021] + |         idle: 0.44824500009418
        [021] + |         message: timed out
        [021] + |         system_message: Operation timed out
        [021] + |     2:
        [021] + |       id: 2
        [021] + |       uuid: 3921679b-d994-4cf0-a6ef-1f6a0d96fc79
        [021] + |       lsn: 0
        [021] + |   signature: 3
        [021] + |   status: running
        [021] + |   vclock: {1: 3}
        [021] + |   uptime: 1
        [021] + |   lsn: 0
        [021] + |   sql: []
        [021] + |   gc: []
        [021] + |   vinyl: []
        [021] + |   memory: []
        [021] + |   pid: 40326
        [021]   | ...
        [021]  test_run:switch('default')
        [021]   | ---
      
      It happened because replication upstream status check occurred too
      early, when it was only in 'disconnected' state. To give the
      replication status check routine ability to reach the needed 'follow'
      state, it need to wait for it using test_run:wait_upstream() routine.
      
      Closes #5233
      11ba3322
    • Alexander V. Tikhonov's avatar
      test: flaky replication/gh-4402-info-errno.test.lua · 2b1f8f9b
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following issue:
      
        [004] --- replication/gh-4402-info-errno.result	Wed Jul 22 06:13:34 2020
        [004] +++ replication/gh-4402-info-errno.reject	Wed Jul 22 06:41:14 2020
        [004] @@ -32,7 +32,39 @@
        [004]   | ...
        [004]  d ~= nil and d.status == 'follow' or i
        [004]   | ---
        [004] - | - true
        [004] + | - version: 2.6.0-10-g8df49e4
        [004] + |   id: 1
        [004] + |   ro: false
        [004] + |   uuid: 41c4e3bf-cc3b-443d-88c9-39a9a8fe2df9
        [004] + |   package: Tarantool
        [004] + |   cluster:
        [004] + |     uuid: 6ec7bcce-68e7-41a4-b84b-dc9236621579
        [004] + |   listen: unix/:(socket)
        [004] + |   replication_anon:
        [004] + |     count: 0
        [004] + |   replication:
        [004] + |     1:
        [004] + |       id: 1
        [004] + |       uuid: 41c4e3bf-cc3b-443d-88c9-39a9a8fe2df9
        [004] + |       lsn: 52
        [004] + |     2:
        [004] + |       id: 2
        [004] + |       uuid: 8a989231-177a-4eb8-8030-c148bc752b0e
        [004] + |       lsn: 0
        [004] + |       downstream:
        [004] + |         status: stopped
        [004] + |         message: timed out
        [004] + |         system_message: Connection timed out
        [004] + |   signature: 52
        [004] + |   status: running
        [004] + |   vclock: {1: 52}
        [004] + |   uptime: 27
        [004] + |   lsn: 52
        [004] + |   sql: []
        [004] + |   gc: []
        [004] + |   vinyl: []
        [004] + |   memory: []
        [004] + |   pid: 99
        [004]   | ...
        [004]
        [004]  test_run:cmd('stop server replica')
      
      It happened because replication downstream status check occurred too
      early, when it was only in 'stopped' state. To give the replication
      status check routine ability to reach the needed 'follow' state, it
      need to wait for it using test_run:wait_downstream() routine.
      
      Closes #5235
      2b1f8f9b
    • Alexander V. Tikhonov's avatar
      test: flaky replication/gh-4928-tx-boundaries test · 5410e592
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following issue:
      
        [089] --- replication/gh-4928-tx-boundaries.result	Wed Jul 29 04:08:29 2020
        [089] +++ replication/gh-4928-tx-boundaries.reject	Wed Jul 29 04:24:02 2020
        [089] @@ -94,7 +94,7 @@
        [089]   | ...
        [089]  box.info.replication[1].upstream.status
        [089]   | ---
        [089] - | - follow
        [089] + | - disconnected
        [089]   | ...
        [089]
        [089]  box.space.glob:select{}
      
      It happened because replication upstream status check occurred too
      early, when it was only in 'disconnected' state. To give the
      replication status check routine ability to reach the needed 'follow'
      state, it need to wait for it using test_run:wait_upstream() routine.
      
      Closes #5234
      5410e592
  2. Sep 09, 2020
    • Alexander V. Tikhonov's avatar
      test: fix status at replication/gh-4424-misc* test · 5a9b79fa
      Alexander V. Tikhonov authored
      Fixed flaky status check:
      
        [016] @@ -73,11 +73,11 @@
        [016]  ...
        [016]  box.info.status
        [016]  ---
        [016] -- running
        [016] +- orphan
        [016]  ...
        [016]  box.info.ro
        [016]  ---
        [016] -- false
        [016] +- true
        [016]  ...
        [016]  box.cfg{                                                        \
        [016]      replication = {},                                           \
        [016]
      
      Test changed to use wait condition for the status check, which should
      be changed from 'orphan' to 'running'. On heavy loaded hosts it may
      spend some additional time, wait condition routine helped to fix it.
      
      Closes #5271
      5a9b79fa
    • Alexander V. Tikhonov's avatar
      test: flaky replication/gh-3642-misc-* test · 2569ba54
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following issue:
      
        [036] --- replication/gh-3642-misc-no-socket-leak-on-replica-disconnect.result	Sun Sep  6 23:49:57 2020
        [036] +++ replication/gh-3642-misc-no-socket-leak-on-replica-disconnect.reject	Mon Sep  7 04:07:06 2020
        [036] @@ -63,7 +63,7 @@
        [036]  ...
        [036]  box.info.replication[1].upstream.status
        [036]  ---
        [036] -- follow
        [036] +- disconnected
        [036]  ...
        [036]  test_run:cmd('switch default')
        [036]  ---
      
      It happened because replication upstream status check occurred too
      early, when it was only in 'disconnected' state. To give the
      replication status check routine ability to reach the needed 'follow'
      state, it need to wait for it using test_run:wait_upstream() routine.
      
      Closes #5276
      2569ba54
    • Alexander V. Tikhonov's avatar
      test: remove asan suppression for unit/msgpack · 35f99e66
      Alexander V. Tikhonov authored
      ASAN should the issue in msgpuck repository in file test/msgpuck.c
      which was the cause of the fail in unit/msgpack test. The issue
      was fixed in msgpuck repository and ASAN suppression was removed
      for it. Also removed skip condition file, which blocked the test
      when it failed.
      
      Part of #4360
      35f99e66
    • Alexander V. Tikhonov's avatar
      lsan: app-tap/http_client.test.lua suppresions · 8d616ade
      Alexander V. Tikhonov authored
      Removed lsan suppresions that were not reproduced.
      
      Part of #4360
      8d616ade
  3. Sep 08, 2020
    • Ilya Kosarev's avatar
      msgpack: print mp_exp type as signed integer · 2a01ce91
      Ilya Kosarev authored
      MsgPack extension types allow applications to define
      application-specific types. They consist of an 8-bit signed integer and
      a byte array where the integer represents a kind of types and the byte
      array represents data. Types from 0 to 127 are application-specific
      types and types from -128 to -1 are reserved for predefined types.
      However, extension types were printed as unsigned integers. Now it is
      fixed and extension types are being printed in a correct way as signed
      integers. Also the typo in word "Unsupported" was fixed. According test
      case is introduced.
      
      Closes #5016
      2a01ce91
    • Ilya Kosarev's avatar
      rtree: add comments on ignored rtree_search() return value · 4883f19b
      Ilya Kosarev authored
      rtree_search() has return value and it is ignored in some cases.
      Although it is totally fine it seems to be reasonable to comment those
      cases as far as such usage might be questionable.
      
      Closes #2052
      4883f19b
    • Alexander V. Tikhonov's avatar
      Divide replication/misc.test.lua · 867e6b3d
      Alexander V. Tikhonov authored
      To fix flaky issues of replication/misc.test.lua the test had to be
      divided into smaller tests to be able to localize the flaky results:
      
        gh-2991-misc-asserts-on-update.test.lua
        gh-3111-misc-rebootstrap-from-ro-master.test.lua
        gh-3160-misc-heartbeats-on-master-changes.test.lua
        gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
        gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
        gh-3606-misc-crash-on-box-concurrent-update.test.lua
        gh-3610-misc-assert-connecting-master-twice.test.lua
        gh-3637-misc-error-on-replica-auth-fail.test.lua
        gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
        gh-3704-misc-replica-checks-cluster-id.test.lua
        gh-3711-misc-no-restart-on-same-configuration.test.lua
        gh-3760-misc-return-on-quorum-0.test.lua
        gh-4399-misc-no-failure-on-error-reading-wal.test.lua
        gh-4424-misc-orphan-on-reconfiguration-error.test.lua
      
      Needed for #4940
      867e6b3d
    • Kirill Yukhin's avatar
      msgpuck: bump a new version · 77e03451
      Kirill Yukhin authored
      - test: correct buffer size to fix ASAN error
      77e03451
    • Sergey Bronnikov's avatar
      lua: return back import of table.clear() method · 09aa8135
      Sergey Bronnikov authored
      Import of `table.clear` module has been removed to fix luacheck warning about
      unused variable in commit 3af79e70
      ('Fix luacheck warnings in src/lua/') and method `table.clear()` became unavailable
      in Tarantool. This commit returns that import back as some applications depends
      on it (bug has been found with Cartridge application) and adds regression test
      for table.clear(). Note: `table.clear` is not available until an explicit
      `require('table.clear')` call.
      
      Closes #5210
      09aa8135
  4. Aug 31, 2020
    • Alexander V. Tikhonov's avatar
      update_repo: correct fix for missing metadata RPMs · 71a24b9f
      Alexander V. Tikhonov authored
      On running update_repo tool with the given option to delete some RPMs
      need to remove all files found by this given pattern. The loop checking
      metadata deletes files, but only which were presented in it. However
      it is possible that some broken update left orphan files: they are
      present in the storage, but does not mentioned in the metadata.
      71a24b9f
    • Ilya Kosarev's avatar
      test: concurrent tuple update segfault on bitset index iteration · c5d7e139
      Ilya Kosarev authored
      Concurrent tuple update could segfault on BITSET_ALL_NOT_SET iterator
      usage. Fixed in 850054b2. This patch
      introduces corresponding test.
      
      Closes #1088
      c5d7e139
    • Alexander V. Tikhonov's avatar
      gitlab-ci: add openSUSE packages build jobs · d07e5f96
      Alexander V. Tikhonov authored
      Implemented openSUSE packages build with testing for images:
      opensuse-leap:15.[0-2]
      
      Added %{sle_version} checks in Tarantool spec file according to
      https://en.opensuse.org/openSUSE:Packaging_for_Leap#RPM_Distro_Version_Macros
      
      Added opensuse-leap of 15.1 and 15.2 versions to Gitlab-CI packages
      building/deploing jobs.
      
      Closes #4562
      d07e5f96
    • Alexander V. Tikhonov's avatar
      vinyl: fix check vinyl_dir existence at bootstrap · 9600b895
      Alexander V. Tikhonov authored
      
      During implementation of openSUSE build with testing got failed test
      box-tap/cfg.test.lua. Found that when memtx_dir didn't exist and
      vinyl_dir existed and also errno was set to ENOENT, box configuration
      succeeded, but it shouldn't. Reason of this wrong behavior was that
      not all of the failure paths in xdir_scan() set errno, but the caller
      assumed it.
      
      Debugging the issue found that after xdir_scan() there was incorrect
      check for errno when it returned negative values. xdir_scan() is not
      system call and negative return value from it doesn't mean that errno
      would be set too. Found that in situations when errno was left from
      previous commands before xdir_scan() and xdir_scan() returned negative
      value by itself it produced the wrong check.
      
      The previous failed logic of the check was to catch the error ENOENT
      which set in the xdir_scan() function to handle the situation when
      vinyl_dir was not exist. It failed, because checking ENOENT outside
      the xdir_scan() function, we had to be sure that ENOENT had come from
      xdir_scan() function call indeed and not from any other functions
      before. To be sure in it possible fix could be reset errno before
      xdir_scan() call, because errno could be passed from any other function
      before call to xdir_scan().
      
      As mentioned above xdir_scan() function is not system call and can be
      changed in any possible way and it can return any result value without
      need to setup errno. So check outside of this function on errno could
      be broken.
      
      To avoid that we must not check errno after call of the function.
      Better solution is to use the flag in xdir_scan(), to check if the
      directory should exist. So errno check was removed and instead of it
      the check for vinyl_dir existence using flag added.
      
      Closes #4594
      Needed for #4562
      
      Co-authored-by: default avatarAlexander Turenko <alexander.turenko@tarantool.org>
      9600b895
  5. Aug 25, 2020
    • Ilya Kosarev's avatar
      tuple: drop extra restrictions for multikey index · bfeb61b3
      Ilya Kosarev authored
      Multikey index did not work properly with nullable root field in
      tuple_raw_multikey_count(). Now it is fixed and corresponding
      restrictions are dropped. This also means that we can drop implicit
      nullability update for array/map fields and make all fields nullable
      by default, as it was until e1d3fe8a
      (tuple format: don't allow null where array/map is expected), as far as
      default non-nullability itself doesn't solve any real problems while
      providing confusing behavior (gh-5027).
      
      Follow-up #5027
      Closes #5192
      bfeb61b3
  6. Aug 24, 2020
    • Vladislav Shpilevoy's avatar
      box: introduce space:alter() · 8c965989
      Vladislav Shpilevoy authored
      There was no way to change certain space parameters without its
      recreation or manual update of internal system space _space. Even
      if some of them were legal to update: field_count, owner, flag of
      being temporary, is_sync flag.
      
      The patch introduces function space:alter(), which accepts a
      subset of parameters from box.schema.space.create which are
      mutable, and 'name' parameter. There is a method space:rename(),
      but still the parameter is added to space:alter() too, to be
      consistent with index:alter(), which also accepts a new name.
      
      Closes #5155
      
      @TarantoolBot document
      Title: New function space:alter(options)
      
      Space objects in Lua (stored in `box.space` table) now have a new
      method: `space:alter(options)`.
      
      The method accepts a table with parameters `field_count`, `user`,
      `format`, `temporary`, `is_sync`, and `name`. All parameters have
      the same meaning as in `box.schema.space.create(name, options)`.
      
      Note, `name` parameter in `box.schema.space.create` is separated
      from `options` table. It is not so in `space:alter(options)` -
      here all parameters are specified in the `options` table.
      
      The function does not return anything in case of success, and
      throws an error when fails.
      
      From 'Synchronous replication' page, from 'Limitations and known
      problems' it is necessary to delete the note about "no way to
      enable synchronous replication for existing spaces". Instead it
      is necessary to say, that it can be enabled using
      `space:alter({is_sync = true})`. And can be disabled by setting
      `is_sync = false`.
      https://www.tarantool.io/en/doc/2.5/book/replication/repl_sync/#limitations-and-known-problems
      
      The function will appear in >= 2.5.2.
      8c965989
    • Cyrill Gorcunov's avatar
      xrow: drop xrow_header_dup_body · 9dd2e2e4
      Cyrill Gorcunov authored
      
      We no longer use it.
      
      Closes #5129
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      9dd2e2e4
    • Cyrill Gorcunov's avatar
      txn: txn_add_redo -- drop synchro processing · 1d7e256b
      Cyrill Gorcunov authored
      
      Since we no longer use txn engine for synchro
      packets processing this code is never executed.
      
      Part-of #5129
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      1d7e256b
    • Cyrill Gorcunov's avatar
      applier: process synchro requests without txn engine · cfccfd44
      Cyrill Gorcunov authored
      
      Transaction processing code is very heavy simply because
      transactions are carrying various data and involves a number
      of other mechanisms to proceed.
      
      In turn, when we receive confirm or rollback packed from
      another node in a cluster we just need to inspect limbo
      queue and write this packed into a WAL journal. So calling
      a bunch of txn engine helpers is simply waste of cycles.
      
      Thus lets rather handle them in a special light way:
      
       - allocate synchro_entry structure which would carry
         the journal entry itself and encoded message
       - process limbo queue to mark confirmed/rollback'ed
         messages
       - finally write this synchro_entry into a journal
      
      Which is a way simplier.
      
      Part-of #5129
      
      Suggedsted-by: default avatarVladislav Shpilevoy <v.shpilevoy@tarantool.org>
      Co-developed-by: default avatarVladislav Shpilevoy <v.shpilevoy@tarantool.org>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      cfccfd44
    • Cyrill Gorcunov's avatar
      qsync: direct write of CONFIRM/ROLLBACK into a journal · 41b31ff0
      Cyrill Gorcunov authored
      
      When we need to write CONFIRM or ROLLBACK message (which is
      a binary record in msgpack format) into a journal we use txn code
      to allocate a new transaction, encode there a message and pass it
      to walk the long txn path before it hit the journal. This is not
      only resource wasting but also somehow strange from architectural
      point of view.
      
      Instead lets encode a record on the stack and write it to the journal
      directly.
      
      Part-of #5129
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      41b31ff0
    • Cyrill Gorcunov's avatar
      qsync: provide a binary form of syncro entries · 7e1ce153
      Cyrill Gorcunov authored
      
      These msgpack entries will be needed to write them
      down to a journal without involving txn engine. Same
      time we would like to be able to allocate them on stack,
      for this sake the binary form is predefined.
      
      Part-of #5129
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      7e1ce153
    • Cyrill Gorcunov's avatar
      journal: add journal_entry_create helper · 580abaee
      Cyrill Gorcunov authored
      
      To create raw journal entries. We will use it
      to write confirm/rollback entries.
      
      Part-of #5129
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      580abaee
    • Cyrill Gorcunov's avatar
      journal: bind asynchronous write completion to an entry · fd145ed5
      Cyrill Gorcunov authored
      
      In commit 77ba0e35 we've redesigned
      wal journal operations such that asynchronous write completion
      is a single instance per journal.
      
      It turned out that such simplification is too tight and doesn't
      allow us to pass entries into the journal with custom completions.
      
      Thus lets allow back such ability. We will need it to be able
      to write "confirm" records into wal directly without touching
      transactions code at all.
      
      Part-of #5129
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      fd145ed5
  7. Aug 20, 2020
  8. Aug 17, 2020
    • Vladislav Shpilevoy's avatar
      xrow: introduce struct synchro_request · ee07eab4
      Vladislav Shpilevoy authored
      All requests saved to WAL and transmitted through network have
      their own request structure with parameters:
      - struct request for DML;
      - struct call_request for CALL/EVAL;
      - struct auth_request for AUTH;
      - struct ballot for VOTE;
      - struct sql_request for SQL;
      - struct greeting for greeting.
      
      It is done for a reason - not to pass all the request parameters
      into each function one by one, and manage them all at once
      instead.
      
      For synchronous requests IPROTO_CONFIRM and IPROTO_ROLLBACK it was
      not done. Because so far it was not too hard to carry just 2
      parameters: lsn and replica_id, from their body.
      
      But it will be changed in #5129. Because in fact these requests
      have more parameters, but they were filled by txn module, since
      synchro requests were saved to WAL via transactions (due to lack
      of alternative API to access WAL).
      
      After #5129 it will be necessary to save LSN and replica_id of the
      request author. This patch introduces struct synchro_request to
      simplify extension of the synchro parameters.
      
      Closes #5151
      Needed for #5129
      ee07eab4
  9. Aug 15, 2020
    • Vladislav Shpilevoy's avatar
      applier: drop a couple of unnecessary arguments · 7e16e45e
      Vladislav Shpilevoy authored
      Applier on_rollback and on_wal_write don't need any arguments -
      they either work with a global state, or with the signaled applier
      stored inside the trigger.
      
      However into on_wal_write() and on_rollback() was passed the
      transaction object, unused.
      
      Even if it would be used, it should have been fixed, because soon
      these triggers will be fired not only for traditional 'txn'
      transactions. They will be used by the synchro request WAL writes
      too - they don't have 'transactions'.
      
      Part of #5129
      7e16e45e
  10. Aug 13, 2020
    • Yaroslav Dynnikov's avatar
      Ensure all curl symbols are exported · 29ec6289
      Yaroslav Dynnikov authored
      In the recent update of libcurl (2.5.0-278-g807c7fa58) its layout has
      changed: private function `Curl_version_init()` which used to fill-in
      info structure was eliminated. As a result, no symbols for
      `libcurl_la-version.o` remained used, so it wasn't included in tarantool
      binary. And `curl_version` and `curl_version_info` symbols went missing.
      
      According to libcurl naming conventions all exported symbols are named
      as `curl_*`. This patch lists them all explicitly in `exprots.h` and
      adds the test.
      
      Close #5223
      29ec6289
  11. Aug 12, 2020
    • Vladislav Shpilevoy's avatar
      tuple: fix access by JSON path starting from '[*]' · 718267aa
      Vladislav Shpilevoy authored
      Tuple JSON field access crashed when '[*]' was used as a first
      part of the JSON path. The patch makes it treated like 'field not
      found'.
      
      Follow-up #5224
      718267aa
    • Vladislav Shpilevoy's avatar
      tuple: fix multikey field JSON access crash · 5c15df68
      Vladislav Shpilevoy authored
      When a tuple had format with multikey indexes in it, any attempt
      to get a multikey indexed field by a JSON path from Lua led to a
      crash.
      
      That was because of incorrect interpretation of offset slot value
      in tuple's field map.
      
      Tuple field map is an array stored before the tuple's MessagePack
      data. Each element is a 4 byte offset to an indexed value to be
      able to get it for O(1) time without MessagePack decoding of all
      the previous fields.
      
      At least it was so before multikeys. Now tuple field map is not
      just an array. It is rather a 2-level array, somehow similar to
      ext4 FS. Some elements of the root array are positive numbers
      pointing at data. Some elements point at a second 'indirect'
      array, so called 'extra', size of which is individual for each
      tuple. These second arrays are used by multikey indexes to store
      offsets to each multikey indexed value in a tuple.
      
      It means, that if there is an offset slot, it can't be just used
      as is. It is allowed only if the field is not multikey. Otherwise
      it is neccessary to somehow get an index in the second 'indirect'
      array.
      
      This is what was happening - a multikey field was found, its
      offset slot was valid, but it was pointing at an 'indirect' array,
      not at the data. JSON tuple field access tried to use it as a data
      offset.
      
      The patch makes JSON field access degrade to fullscan when a field
      is multikey, but no multikey array index is provided.
      
      Closes #5224
      5c15df68
  12. Aug 11, 2020
    • Vladislav Shpilevoy's avatar
      box: snapshot should not include rolled back data · 6f70020d
      Vladislav Shpilevoy authored
      Box.snapshot() could include rolled back data in case synchronous
      transaction ROLLBACK arrived during WAL rotation in preparation of
      a checkpoint.
      
      More specifically, snapshot consists of fixating the engines'
      content (creation of a read-view), doing WAL rotation, and writing
      the snapshot itself. All data changes after content fixation won't
      go into the snap. So if ROLLBACK arrives during WAL rotation, the
      fixated content will have rolled back data, not present in the
      newest dataset.
      
      The patch makes it fail if during WAL rotation anything was rolled
      back. The bug sometimes appeared in an existing test about qsync
      snapshots, but with a very poor reproducibility. In a new test
      file it is reproduced 100% without the patch.
      
      Closes #5167
      6f70020d
  13. Jul 31, 2020
    • Vladislav Shpilevoy's avatar
      txn_limbo: handle duplicate ACKs · 6e11674d
      Vladislav Shpilevoy authored
      Replica can send the same ACK multiple times. This is relatively
      easy to achieve.
      
      ACK is a message form the replica containing its vclock. It is
      sent on each replica's vclock update. The update not necessarily
      means that master's LSN was changed. Replica could write
      something locally, with its own instance_id. Vclock is changed,
      sent to the master, but from the limbo's point of view it looks
      like duplicated ACK, because the received master's LSN didn't
      change.
      
      The patch makes limbo ignore duplicated ACKs.
      
      Closes #5195
      Part of #5219
      6e11674d
    • Vladislav Shpilevoy's avatar
      txn_limbo: handle CONFIRM during ROLLBACK · e7559bfe
      Vladislav Shpilevoy authored
      Limbo could try to CONFIRM LSN whose ROLLBACK is in progress. This
      is how it could happen:
      
      - A synchronous transaction is created, written to WAL;
      - The fiber sleeps in the limbo waiting for CONFIRM or timeout;
      - Timeout happens. ROLLBACK for this and all next LSNs is sent to
        WAL;
      - Replica receives the transaction, sends ACK;
      - Master receives ACK, starts writing CONFIRM for the LSN, whose
        ROLLBACK is in progress right now.
      
      Another case - attempt to lower synchro quorum during ROLLBACK
      write. It also could try to write CONFIRM.
      
      The patch skips CONFIRM if there is a ROLLBACK in progress. Not
      even necessary to check LSNs. Because ROLLBACK always reverts the
      entire limbo queue, so it will cancel all pending transactions
      with all LSNs, and new commits are rolled back even before they
      try to go to WAL. CONFIRM can't help here with anything already.
      
      Part of #5185
      e7559bfe
  14. Jul 30, 2020
    • Vladislav Shpilevoy's avatar
      txn_limbo: handle ROLLBACK during CONFIRM · 849ba7dc
      Vladislav Shpilevoy authored
      Limbo could try to ROLLBACK LSN whose CONFIRM is in progress. This
      is how it could happen:
      
      - A synchronous transaction is created, written to WAL;
      - The fiber sleeps in the limbo waiting for CONFIRM or timeout;
      - Replica receives the transaction, sends ACK;
      - Master receives ACK, starts writing CONFIRM;
      - The first fiber times out and tries to write ROLLBACK for the
        LSN, whose CONFIRM is in progress right now.
      
      The patch adds more checks to the 'timed out' code path to see if
      it isn't too late to write ROLLBACK. If CONFIRM is in progress,
      the fiber will wait for its finish.
      
      Part of #5185
      849ba7dc
    • Vladislav Shpilevoy's avatar
      txn_limbo: panic when synchro WAL write fails · 61385877
      Vladislav Shpilevoy authored
      CONFIRM and ROLLBACK go to WAL. Their WAL write can fail just like
      any other WAL write. However it is not clear what to do in that
      case, especially in case of ROLLBACK fail.
      
      The patch adds panic() stub so as to at least terminate the
      instance. Before the patch it would work like nothing happened,
      with undefined behaviour.
      
      Closes #5159
      61385877
  15. Jul 29, 2020
    • Vladislav Shpilevoy's avatar
      txn_limbo: reduce fiber_set_cancellable() calls · a6ab3771
      Vladislav Shpilevoy authored
      The calls were added before and after each cond_wait() so as the
      fiber couldn't be woken up externally. For example, from Lua.
      
      But it is not necessary to flip the flag on each wait() call. It
      is enough to make it 2 times: forbid cancellation in the beginning
      of txn_limbo_wait_complete(), and return the old value back in the
      end.
      a6ab3771
    • Vladislav Shpilevoy's avatar
      txn_limbo: don't duplicate confirmations in WAL · 4920782e
      Vladislav Shpilevoy authored
      When an ACK was received for an already confirmed transaction
      whose CONFIRM WAL write is in progress, it produced a second
      CONFIRM in WAL with the same LSN.
      
      That was unnecessary work taking time and disk space for WAL
      records. Although it didn't lead to any bugs. Just was very
      inefficient.
      
      This patch makes confirmation LSN monotonically grow. In case more
      ACKs are received for an already confirmed LSN, its confirmation
      is not written second time.
      
      Closes #5144
      4920782e
Loading