Skip to content
Snippets Groups Projects
  1. Sep 18, 2020
    • Sergey Bronnikov's avatar
      extra: add Terraform config files · 0b59bc93
      Sergey Bronnikov authored
      For testing Tarantool with Jepsen we use virtual machines as they provides
      better resource isolation in comparison to containers. Jepsen tests may need a
      single instance or a set of instances for testing cluster.  To setup virtual
      machines we use Terraform [1]. Patch adds a set of configuration files for
      Terraform that can create required number of virtual machines in MCS and output
      IP addresses to stdout.
      
      Terraform needs some parameters before run. They are:
      
      - id, identificator of a test stand that should be specific for this run, id
      also is a part of virtual machine name
      - keypair_name, name of keypair used in a cloud, public SSH key of that key pair
      will be placed to virtual machine
      - instance_count, number of virtual machines in a test stand
      - ssh_key, SSH private key, used to access to a virtual machine
      - user_name
      - password
      - tenant_id
      - user_domain_id
      
      These parameters can be passed via enviroment variables with TF_VAR_ prefix
      (like TF_VAR_id) or via command-line parameters.
      
      To demonstrate full lifecycle of a test stand with Terraform one needs to
      perform these commands:
      
      terraform init extra/tf
      terraform apply extra/tf
      terraform output instance_names
      terraform output instance_ips
      terraform destroy extra/tf
      
      1. https://www.terraform.io/
      
      Part of #5277
      0b59bc93
    • Cyrill Gorcunov's avatar
      lua/pwd: workaround the systemd bug · ab3ff23f
      Cyrill Gorcunov authored
      
      There is a bug in systemd-209 source code: it returns
      ENOENT when no more entries in a password database left.
      
      Later the issue been fixed but we still meet the systems
      where it hits. The problem affects getpwent/getgrent calls
      only thus we can expect them to return the buggy error code
      to skip.
      
      Notes:
      
      1) See systemd's commit where issue been fixed
      
         | commit 06202b9e659e5cc72aeecc5200155b7c012fccbc
         | Author: Yu Watanabe <watanabe.yu+github@gmail.com>
         | Date:   Sun Jul 15 23:00:00 2018 +0900
         |
         |     nss: do not modify errno when NSS_STATUS_NOTFOUND or NSS_STATUS_SUCCESS
      
      2) Another option is to call getpwall on Tarantool startup
         unconditionally where we could simply ignore any errors. This
         is a very bad choise since traversig a password database might
         introduce significant lags if backend does some network activiy
         or have expired caches. Thus drop getpwall() unconditional call
         run it iif a user does an explicit request.
      
      Fixes #5034
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      ab3ff23f
    • Cyrill Gorcunov's avatar
      lua/errno: shrink memory usage on error declaration · 8603da36
      Cyrill Gorcunov authored
      
      There is no need to allocate 32 bytes per each string,
      the backend lua does copy the string internally thus
      plain pointer is enough here no need to allocate redundant
      memory.
      
      Part-of #5034
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      8603da36
    • Cyrill Gorcunov's avatar
      lua/errno: use lengthof helper · 0f062df1
      Cyrill Gorcunov authored
      
      No need for ending empty entry.
      
      Part-of #5034
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      0f062df1
  2. Sep 17, 2020
    • Vladislav Shpilevoy's avatar
      replication: do not register outgoing connections · 44421317
      Vladislav Shpilevoy authored
      Replication protocol's first stage for non-anonymous replicas is
      that the replica should be registered in _cluster to get a unique
      ID number.
      
      That happens, when replica connects to a writable node, which
      performs the registration. So it means, registration always
      happens on the master node when appears an *incoming* request for
      it, explicitly asking for a registration. Only relay can do that.
      
      That wasn't the case for bootstrap. If box.cfg.replication wasn't
      empty on the master node doing the cluster bootstrap, it
      registered all the outgoing connections in _cluster. Note, the
      target node could be even anonymous, but still was registered.
      
      That breaks the protocol, and leads to registration of anon
      replicas sometimes. The patch drops it.
      
      Another motivation here is Raft cluster bootstrap specifics.
      During Raft bootstrap it is going to be very important that
      non-joined replicas should not be registered in _cluster. A
      replica can only register after its JOIN request was accepted, and
      its snapshot download has started.
      
      Closes #5287
      Needed for #1146
      44421317
    • Vladislav Shpilevoy's avatar
      replication: add is_anon flag to ballot · 0fd72560
      Vladislav Shpilevoy authored
      Ballot is a message sent in response on vote request, which is
      sent by applier first thing after connection establishment.
      
      It contains basic info about the remote instance such as whether
      it is read only, if it is still loading, and more.
      
      The ballot didn't contain a flag whether the instance is
      anonymous. That led to a problem, when applier was connected to a
      remote instance, was added to struct replicaset inside a struct
      replica object, but it was unknown whether it is anonymous. It was
      added as not anonymous by default.
      
      If the remote instance was in fact anonymous and sent a subscribe
      response back to the first instance with the anon flag = true,
      then it looked like the remote instance was not anonymous, and
      suddenly became such, without even a reconnect. It could lead to
      an assertion.
      
      The bug is hidden behind another bug, because of which the leader
      instance on boostrap registers all replicas listed in its
      box.cfg.replication, even anonymous ones.
      
      The patch makes the ballot contain the anon flag. Now both relay
      and applier send whether their host is anonymous. Relay does it by
      sending the ballot, applier sends it in scope of subscribe
      request. By the time a replica gets UUID and is added into struct
      replicaset, its anon flag is determined.
      
      Also the patch makes anon_count updated on each replica hash table
      change. Previously it was only updated when something related to
      relay was done. Now anon is updated by applier actions too, and
      it is not ok to update the counter on relay-specific actions.
      
      The early registration bug is a subject for a next patch.
      
      Part of #5287
      
      @TarantoolBot document
      Title: IPROTO_BALLOT_IS_ANON flag
      
      There is a request type IPROTO_BALLOT, with code 0x29. It has
      fields IPROTO_BALLOT_IS_RO (0x01), IPROTO_BALLOT_VCLOCK (0x02),
      IPROTO_BALLOT_GC_VCLOCK (0x03), IPROTO_BALLOT_IS_LOADING (0x04).
      
      Now it gets a new field IPROTO_BALLOT_IS_ANON (0x05). The field
      is a boolean, and equals to box.cfg.replication_anon of the
      sender.
      0fd72560
    • Vladislav Shpilevoy's avatar
      replication: retry in case of XlogGapError · f1a507b0
      Vladislav Shpilevoy authored
      Previously XlogGapError was considered a critical error stopping
      the replication. That may be not so good as it looks.
      
      XlogGapError is a perfectly fine error, which should not kill the
      replication connection. It should be retried instead.
      
      Because here is an example, when the gap can be recovered on its
      own. Consider the case: node1 is a leader, it is booted with
      vclock {1: 3}. Node2 connects and fetches snapshot of node1, it
      also gets vclock {1: 3}. Then node1 writes something and its
      vclock becomes {1: 4}. Now node3 boots from node1, and gets the
      same vclock. Vclocks now look like this:
      
        - node1: {1: 4}, leader, has {1: 3} snap.
        - node2: {1: 3}, booted from node1, has only snap.
        - node3: {1: 4}, booted from node1, has only snap.
      
      If the cluster is a fullmesh, node2 will send subscribe requests
      with vclock {1: 3}. If node3 receives it, it will respond with
      xlog gap error, because it only has a snap with {1: 4}, nothing
      else. In that case node2 should retry connecting to node3, and in
      the meantime try to get newer changes from node1.
      
      The example is totally valid. However it is unreachable now
      because master registers all replicas in _cluster before allowing
      them to make a join. So they all bootstrap from a snapshot
      containing all their IDs. This is a bug, because such
      auto-registration leads to registration of anonymous replicas, if
      they are present during bootstrap. Also it blocks Raft, which
      can't work if there are registered, but not yet joined nodes.
      
      Once the registration problem will be solved in a next commit, the
      XlogGapError will strike quite often during bootstrap. This patch
      won't allow that happen.
      
      Needed for #5287
      f1a507b0
    • Vladislav Shpilevoy's avatar
      xlog: introduce an error code for XlogGapError · fc8e2297
      Vladislav Shpilevoy authored
      XlogGapError object didn't have a code in ClientError code space.
      Because of that it was not possible to handle the gap error
      together with client errors in some switch-case statement.
      
      Now the gap error has a code.
      
      This is going to be used in applier code to handle XlogGapError
      among other errors using its code instead of RTTI.
      
      Needed for #5287
      fc8e2297
  3. Sep 15, 2020
    • Alexander V. Tikhonov's avatar
      gitlab-ci: save sources to new S3 location · c1f72aeb
      Alexander V. Tikhonov authored
      Changed S3 location for sources tarballs. Also added ability to
      create S3 directory for the tarballs if it was not existed.
      c1f72aeb
    • Alexander V. Tikhonov's avatar
      gitlab-ci: fix deployment of tagged commits · 5aa1a1df
      Alexander V. Tikhonov authored
      Found that tagged commits were not run the deployment gitlab-ci jobs.
      To fix it added 'tags' label for deployment and perfomance jobs. Also
      found that after the commit tagged it has tag label in format 'x^0'
      and all previous commits till the previous tag became to have tags in
      format 'x~<commits before>' like 'x~1' or 'x~2' and etc. So the check
      
        if git name-rev --name-only --tags --no-undefined HEAD ; then
      
      became always pass and previous commits on rerun could began to deploy.
      To fix it was used gitlab-ci environment variable 'CI_COMMIT_TAG', it
      shows in real if the current commit has tag and has to be deployed.
      
      Part of #3745
      5aa1a1df
    • Alexander V. Tikhonov's avatar
      test: flaky replication/gh-3704-misc-* · db3dd8dd
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following issue:
      
        [037] --- replication/gh-3704-misc-replica-checks-cluster-id.result	Thu Sep 10 18:05:22 2020
        [037] +++ replication/gh-3704-misc-replica-checks-cluster-id.reject	Fri Sep 11 11:09:38 2020
        [037] @@ -25,7 +25,7 @@
        [037]  ...
        [037]  box.info.replication[2].downstream.status
        [037]  ---
        [037] -- follow
        [037] +- stopped
        [037]  ...
        [037]  -- change master's cluster uuid and check that replica doesn't connect.
        [037]  test_run:cmd("stop server replica")
      
      It happened because replication downstream status check occurred too
      early, when it was only in 'stopped' state. To give the replication
      status check routine ability to reach the needed 'follow' state, it
      need to wait for it using test_run:wait_downstream() routine.
      
      Closes #5293
      db3dd8dd
    • HustonMmmavr's avatar
      build: refactor static build process · 800e5ed6
      HustonMmmavr authored
      
      Refactored static build process to use static-build/CMakeLists.txt
      instead of Dockerfile.staticbuild (this allows to support static
      build on macOS). Following third-party dependencies for static build
      are installed via cmake `ExternalProject_Add`:
        - OpenSSL
        - Zlib
        - Ncurses
        - Readline
        - Unwind
        - ICU
      
      * Added support static build for macOS
      * Fixed `CONFIGURE_COMMAND` while building bundled libcurl for static
        build at file cmake/BuildLibCURL.cmake:
          - disable building shared libcurl libraries (by setting
            `--disable-shared` option)
          - disable hiding libcurl symbols (by setting
            `--disable-symbol-hiding` option)
          - prevent linking libcurl with system libz (by setting
            `--with-zlib=${FOUND_ZLIB_ROOT_DIR}` option)
      * Removed Dockerfile.staticbuild
      * Added new gitlab.ci jobs to test new style static build:
        - static_build_cmake_linux
        - static_build_cmake_osx_15
      * Removed static_docker_build gitlab.ci job
      
      Closes #5095
      
      Co-authored-by: default avatarYaroslav Dynnikov <yaroslav.dynnikov@gmail.com>
      800e5ed6
  4. Sep 14, 2020
    • Vladislav Shpilevoy's avatar
      memtx: force async snapshot transactions · c620735c
      Vladislav Shpilevoy authored
      Snapshot rows contain not real LSNs. Instead their LSNs are
      signatures, ordinal numbers. Rows in the snap have LSNs from 1 to
      the number of rows. This is because LSNs are not stored with every
      tuple in the storages, and there is no way to store real LSNs in
      the snapshot.
      
      These artificial LSNs broke the synchronous replication limbo.
      After snap recovery is done, limbo vclock was broken - it
      contained numbers not related to reality, and affected by rows
      from local spaces.
      
      Also the recovery could stuck because ACKs in the limbo stopped
      working after a first row - the vclock was set to the final
      signature right away.
      
      This patch makes all snapshot recovered rows async. Because they
      are confirmed by definition. So now the limbo is not involved into
      the snapshot recovery.
      
      Closes #5298
      c620735c
    • Alexander Turenko's avatar
      test: update test-run · a33cd1bd
      Alexander Turenko authored
      Fixed formatting of reproduce files with recent pyyaml versions.
      
      Background: test-run generates so called reproduce files in the
      test/var/reproduce/ directory and accepts them as the argument of the
      --reproduce option. It is convenient to share a reproducer for a problem
      that appears when specific tests are run in a specific order.
      
      https://github.com/tarantool/test-run/pull/220
      Unverified
      a33cd1bd
  5. Sep 13, 2020
  6. Sep 12, 2020
    • Vladislav Shpilevoy's avatar
      limbo: don't wake self fiber on CONFIRM write · a0477827
      Vladislav Shpilevoy authored
      During recovery WAL writes end immediately, without yields.
      Therefore WAL write completion callback is executed in the
      currently active fiber.
      
      Txn limbo on CONFIRM WAL write wakes up the waiting fiber, which
      appears to be the same as the active fiber during recovery.
      
      That breaks the fiber scheduler, because apparently it is not safe
      to wake the currently active fiber unless it is going to call
      fiber_yield() immediately after. See a comment in fiber_wakeup()
      implementation about that way of usage.
      
      The patch simply stops waking the waiting fiber, if it is the
      currently active one.
      
      Closes #5288
      Closes #5232
      a0477827
  7. Sep 11, 2020
    • Alexander V. Tikhonov's avatar
      test: replication/status.test.lua fails on Debug · 008e732c
      Alexander V. Tikhonov authored
      
      Found 2 issues on Debug build:
      
        [009] --- replication/status.result	Fri Sep 11 10:04:53 2020
        [009] +++ replication/status.reject	Fri Sep 11 13:16:21 2020
        [009] @@ -174,7 +174,8 @@
        [009]  ...
        [009]  test_run:wait_downstream(replica_id, {status == 'follow'})
        [009]  ---
        [009] -- true
        [009] +- error: '[string "return test_run:wait_downstream(replica_id, {..."]:1: variable
        [009] +    ''status'' is not declared'
        [009]  ...
        [009]  -- wait for the replication vclock
        [009]  test_run:wait_cond(function()                    \
        [009] @@ -226,7 +227,8 @@
        [009]  ...
        [009]  test_run:wait_upstream(master_id, {status == 'follow'})
        [009]  ---
        [009] -- true
        [009] +- error: '[string "return test_run:wait_upstream(master_id, {sta..."]:1: variable
        [009] +    ''status'' is not declared'
        [009]  ...
        [009]  master.upstream.lag < 1
        [009]  ---
      
      It happened because of the change introduced in commit [1]. Where
      mistakenly were used wait_upstream()/wait_downstream() with:
      
        test_run:wait_*stream(*_id, {status == 'follow'})
      
      with status set using '==' instead of '='. We unable to read status
      variable when the strict mode is enabled. It is enabled by default on
      Debug builds.
      
      Follows up #5110
      Closes #5297
      
      Reviewed-by: default avatarAlexander Turenko <alexander.turenko@tarantool.org>
      Co-authored-by: default avatarAlexander Turenko <alexander.turenko@tarantool.org>
      
      [1] - a08b4f3a ("test: flaky replication/status.test.lua status")
      Unverified
      008e732c
    • Oleg Babin's avatar
      lua: fix panic in case when log.cfg.log incorrecly specified · 85f19a87
      Oleg Babin authored
      This patch makes log.cfg{log = ...} behaviour the same as in
      box.cfg{log = ...} and fixes panic if "log" is incorrectly
      specified. For such purpose we export "say_parse_logger_type"
      function and use for logger type validation and logger type
      parsing.
      
      Closes #5130
      85f19a87
    • Alexander V. Tikhonov's avatar
      asan: leak unit/swim.test:swim_test_encryption · ee9e3aed
      Alexander V. Tikhonov authored
      
      Found leak issue:
      
        [001] +==41031==ERROR: LeakSanitizer: detected memory leaks
        [001] +
        [001] +Direct leak of 96 byte(s) in 2 object(s) allocated from:
        [001] +    #0 0x4d8e53 in __interceptor_malloc (/tnt/test/unit/swim.test+0x4d8e53)
        [001] +    #1 0x53560f in crypto_codec_new /source/src/lib/crypto/crypto.c:239:51
        [001] +    #2 0x5299c4 in swim_scheduler_set_codec /source/src/lib/swim/swim_io.c:700:30
        [001] +    #3 0x511fe6 in swim_cluster_set_codec /source/test/unit/swim_test_utils.c:251:2
        [001] +    #4 0x50b3ae in swim_test_encryption /source/test/unit/swim.c:767:2
        [001] +    #5 0x50b3ae in main_f /source/test/unit/swim.c:1123
        [001] +    #6 0x544a3b in fiber_loop /source/src/lib/core/fiber.c:869:18
        [001] +    #7 0x5a13d0 in coro_init /source/third_party/coro/coro.c:110:3
        [001] +
        [001] +SUMMARY: AddressSanitizer: 96 byte(s) leaked in 2 allocation(s).
      
      Prepared minimal issue reproducer:
      
        static void
        swim_test_encryption(void)
        {
                swim_start_test(3);
                struct swim_cluster *cluster = swim_cluster_new(2);
                swim_cluster_set_codec(cluster, CRYPTO_ALGO_AES128, CRYPTO_MODE_CBC,
                                       "1234567812345678", CRYPTO_AES128_KEY_SIZE);
                swim_cluster_delete(cluster);
                swim_finish_test();
        }
      
      Found that memory allocation for codec creation at crypto_codec_new()
      using swim_cluster_set_codec() was not any freed at the test. Added
      crypto_codec_delete() in swim_scheduler_destroy() function for it.
      
      After this fix removed susspencion on memory leak for unit/swim.test.
      
      Closes #5283
      
      Reviewed-by: default avatarVladislav Shpilevoy <v.shpilevoy@tarantool.org>
      
      Co-authored-by: default avatarVladislav Shpilevoy <v.shpilevoy@tarantool.org>
      ee9e3aed
    • Alexander V. Tikhonov's avatar
      test: flaky replication/gh-5195-qsync-* · a43414a5
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following issue:
      
         box.cfg{replication_synchro_quorum = 2}
          | ---
        + | - error: '[string "test_run:wait_cond(function()                ..."]:1: attempt to
        + |     index field ''vclock'' (a nil value)'
          | ...
      
      The issue output was not correct due to wrong output list. Real command
      that caused the initial issue was the previous command:
      
        test_run:wait_cond(function()                                                   \
                local info = box.info.replication[replica_id]                           \
                local lsn = info.downstream.vclock[replica_id]                          \
                return lsn and lsn >= replica_lsn                                       \
        end)
      
      It happened because replication vclock field was not exist at the moment
      of its check. To fix the issue, vclock field had to be waited to be
      available using test_run:wait_cond() routine.
      
      Closes #5230
      a43414a5
    • Alexander V. Tikhonov's avatar
      test: flaky replication/wal_off.test.lua test · ad4d0564
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following issue:
      
        [035] --- replication/wal_off.result	Fri Jul  3 04:29:56 2020
        [035] +++ replication/wal_off.reject	Mon Sep  7 15:32:46 2020
        [035] @@ -47,6 +47,8 @@
        [035]  ...
        [035]  while box.info.replication[wal_off_id].upstream.message ~= check do fiber.sleep(0) end
        [035]  ---
        [035] +- error: '[string "while box.info.replication[wal_off_id].upstre..."]:1: attempt to
        [035] +    index field ''upstream'' (a nil value)'
        [035]  ...
        [035]  box.info.replication[wal_off_id].upstream ~= nil
        [035]  ---
      
      It happened because replication upstream status check occurred too
      early, when its state was not set. To give the replication status
      check routine ability to reach the needed 'stopped' state, it need
      to wait for it using test_run:wait_upstream() routine.
      
      Closes #5278
      ad4d0564
    • Alexander V. Tikhonov's avatar
      test: flaky replication/status.test.lua status · a08b4f3a
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following 3 issues:
      
      line 174:
      
       [026] --- replication/status.result	Thu Jun 11 12:07:39 2020
       [026] +++ replication/status.reject	Sun Jun 14 03:20:21 2020
       [026] @@ -174,15 +174,17 @@
       [026]  ...
       [026]  replica.downstream.status == 'follow'
       [026]  ---
       [026] -- true
       [026] +- false
       [026]  ...
      
      It happened because replication downstream status check occurred too
      early. To give the replication status check routine ability to reach
      the needed 'follow' state, it need to wait for it using
      test_run:wait_downstream() routine.
      
      line 178:
      
      [024] --- replication/status.result	Mon Sep  7 00:22:52 2020
      [024] +++ replication/status.reject	Mon Sep  7 00:36:01 2020
      [024] @@ -178,11 +178,13 @@
      [024]  ...
      [024]  replica.downstream.vclock[master_id] == box.info.vclock[master_id]
      [024]  ---
      [024] -- true
      [024] +- error: '[string "return replica.downstream.vclock[master_id] =..."]:1: attempt to
      [024] +    index field ''vclock'' (a nil value)'
      [024]  ...
      [024]  replica.downstream.vclock[replica_id] == box.info.vclock[replica_id]
      [024]  ---
      [024] -- true
      [024] +- error: '[string "return replica.downstream.vclock[replica_id] ..."]:1: attempt to
      [024] +    index field ''vclock'' (a nil value)'
      [024]  ...
      [024]  --
      [024]  -- Replica
      
      It happened because replication vclock field was not exist at the moment
      of its check. To fix the issue, vclock field had to be waited to be
      available using test_run:wait_cond() routine. Also the replication data
      downstream had to be read at the same moment.
      
      line 224:
      
      [014] --- replication/status.result	Fri Jul  3 04:29:56 2020
      [014] +++ replication/status.reject	Mon Sep  7 00:17:30 2020
      [014] @@ -224,7 +224,7 @@
      [014]  ...
      [014]  master.upstream.status == "follow"
      [014]  ---
      [014] -- true
      [014] +- false
      [014]  ...
      [014]  master.upstream.lag < 1
      [014]  ---
      
      It happened because replication upstream status check occurred too
      early. To give the replication status check routine ability to reach
      the needed 'follow' state, it need to wait for it using
      test_run:wait_upstream() routine.
      
      Removed test from 'fragile' test_run tool list to run it in parallel.
      
      Closes #5110
      a08b4f3a
    • Alexander V. Tikhonov's avatar
      test: flaky replication/gh-4606-admin-creds test · 11ba3322
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following issue:
      
        [021] --- replication/gh-4606-admin-creds.result	Wed Apr 15 15:47:41 2020
        [021] +++ replication/gh-4606-admin-creds.reject	Sun Sep  6 20:23:09 2020
        [021] @@ -36,7 +36,42 @@
        [021]   | ...
        [021]  i.replication[i.id % 2 + 1].upstream.status == 'follow' or i
        [021]   | ---
        [021] - | - true
        [021] + | - version: 2.6.0-52-g71a24b9f2
        [021] + |   id: 2
        [021] + |   ro: false
        [021] + |   uuid: 3921679b-d994-4cf0-a6ef-1f6a0d96fc79
        [021] + |   package: Tarantool
        [021] + |   cluster:
        [021] + |     uuid: f27dfdfe-2802-486a-bc47-abc83b9097cf
        [021] + |   listen: unix/:/Users/tntmac02.tarantool.i/tnt/test/var/014_replication/replica_auth.socket-iproto
        [021] + |   replication_anon:
        [021] + |     count: 0
        [021] + |   replication:
        [021] + |     1:
        [021] + |       id: 1
        [021] + |       uuid: a07cad18-d27f-48c4-8d56-96b17026702e
        [021] + |       lsn: 3
        [021] + |       upstream:
        [021] + |         peer: admin@unix/:/Users/tntmac02.tarantool.i/tnt/test/var/014_replication/master.socket-iproto
        [021] + |         lag: 0.0030207633972168
        [021] + |         status: disconnected
        [021] + |         idle: 0.44824500009418
        [021] + |         message: timed out
        [021] + |         system_message: Operation timed out
        [021] + |     2:
        [021] + |       id: 2
        [021] + |       uuid: 3921679b-d994-4cf0-a6ef-1f6a0d96fc79
        [021] + |       lsn: 0
        [021] + |   signature: 3
        [021] + |   status: running
        [021] + |   vclock: {1: 3}
        [021] + |   uptime: 1
        [021] + |   lsn: 0
        [021] + |   sql: []
        [021] + |   gc: []
        [021] + |   vinyl: []
        [021] + |   memory: []
        [021] + |   pid: 40326
        [021]   | ...
        [021]  test_run:switch('default')
        [021]   | ---
      
      It happened because replication upstream status check occurred too
      early, when it was only in 'disconnected' state. To give the
      replication status check routine ability to reach the needed 'follow'
      state, it need to wait for it using test_run:wait_upstream() routine.
      
      Closes #5233
      11ba3322
    • Alexander V. Tikhonov's avatar
      test: flaky replication/gh-4402-info-errno.test.lua · 2b1f8f9b
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following issue:
      
        [004] --- replication/gh-4402-info-errno.result	Wed Jul 22 06:13:34 2020
        [004] +++ replication/gh-4402-info-errno.reject	Wed Jul 22 06:41:14 2020
        [004] @@ -32,7 +32,39 @@
        [004]   | ...
        [004]  d ~= nil and d.status == 'follow' or i
        [004]   | ---
        [004] - | - true
        [004] + | - version: 2.6.0-10-g8df49e4
        [004] + |   id: 1
        [004] + |   ro: false
        [004] + |   uuid: 41c4e3bf-cc3b-443d-88c9-39a9a8fe2df9
        [004] + |   package: Tarantool
        [004] + |   cluster:
        [004] + |     uuid: 6ec7bcce-68e7-41a4-b84b-dc9236621579
        [004] + |   listen: unix/:(socket)
        [004] + |   replication_anon:
        [004] + |     count: 0
        [004] + |   replication:
        [004] + |     1:
        [004] + |       id: 1
        [004] + |       uuid: 41c4e3bf-cc3b-443d-88c9-39a9a8fe2df9
        [004] + |       lsn: 52
        [004] + |     2:
        [004] + |       id: 2
        [004] + |       uuid: 8a989231-177a-4eb8-8030-c148bc752b0e
        [004] + |       lsn: 0
        [004] + |       downstream:
        [004] + |         status: stopped
        [004] + |         message: timed out
        [004] + |         system_message: Connection timed out
        [004] + |   signature: 52
        [004] + |   status: running
        [004] + |   vclock: {1: 52}
        [004] + |   uptime: 27
        [004] + |   lsn: 52
        [004] + |   sql: []
        [004] + |   gc: []
        [004] + |   vinyl: []
        [004] + |   memory: []
        [004] + |   pid: 99
        [004]   | ...
        [004]
        [004]  test_run:cmd('stop server replica')
      
      It happened because replication downstream status check occurred too
      early, when it was only in 'stopped' state. To give the replication
      status check routine ability to reach the needed 'follow' state, it
      need to wait for it using test_run:wait_downstream() routine.
      
      Closes #5235
      2b1f8f9b
    • Alexander V. Tikhonov's avatar
      test: flaky replication/gh-4928-tx-boundaries test · 5410e592
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following issue:
      
        [089] --- replication/gh-4928-tx-boundaries.result	Wed Jul 29 04:08:29 2020
        [089] +++ replication/gh-4928-tx-boundaries.reject	Wed Jul 29 04:24:02 2020
        [089] @@ -94,7 +94,7 @@
        [089]   | ...
        [089]  box.info.replication[1].upstream.status
        [089]   | ---
        [089] - | - follow
        [089] + | - disconnected
        [089]   | ...
        [089]
        [089]  box.space.glob:select{}
      
      It happened because replication upstream status check occurred too
      early, when it was only in 'disconnected' state. To give the
      replication status check routine ability to reach the needed 'follow'
      state, it need to wait for it using test_run:wait_upstream() routine.
      
      Closes #5234
      5410e592
  8. Sep 09, 2020
    • Alexander V. Tikhonov's avatar
      test: fix status at replication/gh-4424-misc* test · 5a9b79fa
      Alexander V. Tikhonov authored
      Fixed flaky status check:
      
        [016] @@ -73,11 +73,11 @@
        [016]  ...
        [016]  box.info.status
        [016]  ---
        [016] -- running
        [016] +- orphan
        [016]  ...
        [016]  box.info.ro
        [016]  ---
        [016] -- false
        [016] +- true
        [016]  ...
        [016]  box.cfg{                                                        \
        [016]      replication = {},                                           \
        [016]
      
      Test changed to use wait condition for the status check, which should
      be changed from 'orphan' to 'running'. On heavy loaded hosts it may
      spend some additional time, wait condition routine helped to fix it.
      
      Closes #5271
      5a9b79fa
    • Alexander V. Tikhonov's avatar
      test: flaky replication/gh-3642-misc-* test · 2569ba54
      Alexander V. Tikhonov authored
      On heavy loaded hosts found the following issue:
      
        [036] --- replication/gh-3642-misc-no-socket-leak-on-replica-disconnect.result	Sun Sep  6 23:49:57 2020
        [036] +++ replication/gh-3642-misc-no-socket-leak-on-replica-disconnect.reject	Mon Sep  7 04:07:06 2020
        [036] @@ -63,7 +63,7 @@
        [036]  ...
        [036]  box.info.replication[1].upstream.status
        [036]  ---
        [036] -- follow
        [036] +- disconnected
        [036]  ...
        [036]  test_run:cmd('switch default')
        [036]  ---
      
      It happened because replication upstream status check occurred too
      early, when it was only in 'disconnected' state. To give the
      replication status check routine ability to reach the needed 'follow'
      state, it need to wait for it using test_run:wait_upstream() routine.
      
      Closes #5276
      2569ba54
    • Alexander V. Tikhonov's avatar
      test: remove asan suppression for unit/msgpack · 35f99e66
      Alexander V. Tikhonov authored
      ASAN should the issue in msgpuck repository in file test/msgpuck.c
      which was the cause of the fail in unit/msgpack test. The issue
      was fixed in msgpuck repository and ASAN suppression was removed
      for it. Also removed skip condition file, which blocked the test
      when it failed.
      
      Part of #4360
      35f99e66
    • Alexander V. Tikhonov's avatar
      lsan: app-tap/http_client.test.lua suppresions · 8d616ade
      Alexander V. Tikhonov authored
      Removed lsan suppresions that were not reproduced.
      
      Part of #4360
      8d616ade
  9. Sep 08, 2020
    • Ilya Kosarev's avatar
      msgpack: print mp_exp type as signed integer · 2a01ce91
      Ilya Kosarev authored
      MsgPack extension types allow applications to define
      application-specific types. They consist of an 8-bit signed integer and
      a byte array where the integer represents a kind of types and the byte
      array represents data. Types from 0 to 127 are application-specific
      types and types from -128 to -1 are reserved for predefined types.
      However, extension types were printed as unsigned integers. Now it is
      fixed and extension types are being printed in a correct way as signed
      integers. Also the typo in word "Unsupported" was fixed. According test
      case is introduced.
      
      Closes #5016
      2a01ce91
    • Ilya Kosarev's avatar
      rtree: add comments on ignored rtree_search() return value · 4883f19b
      Ilya Kosarev authored
      rtree_search() has return value and it is ignored in some cases.
      Although it is totally fine it seems to be reasonable to comment those
      cases as far as such usage might be questionable.
      
      Closes #2052
      4883f19b
    • Alexander V. Tikhonov's avatar
      Divide replication/misc.test.lua · 867e6b3d
      Alexander V. Tikhonov authored
      To fix flaky issues of replication/misc.test.lua the test had to be
      divided into smaller tests to be able to localize the flaky results:
      
        gh-2991-misc-asserts-on-update.test.lua
        gh-3111-misc-rebootstrap-from-ro-master.test.lua
        gh-3160-misc-heartbeats-on-master-changes.test.lua
        gh-3247-misc-iproto-sequence-value-not-replicated.test.lua
        gh-3510-misc-assert-replica-on-applier-disconnect.test.lua
        gh-3606-misc-crash-on-box-concurrent-update.test.lua
        gh-3610-misc-assert-connecting-master-twice.test.lua
        gh-3637-misc-error-on-replica-auth-fail.test.lua
        gh-3642-misc-no-socket-leak-on-replica-disconnect.test.lua
        gh-3704-misc-replica-checks-cluster-id.test.lua
        gh-3711-misc-no-restart-on-same-configuration.test.lua
        gh-3760-misc-return-on-quorum-0.test.lua
        gh-4399-misc-no-failure-on-error-reading-wal.test.lua
        gh-4424-misc-orphan-on-reconfiguration-error.test.lua
      
      Needed for #4940
      867e6b3d
    • Kirill Yukhin's avatar
      msgpuck: bump a new version · 77e03451
      Kirill Yukhin authored
      - test: correct buffer size to fix ASAN error
      77e03451
    • Sergey Bronnikov's avatar
      lua: return back import of table.clear() method · 09aa8135
      Sergey Bronnikov authored
      Import of `table.clear` module has been removed to fix luacheck warning about
      unused variable in commit 3af79e70
      ('Fix luacheck warnings in src/lua/') and method `table.clear()` became unavailable
      in Tarantool. This commit returns that import back as some applications depends
      on it (bug has been found with Cartridge application) and adds regression test
      for table.clear(). Note: `table.clear` is not available until an explicit
      `require('table.clear')` call.
      
      Closes #5210
      09aa8135
  10. Aug 31, 2020
    • Alexander V. Tikhonov's avatar
      update_repo: correct fix for missing metadata RPMs · 71a24b9f
      Alexander V. Tikhonov authored
      On running update_repo tool with the given option to delete some RPMs
      need to remove all files found by this given pattern. The loop checking
      metadata deletes files, but only which were presented in it. However
      it is possible that some broken update left orphan files: they are
      present in the storage, but does not mentioned in the metadata.
      71a24b9f
    • Ilya Kosarev's avatar
      test: concurrent tuple update segfault on bitset index iteration · c5d7e139
      Ilya Kosarev authored
      Concurrent tuple update could segfault on BITSET_ALL_NOT_SET iterator
      usage. Fixed in 850054b2. This patch
      introduces corresponding test.
      
      Closes #1088
      c5d7e139
    • Alexander V. Tikhonov's avatar
      gitlab-ci: add openSUSE packages build jobs · d07e5f96
      Alexander V. Tikhonov authored
      Implemented openSUSE packages build with testing for images:
      opensuse-leap:15.[0-2]
      
      Added %{sle_version} checks in Tarantool spec file according to
      https://en.opensuse.org/openSUSE:Packaging_for_Leap#RPM_Distro_Version_Macros
      
      Added opensuse-leap of 15.1 and 15.2 versions to Gitlab-CI packages
      building/deploing jobs.
      
      Closes #4562
      d07e5f96
    • Alexander V. Tikhonov's avatar
      vinyl: fix check vinyl_dir existence at bootstrap · 9600b895
      Alexander V. Tikhonov authored
      
      During implementation of openSUSE build with testing got failed test
      box-tap/cfg.test.lua. Found that when memtx_dir didn't exist and
      vinyl_dir existed and also errno was set to ENOENT, box configuration
      succeeded, but it shouldn't. Reason of this wrong behavior was that
      not all of the failure paths in xdir_scan() set errno, but the caller
      assumed it.
      
      Debugging the issue found that after xdir_scan() there was incorrect
      check for errno when it returned negative values. xdir_scan() is not
      system call and negative return value from it doesn't mean that errno
      would be set too. Found that in situations when errno was left from
      previous commands before xdir_scan() and xdir_scan() returned negative
      value by itself it produced the wrong check.
      
      The previous failed logic of the check was to catch the error ENOENT
      which set in the xdir_scan() function to handle the situation when
      vinyl_dir was not exist. It failed, because checking ENOENT outside
      the xdir_scan() function, we had to be sure that ENOENT had come from
      xdir_scan() function call indeed and not from any other functions
      before. To be sure in it possible fix could be reset errno before
      xdir_scan() call, because errno could be passed from any other function
      before call to xdir_scan().
      
      As mentioned above xdir_scan() function is not system call and can be
      changed in any possible way and it can return any result value without
      need to setup errno. So check outside of this function on errno could
      be broken.
      
      To avoid that we must not check errno after call of the function.
      Better solution is to use the flag in xdir_scan(), to check if the
      directory should exist. So errno check was removed and instead of it
      the check for vinyl_dir existence using flag added.
      
      Closes #4594
      Needed for #4562
      
      Co-authored-by: default avatarAlexander Turenko <alexander.turenko@tarantool.org>
      9600b895
  11. Aug 25, 2020
    • Ilya Kosarev's avatar
      tuple: drop extra restrictions for multikey index · bfeb61b3
      Ilya Kosarev authored
      Multikey index did not work properly with nullable root field in
      tuple_raw_multikey_count(). Now it is fixed and corresponding
      restrictions are dropped. This also means that we can drop implicit
      nullability update for array/map fields and make all fields nullable
      by default, as it was until e1d3fe8a
      (tuple format: don't allow null where array/map is expected), as far as
      default non-nullability itself doesn't solve any real problems while
      providing confusing behavior (gh-5027).
      
      Follow-up #5027
      Closes #5192
      bfeb61b3
  12. Aug 24, 2020
    • Vladislav Shpilevoy's avatar
      box: introduce space:alter() · 8c965989
      Vladislav Shpilevoy authored
      There was no way to change certain space parameters without its
      recreation or manual update of internal system space _space. Even
      if some of them were legal to update: field_count, owner, flag of
      being temporary, is_sync flag.
      
      The patch introduces function space:alter(), which accepts a
      subset of parameters from box.schema.space.create which are
      mutable, and 'name' parameter. There is a method space:rename(),
      but still the parameter is added to space:alter() too, to be
      consistent with index:alter(), which also accepts a new name.
      
      Closes #5155
      
      @TarantoolBot document
      Title: New function space:alter(options)
      
      Space objects in Lua (stored in `box.space` table) now have a new
      method: `space:alter(options)`.
      
      The method accepts a table with parameters `field_count`, `user`,
      `format`, `temporary`, `is_sync`, and `name`. All parameters have
      the same meaning as in `box.schema.space.create(name, options)`.
      
      Note, `name` parameter in `box.schema.space.create` is separated
      from `options` table. It is not so in `space:alter(options)` -
      here all parameters are specified in the `options` table.
      
      The function does not return anything in case of success, and
      throws an error when fails.
      
      From 'Synchronous replication' page, from 'Limitations and known
      problems' it is necessary to delete the note about "no way to
      enable synchronous replication for existing spaces". Instead it
      is necessary to say, that it can be enabled using
      `space:alter({is_sync = true})`. And can be disabled by setting
      `is_sync = false`.
      https://www.tarantool.io/en/doc/2.5/book/replication/repl_sync/#limitations-and-known-problems
      
      The function will appear in >= 2.5.2.
      8c965989
Loading