Skip to content
Snippets Groups Projects
  1. Apr 07, 2021
  2. Apr 05, 2021
    • Vladislav Shpilevoy's avatar
      box: remove is_local_recovery variable · 625941cf
      Vladislav Shpilevoy authored
      It was used so as to recover synchronous auto-commit transactions
      in an async way (not blocking the fiber). But it became not
      necessary since #5874 was fixed. Because recovery does not use
      auto-commit transactions anymore.
      
      Closes #5194
      625941cf
    • Vladislav Shpilevoy's avatar
      recovery: make it transactional · 9311113d
      Vladislav Shpilevoy authored
      Recovery used to be performed row by row. It was fine because
      anyway all the persisted rows are supposed to be committed, and
      should not meet any problems during recovery so a transaction
      could be applied partially.
      
      But it became not true after the synchronous replication
      introduction. Synchronous transactions might be in the log, but
      can be followed by a ROLLBACK record which is supposed to delete
      them.
      
      During row-by-row recovery, firstly, the synchro rows each turned
      into a sync transaction. Which is probably fine. But the rows on
      non-sync spaces which were a part of a sync transaction, could be
      applied right away bypassing the limbo leading to all kind of the
      sweet errors like duplicate keys, or inconsistency of a partially
      applied transaction.
      
      The patch makes the recovery transactional. Either an entire
      transaction is recovered, or it is rolled back which normally
      happens only for synchro transactions followed by ROLLBACK.
      
      In force recovery of a broken log the consistency is not
      guaranteed though.
      
      Closes #5874
      9311113d
    • Vladislav Shpilevoy's avatar
      vinyl: handle multi-statement recovery txns · 59fed2d1
      Vladislav Shpilevoy authored
      During recovery and xlog replay vinyl skips the statements already
      stored in runs. Indeed, their re-insertion into the mems would
      lead to their second dump otherwise.
      
      But that results into an issue that the recovery transactions in
      vinyl don't have a write set - their tx->log is empty. On the
      other hand they still are added to the write set (xm->writers).
      Probably so as not to have too many checks "skip if in recovery"
      all over the code.
      
      It works fine with single-statement transactions, but would break
      on multi-statement transactions. Because the decision whether
      need to add to the write set was done based on the tx's log
      emptiness. It is always empty, and so the transaction could be
      added to the write set twice and corrupt its list-link member.
      
      The patch makes the decision about being added to the write set
      based on emptiness of the list-link member instead of the log so
      it works fine both during recovery and normal operation.
      
      Needed for #5874
      59fed2d1
    • Serge Petrenko's avatar
      replication: do not ignore replica vclock on register · f42fee5a
      Serge Petrenko authored
      There was a bug in box_process_register. It decoded replica's vclock but
      never used it when sending the registration stream. So the replica might
      lose the data in range (replica_vclock, start_vclock).
      
      Follow-up #5566
      f42fee5a
    • Serge Petrenko's avatar
      replication: tolerate synchro rollback during final join · 3ec0e87f
      Serge Petrenko authored
      Both box_process_register and box_process_join had guards ensuring that
      not a single rollback occured for transactions residing in WAL around
      replica's _cluster registration.
      Both functions would error on a rollback and make the replica retry
      final join.
      
      The reason for that was that replica couldn't process synchronous
      transactions correctly during final join, because it applied the final
      join stream row-by-row.
      
      This path with retrying final join was a dead end, because even if
      master manages to receive no ROLLBACK messages around N-th retry of
      box.space._cluster:insert{}, replica would still have to receive and
      process all the data dating back to its first _cluster registration
      attempt.
      In other words, the guard against sending synchronous rows to the
      replica didn't work.
      
      Let's remove the guard altogether, since now replica is capable of
      processing synchronous txs in final join stream and even retrying final
      join in case the _cluster registration was rolled back.
      
      Closes #5566
      3ec0e87f
    • Serge Petrenko's avatar
      applier: make final join transactional · eceec305
      Serge Petrenko authored
      Now applier assembles rows into transactions not only on subscribe
      stage, but also during final join / register.
      
      This was necessary for correct handling of rolled back synchronous
      transactions in final join stream.
      
      Part of #5566
      eceec305
    • Serge Petrenko's avatar
      applier: remove excess last_row_time update from subscribe loop · efdd14f4
      Serge Petrenko authored
      applier->last_row_time is updated in applier_read_tx_row, which's called
      at least once per each subscribe loop iteration. So there's no need to
      have a separate last_row_time update inside the loop body itself.
      
      Part of #5566
      efdd14f4
    • Serge Petrenko's avatar
      applier: fix not releasing the latch on apply_synchro_row() fail · 9ad1bd15
      Serge Petrenko authored
      Once apply_synchro_row() failed, applier_apply_tx() would simply raise
      an error without unlocking replica latch. This lead to all the appliers
      hanging indefinitely on trying to lock the latch for this replica.
      
      In scope of #5566
      9ad1bd15
    • Serge Petrenko's avatar
      applier: extract plain tx application from applier_apply_tx() · caab8d52
      Serge Petrenko authored
      The new routine, called apply_plain_tx(), may be used not only by
      applier_apply_tx(), but also by final join, once we make it
      transactional, and recovery, once it's also turned transactional.
      
      Also, while we're at it. Remove excess fiber_gc() call from
      applier_subscribe loop. Let's better make sure fiber_gc() is called on
      any return from applier_apply_tx().
      
      Prerequisite #5874
      Part of #5566
      caab8d52
    • Serge Petrenko's avatar
      applier: extract tx boundary checks from applier_read_tx into a separate routine · cb30cc4c
      Serge Petrenko authored
      Introduce a new routine, set_next_tx_row(), which checks tx boundary
      violation and appends the new row to the current tx in case everything
      is ok.
      
      set_next_tx_row() is extracted from applier_read_tx() because it's a
      common part of transaction assembly both for recovery and applier.
      
      The only difference for recovery will be that the routine which's
      responsible for tx assembly won't read rows. It'll be a callback ran on
      each new row being read from WAL.
      
      Prerequisite #5874
      Part-of #5566
      cb30cc4c
    • Serge Petrenko's avatar
      replication: fix a hang on final join retry · eb908469
      Serge Petrenko authored
      Since the introduction of synchronous replication it became possible for
      final join to fail on master side due to not being able to gather acks
      for some tx around _cluster registration.
      
      A replica receives an error in this case: either ER_SYNC_ROLLBACK or
      ER_SYNC_QUORUM_TIMEOUT. The errors lead to applier retrying final join,
      but with wrong state, APPLIER_REGISTER, which should be used only on an
      anonymous replica. This lead to a hang in fiber executing box.cfg,
      because it waited for APPLIER_JOINED state, which was never entered.
      
      Part-of #5566
      eb908469
    • Vladislav Shpilevoy's avatar
      swim: check types in __serialize methods · 1d121c12
      Vladislav Shpilevoy authored
      In swim Lua code none of the __serialize methods checked the
      argument type assuming that nobody would call them directly and
      mess with the types. But it happened, and is not hard to fix, so
      the patch does it.
      
      The serialization functions are sanitized for the swim object,
      swim member, and member event.
      
      Closes #5952
      1d121c12
    • Vladislav Shpilevoy's avatar
      swim: fix crash on bad member_by_uuid() call · fe33a108
      Vladislav Shpilevoy authored
      In Lua swim object's method member_by_uuid() could crash if called
      with no arguments. UUID was then passed as NULL, and dereferenced.
      
      The patch makes member_by_uuid() treat NULL like nil UUID and
      return NULL (member not found). The reason is that
      swim_member_by_uuid() can't fail. It can only return a member or
      not. It never sets a diag error.
      
      Closes #5951
      fe33a108
    • Alexander V. Tikhonov's avatar
      github-ci: drop not needed tag on branches commits · e23e7ac2
      Alexander V. Tikhonov authored
      We should remove a tag after fetching of a remote repository.
      Ported the following commits:
      
        0f575e01 ('gitlab-ci: fix tag removal for a branch push job')
        0f564f34 ('gitlab-ci: remove tag from pushed branch commit')
      
      Drop a tag that points to a current commit (if any) on a job triggered
      by pushing to a branch (as against of pushing a tag). Otherwise we may
      get two jobs for the same x.y.z-0-gxxxxxxxxx build: one is run by
      pushing a branch and another by pushing a tag. The idea is to hide the
      new tag from the branch job as if a tag would be pushed strictly after
      all branch jobs for the same commit.
      
      Closes tarantool/tarantool-qa#103
      e23e7ac2
    • Alexander Turenko's avatar
      lua: fix tuple leak in <key_def>.compare_with_key · db766c52
      Alexander Turenko authored
      The key difference between lbox_encode_tuple_on_gc() and
      luaT_tuple_encode() is that the latter never raises a Lua error, but
      passes an error using the diagnostics area.
      
      Aside of the tuple leak, the patch fixes fiber region's memory 'leak'
      (till fiber_gc()). Before the patch, the memory that is used for
      serialization of the key is not freed (region_truncate()) when the
      serialization fails. It is verified in the gh-5388-<...> test.
      
      While I'm here, added a test case that just verifies correct behaviour
      in case of a key serialization failure (added into key_def.test.lua).
      The case does not verify whether a tuple leaks and it is successful as
      before this patch as well after the patch. I don't find a simple way to
      check the tuple leak within a test. Verified manually using the
      reproducer from the linked issue.
      
      Fixes #5388
      db766c52
    • Alexander V. Tikhonov's avatar
      github-ci: use vardir option in tests runs · 474eda49
      Alexander V. Tikhonov authored
      Got warning message:
      
        [014] WARGING: unix socket's "/home/ubuntu/actions-runner/_work/tarantool/tarantool/test/var/014_box/gh-5422-broken_snapshot.socket-iproto" path has length 108 symbols that is longer than 107. That likely will cause failing of tests.
      
      It caused the following fail:
      
        [038] Starting instance autobootstrap_guest1...
        [038] Start failed: builtin/box/console.lua:865: failed to create server unix/:/home/ubuntu/actions-runner/_work/tarantool/tarantool/test/var/038_replication/autobootstrap_guest1.socket-admin: No buffer space available
      
      To avoid of it use vardir option in tests runs to decrease paths length.
      
      Closes tarantool/tarantool-qa#104
      474eda49
    • Alexander V. Tikhonov's avatar
      github-ci: avoid of use container tags in actions · 58fe0fcb
      Alexander V. Tikhonov authored
      Changed the following workflows:
      
        luacheck
        debug_coverage
        release*
        static_build
        static_build_cmake_linux
      
      It was changed the OS in which the test run from debian to ubuntu.
      Also changed the way how this OS was booted - before the change it
      was booted as docker container using Github Actions tag from inside
      the worklfows. And it caused all the workflow steps to be run inside
      it. After the change no container run anymore on the running host.
      Github Actions host uses for now with its native OS set in 'runs'
      tag. It was decided to use the latest one OS `ubuntu-20.04` which is
      already the default for 'ubuntu-latest' tag.
      
      This change gave us the abilities to:
       - Remove extra container step in workflow.
       - Switch off swap using 'swapoff' command.
       - Use the same OS as Github Actions uses by default.
       - Setup our local hosts using Github Actions image snapshot.
       - Enable use of actions/checkout@v2.3.4 which is better than v1.
       - Light bootstrap of packages in local .*.mk makefile for:
           build: libreadline-dev libunwind-dev
           tests: pip install -r test-run/requirements.txt
      
      Closes tarantool/tarantool-qa#101
      58fe0fcb
  3. Apr 02, 2021
    • Nikita Pettik's avatar
      vinyl: remove vylog newer than snap in casual recovery · 33254d91
      Nikita Pettik authored
      As a follow-up to the previous patch, let's check also emptiness of the
      vylog being removed. During vylog rotation all entries are squashed
      (e.g. "delete range" annihilates "insert range"), written to the new
      vylog and at the end of new vylog SNAPSHOT marker is placed. If the last
      entry in the vylog is SNAPSHOT, we can safely remove it without
      hesitation.  So it is OK to remove it even during casual recovery
      process. However, if it contains rows after SNAPSHOT marker, removal of
      vylog may cause data loss. In this case we still can remove it only in
      force_recovery mode.
      
      Follow-up #5823
      33254d91
    • Nikita Pettik's avatar
      vinyl: skip vylog if it's newer than snap · 149ccce9
      Nikita Pettik authored
      Having data in different engines checkpoint process is handled this way:
       - wait_checkpoint memtx
       - wait_checkpoint vinyl
       - commit_checkpoint memtx
       - commit_checkpoint vinyl
      
      In contrast to commit_checkpoint which does not tolerate fails (if
      something goes wrong e.g. renaming of snapshot file - instance simply
      crashes), wait_checkpoint may fail. As a part of wait_checkpoint for
      vinyl engine vy_log rotation takes place: old vy_log is closed and new
      one is created. At this moment, wait_checkpoint of memtx engine has
      already created new *inprogress* snapshot featuring bumped vclock.
      While recovering from this configuration, vclock of the latest snapshot
      is used as a reference.
      
      At the initial recovery stage (vinyl_engine_begin_initial_recovery),
      we check that snapshot's vclock matches with vylog's one (they should be
      the same since normally vylog is rotated along with snapshot). On the
      other hand, in the directory we have old snapshot and new vylog (and new
      .inprogress snapshot). In such a situation recovery (even in force mode)
      was aborted. The only way to fix this dead end, user has to manually
      delete last vy_log file.
      
      Let's proceed with the same resolution while user runs force_recovery
      mode: delete last vy_log file and update vclock value. If user uses
      casual recovery, let's print verbose message how to fix this situation
      manually.
      
      Closes #5823
      149ccce9
    • Nikita Pettik's avatar
      errinj: introduce ERROR_INJECT_TERMINATE() macro · a240e019
      Nikita Pettik authored
      It is conditional injection that terminates execution calling assert(0)
      if given condition is true. It is quite useful since allows us to
      emulate situations when instance is suddenly shutdown: due to sigkill
      for example.
      
      Needed for #5823
      a240e019
    • Nikita Pettik's avatar
      xlog: introduce xdir_remove_file_by_vclock() function · 4d6e2b73
      Nikita Pettik authored
      Needed for #5823
      4d6e2b73
    • Nikita Pettik's avatar
      vinyl: make vy_log_begin_recovery() take force_recovery param · f729343c
      Nikita Pettik authored
      Needed for #5823
      f729343c
    • Mergen Imeev's avatar
      sql: ignore \0 in string passed to Lua-function · 22e2e4ea
      Mergen Imeev authored
      Prior to this patch string passed to user-defined Lua-function from SQL
      was cropped in case it contains '\0'. At the same time, it wasn't
      cropped if it is passed to the function from BOX. After this patch the
      string won't be cropped when passed from SQL if it contain '\0'.
      
      Closes #5938
      22e2e4ea
    • Mergen Imeev's avatar
      sql: ignore \0 in string passed to C-function · fa7e6f7d
      Mergen Imeev authored
      Prior to this patch string passed to user-defined C-function from SQL
      was cropped in case it contains '\0'. At the same time, it wasn't
      cropped if it is passed to the function from BOX. Now it isn't cropped
      when passed from SQL.
      
      Part of #5938
      fa7e6f7d
  4. Mar 31, 2021
    • Alexander V. Tikhonov's avatar
      github-ci: switch off swap use · fd6ee6d5
      Alexander V. Tikhonov authored
      Github Actions provides hosts for Linux base runners in the following
      configurations:
      
        2 Cores
        7 Gb memory
        4 Gb swap memory
      
      To avoid of issues with hanging/slowing tests on high memory use
      like [1], hosts configurations must avoid of swap memory use. All
      of the tests workflows run inside dockers containers. This patch
      sets in docker run configurations memory limits based on current
      github actions hosts - 7Gb memory w/o swap memory increase.
      
      Checked 10 full runs (29 workflows in each run used the change) and
      got single failed test on gevent() routine in test-run. This result much
      better than w/o this patch when 3-4 of workflows fail on each full run.
      
      It could happen because swappiness set to default value:
      
        cat /sys/fs/cgroup/memory/memory.swappiness
        60
      
      From documentation on swappiness [2]:
      
        This control is used to define the rough relative IO cost of swapping
        and filesystem paging, as a value between 0 and 200. At 100, the VM
        assumes equal IO cost and will thus apply memory pressure to the page
        cache and swap-backed pages equally; lower values signify more
        expensive swap IO, higher values indicates cheaper.
        Keep in mind that filesystem IO patterns under memory pressure tend to
        be more efficient than swap's random IO. An optimal value will require
        experimentation and will also be workload-dependent.
      
      We may try to tune how often anonymous pages are swapped using the
      swappiness parameter, but our goal is to stabilize timings (and make
      them as predictable as possible), so the best option is to disable swap
      at all and work on descreasing memory consumption for huge tests.
      
      For Github Actions host configurations with 7Gb RAM it means that after
      2.8Gb RAM was used swap began to use. But in testing we have some tests
      that use 2.5Gb of RAM like 'box/net_msg_max.test.lua' and memory
      fragmentation could cause after the test run swap use [3].
      
      Also found that disk cache could use some RAM and it also was the cause
      of fast memory use and start swapping. It can be periodically dropped
      from memory [4] using 'drop_cache' system value setup, but it won't fix
      the overall issue with swap use.
      
      After freed cached pages in RAM another system kernel option can be
      tuned [5][6] 'vfs_cache_pressure'. This percentage value controls the
      tendency of the kernel to reclaim the memory which is used for caching
      of directory and inode objects. Increasing it significantly beyond
      default value of 100 may have negative performance impact. Reclaim code
      needs to take various locks to find freeable directory and inode
      objects. With 'vfs_cache_pressure=1000', it will look for ten times more
      freeable objects than there are. This patch won't do this change, but
      it can be done as the next change.
      
      To fix the issue there were made changes:
      
       - For jobs that run tests and use actions/environment and don't use
         Github Actions container tag, it was set 'sudo swapoff -a' command
         in actions/environment action.
      
       - For jobs that run tests and use Github Actions container tag the
         previous solution doesn't work. It was decided to hard-code the
         memory value based on found on Github Actions hosts memory size
         7Gb. It was set for Github container tag as additional options:
           options: '--init --memory=7G --memory-swap=7G'
         This changes were made temporary till these containers tags will
         be removed within resolving tarantool/tarantool-qa#101 issue for
         workflows:
           debug_coverage
           release
           release_asan_clang11
           release_clang
           release_lto
           release_lto_clang11
           static_build
           static_build_cmake_linux
      
       - For VMware VMs like with FreeBSD added 'sudo swapoff -a' command
         before build commands.
      
       - For OSX on Github actions hosts swapping already disabled:
           sysctl vm.swapusage
           vm.swapusage: total = 0.00M  used = 0.00M  free = 0.00M  (encrypted)
         Also manual switching off swap currently not possible due to do
         System Integrity Protection (SIP) must be disabled [7], but we
         don't have such access on Github Actions hosts. For local hosts
         it must be done manually with [8]:
           sudo nvram boot-args="vm_compressor=2"
         Added swap status control to be sure that host correctly configured:
           sysctl vm.swapusage
      
      Closes tarantool/tarantool-qa#99
      
      [1]: https://github.com/tarantool/tarantool-qa/issues/93
      [2]: https://github.com/torvalds/linux/blob/1e43c377a79f9189fea8f2711b399d4e8b4e609b/Documentation/admin-guide/sysctl/vm.rst#swappiness
      [3]: https://unix.stackexchange.com/questions/2658/why-use-swap-when-there-is-more-than-enough-free-space-in-ram
      [4]: https://kubuntu.ru/node/13082
      [5]: https://www.kernel.org/doc/Documentation/sysctl/vm.txt
      [6]: http://devhead.ru/read/uskorenie-raboty-linux
      [7]: https://osxdaily.com/2010/10/08/mac-virtual-memory-swap/
      [8]: https://gist.github.com/dan-palmer/3082266#gistcomment-3667471
      fd6ee6d5
    • Cyrill Gorcunov's avatar
      test: box-tap/gc -- add test for is_paused field · 83ec719c
      Cyrill Gorcunov authored
      
      Once simple bootstrap is complete and there is no
      replicas used we should run with gc unpaused.
      
      Part-of #5806
      
      Acked-by: default avatarSerge Petrenko <sergepetrenko@tarantool.org>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      83ec719c
    • Cyrill Gorcunov's avatar
      test: add a test for wal_cleanup_delay option · 5437afe2
      Cyrill Gorcunov authored
      
      Part-of #5806
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      5437afe2
    • Cyrill Gorcunov's avatar
      gc/xlog: delay xlog cleanup until relays are subscribed · 2fd51aea
      Cyrill Gorcunov authored
      
      In case if replica managed to be far behind the master node
      (so there are a number of xlog files present after the last
      master's snapshot) then once master node get restarted it
      may clean up the xlogs needed by the replica to subscribe
      in a fast way and instead the replica will have to rejoin
      reading a number of data back.
      
      Lets try to address this by delaying xlog files cleanup
      until replicas are got subscribed and relays are up
      and running. For this sake we start with cleanup fiber
      spinning in nop cycle ("paused" mode) and use a delay
      counter to wait until relays decrement them.
      
      This implies that if `_cluster` system space is not empty
      upon restart and the registered replica somehow vanished
      completely and won't ever come back, then the node
      administrator has to drop this replica from `_cluster`
      manually.
      
      Note that this delayed cleanup start doesn't prevent
      WAL engine from removing old files if there is no
      space left on a storage device. The WAL will simply
      drop old data without a question.
      
      We need to take into account that some administrators
      might not need this functionality at all, for this
      sake we introduce "wal_cleanup_delay" configuration
      option which allows to enable or disable the delay.
      
      Closes #5806
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      
      @TarantoolBot document
      Title: Add wal_cleanup_delay configuration parameter
      
      The `wal_cleanup_delay` option defines a delay in seconds
      before write ahead log files (`*.xlog`) are getting started
      to prune upon a node restart.
      
      This option is ignored in case if a node is running as
      an anonymous replica (`replication_anon = true`). Similarly
      if replication is unused or there is no plans to use
      replication at all then this option should not be considered.
      
      An initial problem to solve is the case where a node is operating
      so fast that its replicas do not manage to reach the node state
      and in case if the node is restarted at this moment (for various
      reasons, for example due to power outage) then `*.xlog` files might
      be pruned during restart. In result replicas will not find these
      files on the main node and have to reread all data back which
      is a very expensive procedure.
      
      Since replicas are tracked via `_cluster` system space this we use
      its content to count subscribed replicas and when all of them are
      up and running the cleanup procedure is automatically enabled even
      if `wal_cleanup_delay` is not expired.
      
      The `wal_cleanup_delay` should be set to:
      
       - `0` to disable the cleanup delay;
       - `>= 0` to wait for specified number of seconds.
      
      By default it is set to `14400` seconds (ie `4` hours).
      
      In case if registered replica is lost forever and timeout is set to
      infinity then a preferred way to enable cleanup procedure is not setting
      up a small timeout value but rather to delete this replica from `_cluster`
      space manually.
      
      Note that the option does *not* prevent WAL engine from removing
      old `*.xlog` files if there is no space left on a storage device,
      WAL engine can remove them in a force way.
      
      Current state of `*.xlog` garbage collector can be found in
      `box.info.gc()` output. For example
      
      ``` Lua
       tarantool> box.info.gc()
       ---
         ...
         is_paused: false
      ```
      
      The `is_paused` shows if cleanup fiber is paused or not.
      2fd51aea
  5. Mar 29, 2021
  6. Mar 26, 2021
    • Alexander Turenko's avatar
      test: update test-run (--memtx-allocator) · 6d6a153b
      Alexander Turenko authored
      Added the `--memtx-allocator <string>` test-run option, which just sets
      the `MEMTX_ALLOCATOR` environment variable. The variable is available in
      the testing code (including instance files) and supposed to be used to
      verify the upcoming `box.cfg{memtx_allocator = <small|system>}` option.
      
      Alternatively one can just set the `MEMTX_ALLOCATOR` environment
      variable manually.
      
      Beware: The option does not set the allocator in tarantool automatically
      in some way. Nope. A test should read the variable and set the box.cfg
      option.
      
      [1]: https://github.com/tarantool/test-run/pull/281
      
      Part of #5419
      6d6a153b
  7. Mar 25, 2021
Loading