Skip to content
Snippets Groups Projects
  1. Jul 24, 2024
    • Ilya Verbin's avatar
      test: do not test errinj.info() output · 00e15340
      Ilya Verbin authored
      There is no much sense in testing it, but it is sensitive to source code
      changes, especially `ERRINJ_*_COUNTDOWN` injections, e.g. see commit
      697123d0 ("box: use maximal space id instead of _schema.max_id").
      
      Needed for tarantool/tarantool-ee#712
      
      NO_DOC=test
      NO_CHANGELOG=test
      
      (cherry picked from commit dc0fd81c)
      00e15340
  2. Jul 23, 2024
    • Vladimir Davydov's avatar
      vinyl: do not log dump if index was dropped · 37eea2b9
      Vladimir Davydov authored
      An index can be dropped while a memory dump is in progress. If the vinyl
      garbage collector happens to delete the index from the vylog by the time
      the memory dump completes, the dump will log an entry for a deleted
      index, resulting in an error next time we try to recover the vylog,
      like:
      
      ```
      ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run 2 committed after deletion
      ```
      
      or
      
      ```
      ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Deleted range 9 has run slices
      ```
      
      We already fixed a similar issue with compaction in commit 29e2931c
      ("vinyl: fix race between compaction and gc of dropped LSM"). Let's fix
      this one in exactly the same way: discard the new run without logging it
      to the vylog on a memory dump completion if the index was dropped while
      the dump was in progress.
      
      Closes #10277
      
      NO_DOC=bug fix
      
      (cherry picked from commit ae6a02eb)
      37eea2b9
  3. Jul 22, 2024
    • Vladimir Davydov's avatar
      tuple: allocate formats table statically · bfbf5a10
      Vladimir Davydov authored
      The tuple formats table may be accessed with `tuple_format_by_id()` from
      any thread, not just tx. For example, it's accessed by a vinyl writer
      thread when it deletes a tuple. If a thread happens to access the table
      while it's being reallocated by tx, see `tuple_format_register()`,
      the accessing thread may crash with a use-after-free or NULL pointer
      dereference bug, like the one below:
      
      ```
       # 1  0x64bd45c09e22 in crash_signal_cb+162
       # 2  0x76ce74e45320 in __sigaction+80
       # 3  0x64bd45ab070c in vy_run_writer_append_stmt+700
       # 4  0x64bd45ada32a in vy_task_write_run+234
       # 5  0x64bd45ad84fe in vy_task_f+46
       # 6  0x64bd45a4aba0 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*)+16
       # 7  0x64bd45c13e66 in fiber_loop+70
       # 8  0x64bd45e83b9c in coro_init+76
      ```
      
      To avoid that, let's make the tuple formats table statically allocated.
      This shouldn't increase actual memory usage because system memory is
      allocated lazily, on page fault. The max number of tuple formats isn't
      that big (64K) to care about the increase in virtual memory usage.
      
      Closes #10278
      
      NO_DOC=bug fix
      NO_TEST=mt race
      
      (cherry picked from commit a2da1de7)
      bfbf5a10
    • Vladislav Shpilevoy's avatar
      applier: drop apply_final_join_tx · 596d56f7
      Vladislav Shpilevoy authored
      Can use the regular applier_apply_tx(), they do the same. The
      latter is just more protective, but doesn't matter much in this
      case if the code does a few latch locks.
      
      The patch also drops an old test about double-received row panic
      during final join. The logic is that absolutely the same situation
      could happen during subscribe, but it was always filtered out by
      checking replicaset.applier.vclock and skipping duplicate rows.
      
      There doesn't seem to be a reason why final join must be any
      different. It is, after all, same subscribe logic but the received
      rows go into replica's initial snapshot instead of xlogs. Now it
      even uses the same txn processing function applier_apply_tx().
      
      The patch also moves `replication_skip_conflict` option setting
      after bootstrap is finished. In theory, final join could deliver
      a conflicting row and it must not be ignored. The problem is that
      it can't be reproduced anyhow without illegal error injection
      (which would corrupt something in an unrealistic way). But lets
      anyway move it below bootstrap for clarity.
      
      Follow-up #10113
      
      NO_DOC=refactoring
      NO_CHANGELOG=refactoring
      
      (cherry picked from commit da158b9b)
      596d56f7
    • Vladislav Shpilevoy's avatar
      box: make instance_vclock const · a62da4ee
      Vladislav Shpilevoy authored
      No code besides box.cc can now update instance's vclock
      explicitly. That is a protection against hacks like #9916.
      
      Closes #10113
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      
      (cherry picked from commit 19b2cc20)
      a62da4ee
    • Vladislav Shpilevoy's avatar
      box: make final join vclock update only in box.cc · 15f4482c
      Vladislav Shpilevoy authored
      The goal is to make sure that no files except box.cc can change
      instance_vclock_storage directly. That leads to all sorts of hacks
      which in turn lead to bugs - #9916 is a good example.
      
      Now applier on final join only sends rows into the journal. The
      journal then is handled by box.cc where vclock is properly
      updated.
      
      Part of #10113
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      
      (cherry picked from commit fe338ed4)
      15f4482c
    • Vladislav Shpilevoy's avatar
      journal: extract journal_write_row from limbo · 972d909b
      Vladislav Shpilevoy authored
      The function writes a single xrow into the journal in a blocking
      way. It isn't so simple, so makes sense to keep as a function,
      especially given that it will be used more in the next commit.
      
      Part of #10113
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      
      (cherry picked from commit 7d10096c)
      972d909b
    • Vladislav Shpilevoy's avatar
      box: move recovery_journal creation · f4438449
      Vladislav Shpilevoy authored
      Recovery journal uses word "recovery" to say that it works with
      xlogs. For snapshot recovery there is bootstrap_journal. Lets use
      it during local snapshot recovery.
      
      The reasoning is that while right now there is no difference, in
      next commits the recovery_journal will do more.
      
      Part of #10113
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      
      (cherry picked from commit 2620eb9e)
      f4438449
    • Vladislav Shpilevoy's avatar
      box: move replicaset.vclock into instance_vclock · 15cf2419
      Vladislav Shpilevoy authored
      Storing vclock of the instance in replicaset.vclock wasn't right.
      It wasn't vclock of the whole replicaset. It was local to this
      instance. There is no such thing as "replicaset vclock".
      
      The patch moves it to box.h/cc.
      
      Part of #10113
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      
      (cherry picked from commit f1e8e4e1)
      15cf2419
    • Vladislav Shpilevoy's avatar
      applier: treat register txns like regular ones · d7d31846
      Vladislav Shpilevoy authored
      Applier during the registration waiting (for registering a new ID
      or a name) could keep doing the master txns received before the
      registration was started. They could still be inside WAL doing a
      disk write, when the replica sends a register request.
      
      Before this commit, it could cause an assertion failure in debug
      and a double LSN error in release.
      
      The reason was that during the registration waiting the applier
      treated all incoming txns as "final join" txns. I.e. it wasn't
      checking if those txns were already received, but not committed
      yet.
      
      During normal subscribe process the appliers (potentially
      multiple) protect themselves from that by keeping track of the
      vclocks which are already applied and also being applied right now
      (replicaset.applier.vclock).
      
      Such protection ensures that receiving same row from 2 appliers
      wouldn't result into its double write. It also protects from the
      case when a txn was received, goes to WAL, but then the applier
      reconnects, resubscribes, and gets the same txn again - it
      shouldn't be applied.
      
      The patch makes so that the registration waiting after recovery
      works like subscribe. Registration during recovery would mean
      bootstrap via join. And outside of recovery it means the instance
      is already running.
      
      Closes #9916
      
      NO_DOC=bugfix
      
      (cherry picked from commit 51751f87)
      d7d31846
  4. Jul 18, 2024
    • Vladimir Davydov's avatar
      vinyl: wake up waiters after clearing checkpoint_in_progress flag · 84b2d213
      Vladimir Davydov authored
      The function `vy_space_build_index`, which builds a new index on DDL,
      calls `vy_scheduler_dump` on completion. If there's a checkpoint in
      progress, the latter will wait on `vy_scheduler::dump_cond` until
      `vy_scheduler::checkpoint_in_progress` is cleared. The problem is
      `vy_scheduler_end_checkpoint` doesn't broadcast `dump_cond` when it
      clears the flag. Usually, everything works fine because the condition
      variable is broadcast on any dump completion, and vinyl checkpoint
      implies a dump, but under certain conditions this may lead to a fiber
      hang. Let's broadcast `dump_cond` in `vy_scheduler_end_checkpoint`
      to be on the safe side.
      
      While we are at it, let's also inject a dump delay to the original
      test to make it more robust.
      
      Closes #10267
      Follow-up #10234
      
      NO_DOC=bug fix
      
      (cherry picked from commit fc3196dc)
      84b2d213
    • Nikita Zheleztsov's avatar
      applier: fix assertion failure after split brain · 73e6d02f
      Nikita Zheleztsov authored
      After receiving async transaction from an old term applier_apply_tx
      exits without unlocking the latch. If the same applier tries to
      subscribe for replication, it fails with assertion, as the latch is
      already locked.
      
      Let's fix the function, which raises error so that it just sets
      diag and returns -1.
      
      Closes #10073
      
      NO_DOC=bugfix
      NO_CHANGELOG=no crash on release version
      
      (cherry picked from commit 5ce010c5)
      73e6d02f
  5. Jul 16, 2024
    • Lev Kats's avatar
      sio: fix error message displaying bind address · b9a2a87d
      Lev Kats authored
      Now `sio_bind` function prints address into error message directly
      instead of relying on `fd` used in `bind` that failed to execute.
      
      `sio_bind` used `sio_socketname_to_buffer` for error message
      effectively attempting printing address bound to `fd` while there
      actually was an error in binding that address to that socket in the
      first place.
      
      Fixes #5925
      
      NO_DOC=bugfix
      NO_CHANGELOG=minor
      
      (cherry picked from commit a5214bfc)
      b9a2a87d
    • Nikita Zheleztsov's avatar
      test: cover split-brain during promote · 252cad12
      Nikita Zheleztsov authored
      This test checks, that when PROMOTE from the previous term is
      encountered we immediately notice split-brain situation and break
      replication without corrupting data.
      
      Closes #9943
      
      NO_DOC=test
      NO_CHANGELOG=test
      
      (cherry picked from commit 06b87e27)
      252cad12
    • Georgiy Lebedev's avatar
      box: refactor synchro quorum update on deletion from `_cluster` space · 096e453e
      Georgiy Lebedev authored
      For symmetry with the update of the synchronous replication quorum on
      insertion into the `_cluster` space, let's reuse the
      `on_replace_cluster_update_quorum` on_commit trigger.
      
      Follows-up #10087
      
      NO_CHANGELOG=<refactoring>
      NO_DOC=<refactoring>
      NO_TEST=<refactoring>
      
      (cherry picked from commit 9b63ced3)
      096e453e
    • Georgiy Lebedev's avatar
      box: update synchro quorum in on_commit trigger instead of on_replace · 737dc12a
      Georgiy Lebedev authored
      Currently, we update the synchronous replication quorum from the
      `on_replace` trigger of the `_cluster` space when registering a new
      replica. However, during the join process, the replica cannot ack its own
      insertion into the `_cluster` space. In the scope of #9723, we are going to
      enable synchronous replication for most of the system spaces, including the
      `_cluster` space. There are several problems with this:
      
      1. Joining a replica to a 1-member cluster without manual changing of
      quorum won't work: it is impossible to commit the insertion into the
      `_cluster` space with only 1 node, since the quorum will equal to 2 right
      after the insertion.
      
      2. Joining a replica to a 3-member cluster may fail: the quorum will become
      equal to 3 right after the insertion, the newly joined replica cannot ACK
      its own insertion into the `_cluster` space — if one out of original 3
      nodes fails, then reconfiguration will fail.
      
      Generally speaking, it will be impossible to join a new replica to the
      cluster, if a quorum, which includes the newly added replica (which cannot
      ACK), cannot be gathered.
      
      To solve these problems, let's update the quorum in the `on_commit`
      trigger. This way we’ll be able to insert a node regardless of the current
      configuration. This somewhat contradicts with the Raft specification, which
      requires application of all configuration changes in the `on_replace`
      trigger (i.e., as soon as they are persisted in the WAL, without quorum
      confirmation), but still forbids several reconfigurations at the same time.
      
      Closes #10087
      
      NO_DOC=<no special documentation page devoted to cluster reconfiguration>
      
      (cherry picked from commit 29d1c0fa)
      737dc12a
  6. Jul 15, 2024
    • Vladimir Davydov's avatar
      vinyl: use broadcast instead of signal to notify about dump completion · 04347ee7
      Vladimir Davydov authored
      There may be more than one fiber waiting on `vy_scheduler::dump_cond`:
      
      ```
      box.snapshot
        vinyl_engine_wait_checkpoint
          vy_scheduler_wait_checkpoint
      
      space.create_index
        vinyl_space_build_index
          vy_scheduler_dump
      ```
      
      To avoid hang, we should use `fiber_cond_broadcast`.
      
      Closes #10233
      
      NO_DOC=bug fix
      
      (cherry picked from commit 30547157)
      04347ee7
    • Lev Kats's avatar
      small: bump new version with UBSan fixes · cf278f56
      Lev Kats authored
      This patch bumped small to the new version that does not trigger
      UBSan with *_entry* macros and should support new oss-fuzz builder.
      
      New commits:
      
      * rlist: make its methods accept const arguments
      * lsregion: introduce lsregion_to_iovec method
      * rlist: make foreach_enrty_* macros not to use UB
      
      Fixes: #10143
      
      NO_DOC=small submodule bump
      NO_TEST=small submodule bump
      NO_CHANGELOG=small submodule bump
      
      (cherry picked from commit 3e183044)
      cf278f56
    • Lev Kats's avatar
      trivia: use __builtin* for offsetof macro · 25146985
      Lev Kats authored
      Changed default tarantool `offsetof` macro implementation so it don't
      access members of null pointer in typeof that triggers UBsan.
      
      Needed for #10143
      
      NO_DOC=bugfix
      NO_CHANGELOG=minor
      NO_TEST=tested manually with fuzzer
      
      (cherry picked from commit 27e94824)
      25146985
  7. Jul 08, 2024
    • Nikolay Shirokovskiy's avatar
      fiber: phohibit fiber self join · 2131743e
      Nikolay Shirokovskiy authored
      In this case join will just hang. Instead let's raise an error in case
      of Lua API and panic in case of C API.
      
      Closes #10196
      
      NO_DOC=minor
      
      (cherry picked from commit 1e1bf36d)
      2131743e
    • Magomed Kostoev's avatar
      fiber: make the concurrent fiber_join safer · cf3def52
      Magomed Kostoev authored
      Prior to this patch a bunch of illegal conditions was possible:
      1. The joinability of a fiber could be changed while the fiber is
         being joined by someone. This could lead to double recycling:
         the first one happened on the fiber finish, and the second one
         in the fiber join.
      2. The joinability of a dead joinable fiber could be altered, this
         led to inability jo join the dead fiber and free its resources.
      3. A running fiber could be joined concurrently by two or more
         fibers, so the fiber could be recycled more than once (once
         per each concurrent join).
      4. A dead recycled fiber could be made joinable and joined leading
         to the double recycle.
      
      Fixed these issues by adding a new FIBER_JOIN_BEEN_INVOKED flag: now
      the `fiber_set_joinable` and `fiber_join_timeout` functions detect
      the double join. Because of the API limitations both of them panic
      when an invalid condition is met:
      - The `fiber_set_joinable` was not designed to report errors.
      - The `fiber_join_timeout` can't raise any error unless a timeout
        is met, because the `fiber_join` users don't expect to receive
        any error from this function at all (except the one generated
        by the joined fiber).
      
      It's still possible that a fiber join is performed on a struct which
      has been recycled and, if the new fiber is joinable too, this can't
      be detected. The current fiber API does not allow to fix this, so
      this is to be the user's responsibility, they should be warned about
      the fact the double join to the same fiber is illegal.
      
      Closes #7562
      
      @TarantoolBot document
      Title: `fiber_join`, `fiber_join_timeout` and `fiber_set_joinable`
      behave differently now.
      
      `fiber_join` and `fiber_join_timeout` now panic in case if double
      join of the given fiber is detected.
      
      `fiber_set_joinable` now panics if the given fiber is dead or is
      joined already. This prevents some amount of error conditions that
      could happen when using the API in an unexpected way, including:
      - Making a dead joinable fiber non-joinable could lead to a memory
        leak: one can't join the fiber anymore.
      - Making a dead joinable fiber joinable again is a sign of attempt
        to join the fiber later. That means the fiber struct may be joined
        later, when it's been recycled and reused. This could lead to a
        very hard to debug double join.
      - Making an alive joined fiber non-joinable would lead to the double
        free: once on the fiber function finish, and secondly in the active
        fiber join finish. Risks of making it joinable are described above.
      - Making a dead and recycled fiber joinable allowed to join the fiber
        once again leading to a double free.
      
      Any given by the API `struct fiber` should only be joined once. If a
      fiber is joined after the first join on it has finished the behavior
      is undefined: it can either be a panic or an incidental join to a
      totally foreign fiber.
      
      (cherry picked from commit 44401529)
      cf3def52
    • Sergey Kaplun's avatar
      luajit: bump new version · 03d9038c
      Sergey Kaplun authored
      * Correct fix for stack check when recording BC_VARG.
      * test: remove inline suppressions of _TARANTOOL
      * FFI: Fix ffi.alignof() for reference types.
      * FFI: Fix sizeof expression in C parser for reference types.
      * FFI: Allow ffi.metatype() for typedefs with attributes.
      * FFI: Fix ffi.metatype() for non-raw types.
      * Maintain chain invariant in DCE.
      * build: introduce option LUAJIT_ENABLE_TABLE_BUMP
      * ci: add tablebump flavor for exotic builds
      * test: allow `jit.parse` to return aborted traces
      * Handle all types of errors during trace stitching.
      * Use generic trace error for OOM during trace stitching.
      * Check for IR_HREF vs. IR_HREFK aliasing in non-nil store check.
      * cmake: set cmake_minimum_required only once
      * cmake: fix warning about minimum required version
      * ci: add a workflow for testing with AVX512 enabled
      * test: introduce a helper read_file
      * OSX/iOS/ARM64: Fix generation of Mach-O object files.
      * OSX/iOS/ARM64: Fix bytecode embedding in Mach-O object file.
      * build: introduce LUAJIT_USE_UBSAN option
      * ci: enable UBSan for sanitizers testing workflow
      * cmake: add the build directory to the .gitignore
      * Prevent sanitizer warning in snap_restoredata().
      * Avoid negation of signed integers in C that may hold INT*_MIN.
      * Show name of NYI bytecode in -jv and -jdump.
      
      Closes #9924
      Closes #8473
      
      NO_DOC=LuaJIT submodule bump
      NO_TEST=LuaJIT submodule bump
      03d9038c
  8. Jul 04, 2024
    • Nikolay Shirokovskiy's avatar
      fiber: fix leak on dead joinable fiber search · e97b01f6
      Nikolay Shirokovskiy authored
      When fiber is accessed from Lua we create a userdata object and keep the
      reference for future accesses. The reference is cleared when fiber is
      stopped. But if fiber is joinable is still can be found with
      `fiber.find`. In this case we create userdata object again.
      Unfortunately as fiber is already stopped we fail to clear the
      reference. The trigger memory that clear the reference is also leaked.
      As well as fiber storage if it is accessed after fiber is stopped.
      
      Let's add `on_destroy` trigger to fiber and clear the references there.
      
      Note that with current set of LSAN suppressions the trigger memory leak
      of the issue is not reported.
      
      Closes #10187
      
      NO_DOC=bugfix
      
      (cherry picked from commit 7db4de75)
      e97b01f6
  9. Jun 26, 2024
    • Nikolay Shirokovskiy's avatar
      box: fix memleak on functional index drop · 432789dc
      Nikolay Shirokovskiy authored
      We just don't free functional index keys on functional index drop now.
      Let's approach keys deletion as in the case of primary index drop ie
      let's drop these keys in background.
      
      We should set `use_hint` to `true` in case of MEMTX_TREE_VTAB_DISABLED
      tree index methods because `memtx_tree_disabled_index_vtab` uses
      `memtx_tree_index_destroy<true>`. Otherwise we get read outside of index
      structure for stub functional index on destroy for introduced `is_func`
      field (which is reported by ASAN).
      
      Closes #10163
      
      NO_DOC=bugfix
      
      (cherry picked from commit 319357d5)
      432789dc
  10. Jun 25, 2024
  11. Jun 22, 2024
    • Vladislav Shpilevoy's avatar
      sio: use kern.ipc.somaxconn for listen() on Mac · 23e58efb
      Vladislav Shpilevoy authored
      listen() on Mac used to take SOMAXCONN as the backlog size. It is
      just 128, which is too small when connections are incoming too
      fast. They get rejected.
      
      Increase of the queue size wasn't possible, because the limit was
      hardcoded. But now sio takes the runtime limit from
      kern.ipc.somaxconn sysctl setting.
      
      One weird thing is that when set too high, it seems to have no
      effect, like if nothing was changed. Specifically, values above
      32767 are not doing anything, even though stay visible in
      kern.ipc.somaxconn.
      
      It seems listen() on Mac internally might be using 'short' or
      int16_t to store the queue size and it gets broken when anything
      above INT16_MAX is used. The code truncates the queue size to this
      value if the given one is too high.
      
      Closes #8130
      
      NO_DOC=bugfix
      NO_TEST=requires root privileges for testing
      
      (cherry picked from commit 7e9a872f)
      23e58efb
  12. Jun 20, 2024
    • Nikolay Shirokovskiy's avatar
      ci: add workflow to check downgrade versions · 4ab1dcfd
      Nikolay Shirokovskiy authored
      Tarantool has hardcoded list of versions it can downgrade to. This list
      should consist of all the released versions less than Tarantool version.
      This workflow helps to make sure we update the list before release.
      
      It is run on pushing release tag to the repo, checks the list and fails
      if it misses some released version less than current. In this case we
      are supposed to update downgrade list (with required downgrade code) and
      update the release tag.
      
      Closes #8319
      
      NO_TEST=ci
      NO_CHANGELOG=ci
      NO_DOC=ci
      
      (cherry picked from commit 6d856347)
      4ab1dcfd
  13. Jun 14, 2024
  14. Jun 13, 2024
    • Serge Petrenko's avatar
      ci: followup fix RPM package builds on aarch64 runners · 74223a2d
      Serge Petrenko authored
      Commit 715abaaf ("ci: fix RPM package builds on aarch64 runners")
      has limited number of parallel jobs to 6 on these runners to fix the
      OOM, but it turns out this isn't enough: almalinux_9_aarch64 workflow
      fails constantly even with this setting. Let's try to reduce the amount
      of jobs to 4.
      
      NO_CHANGELOG=ci
      NO_TEST=ci
      NO_DOC=ci
      74223a2d
    • Vladislav Shpilevoy's avatar
      relay: do not report vclock[0] anywhere · 4f2e67f5
      Vladislav Shpilevoy authored
      Remote replica's vclock is given to master to send data starting
      from that position. The master does that, but, in order to find
      the relevant position in local WAL to start from, the master must
      ignore the local rows. Consider them all already "sent". For that
      the master replaces the remote vclock[0] with the local vclock[0].
      That makes xlog cursor skip all the local rows.
      
      The problem is that this vclock was taken by relay as is, like if
      it was truly reported by the replica. It was even saved as the
      "last received ACK". Which clearly isn't the case.
      
      When a real ACK was received, it didn't contain anything in
      vclock[0], and yet relay "saw" that the previous ACK has
      vclock[0] > 0. That looked like the replica went backwards without
      even closing connection, which isn't possible. That made the relay
      crash from cringe (on assert).
      
      The fix is not to save the local vclock[0] in the last received
      ACK.
      
      For GC and xlog cursor the hack is still needed. An option how to
      make it easier was to set vclock[0] to INT64_MAX to just never
      even bother with any local rows, but that didn't work. Some
      assumptions in other places seem to depend on having a proper
      local LSN in these places.
      
      Closes #10047
      
      NO_CHANGELOG=the bug wasn't released
      NO_DOC=bugfix
      
      (cherry picked from commit 1f75231a)
      4f2e67f5
    • Vladislav Shpilevoy's avatar
      relay: rename vclock args and make const · 49b374f9
      Vladislav Shpilevoy authored
      It wasn't clear which of them are inputs and which are outputs.
      The patch explicitly marks the input vclocks as const. It makes
      the code a bit easier to read inside of relay.cc knowing that
      these vclocks shouldn't change.
      
      Alongside "replica_clock" in subscribe is renamed to
      "start_vclock". To make it consistent with relay_final_join(), and
      to signify that technically it doesn't have to be a replica
      vclock. It isn't really. Box.cc alters the replica's vclock before
      giving it to relay, which means it is no longer "replica clock".
      
      In scope of #10047
      
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      NO_DOC=refactoring
      
      (cherry picked from commit 5ebbed77)
      49b374f9
    • Vladislav Shpilevoy's avatar
      relay: move gc subscriber creation out of it · 605752e5
      Vladislav Shpilevoy authored
      GC consumer creation and destroy seemed to only happen in box.cc
      with one exception in relay_subscribe(). Lets move it out for
      consistency. Now relay can only notify GC consumers, but can't
      manage them.
      
      That also makes it harder to misuse the GC by passing some wrong
      vclock to it, similar to what was happening in #10047.
      
      In scope of #10047
      
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      NO_DOC=refactoring
      
      (cherry picked from commit 4dc0c1ea)
      605752e5
    • Vladislav Shpilevoy's avatar
      box: introduce box_localize_vclock · 149fc1f7
      Vladislav Shpilevoy authored
      The function takes the burden of explaining why this hack about
      setting local component in a remote vclock is needed. It also
      creates a new vclock, not alters an existing one. This is to
      signify that the vclock is no longer what was received from a
      remote host.
      
      Otherwise it is too easy to actually mistreat this mutant vlock as
      a remote vclock. That btw did happen and is fixed in following
      commits.
      
      In scope of #10047
      
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      NO_DOC=refactoring
      
      (cherry picked from commit b8463960)
      149fc1f7
    • Nikolay Shirokovskiy's avatar
      ci: add a workflow to check for entrypoint tags · 426bff55
      Nikolay Shirokovskiy authored
      Check check-entrypoint.sh comment for explanation of what entrypoint tag
      is. The workflow fails if current branch does not have a most recent
      entrypoint tag that it should have.
      
      Part of #8319
      
      NO_TEST=ci
      NO_CHANGELOG=ci
      NO_DOC=ci
      
      (cherry picked from commit c06d0d14)
      426bff55
    • Vladimir Davydov's avatar
      vinyl: fix gc vs vylog race leading to duplicate record · 085279aa
      Vladimir Davydov authored
      Vinyl run files aren't always deleted immediately after compaction,
      because we need to keep run files corresponding to checkpoints for
      backups. Such run files are deleted by the garbage collection procedure,
      which performs the following steps:
      
       1. Loads information about all run files from the last vylog file.
       2. For each loaded run record that is marked as dropped:
          a. Tries to remove the run files.
          b. On success, writes a "forget" record for the dropped run,
             which will make vylog purge the run record on the next
             vylog rotation (checkpoint).
      
      (see `vinyl_engine_collect_garbage()`)
      
      The garbage collection procedure writes the "forget" records
      asynchronously using `vy_log_tx_try_commit()`, see `vy_gc_run()`.
      This procedure can be successfully executed during vylog rotation,
      because it doesn't take the vylog latch. It simply appends records
      to a memory buffer which is flushed either on the next synchronous
      vylog write or vylog recovery.
      
      The problem is that the garbage collection isn't necessarily loads
      the latest vylog file because the vylog file may be rotated between
      it calls `vy_log_signature()` and `vy_recovery_new()`. This may
      result in a "forget" record written twice to the same vylog file
      for the same run file, as follows:
      
        1. GC loads last vylog N
        2. GC starts removing dropped run files.
        3. CHECKPOINT starts vylog rotation.
        4. CHECKPOINT loads vylog N.
        5. GC writes a "forget" record for run A to the buffer.
        6. GC is completed.
        7. GC is restarted.
        8. GC finds that the last vylog is N and blocks on the vylog latch
           trying to load it.
        9. CHECKPOINT saves vylog M (M > N).
       10. GC loads vylog N. This triggers flushing the forget record for
           run A to vylog M (not to vylog N), because vylog M is the last
           vylog at this point of time.
       11. GC starts removing dropped run files.
       12. GC writes a "forget" record for run A to the buffer again,
           because in vylog N it's still marked as dropped and not forgotten.
           (The previous "forget" record was written to vylog M).
       13. Now we have two "forget" records for run A in vylog M.
      
      Such duplicate run records aren't tolerated by the vylog recovery
      procedure, resulting in a permanent error on the next checkpoint:
      
      ```
      ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run XXXX forgotten but not registered
      ```
      
      To fix this issue, we move `vy_log_signature()` under the vylog latch
      to `vy_recovery_new()`. This makes sure that GC will see vylog records
      that it's written during the previous execution.
      
      Catching this race in a function test would require a bunch of ugly
      error injections so let's assume that it'll be tested by fuzzing.
      
      Closes #10128
      
      NO_DOC=bug fix
      NO_TEST=tested manually with fuzzer
      
      (cherry picked from commit 9d3859b2)
      085279aa
    • Georgiy Lebedev's avatar
      box: prevent demoted leader from being a candidate in the next elections · 22a9cfd8
      Georgiy Lebedev authored
      
      Currently, the demoted leader sees that nobody has requested a vote in the
      newly persisted term (because it has just written it without voting, and
      nobody had time to see the new term yet), and hence votes for itself,
      becoming the most probable winner of the next elections.
      
      To prevent this from happening, let's forbid the demoted leader to be a
      candidate in the next elections using `box_raft_leader_step_off`.
      
      Closes #9855
      
      NO_DOC=<bugfix>
      
      Co-authored-by: default avatarSerge Petrenko <sergepetrenko@tarantool.org>
      (cherry picked from commit 05d03a1c)
      22a9cfd8
    • Georgiy Lebedev's avatar
      box: refactor `box_demote` to make it more comprehensible · 49747a4b
      Georgiy Lebedev authored
      
      Suggested by Nikita Zheleztsov in the scope of #9855.
      
      Needed for #9855
      
      NO_CHANGELOG=<refactoring>
      NO_DOC=<refactoring>
      NO_TEST=<refactoring>
      
      Co-authored-by: default avatarNikita Zheleztsov <n.zheleztsov@proton.me>
      (cherry picked from commit ff010fe9)
      49747a4b
    • Vladislav Shpilevoy's avatar
      election: fix box.ctl.demote() nop in off-mode · 42631d5b
      Vladislav Shpilevoy authored
      box.ctl.demote() used not to do anything with election_mode='off'
      if the synchro queue didn't belong to the caller in the same term
      as the election state.
      
      The reason could be that if the synchro queue term is "outdated",
      there is no guarantee that some other instance doesn't own it in
      the latest term right now.
      
      The "problem" is that this could be workarounded easily by just
      calling promote + demote together.
      
      There isn't much sense in fixing it for the off-mode because the
      only reasons off-mode exists are 1) for people who don't use
      synchro at all, 2) who did use it and want to stop. Hence they
      need demote just to disown the queue.
      
      The patch "legalizes" the mentioned workaround by allowing to
      perform demote in off-mode even if the synchro queue term is old.
      
      Closes #6860
      
      NO_DOC=bugfix
      
      (cherry picked from commit 1afe2274)
      42631d5b
Loading