Skip to content
Snippets Groups Projects
  1. Jul 23, 2024
    • Vladimir Davydov's avatar
      vinyl: do not log dump if index was dropped · ae6a02eb
      Vladimir Davydov authored
      An index can be dropped while a memory dump is in progress. If the vinyl
      garbage collector happens to delete the index from the vylog by the time
      the memory dump completes, the dump will log an entry for a deleted
      index, resulting in an error next time we try to recover the vylog,
      like:
      
      ```
      ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run 2 committed after deletion
      ```
      
      or
      
      ```
      ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Deleted range 9 has run slices
      ```
      
      We already fixed a similar issue with compaction in commit 29e2931c
      ("vinyl: fix race between compaction and gc of dropped LSM"). Let's fix
      this one in exactly the same way: discard the new run without logging it
      to the vylog on a memory dump completion if the index was dropped while
      the dump was in progress.
      
      Closes #10277
      
      NO_DOC=bug fix
      ae6a02eb
    • Vladimir Davydov's avatar
      errinj: log error injection value · 019bacbe
      Vladimir Davydov authored
      Let's log the new value when an error injection is set in orer to ease
      debugging in tests.
      
      NO_DOC=logging
      NO_TEST=logging
      NO_CHANGELOG=logging
      019bacbe
  2. Jul 22, 2024
    • Alexander Turenko's avatar
      config/schema: support field deletion using :set() · 8a0013ea
      Alexander Turenko authored
      This commit implements the `<schema object>:set()` algorithm in a more
      accurate way and it solves several drawbacks of the previous
      implementation.
      
      * It was impossible to set a field that is nested to a record or a map
        that has the box.NULL value (#10190).
      * It was impossible to set a field to the box.NULL value (#10193).
      * It was impossible to delete a field, now `nil` RHS value means the
        deletion (#10194).
      
      Fixes #10190
      Fixes #10193
      Fixes #10194
      
      NO_DOC=Included into https://github.com/tarantool/doc/issues/4279
      8a0013ea
    • Georgy Moiseev's avatar
      lua: bump metrics module · e79982e7
      Georgy Moiseev authored
      Bump metrics package submodule. Commits from PRs [1-4] affect
      Tarantool, the other ones are related to module infrastructure.
      
      1. https://github.com/tarantool/metrics/pull/482
      2. https://github.com/tarantool/metrics/pull/483
      3. https://github.com/tarantool/metrics/pull/484
      4. https://github.com/tarantool/metrics/pull/491
      
      NO_DOC=doc is a part of submodule
      e79982e7
    • Vladimir Davydov's avatar
      tuple: allocate formats table statically · a2da1de7
      Vladimir Davydov authored
      The tuple formats table may be accessed with `tuple_format_by_id()` from
      any thread, not just tx. For example, it's accessed by a vinyl writer
      thread when it deletes a tuple. If a thread happens to access the table
      while it's being reallocated by tx, see `tuple_format_register()`,
      the accessing thread may crash with a use-after-free or NULL pointer
      dereference bug, like the one below:
      
      ```
       # 1  0x64bd45c09e22 in crash_signal_cb+162
       # 2  0x76ce74e45320 in __sigaction+80
       # 3  0x64bd45ab070c in vy_run_writer_append_stmt+700
       # 4  0x64bd45ada32a in vy_task_write_run+234
       # 5  0x64bd45ad84fe in vy_task_f+46
       # 6  0x64bd45a4aba0 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*)+16
       # 7  0x64bd45c13e66 in fiber_loop+70
       # 8  0x64bd45e83b9c in coro_init+76
      ```
      
      To avoid that, let's make the tuple formats table statically allocated.
      This shouldn't increase actual memory usage because system memory is
      allocated lazily, on page fault. The max number of tuple formats isn't
      that big (64K) to care about the increase in virtual memory usage.
      
      Closes #10278
      
      NO_DOC=bug fix
      NO_TEST=mt race
      a2da1de7
    • Alexander Turenko's avatar
      config/schema: allow to index 'any' type in :get() · 0f8c715d
      Alexander Turenko authored
      `<schema object>:get()` now can access a field inside the `any` type if
      it is a `table` or `nil`/`box.NULL`.
      
      `config:get()` now can access fields inside `app.cfg.<key>` and
      `roles_cfg.<key>`.
      
      Fixes #10205
      
      NO_DOC=The `<schema object>:get()` update is included into
             https://github.com/tarantool/doc/issues/4279.
             The `config:get()` reference on the website doesn't mention the
             constraint, so it doesn't need an update.
      0f8c715d
  3. Jul 18, 2024
    • Vladimir Davydov's avatar
      vinyl: wake up waiters after clearing checkpoint_in_progress flag · fc3196dc
      Vladimir Davydov authored
      The function `vy_space_build_index`, which builds a new index on DDL,
      calls `vy_scheduler_dump` on completion. If there's a checkpoint in
      progress, the latter will wait on `vy_scheduler::dump_cond` until
      `vy_scheduler::checkpoint_in_progress` is cleared. The problem is
      `vy_scheduler_end_checkpoint` doesn't broadcast `dump_cond` when it
      clears the flag. Usually, everything works fine because the condition
      variable is broadcast on any dump completion, and vinyl checkpoint
      implies a dump, but under certain conditions this may lead to a fiber
      hang. Let's broadcast `dump_cond` in `vy_scheduler_end_checkpoint`
      to be on the safe side.
      
      While we are at it, let's also inject a dump delay to the original
      test to make it more robust.
      
      Closes #10267
      Follow-up #10234
      
      NO_DOC=bug fix
      fc3196dc
  4. Jul 17, 2024
    • Nikita Zheleztsov's avatar
      iproto: introduce FETCH_SNAPSHOT_CURSOR feature · 62c49367
      Nikita Zheleztsov authored
      This commit introduces FETCH_SNAPSHOT_CURSOR feature, which is available
      only in EE. The feature is not returned in response to IPROTO_ID and is
      not shown in box.iproto.protocol_features in Community Edition. Its id
      is shown only in box.iproto.feature, which is a list of all available
      features in the current version.
      
      Needed for tarantool/tarantool-ee#741
      
      NO_CHANGELOG=minor
      
      @TarantoolBot document
      Title: Document iproto feature FETCH_SNAPSHOT_CURSOR
      
      Root document: https://www.tarantool.io/en/doc/latest/reference/reference_lua/net_box/#net-box-connect
      
      FETCH_SNAPSHOT_CURSOR feature requires cursor FETCH_SNAPSHOT on the
      server. Its ID is IPROTO_FEATURE_FETCH_SNAPSHOT_CURSOR. IPROTO version
      is 8 or more, Enterprise Edition is also required.
      62c49367
    • Nikita Zheleztsov's avatar
      engine: introduce stubs for checkpoint FETCH_SNAPSHOT · 2fca5c13
      Nikita Zheleztsov authored
      This commit introduces engine stubs that enable a new method
      of fetching snapshots for anonymous replicas. Instead of using
      the traditional read-view join approach, this update allows
      file snapshot fetching. Note that file snapshot fetching
      is only available in Tarantool EE.
      
      Checkpoint fetching is done via IPROTO_IS_CHECKPOINT_JOIN,
      IPROTO_CHECKPOINT_VCLOCK and IPROTO_CHECKPOINT_LSN fields.
      
      If IPROTO_CHECKPOINT_JOIN is set to true, join will be done from
      files: .snap for memtx, .run for vinyl, if false - from read view.
      
      Checkpoint join allows to continue from the place, where client
      stopped in case of snapshot fetching error. This allows to avoid
      rebootstrap of an anonymous client. This can be done by specifying
      CHECKPOINT_VCLOCK, which says from which file server should continue
      join, client gets vclock at the beginning of the join. Specifying
      CHECKPOINT_LSN allows to continue from some position in checkpoint.
      Server sends all data >= CHECKPOINT_LSN.
      
      If CHECKPOINT_VCLOCK is not specified, fetching is done from the latest
      available checkpoint. If CHECKPOINT_LSN is not specified - start from
      the beginning of the snap. So, specifying only IS_CHECKPOINT_JOIN
      triggers fetching the latest checkpoint from files.
      
      Needed for tarantool/tarantool-ee#741
      
      NO_DOC=ee
      NO_TEST=ee
      NO_CHANGELOG=ee
      2fca5c13
    • Nikita Zheleztsov's avatar
      engine: send vclock with 0th component during join · 56058393
      Nikita Zheleztsov authored
      This commit makes engine to send vclock without ignoring 0th component
      during join, which is needed for checkpoint FETCH SNAPSHOT.
      
      Currently engine join functions are invoked only from
      relay_initial_join, which is done during JOIN or FETCH SNAPSHOT.
      They respond with vclock of the read view we're going to send.
      
      In the following commit checkpoint FETCH SNAPSHOT will be introduced,
      which responds with vclock of the checkpoint, we're going to send.
      Such vclock may include 0th component and it's crucial to send it to
      a client, as in case of connection failure, client will send us the
      same vclock and we'll have to use its signature to figure out, which
      checkpoint client wants.
      
      So, we have to send and receive 0th component of the vclock during
      FETCH_SNAPSHOT. This commit also introduces decoding vclocks without
      ignoring 0th component, as they'll be used in the following commit too.
      
      Needed for tarantool/tarantool-ee#741
      
      NO_DOC=internal
      NO_TEST=ee
      NO_CHANGELOG=internal
      56058393
    • Nikita Zheleztsov's avatar
      xrow: rename xrow_encode_vclock · 313bd730
      Nikita Zheleztsov authored
      This commit renames xrow_encode_vlock to xrow_encode_vclock_ignore0
      since the next commit will introduce encoding vclock without ignoring
      0th component, which is needed during sending the response to fetch
      snapshot request.
      
      This commit also removes internal field inside the replication_request
      structure, as the following commit will use 'vclock' for
      encoding/decoding vclock without ignoring component.
      
      Needed for tarantool/tarantool-ee#741
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      313bd730
    • Nikita Zheleztsov's avatar
      relay: refactor relay_initial_join · 72cc2b3e
      Nikita Zheleztsov authored
      From now on during initial join memtx engine prepares vclock, raft and
      limbo states, it also sends them during memtx_engine_join.
      
      It's done in order to simplify the code of initial join, as in the
      consequent commit checkpoint initial join will be introduced and we want
      relay code to handle it the same as read-view join without confusing
      conditions.
      
      Needed for tarantool/tarantool-ee#741
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      72cc2b3e
    • Nikita Zheleztsov's avatar
      engine: move raft and limbo states after system data in checkpoint · 3da31b83
      Nikita Zheleztsov authored
      Before this commit raft and limbo states were written at the end of the
      checkpoint, which makes it very costly to access them.
      
      Checkpoint join needs to access limbo and raft state in order to send
      them during JOIN_META stage. We cannot use the latest states, like it's
      done for read-view snapshot fetching: states may be far ahead of the
      data, written to the checkpoint, which we're going to send.
      
      This commit moves raft and limbo states after data from the system
      spaces but before user data. We cannot put them right at the beginning
      of the snapshot, because then we'll have to patch recovery process,
      which currently strongly relies on the fact, that system spaces are
      at the beginning of the snapshot (this was done in order to apply force
      recovery only for user data). If we patch recovery process, then old
      versions, where it's unpatched, won't be able to recover from the
      snapshots done by the newer version, compatibility of snapshots will be
      broken.
      
      The current change is not breaking, old Tarantool versions can restore
      from the snapshot made by the newer one.
      
      Needed for tarantool/tarantool-ee#741
      
      NO_DOC=internal
      NO_CHANGELOG=internal
      3da31b83
  5. Jul 16, 2024
    • Nikita Zheleztsov's avatar
      applier: fix assertion failure after split brain · 5ce010c5
      Nikita Zheleztsov authored
      After receiving async transaction from an old term applier_apply_tx
      exits without unlocking the latch. If the same applier tries to
      subscribe for replication, it fails with assertion, as the latch is
      already locked.
      
      Let's fix the function, which raises error so that it just sets
      diag and returns -1.
      
      Closes #10073
      
      NO_DOC=bugfix
      NO_CHANGELOG=no crash on release version
      5ce010c5
    • Ilya Verbin's avatar
      error: introduce ERRINJ_TUPLE_ALLOC_COUNTDOWN · 52926402
      Ilya Verbin authored
      Needed for tarantool/tarantool-ee#712
      
      NO_DOC=internal
      NO_TEST=internal
      NO_CHANGELOG=internal
      52926402
  6. Jul 15, 2024
    • Vladislav Shpilevoy's avatar
      applier: drop apply_final_join_tx · da158b9b
      Vladislav Shpilevoy authored
      Can use the regular applier_apply_tx(), they do the same. The
      latter is just more protective, but doesn't matter much in this
      case if the code does a few latch locks.
      
      The patch also drops an old test about double-received row panic
      during final join. The logic is that absolutely the same situation
      could happen during subscribe, but it was always filtered out by
      checking replicaset.applier.vclock and skipping duplicate rows.
      
      There doesn't seem to be a reason why final join must be any
      different. It is, after all, same subscribe logic but the received
      rows go into replica's initial snapshot instead of xlogs. Now it
      even uses the same txn processing function applier_apply_tx().
      
      The patch also moves `replication_skip_conflict` option setting
      after bootstrap is finished. In theory, final join could deliver
      a conflicting row and it must not be ignored. The problem is that
      it can't be reproduced anyhow without illegal error injection
      (which would corrupt something in an unrealistic way). But lets
      anyway move it below bootstrap for clarity.
      
      Follow-up #10113
      
      NO_DOC=refactoring
      NO_CHANGELOG=refactoring
      da158b9b
    • Vladislav Shpilevoy's avatar
      box: make instance_vclock const · 19b2cc20
      Vladislav Shpilevoy authored
      No code besides box.cc can now update instance's vclock
      explicitly. That is a protection against hacks like #9916.
      
      Closes #10113
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      19b2cc20
    • Vladislav Shpilevoy's avatar
      box: make final join vclock update only in box.cc · fe338ed4
      Vladislav Shpilevoy authored
      The goal is to make sure that no files except box.cc can change
      instance_vclock_storage directly. That leads to all sorts of hacks
      which in turn lead to bugs - #9916 is a good example.
      
      Now applier on final join only sends rows into the journal. The
      journal then is handled by box.cc where vclock is properly
      updated.
      
      Part of #10113
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      fe338ed4
    • Vladislav Shpilevoy's avatar
      journal: extract journal_write_row from limbo · 7d10096c
      Vladislav Shpilevoy authored
      The function writes a single xrow into the journal in a blocking
      way. It isn't so simple, so makes sense to keep as a function,
      especially given that it will be used more in the next commit.
      
      Part of #10113
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      7d10096c
    • Vladislav Shpilevoy's avatar
      box: move recovery_journal creation · 2620eb9e
      Vladislav Shpilevoy authored
      Recovery journal uses word "recovery" to say that it works with
      xlogs. For snapshot recovery there is bootstrap_journal. Lets use
      it during local snapshot recovery.
      
      The reasoning is that while right now there is no difference, in
      next commits the recovery_journal will do more.
      
      Part of #10113
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      2620eb9e
    • Vladislav Shpilevoy's avatar
      box: move replicaset.vclock into instance_vclock · f1e8e4e1
      Vladislav Shpilevoy authored
      Storing vclock of the instance in replicaset.vclock wasn't right.
      It wasn't vclock of the whole replicaset. It was local to this
      instance. There is no such thing as "replicaset vclock".
      
      The patch moves it to box.h/cc.
      
      Part of #10113
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      f1e8e4e1
    • Vladislav Shpilevoy's avatar
      applier: treat register txns like regular ones · 51751f87
      Vladislav Shpilevoy authored
      Applier during the registration waiting (for registering a new ID
      or a name) could keep doing the master txns received before the
      registration was started. They could still be inside WAL doing a
      disk write, when the replica sends a register request.
      
      Before this commit, it could cause an assertion failure in debug
      and a double LSN error in release.
      
      The reason was that during the registration waiting the applier
      treated all incoming txns as "final join" txns. I.e. it wasn't
      checking if those txns were already received, but not committed
      yet.
      
      During normal subscribe process the appliers (potentially
      multiple) protect themselves from that by keeping track of the
      vclocks which are already applied and also being applied right now
      (replicaset.applier.vclock).
      
      Such protection ensures that receiving same row from 2 appliers
      wouldn't result into its double write. It also protects from the
      case when a txn was received, goes to WAL, but then the applier
      reconnects, resubscribes, and gets the same txn again - it
      shouldn't be applied.
      
      The patch makes so that the registration waiting after recovery
      works like subscribe. Registration during recovery would mean
      bootstrap via join. And outside of recovery it means the instance
      is already running.
      
      Closes #9916
      
      NO_DOC=bugfix
      51751f87
    • Nikolay Shirokovskiy's avatar
      raft: finish worker fiber on Tarantool shutdown · ab386e65
      Nikolay Shirokovskiy authored
      Let's make sure box raft worker fiber is finished on Tarantool shutdown as
      we are going to free fibers stacks. If fiber is not finished it's stack
      may have references to objects on heap. Thus as fiber stack will be freed
      we will have FP memory leaks under ASAN.
      
      Part of #9722
      
      NO_TEST=rely on existing tests
      NO_CHANGELOG=internal
      NO_DOC=internal
      ab386e65
    • Nikolay Shirokovskiy's avatar
      net.box: finish worker fiber on Tarantool shutdown · 50ce6ebe
      Nikolay Shirokovskiy authored
      Let's make sure net.box system fiber is finished on Tarantool shutdown as
      we are going to free fibers stacks. If fiber is not finished it's stack
      may have references to objects on heap. Thus as fiber stack will be freed
      we will have FP memory leaks under ASAN.
      
      Part of #9722
      
      NO_CHANGELOG=internal
      NO_DOC=internal
      50ce6ebe
    • Nikolay Shirokovskiy's avatar
      core: fiber pool shutdown · e513d4ce
      Nikolay Shirokovskiy authored
      Fiber pool shutdown is finishing all idle fibers. Any message processing
      is finished earlier on client fiber shutdown.
      
      We need some changes in shutdown order to make fiber pool shutdown.
      First we need to move stopping of iproto threads from free to shutdown
      step. The issue is we want to destroy "tx" endpoint which requires all
      pipes connected to it to be destroyed first. There are pipes in iproto
      threads that connected to "tx". Currently we destroy pipes in free step
      and at this point as there is no event loop in tx thread
      `cbus_endpoint_destroy` can't receive notifications that pipes are
      destroyed.
      
      Originally we put stopping of iproto threads to the free step because we
      don't have client fibers shutdown. So it was convenient to have working
      `net_pipe` so that client fibers can use iproto API without adding extra
      logic to them. Now I guess it make sense to stop client fibers before
      iproto shutdown. This is the second change in shutdown order.
      
      There is another reason why we have iproto shutdown before client fiber
      shutdown. In the process of iproto shutdown we close connections first
      and then cancel all requests in progress. This way client do not receive
      unexpected `FiberIsCancelled` errors in the process of server shutdown.
      After the patch it not so. Well we may close connections as an extra
      step before client fibers shutdown. But let's leave it this way.  Good
      clients can subscribe to servere shutdown and prepare for it.  Otherwise
      they may receive `FiberIsCancelled` for theier request which looks
      sensible.
      
      It is also makes sense now to move watcher and client fiber shutdown
      to `box_shutdown` as we can both use watcher and create client fibers
      without starting a box.
      
      While at it also drop a note in code why we shutdown watcher before even
      fiber clients shutdown.
      
      Part of #9722
      
      NO_CHANGELOG=internal
      NO_DOC=internal
      e513d4ce
    • Nikolay Shirokovskiy's avatar
      memtx: shutdown memtx gc fiber · f944953e
      Nikolay Shirokovskiy authored
      As we are going to check for memory leaks in ASAN we need to wait while
      memtx gc finishes freeing tuples of dropped primary index or keys of
      dropped functional indexes.
      
      Part of #9722
      
      NO_CHANGELOG=internal
      NO_DOC=internal
      f944953e
    • Vladimir Davydov's avatar
      vinyl: use broadcast instead of signal to notify about dump completion · 30547157
      Vladimir Davydov authored
      There may be more than one fiber waiting on `vy_scheduler::dump_cond`:
      
      ```
      box.snapshot
        vinyl_engine_wait_checkpoint
          vy_scheduler_wait_checkpoint
      
      space.create_index
        vinyl_space_build_index
          vy_scheduler_dump
      ```
      
      To avoid hang, we should use `fiber_cond_broadcast`.
      
      Closes #10233
      
      NO_DOC=bug fix
      30547157
    • Lev Kats's avatar
      trivia: use __builtin* for offsetof macro · 27e94824
      Lev Kats authored
      Changed default tarantool `offsetof` macro implementation so it don't
      access members of null pointer in typeof that triggers UBsan.
      
      Needed for #10143
      
      NO_DOC=bugfix
      NO_CHANGELOG=minor
      NO_TEST=tested manually with fuzzer
      27e94824
  7. Jul 09, 2024
  8. Jul 03, 2024
    • Alexander Turenko's avatar
      config: expose experimental.config.utils.schema · ef716a3e
      Alexander Turenko authored
      The module is renamed from `internal.config.utils.schema` to
      `experimental.config.utils.schema` without changes.
      
      It is useful for validation of configuration data in roles and
      applications.
      
      Also, it provides a couple of methods that aim to simplify usual tasks
      around processing of hierarchical configuration data. For example,
      
      * get/set a nested value
      * apply defaults from the schema
      * filter data based on annotations from the schema
      * transform a hierarchical data using a function
      * merge two hierarchical values
      * parse environment variable according to its type in the schema
      
      See https://github.com/tarantool/doc/issues/4279 for an in-depth
      description.
      
      Fixes #10117
      
      NO_DOC=https://github.com/tarantool/doc/issues/4279
      ef716a3e
  9. Jun 26, 2024
    • Nikolay Shirokovskiy's avatar
      box: fix memleak on functional index drop · 319357d5
      Nikolay Shirokovskiy authored
      We just don't free functional index keys on functional index drop now.
      Let's approach keys deletion as in the case of primary index drop ie
      let's drop these keys in background.
      
      We should set `use_hint` to `true` in case of MEMTX_TREE_VTAB_DISABLED
      tree index methods because `memtx_tree_disabled_index_vtab` uses
      `memtx_tree_index_destroy<true>`. Otherwise we get read outside of index
      structure for stub functional index on destroy for introduced `is_func`
      field (which is reported by ASAN).
      
      Closes #10163
      
      NO_DOC=bugfix
      319357d5
  10. Jun 24, 2024
  11. Jun 20, 2024
    • Nikolay Shirokovskiy's avatar
      ci: add workflow to check downgrade versions · 6d856347
      Nikolay Shirokovskiy authored
      Tarantool has hardcoded list of versions it can downgrade to. This list
      should consist of all the released versions less than Tarantool version.
      This workflow helps to make sure we update the list before release.
      
      It is run on pushing release tag to the repo, checks the list and fails
      if it misses some released version less than current. In this case we
      are supposed to update downgrade list (with required downgrade code) and
      update the release tag.
      
      Closes #8319
      
      NO_TEST=ci
      NO_CHANGELOG=ci
      NO_DOC=ci
      6d856347
  12. Jun 18, 2024
  13. Jun 17, 2024
    • DerekBum's avatar
      box: feature `tuple:format` to get a format of a tuple · 6d5f1db5
      DerekBum authored
      This patch adds `tuple:format()` method to get a format
      of a tuple.
      
      Closes #10005
      
      @TarantoolBot document
      Title: New `format` method for `box.tuple`
      Product: Tarantool
      Since: 3.2
      
      The `tuple:format` method returns a format of a tuple.
      6d5f1db5
  14. Jun 13, 2024
    • Vladimir Davydov's avatar
      vinyl: fix gc vs vylog race leading to duplicate record · 9d3859b2
      Vladimir Davydov authored
      Vinyl run files aren't always deleted immediately after compaction,
      because we need to keep run files corresponding to checkpoints for
      backups. Such run files are deleted by the garbage collection procedure,
      which performs the following steps:
      
       1. Loads information about all run files from the last vylog file.
       2. For each loaded run record that is marked as dropped:
          a. Tries to remove the run files.
          b. On success, writes a "forget" record for the dropped run,
             which will make vylog purge the run record on the next
             vylog rotation (checkpoint).
      
      (see `vinyl_engine_collect_garbage()`)
      
      The garbage collection procedure writes the "forget" records
      asynchronously using `vy_log_tx_try_commit()`, see `vy_gc_run()`.
      This procedure can be successfully executed during vylog rotation,
      because it doesn't take the vylog latch. It simply appends records
      to a memory buffer which is flushed either on the next synchronous
      vylog write or vylog recovery.
      
      The problem is that the garbage collection isn't necessarily loads
      the latest vylog file because the vylog file may be rotated between
      it calls `vy_log_signature()` and `vy_recovery_new()`. This may
      result in a "forget" record written twice to the same vylog file
      for the same run file, as follows:
      
        1. GC loads last vylog N
        2. GC starts removing dropped run files.
        3. CHECKPOINT starts vylog rotation.
        4. CHECKPOINT loads vylog N.
        5. GC writes a "forget" record for run A to the buffer.
        6. GC is completed.
        7. GC is restarted.
        8. GC finds that the last vylog is N and blocks on the vylog latch
           trying to load it.
        9. CHECKPOINT saves vylog M (M > N).
       10. GC loads vylog N. This triggers flushing the forget record for
           run A to vylog M (not to vylog N), because vylog M is the last
           vylog at this point of time.
       11. GC starts removing dropped run files.
       12. GC writes a "forget" record for run A to the buffer again,
           because in vylog N it's still marked as dropped and not forgotten.
           (The previous "forget" record was written to vylog M).
       13. Now we have two "forget" records for run A in vylog M.
      
      Such duplicate run records aren't tolerated by the vylog recovery
      procedure, resulting in a permanent error on the next checkpoint:
      
      ```
      ER_INVALID_VYLOG_FILE: Invalid VYLOG file: Run XXXX forgotten but not registered
      ```
      
      To fix this issue, we move `vy_log_signature()` under the vylog latch
      to `vy_recovery_new()`. This makes sure that GC will see vylog records
      that it's written during the previous execution.
      
      Catching this race in a function test would require a bunch of ugly
      error injections so let's assume that it'll be tested by fuzzing.
      
      Closes #10128
      
      NO_DOC=bug fix
      NO_TEST=tested manually with fuzzer
      9d3859b2
    • Vladimir Davydov's avatar
      tuple: don't use offset_slot_cache in vinyl threads · 19d1f1cc
      Vladimir Davydov authored
      `key_part::offset_slot_cache` and `key_part::format_epoch` are used for
      speeding up tuple field lookup in `tuple_field_raw_by_part()`. These
      structure members are accessed and updated without any locks, assuming
      this code is executed exclusively in the tx thread. However, this isn't
      necessarily true because we also perform tuple field lookups in vinyl
      read threads. Apparently, this can result in unexpected races and bugs,
      for example:
      
      ```
        #1  0x590be9f7eb6d in crash_collect+256
        #2  0x590be9f7f5a9 in crash_signal_cb+100
        #3  0x72b111642520 in __sigaction+80
        #4  0x590bea385e3c in load_u32+35
        #5  0x590bea231eba in field_map_get_offset+46
        #6  0x590bea23242a in tuple_field_raw_by_path+417
        #7  0x590bea23282b in tuple_field_raw_by_part+203
        #8  0x590bea23288c in tuple_field_by_part+91
        #9  0x590bea24cd2d in unsigned long tuple_hint<(field_type)5, false, false>(tuple*, key_def*)+103
        #10 0x590be9d4fba3 in tuple_hint+40
        #11 0x590be9d50acf in vy_stmt_hint+178
        #12 0x590be9d53531 in vy_page_stmt+168
        #13 0x590be9d535ea in vy_page_find_key+142
        #14 0x590be9d545e6 in vy_page_read_cb+210
        #15 0x590be9f94ef0 in cbus_call_perform+44
        #16 0x590be9f94eae in cmsg_deliver+52
        #17 0x590be9f9583e in cbus_process+100
        #18 0x590be9f958a5 in cbus_loop+28
        #19 0x590be9d512da in vy_run_reader_f+381
        #20 0x590be9cb4147 in fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*)+34
        #21 0x590be9f8b697 in fiber_loop+219
        #22 0x590bea374bb6 in coro_init+120
      ```
      
      Fix this by skipping this optimization for threads other than tx.
      
      No test is added because reproducing this race is tricky. Ideally, bugs
      like this one should be caught by fuzzing tests or thread sanitizers.
      
      Closes #10123
      
      NO_DOC=bug fix
      NO_TEST=tested manually with fuzzer
      19d1f1cc
    • Vladimir Davydov's avatar
      vinyl: fix cache iterator skipping tuples in read view · 7b72080d
      Vladimir Davydov authored
      The tuple cache doesn't store older tuple versions so if a reader is
      in a read view, it must skip tuples that are newer than the read view,
      see `vy_cache_iterator_stmt_is_visible()`. A reader must also ignore
      cached intervals if any of the tuples used as a boundary is invisible
      from the read view, see `vy_cache_iterator_skip_to_read_view()`.
      There's a bug in `vy_cache_iterator_restore()` because of which such
      an interval may be returned to the reader: when we step backwards
      from the last returned tuple we consider only one of the boundaries.
      As a result, if the other boundary is invisible from the read view,
      the reader will assume there's nothing in the index between the
      boundaries and skip reading older sources (memory, disk). Fix this by
      always checking if the other boundary is visible.
      
      Closes #10109
      
      NO_DOC=bug fix
      7b72080d
    • Vladimir Davydov's avatar
      vinyl: fix run iterator skipping tuples following non-terminal statement · 72763f94
      Vladimir Davydov authored
      If a run iterator is positioned at a non-terminal statement (UPSERT or
      UPDATE), `vy_run_iterator_next()` will iterate over older statements
      with the same key using `vy_run_iterator_next_lsn()` to build the key
      history. While doing so, it may reach the end of the run file (if the
      current key is the last in the run). This would stop iteration
      permanently, which is apparently wrong for reverse iterators (LE or LT):
      if this happens the run iterator won't return any keys preceding the
      last one in the run file. Fix this by removing `vy_run_iterator_stop()`
      from `vy_run_iterator_next_lsn()`.
      
      Part of #10109
      
      NO_DOC=bug fix
      NO_CHANGELOG=next commit
      72763f94
  15. Jun 10, 2024
Loading