Skip to content
Snippets Groups Projects
  1. Aug 24, 2020
  2. Aug 20, 2020
  3. Aug 17, 2020
    • Vladislav Shpilevoy's avatar
      xrow: introduce struct synchro_request · ee07eab4
      Vladislav Shpilevoy authored
      All requests saved to WAL and transmitted through network have
      their own request structure with parameters:
      - struct request for DML;
      - struct call_request for CALL/EVAL;
      - struct auth_request for AUTH;
      - struct ballot for VOTE;
      - struct sql_request for SQL;
      - struct greeting for greeting.
      
      It is done for a reason - not to pass all the request parameters
      into each function one by one, and manage them all at once
      instead.
      
      For synchronous requests IPROTO_CONFIRM and IPROTO_ROLLBACK it was
      not done. Because so far it was not too hard to carry just 2
      parameters: lsn and replica_id, from their body.
      
      But it will be changed in #5129. Because in fact these requests
      have more parameters, but they were filled by txn module, since
      synchro requests were saved to WAL via transactions (due to lack
      of alternative API to access WAL).
      
      After #5129 it will be necessary to save LSN and replica_id of the
      request author. This patch introduces struct synchro_request to
      simplify extension of the synchro parameters.
      
      Closes #5151
      Needed for #5129
      ee07eab4
  4. Aug 15, 2020
    • Vladislav Shpilevoy's avatar
      applier: drop a couple of unnecessary arguments · 7e16e45e
      Vladislav Shpilevoy authored
      Applier on_rollback and on_wal_write don't need any arguments -
      they either work with a global state, or with the signaled applier
      stored inside the trigger.
      
      However into on_wal_write() and on_rollback() was passed the
      transaction object, unused.
      
      Even if it would be used, it should have been fixed, because soon
      these triggers will be fired not only for traditional 'txn'
      transactions. They will be used by the synchro request WAL writes
      too - they don't have 'transactions'.
      
      Part of #5129
      7e16e45e
  5. Aug 13, 2020
    • Yaroslav Dynnikov's avatar
      Ensure all curl symbols are exported · 29ec6289
      Yaroslav Dynnikov authored
      In the recent update of libcurl (2.5.0-278-g807c7fa58) its layout has
      changed: private function `Curl_version_init()` which used to fill-in
      info structure was eliminated. As a result, no symbols for
      `libcurl_la-version.o` remained used, so it wasn't included in tarantool
      binary. And `curl_version` and `curl_version_info` symbols went missing.
      
      According to libcurl naming conventions all exported symbols are named
      as `curl_*`. This patch lists them all explicitly in `exprots.h` and
      adds the test.
      
      Close #5223
      29ec6289
  6. Aug 12, 2020
    • Vladislav Shpilevoy's avatar
      tuple: fix access by JSON path starting from '[*]' · 718267aa
      Vladislav Shpilevoy authored
      Tuple JSON field access crashed when '[*]' was used as a first
      part of the JSON path. The patch makes it treated like 'field not
      found'.
      
      Follow-up #5224
      718267aa
    • Vladislav Shpilevoy's avatar
      tuple: fix multikey field JSON access crash · 5c15df68
      Vladislav Shpilevoy authored
      When a tuple had format with multikey indexes in it, any attempt
      to get a multikey indexed field by a JSON path from Lua led to a
      crash.
      
      That was because of incorrect interpretation of offset slot value
      in tuple's field map.
      
      Tuple field map is an array stored before the tuple's MessagePack
      data. Each element is a 4 byte offset to an indexed value to be
      able to get it for O(1) time without MessagePack decoding of all
      the previous fields.
      
      At least it was so before multikeys. Now tuple field map is not
      just an array. It is rather a 2-level array, somehow similar to
      ext4 FS. Some elements of the root array are positive numbers
      pointing at data. Some elements point at a second 'indirect'
      array, so called 'extra', size of which is individual for each
      tuple. These second arrays are used by multikey indexes to store
      offsets to each multikey indexed value in a tuple.
      
      It means, that if there is an offset slot, it can't be just used
      as is. It is allowed only if the field is not multikey. Otherwise
      it is neccessary to somehow get an index in the second 'indirect'
      array.
      
      This is what was happening - a multikey field was found, its
      offset slot was valid, but it was pointing at an 'indirect' array,
      not at the data. JSON tuple field access tried to use it as a data
      offset.
      
      The patch makes JSON field access degrade to fullscan when a field
      is multikey, but no multikey array index is provided.
      
      Closes #5224
      5c15df68
  7. Aug 11, 2020
    • Vladislav Shpilevoy's avatar
      box: snapshot should not include rolled back data · 6f70020d
      Vladislav Shpilevoy authored
      Box.snapshot() could include rolled back data in case synchronous
      transaction ROLLBACK arrived during WAL rotation in preparation of
      a checkpoint.
      
      More specifically, snapshot consists of fixating the engines'
      content (creation of a read-view), doing WAL rotation, and writing
      the snapshot itself. All data changes after content fixation won't
      go into the snap. So if ROLLBACK arrives during WAL rotation, the
      fixated content will have rolled back data, not present in the
      newest dataset.
      
      The patch makes it fail if during WAL rotation anything was rolled
      back. The bug sometimes appeared in an existing test about qsync
      snapshots, but with a very poor reproducibility. In a new test
      file it is reproduced 100% without the patch.
      
      Closes #5167
      6f70020d
  8. Jul 31, 2020
    • Vladislav Shpilevoy's avatar
      txn_limbo: handle duplicate ACKs · 6e11674d
      Vladislav Shpilevoy authored
      Replica can send the same ACK multiple times. This is relatively
      easy to achieve.
      
      ACK is a message form the replica containing its vclock. It is
      sent on each replica's vclock update. The update not necessarily
      means that master's LSN was changed. Replica could write
      something locally, with its own instance_id. Vclock is changed,
      sent to the master, but from the limbo's point of view it looks
      like duplicated ACK, because the received master's LSN didn't
      change.
      
      The patch makes limbo ignore duplicated ACKs.
      
      Closes #5195
      Part of #5219
      6e11674d
    • Vladislav Shpilevoy's avatar
      txn_limbo: handle CONFIRM during ROLLBACK · e7559bfe
      Vladislav Shpilevoy authored
      Limbo could try to CONFIRM LSN whose ROLLBACK is in progress. This
      is how it could happen:
      
      - A synchronous transaction is created, written to WAL;
      - The fiber sleeps in the limbo waiting for CONFIRM or timeout;
      - Timeout happens. ROLLBACK for this and all next LSNs is sent to
        WAL;
      - Replica receives the transaction, sends ACK;
      - Master receives ACK, starts writing CONFIRM for the LSN, whose
        ROLLBACK is in progress right now.
      
      Another case - attempt to lower synchro quorum during ROLLBACK
      write. It also could try to write CONFIRM.
      
      The patch skips CONFIRM if there is a ROLLBACK in progress. Not
      even necessary to check LSNs. Because ROLLBACK always reverts the
      entire limbo queue, so it will cancel all pending transactions
      with all LSNs, and new commits are rolled back even before they
      try to go to WAL. CONFIRM can't help here with anything already.
      
      Part of #5185
      e7559bfe
  9. Jul 30, 2020
    • Vladislav Shpilevoy's avatar
      txn_limbo: handle ROLLBACK during CONFIRM · 849ba7dc
      Vladislav Shpilevoy authored
      Limbo could try to ROLLBACK LSN whose CONFIRM is in progress. This
      is how it could happen:
      
      - A synchronous transaction is created, written to WAL;
      - The fiber sleeps in the limbo waiting for CONFIRM or timeout;
      - Replica receives the transaction, sends ACK;
      - Master receives ACK, starts writing CONFIRM;
      - The first fiber times out and tries to write ROLLBACK for the
        LSN, whose CONFIRM is in progress right now.
      
      The patch adds more checks to the 'timed out' code path to see if
      it isn't too late to write ROLLBACK. If CONFIRM is in progress,
      the fiber will wait for its finish.
      
      Part of #5185
      849ba7dc
    • Vladislav Shpilevoy's avatar
      txn_limbo: panic when synchro WAL write fails · 61385877
      Vladislav Shpilevoy authored
      CONFIRM and ROLLBACK go to WAL. Their WAL write can fail just like
      any other WAL write. However it is not clear what to do in that
      case, especially in case of ROLLBACK fail.
      
      The patch adds panic() stub so as to at least terminate the
      instance. Before the patch it would work like nothing happened,
      with undefined behaviour.
      
      Closes #5159
      61385877
  10. Jul 29, 2020
    • Vladislav Shpilevoy's avatar
      txn_limbo: reduce fiber_set_cancellable() calls · a6ab3771
      Vladislav Shpilevoy authored
      The calls were added before and after each cond_wait() so as the
      fiber couldn't be woken up externally. For example, from Lua.
      
      But it is not necessary to flip the flag on each wait() call. It
      is enough to make it 2 times: forbid cancellation in the beginning
      of txn_limbo_wait_complete(), and return the old value back in the
      end.
      a6ab3771
    • Vladislav Shpilevoy's avatar
      txn_limbo: don't duplicate confirmations in WAL · 4920782e
      Vladislav Shpilevoy authored
      When an ACK was received for an already confirmed transaction
      whose CONFIRM WAL write is in progress, it produced a second
      CONFIRM in WAL with the same LSN.
      
      That was unnecessary work taking time and disk space for WAL
      records. Although it didn't lead to any bugs. Just was very
      inefficient.
      
      This patch makes confirmation LSN monotonically grow. In case more
      ACKs are received for an already confirmed LSN, its confirmation
      is not written second time.
      
      Closes #5144
      4920782e
    • Igor Munkin's avatar
      lua/utils: improve luaT_newthread performance · 604b98ee
      Igor Munkin authored
      
      <luaT_newthread> created a new GCfunc object for the helper invoked in a
      protected <lua_cpcall> frame (i.e. <luaT_newthread_wrapper>) on each
      call. The change introduces a static reference to a GCfunc object for
      <luaT_newthread_wrapper> to be initialized on Tarantool startup to
      reduce Lua GC memory usage.
      
      Furthermore, since <lua_cpcall> yields nothing on guest stack, the newly
      created Lua coroutine need to be pushed back to prevent its sweep. So
      to reduce guest stack manipulations <lua_cpcall> is replaced with
      <lua_pcall> and the resulting Lua thread is obtained via guest stack.
      
      Part of #5201
      
      Reviewed-by: default avatarSergey Ostanevich <sergos@tarantool.org>
      Reviewed-by: default avatarVladislav Shpilevoy <v.shpilevoy@tarantool.org>
      Signed-off-by: default avatarIgor Munkin <imun@tarantool.org>
      604b98ee
  11. Jul 28, 2020
    • Vladislav Shpilevoy's avatar
      test: fix flaky qsync_with_anon.test.lua again · cd292adb
      Vladislav Shpilevoy authored
      One of the test cases had 2 problems.
      
      - The same as in the previous commit - it started a sync
        transaction on master, switched to replica assuming it sees
        everything up to this sync transaction, but it still can see
        data from the previous test case;
      
      - The test case tried to write a sync transaction on master, got
        timeout, switched to replica to ensure the data is removed here
        too, but since dirty reads are possible, it could happen the
        data was delivered to replica and ROLLBACK wasn't not yet. On
        the replica the rolled back data still could be visible.
      
      The first issue is solved by flushing master's state to replica
      via making a successful sync transaction.
      
      The second issue is fixed by splitting it into more steps, not
      depending on timeouts (1000 is considered infinity).
      
      Closes #5196
      cd292adb
    • Vladislav Shpilevoy's avatar
      test: fix flaky qsync_snapshots.test.lua again · 3290d48a
      Vladislav Shpilevoy authored
      One of the test cases started a sync transaction on master,
      switched to replica, and tried to do some actions assuming that
      the latest master data has arrived here.
      
      But in fact the replica could be far behind the master. It could
      still contain data from the previous test case. That led to a
      bug, when it looked like if the replica had some data committed
      on it, but not committed on master - this was just data from the
      previous test case.
      
      The issue is solved by flushing master's state to replica via
      making a successful sync transaction.
      
      Closes #5167
      3290d48a
    • Timur Safin's avatar
      box: remove unnecessary bootstrap. file · 47551843
      Timur Safin authored
      `src/box/bootstrap.` file (please pay attendtion to the trailing dot)
      is preventing checkout of Tarantool sources on some file-systems
      (i.e. Windows NTFS) which disallow creations of such files.
      
      The funny story is - this file is unnecessary for the build process
      and might be easily deleted.
      
      Closes #4781
      47551843
    • Alexander V. Tikhonov's avatar
      gitlab-ci: restore lto testing on OSX · 412fd7b9
      Alexander V. Tikhonov authored
      Found that after commit:
      
        7faa1abe "gitlab-ci: implement OSX 10.14 testing on mac mini"
      
      the variable environment CMAKE_EXTRA_PARAMS, which turns on LTO flag
      on OSX compilation was mistakenly missed to be updated and LTO feature
      was disabled on OSX testing in real. After it some later commits
      generated the missed LTO errors, which the current commit fixed.
      
      Closes #5160
      412fd7b9
    • Oleg Babin's avatar
      lua: introduce function to check that passed value is uuid · ab3e332c
      Oleg Babin authored
      
      We already have is_decimal function that checks allowed value
      is decimal. After tarantool started to support UUID type it will
      be quite often case to check that some value has UUID type as
      well. This patch introduces "is_uuid" function for this purpose.
      
      Closes #5171
      
      Signed-off-by: default avatarOleg Babin <olegrok@tarantool.org>
      Reviewed-by: default avatarVladislav Shpilevoy <v.shpilevoy@tarantool.org>
      Reviewed-by: default avatarLeonid Vasiliev <lvasiliev@tarantool.org>
      
      @TarantoolBot document
      Title: uuid.is_uuid
      
      is_uuid function returns "true" if specified value has uuid type
      and "false" otherwise.
      ab3e332c
    • Cyrill Gorcunov's avatar
      lua/log: add missing exports · 625095d9
      Cyrill Gorcunov authored
      
      When building with LTO non exported symbols might
      be discarded while we have implicit use in log.lua
      code.
      
      Introduced by a94a9b3f
      
      Fixes #5160
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      625095d9
    • Roman Khabibov's avatar
      lua: return getaddrinfo() errors · acec5a82
      Roman Khabibov authored
      Add getaddrinfo() errors into the several fuctions of socket. Now
      getaddrinfo() return a pair of values (nil and error message) in
      case of error.
      
      Closes #4138
      
      @TarantoolBot document
      Title: socket API changes
      
      * socket.getaddrinfo()
      
      Return error message as the second return value.
      
      Example:
      tarantool> socket.getaddrinfo('non_exists_hostname', 3301)
      ---
      - null
      - 'getaddrinfo: nodename nor servname provided, or not known'
      ...
      
      * socket.tcp_connect()
      
      Return error message as the second return value.
      
      Example:
      tarantool> socket.tcp_connect('non_exists_hostname', 3301)
      ---
      - null
      - 'getaddrinfo: nodename nor servname provided, or not known'
      ...
      
      * socket.bind()
      
      Return error message as the second return value.
      
      Example:
      tarantool> socket.tcp_connect('non_exists_hostname', 3301)
      ---
      - null
      - 'getaddrinfo: nodename nor servname provided, or not known'
      ...
      
      * socket.tcp_server()
      
      Return error message as the second return value.
      
      Example:
      tarantool> socket.tcp_connect('non_exists_hostname', 3301)
      ---
      - null
      - 'getaddrinfo: nodename nor servname provided, or not known'
      ...
      acec5a82
    • Roman Khabibov's avatar
      coio/say: fix getaddrinfo error handling on macOS · eb73eda1
      Roman Khabibov authored
      Before this patch, branch when getaddrinfo() returns error codes
      couldn't be reached on macOS, because they are greater than 0 on
      macOS (assumption "rc < 0" in commit ea1da04d is incorrect for
      macOS).
      
      Note: diag_log() in say.c was added, because otherwise it will be
      hid in the case of panic(). Also, two diag_set() in
      syslog_connect_unix() was added to avoid asserts in this
      diag_log().
      
      Needed for #4138
      eb73eda1
  12. Jul 27, 2020
    • Vladislav Shpilevoy's avatar
      txn_limbo: introduce cascading rollback · dcaae66e
      Vladislav Shpilevoy authored
      Cascading rollback is a state when existing transactions are being
      rolled back right now, and newer transactions can't be committed
      as well. To preserve the 'reversed rollback order' rule.
      
      WAL writer can enter such state when something goes wrong with
      writing to disk. Limbo didn't have that feature until now.
      
      Consider an example why limbo should be able to turn on cascading
      rollback. Without cascading rollback it can happen that a
      transaction is seemingly rolled back, but after restart it is
      committed and visible. The scenario:
      
          * Master writes a sync transaction to WAL with LSN1;
      
          * It starts waiting for ACKs;
      
          * No ACKs for timeout - it starts writing to WAL the command
            ROLLBACK(LSN1). To rollback everything with LSN >= LSN1
            but < LSN of the ROLLBACK record itself;
      
          * Another fiber starts a new transaction, while ROLLBACK is in
            progress;
      
          * Limbo is not empty, so the new transaction is added there.
            Then it also starts writing itself to WAL;
      
          * ROLLBACK finishes WAL write. It rolls back all the
            transactions in the limbo to conform with the 'reversed
            rollback order' rule. Including the latest transaction;
      
          * The latest transaction finished its WAL write with LSN2 and
            sees that it was rolled back by the limbo already.
      
      All seems to be fine, but actually what happened is that
      ROLLBACK(LSN1) is written to WAL *before* the latest transaction
      with LSN2. Now when restart happens, ROLLBACK(LSN1) is replayed
      first, and then the latest LSN2 transaction is replayed second -
      it will be committed successfully, and will be visible.
      
      On the summary: transaction canceled its rollback after instance
      restart. Expected behaviour is that while ROLLBACK is in progress,
      all newer transactions should not even try going to WAL. They
      should be rolled back immediately.
      
      The patch implements the cascading rollback for the limbo.
      
      Closes #5140
      dcaae66e
  13. Jul 24, 2020
  14. Jul 22, 2020
    • Vladislav Shpilevoy's avatar
      txn: remove TXN_IS_DONE check from txn_commit() · 8c50e069
      Vladislav Shpilevoy authored
      
      TXN_IS_DONE was used in txn_commit() to finalize the transaction
      in a case it is not finished yet. But this happens only not in
      so common cases - during bootstrap and recovery. During normal
      operation the transaction is always finished when WAL thread
      returns it to TX thread after a disk write.
      
      So this is a matter of journal, and should be solved here, not in
      txn code with some crutch, especially in such a hot path place.
      
      This commit makes so that after journal_write() the transaction is
      always already done, i.e. txn_complete_async() was called. Nothing
      changes for the normal operation mode except this is -1 'if'.
      
      Also the commit disables snap_quorum_delay.test, which uses
      internal API of replication and txn modules assuming the journal
      is initialized somewhere inside. But now it is not, and it can't
      be fixed in a sane way inside the test. It will be
      deleted/rewritten later.
      
      Acked-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      8c50e069
    • Vladislav Shpilevoy's avatar
      txn_limbo: single function to confirm transactions · 71863597
      Vladislav Shpilevoy authored
      
      Before this patch there were some structural problems with the
      limbo code (still are now, but less).
      
      It wasn't clear when lifetime of a transaction ends. When it is
      committed, when rolled back, when completion is called. With
      rollback things are more or less bearable, but not with commit.
      
      Transaction's completion could be done at the same time with
      setting limbo_entry.is_commit flag. Or with setting the flag +
      yield + completion. This led to having a weird crutch in
      txn_limbo_wait_complete() to make async transactions explicitly
      wait until the previous sync transactions are written to WAL.
      
      The commit flag could be set in 3!!! different places - parameters
      update handler, ACK handler, and confirmation handler. In these 3
      places there were various assumptions making it really hard to
      understand what is happening here and there, what is the
      difference. Not counting how much code was duplicated.
      
      This patch makes the commit path always come through confirmation
      reader, as the most logical place for that, and already covering
      almost all the tricky cases.
      
      Now there is a guarantee that a transaction is completed if it is
      removed from the limbo queue. No need to check TXN_IS_DONE, wait
      for anything, whatsoever.
      
      Also the patch is a preparation for removal of TXN_IS_DONE check
      from the main path of txn_commit(). txn_limbo_wait_complete()
      shouldn't ever return a not finished transaction for that.
      
      Part of #5143
      
      Acked-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      71863597
    • Cyrill Gorcunov's avatar
      stailq: provide better names for args · be84b6de
      Cyrill Gorcunov authored
      
      Use self explanatory dest and src (like in strcat).
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      be84b6de
    • Kirill Yukhin's avatar
      luajit: bump new version · bc58c067
      Kirill Yukhin authored
      * gdb: fix the extension to be loaded with Python 2
      bc58c067
    • Alexander V. Tikhonov's avatar
      gitlab-ci: setup local cleanup/checkout processes · fca966a2
      Alexander V. Tikhonov authored
      Set cleanup based on docker for all of the jobs to avoid of fails
      when the job before change could use docker and reassigned the
      temporary files in the working directory. If the docker is not
      available than no need to use it for cleanup and cleanup runs
      using shell. Gitlab-ci clean flags disabled and reorganized localy
      as shown at [1].
      
      Disabled gitlab-ci checkout default process as shown at [2] to be
      able to fix repository before the checkout. Found that previously
      run gitlab-ci jobs could change permissions files at repository
      which broke the checkouts at the next jobs. Used for checkout
      strategy [3], for submodule update strategy [4]. Submodules local
      update routine in .gitlab.mk file became unneeded and removed.
      
      List of steps made localy instead of gitlab-ci preparations:
      
      1. Check/clone the Tarantool repository with submodules.
      2. For shell based jobs change ownership of all the sources
         to 'gitlab-runner' user. (NOTE: in Docker based jobs the
         'gitlab-runner' user is not known.)
      3. Fetch Tarantool sources with branches and force checkout
         of the testing commit.
      4. Update submodules recursively (use force where supports).
      5. Cleanup all the sources from all files except from repository.
      
      [1] https://docs.gitlab.com/ee/ci/yaml/README.html#git-clean-flags
      [2] https://docs.gitlab.com/ee/ci/yaml/README.html#git-strategy
      [3] https://docs.gitlab.com/ee/ci/yaml/README.html#git-checkout
      [4] https://docs.gitlab.com/ee/ci/yaml/README.html#git-submodule-strategy
      
      Follows up #5036
      fca966a2
  15. Jul 21, 2020
    • Vladislav Shpilevoy's avatar
      txn: single failure point for WAL and TX async commit errors · 8df49e47
      Vladislav Shpilevoy authored
      The same as the previous commit, but applied to the async
      transaction commit (txn_commit_async()). There were 6 failure
      points. After the patch the failures are handled in one place.
      
      Follow up #5146
      8df49e47
    • Vladislav Shpilevoy's avatar
      txn: single failure point for WAL and TX commit errors · 91f0ec9b
      Vladislav Shpilevoy authored
      Txn_commit() uses journal_write() function to send requests to WAL
      thread. journal_write() can return 0/-1, but these codes have
      nothing to do with the actual write result. journal_write() only
      signals whether the interaction with WAL thread finished
      successfully. It tells nothing about what happened in this
      interaction.
      
      To check WAL write result need to look at journal_entry.res field.
      As a result, there were 2 failure points to handle. One of them
      wasn't handled for synchronous transactions. Not counting 4
      failure points in other places:
      
      - When can't prepare the transaction before commit;
      - When can't allocate a journal entry;
      - When can't append a new entry to the qsync limbo;
      - When synchronous transaction completion wait fails.
      
      This patch merges all the failure points into one place for
      txn_commit().
      
      Closes #5146
      91f0ec9b
  16. Jul 20, 2020
Loading