Skip to content
Snippets Groups Projects
  1. Jun 17, 2022
    • Serge Petrenko's avatar
      txn_limbo: track CONFIRM lsn on replicas · 129d83e9
      Serge Petrenko authored
      limbo->confirmed_lsn was only filled on limbo owner in
      txn_limbo_write_confirm. Replicas and recovering limbo owner need to track
      it as well to correctly detect split-brains based on confirmed_lsn.
      
      So update confirmed_lsn in txn_limbo_read_confirm.
      
      Part-of #5295
      
      NO_DOC=internal change
      NO_TEST=tested in future commits
      NO_CHANGELOG=internal change
      129d83e9
    • Serge Petrenko's avatar
      txn_limbo: do not confirm/rollback anything after restart · 6cc1b1f2
      Serge Petrenko authored
      It's important for the synchro queue owner to not finalize any of the
      pending synchronous transactions after restart.
      
      Since the node was down for some time the chances are pretty high it was
      deposed by some new leader during its downtime. It means that the node
      might not know yet that it's transactions were already finalized by someone
      else.
      
      So, any arbitrary finalization might lead to a future split-brain, once the
      remote PROMOTE finally reaches the local node.
      
      Let's fix this by adding a new reason for the limbo to be frozen - a
      queue owner has recovered but has not issued a new PROMOTE locally and
      hasn't received any PROMOTE requests from the remote nodes.
      
      Once the first PROMOTE is issued or received, it's safe to return to the
      old mode of operation.
      
      So, now the synchro queue owner starts in "frozen" state and can't
      CONFIRM, ROLLBACK or issue new transactions until either issuing a
      PROMOTE or receiving a PROMOTE from some remote node.
      
      This also required modifying box.ctl.promote() behaviour: it's no
      longer a no-op on a synchro queue owner, when elections are disabled and
      the queue is frozen due to restart.
      
      Also fix the tests, which assumed the queue owner is writeable after a
      restart. gh-5298 test was partially deleted, because it became pointless.
      
      And while we are at it, remove the double run of gh-5288 test. It is
      storage engine agnostic, so there's no point in running it for both
      memtx and vinyl.
      
      Part-of #5295
      
      NO_CHANGELOG=covered by previous commit
      
      @TarantoolBot document
      Title: ER_READONLY error receives new reasons
      
      When box.info.ro_reason is "synchro" and some operation throws an
      ER_READONLY error, this error now might include the following reason:
      ```
      Can't modify data on a read-only instance - synchro queue with term 2
      belongs to 1 (06c05d18-456e-4db3-ac4c-b8d0f291fd92) and is frozen due to
      fencing
      ```
      This means that the current instance is indeed the synchro queue owner,
      but it has noticed, that someone else in the cluster might start new
      elections or might overtake the synchro queue soon.
      This may be also detected by `box.info.election.term` becoming greater than
      `box.info.synchro.queue.term` (this is the case for the second error
      message).
      There is also a slightly different error message:
      ```
      Can't modify data on a read-only instance - synchro queue with term 2
      belongs to 1 (06c05d18-456e-4db3-ac4c-b8d0f291fd92) and is frozen until
      promotion
      ```
      This means that the node simply cannot guarantee that it is still the
      synchro queue owner (for example, after a restart, when a node still thinks
      it is the queue owner, but someone else in the cluster has already
      overtaken the queue).
      6cc1b1f2
    • Serge Petrenko's avatar
      txn_limbo: fence upon receiving raft term greater than queue term · 0e48475d
      Serge Petrenko authored
      Receiving a raft term greater than the current queue term means that
      someone has either already written PROMOTE (in case elections are
      disabled), or is going to write PROMOTE once he wins the elections (in
      case they are enabled).
      
      In both cases the queue owner in an old term should freeze the limbo
      until queue term catches up with raft term.
      
      Unfreezing happens automatically once synchro queue term catches up.
      
      Part-of #5295
      
      NO_DOC=covered by next commit
      0e48475d
    • Serge Petrenko's avatar
      txn_limbo: rework limbo->frozen flag · ce0a83eb
      Serge Petrenko authored
      Soon there will be more reasons for a transaction limbo to be frozen.
      Let's make the limbo->frozen flag a bitmap and rename it to
      limno->frozen_reasons.
      The first bit, named frozen_due_to_fencing, represents the only current
      reason for the limbo to be frozen.
      While we are at it, rename txn_limbo_(un)freeze to txn_limbo_(un)fence
      to better reflect the situation.
      
      Part-of #5295
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      ce0a83eb
    • Serge Petrenko's avatar
      txn_libmo: preserve confirmed_lsn after reading a PROMOTE · 896a20e4
      Serge Petrenko authored
      Previously we assumed that every PROMOTE request changes limbo owner,
      and thus limbo should have confirmed_lsn = 0 after the request is
      processed, because new confirmed lsn is yet unknown.
      
      This is not true for PROMOTE requests coming in JOIN or saved in
      snapshot: such requests don't change limbo owner: they are like
      savepoints, they notify the instance of the current limbo state.
      
      Such promotions may be detected by the rule
      replica_id (old limbo owner) == origin_id (new limbo owner)
      
      So, for the sake of correct split-brain detection, confirmed_lsn should
      be nonzero after such promotions.
      
      Part-of #5295
      
      NO_DOC=internal change
      NO_TEST=tested in future commits
      NO_CHANGELOG=internal change
      896a20e4
    • Serge Petrenko's avatar
      test: refactor gh_6036_qsync_order test · 978731b3
      Serge Petrenko authored
      
      The test involves creating a manual split-brain between nodes r1 and r2.
      After the split-brain detection introduction it's impossible to reuse
      the nodes in the next test without recreating them.
      
      Let's fix that by switching nodes r1 and r3. Now there's a split-brain
      between (r1, r2) and r3, and r3 isn't used in the following tests and
      may be safely deleted.
      
      Follow-up #5295
      
      NO_DOC=refactoring
      NO_CHANGELOG=refactoring
      
      Signed-off-by: default avatarSerge Petrenko <sergepetrenko@tarantool.org>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      978731b3
    • Serge Petrenko's avatar
      relay: fix PROMOTE and raft term ordering · 67090419
      Serge Petrenko authored
      Fix two issues with sent_raft_term calculations:
      * first of all, it doesn't matter during initial and final join, so set it
        to UINT64_MAX.
      * secondly, it's nullified after a successful dispatch from the tx
        thread. This might make the relay stall forever. For example, when
        elections are disabled.
      
      NO_DOC=bugfix
      NO_TEST=tested in next commit
      67090419
  2. Jun 16, 2022
    • Ilya Verbin's avatar
      core: allow spurious wakeups in coio_waitpid · 7a582646
      Ilya Verbin authored
      Currently it's possible to wakeup a fiber, which is waiting for a
      child process termination, using Tarantool C API. This will leave
      a zombie process behind. This patch reworks `coio_waitpid` in such
      a way that it yields until `cw.data` is set to NULL in the process
      status change callback.
      
      Part of #7166
      
      NO_DOC=refactoring
      NO_CHANGELOG=refactoring
      7a582646
    • Ilya Verbin's avatar
      core: allow spurious wakeups in cord_cojoin · 87e7d312
      Ilya Verbin authored
      Currently it's possible to wakeup a fiber, which is waiting for task
      completion, using Tarantool C API. This will cause a "wrong fiber woken"
      panic. This patch reworks `cord_cojoin` in such a way that it yields
      until a completion flag is set.
      
      Part of #7166
      
      NO_DOC=refactoring
      NO_CHANGELOG=refactoring
      87e7d312
    • Ilya Verbin's avatar
      core: get rid of fiber_set_cancellable in hot_standby_f · 65470cb4
      Ilya Verbin authored
      Currently it's possible to wakeup a `hot_standby_f` fiber from Lua,
      this does not lead to any error, but it results in redundant
      `recover_remaining_wals` calls.
      This patch handles such spurious wakeups in `hot_standby_f`.
      
      Part of #7166
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      65470cb4
    • Ilya Verbin's avatar
      core: get rid of fiber_set_cancellable in gc_checkpoint_fiber_f · 6e5b89e0
      Ilya Verbin authored
      Spurious wakeups are already handled correctly.
      
      Part of #7166
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      6e5b89e0
    • Georgiy Lebedev's avatar
      box: fix transaction "read-view" and "conflicted" states · 4d52199e
      Georgiy Lebedev authored
      Currently, there is a fundamental logical inconsistency with read-view and
      conflicted states of transactions.
      
      Conflicted transactions see all prepared changes (e.g., #7238), because
      they are handled differently than read-view ones. At the same time, one
      does not know the state of the transaction until `box.commit` is called.
      
      A similar problem arises with read-view transactions: if such transactions
      do any DML statements, they are de-facto conflicted, but this will only be
      determined at preparation stage:
      https://github.com/tarantool/tarantool/blob/79245573dabf3c1eb4eb904fd80ee84270360476/src/box/txn.c#L1006-L1013
      
      Fix this inconsistency by the following changes:
      1. Conflict "read-view" transactions on attempt to perform DML statements
      immediately — guarantee this with an assertion at preparation stage.
      2. Make conflicted transactions unconditionally throw "Transaction has been
      aborted by conflict" error on any CRUD operations (including read-only
      ones) until they are either rolled back (which will return no error) or
      committed (which will return the same error).
      
      Closes #7238
      Closes #7239
      Closes #7240
      
      @TarantoolBot document
      Title: new  behaviour of "conflicted" transactions
      
      "Conflicted" transactions now return "Transaction aborted by conflicted"
      error on any CRUD operations (including read-only ones), until they are
      either rolled back (which will return no error) or committed (which will
      return the same error).
      4d52199e
    • Sergey Bronnikov's avatar
      tutorial: use https links · 2f80fbf0
      Sergey Bronnikov authored
      NO_CHANGELOG=internal
      NO_DOC=internal
      NO_TEST=internal
      2f80fbf0
    • Pavel Balaev's avatar
      tools: fix gdb.sh revision regex · 375ceaaa
      Pavel Balaev authored
      Regular expression now works on versions: alpha, beta, rc and so on.
      
      NO_DOC=bugfix
      NO_TEST=bugfix
      NO_CHANGELOG=bugfix
      375ceaaa
    • Pavel Balaev's avatar
      tools: edit gdb.sh code formatting · 83e8c50f
      Pavel Balaev authored
      Tabs were replaced with spaces to bypass checkpatch.
      
      NO_DOC=bugfix
      NO_TEST=bugfix
      NO_CHANGELOG=bugfix
      83e8c50f
  3. Jun 14, 2022
    • Vladimir Davydov's avatar
      test: fix flaky viny/tx_gap_lock test · b3f462bf
      Vladimir Davydov authored
      The `cmp_tuple` helper function is broken - it assumes that all tuple
      fields, including the payload, are numeric. It isn't true - the payload
      field is either nil or string. This results in a false-positive test
      failure:
      
      ```
      error: '[string "function cmp_tuple(t1, t2)     for i = 1, PAY..."]:1:
             attempt to compare nil with string'
      ```
      
      Closes #6336
      
      NO_DOC=test
      NO_CHANGELOG=test
      b3f462bf
    • Yaroslav Lobankov's avatar
      test-run: bump to new version · e6e73423
      Yaroslav Lobankov authored
      Bump test-run to new version with the following improvements:
      
        - Fix issue with not detecting successful server start [1]
      
      [1] tarantool/test-run#343
      
      NO_DOC=testing stuff
      NO_TEST=testing stuff
      NO_CHANGELOG=testing stuff
      e6e73423
  4. Jun 09, 2022
    • Vladimir Davydov's avatar
      test: set shutdown timeout to infinity for default luatest instance · ede831d3
      Vladimir Davydov authored
      With the default shutdown timeout of 3 seconds, a test that leaves
      behind asynchronous requests would still pass, but it would take longer
      to finish, because the server instance started by Tarantool would have
      to wait for the dangling requests to complete. Setting the timeout to
      infinity will result in a hang, making us fix the test.
      
      Infinite timeout is also good for catching bugs like #7225 and #7256.
      
      We don't set the timeout for diff and TAP tests because those are
      deprecated and shouldn't be used for writing new tests. Nevertheless,
      I manually checked that none of them hangs if the timeout is set to
      infinity.
      
      Closes #6820
      
      NO_DOC=test
      NO_CHANGELOG=test
      ede831d3
    • Vladimir Davydov's avatar
      iostream: shutdown socket fd before close · 9cf03555
      Vladimir Davydov authored
      If a socket fd is shared by a child process, closing it in the parent
      will not shut down the underlying connection. As a result, the server
      may hang executing the graceful shutdown protocol. Fix this problem by
      explicitly shutting down the connection socket fd before closing it.
      
      This is a recommended way to terminate a Unix socket connection, see
      http://www.faqs.org/faqs/unix-faq/socket/#:~:text=2.6.%20%20When%20should%20I%20use%20shutdown()%3F
      
      Closes #7256
      
      NO_DOC=bug fix
      9cf03555
    • Ilya Verbin's avatar
      wal: allow spurious wakeups in wal_write · 4bf52367
      Ilya Verbin authored
      It's possible to wakeup a fiber, which is waiting for WAL write
      completion, using Tarantool C API. This results in an error like:
      ```
      main/118/lua F> Journal result code -1 can't be converted to an error
      ```
      
      This patch introduces a flag, which is set when WAL write is
      finished, that allows fibers to yield until the flag is set.
      
      Closes #6506
      
      NO_DOC=bugfix
      4bf52367
    • Yaroslav Lobankov's avatar
      test-run: bump to new version · 0dc60b5f
      Yaroslav Lobankov authored
      Bump test-run to new version with the following improvements:
      
        - Fail *.test.py tests in case of server start errors [1]
      
      [1] tarantool/test-run#333
      
      NO_DOC=testing stuff
      NO_TEST=testing stuff
      NO_CHANGELOG=testing stuff
      0dc60b5f
  5. Jun 08, 2022
    • Mergen Imeev's avatar
      sql: fix wrong ephemeral space format · a6818acc
      Mergen Imeev authored
      This patch fixes format building when an ephemeral space was used in
      ORDER BY and ORDER BY uses at least two variables from the list of
      selected columns.
      
      Closes #7042
      
      NO_DOC=Bugfix
      a6818acc
    • Serge Petrenko's avatar
      decimal: fix index comparison with Inf, NaN · 22fc1f94
      Serge Petrenko authored
      There was an assertion failure when inserting  a decimal into an index
      which contained double Inf or NaN.
      
      The reason for that was never checking decimal_from_*() return values,
      and decimal_from_double() not being able to handle NaN or Inf, because
      these values are not representable in decimal numbers.
      
      Start handling decimal_from_<type> return values and fix decimal
      comparison with Inf, NaN.
      
      Closes #6377
      
      NO_DOC=bugfix
      22fc1f94
  6. Jun 07, 2022
    • Yaroslav Lobankov's avatar
      test: use unix socket in replication-py/swim tests · cb6fc4a3
      Yaroslav Lobankov authored
      To reduce the chance to encounter the tarantool/test-run#141 issue in
      replication-py/swim tests, let's switch to using unix sockets instead
      of TCP ports for tarantool console.
      
      NO_DOC=testing stuff
      NO_TEST=testing stuff
      NO_CHANGELOG=testing stuff
      cb6fc4a3
  7. Jun 06, 2022
    • Ilya Verbin's avatar
      core: allow spurious wakeups in cbus_call · bd6fb06a
      Ilya Verbin authored
      Currently it's possible to wakeup a fiber, which is waiting for `cbus_call`
      completion, using Tarantool C API. This will cause a misleading `TimedOut`
      error. This patch reworks `cbus_call` in such a way that it yields until
      a completion flag is set.
      
      Part of #7166
      
      NO_DOC=refactoring
      NO_CHANGELOG=refactoring
      bd6fb06a
    • Ilya Verbin's avatar
      core: get rid of unused cbus_flush · e568e7f0
      Ilya Verbin authored
      Part of #7166
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      e568e7f0
    • Timur Safin's avatar
      datetime: refactor interval_to_string · b7ff1615
      Timur Safin authored
      Simplify/shorten `interval_to_string()` implementation.
      
      Part of #7045
      
      NO_CHANGELOG=refactoring
      NO_DOC=refactoring
      NO_TEST=refactoring
      b7ff1615
    • Timur Safin's avatar
      datetime: do not mess with nsec in interval · 36bc6f83
      Timur Safin authored
      Do not even try to make more readable output of secs/nsec,
      but rather report them as is, without any [de]normalization.
      
      Not the prior way:
      ```
      tarantool> dt.interval.new{min=1, sec=59, nsec=2e9+1}
      --
      - +1 minutes, 61.000000001 seconds
      ...
      ```
      
      But instead as:
      ```
      tarantool> dt.interval.new{min=1, sec=59, nsec=2e9+1}
      --
      - +1 minutes, 59 seconds, 2000000001 nanoseconds
      ...
      ```
      
      Closes #7045
      
      NO_DOC=internal
      36bc6f83
    • Vladimir Davydov's avatar
      net.box: fix hang in graceful shutdown protocol · 79245573
      Vladimir Davydov authored
      The graceful shutdown protocol works as follows:
      
       1. The server sends a shutdown request (the box.shutdown event) to all
          its clients that subscribed to it.
       2. Upon receiving a shutdown request, a client is supposed to close its
          connection.
       3. The server waits for all clients subscribed to box.shutdown event to
          exit.
       4. The server exits.
      
      In net.box, the box.shutdown event is processed by `remote._callback`.
      The problem is it may occur that `remote._callback` is garbage collected
      while the `remote` object isn't. If this happens, the shutdown request
      will never get processed, and the server won't exit until the `remote`
      object is garbage collected, which may take forever.
      
      Let's fix this issue by breaking the worker loop if we see that the
      callback was garbage collected.
      
      Closes #7225
      
      NO_DOC=bug fix
      79245573
  8. Jun 02, 2022
    • Boris Stepanenko's avatar
      build: define TZDIR for tzcode build · b9c9a7b0
      Boris Stepanenko authored
      nixos (and probably some other distributives) place zoneinfo directory
      not in /usr/share (in /etc for example). TZDIR is set accordingly.
      Currently zoneinfo is looked for in /usr/share, disregarding TZDIR env
      variable.
      
      This commit adds compile definition for TZDIR if such env variable is
      defined. This fixes zoneinfo lookup for nixos.
      
      NO_CHANGELOG=build
      NO_DOC=build
      NO_TEST=build
      b9c9a7b0
    • Vladimir Davydov's avatar
      Revert "github-ci: use openssl@1.1" · 6605de25
      Vladimir Davydov authored
      This reverts commit 33830978.
      
      Follow-up #6477
      
      NO_DOC=ci
      NO_TEST=ci
      NO_CHANGELOG=ci
      6605de25
    • Vladimir Davydov's avatar
      Revert "ci: fix RPM spec to build packages for Fedora 36" · 7e1df16e
      Vladimir Davydov authored
      This reverts commit 9d1f9f0e.
      
      Follow-up #6477
      
      NO_DOC=ci
      NO_TEST=ci
      NO_CHANGELOG=ci
      7e1df16e
    • Vladimir Davydov's avatar
      crypto: OpenSSL 3.0 support · e3bf73c8
      Vladimir Davydov authored
      Two things we need to do to fix build with OpenSSL 3.0:
      
      1. Use EVP_MAC_* functions instead of HMAC_*
         https://www.openssl.org/docs/man3.0/man3/HMAC_CTX_new.html
      
      2. Load the Legacy provider to enable legacy algorithms, such as MD4
         https://wiki.openssl.org/index.php/OpenSSL_3.0#Programming_in_OpenSSL_3.0
      
      Closes #6477
      
      NO_DOC=build fix
      NO_TEST=build fix
      NO_CHANGELOG=build fix
      e3bf73c8
    • Vladimir Davydov's avatar
      ssl: move OpenSSL library initialization code to separate file · f9739160
      Vladimir Davydov authored
      We redefine ssl_init and ssl_free in the EE build, because we need to do
      some extra work there. Currently, it's fine to duplicate the bulk of the
      OpenSSL library initialization code between EE and CE repositories, but
      with the introduction of OpenSSL 3.0 it's going to become more
      complicated so duplicating would look bad. Let's move the common code to
      ssl_init_impl() and ssl_free_impl() helper functions.
      
      Needed for #6477
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      f9739160
    • Vladimir Davydov's avatar
      crypto: use ERR_reason_error_string instead of ERR_error_string · 9cc130f0
      Vladimir Davydov authored
      ERR_error_string adds some extra information that depends on the OpenSSL
      library version (code, module, method). This information says nothing to
      the end user, and it results in different test results after updating to
      OpenSSL 3.0. Let's use ERR_reason_error_string instead, which just
      prints a human-readable error message.
      
      Part of #6477
      
      NO_DOC=minor change in error message
      NO_CHANGELOG=minor change in error message
      9cc130f0
    • Vladimir Davydov's avatar
      crypto: fix openssl_err_str · f72662c5
      Vladimir Davydov authored
      openssl_err_str is used for reporting OpenSSL errors. It calls
      crypto_ERR_* functions using FFI. There's a typo in the code:
      ffi.crypto_ERR_error_string is used instead of ffi.C.*.
      
      We don't normally step on this, because OpenSSL doesn't return errors
      in our configuration, but if it did for some reason (e.g. a cipher was
      disabled in the library), we'd get a confusing error message.
      
      NO_DOC=bug fix
      NO_TEST=occur only on internal error
      NO_CHANGELOG=occur only on internal error
      f72662c5
  9. Jun 01, 2022
    • artembo's avatar
      ci: clean workspace on self-hosted runners · 7ff87404
      artembo authored
      Added 'tarantool/actions/cleanup' action to each job which uses
      self-hosted runners.
      
      The action cleans workspace directory of self-hosted runner after
      previous run. The main reason to add this action is 'Need a single
      revision' error [1] caused by a conflict of submodule versions,
      the standard 'actions/checkout' action fails with this error. It's a
      well-known problem and related issue [2] is still opened.
      
      [1] https://github.com/tarantool/tarantool-qa/issues/145
      [2] https://github.com/actions/checkout/issues/418
      
      NO_DOC=ci
      NO_TEST=ci
      NO_CHANGELOG=ci
      
      Closes tarantool/tarantool-qa#145
      7ff87404
    • Andrey Saranchin's avatar
      core: introduce clock_lowres · 37d5ac5a
      Andrey Saranchin authored
      This patch introduces not thread-safe low resolution
      monotonic clock, based on interval timer. It should
      be used only by thread that initialized it.
      
      Part of #6085
      
      NO_CHANGELOG=internal feature
      NO_DOC=internal feature
      37d5ac5a
    • Andrey Saranchin's avatar
      replace sigprocmask() with pthread_sigmask() · 50107cf2
      Andrey Saranchin authored
      Since the use of sigprocmask() is unspecified in a multithreaded
      process we should use pthread_sigmask() instead. This patch
      replaces all the sigprocmask calls with pthread analogue.
      
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      NO_DOC=refactoring
      50107cf2
    • Yaroslav Lobankov's avatar
      ci: don't overwrite job artifacts in pkg workflows · 37759699
      Yaroslav Lobankov authored
      To ensure that regular and GC64 jobs in packaging workflows don't
      overwrite artifacts of each other, we need to use a different artifact
      name per job.
      
      NO_DOC=ci
      NO_TEST=ci
      NO_CHANGELOG=ci
      37759699
Loading