Skip to content
Snippets Groups Projects
  1. Jun 20, 2022
  2. Jun 18, 2022
    • Igor Munkin's avatar
      luajit: bump new version · b1953b59
      Igor Munkin authored
      * ci: add job for build using Ninja on Linux/x86_64
      * build: create file lists outside of CMake commands
      * build: use unique names for CMake targets
      * Revert "test: disable PUC-Rio tests for several -l options"
      * ci: make GitHub workflows more CMake-ish
      * test: adapt PUC-Rio tests for debug line hook
      * test: adapt PUC-Rio test for tail calls debug info
      * test: adapt PUC-Rio test with reversed function
      
      Closes #5693
      Closes #5702
      Closes #5782
      Follows up #5747
      
      NO_DOC=LuaJIT submodule bump
      NO_TEST=LuaJIT submodule bump
      NO_CHANGELOG=LuaJIT submodule bump
      b1953b59
  3. Jun 17, 2022
    • Cyrill Gorcunov's avatar
      fiber: don't crash on wakeup with dead fibers · 206137e7
      Cyrill Gorcunov authored
      
      When fiber has finished its work it ended up in two cases:
      1) If no "joinable" attribute set then the fiber is
         simply recycled
      2) Otherwise it continue hanging around waiting to be
         joined.
      
      Our API allows to call fiber_wakeup() for dead but joinable
      fibers (2) in release builds without any side effects, such
      fibers are simply ignored, in turn for debug builds this
      causes assertion to trigger. We can't change our API for
      backward compatibility sake but same time we must not
      preserve different behaviour between release and debug
      builds since this brings inconsistency. Thus lets get
      rid of assertion call and allow to call fiber_wakeup
      in debug build as well.
      
      Fixes #5843
      
      NO_DOC=bug fix
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      206137e7
    • Serge Petrenko's avatar
      replication: unify replication filtering with and without elections · deca9749
      Serge Petrenko authored
      Once the split-brain detection is in place, it's fine to nopify obsolete
      data even on a node with elections disabled. Let's not keep a bug around
      anymore.
      
      This behaviour change leads to changing
      "gh_6842_qsync_applier_order_test.lua" a bit. It actually relied on old
      and buggy behaviour: it assumed old transactions would not be nopified
      and would trigger replication error.
      
      This doesn't happen anymore, because nopify works correctly, and the
      transactions are not followed by a conflicting CONFIRM.
      
      The test for this commit is simply altering the
      gh_5295_split_brain_detection_test.lua to work with elections disabled.
      
      Closes #6133
      Follow-up #5295
      
      NO_DOC=internal change
      NO_CHANGELOG=internal change
      deca9749
    • Cyrill Gorcunov's avatar
      txn_limbo: filter incoming synchro requests · af7d703f
      Cyrill Gorcunov authored
      
      When we receive synchro requests we can't just apply them blindly
      because in worst case they may come from split-brain configuration
      (where a cluster split into several clusters and each one has own
      leader elected, then clusters are trying to merge back into the original
      one). We need to do our best to detect such disunity and force these
      nodes to rejoin from the scratch for data consistency sake.
      
      Thus when we're processing requests we pass them to the packet filter
      first which validates their contents and refuse to apply if they violate
      consistency.
      
      Depending on request type each packet traverses an appropriate chain.
      
      filter_generic(): a common chain for any synchro packet.
       1) request:replica_id = 0 allowed for PROMOTE request only.
       2) request:replica_id should match limbo:owner_id, IOW the
          limbo migration should be noticed by all instances in the
          cluster.
      
      filter_confirm_rollback(): a chain for CONFIRM | ROLLBACK packets.
       1) Zero lsn is disallowed for such requests.
      
      filter_promote_demote(): a chain for PROMOTE | DEMOTE packets.
       1) The requests should come in with nonzero term, otherwise
          the packet is corrupted.
       2) The request's term should not be less than maximal known
          one, iow it should not come in from nodes which didn't notice
          raft epoch changes and living in the past.
      
      filter_queue_boundaries(): a common finalization chain.
       1) If LSN of the request matches current confirmed LSN the packet
          is obviously correct to process.
       2) If LSN is less than confirmed LSN then the request is wrong,
          we have processed the requested LSN already.
       3) If LSN is greater than confirmed LSN then
          a) If limbo is empty we can't do anything, since data is already
             processed and should issue an error;
          b) If there is some data in the limbo then requested LSN should
             be in range of limbo's [first; last] LSNs, thus the request
             will be able to commit and rollback limbo queue.
      
      Note the filtration is disabled during initial configuration where we
      apply requests from the only source of truth (either the remote master,
      or our own journal), so no split brain is possible.
      
      In order to make split-brain checks work, the applier nopify filter now
      passes synchro requests from obsolete term without nopifying them.
      
      Also, now ANY asynchronous request coming from an instance with obsolete
      term is treated as a split-brain. Think of it as of a syncrhonous
      request committed with a malformed quorum.
      
      Closes #5295
      
      NO_DOC=it's literally below
      
      Co-authored-by: default avatarSerge Petrenko <sergepetrenko@tarantool.org>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      
      @TarantoolBot document
      Title: new error type: ER_SPLIT_BRAIN
      
      If for some reason the cluster had 2 leaders working independently (for
      example, user has mistakenly lovered the quorum below N / 2 + 1), then
      once such leaders and their followers try connecting to each other, they
      will receive the ER_SPLIT_BRAIN error, and the connection will be
      aborted. This is done to preserve data integrity. Once the user notices
      such an error he or she has to manually inspect the data on both the
      split halves, choose a way to restore the data, and rebootstrap one of
      the halves from the other.
      af7d703f
    • Serge Petrenko's avatar
      txn_limbo: change function return types · 9eab2868
      Serge Petrenko authored
      Change return types of txn_limbo_req_prepare, txn_limbo_process,
      txn_limbo_write_promote, txn_limbo_write_demote from void to int.
      This is a preparation for when these functions start returning errors.
      
      Part-of #5295
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      9eab2868
    • Serge Petrenko's avatar
      box: change box_issue_promote(demote) return type · fd5e1439
      Serge Petrenko authored
      Make box_issue_promote and box_issue_demote return a return code.
      For now it's always 0, but soon they will return errors.
      
      Part-of #5295
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      fd5e1439
    • Serge Petrenko's avatar
      txn_limbo: track CONFIRM lsn on replicas · 129d83e9
      Serge Petrenko authored
      limbo->confirmed_lsn was only filled on limbo owner in
      txn_limbo_write_confirm. Replicas and recovering limbo owner need to track
      it as well to correctly detect split-brains based on confirmed_lsn.
      
      So update confirmed_lsn in txn_limbo_read_confirm.
      
      Part-of #5295
      
      NO_DOC=internal change
      NO_TEST=tested in future commits
      NO_CHANGELOG=internal change
      129d83e9
    • Serge Petrenko's avatar
      txn_limbo: do not confirm/rollback anything after restart · 6cc1b1f2
      Serge Petrenko authored
      It's important for the synchro queue owner to not finalize any of the
      pending synchronous transactions after restart.
      
      Since the node was down for some time the chances are pretty high it was
      deposed by some new leader during its downtime. It means that the node
      might not know yet that it's transactions were already finalized by someone
      else.
      
      So, any arbitrary finalization might lead to a future split-brain, once the
      remote PROMOTE finally reaches the local node.
      
      Let's fix this by adding a new reason for the limbo to be frozen - a
      queue owner has recovered but has not issued a new PROMOTE locally and
      hasn't received any PROMOTE requests from the remote nodes.
      
      Once the first PROMOTE is issued or received, it's safe to return to the
      old mode of operation.
      
      So, now the synchro queue owner starts in "frozen" state and can't
      CONFIRM, ROLLBACK or issue new transactions until either issuing a
      PROMOTE or receiving a PROMOTE from some remote node.
      
      This also required modifying box.ctl.promote() behaviour: it's no
      longer a no-op on a synchro queue owner, when elections are disabled and
      the queue is frozen due to restart.
      
      Also fix the tests, which assumed the queue owner is writeable after a
      restart. gh-5298 test was partially deleted, because it became pointless.
      
      And while we are at it, remove the double run of gh-5288 test. It is
      storage engine agnostic, so there's no point in running it for both
      memtx and vinyl.
      
      Part-of #5295
      
      NO_CHANGELOG=covered by previous commit
      
      @TarantoolBot document
      Title: ER_READONLY error receives new reasons
      
      When box.info.ro_reason is "synchro" and some operation throws an
      ER_READONLY error, this error now might include the following reason:
      ```
      Can't modify data on a read-only instance - synchro queue with term 2
      belongs to 1 (06c05d18-456e-4db3-ac4c-b8d0f291fd92) and is frozen due to
      fencing
      ```
      This means that the current instance is indeed the synchro queue owner,
      but it has noticed, that someone else in the cluster might start new
      elections or might overtake the synchro queue soon.
      This may be also detected by `box.info.election.term` becoming greater than
      `box.info.synchro.queue.term` (this is the case for the second error
      message).
      There is also a slightly different error message:
      ```
      Can't modify data on a read-only instance - synchro queue with term 2
      belongs to 1 (06c05d18-456e-4db3-ac4c-b8d0f291fd92) and is frozen until
      promotion
      ```
      This means that the node simply cannot guarantee that it is still the
      synchro queue owner (for example, after a restart, when a node still thinks
      it is the queue owner, but someone else in the cluster has already
      overtaken the queue).
      6cc1b1f2
    • Serge Petrenko's avatar
      txn_limbo: fence upon receiving raft term greater than queue term · 0e48475d
      Serge Petrenko authored
      Receiving a raft term greater than the current queue term means that
      someone has either already written PROMOTE (in case elections are
      disabled), or is going to write PROMOTE once he wins the elections (in
      case they are enabled).
      
      In both cases the queue owner in an old term should freeze the limbo
      until queue term catches up with raft term.
      
      Unfreezing happens automatically once synchro queue term catches up.
      
      Part-of #5295
      
      NO_DOC=covered by next commit
      0e48475d
    • Serge Petrenko's avatar
      txn_limbo: rework limbo->frozen flag · ce0a83eb
      Serge Petrenko authored
      Soon there will be more reasons for a transaction limbo to be frozen.
      Let's make the limbo->frozen flag a bitmap and rename it to
      limno->frozen_reasons.
      The first bit, named frozen_due_to_fencing, represents the only current
      reason for the limbo to be frozen.
      While we are at it, rename txn_limbo_(un)freeze to txn_limbo_(un)fence
      to better reflect the situation.
      
      Part-of #5295
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      ce0a83eb
    • Serge Petrenko's avatar
      txn_libmo: preserve confirmed_lsn after reading a PROMOTE · 896a20e4
      Serge Petrenko authored
      Previously we assumed that every PROMOTE request changes limbo owner,
      and thus limbo should have confirmed_lsn = 0 after the request is
      processed, because new confirmed lsn is yet unknown.
      
      This is not true for PROMOTE requests coming in JOIN or saved in
      snapshot: such requests don't change limbo owner: they are like
      savepoints, they notify the instance of the current limbo state.
      
      Such promotions may be detected by the rule
      replica_id (old limbo owner) == origin_id (new limbo owner)
      
      So, for the sake of correct split-brain detection, confirmed_lsn should
      be nonzero after such promotions.
      
      Part-of #5295
      
      NO_DOC=internal change
      NO_TEST=tested in future commits
      NO_CHANGELOG=internal change
      896a20e4
    • Serge Petrenko's avatar
      test: refactor gh_6036_qsync_order test · 978731b3
      Serge Petrenko authored
      
      The test involves creating a manual split-brain between nodes r1 and r2.
      After the split-brain detection introduction it's impossible to reuse
      the nodes in the next test without recreating them.
      
      Let's fix that by switching nodes r1 and r3. Now there's a split-brain
      between (r1, r2) and r3, and r3 isn't used in the following tests and
      may be safely deleted.
      
      Follow-up #5295
      
      NO_DOC=refactoring
      NO_CHANGELOG=refactoring
      
      Signed-off-by: default avatarSerge Petrenko <sergepetrenko@tarantool.org>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      978731b3
    • Serge Petrenko's avatar
      relay: fix PROMOTE and raft term ordering · 67090419
      Serge Petrenko authored
      Fix two issues with sent_raft_term calculations:
      * first of all, it doesn't matter during initial and final join, so set it
        to UINT64_MAX.
      * secondly, it's nullified after a successful dispatch from the tx
        thread. This might make the relay stall forever. For example, when
        elections are disabled.
      
      NO_DOC=bugfix
      NO_TEST=tested in next commit
      67090419
  4. Jun 16, 2022
    • Ilya Verbin's avatar
      core: allow spurious wakeups in coio_waitpid · 7a582646
      Ilya Verbin authored
      Currently it's possible to wakeup a fiber, which is waiting for a
      child process termination, using Tarantool C API. This will leave
      a zombie process behind. This patch reworks `coio_waitpid` in such
      a way that it yields until `cw.data` is set to NULL in the process
      status change callback.
      
      Part of #7166
      
      NO_DOC=refactoring
      NO_CHANGELOG=refactoring
      7a582646
    • Ilya Verbin's avatar
      core: allow spurious wakeups in cord_cojoin · 87e7d312
      Ilya Verbin authored
      Currently it's possible to wakeup a fiber, which is waiting for task
      completion, using Tarantool C API. This will cause a "wrong fiber woken"
      panic. This patch reworks `cord_cojoin` in such a way that it yields
      until a completion flag is set.
      
      Part of #7166
      
      NO_DOC=refactoring
      NO_CHANGELOG=refactoring
      87e7d312
    • Ilya Verbin's avatar
      core: get rid of fiber_set_cancellable in hot_standby_f · 65470cb4
      Ilya Verbin authored
      Currently it's possible to wakeup a `hot_standby_f` fiber from Lua,
      this does not lead to any error, but it results in redundant
      `recover_remaining_wals` calls.
      This patch handles such spurious wakeups in `hot_standby_f`.
      
      Part of #7166
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      65470cb4
    • Ilya Verbin's avatar
      core: get rid of fiber_set_cancellable in gc_checkpoint_fiber_f · 6e5b89e0
      Ilya Verbin authored
      Spurious wakeups are already handled correctly.
      
      Part of #7166
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      6e5b89e0
    • Georgiy Lebedev's avatar
      box: fix transaction "read-view" and "conflicted" states · 4d52199e
      Georgiy Lebedev authored
      Currently, there is a fundamental logical inconsistency with read-view and
      conflicted states of transactions.
      
      Conflicted transactions see all prepared changes (e.g., #7238), because
      they are handled differently than read-view ones. At the same time, one
      does not know the state of the transaction until `box.commit` is called.
      
      A similar problem arises with read-view transactions: if such transactions
      do any DML statements, they are de-facto conflicted, but this will only be
      determined at preparation stage:
      https://github.com/tarantool/tarantool/blob/79245573dabf3c1eb4eb904fd80ee84270360476/src/box/txn.c#L1006-L1013
      
      Fix this inconsistency by the following changes:
      1. Conflict "read-view" transactions on attempt to perform DML statements
      immediately — guarantee this with an assertion at preparation stage.
      2. Make conflicted transactions unconditionally throw "Transaction has been
      aborted by conflict" error on any CRUD operations (including read-only
      ones) until they are either rolled back (which will return no error) or
      committed (which will return the same error).
      
      Closes #7238
      Closes #7239
      Closes #7240
      
      @TarantoolBot document
      Title: new  behaviour of "conflicted" transactions
      
      "Conflicted" transactions now return "Transaction aborted by conflicted"
      error on any CRUD operations (including read-only ones), until they are
      either rolled back (which will return no error) or committed (which will
      return the same error).
      4d52199e
    • Sergey Bronnikov's avatar
      tutorial: use https links · 2f80fbf0
      Sergey Bronnikov authored
      NO_CHANGELOG=internal
      NO_DOC=internal
      NO_TEST=internal
      2f80fbf0
    • Pavel Balaev's avatar
      tools: fix gdb.sh revision regex · 375ceaaa
      Pavel Balaev authored
      Regular expression now works on versions: alpha, beta, rc and so on.
      
      NO_DOC=bugfix
      NO_TEST=bugfix
      NO_CHANGELOG=bugfix
      375ceaaa
    • Pavel Balaev's avatar
      tools: edit gdb.sh code formatting · 83e8c50f
      Pavel Balaev authored
      Tabs were replaced with spaces to bypass checkpatch.
      
      NO_DOC=bugfix
      NO_TEST=bugfix
      NO_CHANGELOG=bugfix
      83e8c50f
  5. Jun 14, 2022
    • Vladimir Davydov's avatar
      test: fix flaky viny/tx_gap_lock test · b3f462bf
      Vladimir Davydov authored
      The `cmp_tuple` helper function is broken - it assumes that all tuple
      fields, including the payload, are numeric. It isn't true - the payload
      field is either nil or string. This results in a false-positive test
      failure:
      
      ```
      error: '[string "function cmp_tuple(t1, t2)     for i = 1, PAY..."]:1:
             attempt to compare nil with string'
      ```
      
      Closes #6336
      
      NO_DOC=test
      NO_CHANGELOG=test
      b3f462bf
    • Yaroslav Lobankov's avatar
      test-run: bump to new version · e6e73423
      Yaroslav Lobankov authored
      Bump test-run to new version with the following improvements:
      
        - Fix issue with not detecting successful server start [1]
      
      [1] tarantool/test-run#343
      
      NO_DOC=testing stuff
      NO_TEST=testing stuff
      NO_CHANGELOG=testing stuff
      e6e73423
  6. Jun 09, 2022
    • Vladimir Davydov's avatar
      test: set shutdown timeout to infinity for default luatest instance · ede831d3
      Vladimir Davydov authored
      With the default shutdown timeout of 3 seconds, a test that leaves
      behind asynchronous requests would still pass, but it would take longer
      to finish, because the server instance started by Tarantool would have
      to wait for the dangling requests to complete. Setting the timeout to
      infinity will result in a hang, making us fix the test.
      
      Infinite timeout is also good for catching bugs like #7225 and #7256.
      
      We don't set the timeout for diff and TAP tests because those are
      deprecated and shouldn't be used for writing new tests. Nevertheless,
      I manually checked that none of them hangs if the timeout is set to
      infinity.
      
      Closes #6820
      
      NO_DOC=test
      NO_CHANGELOG=test
      ede831d3
    • Vladimir Davydov's avatar
      iostream: shutdown socket fd before close · 9cf03555
      Vladimir Davydov authored
      If a socket fd is shared by a child process, closing it in the parent
      will not shut down the underlying connection. As a result, the server
      may hang executing the graceful shutdown protocol. Fix this problem by
      explicitly shutting down the connection socket fd before closing it.
      
      This is a recommended way to terminate a Unix socket connection, see
      http://www.faqs.org/faqs/unix-faq/socket/#:~:text=2.6.%20%20When%20should%20I%20use%20shutdown()%3F
      
      Closes #7256
      
      NO_DOC=bug fix
      9cf03555
    • Ilya Verbin's avatar
      wal: allow spurious wakeups in wal_write · 4bf52367
      Ilya Verbin authored
      It's possible to wakeup a fiber, which is waiting for WAL write
      completion, using Tarantool C API. This results in an error like:
      ```
      main/118/lua F> Journal result code -1 can't be converted to an error
      ```
      
      This patch introduces a flag, which is set when WAL write is
      finished, that allows fibers to yield until the flag is set.
      
      Closes #6506
      
      NO_DOC=bugfix
      4bf52367
    • Yaroslav Lobankov's avatar
      test-run: bump to new version · 0dc60b5f
      Yaroslav Lobankov authored
      Bump test-run to new version with the following improvements:
      
        - Fail *.test.py tests in case of server start errors [1]
      
      [1] tarantool/test-run#333
      
      NO_DOC=testing stuff
      NO_TEST=testing stuff
      NO_CHANGELOG=testing stuff
      0dc60b5f
  7. Jun 08, 2022
    • Mergen Imeev's avatar
      sql: fix wrong ephemeral space format · a6818acc
      Mergen Imeev authored
      This patch fixes format building when an ephemeral space was used in
      ORDER BY and ORDER BY uses at least two variables from the list of
      selected columns.
      
      Closes #7042
      
      NO_DOC=Bugfix
      a6818acc
    • Serge Petrenko's avatar
      decimal: fix index comparison with Inf, NaN · 22fc1f94
      Serge Petrenko authored
      There was an assertion failure when inserting  a decimal into an index
      which contained double Inf or NaN.
      
      The reason for that was never checking decimal_from_*() return values,
      and decimal_from_double() not being able to handle NaN or Inf, because
      these values are not representable in decimal numbers.
      
      Start handling decimal_from_<type> return values and fix decimal
      comparison with Inf, NaN.
      
      Closes #6377
      
      NO_DOC=bugfix
      22fc1f94
  8. Jun 07, 2022
    • Yaroslav Lobankov's avatar
      test: use unix socket in replication-py/swim tests · cb6fc4a3
      Yaroslav Lobankov authored
      To reduce the chance to encounter the tarantool/test-run#141 issue in
      replication-py/swim tests, let's switch to using unix sockets instead
      of TCP ports for tarantool console.
      
      NO_DOC=testing stuff
      NO_TEST=testing stuff
      NO_CHANGELOG=testing stuff
      cb6fc4a3
  9. Jun 06, 2022
    • Ilya Verbin's avatar
      core: allow spurious wakeups in cbus_call · bd6fb06a
      Ilya Verbin authored
      Currently it's possible to wakeup a fiber, which is waiting for `cbus_call`
      completion, using Tarantool C API. This will cause a misleading `TimedOut`
      error. This patch reworks `cbus_call` in such a way that it yields until
      a completion flag is set.
      
      Part of #7166
      
      NO_DOC=refactoring
      NO_CHANGELOG=refactoring
      bd6fb06a
    • Ilya Verbin's avatar
      core: get rid of unused cbus_flush · e568e7f0
      Ilya Verbin authored
      Part of #7166
      
      NO_DOC=refactoring
      NO_TEST=refactoring
      NO_CHANGELOG=refactoring
      e568e7f0
    • Timur Safin's avatar
      datetime: refactor interval_to_string · b7ff1615
      Timur Safin authored
      Simplify/shorten `interval_to_string()` implementation.
      
      Part of #7045
      
      NO_CHANGELOG=refactoring
      NO_DOC=refactoring
      NO_TEST=refactoring
      b7ff1615
    • Timur Safin's avatar
      datetime: do not mess with nsec in interval · 36bc6f83
      Timur Safin authored
      Do not even try to make more readable output of secs/nsec,
      but rather report them as is, without any [de]normalization.
      
      Not the prior way:
      ```
      tarantool> dt.interval.new{min=1, sec=59, nsec=2e9+1}
      --
      - +1 minutes, 61.000000001 seconds
      ...
      ```
      
      But instead as:
      ```
      tarantool> dt.interval.new{min=1, sec=59, nsec=2e9+1}
      --
      - +1 minutes, 59 seconds, 2000000001 nanoseconds
      ...
      ```
      
      Closes #7045
      
      NO_DOC=internal
      36bc6f83
    • Vladimir Davydov's avatar
      net.box: fix hang in graceful shutdown protocol · 79245573
      Vladimir Davydov authored
      The graceful shutdown protocol works as follows:
      
       1. The server sends a shutdown request (the box.shutdown event) to all
          its clients that subscribed to it.
       2. Upon receiving a shutdown request, a client is supposed to close its
          connection.
       3. The server waits for all clients subscribed to box.shutdown event to
          exit.
       4. The server exits.
      
      In net.box, the box.shutdown event is processed by `remote._callback`.
      The problem is it may occur that `remote._callback` is garbage collected
      while the `remote` object isn't. If this happens, the shutdown request
      will never get processed, and the server won't exit until the `remote`
      object is garbage collected, which may take forever.
      
      Let's fix this issue by breaking the worker loop if we see that the
      callback was garbage collected.
      
      Closes #7225
      
      NO_DOC=bug fix
      79245573
  10. Jun 02, 2022
Loading