Skip to content
Snippets Groups Projects
  1. Nov 19, 2018
    • Olga Arkhangelskaia's avatar
      box: fix comparison of config tables · 3de37476
      Olga Arkhangelskaia authored
      box.cfg() updates only those options that have actually changed.
      However, for replication it is not always true: box.cfg{replication = x}
      and box.cfg{replication = {x}} are treated differently, and as
      the result - replication is restarted. The patch fixes such behaviour.
      
      Closes #3711
      3de37476
  2. Nov 13, 2018
    • Serge Petrenko's avatar
      box: ensure fiber processing box_cfg doesn't process messages from iproto · 031aea10
      Serge Petrenko authored
      In box_cfg() we have a call to gc_set_wal_watcher(), which creates pipes
      between 'wal' and 'tx' under the hood using cbus_pair().
      While pipes are being created, the fiber calling gc_set_wal_watcher()
      will process all the messages coming to 'tx' thread from iproto. This is
      wrong, since we have a separate fiber pool to handle iproto messages,
      and background fibers shouldn't participate in these messages
      processing. For example, this causes occasional credential corruption in
      the fiber executing box_cfg().
      
      Since tx fiber pool is already created at the time gc_set_wal_watcher()
      is called, we may forbid message processing for the fiber which calls
      the function, one of the tx fiber pool fibers will wake us up when the
      pipes are created.
      
      Closes #3779
      031aea10
  3. Nov 05, 2018
  4. Nov 03, 2018
  5. Nov 02, 2018
    • Mergen Imeev's avatar
      box: wrong is_nullable for multiple indexes · 52b84d2e
      Mergen Imeev authored
      If field isn't defined by space format, than in case of multiple
      indexes field option is_nullable was the same as it was for last
      index that defines it. This is wrong as it should be 'true' only
      if it is 'true' for all indexes that defines it.
      
      Closes #3744.
      52b84d2e
  6. Nov 01, 2018
    • Georgy Kirichenko's avatar
      Show names of Lua functions in backtraces · 5b00d646
      Georgy Kirichenko authored
      Trace corresponding Lua state as well as normal C stack frames while
      fiber backtracing. This might be useful for debugging purposes.
      
      Fixes: #3538
      5b00d646
    • Georgy Kirichenko's avatar
      Use fiber lua state for triggers if possible · 7d040848
      Georgy Kirichenko authored
      Lua trigger invocation reuses a fiber lua state if exists instead of
      creating of new one for each new invocation. This is needed for a lua
      stack reconstruction during backtracing.
      
      Relates: #3538
      7d040848
    • Georgy Kirichenko's avatar
      Proper unwind for currently executing fiber · 98b2cdfa
      Georgy Kirichenko authored
      Each yielded fiber has a preserved coro state stored in a corresponding
      variable however an executing fiber has a volatile state placed in CPU
      registers (stack pointer, instruction pointer and non-volatile registers)
      and corresponding context-storing variable value is invalid.
      For already yielded fiber we have to use a special asm-written handler to make
      a temporary switch to the preserved state and capture executing context what
      is not needed for executing fiber.
      After the patch for the executing fiber NULL is passed to the backtrace
      function as coro context and then backtrace function could decide should
      it use special context-switching handler or might just use unw_getcontext
      from the unwind library.
      98b2cdfa
    • Georgy Kirichenko's avatar
      Do not inline coro unwind function · 85878bcf
      Georgy Kirichenko authored
      Do not inline coro_unwcontext because the unwind handler expects
      for a separate stack frame.
      85878bcf
    • Vladimir Davydov's avatar
      httpc: fix compilation with libcurl >= 7.62.0 · 02da15f7
      Vladimir Davydov authored
      Starting from libcurl 7.62.0, CURL_SSL_CACERT is defined as a macro
      alias to CURLE_PEER_FAILED_VERIFICATION, see
      
        https://github.com/curl/curl/commit/3f3b26d6feb0667714902e836af608094235fca2
      
      This breaks compilation:
      
        httpc.c:337:7: error: duplicate case value 'CURLE_PEER_FAILED_VERIFICATION'
                case CURLE_PEER_FAILED_VERIFICATION:
                     ^
        httpc.c:336:7: note: previous case defined here
                case CURLE_SSL_CACERT:
                     ^
        curl.h:589:26: note: expanded from macro 'CURLE_SSL_CACERT'
        #define CURLE_SSL_CACERT CURLE_PEER_FAILED_VERIFICATION
                                 ^
      
      Fix this by using CURLE_SSL_CACERT only if libcurl version is less
      than 7.62.0.
      
      Note, we can't use CURL_AT_LEAST_VERSION to check libcurl version,
      because it isn't available in libcurl shipped with CentOS 6.
      02da15f7
    • Vladimir Davydov's avatar
      httpc: fix curl version check in httpc_set_keepalive · 74e20755
      Vladimir Davydov authored
      Obviously, the version check as it is now won't work once libcurl 8.0.0
      is released. Use LIBCURL_VERSION_NUM to correctly check libcurl version.
      
      Note, we can't use CURL_AT_LEAST_VERSION to check libcurl version,
      because it isn't available in libcurl shipped with CentOS 6.
      
      Fixex commit 7e62ac79 ("Add HTTP client based on libcurl").
      74e20755
  7. Oct 29, 2018
    • Vladimir Davydov's avatar
      replication: keep header when request is modified by before_replace · 480c55b6
      Vladimir Davydov authored
      When space.before_replace trigger modifies the result of a remote
      operation, we clear the request header so that it gets rebuilt on
      commit. This is incorrect, because as a result we don't bump the
      master's component of the replica's vclock, which leads to the request
      being applied again when the replica reconnects. The issue manifests
      itself in sporadic replication/before_replace test failures.
      
      Fix it by updating the request header rather than clearing it so that
      replica id and lsn get preserved.
      
      Closes #3722
      480c55b6
    • Serge Petrenko's avatar
      hot_standby: reflect amount of recovered rows in box.info · 85299d97
      Serge Petrenko authored
      To be able to switch to hot_standby instance with minimal downtime, we
      need to know how far is it behind the primary instance, i.e. up to what
      vclock we have recovered. Previously this was impossible because
      box.info.vclock always referenced replicaset.vclock, which isn't updated
      during hot_standby.
      
      Introduce a pointer to relevant vclock: either recovery vclock (during
      local recovery) or replicaset.vclock (at all other times) and use it in
      box.info.vclock, box.info.lsn and box.info.signature.
      
      @locker: renamed last_row_vclock to box_vclock and constified it.
      
      Closes #3002
      85299d97
  8. Oct 26, 2018
    • Georgy Kirichenko's avatar
      lua: fix tuple cdata collecting · 022a3c50
      Georgy Kirichenko authored
      In some cases luajit does not collect cdata objects which were
      transformed with ffi.cast as tuple_bless does. In consequence, internal
      table with gc callback overflows and then lua crashes. There might be an
      internal luajit issue because it fires only for jitted code. But assigning
      a gc callback before transformation fixes the problem.
      
      Closes #3751
      022a3c50
    • Vladimir Davydov's avatar
      vinyl: do not account bloom filters to runtime quota · e4338cc5
      Vladimir Davydov authored
      Back when bloom filters were introduced, neither box.info.memory() nor
      box.stat.vinyl().memory didn't exist so bloom filters were accounted to
      box.runtime.info().used for lack of a better place. Now, there's no
      point to account them there. In fact, it's confusing, because bloom
      filters are allocated with malloc(), not from the runtime arena, so
      let's drop it.
      e4338cc5
    • Vladimir Davydov's avatar
      vinyl: fix memory leak in slice stream · 0066457c
      Vladimir Davydov authored
      If a tuple read from a run by a slice stream happens to be out of the
      slice bounds, it will never be freed. Fix it.
      
      The leak was introduced by commit c174c985 ("vinyl: implement new
      simple write iterator").
      0066457c
  9. Oct 25, 2018
    • Serge Petrenko's avatar
      replication: make join stage more informative · a6a22f1b
      Serge Petrenko authored
      This patch adds logging amount of rows received by applier during the
      join stage, the same way that recovery has it.
      
      Closes #3165
      a6a22f1b
    • Kirill Yukhin's avatar
      schema: refactor space cache API · c3dd46c5
      Kirill Yukhin authored
      Remove function which deletes from cache, making replace more general:
      it might be used for both insertions, deletions and replaces. Also, put
      assert on equality of space pointer found in cache to old one into
      replace routine.
      c3dd46c5
    • Vladimir Davydov's avatar
      wal: delete old wal files when running out of disk space · 8a1bdc82
      Vladimir Davydov authored
      Now if the WAL thread fails to preallocate disk space needed to commit
      a transaction, it will delete old WAL files until it succeeds or it
      deletes all files that are not needed for local recovery from the oldest
      checkpoint. After it deletes a file, it notifies the garbage collector
      via the WAL watcher interface. The latter then deactivates consumers
      that would need deleted files.
      
      The user doesn't see a ENOSPC error if the WAL thread successfully
      allocates disk space after deleting old files. Here's what's printed
      to the log when this happens:
      
        wal/101/main C> ran out of disk space, try to delete old WAL files
        wal/101/main I> removed /home/vlad/src/tarantool/test/var/001_replication/master/00000000000000000005.xlog
        wal/101/main I> removed /home/vlad/src/tarantool/test/var/001_replication/master/00000000000000000006.xlog
        wal/101/main I> removed /home/vlad/src/tarantool/test/var/001_replication/master/00000000000000000007.xlog
        main/105/main C> deactivated WAL consumer replica 82d0fa3f-6881-4bc5-a2c0-a0f5dcf80120 at {1: 5}
        main/105/main C> deactivated WAL consumer replica 98dce0a8-1213-4824-b31e-c7e3c4eaf437 at {1: 7}
      
      Closes #3397
      8a1bdc82
    • Vladimir Davydov's avatar
      wal: add event_mask to wal_watcher · b073b017
      Vladimir Davydov authored
      In order to implement WAL auto-deletion, we need a notification channel
      through which the WAL thread could notify TX that a WAL file was deleted
      so that the latter can shoot off stale replicas. We will reuse existing
      wal_watcher API for this. Currently, wal_watcher invokes the registered
      callback on each WAL write so using it as is would be inefficient. To
      avoid that, let's allow the caller to specify events of interest when
      registering a wal_watcher.
      
      Needed for #3397
      b073b017
    • Vladimir Davydov's avatar
      wal: rename wal_watcher->events to pending_events · da9c8c14
      Vladimir Davydov authored
      We will add another event bitmap to wal_watcher. To avoid confusion
      between them, let's rename wal_watcher->events.
      da9c8c14
    • Vladimir Davydov's avatar
      wal: pass wal_watcher_msg to wal_watcher callback · 7077341e
      Vladimir Davydov authored
      This should make it easier to pass some extra information along with the
      event mask. For example, we will use it to pass the vclock of the oldest
      stored WAL, which is needed for WAL auto-deletion.
      
      Needed for #3397
      7077341e
    • Vladimir Davydov's avatar
      wal: preallocate disk space before writing rows · 76110901
      Vladimir Davydov authored
      This function introduces a new xlog method xlog_fallocate() that makes
      sure that the requested amount of disk space is available at the current
      write position. It does that with posix_fallocate(). The new method is
      called before writing anything to WAL, see wal_fallocate(). In order not
      to invoke the system call too often, wal_fallocate() allocates disk
      space in big chunks (1 MB).
      
      The reason why I'm doing this is that I want to have a single and
      clearly defined point in the code to handle ENOSPC errors, where
      I could delete old WALs and retry (this is what #3397 is about).
      
      Needed for #3397
      76110901
    • Vladimir Davydov's avatar
      vinyl: fix memory leak in write iterator · f4ae714b
      Vladimir Davydov authored
      Memory allocated for vy_write_iterator::src_heap is never freed. Fix it.
      The leak was introduced by commit c174c985 ("vinyl: implement new
      simple write iterator").
      f4ae714b
  10. Oct 24, 2018
    • Vladimir Davydov's avatar
      xlog: turn use_coio argument of xdir_collect_garbage to flags · a198e273
      Vladimir Davydov authored
      So that we can add more flags.
      a198e273
    • Vladimir Davydov's avatar
      vinyl: account disk statements of each type · b2f85642
      Vladimir Davydov authored
      This patch adds a new entry to per index statistics reported by
      index.stat():
      
        disk.statement
          inserts
          replaces
          deletes
          upserts
      
      It shows the number of statements of each type stored in run files.
      The new statistics are persisted in index files. We will need this
      information so that we can force major compaction when there are too
      many DELETE statements accumulated in run files.
      
      Needed for #3225
      b2f85642
    • Vladimir Davydov's avatar
      vinyl: remove useless local var from vy_range_update_compact_priority · 0a158a4d
      Vladimir Davydov authored
      Local variable total_size equals total_stmt_count.bytes_compressed so we
      don't really need it.
      0a158a4d
    • Vladimir Davydov's avatar
      tuple: zap tuple_extra · e65ba254
      Vladimir Davydov authored
      tuple_extra() allows to store arbitrary metadata inside tuples.
      To use it, one should set extra_size when creating a tuple_format.
      It was introduced for storing UPSERT counter or column mask inside
      vinyl statements. Turned out that it wasn't really needed as UPSERT
      counter can be stored on lsregion while column mask doesn't need to
      be stored at all.
      
      Actually, the whole idea of tuple_extra() is rather crooked: why
      would we need it if we can inherit struct tuple instead, as we do
      in case of memtx_tuple and vy_stmt? Accessing an inherited struct
      is much more convenient than using tuple_extra().
      
      So this patch gets rid of tuple_extra(). To do that, it partially
      reverts the following commits:
      
      6c0842e0 vinyl: refactor vy_stmt_alloc()
      74ff46d8 vinyl: add special format for tuples with column mask
      11eb7816 Add extra size to tuple_format->field_map_size
      e65ba254
    • Vladimir Davydov's avatar
      tuple: zap tuple_format_dup · 9b8c3949
      Vladimir Davydov authored
      This function was only used for creating a format for tuples with column
      mask in vinyl. Not needed anymore and can be removed.
      
      Anyway, it doesn't make much sense to duplciate a tuple format, because
      it can be referenced instead. Besides, once JSON indexes are introcued,
      duplicating a tuple format will be really painful. One more reason to
      drop it now.
      9b8c3949
    • Vladimir Davydov's avatar
      vinyl: zap vy_stmt_column_mask and mem_format_with_colmask · 08afd57f
      Vladimir Davydov authored
      Finally, these atrocities are not used anywhere and can be removed.
      08afd57f
    • Vladimir Davydov's avatar
      vinyl: explicitly pass column mask to vy_check_is_unique · dae21083
      Vladimir Davydov authored
      This patch is a preparation for removing vy_stmt_column_mask.
      dae21083
    • Vladimir Davydov's avatar
      vinyl: explicitly pass column mask to vy_tx_set · 3a0ab1e1
      Vladimir Davydov authored
      This patch is a preparation for removing vy_stmt_column_mask.
      3a0ab1e1
    • Vladimir Davydov's avatar
      vinyl: do not use column mask as trigger for turning REPLACE into INSERT · 4b96c8a9
      Vladimir Davydov authored
      If a REPLACE statement was generated by an UPDATE operation that updated
      a column indexed by a secondary key, we can turn it into INSERT when the
      secondary index is dumped, because there can't be an older statement
      with the same key other than DELETE. Currently, we use the statement
      column mask to detect such REPLACEs in the write iterator, but I'm
      planning to get rid of vy_stmt_column_mask so let's instead introduce
      a new statement flag to mark such REPLACEs.
      4b96c8a9
    • Vladimir Davydov's avatar
      vinyl: factor out common code of UPDATE and UPSERT · c6985874
      Vladimir Davydov authored
      This patch introduces a helper function vy_perform_update() that
      performs operations common for UPDATE and UPSERT, namely replaces
      a tuple in a transaction write set.
      c6985874
    • Vladimir Davydov's avatar
      vinyl: move update optimization from write iterator to tx · 9d0ccd66
      Vladimir Davydov authored
      An UPDATE operation is written as DELETE + REPLACE to secondary indexes.
      We write those statements to the memory level even if the UPDATE doesn't
      actually update columns indexed by a secondary key. We filter them out
      in the write iterator when the memory level is dumped. That's what we
      use vy_stmt_column_mask for.
      
      Actually, there's no point to keep those statements until dump - we
      could as well filter them out when the transaction is committed. This
      would even save some memory. This wouldn't hurt read operations, because
      point lookup doesn't work for secondary indexes by design and so we have
      to read all sources, including disk, on every read from a secondary
      index.
      
      That said, let's move update optimization from the write iterator to
      vy_tx_commit. This is a step towards removing vy_stmt_column_mask.
      9d0ccd66
  11. Oct 23, 2018
    • Alexander Turenko's avatar
      xlog: fix sync_is_async xlog option · 55dcde00
      Alexander Turenko authored
      The behaviour change was introduced in cda3cb55: sync_is_async option
      was forgotten to be updated from xdir; sync_interval was forgotten too,
      but was restored in 1900c58b.
      
      The commit fixes the performance regression around 6-14% for average RPS
      on default nosqlbench workload with 30 seconds duration. The additional
      information about benchmarking can be found in #3747.
      
      Thanks to Vladimir Davydov (@locker) for the investigation of the
      cda3cb55 changes.
      
      Closes #3747
      
      (cherry picked from commit cd9cc4c5)
      55dcde00
  12. Oct 13, 2018
    • Vladimir Davydov's avatar
      replication: fix rebootstrap crash in case master has replica's rows · d4ce7447
      Vladimir Davydov authored
      During SUBSCRIBE the master sends only those rows originating from the
      subscribed replica that aren't present on the replica. Such rows may
      appear after a sudden power loss in case the replica doesn't issue
      fdatasync() after each WAL write, which is the default behavior. This
      means that a replica can write some rows to WAL, relay them to another
      replica, then stop without syncing WAL file. If this happens we expect
      the replica to read its own rows from other members of the cluster upon
      restart. For more details see commit eae84efb ("replication: recover
      missing local data from replica").
      
      Obviously, this feature only makes sense for SUBSCRIBE. During JOIN
      we must relay all rows. This is how it initially worked, but commit
      adc28591 ("replication: do not delete relay on applier disconnect"),
      witlessly removed the corresponding check from relay_send_row() so that
      now we don't send any rows originating from the joined replica:
      
        @@ -595,8 +630,7 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet)
                 * it). In the latter case packet's LSN is less than or equal to
                 * local master's LSN at the moment it received 'SUBSCRIBE' request.
                 */
        -       if (relay->replica == NULL ||
        -           packet->replica_id != relay->replica->id ||
        +       if (packet->replica_id != relay->replica->id ||
                    packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe,
                                              packet->replica_id)) {
                        relay_send(relay, packet);
      
      (relay->local_vclock_at_subscribe is initialized to 0 on JOIN)
      
      This only affects the case of rebootstrap, automatic or manual, because
      when a new replica joins a cluster there can't be any rows on the master
      originating from it. On manual rebootstrap, i.e. when the replica files
      are deleted by the user and the replica is restarted from an empty
      directory with the same UUID (set via box.cfg.instance_uuid), this isn't
      critical - the replica will still receive those rows it should have
      received during JOIN once it subscribes. However, in case of automatic
      rebootstrap this can result in broken order of xlog/snap files, because
      the replica directory still contains old xlog/snap files created before
      rebootstrap. The rebootstrap logic expects them to have strictly less
      vclocks than new files, but if JOIN stops prematurely, this condition
      may not hold, leading to a crash when the vclock of a new xlog/snap is
      inserted into the corresponding xdir.
      
      This patch fixes this issue by restoring pre eae84efb behavior: now
      we create a new relay for FINAL JOIN instead of reusing the one attached
      to the joined replica so that relay_send_row() can detect JOIN phase and
      relay all rows in this case. It also adds a comment so that we don't
      make such a mistake in future.
      
      Apart from fixing the issue, this patch also fixes a relay leak in
      relay_initial_join() in case engine_join_xc() fails, which was also
      introduced by the above mentioned commit.
      
      A note about xlog/panic_on_broken_lsn test. Now the relay status isn't
      reported by box.info.replication if FINAL JOIN failed and the replica
      never subscribed (this is how it worked before commit eae84efb) so
      we need to tweak the test a bit to handle this.
      
      Closes #3740
      d4ce7447
  13. Oct 12, 2018
    • Vladimir Davydov's avatar
      vinyl: implement basic transaction throttling · c0d8063b
      Vladimir Davydov authored
      If the rate at which transactions are ready to write to the database is
      greater than the dump bandwidth, memory will get depleted before the
      previously scheduled dump is complete and all newer transactions will
      have to wait, which may take seconds or even minutes:
      
        W> waited for 555 bytes of vinyl memory quota for too long: 15.750 sec
      
      This patch set implements basic transaction throttling that is supposed
      to help avoid unpredictably long stalls. Now the transaction write rate
      is always capped by the observed dump bandwidth, because it doesn't make
      sense to consume memory at a greater rate than it can be freed. On top
      of that, when a dump begins, we estimate the amount of time it is going
      to take and limit the transaction write rate accordingly.
      
      Note, this patch doesn't take into account compaction when setting the
      rate limit so compaction threads may still fail to keep up with dumps,
      increasing the read amplification. It will be addressed later.
      
      Closes #1862
    • Vladimir Davydov's avatar
      vinyl: fix memory dump trigger · 45d61b66
      Vladimir Davydov authored
      vy_quota_signal() doesn't wake up a consumer if it won't be able to
      proceed because of the memory limit. This is OK, but it doesn't attempt
      to trigger memory dump in this case either. As a result, it may occur
      that dump isn't triggered and all waiting consumers are aborted by
      timeout.  E.g. this happens if memory dump releases no memory, which is
      possible because memory is allocated and freed in 16 MB chunks. This
      results in occasional vinyl/quota_tmeout test failures.
      
      Fix this by moving the dump trigger right in vy_quota_may_use() so that
      it's called whenever we consider a consumer for wakeup.
      45d61b66
Loading