Skip to content
Snippets Groups Projects
  1. Oct 12, 2018
    • Kirill Yukhin's avatar
      Add compile_commands.json to git ignore · e0017ad6
      Kirill Yukhin authored
      e0017ad6
    • Vladimir Davydov's avatar
      vinyl: implement basic transaction throttling · c0d8063b
      Vladimir Davydov authored
      If the rate at which transactions are ready to write to the database is
      greater than the dump bandwidth, memory will get depleted before the
      previously scheduled dump is complete and all newer transactions will
      have to wait, which may take seconds or even minutes:
      
        W> waited for 555 bytes of vinyl memory quota for too long: 15.750 sec
      
      This patch set implements basic transaction throttling that is supposed
      to help avoid unpredictably long stalls. Now the transaction write rate
      is always capped by the observed dump bandwidth, because it doesn't make
      sense to consume memory at a greater rate than it can be freed. On top
      of that, when a dump begins, we estimate the amount of time it is going
      to take and limit the transaction write rate accordingly.
      
      Note, this patch doesn't take into account compaction when setting the
      rate limit so compaction threads may still fail to keep up with dumps,
      increasing the read amplification. It will be addressed later.
      
      Closes #1862
    • Vladimir Davydov's avatar
      vinyl: fix memory dump trigger · 45d61b66
      Vladimir Davydov authored
      vy_quota_signal() doesn't wake up a consumer if it won't be able to
      proceed because of the memory limit. This is OK, but it doesn't attempt
      to trigger memory dump in this case either. As a result, it may occur
      that dump isn't triggered and all waiting consumers are aborted by
      timeout.  E.g. this happens if memory dump releases no memory, which is
      possible because memory is allocated and freed in 16 MB chunks. This
      results in occasional vinyl/quota_tmeout test failures.
      
      Fix this by moving the dump trigger right in vy_quota_may_use() so that
      it's called whenever we consider a consumer for wakeup.
      45d61b66
    • Vladimir Davydov's avatar
      vinyl: do not account small dumps for bandwidth estimation · e351b3e6
      Vladimir Davydov authored
      Small dumps (e.g. triggered by box.snapshot) have too high overhead
      associated with file creation so taking them into account for bandwidth
      estimation may result in erroneous transaction throttling. Let's ignore
      dumps of size less than 1 MB.
      
      Needed for #1862
      e351b3e6
    • Vladimir Davydov's avatar
      vinyl: do not try to trigger dump in regulator if already in progress · b80f437f
      Vladimir Davydov authored
      This is pointless since trigger_dump_cb callback will return right away
      in such a case. Let's wrap trigger_dump_cb in vy_regulator_trigger_dump
      method, which will actulally invoke the callback only if the previous
      dump has already completed (i.e. vy_regulator_dump_complete was called).
      
      This also gives us a definite place in code where we can adjust the rate
      limit so as to guarantee that a triggered memory dump will finish before
      we hit the hard memory limit (this will be done later).
      
      Needed for #1862
      b80f437f
    • Vladimir Davydov's avatar
      vinyl: bypass format validation for statements loaded from disk · 3846d9b2
      Vladimir Davydov authored
      When the format of a space is altered, we walk over all tuples stored in
      the primary index and check them against the new format. This doesn't
      guarantee that all *statements* stored in the primary index conform to
      the new format though, because the check isn't performed for deleted or
      overwritten statements, e.g.
      
        s = box.schema.space.create('test', {engine = 'vinyl'})
        s:create_index('primary')
        s:insert{1}
        box.snapshot()
        s:delete{1}
      
        -- The following command will succeed, because the space is empty,
        -- however one of the runs contains REPLACE{1}, which doesn't conform
        -- to the new format.
        s:create_index('secondary', {parts = {2, 'unsigned'}})
      
      This is OK as we will never return such overwritten statements to the
      user, however we may still need to read them. Currently, this leads
      either to an assertion failure or to a read error in
      
        vy_stmt_decode
         vy_stmt_new_with_ops
          tuple_init_field_map
      
      We could probably force major compaction of the primary index to purge
      such statements, but it is complicated as there may be a read view
      preventing the write iterator from squashing such a statement, and
      currently there's no way to force destruction of a read view.
      
      So this patch simply disables format validation for all tuples loaded
      from disk (actually we already skip format validation for all secondary
      index statements and for DELETE statements in primary indexes so this
      isn't as bad as it may seem). To do that, it adds a boolean parameter to
      tuple_init_field_map() that disables format validation, and then makes
      vy_stmt_new_with_ops(), which is used for constructing vinyl statements,
      set it to false. This is OK as all statements inserted into a vinyl
      space are validated explicitly with tuple_validate() anyway.
      
      This is rather a workaround for the lack of a better solution.
      
      Closes #3540
      3846d9b2
    • Vladimir Davydov's avatar
      test: fix spurious box/sql test failure · 23e71c6e
      Vladimir Davydov authored
      For some reason this test uses 555 for space id, which may be taken by
      a previously created space:
      
      Test failed! Result content mismatch:
      --- box/sql.result        Fri Oct  5 17:23:25 2018
      +++ box/sql.reject        Fri Oct 12 19:38:51 2018
      @@ -12,12 +12,14 @@
       ...
       _ = box.schema.space.create('test1', { id = 555 })
       ---
      +- error: Duplicate key exists in unique index 'primary' in space '_space'
       ...
      
      Reproduce file:
      
      ---
      - [box/rtree_point.test.lua, null]
      - [box/transaction.test.lua, null]
      - [box/tree_pk.test.lua, null]
      - [box/access.test.lua, null]
      - [box/cfg.test.lua, null]
      - [box/admin.test.lua, null]
      - [box/lua.test.lua, null]
      - [box/bitset.test.lua, null]
      - [box/role.test.lua, null]
      - [box/sql.test.lua, null]
      ...
      
      Remove { id = 555 } to make sure it never happens.
      23e71c6e
    • Alexander Turenko's avatar
      Add Linux/clang CI target · 181bb3e7
      Alexander Turenko authored
      Replaced targets generation using a matrix expansion + exclusion list
      with the explicit targets list. Gave meagingful names for targets.
      
      Fixes #3673.
      181bb3e7
    • Vladimir Davydov's avatar
      xlog: fix filename in error messages · 4a464f8a
      Vladimir Davydov authored
       - xlog_rename() doesn't strip xlog->filename of inprogress suffix so
         write errors will mistakenly report the filename as inprogress.
       - xlog_create() uses a name without inprogress suffix for error
         reporting while it actually creates an inprogress file.
      4a464f8a
  2. Oct 10, 2018
    • Georgy Kirichenko's avatar
      socket: fix polling in case of spurious wakeup · e6bd7748
      Georgy Kirichenko authored
      socket_writable/socket_readable handles socket.iowait spurious wakeup
      until event is happened or timeout is exceeded.
      
      Closes #3344
      e6bd7748
    • Vladimir Davydov's avatar
      vinyl: fix for deferred DELETE overwriting newer statement · 63912c30
      Vladimir Davydov authored
      A deferred DELETE may be generated after a newer statement for the same
      key was inserted into a secondary index and hence land in a newer run.
      Since the read iterator assumes that newer sources always contain newer
      statements for the same key, we mark all deferred DELETE statements with
      VY_STMT_SKIP_READ flag, which makes run/mem iterators ignore them. The
      flag must be persisted when a statement is written to disk, but it is
      not. Fix this.
      
      Fixes commit 504bc805 ("vinyl: do not store meta in secondary index
      runs").
      63912c30
    • Alexander Turenko's avatar
      test: disable feedback daemon test on Mac OS in CI · ab868a6b
      Alexander Turenko authored
      The fail is known and should not have any influence on our CI results.
      
      The test should be enabled back after a fix of #3558.
      ab868a6b
  3. Oct 08, 2018
    • Vladimir Davydov's avatar
      cmake: fix sync_file_range detection · e1aa1a3d
      Vladimir Davydov authored
      sync_file_range is declared only if _GNU_SOURCE macro is defined.
      Also, in order to be used in a source file, HAVE_SYNC_FILE_RANGE
      must be present in config.h.cmake.
      
      Fixes commit caae99e5 ("Refactor xlog writer").
      e1aa1a3d
  4. Oct 06, 2018
  5. Oct 05, 2018
    • Vladimir Davydov's avatar
      replication: ref checkpoint needed to join replica · bae6f037
      Vladimir Davydov authored
      Before joining a new replica we register a gc_consumer to prevent
      garbage collection of files needed for join and following subscribe.
      Before commit 9c5d851d ("replication: remove old snapshot files not
      needed by replicas") a consumer would pin both checkpoints and WALs so
      that would work as expected. However, the above mentioned commit
      introduced consumer types and marked a consumer registered on replica
      join as WAL-only so if the garbage collector was invoked during join, it
      could delete files corresponding to the relayed checkpoint resulting in
      replica join failure. Fix this issue by pinning the checkpoint used for
      joining a replica with gc_ref_checkpoint and unpinning once join is
      complete.
      
      The issue can only be reproduced if there are vinyl spaces, because
      deletion of an open snap file doesn't prevent the relay from reading it.
      The existing replication/gc test would catch the issue if it triggered
      compaction on the master so we simply tweak it accordingly instead of
      adding a new test case.
      
      Closes #3708
      bae6f037
    • Vladimir Davydov's avatar
      gc: call gc_run unconditionally when consumer is advanced · a3542586
      Vladimir Davydov authored
      gc_consumer_unregister and gc_consumer_advance don't call gc_run in case
      the consumer in question isn't leftmost. This code was written back when
      gc_run was kinda heavy and would call engine/wal callbacks even if it
      wouldn't really need to. Today gc_run will bail out shortly, without
      making any complex computation, let alone invoking garbage collection
      callbacks, in case it has nothing to do so those optimizations are
      pointless. Let's remove them.
      a3542586
    • Vladimir Davydov's avatar
      gc: separate checkpoint references from wal consumers · 28c97041
      Vladimir Davydov authored
      Initially, gc_consumer object was used for pinning both checkpoint and
      WAL files, but commit 9c5d851d ("replication: remove old snapshot
      files not needed by replicas") changed that. Now whether a consumer pins
      WALs or checkpoints or both depends on gc_consumer_type. This was done
      so that replicas wouldn't prevent garbage collection of checkpoint
      files, which they don't need after initial join is complete.
      
      The way the feature was implemented is rather questionable though:
       - Since consumers of both types are stored in the same binary search
         tree, we have to iterate through the tree to find the leftmost
         checkpoint consumer, see gc_tree_first_checkpoint. This looks
         inefficient and ugly.
       - The notion of advancing a checkpoint consumer (gc_consumer_advance)
         is dubious: there's no point to move on to the next checkpoint after
         reading one - instead the consumer needs incremental changes, i.e.
         WALs.
      
      To eliminate those questionable aspects and make the code easier for
      understanding, let's separate WAL and checkpoint consumers. We do this
      by removing gc_consumer_type and making gc_consumer track WALs only.
      For pinning the files corresponding to a checkpoint a new object class
      is introduced, gc_checkpoint_ref. To pin a checkpoint, gc_ref_checkpoint
      needs to be called. It is passed the gc_checkpoint object to pin, the
      consumer name, and the gc_checkpoint_ref to store the ref in. To unpin a
      previously pinned checkpoint, gc_checkpoint_unref should be called.
      
      References are listed by box.info.gc() for each checkpoint under
      'references' key.
      28c97041
    • Vladimir Davydov's avatar
      gc: improve box.info.gc output · eba06790
      Vladimir Davydov authored
      Report vclocks in addition to signatures. When box.info.gc was first
      introduced we used signatures in gc. Now we use vclocks so there's no
      reason not to report them. This is consistent with box.info output
      (there's vclock and signature).
      
      Report the vclock and signature of the oldest WAL row available on the
      instance under box.info.gc().vclock. Without this information the user
      would have to figure it out by looking at box.info.gc().consumers.
      eba06790
    • Vladimir Davydov's avatar
      gc: cleanup garbage collection procedure · 668ec2aa
      Vladimir Davydov authored
      Do some refactoring intended to make the code of gc_run() easier for
      understanding:
       - Remove gc_state::checkpoint_vclock. It was used to avoid rerunning
         engine gc callback in case no checkpoint was deleted. Since we
         maintain a list of all available checkpoints, we don't need it for
         this anymore - we can run gc only if a checkpoint was actually
         removed from the list.
       - Rename gc_state::wal_vclock back to gc_state::vclock.
       - Use bool variables with descriptive names instead of comparing
         vclock signatures.
       - Add some comments.
      668ec2aa
    • Vladimir Davydov's avatar
      gc: keep track of available checkpoints · a55f86fd
      Vladimir Davydov authored
      Currently, the checkpoint iterator is in fact a wrapper around
      memtx_engine::snap_dir while the garbage collector knows nothing about
      checkpoints. This feels like encapsulation violation. Let's keep track
      of all available checkpoints right in the garbage collector instead
      and export gc_ API to iterate over checkpoints.
      a55f86fd
    • Vladimir Davydov's avatar
      gc: rename checkpoint_count to min_checkpoint_count · 1162743a
      Vladimir Davydov authored
      Because it's the minimal number of checkpoints that must not be deleted,
      not the actual number of preserved checkpoints. Do it now, in a separate
      patch so as to ease review of the next patch.
      
      While we are at it, fix the comment to gc_set_(min_)checkpoint_count()
      which got outdated by commit 5512053f ("box: gc: do not remove files
      being backed up").
      1162743a
    • Vladimir Davydov's avatar
      gc: format consumer name in gc_consumer_register · 1ed5f1e9
      Vladimir Davydov authored
      It's better than using tt_snprintf at call sites.
      1ed5f1e9
    • Vladimir Davydov's avatar
      gc: fold gc_consumer_new · 7b42ed21
      Vladimir Davydov authored
      gc_consumer_new is used in gc_consumer_register. Let's fold it to make
      the code flow more straightforward.
      7b42ed21
    • Vladimir Davydov's avatar
      gc: use fixed length buffer for storing consumer name · 961938b0
      Vladimir Davydov authored
      The length of a consumer name never exceeds 64 characters so no use to
      allocate a string. This is a mere code simplification.
      961938b0
    • Vladimir Davydov's avatar
      gc: make gc_consumer and gc_state structs transparent · 5231bb6b
      Vladimir Davydov authored
      It's exasperating to write trivial external functions for each member of
      an opaque struct (gc_consumer_vclock, gc_consumer_name, etc) while we
      could simply access those fields directly if we made those structs
      transparent. Since we usually define structs as transparent if we need
      to use them outside a source file, let's do the same for gc_consumer and
      gc_state and remove all those one-line wrappers.
      5231bb6b
    • Vladimir Davydov's avatar
      vinyl: force deletion of runs left from unfinished indexes on restart · eb6280d0
      Vladimir Davydov authored
      If an instance is restarted while building a new vinyl index, there will
      probably be some run files left. Currently, we won't delete such files
      until box.snapshot() is called, even though there's no point in keeping
      them around. Let's tweak vy_gc_lsm() so that it marks all runs that
      belong to an unfinished index as incomplete to force vy_gc() to remove
      them immediately after recovery is complete.
      
      This also removes files left from a failed rebootstrap attempt so we can
      remove a call to box.snapshot() from vinyl/replica_rejoin.test.lua.
      eb6280d0
    • Vladimir Davydov's avatar
      vinyl: fix master crash on replica join failure · 626dfb2c
      Vladimir Davydov authored
      This patch fixes a trivial error on vy_send_range() error path which
      results in a master crash in case a file needed to join a replica is
      missing or corrupted.
      
      See #3708
      626dfb2c
    • Georgy Kirichenko's avatar
      Set lua state for main fiber too · 9039df9c
      Georgy Kirichenko authored
      The main fiber should have a lua state as any other lua fiber.
      
      Needed for #3538
      9039df9c
    • Alexander Turenko's avatar
      Add -Werror for CI (1.10 part) · da505ee7
      Alexander Turenko authored
      Added MAKE_BUILD_TYPE=RelWithDebInfoWError option, which means enabling
      -DNDEBUG=1, -O2 and -Wall -Wextra -Werror. This ensures we have clean
      release build without warnings.
      
      Fixed found -Wunused-variable and -Wunused-parameter warnings.
      
      Part of #3238.
      da505ee7
  6. Oct 03, 2018
    • Vladislav Shpilevoy's avatar
      utf8: allow empty strings in utf8.upper/lower · 129099bc
      Vladislav Shpilevoy authored
      Closes #3709
      129099bc
    • Olga Arkhangelskaia's avatar
      replication: fix assertion with duplicate connection · 03a9bb1a
      Olga Arkhangelskaia authored
      Patch fixes behavior when replica tries to connect to the same master
      more than once. In case when it is initial configuration we raise the
      exception. If it in not initial config we print the error and disconnect
      the applier.
      
      @locker: minor test cleanup.
      
      Closes #3610
      03a9bb1a
    • Vladimir Davydov's avatar
      vinyl: zap vy_env::memory, read_threads, and write_threads · 69a4b786
      Vladimir Davydov authored
      They are only used to set corresponding members of vy_quota, vy_run_env,
      and vy_scheduler when vy_env is created. No point in keeping them around
      all the time.
      69a4b786
    • Vladimir Davydov's avatar
      vinyl: enable quota upon recovery completion explicitly · 6af3d41a
      Vladimir Davydov authored
      Currently, we create a quota object with the limit maximized, and only
      set the configured limit when local recovery is complete, so as to make
      sure that no dump is triggered during recovery. As a result, we have to
      store the configured limit in vy_env::memory, which looks ugly, because
      this member is never used afterwards. Let's introduce a new method
      vy_quota_enable to enable quota so that we can set the limit right on
      quota object construction. This implies that we add a boolean flag to
      vy_quota and only check the limit if it is set.
      
      There's another reason to add such a method. Soon we will implement
      quota consumption rate limiting. Rate limiting requires a periodic timer
      that would replenish quota. It only makes sense to start such a timer
      upon recovery completion, which again leads us to an explicit method for
      enabling quota.
      
      vy_env::memory will be removed by the following patch along with a few
      other pointless members of vy_env.
      
      Needed for #1862
      6af3d41a
    • Vladimir Davydov's avatar
      vinyl: implement quota wait queue without fiber_cond · fac38eec
      Vladimir Davydov authored
      Using fiber_cond as a wait queue isn't very convenient, because:
       - It doesn't allow us to put a spuriously woken up fiber back to the
         same position in the queue where it was, thus violating fairness.
       - It doesn't allow us to check whether we actually need to wake up a
         fiber or it will have to go back to sleep anyway as it needs more
         memory than currently available.
       - It doesn't allow us to implement a multi-queue approach where fibers
         that have different priorities are put to different queues.
      
      So let's rewrite the wait queue with plain rlist and fiber_yield.
      
      Needed for #1862
      fac38eec
    • Vladimir Davydov's avatar
      vinyl: move transaction size sanity check to quota · a94d2770
      Vladimir Davydov authored
      There's a sanity check in vinyl_engine_prepare, which checks if the
      transaction size is less than the configured limit and fails without
      waiting for quota if it isn't. Let's move this check to vy_quota_use,
      because it's really a business of the quota object. This implies that
      vy_quota_use has to set diag to differentiate this error from timeout.
      a94d2770
    • Vladimir Davydov's avatar
      vinyl: minor refactoring of quota methods · ee4ea944
      Vladimir Davydov authored
      The refactoring is targeted at facilitating introduction of rate
      limiting within the quota class. It moves code blocks around, factors
      out some blocks in functions, and improves comments. No functional
      changes.
      
      Needed for #1862
      ee4ea944
    • Vladimir Davydov's avatar
      vinyl: factor load regulator out of quota · 90ffaa8d
      Vladimir Davydov authored
      Turned out that throttling isn't going to be as simple as maintaining
      the write rate below the estimated dump bandwidth, because we also need
      to take into account whether compaction keeps up with dumps. Tracking
      compaction progress isn't a trivial task and mixing it in a module
      responsible for resource limiting, which vy_quota is, doesn't seem to be
      a good idea. Let's factor out the related code into a separate module
      and call it vy_regulator. Currently, the new module only keeps track of
      the write rate and the dump bandwidth and sets the memory watermark
      accordingly, but soon we will extend it to configure throttling as well.
      
      Since write rate and dump bandwidth are now a part of the regulator
      subsystem, this patch renames 'quota' entry of box.stat.vinyl() to
      'regulator'. It also removes 'quota.usage' and 'quota.limit' altogether,
      because memory usage is reported under 'memory.level0' while the limit
      can be read from box.cfg.vinyl_memory, and renames 'use_rate' to
      'write_rate', because the latter seems to be a more appropriate name.
      
      Needed for #1862
      90ffaa8d
  7. Oct 02, 2018
  8. Sep 26, 2018
    • Vladimir Davydov's avatar
      replication: don't stop syncing on configuration errors · 4baa71bc
      Vladimir Davydov authored
      When replication is restarted with the same replica set configuration
      (i.e. box.cfg{replication = box.cfg.replication}), there's a chance that
      an old relay will be still running on the master at the time when a new
      applier tries to subscribe. In this case the applier will get an error:
      
        main/152/applier/localhost:62649 I> can't join/subscribe
        main/152/applier/localhost:62649 xrow.c:891 E> ER_CFG: Incorrect value for
            option 'replication': duplicate connection with the same replica UUID
      
      Such an error won't stop the applier - it will keep trying to reconnect:
      
        main/152/applier/localhost:62649 I> will retry every 1.00 second
      
      However, it will stop synchronization so that box.cfg() will return
      without an error, but leave the replica in the orphan mode:
      
        main/151/console/::1:42606 C> failed to synchronize with 1 out of 1 replicas
        main/151/console/::1:42606 C> entering orphan mode
        main/151/console/::1:42606 I> set 'replication' configuration option to
          "localhost:62649"
      
      In a second, the stray relay on the master will probably exit and the
      applier will manage to subscribe so that the replica will leave the
      orphan mode:
      
        main/152/applier/localhost:62649 C> leaving orphan mode
      
      This is very annoying, because there's no need to enter the orphan mode
      in this case - we could as well keep trying to synchronize until the
      applier finally succeeds to subscribe or replication_sync_timeout is
      triggered.
      
      So this patch makes appliers enter "loading" state on configuration
      errors, the same state they enter if they detect that bootstrap hasn't
      finished yet. This guarantees that configuration errors, like the one
      above, won't break synchronization and leave the user gaping at the
      unprovoked orphan mode.
      
      Apart from the issue in question (#3636), this patch also fixes spurious
      replication-py/multi test failures that happened for exactly the same
      reason (#3692).
      
      Closes #3636
      Closes #3692
      4baa71bc
    • Vladimir Davydov's avatar
      replication: fix recoverable error reporting · 98449ced
      Vladimir Davydov authored
      First, we print "will retry every XX second" to the log after an error
      message only for socket and system errors although we keep trying to
      establish a replication connection after configuration errors as well.
      Let's print this message for those errors too to avoid confusion.
      
      Second, in case we receive an error in reply to SUBSCRIBE command, we
      log "can't read row" instead of "can't join/subscribe". This happens,
      because we switch an applier to SYNC/FOLLOW state before receiving a
      reply to SUBSCRIBE command. Fix this by updating an applier state only
      after successfully subscribing.
      
      Third, we detect duplicate connections coming from the same replica on
      the master only after sending a reply to SUBSCRIBE command, that is in
      relay_subscribe rather than in box_process_subscribe. This results in
      "can't read row" being printed to the replica's log even though it's
      actually a SUBSCRIBE error. Fix this by moving the check where it
      actually belongs.
      98449ced
Loading