Skip to content
Snippets Groups Projects
  1. Sep 14, 2018
    • AKhatskevich's avatar
      Fix http error test · 76a8bd32
      AKhatskevich authored
      The test expected that http:get yields, however, in case of
      very fast unix_socket and parallel test execution, a context
      switch during the call lead to absence of yield and to instant
      reply. That caused an error during `fiber:cancel`.
      
      The problem is solved by increasing http server response time.
      
      Closes #3480
      76a8bd32
  2. Sep 13, 2018
    • Roman Khabibov's avatar
      json: add options to json.encode() · 1663bdc4
      Roman Khabibov authored
      Add an ability to pass options to json.encode()/decode().
      
      Closes: #2888.
      
      @TarantoolBot document
      Title: json.encode() json.decode()
      Add an ability to pass options to
      json.encode() and json.decode().
      These are the same options that
      are used globally in json.cfg().
      1663bdc4
  3. Sep 10, 2018
    • Kirill Yukhin's avatar
      Fix libgomp linking for static build · 0a3186c4
      Kirill Yukhin authored
      Since addition of -fopenmp to compiler also means
      addition of -lgomp to the link stage, pass -fno-openmp
      to the linking stage in case of static build. In that
      case OMP functions are statically linked into libmisc.
      
      Also, emit error if trying to perform static build using
      clang.
      0a3186c4
  4. Sep 09, 2018
    • Vladimir Davydov's avatar
      vinyl: add global memory stats · e78ebb77
      Vladimir Davydov authored
      box.info.memory() gives you some insight on what memory is used for,
      but it's very coarse. For vinyl we need finer grained global memory
      statistics.
      
      This patch adds such: they are reported under box.stat.vinyl().memory
      and consist of the following entries:
      
       - level0: sum size of level-0 of all LSM trees.
       - tx: size of memory used by tx write and read sets.
       - tuple_cache: size of memory occupied by tuple cache.
       - page_index: size of memory used for storing page indexes.
       - bloom_filter: size of memory used for storing bloom filters.
      
      It also removes box.stat.vinyl().cache, as the size of cache is now
      reported under memory.tuple_cache.
      e78ebb77
    • Vladimir Davydov's avatar
      vinyl: fix accounting of secondary index cache statements · 16faada1
      Vladimir Davydov authored
      Since commit 0c5e6cc8 ("vinyl: store full tuples in secondary index
      cache"), we store primary index tuples in secondary index cache, but we
      still account them as separate tuples. Fix that.
      
      Follow-up #3478
      Closes #3655
      16faada1
    • Vladimir Davydov's avatar
      vinyl: set box.cfg.vinyl_write_threads to 4 by default · fe1e4694
      Vladimir Davydov authored
      Any LSM-based database design implies high level of write amplification
      so there should be more compaction threads than dump threads. With the
      default value of 2 for box.cfg.vinyl_write_threads, which we have now,
      we start only one compaction thread. Let's increase the default up to 4
      so that there are three compaction threads started by default, because
      it fits better LSM-based design.
      fe1e4694
    • Vladimir Davydov's avatar
      vinyl: don't start scheduler fiber until local recovery is complete · 7069eab5
      Vladimir Davydov authored
      We must not schedule any background jobs during local recovery, because
      they may disrupt yet to be recovered data stored on disk. Since we start
      the scheduler fiber as soon as the engine is initialized, we have to
      pull some tricks to make sure it doesn't schedule any tasks: the
      scheduler fiber function yields immediately upon startup; we assume
      that it won't be woken up until local recovery is complete, because we
      don't set the memory limit until then.
      
      This looks rather flimsy, because the logic is spread among several
      seemingly unrelated functions: the scheduler fiber (vy_scheduler_f),
      the quota watermark callback (vy_env_quota_exceeded_cb), and the engine
      recovery callback (vinyl_engine_begin_initial_recovery), where we leave
      the memory limit unset until recovery is complete. The latter isn't even
      mentioned in comments, which makes the code difficult to follow. Think
      how everything would fall apart should we try to wake up the scheduler
      fiber somewhere else for some reason.
      
      This patch attempts to make the code more straightforward by postponing
      startup of the scheduler fiber until recovery completion. It also moves
      the comment explaining why we can't schedule tasks during local recovery
      from vy_env_quota_exceeded_cb to vinyl_engine_begin_initial_recovery,
      because this is where we actually omit the scheduler fiber startup.
      
      Note, since now the scheduler fiber goes straight to business once
      started, we can't start worker threads in the fiber function as we used
      to, because then workers threads would be running even if vinyl was
      unused. So we move this code to vy_worker_pool_get, which is called when
      a worker is actually needed to run a task.
      7069eab5
    • Vladimir Davydov's avatar
      vinyl: zap vy_worker_pool::idle_worker_count · 0ff58856
      Vladimir Davydov authored
      It is not used anywhere anymore.
      0ff58856
    • Vladimir Davydov's avatar
      vinyl: use separate thread pools for dump and compaction tasks · 3e76f7b9
      Vladimir Davydov authored
      Using the same thread pool for both dump and compaction tasks makes
      estimation of dump bandwidth unstable. For instance, if we have four
      worker threads, then the observed dump bandwidth may vary from X if
      there's high compaction demand and all worker threads tend to be busy
      with compaction tasks to 4 * X if there's no compaction demand. As a
      result, we can overestimate the dump bandwidth and trigger dump when
      it's too late, which will result in hitting the limit before dump is
      complete and hence stalling write transactions, which is unacceptable.
      
      To avoid that, let's separate thread pools used for dump and compaction
      tasks. Since LSM tree based design typically implies high levels of
      write amplification, let's allocate 1/4th of all threads for dump tasks
      and use the rest exclusively for compaction.
      3e76f7b9
    • Vladimir Davydov's avatar
      vinyl: move worker allocation closer to task creation · ba7abf6f
      Vladimir Davydov authored
      Call vy_worker_pool_get() from vy_scheduler_peek_{dump,compaction} so
      that we can use different worker pools for dump and compaction tasks.
      ba7abf6f
    • Vladimir Davydov's avatar
      vinyl: factor out worker pool from scheduler struct · 49595189
      Vladimir Davydov authored
      A worker pool is an independent entity that provides the scheduler with
      worker threads on demand. Let's factor it out so that we can introduce
      separate pools for dump and compaction tasks.
      49595189
    • Vladimir Davydov's avatar
      vinyl: don't use mempool for allocating background tasks · 661763ed
      Vladimir Davydov authored
      Background tasks are allocated infrequently, not more often than once
      per several seconds, so using mempool for them is unnecessary and only
      clutters vy_scheduler struct. Let's allocate them with malloc().
      661763ed
    • Vladimir Davydov's avatar
      vinyl: add helper to check whether dump is in progress · 04a735b2
      Vladimir Davydov authored
      Needed solely to improve code readability. No functional changes.
      04a735b2
  5. Sep 06, 2018
    • Konstantin Osipov's avatar
    • Georgy Kirichenko's avatar
      Tarantool static build ability · cb1c72da
      Georgy Kirichenko authored
      A possibility to build tarantool with included library dependencies.
      Use the flag -DBUILD_STATIC=ON to build statically against curl, readline,
      ncurses, icu and z.
      Use the flag -DOPENSSL_USE_STATIC_LIBS=ON to build with static
      openssl
      
      Changes:
        * Add FindOpenSSL.cmake because some distributions do not support the use of
        openssl static libraries.
        * Find libssl before curl because of build dependency.
        * Catch all bundled libraries API and export then it in case of static
        build.
        * Rename crc32 internal functions to avoid a name clash with linked libraries.
      
      Notes:
        * Bundled libyaml is not properly exported, use the system one.
        * Dockerfile to build static with docker is included
      
      Fixes #3445
      cb1c72da
  6. Sep 04, 2018
    • Vladimir Davydov's avatar
      Merge branch '1.9' into 1.10 · 8bf936f7
      Vladimir Davydov authored
      8bf936f7
    • Vladimir Davydov's avatar
      box: sync on replication configuration update · 113ade24
      Vladimir Davydov authored
      Now box.cfg() doesn't return until 'quorum' appliers are in sync not
      only on initial configuration, but also on replication configuration
      update. If it fails to synchronize within replication_sync_timeout,
      box.cfg() returns without an error, but the instance enters 'orphan'
      state, which is basically read-only mode. In the meantime, appliers
      will keep trying to synchronize in the background, and the instance
      will leave 'orphan' state as soon as enough appliers are in sync.
      
      Note, this patch also changes logging a bit:
       - 'ready to accept request' is printed on startup before syncing
         with the replica set, because although the instance is read-only
         at that time, it can indeed accept all sorts of ro requests.
       - For 'connecting', 'connected', 'synchronizing' messages, we now
         use 'info' logging level, not 'verbose' as they used to be, because
         those messages are important as they give the admin idea what's
         going on with the instance, and they can't flood logs.
       - 'sync complete' message is also printed as 'info', not 'crit',
         because there's nothing critical about it (it's not an error).
      
      Also note that we only enter 'orphan' state if failed to synchronize.
      In particular, if the instnace manages to synchronize with all replicas
      within a timeout, it will jump from 'loading' straight into 'running'
      bypassing 'orphan' state. This is done for the sake of consistency
      between initial configuration and reconfiguration.
      
      Closes #3427
      
      @TarantoolBot document
      Title: Sync on replication configuration update
      The behavior of box.cfg() on replication configuration update is
      now consistent with initial configuration, that is box.cfg() will
      not return until it synchronizes with as many masters as specified
      by replication_connect_quorum configuration option or the timeout
      specified by replication_connect_sync occurs. On timeout, it will
      return without an error, but the instance will enter 'orphan' state.
      It will leave 'orphan' state as soon as enough appliers have synced.
    • Olga Arkhangelskaia's avatar
      box: add replication_sync_timeout configuration option · ca9fc33a
      Olga Arkhangelskaia authored
      In the scope of #3427 we need timeout in case if an instance waits for
      synchronization for too long, or even forever. Default value is 300.
      
      Closes #3674
      
      @locker: moved dynamic config check to box/cfg.test.lua; code cleanup
      
      @TarantoolBot document
      Title: Introduce new configuration option replication_sync_timeout
      After initial bootstrap or after replication configuration changes we
      need to sync up with replication quorum. Sometimes sync can take too
      long or replication_sync_lag can be smaller than network latency we
      replica will stuck in sync loop that can't be cancelled.To avoid this
      situations replication_sync_timeout can be used. When time set in
      replication_sync_timeout is passed replica enters orphan state.
      Can be set dynamically. Default value is 300 seconds.
      ca9fc33a
    • Olga Arkhangelskaia's avatar
      box: make replication_sync_lag option dynamic · 5eb5c181
      Olga Arkhangelskaia authored
      In #3427 replication_sync_lag should be taken into account during
      replication reconfiguration. In order to configure replication properly
      this parameter is made dynamic and can be changed on demand.
      
      @locker: moved dynamic config check to box/cfg.test.lua
      
      @TarantoolBot document
      Title: recation_sync_lag option can be set dynamically
      box.cfg.recation_sync_lag now can be set at any time.
      5eb5c181
  7. Sep 03, 2018
  8. Aug 31, 2018
  9. Aug 30, 2018
    • Konstantin Belyavskiy's avatar
      replication: rename a return pipe from tx to tx_prio · de80c262
      Konstantin Belyavskiy authored
      There are two different pipes: 'tx' and 'tx_prio'. The latter
      does not support yield(). Rename it to avoid misunderstanding.
      
      Needed for #3397
      de80c262
    • Vladimir Davydov's avatar
      Update test-run · 6004bd9d
      Vladimir Davydov authored
      The new version marks more file descriptors used by test-run
      internals as CLOEXEC. Needed to make replication/misc test pass
      (it lowers RLIMIT_NOFILE).
      6004bd9d
    • Vladimir Davydov's avatar
      applier: ignore ER_UNKNOWN_REQUEST_TYPE for IPROTO_VOTE · 009e50ec
      Vladimir Davydov authored
      IPROTO_VOTE command (successor of IPROTO_REQUEST_VOTE) was introduced in
      Tarantool 1.10.1. It is sent by an applier to its master only if the
      master is running Tarantool 1.10.1 or newer. However, the master may be
      running a Tarantool version 1.10.1 that isn't aware of IPROTO_VOTE, in
      which case the applier will fail to connect with ER_UNKNOWN_REQUEST_TYPE
      error.
      
      Let's fix this issue by ignoring ER_UNKNOWN_REQUEST_TYPE received in
      reply to IPROTO_VOTE command.
      009e50ec
    • Alexander Turenko's avatar
      socket: evaluate buffer size in recv / recvfrom · 11fb3ab9
      Alexander Turenko authored
      When size parameter is not passed to socket:recv() or socket:recvfrom()
      it will call a) or b) on the socket to evaluate size of the buffer to
      store the receiving datagram. Before this commit a datagram will be
      truncated to 512 bytes in the case.
      
      a) Linux: recv(fd, NULL, 0 , MSG_TRUNC | MSG_PEEK)
      b) Mac OS: getsockopt(fd, SOL_SOCKET, SO_NREAD, &val, &len)
      
      It is recommended to set 'size' parameter (size of the input buffer)
      explicitly based on known message format and known network conditions
      (say, set it less then MTU to prevent IP fragmentation, which can be
      inefficient) or pass it from a configuration option of a library / an
      application. The reason is that explicit buffer size provided allows to
      avoid extra syscall to evaluate necessary buffer size.
      
      When 'size' parameter is set explicitly for recv / recvfrom on a UDP
      socket and the next datagram length is larger then the size, the
      returned message will be truncated to the size provided and the rest of
      the datagram will be discarded. Of course, the tail will not be
      discarded in case of a TCP socket and will be available to read by the
      next recv / recvfrom call.
      
      Fixes #3619.
      11fb3ab9
  10. Aug 29, 2018
  11. Aug 28, 2018
    • Vladimir Davydov's avatar
      vinyl: use lower bound percentile estimate for dump bandwidth · ac17838c
      Vladimir Davydov authored
      Use a lower bound estimate in order not to overestimate dump bandwidth.
      For example, if an observation of 12 MB/s falls in bucket 10 .. 15, we
      should use 10 MB/s to avoid stalls.
      ac17838c
    • Vladimir Davydov's avatar
      histogram: add function for computing lower bound percentile estimate · 9a998466
      Vladimir Davydov authored
      The value returned by histogram_percentile() is an upper bound estimate.
      This is fine for measuring latency, because we are interested in the
      worst, i.e. highest, observations, but doesn't suit particularly well
      if we want to keep track of the lowest observations, as it is the case
      with bandwidth. So this patch introduces histogram_percentile_lower(),
      a function that is similar to histogram_percentile(), but returns a
      lower bound estimate of the requested percentile.
      9a998466
    • Vladimir Davydov's avatar
      vinyl: use snap_io_rate_limit for initial dump bandwidth estimate · b646fbd9
      Vladimir Davydov authored
      The user can limit dump bandwidth with box.cfg.snap_io_rate_limit to a
      value, which is less than the current estimate. To avoid stalls caused
      by overestimating dump bandwidth, we must take into account the limit
      for the initial guess and forget all observations whenever it changes.
      b646fbd9
    • Vladimir Davydov's avatar
      vinyl: do not add initial guess to dump bandwidth histogram · afac9b3a
      Vladimir Davydov authored
      Do not add the initial guess to the histogram, because otherwise it
      takes more than 10 dumps to get the real dump bandwidth in case the
      initial value is less (we use 10th percentile).
      afac9b3a
    • Vladimir Davydov's avatar
      vinyl: cache dump bandwidth for timer invocation · 20b4777a
      Vladimir Davydov authored
      We don't need to compute a percentile of dump bandwidth histogram on
      each invocation of quota timer callback, because it may only be updated
      on dump completion. Let's cache it. Currently, it isn't that important,
      because the timer period is set to 1 second. However, once we start
      using the timer for throttling, we'll have to make it run more often and
      so caching the dump bandwidth value will make sense.
      20b4777a
    • Vladimir Davydov's avatar
      vinyl: rename vy_quota::dump_bw to dump_bw_hist · ae31708d
      Vladimir Davydov authored
      The next patch will store a cached bandwidth value in vy_quota::dump_bw.
      Let's rename dump_bw to dump_bw_hist here in order not to clog it.
      ae31708d
    • Vladimir Davydov's avatar
      vinyl: tune dump bandwidth histogram buckets · 9ac103a8
      Vladimir Davydov authored
      Typically, dump bandwidth varies from 10 MB to 100 MB per second so
      let's use 5 MB bucket granularity in this range. Values less than
      10 MB/s can also be observed, because the user can limit disk rate
      with box.cfg.snap_io_rate_limit so use 1 MB granularity between 1 MB
      and 10 MB and 100 KB granularity between 100 KB and 1 MB. A write rate
      greater than 100 MB/s is unlikely in practice, even on very fast disks,
      since dump bandwidth is also limited by CPU, so use 100 MB granularity
      there.
      9ac103a8
    • Vladimir Davydov's avatar
      vinyl: wake up fibers waiting for quota one by one · 285852f6
      Vladimir Davydov authored
      Currently, we wake up all fibers whenever we free some memory. This
      is inefficient, because it might occur that all available quota gets
      consumed by a few fibers while the rest will have to go back to sleep.
      This is also kinda unfair, because waking up all fibers breaks the order
      in which the fibers were put to sleep. This works now, because we free
      memory and wake up fibers infrequently (on dump) and there normally
      shouldn't be any fibers waiting for quota (if there were, the latency
      would rocket sky high because of absence of any kind of throttling).
      However, once throttling is introduced, fibers waiting for quota will
      become the norm. So let's wake up fibers one by one: whenever we free
      memory we wake up the first fiber in the line, which will wake up the
      next fiber on success and so forth.
      285852f6
    • Vladimir Davydov's avatar
      vinyl: fix fiber leak in worker thread · 497fd351
      Vladimir Davydov authored
      We join a fiber that executes a dump/compaction task only at exit
      while we mark all fibers as joinable. As a result, fibers leak, which
      eventually leads to a crash:
      
        src/lib/small/small/slab_arena.c:58: munmap_checked: Assertion `false' failed.
      
      Here's the stack trace:
      
        munmap_checked
        mmap_checked
        slab_map
        slab_get_with_order
        mempool_alloc
        fiber_new_ex
        fiber_new
        cord_costart_thread_func
        cord_thread_func
        start_thread
        clone
      
      Let's fix this issue by marking a fiber as joinable only at exit, before
      joining it. The fiber is guaranteed to be alive at that time, because it
      clears vy_worker::task before returning, while we join it only if
      vy_worker::task is not NULL.
      
      Fixes commit 43b4342d ("vinyl: fix worker crash at exit").
      497fd351
    • Vladimir Davydov's avatar
      box: don't destroy latch/fiber_cond that may have waiters at exit · c0102b73
      Vladimir Davydov authored
      fiber_cond_destroy() and latch_destroy() are no-op on release builds
      while on debug builds they check that there is no fibers waiting on
      the destroyed object. This results in the following assertion failures
      occasionally hit by some tests:
      
        src/latch.h:81: latch_destroy: Assertion `l->owner == NULL' failed.
        src/fiber_cond.c:49: fiber_cond_destroy: Assertion `rlist_empty(&c->waiters)' failed.
      
      We can't do anything about that, because the event loop isn't running at
      exit and hence we can't stop those fibers. So let's not "destroy" those
      global objects that may have waiters at exit, namely
      
        gc.latch
        ro_cond
        replicaset.applier.cond
      c0102b73
Loading