Skip to content
Snippets Groups Projects
  1. Feb 14, 2019
  2. Feb 13, 2019
    • Vladimir Davydov's avatar
      tuple: fix integer boundaries check in tuple_hash_field · 47bf2393
      Vladimir Davydov authored
      exp() is base-e exponential function. Apparently, we must use exp2()
      here to correctly check 64-bit integer boundaries.
      
      Fixes commit 0dfd99c4 ("tuple: fix hashing of integer numbers").
      
      Follow-up #3907
      47bf2393
    • Vladimir Davydov's avatar
      vinyl: fix double to size_t conversion in vy_regulator_update_rate_limit · 87994f22
      Vladimir Davydov authored
      Fixes commit adb78d55 ("vinyl: throttle tx to ensure compaction
      keeps up with dumps").
      
      Follow-up #3721
      87994f22
    • Vladimir Davydov's avatar
      vinyl: throttle tx to ensure compaction keeps up with dumps · adb78d55
      Vladimir Davydov authored
      Every byte of data written to a vinyl database eventually gets compacted
      with data written to the database earlier. The ratio of the size of data
      actually written to disk to the size of data written to the database is
      called write amplification. Write amplification depends on the LSM tree
      configuration and the workload parameters and varies in a wide range,
      from 2-3 to 10-20 or even higher in some extreme cases. If the database
      engine doesn't manage to write those extra data, LSM tree shape will get
      distorted, which will result in increased read and space amplification,
      which, in turn, will lead to slowing down reads and wasting disk space.
      That's why it's so important to ensure the database engine has enough
      compaction power.
      
      One way to ensure that is increase the number of compaction threads by
      tuning box.cfg.vinyl_write_threads configuration knob, but one can't
      increase it beyond the capacity of the server running the instance. So
      the database engine must throttle writes if it detects that compaction
      threads are struggling to keep up. This patch implements a very simple
      algorithm to achieve that: it keeps track of recently observed write
      amplification and data compaction speed, use them to calculate the max
      transaction rate that the database engine can handle while steadily
      maintaining the current level of write amplification, and sets the rate
      limit to 0.75 of that so as to give the engine enough room to increase
      write amplification if needed.
      
      The algorithm is obviously pessimistic: it undervalues the transaction
      rate the database can handle after write amplification has steadied. But
      this is compensated by its simplicity and stability - there shouldn't be
      any abrupt drops or peaks in RPS due to its decisions. Besides, it
      adapts fairly quickly to increase in write amplification when a database
      is filled up. If one finds that the algorithm is being too cautious by
      undervaluing the limit, it's easy to fix by simply increasing the number
      of compaction threads - the rate limit will scale proportionately if the
      system is underloaded.
      
      The current value of the rate limit set by the algorithm is reported by
      box.stat.vinyl() under regulator.rate_limit section.
      
      Thanks to @kostja for the great comment explaining the logic behind the
      rate limiting algorithm.
      
      Closes #3721
      adb78d55
    • Vladimir Davydov's avatar
      vinyl: don't consume quota if wait queue isn't empty · 9b11bf9f
      Vladimir Davydov authored
      vy_quota_use only checks if there's enough quota available for consumer
      to proceed, but that's not enough, because it may occur that there are
      fibers already waiting for the resource. Bypassing them may result in
      starvation, which manifests itself as "waited for vinyl memory quota for
      too long" warnings. To ensure fairness and avoid starvation, let's go to
      sleep if the wait queue is not empty.
      9b11bf9f
    • Vladimir Davydov's avatar
      vinyl: remove extra quota check from vy_quota_use · abdb1291
      Vladimir Davydov authored
      Before waking up a fiber that is waiting for quota, we always first
      check if it can actually consume it, see vy_quota_signal. Hence the
      extra check in vy_quota_use is needed only to prevent spurious wakeups.
      It doesn't seem to be wise to add such a check to a hot path as a
      counter-mesaure to such an unlikely scenario. Let's remove it - after
      all it isn't critical if a spuriously woken up fiber exceeds the limit.
      abdb1291
    • Vladimir Davydov's avatar
      vinyl: introduce quota consumer types · 3196be5e
      Vladimir Davydov authored
      Currently, we only limit quota consumption rate so that writers won't
      hit the hard limit before memory dump is complete. However, it isn't
      enough, because we also need to consider compaction: if it doesn't keep
      up with dumps, read and space amplification will grow uncontrollably.
      
      The problem is compaction may be a quota consumer by itself, as it may
      generate deferred DELETE statements for secondary indexes. We can't
      ignore quota completely there, because if we do, we may hit the memory
      limit and stall all writers, which is unacceptable, but we do want to
      ignore the rate limit imposed to make sure that compaction keeps up with
      dumps, otherwise compaction won't benefit from such a throttling.
      
      To tackle this problem, this patch introduces the concept of quota
      consumer types and resources. Now vy_quota maintains one rate limit per
      each resource and one wait queue per each consumer type. There are two
      types of consumers, compaction jobs and usual transactions, and there
      are two resources managed by vy_quota, disk and memory. Memory-based
      rate limit ensures that transactions won't hit the hard memory limit and
      stall before memory dump is complete. It is respected by all types of
      consumers. Disk-based rate limit is supposed to be set when compaction
      doesn't keep up with dumps. It is only used by usual transactions and
      ignored by compaction jobs.
      
      Since now there are two wait queues, we need to balance wakeups between
      them in case consumers in both queues are ready to proceed. To ensure
      there's no starvation, we maintain a monotonically growing counter and
      assign its value to each consumer put to slip (ticket). We use it to
      wake up the consumer that has waited most when both queues are ready.
      
      Note, the patch doesn't implement the logic of disk-based throttling in
      the regulator module. It is still left for future work.
      
      Needed for #3721
      3196be5e
    • Шипицын Анатолий's avatar
      Add option interface for set source interface in http client · 99272d7b
      Шипицын Анатолий authored
      @TarantoolBot document
      Title: 'interface' http.client option
      It allows to set source network interface for an outgoing
      connection using the interface name or IP address.
      For additional info see https://curl.haxx.se/libcurl/c/CURLOPT_INTERFACE.html
      99272d7b
    • Nikita Pettik's avatar
      sql: clean-up SQLite mentions in codebase · 73e85ae5
      Nikita Pettik authored
      Replace all usage of sqlite3_, sqlite, SQLite prefixes with simple sql_
      All other occurrences of SQLite are substituted with SQL word.
      SQL test suit is purified as well.
      73e85ae5
  3. Feb 12, 2019
    • Шипицын Анатолий's avatar
      httpc: increase max outgoing header size to 8 KiB · 6b79d50a
      Шипицын Анатолий authored
      The reason why the limit is so is that default Apache / nginx maximum
      header size is 8 KiB.
      
      Added a check to raise an error when a header is bigger then the limit.
      
      Fixes #3959.
      6b79d50a
    • Ilya Markov's avatar
      replication: Add rfc on vclock implementation · 3c9b70c7
      Ilya Markov authored
      Add description of possible redesigning of vector clocks.
      3c9b70c7
    • Vladimir Davydov's avatar
      vinyl: randomize range compaction to avoid IO load spikes · c9e7baed
      Vladimir Davydov authored
      Since all ranges constituting an LSM tree have the same configuration,
      they tend to get compacted at approximately the same time. This entails
      IO load spikes, which, in turn, lead to deviation of the LSM tree from
      the target shape and hence increased read amplification. To prevent this
      from happening, this patch implements compaction randomization: with 10%
      probability we defer compaction at each LSM tree level, i.e. if the
      number of runs at a level exceeds the configured run_count_per_level,
      the level will be compacted with 90%-probability, but with 10%
      probability it won't - compaction will be deferred until another run
      is added to the level.
      
      Our simulations show that such a simple algorithm performs fairly well:
      it randomizes compaction pace among ranges, spreading IO load evenly in
      time, while the write amplification is increased by not more than 5-10%,
      which seems to be a reasonable price for elimination of IO load spikes.
      
      Closes #3944
      c9e7baed
    • Vladimir Davydov's avatar
      vinyl: set range size automatically · bedb1d94
      Vladimir Davydov authored
      The key space of a vinyl index consists of multiple ranges that can be
      compacted independently. This design was initially invented to enable
      parallel compaction, so the range size is configured statically, by the
      range_size index option, which equals 1 GB by default. However, it turns
      out that ranges can also be useful for smoothing IO load: if we compact
      approximately the same number of ranges after each dump, we will avoid
      IO bursts, which is good, because IO bursts can distort the LSM tree
      shape, resulting in increased read amplification.
      
      To achieve that, we need to maintain at least as many ranges as the
      number of dumps it takes to trigger major compaction of a range. With
      the default range size, this condition will hold only if the index is
      huge (tens to hundreds gigabytes). If the database isn't that big or
      consists of many small indexes, the range count will never even approach
      that number. So this patch makes the range size scale dynamically to
      satisfy that condition.
      
      The range size configuration options, both global and per index, aren't
      removed though. The patch just changes box.cfg.vinyl_range_size default
      value to nil, which enables automatic range sizing for all new indexes
      created without passing range_size explicitly. All existing indexes will
      still use the range size stored in index options (we don't want to alter
      the behavior of an existing production setup). We are not planning to
      drop range_size option altogether - it still can be useful for testing
      and performance analysis.
      
      The actual range size value is now reported in index.stat().
      
      Needed for #3944
      bedb1d94
    • Vladimir Davydov's avatar
      vinyl: keep track of dumps per compaction for each LSM tree · e4f5476c
      Vladimir Davydov authored
      This patch adds dumps_per_compaction metric to per index statistics. It
      shows the number of dumps it takes to trigger a major compaction of a
      range in a given LSM tree. We need it to automatically choose the
      optimal number of ranges that would smooth out the load generated by
      range compaction.
      
      To calculate this metric, we assign dump_count to each run. It shows how
      many dumps it took to create the run. If a run was created by a memory
      dump, it is set to 1. If a run was created by a minor compaction, it is
      set to the sum of dump counts of compacted ranges. If a run was created
      by a major compaction, it is set to the sum of dump counts of compacted
      ranges minus dump count of the last level run. The dump_count is stored
      in vylog.
      
      This allows us to estimate the number of dumps that triggers compaction
      in a range as dump_count of the last level run stored in the range.
      Finally, we report dumps_per_compaction of an LSM tree as the average
      dumps_per_compaction among all ranges constituting the tree.
      
      Needed for #3944
      e4f5476c
    • Vladimir Davydov's avatar
      vinyl: cancel reader and writer threads on shutdown · e463128e
      Vladimir Davydov authored
      Currently, vinyl won't shutdown until all reader and writer threads
      gracefully complete all their pending requests, which may take a while,
      especially for writer threads that may happen to be doing compaction at
      the time. This is annoying - there's absolutely no reason to delay
      termination in such a case. Let's forcefully cancel all threads, like we
      do in case of relay threads.
      
      This should fix sporadic vinyl/replica_quota test hang.
      
      Closes #3949
      e463128e
  4. Feb 11, 2019
    • Alexander Turenko's avatar
      sql: clean up lemon acttab_free() a bit · 8acf2939
      Alexander Turenko authored
      Nikita Pettik suggests me that free(NULL) is no-op according to POSIX.
      
      This is follow up of 9dbcaa3a.
      8acf2939
    • Konstantin Belyavskiy's avatar
      replication: do not fetch records twice · ae938677
      Konstantin Belyavskiy authored
      This is a draft paper covering following topics:
      1. Draft protocol for discovering and maintaining network topology
      in case of large arbitrary network.
      2. List of required changes to support this feature.
      3. Open questions and alternatives.
      
      Changes in V2:
      Based or Vlad's review
      1. Rewrite couple sections to make it more clear.
      2. Clarify with more details and add examples.
      3. Fixed error.
      
      RFC for #3294
      ae938677
    • Vladimir Davydov's avatar
      vinyl: fix compaction priority calculation · d5ceb204
      Vladimir Davydov authored
      When computing the number of runs that need to be compacted for a range
      to conform to the target LSM tree shape, we use the newest run size for
      the size of the first LSM tree level. This isn't quite correct for two
      reasons.
      
      First, the size of the newest run is unstable - it may vary in a
      relatively wide range from dump to dump. This leads to frequent changes
      in the target LSM tree shape and, as a result, unpredictable compaction
      behavior. In particular this breaks compaction randomization, which is
      supposed to smooth out IO load generated by compaction.
      
      Second, this can increase space amplification. We trigger compaction at
      the last level when there's more than one run, irrespective of the value
      of run_count_per_level configuration option. We expect this to keep
      space amplification below 2 provided run_count_per_level is not greater
      than (run_size_ratio - 1). However, if the newest run happens to have
      such a size that multiplying it by run_size_ratio several times gives us
      a value only slightly less than the size of the oldest run, we can
      accumulate up to run_count_per_level more runs that are approximately as
      big as the last level run without triggering compaction, thus increasing
      space amplification by up to run_count_per_level.
      
      To fix these problems, let's use the oldest run size for computing the
      size of the first LSM tree level - simply divide it by run_size_ratio
      until it exceeds the size of the newest run.
      
      Follow-up #3657
      d5ceb204
    • Vladimir Davydov's avatar
      box: enable WAL before making initial checkpoint · c6743038
      Vladimir Davydov authored
      While a replica is bootstrapped from a remote master, vinyl engine
      may need to perform compaction, which means that it may write to
      the _vinyl_deferred_delete system space. Compaction proceeds fully
      asynchronously, i.e. a write may occur after the join stage is
      complete, but before the WAL is initialized, in which case the new
      replica will crash. To make sure a race like that won't happen, let's
      setup WAL before making the initial checkpoint. The WAL writer is now
      initialized right before starting the WAL thread and so we don't need
      to split WAL struct into the thread and the writer anymore.
      
      Closes #3968
      c6743038
  5. Feb 08, 2019
    • Vladimir Davydov's avatar
      test: fix xlog/panic_on_broken_lsn spurious failure · f1bd33a8
      Vladimir Davydov authored
      If this test is executed after some other test that bumps LSN, then
      the output line gets truncated differently, because greater LSNs may
      increase its length. Fix this by filtering out the LSN manually.
      
      Closes #3970
      f1bd33a8
    • Ivan Koptelov's avatar
      sql: raise an err on CHECK constraint with ON CONFLICT action · 6aa0ef4a
      Ivan Koptelov authored
      Currently all on 'conflict' actions are silently
      ignored for 'check' constraints. This patch add
      explicit parse-time error.
      
      Closes #3345
      6aa0ef4a
    • Nikita Pettik's avatar
      Remove affinity from field definition · b9afef16
      Nikita Pettik authored
      Closes #3698
      b9afef16
    • Nikita Pettik's avatar
      sql: clean-up affinity from SQL source code · 037e2e44
      Nikita Pettik authored
      Replace remains of affinity usage in SQL parser, query optimizer and
      VDBE. Don't add affinity to field definition when table is encoded into
      msgpack.  Remove field type <-> affinity converters, since now we can
      operate directly on field type.
      
      Part of #3698
      037e2e44
    • Nikita Pettik's avatar
      sql: replace affinity with field type in struct Expr · 2dd8444f
      Nikita Pettik authored
      Also this patch resolves issue connected with wrong query plans during
      select on spaces created from Lua: instead of index search in most cases
      table scan was used. It appeared due to the fact that index was checked
      on affinity compatibility with space format. So, if space is created
      without affinity in format, indexes won't be used.
      However, now all checks are related to field types, and as a result
      query optimizer is able to choose correct index.
      
      Closes #3886
      Part of #3698
      2dd8444f
    • Nikita Pettik's avatar
      sql: replace affinity with field type for VDBE runtime · 5a561326
      Nikita Pettik authored
      This stage of affinity removal requires introducing of auxiliary
      intermediate function to convert array of affinity values to field type
      values. The rest of job done in this commit is a straightforward
      refactoring.
      
      Part of #3698
      5a561326
    • Nikita Pettik's avatar
      sql: replace affinity with field type for func · 00758981
      Nikita Pettik authored
      Lets use field_type instead of affinity as a type of return value of
      user function registered in SQL. Moreover, lets assign type of return
      value to expression representing functions. It allows to take it into
      consideration during derived type calculation.
      
      Part of #3698
      00758981
    • Nikita Pettik's avatar
      sql: remove numeric affinity · 758ab1a4
      Nikita Pettik authored
      Numeric affinity in SQLite means the same as real, except that it
      forces integer values into floating point representation in case
      it can be converted without loss (e.g. 2.0 -> 2).
      Since in Tarantool core there is no difference between numeric and real
      values (both are stored as values of Tarantool type NUMBER), lets
      remove numeric affinity and use instead real.
      
      The only real pitfall is implicit conversion mentioned above.  We can't
      pass *.0 as an iterator value since our fast comparators (TupleCompare,
      TupleCompareWithKey) are designed to work with only values of same MP_
      type. They do not use slow tuple_compare_field() which is able to
      compare double and integer. Solution to this problem is simple: lets
      always attempt at encoding floats as ints if conversion takes place
      without loss. This is a straightforward approach, but to implement it we
      need to care about reversed (decoding) situation.
      
      OP_Column fetches from msgpack field with given number and stores it as
      a native VDBE memory object. Type of that memory is based on type of
      msgpack value. So, if space field is of type NUMBER and holds value 1,
      type of VDBE memory will be INT (after decoding), not float 1.0.  As a
      result, further calculations may be wrong: for instance, instead of
      floating point division, we could get integer division.  To cope with
      this problem, lets add auxiliary conversion to decoding routine which
      uses space format of tuple to be decoded. It is worth mentioning that
      ephemeral spaces don't feature space format, so we are going to rely on
      type of key parts. Finally, internal VDBE merge sorter also operates on
      entries encoded into msgpack. To fix this case, we check type of
      ORDER BY/GROUP BY arguments: if they are of type float, we are emitting
      additional opcode OP_AffinityReal to force float type after encoding.
      
      Part of #3698
      758ab1a4
    • Nikita Pettik's avatar
      sql: use field type instead of affinity for type_def · 82298c55
      Nikita Pettik authored
      Also, this allows to delay affinity assignment to field def until
      encoding of table format.
      
      Part of #3698
      82298c55
    • Nikita Pettik's avatar
      sql: remove SQLITE_ENABLE_UPDATE_DELETE_LIMIT define · 43ed060f
      Nikita Pettik authored
      Code under this define is dead. What is more, it uses affinity, so lets
      remove it alongside with tests related to it.
      
      Needed for #3698
      43ed060f
    • Georgy Kirichenko's avatar
      replication: promote tx vclock only after successful wal write · 056deb2c
      Georgy Kirichenko authored
      Applier used to promote vclock prior to applying the row. This lead to
      a situation when master's row would be skipped forever in case there is
      an error trying to apply it. However, some errors are transient, and we
      might be able to successfully apply the same row later.
      
      While we're at it, make wal writer the only one responsible for
      advancing replicaset vclock. It was already doing it for rows coming
      from the local instance, besides, it makes the code cleaner since now we
      want to advance vclock direct from wal batch reply and lets us get rid of
      unnecessary checks whether applier or wal has already advanced the
      vclock.
      
      Closes #2283
      Prerequisite #980
      056deb2c
    • Georgy Kirichenko's avatar
      wal: do not promote wal vclock for failed writes · 066b929b
      Georgy Kirichenko authored
      Wal used to promote vclock prior to write the row. This lead to a
      situation when master's row would be skipped forever in case there is
      an error trying to write it. However, some errors are transient, and we
      might be able to successfully apply the same row later. So we do not
      promote writer vclock in order to be able to restart replication from
      failing point.
      
      Obsoletes xlog/panic_on_lsn_gap.test.
      
      Needed for #2283
      066b929b
  6. Feb 07, 2019
  7. Feb 06, 2019
    • Vladimir Davydov's avatar
      vinyl: use uncompressed run size for range split/coalesce/compaction · 3313009d
      Vladimir Davydov authored
      Historically, when considering splitting or coalescing a range or
      updating compaction priority, we use sizes of compressed runs (see
      bytes_compressed). This makes the algorithms dependent on whether
      compression is used or not and how effective it is, which is weird,
      because compression is a way of storing data on disk - it shouldn't
      affect the way data is partitioned. E.g. if we turned off compression
      at the first LSM tree level, which would make sense, because it's
      relatively small, we would affect the compaction algorithm because
      of this.
      
      That said, let's use uncompressed run sizes when considering range
      tree transformations.
      3313009d
    • Serge Petrenko's avatar
      Fix tarantool -e "os.exit()" hang · 3a851430
      Serge Petrenko authored
      After the patch which made os.exit() execute on_shutdown triggers
      (see commit 6dc4c8d7) we relied
      on on_shutdown triggers to break the ev_loop and exit tarantool.
      Hovewer, there is an auxiliary event loop which is run in
      tarantool_lua_run_script() to reschedule the fiber executing chunks
      of code passed by -e option and executing interactive mode.
      This event loop is started only to execute interactive mode, and
      doesn't exist during execution of -e chunks. Make sure we don't start
      it if os.exit() was already executed in one of the chunks.
      
      Closes #3966
      3a851430
    • Serge Petrenko's avatar
      Fix fiber_join() hang in case fiber_cancel() was called · d69c149f
      Serge Petrenko authored
      In case a fiber joining another fiber gets cancelled, it stays suspended
      forever and never finishes joining. This happens because fiber_cancel()
      wakes the fiber and removes it from all execution queues.
      Fix this by adding the fiber back to the wakeup queue of the joined
      fiber after each yield.
      
      Closes #3948
      d69c149f
    • Serge Petrenko's avatar
      replication: downstream status reporting in box.info · fcf43533
      Serge Petrenko authored
      Start showing downstream status for relays in "follow" state.
      Also refactor lbox_pushrelay to unify code for different relay
      states.
      
      Closes #3904
      fcf43533
Loading