Skip to content
Snippets Groups Projects
  1. May 04, 2017
  2. May 03, 2017
    • Vladimir Davydov's avatar
      vinyl: pin run slice in iterator · d40d611d
      Vladimir Davydov authored
      Currently, we take a reference to vy_slice while waiting for IO in run
      iterator to avoid use-after-free. Since a slice references a run, we
      also need a reference counter in vy_run. We can't use the same reference
      counter for counting the number of active slices, because it includes
      deleted slices which stay allocated only because of being pinned by
      iterators, hence on top of that we add vy_run->slice_count. And all this
      machinery exists solely for the sake of run iterator!
      
      This patch reworks this as follows. It removes vy_run->refs and
      vy_slice->refs, leaving only vy_run->slice_count since it is needed for
      detecting unused runs. Instead it adds vy_slice->pin_count similar to
      vy_mem->pin_count. As long as the pin_count > 0, the slice can't be
      deleted. The one who wants to delete the slice (compaction, split, index
      removal) has to wait until the slice is unpinned. Run iterator pins the
      slice while waiting for IO. All in all this should make the code easier
      to follow.
      d40d611d
    • Alexandr Lyapunov's avatar
      vinyl: tx_serial.test: do not append very long strings in lua. · 4b5c6d2b
      Alexandr Lyapunov authored
      Patch f57151941ab9abc103c1d5f79d24c48238ab39cc introduced
      generation of reproduce code and dump of it to the log. But
      the problem is that the code is initially generated in a big
      lua string using repeated concatenation in a loop. Such a use
      of lua strings is too vulnerable in terms of performance.
      Avoid repeated concatenation of lua string in tx_serial.test.
      4b5c6d2b
    • Vladimir Davydov's avatar
      vinyl: improve tx_conflict.test · 1c543ab6
      Vladimir Davydov authored
      The test now generetes lua code that reproduces found problem.
      The generated code is saved in log.
      
      Copied from tx_serial.test
      1c543ab6
    • Vladimir Davydov's avatar
      vinyl: use stailq instead of rlist for linking log records · b6047032
      Vladimir Davydov authored
      We don't need a doubly-linked list for this. Singly-linked will do.
      b6047032
    • Vladimir Davydov's avatar
      vinyl: yield while adding run slices to ranges on dump · 10f0a1b5
      Vladimir Davydov authored
      The loop over all ranges can take long so we should yield once in a
      while in order not to stall the TX thread. The problem is we can't
      delete dumped in-memory trees until we've added a slice of the new run
      to each range, so if we yield while adding slices, a concurrent fiber
      will see a range with a slice containing statements present in in-memory
      trees, which breaks the assumption taken by merge iterator that its
      sources don't have duplicates. Handle this by filtering out newly dumped
      runs by LSN in vy_read_iterator_add_disk().
      10f0a1b5
    • Vladimir Davydov's avatar
      vinyl: don't create empty slices on dump · 7a18727a
      Vladimir Davydov authored
      Adding an empty slice to a range is pointless, besides it triggers
      compaction for no reason, which is especially harmful in case of
      time-series-like workload. On dump we can omit creating slices for
      ranges that are not intersected by the new run. Note how it affects the
      coalesce test: now we have to insert a statement into each range to
      trigger compaction, not just into the first one.
      7a18727a
    • Vladimir Davydov's avatar
      vinyl: write index dump lsn to metadata log · aace2e14
      Vladimir Davydov authored
      When replaying local WAL, we filter out statements that were written to
      disk before restart by checking stmt->lsn against run->max_lsn: if the
      latter is greater, the statement was dumped. Although it is undoubtedly
      true, this check isn't quite correct. The thing is run->max_lsn might be
      less that the actual lsn at the time the run was dumped, because max_lsn
      is computed as the maximum among all statements present in the run file,
      which doesn't include deleted statements. If this happens, we might
      replay some statements for nothing: they will cancel each other anyway.
      This may be dangerous, because the number of such statements can be
      huge. Suppose, a whole run consists of deleted statements, i.e. there's
      no run file at all. Then we replay all statements in-memory, which might
      result in OOM, because the scheduler isn't started until the local
      recovery is completed.
      
      To avoid that, introduce a new record type in the metadata log,
      VY_LOG_DUMP_INDEX, which is written on each index dump, even if no file
      is created, and contains the LSN of the dump. Use this LSN on recovery
      to detect statements that don't need to be replayed.
      aace2e14
    • Vladimir Davydov's avatar
      vinyl: delete empty runs right away · f5645474
      Vladimir Davydov authored
      This reverts commit a366b5bb ("vinyl: keep track of empty runs").
      
      The former single memory level design required knowledge of max LSN of
      each run. Since this information can't be extracted from the run file in
      general (the newest key might have been deleted by compaction), we added
      it to the metadata log. Since we can get an empty run (i.e. a run w/o
      file on disk) as a result of compaction or dump, we had to add a special
      flag to the log per each run, is_empty, so that we could store a run
      record while omitting loading run file. Thanks to the concept of slices,
      this is not needed any more, so we can move min/max LSN back to the
      index file and remove is_empty flag from the log. This patch starts from
      removing is_empty flag.
      f5645474
    • Vladimir Davydov's avatar
      vinyl: allocate records for metadata log dynamically · b82d5ac8
      Vladimir Davydov authored
      Currently, we use a fixed size buffer, which can accommodate up to 64
      records. With the single memory level it can easily overflow, as we
      create a slice for each range on dump in a single transaction, i.e. if
      there are > 64 ranges in an index, we may get a panic. So this patch
      makes vylog use a list of dynamically allocated records instead of a
      static array.
      b82d5ac8
    • Vladimir Davydov's avatar
      63ceb698
    • Roman Tsisyk's avatar
      Export box_index_key_def() to public C API · f42e1c15
      Roman Tsisyk authored
      Closes #2386
      f42e1c15
    • Roman Tsisyk's avatar
      net.box: minor renames · 0a0bc4d6
      Roman Tsisyk authored
      Rename `remote_check` to `check_remote_arg` to follow conventions
      in schema.lua
      0a0bc4d6
    • Roman Tsisyk's avatar
      net.box: remove varargs from call() and eval() · 386df3d3
      Roman Tsisyk authored
      Change conn:call() and conn:eval() API to accept Lua table instead of
      varargs for function/expression arguments:
      
          conn:call(func_name, arg1, arg2, ...)
            =>
          conn:call(func_name, {arg1, arg2, ...}, opts)
      
          conn:eval(expr, arg1, arg2, ...)
            =>
          conn:eval(expr, {arg1, arg2, ...}, opts)
      
      This breaking change is needed to extend call() and eval() API with
      per-requests options, like `timeout` and `buffer` (see #2195):
      
          c:call("echo", {1, 2, 3}, {timeout = 0.2})
      
          c:call("echo", {1, 2, 3}, {buffer = ibuf})
          ibuf.rpos, result = msgpack.ibuf_decode(ibuf.rpos)
          result
      
      Tarantool 1.6.x behaviour can be turned on by `call_16` per-connection option:
      
          c = net.connect(box.cfg.listen, {call_16 = true})
          c:call('echo', 1, 2, 3)
      
      This is a breaking change for 1.7.x.
      
      Needed for #2285
      Closes #2195
      386df3d3
    • Konstantin Nazarov's avatar
      Add support for space:format() to net.box · 905d44b0
      Konstantin Nazarov authored
      Getting the space format should be safe, as it is tied to schema_id,
      and net.box makes sure that schema_id stays consistent.
      
      It means that when you receive a tuple from net.box, you may be sure
      that its space format is consistent with the remote.
      
      Fixes #2402
      905d44b0
    • Roman Tsisyk's avatar
      Add a test case for space:format() · 9e7749eb
      Roman Tsisyk authored
      Fixes #2391
      9e7749eb
    • Konstantin Nazarov's avatar
      Fix error when setting space format · 75b8cc9b
      Konstantin Nazarov authored
      Previously the format in space:format() wasn't allowed to be nil.
      
      In context of #2391
      75b8cc9b
  3. May 02, 2017
    • Vladimir Davydov's avatar
      vinyl: implement single memory level · de27e278
      Vladimir Davydov authored
       - In-memory trees are now created per index, not per range as before.
       - Dump is scheduled per index and writes the whole in-memory tree to a
         single run file. Upon completion it creates a slice for each range of
         the index.
       - Compaction is scheduled per range as before, but now it doesn't
         include in-memory trees, only on-disk runs (via slices). Compaction
         and dump of the same index can happen simultaneously.
       - Range split, just like coalescing, is done immediately by creating
         new slices and doesn't require long-term operations involving disk
         writes.
      de27e278
    • Vladimir Davydov's avatar
      vinyl: teach memory iterator to skip statements on start · 1e154ede
      Vladimir Davydov authored
      With the single in-memory tree per index, read iterator will reopen
      memory iterator per each range, as it already does in case of txw and
      cache iterators, so we need to teach memory iterator to skip to the
      statement following the last key returned by read iterator. So this
      patch adds a new parameter to memory iterator, before_first, which, if
      not NULL, will make it start iteration from the first statement
      following the key of before_first.
      1e154ede
    • Vladimir Davydov's avatar
      vinyl: move write_iterator->key unref from cleanup to delete · f9c0b99e
      Vladimir Davydov authored
      The key is created in the main cord so there's absolutely no point in
      deleting it in a worker thread. Moving key unref from cleanup to delete
      will simplify some of the workflows of the single memory level patch.
      f9c0b99e
    • Vladimir Davydov's avatar
      vinyl: drop only_disk read iterator argument · da3f11a5
      Vladimir Davydov authored
      This parameter was needed for replication before it was redesigned.
      Currently, it is always false.
      da3f11a5
    • Vladimir Davydov's avatar
      vinyl: sort slices by lsn on recovery · 3a6c2ff3
      Vladimir Davydov authored
      To ease recovery, vy_recovery_iterate() iterates over slices of the same
      range in the chronological order. It is easy to do, because we always
      log slices of the same range in the chronological order, as there can't
      be concurrent dump and compaction of the same range. However, this will
      not hold when the single memory level is introduced: a dump, which adds
      new slices to all ranges, may occur while compaction is in progress so
      that when compaction is finished a record corresponding to the slice
      created by compaction will appear after the slice created by dump,
      although the latter is newer. To prevent this from breaking the
      assumption made by iterators that newer slices are closer to the head of
      vy_range->slices list, let's sort the list on recovery/join.
      3a6c2ff3
    • Vladimir Davydov's avatar
      vinyl: don't recover the same run for each its slice · 10a739b5
      Vladimir Davydov authored
      Currently, on recovery we create and load a new vy_run per each slice,
      so if there's more than one slice created for a run, we will have the
      same run duplicated in memory. To avoid that, maintain the hash of all
      runs loaded during recovery of the current index, and look up the run
      there when a slice is created instead of creating a new run.
      
      Note, we don't need to do anything like this on initial join, as we
      delete the run right after sending it to the replica, so we can just
      create a new run each time we make a slice.
      10a739b5
    • Vladimir Davydov's avatar
      vinyl: store run slices in metadata log · f18dbce6
      Vladimir Davydov authored
      In order to recover run slices, we need to store info about them in the
      metadata log, so this patch introduces two new records:
       - VY_LOG_INSERT_SLICE: takes IDs of the slice, the range to insert the
         slice into, and the run the slice is for. Also, it takes the slice
         boundaries as after coalescing two ranges a slice inserted into the
         resulting range may be narrower than the range.
       - VY_LOG_DELETE_SLICE: takes ID of the slice to delete.
      
      Also, it renames VY_LOG_INSERT_RUN and VY_LOG_DELETE_RUN to
      VY_LOG_CREATE_RUN and VY_LOG_DROP_RUN.
      
      Note, we don't need to keep deleted ranges (and slices) in the log until
      the garbage collection wipes them away any more, because they are not
      needed by deleted run records, which garbage collection targets at.
      f18dbce6
    • Vladimir Davydov's avatar
      vinyl: rename range_{begin,end} keys to {begin,end} in vy_log · b80a2cf8
      Vladimir Davydov authored
      The same keys will be used to specify slice boundaries, so let's call
      them in a neutral way. No functional changes.
      b80a2cf8
    • Vladimir Davydov's avatar
      vinyl: count number of slices per run · d716f54f
      Vladimir Davydov authored
      Currently, there can't be more than one slice per run, but this will
      change one the single memory level is introduced. Then we will have to
      count the number of slices per each run so as not to unaccount the same
      run more than once on each slice deletion. Unfortunately, we can't use
      vy_run->refs to count the number of slices created per each run,
      because, although vy_run->refs is only incremented per each slice
      allocated for the run, this includes slices that were removed from
      ranges and stay allocated only because of being pinned by open
      iterators. So we add one more counter to vy_run, slice_count, and
      introduce new helpers to be used for slice creation/destruction,
      vy_run_make_slice() and vy_run_destroy_slice(), which inc/dec the
      counter.
      d716f54f
    • Vladimir Davydov's avatar
      vinyl: make check for empty range on split more thorough · 09d56944
      Vladimir Davydov authored
      There's a sanity check in vy_range_needs_split() that assures the
      resulting ranges are not going to be empty: it checks the split key
      against the oldest run's min key. The check is not enough for the slice
      concept, because even if the split key is > min key, it still can be <
      the beginning of the slice.
      09d56944
    • Vladimir Davydov's avatar
      vinyl: add slice size estimate · 3740ff9d
      Vladimir Davydov authored
      We use run->info.keys to estimate the size of a new run's bloom filter.
      We use run->info.size to trigger range split/coalescing. If a range
      contains a slice that spans only a part of a run, we can't use run->info
      stats, so this patch introduces the following slice stats: number of
      keys (for the bloom filter) and the size on disk (for split/coalesce).
      These two counters are not accurate, they are only estimates, because
      calculating exact numbers would require disk reads. Instead we simply
      take the corresponding run's stat and multiply it by
      
          slice page count / run page count
      3740ff9d
    • Vladimir Davydov's avatar
      vinyl: separate accounting of ranges and runs · fccaa3f1
      Vladimir Davydov authored
      There will be more than one slice per run, i.e. the same run will be
      used jointly by multiple ranges. To make sure that a run isn't accounted
      twice, separate run accounting from range accounting.
      fccaa3f1
    • Vladimir Davydov's avatar
      vinyl: teach run iterator to respect slice boundaries · c0bb544d
      Vladimir Davydov authored
      Make sure that we start iteration within the given slice and end it as
      soon as the current position leaves the slice boundaries. Note, the
      overhead caused by extra comparisons is only incurred if the slice has
      non-NULL boundaries, which is only the case if the run is shared among
      ranges.
      c0bb544d
Loading