Commits · 0066457cfecd442b80658ab75be03c0501c76510 · core / tarantool

Oct 26, 2018

vinyl: fix memory leak in slice stream · 0066457c

If a tuple read from a run by a slice stream happens to be out of the
slice bounds, it will never be freed. Fix it.

The leak was introduced by commit c174c985 ("vinyl: implement new
simple write iterator").

0066457c

Oct 25, 2018

Merge remote-tracking branch 'origin/1.9' into 1.10 · 211cef71
Alexander Turenko authored 6 years ago

Unverified

211cef71
Merge remote-tracking branch 'origin/1.7' into 1.9 · 3736c379
Alexander Turenko authored 6 years ago

Unverified

3736c379
Merge remote-tracking branch 'origin/1.6' into 1.7 · 7ef5be2e
Alexander Turenko authored 6 years ago

Unverified

7ef5be2e

Upload tarballs to the new S3 buckets · a5da60bd

Alexander Turenko authored 6 years ago

Upload tarballs of alpha and beta tarantool versions (*.0 and *.1
branches) into 2x (3x, 4x...) buckets. See more details about the
release process in the documentation: [1].

[1]: https://tarantool.io/en/doc/2.0/dev_guide/release_management/

Unverified

a5da60bd

test: rewrote xlog/checkpoint_daemon.test.lua · 6f6a3040

Alexander Turenko authored 6 years ago

Updated the test case for #2780 to check a last snapshot file
modification time instead of search log messages.

The test was flaky, because of small timeouts on Linux, but now we
spinning on a condition check to achieve both stable results and fast
execution.

Follows up #2780.
Fixes #3684.

6f6a3040

test: update test-run · b4d24341

Alexander Turenko authored 6 years ago

* added more details about hung tests (#107);
* added show_reproduce_content option (#113);
* fixed inspector error reporting for a failed app test;
* expand action of use_unix_socket option to non-default servers;
* updated tarantool-python submodule (#126);
* added test_run:wait_cond() and test_run:wail_log().

Updated box-py/call.test.py result file, because tarantool-python now
uses CALL 1.7 convention by default and slightly changed yaml output
formatting. See [1] and [3] for more information.

Updated replication-py/cluster.test.py, because of changed
tarantool-python internals, see commit [2].

Updated box-py/iproto.test.py because it uses tarantool-python internals
that was rewritten in [2]. Updated its result file according to CALL 1.7
response format that was set as default with [1] and yaml output
formatting changed within [3].

Updated replication-py/swap.test.py result file, because of yaml output
formatting that was slightly changed within [3].

[1]: https://github.com/tarantool/tarantool-python/issues/82
[2]: https://github.com/tarantool/tarantool-python/commit/4639d9ae1c48f1608bd599c6d93ed6bfca48fbf9
[3]: https://github.com/tarantool/tarantool-python/issues/90

b4d24341

vinyl: fix memory leak in write iterator · f4ae714b

Vladimir Davydov authored 6 years ago

Memory allocated for vy_write_iterator::src_heap is never freed. Fix it.
The leak was introduced by commit c174c985 ("vinyl: implement new
simple write iterator").

f4ae714b

Oct 23, 2018

xlog: fix sync_is_async xlog option · 55dcde00

Alexander Turenko authored 6 years ago

The behaviour change was introduced in cda3cb55: sync_is_async option
was forgotten to be updated from xdir; sync_interval was forgotten too,
but was restored in 1900c58b.

The commit fixes the performance regression around 6-14% for average RPS
on default nosqlbench workload with 30 seconds duration. The additional
information about benchmarking can be found in #3747.

Thanks to Vladimir Davydov (@locker) for the investigation of the
cda3cb55 changes.

Closes #3747

(cherry picked from commit cd9cc4c5)

55dcde00

Oct 13, 2018

replication: fix rebootstrap crash in case master has replica's rows · d4ce7447

Vladimir Davydov authored 6 years ago

During SUBSCRIBE the master sends only those rows originating from the
subscribed replica that aren't present on the replica. Such rows may
appear after a sudden power loss in case the replica doesn't issue
fdatasync() after each WAL write, which is the default behavior. This
means that a replica can write some rows to WAL, relay them to another
replica, then stop without syncing WAL file. If this happens we expect
the replica to read its own rows from other members of the cluster upon
restart. For more details see commit eae84efb ("replication: recover
missing local data from replica").

Obviously, this feature only makes sense for SUBSCRIBE. During JOIN
we must relay all rows. This is how it initially worked, but commit
adc28591 ("replication: do not delete relay on applier disconnect"),
witlessly removed the corresponding check from relay_send_row() so that
now we don't send any rows originating from the joined replica:

  @@ -595,8 +630,7 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet)
           * it). In the latter case packet's LSN is less than or equal to
           * local master's LSN at the moment it received 'SUBSCRIBE' request.
           */
  -       if (relay->replica == NULL ||
  -           packet->replica_id != relay->replica->id ||
  +       if (packet->replica_id != relay->replica->id ||
              packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe,
                                        packet->replica_id)) {
                  relay_send(relay, packet);

(relay->local_vclock_at_subscribe is initialized to 0 on JOIN)

This only affects the case of rebootstrap, automatic or manual, because
when a new replica joins a cluster there can't be any rows on the master
originating from it. On manual rebootstrap, i.e. when the replica files
are deleted by the user and the replica is restarted from an empty
directory with the same UUID (set via box.cfg.instance_uuid), this isn't
critical - the replica will still receive those rows it should have
received during JOIN once it subscribes. However, in case of automatic
rebootstrap this can result in broken order of xlog/snap files, because
the replica directory still contains old xlog/snap files created before
rebootstrap. The rebootstrap logic expects them to have strictly less
vclocks than new files, but if JOIN stops prematurely, this condition
may not hold, leading to a crash when the vclock of a new xlog/snap is
inserted into the corresponding xdir.

This patch fixes this issue by restoring pre eae84efb behavior: now
we create a new relay for FINAL JOIN instead of reusing the one attached
to the joined replica so that relay_send_row() can detect JOIN phase and
relay all rows in this case. It also adds a comment so that we don't
make such a mistake in future.

Apart from fixing the issue, this patch also fixes a relay leak in
relay_initial_join() in case engine_join_xc() fails, which was also
introduced by the above mentioned commit.

A note about xlog/panic_on_broken_lsn test. Now the relay status isn't
reported by box.info.replication if FINAL JOIN failed and the replica
never subscribed (this is how it worked before commit eae84efb) so
we need to tweak the test a bit to handle this.

Closes #3740

d4ce7447

Oct 12, 2018

Add compile_commands.json to git ignore · e0017ad6
Kirill Yukhin authored 6 years ago

e0017ad6

vinyl: implement basic transaction throttling · c0d8063b

Vladimir Davydov authored 6 years ago

If the rate at which transactions are ready to write to the database is
greater than the dump bandwidth, memory will get depleted before the
previously scheduled dump is complete and all newer transactions will
have to wait, which may take seconds or even minutes:

W> waited for 555 bytes of vinyl memory quota for too long: 15.750 sec

This patch set implements basic transaction throttling that is supposed
to help avoid unpredictably long stalls. Now the transaction write rate
is always capped by the observed dump bandwidth, because it doesn't make
sense to consume memory at a greater rate than it can be freed. On top
of that, when a dump begins, we estimate the amount of time it is going
to take and limit the transaction write rate accordingly.

Note, this patch doesn't take into account compaction when setting the
rate limit so compaction threads may still fail to keep up with dumps,
increasing the read amplification. It will be addressed later.

Closes #1862

c0d8063b

vinyl: fix memory dump trigger · 45d61b66

Vladimir Davydov authored 6 years ago

vy_quota_signal() doesn't wake up a consumer if it won't be able to
proceed because of the memory limit. This is OK, but it doesn't attempt
to trigger memory dump in this case either. As a result, it may occur
that dump isn't triggered and all waiting consumers are aborted by
timeout. E.g. this happens if memory dump releases no memory, which is
possible because memory is allocated and freed in 16 MB chunks. This
results in occasional vinyl/quota_tmeout test failures.

Fix this by moving the dump trigger right in vy_quota_may_use() so that
it's called whenever we consider a consumer for wakeup.

45d61b66

vinyl: do not account small dumps for bandwidth estimation · e351b3e6

Vladimir Davydov authored 6 years ago

Small dumps (e.g. triggered by box.snapshot) have too high overhead
associated with file creation so taking them into account for bandwidth
estimation may result in erroneous transaction throttling. Let's ignore
dumps of size less than 1 MB.

Needed for #1862

e351b3e6

vinyl: do not try to trigger dump in regulator if already in progress · b80f437f

Vladimir Davydov authored 6 years ago

This is pointless since trigger_dump_cb callback will return right away
in such a case. Let's wrap trigger_dump_cb in vy_regulator_trigger_dump
method, which will actulally invoke the callback only if the previous
dump has already completed (i.e. vy_regulator_dump_complete was called).

This also gives us a definite place in code where we can adjust the rate
limit so as to guarantee that a triggered memory dump will finish before
we hit the hard memory limit (this will be done later).

Needed for #1862

b80f437f

vinyl: bypass format validation for statements loaded from disk · 3846d9b2

Vladimir Davydov authored 6 years ago

When the format of a space is altered, we walk over all tuples stored in
the primary index and check them against the new format. This doesn't
guarantee that all *statements* stored in the primary index conform to
the new format though, because the check isn't performed for deleted or
overwritten statements, e.g.

  s = box.schema.space.create('test', {engine = 'vinyl'})
  s:create_index('primary')
  s:insert{1}
  box.snapshot()
  s:delete{1}

  -- The following command will succeed, because the space is empty,
  -- however one of the runs contains REPLACE{1}, which doesn't conform
  -- to the new format.
  s:create_index('secondary', {parts = {2, 'unsigned'}})

This is OK as we will never return such overwritten statements to the
user, however we may still need to read them. Currently, this leads
either to an assertion failure or to a read error in

  vy_stmt_decode
   vy_stmt_new_with_ops
    tuple_init_field_map

We could probably force major compaction of the primary index to purge
such statements, but it is complicated as there may be a read view
preventing the write iterator from squashing such a statement, and
currently there's no way to force destruction of a read view.

So this patch simply disables format validation for all tuples loaded
from disk (actually we already skip format validation for all secondary
index statements and for DELETE statements in primary indexes so this
isn't as bad as it may seem). To do that, it adds a boolean parameter to
tuple_init_field_map() that disables format validation, and then makes
vy_stmt_new_with_ops(), which is used for constructing vinyl statements,
set it to false. This is OK as all statements inserted into a vinyl
space are validated explicitly with tuple_validate() anyway.

This is rather a workaround for the lack of a better solution.

Closes #3540

3846d9b2

test: fix spurious box/sql test failure · 23e71c6e

Vladimir Davydov authored 6 years ago

For some reason this test uses 555 for space id, which may be taken by
a previously created space:

Test failed! Result content mismatch:
--- box/sql.result        Fri Oct  5 17:23:25 2018
+++ box/sql.reject        Fri Oct 12 19:38:51 2018
@@ -12,12 +12,14 @@
 ...
 _ = box.schema.space.create('test1', { id = 555 })
 ---
+- error: Duplicate key exists in unique index 'primary' in space '_space'
 ...

Reproduce file:

---
- [box/rtree_point.test.lua, null]
- [box/transaction.test.lua, null]
- [box/tree_pk.test.lua, null]
- [box/access.test.lua, null]
- [box/cfg.test.lua, null]
- [box/admin.test.lua, null]
- [box/lua.test.lua, null]
- [box/bitset.test.lua, null]
- [box/role.test.lua, null]
- [box/sql.test.lua, null]
...

Remove { id = 555 } to make sure it never happens.

23e71c6e

Add Linux/clang CI target · 181bb3e7

Alexander Turenko authored 6 years ago

Replaced targets generation using a matrix expansion + exclusion list
with the explicit targets list. Gave meagingful names for targets.

Fixes #3673.

181bb3e7

xlog: fix filename in error messages · 4a464f8a

Vladimir Davydov authored 6 years ago

 - xlog_rename() doesn't strip xlog->filename of inprogress suffix so
   write errors will mistakenly report the filename as inprogress.
 - xlog_create() uses a name without inprogress suffix for error
   reporting while it actually creates an inprogress file.

4a464f8a

Oct 10, 2018

socket: fix polling in case of spurious wakeup · e6bd7748

Georgy Kirichenko authored 6 years ago

socket_writable/socket_readable handles socket.iowait spurious wakeup
until event is happened or timeout is exceeded.

Closes #3344

e6bd7748

vinyl: fix for deferred DELETE overwriting newer statement · 63912c30

Vladimir Davydov authored 6 years ago

A deferred DELETE may be generated after a newer statement for the same
key was inserted into a secondary index and hence land in a newer run.
Since the read iterator assumes that newer sources always contain newer
statements for the same key, we mark all deferred DELETE statements with
VY_STMT_SKIP_READ flag, which makes run/mem iterators ignore them. The
flag must be persisted when a statement is written to disk, but it is
not. Fix this.

Fixes commit 504bc805 ("vinyl: do not store meta in secondary index
runs").

63912c30

test: disable feedback daemon test on Mac OS in CI · ab868a6b

Alexander Turenko authored 6 years ago

The fail is known and should not have any influence on our CI results.

The test should be enabled back after a fix of #3558.

ab868a6b

Oct 08, 2018

cmake: fix sync_file_range detection · e1aa1a3d

Vladimir Davydov authored 6 years ago

sync_file_range is declared only if _GNU_SOURCE macro is defined.
Also, in order to be used in a source file, HAVE_SYNC_FILE_RANGE
must be present in config.h.cmake.

Fixes commit caae99e5 ("Refactor xlog writer").

e1aa1a3d

Oct 06, 2018

vinyl: fix dump_bandwidth init in case snap_io_rate_limit is maxed out · ed6097f6

Vladimir Davydov authored 6 years ago

box.cfg{snap_io_rate_limit = 0} means that the limit is maxed out hence
we must set the dump bandwidth estimate to the default value. Instead
we set it to 0, which may resulting in invalid transaction throttling.
Fix this.

Fixes commit b646fbd9 ("vinyl: use snap_io_rate_limit for initial
dump bandwidth estimate").

ed6097f6

Oct 05, 2018

replication: ref checkpoint needed to join replica · bae6f037

Vladimir Davydov authored 6 years ago

Before joining a new replica we register a gc_consumer to prevent
garbage collection of files needed for join and following subscribe.
Before commit 9c5d851d ("replication: remove old snapshot files not
needed by replicas") a consumer would pin both checkpoints and WALs so
that would work as expected. However, the above mentioned commit
introduced consumer types and marked a consumer registered on replica
join as WAL-only so if the garbage collector was invoked during join, it
could delete files corresponding to the relayed checkpoint resulting in
replica join failure. Fix this issue by pinning the checkpoint used for
joining a replica with gc_ref_checkpoint and unpinning once join is
complete.

The issue can only be reproduced if there are vinyl spaces, because
deletion of an open snap file doesn't prevent the relay from reading it.
The existing replication/gc test would catch the issue if it triggered
compaction on the master so we simply tweak it accordingly instead of
adding a new test case.

Closes #3708

bae6f037

gc: call gc_run unconditionally when consumer is advanced · a3542586

Vladimir Davydov authored 6 years ago

gc_consumer_unregister and gc_consumer_advance don't call gc_run in case
the consumer in question isn't leftmost. This code was written back when
gc_run was kinda heavy and would call engine/wal callbacks even if it
wouldn't really need to. Today gc_run will bail out shortly, without
making any complex computation, let alone invoking garbage collection
callbacks, in case it has nothing to do so those optimizations are
pointless. Let's remove them.

a3542586

gc: separate checkpoint references from wal consumers · 28c97041

Vladimir Davydov authored 6 years ago

Initially, gc_consumer object was used for pinning both checkpoint and
WAL files, but commit 9c5d851d ("replication: remove old snapshot
files not needed by replicas") changed that. Now whether a consumer pins
WALs or checkpoints or both depends on gc_consumer_type. This was done
so that replicas wouldn't prevent garbage collection of checkpoint
files, which they don't need after initial join is complete.

The way the feature was implemented is rather questionable though:
 - Since consumers of both types are stored in the same binary search
   tree, we have to iterate through the tree to find the leftmost
   checkpoint consumer, see gc_tree_first_checkpoint. This looks
   inefficient and ugly.
 - The notion of advancing a checkpoint consumer (gc_consumer_advance)
   is dubious: there's no point to move on to the next checkpoint after
   reading one - instead the consumer needs incremental changes, i.e.
   WALs.

To eliminate those questionable aspects and make the code easier for
understanding, let's separate WAL and checkpoint consumers. We do this
by removing gc_consumer_type and making gc_consumer track WALs only.
For pinning the files corresponding to a checkpoint a new object class
is introduced, gc_checkpoint_ref. To pin a checkpoint, gc_ref_checkpoint
needs to be called. It is passed the gc_checkpoint object to pin, the
consumer name, and the gc_checkpoint_ref to store the ref in. To unpin a
previously pinned checkpoint, gc_checkpoint_unref should be called.

References are listed by box.info.gc() for each checkpoint under
'references' key.

28c97041

gc: improve box.info.gc output · eba06790

Vladimir Davydov authored 6 years ago

Report vclocks in addition to signatures. When box.info.gc was first
introduced we used signatures in gc. Now we use vclocks so there's no
reason not to report them. This is consistent with box.info output
(there's vclock and signature).

Report the vclock and signature of the oldest WAL row available on the
instance under box.info.gc().vclock. Without this information the user
would have to figure it out by looking at box.info.gc().consumers.

eba06790

gc: cleanup garbage collection procedure · 668ec2aa

Vladimir Davydov authored 6 years ago

Do some refactoring intended to make the code of gc_run() easier for
understanding:
 - Remove gc_state::checkpoint_vclock. It was used to avoid rerunning
   engine gc callback in case no checkpoint was deleted. Since we
   maintain a list of all available checkpoints, we don't need it for
   this anymore - we can run gc only if a checkpoint was actually
   removed from the list.
 - Rename gc_state::wal_vclock back to gc_state::vclock.
 - Use bool variables with descriptive names instead of comparing
   vclock signatures.
 - Add some comments.

668ec2aa

gc: keep track of available checkpoints · a55f86fd

Vladimir Davydov authored 6 years ago

Currently, the checkpoint iterator is in fact a wrapper around
memtx_engine::snap_dir while the garbage collector knows nothing about
checkpoints. This feels like encapsulation violation. Let's keep track
of all available checkpoints right in the garbage collector instead
and export gc_ API to iterate over checkpoints.

a55f86fd

gc: rename checkpoint_count to min_checkpoint_count · 1162743a

Vladimir Davydov authored 6 years ago

Because it's the minimal number of checkpoints that must not be deleted,
not the actual number of preserved checkpoints. Do it now, in a separate
patch so as to ease review of the next patch.

While we are at it, fix the comment to gc_set_(min_)checkpoint_count()
which got outdated by commit 5512053f ("box: gc: do not remove files
being backed up").

1162743a

gc: format consumer name in gc_consumer_register · 1ed5f1e9
Vladimir Davydov authored 6 years ago
```
It's better than using tt_snprintf at call sites.
```
1ed5f1e9

gc: fold gc_consumer_new · 7b42ed21

Vladimir Davydov authored 6 years ago

gc_consumer_new is used in gc_consumer_register. Let's fold it to make
the code flow more straightforward.

7b42ed21

gc: use fixed length buffer for storing consumer name · 961938b0

Vladimir Davydov authored 6 years ago

The length of a consumer name never exceeds 64 characters so no use to
allocate a string. This is a mere code simplification.

961938b0

gc: make gc_consumer and gc_state structs transparent · 5231bb6b

Vladimir Davydov authored 6 years ago

It's exasperating to write trivial external functions for each member of
an opaque struct (gc_consumer_vclock, gc_consumer_name, etc) while we
could simply access those fields directly if we made those structs
transparent. Since we usually define structs as transparent if we need
to use them outside a source file, let's do the same for gc_consumer and
gc_state and remove all those one-line wrappers.

5231bb6b

vinyl: force deletion of runs left from unfinished indexes on restart · eb6280d0

Vladimir Davydov authored 6 years ago

If an instance is restarted while building a new vinyl index, there will
probably be some run files left. Currently, we won't delete such files
until box.snapshot() is called, even though there's no point in keeping
them around. Let's tweak vy_gc_lsm() so that it marks all runs that
belong to an unfinished index as incomplete to force vy_gc() to remove
them immediately after recovery is complete.

This also removes files left from a failed rebootstrap attempt so we can
remove a call to box.snapshot() from vinyl/replica_rejoin.test.lua.

eb6280d0

vinyl: fix master crash on replica join failure · 626dfb2c

Vladimir Davydov authored 6 years ago

This patch fixes a trivial error on vy_send_range() error path which
results in a master crash in case a file needed to join a replica is
missing or corrupted.

See #3708

626dfb2c

Set lua state for main fiber too · 9039df9c
Georgy Kirichenko authored 6 years ago
```
The main fiber should have a lua state as any other lua fiber.

Needed for #3538
```
9039df9c

Add -Werror for CI (1.10 part) · da505ee7

Alexander Turenko authored 6 years ago

Added MAKE_BUILD_TYPE=RelWithDebInfoWError option, which means enabling
-DNDEBUG=1, -O2 and -Wall -Wextra -Werror. This ensures we have clean
release build without warnings.

Fixed found -Wunused-variable and -Wunused-parameter warnings.

Part of #3238.

da505ee7

Oct 03, 2018
- utf8: allow empty strings in utf8.upper/lower · 129099bc
  Vladislav Shpilevoy authored 6 years ago
  
  Closes #3709
  129099bc