- Oct 25, 2018
-
-
AKhatskevich authored
-
AKhatskevich authored
Added -DENABLE_LTO=ON/OFF cmake option, OFF by default. LTO speeds up cpu-intensive workloads by up to 20% (see [1] and [2]). Requirements to enable LTO: - cmake >= 3.9; - Linux: ld.bfd / ld.gold from binutils >= 2.31 (or later 2.30) (gold >= 1.15); - Mac OS: xcode >= 8 (earlier versions are not tested). The requirement of the recent ld version is due to bug with exporting symbols from dynamic list when LTO is enabled, see [3]. Note: -Wlto-type-mismatch on GCC (enabled by default with -flto) gives namy warnings. Filed [4] to investigate it. Note: LuaJIT will be compiled w/o LTO despite the option set, filed [5]. [1]: https://github.com/tarantool/tarantool/wiki/performance-research [2]: https://gist.github.com/Khatskevich/31a2da6ab46ce903120e7a03d65966db [3]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84901 [4]: https://github.com/tarantool/tarantool/issues/3742 [5]: https://github.com/tarantool/tarantool/issues/3743 Closes #3117 f
-
AKhatskevich authored
In case of very aggressive optimizations the compiler can optimize guard-breaker function away and the `unit/guard` test would fail.
-
Kirill Yukhin authored
2.0 was renamed to 2.1.
-
- Oct 23, 2018
-
-
Alexander Turenko authored
The behaviour change was introduced in cda3cb55: sync_is_async option was forgotten to be updated from xdir; sync_interval was forgotten too, but was restored in 1900c58b. The commit fixes the performance regression around 6-14% for average RPS on default nosqlbench workload with 30 seconds duration. The additional information about benchmarking can be found in #3747. Thanks to Vladimir Davydov (@locker) for the investigation of the cda3cb55 changes. Closes #3747
-
- Oct 19, 2018
-
-
Kirill Yukhin authored
-
- Oct 18, 2018
-
-
lenkis authored
-
Vladimir Davydov authored
During SUBSCRIBE the master sends only those rows originating from the subscribed replica that aren't present on the replica. Such rows may appear after a sudden power loss in case the replica doesn't issue fdatasync() after each WAL write, which is the default behavior. This means that a replica can write some rows to WAL, relay them to another replica, then stop without syncing WAL file. If this happens we expect the replica to read its own rows from other members of the cluster upon restart. For more details see commit eae84efb ("replication: recover missing local data from replica"). Obviously, this feature only makes sense for SUBSCRIBE. During JOIN we must relay all rows. This is how it initially worked, but commit adc28591 ("replication: do not delete relay on applier disconnect"), witlessly removed the corresponding check from relay_send_row() so that now we don't send any rows originating from the joined replica: @@ -595,8 +630,7 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet) * it). In the latter case packet's LSN is less than or equal to * local master's LSN at the moment it received 'SUBSCRIBE' request. */ - if (relay->replica == NULL || - packet->replica_id != relay->replica->id || + if (packet->replica_id != relay->replica->id || packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe, packet->replica_id)) { relay_send(relay, packet); (relay->local_vclock_at_subscribe is initialized to 0 on JOIN) This only affects the case of rebootstrap, automatic or manual, because when a new replica joins a cluster there can't be any rows on the master originating from it. On manual rebootstrap, i.e. when the replica files are deleted by the user and the replica is restarted from an empty directory with the same UUID (set via box.cfg.instance_uuid), this isn't critical - the replica will still receive those rows it should have received during JOIN once it subscribes. However, in case of automatic rebootstrap this can result in broken order of xlog/snap files, because the replica directory still contains old xlog/snap files created before rebootstrap. The rebootstrap logic expects them to have strictly less vclocks than new files, but if JOIN stops prematurely, this condition may not hold, leading to a crash when the vclock of a new xlog/snap is inserted into the corresponding xdir. This patch fixes this issue by restoring pre eae84efb behavior: now we create a new relay for FINAL JOIN instead of reusing the one attached to the joined replica so that relay_send_row() can detect JOIN phase and relay all rows in this case. It also adds a comment so that we don't make such a mistake in future. Apart from fixing the issue, this patch also fixes a relay leak in relay_initial_join() in case engine_join_xc() fails, which was also introduced by the above mentioned commit. A note about xlog/panic_on_broken_lsn test. Now the relay status isn't reported by box.info.replication if FINAL JOIN failed and the replica never subscribed (this is how it worked before commit eae84efb) so we need to tweak the test a bit to handle this. Closes #3740
-
- Oct 17, 2018
-
-
Kirill Yukhin authored
Main branch was renamed to master. Update .travis.yml accordingly.
-
- Oct 13, 2018
-
-
Vladimir Davydov authored
-
- Oct 12, 2018
-
-
Kirill Yukhin authored
-
Vladimir Davydov authored
If the rate at which transactions are ready to write to the database is greater than the dump bandwidth, memory will get depleted before the previously scheduled dump is complete and all newer transactions will have to wait, which may take seconds or even minutes: W> waited for 555 bytes of vinyl memory quota for too long: 15.750 sec This patch set implements basic transaction throttling that is supposed to help avoid unpredictably long stalls. Now the transaction write rate is always capped by the observed dump bandwidth, because it doesn't make sense to consume memory at a greater rate than it can be freed. On top of that, when a dump begins, we estimate the amount of time it is going to take and limit the transaction write rate accordingly. Note, this patch doesn't take into account compaction when setting the rate limit so compaction threads may still fail to keep up with dumps, increasing the read amplification. It will be addressed later. Closes #1862
-
Vladimir Davydov authored
vy_quota_signal() doesn't wake up a consumer if it won't be able to proceed because of the memory limit. This is OK, but it doesn't attempt to trigger memory dump in this case either. As a result, it may occur that dump isn't triggered and all waiting consumers are aborted by timeout. E.g. this happens if memory dump releases no memory, which is possible because memory is allocated and freed in 16 MB chunks. This results in occasional vinyl/quota_tmeout test failures. Fix this by moving the dump trigger right in vy_quota_may_use() so that it's called whenever we consider a consumer for wakeup.
-
Vladimir Davydov authored
Small dumps (e.g. triggered by box.snapshot) have too high overhead associated with file creation so taking them into account for bandwidth estimation may result in erroneous transaction throttling. Let's ignore dumps of size less than 1 MB. Needed for #1862
-
Vladimir Davydov authored
This is pointless since trigger_dump_cb callback will return right away in such a case. Let's wrap trigger_dump_cb in vy_regulator_trigger_dump method, which will actulally invoke the callback only if the previous dump has already completed (i.e. vy_regulator_dump_complete was called). This also gives us a definite place in code where we can adjust the rate limit so as to guarantee that a triggered memory dump will finish before we hit the hard memory limit (this will be done later). Needed for #1862
-
Vladimir Davydov authored
When the format of a space is altered, we walk over all tuples stored in the primary index and check them against the new format. This doesn't guarantee that all *statements* stored in the primary index conform to the new format though, because the check isn't performed for deleted or overwritten statements, e.g. s = box.schema.space.create('test', {engine = 'vinyl'}) s:create_index('primary') s:insert{1} box.snapshot() s:delete{1} -- The following command will succeed, because the space is empty, -- however one of the runs contains REPLACE{1}, which doesn't conform -- to the new format. s:create_index('secondary', {parts = {2, 'unsigned'}}) This is OK as we will never return such overwritten statements to the user, however we may still need to read them. Currently, this leads either to an assertion failure or to a read error in vy_stmt_decode vy_stmt_new_with_ops tuple_init_field_map We could probably force major compaction of the primary index to purge such statements, but it is complicated as there may be a read view preventing the write iterator from squashing such a statement, and currently there's no way to force destruction of a read view. So this patch simply disables format validation for all tuples loaded from disk (actually we already skip format validation for all secondary index statements and for DELETE statements in primary indexes so this isn't as bad as it may seem). To do that, it adds a boolean parameter to tuple_init_field_map() that disables format validation, and then makes vy_stmt_new_with_ops(), which is used for constructing vinyl statements, set it to false. This is OK as all statements inserted into a vinyl space are validated explicitly with tuple_validate() anyway. This is rather a workaround for the lack of a better solution. Closes #3540
-
Vladimir Davydov authored
For some reason this test uses 555 for space id, which may be taken by a previously created space: Test failed! Result content mismatch: --- box/sql.result Fri Oct 5 17:23:25 2018 +++ box/sql.reject Fri Oct 12 19:38:51 2018 @@ -12,12 +12,14 @@ ... _ = box.schema.space.create('test1', { id = 555 }) --- +- error: Duplicate key exists in unique index 'primary' in space '_space' ... Reproduce file: --- - [box/rtree_point.test.lua, null] - [box/transaction.test.lua, null] - [box/tree_pk.test.lua, null] - [box/access.test.lua, null] - [box/cfg.test.lua, null] - [box/admin.test.lua, null] - [box/lua.test.lua, null] - [box/bitset.test.lua, null] - [box/role.test.lua, null] - [box/sql.test.lua, null] ... Remove { id = 555 } to make sure it never happens.
-
Alexander Turenko authored
Replaced targets generation using a matrix expansion + exclusion list with the explicit targets list. Gave meagingful names for targets. Fixes #3673.
-
Vladimir Davydov authored
- xlog_rename() doesn't strip xlog->filename of inprogress suffix so write errors will mistakenly report the filename as inprogress. - xlog_create() uses a name without inprogress suffix for error reporting while it actually creates an inprogress file.
-
- Oct 10, 2018
-
-
Georgy Kirichenko authored
socket_writable/socket_readable handles socket.iowait spurious wakeup until event is happened or timeout is exceeded. Closes #3344
-
Vladimir Davydov authored
A deferred DELETE may be generated after a newer statement for the same key was inserted into a secondary index and hence land in a newer run. Since the read iterator assumes that newer sources always contain newer statements for the same key, we mark all deferred DELETE statements with VY_STMT_SKIP_READ flag, which makes run/mem iterators ignore them. The flag must be persisted when a statement is written to disk, but it is not. Fix this. Fixes commit 504bc805 ("vinyl: do not store meta in secondary index runs").
-
Alexander Turenko authored
The fail is known and should not have any influence on our CI results. The test should be enabled back after a fix of #3558.
-
- Oct 08, 2018
-
-
Vladimir Davydov authored
sync_file_range is declared only if _GNU_SOURCE macro is defined. Also, in order to be used in a source file, HAVE_SYNC_FILE_RANGE must be present in config.h.cmake. Fixes commit caae99e5 ("Refactor xlog writer").
-
- Oct 06, 2018
-
-
Vladimir Davydov authored
-
Vladimir Davydov authored
box.cfg{snap_io_rate_limit = 0} means that the limit is maxed out hence we must set the dump bandwidth estimate to the default value. Instead we set it to 0, which may resulting in invalid transaction throttling. Fix this. Fixes commit b646fbd9 ("vinyl: use snap_io_rate_limit for initial dump bandwidth estimate").
-
- Oct 05, 2018
-
-
Vladimir Davydov authored
Before joining a new replica we register a gc_consumer to prevent garbage collection of files needed for join and following subscribe. Before commit 9c5d851d ("replication: remove old snapshot files not needed by replicas") a consumer would pin both checkpoints and WALs so that would work as expected. However, the above mentioned commit introduced consumer types and marked a consumer registered on replica join as WAL-only so if the garbage collector was invoked during join, it could delete files corresponding to the relayed checkpoint resulting in replica join failure. Fix this issue by pinning the checkpoint used for joining a replica with gc_ref_checkpoint and unpinning once join is complete. The issue can only be reproduced if there are vinyl spaces, because deletion of an open snap file doesn't prevent the relay from reading it. The existing replication/gc test would catch the issue if it triggered compaction on the master so we simply tweak it accordingly instead of adding a new test case. Closes #3708
-
Vladimir Davydov authored
gc_consumer_unregister and gc_consumer_advance don't call gc_run in case the consumer in question isn't leftmost. This code was written back when gc_run was kinda heavy and would call engine/wal callbacks even if it wouldn't really need to. Today gc_run will bail out shortly, without making any complex computation, let alone invoking garbage collection callbacks, in case it has nothing to do so those optimizations are pointless. Let's remove them.
-
Vladimir Davydov authored
Initially, gc_consumer object was used for pinning both checkpoint and WAL files, but commit 9c5d851d ("replication: remove old snapshot files not needed by replicas") changed that. Now whether a consumer pins WALs or checkpoints or both depends on gc_consumer_type. This was done so that replicas wouldn't prevent garbage collection of checkpoint files, which they don't need after initial join is complete. The way the feature was implemented is rather questionable though: - Since consumers of both types are stored in the same binary search tree, we have to iterate through the tree to find the leftmost checkpoint consumer, see gc_tree_first_checkpoint. This looks inefficient and ugly. - The notion of advancing a checkpoint consumer (gc_consumer_advance) is dubious: there's no point to move on to the next checkpoint after reading one - instead the consumer needs incremental changes, i.e. WALs. To eliminate those questionable aspects and make the code easier for understanding, let's separate WAL and checkpoint consumers. We do this by removing gc_consumer_type and making gc_consumer track WALs only. For pinning the files corresponding to a checkpoint a new object class is introduced, gc_checkpoint_ref. To pin a checkpoint, gc_ref_checkpoint needs to be called. It is passed the gc_checkpoint object to pin, the consumer name, and the gc_checkpoint_ref to store the ref in. To unpin a previously pinned checkpoint, gc_checkpoint_unref should be called. References are listed by box.info.gc() for each checkpoint under 'references' key.
-
Vladimir Davydov authored
Report vclocks in addition to signatures. When box.info.gc was first introduced we used signatures in gc. Now we use vclocks so there's no reason not to report them. This is consistent with box.info output (there's vclock and signature). Report the vclock and signature of the oldest WAL row available on the instance under box.info.gc().vclock. Without this information the user would have to figure it out by looking at box.info.gc().consumers.
-
Vladimir Davydov authored
Do some refactoring intended to make the code of gc_run() easier for understanding: - Remove gc_state::checkpoint_vclock. It was used to avoid rerunning engine gc callback in case no checkpoint was deleted. Since we maintain a list of all available checkpoints, we don't need it for this anymore - we can run gc only if a checkpoint was actually removed from the list. - Rename gc_state::wal_vclock back to gc_state::vclock. - Use bool variables with descriptive names instead of comparing vclock signatures. - Add some comments.
-
Vladimir Davydov authored
Currently, the checkpoint iterator is in fact a wrapper around memtx_engine::snap_dir while the garbage collector knows nothing about checkpoints. This feels like encapsulation violation. Let's keep track of all available checkpoints right in the garbage collector instead and export gc_ API to iterate over checkpoints.
-
Vladimir Davydov authored
Because it's the minimal number of checkpoints that must not be deleted, not the actual number of preserved checkpoints. Do it now, in a separate patch so as to ease review of the next patch. While we are at it, fix the comment to gc_set_(min_)checkpoint_count() which got outdated by commit 5512053f ("box: gc: do not remove files being backed up").
-
Vladimir Davydov authored
It's better than using tt_snprintf at call sites.
-
Vladimir Davydov authored
gc_consumer_new is used in gc_consumer_register. Let's fold it to make the code flow more straightforward.
-
Vladimir Davydov authored
The length of a consumer name never exceeds 64 characters so no use to allocate a string. This is a mere code simplification.
-
Vladimir Davydov authored
It's exasperating to write trivial external functions for each member of an opaque struct (gc_consumer_vclock, gc_consumer_name, etc) while we could simply access those fields directly if we made those structs transparent. Since we usually define structs as transparent if we need to use them outside a source file, let's do the same for gc_consumer and gc_state and remove all those one-line wrappers.
-
Vladimir Davydov authored
If an instance is restarted while building a new vinyl index, there will probably be some run files left. Currently, we won't delete such files until box.snapshot() is called, even though there's no point in keeping them around. Let's tweak vy_gc_lsm() so that it marks all runs that belong to an unfinished index as incomplete to force vy_gc() to remove them immediately after recovery is complete. This also removes files left from a failed rebootstrap attempt so we can remove a call to box.snapshot() from vinyl/replica_rejoin.test.lua.
-
Vladimir Davydov authored
This patch fixes a trivial error on vy_send_range() error path which results in a master crash in case a file needed to join a replica is missing or corrupted. See #3708
-
Georgy Kirichenko authored
The main fiber should have a lua state as any other lua fiber. Needed for #3538
-
Alexander Turenko authored
Fixed warnings from -Wunused-parameter, -Wunused-variable, -Wunused-but-set-variable. Fixed -Wsometimes-uninitialized on clang in where.c. Removed -Wall -Wextra -Werror from SQL CMakeLists.txt, because it is set by tarantool itself and is redundant. Fixes #3238.
-