- Oct 25, 2018
-
-
Vladimir Davydov authored
Now if the WAL thread fails to preallocate disk space needed to commit a transaction, it will delete old WAL files until it succeeds or it deletes all files that are not needed for local recovery from the oldest checkpoint. After it deletes a file, it notifies the garbage collector via the WAL watcher interface. The latter then deactivates consumers that would need deleted files. The user doesn't see a ENOSPC error if the WAL thread successfully allocates disk space after deleting old files. Here's what's printed to the log when this happens: wal/101/main C> ran out of disk space, try to delete old WAL files wal/101/main I> removed /home/vlad/src/tarantool/test/var/001_replication/master/00000000000000000005.xlog wal/101/main I> removed /home/vlad/src/tarantool/test/var/001_replication/master/00000000000000000006.xlog wal/101/main I> removed /home/vlad/src/tarantool/test/var/001_replication/master/00000000000000000007.xlog main/105/main C> deactivated WAL consumer replica 82d0fa3f-6881-4bc5-a2c0-a0f5dcf80120 at {1: 5} main/105/main C> deactivated WAL consumer replica 98dce0a8-1213-4824-b31e-c7e3c4eaf437 at {1: 7} Closes #3397
-
- Oct 24, 2018
-
-
Vladimir Davydov authored
This patch adds a new entry to per index statistics reported by index.stat(): disk.statement inserts replaces deletes upserts It shows the number of statements of each type stored in run files. The new statistics are persisted in index files. We will need this information so that we can force major compaction when there are too many DELETE statements accumulated in run files. Needed for #3225
-
Vladimir Davydov authored
tuple_extra() allows to store arbitrary metadata inside tuples. To use it, one should set extra_size when creating a tuple_format. It was introduced for storing UPSERT counter or column mask inside vinyl statements. Turned out that it wasn't really needed as UPSERT counter can be stored on lsregion while column mask doesn't need to be stored at all. Actually, the whole idea of tuple_extra() is rather crooked: why would we need it if we can inherit struct tuple instead, as we do in case of memtx_tuple and vy_stmt? Accessing an inherited struct is much more convenient than using tuple_extra(). So this patch gets rid of tuple_extra(). To do that, it partially reverts the following commits: 6c0842e0 vinyl: refactor vy_stmt_alloc() 74ff46d8 vinyl: add special format for tuples with column mask 11eb7816 Add extra size to tuple_format->field_map_size
-
Vladimir Davydov authored
Finally, these atrocities are not used anywhere and can be removed.
-
Vladimir Davydov authored
An UPDATE operation is written as DELETE + REPLACE to secondary indexes. We write those statements to the memory level even if the UPDATE doesn't actually update columns indexed by a secondary key. We filter them out in the write iterator when the memory level is dumped. That's what we use vy_stmt_column_mask for. Actually, there's no point to keep those statements until dump - we could as well filter them out when the transaction is committed. This would even save some memory. This wouldn't hurt read operations, because point lookup doesn't work for secondary indexes by design and so we have to read all sources, including disk, on every read from a secondary index. That said, let's move update optimization from the write iterator to vy_tx_commit. This is a step towards removing vy_stmt_column_mask.
-
- Oct 13, 2018
-
-
Vladimir Davydov authored
During SUBSCRIBE the master sends only those rows originating from the subscribed replica that aren't present on the replica. Such rows may appear after a sudden power loss in case the replica doesn't issue fdatasync() after each WAL write, which is the default behavior. This means that a replica can write some rows to WAL, relay them to another replica, then stop without syncing WAL file. If this happens we expect the replica to read its own rows from other members of the cluster upon restart. For more details see commit eae84efb ("replication: recover missing local data from replica"). Obviously, this feature only makes sense for SUBSCRIBE. During JOIN we must relay all rows. This is how it initially worked, but commit adc28591 ("replication: do not delete relay on applier disconnect"), witlessly removed the corresponding check from relay_send_row() so that now we don't send any rows originating from the joined replica: @@ -595,8 +630,7 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet) * it). In the latter case packet's LSN is less than or equal to * local master's LSN at the moment it received 'SUBSCRIBE' request. */ - if (relay->replica == NULL || - packet->replica_id != relay->replica->id || + if (packet->replica_id != relay->replica->id || packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe, packet->replica_id)) { relay_send(relay, packet); (relay->local_vclock_at_subscribe is initialized to 0 on JOIN) This only affects the case of rebootstrap, automatic or manual, because when a new replica joins a cluster there can't be any rows on the master originating from it. On manual rebootstrap, i.e. when the replica files are deleted by the user and the replica is restarted from an empty directory with the same UUID (set via box.cfg.instance_uuid), this isn't critical - the replica will still receive those rows it should have received during JOIN once it subscribes. However, in case of automatic rebootstrap this can result in broken order of xlog/snap files, because the replica directory still contains old xlog/snap files created before rebootstrap. The rebootstrap logic expects them to have strictly less vclocks than new files, but if JOIN stops prematurely, this condition may not hold, leading to a crash when the vclock of a new xlog/snap is inserted into the corresponding xdir. This patch fixes this issue by restoring pre eae84efb behavior: now we create a new relay for FINAL JOIN instead of reusing the one attached to the joined replica so that relay_send_row() can detect JOIN phase and relay all rows in this case. It also adds a comment so that we don't make such a mistake in future. Apart from fixing the issue, this patch also fixes a relay leak in relay_initial_join() in case engine_join_xc() fails, which was also introduced by the above mentioned commit. A note about xlog/panic_on_broken_lsn test. Now the relay status isn't reported by box.info.replication if FINAL JOIN failed and the replica never subscribed (this is how it worked before commit eae84efb) so we need to tweak the test a bit to handle this. Closes #3740
-
- Oct 12, 2018
-
-
Vladimir Davydov authored
If the rate at which transactions are ready to write to the database is greater than the dump bandwidth, memory will get depleted before the previously scheduled dump is complete and all newer transactions will have to wait, which may take seconds or even minutes: W> waited for 555 bytes of vinyl memory quota for too long: 15.750 sec This patch set implements basic transaction throttling that is supposed to help avoid unpredictably long stalls. Now the transaction write rate is always capped by the observed dump bandwidth, because it doesn't make sense to consume memory at a greater rate than it can be freed. On top of that, when a dump begins, we estimate the amount of time it is going to take and limit the transaction write rate accordingly. Note, this patch doesn't take into account compaction when setting the rate limit so compaction threads may still fail to keep up with dumps, increasing the read amplification. It will be addressed later. Closes #1862
-
Vladimir Davydov authored
When the format of a space is altered, we walk over all tuples stored in the primary index and check them against the new format. This doesn't guarantee that all *statements* stored in the primary index conform to the new format though, because the check isn't performed for deleted or overwritten statements, e.g. s = box.schema.space.create('test', {engine = 'vinyl'}) s:create_index('primary') s:insert{1} box.snapshot() s:delete{1} -- The following command will succeed, because the space is empty, -- however one of the runs contains REPLACE{1}, which doesn't conform -- to the new format. s:create_index('secondary', {parts = {2, 'unsigned'}}) This is OK as we will never return such overwritten statements to the user, however we may still need to read them. Currently, this leads either to an assertion failure or to a read error in vy_stmt_decode vy_stmt_new_with_ops tuple_init_field_map We could probably force major compaction of the primary index to purge such statements, but it is complicated as there may be a read view preventing the write iterator from squashing such a statement, and currently there's no way to force destruction of a read view. So this patch simply disables format validation for all tuples loaded from disk (actually we already skip format validation for all secondary index statements and for DELETE statements in primary indexes so this isn't as bad as it may seem). To do that, it adds a boolean parameter to tuple_init_field_map() that disables format validation, and then makes vy_stmt_new_with_ops(), which is used for constructing vinyl statements, set it to false. This is OK as all statements inserted into a vinyl space are validated explicitly with tuple_validate() anyway. This is rather a workaround for the lack of a better solution. Closes #3540
-
Vladimir Davydov authored
For some reason this test uses 555 for space id, which may be taken by a previously created space: Test failed! Result content mismatch: --- box/sql.result Fri Oct 5 17:23:25 2018 +++ box/sql.reject Fri Oct 12 19:38:51 2018 @@ -12,12 +12,14 @@ ... _ = box.schema.space.create('test1', { id = 555 }) --- +- error: Duplicate key exists in unique index 'primary' in space '_space' ... Reproduce file: --- - [box/rtree_point.test.lua, null] - [box/transaction.test.lua, null] - [box/tree_pk.test.lua, null] - [box/access.test.lua, null] - [box/cfg.test.lua, null] - [box/admin.test.lua, null] - [box/lua.test.lua, null] - [box/bitset.test.lua, null] - [box/role.test.lua, null] - [box/sql.test.lua, null] ... Remove { id = 555 } to make sure it never happens.
-
- Oct 10, 2018
-
-
Georgy Kirichenko authored
socket_writable/socket_readable handles socket.iowait spurious wakeup until event is happened or timeout is exceeded. Closes #3344
-
Vladimir Davydov authored
A deferred DELETE may be generated after a newer statement for the same key was inserted into a secondary index and hence land in a newer run. Since the read iterator assumes that newer sources always contain newer statements for the same key, we mark all deferred DELETE statements with VY_STMT_SKIP_READ flag, which makes run/mem iterators ignore them. The flag must be persisted when a statement is written to disk, but it is not. Fix this. Fixes commit 504bc805 ("vinyl: do not store meta in secondary index runs").
-
Alexander Turenko authored
The fail is known and should not have any influence on our CI results. The test should be enabled back after a fix of #3558.
-
- Oct 06, 2018
-
-
Vladimir Davydov authored
box.cfg{snap_io_rate_limit = 0} means that the limit is maxed out hence we must set the dump bandwidth estimate to the default value. Instead we set it to 0, which may resulting in invalid transaction throttling. Fix this. Fixes commit b646fbd9 ("vinyl: use snap_io_rate_limit for initial dump bandwidth estimate").
-
- Oct 05, 2018
-
-
Vladimir Davydov authored
Before joining a new replica we register a gc_consumer to prevent garbage collection of files needed for join and following subscribe. Before commit 9c5d851d ("replication: remove old snapshot files not needed by replicas") a consumer would pin both checkpoints and WALs so that would work as expected. However, the above mentioned commit introduced consumer types and marked a consumer registered on replica join as WAL-only so if the garbage collector was invoked during join, it could delete files corresponding to the relayed checkpoint resulting in replica join failure. Fix this issue by pinning the checkpoint used for joining a replica with gc_ref_checkpoint and unpinning once join is complete. The issue can only be reproduced if there are vinyl spaces, because deletion of an open snap file doesn't prevent the relay from reading it. The existing replication/gc test would catch the issue if it triggered compaction on the master so we simply tweak it accordingly instead of adding a new test case. Closes #3708
-
Vladimir Davydov authored
If an instance is restarted while building a new vinyl index, there will probably be some run files left. Currently, we won't delete such files until box.snapshot() is called, even though there's no point in keeping them around. Let's tweak vy_gc_lsm() so that it marks all runs that belong to an unfinished index as incomplete to force vy_gc() to remove them immediately after recovery is complete. This also removes files left from a failed rebootstrap attempt so we can remove a call to box.snapshot() from vinyl/replica_rejoin.test.lua.
-
Vladimir Davydov authored
This patch fixes a trivial error on vy_send_range() error path which results in a master crash in case a file needed to join a replica is missing or corrupted. See #3708
-
- Oct 03, 2018
-
-
Vladislav Shpilevoy authored
Closes #3709
-
Olga Arkhangelskaia authored
Patch fixes behavior when replica tries to connect to the same master more than once. In case when it is initial configuration we raise the exception. If it in not initial config we print the error and disconnect the applier. @locker: minor test cleanup. Closes #3610
-
Vladimir Davydov authored
They are only used to set corresponding members of vy_quota, vy_run_env, and vy_scheduler when vy_env is created. No point in keeping them around all the time.
-
Vladimir Davydov authored
Turned out that throttling isn't going to be as simple as maintaining the write rate below the estimated dump bandwidth, because we also need to take into account whether compaction keeps up with dumps. Tracking compaction progress isn't a trivial task and mixing it in a module responsible for resource limiting, which vy_quota is, doesn't seem to be a good idea. Let's factor out the related code into a separate module and call it vy_regulator. Currently, the new module only keeps track of the write rate and the dump bandwidth and sets the memory watermark accordingly, but soon we will extend it to configure throttling as well. Since write rate and dump bandwidth are now a part of the regulator subsystem, this patch renames 'quota' entry of box.stat.vinyl() to 'regulator'. It also removes 'quota.usage' and 'quota.limit' altogether, because memory usage is reported under 'memory.level0' while the limit can be read from box.cfg.vinyl_memory, and renames 'use_rate' to 'write_rate', because the latter seems to be a more appropriate name. Needed for #1862
-
- Sep 26, 2018
-
-
Vladimir Davydov authored
When replication is restarted with the same replica set configuration (i.e. box.cfg{replication = box.cfg.replication}), there's a chance that an old relay will be still running on the master at the time when a new applier tries to subscribe. In this case the applier will get an error: main/152/applier/localhost:62649 I> can't join/subscribe main/152/applier/localhost:62649 xrow.c:891 E> ER_CFG: Incorrect value for option 'replication': duplicate connection with the same replica UUID Such an error won't stop the applier - it will keep trying to reconnect: main/152/applier/localhost:62649 I> will retry every 1.00 second However, it will stop synchronization so that box.cfg() will return without an error, but leave the replica in the orphan mode: main/151/console/::1:42606 C> failed to synchronize with 1 out of 1 replicas main/151/console/::1:42606 C> entering orphan mode main/151/console/::1:42606 I> set 'replication' configuration option to "localhost:62649" In a second, the stray relay on the master will probably exit and the applier will manage to subscribe so that the replica will leave the orphan mode: main/152/applier/localhost:62649 C> leaving orphan mode This is very annoying, because there's no need to enter the orphan mode in this case - we could as well keep trying to synchronize until the applier finally succeeds to subscribe or replication_sync_timeout is triggered. So this patch makes appliers enter "loading" state on configuration errors, the same state they enter if they detect that bootstrap hasn't finished yet. This guarantees that configuration errors, like the one above, won't break synchronization and leave the user gaping at the unprovoked orphan mode. Apart from the issue in question (#3636), this patch also fixes spurious replication-py/multi test failures that happened for exactly the same reason (#3692). Closes #3636 Closes #3692
-
- Sep 25, 2018
-
-
Serge Petrenko authored
In some cases no-ops are written to xlog. They have no effect but are needed to bump lsn. Some time ago (see commit 89e5b784) such ops were made bodiless, and empty body requests are not handled in xrow_header_decode(). This leads to recovery errors in special case: when we have a multi-statement transaction containing no-ops written to xlog, upon recovering from such xlog, all data after the no-op end till the start of new transaction will become no-op's body, so, effectively, it will be ignored. Here's example `tarantoolctl cat` output showing this (BODY contains next request data): --- HEADER: lsn: 5 replica_id: 1 type: NOP timestamp: 1536656270.5092 BODY: type: 3 timestamp: 1536656270.5092 lsn: 6 replica_id: 1 --- HEADER: type: 0 ... This patch handles no-ops correctly in xrow_header_decode(). @locker: refactored the test case so as not to restart the server for a second time. Closes #3678
-
Serge Petrenko authored
If space.before_replace returns the old tuple, the operation turns into no-op, but is still written to WAL as IPROTO_NOP for the sake of replication. Such a request doesn't have a body, and tarantoolctl failed to parse such requests in `tarantoolctl cat` and `tarantoolctl play`. Fix this by checking whether a request has a body. Also skip such requests in `play`, since they have no effect, and, while we're at it, make sure `play` and `cat` do not read excess rows with lsn>=to in case these rows are skipped. Closes #3675
-
- Sep 22, 2018
-
-
Vladimir Davydov authored
There are a few tests that create files in the system tmp directory and don't delete them. This is contemptible - tests shouldn't leave any traced on the host. Fix those tests. Closes #3688
-
Vladimir Davydov authored
Due to a missing privilege revocation in box/errinj, box/access_sysview fails if executed after it. Fixes commit af6b554b ("test: remove universal grants from tests").
-
- Sep 21, 2018
-
-
Sergei Voronezhskii authored
Until the bug in #3420 is fixed
-
- Sep 20, 2018
-
-
Alexander Turenko authored
The problem is that clang does not support -Wno-cast-function-type flag. It is the regression from 8c538963. Follow up of #3685. Fixes #3701.
-
Serge Petrenko authored
This patch rewrites all tests to grant only necessary privileges, not privileges to universe. This was made possible by bugfixes in access control, patches #3516, #3574, #3524, #3530. Follow-up #3530
-
- Sep 19, 2018
-
-
Vladimir Davydov authored
This patch adds some essential disk statistics that are already collected and reported on per index basis to box.stat.vinyl(). The new statistics are shown under the 'disk' section and currently include the following fields: - data: size of data stored on disk. - index: size of index stored on disk. - dump.in: size of dump input. - dump.out: size of dump output. - compact.in: size of compaction input. - compact.out: size of compaction output. - compact.queue: size of compaction queue. All the counters are given in bytes without taking into account disk compression. Dump/compaction in/out counters can be reset with box.stat.reset().
-
Vladimir Davydov authored
Currently, there's no way to figure out whether compaction keeps up with dumps or not while this is essential for implementing transaction throttling. This patch adds a metric that is supposed to help answer this question. This is the compaction queue size. It is calculated per range and per LSM tree as the total size of slices awaiting compaction. We update the metric along with the compaction priority of a range, in vy_range_update_compact_priority(), and account it to an LSM tree in vy_lsm_acct_range(). For now, the new metric is reported only on per index basis, in index.stat() under disk.compact.queue.
-
Vladimir Davydov authored
There's no reason not to report pages and bytes_compressed under disk.stat.dump.out and disk.stat.compact.{in,out} apart from using the same struct for dump and compaction statistics (vy_compact_stat). The statistics are going to differ anyway once compaction queue size is added to disk.stat.compact so let's zap struct vy_compact_stat and report as much info as we can.
-
- Sep 17, 2018
-
-
Serge Petrenko authored
If some error occured during execution of a function called from box.session.su(), we assumed that fiber diagnostics area was not empty, and tried to print an error message using data from the diagnostics. However, this assumption is not true when some lua error happens. Imagine such a case: box.session.su('admin', function(x) return #x end, 3) A lua error would be pushed on the stack but the diagnostics would be empty, and we would get an assertion failure when trying to print the error message. Handle this by using lua_error() instead of luaT_error(). Closes #3659
-
- Sep 15, 2018
-
-
Alexander Turenko authored
Fixed false positive -Wimplicit-fallthrough in http_parser.c by adding a break. The code jumps anyway, so the execution flow is not changed. Fixed false positive -Wparenthesis in reflection.h by removing the parentheses. The argument 'method' of the macro 'type_foreach_method' is just name of the loop variable and is passed to the macro for readability reasons. Fixed false positive -Wcast-function-type triggered by reflection.h by adding -Wno-cast-function-type for sources and unit tests. We cast a pointer to a member function to an another pointer to member function to store it in a structure, but we cast it back before made a call. It is legal and does not lead to an undefined behaviour. Fixes #3685.
-
- Sep 14, 2018
-
-
AKhatskevich authored
The test expected that http:get yields, however, in case of very fast unix_socket and parallel test execution, a context switch during the call lead to absence of yield and to instant reply. That caused an error during `fiber:cancel`. The problem is solved by increasing http server response time. Closes #3480
-
- Sep 13, 2018
-
-
Roman Khabibov authored
Add an ability to pass options to json.encode()/decode(). Closes: #2888. @TarantoolBot document Title: json.encode() json.decode() Add an ability to pass options to json.encode() and json.decode(). These are the same options that are used globally in json.cfg().
-
- Sep 09, 2018
-
-
Vladimir Davydov authored
box.info.memory() gives you some insight on what memory is used for, but it's very coarse. For vinyl we need finer grained global memory statistics. This patch adds such: they are reported under box.stat.vinyl().memory and consist of the following entries: - level0: sum size of level-0 of all LSM trees. - tx: size of memory used by tx write and read sets. - tuple_cache: size of memory occupied by tuple cache. - page_index: size of memory used for storing page indexes. - bloom_filter: size of memory used for storing bloom filters. It also removes box.stat.vinyl().cache, as the size of cache is now reported under memory.tuple_cache.
-
Vladimir Davydov authored
Since commit 0c5e6cc8 ("vinyl: store full tuples in secondary index cache"), we store primary index tuples in secondary index cache, but we still account them as separate tuples. Fix that. Follow-up #3478 Closes #3655
-
Vladimir Davydov authored
Any LSM-based database design implies high level of write amplification so there should be more compaction threads than dump threads. With the default value of 2 for box.cfg.vinyl_write_threads, which we have now, we start only one compaction thread. Let's increase the default up to 4 so that there are three compaction threads started by default, because it fits better LSM-based design.
-
- Sep 04, 2018
-
-
Vladimir Davydov authored
Now box.cfg() doesn't return until 'quorum' appliers are in sync not only on initial configuration, but also on replication configuration update. If it fails to synchronize within replication_sync_timeout, box.cfg() returns without an error, but the instance enters 'orphan' state, which is basically read-only mode. In the meantime, appliers will keep trying to synchronize in the background, and the instance will leave 'orphan' state as soon as enough appliers are in sync. Note, this patch also changes logging a bit: - 'ready to accept request' is printed on startup before syncing with the replica set, because although the instance is read-only at that time, it can indeed accept all sorts of ro requests. - For 'connecting', 'connected', 'synchronizing' messages, we now use 'info' logging level, not 'verbose' as they used to be, because those messages are important as they give the admin idea what's going on with the instance, and they can't flood logs. - 'sync complete' message is also printed as 'info', not 'crit', because there's nothing critical about it (it's not an error). Also note that we only enter 'orphan' state if failed to synchronize. In particular, if the instnace manages to synchronize with all replicas within a timeout, it will jump from 'loading' straight into 'running' bypassing 'orphan' state. This is done for the sake of consistency between initial configuration and reconfiguration. Closes #3427 @TarantoolBot document Title: Sync on replication configuration update The behavior of box.cfg() on replication configuration update is now consistent with initial configuration, that is box.cfg() will not return until it synchronizes with as many masters as specified by replication_connect_quorum configuration option or the timeout specified by replication_connect_sync occurs. On timeout, it will return without an error, but the instance will enter 'orphan' state. It will leave 'orphan' state as soon as enough appliers have synced.
-
Olga Arkhangelskaia authored
In the scope of #3427 we need timeout in case if an instance waits for synchronization for too long, or even forever. Default value is 300. Closes #3674 @locker: moved dynamic config check to box/cfg.test.lua; code cleanup @TarantoolBot document Title: Introduce new configuration option replication_sync_timeout After initial bootstrap or after replication configuration changes we need to sync up with replication quorum. Sometimes sync can take too long or replication_sync_lag can be smaller than network latency we replica will stuck in sync loop that can't be cancelled.To avoid this situations replication_sync_timeout can be used. When time set in replication_sync_timeout is passed replica enters orphan state. Can be set dynamically. Default value is 300 seconds.
-