- Oct 03, 2018
-
-
Vladimir Davydov authored
Turned out that throttling isn't going to be as simple as maintaining the write rate below the estimated dump bandwidth, because we also need to take into account whether compaction keeps up with dumps. Tracking compaction progress isn't a trivial task and mixing it in a module responsible for resource limiting, which vy_quota is, doesn't seem to be a good idea. Let's factor out the related code into a separate module and call it vy_regulator. Currently, the new module only keeps track of the write rate and the dump bandwidth and sets the memory watermark accordingly, but soon we will extend it to configure throttling as well. Since write rate and dump bandwidth are now a part of the regulator subsystem, this patch renames 'quota' entry of box.stat.vinyl() to 'regulator'. It also removes 'quota.usage' and 'quota.limit' altogether, because memory usage is reported under 'memory.level0' while the limit can be read from box.cfg.vinyl_memory, and renames 'use_rate' to 'write_rate', because the latter seems to be a more appropriate name. Needed for #1862
-
- Oct 02, 2018
-
-
Vladimir Davydov authored
There are three places where we start the scheduler fiber and enable the configured memory quota limit: local bootstrap, remote bootstrap, and local recovery completion. I'm planning to add more code there so let's factor it out now.
-
- Sep 26, 2018
-
-
Vladimir Davydov authored
When replication is restarted with the same replica set configuration (i.e. box.cfg{replication = box.cfg.replication}), there's a chance that an old relay will be still running on the master at the time when a new applier tries to subscribe. In this case the applier will get an error: main/152/applier/localhost:62649 I> can't join/subscribe main/152/applier/localhost:62649 xrow.c:891 E> ER_CFG: Incorrect value for option 'replication': duplicate connection with the same replica UUID Such an error won't stop the applier - it will keep trying to reconnect: main/152/applier/localhost:62649 I> will retry every 1.00 second However, it will stop synchronization so that box.cfg() will return without an error, but leave the replica in the orphan mode: main/151/console/::1:42606 C> failed to synchronize with 1 out of 1 replicas main/151/console/::1:42606 C> entering orphan mode main/151/console/::1:42606 I> set 'replication' configuration option to "localhost:62649" In a second, the stray relay on the master will probably exit and the applier will manage to subscribe so that the replica will leave the orphan mode: main/152/applier/localhost:62649 C> leaving orphan mode This is very annoying, because there's no need to enter the orphan mode in this case - we could as well keep trying to synchronize until the applier finally succeeds to subscribe or replication_sync_timeout is triggered. So this patch makes appliers enter "loading" state on configuration errors, the same state they enter if they detect that bootstrap hasn't finished yet. This guarantees that configuration errors, like the one above, won't break synchronization and leave the user gaping at the unprovoked orphan mode. Apart from the issue in question (#3636), this patch also fixes spurious replication-py/multi test failures that happened for exactly the same reason (#3692). Closes #3636 Closes #3692
-
Vladimir Davydov authored
First, we print "will retry every XX second" to the log after an error message only for socket and system errors although we keep trying to establish a replication connection after configuration errors as well. Let's print this message for those errors too to avoid confusion. Second, in case we receive an error in reply to SUBSCRIBE command, we log "can't read row" instead of "can't join/subscribe". This happens, because we switch an applier to SYNC/FOLLOW state before receiving a reply to SUBSCRIBE command. Fix this by updating an applier state only after successfully subscribing. Third, we detect duplicate connections coming from the same replica on the master only after sending a reply to SUBSCRIBE command, that is in relay_subscribe rather than in box_process_subscribe. This results in "can't read row" being printed to the replica's log even though it's actually a SUBSCRIBE error. Fix this by moving the check where it actually belongs.
-
- Sep 25, 2018
-
-
Serge Petrenko authored
In some cases no-ops are written to xlog. They have no effect but are needed to bump lsn. Some time ago (see commit 89e5b784) such ops were made bodiless, and empty body requests are not handled in xrow_header_decode(). This leads to recovery errors in special case: when we have a multi-statement transaction containing no-ops written to xlog, upon recovering from such xlog, all data after the no-op end till the start of new transaction will become no-op's body, so, effectively, it will be ignored. Here's example `tarantoolctl cat` output showing this (BODY contains next request data): --- HEADER: lsn: 5 replica_id: 1 type: NOP timestamp: 1536656270.5092 BODY: type: 3 timestamp: 1536656270.5092 lsn: 6 replica_id: 1 --- HEADER: type: 0 ... This patch handles no-ops correctly in xrow_header_decode(). @locker: refactored the test case so as not to restart the server for a second time. Closes #3678
-
Serge Petrenko authored
If space.before_replace returns the old tuple, the operation turns into no-op, but is still written to WAL as IPROTO_NOP for the sake of replication. Such a request doesn't have a body, and tarantoolctl failed to parse such requests in `tarantoolctl cat` and `tarantoolctl play`. Fix this by checking whether a request has a body. Also skip such requests in `play`, since they have no effect, and, while we're at it, make sure `play` and `cat` do not read excess rows with lsn>=to in case these rows are skipped. Closes #3675
-
- Sep 22, 2018
-
-
Vladimir Davydov authored
There are a few tests that create files in the system tmp directory and don't delete them. This is contemptible - tests shouldn't leave any traced on the host. Fix those tests. Closes #3688
-
Vladimir Davydov authored
fio.rmtree should use lstat instead of stat, otherwise it won't be able to remove a directory if there's a symbolic link pointing to a non-existent file. The test case will be added to app/fio.test.lua by the following commit, which is aimed at cleaning up /tmp directory after running tests.
-
Vladimir Davydov authored
Due to a missing privilege revocation in box/errinj, box/access_sysview fails if executed after it. Fixes commit af6b554b ("test: remove universal grants from tests").
-
Vladimir Davydov authored
Closes #3311
-
Vladimir Davydov authored
Currently, there are two ways of creating a new key definition object apart from copying (key_def_dup): either use key_def_new_with_parts, which takes definition of all key parts and returns a ready to use key_def, or allocate an empty key_def with key_def_new and then fill it up with key_def_set_part. The latter method is rather awkward: because of its existence key_def_set_part has to detect if all parts have been set and initialize comparators if so. It is only used in schema_init, which could as well use key_def_new_with_parts without making the code any more difficult to read than it is now. That being said, let us: - Make schema_init use key_def_new_with_parts. - Delete key_def_new and bequeath its name to key_def_new_with_parts. - Simplify key_def_set_part: now it only initializes the given part while comparators are set by the caller once all parts have been set. These changes should also make it easier to add json path to key_part.
-
- Sep 21, 2018
-
-
Vladimir Davydov authored
This reverts commit ea3a2b5f. Once we finally implement json path indexes, more fields that are calculated at run time will have to be added to struct key_part, like path hash or field offset. So this was actually a mistake to remove key_part_def struct, as it will grow more and more different from key_part. Conceptually having separate key_part_def and key_part is consistent with other structures, e.g. tuple_field and field_def. That said, let's bring key_part_def back. Sorry for the noise.
-
Sergei Voronezhskii authored
Until the bug in #3420 is fixed
-
Vladimir Davydov authored
The only difference between struct key_part_def and struct key_part is that the former stores only the id of a collation while the latter also stores a pointer to speed up tuple comparisons. It isn't worth keeping a separate struct just because of that. Let's use struct key_part everywhere and assume that key_part->coll is NULL if the part is needed solely for storing a decoded key part definition and isn't NULL if it is used for tuple comparisons (i.e. is attached to a key_def).
-
Kirill Shcherbatov authored
Start use tuple_field_by_part(_raw) routine in *extract, *compare, *hash functions. This new function use key_part to retrieve field data mentioned in key_part. Now it is just a wrapper for tuple_field_raw but with introducing JSON paths it would work in other way. Needed for #1012
-
Kirill Shcherbatov authored
To introduce JSON indexes we need changeable key_def containing key_part definition that would store JSON path and offset slot and slot epoch in following patches. Needed for #1012
-
Kirill Yukhin authored
-
- Sep 20, 2018
-
-
Alexander Turenko authored
The problem is that clang does not support -Wno-cast-function-type flag. It is the regression from 8c538963. Follow up of #3685. Fixes #3701.
-
Serge Petrenko authored
This patch rewrites all tests to grant only necessary privileges, not privileges to universe. This was made possible by bugfixes in access control, patches #3516, #3574, #3524, #3530. Follow-up #3530
-
Kirill Yukhin authored
-
- Sep 19, 2018
-
-
Vladimir Davydov authored
This patch adds some essential disk statistics that are already collected and reported on per index basis to box.stat.vinyl(). The new statistics are shown under the 'disk' section and currently include the following fields: - data: size of data stored on disk. - index: size of index stored on disk. - dump.in: size of dump input. - dump.out: size of dump output. - compact.in: size of compaction input. - compact.out: size of compaction output. - compact.queue: size of compaction queue. All the counters are given in bytes without taking into account disk compression. Dump/compaction in/out counters can be reset with box.stat.reset().
-
Vladimir Davydov authored
So that we can easily extend them to account the stats not only per LSM tree, but also globally, in vy_lsm_env.
-
Vladimir Davydov authored
Currently, there's no way to figure out whether compaction keeps up with dumps or not while this is essential for implementing transaction throttling. This patch adds a metric that is supposed to help answer this question. This is the compaction queue size. It is calculated per range and per LSM tree as the total size of slices awaiting compaction. We update the metric along with the compaction priority of a range, in vy_range_update_compact_priority(), and account it to an LSM tree in vy_lsm_acct_range(). For now, the new metric is reported only on per index basis, in index.stat() under disk.compact.queue.
-
Vladimir Davydov authored
Currently, we call memset() on vy_stmt_counter and vy_disk_stmt_counter directly, but that looks rather ugly, especially when a counter has a long name. Let's introduce helper functions for that.
-
Vladimir Davydov authored
There's no reason not to report pages and bytes_compressed under disk.stat.dump.out and disk.stat.compact.{in,out} apart from using the same struct for dump and compaction statistics (vy_compact_stat). The statistics are going to differ anyway once compaction queue size is added to disk.stat.compact so let's zap struct vy_compact_stat and report as much info as we can.
-
Vladimir Davydov authored
The code is difficult to follow when there are nested info tables, because info_table_end() doesn't refer to the table name. Let's annotate info_table_end() with a comment to make it easier to follow. No functional changes.
-
Vladimir Davydov authored
When a few ranges are coalesced, we "force" compaction of the resulting range by raising its compaction priority to max (slice count). There's actually no point in that, because as long as the shape of the resulting LSM tree is OK, we don't need to do extra compaction work. Moreover, it actually doesn't work if a new slice is added to the resulting range by dump before it gets compacted, which is fairly likely, because then its compaction priority will be recalculated as usual. So let's simply call vy_range_update_compact_priority() for the resulting range. When a range is split, the produced ranges will inherit its compaction priority. This is actually incorrect, because range split may change the shape of the tree so let's recalculate priority for each part the usual way, i.e. by calling vy_range_update_compact_priority(). After this patch, there's this only place where we can update compaction priority of a range - it's vy_range_update_compact_priority().
-
Vladimir Davydov authored
This patch addresses a few problems index.compact() is suffering from, namely: - When a range is split or coalesced, it should inherit the value of needs_compaction flag from the source ranges. Currently, the flag is cleared so that the resulting range may be not compacted. - If a range has no slices, we shouldn't set needs_compaction flag for it, because obviously it can't be compacted, but we do. - The needs_compaction flag should be cleared as soon as we schedule a range for compaction, not when all slices have been compacted into one, as we presently expect, because the latter may never happen under a write-intensive load.
-
- Sep 17, 2018
-
-
Serge Petrenko authored
If some error occured during execution of a function called from box.session.su(), we assumed that fiber diagnostics area was not empty, and tried to print an error message using data from the diagnostics. However, this assumption is not true when some lua error happens. Imagine such a case: box.session.su('admin', function(x) return #x end, 3) A lua error would be pushed on the stack but the diagnostics would be empty, and we would get an assertion failure when trying to print the error message. Handle this by using lua_error() instead of luaT_error(). Closes #3659
-
- Sep 15, 2018
-
-
Alexander Turenko authored
Fixed false positive -Wimplicit-fallthrough in http_parser.c by adding a break. The code jumps anyway, so the execution flow is not changed. Fixed false positive -Wparenthesis in reflection.h by removing the parentheses. The argument 'method' of the macro 'type_foreach_method' is just name of the loop variable and is passed to the macro for readability reasons. Fixed false positive -Wcast-function-type triggered by reflection.h by adding -Wno-cast-function-type for sources and unit tests. We cast a pointer to a member function to an another pointer to member function to store it in a structure, but we cast it back before made a call. It is legal and does not lead to an undefined behaviour. Fixes #3685.
-
- Sep 14, 2018
-
-
AKhatskevich authored
The test expected that http:get yields, however, in case of very fast unix_socket and parallel test execution, a context switch during the call lead to absence of yield and to instant reply. That caused an error during `fiber:cancel`. The problem is solved by increasing http server response time. Closes #3480
-
- Sep 13, 2018
-
-
Roman Khabibov authored
Add an ability to pass options to json.encode()/decode(). Closes: #2888. @TarantoolBot document Title: json.encode() json.decode() Add an ability to pass options to json.encode() and json.decode(). These are the same options that are used globally in json.cfg().
-
- Sep 10, 2018
-
-
Kirill Yukhin authored
Since addition of -fopenmp to compiler also means addition of -lgomp to the link stage, pass -fno-openmp to the linking stage in case of static build. In that case OMP functions are statically linked into libmisc. Also, emit error if trying to perform static build using clang.
-
- Sep 09, 2018
-
-
Vladimir Davydov authored
box.info.memory() gives you some insight on what memory is used for, but it's very coarse. For vinyl we need finer grained global memory statistics. This patch adds such: they are reported under box.stat.vinyl().memory and consist of the following entries: - level0: sum size of level-0 of all LSM trees. - tx: size of memory used by tx write and read sets. - tuple_cache: size of memory occupied by tuple cache. - page_index: size of memory used for storing page indexes. - bloom_filter: size of memory used for storing bloom filters. It also removes box.stat.vinyl().cache, as the size of cache is now reported under memory.tuple_cache.
-
Vladimir Davydov authored
Since commit 0c5e6cc8 ("vinyl: store full tuples in secondary index cache"), we store primary index tuples in secondary index cache, but we still account them as separate tuples. Fix that. Follow-up #3478 Closes #3655
-
Vladimir Davydov authored
Any LSM-based database design implies high level of write amplification so there should be more compaction threads than dump threads. With the default value of 2 for box.cfg.vinyl_write_threads, which we have now, we start only one compaction thread. Let's increase the default up to 4 so that there are three compaction threads started by default, because it fits better LSM-based design.
-
Vladimir Davydov authored
We must not schedule any background jobs during local recovery, because they may disrupt yet to be recovered data stored on disk. Since we start the scheduler fiber as soon as the engine is initialized, we have to pull some tricks to make sure it doesn't schedule any tasks: the scheduler fiber function yields immediately upon startup; we assume that it won't be woken up until local recovery is complete, because we don't set the memory limit until then. This looks rather flimsy, because the logic is spread among several seemingly unrelated functions: the scheduler fiber (vy_scheduler_f), the quota watermark callback (vy_env_quota_exceeded_cb), and the engine recovery callback (vinyl_engine_begin_initial_recovery), where we leave the memory limit unset until recovery is complete. The latter isn't even mentioned in comments, which makes the code difficult to follow. Think how everything would fall apart should we try to wake up the scheduler fiber somewhere else for some reason. This patch attempts to make the code more straightforward by postponing startup of the scheduler fiber until recovery completion. It also moves the comment explaining why we can't schedule tasks during local recovery from vy_env_quota_exceeded_cb to vinyl_engine_begin_initial_recovery, because this is where we actually omit the scheduler fiber startup. Note, since now the scheduler fiber goes straight to business once started, we can't start worker threads in the fiber function as we used to, because then workers threads would be running even if vinyl was unused. So we move this code to vy_worker_pool_get, which is called when a worker is actually needed to run a task.
-
Vladimir Davydov authored
It is not used anywhere anymore.
-
Vladimir Davydov authored
Using the same thread pool for both dump and compaction tasks makes estimation of dump bandwidth unstable. For instance, if we have four worker threads, then the observed dump bandwidth may vary from X if there's high compaction demand and all worker threads tend to be busy with compaction tasks to 4 * X if there's no compaction demand. As a result, we can overestimate the dump bandwidth and trigger dump when it's too late, which will result in hitting the limit before dump is complete and hence stalling write transactions, which is unacceptable. To avoid that, let's separate thread pools used for dump and compaction tasks. Since LSM tree based design typically implies high levels of write amplification, let's allocate 1/4th of all threads for dump tasks and use the rest exclusively for compaction.
-
Vladimir Davydov authored
Call vy_worker_pool_get() from vy_scheduler_peek_{dump,compaction} so that we can use different worker pools for dump and compaction tasks.
-