- Sep 09, 2018
-
-
Vladimir Davydov authored
A worker pool is an independent entity that provides the scheduler with worker threads on demand. Let's factor it out so that we can introduce separate pools for dump and compaction tasks.
-
Vladimir Davydov authored
Background tasks are allocated infrequently, not more often than once per several seconds, so using mempool for them is unnecessary and only clutters vy_scheduler struct. Let's allocate them with malloc().
-
Vladimir Davydov authored
Needed solely to improve code readability. No functional changes.
-
- Sep 06, 2018
-
-
Konstantin Osipov authored
-
Georgy Kirichenko authored
A possibility to build tarantool with included library dependencies. Use the flag -DBUILD_STATIC=ON to build statically against curl, readline, ncurses, icu and z. Use the flag -DOPENSSL_USE_STATIC_LIBS=ON to build with static openssl Changes: * Add FindOpenSSL.cmake because some distributions do not support the use of openssl static libraries. * Find libssl before curl because of build dependency. * Catch all bundled libraries API and export then it in case of static build. * Rename crc32 internal functions to avoid a name clash with linked libraries. Notes: * Bundled libyaml is not properly exported, use the system one. * Dockerfile to build static with docker is included Fixes #3445
-
- Sep 04, 2018
-
-
Vladimir Davydov authored
-
Vladimir Davydov authored
Now box.cfg() doesn't return until 'quorum' appliers are in sync not only on initial configuration, but also on replication configuration update. If it fails to synchronize within replication_sync_timeout, box.cfg() returns without an error, but the instance enters 'orphan' state, which is basically read-only mode. In the meantime, appliers will keep trying to synchronize in the background, and the instance will leave 'orphan' state as soon as enough appliers are in sync. Note, this patch also changes logging a bit: - 'ready to accept request' is printed on startup before syncing with the replica set, because although the instance is read-only at that time, it can indeed accept all sorts of ro requests. - For 'connecting', 'connected', 'synchronizing' messages, we now use 'info' logging level, not 'verbose' as they used to be, because those messages are important as they give the admin idea what's going on with the instance, and they can't flood logs. - 'sync complete' message is also printed as 'info', not 'crit', because there's nothing critical about it (it's not an error). Also note that we only enter 'orphan' state if failed to synchronize. In particular, if the instnace manages to synchronize with all replicas within a timeout, it will jump from 'loading' straight into 'running' bypassing 'orphan' state. This is done for the sake of consistency between initial configuration and reconfiguration. Closes #3427 @TarantoolBot document Title: Sync on replication configuration update The behavior of box.cfg() on replication configuration update is now consistent with initial configuration, that is box.cfg() will not return until it synchronizes with as many masters as specified by replication_connect_quorum configuration option or the timeout specified by replication_connect_sync occurs. On timeout, it will return without an error, but the instance will enter 'orphan' state. It will leave 'orphan' state as soon as enough appliers have synced.
-
Olga Arkhangelskaia authored
In the scope of #3427 we need timeout in case if an instance waits for synchronization for too long, or even forever. Default value is 300. Closes #3674 @locker: moved dynamic config check to box/cfg.test.lua; code cleanup @TarantoolBot document Title: Introduce new configuration option replication_sync_timeout After initial bootstrap or after replication configuration changes we need to sync up with replication quorum. Sometimes sync can take too long or replication_sync_lag can be smaller than network latency we replica will stuck in sync loop that can't be cancelled.To avoid this situations replication_sync_timeout can be used. When time set in replication_sync_timeout is passed replica enters orphan state. Can be set dynamically. Default value is 300 seconds.
-
Olga Arkhangelskaia authored
In #3427 replication_sync_lag should be taken into account during replication reconfiguration. In order to configure replication properly this parameter is made dynamic and can be changed on demand. @locker: moved dynamic config check to box/cfg.test.lua @TarantoolBot document Title: recation_sync_lag option can be set dynamically box.cfg.recation_sync_lag now can be set at any time.
-
- Sep 03, 2018
-
-
Konstantin Osipov authored
Ensure box.ctl.wait_ro() and box.ctl.wait_rw() produce meaningful results even when invoked before box.cfg{}: wait for box.cfg{} to complete and the server to enter the right state. Add a test case. In scope of gh-3159
-
- Aug 31, 2018
-
-
Konstantin Osipov authored
-
Vladimir Davydov authored
-
- Aug 30, 2018
-
-
Konstantin Belyavskiy authored
There are two different pipes: 'tx' and 'tx_prio'. The latter does not support yield(). Rename it to avoid misunderstanding. Needed for #3397
-
Vladimir Davydov authored
The new version marks more file descriptors used by test-run internals as CLOEXEC. Needed to make replication/misc test pass (it lowers RLIMIT_NOFILE).
-
Vladimir Davydov authored
IPROTO_VOTE command (successor of IPROTO_REQUEST_VOTE) was introduced in Tarantool 1.10.1. It is sent by an applier to its master only if the master is running Tarantool 1.10.1 or newer. However, the master may be running a Tarantool version 1.10.1 that isn't aware of IPROTO_VOTE, in which case the applier will fail to connect with ER_UNKNOWN_REQUEST_TYPE error. Let's fix this issue by ignoring ER_UNKNOWN_REQUEST_TYPE received in reply to IPROTO_VOTE command.
-
Alexander Turenko authored
When size parameter is not passed to socket:recv() or socket:recvfrom() it will call a) or b) on the socket to evaluate size of the buffer to store the receiving datagram. Before this commit a datagram will be truncated to 512 bytes in the case. a) Linux: recv(fd, NULL, 0 , MSG_TRUNC | MSG_PEEK) b) Mac OS: getsockopt(fd, SOL_SOCKET, SO_NREAD, &val, &len) It is recommended to set 'size' parameter (size of the input buffer) explicitly based on known message format and known network conditions (say, set it less then MTU to prevent IP fragmentation, which can be inefficient) or pass it from a configuration option of a library / an application. The reason is that explicit buffer size provided allows to avoid extra syscall to evaluate necessary buffer size. When 'size' parameter is set explicitly for recv / recvfrom on a UDP socket and the next datagram length is larger then the size, the returned message will be truncated to the size provided and the rest of the datagram will be discarded. Of course, the tail will not be discarded in case of a TCP socket and will be available to read by the next recv / recvfrom call. Fixes #3619.
-
- Aug 29, 2018
-
-
Sergei Kalashnikov authored
Aid the debugging of replication issues related to out-of-order requests. Adds the details of request/tuple to the diagnostic message whenever possible. Closes #3105
-
Vladimir Davydov authored
-
Georgy Kirichenko authored
It is an error to throw an error out of a cbus message handler because it breaks cbus message delivery. In case of replication throwing an error prevents iproto against replication socket closing. Closes #3642
-
Vladimir Davydov authored
So as instances started by test-run don't inherit file descriptors corresponding to logs and sockets of all running instances. Needed for testing #3642
-
- Aug 28, 2018
-
-
Vladimir Davydov authored
Use a lower bound estimate in order not to overestimate dump bandwidth. For example, if an observation of 12 MB/s falls in bucket 10 .. 15, we should use 10 MB/s to avoid stalls.
-
Vladimir Davydov authored
The value returned by histogram_percentile() is an upper bound estimate. This is fine for measuring latency, because we are interested in the worst, i.e. highest, observations, but doesn't suit particularly well if we want to keep track of the lowest observations, as it is the case with bandwidth. So this patch introduces histogram_percentile_lower(), a function that is similar to histogram_percentile(), but returns a lower bound estimate of the requested percentile.
-
Vladimir Davydov authored
The user can limit dump bandwidth with box.cfg.snap_io_rate_limit to a value, which is less than the current estimate. To avoid stalls caused by overestimating dump bandwidth, we must take into account the limit for the initial guess and forget all observations whenever it changes.
-
Vladimir Davydov authored
Do not add the initial guess to the histogram, because otherwise it takes more than 10 dumps to get the real dump bandwidth in case the initial value is less (we use 10th percentile).
-
Vladimir Davydov authored
We don't need to compute a percentile of dump bandwidth histogram on each invocation of quota timer callback, because it may only be updated on dump completion. Let's cache it. Currently, it isn't that important, because the timer period is set to 1 second. However, once we start using the timer for throttling, we'll have to make it run more often and so caching the dump bandwidth value will make sense.
-
Vladimir Davydov authored
The next patch will store a cached bandwidth value in vy_quota::dump_bw. Let's rename dump_bw to dump_bw_hist here in order not to clog it.
-
Vladimir Davydov authored
Typically, dump bandwidth varies from 10 MB to 100 MB per second so let's use 5 MB bucket granularity in this range. Values less than 10 MB/s can also be observed, because the user can limit disk rate with box.cfg.snap_io_rate_limit so use 1 MB granularity between 1 MB and 10 MB and 100 KB granularity between 100 KB and 1 MB. A write rate greater than 100 MB/s is unlikely in practice, even on very fast disks, since dump bandwidth is also limited by CPU, so use 100 MB granularity there.
-
Vladimir Davydov authored
Currently, we wake up all fibers whenever we free some memory. This is inefficient, because it might occur that all available quota gets consumed by a few fibers while the rest will have to go back to sleep. This is also kinda unfair, because waking up all fibers breaks the order in which the fibers were put to sleep. This works now, because we free memory and wake up fibers infrequently (on dump) and there normally shouldn't be any fibers waiting for quota (if there were, the latency would rocket sky high because of absence of any kind of throttling). However, once throttling is introduced, fibers waiting for quota will become the norm. So let's wake up fibers one by one: whenever we free memory we wake up the first fiber in the line, which will wake up the next fiber on success and so forth.
-
Vladimir Davydov authored
We join a fiber that executes a dump/compaction task only at exit while we mark all fibers as joinable. As a result, fibers leak, which eventually leads to a crash: src/lib/small/small/slab_arena.c:58: munmap_checked: Assertion `false' failed. Here's the stack trace: munmap_checked mmap_checked slab_map slab_get_with_order mempool_alloc fiber_new_ex fiber_new cord_costart_thread_func cord_thread_func start_thread clone Let's fix this issue by marking a fiber as joinable only at exit, before joining it. The fiber is guaranteed to be alive at that time, because it clears vy_worker::task before returning, while we join it only if vy_worker::task is not NULL. Fixes commit 43b4342d ("vinyl: fix worker crash at exit").
-
Vladimir Davydov authored
fiber_cond_destroy() and latch_destroy() are no-op on release builds while on debug builds they check that there is no fibers waiting on the destroyed object. This results in the following assertion failures occasionally hit by some tests: src/latch.h:81: latch_destroy: Assertion `l->owner == NULL' failed. src/fiber_cond.c:49: fiber_cond_destroy: Assertion `rlist_empty(&c->waiters)' failed. We can't do anything about that, because the event loop isn't running at exit and hence we can't stop those fibers. So let's not "destroy" those global objects that may have waiters at exit, namely gc.latch ro_cond replicaset.applier.cond
-
- Aug 27, 2018
-
-
Vladimir Davydov authored
-
Vladimir Davydov authored
So that there's a single place where we can wait for quota. It should make it easier to implement quota throttling.
-
Vladimir Davydov authored
Watermark calculation is a private business of vy_quota. Let's move related stuff from vy_env to vy_quota. This will also make it easier to implement throttling opaque to the caller.
-
Vladimir Davydov authored
None of vy_quota methods is called from a hot path - even the most frequently called ones, vy_quota_try_use and vy_quota_commit_use, are only invoked once per a transactions. So there's no need to clog the header with the methods implementation.
-
Vladimir Davydov authored
Let's introduce this helper to avoid code duplication and keep comments regarding quota consumption protocol in one place.
-
Vladimir Davydov authored
Since we check uniqueness constraint before inserting anything into the transaction write set, we have to deal with the situation when a secondary index doesn't get updated. For example suppose there's tuple {1, 1, 1} stored in a space with the primary index over the first field and a unique secondary index over the second field. Then when processing REPLACE {1, 1, 2}, we will find {1, 1, 1} in the secondary index, but that doesn't mean that there's a duplicate key error - since the primary key parts of the old and new tuples coincide, the secondary index doesn't in fact get updated hence there's no conflict. However, if the operation was INSERT {1, 1, 2}, then there would be a conflict - by the primary index. Normally, we would detect such a conflict when checking the uniqueness constraint of the primary index, i.e. in vy_check_is_unique_primary(), but there's a case when this doesn't happen. The point is we can optimize out the primary index uniqueness constraint check in case the primary index key parts contain all parts of a unique secondary index, see #3154. In such a case we must fail vy_check_is_unique_secondary() even if the conflicting tuple has the same primary key parts. Fixes commit fc3834c0 ("vinyl: check key uniqueness before modifying tx write set") Closes #3643
-
Serge Petrenko authored
Sometimes the test failed with output similar to the one below: [001] engine/ddl.test.lua memtx [ fail ] [001] [001] Test failed! Result content mismatch: [001] --- engine/ddl.result Mon Aug 27 09:35:19 2018 [001] +++ engine/ddl.reject Mon Aug 27 11:12:47 2018 [001] @@ -1932,7 +1932,7 @@ [001] ... [001] s.index.pk:select() [001] --- [001] -- - [1, 1, 11] [001] +- - [1, 1, 8] [001] ... [001] s.index.sk:select() [001] --- This happened due to a race condition in a test case added for issue #3578. To fix it we need to move c:get() above s.index.pk:select() to make sure we actually wait for the fiber function to complete before checking results. Follow-up #3578.
-
- Aug 26, 2018
-
-
Vladimir Davydov authored
From time to time box/net.box test fails like this: Test failed! Result content mismatch: --- box/net.box.result Sat Aug 25 18:41:35 2018 +++ box/net.box.reject Sat Aug 25 18:49:17 2018 @@ -2150,7 +2150,7 @@ ... disconnected -- false --- -- false +- true ... ch2:put(true) --- The 'disconnected' variable is changed from false to true by box.session.on_disconnect trigger. If there happens to be a dangling connection that wasn't explicitly closed, it might occur that the Lua garbage collector closes it, which will result in spurious trigger execution and test failure. Fix this by collecting all dangling connections before installing the trigger.
-
Vladimir Davydov authored
The test has a number of problems: - It requires a huge number of file descriptors (> 10000), which is typically unavailable due to ulimit (1024). - Even if it is permitted to open the required number of file descriptors, it may still fail due to net.box.connect failing to connect by timeout. - If it fails, it silently hangs, which makes it difficult to investigate the failure. Actually, there's no need to create that huge number of file descriptors to test depletion of the tx fiber pool anymore, because IPROTO_MSG_MAX was made tunable. Let's lower it down to 64 (default is 768) for the test. Then we can create only 400 fibers and open ~800 file descriptors. Also, let's wrap net.box.connect with pcall and limit the max wait time to 10 seconds so that the test will terminate on any failure instead of hanging forever. Now, the test only takes ~2 seconds to complete, which doesn't qualify as 'long'. Remove it from the list of long-run tests. While we are at it, also fix indentation (it uses tabs instead of spaces in Lua). Closes #2473
-
- Aug 25, 2018
-
-
Vladimir Davydov authored
There's a case in the test that expects that when we dump a new run, vinyl will trigger compaction. Typically, it happens, but sometimes, presumably due to compression, the new run happens to be a bit smaller than the old one so that compaction isn't triggered, and the test hangs. Fix this by forcing compaction with index.compact() knob.
-