Commits · 49595189cebd467609e782349f7a73809d7cc4f8 · core / tarantool

Sep 09, 2018

vinyl: factor out worker pool from scheduler struct · 49595189

A worker pool is an independent entity that provides the scheduler with
worker threads on demand. Let's factor it out so that we can introduce
separate pools for dump and compaction tasks.

49595189

vinyl: don't use mempool for allocating background tasks · 661763ed

Vladimir Davydov authored 6 years ago

Background tasks are allocated infrequently, not more often than once
per several seconds, so using mempool for them is unnecessary and only
clutters vy_scheduler struct. Let's allocate them with malloc().

661763ed

vinyl: add helper to check whether dump is in progress · 04a735b2
Vladimir Davydov authored 6 years ago
```
Needed solely to improve code readability. No functional changes.
```
04a735b2

Sep 06, 2018

cmake: revert the change which breaks cmake build with llvm + openmp · 2500a7a4
Konstantin Osipov authored 6 years ago

2500a7a4

Tarantool static build ability · cb1c72da

Georgy Kirichenko authored 6 years ago

A possibility to build tarantool with included library dependencies.
Use the flag -DBUILD_STATIC=ON to build statically against curl, readline,
ncurses, icu and z.
Use the flag -DOPENSSL_USE_STATIC_LIBS=ON to build with static
openssl

Changes:
  * Add FindOpenSSL.cmake because some distributions do not support the use of
  openssl static libraries.
  * Find libssl before curl because of build dependency.
  * Catch all bundled libraries API and export then it in case of static
  build.
  * Rename crc32 internal functions to avoid a name clash with linked libraries.

Notes:
  * Bundled libyaml is not properly exported, use the system one.
  * Dockerfile to build static with docker is included

Fixes #3445

cb1c72da

Sep 04, 2018

Merge branch '1.9' into 1.10 · 8bf936f7
Vladimir Davydov authored 6 years ago

8bf936f7

box: sync on replication configuration update · 113ade24

Vladimir Davydov authored 6 years ago

Now box.cfg() doesn't return until 'quorum' appliers are in sync not
only on initial configuration, but also on replication configuration
update. If it fails to synchronize within replication_sync_timeout,
box.cfg() returns without an error, but the instance enters 'orphan'
state, which is basically read-only mode. In the meantime, appliers
will keep trying to synchronize in the background, and the instance
will leave 'orphan' state as soon as enough appliers are in sync.

Note, this patch also changes logging a bit:
 - 'ready to accept request' is printed on startup before syncing
   with the replica set, because although the instance is read-only
   at that time, it can indeed accept all sorts of ro requests.
 - For 'connecting', 'connected', 'synchronizing' messages, we now
   use 'info' logging level, not 'verbose' as they used to be, because
   those messages are important as they give the admin idea what's
   going on with the instance, and they can't flood logs.
 - 'sync complete' message is also printed as 'info', not 'crit',
   because there's nothing critical about it (it's not an error).

Also note that we only enter 'orphan' state if failed to synchronize.
In particular, if the instnace manages to synchronize with all replicas
within a timeout, it will jump from 'loading' straight into 'running'
bypassing 'orphan' state. This is done for the sake of consistency
between initial configuration and reconfiguration.

Closes #3427

@TarantoolBot document
Title: Sync on replication configuration update
The behavior of box.cfg() on replication configuration update is
now consistent with initial configuration, that is box.cfg() will
not return until it synchronizes with as many masters as specified
by replication_connect_quorum configuration option or the timeout
specified by replication_connect_sync occurs. On timeout, it will
return without an error, but the instance will enter 'orphan' state.
It will leave 'orphan' state as soon as enough appliers have synced.

113ade24

box: add replication_sync_timeout configuration option · ca9fc33a

Olga Arkhangelskaia authored 6 years ago

In the scope of #3427 we need timeout in case if an instance waits for
synchronization for too long, or even forever. Default value is 300.

Closes #3674

@locker: moved dynamic config check to box/cfg.test.lua; code cleanup

@TarantoolBot document
Title: Introduce new configuration option replication_sync_timeout
After initial bootstrap or after replication configuration changes we
need to sync up with replication quorum. Sometimes sync can take too
long or replication_sync_lag can be smaller than network latency we
replica will stuck in sync loop that can't be cancelled.To avoid this
situations replication_sync_timeout can be used. When time set in
replication_sync_timeout is passed replica enters orphan state.
Can be set dynamically. Default value is 300 seconds.

ca9fc33a

box: make replication_sync_lag option dynamic · 5eb5c181

Olga Arkhangelskaia authored 6 years ago

In #3427 replication_sync_lag should be taken into account during
replication reconfiguration. In order to configure replication properly
this parameter is made dynamic and can be changed on demand.

@locker: moved dynamic config check to box/cfg.test.lua

@TarantoolBot document
Title: recation_sync_lag option can be set dynamically
box.cfg.recation_sync_lag now can be set at any time.

5eb5c181

Sep 03, 2018

box: make sure box.ctl is available before box.cfg{} · ad6a2498

Konstantin Osipov authored 6 years ago

Ensure box.ctl.wait_ro() and box.ctl.wait_rw() produce
meaningful results even when invoked before box.cfg{}: wait
for box.cfg{} to complete and the server to enter the right state.
Add a test case.

In scope of gh-3159

ad6a2498

Aug 31, 2018
- Merge remote-tracking branch 'origin/1.9' into 1.10 · 2b573166
  Konstantin Osipov authored 6 years ago
  
  2b573166
- Merge branch '1.9' into 1.10 · 5f601d7f
  Vladimir Davydov authored 6 years ago
  
  5f601d7f
Aug 30, 2018

replication: rename a return pipe from tx to tx_prio · de80c262

Konstantin Belyavskiy authored 6 years ago

There are two different pipes: 'tx' and 'tx_prio'. The latter
does not support yield(). Rename it to avoid misunderstanding.

Needed for #3397

de80c262

Update test-run · 6004bd9d

Vladimir Davydov authored 6 years ago

The new version marks more file descriptors used by test-run
internals as CLOEXEC. Needed to make replication/misc test pass
(it lowers RLIMIT_NOFILE).

6004bd9d

applier: ignore ER_UNKNOWN_REQUEST_TYPE for IPROTO_VOTE · 009e50ec

Vladimir Davydov authored 6 years ago

IPROTO_VOTE command (successor of IPROTO_REQUEST_VOTE) was introduced in
Tarantool 1.10.1. It is sent by an applier to its master only if the
master is running Tarantool 1.10.1 or newer. However, the master may be
running a Tarantool version 1.10.1 that isn't aware of IPROTO_VOTE, in
which case the applier will fail to connect with ER_UNKNOWN_REQUEST_TYPE
error.

Let's fix this issue by ignoring ER_UNKNOWN_REQUEST_TYPE received in
reply to IPROTO_VOTE command.

009e50ec

socket: evaluate buffer size in recv / recvfrom · 11fb3ab9

Alexander Turenko authored 6 years ago

When size parameter is not passed to socket:recv() or socket:recvfrom()
it will call a) or b) on the socket to evaluate size of the buffer to
store the receiving datagram. Before this commit a datagram will be
truncated to 512 bytes in the case.

a) Linux: recv(fd, NULL, 0 , MSG_TRUNC | MSG_PEEK)
b) Mac OS: getsockopt(fd, SOL_SOCKET, SO_NREAD, &val, &len)

It is recommended to set 'size' parameter (size of the input buffer)
explicitly based on known message format and known network conditions
(say, set it less then MTU to prevent IP fragmentation, which can be
inefficient) or pass it from a configuration option of a library / an
application. The reason is that explicit buffer size provided allows to
avoid extra syscall to evaluate necessary buffer size.

When 'size' parameter is set explicitly for recv / recvfrom on a UDP
socket and the next datagram length is larger then the size, the
returned message will be truncated to the size provided and the rest of
the datagram will be discarded. Of course, the tail will not be
discarded in case of a TCP socket and will be available to read by the
next recv / recvfrom call.

Fixes #3619.

11fb3ab9

Aug 29, 2018

xlog: add request details to panic message for broken LSN · 48f55559

Sergei Kalashnikov authored 6 years ago

Aid the debugging of replication issues related to out-of-order
requests. Adds the details of request/tuple to the diagnostic
message whenever possible.

Closes #3105

48f55559

Merge branch '1.9' into 1.10 · 92d1e02c
Vladimir Davydov authored 6 years ago

92d1e02c

iproto: don't throw exception in replication handler · 2e87902e

Georgy Kirichenko authored 6 years ago

It is an error to throw an error out of a cbus message handler because
it breaks cbus message delivery. In case of replication throwing an
error prevents iproto against replication socket closing.

Closes #3642

2e87902e

Update test-run · 3e382d6e

Vladimir Davydov authored 6 years ago

So as instances started by test-run don't inherit file descriptors
corresponding to logs and sockets of all running instances.

Needed for testing #3642

3e382d6e

Aug 28, 2018

vinyl: use lower bound percentile estimate for dump bandwidth · ac17838c

Vladimir Davydov authored 6 years ago

Use a lower bound estimate in order not to overestimate dump bandwidth.
For example, if an observation of 12 MB/s falls in bucket 10 .. 15, we
should use 10 MB/s to avoid stalls.

ac17838c

histogram: add function for computing lower bound percentile estimate · 9a998466

Vladimir Davydov authored 6 years ago

The value returned by histogram_percentile() is an upper bound estimate.
This is fine for measuring latency, because we are interested in the
worst, i.e. highest, observations, but doesn't suit particularly well
if we want to keep track of the lowest observations, as it is the case
with bandwidth. So this patch introduces histogram_percentile_lower(),
a function that is similar to histogram_percentile(), but returns a
lower bound estimate of the requested percentile.

9a998466

vinyl: use snap_io_rate_limit for initial dump bandwidth estimate · b646fbd9

Vladimir Davydov authored 6 years ago

The user can limit dump bandwidth with box.cfg.snap_io_rate_limit to a
value, which is less than the current estimate. To avoid stalls caused
by overestimating dump bandwidth, we must take into account the limit
for the initial guess and forget all observations whenever it changes.

b646fbd9

vinyl: do not add initial guess to dump bandwidth histogram · afac9b3a

Vladimir Davydov authored 6 years ago

Do not add the initial guess to the histogram, because otherwise it
takes more than 10 dumps to get the real dump bandwidth in case the
initial value is less (we use 10th percentile).

afac9b3a

vinyl: cache dump bandwidth for timer invocation · 20b4777a

Vladimir Davydov authored 6 years ago

We don't need to compute a percentile of dump bandwidth histogram on
each invocation of quota timer callback, because it may only be updated
on dump completion. Let's cache it. Currently, it isn't that important,
because the timer period is set to 1 second. However, once we start
using the timer for throttling, we'll have to make it run more often and
so caching the dump bandwidth value will make sense.

20b4777a

vinyl: rename vy_quota::dump_bw to dump_bw_hist · ae31708d

Vladimir Davydov authored 6 years ago

The next patch will store a cached bandwidth value in vy_quota::dump_bw.
Let's rename dump_bw to dump_bw_hist here in order not to clog it.

ae31708d

vinyl: tune dump bandwidth histogram buckets · 9ac103a8

Vladimir Davydov authored 6 years ago

Typically, dump bandwidth varies from 10 MB to 100 MB per second so
let's use 5 MB bucket granularity in this range. Values less than
10 MB/s can also be observed, because the user can limit disk rate
with box.cfg.snap_io_rate_limit so use 1 MB granularity between 1 MB
and 10 MB and 100 KB granularity between 100 KB and 1 MB. A write rate
greater than 100 MB/s is unlikely in practice, even on very fast disks,
since dump bandwidth is also limited by CPU, so use 100 MB granularity
there.

9ac103a8

vinyl: wake up fibers waiting for quota one by one · 285852f6

Vladimir Davydov authored 6 years ago

Currently, we wake up all fibers whenever we free some memory. This
is inefficient, because it might occur that all available quota gets
consumed by a few fibers while the rest will have to go back to sleep.
This is also kinda unfair, because waking up all fibers breaks the order
in which the fibers were put to sleep. This works now, because we free
memory and wake up fibers infrequently (on dump) and there normally
shouldn't be any fibers waiting for quota (if there were, the latency
would rocket sky high because of absence of any kind of throttling).
However, once throttling is introduced, fibers waiting for quota will
become the norm. So let's wake up fibers one by one: whenever we free
memory we wake up the first fiber in the line, which will wake up the
next fiber on success and so forth.

285852f6

vinyl: fix fiber leak in worker thread · 497fd351

Vladimir Davydov authored 6 years ago

We join a fiber that executes a dump/compaction task only at exit
while we mark all fibers as joinable. As a result, fibers leak, which
eventually leads to a crash:

  src/lib/small/small/slab_arena.c:58: munmap_checked: Assertion `false' failed.

Here's the stack trace:

  munmap_checked
  mmap_checked
  slab_map
  slab_get_with_order
  mempool_alloc
  fiber_new_ex
  fiber_new
  cord_costart_thread_func
  cord_thread_func
  start_thread
  clone

Let's fix this issue by marking a fiber as joinable only at exit, before
joining it. The fiber is guaranteed to be alive at that time, because it
clears vy_worker::task before returning, while we join it only if
vy_worker::task is not NULL.

Fixes commit 43b4342d ("vinyl: fix worker crash at exit").

497fd351

box: don't destroy latch/fiber_cond that may have waiters at exit · c0102b73

Vladimir Davydov authored 6 years ago

fiber_cond_destroy() and latch_destroy() are no-op on release builds
while on debug builds they check that there is no fibers waiting on
the destroyed object. This results in the following assertion failures
occasionally hit by some tests:

  src/latch.h:81: latch_destroy: Assertion `l->owner == NULL' failed.
  src/fiber_cond.c:49: fiber_cond_destroy: Assertion `rlist_empty(&c->waiters)' failed.

We can't do anything about that, because the event loop isn't running at
exit and hence we can't stop those fibers. So let's not "destroy" those
global objects that may have waiters at exit, namely

  gc.latch
  ro_cond
  replicaset.applier.cond

c0102b73

Aug 27, 2018

Merge branch '1.9' into 1.10 · b43c89ba
Vladimir Davydov authored 6 years ago

b43c89ba

vinyl: implement vy_quota_wait using vy_quota_use · 2011b269

Vladimir Davydov authored 6 years ago

So that there's a single place where we can wait for quota. It should
make it easier to implement quota throttling.

2011b269

vinyl: move quota related methods and variables from vy_env to vy_quota · 33a0edde

Vladimir Davydov authored 6 years ago

Watermark calculation is a private business of vy_quota. Let's move
related stuff from vy_env to vy_quota. This will also make it easier
to implement throttling opaque to the caller.

33a0edde

vinyl: move quota methods implementation to vy_quota.c · 86d1ec46

Vladimir Davydov authored 6 years ago

None of vy_quota methods is called from a hot path - even the most
frequently called ones, vy_quota_try_use and vy_quota_commit_use, are
only invoked once per a transactions. So there's no need to clog the
header with the methods implementation.

86d1ec46

vinyl: add vy_quota_adjust helper · 89a14237

Vladimir Davydov authored 6 years ago

Let's introduce this helper to avoid code duplication and keep comments
regarding quota consumption protocol in one place.

89a14237

vinyl: fix primary index uniqueness check being skipped on insert · 19fc3e99

Vladimir Davydov authored 6 years ago

Since we check uniqueness constraint before inserting anything into the
transaction write set, we have to deal with the situation when a
secondary index doesn't get updated. For example suppose there's tuple
{1, 1, 1} stored in a space with the primary index over the first field
and a unique secondary index over the second field. Then when processing
REPLACE {1, 1, 2}, we will find {1, 1, 1} in the secondary index, but
that doesn't mean that there's a duplicate key error - since the primary
key parts of the old and new tuples coincide, the secondary index
doesn't in fact get updated hence there's no conflict.

However, if the operation was INSERT {1, 1, 2}, then there would be a
conflict - by the primary index. Normally, we would detect such a
conflict when checking the uniqueness constraint of the primary index,
i.e. in vy_check_is_unique_primary(), but there's a case when this
doesn't happen. The point is we can optimize out the primary index
uniqueness constraint check in case the primary index key parts contain
all parts of a unique secondary index, see #3154. In such a case we must
fail vy_check_is_unique_secondary() even if the conflicting tuple has
the same primary key parts.

Fixes commit fc3834c0 ("vinyl: check key uniqueness before modifying
tx write set")

Closes #3643

19fc3e99

test: fix engine/ddl occasional failure · 4e0ab1f9

Serge Petrenko authored 6 years ago

Sometimes the test failed with output similar to the one below:

    [001] engine/ddl.test.lua	memtx           [ fail ]
    [001]
    [001] Test failed! Result content mismatch:
    [001] --- engine/ddl.result	Mon Aug 27 09:35:19 2018
    [001] +++ engine/ddl.reject	Mon Aug 27 11:12:47 2018
    [001] @@ -1932,7 +1932,7 @@
    [001]  ...
    [001]  s.index.pk:select()
    [001]  ---
    [001] -- - [1, 1, 11]
    [001] +- - [1, 1, 8]
    [001]  ...
    [001]  s.index.sk:select()
    [001]  ---

This happened due to a race condition in a test case added for
issue #3578.
To fix it we need to move c:get() above s.index.pk:select() to
make sure we actually wait for the fiber function to complete
before checking results.

Follow-up #3578.

4e0ab1f9

Aug 26, 2018

test: fix sporadic box/net.box failure · 206603ce

Vladimir Davydov authored 6 years ago

From time to time box/net.box test fails like this:

  Test failed! Result content mismatch:
  --- box/net.box.result	Sat Aug 25 18:41:35 2018
  +++ box/net.box.reject	Sat Aug 25 18:49:17 2018
  @@ -2150,7 +2150,7 @@
   ...
   disconnected -- false
   ---
  -- false
  +- true
   ...
   ch2:put(true)
   ---

The 'disconnected' variable is changed from false to true by
box.session.on_disconnect trigger. If there happens to be a
dangling connection that wasn't explicitly closed, it might
occur that the Lua garbage collector closes it, which will
result in spurious trigger execution and test failure.

Fix this by collecting all dangling connections before installing
the trigger.

206603ce

test: fix sporadic box/iproto_stress hang · b4b40411

Vladimir Davydov authored 6 years ago

The test has a number of problems:
 - It requires a huge number of file descriptors (> 10000), which is
   typically unavailable due to ulimit (1024).
 - Even if it is permitted to open the required number of file
   descriptors, it may still fail due to net.box.connect failing
   to connect by timeout.
 - If it fails, it silently hangs, which makes it difficult to
   investigate the failure.

Actually, there's no need to create that huge number of file descriptors
to test depletion of the tx fiber pool anymore, because IPROTO_MSG_MAX
was made tunable. Let's lower it down to 64 (default is 768) for the
test. Then we can create only 400 fibers and open ~800 file descriptors.

Also, let's wrap net.box.connect with pcall and limit the max wait time
to 10 seconds so that the test will terminate on any failure instead of
hanging forever.

Now, the test only takes ~2 seconds to complete, which doesn't qualify
as 'long'. Remove it from the list of long-run tests.

While we are at it, also fix indentation (it uses tabs instead of
spaces in Lua).

Closes #2473

b4b40411

Aug 25, 2018

test: fix sporadic vinyl/ddl hang · ec210982

Vladimir Davydov authored 6 years ago

There's a case in the test that expects that when we dump a new run,
vinyl will trigger compaction. Typically, it happens, but sometimes,
presumably due to compression, the new run happens to be a bit smaller
than the old one so that compaction isn't triggered, and the test hangs.
Fix this by forcing compaction with index.compact() knob.

ec210982