Commits · e4bcc2ba926c4b90980685a3a3d5b79241c99fb9 · core / tarantool

Nov 16, 2017

replication: enable ACKS onlys after SUBSCRIBE · e4bcc2ba

Start applier->writer fiber only after SUBSCRIBE.
Otherwiser writer will send ACK during FINAL JOIN and break
replication protocol.

Fixes #2726

e4bcc2ba

Nov 15, 2017

Fix replication/gc test · 473b85a3

Vladimir Davydov authored 7 years ago

Make sure the master receives an ack from the replica and performs
garbage collection before checking the checkpoint count.

473b85a3

relay: don't delete xlog files until replica confirms receipt · 2d86127b

Vladimir Davydov authored 7 years ago

We remove old xlog files as soon as we have sent them to all replicas.
However, the fact that we have successfully sent something to a replica
doesn't necessarily mean the replica will have received it. If a replica
fails to apply a row (for instance, it is out of memory), replication
will stop, but the data files have already been deleted on the master so
that when the replica is back online, the master won't find appropriate
xlog to feed to the replica and replication will stop again.

The user visible effect is the following error message in the log and in
the replica status:

  Missing .xlog file between LSN 306 {1: 306} and 311 {1: 311}

There is no way to recover from this but to re-bootstrap the replica
from scratch.

The issue was introduced by commit ba09475f ("replica: advance gc
state only when xlog is closed"), which targeted at making the status
update procedure as lightweight and fast as possible and so moved
gc_consumer_advance() from tx_status_update() to a special gc message.
A gc message is created and sent to TX as soon as an xlog is relayed.
Let's rework this so that gc messages are appended to a special queue
first and scheduled only when the relay receives the receipt
confirmation from the replica.

Closes #2825

2d86127b

Fix race in garbage collection · b8717738

Vladimir Davydov authored 7 years ago

Engine callbacks that perform garbage collection may sleep, because they
use coio for removing files to avoid blocking the TX thread. If garbage
collection is called concurrently from different fibers (e.g. from relay
fibers), we may attempt to delete the same file multiple times. What is
worse xdir_collect_garbage(), used by engine callbacks to remove files,
isn't safe against concurrent execution - it first unlinks a file via
coio, which involves a yield, and only then removes the corresponding
vclock from the directory index. This opens a race window for another
fiber to read the same clock and yield, in the interim the vclock can be
freed by the first fiber:

  #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
  #1  0x00007f105ceda3fa in __GI_abort () at abort.c:89
  #2  0x000055e4c03f4a3d in sig_fatal_cb (signo=11) at main.cc:184
  #3  <signal handler called>
  #4  0x000055e4c066907a in vclockset_remove (rbtree=0x55e4c1010e58, node=0x55e4c1023d20) at box/vclock.c:215
  #5  0x000055e4c06256af in xdir_collect_garbage (dir=0x55e4c1010e28, signature=342, use_coio=true) at box/xlog.c:620
  #6  0x000055e4c0417dcc in memtx_engine_collect_garbage (engine=0x55e4c1010df0, lsn=342) at box/memtx_engine.c:784
  #7  0x000055e4c0414dbf in engine_collect_garbage (lsn=342) at box/engine.c:155
  #8  0x000055e4c04a36c7 in gc_run () at box/gc.c:192
  #9  0x000055e4c04a38f2 in gc_consumer_advance (consumer=0x55e4c1021360, signature=342) at box/gc.c:262
  #10 0x000055e4c04b4da8 in tx_gc_advance (msg=0x7f1028000aa0) at box/relay.cc:250
  #11 0x000055e4c04eb854 in cmsg_deliver (msg=0x7f1028000aa0) at cbus.c:353
  #12 0x000055e4c04ec871 in fiber_pool_f (ap=0x7f1056800ec0) at fiber_pool.c:64
  #13 0x000055e4c03f4784 in fiber_cxx_invoke(fiber_func, typedef __va_list_tag __va_list_tag *) (f=0x55e4c04ec6d4 <fiber_pool_f>, ap=0x7f1056800ec0) at fiber.h:665
  #14 0x000055e4c04e6816 in fiber_loop (data=0x0) at fiber.c:631
  #15 0x000055e4c0687dab in coro_init () at /home/vlad/src/tarantool/third_party/coro/coro.c:110

Fix this by serializing concurrent execution of garbage collection
callbacks with a latch.

b8717738

box: disable schema auto upgrade for replication · 582a85d4

Vladimir Davydov authored 7 years ago

Currently, box.schema.upgrade() is called automatically after box.cfg()
if the upgrade is considered safe (currently, only upgrade to 1.7.5 is
"safe"). However, no upgrade is safe in case replication is configured,
because it can easily result in replication conflicts. Let's disable
auto upgrade if the 'replication' configuration option is set.

Closes #2886

582a85d4

Nov 13, 2017

Fix recovery from 1.7.5 xlogs containing space truncate+drop · 50a15e3f

Vladimir Davydov authored 7 years ago

Before commit 29d00dca ("alter: forbid to drop space with truncate
record") a space record was removed before the corresponding record in
the _truncate system space so we should disable the check that the space
being dropped doesn't have a record in _truncate in case we are
recovering data generated by tarantool < 1.7.6.

Closes #2909

50a15e3f

Nov 06, 2017

Fix flaky vinyl/ddl.test.lua · 7b2945d6
Roman Tsisyk authored 7 years ago

View commits for tag 1.7.6 1.7.6

7b2945d6
Fix box/tree_misc.test.lua on 32-bit · 16151314
Roman Tsisyk authored 7 years ago

16151314
Filter out bloom_filter in vinyl/layout.test.lua · 0a37ccad
Roman Tsisyk authored 7 years ago
```
Bloom filter depends on hash function, which depends on ICU
version, which may vary.
```
0a37ccad
Fix "field type is deprecated" error message · 67e3a34d
Roman Tsisyk authored 7 years ago

67e3a34d
Add tests cases for 1.7.5 -> 1.7.6 upgrade · cd5a7edf
Roman Tsisyk authored 7 years ago

cd5a7edf
box: rename unicode_s1 to unicode_ci · 8bd310c3
Roman Tsisyk authored 7 years ago
```
+ Don't use id=0 for collations

Follow up #2649
```
8bd310c3

Fix tuple hash computation for scalar and nullable string fields · 69e30163

Vladimir Davydov authored 7 years ago

Fix tuple_hash_field() to handle the following cases properly:

 - Nullable string field (crash in vinyl on dump).
 - Scalar field with collation enabled (crash in memtx hash index).

Add corresponding test cases.

69e30163

memtx: restore stable order for nullable indexes · 926a32f0

Vladimir Davydov authored 7 years ago

First, unique but nullable indexes are not rebuilt when the primary key
is altered although they should be, because they can contain multiple
NULLs. Second, when rebuilding such indexes we use a wrong key def
(index_def->key_def instead of cmp_def), which results in lost stable
order after recovery. Fix both these issues and add a test case.

926a32f0

Test replication in case primary index uses collation · 5d009876

Vladimir Davydov authored 7 years ago

Needed to check if the key definition loaded from vylog to send initial
data to a replica has the collation properly recovered.

5d009876

vinyl: store nullability key def property in vylog · 4b0cc22a

Vladimir Davydov authored 7 years ago

It isn't stored currently, but this doesn't break anything, because the
primary key, which is the only key whose definition is used after having
been loaded from vylog, can't be nullable. Let's store it there just in
case. Update the vinyl/layout test to check that.

4b0cc22a

vinyl: enable collations · 610ae25a

Vladimir Davydov authored 7 years ago

Collations were disabled in vinyl by commmit 2097908f ("Fix
collation test on some platforms and disable collation in vinyl"),
because a key_def referencing a collation could not be loaded from
vylog on recovery (collation objects are created after vylog is
recovered). Now, it isn't a problem anymore, because the decoding
procedure, key_def_decode_parts(), deals with struct key_part_def,
which references a collation by id and hence doesn't need a collation
object to be created. So we can enable collations in vinyl.

This patch partially reverts the aforementioned commit (it can't
do full revert, because that commit also fixed some tests along
the way).

Closes #2822

610ae25a

key_def: do not lookup collation when decoding parts · 7a0f2898

Vladimir Davydov authored 7 years ago

We can't use key_def_decode_parts() when recovering vylog if key_def has
a collation, because vylog is recovered before the snapshot, i.e. when
collation objects haven't been created yet, while key_def_decode_parts()
tries to look up the collation by id. As a result, we can't enable
collations for vinyl indexes.

To fix this, let's rework the decoding procedure so that it works with
struct key_part_def instead of key_part.  The only difference between
the two structures is that the former references the collation by id
while the latter by pointer.

Needed for #2822

7a0f2898

replication: stop applier writer fiber before reconnect · 0a0731f4

Georgy Kirichenko authored 7 years ago

Writer fiber should be stopped before re-connect to avoid
sending unwanted IPROTO_OK replication acknowledges.

Fixes #2726

0a0731f4

replication: close client socket after SUBSCRIBE · 382dcae8

Georgy Kirichenko authored 7 years ago

SUBSCRIBE command is not multiplexed in the binary protocol.
When relay exits with an error during subscribe, remote replica
still continue to send IPROTO_OK replication acknowledges to
the master. These packets are unwanted by IPROTO decoder.

Close socket on errors during SUBSCRIBE.

Fixes #2726

382dcae8

Nov 04, 2017

Fix crash when using nullable boolean index · ea2a1139

Vladimir Davydov authored 7 years ago

The corresponding comparator is missing, which leads to a crash.
Fix it and add a test case checking that nullable indexes work fine
with all available types.

ea2a1139

Introduce backtrace=true option to fiber.info() · b18dd47f

Georgy Kirichenko authored 7 years ago

Symbol resolving can be expensive. Introduce an option for fiber.info():

    fiber.info({ backtrace = true })
    fiber.info({ bt = true })

Fixes #2878

b18dd47f

Fix crash when reloading non-existen function · db514611

Ilya authored 7 years ago

- Change signature of function access_check_func.
  Now it returns status instead of function.

Close #2816

db514611

Nov 03, 2017

Fix compilation on Mac OS · e9c7a713

Vladimir Davydov authored 7 years ago

> src/tarantool/src/box/vinyl.c:2111:33: error: initializer element is not a compile-time constant
>         static const double weight = 1 - exp(-VY_QUOTA_UPDATE_INTERVAL /
>                                      ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Remove the "static" qualifier. It is not really needed as any sane
compiler will pre-calculate the value of 'weight' at compile time
(checked on gcc 6.3.0 with -O0).

e9c7a713

vinyl: cleanup global statistics · 64a4b538

Vladimir Davydov authored 7 years ago

 - Remove vy_stat::rmean statistics, which were left from Sophia, as now
   we have per index statistics which are much more verbose than those.

 - Move vy_stat::dump_bw to vy_env and remove struct vy_stat as there's
   nothing left in it.

 - Move quota statistics from box.info.vinyl().performance.memory to
   box.info.vinyl().quota. Remove 'ratio', which equals used / limit, as
   this kind of calculations should be done by a script aggregating
   statistics. Report 'use_rate' and 'dump_bandwidth' there.

 - Report 'limit' in cache statistics to make them consistent with
   'quota' statistics, where 'limit' is reported. Rename 'cache.count'
   to 'cache.tuples'. Remove vy_cache_env::cache_count, use mempool
   stats instead.

 - Move 'tx_allocated', 'txv_allocated', 'read_interval', 'read_view'
   from box.info.vinyl().performance to box.info.vinyl().tx and name
   them 'transactions', 'statements', 'gap_locks', and 'read_views',
   respectively. Remove vy_tx_stat::active and 'tx.active' as the same
   value is shown by 'tx.transactions', extracted from the mempool.

 - Zap box.info.vinyl().performance - there's nothing left there.

Now global statistics look like:

  tarantool> box.info.vinyl()
  ---
  - cache:
      limit: 134217728
      tuples: 32344
      used: 34898794
    tx:
      conflict: 1
      commit: 324
      rollback: 13
      statements: 10
      transactions: 3
      gap_locks: 4
      read_views: 1
    quota:
      dump_bandwidth: 10000000
      watermark: 119488351
      use_rate: 1232703
      limit: 134217728
      used: 34014634
  ...

Closes #2861

64a4b538

vinyl: do not use rmean for calculating quota use rate · 5e28b70a

Vladimir Davydov authored 7 years ago

We have a timer for updating watermark every second. Let's reuse it for
quota use rate calculation. This will allow us to get rid of legacy
vinyl statistics.

Also, let's use EWMA for calculating the average. It is a more efficient
and common method, which allows to easily tune the period over which the
value is averaged.

5e28b70a

box: add box.NULL alias for msgpack.NULL · 043ba278
Roman Tsisyk authored 7 years ago
```
Follow up #1557
```
043ba278

Nov 02, 2017

Reject attempts to create non-string index part with collation · 4c10b711

Alexandr Lyapunov authored 7 years ago

Collation was simply ignored for non-string parts, that could
confuse potential user.

Generate a readable error in this case.

Fix #2862 part 2

4c10b711

Make collation work with scalar fields · 2601fcd3

Alexandr Lyapunov authored 7 years ago

Now collation is silently ignored for type='scalar' parts.

Use collation for string scalar fields.

Fix #2862 part 1

2601fcd3

Show collation in lua index object · 76b6e110
Alexandr Lyapunov authored 7 years ago
```
Show collation name (if present) in space.index.name.parts[no].

Fix #2862 part 4
```
76b6e110

Make collation by name lookup case insensitive · 3f9a73a7

Alexandr Lyapunov authored 7 years ago

test:create_index('unicode_s1', {parts = {{1, 'STR', collation =
'UNICODE'}}}) will work now.

Fix #2862 part 3

3f9a73a7

schema: allow to store smaller field count that specified in format · 2f53308e

Vladislav Shpilevoy authored 7 years ago

If a field is not indexed and no more indexed or not nullable
fields after that, than allow to skip it in insertion. Such field
value looks like MP_NIL, but MP_NIL is not explicitly stored.
Named access to this field in lua returns nil.

Example:
format =
{{'field1'},
 {'field2'},
 {'field3', is_nullable = true},
 {'field4', is_nullable = true}}

t = space:insert{1, 2} -- ok.

t.field1 == 1, t.field2 == 2, t.field3 == nil, t.field4 == nil

Closes #2880

2f53308e

schema: allow to store custom fields in format's field definition · f688ef36

Vladislav Shpilevoy authored 7 years ago

Some users store in format fields their custom keys. But current
opts parser does not allow to store any unknown keys. Lets allow it.

Example:
format = {}
format[1] = {name = 'field1', type = 'unsigned', custom_field = 'custom_value'}
s = box.schema.create_space('test', {format = format})
s:format()[1].custom_field == 'custom_value'

Closes #2839

f688ef36

vinyl: forbid DDL/DML if wal is disabled · b75ab0e0

Vladimir Davydov authored 7 years ago

Using DML/DDL on a Vinyl index with wal_mode = 'none' is likely to
result in unrecoverable errors like:

  F> can't initialize storage: Invalid VYLOG file: Index 512/0 created twice

To avoid data corruption in case the user tries to use an existing Vinyl
database in conjunction with wal_mode = 'none', let's explicitly forbid
it until we figure out how to fix it.

Workaround #2278

b75ab0e0

vinyl: ignore quota timeout during bootstrap from master · 19ac10c9

Vladimir Davydov authored 7 years ago

During initial join, a replica receives all data accumulated on the
master for its whole lifetime, which may be quota a lot. If the network
connection is fast enough, the replica might fail to keep up with dumps,
in which case replication fails with ER_VY_QUOTA_TIMEOUT. To avoid that,
let's ignore quota timeout until bootstrap is complete.

Note, replication may still fail during the 'subscribe' stage for the
same reason, but it's unlikely, because the rate at which the master
sends data is limited by the number of requests served by the master per
a unit of time, and it should become nearly impossible once throttling
is introduced (See #1862).

Closes #2873

19ac10c9

vinyl: abort bootstrap if vinyl directory is not empty · 4d796a8c

Vladimir Davydov authored 7 years ago

If the user sets snap_dir to an empty directory by mistake while leaving
vinyl_dir the same, tarantool will still bootstrap, but there is likely
to be errors like:

  vinyl.c:835 E> 512/0: dump failed: file './512/0/00000000000000000001.run' already exists
  vy_log.c:1095 E> failed to rotate metadata log: file './00000000000000000005.vylog' already exists

Even worse, it may eventually fail to restart with:

  vy_log.c:886 E> ER_MISSING_SNAPSHOT: Can't find snapshot

To avoid that, let's check the vinyl_dir on bootstrap and abort if it
contains vylog files left from previous setups.

Closes #2872

4d796a8c

vinyl: embed scheduler in env · 76481655

Vladimir Davydov authored 7 years ago

The only reason why it was allocated is that struct vy_scheduler was
defined after struct vy_env, which is not a problem any more. Embedding
it allows us to drop the extra argument to vy_scheduler_need_dump_f().

76481655

vinyl: move scheduler implementation to separate source file · 9f140646

Vladimir Davydov authored 7 years ago

It's a big independent entity, let's isolate its code in
a separate file.

While we are at it, add missing comments to vy_scheduler
struct members.

9f140646

vinyl: remove dependency of scheduler on environment · 32da4d7c

Vladimir Davydov authored 7 years ago

Instead of storing a pointer to vy_env in vy_scheduler, let's:

 - Add pointers to tx_manager::read_views and vy_env::run_env to
   vy_scheduler struct. They are needed to create a write iterator
   for a dump/compaction task.

 - Add a callback to struct vy_scheduler that is called upon dump
   completion to free memory. This allows us to eliminate accesses
   vy_env::quota and vy_env::allocator from vy_scheduler code.

 - Move the assert that assures that the scheduler isn't started during
   local recovery from vy_scheduler_f() to vy_env_quota_exceeded_cb()
   callback so that we don't need to access vy_env::status from the
   scheduler code. Note, after this change we have to set vy_env::status
   to VINYL_ONLINE before calling vy_quota_set_limit(), because the
   latter might schedule a dump.

 - Check if we have anything to dump from vy_begin_checkpoint() instead
   of vy_scheduler_begin_checkpoint().

This will allow us to isolate the scheduler code in a separate file.

32da4d7c

vinyl: rework dump trigger · f927a87b

Vladimir Davydov authored 7 years ago

Currently, dump is triggered (by bumping the memory generation) by the
scheduler fiber while quota consumers just wake it up. As a result, the
scheduler depends on the quota - it has to access the quota to check if
it needs to trigger dump. In order to move the scheduler to a separate
source file, we need to get rid of this dependency.

Let's rework this code as follows:

 - Remove vy_scheduler_trigger_dump() from vy_scheduler_peek_dump(). The
   scheduler fiber now just dumps all indexes eligible for dump and
   completes dump by bumping dump_generation. It doesn't trigger dump by
   bumping generation anymore. As a result, it doesn't need to access
   the quota.

 - Make quota consumers call vy_scheduler_trigger_dump() instead of just
   waking up the scheduler. This function will become a public one once
   the scheduler is moved out of vinyl.c. The function logic is changed
   a bit. First, besides bumping generation, it now also wakes up the
   scheduler fiber. Second, it does nothing if dump is already in
   progress or can't be scheduled because of concurrent checkpoint.
   In the latter case it sets a special flag though that will force the
   scheduler trigger dump upon checkpoint completion.

 - vy_scheduler_begin_checkpoint() can't use vy_scheduler_trigger_dump()
   anymore due to additional checks added to the function, so it bumps
   the generation directly. This looks fine.

 - Such a design has a subtlety regarding how quota consumers notify the
   scheduler and how they are notified back about available quota.
   In extreme cases, quota released by a dump may be not enough to
   satisfy all consumers, in which case we need to reschedule dump.
   Since the scheduler doesn't check the quota anymore and doesn't
   reschedule dump, it has to be done by the left consumers. So
   consumers has to call the quota_exceeded_cb (which triggers a dump
   now) callback every time they are woken up and see there's not enough
   quota. The vy_quota_use() is reworked accordingly.

   Also, since the quota usage may exceed the limit (because of
   vy_quota_force_use()), the quota usage may remain higher than the
   limit after a dump completion, in which case vy_quota_release()
   doesn't wake up consumers and again there's no one to trigger another
   dump. So we must wake up all consumers every time vy_quota_release()
   is called.

f927a87b