Commits · 4a464f8a41651d35a7e05e32b6c18b654c127d6b · core / tarantool

Oct 12, 2018

xlog: fix filename in error messages · 4a464f8a

 - xlog_rename() doesn't strip xlog->filename of inprogress suffix so
   write errors will mistakenly report the filename as inprogress.
 - xlog_create() uses a name without inprogress suffix for error
   reporting while it actually creates an inprogress file.

4a464f8a

Oct 10, 2018

socket: fix polling in case of spurious wakeup · e6bd7748

Georgy Kirichenko authored 6 years ago

socket_writable/socket_readable handles socket.iowait spurious wakeup
until event is happened or timeout is exceeded.

Closes #3344

e6bd7748

vinyl: fix for deferred DELETE overwriting newer statement · 63912c30

Vladimir Davydov authored 6 years ago

A deferred DELETE may be generated after a newer statement for the same
key was inserted into a secondary index and hence land in a newer run.
Since the read iterator assumes that newer sources always contain newer
statements for the same key, we mark all deferred DELETE statements with
VY_STMT_SKIP_READ flag, which makes run/mem iterators ignore them. The
flag must be persisted when a statement is written to disk, but it is
not. Fix this.

Fixes commit 504bc805 ("vinyl: do not store meta in secondary index
runs").

63912c30

test: disable feedback daemon test on Mac OS in CI · ab868a6b

Alexander Turenko authored 6 years ago

The fail is known and should not have any influence on our CI results.

The test should be enabled back after a fix of #3558.

ab868a6b

Oct 08, 2018

cmake: fix sync_file_range detection · e1aa1a3d

Vladimir Davydov authored 6 years ago

sync_file_range is declared only if _GNU_SOURCE macro is defined.
Also, in order to be used in a source file, HAVE_SYNC_FILE_RANGE
must be present in config.h.cmake.

Fixes commit caae99e5 ("Refactor xlog writer").

e1aa1a3d

Oct 06, 2018

vinyl: fix dump_bandwidth init in case snap_io_rate_limit is maxed out · ed6097f6

Vladimir Davydov authored 6 years ago

box.cfg{snap_io_rate_limit = 0} means that the limit is maxed out hence
we must set the dump bandwidth estimate to the default value. Instead
we set it to 0, which may resulting in invalid transaction throttling.
Fix this.

Fixes commit b646fbd9 ("vinyl: use snap_io_rate_limit for initial
dump bandwidth estimate").

ed6097f6

Oct 05, 2018

replication: ref checkpoint needed to join replica · bae6f037

Vladimir Davydov authored 6 years ago

Before joining a new replica we register a gc_consumer to prevent
garbage collection of files needed for join and following subscribe.
Before commit 9c5d851d ("replication: remove old snapshot files not
needed by replicas") a consumer would pin both checkpoints and WALs so
that would work as expected. However, the above mentioned commit
introduced consumer types and marked a consumer registered on replica
join as WAL-only so if the garbage collector was invoked during join, it
could delete files corresponding to the relayed checkpoint resulting in
replica join failure. Fix this issue by pinning the checkpoint used for
joining a replica with gc_ref_checkpoint and unpinning once join is
complete.

The issue can only be reproduced if there are vinyl spaces, because
deletion of an open snap file doesn't prevent the relay from reading it.
The existing replication/gc test would catch the issue if it triggered
compaction on the master so we simply tweak it accordingly instead of
adding a new test case.

Closes #3708

bae6f037

gc: call gc_run unconditionally when consumer is advanced · a3542586

Vladimir Davydov authored 6 years ago

gc_consumer_unregister and gc_consumer_advance don't call gc_run in case
the consumer in question isn't leftmost. This code was written back when
gc_run was kinda heavy and would call engine/wal callbacks even if it
wouldn't really need to. Today gc_run will bail out shortly, without
making any complex computation, let alone invoking garbage collection
callbacks, in case it has nothing to do so those optimizations are
pointless. Let's remove them.

a3542586

gc: separate checkpoint references from wal consumers · 28c97041

Vladimir Davydov authored 6 years ago

Initially, gc_consumer object was used for pinning both checkpoint and
WAL files, but commit 9c5d851d ("replication: remove old snapshot
files not needed by replicas") changed that. Now whether a consumer pins
WALs or checkpoints or both depends on gc_consumer_type. This was done
so that replicas wouldn't prevent garbage collection of checkpoint
files, which they don't need after initial join is complete.

The way the feature was implemented is rather questionable though:
 - Since consumers of both types are stored in the same binary search
   tree, we have to iterate through the tree to find the leftmost
   checkpoint consumer, see gc_tree_first_checkpoint. This looks
   inefficient and ugly.
 - The notion of advancing a checkpoint consumer (gc_consumer_advance)
   is dubious: there's no point to move on to the next checkpoint after
   reading one - instead the consumer needs incremental changes, i.e.
   WALs.

To eliminate those questionable aspects and make the code easier for
understanding, let's separate WAL and checkpoint consumers. We do this
by removing gc_consumer_type and making gc_consumer track WALs only.
For pinning the files corresponding to a checkpoint a new object class
is introduced, gc_checkpoint_ref. To pin a checkpoint, gc_ref_checkpoint
needs to be called. It is passed the gc_checkpoint object to pin, the
consumer name, and the gc_checkpoint_ref to store the ref in. To unpin a
previously pinned checkpoint, gc_checkpoint_unref should be called.

References are listed by box.info.gc() for each checkpoint under
'references' key.

28c97041

gc: improve box.info.gc output · eba06790

Vladimir Davydov authored 6 years ago

Report vclocks in addition to signatures. When box.info.gc was first
introduced we used signatures in gc. Now we use vclocks so there's no
reason not to report them. This is consistent with box.info output
(there's vclock and signature).

Report the vclock and signature of the oldest WAL row available on the
instance under box.info.gc().vclock. Without this information the user
would have to figure it out by looking at box.info.gc().consumers.

eba06790

gc: cleanup garbage collection procedure · 668ec2aa

Vladimir Davydov authored 6 years ago

Do some refactoring intended to make the code of gc_run() easier for
understanding:
 - Remove gc_state::checkpoint_vclock. It was used to avoid rerunning
   engine gc callback in case no checkpoint was deleted. Since we
   maintain a list of all available checkpoints, we don't need it for
   this anymore - we can run gc only if a checkpoint was actually
   removed from the list.
 - Rename gc_state::wal_vclock back to gc_state::vclock.
 - Use bool variables with descriptive names instead of comparing
   vclock signatures.
 - Add some comments.

668ec2aa

gc: keep track of available checkpoints · a55f86fd

Vladimir Davydov authored 6 years ago

Currently, the checkpoint iterator is in fact a wrapper around
memtx_engine::snap_dir while the garbage collector knows nothing about
checkpoints. This feels like encapsulation violation. Let's keep track
of all available checkpoints right in the garbage collector instead
and export gc_ API to iterate over checkpoints.

a55f86fd

gc: rename checkpoint_count to min_checkpoint_count · 1162743a

Vladimir Davydov authored 6 years ago

Because it's the minimal number of checkpoints that must not be deleted,
not the actual number of preserved checkpoints. Do it now, in a separate
patch so as to ease review of the next patch.

While we are at it, fix the comment to gc_set_(min_)checkpoint_count()
which got outdated by commit 5512053f ("box: gc: do not remove files
being backed up").

1162743a

gc: format consumer name in gc_consumer_register · 1ed5f1e9
Vladimir Davydov authored 6 years ago
```
It's better than using tt_snprintf at call sites.
```
1ed5f1e9

gc: fold gc_consumer_new · 7b42ed21

Vladimir Davydov authored 6 years ago

gc_consumer_new is used in gc_consumer_register. Let's fold it to make
the code flow more straightforward.

7b42ed21

gc: use fixed length buffer for storing consumer name · 961938b0

Vladimir Davydov authored 6 years ago

The length of a consumer name never exceeds 64 characters so no use to
allocate a string. This is a mere code simplification.

961938b0

gc: make gc_consumer and gc_state structs transparent · 5231bb6b

Vladimir Davydov authored 6 years ago

It's exasperating to write trivial external functions for each member of
an opaque struct (gc_consumer_vclock, gc_consumer_name, etc) while we
could simply access those fields directly if we made those structs
transparent. Since we usually define structs as transparent if we need
to use them outside a source file, let's do the same for gc_consumer and
gc_state and remove all those one-line wrappers.

5231bb6b

vinyl: force deletion of runs left from unfinished indexes on restart · eb6280d0

Vladimir Davydov authored 6 years ago

If an instance is restarted while building a new vinyl index, there will
probably be some run files left. Currently, we won't delete such files
until box.snapshot() is called, even though there's no point in keeping
them around. Let's tweak vy_gc_lsm() so that it marks all runs that
belong to an unfinished index as incomplete to force vy_gc() to remove
them immediately after recovery is complete.

This also removes files left from a failed rebootstrap attempt so we can
remove a call to box.snapshot() from vinyl/replica_rejoin.test.lua.

eb6280d0

vinyl: fix master crash on replica join failure · 626dfb2c

Vladimir Davydov authored 6 years ago

This patch fixes a trivial error on vy_send_range() error path which
results in a master crash in case a file needed to join a replica is
missing or corrupted.

See #3708

626dfb2c

Set lua state for main fiber too · 9039df9c
Georgy Kirichenko authored 6 years ago
```
The main fiber should have a lua state as any other lua fiber.

Needed for #3538
```
9039df9c

Add -Werror for CI (1.10 part) · da505ee7

Alexander Turenko authored 6 years ago

Added MAKE_BUILD_TYPE=RelWithDebInfoWError option, which means enabling
-DNDEBUG=1, -O2 and -Wall -Wextra -Werror. This ensures we have clean
release build without warnings.

Fixed found -Wunused-variable and -Wunused-parameter warnings.

Part of #3238.

da505ee7

Oct 03, 2018

utf8: allow empty strings in utf8.upper/lower · 129099bc
Vladislav Shpilevoy authored 6 years ago
```
Closes #3709
```
129099bc

replication: fix assertion with duplicate connection · 03a9bb1a

Olga Arkhangelskaia authored 6 years ago

Patch fixes behavior when replica tries to connect to the same master
more than once. In case when it is initial configuration we raise the
exception. If it in not initial config we print the error and disconnect
the applier.

@locker: minor test cleanup.

Closes #3610

03a9bb1a

vinyl: zap vy_env::memory, read_threads, and write_threads · 69a4b786

Vladimir Davydov authored 6 years ago

They are only used to set corresponding members of vy_quota, vy_run_env,
and vy_scheduler when vy_env is created. No point in keeping them around
all the time.

69a4b786

vinyl: enable quota upon recovery completion explicitly · 6af3d41a

Vladimir Davydov authored 6 years ago

Currently, we create a quota object with the limit maximized, and only
set the configured limit when local recovery is complete, so as to make
sure that no dump is triggered during recovery. As a result, we have to
store the configured limit in vy_env::memory, which looks ugly, because
this member is never used afterwards. Let's introduce a new method
vy_quota_enable to enable quota so that we can set the limit right on
quota object construction. This implies that we add a boolean flag to
vy_quota and only check the limit if it is set.

There's another reason to add such a method. Soon we will implement
quota consumption rate limiting. Rate limiting requires a periodic timer
that would replenish quota. It only makes sense to start such a timer
upon recovery completion, which again leads us to an explicit method for
enabling quota.

vy_env::memory will be removed by the following patch along with a few
other pointless members of vy_env.

Needed for #1862

6af3d41a

vinyl: implement quota wait queue without fiber_cond · fac38eec

Vladimir Davydov authored 6 years ago

Using fiber_cond as a wait queue isn't very convenient, because:
 - It doesn't allow us to put a spuriously woken up fiber back to the
   same position in the queue where it was, thus violating fairness.
 - It doesn't allow us to check whether we actually need to wake up a
   fiber or it will have to go back to sleep anyway as it needs more
   memory than currently available.
 - It doesn't allow us to implement a multi-queue approach where fibers
   that have different priorities are put to different queues.

So let's rewrite the wait queue with plain rlist and fiber_yield.

Needed for #1862

fac38eec

vinyl: move transaction size sanity check to quota · a94d2770

Vladimir Davydov authored 6 years ago

There's a sanity check in vinyl_engine_prepare, which checks if the
transaction size is less than the configured limit and fails without
waiting for quota if it isn't. Let's move this check to vy_quota_use,
because it's really a business of the quota object. This implies that
vy_quota_use has to set diag to differentiate this error from timeout.

a94d2770

vinyl: minor refactoring of quota methods · ee4ea944

Vladimir Davydov authored 6 years ago

The refactoring is targeted at facilitating introduction of rate
limiting within the quota class. It moves code blocks around, factors
out some blocks in functions, and improves comments. No functional
changes.

Needed for #1862

ee4ea944

vinyl: factor load regulator out of quota · 90ffaa8d

Vladimir Davydov authored 6 years ago

Turned out that throttling isn't going to be as simple as maintaining
the write rate below the estimated dump bandwidth, because we also need
to take into account whether compaction keeps up with dumps. Tracking
compaction progress isn't a trivial task and mixing it in a module
responsible for resource limiting, which vy_quota is, doesn't seem to be
a good idea. Let's factor out the related code into a separate module
and call it vy_regulator. Currently, the new module only keeps track of
the write rate and the dump bandwidth and sets the memory watermark
accordingly, but soon we will extend it to configure throttling as well.

Since write rate and dump bandwidth are now a part of the regulator
subsystem, this patch renames 'quota' entry of box.stat.vinyl() to
'regulator'. It also removes 'quota.usage' and 'quota.limit' altogether,
because memory usage is reported under 'memory.level0' while the limit
can be read from box.cfg.vinyl_memory, and renames 'use_rate' to
'write_rate', because the latter seems to be a more appropriate name.

Needed for #1862

90ffaa8d

Oct 02, 2018

vinyl: add helper to start scheduler and enable quota on startup · 7d298e53

Vladimir Davydov authored 6 years ago

There are three places where we start the scheduler fiber and enable the
configured memory quota limit: local bootstrap, remote bootstrap, and
local recovery completion. I'm planning to add more code there so let's
factor it out now.

7d298e53

Sep 26, 2018

replication: don't stop syncing on configuration errors · 4baa71bc

Vladimir Davydov authored 6 years ago

When replication is restarted with the same replica set configuration
(i.e. box.cfg{replication = box.cfg.replication}), there's a chance that
an old relay will be still running on the master at the time when a new
applier tries to subscribe. In this case the applier will get an error:

  main/152/applier/localhost:62649 I> can't join/subscribe
  main/152/applier/localhost:62649 xrow.c:891 E> ER_CFG: Incorrect value for
      option 'replication': duplicate connection with the same replica UUID

Such an error won't stop the applier - it will keep trying to reconnect:

  main/152/applier/localhost:62649 I> will retry every 1.00 second

However, it will stop synchronization so that box.cfg() will return
without an error, but leave the replica in the orphan mode:

  main/151/console/::1:42606 C> failed to synchronize with 1 out of 1 replicas
  main/151/console/::1:42606 C> entering orphan mode
  main/151/console/::1:42606 I> set 'replication' configuration option to
    "localhost:62649"

In a second, the stray relay on the master will probably exit and the
applier will manage to subscribe so that the replica will leave the
orphan mode:

  main/152/applier/localhost:62649 C> leaving orphan mode

This is very annoying, because there's no need to enter the orphan mode
in this case - we could as well keep trying to synchronize until the
applier finally succeeds to subscribe or replication_sync_timeout is
triggered.

So this patch makes appliers enter "loading" state on configuration
errors, the same state they enter if they detect that bootstrap hasn't
finished yet. This guarantees that configuration errors, like the one
above, won't break synchronization and leave the user gaping at the
unprovoked orphan mode.

Apart from the issue in question (#3636), this patch also fixes spurious
replication-py/multi test failures that happened for exactly the same
reason (#3692).

Closes #3636
Closes #3692

4baa71bc

replication: fix recoverable error reporting · 98449ced

Vladimir Davydov authored 6 years ago

First, we print "will retry every XX second" to the log after an error
message only for socket and system errors although we keep trying to
establish a replication connection after configuration errors as well.
Let's print this message for those errors too to avoid confusion.

Second, in case we receive an error in reply to SUBSCRIBE command, we
log "can't read row" instead of "can't join/subscribe". This happens,
because we switch an applier to SYNC/FOLLOW state before receiving a
reply to SUBSCRIBE command. Fix this by updating an applier state only
after successfully subscribing.

Third, we detect duplicate connections coming from the same replica on
the master only after sending a reply to SUBSCRIBE command, that is in
relay_subscribe rather than in box_process_subscribe. This results in
"can't read row" being printed to the replica's log even though it's
actually a SUBSCRIBE error. Fix this by moving the check where it
actually belongs.

98449ced

Sep 25, 2018

recovery: fix incorrect handling of empty-body requests. · f8956e05

Serge Petrenko authored 6 years ago

In some cases no-ops are written to xlog. They have no effect but are
needed to bump lsn.

Some time ago (see commit 89e5b784) such
ops were made bodiless, and empty body requests are not handled in
xrow_header_decode(). This leads to recovery errors in special case:
when we have a multi-statement transaction containing no-ops written to
xlog, upon recovering from such xlog, all data after the no-op end till
the start of new transaction will become no-op's body, so, effectively,
it will be ignored. Here's example `tarantoolctl cat` output showing
this (BODY contains next request data):

    ---
    HEADER:
      lsn: 5
      replica_id: 1
      type: NOP
      timestamp: 1536656270.5092
    BODY:
      type: 3
      timestamp: 1536656270.5092
      lsn: 6
      replica_id: 1
    ---
    HEADER:
      type: 0
    ...

This patch handles no-ops correctly in xrow_header_decode().

@locker: refactored the test case so as not to restart the server for
a second time.

Closes #3678

f8956e05

tarantoolctl: fix cat and play for empty body requests · 24a87ff2

Serge Petrenko authored 6 years ago

If space.before_replace returns the old tuple, the operation turns into
no-op, but is still written to WAL as IPROTO_NOP for the sake of
replication. Such a request doesn't have a body, and tarantoolctl failed
to parse such requests in `tarantoolctl cat` and `tarantoolctl play`.
Fix this by checking whether a request has a body. Also skip such
requests in `play`, since they have no effect, and, while we're at it,
make sure `play` and `cat` do not read excess rows with lsn>=to in case
these rows are skipped.

Closes #3675

24a87ff2

Sep 22, 2018

test: remove files created in system tmp directory · 4e705085

Vladimir Davydov authored 6 years ago

There are a few tests that create files in the system tmp directory
and don't delete them. This is contemptible - tests shouldn't leave
any traced on the host. Fix those tests.

Closes #3688

4e705085

fio: fix fio.rmtree not removing invalid symbolic link · 6d188576

Vladimir Davydov authored 6 years ago

fio.rmtree should use lstat instead of stat, otherwise it won't be
able to remove a directory if there's a symbolic link pointing to a
non-existent file.

The test case will be added to app/fio.test.lua by the following commit,
which is aimed at cleaning up /tmp directory after running tests.

6d188576

test: fix spurious box/access_sysview test failure · 144c58b3

Vladimir Davydov authored 6 years ago

Due to a missing privilege revocation in box/errinj, box/access_sysview
fails if executed after it.

Fixes commit af6b554b ("test: remove universal grants from tests").

144c58b3

Remove key_def_new_with_parts from exports · fbd80f70
Vladimir Davydov authored 6 years ago
```
Closes #3311
```
fbd80f70

box: unify key_def constructing procedure · 64263d26

Vladimir Davydov authored 6 years ago

Currently, there are two ways of creating a new key definition object
apart from copying (key_def_dup): either use key_def_new_with_parts,
which takes definition of all key parts and returns a ready to use
key_def, or allocate an empty key_def with key_def_new and then fill it
up with key_def_set_part. The latter method is rather awkward: because
of its existence key_def_set_part has to detect if all parts have been
set and initialize comparators if so. It is only used in schema_init,
which could as well use key_def_new_with_parts without making the code
any more difficult to read than it is now.

That being said, let us:
 - Make schema_init use key_def_new_with_parts.
 - Delete key_def_new and bequeath its name to key_def_new_with_parts.
 - Simplify key_def_set_part: now it only initializes the given part
   while comparators are set by the caller once all parts have been set.

These changes should also make it easier to add json path to key_part.

64263d26

Sep 21, 2018

Revert "box: zap key_part_def struct" · e34638b3

Vladimir Davydov authored 6 years ago

This reverts commit ea3a2b5f.

Once we finally implement json path indexes, more fields that are
calculated at run time will have to be added to struct key_part, like
path hash or field offset. So this was actually a mistake to remove
key_part_def struct, as it will grow more and more different from
key_part. Conceptually having separate key_part_def and key_part is
consistent with other structures, e.g. tuple_field and field_def.
That said, let's bring key_part_def back. Sorry for the noise.

e34638b3