Commits · d95608e42a0f28259824268f3dd74d4a3cbe0fec · core / tarantool

Apr 18, 2019

memtx: cancel checkpoint thread at exit · d95608e4

If a tarantool instance exits while checkpointing is in progress, the
memtx checkpoint thread, which writes the snap file, can access already
freed data resulting in a crash. Let's fix this the same way we did for
relay and vinyl threads - simply cancel the thread forcefully and wait
for it to terminate.

Closes #4170

d95608e4

Improve box.stat.net output · 14673a71

Vladimir Davydov authored 5 years ago

 - Add REQUESTS.current to report the number of requests currently in
   flight, because it's useful for understanding whether we need to
   increase box.cfg.net_msg_max.
 - Add REQUESTS.{rps,total}, because knowing the number of requests
   processed per second can come in handy for performance analysis.
 - Add CONNECTIONS.{rps,total} that show the number of connections
   opened per second and total. Those are not really necessary, but
   without them the output looks kinda lopsided.

Closes #4150

@TarantoolBot document
Title: Document new box.stat.net fields

Here's the list of the new fields:

 - `CONNECTIONS.rps` - number of connections opened per second recently
   (for the last 5 seconds).
 - `CONNECTIONS.total` - total number of connections opened so far.
 - `REQUESTS.current` - number of requests in flight (this is what's
   limited by `box.cfg.net_msg_max`).
 - `REQUESTS.rps` - number of requests processed per second recently
   (for the last 5 seconds).
 - `REQUESTS.total` - total number of requests processed so far.

`CONNECTIONS.rps`, `CONNECTIONS.total`, `REQUESTS.rps`, `REQUESTS.total`
are reset by `box.stat.reset()`.

Example of the new output:
```
---
- SENT:
    total: 5344924
    rps: 840212
  CONNECTIONS:
    current: 60
    rps: 148
    total: 949
  REQUESTS:
    current: 17
    rps: 1936
    total: 12139
  RECEIVED:
    total: 240882
    rps: 38428
...
```

14673a71

Apr 17, 2019
- vinyl: increase even more the open file limit in systemd unit file · 706bce2a
  Konstantin Osipov authored 5 years ago
  
  There can be a lot of small files with vinyl.
  706bce2a
- vinyl: increment min range size to 128MB · 83dd17fe
  Konstantin Osipov authored 5 years ago
  
  increment min range size to 128MB to reduce the amount of open files per process in a typical install.
  83dd17fe
- fix unit test linking in static build · f34c13ec
  d.sharonov authored 5 years ago
  
  Fixes #4165
  f34c13ec
Apr 16, 2019

vinyl: fix crash if space is dropped while space.get is reading from it · 75f03a50

Vladimir Davydov authored 5 years ago

In contrast to vinyl_iterator, vinyl_index_get doesn't take a reference
to the LSM tree while reading from it. As a result, if the LSM tree is
dropped in the meantime, vinyl_index_get will crash. Fix this issue by
surrounding vy_get with vy_lsm_ref/unref.

Closes #4109

75f03a50

vinyl: fix crash during index build · ccd46a27

Vladimir Davydov authored 5 years ago

To propagate changes applied to a space while a new index is being
built, we install an on_replace trigger. In case the on_replace
trigger callback fails, we abort the DDL operation.

The problem is the trigger may yield, e.g. to check the unique
constraint of the new index. This opens a time window for the DDL
operation to complete and clear the trigger. If this happens, the
trigger will try to access the outdated build context and crash:

 | #0  0x558f29cdfbc7 in print_backtrace+9
 | #1  0x558f29bd37db in _ZL12sig_fatal_cbiP9siginfo_tPv+1e7
 | #2  0x7fe24e4ab0e0 in __restore_rt+0
 | #3  0x558f29bfe036 in error_unref+1a
 | #4  0x558f29bfe0d1 in diag_clear+27
 | #5  0x558f29bfe133 in diag_move+1c
 | #6  0x558f29c0a4e2 in vy_build_on_replace+236
 | #7  0x558f29cf3554 in trigger_run+7a
 | #8  0x558f29c7b494 in txn_commit_stmt+125
 | #9  0x558f29c7e22c in box_process_rw+ec
 | #10 0x558f29c81743 in box_process1+8b
 | #11 0x558f29c81d5c in box_upsert+c4
 | #12 0x558f29caf110 in lbox_upsert+131
 | #13 0x558f29cfed97 in lj_BC_FUNCC+34
 | #14 0x558f29d104a4 in lua_pcall+34
 | #15 0x558f29cc7b09 in luaT_call+29
 | #16 0x558f29cc1de5 in lua_fiber_run_f+74
 | #17 0x558f29bd30d8 in _ZL16fiber_cxx_invokePFiP13__va_list_tagES0_+1e
 | #18 0x558f29cdca33 in fiber_loop+41
 | #19 0x558f29e4e8cd in coro_init+4c

To fix this issue, let's recall that when a DDL operation completes,
all pending transactions that affect the altered space are aborted by
the space_invalidate callback. So to avoid the crash, we just need to
bail out early from the on_replace trigger callback if we detect that
the current transaction has been aborted.

Closes #4152

ccd46a27

fiber: Unify sizeof operator · a6b443b0
Cyrill Gorcunov authored 5 years ago
```
We use sizeof as a function in most of the overall code,
fix this nit.
```
a6b443b0

tarantoolctl: raise error when box.cfg isn't called · 2b387d1c

Roman Khabibov authored 6 years ago

Added a check whether box.cfg() is called within an instance
file. If box.cfg() is missed, point a user the reason of a
fail explicitly.

Before this commit the error was look so:

/usr/bin/tarantoolctl:541: attempt to index a nil value

Closes #3953

2b387d1c

Apr 12, 2019

test: introduce new SWIM packet filter by component names · a0016971

Vladislav Shpilevoy authored 5 years ago

In the next patch on payloads it is wanted to drop only packets
containing certain sections such as anti-entropy, dissemination.
New SWIM test transport filters allow to implement this with
ease.

Part of #3234

a0016971

test: generalize SWIM fake descriptor filters · 6ac68cc4

Vladislav Shpilevoy authored 5 years ago

At this moment SWIM test harness implements its own fake file
descriptor table, which is used unawares by the real SWIM code.
Each fake fd has send and recv queues, can delay and drop
packets with a certain probability. But it is going to be not
enough for new tests.

It is wanted to be able to drop packets with a specified content,
from and to a specified direction. For that the patch implements
a filtering mechanism. Each fake fd now has a list of filters,
applied one by one to each packet. If at least on filter wants to
drop a packet, then it is dropped. The filters know packet
content and direction: outgoing or incomming.

Now only one filter exists - drop rate. It existed even before
the patch, but now it is ported on the new API.

Part of #3234

6ac68cc4

swim: factor out 'update' part of swim_member_upsert() · 1eb82afc

Vladislav Shpilevoy authored 5 years ago

Move 'update' logic into a separate function, because in the next
commits it is going to become more complicated due to payload
introduction, and it would be undesirable to clog the upsert()
function with payload-specific code.

Part of #3234

1eb82afc

swim: replace event_bin and member_bin with the passport · 17f895ed

Vladislav Shpilevoy authored 5 years ago

Event_bin and member_bin binary packet structures were designed
separately for different purposes. Initially the event_bin was
thought having the same fields as passport + optional old UUID +
optional payload. On the other hand, member_bin was supposed to
store the passport + mandatory payload.

But old UUID was cut off in favour of another way of UUID update.
And payload appeared to be optional in both anti-entropy and
dissemination. It means, that member_bin and event_bin are not
needed anymore as separate structures. This commit replaces them
with the passport completely.

Part of #3234

17f895ed

swim: factor out MP_BIN decoding from swim_decode_uuid · 61b0bd5a

Vladislav Shpilevoy authored 5 years ago

The new function is swim_decode_bin(), and is going to be used
to safely decode payloads - arbitrary binary data disseminated
alongside with all the other SWIM member attributes.

Part of #3234

61b0bd5a

net.box: fix 'unique' index flag in net.box schema · f1f6433b
Alexander Turenko authored 6 years ago
```
Before this commit it always returns false.

Fixes #4091.
```
f1f6433b

fiber: Define constants for reserved fids · 4bb7b332

Cyrill Gorcunov authored 5 years ago

Opencoded constants are not good for long time
support, make it named one. Moreover there was
a typo in comment, fid = 0 is reserved as well.

4bb7b332

fiber: Drop unused FIBER_CALL_STACK · 9e7b460c
Cyrill Gorcunov authored 5 years ago
```
The constant is leftover from 08585902
```
9e7b460c

test: rework box/on_shutdown test · c0260529

Serge Petrenko authored 5 years ago

The test is flaky under high load (e.g. when is run in parallel with a
lot of workers). Make it less dependent on arbitrary timeouts to improve
stability.

Part of #4134

c0260529

test: extract on_shutdown tests from box/misc · 70ea9998

Serge Petrenko authored 5 years ago

This part of the test is flaky when tests are run in parallel, besides,
it is quite big on its own, so extract it into a separate file to add
more flexibility in running tests and to make finding problems easier.

Part of #4134

70ea9998

sql: increment rowcount of FK alteration · 3bd13cf5

Nikita Pettik authored 5 years ago

Before this patch SQL statement which involves FK constraints creation
or drop didn't increment rowcount:

box.execute("ALTER TABLE t ADD CONSTRAINT fk1 FOREIGN KEY (b) REFERENCES parent (a);")
---
- rowcount: 0
...

This patch fixes this misbehaviour: accidentally VDBE was forgotten to
enable counting changes during ALTER TABLE ADD/DROP constraint.

Closes #4130

3bd13cf5

fiber: Drop redundant memset call · 264beb8b

Cyrill Gorcunov authored 5 years ago

When we allocate new fiber we are clearing the whole
structure right after, so no need to call memset again,
coro context is already full of zeros.

Note the coro context is close to 1K size and redundat
memset here is really a penalty.

264beb8b

test: disable flaky performance test · 962c2cae

avtikhon authored 5 years ago

Disabled wal_off/iterator_lt_gt.test.lua test due to performance
test need to be reorganized into separate mode at the standalone
host. Currently this test doesn't show any issue, but breaks the
testing some time, with errors like:

[010] wal_off/iterator_lt_gt.test.lua                                 [ fail ]
[010]
[010] Test failed! Result content mismatch:
[010] --- wal_off/iterator_lt_gt.result	Fri Apr 12 10:30:43 2019
[010] +++ wal_off/iterator_lt_gt.reject	Fri Apr 12 10:36:30 2019
[010] @@ -79,7 +79,9 @@
[010]  ...
[010]  too_longs
[010]  ---
[010] -- []
[010] +- - 'Some of the iterators takes too long to position: 0.074278'
[010] +  - 'Some of the iterators takes too long to position: 0.11786'
[010] +  - 'Some of the iterators takes too long to position: 0.053848'
[010]  ...
[010]  s:drop()
[010]  ---
[010]
[010] Last 15 lines of Tarantool Log file [Instance "wal"][/tarantool/test/var/010_wal_off/wal.log]:

See #2539

962c2cae

iproto: reduce effects of input buffer fragmentation on large cfg.readahead · a58b9bb8

Konstantin Osipov authored 5 years ago

When cfg.readahead is large, iproto_reset_input() has a tendency to
leave all input buffers large enough for a long time. On the other hand,
the input buffer is not recycled until its maximal size is reached.
This leaves to a case when we keep shifting the read position towards
the end of the buffer, fragmenting memory and growing it to readahead
size, even if input packets and batches are actually small.

Suggested by Alexander Turenko.

a58b9bb8

vinyl: improve dump start/stop logging · 363ab8e6

Vladimir Davydov authored 5 years ago

When initiating memory dump, print how much memory is going to be
dumped, expected dump rate, ETA, and recent write rate. Upon dump
completion, print observed dump rate in addition to dump size and
duration. This should help debugging stalls on memory quota.

Example:

 | 2019-04-12 12:03:25.092 [30948] main/115/lua I> dumping 39659424 bytes, expected rate 6.0 MB/s, ETA 6.3 s, recent write rate 4.2 MB/s
 | 2019-04-12 12:03:25.101 [30948] main/106/vinyl.scheduler I> 512/1: dump started
 | 2019-04-12 12:03:25.102 [30948] vinyl.dump.0/104/task I> writing `./512/1/00000000000000000008.run'
 | 2019-04-12 12:03:26.487 [30948] vinyl.dump.0/104/task I> writing `./512/1/00000000000000000008.index'
 | 2019-04-12 12:03:26.547 [30948] main/106/vinyl.scheduler I> 512/1: dump completed
 | 2019-04-12 12:03:26.551 [30948] main/106/vinyl.scheduler I> 512/0: dump started
 | 2019-04-12 12:03:26.553 [30948] vinyl.dump.0/105/task I> writing `./512/0/00000000000000000010.run'
 | 2019-04-12 12:03:28.026 [30948] vinyl.dump.0/105/task I> writing `./512/0/00000000000000000010.index'
 | 2019-04-12 12:03:28.100 [30948] main/106/vinyl.scheduler I> 512/0: dump completed
 | 2019-04-12 12:03:28.100 [30948] main/106/vinyl.scheduler I> dumped 33554332 bytes in 3.0 s, rate 10.6 MB/s

363ab8e6

vinyl: account statements skipped on read · 779fa706

Vladimir Davydov authored 5 years ago

After we retrieve a statement from a secondary index, we always do
a lookup in the primary index to get the full tuple corresponding to
the found secondary key. It may turn out that the full tuple doesn't
match the secondary key, which means the key was overwritten, but
the DELETE statement hasn't been propagated yet (aka deferred DELETE).
Currently, there's no way to figure out how often this happens as all
tuples read from an LSM tree are accounted under 'get' counter.

So this patch splits 'get' in two: 'get', which now accounts only
tuples actually returned to the user, and 'skip', which accounts
skipped tuples.

779fa706

Apr 11, 2019

test: add srand(time(NULL)) to swim unit tests · 3de1456f
Vladislav Shpilevoy authored 5 years ago
```
Appeared, that it is not called. But probably it should be, in
order to catch more errors.
```
3de1456f

vinyl: take into account primary key lookup in latency accounting · b5734069

Vladimir Davydov authored 5 years ago

Currently, latency accounting and warning lives in vy_point_lookup and
vy_read_iterator_next. As a result, we don't take into account full by
partial tuple lookup in it while it can take quite a while, especially
if there are lots of deferred DELETE statements we have to skip. So this
patch moves latency accounting to the upper level, namely to vy_get and
vinyl_iterator_{primary,secondary}_next.

Note, as a side effect, now we always print full tuples to the log on
"too long" warning. Besides, we strip LSN and statement type as those
don't make much sense anymore.

b5734069

box: account index.pairs in box.stat.SELECT · 7275ad6b

Vladimir Davydov authored 5 years ago

box.stat.SELECT accounts index.get and index.select, but not
index.pairs, which is confusing since pairs() may be used even
more often than select() in a Lua application.

7275ad6b

swim: keep encoded round message cached · cf0ddeb8

Vladislav Shpilevoy authored 6 years ago

During a SWIM round a message is being handed out consisting of
at most 4 sections. Parts of the message change rarely along with
a member attribute update, or with removal of a member. So it is
possible to cache the message and send it during several round
steps in a row. Or even do not rebuild it the whole round.

Part of #3234

cf0ddeb8

sql: as a temporary hack, coerce typeof() return values with nosql types · cdde8aea

Konstantin Osipov authored 5 years ago

SQL is still using a sqlite legacy enum and not enum field_type from
NoSQL to identify types. This creates a mess with type identification,
when the original column/literal type is lost during expression
evaluation.
Until we have proper type arithmetics and preserve field_type in
expressions, coerce the string return value of typeof() functions, which
queries SQL expression value type, with the closest nosql type name.

Rename:
    real -> number
    text -> string
    blob -> scalar

cdde8aea

swim: fix typos in the code · 003e0ff6

Vladislav Shpilevoy authored 5 years ago

After turning on a spell checker there were found lots of typos.
The commit fixes them.

003e0ff6

test: fix SWIM test number · 10b4401c

Vladislav Shpilevoy authored 5 years ago

During merge it was accidentally set to too low number.

Follow up 8fe05fdd
(swim: expose ping broadcast API)

10b4401c

swim: expose ping broadcast API · 8fe05fdd

Vladislav Shpilevoy authored 6 years ago

The previous commit has introduced an API to broadcast SWIM
packets. This commit harnesses it in orider to allow user to do
initial discovery in a cluster, when member tables are empty, and
UUIDs aren't ready at hand.

Part of #3234

8fe05fdd

swim: introduce broadcast tasks · 38917980

Vladislav Shpilevoy authored 6 years ago

When a cluster is just created, no one knows anyone. Broadcast
helps to establish some initial relationships between members.

This commit introduces only an interface to create broadcast
tasks from SWIM code. The next commit uses this interface to
implement ping broadcast.

Part of #3234

38917980

swim: on address update increment incarnation · 0519a5fc

Vladislav Shpilevoy authored 5 years ago

In the original SWIM paper the incarnation is just a way of
refuting old statuses, nothing more. It is not designed as a
versioning system of a member and its non-status attributes. But
Tarantool harnesses the incarnation for wider range of tasks.
In Tarantool's implementation the incarnation (in theory) refutes
old statuses, old payloads, old addresses.

But appeared, that before the patch an address update did not
touch incarnation. Because of that it was possible to rewrite a
new address with the old one back. The patch fixes it with a mere
increment of incarnation on each address update.

The fix is simple because the current SWIM implementation
always carries the tuple {incarnation, status, address} together,
as a one big attribute. It is not so for payloads, so for them an
analogous fix will be much more tricky.

Follow-up for f510dc6f
(swim: introduce failure detection component)

0519a5fc

replication: allow to change instance id during join · 319ab0ae

Vladimir Davydov authored 5 years ago

Before rebootstrapping a replica, the admin may delete it from the
_cluster space on the master. If he doesn't make a checkpoint after
that, rebootstrap will fail with

  E> ER_LOCAL_INSTANCE_ID_IS_READ_ONLY: The local instance id 2 is read-only

This is sort of unexpected. Let's fix this issue by allowing replicas to
change their id during join.

A note about replication/misc test. The test used this error to check
that a master instance doesn't crash in case a replica fails to
bootstrap. However, we can simply set mismatching replicaset UUIDs to
get the same effect.

Closes #4107

319ab0ae

replication: fix garbage collection logic · 4824b00e

Vladimir Davydov authored 5 years ago

Currently, the garbage collector works with vclock signatures and
doesn't take into account vclock components. This works as long as
the caller (i.e. relay) makes sure that it doesn't advance a consumer
associated with a replica unless its acknowledged vclock is greater
than or equal to the vclock of a WAL file fed to it. The bug is that
it does not - it only compares vclock signatures. As a result, if
a replica has some local changes or changes pulled from other members
of the cluster, which render its signature greater, the master may
remove files that are still needed by the replica, permanently breaking
replication and requiring rebootstrap.

I guess the proper fix would be teaching the garbage collector
operate on vclock components rather than signatures, but it's rather
difficult to implement. This patch is a quick fix, which simply
replaces vclock signature comparison in relay with vclock_compare.

Closes #4106

4824b00e

Revert "replication: update replica gc state on subscribe" · 766cd3e1

Vladimir Davydov authored 5 years ago

This reverts commit b5b4809c.

The commit reverted by this patch made relay advance the consumer
associated with a replica right on subscribe. This is wrong, because the
current implementation of the garbage collector operates on vclock
signatures so that if a replica reconnects with a greater signature than
it had when it last disconnected (e.g. due to replica local changes or
changes pulled from other members of the cluster), the garbage collector
may delete WAL files still needed by the replica, breaking replication.

There are two ways to fix this problem. The first and the most difficult
way is to teach the garbage collector to work with vclocks, i.e. rather
than simply sorting all consumers by signature and using the smallest
signature for garbage collection, maintain a vclock each component of
which is the minimum among corresponding components of all registered
consumers.

The second (easy) way is to advance a consumer only if its acknowledged
vclock is greater than or equal to the vclock of a WAL fed to it. This
way the garbage collector still works with vclock signatures and it's
a responsibility of the caller (i.e. relay) to ensure that consumers are
advanced correctly.

I took on the second way for now, because I couldn't figure out an
efficient way to implement the first. This implies reverting the above
mentioned commit and reopening #4034 - sporadic replication/gc.test.lua
failure - which will have to be fixed some other way.

See the next patch for the rest of the fix and the test.

Needed for #4106

766cd3e1

vinyl: don't compress L1 runs · 00b6ea52

Vladimir Davydov authored 5 years ago

L1 runs are usually the most frequently read and smallest runs at the
same time so we gain nothing by compressing them.

Closes #2389

00b6ea52

xlog: cleanup setting of write options · 35fc8afd

Vladimir Davydov authored 5 years ago

The way xlog write options (sync_interval and others) are set is a mess:
if an xlog is created with xlog_create(), we overwrite them explicitly;
if an xlog is created with xdir_create_xlog(), we inherit parameters
from the xdir, which sets them depending on the xdir type (SNAP, XLOG,
or VYLOG), but sometimes we overwrite them explicitly as well. The more
options we add, the worse it gets.

To clean it up, let's add an auxiliary structure combining all xlog
write options and pass it to xlog_create() and xdir_create() everywhere.

35fc8afd