Commits · d6d3e2c08c3ba84b3227a46c6ba6c03e2e7a14ee · core / tarantool

Apr 05, 2018

alter: require rebuild of all secondary vinyl indexes if pk changes · d6d3e2c0

If the primary key is modified, we schedule rebuild of all non-unique
(including nullable) secondary TREE indexes. This is valid for memtx,
but is not quite right for vinyl. For vinyl we have to rebuild all
secondary indexes, because they are all non-clustered (i.e. point to
tuples via primary key parts). This doesn't result in any bugs for now,
because rebuild of vinyl indexes is not supported, but hopefully this is
going to change soon. So let's introduce a new virtual index method,
index_vtab::depends_on_pk, which returns true iff the index needs to be
updated if the primary key changes, and define this new method for vinyl
and memtx TREE indexes.

d6d3e2c0

index: add commit_modify virtual method · 28c31d69

Vladimir Davydov authored 6 years ago

The new method is called after successful update of index definition.
It is passed the signature of the WAL record that committed the
operation. It will be used by Vinyl to update key definition in vylog.

28c31d69

Merge remote-tracking branch 'origin/1.9' into 1.10 · ee99d28f
Konstantin Osipov authored 6 years ago

ee99d28f

log: Fix syslog logger · 7c7a2fa1

Ilya Markov authored 7 years ago

* Remove rewriting format of default logger in case of syslog option.
* Add facility option parsing and use parsed results in format message
  according to RFC3164. Possible values and default value of syslog
  facility are taken from nginx (https://nginx.ru/en/docs/syslog.html)
* Move initialization of logger type and format fucntion before
  initialization of descriptor in log_XXX_init, so that we can test
  format function of syslog logger.

Closes gh-3244.

7c7a2fa1

Apr 04, 2018

vinyl: zap upsert_format · 3a73c2c6

Vladimir Davydov authored 6 years ago

The only difference between format of UPSERT statements and format of
other DML statements of the same index is that the former reserves one
byte for UPSERT counter, which is needed to schedule UPSERT squashing.
Since we store UPSERT counter on lsregion now, we don't need a special
format for UPSERTs anymore. Remove it.

3a73c2c6

vinyl: allocate upsert counter on lsregion · e8147fe7

Vladimir Davydov authored 6 years ago

Currently, we store upsert counter in tuple metadata (that's what
upsert_format is for), but since it's only relevant for tuples of
the memory level, we can store it on lsregion, right before tuple
data. Let's do it now so that we can get rid of upsert_format.

e8147fe7

Merge branch '1.9' into 1.10 · 402f066c
Kirill Yukhin authored 6 years ago

402f066c
Add 'key_def_new_with_parts' (temporary) · 7d089bbd
Alexander Turenko authored 6 years ago
```
Filed gh-3311 to remove this export soon.

Fixes #3310.
```
7d089bbd

Apr 03, 2018

Merge remote-tracking branch 'origin/1.9' into 1.10 · 2c365b0b
Konstantin Osipov authored 6 years ago

2c365b0b

vinyl: fail transaction immediately if it does not fit in memory · 8f63d5d9

Vladimir Davydov authored 6 years ago

If the size of a transaction is greater than the configured memory
limit (box.cfg.vinyl_memory), the transaction will hang on commit
for 60 seconds (box.cfg.vinyl_timeout) and then fail with the
following error message:

  Timed out waiting for Vinyl memory quota

This is confusing. Let's fail such transactions immediately with
OutOfMemory error.

Closes #3291

8f63d5d9

Apr 02, 2018
- Removed precise due to EOL and added artful (#3303) · ec4e8c4f
  Arseny Antonov authored 6 years ago
  
  ec4e8c4f
- Removed precise due to EOL and added artful (#3302) · 886be65a
  Arseny Antonov authored 6 years ago
  
  886be65a
Mar 30, 2018

replication: recover missing local data from replica · eae84efb

Konstantin Belyavskiy authored 6 years ago

In case of sudden power-loss, if data was not written to WAL but
already sent to remote replica, local can't recover properly and
we have different datasets. Fix it by using remote replica's data
and LSN comparison.

Based on @GeorgyKirichenko proposal and @locker race free check.

Closes #3210

eae84efb

replication: stay in orphan mode until replica is synced by vclock · 7ebc8ae4

Konstantin Belyavskiy authored 6 years ago

Stay in orphan (read-only) mode until local vclock is lower than
master's to make sure that datasets are the same across replicaset.
Update replication/catch test to reflect the change.

Suggested by @kostja

Needed for #3210

7ebc8ae4

Update LuaRocks · 3171288c
Vladimir Davydov authored 6 years ago
```
Closes #3148
```
3171288c

console: do not try to prevent SIGPIPE in text console · 427795fa

Vladislav Shpilevoy authored 6 years ago

Text console tried to learn about SIGPIPE before its raising
by read-before-write. If a socket is readable, but read returns
0, then it is closed, and writing to it can raise SIGPIPE. But
Tarantool ignores SIGPIPE, so the process will not be terminated,
write() just returns -1.

The original code checks for SIGPIPE, because when Tarantool is
run under debugger (gdb or lldb), the debugger by default sets
its own signal handlers, and SIGPIPE terminates the process.

But debugger settings can be changed to ignore SIGPIPE too, so
lets remove this overengineering from the console code.

427795fa

netbox: fix a bug with ignored reconnect_after · f278d3f0

Vladislav Shpilevoy authored 6 years ago

If a remote host is unreachable on the first connection attempt,
and reconnect_after is set, then netbox state machine enters
error state, but it must enter error_reconnect. Do it.

The bug was introduced by me in
d2468dac.

f278d3f0

libev: use clock_gettime on OS X if available · 10af1cb1

Vladimir Davydov authored 6 years ago

EV_USE_REALTIME and EV_USE_MONOTONIC, which force libev to use
clock_gettime, are enabled automatically on Linux, but not on OS X. We
used to forcefully enable them for performance reasons, but this broke
compilation on certain OS X versions and so was disabled by commit
d36ba279 ("Fix gh-1777: clock_gettime detected but unavailable in
macos"). Today we need these features enabled not just because of
performance, but also to avoid crashes when time changes on the host -
see issue #2527 and commit a6c87bf9 ("Use ev_monotonic_now/time
instead of ev_now/time for timeouts"). Fortunately, we have this cmake
defined macro HAVE_CLOCKGETTIME_DECL, which is set if clock_gettime is
available. Let's enable EV_USE_REALTIME and EV_USE_MONOTONIC if this
macro is defined.

Closes #3299

10af1cb1

Mar 29, 2018

Fix net.box test · 405446e0
Vladislav Shpilevoy authored 6 years ago

405446e0

vinyl: fix discrepancy between vy_log.tx_size and actual tx len · 94569f65

Vladimir Davydov authored 7 years ago

When a vylog transaction is rolled back, we always reset vy_log.tx_size.
Generally speaking, this is incorrect as rollback doesn't necessarily
remove all pending records from the tx buffer - there still may be
records committed with vy_log_tx_try_commit() that were left in the
buffer due to write errors. We don't rollback such records, but we
still reset tx_size, which leads to a discrepancy between vy_log.tx_size
and the actual length of vy_log.tx list, which further on results in an
assertion failure:

src/box/vy_log.c:698: vy_log_flush: Assertion `i < vy_log.tx_size' failed.

We need vy_log.tx_size to allocate xrow_header array of a proper size so
that we can flush pending vylog records to disk. This isn't a hot path
there, because vylog operations are rare. Besides, we iterate over all
records anyway to fill the xrow_header array. That said, let's remove
vy_log.tx_size altogether and instead calculate the vy_log.tx list
length right in place.

94569f65

vinyl: use rlist for iterating over objects recovered from vylog · 197e1ef0
Vladimir Davydov authored 7 years ago
```
Currently, we use mh_foreach, but each object is on an rlist, which
suits better for iteration.
```
197e1ef0

index: add abort_create virtual method · 7dee93a0

Vladimir Davydov authored 7 years ago

The new method is called if index creation failed, either due to WAL
write error or build error. It will be used by Vinyl to purge prepared
LSM tree from vylog.

7dee93a0

Fix net.box test · d9e254f8
Vladislav Shpilevoy authored 6 years ago

d9e254f8
Merge branch '1.9' into 1.10 · 97cc085f
Konstantin Osipov authored 6 years ago

97cc085f

log: Fix logging large objects · 5ab4581d

Ilya Markov authored 7 years ago

The bug was that logging we passed to function write
number of bytes which may be more than size of buffer.
This may happen because formatting log string we use vsnprintf which
returns number of bytes would be written to buffer, not the actual
number.

Fix this with limiting number of bytes passing to write function.

Close #3248

5ab4581d

Merge branch '1.9' into 1.10 · 180af15f
Konstantin Osipov authored 6 years ago

180af15f

vinyl: improve latency stat · f3a84293

Vladimir Davydov authored 7 years ago

To facilitate performance analysis, let's report not only 99th
percentile, but also 50th, 75th, 90th, and 95th. Also, let's add
microsecond-granular buckets to the latency histogram.

Closes #3207

f3a84293

say: Fix log_rotate · 26a4effe

Ilya Markov authored 7 years ago

* Refactor tests.
* Add ev_async and fiber_cond for thread-safe log_rotate usage.

Follow up #3015

26a4effe

log: Fix logger.test.lua · d0dcc8b9

Ilya Markov authored 7 years ago

Fix race condition in test on log_rotate.
Test opened file that must be created by log_rotate and read from it.
But as log_rotate is executed in separate thread, file may be not
created or log may be not written yet by the time of opening in test.

Fix this with waiting creation and reading the line.

d0dcc8b9

netbox: deprecate console support · bd06e32a
Vladislav Shpilevoy authored 7 years ago
```
Print warning about that. After a while the cosole support will
be deleted from netbox.
```
bd06e32a

console: do not use netbox for console text connections · 1730c538

Vladislav Shpilevoy authored 7 years ago

Netbox console support complicates both netbox and console. Lets
use sockets directly for text protocol.

Part of #2677

1730c538

netbox: allow to create a netbox connection from existing socket · d2468dac
Vladislav Shpilevoy authored 7 years ago
```
It is needed to create a binary console connection, when a
socket is already created and a greeting is read and decoded.
```
d2468dac

bloom: drop spectrum · bc859dce

Vladimir Davydov authored 7 years ago

As it was pointed out earlier, the bloom spectrum concept is rather
dubious, because its overhead for a reasonable false positive rate is
about 10 bytes per record while storing all hashes in an array takes
only 4 bytes per record so one can stash all hashes and count records
first, then create the optimal bloom filter and add all hashes there.

bc859dce

bloom: optimize tuple bloom filter size · 4357bcf3

Vladimir Davydov authored 7 years ago

When we check if a multi-part key is hashed in a bloom filter, we check
all its sub keys as well so the resulting false positive rate will be
equal to the product of multiplication of false positive rates of bloom
filters created for each sub key.

The false positive rate of a bloom filter is given by the formula:

  f = (1 - exp(-kn/m)) ^ k

where m is the number of bits in the bloom filter, k is the number of
hash functions, and n is the number of elements hashed in the filter.
By varying n, we can estimate the false positive rate of an existing
bloom filter when used for a greater number of elements, in other words
we can estimate the false positive rate of a bloom filter created for
checking sub keys when used for checking full keys.

Knowing this, we can adjust the target false positive rate of a bloom
filter used for checking keys of a particular length based on false
positive rates of bloom filters used for checking its sub keys. This
will reduce the number of hash functions required to conform to the
configured false positive rate and hence the bloom filter size.

Follow-up #3177

4357bcf3

vinyl: introduce bloom filters for partial key lookups · fc654aaf

Vladimir Davydov authored 7 years ago

Currently, we store and use bloom only for full-key lookups. However,
there are use cases when we can also benefit from maintaining bloom
filters for partial keys as well - see #3177 for example. So this patch
replaces the current full-key bloom filter with a multipart one, which
is basically a set of bloom filters, one per each partial key. Old bloom
filters stored on disk will be recovered as is so users will see the
benefit of this patch only after major compaction takes place.

When a key or tuple is checked against a multipart bloom filter, we
check all its partial keys to reduce the false positive result.
Nevertheless there's no size optimization as per now. E.g. even if the
cardinality of a partial key is the same as of the full key, we will
still store two full-sized bloom filters although we could probably save
some space in this case by assuming that checking against the bloom
corresponding to a partial key would reduce the false positive rate of
full key lookups. This is addressed later in the series.

Before this patch we used a bloom spectrum object to construct a bloom
filter. A bloom spectrum is basically a set of bloom filters ranging in
size. The point of using a spectrum is that we don't know what the run
size will be while we are writing it so we create 10 bloom filters and
choose the best of them after we are done. With the default bloom fpr of
0.05 it is 10 byte overhead per record, which seems to be OK. However,
if we try to optimize other parameters as well, e.g. the number of hash
functions, the cost of a spectrum will become prohibitive. Funny thing
is a tuple hash is only 4 bytes long, which means if we stored all
hashes in an array and built a bloom filter after we'd written a run, we
would reduce the memory footprint by more than half! And that would only
slightly increase the run write time as scanning a memory map of hashes
and constructing a bloom filter is cheap in comparison to mering runs.
Putting it all together, we stop using bloom spectrum in this patch,
instead we stash all hashes in a new bloom builder object and use them
to build a perfect bloom filer after the run has been written and we
know the cardinality of each partial key.

Closes #3177

fc654aaf

bloom: rename bloom_possible_has to bloom_maybe_has · f03fd4db
Vladimir Davydov authored 7 years ago
```
Suggested by @kostja
```
f03fd4db

bloom: use malloc for bitmap allocations · 78df5acd

Vladimir Davydov authored 7 years ago

There's absolutely no point in using mmap() instead of malloc() for
bitmap allocation - malloc() will fallback on mmap() anyway provided
the allocation is large enough.

Note about the unit test: since we don't round the bloom filter size up
to a multiple of page size anymore, we have to use a more sophisticated
hash function for the test to pass.

78df5acd

test: vinyl/layout: fix bloom filter filtering in output · 88c4c19a

Vladimir Davydov authored 6 years ago

We filter bloom filters, because they depend on ICU version and hence
the test output may vary from one platform to another (see commit
0a37ccad "Filter out bloom_filter in vinyl/layout.test.lua").
However, using test_run for this is unreliable, because a bloom string
can contain newline characters and hence be split in multiple lines in
console output, in which case the filter won't work. Fix this by
filtering bloom_filter manually.

88c4c19a

Merge branch '1.9' into 1.10 · 7ee84c95
Vladislav Shpilevoy authored 6 years ago

7ee84c95

netbox: show is_nullable and collation fields · cc935d24

Kirill Shcherbatov authored 7 years ago

Netbox does not need nullability or collation info, but some
customers do. Lets fill index parts with these fields.

Fixes #3256

cc935d24