Commits · 056deb2cf0488486fd667c6161eda8f7bdbf2096 · core / tarantool

Feb 08, 2019

replication: promote tx vclock only after successful wal write · 056deb2c

Applier used to promote vclock prior to applying the row. This lead to
a situation when master's row would be skipped forever in case there is
an error trying to apply it. However, some errors are transient, and we
might be able to successfully apply the same row later.

While we're at it, make wal writer the only one responsible for
advancing replicaset vclock. It was already doing it for rows coming
from the local instance, besides, it makes the code cleaner since now we
want to advance vclock direct from wal batch reply and lets us get rid of
unnecessary checks whether applier or wal has already advanced the
vclock.

Closes #2283
Prerequisite #980

056deb2c

wal: do not promote wal vclock for failed writes · 066b929b

Georgy Kirichenko authored 6 years ago

Wal used to promote vclock prior to write the row. This lead to a
situation when master's row would be skipped forever in case there is
an error trying to write it. However, some errors are transient, and we
might be able to successfully apply the same row later. So we do not
promote writer vclock in order to be able to restart replication from
failing point.

Obsoletes xlog/panic_on_lsn_gap.test.

Needed for #2283

066b929b

Feb 07, 2019

Fix tests failing after the last commit · aae0bb6c
Vladimir Davydov authored 6 years ago
```
Follow-up dd30970e ("replication: log replica_id in addition to lsn
on conflict").
```
aae0bb6c
replication: log replica_id in addition to lsn on conflict · dd30970e
Vladimir Davydov authored 6 years ago
```
Without replica_id, lsn of the conflicting row doesn't make much sense.
```
dd30970e

replication: move cluster id match check to replica · 7f8cbde3

Serge Petrenko authored 6 years ago

On replica subscribe master checks that replica's cluster id matches
master's one, and disallows replication in case of mismatch.
This behaviour blocks implementation of anonymous replicas, which
shouldn't pollute _cluster space and could accumulate changes from
multiple clusters at once.
So let's move the check to replica to let it decide which action to take
in case of mismatch.

Needed for #3186
Closes #3704

7f8cbde3

sql: allows only positive integer values in the LIMIT clause · 808839c7

Stanislav Zudin authored 6 years ago

VDBE returns an error if LIMIT or OFFSET expressions are casted to
the negative integer value.
If expression in the LIMIT clause can't be converted into integer
without data loss the VDBE instead of SQLITE_MISMATCH returns
SQL_TARANTOOL_ERROR with message "Only positive integers are allowed
in the LIMIT clause". The same for OFFSET clause.

Closes #3467

808839c7

Feb 06, 2019

vinyl: use uncompressed run size for range split/coalesce/compaction · 3313009d

Vladimir Davydov authored 6 years ago

Historically, when considering splitting or coalescing a range or
updating compaction priority, we use sizes of compressed runs (see
bytes_compressed). This makes the algorithms dependent on whether
compression is used or not and how effective it is, which is weird,
because compression is a way of storing data on disk - it shouldn't
affect the way data is partitioned. E.g. if we turned off compression
at the first LSM tree level, which would make sense, because it's
relatively small, we would affect the compaction algorithm because
of this.

That said, let's use uncompressed run sizes when considering range
tree transformations.

3313009d

Fix tarantool -e "os.exit()" hang · 3a851430

Serge Petrenko authored 6 years ago

After the patch which made os.exit() execute on_shutdown triggers
(see commit 6dc4c8d7) we relied
on on_shutdown triggers to break the ev_loop and exit tarantool.
Hovewer, there is an auxiliary event loop which is run in
tarantool_lua_run_script() to reschedule the fiber executing chunks
of code passed by -e option and executing interactive mode.
This event loop is started only to execute interactive mode, and
doesn't exist during execution of -e chunks. Make sure we don't start
it if os.exit() was already executed in one of the chunks.

Closes #3966

3a851430

Fix fiber_join() hang in case fiber_cancel() was called · d69c149f

Serge Petrenko authored 6 years ago

In case a fiber joining another fiber gets cancelled, it stays suspended
forever and never finishes joining. This happens because fiber_cancel()
wakes the fiber and removes it from all execution queues.
Fix this by adding the fiber back to the wakeup queue of the joined
fiber after each yield.

Closes #3948

d69c149f

replication: downstream status reporting in box.info · fcf43533

Serge Petrenko authored 6 years ago

Start showing downstream status for relays in "follow" state.
Also refactor lbox_pushrelay to unify code for different relay
states.

Closes #3904

fcf43533

Feb 05, 2019
- Move tuple_field_* getters from tuple_format.h to tuple.h. · bfb5a7ca
  Konstantin Osipov authored 6 years ago
  
  Initially tuple_field_* getters were placed in tuple_format.h to avoid including tuple_format.h in tuple.h. Now we include tuple_format.h in tuple.h anyway, so move the code where it belongs. Besides, there were a bunch of new getters added to tuple.h since then, so the code has rotten a bit. This is a preparation for an overhaul of tuple_field_* getters naming.
  bfb5a7ca
- Remove an unused function - tuple_field_by_path() · bb231fce
  Konstantin Osipov authored 6 years ago
  
  bb231fce
- Ditch tuple_field_by_full_path() which is used only once. · a1f8c88d
  Konstantin Osipov authored 6 years ago
  
  a1f8c88d
- Rename 'tuple_field_by_part_raw' to 'tuple_field_raw_by_part' · e58e5ee1
  Konstantin Osipov authored 6 years ago
  
  We use tuple_field_raw_ prefix for other similar members.
  e58e5ee1
- Rename 'tuple_field_go_to_path' to 'tuple_go_to_path' · 31e9c449
  Konstantin Osipov authored 6 years ago
  
  Use cached tuple data nad format in tuple_hash.c.
  31e9c449
- SQL is no longer a plug-in, stop using plugin API. · 56bbfeef
  Konstantin Osipov authored 6 years ago
  
  56bbfeef
- json.c: add a comment · c28f4a47
  Konstantin Osipov authored 6 years ago
  
  Add a comment explaining the logic behind intermediate lookups in json_tree_lookup_path() function.
  c28f4a47
Feb 04, 2019

Fix style issues in tuple_compare.cc (json path changes). · 4df4a789
Konstantin Osipov authored 6 years ago

4df4a789
Update msgpuck library to fix compilation on clang · 76edf94b
Vladimir Davydov authored 6 years ago
```
The patch adds missing -fPIC option for clang, without which msgpuck
library might fail to compile.
```
76edf94b

box: specify indexes in user-friendly form · a754980d

Kirill Shcherbatov authored 6 years ago

Implemented a more convenient interface for creating an index
by JSON path. Instead of specifying fieldno and relative path
it is now possible to pass full JSON path to data.

Closes #1012

@TarantoolBot document
Title: Indexes by JSON path
Sometimes field data could have complex document structure.
When this structure is consistent across whole space,
you are able to create an index by JSON path.

Example:
s = box.schema.space.create('sample')
format = {{'id', 'unsigned'}, {'data', 'map'}}
s:format(format)
-- explicit JSON index creation
age_idx = s:create_index('age', {{2, 'number', path = "age"}})
-- user-friendly syntax for JSON index creation
parts = {{'data.FIO["fname"]', 'str'}, {'data.FIO["sname"]', 'str'},
     {'data.age', 'number'}}
info_idx = s:create_index('info', {parts = parts}})
s:insert({1, {FIO={fname="James", sname="Bond"}, age=35}})

a754980d

box: introduce offset_slot cache in key_part · e2df0af2

Kirill Shcherbatov authored 6 years ago

tuple_field_by_part looks up the tuple_field corresponding to the
given key part in tuple_format in order to quickly retrieve the offset
of indexed data from the tuple field map. For regular indexes this
operation is blazing fast, however of JSON indexes it is not as we
have to parse the path to data and then do multiple lookups in a JSON
tree. Since tuple_field_by_part is used by comparators, we should
strive to make this routine as fast as possible for all kinds of
indexes.

This patch introduces an optimization that is supposed to make
tuple_field_by_part for JSON indexes as fast as it is for regular
indexes in most cases. We do that by caching the offset slot right in
key_part. There's a catch here however - we create a new format
whenever an index is dropped or created and we don't reindex old
tuples. As a result, there may be several generations of tuples in the
same space, all using different formats while there's the only key_def
used for comparison.

To overcome this problem, we introduce the notion of tuple_format
epoch. This is a counter incremented each time a new format is
created. We store it in tuple_format and key_def, and we only use
the offset slot cached in a key_def if it's epoch coincides with the
epoch of the tuple format. If they don't, we look up a tuple_field as
before, and then update the cached value provided the epoch of the
tuple format.

Part of #1012

e2df0af2

box: introduce has_json_paths flag in templates · 8e091047

Kirill Shcherbatov authored 6 years ago

Introduced has_json_path flag for compare, hash and extract
functions templates(that are really hot) to make possible do not
look to path field for flat indexes without any JSON paths.

Part of #1012

8e091047

box: introduce JSON Indexes · 4273ec52

Kirill Shcherbatov authored 6 years ago

New JSON indexes allows to index documents content.
At first, introduced new key_part fields path and path_len
representing JSON path string specified by user. Modified
tuple_format_use_key_part routine constructs corresponding
tuple_fields chain in tuple_format::fields tree to indexed data.
The resulting tree is used for type checking and for alloctating
indexed fields offset slots.

Then refined tuple_init_field_map routine logic parses tuple
msgpack in depth using stack allocated on region and initialize
field map with corresponding tuple_format::field if any.
Finally, to proceed memory allocation for vinyl's secondary key
restored by extracted keys loaded from disc without fields
tree traversal, introduced format::min_tuple_size field - the
size of tuple_format tuple as if all leaf fields are zero.

Example:
To create a new JSON index specify path to document data as a
part of key_part:
parts = {{3, 'str', path = '.FIO.fname', is_nullable = false}}
idx = s:create_index('json_idx', {parts = parse})
idx:select("Ivanov")

Part of #1012

4273ec52

box: introduce tuple_field_raw_by_path routine · e4a565db

Kirill Shcherbatov authored 6 years ago

Introduced a new function tuple_field_raw_by_path is used to get
tuple fields by field index and relative JSON path. This routine
uses tuple_format's field_map if possible. It will be further
extended to use JSON indexes.
The old tuple_field_raw_by_path routine used to work with full
JSON paths, renamed tuple_field_raw_by_full_path. It's return
value type is changed to const char * because the other similar
functions tuple_field_raw and tuple_field_by_part_raw use this
convention.
Got rid of reporting error position for 'invalid JSON path' error
in lbox_tuple_field_by_path because we can't extend other
routines to behave such way that makes an API inconsistent,
moreover such error are useless and confusing.

Needed for #1012

e4a565db

Update msgpuck library · c4f2ffb8

Kirill Shcherbatov authored 6 years ago

The msgpack dependency has been updated because the new version
introduces the new mp_stack class which we will use to parse
tuple without recursion when initializing the field map.

Needed for #1012

c4f2ffb8

Jan 30, 2019

box: get rid of atexit() for calling cleanup routines · 1bc1fcda

Serge Petrenko authored 6 years ago

Move a call to tarantool_free() to the end of main().
We needn't call atexit() at all anymore, since we've implemented
on_shutdown triggers and patched os.exit() so that when exiting not
due to a fatal signal (when no cleanup routines are called anyway)
control always reaches a call to tarantool_free().

1bc1fcda

lua: patch os.exit() to execute on_shutdown triggers. · 6dc4c8d7

Serge Petrenko authored 6 years ago

Make os.exit() call tarantool_exit(), just like the signal handler does.
Now on_shutdown triggers are not run only when a fatal signal is
received.

Closes #1607

@TarantoolBot document
Title: Document box.ctl.on_shutdown triggers
on_shutdown triggers may be set similar to space:on_replace triggers:
```
box.ctl.on_shutdown(new_trigger, old_trigger)
```
The triggers will be run when tarantool exits due to receiving one of
the signals: `SIGTERM`, `SIGINT`, `SIGHUP` or when user executes
`os.exit()`.

Note that the triggers will not be run if tarantool receives a fatal
signal: `SIGSEGV`, `SIGABORT` or any signal causing immediate program
termination.

6dc4c8d7

box: implement on_shutdown triggers · 72e25b7c

Serge Petrenko authored 6 years ago

Add on_shutdown triggers which are run by a preallocated fiber on
shutdown and make it possible to register them via box.ctl.on_shutdown()
Make use of the new triggers: now dedicate an on_shutdown trigger to
break event loop instead of doing it explicitly from signal handler.
The trigger is run last, so that all other on_shutdown triggers may
yield, sleep and so on.
Also make sure we can register lbox_triggers without push_event function
in case we don't need one.

Part of #1607

72e25b7c

sql: prohibit type_def keywords in the VALUES statement · a10b1549

Stanislav Zudin authored 6 years ago

The "box.sql.execute('values(blob)')" causes an accert in the
expression processing, because the parser doesn't distinguish the
keyword "BLOB" from the binary value (in the form X'hex').

This fix adds an additional checks in the SQL grammar.
Thus the expressions such as "VALUES(BLOB)", "SELECT FLOAT"
and so on are treated as a syntax errors.

Closes #3888

a10b1549

Jan 29, 2019

iproto: move map creation to sql_response_dump() · 7609069c

Mergen Imeev authored 6 years ago

Currently, function sql_response_dump() puts data into an already
created map. Moving the map creation to sql_response_dump()
simplifies the code and allows us to use sql_response_dump() as
one of the port_sql methods.

Needed for #3505

7609069c

tuple: fix on-stack buffer allocation in tuple_hash_field · c267b0b8

Vladimir Davydov authored 6 years ago

The buffer is defined in a nested {} block. This gives the compiler the
liberty to overwrite it once the block has been executed, which would be
incorrect since the content of the buffer is used outside the {} block.
This results in box/hash and viny/bloom test failures when tarantool is
compiled in the release mode. Fix this by moving the buffer definition
to the beginning of the function.

Fixes commit 0dfd99c4 ("tuple: fix hashing of integer numbers").

c267b0b8

tuple: fix hashing of integer numbers · 0dfd99c4

Vladimir Davydov authored 6 years ago

Integer numbers stored in tuples as MP_FLOAT/MP_DOUBLE are hashed
differently from integer numbers stored as MP_INT/MP_UINT. This breaks
select() for memtx hash indexes and vinyl indexes (the latter use bloom
filters). Fix this by converting MP_FLOAT/MP_DOUBLE to MP_INT/MP_UINT
before hashing if the value can be stored as an integer. This is
consistent with the behavior of tuple comparators, which treat MP_FLOAT
and MP_INT as equal in case they represent the same number.

Closes #3907

0dfd99c4

Jan 25, 2019

wal: remove old xlog files asynchronously · 8e429f4b

Vladimir Davydov authored 6 years ago

In contrast to TX thread, WAL thread performs garbage collection
synchronously, blocking all concurrent writes. We expected file removal
to happen instantly so we didn't bother to offload this job to eio
threads. However, it turned out that sometimes removal of a single xlog
file can take 50 or even 100 ms. If there are a dozen files to be
removed, this means a second delay and 'too long WAL write' warnings.

To fix this issue, let's make WAL garbage collection fully asynchronous.
Simply submit a jobs to eio and assume it will successfully complete
sooner or later.  This means that if unlink() fails for some reason, we
will log an error and never retry file removal until the server is
restarted. Not a big deal. We can live with it assuming unlink() doesn't
normally fail.

Closes #3938

8e429f4b

gc: do not abort garbage collection if failed to unlink snap file · 783662fb

Vladimir Davydov authored 6 years ago

We build the checkpoint list from the list of memtx snap files. So to
ensure that it is always possible to recover from any checkpoint present
in box.info.gc() output, we abort garbage collection if we fail to
unlink a snap file. This introduces extra complexity to the garbage
collection code, which makes it difficult to make WAL file removal fully
asynchronous.

Actually, it looks like we are being way too overcautious here, because
unlink() doesn't normally fail so an error while removing a snap file is
highly unlikely to occur. Besides, even if it happens, it still won't be
critical, because we never delete the last checkpoint, which is usually
used for backups/recovery. So let's simplify the code by removing that
check.

Needed for #3938

783662fb

Allow to reuse tuple_formats for ephemeral spaces · dbbd9317

Kirill Yukhin authored 6 years ago

Since under heavy load with SQL queries ephemeral
spaces might be extensively used it is possible to run out
of tuple_formats for such spaces. This occurs because
tuple_format is not immediately deleted when ephemeral space is
dropped. Its removel is postponed instead and triggered only
when tuple memory is exhausted.
As far as there's no way to alter ephemeral space's format,
let's re-use them for multiple epehemral spaces in case
they're identical.

Closes #3924

dbbd9317

Jan 24, 2019

sql: set error type in case of ephemral space creation failure · 65cd0b11

Kirill Yukhin authored 6 years ago

This is trivial patch which sets error kind if epehemeral
spaces cannot be created due to Tarantool's backend (e.g. there's
no more memory or formats available).

65cd0b11

Set is_temporary flag for formats of ephemeral spaces · 7225084f

Kirill Yukhin authored 6 years ago

Before the patch, when ephemeral space was created flag
is_temporary was set after space was actually created.
Which in turn lead to corresponding flag of tuple_format
being set to `false`.
So, having heavy load using ephemeral spaces (almost any
SQL query) and snapshotting at the same time might lead
to OOM, since tuples of ephemeral spaces were not marked
as temporary and were not gc-ed.
Patch sets the flag in space definition.

7225084f

Pass necessary fields to tuple_format contructor · 3a18f81d

Kirill Yukhin authored 6 years ago

There were three extra fields of tuple_format which were setup
after it was created. Fix that by extending tuple_format
contstructor w/ three new arguments: engine, is_temporary,
exact_field_count.

3a18f81d

vinyl: ignore unknown .run, .index and .vylog keys · 7bd128ae

Vladimir Davydov authored 6 years ago

Currently, if we encounter an unknown key while parsing a .run, .index,
or .vylog file we raise an error. As a result, if we add a new key to
either of those entities, we will break forward compatibility although
there's actually no reason for that. To avoid that, let's silently
ignore unknown keys, as we do in case of xrow header keys.

7bd128ae

vinyl: update lsm->range_heap in one go on dump completion · b07ad8b7

Vladimir Davydov authored 6 years ago

Upon LSM tree dump completion, we iterate over all ranges of the LSM
tree to update their priority and the position in the compaction heap.
Since typically we need to update all ranges, we better use update_all
heap method instead of updating the heap entries one by one.

b07ad8b7