Commits · 8acf2939d663f986372c0dd1bc710967ae127f07 · core / tarantool

Feb 11, 2019

sql: clean up lemon acttab_free() a bit · 8acf2939

Nikita Pettik suggests me that free(NULL) is no-op according to POSIX.

This is follow up of 9dbcaa3a.

8acf2939

replication: do not fetch records twice · ae938677

Konstantin Belyavskiy authored 6 years ago

This is a draft paper covering following topics:
1. Draft protocol for discovering and maintaining network topology
in case of large arbitrary network.
2. List of required changes to support this feature.
3. Open questions and alternatives.

Changes in V2:
Based or Vlad's review
1. Rewrite couple sections to make it more clear.
2. Clarify with more details and add examples.
3. Fixed error.

RFC for #3294

ae938677

vinyl: fix compaction priority calculation · d5ceb204

Vladimir Davydov authored 6 years ago

When computing the number of runs that need to be compacted for a range
to conform to the target LSM tree shape, we use the newest run size for
the size of the first LSM tree level. This isn't quite correct for two
reasons.

First, the size of the newest run is unstable - it may vary in a
relatively wide range from dump to dump. This leads to frequent changes
in the target LSM tree shape and, as a result, unpredictable compaction
behavior. In particular this breaks compaction randomization, which is
supposed to smooth out IO load generated by compaction.

Second, this can increase space amplification. We trigger compaction at
the last level when there's more than one run, irrespective of the value
of run_count_per_level configuration option. We expect this to keep
space amplification below 2 provided run_count_per_level is not greater
than (run_size_ratio - 1). However, if the newest run happens to have
such a size that multiplying it by run_size_ratio several times gives us
a value only slightly less than the size of the oldest run, we can
accumulate up to run_count_per_level more runs that are approximately as
big as the last level run without triggering compaction, thus increasing
space amplification by up to run_count_per_level.

To fix these problems, let's use the oldest run size for computing the
size of the first LSM tree level - simply divide it by run_size_ratio
until it exceeds the size of the newest run.

Follow-up #3657

d5ceb204

box: enable WAL before making initial checkpoint · c6743038

Vladimir Davydov authored 6 years ago

While a replica is bootstrapped from a remote master, vinyl engine
may need to perform compaction, which means that it may write to
the _vinyl_deferred_delete system space. Compaction proceeds fully
asynchronously, i.e. a write may occur after the join stage is
complete, but before the WAL is initialized, in which case the new
replica will crash. To make sure a race like that won't happen, let's
setup WAL before making the initial checkpoint. The WAL writer is now
initialized right before starting the WAL thread and so we don't need
to split WAL struct into the thread and the writer anymore.

Closes #3968

c6743038

Feb 08, 2019

test: fix xlog/panic_on_broken_lsn spurious failure · f1bd33a8

Vladimir Davydov authored 6 years ago

If this test is executed after some other test that bumps LSN, then
the output line gets truncated differently, because greater LSNs may
increase its length. Fix this by filtering out the LSN manually.

Closes #3970

f1bd33a8

sql: raise an err on CHECK constraint with ON CONFLICT action · 6aa0ef4a

Ivan Koptelov authored 6 years ago

Currently all on 'conflict' actions are silently
ignored for 'check' constraints. This patch add
explicit parse-time error.

Closes #3345

6aa0ef4a

Remove affinity from field definition · b9afef16
Nikita Pettik authored 6 years ago
```
Closes #3698
```
b9afef16

sql: clean-up affinity from SQL source code · 037e2e44

Nikita Pettik authored 6 years ago

Replace remains of affinity usage in SQL parser, query optimizer and
VDBE. Don't add affinity to field definition when table is encoded into
msgpack.  Remove field type <-> affinity converters, since now we can
operate directly on field type.

Part of #3698

037e2e44

sql: replace affinity with field type in struct Expr · 2dd8444f

Nikita Pettik authored 6 years ago

Also this patch resolves issue connected with wrong query plans during
select on spaces created from Lua: instead of index search in most cases
table scan was used. It appeared due to the fact that index was checked
on affinity compatibility with space format. So, if space is created
without affinity in format, indexes won't be used.
However, now all checks are related to field types, and as a result
query optimizer is able to choose correct index.

Closes #3886
Part of #3698

2dd8444f

sql: replace affinity with field type for VDBE runtime · 5a561326

Nikita Pettik authored 6 years ago

This stage of affinity removal requires introducing of auxiliary
intermediate function to convert array of affinity values to field type
values. The rest of job done in this commit is a straightforward
refactoring.

Part of #3698

5a561326

sql: replace affinity with field type for func · 00758981

Nikita Pettik authored 6 years ago

Lets use field_type instead of affinity as a type of return value of
user function registered in SQL. Moreover, lets assign type of return
value to expression representing functions. It allows to take it into
consideration during derived type calculation.

Part of #3698

00758981

sql: remove numeric affinity · 758ab1a4

Nikita Pettik authored 6 years ago

Numeric affinity in SQLite means the same as real, except that it
forces integer values into floating point representation in case
it can be converted without loss (e.g. 2.0 -> 2).
Since in Tarantool core there is no difference between numeric and real
values (both are stored as values of Tarantool type NUMBER), lets
remove numeric affinity and use instead real.

The only real pitfall is implicit conversion mentioned above. We can't
pass *.0 as an iterator value since our fast comparators (TupleCompare,
TupleCompareWithKey) are designed to work with only values of same MP_
type. They do not use slow tuple_compare_field() which is able to
compare double and integer. Solution to this problem is simple: lets
always attempt at encoding floats as ints if conversion takes place
without loss. This is a straightforward approach, but to implement it we
need to care about reversed (decoding) situation.

OP_Column fetches from msgpack field with given number and stores it as
a native VDBE memory object. Type of that memory is based on type of
msgpack value. So, if space field is of type NUMBER and holds value 1,
type of VDBE memory will be INT (after decoding), not float 1.0. As a
result, further calculations may be wrong: for instance, instead of
floating point division, we could get integer division. To cope with
this problem, lets add auxiliary conversion to decoding routine which
uses space format of tuple to be decoded. It is worth mentioning that
ephemeral spaces don't feature space format, so we are going to rely on
type of key parts. Finally, internal VDBE merge sorter also operates on
entries encoded into msgpack. To fix this case, we check type of
ORDER BY/GROUP BY arguments: if they are of type float, we are emitting
additional opcode OP_AffinityReal to force float type after encoding.

Part of #3698

758ab1a4

sql: use field type instead of affinity for type_def · 82298c55

Nikita Pettik authored 6 years ago

Also, this allows to delay affinity assignment to field def until
encoding of table format.

Part of #3698

82298c55

sql: remove SQLITE_ENABLE_UPDATE_DELETE_LIMIT define · 43ed060f

Nikita Pettik authored 6 years ago

Code under this define is dead. What is more, it uses affinity, so lets
remove it alongside with tests related to it.

Needed for #3698

43ed060f

replication: promote tx vclock only after successful wal write · 056deb2c

Georgy Kirichenko authored 6 years ago

Applier used to promote vclock prior to applying the row. This lead to
a situation when master's row would be skipped forever in case there is
an error trying to apply it. However, some errors are transient, and we
might be able to successfully apply the same row later.

While we're at it, make wal writer the only one responsible for
advancing replicaset vclock. It was already doing it for rows coming
from the local instance, besides, it makes the code cleaner since now we
want to advance vclock direct from wal batch reply and lets us get rid of
unnecessary checks whether applier or wal has already advanced the
vclock.

Closes #2283
Prerequisite #980

056deb2c

wal: do not promote wal vclock for failed writes · 066b929b

Georgy Kirichenko authored 6 years ago

Wal used to promote vclock prior to write the row. This lead to a
situation when master's row would be skipped forever in case there is
an error trying to write it. However, some errors are transient, and we
might be able to successfully apply the same row later. So we do not
promote writer vclock in order to be able to restart replication from
failing point.

Obsoletes xlog/panic_on_lsn_gap.test.

Needed for #2283

066b929b

Feb 07, 2019

Fix tests failing after the last commit · aae0bb6c
Vladimir Davydov authored 6 years ago
```
Follow-up dd30970e ("replication: log replica_id in addition to lsn
on conflict").
```
aae0bb6c
replication: log replica_id in addition to lsn on conflict · dd30970e
Vladimir Davydov authored 6 years ago
```
Without replica_id, lsn of the conflicting row doesn't make much sense.
```
dd30970e

replication: move cluster id match check to replica · 7f8cbde3

Serge Petrenko authored 6 years ago

On replica subscribe master checks that replica's cluster id matches
master's one, and disallows replication in case of mismatch.
This behaviour blocks implementation of anonymous replicas, which
shouldn't pollute _cluster space and could accumulate changes from
multiple clusters at once.
So let's move the check to replica to let it decide which action to take
in case of mismatch.

Needed for #3186
Closes #3704

7f8cbde3

sql: allows only positive integer values in the LIMIT clause · 808839c7

Stanislav Zudin authored 6 years ago

VDBE returns an error if LIMIT or OFFSET expressions are casted to
the negative integer value.
If expression in the LIMIT clause can't be converted into integer
without data loss the VDBE instead of SQLITE_MISMATCH returns
SQL_TARANTOOL_ERROR with message "Only positive integers are allowed
in the LIMIT clause". The same for OFFSET clause.

Closes #3467

808839c7

Feb 06, 2019

vinyl: use uncompressed run size for range split/coalesce/compaction · 3313009d

Vladimir Davydov authored 6 years ago

Historically, when considering splitting or coalescing a range or
updating compaction priority, we use sizes of compressed runs (see
bytes_compressed). This makes the algorithms dependent on whether
compression is used or not and how effective it is, which is weird,
because compression is a way of storing data on disk - it shouldn't
affect the way data is partitioned. E.g. if we turned off compression
at the first LSM tree level, which would make sense, because it's
relatively small, we would affect the compaction algorithm because
of this.

That said, let's use uncompressed run sizes when considering range
tree transformations.

3313009d

Fix tarantool -e "os.exit()" hang · 3a851430

Serge Petrenko authored 6 years ago

After the patch which made os.exit() execute on_shutdown triggers
(see commit 6dc4c8d7) we relied
on on_shutdown triggers to break the ev_loop and exit tarantool.
Hovewer, there is an auxiliary event loop which is run in
tarantool_lua_run_script() to reschedule the fiber executing chunks
of code passed by -e option and executing interactive mode.
This event loop is started only to execute interactive mode, and
doesn't exist during execution of -e chunks. Make sure we don't start
it if os.exit() was already executed in one of the chunks.

Closes #3966

3a851430

Fix fiber_join() hang in case fiber_cancel() was called · d69c149f

Serge Petrenko authored 6 years ago

In case a fiber joining another fiber gets cancelled, it stays suspended
forever and never finishes joining. This happens because fiber_cancel()
wakes the fiber and removes it from all execution queues.
Fix this by adding the fiber back to the wakeup queue of the joined
fiber after each yield.

Closes #3948

d69c149f

replication: downstream status reporting in box.info · fcf43533

Serge Petrenko authored 6 years ago

Start showing downstream status for relays in "follow" state.
Also refactor lbox_pushrelay to unify code for different relay
states.

Closes #3904

fcf43533

Feb 05, 2019
- Move tuple_field_* getters from tuple_format.h to tuple.h. · bfb5a7ca
  Konstantin Osipov authored 6 years ago
  
  Initially tuple_field_* getters were placed in tuple_format.h to avoid including tuple_format.h in tuple.h. Now we include tuple_format.h in tuple.h anyway, so move the code where it belongs. Besides, there were a bunch of new getters added to tuple.h since then, so the code has rotten a bit. This is a preparation for an overhaul of tuple_field_* getters naming.
  bfb5a7ca
- Remove an unused function - tuple_field_by_path() · bb231fce
  Konstantin Osipov authored 6 years ago
  
  bb231fce
- Ditch tuple_field_by_full_path() which is used only once. · a1f8c88d
  Konstantin Osipov authored 6 years ago
  
  a1f8c88d
- Rename 'tuple_field_by_part_raw' to 'tuple_field_raw_by_part' · e58e5ee1
  Konstantin Osipov authored 6 years ago
  
  We use tuple_field_raw_ prefix for other similar members.
  e58e5ee1
- Rename 'tuple_field_go_to_path' to 'tuple_go_to_path' · 31e9c449
  Konstantin Osipov authored 6 years ago
  
  Use cached tuple data nad format in tuple_hash.c.
  31e9c449
- SQL is no longer a plug-in, stop using plugin API. · 56bbfeef
  Konstantin Osipov authored 6 years ago
  
  56bbfeef
- json.c: add a comment · c28f4a47
  Konstantin Osipov authored 6 years ago
  
  Add a comment explaining the logic behind intermediate lookups in json_tree_lookup_path() function.
  c28f4a47
Feb 04, 2019

Fix style issues in tuple_compare.cc (json path changes). · 4df4a789
Konstantin Osipov authored 6 years ago

4df4a789
Update msgpuck library to fix compilation on clang · 76edf94b
Vladimir Davydov authored 6 years ago
```
The patch adds missing -fPIC option for clang, without which msgpuck
library might fail to compile.
```
76edf94b

box: specify indexes in user-friendly form · a754980d

Kirill Shcherbatov authored 6 years ago

Implemented a more convenient interface for creating an index
by JSON path. Instead of specifying fieldno and relative path
it is now possible to pass full JSON path to data.

Closes #1012

@TarantoolBot document
Title: Indexes by JSON path
Sometimes field data could have complex document structure.
When this structure is consistent across whole space,
you are able to create an index by JSON path.

Example:
s = box.schema.space.create('sample')
format = {{'id', 'unsigned'}, {'data', 'map'}}
s:format(format)
-- explicit JSON index creation
age_idx = s:create_index('age', {{2, 'number', path = "age"}})
-- user-friendly syntax for JSON index creation
parts = {{'data.FIO["fname"]', 'str'}, {'data.FIO["sname"]', 'str'},
     {'data.age', 'number'}}
info_idx = s:create_index('info', {parts = parts}})
s:insert({1, {FIO={fname="James", sname="Bond"}, age=35}})

a754980d

box: introduce offset_slot cache in key_part · e2df0af2

Kirill Shcherbatov authored 6 years ago

tuple_field_by_part looks up the tuple_field corresponding to the
given key part in tuple_format in order to quickly retrieve the offset
of indexed data from the tuple field map. For regular indexes this
operation is blazing fast, however of JSON indexes it is not as we
have to parse the path to data and then do multiple lookups in a JSON
tree. Since tuple_field_by_part is used by comparators, we should
strive to make this routine as fast as possible for all kinds of
indexes.

This patch introduces an optimization that is supposed to make
tuple_field_by_part for JSON indexes as fast as it is for regular
indexes in most cases. We do that by caching the offset slot right in
key_part. There's a catch here however - we create a new format
whenever an index is dropped or created and we don't reindex old
tuples. As a result, there may be several generations of tuples in the
same space, all using different formats while there's the only key_def
used for comparison.

To overcome this problem, we introduce the notion of tuple_format
epoch. This is a counter incremented each time a new format is
created. We store it in tuple_format and key_def, and we only use
the offset slot cached in a key_def if it's epoch coincides with the
epoch of the tuple format. If they don't, we look up a tuple_field as
before, and then update the cached value provided the epoch of the
tuple format.

Part of #1012

e2df0af2

box: introduce has_json_paths flag in templates · 8e091047

Kirill Shcherbatov authored 6 years ago

Introduced has_json_path flag for compare, hash and extract
functions templates(that are really hot) to make possible do not
look to path field for flat indexes without any JSON paths.

Part of #1012

8e091047

box: introduce JSON Indexes · 4273ec52

Kirill Shcherbatov authored 6 years ago

New JSON indexes allows to index documents content.
At first, introduced new key_part fields path and path_len
representing JSON path string specified by user. Modified
tuple_format_use_key_part routine constructs corresponding
tuple_fields chain in tuple_format::fields tree to indexed data.
The resulting tree is used for type checking and for alloctating
indexed fields offset slots.

Then refined tuple_init_field_map routine logic parses tuple
msgpack in depth using stack allocated on region and initialize
field map with corresponding tuple_format::field if any.
Finally, to proceed memory allocation for vinyl's secondary key
restored by extracted keys loaded from disc without fields
tree traversal, introduced format::min_tuple_size field - the
size of tuple_format tuple as if all leaf fields are zero.

Example:
To create a new JSON index specify path to document data as a
part of key_part:
parts = {{3, 'str', path = '.FIO.fname', is_nullable = false}}
idx = s:create_index('json_idx', {parts = parse})
idx:select("Ivanov")

Part of #1012

4273ec52

box: introduce tuple_field_raw_by_path routine · e4a565db

Kirill Shcherbatov authored 6 years ago

Introduced a new function tuple_field_raw_by_path is used to get
tuple fields by field index and relative JSON path. This routine
uses tuple_format's field_map if possible. It will be further
extended to use JSON indexes.
The old tuple_field_raw_by_path routine used to work with full
JSON paths, renamed tuple_field_raw_by_full_path. It's return
value type is changed to const char * because the other similar
functions tuple_field_raw and tuple_field_by_part_raw use this
convention.
Got rid of reporting error position for 'invalid JSON path' error
in lbox_tuple_field_by_path because we can't extend other
routines to behave such way that makes an API inconsistent,
moreover such error are useless and confusing.

Needed for #1012

e4a565db

Update msgpuck library · c4f2ffb8

Kirill Shcherbatov authored 6 years ago

The msgpack dependency has been updated because the new version
introduces the new mp_stack class which we will use to parse
tuple without recursion when initializing the field map.

Needed for #1012

c4f2ffb8

Jan 30, 2019

box: get rid of atexit() for calling cleanup routines · 1bc1fcda

Serge Petrenko authored 6 years ago

Move a call to tarantool_free() to the end of main().
We needn't call atexit() at all anymore, since we've implemented
on_shutdown triggers and patched os.exit() so that when exiting not
due to a fatal signal (when no cleanup routines are called anyway)
control always reaches a call to tarantool_free().

1bc1fcda