Commits · 73ad8c3b3a24d20fefca25649f13d4065bdfe5c9 · core / tarantool

Mar 13, 2019

vinyl: don't use vy_stmt_new_surrogate_delete if not necessary · 73ad8c3b

There are three places where we use this expensive functions while we
could get along with a cheaper one:

 - Deferred DELETE space on_replace trigger. Here we can use simple
   vy_stmt_new_delete, because the trigger is already passed a surrogate
   DELETE statement.

 - Secondary index build on_replace trigger. Here we can extract the
   secondary key, set its type to DELETE and insert it into the index.
   We don't need all the other indexed fields.

 - Secondary index build recovery procedure. Similarly to the previous
   case, we can use extracted here rather than building a surrogate
   DELETE statement.

73ad8c3b

vinyl: zap vy_stmt_new_surrogate_from_key · 7c782047

Vladimir Davydov authored 6 years ago

This heavy function isn't needed anymore, as we can now insert key
statements into the memory level.

7c782047

vinyl: do not fill secondary tuples with nulls when decoded · 5f7524b6

Vladimir Davydov authored 6 years ago

In contrast to a primary index, which stores full tuples, secondary
indexes only store extended (secondary + primary) keys on disk. To make
them look like tuples, we fill missing fields with nulls (aka tuple
surrogate). This isn't going to work nicely with multikey indexes
though: how would you make a surrogate array from a key? We could
special-case multikey index handling, but that would look cumbersome.
So this patch removes nulls from secondary tuples restored from disk
altogether. To achieve that, it's enough to use key_format for them -
then the comparators will detect that it's actually a key, not a tuple
and use the appropriate primitive.

5f7524b6

vinyl: zap vy_write_iterator->format · 902d212b

Vladimir Davydov authored 6 years ago

It's actually only needed to initialize disk streams so let's pass it
to vy_write_iterator_new_slice() instead.

902d212b

vinyl: clean up write iterator source destruction · 2f17c929

Vladimir Davydov authored 6 years ago

By convention we have two methods in each write iterator stream
implementation (including the write iterator itself as it implements
the interface too): 'stop' and 'close'. The 'stop' method is called
in a worker thread. It reverses the effect of 'start'. We need it
unreference all tuples referenced during the iteration (we must do
it in the worker thread, where the tuples were referenced in the first
place so as not to unreference tuple formats, see vy_tuple_delete).
The 'close' method is called from the tx thread to unreference tuple
formats if necessary and release memory.

For the write iterator itself we follow this convention. However,
for individual sources, for vy_slice_stream source to be more exact,
we do not - the write iterator calls both 'stop' and 'close' from
its own 'stop method. Let's cleanup this mess and make the write
iterator follow the convention. We'll need it in the next patch.

2f17c929

vinyl: do not pass format to vy_apply_upsert · 54b8add5

Vladimir Davydov authored 6 years ago

Use the format of the given statement instead. Passing format is
a legacy from the time when we have a separate format for UPSERTs.
Nowadays it only obfuscates the code.

54b8add5

vinyl: add helpers to add/check statement with bloom · 8189b054

Vladimir Davydov authored 6 years ago

A Vinyl statement may be either a key or a tuple. We must use different
functions for the two kinds when working with a bloom filter. Let's
introduce helpers incorporating that logic.

Notes:
 - Currently, we never add keys to bloom filters, but after the next
   patch we will, so this patch adds tuple_bloom_builder_add_key helper.
 - According to the function protocol, tuple_bloom_builder_add may fail
   with out-of-memory, but we never checked that. Fix that while we are
   at it.

8189b054

bloom: factor out helper to add tuple hash to bloom builder · e9728589
Vladimir Davydov authored 6 years ago
```
No functional changes, just move a piece of code, so as not to mix it in
the next patch.
```
e9728589

bloom: do not use tuple_common_key_parts when constructing tuple bloom · afd30b95

Vladimir Davydov authored 6 years ago

Tuple bloom filter is an array of bloom filters, each of which reflects
lookups by all possible partial keys. To optimize the overall bloom
filter size, we need to know how many unique elements there are for each
partial key. To achieve that, we require the caller to pass the number
of key parts that have been hashed for the given tuple. Here's how it
looks in Vinyl:

uint32_t hashed_parts = writer->last_stmt == NULL ? 0 :
tuple_common_key_parts(stmt, writer->last_stmt,
writer->key_def);
tuple_bloom_builder_add(writer->bloom, stmt,
writer->key_def, hashed_parts);

Actually, there's no need in such a requirement as instead we can
calculate the hash value for the given tuple, compare it with the hash
of the tuple added last time, and add the new hash only if the two
values differ. This should be accurate enough while allowing us to get
rid of the cumbersome tuple_common_key_parts helper. Note, such a check
will only work if tuples are added in the order defined by the key
definition, but that already holds - anyway, one wouldn't be able to
use tuple_common_key_parts either if it wasn't true.

While we are at it, refresh the obsolete comment to tuple_bloom_builder.

afd30b95

vinyl: don't use IPROTO_SELECT type for key statements · bb4db772

Vladimir Davydov authored 6 years ago

To differentiate between key and tuple statements in comparators, we set
IPROTO_SELECT type for key statements. As a result, we can't use key
statements in the run iterator directly although secondary index runs do
store statements in key format. Instead we create surrogate tuples
filling missing fields with NULLs. This won't play nicely with multikey
indexes so we need to teach iterators to deal with statements in key
format. The first step in this direction is dropping IPROTO_SELECT in
favor of identifying key statements by format.

bb4db772

vinyl: rename key stmt construction routine · ce1066ed

Vladimir Davydov authored 6 years ago

Currently, it's called vy_stmt_new_select, but soon a key statement will
be allowed to have any type, not just IPROTO_SELECT. So let's rename it
to vy_key_new.

ce1066ed

vinyl: introduce statement environment · d8fbb5e3

Vladimir Davydov authored 6 years ago

Store tuple_format_vtab, max_tuple_size, and key_format there.
This will allow us to determine a statement type (key or tuple)
by checking its format against key_format.

d8fbb5e3

vinyl: remove optimized comparators · f11b82a6

Vladimir Davydov authored 6 years ago

A vinyl statement (vy_stmt struct) may represent either a tuple or a
key. We differentiate between the two kinds by statement type - we use
SELECT for keys and other types for tuples. This was done that way so
that we could pass both tuples and keys to a read iterator as a search
key. To avoid branching in comparators when the types of compared
statements are known in advance, we provide several comparators, each of
which expects certain statement types, e.g. a tuple and a key. Actually,
such a micro optimization looks like an overkill, because a typical
comparator is called by function pointer and has a lot of comparisons
in the code, see tuple_compare_slowpath for instance. Eliminating one
branch will hardly make the code perform better. At the same time, it
makes the code more difficult to write. Besides, once we remove nils
from statements read from disk (aka surrogate tuples), which will
ease implementation of multikey indexes, the number of places where
types of compared statements are known will diminish drastically.
That said, let's remove optimized comparators and always use
vy_stmt_compare, which checks types of compared statements and calls
the appropriate comparator.

f11b82a6

replication: update replica gc state on subscribe · b5b4809c

Vladimir Davydov authored 6 years ago

We advance replica->gc state only when an xlog file is fully recovered,
see recovery_close_log and relay_on_close_log_f. It may turn out that an
xlog file is fully recovered, but isn't closed properly by relay (i.e.
recovery_close_log isn't called), because the replica closes connection
for some reason (e.g. timeout). If this happens, the old xlog file
won't be removed when the replica reconnects, because we don't advance
replica->gc state on reconnect, so the useless xlog file won't be
removed until the next xlog file is relayed. This results in occasional
replication/gc.test.lua failures. Fix this by updating replica->gc on
reconnect with the current replica vclock.

Closes #4034

b5b4809c

box: fix on_replace_trigger_rollback routine · 19fb7287

Kirill Shcherbatov authored 6 years ago

The function on_replace_trigger_rollback in the case of a replace
operation rollback was called with an incorrect argument, as a
result of which the used memory was freed.

19fb7287

box: fix upgrade script for _fk_constraint space · 40222886

Kirill Shcherbatov authored 6 years ago

The set_system_triggers and erase routines in upgrade.lua did not
proceed actions for _fk_constraint space.

40222886

Mar 12, 2019

memtx: rework memtx_tree to store structure as node · e49ef27b

Kirill Shcherbatov authored 6 years ago

Reworked memtx_tree class to use structure memtx_tree_data as a
tree node. This makes possible to extend it with service field
to implement tuple hints and multikey indexes in subsequent
patches.

Needed for #3961

e49ef27b

http: fix httpc auto-managed headers · e1d397c5

Kirill Shcherbatov authored 6 years ago

The http library intelligently sets the headers "Accept",
"Connection", "Keep-Alive".
However, when the user explicitly specified them in the header
options section of the call argument, they could be written to
the HTTP request twice.
We postponed the auto headers setup before https_exececute call.
Now they are set only if they were not set by the user.

Closes #3955

e1d397c5

Mar 11, 2019

sql: fix code generation for aggregate in HAVING clause · 4de6aac1

Nikita Pettik authored 6 years ago

When we allowed using HAVING clause without GROUP BY (b40f2443), one
possible combination was forgotten to be tested:

SELECT 1 FROM te40 HAVING SUM(s1) < 0;
-- And SUM(s1) >= 0, i.e. HAVING condition is false.

In other words, resulting set contains no aggregates, but HAVING does
contain, but condition is false. In this case no byte-code related to
aggregate execution is emitted at all. Hence, query above equals to
simple SELECT 1; Unfortunately, result of such query is the same when
condition under HAVING clause is unsatisfied.  To fix this behaviour, it
is enough to indicate to byte-code generator that we should analyze
aggregates not only in ORDER BY clauses, but also in HAVING clause.

Closes #3932
Follow-up #2364

4de6aac1

sql: derive collation for built-in functions · 219f077b

Nikita Pettik authored 6 years ago

Functions such as trim(), substr() etc should return result with
collation derived from their arguments. So, lets add flag indicating
that collation of first argument must be applied to function's result to
SQL function definition. Using this flag, we can derive appropriate
collation in sql_expr_coll().

Part of #3932

219f077b

txn: move rows without lsn to the transaction tail · 27283deb

Georgy Kirichenko authored 6 years ago

Form a separate transaction with local changes in case of replication.
This is important because we should be able to replicate such changes
(e.g. made within an on_replace trigger) back. In the opposite case
local changes will be incorporated into originating transaction and
would be skipped by the originator replica.

Needed for #2798

27283deb

Mar 07, 2019

vclock: fix big lsn handling · 19b5fb1c

Vladimir Davydov authored 6 years ago

Fixes commit 8031071e ("Lightweight vclock_create and vclock_copy").

Closes #4033

19b5fb1c

sql: replace BLOB as column type with SCALAR · 0d0f53aa

Nikita Pettik authored 6 years ago

BLOB column type is represented by SCALAR field type in terms of NoSQL.
We attempted at emulating BLOB behaviour, but such efforts turn out to
be not decent enough. For this reason, we've decided to abandon these
attempts and fairly replace it with SCALAR column type.  SCALAR column
type acts in the same way as it does in NoSQL: it is aggregator-type for
INTEGER, NUMBER and STRING types. So, column declared with this type can
contain values of these three (available in SQL) types. It is worth
mentioning that CAST operator in this case does nothing.

Still, we consider BLOB values as entries encoded in msgpack with MP_BIN
format. To make this happen, values to be operated should be represented
in BLOB form x'...' (e.g. x'000000'). What is more, there are two
built-in functions returning BLOBs: randomblob() and zeroblob().  On the
other hand, columns with STRING NoSQL type don't accept BLOB values.

Closes #4019
Closes #4023

@TarantoolBot document
Title: SQL types changes
There are couple of recently introduced changes connected with SQL
types.

Firstly, we've removed support of DATE/TIME types from parser due to
confusing behaviour of these types: they were mapped to NUMBER NoSQL
type and have nothing in common with generally accepted DATE/TIME types
(like in other DBs). In addition, all built-in functions related to
these types (julianday(), date(), time(), datetime(), current_time(),
current_date() etc) are disabled until we reimplement TIME-like types as
a native NoSQL ones (see #3694 issue).

Secondly, we've removed CHAR type (i.e. alias to VARCHAR and TEXT). The
reason is that according to ANSI SQL CHAR(len) must accept only strings
featuring length exactly equal to given in type definition. Obviously,
now we don't provide such checks. Types VARCHAR and TEXT are still
legal.

For the same reason, we've removed NUMERIC and DECIMAL types, which were
aliases to NUMBER NoSQL type. REAL, FLOAT and DOUBLE are still exist as
aliases.

Finally, we've renamed BLOB column type to SCALAR. We've decided that
all our attempts to emulate BLOB behaviour using SCALAR NoSQL type don't
seem decent enough, i.e. without native NoSQL type BLOB there always
will be inconsistency, especially taking into account possible NoSQL-SQL
interactions. In SQL SCALAR type works exactly in the same way as in
NoSQL: it can store values of INTEGER, FLOAT and TEXT SQL types at the
same time. Also, with this change behaviour of CAST operator has been
slightly corrected: now cast to SCALAR doesn't affect type of value at
all. Couple of examples:

CREATE TABLE t1 (a SCALAR PRIMARY KEY);
INSERT INTO t1 VALUES ('1');
SELECT * FROM t1 WHERE a = 1;
-- []
Result is empty set since column "a" contains string literal value '1',
not integer value 1.

CAST(123 AS SCALAR); -- Returns 123 (integer)
CAST('abc' AS SCALAR); -- Returns 'abc' (string)

Note that in NoSQL values of BLOB type defined as ones decoded in
msgpack with MP_BIN format. In SQL there are still a few ways to force
this format: declaring literal in "BLOB" format (x'...') or using one of
two built-in functions (randomblob() and zeroblob()). TEXT and VARCHAR
SQL types don't accept BLOB values:

CREATE TABLE t (a TEXT PRIMARAY KEY);
INSERT INTO t VALUES (randomblob(5));
---
- error: 'Tuple field 1 type does not match one required: expected string'
...

BLOB itself is going to be reimplemented in scope of #3650.

0d0f53aa

sql: remove support of NUMERIC type from parser · 696db264

Nikita Pettik authored 6 years ago

NMERIC and DECIMAL were allowed to be specified as column types. But in
fact, they were just synonyms for FLOAT type and mapped to NUMERIC
Tarantool NoSQL type. So, we've decided to remove this type from parser
and return back when NUMERIC will be implemented as a native type.

Part of #4019

696db264

sql: remove support of CHAR type from parser · 46168bfa

Nikita Pettik authored 6 years ago

Since now no checks connected with length of string are performed, it
might be misleading to allow specifying this type. Instead, users must
rely on VARCHAR type.

Part of #4019

46168bfa

sql: remove support of DATE/TIME from parser · 3caeb33c

Nikita Pettik authored 6 years ago

Currently, there is no native (in Tarantool terms) types to represent
time-like types. So, until we add implementation of those types, it
makes no sense to allow to specify those types in table definition.
Note that previously they were mapped to NUMBER type. For the same
reason all built-in functions connected with DATE/TIME are disabled as
well.

Part of #4019

3caeb33c

swim: introduce SWIM's anti-entropy component · 03b9a6e9

Vladislav Shpilevoy authored 6 years ago

SWIM - Scalable Weakly-consistent Infection-style Process Group
Membership Protocol. It consists of 2 components: events
dissemination and failure detection, and stores in memory a
table of known remote hosts - members. Also some SWIM
implementations have an additional component: anti-entropy -
periodical broadcast of a random subset of members table.

Dissemination component spreads over the cluster changes occurred
with members. Failure detection constantly searches for failed
dead members. Anti-entropy just sends all known information at
once about a member so as to synchronize it among all other
members in case some events were not disseminated (UDP problems).

Anti-entropy is the most vital component, since it can work
without dissemination and failure detection. But they can not
work properly with out the former. Consider the example: two SWIM
nodes, both are alive. Nothing happens, so the events list is
empty, only pings are being sent periodically. Then a third
node appears. It knows about one of existing nodes. How should
it learn about another one? Sure, its known counterpart can try
to notify another one, but it is UDP, so this event can get lost.
Anti-entropy is an extra simple component, it just piggybacks
random part of members table with each regular round message.
In the example above the new node will learn about the third
one via anti-entropy messages of the second one soon or late.

This is why anti-entropy is the first implemented component.

Part of #3234

03b9a6e9

box: fix custom delimiter for telnet connection · fa1aa01e

Kirill Shcherbatov authored 6 years ago

In order to give a user ability to use a delimiter symbol within a
code the real delimiter is user-provided 'delim' plus "\n".
Since telnet sends "\r\n" on line break, the updated expression
delim + "\n" could not be found in a sequence data+delim+"\r\n",
so delimiter feature did not work at all.
Added delim + "\r" check along with delim + "\n", that solves the
described problem and does not violate backward compatibility.

Closes #2027

fa1aa01e

applier: don't use xstream for applying rows · 0d437a8e

Georgy Kirichenko authored 6 years ago

Remove xstream dependency and use direct box interface to apply
all replication rows. This is refactoring needed for transactional
replication.

Needed for #2798

0d437a8e

sql: remove table.c · 5fb9ee93
Mergen Imeev authored 6 years ago
```
The module table.c is not used and should be removed.
```
5fb9ee93

Mar 06, 2019

test: cleanup after box-tap/trigger_yield · 19418c90

Vladimir Davydov authored 6 years ago

The test creates a space, but doesn't drop it, which leads to
box-tap/on_schema_init failure:

 | box-tap/trigger_yield.test.lua                                  [ pass ]
 | box-tap/on_schema_init.test.lua                                 [ fail ]
 | Test failed! Output from reject file box-tap/on_schema_init.reject:
 | TAP version 13
 | 1..7
 | ok - on_schema_init trigger set
 | ok - system spaces are accessible
 | ok - before_replace triggers
 | ok - on_replace triggers
 | ok - set on_replace trigger
 | ok - on_schema_init trigger works
 |
 | Last 15 lines of Tarantool Log file [Instance "app_server"][/Users/travis/build/tarantool/tarantool/test/var/002_box-tap/on_schema_init.test.lua.tarantool.log]:
 | 2019-03-06 17:00:12.057 [87410] main/102/on_schema_init.test.lua F> Space 'test' already exists

Fix this.

19418c90

box: add on_schema_init trigger · 7f2a7d23

Serge Petrenko authored 6 years ago

This patch introduces an on_schema_init trigger. The trigger may be set
before box.cfg() is called and is called during box.cfg() right after
prototypes of system spaces, such as _space, are created. This allows to
set triggers on system spaces before any other non-system data is
recovered. For example, it is possible to set an on_replace trigger on
_space, which will work even during recovery.

Part of #3159

@TarantoolBot document
Title: document box.ctl.on_schema_init triggers
on_schema_init triggers are set before the first call to box.cfg() and
are fired during box.cfg() before user data recovery start.

To set the trigger, say
```
box.ctl.on_schema_init(new_trig, old_trig)
```
where `old_trig` may be omitted. This will replace `old_trig` with `new_trig`.
Such triggers let you, for example, set triggers on system spaces before
recovery of any data, so that the triggers are fired even during
recovery.
For example, such triggers make it possible to change a specific space's
storage engine or make a replicated space replica-local on a freshly
bootstrapped replica.
If you want to change space's `space_name` storage engine to `vinyl`
, you may say:
```
function trig(old, new)
    if new[3] == 'space_name' and new[4] ~= 'vinyl' then
        return new:update{{'=', 4, 'vinyl'}}
    end
end
```
Such a trigger may be set on `_space` as a `before_replace` trigger.
And thanks to `on_schema_init` triggers, it will happen before any
non-system spaces are recovered, so the trigger will work for all
user-created spaces:
```
box.ctl.on_schema_init(function()
    box.space._space:before_replace(trig)
end)
```
Note, that the above steps are done before initial `box.cfg{}` call.
Othervise the spaces will be already recovered by the time you set any
triggers.

Now you can say
`box.cfg{replication='master_uri', ...}`
And replica will have the space `space_name` with same contents, as on
master, but on `vinyl` storage engine.

7f2a7d23

sio: implement getsockname wrapper · 37d66377

Vladislav Shpilevoy authored 6 years ago

SWIM wants to allow to bind to zero ports so as the kernel could
choose any free port automatically. It is needed mainly for
tests.

Zero port means that a real port is known only after bind() has
called, and getsockname() should be used to get it. SWIM uses sio
library for such lowlevel API. This is why that function is added
to sio.

Needed for #3234

37d66377

iproto: fix assertion failure on invalid msgpack · 3fb69c15

Kirill Shcherbatov authored 6 years ago

Before the commit d9f82b17 "More than one row in fixheader.
Zstd compression", xrow_header_decode treated everything until
'end' as the packet body while currently it allows a packet to
end before 'end'. The iproto_msg_decode may receive an invalid
msgpack but it still assumes that xrow_header_decode sets an
error in such case and use assert to test it, bit it is not so.
Introduced a new boolean flag to control routine behaviour. When
flag is set, xrow_header_decode should raise 'packet body' error
unless the packet ends exactly at 'end'.

@locker: renamed ensure_package_read to end_is_exact; fixed comments.

Closes #3900

3fb69c15

Mar 05, 2019

vinyl: abort rw transactions when instance switches to ro · 6666db8d

Vladimir Davydov authored 6 years ago

A Vinyl transaction may yield while having a non-empty write set. This
opens a time window for the instance to switch to read-only mode. Since
we check ro flag only before executing a DML request, the transaction
would successfully commit in such a case, breaking the assumption that
no writes are possible on an instance after box.cfg{read_only=true}
returns. In particular, this breaks master-replica switching logic.

Fix this by aborting all local rw transactions before switching to
read-only mode. Note, remote rw transactions must not be aborted,
because they ignore ro flag.

Closes #4016

6666db8d

engine: add switch_to_ro callback · 2385aec2

Vladimir Davydov authored 6 years ago

We will use this callback to abort rw transactions in Vinyl when an
instance is switch to read-only mode.

Needed for #4016

2385aec2

vinyl: add tx to writers list in begin_statement engine callback · fd816e35

Vladimir Davydov authored 6 years ago

Currently, we add a transaction to the list of writers when executing a
DML request, i.e. in vy_tx_set. The problem is a transaction can yield
on read before calling vy_tx_set, e.g. to check a uniqueness constraint,
which opens a time window when a transaction is not yet on the list, but
it will surely proceed to DML after it continues execution. If we need
to abort writers in this time window, we'll miss it. To prevent this,
let's add a transaction to the list of writers in vy_tx_begin_statement.

Note, after this patch, when a transaction is aborted for DDL, it may
have an empty write set - it happens if tx_manager_abort_writers is
called between vy_tx_begin_statement and vy_tx_set. Hence we have to
remove the corresponding assertion from tx_manager_abort_writers.

Needed for #4016

fd816e35

vinyl: rename tx statement begin/rollback routines · fdcccb47

Vladimir Davydov authored 6 years ago

Rename vy_tx_rollback_to_savepoint to vy_tx_rollback_statement and
vy_tx_savepoint to vy_tx_begin_statement, because soon we will do some
extra work there.

Needed for #4016

fdcccb47

Mar 04, 2019

sql: Duplicate key error for a non-unique index · d4b3ee3c

Stanislav Zudin authored 6 years ago

Adds collation analysis into creating of a composite key for
index tuples.
The keys of secondary index consist of parts defined for index itself
combined with parts defined for primary key.
The duplicate parts are ignored. But the search of duplicates didn't
take the collation into consideration.
If non-unique secondary index contained primary key columns their
parts from the primary key were omitted. This fact caused an issue.

@locker: comments, renames.

Closes #3537

d4b3ee3c

build: tags -- Exclude unneeded directories · ebc508f5

Cyrill Gorcunov authored 6 years ago

When building "tags" target we scan the whole working
directory which is redundant. In particular .git,.pc,patches
directories should not be scanned for sure.

ebc508f5