Commits · 13acfe474cc7242ef55c58886a766ca1aa6b4796 · core / tarantool

Aug 01, 2018

txn: add helper to detect transaction boundaries · 13acfe47

Add txn_is_first_statement() function, which returns true if this is the
first statement of the transaction. The function is supposed to be used
from on_replace trigger to detect transaction boundaries.

Needed for #2129

13acfe47

vinyl: rename vy_task::status to is_failed · 21eed04c

Vladimir Davydov authored 6 years ago

vy_task::status stores the return code of the ->execute method. There
are only two codes in use: 0 - success and -1 - failure. So let's chage
this to a boolean flag.

21eed04c

vinyl: zap vy_scheduler::is_worker_pool_running · d77b4dc9
Vladimir Davydov authored 6 years ago
```
This flag is set iff worker_pool != NULL hence it is pointless.
```
d77b4dc9

vinyl: use cbus for communication between scheduler and worker threads · f4625e64

Vladimir Davydov authored 6 years ago

We need cbus for forwarding deferred DELETE statements generated in a
worker thread during primary index compaction to the tx thread where
they can be inserted into secondary indexes. Since pthread mutex/cond
and cbus are incompatible by their nature, let's rework communication
channel between the tx and worker threads using cbus.

Needed for #2129

f4625e64

vinyl: rename some members of vy_scheduler and vy_task struct · 46f50aad

Vladimir Davydov authored 6 years ago

I'm planning to add some new members and remove some old members from
those structs. For this to play nicely, let's do some renames:

  vy_scheduler::workers_available => idle_worker_count
  vy_scheduler::input_queue       => pending_tasks
  vy_scheduler::output_queue      => processed_tasks
  vy_task::link                   => in_pending, in_processed

46f50aad

vinyl: store pointer to scheduler in struct vy_task · 1331d232

Vladimir Davydov authored 6 years ago

Currently, we don't really need it, but once we switch communication
channel between the scheduler and workers from pthread mutex/cond to
cbus (needed for #2129), tasks won't be completed on behalf of the
scheduler fiber and hence we will need a back pointer from vy_task to
vy_scheduler.

Needed for #2129

1331d232

vinyl: do not free pending tasks on shutdown · 15c28b75

Vladimir Davydov authored 6 years ago

This is a prerequisite for switching scheduler-worker communication from
pthread mutex/cond to cbus, which in turn is needed to generate and send
deferred DELETEs from workers back to tx (#2129).

After this patch, pending tasks will be leaked on shutdown. This is OK,
as we leak a lot of objects on shutdown anyway. The proper way of fixing
this leak would be to rework shutdown without atexit() so that we can
use cbus till the very end.

Needed for #2129

15c28b75

vinyl: store full tuples in secondary index cache · 0c5e6cc8

Vladimir Davydov authored 6 years ago

Currently, both vy_read_iterator_next() and vy_point_lookup() add the
returned tuple to the tuple cache. As a result, we store partial tuples
in a secondary index tuple cache although we could store full tuples
(we have to retrieve them anyway when reading a secondary index). This
means wasting memory. Besides, when the #2129 gets implemented, there
will be tuples in a secondary index that have to be skipped as they have
been overwritten in the primary index. Caching them would be inefficient
and error prone. So let's call vy_cache_add() from the upper level and
add only full tuples to the cache.

Closes #3478
Needed for #2129

0c5e6cc8

Jul 31, 2018

vinyl: refactor unique check · 85608344

Vladimir Davydov authored 6 years ago

For the sake of further patches, let's do some refactoring:
 - Rename vy_check_is_unique to vy_check_is_unique_primary and use it
   only for checking the unique constraint of primary indexes. Also,
   make it return immediately if the primary index doesn't need
   uniqueness check, like vy_check_is_unique_secondary does.
 - Open-code uniqueness check in vy_check_is_unique_secondary instead of
   using vy_check_is_unique.
 - Reduce indentation level of vy_check_is_unique_secondary by inverting
   the if statement.

85608344

vinyl: fold vy_delete_impl · f88a0bd1

Vladimir Davydov authored 6 years ago

vy_delete_impl helper is only used once in vy_delete and it is rather
small so inlining it definitely won't hurt. On the contrary, it will
consolidate DELETE logic in one place, making the code easier to follow.

f88a0bd1

vinyl: fold vy_replace_one and vy_replace_impl · 1dfeb601

Vladimir Davydov authored 6 years ago

There's no point in separating REPLACE path between the cases when
the space has secondary indexes and when it only has the primary
index, because they are quite similar. Let's fold vy_replace_one
and vy_replace_impl into vy_replace to remove code duplication.

1dfeb601

vinyl: always get full tuple from pk after reading from secondary index · 5ceca76c

Vladimir Davydov authored 6 years ago

Currently, we don't always need a full tuple. Sometimes (e.g. for
checking uniqueness constraint), a partial tuple read from a secondary
index is enough. So we have vy_lsm_get() which reads a partial tuple
from an index. However, once the optimization described in #2129 is
implemented, it might happen that a tuple read from a secondary index
was overwritten or deleted in the primary index, but DELETE statement
hasn't been propagated to the secondary index yet, i.e. we will have to
read the primary index anyway, even if we don't need a full tuple.

That said, let us:

 - Make vy_lsm_get() always fetch a full tuple, even for secondary
   indexes, and rename it to vy_get().

 - Rewrite vy_lsm_full_by_key() as a wrapper around vy_get() and rename
   it to vy_get_by_raw_key().

 - Introduce vy_get_by_secondary_tuple() which gets a full tuple given a
   tuple read from a secondary index. For now, it's basically a call to
   vy_point_lookup(), but it'll become a bit more complex once #2129 is
   implemented.

 - Prepare vy_get() for the fact that a tuple read from a secondary
   index may be absent in the primary index, in which case it should
   try the next matching one.

Needed for #2129

5ceca76c

vinyl: simplify vy_squash_process · 128503ea

Vladimir Davydov authored 6 years ago

Since vy_point_lookup() now guarantees that it returns the newest
tuple version, we can remove the code that squashes UPSERTs from
vy_squash_process().

128503ea

vinyl: make point lookup always return the latest tuple version · 6d85c35c

Vladimir Davydov authored 6 years ago

Currently, vy_point_lookup(), in contrast to vy_read_iterator, doesn't
rescan the memory level after reading disk, so if the caller doesn't
track the key before calling this function, the caller won't be sent to
a read view in case the key gets updated during yield and hence will
be returned a stale tuple. This is OK now, because we always track the
key before calling vy_point_lookup(), either in the primary or in a
secondary index. However, for #2129 we need it to always return the
latest tuple version, no matter if the key is tracked or not.

The point is in the scope of #2129 we won't write DELETE statements to
secondary indexes corresponding to a tuple replaced in the primary
index. Instead after reading a tuple from a secondary index we will
check whether it matches the tuple corresponding to it in the primary
index: if it is not, it means that the tuple read from the secondary
index was overwritten and should be skipped. E.g. suppose we have the
primary index over the first field and a secondary index over the second
field and the following statements in the space:

  REPLACE{1, 10}
  REPLACE{1, 20}

Then reading {10} from the secondary index will return REPLACE{1, 10}, but
lookup of {1} in the primary index will return REPLACE{1, 20} which
doesn't match REPLACE{1, 10} read from the secondary index hence the
latter was overwritten and should be skipped.

The problem is in the example above we don't want to track key {1} in
the primary index before lookup, because we don't actually read its
value. So for the check to work correctly, we need the point lookup to
guarantee that the returned tuple is always the newest one. It's fairly
easy to do - we just need to rescan the memory level after yielding on
disk if its version changed.

Needed for #2129

6d85c35c

Add tarantoolctl rocks pack/unpack subcommands · 0746fdb4
Konstantin Nazarov authored 6 years ago
```
The subcommands are used to create binary rock distributions.
In context of #3525
```
0746fdb4

Jul 30, 2018

vinyl: implement rebootstrap support · 06658416

Vladimir Davydov authored 6 years ago

If vy_log_bootstrap() finds a vylog file in the vinyl directory, it
assumes it has to be rebootstrapped and calls vy_log_rebootstrap().
The latter scans the old vylog file to find the max vinyl object id,
from which it will start numbering objects created during rebootstrap to
avoid conflicts with old objects, then it writes VY_LOG_REBOOTSTRAP
record to the old vylog to denote the beginning of a rebootstrap
section. After that initial join proceeds as usual, writing information
about new objects to the old vylog file after VY_LOG_REBOOTSTRAP marker.
Upon successful rebootstrap completion, checkpoint, which is always
called right after bootstrap, rotates the old vylog and marks all
objects created before the VY_LOG_REBOOTSTRAP marker as dropped in the
new vylog. The old objects will be purged by the garbage collector as
usual.

In case rebootstrap fails and checkpoint never happens, local recovery
writes VY_LOG_ABORT_REBOOTSTRAP record to the vylog. This marker
indicates that the rebootstrap attempt failed and all objects created
during rebootstrap should be discarded. They will be purged by the
garbage collector on checkpoint. Thus even if rebootstrap fails, it is
possible to recover the database to the state that existed right before
a failed rebootstrap attempt.

Closes #461

06658416

vinyl: simplify vylog recovery from backup · 8e710090

Vladimir Davydov authored 6 years ago

Since we don't create snapshot files for vylog, but instead append
records written after checkpoint to the same file, we have to use the
previous vylog file for backup (see vy_log_backup_path()). So when
recovering from a backup we need to rotate the last vylog to keep vylog
and checkpoint signatures in sync. Currently, we do it on recovery
completion and we use vy_log_create() instead of vy_log_rotate() for it.
This is done so that we can reuse the context that was used for recovery
instead of rereading vylog for rotation. Actually, there's no point in
this micro-optimization, because we rotate vylog only when recovering
from a backup. Let's remove it and use vy_log_rotate() for this.

Needed for #461

8e710090

replication: print master uuid when (re)bootstrapping · 71cec841

Vladimir Davydov authored 6 years ago

Currently only the remote address is printed. Let's also print the UUID,
because replicas are identified by UUID everywhere in tarantool, not by
the address. An example of the output is below:

I> can't follow eb81a67e-99ee-40bb-8601-99b03fa20124 at [::1]:58083: required {1: 8} available {1: 12}
C> replica is too old, initiating rebootstrap
I> bootstrapping replica from eb81a67e-99ee-40bb-8601-99b03fa20124 at [::1]:58083

I> can't follow eb81a67e-99ee-40bb-8601-99b03fa20124 at [::1]:58083: required {1: 17, 2: 1} available {1: 20}
I> can't rebootstrap from eb81a67e-99ee-40bb-8601-99b03fa20124 at [::1]:58083: replica has local rows: local {1: 17, 2: 1} remote {1: 23}
I> recovery start

Suggested by @kostja.

Follow-up ea69a0cd ("replication: rebootstrap instance on startup
if it fell behind").

71cec841

vinyl: zap tx_manager_vlsn · 5a772639

Vladimir Davydov authored 6 years ago

This function is not used anywhere since commit a1e005d8
("vinyl: write_iterator merges vlsns subsequnces")

5a772639

Jul 26, 2018

Merge branch '1.9' into 1.10 · fe07ada1
Konstantin Osipov authored 6 years ago

fe07ada1
lua: fix fio.rmtree to work with non empty dirs · 9917edc7
Konstantin Belyavskiy authored 6 years ago
```
Fix 'fio.rmtree' to remove a non empty directories.
And update test.

Closes #3258
```
9917edc7
lua: fix fio.rmtree to work with non empty dirs · 564a053c
Konstantin Belyavskiy authored 6 years ago
```
Fix 'fio.rmtree' to remove a non empty directories.
And update test.

Closes #3258
```
564a053c

Make access_check_ddl check for entity privileges. · d2e70f18

Serge Petrenko authored 6 years ago

Function access_check_ddl checked only for universal access, thus
granting entity or singe object access to a user would have no effect in
scope of this function.
Fix this by adding entity access checks.

Also attaching an existing sequence to a space checked for
create privilege on both space and sequence
(instead of read + write on sequence). Fixed it and changed the tests
accordingly.

Closes #3516

d2e70f18

Jul 24, 2018
- Allow to mix blackhole statements in other engines' transactions · d512174a
  Vladimir Davydov authored 6 years ago
  
  Blackhole doesn't need transaction control as it doesn't actually store anything so we can mark it with ENGINE_BYPASS_TX.
  d512174a
- Merge branch '1.9' into 1.10 · b9fd0b3b
  Vladimir Davydov authored 6 years ago
  
  b9fd0b3b
Jul 23, 2018

replication: rebootstrap instance on startup if it fell behind · ea69a0cd

Vladimir Davydov authored 6 years ago

If a replica fell too much behind its peers in the cluster and xlog
files needed for it to get up to speed have been removed, it won't be
able to proceed without rebootstrap. This patch makes the recovery
procedure detect such cases and initiate rebootstrap procedure if
necessary.

Note, rebootstrap is currently only supported by memtx engine. If there
are vinyl spaces on the replica, rebootstrap will fail. This is fixed by
the following patches.

Part of #461

ea69a0cd

tx: exclude sysview engine from transaction control · 0ecabde8

Vladimir Davydov authored 6 years ago

Sysview is a special engine that is used for filtering out objects that
a user can't access due to lack of privileges. Since it's treated as a
separate engine by the transaction manager, we can't query sysview
spaces from a memtx/vinyl transaction. In particular, if called from a
transaction space:format() will return

  error: A multi-statement transaction can not use multiple storage engines

which is inconvenient.

To fix this, let's mark sysview engine with a new ENGINE_BYPASS_TX flag
and make the transaction manager skip binding a transaction to an engine
in case this flag is set.

Closes #3528

0ecabde8

Introduce blackhole engine · cdf3ed8f

Vladimir Davydov authored 6 years ago

Blackhole is a very simple engine that allows to create spaces that may
written to, but not read from. It only supports INSERT/REPLACE requests.
It doesn't support any indexes hence SELECT is impossible. It does check
space format though and supports on_replace and before_replace triggers.

The whole purpose of this new engine is writing arbitrary rows to WAL
without storing them anywhere. In particular, we need this engine to
write deferred DELETEs generated for vinyl spaces to WAL.

Needed for #2129

cdf3ed8f

space: call before_replace trigger even if space has no indexes · 00204b6a

Vladimir Davydov authored 6 years ago

Needed for blackhole spaces, which don't support indexes per se, but
still may have a before_replace trigger installed.

00204b6a

Jul 22, 2018

replication: unregister replica with gc if deleted from cluster · ea28a925

Vladimir Davydov authored 6 years ago

When a replica is removed from the cluster table, the corresponding
replica struct isn't destroyed unless both the relay and the applier
attached to it are stopped, see replica_clear_id(). Since replica struct
is a holder of the garbage collection state, this means that in case an
evicted replica has an applier or a relay that fails to exit for some
reason, garbage collection will hang.

A relay thread stops as soon as the replica it was started for receives
a row that tries to delete it from the cluster table (because this isn't
allowed by the cluster space trigger, see on_replace_dd_cluster()).
If a replica isn't running, the corresponding relay can't run as well,
because writing to a closed socket isn't allowed. That said, a relay
can't block garbage collection.

An applier, however, is deleted only when replication is reconfigured.
So if a replica that was evicted from the cluster was configured as a
master, its replica struct will hang around blocking garbage collection
for as long as the replica remains in box.cfg.replication. This is what
happens in #3546.

Fix this issue by forcefully unregistering a replica with the garbage
collector when it is deleted from the cluster table. This is OK as it
won't be able to resubscribe and so we don't need to keep WALs for it
any longer. Note, the relay thread may still be running when a replica
is deleted from the cluster table, in which case we can't unregister it
with the garbage collector right away, because the relay may need to
access the garbage collection state. In such a case, leave the job to
replica_clear_relay, which is called as soon as the relay thread exits.

Closes #3546

ea28a925

Jul 21, 2018

txn: unify txn_stmt tuples reference counting rules · efed5d7f

Vladimir Davydov authored 6 years ago

Currently, the way txn_stmt::old_tuple and new_tuple are referenced
depends on the engine. For vinyl, the rules are straightforward: if
txn_stmt::{old_tuple,new_tuple} is not NULL, then the reference to the
corresponding tuple is elevated. Hence when a transaction is committed
or rolled back, vinyl calls tuple_unref on both txn_stmt::old_tuple and
new_tuple. For memtx, things are different: the engine doesn't
explicitly increment the reference counter of the tuples - it simply
sets them to the newly inserted tuple and the replaced tuple. On commit,
the reference counter of the old tuple is decreased to delete the
replaced tuple, while on rollback the reference counter of the new tuple
is decreased to delete the new tuple.

Because of this, we can't implement the blackhole engine (aka /dev/null)
without implementing commit and rollback engine methods - even though
such an engine doesn't store anything it still has to set the new_tuple
for on_replace trigger and hence it is responsible for releasing it on
commit or rollback. Since commit/rollback are rather inappropriate for
this kind of engine, let's instead unify txn_stmt reference counting
rules and make txn.c unreference the tuples no matter what engine is.
This doesn't change vinyl, because it already conforms. For memtx, this
means that we need to increase the reference counter when we insert a
new tuple into a space - not a big deal as tuple_ref is almost free.

efed5d7f

Rework memtx replace function · d361b1f7

Nikita Pettik authored 7 years ago

By now, replace function takes new tuple and old tuple as arguments, instead
of single txn_stmt. It has been done in order to avoid abusing txn_stmt:
the only usage was extracting tuples from it.
As a result, this function can be used by ephemeral tables
without any patching.

(cherry picked from commit 880712c9)

d361b1f7

Merge sysview_index.[hc] and sysview_engine.[hc] · 44fc192d
Vladimir Davydov authored 6 years ago
```
They are fairly small and closely related so let's merge them and call
the result sysview.[hc].
```
44fc192d
Add generic engine, space, index method stubs · 38a27423
Vladimir Davydov authored 6 years ago
```
This should reduce maintenance burden and help us introduce a new
engine.
```
38a27423

Include oldest vclock available on the instance in IPROTO_BALLOT · 989bb8f0

Vladimir Davydov authored 6 years ago

It will be used to check if a replica fell too much behind its peers and
so needs to be rebootstrapped.

Needed for #461

989bb8f0

Get rid of IPROTO_SERVER_IS_RO · 0ade0880

Vladimir Davydov authored 6 years ago

Not needed anymore as we now use the new IPROTO_VOTE command instead of
IPROTO_VOTE_DEPRECATED. Let's remove it altogether and reuse its code
for IPROTO_BALLOT (they are never decoded together so no conflict should
happen). Worst that can happen is we choose a read-only master when
bootstrapping an older version of tarantool.

0ade0880

IPROTO_VOTE command - follow-up fixes · 42a0ebfa

Vladimir Davydov authored 6 years ago

This patch contains some follow-up fixes for fe8ae607
("Introduce IPROTO_VOTE command"):
 - Rename 'status' to 'ballot' everywhere in the comments.
 - Rename IPROTO_REQUEST_VOTE to IPROTO_VOTE_DEPRECATED and
   iproto_reply_request_vote to iproto_reply_vote_deprecated
   to emphasize the fact that this iproto command has been
   deprecated and IPROTO_VOTE should be used instead.
 - Only send an IPROTO_VOTE request to a master if it is
   running tarantool 1.10.1 or newer.

42a0ebfa

Jul 20, 2018

Introduce IPROTO_VOTE command · fe8ae607

Vladimir Davydov authored 6 years ago

The new command is supposed to supersede IPROTO_REQUEST_VOTE, which is
difficult to extend, because it uses the global iproto key namespace.
The new command returns a map (IPROTO_BALLOT), to which we can add
various information without polluting the global namespace. Currently,
the map contains IPROTO_BALLOT_IS_RO and IPROTO_BALLOT_VCLOCK keys,
but soon it will be added info needed for replica rebootstrap feature.

Needed for #461

fe8ae607

Jul 19, 2018
- Merge branch '1.9' into 1.10 · 712108d2
  Kirill Yukhin authored 6 years ago
  
  712108d2
- say: fix invalid arguments · 1046f851
  Kirill Shcherbatov authored 6 years ago
  
  _say function was called with invalid arguments. Thank @sorc1 for patch. Closes #3433.
  1046f851