Commits · c713192c74c9714ebd3baa47b368ada1019ab384 · core / tarantool

Mar 23, 2017

xctl: don't remove log left from prev checkpoint · c713192c
Vladimir Davydov authored 8 years ago
```
It will be used for backups.
```
c713192c

On backup we copy files corresponding to the most recent checkpoint.
Since xctl does not create snapshots of its log files, but instead
appends records written after checkpoint to the most recent log file,
the signature of the xctl file corresponding to the last checkpoint
equals the signature of the previous checkpoint. So upon successful
recovery from a backup we need to rotate the log to keep checkpoint and
xctl signatures in sync.

96cdc943

xctl: delete old logs on garbage collection · 4a81b9fd

Vladimir Davydov authored 8 years ago

Add vclocks of rotated logs to xdir so that they can be deleted on
garbage collection with xdir_collect_garbage().

4a81b9fd

xctl: always create xctl file · da79fb35

Vladimir Davydov authored 8 years ago

We are going to scan directory to find the latest xctl log. Therefore we
must create an xctl file on each checkpoint, even if there's not record
in it, otherwise we can load stale metadata.

da79fb35

xctl: use xdir for manipulating logs · 802f6789

Vladimir Davydov authored 8 years ago

So that we don't have to reimplement directory scanning logic
when we need it for looking up the latest xctl log.

802f6789

xctl: use vclock instead of lsn · 812fb8e8
Vladimir Davydov authored 8 years ago
```
So that we can use xdir for manipulating metadata logs.
```
812fb8e8
Fix "bad argument #1 to ''pairs'" on box.cfg() · 74bc45b5
Ilya authored 8 years ago
```
- Fix box.cfg() without options
- Fix box.cfg { replication = {} }

Closes #2191
```
74bc45b5
vinyl: fix tuple_upsert_squash() · 8d6f50be
Alexandr Lyapunov authored 8 years ago
```
Fixes #2104
```
8d6f50be

Drop support of UPSERT via secondary key · aaaf7f04

Georgy Kirichenko authored 8 years ago

This feature has never truly been implemented. There weren't even
any test cases for it. This is a breaking change since 1.7.2 alpha.

- Remove index:upsert() and keep only space:upsert()
- Remove index_id from box_upsert()
- Ignore index_id for UPSERT in IPROTO.

Fixes #2226

aaaf7f04

Mar 22, 2017
- Update test-run version · 74d991e2
  bigbes authored 8 years ago
  
  74d991e2
Mar 21, 2017

box: minor cleanups · 37534436

Konstantin Osipov authored 8 years ago

* reduce the scope of wal_stream in box_cfg_xc()
* remove unused start_offset/end_offset from wal_request
* remove unused rmean_wal_tx_bus

37534436

recovery: avoid using the global recovery->vclock · 010662df

Konstantin Osipov authored 8 years ago

Do not use recovery->vclock in lua/info.cc and for truncate.

Maintainance of recovery->vclock after initial recovery is finished
is an artefact of Tarantool 1.5 architecture and will be removed
in the future.

010662df

xstream: add exception-free version of write to be used in vinyl · a300010b
Vladimir Davydov authored 8 years ago

a300010b

vinyl: start scheduler for remote recovery · e0dc453f

Vladimir Davydov authored 8 years ago

The amount of data sent when bootstrapping replica is only limited by
the master's disk size, which can exceed the size of available memory by
orders of magnitude. So we need the scheduler to be up and running while
bootstrapping from the remote host so that it could schedule dumps when
the quota limit is hit.

e0dc453f

vinyl: rework replication · 2993a758

Vladimir Davydov authored 8 years ago

Currently, on initial join we send the current vinyl state. To do that,
we open a read iterator over a space's primary index and send statements
returned by it. Such an approach has a number of inherent problems:

 - An open read iterator blocks compaction, which is unacceptable for
   such a long operation as join. To avoid blocking compaction, we open
   the iterator in the dirty mode, i.e. it skims over the tops. This,
   however, introduces a different kind of problem: this makes the
   threshold between initial and final join phases hazy - statements
   sent on final join may or may not be among those sent during the
   initial join, and there's no efficient way to differentiate between
   them w/o sending extra information.

 - The replica expects LSNs to be growing monotonically. This constraint
   is imposed by the lsregion allocator used for storing statements in
   memory, but read iterator returns statements ordered by key, not by
   LSN. Currently, replica simply crashes if statements happen to be
   sent in an order different from chronological, which renders vinyl
   replication unusable. In the scope of the current model, we can't fix
   this by assigning fake LSNs to statements received on initial join,
   because there's no strict LSN threshold between initial and final
   join phases (see the previous paragraph).

 - In the initial join phase, replica is only aware of  spaces that were
   created before the last snapshot, while vinyl sends statements from
   spaces that exist now. As a result, if a space was created after the
   most recent snapshot, the replica won't be able to receive its tuples
   and fail.

To address the above-mentioned problems, we make vinyl initial join send
the latest snapshot, just like in case of memtx. We implement this by
loading the vinyl state from the last snapshot of the metadata log and
sending statements of all runs from the snapshot as is (including
deletes and updates), to be applied by the replica. To make lsregion at
the receiving end happy, we assign fake monotonically growing LSNs to
statements received on initial join. This is OK, because

  any LSN from final join > max real LSN from initial join
  max real LSN from initial join >= max fake LSN

hence

  any LSN from final join > any fake LSN from initial join

Besides fixing vinyl replication, this patch also enables the
replication test suite for the vinyl engine (except for hot_standby)
and makes engine/replica_join cover the following test cases:
 - secondary indexes
 - delete and update statements
 - keys added in an order different from LSN
 - recreate space after checkpoint

Closes #1911
Closes #2001

2993a758

txn: remove recovery vclock promotion from recovery_fill_lsn() · ac87c66b

Konstantin Osipov authored 8 years ago

* promote recovery vclock in recover_xlog(), after we apply
  the recovered row
* remove unnecessary promotions from WAL and relays: they
  should not use recovery vclock going forward.
* update format specifier for error code ER_UNKNOWN_REPLICA
  to expect a string, rather than integer, since it's passed
  a string for replica id, not integer
* remove unused code in relay.cc

ac87c66b

recovery: move check for alien xrows to applier · 7bdaff31

Konstantin Osipov authored 8 years ago

Assume we can trust everything we read from the local
recovery - the xlog signatures and checksums are on guard for this.

7bdaff31

Implement space:bsize(). Issue #2043. · 7a2bb91b
Roman Tokarev authored 8 years ago

7a2bb91b

info: don't use wal_checkpoint() to display the current vclock · b44c1fd9

Konstantin Osipov authored 8 years ago

Using wal_checkpoint() incurs an extra and unnecessary overhead
in tx thread. Monitoring inquiries are supposed to be quick
and incur close to no ovrhead on tx thread.

The state of replicaset vclock is good enough for monitoring.

Revert back the test results broken by
f2bccc18 by @rtsisyk
The original results were correct, as indicated by the comment
not changed by f2bccc18

b44c1fd9

replication: remove log_level = debug from tests · 8ff0a6a8
Konstantin Osipov authored 8 years ago

8ff0a6a8

wal: initialize row timestamp in wal_request_write(). · fa148512

Konstantin Osipov authored 8 years ago

Intialize row timestamp before writing the row, in wal_request_write().
Row timestamp does not bear any semantics, it's for informaitonal
purposes only. Initialize it in WAL thread. This fixes timestamps
of xctl file as well, which were up until now left uninitialized.

The patch prepares for removal of recovery_fill_lsn().

fa148512

recovery: remove recovery.h from applier.cc · 38313524

Konstantin Osipov authored 8 years ago

Use replicaset_vclock in SUBSCRIBE. Ensure it is correctly initialized
after local recovery, so that it contains LSNs of remote
servers even after they were saved in the local WAL and recovered
from it.

38313524

Mar 20, 2017

recovery: insert a hack to fix failing replica_join.test · 6a0f40c9
Konstantin Osipov authored 8 years ago
```
Reset the global vclock after initial join to ensure
engine events are not filtered out by recovery.
```
6a0f40c9
recovery: remove applier->vclock, introduce a global replicaset_vclock · 0bd2956d
Konstantin Osipov authored 8 years ago

0bd2956d
Revert "box: fix LSN assigned on final join" · 22bf444c
Konstantin Osipov authored 8 years ago
```
This reverts commit d45bcc5a.
```
22bf444c

vinyl: don't dump empty ranges · 3e1fe1bf

Vladimir Davydov authored 8 years ago

Empty ranges are added to the dump heap, so if
 - we exceed the quota
 - there's a range dump in progress
 - there's a range that has already been dumped
we will keep picking the empty range for dumping over and over again
until dumping of the non-empty range is complete, flooding the log with
informational messages.

3e1fe1bf

vinyl: fix run size not recovered · 449eab2a

Vladimir Davydov authored 8 years ago

This breaks multi-level compaction and triggers coalescing of all ranges
after recovery, which may result in xctl buffer overflow and panic if
there are a lot of runs.

449eab2a

Mar 17, 2017

xctl: fix xlog fd leak on rotation · 81405c33
Vladimir Davydov authored 8 years ago
```
Follow up 5102bfbc
```
81405c33
Fix flaky app/ipc.test.lua · 0d5b047c
Roman Tsisyk authored 8 years ago
```
Closes #1845
```
0d5b047c
Add missing replication/prune.result · 5cc226ce
Roman Tsisyk authored 8 years ago
```
Follow up f232756b
```
5cc226ce

box: fix LSN assigned on final join · d45bcc5a

Vladimir Davydov authored 8 years ago

The number of rows sent during initial join may be greater than the LSN
of the checkpoint sent by the master, because there are rows that do not
contribute to LSN (system spaces, etc). If this happens, LSNs assigned
on final join will be greater than LSNs assigned after bootstrapping is
complete, which breaks Vinyl logic. Fix that by resetting recovery
vclock to the checkpoint LSN before getting to final join.

d45bcc5a

Consolidate garbage collection · 666c0337

Vladimir Davydov authored 8 years ago

Currently, old snapshots and xlogs are deleted by the snapshot daemon
while vinyl files are removed from engine_commit_checkpoint().
For the sake of backups and replication, which need to temporarily
disable garbage collection, we should bring all garbage collection
routines together in one place. That's why this patch introduces
box.internal.gc() method, which takes an LSN of the latest snapshot to
keep as its argument. When called, it deletes all xlog files as well as
engine specific files (memtx snapshots, vinyl runs) that are not
required to recover from a snapshot with LSN greater or equal to the
given one. For removal of engine specific files, a new engine callback
is introduced, Engine::collectGarbage. The snapshot daemon now calls
box.internal.gc() to cleanup instead of deleting snap and xlog files by
itself.

666c0337

xctl: separate gc from metadata log rotation · 6c5329af

Vladimir Davydov authored 8 years ago

This patch is a preparation for centralized garbage collection. It
extracts the code doing garbage collection from xctl_rotate() and
places it in a separate function xctl_collect_garbage(). The latter
takes a signature which determines the minimal age an object has to
have to be deleted - the function only removes files left from objects
that were deleted before the log received the given signature. It is
needed for making vinyl respect box.cfg.snapshot_count.

6c5329af

vinyl: make snapshot consistent · bc6457a1

Vladimir Davydov authored 8 years ago

A set of run files created by a snapshot is inconsistent, meaning w/o
replaying xlog it is not guaranteed that it contains a database state
that existed when the snapshot was taken. This is because we dump all
ranges independently and each range as a whole, so that if a statements
happens to be inserted to a range after snapshot was started and before
the range is dumped, it will be included in the dump. This peculiarity
stands in the way of backups and replication, both of which require a
consistent database state.

To make snapshot consistent, let's force rotation of all in-memory trees
on snapshot and make the dump task only dump trees that need to be
snapshotted if snapshot is in progress. The rotation is done lazily, on
insertion to the tree, similarly to how we handle DDL. The difference is
instead of sc_version we check vy_mem->min_lsn against checkpoint_lsn.

bc6457a1

vinyl: remember which mems are dumped by lsn · c1b25490

Vladimir Davydov authored 8 years ago

Since range->mem can be rotated while dump is in progress, we have to
remember which mems we are dumping. Commit 818208c4 ("vinyl: fix
unwritten mem dropped if ddl") does this by remembering the number of
frozen mems at the time of dump preparation on the dump task. Currently,
this works fine, because we always dump all frozen mems. However, this
condition won't hold when consistent snapshot is introduced. The point
is that in order to make snapshot consistent, we need to dump only
in-memory trees which were created before WAL checkpoint during
snapshot. These mems are not even guaranteed to be at the end of the
range->frozen list because of range coalescing. So in this patch we use
the current LSN to remember which mems are going to be dumped - all mems
created after task dump was created will have min_lsn > LSN of task
creation, so we should delete only mems with min_lsn <= the LSN on task
completion.

c1b25490

vinyl: make sure all statements with LSN <= snapshot LSN are dumped · db180cde

Vladimir Davydov authored 8 years ago

In contrast to the memtx engine, which populates in-memory trees from
Engine::prepare(), in case of Vinyl statements are inserted into
in-memory trees after WAL write, from the Engine::commit() callback.
Therefore, to make sure all statements inserted before snapshot are
dumped, we must initiate checkpoint after WAL rotation. Currently, it is
not true - checkpoint is initiated from Engine::beginCheckpoint(). To
make Vinyl snapshots consistent (not requiring xlog replay), we have to
fix that, so introduce a new callback, Engine::prepareWaitCheckpoint(),
which is called right after WAL rotation, and trigger Vinyl checkpoint
from it.

db180cde

vinyl: zap vy_mem_update_formats() + cleanup · 95b9af65

Vladimir Davydov authored 8 years ago

vy_mem_update_formats() is used to update mem formats when mem rotation
is skipped, because the mem is empty. This doesn't work as expected,
because vy_mem_update_formats() does not update mem->sc_version, so that
in case of ddl the next insertion will rotate it anyway. Instead of
updating sc_version in vy_mem_update_formats(), let's fix this by
zapping the helper altogether and simply recreating mem - it isn't a big
deal, because this does not happen often. While we are at it, let's
also:
 - reorder arguments of vy_mem_new() to keep key_def close to format
 - remove extra arguments of vy_range_rotate_mem() - we can get all
   of them right there (in fact we already do in case of ->format).

95b9af65

vinyl: add frozen mems of splitting range to read iterator · ee447dd0

Vladimir Davydov authored 8 years ago

Currently, if the range is splitting, we only add active in-memory
indexes of the resulting ranges to the read iterator, see
vy_read_iterator_add_mem(). This is because until recently a mem could
only be frozen on dump/compaction task preparation, which is disabled
while split is in progress. However, it is not true any more - a mem can
be rotated on txn_commmit() in case of DDL, hence we must always add all
in-memory indexes, including frozen ones, when opening a read iterator.

ee447dd0

Mar 16, 2017
- applier: remove unused code · d99f65f9
  Konstantin Osipov authored 8 years ago
  
  d99f65f9
- recovery: remove dead code, add a comment · 080b51b8
  Konstantin Osipov authored 8 years ago
  
  080b51b8