Commits · 30492862a8caf8c8c80410d9dea284478417eb2a · core / tarantool

May 30, 2018

Allow to increase box.cfg.vinyl_memory and memtx_memory at runtime · 30492862

Slab arena can grow dynamically so all we need to do is increase the
quota limit. Decreasing the limits is still explicitly prohibited,
because slab arena never unmaps slabs.

Closes #2634

30492862

vinyl: update recovery context with records written during recovery · d135f39c

Vladimir Davydov authored 6 years ago

During recovery, we may write VY_LOG_CREATE_LSM and VY_LOG_DROP_LSM
records we failed to write before restart (because those records are
written after WAL and hence may not make it to vylog). Right after
recovery we invoke garbage collection to drop incomplete runs. Once
VY_LOG_PREPARE_LSM record is introduced, we will also collect incomplete
LSM trees there (those we failed to build). However, there may be LSM
trees we managed to build but failed to write VY_LOG_CREATE_LSM for.
This is OK as we will retry vylog write, but currenntly it isn't
reflected in the recovery context used for garbage collection. To avoid
purging such LSM trees, let's update the recovery context with records
written during recovery.

Needed for #1653

d135f39c

May 29, 2018
- vinyl: allocate key parts in vy_recovery_do_create_lsm · 938a3835
  Vladimir Davydov authored 6 years ago
  
  Allocation of vy_lsm_recovery_info::key_parts is a part of the struct initialization, which is handled by vy_recovery_do_create_lsm().
  938a3835
- cfg: a follow up for the fix for gh-3421 · f4639e37
  Konstantin Osipov authored 6 years ago
  
  Update the error message for dynamic changes of instance_uuid and replicaset_uuid
  f4639e37
- Merge remote-tracking branch 'origin/1.9' into 1.10 · 95948a24
  Konstantin Osipov authored 6 years ago
  
  95948a24
- Don't raise on box.cfg for instance_uuid and replicaset_uuid · 8ed45cef
  Georgy Kirichenko authored 6 years ago
  
  Handle cases if instance_uuid and replicaset_uuid are present in box.cfg and have same values as already set. Fixes #3421
  8ed45cef
May 25, 2018

replication: display downstream status at upstream · 3db1dee9

Konstantin Belyavskiy authored 6 years ago

This fix improves 'box.info.replication' output.
If downstream fails and thus disconnects from upstream, improve
logging by printing 'status: disconnected' and error message on
both sides (master and replica).

Closes #3365

3db1dee9

replication: do not delete relay on applier disconnect · adc28591

Konstantin Belyavskiy authored 6 years ago

This is a part of more complex task aiming to improve logging.
Do not destroy relay since it stores last error and it can be
useful for diagnostic reason.
Now relay is created with replica and always exists. So also
remove several NULL checks.
Add relay_state { OFF, FOLLOW and STOPPED } to track replica
presence, once connected it either FOLLOW or STOPPED until
master is reset.
Updated with @kostja proposal.

Used for #3365.

adc28591

Merge remote-tracking branch 'origin/1.9' into 1.10 · 4497e02b
Konstantin Osipov authored 6 years ago

4497e02b

vinyl: purge dropped indexes from vylog on garbage collection · a2d1d2a2

Vladimir Davydov authored 6 years ago

Currently, when an index is dropped, we remove all ranges/slices
associated with it and mark all runs as dropped in vylog immediately.
To find ranges/slices/runs, we use vy_lsm struct, see vy_log_lsm_prune.

The problem is vy_lsm struct may be inconsistent with the state stored
in vylog if index drop races with compaction, because we first write
changes done by compaction task to vylog and only then update vy_lsm
struct, see vy_task_compact_complete. Since write to vylog yields, this
opens a time window during which the index can be dropped. If this
happens, objects that were created by compaction but haven't been logged
yet (such as new runs, slices, ranges) will be deleted from vylog by
index drop, and this will permanently break vylog, making recovery
impossible.

To fix this issue, let's rework garbage collection of objects associated
with dropped indexes as follows. Now when an index is dropped, we write
a single record to vylog, VY_LOG_DROP_LSM, i.e. just mark the index as
dropped without deleting associated objects. Actual index cleanup takes
place in the garbage collection procedure, see vy_gc, which purges all
ranges/slices linked to marked indexes from vylog and marks all their
runs as dropped. When all runs are actually deleted from disk and
"forgotten" in vylog, we remove the index record from vylog by writing
VY_LOG_FORGET_LSM record. Since garbage collection procedure uses vylog
itself instead of vy_lsm struct for iterating over vinyl objects, no
race between index drop and dump/compaction can now lead to broken
vylog.

Closes #3416

a2d1d2a2

vinyl: store lsn of index drop record in vylog · 264f7e3f
Vladimir Davydov authored 6 years ago
```
This is required to rework garbage collection in vinyl.
```
264f7e3f

alter: pass lsn of index drop record to engine · 1af04afe

Vladimir Davydov authored 6 years ago

We pass lsn of index alter/create records, let's pass lsn of drop record
for consistency. This is also needed by vinyl to store it in vylog (see
the next patch).

1af04afe

vinyl: do not reuse lsm objects during recovery from vylog · 31ab8e03

Vladimir Davydov authored 6 years ago

If an index was dropped and then recreated, then while replaying vylog
we will reuse vy_lsm_recovery_info object corresponding to it. There's
no reason why we do that instead of simply allocating a new object -
amount of memory saved is negligible, but the code looks more complex.
Let's simplify the code - whenever we see VY_LOG_CREATE_LSM, create a
new vy_lsm_recovery_info object and replace the old incarnation if any
in the hash map.

31ab8e03

test: update replication_connect_timeout in tests to a lower value · e9bf00fc
Konstantin Osipov authored 6 years ago
```
replication: make replication_connect_timeout dynamic
```
e9bf00fc
test: update replication_connect_timeout in tests to a lower value · c2220fe4
Konstantin Osipov authored 6 years ago

c2220fe4

test: rework test case for memtx async garbage collection · c5f98b91

Vladimir Davydov authored 6 years ago

Do not use errinj as it is unreliable. Check that:
 - No memory is freed by immediately after space drop (WAL is off).
 - All memory is freed asynchronously after yield.

c5f98b91

replication: fix log message in case of sync failure · 6c35bf9b

Vladimir Davydov authored 6 years ago

replicaset_sync() returns not only if the instance synchronized to
connected replicas, but also if some replicas have disconnected and
the quorum can't be formed any more. Nevertheless, it always prints
that sync has been completed. Fix it.

See #3422

6c35bf9b

replication: do not stop syncing if replicas are loading · 1785e79c

Vladimir Davydov authored 6 years ago

If a replica disconnects while sync is in progress, box.cfg{} may stop
syncing leaving the instance in 'orphan' mode. This will happen if not
enough replicas are connected to form a quorum. This makes sense e.g. on
network error, but not when a replica is loading, because in the latter
case it should be up and running quite soon. Let's account replicas that
disconnected because they haven't completed initial configuration yet
and continue syncing if connected + loading > quorum.

Closes #3422

1785e79c

replication: use applier_state to check quorum · ca53ab91

Konstantin Belyavskiy authored 6 years ago

Small refactoring: remove 'enum replica_state' since reuse a subset
from applier state machine 'enum replica_state' to check if we have
achieved replication quorum and hence can leave read-only mode.

ca53ab91

replication: change default replication_connect_timeout to 30 seconds · 06a63686
Konstantin Osipov authored 6 years ago
```
The default of 4 seconds is too low to bootstrap a large cluster.
```
06a63686
iproto: 'iproto_msg_max' -> 'net_msg_max' in message · 020fb77f
Vladislav Shpilevoy authored 6 years ago
```
Closes #3425
```
020fb77f

May 24, 2018

replication: add strict ordering for appliers operating in a full mesh · edd76a2a

Georgy Kirichenko authored 7 years ago

In some cases when an applier processing yielded, other applier might
start some conflicting operation and break replication and database
consistency.
Now applier locks a per-server-id latch before processing a transaction.
This guarantees that there is only one applier request for each server
in progress at each given moment.

The problem was very rare until full mesh topologies in vinyl
became a commonplace.

Fixes gh-3339

edd76a2a

memtx: run garbage collection on demand · 39c8b526

Vladimir Davydov authored 6 years ago

When a memtx space is dropped or truncated, we delegate freeing tuples
stored in it to a background fiber so as not to block the caller (and tx
thread) for too long. Turns out it doesn't work out well for ephemeral
spaces, which share the destruction code with normal spaces: the problem
is the user might issue a lot of complex SQL SELECT statements that
create a lot of ephemeral spaces and do not yield and hence don't give
the garbage collection fiber a chance to clean up. There's a test that
emulates this, 2.0:test/sql-tap/gh-3083-ephemeral-unref-tuples.test.lua.
For this test to pass, let's run garbage collection procedure on demand,
i.e. when any of memtx allocation functions fails to allocate memory.

Follow-up #3408

39c8b526

memtx: rework background garbage collection procedure · cc0e5b4c

Vladimir Davydov authored 6 years ago

Currently, the engine has not control over yields issued during
asynchronous index destruction. As a result, it can't force gc when
there's not enough memory. To fix that, let's make gc callback stateful:
now it's supposed to free some objects and return true if there's still
more objects to free or false otherwise. Yields are now done by the
memtx engine itself after each gc callback invocation.

cc0e5b4c

May 22, 2018

replication: a minor refactoring in leader election · 77dfe1b0
Konstantin Osipov authored 6 years ago
```
Avoid goto, a follow up on gh-3257.
```
77dfe1b0

replication: fix bug with read-only replica as a bootstrap leader · 77098294

Konstantin Belyavskiy authored 6 years ago

Another broken case. Adding a new replica to cluster:
+		if (replica->applier->remote_is_ro &&
+		    replica->applier->vclock.signature == 0)
In this case we may got an ER_READONLY, since signature is not 0.
So leader election now has two phases:
 1. To select among read-write replicas.
 2. If no such found, try old algorithm for backward compatibility
    (case then all replicas exist in cluster table).

Closes #3257

77098294

replication: add logging to replication connect/sync · 861803ee
Konstantin Osipov authored 6 years ago

861803ee

memtx: embed light hash into memtx_hash_index · bd645549

Vladimir Davydov authored 6 years ago

No point in this level of indirection. We embed bps tree implementation
into memtx_tree_index, why don't we do the same in case of hash index.
A good side effect is that we can now define iterators in headers for
both memtx_tree_index and memtx_hash_index, which is required to improve
memtx garbage collection mechanism.

bd645549

memtx: destroy slab arena on engine shutdown · ab65724f

Vladimir Davydov authored 6 years ago

Since it is created when the memtx engine is initialized, we should
destroy it on engine shutdown.

ab65724f

memtx: move all global variables to engine · 3d138884

Vladimir Davydov authored 6 years ago

All functions that need them are now explicitly passed engine so we can
consolidate all variables related to memtx engine state in one place.

3d138884

memtx: pass engine to memory allocation functions · 9d741ab6

Vladimir Davydov authored 6 years ago

We need this so that we can force garbage collection when we are short
on memory. There are two such functions: one is used for allocating
index extents, another for allocating tuples. Index allocating function
has an opaque context so we simply reuse it for passing memtx engine to
it. To pass memtx engine to tuple allocating function, we add an opaque
engine specific pointer to tuple_format and set it to memtx_engine for
memtx spaces.

9d741ab6

memtx: fold memtx_tuple.cc into memtx_engine.c · 97d7cb9e

Vladimir Davydov authored 6 years ago

The two files are too closely related: memtx_arena is defined and
used in memtx_engine.c, but initialized in memtx_tuple.cc. Since
memtx_tuple.cc is small, let's fold it into memtx_engine.c.

97d7cb9e

memtx: init index extent allocator in engine constructor · ac98334f

Vladimir Davydov authored 6 years ago

Postponing it until a memtx index is created for the first time saves us
no memory or cpu, it only makes the code more difficult to follow.

ac98334f

May 21, 2018

Remove unused FDGuard · f57fd113
Vladislav Shpilevoy authored 6 years ago

f57fd113

memtx: free tuples asynchronously when primary index is dropped · 2a1482f3

Vladimir Davydov authored 6 years ago

When a memtx space is dropped or truncated, we have to unreference all
tuples stored in it. Currently, we do it synchronously, thus blocking
the tx thread. If a space is big, tx thread may remain blocked for
several seconds, which is unacceptable. This patch makes drop/truncate
hand actual work to a background fiber.

Before this patch, drop of a space with 10M 64-byte records took more
than 0.5 seconds. After this patch, it takes less than 1 millisecond.

Closes #3408

2a1482f3

vinyl: implement index compact method · db9e214a

Vladimir Davydov authored 6 years ago

Force major compaction of all ranges when index.compact() is called.
Note, the function only triggers compaction, it doesn't wait until
compaction is complete.

Closes #3139

db9e214a

index: add compact method · 9abd0192

Vladimir Davydov authored 6 years ago

This patch adds index.compact() Lua method. The new method is backed by
index_vtab::compact. Currently, it's a no-op for all kinds of indexes.
It will be used by Vinyl engine in order to trigger major compaction.

Part of #3139

9abd0192

May 19, 2018

replication: stability fix for test recover_missing_xlog · 73354bb7

Konstantin Belyavskiy authored 6 years ago

This test falls from time to time, because .xlog may have a
different number in a name (and using box.info.lsn is not an
option here).
Since it's setup of two masters, it could be one or two xlogs
in a folder, so first get a list of all matching files and then
delete the last one.

73354bb7

May 18, 2018
- collation: refactoring · 4e9bf7b4
  Vladislav Shpilevoy authored 6 years ago
  
  Simplify collation code.
  4e9bf7b4
- Merge branch '1.9' into 1.10 · 8d7617d4
  Vladimir Davydov authored 6 years ago
  
  View commits for tag 1.10.1 1.10.1
  
  8d7617d4