Commits · 860d6b3f06a60c0d90426ee0f957dd3c93c0b4c6 · core / tarantool

May 31, 2018

session: move salt into iproto connection · 860d6b3f

Vladislav Shpilevoy authored 6 years ago

Session salt is 32 random bytes, that are used to encode password
when a user is authorized. The salt is not used in non-binary
sessions, and can be moved to iproto connection.

860d6b3f

yaml: introduce yaml.decode tag_only option · 567769d0

Vladislav Shpilevoy authored 6 years ago

Yaml.decode tag_only option allows to decode a single tag of a
YAML document. For #2677 it is needed to detect different push
types in text console: print pushes via console.print, and actual
pushes via box.session.push.

To distinguish them YAML tags will be used. A client console for
each message will try to find a tag. If a tag is absent, then the
message is a simple response to a request.

If a tag is !print!, then the document consists of a single
string, that must be printed. Such a document must be decoded to
get the printed string. So the calls sequence is
yaml.decode(tag_only) + yaml.decode. The reason why a print
message must be decoded is that a print() result on a server side
can be not well-formatted YAML, and must be encoded into it to be
correctly sent. For example, when I do on a server side something
like this:

console.print('very bad YAML string')

The result of a print() is not a YAML document, and to be sent it
must be encoded into YAML on a server side.

If a tag is !push!, then the document is sent via
box.session.push, and must not be decoded. It can be just printed
or ignored or something.

Needed for #2677

567769d0

lua: merge encode_tagged into encode options · b2da28f8

Vladislav Shpilevoy authored 6 years ago

Encode_tagged is a workaround to be able to pass options to
yaml.encode().

Before the patch yaml.encode() in fact has this signature:
yaml.encode(...). So it was impossible to add any options to this
function - all of them would be treated as the parameters. But
documentation says: https://tarantool.io/en/doc/1.9/reference/reference_lua/yaml.html?highlight=yaml#lua-function.yaml.encode
that the function has this signature: yaml.encode(value).

I hope if anyone uses yaml.encode(), he does it according to the
documentation. And I can add the {tag_prefix, tag_handle} options
to yaml.encode() and remove yaml.encode_tagged() workaround.

b2da28f8

May 30, 2018

yaml: introduce yaml.encode_tagged · ddcd95a0

Vladislav Shpilevoy authored 6 years ago

Encode_tagged allows to define one global YAML tag for a
document. Tagged YAML documents are going to be used for
console text pushes to distinguish actual box.session.push() from
console.print(). The first will have tag !push, and the
second - !print.

ddcd95a0

engine: constify vclock argument · 088e3e24

Vladimir Davydov authored 6 years ago

None of engine_wait_checkpoint, engine_commit_checkpoint, engine_join,
engine_backup needs to modify the vclock argument.

088e3e24

Allow to increase box.cfg.vinyl_memory and memtx_memory at runtime · 30492862

Vladimir Davydov authored 6 years ago

Slab arena can grow dynamically so all we need to do is increase the
quota limit. Decreasing the limits is still explicitly prohibited,
because slab arena never unmaps slabs.

Closes #2634

30492862

vinyl: update recovery context with records written during recovery · d135f39c

Vladimir Davydov authored 6 years ago

During recovery, we may write VY_LOG_CREATE_LSM and VY_LOG_DROP_LSM
records we failed to write before restart (because those records are
written after WAL and hence may not make it to vylog). Right after
recovery we invoke garbage collection to drop incomplete runs. Once
VY_LOG_PREPARE_LSM record is introduced, we will also collect incomplete
LSM trees there (those we failed to build). However, there may be LSM
trees we managed to build but failed to write VY_LOG_CREATE_LSM for.
This is OK as we will retry vylog write, but currenntly it isn't
reflected in the recovery context used for garbage collection. To avoid
purging such LSM trees, let's update the recovery context with records
written during recovery.

Needed for #1653

d135f39c

May 29, 2018
- vinyl: allocate key parts in vy_recovery_do_create_lsm · 938a3835
  Vladimir Davydov authored 6 years ago
  
  Allocation of vy_lsm_recovery_info::key_parts is a part of the struct initialization, which is handled by vy_recovery_do_create_lsm().
  938a3835
- cfg: a follow up for the fix for gh-3421 · f4639e37
  Konstantin Osipov authored 6 years ago
  
  Update the error message for dynamic changes of instance_uuid and replicaset_uuid
  f4639e37
- Merge remote-tracking branch 'origin/1.9' into 1.10 · 95948a24
  Konstantin Osipov authored 6 years ago
  
  95948a24
- Don't raise on box.cfg for instance_uuid and replicaset_uuid · 8ed45cef
  Georgy Kirichenko authored 6 years ago
  
  Handle cases if instance_uuid and replicaset_uuid are present in box.cfg and have same values as already set. Fixes #3421
  8ed45cef
May 25, 2018

replication: display downstream status at upstream · 3db1dee9

Konstantin Belyavskiy authored 6 years ago

This fix improves 'box.info.replication' output.
If downstream fails and thus disconnects from upstream, improve
logging by printing 'status: disconnected' and error message on
both sides (master and replica).

Closes #3365

3db1dee9

replication: do not delete relay on applier disconnect · adc28591

Konstantin Belyavskiy authored 6 years ago

This is a part of more complex task aiming to improve logging.
Do not destroy relay since it stores last error and it can be
useful for diagnostic reason.
Now relay is created with replica and always exists. So also
remove several NULL checks.
Add relay_state { OFF, FOLLOW and STOPPED } to track replica
presence, once connected it either FOLLOW or STOPPED until
master is reset.
Updated with @kostja proposal.

Used for #3365.

adc28591

Merge remote-tracking branch 'origin/1.9' into 1.10 · 4497e02b
Konstantin Osipov authored 6 years ago

4497e02b

vinyl: purge dropped indexes from vylog on garbage collection · a2d1d2a2

Vladimir Davydov authored 6 years ago

Currently, when an index is dropped, we remove all ranges/slices
associated with it and mark all runs as dropped in vylog immediately.
To find ranges/slices/runs, we use vy_lsm struct, see vy_log_lsm_prune.

The problem is vy_lsm struct may be inconsistent with the state stored
in vylog if index drop races with compaction, because we first write
changes done by compaction task to vylog and only then update vy_lsm
struct, see vy_task_compact_complete. Since write to vylog yields, this
opens a time window during which the index can be dropped. If this
happens, objects that were created by compaction but haven't been logged
yet (such as new runs, slices, ranges) will be deleted from vylog by
index drop, and this will permanently break vylog, making recovery
impossible.

To fix this issue, let's rework garbage collection of objects associated
with dropped indexes as follows. Now when an index is dropped, we write
a single record to vylog, VY_LOG_DROP_LSM, i.e. just mark the index as
dropped without deleting associated objects. Actual index cleanup takes
place in the garbage collection procedure, see vy_gc, which purges all
ranges/slices linked to marked indexes from vylog and marks all their
runs as dropped. When all runs are actually deleted from disk and
"forgotten" in vylog, we remove the index record from vylog by writing
VY_LOG_FORGET_LSM record. Since garbage collection procedure uses vylog
itself instead of vy_lsm struct for iterating over vinyl objects, no
race between index drop and dump/compaction can now lead to broken
vylog.

Closes #3416

a2d1d2a2

vinyl: store lsn of index drop record in vylog · 264f7e3f
Vladimir Davydov authored 6 years ago
```
This is required to rework garbage collection in vinyl.
```
264f7e3f

alter: pass lsn of index drop record to engine · 1af04afe

Vladimir Davydov authored 6 years ago

We pass lsn of index alter/create records, let's pass lsn of drop record
for consistency. This is also needed by vinyl to store it in vylog (see
the next patch).

1af04afe

vinyl: do not reuse lsm objects during recovery from vylog · 31ab8e03

Vladimir Davydov authored 6 years ago

If an index was dropped and then recreated, then while replaying vylog
we will reuse vy_lsm_recovery_info object corresponding to it. There's
no reason why we do that instead of simply allocating a new object -
amount of memory saved is negligible, but the code looks more complex.
Let's simplify the code - whenever we see VY_LOG_CREATE_LSM, create a
new vy_lsm_recovery_info object and replace the old incarnation if any
in the hash map.

31ab8e03

test: update replication_connect_timeout in tests to a lower value · e9bf00fc
Konstantin Osipov authored 6 years ago
```
replication: make replication_connect_timeout dynamic
```
e9bf00fc
test: update replication_connect_timeout in tests to a lower value · c2220fe4
Konstantin Osipov authored 6 years ago

c2220fe4

test: rework test case for memtx async garbage collection · c5f98b91

Vladimir Davydov authored 6 years ago

Do not use errinj as it is unreliable. Check that:
 - No memory is freed by immediately after space drop (WAL is off).
 - All memory is freed asynchronously after yield.

c5f98b91

replication: fix log message in case of sync failure · 6c35bf9b

Vladimir Davydov authored 6 years ago

replicaset_sync() returns not only if the instance synchronized to
connected replicas, but also if some replicas have disconnected and
the quorum can't be formed any more. Nevertheless, it always prints
that sync has been completed. Fix it.

See #3422

6c35bf9b

replication: do not stop syncing if replicas are loading · 1785e79c

Vladimir Davydov authored 6 years ago

If a replica disconnects while sync is in progress, box.cfg{} may stop
syncing leaving the instance in 'orphan' mode. This will happen if not
enough replicas are connected to form a quorum. This makes sense e.g. on
network error, but not when a replica is loading, because in the latter
case it should be up and running quite soon. Let's account replicas that
disconnected because they haven't completed initial configuration yet
and continue syncing if connected + loading > quorum.

Closes #3422

1785e79c

replication: use applier_state to check quorum · ca53ab91

Konstantin Belyavskiy authored 6 years ago

Small refactoring: remove 'enum replica_state' since reuse a subset
from applier state machine 'enum replica_state' to check if we have
achieved replication quorum and hence can leave read-only mode.

ca53ab91

replication: change default replication_connect_timeout to 30 seconds · 06a63686
Konstantin Osipov authored 6 years ago
```
The default of 4 seconds is too low to bootstrap a large cluster.
```
06a63686
iproto: 'iproto_msg_max' -> 'net_msg_max' in message · 020fb77f
Vladislav Shpilevoy authored 6 years ago
```
Closes #3425
```
020fb77f

May 24, 2018

replication: add strict ordering for appliers operating in a full mesh · edd76a2a

Georgy Kirichenko authored 6 years ago

In some cases when an applier processing yielded, other applier might
start some conflicting operation and break replication and database
consistency.
Now applier locks a per-server-id latch before processing a transaction.
This guarantees that there is only one applier request for each server
in progress at each given moment.

The problem was very rare until full mesh topologies in vinyl
became a commonplace.

Fixes gh-3339

edd76a2a

memtx: run garbage collection on demand · 39c8b526

Vladimir Davydov authored 6 years ago

When a memtx space is dropped or truncated, we delegate freeing tuples
stored in it to a background fiber so as not to block the caller (and tx
thread) for too long. Turns out it doesn't work out well for ephemeral
spaces, which share the destruction code with normal spaces: the problem
is the user might issue a lot of complex SQL SELECT statements that
create a lot of ephemeral spaces and do not yield and hence don't give
the garbage collection fiber a chance to clean up. There's a test that
emulates this, 2.0:test/sql-tap/gh-3083-ephemeral-unref-tuples.test.lua.
For this test to pass, let's run garbage collection procedure on demand,
i.e. when any of memtx allocation functions fails to allocate memory.

Follow-up #3408

39c8b526

memtx: rework background garbage collection procedure · cc0e5b4c

Vladimir Davydov authored 6 years ago

Currently, the engine has not control over yields issued during
asynchronous index destruction. As a result, it can't force gc when
there's not enough memory. To fix that, let's make gc callback stateful:
now it's supposed to free some objects and return true if there's still
more objects to free or false otherwise. Yields are now done by the
memtx engine itself after each gc callback invocation.

cc0e5b4c

May 22, 2018

replication: a minor refactoring in leader election · 77dfe1b0
Konstantin Osipov authored 6 years ago
```
Avoid goto, a follow up on gh-3257.
```
77dfe1b0

replication: fix bug with read-only replica as a bootstrap leader · 77098294

Konstantin Belyavskiy authored 6 years ago

Another broken case. Adding a new replica to cluster:
+		if (replica->applier->remote_is_ro &&
+		    replica->applier->vclock.signature == 0)
In this case we may got an ER_READONLY, since signature is not 0.
So leader election now has two phases:
 1. To select among read-write replicas.
 2. If no such found, try old algorithm for backward compatibility
    (case then all replicas exist in cluster table).

Closes #3257

77098294

replication: add logging to replication connect/sync · 861803ee
Konstantin Osipov authored 6 years ago

861803ee

memtx: embed light hash into memtx_hash_index · bd645549

Vladimir Davydov authored 6 years ago

No point in this level of indirection. We embed bps tree implementation
into memtx_tree_index, why don't we do the same in case of hash index.
A good side effect is that we can now define iterators in headers for
both memtx_tree_index and memtx_hash_index, which is required to improve
memtx garbage collection mechanism.

bd645549

memtx: destroy slab arena on engine shutdown · ab65724f

Vladimir Davydov authored 6 years ago

Since it is created when the memtx engine is initialized, we should
destroy it on engine shutdown.

ab65724f

memtx: move all global variables to engine · 3d138884

Vladimir Davydov authored 6 years ago

All functions that need them are now explicitly passed engine so we can
consolidate all variables related to memtx engine state in one place.

3d138884

memtx: pass engine to memory allocation functions · 9d741ab6

Vladimir Davydov authored 6 years ago

We need this so that we can force garbage collection when we are short
on memory. There are two such functions: one is used for allocating
index extents, another for allocating tuples. Index allocating function
has an opaque context so we simply reuse it for passing memtx engine to
it. To pass memtx engine to tuple allocating function, we add an opaque
engine specific pointer to tuple_format and set it to memtx_engine for
memtx spaces.

9d741ab6

memtx: fold memtx_tuple.cc into memtx_engine.c · 97d7cb9e

Vladimir Davydov authored 6 years ago

The two files are too closely related: memtx_arena is defined and
used in memtx_engine.c, but initialized in memtx_tuple.cc. Since
memtx_tuple.cc is small, let's fold it into memtx_engine.c.

97d7cb9e

memtx: init index extent allocator in engine constructor · ac98334f

Vladimir Davydov authored 6 years ago

Postponing it until a memtx index is created for the first time saves us
no memory or cpu, it only makes the code more difficult to follow.

ac98334f

May 21, 2018

Remove unused FDGuard · f57fd113
Vladislav Shpilevoy authored 6 years ago

f57fd113

memtx: free tuples asynchronously when primary index is dropped · 2a1482f3

Vladimir Davydov authored 6 years ago

When a memtx space is dropped or truncated, we have to unreference all
tuples stored in it. Currently, we do it synchronously, thus blocking
the tx thread. If a space is big, tx thread may remain blocked for
several seconds, which is unacceptable. This patch makes drop/truncate
hand actual work to a background fiber.

Before this patch, drop of a space with 10M 64-byte records took more
than 0.5 seconds. After this patch, it takes less than 1 millisecond.

Closes #3408

2a1482f3