Commits · a5cd73aee7912d5c1b23e5de9b000da6678c361e · core / tarantool

Apr 14, 2021

box/func: module_reload -- drop redundant argument · a5cd73ae

Cyrill Gorcunov authored 4 years ago


The only purpose of the module argument is to
notify the caller that the module doesn't exist.
Lets simplify the code and drop this argument.

Part-of #4642

Acked-by: Serge Petrenko <sergepetrenko@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

a5cd73ae

box/func: fix modules functions restore · b9f2bf4e

Cyrill Gorcunov authored 3 years ago


In commit 96938faf (Add hot function reload for C procedures)
an ability to hot reload of modules has been introduced.
When module is been reloaded his functions are resolved to
new symbols but if something went wrong it is supposed
to restore old symbols from the old module.

Actually current code restores only one function and may
crash if there a bunch of functions to restore. Lets fix it.

Fixes #5968

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

b9f2bf4e

applier: process synchro rows after WAL write · b259e930

Vladislav Shpilevoy authored 3 years ago

Applier used to process synchronous rows CONFIRM and ROLLBACK
right after receipt before they are written to WAL.

That led to a bug that the confirmed data became visible, might be
accessed by user requests, then the node restarted before CONFIRM
finished its WAL write, and the data was not visible again. That
is just like if it would be rolled back, which is not acceptable.

Another case - CONFIRM WAL write could simply fail due to any
reason (no disk space, OOM), but the transactions would remain
confirmed anyway.

Also that produced some hacks in the limbo's code to support the
confirmation and rollback of transactions not yet written to WAL.

The patch makes the synchro rows processed only after they are
written to WAL. Although the 'rollback' case above might still
happen if the xlogs were in the kernel caches, and the machine was
powered off before they were flushed to disk. But that is not
related to qsync specifically.

To handle the synchro rows after WAL write the patch makes them go
to WAL in a blocking way (journal_write() instead of
journal_write_try_async()). Otherwise it could happen that a
CONFIRM/ROLLBACK is being written to WAL and would clear the limbo
afterwards, but a new transaction arrives with a different owner,
and it conflicts with the current limbo owner.

Closes #5213

b259e930

update: allow update(delete) absent nullable fields · 64514113

mechanik20051988 authored 3 years ago

In previous patch update(insert) operation for absent nullable fields
was allowed. This patch allows to update(delete) operation for absent
nullable fileds.
Closes #3378

64514113

update: allow update absent nullable fields · 2bb373b9

Mary Feofanova authored 4 years ago

Update operations could not insert with gaps. This patch changes
the behavior so that the update operation fills the missing fields
with nulls.
Part of #3378

@TarantoolBot document
Title: Allow update absent nullable fields
Update operations could not insert with gaps. Changed the behavior
so that the update operation fills the missing fields with nulls.
For example we create space `s = box.schema.create_space('s')`,
then create index for this space `pk = s:create_index('pk')`, and
then insert tuple in space `s:insert{1, 2}`. After all of this we
try to update this tuple `s:update({1}, {{'!', 5, 6}})`. In previous
version this operation fails with ER_NO_SUCH_FIELD_NO error, and now
it's finished with success and there is [1, 2, null, null, 6] tuple in
space.

2bb373b9

Apr 13, 2021

iproto: implement ability to run multiple iproto threads · 2ede3be3

mechanik20051988 authored 3 years ago

There are users that have specific workloads where iproto thread
is the bottleneck of throughput: iproto thread's code is 100% loaded
while TX thread's core is not. For such cases it would be nice to have
a capability to create several iproto threads.

Closes #5645

@TarantoolBot document
Title: implement ability to run multiple iproto threads
Implement ability to run multiple iproto threads, which is useful
in some specific workloads where iproto thread is the bottleneck
of throughput. To specify count of iproto threads, user should used
iproto_threads option in box.cfg. For example if user want to start
8 iproto threads, he must enter `box.cfg{iproto_threads=8}`. Default
iproto threads count == 1. This option is not dynamic, so user can't
change it after first setting, until server restart. Distribution of
connections per threads is managed by OS kernel.

2ede3be3

iproto: fix error with struct rmean · 440a4f30

mechanik20051988 authored 3 years ago

There was two problems with struct rmean:
- For correct access for rmean struct fields, this struct should be
  created in tx thread.
- In case when rmean_new return NULL in net_cord_f tarantool hangs
  and does not terminate in any way except on SIGKILL.
Also net_slabc cache was not destroyed. Moved allocation and deallocation
of rmean structure to iproto_init/iproto_free respectively. Added
slab_cache_destroy for net_slabc for graceful resource releases.

440a4f30

Fix a potential error with the rmean structure fields access · 91d8e7ae

mechanik20051988 authored 4 years ago

The fields of the rmean structure can be accessed from
multiple threads, so we must use atomic operations to get/set
fields in this structure. Also in the comments to the functions
i wrote in which threads they should be called to correctly access
the fields of the rmean structure.

91d8e7ae

box: change ER_TUPLE_FOUND message · d11fb306

Iskander Sagitov authored 4 years ago

ER_TUPLE_FOUND message shows only space and index, let's also show old
tuple and new tuple.

This commit changes error message in code and in tests. Test sql/checks
and sql-tap/aler remain the same due to problems in showing their old
and new tuples in error message.

Closes #5567

d11fb306

box: add field name to field mismatch errors · 12b7155d
Iskander Sagitov authored 3 years ago
```
Add field name to field mismatch error message.

Part of #4707
```
12b7155d
box: add info to mismatch errors · 24b90815
Iskander Sagitov authored 4 years ago
```
Add got type to field mismatch error message.

Part of #4707
```
24b90815

box: fix uint32_t overflow bug · dea91629

Iskander Sagitov authored 3 years ago

Previously tuple_field_u32 and tuple_next_u32 stored uint64_t value in
uint32_t field. This commit fixes it.

Part of #4707

dea91629

Apr 12, 2021

feedback_daemon: count and report some events · aa97a185

Serge Petrenko authored 3 years ago

Bump `feedback_version` to 7 and introduce a new field: `feedback.events`.
It holds a counter for every event we may choose to register later on.

Currently the possible events are "create_space", "drop_space",
"create_index", "drop_index".

All the registered events and corresponding counters are sent in a
report in `feedback.events` field.

Also, the first registered event triggers the report sending right away.
So, we may follow such events like "first space/index created/dropped"

Closes #5750

aa97a185

feedback_daemon: generate report right before sending · e9c9832a

Serge Petrenko authored 3 years ago

Feedback daemon used to generate report before waiting (for an hour by
default) until it's time to send it. Better actualize the reports and
generate them right when it's time to send them.

Part of #5750

e9c9832a

feedback_daemon: send feedback on server start · bc15e0f0
Serge Petrenko authored 3 years ago
```
Send the first report as soon as instance's initial configuration
finishes.

Part of #5750
```
bc15e0f0

feedback_daemon: rename `send_test` to `send` · 670acf0d

Serge Petrenko authored 3 years ago

feedback_daemon.send() will come in handy once we implement triggers to
dispatch feedback after some events, for example, right on initial
instance configuration.

So, it's not a testing method anymore, hence the new name.

Part of #5750

670acf0d

feedback_daemon: include server uptime in the report · c5d595bc

Serge Petrenko authored 3 years ago

We are going to send feedback right after initial `box.cfg{}` call, so
include server uptime in the report to filter out short-living CI
instances.

Also, while we're at it, fix a typo in feedback_daemon test.

Prerequisite #5750

c5d595bc

qsync: provide box.info.synchro interface for monitoring · bce3b581

Cyrill Gorcunov authored 4 years ago


In commit 14fa5fd8 (cfg: support symbolic evaluation of
replication_synchro_quorum) we implemented support of
symbolic evaluation of `replication_synchro_quorum` parameter
and there is no easy way to obtain it current run-time value,
ie evaluated number value.

Moreover we would like to fetch queue length on transaction
limbo for tests and extend this statistics in future. Thus
lets add them.

Closes #5191

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

@TarantoolBot document
Title: Provide `box.info.synchro` interface

The `box.info.synchro` leaf provides information about details of
synchronous replication.

In particular `quorum` represent the current value of synchronous
replication quorum defined by `replication_synchro_quorum`
configuration parameter because it can be set as dynamic formula
such as `N/2+1` and the value depends on current number of replicas.

Since synchronous replication does not commit data immediately
but waits for its propagation to replicas the data sits in a queue
gathering `commit` responses from remote nodes. Current number of
entries waiting in the queue is shown via `queue.len` member.

A typical output is the following

``` Lua
tarantool> box.info.synchro
---
- queue:
    len: 0
  quorum: 1
...
```

The `len` member shows current number of entries in the queue.
And the `quorum` member shows an evaluated value of
`replication_synchro_quorum` parameter.

bce3b581

Apr 05, 2021

box: remove is_local_recovery variable · 625941cf

Vladislav Shpilevoy authored 3 years ago

It was used so as to recover synchronous auto-commit transactions
in an async way (not blocking the fiber). But it became not
necessary since #5874 was fixed. Because recovery does not use
auto-commit transactions anymore.

Closes #5194

625941cf

recovery: make it transactional · 9311113d

Vladislav Shpilevoy authored 3 years ago

Recovery used to be performed row by row. It was fine because
anyway all the persisted rows are supposed to be committed, and
should not meet any problems during recovery so a transaction
could be applied partially.

But it became not true after the synchronous replication
introduction. Synchronous transactions might be in the log, but
can be followed by a ROLLBACK record which is supposed to delete
them.

During row-by-row recovery, firstly, the synchro rows each turned
into a sync transaction. Which is probably fine. But the rows on
non-sync spaces which were a part of a sync transaction, could be
applied right away bypassing the limbo leading to all kind of the
sweet errors like duplicate keys, or inconsistency of a partially
applied transaction.

The patch makes the recovery transactional. Either an entire
transaction is recovered, or it is rolled back which normally
happens only for synchro transactions followed by ROLLBACK.

In force recovery of a broken log the consistency is not
guaranteed though.

Closes #5874

9311113d

vinyl: handle multi-statement recovery txns · 59fed2d1

Vladislav Shpilevoy authored 3 years ago

During recovery and xlog replay vinyl skips the statements already
stored in runs. Indeed, their re-insertion into the mems would
lead to their second dump otherwise.

But that results into an issue that the recovery transactions in
vinyl don't have a write set - their tx->log is empty. On the
other hand they still are added to the write set (xm->writers).
Probably so as not to have too many checks "skip if in recovery"
all over the code.

It works fine with single-statement transactions, but would break
on multi-statement transactions. Because the decision whether
need to add to the write set was done based on the tx's log
emptiness. It is always empty, and so the transaction could be
added to the write set twice and corrupt its list-link member.

The patch makes the decision about being added to the write set
based on emptiness of the list-link member instead of the log so
it works fine both during recovery and normal operation.

Needed for #5874

59fed2d1

replication: do not ignore replica vclock on register · f42fee5a

Serge Petrenko authored 4 years ago

There was a bug in box_process_register. It decoded replica's vclock but
never used it when sending the registration stream. So the replica might
lose the data in range (replica_vclock, start_vclock).

Follow-up #5566

f42fee5a

replication: tolerate synchro rollback during final join · 3ec0e87f

Serge Petrenko authored 4 years ago

Both box_process_register and box_process_join had guards ensuring that
not a single rollback occured for transactions residing in WAL around
replica's _cluster registration.
Both functions would error on a rollback and make the replica retry
final join.

The reason for that was that replica couldn't process synchronous
transactions correctly during final join, because it applied the final
join stream row-by-row.

This path with retrying final join was a dead end, because even if
master manages to receive no ROLLBACK messages around N-th retry of
box.space._cluster:insert{}, replica would still have to receive and
process all the data dating back to its first _cluster registration
attempt.
In other words, the guard against sending synchronous rows to the
replica didn't work.

Let's remove the guard altogether, since now replica is capable of
processing synchronous txs in final join stream and even retrying final
join in case the _cluster registration was rolled back.

Closes #5566

3ec0e87f

applier: make final join transactional · eceec305

Serge Petrenko authored 4 years ago

Now applier assembles rows into transactions not only on subscribe
stage, but also during final join / register.

This was necessary for correct handling of rolled back synchronous
transactions in final join stream.

Part of #5566

eceec305

applier: remove excess last_row_time update from subscribe loop · efdd14f4

Serge Petrenko authored 4 years ago

applier->last_row_time is updated in applier_read_tx_row, which's called
at least once per each subscribe loop iteration. So there's no need to
have a separate last_row_time update inside the loop body itself.

Part of #5566

efdd14f4

applier: fix not releasing the latch on apply_synchro_row() fail · 9ad1bd15

Serge Petrenko authored 4 years ago

Once apply_synchro_row() failed, applier_apply_tx() would simply raise
an error without unlocking replica latch. This lead to all the appliers
hanging indefinitely on trying to lock the latch for this replica.

In scope of #5566

9ad1bd15

applier: extract plain tx application from applier_apply_tx() · caab8d52

Serge Petrenko authored 4 years ago

The new routine, called apply_plain_tx(), may be used not only by
applier_apply_tx(), but also by final join, once we make it
transactional, and recovery, once it's also turned transactional.

Also, while we're at it. Remove excess fiber_gc() call from
applier_subscribe loop. Let's better make sure fiber_gc() is called on
any return from applier_apply_tx().

Prerequisite #5874
Part of #5566

caab8d52

applier: extract tx boundary checks from applier_read_tx into a separate routine · cb30cc4c

Serge Petrenko authored 4 years ago

Introduce a new routine, set_next_tx_row(), which checks tx boundary
violation and appends the new row to the current tx in case everything
is ok.

set_next_tx_row() is extracted from applier_read_tx() because it's a
common part of transaction assembly both for recovery and applier.

The only difference for recovery will be that the routine which's
responsible for tx assembly won't read rows. It'll be a callback ran on
each new row being read from WAL.

Prerequisite #5874
Part-of #5566

cb30cc4c

replication: fix a hang on final join retry · eb908469

Serge Petrenko authored 4 years ago

Since the introduction of synchronous replication it became possible for
final join to fail on master side due to not being able to gather acks
for some tx around _cluster registration.

A replica receives an error in this case: either ER_SYNC_ROLLBACK or
ER_SYNC_QUORUM_TIMEOUT. The errors lead to applier retrying final join,
but with wrong state, APPLIER_REGISTER, which should be used only on an
anonymous replica. This lead to a hang in fiber executing box.cfg,
because it waited for APPLIER_JOINED state, which was never entered.

Part-of #5566

eb908469

swim: check types in __serialize methods · 1d121c12

Vladislav Shpilevoy authored 3 years ago

In swim Lua code none of the __serialize methods checked the
argument type assuming that nobody would call them directly and
mess with the types. But it happened, and is not hard to fix, so
the patch does it.

The serialization functions are sanitized for the swim object,
swim member, and member event.

Closes #5952

1d121c12

swim: fix crash on bad member_by_uuid() call · fe33a108

Vladislav Shpilevoy authored 3 years ago

In Lua swim object's method member_by_uuid() could crash if called
with no arguments. UUID was then passed as NULL, and dereferenced.

The patch makes member_by_uuid() treat NULL like nil UUID and
return NULL (member not found). The reason is that
swim_member_by_uuid() can't fail. It can only return a member or
not. It never sets a diag error.

Closes #5951

fe33a108

lua: fix tuple leak in <key_def>.compare_with_key · db766c52

Alexander Turenko authored 4 years ago

The key difference between lbox_encode_tuple_on_gc() and
luaT_tuple_encode() is that the latter never raises a Lua error, but
passes an error using the diagnostics area.

Aside of the tuple leak, the patch fixes fiber region's memory 'leak'
(till fiber_gc()). Before the patch, the memory that is used for
serialization of the key is not freed (region_truncate()) when the
serialization fails. It is verified in the gh-5388-<...> test.

While I'm here, added a test case that just verifies correct behaviour
in case of a key serialization failure (added into key_def.test.lua).
The case does not verify whether a tuple leaks and it is successful as
before this patch as well after the patch. I don't find a simple way to
check the tuple leak within a test. Verified manually using the
reproducer from the linked issue.

Fixes #5388

db766c52

Apr 02, 2021

vinyl: remove vylog newer than snap in casual recovery · 33254d91

Nikita Pettik authored 4 years ago

As a follow-up to the previous patch, let's check also emptiness of the
vylog being removed. During vylog rotation all entries are squashed
(e.g. "delete range" annihilates "insert range"), written to the new
vylog and at the end of new vylog SNAPSHOT marker is placed. If the last
entry in the vylog is SNAPSHOT, we can safely remove it without
hesitation.  So it is OK to remove it even during casual recovery
process. However, if it contains rows after SNAPSHOT marker, removal of
vylog may cause data loss. In this case we still can remove it only in
force_recovery mode.

Follow-up #5823

33254d91

vinyl: skip vylog if it's newer than snap · 149ccce9

Nikita Pettik authored 4 years ago

Having data in different engines checkpoint process is handled this way:
 - wait_checkpoint memtx
 - wait_checkpoint vinyl
 - commit_checkpoint memtx
 - commit_checkpoint vinyl

In contrast to commit_checkpoint which does not tolerate fails (if
something goes wrong e.g. renaming of snapshot file - instance simply
crashes), wait_checkpoint may fail. As a part of wait_checkpoint for
vinyl engine vy_log rotation takes place: old vy_log is closed and new
one is created. At this moment, wait_checkpoint of memtx engine has
already created new *inprogress* snapshot featuring bumped vclock.
While recovering from this configuration, vclock of the latest snapshot
is used as a reference.

At the initial recovery stage (vinyl_engine_begin_initial_recovery),
we check that snapshot's vclock matches with vylog's one (they should be
the same since normally vylog is rotated along with snapshot). On the
other hand, in the directory we have old snapshot and new vylog (and new
.inprogress snapshot). In such a situation recovery (even in force mode)
was aborted. The only way to fix this dead end, user has to manually
delete last vy_log file.

Let's proceed with the same resolution while user runs force_recovery
mode: delete last vy_log file and update vclock value. If user uses
casual recovery, let's print verbose message how to fix this situation
manually.

Closes #5823

149ccce9

errinj: introduce ERROR_INJECT_TERMINATE() macro · a240e019

Nikita Pettik authored 4 years ago

It is conditional injection that terminates execution calling assert(0)
if given condition is true. It is quite useful since allows us to
emulate situations when instance is suddenly shutdown: due to sigkill
for example.

Needed for #5823

a240e019

xlog: introduce xdir_remove_file_by_vclock() function · 4d6e2b73
Nikita Pettik authored 4 years ago
```
Needed for #5823
```
4d6e2b73
vinyl: make vy_log_begin_recovery() take force_recovery param · f729343c
Nikita Pettik authored 4 years ago
```
Needed for #5823
```
f729343c

sql: ignore \0 in string passed to Lua-function · 22e2e4ea

Mergen Imeev authored 3 years ago

Prior to this patch string passed to user-defined Lua-function from SQL
was cropped in case it contains '\0'. At the same time, it wasn't
cropped if it is passed to the function from BOX. After this patch the
string won't be cropped when passed from SQL if it contain '\0'.

Closes #5938

22e2e4ea

sql: ignore \0 in string passed to C-function · fa7e6f7d

Mergen Imeev authored 3 years ago

Prior to this patch string passed to user-defined C-function from SQL
was cropped in case it contains '\0'. At the same time, it wasn't
cropped if it is passed to the function from BOX. Now it isn't cropped
when passed from SQL.

Part of #5938

fa7e6f7d

Mar 31, 2021

gc/xlog: delay xlog cleanup until relays are subscribed · 2fd51aea

Cyrill Gorcunov authored 4 years ago


In case if replica managed to be far behind the master node
(so there are a number of xlog files present after the last
master's snapshot) then once master node get restarted it
may clean up the xlogs needed by the replica to subscribe
in a fast way and instead the replica will have to rejoin
reading a number of data back.

Lets try to address this by delaying xlog files cleanup
until replicas are got subscribed and relays are up
and running. For this sake we start with cleanup fiber
spinning in nop cycle ("paused" mode) and use a delay
counter to wait until relays decrement them.

This implies that if `_cluster` system space is not empty
upon restart and the registered replica somehow vanished
completely and won't ever come back, then the node
administrator has to drop this replica from `_cluster`
manually.

Note that this delayed cleanup start doesn't prevent
WAL engine from removing old files if there is no
space left on a storage device. The WAL will simply
drop old data without a question.

We need to take into account that some administrators
might not need this functionality at all, for this
sake we introduce "wal_cleanup_delay" configuration
option which allows to enable or disable the delay.

Closes #5806

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

@TarantoolBot document
Title: Add wal_cleanup_delay configuration parameter

The `wal_cleanup_delay` option defines a delay in seconds
before write ahead log files (`*.xlog`) are getting started
to prune upon a node restart.

This option is ignored in case if a node is running as
an anonymous replica (`replication_anon = true`). Similarly
if replication is unused or there is no plans to use
replication at all then this option should not be considered.

An initial problem to solve is the case where a node is operating
so fast that its replicas do not manage to reach the node state
and in case if the node is restarted at this moment (for various
reasons, for example due to power outage) then `*.xlog` files might
be pruned during restart. In result replicas will not find these
files on the main node and have to reread all data back which
is a very expensive procedure.

Since replicas are tracked via `_cluster` system space this we use
its content to count subscribed replicas and when all of them are
up and running the cleanup procedure is automatically enabled even
if `wal_cleanup_delay` is not expired.

The `wal_cleanup_delay` should be set to:

 - `0` to disable the cleanup delay;
 - `>= 0` to wait for specified number of seconds.

By default it is set to `14400` seconds (ie `4` hours).

In case if registered replica is lost forever and timeout is set to
infinity then a preferred way to enable cleanup procedure is not setting
up a small timeout value but rather to delete this replica from `_cluster`
space manually.

Note that the option does *not* prevent WAL engine from removing
old `*.xlog` files if there is no space left on a storage device,
WAL engine can remove them in a force way.

Current state of `*.xlog` garbage collector can be found in
`box.info.gc()` output. For example

``` Lua
 tarantool> box.info.gc()
 ---
   ...
   is_paused: false
```

The `is_paused` shows if cleanup fiber is paused or not.

2fd51aea