Commits · 8fd71136661b25fcc5ce3dd4986bb376adbe53c0 · core / tarantool

Nov 23, 2020

raft: stop using replicaset.vclock · 8fd71136

Vladislav Shpilevoy authored 4 years ago

Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/.

The patch makes raft stop using replicaset.vclock.

Instead, it has a new option 'vclock'. It is stored inside struct
raft by pointer and should be configured using raft_cfg_vclock().

Box configures it to point at replicaset.vclock like before. But
now raftlib code does not depend on it explicitly.

Vclock is stored in Raft by pointer instead of by value so as not
to update it for each transaction. It would be too high price to
pay for raft independence from box.

Part of #5303

8fd71136

raft: make raft_request.vclock constant · f1ec81bf

Vladislav Shpilevoy authored 4 years ago

Raft is never supposed to change vclock. Not the stored one, nor
the received ones. The patch makes it checked during compilation.

The patch is mostly motivated by a next patch making Raft use an
externally configured vclock which can't be changed. Since Raft
uses raft_request to carry the vclock in a few places, the
request's vclock also must become const.

Part of #5303

f1ec81bf

raft: stop using instance_id · 4d922616

Vladislav Shpilevoy authored 4 years ago

Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/.

The patch makes raft stop using instance_id.

Instead, it has a new option 'instance_id'. It is stored inside
struct raft as 'self', and should be configured using
raft_cfg_instance_id().

The configuration is done when bootstrap ends and the instance_id
is either recovered successfully, or the instance is anonymous.

While working on this, I also considered introducing a new
function raft_boot() instead of raft_cfg_instance_id(). Which I
would also use to configure vclock later. Raft_boot() would be
meant to be called only one time with non-dynamic parameters
instance_id and vclock.

But then I decided to keep adding new raft_cfg_*() functions.
Because:

- It is more consistent with the existing options;

- Does not require to think about too many different functions
  like raft_create(), raft_boot(), raft_cfg_*() and in which order
  to call them;

Also I was thinking to introduce a single raft_cfg() like I did
in swim with swim_cfg(), to reduce number of raft_cfg_*()
functions, but decided it would be even worse with so many
options.

Part of #5303

4d922616

raft: stop using replication_synchro_quorum · 3d8b798a

Vladislav Shpilevoy authored 4 years ago

Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/, including global
replication parameters such as replication_synchro_quorum.

The patch makes raft stop using replication_synchro_quorum.

Instead, it has a new option 'election_quorum'. Note, that this is
just raft API. Box API still uses replication_synchro_quorum. But
it is used to calculate the final quorum in src/box/raft, not
in src/box/raftlib. And to pass it to the base raft
implementation.

Part of #5303

3d8b798a

raft: stop using replication_disconnect_timeout() · fcc9fcd3

Vladislav Shpilevoy authored 4 years ago

Raft is being moved to a separate library in src/lib. It means,
it can't depend on anything from box/, including global
replication parameters such as replication_timeout, and functions
like replication_disconnect_timeout().

The patch makes raft stop using replication_disconnect_timeout().
Instead, it stores death timeout in struct raft. It is configured
by box simultaneously with replication_timeout.

Part of #5303

fcc9fcd3

raft: move box_raft_* to src/box/raft.h and .c · eca49717

Vladislav Shpilevoy authored 4 years ago

The commit moves raft functions and objects specific for box to
src/box/raft from src/box/box and src/box/raftlib.

The goal is to gradually eliminate all box dependencies from
src/box/raftlib and move it to src/lib/raft.

It makes the compilation work again after the previous commit
broke it.

Part of #5303

eca49717

raft: move sources to raftlib.h/.c · bd3f2495

Vladislav Shpilevoy authored 4 years ago

The commit renames raft.h and raft.c to raftlib.h and raftlib.c.
This is done to prepare to raft split into src/box/ and
src/lib/raft.

The commit is not atomic, the build won't work here. Because if
raft is renamed to raftlib, and in the same commit new raft.c and
raft.h are added, git thinks the original file was changed, and
ruins all the git history.

By splitting move of raft to raftlib and introduction of box/raft
into 2 commits the git history is saved.

Part of #5303

bd3f2495

Nov 22, 2020

sql: forbid data changes in read-only mode · 37491733

Sergey Ostanevich authored 4 years ago

A number of places in sql.c uses direct access to box_process_rw() that
does not check read-only setting. Fixed by use of an intended interface
of box_process1().

Closes #5231

37491733

Nov 20, 2020

test: fix flaky unit/swim test · e23b14dc

Vladislav Shpilevoy authored 4 years ago

swim_test_indirect_ping() failed with random seed 1605651752.

The test created a cluster with 3 swim nodes, and broke network
connection between node-1 and node-2. Then it run the cluster for
10 seconds, and ensured, that both node-1 and node-2 are
eventually alive despite they are suspected sometimes.

    node1 <-> node3 <-> node2

'Alive' means that a node is considered alive on all the other
nodes.

The test spun for 10 seconds giving the nodes a chance to become
suspected. Then it checked that node-1 is either still alive, or
it is suspected, but will be restored in at most 3 seconds. The
same was checked for node-2. They were supposed to interact via
node-3.

3 seconds was used assuming that the worst what could happen is
that it is suspected from the beginning of this three-second
interval on node-3, because it was suspected by node-2 and
disseminated to node-3.

Then node-3 might need 1 second to finish its current
dissemination round by sending a ping to node-2, 1 second to start
new round randomly again from node-2, and only then send a ping to
node-1. So 3 seconds total.

But it could also happen, that in the beginning of the
three-second interval node-1 is already suspected on node-2. On
the next step node-2 shares the suspicion with node-3. And then
the scenario above happens. So the test case needed at least 4
seconds.

And actually it could happen infinitely, because while the test
waits for 3 seconds of gossip refutation about node-1 on node-3,
node-2 can suspect it again. And so on.

Also the test would pass even without indirect pings. Because
node-3 has access to node-1 and node-2. So even if, say, node-1
suspects node-2, then it will tell node-3 about it. Node-3 will
ping node-2, get ack, and will refute the gossip. The refutation
will be then sent to node-1 back. It means indirect pings don't
matter here.

The patch makes a new test, which won't pass without indirect
pings. It uses the existing error injection ERRINJ_SWIM_FD_ONLY,
which allows to turn off all the SWIM components except failure
detection. So only pings and acks are being sent.

Then without proper indirect pings node-1 and node-2 would suspect
each other and declare dead eventually. The new test checks it
does not happen.

Closes #5399

e23b14dc

Nov 19, 2020

test: update test-run · 546499c9

Alexander Turenko authored 4 years ago

The following commits are land into the submodule with this update:

- 96dea99 Enable luacheck on Travis CI
- daff045 Add luacheck config and fix warnings found by luacheck

Most of the changes are about test-run's Makefile and CI, but there is
one change in the testing framework itself:

 | diff --git a/test_run.lua b/test_run.lua
 | index 2a8d713..06fbf6c 100644
 | --- a/test_run.lua
 | +++ b/test_run.lua
 | @@ -163,7 +163,6 @@ local function wait_fullmesh(self, servers)
 |                  log.info("%s: bootstrapped", server)
 |                  break
 |              end
 | -            local info = self:eval(server, "box.info")
 |              fiber.sleep(0.01)
 |          end
 |          -- wait all for full mesh

This change removes the call, whose result is not used. On the first
glance, it should not change any behaviour. However `test_run:eval()`
performs a network request and it is good to get rid of it.

The change is tiny and should not affect anything in tarantool testing,
but I decided to commit it separately from others to don't hide anything
and ease bisect if we'll observe any positive or negative impact. IOW,
just in case.

546499c9

lua/merger: fix NULL dereferences on error paths · 962cf85b

Alexander Turenko authored 4 years ago

There are two situations that may lead to a NULL dereference.

First, when an attempt to create a merge source is failed. Now only OOM
may trigger this code path. Reproducing it in a test is tricky, so I
left it without a test case.

Second, when an attempt to create a merger is failed. The reason of the
fail may be a 'wrong' key_def: when we unable to create a tuple format
from it. Say, when the key_def defines several key parts using the same
field, but marks them with different field types.

Fixes #5450

962cf85b

Nov 18, 2020

sql: support column addition · 6cff9025

Roman Khabibov authored 5 years ago

Enable to add column to existing space with
<ALTER TABLE ADD [COLUMN]> statement. Column definition can be
supplemented with the four types of constraints, <DEFAULT>,
<COLLATE> clauses and <[NOT] NULL>, AUTOINCREMENT.

Closes #2349, #3075

@TarantoolBot document
Title: Add columns to existing tables in SQL
Now, it is possible to add columns to existing empty spaces using
<ALTER TABLE table_name ADD [COLUMN] column_name column_type ...>
statement. The column definition is the same as in <CREATE TABLE>
statement.

* Space emptiness is Tarantool's restriction. Possibilty to add
column to non empty space will be implemented later.

For example:

```
tarantool> box.execute("CREATE TABLE test (a INTEGER PRIMARY KEY)")
---
- row_count: 1
...

tarantool> box.execute([[ALTER TABLE test ADD COLUMN b TEXT
         >                                           CHECK (LENGTH(b) > 1)
         >                                           NOT NULL
         >                                           DEFAULT ('aa')
         >                                           COLLATE "unicode_ci"
         >             ]])
---
- row_count: 1
...
```

6cff9025

sql: use parser's region of "index" array · 1c045399

Roman Khabibov authored 4 years ago

Allocate memory for the "index" array of ephemeral space on the
parser's region instead of a heap as it was before. Fixed a memory
leak that realloc() generated for the index array of ephemeral
space. The memory was never released.

Needed for #2349

1c045399

schema: add box_space_field_MAX · 04760e0c
Roman Khabibov authored 4 years ago
```
Just add box_space_field_MAX to the _space fields enum.

Needed for #3075
```
04760e0c

sql: refactor create_table_def · f4231fef

Roman Khabibov authored 4 years ago

Move ck, fk constraint lists from struct create_table_def into new
defs and autoincrement into struct Parse to make the code more
reusable when implementing <ALTER TABLE ADD COLUMN>.

Needed for #3075

f4231fef

sql: rename TK_COLUMN to TK_COLUMN_REF · e0850326

Roman Khabibov authored 5 years ago

Rename TK_COLUMN used for tokens treated as a column name to
TK_COLUMN_REF. It is needed to allow the typing of COLUMN keyword
in <ALTER TABLE ADD COLUMN> statement.

Needed for #3075

e0850326

Nov 17, 2020

vinyl: update gh-4957-too-many-upserts test · 23c16207

Nikita Pettik authored 4 years ago

After upsert rework in 5a61c471 (#5107 issue) update operations are
applied consistently corresponding to upserts they belong to: if update
operation from single upsert fails - all update operations from that
upsert are skipped. But the rest of update operations related to other
upserts (after squashing two upserts) are applied. So let's update #4957
test case: now upsert operation can't be processed only if it contains
more than BOX_UPDATE_OP_CNT_MAX (4000) operations before (before
squashing with any other upsert).

Follow-up #4957
Follow-up #5107

23c16207

Nov 13, 2020

raft: use vclock_map_t for voting tracking · 77c218cc

Cyrill Gorcunov authored 4 years ago


We've VCLOCK_MASK constant which limits the number of
regular nodes in a replication cluster. This limit is
bound to the vclock_map_t bitset type. Thus when we're
tacking voting process in Raft node we better use this
type for vote_mask member (otherwise it is a room for
error if we ever change VCLOCK_MASK and extend the width
of a bitset).

Suggested-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

77c218cc

Nov 12, 2020
- Fix tuple leak in vy_build_recover_stmt · 5d296224
  Kirill Yukhin authored 4 years ago
  
  Variable `old` which contains tuple reference wasn't unrefed at all. Fix this.
  5d296224
Nov 11, 2020

box/lua: forbid specifying index options in key parts · 62dffc7e

Mary Feofanova authored 4 years ago

Degradation caused by #2866 forbade specifying index optios in key parts,
which is actually right. Tests were fixed and the behaviour was extended
to the other way of specifying index parts.

Closes #5473

62dffc7e

Nov 10, 2020

vclock: move to src/lib · b8d3c762

Vladislav Shpilevoy authored 4 years ago

Vclock is used in raft, which is going to be moved to src/lib.
That means vclock also should be moved there.

It is easy, because vclock does not depend on anything in box/.

Needed for #5303

b8d3c762

raft: check box_raft is inited before usage · dcd8e66b

Vladislav Shpilevoy authored 4 years ago

Since box_raft is now initialized at runtime and is used from
several subsystems (memtx for snapshots; applier for accepting
rows; box.info for monitoring), it may be easy to screw the
intialization order and accidentally use the not initialized
global raft object.

This patch adds a sanity check ensuring it does not happen. The
raft state is set to 0 at program start. Then any access to the
global raft object firstly checks the state not being 0.

The initialization order will get trickier when raft will stop
using globals from replication and from box, and will be used from
them more extensively.

Part of #5303

dcd8e66b

raft: add explicit raft argument to all functions · a886aa06

Vladislav Shpilevoy authored 4 years ago

All raft functions worked with a global raft object. That would
make impossible to move raft to a separate module, where it could
be properly unit-tested with multiple raft nodes in each test.

The patch adds an explicit raft pointer argument to each raft
function as a first part of moving raft to a separate library.

The global object is renamed to box_raft_global so as to emphasize
this is a global box object, not from the future raft library.

Its access now should go through box_raft() function, which will
get some sanity checks in the next commit.

Part of #5303

a886aa06

fiber: introduce fiber.f_arg · b237799f

Vladislav Shpilevoy authored 4 years ago

Struct fiber has a member va_list f_data. It is used to forward
arguments to the fiber function when fiber_start() is called,
right from the caller's stack.

But it is useless when fiber is started asynchronously, with
fiber_new + fiber_wakeup. And there is no way to pass anything
into such a fiber.

This patch adds a new member 'void *f_arg', which shares memory
with va_list f_data, and can be used to pass something into the
fiber.

The feature is going to be used by raft. Currently raft worker
fiber works only with global variables, but soon it will need to
have its own pointer at struct raft object. And it can't be
started with fiber_start(), because raft code does not yield
anywhere in its state machine.

Needed for #5303

b237799f

raft: fix crash on candidate cfg during WAL write · 9a8688fa

Vladislav Shpilevoy authored 4 years ago

Raft state machine crashed if it was configured to be a candidate
during a WAL write with a known leader.

It tried to start waiting for the leader death, but should have
waited for the WAL write end first.

The code tried to handle it, but the order of 'if' conditions was
wrong. WAL write being in progress was checked last, but should
have been checked first.

Closes #5506

9a8688fa

raft: fix crash on sm restart during WAL write · b4c4387d

Vladislav Shpilevoy authored 4 years ago

Raft state machine crashed if was restarted during a WAL write
being in progress. When the machine was started, it didn't assume
there still can be a not finished WAL write from the time it was
enabled earlier.

The patch makes it continue waiting for the write end.

Part of #5506

b4c4387d

raft: send state when state machine is started · 5293ee0d

Vladislav Shpilevoy authored 4 years ago

Raft didn't broadcast its state when the state machine was
started. It could lead to the state being never sent until some
other node would generate a term number bigger that the local one.

That happened when a node participated in some elections,
accumulated a big term number, then the election was turned off,
and a new replica was connected in a 'candidate' state. Then the
first node was configured to be a 'voter'.

The first node didn't send anything to the replica, because at
the moment of its connection the election was off.

So the replica started from term 1, tried to start elections in
this term, but was ignored by the first node. It waited for
election timeout, bumped the term to 2, and the process was
repeated until the replica reached the first node's term + 1. It
could take very long time.

The patch fixes it so now Raft broadcasts its state when it is
enabled. To cover the replicas connected while it was disabled.

Closes #5499

5293ee0d

test: fix a typo in election_basic · 455c6d18

Vladislav Shpilevoy authored 4 years ago

The typo led to not resetting the election timeout to the default
value. It was left 1000, and as a result the next election tests
could work extremely long.

Part of #5499

455c6d18

raft: fix crash in worker fiber · 03512e53

Vladislav Shpilevoy authored 4 years ago

Raft worker fiber does all the heavy and yielding jobs. These are
2 - disk write, and network broadcast. Disk write yields. Network
broadcast is slow, so it happens at most once per event loop
iteration.

The worker on each iteration checks if any of these 2 jobs is
active, and if not, it goes to sleep until an explicit wakeup.

But there was a bug. Before going to sleep it did a yield + a
check that there is nothing to do. However during the yield new
tasks could appear, and the check failed, leading to a crash.

The patch reorganizes this part of the code so now the worker does
not yield between checking new tasks and going to sleep.

No test, because extremely hard to reproduce, and don't want to
clog this part of the code with error injections.

03512e53

Nov 03, 2020

core: fix static_alloc buffer overflow · 99d6c8a4

Sergey Ostanevich authored 4 years ago


Static buffer overflow in thread local pool causes random fails on OSX
platform. This was caused by an incorrect use of the allocator result.

Fixes #5312

Co-authored-by: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>

99d6c8a4

txn: warn "too long WAL" on write, not on commit · 58cc0822

Vladislav Shpilevoy authored 4 years ago

"Too long WAL write" is supposed to warn a user that either the
disk write was too long, or the event loop is too slow, maybe due
to certain fibers not doing yields often enough.

It was printed by the code doing the transaction commit. As a
result, for synchronous transactions the check also included the
replication time, often overflowing a threshold and printing
"too long WAL write" even when it had nothing to do with a WAL
write or the event loop being too slow.

The patch makes so the warning is checked and printed after WAL
write right away, not after commit.

Closes #5139

58cc0822

txn: split complete into success and fail paths · bd5b7166

Vladislav Shpilevoy authored 4 years ago

txn_complete used to handle all the transaction outcomes:
- manual rollback;
- error at WAL write;
- successful WAL write and commit;
- successful WAL write and wait for synchronization with replicas.

The code became a mess after synchronous replication was
introduced. This patch splits txn_complete's code into multiple
pieces.

Now WAL write success and fail are handled by
txn_on_journal_write() exclusively. It also runs the WAL write
triggers. It was very strange to call them from txn_complete().

txn_on_journal_write() also checks if the transaction is
synchronous, and if it is not, it completes it with
txn_complete_success() whose code is simple now, and only works
on committing the changes.

In case of fail the transaction always ends up in
txn_complete_fail().

These success and fail functions are now used by the limbo as
well. It appeared all the places finishing a transaction always
know if they want to fail it or complete successfully.

This should also remove a few ifs from the hot code of transaction
commit.

The patch simplifies the code in order to fix the false warning
about too long WAL write for synchronous transactions, which is
printed not at WAL write now, but at commit. These two events are
far from each other for synchro requests.

Part of #5139

bd5b7166

txn: rename txn_complete_async to txn_on_journal_write · a512cb8a

Vladislav Shpilevoy authored 4 years ago

The function is called only by the journal when write is finished.

Besides, it may not complete the transaction. In case of
synchronous replication it is not enough for completion. It means,
it can't have 'complete' in its name.

Also the function is never used out of txn.c, so it is removed
from txn.h and is now static.

The patch is a preparation for not spaming "too long WAL write" on
synchronous transactions, because it is simply misleading.

Part of #5139

a512cb8a

Nov 02, 2020

raft: decode even invalid states of raft · bc6caac6

Cyrill Gorcunov authored 4 years ago


We should never trust the request data it can
carry anything, thus lets make sure we're not
decoding some trash into a string.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

bc6caac6

raft: raft_request_to_string -- don't hardcode size · 950486a9

Cyrill Gorcunov authored 4 years ago


The size should be matched to the real size of a buffer,
otherwise it is a room for mistake. Same time make sure
we're not overriding the buffer.

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

950486a9

Nov 01, 2020

test: fix hanging of vinyl/gh.test.lua · 42c64d06

Alexander V. Tikhonov authored 4 years ago

Found that the previously fixed vinyl/gh.test.lua test in commit:

  94dc5bdd ('test: gh test hangs after gh-4957-too-many-upserts')

with adding fiber.sleep(1) workaround to avoid of raise from the
previously run vinyl/gh-4957-too-many-upserts.test.lua test can be
changed in the other way. The new change from one side will leave
the found issue untouched to be able to resolve it within opened
issue in github. And from the other side it will let the test-run
tool to be able to avoid of this issue using fragile list feature
to save the stability of testing due to found issue is flaky and
can be passed on reruns.

The current fix changes the forever waiting loop to especially
created for such situations test_run:wait_cond() routine which has
timeout in it to avoid of hanging the test till global timeout will
occure. It will let the testing to be continued even after the fail.

Needed for #5141

42c64d06

test: fix flaky vinyl/gh-4957-too-many-upserts · 5c09e52b

Alexander V. Tikhonov authored 4 years ago

Added restart the current server to resolve the issue #5141 which
reproduced in test:

  vinyl/gh-5141-invalid-vylog-file.test.lua

Added test-run filter on box.snapshot error message:

  'Invalid VYLOG file: Slice [0-9]+ deleted but not registered'

to avoid of printing changing data in results file to be able to use
its checksums in fragile list of test-run to rerun it as flaky issue.

Part of #5141

5c09e52b

test: create reproducer for #5141 · 0e6a61ac

Alexander V. Tikhonov authored 4 years ago

Created the stable reproducer for the issue #5141:

  box.snapshot()
  ---
 -- ok
 +- error: 'Invalid VYLOG file: Slice <NUM> deleted but not registered'
  ...

flaky occured in vinyl/ suite tests if running after the test:

  vinyl/gh-4957-too-many-upserts.test.lua

as new standalone test:

  vinyl/gh-5141-invalid-vylog-file.test.lua

based on test:

  vinyl/gh-4957-too-many-upserts.test.lua

Due to issue not reproduced on FreeBSD 12, then test was blocked with:

  vinyl/gh-5141-invalid-vylog-file.skipcond

Needed for #5141

0e6a61ac

test: add test filter for box.snapshot · 5ad18786

Alexander V. Tikhonov authored 4 years ago

Added test-run filter on box.snapshot error message:

  'Invalid VYLOG file: Slice [0-9]+ deleted but not registered'

to avoid of printing changing data in results file to be able to use
its checksums in fragile list of test-run to rerun it as flaky issue.
Also added checksums to fragile list for the following tests:

  vinyl/iterator.test.lua                       gh-5141
  vinyl/snapshot.test.lua                       gh-4984

Needed for #5141
Needed for #4984

5ad18786

gitlab-ci: test default gcc on CentOS 7 · 12453d51

Alexander V. Tikhonov authored 4 years ago

Sometimes it is convenient to use default compiler on CentOS 7.
Added test job which uses for compiling default compiler files:

  CC=/usr/bin/gcc
  CXX=/usr/bin/g++

Closes #4941

12453d51