Commits · 3c0b800a8411f844510515def35616535a6f96fc · core / tarantool

Oct 28, 2022

txn: account started, committed, and rolled back transactions · 70969d1d

This commit fixes BEGIN, COMMIT, and ROLLBACK counters in the box.stat()
output. Before this commit, they always showed 0. Now, they report
the number of started, committed, and rolled back transactions,
respectively.

Closes #7583

NO_DOC=bug fix

70969d1d

box: panic if snapshot has no system spaces during recovery · e0a9aed4

Ilya Verbin authored 2 years ago

Currently, if a snapshot contains some correct entries, but doesn't
include system spaces, Tarantool crashes with segmentation fault, or
for Debug build: void diag_raise(): Assertion `e != NULL' failed.
This happens because memtx_engine_recover_snapshot returns -1, while
diag is not set. Let's panic instead of a crash.

Closes #7800

NO_DOC=bugfix

e0a9aed4

Oct 26, 2022

msgpack: fix crash on decode of 0xc1 · ced405af

Vladimir Davydov authored 2 years ago

0xc1 isn't a valid MsgPack header, but it was allowed by mp_check.
As a result, msgpack.decode crashed while trying to decode it.
This commit updates the msgpuck library to fix this issue.

Closes #7818

NO_DOC=bug fix

ced405af

Oct 25, 2022

box: expose box.info() before box.cfg() · ad420846
Nikolay Shirokovskiy authored 2 years ago
```
So one can easily check current box status.

NO_DOC=minor change

Closes #7255
```
ad420846

fiber: initialize thread-local cord on demand · 508138b7

Vladimir Davydov authored 2 years ago

We're planning to introduce a basic C API for user read views (EE-only).
Like all other box C API functions, the new API functions will use the
existing box error C API for reporting errors. The problem is that
a read view created using C API should be usable from user threads
(started with the pthread lib) while the box error C API doesn't work
in user threads, because those threads don't have the cord pointer
initialized (a diagnostic area is stored in a cord object).

To address this issue, let's create a new cord object automatically on
first use of cord() if it wasn't created explicitly. Automatically
created object is destroyed at thread exit (to achieve that, we use
the C++ RAII concept).

Closes #7814

NO_DOC=The C API documentation doesn't say anything about threads.
       Let's keep it this way for now. We're planning to introduce
       a new C API to work with threads in C modules. We'll update
       the doc when it's ready.

508138b7

security: make os.getenv safe · dd7d46af
Serge Petrenko authored 2 years ago
```
Closes #7797

NO_DOC=security fix
NO_TEST=security fix
```
dd7d46af

security: check size boundaries for getenv() returns · b86395ff

Serge Petrenko authored 2 years ago

getenv() return values cannot be trusted, because an attacker might set
them. For instance, we shouldn't expect, that getenv() returns a value
of some sane size.

Another problem is that getenv() returns a pointer to one of
`char **environ` members, which might change upon next setenv().

Introduce a wrapper, getenv_safe(), which returns the value only when
it fits in a buffer of a specified size, and copies the value onto the
buffer. Use this wrapper everywhere in our code.

Below's a slightly decorated output of `grep -rwn getenv ./src --include
*.c --include *.h --include *.cc --include *.cpp --include *.hpp
--exclude *.lua.c` as of 2022-10-14.
`-` marks invalid occurences (comments, for example),
`*` marks the places that are already guarded before this patch,
`X` mars the places guarded in this patch, and
`^` marks places fixed in the next commit:

NO_WRAP
```
* ./src/lib/core/coio_file.c:509:	const char *tmpdir = getenv("TMPDIR");
X ./src/lib/core/errinj.c:75: const char *env_value = getenv(inj->name);
- ./src/proc_title.c:202: * that might try to hang onto a getenv() result.)
- ./src/proc_title.c:241:	* is mandatory to flush internal libc caches on getenv/setenv
X ./src/systemd.c:54: sd_unix_path = getenv("NOTIFY_SOCKET");
* ./src/box/module_cache.c:300: const char *tmpdir = getenv("TMPDIR");
X ./src/box/sql/os_unix.c:1441: azDirs[0] = getenv("SQL_TMPDIR");
X ./src/box/sql/os_unix.c:1446: azDirs[1] = getenv("TMPDIR");
* ./src/box/lua/console.c:394: const char *envvar = getenv("TT_CONSOLE_HIDE_SHOW_PROMPT");
^ ./src/box/lua/console.lua:771: local home_dir = os.getenv('HOME')
^ ./src/box/lua/load_cfg.lua:1007: local raw_value = os.getenv(env_var_name)
X ./src/lua/init.c:575: const char *path = getenv(envname);
X ./src/lua/init.c:592: const char *home = getenv("HOME");
* ./src/find_path.c:77: snprintf(buf, sizeof(buf) - 1, "%s", getenv("_"));
```
NO_WRAP

Part-of #7797

NO_DOC=security

b86395ff

Oct 24, 2022

sql: fix another cursor invalidation · 5a38c5c9

Mergen Imeev authored 2 years ago

This patch fixes the issue described in issue #5310 when the tuple
format has more fields than the space format. This solution is more
general than the solution in 89057a21.

Follow-up #5310
Closes #4666

NO_DOC=bugfix

5a38c5c9

Oct 21, 2022

build: use relative paths in diagnostics and debugging information · 256da010

Georgiy Lebedev authored 2 years ago

Since our diagnostics use the `__FILE__` macro, they provide absolute
paths, which is kind of redundant and inconsistent: replace them with
relative ones.

As for debugging information, replacing absolute paths with relative ones
also requires an extra command to tell the debugger where to find the
source files, which is not convenient for developers: provide a new
`DEV_BUILD` option (turned off by default), which replaces absolute paths
with relative ones in debugging information if turned off.

Strip the prefix map flags from compiler flags exported to tarantool via
`src/trvia/config.h`.

Closes #7808

NO_DOC=<verbosity>
NO_TEST=<verbosity>

256da010

Oct 20, 2022

box: unify errors about mismatch of password and login during auth · 5c62f01b

Andrey Saranchin authored 2 years ago

If we raise different errors in case of entering an invalid password and
entering the login of a non-existent user during authorization, it will
open the door for an unauthorized person to enumerate users.
So let's unify raised errors in the cases described above.

Closes #tarantool/security#16

NO_DOC=security fix

5c62f01b

Oct 19, 2022

debugger: prevent running from Tarantool REPL · ace88542

Timur Safin authored 2 years ago

At the moment we are not yet compatible with readline
support inside of Tarantool console. Diagnose that situation
at the moment debugger started and bail out.

NO_TEST=interactive
NO_DOC=Markdown updated

ace88542

debugger: console debugger changelog and doc · a2ba5013

Timur Safin authored 2 years ago

NO_TEST=see it elsewhere

Part of #7593

@TarantoolBot document
Title: Console debugger for Lua

Console debugger luadebug.lua
==============================

Module `luadebug.lua` is available as console debugger of Lua scripts.
It's activated via:

```
local debugger = require 'luadebug'
debugger()
```

Originally we have used 3rd-party code from slembcke/debugger.lua but
significantly refactored since then.

Currently available console shell commands are:
```
    c|cont|continue
    - continue execution
    d|down
    - move down the stack by one frame
    e|eval $expression
    - execute the statement
    f|finish|step_out
    - step forward until exiting the current function
    h|help|?
    - print this help message
    l|locals
    - print the function arguments, locals and upvalues
    n|next|step_over
    - step forward by one line (skipping over functions)
    p|print $expression
    - execute the expression and print the result
    q|quit
    - exit debugger
    s|st|step|step_into
    - step forward by one line (into functions)
    t|trace|bt
    - print the stack trace
    u|up
    - move up the stack by one frame
    w|where $linecount
    - print source code around the current line
```

Console debugger `luadebug.lua` allows to see sources of builtin
Tarantool module (e.g. `@builtin/datetime.lua`), and it uses new
function introduced for that purpose `tarantool.debug.getsources()`,
one could use this function in any external GUI debugger (i.e. vscode
or JetBrains) if need to show sources of builtin modules while they
have been debugged.

> Please see third_party/lua/README-luadebug.md for a fuller description
> of an original luadebug.lua implementation.

a2ba5013

box: fix format of _vfunc · 707da125

Mergen Imeev authored 2 years ago

The _vfunc system space is the sysview for the _func system space.
However, the _vfunc format is different from the _func format. This
patch makes the _vfunc format the same as the _func format.

Closes #7822

NO_DOC=bugfix

707da125

Oct 18, 2022

uri: optimize allocation of parameters and their values dynamic arrays · 5cad0759

Georgiy Lebedev authored 2 years ago

Allocation of URI parameters and their values dynamic arrays is done
inefficiently: they are reallocated each time a new parameter or parameter
value is added — grow them exponentially instead.

`struct uri_param` and `struct uri` are exposed in Lua via FFI
(see src/lua/uri.lua): add warnings about the necessity of reflecting
changes to them in `ffi.cdecl`.

Closes #7155

NO_DOC=optimization
NO_TEST=optimization

5cad0759

datetime: datetimes subtractions ignored timezone · 0daed8d5

Timur Safin authored 2 years ago

We used to ignore timezone difference (in `tzoffset`) for
datetime subtraction operation:

```
tarantool> datetime.new{tz='MSK'} - datetime.new{tz='UTC'}
---
- +0 seconds
...

tarantool> datetime.new{tz='MSK'}.timestamp -
           datetime.new{tz='UTC'}.timestamp
---
- -10800
...
```

Now we accumulate tzoffset difference in the minute component
of a resultant interval:

```
tarantool> datetime.new{tz='MSK'} - datetime.new{tz='UTC'}
---
- -180 minutes
...
```

Closes #7698

NO_DOC=bugfix

0daed8d5

datetime: fix interval arithmetic for DST · 6ca07285

Timur Safin authored 2 years ago

We did not take into consideration the fact that
as result of date/time arithmetic we could get
in a different timezone, if DST boundary has been
crossed during operation.

```
tarantool> datetime.new{year=2008, month=1, day=1,
			tz='Europe/Moscow'} +
	   datetime.interval.new{month=6}
---
- 2008-07-01T01:00:00 Europe/Moscow
...
```

Now we resolve tzoffset at the end of operation if
tzindex is not 0.

Fixes #7700

NO_DOC=bugfix

6ca07285

box: forbid DDL operations until box.schema.upgrade · 38f88795

Ilya Verbin authored 2 years ago

Currently, in case of recovery from an old snapshot, Tarantool allows to
perform DDL operations on an instance with non-upgraded schema.
It leads to various unpredictable errors (because the DDL code assumes
that the schema is already upgraded). This patch forbids the following
operations unless the user has the most recent schema version:
- box.schema.space.create
- box.schema.space.drop
- box.schema.space.alter
- box.schema.index.create
- box.schema.index.drop
- box.schema.index.alter
- box.schema.sequence.create
- box.schema.sequence.drop
- box.schema.sequence.alter
- box.schema.func.create
- box.schema.func.drop

Closes #7149

NO_DOC=bugfix

38f88795

Oct 14, 2022

sql: fix assertion in JOIN using unsupported index · fd780129

Mergen Imeev authored 2 years ago

This patch fixed the assertion when JOIN uses index of unsupported type.

Closes #5678

NO_DOC=bugfix

fd780129

vinyl: implement transaction isolation levels · 588170a7

Vladimir Davydov authored 2 years ago

This commit adds support of transaction isolation levels introduced
earlier for memtx mvcc by commit ec750af6 ("txm: introduce
transaction isolation levels"). The isolation levels work exactly in
the same way as in memtx:

 - Unless a transaction explicitly specifies the 'read-committed'
   isolation level, it'll skip prepared statements, even if they are
   visible from its read view. The background for this was implemented
   in the previous patches, which added the is_prepared_ok flag to
   cache and mem iterators.

 - If a transaction skips a prepared statement, which would otherwise be
   visible from its read view, it's sent to the most recent read view
   preceding the prepared statement LSN. Note, older prepared statements
   are still visible from this read view and can actually be selected if
   committed later.

 - A transaction using the 'best-effort' isolation level (default) is
   switched to 'read-committed' when it executes the first write
   statement.

The implementation is tested by the existing memtx mvcc tests that were
made multi-engine in the scope of this commit. However, we add one more
test case - the one that checks that a 'best-effort' read view is
properly updated in case there is more than one prepared transaction.
Also, there are few tests that relied upon the old implementation and
assumed that select from Vinyl may return unconfirmed tuples. We update
those tests here as well.

Closes #5522

NO_DOC=already documented

588170a7

Oct 13, 2022

replication: send raft terms in applier heartbeats · 54495510

Vladislav Shpilevoy authored 2 years ago

There was a bug that an instance could ack a transaction from an
old Raft term thus allowing the old leader to CONFIRM it, even if
that first instance knew there is a newer Raft term going on.

As a result, the old leader could write CONFIRM even if there is
already a new leader elected and the synchro quorum was > half.
That led to split-brain, when bad txn reached the new leader, and
PROMOTE reached the old leader.

Split-brain here is totally unnecessary. If the quorum is correct,
synchro timeout is infinite, and there is no async transactions,
then split-brain shouldn't ever happen.

The fix is as simple as attach the current Raft term number to
applier heartbeats.

In the testcase above if terms are attached, the old leader gets
ACK + new term. That causes the old leader freeze even if the
pending txn got quorum. The old leader can't CONFIRM nor ROLLBACK
its pending txns until a new leader is elected.

Freeze is guaranteed, because if a new leader was elected, then it
had got votes from > half cluster. It means > half nodes have the
new term. That in turn means the old leader during collecting ACKs
for its "new" txn will get the new term number from at least one
replica.

When the new leader finished writing PROMOTE, it either confirms
or rolls back the txn of the old leader (depending on whether it
has reached the new leader before promotion). Neither result
causes split brain. The rollback only causes a non-critical error
on the old leader raised by the bad txn's commit attempt.

There were some alternatives considered. One of the most promising
ones was to make instances reject txns if they see these txns
coming from an instance having an old Raft term. It would help in
the test provided above. But wouldn't do in a more complicated
test, when there is a third node which gets the bad transaction,
then gets local term bumped, and then replicates to any other
instance. Others would accept that bad txn, because the sender has
a newer Raft term, even though the txn author is still in the old
term. Tracking terms of txn author is not possible in too many
cases so as to rely on that.

Closes #7253

@TarantoolBot document
Title: New iproto field in applier -> relay ACKs
The applier->relay channel (from replica back to master) is used
only for sending ACKs. Replication data goes the other way
(relay->applier).

These ACKs had 2 fields: `IPROTO_VCLOCK (0x26)` and
`IPROTO_VCLOCK_SYNC (0x5a)`.

Now they have a new field: `IPROTO_TERM (0x53)`. It is a unsigned
number containing `box.info.election.term` of the sender node
(applier, replica).

54495510

box: forbid non-string types in key_def.new() · 5215f3f3

Ilya Verbin authored 2 years ago

Currently if a non-string type is passed to luaT_key_def_set_part,
lua_tolstring returns null-pointer type_name, which is passed to
a string formatting function in diag_set.

Closes #5222

NO_DOC=bugfix

5215f3f3

box: strengthen field type check · 2dbaf9c2

Ilya Verbin authored 2 years ago


Don't accept an empty string or leading part of "str" or "num" as a
valid field type.

Closes #5940

NO_DOC=Partial field types weren't documented

Co-authored-by: Alexander Turenko <alexander.turenko@tarantool.org>

2dbaf9c2

box: revoke access of guest to LUA function · 815788c8

Aleksandr Lyapunov authored 2 years ago

Since the function is actually an eval, by default there should
be no execute access right in public role.

Closes tarantool/security#14

NO_DOC=bugfix

815788c8

box: drop 'execute' field from uninitialized box · d960476d

Mergen Imeev authored 2 years ago

Prior to this patch, it was possible to call box.execute() before box
was initialized, i.e. before calling box.cfg(). This, however, caused
box.cfg() to be called automatically, which could be problematic as some
parameters could not be changed after box.cfg() was called. After this
patch, box.execute() will only be available when the box has been
initialized.

Closes #4726

@TarantoolBot document
Title: box.execute() now available only after initialization of box

Previously, it was possible to call box.execute() before the box was
configured, in which case the box was configured automatically, which
could lead to problems with box parameters. Now box.execute() can only
be called after the box has been properly configured.

It is also forbidden to set language to SQL in a console with an
unconfigured box.

d960476d

Oct 12, 2022
- box: fix creation of CK and FK constraints on the same field · 6e450424
  Aleksandr Lyapunov authored 2 years ago
  
  Fix a simple typo that caused the problem. Closes #7645 NO_DOC=bugfix
  6e450424
Oct 11, 2022

sql: change rules used to determine NULLIF() type · 805cbaa7

Mergen Imeev authored 2 years ago

This patch introduces new rules to determine type of NULLIF() built-in
function.

Closes #6990

@TarantoolBot document
Title: New rules to determine type of result of NULLIF

The type of the result of NULLIF() function now matches the type of the
first argument.

805cbaa7

sql: change rules used to determine CASE type · 90f64460

Mergen Imeev authored 2 years ago

This patch introduces new rules to determine type of CASE operation.

Part of #6990

@TarantoolBot document
Title: New rules to determine type of result of CASE

New rules are applied to determine the type of the CASE operation. If
all values are NULL with no type, or if a bind variable exists among
the possible results, then the type of CASE is ANY. Otherwise, all NULL
values with no type are ignored, and the type of CASE is determined
using the following rules:
1) if all values of the same type, then type of CASE is this type;
2) otherwise, if any of the possible results is of one of the
incomparable types, then the type of CASE is ANY;
3) otherwise, if any of the possible results is of one of the
non-numeric types, then the type of CASE is SCALAR;
4) otherwise, if any of the possible results is of type NUMBER, then the
type of CASE is NUMBER;
5) otherwise, if any of the possible results is of type DECIMAL, then
the type of CASE is DECIMAL;
6) otherwise, if any of the possible results is of type DOUBLE, then the
type of CASE is DOUBLE;
7) otherwise the type of CASE is INTEGER.

90f64460

Oct 06, 2022

replication: make ER_READONLY non-retriable for applier · 09c18907

Serge Petrenko authored 2 years ago

The commit c1c77782 ("replication: fix bootstrap failing with
ER_READONLY") made applier retry connection infinitely upon receiving a
ER_READONLY error on join. At the time of writing that commit, this was
the only way to make join retriable. Because there were no retries in
scope of bootstrap_from_master. The join either succeeded or failed.

Later on, bootstrap_from_master was made retriable in commit
f2ad1dee ("replication: retry join automatically"). Now when
bootstrap_from_master fails, replica reconnects to all the remote nodes,
thus updating their ballots, chooses a new (probably different from the
previous approach) bootstrap leader, and retries booting from it.

The second approach is more preferable, and here's why. Imagine
bootstrapping a cluster of 3 nodes, A, B and C in a full-mesh topology.
B and C connect to all the remote peers almost instantly, and both
independently decide that B will be the bootstrap leader (it means it
has the smallest uuid among A, B, C).

At the same time, A can't connect to C. B bootstraps the cluster, and
joins C. After C is joined, A finally connects to C. Now A can choose a
bootstrap leader. It has an old B's ballot (smallest uuid, but not yet
booted) and C's ballot (already booted). This is because C's ballot is
received after cluster bootstrap, and B's ballot was received earlier
than that. So A believes C is a better bootstrap leader, and tries to
boot from it.

A will fail joining to C, because at the same time C tries to sync with
everyone, including A, and thus stays read-only. Since A retries joining
to the same instance over and over again, this situation makes the A and
C stuck forever.

Let's retry ER_READONLY on another level: instead of trying to join to
the same bootstrap leader over and over, try to choose a new bootstrap
leader and boot from it.

In the situation described above, this means that A would try to join to
C once, fail due to ER_READONLY, re-fetch new ballots from everyone and
choose B as a join master (now it has smallest uuid and is booted).

The issue was discovered due to linearizable_test.lua hanging
occasionally with the following output:
NO_WRAP
 No output during 40 seconds. Will abort after 320 seconds without output. List of workers not reporting the status:
- 059_replication-luatest [replication-luatest/linearizable_test.lua, None] at /tmp/t/059_replication-luatest/linearizable.result:0
[059] replication-luatest/linearizable_test.lua                       [ fail ]
[059] Test failed! Output from reject file /tmp/t/rejects/replication-luatest/linearizable.reject:
[059] TAP version 13
[059] 1..6
[059] # Started on Thu Sep 29 10:30:45 2022
[059] # Starting group: linearizable-read
[059] not ok 1	linearizable-read.test_wait_others
[059] #   ....11.0~entrypoint.531.dev/test/luatest_helpers/server.lua:104: Waiting for "readiness" on server server_1-q7berSRY4Q_E (PID 53608) timed out
[059] #   stack traceback:
[059] #   	....11.0~entrypoint.531.dev/test/luatest_helpers/server.lua:104: in function 'wait_for_readiness'
[059] #   	...11.0~entrypoint.531.dev/test/luatest_helpers/cluster.lua:92: in function 'start'
[059] #   	...t.531.dev/test/replication-luatest/linearizable_test.lua:50: in function <...t.531.dev/test/replication-luatest/linearizable_test.lua:20>
[059] #   	...
[059] #   	[C]: in function 'xpcall'
NO_WRAP

Part-of #7737

NO_DOC=bugfix

09c18907

sql: fix assertion during INDEXED BY · 22c65f96

Mergen Imeev authored 2 years ago

This patch fixed the assertion when using INDEXED BY with an index that
is at least the third in space.

Closes #5976

NO_DOC=bugfix

22c65f96

sql: fix cursor invalidation · 89057a21

Mergen Imeev authored 2 years ago

If the length of the tuple is greater than the number of fields in the
format, it is possible that the cursor in the VDBE will be overridden
with zeros.

Closes #5310

NO_DOC=bugfix

89057a21

luajit: bump new version · b805d4a3

Igor Munkin authored 2 years ago

* FFI: Always fall back to metamethods for cdata length/concat.
* FFI: Add tonumber() specialization for failed conversions.
* build: introduce LUAJIT_ENABLE_CHECKHOOK option
* Fix overflow check in unpack().
* gdb: refactor iteration over frames while dumping stack
* gdb: adjust to support Python 2 (CentOS 7)

Closes #7458
Closes #7655
Needed for #7762
Part of #7230

NO_DOC=LuaJIT submodule bump
NO_TEST=LuaJIT submodule bump

b805d4a3

Oct 04, 2022

memtx: deprecate HASH index 'GT' iterator type · 302d91cf

Georgiy Lebedev authored 2 years ago

For reasons described in #7231 HASH index 'GT' iterator type is deprecated:
print a warning exactly once about the deprecation.

Closes #7231

@TarantoolBot document
Title: memtx HASH index 'GT' iterator deprecation

memtx HASH index 'GT' iterator is deprecated since Tarantool 2.11
(tarantool/tarantool#7231) and will removed in a future release of
Tarantool: the user will get a warning when using it.

302d91cf

Sep 30, 2022

memtx: fix loss of committed tuple in secondary index · 7b0baa57

Georgiy Lebedev authored 2 years ago

Concurrent transactions can try to insert tuples that intersect only by
parts of secondary index: in this case when one of them gets prepared, the
others get conflicted, but the committed story does not get retained
(because the conflicting statements are not added to the committed story's
delete statement list as opposed to primary index) and is lost after
garbage collection: retain stories if there is a newer uncommitted story
in the secondary indexes' history chain.

Closes #7712

NO_DOC=bugfix

7b0baa57

Sep 29, 2022

gc: replace vclockset_psearch with _match in wal_collect_garbage_f · c63bfb9a

Serge Petrenko authored 2 years ago

When using vclockset_psearch, the resulting vclock may be incomparable
to the search key. For example, with a vclock set { } (empty vclock),
{0: 1, 1: 10}, {0: 2, 1:11} vclockset_psearch(set, {0:2, 1: 9}) might
return {0: 1, 1: 10}, and not { }.
This is known and avoided in other places, for example
recover_remaining_wals(), where vclockset_match() is used instead.
vclockset_match() starts with the same result as vclockset_psearch() and
then unwinds the result until the first vclock which is less or equal to
the search key is found.

Having vclockset_psearch in wal_collect_garbage_f could lead to issues
even before local space changes became written to 0-th vclock component.
Once replica subscribes, its' gc consumer is set to the vclock, which
the replica sent in subscribe request. This vclock might be incomparable
with xlog vclocks of the master, leading to the same issue of
potentially deleting a needed xlog during gc.

Closes #7584

NO_DOC=bugfix

c63bfb9a

Sep 28, 2022

console: don't mix stdout/stderr with readline prompt · 66ca6252

Alexander Turenko authored 2 years ago

The idea is borrowed from [1]: hide and save prompt, user's input and
cursor position before writing to stdout/stderr and return everything
back afterwards.

Not every stdout/stderr write is handled this way: only tarantool's
logger (when it writes to stderr) and tarantool's print() Lua function
performs the prompt hide/show actions. For example,
`io.stdout:write(<...>)` Lua call or `write(STDOUT_FILENO, <...>)` C
call may mix readline's prompt with actual output. However the logger
and print() is likely enough for the vast majority of usages.

The readline's interactive search state (usually invoked by Ctrl+R) is
not covered by this patch. Sadly, I didn't find a way to properly save
and restore readline's output in this case.

Implementation details
----------------------

Several words about the allocation strategy. On the first glance it may
look worthful to pre-allocate a buffer to store prompt and user's input
data and reallocate it on demand. However rl_set_prompt() already
performs free() plus malloc() at each call[^1], so avoid doing malloc()
on our side would not change the picture much. Moreover, this code
interacts with a human, which is on many orders of magnitude slower that
a machine and will not notice a difference. So I decided to keep the
code simpler.

[^1]: Verified on readline 8.1 sources. However it worth to note that
      rl_replace_line() keeps the buffer and performs realloc() on
      demand.

The code is organized to make say and print modules calling some
callbacks without knowledge about its origin and dependency on the
console module (or whatever else module would implement this interaction
with readline). The downside here is that console needs to know all
places to set the callbacks. OTOH, it offers explicit list of such
callbacks in one place and, at whole, keep the relevant code together.

We can redefine the print() function from every place in the code, but I
prefer to make it as explicit as possible, so added the new internal
print.lua module.

We could redefine _G.print on demand instead of setting callbacks for a
function assigned to _G.print once. The downside here is that if a user
save/capture the old _G.print value, it'll use the raw print() directly
instead of our replacement. Current implementation seems to be more
safe.

Alternatives considered
-----------------------

I guess we can clear readline's prompt and user input manually and don't
let readline know that something was changed (and restore the
prompt/user input afterwards). It would save allocations and string
copying, but likely would lean on readline internals too much and repeat
some of its functionality. I considered this option as unstable and
declined.

We can redefine behavior for all writes to stdout and stderr. There are
different ways to do so:

1. Redefine libc's write() with our own implementation, which will call
   the original libc's write()[^2]. It is defined as a weak symbol in
   libc (at least in glibc), so there is no problem to do so.
2. Use pipe(), dup() and dup2() to execute our own code at
   STDOUT_FILENO, STDERR_FILENO writes.

[^2]: There is a good article about pitfalls on this road: [2]. It is
      about LD_PRELOAD, but I guess everything is very similar with
      wrapping libc's function from an executable.

In my opinion, those options are dangerous, because they implicitly
change behavior of a lot of code, which unlikely expects something of
this kind. The second option (use pipe()) adds more user space/kernel
space context switches, more copying and also would add possible
implicit fiber yield at any `write(STD*_FILENO, <...>)` call -- unlikely
all user's code is ready for that.

Fixes #7169

[1]: https://metacpan.org/dist/AnyEvent-ReadLine-Gnu/source/Gnu.pm
[2]: https://tbrindus.ca/correct-ld-preload-hooking-libc/

NO_DOC=this patch prevents mixing of output streams on a terminal and it
       is what a user actually expects; no reason to describe how bad
       would be his/her life without it

66ca6252

memtx: rework transaction rollback · 56cf737c

Georgiy Lebedev authored 2 years ago

When we rollback a transaction statement, we relink its read trackers
to a newer story in the history chain, if present (6c990a7b), but we do not
handle the case when there is no newer story.

If there is an older story in the history chain, we can relink the
rollbacked story's reader to it, but if the rollbacked story is the
only one left, we need to retain it, because it stores the reader list
needed for conflict resolution — such stories are distinguished by the
rollbacked flag, and there can be no more than one such story located
strictly at the end of a given history chain (which means a story can be
fully unlinked from some indexes and present at the end of others).

There are several nuances we need to account for:

Firstly, such rollbacked stories must be impossible to read from an index:
this is ensured by `memtx_tx_story_is_visible`.

Secondly, rollbacked transactions need to be treated as prepared with
stories that have `add_psn == del_psn`, so that they are correctly deleted
during garbage collection.

After this logical change we have the following partially ordered set over
tuple stories:
———————————————————————————————————————————————————————> serialization time
|- - - - - - - -|— — — — — -|— — — — — |— — — — — — -|— — — — — — — -
| No more than  | Committed | Prepared | In-progress | One dirty
| one rollbacked|           |          |             | story in index
| story         |           |          |             |
|- - - - - - - -|— — — — — -| — — — — —|— — — — — — -|— — — — — — — —

Closes #7343

NO_DOC=bugfix

56cf737c

core: introduce linearizable transactions · 70bf99c8

Serge Petrenko authored 2 years ago

Linearizability is a property of operations when operation performed on
any node sees all the operations performed earlier on any other node of
the cluster.

More strictly speaking, it's a property demanding that if a response
for some write request arrived earlier than some read request was made,
this read request must see the results of that (or any earlier) write
request.

This patch introduces a new transaction isolation level: 'linearizable'.
When the option is set, box.begin() is stalled until the node receives the
latest data from at least one member of the quorum. This is needed to
make sure that the node sees all the writes committed on a quorum.
The transaction is served only after the node sees the relevant data,
thus implementing linearizable semantics.

The node working on a linearizable request uses its' relays vclock sync
mechanism in order to know the fresh vclock of remote nodes.

Closes #6707

@TarantoolBot document
Title: New transaction isolation level - linearizable

There is a new transaction isolation level - linearizable.
You may call `box.begin` with `txn_isolation = 'linearizable'`, but you
can't set the default transaction isolation level to 'linearizable'.

Linearizable transactions may only perform requests to synchronous,
local or temporary memtx spaces (vinyl engine support will be added
later).

Starting a linearizable transaction requires
`box.cfg.memtx_use_mvcc_engine` to be on.

Note: starting a linearizable transaction requires that the node is the
replication **source** for at least N - Q + 1 remote replicas. Here `N` is
the count of registered nodes in the cluster and `Q` is
`replication_synchro_quorum` value (the same as
`box.info.synchro.quorum`). This is the implementation limitation. For
example, you may start linearizable transactions on any node of a
cluster in full-mesh topology, but you can't perform linearizable
transactions on anonymous replicas, because noone replicates **from**
them.

When a transcaction is linearizable it sees the latest changes performed
on the quorum of nodes in the cluster. For example, if you use
linearizable transactions to read data on a replica, such a transaction
will never read stale data: all the committed writes performed on the
master will be seen by the transaction.

Making a transaction linearizable requires some waiting until the node
receives all the committed data. In case the node can't contact enough
remote peers to determine which data is committed an error is returned.

Waiting for committed data may time out: if the data isn't received
during the timeout specified by `timeout` option of `box.begin()`, an
error is returned.

When called with `{txn_isolation = 'linearizable'}`, `box.begin()`
yields until the instance receives enough data from remote peers to be
sure that the transaction is linearizable.

70bf99c8

Sep 26, 2022

xrow: fix crash on nested map/array update ops · 8425ebfc

Vladislav Shpilevoy authored 2 years ago

If an update operation tried to insert a new key into a map or an
array which was created by a previous update operation, then the
process would fail an assertion.

That was because the first operation was stored as a bar update.
The second operation tried to branch it assuming that the entire
bar update's JSON path must exist, but it wasn't so for the newly
created part of the path.

The solution is to fallback to branching earlier than the entire
bar path ends, if can see that the next part of the path can't be
found.

Closes #7705

NO_DOC=bugfix

8425ebfc

Sep 23, 2022

memtx: track `index:random` reads and clarify result · 1b82beb2

Georgiy Lebedev authored 2 years ago

TREE (HASH) index implements `random` method: if the space is empty from
the transaction's perspective, which means we have to return nothing, add
gap tracking of whole range (full scan
tracking), since this result is equivalent to `index:select{}`, otherwise
repeatedly call `random` and clarify result, until we get a non-empty one.
We do not care about performance here, since all operations in context of
transaction management currently have O(number of dirty tuples)
complexity.

Closes #7670

NO_DOC=bugfix

1b82beb2

memtx: fix TREE index `get` check for part count · bfcd8ca7

Georgiy Lebedev authored 2 years ago

If TREE index `get` result is empty, the key part count is incorrectly
compared to the tree's `cmp_def->part_count`, though it should be compared
with `cmp_def->unique_part_count`. But we can actually assume that by the
time we get to the index's `get` method the part count is equal to the
unique part count (partial keys are rejected and `get` is not
supported for non-unique indexes): change check to correct assertion.

Closes #7685

NO_DOC=<bugfix>

bfcd8ca7