- Jun 28, 2018
-
-
Konstantin Osipov authored
A minor follow up on the fix for gh-3452 (http.client timeout bug)
-
Ilya Markov authored
Current implementation of http.client relies on fiber_cond which is set after the request was registered and doesn't consider the fact that response may be handled before the set of fiber_cond. So we may have the following situation: 1. Register request in libcurl(curl_multi_add_handle in curl_execute). 2. Receive and process response, fiber_cond_signal on cond_var which no one waits. 3. fiber_cond_wait on cond which is already signaled. Wait until timeout is fired. In this case user have to wait timeout, though data was received earlier. Fix this with adding extra flag in_progress to curl_request struct. Set this flag true before registering request in libcurl and set it false when request is finished before fiber_cond_signal. When in_progress flag is false, don't wait on cond variable. Add 1 error injection. Closes #3452
-
- Jun 27, 2018
-
-
Konstantin Osipov authored
schema_version must be passed to perform_request in 1.9
-
Vladislav Shpilevoy authored
When a connection is closed, some of long-poll requests still may by in TX thread with non-discarded input. If a connection is closed, and then an input is discarded, then connection must not try to read new data. The bug was introduced here: f4d66dae by me. Closes #3400
-
- Jun 25, 2018
-
-
Vladimir Davydov authored
If called on a unix socket, bind(2) creates a new file, see unix(7). When we stop a unix tcp server, we should remove that file. Currently, we do it from the tcp server fiber, after the server loop is broken, which happens when the socket is closed, see tcp_server_loop(). This opens a time window for another tcp server to reuse the same path: main fiber tcp server loop ---------- --------------- -- Start a tcp server. s = socket.tcp_server('unix/', sock_path, ...) -- Stop the server. s:close() socket_readable? => no, break loop -- Start a new tcp server. Use the same path as before. -- This function succeeds, because the socket is closed -- so tcp_server_bind_addr() will clean up by itself. s = socket.tcp_server('unix/', sock_path, ...) tcp_server_bind tcp_server_bind_addr socket_bind => EADDRINUSE tcp_connect => ECONNREFUSED -- Remove dead unix socket. fio.unlink(addr.port) socket_bind => success -- Deletes unix socket used -- by the new server. fio.unlink(addr.port) In particular, the race results in sporadic failures of app-tap/console test, which restarts a tcp server using the same file path. To fix this issue, let's close the socket after removing the socket file. This is absolutely legit on any UNIX system, and this eliminates the race shown above, because a new server that tries to bind on the same path as the one already used by a dying server will not receive ECONNREFUSED until the socket fd is closed and hence the file is removed. A note about the app-tap/console test. After this patch is applied, socket.close() takes a little longer for unix tcp server, because it yields twice, once for removing the socket file and once for closing the socket file descriptor. As a result, on_disconnect() trigger left from the previous test case has time to run after session.type() check. Actually, those triggers have already been tested and we should have cleared them before proceeding to the next test case. So instead of adding two new on_disconnect checks to the test plan, let's clear the triggers before session.type() test case and remove 3 on_connect and 5 on_auth checks from the test plan. Closes #3168
-
Vladislav Shpilevoy authored
Consider this packet: msgpack = require('msgpack') data = msgpack.encode(18400000000000000000)..'aaaaaaa' Tarantool interprets 18400000000000000000 as size of a coming iproto request, and tries with no any checks to allocate buffer of such size. It calculates needed capacity like this: capacity = start_value; while (capacity < size) capacity *= 2; Here it is possible that on i-th iteration 'capacity' < 'size', but 'capacity * 2' overflows 64 bits and becomes < 'size' again, so this loop never ends and occupies 100% CPU. Strictly speaking overflow has undefined behavior. On the original system it led to nullifying 'capacity'. Such size is improbable as a real packet gabarits, but can appear as a result of parsing of some invalid packet, first bytes of which accidentally appears to be valid MessagePack uint. This is how the bug emerged on the real system. Lets restrict the maximal packet size to 2GB. Closes #3464
-
- Jun 14, 2018
-
-
Vladimir Davydov authored
Since tuples stored in temporary spaces are never written to disk, we can always delete them immediately, even when a snapshot is in progress. Closes #3432
-
Vladimir Davydov authored
-
- Jun 08, 2018
-
-
Alexander Turenko authored
It fixes the following errors during tarantool installation from packages on debian / ubuntu: ``` Unpacking tarantool (1.9.1.23.gacbd91c-1) ... dpkg: error processing archive /var/cache/apt/archives/tarantool_1.9.1.23.gacbd91c-1_amd64.deb (--unpack): trying to overwrite '/lib/systemd/system/tarantool.service', which is also in package tarantool-common 1.9.1.23.gacbd91c-1 ``` The problem is that tarantool.service file was shipped with tarantool-common and tarantool packages both. It is the regression after 8925b862. The way to avoid installing / enabling the service file within tarantool package is to pass `--name` option to dh_systemd_enable, but do not pass the service file name. In that case dh_systemd_enable does not found the service file and does not enforce existence of the file. Hope there is less hacky way to do so, but I don't found one at the moment.
-
Georgy Kirichenko authored
Use volatile asm modifier to prevent unwanted and awkward optimizations causing segfault while backtracing
-
- Jun 07, 2018
-
-
Alexander Turenko authored
* added --verbose to show output of successful TAP13 test (#73) * allow to call create_cluster(), drop_cluster() multiple times (#83) * support configurations (*.cfg files) in core = app tests * added return_listen_uri = <boolean> option for create_cluster() * save and print at fail tarantool log for core = app tests (#87)
-
Alexander Turenko authored
It is necessary for build on Ubuntu Bionic. Debian bugreport: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=881481 Debhelper commit: https://github.com/Debian/debhelper/commit/740c628a1e571acded7e2aac5d6e7058e61da37f
-
lifemaker authored
-
- Jun 02, 2018
-
-
Konstantin Osipov authored
Fix a compiler warning with clang 6
-
- Jun 01, 2018
-
-
Vladimir Davydov authored
The callback invoked upon compaction completion uses checkpoint_last() to determine whether compacted runs may be deleted: if the max LSN stored in a compacted run (run->dump_lsn) is greater than the LSN of the last checkpoint (gc_lsn) then the run doesn't belong to the last checkpoint and hence is safe to delete, see commit 35db70fa ("vinyl: remove runs not referenced by any checkpoint immediately"). The problem is checkpoint_last() isn't synced with vylog rotation - it returns the signature of the last successfully created memtx snapshot and is updated in memtx_engine_commit_checkpoint() after vylog is rotated. If a compaction task completes after vylog is rotated but before snap file is renamed, it will assume that compacted runs do not belong to the last checkpoint, although they do (as they have been appended to the rotated vylog), and delete them. To eliminate this race, let's use vylog signature instead of snap signature in vy_task_compact_complete(). Closes #3437
-
- May 31, 2018
-
-
Vladimir Davydov authored
latch_destroy() and fiber_cond_destroy() are basically no-op. All they do is check that latch/cond is not used. When a global latch or cond object is destroyed at exit, it may still have users and this is OK as we don't stop fibers at exit. In vinyl this results in the following false-positive assertion failures: src/latch.h:81: latch_destroy: Assertion `l->owner == NULL' failed. src/fiber_cond.c:49: fiber_cond_destroy: Assertion `rlist_empty(&c->waiters)' failed. Remove "destruction" of vy_log::latch to suppress the first one. Wake up all fibers waiting on vy_quota::cond before destruction to suppress the second one. Add some test cases. Closes #3412
-
- May 29, 2018
-
-
Georgy Kirichenko authored
Handle cases if instance_uuid and replicaset_uuid are present in box.cfg and have same values as already set. Fixes #3421
-
- May 25, 2018
-
-
Konstantin Osipov authored
replication: make replication_connect_timeout dynamic
-
Konstantin Osipov authored
-
Vladimir Davydov authored
replicaset_sync() returns not only if the instance synchronized to connected replicas, but also if some replicas have disconnected and the quorum can't be formed any more. Nevertheless, it always prints that sync has been completed. Fix it. See #3422
-
Vladimir Davydov authored
If a replica disconnects while sync is in progress, box.cfg{} may stop syncing leaving the instance in 'orphan' mode. This will happen if not enough replicas are connected to form a quorum. This makes sense e.g. on network error, but not when a replica is loading, because in the latter case it should be up and running quite soon. Let's account replicas that disconnected because they haven't completed initial configuration yet and continue syncing if connected + loading > quorum. Closes #3422
-
Konstantin Belyavskiy authored
Small refactoring: remove 'enum replica_state' since reuse a subset from applier state machine 'enum replica_state' to check if we have achieved replication quorum and hence can leave read-only mode.
-
Konstantin Osipov authored
The default of 4 seconds is too low to bootstrap a large cluster.
-
- May 24, 2018
-
-
Georgy Kirichenko authored
In some cases when an applier processing yielded, other applier might start some conflicting operation and break replication and database consistency. Now applier locks a per-server-id latch before processing a transaction. This guarantees that there is only one applier request for each server in progress at each given moment. The problem was very rare until full mesh topologies in vinyl became a commonplace. Fixes gh-3339
-
- May 22, 2018
-
-
Konstantin Osipov authored
Avoid goto, a follow up on gh-3257.
-
Konstantin Belyavskiy authored
Another broken case. Adding a new replica to cluster: + if (replica->applier->remote_is_ro && + replica->applier->vclock.signature == 0) In this case we may got an ER_READONLY, since signature is not 0. So leader election now has two phases: 1. To select among read-write replicas. 2. If no such found, try old algorithm for backward compatibility (case then all replicas exist in cluster table). Closes #3257
-
Konstantin Osipov authored
-
- May 17, 2018
-
-
Vladimir Davydov authored
If a compacted run was created after the last checkpoint, it is not needed to recover from any checkpoint and hence can be deleted right away to save disk space. Closes #3407
-
- May 15, 2018
-
-
Vladimir Davydov authored
Improve the test by decreasing range_size so that it creates a lot of ranges for test indexes, not just one. This helped find bugs causing the crash described in #3393. Follow-up #3393
-
Vladimir Davydov authored
Although the bug in vy_task_dump_complete() due to which a tuple could be lost during dump was fixed, there still may be affected deployments as the bug was persisted on disk. To avoid occasional crashes on such deployments, let's make vinyl_iterator_secondary_next() skip tuples that are present in a secondary index but missing in the primary. Closes #3393
-
Vladimir Davydov authored
vy_task_dump_complete() creates a slice per each range overlapping with the newly written run. It uses vy_range_tree_psearch(min_key) to find the first overlapping range and nsearch(max_key) to find the range immediately following the last overlapping range. This is incorrect as nsearch rb tree method returns the element matching the search key if it is present in the tree. That is, if the max key written to a run turns out to be equal the beginning of a range, the slice won't be created for it and it will be silently and persistently lost. The issue manifests itself as crash in vinyl_iterator_secondary_next(), when we fail to find the tuple in the primary index corresponding to a statement found in a secondary index. Part of #3393
-
Vladimir Davydov authored
vy_run_iterator_seek() is supposed to check that the resulting statement matches the search key in case of ITER_EQ, but if the search key lies at the beginning of the slice, it doesn't. As a result, vy_point_lookup() may fail to find an existing tuple as demonstrated below. Suppose we are looking for key {10} in the primary index which consists of an empty mem and two runs: run 1: DELETE{15} run 2: INSERT{10} vy_run_iterator_next() returns DELETE{15} for run 1 because of the missing EQ check and vy_point_lookup() stops at run 1 (since the terminal statement is found) and mistakenly returns NULL. The issue manifests itself as crash in vinyl_iterator_secondary_next(), when we fail to find the tuple in the primary index corresponding to a statement found in a secondary index. Part of #3393
-
Alexander Turenko authored
Follows up #3396.
-
- May 14, 2018
-
-
Alexander Turenko authored
It prevents rewriting result by an another thread after coio_call(), but before lua_pushlstring(). Such case is possible because libeio uses thread pool internally and static __thread storage can be reused before lua_pushlstring() if many parallel digest.pbkdf2() calls are on the fly. Fixes #3396.
-
- May 08, 2018
-
-
Ilya Markov authored
In sequential launch of app-tap/console.test, tests failed with "User exists" and binding errors. Make sockets path relative. Add users cleanup. Relates #3168
-
- May 07, 2018
-
-
Georgy Kirichenko authored
Any ddl is prohibited in a multistatement transaction, there is no reason to try to lock a ddl latch in tis case. Locking for already locked latch will cause an yield and a silent transaction rollback, and this will crash or assert tarantool server. Fixes #2783
-
- May 05, 2018
-
-
Vladislav Shpilevoy authored
Spurious wakeups are possible in console, that makes readline think that there are some data on stdin. Waked up readline returns garbage instead of string, that crashes a server on assertion in Lua. Closes #3343
-
- May 03, 2018
-
-
Vladislav Shpilevoy authored
Any option of base64 leads to urlsafe encoding. It is wrong, and caused by incorrect flag checking. Fix it. Closes #3358
-
Konstantin Osipov authored
* rename request_limit.test.lua to net_msg_max.test.lua * make net_msg_max.test.lua stable (courtesy of @Gerold103) * exclude disconnect messages from iproto_msg_max limit * add a separate warning for throttling based on readahead buffer overflow
-
Vladislav Shpilevoy authored
Starting with 1.9, CALL request which yields releases the intput buffer in net thread before CALL is complete. A release trigger is fired when the CALL fiber yields. The problem is that by default the input socket is not included into poll() list of the event loop: thanks to an optimization by @kostja for strict request/response scenario, the socket is included into poll() list only after the response is sent to the client. Thus, the following could happen: * a client sends a long-polling request * the request yields and maybe never finishes * the socket is not being read until the long-polling request is finished The patch is to explicitly feed EV_READ event to the event loop on the client socket whenever we release the input buffer for a long-polling request. We may remove iproto_resume() from net_discard_input() along with this patch since iproto_resume() will be called by iproto_connection_on_input().
-