- Apr 07, 2021
-
-
Igor Munkin authored
LuaJIT submodule is bumped to introduce the following changes: * test: fix dynamic modules loading on MacOS * test: make utils.selfrun usage easier * test: remove excess dependency for tests target Within this changeset SIP issues are worked around and dynamic modules loading on MacOS is fixed. As a result LuaJIT tests can be enabled for static build target on MacOS. Closes #5959 Follows up #4862 Reviewed-by:
Alexander V. Tikhonov <avtikhon@tarantool.org> Reviewed-by:
Sergey Kaplun <skaplun@tarantool.org> Signed-off-by:
Igor Munkin <imun@tarantool.org>
-
Alexander V. Tikhonov authored
Found that running package removement command from S3: ./tools/update_repo.sh -o=fedora -d=30 -b=s3://tarantool_repo/live/1.10 -r=tarantool-1.10.9.108 it searched to remove: Searching to remove: s3://tarantool_repo/live/1.10/fedora/30/SRPMS/Packages/tarantool-1.10.9.108-1.fedora30.src.rpm while it had to be: Searching to remove: s3://tarantool_repo/live/1.10/fedora/30/SRPMS/Packages/tarantool-1.10.9.108-1.fc30.src.rpm where 'fc30' had to be instead of 'fedora30'. This patch fixes it.
-
Alexander V. Tikhonov authored
For self-host runners, where work path with sources is saving between different workflows runs, it is needed to be sure that all permissions correct for running user to be sure that later call to sources checkout by 'actions/checkout' action run by the same user won't fail on sources cleanup. Closes tarantool/tarantool-qa#102
-
Alexander V. Tikhonov authored
Workflow 'coverity' produces and uploads results to 'coverity.com' web site. Message about it can be shown in PR within each run was done. This patch adds the ability to send message in available PR otherwise it is skipped. Part of #5644
-
mechanik20051988 authored
Test checks possibility of recovery with force_recovery option. For these purpose, snapshot is damaged in test and possibility of recovery is checked. In previous version snapshot size was too small, so sometimes in test, system data, needed for recovery was corrupted. Also moved same code to functions, deleted one of two identical test case (in previous version we write different count of garbage in the middle of snapshot, it's meaningless, because result is almost the same, only the amount of data that can be read from snapshot differs). Follow-up #5422
-
Alexander Turenko authored
List of changes in test-run: * readme: add memcached to users ([PR #285][1]) * test: add integration tests ([PR #273][2]) * test: enable in continuous integration ([PR #273][2]) * Fix slowdown on Python 3 ([PR #290][3]) Updated .luacheckrc to exclude newly added 'core = tarantool' tests (ones that check test-run itself). [1]: https://github.com/tarantool/test-run/pull/285 [2]: https://github.com/tarantool/test-run/pull/273 [3]: https://github.com/tarantool/test-run/pull/290
-
- Apr 05, 2021
-
-
Vladislav Shpilevoy authored
It was used so as to recover synchronous auto-commit transactions in an async way (not blocking the fiber). But it became not necessary since #5874 was fixed. Because recovery does not use auto-commit transactions anymore. Closes #5194
-
Vladislav Shpilevoy authored
Recovery used to be performed row by row. It was fine because anyway all the persisted rows are supposed to be committed, and should not meet any problems during recovery so a transaction could be applied partially. But it became not true after the synchronous replication introduction. Synchronous transactions might be in the log, but can be followed by a ROLLBACK record which is supposed to delete them. During row-by-row recovery, firstly, the synchro rows each turned into a sync transaction. Which is probably fine. But the rows on non-sync spaces which were a part of a sync transaction, could be applied right away bypassing the limbo leading to all kind of the sweet errors like duplicate keys, or inconsistency of a partially applied transaction. The patch makes the recovery transactional. Either an entire transaction is recovered, or it is rolled back which normally happens only for synchro transactions followed by ROLLBACK. In force recovery of a broken log the consistency is not guaranteed though. Closes #5874
-
Vladislav Shpilevoy authored
During recovery and xlog replay vinyl skips the statements already stored in runs. Indeed, their re-insertion into the mems would lead to their second dump otherwise. But that results into an issue that the recovery transactions in vinyl don't have a write set - their tx->log is empty. On the other hand they still are added to the write set (xm->writers). Probably so as not to have too many checks "skip if in recovery" all over the code. It works fine with single-statement transactions, but would break on multi-statement transactions. Because the decision whether need to add to the write set was done based on the tx's log emptiness. It is always empty, and so the transaction could be added to the write set twice and corrupt its list-link member. The patch makes the decision about being added to the write set based on emptiness of the list-link member instead of the log so it works fine both during recovery and normal operation. Needed for #5874
-
Serge Petrenko authored
There was a bug in box_process_register. It decoded replica's vclock but never used it when sending the registration stream. So the replica might lose the data in range (replica_vclock, start_vclock). Follow-up #5566
-
Serge Petrenko authored
Both box_process_register and box_process_join had guards ensuring that not a single rollback occured for transactions residing in WAL around replica's _cluster registration. Both functions would error on a rollback and make the replica retry final join. The reason for that was that replica couldn't process synchronous transactions correctly during final join, because it applied the final join stream row-by-row. This path with retrying final join was a dead end, because even if master manages to receive no ROLLBACK messages around N-th retry of box.space._cluster:insert{}, replica would still have to receive and process all the data dating back to its first _cluster registration attempt. In other words, the guard against sending synchronous rows to the replica didn't work. Let's remove the guard altogether, since now replica is capable of processing synchronous txs in final join stream and even retrying final join in case the _cluster registration was rolled back. Closes #5566
-
Serge Petrenko authored
Now applier assembles rows into transactions not only on subscribe stage, but also during final join / register. This was necessary for correct handling of rolled back synchronous transactions in final join stream. Part of #5566
-
Serge Petrenko authored
applier->last_row_time is updated in applier_read_tx_row, which's called at least once per each subscribe loop iteration. So there's no need to have a separate last_row_time update inside the loop body itself. Part of #5566
-
Serge Petrenko authored
Once apply_synchro_row() failed, applier_apply_tx() would simply raise an error without unlocking replica latch. This lead to all the appliers hanging indefinitely on trying to lock the latch for this replica. In scope of #5566
-
Serge Petrenko authored
The new routine, called apply_plain_tx(), may be used not only by applier_apply_tx(), but also by final join, once we make it transactional, and recovery, once it's also turned transactional. Also, while we're at it. Remove excess fiber_gc() call from applier_subscribe loop. Let's better make sure fiber_gc() is called on any return from applier_apply_tx(). Prerequisite #5874 Part of #5566
-
Serge Petrenko authored
Introduce a new routine, set_next_tx_row(), which checks tx boundary violation and appends the new row to the current tx in case everything is ok. set_next_tx_row() is extracted from applier_read_tx() because it's a common part of transaction assembly both for recovery and applier. The only difference for recovery will be that the routine which's responsible for tx assembly won't read rows. It'll be a callback ran on each new row being read from WAL. Prerequisite #5874 Part-of #5566
-
Serge Petrenko authored
Since the introduction of synchronous replication it became possible for final join to fail on master side due to not being able to gather acks for some tx around _cluster registration. A replica receives an error in this case: either ER_SYNC_ROLLBACK or ER_SYNC_QUORUM_TIMEOUT. The errors lead to applier retrying final join, but with wrong state, APPLIER_REGISTER, which should be used only on an anonymous replica. This lead to a hang in fiber executing box.cfg, because it waited for APPLIER_JOINED state, which was never entered. Part-of #5566
-
Vladislav Shpilevoy authored
In swim Lua code none of the __serialize methods checked the argument type assuming that nobody would call them directly and mess with the types. But it happened, and is not hard to fix, so the patch does it. The serialization functions are sanitized for the swim object, swim member, and member event. Closes #5952
-
Vladislav Shpilevoy authored
In Lua swim object's method member_by_uuid() could crash if called with no arguments. UUID was then passed as NULL, and dereferenced. The patch makes member_by_uuid() treat NULL like nil UUID and return NULL (member not found). The reason is that swim_member_by_uuid() can't fail. It can only return a member or not. It never sets a diag error. Closes #5951
-
Alexander V. Tikhonov authored
We should remove a tag after fetching of a remote repository. Ported the following commits: 0f575e01 ('gitlab-ci: fix tag removal for a branch push job') 0f564f34 ('gitlab-ci: remove tag from pushed branch commit') Drop a tag that points to a current commit (if any) on a job triggered by pushing to a branch (as against of pushing a tag). Otherwise we may get two jobs for the same x.y.z-0-gxxxxxxxxx build: one is run by pushing a branch and another by pushing a tag. The idea is to hide the new tag from the branch job as if a tag would be pushed strictly after all branch jobs for the same commit. Closes tarantool/tarantool-qa#103
-
Alexander Turenko authored
The key difference between lbox_encode_tuple_on_gc() and luaT_tuple_encode() is that the latter never raises a Lua error, but passes an error using the diagnostics area. Aside of the tuple leak, the patch fixes fiber region's memory 'leak' (till fiber_gc()). Before the patch, the memory that is used for serialization of the key is not freed (region_truncate()) when the serialization fails. It is verified in the gh-5388-<...> test. While I'm here, added a test case that just verifies correct behaviour in case of a key serialization failure (added into key_def.test.lua). The case does not verify whether a tuple leaks and it is successful as before this patch as well after the patch. I don't find a simple way to check the tuple leak within a test. Verified manually using the reproducer from the linked issue. Fixes #5388
-
Alexander V. Tikhonov authored
Got warning message: [014] WARGING: unix socket's "/home/ubuntu/actions-runner/_work/tarantool/tarantool/test/var/014_box/gh-5422-broken_snapshot.socket-iproto" path has length 108 symbols that is longer than 107. That likely will cause failing of tests. It caused the following fail: [038] Starting instance autobootstrap_guest1... [038] Start failed: builtin/box/console.lua:865: failed to create server unix/:/home/ubuntu/actions-runner/_work/tarantool/tarantool/test/var/038_replication/autobootstrap_guest1.socket-admin: No buffer space available To avoid of it use vardir option in tests runs to decrease paths length. Closes tarantool/tarantool-qa#104
-
Alexander V. Tikhonov authored
Changed the following workflows: luacheck debug_coverage release* static_build static_build_cmake_linux It was changed the OS in which the test run from debian to ubuntu. Also changed the way how this OS was booted - before the change it was booted as docker container using Github Actions tag from inside the worklfows. And it caused all the workflow steps to be run inside it. After the change no container run anymore on the running host. Github Actions host uses for now with its native OS set in 'runs' tag. It was decided to use the latest one OS `ubuntu-20.04` which is already the default for 'ubuntu-latest' tag. This change gave us the abilities to: - Remove extra container step in workflow. - Switch off swap using 'swapoff' command. - Use the same OS as Github Actions uses by default. - Setup our local hosts using Github Actions image snapshot. - Enable use of actions/checkout@v2.3.4 which is better than v1. - Light bootstrap of packages in local .*.mk makefile for: build: libreadline-dev libunwind-dev tests: pip install -r test-run/requirements.txt Closes tarantool/tarantool-qa#101
-
- Apr 02, 2021
-
-
Nikita Pettik authored
As a follow-up to the previous patch, let's check also emptiness of the vylog being removed. During vylog rotation all entries are squashed (e.g. "delete range" annihilates "insert range"), written to the new vylog and at the end of new vylog SNAPSHOT marker is placed. If the last entry in the vylog is SNAPSHOT, we can safely remove it without hesitation. So it is OK to remove it even during casual recovery process. However, if it contains rows after SNAPSHOT marker, removal of vylog may cause data loss. In this case we still can remove it only in force_recovery mode. Follow-up #5823
-
Nikita Pettik authored
Having data in different engines checkpoint process is handled this way: - wait_checkpoint memtx - wait_checkpoint vinyl - commit_checkpoint memtx - commit_checkpoint vinyl In contrast to commit_checkpoint which does not tolerate fails (if something goes wrong e.g. renaming of snapshot file - instance simply crashes), wait_checkpoint may fail. As a part of wait_checkpoint for vinyl engine vy_log rotation takes place: old vy_log is closed and new one is created. At this moment, wait_checkpoint of memtx engine has already created new *inprogress* snapshot featuring bumped vclock. While recovering from this configuration, vclock of the latest snapshot is used as a reference. At the initial recovery stage (vinyl_engine_begin_initial_recovery), we check that snapshot's vclock matches with vylog's one (they should be the same since normally vylog is rotated along with snapshot). On the other hand, in the directory we have old snapshot and new vylog (and new .inprogress snapshot). In such a situation recovery (even in force mode) was aborted. The only way to fix this dead end, user has to manually delete last vy_log file. Let's proceed with the same resolution while user runs force_recovery mode: delete last vy_log file and update vclock value. If user uses casual recovery, let's print verbose message how to fix this situation manually. Closes #5823
-
Nikita Pettik authored
It is conditional injection that terminates execution calling assert(0) if given condition is true. It is quite useful since allows us to emulate situations when instance is suddenly shutdown: due to sigkill for example. Needed for #5823
-
Nikita Pettik authored
Needed for #5823
-
Nikita Pettik authored
Needed for #5823
-
Mergen Imeev authored
Prior to this patch string passed to user-defined Lua-function from SQL was cropped in case it contains '\0'. At the same time, it wasn't cropped if it is passed to the function from BOX. After this patch the string won't be cropped when passed from SQL if it contain '\0'. Closes #5938
-
Mergen Imeev authored
Prior to this patch string passed to user-defined C-function from SQL was cropped in case it contains '\0'. At the same time, it wasn't cropped if it is passed to the function from BOX. Now it isn't cropped when passed from SQL. Part of #5938
-
- Mar 31, 2021
-
-
Alexander V. Tikhonov authored
Github Actions provides hosts for Linux base runners in the following configurations: 2 Cores 7 Gb memory 4 Gb swap memory To avoid of issues with hanging/slowing tests on high memory use like [1], hosts configurations must avoid of swap memory use. All of the tests workflows run inside dockers containers. This patch sets in docker run configurations memory limits based on current github actions hosts - 7Gb memory w/o swap memory increase. Checked 10 full runs (29 workflows in each run used the change) and got single failed test on gevent() routine in test-run. This result much better than w/o this patch when 3-4 of workflows fail on each full run. It could happen because swappiness set to default value: cat /sys/fs/cgroup/memory/memory.swappiness 60 From documentation on swappiness [2]: This control is used to define the rough relative IO cost of swapping and filesystem paging, as a value between 0 and 200. At 100, the VM assumes equal IO cost and will thus apply memory pressure to the page cache and swap-backed pages equally; lower values signify more expensive swap IO, higher values indicates cheaper. Keep in mind that filesystem IO patterns under memory pressure tend to be more efficient than swap's random IO. An optimal value will require experimentation and will also be workload-dependent. We may try to tune how often anonymous pages are swapped using the swappiness parameter, but our goal is to stabilize timings (and make them as predictable as possible), so the best option is to disable swap at all and work on descreasing memory consumption for huge tests. For Github Actions host configurations with 7Gb RAM it means that after 2.8Gb RAM was used swap began to use. But in testing we have some tests that use 2.5Gb of RAM like 'box/net_msg_max.test.lua' and memory fragmentation could cause after the test run swap use [3]. Also found that disk cache could use some RAM and it also was the cause of fast memory use and start swapping. It can be periodically dropped from memory [4] using 'drop_cache' system value setup, but it won't fix the overall issue with swap use. After freed cached pages in RAM another system kernel option can be tuned [5][6] 'vfs_cache_pressure'. This percentage value controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects. Increasing it significantly beyond default value of 100 may have negative performance impact. Reclaim code needs to take various locks to find freeable directory and inode objects. With 'vfs_cache_pressure=1000', it will look for ten times more freeable objects than there are. This patch won't do this change, but it can be done as the next change. To fix the issue there were made changes: - For jobs that run tests and use actions/environment and don't use Github Actions container tag, it was set 'sudo swapoff -a' command in actions/environment action. - For jobs that run tests and use Github Actions container tag the previous solution doesn't work. It was decided to hard-code the memory value based on found on Github Actions hosts memory size 7Gb. It was set for Github container tag as additional options: options: '--init --memory=7G --memory-swap=7G' This changes were made temporary till these containers tags will be removed within resolving tarantool/tarantool-qa#101 issue for workflows: debug_coverage release release_asan_clang11 release_clang release_lto release_lto_clang11 static_build static_build_cmake_linux - For VMware VMs like with FreeBSD added 'sudo swapoff -a' command before build commands. - For OSX on Github actions hosts swapping already disabled: sysctl vm.swapusage vm.swapusage: total = 0.00M used = 0.00M free = 0.00M (encrypted) Also manual switching off swap currently not possible due to do System Integrity Protection (SIP) must be disabled [7], but we don't have such access on Github Actions hosts. For local hosts it must be done manually with [8]: sudo nvram boot-args="vm_compressor=2" Added swap status control to be sure that host correctly configured: sysctl vm.swapusage Closes tarantool/tarantool-qa#99 [1]: https://github.com/tarantool/tarantool-qa/issues/93 [2]: https://github.com/torvalds/linux/blob/1e43c377a79f9189fea8f2711b399d4e8b4e609b/Documentation/admin-guide/sysctl/vm.rst#swappiness [3]: https://unix.stackexchange.com/questions/2658/why-use-swap-when-there-is-more-than-enough-free-space-in-ram [4]: https://kubuntu.ru/node/13082 [5]: https://www.kernel.org/doc/Documentation/sysctl/vm.txt [6]: http://devhead.ru/read/uskorenie-raboty-linux [7]: https://osxdaily.com/2010/10/08/mac-virtual-memory-swap/ [8]: https://gist.github.com/dan-palmer/3082266#gistcomment-3667471
-
Cyrill Gorcunov authored
Once simple bootstrap is complete and there is no replicas used we should run with gc unpaused. Part-of #5806 Acked-by:
Serge Petrenko <sergepetrenko@tarantool.org> Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
Part-of #5806 Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com>
-
Cyrill Gorcunov authored
In case if replica managed to be far behind the master node (so there are a number of xlog files present after the last master's snapshot) then once master node get restarted it may clean up the xlogs needed by the replica to subscribe in a fast way and instead the replica will have to rejoin reading a number of data back. Lets try to address this by delaying xlog files cleanup until replicas are got subscribed and relays are up and running. For this sake we start with cleanup fiber spinning in nop cycle ("paused" mode) and use a delay counter to wait until relays decrement them. This implies that if `_cluster` system space is not empty upon restart and the registered replica somehow vanished completely and won't ever come back, then the node administrator has to drop this replica from `_cluster` manually. Note that this delayed cleanup start doesn't prevent WAL engine from removing old files if there is no space left on a storage device. The WAL will simply drop old data without a question. We need to take into account that some administrators might not need this functionality at all, for this sake we introduce "wal_cleanup_delay" configuration option which allows to enable or disable the delay. Closes #5806 Signed-off-by:
Cyrill Gorcunov <gorcunov@gmail.com> @TarantoolBot document Title: Add wal_cleanup_delay configuration parameter The `wal_cleanup_delay` option defines a delay in seconds before write ahead log files (`*.xlog`) are getting started to prune upon a node restart. This option is ignored in case if a node is running as an anonymous replica (`replication_anon = true`). Similarly if replication is unused or there is no plans to use replication at all then this option should not be considered. An initial problem to solve is the case where a node is operating so fast that its replicas do not manage to reach the node state and in case if the node is restarted at this moment (for various reasons, for example due to power outage) then `*.xlog` files might be pruned during restart. In result replicas will not find these files on the main node and have to reread all data back which is a very expensive procedure. Since replicas are tracked via `_cluster` system space this we use its content to count subscribed replicas and when all of them are up and running the cleanup procedure is automatically enabled even if `wal_cleanup_delay` is not expired. The `wal_cleanup_delay` should be set to: - `0` to disable the cleanup delay; - `>= 0` to wait for specified number of seconds. By default it is set to `14400` seconds (ie `4` hours). In case if registered replica is lost forever and timeout is set to infinity then a preferred way to enable cleanup procedure is not setting up a small timeout value but rather to delete this replica from `_cluster` space manually. Note that the option does *not* prevent WAL engine from removing old `*.xlog` files if there is no space left on a storage device, WAL engine can remove them in a force way. Current state of `*.xlog` garbage collector can be found in `box.info.gc()` output. For example ``` Lua tarantool> box.info.gc() --- ... is_paused: false ``` The `is_paused` shows if cleanup fiber is paused or not.
-
- Mar 29, 2021
-
-
Igor Munkin authored
* tools: make memprof parser output user-friendly Closes #5811 Part of #5657
-
Igor Munkin authored
* memprof: report stack resizing as internal event Closes #5842 Follows up #5442
-
mechanik20051988 authored
Changed aligned_alloc to posix_memalign because in some macOS systems aligned_alloc function is not available.
-
- Mar 26, 2021
-
-
Alexander Turenko authored
Added the `--memtx-allocator <string>` test-run option, which just sets the `MEMTX_ALLOCATOR` environment variable. The variable is available in the testing code (including instance files) and supposed to be used to verify the upcoming `box.cfg{memtx_allocator = <small|system>}` option. Alternatively one can just set the `MEMTX_ALLOCATOR` environment variable manually. Beware: The option does not set the allocator in tarantool automatically in some way. Nope. A test should read the variable and set the box.cfg option. [1]: https://github.com/tarantool/test-run/pull/281 Part of #5419
-
- Mar 25, 2021
-
-
HustonMmmavr authored
* Remove unnecessary `#include "tt_static.h"` from src/ssl_cert_paths_discover.c * Fix typo at test/app-tap/ssl-cert-paths-discover.test.lua call `os.exit` instead of `os:exit` A follow up on #5615
-
Sergey Ostanevich authored
Resolves #5857 Reviewed-by:
Igor Munkin <imun@tarantool.org> Signed-off-by:
Igor Munkin <imun@tarantool.org>
-