Skip to content
Snippets Groups Projects
  1. Dec 28, 2020
  2. Dec 27, 2020
    • Artem Starshov's avatar
      lua: fix running init lua script · a8f3a6cb
      Artem Starshov authored
      When tarantool launched with -e flag and in
      script after there is an error, program hangs.
      This happens because shed fiber launches separate
      fiber for init user script and starts auxiliary
      event loop. It's supposed that fiber will stop
      this loop, but in case of error in script, fiber
      tries to stop a loop when the last one isn't
      started yet.
      
      Added a flag, which will watch is loop started and
      when fiber tries to call `ev_break()` we can be sure
      that loop is running already.
      
      Fixes #4983
      a8f3a6cb
    • Alexander Turenko's avatar
      test: update test-run (pass timeouts via env) · a598f3f5
      Alexander Turenko authored
      The following variables now control timeouts (if corresponding command
      line options are not passed): TEST_TIMEOUT, NO_OUTPUT_TIMEOUT,
      REPLICATION_SYNC_TIMEOUT. See [1] for details.
      
      I set the following values in the GitLab CI web interface:
      
      | Variable                 | Value                                                   |
      | ------------------------ | ------------------------------------------------------- |
      | REPLICATION_SYNC_TIMEOUT | 300                                                     |
      | TEST_TIMEOUT             | 310                                                     |
      | NO_OUTPUT_TIMEOUT        | 320                                                     |
      | PRESERVE_ENVVARS         | REPLICATION_SYNC_TIMEOUT,TEST_TIMEOUT,NO_OUTPUT_TIMEOUT |
      
      See packpack change [2] and the commit 'ci: preserve certain environment
      variables' regarding the PRESERVE_ENVVARS variable.
      
      The reason, why we need to increase timeouts, comes from the following
      facts:
      
      - We use self-hosted runners to serve GitLab CI jobs. So, the machine
        resources are limited.
      - We run testing with high level of parallelism to speed it up.
      - We have a bunch of vinyl tests, which intensively use disk.
      
      Disk accesses may be quite long within this infrastructure and the
      obvious way to workaround the problem is to increase timeouts.
      
      In the long term we should scale resources depending on the testing
      needs. We'll try to use GitHub hosted runners or, if we'll reach some
      limits, will setup GitHub runners on the Mail.Ru Cloud Solutions
      infrastructure.
      
      [1]: https://github.com/tarantool/test-run/issues/258
      [2]: https://github.com/packpack/packpack/pull/135
      Unverified
      a598f3f5
    • Alexander Turenko's avatar
      ci: preserve certain environment variables · d2f4bd68
      Alexander Turenko authored
      We want to increase testing timeouts for GitLab CI, where we use our own
      runners and observe stalls and high disk pressure when several vinyl
      tests are run in parallel. The idea is to set variables in GitLab CI web
      interface and read them from test-run (see [1]).
      
      First, we need to pass the variables into inner environments. GitLab CI
      jobs run the testing using packpack, Docker or VirtualBox.
      
      Packpack already preserves environment variables that are listed in the
      PRESERVE_ENVVARS variable (see [2]).
      
      This commit passes the variables that are listed in the PRESERVE_ENVVARS
      variable into Docker and VirtualBox environment. So, all jobs will have
      given variables in the enviroment. (Also dropped unused EXTRA_ENV
      variable.)
      
      The next commit will update the test-run submodule with support of
      setting timeouts using environment variables.
      
      [1]: https://github.com/tarantool/test-run/issues/258
      [2]: https://github.com/packpack/packpack/pull/135
      Unverified
      d2f4bd68
  3. Dec 26, 2020
    • Alexander V. Tikhonov's avatar
      test: remove obvious part in rpm spec for Travis · d9c25b7a
      Alexander V. Tikhonov authored
      Removed obvious part in rpm spec for Travis-CI, due to it is no
      longer in use.
      
      ---- Comments from @Totktonada ----
      
      This change is a kind of revertion of the commit
      d48406d5 ('test: add more tests to
      packaging testing'), which did close #4599.
      
      Here I described the story, why the change was made and why it is
      reverted now.
      
      We run testing during an RPM package build: it may catch some
      distribution specific problem. We had reduced quantity of tests and
      single thread tests execution to keep the testing stable and don't break
      packages build and deployment due to known fragile tests.
      
      Our CI had to use Travis CI, but we were in transition to GitLab CI to
      use our own machines and don't reach Travis CI limit with five jobs
      running in parallel.
      
      We moved package builds to GitLab CI, but kept build+deploy jobs on
      Travis CI for a while: GitLab CI was the new for us and we wanted to do
      this transition smoothly for users of our APT / YUM repositories.
      
      After enabling packages building on GitLab CI, we wanted to enable more
      tests (to catch more problems) and wanted to enable parallel execution
      of tests to speed up testing (and reduce amount of time a developer wait
      for results).
      
      We observed that if we'll enable more tests and parallel execution on
      Travis CI, the testing results will become much less stable and so we'll
      often have holes in deployed packages and red CI.
      
      So, we decided to keep the old way testing on Travis CI and perform all
      changes (more tests, more parallelism) only for GitLab CI.
      
      We had a guess that we have enough machine resources and will able to do
      some load balancing to overcome flaky fails on our own machines, but in
      fact we picked up another approach later (see below).
      
      That's all story behind #4599. What changes from those days?
      
      We moved deployment jobs to GitLab CI[^1] and now we completely disabled
      Travis CI (see #4410 and #4894). All jobs were moved either to GitLab CI
      or right to GitHub Actions[^2].
      
      We revisited our approach to improve stability of testing. Attemps to do
      some load balancing together with attempts to keep not-so-large
      execution time were failed. We should increase parallelism for speed,
      but decrease it for stability at the same time. There is no optimal
      balance.
      
      So we decided to track flaky fails in the issue tracker and restart a
      test after a known fail (see details in [1]). This way we don't need to
      exclude tests and disable parallelism in order to get the stable and
      fast testing[^3]. At least in theory. We're on the way to verify this
      guess, but hopefully we'll stick with some adequate defaults that will
      work everywhere[^4].
      
      To sum up, there are several reasons to remove the old workaround, which
      was implemented in the scope of #4599: no Travis CI, no foreseeable
      reasons to exclude tests and reduce parallelism depending on a CI
      provider.
      
      Footnotes:
      
      [^1]: This is simplification. Travis CI deployment jobs were not moved
            as is. GitLab CI jobs push packages to the new repositories
            backend (#3380). Travis CI jobs were disabled later (as part of
            #4947), after proofs that the new infrastructure works fine.
            However this is the another story.
      
      [^2]: Now we're going to use GitHub Actions for all jobs, mainly because
            GitLab CI is poorly integrated with GitHub pull requests (when
            source branch is in a forked repository).
      
      [^3]: Some work toward this direction still to be done:
      
            First, 'replication' test suite still excluded from the testing
            under RPM package build. It seems, we should just enable it back,
            it is tracked by #4798.
      
            Second, there is the issue [2] to get rid of ancient traces of the
            old attempts to keep the testing stable (from test-run side).
            It'll give us more parallelism in testing.
      
      [^4]: Of course, we perform investigations of flaky fails and fix code
            and testing problems it feeds to us. However it appears to be the
            long activity.
      
      References:
      
      [1]: https://github.com/tarantool/test-run/pull/217
      [2]: https://github.com/tarantool/test-run/issues/251
      Unverified
      d9c25b7a
  4. Dec 25, 2020
    • Sergey Bronnikov's avatar
      test: integrate with OSS Fuzz · 7680948f
      Sergey Bronnikov authored
      To run Tarantool fuzzers on OSS Fuzz infrastructure it is needed to pass
      library $LIB_FUZZING_ENGINE to linker and use external CFLAGS and
      CXXFLAGS. Full description how to integrate with OSS Fuzz is in [1] and
      [2].
      
      Patch to OSS Fuzz repository [2] is ready to merge.
      
      We need to pass options with "-fsanitize=fuzzer" two times
      (in cmake/profile.cmake and test/fuzz/CMakeLists.txt) because:
      
      - cmake/profile.cmake is for project source files,
        -fsanitize=fuzzer-no-link option allows to instrument project source
        files for fuzzing, but LibFuzzer will not replace main() in these
        files.
      
      - test/fuzz/CMakeLists.txt uses -fsanitize=fuzzer and not
        -fsanitize=fuzzer-no-link because we want to add automatically
        generated main() for each fuzzer.
      
      1. https://google.github.io/oss-fuzz/getting-started/new-project-guide/
      2. https://google.github.io/oss-fuzz/advanced-topics/ideal-integration/
      3. https://github.com/google/oss-fuzz/pull/4723
      
      Closes #1809
      7680948f
    • Sergey Bronnikov's avatar
      travis: build tarantool with ENABLE_FUZZER · af126b90
      Sergey Bronnikov authored
      OSS Fuzz has a limited number of runs per day and now it is a 4 runs.
      Option ENABLE_FUZZERS is enabled to make sure that building of fuzzers
      is not broken.
      
      Part of #1809
      af126b90
    • Sergey Bronnikov's avatar
      test: add corpus to be used with fuzzers · 8c1bb620
      Sergey Bronnikov authored
      Fuzzing tools uses evolutionary algorithms. Supplying seed corpus
      consisting of good sample inputs is one of the best ways to improve
      fuzz target’s coverage. Patch adds a corpuses that can be used with
      existed fuzzers. The name of each file in the corpus is the sha1
      checksum of its contents.
      
      Corpus with http headers was added from [1] and [2].
      
      1. https://google.github.io/oss-fuzz/getting-started/new-project-guide/
      2. https://en.wikipedia.org/wiki/List_of_HTTP_header_fields
      3. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers
      
      The libFuzzer allow to minimize corpus with help of `-merge` flag:
      when 1 is passed, any corpus inputs from the 2nd, 3rd etc. corpus
      directories that trigger new code coverage will be merged into the
      first corpus directory, when 0 is passed an existed corpus will be
      minimized.
      
      All provided corpuses in a patch were minimized.
      
      Part of #1809
      8c1bb620
    • Sergey Bronnikov's avatar
      test: add fuzzers and support for fuzzing testing · 2ad7caca
      Sergey Bronnikov authored
      There is a number of bugs related to parsing and encoding/decoding data.
      Examples:
      
      - csv: #2692, #4497, #2692
      - uri: #585
      
      One of the effective method to find such issues is a fuzzing testing.
      Patch introduces a CMake flag to enable building fuzzers (ENABLE_FUZZER)
      and add fuzzers based on LibFuzzer [1] to csv, http_parser and uri
      modules. Note that fuzzers must return 0 exit code only, other exit
      codes are not supported [2].
      
      NOTE: LibFuzzer requires Clang compiler.
      
      1. https://llvm.org/docs/LibFuzzer.html
      2. http://llvm.org/docs/LibFuzzer.html#id22
      
      How-To Use:
      
      $ mkdir build && cd build
      $ cmake -DENABLE_FUZZER=ON \
      	-DENABLE_ASAN=ON \
      	-DCMAKE_BUILD_TYPE=Debug \
      	-DCMAKE_C_COMPILER="/usr/bin/clang" \
      	-DCMAKE_CXX_COMPILER="/usr/bin/clang++" ..
      $ make -j
      $ ./test/fuzz/csv_fuzzer -workers=4 ../test/static/corpus/csv
      
      Part of #1809
      2ad7caca
    • Sergey Bronnikov's avatar
      luacheck: remove unneeded comment · e0417b4d
      Sergey Bronnikov authored
      serpent module has been dropped in commit
      b53cb2ae
      "console: drop unused serpent module", but comment that belong to module
      was left in luacheck config.
      e0417b4d
    • Sergey Bronnikov's avatar
    • Sergey Bronnikov's avatar
    • Serge Petrenko's avatar
      test: fix box/error · 038c8abe
      Serge Petrenko authored
      Follow-up #5435
      038c8abe
    • Serge Petrenko's avatar
      txn_limbo: ignore CONFIRM/ROLLBACK for a foreign master · cab99888
      Serge Petrenko authored
      We designed limbo so that it errors on receiving a CONFIRM or ROLLBACK
      for other instance's data. Actually, this error is pointless, and even
      harmful. Here's why:
      
      Imagine you have 3 instances, 1, 2 and 3.
      First 1 writes some synchronous transactions, but dies before writing CONFIRM.
      
      Now 2 has to write CONFIRM instead of 1 to take limbo ownership.
      From now on 2 is the limbo owner and in case of high enough load it constantly
      has some data in the limbo.
      
      Once 1 restarts, it first recovers its xlogs, and fills its limbo with
      its own unconfirmed transactions from the previous run. Now replication
      between 1, 2 and 3 is started and the first thing 1 sees is that 2 and 3
      ack its old transactions. So 1 writes CONFIRM for its own transactions
      even before the same CONFIRM written by 2 reaches it.
      Once the CONFIRM written by 1 is replicated to 2 and 3 they error and
      stop replication, since their limbo contains entries from 2, not from 1.
      Actually, there's no need to error, since it's just a really old CONFIRM
      which's already processed by both 2 and 3.
      
      So, ignore CONFIRM/ROLLBACK when it references a wrong limbo owner.
      
      The issue was discovered with test replication/election_qsync_stress.
      
      Follow-up #5435
      cab99888
    • Serge Petrenko's avatar
      test: fix replication/election_qsync_stress test · bf0fbf3a
      Serge Petrenko authored
      The test involves writing synchronous transactions on one node and
      making other nodes confirm these transactions after its death.
      In order for the test to work properly we need to make sure the old
      node replicates all its transactions to peers before killing it.
      Otherwise once the node is resurrected it'll have newer data, not
      present on other nodes, which leads to their vclocks being incompatible
      and noone becoming the new leader and hanging the test.
      
      Follow-up #5435
      bf0fbf3a
    • Serge Petrenko's avatar
      box: rework clear_synchro_queue to commit everything · 5c7dae44
      Serge Petrenko authored
      
      It is possible that a new leader (elected either via raft or manually or
      via some user-written election algorithm) loses the data that the old
      leader has successfully committed and confirmed.
      
      Imagine such a situation: there are N nodes in a replicaset, the old
      leader, denoted A, tries to apply some synchronous transaction. It is
      written on the leader itself and N/2 other nodes, one of which is B.
      The transaction has thus gathered quorum, N/2 + 1 acks.
      
      Now A writes CONFIRM and commits the transaction, but dies before the
      confirmation reaches any of its followers. B is elected the new leader and it
      sees that the last A's transaction is present on N/2 nodes, so it doesn't have a
      quorum (A was one of the N/2 + 1).
      
      Current `clear_synchro_queue()` implementation makes B roll the transaction
      back, leading to rollback after commit, which is unacceptable.
      
      To fix the problem, make `clear_synchro_queue()` wait until all the rows from
      the previous leader gather `replication_synchro_quorum` acks.
      
      In case the quorum wasn't achieved during replication_synchro_timeout, rollback
      nothing and wait for user's intervention.
      
      Closes #5435
      
      Co-developed-by: default avatarVladislav Shpilevoy <v.shpilevoy@tarantool.org>
      5c7dae44
    • Serge Petrenko's avatar
      txn_limbo: introduce txn_limbo_last_synchro_entry method · 618e8269
      Serge Petrenko authored
      It'll be useful for box_clear_synchro_queue rework.
      
      Prerequisite #5435
      618e8269
    • Vladislav Shpilevoy's avatar
      replication: introduce on_ack trigger · 0941aaa1
      Vladislav Shpilevoy authored
      The trigger is fired every time any of the relays notifies tx of replica's
      known vclock change.
      
      The trigger will be used to collect synchronous transactions quorum for
      old leader's transactions.
      
      Part of #5435
      0941aaa1
    • Serge Petrenko's avatar
      box: add a single execution guard to clear_synchro_queue · 05f7ff7c
      Serge Petrenko authored
      Clear_synchro_queue isn't meant to be called multiple times on a single
      instance.
      
      Multiple simultaneous invocations of clear_synhcro_queue() shouldn't
      hurt now, since clear_synchro_queue simply exits on an empty limbo, but
      may be harmful in future, when clear_synchro_queue is reworked.
      
      Prohibit such misuse by introducing an execution guard and raising an
      error once duplicate invocation is detected.
      
      Prerequisite #5435
      05f7ff7c
    • Sergey Bronnikov's avatar
      test: remove dead code in Python tests end extra newlines · cdb5f603
      Sergey Bronnikov authored
      Closes #5538
      cdb5f603
    • Sergey Bronnikov's avatar
      test: get rid of iteritems() · 6394a997
      Sergey Bronnikov authored
      For Python 3, PEP 3106 changed the design of the dict builtin and the
      mapping API in general to replace the separate list based and iterator
      based APIs in Python 2 with a merged, memory efficient set and multiset
      view based API. This new style of dict iteration was also added to the
      Python 2.7 dict type as a new set of iteration methods. PEP-0469 [1]
      recommends to replace d.iteritems() to iter(d.items()) to make code
      compatible with Python 3.
      
      1. https://www.python.org/dev/peps/pep-0469/
      
      Part of #5538
      6394a997
    • Sergey Bronnikov's avatar
      test: make strings compatible with Python 3 · 5c24c5ae
      Sergey Bronnikov authored
      The largest change in Python 3 is the handling of strings.
      In Python 2, the str type was used for two different
      kinds of values - text and bytes, whereas in Python 3,
      these are separate and incompatible types.
      Patch converts strings to byte strings where it is required
      to make tests compatible with Python 3.
      
      Part of #5538
      5c24c5ae
    • Sergey Bronnikov's avatar
      test: make dict.items() compatible with Python 3.x · e97dc044
      Sergey Bronnikov authored
      In Python 2.x calling items() makes a copy of the keys that you can
      iterate over while modifying the dict. This doesn't work in Python 3.x
      because items() returns an iterator instead of a list and Python 3 raise
      an exception "dictionary changed size during iteration". To workaround
      it one can use list to force a copy of the keys to be made.
      
      Part of #5538
      e97dc044
    • Sergey Bronnikov's avatar
      test: convert print to function and make quotes use consistent · a113e43c
      Sergey Bronnikov authored
      - convert print statement to function. In a Python 3 'print' becomes a
      function, see [1]. Patch makes 'print' in a regression tests compatible with
      Python 3.
      - according to PEP8, mixing using double quotes and quotes in a project looks
      inconsistent. Patch makes using quotes with strings consistent.
      - use "format()" instead of "%" everywhere
      
      1. https://docs.python.org/3/whatsnew/3.0.html#print-is-a-function
      
      Part of #5538
      a113e43c
    • Serge Petrenko's avatar
      feedback_daemon: add operation statistics reporting · 781ae4f4
      Serge Petrenko authored
      Report box.stat().*.total, box.stat.net().*.total and
      box.stat.net().*.current via feedback daemon report.
      Accompany this data with the time when report was generated so that it's
      possible to calculate RPS from this data on the feedback server.
      
      `box.stat().OP_NAME.total` reside in `feedback.stats.box.OP_NAME.total`, while
      `box.stat.net().OP_NAME.total` reside in `feedback.stats.net.OP_NAME.total`
      The time of report generation is located at `feedback.stats.time`
      
      Closes #5589
      781ae4f4
  5. Dec 24, 2020
    • Cyrill Gorcunov's avatar
      crash: report crash data to the feedback server · f132aa9b
      Cyrill Gorcunov authored
      
      We have a feedback server which gathers information about a running instance.
      While general info is enough for now we may loose a precious information about
      crashes (such as call backtrace which caused the issue, type of build and etc).
      
      In the commit we add support of sending this kind of information to the feedback
      server. Internally we gather the reason of failure, pack it into base64 form
      and then run another Tarantool instance which sends it out.
      
      A typical report might look like
      
       | {
       |   "crashdump": {
       |     "version": "1",
       |     "data": {
       |       "uname": {
       |         "sysname": "Linux",
       |         "release": "5.9.14-100.fc32.x86_64",
       |         "version": "#1 SMP Fri Dec 11 14:30:38 UTC 2020",
       |         "machine": "x86_64"
       |       },
       |       "build": {
       |         "version": "2.7.0-115-g360565efb",
       |         "cmake_type": "Linux-x86_64-Debug"
       |       },
       |       "signal": {
       |         "signo": 11,
       |         "si_code": 0,
       |         "si_addr": "0x3e800004838",
       |         "backtrace": "#0  0x630724 in crash_collect+bf\n...",
       |         "timestamp": "2020-12-23 14:42:10 MSK"
       |       }
       |     }
       |   }
       | }
      
      There is no simple way to test this so I did it manually:
      1) Run instance with
      
      	box.cfg{log_level = 8, feedback_host="127.0.0.1:1500"}
      
      2) Run listener shell as
      
      	while true ; do nc -l -p 1500 -c 'echo -e "HTTP/1.1 200 OK\n\n $(date)"'; done
      
      3) Send SIGSEGV
      
      	kill -11 `pidof tarantool`
      
      Once SIGSEGV is delivered the crashinfo data is generated and sent out. For
      debug purpose this data is also printed to the terminal on debug log level.
      
      Closes #5261
      
      Co-developed-by: default avatarVladislav Shpilevoy <v.shpilevoy@tarantool.org>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      
      @TarantoolBot document
      Title: Configuration update, allow to disable sending crash information
      
      For better analysis of program crashes the information associated with
      the crash such as
      
       - utsname (similar to `uname -a` output except the network name)
       - build information
       - reason for a crash
       - call backtrace
      
      is sent to the feedback server. To disable it set `feedback_crashinfo`
      to `false`.
      f132aa9b
    • Cyrill Gorcunov's avatar
      crash: move fatal signal handling in · a0a443bd
      Cyrill Gorcunov authored
      
      When SIGSEGV or SIGFPE reaches the tarantool we try to gather
      all information related to the crash and print it out to the
      console (well, stderr actually). Still there is a request
      to not just show this info locally but send it out to the
      feedback server.
      
      Thus to keep gathering crash related information in one module,
      we move fatal signal handling into the separate crash.c file.
      This allows us to collect the data we need in one place and
      reuse it when we need to send reports to stderr (and to the
      feedback server, which will be implemented in next patch).
      
      Part-of #5261
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      a0a443bd
    • Cyrill Gorcunov's avatar
      backtrace: allow to specify destination buffer · e3503bc2
      Cyrill Gorcunov authored
      
      This will allow to reuse this routine in crash reports.
      
      Part-of #5261
      
      Acked-by: default avatarSerge Petrenko <sergepetrenko@tarantool.org>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      e3503bc2
    • Cyrill Gorcunov's avatar
      util: introduce strlcpy helper · 125d444f
      Cyrill Gorcunov authored
      
      Very convenient to have this string extension.
      We will use it in crash handling.
      
      Acked-by: default avatarSerge Petrenko <sergepetrenko@tarantool.org>
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@gmail.com>
      125d444f
    • Sergey Nikiforov's avatar
      lua/key_def: fix compare_with_key() part count check · 37b15af9
      Sergey Nikiforov authored
      Added corresponding test
      
      Fixes: #5307
      37b15af9
    • Alexander V. Tikhonov's avatar
      update_repo: add Fedora 32 · 78c1de32
      Alexander V. Tikhonov authored
      It was added Fedora 32 gitlab-ci packaging job in commit:
        507c47f7a829581cc53ba3c4bd6a5191d088cdf ("gitlab-ci: add packaging for Fedora 32")
      
      but also it had to be enabled in update_repo tool to make able to save
      packages in S3 buckets.
      
      Follows up #4966
      78c1de32
    • Cyrill Gorcunov's avatar
Loading