Ten-year experience in DBMS testing
tarantool
Posted on February 4, 2022
Author: Sergey Bronnikov
Hi, my name is Sergey Bronnikov, and I work on the Tarantool database. Once I joined, I started taking notes regarding Tarantool development. Now I've decided to rewrite these notes as an article. It might be of interest to C/C++ testers or Tarantool users who want to know how much effort we put into preventing potential issues in new versions.
How SQLite Is Tested by Richard Hipp is a similar article, which is quite popular. But SQLite specifics make it hard to reuse its tools in other projects. This stems from the commitment of the SQLite development team to maintain the library until at least 2050. Hence, they write all the tools from scratch to reduce external dependencies (e.g., the test runner, the mutation testing tool, Fossil SCM). There are no such requirements for us, so we are not limited in our choice of tools and can use anything beneficial. And if any tool appeals to you, you can easily bring it into your C/C++ project. If that's your cup of tea, you should read the entire article.
As you know, testing is part of development. In this article, I will talk about our approach to Tarantool development, which helps us catch the vast majority of bugs before the final release. For us, testing is inseparable from the development itself, and everyone in the team is responsible for quality. I couldn't fit everything into a single article, so I have provided links to other supporting articles at the very end.
The Tarantool's core consists of the code written entirely by us, external components, and libraries. By the way, some components and libraries were also written by us. This is important because we test most of the third-party components only indirectly during integration testing.
In most cases, external components have good quality, but there was an exception — the libcurl library. Sometimes it could cause memory corruptions. Therefore, libcurl became a Git module in our repository rather than a runtime dependency.
LuaJIT provides Lua language support, including both the language execution environment and the JIT tracer compiler. Our LuaJIT has long differed from the vanilla version in a set of patches adding features, such as the profiler, and new tests. That is why we test our fork thoroughly to prevent regression. LuaJIT source code is open and distributed under a free license, but it does not include regression tests. Therefore, we have assembled our regression test suite from PUC-Rio Lua tests, test suite by François Perrad, tests for other LuaJIT forks, and of course, our own tests.
Other external libraries are the following:
MsgPuck to serialize MessagePack data.
libcoro to implement fibers.
libev to provide asynchronous I/O.
c-ares to resolve DNS names asynchronously.
libcurl to work with the HTTP protocol.
icu4c to support Unicode.
OpenSSL, libunwind and zstd to compress data.
small — our set of specialized memory allocators.
lua-cjson to work with JSON, lua-yaml, luarocks, xxHash, PMurHash, etc.
The bulk of the project is written in C, smaller parts in C++ (a total of 36 KLOC) and an even smaller part in Lua (14 KLOC).
Detailed cloc statistics is provided below:
767 text files.
758 unique files.
82 files ignored.
github.com/AlDanial/cloc v 1.82 T=0.78 s (881.9 files/s, 407614.4 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
C 274 12649 40673 123470
C/C++ Header 287 7467 36555 40328
C++ 38 2627 6923 24269
Lua 41 1799 2059 14384
yacc 1 191 342 1359
CMake 33 192 213 1213
...
-------------------------------------------------------------------------------
SUM: 688 25079 86968 205933
-------------------------------------------------------------------------------
The other languages are related to the project infrastructure or tests: CMake, Make, Python (we don't use Python anymore to write tests, but some older tests were written in it).
Detailed cloc output regarding languages used in tests is provided below:
2076 text files.
2006 unique files.
851 files ignored.
github.com/AlDanial/cloc v 1.82 T=2.70 s (455.5 files/s, 116365.0 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Lua 996 31528 46858 194972
C 89 2536 2520 14937
C++ 21 698 355 4990
Python 57 1131 1209 4500
C/C++ Header 11 346 629 1939
SQL 4 161 120 1174
...
-------------------------------------------------------------------------------
SUM: 1231 37120 51998 225336
-------------------------------------------------------------------------------
Such distribution of used programming languages causes testing to focus mainly on identifying problems related to manual memory management (stack overflow, heap buffer overflow, use-after-free, etc.). Our continuous integration system handles it pretty well. I will tell you more about our CI system in the next section.
Continuous integration
We have several branches of Tarantool in development: the main branch (master) and one branch for each version (1.10.x, 2.1.x, 2.2.x, etc.). Minor versions introduce new functions, and the bugs are fixed in all the branches. When merging the branch, we run the entire cycle of regression tests with different compilers and build options. We also build packages for various platforms and do so much more — see details below. Everything is done automatically in a single pipeline. Patches only make it to the main branch after passing the entire pipeline. For now, we apply patches to the main branch manually, but we're aiming for automation.
Currently, we have about 870 integration tests, and they run for 10 minutes on five parallel threads. It may not seem much, but CI testing involves different OS families and versions, architectures, various compilers with options, so the total testing time can be up to half an hour.
We run tests for a large number of operating systems:
Six Ubuntu versions
Three Debian versions
Five Fedora versions
Two CentOS versions
Two OpenSUSE and FreeBSD versions
Two macOS versions
Some configurations also depend on compiler versions and options. Some platforms are supported formally (e.g., macOS) — they are used mainly by developers. Others, such as FreeBSD, are actively tested, but I haven't heard of any cases of using Tarantool FreeBSD port in production. Others, such as Linux, are widely used in production by Tarantool customers and users.
Therefore, the last-mentioned platforms are given more attention in development. Running tests on different operating systems affects the project quality. Different OS families have different memory allocators and may have diverse libc implementations — such variations also allow us to find bugs.
The primary architecture is amd64; we recently added support for ARM64, and it is also represented in CI. Running tests on processors with different architecture makes the code more portable by separating platform-dependent and platform-independent code. It helps detect bugs related to different byte order (big-endian vs. little-endian), instruction execution speed, different mathematical function results, or to such rarities as negative zero. This type of testing makes it easier to port the code to the new architecture, if necessary. LuaJIT is the most platform-dependent since it uses the assembler a lot and generates machine code from Lua code.
Back when there were not so many cloud CI systems, we used Jenkins just like many other projects. Then Travis CI appeared, which is integrated with GitHub, and we migrated. Then our testing matrix grew a lot, and the free version of Travis CI didn't allow us to connect our servers, so we migrated to Gitlab CI. Due to integration issues with GitHub pull requests, we gradually migrated to GitHub Actions as soon as it appeared. Now we use it in all the projects (and we have several hundred of them in our GitHub organization).
We use Github as our platform for the whole development cycle: task scheduling, code repository, testing new changes. All the testing is done there. For this purpose, we use both our physical servers or virtual machines in VK Cloud Solutions, as well as virtual machines provided by Github Actions. GitHub is not flawless: sometimes it's unavailable, sometimes it glitches, but it is good value for money.
Portable code has to be tested on different operating systems and architectures.
Code review
Like all civilized projects with a good development culture, we submit all the patches for a thorough review by two other developers. The code review procedure is described in this open document. It contains style and self-check guidelines to follow before submitting the patch for review. I won't narrate the whole document; I will just list the points related to testing:
Every bug-fixing patch should have a test to reproduce the issue.
Every feature-introducing patch should have one, or better yet, many tests covering the feature.
The test can't pass without the patch.
The test shouldn't be flaky — it has to produce the same result each time it is run.
The test shouldn't be slow to keep the short test duration. Long tests are run with a different test runner option.
Code review allows us to check changes with another pair of eyes.
Static and dynamic analysis
We use static analysis to maintain the general programming style and search for errors. The code style should follow the style guides for Lua, Python, and C. The C style guide is similar to the Linux kernel coding style in a lot of ways, and the Lua style guide follows the default style in luacheck, except for some warnings, which we usually turn off. This allows us to maintain the unified code style and improves readability.
In CMake build files, we use compiler flags that enable extra checks at build time, and we run "make clean" when there are no raw warnings. Besides static analysis in the compilers, we use Coverity static analysis. We used PVS-Studio once, and it detected several non-critical errors in Tarantool itself and in tarantool-c connector. Sometimes, we used cppcheck, not that it found many bugs.
Tarantool codebase contains a lot of Lua code, and we decided to fix all the warnings cppcheck found. Most of them were related to programming style violations, and it found only four source code errors and a single error in the test code. So if you're writing in Lua, don't disregard luacheck and use it from the beginning.
All new changes are tested on builds with dynamic parsers to detect memory problems in C/C++ (AddressSanitizer) and undefined behavior in C/C++ (UndefinedBehaviorSanitizer). Since these parsers can affect application performance, the flags that enable them are disabled by default. AddressSanitizer proved itself well in CI, but it still has a considerable overhead in canary builds. I tried using the Firefox Nightly build when Mozilla introduced ASan, and it wasn't comfortable to use (let alone a DBMS with high-speed requirements). But GWP-ASAN has a smaller overhead, and we are thinking about using it in packages with nightly builds.
If the sanitizers detect issues at the code level, the asserts in our code reveal problems related to invariant violations. Technically, they are macros, a part of the standard C library. assert()
checks the passed expression and terminates if the result is zero. There are about 5,000 of these checks, and they are only enabled in debug builds and disabled in release builds.
The build system also supports Valgrind, but its code execution is much slower than with sanitizers, so this build is not tested in CI.
Functional regression tests
Since the Lua interpreter is built into Tarantool, and the DBMS interface is implemented using Lua API, using Lua for tests seems quite reasonable. Most of our regression tests are written in Lua using the built-in Tarantool modules. One of them is the TAP module for testing Lua code. It implements a set of primitives to check the code and structure tests. Conveniently, there is a certain minimum — enough to test Lua applications. Many modules and applications, which we make, only use this module for testing. As the name suggests, it allows you to output the results in the TAP (Test Anything Protocol) format; this is probably the oldest format for test reporting. Some of the tests are parameterized (e.g., performed with two engines), and the number of tests grows one and a half times if we count all of them in different configurations.
Most Tarantool functions are available using the Lua API, and others can be accessed using the FFI. The FFI is convenient when a C function should not be part of the Lua API, but it is needed for a test. The main thing is that it is not declared static. Here is an example of using C code in Lua with the FFI (isn't it concise?):
local ffi = require "ffi"
ffi.cdef [[
int printf(const char *fmt, ...);
]]
ffi.C.printf("Hello %s!", "world")
Some parts of Tarantool, such as raft, http_parser, csv, msgpuck, swim, uuid, vclock and other self-contained libraries, have modular tests. To write them, we use a header-only C library in the TAP-test style.
We use our own tool to run tests, test-run.py. Nowadays it may not seem reasonable to write a test runner from scratch, but it already exists, and we support it. There are different types of tests in the project — unit tests are written in C and run as binaries. Like for TAP tests, test-run.py analyzes their output in the TAP format in terms of test script success:
TAP version 13
1..20
ok 1 — trigger is fired
ok 2 — is not deleted
ok 3 — ctx.member is set
ok 4 — ctx.events is set
ok 5 — self payload is updated
ok 6 — self is set as a member
ok 7 — both version and payload events are presented
ok 8 — suspicion fired a trigger
ok 9 — status suspected
ok 10 — death fired a trigger
ok 11 — status dead
ok 12 — drop fired a trigger
ok 13 — status dropped
ok 14 — dropped member is not presented in the member table
ok 15 — but is in the event context
ok 16 — yielding trigger is fired
ok 17 — non-yielding still is not
ok 18 — trigger is not deleted until all currently sleeping triggers are finished
Some tests compare the actual test output with the reference output: the new test output is saved to a file and compared with the actual one when running further tests. This approach is quite popular in SQL tests (both in MySQL and in PostgreSQL): you write the necessary SQL structures, run the script, make sure that the output is correct, and save it to a file. You just have to make sure that the input is always deterministic. Otherwise, you'll just end up with more flaky tests. The output may depend on the operating system locale (NO_LOCALE=1 will help), on error messages, on the time and date in the output, etc.
We use this approach in tests to support SQL or replication because it's convenient for code debugging: you can paste tests directly into the console and switch between instances. You can experiment interactively and then use this code as a snippet for the ticket or make a test out of it.
test-run.py allows us to run all types of tests in the same way as generating the whole report.
For testing Lua projects, we have a different framework, luatest. This is originally a fork of another good framework, luaunit. Project forking provided tighter integration with Tarantool (e.g., we added specific fixtures). It also allowed us to implement many new features regardless of the luaunit development: integration with luacov, XFail status support, etc.
The history of SQL tests in Tarantool is fascinating. We used VDBE to adopt a part of SQLite code, namely the SQL query parser and the bytecode compiler. One of the main reasons was that SQLite code has almost 100% test coverage. However, the tests were written in TCL, and we don't use it at all. So we had to write a TCL-Lua convertor to port tests written in TCL, and imported them into the code base after optimizing the resulting code. We still use these tests and add new ones when necessary.
Fault tolerance is one of the server software requirements. Hence, many of our tests have error injections at the Tarantool code level. For this purpose, the source code has a set of macros and the Lua API interface to enable them. For example, we want to add an error that will emulate a delay when writing to WAL. We add a string to the ERRINJ_LIST array in src/lib/core/errinj.h:
--- a/src/lib/core/errinj.h
+++ b/src/lib/core/errinj.h
@@ -151,7 +151,6 @@ struct errinj {
_(ERRINJ_VY_TASK_COMPLETE, ERRINJ_BOOL, {.bparam = false}) \
_(ERRINJ_VY_WRITE_ITERATOR_START_FAIL, ERRINJ_BOOL, {.bparam = false})\
_(ERRINJ_WAL_BREAK_LSN, ERRINJ_INT, {.iparam = -1}) \
+ _(ERRINJ_WAL_DELAY, ERRINJ_BOOL, {.bparam = false}) \
_(ERRINJ_WAL_DELAY_COUNTDOWN, ERRINJ_INT, {.iparam = -1}) \
_(ERRINJ_WAL_FALLOCATE, ERRINJ_INT, {.iparam = 0}) \
_(ERRINJ_WAL_IO, ERRINJ_BOOL, {.bparam = false}) \
Then we include this error into the code responsible for writing to WAL:
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -670,7 +670,6 @@ wal_begin_checkpoint_f(struct cbus_call_msg *data)
}
vclock_copy(&msg->vclock, &writer->vclock);
msg->wal_size = writer->checkpoint_wal_size;
+ ERROR_INJECT_SLEEP(ERRINJ_WAL_DELAY);
return 0;
}
After that, we can enable delayed logging in the debug build using the following Lua function:
$ tarantool
Tarantool 2.8.0-104-ga801f9f35
type 'help' for interactive help
tarantool> box.error.injection.get('ERRINJ_WAL_DELAY')
---
- false
...
tarantool> box.error.injection.set('ERRINJ_WAL_DELAY', true)
---
- true
...
We have added a total of 90 errors to different parts of Tarantool, and at least one functional test corresponds to each error.
Ecosystem integration testing
The Tarantool ecosystem consists of a large number of connectors for different programming languages and auxiliary libraries to implement popular architectural patterns (e.g., cache or persistent queue). There are also products written in Lua using Tarantool: Tarantool DataGrid and Tarantool Cartridge. We test backward compatibility by running extra tests on pre-release versions of Tarantool, including these modules and products.
Randomized testing
I want to pay special attention to some tests because they differ from standard tests in that their data is generated automatically and randomly. There are no such tests in the standard regression set — they are run separately.
The Tarantool's core is written mostly in C, and even careful development doesn't help avoid memory management issues, such as use-after-free, heap buffer overflow, NULL pointer dereference. Such issues are utterly undesirable for server software. Fortunately, the recent development of dynamic analysis and fuzz-testing technologies makes it possible to reduce the number of these issues.
I have already mentioned that Tarantool uses third-party libraries. Many of them already use fuzz testing: the curl, c-ares, zstd, and OpenSSL projects are regularly tested in the OSS-Fuzz infrastructure. Tarantool code has many parts where the code is used for parsing (e.g., SQL or HTTP query parsing) or MsgPack decoding. This code may be vulnerable to bugs related to memory management. The good news is that fuzz testing quickly detects such issues. Tarantool also has integration with OSS-Fuzz, but there are not many tests yet, and we found a single bug in the http_parser library. The number of such tests might eventually grow, and we have detailed instructions for those who want to add a new one.
In 2020, we added support for synchronous replication and MVCC. We had to test this functionality, so we decided to write some tests powered by Jepsen framework. We check consistency by analyzing the transaction history. But the story about testing with Jepsen is big enough for a separate article, so we'll talk about it next time.
Load and performance testing
One of the reasons people tend to choose Tarantool is its high performance. It would be strange not to test this feature. We have an informal test of inserting 1 million tuples per second on common hardware. Anyone can run it on their machine and get 1 Mops on Tarantool. This snippet in Lua might be a good benchmark to run with synchronous replication:
sergeyb@pony:~/sources$ tarantool relay-1mops.lua 2
making 1000000 operations, 10 operations per txn using 50 fibers
starting 1 replicas
master done 1000009 ops in time: 1.156930, cpu: 2.701883
master speed 864363 ops/sec
replicas done 1000009 ops in time: 3.263066, cpu: 4.839174
replicas speed 306463 ops/sec
sergeyb@pony:~/sources$
For performance testing, we also run common benchmarks: the popular YCSB (Yahoo! Cloud Serving Benchmark), NoSQLBench, LinkBench, SysBench, TPC-H, and TPC-C. We also run C Bench, our own Tarantool API benchmark. Its primitive operations are written in C, and scripts are described in Lua.
Metrics
We collect information to evaluate regression test code coverage. For now, we have covered 83% of all the lines and 51% of all code branches, which is not bad. We use Coveralls to visualize the covered areas. There is nothing new about collecting information on C/C++ code coverage: code instrumentation with the -coverage option, testing, and report generation using gcov and lcov. But when it comes to Lua, the situation is slightly worse: there is a primitive profiler, and luacov provides information only about line coverage. It's a little frustrating.
Release checklist
Each new release involves a bunch of different tasks handled by different teams. These tasks include release tagging in the repository, publishing packages and builds, publishing documentation to the website, checking functional and performance testing results, checking for open bugs, triaging for the next milestone, etc. A version release can easily become chaotic, or some steps may be forgotten. To prevent this from happening, we have described the release process in the form of a checklist, and we follow it before releasing a new version.
Conclusion
As the saying goes, there is always room for improvement. Over time, processes and technologies to detect bugs improve, the bugs become more complicated, and the more complex testing and QA system, the fewer bugs reach users.
Useful links
Video recording and slides of Roberto Ierusalimschy's lecture on testing Lua interpreter (I highly recommend it!).
How SQLite Is Tested is a popular article on SQLite testing by by Richard Hipp.
The Untold Story of SQLite With Richard Hipp is Richard Hipp's interview where he shares his thoughts on SQLite testing, among others.
Posted on February 4, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.