Ten-year experience in DBMS testing

tarantool

tarantool

Posted on February 4, 2022

Ten-year experience in DBMS testing

Author: Sergey Bronnikov

Hi, my name is Sergey Bronnikov, and I work on the Tarantool database. Once I joined, I started taking notes regarding Tarantool development. Now I've decided to rewrite these notes as an article. It might be of interest to C/C++ testers or Tarantool users who want to know how much effort we put into preventing potential issues in new versions. 

How SQLite Is Tested by Richard Hipp is a similar article, which is quite popular. But SQLite specifics make it hard to reuse its tools in other projects. This stems from the commitment of the SQLite development team to maintain the library until at least 2050. Hence, they write all the tools from scratch to reduce external dependencies (e.g., the test runner, the mutation testing tool, Fossil SCM). There are no such requirements for us, so we are not limited in our choice of tools and can use anything beneficial. And if any tool appeals to you, you can easily bring it into your C/C++ project. If that's your cup of tea, you should read the entire article.

As you know, testing is part of development. In this article, I will talk about our approach to Tarantool development, which helps us catch the vast majority of bugs before the final release. For us, testing is inseparable from the development itself, and everyone in the team is responsible for quality. I couldn't fit everything into a single article, so I have provided links to other supporting articles at the very end.

The Tarantool's core consists of the code written entirely by us, external components, and libraries. By the way, some components and libraries were also written by us. This is important because we test most of the third-party components only indirectly during integration testing. 

In most cases, external components have good quality, but there was an exception — the libcurl library. Sometimes it could cause memory corruptions. Therefore, libcurl became a Git module in our repository rather than a runtime dependency. 

LuaJIT provides Lua language support, including both the language execution environment and the JIT tracer compiler. Our LuaJIT has long differed from the vanilla version in a set of patches adding features, such as the profiler, and new tests. That is why we test our fork thoroughly to prevent regression. LuaJIT source code is open and distributed under a free license, but it does not include regression tests. Therefore, we have assembled our regression test suite from PUC-Rio Lua tests, test suite by François Perrad, tests for other LuaJIT forks, and of course, our own tests.

Other external libraries are the following:

  • MsgPuck to serialize MessagePack data.

  • libcoro to implement fibers.

  • libev to provide asynchronous I/O.

  • c-ares to resolve DNS names asynchronously.

  • libcurl to work with the HTTP protocol.

  • icu4c to support Unicode.

  • OpenSSL, libunwind and zstd to compress data.

  • small — our set of specialized memory allocators.

  • lua-cjson to work with JSON, lua-yaml, luarocks, xxHash, PMurHash, etc.

The bulk of the project is written in C, smaller parts in C++ (a total of 36 KLOC) and an even smaller part in Lua (14 KLOC).

Image description

Detailed cloc statistics is provided below:

767 text files.
758 unique files.
82 files ignored.
github.com/AlDanial/cloc v 1.82 T=0.78 s (881.9 files/s, 407614.4 lines/s)

-------------------------------------------------------------------------------
Language      files   blank   comment   code
-------------------------------------------------------------------------------
C             274     12649   40673     123470
C/C++ Header  287     7467    36555     40328
C++           38      2627    6923      24269
Lua           41      1799    2059      14384
yacc          1       191     342       1359
CMake         33      192     213       1213
...
-------------------------------------------------------------------------------
SUM:          688     25079   86968     205933
-------------------------------------------------------------------------------
Enter fullscreen mode Exit fullscreen mode

The other languages are related to the project infrastructure or tests: CMake, Make, Python (we don't use Python anymore to write tests, but some older tests were written in it).

Image description

Detailed cloc output regarding languages used in tests is provided below:

2076 text files.
2006 unique files.
851 files ignored.

github.com/AlDanial/cloc v 1.82 T=2.70 s (455.5 files/s, 116365.0 lines/s)

-------------------------------------------------------------------------------
Language      files   blank   comment   code
-------------------------------------------------------------------------------
Lua           996     31528   46858     194972
C             89      2536    2520      14937
C++           21      698     355       4990
Python        57      1131    1209      4500
C/C++ Header  11      346     629       1939
SQL           4       161     120       1174
...
-------------------------------------------------------------------------------
SUM:         1231     37120   51998     225336
-------------------------------------------------------------------------------
Enter fullscreen mode Exit fullscreen mode

Such distribution of used programming languages causes testing to focus mainly on identifying problems related to manual memory management (stack overflow, heap buffer overflow, use-after-free, etc.). Our continuous integration system handles it pretty well. I will tell you more about our CI system in the next section.

Continuous integration

We have several branches of Tarantool in development: the main branch (master) and one branch for each version (1.10.x, 2.1.x, 2.2.x, etc.). Minor versions introduce new functions, and the bugs are fixed in all the branches. When merging the branch, we run the entire cycle of regression tests with different compilers and build options. We also build packages for various platforms and do so much more — see details below. Everything is done automatically in a single pipeline. Patches only make it to the main branch after passing the entire pipeline. For now, we apply patches to the main branch manually, but we're aiming for automation.

Currently, we have about 870 integration tests, and they run for 10 minutes on five parallel threads. It may not seem much, but CI testing involves different OS families and versions, architectures, various compilers with options, so the total testing time can be up to half an hour.

We run tests for a large number of operating systems: 

  • Six Ubuntu versions

  • Three Debian versions

  • Five Fedora versions 

  • Two CentOS versions

  • Two OpenSUSE and FreeBSD versions

  • Two macOS versions 

Some configurations also depend on compiler versions and options. Some platforms are supported formally (e.g., macOS) — they are used mainly by developers. Others, such as FreeBSD, are actively tested, but I haven't heard of any cases of using Tarantool FreeBSD port in production. Others, such as Linux, are widely used in production by Tarantool customers and users. 

Therefore, the last-mentioned platforms are given more attention in development. Running tests on different operating systems affects the project quality. Different OS families have different memory allocators and may have diverse libc implementations — such variations also allow us to find bugs.

The primary architecture is amd64; we recently added support for ARM64, and it is also represented in CI. Running tests on processors with different architecture makes the code more portable by separating platform-dependent and platform-independent code. It helps detect bugs related to different byte order (big-endian vs. little-endian), instruction execution speed, different mathematical function results, or to such rarities as negative zero. This type of testing makes it easier to port the code to the new architecture, if necessary. LuaJIT is the most platform-dependent since it uses the assembler a lot and generates machine code from Lua code.

Back when there were not so many cloud CI systems, we used Jenkins just like many other projects. Then Travis CI appeared, which is integrated with GitHub, and we migrated. Then our testing matrix grew a lot, and the free version of Travis CI didn't allow us to connect our servers, so we migrated to Gitlab CI. Due to integration issues with GitHub pull requests, we gradually migrated to GitHub Actions as soon as it appeared. Now we use it in all the projects (and we have several hundred of them in our GitHub organization). 

We use Github as our platform for the whole development cycle: task scheduling, code repository, testing new changes. All the testing is done there. For this purpose, we use both our physical servers or virtual machines in VK Cloud Solutions, as well as virtual machines provided by Github Actions. GitHub is not flawless: sometimes it's unavailable, sometimes it glitches, but it is good value for money.

Portable code has to be tested on different operating systems and architectures. 

Code review

Like all civilized projects with a good development culture, we submit all the patches for a thorough review by two other developers. The code review procedure is described in this open document. It contains style and self-check guidelines to follow before submitting the patch for review. I won't narrate the whole document; I will just list the points related to testing:

  • Every bug-fixing patch should have a test to reproduce the issue.

  • Every feature-introducing patch should have one, or better yet, many tests covering the feature.

  • The test can't pass without the patch.

  • The test shouldn't be flaky — it has to produce the same result each time it is run.

  • The test shouldn't be slow to keep the short test duration. Long tests are run with a different test runner option.

Code review allows us to check changes with another pair of eyes.

Static and dynamic analysis

We use static analysis to maintain the general programming style and search for errors. The code style should follow the style guides for Lua, Python, and C. The C style guide is similar to the Linux kernel coding style in a lot of ways, and the Lua style guide follows the default style in luacheck, except for some warnings, which we usually turn off. This allows us to maintain the unified code style and improves readability.

In CMake build files, we use compiler flags that enable extra checks at build time, and we run "make clean" when there are no raw warnings. Besides static analysis in the compilers, we use Coverity static analysis. We used PVS-Studio once, and it detected several non-critical errors in Tarantool itself and in tarantool-c connector. Sometimes, we used cppcheck, not that it found many bugs.

Tarantool codebase contains a lot of Lua code, and we decided to fix all the warnings cppcheck found. Most of them were related to programming style violations, and it found only four source code errors and a single error in the test code. So if you're writing in Lua, don't disregard luacheck and use it from the beginning.

All new changes are tested on builds with dynamic parsers to detect memory problems in C/C++ (AddressSanitizer) and undefined behavior in C/C++ (UndefinedBehaviorSanitizer). Since these parsers can affect application performance, the flags that enable them are disabled by default. AddressSanitizer proved itself well in CI, but it still has a considerable overhead in canary builds. I tried using the Firefox Nightly build when Mozilla introduced ASan, and it wasn't comfortable to use (let alone a DBMS with high-speed requirements). But GWP-ASAN has a smaller overhead, and we are thinking about using it in packages with nightly builds.

If the sanitizers detect issues at the code level, the asserts in our code reveal problems related to invariant violations. Technically, they are macros, a part of the standard C library. assert() checks the passed expression and terminates if the result is zero. There are about 5,000 of these checks, and they are only enabled in debug builds and disabled in release builds.

The build system also supports Valgrind, but its code execution is much slower than with sanitizers, so this build is not tested in CI.

Functional regression tests

Since the Lua interpreter is built into Tarantool, and the DBMS interface is implemented using Lua API, using Lua for tests seems quite reasonable. Most of our regression tests are written in Lua using the built-in Tarantool modules. One of them is the TAP module for testing Lua code. It implements a set of primitives to check the code and structure tests. Conveniently, there is a certain minimum — enough to test Lua applications. Many modules and applications, which we make, only use this module for testing. As the name suggests, it allows you to output the results in the TAP (Test Anything Protocol) format; this is probably the oldest format for test reporting. Some of the tests are parameterized (e.g., performed with two engines), and the number of tests grows one and a half times if we count all of them in different configurations.

Most Tarantool functions are available using the Lua API, and others can be accessed using the FFI. The FFI is convenient when a C function should not be part of the Lua API, but it is needed for a test. The main thing is that it is not declared static. Here is an example of using C code in Lua with the FFI (isn't it concise?):

local ffi = require "ffi"

ffi.cdef [[
int printf(const char *fmt, ...);
]]
ffi.C.printf("Hello %s!", "world")
Enter fullscreen mode Exit fullscreen mode

Some parts of Tarantool, such as raft, http_parser, csv, msgpuck, swim, uuid, vclock and other self-contained libraries, have modular tests. To write them, we use a header-only C library in the TAP-test style.

We use our own tool to run tests, test-run.py. Nowadays it may not seem reasonable to write a test runner from scratch, but it already exists, and we support it. There are different types of tests in the project — unit tests are written in C and run as binaries. Like for TAP tests, test-run.py analyzes their output in the TAP format in terms of test script success:

TAP version 13
1..20                                                                      
ok 1 — trigger is fired                                                    
ok 2 — is not deleted                                                      
ok 3 — ctx.member is set    
ok 4 — ctx.events is set                                                   
ok 5 — self payload is updated                                             
ok 6 — self is set as a member
ok 7 — both version and payload events are presented
ok 8 — suspicion fired a trigger                                           
ok 9 — status suspected       
ok 10 — death fired a trigger
ok 11 — status dead                                                                                                                                       
ok 12 — drop fired a trigger                                               
ok 13 — status dropped                                                     
ok 14 — dropped member is not presented in the member table                
ok 15 — but is in the event context      
ok 16 — yielding trigger is fired
ok 17 — non-yielding still is not
ok 18 — trigger is not deleted until all currently sleeping triggers are finished 
Enter fullscreen mode Exit fullscreen mode

Some tests compare the actual test output with the reference output: the new test output is saved to a file and compared with the actual one when running further tests. This approach is quite popular in SQL tests (both in MySQL and in PostgreSQL): you write the necessary SQL structures, run the script, make sure that the output is correct, and save it to a file. You just have to make sure that the input is always deterministic. Otherwise, you'll just end up with more flaky tests. The output may depend on the operating system locale (NO_LOCALE=1 will help), on error messages, on the time and date in the output, etc. 

We use this approach in tests to support SQL or replication because it's convenient for code debugging: you can paste tests directly into the console and switch between instances. You can experiment interactively and then use this code as a snippet for the ticket or make a test out of it.

test-run.py allows us to run all types of tests in the same way as generating the whole report.

For testing Lua projects, we have a different framework, luatest. This is originally a fork of another good framework, luaunit. Project forking provided tighter integration with Tarantool (e.g., we added specific fixtures). It also allowed us to implement many new features regardless of the luaunit development: integration with luacov, XFail status support, etc.

The history of SQL tests in Tarantool is fascinating. We used VDBE to adopt a part of SQLite code, namely the SQL query parser and the bytecode compiler. One of the main reasons was that SQLite code has almost 100% test coverage. However, the tests were written in TCL, and we don't use it at all. So we had to write a TCL-Lua convertor to port tests written in TCL, and imported them into the code base after optimizing the resulting code. We still use these tests and add new ones when necessary.

Fault tolerance is one of the server software requirements. Hence, many of our tests have error injections at the Tarantool code level. For this purpose, the source code has a set of macros and the Lua API interface to enable them. For example, we want to add an error that will emulate a delay when writing to WAL. We add a string to the ERRINJ_LIST array in src/lib/core/errinj.h:

--- a/src/lib/core/errinj.h
+++ b/src/lib/core/errinj.h
@@ -151,7 +151,6 @@ struct errinj {
_(ERRINJ_VY_TASK_COMPLETE, ERRINJ_BOOL, {.bparam = false}) \
_(ERRINJ_VY_WRITE_ITERATOR_START_FAIL, ERRINJ_BOOL, {.bparam = false})\
_(ERRINJ_WAL_BREAK_LSN, ERRINJ_INT, {.iparam = -1}) \
+ _(ERRINJ_WAL_DELAY, ERRINJ_BOOL, {.bparam = false}) \
_(ERRINJ_WAL_DELAY_COUNTDOWN, ERRINJ_INT, {.iparam = -1}) \
_(ERRINJ_WAL_FALLOCATE, ERRINJ_INT, {.iparam = 0}) \
_(ERRINJ_WAL_IO, ERRINJ_BOOL, {.bparam = false}) \
Enter fullscreen mode Exit fullscreen mode

Then we include this error into the code responsible for writing to WAL:

--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -670,7 +670,6 @@ wal_begin_checkpoint_f(struct cbus_call_msg *data)
}
vclock_copy(&msg->vclock, &writer->vclock);
msg->wal_size = writer->checkpoint_wal_size;
+ ERROR_INJECT_SLEEP(ERRINJ_WAL_DELAY);
return 0;
}
Enter fullscreen mode Exit fullscreen mode

After that, we can enable delayed logging in the debug build using the following Lua function:

$ tarantool
Tarantool 2.8.0-104-ga801f9f35
type 'help' for interactive help
tarantool> box.error.injection.get('ERRINJ_WAL_DELAY')

---
- false
...
tarantool> box.error.injection.set('ERRINJ_WAL_DELAY', true)

---
- true
...
Enter fullscreen mode Exit fullscreen mode

We have added a total of 90 errors to different parts of Tarantool, and at least one functional test corresponds to each error.

Ecosystem integration testing

The Tarantool ecosystem consists of a large number of connectors for different programming languages and auxiliary libraries to implement popular architectural patterns (e.g., cache or persistent queue). There are also products written in Lua using Tarantool: Tarantool DataGrid and Tarantool Cartridge. We test backward compatibility by running extra tests on pre-release versions of Tarantool, including these modules and products.

Randomized testing

I want to pay special attention to some tests because they differ from standard tests in that their data is generated automatically and randomly. There are no such tests in the standard regression set — they are run separately.

The Tarantool's core is written mostly in C, and even careful development doesn't help avoid memory management issues, such as use-after-free, heap buffer overflow, NULL pointer dereference. Such issues are utterly undesirable for server software. Fortunately, the recent development of dynamic analysis and fuzz-testing technologies makes it possible to reduce the number of these issues.

I have already mentioned that Tarantool uses third-party libraries. Many of them already use fuzz testing: the curl, c-ares, zstd, and OpenSSL projects are regularly tested in the OSS-Fuzz infrastructure. Tarantool code has many parts where the code is used for parsing (e.g., SQL or HTTP query parsing) or MsgPack decoding. This code may be vulnerable to bugs related to memory management. The good news is that fuzz testing quickly detects such issues. Tarantool also has integration with OSS-Fuzz, but there are not many tests yet, and we found a single bug in the http_parser library. The number of such tests might eventually grow, and we have detailed instructions for those who want to add a new one.

In 2020, we added support for synchronous replication and MVCC. We had to test this functionality, so we decided to write some tests powered by Jepsen framework. We check consistency by analyzing the transaction history. But the story about testing with Jepsen is big enough for a separate article, so we'll talk about it next time.

Load and performance testing

One of the reasons people tend to choose Tarantool is its high performance. It would be strange not to test this feature. We have an informal test of inserting 1 million tuples per second on common hardware. Anyone can run it on their machine and get 1 Mops on Tarantool. This snippet in Lua might be a good benchmark to run with synchronous replication:

sergeyb@pony:~/sources$ tarantool relay-1mops.lua 2
making 1000000 operations, 10 operations per txn using 50 fibers
starting 1 replicas
master done 1000009 ops in time: 1.156930, cpu: 2.701883
master speed    864363  ops/sec
replicas done 1000009 ops in time: 3.263066, cpu: 4.839174
replicas speed  306463  ops/sec
sergeyb@pony:~/sources
Enter fullscreen mode Exit fullscreen mode

For performance testing, we also run common benchmarks: the popular YCSB (Yahoo! Cloud Serving Benchmark), NoSQLBench, LinkBench, SysBench, TPC-H, and TPC-C. We also run C Bench, our own Tarantool API benchmark. Its primitive operations are written in C, and scripts are described in Lua.

Metrics

We collect information to evaluate regression test code coverage. For now, we have covered 83% of all the lines and 51% of all code branches, which is not bad. We use Coveralls to visualize the covered areas. There is nothing new about collecting information on C/C++ code coverage: code instrumentation with the -coverage option, testing, and report generation using gcov and lcov. But when it comes to Lua, the situation is slightly worse: there is a primitive profiler, and luacov provides information only about line coverage. It's a little frustrating.

Release checklist

Each new release involves a bunch of different tasks handled by different teams. These tasks include release tagging in the repository, publishing packages and builds, publishing documentation to the website, checking functional and performance testing results, checking for open bugs, triaging for the next milestone, etc. A version release can easily become chaotic, or some steps may be forgotten. To prevent this from happening, we have described the release process in the form of a checklist, and we follow it before releasing a new version.

Conclusion

As the saying goes, there is always room for improvement. Over time, processes and technologies to detect bugs improve, the bugs become more complicated, and the more complex testing and QA system, the fewer bugs reach users.

Useful links

💖 💪 🙅 🚩
tarantool
tarantool

Posted on February 4, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Ten-year experience in DBMS testing
programming Ten-year experience in DBMS testing

February 4, 2022