YugabyteDB: where did my fsync() go?

This blogpost is about the investigation of a warning message saying Fsync took a long time that went away in recent versions of YugabyteDB.

Summary

(spoiler alert!) The short version is that we still perform fsync(), because that is the guaranteed way for any application to tell the operating system and the hardware to persist all the dirty blocks/extents for a given inode. However, performing fsync() on an inode is now delegated to a dedicated thread (the log-sync thread), and only if this thread cannot perform its work fast enough performing fsync(), it will fallback and be performed by the original thread that performs the logging called the append thread.

The logging of fsync() when it exceeds the set threshold of 50 milliseconds still happens, however in the log-sync thread the severity has been changed from WARNING to INFO, which means that if you filter on the WARNING severity you don't see it. The INFO level messages can become quite a lot, which is why it makes sense to filter it. Also, the message additionally is throttled to be shown at a maximum rate of 1 per second.

Past situation

When a batch load is performed, and I filter for took a long time, this is what I see with YugabyteDB version 2.13.2.0:

➜ yb_stats --tail-log --hostname-match 9000 --stat-name-match 'took a long time'
192.168.66.82:9000   2023-02-09 13:31:59.697006 UTC W log.cc:961           Time spent Fsync log took a long time: real 0.060s   user 0.001s sys 0.000s
192.168.66.80:9000   2023-02-09 13:32:04.205387 UTC W log.cc:961           Time spent Fsync log took a long time: real 0.051s   user 0.000s sys 0.001s
192.168.66.80:9000   2023-02-09 13:32:08.263603 UTC W log.cc:961           Time spent Fsync log took a long time: real 0.057s   user 0.001s sys 0.001s
192.168.66.81:9000   2023-02-09 13:32:08.980399 UTC W log.cc:961           Time spent Fsync log took a long time: real 0.079s   user 0.000s sys 0.000s
192.168.66.82:9000   2023-02-09 13:32:09.025662 UTC W log.cc:961           Time spent Fsync log took a long time: real 0.107s   user 0.000s sys 0.001s
192.168.66.80:9000   2023-02-09 13:32:09.323518 UTC W log.cc:961           Time spent Fsync log took a long time: real 0.206s   user 0.001s sys 0.000s
192.168.66.81:9000   2023-02-09 13:32:09.325979 UTC W log.cc:961           Time spent Fsync log took a long time: real 0.065s   user 0.000s sys 0.000s
192.168.66.80:9000   2023-02-09 13:32:09.336320 UTC W log.cc:961           Time spent Fsync log took a long time: real 0.209s   user 0.000s sys 0.000s
192.168.66.82:9000   2023-02-09 13:32:09.336520 UTC W log.cc:961           Time spent Fsync log took a long time: real 0.174s   user 0.000s sys 0.002s
192.168.66.81:9000   2023-02-09 13:32:11.481073 UTC W log.cc:961           Time spent Fsync log took a long time: real 0.051s   user 0.000s sys 0.001s

Current situation

When a batch load is performed with a recent version, and I filter for took a long time (I am using 2.17.1.0), I see some message for log roll:

❯ yb_stats --tail-log --hostname-match 9000 --stat-name-match 'took a long time'
Tail log ready, showing severities: WEF
192.168.66.80:9000   2023-02-09 14:07:32.139693 UTC W log.cc:731           Time spent T 08372aa6c40c4a0c89dd3692f0b09563 P 1f8a43d3668d4235a990e6045d4f59db: Log roll took a long time: real 0.051s user 0.000s sys 0.000s
192.168.66.82:9000   2023-02-09 14:07:32.147421 UTC W log.cc:731           Time spent T 58d3266d12b0427c9a893c4529506d43 P cea6e8a0388749d18ea9496401608c4e: Log roll took a long time: real 0.053s user 0.001s sys 0.001s
192.168.66.82:9000   2023-02-09 14:07:32.157840 UTC W log.cc:731           Time spent T 08372aa6c40c4a0c89dd3692f0b09563 P cea6e8a0388749d18ea9496401608c4e: Log roll took a long time: real 0.052s user 0.000s sys 0.001s

So, the first thing to look for is whether we are still doing fsync() or not.

Finding the fsync() call again

Seeing if the fsync() system call is still performed is not very hard. Traditionally looking at system calls is done using the linux strace command, or the equivalents on the other operating systems. Today, this can be done too by perf, using: perf trace

Using it is as simple as executing:

$ perf trace -e fsync

And running a batch load for which a tablet is touched that has its leader on the local machine on which the perf trace execute is run. This is how that looks like:

$ sudo perf trace -e fsync
     0.000 ( 3.612 ms): append [worker/8859 fsync(fd: 124)                                                        = 0
    17.115 ( 1.835 ms): append [worker/7164 fsync(fd: 85)                                                         = 0
    64.119 (14.810 ms): append [worker/8858 fsync(fd: 282)                                                        = 0
    82.983 ( 7.721 ms): append [worker/8870 fsync(fd: 283)                                                        = 0
   104.238 (17.324 ms): append [worker/8871 fsync(fd: 298)                                                        = 0
   269.817 ( 6.451 ms): log-sync [work/7162 fsync(fd: 282)                                                        = 0
   300.490 (12.274 ms): log-sync [work/7162 fsync(fd: 283)                                                        = 0
   326.252 (10.503 ms): log-sync [work/7162 fsync(fd: 298)                                                        = 0
   486.770 (10.359 ms): log-sync [work/7162 fsync(fd: 282)                                                        = 0
   523.031 (17.058 ms): log-sync [work/7162 fsync(fd: 298)                                                        = 0
   552.080 (29.121 ms): log-sync [work/7162 fsync(fd: 283)                                                        = 0
   762.369 (         ): log-sync [work/7162 fsync(fd: 283)                                                     ...
   769.427 (         ): log-sync [work/8873 fsync(fd: 298)                                                     ...
   762.369 (54.809 ms): log-sync [work/7162  ... [continued]: fsync())                                            = 0
   769.427 (49.362 ms): log-sync [work/8873  ... [continued]: fsync())                                            = 0
   779.805 (45.496 ms): log-sync [work/8874 fsync(fd: 282)                                                        = 0

Simple conclusion is we are still performing fsync(), that becomes very much clear from the perf trace output. If you look at the latency times, it's also visible that on this system (a tiny test system) the latency still can exceed the threshold of 50ms. So why don't I see a message indicating this?

Digging deeper: understanding where the fsync happens

To try to find out more, we need to see what is going on in the code. This is possible, because YugabyteDB is open source. However, we need to figure out where to look.

There are several ways of doing this. One way that is reasonably simple to do, is to let perf trace print a backtrace/stacktrace with each occurrence of the system call. Once we

 17865.004 (16.523 ms): log-sync [work/9892 fsync(fd: 309)                                                        = 0
                                       __fsync_nocancel (inlined)
                                       yb::(anonymous namespace)::PosixWritableFile::Sync (/opt/yugabyte/yugabyte-2.17.1.0/bin/yb-server)
                                       yb::log::Log::DoSync (/opt/yugabyte/yugabyte-2.17.1.0/bin/yb-server)
                                       yb::log::Log::DoSyncAndResetTaskInQueue (/opt/yugabyte/yugabyte-2.17.1.0/bin/yb-server)
                                       yb::ThreadPool::DispatchThread (/opt/yugabyte/yugabyte-2.17.1.0/bin/yb-server)
                                       yb::Thread::SuperviseThread (/opt/yugabyte/yugabyte-2.17.1.0/bin/yb-server)
                                       start_thread (/opt/yugabyte/yugabyte-2.17.1.0/linuxbrew/lib/libpthread-2.23.so)
                                       __clone (/opt/yugabyte/yugabyte-2.17.1.0/linuxbrew/lib/libc-2.23.so)

The function DoSyncAndResetTaskInQueue can be found here, which calls DoSync, for which the relevant part is this:

  LOG_SLOW_EXECUTION_EVERY_N_SECS(INFO, /* log at most one slow execution every 1 sec */ 1,
                                  50, "Fsync log took a long time") {
    SCOPED_LATENCY_METRIC(metrics_, sync_latency);
    status = active_segment_->Sync();
  }

A lot of things to see and say here:

LOG_SLOW_EXECUTION_EVERY_N_SECS: this means it's not actually logging every occurrence, but rather 'EVERY_N_SECS', which is set to 1 here.
INFO: Ah! The log level has been set from WARNING to INFO for the message.
50: the threshold is still 50ms, this didn't change from the previous fsync that happened inside the critical path.
status = active_segment_->Sync(): this is the actual call to fsync.

Displaying the Fsync took a long time messages again

With what we've learned from visiting the code, it's rather easy now to display the Fsync took a long time again:

➜ yb_stats --tail-log --hostname-match 9000 --log-severity IWEF --stat-name-match 'took a long time'
Tail log ready, showing severities: IWEF
192.168.66.80:9000   2023-02-09 15:24:25.850107 UTC W log.cc:731           Time spent T 58d3266d12b0427c9a893c4529506d43 P 1f8a43d3668d4235a990e6045d4f59db: Log roll took a long time: real 0.057s user 0.000s sys 0.001s
192.168.66.80:9000   2023-02-09 15:24:25.931238 UTC I log.cc:1001          Time spent Fsync log took a long time: real 0.061s   user 0.000s sys 0.002s
192.168.66.80:9000   2023-02-09 15:24:36.632426 UTC I log.cc:1001          Time spent Fsync log took a long time: real 0.065s   user 0.000s sys 0.001s
192.168.66.81:9000   2023-02-09 15:24:39.925721 UTC I log.cc:1001          Time spent Fsync log took a long time: real 0.055s   user 0.000s sys 0.002s
192.168.66.81:9000   2023-02-09 15:24:45.339032 UTC W log.cc:731           Time spent T 58d3266d12b0427c9a893c4529506d43 P c4ba4bb2cea04a2eade78bed94406fb9: Log roll took a long time: real 0.052s user 0.000s sys 0.001s

It's simply adding the I to the log severities to be displayed. However, there is another thing that we've learned from reading the code, which is that not all fsync() calls that took more than 50ms are displayed. So there could be more, or not.

Blog