Alex Dzyoba
Posted on July 9, 2018
ftrace
Ftrace is a framework for tracing and profiling Linux kernel with the following features:
- Kernel functions tracing
- Call graph tracing
- Tracepoints support
- Dynamic tracing via kprobes
- Statistics for kernel functions
- Statistics for kernel events
Essentially, ftrace built around smart lockless ring buffer implementation (see Documentation/trace/ring-buffer-design.txt/). That buffer stores all ftrace info and imported via debugfs1 in/sys/kernel/debug/tracing/
. All manipulations are done with simple files operations in this directory.
How ftrace works
As I’ve just said, ftrace is a framework meaning that it provides only ring buffer, all real work is done by so called tracers. Currently, ftrace includes next tracers:
- function – default tracer;
- function_graph – constructs call graph;
- irqsoff, preempoff, preemptirqsoff, wakeup, wakeup_rt – latency tracers. These are origins of ftrace, they were presented in -rt kernel. I won’t give you lot of info on this topic cause it’s more about realtime, scheduling and hardware stuff;
- nop – you guess.
Also, as additional features you’ll get:
- kernel tracepoints support;
- kprobes support;
- blktrace support, though it’s going to be deleted.
Now let’s look at specific tracers.
Function tracing
Main ftrace function is, well, functions tracing (function
and function_graph
tracers). To achieve this, kernel function instrumented with mcount
calls, just like with gprof. But kernel mcount
, of course, totally differs from userspace, because it’s architecture dependent. This dependency is required to be able to build call graphs, and more specific to get caller address from previous stack frame.
This mcount
is inserted in function prologue and if it’s turned off it’s doing nothing. But if it’s turned on then it’s calling ftrace function that depending on current tracer writes different data to ring buffer.
Events tracing
Events tracing is done with help of tracepoints. You set event via set_event
file in /sys/kernel/debug/tracing
and then it will be traced in the ring buffer. For example, to trace kmalloc
, just issue
echo kmalloc > /sys/kernel/debug/tracing/set_event
and now you can see in trace
:
tail-7747 [000] .... 12584.876544: kmalloc: call_site=c06c56da ptr=e9cf9eb0 bytes_req=4 bytes_alloc=8 gfp_flags=GFP_KERNEL|GFP_ZERO
and it’s the same as in include/trace/events/kmem.h
, meaning it’s just a tracepoint.
kprobes tracing
In kernel 3.10 there was added support for kprobes and kretprobes for ftrace. Now you can do dynamic tracing without writing your own kernel module. But, unfortunately, there is nothing much to do with it, just
- Registers values
- Memory dumps
- Symbols values
- Stack values
- Return values (kretprobes)
And again, this output is written to ring buffer. Also, you can calculate some statistic over it.
Let’s trace something that doesn’t have tracepoint like something not from the kernel but from the kernel module.
On my Samsung N210 laptop I have ath9k WiFi module that’s most likely doesn’t have any tracepoints. To check this just grep available_events:
[tracing]# grep ath available_events
cfg80211:rdev_del_mpath
cfg80211:rdev_add_mpath
cfg80211:rdev_change_mpath
cfg80211:rdev_get_mpath
cfg80211:rdev_dump_mpath
cfg80211:rdev_return_int_mpath_info
ext4:ext4_ext_convert_to_initialized_fastpath
Let’s see what functions can we put probe on:
[tracing]# grep "\[ath9k\]" /proc/kallsyms | grep ' t ' | grep rx
f82e6ed0 t ath_rx_remove_buffer [ath9k]
f82e6f60 t ath_rx_buf_link.isra.25 [ath9k]
f82e6ff0 t ath_get_next_rx_buf [ath9k]
f82e7130 t ath_rx_edma_buf_link [ath9k]
f82e7200 t ath_rx_addbuffer_edma [ath9k]
f82e7250 t ath_rx_edma_cleanup [ath9k]
f82f3720 t ath_debug_stat_rx [ath9k]
f82e7a70 t ath_rx_tasklet [ath9k]
f82e7310 t ath_rx_cleanup [ath9k]
f82e7800 t ath_calcrxfilter [ath9k]
f82e73e0 t ath_rx_init [ath9k]
(First grep filters symbols from ath9k module, second grep filters functions which reside in text section and last grep filters receiver functions).
For example, we will trace ath_get_next_rx_buf
function.
[tracing]# echo 'r:ath_probe ath9k:ath_get_next_rx_buf $retval' >> kprobe_events
This command is not from top of my head – check Documentation/tracing/kprobetrace.txt
This puts retprobe on our function and fetches return value (it’s just a pointer).
After we’ve put probe we must enable it:
[tracing]# echo 1 > events/kprobes/enable
And then we can look for output in trace
file and here it is:
midori-6741 [000] d.s. 3011.304724: ath_probe: (ath_rx_tasklet+0x35a/0xc30 [ath9k] <- ath_get_next_rx_buf) arg1=0xf6ae39f4
Example (block_hasher)
By default, ftrace is collecting info about all kernel functions and that’s huge. But, being a sophisticated kernel mechanism, ftrace has a lot of features, many kinds of options, tunable params and so on for which I don’t have a feeling to talk about because there are plenty of manuals and articles on lwn (see To read section). Hence, it’s no wonder that we can, for example, filter by PID. Here is the script:
#!/bin/sh
DEBUGFS=`grep debugfs /proc/mounts | awk '{ print $2; }'`
# Reset trace stat
echo 0 > $DEBUGFS/tracing/function_profile_enabled
echo 1 > $DEBUGFS/tracing/function_profile_enabled
echo $$ > $DEBUGFS/tracing/set_ftrace_pid
echo function > $DEBUGFS/tracing/current_tracer
exec $*
function_profile_enabled
configures collecting statistical info.
Launch our magic script
./ftrace-me ./block_hasher -d /dev/md127 -b 1048576 -t10 -n10000
get per-processor statistics from files in tracing/trace_stat/
head -n50 tracing/trace_stat/function* > ~/trace_stat
and see top 5
==> function0 <==
Function Hit Time Avg
-------- --- ---- ---
schedule 444425 8653900277 us 19472.12 us
schedule_timeout 36019 813403521 us 22582.62 us
do_IRQ 8161576 796860573 us 97.635 us
do_softirq 486268 791706643 us 1628.128 us
__do_softirq 486251 790968923 us 1626.667 us
==> function1 <==
Function Hit Time Avg
-------- --- ---- ---
schedule 1352233 13378644495 us 9893.742 us
schedule_hrtimeout_range 11853 2708879282 us 228539.5 us
poll_schedule_timeout 7733 2366753802 us 306058.9 us
schedule_timeout 176343 1857637026 us 10534.22 us
schedule_timeout_interruptible 95 1637633935 us 17238251 us
==> function2 <==
Function Hit Time Avg
-------- --- ---- ---
schedule 1260239 9324003483 us 7398.599 us
vfs_read 215859 884716012 us 4098.582 us
do_sync_read 214950 851281498 us 3960.369 us
sys_pread64 13136 830103896 us 63193.04 us
generic_file_aio_read 14955 830034649 us 55502.14 us
(Don’t pay attention to schedule
– it’s just calls of scheduler).
Most of the time we are spending in schedule
, do_IRQ
,schedule_hrimeout_range
and vfs_read
meaning that we either waiting for reading or waiting for some timeout. Now that’s strange! To make it clearer we can disable so called graph time so that child functions wouldn’t be counted. Let me explain, by default ftrace counting function time as a time of function itself plus all subroutines calls. That’s and graph_time
option in ftrace.
Tell
echo 0 > options/graph_time
And collect profile again
==> function0 <==
Function Hit Time Avg
-------- --- ---- ---
schedule 34129 6762529800 us 198146.1 us
mwait_idle 50428 235821243 us 4676.394 us
mempool_free 59292718 27764202 us 0.468 us
mempool_free_slab 59292717 26628794 us 0.449 us
bio_endio 49761249 24374630 us 0.489 us
==> function1 <==
Function Hit Time Avg
-------- --- ---- ---
schedule 958708 9075670846 us 9466.564 us
mwait_idle 406700 391923605 us 963.667 us
_spin_lock_irq 22164884 15064205 us 0.679 us
__make_request 3890969 14825567 us 3.810 us
get_page_from_freelist 7165243 14063386 us 1.962 us
Now we see amusing mwait_idle
that somebody is somehow calling. We can’t say how does it happen.
Maybe, we should get a function graph! We know that it all starts with pread
so let’s try to trace down function calls from pread
.
By that moment, I had tired to read/write to debugfs files and started to use CLI interface to ftrace which is trace-cmd
.
Using trace-cmd
is dead simple – first, you’re recording with trace-cmd
and then analyze it with
recordtrace-cmd report
.
Record:
trace-cmd record -p function_graph -o graph_pread.dat -g sys_pread64 \
./block_hasher -d /dev/md127 -b 1048576 -t10 -n100
Look:
trace-cmd report -i graph_pread.dat | less
And it’s disappointing.
block_hasher-4102 [001] 2764.516562: funcgraph_entry: | __page_cache_alloc() {
block_hasher-4102 [001] 2764.516562: funcgraph_entry: | alloc_pages_current() {
block_hasher-4102 [001] 2764.516562: funcgraph_entry: 0.052 us | policy_nodemask();
block_hasher-4102 [001] 2764.516563: funcgraph_entry: 0.058 us | policy_zonelist();
block_hasher-4102 [001] 2764.516563: funcgraph_entry: | __alloc_pages_nodemask() {
block_hasher-4102 [001] 2764.516564: funcgraph_entry: 0.054 us | _cond_resched();
block_hasher-4102 [001] 2764.516564: funcgraph_entry: 0.063 us | next_zones_zonelist();
block_hasher-4109 [000] 2764.516564: funcgraph_entry: | SyS_pread64() {
block_hasher-4102 [001] 2764.516564: funcgraph_entry: | get_page_from_freelist() {
block_hasher-4109 [000] 2764.516564: funcgraph_entry: | __fdget() {
block_hasher-4102 [001] 2764.516565: funcgraph_entry: 0.052 us | next_zones_zonelist();
block_hasher-4109 [000] 2764.516565: funcgraph_entry: | __fget_light() {
block_hasher-4109 [000] 2764.516565: funcgraph_entry: 0.217 us | __fget();
block_hasher-4102 [001] 2764.516565: funcgraph_entry: 0.046 us | __zone_watermark_ok();
block_hasher-4102 [001] 2764.516566: funcgraph_entry: 0.057 us | __mod_zone_page_state();
block_hasher-4109 [000] 2764.516566: funcgraph_exit: 0.745 us | }
block_hasher-4109 [000] 2764.516566: funcgraph_exit: 1.229 us | }
block_hasher-4102 [001] 2764.516566: funcgraph_entry: | zone_statistics() {
block_hasher-4109 [000] 2764.516566: funcgraph_entry: | vfs_read() {
block_hasher-4102 [001] 2764.516566: funcgraph_entry: 0.064 us | __inc_zone_state();
block_hasher-4109 [000] 2764.516566: funcgraph_entry: | rw_verify_area() {
block_hasher-4109 [000] 2764.516567: funcgraph_entry: | security_file_permission() {
block_hasher-4102 [001] 2764.516567: funcgraph_entry: 0.057 us | __inc_zone_state();
block_hasher-4109 [000] 2764.516567: funcgraph_entry: 0.048 us | cap_file_permission();
block_hasher-4102 [001] 2764.516567: funcgraph_exit: 0.907 us | }
block_hasher-4102 [001] 2764.516567: funcgraph_entry: 0.056 us | bad_range();
block_hasher-4109 [000] 2764.516567: funcgraph_entry: 0.115 us | __fsnotify_parent();
block_hasher-4109 [000] 2764.516568: funcgraph_entry: 0.159 us | fsnotify();
block_hasher-4102 [001] 2764.516568: funcgraph_entry: | mem_cgroup_bad_page_check() {
block_hasher-4102 [001] 2764.516568: funcgraph_entry: | lookup_page_cgroup_used() {
block_hasher-4102 [001] 2764.516568: funcgraph_entry: 0.052 us | lookup_page_cgroup();
block_hasher-4109 [000] 2764.516569: funcgraph_exit: 1.958 us | }
block_hasher-4102 [001] 2764.516569: funcgraph_exit: 0.435 us | }
block_hasher-4109 [000] 2764.516569: funcgraph_exit: 2.487 us | }
block_hasher-4102 [001] 2764.516569: funcgraph_exit: 0.813 us | }
block_hasher-4102 [001] 2764.516569: funcgraph_exit: 4.666 us | }
First of all, there is no straight function call chain, it’s constantly interrupted and transferred to another CPU. Secondly, there are a lot of noise e.g. inc_zone_state
and __page_cache_alloc
calls. And finally, there are neither mdraid function nor mwait_idle
calls!
And the reasons are ftrace default sources (tracepoints) and async/callback nature of kernel code. You won’t see direct functions call chain from sys_pread64
, the kernel doesn’t work this way.
But what if we setup kprobes on mdraid functions? No problem! Just add return probes for mwait_idle
and md_make_request
:
# echo 'r:md_make_request_probe md_make_request $retval' >> kprobe_events
# echo 'r:mwait_probe mwait_idle $retval' >> kprobe_events
Repeat the routine with trace-cmd
to get function graph:
# trace-cmd record -p function_graph -o graph_md.dat -g md_make_request -e md_make_request_probe -e mwait_probe -F \
./block_hasher -d /dev/md0 -b 1048576 -t10 -n100
-e
enables particular event. Now, look at function graph:
block_hasher-28990 [000] 10235.125319: funcgraph_entry: | md_make_request() {
block_hasher-28990 [000] 10235.125321: funcgraph_entry: | make_request() {
block_hasher-28990 [000] 10235.125322: funcgraph_entry: 0.367 us | md_write_start();
block_hasher-28990 [000] 10235.125323: funcgraph_entry: | bio_clone_mddev() {
block_hasher-28990 [000] 10235.125323: funcgraph_entry: | bio_alloc_bioset() {
block_hasher-28990 [000] 10235.125323: funcgraph_entry: | mempool_alloc() {
block_hasher-28990 [000] 10235.125323: funcgraph_entry: 0.178 us | _cond_resched();
block_hasher-28990 [000] 10235.125324: funcgraph_entry: | mempool_alloc_slab() {
block_hasher-28990 [000] 10235.125324: funcgraph_entry: | kmem_cache_alloc() {
block_hasher-28990 [000] 10235.125324: funcgraph_entry: | cache_alloc_refill() {
block_hasher-28990 [000] 10235.125325: funcgraph_entry: 0.275 us | _spin_lock();
block_hasher-28990 [000] 10235.125326: funcgraph_exit: 1.072 us | }
block_hasher-28990 [000] 10235.125326: funcgraph_exit: 1.721 us | }
block_hasher-28990 [000] 10235.125326: funcgraph_exit: 2.085 us | }
block_hasher-28990 [000] 10235.125326: funcgraph_exit: 2.865 us | }
block_hasher-28990 [000] 10235.125326: funcgraph_entry: 0.187 us | bio_init();
block_hasher-28990 [000] 10235.125327: funcgraph_exit: 3.665 us | }
block_hasher-28990 [000] 10235.125327: funcgraph_entry: 0.229 us | __bio_clone();
block_hasher-28990 [000] 10235.125327: funcgraph_exit: 4.584 us | }
block_hasher-28990 [000] 10235.125328: funcgraph_entry: 1.093 us | raid5_compute_sector();
block_hasher-28990 [000] 10235.125330: funcgraph_entry: | blk_recount_segments() {
block_hasher-28990 [000] 10235.125330: funcgraph_entry: 0.340 us | __blk_recalc_rq_segments();
block_hasher-28990 [000] 10235.125331: funcgraph_exit: 0.769 us | }
block_hasher-28990 [000] 10235.125331: funcgraph_entry: 0.202 us | _spin_lock_irq();
block_hasher-28990 [000] 10235.125331: funcgraph_entry: 0.194 us | generic_make_request();
block_hasher-28990 [000] 10235.125332: funcgraph_exit: + 10.613 us | }
block_hasher-28990 [000] 10235.125332: funcgraph_exit: + 13.638 us | }
Much better! But for some reason, it doesn’t have mwait_idle
calls. And it just calls generic_make_request
. I’ve tried and record function graph for generic_make_request
(-g
option). Still no luck. I’ve extracted all function containing wait and here is the result:
# grep 'wait' graph_md.graph | cut -f 2 -d'|' | awk '{print $1}' | sort -n | uniq -c
18 add_wait_queue()
2064 bit_waitqueue()
1 bit_waitqueue();
1194 finish_wait()
28 page_waitqueue()
2033 page_waitqueue();
1222 prepare_to_wait()
25 remove_wait_queue()
4 update_stats_wait_end()
213 update_stats_wait_end();
(cut
will separate function names, awk
will print only function names,uniq
with sort
will reduce it to unique names)
Nothing related to timeouts. I’ve tried to grep for timeout and, damn, nothing suspicious.
So, right now I’m going to stop because it’s not going anywhere.
Conclusion
Well, it was really fun! ftrace is such a powerful tool but it’s made for debugging, not profiling. I was able to get kernel function call graph, get statistics for kernel execution on source code level (can you believe it?), trace some unknown function and all that happened thanks to ftrace. Bless it!
To read
- Debugging the kernel using Ftrace - part 1, part 2
- Secrets of the Ftrace function tracer
trace-cmd
- Dynamic probes with ftrace
- Dynamic event tracing in Linux kernel
- This is how debugfs mounted:
mount -t debugfs none /sys/kernel/debug
[return]
Posted on July 9, 2018
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.