Satoru Takeuchi
Posted on March 24, 2023
Introduction
Those of you who use Linux probably execute various commands on Linux on a daily basis. You might use the term "command name" to identify these, but depending on the context, the meaning of this term can vary. This article explains what the Linux kernel considers a command name.
First, I will present a brief conclusion, followed by a detailed explanation, and finally, I will describe the motivation for this investigation and the subsequent research process.
TL;DR
From the Linux kernel's perspective, the command name is the first 15 bytes of the basename of the executable file name (the file name without the directory part).
It is stored as a NULL-terminated string in a 16-byte field called comm
within a structure called task_struct
, which exists for each process in the kernel's memory (more precisely, for each kernel-level thread).
This enables the kernel to identify processes with low cost and higher readability than using a pid.
This command name is used in kernel logs, by commands such as ps
and pgrep
, and in packages like procps
. Longer command names are truncated due to the 15-byte limit mentioned above.
Investigation Process
Software versions used for the investigation
- Linux kernel: v5.15
- Procps: 3.3.17
Motivation
The motivation for investigating what was mentioned in the "TL;DR" section came from the fact that the pgrep
command I used in my custom program did not work correctly. The pgrep
command takes a string specified as an argument as a regular expression and retrieves a list of pids of running processes that match it. For example, below is an example of running an infinitely sleeping script called "foo.sh" and then using pgrep
to display its pid.
$ cat foo.sh
#!/bin/bash
sleep infinity
$ ./foo.sh &
[2] 1086408
$ pgrep "foo\.sh"
1086408
However, when I tried the same thing with a script called "foo-bar-baz-hoge-huga.sh" that does exactly the same thing as "foo.sh", grep did not display anything.
$ cat foo-bar-baz-hoge-huga.sh
#!/bin/bash
sleep infinity
$ ./foo-bar-baz-hoge-huga.sh &
[2] 1086868
$ pgrep "foo-bar-baz-hoge-huga\.sh"
$
I thought it was odd, so I looked at man pgrep
and found the following description.
The process name used for matching is limited to the 15 characters present in the output of /proc/pid/stat.
In fact, when I looked at the /proc/pid/stat
file for "foo-bar-baz-hoge-huga.sh", I got the following output.
$ cat /proc/601235/stat
601235 (foo-bar-baz-hog) S 593786 601235 593786 34817 601419 4194304 224 0 0 0 0 0 0 0 20 0 1 0 5735606 8617984 900 18446744073709551615 94266299658240 94266300571405 140732967030208 0 0 0 65536 4 65538 1 0 0 17 1 0 0 0 0 0 94266300816048 94266300864080 94266304847872 140732967036675 140732967036712 140732967036712 140732967038941 0
The string displayed inside the parentheses in the second field, which shows the command name, did indeed match only the first 15 characters of the script name, not the entire name.
Although I understood the specification itself and realized that my usage of pgrep was incorrect, I decided to verify where this 15-character limit came from.
Reading the procfs Manual
The files under the /proc/
directory are provided by a file system called procfs
. Unlike file systems such as ext4
or XFS
that manage data on disk, procfs
exists for users to obtain kernel information and modify the kernel state through files. We will not go into the details of procfs here
.
First, let's check the specifications of the /proc/pid/stat
file. The specifications of files under procfs are described in man procfs. The following is an excerpt of the relevant part:
/proc/[pid]/stat
Status information about the process. This is used by ps(1). It is defined in the kernel source file fs/proc/array.c.
...
(2) comm %s
The filename of the executable, in parentheses. Strings longer than TASK_COMM_LEN (16) characters (including the terminating null byte) are silently truncated. This is visible
whether or not the executable is swapped out.
We can see that the second field of the /proc/pid/stat
file contains the name of the executable file in parentheses, and that any part exceeding 16 bytes, including the NULL terminating string, is ignored. Subtracting 1 byte for the NULL character from 16 bytes gives us 15 bytes, which matches the information written in the pgrep
manual.
Identifying the handler for the /proc/pid/stat
file
Next, I looked at the kernel source to see where this string is actually being output and where the data is stored. The procfs
manual states that the /proc/pid/stat
file is defined in the fs/proc/array.c
file in the kernel source, so I first looked at this file.
The relevant code seems to be in the following part of the do_task_stat()
function:
https://github.com/torvalds/linux/blob/v5.15/fs/proc/array.c#L562-L564
When the seq_puts()
function is called, it outputs the specified string to a file. In the code above, lines 562 and 564 output "(" and ")", and it can be inferred that the command name is probably being output to a file by the proc_task_name()
function on line 563.
Before looking at the contents of proc_task_name()
, I decided to first check if the do_task_stat()
function is actually called when the /proc/pid/stat
file is read. I traced the call stack of the do_task_stat()
function and found that it is called in sequence from two functions, proc_tid_stat()
and proc_tgid_stat()
.
https://github.com/torvalds/linux/blob/v5.15/fs/proc/array.c#L646-L656
In the kernel, tid
refers to the thread ID, and tgid
refers to the process name, so we can guess that the proc_tgid_stat()
function is probably the caller. There are functions that display the state of threads under the /proc/pid/task
directory in procfs, so the proc_tid_stat()
function is probably the handler for the /proc/pid/task/tid
file.
Tracing further back the call stack of these functions, I found that in the proc/pid/base.c
file, which registers handlers to be called when users read and write files in procfs
, the proc_tgid_stat()
function is registered to be called when accessing the /proc/tgid/stat
file, or in other words, the /proc/<pid>/stat
file.
https://github.com/torvalds/linux/blob/v5.15/fs/proc/base.c#L3168-L3202
In summary, I found the following:
- The user reads the
/proc/pid/stat
file - The
proc_tgid_stat()
function is called - The
do_task_stat()
function is called - The
proc_task_name()
function is called to output the command name to the file
Identifying the source of the command name information
Upon examining the implementation of the proc_task_name()
function, it looks like this:
https://github.com/torvalds/linux/blob/v5.15/fs/proc/array.c#L99-L112
I will omit the details, but when the process indicated by the pid is a regular program, the evaluation result of the if statement on line 103 is false. This evaluation result is true only in the case of special processes created within the kernel.
Furthermore, since the escape argument of the proc_task_name()
function is true when called via the proc_tgid_stat()
function, the evaluation result of the if statement on line 108 is true. Therefore, we can see that the data obtained by the __get_task_comm()
function (probably a NULL-terminated string) is being used as the output for the /proc/pid/stat
file on line 109 within the proc_task_name()
function. The seq_escape_str()
function on line 109 escapes special characters and spaces, but I will not explain the details here as it is not important for this article.
Now, let's look at the contents of the __get_task_comm()
function.
https://github.com/torvalds/linux/blob/v5.15/fs/exec.c#L1209-L1215
We can see that the value of tsk->comm
, or more precisely, the value of the comm field of a structure named task_struct
, is the source of the command name information. The task_struct
structure exists for each thread. Let's take a look at the definition of the task_struct
structure.
https://github.com/torvalds/linux/blob/master/include/linux/sched.h#L727-L1063
https://github.com/torvalds/linux/blob/master/include/linux/sched.h#L276-L282
We can see that the comm
field is an array of char with a length of 16. The procfs
manual also mentioned that the length of TASK_COMM_LEN
is 16 bytes.
Confirming where the value of task_struct->comm
is set
The __set_task_struct()
function sets the value of task_struct->comm
:
https://github.com/torvalds/linux/blob/v5.15/fs/exec.c#L1223-L1230
The caller of the __set_task_struct()
function is the begin_new_exec()
function:
https://github.com/torvalds/linux/blob/v5.15/fs/exec.c#L1238-L1357
This function is called when the execve()
system call, which creates a new process, is invoked. The bprm->filename
contains the name of the executable file corresponding to the process as a NULL-terminated string. Here, we can see that the name of the executable file is processed using the kbasename()
function and then saved in task->comm
. The kbasename()
function, similar to the basename()
function in the standard C library, returns a string with the directory part of the file name removed. Therefore, if the executable file name is "./foo.sh", "foo.sh" will be stored in task_struct->comm
, and if it's "./foo-bar-baz-hoge-huga.sh", "foo-bar-baz-hog" will be stored. Finally, I understood the definition of the "command name" in the /proc/pid/stat
file, or, in other words, as referred to by the Linux kernel.
Examining the procps source code
Lastly, by reading the procps source code, I found out that the string output by pgrep
is, as described in the man page, the longest 15 characters excluding the "(" and ")" from the second field of the /proc/pid/stat
file.
Since there is nothing particularly interesting going on.
Column: Considering the Definition of Command Names
We now understand that the command name, as referred to by the Linux kernel, is the first 15 bytes of the basename of the executable file. However, why is it processed with the basename, and why is it truncated to a maximum of 15 bytes? The reasons are probably as follows:
To identify processes through kernel logs and other means, it is convenient to have easily accessible information in the form of a string, separate from the process ID (pid). The name of the executable file can be used for this purpose. However, storing the full executable file name in the task_struct
structure may consume a large amount of kernel memory and could potentially create a security vulnerability if a malicious user executed a program with an excessively long file name. Therefore, storing the entire file name is not feasible.
One might think that it would be sufficient to look at the value of the executable file name stored in the process memory. However, this is not necessarily true. When accessing the process memory from the kernel, if the relevant memory might be swapped out, it is necessary to swap it back in before reading, which can be cumbersome. Moreover, this approach cannot be used in situations where the system is running out of memory, for instance, when the kernel needs to log the lack of memory. It is not possible to increase memory usage when there is already a shortage.
The reason for using the basename, such as "foo.sh" instead of the file name or full path specified at runtime like "./foo.sh", is likely due to the decision that the basename still provides sufficiently high visibility. In most cases, the basename is enough to recognize and identify the process without using the full path.
Conclusion
In this article, I desceived why the command name specification in the Linux kernel is as it is. Additionally, I wrote about the process of finding answers to small questions that arise while using a computer by reading source code, allowing readers to relive the experience of source code reading. Neither of these provide immediately useful knowledge, but I hope they can serve as tidbits of information.
Posted on March 24, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
February 25, 2019