Tracing the Arm64 Linux System Call Path
Leesoo Ahn
Posted on August 13, 2024
Arm64 system has two type of traps,
- Synchronous
- Asynchronous
and four exceptions which start with el (stands for exception level.)
- el0 (userspace)
- el1 (kernel)
- el2 (hypervisor)
- el3 (secure mode)
Synchronous is known as system-call among many, while Asynchronous is as hardware interrupt in Arm whitepaper. But the latter is off-topic in this article.
One process is working in el0 and it would raise its hand by itself if it needs any system resource at a time. This is system-call and switches the exception level of CPUs from el0 to el1. Kernel takes the CPU and does something for the leftovers instead of the process. Once it's done, it hands out the CPU to the process again.
The following code is about one of (real) system-call APIs from musl, a well-known libc library.
#define __asm_syscall(...) do { \
__asm__ __volatile__ ( "svc 0" \
: "=r"(x0) : __VA_ARGS__ : "memory", "cc"); \
return x0; \
} while (0)
static inline long __syscall0(long n)
{
register long x8 __asm__("x8") = n;
register long x0 __asm__("x0");
__asm_syscall("r"(x8));
}
Imagine that one process mentioned above is about to call fork()
very soon. The API doesn't take any arguments and therefore, it maps to __syscall0(..)
.
What you need to keep in mind regarding to the code is svc
instruction (stands for supervisor-call), to switch from el0 to el1 with x8
register holding digits that represent the system-call number.
el0t_64_sync_handler
would be called in el1 by the exception vector table describing what to do if svc
raised and jump to el0_svc(..)
by esr
system register holds syndrome information which is used to recognize the exception class (also known as exception reason.)
el0t_64_sync_handler(struct pt_regs *regs)
{
unsigned long esr = read_sysreg(esr_el1);
switch (ESR_ELx_EC(esr)) {
case ESR_ELx_EC_SVC64:
el0_svc(regs);
...
}
From now on, showing a code diagram will be easier than words to understand for everyone. (code is based on v5.15)
el0_svc(struct pt_regs *regs)
{
...
do_el0_svc(regs);
... |
} |
+-----+
|
V
do_el0_svc(struct pt_regs *regs)
{
...
el0_svc_common(regs, regs->regs[8],
| __NR_syscalls,
... | sys_call_table);
} |
+------+
|
V
el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
const syscall_fn_t syscall_table[])
{
...
invoke_syscall(regs, scno, sc_nr, syscall_table);
...
}
We're almost at our destination now. scno
was from x8
register (again, it was holding digits that represent a system-call number) and invoke_syscall(..)
is looking up the system-call function in syscall_table
using the number from scno
. Eventually, it will carry out what was requested.
invoke_syscall(struct pt_regs *regs, unsigned int scno,
unsigned int sc_nr,
const syscall_fn_t syscall_table[])
{
...
if (scno < sc_nr) {
syscall_fn_t syscall_fn;
syscall_fn = syscall_table[array_index_nospec(scno, sc_nr)];
ret = __invoke_syscall(regs, syscall_fn);
} |
... |
} |
+----------------+
|
V
__invoke_syscall(struct pt_regs *regs, syscall_fn_t syscall_fn)
{
return syscall_fn(regs);
}
You may wonder that as far as we know, each system-call has a different number of parameters. But syscall_fn(..)
takes only one, regs
. We will see two cases by code, one for taking nothing and another does five parameters.
fork()
takes nothing in parameters, therefore struct pt_regs
object passing to syscall_fn
is unused.
#define SYSCALL_DEFINE0(sname) \
...
asmlinkage long __arm64_sys_##sname(const struct pt_regs *__unused)
On the other hands, clone()
takes five parameters, therefore struct pt_regs
object expands itself to the number of parameters by SC_ARM64_REGS_TO_ARGS(..)
and __MAP(..)
.
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
| int __user *, parent_tidptr,
| unsigned long, tls,
| int __user *, child_tidptr)
|
+--------+
|
V
#define __SYSCALL_DEFINEx(x, name, ...) \
...
__arm64_sys##name(const struct pt_regs *regs) \
{ \
return __se_sys##name(SC_ARM64_REGS_TO_ARGS(x,__VA_ARGS__)); \
} \ |
+--------+
|
V
__se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
{ \
long ret = __do_sys##name(__MAP(x,__SC_CAST,__VA_ARGS__)); \
... |
return ret; \ |
} \ |
+---------------+
|
V
__do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))
We have walked through the system-call code from el0 to el1. It wasn't a long journey, but wasn't easy either. I hope this tiny map (I like using metaphors) guides you to where you want to be.
happy hacking!
Posted on August 13, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.