Alex Dzyoba
Posted on July 9, 2018
On the other day I’ve decided to solve the popular problem: how to sort 1 million of integers in 1 MiB?
But before I’ve even started to do anything I thought – how can I restrict process memory to 1 MiB? Will it work? So, here is the answers.
Process virtual memory
What you have to know before diving in various methods is how process’s virtual memory is structured. There is a, hands down, the best article you could ever find about that is Gustavo Duarte’s “Anatomy of a Program in Memory”. His whole blog is a treasure.
After reading Gustavo’s article I can propose 2 possible options for restricting memory – reduce virtual address space and restrict heap size.
First is to limit the whole virtual address space for the process. This is nice and easy but not fully correct. We can’t limit the whole virtual address space of the process to 1 MB – we won’t be able to map kernel and libs.
Second is to limit heap size. This is not so easy and seems like nobody tries to do this because the only reasonable way to do this is playing with the linker. But for limiting available memory to such small values like 1 MiB it will be absolutely correct.
Also, I will look at other methods like monitoring memory consumption with intercepting library and system calls related to memory management and changing program environment with emulation and sandboxing.
For testing and illustrating I will use this little program big_alloc
that allocates (and frees) 100 MiB.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
// 1000 allocation per 100 KiB = 100 000 KiB = 100 MiB
#define NALLOCS 1000
#define ALLOC_SIZE 1024*100 // 100 KiB
int main(int argc, const char *argv[])
{
int i = 0;
int **pp;
bool failed = false;
pp = malloc(NALLOCS * sizeof(int *));
for(i = 0; i < NALLOCS; i++)
{
pp[i] = malloc(ALLOC_SIZE);
if (!pp[i])
{
perror("malloc");
printf("Failed after %d allocations\n", i);
failed = true;
break;
}
// Touch some bytes in memory to trick copy-on-write.
memset(pp[i], 0xA, 100);
printf("pp[%d] = %p\n", i, pp[i]);
}
if (!failed)
printf("Successfully allocated %d bytes\n", NALLOCS * ALLOC_SIZE);
for(i = 0; i < NALLOCS; i++)
{
if (pp[i])
free(pp[i]);
}
free(pp);
return 0;
}
All the sources are on github.
ulimit
It’s the first thing that old unix hacker can think of when asked to limit program memory. ulimit
is a bash utility that allows you to restrict program resources, and is just interface for setrlimit
.
We can set the limit to resident memory size.
$ ulimit -m 1024
Now check:
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 7802
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) 1024
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
We set the limit to 1024 kbytes (-m) thus 1 MiB. But when we try to run our program it won’t fail. Setting the limit to something more reasonable like 30 MiB will anyway let our program allocate 100 MB. ulimit
simply doesn’t work. Despite setting the resident set size to 1024 kbytes, I can see in top that resident memory for my program is 4872.
The reason is that Linux doesn’t respect this and man ulimit
tells it directly:
ulimit [-HSTabcdefilmnpqrstuvx [limit]]
...
-m The maximum resident set size (many systems do not honor this limit)
...
There is also ulimit -d
that is respected according to kernel, but it still works because of mmap (see Linker chapter).
QEMU
When you want to modify the program environment QEMU is the natural way for this kind of tasks. It has -R
option to limit virtual address space. But like I said earlier you can’t restrict address space to small values – there will be no space to map libc and kernel.
Look:
$ qemu-i386 -R 1048576 ./big_alloc
big_alloc: error while loading shared libraries: libc.so.6: failed to map segment from shared object: Cannot allocate memory
Here, -R 1048576
reserves 1 MiB for guest virtual address space.
For whole virtual address space, we have to set something more reasonable like 20 MB. Look:
$ qemu-i386 -R 20M ./big_alloc
malloc: Cannot allocate memory
Failed after 100 allocations
It successfully fails1 after 100 allocations (10 MB).
So, QEMU is the first winner in restricting program’s memory size though you have to play with -R
value to get the correct limit.
Container
Another option after QEMU is to launch an application in the container, restricting its resources. To do this you have several options:
- Use fancy high-level docker.
- Use regular usermode tools from the lxc package.
- Go hardcore and write your own script with libvirt.
- Name it…
But after all, resources will be restricted with native Linux subsystem called cgroups. You can try to poke it directly but I suggest using lxc. I would like to use docker but it works only on 64-bit machines and my box is small Intel Atom netbook which is i386.
Ok, quick info. LXC is LinuX Containers. It’s a collection of userspace tools and libs for managing kernel facilities to create containers – isolated and secure environment for an application or whole system.
Kernel facilities that provide such environment are:
- Control groups (cgroups)
- Kernel namespaces
- chroot
- Kernel capabilities
- SELinux, AppArmor
- Seccomp policies
You can find nice documentation on official site, on author’s blog and all over the internet.
To simply run application in container you have to provide config to lxc-execute
where you will configure your container. Every sane person should start from examples in /usr/share/doc/lxc/examples
. Man pages recommends to start with lxc-macvlan.conf
. Ok, let’s do this:
# cp /usr/share/doc/lxc/examples/lxc-macvlan.conf lxc-my.conf
# lxc-execute -n foo -f ./lxc-my.conf ./big_alloc
Successfully allocated 102400000 bytes
It works!
Now let’s limit memory. This is what cgroup for. LXC allows you to configure memory subsystem for container’s cgroup by setting memory limits.
You can find available tunable parameters for memory subsystem in this fine RedHat manual. I’ve found 2:
-
memory.limit_in_bytes
– sets the maximum amount of user memory (including file cache) -
memory.memsw.limit_in_bytes
– sets the maximum amount for the sum of memory and swap usage
Here is what I added to lxc-my.conf:
lxc.cgroup.memory.limit_in_bytes = 2M
lxc.cgroup.memory.memsw.limit_in_bytes = 2M
Launch again:
# lxc-execute -n foo -f ./lxc-my.conf ./big_alloc
#
Nothing happened, looks like it’s way to small memory. Let’s try to launch it from shell in container.
# lxc-execute -n foo -f ./lxc-my.conf /bin/bash
#
Looks like bash failed to launch. Let’s try /bin/sh
:
# lxc-execute -n foo -f ./lxc-my.conf -l DEBUG -o log /bin/sh
sh-4.2# ./dev/big_alloc/big_alloc
Killed
Yay! We can see this nice act of killing in dmesg
:
[15447.035569] big_alloc invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
...
[15447.035779] Task in /lxc/foo
[15447.035785] killed as a result of limit of
[15447.035789] /lxc/foo
[15447.035795] memory: usage 3072kB, limit 3072kB, failcnt 127
[15447.035800] memory+swap: usage 3072kB, limit 3072kB, failcnt 0
[15447.035805] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
[15447.035808] Memory cgroup stats for /lxc/foo: cache:32KB rss:3040KB rss_huge:0KB mapped_file:0KB writeback:0KB swap:0KB inactive_anon:1588KB active_anon:1448KB inactive_file:16KB active_file:16KB unevictable:0KB
[15447.035836] [pid] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[15447.035963] [9225] 0 9225 942 308 10 0 0 init.lxc
[15447.035971] [9228] 0 9228 833 698 6 0 0 sh
[15447.035978] [9252] 0 9252 16106 843 36 0 0 big_alloc
[15447.035983] Memory cgroup out of memory: Kill process 9252 (big_alloc) score 1110 or sacrifice child
[15447.035990] Killed process 9252 (big_alloc) total-vm:64424kB, anon-rss:2396kB, file-rss:976kB
Though we haven’t seen an error message from big_alloc
about malloc failure and how much memory we were able to get, I think we’ve successfully restricted memory via container technology and can stop with it for now.
Linker
Now, let’s try to modify binary image limiting space available for the heap.
Linking is the final part of building a program and it implies using linker and linker script. Linker script is a description of program sections in memory along with its attributes and stuff.
Here is a simple linker script:
ENTRY(main)
SECTIONS
{
. = 0x10000;
.text : { *(.text) }
. = 0x8000000;
.data : { *(.data) }
.bss : { *(.bss) }
}
Dot is current location. What that script tells us is that .text
section starts at address 0x10000, and then starting from 0x8000000 we have 2 subsequent sections .data
and .bss
. Entry point is main
.
Nice and sweet but it will not work for any useful applications. And the reason is that the main
function that you write in C programs is not actually first function being called. There is a whole lot of initialization and cleanup code. That code is provided with C runtime (also shorthanded to crt) and spread into crt#.o libraries in /usr/lib
.
You can see exact details if you launch gcc
with -v
option. You’ll see that at first it invokes cc1
and creates assembly, then translate it to object file with as
and finally combines everything in ELF file with collect2
. Thatcollect2
is ld
wrapper. It takes your object file and 5 additional libs to create final binary image:
/usr/lib/gcc/i686-redhat-linux/4.8.3/../../../crt1.o
/usr/lib/gcc/i686-redhat-linux/4.8.3/../../../crti.o
/usr/lib/gcc/i686-redhat-linux/4.8.3/crtbegin.o
-
/tmp/ccEZwSgF.o
<--
This one is our program object file /usr/lib/gcc/i686-redhat-linux/4.8.3/crtend.o
/usr/lib/gcc/i686-redhat-linux/4.8.3/../../../crtn.o
It’s really complicated so instead of writing my own script I’ll modify default linker script. Get default linker script passing -Wl,-verbose
to gcc
:
gcc big_alloc.c -o big_alloc -Wl,-verbose
Now let’s figure out how to modify it. Let’s see how our binary is built by default. Compile it and look for .data
section address. Here is objdump -h
output
big_alloc
Sections:
Idx Name Size VMA LMA File off Algn
...
12 .text 000002e4 080483e0 080483e0 000003e0 2**4
CONTENTS, ALLOC, LOAD, READONLY, CODE
...
23 .data 00000004 0804a028 0804a028 00001028 2**2
CONTENTS, ALLOC, LOAD, DATA
24 .bss 00000004 0804a02c 0804a02c 0000102c 2**2
ALLOC
.text
, .data
and .bss
sections are located near 128 MiB.
Now, let’s see where is the stack with help of gdb:
[restrict-memory]$ gdb big_alloc
...
Reading symbols from big_alloc...done.
(gdb) break main
Breakpoint 1 at 0x80484fa: file big_alloc.c, line 12.
(gdb) r
Starting program: /home/avd/dev/restrict-memory/big_alloc
Breakpoint 1, main (argc=1, argv=0xbffff164) at big_alloc.c:12
12 int i = 0;
Missing separate debuginfos, use: debuginfo-install glibc-2.18-16.fc20.i686
(gdb) info registers
eax 0x1 1
ecx 0x9a8fc98f -1701852785
edx 0xbffff0f4 -1073745676
ebx 0x42427000 1111650304
esp 0xbffff0a0 0xbffff0a0
ebp 0xbffff0c8 0xbffff0c8
esi 0x0 0
edi 0x0 0
eip 0x80484fa 0x80484fa <main+10>
eflags 0x286 [PF SF IF]
cs 0x73 115
ss 0x7b 123
ds 0x7b 123
es 0x7b 123
fs 0x0 0
gs 0x33 51
esp
points to 0xbffff0a0
which is near 3 GiB. So we have ~2.9 GiB for heap.
In the real world, stack top address is randomized, e.g. you can see it in the output of
# cat /proc/self/maps
As we all know, heap grows up from the end of .data
towards the stack. What if we move .data
section to highest possible address?
Let’s put data segment 2 MiB before stack. Take stack top, subtract 2 MiB:
0xbffff0a0 - 0x200000 = 0xbfdff0a0
Now shift all sections starting with .data
to that address:
. = 0xbfdff0a0
.data :
{
*(.data .data.* .gnu.linkonce.d.*)
SORT(CONSTRUCTORS)
}
Compile it:
$ gcc big_alloc.c -o big_alloc -Wl,-T hack.lst
-Wl
is an option to linker and -T hack.lst
is a linker option itself. It tells linker to use hack.lst
as a linker script.
Now, if we look at the header we’ll see that:
Sections:
Idx Name Size VMA LMA File off Algn
...
23 .data 00000004 bfdff0a0 bfdff0a0 000010a0 2**2
CONTENTS, ALLOC, LOAD, DATA
24 .bss 00000004 bfdff0a4 bfdff0a4 000010a4 2**2
ALLOC
But nevertheless, it successfully allocates. How? That’s really neat. When I tried to look at pointer values that malloc returns I saw that allocation is starting somewhere over the end of .data
section like 0xbf8b7000
, continues for some time with increasing pointers and then resets pointers to _lower_address like 0xb7676000
. From that address it will allocate for some time with pointers increasing and then resets pointers again to even lower address like 0xb5e76000
. Eventually it looks like heap growing down!
But if you think for a minute it doesn’t really that strange. I’ve examined some glibc sources and found out that when brk
fails it will use mmap
instead. So glibc asks the kernel to map some pages, kernel sees that process has lots of holes in virtual memory space and map page from that space for glibc, and finally glibc returns pointer from that page.
Running big_alloc
under strace
confirmed theory. Just look at normal binary:
brk(0) = 0x8135000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77df000
mmap2(NULL, 95800, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb77c7000
mmap2(0x4226d000, 1825436, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4226d000
mmap2(0x42425000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x42425000
mmap2(0x42428000, 10908, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x42428000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77c6000
mprotect(0x42425000, 8192, PROT_READ) = 0
mprotect(0x8049000, 4096, PROT_READ) = 0
mprotect(0x42269000, 4096, PROT_READ) = 0
munmap(0xb77c7000, 95800) = 0
brk(0) = 0x8135000
brk(0x8156000) = 0x8156000
brk(0) = 0x8156000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77de000
brk(0) = 0x8156000
brk(0x8188000) = 0x8188000
brk(0) = 0x8188000
brk(0x81ba000) = 0x81ba000
brk(0) = 0x81ba000
brk(0x81ec000) = 0x81ec000
...
brk(0) = 0x9c19000
brk(0x9c4b000) = 0x9c4b000
brk(0) = 0x9c4b000
brk(0x9c7d000) = 0x9c7d000
brk(0) = 0x9c7d000
brk(0x9caf000) = 0x9caf000
...
brk(0) = 0xe29c000
brk(0xe2ce000) = 0xe2ce000
brk(0) = 0xe2ce000
brk(0xe300000) = 0xe300000
brk(0) = 0xe300000
brk(0) = 0xe300000
brk(0x8156000) = 0x8156000
brk(0) = 0x8156000
+++ exited with 0 +++
and now the modified binary
brk(0) = 0xbf896000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb778f000
mmap2(NULL, 95800, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7777000
mmap2(0x4226d000, 1825436, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4226d000
mmap2(0x42425000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x42425000
mmap2(0x42428000, 10908, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x42428000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7776000
mprotect(0x42425000, 8192, PROT_READ) = 0
mprotect(0x8049000, 4096, PROT_READ) = 0
mprotect(0x42269000, 4096, PROT_READ) = 0
munmap(0xb7777000, 95800) = 0
brk(0) = 0xbf896000
brk(0xbf8b7000) = 0xbf8b7000
brk(0) = 0xbf8b7000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb778e000
brk(0) = 0xbf8b7000
brk(0xbf8e9000) = 0xbf8e9000
brk(0) = 0xbf8e9000
brk(0xbf91b000) = 0xbf91b000
brk(0) = 0xbf91b000
brk(0xbf94d000) = 0xbf94d000
brk(0) = 0xbf94d000
brk(0xbf97f000) = 0xbf97f000
...
brk(0) = 0xbff8e000
brk(0xbffc0000) = 0xbffc0000
brk(0) = 0xbffc0000
brk(0xbfff2000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7676000
brk(0) = 0xbffc0000
brk(0xbfffa000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7576000
brk(0) = 0xbffc0000
brk(0xbfffa000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7476000
brk(0) = 0xbffc0000
brk(0xbfffa000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7376000
...
brk(0) = 0xbffc0000
brk(0xbfffa000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1c76000
brk(0) = 0xbffc0000
brk(0xbfffa000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1b76000
brk(0) = 0xbffc0000
brk(0xbfffa000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1a76000
brk(0) = 0xbffc0000
brk(0) = 0xbffc0000
brk(0) = 0xbffc0000
...
brk(0) = 0xbffc0000
brk(0) = 0xbffc0000
brk(0) = 0xbffc0000
+++ exited with 0 +++
That being said, shifting .data
section up to stack (thus reducing space for heap) is pointless because the kernel will map page for malloc from the virtual memory empty area.
Sandbox
The other way to restrict program memory is sandboxing. The difference from emulation is that we’re not really emulating anything but instead, we track and control certain things in program behavior. Usually sandboxing is used for security research when you have some kind of malware and need to analyze it without harming your system.
I’ve come up with several sandboxing methods and implemented most promising.
LD_PRELOAD trick
LD_PRELOAD
is a special environment variable that when set will make dynamic linker use “preloaded” library before any other, including libc, library. It’s used in a lot of scenarios from debugging to, well, sandboxing.
This trick is also infamously used by some malware.
I have written simple memory management sandbox that intercepts malloc
/free
calls, do memory usage accounting and returns ENOMEM
if memory limit is exceeded.
To do this I have written a shared library with my own malloc
/free
wrappers that will increment a counter by malloc
size and decrement it when free
is called. This library is being preloaded with LD_PRELOAD
when running the application under test.
Here is my malloc implementation.
void *malloc(size_t size)
{
void *p = NULL;
if (libc_malloc == NULL)
save_libc_malloc();
if (mem_allocated <= MEM_THRESHOLD)
{
p = libc_malloc(size);
}
else
{
errno = ENOMEM;
return NULL;
}
if (!no_hook)
{
no_hook = 1;
account(p, size);
no_hook = 0;
}
return p;
}
libc_malloc
is a pointer to original malloc
from libc. no_hook
is a thread-local flag. It’s is used to be able to use malloc in malloc hooks and avoid recursive calls - an idea taken from Tetsuyuki Kobayashi presentation.
malloc
is used implicitly in account
function by uthash hash table library. Why use a hash table? It’s because when you call free
you pass to it only the pointer and in free
you don’t know how much memory has been allocated. So I have a hash table with the pointer as a key and allocated size as a value. Here is what I do on malloc:
struct malloc_item *item, *out;
item = malloc(sizeof(*item));
item->p = ptr;
item->size = size;
HASH_ADD_PTR(HT, p, item);
mem_allocated += size;
fprintf(stderr, "Alloc: %p -> %zu\n", ptr, size);
mem_allocated
is that static variable that is compared against a threshold in malloc
.
Now when free
is called here is what happened:
struct malloc_item *found;
HASH_FIND_PTR(HT, &ptr, found);
if (found)
{
mem_allocated -= found->size;
fprintf(stderr, "Free: %p -> %zu\n", found->p, found->size);
HASH_DEL(HT, found);
free(found);
}
else
{
fprintf(stderr, "Freeing unaccounted allocation %p\n", ptr);
}
Yep, just decrement mem_allocated
. It’s that simple.
But the really cool thing is that it works rock solid2.
[restrict-memory]$ LD_PRELOAD=./libmemrestrict.so ./big_alloc
pp[0] = 0x25ac210
pp[1] = 0x25c5270
pp[2] = 0x25de2d0
pp[3] = 0x25f7330
pp[4] = 0x2610390
pp[5] = 0x26293f0
pp[6] = 0x2642450
pp[7] = 0x265b4b0
pp[8] = 0x2674510
pp[9] = 0x268d570
pp[10] = 0x26a65d0
pp[11] = 0x26bf630
pp[12] = 0x26d8690
pp[13] = 0x26f16f0
pp[14] = 0x270a750
pp[15] = 0x27237b0
pp[16] = 0x273c810
pp[17] = 0x2755870
pp[18] = 0x276e8d0
pp[19] = 0x2787930
pp[20] = 0x27a0990
malloc: Cannot allocate memory
Failed after 21 allocations
Full source code for the library is on github
So, LD_PRELOAD is a great way to restrict memory!
ptrace
ptrace
is another feature that can be used to build memory sandboxing. ptrace
is a system call that allows you to control the execution of another process. It’s built into various POSIX operating system including, of course, Linux.
ptrace
is the foundation of tracers like strace,ltrace, almost every sandboxing software likesystrace, sydbox, mbox and all debuggers including gdb itself.
I have built a custom tool with ptrace
. It traces brk
calls and looks for the distance between the initial program break value and new value set by the next brk
call.
This tool forks and becomes 2 processes. The parent process is tracer and the child process is tracee. In the child process I call ptrace(PTRACE_TRACEME)
and then execv
. In the parent, I use ptrace(PTRACE_SYSCALL)
to stop on syscall and filter brk
calls from the child and then another ptrace(PTRACE_SYSCALL)
to get brk
return value.
When brk
exceeded threshold I set -ENOMEM
as brk
return value. This is set in eax
register so I just overwrite it with ptrace(PTRACE_SETREGS)
. Here is meaty part:
// Get return value
if (!syscall_trace(pid, &state))
{
dbg("brk return: 0x%08X, brk_start 0x%08X\n", state.eax, brk_start);
if (brk_start) // We have start of brk
{
diff = state.eax - brk_start;
// If child process exceeded threshold
// replace brk return value with -ENOMEM
if (diff > THRESHOLD || threshold)
{
dbg("THRESHOLD!\n");
threshold = true;
state.eax = -ENOMEM;
ptrace(PTRACE_SETREGS, pid, 0, &state);
}
else
{
dbg("diff 0x%08X\n", diff);
}
}
else
{
dbg("Assigning 0x%08X to brk_start\n", state.eax);
brk_start = state.eax;
}
}
Also, I intercept mmap
/mmap2
calls because libc is smart enough to call it when brk
failed. So when I have threshold exceeded and see mmap
calls I just fail it with ENOMEM
.
It works!
[restrict-memory]$ ./ptrace-restrict ./big_alloc
pp[0] = 0x8958fb0
pp[1] = 0x8971fb8
pp[2] = 0x898afc0
pp[3] = 0x89a3fc8
pp[4] = 0x89bcfd0
pp[5] = 0x89d5fd8
pp[6] = 0x89eefe0
pp[7] = 0x8a07fe8
pp[8] = 0x8a20ff0
pp[9] = 0x8a39ff8
pp[10] = 0x8a53000
pp[11] = 0x8a6c008
pp[12] = 0x8a85010
pp[13] = 0x8a9e018
pp[14] = 0x8ab7020
pp[15] = 0x8ad0028
pp[16] = 0x8ae9030
pp[17] = 0x8b02038
pp[18] = 0x8b1b040
pp[19] = 0x8b34048
pp[20] = 0x8b4d050
malloc: Cannot allocate memory
Failed after 21 allocations
But… I don’t really like it. It’s ABI specific, i.e. it has to use rax
instead of eax
on 64-bit machine, so either I make a different version of that tool or use #ifdef
to cope with ABI differences or make you build it with -m32
option. But that’s not usable. Also, it probably won’t work on other POSIX like systems, because they might have different ABI.
Other
There are also other things one may try which I rejected for different reasons:
- malloc hooks. Deprecated as said man page so I didn’t bother trying it.
-
Seccomp and
prctl
withPR_SET_MM_START_BRK
. This might work but as said inseccomp filtering kernel documentation it’s not a sandboxing but a “mechanism for minimizing the exposed kernel surface”. So I guess it will be even more awkward than using ptrace by hand. Though I might look at it sometime. - libvirt-sandbox. Nope, it’s just a wrapper over lxc and qemu.
- SELinux sandbox. Nope. Just doesn’t work though it uses cgroup.
Recap
In the end, I’d like to recap:
- There are a lot of ways to restricting memory:
- Resource limiting with ulimit and cgroup
- Running under an emulator like QEMU
- Sandboxing with LD_PRELOAD and ptrace
- Modifying segments in the binary image.
- But not all of them are working
-
ulimit
doesn’t work. -
cgroup
kinda works - crashing application - Emulating works - crashing application
-
LD_PRELOAD
works amazing! -
ptrace
works good enough but ABI dependant - Linker magic doesn’t work because ingenious libc calls
mmap
.
-
References
- Gustavo Duarte’s article again.
- Limiting time and memory consumption of a program in Linux.
- Linux sandboxing
- I think I’ve just invented new term for QA guys. [return]
- Unless application itself uses LD_PRELOAD :-\ [return]
Posted on July 9, 2018
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.