Disclaimer: proposed approach uses dirty hacks & patches and tested on x86_64 only so use it at your own risk. Also no chatGPT or some another Artificial Idiots were used for this research

Lets assume that we have shared library (for example R extension or python module) and we want to know where and why it spending many hours and consuming megawatts of electricity. There is even semi-official way to do this:

compile shared library with -pg option
set envvar LD_PROFILE_OUTPUT to directory where you want to store profiling data
set envvar LD_PROFILE to filename of library to profile
run your program. Well, sounds that you need lots of things to do before this step and you can`t set up profiling dynamically
run sprof on profiling log

Unfortunately this method just don`t work - sprof fails with cryptic message
Inconsistency detected by ld.so: dl-open.c: 890: _dl_open: Assertion `_dl_debug_initialize (0, args.nsid)->r_state == RT_CONSISTENT' failed!

Seems that this long lived bug known since 2017 and still not fixed

Lets try to discover some more reliable way and start with inspection of code generated for profiling

gcc -pg

You can see in disasm that almost every function after prologue has call to _mcount. It located in GOT and thus avoiding PLT black magic. Interesting to note that it has no args - prototype is just void mcount(void) and this leads to two consequences

it extracts return address from stack
and so cannot be called as ordinary C function - you need to add asm thunk like "jmp [mcount_ptr]" in your wrapper

Besides ELF binaries have yet couple of profile related calls in __gmon_start__ function - __monstartup and mcleanup registered with atexit.

Prototype for monstartup:
void monstartup (char *low_pc, char *high_pc)
It called to start profiling data and all mcount calls must be located within range low_pc .. high_pc

Function mcleanup write profiler logs to disc and has prototype

void _mcleanup (void)

Shared libraries have only mcount calls and this is why you can`t start profiling right after loading of shared library - you need some code to start/stop profiling

It so happened that -pg is not only option for profiling - there is second:

gcc -finstrument-functions

gcc has in FUNCTION_DECL attribute DECL_NO_INSTRUMENT_FUNCTION_ENTRY_EXIT

to indicate that function entry and exit should be instrumented with calls to support routines

This "support routines"are

void __cyg_profile_func_enter(void *this_fn, void *call_site)

void __cyg_profile_func_exit(void *this_fn, void *call_site)

Seems that you even can write your own profiler from scratch! Important to note that at least on x86_64 ABI you can patch GOT pointer __cyg_profile_func_enter to mcount - bcs first 2 arguments passed in registers

making API for profiling

So main idea is to write some code to

check if desired shared library can be profiled (e.q. has mcount or __cyg_profile_func_enter in GOT) - I used ELFIO but perhaps this can be done with LIEF too
patch GOT
find addresses of loaded shared llibrary - this can be done with dl_iterate_phdr
call monstartup to start profiling
run your code from profiling library
call mcleanup to stop profiling and store results

Surprisingly now we have file gmon.out! Lets run

gprof path2your_shared_library gmon.out

and see sad results - gprof refusing to show results. This is bcs it don`t know at which address your shared library was loaded (and worse - this address will differs each run due to ASLR). So lets

patch gprof

It sounds scary but in reality the most difficult thing was to find not used letter in getopts. I chose -U to pass image base and subtracting passed value in function gmon_io_read_vma. Patch

Because now you also need to know base address for gprof I just add it to name of gmon.out - via setting GMON_OUT_PREFIX. Run patched gprof:

gprof -b -U 7F9E01224000 ./libprelf.so gmon.7F9E01224000.3926320

Flat profile: Each sample counts as 0.01 seconds. no time accumulated % cumulative self self total time seconds seconds calls Ts/call Ts/call name 0.00 0.00 0.00 12552 0.00 0.00 ELFIO::endianess_convertor::operator()(unsigned long) const 0.00 0.00 0.00 4608 0.00 0.00 ELFIO::section_impl<ELFIO::Elf64_Shdr>::get_size() const 0.00 0.00 0.00 4398 0.00 0.00 ELFIO::section_impl<ELFIO::Elf64_Shdr>::get_entry_size() const 0.00 0.00 0.00 2963 0.00 0.00 ELFIO::endianess_convertor::operator()(unsigned int) const 0.00 0.00 0.00 2935 0.00 0.00 ELFIO::section_impl<ELFIO::Elf64_Shdr>::get_data() const

...

They can be used to run shell when received some magic packet: 1 2 3. As usually there is not tool to show installed netfilter hooks so I added dumping them (and at the same time netfilter loggers) to my lkcd

Lets check where this hooks live inside kernel. As starting point we can review source of main function for hooks installing nf_register_net_hooks which leads to nf_hook_entry_head. We can notice that there are lots of locations for hooks:

field nf_hooks_ingress in net_dev (when CONFIG_NETFILTER_INGRESS enabled)
on more new kernels also field nf_hooks_egress in net_dev (when CONFIG_NETFILTER_EGRESS enabled)
lots of fields in struct netns_nf:
- hooks_ipv4
- hooks_ipv6
- hooks_arp (CONFIG_NETFILTER_FAMILY_ARP)
- hooks_bridge (CONFIG_NETFILTER_FAMILY_BRIDGE)
- hooks_decnet (CONFIG_NETFILTER_FAMILY_DECNET)
Also on old kernels (before 4.16) there was one array hooks in netns_nf

results

lkmem -c -n ../unpacked/101 /boot/System.map-5.15.0-101-generic

...

2 nf hooks:
   [0] type 02 IPV4 idx 0 0xffffffffa7b84dd0 - kernel!apparmor_ipv4_postroute
   [1] type 10 IPV6 idx 0 0xffffffffa7b84e10 - kernel!apparmor_ipv6_postroute

As you may know gcc always placing string literals in section .rodata. Let's assume what we want to change this outrageous behavior - for example for ~~shellcode~~ literals used in function init_module (contained in section .init.text)

We can start with dumping of gcc RTL - for something like printf("some string") RTL will be symbol_ref <var_decl addr *.LC1> and in .S file this looks like

.section .rodata .LC1: .string "some string"

That unnamed VAR_DECL has attribute DECL_IN_CONSTANT_POOL. Probably it is possible to make gcc plugin to collect such literals referring from functions inside specific section and instead of DECL_IN_CONSTANT_POOL patch them section attribute. However this requires too many labour so lets try something more lazy

Possible solutions is to explicitly set section via gcc __attribute__:

#define RSection __attribute__ ((__section__ (".init.text"))) #define _RN(name) static const char rn_##name##__[] RSection = #define _GN(name) rn_##name##__ ... _RN(dummy_str) "some string"; printf("%s\n", _GN(dummy_str));
Looks very ugly, especially because gcc cannot expand macro like "##name##" in double quotes. And even worse - this raises compilation error:

error: ‘rn_dummy_str__’ causes a section type conflict with ‘init_module’
   11 | #define _RN(name) static const char rn_##name##__[] __attribute__ ((section (".init.text"))) =

How we can fix this problem? My first thought was to write quick and dirty Perl script to scan sources for _RN markers and produce .S file where all strings were placed in right section. But then I decided to overcome my laziness and made patch for gcc - it just checks if passed declaration is initialized with STRING_CST value. Surprisingly, it works!

However, returning to the original task - all of this was in vain bcs linux kernel (unlike Windows) cannot discard .init.text sections after driver loading. I wrote simple Perl script to gather some stat about sections of loaded modules and it gave me

perl ms.pl | grep init.text
116 .init.text

opensource is disgusting as usually

As you my know there are two methods

1) using LD_PRELOAD, don`t work if you want to inject into already running (and perhaps even many days) process

2) ptrace. Has the following inherent disadvantages

target process can be ptraced by somebody else
victim program can detect ptrace
you just want to avoid in logs something like ptrace attach of "./a.out"[PID] was attempted by "XXX"

So I developed very rough analogs of famous VirtualAllocEx/VirtualProtectEx + simple hook to hijack execution onto assembly written ~~shell~~code to call dlopen/dlsym. Currently only x86_64 supported bcs I am too lazy to rewrite this asm stub

Prerequisites
You must have root privileges and be able to build and load kernel modules. I tested code on kernel 6.8, 5.15 and probably it also can work on 4.x, not sure about more old versions

Lets start lighting the dirty details in reverse order

Asm code in target process

Due to the fact that my goal was to load arbitrary shared library target process must at least have ld.so, so statically linked processes are immune to this injection (greetings to Go binaries). Asm code is pretty straightforward - it just call dlopen, then dlsym("inject") and call what it returned

execution hijacking

Now we need somehow intercept normal execution flow and force asm stub to run. As usually there are several methods

good old malloc_hook trick. Actually I used free_hook too in my implementation
patch PLT/GOT. Presence of RELRO makes this method slightly harder but don`t forget that you have code in kernel - couple of additional mprotect calls can fix this
for stdc++ there is also set_new_handler function

run kernel code in context of right process

As far as I understand the main reason that linux kernel never had analogs of VirtualAllocEx/VirtualProtectEx is that functions do_mmap/do_mprotect_pkeyalways operating with current process. Ok, not big deal - we just must find method to execute our code in context of right process. As usually there are plenty ways to do this

kprobe

probabilistic method - if you know which kernel functions most often called by target process - you can register kprobe on one of them. Obvious drawback is that your kprobe will be fired for all processes and this can lead to performance degradation

task_struct.restart_block

Unfortunately it doesn't work. I've made for my lkmem -p option to show task details, for normal processes it looks like

PID 557580 at 0xffff90e2507b99c0 thread.flags: 0 flags: 400000 sched_class: 0xffffffff976f4818 - kernel!fair_sched_class restart_block.fn: 0xffffffff960cfbc0 - kernel!do_no_restart_syscall

With patched restart_block.fn:

PID 557580 at 0xffff90e2507b99c0 thread.flags: 0 flags: 400000 sched_class: 0xffffffff976f4818 - kernel!fair_sched_class restart_block.fn: 0xffffffffc12d9a80 - lkcd!main_horror

for unknown reason restart_block.fn was never called

preempt_notifier_register

Don`t ask me why but registered notifier was never called

task_works

I`ve choose it for my implementation. Just couple of functions task_work_add& task_work_cancel. lkmem -p shows for such processes something like

PID 582439 at 0xffff90e151b8b380
 thread.flags: 0
 flags: 400000
 works_count: 1
  work[0] 0xffffffffc12d9a80 - lkcd!main_horror

process death notification

and finally the last piece - we should be able to cancel our injection in case of process suddenly die (for many reasons - we are buggy, process voluntarily decided to leave this cruel world or meet OOM killer etc). Otherwise we can wait for result forever.

Actually this was hardest part - until kernel 5.17 the grass was greener and the world was simpler. You could just call profile_event_register(PROFILE_TASK_EXIT). Then this evil clowns calling themselves "maintainers" killed it and instead ask to use trace_sched_process_exit which is even cannot be found at elixir.bootlin or other piece of dead code register_trace_prio_sched_process_free/register_trace_prio_sched_process_exit (and they still can`t decide which one is more trve). Anyway I've solved this problem ~~and very proud of it~~

Putting all together

Lets inject some shared library to test process
./test -t

pid 593502

on other console load driver and run

/a.out -p 593502 -i `pwd`/test.so
dlopen 0x7f8747ea1e48 0x7f8747ea1e48
/usr/lib/x86_64-linux-gnu/libc-2.31.so base 7F34E1643000

wait

...

injected at 0x7f34e1ad2000

Now first console shows
[+] greeting from injected, addr 0x7f34e1ad2000

And yet another couple of proofs:

dmesg | tail

patch 7F34E182FB70 to 7F34E1AD2000 and 7F34E1831E48 to 7F34E1AD2009

grep 7f34e1ad2000 /proc/593502/maps

7f34e1aca000-7f34e1ad2000 r--p 00024000 103:02 20187524                  /usr/lib/x86_64-linux-gnu/ld-2.31.so
7f34e1ad2000-7f34e1ad3000 r-xp 00000000 00:00 0

grep test.so /proc/593502/maps

7f34e1875000-7f34e1876000 r--p 00000000 08:02 4456719                  /home/redp/lkcd/inject/test.so
7f34e1876000-7f34e1877000 r-xp 00001000 08:02 4456719                  /home/redp/lkcd/inject/test.so
7f34e1877000-7f34e1878000 r--p 00002000 08:02 4456719                  /home/redp/lkcd/inject/test.so
7f34e1878000-7f34e1879000 r--p 00002000 08:02 4456719                  /home/redp/lkcd/inject/test.so
7f34e1879000-7f34e187a000 rw-p 00003000 08:02 4456719                  /home/redp/lkcd/inject/test.so

Lets check what this stub should do being injected in some linux process via __malloc_hook/__free_hook (btw this implicitly means than you cannot use this dirty hack for processes linked with musl or uClibc - they just don't have those hooks)

bcs our stub can be called from two different hooks we should store somewhere via which entry point we was called
restore old hooks values
call dlopen/dlsym and then target function (and pass it address of injection stub for delayed munmap. No, you can't free those memory directly in your target function - try to guess why)
get right old hook and jump to it if it was installed or just return to code called __malloc_hook somewhere in libc

So I collected all parameters to do job in table dtab consisting from 6 pointers

__malloc_hook address
old value of __malloc_hook
__free_hook address
old value of __free_hook
pointer to dlopen
pointer to dlsym

after those table we also has couple of string constants for injected.so full path and function name. Also bcs we must setup 2 entry point I decided to put 1 byte with distance between first and second (to make injection logic more universal) right after dtab. Sounds easy, so lets check how this logic can be implemented on some still living processors (given that RIP alpha, sparc, hp-pa etc)

arm64

for some unknown reason they call it aarch64

ABI

Source, size 209 bytes with BTI c in prologues

arm64 has lots of really amazing features:

pre & post processing, like in stp x29, x30, [sp, -16]! at first sp decreased on 16 bytes and then 2 registers are pushed to stack at once
cbz/cbnz instructions can in one operation compare value with zero and make branch
support for PC-relative addressing and even better - if desired address located within 4Kb you can use just one instruction adr and this lead to code size reduction/better cache utilization

Unfortunately last feature almost ignored by GCC, bcs for starters it just can't place constants into .text section. I made very primitive patch against it but this is just begin of story. Lets check when GCC decides to use short adr form. Constraints "Usa"means logical AND of

In practice this means almost never

PS: Visual Studio can take advantage of short PC-relative addressing by placing all constants inside pool behind function

mips32

This port was inspired by this cool X post. I dislike mips asm bcs of it's strange out-of-order execution - you must hard think what instruction you could place after literally each j/jal/bXX opcode and this make my brain (fostered on ~~z80~~ i386) to seethe. Also mips don't have direct access to PC, so I used old-school trick with jal and arithmetic on $ra register

ABI

Source, size 234 bytes

loongarch

I just couldn’t ignore it since couple of years ago I made IDA Pro processor module for loongarch. It suffers from strange opcodes for PC-relative addressing - like you have to use pair pcalau12i/addi even if desired address located within 12bit range. And it's better for you not to even know how it loads full 64-bit address

ABI

Source, size 242 bytes

Try convince me that input_register_handle is not best place for installing keylogger, it's even strange that they were embarrassed to connect there their holy ~~cow~~ eBPF. Long story short - there are 3 structures in linux kernel for servicing of input devices:

input_dev chained in list (sure non-exported) input_dev_list
input_handler chained in list input_handler_list
input_handle with pointer to input_handler and attached to input_dev (in list h_list)

So keylogger could

just call input_register_handle
to be more stealthy - patch functions pointers in already registered input_handler (very convenient that sysrq_handler missed out method event)
attach own input_handle to desired input_dev but without registering corresponding input_handler - yes, this is perfectly legal
patch functions pointers directly in input_dev

Guess in three tries what exactly you can extract from sysfs?
So I add to my lkcd dumping of all above-mentioned structures. Sample of output

input handlers count: 7 [0] input_handler at addr: 0xffffffff921dac40 - kernel!rfkill_handler Name: rfkill event: 0xffffffff90c91300 - kernel!rfkill_event connect: 0xffffffff90c91200 - kernel!rfkill_connect disconnect: 0xffffffff90c911d0 - kernel!rfkill_disconnect start: 0xffffffff90c915b0 - kernel!rfkill_start [1] input_handler at addr: 0xffffffff920faa60 - kernel!kbd_handler Name: kbd event: 0xffffffff907f5890 - kernel!kbd_event match: 0xffffffff907f3b80 - kernel!kbd_match connect: 0xffffffff907f3120 - kernel!kbd_connect disconnect: 0xffffffff907f30f0 - kernel!kbd_disconnect start: 0xffffffff907f39b0 - kernel!kbd_start [2] input_handler at addr: 0xffffffff920f9300 - kernel!sysrq_handler Name: sysrq filter: 0xffffffff907ef4f0 - kernel!sysrq_filter connect: 0xffffffff907eed20 - kernel!sysrq_connect disconnect: 0xffffffff907eeb60 - kernel!sysrq_disconnect [3] input_handler at addr: 0xffffffff921749e0 - kernel!mousedev_handler Name: mousedev event: 0xffffffff909e3360 - kernel!mousedev_event connect: 0xffffffff909e3d30 - kernel!mousedev_connect disconnect: 0xffffffff909e3c80 - kernel!mousedev_disconnect [4] input_handler at addr: 0xffffffff92174e40 - kernel!evdev_handler Name: evdev event: 0xffffffff909e63a0 - kernel!evdev_event events: 0xffffffff909e62e0 - kernel!evdev_events connect: 0xffffffff909e4e80 - kernel!evdev_connect disconnect: 0xffffffff909e4e20 - kernel!evdev_disconnect [5] input_handler at addr: 0xffffffffc075c0c0 - input_leds!input_leds_handler Name: leds event: 0xffffffffc075a000 - input_leds!input_leds_event connect: 0xffffffffc075a0f0 - input_leds!input_leds_connect disconnect: 0xffffffffc075a010 - input_leds!input_leds_disconnect [6] input_handler at addr: 0xffffffffc081d580 - joydev!joydev_handler Name: joydev event: 0xffffffffc0817d60 - joydev!joydev_event match: 0xffffffffc0817bf0 - joydev!joydev_match connect: 0xffffffffc08181a0 - joydev!joydev_connect disconnect: 0xffffffffc0818140 - joydev!joydev_disconnect input devs count: 20 ... [2] input_dev at addr: 0xffffa0bc453e5800 name: AT Translated Set 2 keyboard phys: isa0060/serio0/input0 handlers: 4 [0] 0xffffffff920f9300 sysrq [1] 0xffffffff920faa60 kbd [2] 0xffffffff92174e40 evdev [3] 0xffffffffc075c0c0 leds setkeycode: 0xffffffff909dcca0 - kernel!input_default_setkeycode getkeycode: 0xffffffff909dd240 - kernel!input_default_getkeycode event: 0xffffffff909e7420 - kernel!atkbd_event

Linux kernel allows you to have discardable sections in LKM and this creates problem of links between two kind of memory. As you can guess keeping pointer to already unloaded area can be very dangerous so I made simple tool kotest to check such kind of links. It divides sections of ELF file into two category and check all relocations - relocs between areas of the same type considered as ok. To keep track if some symbol from persistent area is used only from discardable sections I also use couple of reference counts

command line options

-b take into account variables in .bss
-h make hexdump of found vars
-v verbose mode

To run on lots of LKMs use something like

find path_to_kernel_root -type f -name "*.ko" | xargs kotest

To get summary you can run awk -f total.awkon output of previous command

it is reliable to use for analysis only fixups?

No - there are false positives. Consider excerpt from ip_vs.ko, function ip_vs_register_nl_ioctl:

.init.text:0000000000016155   mov     rdi, offset ip_vs_genl_family
.init.text:000000000001615C   mov     cs:ip_vs_genl_family.module, offset __this_module
.init.text:0000000000016167   mov     cs:ip_vs_genl_family.ops, offset ip_vs_genl_ops
.init.text:0000000000016172   mov     cs:ip_vs_genl_family.mcgrps, 0
.init.text:000000000001617D   mov     qword ptr cs:ip_vs_genl_family.n_ops, 10h
.init.text:0000000000016188   call    __genl_register_family

it turns out that ip_vs_genl_ops (located inside .rodata section) referred only from function ip_vs_register_nl_ioctl in .init.text, but actually it cannot be moved to discardable area bcs it was registered with genl_register_family. Kotest cannot analyze usage of addresses and so gives FP:

.rodata + 5A0 (ip_vs_genl_ops) rref 1 xref 0 add size 768

Another issue is string merging by ld. Lets assume that we have couple of strings: "foobar" referred from some function(s) in .text section and "bar" referred from code in .init.text. Linker can (and usually do) put only string "foobar" into .rodata and fixup to string "bar" will point to middle of this single string "foobar"

So consider output of kotest as estimated upper bound of memory which can be potentially saved by moving into discardable area

why not use famous objtool?

Because ~~of NIH syndrome~~ objtool employs disassembler and as consequence it is slow and supports only few architectures. Kotest is based on elfio and can process both 32 & 64 bit ELF files from any arch (and it is very fast)

LKM loading

starts in function load_module. It's surprisingly huge amount of ~~buggy~~ code so I briefly describe only most important

layout_and_allocate collects sections and allocates persistent and discardable modules memory in function layout_sections
find_module_sections is the most important bcs it fills module structure with lots of pointers to content of section for further processing
post_relocations from where arch-specific module_finalize are called
do_init_module calls init function of module and frees discardable memory by inserting new task into init_free_list

There is nasty bug - in sysfs showed all sections (including freed). So sometimes my lkcd shows amazing results like:

Mod[60] 0xffffffffc0454300 base 0xffffffffc0451000 serio_raw
 init: 0xffffffffc037e000 - nls_iso8859_1!uni2char
 exit: 0xffffffffc0451b8a - serio_raw!serio_raw_drv_exit

field init now points somewhere in middle of module nls_iso8859_1. This happened bcs .init section of serio_raw was freed and now occupied by some other module. Despite this, according to the kernel, it is still listed as part of serio_raw:
ls -1a /sys/module/serio_raw/sections | grep init

.init.text

This bug was caused in function mod_sysfs_setup which knows nothing about discardable sections (and perhaps should call within_module_init to filter out some sections and also save some memory from several module_sect_attr items)

What sections considered by kernel as discardable?

Simple answer - if their names start with ".init". More detailed answer - each architecture can have own version of function module_init_section

For example see arm specific sections

The problem is that this list is not exhausted - some section can be moved to discardable area bcs their content is not used after module initialization. Just to name few:

".altinstructions" - processing inside apply_alternatives called from module_finalize<- post_relocation<- load_module before do_init_module
under Risc-V for unknown reason it has name ".alternative"
from the same function module_finalize also ".retpoline_sites", ".return_sites" etc

However this not all. Let's check function do_mod_ctors. Field module->ctors can point to section ".init.array" (which is considered as discardable) or to ".ctors" (which is not). Logic? Haven't heard

Which data from discardable sections kernel able to clean up?

As you can see function do_init_module calls ftrace_free_mem and trim_init_extable

The last one has very weird comment (glare example of the fact that sometimes no comment is much better):

If the exception table is sorted, any referring to the module init  will be at the beginning or the end.

The problem is that exception table (stored in module->extable) is always sorted in post_relocation

So as you can assume content of section "__ex_table" is cleaning up before freeing of init sections

As for first - ftraces initially loads from section with name FTRACE_CALLSITE_SECTION. And I always believed that presence of ftraces for init functions is idea of questionable usefulness. Sure you can manually mark each init function with __attribute__((__no_instrument_function__)). If you are as lazy as me - entrust this task to the gcc with my patch

Results

On 6.8.8 for aarch64 we have 37140 bytes for data referred only from discardable sections (remember about FP) and 245988 bytes for sections which can be moved to discardable area

kotest refused to count string literals for MIPS kernel modules. Reason was in that gcc does not put sizes/object types of unnamed string literals - it looks in asm files like

$LC0:
        .ascii  "const string %f\012\000"

At the same time it does for named literals:
.type fmt_msg, @object .size fmt_msg, 15 fmt_msg: .ascii "enter with %d\012\000"

I am too lazy to investigate which ancient specification from past century it follows. Fortunately this is easy repairable problem - just calculate size of symbol as distance to next one (or till end of section). Since I suspect that this is not the only architecture with similar gcc behavior, I add -f option to do such kind of sizes recalculation

Some results for mips32 kernel 6.0:
find ~/linux60/ -type f -name "*.ko" | xargs ./kotest | awk -f total.awk 1890293 1497092

potential memory savings is almost 1.9Mb from moving string literals used only in .init.text + yet almost 1.5Mb from unloading some unnecessary sections

As you might suspect, the stack size in the kernel is quite meager so it's very important to know how much of stack occupy your driver. So I conducted inhumane experiments on my own driver lkcd to check if stack frame size can be extracted from DWARF debug info. Biggest function in my driver is lkcd_ioctl so lets explore it

mips

Prolog of lkcd_ioctl looks like:

addiu sp,sp,-688

Lets try to find this number in output of objdump -g

<1><3ca44>: Abbrev Number: 258 (DW_TAG_subprogram)

    <3ca46>   DW_AT_name        : (indirect string, offset: 0x1356f): lkcd_ioctl
    <3ca4a>   DW_AT_decl_file   : 1
    <3ca4b>   DW_AT_decl_line   : 1654
    <3ca4d>   DW_AT_decl_column : 13
    <3ca4e>   DW_AT_prototyped  : 1
    <3ca4e>   DW_AT_type        : <0x1ef>
    <3ca52>   DW_AT_low_pc      : 0x1cdc
    <3ca56>   DW_AT_high_pc     : 0xc8e4
    <3ca5a>   DW_AT_frame_base  : 1 byte block: 9c      (DW_OP_call_frame_cfa)
    <3ca5c>   DW_AT_GNU_all_tail_call_sites: 1
    <3ca5c>   DW_AT_sibling     : <0x5405d>

Whut? Just DW_OP_call_frame_cfa? Next check section .debug_frame with pc=1cdc:

00000450 00000038 00000000 FDE cie=00000000 pc=00001cdc..0000e5c0
  DW_CFA_advance_loc: 4 to 00001ce0
  DW_CFA_def_cfa_offset: 688
  ...

Ok, for mips it was easy

aarch64

Prolog:

paciasp
sub     sp, sp, #0x350 ; 848

output of objdump -g
<1><66a8>: Abbrev Number: 63 (DW_TAG_subprogram) <66a9> DW_AT_name : (indirect string, offset: 0x4056): lkcd_ioctl <66ad> DW_AT_decl_file : 1 <66ae> DW_AT_decl_line : 1654 <66b0> DW_AT_decl_column : 13 <66b1> DW_AT_prototyped : 1 <66b1> DW_AT_type : <0x124> <66b5> DW_AT_low_pc : 0x1d7c <66bd> DW_AT_high_pc : 0xa8e0 <66c5> DW_AT_frame_base : 1 byte block: 9c (DW_OP_call_frame_cfa) <66c7> DW_AT_GNU_all_tail_call_sites: 1 <66c7> DW_AT_sibling : <0x19b6a>

Again check section .debug_frame:

000007f0 00000000000009dc 00000000 FDE cie=00000000 pc=0000000000001d7c..000000000000c65c DW_CFA_advance_loc: 4 to 0000000000001d80 DW_CFA_GNU_window_save DW_CFA_advance_loc: 4 to 0000000000001d84 DW_CFA_def_cfa_offset: 848

It may seem that we have found a universal reliable solution, right?

x86_64

Prolog (I prefer use option -M intel for Intel syntax):

17b0:       e8 00 00 00 00          call   17b5 <lkcd_ioctl+0x5>
17b5:       55                      push   rbp
17b6:       48 89 e5                mov    rbp,rsp
17b9:       41 57                   push   r15
17bb:       41 56                   push   r14
17bd:       41 55                   push   r13
17bf:       41 54                   push   r12
17c1:       49 89 d4                mov    r12,rdx
17c4:       53                      push   rbx
17c5:       48 81 ec d8 02 00 00    sub    rsp,0x2d8

output of objdump -g

<1><37b4c>: Abbrev Number: 254 (DW_TAG_subprogram) <37b4e> DW_AT_name : (indirect string, offset: 0x11d24): lkcd_ioctl <37b52> DW_AT_decl_file : 1 <37b53> DW_AT_decl_line : 1654 <37b55> DW_AT_decl_column : 13 <37b56> DW_AT_prototyped : 1 <37b56> DW_AT_type : <0x1dc> <37b5a> DW_AT_ranges : 0x553 <37b5e> DW_AT_frame_base : 1 byte block: 9c (DW_OP_call_frame_cfa) <37b60> DW_AT_call_all_tail_calls: 1 <37b60> DW_AT_sibling : <0x4e456> Wait, where is address of function? Well, everyone loves DWARF for its simplicity, consistency & unambiguity, he-he. Lets try to search in section .debug_rnglists with value from DW_AT_ranges:

00000553 00000000000017b0 000000000000e14d 0000055f 00000000000002fd 00000000000004d6

I see familiar numbers! It's unclear why there are two addresses range. Check again section .debug_frame for both:

000008c0 000000000000003c 00000000 FDE cie=00000000 pc=00000000000017b0..000000000000e14d DW_CFA_advance_loc: 6 to 00000000000017b6 DW_CFA_def_cfa_offset: 16 DW_CFA_offset: r6 (rbp) at cfa-16 DW_CFA_advance_loc: 3 to 00000000000017b9 DW_CFA_def_cfa_register: r6 (rbp) DW_CFA_advance_loc: 8 to 00000000000017c1 DW_CFA_offset: r15 (r15) at cfa-24 DW_CFA_offset: r14 (r14) at cfa-32 DW_CFA_offset: r13 (r13) at cfa-40 DW_CFA_offset: r12 (r12) at cfa-48 DW_CFA_advance_loc: 11 to 00000000000017cc DW_CFA_offset: r3 (rbx) at cfa-56 DW_CFA_advance_loc2: 3124 to 0000000000002400 DW_CFA_remember_state DW_CFA_restore: r3 (rbx) DW_CFA_advance_loc: 2 to 0000000000002402 DW_CFA_restore: r12 (r12) DW_CFA_advance_loc: 2 to 0000000000002404 DW_CFA_restore: r13 (r13) DW_CFA_advance_loc: 2 to 0000000000002406 DW_CFA_restore: r14 (r14) DW_CFA_advance_loc: 2 to 0000000000002408 DW_CFA_restore: r15 (r15) DW_CFA_advance_loc: 1 to 0000000000002409 DW_CFA_restore: r6 (rbp) DW_CFA_def_cfa: r7 (rsp) ofs 8 DW_CFA_advance_loc: 5 to 000000000000240e DW_CFA_restore_state

Can you find something similar to 0x2d8/728? I can't - DW_CFA_def_cfa_offset is 16 and obviously this is wrong value. Numbers argument like 17c1 are addresses in code, but for 17c5 there is no records at all! Ok, check address from second range:

00000900 0000000000000014 00000368 FDE cie=00000368 pc=00000000000002fd..00000000000004d6
And this is all - no more strings for this range

Let's continue our wandering in endless dead end

GCC has struct stack_usage and even field su inside struct function. And this last is accessible as global var cfun. So it's should be easy to patch dwarf2out.cc yet one more time for example to extract stack size (like function output_stack_usage_1 do) and put it inside DW_AT_frame_base block, right?

NO

As you can see function allocate_stack_usage_info called only when

flag_callgraph_info is set - corresponding to option -fcallgraph-info
flag_stack_usage_info is set - corresponding to options -fstack-usage

In other cases field function->su is zero
There should probably be heartbreaking conclusion about quality of opensource in general and gcc and in particular...

Add today dumping of stack frame sizes to my dwarfdump (well, where they are exists). Format of .debug_frame section obviously was invented by martian misantrophes so patch is huge and ugly

Sample of output for some random function from mips kernel:

// Addr 0x183D27C .text // Frame Size 18 // FileName: drivers/char/random.c // LocalVars: // LVar0, tag 6965A71 // int ret // LVar1, tag 6965A8F // bool branch ssize_t random_read_iter(struct kiocb* kiocb,struct iov_iter* iter);

and the same in JSON format:

"110516780":{"type":"function","file":"drivers/char/random.c","type_id":"110427157","name":"random_read_iter","addr":"25416316","section":".text","frame_size":"24","params":[{"name":"kiocb","id":"110516807","type_id":"110466611"},{"name":"iter","id":"110516828","type_id":"110470510"}],"lvars":["110516849":{"type":"var","file":"drivers/char/random.c","owner":"110516780","type_id":"110426600","name":"ret"}, "110516879":{"type":"var","file":"drivers/char/random.c","owner":"110516780","type_id":"110427073","name":"branch"}]}

Unfortunately dwarfdump can't work with kernel modules bcs they are actually just object files and for this sad reason they have relocations even for debug sections. So to properly deal with this files I need to apply relocations first and this is arch-specific action (which I prefer to avoid)

Binutils has 2 solution of this problem:

objcopy calls bfd_simple_get_relocated_section_contents from libbfd.so and this means that tool should have dependency from it
readelf has it's own relocation code in apply_relocations and this is huge pile of code

And I really don’t like both of the above approaches

Very funny article about (im)possible future of ebpf. Given that right now 8 small BPF scripts with only 7 opcodes occupy 1Mb whole kernel on ebpf will require exabytes of RAM, he-he

Anyway there is another case of info hiding in linux - mentioned in article struct_ops has type btf_kind_operations and cannot be found in include (as usually). So I add today dumping of it in my lkcd. Sample of output:

NR_BTF_KINDS: 17
btf_ops[1] at 0xffffffffaea3b240 - kernel!int_ops
  check_meta: 0xffffffffada6f320 - kernel!btf_int_check_meta
  resolve: 0xffffffffada6f150 - kernel!btf_df_resolve
  check_member: 0xffffffffada6fb10 - kernel!btf_int_check_member
  check_kflag_member: 0xffffffffada6fa30 - kernel!btf_int_check_kflag_member
  log_details: 0xffffffffada6e110 - kernel!btf_int_log
  show: 0xffffffffada70ac0 - kernel!btf_int_show
btf_ops[2] at 0xffffffffaf7fe4c0 - kernel!ptr_ops
  check_meta: 0xffffffffada72180 - kernel!btf_ref_type_check_meta
  resolve: 0xffffffffada74c60 - kernel!btf_ptr_resolve
  check_member: 0xffffffffada6f9d0 - kernel!btf_ptr_check_member
  check_kflag_member: 0xffffffffada6f7c0 - kernel!btf_generic_check_kflag_member
  log_details: 0xffffffffada6e030 - kernel!btf_ref_type_log
  show: 0xffffffffada708b0 - kernel!btf_ptr_show
btf_ops[3] at 0xffffffffaf7fe440 - kernel!array_ops
  check_meta: 0xffffffffada6f1e0 - kernel!btf_array_check_meta
  resolve: 0xffffffffada749a0 - kernel!btf_array_resolve
  check_member: 0xffffffffada748f0 - kernel!btf_array_check_member
  check_kflag_member: 0xffffffffada6f7c0 - kernel!btf_generic_check_kflag_member
  log_details: 0xffffffffada6e0e0 - kernel!btf_array_log
  show: 0xffffffffada736e0 - kernel!btf_array_show
btf_ops[4] at 0xffffffffaf7fe400 - kernel!struct_ops
  check_meta: 0xffffffffada72440 - kernel!btf_struct_check_meta
  resolve: 0xffffffffada71470 - kernel!btf_struct_resolve
  check_member: 0xffffffffada6f960 - kernel!btf_struct_check_member
  check_kflag_member: 0xffffffffada6f7c0 - kernel!btf_generic_check_kflag_member
  log_details: 0xffffffffada6e090 - kernel!btf_struct_log
  show: 0xffffffffada71ea0 - kernel!btf_struct_show
btf_ops[5] at 0xffffffffaf7fe400 - kernel!struct_ops
  check_meta: 0xffffffffada72440 - kernel!btf_struct_check_meta
  resolve: 0xffffffffada71470 - kernel!btf_struct_resolve
  check_member: 0xffffffffada6f960 - kernel!btf_struct_check_member
  check_kflag_member: 0xffffffffada6f7c0 - kernel!btf_generic_check_kflag_member
  log_details: 0xffffffffada6e090 - kernel!btf_struct_log
  show: 0xffffffffada71ea0 - kernel!btf_struct_show
btf_ops[6] at 0xffffffffaf7fe3c0 - kernel!enum_ops
  check_meta: 0xffffffffada722c0 - kernel!btf_enum_check_meta
  resolve: 0xffffffffada6f150 - kernel!btf_df_resolve
  check_member: 0xffffffffada6f9c0 - kernel!btf_enum_check_member
  check_kflag_member: 0xffffffffada6f8d0 - kernel!btf_enum_check_kflag_member
  log_details: 0xffffffffada6e0c0 - kernel!btf_enum_log
  show: 0xffffffffada70620 - kernel!btf_enum_show
btf_ops[7] at 0xffffffffaf7fe480 - kernel!fwd_ops
  check_meta: 0xffffffffada72230 - kernel!btf_fwd_check_meta
  resolve: 0xffffffffada6f150 - kernel!btf_df_resolve
  check_member: 0xffffffffada6f1b0 - kernel!btf_df_check_member
  check_kflag_member: 0xffffffffada6f180 - kernel!btf_df_check_kflag_member
  log_details: 0xffffffffada6e050 - kernel!btf_fwd_type_log
  show: 0xffffffffada6e200 - kernel!btf_df_show
btf_ops[8] at 0xffffffffaf7fe500 - kernel!modifier_ops
  check_meta: 0xffffffffada72180 - kernel!btf_ref_type_check_meta
  resolve: 0xffffffffada74700 - kernel!btf_modifier_resolve
  check_member: 0xffffffffada74620 - kernel!btf_modifier_check_member
  check_kflag_member: 0xffffffffada74540 - kernel!btf_modifier_check_kflag_member
  log_details: 0xffffffffada6e030 - kernel!btf_ref_type_log
  show: 0xffffffffada72a70 - kernel!btf_modifier_show
btf_ops[9] at 0xffffffffaf7fe500 - kernel!modifier_ops
  check_meta: 0xffffffffada72180 - kernel!btf_ref_type_check_meta
  resolve: 0xffffffffada74700 - kernel!btf_modifier_resolve
  check_member: 0xffffffffada74620 - kernel!btf_modifier_check_member
  check_kflag_member: 0xffffffffada74540 - kernel!btf_modifier_check_kflag_member
  log_details: 0xffffffffada6e030 - kernel!btf_ref_type_log
  show: 0xffffffffada72a70 - kernel!btf_modifier_show
btf_ops[10] at 0xffffffffaf7fe500 - kernel!modifier_ops
  check_meta: 0xffffffffada72180 - kernel!btf_ref_type_check_meta
  resolve: 0xffffffffada74700 - kernel!btf_modifier_resolve
  check_member: 0xffffffffada74620 - kernel!btf_modifier_check_member
  check_kflag_member: 0xffffffffada74540 - kernel!btf_modifier_check_kflag_member
  log_details: 0xffffffffada6e030 - kernel!btf_ref_type_log
  show: 0xffffffffada72a70 - kernel!btf_modifier_show
btf_ops[11] at 0xffffffffaf7fe500 - kernel!modifier_ops
  check_meta: 0xffffffffada72180 - kernel!btf_ref_type_check_meta
  resolve: 0xffffffffada74700 - kernel!btf_modifier_resolve
  check_member: 0xffffffffada74620 - kernel!btf_modifier_check_member
  check_kflag_member: 0xffffffffada74540 - kernel!btf_modifier_check_kflag_member
  log_details: 0xffffffffada6e030 - kernel!btf_ref_type_log
  show: 0xffffffffada72a70 - kernel!btf_modifier_show
btf_ops[12] at 0xffffffffaf7fe340 - kernel!func_ops
  check_meta: 0xffffffffada720f0 - kernel!btf_func_check_meta
  resolve: 0xffffffffada6f150 - kernel!btf_df_resolve
  check_member: 0xffffffffada6f1b0 - kernel!btf_df_check_member
  check_kflag_member: 0xffffffffada6f180 - kernel!btf_df_check_kflag_member
  log_details: 0xffffffffada6e030 - kernel!btf_ref_type_log
  show: 0xffffffffada6e200 - kernel!btf_df_show
btf_ops[13] at 0xffffffffaf7fe380 - kernel!func_proto_ops
  check_meta: 0xffffffffada71890 - kernel!btf_func_proto_check_meta
  resolve: 0xffffffffada6f150 - kernel!btf_df_resolve
  check_member: 0xffffffffada6f1b0 - kernel!btf_df_check_member
  check_kflag_member: 0xffffffffada6f180 - kernel!btf_df_check_kflag_member
  log_details: 0xffffffffada6ecf0 - kernel!btf_func_proto_log
  show: 0xffffffffada6e200 - kernel!btf_df_show
btf_ops[14] at 0xffffffffaea3b200 - kernel!var_ops
  check_meta: 0xffffffffada72000 - kernel!btf_var_check_meta
  resolve: 0xffffffffada74330 - kernel!btf_var_resolve
  check_member: 0xffffffffada6f1b0 - kernel!btf_df_check_member
  check_kflag_member: 0xffffffffada6f180 - kernel!btf_df_check_kflag_member
  log_details: 0xffffffffada6e000 - kernel!btf_var_log
  show: 0xffffffffada6d8e0 - kernel!btf_var_show
btf_ops[15] at 0xffffffffaea3b1c0 - kernel!datasec_ops
  check_meta: 0xffffffffada72680 - kernel!btf_datasec_check_meta
  resolve: 0xffffffffada740e0 - kernel!btf_datasec_resolve
  check_member: 0xffffffffada6f1b0 - kernel!btf_df_check_member
  check_kflag_member: 0xffffffffada6f180 - kernel!btf_df_check_kflag_member
  log_details: 0xffffffffada6e0d0 - kernel!btf_datasec_log
  show: 0xffffffffada70380 - kernel!btf_datasec_show
btf_ops[16] at 0xffffffffaea3b180 - kernel!float_ops
  check_meta: 0xffffffffada71800 - kernel!btf_float_check_meta
  resolve: 0xffffffffada6f150 - kernel!btf_df_resolve
  check_member: 0xffffffffada6f850 - kernel!btf_float_check_member
  check_kflag_member: 0xffffffffada6f7c0 - kernel!btf_generic_check_kflag_member
  log_details: 0xffffffffada6dfe0 - kernel!btf_float_log
  show: 0xffffffffada6e200 - kernel!btf_df_show

Recently I've done small research to repurpose overvalued ebpf into something useful and even achieved some modest results. It seems that at least you can use ebpf maps in your old-school native drivers without writing single line of code for ebpf progs. You can ask me - c'mon, there are tons of ways to communicate with driver under linux, just to name few:

ioctls
read from driver/write to driver + possible employ polling
file in kernfs
futex in shared memory
io_uring
netlink like auditd or XFRM do

etc

Let's look at typical situation - your IT crew was bitten by violent adherent of the totalitarian sect of ~~rust~~ ebpf Witnesses and as a consequence you now have hundreds of ebpfs (and no single person who know how this pile of spaghetti code works)

Probably now is much better to integrate new drivers with ebpf, right? Considering that ebpf progs have very serious limitations (like you can't read content of linked lists like raw_notifier_head with right locking) it would be very nice to produce output of your EDR drivers directly to ebpf maps

So I made simple POC to show how you can do it. Source of driver& userland test program

Lets dive into gory details and see what other non-obvious advantages can be obtained from such heretical crossbreeding

From userland to kernel

One possible scenario - replace kernel module params. Your userland code creates ebpf map, fills it with params (including binary blobs - for example code for reverse shell, he-he) and then calls driver with ioctl passing file descriptor of ebpf map into driver

From kernel to userland

You could replace with ebpf maps lots of files in /sys or /proc. Really - you could have updated in real-time data in structured binary form and avoid tons of text parsers. Just think about it seriously

And finally lets check how to work with ebpf maps in your driver

Gathering maps

The first thing is to locate necessary maps. I've implemented 3 way to do this

IOCTL_FROM_FD - by file descriptor returned from bpf_create_map_name. Official way is to call bpf_map_getand it even marked as EXPORT_SYMBOL. However I got
ERROR: modpost: "bpf_map_get" [/home/redp/lkcd/bmc/bmc.ko] undefined!
Well, this is not big deal - we can employ good old symbols lookup
IOCTL_BY_ID. All ebpf maps are stored in tree map_idr, so it just calls idr_find
IOCTL_BY_NAME. Traverse all nodes in map_idr tree and find by name

CRUD operations

Official way to do this - call functions (sure non-exported) like bpf_map_update_value etc, but they require file descriptor and we can miss it. So we can just skip most of their wrappers logic and call corresponding map->ops methods (don't forget to get necessary RCU lock):

for insert/update ops->map_update_elem
for read ops->map_lookup_elem (or even ops->map_lookup_and_delete). See how bpf_map_copy_value implements this
for delete ops->map_delete_elem, see map_delete_elem

Notifications

Surprisingly hardest part was to notify from driver code in userland that ebpf map was updated. Theoretically their file descriptors support polling - see function bpf_map_poll, it calls map->ops->map_poll method if it presents. Unfortunately the only map type implementing it is BPF_MAP_TYPE_RINGBUF

It was very tempting to patch map->ops with copy of original one and just add your own map_poll method. DO'NT DO THAT!

The main threat here is that your driver can be unloaded at arbitrary moments. There is high probability that for rare events (for example, for events that happen a couple of times year) return address to unloaded code will be present in stack of some thread waiting in poll - even if you revert all your patched maps before driver unloading

We should use polling on some other file descriptors - in my case there is only ebpf map so I decided to use FD of driver. For many maps driver could create bunch of files somewhere in kernfs (for example with kobject_uevent) and return them to userland

As kotest shows we can achieve reduction in the size of non-discardable sections of LKMs (and thus reduce total size of memory occupied by kernel) with moving constants used in functions from .init.text to discardable sections like .init.rodata. As usually we can do it manually or invent some dirty hacks to automate this boring process. And the first thought that comes to mind is to employ

gcc

Lets assume that we have some string literal referred from set of functions F1 ... FN. We can safely move it to discardable section only if all this functions itself located in discardable sections. This leads to two consequences

in case of gcc plugin we must operate on RTL bcs some functions can be inlined (or even fully eliminated) - you just cannot do it in GIMPLE
we need some tracking of cross-references between functions and literals

Unfortunately gcc does not have such tracking - all string literals stored in string pool. To make live even worse sometimes string pool belongs to function. From this comment

Only a few targets need per-function constant pools. Most can use one per-file pool

we can conclude that actually nobody ever understand when per-function constant pools are used. Probably I could patch my gcc plugin to add tracking of string literals, finding candidates for eviction to discardable section and marking them with __section__ attribute. However debugging of RTL gcc-plugins is real nightmare

Patching obj files

The next logical step is try to move string literals directly in object files (for example with LIEF). kotest already able to find candidates so the rest is simply to move them, right? Personally I believe that this is possible but there are several problems with this approach:

currently LIEF (or binutils) cannot do it
we need to patch not only relocs but also DWARF debug info and symbol table - bcs after removing of some string from .rodata addresses of all entries below must be changed

In general patching of binary is not perfect idea, especially if there is good alternative

Patching asm files

gcc can generate them with -S option and they are just plain text. Everybody is able to parse text files, is not it?

So I write perl script to parse .S file, collect functions, literals, finding candidates and move them. Obvious drawback of this method - it is very processor specific. Under different processor not only instructions for loading of string literals differs but even asm directives. So currently only limited processors are supported:

x86_64
aarch64
mips

Also for aarch64 it try to track distance between adrp/add pair and current size of function - if it is lesser than 2Kb script will move literal inside function (and so emulate functions constant pool like producing by VC++)

command line options

-F - file with names of functions considered as discardable
-f - dump found functions
-g - try all symbols not marked as global
-l - dump found string literals
-r - name of output discardable section, usually this should be .init.rodata
-s - name of discardable section where functions lye - usually .init.text
-v - verbose mode
-w - for debugging only, don't rewrite original .S file

to select processor:

-i for x86_64
-M for mips

by default processor is aarch64

since everyone (including"professional C++ programmers") has already written about so I will also write it down (for memory)

This was not first time: proof

Official report is very vague but we can conclude that root cause was bug in config parser leading to dereference of bad (it's unclear if it contained zero or just was uninitialized) pointer, as I assumed

Well, this is not big deal - I made thousands bsod/kernel panics. There usually long path from developers machine to production, right?

Then I read second report - and this is pure madness. They DIDN'T TESTED IT AT ALL!!! Proof is very simple - lets assume we have bug with probability of bsod P. Then for N test machines probability to not catch this bug is (1-P) ^ N. for P = 0.5 (bsod happens or not, he-he) and 8 test machines this probability 0.4%. Given that in this incident probability of bsod was close to 1 - size of their test cluster is exactly 0

Instead they used some kind of ~~spell-checkar~~ Content Validator. I'm almost sure that they were bitten by adherents of totalitarian eBPF sect with their “20,000 lines of code” verifier (which ~~exists solely to drink blood and ruin lives~~ also unable to prevents cases like this)

Well, this is already serious problems but still not disaster - like you can deploy only for very small percent of users, employ fuse to prevent spreading of malicious update (for example by tracking heartbeats from updated machines) and so on. Right? ...

So right recipe for catastrophe from group of rebels:

never test anything - static validator is enough
deploy in Friday
deploy to all users at the same time

Lets dissect some typical ebpf spyware. It sets up uprobes on

SSL_read_ex
SSL_read
SSL_write_ex
SSL_write
gotls_write_register
gotls_read_register
gotls_exit_read_register

and uses bpf functions probe_read_user& probe_read_user_str to steal data and map_update_elem& ringbuf_submit to store data in bpf maps

How we can mitigate this?

Official way is to use LSM - function __sys_bpf calls security_bpf so we could register with security_add_hooks LSM hook with index bpf. This effectively prevents loading of ebpf program and sometimes is not what we want - for example in case of honeypots there is high chance that usermode program just will exits after failed ebpf program loading and you can't monitor which connections it will try to establish

Another way - is to patch bpf_func_proto for selected functions, like I did. However this is brutal method and affects all ebpf programs (I still believe that some is not spyware, he-he)

Luckily there is way to blind only some types of ebpf programs - method get_func_proto in bpf_verifier_ops. I made PoC to blind aforementioned 4 functions for BPF_PROG_TYPE_TRACING & BPF_PROG_TYPE_KPROBE only

Now we have another problem - how to check integrity of bpf_verifier_ops? I've also add this check in my lkcd. Example of output when PoC ublind is loaded looks like:

[24] type BPF_PROG_TYPE_TRACING at 0xffffffffc1357720 - ublind!s_trace_patched
  get_func_proto: 0xffffffffc13551e0 - ublind!my_func_proto
  is_valid_access: 0xffffffffaee24e20 - kernel!tracing_prog_is_valid_access

Standard method to find rootkits like this (or like this) is cross-scanning PTEs of memory without NX bit, then extract pages belonging to LKMs - thus in set difference we will gather hidden executable memory. Lets check how we can scan PTEs under linux

disclaimer

this article is not digest of Intel or linux documentation - I'll just describe how you can traverse page tables from LKM. Also code

Lets start with some simple things:

cat /boot/config-$(uname -r) | grep -E 'X86_5LEVEL|PGTABLE_LEVELS' CONFIG_PGTABLE_LEVELS=5 CONFIG_X86_5LEVEL=y

So my kernel has 5 levels and this exactly correspond to hardware:

pte_t - PTE, 9 bits of page address (size of page is 4096 bytes - low 12 bits)
pmd_t - PDE. another 9 bits of address
pud_t - PDPTE, 9 bits
p4d_t - PML4, 9 bits
pgd_t - PML5, 9 bits

Total 12 + 5 * 9 = 57bits

It's really amazing how memory management is implemented differently in different operating systems running on the same hardware. For example in Windows all PTE are located in huge contiguous sparse array and you can get address of PTE for some address of memory with very simple function MiGetPteAddress. Let MiGetPteAddress(addr1) is addr2. We then can continue this process for all paging levels - get PteAddress(addr2) and so on - to find if all 5 parts of address is valid. And this can be used in reverse direction - skip scan of huge PTEs areas if they are not presented in memory

Unfortunately in linux PTE not stored in one huge contiguous memory. So we need to start with top-level (from PGD) and scan all tables on lower levels. Root of pgd_t stored in init_mm->pgd. As usually var init_mm is not exported

<sarcasm>Linux widely known for the consistency, completeness and backward-compatibility of its API and being developers-friendly in general</sarcasm>

Next we need way to find valid pXX_t. Seems that there are functions pXX_present, pXX_bad and so on. The right sequence of calls is

pXX_none
pXX_leaf - this is damn good name for functions to check for large pages
pXX_bad
and finally pXX_offset to get item for next level

Unfortunately there are also so called hugeTLB pages (enabled with CONFIG_HUGETLB_PAGE):

grep 'HUGETLB_PAGE' /boot/config-$(uname -r)CONFIG_HUGETLB_PAGE=y CONFIG_HUGETLB_PAGE_FREE_VMEMMAP=y

As you may expect functions pmd_huge& pud_huge non-exported too (and p4d_huge& pgd_huge are just dumb macros)

Finally we need to check if some page is executable. This is very hardware specific - for example

Arc has flag _PAGE_EXECUTE
aarch64 has flag _PAGE_KERNEL_EXEC
powerpc has _PAGE_EXEC
s390 has _PAGE_NOEXEC

so for some arch there is function pte_exec, while for another pte_no_exec. Also it's curious that there are no analogs for pud/pmd etc - so actually I have zero ideas how check executability for large & huge pages

However, this is not the end of suffering. Quick check:

grep address /proc/cpuinfo address sizes : 39 bits physical, 48 bits virtualshows that they lie - my hardware actually supports only 48bit addresses, so kernel should have only 4 levels of paging. Try to guess how they swept the trash under the carpet?

I'll give you a hint:
page_size 1000, translation level 5 pgd_shift 39 p4d_shift 39 pud_shift 30 pmd_shift 21

In part 1 I've described how memory managed by hardware. Now lets dig into how kernel sees memory. Not surprisingly that we should check the same structures that malicious drivers update while hiding

Modules

List of module structures with head in modules and lock modules_mutex. It has projection on file /proc/modules but sizes in those file are sloppy - function module_total_size calculates total size of driver (including discarded sections!). So we should use only some selected fields:

on kernel >= 6.4 mem[MOD_TEXT].base & mem[MOD_TEXT].size
on kernel < 4.5 module_core & core_text_size
otherwise core_layout.base & core_layout.text_size

vmap_area_list

It has projection on file /proc/vmallocinfo and requires root access. Sure sophisticated rootkits can intercept it but that's ok since we use it for cross-scan only

False positives

As you can guess not every executable page belongs to some driver - there are couple exceptions

kprobes

under x86 they usually allocated within function arch_ftrace_update_trampoline. Funny fact - they are "frozen in amber" - even after uninstalling some kprobe page for trampoline remains marked as executable so probably you even can recover (with disassembler) for which functions kprobes were installed - this can be very valuable for forensic

ebpf JIT

Oh yeah, this is my favorite! Lets check what happens on my old ubuntu:

bpftool prog show

951: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2024-09-05T07:28:09+0300 uid 0 xlated 64B jited 54B memlock 4096B 952: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2024-09-05T07:28:09+0300 uid 0 xlated 64B jited 54B memlock 4096B 953: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2024-09-05T07:28:09+0300 uid 0 xlated 64B jited 54B memlock 4096B 954: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2024-09-05T07:28:09+0300 uid 0 xlated 64B jited 54B memlock 4096B 955: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2024-09-05T07:28:09+0300 uid 0 xlated 64B jited 54B memlock 4096B 956: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2024-09-05T07:28:09+0300 uid 0 xlated 64B jited 54B memlock 4096B 957: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2024-09-05T07:28:09+0300 uid 0 xlated 64B jited 54B memlock 4096B 958: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2024-09-05T07:28:09+0300 uid 0 xlated 64B jited 54B memlock 4096B

As you can see it has 8 very small ebpf programs each JITed into 54 byte of code. Try to guess how much memory do they actually take up? lkcd shows

Memory summary: 1150976 0xffffffffaee3856e bpf_jit_alloc_exec+E 8192 0xffffffffaec89d04 arch_ftrace_update_trampoline+124

1.15Mb! We can even check content of those pages:

lkcd -M ... | grep bpf_jit_alloc_exec [158] PTE 0xffff936bc6b664f0 14CC1B161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC169E000 alloced by bpf_jit_alloc_exec+E [190] PTE 0xffff936bc6b665f0 160CC7161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC16BE000 alloced by bpf_jit_alloc_exec+E [192] PTE 0xffff936bc6b66600 1627BD161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC16C0000 alloced by bpf_jit_alloc_exec+E [194] PTE 0xffff936bc6b66610 1627C5161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC16C2000 alloced by bpf_jit_alloc_exec+E [196] PTE 0xffff936bc6b66620 1627B7161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC16C4000 alloced by bpf_jit_alloc_exec+E [198] PTE 0xffff936bc6b66630 149C7F161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC16C6000 alloced by bpf_jit_alloc_exec+E [200] PTE 0xffff936bc6b66640 1BB539161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC16C8000 alloced by bpf_jit_alloc_exec+E [201] PTE 0xffff936bc6b66648 14CDD4161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC16C9000 alloced by bpf_jit_alloc_exec+E [203] PTE 0xffff936bc6b66658 1F2C8D161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC16CB000 alloced by bpf_jit_alloc_exec+E [204] PTE 0xffff936bc6b66660 1CF8F0161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC16CC000 alloced by bpf_jit_alloc_exec+E [277] PTE 0xffff936bc6b668a8 16548B161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC1715000 alloced by bpf_jit_alloc_exec+E [279] PTE 0xffff936bc6b668b8 165774161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC1717000 alloced by bpf_jit_alloc_exec+E [281] PTE 0xffff936bc6b668c8 165778161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC1719000 alloced by bpf_jit_alloc_exec+E [282] PTE 0xffff936bc6b668d0 165779161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC171A000 alloced by bpf_jit_alloc_exec+E [283] PTE 0xffff936bc6b668d8 16577A161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC171B000 alloced by bpf_jit_alloc_exec+E [285] PTE 0xffff936bc6b668e8 16577C161 addr FFFFFFFFC1600000 final_addr FFFFFFFFC171D000 alloced by bpf_jit_alloc_exec+E [210] PTE 0xffff936bc9c20690 205CC3161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18D2000 alloced by bpf_jit_alloc_exec+E [211] PTE 0xffff936bc9c20698 18E437161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18D3000 alloced by bpf_jit_alloc_exec+E [218] PTE 0xffff936bc9c206d0 10F890161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18DA000 alloced by bpf_jit_alloc_exec+E [219] PTE 0xffff936bc9c206d8 1142FB161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18DB000 alloced by bpf_jit_alloc_exec+E [221] PTE 0xffff936bc9c206e8 1EB948161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18DD000 alloced by bpf_jit_alloc_exec+E [222] PTE 0xffff936bc9c206f0 17943B161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18DE000 alloced by bpf_jit_alloc_exec+E [224] PTE 0xffff936bc9c20700 113F59161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18E0000 alloced by bpf_jit_alloc_exec+E [225] PTE 0xffff936bc9c20708 1E08E0161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18E1000 alloced by bpf_jit_alloc_exec+E [239] PTE 0xffff936bc9c20778 2E63D2161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18EF000 alloced by bpf_jit_alloc_exec+E [240] PTE 0xffff936bc9c20780 1CE54A161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18F0000 alloced by bpf_jit_alloc_exec+E [242] PTE 0xffff936bc9c20790 1F01EE161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18F2000 alloced by bpf_jit_alloc_exec+E [243] PTE 0xffff936bc9c20798 1F01EF161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18F3000 alloced by bpf_jit_alloc_exec+E [245] PTE 0xffff936bc9c207a8 3115CA161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18F5000 alloced by bpf_jit_alloc_exec+E [246] PTE 0xffff936bc9c207b0 3115CB161 addr FFFFFFFFC1800000 final_addr FFFFFFFFC18F6000 alloced by bpf_jit_alloc_exec+E kdps FFFFFFFFC169E000 6

FFFFFFFFC169E000: 01 00 00 00 CC CC CC CC ........ 0xcccccccc00000001 FFFFFFFFC169E008: CC CC CC CC CC CC CC CC ........ 0xcccccccccccccccc FFFFFFFFC169E010: CC CC CC CC CC CC CC CC ........ 0xcccccccccccccccc FFFFFFFFC169E018: CC CC CC CC CC CC CC CC ........ 0xcccccccccccccccc FFFFFFFFC169E020: CC CC CC CC CC CC CC CC ........ 0xcccccccccccccccc FFFFFFFFC169E028: CC CC CC CC CC CC CC CC ........ 0xcccccccccccccccc

I think ebpf should be illegal in every self-respecting country

It seems that gcc not always put COMPONENT_REF when access fields of structures passed by reference. For example I add today simple static function append_name(aux_type_clutch&clutch, const char *name)

It has reference to field txt of structure aux_type_clutch but RTL looks like:

(insn 20 19 21 4 (set (reg/f:DI 0 ax [89]) (mem/f/c:DI (plus:DI (reg/f:DI 6 bp) (const_int -8 [0xfffffffffffffff8])) [258 clutch+0 S8 A64])) "gptest.cpp":1364:24 80 {*movdi_internal} (nil)) (insn 21 20 22 4 (parallel [ (set (reg/f:DI 0 ax [orig:83 _2 ] [83]) (plus:DI (reg/f:DI 0 ax [89]) (const_int 8 [0x8]))) (clobber (reg:CC 17 flags)) ]) "gptest.cpp":1364:24 230 {*adddi_1} (expr_list:REG_EQUAL (plus:DI (mem/f/c:DI (plus:DI (reg/f:DI 19 frame) (const_int -8 [0xfffffffffffffff8])) [258 clutch+0 S8 A64]) (const_int 8 [0x8])) (nil))) First instruction just loads in register RAX parm_decl (of type aux_type_clutch) like

mov rax, [rbp+clutch]

and second add to RAX just some const 0x8 (offset to field txt):

add rax, 8

it's impossible from RTL to track back this constant to offset in COMPONENT_REF

What is more even strange - for methods you can track fields access for parameters passed by reference (like this) - for example in constructor of the same aux_type_clutch:

(insn 12 11 13 2 (set (mem:SI (plus:DI (reg/f:DI 0 ax [94])
                (const_int 40 [0x28])) [4 this_12(D)->level+0 S4 A64])
        (const_int 0 [0])) "gptest.cpp":465:4 81 {*movsi_internal}
     (nil))

I continuing to improve my gcc plugin for collecting cross-references: 1, 2, 3, 4, 5, 6& 7. On this week I decided to see if I can extract source of complex types like records and most prominent kind of them is arguments of function - they are easy to identify in asm (but not so easy to bind them in gcc RTL expressions)

Having function declaration fdecl we can extract arguments with something like:

for (tree arg = DECL_ARGUMENTS (fdecl); arg; arg = DECL_CHAIN (arg)) { auto a = DECL_RTL_IF_SET (arg); if ( a && REG_EXPR(a) ) { // do something with argument inREG_EXPR(a)

I need only arguments that are a pointer or reference to record/union so I filtered them in method check_arg

Those was easiest part, and now try to find how arguments can be tracked in RTL.

We can see using of arguments in expression like

insn 7 6 8 2 (set (reg/f:DI 0 ax [orig:82 _1 ] [82]) (mem/f:DI (plus:DI (reg/f:DI 0 ax [85]) (const_int 88 [0x58])) [170 this_3(D)->m_outfp+0 S8 A64])) "my_plugin.h":53:14 80 {*movdi_internal}
They come from RTX with type MEM and tree code COMPONENT_REF where left part in TREE_OPERAND (expr, 0) is SSA_NAME (and right part in TREE_OPERAND (expr, 1) in this case is FIELD_DECL)

You can check if SSA has name with SSA_NAME_IDENTIFIER and extract name of SSA with IDENTIFIER_POINTER. But linking just by name is bad idea - you could legally have local variable with the same name as argument, so it's better to extract type with SSA_NAME_VAR

Unfortunately this nice and simple method does not work for cases described in my previous post- we have type info scattered in several RTX:

SET REG RMEM without COMPONENT_REF - instead we have nameless SSA_NAME pointing to type of referred field, so we cannot extract type of class and field name
EXPR_LIST (seems that it always has index 6) with note (notes can be extracted with GET_MODE) REG_EQUAL containing expression like PLUS MEM CONST_INT where MEM is PARM_DECL but offset is just constant integer

So I couldn't come up with a more elegant solution than:

identify that currently processed RTX has noteREG_EQUAL
while processing MEM expression within REG_EQUAL store reference to argument and it's type
and finally while processing CONST_INT check that it has both base type and expression PLUS above in stack and then manually find field at that offset in base type (see method add_fref_from_equal). Value of offset can be extracted with XWINT. Open question is what to do if base type is union - it this case we can have several fields on the same offset

Looks very ugly and clumsy but it works - for method append_name it extracted two references to aux_type_clutch.txt from first argument of function