Lets check what this stub should do being injected in some linux process via __malloc_hook/__free_hook (btw this implicitly means than you cannot use this dirty hack for processes linked with musl or uClibc - they just don't have those hooks)
- bcs our stub can be called from two different hooks we should store somewhere via which entry point we was called
- restore old hooks values
- call dlopen/dlsym and then target function (and pass it address of injection stub for delayed munmap. No, you can't free those memory directly in your target function - try to guess why)
- get right old hook and jump to it if it was installed or just return to code called __malloc_hook somewhere in libc
So I collected all parameters to do job in table dtab consisting from 6 pointers
- __malloc_hook address
- old value of __malloc_hook
- __free_hook address
- old value of __free_hook
- pointer to dlopen
- pointer to dlsym
arm64
for some unknown reason they call it aarch64
Source, size 209 bytes with BTI c in prologues
arm64 has lots of really amazing features:
- pre & post processing, like in stp x29, x30, [sp, -16]! at first sp decreased on 16 bytes and then 2 registers are pushed to stack at once
- cbz/cbnz instructions can in one operation compare value with zero and make branch
- support for PC-relative addressing and even better - if desired address located within 4Kb you can use just one instruction adr and this lead to code size reduction/better cache utilization
- symbol_ref,label_ref
- aarch64_symbolic_address_p
- aarch64_mov_operand_p
PS: Visual Studio can take advantage of short PC-relative addressing by placing all constants inside pool behind function
mips32
This port was inspired by this cool X post. I dislike mips asm bcs of it's strange out-of-order execution - you must hard think what instruction you could place after literally each j/jal/bXX opcode and this make my brain (fostered on z80 i386) to seethe. Also mips don't have direct access to PC, so I used old-school trick with jal and arithmetic on $ra register
Source, size 234 bytes
loongarch
I just couldn’t ignore it since couple of years ago I made IDA Pro processor module for loongarch. It suffers from strange opcodes for PC-relative addressing - like you have to use pair pcalau12i/addi even if desired address located within 12bit range. And it's better for you not to even know how it loads full 64-bit address
Source, size 242 bytes