windows deep internals

Lets check if we can add our own attributes (if Google can afford it, then why is it forbidden to mere mortals?). For example I want to have in gcc and dwarf flag about functions/methods parameters direction - is some param IN or OUT. I chose the value of this dwarf attribure 0x28ff

It`s pretty obviously that we can add our own custom attribute in gcc - they even have example how to do this. But what about dwarf producer? Long story short - seems that you cannot do it from plugin. The only dwarf related pass for plugins is pass_dwarf2_frame. So we need to patch gcc. But before this we need to

build gcc from sources

At moment of writing latest stable version of gcc was 12.0 so run

git clone --branch releases/gcc-12 https://github.com/gcc-mirror/gcc.git

and then follows instructions

patch gcc

Lets see how gcc produces dwarf output. All symbol table formatters implement gcc_debug_hooks and currently gcc has 3 (btw there are patches for mingw to produce PDB, so in theory you could have vmlinux.pdb):

dwarf2out.cc - this is our target
godump.c
vmsdbgout.c

so lets add function add_param_direction in dwarf2out.cc:

bool add_param_direction(tree decl, dw_die_ref parm_die)
{
  bool pa1 = lookup_attribute ("param_in", DECL_ATTRIBUTES (decl));
  bool pa2 = lookup_attribute ("param_out", DECL_ATTRIBUTES (decl));
  if ( !(pa1 ^ pa2) )
    return false;
  unsigned char pa_value = 0;
  // seems that you can`t have flag with value 1 - see gcc_assert at line 9599
  if ( pa1 )
    pa_value = 2;
  if ( pa2 )
    pa_value = 3;
  add_AT_flag(parm_die, (dwarf_attribute)0x28ff, pa_value);
  return true;
}

It first checks if parameter has attribute param_in or param_out (but not both at the same time bcs this is senseless) and adds custom dwarf flag attribute via add_AT_flag call. Then we just need to call this function from gen_formal_parameter_die

Now we can add our custom attributes- this can be done via plugin but I preferred to patch c-family/c-attribs.cc:

tree handle_param_in_attribute (tree *node, tree name, tree ARG_UNUSED (args),
                         int ARG_UNUSED(flags), bool *no_add_attrs)
{
  if ( !DECL_P (*node) )
  {
    warning (OPT_Wattributes, "%qE attribute can apply to params declarations only", name);
    *no_add_attrs = true;
    return NULL_TREE;
  }
  tree decl = *node;
  if (TREE_CODE (decl) != PARM_DECL)
  {
    warning (OPT_Wattributes, "%qE attribute can apply to params only", name);
    *no_add_attrs = true;
  } else {
    // check presense of param_out
    if ( lookup_attribute ("param_out", DECL_ATTRIBUTES (decl)) )
    {
      warning (OPT_Wattributes, "%qE attribute useless when param_out was used", name);
      *no_add_attrs = true;
      DECL_ATTRIBUTES (decl) = remove_attribute("param_out", DECL_ATTRIBUTES (decl));
    }
  }
  return NULL_TREE;
}

Function handle_param_in_attribute checks that this attribute linked with function/method parameter. Then it checks that the same parameter don`t have attribute param_out - in this case it just removes both

All patches located here

results

Lets define couple of new keywords:

#define IN __attribute__((param_in)) #define OUT __attribute__((param_out))

and mark some arguments as IN or OUT - for example for method bool PlainRender::dump_type(uint64_t key, OUT std::string &res, named *n, int level)

After rebuilding of debug version with our patched gcc we can see in objdump output something like this

<1><10d8ac>: Abbrev Number: 18 (DW_TAG_subprogram)
    <10d8ad>   DW_AT_specification: <0x103579>
    <10d8b1>   DW_AT_object_pointer: <0x10d8cc>
    <10d8b5>   DW_AT_low_pc      : 0x4301a6
    <10d8bd>   DW_AT_high_pc     : 0x4317e7
    <10d8c5>   DW_AT_frame_base  : 1 byte block: 9c     (DW_OP_call_frame_cfa)
    <10d8c7>   DW_AT_GNU_all_tail_call_sites: 1
    <10d8c8>   DW_AT_sibling     : <0x10db29>
 <2><10d8cc>: Abbrev Number: 10 (DW_TAG_formal_parameter)
    <10d8cd>   DW_AT_name        : (indirect string, offset: 0x107d3): this
    <10d8d1>   DW_AT_type        : <0x1035e5>
    <10d8d5>   DW_AT_artificial  : 1
    <10d8d6>   DW_AT_location    : 3 byte block: 91 98 7c       (DW_OP_fbreg: -488)
 <2><10d8da>: Abbrev Number: 45 (DW_TAG_formal_parameter)
    <10d8db>   DW_AT_name        : key
    <10d8df>   DW_AT_decl_file   : 1
    <10d8e0>   DW_AT_decl_line   : 123
    <10d8e1>   DW_AT_decl_column : 38
    <10d8e2>   DW_AT_type        : <0xfb4be>
    <10d8e6>   DW_AT_location    : 3 byte block: 91 90 7c       (DW_OP_fbreg: -496)
 <2><10d8ea>: Abbrev Number: 234 (DW_TAG_formal_parameter)
    <10d8ec>   Unknown AT value: 28ff: 3
    <10d8ed>   DW_AT_name        : res

definition 1.1 from really cool book"The Design of Approximation Algorithms":

An α-approximation algorithm for an optimization problem is a polynomial-time algorithm that for all instances of the problem produces a solution whose value is within a factor of α of the value of an optimal solution

so you need first to estimate at least size of possible optimal solution, right?

Surprisingly I was unable to find it for maximal clique. stackexchange offers very simple formula (spoiler: the actual size is a couple of orders of magnitude smaller). python networkX offers method with complexity $O (1.4422 n) to find maximal clique itself only. cool.$ Let's invent this algorithm by ourselves

$From wikipedia :$

A clique, $C$ , in an undirected graph $G = (V, E)$ is a subset of the vertices, $C \subseteq V$ , such that every two distinct vertices are adjacent

in other words this means that graph with maximal clique of size K should contains at least K vertices with degree K - 1 or bigger. So we can arrange vertices on degrees and find some degree S where amount of vertices with degree S or bigger is >= S. But this is very rough estimation and it could be refined taking into account the following observation - we can remove all edges to vertices not belonging to this subgraph. So algo is:

calculate degrees of all vertices and arrange them in descending order
for each degree S find first where amount of vertices with degree S or bigger is >= S
put all such vertices in sub-graph SD
remove from SD all edges to vertices not belonging to SD
recalculate degrees of all vertices in SD
find another degree S in SD where amount of vertices with degree S or bigger is >= S. this will be result R

next we can repeat steps 2-6 until enumerate all degrees or some degree will be less than the previously found result R

Complexity

Let N - amount of vertices and M - amount of edges. Then cycle can run max N times and in each cycle we can remove less that M edges (actually in average M/2), so in worst case complexity is O(MN/2)

Results

on graph with N = 20000 and M = 170000:

19 on degree 19, 7216 vertices in SD
predicted with stackexchange: 590.205223

on graph with N = 200000 and M = 1700000:

20 on degree 19, 69489 vertices in SD
predicted with stackexchange: 1846.546113

The strangest thing here is that amount of vertices in SD reduced in ~3 times from whole original graph - like degree in Bron–Kerbosch algorithm

It seems that most of known algorithms for maximal clique try to add as much vertices as possibly and evolving towards more complex heuristics for vertices ordering. But there is opposite way - we can remove some vertices from neighbors, right?

Lets assume that we sorted all vertices of graph with M vertices and N edges by their degrees in descending order and want to check if some vertex with degree K can contain clique. We can check if all of it`s neighbors mutually connected and find one or several most loosely connected vertices - lets name it L. This checking requires K -1 access to adjacency matrix for first vertex, K -2 for second etc - in average (K^ 2) / 2. If no unconnected vertices was found - all survived neighbors are clique. See sample of implementation in function naive_remove

Now we should decay what we can do with L and there is only 2 variants:

we can remove it from set of neighbors
we can keep it and remove from set of neighbors all vertices not connected with L

Notice that in both cases amount of neighbors decreased by at least 1. Now we can recursively repeat this process with removed and remained L at most K times, so complexity will be O = (K ^ 2) / 2 * (2 ^ K)

We can repeat this process for all vertices with degree bigger than maximal size of previously found clique - in worse case M times, so overall complexity of this algorithm is O = M * (K ^ 2) / 2 * (2 ^ K)

In average K = N / M

well, not very good result but processing of each vertex can be done in parallel

We can share adjacency matrix (or even make it read-only) between all working threads and this recursive function will require in each step following memory:

bitset of survived neighbors - K / 8 where V[i] is 1 if this vertex belongs to neighbors and 0 if it was removed
array for unconnected vertices counts with size K

given that recursion level does not exceed K overall used space on stack is

S = K * (K / 8 + K * sizeof(index))

now check if we can run this algorithm on

gpu

Disclaimer: I read book about CUDA programming almost 10 years ago so I can be wrong

Let see some modern CPU and find maximal size of their local thread memory

NVidia GTX TITAN

6 Gb of memory / 2688 cores = 2.2Mb

if index can fit in short (2 byte) we can process vertexes with degree up to 1028

1000$ / 1028 = ~1$ for each vertex degree

NVidia Quadro RTX 8000

40 Gb / 4608 cores = 10.4Mb

vertexes can have degree up to 2237

10000$ / 2237 = 4.5$ for each vertex degree

Sources

There is undirected graph with 1024 vertices and 100909 edges (so average degree is 98.5). It is known that the graph contains clique with size 16. You can pass indexes of clique`s vertices in command line like

./ctf 171 345 ./ctf 171 346 too short clique

This vertices of clique then used to derive AES key and decrypt some short string

Can you solve this?

Every user of IDA Pro likes cross-references - they are very useful but applicable for objects in global memory only. What if I want to have cross-references for virtual methods and class/record fields - like what functions some specific virtual method was called from? Unfortunately IDA Pro cannot shows this - partially because this information is not stored in debug info and also due to weak algo for types propagation. Call of virtual method typically looks similar to

mov rax, [rbp+var_8] ; this mov     rax, [rax]   ;this._vptr
add     rax, 10h mov     rcx, [rax]   ; load method from vtable, why notmov rcx, [rax+0x10]?
call    rcx ; or even better just call [rax+0x10]?

Lets think where we can get such kind of cross-references - sure compiler must have it somewhere inside to generate native code, right? So generally speaking compiler is your next friend (right after disassembler and debugger).

Run gcc with -c -fdump-final-insns options on simple C++ test file and check how call of virtual method looks like:
(call_insn # 0 0 2 (set (reg:DI 0 ax) (call (mem:QI (reg/f:DI 1 dx [orig:85 _4 ] [85]) [ *OBJ_TYPE_REF(_4;this_7(D)->3B) S1 A8]) (const_int 0 [0]))) "vtest.cc":31:21# {*call_value}

What? What is _4, which type has this and what means ->3B instead of method name? Looking ahead, I can say that actually all needed information really stored in RTL thought function dump_generic_node (from tree-pretty-print.cc) is just too lazy to show it properly. Seems that we can develop gcc plugin to extract this cross-references (in fact, the first couple of months of development this was not at all obvious)

why gcc?

bcs gcc is standard de-facto and you can expect that you will able to build with it almost any sources (usually after numerous loud curses finally read the documentation), even on windows. Nevertheless lets consider other popular alternatives

visual c++

Unlike .NET with excellent roslyn Microsoft don`t allow you to make plugin for their C++ compiler nor even describes it`s internal IR

clang

Don't write a clang plugin (c)

In fact probably you could do this. If I right remember llvm IR has full support for virtual methods. Another question is to what extent final native code matches with IR - like some pieces of code could be removed due to dead-code elimination, merged within GCSE pass and so on

Besides llvm still has some problems with producing bootable linux kernel

Anyway I am too stupid to understand bunch of llvm classes with hellish templates and too impatient to wait for several hours every time when recompiling clang after inserting couple of debug prints

gcc plugin basics

There are several good examples of gcc plugins: 1& 2

The first has repeating bug while dump plugin parameters - plugin_info->argv[i].value should be checked against NULL bcs according to documentation in gcc command line option

-fplugin-arg-NAME-<key>[=<value>]

the value is not mandatory

So I just wrote class derived from rtl_opt_pass, registered it for pass "final" (sounds fatal and extremely intimidating) and now in virtual method execute we can ~~open gates to the hell~~

walk on RTL tree

RTL has quite simple structure comparing to GENERIC/GIMPLE - for example maximal value of rtx_code enum is 0x98 (LAST_AND_UNUSED_RTX_CODE) vs tree_code 0x175 (MAX_TREE_CODES)

gcc don`t provide you with ready-to-use iterator for RTL but you can cut&paste some code from function rtx_writer::print_rtx. As you can see in rtx_writer::print_rtx_operand you can extract format string with GET_RTX_FORMAT (GET_CODE (in_rtx)) and then call format-specific dumper for each operand. Some of this dumpers in its turn call print_rtx again so this process is recursive and I had to add stack to track all parents RTL nodes (thus I can know that current expression is part of assignment to some memory if I see in this stack set mem:0 for example)

But the most important thing here is located in arbitrary places calls to print_mem_expr - thin wrapper for print_generic_expr. This is where you finally can have access to real gcc internals - tree_node union contained everything starting from structure tree_base

This is huge topic and I hope it will be described in next part

Part 1

Because I still fighting with endless variants of unnamed types while processing linux kernel lets talk about persistence

The final goal of this plugin is to make from sources database of functions for methods they call and fields they use. After that you can find set of functions referring to some field/method and investigate them later with disasm for example

So sure plugin must store it`s results to somewhere

Probably graph databases is better suited for such data - like you can put symbols as vertices and references as edges, then all references to some symbol is just all of it`s incoming edges. But I am too lazy to install JVM and Neo4j, so I used SQLite (and simple YAML-like files for debugging). You can connect your own storage by implementing interface FPersistence

Details of most important methods from this interface

int connect(const char path, const char username, const char *password)

called during plugin initialization to connect to database.

path passed via plugin parameter -fplugin-arg-gptest-db

username passed via plugin parameter -fplugin-arg-gptest-user

passwordpassed via plugin parameter -fplugin-arg-gptest-password

method must return 0 if connection was successfully established

void cu_start(const char *filename)

called from PLUGIN_START_UNIT callback, so passed filename can be relative. Probably better to use function realpath to get full name of file

int func_start(const char *funcname)

called when plugin starts processing of next function, funcname is mangled name of function. If you don`t want to process this function method can just return non-zero value

void bb_start(int idx)

Basic blocks in gcc differs from what you can see in IDA Pro - they ends on first jump, if this is conditional jump then block will have two ongoing edges - to some other block and so called FALLTHRU to the next

idx is just index of basic block

void add_xref(xref_kind, const char *symname)

main method, called when plugin found in currently processing function cross-reference to something (usually even with symbol name)

void report_error(const char *err_msg)

I decided that it would be convenient to put errors in log/database, so this method called when plugin was unable to find something (like name of nameless structure when access to it`s field occurred and so on)

Part 1& 2

Lets start ~~walk~~ climb on TREEs. Main sources for reference are tree.h, tree-core.h& print-tree.cc

Caution: bcs we traveling during RTL pass some of tree types already was removed so it is unlikely to meet TYPE_BINFO/BINFO_VIRTUALS etc

Main structure is tree_base, it included as first field in all other types - for example tree_type_non_common has first field with type tree_type_with_lang_specific,

which has field common with type tree_type_common,

which again has field common with type tree_common,

which has field typed with type tree_typed,

which has field base with type tree_base

Kind of ancient inheritance in pure C

Caution 2: many fields has totally different meaning for concrete types, so GCC strictly stimulate to use macros from tree.h to access all fields

Type of TREE can be obtained with CODE_TREE and name of code with function get_tree_code_name

CODE_TREE returns enum tree_code and it has a lot of values - MAX_TREE_CODES eq 0x175. So lets check only important subset

XXX_CST

represent some constant

STRING_CST - literal constant, length can be obtained with TREE_STRING_LENGTH and content with TREE_STRING_POINTER. Warning - types and declaration names are not literal constants!
INTEGER_CST- integer constant, value can be obtained with wi::to_wide call
POLY_INT_CST- also integer constant, but split to several elements. Count of such values can be extracted with NUM_POLY_INT_COEFFS and each value via POLY_INT_CST_COEFF

BLOCK

represent some block (usually contains VAR & labels declarations and inlined functions). Variables can be obtained with BLOCK_VARS and BLOCK_NONLOCALIZED_VAR (don`t ask me why there is two kind of vars), embedded blocks with BLOCK_SUBBLOCKS and next block with BLOCK_CHAIN (there should be some joke about cryptocurrencies here)

See sample of traveling on function tree in dump_func_tree

XXX_DECL

represent declaration of structure, union, var etc. Often can have name - you should use DECL_NAME which return another tree with code IDENTIFIER_TREE or NULL_TREE. In first case you can get name via IDENTIFIER_POINTER and check if name is really presented via DECL_NAMELESS

Variables can have initial values - you can check this with DECL_INITIAL

Type of declaration can be extracted with TREE_TYPE

For functions you can access first argument with DECL_ARGUMENTS (and remained with DECL_CHAIN) and result with DECL_RESULT

Fields of records can be extracted with TYPE_FIELDS

Offset of field can be extracted with DECL_FIELD_OFFSET

One remarkable declaration is

LABEL_DECL

has type tree_label_decl and contains rtx for some code label which can be extracted with DECL_RTL_IF_SET

XXX_TYPE

represent some type. This is huge topic so I cover only basic things

You can get types name with TYPE_NAME. It again returns another tree with code IDENTIFIER_TREE or NULL_TREE

You can quickly check if some type is function/record/union with XXX_TYPE_P macros like RECORD_OR_UNION_TYPE_P, FUNC_OR_METHOD_TYPE_P, POINTER_TYPE_P etc

Size of type can be extracted with TYPE_SIZE - it returns tree usually with type INTEGER_CST

Enum values can be extracted with TYPE_VALUES

Fields of record/union can be extracted with TYPE_FIELDS

for pointers and references you can extract base type with TREE_TYPE

Let`s apply priceless knowledge from previous part - for example to extract string literals ~~and insert polymorphic decryption~~

Typical call to printf/printk in RTL looks usually like

(insn 57 56 58 9 (set (reg:DI 5 di) (symbol_ref/f:DI ("*.LC0") [flags 0x2] <var_decl 0x7f480e9edea0 *.LC0>)) "swtest.c":17:7 80 {*movdi_internal}

(call_insn 59 58 191 9 (set (reg:SI 0 ax) (call (mem:QI (symbol_ref:DI ("printf") [flags 0x41] <function_decl 0x7f480e8c7000 printf>) [0 __builtin_printf S1 A8]) (const_int 0 [0]))) "swtest.c":17:7 909 {*call_value} (nil) (expr_list (use (reg:QI 0 ax)) (expr_list:DI (use (reg:DI 5 di)) (nil))))

Translation for mere mortals

First instruction sets register 5 with address (via symbol_ref) of some known symbol with cool name "*.LC0"

Second instruction calls another known symbol "printf", arguments for this call stored in expression list - it use register 0 as result and argument in another nested expression list - early loaded register 5

So we can record in cross-references database 2 items for this function - loading of some symbol and call. Sadly name "*.LC0" is probably totally useless. Lets check if we can go deeper

We can extract TREE item for symbol_ref rtl with SYMBOL_REF_DECL and check it`s type - for some variable type is VAR_DECL

Then we can check if this variable is initializedvia DECL_INITIAL

String literal has type STRING_CST

And finally we can get content of this literal with TREE_STRING_POINTER and length with TREE_STRING_LENGTH

I implemented this logic in method is_cliteral

Also for storing literals I add new virtual method into persistence interface

You can ask - is it possible to extract integer constants like sizeof? Unfortunately no, mainly bcs they was converted to RTL const_int during expressions evaluation and RTL does not have tracking mechanism why does some const_int have this value. Typical use of sizeof may looks like:

... =(some_struct *)kmalloc(sizeof(some_struct) + strlen(s) + 1, GFP_KERNEL);

In RTL this code will be converted to call of strlen, then add to result in register some const_int (value is evaluated expression 1 + sizeof(some_struct)) and then passing it in register or stack into kmalloc via expr_list. Btw second argument also will be passed via const_int so it's impossible to recover value back to GFP_KERNEL constant

Part 1, 2, 3& 4

Lets check how RTL describes jump tables. I made simple test and output of gcc -fdump-final-insns looks like:

(jump_insn # 0 0 8 (parallel [ (set (pc) (reg:DI 0 ax [93])) (use (label_ref #)) ]) "swtest.c":14:3# {*tablejump_1} (nil) -> 8) (barrier # 0 0) (code_label # 0 0 8 (nil) [2 uses]) (jump_table_data # 0 0 (addr_vec:DI [ (label_ref:DI #) (label_ref:DI #) ... ]))
As you can see jump_insn uses opcode tablejump_1 refering to label 8. Right after this label located RTL with code jump_table_data - perhaps this is bad idea to assume that it always will be true so it`s better to use function jump_table_for_label. Also for some unknown reason option -fdump-final-insns does not show content of jump tables. So at least lets try to find jump_table_datas from plugin

Surprisingly you cannot find then when iterating on instructions within each block (using FOR_ALL_BB_FN/FOR_BB_INSNS macros). I suspect this due to the fact that both label and jump_table belong to block with index 0. So I used another cycle:

for ( insn = get_insns(); insn; insn = NEXT_INSN(insn) )

Then we can check if current RTL instruction is jump table with JUMP_TABLE_DATA_P. Jump tables have addr_vec in element with index 3 and each element is label_ref. Length of vector can be obtained from field num_elem. Pretty easy, so what we can do with this knowledge?

Well, we could at least put jump tables with their sizes into debug info

Lets check how this can be done

All debug info produced in gcc via structure gcc_debug_hooks and it even has field with neat name label. The problem that in dwarf2out.cc this field points to debug_nothing_rtx_code_label, which as you can guess really do nothing. Oops

Actually labels inserting in dwarf2 output somewhere inside function gen_label_die called from gen_decl_die. So we need to

find jump tables
check that they are unnamed - respecting code_label does not have LABEL_DECL
add LABEL_DECL with some name
and put link to RTL with SET_DECL_RTL

And yes - we can now see our newly inserted labels in debug info. Without size for sure, bcs code in gen_label_die need to be patchedto do this. Again

Patch

And finally run objdump -g to see

 <2><11a>: Abbrev Number: 14 (DW_TAG_label)
    <11b>   DW_AT_name        : (indirect string, offset: 0x5f): main_jt52
    <11f>   DW_AT_decl_file   : 1
    <120>   DW_AT_decl_line   : 4
    <121>   DW_AT_decl_column : 5
    <122>   DW_AT_byte_size   : 10
    <123>   DW_AT_low_pc      : 0x402050

Part 1, 2, 3, 4& 5

Finally I was able to compile and collect cross-references for enough big open-source projects like linux kernel and botan:

wc -l botan.db
2108274 botan.db
grep Err: botan.db | wc -l
540

So lets check how we can extract access to record fields. If you take quick look at tree.def you can notice very prominent type COMPONENT_REF:

Value is structure or union component.
 Operand 0 is the structure or union (an expression).
 Operand 1 is the field (a node of type FIELD_DECL).
 Operand 2, if present, is the value of DECL_FIELD_OFFSET

Sounds easy? "In theory there is no difference between theory and practice". In practice you can encounter many other types in any combinations, like in this relative simple RTL:
(call_insn:TI 1482 1481 2856 35 (call (mem:QI (mem/f:DI (plus:DI (reg/f:DI 0 ax [orig:340 MEM[(struct Server_Hello_13 *)_325].D.264452.D.264115._vptr.Handshake_Message ] [340]) (const_int 24 [0x18])) [744 MEM[(int (*) () *)_199 + 24B]+0 S8 A64]) [0 *OBJ_TYPE_REF(_200;&MEM[(struct _Uninitialized *)&D.349029].D.305525._M_storage->3B) S1 A8]) (const_int 0 [0])) "/usr/local/include/c++/12.2.1/bits/stl_construct.h":88:18 898 {*call} (expr_list:REG_CALL_ARG_LOCATION (expr_list:REG_DEP_TRUE (concat:DI (reg:DI 5 di) (reg/f:DI 41 r13 [386])) (nil)) (expr_list:REG_DEAD (reg:DI 5 di) (expr_list:REG_DEAD (reg/f:DI 0 ax [orig:340 MEM[(struct Server_Hello_13 *)_325].D.264452.D.264115._vptr.Handshake_Message ] [340]) (expr_list:REG_EH_REGION (const_int 0 [0]) (expr_list:REG_CALL_DECL (nil) (nil)))))) (expr_list:DI (use (reg:DI 5 di)) (nil)))

So I`ll describe in brief some TREE types and how to deal with them to extract something useful

COMPONENT_REF

Usually several component references form chain of fields - like for SomeStruct.f1.f2.f3 there will be 3:

COMPONENT_REF Op1 will contain FIELD_DECL to field f3 and Op2 reference to
COMPONENT_REF Op1 will contain FIELD_DECL to field f2 and Op2 reference to
COMPONENT_REF Op1 will contain FIELD_DECL to field f1 and Op2 finally references to RECORD_TYPE/UNION_TYPEfor SomeStruct

Pretty easy? Actually not - there are at least two problems:

Both Op1 & Op2 can contain any other types - for example SSA_NAME
Record in each chain can be nameless. For C++ you can find enclosed class with function get_containing_scope, but in C all nested nameless structures actually has scope TRANSLATION_UNIT_DECL - in such case there is chance that chain will be unlinked

Dirty hack - you even don`t need RECORD_TYPE for each field bcs you can extract it with DECL_CONTEXT

SSA_NAME

Just reference to some other TREE - it can be extracted with TREE_TYPE. See function dump_ssa_name

MEM_REF

The type of the MEM_REF is the type the bytes at the memory location are interpreted as. MEM_REF <p, c> is equivalent to ((typeof(c))p)->x... where x... is a chain of component references offsetting p by c
Type can be extracted with TMR_BASE and offset with TMR_OFFSET.

Well, it would be good to find field at this offset, right? First field can be extracted with TYPE_FIELDS and next with TREE_CHAIN. See function dump_mem_ref for details

ADDR_EXPR

& in C. Value is the address at which the operand's value resides

Type of value can be extracted with TREE_OPERAND(expr, 0) and again can be any of TREE types. See function dump_addr_expr for details

OBJ_TYPE_REF

Used to represent lookup in a virtual method table which is dependent on the runtime type of an object. Operands are: OBJ_TYPE_REF_EXPR: An expression that evaluates the value to use. OBJ_TYPE_REF_OBJECT: Is the object on whose behalf the lookup is being performed. Through this the optimizers may be able to statically determine the dynamic type of the object. OBJ_TYPE_REF_TOKEN: An integer index to the virtual method table. The integer index should have as type the original type of OBJ_TYPE_REF_OBJECT

Main source for collecting virtual methods calls

So now we can collect all access types to class/structures field and methods. The only uncovered type is pointer to method - can it be tracked? Unfortunately no - nor where offset to method assigned nor where it called. I wrote simple test and methods get_ref look in disasm like:

mov eax, 33 ; just some const mov edx, 0 pop rbp

and in RTL:

(insn 6 3 7 2 (set (reg:DI 0 ax [orig:82 D.3252 ] [82])
        (const_int 33 [0x21])) "vtest.cc":40:24 80 {*movdi_internal}
     (nil))

For some unknown reason there are no OFFSET_REF& PTRMEM_CST in RTL

It seems that clang in version 14 utilize more advanced features from DWARF5, so I add their support to my dwarfdump. IMHO most exciting features are:

Section .debug_line_str

In old versions of dwarf filenames have duplicates for each compilation unit. Since dwarf version 5 they storing in separate section and thus shared and save some space. Obviously this space reducing is negligible compared to overhead from types duplication

Section .debug_str_offsets

Also for space reducing each compilation unit has so called base index for strings passed via DW_AT_str_offsets_base. But there is problem - some attributes already can have name before DW_AT_str_offsets_base occurs:

<0><c>: Abbrev Number: 1 (DW_TAG_compile_unit) <d> DW_AT_producer : (indexed string: 0): clang version 14.0.6 (git@github.com:github/semmle-code 5c87e7737f331823ed8ed280883888566f08cdea) <e> DW_AT_language : 33 (C++14) <10> DW_AT_name : (indexed string: 0x1): c/extractor/src/extractor.cpp <11> DW_AT_str_offsets_base: 0x8

As you can see here 2 tags have names before we have value of string base. Much harder to parse in one pass now

New locations format

I think this is the most cool and useful feature - now each variable and parameter has set of locations linked with address ranges (that`s often case for highly optimized code). Sample:

Offset Entry 2077 0024ef56 00000000000006b4 (index into .debug_addr) 004fb3c500000000 (base address) 0024ef59 0000000000000000 000000000000001c DW_OP_reg5 (rdi) This cryptic message means that starting from address 0x4fb3c5 (note - most tools like objdump or llvm-dwarfdump cannot correctly show this new locations, in this case objdump showed address in bad format) some local variable located in register rdi until next address range. Seems that both IDA Pro and Binary Ninja cannot use this debug information:
.text:00004FB3C5 mov rdi, cs:compilation_tf .text:00004FB3CC cmp dword ptr [rdi+0Ch], 0 Global var compilation_tf has type a_trap_file_ptr - pointer to a_trap_file. IDA Pro has that types information from debug info but anyway cannot show access to field of a_trap_file at offset 0xC for next instruction

As result of all my patches now I can for example inspect IL structures from ~~Microsoft~~ CodeQL C++ extractor:

// Size 0x28
// FileName: c/extractor/edg/src/il_def.h
struct a_name_reference {
// Offset 0x0
a_name_reference_ptr next;
// Offset 0x8
a_name_qualifier_ptr qualifier;
// Offset 0x10
union {
    // Offset 0x0
    a_type_ptr destructor_type;
    // Offset 0x0
    a_property_or_event_descr_ptr property_or_event_descr;
  } variant;
// Offset 0x18
long num_template_arguments;
// Offset 0x20
enum a_special_function_kind special_kind;
// Offset 0x20
a_bit_field is_global_qualified_name:23:1;
// Offset 0x20
a_bit_field is_template_id:22:1;
// Offset 0x20
a_bit_field is_super_qualified:21:1;
// Offset 0x20
a_bit_field is_decltype_qualified:20:1;
// Offset 0x20
a_bit_field used_in_primary_declarator:19:1;
// Offset 0x20
a_bit_field from_prototype_instantiation:18:1;
};

Part 1, 2, 3, 4, 5& 6

Lets check if we can extract other kind of constants - numerical. Theoretically there are no problems - they have types INTEGER_CST, REAL_CST, COMPLEX_CST and so on. And you even can meet them - mostly in programs written in fortran

In most code they usually replaced with RTX equivalents like

INTEGER_CST - const_int (or const_wide_int)
REAL_CST - const_double

const_double is easy case but const_ints are really ubiquitous, they can appear in RTX even when they do not occur in operands of asssembler`s code. So main task is to select only small subset of them. Let`s consider what we can filter out

fields offsets

Luckily this hard part has already been solved in previous part

local variables offsets in stack

RTX has field frame_related:

1 in an INSN or a SET if this rtx is related to the call frame, either changing how we compute the frame address or saving and restoring registers in the prologue and epilogue

this flag affects both parts of set, for loading something from stack it looks something like:

set (reg:DI 0 ax [83])
        (mem/f/c:DI (plus:DI (reg/f:DI 6 bp)
                (const_int -8 [0xfffffffffffffff8]))

and for storing to stack like:

set (mem/f/c:DI (plus:DI (reg/f:DI 6 bp)
                (const_int -8 [0xfffffffffffffff8])) [4 this+0 S8 A64])
        (reg:DI 5 di [ _0 ]))

conditions

Yes, if_then_else almost always follows compare:

(set (reg:CCZ 17 flags)
        (compare:CCZ (reg:QI 2 cx [orig:83 _2 ] [83])
            (const_int 0 [0]))) "vtest.cc":44:19 5 {*cmpqi_ccno_1}

(jump_insn 10 9 11 2 (set (pc)
        (if_then_else (eq (reg:CCZ 17 flags)
                (const_int 0 [0]))
            (label_ref 16)
            (pc))) "vtest.cc":44:19 891 {*jcc}

All these bulky constructions will be translated to just jz, so no const_int 0 will be placed in output

EH block index

like in each function call:

(expr_list:REG_EH_REGION (const_int 0 [0])

Now output looks much better, but probably you would like to have more control over operations types. For this purpose I add to my plugin option -fplugin-arg-gptest-ic=config.file

Lets assume that you compiling some crypto library and want to extract integer constants only for some operations - like assignments, shift and xor but not and. You can put RTX code names to config.file and prefix with '-' unneeded:

set xor ashift lshiftrt -and

I added during past weekend support for var location lists from DWARF5 (located in separate section .debug_loclists) in my dwarfdump. As usually lots of bugs were found

First - they presents only for functions but not for methods. Probably this is real bug and can have serious impact when debugging

Second - generated expressions is not optimal. Lets see example:

locx 53e 4FB5DF - 4FB5F4: DW_OP_piece 0x8, DW_OP_reg0 RAX, DW_OP_piece 0x8, DW_OP_breg0 RAX+0, DW_OP_lit3, DW_OP_lit8, DW_OP_mul, DW_OP_plus, DW_OP_stack_value, DW_OP_piece 0x8 4FB5F4 - 4FB60F: DW_OP_piece 0x8, DW_OP_reg2 RCX, DW_OP_piece 0x8, DW_OP_breg0 RAX+0, DW_OP_lit3, DW_OP_lit8, DW_OP_mul, DW_OP_plus, DW_OP_stack_value, DW_OP_piece 0x8

As you can see this expressions are the same but for adjacent addresses ranges. Why not use single expression for range 4FB5DF - 4FB60F?

DW_OP_mul just pops from stack couple of values and put back result of their multiplication (see evaluation logic in method execute_stack_op), so this sub-expression can be rewritten as just DW_OP_lit24

Also it`s curious to check what other compilers support subj:

gcc

Yes, with options -g -gdwarf-5 -fvar-tracking

golang

No - they even don`t support DWARF5 at all

openwatcom v2

No - judging by the funny comments

WHEN YOU FIGURE OUT WHAT THIS FILE DOES, PLEASE DESCRIBE IT HERE!

I`ve tried to solve CSES task"visiting cities"

Looks like you can use kind of brute-force - get 1st shortest paths, then all remained with the same cost and make union of cities in each path - nor elegant nor smart algorithm, just to estimate if this approach works at all

I remember from my university course "graph theory" about Yen`s algo to get K-th shortest path so choosed to use some ready (and hopefully well-tested) implementation - kssp library from INRIA (yep - famous place where OCaml was invented)

And then happened real madness - different algos gave me different result and moreover - they didn't match with "correct" results from CSES! Lets see what I got (all results for test 7, compilation options -O3 -DTIME -DNDEBUG):

Yen - 470s, 30157 cities
node classification - 4.37s, 30140 cities
postponed node classification - 3.83s, 30140 cities
postponed node classification with star - 3.76s, 30140 cities
sidetrack based - consumed 13Gb of memory and met with OOM killer
parsimonious sidetrack based - OOM again, perhaps bcs not enough parsimonious :-)

Source code

Previous part

I was struck by the idea of how to reduce size of graph before enumerating all K-th shortest paths. We can use cut-points. By definition if we have cut-point in some shortest path it must be visited always - otherwise you can`t reach the destination. So algo is simple

find with Dijkstra algo first shortest paths for whole graph
find all cut-points in whole graph
iterate over found shortest path - if current vertex is cut-point - we can run brute-force from previous cut-point till current

Results are crazy anyway - on the same test 7

Yen: 13.86s, 29787, 3501 cycles
Node Classification: 118.4s, 30013, 3501 cycles
Postponed Node Classification: 104.95s, 30013, 3501 cycles
PNC*: 120.33s, 30013, 3501 cycles
Parsimonious Sidetrack Based: 4.66s, 29980, 3501 cycles
Parsimonious Sidetrack Based v2: 4.79s, 29980, 3501 cycles
Parsimonious Sidetrack Based v3: 4.85s, 29980, 3501 cycles
Sidetrack Based: 4.18s, 29980, 3501 cycles
Sidetrack Based with update: 4.34s, 29980, 3501 cycles

At this time all algos worked to completion and again gave different results...
Source code

CSES has two very similar by description tasks but with completely different solutions: "Critical Cities" (218 accepted solutions at time when I writing this) and "Visiting Cities" (381 accepted solutions)

Critical Cities

We are given an directed unweighted graph and seems that we need to find it`s dominators for example using Lengauer-Tarjan algo (with complexity O((V+E)log(V+E))

Then we could check each vertex in this dominators tree to see if it leads to target node, so overall complexity is O(V * (V+E)log(V+E))

This looks not very impressive IMHO. Lets try something completely different (c) Monty Python's Flying Circus. For example we could run wave (also known as Lee algorithm) from source to target and get some path with complexity O(V+E). Note that in worst case this path can contain all vertices. Lets mark all vertices in this path

Next we could continue to run waves but at this time ignoring edges from marked nodes and see what marked vertices are still reachable. For example on some step k we run wave from Vs and reached vertices Vi and Vj. We can conclude that all vertices in early found path between Vs and Vj are NOT critical cities. So we can repeat next step starting with Vj

This process can be repeated in worst case V times so overall complexity is O(V*(V+E))

My solution is here

Visiting Cities

At this time we are given an directed weighted graph and seems that simplest solution is to find all K-th shortest paths (for example with Yen algo) and make union of their vertices. Because I'm very lazy I decided to reuse some ready and presumably well-tested implementation of this algo. You can read about fabulous results here

After that I plunged into long thoughts until I decided to count how many paths of minimal length go through each vertex - actually we could run Dijkstra in both directions: from source to target and from target to source, counting number of paths with minimal length. And then we could select from this path vertices where product of direct counts with reverse equal to direct count on target (or reverse count on source) - it`s pretty obvious that you can`t avoid such vertices in any shortest path. Complexity of this solution is two times from Dijkstra algo (depending from implementation O(V^2) or O(V * log(V) + E * log(V)) using some kind of heap) + in worst case V checks for each vertices in first found shortest path

My solution is here

IMHO this is very hard task - only 104 accepted solutions. My solution is here

Google gives lots of links for trominos but they all for totally different task from Euler Project - in our case we have only L-shapes. So lets think about possible algorithm

It`s pretty obvious that we can make 2 x 3 or 3 x 2 rectangles with couple of L-trominos. So naive solution is just to check if one size is divisible by 2 and other by 3

However with pen and paper you can quickly realize that you can for example fill rectangle 5 x 6:

aabaab abbabb ccddee dcdced ddccdd

Algo can look like (see function check2x3)

if one side of rectangle is divisible by 6 then another minus 2 should be divisible by 3
if one side of rectangle is divisible by 6 then another minus 3 should be divisible by 2

Submit our solution and from failed tests suddenly discovering that you also can have rectangle 9 x 5. Some details how this happens

So we can have maximal 3 groups of different shapes:

9 x 5 rectangle (or even several if sides multiples of 5 & 9) - in my solution it stored in field has_95
1 or 2 groups of 2 x 3 rectangles below 9 x 5 shape. 1 for case when you can fill this area with shapes 2 x 3 of the same orientation and 2 if you must mix vertical and horizontal rectangles - field trom
the same 1 or 2 groups on right of 9 x 5 shape - field right

Now the only remained problem is coloring

Rectangle 9 x 5 has 5 different colors but it is possible to arrange trominos in such way that on borders it will have only 4 colors and 5th is inside. For groups of 2 x 3 rectangles you need 4 colors if group size is 1 and yet 4 if size is 2. In worst case number of colors is 4 for 9 x 5 + 2 * 2 * 4 = 20 - so we can fit in A-Z

Not perfect but suitable book considering the small number of books about linux internals. IMHO most useful is chapter 10, so below is brief summary of the presented tools

vfsstat, VFS, eBPF-based
vfscount, VFS, eBPF-based
fsrwstat, VFS, eBPF-based, can aggregate stat for each filesystem
filetop, eBPF-based
cachestat Cache, from perf-tools
cachetop, Cache, eBPF-based
btrfsslower, ext4slower, xfsslower etc - eBPF-based
biotop, eBPF-based, for block layer
biosnoop, eBPF-based, for block layer
blktrace, for block layer
iostat, iotop & vmstat are well-known

And I have stupid question - has anyone already merged all this zoo in some cmdlet/package for linux powershell to have common API? At least I was unable to find something similar on powershellgallery

I`ve solved yet another very funny CSES task - it looks very similar to another task called "Reachable Nodes" (my solution for it). The only difference is that we asked to count not unique nodes but colors of nodes. What can go wrong?

And this is where funny part begins - my patched solution got crashes. gdb didn`t showed nothing interesting. However I remember scary cryptic command to show stack usage:

print (char *)_environ - (char *)$sp
$1 = 8384904

Very close to default 8Mb (check ulimit -s). Wait, WHAT? Do we really have stack exhausting? Lets check - 8 * 1024 * 1024 = 8388608 bytes. Tree can have 200000 nodes. 8388608 / 200000 = ~42 bytes for each recursive DFS call. Seems to be true - in each call we store return address + stack frame RBP + 3 registers holding args (this, indexes of node and parent) - so at least 5 * 8 = 40 bytes. It`s so happened that some tests contain tree with very long stem from root till end, so yes - recursive DFS cannot visit all nodes in such tree. Solution is simple - we can emulate recursion with std::stack. As bonus for all nodes in stack we can use single bit mask to save space

Another unpleasant observation is that trees in tests ain't BINARY trees. When one picture is worth a thousand words:

Degree of node 2 is 4. This is main reason why function dfs has separate branch for processing joint nodes with only 2 descendants - bcs initially method is_fork returned only left and right

Source

CSES has several really hard graph-related tasks, for example

New Flight Routes with directed graph (btw this task was borrowed from russian olympiad contest)
Forbidden Cities with undirected graph

It would be a good idea to visualize those graphs. One of well-known tool to do this is Graphviz, so I wrote simple perl script to render graph from CSES plain text into their DSL. On small graphs all goes well and we can enjoy with something like

But seems that on big graphs with 200k nodes dot just can`t finish rendering and after ~2 hours of hard work met with OOM killer. Lets think how we can reduce size of graph

merging nodes with degree 2

Look at picture above. We can notice that vertex 5 has 2 edge: to v2 and v7 and can be replaced with just 1 edge between 2 & 7. This process can be repeated until no vertices with degree 2 remains. For directed graphs we can merge vertex when it has 1 in & 1 out edges

New picture (you can use option -c for my script) with merged nodes has only 19 edges vs 38 in old:

You can notice that now node 2 and 6 has single edge - nodes 1 & 8 was removed. Vertices with blue color are so-called cut-points and this lead us to

Condensation on Strongly Connected Components

For directed graphs can be done with Kosaraju's algorithm
For undirected graph we can just merge all nodes between cut-points. My script support -k option for such graph condensation:

Here for example box v4 contains vertices 4, 5, 7. And now see what happened with both -c & -k options:

Only 10 edges remained

Unfortunately this is anyway too much for graphviz - now it can draw 3 shape:

very long straight line from nodes and lots of edges between them (using dot)
black square of Malevich (using neato)
black ball using circo

So it`s time to look for another tools for graph plotting. Like

R igraph package

It has convenient method read.csv, so I add option -r to my script to produce couple of CSV files:

nodes.csv & dedges.csv for directed graphs
nodes.csv & edges.csv for undirected graphs

Use simple R script for graph loading - it expect to see couple of this .csv files in current directory. You can install igraph package with command
install.packages("igraph")

Again result looks like hairball :-(