Motivation

Reversed Engineering

The problem was born in reversed engineering, in which we decompile the binary code to obtain the source code. In this process, two versions of function are involved: the original function before compilation $f_{1}$ and the recovered function after decompilation $f_{2}$ . Does it hold that $f_{1} = f_{2}$ ? In most cases, they can not look all the same and they can also behave differently in some edge cases. UB, different machine and different compilers can make the executables vary in the binary level.

Bug Detection

Suppose we have two versions of OpenSSL. Some functions are changed during the update. A careless engineer made a small typo somewhere that no one noticed and the code seemed to behave nicely. But in the production environment, for some edge cases, say $n = 114514$ , the program crashed! We would desire to know before we publish the new version, if the new function and the old function do the same thing or where is the improvement/drawback of the new function.

KLEE: A Symbolic Execution Engine

The traditional way of doing this stupid $f_{1} = f_{2}$ problem is static code analysis, but we propose here a new symbolic execution method built upon KLEE.

Implementation

Pipeline

Create Docker Container
-> Read Code Bases
-> Put Codes into Container
-> Compile and Link
-> Make it to KLEE

TODO

Test of LLM Reliability

The aim of running the tests is to judge whether the code generated by LLM can be relied upon. An interesting thing happens now:

Our initial task is to test if $f_{1} = f_{2}$ .
We send the signatures of both functions to LLM, receiving main.c #1. In the test, we compare it with another code written by hand main.c #2. The task now becomes judging if ${main}_{1} = {main}_{2}$ .
The intuitive answer is to count the number of assertion failures in the output message of ktest-tool.

But the problem is, if the main function does not actually contain any code, assertion failure will never happen. The count of assertion error will be 0. And if $f_{1}$ and $f_{2}$ happen to be the same. We would expect there is no assertion error. So in this case the test is incorrect!!

In order to make the test results more believable, we studied the output of KLEE and found some files suffixing .ktest. The number of these files implies the number of branches in our input function. As a result, if the main is empty, there will be no .ktest files in the output directory.
At least in this case the count of .ktest files make sense.

Another thing we use is the content of one .ktest file. It contains the value of the variables when certain branch is reached.

Notice that these still do not guarantee the test works for every possible scenarios but it should suffice for simple cases.

TODOs:
[x] - test ktest output.
[x] - implement new test_main. try catch
[x] - clean docker container: remove_running_container
[] - preping test
[x] - test directory relative path -> absolute path

Future Aspects

Can we make it parallel? Say, compare a bunch of functions at once. (Find things should search for different signatures at once)
Does it holds #ktestfiles in the interval of [max{#branch1, #branch2}, #branch1 * #branch2]?

Automatic json schema generation

Current behaviour of the program

Start a container
Search the full code bases for only 2 function signatures
Run the tests
Stop and remove the container.

In the future it’s likely we want to compare multiple functions. As a result the current pipeline can be very expensive at least in step 2 and causes additional costs of creating and removing containers in step 1 and 4.

My proposed changes

Change the function-name entries to an array of tuples functions-to-be-compared: {(f1,g1), (f2,g2), ...}
In test_find_thing.py, add a new function find_function_signatures_in_files(file_names: list[str], function_names: list[str]), which searches the files for function_signatures, returns a dictionary from function name to function signature | None, and thus 2 new functions find_function_signatures_in_file and find_function_signatures_in_string will also be needed
Pass these 2 dictionaries to a function to generate an array of tuples {(sig_f1,sig_g1), (sig_f2,sig_g2), ...}, removing tuple containing None.
Pass this array to LLM / pass each tuple separately, producing one main.c or a set of main.c’s
Redesign the rest of the code to be capable of running multiple tests at the same time. The way of implementation is to be discussed later

For jobsuche

Since the project is currently not too big, our team consists of only 3 people at this stage. All of us do some programming and propose some features.

The motivation of the project is quite simple. Given a set of function tuples, we want to tell which tuples are essentially the same. In reversed engineering we encounter such problems quite often. Usually we have the original source code, we compile it into executable and then we decompile it back into source code. Naturally we want to know if the decompiler works correctly by comparing these two versions of codes. Luckily, a symbolic execution engine KLEE was implemented by another team. Our main job in this project is: pass the name of functions we want to compare to the program and let LLM automatically generate the test harnesses so that we can do mass comparisons.

There are several technical difficulties during our implementation. I can go further if you are interested.

KLEE works only with a specific version of clang. Hence, the environment management could be a little tricky. We use docker container to solve this problem.
Compiling the large C project can be a headache. We are still brainstorming how to solve this problem.
We introduce a set of criteria to make the test results more accurate. Because one of the criteria can lead to ridiculous result, when the LLM is not even working.
I proposed a feature to reduce the time complexity and enable our program to do some jobs in parallel.

本文采用署名-非商业性使用-相同方式共享 4.0 国际许可协议，转载请注明出处。

Rev Eng