Kirenenko: Dynamic Symbolic Execution as Taint Analysis

The Story

In the second half of the year 2017, Byoungyoung and I were discussing how we can make fuzzing more efficient at flipping conditional branches. He suggested we could use dynamic taint analysis to find input bytes that will affect a target branch so we can focus on mutating a smaller set of bytes instead of the whole input. One challenge we’re facing, however, is how to further improve the efficiency of finding a satisfying assignment to those bytes so we can flip the branch. One idea we thought about is to get a bunch of input/output pairs and leverage the universal approximation theorem of deep neural network (DNN) to train a model (specific to the target branch); then we can use this model to find the satisfying input. At that time, I was reading some papers on program synthesis and learned that this idea may not work very well because (1) the neural network may not be able to approximate complex branch constraints, (2) even if it can approximate, training the model may require many input/output pairs, and (3) the trained model usually cannot be used to solve other branches. Instead, I suggest we could use those input/output pairs to synthesize the symbolic formula of the branch constraints then ask SMT solvers to solve it. This is inspired by some research at that time (e.g., DeepCoder), which suggests that if we want to approximate an unknown function f based on some input/output pairs, it’s probably a better idea to use machine learning to guide the synthesis of an approximation function f' that can generate the same set of input/output pairs than to directly approximate f using a DNN. So we started building SynFuzz. Unfortunately, we didn’t manage to publish SynFuzz as many taint-guided fuzzers were released, including Angora and RedQueen.

However, when we were preparing the resubmission of the paper in the summer of 2019, I realize we can actually trace the full symbolic formula using the same taint analysis framework thus avoid the cumbersome synthesis process. More importantly, this actually resulted in a very promising dynamic symbolic execution engine because

The symbolic formulas are collected at a similar speed of taint analysis, which is much faster than previous interpretation-based approaches like KLEE and angr.
The symbolic formulas are collected at the LLVM IR level, so they’re simpler than those collected at binary level and easier to solve.

In this blog post, I’m going to explain some technical details of Kirenenko.

Since the implementation is quite simple (a small modification over the data-flow sanitizer from LLVM), I didn’t thought it would be worth writing a paper for it. So I just open-soured the code and hope people would build more interesting tools based on it.

Problem Formulation

Background: Dynamic Data-Flow Analysis

Dynamic data-flow analysis aims to track additional properties of program variables according to its runtime data and control dependencies. To facilitate this, an analysis framework associates each program variable with a label (a.k.a., metadata) which represents its properties. A particular dynamic data-flow flow analysis needs to define four policies:

Label interpretation defines what properties are being tracked. Typical interpretations include whether the data is trustworthy and whether the data contain sensitive (e.g., personal health) data.
Label sources define where a non-default label is introduced. Examples include APIs where the test, untrusted, or sensitive data is read.
Label propagation defines how different labels are combined or transformed by an executed instruction. For example, when tracking the propagation of untrusted input data from the internet, a combination of an untrusted label with any other label always ends up with an untrusted label.
Lebel sinks define where and how the additional runtime properties are used. For example, a policy to detect privacy data leakage would check data that leaves the local host (e.g., sent to the network) to ensure its label is not sensitive.

One of the most common form of dynamic data-flow analysis used in security is dynamic taint analysis, where we have two labels: tainted and untainted. As mentioned above, tainted could mean untrustworthy, like data from the network or the user space; it could also mean sensitive information. Dynamic taint analysis has been used in so many security applications that I’m not going to enumerate.

Background: Dynamic Symbolic Execution

Symbolic execution treat program inputs as symbolic values instead of concrete values. Program variables (including memory and register content) can then be represented as symbolic expressions. A symbolic execution engine maintains (1) a symbolic state, which maps program variables to their symbolic expressions, and (2) a set of path constraints, which is a quantifier-free first-order formula over symbolic expressions. To generate a concrete input that would allow the program to follow the same execution trace, the symbolic execution engine uses path constraints to query an SMT solver for satisfiability and feasible assignment to symbolic values (i.e., input bytes).

Observation: Formula Construction as Data-Flow Analysis

If we think carefully, we can see that maintaining the symbolic state is a special form of dynamic data-flow analysis:

Label interpretation: in symbolic execution, the label of a variable is its symbolic expression.
Label source: in test case generation, we mark input bytes as symbolic.
Label propagation: when labels (symbolic expressions) merge, we create a new expression that combines the “results” according to the operation. If the expression is stored as abstracted syntax tree (AST), then merging means creating a new AST node with the operand ASTs as child nodes.
Label sink: at conditional branches (or other interesting places), we use the collected symbolic expressions to consult an SMT solver if certain constraints are feasible (e.g., if the branch condition can be evaluated to true); if so, request a model (assignments).

Kirenenko is built based on this observation.

Technical Details

Background: DFSan

Kirenenko is built upon the DataFlowSanitizer (DFSan) from the LLVM project. DFSan is a byte-granularity dynamic data-flow analysis framework. It performs source code (actually IR) level instrumentation to track runtime data-flow. This is done at two levels. For data in LLVM registers (IR values), it uses shadow registers to store the labels. For data in memory, it uses the highly optimized shadow memory to map each memory byte to its corresponding label in the shadow memory. The most attractive design of DFSan, when compared with other dynamic taint analysis frameworks is that DFSan does not store any information in the label itself. Instead, the labels are indices into another metadata structure called union table, which stores the real input dependencies. Moreover, to track the data-flow at byte-granularity, the union table is already structured as trees: *when merging two labels (i.e., two sets of input dependencies), DFSan will allocate a new label (union table entry) and assign the two labels as child nodes.

struct dfsan_label_info {
  dfsan_label l1;
  dfsan_label l2;
  const char *desc;
  void *userdata;
};

dfsan_label __dfsan_union(dfsan_label l1, dfsan_label l2) {
  // simplified
  dfsan_label label =
    atomic_fetch_add(&__dfsan_last_label, 1, memory_order_relaxed) + 1;
  __dfsan_label_info[label].l1 = l1;
  __dfsan_label_info[label].l2 = l2;
  return label;
}

Note that the whole union table is allocated during initialization as a big array so allocating a new entry is very cheap, just an atomic_fetch_add.

Extending the Union Table

Following our problem formulation, it is kind of straightforward to see that to track symbolic expressions, instead of just input dependencies, we can just extend the union table entry to an AST node and track more information when merging two labels (two sub-expressions).

struct dfsan_label_info {
  dfsan_label l1; // symbolic sub-expression, 0 if the operand is concrete
  dfsan_label l2; // symbolic sub-expression, 0 if the operand is concrete
  u64 op1;        // concrete operand, 0 if the operand is symbolic
  u64 op2;        // concrete operand, 0 if the operand is symbolic
  u16 op;         // the operation of the node, using LLVM IR operations
  u16 size;       // size of the result
};

dfsan_label __taint_union(dfsan_label l1, dfsan_label l2, u16 op, u16 size,
                          u64 op1, u64 op2) {
  // omitted  
}

Conceptually, as simple as this. Of course, we need additional modifications to the instrumentation pass so we can track additional information. We also need to modify how load and store are handled: when storing an expression larger than one byte, we need to break it down using the extract operation; and when loading an expression larger than one byte, we need to cancat the sub-expressions. Finally, we also need to add a hash table to deduplicate the symbolic expression so they won’t blow up the union table.

Generating New Test Inputs

To generate new test inputs, we instrumented conditional branch instructions and switch instructions (i.e., LLVM IR instructions) to invoke a callback function in the runtime. Inside the callback function, we will traverse the union table entry (i.e., walk the AST) to translate the LLVM IR level AST to Z3 AST, then consult the Z3 SMT solver to check the feasibility of a branch target; if so, we get the model and generate a new input. For this step, we follow what QSYM does: track nested branches and if the nested branch solving fails, do optimistic solving.

External Libraries

The biggest limitation of Kirenenko is its support for third party libraries, which is inherits from DFSan. Specifically, DFSan performs source code (IR) level instrumentation so it cannot propagate data-flow for dynamic libraries that are instrumented. One example is libc. To support those libraries, we either need to create wrapper functions to specify how data-flow should be propagated (take a look at dfsan_custom.cc) or compile the library with instrumentations. While other source-code-based symbolic execution tools like KLEE and SymCC also face the same problem, it indeed makes it less convenient than binary-based tools like QSYM.

Performance

Kirenenko is fast for three reasons:

The symbolic constraints collection is done at native speed, or use SymCC’s term, compilation-based, instead of interpretation-based.
The symbolic constraints are collected at LLVM IR level so they are simpler and faster to solve, as shown in the evaluation of SymCC.
It is built upon the highly optimized sanitizer infrastructure: accessing the label is fast, access each AST node is fast, and allocation of each AST node is also fast.

So how fast it really is? Here are some quick test results I got, please feel free to grab the code and try yourself.

Program	Native	w/o Solving	w/ Solving
objdump	0.0009s	0.2665s	3.5206s
size	1.0010s	0.2461s	3.3909s
readelf	0.0009s	0.4068s	4.9730s
tcmpdump	0.0030s	0.0090s	0.8770s
readpng	0.0010s	0.0060s	90.775s

As we can see, although it’s still much slower than native execution, collecting the symbolic constraints itself is very fast now and the bottleneck is on constraint solving. I didn’t include the numbers for QSYM and SymCC, if you’re interested, you can test yourself.

Here are some FuzzBench results when pairing Kirenenko with AFL++.

libjpeg-coverage libjpeg-pairwise libpng-coverage libpng-pairwise sqlite-coverage sqlite-pairwise