Sunday, May 27, 2012

joern: Fuzzy AST and CFG Extraction

Today, I'm releasing joern, a new tool for robust analysis of source code. When pointed to a directory, it parses anything that "kind of looks like" C/C++ code and constructs Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs) as well as a searchable index. Joern allows you to implement quick-and-dirty language aware static analysis tools without requiring a complicated API to be learned or a plugin for an existing IDE to be written. Joern simply mirrors the existing directory structure and exposes information via text files and serialized Python objects. You can download the source distribution here.
  
There is lots of impressive work on static analysis of source code for bug finding, particularly from the software engineering and compiler construction communities. Unfortunately, vulnerability researchers rarely benefit from these results in practice, and it's pretty clear why: most of these techniques require code to be translated into an intermediate format such as Abstract Syntax Trees or Control Flow Graphs to allow code structure and control flow to be analyzed.
During program development, this is not a problem. A working build environment needs to be available anyway, since otherwise code can't be compiled and tested. In contrast, for vulnerability research, setting up a build environment is an optional step and the time it can take to get it working is often simply not justified. Particularly, if you want to perform code analysis on a large scale, covering hundreds of different projects, this can demand a large amount of your time.
To make things worse, vulnerability research often happens long after program development: the code may require adaption to build with current versions of compilers, it may have become difficult to obtain the versions of libraries used and in the worst case, not all the required source code is actually still available. Eventually, you will hear yourself shouting "Just give me the CFG!!", and it is at that time that joern steps in.
Joern will attempt to extract ASTs and CFGs from whatever it can get. To this end, it performs the following steps in succession.
  1. Fuzzy Parsing: Each source file is first parsed. In contrast to a compiler frontend, this parser assumes that the code conforms to the language specification but may not be complete.
  2. AST Creation: The results obtained in the parsing step are now used to construct Abstract Syntax Trees for all functions of the code base. These trees provide a detailed hierarchical representation of the code structure.
  3. CFG Creation: Finally, Control Flow Graphs are created for each function from the parser results. The nodes of this directed graph are basic blocks, i.e. blocks of code always executed in succession, and edges are labeled with conditions.
In addition, joern allows you to filter ASTs and CFGs to restrict your analysis to certain aspects of the code. Let's take a look at a small example to illustrate what this looks like in practice. Consider the following function:

int testFunction(int x, int y)
{
  int s = 0, i = x;
  if(x >= y) return 0;
  while(i < y){
    s += y;
    i++;
  }
  return s;
}

The image on the left shows the unfiltered control flow graph generated from this code. The basic blocks contain the corresponding rows generated by the fuzzy parser, which cover the entire code. As the picture shows, the edges are labeled with the conditions, which must hold so that control is transferred to the destination basic block. The image on the right shows the filtered CFG generated by joern_filter_cfgs, which contains only certain types of nodes such as parameters, declarations and conditions.

Unfiltered CFG
Filtered CFG


Joern uses the output of the fuzzy parser CodeSensor released earlier this year to generate ASTs containing all C/C++ constructs it recognizes in its input stream and CFGs for all functions. For example, for the file /code/subdir/file1.c containing the functions func1 and func2 on lines $line1 and $line2 respectively, executing ...

$ joern_parse code/; joern_filter_asts .code/; joern_filter_cfgs .code/;

... will generate the following directory:

.code/file1.c/ast.csv
.code/file1.c/ast.pickl
.code/file1.c/source

.code/file1.c/func1_$line1:$pos1/cfg.pickl
.code/file1.c/func1_$line1:$pos1/prunedAst.pickl
.code/file1.c/func1_$line1:$pos1/prunedCfg.pickl

.code/file1.c/func2_$line2:$pos2/cfg.pickl
.code/file1.c/func2_$line2:$pos2/prunedAst.pickl
.code/file1.c/func2_$line2:$pos2/prunedCfg.pickl

Behind the scenes, joern_parse walks the code directory and runs CodeSensor on each file to generate a corresponding textual representation of its Abstract Syntax Tree (ast.csv). To ease programmatic access, a corresponding pickle'd python tree data structure is then stored in ast.pickl. Next, a unique directory is created for each function identified by the parser, and the Abstract Syntax Tree of the function is converted into a corresponding Control Flow Graph saved in cfg.pickl.

joern_filter_asts and joern_filter_cfgs then generates pruned versions of both ASTs and CFGs, and stores them in prunedAst.pickl and prunedCfg.pickl respectively. You can specify an existing filter or implement your own. This is particularly useful if you want to focus subsequent analysis only on certain aspects of the code, for example, as part of a feature extraction step for machine learning.

Finally, by running ...

$ joern_index .code/

... you can generate different indexes to perform fast lookups of the locations of code constructs such as functions, function calls and conditions. For debugging purposes, joern_plot also implements a simple graphviz-based plotting utility to visualize ASTs and CFGs.

For more information on how to use the tool, check out the README and run the tools with the --help flag. If you have any more questions about joern, feel free to contact me by mail. If you use joern in your research, I'd also be very interested to chat about that.