Sunday, May 27, 2012

joern: Fuzzy AST and CFG Extraction

Today, I'm releasing joern, a new tool for robust analysis of source code. When pointed to a directory, it parses anything that "kind of looks like" C/C++ code and constructs Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs) as well as a searchable index. Joern allows you to implement quick-and-dirty language aware static analysis tools without requiring a complicated API to be learned or a plugin for an existing IDE to be written. Joern simply mirrors the existing directory structure and exposes information via text files and serialized Python objects. You can download the source distribution here.
  
There is lots of impressive work on static analysis of source code for bug finding, particularly from the software engineering and compiler construction communities. Unfortunately, vulnerability researchers rarely benefit from these results in practice, and it's pretty clear why: most of these techniques require code to be translated into an intermediate format such as Abstract Syntax Trees or Control Flow Graphs to allow code structure and control flow to be analyzed.
During program development, this is not a problem. A working build environment needs to be available anyway, since otherwise code can't be compiled and tested. In contrast, for vulnerability research, setting up a build environment is an optional step and the time it can take to get it working is often simply not justified. Particularly, if you want to perform code analysis on a large scale, covering hundreds of different projects, this can demand a large amount of your time.
To make things worse, vulnerability research often happens long after program development: the code may require adaption to build with current versions of compilers, it may have become difficult to obtain the versions of libraries used and in the worst case, not all the required source code is actually still available. Eventually, you will hear yourself shouting "Just give me the CFG!!", and it is at that time that joern steps in.
Joern will attempt to extract ASTs and CFGs from whatever it can get. To this end, it performs the following steps in succession.
  1. Fuzzy Parsing: Each source file is first parsed. In contrast to a compiler frontend, this parser assumes that the code conforms to the language specification but may not be complete.
  2. AST Creation: The results obtained in the parsing step are now used to construct Abstract Syntax Trees for all functions of the code base. These trees provide a detailed hierarchical representation of the code structure.
  3. CFG Creation: Finally, Control Flow Graphs are created for each function from the parser results. The nodes of this directed graph are basic blocks, i.e. blocks of code always executed in succession, and edges are labeled with conditions.
In addition, joern allows you to filter ASTs and CFGs to restrict your analysis to certain aspects of the code. Let's take a look at a small example to illustrate what this looks like in practice. Consider the following function:

int testFunction(int x, int y)
{
  int s = 0, i = x;
  if(x >= y) return 0;
  while(i < y){
    s += y;
    i++;
  }
  return s;
}

The image on the left shows the unfiltered control flow graph generated from this code. The basic blocks contain the corresponding rows generated by the fuzzy parser, which cover the entire code. As the picture shows, the edges are labeled with the conditions, which must hold so that control is transferred to the destination basic block. The image on the right shows the filtered CFG generated by joern_filter_cfgs, which contains only certain types of nodes such as parameters, declarations and conditions.

Unfiltered CFG
Filtered CFG


Joern uses the output of the fuzzy parser CodeSensor released earlier this year to generate ASTs containing all C/C++ constructs it recognizes in its input stream and CFGs for all functions. For example, for the file /code/subdir/file1.c containing the functions func1 and func2 on lines $line1 and $line2 respectively, executing ...

$ joern_parse code/; joern_filter_asts .code/; joern_filter_cfgs .code/;

... will generate the following directory:

.code/file1.c/ast.csv
.code/file1.c/ast.pickl
.code/file1.c/source

.code/file1.c/func1_$line1:$pos1/cfg.pickl
.code/file1.c/func1_$line1:$pos1/prunedAst.pickl
.code/file1.c/func1_$line1:$pos1/prunedCfg.pickl

.code/file1.c/func2_$line2:$pos2/cfg.pickl
.code/file1.c/func2_$line2:$pos2/prunedAst.pickl
.code/file1.c/func2_$line2:$pos2/prunedCfg.pickl

Behind the scenes, joern_parse walks the code directory and runs CodeSensor on each file to generate a corresponding textual representation of its Abstract Syntax Tree (ast.csv). To ease programmatic access, a corresponding pickle'd python tree data structure is then stored in ast.pickl. Next, a unique directory is created for each function identified by the parser, and the Abstract Syntax Tree of the function is converted into a corresponding Control Flow Graph saved in cfg.pickl.

joern_filter_asts and joern_filter_cfgs then generates pruned versions of both ASTs and CFGs, and stores them in prunedAst.pickl and prunedCfg.pickl respectively. You can specify an existing filter or implement your own. This is particularly useful if you want to focus subsequent analysis only on certain aspects of the code, for example, as part of a feature extraction step for machine learning.

Finally, by running ...

$ joern_index .code/

... you can generate different indexes to perform fast lookups of the locations of code constructs such as functions, function calls and conditions. For debugging purposes, joern_plot also implements a simple graphviz-based plotting utility to visualize ASTs and CFGs.

For more information on how to use the tool, check out the README and run the tools with the --help flag. If you have any more questions about joern, feel free to contact me by mail. If you use joern in your research, I'd also be very interested to chat about that.


Friday, February 3, 2012

Release: CodeSensor 0.1

Today, I'm happy to release CodeSensor, a tool I have been working on for a while: CodeSensor may be useful for you to extract facts from C/C++ code in situations where you do not have a working build-environment. Its goal is to return meta information about source code in a format suitable for further processing using UNIX command line tools and simple scripts.

Here are the results of an example run of CodeSensor:

InputOutput
#include <stdio.h>

int main(int argc, char **argv)
{
  if(argc >= 1){
   
    char *firstArg = argv[1];
   
    while(*firstArg){
     
      if(*firstArg == 'A')
        printf("A!\n");
     
      firstArg++;
    }
  } 
  return 0;
}



func       3:0    18:0     0    int    main
param    3:9    3:17     0    3:0    main    int    argc
param    3:19  3:30     0    3:0    main    char * *    argv
if             5:2    16:2   0    (argc>=1)
decl        7:4    7:28    1    char    *firstArg
while      9:4    15:4     1    (*firstArg)
if           11:6  12:15    2    (*firstArg=='A')
call        12:1  12:14    3    printf    (("A!\n"))
arg        12:8  12:14    3    12:1    printf    "A!\n"
return   17:2   17:10    0    0

As you can see, the output contains several constructs CodeSensor has recognized, displaying the construct-type as well as start- and end- positions in the first three columns. Furthermore, it expresses the nesting of selection- and iteration-statements such as if- and while-statements. To do so, the nesting-depth of each element is displayed in the fourth column. Depending on the type of code construct, further information such as the return type of functions or the conditions imposed by if statements are printed in the remaining columns.

There can be several reasons why a working build environment is not available. For example, the code may be incomplete because important header files are missing. Even if all code is available, the build procedure may be very complex and getting it to work requires more time than you want to invest. Furthermore, the interfaces offered by your compiler to receive abstract syntax trees are usually not particularly beautiful to deal with and may require you to write a lot of post-processing code to get the information into a suitable format.

Having dealt with such problems, I thought it would be nice to have a simple parser for C/C++ that accepts fragments of code and attempts to return information about its grammatical structure in a simple to process format. What is interesting about this problem is that in the general case, incomplete C/C++-code cannot be parsed for several reasons. Consider for example that anything can hide behind a preprocessor macro. For example, you often see function definitions such as:

returnType myFunction(MACRO){ ... }

In the situation where MACRO cannot be resolved because the header file where it is defined is missing, we cannot decide whether this is a function definition or not. Furthermore, the language contains constructs where it is necessary to know whether a sequence of alphanumerics represents a type or an identifier to create the correct parse tree (see this post).

CodeSensor takes a different perspective on this problem. Classically, parsing is a decision problem: You need to decide whether a sequence of tokens corresponds to the language specification. Instead, CodeSensor assumes that the code is syntactically correct. This is a valid assumption for code known to be in production, because if this was not the case, it would not have been possible to produce a binary from this code. CodeSensor now simply seeks a plausible sequence of grammar invocations that is most likely to have led to this code.

It does so by accepting a superset of the language and defining a couple of disambiguation rules. Of course, it cannot guarantee that the produced parse tree is correct, but remember, that's impossible anyway for incomplete code, at least in the general case.

What I personally enjoy most about CodeSensor is its aesthetic formulation. You may or may not agree with me that if something does not at least have a tad of beauty, then there really is not much reason to deal with it. What's beautiful about CodeSensor is that it is entirely composed of a single grammar file written for the ANTLRv3 parser generator and a bit of Java code used to post-process abstract syntax trees. It is a proof-of-concept for the fact that Island Grammars as proposed by Moonen are not just an academic concept but can actually be used to construct approximate parsers for languages as ugly as C++.

You can download a JAR-file here. To run it, simply type

$ java -jar CodeSensor.jar <filename.cpp>
Keep in mind that CodeSensor is pretty alpha though, so it may crash and burn. CodeSensor is released under the GPLv3, so checkout the source-code at GitHub:

https://github.com/fabsx00/codesensor

I would also like to take this release as an opportunity to thank the guys at Recurity Labs and Gregor Kopf in particular for supporting this research. I hope that this tool will be useful to you guys.

Thursday, February 2, 2012

Welcome!

Welcome to my new blog on machine learning for vulnerability identification! Starting from today, I will be working on developing new methods for finding bugs and vulnerabilities in large code bases, mainly based on the idea of exploiting patterns in code to make it more consumable. I will be blogging about newly released tools written during my research, present bug analyses and hopefully even some 0-day found in the process. If this seems interesting to you, check this blog every once in a while for updates.