Friday, February 3, 2012

Release: CodeSensor 0.1

Today, I'm happy to release CodeSensor, a tool I have been working on for a while: CodeSensor may be useful for you to extract facts from C/C++ code in situations where you do not have a working build-environment. Its goal is to return meta information about source code in a format suitable for further processing using UNIX command line tools and simple scripts.

Here are the results of an example run of CodeSensor:

InputOutput
#include <stdio.h>

int main(int argc, char **argv)
{
  if(argc >= 1){
   
    char *firstArg = argv[1];
   
    while(*firstArg){
     
      if(*firstArg == 'A')
        printf("A!\n");
     
      firstArg++;
    }
  } 
  return 0;
}



func       3:0    18:0     0    int    main
param    3:9    3:17     0    3:0    main    int    argc
param    3:19  3:30     0    3:0    main    char * *    argv
if             5:2    16:2   0    (argc>=1)
decl        7:4    7:28    1    char    *firstArg
while      9:4    15:4     1    (*firstArg)
if           11:6  12:15    2    (*firstArg=='A')
call        12:1  12:14    3    printf    (("A!\n"))
arg        12:8  12:14    3    12:1    printf    "A!\n"
return   17:2   17:10    0    0

As you can see, the output contains several constructs CodeSensor has recognized, displaying the construct-type as well as start- and end- positions in the first three columns. Furthermore, it expresses the nesting of selection- and iteration-statements such as if- and while-statements. To do so, the nesting-depth of each element is displayed in the fourth column. Depending on the type of code construct, further information such as the return type of functions or the conditions imposed by if statements are printed in the remaining columns.

There can be several reasons why a working build environment is not available. For example, the code may be incomplete because important header files are missing. Even if all code is available, the build procedure may be very complex and getting it to work requires more time than you want to invest. Furthermore, the interfaces offered by your compiler to receive abstract syntax trees are usually not particularly beautiful to deal with and may require you to write a lot of post-processing code to get the information into a suitable format.

Having dealt with such problems, I thought it would be nice to have a simple parser for C/C++ that accepts fragments of code and attempts to return information about its grammatical structure in a simple to process format. What is interesting about this problem is that in the general case, incomplete C/C++-code cannot be parsed for several reasons. Consider for example that anything can hide behind a preprocessor macro. For example, you often see function definitions such as:

returnType myFunction(MACRO){ ... }

In the situation where MACRO cannot be resolved because the header file where it is defined is missing, we cannot decide whether this is a function definition or not. Furthermore, the language contains constructs where it is necessary to know whether a sequence of alphanumerics represents a type or an identifier to create the correct parse tree (see this post).

CodeSensor takes a different perspective on this problem. Classically, parsing is a decision problem: You need to decide whether a sequence of tokens corresponds to the language specification. Instead, CodeSensor assumes that the code is syntactically correct. This is a valid assumption for code known to be in production, because if this was not the case, it would not have been possible to produce a binary from this code. CodeSensor now simply seeks a plausible sequence of grammar invocations that is most likely to have led to this code.

It does so by accepting a superset of the language and defining a couple of disambiguation rules. Of course, it cannot guarantee that the produced parse tree is correct, but remember, that's impossible anyway for incomplete code, at least in the general case.

What I personally enjoy most about CodeSensor is its aesthetic formulation. You may or may not agree with me that if something does not at least have a tad of beauty, then there really is not much reason to deal with it. What's beautiful about CodeSensor is that it is entirely composed of a single grammar file written for the ANTLRv3 parser generator and a bit of Java code used to post-process abstract syntax trees. It is a proof-of-concept for the fact that Island Grammars as proposed by Moonen are not just an academic concept but can actually be used to construct approximate parsers for languages as ugly as C++.

You can download a JAR-file here. To run it, simply type

$ java -jar CodeSensor.jar <filename.cpp>
Keep in mind that CodeSensor is pretty alpha though, so it may crash and burn. CodeSensor is released under the GPLv3, so checkout the source-code at GitHub:

https://github.com/fabsx00/codesensor

I would also like to take this release as an opportunity to thank the guys at Recurity Labs and Gregor Kopf in particular for supporting this research. I hope that this tool will be useful to you guys.

Thursday, February 2, 2012

Welcome!

Welcome to my new blog on machine learning for vulnerability identification! Starting from today, I will be working on developing new methods for finding bugs and vulnerabilities in large code bases, mainly based on the idea of exploiting patterns in code to make it more consumable. I will be blogging about newly released tools written during my research, present bug analyses and hopefully even some 0-day found in the process. If this seems interesting to you, check this blog every once in a while for updates.