Monday, September 29, 2008

Improving Binary Comparison (and it's implication for malware classification)

I am at Virus Bulletin in Ottawa -- if anyone wants to meet to see our new stuff, please drop mail to ! :)

It has been a while since I posted here -- partially because I had a lot of work to finish, partially because, after having finished all this work, I took my first long vacation in a ... very long while.

So I am back, and there are a number of things that I am happy to blog about. First of all, I now have in writing that I am officially an MSc in Mathematics. For those that care about obscure things like extending the euclidian algorithm to the ring of boolean functions, you can check the thesis here:

For those that are less crazy about weird computational algebra: Our team here at zynamics has made good progress on improving the core algorithms behind BinDiff further. Our stated goal was to make BinDiff more useful for symbol porting: If you have an executable and you suspect that it might contain a statically linked library for which you have source access (or which you have analyzed before), we want BinDiff to be able to port the symbols into the executable you have, even if the compiler versions and build environments differ significantly, and even if the versions of the library are not quite the same.

Why is this important ? Let's say you're disassembling some piece of network hardware, and you find an OpenSSL-string somewhere in the disassembled image. Let's say you're disassembling an old PIX image (6.34 perhabs) and see the string

OpenSSL 0.9.5a 1 Apr 2000

This implies that PIX contains OpenSSL, and that the guys at Cisco probably backported any fixes to OpenSSL to the 0.9.5a version. Now, it would be fantastic if we could do the following: Compile OpenSSL 0.9.5a with full symbols on our own machine, and then "pull-in" these symbols into our PIX disassembly.

While this was sometimes possible with the BinDiff v2.0 engine (and v2.1, which is still essentially the same engine), the results were often lacking in both speed and accuracy. A few months back, Soeren and I went back to the drawing board and thought about the next generation of our diffing engine -- with specific focus on the ability to compare executables that are "far from each other", that differ significantly in build environments etc. and that only share small parts of their code. The resulting engine (dubbed "DiffDeluxe" by Soeren) is significantly stronger at this task.

Why did the original BinDiff v2 engine perform poorly ? There are a number of reasons to this, but primarily because of the devastating impact that a "false match" can have on further matches in the diffing process, and due to the fact that in the described scenarios, most of the executable is completely different, and only small portions match. The old engine had a tendency to match a few of the "unrelated components" of each executable, and these initial incorrect matches led to further bad matching down the road.

This doesn't mean the BinDiff v2 engine isn't probably the best all-round diffing engine you can find (I think it is, even if some early builds of the v2.0 suffered from silly performance issues -- those of you that are still plagued by this please contact support@ for a fix !) -- but for this particular problem some old architectural assumptions had to be thrown overboard.

Anyhow, to cut a long story short: While the results generated by DiffDeluxe aren't perfect yet, they are very promising. Let's follow our PIX/OpenSSL scenario:

DiffDeluxe operates with two "fuzzy" values for each function match: "Similarity" and "Confidence". Similarity indiciates how successful the matching algorithm was in matching basic blocks and instructions within the two functions, and confidence indicates how "certain" DiffDeluxe is that this match is a correct one. This is useful to sort the "good" and "bad" matches, and to inspect results before porting comments/names. Anyhow, let's look at some high-confidence matches:

Well, one doesn't need to be a rocket scientist to see that these functions match. But in many situations, the similarity between two functions is not 100% evident: The following is a matched function with only 72% similarity (but 92% confidence):

So what is the overall result ? Out of the 3977 functions which we had in, we were able to match 1780 in our Pix disassembly -- but with a big caveat: A significant number of these have very low similarity and confidence scores. This isn't surprising: The differences between the compiler used upon compile time of our Pix image (sometime 6 years ago ?) and the compiler we used (gcc 4.1, -O3) is drastic. All in all, we end up with around 250 high-confidence matches -- which is not too bad considering that we don't know how many functions from OpenSSL the Pix code actually contains.

In order to have a more clear idea of how well these algorithms perform, we need an example of which we know that essentially the entire library has been statically linked in. For this, luckily, we have Adobe Reader :-)

With all the Adobe patches coming up, let's imagine we'd like to have a look at the Javascript implementation in Acrobat Reader. It can be found in Escript.api. Now, I always presume that everybody else is as lazy as me, so I can't imagine Adobe wrote their own Javascript implementation. But when Adobe added Javascript to Acrobat Reader, there were few public implementations of Javascript around -- essentially only the engine that is nowadays known as "SpiderMonkey", e.g. the Mozilla Javascript engine. So I compiled SpiderMonkey into "" on my Linux machine and disassembled Escript.api. Then I ran DiffDeluxe. The result:

Escript contains about 9100 functions, contains about 1900. After running the diff, we get 1542 matches. Let's start verifying how "good" these matches are. As discussed above, DiffDeluxe uses a "similarity" and "confidence" score to rate matches. We get 203 matches with similarity and confidence above 90% -- for these functions, we can more or less blindly assume the matches are correct. If we have any doubts, we can inspect them:

Well, there is little question that this match was accurate.

The interesting question is really: How low can we go similarity- and confidence-wise before the results start deteriorating too badly ? Let's go low -- for similarities below 40%. For example the js_ConcatStrings match.

Manual inspection of the screenshot on the right will show that the code performs equivalent tasks, but that hardly any instructions remain identical.

Proceeding further down the list of matches, it turns out that results start deteriorating once both confidence and similarity drop below 0.3 -- but we have around 950 matches with higher scores, e.g. we have successfully identified 950 functions in Escript.api. While this is signifcantly less than the 1900 functions that we perhabs could have identified, it is still pretty impressive: After all, we do not know which exact version of SpiderMonkey was used to compile Escript.api, and significant changes could have been made to the code.

Clearly, we're a long way from matching 95% -- but we're very close to the 50% barrier, and will work hard to improve the 50% to 75% and beyond :-)

Anyhow, what does all this have to do with automatic classification and correlation of malware ?

I think the drastic differences induced by platform/compiler changes make it pretty clear that statistical measures that do not focus on the structure and semantics of the executable, but on some "simple" measure like instruction frequencies, fail. All the time. Behaviorial methods might have a role to play, but they will not help you one bit if you acquire memory from a compromised machine, and are trivially obfuscated by adding random noisy OS interaction.

I am happy to kill two birds with one stone: By improving the comparison engine, I am making my life easier when I have to disassemble Pix -- and at the same time, I am improving the our malware classification engine. Yay :-)

Anyhow, as mentioned above: I am at the Virus Bulletin conference -- if anyone wishes to have a chat or have our products demo'ed, please do not hesitate to send mail to


Unknown said...

I'm going to be the annoying one and ask when we can expect to see this behavior in BinDiff or some other publicly available product?

Unknown said...

Halvar, amazing stuff. I like your thesis too, although I've only read a few pages so far.

While you've demoed some libraries that likely match the one used in the target, what about the false positive ratio? It would be interesting to see your comparison of multiple versions of the OpenSSL library against PIX to see which one it is most similar to. Are they really backporting patches or is it stock 0.9.5a?

Also, perhaps compare gnutls to the PIX OpenSSL implementation and see how much various internal functions (say, derive master secret) match even though the libraries are different implementations.

Thierry Zoller said...

Hey Halvar, your posts do have titles now! :)

Ryan Russell said...

Asking out of ignorance, so apologies in advance.

You are already taking advantage of callpath info to bring up neighbor functions? I.e. A calls B. B by itself in both binaries is 0.5 confidence. But A is 0.95. If both A's call both B's in the same place, do you bump B up to 0.725?

halvar.flake said...

@forever: This should be in BinDiff v3.0, due out in q1 2009.

@nate: False positives are usually identifiable -- they tend to have low "confidence" scores in the matching algorithm. Generally, with larger graphs the false positive rate decreases drastically.

As far as I can see, they are backporting patches to 0.9.5a, but I only had a cursory look.

Concerning gnutls: I would be _extremely_ surprised if gnutls vs openssl matches in any way -- but I will try, thanks for the hint ! :)

@ryan: No, not yet -- the confidence of the neighboring matches does not influence the confidence of the current match yet. In debug builts of the code we construct a "reasoning tree" though -- e.g. "this node was matched because of this other node here". This would allow us to base the current confidence on the confidence of the things that "led" the algorithm to decide the way it did.

Unfortunately, keeping this "reasoning tree" in memory is not easily feasible for very large diffs, so we don't build it unless we're trying to diagnose an error...