Tuesday, May 16, 2006

Behavioural classification of malware

Today is a good day: I got my math homework done before it has to be handed in, and that leaves me some free time to blog :-)

Dana Epp has a post referring to an entry by Tony Lee referencing an EICAR paper on automated malware classification using behavioural analysis. I am not totally unbiased on this as we at SABRE have been working a lot on structural classification of malware recently, so take my following criticism with a grain of salt.

I personally think the approach in the paper is suboptimal for the following reasons:
  1. By using behavioural data, we can only classify an executable based on things it does in the observed timeframe. Any time-delayed trigger (that e.g. triggers two months from now) is hard to see, and the application might just sleep until then. How do we classify something that just came into our networks ? We can't classify it until it starts becoming "active".
  2. It is trivial even for somebody who knows only rudimentary programming to modify a program so that the modifed program only has a few (~4 ?) lines of code more than the original program, yet it's Levenshtein distance as measure in the paper is arbitrarily large. As it stands, adding file writes in a loop should be enough, and the Levenshtein distance can be arbitrarily increased by more loop iterations.
  3. The paper cites on-access deobfuscation as a principal problem that static analysis cannot easily deal with -- but from (2) it follows that on-access deobfuscation can be coupled with Levenstein-distance-maximizing code in a trivial manner, breaking the approach that was proposed as superior in the paper. The claim that runtime analysis can effectively bypass the need to deal with obfuscation is simply not true if the application ever targets the event collection by 'junking' it with bogus events.
Taken together this means that the approach presented in the paper can be trivially foiled with very minor high-level-language modifications in the source of the program, whereas the static structural comparisons we use need to be foiled via the use of a strong obfuscation component, which if done moderately cleverly, would also foil the approach from the paper.

I'm not saying the paper isn't good or doesn't touch valid points, but behaviour is so trivially randomized even from a high-level-language level that the approach in the paper is next to pointless once malware authors target it.

On to something kinda different:

A more general question we have to ask ourselves is: Do we really want to measure the accuracy of new, automated malware classification algorithms by comparing them to the results of the manual classification done by AV-vendors so far, which had neither efficient information sharing nor any clear methodology as to how to name malware ? Using any sort of machine learning based on the AV-industry provided datasets needs to be very resilient to partially incorrect input data, as a good number of bots seem to be more or less arbitrarily named.

Anyhow, time to go to sleep and read Courtois eprint paper


Ryan Russell said...

So, I admit to preferring your scheme (code similarity) on paper. I'm curious about a couple of things:

-How do you count two pieces of identical code (same or equivalent code segments) that have different strings? I.e. they have a different list of IRC servers, channels, passwords, proccesses to kill, etc.. Same or different?

-Any thoughts on someone purposly designing their malware to cause your scheme to trigger a 99.9% match, but in fact they have made some critical changes.

sp said...

I haven't read the EICAR paper (maybe it's addressed there) yet but I think especially the third point you raise is something which might cause serious problems.

Many years ago virus writers that wanted to delay analysis of their viruses by anti-virus companies began to develop techniques like self-modifying or polymorphic code.

If behavioural analysis becomes widespread I think we're going to see exactly the same security-by-obfuscation concept re-introduced on a higher level.

What's going to stop malware authors to add a thread to their applications that does nothing but create randomized bogus events? I'm talking about random HTTP connections or random IRC connections or whatever else is common for malware and likely to be detected by behavioural analysis.

If you get thousands of randomized events and one actual event valuable for classification how do you pick out the relevant event?

Well, I guess you could first distinguish between malware families that do generate these events and those who don't. Among those who do you could check out which events they actually generate (not all viruses might create registry key accesses or FTP connections or something). You could also use some statistics. Assuming normal distribution of randomly generated events of n types a piece that randomly generates events of 8 types will have a different distribution than one that randomly generates events of 10 types. If you don't have normal distribution you could use the distribution itself to classify viruses.

Oh well, I think I've entered the rambling stage so I'm just going to stop and read the actual paper.

halvar.flake said...

Different Strings and identical code means identical species in our current classification, and I don't intend to change that really. A horse is a horse , irrespective of it's nametag ;)

Concerning point two: If two pieces of malware differ only very slightly, they will still be grouped together. We're not interpreting such changes and don't intend too - the mere similarity to an existing piece of malware means "bad".

yoda said...

looks like Halvar needs to read the MS paper more carefully:

on-access obfuscation is NOT a "principal problem" for static analysis, not said so in the paper.

Rather the paper says

"Code/data obfuscation ... also poses considerable challenges ..., for example, in the case of on-access obfuscation..."

Also the paper clearly states the proper approach to the problem:
"These challenges aforementioned suggest the importance of using data from runtime analysis in addition to static analysis to achieve optimal classification results."