Saturday, April 29, 2006

More on automated malware classification and naming

So after having posted some graphs without further explanation yesterday, I think it is a good idea to actually explain what these graphs were all about.

We at SABRE have worked on automated classification of malware over the last few weeks. Essentially, thanks to Rolf's relentless optimization efforts and a few algorithmic tricks, the BinDiff2 engine is blazingly fast. As an example, we can diff a 30000+ functions router image in under 8 minutes on my laptop now, and this includes reading the data from the IDA database. That means that we can afford to run a couple of hundred thousand diffs on a collection of malware samples, and to then work with the results -- malware is rarely of a size exceeding 1000 functions, and anything of that size is diffed in a few seconds.

So we were provided with a few thousand samples of bots that Thorsten Holz had collected. The samples themselves were only marked with their MD5sum.

We ran the first few hundred through a very cheap unpacking tool and then disassembled the results. We than diffed each sample against each other. Afterwards, we ran a phylogenetic clustering algorithm on top of the results, and ended up with this graph:

The connected components have been colored, and a hi-res variant of this image can be looked at here.
The labels on the edges are measures of graph-theoretic similarity -- a value of 1.000 means that the executables are identical, lower values give percentage-wise similarity. We have decided in this graph to keep everything with a similarity of 50% or greater in one family, and cut off everything else.

So what can we read from this graph ? First of all, it is quite obvious that although we have ~200 samples, we only have two large families, three small families, two pairs of siblings and a few isolated samples. Secondly, even the most "distant relatives" in the cyan-colored cluster are 75% similar, the most "distant relatives" in the green cluster on the right are still 58% similar. If we cut the green cluster on the right into two subclusters, the most distant relatives are 90% similar.

Now, in order to double-check our results with what AV folks already know, Thorsten Holz provided us with the results of a run of ClamAV on the samples, and we re-generated the graphs with the names of those samples that were recognized by ClamAV.

The result looks like this:

(check this graphic for a hi-res version which you will need to make sense of the following ;)

What can we learn from this graph ? First of all, we see that the various members of the GoBot family are so similar to the GhostBot branch that we should probably consider GoBot and GhostBot to be of the same family. The same holds for the IrcBot and GoBot-3 samples and for the Gobot.R and Downloader.Delf-35 samples -- why do we have such heavily differing names when the software in question is so similar ?

We seem to consider Sasser.B and Sasser.D to be of the same family with a similarity of 69%, but Gobot.R and Downloader.Delf-35, which are a lot more similar, have their own families ?

What we can also see is that we have two samples in the green cluster (GoBot, IRCBot, GhostBot, Downloader.Delf) that have not been recognized by ClamAV and that seem to have their own branch in the family tree:

Apparently, these two specimen are new veriants of the above family tree.

Miscallenous other things we can learn from this (admittedly small) graph:

  • We have 6 variants of Trojan.Crypt.B here that go undetected by ClamAV
  • We have an undetected variant of Worm.Korgo.Z
  • PadoBot and Korgo.H are very closely related, whereas Korgo.Z does not appear to be very similar. Generally, the PadoBot and the Korgo.H/Korgo.Y/Korgo.AJ family seem to be just one family.

So much fun.
Anyhow, if you guys find this stuff interesting, drop me mail about this at halvar.flake@sabre-security.com ...

Friday, April 28, 2006

Also, Rolf has created a shockwave where he shows how to use BinDiff to find GPL violations.

Check it here...
Automated classification of malware is easier than it seems once you have the right infrastructure -- in our case consisting of the BinDiff2 engine, a few generic unpacking tools and a graph layout program.

I have uploaded some graphics:

This is just a collection of arbitrary malware whose members were identified by use of AV programs. We then BinDiff'ed all samples and used some phylogenetics algorithm to draw this diagram. The results are quite neat, although we did not filter library functions, so some very simple viruses have high similarity due to the fact that 95% of their code is statically linked library code.

This is a collection of a few hundred bots. They were collected on Thorsten Holz's Honeynet, and we auto-unpacked them and then did the BinDiffing/tree generation. This time, we did filter libraries as good as we could. The 184 samples here all have different MD5sums, but the
largest majority belongs to essentially two families. All in all, we have ~5 "families", two pairs
of "siblings" and 9 isolated species here. Fun.

Wednesday, April 26, 2006

I am in a state of brainfry. Which I guess is good. Exhaustion can be fun. I think I haven't worked as hard as I am working at the moment in quite a while, and while the results move slowly (due to fighting on many fronts), they move, and in a good direction. Now I will sleep. Ah, for those into this sorta stuff, here is what claims to be a significant improvement of the FMS attack on RC4. Good read, although nothing worldmoving, and I haven't verified the results fully yet.

Wednesday, April 19, 2006

Quote from the Matasano Blog:
"We are so funded now"
^^ Classic :)
Language in thought and action or "what's correct vs. what's right"

Everybody should IMO, at some point of his life, have read Hayakawa's "Language in thought and action". It is a book about semantics, of words and their meanings. Semantics pop up in very amusing situations, for example in code analysis/review, and a few other places.

A small thing to take out of that book is that the entry in a dictionary is not the definitive meaning of a word, but that the meaning of the word is (somewhat circularly) defined as what the collective mind thinks this word means.

There's currently an bugtraq about the way that GCC treats certain 'invalid language' constructs. While I find very few things more amusing than compiler changes breaking formerly working code, I also find the reaction of everybody except fefe on that list to be very amusing: Instead of relying on what everybody thinks the meaning of that line of code is, they refer to the written standard. In essence, they look up the meaning of a word in a dictionary.

Yes, standards are important. But there is a difference between the standard on paper and the 'standard', meaning the way everybody perceives (and writes in) that language. And due to the fact that C's standard has been in some ways unenforced for 20+ years there are lots of existing 'standard' idioms (such as Fefe's int wrap check) that are not valid by the written standard.

What we're trying to do here is the equivalent of not correcting a child's grammar for 20 years, allowing it to play with other kids with broken grammar and always understanding what the child wanted to say when it used broken grammar. And then, once it turned 20, we cross out the constructs with bad grammar from everything he says and interpret the rest.

If the construct is so invalid, have the compiler warn at least :)
This sounds like a sure recipe for disaster to me :)

I had a discussion with Rolf yesterday about GCC's ghastlyness, and the fact that we're spending a good bit of our time fighting compiler woes instead of coding. Then again, we're doing something rather odd: We're trying to write C++ code with GCC, which frankly, doesn't work. I am very inclined to switch to Intel's CPP.

Oh, while I am at ranting: On which CPU is GCC's weird optimization of passing arguments by subtracting from ESP and then issuing several 'mov [esp+const], reg' faster than push/call ? Sure, on a cycle-by-cycle basis each instruction is faster, but has anyone ever considered the impact of a 3*size increase in the code on CPU caching ?

I'll shut up now before this deteriorates into more ranting.

Monday, April 17, 2006

http://teh-win.blogspot.com/ has (as usual) an amusing read up, which at one step harps on a point that I can't support enough: 0days != hacking. Almost all "real" hacking is done via transitive trust (thus the same goes for pentests). 0days allow you to more quickly get _some_ trust to exploit transitively, but the "real" work is done on transitive trust. And transitive trust and "real" hacking gets too little credit at security conferences, mainly because any "real" research here is by direct implication illegal ("... I wrote this worm that exploits transitive trust ... and I have some empirical data on it's spreading capabilities *cough* ...").

Now I just need to find a dictionary that explains me what "branler la nouille en mode noyau" means ;)
Publication Economics and Cryptography Research

Something I cannot cease to wonder is why historically there has been so little published research on the cryptanalysis of block ciphers. There seem to be millions of articles describing "turning some math guy's favourite mathematical problem into an asymetric crypto algorithm" and a similar flood of "fair coin flipping if all participants are drunk cats and the coin is a ball of yarn"-sort of papers. All in all, there have been perhabs less than 20 REALLY important papers in the analysis of symetric crypto in ... uhm ... the last 10 years (I count hashes as symetric crypto here).

What's the reason for this ?

First of all, symetric crypto tends to not have a "nice" mathematical structure. This changed somewhat with AES, but almost everything on the symetric side is rather ugly to look at. Sure, everything can be written as large multivariate polynomials over GF(2), but that's just a prettier way of writing a large boolean formulae. So it's hard for anybody in a math department to justify working on something that is "like a ring, but not quite, or like a group, but not quite".

Secondly, starting to build a protocol or proposing a new asymetric cipher is something that a sane researcher (that has not earned tenure yet) can do in a "short" window of time. Setting out to break a significant crypto algorithm could very easily lead to "10+ years in the wilderness and a botched academic career due to a lack of publications". The result: If you haven't earned tenure yet, and want to work in crypto, you work on the constructive side.

I find this to be a bit frustrating. I'd like to work on ciphers, specifically on BREAKING ciphers. I seriously could never get myself excited about defense. I wouldn't mind spending a few years of my life on one cipher. But academically, and career wise, this is clear suicide.

Perhabs we should value "destructive" research more. From my personal viewpoint, a break in a significant cipher is worth more than 20 papers on fair coin flipping in the absence of gravity. But when it comes to giving out tenure, it almost seems that the 20 papers outweigh the one.

Ahwell. Enough ranting, back to work.

Friday, March 31, 2006

http://metasploit.blogspot.com/2006/03/few-msrt-graph-illustrations.html

This is cool.

Wednesday, March 15, 2006

My latest travel madness is over, and after 4 weeks on the road I am home again. The last week of that travelling I spent mostly sick -- that means I went to BlueHat for all the work and then had to skip all the fun. Furthermore, I learnt a bunch of interesting things concerning sinusitis and pressure changes during intercontinental flights.

Those that know me know I am a graph nerd (and by extension an OBDD nerd), so check
this paper for some more OBDD fun.
While you are doing your brain some good, you might as well read this paper -- it is fascinating and very useful at the same time.

Now, I am just enjoying the second day of my first "weekend" in a month -- I decided not to work for two days, and spent the first 34 hours of these two days in bed. Now I am cleaning my flat and will be reading some fun shit lateron (compact riemann surfaces perhabs?). Or I will call Ero and see if he is up for a round of Go.

I saw that Ilfak reads my BH presentations -- I am flattered and thus will need to try harder with them the next time.

One of the highlights of BlueHat was meeting a bunch of MS folks that I respect A LOT: Jon, Brandon, Alex, Pete, Jan -- thanks for being awesome.

I read many good things about OCaml recently, and a bunch of very interesting programs (ASTREE and SLAM/SDV) seem to be written in it. Can someone offer some comments on it ?

Sunday, February 26, 2006

My travel schedule remains insane, and I notice that thinking about 'hard' problems is very difficult while travelling. Sigh. Anyhow, those that had beer with me in the last year and a half or so know that I have a severe interest in OBDD-like structures and their applications to cryptanalysis.
So I just wanted to put a reference here to
http://eprint.iacr.org/2006/072.pdf

Saturday, February 18, 2006

We at SABRE will be giving a 3-day advanced reverse engineering course this fall in Frankfurt. Check this link if you are interested in that sorta stuff.

Tuesday, February 07, 2006

History is important. It is VERY cool to see DaveG posting old 8LGM memorabilia. Thanks !

I have to hit the train now. Will read this
on the train -- skimmed it in the subway, and it looks ridiculously interesting (if you happen to be into solving systems of multivariate polynomial equations or into classical algebraic geometry in general). It claims to contain a bridge between multivariate polys over finite fields and univariate polynomials over certain extensions of the same field. We'll see how good it actually is.
Hans Dobbertin died of cancer on the 2nd of February. I did not know him well - I had switched universities in order to write my thesis under his supervision, but shortly after I had arrived at the new university he fell ill. The few times that we talked it was a lot of fun though, and I respected him greatly. There are only a limited number of professors I met in my life that really understood and seemed to appreciate what I am doing, and it did not take much to explain it to Prof Dobbertin (which might have to do with his work prior to becoming a university professor). My condolescences to his family and those that were close to him.

Aside from the personal sadness I feel it is a setback for my Diplomarbeit. But well, that is nothing in the big picture really.

Sunday, January 15, 2006

My schedule in the next weeks sucks. Wednesday I am flying to London. The following Saturday to DC. Coming back the next Sunday. Possibly need to go to Paris the day after. Need to fly to Dresden on Tuesday night, and come back on Thursday morning (in time for lectures). I hope studies don't suffer too much.

It's damn cool to have Rolf & Ero here. Shit is moving fast. We're getting lots done. :-)

Saturday, December 17, 2005

Microsoft is moving the GUI code back out of the kernel in Vista according to this article. This is bad news: Finding local priviledge escalation bugs might become hard in the future.

The move of the GUI into the kernel (done with NT 4.0) was a misguided attempt at increasing performance in order to get people to switch from Win9x to NT - something that did not work until Windows 2000 / XP really. A lot of headaches (outside of the usual out-of-bounds-memory-access-bugs) were created by that move (shatter attacks etc). From the defender's standpoint, this totally makes sense. I have a feeling that security-wise the gulf between MS and all other closed-source vendors (which have to operate under market conditions and thus can't pump a few billion into security) is widening.

Coming back to audit "random" closed source code after having worked on MS binaries is a bit like auditing a "random" open-source project after having spent time on well-audited bits of OpenSSH. You're surprised that things can be so easy.

Tuesday, December 13, 2005

Blogging is strange. You write down a few lines of half-coherent something under the delusion that nobody is reading the blog, and out of a sudden you show up cross-referenced in blogs that you read yourself regularly. With such a large crew blogging at Matasano (what used to be Thomas Ptacek's blog) they have a blog-update-frequency that leads to their blog being about as productivity-destructive as slashdot.

I am seriously flattered to be mentioned there (and scared that my rants are actually read).

One of today's posts there mentions DJB's crypto algorithms, specifically Salsa20. Now, I am not a cryptographer, but I do not trust Salsa, for a variety of reasons:
  • It looks too much like MD4/MD5.
  • We have very limited understanding on why a wild mixture of ADD/XOR/ROL would produce equation systems that are hard to solve. Yes, nonlinearity over GF(2)^32 and over Z/2^32Z are given by mixing boolean functions and addition, but this paper gives some pretty neat insight into how just mixing ADD/XOR (without the ROL) is trivially solvable. I don't trust a single rotation that much.
  • Avoiding integer multiplication (whose representing BDD can grow exponentially with the number of bits and is thus hard to model using the methods in the paper) is something which I would not do - I know DJB cares a lot about timing, but given the choice of potentially leaking a few cycles and making the output of an operation ridiculously complicated (while at the same time tackling the problem of weak differential propagation in the high-order bits) I chose the latter.
  • DJB might be over-emphasizes timing. His AES S-Box stamps RDTSC output into packets, which is many orders of magnitudes more precise than any measurement you will get IMHO. True, caching issues (and cache alignment issues) can easily eat up 100 cycles, but that is still a lot less than a timer tick, the measure that in the most optimistic scenarios you'd be likely to get.
All in all, I do not trust systems built on just mixing ADD/XOR/ROL. There is a reason for the name of this blog.
Allright, I have 8 minutes of free time before I need to run to the computational algebra lecture, and I will spend it by dropping a few thoughts about Dan Geer's "login"-article advocating moving away from a monoculture.

My two points on his proposed 'artificial diversity':
1) It will increase resitence against total extinction. A worm will need more than one bug to wipe all harddisks.
2) It will also make sure that skilled attackers will get their hand on useful information.

So please do it. Listen to Dr. Geer.

The (brief) reasoning: Let's take the pool of computers in an organisation. Lets also take a useful piece of information (for example, a source tarball) and distribute it randomly on a small subset of the computers in the organisation. In the monoculture example, I would need an exploit for the monocultureOS. In the diversity example, I need an exploit for any of the OSs on which the information that I want is stored. Joy. Please diversify !

Saturday, December 10, 2005

One of the lectures I am attending, "Algebra and Algebraic Geometry from an algorithmic perspective", is, while often interesting, also insanely frustrating: The professors assistant poses the exercise problems, but does not synchronize progress with the professor (who lectures, but frequently drifts off on a tangent, leaving the students confused as to wether they are "on a tangent" or "on the lecture subject"). This puts us in the situation that we usually get the theoretical background needed to solve a given exercise sheet one to two weeks after the sheet had to be handed in. After some somewhat-fruitless long hours on the last sheet, I was advised that the proof I am looking for can be found in Grothendieck's "EGA IV". Trying to find EGA IV via google, I stumbled over Grothendieck's nonmathematical writings. They are surprisingly interesting to read:

http://acm.math.spbu.ru/RS/

Tuesday, November 15, 2005

My procrastination skills have reached fearsome levels. It is 10:20 pm, I am supposed
to present some paper tomorrow morning at 8:30, and I am here blogging instead of
reading it for the first time. Great.

On a different note, I found a pretty amusing blog:

http://teh-win.blogspot.com/

It's very refreshing to see a blog talking "cleartext", and to know that I am not alone
in some of my thoughts. Then again, I am definitely guilty of talking the same tired
shit over and over again ;)

A friend once said after I talked to him about getting lost doing some crypto math for
a week that "crypto is fun, eh ? Like doing morphine...". He was right. I have been digging
into some crypto shit again, and it's extremely hard to not lose yourself in it.

There's a bunch of exciting shit going on though: My two coworkers, Ero & Rolf, both
seem to be moving here, which means (even) less sleep and more cool work.