Automated classification of malware is easier than it seems once you have the right infrastructure -- in our case consisting of the BinDiff2 engine, a few generic unpacking tools and a graph layout program.
I have uploaded some graphics:
This is just a collection of arbitrary malware whose members were identified by use of AV programs. We then BinDiff'ed all samples and used some phylogenetics algorithm to draw this diagram. The results are quite neat, although we did not filter library functions, so some very simple viruses have high similarity due to the fact that 95% of their code is statically linked library code.
This is a collection of a few hundred bots. They were collected on Thorsten Holz's Honeynet, and we auto-unpacked them and then did the BinDiffing/tree generation. This time, we did filter libraries as good as we could. The 184 samples here all have different MD5sums, but the
largest majority belongs to essentially two families. All in all, we have ~5 "families", two pairs
of "siblings" and 9 isolated species here. Fun.