First Adventure in Malware Data Science
mattmatt
Posted on December 7, 2018
I haven’t been a software developer for very long, but I enjoy it. In my day job I do backend work in Flask, but I was looking for an exciting project to work on at home, something that would stretch my skills and teach me some new things. And maybe let me get a little dirty. Since I was a teen I’ve been fascinated by Malware. However, it’s a curiosity I haven’t explored as much as I’d like. So when No Starch Press had a sale I found and pre-ordered Malware Data Science by Joshua Saxe and Hillary Sanders.
I just started reading chapter 4, and I’m enjoying it so far. Maybe I’ll write a review when I’m done. In the meantime, I’d like to blog real quick about a small issue I had working through chapter 2, and how I fixed it (it was my problem, not the authors’).
All the authors’ code is in python2.7, and I’m more used to 3. No big deal though, right? I’m a professional, I can suss this out. No, this wasn’t my problem.
What I had missed when a small detail when I downloaded the sample code and malware files. The zip file is entitled malware_data_science_entrypoints_redacted.zip
The task in chapter 2 I was attempting was to print out decompiled malware, starting at the entrypoint’s address. I completed my port of their python script (it was dodgy putting parentheses into that print statement) and ran it. And nothing happened.
What did I do wrong, I wondered? I checked the file. I read through the readme on pefile’s github. I checked capstone’s documentation. I was doing everything right. I went so far as to download the authors’ helpful Ubuntu vm, with all the code and data on it already. It worked there. But not on my fedora vm.
This frustrated me for far longer than it should have. I don’t remember when ‘entrypoints_redacted’ caught my eye, but I felt a little silly then. I altered my script a little to print out the entrypoint’s address, and it is adorably named 0xcc00ffee. When you feed this address to the disassembler you get nothing, as that address (it’s an offset, right?) is very large (3 and a half gb) and the size of the file itself is almost 631k.
I ran their vm and printed out the entrypoint address from there. Lowly 0x121ba (or 74170 in python). Entered that into my script and voila, I got disassembled code. It’s not exactly what the book says it should be, but it is exactly what the working code on the authors’ vm says it is, so I guess I did something right.
I’m looking forward to digging deeper in this book.
Posted on December 7, 2018
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.