Applied Data Analysis

Assignment goal

You will conduct a real world data analysis task on real data. This will include preprocessing/filtering, aggregating and interpreting data, and visualizing it through tables and graphs. The ultimate goal is to provide a “debriefing” document which explains what happened during the denial of service attack. This assignment is due Tuesday, February 25 before class.

Assignment dataset

You will receive a packet capture file containing a trace of a distributed denial of service attack. This attack was launched by the storm botnet against a machine at UC San Diego in 2007.

Suggested analysis dimensions

The ultimate goal, again is simply a “debriefing” - explain to someone with no prior knowledge about the event what happened, where the attack came from, or anything else you can infer from the data that might be of interest. Not all of these are required, and adding others not listed here (when relevant) would be helpful as well. Sample analysis tasks include:

Determine whether source IP addresses have been spoofed by the attacker.
Characterize “early” attackers vs. ones which show up later in the trace - do they have any identifying features, e.g. network location, attack strategy, or bandwidth?
visualize the attack, either geographically or “topologically” - the gold standard for these can be found here.
Characterize network location of “heavy hitters” vs low bandwidth attackers. Do high bandwidth attackers come from residential, commercial, or educational/research networks?
Are attackers rate limited by their upload bandwidth, CPU power, some code-based internal limit, or some other effect? You might want to refer back to the Witty worm paper to see how they did this analysis.
More generally, characterize the distribution of attacker bandwidth, either as a function of AS, /24, /16, or some other granularity.

The final deliverable is a report, including text and graphs/graphics, explaining what happened during the attack along with any inferred understanding. You should include basic information about the size of the attack, the duration, the target, and a characterization of the sources. You can use the graphing library of your choice, but we recommend boomslang, R, or gnuplot.

Resources

The best resource for inspiration regarding analysis or presentation are the previous papers we have covered in class - take a look at them and their analysis approach to help you do similar things for this project. The professors are also a resource for this assignment. You are highly encouraged to bring rough drafts of your plots to the instructors; we are attempting to recreate the paper writing environment whereby students generate graphs and iterate on creating and improving graphs iteratively. It is very unlikely that you wil receive a satisfactory grade if you do not iterate multiple times.

Your other resources are computation and data on Amazon Web Services. Each student will receive login instructions for a virtual machine with “top of the line” performance (purchased outright, this would be approximately a $5,000 machine). While local data analysis is certainly possible, a powerful remote server can be very helpful for chewing through a large amount of data quickly. The bzipped packet capture is available on S3 - it is 32 GB uncompressed. Please treat this data as private and do not publicize the data itself or any of your analysis outside of this class community. After you have finished the assignment, you will be free to publicize/share your final report.

To download the data from S3 to your virtual machine (and prepare to analyze it), you can run:

sudo apt-get install python-pip apparmor-utils
sudo aa-complain /usr/sbin/tcpdump
sudo pip install awscli
aws --region us-east-1 s3 cp s3://uicbits.net/classdata/ddostrace.070804.pcap.bz2 ddostrace.070804.pcap.bz2

Your instance is provisioned with 64 GB of storage as the root partition, which is enough space to hold the decompressed trace file but not much more.

If you need more space or your analysis becomes I/O-bound, you can also use 160GB of high performance (SSD backed) “scratch” space by running this command:

sudo mount /dev/xvdb /mnt

Two important things to remember:

The “scratch” space is ephemeral storage - any data on it will be lost when your machine shuts down.
Your machine will be automatically turned off if you do not use it for 6 hours in a row. You can log in to the web interface to turn it back on if this happens - make sure to note the IP when logging back in as it will most likely have changed.

Remember: while the /mnt directory is good for storing temporary or intermediate files, it will disappear after your machine shuts down, and there is absolutely no way to get it back. I suggest writing all code in your home directory (or better yet, a directory under revision control) and keeping raw or computed data on the scratch disk.