CS 594: Empirical Analysis

Homework 1: Skills assessment

This course, much like empirical research, requires the ability to rapidly ingest, digest, and present data. This assignment tests your ability to perform these tasks.

You will turn in two flavors of data: first, an English language writeup that explains how you accomplished your task. Second, the source code that you used to answer the questions throughout the assignment. You are welcome to create each of these as individual files; however, Jupyter notebooks are an extra handy method for combining the first and second flavors.

You are not required to ensure that your writeup is instantly and easily “gradeable” - in most cases I will not be running your submitted code. However, you must submit all code you used to complete the assignment: pseudocode, or only submitting part of the code you used is not acceptable. You must also keep all of the data required to run your analysis, and I reserve the right to work through your assignment to verify its correctness.

All homework assignments are individual efforts. Feel free to ask for help on the class Piazza.

Step 1: basic data ingestion

Download Alexa’s list of 1 million top websites. Note that this list is updated daily, so good data hygiene suggests that you name the downloaded file with the current date. For each website within the top 1000, identify its category, as determined by OpenDNS: this information is available at https://domain.opendns.com/DOMAIN, where one replaces DOMAIN with the domain name of the site in question. In your writeup, explain how you extracted that information.

Step 2: basic data creation

Find the end to end latency (using a tool like ping) for each site, and download the entirety of each website’s home page. You may use whichever tools you find appropriate. Something like wget or python’s Requests library is sufficient, you don’t need to use something like Selenium or phantomjs that will run all of the javascript on the page as a normal browser would. Record the number of bytes sent and the amount of time that it takes to complete downloading the index document for each page (hereafter referred to as download delay).

Step 3: basic data analysis

Perform some basic exploratory analysis of the size and download time for each website. In your writeup, you must explain and justify an answer to each of these questions:

Construct a histogram of each of the bandwidth, size, and latency, and download delay for your dataset. If there are any artifacts in the distribution, mention them, but you don’t need to dig into them.
Is there a statistically significant difference in the amount of time it took to download sites of from different categories? Explain which test you used to determine statistical significance.
Construct and answer at least two other independent questions of your data. The easiest type of question to answer is “does X have an effect on Y?” type questions. Use data and figures to justify your answers.

Regarding the independent questions, you must come up with your own questions. This will be done on the honor system: if two students independently come up with the same question, that is okay. If you share your question with another person (before they independently come up with it), they can’t use it. As this assignment is meant as calibration for the class, any sharing of code is strictly prohibited.

Notes

Make sure to do all of your measurements from one point of view: e.g. if you scrape 500 pages at home and 500 at uic, this will bias your data.

This assignment is due Monday, August 29th at 11:59 PM