Overview

This class explores techniques and considerations for conducting computer science research rooted in empirical observation. Topics include measurement methodology, meta-data, use of external datasets, assessing data quality, calibration, sampling, statistical summaries, visualization techniques, goodness-of-fit, hypothesis testing, structuring the analysis process, and presentation of results. Ultimately, the goal is to foster analysis of empirical data that is both sound and illuminating. Students will “bring their own data” for exploration (via presentation) during class meetings. Ideally, this will come from their current or previous empirical research efforts, but if not, students can instead (or in addition) select empirical studies from the literature or publicly available datasets for presentation and analysis. The number of presentations will depend on the class size, though will not be more than 2 or 3. The course will also include occasional reading and/or data analysis assignments.

Method of Instruction

This class will consist of four components:

  • Lectures on the foundations of empirical research.
  • Reading and discussing previous research. By default, readings will focus on topics in data science, networking, and security. However, readings will be chosen based on the research interests of students. All students will read all papers, and students will take turns leading discussion on papers in their own interest area. Each student will present at least once, and students will be required to critique the presentations of others.
  • Students will complete homework assignments covering the basics of data collection, summarization, and analysis.
  • Using the skills learned in class, small groups of students will conduct and relate their own new data analysis – either of datasets they bring themselves (and work on in their own research), or datasets collected / provided during class.

Student Deliverables

Students will be expected to read approximately one paper per session, present at least one previously completed methodology over the semester, complete between 2 and 4 homework assignments, and conduct one group analysis project which will include both a written and presentation component. The final project will be graded based on its correctness, thoroughness, clarity, and soundness of the analysis.

Prerequisites

Programming skill amenable to rapid ingestion and analysis of datasets (in a high level language like Matlab, Python, or R) is required. Completion of the student skills and interest survey. Active empirical research is strongly encouraged. Students who are not thesis option MS students or PhD students are encouraged to contact the instructor prior to enrollment. CS 590 and CS 418 are also recommended.

Course outline

  • Weeks 1-3
    • Lectures on fundamentals:
      • Data collection
      • Dataset evaluation and cleaning
      • Exploratory analysis and visualization
      • Statistical analysis
      • Presentation-quality analysis and visualization
    • Initial paper discussions lead by instructor
    • Administrivia: select later readings reflecting class interests
    • Students propose final projects
  • Weeks 4-13
    • Group discussions of completed empirical analysis research lead by students
  • Weeks 14-15
    • Final project presentations

Grading mechanism

This class is very much about students getting the most out of their experience tuned to their individual research interests/goals. Your final grade will consist of:

  • 25% from homeworks. Each homework will be graded sufficient/insufficient. All homeworks graded insufficient can be turned in again once. One homework may be turned in a third time. For homeworks returned as insufficient, I’ll provide feedback on what to do to improve, please see me if you have any further questions.
  • 25% from participation. Participation consists of answering questions during class on the required readings, presenting your own research/research in your area, and engaging with other student presentations. Students can miss three class periods without losing participation points. This does not extend to scheduled presentation days - please contact me ASAP if you need to miss a day on which you are scheduled to present.
  • 50% from final project. The final project can be completed individually or in groups of two, and will consist of an in-class presentation and a written report.

Suggested Datasets for Final Project

  • Enron email corpus
  • CRAWDAD wireless networking repository datasets
  • CAIDA internet measurements
  • DNS Zonefiles
  • Wikipedia
  • StackOverflow
  • City of Chicago open data
  • FTC fraud complaint list