Lecture time: MW 4:30-5:45pm
Location: LC D5
Instructor: Prof. Elena Zheleva
Office hours: Tue 3-5pm, SEO 1140
TA1: Usman Shahid
Office hours: TBD
TA2: Mao Li
Office hours: TBD
"Our ability to collect, manipulate, analyze, and act on vast amounts of data is having a profound impact on all aspects of society. This transformation has led to the emergence of data science as a new discipline. The explosive growth of interest in this area has been driven by research in social, natural, and physical sciences with access to data at an unprecedented scale and variety, by industry assembling huge amounts of operational and behavioral information to create new services and sources of revenue, and by government, social services and non-profits leveraging data for social good. This emerging discipline relies on a novel mix of mathematical and statistical modeling, computational thinking and methods, data representation and management, and domain expertise."
--Committee on Data Science, Computing Research Association
This course provides an in-depth overview of data science from a computer science perspective. Topics include modeling, storage, manipulation, integration, classification, analysis, visualization, information extraction, and big data. The course is programming-intensive and an emphasis will be placed on tying data science concepts to specific real-world applications through hands-on experience.
Working knowledge of probability, data structures and algorithms, and ability to (learn to) program in Python.
We will use Piazza for the course schedule, discussions, and materials, and Gradescope for grading.
Python is the programming language used for homework assignments. We will use Google Cloud for big data computing thanks to a generous grant that Google Cloud provided for this course.
Programming-based homework assignments - 30%
Midterm exam - 20%
Bi-weekly quizzes - 15%
Class project - 35%
No textbook is required. Readings will be assigned, using multiple online sources, including:
[PTDS] Principles and techniques of data science. Lau, Gonzalez, Nolan.
[MMD] Mining of massive datatasets. Leskovec, Rajaraman, Ullman.
[FDV] Fundamentals of data visualization. Wilke.
[CIT] Computational and Inferential thinking. Adhikari, DeNero.
[CIML] A course in machine learning [Errata]. Hal Daume III.
|1/14||Welcome to Data Science||Quiz 0 out|
|1/16||Data science lifecycle, data design and sampling||Syllabus, PTDS: Chapters 1-2||Quiz 0 due 11:59pm|
|1/21||MLK Day - no class|
|1/23||Hypothesis testing||CIT: Chapter 11||Pre-lab assignment out|
|1/28||Lab 1: Data processing with pandas||PTDS: Chapter 3||
Project Requirements out
Quiz 1 out
|1/30||Class canceled due to inclement weather||Quiz 1 due 11:59pm, Jan. 30|
|2/4||Lab 2: Web scraping and data collection||PTDS: Chapter 7|
|2/6||Data cleaning and exploratory data analysis||PTDS: Chapters 4-5||HW 1 (the two labs) due 11:59pm, Feb. 10|
|2/11||Data visualization||PTDS: Chapter 6||Quiz 2 out|
|2/13||Data visualization||FDV: Chapters 1-2||
Quiz 2 due 2/13 11:59pm
Project proposal due 2/14 11:59pm
HW 2 out
|2/18||Models and estimation||PTDS: Chapter 10||Sign up for project check-in slot on 3/5|
|2/20||Probability and generalization||PTDS: Chapter 12|
|2/25||Supervised learning: decision trees||CIT: Chapter 17||Quiz 3 out|
|2/27||Supervised learning: geometric view||CIML: Chapter 1||
Quiz 3 due 11:59pm 2/27
HW 2 due 11:59pm 2/28
|3/4||Supervised learning: linear and non-linear models||PTDS: Chapter 13||Project check-in 3/5|
|3/6||Supervised learning: bias-variance tradeoff, ensembles||PTDS: Chapter 8 CIML: 5.6||Post-check-in proposal slides due 11:59pm 3/8|
|3/11||Supervised learning: practical issues, regression||PTDS: Chapter 13 CIML: Chapter 5.5, 6.2||Quiz 4 out|
|3/13||Unsupervised learning||CIML: 3.4||Quiz 4 due 11:59pm HW 3 out|
|4/1||Databases and SQL||PTDS: Chapter 9|
Data science in the real world
Invited talk by Dr. Plamen Petrov, VP of AI for Anthem
|HW 3 due 11:59pm 4/3|
|4/8||Databases and SQL||PTDS: Chapter 9|
|4/10||Large-scale data processing||MMD: Chapter 2||HW 4 out Project progress report due 11:59pm, April 11|
|4/15||Recommender systems||MMD: Chapter 9||Quiz 5 out|
|4/17||A/B testing||PTDS: Chapter 18||Quiz 5 due 11:59pm|
|4/22||Ethics in data science
Guest lecture by Dr. Emanuelle Burton
Who's Using Your Face - The Ugly Truth About Facial Recognition
Why You Can No Longer Get Lost in the Crowd
Microsoft Denied Police Facial Recognition Tech Over Human Rights Concerns
NYPD Claws Back Documents On Facial Recognition
|4/24||Social network analysis||MMD: Chapter 10||HW 4 due 11:59pm 4/25|
|4/29||Final project presentations||Presentations due at noon Quiz 6 out|
|5/1||Final project presentations||Quiz 6 due 11:59pm|
|5/7||Final project due||Final project due 11:59pm|