Course Description
Provenance and explanations are essential tools for building trust-worthy, secure, transparent, and fair data-intensive systems and machine learning pipelines. These tools are used to debug analysis results, to comprehend the results of complex queries, to explore the impact of hypothetical changes to data and/or policies, to audit sensitive computations, and to justify and understand predictions made by machine learning models. This course provides a comprehensive overview of algorithms, systems, and techniques for capturing & managing data provenance, i.e., tracking the origin and creation process of data, as well as for generating explanations for data-intensive computations such as declarative queries and machine learning.
The goal of this course is to provide students with the necessary tools to build provenance-enabled systems and develop automated solutions for generating explanations.
Course Topics
The following topics will be covered in the course:
- Provenance & Explanations - Introduction
- Motivation & use cases
- Provenance graphs
- Explanations for query answers
- Provenance models
- Hypothetical reasoning: what-if and how-to
- Incremental view maintenance / what-if queries
- View update & how-to
- Explanations
- Counterfactual explanations
- Explanations as (provenance) summarization
- Attribution and degrees of responsibility (including game theoretic notions of attribution)
- Explaining missing answers
- Provenance capture & management
- How to compute provenance efficiently?
- Storage and computation trade-offs
- Building provenance-aware & explanation-ready systems
- Strategies for capturing and managing provenance
- How to compute explanations efficiently?
Course Organization
Materials
The following overview articles and textbooks will be helpful, but are optional.
Data Provenance - Origins, Applications, Algorithms, and Models., Boris Glavic. Foundations and Trends® in Databases, vol. 9 (3-4), 209-441, 2021. http://www.cs.uic.edu/%7ebglavic/dbgroup/assets/pdfpubls/G21.pdf
Trends in Explanations: Understanding and Debugging Data-Driven Systems., Boris Glavic, Alexandra Meliou, Sudeepa Roy. Foundations and Trends® in Databases, vol. 11 (3), 226-318, 2021. http://www.cs.uic.edu/%7ebglavic/dbgroup/assets/pdfpubls/GMR21.pdf
Principles of Data Integration, 1th Edition, Doan, Halevy, and Ives, Morgan Kaufmann, 2012
Depending on your background, a standard database textbook may be useful:
Elmasri and Navathe. Fundamentals of Database Systems, 6th Edition, Addison-Wesley, 2003
Ramakrishnan and Gehrke. Database Management Systems, 3nd Edition, McGraw-Hill, 2002
Silberschatz, Korth, and Sudarshan. Database System Concepts, 6th Edition, McGraw Hill, 2010
Garcia-Molina, Ullman, and Widom. Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2008
Grading
- Project: 40%
- Paper review and presentation: 40%
- Homework assignment & Quizzes: 10%
- Active participation in class: 10%
Class organization
- The first half of the class will consist mostly of lectures given by the instructor to introduce students to necessary background in databases, provenance, and explanations.
- In the second half, students will read and present research papers related to the topics covered in the course.
- After the first few classes, students will have to decide on a research project. In this project, students will either implement and/or evaluate an existing techniques from a state-of-the-art research paper or work on novel research. The results of these project will be towards the end of the semester. The instructor will provide guidance to students that are interested in publishing their work developed in this course where appropriate.
Workload
In this coursed, students will …
- Work on a semester-long research project related to implementing provenance or explanation techniques based on a research paper or working on developing new techniques.
- Review and present a state-of-the-art research paper from the field.
- Actively participate in class
- Homework assignments / quizzes
Prerequisites
No formal prerequisites, but some background in databases (roughly equivalent to CS480) is expected.
Overview
This course teaches you about provenance and explanations as fundamental tools for responsible data science and AI, auditing, building trust and understanding analysis results, and debugging of analysis workflows.
- Lecture: overview of content covered in the lectures: here
- Project: information about the project: here
- Literature review: information about the literature review: here
Important Dates
- Select a paper to review: 09/12
- Submit a full draft of the report: 12/02
- Submit the report review report: 12/11
- Select a project: 09/22
- Meet to discuss project design: 10/15
- Meet to discuss progress and final steps: week of 11/11
- Finish project implementation: 12/03 and 12/05
Workload and Grading Scheme
Grading Policy:
- Project: 40%
- Paper Review: 40%
- Homework assignment & Quizzes: 10%
- Active class participation: 10%
Grading scheme:
- 80+ = A
- 50+ = B
- 35+ = C
- <35 = E