Literature Review

Overview

During the course you will read and summarize several research papers covering state-of-the-art techniques in Big Data processing. Each student will present one of these papers in class. Papers that are required reading have to be read by all students in the class.

For your presentation, you can select any paper.

Please select a paper until 09/12.

The papers are available through google drive

Presentation and Report

Please prepare a 20-25 minute talk with slides to present the paper you have been assigned. The whole presentation including Q&A should be 30-35 minutes. Furthermore, you need to write a report explaining and criticizing the presented techniques.

The schedule for presentations is shown below.

A full draft of the report is due on 12/02.

The report is due on 12/11.

Help for writing the report, preparing slides, and giving a talk

How to give a presentation and prepare slides:

Page giving information on how to give a talk and prepare slides.
http://www.eecs.berkeley.edu/~messer/Bad_talk.html - Tutorial on how to give a (bad) good talk.
Other slides on how to give a good talk

How to write a scientific article:

Page on how to write an CS article. Also comments on some general writing rules.
Simon Peyton Jones slides and video on how to write a great research paper

Presentation Schedule

The presentation schedule will be announced once papers have been assigned.

Student	Paper	Presentation Date	Slides
Leonardo Borgioli	10/15	Falling Rule Lists	pdf
Karthik Ragi	10/17	Why Should I Trust You?: Explaining the Predictions of Any Classifier	pdf
Sri Keshav Katragadda	10/22	The W3C PROV Family of Specifications for Modelling Provenance Metadata	pdf
Gustavo Moreira	10/24	Noworkflow: A Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts	pdf
Chenjie Li	10/29	Causality-Based Explanation of Classification Outcomes	pdf
Huy Truong	10/31	A Unified Approach to Interpreting Model Predictions	pdf
Revathi Dhotre	11/05	Summarizing Provenance of Aggregate Query Results in Relational Databases	pdf
Amy Byrnes	11/12	HypeR: Hypothetical Reasoning with What-If and How-to Queries Using a Probabilistic Causal Approach	pdf
Jyotsna Rajaraman	11/14	Complaint-Driven Training Data Debugging for Query 2.0

List of Papers

Provenance Models

The Semiring Framework for Database Provenance, Todd J Green, Val Tannen, Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 93–99, 2017
The W3C PROV Family of Specifications for Modelling Provenance Metadata, Paolo Missier, Khalid Belhajjame, James Cheney, Proceedings of the 16th International Conference on Extending Database Technology, pp. 773–776, 2013

Provenance-Aware Systems

GProM - a Swiss Army Knife for Your Provenance Needs, Bahareh Arab, Su Feng, Boris Glavic, Seokki Lee, Xing Niu, Qitian Zeng, IEEE Data Engineering Bulletin41 (1), 51–62, 2018
Smoke: Fine-Grained Lineage at Interactive Speed, Fotis Psallidas, Eugene Wu, Proc. VLDB Endow.11 (6), 719–732, 2018
You Say ‘What’, I Hear ‘Where’ and ‘Why’ - (Mis-) Interpreting SQL to Derive Fine-Grained Provenance, Tobias Müller, Benjamin Dietrich, Torsten Grust, Proceedings of the VLDB Endowment, 11 (11), 2018
Noworkflow: A Tool for Collecting, Analyzing, and Managing Provenance from Python Scripts, João Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, Juliana Freire, Proc. VLDB Endow.10 (12), 1841–1844, 2017
Fine-Grained Lineage for Safer Notebook Interactions, Stephen Macke, Aditya G. Parameswaran, Hongpu Gong, Doris Jung Lin Lee, Doris Xin, Andrew Head, Proc. VLDB Endow.14 (6), 1093–1101, 2021

Explainable Machine Learning Models

Why Should I Trust You?: Explaining the Predictions of Any Classifier, Marco Túlio Ribeiro, Sameer Singh, Carlos Guestrin, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 1135–1144, 2016
Falling Rule Lists, Fulton Wang, Cynthia Rudin, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015, 2015

Attribution and Intervention-based Explanations for Machine Learning

Causality-Based Explanation of Classification Outcomes, Leopoldo E. Bertossi, Jordan Li, Maximilian Schleich, Dan Suciu, Zografoula Vagena, Proceedings of the Fourth Workshop on Data Management for End-To-End Machine Learning, In conjunction with the 2020 ACM SIGMOD/PODS Conference, DEEM@SIGMOD 2020, Portland, OR, USA, June 14, 2020, pp. 6:1–6:10, 2020
A Unified Approach to Interpreting Model Predictions, Scott M. Lundberg, Su-In Lee, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 4765–4774, 2017
Complaint-Driven Training Data Debugging for Query 2.0, Weiyuan Wu, Lampros Flokas, Eugene Wu, Jiannan Wang, Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp. 1317–1334, 2020

Explanations as Provenance Summarization

Summarizing Provenance of Aggregate Query Results in Relational Databases, Omar AlOmeir, Eugenie Yujing Lai, Mostafa Milani, Rachel Pottinger, 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19-22, 2021, pp. 1955–1960, 2021
A Formal Approach to Finding Explanations for Database Queries, Sudeepa Roy, Dan Suciu, SIGMOD, pp. 1579–1590, 2014
Approximate Summaries for Why and Why-Not Provenance, Seokki Lee, Bertram Ludäscher, Boris Glavic, Proceedings of the VLDB Endowment 13 (6), 912 - 924, 2020
Putting Things into Context: Rich Explanations for Query Answers Using Join Graphs, Chenjie Li, Zhengjie Miao, Qitian Zeng, Boris Glavic, Sudeepa Roy, Proceedings of the 46th International Conference on Management of Data, pp. 1051–1063, 2021

Hypothetical Reasoning: What-if and How-to

Tiresias: The Database Oracle for How-to Queries, A. Meliou, D. Suciu, Proceedings of the 2012 international conference on Management of Data, pp. 337–348, 2012
HypeR: Hypothetical Reasoning with What-If and How-to Queries Using a Probabilistic Causal Approach, Sainyam Galhotra, Amir Gilad, Sudeepa Roy, Babak Salimi, SIGMOD ‘22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pp. 1598–1611, 2022
Efficient Answering of Historical What-If Queries, Felix Campbell, Bahareh Arab, Boris Glavic, Proceedings of the 48th International Conference on Management of Data, pp. 1556–1569, 2022
Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines, Stefan Grafberger, Paul Groth, Sebastian Schelter, Proc. ACM Manag. Data1 (2), 128:1–128:26, 2023

Attribution and Intervention-based Explanations for Databases

Computing the Shapley Value of Facts in Query Answering, Daniel Deutch, Nave Frost, Benny Kimelfeld, Mikaël Monet, SIGMOD ‘22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pp. 1570–1583, 2022
The Shapley Value in Database Management, Leopoldo E. Bertossi, Benny Kimelfeld, Ester Livshits, Mikaël Monet, SIGMOD Rec.52 (2), 6–17, 2023
ShapGraph: An Holistic View of Explanations through Provenance Graphs and Shapley Values, Susan B. Davidson, Daniel Deutch, Nave Frost, Benny Kimelfeld, Omer Koren, Mikaël Monet, SIGMOD ‘22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, pp. 2373–2376, 2022
Tracing Data Errors with View-Conditioned Causality, Alexandra Meliou, Wolfgang Gatterbauer, Suman Nath, Dan Suciu, SIGMOD Conference, pp. 505-516, 2011

Instructor

Boris Glavic

Course

Syllabus