Project

Overview

In the project you will extend an provenance or explanations system, or implement a technique from a state-of-the-art research paper. Concrete project ideas will be posted at the beginning of the course.

Organization & Dates

The first step will be for you to identify a project you want to work on. Projects are either implementing / evaluating techniques from a state-of-the-art research paper or doing original research. Some example project ideas will be shown below. However, suggesting your own ideas is encouraged. Once you have selected a project, you should come up with a plan for implementing the project. You should then meet with the instructor to present your approach. The finished project is due at the end of the semester. You should also write a short summary of what you have done. The project code will be submitted through github. We will create a repository on github for each student. Furthermore, at the last day of class (12/03 and 12/05), we will do short project presentations (5-10 minutes per project).

In summary, the timeline for the project is:

Select a project: 09/22
Meet to discuss project design: 10/15
Meet to discuss progress and final steps: week of 11/11
Finish project implementation: 12/09

Note that both individual projects as well as group projects are possible. However, the expectations for a group project will be set higher.

Project Presentations

All presentations will be done in class 12/03 and 12/05. Presentations will be 10 minutes long followed by a 5 minutes discussion.

Date	Student	Timeslot
12/03	Borgioli, Leonardo	12:30
12/03	Byrnes, Amy	12:45
12/03	Dhotre,Revathi	1:00
12/03	Katragadda, Sri Keshav	1:15
12/03	Li, Chenjie	1:30
12/05	Moreira, Gustavo	12:30
12/05	Ragi, Karthik	12:45
12/05	Rajaraman, Jyotsna	1:00
12/05	Truong, Huy	1:15

Example Project Ideas

As mentioned above, feel free to propose your own project idea. These are just meant as examples to give you an idea of what is expected.

Support Spark or DuckDB as a Backend for the GProM Provenance System (1-2 students)

Data provenance is information about how a piece of data was derived from input data. In this project you will extend an existing provenance management system called GProM which can translate provenance requests into SQL queries. GProM can run on-top of multiple database systems. In this project you will extend GProM to be able to use Spark or DuckDB as a backend.

Extend the Pandas Library with Provenance Tracking (1-3 students)

In this project, you will extend the Pandas library (https://pandas.pydata.org/) data frame API to track fine-grained provenance using an existing provenance model such as the ones discussed in class, e.g., Why-provenance. Towards this goal you will identify a subset of the data frame API that corresponds to relational algebra operators and then implement provenance semantics for this subset. Provenance tracking can either be implemented by extending the implementations of Pandas API operations or by a rewrite-based approach that utilizes existing Pandas Dataframe operations to implement provenance tracking.

Experimental Comparison of Database Provenance Systems (1-3 students)

In this project, you will conduct an experimental comparison of multiple existing provenance systems for relational databases such as GProM and ProvSQL to evaluate their performance, resource requirements, and capabilities.

Implement Provenance-based Data Skipping on Spark (1-2 students)

Provenance-based Data Skipping is a technique that speeds up queries by adding data-dependent filter conditions that filter out data that irrelevant for answering the query. At the core, this approach relies on provenance sketches which encode for a range-partition of the input data which fragments contain relevant data. Deepening on how many students are working on this project you could either just build the filtering mechanism that takes as input the list of ranges from the partitioning that contain provenance and rewrite the query to filter data that does not belong to these ranges or also implement the mechanism for creating such sketches for a query.

Integrate Data Discovery Techniques into CaJaDe (1-2 students)

CaJaDe is a system that explains results of aggregate queries by automatically mining relevant context related to the data accessed by the query. CaJaDe does this by joining the input data of a query with other tables in the database based on a join graph the encodes viable options for joining the tables in the database. In this project, you will integrate CaJaDe with data discovery techniques that automatically find possible ways to join datasets in a data lake (see, e.g., this paper).

Shapley-based Explanations for Weakly Supervised Training Data Generation (1-2 students)

Data programming techniques like Snorkel enable rapid generation of training data by replacing human labeling with automated labeling through so called labeling functions that are designed by domain experts. In this project, you will implement a Shapley-value based attribution technique to identify which labeling functions are most responsible for the predictions made by Snorkel.

Improving Intervention-based Explanations for How Training Data Affects Predictions (1-2 students)

Intervention can be used to determine the impact a subset of the training data has on the predictions made by an ML model by removing a part of the training data and retraining the model and comparing the predictions made by the old and new model. However, this expensive as it requires repeated retraining of the model. In this project, you will employ a recent technique for generating a set of models from a set of training datasets efficiently to improve the performance of intervention-based techniques.

Uncertainty Quantification as Explanations for Data Quality Issues

Data quality issues like missing values, outliers, and constraint violations can significantly impact the behavior of ML models trained over this data. One approach to deal with data quality issues is to “clean” the data. However, typically we do not have sufficient information available to determine what the correct ground truth clean version of dataset is. In this project, you will employ uncertainty quantification techniques to model the variability of a model based on the possible alternative repairs of a dirty dataset. A high-level description of the variability of the model will then serve as an explanation for the impact that the data quality issues have on a model trained over the data.

Instructor

Boris Glavic

Course

Syllabus