2025-06-04 - Paper accepted at EDBT 2026
Incremental Maintenance of Provenance Sketches Under Updates
Congrats to Pengyuan on his accepted EDBT 2026 paper. In this work [1] in collaboration with Dieter, Vasudha, Zhen, and Danica, we develop an in-memory engine for incremental maintenance of provenance sketches.
Incremental Maintenance for Provenance-based Data Skipping
Provenance-based data skipping [2] compactly over-approximates the provenance of a query using provenance sketches and utilizes such sketches to speed-up the execution of subsequent queries by skipping irrelevant data. However, a sketch captured at some time in the past may become stale if the data has been updated subsequently. Thus, there is a need to maintain provenance sketches. In this work, we introduce In-Memory incremental Maintenance of Provenance sketches (IMP), a framework for maintaining sketches incrementally under updates. At the core of IMP is an incremental query engine for data annotated with sketches that exploits the coarse-grained nature of sketches to enable novel optimizations.
Relevance-based Data Management
This work is part of our group’s work on Relevance-based Data Management where we utilize information on what data is relevant for answering a query to improve a wide range of data management tasks. A fundamental idea in relevance-based data management is to use dynamic relevance analysis techniques that determine at runtime what data is needed for answering a query. Dynamic relevance analysis is much more powerful than the static relevance analysis a database uses to determine what data can be pruned based on selection conditions. For instance, it enables us to determine what data is needed for answering top-k and aggregation queries with HAVING
where the relevance of input data can only be determined at runtime. For example, consider the following query that returns the three departments with the most employees.
SELECT dept, count(*) AS numemp
FROM employee
GROUP BY dept
ORDER BY numemp DESC
LIMIT 3
Only data from the three departments with the most employees is needed to answer this query, but the identity of these departments will only be discovered at runtime!
-
In-memory Incremental Maintenance of Provenance Sketches
Pengyuan Li, Boris Glavic, Dieter Gawlick, Vasudha Krishnaswamy, Zhen Hua Liu, Danica Porobic and Xing Niu
Proceedings of the International Conference on Extending Database Technology (2026).@inproceedings{LG26, author = {Li, Pengyuan and Glavic, Boris and Gawlick, Dieter and Krishnaswamy, Vasudha and Liu, Zhen Hua and Porobic, Danica and Niu, Xing}, booktitle = {Proceedings of the International Conference on Extending Database Technology}, title = {In-memory Incremental Maintenance of Provenance Sketches}, pages = {}, keywords = {Provenance; Data Skipping; Relevance-based Data Management; Incremental Maintenance}, projects = {Relevance-based Data Management}, pdfurl = {http://www.cs.uic.edu/%7ebglavic/dbgroup/assets/pdfpubls/LG26.pdf}, longversionurl = {https://arxiv.org/pdf/2505.20683}, venueshort = {EDBT}, istoappear = true, year = {2026} }
-
Provenance-based Data Skipping
Xing Niu, Ziyu Liu, Pengyuan Li, Boris Glavic, Dieter Gawlick, Vasudha Krishnaswamy, Zhen Hua Liu and Danica Porobic
Proceedings of the VLDB Endowment. 15, 3 (2021) , 451–464.@article{NL21, author = {Niu, Xing and Liu, Ziyu and Li, Pengyuan and Glavic, Boris and Gawlick, Dieter and Krishnaswamy, Vasudha and Liu, Zhen Hua and Porobic, Danica}, keywords = {Provenance, Data Skipping, Relevance-based Data Management}, title = {Provenance-based Data Skipping}, journal = {Proceedings of the VLDB Endowment}, projects = {Relevance-based Data Management}, pages = {451 - 464}, volume = {15}, issue = {3}, year = {2021}, doi = {10.14778/3494124.3494130}, venueshort = {{PVLDB}}, pdfurl = {https://vldb.org/pvldb/vol15/p451-niu.pdf}, longversionurl = {https://arxiv.org/pdf/2104.12815} }
Database systems use static analysis to determine upfront which data is needed for answering a query and use indexes and other physical design techniques to speed-up access to that data. However, for important classes of queries, e.g., HAVING and top-k queries, it is impossible to determine up-front what data is relevant. To overcome this limitation, we develop provenance-based data skipping (PBDS), a novel approach that generates provenance sketches to concisely encode what data is relevant for a query. Once a provenance sketch has been captured it is used to speed up subsequent queries. PBDS can exploit physical design artifacts such as indexes and zone maps.