UIC Database Group

Header bar

2024-11-30 - Papers accepted at NeurIPS and SIGMOD

Learning From Uncertain Data (NeurIPS 2024)

In this work [1] lead by UCSD Ph.D. student Jiongli Li, Boris together with Babak Salimi from UCSD and former DBGroup member Su Feng investigate how to track the effect of uncertainty caused by a wide range of data quality issues in training and test data on the parameters and predictions of machine learning models.

Our system, ZORRO, trains a model for each possible variation of the dataset (errors, missing data, and biases) and combines them to predict a range, always including the true prediction.

Since training a model for every possible world is impractical, we leverage abstract interpretation and zonotopes (a type of convex polytope) to represent all dataset variations compactly, enabling efficient uncertainty handling.

We also developed an abstract version of gradient descent, transforming input zonotopes into another zonotope that over-approximates all possible model parameters—essentially running gradient descent across all possible worlds simultaneously and symbolically.

ZORRO returns a prediction range guaranteed to include the ground truth prediction (from the model trained on the ground truth data). It helps decision-makers assess whether individual predictions are robust, determining if they should trust or abstain from using the model’s output.

For linear regression with Ridge regularization, we also provide a non-trivial closed-form solution to make computations more efficient and enable reasoning about predictions and parameters under uncertainty.

Using Approximate Query Processing to Approximate Probabilistic Queries (SIGMOD 2025)

In this work [2] lead by Aaron Huber, Ph.D. student at SUNY Buffalo, Oliver Kennedy, Atri Rudra, and Zhuoyue Zhao, and Boris did investigate how to efficiently approximate the expected multiplicity of result tuples for queries over probabilistic database with bag semantics.

While the data complexity of evaluating such queries to return expected multiplicities is in /PTIME/, the overhead over deterministic query processing can still be significant. We identify the construction of lineage formulas, polynomials that record how an output tuple of a query is derived from the inputs, as a major bottleneck of both exact and approximate computation of expected multiplicities.

The main insight of this work is that instead of constructing the full lineage polynomial for every result tuple, one can instead adopt approximate query processing techniques (/AQP/) for estimating aggregation results on top of joins to sample monomials

  1. FastPDB: Towards Bag-Probabilistic Queries at Interactive Speeds
    Aaron Huber, Oliver Kennedy, Atri Rudra, Zhuoyue Zhao, Su Feng and Boris Glavic
    SIGMOD. 3, 1 (2025) , 41:1–41:25.
    details
  2. Learning from Uncertain Data: From Possible Worlds to Possible Models
    Jiongli Zhu, Su Feng, Boris Glavic and Babak Salimi
    Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024 (2024).
    details