2025-07-24 - Paper on stress-testing ML pipelines accepted at VLDB 2026
Stress-testing ML pipelines with Adversarial Data Corruption
In this work [1] lead by Jiongli, Boris together with Jiongli’s advisor Babak from UCSD investigate systematic data quality issues impact the performance (accuracy, fairness, …) of machine learning models. Furthermore, we explore how data cleaning techniques using during preprocessing affect a downstream model trained over the cleaned data and how robust machine learning techniques such as conformal prediction fail to provide their claimed guarantees when training data errors are systematic.
The main contribution of the work is a framework to inject adversarial, but realistic, data errors into clean data to evaluate the sensitivity of ML pipelines and robust machine learning techniques to such errors. The framework is general in that we treat the ML pipeline as a black-box and use black-box optimization techniques to search for adversarial data corruptions. Using this framework we were able to gain new insights into the impact of data quality issues, demonstrating that: (i) prior experimental studies that use random or manually crafted error injection techniques often fail to identify vulnerabilities of ML pipelines, (ii) often small amounts of errors are enough to severely degrade model performance, and (iii) robust machine learning techniques fail in the presence of systematic data errors as their assumptions (e.g., exchangeability) are violated.
-
Stress-Testing ML Pipelines with Adversarial Data Corruption
Jiongli Zhu, Geyang Xu, Felipe Lorenzi, Boris Glavic and Babak Salimi
Proc. VLDB Endow. 18, 11 (2025) , 4668–4681.@article{ZX25, author = {Zhu, Jiongli and Xu, Geyang and Lorenzi, Felipe and Glavic, Boris and Salimi, Babak}, keywords = {Data Cleaning; Machine Learning}, title = {Stress-Testing {ML} Pipelines with Adversarial Data Corruption}, journal = {Proc. {VLDB} Endow.}, year = {2025}, pdfurl = {https://www.vldb.org/pvldb/vol18/p4668-zhu.pdf}, doi = {10.14778/3749646.3749721}, volume = {18}, number = {11}, pages = {4668 - 4681}, longversionurl = {https://arxiv.org/pdf/2506.01230}, venueshort = {{PVLDB}} }