Perm: Efficient Provenance Support for Relational Databases

Authors

Boris Glavic

Materials

Abstract

In many application areas like scientific computing, data-warehousing, and data integration detailed information about the origin of data is required. This kind of information is often referred to as data provenance. The provenance of a piece of data, a so-called data item, includes information about the source data from which it is derived and the transformations that lead to its creation and current representation. In the context of relational databases, provenance has been studied both from a theoretical and algorithmic perspective. Yet, in spite of the advances made, there are very few practical systems available that support generating, querying and storing provenance information (We refer to such systems as provenance management systems or PMS). These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested sub-queries, aggregation and/or user defined functions. Without support for these constructs, a provenance management system is of limited use. Furthermore, existing approaches use different data models to represent provenance and the data for which provenance is computed (normal data). This has the intrinsic disadvantage that a new query language has to be developed for querying provenance information. Naturally, such a query language is not as powerful and mature as, e.g., SQL. In this thesis we present Perm, a novel relational provenance management system that addresses the shortcoming of existing approaches discussed above. The underlying idea of Perm is to represent provenance information as standard relations and to generate and query it using standard SQL queries; ”Use SQL to compute and query the provenance of SQL queries”. Perm is implemented on top of PostgreSQL extending its SQL dialect with provenance features that are implemented as query rewrites. This approach enables the system to take full benefit from the advanced query optimizer of PostgreSQL and provide full SQL query support for provenance information. Several important steps were necessary to realize our vision of a ”purely relational” provenance management system that is capable of generating provenance information for complex SQL queries. We developed new notions of provenance that handle SQL constructs not covered by the standard definitions of provenance. Based on these provenance definitions rewrite rules for relational algebra expressions are defined for transforming an algebra expression q into an algebra expression that computes the provenance of q (These rewrites rules are proven to produce correct and complete results). The implementation of Perm, based on this solid theoretical foundation, applies a variety of novel optimization techniques that reduce the cost of some intrinsically expensive provenance operations. By applying the Perm system to schema mapping debugging - a prominent use case for provenance - and extensive performance measurements we confirm the feasibility of our approach and the superiority of Perm over alternative approaches.

bibtex

@phdthesis{G10a,
  author = {Glavic, Boris},
  date-added = {2012-12-14 18:55:49 +0000},
  date-modified = {2012-12-14 18:55:49 +0000},
  keywords = {Provenance; Perm},
  pdfurl = {http://www.cs.uic.edu/%7ebglavic/dbgroup/assets/pdfpubls/G10a.pdf},
  projects = {Perm},
  school = {University of Zurich},
  title = {{Perm: Efficient Provenance Support for Relational Databases}},
  venueshort = {PhD Thesis},
  year = {2010},
  bdsk-url-1 = {http://www.cs.uic.edu/%7ebglavic/dbgroup/assets/pdfpubls/G10a.pdf}
}

Reference

Perm: Efficient Provenance Support for Relational Databases Boris Glavic University of Zurich.