Mark Grechanik Ph.D., University of Texas at Austin
PISTIS Project
© Copyright Mark Grechanik 2012
Summary Testing software applications that use nontrivial databases is increasingly outsourced to test centers in order to achieve lower cost and higher quality. Not only do different data privacy laws prevent organizations from sharing this data with test centers because databases contain sensitive information, but it is also very time consuming and difficult to anonymize, distribute, and test with large databases. Removing and sanitizing data often leads to significantly worsened test coverages and fewer uncovered faults, thereby reducing the quality of software applications. We created a novel approach for Protecting and mInimizing databases for Software TestIng taSks (PISTIS) that both sanitizes a database and minimizes it. PISTIS uses a weight-based data clustering algorithm that partitions data in the database using information from program analysis that indicate how this data is used by the application. For each cluster, a centroid object is computed that represents different persons or entities in the cluster, and we use associative rule mining to compute and use constraints to ensure that the centroid objects are representative of the general population of data in the cluster. Doing so also sanitize information, since these centroid objects replace the original data to make it difficult for attackers to infer sensitive information. Thus, we reduce a large database to a few centroid objects and we show in our experiments with four applications that test coverage stays within a close range to its original level. The entire code, experimental setup, and video can be obtained from here. The Problem Testing software applications that use nontrivial databases is increasingly outsourced to test centers in order to achieve lower cost and higher quality. Not only do different data privacy laws prevent organizations from sharing this data with test centers because databases contain sensitive information, but it is also very time consuming and difficult to anonymize, distribute, and test applications with large databases. Removing and sanitizing data often leads to significantly worsened test coverages and fewer uncovered faults, thereby reducing the quality of software applications. Database-centric applications (DCAs) are common in enterprise computing, and they use nontrivial databases. DCAs have increasingly been tested by third-party specialized software service providers, which are also called test centers, which offer lower cost and higher quality when compared to in-house testing. When releasing these proprietary DCAs to external test centers, DCA owners should make their databases available to test engineers, so that they can perform testing using original data. However, since sensitive information cannot be disclosed to external organizations, testing is often performed with synthetic input data. For instance, if values of the field Nationality are replaced with the generic value “Human,”' DCAs may execute some paths that result in exceptions or miss certain paths. As a result, test centers report worse test coverage (such as code coverage) and fewer uncovered faults, thereby reducing the quality of DCAs and obliterating benefits of test outsourcing. This situation is aggravated by big data - collections of large-sized data sets that contain patterns that may be useful for some tasks. To perform these tasks using big data, DCAs use databases whose sizes are measured in hundreds of terabytes on the low end. Our interviews with contractors who use industry-strength tools like IBM Optim reveal that sanitizing large databases often takes many weeks and requires significant resources. In addition, maintaining and resetting states of large databases when testing DCAs is difficult. Ideally, the size of a database should be reduced to alleviate testing without sacrificing its quality. Sanitizing and Minimizing (S&M) databases are loosely connected tasks -- in some cases, removing data from a database that describe persons or entities may hide sensitive information about them. However, in general, minimizing databases does not hide sensitive information and sanitizing data does not result in smaller databases. A fundamental problem in test outsourcing is how to allow a DCA owner to release a smaller subset of its private data with guarantees that the entities in this data (e.g., people, organizations) are protected at a certain level while retaining testing efficacy. Ideally, sanitized (or anonymized) data should induce execution paths that are similar to the ones that are induced by the original data. In other words, when databases are S&Med, information about how DCAs use this data should be taken into consideration.                                                Downloads and Experimental Results To reproduce results of our experiments with PISTIS, you need to obtain the following components and execute steps. Subject DCAs can be obtained from here. The movie that shows how PISTIS plugin for Eclipse is used is available here. The source code for PISTIS is available here, To compute rankings of attributes you need the toolkit from our previous work on PRIEST that can be downloaded here. The package that contains deliverables for PISTIS is available here, The results of experiments with subject DCAs are available here.   People PISTIS was created at the Advanced Research In Software Engineering (ARISE) lab at the Department of Computer Science of the University of Illinois at Chicago where Mark Grechanik leads a research team and Software Engineering Maintenance and Evolution Research Unit at the College of William and Mary headed by Denys Poshyvanyk. We thank Daniel Graham from the College of William and Mary for participaning in the early stages of the work on PISTIS. Boyang Li and Mark Grechanik, Project Leads Emails:bli01[at]email.wm.edu and drmark[at]uic.edu Denys Poshyvanyk Email: denys[at]cs.wm.edu