PRIME Project

Mark Grechanik Ph.D., University of Texas at Austin

Summary We reformulated a problem if software quality can be accurately predicted using internal software metrics alone as a supervised machine learning problem. We conducted a large-scale empirical study with 3,392 open-source projects using six different classifiers. Further, we performed feature selection to determine if a subset of these metrics could do so to guard against noise and irrelevant attributes. Our results show that the accuracy of software quality prediction stays below 61% with Cohen's and Shah's kappa << 0.1 leading us to suggest that comprehensive sets of internal software metrics alone CANNOT accurately predict software quality in general. The entire code, experimental setup, and video can be obtained from here. The paper is published in the proceedings of the 12th International Conference on Machine Learning and Data Mining MLDM 2016, July 16-21, 2016, New York, USA. You can read the paper here. The Problem The quality of a software application is a set of measures that describe how this application is expected to fulfill some need and meet some standards. There are many different measures including but not limited to correctness, reliability, performance, robustness, maintainability, and usability. We interviewed many dozens of software managers and we found that a single and most reliable measure of the quality of software applications is a phase of testing (PoT) that is assigned to a specific software application at a given time, i.e., alpha, beta, and production. The concept of PoT was introduced in 1950s and widely used in software development to indicate the quality of software applications. When a software application is built and tested within a development organization, this application is assigned the \emph{alpha phase}, where testing is performed by dedicated teams within this organization. Alpha phase indicates the lowest level of quality for software applications. During the work on the alpha version of the application, software engineers fix bugs and add and change features (i.e., units or functionality) among many other things. When the collective quality of the application improves to a certain level, stakeholders assign it to the beta phase, where the application is shipped to selected customers who will use it and give detailed feedback to the development organization. At some point, software engineers further improve the quality of the application, and stakeholders assign it the \emph{production phase}. We observed that in many companies and organizations PoT is used to describe both internal and external qualities of software. In our interviews with software managers we asked why they do not use other indicators of software quality, for example, the number of bugs or some software metrics that are computed from the source code of the software applications. They stated that none of these metrics alone can describe the quality of a software application. For example, it is not just the number of outstanding bugs that matters but also their severity and the average time it takes to resolve a bug. In general, major and critical software defects constitute approximately 31% of all defects, the remaining 69% are minor and cosmetic bugs over the lifecycle of a software application. Customers do not care about many of the cosmetic and minor bugs and they remain unresolved without any detriment to the quality of software application. Contrary to a single metric, a PoT phase is collectively agreed on by multiple stakeholders who assign it to a software application based on multiple metrics and feedback from customers. As such, PoT reflects an ultimate judgement on the quality of software applications. We address the following problem: can software quality be accurately predicted using source code software metrics alone? Collecting various software metrics from the source code of applications is easy; determining subsets of these metrics that are useful to build a good predictor is very difficult. This problem is an instance of a bigger problem in ML, namely a supervised ML problem to verify if collectively these software metrics are predictors of software quality as judged by PoT criterion. This problem focuses on constructing and selecting subsets of attributes that are important for creating high-quality prediction models. The root of this problem is in difficulty to determine individual predictive powers for attributes using a specific ML algorithm. For a small number of attributes it is possible to train classifiers with all subsets from the powerset of these attributes, however, this brute-force solution quickly becomes computationally prohibitive even when the number of attributes increases to couple of dozens. Here we deal with over 90 different software metrics, and trying all their subsets is not feasible. Further, we perform feature selection to determine if a subset of these metrics can do so to guard against noise and irrelevant attributes. The problem is not only in training predictors using all subsets of attributes - different ML algorithms may perform differently, so these ML algorithms should be used to build predictors using subsets of attributes. Parameter spaces of these ML algorithms should be explored as part of sensitivity analysis to check to see if varying values of these parameters significantly changes the precision of the predictor. These and many other variables are explored as part of addressing our problem. Downloads and Experimental Results To reproduce results of our experiments with PRedicting software qualIty with Minimum fEatures (PRIME), you need to obtain the following components. • Software metrics for subject applications are available in this CSV file. • The source code archive for the subject is available upon request, • The experimental setup for PRIME as an archive of Rapidminer projects is available here. People PRIME was carried out at the Advanced Research In Software Engineering (ARISE) lab at the Department of Computer Science of the University of Illinois at Chicago where Mark Grechanik leads a research team and Software Engineering Maintenance and Evolution Research Unit at the College of William and Mary headed by Denys Poshyvanyk. This project is done in collaboration with Mohak Shah from GE Research. Mark Grechanik, Project Lead Email: drmark[at]uic.edu Nitin Prabhu Email: nprabh3[at]uic.edu Denys Poshyvanyk Email: denys[at]cs.wm.edu Daniel Graham Emails:dggraham[at]email.wm.edu Mohak Shah Email: mohak.shah[at]gmail.com