I n f o r m a t i o n
Analyzing Software Measurement Data with Clustering Techniques
Shi Zhong, Taghi M. Khoshgoftaar, and Naeem Seliya, Florida Atlantic University
Software quality estimation using supervised-learning approaches is difficult without software fault measurement data from similar projects or earlier system releases. Cluster analysis with expert inputis a viable unsupervised-learning solution for predicting software modules’ fault proneness and potential noisy modules.
or software quality estimation, software development practitioners typically construct quality-classification or fault prediction models using software metrics and
fault data from a previous system release or a similar software project.1 Engineers then use these modelsto predict the fault proneness of software modules in development.
This lets them track and detect potential software faults early on, which is critical in many highassurance systems. However, building accurate quality-estimation models is challenging because noisy data usually degrades trained models’ performance. Two general types of noise in software metrics and quality data exist. Onerelates to mislabeled software modules, caused by software engineers failing to detect, forgetting to report, or simply ignoring existing software faults. The other pertains to deficiencies in some collected software metrics, which can lead to two similar (in terms of given metrics) software modules for different fault proneness labels. Removing such noisy instances can significantly improve theperformance of calibrated software quality-estimation models.2 Therefore, pinpointing the problematic software modules before calibrating any software qualityestimation models is desirable. Another main challenge is that, in real-world software projects, software fault measurements (such as fault proneness labels) might not be available for training a software quality-estimation model. This happens whenan organization is dealing with a software project type it’s never dealt with before. In addition, it might not have recorded or collected software fault data from a previous system release. So, how does the quality assurance team predict a software project’s quality without the collected software metrics? The team can’t take a supervised learning approach without software quality metrics such asthe risk-based class or number of faults. The estimation task then falls on the analyst (expert), who
1094-7167/04/$20.00 © 2004 IEEE Published by the IEEE Computer Society
must decide the labels for each software module. Cluster analysis, an exploratory data analysis tool, naturally addresses these two challenges.
Unsupervised learning methods such as clustering techniquesare a natural choice for analyzing software quality in the absence of fault proneness labels. Clustering algorithms can group the software modules according to the values of their software metrics. The underlying software-engineering assumption is that fault-prone software modules will have similar software measurements and so will likely form clusters. Similarly, not-fault-prone modules willlikely group together. When the clustering analysis is complete, a software engineering expert inspects each cluster and labels it fault prone or not fault prone. A clustering approach offers practical benefits to the expert who must decide the labels. Instead of inspecting and labeling software modules one at a time, the expert can inspect and label a given cluster as a whole; he or she can assignall the modules in the cluster the same quality label. Such a strategy eases the tediousness of the labeling task, which is compounded when modules are numerous. For each cluster, the clustering algorithm can provide a representative software module, which the expert can inspect for labeling all modules in that cluster (aided by other descriptive data statistics). Moreover, when actual labels for...