Software Defect Prediction Based on Optimized Machine Learning Models: A Comparative Study

Software defect prediction is crucial used for detecting possible defects in software before they manifest. While machine learning models have become more prevalent in software defect prediction, their effectiveness may vary based on the dataset and hyperparameters of the model. Difficulties arise in determining the most suitable hyperparameters for the model, as well as identifying the prominent features that serve as input to the classifier. This research aims to evaluate various traditional machine learning models that are optimized for software defect prediction on NASA MDP (Metrics Data Program) datasets. The datasets were classified using k-nearest neighbors (k-NN), decision trees, logistic regression, linear discriminant analysis (LDA), single hidden layer multilayer perceptron (SHL-MLP), and Support Vector Machine (SVM). The hyperparameters of the models were fine-tuned using random search, and the feature dimensionality was decreased by utilizing principal component analysis (PCA). The synthetic minority oversampling technique (SMOTE) was implemented to oversample the minority class in order to correct the class imbalance. k-NN was found to be the most suitable for software defect prediction on several datasets, while SHL-MLP and SVM were also effective on certain datasets. It is noteworthy that logistic regression and LDA did not perform as well as the other models. Moreover, the optimized models outperform the baseline models in terms of classification accuracy. The choice of model for software defect prediction should be based on the specific characteristics of the dataset. Furthermore, hyperparameter tuning can improve the accuracy of machine learning models in predicting software defects.


I. INTRODUCTION
As technology has advanced and consumer expectations for software have risen, the software development process has gotten increasingly intricate [1]. As a result, software engineers must now focus on improving their ability to detect and prevent software defects [2]. Software Defect Prediction (SDP) is a crucial technique that identifies potential software defects before they occur. In software engineering, SDP is an important and challenging task. Better software quality and reduced development costs are both linked to early defect detection in software development [3], [4].
Recently, machine learning models have been widely used to detect defects in software. This is because machine learning models have the ability to find patterns automatically from data by recognizing defects in software [5]. Predicting software defects using machine learning models has been demonstrated to be useful in several studies, such as decision tree (DT) [6], Naïve Bayes (NB) [7], K-nearest Neighbors (k-NN) [8], [9], Artificial Neural Network (ANN) [10], and Support Vector Machine (SVM) [11]. Different datasets and model hyperparameters can result in widely varying model performances in machine learning. A common challenge in machine learning is selecting the optimal model hyperparameters. However, almost all studies in SDP using machine learning models did not perform hyperparameter tuning to obtain the optimal model hyperparameters.
Another issue in SDP using a machine learning model is selecting prominent features to use as input to the classifier. The optimal feature subset has been chosen to use several feature selection techniques to avoid a decline in the performance of classification models for SDP caused by redundant and irrelevant features, as reported in [12]- [14]. The quality of results obtained by feature selection techniques is very dependent on datasets. Principal Component Analysis (PCA) is another approach that can be used to reduce irrelevant features. In machine learning, PCA reduces data dimensionality while maintaining as much information as possible to find patterns [15]. To achieve optimal classification 167 TEKNIKA, Volume 12 (2) performance, however, it is necessary to determine the optimal number of selected components. This paper compares the optimized machine learning models for SDP on NASA MDP (Metrics Data Program) datasets [16]. Some traditional machine learning models were used to classify 12 datasets from NASA MDP datasets: knearest neighbors (k-NN), decision trees (DT), logistic regression (LR), linear discriminant analysis (LDA), single hidden layer multilayer perceptron (SHL-MLP), and support vector machine (SVM). The hyperparameters of the model were optimized using random search [17] to obtain the best classifier for each dataset. Before being input to the classifier, the dimensionality of the features was reduced using PCA. The number of selected components was also optimized using random search. The synthetic minority oversampling technique (SMOTE) was used as an oversampling strategy for the minority class to deal with unbalanced samples on NASA MDP datasets [18].
The remaining sections of the paper are structured as follows: Section two provides a theoretical foundation for the study by reviewing the relevant literature. Section three explains the methods used in this study. Following that, the findings and discussion of their implications will be presented in Section 4. Finally, a summary of the key findings and suggestions for avenues for future research will be provided in the last section.

II. LITERATURE REVIEW
Iqbal et al. [5] analyzed the effectiveness of several machine learning models for SDP using NASA MDP datasets. The performance of ten machine learning models was measured using a variety of evaluation metrics. These models included k-NN, DT, LR, MLP, SVM, radial basis function (RBF), one rule (OR), kStart (PART), and random forest (RF). The findings indicate that the metrics used to evaluate the performance of the model change depending on the dataset, except for the ROC area score. Based on the ROC area score, RF achieved higher performance compared to other models.
A novel method for SDP based on a weighted naive Bayes classifier was proposed by Ji et al. [6]. The authors leverage the concept of information diffusion to assign weights to the features used in the classifier, resulting in improved prediction accuracy compared to traditional naive Bayes classifiers. While the approach shows promise, the authors' experimental evaluation could benefit from larger and more diverse datasets to better demonstrate the method's effectiveness.
Marian et al. [7] proposed a new approach to predicting software defects using fuzzy decision trees. The authors claim that this approach outperforms standard DT in AUC scores. While the concept of fuzzy decision trees is intriguing, the study lacks sufficient information about the implementation and evaluation processes, such as the selection of input features and the selection criteria for the best model. In addition, the dataset used in the experiments is limited to two software projects, which casts doubt on the generalizability of the proposed method.
Hammad et al. [8] presented a machine learning approach to predict software faults using k-NN. The authors conducted experiments and achieved promising results, with an accuracy rate of up to 87%. Kumar et al. [9] proposed a new approach for predicting software defects in Aspect-Oriented Programming (AOP). The authors use a combination of fuzzy c-means clustering with genetic algorithms (FCM-GM) and k-NN. The experimental results show that the proposed FCM-GM outperforms traditional FCM and k-NN. However, the classification models in [8], [9] were only evaluated using five datasets and one dataset from NASA MDP datasets, respectively.
Rong et al. [10] proposed a new method for software defect prediction using SVM and a bat algorithm with centroid strategy (CBA). CBA was used to optimize the parameters of SVM to enhance the accuracy of the prediction model. The central concept of the optimization algorithm involves treating SVM parameters as particles in CBA, which then undergoes self-updating until the algorithm achieves its final condition. The experimental results show that the proposed method outperforms other classifiers, including standard SVM. However, the proposed method was only evaluated with four datasets from the NASA MDP datasets.
Jayanthi et al. [11] proposed a method for SDP that uses an ANN and an enhanced version of PCA. Their improvement involves merging PCA with maximum likelihood in order to minimize the PCA reconstructed data. The experimental results reveal that the proposed approach surpasses other existing models, reaching an AUC of 97.20% and substantially improving classification accuracy. However, the authors did not explain how to decide on the number of principal components chosen in PCA and the ANN architecture used for each dataset, despite claiming that their method achieves good performance.
Nevertheless, most studies using traditional machine learning models for SDP did not perform hyperparameter tuning to obtain the best classifier. The number of components used in PCA was also not optimized so that the best performance was achieved by each model. Therefore, this research attempts to compare the performance of several traditional machine learning models by performing hyperparameter tuning both on the classifier and PCA to obtain the best classification performance.

III. METHODS
This research involved multiple steps to conduct a comparative analysis of optimized traditional machine learning models. These procedures consisted of gathering a dataset, oversampling the minority class, using PCA for dimensionality reduction, training various traditional machine learning models for classification with hyperparameter tuning, and evaluating the models as illustrated in Figure 1

A. Dataset
This research employed the dataset from the NASA Metrics Data Program (MDP) to evaluate the classification model used. The NASA MDP is a dataset about software defects in different NASA projects. This includes information like how many defects were found in each project, how big the code base is, and how much work it took to make the software. Software engineers frequently use this dataset to examine how different software metrics relate to software defects. The MDP dataset comprises both public and confidential data. The former provides information on 24 NASA software projects, while the latter contains additional project information that is accessible only to authorized users [19]. This research used the clean version of NASA MDP datasets from D" collection as described in [16]. The datasets consisted of datasets from 12 projects, namely CM1, JM1, KC1, KC3, MC1, MC2, MW1, PC1, PC2, PC3, PC4, and PC5. The number of features varies in each dataset but has the same number of classes, namely defective Y and defective N. Table 1 shows the detailed description of the 12 datasets used in this research.

B. Oversampling Strategy
As can be seen in Table 1, the number of defective instances is significantly smaller than the number of nondetective instances in all datasets. This condition is considered a class imbalance problem. If this condition is not addressed, it will affect the performance of the machine learning model. Therefore, this research applied an oversampling strategy to the minority class using the synthetic minority oversampling technique (SMOTE) [18]. SMOTE is a popular oversampling technique used in machine learning to address the class imbalance problem. SMOTE works by creating synthetic instances of the minority class by interpolating between the instances of the minority class. Specifically, SMOTE selects a minority class instance and finds its k-nearest neighbors in the feature space. It then creates synthetic instances by randomly selecting one of the k-neighbors and interpolating between the minority sample and the selected neighbor. This creates a new instance that is similar to the minority class but is not an exact copy of any existing instance.

C. Dimensionality Reduction
The presence of numerous features for training a machine learning model does not guarantee its good performance. Furthermore, a high number of features can also increase the amount of time and computational resources needed during the training process [20]. Therefore, this research employed principal component analysis (PCA) [15] to reduce the dimensionality of features. PCA is a commonly used technique in machine learning for dimensionality reduction. PCA is an unsupervised dimensionality reduction method that can reduce the high dimensionality of features into fewer significant and uncorrelated principal components while retaining the essential information of the original features [20]. PCA seeks to identify the most significant features or variables in a dataset and depict them in a space with fewer dimensions.
Reducing feature dimensions is done by transforming the features into new variables that are not correlated with each other but can still explain as much of the variation in the original data as possible. These variables are called principal components. PCA employed the covariance matrix of the original data to determine the principal components and the amount of variance explained by each component by calculating the eigenvectors and the eigenvalues of the matrix, respectively. The dataset can be projected onto a lower ndimensional space by choosing only the top n eigenvectors, where n is the desired number of dimensions (primary components). In this research, the value of n was determined using a random search from { ∈ ℕ|5 ≤ ≤ } such that the best performance of the model is achieved, where is the number of features. Before being inputted to the classifier selected components were scaled into intervals [0,1] to avoid the dominance of certain components using equation (1), where is scaled component, is the original component, min ( ) and max ( ) are the minimum and maximum of the selected component, respectively.

D. Classification
In this research, six traditional machine learning models were employed to classify 12 datasets from NASA MDP datasets into two classes: defective Y and defective N. The models used include k-nearest neighbors (k-NN), logistic regression (LR), decision tree (DT), linear discriminant analysis (LDA), support vector machine (SVM), and single hidden layer multi-layer perceptron (SHL-MLP) [ Parameters are a part of every model; these are not learned by the model during training but rather are established by the user beforehand. Hyperparameters are another name for this type of parameter. Therefore, the model's hyperparameters must be adjusted for optimal performance. This research employed random search [17] to determine the optimum hyperparameters for each model. Finding the optimal value of a function with random search optimization is a straightforward and efficient process. Several points are generated randomly at predetermined intervals. The function value is then evaluated at these points to determine the optimum point. This process is carried out iteratively until the stopping criterion is met. Random search optimization is a simple method that is easy to implement because it does not require a lot of information about the functions being optimized including function derivatives. The tuned hyperparameters and the search domains for each model are tabulated in Table 2.

E. Model Evaluation
To evaluate the classification models, each dataset was divided into two parts, which are the training data and the testing data, with a ratio of 70:30, using stratified random subsampling [22]. Stratified random subsampling ensured that every class in the dataset was equally represented in both the training and testing datasets, maintaining the same proportion as the original dataset. The main objective of employing this method was to ensure that the training and testing datasets reflected the overall dataset in its entirety. The model was then trained and evaluated using the training data and the testing, respectively. The model's performance was also measured using the testing data. Several metrics were employed to evaluate the performance of the models using the testing dataset, including accuracy, precision, recall, and F1 score. Equation (2) was used to determine the classification accuracy, which is the proportion of right predictions throughout the whole dataset, where is the number of instances that are actually positive and classified as positive, is the number of instances that are actually negative and classified as negative, is the number of instances that are actually negative but classified as positive, and is the number of instances that are actually positive but classified as negative. Precision measures the proportion of true positive predictions out of all positive predictions and is calculated using equation (3).

= + (3)
Recall measures the proportion of true positive predictions out of all actual positive instances and is calculated using equation (4).

= + (4)
Lastly, the F1 score is the harmonic mean of precision and recall and is calculated using equation (5).
In this research, all models were trained and evaluated using Python language programming with some open source Python libraries, namely pandas [23], imbalanced-learn [24], and Scikit-learn [25]. Pandas were used to import the dataset from a source file. Imbalanced-learn was used to perform oversampling on minority classes using SMOTE. Scikit-learn was used for training and testing all machine learning models, as well as tuning the hyperparameters.

IV. RESULTS AND DISCUSSION
This section presents the experiment results that aimed to compare the performance of various traditional machine learning models in predicting software defects. The accuracy, 170 TEKNIKA, Volume 12 (2) Table 3.
As can be seen in Table 3, k-NN achieved the highest accuracy on seven datasets, which are JM1 77.91%, KC1 79.31%, KC3 91.58%, MC1 98.97%, MC2 83.67%, PC3 93.46%, and PC5 83.06%. In second place, SVM achieved the highest accuracy on four datasets, which are CM1 97.66%, KC3 91.58%, MW1 98.53%, and PC4 93.69%. In third place, SHL-MLP achieved the highest accuracy on three datasets, which are MC1 98.97%, PC1 96.12%, and PC3 93.46%. Logistic regression and decision tree achieved the highest accuracy only on a dataset, MC2 83.67% and PC2 98.86%, respectively. While LDA never achieved the highest accuracy. Based on the accuracy scores in Table 3, it can be concluded that k-NN consistently performs well across most datasets, achieving high accuracy rates. Additionally, SVM and SHL-MLP also demonstrate competitive performance in terms of accuracy.  Table 4 shows the precision of the optimized models on all datasets. As can be seen in Table 4, k-NN achieved the highest precision on seven datasets, which are JM1 78.51%, KC1 79.70%, KC3 91.64%, MC1 98.99%, MC2 83.86%, PC3 93.70%, and PC5 83.28%. In second place, SVM achieved the highest precision on four datasets, which are CM1 97.68%, KC3 91.64%, MW1 98.57%, and PC4 94.05%. In third place, SHL-MLP achieved the highest precision on two datasets, which are MC1 98.99% and PC1 96.27%. The decision tree achieved the highest precision only on a dataset PC2 98.86%. While logistic regression and LDA never achieved the highest precision. Based on the precision values in Table 4, it can be concluded that k-NN consistently achieves high precision values across various datasets. These results indicate the effectiveness of k-NN in correctly identifying positive instances across all positive predictions. In addition, SVM and SHL-MLP also perform well in most of the datasets. However, when compared, LR, DT, and LDA show relatively lower precision values. Table 5 shows the recall of the optimized models on all datasets. As can be seen in Table 5 Table 5, it can be concluded that k-NN consistently achieves the highest recall rate across all datasets, followed by SVM and SHL-MLP. Logistic Regression, DT, and LDA generally show lower recall rates compared to k-NN, SVM, and SHL-MLP. These results indicate k-NN is more effective in correctly identifying positive examples across all actual positive data.  Table 6 shows the F1 score of the optimized models on all datasets. As can be seen in Table 6 Table 6, it can be concluded that k-NN generally performs well and LR and LDA achieve lower scores. In addition, SVM and SHL-MLP demonstrate good performance on most datasets but are not consistently superior to other algorithms. These results indicate k-NN performs well in terms of both correctly identifying positive instances and capturing all positive instances. The results presented in Tables 3-6 provide insights into the performance of various optimized traditional machine learning models for software defect prediction. It can be observed that k-NN, SVM, and SHL-MLP are the most effective algorithms for predicting software defects, based on the highest accuracy, precision, recall, and F1 scores achieved on several datasets. k-NN outperformed all other algorithms in terms of accuracy, precision, recall, and F1 score on seven datasets, including JM1, KC1, KC3, MC1, MC2, PC3, and PC5. This suggests that k-NN is a suitable algorithm for software defect prediction and can be relied upon for accurate and precise predictions on these datasets. SVM achieved the highest accuracy, precision, recall, and F1 score on four datasets, including CM1, KC3, MW1, and PC4. SHL-MLP achieved the highest accuracy, precision, recall, and F1 score on three datasets, including MC1, PC1, and PC3. This suggests that SVM and SHL-MLP are other effective models for software defect prediction and can be used in scenarios where k-NN may not be the best choice. It is also interesting to note that logistic regression and LDA never achieved the highest accuracy, precision, recall, or F1 score on any of the datasets. This suggests that these algorithms may not be the best choice for software defect prediction, at least in the context of the datasets used in this research.
Furthermore, most of the classification accuracy of the optimized machine learning models significantly outperforms the baseline models reported in [5] which do not employ hyperparameter tuning, as tabulated in Table 7. For example, the accuracy of optimized k-NN ranges from 77.91% to 98.97%. On the other hand, the accuracy achieved by k-NN in [5] ranges from 69.34% to 97.27%. Similarly, the optimized SVM achieved higher accuracy between 79.31% and 98.53%, when compared to the SVM accuracy reported in [5], which is between 62.16% and 90.82%. Therefore, the optimized models are much more accurate and reliable than the baseline models.

V. CONCLUSION
This paper presents a comparative study of optimized machine learning models for software defect prediction on NASA MDP datasets. The hyperparameters of models were optimized using random search to obtain the best classifier for each dataset. Before input to the classifier, the dimensionality of features was reduced using PCA, and the number of selected components was also optimized using random search. The