Imbalanced Data in classification: A case study of credit scoring

Imbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoring

Trang 1

MINISTRY OF EDUCATION AND TRAININGUNIVERSITY OF ECONOMICS HO CHI MINH

Trang 2

Ho Chi Minh City - 2024

Trang 3

MINISTRY OF EDUCATION AND TRAININGUNIVERSITY OF ECONOMICS HO CHI MINH

Trang 4

Ho Chi Minh City - 2024

Trang 5

STATEMENT OF AUTHENTICATION

I certify that the Ph.D dissertation, “Imbalanced data in classification: A

case study of credit scoring”, is solely my own research.

This dissertation is only used for the Ph.D degree at the University of Eco- nomics Ho Chi Minh City (UEH), and no part of it has been submitted to any other university or organization to obtain any other degree Any studies of other authors used in this dissertation are properly cited.

Ho Chi Minh City, April 2, 2024

Trang 6

First of all, I would like to express my deepest gratitude to my supervisors, Assoc Prof Dr Le Xuan Truong and Dr Ta Quoc Bao, for their scientific direction and dedicated guidance throughout the process of conducting this Ph.D dissertation.

I sincerely thank the teachers of the UEH’s doctoral training program for imparting valuable knowledge, and the teachers at the Department of Mathe- matics and Statistics, UEH for their sincere comments on my dissertation.

I sincerely thank Dr Le Thi Thanh An for her moral and academic support so that I could complete the research Besides, I really appreciate the interest and help of my colleagues at Ho Chi Minh City University of Banking.

Finally, I am grateful for the unconditional support that my mother and my family have given to me on my educational path.

Ho Chi Minh City, April 2, 2024

Trang 7

1.3 Research gap identifications 5

1.3.1 Gaps in credit scoring 5

1.3.2 Gaps in the approaches to solving imbalanced data 7

1.3.3 Gaps in Logistic regression with imbalanced data 9

1.4 Research objectives, research subjects, and research scopes 10

Trang 8

1.6 Contributions of the dissertation 13

1.7 Dissertation outline 14

2 LITERATURE REVIEW OF IMBALANCED DATA16 2.1 Imbalanced data in classification 16

2.1.1 Description of imbalanced data 16

2.1.2 Obstacles in imbalanced classification 16

2.1.3 Categories of imbalanced data 17

2.2 Performance measures for imbalanced data 19

2.2.1 Performance measures for labeled outputs 19

2.2.1.1 Single metrics 19

2.2.1.2 Complex metrics 21

2.2.2 Performance measures for scored outputs 22

2.2.2.1 Area under the Receiver Operating

2.3.3.1 Integration of algorithm-level method and en-semble classifier algorithm 42

Trang 9

2.3.3.2 Integration of data-level method and ensemble

classifier algorithm 43

2.3.3.3 Comments on ensemble-based approach 45

2.3.4 Conclusions of approaches to imbalanced data 46

2.4 Credit scoring 48

2.4.1 Meaning of credit scoring 48

2.4.2 Inputs for credit scoring models 49

2.4.3 Interpretability of credit scoring models 51

2.4.4 Approaches to imbalanced data in credit scoring 52

2.4.5 Recent credit scoring ensemble models 53

2.5 Chapter summary 55

3 IMBALANCED DATA IN CREDIT SCORING56 3.1 Classifiers for credit scoring 56

3.1.1.6 Support vector machine 62

3.1.1.7 Artificial neural network 64

3.1.2 Ensemble classifiers 66

3.1.2.1 Heterogeneous ensemble classifiers 66

3.1.2.2 Homogeneous ensemble classifiers 67

3.1.3 Conclusions of statistical models for credit scoring 69

3.2 The proposed credit scoring ensemble model base Decision tree 71 3.2.1 The proposed algorithms 71

3.2.1.1 Algorithm for balancing data - OUS(B) algorithm 71 3.2.1.2 Algorithm for constructing ensemble classifier -DTE(B) algorithm 72

3.2.2 Empirical data sets 73

Trang 10

3.2.3 Computation process 74

3.2.4 Empirical results 76

3.2.4.1 The optimal Decision tree ensemble classifier 76 3.2.4.2 Performance of the proposed model on the Viet-namese data sets 77

3.2.4.3 Performance of the proposed model on the pub-lic data sets 79

3.2.4.4 Evaluations 81

3.2.5 Conclusions of the proposed credit scoring ensemble model based Decision tree 82

3.3 The proposed algorithm for imbalanced and overlapping data 83 3.3.1 The proposed algorithms 84

3.3.1.1 Algorithm for dealing with noise, overlapping, and imbalanced data 84

3.3.1.2 Algorithm for constructing ensemble model 84

3.3.2 Empirical data sets 85

Trang 11

4.2.1 Prior correction 95

4.2.2 Weighted likelihood estimation (WLE) 96

4.2.3 Penalized likelihood regression (PLR) 97

4.3 The proposed works 98

4.3.1 The modification of the cross-validation procedure 99

4.3.2 The modification of Logistic regression 101

4.4.6 Important variables for output 111

4.4.6.1 Important variables for F-LLR fitted model 111

4.4.6.2 Important variables of the Vietnamese data set 112 4.5 Discussions and Conclusions 115

5.1.1 The interpretable credit scoring ensemble classifier 118

5.1.2 The technique for imbalanced data, noise, and

Trang 12

C.1 German credit data set (GER) 140

C.2 Vietnamese 1 data set (VN1) 141

C.4 Taiwanese credit data set (TAI) 143

C.5 Bank personal loan data set (BANK) 145

C.6 Hepatitis C patients data set (HEPA) 146

C.7 The Loan schema data from lending club (US) 147

C.9 Australian credit data set (AUS) 151

C.10 Credit risk data set (Credit 1) 152

C.11 Credit card data set (Credit 2) 153

C.12 Credit default data set (Credit 3) 154

Trang 13

LIST OF ABBREVIATIONS

regression FN, FNR False negative, False negative rate

Trang 14

balancing data

for

Trang 15

xi

Trang 16

2.7 Illustration of ROS technique 35

2.8 Illustration of SMOTE technique 35

2.9 Approaches to imbalanced data in classification 47

3.1 Illustration of a Decision tree 61

3.2 Illustration of a decision boundary of SVM 63

3.3 Illustration of a two-hidden-layer ANN 65

3.4 Importance level of features of the Vietnamese data sets 77

3.5 Computation protocol of the proposed ensemble classifier 86

4.1 Illustration of F-CV 100

4.2 Illustration of F-LLR 102

Trang 17

LIST OF TABLES

1.1 General implementation protocol in the dissertation 13

2.1 Confusion matrix 19

2.2 Representatives employing the algorithm-level approach to ID 27 2.3 Cost matrix in Cost-sensitive learning 28

2.4 Summary of SMOTE algorithm 36

2.5 Representatives employing the data-level approach to ID 41

2.6 Representatives employing the ensemble-based approach to ID 45 3.1 Representatives of classifiers in credit scoring 70

3.2 OUS(B) algorithm 72

3.3 DTE(B) algorithm 73

3.4 Description of empirical data sets 74

3.5 Computation protocol of empirical study on DTE 75

3.6 Performance measures of DTE(B) on the Vietnamese data sets 76 3.7 Performance of ensemble classifiers on the Vietnamese data sets 78 3.8 Performance of ensemble classifiers on the German data set 80

3.9 Performance of ensemble classifiers on the Taiwanese data set 81 3.10 TOUS(B) algorithm 84

3.11 TOUS-F(B) algorithm 85

3.13 Average testing AUC of the proposed ensembles 89

3.14 Average testing AUC of the models based LLR 90

3.15 Average testing AUC of the ensemble classifiers based tree 91

4.1 Cross-validation procedure for Lasso Logistic regression 99

4.2 F-measure-oriented Cross-Validation Procedure 100

4.3 Algorithm for F-LLR classifier 101

Trang 18

4.5 Implementation protocol of empirical study 106

4.6 Average testing performance measures of classifiers 108

4.7 Average testing performance measures of classifiers (cont.) 109

4.8 The number of wins of F-LLR on empirical data sets 110

4.9 Important features of the Vietnamese data set 113

4.10 Important features of the Vietnamese data set (cont.) 114

B.1 Algorithm of Bagging classifier 138

B.2 Algorithm of Random Forest 138

B.3 Algorithm of AdaBoost 139

C.1 Summary of the German credit data set 140

C.2 Summary of the Vietnamese 1 data set 141

C.3 Summary of Vietnamese 2 data set 142

C.4 Summary of the Taiwanese credit data set (a) 143

C.5 Summary of the Taiwanese credit data set (b) 144

C.6 Summary of the Bank personal loan data set 145

C.7 Summary of the Hepatitis C patients data set 146

C.8 Summary of the Loan schema data from lending club (a) 147

C.9 Summary of the Loan schema data from lending club (b) 148

C.10 Summary of the Loan schema data from lending club (c) 149

C.12 Summary of the Australian credit data set 151

C.13 Summary of the Credit 1 data set 152

Trang 19

In classification, imbalanced data occurs when there is a great difference in the quantities of classes of the training data set This problem frequently arises in various fields, for example, credit scoring and medical diagnosis With imbalanced data, predictive modeling for real-world applications has posed a challenge because most machine learning algorithms are designed for balanced data sets Therefore, addressing imbalanced data has attracted much attention from researchers and practitioners.

In this dissertation, we propose solutions for imbalanced classification Fur- thermore, these solutions are applied to a credit scoring case study The solutions are derived from three papers published in the scientific journals.

• The first paper presents an interpretable decision tree ensemble model for imbalanced credit scoring data sets.

• The second paper introduces a novel technique for addressing imbalanced data, particularly in the cases of overlapping and noisy samples.

• The final paper proposes a modification of Logistic regression focusing on the optimization F-measure, a popular metric in imbalanced

These classifiers have been trained on a range of public and private data sets with highly imbalanced status and overlapping classes The primary results demonstrate that the proposed works outperform both traditional and some recent models.

Trang 20

TÓM TẮT

Khi giải quyết các bài toán phân loại, hiện tượng dữ liệu không cân bằng xảy ra nếu các lớp trong tập huấn luyện có sự chênh lệch số phần tử đáng kể Vấn đề này thường gặp trong nhiều lĩnh vực, ví dụ như đánh giá tín dụng và chẩn đoán y khoa Với dữ liệu không cân bằng, việc lập mô hình dự báo cho các bài toán ứng dụng thực tế đã đặt ra một thách thức lớn bởi vì hầu hết các thuật toán học máy được thiết kế cho dữ liệu cân bằng Vì vậy, xử lý dữ liệu không cân bằng cho bài toán phân loại đã và đang thu hút nhiều sự quan tâm của các nhà nghiên cứu và người làm ứng dụng.

Trong luận án này, chúng tôi đề xuất một số giải pháp cho bài toán phân loại với dữ liệu không cân bằng Những giải pháp này được áp dụng cho một tình huống nghiên cứu là đánh giá tín dụng Các kết quả mới của luận án được trích từ ba bài báo đã được công bố trên những tạp chí khoa học, bao gồm:

• Bài báo thứ nhất đề xuất một mô hình có khả năng giải thích Đây là mô hình quần hợp các mô hình cây quyết định và ứng dụng cho đánh giá tín dụng.

• Bài báo thứ hai giới thiệu một kỹ thuật mới cho dữ liệu không cân bằng, đặc biệt trong trường hợp dữ liệu có chồng chéo các lớp và nhiễu.

• Bài báo thứ ba đề xuất một hiệu chỉnh cho mô hình hồi quy Logistic Sự điều chỉnh này tập trung vào tối đa hoá độ đo F - một độ đo hiệu quả phổ biến trong các bài toán phân loại không cân bằng.

Các mô hình phân loại này được thực nghiệm trên tập dữ liệu công khai và dữ liệu riêng với tính chất không cân bằng và chồng chéo các lớp Kết quả đã chứng minh rằng các mô hình của chúng tôi có hiệu quả vượt trội so với các mô hình truyền thống và các mô hình được đề xuất gần đây.

Trang 21

Chapter 1

1.1Overview of imbalanced data in classification

Nowadays, classification plays a crucial role in several fields, for example, medicine (cancer diagnosis), finance (fraud detection), business administration (customer churn prediction), information retrieval (oil spill tracking, telecommu- nication fraud), image identification (face recognition), and so on Classification is the problem of predicting a class label for a given sample On training data sets that comprise samples with different label types, classification algorithms learn samples’ features to recognize the labels’ patterns After that, these patterns, now presented as a fitted classification model, will make predictions about the labels of new samples.

Classification is categorized into two types, binary and multi-classification Binary classification, which is the basic type, focuses on the two-class label problems In contrast, multi-classification solves the tasks of several class labels Multi-classification is sometimes considered binary with two classes: one class corresponding to the concern label, and the other representing the remain- ing labels In binary classification, data sets are partitioned into positive and negative classes The positive is the interest class, which has to be identified in the classification task In this dissertation, we focus on binary classification For convenience, we define some concepts as follows.

the set of samples S = X × Y , where X ⊂ Rk is the domain of samples’ features

The subset of samples labeled 1 is called the positive class, denoted S+ Theremaining subset is called the negative class, denoted S− A sample s ∈ S+ iscalled a positive sample, otherwise it is called a negative sample.

Trang 22

Definition 1.1.2 A binary classifier is a function mapping the domain of

features X to the set of labels {0, 1}.

With a given sample s0 = (x0, y0) ∈ S, there are four possibilities follows:• If f (s0) = y0 = 1, s0 is called a true positive sample.

• If f (s0) = y0 = 0, s0 is called a true negative sample.

• If f (s0) = 1 and y0 = 0, s0 is called a false positive sample.• If f (s0) = 0 and y0 = 1, s0 is called a false negative sample.

The number of the true positive, true negative, false positive, and false negativesamples, are denoted TP, TN, FP, and FN, respectively.

Some popular criteria used to evaluate the performance of a classifier are accuracy, true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR).

TP + TN

Trang 23

In many application domains where there is a balance of the positive and negative classes, accuracy is the first target of classifiers However, the interest class (the positive class) sometimes consists of unusual events or rare events The number of samples in the positive class is too small for classifiers to recognize the positive patterns In such situations, if classifiers make mistakes in the positive class, the cost of loss will be very heavy Therefore, accuracy is no longer the most important performance criterion but something related to TP such as the TPR.

For example, in fraud detection, the customers are divided into “bad” and “good” classes Since the credit regulations are made public and the customers have preliminarily been screened before applying for a loan, a credit data set often includes a majority class of good customers and a minority class of the bad The loss of misclassifying the “bad” into “good” is often far greater than

Trang 24

the loss of misclassifying the “good” into “bad” Hence, identifying the bad is often considered more crucial than the other task Consider a list of credit customers consisting of 95% good and 5% bad If pursuing a high accuracy, we can choose a trivial classifier mapping all customers with good labels Then the accuracy of this classifier is 95%, but TPR is 0% In other words, this classifier was unable to identify bad customers Instead, another classifier with a lower accuracy but greater TPR can be considered to replace this trivial classifier.

Another example of the rare classification is cancer diagnosis In this case, the data set has two classes, which are the “malignant” and “benign” The number of malignant patients is always much less than those of benign However, malignancy is the first target of any cancer diagnosis process because of the heavy consequences of missing cancer patients Therefore, it is unreasonable to base on the accuracy metric to evaluate the performance of cancer diagnosis classifiers.

The phenomenon of skew distribution in training data sets for

classification is known as imbalanced data.

positive and negative classes, respectively If the quantity of S+ is far less thanthe one of S−

, S is called an imbalanced data set Besides, the imbalanced ratio(IR) of S is defined as the ratio of the quantities of negative and positive class:

−| |S+|

When a training data set is imbalanced, simple classifiers usually have a very high accuracy but low TPR These classifiers aim to maximize the accuracy (sometimes called global accuracy), thus equating losses caused by the error type I and error type II (Shen, Zhao, Li, Li, & Meng, 2019) Therefore, the classification results are often biased toward the majority class (the negative class) (Galar, Fernandez, Barrenechea, Bustince, & Herrera, 2011; Haixiang et al., 2017) In the case of a rather high imbalanced ratio, the minority class

Trang 25

(the positive class) is usually ignored since the common classifiers often treat it as noise or outliers Hence, the target of recognizing the patterns of the positive class fails although identifying the positive samples is often the crucial task of imbalanced classification Therefore, imbalanced data is a challenge in classification.

Besides, experiment studies showed that if the imbalanced ratio increased, the overall model performance decreased (Brown & Mues, 2012) Furthermore, some authors stated that imbalanced data was not only the main reason for the poor performance but the noise and overlapping samples also degraded the performance of learning methods (Batista, Prati, & Monard, 2004; Haixiang et al., 2017) Thus, researchers or practitioners should deeply understand the nature of data sets to handle them correctly.

A typical case study of imbalanced classification is credit scoring This issue is reflected in the bad debt ratio of commercial banks For example, in Viet- nam, the bad debt ratio in the on-balance sheet was 1.9% in 2021 and 1.7% in 2020 Besides, the gross bad debt ratio (including on-balance sheet bad debt, unresolved bad debt sold to VAMC, and potential bad debt from restructuring) was 7.3% in 2021 and 5.1% in 20201 Although bad customers account for a very small part of the credit customers, the consequences of the bad debt of the bank are extremely heavy In countries where most economic activities rely on the banking system, the increase in the bad debt ratio may not only threaten the execution of the banking system but also push the economy to a series of collapses Therefore, it is important to identify the bad customers in credit scoring.

In Vietnam, the credit market is tightly controlled by regulations of the State bank Commercial banks now consciously manage credit risk by strictly applying credit appraisal processes before funding In the field of academic research, credit scoring has attracted many authors (Bình & Anh, 2021; Hưng & Trang, 2018; Quỳnh, Anh, & Linh, 2018; Thắng, 2022) However, few works have solved the imbalanced issue (Mỹ, 2021).

https://sbv.gov.vn/webcenter/portal/vi/links/cm255?dDocName=SBV489213

Trang 26

These facts prompted us to study imbalanced classification deeply The

dissertation titled “Imbalance data in classification: A case study of

credit scoring” aims to find suitable solutions for the imbalanced data and

related issues, especially a case study of credit scoring in Vietnam.

1.3Research gap identifications

1.3.1Gaps in credit scoring

In the dissertation, we choose credit scoring as a case study of imbalanced classification.

Credit scoring is an arithmetical representation based on the analysis of the creditworthiness of customers (Louzada, Ara, & Fernandes, 2016) Credit scoring provides valuable information to banks and finance institutions in order not only to hedge the credit risk but also to standardize regulations on credit management Therefore, credit-scoring classifiers have to meet two significant requirements They are:

i) The ability to accurately classify the bad customers;

ii) The ability to easily explain the predicted results of the classifiers Over the two recent decades, the first requirement has been solved with the development of methods to improve the performance of credit scoring models They are traditional statistical methods (K-nearest neighbors, Discriminant analysis, and Logistic regression) and popular machine learning models (Deci- sion tree, Artificial neural network, and Support vector machine) (Baesens et al., 2003; Brown & Mues, 2012; Louzada et al., 2016) Those are called single classifiers The effectiveness of a single classifier is not similar across the data sets For example, some studies showed that Logistic regression outperformed Decision tree (Marqués, García, & Sánchez, 2012; Wang, Ma, Huang, & Xu, 2012), but another result concluded that the Logistic regression worked worse than Decision tree (Bensic, Sarlija, & Zekic-Susac, 2005) Besides, according to (Baesens et al., 2003), Support vector machine was better than Logistic regression, Li et al (2019); Van Gestel et al (2006) indicated that there was an

Trang 27

insignificant difference among Support vector machine, Logistic regression, and Linear discriminant analysis In summary, empirical credit scoring studies lead to the important conclusion that there is no best single classifier for all data sets.

Under the development of computational software and programming languages, there is a shift from single classifiers to ensemble ones The term “ensemble classifier” or “ensemble model” refers to the collection of multiple classifier algorithms Ensemble models work by leveraging the collective power for decision-making across multiple sub-classifiers In the literature on credit scoring, empirical studies concluded that the ensemble models had superior performance to the single ones (Brown & Mues, 2012; Dastile, Celik, & Potsane, 2020; Lessmann, Baesens, Seow, & Thomas, 2015; Marqués et al., 2012) How- ever, ensemble algorithms do not directly handle the imbalanced data issue.

While the second requirement of a credit scoring model often attracts less attention than the first, its role is equally important It provides the reasons for the classification results, which is the framework for assessing, managing, and hedging credit risk For example, nowadays, customers’ features are collected into empirical data sets more and more diversely, but not all of them are useful for credit scoring Administrators need important information from the classification model that influences the likelihood of default to set transpar- ent credit standards There is usually a trade-off between the effectiveness and transparency of classifiers (Brown & Mues, 2012) As performance measures increase, explaining the predicted result becomes more difficult For example, single classifiers such as Discriminant analysis, Logistic regression, and Decision trees are interpretable, but they usually work far less effectively than Support vector machine and Artificial neural network, which are the representatives of “black box” classifiers Another case is ensemble classifiers Most of them operate in an incomprehensible process although they have outstanding perfor-mance Even with popular ensemble classifiers such as Bagging Tree, Random Forest, or AdaBoost, which do not have very complicated structures, their interpretability is not discussed According to Dastile et al (2020), in the credit

Trang 28

scoring application, only 8% studies proposed new models with the discussion of interpretability.

Therefore, building a credit-scoring ensemble classifier that satisfies both requirements is an essential task.

In Vietnam, credit data sets usually suffer from imbalance, noise, and overlapping issues Although the economy is under the influence of the digital trans- formation process and credit scoring models have developed rapidly, Vietnamese commercial banks have still applied traditional methods such as Logistic regression and Discriminant analysis Some studies used machine learning methods such as Artificial neural network (Kiều, Diệp, Nga, & Nam, 2017; Nguyen & Nguyen, 2016; Thịnh & Toàn, 2016), Support vector machine (Nhâm, 2021), Random forest (Ha, Nguyen, & Nguyen, 2016), and ensemble models (Luu & Hung, 2021) The idea of these studies is to support the applications of advanced methods in credit scoring, but they are not concerned with the imbalanced issue and interpretability Very few studies dealt with the imbalance issue (Mỹ, 2021; Toàn, Lịch, Hương, & Thọ, 2017) However, these works only solved imbalanced data and ignored the noise and overlapping samples.

In summary, it is necessary to build a credit-scoring ensemble classifier that can tackle the imbalanced data and other related issues such as noise and overlapping samples to raise the performance measures, especially on Vietnamese data sets Furthermore, the proposed model can point out the important features to predict the credit risk status.

1.3.2Gaps in the approaches to solving imbalanced data

There are three popular approaches to imbalanced classification in the literature They are algorithm-level, data-level, and ensemble-based approaches (Galar et al., 2011).

The algorithm-level approach solves imbalanced data by modifying the clas-sifier algorithms to reduce the bias toward the majority class This approach needs deep knowledge about the intrinsic classifiers which users usually lack In addition, designing specific corrections or modifications for the given

Trang 29

clas-sifier algorithms makes this approach not versatile A representative of the algorithm-level approach is the Cost-sensitive learning method which imposes or corrects the costs of loss upon misclassifications and requires the minimal total loss of the classification process (Xiao, Xie, He, & Jiang, 2012; Xiao et al., 2020) However, the values of the costs of losses are usually assigned by the researchers’ intention In short, the algorithm-level approach is inflexible and unwieldy.

The data-level approach balances training data sets by applying re-sampling techniques, which belong to three main groups, including over-sampling, under- over-sampling, and the hybrid of over and under-sampling Over-sampling techniques increase the quantity of the minority class while under-sampling techniques de- crease the one of the majority class This approach implements easily and performs independently of the classifier algorithms However, re-sampling techniques change the distribution of the training data set which may lead to a poor classification model For instance, random over-sampling techniques increase the computation time and may repeat the noise, and overlapping samples, thus probably leading to an over-fitting classification model Some hierarchical methods of over-sampling can cause other problems For example, the Synthetic Minority Over-sampling technique (SMOTE) can exacerbate the overlapping issue In contrast, under-sampling techniques may miss useful information about the majority class, especially on severely imbalanced data (Baesens et al., 2003; Sun, Lang, Fujita, & Li, 2018).

The third is the ensemble-based approach which integrates ensemble classifier algorithms with algorithm-level or data-level approaches This approach exploits the advantage of ensemble classifiers to improve the performance criteria The ensemble-based approach seems to be the trend in dealing with imbalanced data (Abdoli, Akbari, & Shahrabi, 2023; Shen, Zhao, Kou, & Al- saadi, 2021; Yang, Qiao, Huang, Wang, & Wang, 2021; Zhang, Yang, & Zhang, 2021) However, the ensemble-based approach often faces complex models that are too difficult to interpret the results This is a concern that must be realized fully.

Trang 30

In summary, although there are many methods for imbalanced classification, each of them has some drawbacks Some hybrid methods are complex and inaccessible Moreover, there are very few studies dealing with either imbalance or noise and overlapping samples With the available studies, on some data sets, the methods do not raise the performance measures as high as expected Hence, it is coming up with the idea of a new algorithm that can deal with imbalance, noise, and overlapping to increase the performance measure on the positive class.

1.3.3Gaps in Logistic regression with imbalanced data

Logistic regression (LR) is one of the most popular single classifiers, especially in credit scoring (Onay & Öztürk, 2018) LR can provide an understandable output that is a conditional probability of belonging to the positive class This probability is the reference to predict the sample’s label by comparing it with a given threshold The sample is classified into the positive class if and only if its conditional probability is greater than this threshold This characteristic of LR can innovate into multi-classification Besides, the computation process of LR, which employs the maximum likelihood estimator, is quite simple It does not take much time since there are several available packages of software or programming languages Furthermore, LR can show the impact of predictors on the output by evaluating the statistically significant level of the parameters corresponding to the predictors In other words, LR provides an interpretable and affordable model.

However, LR is ineffective on imbalanced data sets (Firth, 1993; King & Zeng, 2001), specifically, the conditional probability of positive samples is un- derestimated Therefore, the positive samples are likely misclassified Besides, the statistically significant level of predictors is usually based on the parameter testing procedure, which uses the “p-value” criterion as a framework Mean- while, the p-value has recently been criticized in the statistical community because of its misunderstanding (Goodman, 2008) Those lead to the limitation in the application fields of LR although it has several advantages.

Trang 31

There are multiple methods to deal with imbalanced data for LR such as prior correction (Cramer, 2003; King & Zeng, 2001), weighted likelihood estimation (WLE) (Maalouf & Trafalis, 2011; Manski & Lerman, 1977; Ramalho & Ramalho, 2007) and penalized likelihood regression (PLR) (Firth, 1993; Green- land & Mansournia, 2015; Puhr, Heinze, Nold, Lusa, & Geroldinger, 2017) All of them are related to the algorithm-level approach, which requires much effort from the users For example, prior correction and WLE need the ratio of the positive class in the population which is usually unavailable in real-world applications Besides, some methods of PLR are too sensitive for initial values in the computation process of the maximum likelihood estimation Furthermore, some methods of PLR were just for the biased parameter estimates, not for the biased conditional probability (Firth, 1993) A hybrid of these methods and re-sampling techniques has not been considered in the literature on LR with imbalanced data The hybrid methods can exploit the advantages of each individual and directly solve imbalanced data for LR.

In summary, LR for imbalanced data needs to be modified in the computation process by a combination of data-level and algorithm-level approaches The modification can deal with imbalanced data and still retain the ability to provide the impacts of the predictors on the response without the “p-value” criterion.

1.4 Research objectives, research subjects, and research scopes

1.4.1Research objectives

In this dissertation, we aim to achieve the following objectives.

The first objective is to propose a new ensemble classifier that satisfies two key requirements of a credit-scoring model This ensemble classifier is expected to outperform the traditional classification models and popular balanced methods such as the Bagging tree, Random forest, and AdaBoost combined with random over-sampling (ROS), random under-sampling (RUS), SMOTE, and Adaptive synthetic sampling (ADASYN) Furthermore, the proposed model can identify the significance of input features in predicting the credit risk status.

The second objective is to propose a novel technique to address the

Trang 32

chal-lenges of imbalanced data, noise, and overlapping samples This technique can leverage the strengths of re-sampling methods and ensemble models to tackle these critical issues in classification Subsequently, this technique can be applied to credit scoring and other imbalanced classification applications, for example, medical diagnosis.

The final objective is to modify the computation process of Logistic regression to address imbalanced data and mitigate the issue of overlapping samples This modification directly impacts the F-measure, which is commonly used to evaluate the performance of classifiers in imbalanced classification The proposed work can compete with popular balanced methods for Logistic regression such as weighted likelihood estimation, penalized likelihood regression, and re- sampling techniques, including ROS, RUS, and SMOTE.

1.4.2Research subjects

This dissertation investigates the phenomenon of imbalanced data and other related issues such as noise and overlapping samples in classification We exam- ine various balancing methods, encompassing algorithm-level, data-level, and ensemble-based approaches in a case study of credit scoring Within these approaches, data-level and ensemble-based are paid more attention than algorithm- level Additionally, Lasso-Logistic regression, which is a version of penalization on Logistic regression, is studied in two application contexts: a based learner of an ensemble classifier and the individual classifier.

1.4.3Research scopes

The dissertation focuses on binary classification problems for imbalanced data sets and their application in credit scoring Interpretable classifiers, in-cluding Logistic regression, Lasso-logistic regression, and Decision trees, are considered To deal with imbalanced data, the dissertation concentrates on the data-level approach and the integration of data-level methods and ensemble classifier algorithms Some popular re-sampling techniques such as ROS, RUS, SMOTE, ADASYN, Tomek-link, and Neighborhood Cleaning Rule, are

Trang 33

investigated in this study In addition, popular performance criteria, which are suitable for imbalanced classification such as AUC (Area Under the Re-ceiver Operating Characteristics Curve), KS (Kolmogorov-Smirnov statistic), F-measure, G-mean, and H-measure, are used to evaluate the effectiveness of considered classifiers.

1.5 Research data and research methods

1.5.1Research data

The case study of credit scoring uses six secondary data sets Three of them are from the UCI machine learning repository such as German, Taiwan, and the Bank personal loan data sets These data sets are very popular in studying credit scoring and are used as a benchmark in the literature Besides, the three private data sets are collected from commercial banks in Vietnam All Viet- namese data sets are highly imbalanced with different levels Furthermore, to justify the ability to improve the performance measures of the proposed works, the empirical study used one data set belonging to the medical field, Hepatitis data This data set was available on the UCI machine learning repository.

The case study of Logistic regression employs nine data sets Four of them, which are German, Taiwanese, Bank personal loan, and Hepatitis data sets, are also used in the case study of credit scoring The others are easy to access through the Kaggle website and UCI machine learning repository.

1.5.2Research methods

The dissertation applies the quantitative research method to clarify the effectiveness of the proposed works such as the credit scoring ensemble classifier, the algorithm for balancing and free-overlapping data, and the modification of Logistic regression.

The general implementation protocol of the proposed works follows the steps in Table 1.1 This implementation protocol is applied in all computation processes in the dissertation However, in each case, the content in Step 2 may vary in some ways The computation processes are conducted by the programming

Trang 34

Table 1.1: General implementation protocol in the dissertation

Steps Contents

2 Constructing the new model with different hyper-parameters to find the optimal model on the training data.

and classifier algorithms on the same training data.

testing data, then calculating their performance measures.

language R, which has been widely used in the machine learning community.

1.6Contributions of the dissertation

The dissertation contributes three methods to the literature on credit scoring and imbalanced classification The proposed methods were published in three articles, including:

(1) An interpretable decision tree ensemble model for imbalanced credit

scoring datasets, Journal of Intelligent and Fuzzy System, Vol 45, No 6,

10853–10864, 2023.

(2) TOUS: A new technique for imbalanced data classification, Studies inSys- tems, Decision, and Control, Vol 429, 595–612, 2022, Springer.

(3) A modification of Logistic regression with imbalanced data:

F-measure-oriented Lasso-logistic regression, ScienceAsia, 49S, 68–77, 2023.

Regarding the literature on credit scoring, the dissertation suggests the interpretable ensemble classifier which can address imbalanced data The proposed model which uses Decision tree as the base learner has more specific advantages than the popular approaches such as higher performance measures and interpretability The proposed model corresponds to the first article.

Trang 35

Regarding the literature on imbalanced data, the dissertation proposes a method for balancing, de-noise, and free-overlapping samples thanks to the ensemble-based approach This method outperforms the integration of the re- sampling techniques (ROS, RUS, and SMOTE, Tomek-link, and Neighborhood Cleaning Rule) and popular ensemble classifier algorithms (Bagging tree, Ran- dom forest, and AdaBoost) This work corresponds to the second article.

Regarding the literature on Logistic regression, the dissertation provides a modification to its computation process The proposed work makes Logistic regression more effective than the existing methods for Logistic regression with imbalanced data and retain the ability to show the important level of input features without using p−value This modification is in the third article.

1.7 Dissertation outline

The dissertation “Imbalanced data in classification: A case study of credit scoring” has five chapters.

• Chapter 1 Introduction

• Chapter 2 Literature review of imbalanced data • Chapter 3 Imbalanced data in credit scoring

• Chapter 4 A modification of Logistic regression with imbalanced data • Chapter 5 Conclusions

Chapter 1 is the introduction, which briefly introduces the contents of the dissertation This chapter presents the overview of imbalanced data in classification Besides, other contents are the motivations, research gap identifications, objectives, subjects, scopes, data, methods, contributions, and the dissertation outline.

Chapter 2 is the literature review on imbalanced data in classification This chapter provides the definition, obstacles, and related issues of imbalanced data, for example, the overlapping classes Besides, this chapter deeply presents the performance measures for imbalanced data The most important section is the

Trang 36

review of approaches to imbalanced data, including algorithm-level, data-level, and ensemble-based-level Chapter 2 also examines the basic background and recent proposed works of credit scoring The detailed discussion of previous studies clarifies the pros and cons of existing balancing methods That is the framework for developing the new balanced methods in the dissertation.

Chapter 3 is the case study of imbalanced classification - credit scoring This chapter is based on the main contents of the first and second articles referred to in Section 1.6 We propose an ensemble classifier that can address imbalanced data and provide the importance level of predictors Furthermore, we innovate the algorithm of this credit-scoring ensemble classifier to handle overlapping and noise before dealing with imbalanced data The empirical studies are conducted to verify the effectiveness of the proposed algorithms.

Chapter 4 is another study on imbalanced data which is related to Logistic regression This chapter proposes a modification of the inner and outer of the computation process of Logistic regression The inner is a change in the performance criterion to estimate the score, and the outer is a selective application of re-sampling techniques to re-balance the training data The experiment studies on nine data sets to verify the performance of the modification Chapter 4 corresponds to the third article referred to in Section 1.6.

Chapter 5 is the conclusion, which summarizes the dissertation, implies the applications of the proposed works, and refers to some further studies.

Trang 37

Chapter 2

LITERATURE REVIEW OF IMBALANCED DATA

2.1 Imbalanced data in classification

2.1.1 Description of imbalanced data

According to Definition 1.1.4, any data set with a skewed quantity of

samples in two classes is technically imbalanced data (ID) In other words,

any two-class data set with an imbalanced ratio (IR) greater than one is considered ID There are not any conventional definitions of the IR threshold to conclude that a data set is imbalanced Most authors simply define ID that there is a class with a much greater (or lower) number of samples than one of the other (Brown & Mues, 2012; Haixiang et al., 2017) Other authors assess a data set imbalanced if the interest class has significantly fewer samples than the other and ordinary classifier algorithms encounter difficulty in distinguishing two classes (Galar et al., 2011; López, Fernández, García, Palade, & Herrera, 2013; Sun, Wong, & Kamel, 2009) Therefore, a data set is considered as ID when its IR is greater than one and most samples of the minority class cannot be identified by standard classifiers.

2.1.2 Obstacles in imbalanced classification

In ID, the minority class is usually misclassified since there is too little in-formation about their patterns Besides, standard classifier algorithms often operate according to the rules of the maximum accuracy metric Hence, the classification results are usually biased toward the majority class to get the highest global accuracy and very low accuracy for the minority class On the other hand, the patterns of the minority class are often specific, especially in extreme ID, which leads to the ignorance of minority samples (they may be treated as noise) to favor the more general patterns of the majority class As a

Trang 38

consequence, the minority class, which is the interested object in the classification process, is usually misclassified in ID.

The above analyzes are also supported by empirical studies Brown and Mues (2012) concluded that the higher the IR, the lower the performance of classifiers Furthermore, Prati, Batista, and Silva (2015) found that the expected performance loss, which was the proportion of the performance difference between ID and the balanced data, became significant when IR was from 90/10 and greater Prati et al also pointed out that the performance loss tended to increase quickly for higher values of IR.

In short, IR is the factor that reduces the effectiveness of standard classifiers.

2.1.3Categories of imbalanced data

In real applications, combinations of ID and other phenomena make classification processes more difficult Some authors even claim that ID is not only the main reason for the poor performance but the overlapping, small sample size, small disjuncts, borderline, rare, and outlier samples are also the causes of the low effectiveness of popular classifier algorithms (Batista et al., 2004; Fernández et al., 2018; Napierala & Stefanowski, 2016; Sun et al., 2009).

• Overlapping or class separability (Fig.2.1b) is the phenomenon of the un- clear decision boundary of two classes It also means that some samples of two classes are blended On data sets with overlapping, the standard classifier algorithms such as Decision tree, Support vector machine, or K-nearest neighbors become harder to perform Batista et al (2004) stated that the IR was less important than the degree of overlap between classes Similarly, Fernández et al (2018) believed that any simple classifier algorithm could perform classification independently of the IR in case of no overlapping.

• Small sample size: Learning algorithms need a sufficient amount of samples of data sets to generalize the rule to discriminate classes Without large training sets, a classifier cannot only generalize characteristics of the data but it can also produce an over-fitting model (Cui, Davis, Cheng, & Bai, 2004; Wasikowski & Chen, 2009) On imbalanced and small data

Trang 39

Figure 2.1: Examples of circumstances of imbalanced data.

Source: Galar et al (2011)

sets, the lack of information about the positive class becomes more serious Krawczyk and Woźniak (2015) stated that when fixing the IR, the more samples of the minority class, the lower the error rate of classifiers.

• Small disjuncts (Fig 2.1c): This problem occurs when the minority class consists of several sub-spaces in the feature space Therefore, small dis-juncts provide classifiers with a smaller number of positive samples than large disjuncts In other words, small disjuncts cover rare samples that are too hard to be found in the data sets, and learning algorithms often ignore rare samples to set the general classification rules It leads to a higher error rate on small disjuncts (Prati, Batista, & Monard, 2004; Weiss, 2009).

• The characteristics of positive samples such as borderline, rare, and outlier, affect the performance of standard classifiers The fact is that borderline samples are always too difficult to be recognized In addition, the rare and outliers are extremely hard to be identified According to Napierala and Stefanowski (2016); Van Hulse and Khoshgoftaar (2009), an imbalanced data set with many borderline or rare and outlier samples made standard classifiers less efficient.

In summary, studying ID should pay attention to the related issues such as the overlapping, small sample size, small disjuncts, and the characteristics of the positive samples.

Trang 40

2.2Performance measures for imbalanced data

The quality of a classifier is evaluated by inspecting how effective it shows on testing data It means the outputs of the classifier are compared with the true labels of the testing data which are hidden in the process of constructing the classifier There are two types of outputs, which are labeled and scored types Depending on each type, some metrics are used to analyze the performance of classifiers In ID, there are some notes on the choice of performance measures.

2.2.1Performance measures for labeled outputs

Most learning algorithms show labeled outputs, for example, K-nearest neighbors, Decision tree, ensemble classifier based Decision tree, and so on A conve- nient way to introduce the performance of labeled-output classifiers is a cross- tabulation between actual and predicted labels, known

as confusion matrix.

Table 2.1: Confusion matrix

Predicted positive Predicted negative Total

In Table 2.1, TP, FP, FN, and TN follow the Definition 1.1.3 Besides, POS and NEG are the numbers of the actual positive and negative samples in the training data, respectively PPOS and PNEG are the numbers of the predicted positive and negative samples, respectively N is the total number of samples.

From the confusion matrix, several metrics are built to provide a framework for analyzing many aspects of a classifier These metrics can be divided into two types, single and complex metrics.

2.2.1.1Single metrics

The most popular single metric is accuracy or its complement, error rate.

Accuracy is the proportion of the correct outputs, and error rate is the proportion of the incorrect ones Therefore, the higher (or lower) accuracy

Imbalanced Data in classification: A case study of credit scoring

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan