Imbalanced Data in classification: A case study of credit scoring

178 1 0
Imbalanced Data in classification: A case study of credit scoring

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Imbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoring

Trang 1

MINISTRY OF EDUCATION AND TRAININGUNIVERSITY OF ECONOMICS HO CHI MINH

Trang 2

Ho Chi Minh City - 2024

Trang 4

STATEMENT OF AUTHENTICATION

I certify that the Ph.D dissertation, “Imbalanced data in classification: A

case study of creditscoring”, is solelymyownresearch.

ThisdissertationisonlyusedforthePh.D.degreeattheUniversityofEco-nomicsHoChiMinhCity(UEH),andnopartofithasbeensubmittedtoanyother university or organization to obtainanyother degree.Anystudiesofother authors used in this dissertation are properlycited.

Ho Chi MinhCity,April 2,2 0 2 4

Trang 5

Firstofall,Iwouldliketoexpressmydeepestgratitudetomysupervisors, Assoc Prof Dr Le XuanTruongand Dr.TaQuoc Bao, for theirscientificdirection and dedicated guidance throughout the process of could complete the research Besides, I really appreciate theinterestand help ofmycolleagues at Ho Chi Minh City University ofBanking.

Finally,Iamgratefulfortheunconditionalsupportthatmymotherandmyfamilyhave giventomeonmyeducationalpath.

Ho Chi Minh City, April 2, 2024

Trang 6

1.3.2 Gaps in the approaches to solving imbalanced data 7

1.3.3 Gaps in Logistic regression with imbalanced data 9

1.4 Research objectives, research subjects, andresearchscopes 10

Trang 8

2.3.3.2 Integration of data-level method and ensemble

3.2.1.1 Algorithm for balancing data - OUS(B)algorithm 71 3.2.1.2 Algorithm for constructing ensemble classifier - DTE(B)a l g o r i t h m 72

3.2.2 Empirical datasets 73

Trang 10

4.2.1 Priorcorrection 95

4.2.2 Weightedlikelihoode s t i m a t i o n (WLE) 96

4.2.3 Penalized likelihoodregression (PLR) 97

4.3 Theproposedworks 98

4.3.1 The modification of thecross-validationprocedure 99

4.3.2 The modification ofL o g i s t i c regression 101

4.4.6 Important variablesf o r output 111

4.4.6.1 Important variables for F-LLRfitted model 111

4.4.6.2 Important variables of the Vietnamesedataset 112

5.1.1 The interpretable credit scoringensembleclassifier 118

5.1.2 The technique for imbalanced data, noise,

Trang 11

C.5 Bank personal loan data set(BANK) 145

C.6 Hepatitis C patients dataset(HEPA) 146

C.7 The Loan schema data from lendingclub(US) 147

C.8 Vietnamese 3 datas e t (VN3) 150

C.9 Australian credit datas e t (AUS) 151

C.10 Credit risk data set(Credit1) 152

C.11 Credit card data set(Credit2) 153

C.12 Credit default data set(Credit3) 154

C.13 Vietnamese 4 datas e t (VN4) 155

Trang 12

LIST OFABBREVIATIONS

ADASYN Adaptive synthetic samplingANN Artificial neuralnetwork AUC Area under theROCcurves AUS Australian credit datas e t BANK Bank personal loan data set

CART Classification and regression treea l g o r i t h m CHAID Chi-square automatic interaction detector algorithmCNN Condensed nearestneighbors

Credit1 Credit risk datas e t Credit2 Credit card datas e t Credit3 Credit default dataset

FLAC Firth’s logistic regression with

addedcovariateFLIC Firth’s logistic regression with intercept-correctionF-LLR F-measure-oriented Lasso-Logisticregression FIR Firth-type - a version of Penalized likelihood regressionFN,FNR Falsenegative,Falsenegativerate

Trang 13

FP, FPR False positive, False positive rate GER German credit data set

HEPA Hepatitis patient data set

HEOM Heterogeneous Euclidean-Overlap metric HVDM Heterogeneous value difference metric ID Imbalanced data

IR Imbalanced ratio

KNN K-nearest neighbor classifier KS Kolmogorov-Smirnov statistic LDA Linear discriminant analysis

LLE Lasso-Logistic regression ensemble classifier LR Logistic regression

LLR Lasso-Logistic regression MLE Maximum likelihood estimate NCL Neighborhood cleaning rule OSS One-side selection

OUS Over-Under sampling - the proposed algorithm balancing data

for PLR Penalized likelihood regression

QDA Quadratic discriminant analysis

ROC Receiver Operating Characteristics Curve ROS Random over-sampling

RPART Recursive Partitioning and

RegressionTreealgorithmRUS Randomunder-sampling SMOTE Synthetic Minority Over-samplingt e c h n i q u e

Trang 14

samplingtechniqueUCI University of California,I r v i n e US Loan schema data set from lendingc l u b VACM Vietnam Asset ManagementC o m p a n y VN1 Vietnamese credit 1 dataset

VN2 Vietnamese credit 2 dataset VN3 Vietnamese credit 3 dataset VN4 Vietnamese credit 4 dataset

WLE Weightedlikelihoode s t i m a t i o n

Trang 15

2.8 Illustration ofS M O T E technique 35

2.9 Approaches to imbalanced datai n classification 47

3.1 Illustration of aD e c i s i o n tree 61

3.2 Illustration of a decision boundaryo f SVM 63

3.3 Illustration of at w o - h i d d e n - l a y e r ANN 65

3.4 Importance level of features of the Vietnamesed a t a sets 77

3.5 Computation protocol of the proposedensembleclassifier 86

4.1 IllustrationofF-CV 100

4.2 IllustrationofF-LLR 102

Trang 16

3.9 Performance of ensemble classifiers on theTaiwanesedata set 813.10TOUS(B)algorithm 84

Trang 17

4.5 Implementation protocol ofe m p i r i c a l study 106

4.6 Averagetesting performance measuresofclassifiers 108

4.7 Averagetesting performance measures ofclassifiers(cont.) 109

4.8 The number of wins of F-LLR on empiricaldatasets 110

4.9 Important features of the Vietnamesed a t a set 113

4.10 Important features of the Vietnamese datas e t (cont.) 114

B.1 Algorithm ofB a g g i n g classifier 138

B.2 Algorithm ofR a n d o m Forest 138

B.3 AlgorithmofAdaBoost 139

C.1 Summary of the German creditdataset 140

C.2 Summary of the Vietnamese 1dataset 141

C.3 Summary of Vietnamese 2d a t a set 142

C.4 Summary of theTaiwanesecredit dataset(a) 143

C.5 Summary of theTaiwanesecredit dataset(b) 144

C.6 Summary of the Bank personal loandataset 145

C.7 Summary of the Hepatitis C patientsdataset 146

C.8 Summary of the Loan schema data from lendingclub (a) 147

C.9 Summary of the Loan schema data from lendingclub(b) 148

C.10 Summary of the Loan schema data from lendingclub (c) 149

C.11 Summary of the Vietnamese 3dataset 150

C.12 Summary of the Australian creditdataset 151

C.13 Summary of the Credit 1dataset 152

C.14 Summary of the Credit 2dataset 153

C.15 Summary of the Credit 3dataset 154

C.16 Summary of the Vietnamese 4dataset 155

Trang 18

In classification, imbalanced data occurs when there is a great difference in the quantities of classes of the training data set This credit scoring datas e t s

• Thesecondpaperintroducesanoveltechniqueforaddressingimbalanced data, particularly in the cases of overlapping and noisysamples.

• ThefinalpaperproposesamodificationofLogisticregressionfocusingon theoptimizationF-measure,apopularmetricinimbalancedclassification.

Theseclassifiershavebeentrainedonarangeofpublicandprivatedatasets with highly imbalanced status and overlapping classes The primaryresultsdemonstrate that the proposed works outperform both traditional andsomerecentmodels.

Trang 19

trong trường hợp dữ liệu có chồng chéo các lớpvànhiễu.

• Bài báo thứ ba đề xuất một hiệu chỉnhchomô hình hồi quy Logistic Sự

Trang 20

classification, which is the basic type, focuses on the two-class label problems In contrast, multi-classification solves the tasks of several classla- bels.Multi-classificationissometimesconsideredbinarywithtwoclasses:oneclasscorrespondingtotheconcer nlabel,andtheotherrepresentingtheremain- ing labels In binary classification, data sets are partitioned into positiveandnegative classes The positive is the interest class, which has tobeidentifiedintheclassificationtask.Inthisdissertation,wefocusonbinaryclassification.F orconvenience,wedefine some concepts asfollows.

Definition 1.1.1.A data set withkinputfeaturesfor binary classification isthe

The subset of samples labeled1is called the positive class, denotedS+.

A

Trang 21

Definition 1.1.2.A binary classifier is a function mapping the

Definition 1.1.3.Considering a data setSand a classifierf:X→ {0,1}.With a

• Iff(s0)= 1andy0= 0,s0is called a false positivesample.• Iff(s0)= 0andy0= 1,s0is called a false negativesample.

The number of the true positive, true negative, false positive, and falsenegativesamples, are denoted TP, TN, FP, and FN, respectively.

Some popular criteria used to evaluate the performance of a classifier are accuracy, true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR).

Trang 22

the loss of misclassifying the“good”into“bad”.Hence, identifying the bad is often considered more crucial than the other task Consider a list ncy is the first target ofanycancer diagnosis process because of the heavy consequences of missing cancer patients Therefore, it isunreasonabletobaseontheaccuracymetrictoevaluatetheperformanceofcancerdiagnosis classifiers.

Thephenomenonofskewdistributionintrainingdatasetsforclassification is known

Definition 1.1.4.LetS=S+∪Sbe the data set, whereS+andSare

Whenatrainingdatasetisimbalanced,simpleclassifiersusuallyhaveavery high accuracy butlowTPR These classifiers aim to maximize the accuracy (sometimes called global accuracy), thus equating losses causedbytheerrortype I and error typeII(Shen, Zhao, Li, Li, & Meng, 2019) Therefore,theclassification results are often biasedtowardthe majority class (thenegativeclass) (Galar,Fernandez,Barrenechea,Bustince,&Herrera,2011;Haixiangetal.,2017).In thecaseofaratherhighimbalanced ratio,t h e minorityclass

Trang 23

ce of learning methods (Batista, Prati, & Monard, 2004; Haixiang et al., 2017) Thus, researchers or practitioners should deeply understandthenatureofdatasetstohandlethemcorrectly.

Atypicalcasestudyofimbalancedclassificationiscreditscoring.Thisissue is reflected in the bad debt ratio of commercial banks.Forexample, inViet-nam, the bad debt ratio in the on-balance sheetwas1.9% in 2021 and 1.7%in2020 Besides, the gross bad

baddebt,unresolvedbaddebtsoldtoVAMC,andpotentialbaddebtfromrestructuring)was7.3 % in 2021 and 5.1% in 20201 Although bad customers account foraverysmallpartofthecreditcustomers,theconsequencesofthebaddebtofthe

bankareextremelyheavy.Incountrieswheremosteconomicactivitiesrelyonthe banking system, the increase in the bad debt ratiomaynot only threaten the execution of the banking system but also push the economy to a seriesofcollapses Therefore, it is important to identify the bad customers

Trang 24

Thesefactspromptedustostudyimbalancedclassificationdeeply.Thedis- sertation

titled“Imbalance data in classification: A case study of credit scoring”aims

to find suitable solutions for the imbalanced data and related issues, especially a case study of credit scoring inVietnam.

ii) The ability to easily explain the predicted results of thec l a s s i fi e r s Over thetworecent decades, the first requirement has been solved withthe development of methods to improve the performance of credit asets.Forexample,somestudiesshowedthatLogisticregressionoutperformed Decision tree (Marqués, García, & Sánchez, 2012;Wang,Ma, Huang, &Xu,2012),butanotherresultconcludedthattheLogisticregressionworkedworsethan Decision tree (Bensic, Sarlija, & Zekic-Susac, 2005) Besides, according to (Baesens et al., 2003), Support vector machinewasbetter than Logisticre-gression,Lietal.(2019);VanGesteletal.(2006)indicatedthattherewasan

Trang 25

insignificant difference among Support vector machine, Logistic emodelshadsuperiorper-formance to the single ones (Brown & Mues, 2012;

2020;Lessmann,Baesens,Seow,&Thomas,2015;Marquésetal.,2012).How-ever,ensemblealgorithmsdonotdirectlyhandletheimbalanceddataissue.

While the second requirement of a credit scoring model often attracts less attention than the first, its role is equally important It provides the reasons fortheclassificationresults,whichistheframeworkforassessing,managing, and hedging credit risk.Forexample,nowadays,customers’ features arecol- lected into empirical data sets more and morediversely,but not all classifiers Another case is ensemble classifiers Most ofthemoperate in an incomprehensible process although theyhaveoutstanding perfor-mance.EvenwithpopularensembleclassifierssuchasBaggingTree,RandomForest,orAda Boost,whichdonothaveverycomplicatedstructures,theirin-terpretability is not discussed According to Dastile et al (2020), in thecredit

Trang 26

bankshavestill applied traditional methods such as Logistic regres-sionandDiscriminantanalysis.Somestudiesusedmachinelearningmethods such as Artificial neural network (Kiều, Diệp, Nga, & Nam, 2017; Nguyen&Nguyen, 2016; Thịnh &Toàn,2016), Support vector machine

The algorithm-level approach solves imbalanced databymodifying theclas-sifier algorithms to reduce the biastowardthe majority class This approach needsdeepknowledgeabouttheintrinsicclassifierswhichusersusuallylack.

Trang 27

Inaddition,designingspecificcorrectionsormodificationsforthegivenclas-sifier algorithms makes this approach not versatile A representative ofthealgorithm-level approach is the Cost-sensitive learning method which imposesor corrects the costs of loss upon misclassifications and requires theminimal total loss of the classification process (Xiao, Xie, He, & Jiang, 2012; creaset h e o n e o f t h e m a j o r i t y c l a s s This approach implements easily andperforms independently of the classifier algorithms.However,re-sampling tech-niques change the distribution of the training data set whichmaylead to

Trang 28

Insummary,although there are many methods for imbalanced classification,each of them has some drawbacks Some hybrid methods are complex andinaccessible Moreover, there are very few studies dealing with eitherimbalanceor noise and overlapping samples With theavailablestudies, sthereferencetopredictthesample’slabelbycomparingitwitha given threshold The sample is classified into the positive class if andonlyif its conditional probability is greater than this threshold

ThischaracteristicofLRcaninnovateintomulti-classification.Besides,thecomputationprocess of LR, which employs the maximum

-wh ile, thep-valuehasrecentlybeencriticizedinthestatisticalcommunitybe- cause of its misunderstanding (Goodman, 2008) Those lead to thelimitation in the application fields of LR although it has severala d v a n t a g e s

Trang 29

There are multiple methods to deal with imbalanced data for LR suchaspriorcorrection(Cramer,2003;King&Zeng,2001),weightedlikelihoodesti-mation(WLE)(Maalouf&Trafalis,2011;Manski&Lerman,1977;Ramalho&

Ramalho,2007)andpenalizedlikelihoodregression(PLR)(Firth,1993;Green-land&Mansournia,2015;Puhr,Heinze,Nold,Lusa,&Geroldinger,2017).Allofthemarerelatedt othealgorithm-levelapproach,whichrequiresmucheffortfrom the users.Forexample, prior correction and WLE need the ratio

ofthepositiveclassinthepopulationwhichisusuallyunavailableinreal-worldap-plications.Besides,somemethodsofPLRaretoosensitiveforinitialvaluesinthecompu tationprocessofthemaximumlikelihoodestimation.Furthermore,some methods of PLR were just for the biased parameter estimates, notforthe biased conditional probability (Firth, 1993) A hybrid of these methods and re-sampling techniques has not been considered in the literature Thefirstobjectiveistoproposeanewensembleclassifierthatsatisfiestwokeyrequirementsofacredit-scoringmodel.Thisensembleclassifierisexpectedtooutperform the traditional classification models and popular balanced

Trang 30

ThismodificationdirectlyimpactstheF-measure,whichiscommonlyusedto evaluate the performance of classifiers in imbalanced classification Thepro-posedworkcancompetewithpopularbalancedmethodsforLogisticregression such as weighted likelihood estimation, penalized likelihood regression, andre-sampling techniques, including ROS, RUS, andSMOTE.

1.4.2Researchs u b j e c t s

Thisdissertationinvestigatesthephenomenonofimbalanceddataandotherrelated issues such as noise and overlapping samples in classification.Weexam- ine various balancing methods, encompassing algorithm-level, data-level,

The dissertation focuses on binary classification problems forimbalanceddata sets and their application in credit scoring Interpretable classifiers,in-cluding Logistic regression, Lasso-logistic regression, and Decision trees,areconsidered.Todeal with imbalanced data, the dissertation

concentratesonthedata-levelapproachandtheintegrationofdata-levelmethodsandensem- ble classifier algorithms Some popular re-sampling techniques

asROS,RUS,SMOTE,ADASYN,Tomek-link,andNeighborhoodCleaningRule,are

Trang 31

investigated in thisstudy.In addition, popular performance criteria,whichare suitable for imbalanced classification such asAUC(Area

theempiricalstudyusedonedatasetbelongingtothemedicalfield,Hepatitis data This data setwas availableon the UCI machine learningrepository.

Trang 32

Table1.1: General implementation protocol in thed i s s e r t a t i o n

Steps Contents

algorithms on the same trainingda t a

calculating their performancemeasures.

(1) An interpretable decision tree ensemble model for imbalanced

creditscoringdatasets,Journal of Intelligent andFuzzySystem,Vol45, No 6,10853–

10864, 2023.

(2) TOUS:Anewtechniqueforimbalanceddataclassification,StudiesinSys-tems,

Decision, and Control,Vol429, 595–612, 2022, Springer.

(3) A modification of Logistic regression with imbalanced data:

F-measure-oriented Lasso-logistic regression,ScienceAsia,49S, 68–77,2 0 2 3

Regardingtheliteratureoncreditscoring,thedissertationsuggeststheinter-pretable ensemble classifier which can address imbalanced data The proposedmodel which uses Decision tree as the base learner has more specificadvan-tagesthanthepopularapproachessuchashigherperformancemeasuresandinterpretabi lity.The proposed model corresponds to the firstarticle.

Trang 33

Regarding the literature on imbalanced data, the dissertation proposesamethod for balancing, de-noise, and free-overlapping samples

totheensemble- basedapproach.Thismethodoutperformstheintegrationofthere-samplingtechniques(ROS,RUS,andSMOTE,Tomek-link,andNeighborhood

CleaningRule)andpopularensembleclassifieralgorithms(Baggingtree,Ran-dom forest, and AdaBoost) This work corresponds to the secondarticle.

RegardingtheliteratureonLogisticregression,thedissertationprovidesamodifica tion to its computation process The proposed work makes Logisticregression more effective than the existing methods for Logistic regression withimbalanced data and retain the ability to show the important level ofinputfeatures without usingp−value This modification is in the thirdarticle.

1.7Dissertationo u t l i n e

The dissertation “Imbalanced data in classification: A case study of credit scoring” has five chapters.

• Chapter 1.I n t r o d u c t i o n

• Chapter 2 Literature review of imbalancedd a t a • Chapter 3 Imbalanced data in credits c o r i n g

• Chapter 4 A modification of Logistic regression with imbalanceddata

Trang 34

review of approaches to imbalanced data, including algorithm-level,data-

level,andensemble-based-level.Chapter2alsoexaminesthebasicbackgroundandrecentproposedworksofcreditsc oring.T h e detaileddiscussionofprevious studies clarifies the pros and cons of existing balancing methods That is the idetheimportancelevelofpredictors.Furthermore,weinnovatethealgorithmofthiscredit-scoringensembleclassifiertohandleoverlappingand noise before dealing with imbalanced data The empirical studies are conductedto verify the effectiveness of the

Trang 35

conventionaldefinitionsoftheIRthresholdtoconcludethatadata set is imbalanced Most authors simply define ID that there is a class with amuchgreater (orlower)number of samples

(Brown&Mues,2012;Haixiangetal.,2017).Otherauthorsassessadatasetimbalanced iftheinterestclasshassignificantlyfewersamplesthantheotherandordinaryclassifier

algorithms encounter difficulty in distinguishingtwoclasses (Galaretal., 2011; López,Fernández,García, Palade, & Herrera, 2013; Sun,Wong,& Kamel, 2009) Therefore, a data set is considered as ID when its IR is greaterthanoneandmostsamplesoftheminorityclasscannotbeidentifiedbystandardclas sifiers.

2.1.2Obstacles in imbalancedclassification

In ID, the minority class is usually misclassified since there istoolittlein-formation about their patterns Besides, standard classifier algorithmsoftenoperate according to the rules of the maximum accuracy metric Hence,theclassification results are usually biasedtowardthe majority class to getthehighest global accuracy and verylowaccuracy for the minority class Ontheother hand, the patterns of the minority class are often specific, especiallyinextreme ID, which leads to the ignorance of minority samples (theymay

betreatedasnoise)tofavorthemoregeneralpatternsofthemajorityclass.Asa

Trang 36

consequence,theminorityclass,whichistheinterestedobjectintheclassifica- tion process, is usually misclassified inID.

Theaboveanalyzesarealsosupportedbyempiricalstudies.Br own andMues(2012) concluded that the higher the IR, thelowerthe performance of classi- fiers Furthermore, Prati, Batista, andSilva(2015) found that the expected performanceloss,whichwastheproportionoftheperformancedifferencebe-tweenID and the balanced data, became significant when IRwasfrom90/10and greater Prati et al also pointed out that the performance loss tendedtoincrease quickly for highervaluesofIR.

In short, IR is the factor that reduces the effectiveness of standard classifiers.

nded.Ondatasetswithoverlapping,thestandardclassi-fieralgorithmssuchasDecisiontree,Supportvectormachine,orK-nearest neighbors become harder to perform Batista et al (2004) stated hedata but it can also produce an over-fitting model (Cui, Davis, Cheng, & Bai, 2004;Wasikowski& Chen, 2009) On imbalanced and smalldata

Trang 37

Figure 2.1:E x a m p l e s o f c i r c u m s t a n c e s o f i m b a l a n c e d d a t a

Source: Galar et al (2011)

sets,thelackofinformationaboutthepositiveclassbecomesmoreserious Krawczyk andWoźniak(2015) stated that when fixing the IR, themoresamples of the minority class, thelowerthe error rate ofclassifiers.

• Small disjuncts (Fig 2.1c): This problem occurs when the minorityclassconsists of several sub-spaces in the feature space.

• Thecharacteristicsofpositivesamplessuchasborderline,rare,andoutlier, affect the performance of standard classifiers The fact is that borderline samplesarealwaystoodifficulttoberecognized.Inaddition,therareand outliers are extremely hard tobeidentified According to Napieralaand Stefanowski (2016);VanHulse and Khoshgoftaar (2009),

Trang 38

2.2Performance measures for imbalancedd at a

between actual and predicted labels, known asconfusionmatrix.

Table 2.1: Confusion matrix

InTable2.1,TP, FP,FN, and TN follow the Definition 1.1.3 Besides, POS and NEG are the numbers of the actual positive and negative samples inthe trainingdata,respectively PPOSandPNEGarethenumbersofthepredicted positiveandnegativesamples,respectively.Nisthetotalnumberofsamples.

Fromtheconfusionmatrix,severalmetricsarebuilttoprovideaframeworkfor analyzing many aspects of a classifier These metrics canbedividedintotwotypes, single and complexmetrics.

Trang 39

imbalanced data set with very high IR, standard classifiers often get a very high accuracy andlowerror rate It means the number of positive samples classified correctly is small despite their crucial role intheclassificationtask.Secondly,theerrorrateconsidersthecostofmisclassifying

thepositiveclassandthenegativeequally.WhereasinID,themisclassification

ofthepositivesampleisoftenmorecostlythantheoneofthenegative.There-fore,imbalancedclassificationstudiesusesomesinglemetricsthatfocusona specific

class such asTPR (orrecall),FPR, TNR, FNR, andprecision.

TPR is the proportion of the positive samples classified correctly Other

Trang 40

insteadofaccuracy,TPRisthemostfavoredmetricbecauseoftheimportance of the positive class.However,in credit scoring and cancer diagnosis, ifonlyfocusing on the TPR and ignoring the FPR, a trivial classifier will design all samples with the positive label In other words, the classifier cannotidentify nce of a classifier, especially in ID It leads to combinations of the above

single measureis one of the most popular complex

parameterβis set greater than 1 if and only if FN is more concerned thanFP.F1is the special case of Fβwhen the importance of precision

Ngày đăng: 12/04/2024, 18:33

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan