Đang tải... (xem toàn văn)
Imbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoringImbalanced Data in classification: A case study of credit scoring
Trang 1MINISTRY OF EDUCATION AND TRAININGUNIVERSITY OF ECONOMICS HO CHI MINH
Trang 2Ho Chi Minh City - 2024
Trang 4STATEMENT OF AUTHENTICATION
I certify that the Ph.D dissertation, “Imbalanced data in classification: A
case study of creditscoring”, is solelymyownresearch.
ThisdissertationisonlyusedforthePh.D.degreeattheUniversityofEco-nomicsHoChiMinhCity(UEH),andnopartofithasbeensubmittedtoanyother university or organization to obtainanyother degree.Anystudiesofother authors used in this dissertation are properlycited.
Ho Chi MinhCity,April 2,2 0 2 4
Trang 5Firstofall,Iwouldliketoexpressmydeepestgratitudetomysupervisors, Assoc Prof Dr Le XuanTruongand Dr.TaQuoc Bao, for theirscientificdirection and dedicated guidance throughout the process of could complete the research Besides, I really appreciate theinterestand help ofmycolleagues at Ho Chi Minh City University ofBanking.
Finally,Iamgratefulfortheunconditionalsupportthatmymotherandmyfamilyhave giventomeonmyeducationalpath.
Ho Chi Minh City, April 2, 2024
Trang 61.3.2 Gaps in the approaches to solving imbalanced data 7
1.3.3 Gaps in Logistic regression with imbalanced data 9
1.4 Research objectives, research subjects, andresearchscopes 10
Trang 82.3.3.2 Integration of data-level method and ensemble
3.2.1.1 Algorithm for balancing data - OUS(B)algorithm 71 3.2.1.2 Algorithm for constructing ensemble classifier - DTE(B)a l g o r i t h m 72
3.2.2 Empirical datasets 73
Trang 104.2.1 Priorcorrection 95
4.2.2 Weightedlikelihoode s t i m a t i o n (WLE) 96
4.2.3 Penalized likelihoodregression (PLR) 97
4.3 Theproposedworks 98
4.3.1 The modification of thecross-validationprocedure 99
4.3.2 The modification ofL o g i s t i c regression 101
4.4.6 Important variablesf o r output 111
4.4.6.1 Important variables for F-LLRfitted model 111
4.4.6.2 Important variables of the Vietnamesedataset 112
5.1.1 The interpretable credit scoringensembleclassifier 118
5.1.2 The technique for imbalanced data, noise,
Trang 11C.5 Bank personal loan data set(BANK) 145
C.6 Hepatitis C patients dataset(HEPA) 146
C.7 The Loan schema data from lendingclub(US) 147
C.8 Vietnamese 3 datas e t (VN3) 150
C.9 Australian credit datas e t (AUS) 151
C.10 Credit risk data set(Credit1) 152
C.11 Credit card data set(Credit2) 153
C.12 Credit default data set(Credit3) 154
C.13 Vietnamese 4 datas e t (VN4) 155
Trang 12LIST OFABBREVIATIONS
ADASYN Adaptive synthetic samplingANN Artificial neuralnetwork AUC Area under theROCcurves AUS Australian credit datas e t BANK Bank personal loan data set
CART Classification and regression treea l g o r i t h m CHAID Chi-square automatic interaction detector algorithmCNN Condensed nearestneighbors
Credit1 Credit risk datas e t Credit2 Credit card datas e t Credit3 Credit default dataset
FLAC Firth’s logistic regression with
addedcovariateFLIC Firth’s logistic regression with intercept-correctionF-LLR F-measure-oriented Lasso-Logisticregression FIR Firth-type - a version of Penalized likelihood regressionFN,FNR Falsenegative,Falsenegativerate
Trang 13FP, FPR False positive, False positive rate GER German credit data set
HEPA Hepatitis patient data set
HEOM Heterogeneous Euclidean-Overlap metric HVDM Heterogeneous value difference metric ID Imbalanced data
IR Imbalanced ratio
KNN K-nearest neighbor classifier KS Kolmogorov-Smirnov statistic LDA Linear discriminant analysis
LLE Lasso-Logistic regression ensemble classifier LR Logistic regression
LLR Lasso-Logistic regression MLE Maximum likelihood estimate NCL Neighborhood cleaning rule OSS One-side selection
OUS Over-Under sampling - the proposed algorithm balancing data
for PLR Penalized likelihood regression
QDA Quadratic discriminant analysis
ROC Receiver Operating Characteristics Curve ROS Random over-sampling
RPART Recursive Partitioning and
RegressionTreealgorithmRUS Randomunder-sampling SMOTE Synthetic Minority Over-samplingt e c h n i q u e
Trang 14samplingtechniqueUCI University of California,I r v i n e US Loan schema data set from lendingc l u b VACM Vietnam Asset ManagementC o m p a n y VN1 Vietnamese credit 1 dataset
VN2 Vietnamese credit 2 dataset VN3 Vietnamese credit 3 dataset VN4 Vietnamese credit 4 dataset
WLE Weightedlikelihoode s t i m a t i o n
Trang 152.8 Illustration ofS M O T E technique 35
2.9 Approaches to imbalanced datai n classification 47
3.1 Illustration of aD e c i s i o n tree 61
3.2 Illustration of a decision boundaryo f SVM 63
3.3 Illustration of at w o - h i d d e n - l a y e r ANN 65
3.4 Importance level of features of the Vietnamesed a t a sets 77
3.5 Computation protocol of the proposedensembleclassifier 86
4.1 IllustrationofF-CV 100
4.2 IllustrationofF-LLR 102
Trang 163.9 Performance of ensemble classifiers on theTaiwanesedata set 813.10TOUS(B)algorithm 84
Trang 174.5 Implementation protocol ofe m p i r i c a l study 106
4.6 Averagetesting performance measuresofclassifiers 108
4.7 Averagetesting performance measures ofclassifiers(cont.) 109
4.8 The number of wins of F-LLR on empiricaldatasets 110
4.9 Important features of the Vietnamesed a t a set 113
4.10 Important features of the Vietnamese datas e t (cont.) 114
B.1 Algorithm ofB a g g i n g classifier 138
B.2 Algorithm ofR a n d o m Forest 138
B.3 AlgorithmofAdaBoost 139
C.1 Summary of the German creditdataset 140
C.2 Summary of the Vietnamese 1dataset 141
C.3 Summary of Vietnamese 2d a t a set 142
C.4 Summary of theTaiwanesecredit dataset(a) 143
C.5 Summary of theTaiwanesecredit dataset(b) 144
C.6 Summary of the Bank personal loandataset 145
C.7 Summary of the Hepatitis C patientsdataset 146
C.8 Summary of the Loan schema data from lendingclub (a) 147
C.9 Summary of the Loan schema data from lendingclub(b) 148
C.10 Summary of the Loan schema data from lendingclub (c) 149
C.11 Summary of the Vietnamese 3dataset 150
C.12 Summary of the Australian creditdataset 151
C.13 Summary of the Credit 1dataset 152
C.14 Summary of the Credit 2dataset 153
C.15 Summary of the Credit 3dataset 154
C.16 Summary of the Vietnamese 4dataset 155
Trang 18In classification, imbalanced data occurs when there is a great difference in the quantities of classes of the training data set This credit scoring datas e t s
• Thesecondpaperintroducesanoveltechniqueforaddressingimbalanced data, particularly in the cases of overlapping and noisysamples.
• ThefinalpaperproposesamodificationofLogisticregressionfocusingon theoptimizationF-measure,apopularmetricinimbalancedclassification.
Theseclassifiershavebeentrainedonarangeofpublicandprivatedatasets with highly imbalanced status and overlapping classes The primaryresultsdemonstrate that the proposed works outperform both traditional andsomerecentmodels.
Trang 19trong trường hợp dữ liệu có chồng chéo các lớpvànhiễu.
• Bài báo thứ ba đề xuất một hiệu chỉnhchomô hình hồi quy Logistic Sự
Trang 20classification, which is the basic type, focuses on the two-class label problems In contrast, multi-classification solves the tasks of several classla- bels.Multi-classificationissometimesconsideredbinarywithtwoclasses:oneclasscorrespondingtotheconcer nlabel,andtheotherrepresentingtheremain- ing labels In binary classification, data sets are partitioned into positiveandnegative classes The positive is the interest class, which has tobeidentifiedintheclassificationtask.Inthisdissertation,wefocusonbinaryclassification.F orconvenience,wedefine some concepts asfollows.
Definition 1.1.1.A data set withkinputfeaturesfor binary classification isthe
The subset of samples labeled1is called the positive class, denotedS+.
A
Trang 21Definition 1.1.2.A binary classifier is a function mapping the
Definition 1.1.3.Considering a data setSand a classifierf:X→ {0,1}.With a
• Iff(s0)= 1andy0= 0,s0is called a false positivesample.• Iff(s0)= 0andy0= 1,s0is called a false negativesample.
The number of the true positive, true negative, false positive, and falsenegativesamples, are denoted TP, TN, FP, and FN, respectively.
Some popular criteria used to evaluate the performance of a classifier are accuracy, true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR).
Trang 22the loss of misclassifying the“good”into“bad”.Hence, identifying the bad is often considered more crucial than the other task Consider a list ncy is the first target ofanycancer diagnosis process because of the heavy consequences of missing cancer patients Therefore, it isunreasonabletobaseontheaccuracymetrictoevaluatetheperformanceofcancerdiagnosis classifiers.
Thephenomenonofskewdistributionintrainingdatasetsforclassification is known
Definition 1.1.4.LetS=S+∪S−be the data set, whereS+andS−are
Whenatrainingdatasetisimbalanced,simpleclassifiersusuallyhaveavery high accuracy butlowTPR These classifiers aim to maximize the accuracy (sometimes called global accuracy), thus equating losses causedbytheerrortype I and error typeII(Shen, Zhao, Li, Li, & Meng, 2019) Therefore,theclassification results are often biasedtowardthe majority class (thenegativeclass) (Galar,Fernandez,Barrenechea,Bustince,&Herrera,2011;Haixiangetal.,2017).In thecaseofaratherhighimbalanced ratio,t h e minorityclass
Trang 23ce of learning methods (Batista, Prati, & Monard, 2004; Haixiang et al., 2017) Thus, researchers or practitioners should deeply understandthenatureofdatasetstohandlethemcorrectly.
Atypicalcasestudyofimbalancedclassificationiscreditscoring.Thisissue is reflected in the bad debt ratio of commercial banks.Forexample, inViet-nam, the bad debt ratio in the on-balance sheetwas1.9% in 2021 and 1.7%in2020 Besides, the gross bad
baddebt,unresolvedbaddebtsoldtoVAMC,andpotentialbaddebtfromrestructuring)was7.3 % in 2021 and 5.1% in 20201 Although bad customers account foraverysmallpartofthecreditcustomers,theconsequencesofthebaddebtofthe
bankareextremelyheavy.Incountrieswheremosteconomicactivitiesrelyonthe banking system, the increase in the bad debt ratiomaynot only threaten the execution of the banking system but also push the economy to a seriesofcollapses Therefore, it is important to identify the bad customers
Trang 24Thesefactspromptedustostudyimbalancedclassificationdeeply.Thedis- sertation
titled“Imbalance data in classification: A case study of credit scoring”aims
to find suitable solutions for the imbalanced data and related issues, especially a case study of credit scoring inVietnam.
ii) The ability to easily explain the predicted results of thec l a s s i fi e r s Over thetworecent decades, the first requirement has been solved withthe development of methods to improve the performance of credit asets.Forexample,somestudiesshowedthatLogisticregressionoutperformed Decision tree (Marqués, García, & Sánchez, 2012;Wang,Ma, Huang, &Xu,2012),butanotherresultconcludedthattheLogisticregressionworkedworsethan Decision tree (Bensic, Sarlija, & Zekic-Susac, 2005) Besides, according to (Baesens et al., 2003), Support vector machinewasbetter than Logisticre-gression,Lietal.(2019);VanGesteletal.(2006)indicatedthattherewasan
Trang 25insignificant difference among Support vector machine, Logistic emodelshadsuperiorper-formance to the single ones (Brown & Mues, 2012;
2020;Lessmann,Baesens,Seow,&Thomas,2015;Marquésetal.,2012).How-ever,ensemblealgorithmsdonotdirectlyhandletheimbalanceddataissue.
While the second requirement of a credit scoring model often attracts less attention than the first, its role is equally important It provides the reasons fortheclassificationresults,whichistheframeworkforassessing,managing, and hedging credit risk.Forexample,nowadays,customers’ features arecol- lected into empirical data sets more and morediversely,but not all classifiers Another case is ensemble classifiers Most ofthemoperate in an incomprehensible process although theyhaveoutstanding perfor-mance.EvenwithpopularensembleclassifierssuchasBaggingTree,RandomForest,orAda Boost,whichdonothaveverycomplicatedstructures,theirin-terpretability is not discussed According to Dastile et al (2020), in thecredit
Trang 26bankshavestill applied traditional methods such as Logistic regres-sionandDiscriminantanalysis.Somestudiesusedmachinelearningmethods such as Artificial neural network (Kiều, Diệp, Nga, & Nam, 2017; Nguyen&Nguyen, 2016; Thịnh &Toàn,2016), Support vector machine
The algorithm-level approach solves imbalanced databymodifying theclas-sifier algorithms to reduce the biastowardthe majority class This approach needsdeepknowledgeabouttheintrinsicclassifierswhichusersusuallylack.
Trang 27Inaddition,designingspecificcorrectionsormodificationsforthegivenclas-sifier algorithms makes this approach not versatile A representative ofthealgorithm-level approach is the Cost-sensitive learning method which imposesor corrects the costs of loss upon misclassifications and requires theminimal total loss of the classification process (Xiao, Xie, He, & Jiang, 2012; creaset h e o n e o f t h e m a j o r i t y c l a s s This approach implements easily andperforms independently of the classifier algorithms.However,re-sampling tech-niques change the distribution of the training data set whichmaylead to
Trang 28Insummary,although there are many methods for imbalanced classification,each of them has some drawbacks Some hybrid methods are complex andinaccessible Moreover, there are very few studies dealing with eitherimbalanceor noise and overlapping samples With theavailablestudies, sthereferencetopredictthesample’slabelbycomparingitwitha given threshold The sample is classified into the positive class if andonlyif its conditional probability is greater than this threshold
ThischaracteristicofLRcaninnovateintomulti-classification.Besides,thecomputationprocess of LR, which employs the maximum
-wh ile, thep-valuehasrecentlybeencriticizedinthestatisticalcommunitybe- cause of its misunderstanding (Goodman, 2008) Those lead to thelimitation in the application fields of LR although it has severala d v a n t a g e s
Trang 29There are multiple methods to deal with imbalanced data for LR suchaspriorcorrection(Cramer,2003;King&Zeng,2001),weightedlikelihoodesti-mation(WLE)(Maalouf&Trafalis,2011;Manski&Lerman,1977;Ramalho&
Ramalho,2007)andpenalizedlikelihoodregression(PLR)(Firth,1993;Green-land&Mansournia,2015;Puhr,Heinze,Nold,Lusa,&Geroldinger,2017).Allofthemarerelatedt othealgorithm-levelapproach,whichrequiresmucheffortfrom the users.Forexample, prior correction and WLE need the ratio
ofthepositiveclassinthepopulationwhichisusuallyunavailableinreal-worldap-plications.Besides,somemethodsofPLRaretoosensitiveforinitialvaluesinthecompu tationprocessofthemaximumlikelihoodestimation.Furthermore,some methods of PLR were just for the biased parameter estimates, notforthe biased conditional probability (Firth, 1993) A hybrid of these methods and re-sampling techniques has not been considered in the literature Thefirstobjectiveistoproposeanewensembleclassifierthatsatisfiestwokeyrequirementsofacredit-scoringmodel.Thisensembleclassifierisexpectedtooutperform the traditional classification models and popular balanced
Trang 30ThismodificationdirectlyimpactstheF-measure,whichiscommonlyusedto evaluate the performance of classifiers in imbalanced classification Thepro-posedworkcancompetewithpopularbalancedmethodsforLogisticregression such as weighted likelihood estimation, penalized likelihood regression, andre-sampling techniques, including ROS, RUS, andSMOTE.
1.4.2Researchs u b j e c t s
Thisdissertationinvestigatesthephenomenonofimbalanceddataandotherrelated issues such as noise and overlapping samples in classification.Weexam- ine various balancing methods, encompassing algorithm-level, data-level,
The dissertation focuses on binary classification problems forimbalanceddata sets and their application in credit scoring Interpretable classifiers,in-cluding Logistic regression, Lasso-logistic regression, and Decision trees,areconsidered.Todeal with imbalanced data, the dissertation
concentratesonthedata-levelapproachandtheintegrationofdata-levelmethodsandensem- ble classifier algorithms Some popular re-sampling techniques
asROS,RUS,SMOTE,ADASYN,Tomek-link,andNeighborhoodCleaningRule,are
Trang 31investigated in thisstudy.In addition, popular performance criteria,whichare suitable for imbalanced classification such asAUC(Area
theempiricalstudyusedonedatasetbelongingtothemedicalfield,Hepatitis data This data setwas availableon the UCI machine learningrepository.
Trang 32Table1.1: General implementation protocol in thed i s s e r t a t i o n
Steps Contents
algorithms on the same trainingda t a
calculating their performancemeasures.
(1) An interpretable decision tree ensemble model for imbalanced
creditscoringdatasets,Journal of Intelligent andFuzzySystem,Vol45, No 6,10853–
10864, 2023.
(2) TOUS:Anewtechniqueforimbalanceddataclassification,StudiesinSys-tems,
Decision, and Control,Vol429, 595–612, 2022, Springer.
(3) A modification of Logistic regression with imbalanced data:
F-measure-oriented Lasso-logistic regression,ScienceAsia,49S, 68–77,2 0 2 3
Regardingtheliteratureoncreditscoring,thedissertationsuggeststheinter-pretable ensemble classifier which can address imbalanced data The proposedmodel which uses Decision tree as the base learner has more specificadvan-tagesthanthepopularapproachessuchashigherperformancemeasuresandinterpretabi lity.The proposed model corresponds to the firstarticle.
Trang 33Regarding the literature on imbalanced data, the dissertation proposesamethod for balancing, de-noise, and free-overlapping samples
totheensemble- basedapproach.Thismethodoutperformstheintegrationofthere-samplingtechniques(ROS,RUS,andSMOTE,Tomek-link,andNeighborhood
CleaningRule)andpopularensembleclassifieralgorithms(Baggingtree,Ran-dom forest, and AdaBoost) This work corresponds to the secondarticle.
RegardingtheliteratureonLogisticregression,thedissertationprovidesamodifica tion to its computation process The proposed work makes Logisticregression more effective than the existing methods for Logistic regression withimbalanced data and retain the ability to show the important level ofinputfeatures without usingp−value This modification is in the thirdarticle.
1.7Dissertationo u t l i n e
The dissertation “Imbalanced data in classification: A case study of credit scoring” has five chapters.
• Chapter 1.I n t r o d u c t i o n
• Chapter 2 Literature review of imbalancedd a t a • Chapter 3 Imbalanced data in credits c o r i n g
• Chapter 4 A modification of Logistic regression with imbalanceddata
Trang 34review of approaches to imbalanced data, including algorithm-level,data-
level,andensemble-based-level.Chapter2alsoexaminesthebasicbackgroundandrecentproposedworksofcreditsc oring.T h e detaileddiscussionofprevious studies clarifies the pros and cons of existing balancing methods That is the idetheimportancelevelofpredictors.Furthermore,weinnovatethealgorithmofthiscredit-scoringensembleclassifiertohandleoverlappingand noise before dealing with imbalanced data The empirical studies are conductedto verify the effectiveness of the
Trang 35conventionaldefinitionsoftheIRthresholdtoconcludethatadata set is imbalanced Most authors simply define ID that there is a class with amuchgreater (orlower)number of samples
(Brown&Mues,2012;Haixiangetal.,2017).Otherauthorsassessadatasetimbalanced iftheinterestclasshassignificantlyfewersamplesthantheotherandordinaryclassifier
algorithms encounter difficulty in distinguishingtwoclasses (Galaretal., 2011; López,Fernández,García, Palade, & Herrera, 2013; Sun,Wong,& Kamel, 2009) Therefore, a data set is considered as ID when its IR is greaterthanoneandmostsamplesoftheminorityclasscannotbeidentifiedbystandardclas sifiers.
2.1.2Obstacles in imbalancedclassification
In ID, the minority class is usually misclassified since there istoolittlein-formation about their patterns Besides, standard classifier algorithmsoftenoperate according to the rules of the maximum accuracy metric Hence,theclassification results are usually biasedtowardthe majority class to getthehighest global accuracy and verylowaccuracy for the minority class Ontheother hand, the patterns of the minority class are often specific, especiallyinextreme ID, which leads to the ignorance of minority samples (theymay
betreatedasnoise)tofavorthemoregeneralpatternsofthemajorityclass.Asa
Trang 36consequence,theminorityclass,whichistheinterestedobjectintheclassifica- tion process, is usually misclassified inID.
Theaboveanalyzesarealsosupportedbyempiricalstudies.Br own andMues(2012) concluded that the higher the IR, thelowerthe performance of classi- fiers Furthermore, Prati, Batista, andSilva(2015) found that the expected performanceloss,whichwastheproportionoftheperformancedifferencebe-tweenID and the balanced data, became significant when IRwasfrom90/10and greater Prati et al also pointed out that the performance loss tendedtoincrease quickly for highervaluesofIR.
In short, IR is the factor that reduces the effectiveness of standard classifiers.
nded.Ondatasetswithoverlapping,thestandardclassi-fieralgorithmssuchasDecisiontree,Supportvectormachine,orK-nearest neighbors become harder to perform Batista et al (2004) stated hedata but it can also produce an over-fitting model (Cui, Davis, Cheng, & Bai, 2004;Wasikowski& Chen, 2009) On imbalanced and smalldata
Trang 37Figure 2.1:E x a m p l e s o f c i r c u m s t a n c e s o f i m b a l a n c e d d a t a
Source: Galar et al (2011)
sets,thelackofinformationaboutthepositiveclassbecomesmoreserious Krawczyk andWoźniak(2015) stated that when fixing the IR, themoresamples of the minority class, thelowerthe error rate ofclassifiers.
• Small disjuncts (Fig 2.1c): This problem occurs when the minorityclassconsists of several sub-spaces in the feature space.
• Thecharacteristicsofpositivesamplessuchasborderline,rare,andoutlier, affect the performance of standard classifiers The fact is that borderline samplesarealwaystoodifficulttoberecognized.Inaddition,therareand outliers are extremely hard tobeidentified According to Napieralaand Stefanowski (2016);VanHulse and Khoshgoftaar (2009),
Trang 382.2Performance measures for imbalancedd at a
between actual and predicted labels, known asconfusionmatrix.
Table 2.1: Confusion matrix
InTable2.1,TP, FP,FN, and TN follow the Definition 1.1.3 Besides, POS and NEG are the numbers of the actual positive and negative samples inthe trainingdata,respectively PPOSandPNEGarethenumbersofthepredicted positiveandnegativesamples,respectively.Nisthetotalnumberofsamples.
Fromtheconfusionmatrix,severalmetricsarebuilttoprovideaframeworkfor analyzing many aspects of a classifier These metrics canbedividedintotwotypes, single and complexmetrics.
Trang 39imbalanced data set with very high IR, standard classifiers often get a very high accuracy andlowerror rate It means the number of positive samples classified correctly is small despite their crucial role intheclassificationtask.Secondly,theerrorrateconsidersthecostofmisclassifying
thepositiveclassandthenegativeequally.WhereasinID,themisclassification
ofthepositivesampleisoftenmorecostlythantheoneofthenegative.There-fore,imbalancedclassificationstudiesusesomesinglemetricsthatfocusona specific
class such asTPR (orrecall),FPR, TNR, FNR, andprecision.
TPR is the proportion of the positive samples classified correctly Other
Trang 40insteadofaccuracy,TPRisthemostfavoredmetricbecauseoftheimportance of the positive class.However,in credit scoring and cancer diagnosis, ifonlyfocusing on the TPR and ignoring the FPR, a trivial classifier will design all samples with the positive label In other words, the classifier cannotidentify nce of a classifier, especially in ID It leads to combinations of the above
single measureis one of the most popular complex
parameterβis set greater than 1 if and only if FN is more concerned thanFP.F1is the special case of Fβwhen the importance of precision