Bài giảng khai phá dữ liệu (data mining) ensemble models

Trịnh Tấn Đạt Khoa CNTT – Đại Học Sài Gòn Email: trinhtandat@sgu.edu.vn Website: https://sites.google.com/site/ttdat88/ Contents  Introduction  Voting  Bagging  Boosting  Stacking and Blending Introduction Definition  An ensemble of classifiers is a set of classifiers whose individual decisions are combined in some way (typically, by weighted or un-weighted voting) to classify new examples  Ensembles are often much more accurate than the individual classifiers that make them up Learning Ensembles  Learn multiple alternative definitions of a concept using different training data or different learning algorithms  Combine decisions of multiple definitions, e.g using voting Training Data  Data Data Learner Learner  Model Model  Model Combiner Data K Learner K Model K Final Model Necessary and Sufficient Condition  For the idea to work, the classifiers should be  Accurate  Diverse  Accurate: Has an error rate better than random guessing on new instances  Diverse: They make different errors on new data points Why they Work?  Suppose there are 25 base classifiers  Each classifier has an error rate,  = 0.35  Assume classifiers are independent  Probability that the ensemble classifier makes a wrong prediction:  25  i 25 − i     − = 0.06 ( )   i  i =13   25 Marquis de Condorcet (1785) Majority vote is wrong with probability: Value of Ensembles  When combing multiple independent and diverse decisions each of which is at least more accurate than random guessing, random errors cancel each other out, correct decisions are reinforced  Human ensembles are demonstrably better  How many jelly beans in the jar?: Individual estimates vs group average A Motivating Example  Suppose that you are a patient with a set of symptoms  Instead of taking opinion of just one doctor (classifier), you decide to take opinion of a few doctors!  Is this a good idea? Indeed it is  Consult many doctors and then based on their diagnosis; you can get a fairly accurate idea of the diagnosis The Wisdom of Crowds  The collective knowledge of a diverse and independent body of people typically exceeds the knowledge of any single individual and can be harnessed by voting Stacked learners: first attempt Stacking  EX:  Step 1: The train set is split into 10 parts  Step 2: A base model (suppose a decision tree) is fitted on parts and predictions are made for the 10th part Stacking  Step 3: Using this model, predictions are made on the test set  Step 4: Steps to are repeated for another base model (say knn) resulting in another set of predictions for the train set and test set Stacking  Step 5: The predictions from the train set are used as features to build a new model ( can use logistic regression)  Step 6: This model is used to make final predictions on the test prediction set Blending  Blending follows the same approach as stacking but uses only a holdout (validation) set from the train set to make predictions  In other words, unlike stacking, the predictions are made on the holdout set only  The holdout set and the predictions are used to build a model which is run on the test set Blending  Step 1: The train set is split into training and validation sets  Step 2: Model(s) are fitted on the training set  Step 3: The predictions are made on the validation set and the test set  Step 4: The validation set and its predictions are used as features to build a new model  Step 5: This model is used to make final predictions on the test and meta-features Blending  EX: simple code model1 = tree.DecisionTreeClassifier() model1.fit(x_train, y_train) val_pred1=model1.predict(x_val) test_pred1=model1.predict(x_test) val_pred1=pd.DataFrame(val_pred1) test_pred1=pd.DataFrame(test_pred1) model2 = KNeighborsClassifier() model2.fit(x_train,y_train) val_pred2=model2.predict(x_val) test_pred2=model2.predict(x_test) val_pred2=pd.DataFrame(val_pred2) test_pred2=pd.DataFrame(test_pred2) df_val=pd.concat([x_val, val_pred1,val_pred2],axis=1) df_test=pd.concat([x_test, test_pred1,test_pred2],axis=1) model = LogisticRegression() model.fit(df_val,y_val) model.score(df_test,y_test) Netflix challenge - million USD (20062009)  Netflix, an online DVD-rental and online video streaming service  Task: predict user ratings to films from ratings by other users  Goal: improve existing method by 10%  Winner’s solution: ensemble with over 500 heterogeneous models, aggregated with gradient boosted decision trees  Ensembles based on blending/stacking were key approaches used in the netflix competition Conclusions  Ensemble methods combine several hypotheses into one prediction  They work better than the best individual hypothesis from the same class     because they reduce bias or variance (or both) Bagging is mainly a variance-reduction technique, useful for complex hypotheses Boosting focuses on harder examples, and gives a weighted vote to the hypotheses Boosting works by reducing bias and increasing classification margin Stacking is a generic approach to ensemble various models and performs very well in practice Bài Tập 1) Cài đặt kỹ thuật ensemble model cho toán dự đoán bệnh tiểu đường (Diabetes Predictions) All patients here are females at least 21 years old of Pima Indian heritage Number of Instances: 768 Number of Attributes: plus class Missing Attribute Values: None Class Distribution: (class value is interpreted as "tested positive for diabetes") Class Value Number of instances 0:500 ; 1: 268 0: tested_negative 1: tested_positive Bài Tập ❖ Áp dụng kỹ thuật sau  Voting  Hard voting  Soft voting  Weighted voting  Comparison Ensemble models  Bagging  Boosting  Voting Diabetes Predictions  https://www.openml.org/d/37 Voting  Based model (scikit-learn)  KNN  RandomForest  Logistic regression  Voting  Hard voting  Soft voting  Weighted voting 0.7835497835497836 0.7878787878787878 0.7922077922077922 Comparison Ensemble models  Based model:  rf = RandomForestClassifier()  et = ExtraTreesClassifier()  knn = KNeighborsClassifier()  svc = SVC()  rg = RidgeClassifier() ❖ Comparison Ensemble models ▪ Bagging ▪ Boosting ▪ Voting Reference  https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble- methods-with-sklearn-and-mlens-a455c0c982de  https://machinelearningmastery.com/ensemble-machine-learningalgorithms-python-scikit-learn/

Bài giảng khai phá dữ liệu (data mining) ensemble models

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan