Thông tin tài liệu
Trịnh Tấn Đạt Khoa CNTT – Đại Học Sài Gòn Email: trinhtandat@sgu.edu.vn Website: https://sites.google.com/site/ttdat88/ Contents Introduction Voting Bagging Boosting Stacking and Blending Introduction Definition An ensemble of classifiers is a set of classifiers whose individual decisions are combined in some way (typically, by weighted or un-weighted voting) to classify new examples Ensembles are often much more accurate than the individual classifiers that make them up Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different learning algorithms Combine decisions of multiple definitions, e.g using voting Training Data Data Data Learner Learner Model Model Model Combiner Data K Learner K Model K Final Model Necessary and Sufficient Condition For the idea to work, the classifiers should be Accurate Diverse Accurate: Has an error rate better than random guessing on new instances Diverse: They make different errors on new data points Why they Work? Suppose there are 25 base classifiers Each classifier has an error rate, = 0.35 Assume classifiers are independent Probability that the ensemble classifier makes a wrong prediction: 25 i 25 − i − = 0.06 ( ) i i =13 25 Marquis de Condorcet (1785) Majority vote is wrong with probability: Value of Ensembles When combing multiple independent and diverse decisions each of which is at least more accurate than random guessing, random errors cancel each other out, correct decisions are reinforced Human ensembles are demonstrably better How many jelly beans in the jar?: Individual estimates vs group average A Motivating Example Suppose that you are a patient with a set of symptoms Instead of taking opinion of just one doctor (classifier), you decide to take opinion of a few doctors! Is this a good idea? Indeed it is Consult many doctors and then based on their diagnosis; you can get a fairly accurate idea of the diagnosis The Wisdom of Crowds The collective knowledge of a diverse and independent body of people typically exceeds the knowledge of any single individual and can be harnessed by voting Stacked learners: first attempt Stacking EX: Step 1: The train set is split into 10 parts Step 2: A base model (suppose a decision tree) is fitted on parts and predictions are made for the 10th part Stacking Step 3: Using this model, predictions are made on the test set Step 4: Steps to are repeated for another base model (say knn) resulting in another set of predictions for the train set and test set Stacking Step 5: The predictions from the train set are used as features to build a new model ( can use logistic regression) Step 6: This model is used to make final predictions on the test prediction set Blending Blending follows the same approach as stacking but uses only a holdout (validation) set from the train set to make predictions In other words, unlike stacking, the predictions are made on the holdout set only The holdout set and the predictions are used to build a model which is run on the test set Blending Step 1: The train set is split into training and validation sets Step 2: Model(s) are fitted on the training set Step 3: The predictions are made on the validation set and the test set Step 4: The validation set and its predictions are used as features to build a new model Step 5: This model is used to make final predictions on the test and meta-features Blending EX: simple code model1 = tree.DecisionTreeClassifier() model1.fit(x_train, y_train) val_pred1=model1.predict(x_val) test_pred1=model1.predict(x_test) val_pred1=pd.DataFrame(val_pred1) test_pred1=pd.DataFrame(test_pred1) model2 = KNeighborsClassifier() model2.fit(x_train,y_train) val_pred2=model2.predict(x_val) test_pred2=model2.predict(x_test) val_pred2=pd.DataFrame(val_pred2) test_pred2=pd.DataFrame(test_pred2) df_val=pd.concat([x_val, val_pred1,val_pred2],axis=1) df_test=pd.concat([x_test, test_pred1,test_pred2],axis=1) model = LogisticRegression() model.fit(df_val,y_val) model.score(df_test,y_test) Netflix challenge - million USD (20062009) Netflix, an online DVD-rental and online video streaming service Task: predict user ratings to films from ratings by other users Goal: improve existing method by 10% Winner’s solution: ensemble with over 500 heterogeneous models, aggregated with gradient boosted decision trees Ensembles based on blending/stacking were key approaches used in the netflix competition Conclusions Ensemble methods combine several hypotheses into one prediction They work better than the best individual hypothesis from the same class because they reduce bias or variance (or both) Bagging is mainly a variance-reduction technique, useful for complex hypotheses Boosting focuses on harder examples, and gives a weighted vote to the hypotheses Boosting works by reducing bias and increasing classification margin Stacking is a generic approach to ensemble various models and performs very well in practice Bài Tập 1) Cài đặt kỹ thuật ensemble model cho toán dự đoán bệnh tiểu đường (Diabetes Predictions) All patients here are females at least 21 years old of Pima Indian heritage Number of Instances: 768 Number of Attributes: plus class Missing Attribute Values: None Class Distribution: (class value is interpreted as "tested positive for diabetes") Class Value Number of instances 0:500 ; 1: 268 0: tested_negative 1: tested_positive Bài Tập ❖ Áp dụng kỹ thuật sau Voting Hard voting Soft voting Weighted voting Comparison Ensemble models Bagging Boosting Voting Diabetes Predictions https://www.openml.org/d/37 Voting Based model (scikit-learn) KNN RandomForest Logistic regression Voting Hard voting Soft voting Weighted voting 0.7835497835497836 0.7878787878787878 0.7922077922077922 Comparison Ensemble models Based model: rf = RandomForestClassifier() et = ExtraTreesClassifier() knn = KNeighborsClassifier() svc = SVC() rg = RidgeClassifier() ❖ Comparison Ensemble models ▪ Bagging ▪ Boosting ▪ Voting Reference https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble- methods-with-sklearn-and-mlens-a455c0c982de https://machinelearningmastery.com/ensemble-machine-learningalgorithms-python-scikit-learn/
Ngày đăng: 23/12/2023, 10:14
Xem thêm: Bài giảng khai phá dữ liệu (data mining) ensemble models