Bài giảng khai phá dữ liệu (data mining) dimensionality reduction and feature selection

Trịnh Tấn Đạt Khoa CNTT – Đại Học Sài Gòn Email: trinhtandat@sgu.edu.vn Website: https://sites.google.com/site/ttdat88/ Contents  Introduction: dimensionality reduction and feature selection  Dimensionality Reduction  Principal Component Analysis (PCA)  Fisher’s linear discriminant analysis (LDA)  Example: Eigenface  Feature Selection  Homework Introduction  High-dimensional data often contain redundant features  reduce the accuracy of data classification algorithms  slow down the classification process  be a problem in storage and retrieval  hard to interpret (visualize)  Why we need dimensionality reduction???  To avoid “curse of dimensionality” https://en.wikipedia.org/wiki/Curse_of_dimensionality  To reduce feature measurement cost  To reduce computational cost Introduction  Dimensionality reduction is one of the most popular techniques to remove noisy (i.e., irrelevant) and redundant features  Dimensionality reduction techniques: feature extraction v.s feature selection  feature extraction: given N features (set X), extract M new features (set Y) by linear or nonlinear combination of all the N features (i.e PCA, LDA)  feature selection: choose a best subset of highly discriminant features of size M from the available N features (i.e Information Gain, ReliefF, Fisher Score) Dimensionality Reduction Principal component analysis (PCA) ❖ Variance v.s Covariance  Variance : phương sai biến ngẫu nhiên thước đo phân tán thống kê biến đó, hàm ý giá trị biến thường cách giá trị kỳ vọng bao xa Low variance High variance  Covariance: hiệp phương sai độ đo biến thiên hai biến ngẫu nhiên (phân biệt với phương sai - đo mức độ biến thiên biến) N Cov( X , Y ) =  ( x − x )(y − y ) i =1 i i (N − 1) Principal component analysis (PCA)  Mean (expected value): giá trị “mong muốn”, biểu diễn giá trị trung bình biến  Standard Deviation: Độ lệch chuẩn đo tính biến động giá trị mang tính thống kê Nó cho thấy chênh lệch giá trị thời điểm đánh giá so với giá trị trung bình Principal component analysis (PCA)  Representing Covariance between dimensions as a matrix e.g for dimensions:  cov(x,y) = cov(y,x) hence matrix is symmetrical about the diagonal  N-dimensional data will result in NxN covariance matrix Principal component analysis (PCA)  What is the interpretation of covariance calculations? e.g.: liệu chiều x: số lượng học môn học y: điểm số môn học covariance value ~ 104.53 what does this value mean? -> số lượng học tăng  , điểm số  Algorithms for Flat Features ❖ Fisher Score  Features with high quality should assign similar values to instances in the same class and different values to instances from different classes  The score for the i-th feature Si will be calculated by Fisher Score as Algorithms for Flat Features ❖ Mutual Information based on Methods  Information gain is used to measure the dependence between features and labels  Calculates the information gain between the i-th feature fi and the class labels C as  A feature is relevant if it has a high information gain Algorithms for Flat Features Algorithms for Flat Features Wrapper Models  It utilize a specific classifier to evaluate the quality of selected features, and offer a simple and powerful way to address the problem of feature selection, regardless of the chosen learning machine  A typical wrapper model will perform the following steps:  Step 1: searching a subset of features,  Step 2: evaluating the selected subset of features by the performance of the classifier,  Step 3: repeating Step and Step until the desired quality is reached  Search strategies: hill-climbing, best-first, GA, … Algorithms for Flat Features Algorithms for Flat Features Algorithms for Flat Features Embedded Models  Embedded Models embedding feature selection with classifier construction, have the advantages of (1) wrapper models — they include the interaction with the classification model and (2) filter models—and they are far less computationally intensive than wrapper methods  Three types of embedded methods  Pruning methods: that first utilize all features to train a model and then attempt to eliminate some features by setting the corresponding coefficients to 0, while maintaining model performance such as recursive feature elimination using a support vector machine  The second are models with a built-in mechanism for feature selection such as ID3 and C4.5  The third are regularization models with objective functions that minimize fitting errors and in the meantime force the coefficients to be small or to be exactly zero Features with coefficients that are close to are then eliminated Methods: Lasso Regularization, Adaptive Lasso, Bridge regularization, Elastic net regularization Algorithms for Flat Features ❖ Regularization methods  Classifier induction and feature selection are achieved simultaneously by estimating w with properly tuned penalties  Feature selection is achieved and only features with nonzero coefficients in w will be used in the classifier  Lasso Regularization: Lasso regularization is based on l1-norm of the coefficient of w and defined as l1 regularization can generate an estimation of w with exact zero coefficients In other words, there are zero entities in w, which denotes that the corresponding features are eliminated during the classifier learning process Therefore, it can be used for feature selection Algorithms for Structured Features  For many real-world applications, the features exhibit certain intrinsic structures, e.g., spatial or temporal smoothness, disjoint/overlapping groups, trees, and graphs  Incorporating knowledge about the structures of features may significantly improve the classification performance and help identify the important features  We focus on linear classifiers with structured features to minimize a empirical error penalized by a regularization term as where G denotes the structure of features, and α controls the trade-off between data fitting and regularization Research ▪ Algorithms for Structured Features ▪ Algorithms for Streaming Features Discussions and Challenges Scalability  With the growth of dataset sizes, the scalability of current algorithms may be in jeopardy, especially with these domains that require an online classifier  The scalability of feature selection algorithms is a big problem Usually, they require a sufficient number of samples to obtain, statically, adequate results Discussions and Challenges Stability  It is defined as the sensitivity of the selection process to data perturbation in the training set  The underlying characteristics of data can greatly affect the stability of an algorithm  These characteristics include dimensionality m, sample size n, and different data distribution across different folds, and the stability issue tends to be data dependent Discussions and Challenges Linked Data  Linked data has become ubiquitous in real-world applications such as tweets in Twitter (tweets linked through hyperlinks), social networks in Facebook (people connected by friendships), and biological networks (protein interaction networks)  Linked data is patently not independent and identically distributed (i.i.d.)  Feature selection methods for linked data need to solve the following immediate challenges:  How to exploit relations among data instances; and  How to take advantage of these relations for feature selection  Two attempts to handle linked data w.r.t feature selection for classification are LinkedFS and FSNet Homework 1) Apply feature selection for Boston dataset recognition Ref: https://datasciencebeginners.com/2018/11/26/using-scikit-learn-inpython-for-feature-selection/ Homework 2) Apply PCA as dimension reduction for iris recognition Ref : https://towardsdatascience.com/pca-using-python-scikit-learne653f8989e60

Bài giảng khai phá dữ liệu (data mining) dimensionality reduction and feature selection

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan