Bài giảng khai phá dữ liệu (data mining) data preprocessing

Trịnh Tấn Đạt Khoa CNTT – Đại Học Sài Gòn Email: trinhtandat@sgu.edu.vn Website: https://sites.google.com/site/ttdat88/ Outline  Why preprocess the data?  Descriptive data summarization  Data cleaning  Data integration and transformation  Data reduction  Discretization and concept hierarchy generation  Summary Why Data Preprocessing?  Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest, …  e.g., occupation=“ ”  noisy: containing errors or outliers  e.g., Salary=“-10”  inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records Why Is Data Dirty?  Incomplete data may come from  “Not applicable” data value when collected  Different considerations between the time when the data was collected and when it is analyzed  Human/hardware/software problems  Noisy data (incorrect values) may come from  Faulty data collection instruments  Human or computer error at data entry  Errors in data transmission  Inconsistent data may come from  Different data sources  Functional dependency violation (e.g., modify some linked data)  Duplicate records also need data cleaning Why Is Data Preprocessing Important?  No quality data, no quality mining results!  Quality decisions must be based on quality data  e.g., duplicate or missing data may cause incorrect or even misleading statistics  Data warehouse needs consistent integration of quality data  Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse Multi-Dimensional Measure of Data Quality  A well-accepted multidimensional view:  Accuracy  Completeness  Consistency  Timeliness  Believability  Value added  Interpretability  Accessibility Data type  Numeric: The most used data type, and the stored content is numeric  Characters and strings: strings are arrays of characters  Boolean: for binary data with true and false values  Time series data: including time-or sequential-related properties  Sequential data: data itself has sequential relationship  Time series data: each data will be subject to change with time Data type  Spatial data: for data including special related attributes  For example, Google Map, Integrated Circuit Design Layout, Wafer Exposure Layout, Global Positioning System (GPS), etc  Text data: for paragraph description, including patent reports, diagnostic reports, etc  Structured data: library bibliographic data, credit card data  Semi-structured data: email, extensible markup language (XML)  Unstructured data: social media data of messages in Facebook  Multimedia data: Including data of pictures, audio, video, etc in media with mass data volumes as compared to other types of data that need data compression for data storage Data scale “A proxy attribute is a variable that is used to represent or stand in for another variable or attribute that is difficult to measure directly A proxy attribute is typically used in situations where it is not possible or practical to measure the actual attribute of interest For example, in a study of income, the amount of money a person earns per year may be difficult to determine accurately In such a case, a proxy attribute, such as education level or occupation, may be used instead.” ChatGPT  Each variable of data has its corresponding attribute and scale to quantify and measure its level  natural quantitative scale  qualitative scale  When one variable is hard to find the corresponding attribute, proxy attribute can be used instead as a measurement  Common scales: nominal scale, categorical scale, ordinal scale, interval scale, ratio scale, and absolute scale Six common scales  nominal scale: only used as codes, where the values has no meaning for      mathematical operations categorical scale: according to its characteristics, and each category is marked with a numeric code to indicate the category to which it belongs ordinal scale: to express the ranking and ordering of the data without establishing the degree of variation between them interval scale: also called distance scale, can describes numerical differences between different numbers in a meaningful way ratio scale: different numbers can be compared to each other by ratio absolute scale: the numbers measured have absolute meaning 10 Data Compression  String compression  There are extensive theories and well-tuned algorithms  Typically lossless  But only limited manipulation is possible without expansion  Audio/video compression  Typically lossy compression, with progressive refinement  Sometimes small fragments of signal can be reconstructed without reconstructing the whole  Time sequence is not audio  Typically short and vary slowly with time 57 Data Compression Original Data Compressed Data lossless Original Data Approximated 58 Numerosity Reduction  Reduce data volume by choosing alternative, smaller forms of data representation  Parametric methods  Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)  Example: regression model  Non-parametric methods  Do not assume models  Major families: histograms, clustering, sampling 59 Sampling: with or without Replacement Raw Data 60 Discretization  Three types of attributes:  Nominal — values from an unordered set, e.g., color, profession  Ordinal — values from an ordered set, e.g., military or academic rank  Continuous — real numbers, e.g., integer or real numbers  Discretization:  Divide the range of a continuous attribute into intervals  Some classification algorithms only accept categorical attributes  Reduce data size by discretization  Prepare for further analysis 61 Discretization process  According to different aspects  Supervised and unsupervised  Dynamic and static  Global and local  Splitting and merging  Direct and incremental 62 Data partition  Data partition: training data, testing data, validation data  Different partition methods have different mining and should present original information as possible  Method 1: 70% for training, 10% for validation and 20%for testing  Method 2: k fold cross-validation 63 Data partition  Training data: use the data to build the model  Validation data: examples used to tune the hyperparameters of a model (a part of training data)  Testing data: evaluate the model for robustness  K fold cross-validation: a resampling procedure that splits the data into ksubsets, to fit the same statistical method k times using different subsets of the data 64 Model evaluation for classification  Two aspects to evaluate the results of classification model  Using the results from testing data set to evaluate the better model  Finding the best model from domain experts  Classification Accuracy  Calculate accuracy or error rate by classification results  It assumes equal cost for all classes  Misleading in unbalanced datasets  It doesn’t differentiate between different types of errors  Examples: 65 Model evaluation for classification  A binary classifier predicts all data instances of a test dataset as either positive or negative  This classification (prediction) produces four outcomes –true positive, true negative, false positive and false negative  Four outcomes of classification:  True positive (TP): correct positive prediction  False positive (FP): incorrect positive prediction  True negative (TN): correct negative prediction  False negative (FN): incorrect negative prediction  Confusion matrix: an error matrix It is presented as a table in which the predicted class is compared with the actual class 66 Model evaluation for classification 67 Model evaluation for classification 68 Receiver-Operator Curve (ROC Curve)  ROC curve: visualize the performance of a      binary classifier between FP rate and TP rate TP rate: the larger the better FP rate: the smaller the better Accuracy =1－FP rate: FP rate increases while accuracy decreases FP rate changes depend on the setting of thresholds The larger the area, the better the model 69 Area Under the Curve (AUC) AUC size is directly connected to model performance Models that perform better will have higher AUC values A random model will have an AUC of 0.5, while a perfect classifier would have an AUC of 70 Summary  Data preparation or preprocessing is a big issue for both data warehousing and data mining  Discriptive data summarization is need for quality data preprocessing  Data preparation includes  Data cleaning and data integration  Data reduction and feature selection  Discretization  A lot a methods have been developed but data preprocessing still an active area of research 71

Bài giảng khai phá dữ liệu (data mining) data preprocessing

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan