Data Mining and Knowledge Discovery Handbook, 2 Edition part 80 ppt

770 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy Fig. 39.5. ANNCAD Framework tion sequence. FOCUS framework uses the difference between data mining models as the deviation in data sets. Ferrer-Troyano et al (Ferrer-Troyano et al.,2004) have proposed a scalable classification algorithm for numerical data streams. The algorithm has been termed as Scalable Classification Algorithm by Learning decisiOn Patterns SCALLOP. The algorithm starts by reading a number of user-specified labeled records. A number of rules are created for each class from these records. For each record read after creating these rules, there are three cases: a) Positive covering: a new record that strengthens a current discovered rule. b) Possible expansion: a new record that is associated with at least one rule however is not covered by any discovered rule. c) Negative covering: a new record that weakens a current discovered rule. For each of the above cases, a different procedure is used as follows: a) Positive covering: an update of the positive support and confidence of the rule is calculated and assigned to the existing rule. b) Possible expansion: the rule is extended if it satisfies two conditions: 1. It is bounded within a user-specified growth bounds to avoid a possible wrong expansion of the rule. 2. There is no intersection between the expanded rule and any already discovered rule associated with the same class label. c) Negative covering: an update of the negative support and confidence is calculated. If the confidence is less than a minimum user-specified threshold, a new rule is added. Having read a user-defined number of records, a rule refining process takes place. Merge of rules in the same class and within a user-defined acceptable distance measure is used in this process with a condition non-intersecting with rules associated 39 Data Stream Mining 771 with other class labels. The resulting hypercube should also be within the growth bounds of the rules. The second step of the refining stage release the uninteresting rules from the current model. The rules that have less than the minimum positive support are released from the model. Also the rules that are not covered by at least one of the records of the last user-defined number of received records are also released from the classifier. Figure 39.6 shows an illustration of the basic process of using SCALLOP to build a data stream classifier. Finally a voting-based classification technique is used to classify the unlabelled records for model use. If there is a rule covers the current record, the label associated with that rule is used as the classifier output; otherwise a voting over the current rules within the growth bounds is used to infer the class label. Fig. 39.6. Basic SCALLOP Process Papadimitriou et al (Papadimitriou et al., 2003) have proposed AWSOM (Arbi- trary Window Stream mOdeling Method) for discovering interesting patterns from sensor data. They developed a one-pass algorithm to incrementally update the patterns. Their method requires only O(logN) memory where N is the length of the sequence. They conducted experiments with real and synthetic data sets. They use wavelet coefficients as compact information representation and correlation structure detection, and then apply a linear regression model in the wavelet domain. The system depends on creating compact representation to address the high speed streaming problem. The experimental results show the efficiency in detecting correlation. Gaber et al. (Gaber et al., 2005) have developed Lightweight Classification LW- Class. It is a variation of LWC. It is also an AOG-based technique. The idea is to use Knearest neighbors with updating the frequency of class occurrence given the data stream features. In case of contradiction between the incoming stream and the stored summary of the cases, the frequency is reduced. In case of the frequency is equalized to zero, all the cases represented by this class is released from the memory. 772 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy 39.4 Frequent Pattern Mining Techniques Frequency counting is the process of identifying the highest frequent items. It could be used as a stand alone technique to discover the heavy hitters (Cormode and Muthukrishnan, 2003). It could also be used as a step towards finding association rules. The main idea is to find data items with a probability greater than or equal to a pre-specified minimum threshold known in the context of frequent items as the item support (Dunham, 2003). The item support is calculated by dividing the number of times the observed item appears to the total number of records. Giannella et al (Giannella et al., 2003) have proposed and implemented a frequent itemsets mining algorithm over data stream. They have used tilted windows to calculate the frequent patterns for the most recent transactions based on the fact that users are more interested in the most recent streaming information rather than older data streams. They have developed an incremental algorithm to maintain the FP-stream, which is a tree data structure to represent and discover frequent itemsets in data streams. FP-stream has been developed based on FP-tree, which has been first introduced by Han et al (Han et al., 2000) as a graphical representation for discovering frequent itemsets. A number of experiments have been conducted to prove the algorithm efficiency. The results show that with limited memory, the algorithm can discover the frequent itemsets with approximate support. Manku and Motwani (Manku and Motwani, 2002) have proposed and implemented an approximate frequency counting algorithm in data streams. The implemented algorithm uses all the previous historical data to calculate the frequent patterns incrementally. Two algorithms have been introduced: sticky sampling and lossy counting algorithms. Although the first algorithm analytically should have a better performance because it has better worst-case bound, the experimental studies have proved the lossy count algorithm has a better practical performance. The sticky sampling algorithm uses sampling that attracts the new records with already existing en- tries to have a higher probability to be sampled. The other algorithm uses that idea of group testing using buckets for counting items within the same group by maintaining one counter only. Cormode and Muthukrishnan (Cormode and Muthukrishnan, 2003) have developed an algorithm for counting frequent items. The algorithm uses group testing to find the hottest k items. The algorithm can process turnstile data stream model which allows addition as well as deletion of data records. An approximation randomized algorithm has been used to approximately discover the most frequent items. The algorithm can recall the frequent items with given item support and probability. It is worth mentioning that the turnstile data stream model is the hardest to analyze. Time series and cash register models are easier. The former does not allow increments and decrements and the later one allows only increments. Jin et al (Jin et al., 2003) have proposed hCount algorithm to discovering frequent items in data streams. This algorithm also deals with the turnstile data stream model where insertion and deletion from the data are allowed. The algorithm dynamically works with any range of data and does not need any prior knowledge about the data. The algorithm is classified as an approximation technique that keeps the number 39 Data Stream Mining 773 of counters that can guarantees a minimum acceptable error. The algorithm simply keeps the number of counters that analytically can result in the final approximated output deviated with a user given threshold of error. Gaber et al. (Gaber et al., 2005) have developed one more AOG-based algorithm: Lightweight frequency counting LWF. It has the ability to find an approximate solution to the most frequent items in the incoming stream using adaptation and releasing the least frequent items regularly in order to count the more frequent ones. 39.5 Time Series Analysis Time series analysis is concerned with discovering patterns in attribute values that vary over temporal basis. Three main functions are performed in time series mining: clustering of similar time series, predicting future values in a time series, and classifying the behavior of a time series (Dunham, 2003). Indyk et al (Indyk et al., 2000) have proposed approximate solutions with probabilistic error bounding to two problems in time series analysis: relaxed periods and average trends. The algorithms use dimensionality reduction sketching techniques. The process starts with computing the sketches over an arbitrarily chosen time window. This creates what so called sketch pool. Sketching is the process of random projection over a number of attributes. Using this pool of sketches, relaxed periods and average trends are computed. Relaxed periods refer to those periods in time series that are repeated over time. Since exact repetition is rare, similar ones using distance functions are acceptable. Average trend is the mean values of a subsequence of observation of a pre-specified length in a time series. The algorithms have shown experimentally efficiency in running time and accuracy. Perlman and Java (Perlman and Java, 2003) have proposed an approach to mine astronomical time series streams. The technique starts with handling missing data using interpolation. A normalization process then takes place for a two-phase pre- processing step. A process of finding frequently occurring shapes in times series using time windows represents the first processing step. Then, clustering the discovered patterns of shapes is the second step. Rule extraction and filtering over the created clusters represent final step in the approach. The limitation of the implemented system is that it can process only one time series at any time. Figure 39.7 shows a simple flow chart of the approach. Zhu and Shasha (Zhu and Shasha, 2003) have proposed techniques to compute a set of statistical measures over time series data streams. The proposed techniques use discrete Fourier transform to create synopsis data structure. The system is called StatStream and is able to compute approximate error bounded correlations and inner products. The system works over an arbitrarily chosen sliding window. Keogh et al (Keogh et al., 2003) have proved empirically that most cited clustering time series data streams algorithms proposed so far in the literature result in meaningless results in subsequence clustering. They have proposed a solution using k-motif to choose the subsequences that the algorithm can work on. The 1-motif is the subsequence that has the highest count of not-trivial matches in a time series. 774 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy Fig. 39.7. Astronomical Time Series Analysis Thus, the k-motif is the highest k subsequences that satisfy the condition of highest count of matches. Experimental results show the success of the techniques in extracting meaningful time series clustering results. Lin et al (Lin et al., 2003) have proposed the use of symbolic representation of time series data streams that has been termed Symbolic Aggregate approXimation (SAX). This representation allows dimensionality/numerosity reduction. Numeros- ity reduction refers to reducing the number of records. They have demonstrated the applicability of the proposed representation by applying it to clustering, classification, indexing and anomaly detection mining techniques. The approach has two main stages. The first one is the transformation of time series data to Piecewise Aggregate Approximation followed by transforming the output to discrete string symbols in the second stage. Chen et al (Chen et al., 2002) have proposed the application of what so called regression cubes for data streams. Due to the success of OnLine Analytical Processing OLAP technology in the application of static stored data, it has been proposed to use multidimensional regression analysis to create a compact cube that could be used for answering aggregate queries over the incoming data streams. This research has been extended to be adopted in the undergoing project Mining Alarming Incidents in Data Streams MAIDS. The technique has shown experimentally efficiency in analyzing time series data streams. 39.6 Systems and Applications Recently systems and applications that deal with mining data streams have been developed. The systems are application-oriented except for MAIDS developed by Cai et al (Cai et al., 2004) which represents the first attempt to develop a generic data stream mining system. The following list introduces these systems and applications with short descriptions. Burl et al (Burl et al., 1999) have developed Diamond Eye for NASA and JPL. The aim of the project is to enable remote systems as well as scientists to extract patterns from spatial objects in real time image streams. The success of this project will enable ”a new era of exploration using highly autonomous spacecraft, rovers, 39 Data Stream Mining 775 and sensors” (Burl et al., 1999). The system uses a high performance computational facility for processing the data mining request. The scientist uses a web interface that uses java applets to connect to the server that requests that images to perform the image mining process. Kargupta et al (Kargupta et al., 2002) have developed the first ubiquitous data stream mining system termed MobiMine. It is a client/server PDA-based distributed data mining application for financial data streams. The system prototype has been developed using a single data source and multiple mobile clients; however the system is designed to handle multiple data sources. The server functionalities in the proposed system are data collection from different financial web sites and storage, selection of active stocks using common statistics methods, and applying online data mining techniques to the stock data. The client functionalities are portfolio management using a mobile micro-database to store portfolio data and information about user’s preferences, and construction of the WatchList and this is the first point of interaction between the client and the server. The server computes the most active stocks in the market, and the client in turn selects a subset of this list to construct the personal- ized WatchList according to an optimization module. The second point of interaction between the client and the server is that the server performs online data mining and then transforms the results using Fourier transformation and finally sends this to the client. The client in turn visualizes the results on the PDA screen. It is worth pointing out that the data mining process in MobiMine has been performed at the server side given the resource constraints of a mobile device. With the increase need for onboard data mining in resource-constrained computing environments, Kargupta et al (Kar- gupta, 2004) have developed onboard mining techniques for a different application in mining vehicle sensory data streams. Kargupta et al (Kargupta, 2004) have developed Vehicle Data Stream Mining System VEDAS. It is a ubiquitous data stream mining system that allows continuous monitoring and pattern extraction from data streams generated on-board a moving vehicle. The mining component is located on the PDA. VEDAS uses online incremental clustering for modeling of driving behavior. Tanner et al (Tanner et al., 2002) have developed EnVironment for On-Board Processing (EVE) for astronomical data streams. The system analyzes data streams continuously generated from measurements of different on-board sensors. Only interesting patterns are sent to the ground stations for further analysis preserving the limited bandwidth. Srivastava and Stroeve (Srivastava and Stroeve, 2003) work in a NASA project for onboard detection of geophysical processes such as snow, ice and clouds using kernel clustering methods for data compression preserving limited bandwidth needed to send image streams to the ground centers. The kernel methods have been chosen due to its low computational complexity. Cai et al (Cai et al., 2004) have developed an integrated mining and querying system. The system can classify, cluster, count frequency and query over data streams. Mining Alarming Incidents of Data Streams MAIDS is currently under develop- ment and recently the project team has demonstrated its prototype implementation. 776 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy Sequential pattern mining and hidden network mining are currently under develop- ment. Pirttikangas et al (Pirttikangas et al., 2001) have implemented a mobile agent- based ubiquitous data mining for a context-aware health club for cyclists. The system is called Genie of the Net. The process starts by collecting information from sensors and databases in order to recognize the needed information for the specific application. This information includes user’s context and other needed information collected by mobile agents. The main scenario for the health club system is that the user has a plan for an exercise. All the needed information about the health such as heart rate is recorded during the exercise. This information is analyzed using data mining techniques to advise the user after each exercise. Having discussed the state-of-the-art in mining data streams in terms of developed techniques as well as systems used in different applications, we can use this review as a base for classifying these techniques into generic categories 39.7 Taxonomy of Data Stream Mining Approaches Research problems and challenges that have been discussed earlier in mining data streams have its solutions using well-established statistical and computational approaches. We can categorize these solutions to data-based and task-based ones. In data-based solutions, the idea is to examine only a subset of the whole dataset or to transform the data vertically or horizontally to an approximate smaller size data representation. On the other hand, in task-based solutions, techniques from computational theory have been adopted to achieve time and space efficient solutions. In this section we review these theoretical foundations. 39.7.1 Data-based Techniques Data-based techniques refer to summarizing the whole dataset or choosing a subset of the incoming stream to be analyzed. Sampling, load shedding and sketching techniques represent the former one. Synopsis data structures and aggregation represent the later one. The following subsections represent an outline of the basics of these techniques with pointers to its applications in the context of data stream mining. Sampling Sampling refers to the process of probabilistic choice of a data item to be pro- cessed (Toivonen, 1996). Sampling is an old statistical technique that has been used for a long time in the context of conventional data mining for large databases. In the context of data stream mining, boundaries of error rate of the computation are given as a function in the sampling rate or size. Very Fast Machine Learning techniques (Domingos and Hulten, 2000) have used Hoeffding bound (Hoeffding, 1963) to measure the sample size according to a derived loss function according to the 39 Data Stream Mining 777 running mining algorithm. The problem with using sampling in the context of data stream analysis is the unknown dataset size. Thus the treatment of data stream should follow a special analysis to find the error bounds. Another problem with sampling is that it is important to check for anomalies for surveillance analysis as an application in mining data streams. Sampling is not the right choice for such an application. Sampling also does not address the problem of fluctuating data rates. It would be worth investigating the relationship among the three parameters: data rate, sampling rate and error bounds. Load Shedding Load shedding refers (Babcock et al., 2003, Tatbul et al., 2003, Tatbul et al., 2003) to the process of dropping a sequence of data streams. Load shedding has been used successfully in querying data streams. It has the same problems of sampling. Load shedding is difficult to be used with mining algorithms because it drops chunks of data streams that could be used in the structuring of the generated models or it might represent a pattern of interest in time series analysis. However recently it has been used in the classification problem with an acceptable accuracy in an algorithm developed by Chi et al (Chi et al., 2005). The algorithm has been termed as Loadstar. It represents the first attempt for using load shedding in high speed data stream classification problems. Sketching Sketching (Babcock et al., 2002, Muthukrishnan, 2003) is the process of randomly project a subset of the features. It is the process of vertically sample the incoming stream. Sketching has been applied in comparing different data streams and in aggregate queries. The major drawback of sketching is that of accuracy. It is hard to use it in the context of data stream mining. Principal Component Analysis (PCA) would be a better solution that has been applied in streaming applications (Kargupta, 2004). Synopsis Data Structures Creating synopsis of data refers to the process of applying summarization techniques that are capable of summarizing the incoming stream for further analysis. Wavelet analysis (Gilbert et al., 2003), histograms, quantiles and frequency moments (Bab- cock et al., 2002) have been proposed as synopsis data structures. Since synopsis of data does not represent all the characteristics of the dataset, approximate answers are produced when using such data structures. Aggregation Aggregation is the process of computing statistical measures such as means and vari- ance that summarize the incoming data stream. Using this aggregated data could then 778 Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy be used by the data mining algorithm. The problem with aggregation is that it does not perform well with highly fluctuating data distributions. Merging online aggregation with offline mining has been studies in (Aggarwal et al., 2003, Aggarwal et al., 2004, Aggarwal et al., 2004) for clustering and classification of data streams. Definitions, advantages and disadvantages of all of the above data-based approaches are given in Table 39.2. 39.7.2 Task-based Techniques Task-based techniques are those methods that modify existing techniques or develop new ones in order to address the computational challenges of data stream processing. Approximation algorithms, sliding window techniques represent this category. In the following subsections, we examine each of these techniques and its application in the context of data stream analysis. Approximation algorithms Approximation algorithms (Muthukrishnan, 2003) have their roots in algorithm design. It is concerned with design algorithms for computationally hard problems. These algorithms can result in an approximate solution with error bounds. The idea is that data stream mining algorithms are considered hard computational problems given its features of continuity and speed and the resource-constrained computational environment. Approximation algorithms have attracted researchers as a direct solution to data stream mining problems. However, the problem of data rates with regard to the available resources could not be solved using approximation algorithms. Other tools should be used along with these algorithms in order to adapt to the available resources. Approximation algorithms have been used in (Cormode and Muthukrish- nan, 2003, Jin et al., 2003) for discovering frequent items. Sliding Window The inspiration behind sliding window techniques is that the user is more concerned with the analysis of most recent data streams. Thus, the detailed analysis is done over the most recent data items and summarized versions of the old ones. This idea has been adopted in many techniques in the undergoing comprehensive data stream mining system MAIDS (Dong et al., 2003). The main issue of the sliding window techniques is how to remove the expired results from the current created model. Algorithm Output Granularity The algorithm output granularity (AOG) (Gaber et al., 2005,Gaber et al., 2004) introduces the first resource-aware data analysis approach that can cope with fluctuating very high data rates according to the available memory and the processing speed represented in time constraints. The AOG performs the local data analysis on a resource 39 Data Stream Mining 779 Table 39.2. Data-based Techniques Technique Definition Pros Cons Sampling The process of choosing a subset of a dataset for the sake of analysis using probability theory. • Well established techniques. • Error boundaries guaran- teed • Poor for anomaly detection. Load Shedding The process of ig- noring a continuous chunk of streaming data • Proved efficiency with data stream querying. • Used recently with success in data stream mining • Very poor for anomaly detection. Sketching Randomly projection of a set of features to be analyzed • Considerably improve the running time. • Some unse- lected features might be of great impor- tance. Synopsis Data Structure Quick transformation of the incoming stream into a summarized compressed form. • Analysis task independent. • might not be sufficient with high data rates. Aggregation Calculating statistical measures that capture the features of data. • Analysis task independent. • Aggregation measures do not capture all the required features of data. . turnstile data stream model where insertion and deletion from the data are allowed. The algorithm dynamically works with any range of data and does not need any prior knowledge about the data. The. image mining process. Kargupta et al (Kargupta et al., 20 02) have developed the first ubiquitous data stream mining system termed MobiMine. It is a client/server PDA-based distributed data mining. (Kar- gupta, 20 04) have developed onboard mining techniques for a different application in mining vehicle sensory data streams. Kargupta et al (Kargupta, 20 04) have developed Vehicle Data Stream Mining System

Data Mining and Knowledge Discovery Handbook, 2 Edition part 80 ppt

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan