Khai thác dữ liệu chuỗi thời gian dựa vào rút trích đặc trưng bằng phương pháp điểm giữa và kỹ thuật xén = time series data mining based on feature extraction with middle points and clipping method

168 646 4
Khai thác dữ liệu chuỗi thời gian dựa vào rút trích đặc trưng bằng phương pháp điểm giữa và kỹ thuật xén = time series data mining based on feature extraction with middle points and clipping method

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

I HC QUC GIA TP H CH MINH TRNG I HC BCH KHOA NGUYN THNH SN KHAI PH D LIU CHUI THI GIAN DA VO RT TRCH C TRNG BNG PHNG PHP IM GIA V K THUT XẫN (TIME SERIES DATA MINING BASED ON FEATURE EXTRACTION WITH MIDDLE POINTS AND CLIPPING METHOD) LUN N TIN S KHOA HC MY TNH TP H CH MINH NM 2014 I HC QUC GIA TP HCM TRNG I HC BCH KHOA NGUYN THNH SN KHAI PH D LIU CHUI THI GIAN DA VO RT TRCH C TRNG BNG PHNG PHP IM GIA V K THUT XẫN (TIME SERIES DATA MINING BASED ON FEATURE EXTRACTION WITH MIDDLE POINTS AND CLIPPING METHOD) Chuyờn ngnh: Khoa hc mỏy tớnh Mó s chuyờn ngnh: 62.48.01.01 Phn bin c lp 1: TS Nguyn c Dng Phn bin c lp 2: TS V Tuyt Trinh Phn bin 1: PGS TS Nguyn Th Kim Anh Phn bin 2: PGS TS Phỳc Phn bin 3: PGS TS Qun Thnh Th NGI HNG DN KHOA HC PGS TS Dng Tun Anh LI CAM OAN Tỏc gi xin cam oan õy l cụng trỡnh nghiờn cu ca bn thõn tỏc gi Cỏc kt qu nghiờn cu v cỏc kt lun lun ỏn ny l trung thc, v khụng chộp t bt k mt ngun no v di bt k hỡnh thc no Vic tham kho cỏc ngun ti liu (nu cú) ó c thc hin trớch dn v ghi ngun ti liu tham kho ỳng theo yờu cu Tỏc gi lun ỏn Nguyn Thnh Sn i TểM TT khc phc c im lng ln ca d liu chui thi gian, nhiu phng phỏp thu gim s chiu da vo rỳt trớch c trng ó c xut v s dng Tuy nhiờn cú khụng ớt phng phỏp thu gim s chiu mc phi hai nhc im quan trng: mt s phng phỏp thu gim s chiu khụng chng minh c bng toỏn hc tha iu kin chn di v mt s phng phỏp khỏc khụng xut c cu trỳc ch mc thớch hp i kốm h tr vic tỡm kim tng t hu hiu úng gúp th nht ca lun ỏn ny l xut mt phng phỏp thu gim s chiu mi da vo im gia v k thut xộn, cú tờn l MP_C (Middle points and Clipping), v kt hp phng phỏp ny vi ch mc ng chõn tri h tr vic tỡm kim tng t mt cỏch hu hiu Qua lý thuyt v thc nghim, chỳng tụi chng minh c phng phỏp MP_C tha iu kin chn di, l iu kin nhm m bo khụng xy li tỡm sút tỡm kim tng t Thc nghim cũn cho thy phng phỏp MP_C hiu qu hn mt phng phỏp c a chung, phng phỏp xp x gp tng on (PAA- Piecewise Aggregate Approximation), v phng phỏp xộn d liu (Clipping) v c ba tiờu chớ: cht chn di, t l thu gim truy xut v thi gian thc thi Lun ỏn cũn cho thy phng phỏp MP_C cú th s dng hiu qu cho bi toỏn tỡm kim tng t trờn d liu chui thi gian dng lung, mt bi toỏn rt thi s, ó v ang c quan tõm nghiờn cu thi gian gn õy, da vo cỏch tớnh toỏn gia tng phng phỏp MP_C v chớnh sỏch cp nht ch mc trỡ hoón (deferred update policy) úng gúp th hai ca lun ỏn ny l vic ng dng thnh cụng phng phỏp thu gim s chiu MP_C v cu trỳc ch mc ng chõn tri vo ba bi toỏn quan trng khai phỏ d liu chui thi gian: gom cm, phỏt hin motif v d bỏo trờn d liu chui thi gian Vi bi toỏn gom cm, chỳng tụi dng tớnh cht a mc phõn gii ca phng phỏp MP_C cú th s dng gii thut I-k-Means gom cm d liu chui thi gian v xut thờm cỏch s dng kd-tree xỏc nh cỏc trung tõm cm ban u cho gii thut I-k-Means nhm khc phc nhc im ca gii thut ny chn cỏc trung tõm cm mc ng mt cỏch ngu nhiờn Vi bi toỏn phỏt hin motif, chỳng tụi xut hai gii thut phỏt hin motif xp x trờn d liu chui thi gian: (1) gii thut s dng R*-tree kt hp vi ý tng t b sm tớnh toỏn ii khong cỏch Euclid v (2) gii thut dng phng phỏp thu gim s chiu MP_C kt hp vi cu trỳc ch mc ng chõn tri Trong hai gii thut ny, gii thut th hai t cú hiu qu cao hn Vi bi toỏn d bỏo d liu chui thi gian, chỳng tụi dng phng phỏp thu gim s chiu MP_C kt hp vi cu trỳc ch mc ng chõn tri vo phng phỏp d bỏo tỡm kim k lõn cn gn nht (k-NN) v thc nghim cho thy phng phỏp ny cho kt qu d bỏo chớnh xỏc cao hn v thi gian d bỏo nhanh hn so vi mụ hỡnh mng n ron nhõn to (ANN) d bỏo vi d liu cú tớnh hay xu hng iii ABSTRACT To overcome high dimensionality of time series data, several dimensionality reduction methods, which is based on feature extraction, have been proposed and used However, a number of these methods did not provide any formal proof that they satisfy the lower bounding condition while many of them did not go with any multidimensional index structure which helps in fast retrieval The first contribution of this thesis is a new dimensionality reduction method based on Middle points and Clipping, called MP_C, which performs effectively with the support of Skyline index Through formal proof and experiments on benchmark datasets, we show that MP_C satisfies the lower bounding condition which guarantees no false dismissals Experimental results also reveal that MP_C is more effective than the popular dimensionality reduction method, Piecewise Aggregate Approximation (PAA) and the Clipping method in terms of tightness of lower bound, pruning ratio and running time We also proposed the extension of MP_C in Kontaki framework which can be applied effectively for similarity search in streaming time series The second contribution of this thesis is the application of MP_C method to the three important time series data mining tasks: clustering, motif detection and time series prediction As for clustering, we exploit the multi-resolution property of MP_C in using I-k-Means algorithm for time series clustering and propose the use of kd-tree in choosing initial centroids for I-k-Means algorithm in order to overcome the drawback of randomly determining the initial centroids in the first level of I-k-Means As for motif discovery, we propose two algorithms for finding approximate motif in time series data: (1) the algorithm that uses R*-tree combined with the idea of early abandoning in Euclidean distance computation and (2) the algorithm using MP_C associated with Skyline index; and between the two algorithms, the latter is more effective than the former As for time series prediction, we propose the use of MP_C with Skyline index in a prediction approach based on a k-nearest-neighbors algorithm and experiments show that the proposed method performs better than artificial neural network model in terms of prediction accuracy and computation time, especially for seasonal and trend time series iv LI CM N Xin by t lũng bit n sõu sc n Thy PGS TS Dng Tun Anh ó tn tỡnh hng dn, ng viờn, ch bo v úng gúp ý kin cho vic nghiờn cu v hon thnh Lun ỏn Tin s ny Tụi xin gi li cm n n cỏc Thy, Cụ khoa Khoa hc v K thut Mỏy tớnh trng i hc Bỏch khoa Tp H Chớ Minh, cỏc bn nhúm nghiờn cu v khai phỏ d liu chui thi gian ó úng gúp nhiu ý kin quớ bỏu cho vic nghiờn cu lun ỏn Tụi cng xin cm n cỏc ng nghip v bn bố khoa Cụng ngh Thụng tin trng i hc S phm K thut Tp H Chớ Minh ó luụn ng viờn, khớch l v to iu kin thun li giỳp tụi hon thnh lun ỏn ỳng hn Cm n ụng Nguyn Quang Chõu, Vit kiu M, ó h tr mt phn kinh phớ tụi cú th cụng b v thuyt trỡnh cụng trỡnh ca mỡnh ti hi ngh ACIIDS 2012 Cm n Giỏo s Tin s H Tỳ Bo (Vin Nghiờn cu Cao Cp Khoa hc v Cụng ngh Nht Bn) ó h tr kinh phớ tụi cú th d hi ngh ComManTel 2013 Tp H Chớ Minh, thỏng nm 2013 Tỏc gi Nguyn Thnh Sn v MC LC DANH MC CC HèNH NH ix DANH MC BNG BIU xiv DANH MC CC T VIT TT xvi CHNG GII THIU 1.1 D liu chui thi gian v cỏc bi toỏn khai phỏ d liu liờn quan 1.2 Mc tiờu, i tng v phm vi nghiờn cu 1.3 Nhim v v hng tip cn ca lun ỏn 1.4 Túm tt kt qu t c .7 1.5 Cu trỳc ca lun ỏn CHNG C S Lí THUYT V CC CễNG TRèNH LIấN QUAN 10 2.1 Cỏc o tng t 10 2.1.1 o Euclid 10 2.1.2 o xon thi gian ng 11 2.2 Thu gim s chiu chui thi gian 12 2.2.1 iu kin chn di 12 2.2.2 Cỏc phng phỏp thu gim s chiu da vo rỳt trớch c trng 13 2.2.3 V tớnh ỳng n v tớnh kh ch mc ca cỏc phng phỏp thu gim s chiu 21 2.3 Ri rc húa chui thi gian 22 2.4 Cu trỳc ch mc 23 2.4.1 R-tree .23 2.4.2 Ch mc ng chõn tri 25 2.5 Tỡm kim tng t trờn d liu chui thi gian 27 2.5.1 í tng tng quỏt 27 2.5.2 So trựng ton chui v so trựng chui 27 2.5.3 o khong cỏch nhúm v iu kin chn di nhúm 28 2.5.4 Cỏc phng phỏp tỡm kim tng t liờn quan 28 2.6 Tỡm kim tng t trờn chui thi gian dng lung 29 2.7 Phỏt hin motif trờn chui thi gian 32 2.7.1 Cỏc khỏi nim c bn v motif .32 2.7.2 Tng quan v mt s phng phỏp phỏt hin motif tiờu biu 36 2.8 Gom cm d liu chui thi gian 41 2.8.1 Gii thiu 41 2.8.2 Gii thut K-Means 42 vi 2.8.3 Gom cm bng thut toỏn I-k-Means 43 CHNG THU GIM S CHIU CHUI THI GIAN BNG PHNG PHP MP_C 46 3.1 Phng phỏp thu gim s chiu MP_C (Middle Points_Clipping) 46 3.2 o tng t khụng gian c trng MP_C 49 3.3 phc ca gii thut thu gim s chiu theo phng phỏp MP_C 52 3.4 Cu trỳc ch mc ng chõn tri cho cỏc chui thi gian c biu din bng MP_C 53 3.4.1 Vựng bao MP_C (MP_C_BR) 53 3.4.2 Hm tớnh khong cỏch gia chui truy Q v MP_C_BR 54 3.4.3 Ch mc ng chõn tri cho phng phỏp biu din MP_C 56 3.4.4 X lý cỏc cõu truy cú chiu di khỏc .58 3.5 Tỡm kim tng t trờn d liu chui thi gian dng lung da vo phng phỏp MP_C v ch mc ng chõn tri 60 3.6 Kt qu thc nghim 61 3.6.1 Thc nghim v bi toỏn tỡm kim tng t trờn d liu chui thi gian 62 3.6.2 Thc nghim v tỡm kim tng t trờn d liu chui thi gian dng lung 74 CHNG PHT HIN MOTIF DA VO CU TRC CH MC A CHIU HOC CH MC NG CHN TRI 78 4.1 Phng phỏp phỏt hin motif da vo cu trỳc ch mc a chiu v k thut t b sm 78 4.2 Phỏt hin motif xp x da trờn phng phỏp MP_C vi s h tr ca ch mc ng chõn tri 84 4.3 Thc nghim v bi toỏn phỏt hin motif 87 4.3.1 Thc nghim 1: So sỏnh ba gii thut dựng R*-tree, RP v R*-tree kt hp vi t b sm 88 4.3.2 Thc nghim 2: So sỏnh ba gii thut dựng R*-tree, RP v MP_C kt hp vi ch mc ng chõn tri 91 CHNG GOM CM CHUI THI GIAN C THU GIM THEO PHNG PHP MP_C BNG GII THUT I-K-MEANS 97 5.1 Túm tt mt s k thut chn trung tõm cm ng thut toỏn k-Means 97 5.2 Biu din chui thi gian nhiu mc xp x theo phng phỏp MP_C 99 5.3 Kd-tree 99 5.4 Dựng kd-tree to cỏc trung tõm cm ng cho thut toỏn I-k-Means 100 5.5 Dựng CF-tree to cỏc trung tõm cm ng cho thut toỏn I-k-Means 103 vii 5.5.1 c trng cm v CF-tree (Cluster Feature tree) 103 5.5.2 Dựng CF-tree to cỏc trung tõm cm cho thut toỏn I-k-Means 105 5.6 Thc nghim v bi toỏn gom cm 106 5.6.1 Cỏc tiờu chun ỏnh giỏ cht lng ca gii thut gom cm .106 5.6.2 D liu dựng thc nghim 108 5.6.3 Kt qu thc nghim v bi toỏn gom cm 109 CHNG D BO D LIU CHUI THI GIAN Cể TNH XU HNG HOC MA BNG PHNG PHP SO TRNG MU 115 6.1 Cỏc cụng trỡnh liờn quan 115 6.2 Xu hng v tớnh d liu chui thi gian 117 6.3 Hai phng phỏp d bỏo d liu chui thi gian 118 6.3.1 D bỏo chui thi gian bng mng n ron nhõn to .118 6.3.2 Phng phỏp xut: k-lõn cn gn nht .121 6.4 ỏnh giỏ bng thc nghim .123 CHNG KT LUN V HNG PHT TRIN 131 7.1 Cỏc úng gúp chớnh ca lun ỏn 131 7.2 Hn ch ca lun ỏn 132 7.3 Hng phỏt trin 133 CC TI LIU CễNG B CA TC GI 134 Cỏc cụng trỡnh liờn quan trc tip n lun ỏn .134 Cỏc cụng trỡnh liờn quan giỏn tip n lun ỏn 135 TI LIU THAM KHO 136 Ph lc A Chng minh o DMP_C(Q, C) tha cỏc tớnh cht ca mt khụng gian metric 148 viii TI LIU THAM KHO [1] R Hyndman Time Series Data Library [Online] http://www.datamarket.com [2] E Keogh, "A Tutorial on Indexing and Mining Time Series Data," in The IEEE International Conference on Data Mining (ICDM 2001), San Jose, USA, November 29, 2001 [3] S Babu and J Widom, "Continuous queries over data streams," ACM SIGMOD Record 30(3), pp 109-120, 2001 [4] M Kontaki, A N Papadopoulos, "Efficient similarity search in streaming Time sequences," in Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004), Santorini, Greece, 2004, pp 63-72 [5] M Kontaki, A N Papadopoulos, Y Manolopoulos, "Adaptive similarity search in streaming time series with sliding windows," in Data and Knowledge Engineering, vol 16, no 6, 2007, pp 478-502 [6] J R Chen, "Useful clustering outcomes from meaningful time series clustering," in AusDM' 07 Proceedings of the sixth Australasian conference on Data mining and Analytics, 2007, pp 101-109 [7] C Gruber, M Coduro, B Sick, "Signature Verification with Dynamic RBF Networks and Time Series Motifs," in Proc of 10th Int Workshop on Frontiers in Handwriting Recognition, p 2006 [8] X Xi, E Keogh, L Wei, A Mafra-Neto, "Finding Motifs in a Database of Shapes," in Proc of SIAM, 2007, pp 249-270 [9] Y Jiang, C Li, J Han, "Stock temporal prediction based on time series motifs," in Proc of 8th Int Conf on Machine Learning and Cybernetics, 2009 [10] L Phu and D T Anh, "Motif-based Method for Initialization k-Means Clustering of Time Series Data ," in Proc of 24th Australasian Joint Conference (AI 2011), Dianhui Wang, Mark Reynolds (Eds.), LNAI 7106, Springer-Verlag, Perth, Australia, Dec 5-8, 2011, pp 11-20 [11] K Buza and L S Thieme, "Motif-based Classification of Time Series with Bayesian Networks and SVMs," in A Fink et al (eds.) Advances in Data Analysis, Data Handling and Business Intelligences, Studies in Classification, Data Analysis, Knowledge Organization Springer-Verlag, 2010, pp 105-114 136 [12] B Chiu, E Keogh, S Lonardi, "Probabilistic discovery of time series motifs," in Proc of the 9th International Conference on Knowledge Discovery and Data mining (KDD'03), 2003, pp 493-498 [13] P Beaudoin, S Coros, M van de Panne, Pierre Poulin, "Motion-motif graphs in time series," in SCA '08 Proceedings of the 2008 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2008, pp 117126 [14] J Meng, J Yuan, M Hans and Y Wu, "Mining Motifs from Human Motion," in Proc of EUROGRAPHICS, 2008 [15] D Minnen, C L Isbell, I Essa, T Starner, "Discovering Multivariate Motifs using Subsequence Density Estimation and Greedy Mixture Learning," in AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence, 2007, pp 615-620 [16] S Rombo and G Terracina, "Discovering representative models in large time series databases," in Proc of the 6th International Conference on Flexible Query Answering Systems, 2004, pp 84-97 [17] D Yankov, E Keogh, J Medina, B Chiu, V Zordan, "Detecting Motifs Under Uniform Scaling," in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, pp 844-853 [18] Y Tanaka, K Iwamoto and K Uehara, "Discovery of Time Series Motif from Multi-Dimensional Data Based on MDL Principle," in Machine Learning 58, 2005, pp 269-300 [19] G P Zhang and M Qi, "Neural Network Forecasting for Seasonal and Trend Time Series," European Journal of Operational Research, vol 160, pp 501514, 2005 [20] R Nayak and P te Braak, "Temporal Pattern Matching for the Prediction of Stock Prices," in Ong, K.-L and Li, W and Gao, J., Eds Proceedings 2nd International Workshop on Integrating Artificial Intelligence and Data Mining (AIDM 2007) CRPIT, 84, Gold Coast, 2007, pp 99-107 [21] Q Yang and X Wu, "10 Challenging Problems in Data Mining Research," International Journal of Information Technology and Decision Making, vol 5, pp 597-604, 2006 [22] C.-S Perng, H Wang, S R Zhang, D S Parker, "Landmarks: A New Model for Similarity-Based Pattern Querying in Time Series Databases," in Proceedings of the IEEE Sixteenth International Conference on Data Engineering, 2000, p 33 137 42 [23] F.L Chung, T.C Fu, R Luk, V Ng, "Flexible Time Series Pattern Matching Based on Perceptually Important Points," in International Joint Conference on Artificial Intelligence Workshop on Learning from Temporal and Spatial Data, 2001, pp 1-7 [24] T.C Fu, F L Chung, R Luk, C M Ng, "A Specialized Binary Tree for Financial Time Series Representation," in The10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Temporal Data Mining, 2004 [25] E Fink, K B Pratt, "Indexing of compressing time series," in Mark Last, Abraham Kandel and Horst Bunke, editors Data mining in time series Databases, World Scientific, Singapore., 2003 [26] E Fink, H S Gandhi, "Compression of time series by extracting major extrema," Journal of Experimental & Theoretical Artificial Intelligence, vol 23, no 2, pp 255-270, Jun 2011 [27] A Ratanamahatana, E Keogh, A J Bagnall, S Lonardi, "A Novel Bit Level Time Series Representation with Implications for Similarity Seach and Clustering," in Proc 9th Pacific-Asian Int Conf on Knowledge Discovery and Data Mining (PAKDD05), Hanoi, Vietnam, 2005, pp 51-65 [28] S.-S Choi, S.-H Cha, C C Tappert, "A Survey of Binary Similarity and Distance Measures," Journal of Systemics, Cybernetics and Informatics, vol 8, no 1, pp 43-48, 2010 [29] S.-H Cha, "Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions," International Journal of Mathematical Models and Methods in Applied Sciences, vol 1, no 4, pp 300-307, 2007 [30] D Berndt and J Clifford, "Finding Patterns in time series: a dynamic programming approach," Journal of advances in Knowledge Discovery and Data Mining, pp 229-248, 1996 [31] M Vlachos, D Gunopulos, G Das, "Indexing Time Series under Condition of Noise," in M Last, A Kandel & H Bunke (Eds.), Data Mining in Time Series Databases World Scientific Publishing, 2004 [32] E Keogh, "Mining Shape and Time Series Databases with Symbolic Representations," in Tutorial of the 13rd ACM International Conference on Knowledge Discovery and Data mining (KDD 2007), 2007, pp 12-15 138 [33] J Han and M Kamber, Data Mining: Concepts and Techniques, Second Edition ed Morgan Kaufmann publishers, 2006 [34] E Keogh and C A Ratanamahatana, "Exact Indexing of Dynamic Time Warping," in VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases , 2002, pp 406-417 [35] C Faloutsos, M Ranganathan, Y Manolopoulos, "Fast Subsequence Matching in Time Series Databases," in Proceedings of the ACM SIGMOD International Conference on Management of Data, Minneapolis, NM, 1994, pp 419-429 [36] R Agrawal, C Faloutsos, A Swami , "Efficient similarity search in sequence databases," in Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms, Chicago, 1993, pp 69-84 [37] K Chan and A W Fu, "Efficient Time Series Matching by Wavelets," in Proceedings of the 15th IEEE Int'l Conference on Data Engineering, Sydney, Australia, 1999, pp 126-133 [38] E Keogh, K Chakrabarti , M Pazzani , S Mehrotra , "Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases," in Proceedings of Conference on Knowledge and Information Systems, 2000, pp 263-286 [39] E Keogh, K Chakrabarti, S Mehrotra, M Pazzani, "Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases," in Proceedings of ACM SIGMOD Conference on Management of Data, Santa Barbara, CA, 2001, pp 151-162 [40] I Popivanov, R J Miller, "Efficient Similarity Queries Over Time Series Data Using Wavelets," in Proceedings of the 18th International Conference on Data Engineering, San Jose, California, USA, 2002, pp 212-221 [41] B Lkhagva, Y Suzuki, and K Kawagoe, "New Time Series Data Representation ESAX for Financial Applications," in Proceedings of the International Special Workshop on Databases for Next-Generation Researchers (SWOD 2006) in conjunction with International Conference on Data Engineering, ICDE 2006 , Georgia, USA, 2006, pp 17-22 [42] E Keogh and M Pazzani, "An Enhanced Representation of Time Series which allows fast and accurate classification, clustering and relevance feedback," in Proc of KDD, 1998 [43] E Keogh, E S Chu, D Hart, M Pazzani, "An online Algorithm for Segmenting Time Series," in Proc of the IEEE International Conference on Data Mining, 139 California, USA, 2001, pp 289-296 [44] Q Chen, L Chen, X Lian, Y Liu and J X Yu, "Indexable PLA for Efficient Similarity Search," in Proc of Int Conf on Very Large Database (VLDB'07), Vienna, Austria, September 23-28, 2007, pp 435-446 [45] J Lin, E Keogh, S Leonardi, B Chiu, "A symbolic Representation of Time Series with Implications for Streaming Algorithms," in Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, CA, 2003, pp 2-11 [46] J Shieh and E Keogh, "iSAX: indexing and mining terabyte sized time series," in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp 623-631 [47] A Guttman, "R-trees: a Dynamic Index Structure for Spatial Searching," in Proc of the ACM SIGMOD Int Conf on Management of Data, 1984, pp 47-57 [48] N Beckmann, H Kriegel, R Schneider, B Seeger, "The R*-tree: An efficient and robust access method for points and rectangles," in Proc of 1990 ACM SIGMOD Conf., Atlantic City, NJ, 1990, pp 322-331 [49] Q Li, I Lúpez, B Moon, "Skyline Index for Time Series Data," in IEEE Trans on Knowledge and Data Engineering, vol 16, 2004, pp 669-684 [50] K Chakrabarti and S Mehrotra, "The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces," in Proceedings of 15th International Conference on Data Engineering, 1999, pp 440-447 [51] P Ciaccia, M Patella and P Zezula, "M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces," in Proceedings of the 23rd Conference on Very Large Databases (VLDB) , 1997, pp 426-435 [52] I Assent, M Wichterich, R Krieger, H Kremer, T Seidl, "Anticipatory DTW for Efficient Similarity Search in Time Series Databases," in Proc 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France, 2009, pp 826-837 [53] K Kawagoe, T Bernecker, H Kriegel, M Renz, A Zimek,A Zufle , "Similarity Search in Time Series of Dynamical Model-based Systems," in Database and Expert Systems Applications (DEXA), 2010 Workshop on, Bilbao, 2010, pp 110-114 [54] J Nang, J Park, J Yang, S Kim, "A Hierarchical Bitmap Indexing Method for Similarity Search in High-Dimensional Multimedia Databases," Journal of 140 Information Science and Engineering, pp 393-407, 2010 [55] M Suntinger, H Obweger, J Schiefer, P Limbeck, G Raidl, "Trend-based similarity search in time-series data," in Proceedings of the Second International Conference on Advances in Database, Knowledge, and Data Applications DBKDA 2010, 2010, pp 97-106 [56] Y Yang, Y Xia, F Ge, Y Meng, H Yu, "A Trend Based Similarity Calculation Approach for Mining Time Series Data," in Proceedings of the International MultiConference of Engineers and Computer Scientists, Hong Kong, 2012 [57] B Babcock, M Datar, R Motwani, L O'Callaghan , "Maintaining variance and k-medians over data stream windows," in PODS '03 Proceedings of the twentysecond ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, New York, NY, USA, 2003, pp 234-243 [58] X Liu and H Ferhatosmanoglu , "Efficient k-NN Search on Streaming Data Series," in Proceedings of the 8th International Symposium on Spatial and Temporal Databases (SSTD2003), Santorini, Greece, 2003, pp 83-101 [59] H Wu, B Salzberg, D Zhang, "Online event-driven subsequence matching over financial data streams," in SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data, Paris, France, 2004, pp 23-34 [60] Y Chen, M A Nascimento, B Chin, O Anthony, K H Tung , "Spade: On shape-based pattern detection in streaming time series," in IEEE 23rd International Conference on Data Engineering ( ICDE 2007), 2007, pp 786795 [61] X Lian, L Chen, "Efficient Similarity Join over Multiple Stream Time Series," IEEE Transactions on Knowledge and Data Engineering, vol 21, no 11, pp 1544-1558, Nov 2009 [62] M Kontaki, A N Papadopoulos, Y Manolopoulos, "Continuous Processing of Preference Queries in Data Streams," in SOFSEM '10 Proceedings of the 36th Conference on Current Trends in Theory and Practice of Computer Science, 2010, pp 47-60 [63] A Marascu, S A Khan, T Palpanas, "Scalable Similarity Matching in Streaming Time Series," in PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining, 2012, pp 218-230 [64] L Gao and X Wang, "Continually Evaluating Similarity-Based Pattern Queries on a Streaming Time Series," in SIGMOD '02 Proceedings of the 2002 ACM 141 SIGMOD international conference on Management of data, 2002, pp 370-381 [65] L Gao, Z Yao, X Wang, "Evaluating continuous nearest neighbor queries for streaming time series via pre-fetching," in CIKM '02 Proceedings of the eleventh international conference on Information and knowledge management, 2002, pp 485-492 [66] L Gao , X Wang , "Improving the Performance of Continuous Queries on Fast Data Streams: Time Series Case," in Proceedings of the ACM SIGMOD DMKD Workshop, Madison, WI, 2002 [67] L Gao and X Wang, "Continuous Similarity-Based Queries on Streaming Time Series," IEEE Transactions on Knowledge and Data Engineering, vol 17, No.10, 2005 [68] H Wu, B Salzberg, G Sharp, S Jiang, H Shirato, D Kaeli, "Subsequence Matching on Structured Time Series Data," in SIGMOD '05 Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 1416, 2005, pp 682-693 [69] X Lian and L Chen, "Efficient similarity search over future stream time series," in IEEE Transactions on Knowledge and Data Engineering, vol 20, no 1, 2008, pp 40-54 [70] J Lin, E Keogh, S Lonardi, P Patel , "Finding Motifs in Time Sries," in Proc 2nd Workshop on Temporal Data Mining, Edmonton, Alberta, Canada, 2002 [71] A Mueen, E Keogh , Q Zhu , S Cash, "Exact Discovery of Time Series Motifs," in Proc of SIAM Int on Data Mining, 2009, pp 473-484 [72] D Minnen, C Isbell, I Essa, T Starner, "Detecting Subdimensional Motifs: An Efficient Algorithm for Generalized Multivariate Pattern Discovery," in Seventh IEEE International Conference on Data Mining, 2007, pp 601-606 [73] P Ferreira, P Azevedo, C Silva, R Brito, "Mining approximate motifs in time series," in proc of the 9th Int Conf on Discovery Science., 2006, pp 89-101 [74] N Castro and P Azevedo, "Multiresolution Motif Discovery in Time Series," in Proceedings of the SIAM International Conference on Data Mining (SDM 2010), Columbus, Ohio, USA, 2010, pp 665-676 [75] H Tang and S S Liao, "Discovering original motifs with different lengths from time series," Know.-Based Syst 21,7, pp 666-671, Oct 2008 [76] A Metwally, D Agrawal, A El Abbadi , "Efficient Computation of Frequent and Top-k Elements in Data Streams," in Proceedings of the 10th International 142 Conference on Database Theory, 2005, pp 398-412 [77] Y Lin, M D McCool, A A Ghorbani, "Motif and Anomaly Discovery of TimeSeries Based on Subseries Join," in Proceedings of the International MultiConference of Engineers and Computer Scientists (IMECS 2010), Hong Kong, 2010 [78] T Warren Liao, "Clustering of time series data - a survey," in Pattern Recognition Society, 2005 [79] J Lin, M Vlachos, E Keogh, D Gunopulos, "Iterative incremental clustering of time series," in Proceedings of 9th International Conference on Extending Database Technology, 2004, pp 106-122 [80] H Zhang, T B Ho, Y Zhang, M S Lin., "Unsupervised Feature Extraction for Time Series Clustering using Orthogonal Wavelet Transform," Journal Informatica, vol 30, No 3, pp 305-319, 2006 [81] J McQueen, "Some Methods for Classification and Analysis of Multivariate Observation," in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol 1, Berkeley, CA, 1967, pp 281-297 [82] P.Bradley, U Fayyad and C Reina, "Scaling Clustering Algorithms to Large Databases," in Proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining, New York, Aug 27-31, 1998, pp 9-15 [83] M Fayyad, C A Reina, P S Bradley, U Fayyad, C Reina, P S Bradley, "Initialization of Iterative Refinement Clustering Algorithms," in Prroceedings of the 4th International Coference on Knowledge Discovery and Data Mining, Aug 27-31, 1998, pp 194-198 [84] T Fu, "A Review on Time Series Data Mining," Engineering Applications of Artificial Intelligence, vol 24, pp 164-181, 2011 [85] E Keogh UCR time series Datamining http://www.cs.ucr.edu/~eamonn/time_series_data/ [86] PhysioBank Archive http://www.physionet.org/physiobank/database Archive Index [Online] [Online] [87] Economic Time Series Page [Online] http://www.economagic.com/ [88] Economics web Institute http://www.economicswebinstitute.org/ecdata.htm 143 [Online] [89] E Keogh and S Kasetty, "On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration," in the 8th ACM SIGKDD and Data Mining, Edmonton, Alberta, Canada, July 23-26, 2007, pp 102-111 [90] P Schọfer and M Hửgqvist, "SFA: A symbolic fourier approximation and index for similarity search in high dimensional datasets," in EDBT '12 Proceedings of the 15th International Conference on Extending Database Technology, 2012, pp 516-527 [91] M Vlachos, M Hadjieleftheriou, D Gunopulos, E Keogh, "Indexing multidimensional time-series with support for multiple distance measures.," in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining., Washington, DC, USA, August 24 27, 2003, pp 216-225 [92] H Ding, G Trajcevski, P Scheuermann, X Wang, E Keogh, "Querying and mining of time series data: experimental comparison of representations and distance measures," Proceedings of the VLDB Endowment, vol 1, no 2, pp 1542-1552, Aug 2008 [93] S J Redmond and C Heneghan, "A method for initialising the k-Means clustering algorithm using kd-tree," in Pattern Recognition, vol 28, 2007, pp 965-973 [94] L Kaufman and P J Rousseeuw, Finding Group in Data An Introduction to Cluster Analysis Wiley, Canada, 1990 [95] G P Babu and M N Murty, "A near-optimal initial seed value selection in kMeans algorithm using a genetic algorithm.," in Pattern Recognition Letter 14 (10), 1993, pp 763-769 [96] I Katsavounidis, C.-C Jay Kuo, and Zhen Zhang, "A new initialization technique for generalised lloyd iteration," in IEEE Signal Processing Letters (10), 1994, pp 144-146 [97] A Likas, N Vlassis, J Verbeek, "The global k-Means clustering algorithm," in Pattern Recognition, vol 36, 2003, pp 451-461 [98] J Bentley, "Multidimensional binary trees used for associative searching," Journal of Communications of the ACM, vol 18 No.9, pp 509-517, 1975 [99] T Zhang, R Ramakrishnan, M Livny, "BIRCH: A new data clustering algorithm and its applications," Journal of Data Mining and Knowledge Discovery, 1(2), pp 141-182, 1997 144 [100] V L Q Nhon and D T Anh, "A BIRCH-based Clustering Method for Large Time Series Databases," in New Frontiers in Applied Data Mining - PAKDD 2011 International Workshops, Shenzhen, China, May 24-27, 2011 LNAI 7104, Springer-Verlag, pp 148-159 [101] M Halkidi, Y Batistakis, M Vazirgiannis , "On clustering validation techniques," Journal of Intelligent Information Systems, vol 17(2-3), pp 107145, 2001 [102] Y Xiong and D Y Yeung, "Time series clustering with arma mixtures," Pattern Recognition, vol 37(8), pp 1675-1689, 2004 [103] K K Dhiral , K Kalpakis , D Gada , V Puttagunta, "Distance measures for effective clustering of ARIMA time-series," in Proceedings of the 2001 IEEE International Conference on Data Mining, 2001, pp 273-280 [104] A Strehl and J Ghosh, "A Knowledge reuse framework for combining multiple partitions," Journal of Machine Learning Research, vol 3(3), pp 583-617, 2002 [105] X Z Fern and C E Brodley, "Solving cluster ensemble problems by bipartite graph partitioning," in Proceedings of the 21th International Conference on Machine Learning, Article No 36, 2004 [106] S Gelper, R Fried, C Croux, "Robust forecasting with exponential and HoldWinters smoothing," Journal of Forecasting, vol 29, no 3, pp 285-300, 2010 [107] C Chatfield, Time-series forecasting New York, NY: Chapman and Hall, Inc., 2000 [108] I .-B Kang, "Multi-period forecasting using different models for different horizons: An application to U.S economic time series data," International Journal of Forecasting, vol 19, pp 387-400, 2003 [109] J H Kim, "Forecasting autoregressive time series with bias corrected parameter estimators," International Journal of Forecasting, vol 19, pp 493-502, 2003 [110] S D Balkin and J K Ord, "Automatic neural network modeling for univariate time series," International Journal of Forecasting, vol 16, pp 509-515, 2000 [111] E Cadenas and W Rivera, "Short-term wind speed forecasting in La Venta, Oaxaca, Mộxico, using artificial neural network," Renewable Energy, vol 34, pp 274-278, 2009 [112] M Ghiassi, H Saidane and D.K Zimbra, "A dynamic artificial neural networkfor forecasting series events," International Journal of Forecasting, vol 145 21, pp 341-362, 2005 [113] S Heravi and C R Birchenhall, "Linear versus neural network forecasting for European industrial production series," International Journal of Forecasting, vol 20, pp 435-446, 2004 [114] G Tkacz, "Neural network forecasting of canadian GDP growth," International Journal of Forecasting, vol 17, pp 57-69, 2001 [115] A K Palit and D Popovic, Computational intelligence in time series forecasting Theory and Engineering Applications Springer-Verlag London, 2005 [116] K J Kim, "Financial time series forecasting using support vector machines," Neurocomputing, vol 55, pp 307-319, 2003 [117] M A Mohandes, T O Halawani, S Rehman,A A Hussain, "Support vector machine for wind speed prediction," Renewable Energy, vol 29, pp 938-947, 2004 [118] A T Lora, J R Santos, J C R Santos, A G Expúsito, J L M Ramos, "Time series prediction: Application to the short term electric energy demand," in Lecture Notes in Artificial Intelligence, 2004, pp 577-586 [119] F M lvares, A T Lora, J.C Riquelme, J.S Aguilar Ruiz, "Energy Time Series Forecasting Based on Pattern Sequence Similarity," Knowledge and Data Engineering, IEEE Transaction, vol 23, no 8, pp 1230-1243, Aug 2011 [120] A Sorjamaa, J Hao and A Lendasse, "Mutual information and k-nearest neighbors approximator for time series prediction," in Artificial Neural Networks: Biological Inspirations ICANN 2005: 15th International Conference, Warsaw, Poland, 2005, pp 553-558 [121] A.T Lora, J.M.R Santos, A.G Exposito, J.L.M Ramos, J.C.R Santos, "Electricity Market Price Forecasting Based on Weighted Nearest Neighbors Techniques," IEEE Transactions on Power Systems, vol 22, no 3, pp 12941301, Aug 2007 [122] Z Huang and M L Shyu, "k-NN Based LS-SVM Framework for Long-Term Time Series Prediction," in The 11th IEEE International Conference on Information Reuse and Integration (IRI 2010), Tuscany Suites & Casino, Las Vegas, Nevada, USA, 2010, pp 69-74 [123] Z Huang and M.-L Shyu, "Long-Term Time Series Prediction using k-NN Based LS-SVM Framework with Multi-Value Integration," in Recent Trends in Information Reuse and Integration, K K a M T Tansel Ozyer, Ed Springer 146 Vienna, 2012, ch 9, pp 191-209 [124] Z Huang, M L Shyu, J M Tien, "Multi-Model Integration for Long-Term Time Series Prediction," in The 13th IEEE International Conference on Information Reuse and Integration (IRI 2012), Tuscany Suites & Casino, Las Vegas, Nevada, USA, 2012 [125] G Zhang, B E Patuwo, M Y Hu, "Forecasting with artificial neural networks: The state of the art," International Journal of Forecasting, vol 14, pp 35-62, 1998 [126] T Kolarik, G Rudorfer, "Time series forecasting using neural networks," ACM SIGAPL APL Quote Quad, vol 25, No 1, pp 86-94, 1994 [127] C Hamzaỗebi, "Improving artificial neural networks performance in seasonal time series forecasting," Information Sciences, vol 178, pp 4550-4559, 2008 [128] T Ash, "Dynamic node creation in backpropagation networks," Computer Science, vol 1, no 4, pp 365-375, 1989 [129] Spice-Neuro Neural Network http://www.spice.ci.ritsumei.ac.jp/~thang/programs 147 Program [Online] Ph lc A Chng minh o DMP_C(Q, C) tha cỏc tớnh cht ca mt khụng gian metric D thy o DMP_C(Q, C) tha cỏc tớnh cht: - DMP_C(Q, C) 0, - DMP_C(C, C)=0, - DMP_C(Q, C)= DMP_C(C, Q) Chng minh o DMP_C(Q, C) tha bt ng thc tam giỏc Ta cn chng minh DMP_C(Q, C) DMP_C(Q, S) + DMP_C(S, C) vi mi chui Q, C, S khụng gian c trng MP_C tng ng vi cỏc chui thi gian gc Q, C v S - u tiờn, chỳng tụi chng minh: D2(Q, C) D2(Q, S) + D2(S, C) N l Ta cú D2 (Q' , S ' ) D2 ( S ' , C ' ) (d j (q ji , b sji ) (d j ( s ji , b cji )) vi j i q' d j (q ji , b sji ) ji nu (qji > bsji = 0) (qji bsji = 1) s ' d j ( s ji , b cji ) ji nu (sji > bcji = 0) (sji bcji = 1) cỏc trng hp khỏc cỏc trng hp khỏc qji = qji - àqj , ú qji l giỏ tr c chn th i on th j ca chui thi gian Q sji = sji - àsj , ú sji l giỏ tr c chn th i on th j ca chui thi gian S àqj v àsj l giỏ tr trung bỡnh ca on j tng ng chui Q v S bcj v bsj l biu din nh phõn cỏc im c chn on j tng ng hai chui C v S Nh vy q'2ji s'2ji d j (q' ji , b sji ) d j ( s' ji , b cji ) nu (qji > bsji = 0) (qji bsji = 1) (sji > bcji = 0) (sji bcji = 1) cỏc trng hp khỏc p dng tớnh cht phõn phi ca hai phộp toỏn: , Ngoi ra, nu bsji = thỡ sji = sji - àsj Ngc li, nu bsji = thỡ sji > 0, ta c 148 q'2ji s'2ji s c d j (q ji , b ji ) d j ( s ji , b ji ) nu [(qji > bcji = 0) (bsji = bcji = ) (qji > sji > 0)] [(qji bcji = 1) (bsji = bcji = ) (qji sji 0)] cỏc trng hp khỏc (do sji2 > v tớnh cht ca hai phộp toỏn v ) nu [(qji > bcji = 0) [(qji bcji = 1) q '2ji cỏc trng hp khỏc = dj(qji ,bcji) Kt lun D2(Q, C) D2(Q, S) + D2(S, C) - Chng minh DMP_C(Q, C) DMP_C(Q, S) + DMP_C(S, C) N N i i ( w ( qi ci ) ) D2 (Q' , C ' ) ( w ( qi si ) ) D2 (Q' , S ' ) N ( w ( si ci ) ) D2 ( S ' , C ' ) i t Ai = àqi - àsi v Bi = àsi - àci Bt ng thc trờn tng ng vi: N N N i i i w ( qi ci ) D2 (Q' , C ' ) w Ai2 D2 (Q' , S ' ) w Bi2 D2 ( S ' , C ' ) (1) N N i i w Ai2 D2 (Q' , S ' ) w Bi2 D2 ( S ' , C ' ) Ta cú, N N w ( qi ci ) D2 (Q' , C ' ) w ( qi si si ci ) D2 (Q' , C ' ) i i N N w ( Ai Bi ) D2 (Q' , C ' ) w ( Ai2 Bi2 Ai Bi ) D2 (Q' , C ' ) i N i N N i i N w Ai2 w Bi2 2w Ai Bi D2 (Q' , C ' ) i N N w Ai2 D2 (Q' , S ' ) w Bi2 D2 ( S ' , C ' ) 2w Ai Bi i i i (do D2(Q,C) D2(Q,S)+ D2(S,C)) Mt khỏc, N N i i w Ai2 D2 (Q' , S ' ) w Bi2 D2 ( S ' , C ' ) 149 (2) N N i i ( w Ai2 D2 (Q' , S ' )).(w Bi2 D2 ( S ' , C ' )) N N N N i i i i w2 Ai2 Bi2 D2 (Q' , S ' ).w Bi2 D2 ( S ' , C ' ).w Ai2 D2 (Q' , S ' ).D2 ( S ' , C ' ) 2w N N A B i i i i N N i i (do w 0, Bi2 0, Ai2 0, D2 (Q' , S ' ) 0, D2 (S ' , C ' ) 0) N 2w ( Ai Bi ) i (p dng bt ng thc Bunhiacopski-Cauchy-Schwarts) N 2w Ai Bi (3) T (1), (2) v (3) suy DMP_C(Q, C) DMP_C(Q, S) + DMP_C(S, C) i 150 [...]... trong khai phá dữ liệu chuỗi thời gian là thực hiện chúng trong không gian đặc trưng (feature space) của dữ liệu Nhƣ vậy điều đầu tiên và cơ bản nhất trƣớc khi thực hiện các bài toán trong khai phá dữ liệu chuỗi thời gian là các chuỗi thời gian cần đƣợc biểu diễn trong không gian đặc trƣng bằng một kỹ thuật thu giảm số chiều nào đó Sau đó thực hiện các bài toán khai phá dữ liệu trong không gian đặc. .. khai phá dữ liệu chuỗi thời gian đƣợc xếp thứ 3 trong 10 hƣớng nghiên cứu sẽ là quan trọng và thách thức nhất [21] Khi nghiên cứu các bài toán khai phá dữ liệu chuỗi thời gian, ngƣời ta thƣờng vận dụng những kỹ thuật trong các lĩnh vực nhƣ khai phá dữ liệu, học máy, cơ sở dữ liệu, nhận dạng, xử lý tín hiệu, sinh tin học, v.v… Tuy nhiên, vì dữ liệu chuỗi thời gian thƣờng rất lớn, những giải thuật khai. .. Trong chƣơng này, chúng tôi sẽ trình bày tổng quan về chuỗi thời gian và các bài toán quan trọng trong khai phá dữ liệu chuỗi thời gian Tiếp theo là mục tiêu, đối tƣợng, phạm vi nghiên cứu của luận án và tóm tắt kết quả nghiên cứu đạt đƣợc Cuối cùng là cấu trúc của luận án này 1.1 Dữ liệu chuỗi thời gian và các bài toán khai phá dữ liệu liên quan Một chuỗi thời gian (time series) là một chuỗi các điểm. .. thời gian dạng luồng dựa vào ý tƣởng tính toán thu giảm số chiều gia tăng và cập nhật chỉ mục trì hoãn 1.4 Tóm tắt kết quả đạt đƣợc Với nhiệm vụ đầu tiên của luận án, chúng tôi đã đề xuất đƣợc một kỹ thuật thu giảm số chiều dữ liệu chuỗi thời gian dựa trên phƣơng pháp điểm giữa kết hợp với kỹ thuật xén, gọi là MP_C (Middle Points and Clipping) Kỹ thuật này đƣợc thực hiện bằng cách chia chuỗi thời gian. .. nhƣ các điểm cực trị toàn cục Hình 2.8 là ví dụ minh họa phƣơng pháp này (a) (b) (c) Hình 2.8 (a) Tập dữ liệu ban đầu, (b) tập dữ liệu được biểu diễn bằng các điểm mốc và (c) Biểu diễn tập dữ liệu bằng các điểm mốc sau giai đoạn làm trơn [22] - Phương pháp điểm cực trị Năm 2003, Fink and Pratt đã đề xuất một kỹ thuật thu giảm số chiều dựa trên việc trích các điểm quan trọng trong chuỗi thời gian [25]... nền tảng của nhiều bài toán khác trong khai phá dữ liệu chuỗi thời gian Đây là bài toán khó vì kích thƣớc dữ liệu chuỗi thời gian thƣờng lớn và vì chúng ta không thể lập chỉ mục dữ liệu chuỗi thời gian một cách dễ dàng nhƣ trong hệ thống cơ sở dữ liệu truyền thống Một vài thí dụ về ứng dụng của tìm kiếm tƣơng tự trên chuỗi thời gian có thể nêu ra nhƣ sau: - Tìm trong quá khứ, những giai đoạn mà số lƣợng... không thể làm việc một cách hữu hiệu với dữ liệu chuỗi thời gian Motif trong chuỗi thời gian là mẫu xuất hiện với tần suất cao nhất Hình 1.2 minh họa ví dụ về motif là chuỗi con xuất hiện ba lần trong chuỗi thời gian dài hơn Từ khi đƣợc hình thức hóa vào năm 2002, phát hiện motif trong dữ liệu chuỗi thời gian đã và đang đƣợc dùng để giải quyết các bài toán trong nhiều lĩnh vực ứng dụng khác nhau ví... theo số chuỗi trong cơ sở dữ liệu chuỗi thời gian hay chiều dài 3 của chuỗi thời gian mà từ đó các chuỗi con đƣợc trích ra Vì lý do đó, có nhiều thuật toán phát hiện motif xấp xỉ đã đƣợc giới thiệu ( [13], [14], [12], [15], [16], [17]) Các cách tiếp cận này thƣờng có độ phức tạp tính toán là O(n) hay O(nlogn), với n là số chuỗi trong cơ sở dữ liệu chuỗi thời gian hay chiều dài của chuỗi thời gian mà... cách liên tục và đƣợc nối vào cuối chuỗi C theo thứ tự thời gian Vì một chuỗi thời gian dạng luồng bao gồm một số lớn các giá trị, sự tƣơng tự giữa hai chuỗi thƣờng đƣợc tính dựa trên W giá trị cuối cùng (W là chiều dài cửa sổ trƣợt) Cho nên, nếu W = 1024 thì mỗi chuỗi đƣợc coi nhƣ một điểm trong không gian 1024 chiều Các bài toán thƣờng đƣợc nghiên cứu trong khai phá dữ liệu chuỗi thời gian gồm tìm... trọng trong khai phá dữ liệu chuỗi thời gian có chiều dài bằng nhau, đó là: tìm kiếm tƣơng tự, gom cụm, phát hiện motif và dự báo trên dữ liệu chuỗi thời gian, trong đó tìm kiếm tƣơng tự là bài toán nền tảng Do sự thông dụng và dễ hiện thực của độ đo Euclid, trong luận án này, chúng tôi sẽ chỉ nghiên cứu các bài toán khai phá dữ liệu chuỗi thời gian nêu trên với độ đo Euclid 5 1.3 Nhiệm vụ và hƣớng ... SƠN KHAI PHÁ DỮ LIỆU CHUỖI THỜI GIAN DỰA VÀO RÚT TRÍCH ĐẶC TRƢNG BẰNG PHƢƠNG PHÁP ĐIỂM GIỮA VÀ KỸ THUẬT XÉN (TIME SERIES DATA MINING BASED ON FEATURE EXTRACTION WITH MIDDLE POINTS AND CLIPPING METHOD) ... diễn chuỗi thời gian ( [1]) Một chuỗi thời gian dạng luồng (streaming time series) C chuỗi thời gian giá trị tới cách liên tục đƣợc nối vào cuối chuỗi C theo thứ tự thời gian Vì chuỗi thời gian. .. thời gian toán khai phá liệu liên quan Một chuỗi thời gian (time series) chuỗi điểm liệu đƣợc đo theo khoảng thời gian liền theo tần suất thời gian thống Hình 1.1 minh họa ví dụ chuỗi thời gian

Ngày đăng: 26/02/2016, 20:11

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan