A Fast Parallel Algorithm for Discovering Frequent Patterns docx

A Fast Parallel Algorithm for Discovering Frequent Patterns Kawuu W. Lin Department of Computer Science and Information Engineering National Kaohsiung University of Applied Sciences Kaohsiung, Taiwan, R.O.C. linwc@cc.kuas.edu.tw Abstract Fast discovery of frequent patterns is the most extensively discussed problem in data mining fields due to its wide applications. As the size of database increases, the computation time and the required memory increase severely. The difficulty of mining large database launched the research of designing parallel and distributed algorithms to solve the problem. Most of the past studies tried to parallelize the computation by dividing the database and distribute the divided database to other nodes for mining. This approach might leak data out and evidently is not suitable to be applied to sensitive domains like health-care. In this paper, we propose a novel data mining algorithm named FD-Mine that is able to efficiently utilize the nodes to discover frequent patterns in cloud computing environments with data privacy preserved. Through empirical evaluations on various simulation conditions, the proposed FD-Mine delivers excellent performance in terms of scalability and execution time. Keywords: Data mmmg; cloud computing; association rule mining; frequent pattern mining; privacy preserved I. Introduction With the progress of information technology, data mining techniques have been extensively applied to many applications in various domains. The goal of data mining is to discover the hidden useful information from large databases. The discovered information could help the decision processes, aid the commercial promotion, and so forth. The data mining includes four main topics: association rule mining [2], sequential pattern mining [3], clustering [11] and classification [5]. Among the data mining studies, the problem of frequent pattern mining, i.e. association rule mining and sequential pattern mining, is mostly discussed due to its wide applications. The basic conception of frequent pattern mining problem is to discover the pattern whose frequency of appearance in the database is greater than a specific threshold. An association rule is defined as X=>Y, where X and Yare sets of items. The concept of association rule mining is to discover the sets of items tending to associate with the others in the database. The studies on association rule mining can be classified into two types, 1) the generate-and-test Yu-Chin Luo Department of Computer Science and Information Engineering National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, R.O.C. kim-x@yahoo.com.tw [2] (Apriori-like) approach and 2) the frequent pattern growth approach [6] (FP-growth-like). The Apriori-like methods iteratively generate candidate itemset of size (k+1) from frequent itemset of size k and scan the database repetitively to test the frequency of each candidate itemset. Definitely, the Apriori-like methods suffer from the large number of candidate itemsets, especially when the support threshold is small. In view of this reason, Han et al. [6] proposed a novel data structure, named frequent pattern tree (FP-tree), in which the transactions are compressed and stored. A mining algorithm, namely FP-growth was also proposed for discovering the frequent patterns from the FP-tree. FP-growth needs only two scans on physical databases and therefore has a great improvement on the execution time. As the size of database increases, the computation time and the required memory increase severely. Many studies on association rules mining were proposed mainly to improve the efficiency in terms of execution time. In the past decades, parallel and distributed computing (PDC) techniques have attracted extensive attentions on the ability to manage and compute the significant amount of data. The difficulty of mining large database launched the research of designing parallel and distributed algorithms to solve the problem [7], [8], [10], [13], [14]. The main approach of the existing studies is to divide the database and then to distribute each part of the database to nodes or processors for mining with the goal to distribute the computation loading. During the mining process, the nodes will exchange required transactions from each other. The workload of data exchanging among nodes becomes heavy when the average length of transaction is long or the size of database is large. Although many algorithms have been proposed, the execution efficiency of frequent pattern mining is still a challenge to the researchers due to the data explosion. In addition to the exchanging workload, the data privacy is also a major concern since this kind of algorithms duplicates the database to every node in the PDC architecture. This approach evidently is not suitable to be applied to sensitive domains like health-care. In this paper, we propose a novel data mining method named FD-Mine that is able to efficiently utilize the cloud nodes to fast discover frequent patterns in cloud computing environments with data privacy preserved. Through empirical evaluations on Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply. various simulation conditions, the proposed FD-Mine delivers excellent performance in terms of scalability and execution time. In the following sections, we briefly review related work in Section 2. In Section 3, we propose the architecture and present the data mining algorithm. The empirical evaluation for performance study is made in Section 4. The conclusions are given in Section 5. II. Related Work In order to improve the performance of association rule mining, many researchers tried to distribute the mining computation over more than one processor/node. In [9], the authors proposed a parallel algorithm named Parallel FP-tree (PFP-tree) based on the FP-tree data structure for mining frequent patterns on message passing multiprocessor systems. The proposed algorithm divides the database into several non-overlapping parts according to number the available processors, and lets each processor construct its FP-tree by exchanging necessary information from other processors. Because the algorithm is performed on a node, the data exchanging is done in the same node so that the overhead might not be severe. To parallelize the frequent pattern mining, the past studies relied on mainly the database dividing method [4], [15]. The database is divided equally or by some criteria and each part of the database is sent to the node for mining. The approach that duplicates the database to other nodes risks leaking out the data. The data privacy cannot be preserved by this approach. Note that in cloud computing environments the network latency is an important issue that should be carefully considered. Generally, the size of the targeted database is always large in the mining applications. Transmitting the database and exchanging large amount of data over the internet will greatly slow down the performance. In [12], the proposed method, named QFP-growth, divides the database equally and constructs the FP-trees based on the assigned parts of database. The FP-trees are then merged to a FP-tree to complete the mining task. The data transmission overhead was studied in [14]. The authors observed that the elapsed time by exchanging transactions is much more than mining time. To efficiently exchange transactions among nodes for database dividing approach, TPFP-tree was proposed by using transaction identification set (Tidset) to select the transactions directly instead of scanning the physical database. The Tidset is a table recording the IDs of transactions that contain a certain item, so the required memory of Tidset is as the same size as the assigned partial database. Therefore, TPFP is bound to the size of the targeted database. To balance the computing loading of TPFP-tree, the authors [15] proposed BTP-tree algorithm, which is a balanced Tidset-based parallel FP-tree algorithm, for mining frequent patterns. The algorithm equally divides the database into p parts, where p is the number of nodes. The partial databases are sent to the nodes individually. Each node establishes the Tidset and header table in accordance with the assigned database. A global header table named GHT is derived by filtering the items with support smaller than the threshold from the table in which all of the header tables of the nodes are gathered. Before executing the mining task, BTP-tree algorithm calculates a performance index for each node, and records the sum of performance indexes. A mining task is then separated into p sub-tasks, where the loading of each task is calculated in unit of the number of items in header table. The task assignment is decided by the mechanism of performance indexing. After the task assignment, each node constructs its Tidset for fast selection use. The required transactions are exchanged among nodes to generate the new sub-databases by referring to the items of header tables. Finally, the FP-growth is performed on each node to discover the frequent patterns. The frequent patterns are further gathered from all the nodes to obtain the complete frequent patterns. III. Proposed Algorithm: FD-Mine In this section, we describe the proposed algorithm that is able to efficiently distribute the computation in the cloud computing environments. The cloud architecture for mining frequent patterns is introduced in Section 3.1. In Section 3.2, we formulate the problem. The details of the proposed algorithms are described in Section 3.3. 3.1 Proposed Cloud Architecture for Frequent Pattern Mining Note that in the cloud computing environments the data privacy is an important issue. Since the clouds are distributed physically and each cloud node provides only its computation ability, the trusty of the nodes cannot be preserved. Therefore, in order to preserve the data privacy only a node that is safe, while not every node, can access the database. In our architecture, we name this node as trusted node or kernel node, the cloud in which the node locates as kernel cloud. Considering the efficiency of data transmission among clouds, each cloud is designed to have only a node to connect other clouds, named connection-node, abbreviated as conn-node. If a node N needs data from trusted node, the node N will ask the conn-node of N's cloud to see whether the conn-node has the data or not. If the conn-node has the data, N can download the data from conn-node via intranet. Otherwise, the data will be duplicated to the conn-node via internet, and then N can download the data from conn-node via intranet. By using this transmission policy, the network latency can be minimized. Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply. ~ Physical Machine 9 Dat"b.oL\\ IIIII!!II Trusted Xode • (Virtual Machine) ~ ConnectionNode ~ tvirt ual Mactunej CI Comreting Xcdc ~ [Virtu alM achine) Figure 1. Proposed architecture for frequent pattern mining. In this architecture, each conn-node should maintain a table to record the status of the nodes of its cloud. The recorded information for each node contains the node's ID and the availability. All of the tables are then gathered in the kernel node so that the kernel node has complete information of computation ability in terms of available nodes. The information is updated periodically. 3.2 Mining frequent patterns in cloud computing environments One of the characteristics of the proposed algorithm is that the data privacy is preserved. Unlike the parallel Apriori-like algorithms [4] that need to duplicate the database to remote nodes or the BTP-tree [15] algorithm that distributes part of the database directly to cloud nodes, only the kernel node is permitted to access the database in our designed architecture and algorithms. In addition to the leaking problem of data privacy of the conventional algorithms, the required time for duplicating physical database is considerable. The data structure used by the proposed algorithms is based on that of FP-growth. The FP-tree is a data structure that stores the frequent items in compressed form. Because the items with support smaller than the support threshold are filtered and the filtered transactions have been constructed in the FP-tree, reversely retrieving the complete transaction of any user from the FP-tree is impossible. Moreover, because the FP-tree is often implemented in linked-list and our algorithm will also compress the FP-tree again by ZIP to reduce the transmission time, the transactions will not be reversed. The data privacy can be preserved. 3.3 FD-Mine algorithm The purpose of FD-Mine is fast mining. In the cloud computing environments, the distribution of mining computation accompanies data transmission over the network. In BTP-tree [15], the database is divided equally into several parts and sent to the available nodes. Then the nodes ask the required data from each other to finish the mining task. In fact, the database is often large in size. Obviously, this approach not only leaks the data but also incurs a lot of data transmission over the network. The perforrnance of this kind of approach is expected to be bad. An intuitive way to save the time is to minimize the amount of data transmission. Our proposed FD-Mine is designed to transmit as less data as possible to save the time from network latency and disk I/O time. The algorithm is presented in Figure 2. We describe the details of FD-Mine as below. The trusted node TN follows the FP-tree construction algorithm to scan the database twice times, and constructs the corresponding FP-tree stored in TN (line I). The next step is to obtain the header table HT (line 2) and to divide HT into I N! disjointed sets, stored in IS (line 3). Since the frequent patterns are not predictable, HT is divided randomly with the goal to balance the loading of each node. Considering the execution efficiency, the most important issue is that the amount of data transmission should be minimized. To minimize the amount of data transmission, the FP-tree constructed on TN is duplicated to each idle node. In the cloud computing environments, we also consider the problem of network latency. Since the internet latency always larger than intranet latency, the FP-tree duplication should be done in intranet. Algorithm FD-Mine Input: A transaction database DB, a minimum support threshold ~, the trusted node TN, and a set of nodes N with cloud architecture C Output: The complete set of frequent patterns, FP 1 TN.FPT ~ constructFPTree(DB,~) II TN reads the DB and construct the corresponding FP-tree 2 HT ~ getHT(FPT) II Obtain the header table ofFPT 3 IS ~ divideHT(lNI) IIRandomly divide the items ofHT into IN[ disjointed sets 4 FOR i=1 TO II SI 5 n ~ selectNode(N ,i) II Select the ith node 6 cn ~ selectConnNode(n,C) II Select the conn-node o fn 7 IF (isExistFPT(cn)==FALSE) 8 cn.FPT ~ TN.FPT II Duplicate FPT from TN if en does not have FPT 9 ENDIF 10 n.FPT ~ cn.FPT II Duplicate FPT from the conn-node ofn 11 is, ~ getSet(IS,i) II Obtain ith set of IS 12 fp, ~ N;.BatchFPGrowth(isD II Batch-run FP-growth for each conditional item in is;to mine the frequent patterns 13 FP ~ FP U fp, 14 ENDFOR 15 RETURN FP Figure 2. FD-Mine Algorithm. Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply. 80 - - - - - - - - - - - - - - - , Number of Nodes Figure 3. The execut ion time for FD-Mine and BTP-tree with number of nodes varied on dataset T20.IS.NIOOK.DIOOK. 10 ·0 ······ ·0 0 -O 30 ~ 60 !E- Q) E F 50 c: .2 "S ~ 40 w 70 the required execution time of FD-Mine and BTP-tree decreases with the increase in the number of nodes. It is observed that the execution time of FD-Mine is almost the same to that ofBTP-tree when there is only one node available to be used. This is trivial because both of them perform FP-growth in a single node. The execution time of FD-Mine is slightly more than that of BTP-tree when the number of processors is equal to 2 or 3. This is because the time elapsed by FP-tree compression and decompression is more than the time to directly transmit the divided parts of database. When there are more than 3 nodes, FD-Mine exhibits the advantage of sending after compression, less time required for completing the whole mining task. Figure 4 shows the impact on execution time when the average length of transaction is lengthened to 40. It is found that FD-Mine delivers better performance than BTP-tree when the number of nodes is greater than 2. The reason is that BTP-tree, the database dividing approach, needs to exchange the transactions to each other, and the performance suffers from the large number of exchanged transactions. Figure 5 shows the performance of FD-Mine and BTP-tree under the number of transactions set to 200K. In this experiment, FD-Mine outperforms BTP-tree when the number of nodes is greater than 2, in which the intrinsic drawback of the database dividing approach is demonstrated. In the series of experiments, it is observed that FD-Mine not only can preserve the data privacy but also delivers better performance than BTP-tree in terms of execution time especially when the database is large in size. 5.2 Effects of varying the parameters of dataset In the section, we study the effects by varying the support threshold, and the parameters, number of transactions and average transaction length, of the data generator. Two algorithms are compared, FD-Mine and BTP-tree in the experiment. IV. Experimental Results To evaluate the performance of the proposed algorithm, we use IBM's Quest Synthetic Data Generator [1] to generate the workload data for mining. The experiments were conducted on a cloud system with three clouds. The first cloud contains four nodes, including the kernel node, in which each node is equipped with an E8400 204GHZ CPU, 1GB of available RAM and 320GB of disk storage. The second cloud and third cloud contain four and three nodes respectively, in which each node is equipped with a P8600 204GHZ CPU, IGB of available RAM and 160GB of disk storage. Note that the kernel node is responsible for receiving the requests and is not used for mining. Therefore totally ten nodes can be used for mining in the system. To verify the performance, since there are very few parallel and privacy-preserved algorithms of frequent pattern mining, we select the BTP-tree for comparison, which is one of the most efficient algorithms that can parallelize the mining task on grid systems. Both of FD-Mine and BTP-tree were implemented in Java, and the message passing among nodes and remote function call were implemented in Java RMI technology. Since the most of the existing parallel algorithms are database dividing approach, we select the most efficient one, BTP-tree, for performance comparison. 5.1 Effects of varying the number of cloud nodes In the following experiments, we investigate the performance of FD-Mine in terms of execution time by varying the number of cloud nodes from I to 10. The performance results for database T20.I5.NIOOK.D100K are described. The support threshold is set to 0.03%, which is a very small value, in order to verify the performance of both the algorithms, FD-Mine and BTP-tree. Figure 3 shows For this reason, the FP-tree duplication is processed as follows. First, the algorithm selects an idle node n (line 5), and selects the connection node en of n from the cloud architecture C (line 6). If en has no duplicated FP-tree, TN will duplicate one to en (line 7 to line 9). Note that in order to minimize the transmitting overhead the FP-tree should be compressed in advance. Afterwards, node n can obtain the compressed FP-tree via intranet and decompress it (line 10). After receiving the FP-tree, node n is assigned to a subset of IS (line 11), and batch-runs FP-growth for each conditional item in the subset to mine the frequent patterns (line 12 to line 13). Obviously, each node needs only one data transmission, i.e. FP-tree duplication, and the transmission is in intranet to minimize the network latency. After all of the I N! disjointed sets are processed, the frequent patterns are returned (line 15). Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply. 0.050.040.030.02 o. ·····················.0. ······ ··············· 0. ····· ··········· ······. .0 . ········ ··· ··· ······· 0 0.01 34 32 20 u- ! 30 Q) E 28 i= c .Q 26 3 ~ 24 22 36 , - - - - - - - - - - - - - - - - , 1 8.L-, r , , J ·· ······ 0 ··· ····· 0 {) 0 140 120 U- Q) $ Q) 100 E i= c a ~ 80 c Q) o , x w >'0 . 60 40 8 10 Number of Nodes Figure 4. The execution time for FD-Mine and BTP-tree with number of nodes varied on dataset T40.I5.N100K.D100K. Support Thresh old (%) Figure 6. The execution time for FD-Mine and BTP-tree with support threshold var ied on dataset T20.15.N100K.D100K. data privacy is preserved. Unlike the parallel Apriori-like algorithms that need to duplicate the database to remote nodes or the BTP-tree algorithm that distributes part of the database directly to cloud nodes, the database will never be duplicated and only the kernel node is permitted to access the database in our designed architecture and algorithms. Through empirical evaluations on various simulation conditions, the proposed FD-Mine delivers excellent performance in terms of scalability and execution time. ····· ··0 ······· 0 ······.0 0 100 90 u- 80 Q) $ Q) 70 E i= c .Q 60 :5 o Q) x ·0 w 50 40 30 -'-r r r r r r r , 10 Number of Nodes Figure 5. The execution time for FD-Mine and BTP-tree with number of nodes varied on dataset T20.I5 .Nl OOK.D200K. Acknowledgement This research was partially supported by National Science Council, Taiwan, ROC under Grant No.97-2218-E-151-003-MY2. In Figure 6, we explore the impact on execution time by varying the support threshold from 0.05% to 0.0I% with ten cloud nodes. It can be found that FD-Mine always requires less time than BTP-tree. The efficiency in execution time of FD-Mine is mainly achieved by reducing the transmission overhead and the disk I/O times. In the experiment, the required time of FD-Mine is only about 82% of the execution time ofBTP-tree in average. V. Conclusions In this paper, we have presented an efficient algorithm named FD-Mine that is able to efficiently utilize the cloud nodes to discover frequent patterns in cloud computing environments with data privacy preserved. The proposed FD-Mine is composed of two algorithms, namely HD-Mine and FD-Mine. The limitation of the conventional algorithm for mining the dataset with a large number of frequent patterns is bounded to the available memory. The proposed HD-Mine is able to discover the frequent patterns from this kind of datasets by merging the memory of several nodes. The proposed FD-Mine focuses on the fast discovery of frequent patterns by utilizing the cloud nodes, and is useful to the applications that emphasize real time mining. Another important characteristic of the proposed algorithms is that the References [IJ R. Agrawal and R. Srikant. Quest Synthetic Data Generator. IBM Almaden Research Center, San Jose, California, http://www.almaden.ibm.com/cs/quest/syndata.html. [2J R. Agrawal, Imielinski T, Swami A. Mining association rules between sets of items in large databases. In: Proc. ACM SIGMOD IntI. ConfManagement Data, 1993. [3J R. Agrawal, R. Srikant, Mining Sequential Patterns, in: Proc. of the 11 th 1nt'l Conf. on Data Engineering, 1995, pp. 3-14. [4J R. Agrawal, John C. Shafer, "Parallel Mining of Association Rules", IEEE Transactions on knowledge and Data Engineering, December 1996. [5J R. J. Bayardo, Jr., Brute-force mining of high-confidence classification rules. In Proceedings of the 3rd international conference on knowledge discovery and data mining (KDD'97), Newport Beach, California, USA. [6J J. Han, 1. Pei, and Y. Yin. Mining Frequent Patterns Without Candidate Generation. Proc. of ACM Int. Conf. on Management of Data (SIGMOD), \-12 ,2000. [7J J.D. Holt, S.M. Chung, "Parallel mining of association rules from text databases on a cluster of workstations," Proceedings of 18th International Symposium on Parallel and Distributed Processing, 2004, pp. 86. [8J P.Iko and M. Kitsuregawa, "Shared Nothing Parallel Execution of FPgrowth." DBSJ Letters, Volume 2, No.1, 2003, pp. 43-46. [9J A. Javed, A. Khokhar, "Frequent Pattern Mining on Message Passing Multiprocessor Systems," Distributed and Parallel database, Volume 16, Issue 3, 2004, pp. 321-334. [IOJ T. Li, S. Zhu, M. Ogihara, "A New Distributed Data Mining Model Based on Similarity," Symposium on Applied Computing, 2003, pp.432-436. [II J Ester M., Kriegel H P., Sander 1., Xu X.: "A Density-Based Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply. Algorithm for Discovering Clusters in Large Spatial Databases with Noise", Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, 1996, pp. 226-231. [12] Y. Qiu, Y. 1. Lan and Q. S. Xie, "An improved algorithm of mining from FP- tree," Proceedings of the Third International Conference on Machine Learning and Cybernetics, pp. 26-29, 2004. [13] E H. S. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. IEEE Transactions on Knowledge and Data Engineering, 12(3):352 -377, 2000. [14] J. Zhou, K M. Yu, "Tidset-based Parallel FP-tree Algorithm for the Frequent Pattern Mining Problem on PC Clusters," Lecture Notes in Computer Science 5036, 2008, pp. 18-28. [15] 1. Zhou, K M. Yu, Balanced Tidset-based Parallel FP-tree Algorithm for the Frequent Pattern Mining on Grid System, Fourth International Conference on Semantics, Knowledge and Grid, 2008. Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply. . A Fast Parallel Algorithm for Discovering Frequent Patterns Kawuu W. Lin Department of Computer Science and Information Engineering National Kaohsiung University of Applied Sciences Kaohsiung,. considered. Generally, the size of the targeted database is always large in the mining applications. Transmitting the database and exchanging large amount of data over the internet will greatly slow. size of the targeted database. To balance the computing loading of TPFP-tree, the authors [15] proposed BTP-tree algorithm, which is a balanced Tidset-based parallel FP-tree algorithm, for mining frequent

A Fast Parallel Algorithm for Discovering Frequent Patterns docx

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan