DSpace at VNU: The lattice-based approaches for mining association rules: a review

12 146 0
DSpace at VNU: The lattice-based approaches for mining association rules: a review

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Advanced Review The lattice-based approaches for mining association rules: a review Tuong Le1,2 and Bay Vo3,4* The traditional methods for mining association rules (ARs) include two phrases: mining frequent itemsets (FIs)/frequent closed itemsets (FCIs)/frequent maximal itemsets (FMIs) and generating ARs from FIs/FCIs/FMIs Lattice-based approaches (LBAs) for mining ARs are new approaches including two phrases: frequent itemset lattice (FIL)/frequent closed itemset lattice (FCIL) building and generating ARs from the lattice Total mining time of LBAs for mining ARs outperforms the traditional methods for mining ARs Besides, the most important advantage of LBAs for mining ARs is that the algorithms only build the lattice once and mine ARs with many different confidences or many different minimum supports (the thresholds have to be greater than or equal to the threshold used to build lattices) without mining FIs/FCIs again In this article, we describe a number of existing LBAs for mining ARs on static databases including lattice building and rule generation In addition, in today’s online system, the data often change in several operations such as insert, delete, and update Hence, a number of LBAs for mining ARs on dynamic databases are mentioned Finally, complexity analysis of the LBAs for mining ARs is also thoroughly discussed © 2016 John Wiley & Sons, Ltd How to cite this article: WIREs Data Mining Knowl Discov 2016, 6:140–151 doi: 10.1002/widm.1181 INTRODUCTION D ata mining is a process of analyzing the data to find knowledge to use in intelligent systems There are currently many problems to be introduced such as problem of mining association rules (ARs), classification,1–6 clustering,7,8 text mining,9 and their applications.2,10 Mining ARs, including ARs, minimal non-redundant association rules (MNARs), and most generalization association rules (MGARs), is a model being widely used in market basket analysis, *Correspondence to: bayvodinh@gmail.com Division of Data Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam Faculty of Information Technology, Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam College of Electronics and Information Engineering, Sejong University, Seoul, Republic of Korea Conflict of interest: The authors have declared no conflicts of interest for this article 140 online e-commerce such as Amazon, Alibaba, and so on, and several other recommendation systems Traditional approaches for mining ARs consist of two steps: mining frequent itemsets (FIs)/frequent closed itemsets (FCIs)/frequent maximal itemsets (FMIs) (FIs/FCIs/FMIs),11,12,13 and generating rules from those itemsets Some variants of FIs such as high utility itemsets (itemsets whose utility satisfies a given threshold),14–27 top-k high utility itemsets (topk itemsets with highest utility),28 weighted pattern (pattern with weighted items),29–31 erasable itemsets (itemsets can be eliminated but not greatly affect the factory’s profit),32–34 weighted erasable patterns (erasable itemsets considered the distinct weight of each item),35,36 and so on are proposed Besides, several type of representations that limit the number of FIs such as FCIs,37–41 FMIs,42–47 top-k FIs,48,49 toprank-k FIs,50,51 and FIs with constraints52 are also proposed In traditional approaches for mining ARs, researchers usually focus on the first phrase (mining FIs/FCIs/FMIs) However, the second phrase (rule generation) takes a lot of time for mining a large © 2016 John Wiley & Sons, Ltd Volume 6, July/August 2016 WIREs Data Mining and Knowledge Discovery The lattice-based approaches for mining association rules number of FIs/FCIs/FMIs Therefore, lattice-based approaches (LBAs) for mining ARs are proposed to overcome the above weakness Generally, these approaches will build frequent itemset lattice (FIL)/ frequent closed itemset lattice (FCIL) (FIL/FCIL) in the first phrase In the next phrase, they only traverse the lattice to generate ARs As generating rules from lattice has less complexity than traditional approaches such as Apriori or hash table, total mining time of LBAs for mining ARs outperforms the traditional methods for mining ARs Moreover, the largest advantage of LBAs for mining ARs is that the algorithms only build the lattice once and mine ARs with many different confidences or many different minimum supports (the thresholds have to be greater than or equal to the threshold used to build lattices) without mining FIs/FCIs/FMIs again Therefore, LBAs are extensively used to mine ARs nowadays In addition, in today’s online system, the data often change in several operations such as add, delete, and update, especially in e-commerce systems, which raise a need of improving AR mining methods to adapt with the new requirements There have been several studies for mining patterns/rules on dynamic databases In this article, we conduct a review of LBAs for mining ARs including lattice building and rule generation phrases Furthermore, a number of LBAs for mining ARs for dynamic databases are also surveyed Next, the complexity analysis of LBA for mining ARs is discussed Finally, some challenges of the LBAs and their potential applications in the near future are introduced The rest of the article is organized as follows The section “Classical Approaches for Mining ARS” presents the classical approaches for mining FIs/FCIs and mining ARs/MNARs/MGARs In the section “The FIL/FCIL Building,” we report the existing approaches for building FIL and FCIL Next, the section “LBAs for Mining ARs” presents a number of LBAs for mining ARs Some incremental LBAs for mining ARs are subsequently presented in the section “LBAs for Mining ARS on Dynamic Databases” Then, the section “Complexity Analysis” shows complexity analysis of LBA for mining ARs The conclusion is presented in the section “Conclusion and Future Researches.” CLASSICAL APPROACHES FOR MINING ARS Given a database (DB) comprising of a number of transactions (n) such that each transaction contains a number of items Transaction database (DBe) is Volume 6, July/August 2016 TABLE | A Transaction Database (DBe) Example Transaction Items A, C, T, W C , D, W A, C, T, W A, C, D, W A, C, D, T, W, E C, D, T, E presented in Table as an example and will be used for illustrative purposes throughout this article The support of an itemset X, denoted by σ(X), is the number of transactions in DB that contain all items of X An itemset X is an FI if and only if σ(X) ≥ dminSup × ne, in which minSup is a usergiven minimum support threshold Currently, there are many algorithms for mining FIs, which may be divided into three main groups: (1) Methods that use a candidate generate-and-test strategy: they generate frequent 1-itemsets which are then used to generate candidate 2-itemsets, and so on until there is no more candidates that can be generated Apriori53 and BitTableFI54 are exemplar algorithms (2) Methods that adopt a divide-and-conquer strategy: they compress DB into a tree structure and mine FIs from this tree by using divide-and-conquer strategy FP-Growth55 and FP-Growth*56 are exemplar algorithms (3) Methods that use a hybrid approach: these methods use vertical data formats to compress DB and also mine FIs by using divide-and-conquer strategy Eclat,57 dEclat,58 Index-BitTableFI,59 DBVFI,60 PrePost,61 FIN,62 NSFI,63 and PrePost+64 are some examples An FI is called an FCI if none of its supersets has the same support For instance, consider DBe and minSup = 50% Two itemsets, AW and ACW, are two FIs because σ(AW) = σ(ACW) = > dminSup × ne = d50% × 6e = However, AW is not an FCI because ACW is its superset and has the same support to AW Only ACW is an FCI Most of the previously proposed algorithms for mining FCIs can be categorized as being either (1) generate-andtest, (2) divide-and-conquer, or (3) hybrid methods The generate-and-test (Apriori-based) approach uses a level-wise search to mine FCIs A well-known algorithm is Close.65 The divide-and-conquer approach adopts a divide-and-conquer strategy and uses some compact data structures to efficiently mine FCIs Examples are CLOSET39 and CLOSET+.56 The hybrid approaches integrate the previous two Typically, the database is firstly transformed into a © 2016 John Wiley & Sons, Ltd 141 Advanced Review wires.wiley.com/dmkd vertical data format or compress format The approach then utilizes some pruning properties to quickly prune nonclosed itemsets Examples are CHARM, dCHARM,58 DBV-Miner,41 DCI_PLUS,66 and NAFCP.37 An AR is an implication expression of the form X ! Y, where X and Y are disjoint itemsets, i.e., X \ Y = ; The strength of an AR can be measured in terms of its confidence Confidence of a rule (c) determines how frequently items in Y appear in transactions that contain X: c(X ! Y) = σ(X [ Y)/σ(X) Each frequent k-itemset, XY, can produce up to 2k−2 ARs, ignoring rules that have empty antecedents or consequents (; ! XY or XY ! ;) An AR can be extracted by partitioning the itemset XY into two nonempty subsets, X and Y, such that X ! Y satisfies the confidence threshold (minConf ) Note that all such rules must have already met the support threshold because they are generated from an FI Because the rule generation from FIs is simple, there are relatively few studies on this stage Many studies focused on the stage of mining FIs/FCIs Agrawal and Srikant53 introduced the following properties: “if the rule c(AB ! CD) < minConf, then the rules c(ABC ! D) and c(ABD ! C) are smaller than minConf” to reduce the search space An algorithm based on this property has been proposed to efficiently mine ARs from FIs/FCIs generated from stage This method has been used to mine ARs from FIs/FCIs so far Let X be an FCI An itemset Y is a generator of X if and only if in Y  X and σ(X) = σ(Y) For example, AW is a generator of ACW, because AW  ACW and σ(AW) = σ(ACW) = Similarly, A and AC are also generators of ACW Let G(X) is the set of X’s generators We have Y G(X) is a minimal generator if and only if Y does not have any subset in G(X) For example, G(ACW) = {A, AC, AW} therefore minimal generators of ACW is mG(ACW) = {A} An association rule R1: X1 ! Y1 is a MNAR if there is no AR R2: X2 ! Y2 with σ(X1 [ Y1) = σ(X2 [ Y2), c(R1) = c(R2), X2  X1 and Y2  Y1 There are two kinds of MNARs obtained: (1) exact rules (their confidence = 100%): the rules have the form X0 ! X, where X is an FCI, and X0 mG(X) and (2) approximate rules (their confidence < 100%): the rules have the form X0 ! Y, in which X and Y are FCIs, and X0 mG(X), X  Y Assume that there are two rules R1: X1 ! Y1 and R2: X2 ! Y2 Rule R1 is said to be more general than R2 (R1 / R2) if and only if X1  X2 and Y2  Y1 Let R = {R1, R2, …, Rn} be the set of rules that satisfy the conditions of minSup and minConf A rule Ri is said to have a higher precedence than another rule Rj, denoted as Ri > Rj, if Ri / Rj and one of the 142 following conditions holds: (1) c(Ri) > c(Rj); (2) c(Ri) = c(Rj) and σ(Ri) > σ(Rj) Let RMG be the set of the MGARs of R: RMG = {Rj R| ¬ Ri R: Ri > Rj} THE FIL/FCIL BUILDING LBAs for mining ARS are divided into two phases: (1) building lattice and (2) generating ARs from the lattice This section presents the existing approaches for building lattices Some of existing approaches for mining ARs from the lattices are then introduced in the section “LBAs for Mining ARs” The FIL Building In 2009, Vo and Le67 proposed an algorithm for building FIL (e.g., FIL-2009) directly from the database (Table 2) In FIL-2009, each node in the lattice has the tuple hX, Tidset, Childreni where X is a kitemset, Tidset is the set of IDs associated with the transactions containing X, and Children = {Y | Y (k + 1)-itemsets and X  Y} FIL-2009 built for DBe in Table with minSup = 50% is presented in Figure Although, mining ARs from FIL-2009 is very effective, FIL-2009 is not an effective structure to mine MNARs Therefore, in 2011, Vo and Le68 extended the structure of FIL-2009 (e.g., FIL-2011) by adding one field to consider whether or not a lattice node is a minimal generator, and another field to consider whether or not a lattice node is an FCI These values were directly determined in the lattice building The structure is then used to effectively mine MNARs, which will be presented in “LBAs for Mining ARs” section With DBe in Table and minSup = 50%, FIL-2011 is presented in Figure On the figure, bold-nodes and dashed-nodes indicate FCIs and minimal generators respectively When a node XA in an FIL-2009 (and FIL2011) is created, FIL-Building-2009 (or FIL-Building2011) has to find all the nodes that are the children of XA to update the lattice This process first visits all children of X (Y X.Children) With each Y, the process visits all children of Y (YB Y.Children) With each YB, if XA  YB, the process then updates TABLE | Existing Approaches for Building Frequent Itemset Lattice (FIL) No Year Structure FIL-Building-2009 67 2009 FIL-2009 FIL-Building-2011 68 2011 FIL-2011 TFIL69 2014 FIL-2014 © 2016 John Wiley & Sons, Ltd Name of Algorithm Volume 6, July/August 2016 WIREs Data Mining and Knowledge Discovery The lattice-based approaches for mining association rules {} A1345 AT135 AW1345 ATW135 D2456 AC1345 DW245 ATC135 T1356 W12345 DC2456 TW135 AWC1345 C123456 TC1356 DWC245 WC12345 TWC135 ATWC135 F I G U R E | FIL-2009 for DBe with minSup = 50% {} A1345 AT135 AW1345 ATW135 D2456 AC1345 DW245 ATC135 T1356 W12345 DC2456 TW135 AWC1345 C123456 TC1356 DWC245 WC12345 TWC135 ATWC135 F I G U R E | FIL-2011 for DBe with minSup = 50% YB belonging to the children of XA (YB XA.Children) Considering FIL-2009 in Figure 1, when the algorithm creates the node TC, it has to consider all the child nodes associated with T, which consist of AT and TW Next, the algorithm has to consider all the child nodes associated with AT and TW, which are {ATW, ATC} and {ATW} However, the process of considering all child nodes of TW does not find any nodes that are the child node of TC The node ATW is a duplicate, and thus making the process of considering all child nodes associated with TW unncessary To overcome this weakness, Vo et al.69 proposed a new structure for an FIL (e.g., FIL-2014) Volume 6, July/August 2016 and TFIL algorithm for FIL-2014 building Each node on the lattice contains the form hItemset, Tidset, ChildrenEC, ChildrenLi In which, ChildrenEC contains the child nodes based on the equivalent class feature associating with Itemset; and ChildrenL contains the child nodes based on the lattice feature associated with Itemset Because this algorithm does not scan all the child nodes of XA to update the lattice, the time needed to build FIL-2014 of TFIL algorithm is less than that of FIL-Building-2009 to build FIL2009 and FIL-Building-2011 to build FIL-2011 For DBe in Table and minSup = 50%, FIL-2014 is presented in Figure © 2016 John Wiley & Sons, Ltd 143 Advanced Review wires.wiley.com/dmkd {} A1345 D2456 AW1345 AT135 AC1345 ATW135 T1356 DC2456 DW245 ATC135 W12345 AWC1345 C123456 TW135 TC1356 DWC245 WC12345 TWC135 ATWC135 F I G U R E | FIL-2014 for DBe with minSup = 50% The FCIL Building In 2005, Zaki and Hsiao58 proposed CHARM-L to create FCIL-2005 (Table 3) The FCIL-2005 created by CHARM-L for DBe with minSup = 50% is shown in Figure However, MNARs and MGARs cannot be generated from FCIL-2005 Mining MNARs and MGARs from FCIL-2005 requires using a level-wise approach to generate generators; therefore, it is inefficient in terms of the mining time TABLE | Existing Approaches for Building Frequent Closed Itemset Lattice (FCIL) No Name of Algorithm Year Name of FCIL CHARM-L58 2005 FCIL-2005 FCIL-Building-201370 2013 FCIL-2013 Snow-Touch71 2014 FCIL-2013 {} In 2013, Vo et al.70 proposed FLC-Building2013 to build FCIL (e.g., FCIL-2013) effectively First, FCIs with their minimal generators are mined using MG-CHARM.67 Then, an algorithm (e.g., FCIL-Building-2013) is proposed to insert FCIs into FCIL-2013 with O(n × k) complexity where n is the number of FCIs and k is the average of the number of child nodes on the lattice Since k

Ngày đăng: 16/12/2017, 09:00

Mục lục

  • The lattice-based approaches for mining association rules: a review

    • INTRODUCTION

    • CLASSICAL APPROACHES FOR MINING ARS

    • THE FIL/FCIL BUILDING

      • The FIL Building

      • The FCIL Building

      • LBAS FOR MINING ARS

        • LBA for Mining ARs

        • LBA for Mining ARs with Interestingness Measures

        • LBA for Mining MNARs

        • LBA for Mining MGARs

        • LBAs FOR MINING ARs ON DYNAMIC DATABASES

        • COMPLEXITY ANALYSIS

        • CONCLUSION AND FUTURE RESEARCHES

        • REFERENCES

Tài liệu cùng người dùng

Tài liệu liên quan