Managing and Mining Graph Data part 24 ppsx

212 MANAGING AND MINING GRAPH DATA Cheng et al. in [11, 12] consider 𝐴→𝐷 as a R-join (like 𝜃-join), and process a graph pattern matching as a sequence of R-joins. The issue is how to select join order. They propose a dynamic programming algorithm to determine the R-join order in [11]. They also propose an R-join/R-semijoin approach in [12]. The basic idea is to divide the join-index based approach into two steps namely filter and fetch. The filter steps shares the similarity with semijoin, and the fetch step is to join. Cheng et al. study how to select R-join/R-semijoin order by interleaving R-joins with R-semijoins, using dynamic programming in [12]. Wang et al. in [35] propose a query graph 𝐺 𝑞 based on the hash join approach, and consider how to share the processing cost when it needs to process several 𝐴𝑙𝑖𝑠𝑡 and 𝐷𝑙𝑖𝑠𝑡 simultaneously. Wang et al. propose three basic join operators, namely, IT-HGJoin, T-HGJoin, and Bi-HGJoin. The IT-HGJoin processes a subgraph of a query with one descendant and multiple ancestors, for example, 𝐴→𝐷 ∧ 𝐵→𝐷. The T-HGJoin process a subgraph of a query with one ancestor and multiple descendants, for example, 𝐴→𝐶 ∧ 𝐴→𝐷. The Bi-HGJoin processes a complete bipartite subgraph of a query with multiple ancestors and multiple descendants, for example 𝐴→𝐶 ∧𝐴→𝐷∧𝐵→𝐶 ∧𝐵→𝐷. A general query graph 𝐺 𝑞 will be processed by a set of subgraph queries using IT-HGJoin, T-HGJoin, and Bi-HGJoin. 11. Conclusions and Summary In this chapter, we presented a survey on reachability queries. We discussed several coding-based approaches using traversal, dual-labeling, tree cover, chain cover, path-tree cover, 2-hop cover, and 3-hop cover approaches. We also addressed how to support distance-aware queries such as to find the shortest distance between two nodes in a large directed graph using the 2-hop cover, and how to support graph pattern matching using the existing graph- based coding schema. As future work, it becomes important how to use the graph-based coding schema to support more real large graph-based applications. References [1] R. Agrawal, A. Borgida, and H. V. Jagadish. Efficient management of transitive relationships in large data and knowledge bases. In Proceedings of the 1989 ACM SIGMOD international conference on Management of data (SIGMOD 1989), 1989. [2] K. Anyanwu and A. Sheth. 𝜌-queries: enabling querying for semantic associations on the semantic web. In Proceedings of the 12th international conference on World Wide Web (WWW 2003), 2003. Graph Reachability Queries: A Survey 213 [3] B. Berendt and M. Spiliopoulou. Analysis of navigation behaviour in web sites integrating multiple information systems. The VLDB Journal, 9(1), 2000. [4] R. Bramandia, J. Cheng, B. Choi, and J. X. Yu. Updating recursive XML views without transitive closure. To appear in VLDB J., 2009. [5] R. Bramandia, B. Choi, and W. K. Ng. On incremental maintenance of 2- hop labeling of graphs. In Proceedings of the 17th international conference on World Wide Web (WWW 2008), 2008. [6] D. Brickley and R. V. Guha. Resource Description Framework (RDF) Schema Specification 1.0. W3C Recommendation, 2000. [7] N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: optimal XML pattern matching. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data (SIGMOD 2002), 2002. [8] L. Chen, A. Gupta, and M. E. Kurul. Stack-based algorithms for pattern matching on dags. In Proceedings of the 31nd international conference on Very large data bases (VLDB 2005), 2005. [9] Y. Chen and Y. Chen. An efficient algorithm for answering graph reachability queries. In Proceedings of the 24th International Conference on Data Engineering (ICDE 2008), 2008. [10] J. Cheng and J. X. Yu. On-line exact shortest distance query processing. In Proceedings of the 12th International Conference on Extending Database Technology (EDBT 2009), 2009. [11] J. Cheng, J. X. Yu, and B. Ding. Cost-based query optimization for multi reachability joins. In Proceedings of the 12th International Conference on Database Systems for Advanced Applications (DASFAA 2007), 2007. [12] J. Cheng, J. X. Yu, B. Ding, P. S. Yu, and H. Wang. Fast graph pattern matching. In Proceedings of the 24th International Conference on Data Engineering (ICDE 2008). [13] J. Cheng, J. X. Yu, X. Lin, H. Wang, and P. S. Yu. Fast computation of reachability labeling for large graphs. In Proceedings of the 10th In- ternational Conference on Extending Database Technology (EDBT 2006), 2006. [14] J. Cheng, J. X. Yu, X. Lin, H. Wang, and P. S. Yu. Fast computing reachability labelings for large graphs with high compression rate. In Proceed- ings of the 11th International Conference on Extending Database Technol- ogy (EDBT 2008), 2008. [15] J. Cheng, J. X. Yu, and N. Tang. Fast reachability query processing. In Proceedings of the 11th International Conference on Database Systems for Advanced Applications (DASFAA 2006), 2006. 214 MANAGING AND MINING GRAPH DATA [16] Y. J. Chu and T. H. Liu. On the shortest arborescence of a directed graph. Science Sinica, 14:1396–1400, 1965. [17] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick. Reachability and distance queries via 2-hop labels. In Proceedings of the 13th annual ACM- SIAM symposium on Discrete algorithms (SODA 2002), 2002. [18] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT Press, 2001. [19] S. DeRose, E. Maler, and D. Orchard. XML linking language (XLink) version 1.0. 2001. [20] S. DeRose, E. Maler, and D. Orchard. XML pointer language (XPointer) version 1.0. 2001. [21] J. Edmonds. Optimum branchings. J. Research of the National Bureau of Standards, 71B:233–240, 1967. [22] M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language for a web-site management system. SIGMOD Rec., 26(3), 1997. [23] H. He, H. Wang, J. Yang, and P. S. Yu. Compact reachability labeling for graph-structured data. In Proceedings of the 2005 ACM CIKM Inter- national Conference on Information and Knowledge Management (CIKM 2005), pages 594–601, 2005. [24] H. V. Jagadish. A compression technique to materialize transitive closure. ACM Trans. Database Syst., 15(4):558–598, 1990. [25] R. Jin, Y. Xiang, N. Ruan, and D. Fuhry. 3-HOP: A high-compression indexing scheme for reachability query. In Proceedings of the 2009 ACM SIGMOD international conference on Management of data (SIGMOD 2009), 2009. [26] R. Jin, Y. Xiang, N. Ruan, and H. Wang. Efficiently answering reachability queries on very large directed graphs. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIG- MOD 2008), 2008. [27] D. S. Johnson. Approximation algorithms for combinatorial problems. In Proceedings of the 5th annual ACM symposium on Theory of computing (STOC 1973), 1973. [28] L. Roditty and U. Zwick. A fully dynamic reachability algorithm for directed graphs with an almost linear update time. In Proceedings of the 36 annual ACM symposium on Theory of computing (STOC 2004), 2004. [29] R. Schenkel, A. Theobald, and G. Weikum. Hopi: An efficient connec- tion index for complex XML document collections. In Proceedings of the 9th International Conference on Extending Database Technology (EDBT 2004), 2004. Graph Reachability Queries: A Survey 215 [30] R. Schenkel, A. Theobald, and G. Weikum. Efficient creation and incremental maintenance of the HOPI index for complex XML document collections. In Proceedings of the 21th International Conference on Data Engineering (ICDE 2005), 2005. [31] K. Simon. An improved algorithm for transitive closure on acyclic di- graphs. Theor. Comput. Sci., 58(1-3):325–346, 1988. [32] S. TrißI and U. Leser. Fast and practical indexing and querying of very large graphs. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data (SIGMOD 2007), 2007. [33] J. van Helden, A. Naim, R. Mancuso, , M. Eldridge, L. Wernisch, D. Gilbert, and S. Wodak. Reresenting and analysing molecular and cellu- lar function using the computer. Journal of Biological Chemistry, 381(9- 10), 2000. [34] H. Wang, H. He, J. Yang, P. S. Yu, and J. X. Yu. Dual labeling: Answering graph reachability queries in constant time. In Proceedings of the 22th International Conference on Data Engineering (ICDE 2006), 2006. [35] H. Wang, J. Li, J. Luo, and H. Gao. Hash-base subgraph query processing method for graph-structured XML documents. Proceedings VLDB Endow- ment, 1(1), 2008. [36] H. Wang, W. Wang, X. Lin, and J. Li. Labeling scheme and structural joins for graph-structured XML data. In Proceedings of the 7th Asia- Pacific Web Conference on Web Technologies Research and Development (APWeb 2005), 2005. Chapter 7 EXACT AND INEXACT GRAPH MATCHING: METHODOLOGY AND APPLICATIONS Kaspar Riesen Institute of Computer Science and Applied Mathematics, University of Bern Neubr-uckstrasse 10, CH-3012 Bern, Switzerland riesen@iam.unibe.ch Xiaoyi Jiang Department of Mathematics and Computer Science, University of M-unster Einsteinstrasse 62, D-48149 M-unster, Germany xjiang@math.uni-muenster.de Horst Bunke Institute of Computer Science and Applied Mathematics, University of Bern Neubr-uckstrasse 10, CH-3012 Bern, Switzerland bunke@iam.unibe.ch Abstract Graphs provide us with a powerful and flexible representation formalism which can be employed in various fields of intelligent information processing. The process of evaluating the similarity of graphs is referred to as graph matching. Two approaches to this task exist, viz. exact and inexact graph matching. The former approach aims at finding a strict correspondence between two graphs to be matched, while the latter is able to cope with errors and measures the difference of two graphs in a broader sense. The present chapter reviews some fundamental concepts of both paradigms and shows two recent applications of graph matching in the fields of information retrieval and pattern recognition. Keywords: Exact and Inexact Graph Matching, Graph Edit Distance, Information Retrieval by means of Graph Matching, Graph Embedding via Graph Matching © Springer Science+Business Media, LLC 2010 C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_7, 217 218 MANAGING AND MINING GRAPH DATA 1. Introduction After many years of research, the fields of pattern recognition, machine learning and data mining have reached a high level of maturity [4]. Power- ful methods for classification, clustering, information retrieval, and other tasks have become available. However, the vast majority of these approaches rely on object representations given in terms of feature vectors. Such object representations have a number of useful properties. For instance, the dissimilarity, or distance, of two objects can be easily computed by means of the Euclidean distance. Moreover, a large number of well-established methods for data mining, information retrieval, and related tasks in intelligent information processing are available. Recently, however, a growing interest in graph-based object representation can be observed [16]. Graphs are powerful and universal data structures able to explicitly model networks of relationships between substructures of a given object. Thereby, the size as well as the complexity of a graph can be adopted to the size and complexity of a particular object (in contrast to vectorial approaches where the number of features has to be fixed beforehand). Yet, after the initial enthusiasm induced by the “smartness” and flexibility of graph representations in the late seventies, a number of problems became evi- dent. First, working with graphs is unequally more challenging than working with feature vectors, as even basic mathematic operations cannot be defined in a standard way, but must be provided depending on the specific application. Hence, almost none of the common methods for data mining, machine learning, or pattern recognition can be applied to graphs without significant modifications. Second, graphs suffer from of their own flexibility. For instance, computing the distances of a pair of objects, which is an important task in many areas, is linear in the number of data items in the case where vectors are employed. The same task for graphs, however, is much more complex, since one cannot simply compare the sets of nodes and edges, which are generally unordered and of different size. More formally, when computing graph dissimilarity or similarity one has to identify common parts of the graphs by considering all of their subgraphs. Regarding that there are 𝑂(2 𝑛 ) subgraphs of a graph with 𝑛 nodes, the inherent difficulty of graph comparisons becomes obvious. Despite adverse mathematical and computational conditions in the graph domain, various procedures for evaluating proximity, i.e. similarity or dissimilarity, of graphs have been proposed in the literature [15]. The process of evaluating the similarity of two graphs is commonly referred to as graph matching. The overall aim of graph matching is to find a correspondence between the nodes and edges of two graphs that satisfies some, more or less, stringent constraints. That is, by means of the graph matching process similar substructures in one graph are mapped to similar substructures in the other graph. Based on Exact and Inexact Graph Matching: Methodology and Applications 219 this matching, a dissimilarity or similarity score can eventually be computed indicating the proximity of two graphs. Graph matching has been the topic of numerous studies in computer science over the last decades. Roughly speaking, there are two categories of tasks in graph matching, viz. exact matching and inexact matching. In the former case, for a matching to be successful, it is required that a strict correspondence is found between the two graphs being matched, or at least among their sub- parts. In the latter approach this requirement is substantially relaxed, since also matchings between completely non-identical graphs are possible. That is, inexact matching algorithms are endowed with a certain tolerance to errors and noise, enabling them to detect similarities in a more general way than the exact matching approach. Therefore, inexact graph matching is also referred to as error-tolerant graph matching. For an extensive review of graph matching methods and applications, the reader is referred to [15]. In this chapter, basic notations and definitions are in- troduced (Sect. 2) and an overview of standard techniques for exact as well as error-tolerant graph matching is given (Sect. 3 and 4). In Sect. 3, dissimilarity models derived from graph isomorphism, subgraph isomorphism, and maxi- mum common subgraph are discussed for exact graph matching. In Sect. 4, inexact graph matching and in particular the paradigm of edit distance applied to graphs is discussed. Finally, two recent applications of graph matching are reviewed. First, in Sect. 5 an algorithmic framework for information retrieval based on graph matching is described. This approach is based on both exact and inexact graph matching procedures and aims at querying large database graphs. Secondly, a graph embedding procedure based on graph matching is reviewed in Sect. 6. This framework aims at an explicit embedding of graphs in real vector spaces, which establishes access to the rich repository of algorithmic tools for classification, clustering, regression, and other tasks, originally developed for vectorial representations. 2. Basic Notations Various definitions for graphs can be found in the literature, depending upon the considered application. It turns out that the definition given below is suffi- ciently flexible for a large variety of tasks. Definition 7.1 (Graph). Let 𝐿 𝑉 and 𝐿 𝐸 be a finite or infinite label alphabet for nodes and edges, respectively. A graph 𝑔 is a four-tuple 𝑔 = (𝑉, 𝐸, 𝜇, 𝜈), where 𝑉 is the finite set of nodes, 𝐸 ⊆ 𝑉 ×𝑉 is the set of edges, 𝜇 : 𝑉 → 𝐿 𝑉 is the node labeling function, and 220 MANAGING AND MINING GRAPH DATA (a) (b) (c) a b c d e f g (d) Figure 7.1. Different kinds of graphs: (a) undirected and unlabeled, (b) directed and unlabeled, (c) undirected with labeled nodes (different shades of gray refer to different labels), (d) directed with labeled nodes and edges. 𝜈 : 𝐸 → 𝐿 𝐸 is the edge labeling function. The number of nodes of a graph 𝑔 is denoted by ∣𝑔∣, while 𝒢 represents the set of all graphs over the label alphabets 𝐿 𝑉 and 𝐿 𝐸 . Definition 7.1 allows us to handle arbitrarily structured graphs with uncon- strained labeling functions. For example, the labels for both nodes and edges can be given by the set of integers 𝐿 = {1, 2, 3, . . .}, the vector space 𝐿 = ℝ 𝑛 , or a set of symbolic labels 𝐿 = {𝛼, 𝛽, 𝛾, . . .}. Given that the nodes and/or the edges are labeled, the graphs are referred to as labeled graphs. Unlabeled graphs are obtained as a special case by assigning the same label 𝜀 to all nodes and edges, i.e. 𝐿 𝑉 = 𝐿 𝐸 = {𝜀}. Edges are given by pairs of nodes (𝑢, 𝑣), where 𝑢 ∈ 𝑉 denotes the source node and 𝑣 ∈ 𝑉 the target node of a directed edge. Commonly, the two nodes 𝑢 and 𝑣 connected by an edge (𝑢, 𝑣) are referred to as adjacent. A graph is termed complete if all pairs of nodes are adjacent. Directed graphs directly cor- respond to the definition above. In addition, the class of undirected graphs can be modeled by inserting a reverse edge (𝑣, 𝑢) ∈ 𝐸 for each edge (𝑢, 𝑣) ∈ 𝐸 with identical labels, i.e. 𝜈(𝑢, 𝑣) = 𝜈(𝑣, 𝑢). In Fig. 7.1 some graphs (directed/undirected, labeled/unlabeled) are shown. Definition 7.2 (Subgraph). Let 𝑔 1 = (𝑉 1 , 𝐸 1 , 𝜇 1 , 𝜈 1 ) and 𝑔 2 = (𝑉 2 , 𝐸 2 , 𝜇 2 , 𝜈 2 ) be graphs. Graph 𝑔 1 is a subgraph of 𝑔 2 , denoted by 𝑔 1 ⊆ 𝑔 2 , if (1) 𝑉 1 ⊆ 𝑉 2 , (2) 𝐸 1 ⊆ 𝐸 2 , (3) 𝜇 1 (𝑢) = 𝜇 2 (𝑢) for all 𝑢 ∈ 𝑉 1 , and (4) 𝜈 1 (𝑒) = 𝜈 2 (𝑒) for all 𝑒 ∈ 𝐸 1 . By replacing condition (2) in Definition 7.2 by the more stringent condition (2’) 𝐸 1 = 𝐸 2 ∩ 𝑉 1 × 𝑉 1 , 𝑔 1 becomes an induced subgraph of 𝑔 2 . If 𝑔 2 is a subgraph of 𝑔 1 , graph 𝑔 1 is called a supergraph of 𝑔 2 . Exact and Inexact Graph Matching: Methodology and Applications 221 (a) (b) (c) Figure 7.2. Graph (b) is an induced subgraph of (a), and graph (c) is a non-induced subgraph of (a). Obviously, a subgraph 𝑔 1 is obtained from a graph 𝑔 2 by removing some nodes and their incident, as well as possibly some additional, edges from 𝑔 2 . For 𝑔 1 to be an induced subgraph of 𝑔 2 , some nodes and only their incident edges are removed from 𝑔 2 , i.e. no additional edge removal is allowed. Fig. 7.2(b) and 7.2(c) show an induced and a non-induced subgraph of the graph in Fig. 7.2(a), respectively. 3. Exact Graph Matching The aim in exact graph matching is to determine whether two graphs, or at least part of them, are identical in terms of structure and labels. A common approach to describe the structure of a graph is to define the adjacency matrix A = (𝑎 𝑖𝑗 ) 𝑛×𝑛 of graph 𝑔 = (𝑉, 𝐸, 𝜇, 𝜈) (∣𝑔∣ = 𝑛). In this matrix the entry 𝑎 𝑖𝑗 is equal to 1 if there is an edge (𝑣 𝑖 , 𝑣 𝑗 ) ∈ 𝐸 connecting the 𝑖-th node 𝑣 𝑖 ∈ 𝑉 with the 𝑗 −𝑡ℎ node 𝑣 𝑗 ∈ 𝑉 , and 0 otherwise. Generally, for the nodes (and also the edges) of a graph there is no unique canonical order. Thus, for a single graph with 𝑛 nodes, 𝑛! different adjacency matrices exist, since there are 𝑛! possibilities to order the nodes of 𝑔. Con- sequently, for checking two graphs for structural identity, we cannot simply compare their adjacency matrices. The identity of two graphs 𝑔 1 and 𝑔 2 is commonly established by defining a function, termed graph isomorphism, that maps 𝑔 1 to 𝑔 2 . Definition 7.3 (Graph Isomorphism). Let us consider two graphs denoted by 𝑔 1 = (𝑉 1 , 𝐸 1 , 𝜇 1 , 𝜈 1 ) and 𝑔 2 = (𝑉 2 , 𝐸 2 , 𝜇 2 , 𝜈 2 ) respectively. A graph isomorphism is a bijective function 𝑓 : 𝑉 1 → 𝑉 2 satisfying (1) 𝜇 1 (𝑢) = 𝜇 2 (𝑓(𝑢)) for all nodes 𝑢 ∈ 𝑉 1 (2) for each edge 𝑒 1 = (𝑢, 𝑣) ∈ 𝐸 1 , there exists an edge 𝑒 2 = (𝑓(𝑢), 𝑓 (𝑣)) ∈ 𝐸 2 such that 𝜈 1 (𝑒 1 ) = 𝜈 2 (𝑒 2 ) (3) for each edge 𝑒 2 = (𝑢, 𝑣) ∈ 𝐸 2 , there exists an edge 𝑒 1 = (𝑓 −1 (𝑢), 𝑓 −1 (𝑣)) ∈ 𝐸 1 222 MANAGING AND MINING GRAPH DATA (a) (b) (c) Figure 7.3. Graph (b) is isomorphic to (a), and graph (c) is isomorphic to a subgraph of (a). Node attributes are indicated by different shades of gray. such that 𝜈 1 (𝑒 1 ) = 𝜈 2 (𝑒 2 ) Two graphs are called isomorphic if there exists an isomorphism between them. Obviously, isomorphic graphs are identical in both structure and labels. That is, a one-to-one correspondence between each node of the first graph and each node of the second graph has to be found such that the edge structure is pre- served and node and edge labels are consistent. Unfortunately, no polynomial runtime algorithm is known for the problem of graph isomorphism [25]. That is, in the worst case, the computational complexity of any of the available algorithms for graph isomorphism is exponential in the number of nodes of the two graphs. However, since most scenarios en- countered in practice are often different from the worst case, and furthermore, the labels of both nodes and edges very often help to substantially reduce the complexity of the search, the actual computation time can still be manageable. Polynomial algorithms for graph isomorphism have been developed for special kinds of graphs, such as trees [1], ordered graphs [38], planar graphs [34], bounded-valence graphs [45], and graphs with unique node labels [18]. Standard procedures for testing graphs for isomorphism are based on tree search techniques with backtracking. The basic idea is that a partial node matching, which assigns nodes from the two graphs to each other, is iteratively expanded by adding new node-to-node correspondences. This expan- sion is repeated until either the edge structure constraint is violated or node or edge labels are inconsistent. In this case a backtracking procedure is ini- tiated, i.e. the last node mappings are iteratively undone until a partial node mapping is found for which an alternative extension is possible. Obviously, if there is no further possibility for expanding the partial node matching without violating the constraints, the algorithm terminates indicating that there is no isomorphism between the considered graphs. Conversely, finding a complete node-to-node correspondence without violating any of the structure or label constraints proves that the investigated graphs are isomorphic. In Fig. 7.3 (a) and (b) two isomorphic graphs are shown. A well known, and despite its age still very popular, algorithm implementing the idea of a tree search with backtracking for graph isomorphism is described in [89]. A more recent algorithm for graph isomorphism, also based on the idea of tree search, is the VF algorithm and its successor VF2 [17]. Here the