Managing and Mining Graph Data part 58 ppsx

560 MANAGING AND MINING GRAPH DATA Two of the topological structures discovered with their toolkit are depicted in Figure 18.2. Such large structures cannot be obtained by using standard motif mining algorithms. As noted by the authors, the identified topological structures are mainly composed of polar (N, T, S), charged (K), and aromatic (W) residues, which is in agreement with biophysics literature. Figure 18.2. Frequent Topological Structures Discovered by TSMiner 3.2 Motif Discovery in Biological Networks In addition to subgraphs that are frequent across many networks, substruc- tures that are repeated frequently within a single and large network can be useful for knowledge discovery. A motif of a graph refers to a substructure, which is repeated considerably inside the graph. There are two main approaches, frequency-based and statistical, to determine the significance of this repetition. The frequency-based approach considers a subgraph as a motif if it is occurring more than a threshold number of times. On the other hand, statistical approach labels a subgraph as motif if it is occurring more than the expected number of times with respect to random networks. Network motifs can be particularly effective in understanding the modularity and the global structure of biological networks. For example in the case of PPI networks, motifs can be useful for the identification of protein complexes and other protein groupings that are related to the mechanics of the living organism. In the case of regulatory networks, motifs enable understanding gene regulation mechanisms and it also enables researchers to develop models and experiments to understand these mechanics. Milo et al. is the first to define network motifs and find them in networks from biochemistry, neurobiology, ecology, and engineering [78]. They defined network motifs as patterns of interconnections occurring in complex networks at numbers that are significantly higher than those in randomized networks. Their analysis revealed some common (and diverse) motifs across fields. As an example, they shoved that the directed triangle motif, known as the feed-forward loop, exists in both transcription-regulatory and neural networks, whereas four-node feedback loops only emerge in electric circuits but A Survey of Graph Mining Techniques for Biological Datasets 561 not in biological systems. To identify such motifs, Milo et al. exhaustively enumerated all subgraphs of 𝑛 nodes in the studied networks, where 𝑛 is limited to 3 and 4. They then generated random networks while keeping the number of nodes, links and the degree distribution unchanged. Subgraphs of these random networks are counted and these counts are used to determine motifs. As an alternative to exact counting, in a follow-up work they proposed a sampling method for subgraph counting [61]. Instead of enumerating subgraphs exhaustively, subgraphs are sampled to estimate their relative frequency. The method starts by picking a random edge from the network and then expanding the corresponding subgraph iteratively by picking random neighboring edges. At each iteration, a list of all candidate edges are generated for the next random pick. The subgraph is expanded until it reaches a pre-defined size. Although being an extension over the exhaustive search, this algorithm is also limited to finding small-size motifs. In the transcription network of E. coli, subgraph samples of sizes 3 to 8 have been reported. Higher order motifs composed of five and six nodes in this network are tabulated in their study [61]. Protein-protein interaction networks accumulate pairwise or group-wise physical interactions of proteins into a network structure. Motifs of these networks can be utilized to characterize and better understand the group-level relations. For identification of large size motifs in Protein-Protein Interaction (PPI) networks, a scalable algorithm, NEtwork MOtif FINDER [19] has been proposed as an extension to subgraph mining algorithms. This algorithm is based on formation of frequent trees of varying size from 2 to 𝑘, which are then used to partition the graph into a set of graphs such that each graph embeds a size-k tree. In the next step, frequent size-k graphs are generated by perform- ing graph join operations. Frequency of these size-k graphs can be counted in randomized networks. NEMOFINDER describes frequent subgraphs that are also unique as Network Motifs. Uniqueness of a subgraph is determined by the number of times a subgraph is more frequent in the real graph than randomized graphs. Existing Apriori-based algorithm are not able to capture interesting network motifs that are repeated and unique. Uniqueness of these size-k graphs are calculated based on their number of occurrences in real input graph and the randomized graphs. They build their algorithm as an extension to the SPIN [54] algorithm with the possibility of overlapping subgraphs. The input to the NEMOFINDER algorithm is a PPI network, and user defined thresholds for frequency, uniqueness, and maximal network size. The algorithm outputs Network Motifs that are frequent and unique with respect to the defined thresholds. Employing their algorithm on the PPI network of budding yeast, they discovered motifs up to size 12. They later proposed an extension to the NEMOFINDER, named LaMoFinder, which takes into consideration labels of nodes [20]. While applying LaMoFinder to discover PPI network motifs, they used Gene Ontology terms as node labels [20]. They first mine 562 MANAGING AND MINING GRAPH DATA an unannotated network for motifs. Next, motifs are labeled with Gene On- tology functions. Their analysis showed that by incorporating labels they are not able to capture only the topological shapes but also biological context of motifs. Labeled motifs extracted from a real world PPI network are employed for protein function prediction. In a more recent work, Grochow and Kellis [39] proposed an algorithm to avoid the limitations of exact counting and subgraph sampling based motif mining algorithms. Their algorithm works by exhaustively searching for instances of a single query graph in a network. They proposed a motif-centric alternative to existing methods which is based on an improved isomorphism test, i.e., symmetry breaking. The algorithm identifies all instances of a query graph 𝐻, given a target network 𝐺. They extended isomorphism test based on the most constrained neighbor concept. They defined the most constrained neighbor of the already-mapped nodes which is the least possible nodes to be mapped to. They also introduced and enforced several symmetry-braking conditions, to make sure that there is a unique map from the query graph 𝐻 to each instance of 𝐻 in 𝐺. They utilized their algorithm to find motifs in two biological networks: PPI network of S. cerevisiae and Transcriptional network of S. cerevisiae. The former is composed of 1379 nodes and 2473 edges, where motifs of 15 and 20 nodes can be identified with the proposed algorithm. From the latter one, which has 685 nodes and 1052 edges, a 29-node motif that corresponds to the cellular transcription machinery has been identified. In addition to being scalable for finding larger motifs, this algorithm also enables exploring motif clustering and querying a particular subgraph. Moreover, the algorithm is very easy to parallelize by counting each subgraph on a separate processor. 4. Mining Graphs for the Discovery of Modules Different forms of real-life associations between biological entities have been detected by various technologies and these associations have been ac- cumulated in the public databases in the form of complex networks. Under- standing these complex structures often require breaking them into small components and identifying the interactions between these components. These components are composed of nodes which are more relevant to each other than with outsiders and they are commonly referred as communities or modules. Decomposition of a given graph into its modules can also be very effective in the analysis of biological networks. Some biological networks are naturally decomposed into such components, which are commonly referred as modular networks. Some examples of biological modules are transcriptional modules, protein complexes, gene functional groups, and signaling pathways. A Survey of Graph Mining Techniques for Biological Datasets 563 The most well-known biological modular networks is the Protein-Protein Interaction(PPI) Network. The number and coverage of public databases that collect experimental data on protein physical bindings of diverse organisms have been increasing with the advancements in high-throughput techniques. Although there is no established standard database of PPIs today, there have been efforts to integrate existing interactions in publicly available databases. As of today, Human Protein Reference Database (HPRD) footnote http: //www.hprd.org) includes 34,624 Protein-protein interactions between Human proteins that are derived from a number of platforms such as Mass Spectro- metric Analysis, Yeast two-hybrid based protein-protein interaction, and Co- immunoprecipitation and mass spectrometry-based protein-protein interaction. Similarly, another freely accessible database BIOGRID [100] includes more than 238,634 raw interactions from various organisms including Sac- charomyces cerevisiae, Caenorhabditis elegans, Drosophil melanogaster and Homo sapiens. These large collections of protein interactions are naturally represented in the form of networks to facilitate the process of knowledge discovery. Modular nature of these networks has been investigated by different algorithms and the identified modules have been utilized for a better character- ization of the unknown proteins. Gene co-expression networks are another example of biological networks that exhibit modular structure [15, 102]. In these network structures, nodes represent genes and edges between nodes refer to genes that are expressed similarly over studied conditions. Gene groups that indicate a similar expression pattern can be defined as a gene module, where a functionality between the elements of this module is likely to be shared [91, 102]. Another modular biological network that have been excessively studied is the Regulatory networks. They model activation (or suppression) of a gene by specific DNA binding proteins in the form of a directed graph. Modules that can be deduced from regulatory networks correspond to a set of co-regulated genes as well as their common regulators. Given all these application areas, effective identification of modules from diverse biological networks has great potential for a better understanding of studied organisms. In this section we discuss different methodologies that are proposed for the detection of network modules or communities in biological graphs. Here, a community can be defined as a densely connected group of nodes, where only a few connections exist between different communities [80]. First, we look at algorithms that extract community structures from networks. Next, we discuss clustering algorithms that have been proposed to decompose the whole structure into subgroups, where similarity within group elements is maximized, and between groups is minimized. 564 MANAGING AND MINING GRAPH DATA 4.1 Extracting Communities In the analysis of PPI networks, of particular interest to many scientists is to study protein interaction networks to isolate densely interacting regions, also known as communities, since they are presumed to be protein complexes or functional modules. A protein complex can be defined as a set of proteins that bind to each other in order to accomplish a cellular level task. Identification of these structures is useful to understand cell functioning, to predict functionality of unknown proteins. The interest in their identification is motivated by the fact that proteins heavily interacting within themselves, usually participate into the same biological processes. Thus, discovery of dense subgraphs from PPI networks is recognized as an important task for the identification of protein complexes. Based on this underlying principle, a set of algorithms that employ local dense regions of PPI networks to discover putative complexes have been proposed. Bader et al [8] proposed a three-step algorithm; Molecular COmplex DE- tection (MCODE) to identify clusters of proteins that are heavily interacting. MCODE starts with weighting each node of the network based on the density of its local neighborhood. Next, nodes with high weights are assigned as seeds and starting from these seed nodes initial clusters are obtained by iteratively including neighboring nodes to the cluster. Finally an optional third step is proposed to filter proteins according to a connectivity criteria. They evalu- ated MCODE on an integrated dataset of Budding Yeast that is composed of 9088 protein-protein interactions among 4379 proteins from the MIPS, YPD, and PreBIND databases. They predicted 166 complexes from this network. 52 of these complexes matched with known protein complexes in the MIPS database. MCODE bases on the observation that proteins share functions with their immediate neighbors. In a more recent work, Chua et al utilized another observation based on level-2 interactions in PPI networks [22]. They derived a topological weighting schema, namely the Functional Similarity Weight (FS- Weight) that enables weighting both direct and indirect (i.e., ‘level-2’) interactions. FS-Weight makes use of estimated reliability of each interaction to reduce the impact of noise. The reliability of each experimental source is estimated by the fraction of unique interactions in which at least one level-4 Gene Ontology term is shared. FS-Weight also favors two proteins that share many common neighbors from a reliable source. Number of non-common neighbors are also included into the calculation in order to reduce potential false posi- tive inferences. Based on FS-weights, the studied PPI network is expanded with ‘level-2’ interactions and filtered by eliminating interactions with small FS-weights. After this preprocessing step, they identify cliques in the modi- fied PPI network and iteratively merged cliques to form larger subgraphs that are still dense. More recently, Li et al [73] proposed an algorithm named DE- A Survey of Graph Mining Techniques for Biological Datasets 565 CAFF (Dense Neighborhood Extraction using Connectivity and conFidence measures) which employs the Hub Removals algorithm [86]. DECAFF initially identifies local dense neighborhoods of each protein by iteratively removing nodes with low degrees from the local neighborhoods. These local cliques are merged with the dense subgraphs detected by the Hub Removal algorithm [86] based on a Neighborhood Affinity criteria. Neighborhood Affin- ity of two subgraphs is calculated based on their size and the number of their common neighbors. Finally DECAFF improves the quality of final clusters by removing subgraphs with low reliability scores. The reliability of a subgraph is defined as the average reliability of all interactions of that subgraph, where interaction reliability is deduced from functional relevance of its two interacting proteins. In addition to PPI networks, scientists are also interested in identifying community structures from gene co-expression networks. Expression profiles obtained through microarray studies can be transformed into gene co-expression networks, where nodes represent genes and two nodes are linked if the corresponding genes behave significantly similar across different samples (i.e., co-expression). Scientists are particularly interested in the problem of identifying gene subnetworks that have similar expression patterns under different conditions [103] since they have been theorized to have the same cellular function [30]. To find gene groups that have similar expression patterns, Hartuv and Shamir proposed an algorithm that recursively splits the weighted co-expression graph into its highly connected components [43]. A highly connected component is defined as a subnetwork which includes at least two nodes, i.e., 𝑛 > 1, and which can only be disconnected after the removal of more than 𝑛/2 edges. Their algorithm, namely the Highly Connected Sub- graphs(HCS), at each iteration splits the network into subgraphs until a highly connected component is identified. Shamir and Sharan [94] proposed an extension of the HCS algorithm, CLICK - CLuster Identification via Connectivity Kernels. In each step of their algorithm, a minimum cut of the input graph is computed, which outputs two subgraphs. Subgraphs which satisfied certain criterion are labeled as kernels. Each kernel is attributed with a fingerprint similarity that is calculated based on its elements. After all the kernels are identified, nodes that are not part of any kernels are further analyzed and the ones that are similar to any of the kernels are included into the kernel and the kernel’s fingerprint is re-calculated - adoption step in the algorithm. Next, kernels that are similar enough are merged and the adoption operation is repeated. Adop- tion and kernel merging steps are repeated until there are no more changes in the kernel structures. Final kernels are outputted as gene clusters obtained by the CLICK algorithm. They have shown that their algorithm outperform existing clustering algorithms when applied on various gene expression datasets, 566 MANAGING AND MINING GRAPH DATA originating from various studies, such as the yeast cell cycle dataset, or the response of human fibroblasts to serum. Regulatory modules can be inferred from diverse datasets including ChIP- chip, motif, and gene expression datasets. A regulatory module is composed of a set of genes that are co-regulated by a common set of regulators. In order to identify such modules from ChIP-chip data and gene expression profiles, GRAM algorithm is proposed [10]. A set of genes that are bind with the same regulator set is obtained from the ChIP-chip binding p-values with an exhaustive search. Subsequently, a subset of this set that are similarly expressed is se- lected to serve as a seed. Then, the algorithm identifies genes that are similarly expressed with the seed genes and that are connected to the same set of transcription factors based on a relaxed binding criteria. Lemmens et al. improved the GRAM algorithm by incorporating motif data as an additional source [71]. In the seed discovery step, they discover seeds composed of genes that are co-expressed (deduced from mRNA measurements), that bind to the same regulators (deduced from ChIP-chip data), and that have the same motifs in their intergenic regions (deduced from Motif data). they employed an Apriori-like algorithm in order to identify such seeds. And a p-value is assigned to asses the quality of each seed. In the second seed extension step, gene content of the seeds are extended. For this purpose, each gene is ranked according to their correlation with the mean expression profile of the seed genes, and the ones that are similar enough (according to a cut-off) are included into the module. They employed their algorithm for the discovery of Budding Yeast regulatory modules by integrating ChIP-chip, motif, and gene expression datasets. 4.2 Clustering Clustering algorithms can also be effective in identifying the modules of biological networks. In contrast to community discovery approaches, clustering (or graph partitioning) decompose the whole network structure into groups. A clustering algorithm locates every node of the graph into a community or a module. To elucidate gene functions at a global scale, clustering of gene co- expression networks have been investigated. Since genes that are on the same pathways or belong to the same functional complexes are often co-regulated, they often exhibit similar expression patterns under diverse conditions. Thus, identifying and studying groups of highly-interacting genes in co-expression networks is an important step towards characterizing genes at a global scale. For this purpose, a variety of existing graph partitioning algorithms can be leveraged. Spectral methods that target weighted cuts [96] form an important class of such algorithms. Multi-level graph partitioning algorithms such as Metis [60] and Graclus[27] are well known to scale well for large networks. A Survey of Graph Mining Techniques for Biological Datasets 567 Divisive/agglomerative approaches have also been popular in network analysis [80], but they are expensive and do not scale well [16]. Markov Clustering (MCL) [28], a graph clustering algorithm based on (stochastic) flow simula- tion, has proved to be highly effective at clustering biological networks [14]. A variant of this algorithm known as MLR-MCL [89] have been proposed recently to address the scalability of MCL algorithm. In addition to these diverse graph partitioning algorithms, other classical clustering algorithms have also been employed – e.g., the hierarchical clustering [99], the k-means clustering [76], and the self-organizing maps [65]. Be- sides the application of standard clustering algorithms, clustering algorithms that are more suitable for the specific task have been studied. Among these are the biclustering algorithms which identify a group of genes that behave similarly only for a subset of all conditions. Given a gene expression matrix of samples and genes, biclustering algorithms perform clustering in two di- mensions simultaneously [21]. Statistically significant sub-matrices of a subset of genes and a subset of samples are the identified biclusters. Cheng and Church proposed a greedy approach in order to find maximal sized biclusters that satisfy a certain condition on the residue scores [21]. Their algorithm identifies each biclusters separately by iteratively removing rows and columns until the mean squared residue score for the sub-matrix (an assessment for the quality of bi-cluster) is smaller than a threshold and by iteratively adding rows and columns while the quality assessment score does not exceed threshold. Each run of the algorithm identifies a sub-matrix (bi-cluster) separately, and the next bi-cluster is identified after the found sub-matrix is masked by randomization. using this algorithm, they identified biclusters from gene expression datasets of Human and Yeast. Later, Koyuturk et al proposed a work which associates statistical significance to the extracted biclusters. To discover binary biclusters from a quantized gene expression matrix, they formulate this problem as an optimization problem based on the statistical significance ob- jective. Fast heuristics are proposed so solve this optimization problem in a scalable manner. The algorithm is tested on quantized breast tumor gene expression matrix [67]. Tanay et al. converted bi-clustering problem into a graph theory problem using bi-partite modeling [106]. Initially the expression data is converted into a bi-partite of genes and samples. More formally a graph 𝐺(𝑉, 𝑆, 𝐸) is constructed where 𝑉 is set of genes, 𝑆 is set of conditions, and there exists and edge between 𝑣 and 𝑠, (𝑣, 𝑠) ∈ 𝐸 if, 𝑔 is expressionally re- sponsive in sample 𝑠. This modeling reduces the biclustering problem into the problem of finding the densest subgraphs in 𝐺. Since the identification of heaviest bi-clique is an NP-complete problem, authors restricted the search space by assuming a degree bound on one side of the bipartite graph. Later Tanay applied SAMBA algorithm on the gene expression dataset of 96 human tissue samples [105]. In that work, they compared their work against, Cheng 568 MANAGING AND MINING GRAPH DATA Figure 18.3. Benefits of Ensemble Strategy for Community Discovery in PPI networks in compar- ison to community detection algorithm MCODE and clustering algorithm MCL. The Y-axis represents -log(p-value). and Church’s algorithm [21] and observed that biclusters from SAMBA are better in terms of their statistical significance. An ensemble clustering algorithm is also studied on biological networks to generate a more robust clustering compared to individual clustering algorithms [6]. Cluster ensembles can be defined as a mapping from a set of clusterings generated by a variety of sources into a single consensus clustering arrangement. Asur et al. proposed an ensemble clustering for the PPI decomposition problem. First different topological weighting schemes are proposed to generate different views of the unweighted PPI network. Next, these different views are clustered with different algorithms to obtain a set of base clusterings of the network. These clusterings are integrated into a Cluster Mem- bership Matrix which is reduced in size to eliminate redundancy and to scale the consensus determination problem based on PCA. Subsequently standard hierarchical clustering algorithms are utilized for computing the consensus clustering (recursive bisections (PCA-rbr) and agglomerative clustering (PCA- agglo)). When compared with existing community detection and clustering algorithms, they observed that their algorithm is able to produce topologically and biologically more significant clusters (as shown in Figure 18.3). The Y- axis represents distribution of Gene Ontology enrichment p-values. Smaller p-values represent more significantly enriched groups with a particular Gene Ontology term. In addition to biclustering and ensemble clustering strategies, scientists also studied soft clustering algorithms for biological networks, which enables as- signing multiple-cluster membership to multi-faceted biological entities. To A Survey of Graph Mining Techniques for Biological Datasets 569 enable multiple cluster membership for proteins while identifying PPI clusters, Asur et al [6] proposed a soft ensemble clustering technique that is a step further from their PCA based consensus clustering. This adapted algorithm, after obtaining the initial consensus clustering, iteratively calculates the strength of each protein’s membership to each consensus cluster based on shortest path distances. Proteins that have high propensity towards multiple membership are then assigned to their alternate clusters. To test the efficacy of this soft clustering algorithm, the compared their algorithm with the original ensemble clustering. As can be seen in Figure 18.4, they observed that, allowing multiple membership to proteins, improves the overall accuracy of the clustering, as evident from the smaller p-values of GO enrichment analysis. Biological Process 0 10 20 30 40 50 60 70 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 Significant Clusters -Log(pvalue) Base_metis Hub-duplication PCA-softagglo Figure 18.4. Soft Ensemble Clustering improves the quality of extracted clusters. The Y-axis represents -log(p-value). A soft bi-clustering algorithm (MF-PINCoC), an extension to the algorithm PINCoC, has been proposed to identify overlapping dense subgraphs by using a local search technique has been proposed recently [84]. The PINCoC algorithm applies a greedy search strategy in order to find the local optimal sub-matrices in terms of a quality function. More recently, Avogadri et al. proposed an ensemble fuzzy clustering for decom- posing gene expression datasets into its overlapping clusters [7]. They first generate multiple views of the data by using random projections. A random projection maps data from a high-dimensional space to a lower dimensional space. On these views, they applied fuzzy k-means algorithm and these fuzzy clustering arrangements are combined into a similarity matrix. They again employed fuzzy k-means on this similarity matrix to identify fuzzy consensus clustering [7]. This algorithm is applied on four different microarray datasets and compared against different ensemble strategies. 5. Discussion In this article we surveyed the principal results in the field of graph mining that relate to the application domain of bioinformatics. We examined these results along three directions: i) from the perspective of mining tree-structured data; ii) from the perspective of mining multiple graphs or networks; and iii) . 560 MANAGING AND MINING GRAPH DATA Two of the topological structures discovered with their toolkit are depicted in Figure 18.2. Such large structures cannot be obtained by using standard motif mining. within group elements is maximized, and between groups is minimized. 564 MANAGING AND MINING GRAPH DATA 4.1 Extracting Communities In the analysis of PPI networks, of particular interest to many scientists. algorithms when applied on various gene expression datasets, 566 MANAGING AND MINING GRAPH DATA originating from various studies, such as the yeast cell cycle dataset, or the response of human fibroblasts