Báo cáo y học: "The CRIT framework for identifying cross patterns in systems biology and application to chemogenomics" ppt

Gianoulis et al Genome Biology 2011, 12:R32 http://genomebiology.com/2011/12/3/R32 METHOD Open Access The CRIT framework for identifying cross patterns in systems biology and application to chemogenomics Tara A Gianoulis1,2, Ashish Agarwal3,4, Michael Snyder5 and Mark B Gerstein3,4,6* Abstract Biological data is often tabular but finding statistically valid connections between entities in a sequence of tables can be problematic - for example, connecting particular entities in a drug property table to gene properties in a second table, using a third table associating genes with drugs Here we present an approach (CRIT) to find connections such as these and show how it can be applied in a variety of genomic contexts including chemogenomics data Background Understanding the relationship between two or more variables is a driving motivation of many biological questions The past several decades has seen a rapid increase in our ability to discern such relationships at multiple levels from molecular to cellular to whole populations However, our ability to understand the relationships between different scales and different types of data is still limited [1] Here we introduce Cross Pattern Identification Technique (CRIT) as a means of integrating at least three matrices which not all share the same index The goal of CRIT is to systematically combine information from multiple tables with different indices allowing one to not only stack features in a single dimension but also to span across multiple ones Thus, CRIT captures a new type of relationship between different types of data (for example drugs and their protein targets) which we term a ‘cross pattern.’ What is a cross pattern and how does this differ from the more traditional integration methods? There are two main differences: (1) It preserves the underlying structure of the individual datasets allowing for greater transparency and more importantly (2) it does not rely on a single index for querying In other words, cross patterns are conceptually related to correlation but are not correlations as there is no * Correspondence: mark.gerstein@yale.edu Department of Computer Science, Yale University, 51 Prospect St, New Haven, CT 06511, USA Full list of author information is available at the end of the article obvious way to correlate two differently indexed objects To better illustrate these differences, in Figure 1, we are given three pieces of information: the properties of a set of drugs, the properties of a set of proteins, and which drugs targeted which proteins Our goal is to determine if there are any properties of drugs that are related to any property of the protein target As a test query, in Figure 1b, we narrow our question to Which types of proteins are disrupted by aromatic drugs? Understanding these types of relationships could provide additional details about general mechanisms of drug-protein binding and how to design drugs to disrupt a particular function Investigating this question though would require integration across two different object types: proteins and drugs As shown in Figure 1a, principal component analysis (PCA) captures the set of drug properties with the most variance, but without further collapsing of the tables, it is not possible to discern what types of proteins are most affected by aromatic drugs Similarly, both canonical correlation analysis (CCA) and biclustering can define relationships amongst datasets that share the same index [2,3] Namely, they can identify relationships between either drug properties and their protein targets or protein properties and their drug targets but cannot span across a differently indexed dataset Although methods are available for integrating more than three matrices when all share the same index variable (see discussion in [4]), how to integrate features when they not all share the same index remains an open question We suggest that © 2011 Gianoulis et al.; licensee BioMed Central Ltd This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited Gianoulis et al Genome Biology 2011, 12:R32 http://genomebiology.com/2011/12/3/R32 Page of 12 DrugProperties (a) PCA DRUGS DRUGS DRUGS CCA/Biclustering PROTEINS Drug-Properties Drug-Properties DRUGS DRUGS (c) Protein-Properties Labeler: Transfers label on columns of previous datset to rows of new dataset L = [DarkGreen, DarkGreen, LightGreen, LightGreen] DRUGS Aro (u) Transfer of L2 D1 D2 D3 D4 D5 Drug-Properties CRIT PROTEINS Transfer of L1 (b) PROTEINS Cross Pattern D1 D2 D3 D4 T3 DRUGS D1 T1 split by Aro D2 D1 D1 D3 D2 D3 D4 D4 D5 PROTEINS R1 R2 R3 R4 R5 PROTEINS Slicer: Partitions rows into dark and light green slices DRUGS T1 T2 DRUGS DRUGS T1 T2 T3 Intersect Discriminator: Returns a label for the columns based on whether the slices (from the rows) are sig different D2 D3 D4 PROTEINS REPEAT Protein-Properties Figure Difference between CRIT and previous techniques (a) Data in a single matrix can be investigated using techniques such as PCA Techniques such as CCA are applicable to two matrices with a common index CRIT allows working with three or more matrices that not share a common index (b) An overview of CRIT (c) A simple example showing how proteins can be labeled as sensitive to a particular drug property See text for more details Gianoulis et al Genome Biology 2011, 12:R32 http://genomebiology.com/2011/12/3/R32 cross patterns provide the flexibility and intuitiveness to allow for the formal definition of these types of relationships In the remainder of the text, we describe CRIT and apply it to three different types of problems: breast cancer gene expression, yeast regulatory networks, and a further explication of the above example in chemogenomics data Example datasets, code, and documentation for CRIT can be found at [5] Algorithm Cross-integration (CRIT) Figure 1b shows an overview of the entire method and Figure 1c illustrates the individual functions of CRIT CRIT has three generic types of functions: a labeler, a slicer, and a discriminator The labeler transfers a label from one dataset to another (rows to columns or the reverse) The slicer partitions this new dataset into separate ‘slices’ on the basis of the label generated in the previous step Finally, the discriminator applies a statistical test to the slices to generate a new set of labels More generally, the discriminator determines if there are any features in the second dataset that ‘discriminate’ among the labeled slices based on the parameter in the first dataset The entire process is iterated until all of the matrices have been used In the instance in Figure 1b, c, the first label is generated by simply assigning each drug to be aromatic or not aromatic Next, this label is transferred via the labeler to the second matrix containing the drugs and their associated protein targets The slicer partitions this matrix into two slices (aromatic and non-aromatic drug treatments) Finally, the discriminator examines if the label is meaningful for any of the protein targets If aromaticity were significant in determining the disruptiveness of a particular drug to that protein, one should see two distinct fitness populations as shown in Figure 1b However, should this label be non-discriminatory that is the aromaticity of the drug is not a factor in determining its effectiveness on the protein of interest, the label should not split the drug treatments into distinct populations Those proteins which illustrated sensitivity to the aromaticity of the drug are then labeled aro-sensitive and this label is propagated to the next matrix and so on Results and Discussion Overview Below, we applied CRIT to three different types of problems: extracting general trends from properties of transcription factors and their associated targets in the yeast regulatory network, relationships between gene properties such as expression and binding status and breast cancer type, and finally using chemogenomics, chemoinformatics, and functional genomics data we investigated the relationship between properties of drugs and Page of 12 properties of their associated targets In all cases, we differentiate between three different levels of significance in discussing the individual cross patterns The level of confidence in each cross pattern is further distinguished by the thickness of the line as shown in each of the three result figures (see Additional file for investigation of method robustness using synthetic datasets) Regulation: transcription factors and their target properties Cis-regulatory elements as a means of regulating gene expression have been extensively studied However, beyond such motifs, are there inherent properties of the targets themselves that make them more or less likely to be regulated by a given class of transcription factors (TFs)? As an example, essential transcription factors preferentially regulate essential targets? Are there genome composition features such as GC or codon bias that influence which targets are regulated by which TFs? There is no meaningful way of correlating properties of TFs on top of properties of their downstream targets as the number of targets of each TF is variable These two objects not share the same index However, despite the dissimilarity of object types, such integration is critical to identify principles governing transcriptional regulatory evolution as such patterns would not be observable from just looking at a single TF or single set of targets Datasets Nineteen transcription factor and gene target properties were taken from an extensive meta-analysis in [6] (Additional file 2) A genome-wide mapping of transcription factor and targets as defined in [7] was used as the connector matrix The intersection between TFs mapped by Harbison et al and TF and protein properties from Xia et al resulted in 201 TFs and 5,125 gene targets Evaluating significance For each TF property, TFs were labeled as either above or below median value (given the number of TFs, breakdown into finer classes yielded numbers too small to perform meaningful statistics) This label was then transferred to the connector matrix where the rows represented the individual transcription factors and the columns potential gene targets Each element of this matrix was a score of how likely the TF would be to regulate the specific target The rows of this matrix were then partitioned via the labeling generating two different distributions of gene target scores The likelihood that the scores were obtained from the same distribution was evaluated using Welch’s t-test and q values were generated through FDR-correction of associated P values Those targets with q < 0.05 were considered to be more likely to be regulated by one type of TF than another are defined as TF-property (for Gianoulis et al Genome Biology 2011, 12:R32 http://genomebiology.com/2011/12/3/R32 example essentiality-sensitive) targets This label (sensitive/insensitive) was applied to the columns of the TF/ target matrix and propagated to the rows of the target/ target-property matrix The process was then repeated where the target/target-property matrix was partitioned on the basis of sensitivity and those target properties that were able to discriminate between the TF propertysensitive targets and TF property-insensitive targets The end result was a set of cross patterns connecting a specific property of a transcription factor to a specific property of a target Results In total, we identified 13 significant cross patterns relating properties of TFs and properties of targets suggesting an overall pattern of these TFs exhibiting ‘preferences’ or ‘sensitivities’ to particular attributes of targets (Figure 2) Many of these cross patterns were between the physicochemical and composition properties of TFs and targets suggesting that the composition and evolutionary history of the gene target may be a useful complement to the presence or absence of a given motif in predicting transcription factor binding As an example, we identified a subset of seven transcription factors that exhibited a strong preference for either essential or inessential targets (q < 0.05, FDR-corrected) One-hundred-thirty-five targets were preferentially regulated by either an essential or nonessential TF The number of protein-protein interaction partners of a given TF was connected to the level of gene duplication of the genes the TF targeted In addition, TF expression was also connected to the level of gene duplication Breast cancer: ER status and ER binding In our second application, we applied CRIT to a well characterized system Estrogen receptor (ER) activation is one of the primary molecular features used to differentiate breast cancer subtypes through immunohistochemical staining Activation of this receptor results in strikingly different cancer phenotype due to extensive downstream remodeling of transcriptional programs, and the genes and molecular mechanisms affected by this dichotomy are of particular interest Identification of gene signatures of specific tumor types is critical in the development of more targeted therapeutics van’t Veer and colleagues identified two breast cancer subtypes distinguished by differences in the immunohistochemical stain for estrogen receptor (ER) Further, through supervised methods they identified 550 additional genes that were signatures of this status [8] Datasets Maps of ER to target genes were obtained from [9] Definition of target defined as in [9] ER status, microarray data, and patient metadata were all taken from [8] Page of 12 Evaluating significance A slight modification of CRIT was required to accommodate binary features We used the hypergeometric distribution in order to calculate the significance of overlap of differentially expressed ER+ and ER- genes To be explicit, the problem can be described in terms of determining the probability of drawing x white balls from an urn of m white balls and n black balls after taking out k balls Thus, we regard the ER binding genes as the total number of white balls(x) and non-binding genes as black balls (n) The total number of differentially expressed genes (ER+ vs ER-) represents the sample withdrawn and x of these are also ER targets (that is sampled white balls) Thus, we calculate the significance of overlap by summing P(X >= x) Results We applied CRIT to the van’t Veer patient metadata, signature genes, and estrogen binding information from Carroll et al [9] (Figure 3a) In this manner, we were able to recapitulate the observed relationship between ER (+) tumors and the expression of genes that are bound by estrogen (P < × 10-4) (Figure 3b) Although this application serves as an important validation, the result is already well known To show the potential of CRIT, we applied it to a more complex problem domain Chemogenomics: drug properties and target properties To investigate more complex non-obvious connections, we applied CRIT to identify relationships between small molecule properties and properties of their protein targets (Figure 4a) Numerous papers have attempted to find relationships between particular drugs and particular targets [10-12] Here, we investigated a slightly different question Rather than looking at individual drugs and individual targets, we examined whether there are classes of drugs that are particularly disruptive to a class of proteins As an example, we tested the hypothesis that the subset of proteins bound or more indirectly affected by a structural parameter may also share physicochemical or other types of properties by posing questions in the form: Do positively charged proteins exhibit a tendency to interact with negatively charged compounds? Datasets Hillenmeyer et al tested 291 unique compounds on the heterozygous yeast deletion collection under a number of different concentrations (Additional file 1) We selected profiles generated using the minimum drug concentration since specificity decreases as drug concentrations approach toxicity Small molecules were converted to text strings called SMILES [13] (Additional file 3) and small molecule properties were computed [14] (Additional file 4, 5) Only compounds with no missing values were kept, resulting in 281 unique compounds Gianoulis et al Genome Biology 2011, 12:R32 http://genomebiology.com/2011/12/3/R32 (b) 19 TFPr op er t i es (a) Page of 12 From the 19 TF PROPERTIES Charge 201 TFs 201 TFs Coil From the 19 GENE TARGET PROPERTIES Codon Adaptation Index Codon Bias Connector Disorder Essentiality 5125 GENES Essentiality Expression 5125 GENES Expression Gene Duplication # of Interactors Gene Duplication # of Interactors TM_Helix 19 GeneProperties (c) TF Properties Char ge Coil Disorder Essentiality Gene Duplication Target Properties -3 CAI (p

Báo cáo y học: "The CRIT framework for identifying cross patterns in systems biology and application to chemogenomics" ppt

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Abstract

Background

Algorithm

Cross-integration (CRIT)

Results and Discussion

Overview

Regulation: transcription factors and their target properties

Datasets

Evaluating significance

Results

Breast cancer: ER status and ER binding

Datasets

Evaluating significance

Results

Chemogenomics: drug properties and target properties

Datasets

Evaluating significance

Results

Direct properties of small molecules are sometimes mirrored by those of their protein targets

Localization constrains physicochemical properties of drugs

GO-specific disruption

Environmental stress response

Guilt by association to predict function or mechanism of compound action

Generality of CRIT

Conclusions

Materials and methods

Formal definition of CRIT

Tài liệu cùng người dùng

Tài liệu liên quan