Báo cáo khoa học: Modules, multidomain proteins and organismic complexity ppt

Thông tin tài liệu

Modules, multidomain proteins and organismic complexity Hedvig Tordai, Alinda Nagy, Krisztina Farkas, La ´ szlo ´ Ba ´ nyai and La ´ szlo ´ Patthy Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Budapest, Hungary The average size of a protein domain of known crystal structure is about 175 residues; proteins that are larger than 200–300 residues usually consist of multiple protein folds [1]. The individual structural domains of such multidomain proteins are defined as compact folds that are relatively independent inasmuch as the interactions within one domain are more significant than with other domains. The individual domains of multidomain proteins usually fold independently of the other domains. Some multidomain proteins contain multiple copies of a single type of structural domain, indicating that internal duplication of a gene segment encoding a domain has given rise to such proteins. Many multidomain proteins contain different types of domains (i.e. domains that are not homologous to each other). The genes of such multidomain proteins were created by joining two or more gene segments that encode different protein domains. Such multidomain proteins, consisting of multiple domains of independent evolutionary origin, are frequently referred to as mosaic proteins. Multidomain proteins have some unique features that endow them with major evolutionary significance. In multidomain proteins a large number of functions (different binding activities, catalytic activities) may coexist making such proteins indispensable constituents of regulatory or structural networks where multiple interactions (protein–protein, protein–ligand, protein– DNA, etc., interactions) are essential. For example, the domains that constitute multidomain proteins of the intracellular and extracellular signaling pathways mediate multiple interactions with other components of the signaling pathways. Similarly, the coexistence of different domains with different binding specificities is also essential for the biological function of multidomain proteins of the extracellular matrix: the multiple, specific interactions among matrix constituents Keywords domain; exon-shuffling; module; multidomain protein; organismic complexity Correspondence L. Patthy, Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, Budapest, POBox 7, H-1518, Hungary Fax: +361 4665465 Tel: +361 2093537 E-mail: patthy@enzim.hu (Received 9 May 2005, revised 9 August 2005, accepted 12 August 2005) doi:10.1111/j.1742-4658.2005.04917.x Originally the term ‘protein module’ was coined to distinguish mobile domains that frequently occur as building blocks of diverse multidomain proteins from ‘static’ domains that usually exist only as stand-alone units of single-domain proteins. Despite the widespread use of the term ‘mobile domain’, the distinction between static and mobile domains is rather vague as it is not easy to quantify the mobility of domains. In the present work we show that the most appropriate measure of the mobility of domains is the number of types of local environments in which a given domain is present. Ranking of domains with respect to this parameter in different evolutionary lineages highlighted marked differences in the propensity of domains to form multidomain proteins. Our analyses have also shown that there is a correlation between domain size and domain mobility: smaller domains are more likely to be used in the construction of multidomain proteins, whereas larger domains are more likely to be static, stand-alone domains. It is also shown that shuffling of a limited set of modules was facilitated by intronic recombination in the metazoan lineage and this has contributed significantly to the emergence of novel complex multidomain proteins, novel functions and increased organismic complexity of metazoa. Abbreviations TSP1, thrombospondin type I. 5064 FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS are indispensable for the proper architecture of the extracellular matrix. As a corollary of their involve- ment in multiple interactions, formation of novel multidomain proteins is likely to contribute significantly to the evolution of increased organismic complexity since the latter reflects the complexity of interactions among genes, proteins, cells, tissues and organs [2]. Despite such valuable properties of large, complex multidomain proteins the vast majority of proteins contain only one domain [3–5]. Furthermore, recent studies have revealed that the majority of multidomain proteins tend to have very few domains. Wolf et al. [4] have counted the number of different folds in each protein of proteomes of archaea, bacteria and eukarya and the average fraction of the proteins with each given number of domains was calculated. It has been conclu- ded from these analyses that distributions of single-, two-, three-domain, etc., proteins in archaea, bacteria and eukarya is such that each next class (e.g. two- domain proteins vs. single-domain proteins, three- domain proteins vs. two-domain proteins, etc.) contains significantly fewer entries than the previous one. More recent mathematical analyses of the distribution of multidomain proteins according to the number of different constituent domains have revealed that their distribution follows a power law, i.e. single-domain proteins are the most abundant, whereas proteins containing larger numbers of domain-types are increasingly less frequent. This type of distribution is consistent with a random recombination (joining and breaking) model of evolution of multidomain architectures [6]. The observation of Wolf et al. [4] that the size distribution of multidomain proteins was very similar in eukaryotes and prokaryotes apparently contradicted the notion that evolution of complex eukaryotes favored (and benefited from) the formation of more and larger multidomain proteins as they contributed to their increased organismic complexity. Recent analyses, however, provided evidence that there may be a connection between the propensity of protein domains to form multidomain architectures and organismic complexity. For example, Koonin et al. [6] have shown that – although in all proteomes the domain distribution is compatible with a random recombination model of the evolution of multidomain architectures – the likelihood of domain joining appears to increase in the order Archaea < Bacteria < Eukaryotes, and there is a significant excess of larger multidomain proteins in Eukaryotes. Similarly, Wuchty [5] has shown that higher organisms tend to have more complex multidomain proteins. Using graph theory- based tools to survey and compare protein domain organizations of different organisms Ye and Godzik [7] have shown that the number of domains, the number of domain combinations, and the size of the largest connected component of domain-combination networks (measured by the number of domains it consists of) of each organism increase with the complexity of the organisms. The propensity of different domain types to form multidomain proteins shows great variation, ranging from ‘static’ domains that rarely or never occur in multidomain proteins, to ‘mobile’ domains (usually referred to as modules) that frequently participate in gene-rearrangements to build multidomain proteins. Various analyses of the number of multidomain architectures in which different domain-types are involved have shown that their distribution also follows a power law: a minority of domain-types (the ‘mobile’ modules) occur in numerous multidomain proteins, whereas the majority of domains belong to categories that are rarely used in multidomain proteins [5,6,8]. Such a power law distribution indicates that the chance of a domain to be used in the construction of novel multidomain proteins is proportional to the number of times it has already been used. As for any other type of genetic change, the frequency of joining a given domain-type to other domains to create novel multidomain architectures reflects the probability of such a genetic change and the probability of its fixation. In other words, the propensity of a domain to form multidomain proteins is a function of the frequency of genetic events that can lead to such gene-fusions and the selective value of the resulting chimeric proteins. Accordingly, it is likely that the most mobile modules have acquired this status as a result of a combination of special structural, functional and genomic features [9]. First, certain structural features of domains may facilitate their preferential proliferation in multidomain proteins. For example, the stability and folding autonomy of domains in multidomain proteins may be of utmost importance for their mobility as this minimizes the influence of neighboring domains [9]. Folding autonomy can ensure that folding of the domain is not deranged when inserted into a novel protein environment. It seems thus very likely that the most widely used domains have been selected according to the rate, robustness and autonomy of folding [10]. It is noteworthy in this respect that multidomain proteins are under-represented in Archaea compared with the other two kingdoms of life and this fact is thought to be related to the lower stability of multidomain proteins in the hyperthermophilic environments where most archaeal species live [6]. H. Tordai et al. Mobile domains FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS 5065 Second, functional aspects may also contribute to the proliferation of certain domains. For example, in complex cellular signaling pathways there is a greater demand for domains that mediate interaction with other constituents of the pathways (e.g. protein kinase domain) thus selection may have favored the spread of these modules to other multidomain proteins. Finally, special genomic features of certain genes (gene- segments) may have significantly facilitated their combination with other domains. To gain further insight into the factors that influence the mobility of domains and control the creation of multidomain proteins, in the present work we have compared the propensity of different domains to form multidomain proteins in several major groups of organisms (Bacteria, Archaea, Protozoa, Plants, Fungi, metazoa) as well as in individual proteomes of some representative species. The specific questions we have addressed were: (a) What is the most appropriate parameter that reflects the evolutionary mobility of protein domains? (b) Are there significant differences in the propensity to form multidomain proteins in different evolutionary lineages? (c) How do structural and functional properties of domains influence their mobility? (d) Is there reli- able evidence for the notion that intronic recombination has significantly contributed to the remarkable mobility of some domain-types in metazoa? Results and discussion Differences in the propensity to form multidomain proteins in different evolutionary lineages As shown in Table 1, different evolutionary groups show significant differences in the propensity to form multidomain proteins: the proportion of multidomain proteins decreases in the order metazoa > plants > fungi  protozoa > bacteria > archaea. At one extreme we find archaea where only 23% of the entries contain more than one Pfam-A domain, while metazoa represent the other extreme where 39% of the entries correspond to multidomain proteins. It is also clear from Table 1 that in metazoa a larger proportion of Pfam-A domains participates in the construction of multidomain proteins than in archaea. Furthermore, the multidomain proteins of metazoa tend to be larger than those in Archaea: multidomain proteins with more than 10 PfamA domains are nine times more frequent in metazoa than in archaea (Table 2). This observation is in harmony with earlier conclusions that the average protein length is considerably greater in eukaryotes than in prokaryotes [11]. These differences between different evolutionary lineages are unlikely to be due to differences in annotation coverage. As shown recently by Ekman et al. [12], the Pfam-A domain coverage is similar for archaea, bacteria and eukarya: in each group about 70% of the proteins have at least one Pfam-A domain. In agreement with this conclusion, our analyses have also shown that Pfam-A coverage is similar for bacteria, archaea, protozoa, plants, fungi and metazoa (Table 3). To gain a deeper insight into the factors controlling the frequency and size of multidomain proteins in different groups of organisms we have plotted the number of multidomain proteins vs. the number of constituent domains. Earlier studies have pointed out that such distributions usually fit the power law: P(i)@ci –c where P(i) is the number of multidomain Table 1. Domains and multidomain proteins in different groups of organisms. a Proteins containing at least one Pfam-A domain; b Pfam-A domains; c proteins containing at least two Pfam-A domains; d domains occurring in at least one multidomain protein; e domains occurring only as stand-alone domains in single domain proteins. Proteins a Domains b Multidomain proteins c (% of proteins) Mobile domains d (% of domains) Static domains e (% of domains) Bacteria 273 859 4079 73 076 (27%) 1974 (48%) 2105 (52%) Archaea 23 728 1725 5529 (23%) 776 (45%) 949 (55%) Protozoa 16 756 1967 5298 (32%) 932 (47%) 1035 (53%) Plants 57 620 2562 20 359 (35%) 1305 (51%) 1257 (49%) Fungi 20 371 2249 6434 (32%) 1102 (49%) 1147 (51%) Metazoa 129 881 3272 51 085 (39%) 1748 (53%) 1524 (47%) Table 2. Percentage of multidomain proteins containing more than N number of Pfam-A domains in different groups of organisms. N Bacteria Archaea Protozoa Plants Fungi Metazoa 1 26.67 23.30 31.62 35.33 31.58 39.33 2 8.88 7.01 14.95 14.28 12.98 17.97 3 3.94 3.40 8.84 8.00 6.88 11.11 4 1.96 1.66 5.72 5.33 4.15 7.66 5 1.24 1.13 3.94 3.84 2.72 5.62 10 0.27 0.19 1.14 1.22 0.39 1.74 Mobile domains H. Tordai et al. 5066 FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS proteins containing exactly i domains, c is a normaliza- tion constant and c is a parameter, which typically assumes values between 1 and 3 [13]. In double-logarithmic plots, the plot of P(i) as a function of i is a straight line with a negative slope c. As shown in Fig. 1, in the case of each evolutionary group the data closely follow straight lines in double-logarithmic plots consistent with power-law dependence. The distribution of values of metazoan multidomain proteins was found to be significantly different from those of multidomain proteins of plants (P ¼ 0.0002), bacteria (P<0.0001), fungi (P<0.0001) or archaea (P<0.0001). The fact that the slopes of the curves in Fig. 1 are increasingly steeper in the order metazoa fi plants fi bacteria  fungi  archaea (Table 4) indicates that the likelihood of domain joining is greater in metazoa than in prokaryotes, plants and fungi. Surprisingly, the slope in protozoa is similar to that observed for metazoa. A possible explanation for the unusual abundance of larger multidomain proteins in protozoa is that parasitic protists have acquired metazoan-like multidomain proteins through lateral gene transfer. Recently it has been shown that different lineages of apicomplexan protozoa (e.g. Plasmodium, Cryptosporidium) have acquired distinct but overlapping sets of multidomain surface proteins constructed from adhesion domains typical of animal proteins, although in no case do they share multidomain architectures identical to those of animals [14,15]. Some of these proteins contain conserved adhesion domains such as the epidermal growth factor-like domain (EGF domain), thrombospondin type I (TSP1) domain, the von Willebrand factor A (vWA) domain and the PAN ⁄ APPLE domain that are typically abundant in animal surface proteins but are absent or rarely present in surface adhesion molecules Table 3. Percentage of positions in domain-triplet types occupied by Pfam-A domains vs. Nterm, Cterm and Unknown regions in multidomain proteins of different groups of organisms. For defini- tion of domain-triplet type, Nterm, Cterm and Unknown regions in domain-triplets see Methods. Bacteria Archaea Protozoa Plants Fungi Metazoa Pfam-A 71 70 68 68 67 71 Nterm 8 9 6 6 7 6 Cterm 8 8 7 7 7 6 Unknown 13 13 19 18 19 16 Fig. 1. Distribution of multidomain proteins with respect to the number of constituent domains. The figure shows the number of constituent domains (x axis, log 10 scale) compared with the number of multidomain proteins that have that number of domains (y axis, log 10 scale) in bacteria (A), archaea (B), protozoa (C), plants (D), fungi (E) and metazoa (F). Parameters of the plots are compiled in Table 4. Table 4. Parameters of the linear fit of the double logarithmic plots for P(i) ¼ ci -c where P(i) is the number of multidomain proteins and i is the number of constituent domains. Bacteria Archaea Protozoa Plants Fungi Metazoa c 2.9343 3.1744 2.7101 2.8356 3.0635 2.7457 R 0.9597 0.9822 0.9785 0.9737 0.9755 0.9865 N652134362340 P < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 H. Tordai et al. Mobile domains FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS 5067 in other eukaryotic lineages. A systematic analysis of the C. parvum proteome has identified 32 widely conserved surface domains distributed in 51 proteins, including 24 noncatalytic protein- or carbohydrate- interacting domains and seven catalytic domains. Most strikingly, 10 of these domains, namely, TSP1, sushi ⁄ CCP, Notch ⁄ Lin (NL1), NEC (neurexin-collagen domain), fibronectin type 2 (FN2), pentraxin, MAM domain (a domain present in meprin, A5, receptor protein tyrosine phophatase mu), ephrin-receptor EGF-like domain, the animal signaling protein hedge- hog-type HINT domain and the scavenger domain have thus far been found only in the surface proteins of animals other than apicomplexans. The remaining domains such as the EGF, LCCL domain (a domain first found in Limulus factor C, Coch-5b2 and Lgl1), Kringle, SCP domain are seen in some other eukaryotes, but predominantly found only in animals. In phy- logenetic analyses specific affinities between apicomplexan and animal versions were recovered [16], making horizontal gene transfer from animals, fol- lowed by selective retention of functionally relevant proteins involved in adhesion as the most parsimoni- ous explanation for these observations. It thus appears that metazoa favor the formation of larger multidomain proteins than archaea, bacteria, fungi, plants. To test whether this is related to the fact that the world of extracellular (and some transmembrane) multidomain proteins has significantly expanded in metazoa [2,9], we have analyzed the size distribution of extracellular, transmembrane and intracellular multidomain proteins of metazoa separately. Differences in the propensity to form extracellular, intracellular and transmembrane multidomain proteins in metazoa Double-logarithmic plots of the number of extracellular, intracellular and transmembrane multidomain proteins vs. the number of constituent domains have revealed that in each case the data follow straight lines consistent with power-law dependence. The distribution of values for extracellular multidomain proteins, however, differed significantly from those of intracellular multidomain proteins (P<0.0001), of transmembrane multidomain proteins (P ¼ 0.0010) or of total metazoan multidomain proteins (P<0.0001). The slope for extracellular multidomain proteins is shallower than the value for intracellular multidomain proteins, for transmembrane proteins or for total metazoan multidomain proteins (Table 5). These observations indicate that the ratio of domain joining ⁄ breaking is greater for extracellular than for intracellular multidomain proteins of metazoa. To test whether this reflects the fact that exon-shuffling of class 1–1 modules contributed primarily to the creation of extracellular (and extracellular parts of some transmembrane) multidomain proteins of metazoa [2,9], we have analyzed the size distribution of multidomain proteins assembled from class 1–1 modules (irrespect- ive of their subcellular localization). Power law distribution of metazoan multidomain proteins assembled by exon-shuffling from class 1–1 modules Analysis of the double-logarithmic plot of the number of multidomain proteins assembled from class 1–1 modules vs. the number of constituent domains has revealed that the distribution of values differs significantly from those for total metazoan multidomain proteins (P<0.0001) or for intracellular metazoan multidomain proteins (P<0.0001). The slope for multidomain proteins assembled from class 1–1 modules is shallower than the values for intracellular or total metazoan multidomain proteins (Table 5). This observation is consistent with the notion that exon-shuffling of class 1–1 modules has favored the creation of larger (primarily extracellular) multidomain proteins of metazoa. Domain size and propensity to form multidomain proteins By plotting the size of multidomain proteins as a function of the number of constituent Pfam-A domains we obtained a linear relationship (Y ¼ A + B*X), where X is the number of domains, B is the average size (in amino acid residues) of Pfam-A domains actually used to build multidomain proteins and Y is the size of the multidomain proteins (Fig. 2). The value of B was found to be 80 amino acid residues, much smaller than the average size of Pfam-A domains (178 residues) present in the Pfam-A database. This observation suggests that smaller domains are more likely to be used in the construction of multidomain proteins. As the value of A is Table 5. Parameters of the linear fit of the double logarithmic plots for P(i) ¼ ci –c where P(i) is the number of extracellular, transmembrane, intracellular or class 1–1 multidomain proteins and i is the number of constituent domains. Total Extracellular Transmembrane Intracellular Class 1–1 c 2.7457 2.1107 2.7479 2.8071 2.4031 R 0.9865 09362 0.9684 0.9781 0.9714 N40 35 34 37 39 P < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 Mobile domains H. Tordai et al. 5068 FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS 302 amino acid residues this also suggests that Pfam-A domains larger than average are more likely to be static, stand-alone domains. Figure 2 thus suggests that larger Pfam-A domains predominate in single- and oligodo- main proteins, whereas larger multidomain proteins are constructed from smaller Pfam-A domains. It is noteworthy in this respect that the most versatile mobile modules (e.g. the EGF, ig, fn1, TSP_1, Sushi, Ldl_re- cept_a, SH3–1, SH2 modules, kringles) are less than 100 amino acid residues. A possible explanation for this phenomenon is that smaller, compact domains are more likely to satisfy the folding autonomy criterion that is crucial for their structural integrity in multidomain proteins. This explanation is supported by the fact that the rate of protein folding of single-domain proteins is inversely proportional to protein length [17,18]. Measuring the evolutionary mobility of protein domains It has long been known that the propensity of individual domains to form multidomain architectures shows significant differences: whereas the majority of domains are rarely observed in multidomain proteins, some domains are extremely widely used [18]. Never- theless, the distinction between static and mobile domains is rather vague since it is not simple to measure domain mobility. The frequent reuse (‘mobility’) of a protein domain increases several types of parameters such as (a) the number of proteins in proteome(s) in which it is present; (b) number of copies of the domain in proteomes(s); (c) number of other domain-types with which the given domain co-occurs to form multidomain proteins; and (d) number of multidomain protein architectures (linear sequence of domains, domain- organizations) in which the given domain occurs. Parameters (a) and (b) are rarely used to illustrate differences in the mobility of protein domains, as it is clear that these parameters may also be affected by and may have more to do with gene duplications or domain duplications than with domain mobility. In recent years the mobility of a domain was most frequently measured by the number of other domain- types with which the given domain co-occurs (to which it is ‘connected’) in multidomain proteins [5,7]. An obvious problem with this ‘co-occurrence’ or ‘connectivity’ approach is that a domain may co-occur with a large number of other domains in large families of multidomain proteins in which the given domain is always in the same local context, i.e. it shows no sign of mobility (Fig. 3). We face a similar problem if we wish to use the number of multidomain protein architectures to measure mobility of domains: a domain may occur in a large number of different architectures in which the given domain is always in the same local context (Fig. 3). As illustrated in this figure, during evolution of multidomain protein families domain insertions distant from the given domain may lead to marked changes in the number of architectures in which a given domain is present, marked changes in the number of co-occurring domains even though the given domain is present in the same local environment. To assess the significance of these problems, in the present work we have introduced the number of local architecture-types in which the given domain occurs as a measure of its mobility. Local architecture (local context) is defined as the ‘triplet’ consisting of the clo- sest upstream (if any) and downstream (if any) domain neighbors of the given domain. As illustrated in Table 6, ranking domains with respect to the number of types of domains co-occurring with the domain in metazoan multidomain proteins (CO-OCCURRENCE), number of types of metazoan multidomain protein architectures in which the domain is present (ARCHITECTURE) and number of local architectures (TRIPLETS) in which the domain is present give very different results. Fig. 2. PfamA domain number and protein size of metazoan proteins. The line shows a linear fit according to equation Y ¼ A + B*X, where Y is the size of proteins with a given number of Pfam-A domains, X is the number of constituent Pfam-A domains (N ¼ 129881; B ¼ 80.1 ± 0.3425; A ¼ 302.3 ± 1.243; r 2 ¼ 0.2987, P < 0.0001, 95% confidence interval). The figure shows only the data for proteins containing less than 25 domain; the squares represent the average size of proteins with a given number of Pfam-A domains. H. Tordai et al. Mobile domains FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS 5069 The similarities and differences of the information- content of ‘TRIPLET’ vs. ‘ARCHITECTURE’ and ‘CO-OCCURRENCE’ are illustrated in Fig. 4. As shown in Fig. 4(A), there is a clear linear relationship (Y ¼ B*X; R ¼ 0.9156, P < 0.0001) between the number of architecture-types (X) and the number of Probable G protein-coupled receptor 97 precursor Q8R0T6 CD97 antigen precursor P48960 Brain - specific angiogenesis inhibitor 1 Q8CGM0 Latrophilin-like protein LAT-2 AAQ84879 Probable G protein-coupled receptor 126 precursor Q86SQ4 Probable G protein-coupled receptor 125 precursor Q8IWK6 Probable G protein-coupled receptor 116 precursor Q8IZF2 Latrophilin-1 O97830 Receptor for egg jelly 3 protein Q95V80 Polycystic kidney disease and receptor for egg jelly related protein precursor Q9Z0T6 Fig. 3. Domain organization of representative multidomain proteins containing the GPS-domain (G-protein-coupled receptor proteolytic site domain). The rectangles highlight the two types of local environments in which the GPS-domain occurs. The multidomain proteins shown represent 10 distinct architectures and contain 18 types of co-occurring domains. Note that GPS-containing multidomain proteins have diverse architecture due to the relatively high number of co-occurring domain-types, although the local environment of the GPS domain is mostly unchanged: it is present in only two triplet types. Table 6. Ranking of Pfam-A domains in metazoa with respect to parameters reflecting their evolutionary mobility. Only the top-ranking 20 are shown; the domains are listed in the order of decreasing mobility. The domain names correspond to those used by the Pfam database (http://www.sanger.ac.uk/Software/Pfam/) [22]. Class 1–1 modules are highlighted in bold. Rank Number of types of co-occurring domains Number of types of architectures Number of types of ‘triplets’ 1 Pkinase EGF EGF 2 EGF I-set Pkinase 3 Ank fn1 ig 4PH ig PH 5 zf-C3HC4 LRR fn3 6 zf-C2H2 Pkinase EGF_CA 7 fn1 EGF_CA I-set 8 EGF_CA Ank SH3–1 9 ig zf-C2H2 CUB 10 SH3–1 PH Ldl_recept_a 11 I-set SH3–1 zf-C2H2 12 efhand Ldl_recept_a TSP_1 13 PDZ Laminin_G_2 Ank 14 LRR Collagen Sushi 15 WD40 Sushi zf-C3HC4 16 Lectin_C efhand efhand 17 Ldl_recept_a PDZ PDZ 18 IQ IQ zf-CCHC 19 TSP_1 CUB C1–1 20 Helicase_C zf-C3HC4 SH2 A B Fig. 4. Comparison of the number of triplet types, number of architecture-types and number of co-occurring domain-types in metazoan multidomain proteins. (A) The figure shows a linear fit according to equation Y ¼ A + B*X, where Y is the number of triplet types, X is the number of architecture-types containing a given domain (N ¼ 1748; B ¼ 0.3673; R ¼ 0.9156, P < 0.0001). (B) The figure shows a linear fit according to equation Y ¼ A + B*X, where Y is the number of triplet types containing a given domain, X is the number of domain-types co-occurring with that domain (N ¼ 1748; B ¼ 1.082; R ¼ 0.9144, P < 0.0001). Class 1–1 modules showing greatest mobility (present in more than 15 triplet types) are highlighted in red. Mobile domains H. Tordai et al. 5070 FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS triplet types (i.e. local architecture-types, Y) in which a given domain is present, but the slope of the line (B ¼0.3673) indicates that a given local architecture- type may be found in several different global architectures, i.e. there is a uniform tendency that changes in architecture occur at distant regions. Furthermore, examination of the data reveals that domains (repeats) more prone to duplication than to shuffling (LRR, Ank, etc.) are the ones that deviate from this linear relationship most significantly. Similarly, there is a linear relationship (R ¼ 0.9144, P < 0.0001) between the number of domain types with which a given domain co-occurs (connectivity) and the number of triplet types in which it is present (Fig. 4B). Nevertheless, examination of data reveals that the majority of mobile class 1–1 modules known to have been shuffled by exon-shuffling [20] deviate from this linear relationship most significantly inasmuch as they have higher triplet numbers than expected by the linear relationship, they are above the line calculated by linear regression analysis (Fig. 4B). This is also reflected in the fact that in Table 6, class 1–1 modules (e.g. CUB-, TSP1, Ldl_recept_a) occupy more prominent positions in the TRIPLET column than in the ARCHITEC- TURE or CO-OCCURRENCE columns. On the other hand, domains [e.g. GPS (G-protein- coupled receptor proteolytic site domain), Fig. 3] that are present in almost invariable local environments of a vast variety of multidomain protein architectures are present in much lower number of triplet types than expected if we assume a perfect linear relation. As illustrated in Fig. 3, the high number of domains co-occurring with the GPS domain, the high number of architecture-types in which it is present reflects domain-shuffling events distant from the GPS domain and has little to do with mobility of the GPS domain. It thus appears that the number of local architecture-types (‘triplets’) in which the given domain is present is a more relevant parameter to reflect the ‘shuffling’ or ‘insertion’ of a mobile domain into different environments. Ranking of domains according to this parameter has revealed that the best known mobile modules (EGF, PH, ig, I-set, SH3–1, fn1, EGF_CA, CUB, TSP_1, Ldl_recept_a, sushi, etc.) occupy most of the top 20 positions in the TRIPLET column of Table 6. Domain size and domain mobility We have used the number of triplet types in which a Pfam-A domain occurs as a measure of its mobility to investigate whether mobility correlates with domain size. As shown in Fig. 5 there is a significant inverse correlation between domain size and domain mobility. This observation is in harmony with the data shown in Fig. 2 that also suggest that smaller domains are more likely to be used in the construction of multidomain proteins, whereas larger domains are more likely to be static, stand-alone domains. There are a few noteworthy exceptions to the generalization that the domains showing greatest mobility are small. One of these exceptions is the protein kinase domain that – with an average size of 228 amino acids – shows the second greatest mobility in metazoan multidomain proteins (Fig. 5 and Table 6). It seems likely that its mobility reflects primarily the great demand of this domain in signaling networks. Power law distribution of domain mobility It is evident from Fig. 5 that the majority of domains occur in a relatively small number of local architecture-types, whereas a small minority of domains serves as versatile building blocks of multidomain proteins. This is in agreement with recent observations that power laws describe the distribution of domains with respect to the number of multidomain architectures in which they occur [5,6,8]. Fig. 5. Domain size and domain mobility. The number of domain- triplet types in which a Pfam-A domain occurs in metazoan multidomain proteins is plotted as a function of the average size of the given Pfam-A domain family (in amino acid residues). Note that there is an inverse correlation between domain size and domain mobility (number of pairs ¼ 1748, Pearson r ¼ )0.1507, P < 0.0001). H. Tordai et al. Mobile domains FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS 5071 To analyze the factors that influence domain mobility in different groups of organisms we have plotted the number of domain-types as a function of their ‘mobility’, mobility being expressed either as the number of domain-types co-occurring with the given domain or the number of triplet types in which a domain occurs. In the case of the co-occurrence approach the data follow straight lines in double-logarithmic plots consistent with power-law dependence. The distribution of values of metazoa was found to be significantly different from those of bacteria (P<0.0001), archaea ( P ¼ 0.0002), plants (P<0.0001) and fungi (P<0.0001). The slopes of the curves increase in the order metazoa < bacteria < plants < archaea < fungi (Table 7A). To test whether this is related to the fact that shuffling of (class 1–1) modules and creation of extracellular and transmembrane multidomain proteins was significantly facilitated by intronic recombination in metazoa [9,21], we have analyzed the domain co-occurrence plots for extracellular, transmembrane and intracellular and class 1–1 multidomain proteins of metazoa separately. The distribution of values for extracellular multidomain proteins differed significantly from that of intracellular multidomain proteins (P<0.0051). The slope for extracellular multidomain proteins is shallower than the value for intracellular multidomain proteins (Table 8A). Furthermore, the values for multidomain proteins assembled from class 1–1 modules differed significantly from that for total metazoan multidomain proteins (P<0.0001). The slope for class 1–1 multidomain proteins is shallower than the value for total metazoan multidomain proteins (Table 8A). These observations are consistent with the notion that intronic recombination greatly increased the mobility of class 1–1 modules in metazoa, and this facilitated the creation of novel extracellular multidomain proteins of animals. Analysis of domain mobility of metazoan proteins with the triplet approach has also revealed that the data follow straight lines in double-logarithmic plots. In the case of the triplet plots the distribution of values in metazoa is also significantly different from those of bacteria (P<0.0001), archaea (P<0.0001), plants (P<0.0001), protozoa (P<0.001) and fungi (P<0.0001). Comparison of the slopes of co-occurrence vs. triplet plots in different groups of organisms (Tables 7A and B) has revealed that in each case the slopes of the triplet plots are steeper than those of co-occurrence plots (metazoa: c ¼ 1.6170 vs. 2.0125; bacteria: c ¼ 1.8207 vs. 2.2851; archaea: c ¼ 2.0278 vs. 2.4554; protozoa: c ¼ 1.9690 vs. 2.7118; plants: c ¼ 1.8616 vs. 2.5508; fungi: c ¼ 2.2128 vs. 2.8692). It seems likely that this is due to the difference of the two approaches: the ‘global’ co-occurrence approach tends to overestimate the mobility of domains as opposed to the ‘local’ triplet approach (Fig. 3). Nevertheless, the results of the two analyses are similar inasmuch as metazoan domains display the shallowest slopes. It is interesting to point out that the mobility distribution of domains of protozoa is very similar to those of Plants and Fungi (Table 7B), whereas the size-frequency distribution of protozoan multidomain proteins is more similar to that of metazoa (Table 4). A possible explanation for this apparent contradiction is that the lateral gene transfer of multidomain proteins from animal hosts affects the size distribution of the multidomain protein pool of parasitic protozoa, but the domains thus acquired have lost their mobility in the intron-poor genomes of protists. In the case of the triplet plots the distribution of values for multidomain proteins assembled from class 1–1 modules is significantly different from those of extracellular proteins (P<0.0001), transmembrane multidomain proteins (P<0.0001), intracellular multidomain proteins (P<0.0001) or total metazoan multidomain proteins (P<0.0001). The slope of the triplet plot for multidomain proteins assembled from class 1–1 modules is shallower than that for extracellular proteins, for transmembrane proteins, for intracellular multi- Table 7. Parameters of the linear fit of the double logarithmic plots for P(i) ¼ ci – c where P(i) is the number of domains. Bacteria Archaea Protozoa Plants Fungi Metazoa (A) i is the number of types of domains with which they co-occur in multidomain proteins c 1.8207 2.0278 1.9690 1.8616 2.2128 1.6170 R 0.9755 0.9316 0.9596 0.9498 0.9688 0.9588 N 43 16 21 31 22 50 P < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 (B) i is the number of domain-triplet types (local architectures) in which they occur in multidomain proteins c 2.2851 2.4554 2.7118 2.5508 2.8692 2.0125 R 0.9726 0.9357 0.9852 0.9624 0.9734 0.9568 N 35 16 11 25 16 42 P £ 0.0001 £ 0.0001 £ 0.0001 £ 0.0001 £ 0.0001 £ 0.0001 Mobile domains H. Tordai et al. 5072 FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS domain proteins or for total metazoan multidomain proteins (Table 8). Comparison of co-occurrence plots (Table 8A) and triplet plots (Table 8B) for extracellular, intracellular, transmembrane, class 1–1 multidomain proteins has also revealed that order of the slopes is similar in the two approaches: class 1–1 < extracellular < transmembrane < intracellular multidomain proteins. Global and local domain co-occurrence networks Power law distributions are intimately related to the so-called scale-free networks: networks in which the frequency distribution of node degrees (i.e. the number of other nodes to which a given node is connected) follows a power law. Accordingly, power law distributions are frequently analyzed and visualized through scale-free networks. The basis of the scale-free behavior of network evolution (and power law distributions) is that the probability of a node acquiring a new connection is proportional to the number of links that node already has: there is a greater likelihood of nodes being added to pre-existing hubs. For example, the fact that the casting of actors in movies and the distribution of peo- ple according to their wealth follow a power law is a manifestation of ‘the rich get richer’ principle [5,6]. By analogy, the fact that the distribution of domains according to the frequency they are used to build multidomain proteins follows a power law indicates that the chance of a domain to be used is proportional to the number of times it has already been used. In the present work domain co-occurrence networks and triplet networks were used to illustrate and quantify the mobility of domains and the complexity of multidomain protein networks. The number of vertices, connec- tivities (edges) and the size of the largest connected component were used to characterize the complexity of the domain networks of different groups of organisms (Table 9). The size (the number of vertices) of the largest connected component increases linearly (with a slope of 1.0529) with the number of total vertices (as we proceed from prokaryotes to higher eukaryotes), with a ‘lag’ of about 500 vertices (Fig. 6A). A possible explanation for this phenomenon is that some ancient domains formed ancient multidomain proteins but apparently they no longer participate in novel domain combinations, instead remaining ‘islands’, separated from the largest connected component of the domain network. An illus- trative example of this group is the ancient multidomain protein RNA polymerase Rpb1, constructed from domains RNA_pol_Rpb1–1, RNA_pol_Rpb1–2, RNA_pol_Rpb1–3, RNA_pol_Rpb1–4, RNA_- pol_Rpb1–5, which combine only with each other. The number of architecture types also increases with the number of total vertices, and the correlation is best described by a semilogarithmic plot (Fig. 6B) consistent with a model in which domains combine at random. It is noteworthy, however, that in the linear fit to equation Y ¼ A + B*X the value of A suggests that there are frozen ancient multidomain architectures, the constituent domains of which do not participate in the construction of novel multidomain proteins. It appears that this is another manifestation of what we said above in connection with the set of vertices excluded from the largest connected component of domain networks: some ancient domains form ancient multidomain proteins with permanent domain partners (and conserved architectures) but they are no longer used in the construction of novel multidomain architectures. A comparison of the list of domains excluded from the largest connected components in all organisms with the list of domains in conserved multidomain architectures shared by all organisms has revealed significant similarities. For example, ancient domains ⁄ multidomain proteins (fulfilling basic functions) such as Enolase_C and Enolase_N of enolase, Table 8. Parameters of the linear fit of the double logarithmic plots for P(i) ¼ ci – c where P(i) is the number of domains. Total Extracellular Transmembrane Intracellular Class 1–1 (A) i is the number of types of domains with which they co-occur in extracellular, transmembrane, intracellular and class 1–1 multidomain proteins of metazoa c 1.6170 1.3193 1.4542 1.7839 1.0984 R 0.9588 0.8879 0.9337 0.9425 0.8951 N 50 17 19 25 37 P < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 (B) i is the number of domain-triplet types (local architectures) in which they occur in extracellular, transmembrane, intracellular and class 1–1 multidomain proteins of metazoa c 2.0125 1.6241 1.8397 2.3389 0.9233 R 0.9568 0.8668 0.9265 0.9638 0.9101 N 42 13 17 19 28 P £ 0.0001 £ 0.0001 £ 0.0001 £ 0.0001 £ 0.0001 H. Tordai et al. Mobile domains FEBS Journal 272 (2005) 5064–5078 ª 2005 FEBS 5073 [...]... Ribonuc_red_lgN and Ribonuc_red_lgC of ribonucleotide reductases are present in both groups Domain networks and organismic complexity As illustrated in Figs 6, 7 and 8 and summarized in Table 9, the total number of vertices and edges, the size of the largest connected component of domain networks, and the number of architecture types all increase parallel with the evolution of higher organisms of greater organismic. .. Identification of extracellular, intracellular and transmembrane multidomain proteins of metazoa Extracellular, transmembrane and intracellular multidomain proteins of metazoa were identified on the basis of the subcellular location information of database entries Extracellular proteins were identified as those annotated as extracellular, secreted or plasma proteins, intracellular proteins were identified as those... through multidomain transmembrane proteins such as receptor kinases, G-protein coupled receptors, etc Comparison of domain-networks of different eukaryotes thus confirms that the evolution of increased organismic complexity in metazoa is intimately associated with the generation of novel extracellular and transmembrane multidomain proteins that mediate the interactions among their cells, tissues and organs... al Mobile domains (extracellular, intracellular and transmembrane multidomain proteins) has revealed that extracellular domains used in the construction of extracellular proteins (and extracellular parts of transmembrane proteins) of metazoa are particularly enriched in domains of greater mobility Among the extracellular domains the so-called class 1–1 modules, i.e domains which have been shuffled by... containing extracellular domains but lacking intracellular domains and transmembrane domains Intracellular proteins were identified as those containing intracellular domains but lacking extracellular domains and transmembrane domains Transmembrane multidomain proteins were identified as those containing intracellular and ⁄ or extracellular domains and transmembrane domains PfamA domains were assigned a subcellular... 1–1 domains in metazoa, thereby facilitating the construction of extracellular and transmembrane multidomain proteins unique for metazoa [2,9] Methods Databases of multidomain proteins Fig 6 Correlation of the number of total vertices of domain networks with the number of vertices in LCC, the largest connected component and with the number of architecture types (A) The figure shows the linear fit according... PfamA domain) and there was no PfamA domain within this region, then the upstream region was defined as Nterm, the downstream region was defined as Cterm To assess the number of contexts in which a given domain (domain Di) can occur in multidomain proteins we have listed all domain triplets Du-Di-Dd, where Di is the domain analyzed and Du and Dd are the domains flanking domain Di at its N- and C-terminal... increase parallel with the evolution of higher organisms of greater organismic complexity At one extreme we find Archaea with the lowest values for the parameters reflecting the complexity of the world of multidomain proteins Conversely, metazoa, particularly Chordates, have the highest values in all these parameters Figures 7 and 8 also show that significant changes occurred in the structural organization... 2753 Gp_dh_C and Gp_dh_N of glyceraldehyde 3-phosphate dehydrogenase, Ldh_1 °C and Ldh_1_N of lactate ⁄ malate dehydrogenase, FGGY_N and FGG_C of the FGGY family of carbohydrate kinases, THF_DHG_CYH and THF_DHG_CYH_C of tetrahydrofolate dehydrogenase ⁄ cyclohydrolases, RNA_ pol_Rpb1–1, RNA_pol_Rpb1–2, RNA_pol_Rpb1–3, RNA_pol_Rpb1–4, RNA_pol_Rpb1–5 of RNA polymerase Rpb1, DNA_photolyase and FAD_binding_7... nuclear, mitochondrial, cytoskeletal proteins, transmembrane proteins were identified as those annotated as membrane proteins The correct assignment of proteins to these categories was also checked by the presence or absence of annotated transmembrane domains, presence or absence of extracellular or intracellular (cytoplasmic, nuclear) PfamA domains Extracellular proteins were identified as those containing . Modules, multidomain proteins and organismic complexity Hedvig Tordai, Alinda Nagy, Krisztina Farkas, La ´ szlo ´ Ba ´ nyai and La ´ szlo ´ Patthy Institute. extracellular) multidomain proteins of metazoa. Domain size and propensity to form multidomain proteins By plotting the size of multidomain proteins as a

Ngày đăng: 23/03/2014, 15:21

Xem thêm: Báo cáo khoa học: Modules, multidomain proteins and organismic complexity ppt