Tài liệu Báo cáo khoa học: Domain deletions and substitutions in the modular protein evolution doc

11 431 0

Daniel Gửi tin nhắn Báo tài liệu vi phạm

Tải lên: 111,496 tài liệu

  • Loading ...
1/11 trang

Thông tin tài liệu

Ngày đăng: 19/02/2014, 07:20

Domain deletions and substitutions in the modular proteinevolutionJanuary Weiner 3rd, Francois Beaussart and Erich Bornberg-BauerDivision of Bioinformatics, School of Biological Sciences, The Westfalian Wilhelms University of Mu¨nster, GermanyProteins are well known to evolve not only by pointmutations, but also by modular rearrangements [1–3]. By and large, these rearrangements occur at thelevel of domains, which are independent folding unitsand have been proposed to represent the unit ofmodular evolution [3,4]. Most domains always formthe same combinations; that is, they are alwaysfound next to the same neighbours. For example,domains found in ribosomal proteins are not foundelsewhere and are present always in the same con-text. Also, it has been reported that many domainsappear in a very much conserved order (suprado-mains) [5], and that the frequent occurrence of cer-tain modular arrangements (arrangements of modulesalong a sequence) across phyla is the result of con-servation [6].While few domains co-occur with many others atleast once in the same protein, most domains have fewpartner domains, or are even always singletons [3,7–9].Well-known examples of highly linked domains occur-ring in many different combinations are the P-loopnucleotide triphosphate hydrolase domain, the epider-mal growth factor (EGF) domain, the SH3 domain,the P-kinase domain and the domains involved in theblood clotting cascade [1,10].The phenomenon of differential arrangements hasoften been termed domain mobility [11]. However,this term may be misleading as it implies that singleKeywordsdomain loss; fission; fusion; proteindomains; protein evolutionCorrespondenceE. Bornberg-Bauer, Division ofBioinformatics, School of BiologicalSciences,The Westfalian WilhelmsUniversity of Mu¨nster, Schlossplatz 4,D48149 Mu¨nster, GermanyFax: +49 251 8321631Tel: +49 251 8321630E-mail: ebb@uni-muenster.de(Received 5 December 2005, revised 13February 2006, accepted 9 March 2006)doi:10.1111/j.1742-4658.2006.05220.xThe main mechanisms shaping the modular evolution of proteins aregene duplication, fusion and fission, recombination and loss of frag-ments. While a large body of research has focused on duplications andfusions, we concentrated, in this study, on how domains are lost. Weinvestigated motif databases and introduced a measure of protein simi-larity that is based on domain arrangements. Proteins are represented asstrings of domains and comparison was based on the classic dynamicalignment scheme. We found that domain losses and duplications weremore frequent at the ends of proteins. We showed that losses can beexplained by the introduction of start and stop codons which render theterminal domains nonfunctional, such that further shortening, until thewhole domain is lost, is not evolutionarily selected against. We demon-strated that domains which also occur as single-domain proteins are lesslikely to be lost at the N terminus and in the middle, than at the C ter-minus. We conclude that fission ⁄ fusion events with single-domainproteins occur mostly at the C terminus. We found that domain substi-tutions are rare, in particular in the middle of proteins.We also showedthat many cases of substitutions or losses result from erroneous annota-tions, but we were also able to find courses of evolutionary events wheredomains vanish over time. This is explained by a case study on the bac-terial formate dehydrogenases.AbbreviationsDomain ID, domain identification number; EGF, epidermal growth factor; FDHF, formate dehydrogenase H.FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2037modules or small arrangements are being transferredfrom one protein to another. Considering that oftentwo modules or larger arrangements as such arefused into one protein, it becomes difficult to defnewhich of the modules is ‘mobile’ and which is ‘sta-tic’. Therefore, it has been suggested that the termversatility ahould be used instead of domain mobility[3,12]. Independently of the perspective taken, theunderlying mechanisms of modular rearrangementsare mostly gene fusion and domain loss and, prob-ably to a lesser extent, domain shuffling of exonsand recombination [13–17].While the emergence of domain combinations is welldocumented [4,6,7,18–21], relatively little is knownabout domain losses.In this article, we focus on how domains are lost.Ultimately, this question is difficult to discern from therecruitment of domains because, in comparing twoproteins, phylogenetic analysis is required to detectwhether a domain has been recruited in one protein orlost in the other. To deal with this problem, we investi-gated the possible genetic mechanisms that can cause adomain to be lost or gained.As usual in sequence analysis, information on thehistory of evolution can only be assumed a posteri-ori, meaning that disadvantagous mutations (frame-shifts, domain deletions, etc.) have been weeded outby negative selection. Thus, we only observe eventsof modular rearrangements that are either beneficialor neutral. For the sake of comprehensiveness, weused the ProDom database [22], which recordsconserved sequence fragments. However, they are notalways identical to structural domains. To conferwith the general definition of domains [3], all keyresults were confirmed using Pfam, which largelyagrees with structural domain definitions [23].In the following study we first investigated whetherthe relative frequencies of deletions (or recruitements)depend on if a domain is at the end or the middle ofa protein. Unless explicitly stated, we used the term‘deletion’ as synonymous for deletions and recruit-ments. We then investigated whether eliminations aremore frequently observed at the boundaries ofdomains and whether or not domain substitutions arefrequent. For that purpose, we categorized and des-cribed misannotations of domains to discern themfrom real substitutions or deletions of domains. Next,we studied whether some domains are more often lostand whether frequencies of domain deletions dependon domain versatility. Finally, we discussed the impli-cations of our results for a wider understanding ofmodular protein evolution and the possibilities for gen-erating a model in which modular protein evolution isformally described in terms of module edit operationsand cost functions.Results and DiscussionSingle domain deletionsThe first question we asked was whether the probabil-ity of a domain deletion is evenly distributed through-out a protein. The null hypothesis was that geneticmechanisms which lead to domain deletions (for exam-ple, deletions and insertions of sequence fragments,intron recombinations, etc.) do not depend on theposition within the sequence. However, two factorscould cause a bias. First, any point mutation that cre-ates a premature stop codon will cause a C-terminaldeletion of a protein. Likewise, a mutation leading tothe emergence of an alternative transcription or trans-lation start will cause an N-terminal deletion. Second,a fission producing two genes from one will result inthe deletion of a terminal fragment from a protein or,vice versa, a fusion of two smaller proteins into onewill result in the observed pattern.We first grouped proteins by the number of domainsthey have (see the Materials and methods). For eachprotein, we searched for deletion events, that is, a pro-tein which has exactly the same domain arrangement,except for a single domain missing anywhere in thearrangement. Then we calculated the frequency of thedeletion at each domain position within the group ofproteins containing a given number of domains.We found that the domain deletions are more com-mon at either of the protein termini, and that theiroccurrence is slightly higher at one of the termini,depending on the number of domains in the proteinand the database selected (Fig. 1). The prevalence ofterminal deletions did not depend on the number ofdomains in proteins, and the results for Pfam and Pro-Dom databases were similar. In only a few cases wereslightly increased frequencies of domain deletionsobserved at a central position.According to our predictions, this suggests that thegenetic mechanism of domain deletions acts predomin-antly on sequence termini. Therefore, we tentativelypropose that the insertions of new transcription startand stop codons, as well as gene fusion and fission,are more likely to occur than, for example, intronmobility caused by exon shuffling.Multiple domain deletionsWe supported the previous findings by analysing caseswhere one or more domains were deleted from aMechanisms shaping modular protein evolution J. Weiner 3rd et al.2038 FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBSprotein. We considered only deletions in which at leasthalf of the domains of the full length arrangement waspreserved, to ensure that homologous arrangementswere being compared. The results were similar to thoseof single domain deletions, in that the terminal dele-tions were prevalent (see the Supplementary Material).In many cases, a deleted domain is a part of a lar-ger, deleted fragment. We have found that fragmentsdeleted at either termini are, in general, much longerthan fragments deleted within a protein sequence. Thedeletions within the protein are much more often singledomain deletions (Fig. 2). The total number of dele-tions that concern only one, single domain, is higherfor the positions between the termini. However, thenumber of major deletions (deletions that span morethan one domain) is higher at terminal positions. Thissupports the view that the deletions generally involvethe protein termini.In-detail analysis of the deletion eventsDuring our analyses, we noted that some of the appar-ent domain deletions are actually just misannotations.A lack of a domain identifier at a given position in aprotein annotation does not necessarily mean that thecorresponding domain is physically deleted. Likewise,a different identifier does not necessarily signify aphysical substitution. To address this problem, we con-structed clusters of similar proteins that contained atPositionProportion of domains deleted0.90.80.70.60.50.40.30.20.1012340.90.80.70.60.50.40.30.20.101234560.80.70.60.50.40.30.20.100.70.60.50.40.30.20.10123456789101112345678910Fig. 1. Statistics of single domain deletions in the whole SwissProt ⁄ TrEMBL set of proteins. The figure shows the relative proportion ofdomain deletions at different positions within the proteins of length 4, 6, 10 and 11 domains. Dark grey, Pfam; Light grey, ProDom.Length of the deleted fragment (in domains) Number of occurenciesFig. 2. Number of occurrences of domain deletions as a function ofthe length (in domains) of the deleted fragment. Diamonds, N-term-inal deletions; squares, deletions within the protein; circles, C-term-inal deletions. Single domain losses occur preferentially on one ofthe middle positions, whereas longer fragments tend to be deletedat the termini.J. Weiner 3rd et al. Mechanisms shaping modular protein evolutionFEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2039least six ProDom domains. We aligned the domainarrangements within a cluster using a simple progres-sive multiple alignment algorithm [24], based onpairwise alignments generated using the Needleman-Wunsch algorithm [25] (Supplementary material).We were able to distinguish five types of phenom-ena that resulted in an apparent deletion from thedomain arrangement (Table 1, Fig. 3). The first twowere real substitutions and physical deletions ofdomains. In some cases, at the site where the domainannotation was missing, there was, in fact, a sequencesimilar to the sequence of this domain. However,because of length or large evolutionary distance, thissequence was not annotated by the automatic annota-tion mechanism of ProDom (‘erosion’). In othercases, if there is a high sequence variation betweenthe instances of the domains with a given identifica-tion number (ID), homologous sequences can beassigned different ProDom identifiers (‘camouflage’).Yet, in other cases, although the annotation (ProDomTable 1. Criteria used to distinguish between various types of sequence rearrangements and annotation artefacts that result in a disappear-ance of a domain in the domain string of a protein.Evolutionary eventsphysical deletion a domain is physically deleted from the protein sequence, and only a short (<20 amino acids) fragment canbe found between the neighbouring domainssubstitution a domain is replaced by another domain that bears no similarity with the original domainshadow domain at a given position, in one protein there is a ProDom domain; at the same position in another proteinthere is an amino acid sequence which is not similar to the given domain and which doesnot correspond to a ProDom IDAnnotation artefactscamouflage although there are two different ProDom domains at the same position in two proteins,they are significantly similar (E<<1)erosion the domain is not annotated in ProDom, but there is at this position a similar amino acid sequenceDomain−wise evolutionary events Annotation artifactsSubstitutionA SubstitutionAADBCCAABCCShadow domain seqB Shadow domainDeletionAABCC Physical deletionCD CamouflageADBCCACamouflageE ErosionAABCCErosionseqE−value (B,D) ~ 1E−value (B,seq) ~ 1E−value (B,D) << 1E−value (B,seq) << 1Fig. 3. Classification of domain-wise events observed in the domain databases. Different evolutionary events (A, B, C) and annotation arte-facts (D, E) result in an apparent ‘deletion’ of a ProDom domain from a protein annotated in terms of ProDom domains. Domain and dotplots can be found in the Supplementary material.Mechanisms shaping modular protein evolution J. Weiner 3rd et al.2040 FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBSID) of a given domain is missing, there is no physicaldeletion or misannotation. Instead, the amino acidsequence at this position is not similar to the givenProDom domain; therefore, it is a case of a real sub-stitution.We call this case a ‘shadow domain’.For each of these events, we counted its occur-rence in the constructed protein clusters (see theMaterials and methods for details), at each positionin each protein cluster, as follows. If a domain wasfound to be deleted from an arrangement in a clus-ter, the amino acid sequences occurring in all thesequences of the cluster at the given position wereanalysed. We have applied the criteria from Table 1to distinguish between the three types of real evolu-tionary events (physical domain deletion, substitutionand shadow domains) and two types of annotationartefacts (camouflage and erosion). In the case ofphysical deletions, shadow domains and erosions, thenumbers of these events were simply counted. How-ever, in the case of substitutions and camouflage, itis not reasonable to count the number of occur-rences of such an event without inferring a directionof the substitution. For example, if at a certain posi-tion in a cluster, domain A occurs in two sequences,and each of the domains B and C occurs five times,then what frequency of the substitutions should beassumed here? We have used the following routine:all possible pairwise combinations of domains fromdifferent proteins occurring at the same domain posi-tion in a cluster were analysed. If the two domainsin a pair were different, then an event (substitutionor camouflage) was recorded. Therefore, the calcula-ted numbers of substitution and camouflage eventscannot be used to infer any conclusions on the act-ual substitution rate of domains; however, because atall domain positions the number of camouflage andsubstitution events have been calculated in the sameway, relative frequencies of the camouflage and sub-stitution events at different positions can be inferred.The relative frequencies of physical domain dele-tions, substitutions and shadow domains are allhigher at the termini. The average domain deletionfrequency is 9%, 7% at the nonterminal positionand 20% at the termini (Table 3). This trend cannotbe seen in the case of annotation artefacts (Fig. 4,Table 3). Furthermore, annotation artefacts are 10times rarer than real, physical events (Table 3).Therefore, our previous results for single-domain andmultiple deletions are scarcely affected by inaccur-acies of the database annotations and reflect realevolutionary events. This supports the aforemen-tioned finding that the majority of deletions arecaused by the physical deletions of protein termini.We repeated this analysis to test whether there aredifferences between prokaryotes and eukaryotes; how-ever, we did not find significant differences (see theSupplementary material).Distribution of termini length in proteinsWe have further pursued the question of whether theterminal deletions can be regarded as truly modularevents; that is, to what extent evolution preservesdomain boundaries upon domain deletion. The nullhypothesis is that in the case of nearly neutral evolu-tion, the domains are depleted gradually, and partiallydeleted domain fragments are common. In such a case,the evolution of proteins cannot be modelled by theapproximation of domains or modules. However, sev-eral factors can make the situation different. First,selection pressure could rapidly eliminate the truncatedfragments – unnecessary biosynthesis of the nonfunc-tional protein fragments should reduce fitness. Second,if domain deletions are caused by genetic mechanismspreserving domain boundaries (such as gene fusions),partial domains will be rare. If this is the case, aminoacid sequence deletions can be simplified to domaindeletion events, and thus protein evolution could beabstracted to the level of modules.We tackled this problem as follows. We have con-structed clusters of proteins. Each cluster containedproteins with the same domain arrangement, or withan arrangement shortened by a terminal domain dele-tion, either N terminal or C terminal. We recordedthe length of the N- or C-terminal amino acidEvolutionary events Annotation artefactsFig. 4. Results of the protein clusters analysis: relative percentagesof different evolutionary events and annotation artefacts at differentdomain positions within the analysed sequences. Error bars indi-cate the standard error of the calculated proportion. The values forthe ‘Middle position’ were averaged from the values for all non-terminal positions.J. Weiner 3rd et al. Mechanisms shaping modular protein evolutionFEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2041sequence and plotted the distribution of its length(see the Materials and methods for details). Thelengths were normalized for every protein cluster andthen averaged for evaluation. A length of 0 corres-ponds to the case when the terminal domain is com-pletely deleted, and 100 to the average length of theterminal domain in the whole cluster. Furthermore,we refined these results by counting only the proteinsequence fragments that are similar, at the amino acidsequence level, to the remaining sequence of the dele-ted domain, given one of two E-value thresholds.These E-values between those fragments and theintact domain were recorded and put in three bins,each for a different range of E-values (any E-value,0 £ E £ 0.01; 0 £ E £ 1 · 10)5).The distributions of termini lengths are shown inFig. 5. The distributions show that complete domainsare much more likely to be present in proteins, andthat partial domains are rare at the terminal ends.These distributions hold also for sets of data in whichsequences containing three or fewer domains wereremoved, and also in the case of Pfam domains(Fig. 5, bottom). If an E-value was applied (only frag-ments similar to the given domain were considered),the shorter sequences with a terminal fragment thatwas completely lost were eliminated from the histo-gram. This was not necessarily because the fragmentswere not homologous, but because the fragments weretoo short to show any significant similarity. However,the right part of the distribution, corresponding tosequence fragments of > 50% of the average domainlength, did not change significantly (grey bars onFig. 5).Domain deletions and domain versatilityFinally, we investigated whether the domain deletionevents were connected to the properties of the deleteddomains itself. Specifically, we wished to establish whe-ther the versatility of a domain plays a role in domaindeletions. Furthermore, we considered that domainscan, in general, fold autonomously. Therefore, weProDom, N−terminus ProDom, C−terminusLength of the N terminus in % of the deleted domainLength of the C terminus in % of the deleted domainPfam, N−terminus Pfam, C −terminusNumber of occurencesNumber of occurences2500035000550000 100 200 30005000010020030005000 150000 100 200 30005000150000 100 200 3000500015000Fig. 5. Length distributions of the remainingfragment from a terminal domain. Distribut-ion of the length of the terminal sequencesis based on comparison of domain arrange-ments alignments. Left, distribution onthe N-termini; right; distribution on theC-termini.The lengths are relative to the sizeof the deleted domain (¼ 100%). White bars;all terminal fragments; light grey, terminalfragments similar to the deleted domain(E < 0.01); dark grey, terminal fragmentssignificantly similar to the deleted domain(E <1· 10–5). Top, results for the ProDomdatabase; bottom, results for the PfamA dataset.Table 2. Deleted domains and domain versatility.Position Fraction as single for allaFraction as single for deletedbAverage NN for allcAverage NN for deleteddTotal for all domains 3.00% ± 0.04 1.82% ± 0.10 2.50 ± 0.02 2.40 ± 0.08N-terminus 2.46% ± 0.07 1.81% ± 0.16 1.67 ± 0.02 2.13 ± 0.12middle 2.40% ± 0.05 0.96% ± 0.13 3.65 (1.83) ± 0.03 4.12 (2.06) ± 0.29C-terminus 3.20% ± 0.09 3.64% ± 0.27 1.70 ± 0.02 2.72 ± 0.19aOverall fraction of domains that were found to form single-domain proteins;bfraction of deleted domains that were found to form single-domain proteins;caverage number of neighbours for all domains in the protein clusters ± standard error;daverage number of neighboursfor the deleted proteins ± standard error. As each of the domains in a middle position has two neighbours, the values in parentheses arethe averages divided by two. The results are based on a dataset with proteins having 3 or more domains.Mechanisms shaping modular protein evolution J. Weiner 3rd et al.2042 FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBSrecorded how often domains that are lost form single-domain proteins.First, we calculated the fraction of domains that alsooccur as single-domain genes in the sets of domainsthat are deleted at an N-terminal, C-terminal or cen-tral position.We found that the domains which alsooccur as single-domain proteins are found two- to fourtimes more frequently at the termini, and twice as fre-quently at the C terminus than at the N terminus(Table 2). Surprisingly, the average fraction ofdomains that also occur as single-domain genes islower for the domains that partake in deletion eventsthan the average for all domains.The ability of a domain to form autonomous, sin-gle-domain proteins may be related to its versatility.We have therefore calculated the domain connectivityand found that it is highest for the nonterminaldomains. However, as the domains at a nonterminalposition have, on average, two neighbours, whereasthe terminal domains have only one, the averages forthis type of domains must be halved. In that case,the percentages of domains that form autonomous,single-domain proteins are higher for domains thatundergo deletions at the termini, and lower fordomains that undergo deletions at a nonterminalposition (Table 2). Again, the numbers of domainsthat form autonomous, single-domain proteins arehighest for the domains that are deleted at theC terminus.We conclude that the elevated rates of domain dele-tions at the termini regions are partly related todomain versatility and their ability to function outsidea multidomain protein (to form single-domain pro-teins). The events involving domain acquisition ⁄ lossare twice as frequent at the C terminus than at theN terminus (Table 2).Case study: bacterial formatedehydrogenasesAn exemplary cluster of bacterial formate dehydroge-nase proteins is shown in Fig. 6. This cluster illustratesseveral modular events, including domain deletion, asubstitution by a diverged sequence fragment, and ero-sion (Fig. 6B). A multiple alignment of the proteinsequences can be found in the Supplementary material.For some of the proteins the structure is known [26].We analysed the phylogeny of the cluster, as derivedfrom whole protein sequences (Fig. 6C). The obtainedphylogenetic tree is consistent with the modifications ofthe domain arrangements (Fig. 6D), and the revealedevents can be associated with the tree nodes. Significantrearrangements take place on the sixth position of thecluster where, in different proteins, we found two differ-ent ProDom domains, shadow domains and, at oneposition (in the protein O59078), a complete deletion.Further rearrangements are found at the protein C ter-minus: two proteins have additionally two otherdomains. The shadow domains may either be the resultof a substitution by another sequence, or by such a highaccumulation of mutations in a domain that it is nolonger similar to the original sequence.There are three variable regions in the domainarrangement of the protein cluster. First, at position 6in the arrangement, in some proteins there are similarsequences that were not annotated in ProDom (‘ero-sion’) or domains which were annotated differentlybecause of high sequence divergence (‘camouflage’).Next, at position 8, there is a substitution in two ofthe sequences. Finally, the C-terminal part is missing,truncated or eroded in many sequences, for example inthe illustrated structure (Fig. 6A,B).ConclusionsOur main conclusions are as follows (a) domain dele-tion events occur frequently at either of the termini,(b) the deletions occur domain-wise; that is, in most ofthe cases the whole domain is lost, (c) domain lossescorrelate with domain versatility (i.e. the number ofdifferent combinations in which a domain occurs), (d)versatile domains are more frequently found at theC terminus and (e) clear definitions can be given todistinguish misannotations from physical deletions.Eventually the question ‘What is the probability of adomain deletion?’ can only be answered using domainphylogenies. However, our study shows that the dele-tion events are quite frequent; in the collected proteinclusters, the frequencies of proteins in a cluster with adomain deleted at either of the termini were % 9%Table 3. Results of the analysis of protein clusters for the ProDomdatabase. Numbers in the table correspond to the absolute num-bers of events recorded (% of the events recorded is given in par-enthesis).EventAverage(%) N-terminus middle C-terminusTotal number ofdomains152105 14520 123065 14520Real events:Deletions 13925 (9.2) 2998 (20.6) 8077 (6.6) 2850 (19.6)Substitutions 3034 (2.0) 546 (3.8) 2000 (1.6) 488 (3.4)Shadow domains 8770 (5.8) 1811 (12.5) 5399 (4.4) 1560 (10.7)Annotation artefacts:Camouflage 1557 (1.0) 110 (0.8) 1391 (1.1) 56 (0.4)Erosion 1235 (0.8) 82 (0.6) 1001 (0.8) 152 (1.0)J. Weiner 3rd et al. Mechanisms shaping modular protein evolutionFEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2043(Table 3), which provides a rough estimate for thefrequency of deletion events in protein–protein com-parisons.The fact that the domain deletions are not uniformlydistributed along a protein, but that they nonethelessfollow a distinct pattern of domain deletions, is animportant conclusion in the context of constructingalgorithms for sequence alignments that take intoaccount domain arrangements of proteins. It also pro-vides a biological justification for choosing a lower-endgap penalty in sequence alignment algorithms, such asclustalw [27].In conclusion, by analysing the versatility of deleteddomains and their ability to form single-domain pro-teins, we have found that, while gene fusion and fissionindeed play a significant role in the deletion events atthe termini, the introduction of new start and stop co-dons also play a major role. The fraction of the dele-ted domains that can be found as single-domainproteins was twice as high at the C terminus (Table 2),as was the connectivity of the C-terminally deleteddomains. This suggests that in a gene fusion or fissionevent, the versatile, single-domain protein is morelikely to be found at the C terminus. This may beexplained by the fact that in a gene fusion ⁄ fissionevent, or in the case of introduction of new start andstop codons, the N-terminal part of the codingsequence remains connected to its promoter region andregulatory sites. Thus, a versatile domain that is fusedwith the C terminus of a much larger protein will nothave an effect on the regulation of the whole protein,because it will not modify the promoter region andregulatory sites. Our results suggest such a selectivedisequilibrium: the function (and regulation) of theprotein is connected to its N-terminal part, and there-fore the fusion ⁄ fission events involving smaller, versa-tile domains will occur more frequently at theC terminus.Moreover, we have found that the event of domaindeletion occurs mostly in a modular manner. This canhave two explanations. First, the apparent domaindeletion can be caused by gene fusion or fission. Sec-ond, a domain fragment truncated (e.g. by a nonsensemutation) that is no longer functional may be rapidlyeliminated by natural selection. Either way, thedomain deletions effectually respect domain boundar-ies. These results have further supported the emergingview that, by and large, the modular evolution of pro-teins is dominated by two major types of events:fusion, on the one hand, and deletion and fission onthe other [3,4,21,28]. Exon shuffling and recombinationseem to be rare.ABCDFig. 6. Cluster of the bacterial formate dehy-drogenases. (A,B) The structure of formatedehydrogenase H (FDHF) from Escherichiacoli. (C) Phylogeny of the analysed proteinsobtained by the parsimony method with 100bootstraps. (D) The corresponding domainarrangements of the analysed proteins.Colour code: (A) is coloured according to theProDom annotation, with one colour forevery domain. Colours and arrows on (B)indicate events identified by analysis of acluster of related proteins and correspondto the coloured arrows on (C) and (D). Thesymbols on (C) show a possible attributionof the events to tree nodes. sub, substitut-ion; del, deletion ⁄ insertion; colours of thesymbols correspond to the colours (B). Thecoloured boxes on (D) correspond to differ-ent ProDom domains and are the same ason (A). The black thin boxes on position 6correspond to ‘shadow domains’.Mechanisms shaping modular protein evolution J. Weiner 3rd et al.2044 FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBSMaterials and methodsFor the analyses, ProDom [22] version 2004.1 was used. Themain results were confirmed using the Pfam, release17 [29].Each database contains a number of domain arrangements,that is, proteins annotated in terms of domains. All supple-mentary materials can be found on our web page (http://www.uni-muenster.de/Bioinformatics/services/domdel/).Overall single deletion statisticsProteins from the ProDom database and, separately, fromthe Pfam database, were divided into sets according to thenumber of domains. Each set contained all proteins with afixed number of domains, for example ‘set6’ contained pro-teins with six domains.Each protein from a given set containing proteins oflength N domains was compared with each protein fromthe set containing proteins of length N)1 domains. Forexample, a protein with six domains was compared with allproteins that have five domains. If the shorter arrangementwas identical to the longer one, with the exception of a sin-gle, missing domain, a deletion was registered. The positionof the deletion within the domain arrangement was recor-ded. For example, given the five-domain arrangementABDEF (where A to E are domains), it is identical to thesix-domain arrangement, ABCDEF, with the exception ofthe deleted domain C.The average deletion frequency was calculated as thenumber of all deletion events divided by the total numberof domains in all the examined sequences. The relativedomain deletion frequency at a given domain position in aset of proteins of a given length was defined as the numberof deletions at this position, divided by the total number ofdeletions in this set.These investigations have been repeated with a nonredun-dant data set, in which each arrangement was representedonly once. That is, from a set of proteins which had the samedomain arrangement, only one representative was kept.Overall multiple deletion statisticsFor each domain arrangement given, all other arrange-ments that would be obtained by removal from the givenarrangement of one or more domains were considered. Forexample, if A to F are domains, and ABCDEF is the givenarrangement, then we would consider the arrangementsABCDE, BCD, ABEF, etc.Similarity of protein arrangementsFor the purpose of constructing multiple domain arrange-ment alignments and domain arrangement-based phylo-genies, we implemented the Needleman-Wunsch globalalignment algorithm [25] for protein domains, with theparameters as defined previously [17]: match ¼ 10, mis-match ¼ )5, gap ¼ )1.Construction of protein clustersWe constructed clusters of proteins with similarity in theirdomain arrangement of > 80%. Only clusters that had atleast six domains were considered. For each protein fromthe ProDom database, all proteins were considered that hadone domain less than the given protein. If a given proteinmatched the examined arrangement by all but one domain,a deletion event was recorded. Starting with a single protein,a number of hits was recorded and added to the cluster; fur-thermore, these proteins were used to obtain the next set ofhits (i.e. proteins that have one domain less than the proteinthat was used in the search). The procedure stopped for agiven cluster when no further similar domain arrangementswere found. Only clusters containing at least 10 proteinsand 10 ProDom domains were used for further analysis.Additionally, the amino acid sequences of all the sequencesin the cluster were collected. The resulting clusters were sub-sequently aligned with a simple multiple-domain arrange-ment alignment algorithm (progressive alignment). Thelength (in terms of domains) of a cluster was defined as thelength of the multiple-domain arrangement alignment.Calculation of the relative event frequency atdifferent domain positions in protein clustersFor each of the events, e, and for each of the sets of clus-ters of a given length, l, the frequency of the event at aposition, k, was defined as:fe;k¼ ne;k=Xli¼1ne;i;where ne,iis the number of occurrences of the event e at thedomain position i. The average frequency at the middlepositions (that is, all domain positions except the N- andC termini) was calculated as:ne;middle¼XlÀ1i¼2fe;i=ðl À 2Þ:Finally, the N-terminal, C-terminal and central positionfrequencies for each event were averaged for all sets ofclusters.Distribution of amino acid sequence lengthof the terminiFor each of the databases ProDom and Pfam, two sets ofalignments were created: one for N-terminal deletions, andone for C-terminal deletions. In each set, an alignment con-tained sequences that had one of the two types of arrange-ments: either a complete arrangement, or one in which aterminal domain was missing from the ProDom description.Alignments were constructed from the whole ProDomJ. Weiner 3rd et al. Mechanisms shaping modular protein evolutionFEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2045database. Only alignments which contained at least onecomplete sequence and one sequence with a missing domain(depending on the set, either N- or C terminal) were consid-ered.For each alignment in each set, the average size of thedeleted domain was calculated for the proteins with thecomplete arrangement. To take into account the variabilityof the length of the complete domain, the length of theN-terminal fragment was definned as the length of theamino acid sequence preceding the next domain inthe arrangements, expressed as the percentage of the calcu-lated average length of the deleted domain in this align-ment. Finally, the distribution of these values throughoutall of the analysed alignments was calculated.References1 Patthy L (1999) Protein Evolution. Blackwell Science,Oxford.2 Liu J & Rost B (2004) CHOP: parsing proteins intostructural domains. Nucleic Acids Res 32, W569–W571.3 Bornberg-Bauer E, Beaussart F, Kummerfeld S, Teich-mann S & Weiner J 3rd (2005) The evolution of domainarrangements in proteins and interaction networks. CellMol Life Sci 62, 435–445.4 Voge IC, Teichmann S & Pereira-Lea IJ (2005) Therelationship between domain duplication and recombi-nation. J Mol Biol 346, 355–365.5 Voge IC, Berzuini C, Bashton M, Gough J & Teich-mann S (2004) Supra-domains: evolutionary units largerthan single protein domains. J Mol Biol 336, 809–823.6 Gough J (2005) Convergent evolution of domain archi-tectures (is rare). Bioinformatics 21, 1464–1471.7 Apic G, Gough J & Teichmann S (2001) An insight intodomain combinations. Bioinformatics 17 (Suppl. 1),S83–S89.8 Wuchty S (2001) Scale-free behavior in protein domainnetworks. Mol Biol Evol 18, 1694–1702.9 Bornberg-Bauer E (2002) Randomness, structuraluniqueness, modularity, and neutral evolution insequence space of model proteins. Z Phys Chem 216,139–154.10 Madera M, Voge IC, Kummerfeld S, Chothia C &Gough J (2004) The SUPERFAMILY database in2004: additions and improvements. Nucleic Acids Res32, D235–D239.11 Doolittle R & Bork P (1993) Evolutionarily mobilemodules in proteins. Sci Am 269, 50–56.12 Apic G, Huber W & Teichmann S (2003) Multi-domainprotein families and domain pairs: comparison withknown structures and a random model of domainrecombination. J Struct Funct Genomics 4, 67–78.13 Ponting C & Russel IR (1995) Swaposins: circular per-mutations within genes encoding saposin homologues.Trends Biochem Sci 20, 179–180.14 Ulie IS, Fliess A & Unger R (2001) Naturally occur-ring circular permutations in proteins. Prot Eng 14,533–542.15 Fliess A, Motro B & Unger R (2002) Swaps in proteinsequences. Proteins 48, 377–387.16 Bujnicki J (2002) Sequence permutations in the molecu-lar evolution of DNA methyltransferases. BMC EvolBiol 2,3.17 Weiner J 3rd, Thomas G & Bornberg-Bauer E (2005)Rapid motif-based prediction of circular permutationsin multi-domain proteins. Bioinformatics 21, 932–937.18 Apic G, Gough J & Teichmann S (2001) Domain com-binations in archaeal, eubacterial and eukaryotic pro-teomes. J Mol Biol 310, 311–325.19 Bashton M & Chothia C (2002) The geometry ofdomain combination in proteins. J Mol Biol 315, 927–939.20 Vogel C, Bashton M, Kerrison N, Chothia C & Teich-mann S (2004) Structure, function and evolution ofmultidomain proteins. Curr Opin Struct Biol 14, 208–216.21 Kummerfeld S & Teichmann S (2005) Relative rates ofgene fusion and fission in multi-domain proteins. TrendsGenet 21, 25–30.22 Corpet F, Servant F, Gouzy J & Kahn D (2000) Pro-Dom and ProDom-CG: tools for protein domain analy-sis and whole genome comparisons. Nucleic Acids Res28, 267–269.23 Zhang Y, Chandonia J, Ding C & Holbrook S (2005)Comparative mapping of sequence-based and structure-based protein domains. BMC Bioinformatics 6, 77.24 Feng D & Doolittle R (1987) Progressive sequencealignment as a prerequisite to correct phylogenetic trees.J Mol Evol 25, 351–360.25 Needleman S & Wunsch C (1970) A general methodapplicable to the search for similarities in the aminoacid sequence of two proteins. J Mol Biol 48, 443–453.26 Boyington J, Gladyshev V, Khangulov S, Stadtman T& Sun P (1997) Crystal structure of formate dehydro-genase H: catalysis involving Mo, molybdopterin, sele-nocysteine, and an Fe4S4 cluster. Science 275, 1305–1308.27 Thompson J, Higgins D & Gibson T (1994) clustalw:improving the sensitivity of progressive multiplesequence alignment through sequence weighting, posi-tion-specific gap penalties and weight matrix choice.Nucleic Acids Res 22, 4673–4680.28 Weiner J 3rd & Bornberg-Bauer E (2006) Evolution ofcircular permutations in multi-domain proteins. MolBiol Evol 23, 734–743.29 Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L,Eddy S, Griffiths-Jones S, Howe K, Marshal IM &Sonnhammer E (2002) The Pfam protein families data-base. Nucleic Acids Res 30, 276–280.Mechanisms shaping modular protein evolution J. Weiner 3rd et al.2046 FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS[...]... Weiner 3rd et al Supplementary material The following supplementary material is available online: Fig S1 Statistics for single domain deletions Fig S2 Statistics for multiple domain deletions Fig S3 Detailed results of the cluster analysis Mechanisms shaping modular protein evolution Fig S4 Results for the comparison of eukaryotes and prokaryotes Fig S5 Pairwise multiple alignment algorithm for domain. .. the comparison of eukaryotes and prokaryotes Fig S5 Pairwise multiple alignment algorithm for domain arrangements This material is available as part of the online article from http://www.blackwell-synergy.com FEBS Journal 273 (2006) 2037–2047 ª 2006 The Authors Journal compilation ª 2006 FEBS 2047 . growth factor (EGF) domain, the SH3 domain, the P-kinase domain and the domains involved in the blood clotting cascade [1,10]. The phenomenon of differential. of the protein termini, and that theiroccurrence is slightly higher at one of the termini,depending on the number of domains in the protein and the database
- Xem thêm -

Xem thêm: Tài liệu Báo cáo khoa học: Domain deletions and substitutions in the modular protein evolution doc, Tài liệu Báo cáo khoa học: Domain deletions and substitutions in the modular protein evolution doc, Tài liệu Báo cáo khoa học: Domain deletions and substitutions in the modular protein evolution doc

Gợi ý tài liệu liên quan cho bạn

Nhận lời giải ngay chưa đến 10 phút Đăng bài tập ngay