Báo cáo khoa học: A study on genomic distribution and sequence features of human long inverted repeats reveals species-specific intronic inverted repeats pptx

13 542 0
Báo cáo khoa học: A study on genomic distribution and sequence features of human long inverted repeats reveals species-specific intronic inverted repeats pptx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

A study on genomic distribution and sequence features of human long inverted repeats reveals species-specific intronic inverted repeats Yong Wang and Frederick C C Leung School of Biological Sciences and Genome Research Centre, The University of Hong Kong, China Keywords human; intron; long inverted repeat; primates; stem-loop Correspondence Y Wang, School of Biological Sciences, The University of Hong Kong, Hong Kong, China Fax: +852 2857 4672 Tel: +852 2299 0825 E-mail: wangyong@graduate.hku.hk (Received 11 December 2008, revised 19 January 2009, accepted 23 January 2009) doi:10.1111/j.1742-4658.2009.06930.x The inverted repeats present in a genome play dual roles They can induce genomic instability and, on the other hand, regulate gene expression In the present study, we report the distribution and sequence features of recombinogenic long inverted repeats (LIRs) that are capable of forming stable stem-loops or palindromes within the human genome A total of 2551 LIRs were identified, and 37% of them were located in long introns (largely > 10 kb) of genes Their distribution appears to be random in introns and is not restrictive, even for regions near intron–exon boundaries Almost half of them comprise TG ⁄ CA-rich repeats, inversely arranged Alu repeats and MADE1 mariners The remaining LIRs are mostly unique in their sequence features Comparative studies of human, chimpanzee, rhesus monkey and mouse orthologous genes reveal that human genes have more recombinogenic LIRs than other orthologs, and over 80% are human-specific The human genes associated with the human-specific LIRs are involved in the pathways of cell communication, development and the nervous system, as based on significantly over-represented Gene Ontology terms The functional pathways related to the development and functions of the nervous system are not enriched in chimpanzee and mouse orthologs The findings of the present study provide insight into the role of intronic LIRs in gene regulation and primate speciation An inverted repeat consists of two repeat copies (hereafter termed arms) that are approximately complementary to each other Generally, there is a spacer between the arms, and the full structure of an inverted repeat can form a stem-loop or palindrome The potential to form a stable stem-loop is determined by the arm size, spacer size and the matching degree of the arms [1,2] For example, a relatively huge spacer makes it difficult for the two arms to form a stem Studies of inverted repeats show that they may raise instability in a genome and, on the other hand, regulate gene expression in both prokaryotes and eukaryotes Being capable of forming secondary structures [3], inverted repeats can induce genomic instability via gene amplification, recombination, DNA double-strand breaks and rearrangement [1,2,4–8] Moreover, inverted repeats provide sites for the integration of viruses into eukaryotic genomes [9,10] and also comprise replication stall sites, as shown in a recent study in which evidence obtained in vivo demonstrated replication stalling by hairpins formed by inverted repeats in bacteria, yeast and mammalian cells [11] As a result, they are restricted in a genome to some extent For example, neighboring repetitive elements, such as Alu repeats, are generally found to occur in the same direction, and those in the styles of head-to-head and tail-to-tail are rarely observed, particularly when the spacer between them is tiny [1,12] In a mouse trans- Abbreviations FDR, false discovery rate; GO, Gene Ontology; LIR, long inverted repeats; siRNA, small interference RNA; TIR, terminal inverted repeat 1986 FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS Y Wang and F C C Leung gene experiment, the introduction of a large palindrome was followed by numerous rearrangements, which were assumed to comprise a solution for attenuating the impact of the palindrome in the progeny [13] Inverted repeats also regulate gene expression The stem-loops and palindromes constructed by inverted repeats are involved in RNA interference, transcription initiation of genes, initiation of DNA replication and alternative splicing of exons The small interference (si)RNA genes active in RNA interference comprise inverted repeats capable of forming a stem-loop motif longer than 22 bp Some are derived from miniature inverted-repeat transposable elements [14] At present, studies have identified siRNA genes from Caenorhabditis elegans to humans RNA interference was initially discovered as an efficient mechanism for inhibiting the expression of specific genes [15,16], and later was found to be responsible for developmental regulation [17,18] and heterochromatin maintenance [19,20] In promoters, inverted repeats can facilitate the recognition process and the subsequent binding of RNA polymerase during gene transcription [21,22] Moreover, the inverted repeats in a cruciform structure will attract mediators of second messenger-directed transcription, hence altering the transcriptional response [23] Many studies also show that inverted repeats are essential for the initiation of DNA replication in plasmids, bacteria, eukaryotic viruses and mammalian cells [24] The inverted repeats in introns are able to affect the alternative splicing of exons [25,26] and the removal efficiency of introns [27,28] For example, alternative splicing of exon in the COL2A1 gene was mediated by a stem-loop adjacent to the exon–intron boundary [25] Because inverted repeats are both unstable and functional elements in a genome, they are expected to be distributed in intergenic regions or large introns of genes In the yeast genome, almost 100% of large palindromes (> 25 bp) are far away from coding regions [29] Any insertion approaching conserved transcribed sites will ultimately be erased unless their presence provides an evolutionary advantage and, thus, is under positive selection One line of evidence for this is that, compared to introns, upstream regions of genes have more palindromes, which probably developed for the initiation of transcription [30] A recent study shows that Caenorhabditis lineages have conserved inverted repeats in intergenic regions [31], which were suggested to be functional and therefore actively maintained in the lineages In the human genome, there are many such motifs, although we have little knowledge of their fine-scale distribution, sequence features and potential functions at present A study on human long inverted repeats [32,33] Human inverted repeats were investigated in a previous study, in which a majority of them were found to be weak with respect to their capacity to form a simple stem-loop or hairpin in terms of their structural features [32] Genome-wide distribution of human palindromes has also been surveyed, and a database has been created for public use [30] However, the palindromes with mismatches and indels were not collected in the database In the present study, we first located all the long inverted repeats (LIRs) characterized with long arms, high arm similarity and a short internal spacer in the human genome They were termed as recombinogenic LIRs in our previous study on human chromosomes 21 and 22 [33], although their distribution and frequency had not been fully surveyed in the whole human genome The present study aims to provide a panoramic view of recombinogenic LIRs On the basis of evidence obtained in vivo [1,2,11], the LIRs identified in the present study can easily form stem-loops or palindromes Their presence in the human genome by itself implies that they are functional in some manner The results obtained showed that 37% of the LIRs were located in intronic regions and some were primate-specific TG ⁄ CA-rich repeats are the most frequently observed feature in LIR arms Considering that the LIRs probably have essential functions and drive the speciation of primates, we studied the degree of conservation and species specification of the LIRs among orthologous genes from the mouse (Mus musculus), rhesus monkey (Macaca mulatta), chimpanzee (Pan troglodytes) and human The results obtained demonstrate that human orthologs have relatively more LIRs, most of which are human-specific These human-specific LIRs are probably essential for the development of the advanced functions of human nervous system in light of the Gene Ontology (GO) profile of human orthologs Results Characters and distribution of LIRs in human autosomes We identified 2551 LIRs in human autosomes and approximately 87% of them have a short spacer (0–9 bp) and arm (31–59 bp) (Fig 1) By contrast, the mismatch rate between the arms of an LIR varies from 0–0.15, showing a relatively lower standard deviation with respect to the amounts of the LIRs in different ranges (Fig 1) These results indicate that a majority of the LIRs are able to form a stem-loop with a stem of 31–59 bp and a tiny loop (or none for a palindrome) FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS 1987 A study on human long inverted repeats Y Wang and F C C Leung Fig Characteristics of human LIRs The 2551 LIRs were classified according to spacer size, mismatch rate and arm length Fig Distribution of LIRs in the human genome The density is represented by the amount of LIRs per Mb sequence The shortest bars denote one LIR per Mb The genomic distribution of the LIRs shows that the density of LIRs selected by our criteria is quite low (Fig 2) The highest and lowest LIR densities were observed in chromosomes (1.2 ⁄ Mb) and 22 (0.44 ⁄ Mb) respectively Interestingly, the LIR density negatively co-varies with gene density among the chromosomes (t = 19.8; P < 10)4) (Fig 3) The point denoting chromosome 19 is notably far from the 1988 regression line, accounting for gene clusters that contribute to the two-fold higher gene density of the chromosome 19 compared to the genomic average [34] We found that the negative correlation is due to the high frequency of LIRs in long genes A total of 956 LIRs (in 702 genes) were located in genic regions, and 1595 in intergenic regions In other words, 37% of the LIRs were found within genes However, our FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS Y Wang and F C C Leung A study on human long inverted repeats 35 chr.19 Gene density (/Mb) 30 y = –23.8x + 33.6 R2 = 0.53 25 20 15 10 0.4 0.6 0.8 1.2 S-LIR density (/Mb) Fig Negative correlation between gene density and LIR frequency The black dots show the correlation between gene density and LIR density in the 22 chromosomes calculation of the coverage of genes in the human genome was 26.9%, which is consistent with the value reported previously [35] When introns were taken into account, the percentage was 25.1% This implies that the distribution of the LIRs is not random Statistical analysis performed on the results shows that the presence of LIRs is significantly biased to be within genes (chi-square test; P < 0.0001) The LIRs that have long arms (> 400 bp) and the associated genes are listed in Table S1 Surprisingly, over half of them were found within genes There are two cases of partial overlap between LIRs and exons In one case, the left arm of an intronic LIR extends into an exon of c14orf165 and, in the other case, an LIR on the chromosome 17 partially overlaps an exon of a putative gene The frequency is much lower than expected We did not find any LIRs overlapping either the start or end site of a gene Further results confirmed that LIRs tend to reside in large intronic and intergenic regions Only five LIRs were found in introns < kb (the smallest intron was 757 bp), and none in intergenic regions < kb The median sizes for the introns and the intergenic regions are 46 and 386 kb, respectively Moreover, most of the LIR-containing intergenic regions are > 10 kb Correspondingly, a chromosome that has more long genes will show a lower gene density, in agreement with the above negative correlation between LIR density and gene density (Fig 2) We then studied the positions of the LIRs in introns and intergenic regions A short distance to the exon– intron boundary or transcription starting point is an indication that an LIR is functioning in the gene A ratio of 0–0.5 was applied to denote the relative distance to the boundaries, or to the center, and was divided into five ranges We calculated the percentage of LIRs falling within the ranges and observed a small percentage difference between the ranges, suggesting a random distribution of LIRs in both intronic and intergenic regions (see Fig S1) We also considered the effect of the length of these regions on the distribution The intronic and intergenic regions were then classified on the basis of their lengths Within each of the length groups, the numbers of LIRs in the ratio ranges not show any significant difference (chi-square tests; d.f = 4; P > 0.1) (see Fig S1) Therefore, the LIRs not avoid approaching the boundaries for exonic or genic regions The median distance to the exon boundaries is 7.8 kb, and that to the gene boundaries is 69 kb Strikingly, pseudogenes were frequently found around the intergenic LIRs A total of 803 intergenic LIRs (50%) have one or two neighboring pseudogenes, of which 422 are RNA pseudogenes According to the annotation in the Ensembl database (http://www ensembl.org), approximately 27% of the human genes are pseudogenes The occurrence of pseudogenes adjacent to LIRs is statistically significant (chi-square test; P < 0.0001) Sequence features of the human LIRs We found that over half (51%) of the identified LIRs could be packed into groups consisting of at least three members on the basis of sequence similarity The group members are comprised of simple repeats, known repetitive elements, amplified genes or duplicated genomic fragments The largest group consists of LIRs formed by stretches of TG ⁄ CA dinucleotides and interspersed TA dinucleotides We defined them as TG ⁄ CA-rich LIRs, accounting for 33% and 39% of all the LIRs in the intronic and intergenic regions, respectively (Fig 4) By contrast, we also identified TC ⁄ GA-rich LIRs that occupy only 3% in both of the regions Thus, the frequency of TG ⁄ CA-rich LIRs is at least 11-fold higher than that of TC ⁄ GA-rich ones (for intronic LIRs: 11-fold; for intergenic LIRs: 13-fold) The difference is statistically significant (chi-square test; P < 0.0001) On average, the combination of TG ⁄ CA-rich and TC ⁄ GA-rich LIRs occupies 38% of the identified LIRs Additionally, we could not identify any LIRs constructed by simple repeats in longer repeat units (> bp) The second largest group comprises the LIRs involved in known human repetitive elements We found 145 MADE1 mariners and 108 inverted Alu repeats in our LIR collection The mariner has a short FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS 1989 A study on human long inverted repeats Y Wang and F C C Leung Fig Composition of LIRs in the human genome The LIRs in POTE and NPS-Septin families occupy 3% of all the intronic LIRs The ‘other’ LIRs, occupying 49% of all LIRs, refer to those with unique sequence features spacer and long terminal inverted repeats (TIRs) In the present study, they were considered as LIRs in cases of high identity between TIRs Within both intronic and intergenic LIRs, they occupy 6% in total Alu repeats in the LIRs are mostly in a partial structure and found to be in the styles of head-to-head or tail-to-tail In some large LIRs, more than one Alu was included in one arm, and the complete structure of Alu could be retained therein The proportion of inverted Alus within the LIRs is 6% for intronic regions and 3% for intergenic regions The grouping of the LIRs is also a result of gene amplification or fragmental duplication, although the numbers of such groups and the members inside the groups are not large We identified 20 LIRs in genes encoding a novel protein similar to septin (NPS-Septin) and eight in genes encoding POTE The genes belong to gene families and their duplication is coupled with the spread of the LIRs inside the gene The remaining LIRs aside from the above groups show similarity either to one or none of the others They are labeled as rare LIRs, accounting for 49% of all the LIRs in the human genome We explored the LIRs in the NPS-Septin gene family in more detail A blat search in the University of California Santa Cruz (UCSC) browser (http:// genome.ucsc.edu) was used to confirm the association of the LIRs with the gene family The longest gene LOC400807 is approximately 107 kb, and an NPSSeptin LIR is positioned at approximately 10.6 kb In addition, we also found more NPS-Septin LIRs on the Y chromosome, although they were not present in NPS-Septin genes Sequence alignment displays highly identical arms but diverse spacers for the 20 NPSSeptin LIRs (Fig 5) They are able to form variant stem-loop structures where both the stem and loop are of different sizes Except for those on chromosomes 3, 10 and Y, all the LIRs were located at subtelomeric regions (see Table S2) The proximal LIRs show similar spacer motifs; for example, the three LIRs on chromosome 1p (no 1–3) and the two LIRs on the 1990 Y chromosome (Fig 5) This is evidence for inverted duplication of the fragments at these regions We also noted that sequence similarity at the flanking regions of the LIRs declines gradually at all sites Species-specific LIRs inside orthologous genes To obtain species specification of the LIRs, we detected LIRs in mouse, rhesus monkey, chimpanzee and human orthologous genes Among 12 723 groups of orthologous genes, we identified 546 LIRs for human orthologs, 481 for chimpanzee orthologs, 201 for mouse orthologs and 130 for rhesus monkey orthologs For species specification of the LIRs, 421 (77%) are human-specific, 355 (74%) are chimpanzee-specific, 180 (90%) are mouse-specific and 107 (82%) are rhesus monkey-specific For the nonspecific LIRs, 13 groups of orthologs from the three primate species all have at least one LIR, and 104 ortholog pairs from humans and chimpanzees possess LIR(s) This suggests that most of the nonspecies-specific LIRs are shared by the primates, and some LIRs were specifically developed in the primate lineage We next obtained the biological profile of the human orthologs that have human-specific LIR(s) Compared to randomly-selected human genes, the orthologs are significantly enriched with GO terms within the categories of development, binding, membrane, cell communication and signal transduction (Table 1) An important finding is that a number of the terms are related to the nervous system, including neurotransmitter receptor activity (GO:0030594), central nervous system development (GO:0007417), GABA receptor activity (GO: 0016917), axonogenesis (GO:0007409), projection, generation, differentiation and development of neurons (GO:0043005, GO:0048699, GO:0030182, GO:0048666), synapse (GO:0045202), and so on The GO term that is under-represented in these genes is GO:0006955 for immune response [false discovery rate (FDR) = 0.048] We also performed the same test on 104 orthologs with human- and chimpanzee-specific LIRs The GO FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS Y Wang and F C C Leung A study on human long inverted repeats Fig Alignment of the LIRs mostly located in genes encoding a novel protein similar to septin The locations of the LIRs are listed in Table S2 Essentially, the arms of the LIRs are approximately 1–48 and 97–143 bp, and can be extended into the spacers in some LIRs terms from the human orthologs were used for comparison with those from randomly-selected human genes, showing that the above over-represented GO terms related to the nervous system were largely not assigned to these orthologs (see Table S3) Only the term GO:0045202 is related to synapse Basically, the terms for binding, membrane, signal transduction and cell communication are retained in the list To make a control, we obtained over-represented GO terms from the mouse genes associated with mouse-specific LIRs by comparison with randomlyselected mouse orthologs A part of the result shown in Table S4 is similar to that also shown in Table S3 (e.g binding and signal transduction) The difference is that the result for the mouse orthologs includes GO terms for the regulation of transcription, the RNA biosynthetic process and the phosphate metabolic process We found one over-represented term (GO:0007399) in a pathway for nervous system development (FDR = 0.0368) The 104 LIRs common in human and chimpanzee orthologs were studied, aiming to uncover the mechanism of their formation We searched the arm sequences of the LIRs in the UCSC genome browser FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS 1991 A study on human long inverted repeats Y Wang and F C C Leung Table GO terms over-represented in human genes having human-specific LIRs The genes are human orthologs that have at least one human-specific LIR Reference genes are randomly selected from the list of orthologs, and are used for comparison with the test human genes with specific LIRs The GO terms in 352 test genes were compared with those in 296 reference genes, using Fisher’s exact test in BLAST2GO FDR was applied to obtain significantly over-represented (FDR < 0.05) GO terms in the test genes Several GO terms belonging to levels or are not included GO term Name FDR GO term Name FDR GO:0060089 GO:0007165 GO:0005886 GO:0044425 GO:0007154 GO:0032501 GO:0045202 GO:0004888 GO:0031224 GO:0016021 GO:0032502 GO:0030695 GO:0007275 GO:0044459 GO:0051179 GO:0030182 GO:0030054 GO:0043167 GO:0007186 GO:0004872 GO:0007166 GO:0000166 GO:0031420 GO:0045211 GO:0006811 GO:0007399 GO:0043169 GO:0046872 GO:0030554 GO:0007155 GO:0048666 GO:0005083 GO:0006814 GO:0005509 GO:0016773 GO:0017076 GO:0022008 GO:0048699 GO:0000902 GO:0032989 GO:0030234 GO:0051056 GO:0048869 GO:0030154 GO:0005215 GO:0032559 GO:0032555 GO:0032553 GO:0031226 Molecular transdducer activity Signal transduction Plasma membrane Membrane part Cell communication Multicellular organismal process Synapse Transmembrane receptor activity Intrinsic to membrane Integral to membrane Developmental process GTPase regulator activity Multicellular organismal development Plasma membrane part Localization Neuron differentiation Cell junction Ion binding G-protein coupled receptor protein Receptor activity Cell surface receptor Nucleotide binding Alkali metal ion binding Postsynaptic membrane Ion transport Nervous system development Cation binding Metal ion binding Adenyl nucleotide binding Cell adhesion Neuron development Small GTPase regulator activity Sodium ion transport Calcium ion binding Phosphotransferase activity Purine nucleotide binding Neurogenesis Generation of neurons Cell morphogenesis Cellular structure morphogenesis Enzyme regulator activity Regulation of small GTPase Cellular developmental process Cell differentiation Transporter activity Adenyl ribonucleotide binding Purine ribonucleotide binding Ribonucleotide binding Intrinsic to plasma membrane 6.49E-05 6.49E-05 1.22E-04 2.44E-04 2.65E-04 2.65E-04 3.10E-04 7.24E-04 7.24E-04 8.83E-04 0.001316 0.001397 0.001397 0.001397 0.001397 0.001526 0.001544 0.001544 0.002093 0.002643 0.003784 0.003784 0.004149 0.004149 0.004408 0.004741 0.005509 0.005536 0.005536 0.005536 0.005868 0.005868 0.005868 0.006231 0.006277 0.00637 0.006378 0.006378 0.007098 0.007098 0.007765 0.009344 0.009344 0.009344 0.009344 0.009362 0.01047 0.01047 0.011022 GO:0005524 GO:0048731 GO:00434 12 GO:0005887 GO:0005230 GO:0031175 GO:0031402 GO:0030030 GO:0022804 GO:0004672 GO:0048856 GO:0043687 GO:0015075 GO:0006464 GO:0016301 GO:0051234 GO:0005856 GO:0005216 GO:0022803 GO:0022838 GO:0004713 GO:0004930 GO:0022857 GO:0050793 GO:0008509 GO:0007409 GO:0048812 GO:0019199 GO:0048667 GO:0046578 GO:0016917 GO:0004714 GO:0048468 GO:0022891 GO:0009790 GO:0006793 GO:0022892 GO:0043005 GO:0005096 GO:0015698 GO:0065007 GO:0005089 GO:0005088 GO:0007010 GO:0007417 GO:0030594 GO:0004674 GO:0008092 GO:0005515 ATP binding System development Biopolymer modification Integral to plasma membrane Extracellular ion channel activity Neurite development Sodium ion binding Cell projection organization Active transmembrane transporter Protein kinase activity Anatomical structure development Post-translational protein modification Ion transmembrane transporter Protein modification process Kinase activity Establishment of localization Cytoskeleton Ion channel activity Passive transmembrane transporter Substrate-specific channel activity Protein-tyrosine kinase activity G-protein coupled receptor activity Transmembrane transporter activity Regulation of developmental process Anion transmembrane transporter Axonogenesis Neurite morphogenesis Transmembrane protein kinasc activity Neuron morphogenesis Ras protein signal transduction GABA receptor activity Kinase activity Cell development Transmembrane transporter activity Embryonic development Phosphorus metabolic process Substrate-specific transporter Neuron projection GTPase activator activity Inorganic anion transport Biological regulation Rho guanyl-nucleotide exchange factor Ras guanyl-nucleotide exchange factor Cytoskeleton organization Central nervous system development Neurotransmitter receptor activity Protein serine ⁄ threonine kinase Cytoskeletal protein binding Protein binding 0.011768 0.013028 0.013028 0.013136 0.014854 0.014854 0.014854 0.016226 0.016354 0.017032 0.017199 0.017406 0.017424 0.017424 0.018271 0.018271 0.018624 0.019349 0.019349 0.019349 0.020021 0.020021 0.020021 0.021558 0.021558 0.023565 0.023565 0.023565 0.023565 0.023565 0.023565 0.023565 0.025037 0.026669 0.028903 0.030622 0.030713 0.033582 0.033582 0.033582 0.034756 0.041165 0.041165 0.041874 0.043131 0.043131 0.045995 0.046358 0.047541 for homologous fragments in other mammalian genomes The species specification of the LIRs was demonstrated in several cases, where we found 1992 half-sized LIRs in the rhesus monkey genome One case is the LIR in the human gene c9orf52 that has 19 ORFs and four transcription variants Positioned FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS Y Wang and F C C Leung A study on human long inverted repeats Fig An intronic LIR in c9orf52 and the flanking conserved sequence The arrow denotes the intronic LIR, positioned between exons 17 and 18 The large arrows with an opposing orientation indicate the two arms of the LIR Rhesus monkey (Rhesus macaque) and dog (Canis familiaris) genes possess half-sized LIRs between exons 17 and 18 (5.3 kb to exon 18; 36.59 kb to exon 17) (Fig 6), it has a homolog in the chimpanzee genome However, all homologous fragments from the rhesus monkey correspond to one arm of the LIR Moreover, motif conservation was exhibited at the flanking sequences of the LIR in primates (Fig 6) In other words, the half-sized LIR represents the ancestral status, and the full-sized LIR was developed in the chimpanzee and human lineages We did not find fragments homologous to the LIR in the mouse genome Instead, a half-sized LIR was observed in the dog genome, suggesting that nonprimate genomes also lack the full-sized LIR This also serves as additional solid evidence for the presence of the half-sized LIR in the rhesus monkey These results imply that some LIRs were derived by inverted duplication of one arm Discussion A survey of recombinogenic LIRs across the human genome In the present study, we identified LIRs in the human genome, and provide a fine map of the distribution of human LIRs Due to a strong capability for forming a stem-loop, the LIRs are recombinogenic and account for only approximately 0.4% of all human LIRs, as suggested previously [33] Our algorithm allows the presence of mismatches and insertions in the stem part of the secondary structure, and also provides settings for spacer size, arm size and arm similarity Due to variant internal structures, inverted repeats are different in their efficiency with respect to the induction of instability Evidence is available suggesting that arm size, arm similarity and internal spacer size are all important factors [1,2] Therefore, the inverted repeats identified in the present study are generally associated with a high potential for stem-loop or palindrome formation This is partially supported by the fact that approximately 87% of our LIRs have a short spacer of < 10 bp Nonetheless, we cannot preclude the possibility that some of the LIRs experience difficulty regarding the formation of a stem-loop, such as the reversely duplicated genes and those extremely large LIRs with a huge spacer (see Table S1) In previous studies, the methods employed for inverted repeat identification could not search the inverted repeats by freely defining arm similarity, spacer size and indels [30,32,36] Thus, the map of the LIRs obtained in the present study provides a more detailed distribution of stem-loops in the human genome, and confirms that the LIRs are mostly located in long introns and intergenic regions Furthermore, the inverted repeats in the present study are more likely to be functional than those of previous studies because functional inverted repeats such as siRNA genes are rarely palindromes showing 100% arm similarity [30,32,36] Because of the difficulties encountered in the design of the algorithm for LIR searching at the genome level and the complex folding structures of inverted repeats, we could not target all the inverted repeats with a strong potential to form a stem-loop or a palindrome Particularly, there are a large number of AT-rich regions in the human genome, and the frequency of (TA)n repeats is 19.4 per Mb [37] The self-complementary (TA)n repeats can by themselves form variant secondary structures To remove these simple repeats, we set the GC content of the arm sequences at > 20% This step, however, unavoidably deleted AT-rich LIRs, and some of them have been implicated as the mediator of constitutional t(11;22) translocation in humans [38] Although there are also a large number of (CA)n and (GA)n repeats in the human genome, the frequency of their complementary repeats (TG)n and (CT)n is much lower [37] Therefore, the presence of the TG ⁄ CA-rich and TC ⁄ GA-rich LIRs is not a result of the enrichment of (CA)n and (GA)n repeats FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS 1993 A study on human long inverted repeats Y Wang and F C C Leung Probable functions of the LIRs The results obtained in the present study show that a considerable proportion of the LIRs are within genes and tend to be located in large introns of long genes The LIRs in the large introns, although still unstable, will not greatly disturb the coding parts of the genes Knowing the genomic distribution and sequence features of the LIRs enables us to speculate about the biological functions of the LIRs First, there are a large number of TG ⁄ CA-rich LIRs in our collection, and these intronic TG and CA tracts are probably involved in the alternative splicing of genes One study revealed that intronic TG tracts, particularly in hairpin structure, are important in the intron knockout process and help to create complicated splicing patterns [39] On the other hand, CA-tracts and CA-rich sequences are confirmed to be regulators for alternative splicing One study showed that the insertion of a CA repeat into different intronic places will result in variant splicing patterns in a human gene [40] Perhaps splicing sites at intron–exon boundaries can be recognized easily by a signal of secondary structure Taken together, this allows us to propose that the TG ⁄ CA-rich LIRs are regulators in human genes Second, approximately half of the LIRs are unique in sequence features, and some of them are probably unidentified siRNA genes In the present study, the LIRs are longer than the minimal length required for an siRNA Although arm similarity is higher than that observed in most siRNAs, some of them are still candidates for siRNA genes We used emboss sirna (http://emboss.sourceforge.net/apps/cvs/sirna.html) to identify the candidates with a threshold score of 8, and found that 267 of the LIRs are potential siRNA genes The validity of these motifs in gene silencing requires further empirical examination LIRs than other regions (Fig 2) and, instead, some chromosomal LIR-rich regions are located at the inner part of the chromosomes The lack of an association between LIRs and recombination hotspots is also suggested by a recent study on human recombination hotspots on the basis of a computational simulation using single nucleotide polymorphism data, which showed that inverted repeats were not found overabundant in the hotspots [43] In the present study, we did not detect over-represented LIRs in the hotspots (results not shown) Therefore, the contribution of LIRs to recombination hotspots is not supported, and the recombination-inducing effect of the recombinogenic LIRs probably acts only on specific genomic regions LIRs spreading via fragmental duplications NPS-Septin genes are spread in the human genome possibly due to interchromosomal recombination and fragmental duplication One study showed that interchromosomal recombination frequently occurs at the subtelomeric regions in humans [42] NPS-Septin was assumed to be one of the gene families that amplified themselves by this mechanism The result of gene amplification is concurrent duplication of the intronic LIRs, as observed in chromosomes and Y in the present study By contrast, only two chimpanzee NPSSeptin LIRs were identified, which is in accordance with the low frequency of subtelomeric duplications in the chimpanzee genome [42] Regarding the spread of LIRs in the POTE family, at least those on chromosome subtelomeres are most likely the result of intrachromosomal recombination, as inferred from genomic locations Similarly, the chimpanzee genome has two POTE homologs: one on chromosome 12 and another one on chromosome 22 In addition, the LIRs in POTE and NPS-Septin families were entirely absent from other mammals in current genomic assemblies LIRs and recombination hotspots are not related Probable role of the LIRs in primate speciation The question remains as to whether the recombinogenic LIRs identified in the present study are frequently associated with recombination hotspots in the human genome Almost 47% of the human genome is composed of repeats [37], and direct repeats are predominant over inverted repeats in the human genome, partially because inverted repeats are able to induce instability five-fold more efficiently than direct repeats [41] The UCSC browser provides the recombination rate data for the human genome Essentially, recombination hotspots concentrate on subtelomeric regions [42] The regions, however, not have more Among the orthologous genes, we found that human and chimpanzee genes contain more LIRs than rhesus monkey and mouse orthologs Our data suggest that most of the LIRs shared by human and chimpanzee orthologs were developed and maintained by the common ancestor of humans and chimpanzees However, the difference in LIR frequency among the primates could be narrowed to some extent Our search for LIRs in rhesus monkey orthologs probably missed a proportion of the LIRs, although the similarity between arms was lessened to 75% In the case that the similarity between arms was lower than 75%, some 1994 FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS Y Wang and F C C Leung of the LIRs shared by all primates could not be visualized Indeed, we observed higher mismatch rates in the stems formed by monkey LIRs, and the corresponding chimpanzee and human LIRs have undergone compensating mutations that help to improve the stability of the stem-loops for human and chimpanzee LIRs relative to those for rhesus monkey (results not shown) The compensating mutations comprise one line of evidence for the functional role and adaptive evolution of the primate LIRs The biological profiling of the orthologs with human-specific LIRs implies their association with the development of the central nervous system Moreover, GO terms in pathways such as cell communication and transmembrane signal transduction are enriched in these orthologs The number of genes in eukaryotic genomes is not so different as previously considered [44,45] and the morphological and physiological differences among eukaryotes are considered to be the result of the different regulation levels of the existing genes The intronic LIRs in the present study are probably novel, essential regulatory motifs that enable a complex expression profile and the fine regulation of human genes, as suggested previously [46] The appearance of the LIRs probably provides humans with an evolutionary advantage and contributes to the speciation of primates Experimental procedures Identification of LIRs The human genome (Build 35) was downloaded from the NCBI (http://www.ncbi.nlm.nih.gov/) and the locations of all human genes and their exons (for protein-coding genes) were obtained from the Ensembl database (http://www ensembl.org) From the gene list, we obtained the locations of the boundaries of the genic and nongenic regions Exons belonging to the same genes were sorted again according to their genomic locations, and the introns were defined as the intervals between the exons From the list, the boundaries of exons and introns were determined We first searched for inverted repeats across the human genome using bespoke software [33] The settings for this step were: arm length > 30 bp; arm identity > 85%; and spacer < kb In addition, inverted repeats with a GC content of the arms of < 20% were filtered out This aimed to exclude an abundance of inverted repeats formed by (TA)n simple repeats as shown in our primary study A (TA)n by itself is an inverted repeat, and can form variant secondary structures rather than an exclusive and stable stem-loop Therefore, (TA)n repeats were not the required typical inverted repeats As a result, we removed them from A study on human long inverted repeats the dataset in the present study Several types of redundancies were removed, as described previously [33] To define the recombinogenic LIRs, we screened the collection with new criterion The ratio of arm length to spacer length must be larger than mismatch, where the mismatch is equal to 100% minus identity Therefore, the LIRs in our dataset were recombinogenic LIRs [33] The LIRs within genes were identified and the ratio of their relative distance to exon–intron boundaries was calculated Here, a ratio approaching to indicates the relative distance to the closest exon-intron boundary and that close to 0.5 means that the LIR is positioned close to the central of an intron, no matter in what direction Pseudogenes were not used in this survey For those LIRs in intergenic regions, the same ratios were also measured The difference was that the ratios in that case represent the relative distance to the closest neighboring genes We also attempted to identify cases of partial overlapping between LIRs and genes or exons Classification of LIRs We selected the LIRs that were basically constructed by dinucleotide repeats In the case where TG + TA + CA > 80% of the arm of an LIR, it was considered as TG ⁄ CA-rich; in the case where TC + TA + GA > 80%, it was considered as TC ⁄ GA-rich The remaining LIRs were classified on the basis of similarity We first used consensus motifs of common human repetitive elements (from the RepBase: http://www.girinst.org/) as templates An LIR was considered to be formed by a known repetitive element if the identity of the homologous part (> 20 bp) was higher than 75% For the results obtained, LIRs formed by inverted Alu repeats were further confirmed by repeatmasker (http://repeatmasker.org) Second, LIRs homologous to each other were searched Similarly, the criteria were: homologous part > 20 bp and identity of homologous part > 75% Put simply, the algorithm for searching the homologous part aimed to find an identical seed of bp and then extend the seed at both ends until continuous two mismatches occur at both sides LIRs in mammalian orthologous genes We obtained orthologous genes for human–chimpanzee, human–rhesus monkey and human–mouse species pairs from the BIOMART database (http://www.ensembl.org/ biomart/), which employs the Ensembl 42 Homology Database By searching the same human gene IDs in the three ortholog tables, we created a new ortholog table containing 12 723 groups of orthologous genes from the four species In the BIOMART database, some orthologous genes are of the types ‘one-to-many’ and ‘many-to-many’ that denote a multiple orthologous relationship between the genes In the FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS 1995 A study on human long inverted repeats Y Wang and F C C Leung present study, we kept the orthologs in the ‘one-to-one’ type to allow an easier comparison among the orthologs in terms of the absence ⁄ presence of the LIRs The criteria for searching inverted repeats among the orthologs were: arm length > 30 bp; arm identity > 75%; spacer < 500 bp Again, the inverted repeats with a GC content of the arms of < 20% were excluded We then selected recombinogenic LIRs among the collection using the same criterion Next, we tried to uncover the specification in the biological profile of genes having species-specific LIRs We have three sets of test genes, which include the human orthologs having human-specific LIRs, the human orthologs having humanand chimpanzee-specific LIRs (i.e rhesus monkey and mouse orthologs not have specific LIRs) and the mouse orthologs having mouse-specific LIRs To make a reference dataset, we randomly selected human and mouse genes from the ortholog table The GO terms were obtained for the genes from the Ensembl database (http://www.ensembl.org/) In the reference set, 296 human orthologs and 164 mouse orthologs were assigned with GO term(s) in the Ensembl database; in the test sets, 352 in the first gene set, 91 in the second gene set and 164 in the third gene set were found to have GO terms We did not perform the test for chimpanzee and rhesus monkey orthologs with species-specific LIRs due to concerns of insufficient GO terms being assigned to their genes blast2go [47] was used to compare the frequency of the GO terms in the reference genes and the test genes Using gossip software (http://gossip.gene-groups.net) in blast2go, the significance of the difference in the term frequency was determined by the FDR (P < 0.05) [48] Hence, we obtained a list of GO terms that were significantly over-represented in the test genes, revealing the pathways in which the LIRcontaining orthologs are involved References Lobachev KS, Stenger JE, Kozyreva OG, Jurka J, Gordenin DA & Resnick MA (2000) Inverted Alu repeats unstable in yeast are excluded from the human genome EMBO J 19, 3822–3830 Lobachev KS, Shor BM, Tran HT, Taylor W, Keen JD, Resnick MA & Gordenin DA (1998) Factors affecting inverted repeat stimulation of recombination and deletion in Saccharomyces cerevisiae Genetics 148, 1507–1524 Gordenin DA, Lobachev KS, Degtyareva NP, Malkova AL, Perkins E & Resnick MA (1993) Inverted DNA repeats: a source of eukaryotic genomic instability Mol Cell Biol 13, 5315–5322 Tanaka H, Tapscott SJ, Trask BJ & Yao M-C (2002) Short inverted repeats initiate gene amplification through the formation of a large DNA palindrome in mammalian cells Proc Natl Acad Sci USA 99, 8772– 8777 1996 Lin C-T, Lin W-H, Lyu YL & Whang-Peng J (2001) Inverted repeats as genetic elements for promoting DNA inverted duplication: implications in gene amplification Nucleic Acids Res 29, 3529–3538 Gotter AL, Nimmakayalu MA, Jalali GR, Hacker AM, Vorstman J, Conforto Duffy D, Medne L & Emanuel BS (2007) A palindrome-driven complex rearrangement of 22q11.2 and 8q24.1 elucidated using novel technologies Genome Res 17, 470–481 Nag DK & Kurst A (1997) A 140-bp-long palindromic sequence induces double strand breaks during meiosis in the yeast Saccharomyces cerevisiae Genetics 146, 835–847 Collick A, Drew J, Penberth J, Bois P, Luckett J, Scaerou F, Jeffreys A & Reik W (1996) Instability of long inverted repeats within mouse transgenes EMBO J 15, 1163–1171 Katz RA, Gravuer K & Skalka AM (1998) A preferred target DNA structure for retroviral integrase in vitro J Bio Chem 273, 24190–24195 10 Inagaki K, Lewis SM, Wu X, Ma C, Munroe DJ, Fuess S, Storm TA, Kay MA & Nakai H (2007) DNA palindromes with a modest arm length of >20 base pairs are a significant target for recombinant adeno-associated virus vector integration in the liver, muscles, and heart in mice J Virol 81, 11290–11303 11 Voineagu I, Narayanan V, Lobachev KS & Mirkin SM (2008) Replication stalling at unstable inverted repeats: interplay between DNA hairpins and fork stabilizing proteins Pro Nat Aca Sci USA 105, 9936–9941 12 Stenger JE, Lobachev KS, Gordenin D, Darden TA, Jurka J & Resnick MA (2001) Biased distribution of inverted and direct Alus in the human genome: Implications for insertion, exclusion, and genome stability Genome Res 11, 12–27 13 Akgun E, Zahn J, Baumes S, Brown G, Liang F, Romanienko PJ, Lewis S & Jasin M (1997) Palindrome resolution and recombination in the mammalian germ line Mol Cell Biol 17, 5559–5570 14 Piriyapongsa J (2007) A family of human microRNA genes from miniature inverted-repeat transposable elements PLoS ONE 2, e203 15 Montgomery MK, Xu S & Fire A (1998) RNA as a target of double-stranded RNA-mediated genetic interference in Caenorhabditis elegans Proc Natl Acad Sci USA 95, 15502–15507 16 Bartel DP (2004) MicroRNAs: genomics, biogenesis, mechanism, and function Cell 116, 281 17 Moss EG, Lee RC & Ambros V (1997) The cold shock domain protein LIN-28 controls developmental timing in C elegans and is regulated by the lin-4 RNA Cell 88, 637–646 18 Reinhart BJ, Slack FJ, Basson M, Pasquinelli AE, Bettinger JC, Rougvie AE, Horvitz HR & Ruvkun G (2000) The 21-nucleotide let-7 RNA regulates develop- FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS Y Wang and F C C Leung 19 20 21 22 23 24 25 26 27 28 29 30 31 mental timing in Caenorhabditis elegans Nature 403, 901–906 Schramke V, Sheedy DM, Denli AM, Bonila C, Ekwall K, Hannon GJ & Allshire RC (2005) RNA-interference-directed chromatin modification coupled to RNA polymerase II transcription Nature 435, 1275–1279 Volpe TA, Kidner C, Hall IM, Teng G, Grewal SIS & Martienssen RA (2002) Regulation of heterochromatic silencing and histone H3 lysine-9 methylation by RNAi Science 297, 1833–1837 Glucksmann MA, Markiewicz P, Malone C & Rothman-Denes LB (1992) Specific sequences and a hairpin structure in the template strand are required for N4 virion RNA polymerase promoter recognition Cell 70, 491–500 Kim EL, Peng H, Esparza FM, Maltchenko SZ & Stachowiak MK (1998) Cruciform-extruding regulatory element controls cell-specific activity of the tyrosine hydroxylase gene promoter Nucleic Acids Res 26, 1793– 1800 Spiro C & McMurray CT (1997) Switching of DNA secondary structure in proenkephalin transcriptional regulation J Bio Chem 272, 33145–33152 Christopher E, Pearson CE, Zorbas H, Price GB & Zannis-Hadjopoulos M (1996) Inverted repeats, stemloops, and cruciforms: significance for initiation of DNA replication J Cell Biochem 63, 1–22 McAlinden A, Havlioglu N, Liang L, Davies SR & Sandell LJ (2005) Alternative splicing of type II procollagen exon is regulated by the combination of a weak 5¢ splice site and an adjacent intronic stem-loop cis element J Biol Chem 280, 32700–32711 Baraniak AP, Lasda EL, Wagner EJ & Garcia-Blanco MA (2003) A stem structure in fibroblast growth factor receptor transcripts mediates cell-type-specific splicing by approximating intronic control elements Mol Cell Biol 23, 9327–9337 Miyaso H, Okumura M, Kondo S, Higashide S, Miyajima H & Imaizumi K (2003) An intronic splicing enhancer element in survival motor neuron (SMN) pre-mRNA J Biol Chem 278, 15825–15831 Chen Y & Stephan W (2003) Compensatory evolution of a precursor messenger RNA secondary structure in the Drosophila melanogaster Adh gene Proc Natl Acad Sci USA 100, 11499–11504 ˇ ´ ´ ´ Lisnic B, Svetec I-K, Saric H, Nikolic I & Zgaga Z (2005) Palindrome content of the yeast Saccharomyces cerevisiae genome Curr Genet 47, 289–297 Lu L, Jia H, Droge P & Li J (2007) The human ¨ genome-wide distribution of DNA palindromes Funct Integr Genomics 7, 221–227 Zhao G, Chang KY, Varley K & Stormo GD (2007) Evidence for active maintenance of inverted repeat structures identified by a comparative genomic approach PLoS ONE 2, e262 A study on human long inverted repeats 32 Warburton PE, Giordano J, Cheung F, Gelfand Y & Benson G (2004) Inverted repeat structure of the human genome: the X-chromosome contains a preponderance of large, highly homologous inverted repeats that contain testes genes Genome Res 14, 1861–1869 33 Wang Y & Leung FCC (2006) Long inverted repeats in eukaryotic genomes: recombinogenic motifs determine genomic plasticity FEBS Lett 580, 1277–1284 34 Grimwood J, Gordon LA, Olsen A, Terry A, Schmutz J, Lamerdin J, Hellsten U, Goodstein D, Couronne O, Tran-Gyamfi M et al (2004) The DNA sequence and biology of human chromosome 19 Nature 428, 529– 539 35 Huttenhofer A, Schattner P & Polacek N (2005) ă Non-coding RNAs: hope or hype? Trends Genet 21, 289–297 36 Achaz G, Netter P & Coissac E (2001) Study of intrachromosomal duplications among the eukaryote genomes Mol Biol Evol 18, 2280–2288 37 International Human Sequencing Consortium (2001) Initial sequencing and analysis of the human genome Nature 409, 860–921 38 Edelmann L, Spiteri E, Koren K, Pulijaal V, Bialer MG, Shanske A, Goldberg R & Morrow BE (2001) AT-rich palindromes mediate the constitutional t(11;22) translocation Am J Hum Genet 68, 1–13 39 Hefferon TW, Groman JD, Yurk CE & Cutting GR (2004) A variable dinucleotide repeat in the CFTR gene contributes to phenotype diversity by forming RNA secondary structures that alter splicing Proc Natl Acad Sci USA 101, 3504–3509 40 Hui J, Hung L-H, Heiner M, Schreiner S, Neumuller ă N, Reither G, Haas SA & Bindereif A (2005) Intronic CA-repeat and CA-rich elements: a new class of regulators of mammalian alternative splicing EMBO J 24, 1988–1998 41 Waldman AS, Tran H, Goldsmith EC & Resnick MA (1999) Long inverted repeats are an at-risk motif for recombination in mammalian cells Genetics 153, 1873– 1883 42 Linardopoulou EV, Williams EM, Fan Y, Friedman C, Young JM & Trask BJ (2005) Human subtelomeres are hot spots of interchromosomal recombination and segmental duplication Nature 437, 94–100 43 Myers S, Bottolo L, Freeman C, McVean G & Donnelly P (2005) A fine-scale map of recombination rates and hotspots across the human genome Science 310, 321– 324 44 Bird AP (1995) Gene number, noise reduction and biological complexity Trends Genet 11, 94–100 45 Claverie J-M (2001) Gene number: what if there are only 30,000 human genes? Science 291, 1255–1257 46 Bacolla A, Larson JE, Collins JR, Li J, Milosavljevic A, Stenson PD, Cooper DN & Wells RD (2008) Abundance and length of simple repeats in vertebrate FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS 1997 A study on human long inverted repeats Y Wang and F C C Leung genomes are determined by their structural properties Genome Res 18, 1545–1553 47 Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M & Robles M (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research Bioinformatics 21, 3674–3676 48 Bluthgen N, Brand K, Cajavec B, Swat M, Herzel H & ă Beule D (2005) Biological profiling of gene groups utilizing gene ontology Genome Inform 16, 106–115 Supporting information The following supplementary material is available: Fig S1 Evenly-distributed LIRs in intergenic and intronic regions 1998 Table S1 LIRs with arms longer than 400 bp Table S2 LIRs in genes encoding novel proteins similar to septin Table S3 GO terms over-represented in human genes having human- and chimpanzee-specific LIRs Table S4 GO terms over-represented in mouse genes having mouse-specific LIRs This supplementary material can be found in the online version of this article Please note: Wiley-Blackwell is not responsible for the content or functionality of any supplementary materials supplied by the authors Any queries (other than missing material) should be directed to the corresponding author for the article FEBS Journal 276 (2009) 1986–1998 ª 2009 The Authors Journal compilation ª 2009 FEBS ... deleted AT-rich LIRs, and some of them have been implicated as the mediator of constitutional t(11;22) translocation in humans [38] Although there are also a large number of (CA)n and (GA)n repeats. .. repeats are essential for the initiation of DNA replication in plasmids, bacteria, eukaryotic viruses and mammalian cells [24] The inverted repeats in introns are able to affect the alternative... on human long inverted repeats Fig An intronic LIR in c9orf52 and the flanking conserved sequence The arrow denotes the intronic LIR, positioned between exons 17 and 18 The large arrows with an

Ngày đăng: 07/03/2014, 00:20

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan