A linkage disequilibrium map of the human major histocompatibility complex in singapore chinese conserved extended haplotypes and ancestral blocks 1

CHAPTER 1: INTRODUCTION Introduction 1.1 The Human Major Histocompatibility Complex (MHC) 1.1.1 Features of the Human MHC The human Major Histocompatibility Complex (MHC) is a gene dense segment on the short arm of the human chromosome It houses over 200 gene loci in a 3.6Mb region, and more than 40% of the expressed genes here are known to have an immunity related function (The MHC Sequencing Consortium, 1999) The highly polymorphic human leukocyte antigen (HLA) genes, key loci for histocompatibility matching in organ transplants, are also located within this complex Traditionally, the MHC is defined as the region bordered by the HLA-F and the RPS18 loci (Campbell and Trowsdale, 1993), and divided into sub-regions to reflect the clustering of different classes of genes The class I region lies at the telomeric end and carries a cluster of HLA class I genes such as the classical HLA-A, -B, -C, the non-classical HLA-E, -F, -G and the class I-like MICA and MICB The class II region is at the centromeric end, comprising of HLA class II antigens such as HLA-DR, -DP and DQ The class III region is sandwiched in-between and comprises of varied gene families such as the tumour necrosis factors (TNF, LTA, LTB) and the complement cascade (CFB, C2, C4A, C4B) However the discovery of HLA-like genes in the periregions of the classical MHC, as well as extended regions of conserved synteny, led to the hypothesis of an extended MHC in humans (Malfroy et al 1997, Yoshino et al 1997, Stephens et al 1999) and these additional regions have been termed the extended class I and extended class II regions A gene map of the MHC can be seen in Figure 1.1 Introduction Figure 1.1 Gene Map of the Major Histocompatibility Complex The gene map between 29.0Mb and 33.8Mb of the chromosome 6p is shown above Only gene loci that are known to be expressed are included The coloured backgrounds mark out sub-regions of the MHC as described in a recent review of the extended MHC (Horton et al 2004) Yellow – Extended class I region; Blue – Class I region; Green – Class III region; Orange – Class II region; Grey – Extended Class II region The locations of the classical HLA class I genes (HLA-A, HLA-B, HLA-C) and HLA class II genes (HLADRB1, HLA-DQB1, HLA-DQP1) are indicated in blue and red blocks respectively All gene annotations were taken from the Vertebrate Genome Annotation Database (Vega) (Wilming et al 2008) The distances in this figure are not to scale but the physical map locations are approximately indicated for easy reference Introduction A hallmark of the MHC is the duplication that results in the formation of large gene clusters across the region Besides the HLA genes, very large clusters of RNA (157 in total have been found) and histone genes (55 expressed loci) are located here, mostly within the extended class I sub-region This same sub-region also contains 34 olfactory receptor-coding loci, of which about half are potentially functional (Younger et al 2001, Horton et al 2004) Within the classical MHC, other gene clusters include tripartite motifs-containing zinc-fingers (TRIM), heat shock proteins as well as lymphocyte antigen genes (LY6) The gene clusters across the MHC are believed to be the results of both gene-level and larger segmental duplication events and are maintained because of functional requirements For example several immune system genes function in tandem, such as the dimerization of HLA-DQA and DQB, and co-localization can ensure that these genes are expressed in similar quantities for heterodimer formations Similarly genes involved in antigen processing such as the peptide transporters (TAP1/2), immuno-proteosome components (PSMB8/9) and peptide chaperone TAPBP could benefit from clustering together for the coordination of expression profiles (Horton et al 2004) Other gene families such as the RNA and histone genes are thought to exist in clusters in order to maximise transcription levels (Mungall at al 2003) Extreme polymorphism is another characteristic of the MHC, with diversity at an order of magnitude higher than the genome average (Stewart et al 2004) Most of these variations can be attributed to the HLA class I and class II genes that carry between 200 to 1000 different alleles at each locus (Robinson et al 2003) These HLA genes encode for cell surface glycoproteins that present endogenous and exogenous peptides to T-cells, subsequently initiating the adaptive immunity Introduction response The extreme diversity of HLA genes is believed to be a result of pathogendriven balancing selection favouring increased variety against evolving pathogens (Meyer and Thompson 2001) As for variation across the entire MHC, there is an estimated 15,000 single-nucleotide polymorphisms (SNP) between the MHC sequences of any individuals, or about 3.2 SNPs for every kb (Horton et al 2008) This is somewhat similar to the genome average (ENCODE Project Consortium, 2004) Polymorphism in the MHC is however not restricted to SNPs Large tracks of deletion/insertion polymorphisms (DIPs) have been catalogued between several fulllength MHC sequences (Horton et al 2008) The majority of DIPs are a result of copy-number variations of repetitive transposable elements such as short interspersed elements (SINE), long interspersed elements (LINE), Alu elements and human endogenous retroviruses (HERV) These DIPs contribute to sequence length variations between different MHC haplotypes Additionally there also exist gene-specific copy number variations, such as the large regions of complex polymorphism in the MHC, the RCCX module and the HLADRB locus The RCCX module is a segment of DNA that includes part of the STK19 (RP) gene, C4A/B, CYP21 and part of the TNXB gene (Chung et al 2002) The C4 gene is a component of the complement cascade and exists in either a long or a short variant Different MHC haplotypes carry between to modular units of a combination of C4A and/or C4B as a result of segmental duplication, and consequently different length variants of the RCCX module correlate to dosage variation of the C4A and C4B proteins Dosage variation of C4A and C4B proteins is Introduction associated with systemic lupus erythematosus (SLE), possibly by increased production of the C4 protein at local tissues during the inflammatory process, exacerbating complement-mediated tissue injuries (Rupert et al 2002) The HLA-DRB locus also displays length variation between MHC haplotypes with main arrangements of this locus; HLA-DR1, -DR8, -DR51, -DR52 and -DR53 (Figure 1.2) Figure 1.2 Organisation of HLA-DR Haplotypes This is a schematic representation of the major DR haplotypes found in the MHC, shown telomeric to centromeric All haplotypes carry the DRB9 pseudogene at the telomeric end and the DRB1 gene at the centromeric end Each however varies in the complement of DRB paralogs in-between Coloured blocks are expressed genes while grey patterned blocks are pseudogenes Distances are not drawn to scale This figure is adapted from Svensson and Andersson, 1997 All arrangements carry a DRB9 pseudogene at the telomeric end and a DRB1 gene at the centromeric end In between, the DR subtypes carry different arrangements of DRB3, DRB4 and DRB5 genes, as well as DRB6, DRB7 and DRB8 pseudogenes (Bergstrom et al 1999) Based on the analysis of lineage-specific insertion/deletion elements between syntenic regions in humans, chimpanzees and gorillas, the DR51, Introduction DR52 and DR53 arrangements were found to be ancient in the hominid clade, while the DR1 and DR8 arrangements evolved from DR51 and DR52 more recently (Svensson and Andersson 1997) 1.1.2 MHC and Disease Given that the MHC is at the heart of the human immune system, it is not surprising that the MHC is associated with the majority of autoimmune and infectious diseases, and thus is the focus of many disease gene-mapping studies (Lechler and Warrens, 2000) Most disease associations to the MHC are first identified as a significant difference in the frequency of a particular HLA allele in a patient group compared to an ethnically matched group of healthy individuals An example of such an association is that of HLA-B27 alleles and ankylosing spondylitis (AS), a chronic inflammatory rheumatic disease This association is the strongest described of any other auto-immune associations with MHC molecules, with over 94% of AS patients HLA-B27-positive compared to only less than 10% of healthy individuals, translating to an odds ratio of over 170 (Lechler and Warrens, 2000) This association is also remarkable for the fact that it is robust in every population examined The main hypothesis for the B27 and AS association is that HLA-B27 molecules have the unique ability to bind to a set of “arthritogenic self-peptides” This hypothesis has been supported by epidemiological and functional studies of HLA-B27 subtypes in which associated subtypes like B*2705 are able to present a self-peptide in different conformations while non-associated subtypes like B*2706 and B*2709 not share this ability (Ren et al 1997, Hülsmeyer et al 2004) However even in the face of such strong epidemiological support, the precise molecular basis of HLA-B27 association with AS has not yet been determined conclusively Introduction One of the very few disease genes mapped conclusively to the MHC region illustrates the complexity of MHC disease associations Hereditary haemochromatosis is an autosomal recessive disorder of iron metabolism leading to an accumulation of excessive iron in the body, ultimately leading to multi-organ dysfunction (Feder et al 1996) The association of haemochromatosis to the MHC was first identified in 1976 with the segregation of HLA-A3 alleles with haemochromatosis patients of Caucasian descent throughout Europe (Simon et al 1976) Numerous fine-mapping studies followed in the decades after that, with strong association signals coming from the 1-2Mb region around the HLA-A locus Eventually a mutation in a HLA class I-like gene, HFE, was conclusively linked to hereditary haemochromatosis, and shown to disrupt the process in which HFE plays in regulating iron absorption and distribution (Feder et al 1996, Townsend and Drakesmith 2002) Physical mapping of HFE places it at location 26.2Mb of the chromosome 6p, a distance of 3.8Mb away from HLA-A MHC haplotypes carrying the HLA-A3 allele in Caucasians are highly conserved in the stretch between the HFE and HLA-A loci, providing an explanation for the strong but ultimately spurious association signal for HLA-A3 in hereditary haemochromatosis For the majority of the MHC associated diseases, establishing a causative relationship between a disease and a MHC gene has been difficult In most studies where associations with HLA alleles are found, such as the link between insulin-dependent diabetes mellitus (IDDM) and HLA-DRB1*03/*04, these are oftentimes population dependent with little reproducibility in other ethnic groups (Lechler and Warrens, 2000) There are a few reasons for this lack of success First is the incomplete knowledge of the variation in the MHC outside of the classical HLA loci, especially Introduction in non-Caucasian populations Second, strong stretches of allele conservation due to linkage disequilibrium – exemplified by the tripartite association of HLA-A3, haemochromatosis and the HFE gene – complicate the discovery of disease loci Classical HLA alleles associated with diseases are only thought to be markers in linkage disequilibrium with the actual disease loci within the MHC (Dawkins et al 1999) Third, most MHC associated diseases are believed to be complex and polygenic in nature, belonging to the common disease/common variant (CDCV) class of diseases, in which the genetic risk for common diseases will often be due to disease-predisposing alleles with relatively high frequencies (Reich and Lander, 2001) In order to map CDCV diseases, genotyping informative markers selected with knowledge of the linkage disequilibrium of the targeted region is crucial (Zondervan and Cardon, 2004) Within the Singaporean Chinese population, a range of diseases has been found to be associated with the MHC (Table 1.1) These include adverse drug reactions, cancers, renal diseases and autoimmune conditions Many of these associations, such as that of HLA-B*4601 with nasopharyngeal carcinoma, have been replicated in other Chinese populations (Hildesheim et al 2002) However, without complete knowledge of the variation and linkage disequilibrium of the MHC in Chinese, the identification of a causative or disease-associated locus in each of these diseases has been elusive thus far Introduction 1.2 Linkage Disequilibrium Linkage disequilibrium (LD) refers to the non-independence of alleles at different loci As an example consider adjacent loci, one with alleles A/a and the other with alleles B/b, giving rise to possible haplotypes AB, ab, Ab and aB If the loci are independent, the frequency of seeing a particular haplotype should not deviate from the expected frequency – which is the product of the individual allele frequencies If 10 Introduction and 16kb respectively), while the African haplotype blocks were half as long (7kb) On the average, haplotype blocks cover 81% and 87% of the Oriental and Caucasian genomes respectively, compared to only 67% of the African population Diversity of haplotypes blocks was also lower in non-African populations (4 common haplotype per block) as compared to 5.6 common haplotypes per block in the African samples Haplotype block definitions were seen to be robust across populations, especially between the Caucasian and Oriental populations, with both common and rare haplotypes shared across populations The low diversity of haplotype blocks also meant that a typical SNP is highly correlated with many nearby partners; 80% of the SNPs are perfectly correlated with at least one other partner, and over 90% of the SNPs are highly correlated with at least another SNP This suggests that there is high redundancy among SNPs and the actual number needed for doing association studies can be determined from localised LD patterns Using an r2 threshold of 0.8 the HapMap project predicted that for a nonAfrican population, approximately 250,000 tag SNPs is sufficient to represent the common SNP variation in the entire human genome, while 500,000 tag SNPs is required for African populations (International HapMap Consortium 2005) The HapMap data lays the foundation for identifying genetic factors that influence disease phenotypes by providing an understanding of variation and LD patterns that can be utilized for comprehensive association studies It is however important to address the robustness and transferability of the variation data, LD patterns, haplotype blocks and tag SNPs identified in the HapMap project to other distinct populations such as the Singaporean Chinese 15 Introduction 1.2.4 The Structure of LD in the MHC The LD structure of the MHC has also been described recently with the use of SNP variation maps of various densities, mostly in Caucasian populations (Walsh et al 2003, Stenzel et al 2004, Miretti et al 2005, de Bakker et al 2006) The descriptions of LD presented by each of these teams are very similar, but the report published by Miretti and co-workers provides the most descriptive detail of LD across the MHC and is briefly reiterated here Using a map of 2,300 SNPs over a 4.46Mb region containing the MHC, Miretti and colleagues described 202 haplotype blocks across the MHC, covering 81.95% of the underlying sequence Haplotype blocks have an average size of 18kb and 6.4 haplotypes per block (Miretti et al 2005) These data show that LD patterns within the MHC not differ from the rest of the genome, displaying the same block-like structure with no significant differences in block lengths, block coverage and haplotype diversity 1.2.5 Haplotype-Specific LD in the MHC In a separate study, Ahmad and co-workers used a different strategy to investigate LD patterns of the MHC in Caucasian samples (Ahmad et al 2003) They constructed gene-specific allelic haplotypes at 25 genes across a 3.5Mb segment of the MHC from HLA-A to TAPBP using sets of SNPs at each of the 25 loci These haplotypes were then linked together to construct “surrogate inter-gene haplotypes” The LD of each of these surrogate haplotypes were analysed by using HLA-B as an anchor point and from these a clear pattern that LD varies in a haplotype-specific manner emerged Haplotypes carrying the B*0801, B*4403, B*5701 and B*1302 alleles were seen to exhibit stronger LD across the region studied, and in contrast alleles such as B*1501, B*1801, B*4001 and B*1401 have much lower LD values across the MHC 16 Introduction 1.2.6 Conserved Extended Haplotypes in the MHC The short discrete haplotype blocks seen in the population-based SNP LD maps appear to run contrary to the understanding of LD patterns in the MHC that has been described by various groups for some time (Alper et al 1992, Degli-Esposti et al 1992b) Rather then the average 18kb blocks identified in the SNP maps, these studies suggested a more complicated structure of LD in the MHC that varies between different MHC haplotypes (Yunis et al 2003), similar to the HLA haplotype-specific LD described by Ahmad et al 2003 From pedigree analysis as well as allele and haplotype counting methods, the MHC haplotype compositions of various populations were previously determined in family studies Rather than using SNPs as markers, these haplotypes were constructed with alleles at the classical HLA-loci (HLA-A, -B, -C, -DR,-DQ), composition of the complement genes (C2, CFB, C4A/B), as well as other polymorphic markers such as microsatellites across the MHC (Alper et al 1992, Degli-Esposti et al 1992b, Gaudieri et al 1997) These haplotypes show that blocks of conserved DNA sequence, containing specific allele combinations of or more loci within the MHC, are shared between unrelated individuals in the same population The lengths of these conserved sequences vary between haplotypes, shorter ones are less than 100kb long, but longer ones stretch across the entire MHC and were termed conserved extended haplotypes (“CEHs”, Alper et al 1992) or ancestral haplotypes (“AH”, Degli-Esposti et al 1992b) Furthermore, these conserved regions show “genetic fixity"; unrelated individuals who share common conserved regions would be identical (or nearly identical) in the entire DNA sequence within It was hypothesized that all MHC haplotypes seen in populations today are either entire CEHs or consists of shuffled fragments of CEHs Conserved extended haplotypes are 17 Introduction also thought to occur in widely different frequencies, not only across the different human ethnic groups, but also within subsets of each group In a recent review of conserved extended haplotypes (Alper et al 2006), it was estimated that up to a third of common (frequency > 0.01) European Caucasian MHC haplotypes are fixed from HLA-B to HLA-DR, a distance of more than 1Mb Using data from 2000 normal family controls, they also reported the most common HLA-A, -B and -DRB1 CEHs in the Caucasian population and these are listed in Table 1.2 Table 1.2: Common Conserved Extended Haplotypes Found in a Caucasian Population HLA-A HLA-DRB1 Frequency A*0101 A*0301 A*0201 A*2601 A*0201 A*3101 A*3001 A*0201 1.2.7 HLA-B B*0801 B*0702 B*4402 B*3801 B*1501 B*4001 B*1801 B*1501 DRB1*0301 DRB1*1501 DRB1*0401 DRB1*0402 DRB1*0401 DRB1*0401 DRB1*0301 DRB1*0401 8.60% 6.90% 2.60% 1.20% 1.00% 0.90% 0.60% 0.50% CEHs and Disease Associations Given the genetic fixity of CEHs, it is perhaps not surprising that most CEHs are associated with a large number of diseases In particular, the A1-B8-DR3 Caucasian haplotype has frequently been referred to as the “autoimmune haplotype” because of its association with a myriad of autoimmune conditions including and not limited to: insulin-dependent diabetes mellitus, SLE and myasthenia gravis (Price et al 1999) Another example of a disease associated CEHs in the Caucasian population is the A3B7-DR15 haplotype that is associated with SLE and hereditary haemochromatosis 18 Introduction (Dawkins et al 1999) It is assumed that there are one or more genes within these CEHs that contribute to genetic susceptibility, but the strong LD of conserved haplotypes makes it difficult to identify genes on the haplotype that are responsible for the aetiology of the diseases One approach that utilises the fixity of these CEHs to identify disease susceptibility loci is a modified form of recombinant mapping – comparing different CEHs that share an association to a particular disease (Degli-Esposti et al 1992a, Price et al 1999) As an example, the A1-B8-DR3 and A3-B7-DR15 haplotypes are both associated with SLE, and if the allelic content of both these CEHs are known, one can theoretically identify alleles that are shared between them and these become potential candidates for targeted association studies To this end, the MHC Haplotype Project was initiated to completely sequence full-length Caucasian MHC haplotypes (Allcock et al 2002) and the data from this resource was made available very recently (Horton et al 2008) In the Singaporean Chinese population two HLA haplotypes, A2-B46-DR9 and A33B58-DR3, appear most frequently in disease associations (see Table 1.1) It is likely that these haplotypes are CEHs in the Chinese population and a detailed knowledge of the extent and allelic content of these haplotypes will be tremendously useful in dissecting the disease associated loci on them 19 Introduction 1.3 Recombination Hotspots Homologous recombination during meiosis results in the formation of crossover chromosomes and if it occurs frequently across a specific genomic region (a recombination hotspot), this leads to a break down of linkage disequilibrium between polymorphic loci flanking the recombination hotspot As such, understanding patterns of LD will be incomplete without taking into account how recombination varies across a region The distribution of recombination events across the human genome has been better understood in recent years and is known to occur across localized hotspots while the bulk of the human genome is “cold” with few signs of historical recombination within Fine mapping of hotspots have also consistently indicated that crossover locations cluster across windows of 1-2kb (Jeffreys et al 1998, 2000, 2001, Kauppi et al 2004) The characteristics and locations of recombination hotspots are briefly covered in this section 1.3.1 Meiotic Recombination and Crossover Locations To maintain genome size with each generation, a single round of DNA replication precedes rounds of chromosome segregation during meiosis In order to ensure that proper chromosomal segregation occurs, bivalents are formed between homologous pairs of sister chromatids and are held together by the formation of chiasmata Chiasmata are formed through crossover events between homologous chromosomes, with at least one chiasma (average of to 3) seen between each pair of homologous chromosomes (Alberts et al 2004) Chromosomal segregation and resolution may lead to genetic exchange between homologous chromosomes, forming crossovers or reciprocal homologous recombination exchanges 20 Introduction The molecular mechanism of meiotic recombination has most thoroughly been described in the budding yeast Saccharomyces cerevisiae and a similar mechanism is thought to occur in mammals (Neale and Keeney 2006) The heart of meiotic recombination is the formation and subsequent repair of double strand breaks (DSB) in DNA, which are intentionally introduced into the DNA by a topoisomerase-like protein Spo11 that generates transient, covalent protein-DNA linkages to the 5´ ends at either sides of the break (Neale and Keeney 2006) (see Figure 1.3) Spo11 is subsequently removed and the 5´ ends are degraded by exonucleases to form variable lengths 3´ singe-strand tails that are up to 1kb long (Step III in Figure 1.3) DNA strand exchange proteins including the RecA homologs Rad51 and Dmc1 bind to the 3´ tails, forming helical nucleo-protein filaments that search out for targets on homologous chromosomes and catalyse strand-exchange (Neele and Keeney 2006) The second 3´ tail is subsequently captured to form a heteroduplex region in which the strands from two different DNA molecules have begun to pair This region is subsequently extended with the aid of enzymes (RuvA and RuvB) to establish further strand exchanges between the chromosomes in a process called branch migration (Alberts et al 2004) In this manner, a recombination intermediate formation called the double Holliday junction is formed (Step V, Figure 1.3) The resolution of the double Holliday junction leads to either a formation of crossover or non-crossover chromosomes This DSB pathway is also believed be responsible for meiotic recombination in humans as well, based on the high level of conservation of Spo11 (Romanienko and Camerini-Otero 1999) 21 Introduction V Second End Capture I Intact Chromosome II DSB Formation by Spo11 VI Resolution of Holliday Junction IV DNA Strand Exchange Crossover Frequency III Degradation from 5´ Ends Initiation Zone Figure 1.3 DNA Events in Meiotic Recombination Steps I to III – The topoisomerase Spo11 cleaves DNA to form a double strand break (DSB) and the 5´ ends at both ends of the break are degraded to leave 3´single-stranded segments up to 1kb long Steps IV to V – Rad51 and DMC1 bind to the 3´ ends and invade a homologous chromosome leading to DNA strand exchange between the chromosomes The second 3´ strand is captured and branch migration occurs yielding the formation of a double Holliday junction Step VI – Resolution of the Holliday junctions may lead to either crossover (as shown here) or non-crossover products The mappings of crossover products reveal that crossover events occur mostly within a 1-2kb window centred on the assumed initialization zone of the DSB events This figure is adapted from Neale and Keeney 2006 1.3.2 Recombination Hotspots are Preferential Sites for DSB Formation Studies in budding yeast also indicate that the most important factor in crossover locations is the distribution of DSB that initiate meiotic recombination (Kauppi et al 2004) DSBs are known to cluster in hotspots and are virtually absent in coldspots, and although factors that influence whether a DNA sequence will be a hotspot in budding yeast is not completely understood, there appears to be several guiding principles First, all known DSB hotspots are nuclease-hypersensitive and most probably reflect that an open chromatin configuration is necessary for DSB formation (Wu and Lichten 1994) However, not all nuclease-hypersensitive sites are DSB 22 Introduction hotspots and there is no correlation between the degree of nuclease-hypersensitivity and frequency of DSBs DSBs also tend to occur in promoter regions but this is likely a reflection of the open-chromatin configuration associated with such regions On a larger scale, DSB hotspots tend to be located in regions of relatively elevated G+C content but the molecular explanation for this association is still not yet known 1.3.3 Mapping Recombination Hotspots in the Human Genome In general there are methods of mapping recombination hotspots in the human genome; analysis of pedigrees, genotyping of recombinant sperm and computational inferences from SNP variation maps Recombination Hotspot Mapping with Pedigree Analysis In multi-generation family studies, each offspring provides information on the paternal and maternal products of a single meiotic event Traditionally, polymorphic microsatellites and variable number short tandem repeats (STR) are used to construct recombination maps by identifying crossover events in pedigrees (for example Cullen et al 1997 and Kong et al 2002) The accuracy of a recombination map is a function of the number of meiotic events detected, in other words proportional to the number of families studied The resolution of a recombination map will depend on the number of polymorphic markers genotyped An ideal high-resolution, genome-wide recombination map from family studies using a dense map of SNPs would require approximately 105– 106 meiotic events in a impractically large number of families, which makes this approach unfeasible for the fine mapping of recombination hotspots However such family-based geneic maps can reveal differences between the recombination rate in females and males 23 Introduction Recombination Hotspots Mapping Sperm Typing Sperm typing is currently the most accurate way of genetically defining recombination hotspots Sperm from a single individual allows thousands of meiotic products to be screened and hence making it possible to detect and quantify many recombination events (Kauppi et al 2004) There are generally approaches for sperm typing, the first of which involves single-sperm isolation by fluorescenceactivated cell sorting followed by whole genome DNA amplification and genotyping of polymorphic markers By analysing the genotype at each locus in many individual sperm, the phase of the donor can be deduced and individual sperm identified as a crossover or non-crossover The crossover junction is also narrowed to the interval between the polymorphic loci An example of this technique is the recombination hotspots identified across the MHC using microsatellite markers (Cullen et al 2002) Single-sperm techniques are generally an expensive and laborious process and to date have been used to identify hotspots from 11kb to over 300kb resolutions (Arnheim et al 2003, 2007) A more high-resolution sperm-typing technique involves pooling together hundreds to thousands of sperm from a single individual and using nested, allele-specific PCR primers to selectively identify and isolate crossover locations By using many pools, it is potentially possible to identify recombination hotspots based on millions of meioses This technique also has the advantage of being able to detect and quantify very low-frequency recombination events The resolution of the crossover-breakpoint mapping is limited by the density of the polymorphic markers within the crossover region, although with the use of appropriately chosen SNPs, this technique has successfully mapped tens of recombination hotspots at a 2kb or less resolution, 24 Introduction including within the MHC class II locus (Jeffreys et al 2000, 2001, Kauppi et al 2004) This method is however technically challenging due to the stringency of designing allele-specific PCR primers, and is also limited to scanning for crossover locations less than 10kb (a reasonable limit for PCR reactions) Only 200kb of the human genome has actually been tested for recombination using pooled sperm PCR (Arneheim et al 2007) Both techniques of sperm typing are however limited to male-specific homologous recombination, and hence unable to account for female-specific recombination that is known to exist (Broman et al 1998, Kong et al 2002) Due to practicalities each study is usually limited to a handful of sperm donors and therefore also unlikely to identify individual-specific recombination rates (Arneheim et al 2007) Recombination Hotspots Inferred from Computational Approaches The last method of mapping recombination hotspots is through in-silico predictions from population data, and has been made possible by recent improvements in computational power, algorithms and SNP genotyping technologies that have made available large population genetics datasets such as the HapMap project Based on multi-SNP haplotype patterns, intervals of high LD and low LD can be identified, and the low LD regions provide a signal that recombination events has occurred in the history of the samples Consequently, recombination hotspots inferred from population genetic data are “historical” hotspots, although it does not necessarily indicate that the hotspots are no longer active in the current population 25 Introduction Most algorithms are based on the coalescence model, which states that in an absence of recombination, all alleles of genetic markers on a chromosome may be traced back to a single ancestral chromosome and all modern day variation are results of stepwise mutation of that ancestral chromosome Under the coalescence model, the recombination rate can then be estimated from LD values between markers, and recombination hotspots can thus be inferred by identifying regions where the rate is significantly higher than the flanking regions (Pritchard and Przeworski 2001, Arnheim et al 2007) These methods have been tested extensively on family-based data and shown to be very accurate in locating historical recombination hotspots (Marchini et al 2006) However these computationally inferred hotspots would only narrow the crossover location to a resolution that is limited by the SNP markers used; to fine map and verify these hotspots will requite further detailed analysis of the individual recombinant DNA molecules 1.3.4 Sperm-Mapped Recombination Hotspots in the MHC At present there are recombination hotspots mapped in the MHC using the pooled sperm technique and another mapped using the single-sperm technique The former are seen to cluster around locations in the class II region of the MHC while the latter are mapped to hotspot regions of between 35-105kb, with the resolution limited by the microsatellite and STR markers used in the genotyping The locations of these 12 hotspots are listed in Table 1.3 26 Introduction Table 1.3: Location of Sperm Mapped Recombination Hotspots in the MHC Hotspot Location (Mb) Reference Method TAP2 DMB2 DMB1 DNA3 DNA2 DNA1 32.912 - 32.913 33.007 - 33.009 33.010 - 33.012 33.073 - 33.074 33.081 - 33.082 33.085 - 33.086 33.086 29.753 - 29.799 29.914 - 29.995 31.643 - 31.686 32.406 - 32.511 32.778 - 32.813 33.071 - 33.172 Jeffreys et al 2000 Pooled Sperm Jeffreys et al 2001 Pooled Sperm Cullen et al 2002 Single Sperm Telomeric to HLA-F HLA-G - HLA-A LTA - BAT2 TSBP - DRA DQB1 - DQB3 RING3 - DPB1 1.4 Aims of This Study Within the Singaporean Chinese population the MHC is associated with a number of autoimmune diseases such as systemic lupus erythematosus, rheumatoid arthritis, psoriasis and myasthenia gravis, as well as other conditions such as adverse drug reactions and nasopharyngeal carcinoma It is believed that most of these associations are due to common variations within the MHC, and it is only through high-powered association studies that these disease loci can be identified The design of such association studies will benefit from an in-depth knowledge of the genetic variation and linkage disequilibrium patterns of the MHC in the local Chinese population While LD maps of the human genome has been made publicly available by the HapMap project, it is not known how well that data translates to the local Chinese population The MHC is also known to carry conserved extended haplotypes that stretch over megabases and obfuscate disease association studies, but an accurate description of 27 Introduction the frequency and extent of these CEHs in the local Chinese population is not available An accurate catalogue of the CEHs in the Chinese population will also add to the currently available data that is limited to the Caucasian population, and allow comparisons to be drawn between the two Of special interest are the shared segments between different ethnic CEHs that share similar associations to diseases Recombination hotspots are known to occur in 1-2kb windows, and if an accurate map of active recombination hotspots across the human genome can be determined, the cis-acting DNA elements that promote DSB and/or recombination hotspots formation can be identified With this knowledge, it will be possible to understand how recombination rates varies among individuals and will potentially affect marker selection in association studies While computationally inferred recombination hotspots have been shown to be highly accurate, pooled-sperm mapping is currently the only method that can identify and verify crossover locations at a base-pair resolution However, this method suffers from the laborious need to fine-tune allelespecific PCR assays that makes them impossible to scale up Furthermore, sperm mapping will only provide a male-biased investigation of recombination and cannot address sex-specific recombination rate variation that is known to exist Alternate approaches such the haplotyping of individual DNA molecules across putative crossover hotspots using single-molecule imaging has been suggested (Kauppi et al 2004) Another approach is to identify recombinant haplotypes at crossover hotspots from a population genetics dataset and re-sequence these to fine map the crossover region in these recombinant haplotypes The 35-105kb wide active recombination hotspots in the MHC identified by Cullen et al 2002 provides an opportunity to test this approach 28 Introduction With these in mind, the objectives for this study of the Singaporean Chinese population are as follows: i To construct SNP variation maps of the chromosome 6p with particular emphasis on the MHC ii To describe the patterns of linkage disequilibrium and haplotype structure across the chromosome 6p and the MHC iii To identify the conserved extended haplotypes present in the MHC iv To compare conserved extended haplotypes in the Chinese population with those of the Caucasian population v To utilise the SNP variation data for identifying recombinant haplotypes and re-sequence these to fine-map the single-sperm recombination hotspot regions 29 ... Population HLA -A HLA-DRB1 Frequency A* 010 1 A* 03 01 A* 02 01 A* 26 01 A* 02 01 A* 310 1 A* 30 01 A* 02 01 1.2.7 HLA-B B*08 01 B*0702 B*4402 B*38 01 B *15 01 B*40 01 B *18 01 B *15 01 DRB1*03 01 DRB1 *15 01 DRB1*04 01 DRB1*0402... Townsend and Drakesmith 2002) Physical mapping of HFE places it at location 26.2Mb of the chromosome 6p, a distance of 3.8Mb away from HLA -A MHC haplotypes carrying the HLA -A3 allele in Caucasians are... transferability of the variation data, LD patterns, haplotype blocks and tag SNPs identified in the HapMap project to other distinct populations such as the Singaporean Chinese 15 Introduction 1. 2.4

A linkage disequilibrium map of the human major histocompatibility complex in singapore chinese conserved extended haplotypes and ancestral blocks 1

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan