In part one we discussed what is meant by genetic variants and traits and the relationship between the two. Here we will discuss how we can find the location within the genome of genetic variants which make individuals more likely to possess certain traits, including predispositions to certain behaviours. If you haven't already you may wish to read part one to gain an understanding of some key terms and concepts before continuing with this article.
Genotyping, as opposed to whole genome sequencing, does not identify every nucleotide present at each position in the genome. This is because the cost of whole genome sequencing combined with the large sample sizes required to detect associations with complex traits renders whole genome sequencing studies financially infeasible. Instead, single nucleotides called tag SNPs are chosen from across the genome which give us enough information to impute with a high degree of accuracy the nucleotides present at all other positions in the genome. This works by taking advantage of a feature of the genome called linkage disequilibrium.
To understand what this is we need to take a step back and look at how the genome of an organism is formed from the DNA passed down to it from it's parents. Each organism has two versions of each chromosome, called sister chromosomes, in it's genome - one for each parent. But it is not the case that you simply have one of your father's chromosomes and one of your mother's chromosomes. There is a point of reshuffling DNA during a process of reproduction called meiosis.
Meiosis occurs during the production of the parent's sex cells. Pairs of chromosomes come together at the centre of the nucleus of the germ cell (the cell which produces sex cells) and undergo a process of swapping chunks of DNA from each chromosome to the equivalent position on it's sister chromosome. This process is called crossing over and is akin to swapping pages between the two versions of the chapters of the genomic instruction manual. Before crossing over both versions of each chromosome are comprised of two identical copies of the string of DNA which makes up the chromosome called chromatids however, as a result of the non-uniform nature of the crossing over process, crossing over results in 4 non identical chromatids (two for each sister chromosome).
The cell then divides to form two cells each containing one copy of every newly shuffled chromosome. Following this a second round of meiosis occurs in which the cells divide once again to produce four cells each containing a unique copy of each chromatid. The four cells, with unique genomes, produced by this process of meiosis are called gametes and take the form of sperm cells in males and ova (egg cells) in females. When a sperm cell meets an ovum during the reproduction process the chromatids from each cell become chromosomes resulting in a completely new genome made up of a set of chromosomes made of chunks of DNA from both of the mother's sets of chromosomes and a set of chromosomes made of chunks of DNA from both of the father's sets of chromosomes.
Meiosis forms 4 sex cells with unique chromosomes
The relevance which meiosis has to genotyping is due to the fact that, although chunks of DNA are swapped between versions of chromosomes non-uniformly, the crossing over process is not completely random. Some chunks of DNA within the chromosomes are more likely to stay together during the crossing over process than would be expected by chance, a feature of genomics called linkage disequilibrium. Genetic variants which are more likely to be found in the same genome than would be expected if crossing over occurred randomly, even after multiple generations of crossing over, are therefore said to be in linkage disequilibrium with each other. The higher the likelihood of staying together, the higher the degree of linkage disequilibrium.
The above image is a visualisation of linkage disequilibrium called an LD heatmap. The black line at the bottom represents a stretch of DNA and the intensity of the red points represents the probability that the two SNPs which intersect at that point would be found together. The black triangles represent a stretch of DNA which are significantly unlikely to be separated during crossing over and are therefore in linkage disequilibrium with each other.
Chunks of DNA, housing genetic variants, which stick together like this are referred to as haplotype blocks (blocks of DNA inherited from the same parent). Now that we understand haplotype blocks we can explain how genotyping data can be used to impute the SNPs present in the genome from reading only a few selected bases. Imagine a sentence in the genomic instruction which was known to be a haplotype block and that 4 different versions of this sentence are known to exist.
The same can also be said of sequences of DNA. Take this example of four versions of the same haplotype block in the sequences below. To find which version of this haplotype block is present it is enough to read only the SNP present at the fourth and tenth position even though the sixth and the twelfth position also vary. It is for this reason that the SNPs in red are chosen as tag SNPs.
The existence of linkage disequilibrium creates both difficulties and opportunities when using genotyping to identify genetic variants associated with certain traits. The difficulty is that association studies used to identify statistical associations between traits and genetic variants most often identify several SNPs which are in linkage disequilibrium with each other rather than identifying single variants. This is a problem as there is no way of knowing which of the identified SNPs are the causal SNP amongst the several other SNPs which were identified only because they are in linkage disequilibrium with the causal SNP.
The above image is part of a graph which has the position of individual SNPs from a section of a chromosome across the x-axis. The higher up the SNP, represented by dots, appears on the y-axis the stronger its association with the trait. Only one of these SNPs is likely to be the cause of the association found at this location in the genome but as a result of linkage disequilibrium an association is also found with many other nearby SNPs.
Despite this difficulty, genotyping based GWAS is a highly useful methodology as the information they can gather from tag SNPs showing a significant association with the trait of interest combined with databases containing the DNA sequences of all known haplotype blocks make it possible to identify the location of the genome in which the causal variant is located and impute the identity of all common genetic variants in this part of the genome. This is achieved at a fraction of the cost of whole genome sequencing.
Researchers can then use this information to inform inquiry into the biological processes behind the trait under investigation. For example, they might choose to look more closely at the genes which are found within or near the identified locations in the genome and create hypotheses about how variation in the expression or composition of these genes might influence the trait based on prior knowledge of the function of these genes.
The next article will discuss the statistical methods which are used to identify associations between SNP's and traits.


