Saturday, November 21, 2020

Polygenic Scores - Genome-Wide Association Studies Explained simply part 4




Welcome to part 4 of this video series on genome wide association studies. In previous videos we have discussed the relationship between genetic variants and traits, linkage disequilibrium and the statistical methodology used to detect associations between genetic variants and complex traits. In this video we will discuss one of the uses GWAS which is the development of polygenic scores.

There are certain traits which are determined by the presence or absence of genetic variants in a single gene. These are called monogenic traits and include traits such as lactose tolerance and, peculiarly enough, which thumb goes on top when you interlock your fingers. Most complex traits however are a result of the combined influence of a multitude of genetic variants affecting thousands of genes in your genome, each responsible for only a fraction of the overall genetic contribution to the trait. We refer to these as polygenic traits. 

As we have discussed in previous videos, the genetic variants associated with complex traits can be identified using genome-wide association studies. We can use this information to make predictions about some proportion of the variance in the traits of a population in the form of polygenic scores. Polygenic scores are generated using algorithms which sum together all the genetic variants possessed by an individual which are known to influence the likelihood or level of a trait, weighted by their effect size. This value produced by the polygenic score can be used to make predictions about the relative likelihood of a dichotomous trait, such as having a diagnosis of major depression or signing up for sky diving lessons, or level of a continuous trait, such as height or extraversion scores.



For those of you who have watched the previous video the phrase "proportion of the variance" might ring a bell. Proportion of the variance is the same thing as the co-efficient of determination which is measured by r squared. R squared can be thought of as how closely the values or likelihood of the trait for all individuals in the sample hug the line of best fit. 

For example, take a graph which models polygenic risk score values as deciles across the X-axis and values for a continuous trait along the Y-axis. We can determine what percentage of the variance is accounted for by our polygenic score by plotting the phenotype values for a validation sample of individuals against their polygenic score, drawing a line of best fit on the resulting graph and calculating an r squared value. A quick side note, the validation sample must be composed of individuals who's data were not used in the GWAS on which the polygenic score algorithm was based. This is necessary to prevent bias. 



Most biologically influenced complex traits, including behavioural traits, have some degree of what is called "heritability". Heritability is the proportion of the variability in the level or likelihood of a trait which is influenced by the genetic variants passed on to the individual from their parents as opposed to that which is influenced by environmental factors. Heritability estimates are calculated using twin and adoption studies. These types of studies will be discussed in another video.

Let us take IQ score as an example of a continuous trait. IQ has an estimated heritability of about 50%. This means that 50% of the variability in IQ scores is determined by the genetic variants people are born with. Heritability acts like a kind of upper limit to what can be predicted by polygenic scores. So researchers can strive to be able to predict up to 50% of the variance in IQ scores by looking at the genome. 

In reality of course we aren't there yet due to limitations with statistical power and methodological challenges. This is known as the missing heritability problem. At the moment we can only account for between 20% and 50% percent of the heritable portion of the variance in IQ with the results of GWAS. Genomic research into polygenic traits is still however a very young field and the percentage of the variance which can be accounted for is growing as sample sizes become larger and methodologies become more precise. 

One method of representing the explanatory power of a polygenic score graphically is by drawing confidence intervals above and below the mean level of a trait at each decile. Confidence intervals are values between which the mean value of the trait are likely to be found in some proportion of validation samples. Normally this is set as 95%. This is not to be confused with the proportion of individuals at each decile who fall between these values rather it means if you were to test this model on 100 different validation samples the mean Y-axis value at each decile would fall between the confidence intervals in 95 of them. 

Confidence intervals for a polygenic score with high explanatory power might look like this whereas for a polygenic score with low explanatory power confidence intervals may look like this. Note that the predicted value of the trait for each percentile is the same for both polygenic scores however the one with the smaller confidence intervals is more useful as we can have more confidence in the predicted mean value of the trait.



Let us look at the results of a polygenic score model for IQ alongside the distribution of IQ scores in the population. IQ scores are normally distributed with the 50th centile being defined as an IQ of 100 and other scores are shown at each standard deviation from the mean on this diagram. This polygenic score has an r-squared value of 0.11. In other words, this model accounts for 11% percent of the variation in IQ scores. Individuals in the highest decile have a mean IQ which is about half a standard deviation above the mean whereas those in the bottom decile have a mean IQ about half a standard deviation below the mean. By comparing this to the distribution of IQ scores we can see that this is equal to a mean IQ of about 108 at the highest decile and a mean IQ of 92 at the lowest decile.



It may be difficult to grasp how much of a difference that actually is for a non-physically visible trait like IQ so let us visualise it by replacing IQ with a more tangible trait such as height. Half a standard deviation below the mean for height in adult males in the USA is equivalent to about 69 inches, that is 5'9" or 175cm, and half a standard deviation above the mean is equivalent to about 71 inches, that is 5'11" or 180cm. To put that in perspective, according to this chart of celebrity heights that is the difference between Tom Hardy and Michael Fasbender.



The same method can be used for the prediction of dichotomous traits, however instead of predicting the level of the trait polygenic scores are used to predict the likelihood of a trait occurring. For example, here is a graph showing the deciles of a polygenic score for diagnosis of major depression in a sample of the Danish population. Rather than a quantitative value what is referred to here as a hazard ratio is plotted along the Y-axis. The hazard ratio assigns a value of 1 to the decile with the smallest proportion of individuals having a diagnosis of major depression and a value to each other decile corresponding to the ratio of cases relative to the first decile. From this graph we can see that individuals in the tenth decile are just over two and a half times as likely to have a diagnosis of major depression than those in the first decile. 



Information about the actual proportion of diagnoses of major depression at each decile is not available from this graph however, we know the overall proportion of individuals with a diagnosis of major depression in Denmark is 3%. Therefore, assuming that the 5th and 6th deciles are approximately equivalent to the population risk, we can extrapolate that individuals in the first decile have around about a 1.7% chance of developing major depression whereas those in the tenth decile will have a 4.3% chance. To put this into perspective the proportion of individuals in the first decile who will develop depression is like the proportion of bombs in this game of minesweeper and for the tenth decile it is like this game of minesweeper. 



From the examples we have looked at you might be wondering whether these polygenic scores are actually useful at all. Those with a polygenic score in the top decile for IQ are by no means guaranteed to be the next Einstein, neither are those with scores in the top decile for height much more likely than average to have a career in basketball. Those with polygenic scores in the top decile for risk of major depression do not have much more to worry about than the average person either. There are however many uses of polygenic scores beyond merely making predictions about the traits of individuals.

For example, there are certain diseases which can be prevented or mitigated with early detection and treatment however health organisations rarely have the funds to screen every individual at risk. Polygenic scores can be used to inform the screening for these diseases and therefore increase the number of cases which are caught early from screening the same number of individuals. Polygenic scores can also be used to shed light on the biological causes of diseases with high comorbidity by finding to what degree, if any, the polygenic score for one disease predicts the other. Stratified and precision medicine, which are hot topics in science right now, are another use of polygenic scores . Complex diseases such as bipolar disorder and schizophrenia can be treated by a number of different drugs and finding out which one works best for the individual is often a case of trial and error. Polygenic scores for drug response can be used to predict which drug is more likely to be effective for the individual and minimise the amount of trial and error required to find the right treatment. In the future it may even be possible to tailor-make drugs based on the individuals genome. 




As well as in medicine, polygenic scores also have their uses in psychology. Disentangling the effects of nature and nurture has been one of the biggest problems in psychology since its naissance as a scientific field of study. Polygenic scores go part of the way towards solving this problem. Studies have found that children with more books in their household do better in school, but is this because the books are making them better learners or is it because their parents have genes which make them enjoy learning and therefore buy a lot of books as well as passing on these genes to their children. Polygenic scores which predict educational attainment serve as a control to help us decipher how much, if any, of the correlation between books in the house and school results are actually caused by the books. In some cases polygenic scores are also relevant to debates on social policy. Does consumption of cannabis increase the risk of schizophrenia? Or are people who are genetically predisposed to schizophrenia also genetically more likely to consume cannabis? 





So this concludes this video series on Genome-Wide Association Studies. I hope you all found it accessible and informative. If there is anything we discussed in these videos you would like to see discussed in more detail please leave a comment below and I may make another video about it in the future. If you enjoyed this video that means you have a high polygenic score for hitting the like button and subscribing to the behavioural genomics YouTube channel. 



Thursday, November 12, 2020

Genome-Wide Association Studies Explained Simply - Linear and Logistic Regression


Hello everyone and welcome back to this video series on Genome-Wide Association Studies. 

In the previous video we discussed P-values and the multiple testing problem in that must be accounted for in modern genomics.

Now that we understand P-values we can now look at how the P-values for associations between genetic variants and traits are calculated. GWAS most often use a type of statistical test called regression analysis. Regression analysis is used to estimate the relationship between a dependent variable and a predictor variable. In this case these are the trait and a genetic variant respectively. 

For example, if we wanted to find if a genetic variant is associated with a continuous trait, like scores on a measure of extraversion, we could graph the extraversion scores of the participants along the y-axis and the genotype of the participants along the x-axis. As we have seen in previous videos people have two versions of each chromosome in their genome. Therefore participants can be categorised as having two, one or no copies of the minor allele - that is, the least common variant of the nucleotide at that position. In a regression model that assumes additive allelic effects it is assumed that having two copies of an associated variant would have a larger affect on the trait than having only one therefore the x-axis can be plotted directionally in the number of minor alleles in the genotype.  




A line of best fit - that is, a straight line which is as close to every point on the graph as possible - is drawn across the graph. In the linear model assuming additive allelic effects we have just described this would look like a line which is as close as possible to the mean y-axis value, which is in this case the participants' extraversion scores, for each genotype. The slope of the line determines the direction of the effect of the genetic variant on the extraversion scores of the participants with an incline denoting a positive correlation (having the genetic variant makes you more extraverted) and a decline denoting a negative correlation (having the genetic variant makes you more introverted). 



A steeper slope is indicative of a larger impact of the genetic variant on the phenotype. This is one way of measuring what is called the effect size. The effect size is found by calculating the gradient of the line of best fit and measures the amount of change in extraversion per copy of the minor allele. A larger positive gradient indicates a stronger positive effect and a larger negative gradient indicates a stronger negative effect. A straight line would have a gradient of zero and indicate no effect of the genetic variant on extraversion. 




The proportion of the variance in extraversion which can be explained by the genotype of this SNP is called the co-efficient of determination. A higher co-efficient of determination usually looks like the points in the graph hug the line of best fit closer whereas with a lower co-efficient of determination the points are more dispersed. The co-efficient of determination is measured by a value called r-squared and is found by calculating the sum of the square of the distances between all the points on the graph and the line of best fit, subtracting this from the sum of the square of the differences between all the values for extraversion and the average value for extraversion in the whole sample and dividing this number by the average value for extraversion in the whole sample. In other words, r squared measures the variation in extraversion explained by the genetic variant divided by the variation in extraversion in general.




The P-value can be calculated by finding a value called F.  F is calculated by finding the variation in extraversion explained by genotype divided by the variation in extraversion not explained by genotype. The P-value can be found from the F-value by comparing the F-value found in the regression to a probability distribution of F-values which could have been found assuming there was no relationship between the genotype and the trait. 

The P-value is calculated by finding the proportion of F-values at least as large as the one found in the regression in the distribution of all possible F-values from this sample. Another way of describing it graphically is the P-value is equal to the area under the curve with an F-value higher than or equal to that found in the regression divided by the area under the entire curve.




The type of regression used for continuous traits such as extraversion is called a linear regression. For dichotomous traits, such as whether or not a person has ever went skydiving, values cannot be assigned to the trait in the same way as can be done for continuous traits like extraversion scores. Therefore a different method called logistic regression is used for finding associations with dichotomous traits.
 
Rather than modelling how a genetic variant increases or decreases the level of a trait logistic regression models whether a genetic variant increases or decreases the likelihood that an individual would posses the trait. Instead of using a straight line of best fit logistic regression uses the logistic function. The logistic function is an s shaped curve which, in this case, models the log odds of an individual having ever gone sky diving based on their genotype. The method for generating the logistic curve is slightly more complicated so we won't be going into it in this video.






In a similar way to linear regression, where the slope of the line indicates the direction of the effect, the direction of the curve indicates the direction of the effect of the minor allele on the likelihood of the trait. A high to low curve indicates a negative effect of the minor allele on the likelihood of the trait and a low to high curve indicates a positive impact of the minor allele on the likelihood of the trait. 

There are many different methods calculating r-squared for logistic regression. The method which is most similar to that used in linear regression calculates r-squared by finding the log likelihood of having gone skydiving after taking genotype into account, subtracting this from the overall likelihood of having gone skydiving, and dividing the result by the overall likelihood of having gone skydiving. In other words, r squared measures the likelihood of skydiving explained by the genetic variant over the likelihood of skydiving in general.

The P-value for our logistic regression is found by calculating a chi-squared value. The chi-squared value is calculated simply by multiplying the difference between the likelihood of having gone skydiving after taking genotype into account and the overall likelihood of having gone skydiving by two. The number two comes from the number of degrees of freedom. The chi-squared value found in the regression can then be compared to a probability distribution of chi-squared values which could have been found assuming no relationship between genotype and the likelihood of having gone skydiving. We can find our P-value by dividing the area under the curve with a chi-squared value higher than or equal to the one found in the regression by the area under the entire curve.




In GWAS we use high performance computers to run software which carries out these tests to calculate the p-value for every SNP we analyse in the GWAS. Once we have calculated the P-values we can visualise the results in a graph which looks like this. This is called a Manhattan plot. It is called this because it usually resembles the skyline of Manhattan. Across the x-axis we have the position of each SNP within each of the chromosomes which are separated by colour. The P-values for each SNP are graphed across the y-axis after being multiplied by -log10 so that the often highly disparate values are more easily visible on the graph and that the lower p-values appear higher up in the graph.




We can then draw a line on our graph at the P-value for genome wide significance. This way we can easily spot which locations in the genome contain genetic variants which are associated with the trait by looking for points in the graph which lie above the genome-wide significance line. You may have noticed from this image that SNPs which appear above this line usually cluster together and form "towers" around a single position in the genome. This is a result of linkage disequilibrium which has been discussed in a previous video.   




So this is how researchers find statistical associations between genetic variants and continuous and dichotomous traits. Once these results have been gathered researchers can then zoom in on the positions in the genome where these peaks are found and look more closely at the genes within or near this location. Another thing we can do with these results is develop algorithms called polygenic risk scores which can predict an individuals traits based on their genotype. If you want to find out how to predict whether or not you're gonna go skydiving one day, subscribe to behavioural genomics and look out for the next video. 





Sunday, November 1, 2020

Genome-Wide Association Studies Explained Simply - Statistical Tests: P-Values and Multiple Testing



Welcome to part 3 on this video series on Genome-Wide Association Studies. In previous videos we have discussed the relationship between genetic variants and traits and how researchers use genotyping to identify the location of genetic variants associated with complex traits by taking advantage of linkage disequilibrium.

This video will discuss the role of P-values in genome-wide association studies as well as how researchers use a genome-wide p-value threshold to account for the multiple testing problem. 

First, we need to understand how researchers define a statistically significant association. GWAS uses P-values to measure the statistical significance of an association between a genetic variant and a trait. P-values measure the likelihood that an association at least as strong as the observed association would be found if there was in fact no real connection between the trait and the genetic variant. 

P-values are scored on a scale where the closer the score is to one the weaker the evidence is for a real association and the smaller the score the stronger the evidence for a real association. For example, a P-value of 0.01 means that finding a false positive - that is, finding an association where there is no real relationship between the trait and the genetic variant - where the association is of this strength would be expected once for every 100 trials in samples of the same size. 


Researchers use an arbitrarily agreed upon P-value threshold to decide how statistically significant the association needs to be to be confident that the association is real. Traditionally this had been agreed upon as P<=0.05 - that is a one in twenty chance of the association being a false positive. For GWAS however this threshold is far too lenient.



The reason that p<=0.05 had been acceptable historically in genetics but is not acceptable in modern genomics is that GWAS carry out way more than 20 tests. If every known independent common variant is tested for association there would be more than 3 million tests carried out per GWAS. A P-value threshold of p<=0.05 would therefore result in more than 150 thousand spurious associations being deemed statistically significant.




Researchers usually correct for the multiple testing problem by instead using a genome-wide significance threshold which is normally accepted as p < 5 × 10−8. This number is arrived at by dividing the traditionally accepted significance threshold of 0.05 by approximately the number of independent common SNPs across the human genome.

Correcting for multiple testing in this way solves the problem of too many false positives but there is always a trade off to be made between rejecting false positives and the power to detect real associations. If the size of the effect that a genetic variant has on the trait is small, and for complex traits like behaviour this is almost always the case, it is necessary for sample sizes - that is, the number of participants who have volunteered their genotype data - to be very large in order to be able to detect the effect at genome wide significance. For this reason, genomic researchers most often share databases of large amounts of genotyping data. The UK BioBank for example is a cohort of over 500,000 participants who have volunteered their genomic data to be used in GWAS and other types of studies.

If you want to find out how researchers use these data to find associations between genetic variants and traits subscribe to behavioural genomics and look out for the next video.