Thursday, November 12, 2020

Genome-Wide Association Studies Explained Simply - Linear and Logistic Regression


Hello everyone and welcome back to this video series on Genome-Wide Association Studies. 

In the previous video we discussed P-values and the multiple testing problem in that must be accounted for in modern genomics.

Now that we understand P-values we can now look at how the P-values for associations between genetic variants and traits are calculated. GWAS most often use a type of statistical test called regression analysis. Regression analysis is used to estimate the relationship between a dependent variable and a predictor variable. In this case these are the trait and a genetic variant respectively. 

For example, if we wanted to find if a genetic variant is associated with a continuous trait, like scores on a measure of extraversion, we could graph the extraversion scores of the participants along the y-axis and the genotype of the participants along the x-axis. As we have seen in previous videos people have two versions of each chromosome in their genome. Therefore participants can be categorised as having two, one or no copies of the minor allele - that is, the least common variant of the nucleotide at that position. In a regression model that assumes additive allelic effects it is assumed that having two copies of an associated variant would have a larger affect on the trait than having only one therefore the x-axis can be plotted directionally in the number of minor alleles in the genotype.  




A line of best fit - that is, a straight line which is as close to every point on the graph as possible - is drawn across the graph. In the linear model assuming additive allelic effects we have just described this would look like a line which is as close as possible to the mean y-axis value, which is in this case the participants' extraversion scores, for each genotype. The slope of the line determines the direction of the effect of the genetic variant on the extraversion scores of the participants with an incline denoting a positive correlation (having the genetic variant makes you more extraverted) and a decline denoting a negative correlation (having the genetic variant makes you more introverted). 



A steeper slope is indicative of a larger impact of the genetic variant on the phenotype. This is one way of measuring what is called the effect size. The effect size is found by calculating the gradient of the line of best fit and measures the amount of change in extraversion per copy of the minor allele. A larger positive gradient indicates a stronger positive effect and a larger negative gradient indicates a stronger negative effect. A straight line would have a gradient of zero and indicate no effect of the genetic variant on extraversion. 




The proportion of the variance in extraversion which can be explained by the genotype of this SNP is called the co-efficient of determination. A higher co-efficient of determination usually looks like the points in the graph hug the line of best fit closer whereas with a lower co-efficient of determination the points are more dispersed. The co-efficient of determination is measured by a value called r-squared and is found by calculating the sum of the square of the distances between all the points on the graph and the line of best fit, subtracting this from the sum of the square of the differences between all the values for extraversion and the average value for extraversion in the whole sample and dividing this number by the average value for extraversion in the whole sample. In other words, r squared measures the variation in extraversion explained by the genetic variant divided by the variation in extraversion in general.




The P-value can be calculated by finding a value called F.  F is calculated by finding the variation in extraversion explained by genotype divided by the variation in extraversion not explained by genotype. The P-value can be found from the F-value by comparing the F-value found in the regression to a probability distribution of F-values which could have been found assuming there was no relationship between the genotype and the trait. 

The P-value is calculated by finding the proportion of F-values at least as large as the one found in the regression in the distribution of all possible F-values from this sample. Another way of describing it graphically is the P-value is equal to the area under the curve with an F-value higher than or equal to that found in the regression divided by the area under the entire curve.




The type of regression used for continuous traits such as extraversion is called a linear regression. For dichotomous traits, such as whether or not a person has ever went skydiving, values cannot be assigned to the trait in the same way as can be done for continuous traits like extraversion scores. Therefore a different method called logistic regression is used for finding associations with dichotomous traits.
 
Rather than modelling how a genetic variant increases or decreases the level of a trait logistic regression models whether a genetic variant increases or decreases the likelihood that an individual would posses the trait. Instead of using a straight line of best fit logistic regression uses the logistic function. The logistic function is an s shaped curve which, in this case, models the log odds of an individual having ever gone sky diving based on their genotype. The method for generating the logistic curve is slightly more complicated so we won't be going into it in this video.






In a similar way to linear regression, where the slope of the line indicates the direction of the effect, the direction of the curve indicates the direction of the effect of the minor allele on the likelihood of the trait. A high to low curve indicates a negative effect of the minor allele on the likelihood of the trait and a low to high curve indicates a positive impact of the minor allele on the likelihood of the trait. 

There are many different methods calculating r-squared for logistic regression. The method which is most similar to that used in linear regression calculates r-squared by finding the log likelihood of having gone skydiving after taking genotype into account, subtracting this from the overall likelihood of having gone skydiving, and dividing the result by the overall likelihood of having gone skydiving. In other words, r squared measures the likelihood of skydiving explained by the genetic variant over the likelihood of skydiving in general.

The P-value for our logistic regression is found by calculating a chi-squared value. The chi-squared value is calculated simply by multiplying the difference between the likelihood of having gone skydiving after taking genotype into account and the overall likelihood of having gone skydiving by two. The number two comes from the number of degrees of freedom. The chi-squared value found in the regression can then be compared to a probability distribution of chi-squared values which could have been found assuming no relationship between genotype and the likelihood of having gone skydiving. We can find our P-value by dividing the area under the curve with a chi-squared value higher than or equal to the one found in the regression by the area under the entire curve.




In GWAS we use high performance computers to run software which carries out these tests to calculate the p-value for every SNP we analyse in the GWAS. Once we have calculated the P-values we can visualise the results in a graph which looks like this. This is called a Manhattan plot. It is called this because it usually resembles the skyline of Manhattan. Across the x-axis we have the position of each SNP within each of the chromosomes which are separated by colour. The P-values for each SNP are graphed across the y-axis after being multiplied by -log10 so that the often highly disparate values are more easily visible on the graph and that the lower p-values appear higher up in the graph.




We can then draw a line on our graph at the P-value for genome wide significance. This way we can easily spot which locations in the genome contain genetic variants which are associated with the trait by looking for points in the graph which lie above the genome-wide significance line. You may have noticed from this image that SNPs which appear above this line usually cluster together and form "towers" around a single position in the genome. This is a result of linkage disequilibrium which has been discussed in a previous video.   




So this is how researchers find statistical associations between genetic variants and continuous and dichotomous traits. Once these results have been gathered researchers can then zoom in on the positions in the genome where these peaks are found and look more closely at the genes within or near this location. Another thing we can do with these results is develop algorithms called polygenic risk scores which can predict an individuals traits based on their genotype. If you want to find out how to predict whether or not you're gonna go skydiving one day, subscribe to behavioural genomics and look out for the next video. 





No comments:

Post a Comment