Welcome to part 4 of this video series on genome wide association
studies. In previous videos we have discussed the relationship between genetic
variants and traits, linkage disequilibrium and the statistical methodology
used to detect associations between genetic variants and complex traits. In
this video we will discuss one of the uses GWAS which is the development of
polygenic scores.
There are certain traits which are determined by the presence or absence of genetic variants in a single gene. These are called monogenic traits and include traits such as lactose tolerance and, peculiarly enough, which thumb goes on top when you interlock your fingers. Most complex traits however are a result of the combined influence of a multitude of genetic variants affecting thousands of genes in your genome, each responsible for only a fraction of the overall genetic contribution to the trait. We refer to these as polygenic traits.
As we have discussed in previous videos, the genetic variants associated
with complex traits can be identified using genome-wide association studies. We
can use this information to make predictions about some proportion of the
variance in the traits of a population in the form of polygenic scores.
Polygenic scores are generated using algorithms which sum together all the
genetic variants possessed by an individual which are known to influence the
likelihood or level of a trait, weighted by their effect size. This value
produced by the polygenic score can be used to make predictions about the
relative likelihood of a dichotomous trait, such as having a diagnosis of major depression or
signing up for sky diving lessons, or level of a continuous trait, such as
height or extraversion scores.
For those of you who have watched the previous video the phrase
"proportion of the variance" might ring a bell. Proportion of the
variance is the same thing as the co-efficient of determination which is
measured by r squared. R squared can be thought of as how closely the values or
likelihood of the trait for all individuals in the sample hug the line of best fit.
For example, take a graph which models polygenic risk score values as
deciles across the X-axis and values for a continuous trait along the Y-axis.
We can determine what percentage of the variance is accounted for by our
polygenic score by plotting the phenotype values for a validation sample of
individuals against their polygenic score, drawing a line of best fit on the
resulting graph and calculating an r squared value. A quick side note, the
validation sample must be composed of individuals who's data were not used in
the GWAS on which the polygenic score algorithm was based. This is necessary to
prevent bias.
Most biologically influenced complex traits, including behavioural
traits, have some degree of what is called "heritability".
Heritability is the proportion of the variability in the level or likelihood of
a trait which is influenced by the genetic variants passed on to the
individual from their parents as opposed to that which is influenced by
environmental factors. Heritability estimates are calculated using twin and
adoption studies. These types of studies will be discussed in another video.
Let us take IQ score as an example of a continuous trait. IQ has an
estimated heritability of about 50%. This means that 50% of the variability in
IQ scores is determined by the genetic variants people are born with.
Heritability acts like a kind of upper limit to what can be predicted by
polygenic scores. So researchers can strive to be able to predict up to 50% of
the variance in IQ scores by looking at the genome.
In reality of course we aren't there yet due to limitations with
statistical power and methodological challenges. This is known as the missing
heritability problem. At the moment we can only account for between 20% and 50%
percent of the heritable portion of the variance in IQ with the results of GWAS. Genomic research
into polygenic traits is still however a very young field and the percentage of
the variance which can be accounted for is growing as sample sizes become
larger and methodologies become more precise.
One method of representing the explanatory power of a polygenic score graphically is by drawing confidence intervals above and below the mean level of a trait at each decile. Confidence intervals are values between which the mean value of the trait are likely to be found in some proportion of validation samples. Normally this is set as 95%. This is not to be confused with the proportion of individuals at each decile who fall between these values rather it means if you were to test this model on 100 different validation samples the mean Y-axis value at each decile would fall between the confidence intervals in 95 of them.
Confidence intervals for a polygenic score with high explanatory power might look like this whereas for a polygenic score with low explanatory power confidence intervals may look like this. Note that the predicted value of the trait for each percentile is the same for both polygenic scores however the one with the smaller confidence intervals is more useful as we can have more confidence in the predicted mean value of the trait.
It may be difficult to grasp how much of a difference that actually is for a non-physically visible trait like IQ so let us visualise it by replacing IQ with a more tangible trait such as height. Half a standard deviation below the mean for height in adult males in the USA is equivalent to about 69 inches, that is 5'9" or 175cm, and half a standard deviation above the mean is equivalent to about 71 inches, that is 5'11" or 180cm. To put that in perspective, according to this chart of celebrity heights that is the difference between Tom Hardy and Michael Fasbender.
The same method can be used for the prediction of dichotomous traits, however instead of predicting the level of the trait polygenic scores are used to predict the likelihood of a trait occurring. For example, here is a graph showing the deciles of a polygenic score for diagnosis of major depression in a sample of the Danish population. Rather than a quantitative value what is referred to here as a hazard ratio is plotted along the Y-axis. The hazard ratio assigns a value of 1 to the decile with the smallest proportion of individuals having a diagnosis of major depression and a value to each other decile corresponding to the ratio of cases relative to the first decile. From this graph we can see that individuals in the tenth decile are just over two and a half times as likely to have a diagnosis of major depression than those in the first decile.
Information about the actual proportion of diagnoses of major depression at each decile is not available from this graph however, we know the overall proportion of individuals with a diagnosis of major depression in Denmark is 3%. Therefore, assuming that the 5th and 6th deciles are approximately equivalent to the population risk, we can extrapolate that individuals in the first decile have around about a 1.7% chance of developing major depression whereas those in the tenth decile will have a 4.3% chance. To put this into perspective the proportion of individuals in the first decile who will develop depression is like the proportion of bombs in this game of minesweeper and for the tenth decile it is like this game of minesweeper.
From the examples we have looked at you might be wondering whether these polygenic scores are actually useful at all. Those with a polygenic score in the top decile for IQ are by no means guaranteed to be the next Einstein, neither are those with scores in the top decile for height much more likely than average to have a career in basketball. Those with polygenic scores in the top decile for risk of major depression do not have much more to worry about than the average person either. There are however many uses of polygenic scores beyond merely making predictions about the traits of individuals.
For example, there are certain diseases which can be prevented or mitigated with early detection and treatment however health organisations rarely have the funds to screen every individual at risk. Polygenic scores can be used to inform the screening for these diseases and therefore increase the number of cases which are caught early from screening the same number of individuals. Polygenic scores can also be used to shed light on the biological causes of diseases with high comorbidity by finding to what degree, if any, the polygenic score for one disease predicts the other. Stratified and precision medicine, which are hot topics in science right now, are another use of polygenic scores . Complex diseases such as bipolar disorder and schizophrenia can be treated by a number of different drugs and finding out which one works best for the individual is often a case of trial and error. Polygenic scores for drug response can be used to predict which drug is more likely to be effective for the individual and minimise the amount of trial and error required to find the right treatment. In the future it may even be possible to tailor-make drugs based on the individuals genome.
As well as in medicine, polygenic scores also have their uses in psychology. Disentangling the effects of nature and nurture has been one of the biggest problems in psychology since its naissance as a scientific field of study. Polygenic scores go part of the way towards solving this problem. Studies have found that children with more books in their household do better in school, but is this because the books are making them better learners or is it because their parents have genes which make them enjoy learning and therefore buy a lot of books as well as passing on these genes to their children. Polygenic scores which predict educational attainment serve as a control to help us decipher how much, if any, of the correlation between books in the house and school results are actually caused by the books. In some cases polygenic scores are also relevant to debates on social policy. Does consumption of cannabis increase the risk of schizophrenia? Or are people who are genetically predisposed to schizophrenia also genetically more likely to consume cannabis?
So this concludes this video series on Genome-Wide Association Studies. I hope you all found it accessible and informative. If there is anything we discussed in these videos you would like to see discussed in more detail please leave a comment below and I may make another video about it in the future. If you enjoyed this video that means you have a high polygenic score for hitting the like button and subscribing to the behavioural genomics YouTube channel.










