Monday, March 29, 2021

The Greatest Discovery in Psychology of all Time? - Twin and Adoption Studies

The cause of individual differences in behaviour and personality is one of the most hotly debated subjects in the history of psychology. Over time theories have shifted between those which emphasise the role of the environment in moulding the character of a person and those which propose that differences between people are largely innate. At one extreme lies environmentalism, which posits that people are like "blank slates" on which their character is written by the environment, and at the other genetic determinism, the idea that individual differences in psychological characteristics are determined completely by genetics.

Francis Galton, great grandson of Erasmus Darwin and second cousin of Charles Darwin, was the first to have attempted to study the inheritance of behaviour empirically. Inspired by the magnitude of his cousin's scientific breakthrough he became obsessed with studying, in his own not so humble words, the "heredity of greatness". Among his observations he noticed that some sets of twins were identical whereas others were only as similar as regular full siblings. He took several measures of physical and psychological characteristics of identical and non-identical twins and from his results he estimated that identical twins were around twice as similar as non-identical twins. Since the environment was assumed to be equally as similar for identical and non-identical twins Galton derived that around half of the variance in behaviour can be accounted for by heredity. Although his experiments had holes in their methodology, his estimates turned out to be impressively accurate for someone who had no knowledge of DNA.

Unfortunately, for all of Galton's genius, inherited or otherwise, he had one very bad idea which tainted his legacy. Eugenics. Galton believed that civilisation could be bettered by "improving the human stock" by giving "...the more suitable races or strains of blood a better chance of prevailing speedily over the less suitable than they otherwise would have had". This idea spread and gained popularity among political parties across Europe and north America. Eugenics programmes provided positive enforcement such as financial incentives to those deemed most fit for reproduction and negative enforcement including forced sterilisation of those deemed unfit such as those with disabilities, low IQ scores, criminal records and minority racial identities. This was carried out in countries including the USA and, most notoriously, Germany under Nazi rule.

As a result of the horrors of the 20th century the public and academic institutions alike, quite understandably, developed a distaste for studies into the heritability of psychological traits and the subject was shelfed. This created an academic vacuum ready to be filled with environmentalist theories of human nature, such as Sigmund Freud's theory of psychosexual development and B.F. Skinner's radical behaviourism. Psychology remained in the grips of environmentalism until the 1960's when this consensus was challenged by a new sub-field of genetics called "behavioural genetics" and a novel methodology which was beginning to stack up mountains of evidence to the contrary

Behavioural genetics is a field of study which aims to discover the influence of genetics on behaviour. In this regard it is similar to the related field of evolutionary psychology however, where evolutionary psychology investigates the reasons why differing behaviours evolve in different species, behavioural genetics aims to quantify the contribution of genetics to individual differences in behaviour within the same species.

The genetic influence on individual differences is estimated by a statistic called heritability. Heritability describes the proportion of the variance of a trait within a population which can be attributed to genetics. This is not the same as saying that heritability measures how much of the trait is caused by genetics rather it describes how much of the differences in a trait between individuals in a population is caused by genetics. Heritability estimates are only just that, estimates, the accuracy of which can be affected by factors such as genes with a non-additive effect on a trait, homogeneity of environment and age of testing the trait.

Heritability estimates paint a picture of the genetic contribution to individual differences in traits of the population they were tested in however heritability estimates of the same trait may vary between populations in different environments. An example of this is the fact that heritability of body weight is higher in more affluent countries than in poorer countries. This is because, in poorer countries, people are more restricted in their food consumption because of lack of availability so their body weight will be more dependent upon how much nutrition they have access to than their genetics whereas in affluent countries, where almost everyone can afford to eat more than they need, body weight will depend more on genetically influenced factors such as metabolism, frame and appetite.

Another caveat is that heritability shows correlation of genetics with traits, not causality. Heritability tells us nothing about the pathway which leads from genetic variants to behaviour and it often takes tortuously indirect routes. For example, attractive people tend to score higher on measures of self-confidence than average, presumably as a result of being more well received socially. Part of the heritability of self-confidence can therefore be explained by genetic variants which, rather than affecting brain areas associated with self-confidence as one might assume, result in prominent cheekbones and a cute button nose.

Heritability is estimated by dividing the correlation of a trait in related individuals by their co-efficient of relatedness. This is easier to express as a formula:

T = Correlation of trait

G = Co-efficient of relatedness

Heritability = T / G

Correlation of the trait refers to how similar the trait is in genetic relatives compared to the population as a whole. For dichotomous traits, characteristics which a person either has or does not have, this is calculated simply by finding how likely someone is to have trait, such as a diagnosis of schizophrenia or a penchant for skydiving, given that a genetic relative such as a sibling, cousin or parent, also expresses the trait.

For quantitative traits, characteristics which every individual has but at varying levels, correlation of the trait is calculated by how accurately you can predict the level of a trait, such as height or IQ scores, in an individual given that you know the level of the trait in a genetic relative compared to how much the trait varies from mean average in the population as a whole.

The strength of the correlation is denoted by a statistic called r-squared, the value of which can vary between 0 and 1. A correlation of r-squared=0 means that the pairs of genetic relatives are no more similar to each other than any pair of individuals in the population as a whole whereas a correlation of 1 means that the trait of one of the pair of genetic relatives perfectly predicts the trait in the other. It is extremely unlikely that r-squared would ever be exactly 0 or 1 for any trait rather it usually falls somewhere in-between.

The co-efficient of relatedness is a measure of how closely related two people are and is roughly equivalent to the proportion of genetic variants which they had inherited from the same ancestor. So identical twins will have a co-efficient of relatedness of 1, full siblings and parent child pairs will have 0.5, grandparent and grandchild pairs, cousins and nephews will have 0.25 and so on.

This formula for estimating heritability works because the correlation of the trait describes how similar the genetically related pairs are compared to the population as a whole and, all else being equal, this similarity must be caused by their shared DNA. So, we can divide this similarity in the trait by the co-efficient of relatedness of the genetically related pairs to find out what proportion of the variance of the trait is attributable to genetics.

To give a hypothetical example if full siblings had a correlation of r-squared=0.25 in their extraversion scores you could estimate that extraversion was 50% heritable by following the formula for estimating heritability as 0.25/0.5 = 0.5 = 50%. You could reach the same conclusion if the extraversion scores for full cousins had a correlation of r-squared=0.125 or if the correlation was r-squared=0.5 for identical twins.

Obviously, there are confounding factors which need to be considered when designing experiments for determining the heritability of traits. The most obvious being that related people tend to share similar environments as they usually grow up in the same family households so the similarity in traits could equally have been caused by similar upbringing. This problem is tackled in part by the use of adoption studies.

Adoption studies work by comparing the similarity in traits between adoptees and their biological family members with that of their adoptive family. For example, measuring extraversion in adoptees and comparing the similarity between their extraversion scores and that of their birth parents with the similarity between their extraversion scores and that of their adoptive parents. A similar method is comparing the similarity of adoptive siblings to that of biological siblings. Adoption studies mimic the well-established animal research technique of cross fostering where animal offspring are removed from their biological parents at birth and raised by surrogates to investigate genetic linked behaviours and physical traits.

Another method of addressing the shared environment issue is by studying twins. Identical twins have been called god’s gift to genetics because they share 100% of their DNA allowing the opportunity for a natural experimental control. Twin studies work by comparing similarity of pairs of monozygotic (identical) twins with that of dizygotic (non-identical aka fraternal) twins. Because both kinds of twin share very similar environment, including gestating in the same womb and being born on the same day, any difference between the similarity of monozygotic twins and that of dizygotic twins must be in the most part caused by genetics.

Neither of these methods are infallible on their own. For example, in twin studies, there is the assumption that non-identical twins are treated equally as similarly by their parents as identical twins, despite some evidence to suggest that this is not the case, and, in adoption studies, middle class families are overrepresented in adoptive families relative to the general population potentially resulting in an overestimation of heritability since environmental variance is restricted. But by compiling the evidence gathered from several methods researchers can account for the weaknesses of each with the strengths of the others and triangulate upon more and more accurate estimates of heritability.

A situation which offers the best of both from the two aforementioned methods is cases of identical twins separated at birth. Heritability can be estimated by comparing the variance of a trait between identical twins separated at birth with that of identical twins raised together. As monozygotic twins are genetically identical any difference in similarity between the twins raised apart and the twins raised together must be caused by the environment. Cases of identical twins separated at birth are unfortunate for the people involved and are thankfully vanishingly rare. They are also, however, very useful when it comes to the study of nature and nurture so behavioural geneticists try to document as many of such cases as they can find.

You might have already heard about twin studies or anecdotes about the remarkable similarity of twins. Twins separated at birth are often discussed in the media highlighting astounding similarities in behavioural quirks between separated twins. The Jim twins (so called because their adoptive parents both named them Jim), who were separated at birth and reunited in 1979 at the age of 39, are among the most famous examples. As well as being remarkably similar in their appearance they also both enjoyed mechanical drawing as a hobby, preferred the same subjects at school, had both married women named Linda, divorced them, remarried women named Betty, and both had sons whom they named James Allen. Anecdotes like these are often remarkable however they do not tell us much about the genetics of human psychology. Not much that is quantifiable anyway. The real discoveries from twin studies are to be found in traits measured in objective and standardised pen and paper tests carried out on large sample sizes.

One such category of psychological traits which can be objectively measured is personality. Personality traits are habitual patterns of behaviour, thought, and emotion that are stable properties of an individual's psychology. There are 5 major dimensions of personality agreed upon by most personality researchers. These include openness, conscientiousness, extraversion, agreeableness, and neuroticism. Intelligence is another measurable aspect of psychology. As well as quantitative dimensions of psychology, taking yes/no measures of life events which are in large part a result of choices made by the individual, such as obtaining a university degree, being divorced, having a criminal record, or seeking support for mental health, form another route to investigating individual differences in psychology.

The psychologist Thomas Bouchard, who had had heard about the Jim twins and invited them to be a subject of his research, started receiving correspondences from other such separated twins soon after publishing the results of his case study. He eventually collected data from over 100 pairs of twins. Having a sample of this size created the perfect opportunity for the genetics of individual differences in psychology to be studied. By providing a reference point of genetic similarity, and removing shared environment from the equation, for the first time in the history of science twin and adoption studies had made it possible to disentangle and empirically measure the respective contributions of nature and nurture, and nature was coming up larger than anyone had expected.

The findings of twin and adoption studies have been summarised in the three laws of behavioural genetics:

1. "All behavioural traits are heritable".

2. "The effect of genetics is always greater than the effect of shared environment”.

3. "There is a substantial proportion of the variance in behaviour not explained by shared genes or families".

The most startling discovery brought about by twin and adoption studies, described in the first law, is that there is not a single known, measurable behavioural trait that does not show any genetic influence. Every psychological characteristic that has been investigated in twin and adoption studies has been found to be correlated between family members and the strength of these correlations increase with the percentage of shared DNA.

Not only are all behavioural traits heritable, the second law tells us that genetic factors always have a greater effect on behaviour than the home environment. Behavioural traits in adoptees are always on average more similar to those of their biological family than their adopted ones. Therefore, the effect of genetics on behaviour must be greater than the shared environment.

The third law describes the fact that there are no behavioural traits which are accounted for completely by the combined influence of heritability and the home environment. Identical twins who grow up in the same home are very similar but not psychologically identical. Therefore, there must be other factors affecting behaviour independent of genetics or shared environment. Behavioural geneticists refer to this as the non-shared environment; the things which we experience in life which are unique to us as individuals.

The non-heritable factors which influence our behaviour can therefore be categorised into the shared environment and the non-shared environment. The influence of the non-shared environment can be calculated by subtracting heritability from the correlation of a trait in siblings sharing the same home environment. It turns out that the effect of the shared environment is close to negligible. This is evidenced by the fact that, in measures of behaviour, siblings raised apart are no less similar than siblings raised together and by the fact that adopted siblings are no more similar than strangers. The most generous estimates of the impact of the shared environment are around 10% however most estimates are closer to 0%.

The bulk of the non-genetic influence on behaviour is therefore made up by the non-shared environment. The non-shared environment is comprised of the non-systematic events that happen to us in our life which are unique to us as individuals. Catching a virus as a new-born, a neighbour who teaches you to play chess, a sporting injury, a musical instrument discovered in the loft and everything in-between. Their effect on our behaviour is like that of a pinball game, nudging us in different directions as we move through our life. The non-shared environment is mysterious, hard to identify and measure, non-stable and random. It is for these reasons that, despite explaining a substantial proportion of the variance between individuals, the non-shared environment is not a useful predictor of behaviour; it is just too unique to each individual to be quantified.

Findings of behavioural genetics have been solid; despite the harsh scrutiny the studies had been placed under due to the controversy surrounding the topic. In fact, the statistical rigour with which behavioural genetics studies have had to be carried out to stand up to such scrutiny resulted in the findings of behavioural genetics being one of the modern psychological theories which survived the so called "replication crisis" in which more than half of the findings of many papers published in highly regarded academic journals turned out to be statistical flukes.

We must bear in mind however the caveats about heritability mentioned earlier. Heritability estimates are only applicable to the population in which they were calculated. So, although being raised in a different home environment within the same culture is not likely to make much of a difference to your behaviour that is not to say that being raised in a radically different culture would not have an impact. It is entirely possible that a person would have ended up with a different personality had they grown up in the amazon jungle rather than the city of London. Another point to bear in mind is that estimates of the impact of the shared environment are made under the assumption that there is at least adequate nurture in the home environment. Cases of abuse, neglect and deprivation do have an impact on normal psychological development.

Despite these reservations the findings from twin and adoption studies are transformative for the way we think about ourselves. The finding that DNA is a much more important and stable influencing factor on who we are than the way we were raised delivered a fatal blow to the doctrine of radical behaviourism which had psychology in its grips for the latter half of the 20th century. Behavioural genetics teaches us that people are not blank slates upon which talents and proclivities can be written by social conditioning or blocks of clay ready to be moulded by parenting or schooling. People have their own genetic selves which, given adequate nurture, they may flourish into. Just as a gardener may create the right environment for flowers to bloom but cannot change the type of flower a seed will become.

Despite the overwhelming evidence for the heritability of behaviour academia was not immediately willing to accept theories which challenge the blank slate model. The impact which so called "social Darwinism" and eugenics has had on far-right political opinions had left people with a distaste for the idea of biologically inherited differences between people. Around the time when behavioural genetics was beginning to be studied it was dangerous both personally and professionally to publish about the genetic influence on behaviour. Researchers who published studies on behavioural genetics, those that could get published at all, had on several instances been branded as eugenicists by the media and had been demonstrated against by student movements and threatened with physical violence. These incidences had arisen from concerns by the implications of behavioural genetics for inequality, but the people involved had not considered the harm which had been caused by the assumption that we are all inherently the same.

For example, if we are all born psychologically the same then mental illness must always be a result of the environment and the blame was often landed on mothers. Freudian thinking on mental health purposed that "cold" mothering styles, which was asserted as a sign of the mothers own mental illness, is the cause of psychological diseases such as schizophrenia. Adoption studies have shown that the bulk of the risk of developing a mental illness is explained by genetic factors rather than trivial differences in parenting styles. These studies were carried out by observing the rate of mental illnesses such as schizophrenia and major depression in adoptees whose biological parents had one of these conditions. It was found that children of parents with a family history of mental illness were at an approximately equally increased risk of developing a mental illness whether they were raised by their birth parents or not. Similar studies have shown that adoptees who have no family history of mental illness who were adopted by parents who did had only a very slightly increased risk of developing a mental illness themselves. This is not to say that parenting doesn't matter. Again, growing up in situations where there is abuse, neglect or malnourishment will increase the risk of mental illness but the majority of people who struggle with their mental health did not have traumatic childhoods. The flipside of this observation is also true. Troubled upbringings are, for most people, not a life sentence.

Criticisms of behavioural genetics often make accusations of eugenics and genetic determinism and beg the question of free will. Critics of this sort have a backwards view of the implications of these findings. For one thing, the non-shared environment has been found to be equally as important as genetics for many aspects of personality. This finding alone is enough to clear the name of any behavioural geneticist on trial for ushering in a new era of eugenics. Even if it were the case that differences between individuals were caused purely by genetics eugenics could only come about under authoritarian regimes which make value judgements about which traits are more desirable and what kinds of people have greater and lesser rights to live. No such judgements are made within the realms of science.

As for the matter of free will, let us say that it was instead found that there was no relationship between genetics and personality. That a person’s personality was carved out purely by the environment no matter how complex and unpredictable the relationship between personality and environmental factors. Would the person be any more free if this were the case? Is it not that, under such circumstances, a person would become whatever their environment dictates like a leaf blown through life by the winds of circumstance without direction? If a person’s character is determined to any degree by their genetics it can at least be that some of the forces which move them through life come from within. The guidebook that has been passed down to them by countless generations of ancestors, written in the DNA at the core of every cell in their body. If anything can be called a self which is free to will why not this?

Most people today have come to accept that the environmentalist view of human nature is too extreme. In large part because they could not help but notice the effect of genetics on behaviour within their own families with their own eyes. Despite the gradual shift of opinion within academia, folk theories of behavioural genetics remain not quite in sync with the science. Most people accept that genetics has some influence on behaviour in a person’s young life, but it is commonly believed that these influences get ironed out as a child grows up and settles into their environment. Counterintuitively, one of the findings of behavioural genetics is that adoptees become more similar to their biological family members as they get older even if they have never met each other.

This finding suggests that the heritability of behaviour increases with age and that genetics encroaches more and more on to the territory of the environment as the individual gets older. As if everyone has a genetic destination for their personality towards which they circle closer and closer as they move through life. This echoes of humanistic theories of psychology such as Maslow's self-actualisation and of Jungian psychoanalytic theories of individuation. Perhaps the influence we see of non-shared environment on behaviour is a result of the individual moulding themselves into a shape which might fit into some niche in the impossibly complicated puzzle which is their life. As they move through life, they accumulate understanding of both the world and of themselves, an understanding we might call wisdom. With each gain in wisdom a person may pivot into positions which better suit their genetic proclivities and aversions. Strengths and weaknesses.

We know from twin and adoption studies that our behaviour is influenced by genetics. But where are the genes and how do they result in behavioural traits? The answer is not so straightforward. The hunt for the specific genes which influence behavioural traits, as well as for most other complex phenotypes, had been largely fruitless and the few successful finds only explain a fraction of the variance in behaviour. What this shows us is that the heritability of complex traits identified in twin and adoption studies, rather than resulting from a few genes with high impact as was previously believed to be the case, are a result of the cumulative effect of genetic variants across the entire genome, each with a marginal contribution to the overall heritability of the trait. The term for such traits is polygenic traits.

The physiological pathways from genetic variants to polygenic traits such as behaviour are currently not well understood. With modern genomics however we are at least one step closer to understanding the genetic architecture of behaviour. The rise of a new technology called microarrays has made it possible to look at variants across the whole genome rather than just sequencing a few single genes at a time. Massive studies using micro-array technology carry out genome-wide searches in hundreds of thousands of individuals to identify statistical associations between traits and clusters of genetic variants. These studies, called genome-wide association studies, are not able to point exactly to the specific genes involved in influencing behaviour however, with large enough sample sizes, they are able to point to the region within the genome where corelated genetic variants can be found.

One of the applications for the findings of genome-wide association studies are the production of predictive models called polygenic scores. Polygenic scores can make predictions about a trait in an individual by summing together the estimated effect size of all the genetic variants possessed by an individual which were found in GWAS to have an association with the trait. For example, by scanning the genome of an individual we can use polygenic scores to predict with some degree of certainty their likelihood of developing major depression at some point in their life or whereabouts they would be likely to score for the different dimensions on a personality test. As more people's genomes are sequenced and the sample sizes of genome-wide association studies become larger and larger the accuracy of the predictions made using polygenic scores become more accurate and ever-increasing proportions of the heritability estimated by twin and adoption studies can be accounted for.

The technology is still in its early stages however the potential for its use is massive. As behavioural genetics transitions towards behavioural genomics we can build upon the foundations of twin and adoption studies and go further than family resemblance in investigating the genetic influence on behaviour by looking directly at a person’s genome. This is an exciting moment in the history of science as behavioural genomics promises to provide us with another avenue towards knowing ourselves.


Saturday, November 21, 2020

Polygenic Scores - Genome-Wide Association Studies Explained simply part 4




Welcome to part 4 of this video series on genome wide association studies. In previous videos we have discussed the relationship between genetic variants and traits, linkage disequilibrium and the statistical methodology used to detect associations between genetic variants and complex traits. In this video we will discuss one of the uses GWAS which is the development of polygenic scores.

There are certain traits which are determined by the presence or absence of genetic variants in a single gene. These are called monogenic traits and include traits such as lactose tolerance and, peculiarly enough, which thumb goes on top when you interlock your fingers. Most complex traits however are a result of the combined influence of a multitude of genetic variants affecting thousands of genes in your genome, each responsible for only a fraction of the overall genetic contribution to the trait. We refer to these as polygenic traits. 

As we have discussed in previous videos, the genetic variants associated with complex traits can be identified using genome-wide association studies. We can use this information to make predictions about some proportion of the variance in the traits of a population in the form of polygenic scores. Polygenic scores are generated using algorithms which sum together all the genetic variants possessed by an individual which are known to influence the likelihood or level of a trait, weighted by their effect size. This value produced by the polygenic score can be used to make predictions about the relative likelihood of a dichotomous trait, such as having a diagnosis of major depression or signing up for sky diving lessons, or level of a continuous trait, such as height or extraversion scores.



For those of you who have watched the previous video the phrase "proportion of the variance" might ring a bell. Proportion of the variance is the same thing as the co-efficient of determination which is measured by r squared. R squared can be thought of as how closely the values or likelihood of the trait for all individuals in the sample hug the line of best fit. 

For example, take a graph which models polygenic risk score values as deciles across the X-axis and values for a continuous trait along the Y-axis. We can determine what percentage of the variance is accounted for by our polygenic score by plotting the phenotype values for a validation sample of individuals against their polygenic score, drawing a line of best fit on the resulting graph and calculating an r squared value. A quick side note, the validation sample must be composed of individuals who's data were not used in the GWAS on which the polygenic score algorithm was based. This is necessary to prevent bias. 



Most biologically influenced complex traits, including behavioural traits, have some degree of what is called "heritability". Heritability is the proportion of the variability in the level or likelihood of a trait which is influenced by the genetic variants passed on to the individual from their parents as opposed to that which is influenced by environmental factors. Heritability estimates are calculated using twin and adoption studies. These types of studies will be discussed in another video.

Let us take IQ score as an example of a continuous trait. IQ has an estimated heritability of about 50%. This means that 50% of the variability in IQ scores is determined by the genetic variants people are born with. Heritability acts like a kind of upper limit to what can be predicted by polygenic scores. So researchers can strive to be able to predict up to 50% of the variance in IQ scores by looking at the genome. 

In reality of course we aren't there yet due to limitations with statistical power and methodological challenges. This is known as the missing heritability problem. At the moment we can only account for between 20% and 50% percent of the heritable portion of the variance in IQ with the results of GWAS. Genomic research into polygenic traits is still however a very young field and the percentage of the variance which can be accounted for is growing as sample sizes become larger and methodologies become more precise. 

One method of representing the explanatory power of a polygenic score graphically is by drawing confidence intervals above and below the mean level of a trait at each decile. Confidence intervals are values between which the mean value of the trait are likely to be found in some proportion of validation samples. Normally this is set as 95%. This is not to be confused with the proportion of individuals at each decile who fall between these values rather it means if you were to test this model on 100 different validation samples the mean Y-axis value at each decile would fall between the confidence intervals in 95 of them. 

Confidence intervals for a polygenic score with high explanatory power might look like this whereas for a polygenic score with low explanatory power confidence intervals may look like this. Note that the predicted value of the trait for each percentile is the same for both polygenic scores however the one with the smaller confidence intervals is more useful as we can have more confidence in the predicted mean value of the trait.



Let us look at the results of a polygenic score model for IQ alongside the distribution of IQ scores in the population. IQ scores are normally distributed with the 50th centile being defined as an IQ of 100 and other scores are shown at each standard deviation from the mean on this diagram. This polygenic score has an r-squared value of 0.11. In other words, this model accounts for 11% percent of the variation in IQ scores. Individuals in the highest decile have a mean IQ which is about half a standard deviation above the mean whereas those in the bottom decile have a mean IQ about half a standard deviation below the mean. By comparing this to the distribution of IQ scores we can see that this is equal to a mean IQ of about 108 at the highest decile and a mean IQ of 92 at the lowest decile.



It may be difficult to grasp how much of a difference that actually is for a non-physically visible trait like IQ so let us visualise it by replacing IQ with a more tangible trait such as height. Half a standard deviation below the mean for height in adult males in the USA is equivalent to about 69 inches, that is 5'9" or 175cm, and half a standard deviation above the mean is equivalent to about 71 inches, that is 5'11" or 180cm. To put that in perspective, according to this chart of celebrity heights that is the difference between Tom Hardy and Michael Fasbender.



The same method can be used for the prediction of dichotomous traits, however instead of predicting the level of the trait polygenic scores are used to predict the likelihood of a trait occurring. For example, here is a graph showing the deciles of a polygenic score for diagnosis of major depression in a sample of the Danish population. Rather than a quantitative value what is referred to here as a hazard ratio is plotted along the Y-axis. The hazard ratio assigns a value of 1 to the decile with the smallest proportion of individuals having a diagnosis of major depression and a value to each other decile corresponding to the ratio of cases relative to the first decile. From this graph we can see that individuals in the tenth decile are just over two and a half times as likely to have a diagnosis of major depression than those in the first decile. 



Information about the actual proportion of diagnoses of major depression at each decile is not available from this graph however, we know the overall proportion of individuals with a diagnosis of major depression in Denmark is 3%. Therefore, assuming that the 5th and 6th deciles are approximately equivalent to the population risk, we can extrapolate that individuals in the first decile have around about a 1.7% chance of developing major depression whereas those in the tenth decile will have a 4.3% chance. To put this into perspective the proportion of individuals in the first decile who will develop depression is like the proportion of bombs in this game of minesweeper and for the tenth decile it is like this game of minesweeper. 



From the examples we have looked at you might be wondering whether these polygenic scores are actually useful at all. Those with a polygenic score in the top decile for IQ are by no means guaranteed to be the next Einstein, neither are those with scores in the top decile for height much more likely than average to have a career in basketball. Those with polygenic scores in the top decile for risk of major depression do not have much more to worry about than the average person either. There are however many uses of polygenic scores beyond merely making predictions about the traits of individuals.

For example, there are certain diseases which can be prevented or mitigated with early detection and treatment however health organisations rarely have the funds to screen every individual at risk. Polygenic scores can be used to inform the screening for these diseases and therefore increase the number of cases which are caught early from screening the same number of individuals. Polygenic scores can also be used to shed light on the biological causes of diseases with high comorbidity by finding to what degree, if any, the polygenic score for one disease predicts the other. Stratified and precision medicine, which are hot topics in science right now, are another use of polygenic scores . Complex diseases such as bipolar disorder and schizophrenia can be treated by a number of different drugs and finding out which one works best for the individual is often a case of trial and error. Polygenic scores for drug response can be used to predict which drug is more likely to be effective for the individual and minimise the amount of trial and error required to find the right treatment. In the future it may even be possible to tailor-make drugs based on the individuals genome. 




As well as in medicine, polygenic scores also have their uses in psychology. Disentangling the effects of nature and nurture has been one of the biggest problems in psychology since its naissance as a scientific field of study. Polygenic scores go part of the way towards solving this problem. Studies have found that children with more books in their household do better in school, but is this because the books are making them better learners or is it because their parents have genes which make them enjoy learning and therefore buy a lot of books as well as passing on these genes to their children. Polygenic scores which predict educational attainment serve as a control to help us decipher how much, if any, of the correlation between books in the house and school results are actually caused by the books. In some cases polygenic scores are also relevant to debates on social policy. Does consumption of cannabis increase the risk of schizophrenia? Or are people who are genetically predisposed to schizophrenia also genetically more likely to consume cannabis? 





So this concludes this video series on Genome-Wide Association Studies. I hope you all found it accessible and informative. If there is anything we discussed in these videos you would like to see discussed in more detail please leave a comment below and I may make another video about it in the future. If you enjoyed this video that means you have a high polygenic score for hitting the like button and subscribing to the behavioural genomics YouTube channel. 



Thursday, November 12, 2020

Genome-Wide Association Studies Explained Simply - Linear and Logistic Regression


Hello everyone and welcome back to this video series on Genome-Wide Association Studies. 

In the previous video we discussed P-values and the multiple testing problem in that must be accounted for in modern genomics.

Now that we understand P-values we can now look at how the P-values for associations between genetic variants and traits are calculated. GWAS most often use a type of statistical test called regression analysis. Regression analysis is used to estimate the relationship between a dependent variable and a predictor variable. In this case these are the trait and a genetic variant respectively. 

For example, if we wanted to find if a genetic variant is associated with a continuous trait, like scores on a measure of extraversion, we could graph the extraversion scores of the participants along the y-axis and the genotype of the participants along the x-axis. As we have seen in previous videos people have two versions of each chromosome in their genome. Therefore participants can be categorised as having two, one or no copies of the minor allele - that is, the least common variant of the nucleotide at that position. In a regression model that assumes additive allelic effects it is assumed that having two copies of an associated variant would have a larger affect on the trait than having only one therefore the x-axis can be plotted directionally in the number of minor alleles in the genotype.  




A line of best fit - that is, a straight line which is as close to every point on the graph as possible - is drawn across the graph. In the linear model assuming additive allelic effects we have just described this would look like a line which is as close as possible to the mean y-axis value, which is in this case the participants' extraversion scores, for each genotype. The slope of the line determines the direction of the effect of the genetic variant on the extraversion scores of the participants with an incline denoting a positive correlation (having the genetic variant makes you more extraverted) and a decline denoting a negative correlation (having the genetic variant makes you more introverted). 



A steeper slope is indicative of a larger impact of the genetic variant on the phenotype. This is one way of measuring what is called the effect size. The effect size is found by calculating the gradient of the line of best fit and measures the amount of change in extraversion per copy of the minor allele. A larger positive gradient indicates a stronger positive effect and a larger negative gradient indicates a stronger negative effect. A straight line would have a gradient of zero and indicate no effect of the genetic variant on extraversion. 




The proportion of the variance in extraversion which can be explained by the genotype of this SNP is called the co-efficient of determination. A higher co-efficient of determination usually looks like the points in the graph hug the line of best fit closer whereas with a lower co-efficient of determination the points are more dispersed. The co-efficient of determination is measured by a value called r-squared and is found by calculating the sum of the square of the distances between all the points on the graph and the line of best fit, subtracting this from the sum of the square of the differences between all the values for extraversion and the average value for extraversion in the whole sample and dividing this number by the average value for extraversion in the whole sample. In other words, r squared measures the variation in extraversion explained by the genetic variant divided by the variation in extraversion in general.




The P-value can be calculated by finding a value called F.  F is calculated by finding the variation in extraversion explained by genotype divided by the variation in extraversion not explained by genotype. The P-value can be found from the F-value by comparing the F-value found in the regression to a probability distribution of F-values which could have been found assuming there was no relationship between the genotype and the trait. 

The P-value is calculated by finding the proportion of F-values at least as large as the one found in the regression in the distribution of all possible F-values from this sample. Another way of describing it graphically is the P-value is equal to the area under the curve with an F-value higher than or equal to that found in the regression divided by the area under the entire curve.




The type of regression used for continuous traits such as extraversion is called a linear regression. For dichotomous traits, such as whether or not a person has ever went skydiving, values cannot be assigned to the trait in the same way as can be done for continuous traits like extraversion scores. Therefore a different method called logistic regression is used for finding associations with dichotomous traits.
 
Rather than modelling how a genetic variant increases or decreases the level of a trait logistic regression models whether a genetic variant increases or decreases the likelihood that an individual would posses the trait. Instead of using a straight line of best fit logistic regression uses the logistic function. The logistic function is an s shaped curve which, in this case, models the log odds of an individual having ever gone sky diving based on their genotype. The method for generating the logistic curve is slightly more complicated so we won't be going into it in this video.






In a similar way to linear regression, where the slope of the line indicates the direction of the effect, the direction of the curve indicates the direction of the effect of the minor allele on the likelihood of the trait. A high to low curve indicates a negative effect of the minor allele on the likelihood of the trait and a low to high curve indicates a positive impact of the minor allele on the likelihood of the trait. 

There are many different methods calculating r-squared for logistic regression. The method which is most similar to that used in linear regression calculates r-squared by finding the log likelihood of having gone skydiving after taking genotype into account, subtracting this from the overall likelihood of having gone skydiving, and dividing the result by the overall likelihood of having gone skydiving. In other words, r squared measures the likelihood of skydiving explained by the genetic variant over the likelihood of skydiving in general.

The P-value for our logistic regression is found by calculating a chi-squared value. The chi-squared value is calculated simply by multiplying the difference between the likelihood of having gone skydiving after taking genotype into account and the overall likelihood of having gone skydiving by two. The number two comes from the number of degrees of freedom. The chi-squared value found in the regression can then be compared to a probability distribution of chi-squared values which could have been found assuming no relationship between genotype and the likelihood of having gone skydiving. We can find our P-value by dividing the area under the curve with a chi-squared value higher than or equal to the one found in the regression by the area under the entire curve.




In GWAS we use high performance computers to run software which carries out these tests to calculate the p-value for every SNP we analyse in the GWAS. Once we have calculated the P-values we can visualise the results in a graph which looks like this. This is called a Manhattan plot. It is called this because it usually resembles the skyline of Manhattan. Across the x-axis we have the position of each SNP within each of the chromosomes which are separated by colour. The P-values for each SNP are graphed across the y-axis after being multiplied by -log10 so that the often highly disparate values are more easily visible on the graph and that the lower p-values appear higher up in the graph.




We can then draw a line on our graph at the P-value for genome wide significance. This way we can easily spot which locations in the genome contain genetic variants which are associated with the trait by looking for points in the graph which lie above the genome-wide significance line. You may have noticed from this image that SNPs which appear above this line usually cluster together and form "towers" around a single position in the genome. This is a result of linkage disequilibrium which has been discussed in a previous video.   




So this is how researchers find statistical associations between genetic variants and continuous and dichotomous traits. Once these results have been gathered researchers can then zoom in on the positions in the genome where these peaks are found and look more closely at the genes within or near this location. Another thing we can do with these results is develop algorithms called polygenic risk scores which can predict an individuals traits based on their genotype. If you want to find out how to predict whether or not you're gonna go skydiving one day, subscribe to behavioural genomics and look out for the next video. 





Sunday, November 1, 2020

Genome-Wide Association Studies Explained Simply - Statistical Tests: P-Values and Multiple Testing



Welcome to part 3 on this video series on Genome-Wide Association Studies. In previous videos we have discussed the relationship between genetic variants and traits and how researchers use genotyping to identify the location of genetic variants associated with complex traits by taking advantage of linkage disequilibrium.

This video will discuss the role of P-values in genome-wide association studies as well as how researchers use a genome-wide p-value threshold to account for the multiple testing problem. 

First, we need to understand how researchers define a statistically significant association. GWAS uses P-values to measure the statistical significance of an association between a genetic variant and a trait. P-values measure the likelihood that an association at least as strong as the observed association would be found if there was in fact no real connection between the trait and the genetic variant. 

P-values are scored on a scale where the closer the score is to one the weaker the evidence is for a real association and the smaller the score the stronger the evidence for a real association. For example, a P-value of 0.01 means that finding a false positive - that is, finding an association where there is no real relationship between the trait and the genetic variant - where the association is of this strength would be expected once for every 100 trials in samples of the same size. 


Researchers use an arbitrarily agreed upon P-value threshold to decide how statistically significant the association needs to be to be confident that the association is real. Traditionally this had been agreed upon as P<=0.05 - that is a one in twenty chance of the association being a false positive. For GWAS however this threshold is far too lenient.



The reason that p<=0.05 had been acceptable historically in genetics but is not acceptable in modern genomics is that GWAS carry out way more than 20 tests. If every known independent common variant is tested for association there would be more than 3 million tests carried out per GWAS. A P-value threshold of p<=0.05 would therefore result in more than 150 thousand spurious associations being deemed statistically significant.




Researchers usually correct for the multiple testing problem by instead using a genome-wide significance threshold which is normally accepted as p < 5 × 10−8. This number is arrived at by dividing the traditionally accepted significance threshold of 0.05 by approximately the number of independent common SNPs across the human genome.

Correcting for multiple testing in this way solves the problem of too many false positives but there is always a trade off to be made between rejecting false positives and the power to detect real associations. If the size of the effect that a genetic variant has on the trait is small, and for complex traits like behaviour this is almost always the case, it is necessary for sample sizes - that is, the number of participants who have volunteered their genotype data - to be very large in order to be able to detect the effect at genome wide significance. For this reason, genomic researchers most often share databases of large amounts of genotyping data. The UK BioBank for example is a cohort of over 500,000 participants who have volunteered their genomic data to be used in GWAS and other types of studies.

If you want to find out how researchers use these data to find associations between genetic variants and traits subscribe to behavioural genomics and look out for the next video.


Monday, October 5, 2020

Genome-Wide Association Studies Part 2 - Linkage Disequilibrium


In part one we discussed what is meant by genetic variants and traits and the relationship between the two. Here we will discuss how we can find the location within the genome of genetic variants which make individuals more likely to possess certain traits, including predispositions to certain behaviours. If you haven't already you may wish to read part one to gain an understanding of some key terms and concepts before continuing with this article.

In GWAS, information about an individuals genome is obtained using a process called genotyping. I won't go in to the specific details of how this works (if you are interested you can read about it here) but essentially the nucleotide present at different positions in the participants genome is identified from a sample of their DNA. To carry on the metaphor from part 1, genotyping is like skimming the pages of a persons genomic instruction manual.

Genotyping, as opposed to whole genome sequencing, does not identify every nucleotide present at each position in the genome. This is because the cost of whole genome sequencing combined with the large sample sizes required to detect associations with complex traits renders whole genome sequencing studies financially infeasible. Instead, single nucleotides called tag SNPs are chosen from across the genome which give us enough information to impute with a high degree of accuracy the nucleotides present at all other positions in the genome. This works by taking advantage of a feature of the genome called linkage disequilibrium. 

To understand what this is we need to take a step back and look at how the genome of an organism is formed from the DNA passed down to it from it's parents. Each organism has two versions of each chromosome, called sister chromosomes, in it's genome - one for each parent. But it is not the case that you simply have one of your father's chromosomes and one of your mother's chromosomes. There is a point of reshuffling DNA during a process of reproduction called meiosis.

Meiosis occurs during the production of the parent's sex cells. Pairs of chromosomes come together at the centre of the nucleus of the germ cell (the cell which produces sex cells) and undergo a process of swapping chunks of DNA from each chromosome to the equivalent position on it's sister chromosome. This process is called crossing over and is akin to swapping pages between the two versions of the chapters of the genomic instruction manual. Before crossing over both versions of each chromosome are comprised of two identical copies of the string of DNA which makes up the chromosome called chromatids however, as a result of the non-uniform nature of the crossing over process, crossing over results in 4 non identical chromatids (two for each sister chromosome). 

Sister chromosomes pair up at the nucleus of the cell.



The crossing over process


The cell then divides to form two cells each containing one copy of every newly shuffled chromosome. Following this a second round of meiosis occurs in which the cells divide once again to produce four cells each containing a unique copy of each chromatid. The four cells, with unique genomes, produced by this process of meiosis are called gametes and take the form of sperm cells in males and ova (egg cells) in females. When a sperm cell meets an ovum during the reproduction process the chromatids from each cell become chromosomes resulting in a completely new genome made up of a set of chromosomes made of chunks of DNA from both of the mother's sets of chromosomes and a set of chromosomes made of chunks of DNA from both of the father's sets of chromosomes.

Meiosis forms 4 sex cells with unique chromosomes





Fertilisation creates an embryo with a unique genome, made up of genetic variants from both parents


The relevance which meiosis has to genotyping is due to the fact that, although chunks of DNA are swapped between versions of chromosomes non-uniformly, the crossing over process is not completely random. Some chunks of DNA within the chromosomes are more likely to stay together during the crossing over process than would be expected by chance, a feature of genomics called linkage disequilibrium. Genetic variants which are more likely to be found in the same genome than would be expected if crossing over occurred randomly, even after multiple generations of crossing over, are therefore said to be in linkage disequilibrium with each other. The higher the likelihood of staying together, the higher the degree of linkage disequilibrium.


The above image is a visualisation of linkage disequilibrium called an LD heatmap. The black line at the bottom represents a stretch of DNA and the intensity of the red points represents the probability that the two SNPs which intersect at that point would be found together. The black triangles represent a stretch of DNA which are significantly unlikely to be separated during crossing over and are therefore in linkage disequilibrium with each other.

Chunks of DNA, housing genetic variants, which stick together like this are referred to as haplotype blocks (blocks of DNA inherited from the same parent). Now that we understand haplotype blocks we can explain how genotyping data can be used to impute the SNPs present in the genome from reading only a few selected bases. Imagine a sentence in the genomic instruction which was known to be a haplotype block and that 4 different versions of this sentence are known to exist.


If we were to look at this sentence in someone's genomic instruction manual and read only the fourth and twelfth letter (red) this would be enough to impute which version of the sentence was present. Even though the sixth and the fifteenth letters (orange) also varies between the haplotype blocks it is not necessary to read these to confirm which letter is present at these positions.

The same can also be said of sequences of DNA. Take this example of four versions of the same haplotype block in the sequences below. To find which version of this haplotype block is present it is enough to read only the SNP present at the fourth and tenth position even though the sixth and the twelfth position also vary. It is for this reason that the SNPs in red are chosen as tag SNPs.





The existence of linkage disequilibrium creates both difficulties and opportunities when using genotyping to identify genetic variants associated with certain traits. The difficulty is that association studies used to identify statistical associations between traits and genetic variants most often identify several SNPs which are in linkage disequilibrium with each other rather than identifying single variants. This is a problem as there is no way of knowing which of the identified SNPs are the causal SNP amongst the several other SNPs which were identified only because they are in linkage disequilibrium with the causal SNP. 


The above image is part of a graph which has the position of individual SNPs from a section of a chromosome across the x-axis. The higher up the SNP, represented by dots, appears on the y-axis the stronger its association with the trait. Only one of these SNPs is likely to be the cause of the association found at this location in the genome but as a result of linkage disequilibrium an association is also found with many other nearby SNPs.  

Despite this difficulty, genotyping based GWAS is a highly useful methodology as the information they can gather from tag SNPs showing a significant association with the trait of interest combined with databases containing the DNA sequences of all known haplotype blocks make it possible to identify the location of the genome in which the causal variant is located and impute the identity of all common genetic variants in this part of the genome. This is achieved at a fraction of the cost of whole genome sequencing.

Researchers can then use this information to inform inquiry into the biological processes behind the trait under investigation. For example, they might choose to look more closely at the genes which are found within or near the identified locations in the genome and create hypotheses about how variation in the expression or composition of these genes might influence the trait based on prior knowledge of the function of these genes.

The next article will discuss the statistical methods which are used to identify associations between SNP's and traits.