Approved research

Validating genomic trait prediction algorithms for commercial use with the UK Biobank resource

Allelica S.r.l.

Lay summary

Since the human genome was fully sequenced almost 20 years ago the scientific community has made great strides in understanding the genetic basis of many human traits and diseases. Much of what we know has been advanced by the generation of large datasets, due to the increasingly cheap ability to sequence DNA, and by the development of sophisticated statistical machinery to discover patterns in these large, complex genetic datasets. A particularly fruitful avenue of enquiry has been genome-wide association studies (GWAS), which compare DNA between people with or without a specific disease or with varying values of a particular continuous trait. These analyses have generated lists of genetic variants that have been robustly associated with human traits and diseases, and, crucially, offer an indication of the strength and size of the association. An individual overall risk of getting a disease, like diabetes - or their value of a trait, like height - is a complex combination of many different environmental, lifestyle, and genetic factors. Genetics on its own will never be able to provide a complete assessment of an individual risk. However, because many diseases and traits do have a genetic component, and these are being identified through GWAS, we are now in a position to use this information to better predict the genetic component of their risk, so that interventions can be better targeted and lifestyles can potentially be modified. Our research project aims to translate the results of GWAS into usable information that can inform on an individual genetic liability for a trait or disease. To achieve this we have built a pipeline to turn an individual genetic sequence information into a genetic risk score. We would like to use the UK Biobank dataset to test the utility and generalizability of our pipeline. Because this dataset has genetic information on half a million individuals together with matched measurements on a wide variety of different traits and disease outcomes, it is a unique resource for assessing our pipelines. By using this resource, we will be able to iterate and improve our algorithms to generate a product that can take anyone genetic data and turn it into a genetic risk prediction. We anticipate that this product will be of use to genetic screening programs as well as individual consumers hoping to learn about the effect of their genome on their bodies, ultimately leading to better disease risk prediction, prognosis, and stratification.