Skip to navigation Skip to main content Skip to footer

Approved Research

Develop statistical genetic methods with hidden subgroups

Principal Investigator: Professor Wei Li
Approved Research ID: 78898
Approval date: January 26th 2023

Lay summary

In this study, we will develop novel methods to identify who (which subgroup of people) have a higher risk of diseases (e.g., breast cancer) than others.

Previously, scientists have shown that some humans are born with variations in their DNA (known as "risk variants") that increase their lifetime risk of some diseases (e.g., breast cancer). Using these risk variants, scientists defined the polygenic risk score (PRS) to predict if a person will have a high risk to have a certain disease (e.g., breast cancer). However, previous studies do not consider the difference between people. In this study, we will develop novel methods, which will take into account population differences, to obtain more accurate disease risk predictions.

We will mainly focus on brain diseases and breast cancer first. And if promising, our methods will be extended to other human diseases. The whole project is expected to last for about three years.

We believe that our study will make significant contributions to disease (e.g., breast cancer and brain diseases) risk prediction. From a public health perspective, if we can identify the subgroup of people who have high disease risk, then we can provide appropriate care for them.

Scope extension, April 2024:

We aim to develop, (1) a novel polygenic risk score (PRS) prediction model and (2) a novel 3'UTR alternative polyadenylation (APA) transcriptome-wide association study (3'aTWAS) model that incorporates subgroups of patients. Specifically, we will apply K-lines, a statistical method we recently developed, to individual-level GWAS data for identifying hidden subgroups of patients. Then we will build subgroup-specific prediction models of PRS and APA values. We will also use the UK Biobank data to ensure the computational efficiency of our algorithm.

We will mainly focus on brain disorders and breast cancer first. And if promising, our methods will be extended to other human diseases.

With our proposed subgroup-PRS model, we expect to improve PRS-based risk stratification for disease screening and target intervention. With our proposed subgroup-based 3'aTWAS model, we hope to increase the prediction accuracy of APA using and, consequently, explain a large fraction of GWAS risk SNPs enriched in 3'UTRs and gene downstream regions.

We will develop a novel polygenic risk score with hidden subgroups by combining SNPs and tandem repeats information. To do this, first, the UK Biobank data will be used to genotype hidden variants, such as tandem repeats, and a resource database will be built to present tandem repeat allele/length frequency in diverse ancestries. We will use hidden variants, such as the length groups of tandem repeats, to define hidden subgroups of patients/individuals. Then we will do association studies, including GWAS, RWAS and PheWAS, to find classical SNPs and new hidden variants, such as tandem repeats, associations with phenotypes. Further, a novel polygenic risk score with hidden subgroups will be developed by combining SNPs and hidden variants, such as tandem repeats.