How large is large enough? Is small datasets still valuable under the shadow of huge biobank data?
Principal Investigator:
Professor Cathy Fann
Approved Research ID:
46789
Approval date:
April 18th 2019
Lay summary
Our research project would like to address two issues. 1) Despite ethnical related disease susceptible markers, given large biobank datasets, will smaller biobank datasets sill valuable? 2) Huge datasets provide great opportunities in mapping disease associated markers, however, how large is large enough? In this project, we will use machine learning techniques to 'oversample' small biobank data and compare the results with that obtained from UK biobank data through different phenotypes (diseases). The 'oversampling' techniques will be carried out by using machine learning methods such as SMOTE (Synthetic Minority Over-Sampling Technique), GUN (Generative Unadversarial Networks) and GAN (Generative Adversarial Networks). The purpose of oversampling is to improve statistical power which is usually lower for small datasets. In order to have a throughout understanding, we plan to screen out common phenotypes for the two biobank datasets (Taiwan and UK biobank) which involve about 24 phenotypes, such as hypertension, asthma, cancers, etc. The reason for examining different phenotypes is because the genetic contributions for these diseases are different and therefore the association results might be affected given fixed sample size. For diseases with higher genetic contribution, modest dataset might be sufficient to identify susceptible markers. Most traditional statistical models are built on a few assumptions. For example, the ratio between case and control numbers should not be too far away from one, however, for large biobank data, the ratio could be 0.01 or even lower which might distort the results. By using oversampling techniques, we are able to observe the performance of the statistics under various ratios and therefore identify more appropriate ratios for association tests. Overall speaking, by using UK and Taiwan biobank datasets, the goals of our study are to identify important parameters such as case-control ratios, different disease prevalence, effect size (difference between allele frequency between cases and controls) etc, and their impacts to the association test results by using machine learning techniques.