For many important complex traits, Genome Wide Association Studies (GWAS) have only recovered a small proportion of the variance in disease prevalence known to be caused by genetics. The most common explanation for this is the presence of multiple rare mutations, each contributing a small amount to the prevalence of the traits. These rare mutations cannot be identified in GWAS due to a lack of statistical power. Even common mutations with small effect are difficult to identify if more than one is required for the trait. These (rare, or common) mutations may be concentrated in relatively few genes, as is the case for many known Mendelian diseases, where the mutations are often compound heterozygous (CH). Due to the multiple mutations, each of which contributes little by itself to the prevalence of the disease, GWAS lacks power to identify genes contributing to a CH-trait.
In this paper, we address the problem of finding genes that are causal for CH-traits, by introducing a discrete optimization problem, called the Phenotypic Distance Problem. We show that it can be efficiently solved on realistic-size simulated CH-data by using integer linear programming (ILP). The empirical results strongly validate this approach. Joint work with Rasmus Nielsen UCB.