Treelets-based approaches to estimating sparse fine-scale population structure from genetic data

Supervisors: Heejung Shim and David Balding

Available for: MSc/PhD and undergraduate research projects.

Location: Melbourne Integrative Genomics, University of Melbourne

Project title: Treelets-based approaches to estimating sparse fine-scale population structure from genetic data

Background: Methods for analysis of population structure using genetic data have been widely used to understand human history and correct for population structure in genome-wide association studies, and the most common methods to analyzing population structure include admixture-based models [1] and principal components analysis (PCA) [2]. This project will develop new approaches to estimating sparse fine-scale population structure from genetic data. Our methods build on multi-scale methods, treelets [3], that extend wavelets [4] for analyses of unordered data. Treelets simultaneously construct a data-driven hierarchical tree structure of individuals and a multi-scale orthonormal basis on the hierarchical tree, both of which capture sparse structure in the genetic data [3].

Proposed projects: The specific project will depend on the student’s interest and background. Options are 1) software development for the new methods, 2) contributing to the development of the new methods, or 3) benchmarking admixture-based models and PCA against our treelets-based approaches using real human genetic data and computer simulations.

Learning outcomes: software development, statistics / machine learning, programming using C\C++ (or Python) and R, statistical analysis of complex and large-scale genomic data, data visualization.

References:

[1] Pritchard, Jonathan K., Matthew Stephens, and Peter Donnelly. Inference of population structure using multilocus genotype data. Genetics 155.2 (2000): 945-959.

[2] Novembre, John, et al. Genes mirror geography within Europe. Nature 456.7218 (2008): 98-101.

[3] Lee, Ann B., Boaz Nadler, and Larry Wasserman. Treelets: an adaptive multi-scale basis for sparse unordered data. The Annals of Applied Statistics (2008): 435-471.

[4] Shim, Heejung, and Matthew Stephens. Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays. The Annals of Applied Statistics (2015): 665-686.

More Information

Heejung Shim

8344 0707