Model-based inference Inference of population structure

Transcription

Model-based inference Inference of population structure
Inference of ancestry from multilocus genotype
data
[email protected]
March 2014
Outline
I
Statistical methods for estimating, visualizing and interpreting
population genetic structure
I
PCA and STRUCTURE
I
Geographically explicit approaches to the inference of
individual admixture proportions
I
How many ancestral populations to a sample?
Population genetic structure
I
I
Many organisms form genetically dierentiated subpopulations
(herds, colonies, schools, prides, packs).
Importance of geographic scales (regional, local)
Population genetic structure
I
Natural populations are not random mating populations.
I
There is a geographic range within which individuals are more
closely related to one another than to those far apart.
I
The inference of population structure is inuenced by the
demographic history of a species, past events of population
ssion and fusion, migrations, etc.
I
A clear understanding of population structure is useful for
detecting genes under selection.
Regional and local scales
I
At the regional scale, clusters and clines are the consequences
of demographic processes such as colonization, admixture,
fragmentation, reproductive isolation, or selective processes
like local adaptation.
I
At the local scale, restricted dispersal creates local patches of
identical genotypes, spatial autocorrelation of allele
frequencies, and long-range isolation by distance patterns
Genetic clusters and clines
Genetic clusters and clines
I
Clines: Large-scale spatial trends in allele frequencies or
genetic diversity
Patterns of Isolation by distance
Clusters and clines are not mutually exclusive patterns
Hewitt 2000
Back to population genetics: Hardy-Weinberg Equilibrium
I
Allele and genotype frequencies in a population remain
constant from generation to generation.
I
It assumes an innitely large population size, random mating,
no mutation, no migration, and selective neutrality.
I
The genotype frequencies are deduced from the allele
frequencies.
0 (q )
1 (p )
0 (q )
q2
pq
1 (p )
pq
p2
Genotype frequencies at a bi-allelic locus (Single Nucleotide
Polymorphism). Heterozygosity H = 2pq.
Linkage Disequilibrium (LD)
I
I
I
Non random association of alleles at two or more loci
Considering two bi-allelic loci, A and B , LD can be measured
by
D = pAB − pA pB = cov(A, B )
In the absence of evolutionary forces other than random
mating, the linkage disequilibrium measure D converges to zero
at a rate equal to the recombination rate between the two loci.
Population structure creates LD at unlinked loci
I
I
I
I
Suppose our sample contains two populations in equal
proportions
In population 1, we have pA1 = 1 and pB1 = 0 (D = 0)
In population 2, we have pA2 = 0 and pB2 = 1 (D = 0)
In the sample,
pA = 1/2
and pB = 1/2
Thus, because pAB = 0, the linkage disequilibrium measure is
non zero, and maximal in absolute value
|D | = 1/4 .
Population structure creates Hardy-Weinberg disequilibrium
I
I
I
Suppose our population contains two subpopulations in equal
proportions, each in HW equilibrium
Consider a bi-allelic (0/1) locus and let p1 and p2 denote the
allele frequencies in subpopulations 1 and 2 (frequencies of 1).
In the total population, we have
p=
Thus, H 6= 2pq .
p1 + p2
2
and H = p1 q1 + p2 q2
Bayesian clustering algorithms
I
I
I
Assume K unknown subpopulations, n individuals genotyped
at L loci.
No-admixture (ssion) model: For each individual i and each
cluster k , compute the probability that individual i originates
in cluster k .
Admixture (ssion and fusion) model: For each individual i
and each cluster k , compute the fraction of genome of
individual i that originates in cluster k .
Genetic structure of human populations
I
Each individual is represented by a segment of length 1.
I
Each cluster is represented by a color.
Building a likelihood
I
I
I
I
Clusters minimize HW and LD disequilibria
Data: (yi ` )i ≤n,`≤L is a matrix of 0 and 1, where each
individual i is coded with 2 rows.
Clusters: (zi )i ≤n , zi ∈ {1, . . . , K }
Allele frequencies: Given zi = k ,
Pr(yi ` = 1 | zi = k , p) = p`,k
and
Pr(y | z , p) =
n Y
L Y
2
Y
y
i =1 `=1 j =1
p`,z (1 − p`,z )1−y
j
i
i
i
j
i
Prior distributions
I
Independent allele frequencies are sampled at each locus ` and
in each cluster k
p`k ∼ beta(λ1 , λ2 ) (default value λi = 1).
I
Uniform distribution on individual cluster labels
Pr(zi = k ) = K1 , k = 1, . . . , K .
MCMC algorithm
I
Gibbs sampler: Start from an initial conguration of allele
frequencies and cluster labels. Then repeat the following steps
I
Step 1 Estimate allele frequencies given the clusters
p t ∼ Pr(p | y ; z t −1 )
I
Step 2 Estimate clusters given the allele frequencies
z t ∼ Pr(z | y ; p t )
I
Special case of a Metropolis-Hastings algorithm (without
rejection).
Statistical mixture model
I
Known for long in statistics as the latent class model
(Lazarsfeld-Henry 1968, Goodman 1974)
I
Bayesian implementation popularized by the software
structure (Pritchard et al 2000)
I
Important innovation in structure: it includes admixture
models (next slides).
Admixture models
I
Genetic admixture is the process by which a hybrid population
is formed from contributions by two or more parental (or
ancestral) populations
I
In an admixed population, individual genomes are themselves
(to a greater or a lesser extent) admixed.
I
Bayesian clustering methods are capable of calculating
individual admixture proportions where the ancestral
populations are not imposed by the sampling process.
Divergence vs Admixture of populations
Admixture of two ancestral populations
What change to the model?
I
Introduction of additional parameters: Q -matrix (n × K
dimensions)
qi ,k = proportion of
I
individual i 0 s genome from population k
One cluster for each allele copy
zi ,` = population of
and
origin of allele copy yi ,`
Pr(zi ,` = k | p, q ) = qi ,k
where
qi ,. ∼ D(α1 , . . . , αK )
Dirichlet distribution.
How can we account for geography?
I
Gradients in gene frequencies are created by the contact of
two or more populations.
I
Their shape is sigmoidal (Barton and Hewitt 1986)
Modeling the cline
Extension of the algorithm: including spatial information
I
Population genetic structure is spatially structured. Let xi be
the spatial coordinates of individual i . We assume that
log αi . = f (xi )T β. + i .
where the prior distribution on β is non-informative and is a
spatially autocorrelated Gaussian noise.
I
tess computer program (Durand et al 2009).
Geographic admixture of 2 parental populations (simple model)
Results of structure and tess
I
FST (of the pooled parental populations) is a measure that
quanties the departure from HWE in the ancestral population
Choosing K: The Deviance Information Criterion
I
Let θ̂ be a point estimate and dene the deviance as
D (θ, y ) = −2 log p (y |θ)
I
The predicted (or expected) deviance is
D̄ = Eθ|y [D (θ, y )]
and the eective number of parameters pD = D̄ − D (θ̂, y ).
I
Dene
DIC = D̄ + pD
Choosing K DIC curves
Information criterion
Model 1
Model 2
K2
1
2
3
K1
4
Number of cluster
5
6
SWITCH TO POWER POINT
A realistic scenario for a contact zone (Non equilibrium
stepping-stone simulation)
65
Latitude
60
55
50
45
40
35
−10
0
10
20
Longitude
30
SWITCH TO POWER POINT
Results
Results of inference
65
Latitude
60
55
50
45
40
35
−10
0
10
20
Longitude
30
Application to Fundulus heteroclitus
Application to Fundulus heteroclitus
Application to Gypsophila repens
Conclusion messages
I
Bayesian algorithms detect population structure and
individuals admixture levels without a need to predene
ancestral populations
I
Admixture models should be preferred to models without
admixture unless we have evidence of pure subpopulation
divergence
I
The choice of the number of cluster is a highly dicult
problem but heuristic criteria seem to perform well in many
cases
Bibliography and resources
I
Hartl DL, Clark AG (2007) Principles of Population Genetics,
Fourth Edition. Sinauer
I
Pritchard JK, Stephens M, Donnelly P (2000) Inference of
population structure using multilocus genotype data. Genetics
155: 945-959.
I
Durand E, Jay F, Gaggiotti OE, François O (2009) Spatial
inference of admixture proportions and secondary contact
zones. Molecular Biology and Evolution 26:1963-1973.