1 The fine-scale genetic structure of the French

Transcription

1 The fine-scale genetic structure of the French
The fine-scale genetic structure of the French population.
Aude Saint Pierre,1,2,3,4 Céline Bellenguez,5,6,7 Sébastien Letort,1,2,3 Luc Letenneur,8,9
Claudine Berr,10,11 Carole Dufouil,8,9 Claude Férec,1,2,3 Philippe Amouyel,5,6,7,12
Emmanuelle Génin,1,2,3
1
Inserm, UMR1078, 29200 Brest, France
2
Université Bretagne Occidentale, 29200 Brest, France
3
Centre Hospitalier Régional Universitaire, 29200 Brest, France
4
Département Hommes Natures Sociétés, UMR7206, Musée de l'Homme, 75016 Paris,
France
5
Inserm, U744, 59000 Lille, France
6
Université Lille 2, 59000 Lille, France
7
Institut Pasteur, 59000 Lille, France
8
Inserm U897, 33076 Bordeaux, France
9
Université Bordeaux 2, 33000 Bordeaux, France
10
Inserm, U1061, 34093 Montpellier, France
11
Université Montpellier 1, 34967 Montpellier, France
12
Centre Hospitalier Régional Universitaire de Lille, 59000 Lille, France
1
Abstract
The existence of population stratification is a major problem in case-control association
studies and there is a need for a better assessment of allele frequency variation within
populations at all geographic scales. Such efforts have been conducted in different
European countries where strong patterns of geographic variations were found. The
genome-wide extent of variations in allele frequencies of common variants has however
never been documented at the scale of France. In this study, we describe these patterns
of variation using genome-wide SNP chip data from 4,433 individuals, recruited as part
of the Three-City study and whose places of birth in France were available. We show
that there is a strong correlation between the top three principal components extracted
from the genetic data and the latitude and longitude of birth places. Using multiple
linear regression models, we were able to determine the birth places within less than
197 km of the reported origin for 50% of the individuals. Using model-based clustering
with seven main geographic regions, we found that individuals were assigned in
majority to their true region of origin. However, we found that information on ancestry
could not be retrieved by using a small panel of Ancestry-Informative Markers (AIMs).
2
Introduction
The existence of allele frequency differences between populations living in
different geographic areas is a major concern for genetic association studies as it could
lead to false positive results or failure to detect a true association. This problem often
referred to as “population stratification” has been pointed out by geneticists for a long
time. It has led to the development of alternative study designs relying on related rather
than unrelated individuals 1; 2.
However, there have been many debates regarding the real impact of population
stratification on association studies. Some authors argued that if the study was well
designed with individuals sampled within the same continent, the risk of false positives
was rather limited 3-6.
With the advent of genome-wide association studies and the genotyping of large
samples of cases and controls with ancestries in different populations, it has become
easier to measure the real impact of population stratification on association study and
calls for caution have been raised 7; 8. Indeed, allele frequency differences are detectable
at all geographic scales. They were found between countries within Europe 9-12 but also
within countries
13-18
. A recent study revealed pattern of fine-scale genetic
differentiation within the UK18.
If some European populations have been extensively studied, this is not the case
of the French population. Although the existence of strong regional differences in HLA
allele distribution is rather well documented
19-21
, genome-wide geographic variations
have been far less studied. Only one recent study so far has investigated this issue with a
focus on the western part of the country where several SNPs were found to exhibit allele
frequency differences between districts 16.
3
In this study, we provide a descriptive analysis of the pattern of population
structure at the scale of France with the objective of better understanding the genetic
variation of the French population and informing future large scale genetic mapping
studies. For this purpose, we analyse genome-wide SNP data on a large sample of 4,433
individuals, recruited as part of the Three-City study
22
and for whom detailed
information on their places of birth were available.
Using principal component analysis (PCA), we characterize fine-scale
population structure and assess genetic differentiation between regions. We also address
the question of the prediction of places of birth from genetic data using multiple linear
regression and classification algorithm. Finally, we explore the extent to which
geographic location of an individual within France can be predicted solely on
information from small subsets of selected ancestry informative markers.
4
Material and Methods
Samples
We used samples from the Three-City Study (3C Study) that was designed to examine
the relationship between vascular diseases and dementia in 9,294 persons aged 65 years
and over. For more details on the study, see http://www.three-city-study.com/the-threecity-study.php.
All analyses were performed on individuals who did not have developed dementia or
cognitive impairment by the time their blood sample was taken and who were
genotyped in the study by Lambert et al.
22
. The geographical locations of individuals
were defined according to the latitude and longitude of their place of birth, declared at
the moment of enrolment in the cohort. Individuals with missing place of birth were
excluded. We kept individuals born in metropolitan France and excluded individuals
born in Corsica because of their low number (n=5). The study population included
4,659 unrelated individuals from seven geographical regions: “Grand-Ouest” (GO,
n=356), “Grand-Est” (GE, n=2,432), “Nord” (NO, n=131), “Île-de-France” (IDF,
n=370), “Rhône-Alpes” (RA, n=241), “Méditerranée” (MED, n=249) and “Sud-Ouest”
(SO, n=880).
Genotyping data
Samples were genotyped with Illumina Human610-Quad BeadChip in the Centre
National de Génotypage as described elsewhere 22. Standard quality control procedures
implemented in PLINK version 1.7
23
were used to remove low quality individuals and
5
low quality SNPs. We followed the recommendations from Anderson et al.
24
.
Individuals were removed if they had a call rate < 97%. A heterozygosity threshold of
± 3 standard deviations (SD) from the mean was used to remove individuals with an
excessive or reduced proportion of heterozygote genotypes. A Principal Component
Analysis (PCA) was performed to exclude individuals with non-European ancestry.
PCA was performed on the combined set of the 3C Study sample with the 1000
Genomes individuals 25. We conducted the analysis using the CEU, FIN (Finnish), GBR
(British), IBS (Spain) and TSI (Italian) samples. The data set included 379 individuals:
85 CEU, 93 FIN, 89 GBR, 14 IBS and 98 TSI. Only SNPs that were present in our
GWAS data set and common with the 1000G data were kept in further analysis.
Analysis included 477,640 autosomal SNPs and 4,812 individuals. Plots were visually
inspected and 5 presumably non-European individuals were excluded. Identity by
descent (IBD) statistics was calculated with the genome command in PLINK to identify
related samples. An exclusion threshold of 0.1875 was used. Applying all these QC
filters led to the removal of 192 individuals. All samples failing sample-level QC were
removed prior to performing SNPs QC. Markers were removed if they had a genotypemissing rate > 1%, a minor allele frequency < 1% or departed from Hardy–Weinberg
proportion (P ≤ 10-8). The final study population included 4,467 unrelated individuals
and 493,065 autosomal SNPs.
Evidence for fine-scale population stratification in France
Principal Component Analysis (PCA)
PCA was carried out using the Eigenstrat software
26
. The default procedure was used
for outlier removals with up to 5 PCA run iteratively after removing individuals with
6
any of the top 10 PCs departing of more than 6 standard deviations from the mean.
Outlier individuals were removed prior to performing further analyses. SNPs located in
known regions of long range linkage disequilibrium (LD) in European populations
27
were not included in the analysis. SNPs in strong LD were also pruned out using the
indep-pairwise command in PLINK with r2=0.2, a window size of 50 SNPs and 5 SNPs
to shift the window at each step.
Geographic relevance of PCs
We applied three tests to determine if the PCs were correlated with the geography in
France. (i) Using a Spearman’s rank correlation coefficient, we tested for significance of
association between latitude, longitude and PCs coordinates (‘cor.test’ function in R)
(ii) We applied a Procruste test using the ‘vegan’ package in R with 1,000,000
permutations to assess the similarity between PCA maps of genetic variation and
geographic maps of population locations. The similarity between two maps is quantified
by a Procruste similarity statistic named t0. (iii) We performed a Kruskal-Wallis test to
evaluate if the distribution of PCs scores varied significantly between regions
(‘kruskal.test’ function in R) followed by a post-hoc test implemented in the
‘kruskalmc’ function from the R package ‘pgirmess’.
Fst and run of homozygosity (ROH)
We investigated local genetic differentiation based on the pairwise Fst
28
matrix
between geographical regions of France. The Fst was calculated for every SNP and then
averaged across the genome to obtain a genome-wide estimate of the genetic distance
between regions. All 493,065 SNPs were used.
7
ROH were estimated separately for each region (PLINK, --homozyg option). To
compensate for unequal sampling in the different regions of interest, a subset of 700
individuals was considered, 100 randomly selected in each region. Default parameters
were used to reconstruct ROHs: a 5 Mb window size, a minimum of 50 SNPs per
window and allowance for 1 heterozygous and 5 missing calls per window. The final
segments that were called as homozygous had a minimum number of 100 contiguous
SNPs, a minimum length of 1 Mb and a minimum density of 1 SNP per 50 kb. The
maximum gap between two consecutive SNPs within a segment was less than 1 Mb
apart. We assessed two features of ROH: 1) the sum of the lengths of all ROH for an
individual measured in mega base (Mb) (average calculated across individuals within a
geographical region); 2) the proportion, FROH, of the autosomal genome falling in ROH.
This latter proportion was computed considering all ROHs longer than 1Mb (FROH1) and
considering the subset of ROHs longer than 5 Mb (FROH5). All 493,065 SNPs were used
in the Fst and ROH calculations.
Prediction of geographical location
In order to assess the predictive power of the genotypic data to determine the
geographic areas of ancestry of individuals, cross validation was used.
Selection of a training and test data set
Two training sets of 700 individuals were selected. The first training set, referred to as
the “random set”, was composed of the 100 individuals selected at random in each
region that were used in the ROH analysis. The second training set, referred to as the
“extremePC set”, included individuals selected based on their coordinates on the first
axe of variation (PC1): individuals with the 50 top and the 50 bottom PC1 values in
8
each of the 7 regions were included. In each case, all remaining individuals (n=3,733)
were included in the test set.
Linear regression model of PCs
We assigned each individual to a specific geographic location by fitting independent
linear models for latitude and longitude
12
. To perform assignment, we first estimated
the coefficients of regression on the training set of individuals and then used the
estimated coefficients of the linear model to predict latitude and longitude of individuals
from the test set on the basis of their PCs scores. Computation of PCs scores is
performed using only the random set of individuals and we projected individuals from
the test set onto those principal components. PC scores were inferred using the -w
option in smartpca. We used the rotated PC1 and PC2 scores estimated from the
Procruste analysis in the whole sample because they more strongly correlate with
latitude and longitude.
Starting from the full model with PC1 and PC2, two quadratic terms and one interaction
term, we performed stepwise regression with backward and forward elimination with
the ‘step’ function in R. Models were compared according to their Akaike information
criterion (AIC). The model with the lowest AIC was selected and the estimated
coefficients of regression were then used to estimate the latitude and the longitude of
individuals in the test sample. Distances in kilometers between estimated and observed
geographical coordinates were calculated using the Haversine formula.
Classification of individuals into the seven regions
9
To study how individuals can be assigned into the seven regions using the genetic data,
we used two alternative methods, the model-based clustering method implemented in
admixture
29
and the K-nearest-neighbour (KNN) algorithm. Admixture was run in a
supervised mode with the 7 reference populations defined by the random training data
set. Each individual in the test sample was assigned to the population with the highest
posterior probability. For KNN, the same approach as proposed by Huckins et al.
10
to
assign an ancestry to each individual was used. Briefly, ancestry was assigned based on
the results of the majority vote on the regions of origin of the 5 nearest neighbours of
each individual as determined by running PLINK neighbour option. In case where the
five nearest neighbours did not reach a majority vote, only the four nearest neighbours
were selected, and a majority vote was taken again. If this was still unsuccessful, only
the top three neighbours were used. If still no majority vote was reached, the sample
was classed as ‘unclassified’. The analyses were based on the pruned subset of SNPs
after the removal of the SNPs located in regions of long range LD and SNPs in strong
LD.
Evaluation of the quality of classification
We assessed the performance of the classification by comparing the population
assignments obtained with admixture and KNN with the regions of origin. Only
individuals from the test data set were used for assessing performances of admixture
whereas for KNN, we considered the overall sample as in Huckins et al.
10
. We
calculated the Correct Classification Rate (CCR) which is defined as the proportion of
individuals that were correctly assigned to their region of origin. This indicator
considers only the one-to-one relationship between estimated groups and true
10
populations and might not well reflect the similarity between partitions. A measure of
similarity between two partitions can be evaluated using the Rand index
30
. This index
measures the proportion of similar assignments of pairs of points; i.e. the proportion of
pairs of points that are placed either together in a group or in different groups in both
partitions. The Rand index ranges between 0 and 1. It takes the value of 1 when the two
classifications are identical and 0 when no pair of points appears either in the same
group or in different groups in both partitions. Because the expected value of the rand
index between two random partitions can vary, Hubert and Arabie
31
suggested using a
corrected version of the Rand index, the Adjusted Rand Index (ARI). Based on the
contingency table (Table 1), the ARI is of the form:
 nij  
 ai 
 b j 
N
 
ij 
  i   j    2 
ARI =
 b j    ai   b j  
1   ai 
∑   + ∑   − ∑  ∑  
2  i  2  j  2   i  2  j  2  
∑  2  − ∑  2 ∑  2 
N
 
2
where ai and bi are the numbers of individuals in the i-th group of partitions U and V
respectively, nij the number of individuals in the i-th group of U and the j-th group of V
and N is the total number of individuals (Table 1). The ARI is bounded above by 1,
meaning a perfect clustering, and takes on the value 0 when the index equals its
expected value.
Selection of Ancestry Informative Markers (AIMs)
Derivation of AIMs
To obtain different subsets of AIMs with different sizes N, we identified the top N
SNPs that correlate with the latitude and longitude variation. We performed two
11
independent genome wide association studies in the random sample with the latitude
and the longitude phenotype respectively assuming an additive genetic model.
In each analysis, SNPs were ranked according to their p-value from the association test
and the subsets of best significant SNPs were then selected. We picked-up subsets of
equal size from the two association analyses performed respectively with latitude and
longitude. The subsets of best SNPs from both association studies were then merged
together. SNPs located in known regions of long range LD were removed and SNPs in
strong LD were pruned out to obtain the final set of AIMs. In this study, we aimed to
select SNPs with the strongest pattern of differentiation for geographical variation so we
did not adjust the association test for top PCs scores to avoid washing-out the effect of
population stratification.
Validation of AIMs
Different criteria were used to study the relevance of the different panels of selected
AIMs.
First, to determine how these AIMs performed at assigning individuals to the different
regions, we run admixture in a supervised mode, taking as reference population the
random training set. Performances of classification were evaluated using the Correct
Classification Rate (CCR) and the Adjusted Rand Index (ARI).
Second, for each panel of AIMs we assigned each individual to a specific geographic
location by fitting independent linear models for latitude and longitude. Performances
of the prediction were assessed based on the mean and median of the number of km
between the expected and reported origin of individuals in the test sample.
12
The performances calculated with various numbers of AIMs were compared to the
performances calculated with the entire set of SNPs.
13
Results
Evidence for fine-scale population stratification in France
Even though they were sampled in only three cities (Bordeaux, Montpellier and Dijon),
the genotyped individuals in the 3C study were in fact born all over France and the
sample includes representatives from the seven major regions of France (Figure 1 and
Table S1). Among the 4,467 individuals, 34 individuals were tagged as possible outliers
after running iteratively Eigenstrat and were removed from all subsequent analyses that
were thus based on a sample of 4,433 individuals. The first three PCs of the PCA
performed on the genotypes of these individuals were found to account for 0.07%,
0.04% and 0.04% of the total variation in the data, respectively (Figure 2). The first
principal component roughly differentiates the northern, north-eastern regions (NO, GE)
to southern, south-western regions (SO, MED). The second PC mainly differentiates
western (GO, NO) from south-eastern (MED) individuals. Some individuals from GE
cluster with the south-eastern group whereas others fall into the western group. The
third PC shows more subtle patterns of differentiation, suggesting a separation between
the north-eastern and southwest region from the middle band area of the France (Figure
3). The Europe-wide PCA analysis (Figure S1) with 1000G samples confirmed this
trend with a slightly clearer pattern of differentiation on PC1 and PC2 that accounted
for respectively 0.10% and 0.06% of the total variation.
The top three PCs were significantly correlated with geographical axes (p-value≤10-16).
The Spearman correlation coefficients for PC1, PC2 and PC3 were respectively -0.60,
0.23 and -0.13 with latitude and 0.57, -0.19 and -0.26 with longitude. The Procruste test
comparing the maps obtained from PCs and from geographical coordinates was
14
significant (p-value ≤ 10-6) (with a Procruste similarity statistic t0 of respectively 0.55
for PC1 and PC2 versus latitude and longitude; 0.48 for PC1 and PC3 versus latitude
and longitude and 0.29 for PC2 and PC3 versus latitude and longitude). The PC scores
distributions were different between the different regions (Kruskal-Wallis test p-value ≤
10-16) and all pairwise comparisons, except for GO versus IDF, were significant at a
global significance level of 5%.
Genetic distances between regions measured by pairwise Fst statistics revealed subtle
differences between regions. The largest difference was observed between NO and SO
(Fst = 0.068%) followed by GE and SO (Fst = 0.049%) and NO and MED (Fst =
0.046%). The genetic differentiation between regions increased with the geographic
distances in France (Figure 4 and Table S2) and a significant correlation was found
between the distances in kilometers and the Fst values (Pearson correlation = 0.68, pvalue = 6x10-4). The IDF region showed very little differentiation with the GE, GO and
RA (Fst ~ 0.004%). This could be expected due to the more cosmopolitan
characteristics of IDF compared to other regions.
In the random subset of 700 individuals (100 individuals from each region), 13,336 runs
of homozygosity (ROH) of at least 1 Mb were detected (on average 19 per individual),
among which 94 (on average 0.13 per individual) had a length of 5 Mb or greater. Their
repartition by region was found to vary. Indeed, the average proportion of the genome in
ROHs for individuals from the SO region was the largest (FROH1 = 1.10 %) followed by
NO and RA (FROH1 = 1.08 % - 1.06 %) while IDF and MED had the lowest proportion
(FROH1 = 0.96 % - 1.00%) (Figure 5). The SO region was slightly different from the
15
others for ROH with, on average, an increased proportion of the genome in ROH of at
least 1 Mb and an increased summed length of ROH segments. Focusing on longer
ROH segments of at least 5 Mb that are likely to be identical by descent and to sign
some levels of inbreeding, we found that the RA region had the largest proportion of
the genome in ROH of at least 5 Mb (FROH5 = 0.10 %) followed by NO, GE and SO
(FROH5 ~ 0.06 %). Again, IDF had the lowest proportion of the genome in ROH (FROH5
= 0.01 %). Similar patterns were observed for the summed segment sizes (Figure S2).
Overall, IDF was found to have the lowest proportion of the genome in ROH (as
measured by ROH over 1 and 5 Mb in length) and SO and RA the highest proportions.
Prediction of geographical location
The rotated PC1 and PC2 scores estimated from the Procruste analysis in the whole
sample are shown in Figure S3. The best linear regression model to predict latitude in
the training dataset was found to include the rotated PC1, the rotated PC2, one quadratic
term for the rotated PC2 and the interaction between the rotated PC1 and the rotated
PC2 (AIC=751.09). The best model for longitude included the rotated PC1, the rotated
PC2 and one quadratic term for the rotated PC1. The AIC value was greater
(AIC=994.27) suggesting that latitude was better predicted by PCs scores than
longitude. Using these fitted models on the test sample, 50% of the individuals could be
located within 197 km of their reported origin and 90% within 332 km (Figure 6). The
scatter plot of individuals showed a variation from northeast to southwest. The origins
of individuals from the GE and RA regions were better predicted with the lowest mean
and median of the number of km between their expected and reported origin (mean=185
km; median=167 km for GE; mean=180 km; median=185 km for RA). Inversely, the
16
origins of individuals from the NO and MED regions were poorly predicted (mean=360
km, median=366 km for NO; mean=345 km, median=336 km for MED).
To assess the impact of the training sample selection on linear prediction, we also
performed the training on 100 individuals in each region selected on the top and bottom
tail of the PC1 coordinates for the region (extremePC training set). The AIC values
from multiple linear regression in this extremePC training sample increased to 1115.18
for longitude and to 945.00 for latitude. Selecting individuals randomly might therefore
better reflect the overall ancestry of our sample than selecting individuals on the
extremes of the first PC scores. This was confirmed by an increase in the distances
between the observed and the predicted geographical origins.
When running Admixture on the test sample, we found that individuals were assigned in
majority to their true region of origin (Figure 7A, Figure S4). This was true for all
regions except IDF but, for IDF, misclassified individuals were usually assigned to a
close region (GO and NO). The CCR and ARI were 0.31 and 0.10 respectively. The
KNN method based on the majority vote from the nearest neighbours was found to
perform better on these indices with an ARI value slightly increased (ARI=0.11) and a
CCR much larger, reaching a value of 0.5. However, this higher value was obtained
because of the over-representation of two regions in the sample (GE and GO) in which
the KNN method tended to cluster all individuals (Figure 7B and Figure S5). For
individuals originating outside of these two regions, the proportion of individuals
assigned to GE is higher than 55% suggesting that the overall classification
performances of KNN were very poor compared to admixture.
17
Ancestry Informative (AIMs)
From the results of the association test performed against latitude and longitude, sets of
between 127 and 101,386 AIMs were selected. Their performances to assign individuals
to their regions of origin with admixture were evaluated by measuring the relative
increase of the CCR and ARI indexes when using each subset of AIMs compared to the
full panel of SNPs. The classification performances with small panels of AIMs were
quite low. It was only possible to obtain the same values of the ARI index than with the
entire set of SNPs when using 94% of the SNPs (~95,000 SNPs) and the same value of
the CCR index when using 24% of the SNPs (~24,000 SNPs) (Figure S6).
Small panels of AIMs were also found to perform poorly for assigning individuals into
their region of origin (Figure S7).
18
Discussion
In this study, we investigated population stratification in France using a large
sample of 4,433 individuals for whom detailed information on birth places was
available. This is one of the largest studies performed so far to investigate genome-wide
patterns of variation in France. Interestingly, even if they were sampled only in three
cities in France, the places of birth of individuals from the 3C sample were in fact
evenly distributed across the French territory. This was one of the strengths of the 3C
sample to investigate the performance of assignment at the scale of France. Moreover,
the fact that the 3C study only includes elderly people born before 1935 at a period in
time where migrations were rather limited was also an advantage for this study. Their
place of birth was, except perhaps for the IDF region, a good indicator of the region of
origin of their ancestors. This was not the case however for the places where they were
sampled that could not be used to trace back their origin. This raises serious concerns on
the studies that used sampling places as surrogate for geographical origins.
Using a simple PCA, it was possible to differentiate northern from southern
regions on PC1 and western to south-eastern regions on PC2, similar to what was
observed within different European countries
14; 17
. Combining the 1000 Genomes
individuals with the French population confirmed this trend with the West (GO) and
North (NO) regions of France classified with the CEU and GBR populations from 1000
Genomes. The first three PCs in the 3C sample were significantly correlated with
geographic axes and PCs scores varied significantly between regions with PC1 showing
the highest correlation with geography and the strongest differences between North and
South regions.
19
Similarly, the genetic differentiation was well correlated with geographical
distances with the strongest Fst values found between North-East and South-West
regions (Fst=0.0007 between NO and SO). The IDF region of Paris, on the other hand,
exhibited the smallest levels of differentiation with the other regions and especially the
neighbouring regions (GO and GE). This was consistent with the long history of
migrations of individuals from the different regions of France to the French capital to
find a job. ROH analyses told a similar story with individuals from the SO and NO
regions showing more homozygous segments than individuals from the IDF region. The
RA region that includes the Alps was pointed out when focusing on the longest ROHs
of at least 5 Mb since FROH5 is the highest in this region. This could be due to the
existence of mountains in this region that have, until quite recently, been natural barriers
against population movements 32. Indeed, ROHs of 5 Mb and longer are most likely due
to inbreeding and thus a good indicator of population isolation 33.
In agreement with previous studies performed at wider geographic scales 12; 34; 35,
we found that genetic data can be used to gain some information on the origin of
individuals within France. PCs were found to be good predictors of the longitude and
latitude of the places of birth of individuals. Using multiple linear regression, one can
place 50% of the individuals within 197 km of their reported origin and 90% within 332
km. This is better than the performances achieved within Europe by Novembre et al.
12
.
Indeed in this latter study, 50% of individuals could be assigned within 310 km of their
reported origin and 90% within 700 km of their origin. The lower distances of
assignments might be due to the finer scale information on origin available for each
20
individual in our sample. In our sample, the distances of assignments from the reported
origins were the lowest for regions located on the Northeast-Southwest axis. To avoid
samples size associated bias, principal components are estimated using only the subset
of random individuals equally picked-up in each region. Individuals from the test
sample are then projected onto those principal components.
Using the model-based clustering method implemented in admixture in a
supervised mode with a reference population of 100 individuals picked-up at random in
each region, we found that individual ancestries are mainly distributed into their closest
regions like, for example, SO and MED or RA and MED. Performances of both the
model-based clustering and the linear regression approaches vary by region with, as
expected from the PCA plots, individuals from the SO region in the South-West being
usually better classified than individuals from the other regions. Interestingly, for these
other regions, the two approaches perform differentially with admixture performing
better in MED and GO than the linear regression.
Rather than using a model-based clustering approach, it was suggested that a
simpler method based on the K-nearest neighbour (KNN method) could also be efficient
for the prediction of geographic coordinates 10; 34. In particular, it was shown that within
Europe, the proportion of individuals correctly assigned (CCR) to their region of origin
was high, reaching 80% for several populations 10. This number is much higher than the
value found here since only 50% of individuals were well classified in their region of
origin. However, we were stricter in our evaluation compared to Huckins et al. 10 as they
considered that an individual was correctly classified if he/she was assigned to his/her
country of origin or a country with a Fst value of less than 0.001 as this Fst value was
21
considered as the threshold below which populations may not considered genetically
distinct. All the pairwise Fst values we computed here between the 7 regions were
below 0.001 but still we could detect some differences between the regions and the fact
that 50% of the individuals are assigned to their true region of origin is a rather good
achievement. However, this relative good performance of the KNN method compared to
admixture in terms of correct classification rates hides the fact that the results are quite
different from one region to another. The KNN method performed well for the two
major regions (GE and SO) where the majority of individuals come from but performed
very badly for the other regions with CCR of less than 10%. This was not the case with
admixture that gave better results in these other regions with, for example, a CCR above
50% in the MED region.
Several studies proposed panels of Ancestry Informative Markers (AIMs) to
infer ancestry for samples of European origin 10; 34; 36; 37. A small number of these AIMs
may be used to perform population classification. We obtained a list of AIMs from their
association with longitude and latitude variations. We explored whether an optimal set
of AIMs could be used to infer ancestry of individuals. Our results show that more than
24% of the full set of SNPs are needed to obtain similar classification performances as
with the full set of SNPs. This analysis illustrates the fact that the information contained
in the full set of SNPs is hardly summarized using a smaller number of representative
AIMs. We found that small panel of AIMs to detect fine-scale population stratification
perform poorly in the 3C sample.
22
In summary, our study revealed fine-scale genetic structure within France. This
is a unique population genetics study in terms of resolution, because of the large sample
distributed all around France with detailed information available of the place of birth.
Our results confirm the tight correlation between genes and geography and thus the
importance of considering stratification in association studies, even when analysing
supposedly homogeneous populations.
Supplemental Data
Supplemental Data include seven figures and two tables.
Acknowledgements:
This work was supported by a grant from the Britanny Region (dispositif SAD stratégie
d’attractibilité durable-projet STATEX) and from Association Gaetan Saleun.
Conflict of Interest: The authors declare no conflict of interest.
Web resources:
1000 Genomes: http://browser.1000genomes.org
23
References
1. Gauderman, W.J., Witte, J.S., and Thomas, D.C. (1999). Family-based association
studies. J Natl Cancer Inst Monogr, 31-37.
2. Schaid, D.J., and Rowland, C. (1998). Use of parents, sibs, and unrelated controls for
detection of associations between genetic markers and disease. Am J Hum Genet
63, 1492-1506.
3. Wacholder, S., Rothman, N., and Caporaso, N. (2000). Population stratification in
epidemiologic studies of common genetic variants and cancer: quantification of
bias. J Natl Cancer Inst 92, 1151-1158.
4. Wacholder, S., Rothman, N., and Caporaso, N. (2002). Counterpoint: bias from
population stratification is not a major threat to the validity of conclusions from
epidemiological studies of common polymorphisms and cancer. Cancer
Epidemiol Biomarkers Prev 11, 513-520.
5. Khlat, M., Cazes, M.H., Genin, E., and Guiguet, M. (2004). Robustness of casecontrol studies of genetic factors to population stratification: magnitude of bias
and type I error. Cancer Epidemiol Biomarkers Prev 13, 1660-1664.
6. Heiman, G.A., Hodge, S.E., Gorroochurn, P., Zhang, J., and Greenberg, D.A. (2004).
Effect of population stratification on case-control association studies. I.
Elevation in false positive rates and comparison to confounding risk ratios (a
simulation study). Hum Hered 58, 30-39.
7. Freedman, M.L., Reich, D., Penney, K.L., McDonald, G.J., Mignault, A.A.,
Patterson, N., Gabriel, S.B., Topol, E.J., Smoller, J.W., Pato, C.N., et al. (2004).
Assessing the impact of population stratification on genetic association studies.
Nat Genet 36, 388-393.
8. Marchini, J., Cardon, L.R., Phillips, M.S., and Donnelly, P. (2004). The effects of
human population structure on large genetic association studies. Nat Genet 36,
512-517.
9. Heath, S.C., Gut, I.G., Brennan, P., McKay, J.D., Bencko, V., Fabianova, E.,
Foretova, L., Georges, M., Janout, V., Kabesch, M., et al. (2008). Investigation
of the fine structure of European populations with applications to disease
association studies. Eur J Hum Genet 16, 1413-1429.
10. Huckins, L.M., Boraska, V., Franklin, C.S., Floyd, J.A., Southam, L., Sullivan, P.F.,
Bulik, C.M., Collier, D.A., Tyler-Smith, C., Zeggini, E., et al. (2014). Using
ancestry-informative markers to identify fine structure across 15 populations of
European origin. Eur J Hum Genet 22, 1190-1200.
11. Moskvina, V., Smith, M., Ivanov, D., Blackwood, D., Stclair, D., Hultman, C.,
Toncheva, D., Gill, M., Corvin, A., O'Dushlaine, C., et al. (2010). Genetic
Differences between Five European Populations. Hum Hered 70, 141-149.
12. Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R., Auton, A., Indap, A.,
King, K.S., Bergmann, S., Nelson, M.R., et al. (2008). Genes mirror geography
within Europe. Nature 456, 98-101.
13. Babron, M.C., de Tayrac, M., Rutledge, D.N., Zeggini, E., and Genin, E. (2012).
Rare and low frequency variant stratification in the UK population: description
and impact on association tests. PLoS One 7, e46519.
14. Abdellaoui, A., Hottenga, J.J., de Knijff, P., Nivard, M.G., Xiao, X., Scheet, P.,
Brooks, A., Ehli, E.A., Hu, Y., Davies, G.E., et al. (2013). Population structure,
24
migration, and diversifying selection in the Netherlands. Eur J Hum Genet 21,
1277-1285.
15. Esko, T., Mezzavilla, M., Nelis, M., Borel, C., Debniak, T., Jakkula, E., Julia, A.,
Karachanak, S., Khrunin, A., Kisfali, P., et al. (2013). Genetic characterization
of northeastern Italian population isolates in the context of broader European
genetic diversity. Eur J Hum Genet 21, 659-665.
16. Karakachoff, M., Duforet-Frebourg, N., Simonet, F., Le Scouarnec, S., Pellen, N.,
Lecointe, S., Charpentier, E., Gros, F., Cauchi, S., Froguel, P., et al. (2014).
Fine-scale human genetic structure in Western France. Eur J Hum Genet.
17. O'Dushlaine, C.T., Morris, D., Moskvina, V., Kirov, G., Consortium, I.S., Gill, M.,
Corvin, A., Wilson, J.F., and Cavalleri, G.L. (2010). Population structure and
genome-wide patterns of variation in Ireland and Britain. Eur J Hum Genet 18,
1248-1254.
18. Leslie, S., Winney, B., Hellenthal, G., Davison, D., Boumertit, A., Day, T., Hutnik,
K., Royrvik, E.C., Cunliffe, B., Lawson, D.J., et al. (2015). The fine-scale
genetic structure of the British population. Nature 519, 309-314.
19. Prevost, P., Busson, M., and Marcelli-Barge, A. (1984). Distribution of HLA-A,B
alleles in 13 panels of blood donors in France. Tissue Antigens 23, 301-307.
20. Lonjou, C., Clayton, J., Cambon-Thomsen, A., and Raffoux, C. (1995). HLA -A, B, -DR haplotype frequencies in France--implications for recruitment of
potential bone marrow donors. Transplantation 60, 375-383.
21. Degioanni, A., Darlu, P., and Raffoux, C. (2003). Analysis of the French National
Registry of unrelated bone marrow donors, using surnames as a tool for
improving geographical localisation of HLA haplotypes. Eur J Hum Genet 11,
794-801.
22. Lambert, J.C., Heath, S., Even, G., Campion, D., Sleegers, K., Hiltunen, M.,
Combarros, O., Zelenika, D., Bullido, M.J., Tavernier, B., et al. (2009).
Genome-wide association study identifies variants at CLU and CR1 associated
with Alzheimer's disease. Nat Genet 41, 1094-1099.
23. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D.,
Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J., et al. (2007). PLINK: a tool set
for whole-genome association and population-based linkage analyses. Am J
Hum Genet 81, 559-575.
24. Anderson, C.A., Pettersson, F.H., Clarke, G.M., Cardon, L.R., Morris, A.P., and
Zondervan, K.T. (2010). Data quality control in genetic case-control association
studies. Nat Protoc 5, 1564-1573.
25. Abecasis, G.R., Auton, A., Brooks, L.D., DePristo, M.A., Durbin, R.M., Handsaker,
R.E., Kang, H.M., Marth, G.T., and McVean, G.A. (2012). An integrated map of
genetic variation from 1,092 human genomes. Nature 491, 56-65.
26. Patterson, N., Price, A.L., and Reich, D. (2006). Population structure and
eigenanalysis. PLoS Genet 2, e190.
27. Price, A.L., Weale, M.E., Patterson, N., Myers, S.R., Need, A.C., Shianna, K.V.,
Ge, D., Rotter, J.I., Torres, E., Taylor, K.D., et al. (2008). Long-range LD can
confound genome scans in admixed populations. Am J Hum Genet 83, 132-135;
author reply 135-139.
28. Weir, B.S., and Cockerham, C.C. (1984). Estimating F-Statistics for the Analysis of
Population Structure. Evolution 38, 1358-1370.
25
29. Alexander, D.H., Novembre, J., and Lange, K. (2009). Fast model-based estimation
of ancestry in unrelated individuals. Genome Res 19, 1655-1664.
30. Rand, W.M. (1971). Objective Criteria for the Evaluation of Clustering Methods.
American Statistical Association 66, 846-850.
31. Hubert, L., and Arabie, P. (1985). Comparing Partitions. Journal of Classification,
193-218.
32. Vernay, M. (2000). Trends in inbreeding, isonymy, and repeated pairs of surnames
in the Valserine Valley, French Jura, 1763-1972. Hum Biol 72, 675-692.
33. McQuillan, R., Leutenegger, A.L., Abdel-Rahman, R., Franklin, C.S., Pericic, M.,
Barac-Lauc, L., Smolej-Narancic, N., Janicijevic, B., Polasek, O., Tenesa, A., et
al. (2008). Runs of homozygosity in European populations. Am J Hum Genet
83, 359-372.
34. Drineas, P., Lewis, J., and Paschou, P. (2010). Inferring geographic coordinates of
origin for Europeans using small panels of ancestry informative markers. PLoS
One 5, e11892.
35. Hoggart, C.J., O'Reilly, P.F., Kaakinen, M., Zhang, W., Chambers, J.C., Kooner,
J.S., Coin, L.J., and Jarvelin, M.R. (2012). Fine-scale estimation of location of
birth from genome-wide single-nucleotide polymorphism data. Genetics 190,
669-677.
36. Price, A.L., Butler, J., Patterson, N., Capelli, C., Pascali, V.L., Scarnicci, F., RuizLinares, A., Groop, L., Saetta, A.A., Korkolopoulou, P., et al. (2008).
Discerning the ancestry of European Americans in genetic association studies.
PLoS Genet 4, e236.
37. Bauchet, M., McEvoy, B., Pearson, L.N., Quillen, E.E., Sarkisian, T.,
Hovhannesyan, K., Deka, R., Bradley, D.G., and Shriver, M.D. (2007).
Measuring European population stratification with microarray genotype data.
Am J Hum Genet 80, 948-956.
26
Figure titles and legends:
Figure 1: The seven geographical regions of France according to the geographical
coordinates latitude and longitude. Individuals are coloured according to the region
where they were born.
Figure 2: The scatter plot of the first three PCs from PCA performed on the SNP
genotype data of the 4,433 individuals from the 3 Cities study. Individuals are coloured
according to the region where they were born.
27
Figure 3: Distribution of PC values according to the geographical coordinates latitude
and longitude. Colour of the points indicates the range of PCs values: a) PC1, b) PC2,
c) PC3. Red colours indicate negative values while green colours indicate positive
values.
Figure 4: Fst distribution according to geographical distance in kilometres (km).
Genetic distance correlates with geographical distance (Pearson correlation = 0.68). We
computed pairwise Fst values between all regions of France, and compared them to
geographic distance in kilometers between the midpoints across individuals of each
region.
28
Figure 5: Mean FROH1 (entire bars) and FROH5 (gray parts of the bars) in the seven
regions of France.
Figure 6: Prediction of geographic location of individuals from the test set (n=3,733)
using multiple linear regression model. A) Expectation: The seven geographical regions
of France according to the geographical coordinates of individuals in the test sample; B)
Prediction of geographical coordinates according to the multiple linear regression
model.
29
Figure 7:
(A) Proportion of individuals assigned to each group by admixture in each region.
Colours of the seven groups were inferred from the admixture proportion estimated in
the reference sample. Only individuals from the test sample are included in the analysis.
(B) Proportion of individuals assigned to each group by the classification algorithm
KNN in each region.
30
Table titles and legends:
Table 1: The contingency table between partitions U and V
U /V
V1
V2
K
VC
Sums
U1
n11
n12
K
n1C
a1
U2
n21
n22
K
n2C
a2
M
M
M
O
M
M
UR
n R1
nR 2
K
n RC
aR
Sums
b1
b2
K
bC
N
31