1 The fine-scale genetic structure of the French
Transcription
1 The fine-scale genetic structure of the French
The fine-scale genetic structure of the French population. Aude Saint Pierre,1,2,3,4 Céline Bellenguez,5,6,7 Sébastien Letort,1,2,3 Luc Letenneur,8,9 Claudine Berr,10,11 Carole Dufouil,8,9 Claude Férec,1,2,3 Philippe Amouyel,5,6,7,12 Emmanuelle Génin,1,2,3 1 Inserm, UMR1078, 29200 Brest, France 2 Université Bretagne Occidentale, 29200 Brest, France 3 Centre Hospitalier Régional Universitaire, 29200 Brest, France 4 Département Hommes Natures Sociétés, UMR7206, Musée de l'Homme, 75016 Paris, France 5 Inserm, U744, 59000 Lille, France 6 Université Lille 2, 59000 Lille, France 7 Institut Pasteur, 59000 Lille, France 8 Inserm U897, 33076 Bordeaux, France 9 Université Bordeaux 2, 33000 Bordeaux, France 10 Inserm, U1061, 34093 Montpellier, France 11 Université Montpellier 1, 34967 Montpellier, France 12 Centre Hospitalier Régional Universitaire de Lille, 59000 Lille, France 1 Abstract The existence of population stratification is a major problem in case-control association studies and there is a need for a better assessment of allele frequency variation within populations at all geographic scales. Such efforts have been conducted in different European countries where strong patterns of geographic variations were found. The genome-wide extent of variations in allele frequencies of common variants has however never been documented at the scale of France. In this study, we describe these patterns of variation using genome-wide SNP chip data from 4,433 individuals, recruited as part of the Three-City study and whose places of birth in France were available. We show that there is a strong correlation between the top three principal components extracted from the genetic data and the latitude and longitude of birth places. Using multiple linear regression models, we were able to determine the birth places within less than 197 km of the reported origin for 50% of the individuals. Using model-based clustering with seven main geographic regions, we found that individuals were assigned in majority to their true region of origin. However, we found that information on ancestry could not be retrieved by using a small panel of Ancestry-Informative Markers (AIMs). 2 Introduction The existence of allele frequency differences between populations living in different geographic areas is a major concern for genetic association studies as it could lead to false positive results or failure to detect a true association. This problem often referred to as “population stratification” has been pointed out by geneticists for a long time. It has led to the development of alternative study designs relying on related rather than unrelated individuals 1; 2. However, there have been many debates regarding the real impact of population stratification on association studies. Some authors argued that if the study was well designed with individuals sampled within the same continent, the risk of false positives was rather limited 3-6. With the advent of genome-wide association studies and the genotyping of large samples of cases and controls with ancestries in different populations, it has become easier to measure the real impact of population stratification on association study and calls for caution have been raised 7; 8. Indeed, allele frequency differences are detectable at all geographic scales. They were found between countries within Europe 9-12 but also within countries 13-18 . A recent study revealed pattern of fine-scale genetic differentiation within the UK18. If some European populations have been extensively studied, this is not the case of the French population. Although the existence of strong regional differences in HLA allele distribution is rather well documented 19-21 , genome-wide geographic variations have been far less studied. Only one recent study so far has investigated this issue with a focus on the western part of the country where several SNPs were found to exhibit allele frequency differences between districts 16. 3 In this study, we provide a descriptive analysis of the pattern of population structure at the scale of France with the objective of better understanding the genetic variation of the French population and informing future large scale genetic mapping studies. For this purpose, we analyse genome-wide SNP data on a large sample of 4,433 individuals, recruited as part of the Three-City study 22 and for whom detailed information on their places of birth were available. Using principal component analysis (PCA), we characterize fine-scale population structure and assess genetic differentiation between regions. We also address the question of the prediction of places of birth from genetic data using multiple linear regression and classification algorithm. Finally, we explore the extent to which geographic location of an individual within France can be predicted solely on information from small subsets of selected ancestry informative markers. 4 Material and Methods Samples We used samples from the Three-City Study (3C Study) that was designed to examine the relationship between vascular diseases and dementia in 9,294 persons aged 65 years and over. For more details on the study, see http://www.three-city-study.com/the-threecity-study.php. All analyses were performed on individuals who did not have developed dementia or cognitive impairment by the time their blood sample was taken and who were genotyped in the study by Lambert et al. 22 . The geographical locations of individuals were defined according to the latitude and longitude of their place of birth, declared at the moment of enrolment in the cohort. Individuals with missing place of birth were excluded. We kept individuals born in metropolitan France and excluded individuals born in Corsica because of their low number (n=5). The study population included 4,659 unrelated individuals from seven geographical regions: “Grand-Ouest” (GO, n=356), “Grand-Est” (GE, n=2,432), “Nord” (NO, n=131), “Île-de-France” (IDF, n=370), “Rhône-Alpes” (RA, n=241), “Méditerranée” (MED, n=249) and “Sud-Ouest” (SO, n=880). Genotyping data Samples were genotyped with Illumina Human610-Quad BeadChip in the Centre National de Génotypage as described elsewhere 22. Standard quality control procedures implemented in PLINK version 1.7 23 were used to remove low quality individuals and 5 low quality SNPs. We followed the recommendations from Anderson et al. 24 . Individuals were removed if they had a call rate < 97%. A heterozygosity threshold of ± 3 standard deviations (SD) from the mean was used to remove individuals with an excessive or reduced proportion of heterozygote genotypes. A Principal Component Analysis (PCA) was performed to exclude individuals with non-European ancestry. PCA was performed on the combined set of the 3C Study sample with the 1000 Genomes individuals 25. We conducted the analysis using the CEU, FIN (Finnish), GBR (British), IBS (Spain) and TSI (Italian) samples. The data set included 379 individuals: 85 CEU, 93 FIN, 89 GBR, 14 IBS and 98 TSI. Only SNPs that were present in our GWAS data set and common with the 1000G data were kept in further analysis. Analysis included 477,640 autosomal SNPs and 4,812 individuals. Plots were visually inspected and 5 presumably non-European individuals were excluded. Identity by descent (IBD) statistics was calculated with the genome command in PLINK to identify related samples. An exclusion threshold of 0.1875 was used. Applying all these QC filters led to the removal of 192 individuals. All samples failing sample-level QC were removed prior to performing SNPs QC. Markers were removed if they had a genotypemissing rate > 1%, a minor allele frequency < 1% or departed from Hardy–Weinberg proportion (P ≤ 10-8). The final study population included 4,467 unrelated individuals and 493,065 autosomal SNPs. Evidence for fine-scale population stratification in France Principal Component Analysis (PCA) PCA was carried out using the Eigenstrat software 26 . The default procedure was used for outlier removals with up to 5 PCA run iteratively after removing individuals with 6 any of the top 10 PCs departing of more than 6 standard deviations from the mean. Outlier individuals were removed prior to performing further analyses. SNPs located in known regions of long range linkage disequilibrium (LD) in European populations 27 were not included in the analysis. SNPs in strong LD were also pruned out using the indep-pairwise command in PLINK with r2=0.2, a window size of 50 SNPs and 5 SNPs to shift the window at each step. Geographic relevance of PCs We applied three tests to determine if the PCs were correlated with the geography in France. (i) Using a Spearman’s rank correlation coefficient, we tested for significance of association between latitude, longitude and PCs coordinates (‘cor.test’ function in R) (ii) We applied a Procruste test using the ‘vegan’ package in R with 1,000,000 permutations to assess the similarity between PCA maps of genetic variation and geographic maps of population locations. The similarity between two maps is quantified by a Procruste similarity statistic named t0. (iii) We performed a Kruskal-Wallis test to evaluate if the distribution of PCs scores varied significantly between regions (‘kruskal.test’ function in R) followed by a post-hoc test implemented in the ‘kruskalmc’ function from the R package ‘pgirmess’. Fst and run of homozygosity (ROH) We investigated local genetic differentiation based on the pairwise Fst 28 matrix between geographical regions of France. The Fst was calculated for every SNP and then averaged across the genome to obtain a genome-wide estimate of the genetic distance between regions. All 493,065 SNPs were used. 7 ROH were estimated separately for each region (PLINK, --homozyg option). To compensate for unequal sampling in the different regions of interest, a subset of 700 individuals was considered, 100 randomly selected in each region. Default parameters were used to reconstruct ROHs: a 5 Mb window size, a minimum of 50 SNPs per window and allowance for 1 heterozygous and 5 missing calls per window. The final segments that were called as homozygous had a minimum number of 100 contiguous SNPs, a minimum length of 1 Mb and a minimum density of 1 SNP per 50 kb. The maximum gap between two consecutive SNPs within a segment was less than 1 Mb apart. We assessed two features of ROH: 1) the sum of the lengths of all ROH for an individual measured in mega base (Mb) (average calculated across individuals within a geographical region); 2) the proportion, FROH, of the autosomal genome falling in ROH. This latter proportion was computed considering all ROHs longer than 1Mb (FROH1) and considering the subset of ROHs longer than 5 Mb (FROH5). All 493,065 SNPs were used in the Fst and ROH calculations. Prediction of geographical location In order to assess the predictive power of the genotypic data to determine the geographic areas of ancestry of individuals, cross validation was used. Selection of a training and test data set Two training sets of 700 individuals were selected. The first training set, referred to as the “random set”, was composed of the 100 individuals selected at random in each region that were used in the ROH analysis. The second training set, referred to as the “extremePC set”, included individuals selected based on their coordinates on the first axe of variation (PC1): individuals with the 50 top and the 50 bottom PC1 values in 8 each of the 7 regions were included. In each case, all remaining individuals (n=3,733) were included in the test set. Linear regression model of PCs We assigned each individual to a specific geographic location by fitting independent linear models for latitude and longitude 12 . To perform assignment, we first estimated the coefficients of regression on the training set of individuals and then used the estimated coefficients of the linear model to predict latitude and longitude of individuals from the test set on the basis of their PCs scores. Computation of PCs scores is performed using only the random set of individuals and we projected individuals from the test set onto those principal components. PC scores were inferred using the -w option in smartpca. We used the rotated PC1 and PC2 scores estimated from the Procruste analysis in the whole sample because they more strongly correlate with latitude and longitude. Starting from the full model with PC1 and PC2, two quadratic terms and one interaction term, we performed stepwise regression with backward and forward elimination with the ‘step’ function in R. Models were compared according to their Akaike information criterion (AIC). The model with the lowest AIC was selected and the estimated coefficients of regression were then used to estimate the latitude and the longitude of individuals in the test sample. Distances in kilometers between estimated and observed geographical coordinates were calculated using the Haversine formula. Classification of individuals into the seven regions 9 To study how individuals can be assigned into the seven regions using the genetic data, we used two alternative methods, the model-based clustering method implemented in admixture 29 and the K-nearest-neighbour (KNN) algorithm. Admixture was run in a supervised mode with the 7 reference populations defined by the random training data set. Each individual in the test sample was assigned to the population with the highest posterior probability. For KNN, the same approach as proposed by Huckins et al. 10 to assign an ancestry to each individual was used. Briefly, ancestry was assigned based on the results of the majority vote on the regions of origin of the 5 nearest neighbours of each individual as determined by running PLINK neighbour option. In case where the five nearest neighbours did not reach a majority vote, only the four nearest neighbours were selected, and a majority vote was taken again. If this was still unsuccessful, only the top three neighbours were used. If still no majority vote was reached, the sample was classed as ‘unclassified’. The analyses were based on the pruned subset of SNPs after the removal of the SNPs located in regions of long range LD and SNPs in strong LD. Evaluation of the quality of classification We assessed the performance of the classification by comparing the population assignments obtained with admixture and KNN with the regions of origin. Only individuals from the test data set were used for assessing performances of admixture whereas for KNN, we considered the overall sample as in Huckins et al. 10 . We calculated the Correct Classification Rate (CCR) which is defined as the proportion of individuals that were correctly assigned to their region of origin. This indicator considers only the one-to-one relationship between estimated groups and true 10 populations and might not well reflect the similarity between partitions. A measure of similarity between two partitions can be evaluated using the Rand index 30 . This index measures the proportion of similar assignments of pairs of points; i.e. the proportion of pairs of points that are placed either together in a group or in different groups in both partitions. The Rand index ranges between 0 and 1. It takes the value of 1 when the two classifications are identical and 0 when no pair of points appears either in the same group or in different groups in both partitions. Because the expected value of the rand index between two random partitions can vary, Hubert and Arabie 31 suggested using a corrected version of the Rand index, the Adjusted Rand Index (ARI). Based on the contingency table (Table 1), the ARI is of the form: nij ai b j N ij i j 2 ARI = b j ai b j 1 ai ∑ + ∑ − ∑ ∑ 2 i 2 j 2 i 2 j 2 ∑ 2 − ∑ 2 ∑ 2 N 2 where ai and bi are the numbers of individuals in the i-th group of partitions U and V respectively, nij the number of individuals in the i-th group of U and the j-th group of V and N is the total number of individuals (Table 1). The ARI is bounded above by 1, meaning a perfect clustering, and takes on the value 0 when the index equals its expected value. Selection of Ancestry Informative Markers (AIMs) Derivation of AIMs To obtain different subsets of AIMs with different sizes N, we identified the top N SNPs that correlate with the latitude and longitude variation. We performed two 11 independent genome wide association studies in the random sample with the latitude and the longitude phenotype respectively assuming an additive genetic model. In each analysis, SNPs were ranked according to their p-value from the association test and the subsets of best significant SNPs were then selected. We picked-up subsets of equal size from the two association analyses performed respectively with latitude and longitude. The subsets of best SNPs from both association studies were then merged together. SNPs located in known regions of long range LD were removed and SNPs in strong LD were pruned out to obtain the final set of AIMs. In this study, we aimed to select SNPs with the strongest pattern of differentiation for geographical variation so we did not adjust the association test for top PCs scores to avoid washing-out the effect of population stratification. Validation of AIMs Different criteria were used to study the relevance of the different panels of selected AIMs. First, to determine how these AIMs performed at assigning individuals to the different regions, we run admixture in a supervised mode, taking as reference population the random training set. Performances of classification were evaluated using the Correct Classification Rate (CCR) and the Adjusted Rand Index (ARI). Second, for each panel of AIMs we assigned each individual to a specific geographic location by fitting independent linear models for latitude and longitude. Performances of the prediction were assessed based on the mean and median of the number of km between the expected and reported origin of individuals in the test sample. 12 The performances calculated with various numbers of AIMs were compared to the performances calculated with the entire set of SNPs. 13 Results Evidence for fine-scale population stratification in France Even though they were sampled in only three cities (Bordeaux, Montpellier and Dijon), the genotyped individuals in the 3C study were in fact born all over France and the sample includes representatives from the seven major regions of France (Figure 1 and Table S1). Among the 4,467 individuals, 34 individuals were tagged as possible outliers after running iteratively Eigenstrat and were removed from all subsequent analyses that were thus based on a sample of 4,433 individuals. The first three PCs of the PCA performed on the genotypes of these individuals were found to account for 0.07%, 0.04% and 0.04% of the total variation in the data, respectively (Figure 2). The first principal component roughly differentiates the northern, north-eastern regions (NO, GE) to southern, south-western regions (SO, MED). The second PC mainly differentiates western (GO, NO) from south-eastern (MED) individuals. Some individuals from GE cluster with the south-eastern group whereas others fall into the western group. The third PC shows more subtle patterns of differentiation, suggesting a separation between the north-eastern and southwest region from the middle band area of the France (Figure 3). The Europe-wide PCA analysis (Figure S1) with 1000G samples confirmed this trend with a slightly clearer pattern of differentiation on PC1 and PC2 that accounted for respectively 0.10% and 0.06% of the total variation. The top three PCs were significantly correlated with geographical axes (p-value≤10-16). The Spearman correlation coefficients for PC1, PC2 and PC3 were respectively -0.60, 0.23 and -0.13 with latitude and 0.57, -0.19 and -0.26 with longitude. The Procruste test comparing the maps obtained from PCs and from geographical coordinates was 14 significant (p-value ≤ 10-6) (with a Procruste similarity statistic t0 of respectively 0.55 for PC1 and PC2 versus latitude and longitude; 0.48 for PC1 and PC3 versus latitude and longitude and 0.29 for PC2 and PC3 versus latitude and longitude). The PC scores distributions were different between the different regions (Kruskal-Wallis test p-value ≤ 10-16) and all pairwise comparisons, except for GO versus IDF, were significant at a global significance level of 5%. Genetic distances between regions measured by pairwise Fst statistics revealed subtle differences between regions. The largest difference was observed between NO and SO (Fst = 0.068%) followed by GE and SO (Fst = 0.049%) and NO and MED (Fst = 0.046%). The genetic differentiation between regions increased with the geographic distances in France (Figure 4 and Table S2) and a significant correlation was found between the distances in kilometers and the Fst values (Pearson correlation = 0.68, pvalue = 6x10-4). The IDF region showed very little differentiation with the GE, GO and RA (Fst ~ 0.004%). This could be expected due to the more cosmopolitan characteristics of IDF compared to other regions. In the random subset of 700 individuals (100 individuals from each region), 13,336 runs of homozygosity (ROH) of at least 1 Mb were detected (on average 19 per individual), among which 94 (on average 0.13 per individual) had a length of 5 Mb or greater. Their repartition by region was found to vary. Indeed, the average proportion of the genome in ROHs for individuals from the SO region was the largest (FROH1 = 1.10 %) followed by NO and RA (FROH1 = 1.08 % - 1.06 %) while IDF and MED had the lowest proportion (FROH1 = 0.96 % - 1.00%) (Figure 5). The SO region was slightly different from the 15 others for ROH with, on average, an increased proportion of the genome in ROH of at least 1 Mb and an increased summed length of ROH segments. Focusing on longer ROH segments of at least 5 Mb that are likely to be identical by descent and to sign some levels of inbreeding, we found that the RA region had the largest proportion of the genome in ROH of at least 5 Mb (FROH5 = 0.10 %) followed by NO, GE and SO (FROH5 ~ 0.06 %). Again, IDF had the lowest proportion of the genome in ROH (FROH5 = 0.01 %). Similar patterns were observed for the summed segment sizes (Figure S2). Overall, IDF was found to have the lowest proportion of the genome in ROH (as measured by ROH over 1 and 5 Mb in length) and SO and RA the highest proportions. Prediction of geographical location The rotated PC1 and PC2 scores estimated from the Procruste analysis in the whole sample are shown in Figure S3. The best linear regression model to predict latitude in the training dataset was found to include the rotated PC1, the rotated PC2, one quadratic term for the rotated PC2 and the interaction between the rotated PC1 and the rotated PC2 (AIC=751.09). The best model for longitude included the rotated PC1, the rotated PC2 and one quadratic term for the rotated PC1. The AIC value was greater (AIC=994.27) suggesting that latitude was better predicted by PCs scores than longitude. Using these fitted models on the test sample, 50% of the individuals could be located within 197 km of their reported origin and 90% within 332 km (Figure 6). The scatter plot of individuals showed a variation from northeast to southwest. The origins of individuals from the GE and RA regions were better predicted with the lowest mean and median of the number of km between their expected and reported origin (mean=185 km; median=167 km for GE; mean=180 km; median=185 km for RA). Inversely, the 16 origins of individuals from the NO and MED regions were poorly predicted (mean=360 km, median=366 km for NO; mean=345 km, median=336 km for MED). To assess the impact of the training sample selection on linear prediction, we also performed the training on 100 individuals in each region selected on the top and bottom tail of the PC1 coordinates for the region (extremePC training set). The AIC values from multiple linear regression in this extremePC training sample increased to 1115.18 for longitude and to 945.00 for latitude. Selecting individuals randomly might therefore better reflect the overall ancestry of our sample than selecting individuals on the extremes of the first PC scores. This was confirmed by an increase in the distances between the observed and the predicted geographical origins. When running Admixture on the test sample, we found that individuals were assigned in majority to their true region of origin (Figure 7A, Figure S4). This was true for all regions except IDF but, for IDF, misclassified individuals were usually assigned to a close region (GO and NO). The CCR and ARI were 0.31 and 0.10 respectively. The KNN method based on the majority vote from the nearest neighbours was found to perform better on these indices with an ARI value slightly increased (ARI=0.11) and a CCR much larger, reaching a value of 0.5. However, this higher value was obtained because of the over-representation of two regions in the sample (GE and GO) in which the KNN method tended to cluster all individuals (Figure 7B and Figure S5). For individuals originating outside of these two regions, the proportion of individuals assigned to GE is higher than 55% suggesting that the overall classification performances of KNN were very poor compared to admixture. 17 Ancestry Informative (AIMs) From the results of the association test performed against latitude and longitude, sets of between 127 and 101,386 AIMs were selected. Their performances to assign individuals to their regions of origin with admixture were evaluated by measuring the relative increase of the CCR and ARI indexes when using each subset of AIMs compared to the full panel of SNPs. The classification performances with small panels of AIMs were quite low. It was only possible to obtain the same values of the ARI index than with the entire set of SNPs when using 94% of the SNPs (~95,000 SNPs) and the same value of the CCR index when using 24% of the SNPs (~24,000 SNPs) (Figure S6). Small panels of AIMs were also found to perform poorly for assigning individuals into their region of origin (Figure S7). 18 Discussion In this study, we investigated population stratification in France using a large sample of 4,433 individuals for whom detailed information on birth places was available. This is one of the largest studies performed so far to investigate genome-wide patterns of variation in France. Interestingly, even if they were sampled only in three cities in France, the places of birth of individuals from the 3C sample were in fact evenly distributed across the French territory. This was one of the strengths of the 3C sample to investigate the performance of assignment at the scale of France. Moreover, the fact that the 3C study only includes elderly people born before 1935 at a period in time where migrations were rather limited was also an advantage for this study. Their place of birth was, except perhaps for the IDF region, a good indicator of the region of origin of their ancestors. This was not the case however for the places where they were sampled that could not be used to trace back their origin. This raises serious concerns on the studies that used sampling places as surrogate for geographical origins. Using a simple PCA, it was possible to differentiate northern from southern regions on PC1 and western to south-eastern regions on PC2, similar to what was observed within different European countries 14; 17 . Combining the 1000 Genomes individuals with the French population confirmed this trend with the West (GO) and North (NO) regions of France classified with the CEU and GBR populations from 1000 Genomes. The first three PCs in the 3C sample were significantly correlated with geographic axes and PCs scores varied significantly between regions with PC1 showing the highest correlation with geography and the strongest differences between North and South regions. 19 Similarly, the genetic differentiation was well correlated with geographical distances with the strongest Fst values found between North-East and South-West regions (Fst=0.0007 between NO and SO). The IDF region of Paris, on the other hand, exhibited the smallest levels of differentiation with the other regions and especially the neighbouring regions (GO and GE). This was consistent with the long history of migrations of individuals from the different regions of France to the French capital to find a job. ROH analyses told a similar story with individuals from the SO and NO regions showing more homozygous segments than individuals from the IDF region. The RA region that includes the Alps was pointed out when focusing on the longest ROHs of at least 5 Mb since FROH5 is the highest in this region. This could be due to the existence of mountains in this region that have, until quite recently, been natural barriers against population movements 32. Indeed, ROHs of 5 Mb and longer are most likely due to inbreeding and thus a good indicator of population isolation 33. In agreement with previous studies performed at wider geographic scales 12; 34; 35, we found that genetic data can be used to gain some information on the origin of individuals within France. PCs were found to be good predictors of the longitude and latitude of the places of birth of individuals. Using multiple linear regression, one can place 50% of the individuals within 197 km of their reported origin and 90% within 332 km. This is better than the performances achieved within Europe by Novembre et al. 12 . Indeed in this latter study, 50% of individuals could be assigned within 310 km of their reported origin and 90% within 700 km of their origin. The lower distances of assignments might be due to the finer scale information on origin available for each 20 individual in our sample. In our sample, the distances of assignments from the reported origins were the lowest for regions located on the Northeast-Southwest axis. To avoid samples size associated bias, principal components are estimated using only the subset of random individuals equally picked-up in each region. Individuals from the test sample are then projected onto those principal components. Using the model-based clustering method implemented in admixture in a supervised mode with a reference population of 100 individuals picked-up at random in each region, we found that individual ancestries are mainly distributed into their closest regions like, for example, SO and MED or RA and MED. Performances of both the model-based clustering and the linear regression approaches vary by region with, as expected from the PCA plots, individuals from the SO region in the South-West being usually better classified than individuals from the other regions. Interestingly, for these other regions, the two approaches perform differentially with admixture performing better in MED and GO than the linear regression. Rather than using a model-based clustering approach, it was suggested that a simpler method based on the K-nearest neighbour (KNN method) could also be efficient for the prediction of geographic coordinates 10; 34. In particular, it was shown that within Europe, the proportion of individuals correctly assigned (CCR) to their region of origin was high, reaching 80% for several populations 10. This number is much higher than the value found here since only 50% of individuals were well classified in their region of origin. However, we were stricter in our evaluation compared to Huckins et al. 10 as they considered that an individual was correctly classified if he/she was assigned to his/her country of origin or a country with a Fst value of less than 0.001 as this Fst value was 21 considered as the threshold below which populations may not considered genetically distinct. All the pairwise Fst values we computed here between the 7 regions were below 0.001 but still we could detect some differences between the regions and the fact that 50% of the individuals are assigned to their true region of origin is a rather good achievement. However, this relative good performance of the KNN method compared to admixture in terms of correct classification rates hides the fact that the results are quite different from one region to another. The KNN method performed well for the two major regions (GE and SO) where the majority of individuals come from but performed very badly for the other regions with CCR of less than 10%. This was not the case with admixture that gave better results in these other regions with, for example, a CCR above 50% in the MED region. Several studies proposed panels of Ancestry Informative Markers (AIMs) to infer ancestry for samples of European origin 10; 34; 36; 37. A small number of these AIMs may be used to perform population classification. We obtained a list of AIMs from their association with longitude and latitude variations. We explored whether an optimal set of AIMs could be used to infer ancestry of individuals. Our results show that more than 24% of the full set of SNPs are needed to obtain similar classification performances as with the full set of SNPs. This analysis illustrates the fact that the information contained in the full set of SNPs is hardly summarized using a smaller number of representative AIMs. We found that small panel of AIMs to detect fine-scale population stratification perform poorly in the 3C sample. 22 In summary, our study revealed fine-scale genetic structure within France. This is a unique population genetics study in terms of resolution, because of the large sample distributed all around France with detailed information available of the place of birth. Our results confirm the tight correlation between genes and geography and thus the importance of considering stratification in association studies, even when analysing supposedly homogeneous populations. Supplemental Data Supplemental Data include seven figures and two tables. Acknowledgements: This work was supported by a grant from the Britanny Region (dispositif SAD stratégie d’attractibilité durable-projet STATEX) and from Association Gaetan Saleun. Conflict of Interest: The authors declare no conflict of interest. Web resources: 1000 Genomes: http://browser.1000genomes.org 23 References 1. Gauderman, W.J., Witte, J.S., and Thomas, D.C. (1999). Family-based association studies. J Natl Cancer Inst Monogr, 31-37. 2. Schaid, D.J., and Rowland, C. (1998). Use of parents, sibs, and unrelated controls for detection of associations between genetic markers and disease. Am J Hum Genet 63, 1492-1506. 3. Wacholder, S., Rothman, N., and Caporaso, N. (2000). Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. J Natl Cancer Inst 92, 1151-1158. 4. Wacholder, S., Rothman, N., and Caporaso, N. (2002). Counterpoint: bias from population stratification is not a major threat to the validity of conclusions from epidemiological studies of common polymorphisms and cancer. Cancer Epidemiol Biomarkers Prev 11, 513-520. 5. Khlat, M., Cazes, M.H., Genin, E., and Guiguet, M. (2004). Robustness of casecontrol studies of genetic factors to population stratification: magnitude of bias and type I error. Cancer Epidemiol Biomarkers Prev 13, 1660-1664. 6. Heiman, G.A., Hodge, S.E., Gorroochurn, P., Zhang, J., and Greenberg, D.A. (2004). Effect of population stratification on case-control association studies. I. Elevation in false positive rates and comparison to confounding risk ratios (a simulation study). Hum Hered 58, 30-39. 7. Freedman, M.L., Reich, D., Penney, K.L., McDonald, G.J., Mignault, A.A., Patterson, N., Gabriel, S.B., Topol, E.J., Smoller, J.W., Pato, C.N., et al. (2004). Assessing the impact of population stratification on genetic association studies. Nat Genet 36, 388-393. 8. Marchini, J., Cardon, L.R., Phillips, M.S., and Donnelly, P. (2004). The effects of human population structure on large genetic association studies. Nat Genet 36, 512-517. 9. Heath, S.C., Gut, I.G., Brennan, P., McKay, J.D., Bencko, V., Fabianova, E., Foretova, L., Georges, M., Janout, V., Kabesch, M., et al. (2008). Investigation of the fine structure of European populations with applications to disease association studies. Eur J Hum Genet 16, 1413-1429. 10. Huckins, L.M., Boraska, V., Franklin, C.S., Floyd, J.A., Southam, L., Sullivan, P.F., Bulik, C.M., Collier, D.A., Tyler-Smith, C., Zeggini, E., et al. (2014). Using ancestry-informative markers to identify fine structure across 15 populations of European origin. Eur J Hum Genet 22, 1190-1200. 11. Moskvina, V., Smith, M., Ivanov, D., Blackwood, D., Stclair, D., Hultman, C., Toncheva, D., Gill, M., Corvin, A., O'Dushlaine, C., et al. (2010). Genetic Differences between Five European Populations. Hum Hered 70, 141-149. 12. Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A.R., Auton, A., Indap, A., King, K.S., Bergmann, S., Nelson, M.R., et al. (2008). Genes mirror geography within Europe. Nature 456, 98-101. 13. Babron, M.C., de Tayrac, M., Rutledge, D.N., Zeggini, E., and Genin, E. (2012). Rare and low frequency variant stratification in the UK population: description and impact on association tests. PLoS One 7, e46519. 14. Abdellaoui, A., Hottenga, J.J., de Knijff, P., Nivard, M.G., Xiao, X., Scheet, P., Brooks, A., Ehli, E.A., Hu, Y., Davies, G.E., et al. (2013). Population structure, 24 migration, and diversifying selection in the Netherlands. Eur J Hum Genet 21, 1277-1285. 15. Esko, T., Mezzavilla, M., Nelis, M., Borel, C., Debniak, T., Jakkula, E., Julia, A., Karachanak, S., Khrunin, A., Kisfali, P., et al. (2013). Genetic characterization of northeastern Italian population isolates in the context of broader European genetic diversity. Eur J Hum Genet 21, 659-665. 16. Karakachoff, M., Duforet-Frebourg, N., Simonet, F., Le Scouarnec, S., Pellen, N., Lecointe, S., Charpentier, E., Gros, F., Cauchi, S., Froguel, P., et al. (2014). Fine-scale human genetic structure in Western France. Eur J Hum Genet. 17. O'Dushlaine, C.T., Morris, D., Moskvina, V., Kirov, G., Consortium, I.S., Gill, M., Corvin, A., Wilson, J.F., and Cavalleri, G.L. (2010). Population structure and genome-wide patterns of variation in Ireland and Britain. Eur J Hum Genet 18, 1248-1254. 18. Leslie, S., Winney, B., Hellenthal, G., Davison, D., Boumertit, A., Day, T., Hutnik, K., Royrvik, E.C., Cunliffe, B., Lawson, D.J., et al. (2015). The fine-scale genetic structure of the British population. Nature 519, 309-314. 19. Prevost, P., Busson, M., and Marcelli-Barge, A. (1984). Distribution of HLA-A,B alleles in 13 panels of blood donors in France. Tissue Antigens 23, 301-307. 20. Lonjou, C., Clayton, J., Cambon-Thomsen, A., and Raffoux, C. (1995). HLA -A, B, -DR haplotype frequencies in France--implications for recruitment of potential bone marrow donors. Transplantation 60, 375-383. 21. Degioanni, A., Darlu, P., and Raffoux, C. (2003). Analysis of the French National Registry of unrelated bone marrow donors, using surnames as a tool for improving geographical localisation of HLA haplotypes. Eur J Hum Genet 11, 794-801. 22. Lambert, J.C., Heath, S., Even, G., Campion, D., Sleegers, K., Hiltunen, M., Combarros, O., Zelenika, D., Bullido, M.J., Tavernier, B., et al. (2009). Genome-wide association study identifies variants at CLU and CR1 associated with Alzheimer's disease. Nat Genet 41, 1094-1099. 23. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J., et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81, 559-575. 24. Anderson, C.A., Pettersson, F.H., Clarke, G.M., Cardon, L.R., Morris, A.P., and Zondervan, K.T. (2010). Data quality control in genetic case-control association studies. Nat Protoc 5, 1564-1573. 25. Abecasis, G.R., Auton, A., Brooks, L.D., DePristo, M.A., Durbin, R.M., Handsaker, R.E., Kang, H.M., Marth, G.T., and McVean, G.A. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56-65. 26. Patterson, N., Price, A.L., and Reich, D. (2006). Population structure and eigenanalysis. PLoS Genet 2, e190. 27. Price, A.L., Weale, M.E., Patterson, N., Myers, S.R., Need, A.C., Shianna, K.V., Ge, D., Rotter, J.I., Torres, E., Taylor, K.D., et al. (2008). Long-range LD can confound genome scans in admixed populations. Am J Hum Genet 83, 132-135; author reply 135-139. 28. Weir, B.S., and Cockerham, C.C. (1984). Estimating F-Statistics for the Analysis of Population Structure. Evolution 38, 1358-1370. 25 29. Alexander, D.H., Novembre, J., and Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19, 1655-1664. 30. Rand, W.M. (1971). Objective Criteria for the Evaluation of Clustering Methods. American Statistical Association 66, 846-850. 31. Hubert, L., and Arabie, P. (1985). Comparing Partitions. Journal of Classification, 193-218. 32. Vernay, M. (2000). Trends in inbreeding, isonymy, and repeated pairs of surnames in the Valserine Valley, French Jura, 1763-1972. Hum Biol 72, 675-692. 33. McQuillan, R., Leutenegger, A.L., Abdel-Rahman, R., Franklin, C.S., Pericic, M., Barac-Lauc, L., Smolej-Narancic, N., Janicijevic, B., Polasek, O., Tenesa, A., et al. (2008). Runs of homozygosity in European populations. Am J Hum Genet 83, 359-372. 34. Drineas, P., Lewis, J., and Paschou, P. (2010). Inferring geographic coordinates of origin for Europeans using small panels of ancestry informative markers. PLoS One 5, e11892. 35. Hoggart, C.J., O'Reilly, P.F., Kaakinen, M., Zhang, W., Chambers, J.C., Kooner, J.S., Coin, L.J., and Jarvelin, M.R. (2012). Fine-scale estimation of location of birth from genome-wide single-nucleotide polymorphism data. Genetics 190, 669-677. 36. Price, A.L., Butler, J., Patterson, N., Capelli, C., Pascali, V.L., Scarnicci, F., RuizLinares, A., Groop, L., Saetta, A.A., Korkolopoulou, P., et al. (2008). Discerning the ancestry of European Americans in genetic association studies. PLoS Genet 4, e236. 37. Bauchet, M., McEvoy, B., Pearson, L.N., Quillen, E.E., Sarkisian, T., Hovhannesyan, K., Deka, R., Bradley, D.G., and Shriver, M.D. (2007). Measuring European population stratification with microarray genotype data. Am J Hum Genet 80, 948-956. 26 Figure titles and legends: Figure 1: The seven geographical regions of France according to the geographical coordinates latitude and longitude. Individuals are coloured according to the region where they were born. Figure 2: The scatter plot of the first three PCs from PCA performed on the SNP genotype data of the 4,433 individuals from the 3 Cities study. Individuals are coloured according to the region where they were born. 27 Figure 3: Distribution of PC values according to the geographical coordinates latitude and longitude. Colour of the points indicates the range of PCs values: a) PC1, b) PC2, c) PC3. Red colours indicate negative values while green colours indicate positive values. Figure 4: Fst distribution according to geographical distance in kilometres (km). Genetic distance correlates with geographical distance (Pearson correlation = 0.68). We computed pairwise Fst values between all regions of France, and compared them to geographic distance in kilometers between the midpoints across individuals of each region. 28 Figure 5: Mean FROH1 (entire bars) and FROH5 (gray parts of the bars) in the seven regions of France. Figure 6: Prediction of geographic location of individuals from the test set (n=3,733) using multiple linear regression model. A) Expectation: The seven geographical regions of France according to the geographical coordinates of individuals in the test sample; B) Prediction of geographical coordinates according to the multiple linear regression model. 29 Figure 7: (A) Proportion of individuals assigned to each group by admixture in each region. Colours of the seven groups were inferred from the admixture proportion estimated in the reference sample. Only individuals from the test sample are included in the analysis. (B) Proportion of individuals assigned to each group by the classification algorithm KNN in each region. 30 Table titles and legends: Table 1: The contingency table between partitions U and V U /V V1 V2 K VC Sums U1 n11 n12 K n1C a1 U2 n21 n22 K n2C a2 M M M O M M UR n R1 nR 2 K n RC aR Sums b1 b2 K bC N 31