contributions à la statistique computationnelle et à la
Transcription
contributions à la statistique computationnelle et à la
C ONTRIBUTIONS À LA STATISTIQUE COMPUTATIONNELLE ET À LA CLASSIFICATION NON SUPERVISÉE THIS IS A TEMPORARY TITLE PAGE It will be replaced for the final print by a version provided by the service academique. soutenue le 12 Décembre 2014 à la Faculté des Sciences Institut de Mathématiques et Modélisation de Montpellier (I3M) École Doctorale Information, Structures et Systèmes (I2S) Université Montpellier 2 (UM2) pour l’obtention d’une Habilitation à Diriger des Recherches par Pierre Pudlo devant le jury composé de: Prof Mark Beaumont, University of Bristol, rapporteur Prof Gérard Biau, Université Pierre et Marie Curie, rapporteur Dr Gilles Celeux, INRIA, président du jury Dr Arnaud Estoup, INRA, examinateur Prof Jean-Michel Marin, Université Montpellier 2, coordinateur Prof Didier Piau, Université Joseph Fourier, examinateur Montpellier, UM2, 2014 “Et tout d’un coup le souvenir m’est apparu. Ce goût celui du petit morceau de madeleine que le dimanche matin à Combray (parce que ce jour-là je ne sortais pas avant l’heure de la messe), quand j’allais lui dire bonjour dans sa chambre, ma tante Léonie m’offrait après l’avoir trempé dans son infusion de thé ou de tilleul. La vue de la petite madeleine ne m’avait rien rappelé avant que je n’y eusse goûté; peut-être parce que, en ayant souvent aperçu depuis, sans en manger, sur les tablettes des pâtissiers, leur image avait quitté ces jours de Combray pour se lier à d’autres plus récents; peut-être parce que de ces souvenirs abandonnés si longtemps hors de la mémoire, rien ne survivait, tout s’était désagrégé; les formes — et celle aussi du petit coquillage de pâtisserie, si grassement sensuel, sous son plissage sévère et dévot — s’étaient abolies, ou, ensommeillées, avaient perdu la force d’expansion qui leur eût permis de rejoindre la conscience. Mais, quand d’un passé ancien rien ne subsiste, après la mort des êtres, après la destruction des choses, seules, plus frêles mais plus vivaces, plus immatérielles, plus persistantes, plus fidèles, l’odeur et la saveur restent encore longtemps, comme des âmes, à se rappeler, à attendre, à espérer, sur la ruine de tout le reste, à porter sans fléchir, sur leur gouttelette presque impalpable, l’édifice immense du souvenir.” — À la recherche du temps perdu, Marcel Proust Remerciements L’écriture de ce mémoire d’habilitation à diriger des recherches était l’occasion de faire le bilan de mes travaux de recherche depuis ma thèse. Il doit beaucoup à l’ensemble de mes co-auteurs, mes nombreuses rencontres dans la communauté statistique, et au delà ; cette page est l’occasion de les remercier tous, ainsi que ma famille et mes amis. Je voudrais d’abord exprimer ma profonde gratitude à Mark Beaumont, Gérard Biau, Chris Holmes et Adrian Raftery qui ont accepté de rapporter cette habilitation et se sont acquittés de cette lourde tâche dans les délais impartis par les nombreuses contraintes administratives françaises. Je remercie vivement mes deux premiers rapporteurs, Mark Beaumont et Gérard Biau, ainsi que Gilles Celeux, Arnaud Estoup et Didier Piau d’avoir bien voulu faire partie de ce jury, malgré leurs emplois du temps chargés. Ces travaux de recherche doivent beaucoup à Jean-Michel Marin, qui coordonne cette habilitation. Sa curiosité scientifique, sa disponibilité et son enthousiasme constants m’ont porté depuis son arrivée à Montpellier. J’ai découvert avec lui les méthodes de Monte-Carlo et la statistique bayésienne. Ses compétences et son rayonnement scientifique m’ont guidé jusqu’à ce mémoire. Naturellement, je dois aussi un grand merci à Bruno Pelletier avec qui tout a commencé. J’ai bénéficié d’excellentes conditions de travail à l’Institut de Mathématiques et Modélisation de Montpellier, au sein de son équipe de probabilités et statistique. Ils ont su me faire confiance en me recrutant après ma thèse en probabilités appliquées. C’est aussi l’occasion de ré-itérer mes remerciements à Didier Piau qui a encadré ma thèse, et m’a beaucoup appris, y compris dans ses cours de licence et maitrise. Je remercie tous les membres passés et présents de mon équipe. Je n’ose pas me lancer ici dans une liste exhaustive au risque d’en oublier un ; ils ont tous contribué. Je remercie chaleureusement l’INRA, et le Centre de Biologie pour la Gestion des Populations qui m’a accueilli pendant deux ans. J’y ai trouvé des conditions idéales pour mener des travaux i Remerciements de recherche en statistique appliquée à la génétique des populations et d’excellents co-auteurs, Jean-Marie Cornuet, Alex Dehne-Garcia, Arnaud Estoup, Mathieu Gautier, Raphael Leblois et Renaud Vitalis auxquels s’ajoute François Rousset. C’est Jean-Michel qui m’a présenté à cette équipe, ainsi qu’à Christian Robert, avec qui j’ai eu la chance de travailler. Merci Christian d’avoir accepté de faire partie de ce jury, mais tes nombreux talents n’incluent pas encore le don d’ubiquité. Enfin, de nombreuses structures ont financé ces travaux de recherches. Je pense en particulier à l’ANR, au LabEx NUMEV et à l’Institut de Biologie Computationnelle que je remercie. Et un grand merci aux étudiants qui se sont lancés dans une thèse sous ma co-direction, sans peut-être savoir ce qui les attendaient : Mohammed Sedki, Julien Stoehr, Coralie Merle et Paul-Marie Grollemund. Je leur souhaite un brillant avenir. Montpellier, le 4 Décembre 2014 Pierre Pudlo ii Contents Remerciements i List of figures v 1 Résumé en français 1 1 Classification non supervisée . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Statistique computationnelle et applications à la génétique des populations . 3 2 Échantillonnage préférentiel et algorithme SIS . . . . . . . . . . . . . . 4 2.2 Algorithmes ABC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Premiers pas vers les jeux de données moléculaires issus des technologies NGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Graph-based clustering 1 3 2.1 9 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.1 Graph-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 Consistency results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 Consistency of the graph cut problem . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Selecting the number of groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Computational statistics and intractable likelihoods 23 1 Approximate Bayesian computation . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.2 Auto-calibrated SMC sampler . . . . . . . . . . . . . . . . . . . . . . . . 26 1.3 ABC model choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Bayesian inference via empirical likelihood . . . . . . . . . . . . . . . . . . . . . 31 2 3 4 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Computing the likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Sample from the posterior with AMIS . . . . . . . . . . . . . . . . . . . . 36 Inference in neutral population genetics . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 A single, isolated population at equilibrium . . . . . . . . . . . . . . . . 40 4.2 Complex models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 iii Contents Bibliography 56 A Published papers 57 B Preprints 59 C Curriculum vitæ Thèmes de recherche . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liste de publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 62 64 iv List of Figures 2.1 2.2 2.3 2.4 3.1 3.2 3.3 Hartigan (1975)’s definition of clusters in terms of connected components of the t -level set of the density: at the chosen threshold, there are two clusters, C1 and C2 . If t is much larger, only the right cluster remains, and if t is smaller than the value on the plot, both clusters merge into a single group. . . . . . . . Results of our algorithm on a toy example. (left) the partition given by the spectral clustering procedure (points in black remains unclassified); (right) the spectral representation of the dataset, i.e., the ρ(X i )’s: the relatively large spread of the points on the top of the plot is due to poor mixing properties of the random walk on the red points. . . . . . . . . . . . . . . . . . . . . . . . . . . Graph-cut on a toy example. The red line represents the bottleneck, the hneighborhood graph is in gray. The partition returned by the spectral clustering method with the h-neighborhood graph corresponds to the color of the crosses (blue or orange) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The graph-based selection of k: (left) the datapoints and the spectral clustering output; (right)the eigenvalues of the matrix Q: the eigengap is clearly between λ3 and λ4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gene genealogy of a sample of five genes numbered from 1 to 5. The intercoalescence times T5 , . . . , T2 are represented on the vertical time axis. . . . . . Simulation of the genotypes of a sample of eight genes. As for microsatellite loci with the stepwise mutation model, the set of alleles is a interval of integer numbers A ⊂ N. The mutation process Q mut adds +1 or −1 to the genotype with equal probability. Once the genealogy has been drawn, the MRCA is genotyped at random, here 100 and we run the mutation Markov process along the vertical lines of the dendrogram. For instance, the red and green lines are the lineages from MRCA to gene number 2 and 4 respectively. . . . . . . . . . . Example of an evolutionary scenario: four populations Pop1, . . . , Pop4 have been sampled at time t = 0. Branches of the history can be considered as tubes in which the gene genealogy should be drawn. The historical model includes two unobserved populations (Pop5 and Pop6) and fifteen parameters: six dates t 1 , . . . , t 6 , seven populations sizes Ne 1 , . . . , Ne 6 and Ne 40 and two admixture rates r, s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 16 19 21 41 42 45 v 1 Résumé en français Mes travaux de recherche ont touché à des thématiques diversifiées, allant de résultats théoriques en probabilités au développement d’algorithmes d’inférence et à leurs mises en œuvre. Deux caractéristiques dominantes se dégagent : (1) l’omniprésence d’algorithmes (depuis l’algorithme glouton de ma thèse aux algorithmes de Monte-Carlo dans mes derniers travaux), et (2) leur adaptation en biologie, principalement en génétique des populations. Du fait de la taille croissante des données produites notamment en génomique, les méthodes statistiques doivent gagner en efficacité sans perdre le détail de l’information incluse dans ses grandes bases de données. Mes travaux détaillés ci-dessous ont fourni aussi bien des contributions importantes à l’analyse et la compréhension des performances de ces algorithmes, qu’à la conception de nouveaux algorithmes gagnant en précision ou efficacité d’estimation. 1 Classification non supervisée Publications. (A2), (A3) et (A4), voir page 64. Un brevet international. Mots clés. Machine learning, classification spectrale, théorèmes asymptotiques, constante de Cheeger, graph-cut, graphes de voisinages, théorie spectrale d’opérateurs. Après ma thèse en probabilités appliquées, je me suis tourné à Montpellier vers la statistique. J’ai ainsi obtenu plusieurs résultats en Machine Learning, sur des problèmes de classification non supervisée (voir, par exemple, Hastie et al., 2001). Cette méthode d’analyse de données consiste à partitionner les individus (taxons) d’un échantillon en groupes relativement similaires et homogènes, sans aucune information sur l’appartance des individus aux groupes, ni même sur le nombre de groupes. De plus, les informations organisées selon un réseau, ou graphe, comme les réseaux d’interaction, sont de plus en plus fréquentes et les méthodes permettant d’analyser la structure de tels réseaux de plus en plus nécessaires. Ces méthodes se doivent d’être algorithmiquement efficaces du fait de la taille croissante des données. Les techniques de classifications spectrales (von Luxburg, 2007) auxquelles je me suis intéressé 1 Chapter 1. Résumé en français occupent une place importante dans ce champ de recherche. Comme dans les méthodes à noyau (Shawe-Taylor and Cristianini, 2004), nous utilisons une comparaison par paire des individus, mais au travers d’une fonction de similarité qui associe un nombre positif à chaque paire d’observations, reflètant leur proximité. L’un des avantages de cette méthode récente est sa grande maniabilité et sa faculté d’adaptation à de nombreux types de données. Elle permet en effet de détecter des groupes d’observations de forme quelconque, contrairement aux k-means ou méthodes de mélange, qui ne détectent que des groupes convexes. L’algorithme étudié (Ng et al., 2002) s’appuie sur une marche aléatoire qui se déplace sur les individus à classer proportionnellement à leur similarité. On reconstruit alors les groupes homogènes en cherchant des ensembles d’états dont la marche aléatoire sort avec faible probabilité. Récemment, von Luxburg et al. (2008) ont montré que de tels algorithmes convergent. Mais l’obtention d’une caractérisation géométrique simple du partionnement limite restait une question largement ouverte. Pourtant, dans le contexte de la classification non supervisée, Hartigan (1975) a proposé une définition précise et intuitive d’un groupe en terme d’ensembe de niveaux de la densité sous-jacente. En modifiant l’algorithme de classification spectrale, nous avons montré, avec Bruno P ELLETIER (PR, Rennes 2), que le partitionnement limite coïncide avec cette définition ((A02) Pelletier and Pudlo, 2011). Cette démonstration repose sur un mode de convergence fort d’opérateurs associés aux matrices de similarité de l’échantillon. La méthode de graphe de voisinage (Biau et al., 2007) s’interprète comme une classification spectrale dont la fonction de similarité est binaire. Avec Benoît C ADRE (PR, ENS Cachan, Rennes) et Bruno P ELLETIER (PR, Rennes 2), nous avons largement amélioré les résultats sur l’estimation du nombre de groupes k dans ce cadre particulier ((A04) Cadre, Pelletier, and Pudlo, 2013). L’estimateur de k est la multiplicité de la valeur propre nulle du laplacien de graphe. Ce résultat est une première justification formelle de l’heuristique de trou spectral (von Luxburg, 2007) dans le cadre général. Enfin, l’algorithme de classification spectrale peut se voir comme une approximation du problème NP-difficile de graph-cut ou de détection de goulets d’étranglement dans ces graphes ou réseaux d’interaction (von Luxburg, 2007). La constante de Cheeger et le problème de minimisation associé détectent ces goulets d’étranglement dans des graphes valués, comme par exemple les graphes de voisinages. Avec Ery A RIAS -C ASTRO (A-PR, University of Californy, San Diego) et Bruno P ELLETIER (PR, Rennes 2), nous avons étudié la constante de Cheeger sur de tels graphes aléatoires construits par échantillonnage ((A03) Arias-Castro, Pelletier, and Pudlo, 2012) . Nous avons obtenu des résultats asymptotiques donnant la limite continue lorsque la taille de l’échantillon grandit en adaptant la fonction de similarité binaire à cette taille. Récemment, nous avons été contacté par le CHU avec André Mas (PR UM2, I3M) pour des traiter des jeux de données d’analyse sanguine par cytomètre en flux. La question initiale était de détecter un petit cluster de cellules rares (des cellules circulantes à cause d’un cancer par exemple) parmi un très grand nombre d’observation. L’algorithme que nous avons développé 2 2. Statistique computationnelle et applications à la génétique des populations avec l’équipe Tatoo du LIRMM (UMR d’informatique de l’UM2) a été partiellement breveté (brevet international, voir page 64). Avec la société d’accélération du transfert de technologies de Montpellier (SATT AxLR), nous cherchons un partenaire industriel pour valoriser ce brevet. Cette innovation fait l’objet d’une demande de financement de maturation auprès de la SATT AxLR, ainsi que des questions de transferts de méthodologies statistiques bien connues, mais peu utilisées par l’industrie fournissant des logiciels d’analyse de données issues de cytomètres en flux. 2 Statistique computationnelle et applications à la génétique des populations Publications. (A5), (A6), (A7), (A8), (A9), (A12), (A13), (A14) et (A15), ainsi que les prépublications (A10), (A11), (A16) et (A17) voir page 64. Mots clés. Méthodes de Monte Carlo, statistique computationnelle, méthodes ABC, échantillonnage préférentiel, vraisemblance empirique, génétique des populations. Depuis 2010, je me suis intéressé à la statistique computationnelle pour la génétique des populations. Cette thématique de recherche m’a permis de mettre en valeur mes compétences en statistique, en probabilités, ainsi que sur les questions d’implémentation informatique. J’ai souhaité profiter de l’environnement scientifique exceptionnel de Montpellier en génétique des populations pour en faire un axe majeur de mes travaux de recherche. Sous neutralité, l’évolution génétique est modélisée par des processus stochastiques complexes (notamment diffusion de Kimura (1968) et coalescent de Kingman (1982)) prenant en compte simultanément les mutations et la dérive génétique. Répondre à des questions d’intérêt biologique (quantifier un taux de migration, une réduction de taille de population, dater des fondations de populations, ou retracer les voies d’invasion d’espèces étudiées au CBGP) est un problème méthodologique délicat. Une modélisation fine permet de distinguer des effets confondants comme, par exemple, sélection vs. variations démographiques. La généalogie (i.e., les liens de parenté de l’échantillon de copies de gènes étudié), les dates des mutations et les génotypes ancestraux, décrits par le modèle stochastique, ne sont pas observées directement. On parle de processus latents ou cachés. La vraisemblance des données s’obtient alors en sommant sur toutes les possibilités, ce qui n’est pas faisable en temps fini. On peut citer deux classes d’algorithmes de Monte-Carlo qui permettent d’inférer les paramètres ou le modèle évolutif sous-jacent (on parle plus couramment de scénario d’évolution) malgré la présence de ce processus stochastique latent. La première classe de méthodes repose sur un échantillonnage préférentiel séquentiel (Sequential Importance Sampling ou SIS, voir Stephens and Donnelly, 2000; De Iorio and Griffiths, 2004a,b; De Iorio et al., 2005). Elles attaquent directement le calcul de la somme en tirant aléatoirement (échantillonnage) le processus latent. Ce tirage séquentiel (qui remonte le temps progressivement) est dirigé par 3 Chapter 1. Résumé en français une loi (dite loi d’importance) qui charge les généalogies supposées contribuer le plus à cette somme (d’où l’adjectif préférentiel). Cette première famille de méthode est la plus précise, dans le sens où elle minimise l’erreur d’estimation dans un modèle donné. Mais elle est aussi la plus gourmande en temps de calcul et la plus restreinte dans le champ des scénarii d’évolution couverts. En effet, elle nécessite d’adapter la loi d’importance qui échantillonne les arbres les plus importants à chaque situation démographique et historique considérée (voir par exemple (A14) Leblois, Pudlo, Néron, Bertaux, Beeravolu, Vitalis, and Rousset (2014)). La seconde classe de méthodes comprend les méthodes bayésiennes approchées (ABC ou approximate Bayesian computation, voir par exemple Beaumont, 2010; (A05) Marin, Pudlo, Robert, and Ryder, 2012; (A13) Baragatti and Pudlo, 2014) qui contournent le calcul de cette somme en comparant des jeux de données simulées aux données observées au travers de quantités numériques (statistiques résumées) supposées informatives. Les estimations obtenues par ABC sont moins précises que celles obtenues par SIS. Mais elles sont beaucoup plus souples car ne reposent que sur notre capacité à (1) simuler suivant le modèle stochastique, et (2) capter l’information importante au travers de statistiques résumées. Pour cette raison, elles sont considérées comme les plus prometteuses pour répondre aux questions complexes de génétique des populations. Nous avons passé en revue les principes statistiques ainsi que les principaux résultats de cette dernière méthode dans deux articles qui se complètent (A05) Marin, Pudlo, Robert, and Ryder (2012), (A13) Baragatti and Pudlo (2014). 2.1 Échantillonnage préférentiel et algorithme SIS Avec Jean-Michel M ARIN (Montpellier 2), j’ai co-encadré la thèse en biostatistique de Mohammed S EDKI. Nous nous sommes principalement intéressés à l’algorithme d’échantillonnage préférentiel adaptatif et multiple (AMIS pour Adaptive Multiple Importance Sampling), voir Cornuet et al. (2012a). Lorsque la vraisemblance est calculée avec SIS, le coût de calcul en chaque point de l’espace des paramètres est coûteux. Dans une perspective bayésienne, le schéma AMIS permet d’échantillonner cet espace des paramètres suivant la loi a posteriori, en ne nécessitant que peu d’appels au calcul de la vraisemblance en un point. D’où son efficacité computationnelle. AMIS approche sa cible par un système de particules pondérées, mis à jour séquentiellement et recycle l’ensemble des calculs obtenus. Des tests numériques effectués sur des modèles de génétique des populations ont montré les performances numériques de l’algorithme AMIS (efficacité et stabilité). Toutefois, la question de la convergence des estimateurs obtenus par cette technique restait largement ouverte. Nous avons montré ((A16) Marin, Pudlo, and Sedki, 2014) des résultats de convergence d’une version légèrement modifiée de cet algorithme, mais conservant les qualités numériques du schéma original. Dans le schéma SIS, ce sont les lois d’importance qui sont chargées de proposer des généalogies supposées contribuer le plus à la vraisemblance. Malheureusement, ces lois d’importance sont conçues pour des situations en équilibre démographique. Il est possible de les utiliser dans des situations où la taille d’une population varie au cours du temps, au prix d’un coût 4 2. Statistique computationnelle et applications à la génétique des populations de calcul beaucoup plus important pour conserver la même qualité d’approximation. Avec Raphaël L EBLOIS (CR INRA, CBGP) et Renaud V ITALIS (DR INRA, CBGP), nous nous sommes attaqués dans (A14) Leblois et al. (2014) au cas d’une unique population panmictique dont la taille varie au cours du temps, et compare les résultats obtenus avec d’autres algorithmes de la littérature, moins efficaces ou moins bien justifiés. J’encadre avec Raphaël L EBLOIS la thèse de Coralie M ERLE (UM2 I3M & INRA CBGP), dont l’un des objectifs est de proposer des pistes pour palier ce coût de calcul, en prolongement direct de son stage de M2. Nous avons cherché à comprendre comment appliquer un rééchantillonnage (Liu et al., 2001) dont le but est d’apprendre automatiquement quels sont les arbres proposés par la loi d’importance qui contribueront le plus à la vraisemblance. Des premiers résultats à publier montrent que ce ré-échantillonnage permet de diviser le coût de calcul par un facteur 10 dans des modèles de dynamique comparables à ceux étudiés dans (A14) Leblois et al. (2014). 2.2 Algorithmes ABC Avec mon premier étudiant en thèse, Mohammed S EDKI, et Jean-Marie C ORNUET (DR INRA, CBGP), nous avons également développé un nouvel algorithme d’inférence ABC sur des scénarii évolutifs dans le paradigme bayésien ((A11) Sedki, Pudlo, Marin, Robert, and Cornuet, 2013). Comparé à l’état de l’art, cet algorithme est auto-calibré (auto-tuning) et plus efficace, d’où un gain de temps pour obtenir une réponse de même qualité que l’algorithme standard. Nous avons illustré cette méthode sur un jeu de données portant sur quatre populations d’abeilles domestiques européennes (Apis mellifera) et un scénario démographique précédemment validé sur une étude de l’ADN mitochondrial. Au Centre de Biologie pour la Gestion des Populations (UMR INRA SupAgro Cirad IRD, Montpellier), je me suis fortement impliqué dans le codage de la seconde version de DIYABC (Cornuet et al., 2008, 2010), qui vient de sortir ((A12) Cornuet et al., 2014). Ce logiciel condense toute l’expérience acquise au sein de l’ANR EMILE sur les méthodes ABC pour la génétique des populations. En particulier, pour gérer les situations où le nombre de statistiques résumées est important, nous avons proposé avec Arnaud E STOUP (DR INRA, CBGP) et Jean-Marie C ORNUET (DR INRA, CBGP) d’estimer la probabilité a posteriori d’un modèle via une analyse discriminante linéaire ((A06) Estoup et al., 2012). Je co-encadre actuellement les travaux de thèse de Julien S TOEHR, qui porte sur la sélection de modèles pour des champs de Markov latents, question de difficulté comparable à celle de choix de scénarii d’évolution en génétique des populations. Notons ici que de modèles de champs markoviens ou markoviens cachés intéressent également le département MIA pour l’analyse de réseaux d’interaction. Je pense en particulier à Nathalie P EYRARD (MIA Toulouse), qui fait partie du comité de suivi de thèse de Julien, et au champ thématique qu’elle anime sur l’analyse des réseaux. Dans (A15) Stoehr, Pudlo, and Cucala (2014), nous avons mis en place une procédure ABC de choix de modèle, qui renonce à l’approximation de la probabilité 5 Chapter 1. Résumé en français a posteriori de chacun des modèles (qui représentent différentes structures de dépendance) pour améliorer le taux de mauvaise classification (c’est-à-dire de mauvais choix de modèle), via une procédure des k plus proches voisins parmis les simulations ABC. Nous inférons ensuite localement autour du jeu de données observé un taux de mauvaise classification qui fournit un succédané à l’estimation de la difficulté locale du choix fourni par la probabilité a posteriori. Cette approche nous a permis de diminuer le nombre de simulations ABC nécessaires à prendre une décision correcte, et permet donc de diminuer de façon importante le temps de calcul, tout en gagnant en qualité de décision. Ce nouvel indicateur nous permet finalement de construire une procédure de sélection de statistiques résumées qui s’adapte au jeu de données observé, et de réduire ainsi la dimension du problème. Avec Arnaud E STOUP (DR INRA, CBGP), Jean-Michel M ARIN (PR I3M, UM2) et Christian P. R OBERT (PR CEREMADE, Dauphine), nous prolongeons ces travaux en remplaçant la méthode des k plus proches voisins par des forêts aléatoires (Breiman, 2001), entrainées sur les simulations ABC. Ce type de classifieur, qui prédit un modèle en fonction des statistiques résumées, est bien moins sensible à la dimension que la méthode des k plus proches voisins et donne de bien meilleurs résultats en génétique des populations où le nombre de statistiques résumées est de l’ordre de la centaine (à comparer avec la dizaine de statistiques résumées dans les questions de champs markoviens cachés des travaux de Julien S TOEHR). En outre, nous fournissons un autre succédané à la probabilité a posteriori qui est un taux de mauvaise classification intégré contre la loi a posteriori de prédiction, intégration que l’on peut facilement réaliser avec une méthode ABC, voir (A17) Pudlo, Marin, Estoup, Gauthier, Cornuet, and Robert (2014). 2.3 Premiers pas vers les jeux de données moléculaires issus des technologies NGS Les données de polymorphisme collectées par séquençage ultra-haut débit (Next Generation Sequencing data ou NGS) fournissent des marqueurs de type SNP (Single Nucleotide Polymorphism) bi-allélique. L’intérêt de ces données de grandes dimensions est la quantité d’information qu’elles portent. Pour réduire les coûts financiers, il est possible d’utiliser un génotypage par lots (ou pool d’individus), chacun d’entre eux représentant une population d’intérêt pour l’espèce étudiée. Avec Mathieu G AUTIER (CR INRA, CBGP) et Arnaud E STOUP (DR INRA, CBGP), j’ai participé à des premiers travaux ((A08) Gautier et al., 2013) qui nous permettent de mieux comprendre l’information perdue par ces schémas de génotypage. Le séquençage RAD (Restriction site–associated DNA) est une technique récente basée sur la caractérisation de fragments du génome adjacents à des sites de restriction. Lorsque le site de restriction associé au marqueur d’intérêt a muté et perdu sa fonction, il est impossible de séquencer le fragment d’ADN correspondant pour ce site. Ceci introduit possiblement un biais, connu sous le nom d’Allele Drop Out. J’ai participé, avec les mêmes collaborateurs, à l’étude détaillée ((A09) Gautier et al., 2012) de ce biais : il s’avère relativement faible dans la 6 2. Statistique computationnelle et applications à la génétique des populations plupart des cas. Nous proposons en outre une méthode pour filtrer les sites où ce biais est important. Il est essentiel de souligner ici que les données NGS nécessitent un lourd travail pour mettre les algorithmes d’inférence à l’échelle de la dimension des données produites et tirer le meilleur parti de cette information. Une première piste finalisée depuis peu correspond aux travaux que j’ai développés avec Kerrie M ENGERSEN (Queensland University of Technology, Brisbane, Australie) et Christian P. R OBERT (Paris-Dauphine & IUF). Il s’agit dans (A07) Mengersen, Pudlo, and Robert (2013) d’une utilisation originale de la vraisemblance empirique (Owen, 1988, 2010) pour construire un algorithme de calcul bayésien (BCel pour Bayesian Computation via empirical likelihood). Cette méthode BCel utilise des équations d’estimation intra-locus dérivées du modèle via une vraisemblance composite (Lindsay, 1988) par paires de gènes. Au lieu de résoudre ces équations pour trouver une estimation des paramètres d’intérêt par locus et de réconcilier ces différents estimateurs en prenant leur moyenne ou leur médiane, nous utilisons ces équations d’estimation comme entrée dans l’algorithme de vraisemblance empirique. Cette dernière reconstruit alors une fonction de vraisemblance à partir des données et de ces équations. Nous avons testé cette méthodologie dans différentes situations. En particulier, en génétique des populations, sur de gros jeux de données composés de marqueurs microsatellites, nous avons montré que notre méthode fournit des résultats plus précis que les méthodes ABC, pourtant considérées comme la référence étalon dans ce domaine. En outre, BCel permet de réduire grandement les temps de calcul (plusieurs heures en ABC deviennent ici autant de minutes), donc de traiter des jeux de données de dimension plus grande. Cette piste est donc prometteuse pour l’avenir. 7 2 Graph-based clustering Keywords. Machine learning, spectral clustering, asymptotic theorem, Cheeger constant, graphcut, neighborhood graphs, spectral theory of operators. Papers. See page 64. • (A02) Pelletier and Pudlo (2011) • (A03) Arias-Castro, Pelletier, and Pudlo (2012) • (A04) Cadre, Pelletier, and Pudlo (2013) Clustering or cluster analysis Under this generic term we gather data analysis methods that aim at grouping observations or items in such a way that objects in the same group (also called a cluster) are similar to each other. The precise sense of similar will be defined below, though we should stress here that it is subjective. The grouping that ensued depends on the algorithm as well as its tuning. Most famous methods to achieve this goal are hierarchical clustering, k-means or other centroid-based methods and model-based methods such as mixture of multivariate Gaussian distribution, see e.g. Chapter 14 in Hastie et al. (2009), Chapter 10 in Duda et al. (2012) and McLachlan and Peel (2000). Hierarchical clustering algorithms (at least the bottom up approach) gather progressively data points from the pair of points that are the nearest, to build a whole hierarchy of groups always designed with a dendrogram. Groups of interests are then produced by cutting the dendrogram at a certain level. k-mean clustering recovers groups by minimizing the within-cluster sum of square. Various methods have been derived from this idea, changing the criterion we minimize. And, finally clustering with mixture distribution recover the groups by setting each data point to the component of the mixture with the highest probability of density, after seeking the maximum likelihood estimator with an EM algorithm. The latter class of method permits to select the number of groups resorting to penalized likelihood criterion such as BIC or ICL. Though this popular methods have proved useful in numerous applications, they suffer from some internal limitations, can be unstable (hierarchical methods) or fail to uncover clusters of complex (e.g., 9 Chapter 2. Graph-based clustering non-convex) shapes. As in the case of kernel based learning (see, e.g., Shawe-Taylor and Cristianini, 2004) the methods I have studied with Bruno P ELLETIER and other colleagues are based on pairwise comparison of individuals via a similarity function returning a non negative real number which reflects their proximity. Often the input of these methods is a pairwise similarity or distance matrix, and they ignore the other details of the dataset. Hence we can also resort to these methods for analyzing networks. Graph based methods I have studied two procedures, namely spectral clustering (von Luxburg, 2007) and a neighborhood graph scheme (Biau et al., 2007), that we can both see as graph based algorithms. The vertices of the undirected graph symbolize the observations, i.e. the items we are trying to gather into groups, and the edges link comparable items. Or weights on the edges of the graph indicate the strength of the link, see Subsections 1.1 below. The ideal clusters form a partition of the vertices which tends to minimize the number of edges that have their endpoints in different subsets of the partition. When the edges are weighted, the number of edges is replaced by the sum of the weights. Generally, the above graph cut problem is NP hard. But the spectral clustering algorithm provides the solution of a relaxed optimization problem in polynomial time (von Luxburg, 2007). The noticeable exception used in Section 3 is when the graph has multiple connected components, in which case the partition can be obtained in linear time with the help of famous algorithms based on a breadth-first search or depth-first search over the graph. Preprocessing The success of clustering methods to reveal an interesting structure of the dataset depends often on preprocessing steps such as normalization of covariates, deletion of outliers, etc. as well as a fine tuning of the algorithm. The results we have obtained in (A02) Pelletier and Pudlo (2011) and in (A04) Cadre, Pelletier, and Pudlo (2013) show that we can learn which part of the data space has low density during a first stage of the algorithms and learn how to clusterize the data during a second stage of the algorithms on the same dataset without any over-fitting or bias. In both procedures, the observations falling into areas of low density are set apart and we do not attempt to assign these items to any revealed group: they are considered as background noise or outliers. Besides the preprocessing step stabilizes the spectral clustering algorithm, as discussed in Subsection 1.2. The last advantage of the above preprocessing is to fit the intuitive, geometric definition of clusters given by (Hartigan, 1975) as connected components of (upper) level set of the density, see below. The literature on cluster analysis is wide, and many clustering algorithms have been developed so that a comprehensive review would be too prominent for this summary of my own research. The references of this chapter reflects only the way I entered into this subject. Cluster analysis always implies a subjective dimension mainly depending on the context of its end-use; evaluating objectively the result of an algorithm is almost impossible. As defended by von Luxburg 10 et al. (2012) when facing a concrete dataset, the sole judge of the accuracy of the revealed partition is the fact that the partition is or is not useful practically. Whence the importance of studying theoretically some algorithms and proving that the partition obtained on data converges to a solution that depends only on the underlying distribution. The first result in this direction is the consistency of k-means (Pollard, 1981) whose limit is well characterized. Moreover, the limit of the cluster centers depends on the Euclidean distance used to assess proximity of data points, thus on a subjective choice of the user. Gaussian mixture clustering (see, e.g., McLachlan and Peel, 2000) is also a consistent method whose limit is independent of a distance choice. The partitions provided straightaway by the mixture components are always convex, though we can merge them (Baudry et al., 2010). Figure 2.1 – Hartigan (1975)’s definition of clusters in terms of connected components of the t -level set of the density: at the chosen threshold, there are two clusters, C1 and C2 . If t is much larger, only the right cluster remains, and if t is smaller than the value on the plot, both clusters merge into a single group. There is no clear, mathematical definition of a cluster anyone agrees with, but Hartigan (1975) outlined the following. Fix a real, non negative number t . The clusters are the connected components of the upper level set of the density f , namely L (t ) := {x : f (x) ≥ t }. See Figure 2.1. This definition depends of course on the value of t , but also on reference measure used to set the density, another way of hiding normalization of the covariates, distance issues, etc. Other loose definitions can be found in the literature; an example is given by Friedman and Meulman (2004) whose clusters are defined in terms of proximity on different sets of variables (a set which depends on the group the item belongs to). 11 Chapter 2. Graph-based clustering 1 Spectral clustering The class of spectral clustering algorithms ((A02) Pelletier and Pudlo, 2011) is a recent alternative that has become one of the most popular modern clustering method, outperforming classical clustering algorithms, see von Luxburg (2007). As with kernel methods (Shawe-Taylor and Cristianini, 2004), the input of the procedure is a pairwise comparison of the items in the dataset via a similarity function s. 1.1 Graph-based clustering Consider a dataset X 1 , . . . , X n of size n in Rd . The similarity matrix S(i , j ) := s(h −1 (X j − X i )) where s is a given similarity function whose support is the unit ball (or any convex, open, bounded set including the origin), taking values in [0; +∞), and where h is a scale parameter we have to tune. If s(u) = s(−u) for all u, the above matrix is symmetric and its coefficients are non-negative. Examples are given in von Luxburg (2007). Thus it can be seen as the weighted adjacency matrix of a similarity graph (that is allowed to have self-loops): each node i of the graph represents one data point X i , if S(i , j ) = 0, there is no edge between i and j , and if S(i , j ) > 0, there is an edge between i and j with weight equals to S(i , j ). Note that, if s is the indicator function of the unit ball, namely s(u) = 1{kuk ≤ 1}, the graph is actually unweighted (entries of the adjacency matrix are either 0 or 1) and is called the h-neighborhood graph. It connects all points of the dataset whose pairwise distances are smaller than h. More generally, a benefit of the spectral clustering is its flexibility and its ability to adapt to different kinds of covariates with the help of a symmetric similarity s, though we have assumed here that all covariates are continuous. The normalized algorithm (Ng et al., 2002) is based on the properties of a random walk on the similarity graph: when the walk is at a vertex i , it will jump to another vertex j with probability proportional to S(i , j ). Its transition matrix is then Q := D −1 S, (2.1) where D is the diagonal matrix defined by D(i , i ) = X S(i , j ). j Note that D(i , i ) can be interpreted as the weighted degree of node i in the similarity graph. Then the algorithm tries to recover clusters as sets of nodes in which the random will be trapped for a long time. Other variants of the spectral clustering algorithm rely on other ways of normalizing the similarity matrix, see von Luxburg (2007) or Maier et al. (2013). 12 1. Spectral clustering A simplified case To give intuition on the spectral clustering algorithm, we begin with a simplified case where the similarity graph has more than one connected component, say C1 , . . . , Ck . Then, the set of nodes (i.e., the dataset) can be partitioned into recurrent classes. And, if the random walk starts from one node of the graph, it is trapped forever into the recurrent class where this first state lies. The connected components, thus the clusters, can be recovered with the help of clever graph algorithms (breadth-first search or depth-first search). But this method cannot be applied in the general case. The number of recurrent classes, k, is actually the multiplicity of the eigenvalue 1 of the matrix Q. And the linear subset of (right) eigenvectors corresponding to this largest eigenvalue of Q represents the set of harmonic functions on the graph. Note that, in this setting, a vector V = (V (1), . . . ,V (n)) indexed by {1, . . . , n} can be considered as a function v whose parameter varies in {X 1 , . . . , X n } with v(X i ) = V (i ). The k vectors V1 = (V1 (1), . . . ,V1 (n)),. . . , Vk = (Vk (1), . . . ,Vk (n)) defined by 1 if i ∈ C , ` V` (i ) = 0 otherwise, forms a basis of the eigenspace Ker(Q − I ). In other words, the eigenvectors are piecewise constant on the connected components C` , ` = 1, . . . , k. Consider now any basis V1 , . . . ,Vk of this eigenspace of dimension k. We have that, for any pair i , j , • if i and j belong to the same component, then the i -th and j -th coordinates of any vector of the basis are equal: V1 (i ) = V1 ( j ), . . . , Vk (i ) = Vk ( j ) (because the vectors of the basis are harmonic functions); • if i and j do not belong to the same component, then V` (i ) and V` ( j ) differ at least on one vector of the basis, i.e., for at least one value of ` (because these vectors form a basis of the eigenspace). With the help of the basis we can send the original dataset into Rk (where k is number of clusters) as follow: the i -th item, X i , is replaced with the vector ρ(X i ) = (V1 (i ), . . . ,VK (i )) whose coordinates are the i -th coordinates of each vector in the basis. The two above properties of the basis implies that the representation of the dataset in RK is composed of exactly k different points in a one-to-one correspondance with the clusters. The general algorithm Recall that we consider here only the normalized algorithm of Ng et al. (2002). When the similarity graph cannot be decomposed into connected components as above, the random walk is irreducible and the order of the multiplicity of the (largest) eigenvalue 1 of Q is then one. As considered in von Luxburg et al. (2008) and in (A02) Pelletier and Pudlo (2011), the general situation can be a seen as a perturbation of the simplified case, where some null coefficients of the similarity matrix have been replaced by small ones. Then the k largest eigenvalues of Q (counted with their multiplicity) replace the eigenvalue 1 of multiplicity k and the corresponding eigenvectors V1 , . . . , Vk are substitutes of the general 13 Chapter 2. Graph-based clustering basis described above. Likewise, we can send the dataset into Rk via this basis: ρ(X i ) = (V1 (i ), . . . ,Vk (i )). (2.2) Because of the perturbation, the eigenvector are now only almost constant on the clusters. Hence the representation is composed of n different points, but they concentrate around k well separated centers. Finally the algorithm recover the clusters applying a k-means procedure on this new representation. This leads to Algorithm 1. Algorithm 1: Normalized spectral clustering Input: the similarity matrix S; the number of groups k compute the transition matrix Q as in (2.1); compute the k largest eigenvalues of Q; set V1 , . . .Vk as the corresponding eigenvectors; for i ← 1 to n do set ρ i = (V1 (i ), . . . ,Vk (i )); end apply a k-means algorithm on the n new data points ρ 1 , . . . , ρ n with k groups; Output: the partition of the dataset into k groups It is worth noting that the matrix Q is similar (or conjugate) to a symmetric matrix: ³ ´ Q = D −1/2 D −1/2 SD −1/2 D 1/2 . ¡ ¢ Hence, first Q is diagonalizable, second both spectrums of Q and D −1/2 SD −1/2 are equal. In particular, all eigenvalues of Q are real, and a vector V is an eigenvector of Q if and only if ¡ ¢ D 1/2V is an eigenvector of the symmetric matrix D −1/2 SD −1/2 . And finally, all eigenvalues of Q lie in [−1; 1] because, for each i , X j Q(i , j ) = X¯ ¯ ¯Q(i , j )¯ = 1. j So that the computation of the largest eigenvalues and the eigenvectors of Q can be performed with an efficient algorithm on symmetric matrix. Graph cut, conductance,. . . There is another interpretation of Algorithm 1 in terms of conductance, graph cut and mixing time of the Markov chain defined by the random walk on the graph. To simplify the discussion, let us assume that the number of groups k equals 2. Since the weights of the edges represent the strength of the similarity between data points, it makes sense to partition the dataset by minimizing the weights of the edges with endpoint in different groups. So, if A is a subset of nodes and Ā its complement, we seek a bottleneck or a subset A of nodes not well connected with the rest of the graph, namely Ā. Thus one might be 14 1. Spectral clustering tempted to minimize X S(A, Ā) := s(i , j ). i ∈A, j ∈ Ā Very often, the minimum of the optimization problem is realized when A (or equivalently Ā) is a singleton composed of one observation rather far from the rest of the dataset. To avoid such undesirable solution, we normalize S(A, Ā) by a statistic that take into account the size of A and Ā. Various normalizations are possible, see e.g. von Luxburg (2007), but we can consider the Cheeger cut, also called the normalized cut, defined as Cut(A, Ā) := S(A, Ā) min(S(A), S( Ā)) , where S(A) := X s(i , j ). (2.3) i ∈A, j ∈A The minimization of the normalized cut can be transformed into the optimization of a quadratic form on Rn with very complex constrains and if f = ( f (1), . . . , f (n)) is a point where the quadratic form is minimal, the partition can be recovered as A = {i : f (i ) > 0} and Ā = {i : f (i ) < 0}, see Shi and Malik (2000) and von Luxburg (2007). Due to the constrains, the optimization problem is NP hard in general and Algorithm 1 gives an approximate solution by removing some of the constrains (Optimization problems are related to eigendecomposition of symmetric matrices via Rayleigh-Ritz theorem). Moreover, the minimum value of Cut(A, Ā) is called the Cheeger constant which controls how fast the random walk converges to its stationary distribution. Indeed, the presence of a narrow bottleneck between A and Ā will prevent the chain to move easily from A to its complement. The precise result is a control on the second largest eigenvalue of Q, see e.g. Diaconis and Stroock (1991). 1.2 Consistency results Previous results from the literature Assume that the dataset is an iid sample X 1 , . . . , X n from an unknown distribution P on Rd . If the number of groups k is known, von Luxburg et al. (2008) have proven that, when the data size n tends to infinity, and when the scale parameter h is fixed, the Markovian transition matrix Q tends (in some weak sense) to a Markovian operator whose state space is the support of the underlying distribution P . Their proof follows the footsteps of Koltchinskii (1998); it has been simplified by Rosasco et al. (2010) a few years later. As a consequence, with the help of functional analysis results, they can prove that the eigendecomposition of Q converges to the eigendecomposition of the limit operator which is an integral operator. Yet the limit partition is difficult to interpret and its geometrical meaning in relation to the unknown distribution P is neither explicit, nor discussed in both paper. The geometry of spectral clustering is still a hot topic, see e.g. Schiebinger et al. (2014). We believed that this is maybe not the correct limit to consider. Indeed the scale parameter h 15 Chapter 2. Graph-based clustering can be compared to a bandwidth and the weighted degree of node i , D(i , i ) = X j S(i , j ) = ´ X ³ −1 s h (X j − X i ) j is a kernel estimate of the density f of P at point X i (up to a constant). It is well known that kernel density estimators converge to some function f h when h is fixed and the size of the dataset n → ∞. But this is a biased estimator and f h 6= f . In the similarity graph context, when h remains constant while the number of data points increases, the narrowness of the bottleneck decreases. In other words, there is more points in between clusters and the random walk defined by Q will jump from one group to another one more easily. Our intuition with Bruno Pelletier was that the bottleneck would not wipe out if h → 0 while n → ∞. But the limit problem becomes harder, see Section 2. Figure 2.2 – Results of our algorithm on a toy example. (left) the partition given by the spectral clustering procedure (points in black remains unclassified); (right) the spectral representation of the dataset, i.e., the ρ(X i )’s: the relatively large spread of the points on the top of the plot is due to poor mixing properties of the random walk on the red points. Consistency proven in (A02) Pelletier and Pudlo (2011) The cluster analysis becomes much more easier if we remove the data points in between groups. This is what we have proposed in (A02) Pelletier and Pudlo (2011). We start with a preprocessing step that set apart points in areas of low density. This preprocessing is performed with the help of a consistent density estimator fˆn (note that we have at our disposal a kernel density estimator if we multiply the weighted degrees of the similarity graph by the correct constant). We fix a threshold t and keep the points whose indices lie in J (t ) := { j : fˆn (X j ) ≥ t }. The density estimator fˆn is (of course) computed on the dataset X 1 , . . . , X n we want to study; we do not need to divide the dataset into different folds (one for density estimation, one for 16 2. Consistency of the graph cut problem the cluster analysis). We then apply Algorithm 1 on the subset of kept observations. Because of the preprocessing, data points we use as input of the spectral clustering algorithm easily group into well separated area of Rd whatever the size n. Hence we do not need an intricate calibration of h. An example is given in Figure 2.2. Under mild assumptions not recalled here, we have proven in (A02) Pelletier and Pudlo (2011) that the whole procedure is consistent and that the groups converge to the connected components of the upper level set of the density f defined as L (t ) = {x : f (x) ≥ t } as long as the scale parameter h is below a threshold that does not depend on the size of the dataset! Additionally the limit corresponds to the definition of a cluster given by Hartigan (1975) and can be easily interpreted. Yet the data points that fall outside J (t ) remain unclassified. Besides we have proven that the Markovian transition matrices Q converge strongly (in a certain operator norm sense) to a compact integral operator, which is a stronger result than the one obtained by von Luxburg et al. (2008). The trick was to consider the appropriate Banach space: n o W (L (t )) := v of class C 1 on L (t ) equipped with the norm ¯ ¯ ¯ ¯ X ¯ ∂v ¯ kvkW := sup ¯v(x)¯ + sup ¯¯ (x)¯¯ . ∂x i x x i 2 Consistency of the graph cut problem Difficulty of studying the spectral clustering algorithm As explained above, we believed that a correct calibration of h in the spectral clustering algorithm would lead to a scale h that goes to 0 when n → ∞. In this regime, the matrices Q converge to the identity operator (which is neither an integral operator nor a compact operator anymore) and we have to study the so called normalized graph Laplacian (nh d +2 )−1 (I −Q) (2.4) (correctly scaled) in order to obtain a non trivial limit. Indeed, as shown in Belkin and Niyogi (2005) or Giné and Koltchinskii (2006) for the case where the distribution in uniform over an unknown set of the Euclidean space, the limit operator is a Laplace-Beltrami operator (i.e., a second order derivative operator). At least intuitively, this implies that the random walk on the similarity graph converges to a diffusive continuous Markov process on the support of the underlying distribution when time is correctly scaled. A formal proof has not been written, but this could be done resorting to theorems in Ethier and Kurtz (2009). But the spectral decomposition of such unbounded operators are much more 17 Chapter 2. Graph-based clustering difficult to handle, and thus the asymptotic study of the spectral algorithm requires much deeper theorems of functional analysis than the (not trivial!) one manipulated in von Luxburg et al. (2008), Rosasco et al. (2010) or our own (A02) Pelletier and Pudlo (2011). Laplace-Beltrami operators The most simple Laplace operator is the one dimensional ∆ = −∂2 /(∂x)2 , which is the infinitesimal generator of the Brownian motion. On the real line R, any real number λ is an eigenvalue of ∆, and eigenfunctions are either p p p p x 7→ C sin( |λ|x) +C 0 cos( |λ|x) or x 7→ C sinh( |λ|x) +C 0 cosh( |λ|x) depending on the sign of λ. Hence the spectrum of ∆ is far from being countable. While on the circle S1 , i.e. the segment [0; 2π] whose endpoints have been glued, the spectrum of ∆ is countable, limited to non positive integer numbers λ and, up to an additive constant, eigenfunctions are now x 7→ C sin ³p ´ ³p ´ |λ|x +C 0 cos |λ|x . We face here the very difference between the Fourier expansion of a (2π)-periodic function and the Fourier transform of a real function on the real line. To circumvent the functional analysis difficulties, we studied the graph cut problem in (A03) Arias-Castro, Pelletier, and Pudlo (2012). Recall from Section 1.1 that the spectral algorithm can be seen as an approximation of the graph cut optimization. Thus we discard the evaluation of the accuracy of the latter approximation. What remains to study is the convergence of a sequence of optimization problems when n → ∞ and h → 0 although they are practically NP-hard to solve. Results The results we have obtained in (A03) Arias-Castro, Pelletier, and Pudlo (2012) are rather limited in their statistical consequences despite the difficulty of the proofs. Yet we have shown that our intuition on a relevant calibration of the scale h with respect to the data size was correct : in this setting, the correct scaling implies than h → 0 when n → ∞. The main assumptions are as follows: • the distribution of the iid dataset X 1 , . . . , X n is the uniform distribution on a compact set M of Rd ; • the similarity function is the indicator function of the unit ball of Rd , so that the similarity graph is the h-neighborhood graph (see above in Section 1.1); • and n → ∞ and h → 0 such that nh 2d +1 → ∞. The toy example of Figure 2.3 is a uniform sample of size 1,500, where M is the union of two overlapping circles. Since the graph-cut problem is NP-hard, the approximation solution 18 −1.0 −0.5 0.0 0.5 1.0 2. Consistency of the graph cut problem −2 −1 0 1 2 Figure 2.3 – Graph-cut on a toy example. The red line represents the bottleneck, the hneighborhood graph is in gray. The partition returned by the spectral clustering method with the h-neighborhood graph corresponds to the color of the crosses (blue or orange) drawn in Figure 2.3 was computed with the spectral clustering procedure (here without any preprocessing). To avoid peculiarities of the discrete optimization problem on a finite graph and non regular solutions, we optimize (2.3) under the constraint that A has a smooth boundary in Rd and that the curvature of the boundary of A is not too large; measured in term of reach (a mild way to define radius of curvature), we constrained the reach of ∂A to be above a threshold ρ and that ρ → 0 such that, for some α > 0, when n → ∞, • hρ −α → 0 and • nh 2d +1 ρ α → ∞. In a first theorem, we have proven that, when n → ∞, if well normalized, the function defined in (2.3) converges almost surely to c(A; M \ A) := perimeter(M ∩ ∂A) , min(volume(A ∩ M ), volume(M \ A)) (2.5) where ∂A denotes the boundary of A. Minimizing such a function of the set A detects the narrowest bottleneck of the compact set M : if h(A; M \ A) is small for some A, then we can divide M in two sets, A and M \ A of rather large contents (because of the volumes in the denominator of (2.5)) but separated by an hyper-surface of small perimeter. This is known in geometry as the Cheeger problem: the 19 Chapter 2. Graph-based clustering smallest value of c(A; M \ A) is the Cheeger (isoperimetric) constant of the compact set M . The two other theorems we have proven in (A03) Arias-Castro, Pelletier, and Pudlo (2012) show that, with the above constraints, the smallest value of (2.3) converges to almost surely the smallest value of (2.5) up to a normalizing constant, and that the discrete sets we obtain on data can serve as skeleton to recover the optimal set A at the limit. 3 Selecting the number of groups Calibration of the threshold t We mentioned earlier that Hartigan (1975)’s cluster depends on the value of the threshold t we have to set. A clear procedure to fix the value of t is to assume that we want to discard a small part of the dataset, say of probability p. In (A04) Cadre, Pelletier, and Pudlo (2013), we studied consistency of the algorithm estimating t as the pquantile of fˆn (X 1 ),. . . , fˆn (X n ) when fˆn is a density estimator trained on X 1 , . . . , X n , a quantile that can be easily computed using an order statistic. Under mild assumptions, we established concentration inequalities for t when n → ∞, depending on the supremum norm supx | fˆn (x)− f (x)|. Moreover, when fˆn is a kernel density estimator, we controlled the Lebesgue measure of the symmetric difference between the true level set of f of probability (1 − p) and the level set of fˆn using the calibrated value of t . The exact convergence rate we obtained is proportional to (nεd )−1/2 , where ε is the bandwidth of the kernel density estimator (see also Rigollet and Vert, 2009). Note that ε tends to 0 when n → ∞ so that the rate of convergence is (as common in unparametric settings) much smaller than n −1/2 . Estimating the number of components in the level set A few years before, Biau et al. (2007) provided a graph-based method to select the number of groups in such nonparametric settings. Their estimator, inspired from Cuevas et al. (2001), can be described as follows: this is the multiplicity of the eigenvalue 1 of the matrix Q described in Section 1.1 when the similarity function s is the indicator function of the unit ball and when the scale factor h is well calibrated. We recall here that the similarity graph is then the unweighted h-neighborhood graph. The results of (A04) Cadre, Pelletier, and Pudlo (2013) were originally followed by concentration results on this estimator of k, but were not published in the journal. These results can be found in the french preprint server HAL, see http://hal.archives-ouvertes.fr/hal-00397437. In comparison with the original results in Biau et al. (2007), the proven concentration inequalities do not require any assumption on the gradient of fˆn and take into account the estimation of t . The eigengap heuristic The idea to understand how to generalize the above reasoning to the spectral clustering algorithm is always based on perturbation theory. Indeed, if the number of groups is k, we expect to observe k eigenvalues around 1, and a gap between these largest 20 4. Perspectives Figure 2.4 – The graph-based selection of k: (left) the datapoints and the spectral clustering output; (right)the eigenvalues of the matrix Q: the eigengap is clearly between λ3 and λ4 . eigenvalues and the (k + 1)-th eigenvalue. Thus, if λ1 > · · · > λn are the eigenvalues of Q we seek the smallest value of k for which λk − λk+1 is relatively large compared to λ` − λ`+1 , ` = 1, . . . , (k − 1). An example in given in Figure 2.4. The eigengap heuristic usually works well when • the data contains well pronounced clusters, • the scale parameter is well calibrated and • the clusters are not far from being convex. For instance, the toy example of Figure 2.3 does not present any eigengap. We have tried several toy examples with Mohammed Sedki during a graduate intership, but we were unable to get a general algorithm that produces a reasonable results whatever the example, see my paper in the proceedings of conference (T10), 41ièmes Journées de Statistique, Bordeaux, 2009. 4 Perspectives Calibration The major problem with unsupervised learning is calibration since we cannot resort to cross-validation. As we have seen above, many properties we are able to prove on the spectral clustering algorithm depends on the fact that the scale parameter h is well chosen. A possible road to solve this problem is based on the fact the weighted degrees of each node (i.e., of each observation) form a density estimate. Indeed, we have (nh d )−1 D(i , i ) = (nh d )−1 X j S(i , j ) = n −1 X ³ ´ h −d s h −1 (X j − X i ) j 21 Chapter 2. Graph-based clustering and can calibrate h using cross-validation. We have no theoretical guarantee that this way of proceeding is the best, but at least, the similarity graph captures some important features of the data. Another more standard procedure to set h is to rely on stability of the partition returned by the algorithm when we subsample the dataset or when we add small noise, see e.g. Ben-David et al. (2006) or von Luxburg et al. (2012), but such procedures need careful handling. Selecting the number of groups The eigengap heuristic is not optimal to select the number of groups. Another possible road is to focus on concentration into clusters of the spectral representation of the data points , i.e., the ρ(X i )’s defined in (2.2). In this perspective, we are looking for a largest value of k such that the within sum of squares of the ρ(X i )’s remains small. The difficulty is that the dimension of the spectral representation is k and thus varies with the number of groups. 22 3 Computational statistics and intractable likelihoods Keywords. Mote Carlo methods, computational statistics, approximate Bayesian computation, importance sampling, empirical likelihood, population genetics. Papers. See page 64. • • • • • • • • • • • • • (A05) Marin, Pudlo, Robert, and Ryder (2012) (A06) Estoup, Lombaert, Marin, Robert, Guillemaud, Pudlo, and Cornuet (2012) (A07) Mengersen, Pudlo, and Robert (2013) (A08) Gautier, Foucaud, Gharbi, Cezard, Galan, Loiseau, Thomson, Pudlo, Kerdelhué, and Estoup (2013) (A09) Gautier, Gharbi, Cezard, Foucaud, Kerdelhué, Pudlo, Cornuet, and Estoup (2012) (A10) Ratmann, Pudlo, Richardson, and Robert (2011) (A11) Sedki, Pudlo, Marin, Robert, and Cornuet (2013) (A12) Cornuet, Pudlo, Veyssier, Dehne-Garcia, Gautier, Leblois, Marin, and Estoup (2014) (A13) Baragatti and Pudlo (2014) (A14) Leblois, Pudlo, Néron, Bertaux, Beeravolu, Vitalis, and Rousset (2014) (A15) Stoehr, Pudlo, and Cucala (2014) (A16) Marin, Pudlo, and Sedki (2014) (A17) Pudlo, Marin, Estoup, Gauthier, Cornuet, and Robert (2014) Intractable likelihoods Since 2010 and the arrival of Jean-Michel M ARIN at Montpellier, I have been interested in computational statistics for Bayesian inference with intractable likelihoods. When conducting a parametric Bayesian analysis, Monte Carlo methods aim at approximating the posterior via the empirical distribution of a simulated sample on the parameter space, Φ say. More precisely, in the regular case, the posterior has density π(φ | x obs ) ∝ p prior (φ) f (x obs | φ), (3.1) 23 Chapter 3. Computational statistics and intractable likelihoods where p prior (φ) is the prior density on Φ and f (x | φ) the density of the stochastic model. But the likelihood may be unavailable for mathematical reason (no closed form as a function of φ) or for computational reasons (too expensive to calculate). Difficulties are obvious when the stochastic model is based on a latent process u ∈ U , i.e., f (x | φ) = Z U f (x, u | φ)du, and the above integral cannot be computed explicitly or when the likelihood is known up to a normalising constant depending on φ, i.e., f (x | φ) = g (x, φ) , Zφ Z where Zφ = g (x, φ)dx. And some models such as hidden Markov random fields suffer from both difficulties, see, e.g., Everitt (2012). Monte Carlo algorithms, for instance Metropolis-Hastings algorithms (see e.g. Robert and Casella, 2004), require numerical evaluations of the likelihood f (x obs | φ) at many values of φ. Or, using a Gibbs sampler, one needs to be able to simulate from both conditionals f (φ | u, x obs ) and f (u | x obs , φ). Moreover, if the above is often possible, the increase in dimension induced by the data augmentation from φ to (φ, u) may be such that the properties of the MCMC algorithms are too poor for the algorithm to be considered. The limit of MCMC (poor mixing properties in complex models and multi-modal likelihoods) have been reached for instance in population genetics. Authors in this field of evolutionary sciences, which has been my favorite application domain in the past years, proposed a new methodology named approximate Bayesian computation (ABC) to deal with intractable likelihoods. A major benefit (which contributes to its success) of this class of algorithms is that computations of the likelihoods are replaced by simulations from the stochastic models. Since its introduction by Tavaré et al. (1997), it has been widely used and provoked a flurry of researches. My contributions to ABC methods is described in Section 1, including fast ABC samplers, high performance computing strategies and ABC model choice issues. When simulating datasets is slow (for instance with high dimensional data), ABC is no longer an option. Thus we have proposed to replace the intractable likelihood with an approximation given by the empirical likelihood of Owen (1988, 2010), see Section 2. On the contrary, when the dimension of the latent process u and of the parameter φ is moderate, we can rely on importance sampling to approximate the relevant integrals and my work in this last class of methods is described in Section 3. We end up the Chapter with a short presentation of the stochastic models in population genetics and a few perspectives to face the drastic increase of the size of genetic data due to next generation sequencing (NGS) techniques. These models are almost all available in the DIYABC software to which I contributed, see (A12) Cornuet et al. (2014). 24 1. Approximate Bayesian computation 1 Approximate Bayesian computation We have written two reviews on ABC, namely (A05) Marin, Pudlo, Robert, and Ryder (2012) completed by (A13) Baragatti and Pudlo (2014). Before exposing our original developments, we begin with a short presentation of ABC methods. 1.1 Recap Basic samplers and their targets The basic idea underlying ABC is as follow. Using simulations from the stochastic model, we can produce a simulated sample from the joint distribution π(φ, x) := p prior (φ) f (x | φ). (3.2) The posterior distribution (3.1) is then the conditional distribution of (3.2) knowing that the data x is equal to the observation x obs . From the joint, simulated sample, ABC derives approximations of the conditional density π(φ | x obs ) and of other functionals of the posterior such as moments or quantiles. The previous elegant idea suffers from two shortcomings. First, the algorithm might be time consuming if simulating from the stochastic model is not straightforward. But the most profound problem is that estimating the conditional distribution of φ knowing x = x obs requires that some simulations fall into the neighbourhood of x obs . If the data space X is not of very small dimension, we face the curse of dimensionality, namely that it is almost impossible to get a simulated dataset near the observed one. To solve the problem, ABC schemes perform a (non linear) projection of the (observed and simulated) datasets on a space of reasonable dimension d via some summary statistics η : X → Rd and set a metric δ[s, s 0 ] on Rd . Hence Algorithm 2. Algorithm 2: ABC acceptation-rejection algorithm with a given threshold ε for i = 1 → Nprior do draw φi from the prior p prior (φ); draw x from the likelihood f (x | φi ); compute s i = η(x); store the particle (φi , s i ); end for i = 1 → Nprior do compute the distance δ[s i , s obs ]; reject the particle if δ(s i , s obs ) > ε; end Note that the first loop does not depend on the data set x obs , hence the simulated particles might be reused to analyse other datasets. In the second loop, the particles can also be 25 Chapter 3. Computational statistics and intractable likelihoods weighted by w i = K ε (δ[s i , s obs ]), where K ε is some smoothing kernel with bandwidth ε. Then, the acceptation-rejection algorithm is just a particular case of the weighted algorithm, with K ε (δ) = 1{δ ≤ ε}. Target of the algorithm and approximations The holy grail of the ABC scheme is the posterior distribution (3.1). To bypass the curse of dimensionality, ABC introduces the non linear projection η : X → Rd . Hence, we cannot recover anything better than the conditional distribution of φ knowing that the simulated summary statistics η(x) is equal to s obs = η(x obs ), i.e., ¡ ¯ ¢ π φ ¯ η(x) = s obs . (3.3) Moreover, the event η(x) = s obs might have very low probability, if not null. Hence, the output of the ABC algorithm is a sample from the distribution conditioned by the larger event © ª A ε = δ[η(x), s obs ] ≤ ε , namely ¡ ¯ ¢ π φ ¯ δ[η(x), s obs ] ≤ ε . (3.4) The distribution (3.4) tends to (3.3) when ε → 0. But, if the user want to control the size of the output, decreasing the threshold ε might be problematic in terms of processing time: indeed, if we want a sample of size N from (3.4), the algorithm requires an average number of ¡ ¢ ¡ ¢ simulations Nprior = N /π A ε and the probability of event A ε , π A ε , can decrease very fast toward 0 when ε → 0. Actually, the error we commit be estimating the density of the output of the algorithm rather than computing explicitly (3.3) has been widely studied in nonparametric statistics since the seminal paper of Rosenblatt (1969). But the discrepancy between (3.3) and the genuine posterior (3.1) remains insufficiently explored. 1.2 Auto-calibrated SMC sampler Rejection sampling (Algorithm 2) and ABC-MCMC methods (Marjoram et al., 2003) can perform poorly if the tolerance level ε is small. Various sequential Monte Carlo algorithms (see Doucet et al., 2001; Del Moral, 2004; Liu, 2008a, for general references) have been constructed as an alternative to these two methods: Sisson et al. (2007, 2009), Beaumont et al. (2009), Drovandi and Pettitt (2011) and Del Moral et al. (2012). These algorithms start from a large tolerance level ε0 , and at each iteration the tolerance level decreases, εt < εt −1 . The simulation problem becomes therefore more and more difficult, whereas the proposal distribution for the parameters φ becomes more and more close to the posterior. In practice, the tolerance level ε used in the rejection sampling algorithm is not fixed in advance, but corresponds to a quantile of the distances between the observed dataset and some simulated ones — see the interpretation of this calibration in term of nearest neighbors in Biau et al. (2012), as well as in Section 1.3. 26 1. Approximate Bayesian computation The algorithm of Beaumont et al. (2009) corrects the bias introduced by Sisson et al. (2007) and is a particular population Monte Carlo scheme (Cappé et al., 2004). It requires fixing a sequence of decreasing tolerance levels ε0 > ε1 > . . . > εT which is not very realistic for practical problems. In contrast, the proposals of Del Moral et al. (2012) and Drovandi and Pettitt (2011) are adapted likelihood-free versions of the Sequential Monte Carlo sampler (Del Moral et al., 2006) and include a self-calibration mechanism for the sequence of decreasing tolerance levels. Sadly, in some situations, these last auto-calibrated algorithms do not permit any gain in computation time, see our test presented in Table 1 of (A11) Sedki, Pudlo, Marin, Robert, and Cornuet (2013) when all calculations are negligible in front of the simulation of one dataset from the model. This is typically true for complex models, e.g. for complex scenarios in population genetics. With my former PhD student Mohammed S EDKI, we have transformed the likelihood-free SMC sampler of Del Moral et al. (2012) in order to keep the number of simulated datasets from the stochastic model as low as possible. Indeed, the rejection sampler of Algorithm 2 proposes values of φ according to the prior distribution, hence probably in areas of the parameter space with low posterior density. Almost any value of φ in such areas will produce simulated datasets far from the observed one. The idea of sequential ABC algorithms is to learn gradually the posterior distribution, i.e. in which area of the parameter space we should draw the proposed parameter φ. It is worth noting also that even if we draw parameters from the (unknown) posterior distribution, the simulated datasets are distributed according to the posterior predictive law, and such simulated datasets do not fall automatically in a small neighborhood of the observed data (measured in term of summary statistics). So that, even if we learn perfectly how to sample the parameter space, reducing the value of the threshold ε induces always an irreducible computational cost. The algorithm we proposed in (A11) Sedki, Pudlo, Marin, Robert, and Cornuet (2013) is based on the above assessments and can be described as follow: (a) begin with a first run of the rejection sampler proposing Nprior = N0 particles and accepting N1 particles (N1 < N0 , usually N1 is one tenth of N0 ); (b) use a few iterations of ABC-SMC which update the sample of size N1 and reduce the threshold ε; (c) end with a last run of the rejection sampler on the N1 particles produced by ABC-SMC and return to the user a sample of particles of size N2 < N1 . Step (a) produces a crude estimate of the posterior distribution, and has proved crucial (in numerical example) to reduce the computational cost of the whole procedure. Step (b) includes a self-calibration of the sequence of thresholds, resulting from a trade off between small thresholds and quality of the sample of size N1 . This step includes also a stopping criterion to detect when ABC-SMC looses efficiency and faces the irreducible computational 27 Chapter 3. Computational statistics and intractable likelihoods cost mentioned above. The ABC approximation of the posterior is often far from admissible when the stopping criterion is met: it is designed to detect when it becomes useless to learn where to draw parameters φ to reduce easily the threshold ε, but not to serve as a guarantee that the final sample is a good approximation of the posterior. Hence, we end up the whole algorithm with a rejection step in (c), which decreases the threshold by keeping only a small proportion of the particles returned by ABC-SMC. We have illustrated the numerical behavior of the proposed scheme on simulated data and a challenging real-data example from population genetics, studying the European honeybee Apis melifera. In the latter example, the computational cost was about twice smaller than Algorithm 2. Finally, in a conference paper (Marin, Pudlo, and Sedki, 2012), we compared different parallelization strategies for our own algorithm and for Algorithm 2. Algorithm 2 is embarrassingly parellel : each drawing from the joint distribution can be done independently by some core of the CPU, or by some computer of the cluster. On the other hand, between each iteration of the ABC-SMC, there is a step (sorting the particles with respect to their distances to s obs and calibrating ε) that cannot be done with parallel computation. We show indeed that the parallel overhead of sequential ABC samplers prohibits their use on large clusters of computers while the simplest algorithm can take advantage of the many computers in the cluster without loosing time when we have to distribute some relevant informations to the computers. 1.3 ABC model choice There has been several attempts to improve models by interpreting lack of fits between simulated and observed summary statistics that are interpretable, see our own work (A10) Ratmann, Pudlo, Richardson, and Robert (2011). But ABC model choice is generally conducted with Algorithm 3, which rearranges Algorithm 2 for Bayesian model choice. Posterior probabilities of each model are then estimated with the frequency of each model among the kept particles. If the goal of the Bayesian analysis is the selection of the model that best fits the observed data x obs (to decide between various possible histories in population genetics for instance), it is performed through the maximum a posterior (MAP) model number, replacing the unknown probabilities with their ABC approximations. Sadly, Robert et al. (2011) and Didelot et al. (2011) have raised important warnings regarding model choice with ABC since there is a fundamental discrepancy between the genuine posterior probabilities and the ones based on summary statistics. Moving away from the rejection algorithms From the standpoint of machine learning, the reference table simulated during the first loop of Algorithm 3 serves as a training database composed of iid replicates drawn from the joint Bayesian distribution (model × parameter × summaries of dataset), which can be seen as a hierarchical model. To select the best model, we have drifted gradually to more sophisticated machine learning procedures. With my PhD student Julien S TOEHR, we have considered the ABC model choice algorithm as a k-nearest neighbor (knn) method: the calibration of ε in Algorithm 3 is thus transformed 28 1. Approximate Bayesian computation Algorithm 3: ABC acceptation-rejection algorithm with a given threshold ε for i = 1 → Nprior do choose a model m i at random from the prior on model number; draw φi from the prior p mi −prior (φ); draw x from the likelihood f (x | φi , m i ) of model m i ; compute s i = η(x); store the particle (m i , φi , s i ); end for i = 1 → Nprior do compute the distance δ[s i , s obs ]; reject the particle if δ(s i , s obs ) > ε; end into the calibration of k. This knn interpretation was also introduced by Biau et al. (2012) for parameter inference. Indeed, the theoretical results of Marin et al. (2013) requires a shift from the approximation of the posterior probabilities: the only issue that ABC can address with reliability is classification (i.e., decision in favor of the model that best fits the data). In this first paper, we proposed to calibrate k by minimizing the misclassification error rate of knn on a test reference table, drawn independently from the simulations that have been used to train b is a classifier, the misclassification error rate is the knn classifier. Recall that, if m Ï X b ) 6= M ) = p(m) b τ := P(m(X 1{m(x) 6= m}p m−prior (φ) f (x | φ, m)dxdφ. (3.5) m With the help of a Nadaraya-Watson estimators, we also proposed to disintegrate the misclassification rate to get the conditional expected value of the misclassification loss knowing the observed data (or, more precisely, knowing some summaries of the observed data). The conditional error is used to assess the difficulty of the model choice at a given value of x and b the confidence we can put on the decision m(x). This is crucial since at the end the classifier is built to take a decision on a single value of x which is the observed dataset. More precisely, since ABC relies on summary statistics, we introduced ¯ ¡ ¢ b 1 (X )) 6= M ¯ η 2 (X ) = s τ(s) := P m(η (3.6) where η 1 (·) and η 2 (·) are two (different or equal) functions we can use to summarize a dataset: the first one is used by the knn classifier, the second one to disintegrate the error rate. And finally, we have proposed an adaptive ABC scheme based on this local assessment of the error to adjust the projection, i.e. the selection of summary statistics, to the data point within knn. This attempt to fight against the curse of dimensionality locally around the observed data x obs contrasts with most projection methods which are often performed on the whole set of particles (i.e. not adapted to x obs ) and most often limited to parameter inference (Blum et al., 2012). Yet the methodology (Nadaraya-Watson estimators) is limited to a modest number of summary statistics, as when selecting between dependency structures of discrete hidden 29 Chapter 3. Computational statistics and intractable likelihoods Markov random fields. ABC model choice via random forest Real population genetics examples are often based on many summary statistics (for example 86 numerical summaries in Lombaert et al. (2011)) and require machine learning techniques that are less sensible to the dimension than knn. In large dimensional spaces, knn classifiers are often strongly outperformed by other classifiers. Various methods have been proposed to reduce the dimension, see the review of Blum et al. (2013), and our own proposal based on LDA axes ((A06) Estoup et al., 2012). But, when the decision can be taken with reliability on a small (but unknown) subset of the summaries (possibly depending the point of the data space), random forest is a good choice, since its performances depend mainly on the intrinsic dimension of the classification problem (Biau, 2012; Scornet et al., 2014). However machine learning solutions such as random forests miss the distinct advantage of posterior probabilities, namely that they evaluate the confidence degree in the selected model. Indeed, in a well trained forest, the proportion of trees in favor of a given model is an estimate of the posterior probability largely biased toward 0 or 1. We thus developed a new posterior indicator to assess the confidence we can put in the model choice for the observed dataset. This index is the integral of the misclassification loss with respect to the posterior predictive and can be evaluated with an ABC algorithm. It can be written as ´Ï n o ³ ¯ ´ ³ ¯ ´ X ³ ¯¯ ¯ ¯ b error(s) := p m ¯ η(x) = s 1 m(η(y)) 6= m p φ ¯ η(x) = s, m f y ¯ φ, m dφdy (3.7) m and should be compared with τ defined in (3.5). The difference is that the prior probabilities, densities are replaced by the posterior probabilities and densities knowing that η(x) = s. Unlike tests based on posterior predictive p-values (Meng, 1994), our indicator do not commit the sin of “using the data twice” and is simply an exploratory statistic. Additionally it does not suffer from the drastic limitation in dimension as the conditional error rate (3.6) we developed with Julien Stoehr due to the resort to Nadaraya-Watson estimator (see above). Last but not least, the training of random forests requires a much lower simulation effort when compared with the standard ABC model choice algorithm. The gain in computation time is large: our population genetic examples show that we can divide by ten or twenty the number of simulations. Conclusion and perspectives My favorite way to present ABC is now the following two stage algorithm. The general ABC algorithm (A) Generate a large set of particles (m, φ, η(x)) from the joint p(m)p(φ | m) f (x|m, φ) (B) Use learning methods to infer about m or φ at s obs = η(x obs ). If we are only interested in parameter inference, we can drop m in the above algorithm (and 30 2. Bayesian inference via empirical likelihood the trivial prior which set probability 1 on a single model number). We have conducted researches to speed up the computation either by improving the standard ABC algorithms in stage (A) of the algorithm, see Section 1.2, or in stage (B), see Section 1.3. The gain in computation time we have obtain in the latter case is much larger than the gain we obtained with sequential ABC algorithms (although the setting is not exactly the same). Due to obvious reasons regarding computer memory, ABC algorithms will never be able to keep track of the whole details of simulated datasets: they commonly saves vectors of summary statistics. Thus the error between (3.3) and (3.1) will remain. The only work that study this discrepancy is Marin et al. (2013), and is limited to model choice issues. For parameter inference, the problem remains open. Additionally, machine learning methods will be of considerable interest for the statistical processing of massive SNP datasets whose production is on the increase with the field of population genetics, even if such sequencing methods can introduce some bias (see, e.g., our work in (A09) Gautier et al., 2012; (A08) Gautier et al., 2013). Those powerful methods have not been really tested for parameter inference which means that there is still room for improvements in stage (B) of the algorithm. 2 Bayesian inference via empirical likelihood In (A07) Mengersen, Pudlo, and Robert (2013), we have also explored another likelihood-free approximation for parameter inference in the Bayesian paradigm. Instead of bypassing the computation of intractable likelihoods with numerous simulations, the proposed methodology relied on the empirical likelihood of Owen (1988, 2010). The latter defines a pseudo-likelihood on a vector of parameters φ as follows ½ L el (φ | x obs ) = max n Y p i : 0 ≤ p i ≤ 1, X p i = 1, X p i h(x i , φ) = 0 ¾ (3.8) i =1 when the data x obs = (x 1 , . . . , x n ) are composed of iid replicates from a (unknown) distribution P and, when, for some known function h, the parameter of interest φ satisfy Z h(x i , φ)P (dx i ) = 0. (3.9) Note that the above equation might be interpreted as a non-parametric definition of the parameter of interest φ since there is no assumption on the distribution P . The original framework introduced by Owen (1988) dealt with the mean φ of an unknown distribution P in which case h(x i , φ) = x i − φ. And, with empirical likelihood ratio test, we can produce confidence intervals on φ in a non-parametric setting, i.e. more robust than confidence intervals from a parametric model. Other important examples are moments of various order or quantiles. In all these cases, the function h in (3.9) is well known. 31 Chapter 3. Computational statistics and intractable likelihoods Bayesian computation via empirical likelihood, BCel Once the function h in (3.9) is known, we can compute the value of L el (φ | x obs ) with (3.8) at any value of φ. Solving the optimization problem can be done with an efficient Newton-Lagrange algorithm (the constraint is linear in p). The method we have proposed in (A07) Mengersen, Pudlo, and Robert (2013) replaces the true likelihood with the empirical likelihood in samplers from the posterior distribution. It is well known, see Owen (2010), that empirical likelihood are not normalized, hence we renounce using it to compute the evidence or the posterior probability of each competing model in the model choice framework. Moreover, the paper exposes two samplers from the approximation of the posterior. The first one draws the parameters from the prior and weights them with the empirical likelihood; the second one relies on AMIS, see below in Section 3.2, to learn gradually how proposed values of φ should be drawn. Empirical likelihood in population genetics It was a real challenge to use the empirical likelihood in population genetics, even if we assumed that the data are composed of iid blocks, each one corresponding to the genetic data at some locus. Actually empirical likelihoods have already been used to produce robust confidence intervals of parameters from a likelihood L(φ, x i ). In this case, the function h in the constraint (3.9) is the score function h(x i , φ) = ∇φ log L(φ, x i ), whose zero is the maximum likelihood estimator. But, in population genetics, interesting parameters such as dates of divergence between populations, effective population sizes, mutation rates etc. cannot be expressed as moments of the sampling distribution at a given locus; and additionally they are generally parameters of an intractable likelihood. Hence we rely on another ersatz of the likelihood which is also used in the context of intractable likelihoods: composite likelihoods (Lindsay, 1988), and composite score functions. At a the i -th locus, since the information is mainly in the dependency between individuals of the sample x i , we rely on the pairwise composite likelihood we can compute explicitly. Indeed, the pairwise composite likelihood of the data x i at a the i -th locus is a product of likelihoods when the dataset is restricted to some pair of genes from the sample: L comp (x i |φ) = Y L(x i (a), x i (b)|φ), a,b where {a, b} ranges the set of pairs of genes, and x i (a) (resp. x i (b)) is the allele carried by gene a (resp. b) at the i -th locus. When there is only two genes in a sample, the combinatorial complexity due to the gene genealogy disappears and each L(x i (a), x i (b)|φ) is known explicitly. Thus, we can compute explicitly the composite score function at each locus, then set h(x i , φ) = X ∇φ log L comp (x i (a), x i (b)|φ) a,b and rely on empirical likelihood on the whole dataset composed of n loci. 32 3. Importance sampling The paper ((A07) Mengersen, Pudlo, and Robert, 2013) includes also synthetic examples with samples from two or three populations genotyped at one hundred independent microsatellite loci. Replacing the intractable likelihood with (3.8) in Monte Carlo algorithms computing the posterior distribution of φ, we obtained an inference algorithm much faster than ABC (no need to simulate from the model). The accuracy of Bayesian estimators such as the posterior expectation and the coverage of credible intervals were at least as accurate as the one we can obtain with ABC. In particular, the coverage of credible intervals was equal or larger than their nominal probability. Hence, in presence of iid replicates in the data, the empirical likelihood can be used as a black box to adjust the composite likelihood and obtain appropriate inference; in other words, the empirical likelihood methodology appears as a major competitor to the calibration of composite likelihoods proposed by Ribatet et al. (2012). In the theoretical works of Owen (2010), there is no assumptions on the function h of (3.9). When φ is a vector of parameters, h takes values in a vector space of the same dimension than φ. Quite surprisingly, our simulation studies performed badly when the coordinates of h were not scaled properly with respect to each others. Gradients of criterions such as the log-likelihood or the composite log-likelihood provide a natural way to scale each coordinate of h; and this scaling performs quite well. But we have no theoretical justification of this. 3 Importance sampling Importance sampling is an old trick to approximate or compute numerically an expected value or an integral in small or moderate dimensional spaces. Suppose that we aim at computing ´ Z E ψ(X ) = ψ(x)Π(dx) ³ (3.10) where Π(·) is the distribution of X under the probability measure P. Let Q(·) denote another distribution on the same space, such that |ψ(x)|Π(dx) is absolutely continuous with respect to Q(·), i.e. |ψ|Π ¿ Q. Then the target integral can be written as Z ψ(x)Π(dx) = Z e ψ(x)Q(dx), e where ψ(x) = d(ψΠ) (x) dQ is the Radon-Nikodym derivative of the signed measure ψΠ with respect to Q. Note that, if Π and Q have both densities with respect to a reference measure µ, i.e. Π(dx) = π(x)µ(dx) and e Q(dx) = q(x)µ(dx), then we can compute the derivative practically with ψ(x) = ψ(x)π(x)/q(x). If additionally the X i ’s are sampled independently from Q(·), then the average n 1X e i) ψ(X n i =1 (3.11) is the importance sampling estimator of (3.10) with importance (or instrumental or proposal or sampling) distribution Q. This approximation is useful in moderate dimensional spaces to reduce the variance of a crude Monte Carlo estimate based on a sample from Π, or when 33 Chapter 3. Computational statistics and intractable likelihoods it is difficult to sample from Π. From another viewpoint, maybe the most significant when Π is the posterior distribution of a Bayesian model, we can also use importance sampling to approximate the distribution Π. Indeed, if Π ¿ Q, and if ω(x) is the Radon-Nikodym derivative of Π with respect to Q, then the empirical distribution n 1X ω(X i )δ X i n i =1 provides a Monte Carlo approximation of the distribution Π(·) and integrals as in (3.10) can be estimated replacing Π with the empirical distribution. The calibration of the proposal distribution Q is paramount in most implementation of the importance sampling methodology if one wants to avoid estimates with large (if not infinite) variance. As explained in many textbooks, when ψ(x) is non-negative for all x, the best sampling distribution is Q(dx) = R ψ(x) Π(dx) ψ(z)Π(dz) (3.12) which leads to a zero variance estimator. Yet the best distribution depends on the unknown integral, and simulating from this distribution is generally more demanding than the original problem of computing (3.10). There exists various classes of adaptive algorithms to learn in line how to adapt the proposal distribution; the adaptive multiple importance sampling (AMIS) of Cornuet et al. (2012a) is one of the most efficient when the likelihood is complex to compute, or when it should be approximated via another Monte Carlo algorithm. Section 3.1 is devoted to importance sampling methods to integrate over the latent processes and approximate the likelihood. Section 3.2 presents theoretical results on AMIS which can be used with the latter approximation of the likelihood to compute the posterior distribution. 3.1 Computing the likelihood Framework Since the seminal paper of Stephens and Donnelly (2000), importance sampling has being used to integrate over all possible realizations of the latent process and compute the likelihood in population genetics. Forward in time, the coalescent based model can be described as a pure jump, measure-valued Markov process where Z (t ) describes the genetic ancestors of the sample at time t , see Section 4. The observed sample (composed of n individuals, or genes) is modeled by the distribution of Z (σ − 0) where σ is the optional time defined as the first time at which Z (t ) is composed of (n + 1) genes. Thus, if {X (k)} is the embedded Markov chain of the pure jump process, and τ the stopping time defined as the first time at which X (k) is composed of (n + 1) genes, the likelihood of the parameter φ is ³ ´ f (x obs |φ) = Pφ X (τ − 1) = x obs 34 (3.13) 3. Importance sampling that can be written as an integral of some indicator function with respect to the distribution of the embedded Markov chain: à ! o ∞ X k−1 X X n Y f (x obs |φ) = · · · 1 x k−1 = x obs , card(x k ) = n + 1 p 0 (x 0 ) Π(x i , x i +1 ) , (3.14) k=1 x 0 xk i =0 where x i ranges the set of counting measures on the set of possible alleles, p 0 is the initial distribution of the embedded Markov chain and Π its transition matrix. The best importance distribution (3.12) is the distribution of the embedded chain conditioned by the event {X (τ − 1) = x obs }. Reversing time, this is the distribution of a process starting from x obs . But due to the combinatorial complexity of the coalescent, we are generally unable to simulated from the distribution of the reversed Markov chain. Stephens and Donnelly (2000) proposed an approximation of the conditioned, reversed chain which serves as importance distribution. The latter corresponds exactly to the reversed chain which leads to a zero variance approximation in the specific case of “parent independent” mutants. The resulting algorithm falls into the class of sequential importance sampling (SIS), see Algorithm 4. Nevertheless, the biological framework of the seminal paper is limited as the authors consider only a single population at equilibrium, i.e., with a constant population size. De Iorio and Griffiths (2004a,b) have extended the biological scope to samples from different populations linked by migrations between them. Among different terms we do not described precisely here, the importance distribution they proposed depends on some matrix inversion. De Iorio et al. (2005) have replaced this matrix inversion by an explicit formula when the mutation process of microsatellite loci is the stepwise mutation model (SMM). Algorithm 4: Sequential importance sampling Input: The transition matrix Q of the importance distribution (backward in time); The transition matrix of the model Π (forward in time); The equilibrium distribution p 0 of the mutation model; The observed data x obs set L ← 1; set x ← x obs ; while sample size of x ≥ 2 do draw y according to Q(x, ·); Π(y, x) set L ← L × ; Q(x, y) set x ← y; end set L ← L × p 0 (x) Output: the estimate L Our work on population in disequilibrium When considering population sizes that vary over time, the above Markov process {Z (t )} becomes inhomogeneous: transition rates de35 Chapter 3. Computational statistics and intractable likelihoods pends on the population size Ne (t ) at time t . In (A14) Leblois et al. (2014) we face the case of a single population whose size grows or shrinks exponentially on some interval of time. The importance distribution is now an inhomogeneous Markov process, backward in time, starting from the observed data. Moreover the explicit formula of De Iorio et al. (2005) replacing numerical matrix inversion has been extended to a more complex mutation model on microsatellite loci called the generalized stepwise model (GSM). The motivation of the switch to a GSM model is that misspecification of the mutation model can lead to false bottleneck signals as is also shown in the paper. The efficiency of the algorithm (in term of variance of the likelihood estimates) decreases when moving away from homogeneity of the process. One sometimes hears the claim that importance sampling is inefficient in high dimensional spaces because the variance of the likelihood blows up. There is certainly some truth to this, and it is a well known problem of sequential importance sampling for long realizations of a Markov chain. See e.g. Proposition 5.5 in Chapter XIV of Asmussen and Glynn (2007), proving that variance of the importance sampling estimator grows exponentially with the length of the paths. Hence, to obtain a reasonable accuracy, the computational cost increases largely in time. Perspectives With my student Coralie M ERLE (during a graduate internship and part of the first year of her PhD scholarship), we are trying to increase the accuracy with the same computational budget resorting to sequential importance sampling with resampling (SIR Liu et al., 2001). The core ideas is to resample N instances of Algorithm 4 at various iterations according to a function of the current weights L and thus get rid of simulations that will not contribute to the average (3.11). Adaptives methods to tune the importance distribution can also help increase the sharpness of the likelihood estimate. Nevertheless many adaptive algorithms are always designed to approximate (3.10) for a large class of function ψ while here the integrand ψ is fixed to a given indicator function. 3.2 Sample from the posterior with AMIS Being able to compute the likelihood is not the final goal of a statistical analysis; in a Bayesian framework, MCMC algorithms provides a Monte Carlo approximation of the posterior distribution, various approximations of punctual estimators such as the posterior expected value, the posterior median and approximate credible intervals. The pseudo-marginal MCMC algorithm proposed by Beaumont (2003) and studied by Andrieu and Roberts (2009) and Andrieu and Vihola (2012) replaces evaluations of the likelihood with importance sampling estimates as presented in Section 3.1. In particular, the authors proved that, when the estimates of the likelihood are unbiased, this scheme is exact since it provides samples from the true posterior. But such Metropolis-Hastings algorithms can suffer from poor mixing properties: if the chain is at some value φ of the parameter space, and if the likelihood has been largely over-estimated (which can happen because of the large, if not infinite, variance), then the chain is stucked at this value for a very long time. Of course, one can always improve the accuracy of the 36 3. Importance sampling likelihood estimate by increasing the number of runs of the SIS algorithm, which increases as well the time complexity. But we can also replace MCMC algorithms with other Monte Carlo methods to solve this problem. AMIS In general, particle algorithms are less sensible to such poor mixing properties and can better explore different modes of posterior distributions. The adaptive multiple importance sampling (AMIS Cornuet et al., 2012b) is a good example, which combines multiple importance sampling and adaptive techniques to draw parameters from the posterior. The sequential particle scheme is in the same vein as Cappé et al. (2008). But it introduces a recycling of the successive samples generated during the learning process of the importance distribution on the parameter space. In particular, AMIS do not throw approximations (via importance sampling in the gene genealogy space) of the likelihood at given value φ of the parameter and can be less time consuming than other methods. On various numerical experiments where the target is the posterior distribution of some population genetics data sets, Cornuet et al. (2012b) and Sirén et al. (2010) show considerable improvements of the AMIS in Effective Sampling Size (see, e.g., Liu, 2008b, chapter 2). In such settings where calculating the posterior density (or an estimate of it) is drastically time consuming, a recycling process makes sense. If the rest of the Section, π(φ) denotes the posterior density, and is the target of the AMIS. During the learning process, the AMIS tries successive proposal distributions from a paramet¡ ¢ ¡ ¢ ric family of distributions , say Q θb1 , . . . ,Q θbT . Each stage of the iterative process estimates a ¡ ¢ better proposal Q θbt +1 , by minimising a criterion such as, for instance, the Kullback-Leibler divergence between Q(θ) and the target Π, which in our context is the posterior distribution. The novelty of the AMIS is the following recycling procedure of all past simulations. At iteration t , the AMIS has already produced t samples: ¡ ¢ φ11 , . . . , φ1N1 ∼ Q θb1 , ¡ ¢ φ21 , . . . , φ2N2 ∼ Q θb2 , .. . ¡ ¢ φ1t , . . . , φtNt ∼ Q θbt with respective sizes N1 , N2 , . . . , N t . Then the scheme derives a new parameter θbt +1 from all those past simulations. To that purpose, the weight of φki (k ≤ t , i ≤ Nk ) is updated with " # t N ¡ k ¢. X ` ¡ k b ¢ q φi , θ` , π φi `=1 Ωt (3.15) where π(φ) is the density of the posterior distribution, or an IS estimate of the density, θb1 , . . . , θbt are the parameters generated throughout the t past iterations, x 7→ q(x, θ) is the density of Q(θ) with respect to the reference measure dx and Ωt = N1 + N2 + · · · + N t is the total number 37 Chapter 3. Computational statistics and intractable likelihoods of past particles. The importance weight (3.15) is inspired from ideas of Veach and Guibas (1995), which had be popularized by Owen and Zhou (2000) to merge several independent importance samples. Our results Before our work ((A16) Marin, Pudlo, and Sedki, 2014), no proof of convergence had been provided, neither in Cornuet et al. (2012b) nor elsewhere. It is worth noting that the weight (3.15) introduces long memory dependence between the samples, and even a bias which was not controlled by theoretical results. The main purpose of (A16) Marin, Pudlo, and Sedki (2014) was to fill in this gap, and to prove the consistency of the algorithm at the cost of a slight modification in the adaptive process. We suggested learning the new parameter θbt +1 ¡ t ¢. ³ t ´ t t on the last sample φ1 , . . . , φNt weighted with the classical π φi q φi , θbt for all i = 1, . . . , N t . The only recycling procedure was in the final output that merges all the previously generated samples using (3.15). Contrary to what has been done in the literature dealing with other adaptive scheme, we decided to adopt a more realistic asymptotic setting in our paper. In Douc et al. (2007) for instance, the consistency of the adaptive population Monte Carlo (PMC) schemes is proven assuming that the number of iterations, say T , is fixed and that the number of simulations within each iteration, N = N1 = N2 = · · · = NT , goes to infinity. The convergence we have proven in (A16) Marin, Pudlo, and Sedki (2014) holds when N1 , . . . , NT , . . . is a growing, but fixed sequence and T goes to infinity. Hence the proofs of ours theorems provide new insights on adaptive PMC in that last asymptotic regime. The convergence of θbt to the target θ ∗ relies on limit theorems on triangular arrays (see Chopin, 2004; Douc and Moulines, 2008; Cappé et al., 2005, Chapter 9). The consistency of the final merging with weights given by (3.15) is not a straightforward consequence of asymptotic theorems. Its proof requires the introduction of a new weighting π(φki ) Á q(φki , θ ∗ ) (3.16) that is more simple to study, although biased and non explicitly computable (because θ ∗ is unknown). Under the set of assumptions given below, this last weighting scheme is consistent (see the proposition in the last part of the paper) and is comparable to the actual weighting given by (3.15), which yields the consistency of the modified AMIS algorithms. Perspectives One of the restrictive assumptions that have appeared during the proof of consistency is that the sample sizes satisfy ∞ 1 X < ∞, t =1 N t (3.17) which means that the sample size goes to infinity much faster than linearly. Following the proofs, one can see that we used a uniform Chebyshev inequality and uniform square integrability of some family of random variables to control the asymptotic behavior of θbt . These 38 4. Inference in neutral population genetics results have practical consequences in the design of N t when running AMIS, though one do not say this assumption is necessary to have consistency. Actually, we believe that we can replace Chebyshev inequality with Chernoff or large deviation bounds to obtain consistency of θbt , and then replace (3.17) with a less restrictive assumption. The price to pay then is uniform exponential integrability of the some family of random variables. Besides, the technics of proof we have developed can lead to new results on adaptive PMC. Nevertheless the setting of such adaptive algorithms is quite different as they do not target an ideal proposal distribution Q(θ ∗ ) but they claim to approximate a sequence of ideal proposal distributions Q(θt∗ ) at each stage of the iterative algorithm. The sequence θt∗ is defined as a recurring sequence, namely θt∗+1 = Z h(x, θt∗ )Π(dx), and the above integral is approximated with an importance sampling estimate at each iteration of the sequential algorithm. Such settings are obviously more complex than the one we considered in (A16) Marin, Pudlo, and Sedki (2014). 4 Inference in neutral population genetics Under neutrality, genetic evolution is modeled by complex stochastic models (in particular Kimura (1968)’s diffusion and Kingman (1982)’s coalescent) that take into account mutations and genetical drift simultaneously. Answering to important biological questions (such as assessing a migration rate, a shrink in population size, dating foundation of a population or other important events) is often a delicate methodological issue. Such subtle models allow us to discriminate between confusing effects like, for instance, misspecification of the mutation model vs a shrink in population size (a bottleneck). The statistical issues fall into the class of inference on hidden or latent processes, because the genealogy (which is a graph that represents the genetical kinship of the sample), mutations dates and ancestral genotypes are not observed directly. To shed some light on the difficulties that occur in my favorite fields of application of the methods I have developed, we start with a short description of the stochastic models in Sections 4.1 and 4.2 and we end the section with a few perspectives. Most of the models exposed in this last part are implemented in the DIYABC software ((A12) Cornuet et al., 2014). This user friendly program permits a comprehensive analysis of population history using ABC on genetic data, proposes various panels to set the models, the prior distribution and includes several ABC analyses. But, in order to stay as efficient as possible, the Monte Carlo simulations and numerical computations have been implemented in C++. We have made an effort to take advantage of multi-cores or multi-CPU modern computer architectures and have parallelized the computations. 39 Chapter 3. Computational statistics and intractable likelihoods 4.1 A single, isolated population at equilibrium Many statistical models (linear regression, mixed models, linear discriminant analysis, time series, brownian motion, stochastic differential equations...) are based on the Gaussian distribution, the latter being the limit distribution of empirical averages in many settings. When it comes to model kinship of a sample of individuals, Kingman (1982)’s coalescent plays this role of standard distribution. In both cases the Gaussian distribution or Kingman’s coalescent are far from universal distributions. We can mention the Λ-coalescent which allows multiple collisions and refer the reader to the review of Berestycki (2009) though it has rarely been used to analyse genetic datasets. Algorithm 5: Simulation of the coalescent Input: Number n of genes in the sample; Effective size Ne of the population set time t ← 0; set the ancestral sample size k ← n; while k ≥ 2 do draw Tk from the exponential distribution with rate k(k − 1)/(2Ne ); set time t ← t + Tk ; choose the pair of genes that coalesce at random amid the k(k − 1)/2 pairs; set k ← k − 1; end The coalescent For the sake of clarity, let us begin with a single close population of constant size Ne (i.e., at equilibrium) and a sample of n haploid individuals from this population (i.e. a sample of n/2 diploid individuals). As often in population genetics we will call these haploid individuals “genes” (in particular here, the word “gene” does not mean a coding sequence of DNA). In the following paragraphs, we do not intend to give a comprehensive description of genetic materials: we neglect sexual chromosomes, mitochondrial DNA... and will focus on the autosomal genome. Backward in time, pairs of genes find common ancestors until they reach the most recent common ancestor (MRCA) of the whole sample. Kingman’s coalescent stays as simple as possible: it is a memoryless process and genes are exchangeable. Hence times between coalescences are exponentially distributed and the pair that coalesces is chosen at random amid the k(k − 1)/2 pairs of genes in the ancestral sample of size k ≤ n. Moreover the population size Ne influences the rate at which pairs coalesce: it is more complex to find a common ancestor in large populations than in small ones and the average time for a given pair of genes before coalescence is Ne . Whence the simulation of Kingman’s coalescent in Algorithm 5. A realization of the whole process is often exposed as a dendrogram where tips represent the observed sample and where the root is the most recent common ancestor of the sample, see Figure 3.1. 40 4. Inference in neutral population genetics Past MRCA Lineage T2 T3 T4 T5 Now • • • • • 4 2 5 3 1 Figure 3.1 – Gene genealogy of a sample of five genes numbered from 1 to 5. The intercoalescence times T5 , . . . , T2 are represented on the vertical time axis. Mutations Over time, the genes vary in a set A of possible alleles. The mutation process is commonly described with a Markov transition matrix Q mut and a mutation rate µ per gene per time unit. If we follow a lineage forward in time from the MRCA to a given tip of the dendrogram, we faces a pure jump Markov process with intensity matrix Λ = µ(Q mut − I ). When possible, these Markov processes along the lineages of the genealogy are supposed to be at equilibrium so that marginal distributions of the allele of the MRCA, as well as any gene from the observed sample are all equal to the stationary distribution of Q mut . Additionally the stationary distribution do not carry any information (regarding parameters µ or Ne ), and the relevant information of the dataset lies in the dependency between genes. It is worth noting that the above description of the stochastic model provides straight a simulation algorithm, with a first stage backward in time to build the gene genealogy, followed by a second stage forward in time which simulates the Markov processes along the lineages. Yet the whole model can also be described as a measure-value process {Z (t )} forward in time. In this setting, Z (t ) is a counting measure on A which describes the ancestral sample at time t . In particular, Z (0) is a Dirac mass that puts mass one on the allele type of the MRCA. The process {Z (t )} is still a pure jump Markov process, which is explosive because Kingman’s coalescent is coming down from infinity, see e.g. Berestycki (2009). Yet, if σ is the optional time defined as the first time the total mass of Z (t ) hits (n + 1), then Z (σ − 0), the last measure visited before σ, models the genetic sample of size n. The intensity matrix of the forward in time process can be written explicitly. Reversing time, the measure-valued process is still a pure jump Markov process, but its intensity matrix cannot be computed explicitly due to dependency between branches of the genealogy (only pair of genes carrying the same allele 41 Chapter 3. Computational statistics and intractable likelihoods MRCA 100 ?−1 ?+1 ?+1 ?+1 ?−1 ?+1 −1? ?−1 ?−1 ?−1 • Gene number: 2 Genotype: 99 ?−1 • 4 • 3 • 1 • 7 • 8 • 5 96 96 99 98 98 103 −1? • 6 101 Figure 3.2 – Simulation of the genotypes of a sample of eight genes. As for microsatellite loci with the stepwise mutation model, the set of alleles is a interval of integer numbers A ⊂ N. The mutation process Q mut adds +1 or −1 to the genotype with equal probability. Once the genealogy has been drawn, the MRCA is genotyped at random, here 100 and we run the mutation Markov process along the vertical lines of the dendrogram. For instance, the red and green lines are the lineages from MRCA to gene number 2 and 4 respectively. can coalesce). Finally the sample is often composed of multiple loci data, which means that the individuals have been genotyped at different positions of their genome. If these loci are on different chromosomes or if they are distant enough on the same chromosomes, taking recombination into account, we can safely assume independency between the loci. Note that this implies that gene genealogies of different loci are independent. 4.2 Complex models Answering to important biological questions regarding the evolution of a given species requires much more complex models than the one presented in Section 4.1, though the simplest model serves as a foundation stone of others. Varying population size To study the population size, we have to set Ne (t ) as a function depending on time. Because the time of the most recent common ancestor is unknown, while 42 4. Inference in neutral population genetics the date at which we have sampled the individuals from the population is known, the function Ne (t ) describes the population size backward in time (t = 0 is the sampling date). Markov properties of the gene genealogy remain, but the process becomes inhomogeneous as the jump rates depend on time. Algorithm 5 can be adapted with a Gillespie algorithm as long as we can explicitly bound from below the jump rate k(k − 1)/(2Ne (s)) after current time t , i.e. for any s ≥ t , see Algorithm 6. The function t 7→ Ne (t ) can be, for example, a piecewise constant function, or as in (A14) Leblois et al. (2014), a continuous function which remains constant except on a given interval of time where the size increases or decreases exponentially. Algorithm 6: Simulation of the gene genealogy with varying population size Input: Number n of genes in the sample; A procedure that compute the effective size Ne (t ) of the population; A procedure that compute an upper bound B (t ) of Ne (s) for all s ≥ t set time t ← 0; set the ancestral sample size k ← n; while k ≥ 2 do compute the bound B ← B (t ); draw T from the exponential distribution with rate k(k − 1)/(2 B ); set time t ← t + T ; draw U from a uniform distribution on [0; 1); if U ≤ Ne (t )/B then choose the pair of genes that coalesce at random amid the k(k − 1)/2 pairs; set k ← k − 1; end end More than one population We are considering here the class of models implemented in the software DIYABC ((A12) Cornuet et al., 2014), see also Donnelly and Tavare (1995). An example is given in Figure 3.3. In this setting, the history of the sampled populations is described (backward in time) in terms of punctual events, divergence and admixture, until reaching a single ancestral population. Locally in each population of the history, in between those punctual events, the gene genealogy is built up following the Markov process given in Section 4.1 if the population size is constant, and following the above algorithm when population sizes vary. Once we reach the date of a punctual event, we have to move ancestral genes from one population to the others according to this event. • If the event says that, forward in time, a new population named B has diverged from a population named A at this date, then, backward in time, the ancestral sample of population B is moved into population A. • If the event says that, forward in time, a new population named C is an admixture between populations A and B , then, reversing time, the ancestral genes that are in 43 Chapter 3. Computational statistics and intractable likelihoods population C are sent in population A with probability r and in population B with probability 1 − r , where r is a parameter of the model. 4.3 Perspectives The challenge we face in population genetics is a drastic increase of the size of the data, due to next generation sequencing (NGS). This microbiological technique produces datasets including up to millions of SNP loci. The interest of such high dimensional data is the amount of information they carry. But developing methods to reveal this information requires important works on inference algorithms. Gold standard methods such as sequential importance sampling (when the evolutionary scenario is relative simple) or ABC are limited in the number of loci they can analyse. In the paragraphs below, we sketch a few research points to scale inference algorithms to NGS data. Understanding and model SNP loci SNP mutations Two models coexist in the literature to explain SNP data. The first and simplest model, that can be simulated with Hudson’s algorithm, considers that the gene genealogy of a SNP locus is given by the Kingman’s coalescent, and that one and only one mutation event occurs during the past history of the gene sample at a given locus. The second model assumes that, at each base pair of the DNA strand, a gene genealogy is drawn following the Kingman’s coalescent independently of the gene genealogies of other base pairs of the DNA strand. Then, mutations are put at random on the branches of the genealogies at some very low rate. Most of the gene genealogies will not carry any mutation event and the other gene genealogies will carry only one mutation event (hence the presence of bi-allelic SNP loci). The gene genealogies with mutation event(s) are characterized by a total branch length which is larger than those without mutation event. The probabilistic distribution of gene genealogies with mutation event(s) is often referred in the literature as Unique Event Polymorphism (UEP) genealogies (see for instance Markovtsova et al., 2000). The Hudson and UEP models are clearly different. Therefore, simulating data with one model and estimating parameters with the other one leads to a bias which is due to misspecification of the model (independent of the effect of other bias). Markovtsova et al. (2001) discussed the consequences of this misspecification in the particular context of an evolutionary neutrality test assuming a simple demographic scenario with a single population and an infinite site mutation model. The Markovtsova et al. (2001)’s paper provoked a flurry of responses and comments, which globally suggests that the Hudson approximation is correct at least for the tests that were carried out for infinite sites models, provided that some conditions on parameters of the mutation model are satisfied. Additional work is certainly needed to investigate the effect of this misspecification bias on SNP data when more complex demographic histories involving several populations are considered. 44 4. Inference in neutral population genetics t6 Ne 40 t5 Ne 4 Divergence t4 Ne 6 Pop6 t3 s 1−s Admixture t2 Ne 5 t1 Pop5 r Ne 1 1−r Ne 2 Ne 3 Ne 4 t =0 Pop1 Pop2 Pop3 Pop4 Figure 3.3 – Example of an evolutionary scenario: four populations Pop1, . . . , Pop4 have been sampled at time t = 0. Branches of the history can be considered as tubes in which the gene genealogy should be drawn. The historical model includes two unobserved populations (Pop5 and Pop6) and fifteen parameters: six dates t 1 , . . . , t 6 , seven populations sizes Ne 1 , . . . , Ne 6 and Ne 40 and two admixture rates r, s. 45 Chapter 3. Computational statistics and intractable likelihoods Linkage disequilibrium and time scale Autosomal loci are commonly considered as independent if they are not numerous. Thus each locus has its own gene genealogy independently of the others and the likelihood writes as a product over loci. When the number of loci increases, this assumption becomes debatable: genetic markers are dense along the genome and recombination occurs more rarely between them. The stochastic model that explains the hidden genealogies along the genome is the ancestral recombination graph (Griffiths and Marjoram, 1997), which has been approximated by Markov processes along the genome (Wiuf and Hein, 1999; McVean and Cardin, 2005; Marjoram and Wall, 2005). But very few inference algorithms are based on these models; first attempts to design such methods are limited in the number of individual they can handle (Hobolth et al., 2007; Li and Durbin, 2011). Yet modeling recombination would permit to retrieve a natural time scale on the parameter of interests. Indeed, models are often over-parametrized: they is no way to set a time scale of a latent genealogy based on current genetic data but a Bayesian model can provide information on this time scale in the prior distribution of the recombination rate. Infer and predict Sequential importance sampling (SIS) algorithms They are currently the most accurate method for parameter inference, but also the most intensive in term of computation time, and their implementation is limited to realistic, but still simple, evolutionary scenarios. We face major obstacles to scale these algorithms to NGS data size: the lack of efficient proposal distributions in demographical model which are not at equilibrium, and the slow calculation speed. Additionally, a major drawback of their accuracy is that they are very sensible to misspecification of the model. The future of such methods on high dimensional data is compromised, expect if we make a huge leap forward in computation time. ABC algorithms This class of algorithms is currently one of the gold standard method to conduct a statistical analysis in complex evolutionary setting including numerous populations. They drew strength and flexibility from the fact that they require only (1) an efficient simulation algorithm from the stochastic model and (2) our ability to summarize the relevant information in numerical statistics of smaller dimension than the original data. Yet, major challenges remain to adapt this class of inference procedure to NGS data: the design and the number of summary statistics, though we have reduced the issue with the resort to random forests for model choice. In particular, machine learning based algorithm such as ABC-EP (Barthelmé and Chopin, 2011) can take advantage of the specific structure of genetic data into independent or Markov-dependent loci. ABC algorithms can also approximate a posterior predictive distribution of the genetic polymorphism under neutrality with complex demographic models. This can help detect loci under selection as outliers of the posterior predictive. Algorithme BCel Finally, this class of inference methods is one of the most promising to handle efficiently large NGS dataset, as we have shown in (A07) Mengersen, Pudlo, and Robert 46 4. Inference in neutral population genetics (2013). To transform it into a routine procedure to analyze data under complex evolutionary scenario, we need to (1) understand better SNP mutation models, and recombination as explained in Section 4.3 and (2) extend to this latter models the explicit formulas for the pairwise composite likelihood. 47 Bibliography (A01) Pudlo, P. (2009). Large deviations and full Edgeworth expansions for finite Markov chains with applications to the analysis of genomic sequences. ESAIM: Probab. and Statis. 14, 435–455. (A02) Pelletier, B. and P. Pudlo (2011). Operator norm convergence of spectral clustering on level sets. The Journal of Machine Learning Research 12, 385–416. (A03) Arias-Castro, E., B. Pelletier, and P. Pudlo (2012). The normalized graph cut and Cheeger constant: from discrete to continuous. Advances in Applied Probability 44(4), 907–937. (A04) Cadre, B., B. Pelletier, and P. Pudlo (2013). Estimation of density level sets with a given probability content. Journal of Nonparametric Statistics 25(1), 261–272. (A05) Marin, J.-M., P. Pudlo, C. P. Robert, and R. Ryder (2012). Approximate bayesian computational methods. Statistics and Computing 22(6), 1167–1180. (A06) Estoup, A., E. Lombaert, J.-M. Marin, C. Robert, T. Guillemaud, P. Pudlo, and J.-M. Cornuet (2012). Estimation of demo-genetic model probabilities with Approximate Bayesian Computation using linear discriminant analysis on summary statistics. Molecular Ecology Ressources 12(5), 846–855. (A07) Mengersen, K. L., P. Pudlo, and C. P. Robert (2013). Bayesian computation via empirical likelihood. Proc. Natl. Acad. Sci. USA 110(4), 1321–1326. (A08) Gautier, M., J. Foucaud, K. Gharbi, T. Cezard, M. Galan, A. Loiseau, M. Thomson, P. Pudlo, C. Kerdelhué, and A. Estoup (2013). Estimation of population allele frequencies from nextgeneration sequencing data: pooled versus individual genotyping. Molecular Ecology 22(4), 3766–3779. (A09) Gautier, M., K. Gharbi, T. Cezard, J. Foucaud, C. Kerdelhué, P. Pudlo, J.-M. Cornuet, and A. Estoup (2012). The effect of RAD allele dropout on the estimation of genetic variation within and between populations. Molecular Ecology 22(11), 3165–3178. (A10) Ratmann, O., P. Pudlo, S. Richardson, and C. P. Robert (2011). Monte Carlo algorithms for model assessment via conflicting summaries. Technical report, arXiv preprint 1106.5919. 49 Bibliography (A11) Sedki, M., P. Pudlo, J.-M. Marin, C. P. Robert, and J.-M. Cornuet (2013). Efficient learning in abc algorithms. Technical report, arXiv preprint 1210.1388. (A12) Cornuet, J.-M., P. Pudlo, J. Veyssier, A. Dehne-Garcia, M. Gautier, R. Leblois, J.-M. Marin, and A. Estoup (2014). DIYABC v2.0: a software to make Approximate Bayesian Computation inferences about population history using Single Nucleotide Polymorphism, DNA sequence and microsatellite data. Bioinformatics 38(8), 1187–1189. (A13) Baragatti, M. and P. Pudlo (2014). An overview on Approximate Bayesian Computation. ESAIM: Proc. 44, 291–299. (A14) Leblois, R., P. Pudlo, J. Néron, F. Bertaux, C. R. Beeravolu, R. Vitalis, and F. Rousset (2014). Maximum likelihood inference of population size contractions from microsatellite data. Molecular biology and evolution (to appear), 19 pages. (A15) Stoehr, J., P. Pudlo, and L. Cucala (2014). Adaptive ABC model choice and geometric summary statistics for hidden Gibbs random fields. Statistics and Computing in press, 15 pages. (A16) Marin, J.-M., P. Pudlo, and M. Sedki (2012, 2014). Consistency of the Adaptive Multiple Importance Sampling. Technical report, arXiv preprint arXiv:1211.2548. (A17) Pudlo, P., J.-M. Marin, A. Estoup, M. Gauthier, J.-M. Cornuet, and C. P. Robert (2014). ABC model choice via Random Forests. Technical report, arXiv preprit 1406.6288. Andrieu, C. and G. O. Roberts (2009). The pseudo-marginal approach for efficient monte carlo computations. The Annals of Statistics 37(2), 697–725. Andrieu, C. and M. Vihola (2012). Convergence properties of pseudo-marginal markov chain monte carlo algorithms. Technical report, arXiv preprint arXiv:1210.1484. Asmussen, S. and P. W. Glynn (2007). Stochastic Simulation, Algorithms and Analysis, Volume 57 of Stochastic modelling and applied probability. New York: Springer. Barthelmé, S. and N. Chopin (2011). ABC-EP: Expectation Propagation for Likelihood-free Bayesian Computation. In L. Getoor and T. Scheffer (Eds.), ICML 2011 (Proceedings of the 28th International Conference on Machine Learning), pp. 289–296. Baudry, J.-P., A. E. Raftery, G. Celeux, K. Lo, and R. Gottardo (2010). Combining mixture components for clustering. Journal of Computational and Graphical Statistics 19(2), 332– 353. Beaumont, M. A. (2003). Estimation of population growth or decline in genetically monitored populations. Genetics 164(3), 1139–1160. Beaumont, M. A. (2010). Approximate Bayesian Computation in Evolution and Ecology. Annu. Rev. Ecol. Evol. Syst. 41, 379–406. 50 Bibliography Beaumont, M. A., J.-M. Cornuet, J.-M. Marin, and C. P. Robert (2009). Adaptive approximate Bayesian computation. Biometrika 96(4), 983–990. Belkin, M. and P. Niyogi (2005). Towards a theoretical foundation for laplacian-based manifold methods. In Learning theory, Volume 3559 of Lecture Notes in Comput. Sci., pp. 486–500. Berlin: Springer. Ben-David, S., U. von Luxburg, and D. Pal (2006). A sober look on clustering stability. In G. Lugosi and H. Simon (Eds.), Proceedings of the 19th Annual Conference on Learning Theory (COLT), pp. 5–19. Springer, Berlin. Berestycki, N. (2009). Recent progress in coalescent theory. Ensaios Matematicos 16, 1–193. Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research 13(1), 1063–1095. Biau, G., B. Cadre, and B. Pelletier (2007). A graph-based estimator of the number of clusters. ESAIM Probab. Stat. 11, 272–280. Biau, G., F. Cérou, and A. Guyader (2012). New insights into approximate bayesian computation. Technical report, arXiv preprint arXiv:1207.6461. Blum, M., M. Nunes, D. Prangle, and S. Sisson (2012). A comparative review of dimension reduction methods in approximate Bayesian computation. Technical report, arXiv preprint arXiv:1202.3819. Blum, M. G. B., M. A. Nunes, D. Prangle, and S. A. Sisson (2013). A comparative review of dimension reduction methods in Approximate Bayesian computation. Statistical Science 28(2), 189–208. Breiman, L. (2001). Random forests. Machine learning 45(1), 5–32. Cappé, O., A. Guillin, J. M. Marin, and C. P. Robert (2004). Population Monte Carlo. J. Comput. Graph. Statist. 13(4), 907–929. Cappé, O., A. Guillin, J.-M. Marin, and C. P. Robert (2008). Adaptive importance sampling in general mixture classes. Statistics and Computing 18, 587–600. Cappé, O., E. Moulines, and T. Rydén (2005). Inference in hidden Markov models. Springer, New York. Chopin, N. (2004). Central Limit Theorem for Sequential Monte Carlo Methods and Its Application to Bayesian Inference. The Annals of Statistics 32(6), 2385–2411. Cornuet, J.-M., J.-M. Marin, A. Mira, and C. Robert (2012a). Adaptive multiple importance sampling. Scandinavian Journal of Statistics 39(4), 798–812. Cornuet, J.-M., J.-M. Marin, A. Mira, and C. P. Robert (2012b). Adaptive Multiple Importance Sampling. Scandinavian Journal of Statistics 39(4), 798–812. 51 Bibliography Cornuet, J.-M., V. Ravigné, and A. Estoup (2010). Inference on population history and model checking using DNA sequence and microsatellite data with the software DIYABC (v1.0). Bioinformatics 11(1), 401. Cornuet, J.-M., F. Santos, M. A. Beaumont, C. P. Robert, J.-M. Marin, D. J. Balding, T. Guillemaud, and A. Estoup (2008). Inferring population history with DIYABC: a user-friendly approach to Approximate Bayesian Computation. Bioinformatics 24(23), 2713–2719. Cuevas, A., M. Febrero, and R. Fraiman (2001). Cluster analysis: a further approach based on density estimation. Comput. Statist. Data Anal. 36, 441–459. De Iorio, M. and R. C. Griffiths (2004a). Importance sampling on coalescent histories, I. Advances in Applied Probability 36, 417–433. De Iorio, M. and R. C. Griffiths (2004b). Importance sampling on coalescent histories, II. Advances in Applied Probability 36, 434–454. De Iorio, M., R. C. Griffiths, R. Leblois, and F. Rousset (2005). Stepwise matuation likelihood computation by sequential importance sampling in subdivided population models. Theoretical Population Biology 68, 41–53. Del Moral, P. (2004). Feynman-Kac formulae. Probability and its Applications (New York). New York: Springer-Verlag. Genealogical and interacting particle systems with applications. Del Moral, P., A. Doucet, and A. Jasra (2006). Sequential Monte Carlo samplers. J. Royal Statist. Society Series B 68(3), 411–436. Del Moral, P., A. Doucet, and A. Jasra (2012). An adaptive sequential Monte Carlo method for approximate Bayesian computation. Statistics and Computing 22(5), 1009–1020. Diaconis, P. and D. Stroock (1991). Geometric bounds for eigenvalues of markov chains. The Annals of Applied Probability 1(1), 36–61. Didelot, X., R. G. Everitt, A. M. Johansen, D. J. Lawson, et al. (2011). Likelihood-free estimation of model evidence. Bayesian analysis 6(1), 49–76. Donnelly, P. and S. Tavare (1995). Coalescents and genealogical structure under neutrality. Annual review of genetics 29(1), 401–421. Douc, R., A. Guillin, J. M. Marin, and C. P. Robert (2007). Convergences of adaptive mixtures of importance sampling schemes. The Annals of Statistics 35(1), 420–448. Douc, R. and E. Moulines (2008). Limit theorems for weighted samples with applications to Sequential Monte Carlo Methods. The Annals of Statistics 36(5), 2344–2376. Doucet, A., N. de Freitas, and N. Gordon (2001). Sequential Monte Carlo Methods in Practice. Springer-Verlag, New York. 52 Bibliography Drovandi, C. C. and A. N. Pettitt (2011). Estimation of parameters for macroparasite population evolution using approximate bayesian computation. Biometrics 67(1), 225–233. Duda, R. O., P. E. Hart, and D. G. Stork (2012). Pattern classification (2nd ed.). John Wiley & Sons. Ethier, S. N. and T. G. Kurtz (2009). Markov processes: characterization and convergence, Volume 282 of Wiley series in Probability and Statistics. John Wiley & Sons. Everitt, R. G. (2012). Bayesian parameter estimation for latent Markov random fields and social networks. Journal of Computational and Graphical Statistics 21(4), 940–960. Friedman, J. H. and J. J. Meulman (2004). Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66(4), 815–849. Giné, E. and V. Koltchinskii (2006). Empirical graph laplacian approximation of laplace– beltrami operators: Large sample results. In High dimensional probability, pp. 238–259. Institute of Mathematical Statistics. Griffiths, R. C. and P. Marjoram (1997). An ancestral recombination graph. Institute for Mathematics and its Applications 87, 257. Hartigan, J. (1975). Clustering Algorithms. New-York: Wiley. Hastie, T., R. Tibshirani, and J. Friedman (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New-York: Springer. Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. Hobolth, A., O. F. Christensen, T. Mailund, and M. H. Schierup (2007). Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model. PLoS genetics 3(2), e7. Kimura, M. (1968). Evolutionary rate at the molecular level. Nature 217, 624–626. Kingman, J. (1982). The coalescent. Stoch. Proc. and Their Applications 13, 235–248. Koltchinskii, V. I. (1998). Asymptotics of spectral projections of some random matrices approximating integral operators. In High dimensional probability (Oberwolfach, 1996), Volume 43 of Progr. Probab., pp. 191–227. Basel: Birhäuser. Li, H. and R. Durbin (2011). Inference of human population history from individual wholegenome sequences. Nature 475, 493–496. Lindsay, B. G. (1988). Composite likelihood methods. In Statistical inference from stochastic processes (Ithaca, NY, 1987), Volume 80 of Contemp. Math., pp. 221–239. Providence, RI: Amer. Math. Soc. 53 Bibliography Liu, J. S. (2008a). Monte Carlo strategies in scientific computing. Springer Series in Statistics. New York: Springer. Liu, J. S. (2008b). Monte Carlo Strategies in Scientific Computing. Series in Statistics. Springer. Liu, J. S., R. Chen, and T. Logvinenko (2001). A theoretical framework for sequential importance sampling with resampling. In Sequential Monte Carlo methods in practice, pp. 225–246. Springer. Lombaert, E., T. Guillemaud, C. Thomas, et al. (2011). Inferring the origin of populations introduced from a genetically structured native range by Approximate Bayesian Computation: case study of the invasive ladybird Harmonia axyridis. Molecular Ecology 20, 4654–4670. Maier, M., U. von Luxburg, and M. Hein (2013). How the result of graph clustering methods depends on the construction of the graph. ESAIM: Probability and Statistics 17, 370–418. Marin, J.-M., N. Pillai, C. P. Robert, and J. Rousseau (2013). Relevant statistics for Bayesian model choice. Journal of the Royal Statistical Society: Series B Early View(doi:10.1111/rssb.12056), 21 pages. Marin, J.-M., P. Pudlo, and M. Sedki (2012). Optimal parallelization of a sequential approximate Bayesian computation algorithm. In IEEE Proceedings of the 2012 Winter Simulation Conference, pp. Article number 29, 7 pages. Marjoram, P., J. Molitor, V. Plagnol, and S. Tavaré (2003, December). Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 100(26), 15324–15328. Marjoram, P. and J. Wall (2005). Fast "coalescent" simulation. BMC genetics 7, 16–16. Markovtsova, L., P. Marjoram, and S. Tavaré (2000). The age of a unique event polymorphism. Genetics 156(1), 401–409. Markovtsova, L., P. Marjoram, and S. Tavare (2001). On a test of depaulis and veuille. Molecular biology and evolution 18(6), 1132–1133. McLachlan, G. and D. Peel (2000). Finite mixture models. John Wiley & Sons. McVean, G. A. and N. J. Cardin (2005). Approximating the coalescent with recombination. Philosophical Transactions of the Royal Society B: Biological Sciences 360(1459), 1387–1393. Meng, X.-L. (1994). Posterior predictive p-values. The Annals of Statistics 22(3), 1142–1160. Ng, A., M. Jordan, and Y. Weiss (2002). On spectral clustering: Analysis and an algorithm. In T. Dietterich, S. Becker, and Ghahramani (Eds.), Advances in Neural Information Processing Systems, Volume 14, pp. 849–856. MIT Press. Owen, A. and Y. Zhou (2000). Safe and Effective Importance Sampling. Journal of the American Statistical Association 95(449), 135–143. 54 Bibliography Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75, 237–249. Owen, A. B. (2010). Empirical likelihood. CRC press. Pollard, D. (1981). Strong consistency of k-means clustering. The Annals of Statistics 9(1), 135–140. Ribatet, M., D. Cooley, and A. C. Davison (2012). Bayesian inference from composite likelihoods, with an application to spatial extremes. Statistica Sinica 22, 813–845. Rigollet, P. and R. Vert (2009). Optimal rates for plug-in estimators of density level sets. Bernoulli 15(4), 1154–1178. Robert, C. and G. Casella (2004). Monte Carlo Statistical Methods (second ed.). Robert, C. P., J.-M. Cornuet, J.-M. Marin, and N. Pillai (2011). Lack of confidence in approximate bayesian computation model choice. Proc. Natn. Acad. Sci. USA 108(37), 15112–15117. Rosasco, L., M. Belkin, and E. De Vito (2010). On learning with integral operators. Journal of Machine Learning Research 11, 905–934. Rosenblatt, M. (1969). Conditional probability density and regression estimators. Multivariate analysis II 25, 31. Schiebinger, G., M. Wainwright, and B. Yu (2014). The Geometry of Kernelized Spectral Clustering. arXiv preprint arXiv:1404.7552. Scornet, E., G. Biau, and J.-P. Vert (2014). Consistency of Random Forests. Technical report, (arXiv) Technical Report 1405.2881. Shawe-Taylor, J. and N. Cristianini (2004). Kernel Methods for Pattern Analysis. Cambridge University Press. Shi, J. and J. Malik (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905. Sirén, J., P. Marttinen, and J. Corander (2010). Reconstructing population histories from single-nucleotide polymorphism data. Molecular Biology and Evolution 28(1), 673–683. Sisson, S. A., Y. Fan, and M. Tanaka (2007). Sequential Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 104, 1760–1765. Sisson, S. A., Y. Fan, and M. Tanaka (2009). Sequential Monte Carlo without likelihoods: Errata. Proc. Natl. Acad. Sci. USA 106, 16889. Stephens, M. and P. Donnelly (2000). Inference in molecular population genetics. J. R. Statist. Soc. B 62, 605–655. 55 Bibliography Tavaré, S., D. Balding, R. Griffith, and P. Donnelly (1997). Inferring coalescence times from DNA sequence data. Genetics 145, 505–518. Veach, E. and L. Guibas (1995, August). Optimally Comabining Sampling Techniques For Monte Carlo Rendering. In SIGGRAPH’95 Proceeding, pp. 419–428. Addison-Wesley. von Luxburg, U. (2007). A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416. von Luxburg, U., M. Belkin, and O. Bousquet (2008). Consistency of spectral clustering. Ann. Statis. 36(2), 555–586. von Luxburg, U., B. Williamson, and I. Guyon (2012). Clustering: Science or art? In ICML Unsupervised and Transfer Learning, JMLR Workshop and Conference Proceedings, Volume 27, pp. 65–80. Wiuf, C. and J. Hein (1999). Recombination as a point process along sequences. Theoretical population biology 55(3), 248–259. 56 A Published papers See web page http://www.math.univ-montp2.fr/~pudlo/HDR • (A2) B. Pelletier and P. Pudlo (2011) Operator norm convergence of spectral clustering on level sets. Journal of Machine Learning Research, 12, pp. 349–380 • (A3) E. Arias-Castro, B. Pelletier and P. Pudlo (2012) The Normalized Graph Cut and Cheeger Constant: from Discrete to Continuous. Advances in Applied Probability, 44(4), dec 2012 • (A4) B. Cadre, B. Pelletier and P. Pudlo (2013) Estimation of density level sets with a given probability content. Journal of Nonparametric Statistics 25(1), pp. 261–272. • (A5) J.–M. Marin, P. Pudlo, C. P. Robert and R. Ryder (2012) Approximate Bayesian Computational methods. Statistics and Computing 22(6), pp. 1167–1180. • (A06) Estoup, A., E. Lombaert, J.-M. Marin, C. Robert, T. Guillemaud, P. Pudlo, and J.-M. Cornuet (2012). Estimation of demo-genetic model probabilities with Approximate Bayesian Computation using linear discriminant analysis on summary statistics. Molecular Ecology Ressources 12(5), 846–855. • (A7) Mengerson, K.L., Pudlo, P. and Robert, C. P. (2013) Bayesian computation via empirical likelihood. Proc. Natl. Acad. Sci. USA 110(4), pp. 1321–1326. • (A8) Gautier, M., J. Foucaud, K. Gharbi, T. Cezard, M. Galan, A. Loiseau, M. Thomson, P. Pudlo, C. Kerdelhué, and A. Estoup (2013). Estimation of population allele frequencies from next-generation sequencing data: pooled versus individual genotyping. Molecular Ecology 22(4), 3766–3779. • (A9) Gautier, M., K. Gharbi, T. Cezard, J. Foucaud, C. Kerdelhué, P. Pudlo, J.-M. Cornuet, and A. Estoup (2012). The effect of RAD allele dropout on the estimation of genetic variation within and between populations Molecular Ecology 22(11), 3165–3178. 57 Appendix A. Published papers • (A12) Cornuet J.-M., Pudlo P., Veyssier J., Dehne-Garcia A., Gautier M., Leblois R., Marin J.-M., Estoup A. (2014) DIYABC v2.0: a software to make Approximate Bayesian Computation inferences about population history using Single Nucleotide Polymorphism, DNA sequence and microsatellite data. Bioinformatics 30(8), pp. 1187–1189. • (A13) Baragatti, M. and P. Pudlo (2014). An overview on Approximate Bayesian Computation. ESAIM: Proc. 44, 291–299. • (A14) Leblois, R., Pudlo, P., Néron, J., Bertaux, F., Beeravolu, C. R., Vitalis, R. and Rousset, F. Maximum likelihood inference of population size contractions from microsatellite data. Molecular Biology and Evolution, in press. • (A15) Stoehr, J., Pudlo, P. and Cucala, L. (2014) Adaptive ABC model choice and geometric summary statistics for hidden Gibbs random fields. Accepté dans Statistics and Computing. 58 B Preprints See web page http://www.math.univ-montp2.fr/~pudlo/HDR • (A10) Ratmann, O., P. Pudlo, S. Richardson, and C. P. Robert (2011). Monte Carlo algorithms for model assessment via conflicting summaries. Technical report, arXiv preprint 1106.5919. • (A11) Sedki, M., P. Pudlo, J.-M. Marin, C. P. Robert, and J.-M. Cornuet (2013). Efficient learning in abc algorithms. Technical report, arXiv preprint 1210.1388. • (A16) Marin, J.-M., P. Pudlo, M. Sedki (2013). Consistency of the Adaptive Multiple Importance Sampling. Technical report, arXiv preprint 1211.2548. • (A17) Pudlo, P., Marin, J.-M., Estoup, A., Gautier, M., Cornuet, J.-M. and Robert, C. P. ABC model choice via random forests. Technical report, arXiv preprint 1406.6288. 59 C Curriculum vitæ M AÎTRE DE CONFÉRENCES à l’Université Montpellier 2, Faculté des Sciences I3M - Institut de Mathématiques et Modélisation de Montpellier UMR CNRS 5149 Place Eugène Bataillon ; 34095 Montpellier CEDEX, France Tél. : +33 4 67 14 42 11 / +33 6 85 17 78 46 Email : [email protected] URL : http://www.math.univ-montp2.fr/˜pudlo Né le 20 septembre 1977 à Villers-Semeuse (08 – Ardennes) Nationalité française. Études 2001–2004 T HÈSE au laboratoire de Probabilités, Combinatoire et Statistique, université Claude Bernard Lyon 1 sous la direction de Didier P IAU Estimations précises de grandes déviations et applications à la statistique des séquences biologiques 1998–2002 Étude à l’École Normale Supérieure de Lyon : M AGISTÈRE mathématiques et applications (MMA) L ICENCE , MAÎTRISE ET DEA à l’université Lyon 1 A GRÉGATION de mathématiques (reçu 52ème) Postes occupés septembre 2011–août 2013 Délégation INRA au Centre de Biologie pour la Gestion des Populations septembre 2006 Recrutement MCF à l’université Montpellier 2 2005–2006 ATER, université de Franche-Comté, Laboratoire de Mathématiques de Besançon (UMR 6623) 2002–2005 Allocataire-moniteur, université Lyon 1 61 Appendix C. Curriculum vitæ 1998–2002 Fonctionnaire stagiaire, École Normale Supérieure de Lyon Thèmes de recherche Classification : Clustering spectral • Clustering basé sur la densité • Machine learning • Théorèmes asymptotiques • constante de Cheeger • Graphes aléatoires. Financé par le projet ANR CLARA (2009–2013): Clustering in High Dimension: Algorithms and Applications, porté par B. P ELLETIER (je suis responsable du pôle montpelliérain) Probabilités numériques et statistique bayésienne : méthodes de Monte Carlo • Génétique des populations • ABC (Approximate Bayesian Computation) • échantillonnage préférentiel • Vraisemblance empirique. Financé par • une délégation INRA (département SPE) de deux ans au Centre de Biologie pour la Gestion des Populations (CBGP, UMR INRA SupAgro Cirad IRD, Montpellier) • le projet ANR EMILE (2009–2013): Études de Méthodes Inférentielles et Logiciels pour l’Évolution, porté par J.M. C ORNUET, puis R. V ITALIS ; • le Labex NUMEV, Montpellier ; • l’Institut de Biologie Computationnelle (projet Investissement d’Avenir, Montpellier) ; • le projet PEPS (CNRS) « Comprendre les maladies émergentes et les épidémies : modélisation, évolution, histoire et société », que je porte. Principaux collaborateurs Communauté mathématique : Benoît C ADRE (Pr, École Normale Supérieure de Cachan, antenne de Rennes), Jean-Michel M ARIN (Pr, université Montpellier 2), Kerrie M ENGERSEN (Pr, Queensland University of Technology, Brisbane, Australie), Bruno P ELLETIER (Pr, université Rennes 2), Didier P IAU (Pr, université Joseph Fourier Grenoble 1) Christian P. R OBERT (Pr, université Paris-Dauphine & IUF). Communauté biologique : Jean-Marie C ORNUET (DR, INRA CBGP), Arnaud E STOUP (DR, INRA CBGP), Mathieu G AUTIER (CR, INRA CBGP), Raphaël L EBLOIS (CR, INRA CBGP) et François R OUSSET (DR, CNRS ISE-M). Responsabilités administratives et scientifiques 2013–aujourd’hui Co-responsable de l’axe Algorithmes & Calculs du Labex NUMEV 62 2008–aujourd’hui Responsable du séminaire de probabilités et statistique de Montpellier 2010–aujourd’hui Membre élu du Conseil de l’UMR I3M 2014 Membre du comité scientifique des Troisième Rencontres R à Montpellier (25–27 juin 2014) 2012, -13, -14 Membre des comités d’organisation des Écoles-Ateliers “Mathematical and Computational evolutionary biology”, juin 2012, 2013 et 2014. 2012 Membre du comité de sélection à l’université Lyon 1 pour recrutement d’un maître de conférences en statistique. 2010 Membre du comité d’organisation des Journées de Statistiques du Sud à Mèze (juin 2010): “Modelling and Statistics in System Biology”. 2009–2010 Membre de comités de sélection à Montpellier 2. 2008 Membre élu de la commission de spécialistes CNU 26 à Montpellier 2. 2004–2005 Administrateur du serveur du LaPCS (Laboratoire de Probabilités, Combinatoire et Statistique, Lyon 1) Encadrement Outre une dizaine d’étudiants en Master 1, j’ai encadré les travaux ci-dessous. février - juin 2009 Stage de M2 de Mohammed S EDKI. 2009–2012 Thèse de Mohammed S EDKI avec J.-M. M ARIN (PR, Montpellier 2) : Échantillonnage préférentiel adaptatif et méthodes bayésiennes approchées appliquées à la génétique des populations. (soutenue le 31 octobre 2012). M. S EDKI est M AÎTRE DE CONFÉRENCES à l’université d’Orsay (faculté de médecine – Le Kremlin-Bicetre) depuis septembre 2013. mars – juin 2012 Stage de M2 de Julien S TOEHR, élève de l’École Normale Supérieure de Cachan 2012 – aujourd’hui Thèse de Julien S TOEHR co-encadrée avec Jean-Michel M ARIN (PR, Montpellier 2) et Lionel C UCALA (MCF, Montpellier 2) : Choix de modèles pour les champs de Gibbs (en particulier via ABC) avril - juillet 2013 Stage de M2 de Coralie M ERLE (Master MathSV, Université Paris Sud – École Polytechnique) avec Raphaël L EBLOIS (CR - CBGP). 2013 – aujourd’hui Thèse de Coralie M ERLE en co-direction effective avec Raphaël L EBLOIS (CR - CBGP) : Nouvelles méthodes d’inférence de l’histoire démographique à partir de données génétiques. 63 Appendix C. Curriculum vitæ Enseignements Depuis mon recrutement en 2006, j’ai eu l’occasion d’effectuer de nombreuses heures d’enseignement, parmis lesquelles on peut citer : PhD Responsable du module doctoral “Programmation orientée objet : modélisation probabiliste & calcul numérique en statistique pour la biologie (30h) M2R Responsable du module de Classification Supervisée et Non-Supervisée (20h) M1 Responsable du module Processus stochastique / Réseaux et files d’attente (50h) L3 Responsable du module Traitement de données (50h) pour les licences de Biologie et de Géologie-Biologie-Environnement Liste de publications Articles Pour la notoriété des revues, voir tableau en page 67. (A1) P. Pudlo@ (2009) Large deviations and full Edgeworth expansions for finite Markov chains with applications to the analysis of genomic sequences. ESAIM: Probab. and Statis. 14, pp. 435–455. (A2) B. Pelletier and P. Pudlo@ (2011) Operator norm convergence of spectral clustering on level sets. Journal of Machine Learning Research, 12, pp. 349–380. (A3) E. Arias-Castro, B. Pelletier and P. Pudlo@ (2012) The Normalized Graph Cut and Cheeger Constant: from Discrete to Continuous. Advances in Applied Probability, 44(4), dec 2012. (A4) B. Cadre, B. Pelletier and P. Pudlo@ (2013) Estimation of density level sets with a given probability content. Journal of Nonparametric Statistics 25(1), pp. 261–272. (A5) J.–M. Marin, P. Pudlo@ , C. P. Robert and R. Ryder (2012) Approximate Bayesian Computational methods. Statistics and Computing 22(6), pp. 1167–1180. (A6) A. Estoup, E. Lombaert, J.–M. Marin, T. Guillemaud, P. Pudlo, C. P. Robert and J.–M. Cornuet (2012) Estimation of demo-genetic model probabilities with Approximate Bayesian Computation using linear discriminant analysis on summary statistics. Molecular Ecology Resources 12(5), pp. 846–855. (A7) Mengerson, K.L., Pudlo, P.@ and Robert, C. P. (2013) Bayesian computation via empirical likelihood. Proc. Natl. Acad. Sci. USA 110(4), pp. 1321–1326. (A8) Gautier, M., Foucaud, J., Gharbi, K., Cezard, T., Galan, M., Loiseau, A., Thomson, M., Pudlo, P., Kerdelhué, C., Estoup, A. (2013) Estimation of population allele frequencies 64 from next-generation sequencing data: pooled versus individual genotyping. Molecular Ecology 22(14), pp. 3766–3779. (A9) Gautier, M., Gharbi, K., Cezard, T., Foucaud, J., Kerdelhué, C., Pudlo, P., Cornuet, J.-M., Estoup, A. (2013) The effect of RAD allele dropout on the estimation of genetic variation within and between populations. Molecular Ecology 22(11), pp. 3165–3178. (A12) Cornuet J.-M., Pudlo P., Veyssier J., Dehne-Garcia A., Gautier M., Leblois R., Marin J.-M., Estoup A. (2014) DIYABC v2.0: a software to make Approximate Bayesian Computation inferences about population history using Single Nucleotide Polymorphism, DNA sequence and microsatellite data. Bioinformatics, btt763. (A13) Baragatti, M. and Pudlo, P.@ (2014) An overview on Approximate Bayesian computation. ESAIM: Proc. 44, pp. 291–299. (A14) Leblois, R., Pudlo, P., Néron, J., Bertaux, F., Beeravolu, C. R., Vitalis, R. and Rousset, F. Maximum likelihood inference of population size contractions from microsatellite data. Molecular Biology and Evolution, in press. (A15) Stoehr, J., Pudlo, P. and Cucala, L. (2014) Adaptive ABC model choice and geometric summary statistics for hidden Gibbs random fields. Accepté dans Statistics and Computing. Voir Arxiv:1402.1380 Articles soumis (A10) Ratmann, O., Pudlo, P., Richardson, S. and Robert, C. P. (2011) Monte Carlo algorithms for model assessment via conflicting summaries. Voir Arxiv:1106.5919 (A11) Sedki, M., Pudlo, P., J.–M. Marin, C. P. Robert and J.–M. Cornuet (2013) Efficient learning in ABC algorithms. Soumis. Voir Arxiv:1210.1388 (A16) Marin, J.-M., Pudlo, P.@ and Sedki, M. (2012 ; 2014) Consistency of the Adaptive Multiple Importance Sampling. Soumis. Voir Arxiv:1211.2548. (A17) Pudlo, P., Marin, J.-M., Estoup, A., Gautier, M., Cornuet, J.-M. and Robert, C. P. ABC model choice via random forests. Soumis. Voir Arxiv:1406.6288 Brevet international (B1) Demande PCT n°EP13153512.2 : Process for identifying rare events (2014). @ Attention, il est d’usage en mathématiques de classer les auteurs par ordre alphabétique. Cela concerne les articles (A1), (A2), (A3), (A4), (A5), (A7), (A10) et (A13) marqués d’un @ Documents et programmes informatiques à vocation de transfert (D1) Cornuet J-M, Pudlo P, Veyssier J, Dehne-Garcia A, Estoup A (2013) DIYABC V2.0. a userfriendly package for inferring population history through Approximate Bayesian Com65 Appendix C. Curriculum vitæ putation using microsatellites, DNA sequence and SNP data. Programme disponible sur le site http://www1.montpellier.inra.fr/CBGP/diyabc/ (D2) Cornuet J-M, Pudlo P, Veyssier J, Dehne-Garcia A, Estoup A (2013) DIYABC V2.0. a user-friendly package for inferring population history through Approximate Bayesian Computation using microsatellites, DNA sequence and SNP data. Notice détaillée d’utilisation de 91 pages disponible sur le site http://www1.montpellier.inra.fr/CBGP/diyabc/ Communications orales (T1) Séminaire de génétique des populations, Vienne, Autriche, Avril 2013. http://www.popgen-vienna.at/news/seminars.html (T2) Journées du groupe Modélisation Aléatoire et Statistique de la SMAI, Clermont-Ferrand, Août 2012. (T3) Mathematical and Computational Evolutionary Biology, Montpellier, Juin 2012. (T4) International Workshop on Applied Probability, Jérusalem, Juin 2012. (T5) Séminaires Statistique Mathématique et Applications, Fréjus, Août-Septembre 2011. (T6) 5èmes Journées Statistiques du Sud, Nice, Juin 2011. (T7) Approximate Bayesian Computation in London, Mai 2011. (T8) 3rd conference of the International Biometric Society Channel Network, avril 2011. (T9) 42èmes Journées de Statistique, Marseille 2010. (T10) 41èmes Journées de Statistique, Bordeaux 2009. (T11) XXXIVème École d’Été de Probabilités de Saint-Flour, 2004. (T12) Journées de probabilités, La Rochelle, 2002. D IVERS SÉMINAIRES EN F RANCE : entre autre, Toulouse (2012), AgroParisTech (2012), Grenoble au LECA (2011), Avignon (2011), Besançon (2010, 2013), Rennes (2010), Grenoble (2008). . . Notoriété des revues 66 Table C.1 – Référentiel de notoriété Revue Advances in Applied Probability ESAIM: Probab. and Statis. Journal of Machine Learning Research Journal of Nonparametric Statistics Molecular Biology and Evolution Molecular Ecology Molecular Ecology Resources Proc. Natl. Acad. Sci. USA Statistics and Computing ∗∗∗∗ Facteur d’impact (FI à 5 ans) Notoriété 0.900 (0.841) 0.408 (–) 3.420 (4.284) 0.533 (0.652) 10.353(11.221) 6.275 (6.792) 7.432 (4.150) 9.737 (10.583) 1.977 (2.663) ∗∗ ∗ ∗∗∗ ∗ ∗∗∗∗ ∗∗∗ ∗∗∗ ∗∗∗∗ ∗∗∗ = Exceptionnelle, ∗∗∗ = Excellente, ∗∗ = Correcte, ∗ = Médiocre La notoriété des revues est tirée du Référentiel de notoriété 2012, Erist de Jouy-en-Josas – Crebi, M. Désiré, M.-H. Magri et A. Solari. Elle provient d’une étude de la distribution des facteurs d’impact par discipline. 67