Analyse d`algorithmes distribués pour l`approximation stochastique
Transcription
Analyse d`algorithmes distribués pour l`approximation stochastique
2015-ENST-0002 EDITE - ED 130 Doctorat ParisTech THÈSE pour obtenir le grade de docteur délivré par TELECOM ParisTech Spécialité «Signal et Images» présentée et soutenue publiquement par Gemma MORRAL ADELL le 8 janvier 2015 Analyse d’algorithmes distribués pour l’approximation stochastique dans les réseaux de capteurs Directeur de thèse : Pascal BIANCHI Co-encadrement de la thèse : Jérémie JAKUBOWICZ Jury M. Walid BEN-AMEUR, Professeur, SAMOVAR, Télécom SudParis Président M. Alex OLSHEVSKY, Assistant professor, University of Illinois Rapporteur M. Cédric RICHARD, Professeur, Laboratoire Lagrange, Université de Nice Sophia-Antipolis Rapporteur M. Julien HENDRICKX, Assistant professor, ICTEAM, Ecole Polytechnique de Louvain Examinateur Mme. Gersende FORT, Directrice de recherches, LTCI, Télécom ParisTech Examinateur M. Claude CHAUDET, Maître de conférences, RMS, Télécom ParisTech Invité M. Pascal BIANCHI, Maître de conférences, LTCI, Télécom ParisTech Directeur de thèse M. Jérémie JAKUBOWICZ, Maître de conférences, SAMOVAR, Télécom SudParis Co-directeur de thèse T H È S E TELECOM ParisTech école de l’Institut Mines-Télécom - membre de ParisTech 46 rue Barrault 75013 Paris - (+33) 1 45 81 77 77 - www.telecom-paristech.fr This work has been supported by DGA (French Armement Procurement Agency) and by Institut Mines-Télécom ("Futur & Ruptures" program). Acknowledgments First and foremost, I would like to give a warm thank to my advisor Pascal Bianchi to have trust in me, for believing in my skills and especially for his academic and scientific support. I really admire him as a brilliant researcher and I owe him for his patient on me. I am also most grateful to my co-advisor Jérémie Jakubowicz. For his present, a book on Probabilités which I have read twice or three times, and for being available especially during the first two years. I would like to thank Gersende Fort for her inconditional help whenever or whatever my matter was. I am really honored to be part of an important joint work with her. More shortly but also important, it was a pleasure to follow the lectures on Probabilités of Jamal Najim at the very beginning of my thesis, to have been proposed to present a joint work with Stéphan Clémençon in a Big Data conference and to receive the pleasant greetings of Eric Moulines most of the mornings at the laboratory. I have the chance to learn from all of them. It was a pleasure to be at the bureau DA-319 where I could share plenty of moments, chats, confidences and support from such a kind persons: Marjorie Jala and Amandine Schreck. For almost three years we were copines du bureau and we have spent a lot of hours in the same place. I will keep them in my memory. More briefly but also kind and nice colleagues are: Emilie, Cristina, Andrés (with whom I could speak spanish), Claire, Paul, Miro and Alain. And a really warm thank to Amy N. Dieng with whom I have done an interesting joint work on the localization topic. Out of my lab’s environment, there is one of my best friends and an excellent scientist, my running mate Tommaso. Grazie mille Tommy for your support. I would like to thank Alexander for being my german partner for a short while. I shared a lot of moments to look for a break time with: my swimming mate Jérôme (Massard & Vignerons are our meeting point); my saturday’s break mates Stéphane and Eric (discussions chez "Le touche-balles"); my franco-catalan partner Gwenola; my italian mates Mary, Rugge and Peppe (pazienza ragga); my scientist mentors Albert (gràcies to be always available via Whatsapp) and Jérôme (merci pour être ma source culturelle). I would like to thank every of them for being available and having such a patience to wait for me. I also think to my lovely friend Cristina since I had the chance she had a job in Paris. I am grateful to be part of my association nc with Emmanuel and other such interesting people. I finish with my family. I would like to thank my parents and my sister for their inconditional help, support and love to me. They have been always there even if we are separated by the distance. I am fortunate to have them. I have realized one of my wishes, I have lived in Paris to finish my studies for more than five years. I receive the best presents at the end, my thesis degree and Marco. I bring them with me for the rest of my life. Contents Introduction et présentation des résultats (in French) 1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Contexte et cadre considérés . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Modèle du réseau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Modèle de communication . . . . . . . . . . . . . . . . . . . . . . . . 3 Positionnement de cette thèse . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Préliminaires: algorithmes de consensus . . . . . . . . . . . . . . . . . 3.2 Optimisation distribuée . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Analyse en composantes principales distribuée . . . . . . . . . . . . . 3.4 Application de l’ACP: localisation dans les réseaux de capteurs sans fils 4 Organisation du mémoire de thèse . . . . . . . . . . . . . . . . . . . . . . . . 5 Production scientifique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 10 11 11 12 12 12 14 18 23 25 27 1 31 31 32 33 33 35 37 40 42 Introduction 1.1 Motivations . . . . . . . . . . . . . . . . . . . . 1.2 Framework . . . . . . . . . . . . . . . . . . . . 1.3 Position of the thesis . . . . . . . . . . . . . . . 1.3.1 Preliminary: consensus algorithms . . . . 1.3.2 Distributed optimization . . . . . . . . . 1.3.3 Distributed principal component analysis 1.4 Thesis outline . . . . . . . . . . . . . . . . . . . 1.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I Consensus algorithms 45 2 Success and failure of adaptation-diffusion algorithms 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . 2.1.1 Context and goal . . . . . . . . . . . . . . 2.1.2 Related works on distributed optimization . 2.1.3 Contributions . . . . . . . . . . . . . . . . 2.2 Distributed optimization . . . . . . . . . . . . . . 2.2.1 Framework . . . . . . . . . . . . . . . . . 47 48 48 49 51 52 52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Contents 2.3 2.4 2.5 2.6 2.7 II 3 4 2.2.2 Results . . . . . . . . . . . . . . . . . . . . . 2.2.3 Success and failure of convergence . . . . . . 2.2.4 Enhanced algorithm with weighted step sizes . Distributed Robbins-Monro algorithm: general setting Convergence analysis . . . . . . . . . . . . . . . . . . 2.4.1 Disagreement vector . . . . . . . . . . . . . . 2.4.2 Average vector . . . . . . . . . . . . . . . . . 2.4.3 Main convergence result . . . . . . . . . . . . Convergence rate . . . . . . . . . . . . . . . . . . . . 2.5.1 Assumption . . . . . . . . . . . . . . . . . . . 2.5.2 Main result . . . . . . . . . . . . . . . . . . . 2.5.3 A special case: doubly-stochastic matrices . . Concluding remarks . . . . . . . . . . . . . . . . . . . Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributed principal component analysis 53 54 55 55 56 57 58 59 59 59 60 61 62 62 69 A distributed on-line Oja’s algorithm 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Context and goal . . . . . . . . . . . . . . . . . . . . 3.1.2 Related works . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . 3.2 Case G = GN . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Oja’s algorithm . . . . . . . . . . . . . . . . . . . . . 3.2.2 Communication model: randomized sparsification . . 3.2.3 Distributed on-line Oja’s algorithm (p = 1, G = GN ) 3.3 General graph and unknown matrix M case . . . . . . . . . . 3.3.1 Network considerations . . . . . . . . . . . . . . . . 3.3.2 Distributed on-line algorithm . . . . . . . . . . . . . . 3.3.3 Convergence analysis . . . . . . . . . . . . . . . . . . 3.4 Extension of Oja’s algorithm for p ≥ 1 . . . . . . . . . . . . 3.4.1 Oja’s algorithm . . . . . . . . . . . . . . . . . . . . . 3.4.2 Distributed on-line Oja’s algorithm . . . . . . . . . . 3.5 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Principal eigenvector estimation (p = 1) . . . . . . . 3.5.2 Two principal eigenvectors estimation (p = 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 72 72 73 75 75 75 76 77 79 79 80 80 83 83 84 84 84 87 Application to self-localization in WSN 4.1 Contributions . . . . . . . . . . . . . . . . . . 4.2 Received signal model and testbed description . 4.2.1 Ranging-based approaches . . . . . . . 4.2.2 Log-normal shadowing model (LNSM) 4.2.3 Distance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 94 95 95 96 97 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contents 4.3 4.4 4.5 4.6 7 4.2.4 FIT IoT-LAB: platform of wireless sensor nodes . . . . . . . . . . . Overview of some localization techniques . . . . . . . . . . . . . . . . . . . 4.3.1 Centralized techniques . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Distributed approaches . . . . . . . . . . . . . . . . . . . . . . . . . Distributed MDS-MAP approach . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 The framework: centralized batch MDS . . . . . . . . . . . . . . . . 4.4.2 Centralized on-line MDS . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Distributed on-line MDS . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . Position refinement: distributed maximum likelihood estimator . . . . . . . . 4.5.1 Principle: maximum likelihood estimation . . . . . . . . . . . . . . . 4.5.2 The algorithm: on-line gossip-based implementation . . . . . . . . . 4.5.3 Numerical results: initialization by do-MDS algorithm . . . . . . . . A cooperative RSSI-based algorithm for indoor localization in WSN . . . . . 4.6.1 Observation model: biased log-normal shadowing model (B-LNSM) . 4.6.2 Initialization: biased maximum likelihood estimator (B-MLE) . . . . 4.6.3 Experimental results after the refinement phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 104 104 109 110 110 111 113 116 119 122 122 124 125 127 127 128 128 Conclusions and perspectives 137 III 141 Appendices A Application on distributed parameter estimation A.1 Introduction . . . . . . . . . . . . . . . . . . A.2 Parametric model: exponential families . . . A.3 Centralized EM algorithms . . . . . . . . . . A.3.1 Batch EM . . . . . . . . . . . . . . . A.3.2 On-line EM . . . . . . . . . . . . . . A.4 Proposed distributed on-line EM . . . . . . . A.4.1 Algorithm . . . . . . . . . . . . . . . A.5 Convergence w.p.1 . . . . . . . . . . . . . . A.6 Numerical results . . . . . . . . . . . . . . . A.6.1 Application to Gaussian mixtures . . A.6.2 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 143 144 145 145 146 148 148 149 150 150 152 B Application on distributed machine learning B.1 Introduction . . . . . . . . . . . . . . . . . . . . B.2 Background . . . . . . . . . . . . . . . . . . . . B.2.1 Objective . . . . . . . . . . . . . . . . . B.2.2 Distributed Learning . . . . . . . . . . . B.3 The Online Learning Gossip Algorithm (OLGA) B.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 157 159 159 160 160 162 . . . . . . . . . . . 8 Contents B.5 Distributed Selection . B.6 Numerical Results . . . B.6.1 Simulation data B.6.2 Real data . . . B.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C Examples of gossip models for consensus algorithms C.1 Standard gossip averaging . . . . . . . . . . . . . . C.1.1 Communication model description . . . . . . C.1.2 Numerical results on distributed optimization C.2 Push-sum gossip averaging . . . . . . . . . . . . . . C.2.1 Communication model description . . . . . . C.2.2 Algorithm for distributed optimization . . . . C.2.3 Numerical results on distributed optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 166 166 166 167 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 173 174 175 178 178 178 179 D Proofs related to Chapter 2 D.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . D.2 Proof of Lemma 2.3 . . . . . . . . . . . . . . . . . . . . . . D.3 Preliminary results on the sequence (φn )n . . . . . . . . . D.4 Proof of Proposition 2.2 . . . . . . . . . . . . . . . . . . . . D.4.1 Decomposition of hθn+1 i − hθn i . . . . . . . . . D.4.2 Proof of Proposition 2.2 . . . . . . . . . . . . . . . D.5 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . . D.5.1 A preliminary result . . . . . . . . . . . . . . . . . D.5.2 Checking condition C2 of [67, Theorem 2.1] . . . . D.5.3 Expression of U? . . . . . . . . . . . . . . . . . . . D.5.4 Checking condition C3 of [67, Theorem 2.1] . . . . D.5.5 Detailed computations for verifying the condition C2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 185 186 186 191 191 192 194 194 195 196 196 197 Bibliography . . . . . . . . . . . . . . . . . . . . . 203 Introduction et présentation des résultats Cette thèse s’ est déroulée au sein du LTCI (Laboratoire Traitement et Communication de l’ Information) à Télécom ParisTech, sous la direction de Pascal Bianchi. Cette thèse a également été co-encadrée par Jérémie Jakubowicz pendant une demie partie de ma thèse et aussi avec un support proche et appréciable de Gersende Fort lorsque des travaux conjoints se développaient. L’objectif de cette thèse est de proposer et d’analyser de nouvelles stratégies distribuées pour les problèmes de consensus et d’analyse en composantes principales dans les réseaux de capteurs qui soient basées sur l’approximation stochastique. Ce type d’approche s’avère adapté pour les réseaux de capteurs sans fils étant des réseaux composées par des dispositifs généralement de bas cout et avec des ressources limités. Ainsi, nous donnons priorité à la conception des algorithmes distribués simples qui entrainent un traitement des données sans besoin de stockage ni d’opérations complexes, et avec des échanges sporadiques entre capteurs voisins. Les problématiques abordées dans les deux parties de cette thèse peuvent se généraliser comme un problème d’estimation statistique décentralisée. D’une côté, l’estimation vient du fait que chaque capteur doit estimer un paramètre inconnu qui dépend de l’environnement. De l’autre côté, l’analyse statistique vient de considérer que cet environnement est imparfait et issu d’un bruit modélisé comme étant aléatoire. Par conséquence, dans ce contexte, l’environnement résulte partiellement connu par les capteurs. Dans ce premier chapitre le but est de présenter nos résultats et nos principales contributions dans les deux champs qui ont été traités: les algorithmes de consensus dans une première partie, et l’analyse en composantes principales distribuée dans la partie suivante. Elles sont plus profondément expliquées ensuite dans les Chapitres 2, 3 et 4 respectivement où les détails des preuves et des précisions bibliographiques sont donnés. Tout d’abord dans ce chapitre, nous justifions la motivation de ces contributions en relevant les applications, les tendances et les besoins actuels du marché qui font émerger l’utilisation des réseaux d’appareils interconnectés, e.g. capteurs sans fils. Ensuite, nous introduisons dans la Section 2 quels sont le contexte et le cadre de ces travaux. Nous décrivons les trois principales propriétés qui ont en commun les algorithmes que nous avons conçu pour les différentes problématiques traitées. La Section 3 présente nos contributions dans les deux contextes applicatifs traités, i.e. l’optimisation distribuée dans la Section 3.2 et l’analyse en composantes principales dans la Section 3.3. Toutefois, nous ajoutons deux sections additionnelles indispensables. Premièrement, la Section 3.1 sert d’étape préliminaire afin de comprendre les algorithmes de consensus en général qui sont une partie clé dans 10 Introduction et présentation des résultats (in French) l’optimisation distribuée abordée par la suite. Deuxièmement, la Section 3.4 présente les résultats expérimentaux issus de l’application concrète de la localisation des capteurs en utilisant l’algorithme d’analyse en composantes principales distribué qui a été proposé dans la section précédente. 1 Motivations La télédétection et la surveillance de l’environnement ont été un domaine de recherche actif dans les dernières décennies. L’évolution technologique des réseaux de capteurs sans fils (wireless sensor networks en anglais) a contribué à répandre d’abord leur utilisation à des fins militaires et plus tard pour des applications civiles et industrielles, e.g surveillance et gestion de la sécurité. Par ailleurs, les motivations scientifiques sur le traitement statistique du signal ont été un élément important des progrès réalisés sur la détection, l’estimation et la classification des données acquises par les réseaux de capteurs. Bien que nous donnons une attention particulière aux réseaux de capteurs, les travaux de cette thèse s’adressent aussi aux systèmes multi-agents plus généralement, e.g. par exemple les réseaux pair-à-pair ou les systèmes multi-robots. Les applications liées aux systèmes multi-agents comprennent: • les contrôle et coordination des réseaux, dont les exemples suivants : la poursuite d’une cible ou d’une trajectoire [132], [45], [123], [152], l’allocation des ressources [108], [23]. • le traitement de Big Data, dont les exemples suivants : l’apprentissage des classificateurs [149], [160], la recommandation par estimation des profils [147], le PageRank [124]. • la surveillance de l’environnement et la reconnaissance d’objets dans les réseaux de capteurs, dont les exemples suivants : l’ estimation de paramètres [135], [144], la détection des signaux [120], [90], [102]. Dans cette thèse, nous donnons une attention particulière au problème de la localisation dans les réseaux de capteurs sans fils suite à l’actuelle croissance des applications et services dont la connaissance des positions d’un groupe de capteurs ou appareils inter-connectés est nécessaire, i.e. location awareness services. Ainsi, l’information sur les positions des nœuds/capteurs du réseau peut être utilisé pour plusieurs buts tel que le routage dynamique qui peut être contrôlé et adapté selon les positions données. D’ailleurs, pour la surveillance de l’environnement, il existe un besoin des applications nécessitant de l’information géographique sur les données observées e.g. identifier la température ou l’humidité sur une carte. Dans une utilisation plus sophistiquée, les développements sur la domotique et l’internet des objets (smart home and Internet of Things contribuent à la conception des maisons intelligentes dont l’électroménager, les régulateurs d’énergie et tout appareil électronique sont activés automatiquement en fonction de la position de l’hôte. Si la robustesse et un déploiement flexible sont deux des principaux avantages d’un réseau distribué des capteurs sans fils, il faut surmonter des possibles problèmes liés à la confidentialité et aussi liés à la sécurité des données et à la limitation de ressource énergétique (batterie généralement), stockage et complexité de calcul. Nous considérons un système multi-agents comme un groupe d’entités dotées des capacités à la fois en interaction et de traitement des données, e.g. un réseau de capteurs sans fil. Introduction et présentation des résultats (in French) 11 Nous référons par agent comme le mot général pour nommer un ordinateur, un processeur ou un nœud/capteur ou tout autre dispositif qui constitue un réseau connecté. Dans un objectif et un contexte spécifiques, une tâche globale est effectuée par le réseau d’agents en communiquant, e.g. mesures de température dans les réseaux de capteurs ou classification binaire dans l’apprentissage automatique. Les agents sont autonomes et ont leur propre conscience de l’environnement en ayant accès à des vues locales du scénario global sans être sous contrôle d’une unité centrale désignée en avance. Dans cette thèse, nous considérons que chaque agent a une vue partielle sur le problème global à résoudre. Typiquement, dans les réseaux de capteurs chaque capteur est en mesure d’observer les alentours de l’environnement. Alors que dans l’apprentissage automatique distribué, chaque processeur (ou ordinateur) est en charge de gérer une partie de l’ensemble des données qui a été distribué. L’objectif de notre travail est la conception d’algorithmes distribués pour les réseaux multiagents embarqués. Par distribué, nous entendons que les agents partagent leur information locale sans avoir aucune architecture hiérarchique ni aucun agent maître, i.e. contrairement au traitement centralisé. Ces architectures offrent deux avantages principaux. D’une côté, la robustesse face à des défaillances de nœuds individuelles, puisque les données peuvent être récupérées à n’importe quel autre agent. Et de l’autre côté, la scalabilité pour s’adapter aux changements d’ordre de grandeur du réseau en conservant les prestations puisqu’il n’y a pas d’agent central qui peut produire un goulot d’étranglement. 2 Contexte et cadre considérés En particulier dans cette thèse, nous nous concentrons sur des algorithmes distribués à partir du point de vue spéciale de l’approximation stochastique. Nous considérons le cas où les données/observations sont traitées "en ligne". Un échantillon pris de façon aléatoire est utilisé par un agent pour mettre à jour sa solution locale et il est supprimé après utilisation; ensuite, un nouvel échantillon est utilisé, et ainsi de suite. Dans ce contexte, nous adressons conjointement deux cadres: les systèmes multi-agents et l’approximation stochastique distribuée. En conséquence, nous nous sommes intéressés à des méthodes numériques qui ont les trois propriétés suivantes. Elles sont i) itératives - une valeur itérée est mise à jour à chaque instant discret du temps, i.e. itération, ii) distribuées - les agents communiquent afin de fusionner leurs itérations individuelles, et iii) en ligne - chaque mise à jour se fait en utilisant la plus récente observation ou le plus récent échantillon des données. 2.1 Modèle du réseau Le réseau de N agents est représenté par un graphe non dirigé (non orienté) G = (V, E), où V = {1, . . . , N } désigne l’ensemble des agents (également appelés nœuds dans le contexte des réseaux de capteurs sans fils) et E désigne l’ensemble des arêtes non orientés. D’ailleurs, deux agents i, j ∈ V sont connectés s’il existe le lien {i, j} ∈ E qui permet la communication entre 12 Introduction et présentation des résultats (in French) i et j. Nous allons parfois écrire i ∼ j pour désigner l’arête {i, j} ∈ E. Nous supposerons toujours que {i, i} n’est pas une arête i.e. G n’a pas d’auto-boucles. 2.2 Modèle de communication Les algorithmes distribués peuvent être soit synchrones ou asynchrones. Dans le cas synchrone, tous les agents sont attendus à compléter leurs calculs locaux avant que leurs valeurs de sortie peuvent être fusionnées. En cas contraire, l’algorithme procède avec la prochaine itération à condition que tous les agents ont obtenu leur résultat. Or, dans le cas asynchrone, nous supposons que chaque agent est susceptible de terminer un calcul local à un certain instant de temps aléatoire, il passe sa sortie à d’autres agents voisins, et procède à un autre calcul local sans qu’il soit nécessaire d’attendre le reste des agents. Ainsi, le sens que nous donnons à "algorithme distribué asynchrone" est qu’il n’y a pas d’agent central qui planifie les instants des calculs et que n’importe quel nœud peut se réveiller de façon aléatoire à tout moment indépendamment des autres nœuds. Ce mode de fonctionnement apporte des avantages évidents en termes de faible complexité et en étant plus flexible à mettre en œuvre. L’optimisation distribuée asynchrone est un sujet prometteur pour aborder les problèmes d’apprentissage automatique impliquant des ensembles de données massifs (voir [32] ou l’étude plus récent de [40]). 3 3.1 Positionnement de cette thèse Préliminaires: algorithmes de consensus Historiquement dans la littérature, une quantité importante de travaux dans le traitement distribué ont été faits pour résoudre les soi-disant problèmes du consensus sur la moyenne (average consensus en anglais). Bien que la thèse aborde des questions plus générales, nous commençons avec une description de ce problème archétype, car il permet de présenter les principales idées qui sous-tendent de nombreux algorithmes proposés notamment pour l’optimisation distribuée. On note par T0,i ∈ R une certaine valeur scalaire observée par l’agent i ∈ V . L’objectif du consensus sur la moyenne est, pour toutPagent i, d’estimer quelle est la valeur moyenne sur tout l’ensemble, i.e. on dénote par T̄0 = N1 N i=1 T0,i . L’approche la plus répandue pour résoudre ce problème d’ une façon distribuée est la suivant. À l’instant n, chaque agent i a une estimation Tn,i de la moyenne recherchée qui est inconnue. Chaque nœud i reçoit les itérées actuelles d’autres nœuds dans son voisinage, i.e. les nœuds j tels que j ∼ i ∈ E, et met à jour son itérée locale comme une moyenne pondérée de son itérée passée et les itérées reçues de ses voisins. Formellement, à chaque itération n ∀i ∈ V, Tn,i = X w(i, j)Tn−1,j (1) j∈V où w(i, j) sont des poids non négatifs tels que w(i, j) = 0 dès que i et j ne sont pas connectés. Cette condition garantit que l’algorithme est en effet bien réparti sur le graphe du réseau. Pour simplifier, supposons également que les coefficients de pondération vérifient la condition Introduction et présentation des résultats (in French) 13 P suivante j w(i, j) = 1 pour tous les i. On définit alors la matrice W de taille N × N dont le coefficient (i, j) coïncide avec w(i, j). L’algorithme ci-dessus s’écrit simplement Tn = W Tn−1 où Tn est le vecteur-colonne dont la i-ième entrée coïncide avec Tn,i . On note que W est une matrice stochastique dans le sens où la somme de toutes ses files est égale à un. Les techniques du consensus sur la moyenne distribuées ont ses origines sur la théorie de la statistique appliquée [48] et l’informatique [155], [156] (voir [16] pour un état de l’art détaillé sur ce sujet). Ces conditions dans lesquelles l’algorithme distribué ci-dessus converge vers la moyenne T̄0 recherchée sont bien connues. Supposons maintenant que X ∀i ∈ V, w(i, j) = 1 (2) j:j∼i ∀i ∈ V, X w(i, j) = 1 , (3) i:i∼j c’est à dire, non seulement toutes les lignes de la matrice W somment un, mais aussi toutes les colonnes somment également un. Telle une matrice W est appelé doublement stochastique ou bistochastiques. Si W est d’ailleurs une matrice primitive1 , il peut être démontré comme étant une application immédiate du théorème de Perron-Frobenius (voir [81]) telle que ∀i ∈ V, lim Tn,i = T̄0 . n→∞ (4) Ainsi, l’algorithme permet à chaque nœud d’atteindre finalement un consensus, c’est à dire, un accord sur la valeur finale. En outre, la valeur du consensus coïncide avec la moyenne T̄0 recherchée. L’algorithme ci-dessus est parfois appelé par certains auteurs : algorithme de commérage (gossip en anglais). L’algorithme de gossip ci-dessus est synchrone dans le sens où tous les agents doivent communiquer leurs valeurs à tout moment de l’algorithme itératif, et la matrice W est fixée la même pour toutes les itérations. Les auteurs de [31] proposent un protocole de communication asynchrone. A l’instant de temps n, un nœud donné choisi aléatoirement (nœud i par exemple) se réveille et sélectionne de façon uniformément aléatoire un nœud dans son voisinage (nœud j). Les nœuds i et j fusionnent leurs valeurs ainsi : Tn,i = Tn,j = 0.5Tn−1,i + 0.5Tn−1,j , (5) tandis que les autres nœuds k ∈ / {i, j} maintiennent leur itérées Tn,k = Tn−1,k . L’algorithme s’écrit de façon similaire sous forme matricielle comme Tn = Wn Tn−1 où (Wn )n est une séquence aléatoire de matrices, à savoir Wn = IN − 0, 5(ei − ej )(ei − ej )T où IN est la N × N matrice identité, ei est le i-ième vecteur de la base canonique de RN et T représente la transposée. La convergence (4) est toujours conservée (au sens presque sûr) sous certaines Il existe m > 0 tel que tous les coefficients de W m sont strictement positifs. L’hypothèse tient par exemple si w(i, j) > 0 pour tous i ∼ j et il existe i tel que w(i, i) > 0. 1 14 Introduction et présentation des résultats (in French) conditions techniques qui sont spécifiées en détail dans [31]. La caractéristique clé des matrices Wn est qu’elles sont toujours doublement stochastiques pour tout n 6= 0. Le protocole de commérage décrit ci-dessus (connu en anglais comme pairwise gossip) peut être considéré comme asynchrone, étant donné que les nœuds sont autorisés à être inactifs à certains instants. Toutefois, il nécessite toujours un certain niveau de coordination entre les nœuds: deux nœuds doivent mettre à jour simultanément leurs valeurs au même instant de temps. Réduire la nécessité de ces liens bidirectionnels (transmission et feedback à la fois) afin d’obtenir des protocoles vraiment asynchrones a été un enjeu important dans les années suivantes à [31]. Les auteurs de [10] proposent un modèle de communication complètement asynchrone connu couramment comme broadcast gossip en anglais. Comme au protocole précédent, à chaque itération n un agent est activé de façon uniformément aléatoire. L’ asynchronisme est maintenant au "niveau agent" au lieu d’ être au "niveau arête" étant donné que l’agent actif i diffuse son estimée à tous ses voisins sans attendre la transmission de retour. Malheureusement, il est montré dans [10] que le résultat de la convergence vers la moyenne recherchée (4) n’est plus vérifiée. Or, tout ce que l’on peut attendre d’un tel protocole simple à implémenter est une convergence en moyenne, mais pas une convergence presque sûre. On pourra se référer à l’article [69] pour des considérations plus détaillées sur cette question. Pour conclure cette section préliminaire, le problème de consensus sur la moyenne peut être résolu en utilisant un algorithme linéaire de gossip dans une version asynchrone, mais la bistochasticité des matrices de pondération Wn est requise à chaque instant n. Néanmoins, la double-stochasticité présente en pratique certains inconvénients en ce qui concerne leur implémentation, puisqu’elle exige généralement les feedbacks dans le réseau. Des méthodes alternatives sur cette question ne nécessitent pas cette double-stochasticité au détriment de requérir à des modèles de communication plus complexes, e.g. [89], [84], [78]. Par exemple, [78] apporte le résultat de convergence (4) en utilisant uniquement des matrices stochastiques par ligne, i.e. Wn 1 = 1, mais en considérant que dans la phase de communication l’ensemble des nœuds augmente à chaque itération n. Il est aussi intéressant de noter la contribution de [89] qui introduit le protocole push-sum (plus généralement analysé par [34]). Le modèle de gossip [89] permet de contourner la question de la convergence sans la nécessité des liens feedback en introduisant dans leur modèle une certaine communication supplémentaire, i.e. deux variables au lieu d’une sont impliquées dans l’étape de mise à jour. [84] propose une version asynchrone de ce dernier modèle [89]. L’analyse de la convergence et de la vitesse de convergence sont fournies dans [83] par les mêmes auteurs. Nous nous référons à [16] pour une description plus complète et générale sur les algorithmes de consensus sur la moyenne distribués. 3.2 1) Optimisation distribuée Contexte L’optimisation distribuée est présente dans la plupart des applications mentionnées ci-dessus relatives aux réseaux de capteurs sans fils et de l’apprentissage automatique. Le but du réseau est d’optimiser une fonction globale qui est définie comme une somme de fonctions privées Introduction et présentation des résultats (in French) 15 locales. Le problème de minimisation global à résoudre s’écrit : min F (θ) , θ∈R F (θ) = N X fi (θ) (6) i=1 où fi est la fonction de coût privée de l’agent i. Nous supposons dans cette thèse que les fonctions fi sont différentiables, mais pas nécessairement convexes. Cette thèse a également mis l’accent sur les méthodes de premier ordre, i.e. les algorithmes s’appuyant uniquement sur des calculs de gradient. Un exemple illustratif dans le cadre des réseaux de capteurs est le suivant. Exemple 1. Dans les contextes des réseaux de capteurs, il est souvent le cas où l’on doit estimer un paramètre θ (e.g. température, humidité, position d’une source) basé sur un ensemble d’observations aléatoires X1 , . . . , Xn recueillies par des capteurs indépendants et dont les fonctions de densité de probabilité marginales pθ,1 , . . . , pθ,N sont indexés par θ. À condition que les variables aléatoires Xi sont indépendantes, l’estimation du maximum de vraisemblance de θ peut être écrit comme minimiseur de (6) où fi (θ) = − log pθ,i (Xi ) . Dans un cadre centralisé, les observations aléatoires sont collectées par une unité/agent centrale. Toutes les fonctions sont supposées être disponibles par ce seul agent, et ainsi un algorithme standard de descente du gradient sur la fonction globale F peut être directement utilisé pour obtenir un minimiseur. Or, cette thèse porte sur la conception d’approches distribuées : les fonctions fi sont seulement connues de façon locale par les agents et la fonction F n’est plus disponible. Dans la littérature, il existe principalement deux approches distribuées des algorithmes de premier ordre pour résoudre ce problème. La première est l’approche incrémentale proposée par [113], [131], [133]. Un message contenant l’estimée actuelle est transmise de façon itérative en se déplaçant de nœud en nœud dans le réseau, i.e. chaque agent additionne leur propre valeur basée sur son observation locale et retransmet cette mise à jour à l’agent suivant. L’approche, bien que conceptuellement simple, présente certains inconvénients. Un algorithme incrémental nécessite généralement que la transmission du message entre les nœuds suit un cycle hamiltonien. Par contre, la recherche d’un tel chemin dans un graphe est connu pour être un problème de classe NP. Des approches alternatives sur le besoin du cycle hamiltonien ont été proposées: par exemple, [113] suppose qu’un agent seulement communique avec un autre agent choisi de façon aléatoire dans le réseau (pas nécessairement dans son voisinage) selon la distribution uniforme. Cependant, l’approche de [113] nécessite encore du routage considérable. Cette thèse propose une deuxième approche coopérative du calcul distribué poursuivant une forme adapt-then-combine (terminologie en anglais qui a été introduite par [103] dans [38]) et également connue comme algorithmes du type adaptation-diffusion. En particulier, il s’agit d’une approche basée sur les techniques de consensus (Section 3.1) et dont l’idée est issue du travail de [155]. Dans ce contexte les agents mettent à jour leur estimée à partir d’une étape locale de descente du gradient ; ensuite certains agents communiquent et fusionnent leur estimée 16 Introduction et présentation des résultats (in French) locale en fonction des information échangées. Comme nous avons introduit dans la section précédente, ces méthodes de commérage sont connues dans la littérature anglophone comme méthodes de gossip. Contrairement aux approches incrémentales, chaque nœud i a sa propre estimation θn,i pour chaque instant de temps n, i.e. chaque agent i génère une suite d’estimées (θn,i )n≥1 que nous supposerons ici réelle (le cas vectoriel se traite sans plus de difficultés, sinon notationnelles dans le Chapitre 2). A l’itération n, l’algorithme étudié s’écrit en deux étapes : [Etape locale] L’agent i génère une estimée temporaire θ̃n,i donnée par : θ̃n,i = θn−1,i − γn ∇fi (θn−1,i ) , (7) où γn est un pas déterministe positif, (∇fi (θn−1,i ))n≥1 sont les observations dont l’agent i dispose et ∇ désigne le gradient. [Gossip] L’agent i observe les valeurs θ̃n,j de certains autres agents j et génère une moyenne pondérée : N X θn,i = wn (i, j) θ̃n,j , (8) j=1 où Wn = (wn (i, j))i,j∈V est une matrice de gossip similaire à celles décrites dans la section précédente (voir la Section C.1 à l’Annexe C pour des exemples plus détaillés). D’ailleurs, il est plus courant d’ observer le gradient à une certaine perturbation aléatoire près, pouvant dépendre de l’historique de l’algorithme. Dans ce cas, l’équation (7) doit être remplacé par θ̃n,i = θn−1,i − γn ∇fi (θn−1,i ) + γn ξn,i (9) où ξn,i est une perturbation aléatoire due au fait que le gradient n’est pas parfaitement observé au nœud i. Exemple 2. Pour illustrer ce point, nous considérons de nouveau l’exemple dans les réseaux de capteurs donnée précédemment, où les agents cherchent à estimer un paramètre inconnu θ dans le sens du maximum de vraisemblance sur des observations aléatoires. Considérons le cas où chaque capteur i recueille une séquence d’observations aléatoires (Xn,i )n=1,2,... au lieu d’une seule observation Xi . Supposons aussi que la séquence soit formée par des copies indépendantes de Xi . Alors, une estimation en ligne distribuée du paramètre θ en utilisant l’algorithme ci-dessus se lirait comme θ̃n,i = θn−1,i + γn ∇ log pθn−1,i (Xn,i ) . Sous certaines conditions de régularité, il peut être prouvé que la mise à jour ci-dessus coïncide avec (9) en admettant fi (θ) = −E[log pθn,i (Xi )] où E représente l’espérance et la perturbation ξn,i est un incrément de martingale. Il est espéré que, sous certaines hypothèses, ∀i ∈ V, lim θn,i = θ? n→∞ (10) Introduction et présentation des résultats (in French) 17 où θ? est un certain minimiseur de F (dont on assume son existence). Nous nous référons au Chapitre 2 pour un état de l’art plus détaillé sur ces techniques. Cependant, nous mentionnons que la convergence est généralement prouvée sous certaines hypothèses fortes sur les matrices (Wn )n qui décrivent le protocole de communication du consensus. En général, le consensus vers la valeur optimale recherchée θ? est obtenu sous l’hypothèse de double-stochasticité ([116], [134]), i.e. (Wn )n sont stochastiques par ligne et par colonne ce qui signifie que Wn 1 = 1 et 1T Wn = 1T . Plus tard, dans [19], [112], la condition de colonne-stochasticité est relâchée et elle est supposée seulement en moyenne, i.e. 1T E[Wn ] = 1T . Cela mène, par exemple, à l’utilisation du modèle de gossip du type broadcast introduit par [10]. De façon similaire, les auteurs de [43] proposent un modèle de diffusion qui ne nécessite que la condition de stochasticité par ligne au détriment de sa nature synchrone. 2) Résultats L’objectif de cette thèse est d’obtenir des résultats de convergence tels que (10) sur la séquence générée par l’Algorithme (7)-(8) sous des conditions sur (Wn )n plus faibles. Nous étudions les résultats de [19] quand (Wn )n sont supposées stochastiques par ligne seulement. La plupart des travaux les supposent en outre stochastiques par colonne, ce qui s’avère restrictif en terme d’implémentation, et interdit la mise en œuvre de protocoles d’échange pourtant naturels. Nous nous adressons à un cadre plus large de la communication, quand les matrices (Wn )n peuvent dépendre des observations ou des dernières estimations. Ainsi, en relaxant cette hypothèse de bistochasticité, nous quantifions la dégradation des performances liée à cette relaxation. En outre, nous considérons un cas plus général sur le cadre d’approximation stochastique en laissant Algorithme (7)-(8) prendre la forme suivante: θn = Wn ( θn−1 + γn Yn ) . (11) La récursion (11) étend l’application du problème d’optimisation distribuée (6) à un cadre plus général. En effet, l’algorithme (11) peut être considéré comme une version distribuée de l’algorithme de Robbins-Monro [139]. Les algorithmes d’approximation stochastique ont été initialement conçus par [139] pour trouver les zéros (racines) d’ une certaine fonction h, appelée champ moyen dans des situations où l’ on ne dispose que des mesures bruitées de cette fonction. A cet effet, yn peut être liée à une estimation non biaisée de la fonction h(θ) dont on cherche à trouver ses racines, i.e. θ ∈ { h(θ) = 0 }. Sous l’ hypothèse de pas (γn )n , nous apportons les contributions suivantes qui conduisent à répondre d’une manière à la fois qualitative et quantitative aux questions précédentes. • Supposant que la suite des matrices stochastiques (Wn )n≥1 est i.i.d., on montre sous certaines hypothèses techniques que l’ Algorithme (7)-(8) converge vers le consensus et il est caractérisé. On montre que cette valeur accordée ne coïncide pas nécessairement avec P un point critique de i fi . Nous fournissons également une variante de l’ algorithme qui permet de récupérer les points recherchés. • Nous fournissons des conditions suffisantes soit sur le protocole de communication qui est représenté par (Wn )n≥1 ou sur les fonctions fi qui assurent que les points limites 18 Introduction et présentation des résultats (in French) P sont les points critiques de i fi . Lorsqu’une telle condition n’est pas satisfaite, nous proposons également une modification simple de l’algorithme qui permet de récupérer le comportement recherché. • Nous adressons nos résultats dans un cadre plus large, en supposant que les matrices (Wn )n≥1 ne sont plus i.i.d., mais elles sont susceptibles de dépendre à la fois des observations actuelles et des estimations antérieures. Nous étudions également un cadre d’approximation stochastique général qui va au-delà du modèle (11) et au-delà du seul problème de l’optimisation distribuée. • On caractérise la vitesse de convergence de l’algorithme sous la forme d’un théorème central limite. Contrairement à [19], nous nous adressons au cas où la séquence (Wn )n≥1 n’est pas nécessairement bistochastique. Nous montrons que les matrices non-doublement stochastiques ont une influence sur la covariance de l’erreur moyen asymptotique (même si elles sont doublement stochastiques en moyenne, e.g. [10]). D’autre part, nous montrons que lorsque la matrice Wn est doublement stochastique pour tout n, e.g. [31], la covariance asymptotique est identique à celle qui s’obtiendrait dans un cadre centralisé (optimale). Finalement, un objectif de la thèse est d’étudier l’utilisation de l’algorithme proposé pour les missions d’inférence statistique dans les réseaux de capteurs. Nous proposons un algorithme Espérance-Maximisation (Expectation-Maximization en anglais abrégé EM) distribué qui est inspiré de l’approche adaptation-diffusion (voir l’Annexe A). Nous appliquons également notre méthode au problème de l’auto-localisation dans les réseaux de capteurs en incluant une étape de raffinement des positions estimées (voir Chapitre 4). 3.3 1) Analyse en composantes principales distribuée Contexte Une autre problématique qui peut être adressée par moyen de l’approximation stochastique est l’analyse en composantes principales (ACP). L’objectif dans ces types de problèmes est plutôt différent de celui considéré précédemment. En effet, le but maintenant pour le réseau n’est plus de trouver un consensus sur le paramètre d’intérêt commun. Or ici, le but pour chaque nœud i est de conduire son itérée à la valeur des i-ième entrées des principaux vecteurs et valeurs propres d’une matrice M qui dépend de l’environnement et de la configuration du graphe subjacent au réseau. Nous définissons M ∈ RN ×N une matrice semi-définie positive symétrique dont les entrées décrivent une certaine mesure de similarité entre chaque paire d’agents, e.g. des similitudes (positionnement multidimensionnel dont on utilise le terme en anglais multidimensional scaling MDS [95], [27])), des distances (localisation dans les réseaux de capteurs [57], [143]), des évaluations des clients sur des produits consommés (profilage des utilisateurs ou en anglais user profiling [154], [91]), des coefficients d’adjacence dans un graphe (partitionnement spectrale ou en anglais spectral clustering [29]) ou des covariances (détection de signal [88], [37]). Nous supposons qu’un agent donné i a seulement de l’information partielle sur la matrice M (typiquement, l’agent i est seulement en mesure d’observer la i-ième file de M ). ainsi, l’analyse en Introduction et présentation des résultats (in French) 19 composantes principales de M correspond à trouver sa décomposition spectrale telle que M = U ΛU T , U U T = IN (12) où U est une matrice orthonormale dont les colonnes sont les vecteurs propres de M et Λ est une matrice diagonale contenant les correspondants valeurs propres (λ1 , . . . , λN ) en les prenant par ordre décroissant λ1 ≥ · · · ≥ λN . Nous définissons la norme euclidienne notée k . k. Pour un entier donné tel que p < N , l’objectif est d’évaluer les p plus grandes valeurs propres λ1 , . . . , λp et les correspondants vecteurs propres, que nous notons par u1 , . . . , up . Lorsque M est parfaitement connue et les données sont traitées de manière centralisée, plusieurs méthodes classiques sont connues pour résoudre efficacement (12) telle que la méthode des puissances (en anglais power method [73, p. 406]) lorsque p = 1 et la factorisation QR (en anglais QR-factorization [81, p. 114] et qui est appelée itération orthogonale en anglais orthogonal iteration dans [73, p. 454]) ou l’orthonormalisation de Gram-Schmidt [73, p. 254] lorsque p > 1. La méthode des puissances centralisée est basée sur le calcul récursif suivant (p = 1) : Ũn = M Un−1 Un = Ũn , kŨn k (13) (14) où (Un )n est la suite des vecteurs estimés qui converge propre U1 et P vers2le premier vecteur N . Du point de |x(i)| tel que x ∈ R kxk représente la norme euclidienne, i.e. kxk2 = i vue d’une implémentation distribuée, les deux termes M Un−1 et kM Un−1 k (terme de normalisation) ont certains inconvénients au niveau de la complexité, i.e. nombre de communications et nombre d’opérations (sommes etP multiplications). Pour un agent donné i, le premier produit matriciel s’écrit comme la somme N j=1 M (i, j)Un−1,j qui contient N termes impliquant une communication avec chaque agent j séparément. Deuxièmement, pour tout agent i, (14) s’écrit qP N 2 Un (i) = Ũn (i)/ i=1 Ũn (j) ce qui implique que l’agent i doit demander à tous les autres agents j 6= i sur leurs valeurs Uñ(j) pour effectuer la mise à jour de la normalisation. En fait, lorsque N est grand, cela peut produire un coût prohibitif pour le réseau en termes du nombre de communications. En conséquence, plusieurs travaux ont fait des efforts pour proposer une mise en œuvre décentralisée de (13)-(14). Une paire de travaux mènent une version distribuée de (13)-(14) (voir [90],[92]) en introduisant une étape de consensus sur le calcul du terme de normalisation kŨn k qui doit être utilisé pour actualiser chaque coordonnée Un (i) localement par chaque agent i. Alors que dans [90] M est supposée être parfaitement connue, [92] inclue un modèle creux (sparse en anglais) synchrone pour estimer le vecteur M Un−1 . Contrairement P à [90] où chaque agent i est en mesure de calculer N M (i, j)Un−1,j , les auteurs de [92] j=1 décrivent un modèle de matrice creuse (sparse matrix en anglais) pour M dans laquelle chaque agent i transmet M (i, j)Un−1,j à un petit ensemble de voisins choisis aléatoirement. Dans cette thèse, nous cherchons à concevoir un algorithme d’analyse en composantes principales qui soit: distribué puisque les nœuds coopèrent de façon asynchrone afin d’estimer séparément les différentes entrées des principaux vecteurs propres; et en ligne puisque la matrice 20 Introduction et présentation des résultats (in French) M n’est pas observée parfaitement, mais une séquence est générée (Mn )n des versions perturbées/bruitées de M . La séquence (Mn )n s’écrit comme Mn = M + ξn où la matrice aléatoire issue du bruit ξn est typiquement un incrément de martingale. Dans le cas centralisé, lorsqu’une séquence de matrices (Mn )n est observée globalement par une unité centrale de calcul, un algorithme suivant une approche d’approximation stochastique peut être utilisé pour estimer les principaux vecteurs propres de M . Dans cet intérêt, l’algorithme d’Oja peut être utilisé à cet effet (voir [120] pour p = 1 et [122] pour p > 1). Nous faisons référence également à [57], [24], [85] où les auteurs proposent des approches alternatives pour résoudre (12) basées sur le problème d’optimisation semi-définie positive sous contraintes (en anglais abrégé comme semidefinite programming). Concrètement dans ce travail, nous introduisons une version distribuée de l’algorithme de Oja. Nous définissons Un = (un,1 , . . . , un,p )T les p-composantes principales estimées à l’itération n. Dans l’algorithme d’Oja [122], la séquence estimée Un est générée par la récursion suivante: T Un = Un−1 + γn Mn Un−1 − Un−1 (Un−1 Mn Un−1 ) . (15) On note que (15) se résume à un algorithme de Robbins-Monro [139] évoqué dans la section précédente puisque (15) est vu comme un cas particulier d’une étape d’approximation stochastique. Toutefois, la norme du vecteur généré ci-dessus kUn k peut être supérieure à un ce qui ferait devenir l’algorithme instable. Plusieurs variantes ont également été proposées pour remédier cette instabilitée, soit en introduisant un terme de normalisation ou une étape de projection (voir [29]). Étant donné que nous visons des implémentations distribuées de (15) où chaque capteur doit estimer ses coordonnées Un (i), elle peut être spécifiée par chaque capteur comme : N X T Un (i) = Un−1 (i) + γn Mn (i, j)Un−1 (j) − Un−1 (i)(Un−1 Mn Un−1 ) . (16) j=1 où Un (i) sont les coordonnées estimées au capteur i, i.e. les p composantes correspondantes la i-ième file Un (i) = (un,1 (i), . . . , un,p (i)). Le but dans notre travail est la conception d’un algoP T rithme où les deux termes impliqués, N j=1 Mn (i, j)Un−1 (j) et Un−1 Mn Un−1 , soient estimés de façon complètement distribuée, asynchrone et en ligne. Différentes versions distribuées de l’algorithme (15) ont été proposées dans la littérature, souvent dans des contextes spécifiques comme par exemple : le profilage des utilisateurs/clients dans [147] (user profiling en anglais) comme application de l’apprentissage automatique ou l’estimation/détection du signal dans [102] comme application des réseaux de capteurs sans fils. En plus de l’approche choisie, ces deux travaux ont une caractéristique commune dans leur algorithme : la version distribuée de (15) inclue plusieurs étapes de consensus sur la moyenne de la forme [31] (les algorithmes de average consensus en anglais expliqués dans la Section 3.1) pour estimer certains termes dans (15) de manière décentralisée. En effet, ces approches nécessitent deux échelles de temps, i.e. l’indice de l’itération n pour mettre à jour Un et un autre indice de temps pour indexer le nombre des cycles de consensus de la forme (5). En particulier, les auteurs de [147] s’adressent à un problème d’apprentissage automatique où les observations M correspondent à une grande matrice contenant notes (binaires) des évaluations faites par les Introduction et présentation des résultats (in French) 21 utilisateurs sur certains produits consommés. Sous l’hypothèse que M est une matrice de rang faible, l’objectif est d’estimer le vecteur associé au profil de chaque utilisateur. Un algorithme d’Oja distribué est proposé pour effectuer la décomposition spectrale d’un ensemble de données partiellement connu Mn , i.e. la matrice Mn est assumée être creuse (sparse en anglais). Un terme de normalisation est inclus dans (15) pour éviter les problèmes de stabilité. Le terme Mn Un−1 est réalisé par un modèle sparse fixe, i.e. chaque agent i observe un petit ensemble de Mn (i, j)Un−1 (j) provenant de ses voisins j à chaque itération n de l’équation d’Oja. Dans T M U une étape précédente, le terme de normalisation introduit et le terme Un−1 n n−1 sont tous les deux effectués par plusieurs étapes de consensus sur ces deux termes avant la mise à jour de Un (i) à chaque agent i. Alors que dans l’article [102] l’objectif est de trouver la décomposition spectrale d’une matrice de covariance M d’un signal provenant des mesures bruitées reçues au sein d’un réseau de capteurs sans fils, i.e. il est supposé le modèle standard de signal reçu "signal de haute énergie + zéro bruit aléatoire moyen". Pourtant, trouver les p-principaux vecteurs propres de M implique de capturer les composants qui ont le plus d’énergie sur les données reçues afin de détecter et d’estimer le signal reçu d’intérêt. Les auteurs de [102] assument que les n’ont que accès à une estimation de la matrice de covariance telle que Pncapteurs −1 H N Mn = n t=0 rt rt où (rt ∈ C )t≥0 représentent les mesures collectées par les N capteurs du réseau. Sous ces dernières hypothèses sur le modèle, trois termes sont identifiés lors H U de la description de (15) pour définir son implémentation distribuée, i.e. rnH Un−1 , Un−1 n−1 H r rH U et Un−1 n n n−1 . L’ algorithme finalement proposé est réalisé en introduisant trois étapes de consensus pour ces trois termes qui impliquent plusieurs cycles de communication de la forme (5). A la fin de cette phase, chaque capteur est capable d’actualiser leurs coordonnées Un (i). En conclusion, on note que chacune de ces approches distribuées [147], [102], [92] comprennent une ou plusieurs phases de consensus pour chaque itération/actualisation des composantes estimées ce qui signifie que le coût en termes du nombre de calculs et de communications augmentent considérablement avec le nombre de cycles requis pour l’étape du consensus. 2) Résultats Contrairement à [147], [102], nous proposons un algorithme Oja distribué pour estimer les composantes principales de M dans un cadre général où on ne donne pas un modèle explicite pour les observations. Les observations sont définies par une séquence indépendante de matrices (Mn )n , i.e. Mn contient les mesures imparfaites de M issues d’un bruit aléatoire à l’instant de temps n. D’ailleurs dans cette thèse, nous considérons le modèle suivant. A chaque instant (itération) n, chaque nœud i observe quelques échantillons aléatoires bruités de l’ i-ième file de la matrice Mn . Chaque nœud i envoie et/ou reçoit les variables des autres nœuds j dans le réseau choisis de façon aléatoire (contrairement à [147] où on considère des liens fixes entre les nœuds). Les produits matriciels impliqués dans l’équation de mise à jour d’Oja (voir (15)), i.e. respectivement T M U Mn Un−1 et Un−1 n n−1 , sont effectuées par l’intermédiaire d’un modèle de communication asynchrone différent du modèle de consensus [31] nécessaire dans [147], [102]. Ainsi, nous définissons à chaque capteur i deux séquences aléatoires yn (i) et zn (i), correspondant aux esP T M U timées non biaisées des deux termes j M (i, j)Un−1 (j) et Un−1 n n−1 respectivement qui interviennent dans l’équation (16). En outre, nous introduisons une étape de projection à chaque 22 Introduction et présentation des résultats (in French) itération n qui permet Un de rester délimitée dans un ensemble compact afin d’éviter instabilités sur la séquence (Un )n . La mise à jour à chaque capteur devient : Un (i) = ΠK [Un−1 (i) + γn (yn (i) − zn (i)Un−1 (i))] (17) où K est un ensemble compact arbitraire dont son intérieur contient [−1, 1]p et ΠK est le projecteur sur K. Cette étape est facile à implémenter sur chaque capteur et ne nécessite pas de communications supplémentaires. Les estimées yn (i) et zn (i) sont issues d’un protocole de communication asynchrone qui est détaillé dans les Chapitre 3. La convergence de l’algorithme proposé est analysée dans le régime asymptotique où n tend vers l’infini. Bien que la mise en œuvre et l’objectif sont différents de l’algorithme (11), les deux sont liés par le cadre théorique de Robbins-Monro. Ainsi, l’analyse de la convergence de la suite générée (Un )n par l’algorithme proposé implique l’existence d’une fonction h(U ) représentant le champ moyen dont les racines correspondent à l’espace propre de M , i.e. U ∈ {h(U ) = 0} sont des vecteurs propres qui vérifient la décomposition spectrale de M (12). Ainsi, de façon similaire aux travaux de [122], [29], l’analyse de la convergence se caractérise principalement par considérer les ingrédients suivants : la stabilité de Un , la définition de h(U ) et ses racines {h(U ) = 0}. Nous apportons les contributions suivantes: • Nous avons utilisé de [92] l’idée de sparsification (communications rares), et nous partageons avec [102], [147] et [29] le même fondement, à savoir l’algorithme de Oja, que nous utilisons dans un contexte distribué comme initié par [147] et [102]. • Nous fournissons un cadre général et des algorithmes qui englobent à la fois le cas où la matrice symétrique est parfaitement connue et le cas où M n’est pas parfaitement connue. Dans ce dernier cas, nous considérons à sa place une séquence i.i.d. de matrices aléatoires notée par Mn , n ≥ 0. • Nous fournissons un algorithme qui implique un modèle asynchrone pour définir les communications entre les agents et un modèle en ligne d’acquisition et traitement des données par chaque agent. Ce modèle en ligne est adapté au contexte considéré où les mesures (observations) et les communications entre les agents sont modélisées comme des processus aléatoires avec des distributions paramétriques connues. • Nous prouvons la convergence presque-sûre de la suite du sous-espace vectoriel généré par les algorithmes distribuées proposés dans le Chapitre 3 vers un ensemble des vecteurs propres de M . Ensuite, ayant un but plus réaliste et expérimental, nous étudions l’application de notre algorithme pour l’ auto-localisation dans les réseaux de capteurs sans fils. En effet, ce contexte a été motivé par la possibilité d’obtenir des résultats numériques fournis à partir des données réelles issus des vrais capteurs. Introduction et présentation des résultats (in French) 3.4 23 Application de l’ACP: localisation dans les réseaux de capteurs sans fils Dans le contexte du traitement du signal, une motivation intéressante de concevoir l’algorithme d’Oja distribué, asynchrone et en ligne décrit par (15) repose sur son application au problème de l’auto-localisation dans les réseaux de capteurs sans fils (voir [57], [27], [143], [24], [92], [41]). La théorie de l’analyse multidimensionnelle (MDS) [95] traite le problème général suivant: trouver une configuration intégrant les N objets étudiés lorsque seulement des données sur leurs similarités/distances sont disponibles. En particulier, la méthode classique visée dans cette thèse, i.e. MDS [27, Chapitre 12], considère les distances euclidiennes entre les N positions étudiées comme mesures des similarités dans un espace de dimension p (généralement les positions des capteurs sont issues de p = 2 ou 3 dimensions). Dans ce cas, la méthode MDS classique effectue l’analyse en composantes principales (12) de la matrice M qui est définie comme suit: 1 M = − J⊥ DJ⊥ 2 (18) où la matrice D contient les carrés des distances et J⊥ = I − 1/N 11T . Dans le contexte des réseaux de capteurs sans fils, on récupère les positions du réseau formé par N capteurs (à une isométrie près, i.e. rotation/traduction/réflexion) en appliquant la méthode MDS classique (également connue comme MDS-MAP [143]). On définit par zi la position d’un capteur/node i du réseau et on définit par z̄ le barycentre du réseau, i.e. la position moyenne P z̄ = N1 i zi . Dans le cas de l’espace euclidien, les entrées de la matrice des carrés des distances définie par D sont liées aux positions des capteurs zi par la relation suivante: D(i, j) = kzi − zj k2 . (19) Ensuite, en utilisant (19) dans la définition de M (18) implique que M = ZZ T , où le i-ième ligne de la matrice Z coïncide avec zi − z̄. Par conséquent, le problème de l’analyse en composantes principales (12) appliqué à (18) dans le contexte des réseaux de capteurs se réduit à trouver la décomposition M = ZZ T telle que Z = U Λ1/2 ∈ RN ×p (p = 2 ou p = 3). Les matrices U et Λ sont obtenues par la décomposition spectrale définissons la po√ de M . Nousp sition estimée à chaque capteur i comme Z(i) où Z(i) = ( λ1 u1 (i), . . . , λp up (i)). Nous supposons les conditions décrites dans la section précédente où les capteurs ont accès seulement à des mesures bruitées des correspondantes lignes de M , i.e. (Mn )n la suite de matrices estimées non-biaisées de M . L’algorithme proposé dans le Chapitre 4 est une adaptation de (17) dans ce contexte spécifique afin d’ obtenir les positions estimées Z depuis les vecteurs propres U de façon distribuée, asynchrone et en ligne. L’approche centralisée pour la localisation introduite par [143] (dont une analyse théorique est donnée dans [85]) comporte deux étapes principales: d’abord obtenir les carrés de tous les paires des distances D(i, j) entre les capteurs et calculer la matrice (18) (étape qui comporte un double-centrage des positions au barycentre du réseau); et, deuxièmement, trouver les p composantes principales de M . Néanmoins, il est important de noter que dans les réseaux de capteurs sans fils, l’acquisition de D n’est pas directement possible. Or, les distances peuvent être estimées à partir d’autres mesures disponibles en fonction des modules électroniques et des dispositifs qui sont contenus dans les capteurs, e.g. la mesure de la puissance en réception 24 Introduction et présentation des résultats (in French) d’un signal reçu (paramètre qui est connu en anglais par received signal strength indicator dont l’acronyme RSSI), le temps d’arrivée (en anglais time of arrival TOA) ou l’angle d’arrivée (en anglais angle of arrival AOA) (voir des travaux sur l’état de l’art plus détaillé dans [126], [105]). Dans cette thèse, les capteurs pris en compte dans nos résultats expérimentaux et qui sont mis à disposition par la plateforme FIT IoT-LAB [1] sont issus du dispositif de signal radio CC24202 . En particulier, FIT IoT-LAB est une plateforme accessible à distance en se connectant à partir d’ un compte privé et par des commandes en ligne en se connectant aux ports séries des capteurs. La plateforme (nationale française) comporte quatre sites situés en France. Nos expériences sont lancés sur le site à Rennes qui comporte 256 capteurs de type WSN430 issus du standard ZigBee IEEE 802.15.4 qui opèrent à la fréquence 2.4 GHz et qui sont déployés dans deux grandes salles de rangement des dimensions 6 × 15 m2 presque vides d’objets. Dans cette technologie radio les capteurs peuvent obtenir des mesures de RSSI. Ainsi, nous définissons un estimateur non biaisé du carré de la distance basé sur le modèle du signal paramétrique suivant une distribution logNormal (modèle standard qui est connu par son nom en anglais log-Normal shadowing model LNSM décrit dans [136]). En outre, l’étape d’analyse en composantes principales de notre approche implique : une version distribuée et asynchrone de l’algorithme d’ Oja (15) avec un modèle des observations qui permet à chaque capteur d’ estimer le i-ième file de Mn à l’ aide de ce modèle non-biaisé des distances à partir des mesures sporadiques de RSSI acquises par chaque capteur. L’algorithme est complètement détaillé dans la Section 4.4.3 du Chapitre 4. Nos contributions sont les suivantes : • Nous adaptons et concevons l’algorithme distribué proposé du Chapitre 3 pour le problème de l’auto-localisation dans les réseaux de capteurs sans fils en supposant les mesures sporadiques de RSSI acquises par les capteurs. La position à chaque capteur est estimée sans connaissance à priori des points de repère, i.e. les positions des capteurs ancres (en anglais anchor nodes ou landmarks). • Nous obtenons des résultats numériques sur la précision d’estimation d’une position dans deux cas : lorsque les données artificielles qui sont générées en suivant la distribution du modèle LNSM et lorsque les données sont collectées à partir des expériences dans un contexte réel par moyen des capteurs depuis la plateforme FIT IoT-LAB [1]. Nous comparons nos résultats avec des méthodes centralisées classiques (la multilateration [79] (MC), min-max [141], l’ algorithme 3 MDS classique résumé dans la Section 4.4.1 et l’ approche d’Oja (15)) et une approche MDS distribuée proposée par [45]. • Nous proposons une phase supplémentaire qui comporte un algorithme d’optimisation distribué de la forme (7)-(8) afin d’améliorer la précision des positions estimées et afin de mener chaque capteur à obtenir une carte locale à partir de sa propre position et celles de ses capteurs voisins. Cette étape optionnelle peut être particulièrement utile lorsque certains capteurs ancres sont présents ensuite dans le réseau de capteurs. L’ algorithme de d’ affinement est d’abord implémenté sur les capteurs de la plateforme FIT IoT-LAB lorsque les positions estimées sont initialisées depuis l’algorithme proposé dans le Chapitre 3 2 Détails des spécifications sur http://www.ti.com/product/cc2420 Introduction et présentation des résultats (in French) 25 et sur trois scénarios différents lorsque elles sont initialisées avec l’algorithme proposé par [53] (données accessibles dans les sites [2] et [3]). 4 Organisation du mémoire de thèse Les travaux de cette thèse abordent deux problématiques différentes dans les réseaux de capteurs sans fils : le consensus et l’analyse en composantes principales. Nous proposons pour ces deux problématiques des algorithmes basés sur l’approximation stochastique distribuée dont les principales caractéristiques ont étés décrites dans la Section 2. Ainsi, nous séparons ce manuscrit en deux parties différentes liées aux introductions respectivement faites dans les sections 3.2 et 3.3. Dans chacune des parties nous traitons une application particulière qui fournit un contexte à la problématique. D’un côté, l’optimisation distribuée est l’ application des algorithmes de consensus, de l’autre côté, la localisation est l’application de l’ analyse en composantes principales distribuée. Ci-dessous, la Figure 1 montre comment nous avons schématisé ce mémoire et les correspondantes relations entre les chapitres. Dans la première partie, nous apportons des résultats généraux de convergence sur les algorithmes de consensus. Nous spécifions et nous illustrons ces résultats dans le cas de l’ optimisation distribuée. Finalement nous avons adapté ce travail pour un exemple plus particulier de l’ optimisation distribuée, qui est l’ estimation de paramètres. Notre première contribution au début de cette thèse fut la conception d’une version distribuée et en-ligne de l’algorithme déjà connu Espérance-Maximisation (en anglais Expectation-Maximization) qui s’utilise couramment pour estimer les paramètres d’un signal représenté par un mélange de Gaussiennes. La deuxième partie est moins théorique que la première et est plus consacrée à décrire les résultats plus expérimentaux sur l’analyse en composantes principales distribué qui ont été issus des expériences sur des vrais capteurs sans fils de la plateforme FIT IoT-LAB accessible à distance. La conception d’un nouveau algorithme pour ce propos est introduit dans le premier chapitre de cette partie et il est adapté pour la localisation dans les réseaux de capteurs dans le chapitre qui le suit. Nous rajoutons à l’algorithme de localisation proposé, une étape d’ affinement des positions estimés qui se décrit comme un problème d’optimisation distribuée. Ainsi à cet effet, nous nous servons de l’algorithme de consensus de descente du gradient comme cas particulier de (7)-(8) pour résoudre ce problème de façon distribuée. Cette thèse est ainsi séparée en deux parties concernant les deux applications différentes qui motivent notre travail. Le contenu du mémoire comporte finalement trois chapitres qui sont résumés de la façon suivante: Le Chapitre 2 étudie le problème de l’approximation stochastique distribuée dans les systèmes multi-agents. L’algorithme étudié se compose en deux étapes: une étape d’approximation stochastique locale et une étape de diffusion qui entraîne le réseau à trouver un consensus sur le résultat obtenu. L’étape de diffusion utilise des matrices stochastiques par ligne pour pondérer les échanges de réseau. Contrairement aux œuvres précédentes, les matrices des coefficients de pondération ne sont pas censées être doublement stochastiques, et peuvent également dépendre de l’estimation passée. 26 Introduction et présentation des résultats (in French) Chapitre 2 ' Chapitre 3 $ ' Algorithmes de consensus & Analyse en composantes principales % & Algorithme de descente du gradient Algorithme Espérance-Maximisation Annexe A $ R ? ' $ Estimation de paramètres & Algorithme d’Oja Chapitre 4 ? Localisation % % Figure 1. Schéma du cadre réalisé dans cette thèse et les relations entre les chapitres. Nous prouvons que les matrices non-doublement stochastiques (non bistochastiques) influencent généralement les points limites de l’algorithme. Toutefois, les points limites ne sont pas affectés par le choix des matrices à condition que celles-ci sont bistochastiques en moyenne. Cette conclusion légitimise l’utilisation de protocoles de diffusion du type broadcast, qui sont plus faciles à mettre en œuvre. Ensuite, à l’aide d’un théorème central limite, nous prouvons que les protocoles doublement stochastiques possèdent asymptotiquement les mêmes performances qu’un algorithme centralisé et nous quantifions la dégradation causée par l’utilisation de matrices non doublement stochastiques. Tout au long du chapitre, un accent particulier est mis sur le cas particulier de l’optimisation distribuée non convexe comme une illustration de nos résultats. Le Chapitre 3 traite le problème de l’analyse en composantes principales (ACP) par une implémentation distribuée et asynchrone. Nous fournissons deux algorithmes s’adaptant à des situations différentes selon la structure du graphe sous-jacent. Un cadre suffisamment général nous permet d’analyser tout ces algorithmes en même temps. La convergence est prouvée avec probabilité un sous des hypothèses convenables, et des expériences numériques illustrent leur bon comportement. L’algorithme proposé nous permet d’aborder dans le chapitre suivant (Chapitre 4), le problème de l’auto-localisation dans les réseaux de capteurs sans fil sur lequel est basé ce contexte. Le Chapitre 4 considère le problème de la localisation dans les réseaux sans fil formés par des capteurs dont les positions restent fixes. Chaque nœud cherche à estimer sa propre position à partir de mesures bruitées de la distance par rapport à d’autres nœuds. Ainsi, nous supposons que les capteurs peuvent obtenir des mesures sur le niveau de puissance du signal reçu (Received Signal Strength Indicator connu par l’acronyme RSSI) et qui en même temps sont liés à la distance euclidienne à l’ aide d’un modèle statistique log-normal connu comme Log-Normal Shadowing Model (LNSM). Dans un mode centralisé batch, les positions peuvent Introduction et présentation des résultats (in French) 27 être récupérées (à une isométrie près, i.e. rotation, réflexion, translation) par une ACP appliquée sur une matrice dite de similarité qui est construite à partir des distances relatives entre chaque paire de capteurs. Dans ce chapitre, nous proposons un algorithme distribuée en ligne permettant à chaque nœud d’estimer sa propre position basée sur l’échange limité d’informations dans le réseau. Notre cadre englobe le fait d’avoir des mesures sporadiques et de possibles liens qui tombent aléatoirement en panne. Nous prouvons la consistance de notre algorithme utilisant une analyse de convergence similaire à celle utilisée au chapitre précédent. Nous incluons également une étape de raffinement basé sur un algorithme de consensus (Chapitre 2) afin d’améliorer la précision des positions estimées. Finalement, nous fournissons des résultats numériques et expérimentaux à partir de données réelles et simulées. Les simulations issues des données réelles sont effectuées sur une plateforme qui met à disposition des réseaux de capteurs sans fil et qui est accessible à distance. Nous avons choisi de présenter dans la partie des annexes les preuves détaillées liées à l’analyse des algorithmes de consensus qui ne sont inclues dans le Chapitre 2 (voir l’Annexe D). D’ailleurs et de façon supplémentaire, l’Annexe C présente l’ analyse numérique des protocoles de communication plus connus qui sont utilisés pour le consensus, i.e. protocoles de gossip, et en particulier pour l’optimisation distribuée. Nous avons aussi inclus deux des articles de conférence qui découlent des travaux conjoints au sein de notre département. Il est à noter que tout au long de cette thèse, nos contributions contiennent une partie plus rigoureuse liée à l’obtention des résultats théoriques et une partie consacrée à des applications plus spécifiques et concrètes liées aux sujets de recherche actuels. En effet, le premier algorithme basé sur l’approximation stochastique distribuée qui a été conçu au début de cette thèse est rapporté à l’Annexe A. Nous avons introduit un nouvel algorithme d’Espérance-Maximisation distribué en-ligne (DEM) pour les modèles de données latentes, y compris le modèle du mélange des gaussiennes comme un cas particulier bien connu. Un second algorithme est issue d’une collaboration dans le cadre de l’apprentissage automatique (machine learning) et du Big Data qui est adressé à l’Annexe B. Nous avons présenté un algorithme d’apprentissage en ligne incluant un protocole de communication gossip (OLGA) qui est consacré à étudier la classification binaire dans un cadre distribué. 5 Production scientifique Ces travaux de recherche présentés dans ce manuscrit sont le fruit des résultats obtenus de collaborations avec mon directeur de thèse Pascal Bianchi, mais aussi avec Jérémie Jakubowicz (Télécom SudParis), Gersende Fort (CNRS-Télécom ParisTech) et Stéphan Clémençon (Télécom ParisTech), ainsi que avec Amy N. Dieng (PhD au LINCS de Télécom ParisTech avec Claude Chaudet) pour les sujets issus de la seconde partie du mémoire. Ainsi, les contributions de ces collaborations comportent plusieurs et différents résultats qui ont été présentées à la fois à des conférences internationales et nationales. Elles sont énumérées ci-dessous. Articles dans des revues internationales à comité de lecture 1. G. Morral et P. Bianchi, "Distributed on-line multidimensional scaling for self-localization 28 Introduction et présentation des résultats (in French) in wireless sensor networks", soumis à Elsevier journal on Signal Processing, février 2015, arXiv:1503.05298. 2. G. Morral, P. Bianchi et G. Fort, "Success and failure of adaptation-diffusion algorithms for consensus in multi-agent networks", soumis à IEEE Trans. on Signal Processing, octobre 2014, arXiv:1410.6956. Articles de conférences avec actes 1. G. Morral, P. Bianchi* et G. Fort, "Success and failure of adaptation-diffusion algorithms for consensus in multi-agent networks", the 53rd IEEE Conference on Decision and Control (CDC), Los Angeles, USA, Décembre 2014. 2. G. Morral* et N.A. Dieng, "Cooperative RSSI-based indoor localization: B-MLE and distributed stochastic approximation", the 80th IEEE Vehicular Technology Conference (VTC2014-Fall), Vancouver, Canada, Septembre 2014. 3. G. Morral*, N.A. Dieng et P. Bianchi, "Distributed on-line multidimensional scaling for self-localization in wireless sensor networks", the 39th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1110-1114, Florence, Italie, Mai 2014. 4. P. Bianchi, S. Clémençon, J. Jakubowicz et G. Morral*, "On-line learning gossip algorithm (OLGA) in multi-agent systems with local decision rules", the 1st IEEE International Conference on Big Data (BigData), pp. 6-14, Santa Clara, USA, Octobre 2013. 5. G. Morral*, P. Bianchi, G. Fort et J. Jakubowicz, "Approximation stochastique distribuée: le coût de la non-bistochasticité", the 24th National Conference on Signal and Image Processing (GRETSI), Brest, France, Septembre 2013. 6. G. Morral, P. Bianchi, et J. Jakubowicz*, "Asynchronous distributed principal component analysis using stochastic approximation", the 51st Annual Conference on Decision and Control (CDC), pp. 1398-1403, Maui, Hawaï, Décembre 2012. 7. G. Morral, P. Bianchi*, G. Fort et J. Jakubowicz, "Distributed stochastic approximation: the price of non-double stochasticity", invited paper, the 46th Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp. 1473-1477, Californie, USA, Novembre 2012. 8. G. Morral*, P. Bianchi et J. Jakubowicz, "Gossip-based online distributed expectation maximization", the 2012 IEEE Statistical Signal Processing Workshop (SSP), pp. 305308, Ann Arbor, USA, Aout 2012. Journées sans actes 1. G. Morral*, "Analyse d’algorithmes distribués pour l’approximation stochastique dans les réseaux de capteurs", présentation des résultats de cette thèse à la 4ème Journée de restitution des travaux de recherche du programme Futur & Ruptures organisée par la Fondation Télécom comme candidate aux Prix de Thèse 2015, Mars 2015. Introduction et présentation des résultats (in French) 29 2. P. Bianchi, S. Clémençon, J. Jakubowicz et G. Morral*, "On-line learning gossip algorithm (OLGA) in multi-agent systems with local decision rules", poster présenté au 3e colloque « Numérique : Grande échelle Complexité » organisé par l’ Institut MinesTélécom, Mars 2014. 3. G. Morral*, P. Bianchi, G. Fort et J. Jakubowicz, "Approximation stochastique distribuée: le coût de la non-bistochasticité", poster présenté à la 2ème Journée de restitution des travaux de recherche du programme Futur & Ruptures organisée par la Fondation Télécom, Janvier 2013. Chapter 1 Introduction 1.1 Motivations Remote sensing and environmental monitoring have been an active research area for the last decades. The technological evolution of wireless sensor networks (WSNs) has contributed to spread their use on military purposes first, to civil and industrial applications later, e.g. surveillance and security management. Besides, the scientific motivations on statistical signal processing have been an important part of the progress made on detection, estimation and classification of the data acquired by WSNs. Although we give a special attention to WSNs, we also address more general multi-agents systems, e.g. peer-to-peer networks or multi-robot systems. The applications related to multi-agents systems include: network control and coordination (e.g. target or trajectory tracking [132], [45], [123], [152], power and resources allocation [108], [23]), Big Data processing (e.g. classifier training [149],[160], recommendation profiling [147], PageRank [124]) or environmental monitoring and pattern recognition in sensor networks (e.g. parameter estimation [135], [144], signal detection [120],[90],[102]). In this thesis, we give a special attention to the localization problem in wireless sensor networks since location awareness (or network location awareness) is required in many applications. Yet, information about sensor nodes’ positions may be used by purposes such as routing and querying that can be adapted or controlled according to the given positions. For environmental monitoring, there exists a need on applications requiring geographical information of the observed data e.g. identifying temperature or humidity on a map. In a more challenging use, developments on home automation contribute to make a home to activate the furniture automatically depending on the host position. If the robustness and a flexible deployment are two of their main advantages, one has to overcome the issues related to the data privacy and security and the limited resources including energy, memory and computational complexity. We consider a multi-agent system as a group of entities having both interacting and data processing capabilities, e.g. a wireless sensor network. We refer by agent as the general word to name a computer, a processor or a sensor node or any other device forming the connected network. Within any specific objective and context, a global task is done by the network of communicating agents, e.g. temperature measurement in WSNs or binary classification in machine learning. Agents are autonomous and self-aware, they have access to local views of the global 32 Chapter 1. Introduction scenario and they are not controlled by a designated central unit. In this thesis, we consider that each agent has a partial view on the global problem to solve. Typically in WSNs each sensor node is able to observe its surrounding area of the environment. While in distributed machine learning, each processor (or computer) is in charge of handling one part of a distributed dataset. The objective of our work is to design distributed algorithms for embedded multi-agent networks. By distributed, we mean that agents share their local information without any hierarchic architecture and any master agent, i.e. contrarily to centralized processing. These architectures offer two main advantages: robustness to individual node failures since data can be recovered at each agent, and scalability since there is no central agent that may produce a bottleneck. 1.2 Framework In this thesis, we focus on distributed algorithms from the special point of view of stochastic approximation. We consider the case where data/observations are handled "on-line". A random sample is used by an agent to update its local solution and then deleted after use; next, another new sample is used, and so on. In this context, we address jointly two frameworks: multiagent systems and distributed stochastic approximation. As a consequence, we are interested in numerical methods that have the following three properties. They are i) iterative – an iterate is updated at each time instant, ii) distributed – agents communicate to merge their individual iterates, iii) on-line – each update is done using the most recent observation or data sample. Network model: The network of N agents is represented by an undirected (non-oriented) graph G = (V, E), where V = {1, . . . , N } stands for the set of agents (also referred to nodes in WSN context) and E is the set of undirected edges. Two agents i, j ∈ V are connected if there exists the link {i, j} ∈ E that enables the communication between i and j. We shall sometimes write i ∼ j to denote the edge {i, j} ∈ E. We shall always assume that {i, i} is not an edge (G has no self loop). Communication model: Distributed algorithms can either be synchronous or asynchronous. In the synchronous case, all agents are expected to complete their local computations before their outputs can be merged. Otherwise stated, the algorithm proceeds with the next iteration provided that all agents have returned their result. In the asynchronous case, we assume that each agent is likely to finish a local computation at some random time instant, passes its output to other agents, and proceeds with another local computation with no need to wait for the rest of the agents. Thus, the meaning we give to "distributed asynchronous algorithm” is that there is no central scheduler and that any node can wake up randomly at any moment independently of the other nodes. This mode of operation brings clear advantages in terms of complexity and flexibility. Asynchronous distributed optimization is a promising framework in order to scale up machine learning problems involving massive data sets (see [32] or the more recent survey [40]). 1.3. Position of the thesis 1.3 1.3.1 33 Position of the thesis Preliminary: consensus algorithms Historically, a significant amount of work in distributed processing has been done in the literature to solve the so-called average consensus problem. Although the thesis addresses more general issues, we start with a description of this archetypal problem, as it allows to introduce the main ideas underlying many distributed algorithms. We denote by T0,i ∈ R some scalar value observed by agent i ∈ V . The objective of average P consensus is, for any agent i, to estimate the average value over all T̄0 = N1 N i=1 T0,i . One of the most widespread approach to solve the problem distributively is the following. At time instant n, each agent i has an estimate Tn,i of the sought average. Each node i receives the current iterates of other nodes in its neighborhood i.e., nodes j such that j ∼ i and updates its local iterate as a weighted average of its past iterate and the iterates received from neighbors. Formally, at each time n X w(i, j)Tn−1,j (1.1) ∀i ∈ V, Tn,i = j∈V where w(i, j) are non-negative weights such that w(i, j) = 0 whenever i and j are not connected. This condition ensures that the algorithm is indeed distributed among the graph. For P simplicity, assume also that weights are non-negative and that j w(i, j) = 1 for all i. Define N × N matrix W whose coefficient (i, j) coincides with w(i, j). The above algorithm simply writes Tn = W Tn−1 where Tn is the column vector whose i-th entry coincides with Tn,i . Note that W is a stochastic matrix in the sense that all its rows sum to one. Distributed average consensus has its origins on applied statistics theory [48] and computer science [155],[156] (see [16] for a detailed state of the art on this subject). These conditions under which the above distributed algorithm converges to the sought average T̄0 are well known. Assume that X ∀i ∈ V, w(i, j) = 1 (1.2) j:j∼i ∀i ∈ V, X w(i, j) = 1 , (1.3) i:i∼j that is, not only all rows of matrix W sum to one, but all columns also sum to one. Such a matrix W is called doubly stochastic. If W is moreover a primitive matrix1 , it can be shown as an immediate application of the Perron-Frobenius theorem (see [81]) that ∀i ∈ V, lim Tn,i = T̄0 . n→∞ (1.4) There exists m > 0 such that all coefficients of W m are strictly positive. The assumption holds for instance if w(i, j) > 0 for all i ∼ j and there exists i such that w(i, i) > 0. 1 34 Chapter 1. Introduction Thus, the algorithm allows each node to eventually reach a consensus, an agreement on the final value. In addition, the value of the consensus coincides with the sought average T̄0 . The above algorithm is sometimes referred to by some authors as a gossip algorithm. The above gossip algorithm is synchronous in the sense that all agents must communicate their value at any moment of the algorithm, and the matrix W is fixed among the iterations. The Authors of [31] propose an asynchronous communication protocol. At time n, a given node chosen at random (say node i) wakes up and randomly select a node in its neighborhood (say node j). Nodes i and j merge their values by Tn,i = Tn,j = 0.5Tn−1,i + 0.5Tn−1,j , (1.5) while other nodes k ∈ / {i, j} keep their iterates Tn,k = Tn−1,k . The algorithm writes Tn = Wn Tn−1 where (Wn )n is a random sequence of matrices, namely Wn = IN − 0.5(ei − ej )(ei − ej )T where IN is the N × N identity matrix, ei is the i-th vector of the canonical basis in RN and T stands for transpose. The convergence (1.4) still holds (in the almost sure sense) under technical conditions specified in [31]. A key feature of the matrices Wn is that they are still doubly stochastic for all n 6= 0. The pairwise protocol described above can be considered as asynchronous in the sense that nodes are allowed to be inactive at some instants. However, it still requires some level of coordination among nodes: two nodes must simultaneously update their values at the same instant. Alleviating the need for such feedback in order to achieve truly asynchronous protocols has been an important stake in the years after [31]. The Authors of [10] propose a full asynchronous communication model called broadcast gossip. Similarly to the previous protocol, at each instant n one agent is randomly activated. The asynchronism is now at "agent-level" since the active agent i broadcasts its estimate to all neighbors without expecting a feedback transmission. Unfortunately, it is shown in [10] that the sought convergence result (1.4) does not hold anymore. All that can be expected with such a simple protocol is a convergence in expectation, but not an almost sure convergence. We refer to [69] for more detailed considerations on that matter. As a conclusion of this preliminary paragraph, the average consensus problem can be solved using a linear gossip algorithm in an asynchronous version, but the double stochasticity of the weighting matrices Wn is required at each n. Double stochasticity comes along with some drawbacks regarding practical implementation, as it generally requires feedbacks in the network. Alternative works on that matter do not require double stochasticity at expenses of more complex models, e.g. [89], [84], [78]. For instance, [78] state the convergence result (1.4) by using only row-stochastic matrices, i.e. Wn 1 = 1, but leading the set of communicating nodes grow at each time n. It is worth noting the contribution of [89] that introduce the push-sum protocol (more generally analyzed by [34]). The gossip model of [89] allows to circumvent the convergence issue without the need of feedback links by introducing some additional communication, i.e. two variables instead of one are involved in the update step. [84] propose an asynchronous version of the latter model [89]. The convergence analysis and convergence speed are provided in [83] by the same Authors. We refer to [16] for a more complete description on general distributed average consensus algorithms. 1.3. Position of the thesis 1.3.2 35 Distributed optimization Distributed optimization is present in many of the applications mentioned above related to sensor networks and machine learning. The goal of the network is to optimize a global function that is defined as a sum of local private functions. The minimization problem under study is described as follows: min F (θ) , θ∈R F (θ) = N X fi (θ) (1.6) i=1 where fi is the private cost function of agent i. Let us provide an application example. We assume in this thesis that functions fi are differentiable but not necessarily convex. This thesis also put the focus on first order methods i.e., algorithms relying merely on gradient computations. Let us provide an illustrative example in the context of sensor networks. Example 1.1. In WSN contexts, it is often the case that one should estimate a parameter θ based on a set of random observations X1 , . . . , XN collected by independent sensors and whose marginal probability density functions pθ,1 , . . . , pθ,N are indexed by θ. Provided that the Xi ’s are independent, the maximum likelihood estimate of θ can be written as the minimizer of (1.6) where fi (θ) = − log pθ,i (Xi ). In a centralized setting, random observations are collected at a central unit. All functions are assumed to be available at a single place, and a standard gradient descent on F can be used to obtain a minimizer. This thesis focuses on the distributed setting: functions fi are only locally known by the agents, but the function F is nowhere available. In the literature, there are mainly two kinds of distributed first order algorithm for solving this problem. The first one is known as the incremental approach (see [113], [131], [133]). A single iterate travels in the network from node to node. Each node updates the estimate by incrementing the iterate from a scaled version of its negative gradient evaluated at the current point. The approach, although conceptually simple, has some drawbacks. Incremental algorithms generally require the message to go through a Hamiltonian cycle in the network. Finding such a path is known to be a NP complete problem. Relaxations of the Hamiltonian cycle requirement have been proposed: for instance, [113] only requires that an agent communicates with another agent randomly selected in the network (not necessarily in its neighborhood) according to the uniform distribution. However, substantial routing is still needed. This thesis focuses on another cooperative approach of the form adapt-then-combine (following a terminology introduced by [103] in [38]) and also known as adaptation-diffusion algorithms. The idea, which traces back to the [155], consists in coupling local gradient descent at the nodes’ side and a gossip communication step, in order to merge the iterates as explained in the previous subsection. Contrarily to incremental approaches, each node i has its own estimate θn,i . At each iteration, the following update holds: θ̃n,i = θn−1,i − γn ∇fi (θn−1,i ) , θn,i = N X j=1 wn (i, j) θ̃n−1,j , (1.7) (1.8) 36 Chapter 1. Introduction where γn is a deterministic positive step size and ∇ denotes the gradient and Wn = (wn (i, j))i,j∈V is a gossip matrix similar to the ones described previously (see Section C.1 in Appendix C for detailed examples). In addition, it is sometimes the case that a the gradient is observed up to some random perturbation, which might depends on the history of the algorithm. In that case, equation (1.7) must be replaced by θ̃n,i = θn−1,i − γn ∇fi (θn−1,i ) + γn ξn,i (1.9) where ξn,i is a perturbation due to the fact that the gradient is not perfectly observed at node i. Example 1.2. To illustrate this point, consider again the WSN example given previously, where agents seek to estimate a unknown parameter θ in the maximum likelihood sense based on random observations. Consider the case where each sensor i gather a sequence of random observations (Xn,i )n=1,2,... instead of a single observation Xi . Assume also that the sequence is formed by independent copies of Xi . Then, an on-line distributed estimation of parameter θ using the above algorithm would read as θ̃n,i = θn−1,i + γn ∇ log pθn−1,i (Xn,i ) . Under some regularity conditions, it can be shown that the above update coincides with (1.9) by letting fi (θ) = −E[log pθn,i (Xi )] where E stands for the expectation and the perturbation ξn,i is a martingale increment. It is expected that, under some assumptions, ∀i ∈ V, lim θn,i = θ? n→∞ (1.10) where θ? is some minimizer of F (assumed to exist). We refer to Chapter 2 for a more detailed state of the art on these techniques. However, we mention that convergence is generally proved under some strong assumptions on the matrices (Wn )n describing the consensus protocol. In general the sought consensus is achieved under the double-stochasticity assumption ([116], [134]), i.e. (Wn )n are row and column stochastic meaning that Wn 1 = 1 and 1T Wn = 1T . In [19], [112] the column-stochasticity condition is relaxed and it is only assumed in expectation. This leads for instance the use of the broadcast gossip model of [10]. Similarly, the Authors of [43] introduce a diffusion model that only requires the row-stochasticity condition at expenses of its synchronous nature. The objective in this thesis is to derive convergence results such (1.10) on the sequence generated by Algorithm (1.7)-(1.8) under more mild conditions on (Wn )n . We investigate the results of [19] when (Wn )n are only row-stochastic. We extend them to a broader communication setting, when (Wn )n may depend on the observations or the last estimates. In addition, we consider a more general case on stochastic approximation framework by letting Algorithm (1.7)(1.8) take the following form: θn = Wn ( θn−1 + γn Yn ) . (1.11) Recursion (1.11) extends the application of distributed optimization problem (1.6) to a more general framework. Indeed, Algorithm (1.11) can be viewed as a distributed version of the so-called 1.3. Position of the thesis 37 Robbins-Monro algorithm [139]. For that purpose, Yn may be related to an unbiased estimation of a given mean field function h(θ) that one seeks to find its roots, i.e. θ ∈ { h(θ) = 0 }. We are also focusing on the convergence rate of this algorithm along with asymptotic normality. Finally, an objective of the thesis is to investigate the use of the above algorithm for statistical inference tasks in sensor networks. We propose a distributed Expectation-Maximization algorithm inspired of the adaptation-diffusion approach. We also apply our method to the sensor self-localization problem. 1.3.3 Distributed principal component analysis Another dealing problem addressed by the stochastic approximation framework is the principal component analysis (PCA). The objective in such problems is rather different from the considered previously. Indeed, the aim is no longer to find a consensus on common parameter of interest. Here, the aim to drive the iterate of each node i to the value of the i-th entries of the principal eigenvectors of a matrix M . We define M ∈ RN ×N a symmetric positive semi-definite matrix whose entries describe some similarity metric between each pair of agents, e.g. similarities (multidimensional scaling [95], [27]), distances (WSN localization [57], [143]), customer ratings (user profiling [154], [91]), adjacency weights (spectral clustering [29]) or covariances (signal detection [88], [37]). We assume that a given agent i has only a partial information on the matrix M (typically, it is only able to observe the i-th row of M ). Consider the spectral decomposition of M M = U ΛU T , U U T = IN (1.12) where U is an orthonormal matrix whose columns are the eigenvectors of M and Λ is a diagonal matrix containing the corresponding eigenvalues (λ1 , . . . , λN ) in decreasing order λ1 ≥ · · · ≥ λN . We define k . k the Euclidean norm. For a given integer p < N , the aim is to evaluate the p largest eigenvalues λ1 , . . . , λp and the corresponding eigenvectors, which we denote u1 , . . . , up . When M is perfectly known and data is processed in a centralized manner, several classical methods are known to solve efficiently (1.12) such the power method ([73, p. 406]) when p = 1 and the QR-factorization ([81, p. 114] and called orthogonal iteration in [73, p. 454]) or the Gram-Schmidt orthonormalization [73, p. 254] when p > 1. The centralized power method is based on computing the following recursion (when p = 1): Ũn = M Un−1 Un = Ũn , kŨn k (1.13) (1.14) where (Un )n is the estimate sequence that to the first eigenvector u1 and kxk stands P converges 2 such x ∈ RN . From a distributed implefor the Euclidean norm, i.e. kxk2 = |x(i)| i mentation viewpoint, both terms M Un−1 andPkM Un−1 k have drawbacks. For a given agent i, the first matrix product writes as a sum N j=1 M (i, j)Un−1,j that contains N terms involving a communication with each separate agent j. Second, for any agent i, (1.14) writes qP N 2 Un (i) = Ũn (i)/ i=1 Ũn (j) and thus agent i should query all other agents about their values 38 Chapter 1. Introduction Ũn (j) to implement the normalization update. When N is large, this could incur a prohibitive cost to the network in terms of number of communications. As a consequence, several works have made efforts to arise a decentralized implementation of (1.13)-(1.14). A couple of works deal with a distributed version of (1.13)-(1.14) (see [90], [92]) by introducing consensus averaging to compute the normalization term. While in [90] M is assumed to be perfectly known, [92] include a synchronous sparse model for M Un−1 . Contrarily to [90] where each agent i is able PN to compute j=1 M (i, j)Un−1,j , [92] describe a sparse model in which each agent i transmits M (i, j)Un−1,j to a small set of randomly chosen neighbors. In this thesis we seek to design an algorithm which is • distributed: nodes cooperate in order to separately estimate different entries of the principal eigenvectors; • on-line: matrix M is unobserved, but a sequence (Mn )n of perturbed/noisy versions of M is generated. The sequence (Mn )n is written as Mn = M + Ξn where the perturbation matrix Ξn is typically a martingale increment. In the centralized case, when a sequence of matrices (Mn )n is globally observed at a central computing unit, stochastic approximation can be used to estimate the eigenvectors of M . Oja’s algorithm can be used for that sake (see [120] for p = 1 and [122] for p > 1). We also refer to (see [57], [24], [85]) for alternative approaches to solve (1.12) by semidefinite programming based on constraint optimization. In this work we introduce a distributed version of Oja’s algorithm. We define as Un = (un,1 , . . . , un,p )T the p-principal components estimated at time n. In the Oja’s algorithm [122], the estimate sequence Un is generated by: T Un = Un−1 + γn Mn Un−1 − Un−1 (Un−1 Mn Un−1 ) . (1.15) Note that (1.15) boils down to a Robbins-Monro algorithm [139]. Due to possible instabilities of the algorithm, several variants have also been proposed, which either introduce a normalization or a projection step (see [29]). Distributed variants of the algorithm have been investigated, often in specific contexts user profiling [147] or signal estimation (or detection) in WSN [102]. Both works have two common features: proposing a distributed version of (1.15) and including average consensus iterates of the form [31] in their algorithm to perform some of the terms in (1.15) distributedly. Indeed, these approaches require two time scales, i.e. the iteration index n to update Un and another time index corresponding the number of consensus cycles of the form (1.5). In particular, the Authors of [147] address a machine learning context where observations M correspond to a large matrix containing user taste ratings (binary) of some products. Under the assumption that M is a low-rank matrix, the aim is to estimate the profile vector of each user. A distributed Oja’s algorithm is proposed to perform the spectral decomposition of a partially known dataset Mn . A normalization term is included in (1.15) to avoid stability issues. The term Mn Un−1 is performed by a fixed sparse model, i.e. each agent i observes a small set of Mn (i, j)Un−1 (j) from its neighbors j at each iteration n of the Oja’s update. The introduced normalization term T M U and (Un−1 n n−1 ) are both performed by an average consensus during several rounds before to compute the Oja’s update Un (i) at each agent i. Whereas, in [102] the goal is to find 1.3. Position of the thesis 39 the eigendecomposition of a signal’s covariance matrix M from noisy received measurements within a wireless sensor network, i.e. the received signal model is assumed as "high energy signal + zero mean random noise". Yet, finding the p-principal eigenvectors of M refers to capturing the high-energy components of received data in order to detect and estimate the incoming signal P of interest. The Authors of [102] assume an estimate covariance matrix such Mn = n−1 nt=0 rt rtH where (rt ∈ CN )t≥0 describe received measurements at N sensors. Under the latter model assumptions, three terms are identified when describing recursion (1.15) to H U H H define its distributed implementation, i.e. rnH Un−1 , Un−1 n−1 and Un−1 rn rn Un−1 . The proposed distributed Oja’s algorithm is performed by introducing three average consensus for the latter terms involving several rounds. At the end of this step, each sensor node is able to update Un (i). Differently to [147], [102], we propose a distributed Oja algorithm to estimate the principal components of M in a general setting without explicitly giving a model for observations (Mn )n . In this thesis, we consider the following model. At each instant n, each node i observes some random noisy samples of the i-th row of the matrix Mn . Each node i sends and/or receives variables from other nodes j in the network, chosen at random (contrarily to [147] that considers T M U fixed links). The matrix products involved in the Oja’s update, i.e. Mn Un−1 and Un−1 n n−1 , are performed via an asynchronous communication model different to the average consensus model [31] required in [147], [102]. We define at each yn (i) P sensor i two random sequences T and zn (i) as unbiased estimates of the corresponding j M (i, j)Un−1 (j) and (Un−1 Mn Un−1 ) respectively. Besides, we introduce a projection step at each iteration n that enables Un to remain bounded in a compact set in order to avoid unstabilities on sequence (Un )n . The convergence of the proposed algorithm is analyzed in the asymptotic regime where n tends to infinity. Although the implementation and the objective is different to (1.11), both are related through the RobbinsMonro framework. Thus, the convergence analysis of the sequence generated by the proposed algorithm implies the existence of a mean field function h(U ) whose roots correspond with the underlying eigenspace of M , i.e. U ∈ {h(U ) = 0} verify (1.12). Hence, similarly to [122], [29], the convergence analysis is mainly characterized by addressing: the stability of Un , the definition of h(U ) and its roots {h(U ) = 0}. Next, we investigate application of our algorithm to self-localization in wireless sensor networks since numerical results can be provided from real data. Application: self-localization in wireless sensor networks In signal processing, an interesting motivation to design a distributed, asynchronous and online of Oja’s algorithm (1.15) relies on its application to the localization problem in wireless sensor networks (see [57], [27], [143], [24], [92], [41]). The theory of multidimensional scaling (MDS) [95] deals with the following general problem: find an embedding configuration of N objects when only similarity/distance data are available. In particular, the method referred as classical MDS [27, Chapter 12] considers Euclidean distances between N positions in a coordinate space of dimension p. In that case, classical MDS performs the PCA (1.12) of M defined 40 Chapter 1. Introduction as follows: 1 M = − J⊥ DJ⊥ 2 (1.16) where matrix D contains the square distances and J⊥ = I − 1/N 11T . In the WSN context, classical MDS (also known as MDS-MAP [143]) recovers the positions of a network formed by N sensor nodes (up to a rotation/translation/reflection). We denote by 1 P zi the position of any sensor node i and z̄ the barycenter of the network, i.e. z̄ = N i zi . In the case of Euclidean space, the entries of D are related to the sensor nodes positions as: D(i, j) = kzi − zj k2 . (1.17) Then, using (1.17) to (1.16) means that M = ZZ T , where the i-th line of matrix Z coincides with zi − z̄. Hence, the PCA problem (1.12) applied to (1.16) within the WSN context, reduces to find M = ZZ T such that Z = U Λ1/2√∈ RN ×p (usually p p = 2 or p = 3). We define each recovered node position Z(i) as Z(i) = ( λ1 u1 (i), . . . , λp up (i)). The centralized localization approach introduced by [143] (theoretical analysis in [85]) involves two main steps: first obtain the square pairwise distances D(i, j) between the sensor nodes and compute (1.16) (also referred as double-centering); and second, find the p-principal components of M . In wireless sensor networks, the acquisition of D is not directly possible. Thus, distances may be estimated from other available measurements depending on the electronics of the sensor node devices, e.g. received signal strength indicator (RSSI), time-of-arrival (TOA) or angle-of-arrival (AOA) (see [126], [105]). In this thesis, we focus on RSSI-based techniques since the wireless sensor nodes considered in our experiments are issued to the FIT IoT-LAB platform [1]. We define an unbiased estimator of the square distance based on the standard parametric Log-Normal Shadowing Model (LNSM) of [136]. Besides, the PCA step of our approach involves: a distributed version of the Oja’s algorithm (1.15) along with an observation model that enables each sensor node to estimate the i-th row of Mn . 1.4 Thesis outline Figure 1.1 illustrates the organization of this thesis. The thesis is separated in two parts according to the two different applications we have addressed in our work. Chapter 2 investigates the problem of distributed stochastic approximation in multi-agent systems. The algorithm under study consists of two steps: a local stochastic approximation step and a diffusion step which drives the network to a consensus. The diffusion step uses rowstochastic matrices to weight the network exchanges. As opposed to previous works, exchange matrices are not supposed to be doubly stochastic, and may also depend on the past estimate. 1.4. Thesis outline 41 Chapter 2 ' Chapter 3 $ ' Consensus algorithms & Appendix A Principal component analysis % Expectation-Maximization algorithm $ & Gradient descent algorithm R ? ' $ Parameter estimation & Oja’s algorithm Chapter 4 ? Localization % % Figure 1.1. Scheme of the framework realized in the present thesis and relations between the chapters. We prove that non-doubly stochastic matrices generally influence the limit points of the algorithm. Nevertheless, the limit points are not affected by the choice of the matrices provided that the latter are doubly-stochastic in expectation. This conclusion legitimates the use of broadcast-like diffusion protocols, which are easier to implement. Next, by means of a central limit theorem, we prove that doubly stochastic protocols perform asymptotically as well as centralized algorithms and we quantify the degradation caused by the use of non doubly stochastic matrices. Throughout the chapter, a special emphasis is put on the special case of distributed non-convex optimization as an illustration of our results. Chapter 3 addresses the problem of asynchronous distributed Principal Component Analysis (PCA). We provide two algorithms coping with different situations according to the underlying graph structure. A general enough framework allows us to analyze all these algorithms at the same time. Convergence is proved with probability one under suitable assumptions, and numerical experiments illustrate their good behavior. The proposed algorithm allows us to address in the following chapter the problem of self-localization in Wireless Sensor Networks (WSNs) which is based on this framework. Chapter 4 considers the localization problem in wireless networks formed by fixed nodes. Each node seeks to estimate its own position based on noisy measurements of the relative distance with other nodes. Yet, we assume that sensor nodes are able to obtain RSSI (Received Signal Strength Indicator) measurements which are related to the Euclidean distance by a LogNormal Shadowing Model (LNSM). In a centralized batch mode, positions can be retrieved (up to a rigid transformation) by PCA on a so-called similarity matrix built from the relative distances. In this chapter, we propose a distributed on-line algorithm allowing each node to estimate its own position based on limited exchange of information in the network. Our frame- 42 Chapter 1. Introduction work encompasses the case of sporadic measurements and random link failures. We prove the consistency of our algorithm using a similar convergence analysis of the previous chapter. We also include a refinement step based on a consensus algorithm (Chapter 2) in order to improve the accuracy of the estimated positions. Finally, we provide numerical and experimental results from both simulated and real data. Simulations issued to real data are conducted on a wireless sensor network testbed. We let to the appendix part the proofs not included in Chapter 2 related to the analysis of consensus algorithms (see Appendix D). Appendix C gives additional numerical analysis on the choice of some known gossip communication protocols for consensus, and in particular for distributed optimization. We also include two of the conference papers that stem from the joint works within our department. It is worth noting that all along this thesis, our contributions involved a more rigorous part related to obtain theoretical results and a part devoted to more specific and concrete applications issued to current research topics. Indeed, the first algorithm based on distributed stochastic approximation designed at the beginning of this thesis is reported in Appendix A. We introduced a novel on-line Distributed Expectation-Maximization (DEM) algorithm for latent data models including Gaussian Mixtures as a special case. A second algorithm originated from a collaboration on the machine learning and Big Data framework is addressed in Appendix B. We presented an on-line learning gossip algorithm (OLGA) devoted to investigate binary classification in a distributed setting. 1.5 Publications The contribution of our work involved different and several results which have been presented both in international and national meetings. They are enumerated below. Journal paper 1. G. Morral and P. Bianchi, "Distributed on-line multidimensional scaling for self-localization in wireless sensor networks", submitted to Elsevier journal on Signal Processing, February 2015, arXiv:1503.05298. 2. G. Morral, P. Bianchi and G. Fort, "Success and failure of adaptation-diffusion algorithms for consensus in multi-agent networks", submitted to IEEE Transactions on Signal Processing journal, October 2014, arXiv:1410.6956. International conference papers with proceedings 1. G. Morral, P. Bianchi* and G. Fort, "Success and failure of adaptation-diffusion algorithms for consensus in multi-agent networks", the 53rd IEEE Conference on Decision and Control (CDC), Los Angeles, USA, December 2014. 2. G. Morral* and N.A. Dieng, "Cooperative RSSI-based indoor localization: B-MLE and distributed stochastic approximation", the 80th IEEE Vehicular Technology Conference (VTC2014-Fall), Vancouver, Canada, September 2014. 1.5. Publications 43 3. G. Morral*, N.A. Dieng and P. Bianchi, "Distributed on-line multidimensional scaling for self-localization in wireless sensor networks", the 39th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1110-1114, Florence, Italy, May 2014. 4. P. Bianchi, S. Clémençon, J. Jakubowicz and G. Morral*, "On-line learning gossip algorithm (OLGA) in multi-agent systems with local decision rules", the 1st IEEE International Conference on Big Data (BigData), pp. 6-14, Santa Clara, USA, October 2013. 5. G. Morral*, P. Bianchi, G. Fort and J. Jakubowicz, "Approximation stochastique distribuée: le coût de la non-bistochasticité", the 24th National Conference on Signal and Image Processing (GRETSI), Brest, September 2013. 6. G. Morral, P. Bianchi, and J. Jakubowicz*, "Asynchronous distributed principal component analysis using stochastic approximation", the 51st Annual Conference on Decision and Control (CDC), pp. 1398-1403, Maui, Hawaii, December 2012. 7. G. Morral, P. Bianchi*, G. Fort and J. Jakubowicz, "Distributed stochastic approximation: the price of non-double stochasticity", invited paper, the 46th Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp. 1473-1477, California, USA, November 2012. 8. G. Morral*, P. Bianchi and J. Jakubowicz, "Gossip-based online distributed expectation maximization", the 2012 IEEE Statistical Signal Processing Workshop (SSP), pp. 305308, Ann Arbor, USA, August 2012. National conferences without proceedings 1. G. Morral*, "A study of distributed algorithms for stochastic approximation in wireless sensor networks", presentation of the results related to this thesis during the 4th annual Meeting of the research work granted by the Futur & Ruptures program organized by Fondation Télécom as candidate for the Thesis Prizes 2015, March 2015. 2. P. Bianchi, S. Clémençon, J. Jakubowicz and G. Morral*, "On-line learning gossip algorithm (OLGA) in multi-agent systems with local decision rules", poster presented in the 3th Seminar on Digital Technologies: Scale and Complexity organized by Institut MinesTélécom, March 2014. 3. G. Morral*, P. Bianchi, G. Fort and J. Jakubowicz, "Approximation stochastique distribuée: le coût de la non-bistochasticité", poster presented in the 2nd annual Meeting of the research work granted by the Futur & Ruptures program organized by Fondation Télécom, January 2013. Part I Consensus algorithms Chapter 2 Success and failure of adaptation-diffusion algorithms The first part of this thesis is devoted to the convergence analysis of consensus algorithms in multi-agent systems. The objective of the network is to find an agreement on the estimate value when the environment is partially unknown and only local information is available at each agent. In particular, we focus on distributed algorithms based on adaptation and diffusion schemes. The general algorithm consists in two steps: a local stochastic approximation step and a diffusion step which drives the network to an agreement. The diffusion step uses row-stochastic matrices to weight the network exchanges. As opposed to previous works, exchange matrices are not supposed to be doubly stochastic, and may also depend on the past estimate. We prove that non-doubly stochastic matrices generally influence the limit points of the algorithm. Nevertheless, the limit points are not affected by the choice of the matrices provided that the latter are doubly-stochastic in expectation. This conclusion legitimates the use of broadcast-like diffusion protocols, which are easier to implement. Next, by means of a central limit theorem, we prove that doubly stochastic protocols perform asymptotically as well as centralized algorithms and we quantify the degradation caused by the use of non doubly stochastic matrices. Throughout this chapter, a special emphasis is put on the special case of distributed non-convex optimization as an illustration of our results. Appendix A provides an application of such case related to parameter estimation. We design a distributed version of the ExpectationMaximization algorithm for exponential families. We first introduce some useful notation that enables us to define and later analyze the algorithms under study in this chapter. 48 Chapter 2. Success and failure of adaptation-diffusion algorithms N x, y, . . . |x| kAk, r(A) A⊗B 1 IN J J⊥ P, E 2.1 2.1.1 positive integer column vectors in RdN P Euclidean norm of x such |x|2 = |x(i)|2 spectral norm and spectral radius of matrix A Kronecker product between matrices A and B vector N × 1 with all entries equal to one N × N identity matrix orthogonal projector onto the linear span of 1 (consensus space) projection matrix orthogonal to J such J⊥ = IN − J (disagreement space) probability and associated expectation operators on a measurable space Introduction Context and goal During the last thirty years, distributed stochastic approximation has been addressed using different cooperative approaches. In the so-called incremental approach (see for instance [131, 133]) a message containing an estimate of the quantity of interest iteratively travels all over the network. In this chapter we focus on another cooperative approach based on average consensus techniques where the estimates computed locally by each agent are combined through the network. This idea traces back to [155] where a network of processors seeks to optimize some objective function known by all agents (possibly up to some additive noise). In our context we consider a network composed by N agents, or nodes. Agents seek to find a consensus on some global parameter by means of local observations and peer-to-peer communications. The aim in this chapter is to analyze the asymptotic behavior of the following distributed algorithm. Agent i (i = 1, . . . , N ) generates a Rd -valued stochastic process (θn,i )n≥0 . At time n, the update is obtained in two steps: [Local step] Node i generates a temporary iterate θ̃n,i given by θ̃n,i = θn−1,i + γn Yn,i , (2.1) where γn is a deterministic positive step size and where the Rd -valued random process (Yn,i )n≥1 represents the observations made by agent i. [Gossip step] Agent i is able to observe the values θ̃n,j of some other j’s and computes the weighted average as follows: θn,i = N X wn (i, j) θ̃n,j , (2.2) j=1 P where the wn (i, j)’s are scalar non-negative random coefficients such that N j=1 wn (i, j) = 1 N for any i. The sequence of random matrices Wn := [wn (i, j)]i,j=1 represents the time-varying 2.1. Introduction 49 communication network between the nodes. One simply set wn (i, j) = 0 whenever nodes i and j are unable to communicate at time n. The aim of this paper is to investigate the almost sure (a.s.) convergence of this algorithm as n tends to infinity as well as the convergence rate. Our goal is in particular to quantify the effect of the sequence of matrices (Wn )n≥1 on the convergence. The algorithm is initialized at some arbitrary Rd -valued vectors θ0,1 , · · · , θ0,N . T , . . . , Y T )T ∈ RdN , n ≥ 1, are defined The random variables Wn ∈ RN ×N and Yn := (Yn,1 n,N on the same measurable space equipped with P and E. For any n ≥ 1, define the σ-field Fn := σ(θ0 , W1 , . . . , Wn , Y1 , . . . , Yn ) where θ0 is the (possibly random) initial point of the algorithm. It is assumed that for any i ∈ 1, . . . , N , (θn,i )n≥0 satisfies the update equations (2.1)-(2.2); and we set T T θn := (θn,1 , . . . , θn,N )T . 2.1.2 Related works on distributed optimization Many recent applications related to statistical data processing and machine learning can be handled by the framework of distributed optimization. We may refer to applications such: network control and coordination (e.g. target or trajectory tracking [132], [123], power and resources allocation [108], [23]), big data processing (e.g. classifier training [151],[160]) or environmental monitoring in sensor networks (e.g. parameter estimation [135], [144]). The algorithm (2.1)-(2.2) under study is not new. The idea beyond the algorithm traces back to [155, 156] where a network of processors seeks to optimize some objective function known by all agents (possibly up to some additive noise). More recently, numerous works extended this kind of algorithm to more involved multi-agent scenarios, see [97, 103, 117, 87, 144, 43, 19, 21, 23, 114] as a non-exhaustive list. In this context, one seeks to minimize a sum of local private cost functions fi of the agents: N X min fi (θ) , (2.3) i=1 where for all i, the function fi is supposed to be unknown by any other agent j, j 6= i. To address this question, it is assumed that Yn,i = −∇fi (θn−1,i ) + ξn,i (2.4) where ∇ is the gradient operator and ξn,i represents some random perturbation which possibly occurs when observing the gradient. Hence, the distributed algorithm (2.1)-(2.2) is a distributed stochastic gradient algorithm. In this paper, we handle the case where functions fi are not necessarily convex. Of course, in that case, there is generally no hope to ensure the convergence to a minimizer to (2.3). Instead, a moreP realistic objective is to achieve critical points of the objective function i.e., points θ such that i ∇fi (θ) = 0. In a machine learning context, fi is typically the risk of a classifier indexed by θ (for more details we refer to [107, 65, 32, 6]). The problem of finding the optimal vector quantizer is addressed in [125] by minimizing a non convex cost function called distortion. [125] proposes a distributed and on-line implementation of the k-means named competitive learning vector 50 Chapter 2. Success and failure of adaptation-diffusion algorithms quantization algorithm (CLVQ) and based on stochastic approximation. The consistency of the algorithm is proved under suitable assumptions such row-stochastic matrices and asynchronous weights: the trajectories of agents reach an asymptotic consensus a.s. and the corresponding agreement vector converges a.s. towards one of the random connected component of the set of critical points. [43],[144] restrict their analysis by considering a linear regression model for the observations and the case of common quadratic functions for the agents. [43] studies the mean square error performance of a distributed stochastic approximation algorithm based on a deterministic diffusion scheme and it is shown that the error variance is bounded and the convergence is achieved in the noise-free case. In [144] these results are obtained when considering in addition i.i.d. random noise. In the field of stochastic cooperative games, the work of [13] is focused on the a.s. convergence of bargaining processes when they are allocated in a distributed manner. The proposed algorithm generates iteratively sequence including two steps: a combining step involving a double stochastic time-varying random matrix in which agents communicate, and a local projection step onto a closed and convex set. The results are: the convergence a.s. towards zero of the nonlinear error due to the projection and the convergence a.s. of the network towards the sought allocation. Regarding the works on statistical data inference, there is a rich literature on distributed estimation and optimization algorithms, see [26],[103], [87], [38], [117], [144] as a non-exhaustive list. Among the first gossip algorithms are those considered in the treatise [18] and in [156]. The case where the gossip matrices are random and the observations are noiseless is considered in [31]. The authors of [117] solve a constrained optimization by also using noiseless estimates. The contributions [38] and [144] consider the framework of linear regression models. In [134], stochastic gradient algorithms are considered in the case the matrices (Wn )n are doubly stochastic gossip i.e. Wn 1 = WnT 1 = 1. This contribution assumes in addition that the gradients are bounded and considers rather stringent assumptions on the conditional variances of the observation noises. Convergence to a global minimizer is shown in [116] assuming convex utility functions and bounded (sub)gradients. The results of [116] are extended in [134] to the stochastic descent case i.e., when the observation of utility functions is perturbed by a random noise. More recently, [19] investigated distributed stochastic approximation at large, providing stability conditions of the algorithm (2.1)-(2.2) while relaxing the bounded gradient assumption and including the case of random communication links. In [19], it is also proved under some hypotheses that the estimation error is asymptotically normal: the convergence rate and the asymptotic covariance matrix are characterized. An enhanced averaging algorithm à la Polyak is also proposed to recover the optimal convergence rate. Note that all the works previously cited do not take into account the case where (Wn )n depend on the observations (Yn )n in their convergence analysis. Doubly and non-doubly stochastic matrices. In most works (see for instance [116, 134]), the matrices (Wn )n≥1 are assumed doubly stochastic, meaning that WnT 1 = Wn 1 = 1 where 1 is the N × 1 vector whose components are all equal to one and where T denotes transposition. Although row-stochasticity (Wn 1 = 1) is rather easy to ensure in practice, column-stochasticity (WnT 1 = 1) implies more stringent restrictions on the communication protocol. For instance, in [31], each one-way transmission from an agent i to another agent j requires at the same time a feedback link from j to i. As a matter of fact, double stochasticity prevents from using 2.1. Introduction 51 natural broadcast schemes, in which a given node may transmit its local estimate to all neighbors without expecting any immediate feedback. Remarkably, although generally assumed, double stochasticity of the matrices Wn is in fact not mandatory. A couple of works (see e.g. [112, 19]) get rid of the column-stochasticity condition, but at the price of assumptions that may not always be satisfied in practice. Other works ([114, 153]) manage to circumvent the use of feedback links by coupling the gradient descent with the so-called push-sum protocol [89]. The latter however introduces an additional communication of weights in the network in order to keep track of some summary of the past transmissions. As a consequence, we address the following questions: What conditions on the sequence (Wn )n≥1P are needed to ensure that Algorithm (2.1)-(2.2) drives all agents to a common critical point of i fi ? What happens if these conditions are not satisfied? How is the convergence rate influenced by the communication protocol? 2.1.3 Contributions We provide the following contributions which lead to answer in a both qualitative and quantitative manner the previous questions. • Assuming that (Wn )n≥1 forms an i.i.d. sequence of stochastic matrices, we prove under some technical hypotheses that Algorithm (2.1)-(2.2) leads the agents to a consensus, which is characterized. P It is shown that the latter consensus does not necessarily coincide with a critical point of i fi . We also provide an augmented algorithm which allows to recover the sought points. • We provide sufficient conditions either on the communication protocol P(Wn )n≥1 or on the functions fi which ensure that limit points are the critical points of i fi . When such conditions are not satisfied, we also propose a simple modification of the algorithm which allows to recover the sought behavior. • We extend our results to a broader setting, assuming that the matrices (Wn )n≥1 are no longer i.i.d., but are likely to depend on both the current observations and the past estimates. We also investigate a general stochastic approximation framework which goes beyond the model (2.4) and beyond the only problem of distributed optimization. • We characterize the convergence rate of the algorithm under the form of a central limit theorem. Unlike [19], we address the case where the sequence (Wn )n≥1 is not necessarily doubly stochastic. We show that non-doubly stochastic matrix have an influence on the asymptotic error covariance (even if they are doubly stochastic in average). On the other hand, we prove that when the matrix Wn is doubly stochastic for all n, the asymptotic covariance is identical to the one obtained in a centralized setting. The chapter is organized as follows: Section 2.2 is a gentle presentation of our results in the special case of distributed optimization (see (2.3)) assuming in addition that sequence (Wn ) is independent and identically distributed (i.i.d.). In Section 2.3 we provide the general setting to study almost sure convergence which is studied in Section 2.4.1. Section 2.5 investigates convergence rates. Conclusions and numerical results in Section 2.7 and Appendix C complete 52 Chapter 2. Success and failure of adaptation-diffusion algorithms the discussion on the topic. Proofs are given in a small part in Section D.1 but mostly devoted in Appendix D. 2.2 Distributed optimization We first sketch our result in the special case of distributed optimization i.e., when the "innovation” Yn,i of the algorithm in (2.1) has the form (2.4). For simplicity, the matrix-valued process Wn will be assumed i.i.d. and independent of both processes Yn and θn . This assumption will be relaxed in section 2.3. 2.2.1 Framework In this section, we consider the case when Yn,i satisfies (2.4) with Assumption 2.1. 1. fi : Rd → R is differentiable and ∇fi is locally Lipschitz-continuous. 2. For any Borel set A of RdN , P [ξn+1 ∈ A | Fn ] = νθn (A) almost surely (a.s.) where (νθ )θ∈RdN is a family of probability measures such that R (a) z dνθ (z) = 0 and, R (b) sup |z|2 dνθ (z) < ∞ for any compact set K ⊂ RdN . θ∈K Assumption 2.2. 1. For any n ≥ 0, conditionally to Fn , (Yn+1 , Wn+1 ) are independent. 2. (Wn )n≥1 is an i.i.d. sequence of row-stochastic matrices (i.e., Wn 1 = 1 for any n) with non-negative entries. 3. The spectral radius of the matrix E[W1T J⊥ W1 ] is strictly lower than 1. P The row-stochasticity assumption is a rather mild condition. It claims that j wn (i, j) = 1 for any i i.e., each node i computes a weighted average of the temporary updates at each node (with possibly P some null weights). In many works, it is usually also assumed that Wn is columnstochastic i.e., i wn (i, j) = 1 for any j. Our weaker framework addresses more general gossip protocols, usually less demanding in terms of scheduling and overall network coordination. Assumption 2.2-3) is a contraction condition which is required to drive the network to a consensus. Assumption 2.3. The deterministic step-size sequence (γn )n≥1 satisfies γn > 0 and: 1. limn γn+1 /γn = 1, P P 1+λ 2. < ∞ for some λ ∈ (0, 1), n γn = +∞, n γn P 3. n |γn − γn−1 | < ∞ . Polynomially decreasing sequences γn ∼ γ? /na when n → ∞, for some a ∈ (1/2, 1] and γ? > 0 satisfy Assumption B.1. Finally, we introduce a stability-like condition. 2.2. Distributed optimization 53 Assumption 2.4. Almost surely, there exists a compact set K of RdN such that θn ∈ K for any n ≥ 0. Assumption 2.4 claims that the sequence (θn )n≥0 remains in a compact set and this compact set may depend on the path. It is implied by the stronger assumption “there exists a compact set K of RdN such that with probability one, θn ∈ K for any n ≥ 0”. Checking Assumption 2.4 is not always an easy task. As the main scope of this paper is the analysis of convergence rather than stability, it is taken for granted: we refer to [19] for sufficient conditions implying stability. 2.2.2 Results The statement of our convergence result is prefaced with the following lemma, which shows that the matrix W := E[W1 ] admits a unique left Perron eigenvector v; this vector will play a role in the characterization of the limiting points of the algorithm (2.1)-(2.2). Lemma 2.1. Under Assumptions 2.2-2) and 2.2-3), the RN -valued vector v defined by v T := 1 T −1 is the unique non-negative vector satisfying v T = v T W and v T 1 = 1. N 1 W (IN − J⊥ W ) T Proof. By the Jensen’s inequality, for any x ∈ RN , xT W J⊥ W x ≤ xT E W1T J⊥ W1 x. Then, by Assumption 2.2-3), the spectral norm of J⊥ W is strictly lower than one. Therefore, IN − J⊥ W is invertible. We prove that a vector w satisfying wT 1 = 1 and wT W = wT ; then wT = N1 1T W (IN − J⊥ W )−1 . Inversely, a vector w such that wT = N1 1T W (IN − J⊥ W )−1 ; then it satisfies wT 1 = 1 and wT W = wT . And both statements meaning that w = v. Let us start with the first statement. The vector v satisfies v T 1 = 1 and v T W = v T . Let us prove that a vector satisfying these two properties is unique; let w ∈ RN satisfying these properties. First, we apply the first condition: wT = wT W = wT W − 1T 1T W+ W, N N then we apply the second condition as follows: wT = wT W − wT 1 1T 1T 1T 1 W+ W = wT J⊥ W + W =⇒ wT (IN − J⊥ W ) = 1T W , N N N N and thus from the above equation one can easily find that wT = v T . We now prove the second statement. We first verify wT 1 = 1 as follows. Since W is a stochastic matrix, its spectral radius is one. By [81], there exists a non-negative vector w such that wT W = wT and 1T w > 0. We can therefore assume without loss of generality that wT 1 = 1. Indeed, observe that the definition wT = N1 1T W (IN − J⊥ W )−1 implies that wT (IN − J⊥ W ) = 1 T N 1 W . By multiplying this equality by vector 1 to its both sides, using the row-stochasticity assumption of W and because J⊥ 1 = 0; then it is shown that wT 1 = 1. In the same way, one 54 Chapter 2. Success and failure of adaptation-diffusion algorithms could apply the inverse matrix lemma (see [81]) twice to show that (IN − J⊥ W )−1 1 = 1 as follows: (IN − J⊥ W )−1 1 = (IN + J⊥ (IN − W J⊥ )−1 W )1 = 1 + J⊥ (IN − W J⊥ )−1 1 = 1 + J⊥ (IN + W (IN − J⊥ W )−1 J⊥ )1 Because J⊥ 1 = 0, then (IN − J⊥ W )−1 1 = 1. Finally, using the row-stochasticity assumption of W implies that: : 1 w T 1 = 1T W 1 = 1 . N Secondly, we verify wT W = wT . For that purpose, By using the above equality wT (IN − J⊥ W ) and the latter verified condition wT 1 = 1, we have: wT W = wT W − 1 T N1 W = 1T 1T W+ W = wT J⊥ W + wT (IN − J⊥ W ) = wT N N Once again, the above discussion implies that w = v. This concludes the proof. If A is a set, we say that (xn )n converges to A if inf{|xn − y| : y ∈ A} tends to zero as n → ∞. Theorem 2.1. Let Assumptions 2.1, 2.2, B.1 and 2.4 hold true. Define the function V : Rd → R V (θ) := N X vi fi (θ) (2.5) i=1 where v = (v1 , . . . , vN ) is the vector defined in Lemma 2.1. Assume that the set L = {θ ∈ Rd | ∇V = 0} of critical points of V is non-empty and included in some level set {θ : V (θ) ≤ C}, and that V (L) has an empty interior. Assume also that the level sets {θ : V (θ) ≤ C} are either empty or compact. The following holds with probability one: 1. The algorithm converges to a consensus i.e., limn→∞ maxi,j |θn,i − θn,j | = 0. 2. The sequence (θn,1 )n≥0 converges to L as n → ∞. Theorem 2.1 is proved in Appendix D.1. Its proof consists in showing that it is a special case of the more general convergence result given by Theorem B.1. 2.2.3 Success and failure of convergence The P algorithm converges to L which in general is not the set of the critical points of θ 7→ i fi (θ). We discuss some special where both sets actually coincide. Scenario 1. All functions fi are strictly convex and admit a (unique) common minimizer θ? . This case is for instance investigated by [43] in the framework of statistical estimation in wireless sensor network. In this scenario, we may assume without loss of generality that fi (θ) ≥ 2.3. Distributed Robbins-Monro algorithm: general setting 55 fi (θ? ) = 0 for all i (note that Algorithm (2.1)-(2.2) is not modified when fi is translated). Since vi ≥ 0, V is a non-negative strictly convex function such that V (θ? ) = 0. Therefore, the set of minimizers of V is {θ? }. On the other hand, since V is convex, P L is the set of minimizers of V . This implies that the set L is formed by the minimizers of i fi . Relaxing strict convexity, note that when the functions fi are justP convex with a common minimizer and vi > 0 for any i, then L is formed by the minimizers of i fi , then the same conclusion holds. Scenario 2. W is column-stochastic i.e., 1T W = 1T . 1 1 P In this case, v given by Lemma 2.1 P is the vector N 1. Consequently, V = N i fi . Here again, L is the set of minimizers of i fi . An example of random communication protocol satisfying 1T W = 1T is the following: at time n, a single node i wakes up at random with probability pi and broadcasts its temporary update θ̃n,i to all its neighbors Ni . Any neighbor j computes the weighted average θn,j = β θ̃n,i + (1 − β)θ̃n,j . On the other hand, any node k which does not belong to the neighborhood of i (including i itself) sets θn,k = θ̃n,k . Then, given i wakes up, the (k, `)-th entry of Wn is given by: 1 if k ∈ / Ni and k = ` , β if k ∈ Ni and ` = i , wn (k, `) = 1 − β if k ∈ Ni and k = ` , 0 otherwise. Here, Wn is not doubly stochastic. However, when nodes wake up according to the uniform distribution (pi = N1 for all i) it is easily seen that 1T E[Wn ] = 1T . 2.2.4 Enhanced algorithm with weighted step sizes We end up this section with a simple modification of the initial algorithm in the case where vi > 0 for all i. Let us replace the local step (2.1) of the algorithm by θ̃n,i := θn−1,i + γn vi−1 Yn,i (2.6) where Yn,i is still given by (2.4). As an immediate Corollary of Theorem 2.1, the P algorithm (2.6)(2.2) drives the agent to a consensus which coincides with the critical points of i fi . Of course, this modification requires for each node i to have some prior knowledge of the communication protocol through the coefficients vi (in that case, questions related to a distributed computation of the vi ’s would be of interest, but are beyond the scope of this paper). 2.3 Distributed Robbins-Monro algorithm: general setting In this section, we consider the general setting described by Algorithm (2.1)-(2.2) with weaker conditions on the distribution of the observations Yn . We also weaken the assumptions on the conditional distribution of (Yn+1 , Wn+1 ) given the past behavior of the algorithm Fn : our general framework includes the case when the communication protocol is adapted at each time n and takes into account the network observations. We denote by M1 the set of N × N non-negative row-stochastic matrices and we endow M1 with its Borel σ-field. 56 Chapter 2. Success and failure of adaptation-diffusion algorithms Assumption 2.5. 1. There exists a collection of distributions (µθ )θ∈RdN on RdN × M1 such that almost-surely, for any Borel set A: P [(Yn+1 , Wn+1 ) ∈ A | Fn ] = µθn (A) . In addition, the application θ 7→ µθ (A) defined on RdN is measurable for any A in the Borel σ-field of RdN × M1 . R 2. For any compact set K ⊂ RdN , sup |y|2 dµθ (y, w) < ∞. θ∈K Assumption 2.5-1) means that the joint distribution of the r.v.’s Yn+1 and Wn+1 depends on the past Fn only through the last value θn of the vector of estimates. It also implies that Wn is almost-surely (a.s.) non-negative and row-stochastic. Since the variables (Yn+1 , Wn+1 ) are not necessarily independent conditionally to the past Fn and (Wn )n≥1 are no longer i.i.d., the contraction condition on J⊥ W1 is replaced with the following condition: Assumption 2.6. For any compact set K ⊂ RdN , there exists ρK ∈ (0, 1) such that for all θ ∈ K, φ in RdN and A ∈ RdN × RdN , Z Z (φ + Ay)T (w ⊗ Id )T J⊥ (w ⊗ Id )(φ + Ay)dµθ (y, w) ≤ ρK |φ + Ay|2 dµθ (y, w) . We provide some on the above insight condition. Assumption 2.6 is satisfied as soon as the T spectral radius r E W1 J⊥ W1 |θ0 , Y1 is upper bounded by a constant independent of (θ0 , Y1 ) when θ0 ∈ K and strictly lower than one. When (Wn )n≥1 is an i.i.d. sequence, independent of the sequence (Yn )n≥1 and of θ0 , the above condition reduces to r(E[W1T J⊥ W1 ]) < 1. 2.4 Convergence analysis For any vector x ∈ RdN of the form x = (xT1 , . . . , xTN )T where xi ∈ Rd , we define the vector of Rd x1 + · · · + xN 1 hxi := = (1T ⊗ Id )x . (2.7) N N We extend the notation to matrices X ∈ RdN ×k as hXi = N1 (1T ⊗ Id )X. We also define J := J ⊗ Id , J⊥ := J⊥ ⊗ Id . Note that Jx = 1 ⊗ hxi . Algorithm (2.1-2.2) can be written in matrix form as: θn = Wn (θn−1 + γn Yn ) where Wn = Wn ⊗ Id . (2.8) We decompose the estimate vector θn into two components θn = 1 ⊗ hθn i + J⊥ θn . In Section 2.4.1, we analyze the asymptotic behavior of the disagreement vector J⊥ θn . The study of the average vector hθn i will be addressed in Section 2.4.2. These two sections are prefaced by a result which established the dynamics of these sequences. Set −1 φn := γn+1 J ⊥ θn (2.9) αn := γn /γn+1 . (2.10) 2.4. Convergence analysis 57 Lemma 2.2. Let (θn )n≥0 be the sequence given by (4.26). Assume that (Wn )n≥0 are rowstochastic matrices. It holds hθn i = hθn−1 i + γn hWn (Yn + φn−1 )i , (2.11) φn = αn J⊥ Wn (φn−1 + Yn ) . (2.12) Proof. Since Wn is row-stochastic, JWn J = J. Hence, Jθn = Jθn−1 + JWn J⊥ θn−1 + γn JWn Yn . It follows that Jθn = Jθn−1 +γn JWn (Yn +γn−1 J⊥ θn−1 ) which directly gives (2.11). By projecting θn onto the disagreement subspace, one has J⊥ θn = J⊥ Wn (θn−1 + γn Yn ). Since Wn is row-stochastic, J⊥ Wn = J⊥ Wn J⊥ . Then, (2.12) follows. 2.4.1 Disagreement vector We first begin with a technical lemma proved in Appendix D.2. Lemma 2.3. Let Assumptions B.1-1), 2.5 and 2.6 hold. Let (φn )n≥0 be the sequence given by (2.9). For any compact set K ⊂ RdN , supn E |φn |2 ITj≤n−1 {θj ∈K} < ∞. This lemma implies that for any compact set, there exists C such that for any n ≥ 0, 2 . E[|J⊥ θn |2 ITk {θk ∈Km } ] ≤ Cγn+1 Proposition 2.1 (Agreement). Let Assumptions B.1-1), B.1-2), 2.4, 2.5 and 2.6 hold. Then almost-surely, lim J⊥ θn = 0 . n→∞ Proof. Let (Km )m≥0 be an increasing sequence of compact subsets of RdN such that RdN . Under Assumption 2.4, we have to prove equivalently that for any m ≥ 0, S m Km = lim J⊥ θn 1Tk {θk ∈Km } = 0 a.s. . n Let m ≥ 0. 2 Lemma 2.3 implies that there exists a constant C such that E[|J⊥ θn |2 ITk {θk ∈Km } ] ≤ Cγn+1 for any n. By Assumption B.1-2), this implies that X E[|J⊥ θn |2 ITk {θk ∈Km } ] n is finite; hence 2 T n |J⊥ θn | I k {θk ∈Km } P is finite a.s. which yields lim J⊥ θn2 ITk {θk ∈Km } = 0 a.s. . n 58 Chapter 2. Success and failure of adaptation-diffusion algorithms 2.4.2 Average vector We now study the long-time behavior of the average estimate hθn i. Define for any θ ∈ RdN : Z Wθ := (w ⊗ Id ) dµθ (y, w) (2.13) Z zθ := (w ⊗ Id )y dµθ (y, w) . (2.14) and let us assume regularity-in-θ properties of these quantities Assumption 2.7. There exists λµ ∈ (1/2, 1] and for any compact set K ⊂ RdN , there exists a constant C > 0 such that for any θ, θ0 ∈ K, Wθ − Wθ0 ≤ C|θ − θ0 |λµ , (2.15) |Jzθ − JzJθ | ≤ C|J⊥ θ|λµ , (2.16) |J⊥ zθ − J⊥ zθ0 | ≤ C|θ − θ0 |λµ , (2.17) From (2.12) and Assumption 2.5-1), we have hθn i = hθn−1 i + γn hWn (Yn + φn−1 )i , = hθn−1 i + γn E [hWn (Yn + φn−1 )i|Fn−1 ] + γn Ξn where (Ξn )n is a martingale-increment term and E [hWn (Yn + φn−1 )i|Fn−1 ] = hzθn−1 + Wθn−1 φn−1 i . Since limn (θn − 1 ⊗ hθn i) = 0 almost-surely, Assumption 2.7 implies that roughly speaking hzθn−1 + Wθn−1 φn−1 i ≈ hz1⊗hθn−1 i + W1⊗hθn−1 i φn−1 i . In addition, from (2.12) and Assumption 2.5-1) again, the conditional distribution of φn given the past is the conditional distribution given φn−1 and is of the form Pαn ,θn−1 (φn−1 , ·) where Pα,θ is a Markov transition kernel (see Appendix D.3 for an explicit expression of this transition kernel). Each kernel Pα,θ possesses an invariant distribution πα,θ and is ergodic. Therefore, it is natural to define the mean field function h : Rd → Rd (2.12) by (1) h(ϑ) = hz1⊗ϑ + W1⊗ϑ m1⊗ϑ i (2.18) (1) where m1⊗ϑ is the expectation of the invariant distribution π1,1⊗ϑ , given by (see Proposition D.2 in Appendix D.3) (1) mθ := (IdN − J⊥ Wθ )−1 J⊥ zθ . Note that under Assumption 2.6, this quantity is well defined since for any compact K ⊂ RdN , √ supθ∈K kJ⊥ Wθ k ≤ ρK . We establish the limiting behavior of the average sequence (hθn i)n by verifying the sufficient conditions for the convergence of stochastic approximation schemes given in [9, Theorem 2.2 and 2.3]. To that goal, we assume that there exists a Lyapunov function V for the mean field h. 2.5. Convergence rate Assumption 2.8. 59 1. h : Rd → Rd is continuous. 2. There exists a continuously differentiable function V : Rd → R+ such that (a) there exists M > 0 such that L = {ϑ ∈ Rd : ∇V T (ϑ)h(ϑ) = 0} ⊂ {V ≤ M }. In addition, V (L) has an empty interior; (b) there exists M 0 > M such that {V ≤ M 0 } is a compact subset of Rd ; (c) for any ϑ ∈ Rd \ L, ∇V T (ϑ)h(ϑ) < 0. (1) Assumptions 2.5, 2.6 and 2.7 imply that ϑ 7→ m1⊗ϑ is continuous on Rd (see Proposition D.3 in Appendix D.3). Therefore, a sufficient condition for the Assumption 2.8-1) is to strengthen the conditions (2.16-2.17) of Assumption 2.7 as follows: |zθ − zθ0 | ≤ C|θ − θ0 |λµ . Proposition 2.2. Let Assumptions B.1, 2.4, 2.5, 2.6,2.7 and 2.8 hold true. Assume in addition that λ ≤ λµ where λ, λµ are resp. given by Assumption B.1 and 2.7. The average sequence (hθn i)n converges almost-surely to a connected component of L. The proof of Proposition 2.2 is given in Appendix D.4. It consists in verifying the assumptions of [9, Theorem 2]. 2.4.3 Main convergence result As a trivial consequence of Propositions 2.1 and 2.2, we have Theorem 2.2. Let Assumptions B.1, 2.4, 2.5, 2.6, 2.7 and 2.8 hold true. Assume in addition that λ ≤ λµ where λ, λµ are resp. given by Assumption B.1 and 2.7. The following holds with probability one: 1. The algorithm converges to a consensus i.e., limn→∞ J⊥ θn = 0; 2. θn,1 converges to a connected component of L. 2.5 2.5.1 Convergence rate Assumption We derive the rate of convergence of the sequence {θn , n ≥ 0} to 1 ⊗ θ? for some θ? satisfying Assumption 2.9. θ? is a root of h i.e., h(θ? ) = 0. Moreover, h is twice continuously differentiable in a neighborhood of θ? . The Jacobian ∇h(θ? ) is a Hurwitz matrix. Denote by −L, L > 0, the largest real part of its eigenvalues. The moment conditions on the conditional distributions of the observations Yn and the contraction assumption on the network have to be strengthened as follows: 60 Chapter 2. Success and failure of adaptation-diffusion algorithms Assumption 2.10. There exists τ ∈ (0, 2) such that for any compact set K ⊂ RdN , Z sup |y|2+τ dµθ (y, w) < ∞ . θ∈K Assumption 2.11. Let τ be given by Assumption 2.10. For any compact set K ⊂ RdN , there exists ρ̃K ∈ (0, 1) such that for any φ ∈ RdN Z sup θ∈K |((J⊥ w) ⊗ Id )|2+τ dµθ (y, w) ≤ ρ̃K |φ|2+τ . We also go further in the regularity-in-θ of the integrals w.r.t. µθ . More precisely Assumption 2.12. There exists λµ ∈ (1/2, 1] and for any compact set K ⊂ RdN there exists a constant C such that 1. for any θ, θ0 ∈ K, |hzθ i − hzθ0 i| ≤ C |θ − θ0 |λµ . 2. Set QA (x, y, w) := (x + y)T (w ⊗ Id )T J⊥ AJ⊥ (w ⊗ Id )(x + y) for some dN × dN matrix A. For any θ, θ0 ∈ K, x ∈ RdN and any matrix A such that kAk ≤ 1, Z QA (x, y, w)dµθ (y, w) − QA (x, y, w)dµθ0 (y, w) ≤ C θ − θ0 λµ (1 + |x|2 ) . We finally have to strengthen the conditions on the step-size sequence. Assumption 2.13. Let τ (resp. λµ ) be given by Assumption 2.10 (resp. Assumption 2.12). As n → ∞, γn ∼ γ? /na for some a ∈ ((1 + λµ )−1 ∨ (1 + τ /2)−1 ; 1] and γ? > 0. In addition, if a = 1 then γ? > 1/(2L) where L is given by Assumption 2.9. 2.5.2 Main result (1) (2) Define m? := (IdN − J⊥ W1⊗θ? )−1 J⊥ z1⊗θ? and m? defined in (2.14), where := (Id2 N 2 − Φ? )−1 ζ? where zθ is Z Φ? := Z ζ? := T (w) dµ1⊗θ? (y, w) (1) T (w)vec yy T + 2m? y T dµ1⊗θ? (y, w) and where we used the notation T (w) := ((J⊥ w) ⊗ Id ) ⊗ ((J⊥ w) ⊗ Id ). As will be seen in (1) (2) the proofs, m? and m? represent the asymptotic first order moment and (vectorized) second order moment of the r.v. φn defined by (2.9). Define also R? (w) := (w ⊗ Id ) − W1⊗θ? and 2.5. Convergence rate 61 υ? (y, w) := (w ⊗ Id )y − z1⊗θ? . Finally, define A? := R? := Z T? := Z S? := Z 1T ⊗ Id (IdN + W1⊗θ? (IdN − J⊥ W1⊗θ? )−1 J⊥ ) N (R? (w) ⊗ R? (w)) dµ1⊗θ? (y, w) (υ? (y, w) ⊗ R? (w))dµ1⊗θ? (y, w) vec (υ? (y, w)υ? (y, w)T )dµ1⊗θ? (y, w) . We establish in Section D.5 the following result. Theorem 2.3. Let Assumption 2.5-1), Assumption 2.7, Assumption 2.6 and Assumption 2.9 to Assumption 2.13 hold true. Let U? be the positive-definite matrix given by (2) (1) vec U? = (A? ⊗ A? )(R? m? + 2T? m? + S? ) −1/2 Then conditionally to the event {limn θn = 1 ⊗ θ? }, the sequence {γn (iθn h−θ? ), n ≥ 0} converges in distribution to a zero mean Gaussian distribution with covariance matrix V where V is the unique positive-definite matrix satisfying V∇h(θ? )T + ∇h(θ? )V = −U? V (I + 2γ? ∇h(θ? ))T + (I + 2γ? ∇h(θ? )) V = −2γ? U? 2.5.3 if a < 1, if a = 1. A special case: doubly-stochastic matrices In this paragraph, let us investigate the special case when (Wn )n are N × N doubly-stochastic matrices. Note that in this case,R (2.11) gets into hθn i = hθn−1 i + γn hYn i and the mean field function h is equal to h(ϑ) = hyidµ1⊗ϑ (y, w). Along the event {limn θn = 1 ⊗ θ? }, it is therefore expected to have U? equal to the covariance matrix of hY i when Y ∼ µ1⊗θ? (see e.g. [61, Theorem 2.2.12]). This is exactly R what can be retrieved from Theorem 2.3 as shown below. Since Wn is column-stochastic, w dµ1⊗θ? (y, w) is column-stochastic, and we have A? = 1T ⊗ Id . Then, it is not difficult to check thatR A? R? (w) = 0, which implies that R? = T? = 0. N Therefore vec U? = (A? ⊗ A? )S? i.e., U? = hυ? (y, w)ihυ? (y, w)iT dµ1⊗θ? (y, w). This yields the following corollary Corollary 2.1. In addition to the assumptions of Theorem 2.3, assume that (Wn )n are N × N doubly-stochastic matrices. Then the matrix U? is given by Z U? = hy − ȳ? ihy − ȳ? iT dµ1⊗θ? (y, w) where ȳ? = R y dµ1⊗θ? (y, w). 62 Chapter 2. Success and failure of adaptation-diffusion algorithms 2.6 Concluding remarks In this paragraph, we informally draw some general conclusions of our study. We assimilate the communication protocol to the selection of the sequence Wn , which we assume i.i.d. in this paragraph for simplicity. We say that a protocol is doubly stochastic if Wn is doubly stochastic for each n. We say that a protocol is doubly stochastic in average if E [Wn ] is doubly stochastic for each n. 1. Consensus is fast. Theorem 2.3 states that the average estimation error converges to zero √ √ at rate γn . This result was actually expected, as γn is the well-known convergence rate of standard stochastic approximation algorithms. On the other hand, Lemma 2.3 suggests that the disagreement vector J⊥ θn goes to zero at rate γn that is, one order of magnitude faster. Asymptotically, the fluctuations of the √ normalized estimation error (θn − 1 ⊗ θ? )/ γn are fully supported by the consensus space. This remark also suggests to analyze non-stationary communication protocols, for which the number of transmissions per unit of time decreases with n. This problem is addressed in [19]. 2. Non-doubly stochastic protocols generally influence the limit points. This issue is discussed in Section 3). The choice of the matrices Wn is likely to have an impact on the set of limit points of the algorithms. This may be inconvenient especially in distributed optimization tasks. 3. Protocols that are doubly stochastic “in average” all lead to the same limit points. In the framework of distributed optimization, the latter set of limit points precisely coincides with the sought critical points of the minimization problem. It means that non-doubly stochastic protocols can be used provided that they are doubly stochastic in average. 4. Asymptotically, doubly stochastic protocols perform as well as a centralized algorithm. By Corollary 2.1, if Wn is chosen to be doubly stochastic for all n, the asymptotic error covariance characterized in Theorem 2.3 does not depend on the specific choice of Wn . In distributed optimization, the asymptotic performance is identical to the performance that would have been obtained by replacing Wn by the orthogonal projector J, P which would lead to the centralized update hθn i = hθn−1 i + γNn N Y . On the opn,i i=1 posite, protocols that are not doubly stochastic generally influence the asymptotic error covariance, even if they are doubly stochastic in average. 2.7 Numerical results We illustrate the convergence results obtained in Section 2.2.2 and discussed in sections 3) and 2.6. We depict a particular case of the distributed optimization problem described in Section 2.2. Consider a network of N = 5 agents and for any i = 1, . . . , 5, we define a private cost function 2.7. Numerical results 63 fi : R → R. We address the following minimization problem: min F (θ) where F (θ) = θ⊂R 5 X 1 i=1 2 (θ − αi )2 (2.19) where αT = (−3, 5, 5, 1, −3). The minimizer of (2.19) is θF = hαi = 1. The network is represented by an undirected graph G = (V, E) with vertices {1, . . . , N } and 6 fixed edges E. The corresponding adjacency matrix is given by 0 1 0 1 0 1 0 1 0 0 (2.20) A= 0 1 0 1 1 . 1 0 1 0 1 0 0 1 1 0 We choose θ0,i = 0 for each agent i and the step-size sequence of the form γn = 0.1/n0.7 . Observations Yn,i are defined as in (2.4): (ξn,i )n,i is an i.i.d. sequence with Gaussian distribution N(0, σ 2 ) where σ 2 = 1. Figure 2.1 illustrates the two results of Theorem 2.1 according to different gossip matrices (Wn )n . First, Figure 2.1 (a) addresses the convergence of sequence (θn,1 )n≥0 as a function of n to show the influence of matrices Wn to the limit points. In particular, the dashed line curve corresponds to the algorithm (2.1)-(2.2) when Wn is assumed to be fixed and deterministic (Wn = W1 for all n); we select W1 in such a way that each agent computes the average of the −1 temporary estimates in its neighborhood. This is equivalent to set W1 = (I PNN+ D) (IN + A), where D is the diagonal matrix containing the degrees, i.e. D(i, i) = j=1 A(i, j) for each agent i. Note that W1 is not doubly stochastic since 1T W1 6= 1TP . Computing the left Perron eigenvector defined by Lemma 2.1 yields the minimizer of V = i vi fi being θV = v T α = 1.24. In that case, the sequence (θn,1 )n converges to θ? = θV instead of the desired θ? = θF . Figure 2.1 (a) includes the trajectory of θn,1 generated by Algorithm (2.6)-(2.2) with W1 = (IN + D)−1 (IN + A). As proposed in Section 2.2.4 when introducing the weighted step size such γn vi−1 the sequence now converge to the sought value θF . Figure 2.1 (a) also illustrates the convergence behavior of Scenario 2 where the limit point θ? of Algorithm (2.1)-(2.2) corresponds with θF . In that case, we consider two standard models for Wn , namely the pairwise gossip of [31] and the broadcast gossip of [10] (we set β = 12 ). Finally, the plain line in Figure 2.1 (a) shows the performance of the algorithm proposed by [115] for distributed optimization which is based on a synchronous version of the push-sum protocol of [89]. We conclude the illustration of Theorem 2.1 by the results on the consensus convergence for the same examples of Wn considered in Figure 2.1 (a). Thus, Figure 2.1 (b) represents the norm of the scaled disagreement vector as a function of n. As expected from Theorem 2.1-2), consensus is asymptotically achieved independently of the limit point, i.e. θF or θV . Note that the synchronous models of W1 and [115] require N transmissions at each iteration n whereas the gossip protocols of [31] and [10] only require two and one transmissions respectively due to their asynchronous nature. This may explain the gap between the curves in Figure 2.1 (b) when regarding the convergence rate towards the consensus. 64 Chapter 2. Success and failure of adaptation-diffusion algorithms 1.2 1 θn,1 0.8 0.6 Algorithm (1)− (2), Wn = (I+D)−1(I+A) Algorithm (6)− (2), Wn = (I+D)−1(I+A) 0.4 Algorithm (1)− (2), W pairwise [31] n Algorithm (1)− (2), Wn broadcast [10] 0.2 Algorithm [115] 0 0 2000 4000 6000 8000 number of iterations (n) 10000 12000 (a) Trajectories of θn,1 as a function of n. Algorithm (1)− (2), W = (I+D)−1(I+A) n −2 10 Algorithm (6)− (2), Wn = (I+D)−1(I+A) Algorithm (1)− (2), Wn pairwise [31] Algorithm (1)− (2), W broadcast [10] n −4 Algorithm [115] 10 −6 10 −8 10 0 2000 4000 (b) q 6000 8000 10000 12000 14000 16000 18000 number of iterations (n) 1 N PN i=1 (θn,i − hθn i)2 as a function of n. Figure 2.1. Convergence result of Theorem 2.1 according to different communication schemes for (Wn )n . The result of Theorem 2.3 is illustrated in Figure 2.2 which leads to the concluding remark 4) of Section 2.6. Figures 2.2 (a) and 2.2 (b) display the asymptotic analysis of the normalized −1/2 average error γn (hθn i − θ? ). Indeed, once the convergence is achieved, the asymptotic distribution can be characterized by the closed form of the variance U ? ∈ R. In this example, 2.7. Numerical results 65 −1/2 Theorem 2.3 states that γn (hθn i − θ? ) converges in distribution to a r.v. ∼ N(0, V) where ∇h(θ? ) = −1 and thus the variance is V = U2? . The first boxplot and the first histogram in Figure 2.2 are related to the algorithm implemented in a centralized manner. We consider the distributed algorithm (2.1)-(2.2) with different choices of Wn : the pairwise gossip of [31], the broadcast gossip of [10] and the fixed W1 defined by (IN + D)−1 (IN + A). Note that θ? coincides with the sought minimum θF when Algorithm (2.1)-(2.2) considers the pairwise and the broadcast gossip matrices since these cases v is equal to 1/N 1 and thus θV = θF . However, in −1/2 the fixed W1 case (such W1T 1 6= 1), the average error sequence γn (hθn i − θ? ) is computed with respect θ? = θV which does not coincide with θF . From Figure 2.2 (b) we observe that the normal distribution obtained in Theorem 2.3 is coherent with the empirical results. The expression of U? defined in (2.21) takes the following form: U? = 1 [vec (C)T Γvec (σ 2 IN + (IN + 2Λ))g(θ? 1)g(θ? 1)T + N2 + g(θ? 1)T C(Λ + IN )g(θ? 1) + σ 2 tr(C + 11T )] (2.21) where tr(A) is the trace of matrix A and: C = E[W1T 11T W1 ] − 11T Γ = (IN 2 − E[J ⊥ W1 ⊗ J ⊥ W1 ])−1 E[J ⊥ W1 ⊗ J ⊥ W1 ] Λ = (IN − J ⊥ W )−1 J ⊥ W g(θ? 1) = −θ? 1 + α As stated in Corollary 2.1 when (Wn )n are doubly-stochastic U? corresponds with the same variance obtained in a centralized setting. Since the covariance matrix C is equal to zero and 2 tr(11T ) = N , (2.21) reduces to σN = 0.2. Yet, we can rewrite the variance (2.21) by defining two terms as follows: U? = U?opt + U?com where: U?opt = U?com σ2 N 1 = 2 [vec (C)T Γvec (σ 2 IN + (IN + 2Λ))g(θ? 1)g(θ? 1)T + N + g(θ? 1)T C(Λ + IN )g(θ? 1) + σ 2 tr(C)] The case when U? = U?opt is displayed by the two first histograms in Figure 2.2 (b). Contrarily, U?com > 0 once the column-stochasticity is not verified since the covariance matrix C is not anymore null and all terms in (2.21) influence the asymptotic variance. For instance, that 2 is the case when matrices (Wn )n follow the broadcast model of [10] since C = βN L2 where L is the Laplacian matrix of the G, i.e. L = D − A. As shown in Figure 2.2 (b), U? is now 18.03. However, when regarding the asymptotic variances illustrated by the boxplots in Figure 2.2 (a), it is worth noting that these variances are rather different for the two non-column stochastic cases, i.e. deterministic case (fixed Wn = W1 ∀n) and the random broadcast case for (Wn )n . Note that the variances in the deterministic cases (see boxplot 4 and 5 in Figure 2.2 (a)) 66 Chapter 2. Success and failure of adaptation-diffusion algorithms 8 6 4 2 0 −2 Centralized (Wn)n pairwise [31] W =W , ∀ n n 1 Wn=W1, ∀ n −4 (weighted) −6 −8 (Wn)n broadcast [10] −10 1 2 3 4 (a) Boxplots of the normalized average error. Boxplots 2, 3 and 4 correspond to Algorithm (2.1)-(2.2) while boxplot 5 corresponds to Algorithm (2.1)-(2.6) (weighted step-sizes). Distributed with Wn pairwise [31] Centralized Distributed with W broadcast [10] n 1 1 1 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 −2 0 0 2 −2 0 2 0 −10 0 10 (b) Empirical distribution (dark bars) versus theoretical distribution given by Theorem 2.3 (solid line). Figure 2.2. Asymptotic analysis of the normalized average error √1γn (hθn i − θ? ) according to different communication schemes for (Wn )n after n = 30000 iterations and over 100 independent Monte-Carlo runs. are 0.26 and 0.49 respectively. These values are slightly close to the variance achieved by the optimal cases (0.2), i.e. centralized processing and doubly-stochastic matrices. It is due to the low contribution of the variance term U?com . Indeed, the reasons come from two sources: the 2.7. Numerical results 67 moments of the disagreement sequence defined in (2.9) and the covariance matrix C (related to the non-column stochastic character of the model for Wn ). Observing the disagreement trajectories in Figure 2.1 (b), there is a gap of nearly two orders of magnitude (10−2 ) between the mean values (first order) of the deterministic case (dashed-line) and the broadcast case (plain line with square markers). Besides, the fluctuations (second order) are larger in the broadcast case while they are almost null in the deterministic cases (dashed-line and plain line with circle markers). Nevertheless, due to the convergence rate of the disagreement sequence, the impact on √ the average error is negligible, i.e. γn is faster than γn as stated in Proposition 2.1. Thus, the main contribution on U? comes from the covariance matrix C. In the deterministic cases, matrix C has values close to 0 contrarily to those obtained by the broadcast gossip (values greater than one). Part II Distributed principal component analysis Chapter 3 A distributed on-line Oja’s algorithm In this chapter we investigate the problem of the spectral decomposition of a real symmetric and positive-semidefinite matrix N × N denoted by M . The objective is to compute the principal eigenvectors and the corresponding eigenvalues of M , i.e. called as principal component analysis (PCA). Consider the spectral decomposition M such that: M = U ΛU T (3.1) where U T = U −1 is an orthogonal matrix and Λ is a diagonal matrix, both matrices of size N × N . We seek to obtain the above factorization (3.1) and the related column vectors of U having norm one by the design of a distributed algorithm. For that purpose, we introduce a connected network of N agents under a graph structure such that each agent partially observes matrix M and focus on the computation of its corresponding entries of U and Λ. By means of communications among the agents, we are interested in the conception of an iterative and distributed algorithm such that we can ensure the convergence towards the principal eigenvectors of M . Before going into the details, we first introduce the notations considered throughout this chapter summarized in the table below. N, p x, y, . . . x, y, . . . X, Y , . . . x◦y kxk kXk diag(·) P, E positive integers real numbers (scalars) column vectors in RN real matrices in RN ×N or RN ×p Hadamard product P Euclidean norm, kxk2 = i |x(i)|2 Frobenius norm, kXk2 = tr(X T X), tr(·) is the trace diagonal matrix formed by the elements (·) on its diagonal probability and associated expectation operators on a tacit probability space 72 3.1 3.1.1 Chapter 3. A distributed on-line Oja’s algorithm Introduction Context and goal Eigenvector computation serves as the main ingredient in the celebrated principal component analysis (PCA) [76] which is a classical approach to reduce the complexity of systems due to its high dimension or randomness. In wireless sensor networks (WSN), PCA is applied to compute the sensors’ positions (see our application in Chapter 4 or [92]) and the signal’s covariance matrix (see [102]). In cluster learning, eigendecomposition is needed for several applications such as: PageRank [124], stationary distribution of an ergodic Markov chain (see [33] or [29]), graph clustering [29], spectral analysis of an adjacency matrix for social engineering [90] and recently, for user profiling to personalize services [147]. As highlighted in [90], [147] or [102], there exists an importance to achieve an embedding at the user/agent level thanks to distributed processing. Let M in RN ×N . We set the scalars λ1 (M ) ≥ · · · ≥ λN (M ) as the non-increasing eigenvalues of matrix M . Unless ambiguous, an eigenvector uk (M ) and its corresponding eigenvalue λk (M ) are simply denoted by uk and λk respectively for all k = 1, . . . , N . Set p such that p < N . Following the compact matrix notation of (3.1), we define by U the N × p matrix containing the p first column vectors of P as U = (u1 , . . . , up ). And we define by Λ the p × p diagonal matrix containing the corresponding eigenvalues as Λ = diag(λ1 , . . . , λp ). Note that if the eigenspace has dimension 1, a unit norm eigenvector associated to Λ = λ1 is denoted U = u1 . Let us assume that N agents are connected and form a network. Moreover we denote by G = (V, E) an undirected graph G with vertices set V = {1, . . . , N } and edge set E = {∀i, j ∈ V | i ∼ j}. We denote by Ni the neighborhood of any agent i, i.e. all other agents j such j ∼ i ⊂ E. GN denotes the complete graph such it has all N (N − 1)/2 possible pair of edges. A symmetric matrix M is said to be adapted to a graph G = (V, E) whenever i 6∼ j implies M (i, j) = 0. For instance, when G = GN , any symmetric matrix M is adapted to G. In this section and Section 3.2, we assume that all agents are connected, so the network they form is represented by the complete graph GN . General types of graphs will be addressed in Section 3.3. Let us also assume that each pair of connected agents is given a weight, representing, for instance, the distance between the two agents, a link quality measure or a resource allocation. The weight between a couple of agents i and j is denoted M (i, j). Thus, all weights are simultaneously encoded into the single symmetric matrix M = M T . When considering perfect observation of M , each agent i has access to the i-th row of M , i.e. M (i, 1), . . . , M (i, N ). In addition, the location of each agent i depends on the ith component of each eigenvector, i.e. the i-th row/coordinates of U denoted by U (i) = (u1 (i), . . . , up (i)). In the first part of this thesis we addressed the problem of the network consensus in some global parameter external and independent of the agent. While in this second part, we deal with the problem of the network configuration requiring each agent to infer its coordinates which depends on the global subspace of the network spanned by U , i.e. the eigenspace associated to M . It is worth noting that contrarily to Chapter 2, the goal now is to design a distributed algorithm that enables each agent i to obtain an estimate of its coordinates U (i). Similarly to Chapter 2 the process is based on a stochastic approximation approach and is 3.1. Introduction 73 performed by each agent from local information and several communications with its neighbors. Although agents have not access to the global information M , each coordinates U (i) of agent i are related somehow to M . Hence, we define a communication scheme different to the one considered in Chapter 2 to solve this problem in a distributed manner since the network consensus is not anymore required. 3.1.2 Related works Consider the spectral analysis problem in one dimension (p = 1) applied to a deterministic matrix (or perfectly known) M , then the goal is to compute u1 . A well known centralized algorithm to accomplish this task is the so-called power method (PM) [73, p. 406] described by Algorithm 1. Algorithm 1: Power method for principal eigenvector estimation (p = 1) Initialize: set u0 ∈ RN randomly. Update: at each time n = 1, 2, . . . [step 1] Perform ũn = M un−1 . [step 2] Normalize un = ũn kũn k . From a distributed implementation viewpoint, PNboth steps 1 and 2 have drawbacks. For a given agent i, step 1 writes as a sum ũn (i) = j=1 M (i, j)un−1 (j) that contains N terms involving each a communication with a separate agent j. When N is large, this could incur a proqP N 2 hibitive cost to the network. Second, for any agent i, step 2 writes un (i) = ũn (i)/ j=1 un (j) implying that: agent i should query all other agents about their values ũn (j) to implement this step. Hence, Algorithm 1 has a computational cost (multiplications, additions and a divisions) scaling N 2 and a constraint communication scheme demanding synchronization among the agents at each iteration time n. Note that more complicated step has to be included when considering the computation of more than one eigenvector (p > 1). Indeed, this step relies on the condition that eigenvectors form an orthonormal basis which can be performed by the GramSchmidt method or the QR-decomposition. The general PM for the computation of the first p eigenvectors of a matrix M is called orthogonal iteration (OI) [73, p. 454]. As an extension of the PM, the OI includes a QR decomposition at each iteration time n to generate the corresponding eigenvectors denoted by U n . Note that, the computational cost is increased and scales a factor of pN 2 (multiplications and additions) and a factor p3 due to the Cholesky factorization which includes the inverse computation of a p × p matrix. We can find two decentralized algorithms [90], [92]. The first work addresses the general p-eigenvector estimation of a perfectly known adjacency matrix and they proposed a distributed version of the OI [73, p. 454] by introducing two communications steps at each iteration. One in which each agent i sends its current estimate U n (i) to all other agents, and a second in which a push-sum averaging gossip phase is considered to estimate the p × p matrix related to the Cholesky factorization. Note that, this phase implies all agents communicating during a 74 Chapter 3. A distributed on-line Oja’s algorithm given number of rounds to achieve a certain accuracy. In [92] a distributed gossip-based version of Algorithm 1 is proposed to estimate the first eigenvector even if they mentioned that the extension to the p general case can be performed by a distributed Gram-Schmidt method whose details are not given and the considered an example of p = 2 to illustrate numerical results. The Authors of [92] introduce a deterministic and random sparsification of matrix M to compute the product of step 1 of Algorithm 1 in which each agent communicates with other agents with a given probability. Then, the normalization step 2 is solved by an averaging gossip phase involving a number of rounds to achieve a given accuracy at each iteration. Error bounds are given depending on this two design parameters, the probability to communicate of each agent and the desired accuracy that gives the number of gossip rounds. Thus, [92] does not provide an algorithm that converges to the sought eigenvector; it converges to an approximate solution. In a more general framework, if matrix M is partially known because it is corrupted by c can be firstly computed by a random noise for instance, an unbiased estimation version M collecting a large number of observations. In such case, Algorithm 1 is issued to a previous batch phase before to perform the eigenspace computation increasing the number of communications. In such context, when considering the sequence (M n )n such that E[M n ] = M for all n, an alternative is the centralized stochastic approximation approach proposed by Oja in [94] (p = 1) and [122] (p > 1) as an extension of the two previous algorithms. Although under some further hypothesis, such as large matrices, sparsity, noisy measurements, etc. different methods can be considered (see [73]), we focus on distributed PCA based on Oja’s approach. We design an on-line implementation in which each agent is able to update its local estimate when a new observation is obtained, which is adapted in our context. A distributed version of [122] is presented in [147] for a sparse and perfectly known matrix M . They introduce two time scales, the slow one to update the coordinates at each agent U n (i) and a fast one performed by the random averaging gossip [31] to make the global terms U Tn U n and U Tn M U n available at each agent and at each iteration n. Moreover, [147] proposes a couple of algorithms respectively synchronous and asynchronous. Unfortunately for settings where asynchronism robustness is required such as wireless sensor networks (WSN), convergence is only guaranteed for the synchronous algorithm. Finally, paper [102] addresses the problem of distributed computation of principal components of a signal’s covariance matrix that is only observed up to some noise on WSN. Once Oja’s recursion is defined for each agent i, the Authors of [102] identify three terms which need the knowledge of the values from all the other agents at each iteration, i.e. M U n−1 , U Tn−1 U n−1 and U Tn−1 M U n−1 . Hence, an average consensus phase as ([31]) is performed to estimate each term involving a given number of rounds before updating U n . The associated mean field is proved to be close to Oja’s recursion. This result leads them to obtain, under suitable assumptions, the convergence towards an equilibrium point lying on a close limit set to the sought principal component. The chapter is organized as follows. Section 3.2 introduces the framework and details the proposed algorithm in the case of complete graph and first eigenvector estimation problem. Section A.5 provides the analysis of the algorithm: convergence with probability 1 is proved under some suitable assumptions for the one-dimensional case. We provide an extension in Section 3.3, taking into account a general graph context and the computation of several principal components. This latter section leads us to introduce the general algorithm used for the local- 3.2. Case G = GN 75 ization application in WSN (see Chapter 4). Numerical results on one and two dimensions are provided in Section 3.5 to conclude this chapter. 3.1.3 Contributions Note that, each of these distributed solutions include one or several average consensus phases at each iteration meaning that the computational and communication costs are issued to the required number of consensus steps. We bring the following contributions: • We used from [92] the idea of sparsification (sparse communications), and we share with [102], [147] and [29] the same foundation, namely Oja’s algorithm, which we use in a distributed context as initiated by [147] and [102]. • We provide a general framework and algorithms that encompass both the case where the symmetric matrix is perfectly known and the case where M is not perfectly known but instead, a sequence of independent and identically distributed (i.i.d.) samples denoted by M n , n ≥ 0. • We provide an asynchronous and on-line implementation based on random measurements or observations and random communications throughout the agents. • We prove almost sure (a.s) convergence of the proposed distributed to some eigenvector of M . Since we are interested in on-line processing, we denote by M n the sample related to the actual time instant n when the computation may be hold by the agents. 3.2 Case G = GN In this section we investigate the following simple case: the estimation of the first eigenvector and the network of N agents forming a complete graph. First, we recall the centralized Oja’s algorithm and then we introduce the communication step to provide the distributed version. 3.2.1 Oja’s algorithm When faced with random matrices (M n )n having a given expectation M , the following centralized on-line algorithm, due to Oja [120] and analyzed in [122], converges to the principal eigenvector (p = 1), under suitable assumptions: un = un−1 + γn M n un−1 − uTn−1 M n un−1 un−1 . (3.2) However, since we are expecting convergence to a unit-norm vector, the above recursion is known to suffer from numerical unstabilities as soon as the initialization is not well chosen [122]. The work of [29] deals with the stabilization issue of [122] by including a normalization term to the above stochastic approximation equation which ensures the Lipschitz continuity of the term M n − M n−1 and thus, the stability of the generated sequence. Moreover, a white noise 76 Chapter 3. A distributed on-line Oja’s algorithm term is added to ensure the consistency of the sequence and the convergence is finally proved. Nevertheless, the computation of the normalization term proposed in [29] may be difficult to generalize in a distributed context. Thus, the unstabilities can be easily avoided by introducing a simple projection step (see [23] for instance) as follows: un = ΠK un−1 + γn M n un−1 − uTn−1 M n un−1 un−1 , (3.3) where K is any compact convex set whose interior contains the closed unit ball in RN , and where ΠK is the projector onto K. Note that both standard Oja’s [122] and the variant approaches of [29] and (3.3) are centralized and need a number of operations (products and sums) and communications scaling a factor of N 2 and N respectively at each iteration due to the terms M n un−1 and uTn−1 M n un−1 un−1 . However, in a distributed setting of N agents, a similar amount of complexity is charged by each agent i to update its corresponding vector entry of un , i.e. un (i). The first term M n un−1 P implies that each agent i sends un−1 (j) to all other j 6= i and involves the matrix operation N j=1 M n (i, j)un−1 (j). Subsequently, each agent i sends this latter value to all other j 6= i and is able to perform the second matrix operation (uTn−1 M n un−1 )un−1 (i). In order to reduce the number of communications per agent at each iteration, we introduce an asynchronous communication model which enables agents to perform a less number of transmissions while keeping the behavior of sequence (3.3). 3.2.2 Communication model: randomized sparsification We define the following asynchronous model. Definition 3.1 (Asynchronous sparsification matrices). Let q be a real number such that 0 < q < 1. Define i.i.d. uniformly distributed random variables (r.v.) Vn (P[Vn = i] = 1/N , ∀i ∈ {1, . . . , N }) and i.i.d. random Bernoulli variables An,i (P[Qn,i = 1] = q, ∀i ∈ {1, . . . , N }). The following sequence of random matrices An are referred to as asynchronous sparsification matrices (ASM): An (i, j) = 1 if i = j if i 6= j and j = Vn and Qn,i = 1 0 otherwise N q Notice that matrices An are not symmetric. The following lemma can be straightforwardly proved. Lemma 3.1. The ASM defined by (3.4) are i.i.d. random matrices, such that: i) There exists a constant Cq such that kAn k ≤ Cq with probability 1. ii) E[An ] = N J = 11T . The following proposition is an immediate consequence of Lemma 3.1. (3.4) 3.2. Case G = GN 77 Proposition 3.1. Define M n := An ◦ M . The matrix sequence (M n )n is uniformly bounded with probability 1 and unbiased estimation of M , i.e. E[M n ] = M . Remark 3.1. Let u be a vector whose i-th component is known by node i only. The computation of vector (An ◦ M )u can be performed easily in an asynchronous and distributed way: at time n, some agent Vn wakes up, chooses a sparse list of neighbors it is going to contact using head and tail draws Qn,i . Each of the chosen neighbors is awaken by Vn and updates its value only using the value of agent Vn and its own. Notice also that no feedback is needed from the network: agent Vn only has to send its value to some chosen neighbors. In that sense, ASM are analogous to broadcast matrices of [10]. Let us mention one drawback that seems difficult to circumvent: each agent has to know previously the total number N of agents in the network. The Authors of [92] replace the matrix multiplication step un = M un−1 by another multiplication un = M n un−1 where M n is an unbiased and sparse estimation of M . Yet, in [92] it is suggested the use of a simple Bernoulli sparsification scheme studied by [5]: M n = q −1 B n ◦ M where B n are random matrices with i.i.d. binary entries taking value 1 with probability q. However, notice that multiplying by matrix M n cannot be done asynchronously. Indeed, in [92] it is considered that all agents transmit at same time. In addition, as already noticed, the convergence towards the sought eigenspace is not ensured. Following the communication model in Definition 3.1, we now describe the proposed algorithm. 3.2.3 Distributed on-line Oja’s algorithm (p = 1, G = GN ) Since we are interested in a distributed on-line implementation of Oja’s algorithm (3.3), we take the particular case where M n = An ◦ M for some ASM An . Putting aside the term uTn−1 M n un−1 Oja’s algorithm only involves matrix multiplications of the form M n un−1 that could easily be distributed for the reason mentioned in Remark 3.1. Unfortunately, term uTn−1 M n un−1 still remains difficult to evaluate distributively. Our idea is thus to replace the latter with an estimate which is more suitable to distributed computation. We set z n = Ãn un−1 ◦ (M n un−1 ) , where Ãn are ASM independent from An . Again, z n can be computed distributively, each node i being able to evaluate the i-th component z n (i) of z n by means of local gossiping with those agents selected by the sparsification matrix Ãn . Loosely speaking, we interpret z n (i) as a noisy estimate of the desired term uTn−1 M n un−1 . We are now in position to state our first algorithm, directly inspired of the projected Oja’s algorithm of Section 3.2.1. The algorithm iteratively generates a random sequence (un )n according to the following updates: y n zn un = (An ◦ M )un−1 = Ãn (un−1 ◦ y n ) = ΠK [un−1 + γn (y n − z n ◦ un−1 )] (3.5) where An and Ãn are two independent ASM. From Remark 3.1, it follows that the matrix multiplications steps in the update (3.5) can easily be implemented in a distributed fashion. Thus, the algorithm (3.5) complies to the asynchronous requirement and is fully distributed, 78 Chapter 3. A distributed on-line Oja’s algorithm as soon as the projector ΠK can be applied distributively. To that end, since vector un has N entries, it is sufficient to choose K as a Cartesian product of the form: K = [−α, α] ⊗ · · · ⊗ [−α, α] (3.6) such α > 1. In addition, (3.6) is a real interval whose interior contains [−1, 1] for each agent i. Note that the algorithm is fully characterized by (3.5). However, in order to give a more detailed description, we also provide a pseudocode version of (3.5) in Algorithm 2 below. Algorithm 2: Distributed on-line Oja’s algorithm (doOja) Initialize: Set u0 (i) 6= 0 for any i ∈ V . Iterate: At each time n = 1, 2, · · · The clock of some random agent i ∈ V is ticking. Agent i sends un−1 (i) to other agents Ni ⊂ V randomly selected as in Remark 3.1. For any agent j ∈ V , do: y n (j) = M (j, j)un−1 (j) + Nq M (i, j)un−1 (i) if j ∈ Ni M (j, j)un−1 (j) otherwise. The clock of some random agent l ∈ V is ticking. Agent l sends un−1 (l)y n (l) to random agents Nl ⊂ V For any agent i ∈ V , do: z n (i) = un−1 (i)y n (j) + Np un−1 (i)y n (i) if i ∈ Nl un−1 (i)y n (j) otherwise. and finally: un (i) = Π[−α,α] [un−1 (i) + γn (y n (i) − z n (i)un−1 (i))] . 3.3. General graph and unknown matrix M case 3.3 79 General graph and unknown matrix M case The algorithm detailed in previous Section 3.4 is well suited in the context of perfect connectivity (complete graph setting GN ) between agents, known matrix M and centralized processing. In this section, we introduce some necessary notations for the general context when considering a general connected graph formed by the network of N agents. 3.3.1 Network considerations We let (W n,t )n,t≥0 be a doubly infinite i.i.d. sequence of random matrices with the same distribution as E[W n,t ] = W . We set the random pairwise gossip scheme of [31] for (W n,t )n,t≥0 . In that case, W n,t = I − (ei − ej )(ei − ej )T /2 where ej denotes the i-th vector of the canonical −1 basis in RN such the edge i ∼ j ⊂ E is active with probability 1/N (d−1 i + dj ) if di , dj are the degrees of agents (vertices) i and j respectively. We introduce the following lemma that will be used later in the convergence analysis. Yet, it establishes that it exists a sufficient number of steps to perform consecutively the gossip scheme of [31]. Q φ(n) Lemma 3.2. Let C n = N t=1 W n,t be a product of gossip matrices performed during a φ(n) time, i.e. (W n,t )n,t≥0 are the pairwise gossip matrices described above. Then, E[C n ] = J + Rn where Rn → 0 when n → ∞ almost surely (a.s.). Proof. To prove the above condition, we verify that limn→∞ E[C n −J −Rn ] = 0. Upon noting that limn→∞ E[R n ] = 0 and(W n,t )n,t≥0 are i.i.d. such that E[W n,t ] = W , it is necessary to verify limn→∞ W φ(n) − J then by induction: = 0. Setting φ(n) = n and since W J = J and J W n−1 = J , (W − J )n = (W − J ) (W − J )n−1 = (W − J ) W n−1 − J = W n − J Thus, limn→∞ (W − J )n = 0 as the corresponding spectral radius ρ (W − J ) is lower than one (see [31]). In addition, we assume the following considerations: 1) Graph G is not necessarily complete. On the first hand, the fact that G 6= GN naturally implies some degree of sparsity in the matrix M , as M (i, j) = 0 for any i 6∼ j. One might expect that this natural sparsity of M should facilitate the design of distributed algorithms. However, on the other hand, it is no longer possible to design an ASM sequence as in Algorithm 2. In particular, the computation of z n in (3.5) becomes irrelevant. Intuitively, the main advantage for using ASM in Section 3.2.3 was that such matrices are equal in expectation to N J . When G 6= GN , one can no longer generate a sparse random matrix adapted to G whose expectation is N J . In the sequel, we circumvent this issue by replacing ASM with a random gossip step inspired from [31]. The idea of using linear gossip methods for PCA has been used previously in [102]. Note however that our algorithm presents significant differences with [102]: First, the Authors of [102] focus on the 80 Chapter 3. A distributed on-line Oja’s algorithm case where the observed matrix has rank one, which allows for useful simplifications. Second, the algorithm of [102] ends up in a neighborhood of the sought eigenspace, whereas we show almost sure convergence of our algorithm. Finally, [102] assumes synchronous deterministic gossip while we focus on asynchronous random gossip. 2) Matrix M is likely to be imperfectly observed. Instead, we assume that a node i ∈ V associates a weight M n (i, j) at time n to any node j in its neighborhood. Here, (M n )n is a random matrix sequence which can be interpreted as a noisy version of a deterministic matrix M (we typically assume E[M n ] = M ). We consider a random sequence of matrices (M n )n adapted to G. This scenario has various applications. First, this model encompasses the sparsification scheme of Section 3.2: we just set M n = An ◦ M where An is a well-chosen sparse random matrix adapted to G. As an other example, our model also encompasses the case where M is a covariance matrix of a some i.i.d. random vectors η n , M = E[η n η Tn ] is not directly observable by the network but η n η Tn is. 3.3.2 Distributed on-line algorithm Considering the network assumptions described in previous section, we propose the following iterations as an extension of Algorithm 2 to estimate the first eigenvector. We generalize (3.5) as: y n = M n un−1 (3.7) z n = C n (un−1 ◦ y n ) un = ΠK [un−1 + γn (y n − z n ◦ un−1 )] where C n is as defined in Lemma 3.2 for a general network or is a ASM (Definition 3.1) for the complete graph case, φ is a non decreasing integer valued function going to infinity with n and ΠK plays the same role than in (3.3). Remark that this algorithm yields a distributed asynchronous implementation compatible with G since all matrices are adapted to the graph structure G. Remark 3.2. It is known from [31] that for a fixed n, the infinite product of gossip matrices Q∞ t=1 W n,t converges almost surely to the orthogonal projector J provided that G is connected. Qφ(n) In this paper, we approximate J using a finite product t=1 W n,t . Since φ(n) → ∞, it is expected that the latter product becomes closer and closer to the true projector as n increases. It Qφ(n) is worth noting that if product t=1 W n,t was indeed replaced by J in (3.13), then the recursion (3.13) would coincide with a centralized Oja’s algorithm (3.12) as described in Section 3.4. 3.3.3 Convergence analysis We have seen two distributed on-line algorithms in Sections 3.2.3 and 3.4.2. In this section we provide a convergence result which encompasses both algorithms. Note that in this case, since p = 1, the upper case notation U n is equivalent to the lower case notation un . In order to make precise statements, let us first introduce s. Recall that G denotes the underlying graph. Assumption 3.1 (Step size). (γn )n is a decreasing step-size sequence such γn > 0 and has the following standard properties [28]: 3.3. General graph and unknown matrix M case i) P ii) P n≥0 γn = +∞. 2 n≥0 γn < ∞. 81 Let us introduce some sequences of random matrices (M n )n , (C n )n , (Rn )n and (R0n )n and denoted by Fn the filtration up to time n i.e., σ(u0 , M 1 , . . . , M n , C 1 , . . . , C n , . . . ). We assume the following conditions. Assumption 3.2 (M n ). i) M is a symmetric N × N matrix and λ1 (M ) has multiplicity 1. ii) M n is adapted to G. iii) There exists a sequence of matrices Rn , such that E[M n |Fn−1 ] = M + Rn and Rn converges almost surely to 0. iv) ∃C > 0, P[kM n k < C] = 1. v) There exists a constant L > 0 , such E[kM + Rn − Mn k2 |Fn−1 ] < L Assumption 3.3 (C n ). Let (C n )n be a sequence of matrices such that: i) C n is adapted to G. ii) There exists a sequence of matrices R0n , such that E[C n |Fn ] = N J + R0n and kR0n k converges almost surely to 0. iii) ∃C > 0, P[kC n k < C|Fn−1 ] = 1. iv) Conditionally to Fn , C n and M n are independent. v) There exists a constant L0 > 0 , such E[kN J + R0n − Cn k2 |Fn−1 ] < L0 . Q φ(n) Note that the sequence of matrices C n = N W defined in Lemma 3.2 of Secn,t t=1 tion 3.3.1 satisfies the above Assumption 3.3. As mentioned previously, the technical bottleneck is to ensure stability, i.e. P[kun k is bounded] = 1. Hence, we introduce a stability-like condition to follow an analysis related to stochastic approximation schemes. Assumption 3.4. For each agent i, there exists an instant time n0i such that ∀n > n0i the sequence un−1 (i) + γn (y n (i) − z n−1 (i)un−1 (i)) remains in the compact set K almost surely. The above assumption states that the projector ΠK becomes inactive for all n after a certain value at each agent i. Hence, Assumption 3.4 claims that the sequence (un )n≥0 remains a.s. in the compact set K, i.e. there exists m > maxi n0i such that um = ΠK [um ]. The following lemma is simple algebra but allows us to cast the proposed algorithms of equations (3.5) and (3.7) into a stochastic approximation problem. Indeed, the sequence defined in (3.7) before the projection onto K from (3.6) can be written under a compact matrix form as follows: un = ΠK [un−1 + γn (M n un−1 − (C n (un−1 ◦ M n un−1 )) ◦ un−1 )] (3.8) 82 Chapter 3. A distributed on-line Oja’s algorithm Lemma 3.3. Under Assumptions 3.2 and 3.3, (3.8) can be written as: un = un−1 + γn h(un−1 ) + γn ιn + γn r n + γn en (3.9) where function h(u) = M u − (uT M u)u denotes the so-called mean field function, ιn is such that E[ιn |Fn−1 ] = 0 is a martingale increment and r n and en are two remainder terms. Proof. Direct computations give the following expressions for the error sequences ιn , r n and en : ιn = (M n − M − Rn )un−1 + (N J + R0n − C n )(un−1 ◦ M n un−1 ) ◦ un−1 + (uTn−1 (M + Rn − M n )un−1 )un−1 r n = Rn un−1 + R0n (un−1 ◦ M n un−1 ) ◦ un−1 + (uTn−1 Rn un−1 )un−1 en = ΠK [M n un−1 − (C n (un−1 ◦ M n un−1 )) ◦ un−1 ] − M n un−1 − (C n (un−1 ◦ M n un−1 )) ◦ un−1 Using Assumption 3.2, one derives that E[(M n −M −Rn )un−1 |Fn−1 ] = 0 and E[(uTn−1 (M + Rn − M n )un−1 )un−1 |Fn−1 ] = 0. Then using Assumption 3.3, one has that E[(C n − N J − R0n )(un−1 ◦ M n un−1 ) ◦ un−1 |Fn−1 ] = 0 which gives E[ιn |Fn−1 ] = 0 as stated. | Moreover, Lemma 3.4. Under Assumptions 3.1, 3.2, 3.3 and 3.4, then the remainder terms r n and en tend P a.s. to 0 as n → ∞ and n γn ιn is almost surely a convergent series. Proof. Under Assumption 3.4 en converges a.s. to zero. From the stability condition of Assumption 3.4, one has P[||un−1 || < α] = 1. Moreover, using Assumption 3.2: kr n k ≤ kRn kα + kR0n kCα3 + α3 kRn k and Assumption 3.3 implies that r n tends a.s. to 0. Moreover, using decomposition of Lemma 3.3, one has E[kιn k2 |Fn−1 ] ≤ Lα2 + (L + L0 )α6 Using Assumption 3.1 it implies that X X γn2 E[kιn k2 |Fn−1 ] < K γn2 < ∞ n 2 -martingale argument [75, Theorem 2.17 p.35], this for some constant K. Using standard LP implies a.s. convergence of the sequence n γn ιn . The following result is standard in the stochastic approximation folklore, see for instance [14] or [28, Corollary 3, p.17]. Define r 0n = r n + en , then: Theorem 3.1. A discrete dynamical system un = un−1 + γn h(un−1 ) + γn ιn + γn r 0n such that: i) r 0n converges almost surely to 0, 3.4. Extension of Oja’s algorithm for p ≥ 1 ii) P n γn ιn 83 < ∞ almost surely, iii) there exists function V : limkuk→∞ V (u) = +∞ and hV (u), h(u)i ≤ 0, converges almost surely to the critical set of V defined as: H = {u ∈ RN : ∇V (u) = 0}. Following [50], we have: Proposition 3.2. Function V (u) = exp(kuk2 )/(uT (M + λI)u) is a positive, coercive, Lyapunov function for u̇ = M u − (uT M u)u. Its critical set is formed by the eigenvectors of matrix M . Combining Proposition 3.2 and Theorem 3.1 gives the main result of this section. Theorem 3.2. Under Assumptions 3.1, 3.2, 3.3 and 3.4, the column vector un defined by the proposed algorithms (3.5) and (3.7) converges to an eigenvector of M . 3.4 3.4.1 Extension of Oja’s algorithm for p ≥ 1 Oja’s algorithm We address the natural extension of the proposed algorithm when the goal is to compute more than one eigenvector, i.e. the distributed version of the Oja’s algorithm for p ≥ 1 [122]. Namely, the previous algorithms have to be extended to the computation of the N × p matrix U = (u1 , . . . , up ) (principal components). There are now p estimates for the p principal components computed simultaneously. The pending estimates un,1 , . . . , un,p are concatenated into a single N ×p matrix U n = (un,1 , . . . , un,p ) obeying the same iterations. Note that, while extending the centralized on-line algorithm (3.2) in Section 3.2.1, our objective is to eventually find a point in the set χ of N ×p matrices whose columns are orthonormal and span the vector space associated with the p principal eigenvalues of M . When faced with random matrices M n having a given expectation M (see Proposition 3.1), the principal eigenspace of M can be recovered by the following algorithm, due to Oja [120] and analyzed in [122]. The algorithm generates a sequence (U n )n of N × p matrices according to: U n = U n−1 + γn M n U n−1 − U n−1 U Tn−1 M n U n−1 , (3.10) where γn > 0 is a step size sequence. The main objective is to identify the principal eigenvectors of M that is, to find a N × p matrix within the set χ , V ∈ RN ×p : V T V = I p , Im(V ) = Im(U ) (3.11) where Im(U ) stands for the linear span of the columns of U . In order to have more insight, it is convenient to interpret (3.10) as a Robbins-Monro algorithm (see Chapter 3, [51]) of the form U n = U n−1 + γn (h(U n−1 ) + ξn ) where ξn is a martingale increment noise and h is the so-called mean field of the algorithm given by h(U ) = 84 Chapter 3. A distributed on-line Oja’s algorithm M U − U U T M U . It is known that under adequate stability assumptions and vanishing step size γn , the algorithm converges to the roots of h (Theorem 2 in [51]). By Theorem 1 of [121] the roots of h are essentially rotations of matrices whose columns are eigenvectors of M , multiplied by some scalar, including zero. Thus, strictly speaking, the algorithm might converge to a broader set than the sought set χ. Fortunately, it is known since [158] that all roots of h outside the set χ are unstable. Undesired points can be avoided by standard avoidance-of-traps methods (see Chapter 4 in [28] and [129]) which consist in artificially adding an extra-noise in the parenthesis of the right hand side of (3.10). Thus, we introduce a projection step as proposed in (3.3) for the one-dimensional case to overcome the numerical unstabilities of the general algorithm (3.10): U n = ΠK U n−1 + γn M n U n−1 − U n U Tn−1 M n U n−1 , (3.12) where K is any arbitrary compact convex set whose interior contains χ, and where ΠK is the projector onto K. Similarly to (3.6), we set K = [−α, α]N ×p . Thus, the interior of [−α, α]p contains [−1, 1]p for each agent i which leads a distributed implementation. 3.4.2 Distributed on-line Oja’s algorithm To avoid cluttered notations we write iterations at the agent level, U n (i) refers to line i of matrix U n (U n (i) has hence size 1 × p). We also extend the compact set defined by (3.6) to the p-dimensional case. Algorithm (3.5) generalizes to the following recursion: P Y n (i) = Pj∈V M n (i, j)U n−1 (j) (3.13) Z n (i) = j∈V C n (i, j)U n−1 (j)T Y n (j) U n (i) = Π[−α,α]p [U n−1 (i) + γn (Y n (i)−Z n (i)U n−1 (i))] where matrix Z n (i) has size p × p. 3.5 3.5.1 Numerical results Principal eigenvector estimation (p = 1) We consider the problem/context where the aim is to find the first eigenvector u1 of a similarity matrix observed partially by a set of N agents forming a network. We study the convergence behavior of the sequence generated by Algorithm 2 when varying the parameters N and q of our model. We run 100 independent trajectories of Algorithm 2 with a decreasing step size sequence (γn )n of the form nγ0a . We set γ0 = 1 and a = 1. To illustrate Theorem 3.2, Figure 3.1 shows the convergence towards the principal component of M for different values of q and a fixed network size of N = 10. In order to highlight the impact of parameter q on the error performance, Figure 3.2 displays the root-mean-square error (RMSE) between u1 and the estimated un as a function of q when n = 1000 iterations and for different network sizes, e.g. N equal to 10, 50, 100 and 500. Note that q is the Bernoulli’s parameter of the communication model (see Definition 3.1) which 3.5. Numerical results 85 1.8 2 q = 0.1 q = 0.5 q = 0.8 q=1 1.6 1.4 1.5 q = 0.1 q = 0.5 q = 0.8 q=1 λ = 1.75 0.5 || un|| 1 n n uT M u n 1.2 1 0.8 0.6 0.4 0 0 200 400 600 800 1000 1200 number of iterations (n) 1400 1600 1800 0.2 0 200 400 600 800 1000 number of iterations (n) (a) 1200 1400 (b) Figure 3.1. Convergence of sequence (uTn M n un )n towards the eigenvalue λ1 (right) and the norm-one sequence (||un ||)n (left) for different values of the sparsification parameter q. describes the probability of a node to communicate at iteration n. One would expect that a decreasing probability q degrades the estimation accuracy, i.e. increasing the RMSE. Yet, in Figure 3.2, the performance is similar over the different values of N when q ≤ 0.1 and q ≥ 0.8. Contrarily, when q ∈ (0.1, 0.8) a small gap appears on the RMSE values for the different values of N , e.g. in general the performance is better for N = 100 than for N = 500. N = 10 N = 50 N = 100 N = 500 −1 RMSE 10 −2 10 −3 10 0 0.1 0.2 0.3 0.4 0.5 0.6 probability (q) 0.7 0.8 0.9 1 Figure 3.2. RMSE as a function of the probability parameter q of our proposed distributed Oja’s algorithm for different values of network size N . In addition to Figure 3.2, Figure 3.3 illustrates the performance of Algorithm 2 for each network size N ∈ {10, 100, 50, 500} and for three different values of q ∈ {0.5, 0.8, 1}. Similarly to [92], we define the complexity per agent as η = nd, where n is the number of iterations, i.e. the number of updates to obtain an estimation, and d is the average number of communications, i.e. d = qN by Definition 3.1. Figure 3.3 illustrates the RMSE as a function of the complexity per node η. Similarly to [92], Figure 3.3 includes the threshold on the complexity per node in 1600 86 Chapter 3. A distributed on-line Oja’s algorithm order to emphasize the trade off between the RMSE and the amount of communications and computations required, e.g. by setting different values for the parameter q. Let η1 and η2 be the thresholds such: η1 = arg min (RMSEq=0.5 , RMSEq=1 ) η η2 = arg min (RMSEq=0.8 , RMSEq=1 ) η η1= 880 0.5 Algorithm 2, q = 1 Algorithm 2, q = 0.5 Algorithm 2, q = 0.8 0.5 η2= 990 Algorithm 2, q = 1 Algorithm 2, q = 0.5 Algorithm 2, q = 0.8 0.4 0.4 η = 3680 0.3 2 RMSE RMSE η1= 2800 0.3 0.2 0.2 0.1 0.1 0 500 1000 1500 2000 complexity per node (η) 2500 0 500 3000 1000 1500 (a) N = 10. 2500 3000 3500 complexity per node (η) 4000 4500 5000 (b) N = 50. 0.5 Algorithm 2, q = 1 Algorithm 2, q = 0.5 Algorithm 2, q = 0.8 0.4 2000 Algorithm 2, q = 1 Algorithm 2, q = 0.5 Algorithm 2, q = 0.8 0.5 η1= 4450 0.4 η2= 5300 RMSE RMSE 0.3 η1= 23000 η2= 28020 0.3 0.2 0.2 0.1 0 0.1 1000 2000 3000 4000 complexity per node (η) (c) N = 100. 5000 6000 7000 0 0.5 1 1.5 2 complexity per node (η) 2.5 3 (d) N = 500. Figure 3.3. RMSE on the first eigenvector when using the proposed distributed Oja’s algorithm (Algorithm 2) for different number of agents N . Table 1 and Figure 3.4 report the thresholds (η1 and η2 defined above) marked by the vertical black lines in Figure 3.3. The thresholds indicate when the dense computation scheme with q = 1 achieves the same accuracy than the sparse schemes, i.e. by tuning different levels of sparsity with q = 0.5 and q = 0.8. Indeed, Table 1 highlights the threshold on the complexity required by Algorithm 2 when the sparse schemes perform better than the dense scheme. Moreover, Figure 3.4 shows the complexity thresholds (η1 , η2 ) as a function of the network size N which seem to grow linearly with N , i.e. ηi ∝ βi N for some βi > 1 (i = 1, 2). Table 1 also includes the values of the first eigenvalue λ1 . They increase with the dimension N meaning that a larger number of communications and updates are required to improve the accuracy. Note that, in [102] 3.5 4 x 10 3.5. Numerical results 87 the convergence is shown to be related to N and λ1 through an expression of the error bound between the estimated and the true eigenvector. 4 3 x 10 η , q = (0.5,1) 1 η1 880 2800 4450 23000 η2 990 3680 5300 28020 λ1 1.75 9.6 11.96 81.9 2 1.5 1 0.5 0 0 Table 1. Complexity thresholds η1 and η2 marked in Figure 3.3. η2, q = (0.8,1) 2.5 complexity per node (η) N 10 50 100 500 50 100 150 200 250 300 network size (N) 350 400 500 Figure 3.4. Complexity thresholds η1 and η2 as a function of network size N . Finally, we show in Figure 4.17 the impact of parameter q on the variance of the error which is the trade off between the accuracy and the number of communications requires. Figure 4.17 summarized the statistical information of the RMSE through the 100 independent runs of the sequence generated by (3.5). We consider N = 100, n after 1000 iterations and the probability q of the Bernoullis defined in Definition 3.1 as 0.05, 0.5 and 0.8. The median and variance values of the RMSE decreases as the probability that an agent communicate with all other agents tends to 1 as this extreme value corresponds with the performance of the centralized processing of algorithm (3.2). The differences are rather higher between a low value (q = 0.05) and a median value (q = 0.5) than between values of q after 0.5, since the differences between the case when q = 0.5 and q = 0.8 are relatively low. Although a best RMSE performance is achieved when q = 0.8, extreme values (outliers) appear more frequent than the lowest case q = 0.05. In Figure 3.6 we represent the impact of the Bernoulli’s parameter q from the sparse communication Definition 3.1 on the stopping time of the projection operator. In the present scenario we set the compact set as in (3.6) with [−α, α] = [−1, 1] for each agent i = 1, . . . , N . We set the network of N = 10 agents and we compute the mean total number of projections over the 100 independent runs of Algorithm 2 at each iteration time n. We indicate in Figure 3.6 by the black arrow the stopping time of the projection step, i.e. the last iteration n in which the projector is required in average. Since the convergence and the accuracy degrades when decreasing q (see Figures 3.2 and 4.17), the projector is active for longer. We observe from Figure 3.6 that even if the projector becomes inactive later after a certain time when decreasing q, it is active (in average) almost during the first 300 iterations for all values of q; which may be due to the randomness of the initialization. In addition, the difference on the stopping time of the projector between q = 0.5 and q = 0.75 is rather higher (a gap of 615 iterations) than the differences between q = 0.25 and q = 5 (182 iterations) and between q = 0.75 and q = 0.9 (154 iterations). 3.5.2 450 Two principal eigenvectors estimation (p = 2) To illustrate the proposed algorithm described in Section 3.4.2, we consider a complete graph GN with N = 1000. Although the application of localization in WSN is more largely explain in the following Chapter 4, we use the same scenario of [92] related to agents’ positions as 88 Chapter 3. A distributed on-line Oja’s algorithm q = 0.5 RMSE q = 0.05 q = 0.8 0.2 0.2 0.2 0.18 0.18 0.18 0.16 0.16 0.16 0.14 0.14 0.14 0.12 0.12 0.12 0.1 0.1 0.1 0.08 0.08 0.08 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 N=10 N=50 N=100 N=500 N=10 N=50 N=100 N=500 N=10 N=50 N=100 N=500 Figure 3.5. Boxplot of the RMSE values obtained over the 100 independent trajectories and for n = 1000 iterations of Algorithm 2 for different probabilities q. a benchmark. Consider a network of N = 1000 agents randomly uniformly placed in the unit square and each agent i having an unknown fixed position in the plane, i.e. p = 2. The Authors of [92] propose a distributed implementation of the so-called MDS (multidimensional scaling [95]) algorithm (see [143]) whose aim is to compute the coordinates z i of each node i (up to some rotation/translation) based on the measurement of the square distance D(i, j) = kz i ) − z j k2 with other nodes j. We set M = −1/2(I N − J )D(I N − J ) and compute the p = 2 principal components of M using the algorithm described in Section 3.4.2. As shown in [143], the position z i can easily be inferred from the i-th entries of these vectors along with the corresponding eigenvalues. We set a decreasing step size sequence (γn )n √ of the form 1/ n. We run the proposed distributed on-line algorithm (3.13) 100 independent times and until iteration n = 1000. Following [92], we define d as the average number of communications per node at any iteration which is d = N q where q is the Bernoulli parameter of the ASM in Definition 3.1. We compare our algorithm with the algorithm of [92] for the same number of communications per agent. Figure 3.7 illustrates the root-mean-square error (RMSE) of both algorithms as a function of the complexity per node which considers the iteration time n multiplied by the average number of communications d. The error of the algorithm of [92] vanishes more rapidly during the very first iterations but then is immediately subject to a residual error, while our algorithm eventually converges to the sought solution. The trade off between the accuracy and the communication cost per node is contrasted by comparing the centralized dense matrix case (power method) and the distributed sparse matrices case ( [92] and our proposed algorithm). 3.5. Numerical results 89 3.5 q=0.25 3.5 3 3 2.5 2.5 2 2 1.5 1.5 1 q=0.5 1 n=1722 0.5 0 0 n=1540 0.5 500 1000 1500 n iteration 2000 2500 3000 0 0 500 1000 (a) 2000 2500 3000 (b) 3.5 q=0.75 3.5 3 3 2.5 2.5 2 2 1.5 1.5 1 q=0.9 1 n=925 0.5 0 0 1500 n iteration 500 1000 n=771 0.5 1500 n iteration (c) 2000 2500 3000 0 0 500 1000 1500 n iteration 2000 2500 (d) Figure 3.6. Mean number of projections per iteration n for different values of probability q. Finally, we show a more detailed convergence analysis of algorithm (3.13) for a fixed value of q in the considered scenario. Figure 3.8 illustrates the convergence analysis of the estimated eigenvalues and eigenvectors. In that case, we set q = 0.8. Since we consider the two dimensional case (p = 2), Figure 3.8 shows the orthonormality of the estimated matrix obtained as U Tn U n and the two corresponding eigenvalues diag(U Tn M U n ). Indeed, the desired unitnorm of the estimated eigenvectors is reported in Figure 3.8 (c) and the orthogonality between the two estimated eigenvectors is respectively in Figure 3.8 (d)). In addition, we expect to converge towards the true eigenvalues, λ1 = 1.1 and λ2 = 0.57 when regarding the trajectories of uTn,1 M un,1 and uTn,2 M un,2 respectively. 3000 90 Chapter 3. A distributed on-line Oja’s algorithm Algorithm [53], d = 50 0 10 Algorithm [53], d = 500 Power method Proposed doOja, d = 50 Proposed doOja, d = 500 −1 RMSE 10 −2 10 −3 10 2 4 6 8 10 complexity per node (iteration x d) 12 14 4 x 10 Figure 3.7. RMSE for the centralized case with the power method of [73, p. 406] against its distributed version [92] and the proposed distributed on-line Oja’s algorithm (3.13) (also called doOja). 3.5. Numerical results 91 6 estimated λ 3 estimated λ2 true λ1=1.1 2.5 true λ2 = 0.57 1 5 2 n,2 3 2 1 1 0.5 0 0 −1 0 1.5 n,2 uT M u uT M un,1 n,1 4 −0.5 500 1000 1500 2000 2500 3000 iteration (n) 3500 4000 4500 5000 −1 0 500 (a) Trajectory of estimated λ1 . 1500 2000 2500 3000 iteration (n) 3500 4000 4500 5000 (b) Trajectory of estimated λ2 . 4 0.8 ||un,1|| ||un,2|| 3.5 0.6 3 2.5 uT u n,1 n,2 || un,k || 1000 2 1.5 1 0.4 0.2 0 0.5 0 0 500 1000 1500 2000 2500 3000 iteration (n) 3500 4000 4500 (c) Trajectories of ||un,1 || and ||un,2 ||. 5000 −0.2 0 500 1000 1500 2000 2500 iteration (n) 3000 3500 4000 (d) Trajectory of uTn,1 un,2 . Figure 3.8. Convergence analysis of the estimated eigenspace with the proposed distributed on-line Oja’s algorithm (3.13). 4500 Chapter 4 Application to self-localization in WSN In this chapter we investigate the problem of localization in wireless sensor networks (WSN) as a particular application of the principal component analysis (PCA) framework described in previous Chapter 3. The link between PCA and localization, which may not be directly clear, is given by the following framework: wireless sensor devices are able to obtain received power measurements that can be related to a ranging model depending on the inter-sensor distances, and finally, from square distances one may find the underlying network configuration by applying PCA (the multidimensional scaling method). Thus, we focus on ranging techniques using received signal strength indicator (RSSI) since it does not require additional hardware or/and synchronization compared to time difference of arrival (TDOA) and angle of arrival (AOA) techniques. Moreover, we focus on RSSI-based techniques that make each sensor node able to localize itself (see scheme in Figure 4.1). Since WSN are composed by electronic devices (also called motes) characterized by their low cost, size and power consumption features, distributed processing based on on-line measurements and asynchronous communications becomes adapted for this scenario. The problem of self-localization involving low-cost radio devices in WSN can be viewed as an example of the internet of things (IoT). The evolution in the last 50 years of the embedded systems and smart grids has contributed to enable the WSN integrates the emerging system of the IoT. Recently, advanced applications to handle specific tasks require the support of networking features to design cloud-based architectures involving sensor nodes, computers and other remote component. Among the large range of applications, location services can be provided by small devices carried by persons or deployed in a given area. Information about sensor nodes’ positions may be used by purposes such as routing and querying which can be adapted or controlled according to the given positions. Throughout this chapter the notation of some of the variables that are used regularly. Yet, we summarized them in the table below. Then before the description of the framework, we highlight the main contributions of this chapter. 94 Chapter 4. Application to self-localization in WSN N, M zi x i , yi Z Ak Ni , Mi ai , bi PL0 η σ2 l × h m2 T , Ti t, n dij k · k, (·)T , h·i 4.1 number of unknown sensor positions and of anchors/landmarks row vector in R1×p of any sensor position i (in practice, p = 2 or 3) abscissa and ordinate values of any unknown sensor position z i N × p matrix whose row-elements are the positions z i row vector in R1×p of any anchor/landmark k set of neighbor sensors of i and its neighboring anchors abscissa and ordinate values of any anchor/landmark position i path loss parameter of the RSSI model the path loss exponent parameter of the RSSI model noise variance parameter of the RSSI model dimensions for a given indoor area of length l and height h number of observations (equal for all nodes or different at each node i) time and iteration index of the observed or estimated data distance between any pair of nodes i and j Euclidean norm, transpose operator and scalar product Contributions We provide the following contributions. • We adapt and design the distributed algorithm proposed in Chapter 3 for the WSN localization problem by assuming sparse RSSI measurements through the sensor nodes. The position at each sensor node is estimated without prior knowledge of landmarks, i.e. anchor nodes are not required. • We provide numerical results on the position accuracy in two cases: when data is simulated following a known distribution and when data is collected from a real testbed in an indoor scenario (at the FIT IoT-LAB platform [1]). • We include a distributed refinement phase in order to improve the position accuracy and in order to lead each sensor node to obtain a local map of itself and their neighbors. This optional step may be especially useful when some landmarks are surrounding the WSN. The refinement algorithm is first implemented on the FIT IoT-LAB’s testbed [2] when the initialized positions are obtained by the proposed algorithm in this chapter. Then, we test it on three different indoor scenarios ([2] and [3]). The chapter is organized as follows. In the first section we introduce the framework, the received signal model used in WSN to extract the positions’ information and some experimental results obtained in a real WSN scenario. The following sections are related to recall the classical and the recent techniques dealing with this problem. Then we focus on presenting our distributed on-line algorithm along with numerical results based on simulated and real data. 4.2. Received signal model and testbed description 4.2 4.2.1 95 Received signal model and testbed description Ranging-based approaches Consider N agents (e.g. sensor nodes or other electronic devices) seeking to estimate their respective positions defined as {z 1 , · · · , z N }. Note that positions are generally expressed in 2 or 3 dimensions in the wireless sensor context, i.e. z i ∈ Rp with p = 2, 3 for i = 1, . . . , N . The goal is to design a distributed and on-line algorithm to enable each sensor node to estimate its position z i from noisy measurements of the distances, i.e. ranging technique. We assume that agents have only access to noisy measurements of their relative RSSI values. The RSSI can be related to the Euclidean distance d through a statistical model. Figure 4.1 illustrates the summarized framework of this chapter and the relation between these metrics, e.g. RSSI, d and z. Figure 4.1. Framework of the RSSI-based ranging techniques. We denote dij the Euclidean distance between any pair of nodes i, j ∈ {1, · · · , N } which is defined as dij = kz i − z j k. Before introducing the RSSI model, it is worth noting that this problem is in fact ill-posed. Since the only input data are distances, exact positions are identifiable only up to a rigid transformation. Indeed, quantities (dij )∀i,j are preserved when an isometry is applied to the agents’ positions: i) when positions are affected by a translation t, then the distance between two shifted positions z 0i and z 0j is: d0ij = kz 0i − z 0j k = kz i + t − (z j + t)k = dij ii) when positions are affected by a rotation/reflection denoted as the p × p orthogonal matrix R (i.e. a unitary matrix such as R−1 = RT ), then the distance between two relative 96 Chapter 4. Application to self-localization in WSN positions z 0i and z 0j is: d0ij = kz 0i − z 0j k = kz i R − z j Rk = dij The problem is generally circumvented by assuming a minimum number of anchors or also named landmarks (sensor nodes whose GPS-positions are known), e.g. M = 3 or 4 when p = 2, and considering these prior knowledge to identify the indeterminacy. Thus, we divide the existent methods into two groups according to this point, the anchor-based methods and anchorfree methods (see Section 4.3). The first group gives a position in the absolute reference system (e.g. GPS coordinates) and the second group gives a set of positions according to a relative reference system (e.g. the origin is taken as the barycentric coordinates of the sensor nodes’ positions). Once the relation between positions and the Euclidean distance is assumed, we now describe the framework to estimate any distance d from the available measurements in wireless sensor networks, i.e. the RSSI. 4.2.2 Log-normal shadowing model (LNSM) We recall the statistical model used to describe the received signal strength indicator (RSSI) data as a function of the distances between the sensor nodes. The log-normal shadowing model (LNSM) is based on the log-distance path loss model (see [8], [46],[136] for details), which describes the average path loss PL(d) at a distance d expressed in dB as: PL(d) = PL0 + 10η log10 d d0 (4.1) where the parameters η, d0 and PL0 depend on the environment. The parameter η depends on the propagation medium. For instance, η = 2 is considered in the free-space and η varying from 1.6 to 6 in indoor and more complex environments (see [136] and [72]). The parameters d0 and PL0 are the reference distance (typically 1 m in indoor environments [136]) and its corresponding path loss value. In order to capture the random shadowing effects that may occur at different locations having the same distance separation, the LNSM considers the addition of a Gaussian random variable (r.v.) ε ∼ N(0, σ 2 ) to the average path loss PL(d) given by (4.1). Thus, the RSSI is defined as the noisy received power given an emitted power PT and an average path loss PL(d) at distance d expressed in dBm units: P (d) = PT − PL(d) + ε . Since in the presented experimental results we deal with indoor environments, we assume d0 = 1 m for our RSSI signal model. In addition, the considered sensor nodes in all our experimental campaigns are issued to the CC24201 radio frequency (RF) transceiver (i.e. device comprising both a transmitter and a receiver parts) whose transmit power PT is typically 0 dBm. Taking into account this two specifications, the general expression of the LNSM for the RSSI r.v. defined as P is such: P = −PL0 − 10η log10 d + ε ∼ N( −PL0 − 10η log10 d , σ 2 ) . (4.2) 1 Technical specifications: http://www.ti.com/lit/ds/symlink/cc2420.pdf. 4.2. Received signal model and testbed description 4.2.3 97 Distance estimation From the previous distribution, the maximum likelihood (ML) can be used to estimate the environmental parameters PL0 , η and σ 2 . When collecting several RSSI values associated to ˆ 0 and η̂ related to the mean path loss value, different known distances we obtain the estimates PL 2 and the estimated variance σ̂ from the corresponding square residuals. Set K the number of the known distances values (dk )k=1,...,K and set T as the number of RSSI values (Pk (t))t=1,...,T collected from each distance dk . If the KT total of samples are drawn from the distribution (4.2), the ML optimization problem is written as follows: max (PL0 ,η,σ 2 ) − K T KT 1 XX ln(2πσ 2 ) − 2 (Pk (t) + PL0 + 10η log 10dk )2 . 2 2σ (4.3) k=1 t=1 The environmental parameters PL0 and η can be estimated firstly by a least squares (LS) or maximum likelihood (ML) method since for the normal distribution case both estimators are identical. Indeed, equation (4.2) can be viewed as a linear model of the form y = −PL0 − ηx corrupted by an additional noise when setting x equal to 10 log10 d. When writing (4.3) under vector notation, the estimator solves the minimization problem: c 0 , η̂) = min kP − Lαk2 (PL (PL0 ,η) where: P = 1 T 1 T PT t=1 P1 (t) .. . PT t=1 PK (t) −1 .. L= . −10 log10 d1 .. . −1 −10 log10 dK PL0 α= η The solution is then obtained as follows: c0 PL = (LT L)−1 LT P . η̂ (4.4) Finally, the variance σ 2 is computed using the latter values (4.4) and solving (4.3): σ̂ 2 = K T 2 1 XX c 0 + η̂10 log10 dk . Pk (t) + PL KT (4.5) k=1 t=1 Given the estimated environmental parameters as (4.4) and (4.5) and using the model (4.2) for a collection of T RSSI values whose mean value is denoted by P̄ , the maximum likelihood estimator of any of the distances involved in (4.3) is a biased estimator such as: dˆ1 = 10 −P̄ −PL0 10η ε̄ = d10 10η where ε̄ ∼ N(0, σ2 ). T (4.6) 98 Chapter 4. Application to self-localization in WSN T=1 T=5 20 T = 10 15 16 estimated distance 14 15 σ=0 σ /η = 1.5 σ /η = 4 12 10 10 10 8 6 5 5 4 2 0 0 5 10 15 0 0 5 10 true distance 15 0 0 5 10 15 Figure 4.2. Biased estimator dˆ1 (4.6) as a function of the true distance d when considering two levels of noise represented by the factor ση and when varying the number of collected RSSI samples T . Due to the nature of the distribution of the signal model, the additive average noise term ε̄ ε̄ becomes a multiplicative random term 10 10η whose mean is not equal to one. For a normal r.v. ε of variance σ 2 and for any real constant γ, the following property holds: E[10γε ] = 10γ 2 σ2 2 ln 10 (4.7) Hence, defining by C the mean of this multiplicative noise, the bias of the latter estimator (4.6) is equal to d(C − 1) which tends to zero when T >> ση as shown in Figure 4.2. As expected, the bias increases with the distance but decreases when the number of samples T increases, being nearly negligible for T = 10 and for both noise levels ( ση = 1.5 and ση = 4). Taking into account the noise variance σ 2 , the unbiased estimator of the distance can be defined as follows: −P̄ −PL0 σ 2 ln 10 ε̄ 10 10η d ˆ =d (4.8) = 10 10η =⇒ E[d] where C = 10 2T (10η)2 dˆ2 = C C For the same reason, due to the multiplicative noise, the variance of the estimator grows linearly with the square of the distance. Nevertheless, the environmental parameters σ 2 and η and the number of RSSI data T affects on the behavior of the bias and the variance through the value of C. The variance of the unbiased estimator (4.8) coincides with its mean square error(MSE): σ MSE(dˆ2 ) = var(dˆ2 ) = d2 (C 2 − 1) → 0 ⇐⇒ T >> . (4.9) η It is easy to see that when taking the biased version (4.6), its variance is the same as in (4.9) multiplied by a factor C 2 . We summarize the main statistics of both estimators dˆ1 and dˆ2 in the following table: 4.2. Received signal model and testbed description dˆ dˆ1 dˆ2 99 ˆ B(d) ˆ var(d) ˆ MSE(d) d(C − 1) 0 d2 C 2 (C 2 − 1) d2 (C 2 − 1) d2 C 2 (C 2 − 1) + d2 (C − 1)2 d2 (C 2 − 1) ˆ variance (var(d)) ˆ and mean square error (MSE(d) ˆ = var(d) ˆ + B(d) ˆ 2 ) for Table 1. Bias (B(d)), each estimator as a function of the distance d and constant C defined in (4.8). Note from Table 1: MSE(dˆ2 ) = C 2 MSE(dˆ1 ) + d2 (C − 1)2 > MSE(dˆ1 ). Figure 4.3 highlights the effect of the trade off between the environmental parameters ( ση ) and the number of RSSI values (T ) when choosing the biased or the unbiased estimator of the distance. We pick 1000 samples of (4.6) and (4.8) and we computed their empirical values of the MSE. We consider the values 1.5 and 4 for the environmental factor ση as the estimated values in our real measurement settings (see Table 7) are between 1.6 and 3. T=5 T=1 T = 10 6 3 4 2 2 1 30 20 MSE 10 0 0 5 10 15 0 0 5 10 15 20 600 40 5 10 15 biased, σ /η=4 unbiased, σ /η=4 15 400 10 20 200 0 0 0 0 25 biased, σ /η=1.5 unbiased, σ /η=1.5 5 5 10 15 0 0 5 10 distance (d) 15 0 0 5 10 15 Figure 4.3. Mean square error (MSE) of the biased estimator (4.6) and the unbiased estimator (4.8) as a function of the true distance d when considering two levels of noise represented by the factor ση (top ση = 1.5 and bottom ση = 4) and when varying the number of collected RSSI samples T (left T = 1, middle T = 5 and right column T = 10). As displayed in Figure 4.3, the variance of the biased estimator is larger than the variance of the unbiased case affected by the multiplying factor C 2 defined in (4.8). For instance, regarding the on-line case (T = 1), the factor C 2 is equal to 1.1 when considering the low noise level ( ση = 1.5) and 2 when considering the high noise level ( ση = 4). For a fixed value of ση , the variances decrease when increasing the number of samples T . Note that for low noisy 100 Chapter 4. Application to self-localization in WSN environments the gap between the biased and the unbiased estimator is rather low even when T = 1 as C 2 is close to one. When dealing with higher noise levels, increasing the number of collected data becomes the solution to approach the biased behavior to the unbiased one. In conclusion, Figure 4.2 and 4.3 show that when considering an on-line data acquisition (T = 1), the unbiased estimator should be considered in general. However, depending on the accuracy required the biased estimator may be useful for distances below 5 for the high noise case and for distances until 15 if the noise level is relatively low. Note that it may be restrictive in real scenarios to estimate parameter σ 2 as it may be not stable during the processing time especially in the indoor case. Thus, the choice of the distance estimator can be determined depending on the localization algorithm considered and the involved scenario. Since in Section 4.3 we make an overview of the existing localization techniques, we note that in some of them there is a need to estimate the square of the distance instead of the plain distance. In such cases, it is analogous to the estimators (4.6) and (4.8). Upon noting the property (4.7) and the definition of constant C in (4.8), the unbiased estimator of the square distance is defined as: 10 D= 4 C −P̄ −PL0 5η ε̄ d2 = 4 10 5η C =⇒ E[D] = d2 var[D] = d4 (C 8 − 1) (4.10) Since the objective of this chapter is to evaluate the theoretical model and the proposed distributed algorithm on data coming from a real setting of WSN, the following section introduces the management of the platform and the process to run an experiment on a set of selected sensor nodes we use in our numerical results. Before discussing on localization techniques, we show some of our numerical results to point out the main issues that occur in practice when dealing with a real indoor scenario. Indeed, during the last ten years several works have been discussing about the LNSM and its relevance on different types of environments (see for instance [62], [127] or more recently [159] among a long list). 4.2.4 1) FIT IoT-LAB: platform of wireless sensor nodes Platform description In order to obtain real RSSI values we make use of the FIT IoT-LAB platform deployed at Rennes illustrated in Figure 4.4. The 256 WSN430 open nodes available at the platform are issued to the standard ZigBee IEEE 802.15.4 operating at 2.4 GHz. The sensor nodes are located in two storage rooms of size 6 × 15 m2 containing different objects. They are placed at the ceil which is 1.9 m height from the floor in a grid organization as shown in Figure 4.4 (see more details on the website [1]). The WSN430 nodes support the open source operating system (OS) Contiki3 which is designed for networked and memory-constrained systems with a particular focus on low-power wireless IoT devices. It is used to design an embedded system requiring IPv4/IPv6 (Internet Protocol) or Rime (lightweight custom networking protocol) communication features. Since 2 3 See: https://github.com/iot-lab/iot-lab/wiki/Hardware_Wsn430-node Details: https://github.com/iot-lab/iot-lab/wiki/Contiki-support 4.2. Received signal model and testbed description (a) View of the platform hosted at Rennes. 101 (b) Schematics of the WSN430 node2 . Figure 4.4. FIT IoT-LAB platform at Rennes using WSN430 open nodes. each sensor node is uniquely identified by a Rime address (2 bytes assignment), one can make use of the primitives provided by this alternative network stack specialized for low-power wireless systems. Thus, the available code on the website [1] can be adapted and loaded as a firmware file (ihex extension) onto the selected sensor nodes to make them able to communicate under a designed protocol. In our experimental results we only required Rime’s addresses to make the selected sensor nodes able to communicate. Figure 4.5 displays the main steps while performing an experiment at the FIT IoT-LAB platform. Experiments can be launched in a remote way by selecting the desired sensor nodes available at each moment depending on their status, i.e. free, busy for other users or not available due to technical problems. While uploading a firmware file on the selected sensor nodes, they are able to communicate and receive packets containing the corresponding RSSI measurements. Once registered, the procedure to run an experiment in a remote way is performed by ssh commands on a terminal window as shown in Figure 4.5 (c) and 4.5 (d). At the end of the experiment we are able to recover the text file generated by each sensor node which contains the communications passing through the serial port of the node (see the received RSSI message format in Figure4.5 (d)). 2) Illustration of RSSI-distance measurements Figure 4.6 (a) illustrates the behavior of some real measurements related to the experimental campaigns on the FIT IoT-LAB testbed (see the experimental results in Section 2) and 2)). We run several experiments during 5 min (300 s) on different days and at different scheduled times (around 10am, 3pm and 10pm depending on the node’s availability) involving the 50 nodes shown in Figure 4.7 (a). We set N = 44 and M = 6 in our testbeds. Figure 4.6 (a) shows the recovered RSSI values from the anchor node whose Node_id is 157 which received data from 39 neighboring nodes. In order to compare the empirical distribution with the theoretical model recalled in (4.2), we draw 1000 samples from the theoretical LNSM with the estimated parameters by the anchor 102 Chapter 4. Application to self-localization in WSN (a) Create new experiment. (c) Experiment’s state transition. (b) Firmware association. (d) Recover data flow. Figure 4.5. Workflow to handle an experiment from the user website profile and recover the collected data from the user’s terminal frontend. node LM157 (PL0 = 59.65 dBm, η = 2.3 and σ 2 = 31.8 dB in Table 7). We compute and plot the maximum and the minimum values as shown in Figure 4.6 (a) along with the real values. Note that the real RSSI values fit well to the model for data coming from sensor neighbors being close to node LM157 as nodes 156 and 158 (see distribution’s plots in Figures 4.6 (b) and 4.6 (c)). On the contrary, for nodes 247 and 248, a gap of about 10 dB appears between the theoretical mean of the LNSM and the corresponding averaged RSSI values (see Figures 4.6 (d) and 4.6 (e)) as these nodes are considerably further and placed close to the wall. Such effects are probably related to multi-paths and are consequently reproduced on the estimated distances when using the unbiased estimator defined in (4.8). We report in Figure 4.7 the distance estimator based on (4.8) applied to the RSSI data of Figure 4.6 (a). As shown in Figure 4.7 (b) the 53.8% (21 over the 39) of the distances are estimated with an error tolerance greater than 0.5 m against the 46.2% otherwise. Nevertheless, a third part of the total 39 estimated distances are issued of an error greater than one (the square 4.2. Received signal model and testbed description 103 −35 −40 249 −45 RSSI (dBm) −50 −55 −60 −65 −70 −75 −80 156 − 158 −85 247 −90 0.5 1 1.5 2 2.5 3 3.5 distance (m) 4 4.5 5 5.5 6 (a) Real RSSI values (5) measured by the anchor node whose Node_id is 157 of the network illustrated in Figure 4.15. The marker (l) corresponds to the maximum and minimum values of the theoretical LNSM (4.2) for each given distance. 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 −70 −65 −60 −55 −50 −45 −40 −35 RSSI (dBm) empirical distribution estimated mean theoretical distribution true mean 0 −70 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 −80 −75 −70 −55 −50 −45 −40 −35 (c) Estimated distance by LM157 from node 158. 0.9 −85 −60 RSSI (dBm) (b) Estimated distance by LM157 from node 156. 0 −65 −65 −60 RSSI (dBm) (d) Estimated distance by LM157 from node 247. 0 −85 −80 −75 −70 −65 −60 RSSI (dBm) (e) Estimated distance by LM157 from node 249. Figure 4.6. Comparison between the empirical distribution (histogram bars) of the real data displayed on the top figure and the distribution of the model described by (4.2). markers on the right and the boxed sensor nodes on the left in Figure 4.7). These estimated distances correspond to the sensor nodes located the furthest from anchor node 157 are the closest to the wall (sensor nodes whose Node_id are 240, 247, 249 and 253), the surrounding 104 Chapter 4. Application to self-localization in WSN 12 183 181 180 179 178 LM163 y (m) 8 161 159 158 LM157 156 155 6 4 LM176 218 198 197 196 195 194 193 216 215 LM214 213 212 211 173 237 LM236 235 234 232 231 230 229 209 253 252 251 250 249 247 227 LM244 243 225 8 estimated distance (m) 10 + 0.5m sensor nodes anchor nodes (LM) 202 201 200 − 0.5m 7 6 5 4 3 2 2 1 240 0 2 2.5 3 3.5 4 4.5 5 x (m) 5.5 6 6.5 7 7.5 (a) Network configuration around LM157. 0 0 1 2 3 4 5 6 7 8 true distance (m) (b) Estimated distances using (4.8) by LM157. Figure 4.7. Estimated distances at anchor node LM157 circled on the left figure from its neighboring sensor nodes from the real collected data shown in Figure 4.6 (a). On the right figure: we ) the estimated distances whose error is greater than 1 m which correspond with remarked as ( the sensor nodes emphasized by the boxes on the left figure, the plain line designs the true distance (or the distance estimated if σ 2 = 0) and the dashed lines limited the estimated distances under an error tolerance of ±0.5 m. sensors of the network whose number of RSSI packets received at node 157 are low (202, 218 and 237) and the sensor nodes located close to the anchor node whose Node_id is 176 (194, 195 and 196) which may affect directly the line-of-sight of node 157. 4.3 Overview of some localization techniques During the last decades the localization problem has raised a great deal of attention resulting in an extended list of different proposed algorithms among the signal processing community. Indeed, several overview papers have been published in the last ten years dealing with the description and classification of the localization techniques (see, for instance, [99], [126] and [105]). In this chapter we recall and summarize the main RSSI-based approaches to find the network configuration depending on one criteria: if anchors/landmarks are used (anchor-based) or not (anchor-free). The scheme is shown in Figure 4.8. We consider that all them have a centralized nature, i.e. the problem can be completely described and directly addressed by a central processor. 4.3.1 1) Centralized techniques Anchor-based methods The classical techniques involve the resolution of a single unknown position of a sensor node at a time by means of RSSI values coming from a fixed number of surrounding anchor nodes (or landmarks) denoted by M (find a comparative made by [99] or [71]). We denote the un- 4.3. Overview of some localization techniques 105 Figure 4.8. Classification of the existing classical methods on localization. known node position by z = (x, y) (two dimensions p = 2) and any anchor node position by Ai = (ai , bi ). Since the sensor node only uses the information from known positions, z can be expressed in absolute coordinates, i.e. anchor positions in GPS-coordinates. In anchor-based methods, the solution (the unknown sensor node position) is issued to M T measurements where T denotes the number of RSSI values collected from each of the M anchor nodes. Note that, when dealing with more than one unknown position, e.g. a network of N sensor nodes, the problem may be performed sequentially, i.e. solving one position at each time. We distinguish between two ways to address the problem: Geometrical point of view The following methods were first studied for aerospace or aeronautical systems and consider the unknown position as being the intersection point of a fixed number of curves M , one for each landmark. Although they are originally based on time measurements4 (see [137]), the related system of equations can be finally described by the set of landmarks positions and the distances between the landmarks and the unknown position. If the LNSM is considered for the RSSI signal, distances can be estimated as described in Section 4.2.1. If distances are perfectly known, the system of equations is described from different geometrical curves such: circles (called also trilateration [110]), hyperbolas [137], quadrilaterals (also called min-max [141]) or triangles (known as multilateration [79]). Figure 4.9 illustrates instances of these approaches. Note that, unless for the quadrilaterals’ case, a landmark is designed as the reference of 4 ToA=Time of Arrival and TDoA=Time Difference of Arrival 106 Chapter 4. Application to self-localization in WSN (a) Trilateration (M = 3). (b) Intersection of three hyperbolas (M = 4). (c) Intersection of three quadrilaterals (M = 3). (d) Multilateration: cosine rule applied on 5 triangles which forms a linear system of 5 equations following the expression on the right. Figure 4.9. Classical methods: geometrical point of view. the coordinate system. As shown in Figure 4.9 (a), trilateration considers explicitly one reference landmark and applies a linear transformation (translation t and rotation R easily computed) to the whole system to solve the problem. In Figures 4.9 (b) and 4.9 (d) the reference landmark is considered implicitly in the equations by simple subtractions. 4.3. Overview of some localization techniques 107 Note that, multilateration (geometrically based on the cosine’s theorem) is an extension of trilateration considering M intersecting circles. Indeed, from M equations of circles, subtracting the equation of the reference landmark to the M − 1 remaining equations and adding the square distance between the reference landmark and each M − 1 remaining landmarks, it results in the same system of M − 1 equations of the multilateration technique. When considering a noisy scenario in which distances between the unknown sensor node and the landmarks are estimated by means of RSSI values following the LNSM, several works coupled the latter methods with a least squares problem. The most relevant works are those from [118], [140] and [141] considering multi-hop communications between the sensor nodes. Any pair of nodes within the network which are not directly connected are able to communicate with the help of one or more intermediate/relay nodes. The main drawback of these methods is the dependence with the anchor’s positions since the solutions must lie inside the convex-hull formed by the set of known positions. In addition, the solution loses in robustness since it may be affected by the choice of the reference landmark, e.g. a landmark with an error in its known position or placed too far from the sensor node. Statistical point of view This methods focus on the statistical distribution of the received RSSI measurements coming from the M landmarks. The goal is to consider a parametric model for the received signal and to apply maximum likelihood estimator (MLE). Thus, the LNSM in (4.2) is the parametric model assumed for the observed RSSI and the unknown parameter to estimate is the unknown position of the sensor node z. Set (Pm (1), . . . , Pm (T )) the RSSI values collected from any landmark m. Then, the MLE of z (in two dimensions) is: ẑ = arg min M X T X 2 p 2 2 Pm (t) + PL0 + 10η log10 (x − am ) + (y − bm ) (4.11) z=(x,y) m=1 t=1 Since (4.11) is not a non-convex problem, the solution is affected by the existence of local minimums. Iterative algorithms need in general a suitable initialization, e.g. a noisy version of the target position. Indeed, the solution of (4.11) lies on the intersection of the M circles of the form: (x − am )2 + (y − bm )2 = r ∀m = 1, . . . , M where r is set as the square distance (4.10). If the dimensions of the area issued to the problem are previously known, one can set a grid of points, evaluate the function (4.11) on each center point and set ẑ as the point where the minimum value is achieved (see for instance [86], [47] or [54]). Note that, the accuracy depends on the choice of the grid’s resolution and so the required computational cost. As an alternative, iterative algorithms can be used to solve this non-linear optimization problem such: the conjugate gradient used in [128], the Nelder-Mead method used in [42] and more recently the LevenbergMarquardt algorithm in [7]. 108 Chapter 4. Application to self-localization in WSN It is worth noting that the LNSM may not be the most suitable model for more complex indoor scenarios (see [11]). Other works proposes the MLE on different statistical models. In [159] the environment is considered to be dynamic and time-varying. They proposed a modified version of the LNSM by adapting progressively the estimation of the noise variance σ 2 . In [53] the modified LNSM considers different environment parameters (PL0 , η and σ 2 ) for each landmark and includes a bias factor to (4.2) to model the possibly outlier/multipath effects. More recently, [55] assumes the RSSI as a Gaussian Mixture of two classes and the problem (4.11) is solved through an Expectation-Maximization algorithm. 2) Anchor-free techniques When dealing with distributed sensor networks, e.g. a WSN of N nodes, anchors may not be present, too far away or GPS signal is not available, e.g. indoor scenario. Nevertheless, sensor nodes can still benefit of the RSSI measurements obtained from their neighbors whose positions are unknown. The configuration of the network can be recovered on a relative coordinate system instead of the GPS absolute coordinate system. One should therefore rely on anchor-free methods. When distances between nodes are view as similarity metrics, the positioning problem is referred to multidimensional scaling (MDS). Yet, the structure of distance-like data is related to the underlying geometric configuration of the network, i.e. find an embedding from the N nodes such that distances are preserved. For instance, in classical MDS [27, Chapter 12] positions are obtained by principal component analysis (PCA) of the input N × N matrix constructed from the Euclidean distances. If distances are issued to some noise, e.g. estimated from RSSI measurements as (4.8), [143] propose a MDS-MAP algorithm based on the classical MDS problem. Indeed, the WSN localization problem is solved by enable each sensor node to infer all the estimated pairwise distances. It is worth noting that MDS problem for WSN localization is related to the rigid graph theory. Indeed, rigid graph theory explores the property of a given network configuration to be an embedding of a graph in an Euclidean space, i.e. the globally rigid property. This property, also known as fold-freedom (see [77] or [44]), suffers from two problems: non-uniqueness and NP-hard complexity. To overcome this issues, [130] proposed a two-phase algorithm based on fold-freedom initial configuration joint with a mass-spring model to refine the estimated positions. Alternative approaches within the localization context are based on optimization techniques. In metric or modern MDS, positions are obtained by the stress majorization algorithm called SMACOF (see [27, Chapter 8] and [101]). The minimization problem is performed by using an auxiliary quadratic function that majorizes the stress function. Semidefinite programming (SDP) with convex constraints can be found in [57] and [24]. Recently the works of [41], [85] give further error bound analysis of the SDP approach. In general, even if these centralized techniques achieve high accuracy and solve the N unknown positions at once, they require high computational cost and may be especially complex to be implemented in a real wireless sensor network. Yet, classical MDS and SDP require about O(N 3 ) of computational complexity. 4.3. Overview of some localization techniques 4.3.2 109 Distributed approaches A distributed batch version of the SMACOF algorithm based on a round-robin communication scheme is proposed in [45]. Following a sequential approach, the global function to minimize is computed at each iteration in which each node aggregates its local estimate following a cycle path through the network. In their approach, all nodes have to broadcast its estimate position at each iteration before the updating cycle phase of the stress function. The anchor’s positions are required during all the iterative process. Their numerical results on real data show a root-meansquare error (RMSE) of 2.48 m on the same testbed considered in [127] where a centralized MLE achieved a RMSE value of 2.18 m. The indoor scenario is formed by N = 40 and M = 4 deployed in 1414 m2 and the estimated parameters are η = 2.3 and low noise variance σ 2 = 3.92 dB. Since [45] considers the minimization of the non-convex stress function, the same distributed approach (batch and incremental) is presented in [63] but using a quadratic criterion to overcome the non-convex issue. The function to minimize is formed by two terms: a first quadratic term which each sensor node tends to be positioned at the center of the polygon defined by its neighbors’ positions, and a second regularizing term which includes the information from the anchor nodes. The iterative algorithm is performed by two steps at each cycle: firstly all nodes update their local optimization step and then they broadcast to all neighbors their estimated positions. They considered simulated data from a network composed by N = 72 and M = 8 deployed in 9090 m2 . They achieved a localization mean error of 5.12 m and comparing to the respective value of [45], an improvement of 31% is obtained. The Authors of [24] propose a distributed implementation of their SDP-based localization algorithm. In [25] the network is divided in several clusters of at least two anchor nodes and a large number of sensor nodes (N = 1000 − 2000 nodes are considered) and then the SDP problem is addressed locally at each cluster. A first step is to obtain a uniform partition of the network based on the geographical positions of the sensors. Then, the convex optimization problem is performed at the cluster level. Based on the accuracy obtained on the estimated positions, the sensor nodes become "anchors" depending on a given tolerance. The new considered "anchors" are used to update the clusters and the process of solving local SDP is repeated several rounds. Note that [25] is a parallel approach of [24] since the computation core is done at cluster level, i.e. a sensor node is designed to solve the problem at each cluster. Although it is mentioned that neighboring clusters need to communicate which nodes have been positioned at each round, [25] does not give any more details about how the computation and communication processes are hold. Simulations on synthetic data illustrate the performance of their algorithm. More recently, gossip-based algorithms have been proposed in [39], [152] and [35]. Both works of [39] and [152] are based on Kalman filtering applied to noisy measurements coming from anchor and non-anchor nodes. In [39], they used a diffusion scheme represented by a deterministic matrix which is assumed to be column-stochastic to solve the minimization problem. They showed a nearly zero mean error on simulated data involving a simple network of N = 8 and M = 4 in a 1 × 1 m2 square. In [152] they focus on mobile networks and they proposed a cooperative algorithm to solved the convex optimization problem on mobile robots. Each robot computed at each iteration time its estimated position based on all quantized data coming from the rest of robots. The algorithm proposed by [35] is based on a distributed gradient descent with time-varying step size performed in two stages. Firstly, the step size is computed by an 110 Chapter 4. Application to self-localization in WSN iterative averaging procedure consisting on a deterministic and double-stochastic matrix in order to achieve the consensus through the optimal Barzilai and Borwein step size value. Once the averaging cycles give the suitable accurate step size, each node received the last estimated positions from its neighbors and performed locally the gradient update. Their simulations on a network of N = 10 in a 1 × 1 m2 obtained a decreasing accuracy on the estimated positions but required a large total number of communications (several averaging cycles per any distributed gradient descent step). A distributed version of the spring model algorithm of [130] can be found in [162] where all sensor nodes broadcast their local estimates at each iteration before each updating step. Alternatively, both iterative algorithms [92] and [80] introduced a sparsification model to enable each sensor node to select only a small set of observations from its neighbors at each time. In [92] the sparsification model is based on an uniformly random choice of the observations. Meanwhile the Authors of [80] defined a threshold determining the level of sparsity and applied it on the income data. However, this works follow two different approaches. In [92] a sparsification model on the observations is introduced to decentralized the PCA algorithm. And [80] is based on non-linear kernel learning where the optimization problem is performed by a gradient descent algorithm with constant step size. Other works address the MDS problem for distributed WSN localization. The MDS-MAP proposed in [143] is later improved in [142] for larger network size (experimental results with N = 200). In [142] each sensor node applies the MDS-MAP of [143] to its local map and then the local maps are merged sequentially to recover the global map. Although the accuracy is improved, the complexity still remains high due to the spectral decomposition involved in MDS, i.e. about O(k 3 ) where k is the average number of neighbors per node. 4.4 Distributed MDS-MAP approach In this section we describe the proposed distributed on-line algorithm for WSN localization based on the MDS-MAP approach. Our algorithm is asynchronous and encompasses the case of random link failures and random noisy and sporadic RSSI measurements. First, we briefly recall MDS-MAP algorithm. Then, the on-line version is obtained by using the Oja’s algorithm [122, 29] (see Section 3.4 in Chapter 3). The last part of this section is devoted to the asymptotic convergence analysis of the proposed algorithm. 4.4.1 The framework: centralized batch MDS Define S as the N × N matrix of square relative distances i.e., S(i, j) = d2ij . Define z = 1 PN 2 i=1 z i as the center of mass (or barycenter) of the agents. Upon noting that dij = kz i − N zk2 + kz j − zk2 − 2hz i − z, z j − zi, one has: S = c1T + 1cT − 2ZZ T (4.12) where 1 is the N ×p matrix whose components are all equal to one, c = (kz 1 −zk2 , · · · , kz N − zk2 )T and the i-th line of matrix Z coincides with the row-vector z i − z. Otherwise stated, the i-th line of Z coincides with the barycentric coordinates of node i. Define J = 11T /N as the 4.4. Distributed MDS-MAP approach 111 orthogonal projector onto the linear span of 1. Define J ⊥ = I N − J as the projector onto the space of vectors with zero sum, where I N is the N × N identity matrix. It is straightforward to verify that J ⊥ Z = Z. Thus, introducing the matrix 1 M , − J ⊥ SJ ⊥ , 2 (4.13) equation (4.12) implies that M = ZZ T . In particular, M is symmetric, non-negative and has rank (at most) p. The agents’ coordinates can be recovered from M (up to a rigid transformation) by recovering the principal eigenspace of M i.e., the vector-space spanned by the p-th principal eigenvectors (see [27, Chapter 12]). Denote by {λk }N k=1 the eigenvalues of M in decreasing order, i.e. λ1 ≥ · · · ≥ λN . In the sequel, we shall always assume that λp > 0, meaning that M has a full column-rank p < Np . Denote by {uk }pk=1 corresponding unit-norm N × 1 √ eigenvectors. Set Z = ( λ1 u1 , · · · , λp up ). Clearly M = ZZ T and Z = RZ for some matrix R such that RRT = I N . Otherwise stated, Z coincides with the barycentric coordinates Z up to an orthogonal transformation. In practice, matrix S is usually not perfectly known and b This yield the Algorithm 3 (see [27, Chapter 12]). must be replaced by an estimate S. Algorithm 3: Centralized batch MDS for localization Input: Noisy estimates of the square distances Dij defined by (4.10) for all pair i, j. b = (Dij )i,j=1,...,N . 1. Compute matrix S 1 c = − J ⊥ SJ b ⊥. 2. Set M 2 c. 3. Find the p principal eigenvectors {uk }pk=1 and eigenvalues {λk }pk=1 of M p √ b = ( λ1 u1 , · · · , λp up ) Output: Z Note that if all N 2 pairwise distances can not be obtained at once due to the size of the network and because sensor nodes have different radius of connectivity, matrix (4.12) is incomplete (containing null entries) and the eigendecomposition can not be found. To overcome this problem, the MDS-MAP algorithm (see [143] and [85] for an error bound analysis) introduces a first step in which each sensor node or a central node computes all pairwise distances using multi-hop communications and finding all shortest-paths (apply algorithms such as Dijkstra’s or Floyd’s whose time complexity is O(N 3 )) to construct the complete similarity matrix (4.12). Since MDS-MAP may be limited by the accuracy and the computational cost, an alternative approach proposed by [142] applies the basic MDS Algorithm 3 locally computed at each sensor node. Thus, each sensor node uses the information of all its immediate neighbors forming a complete network and is able to compute a local complete similarity matrix. Once the local set of relative positions is estimated, an iterative/sequential phase is used to merge the obtained local maps. Thus, the accuracy of the local maps is preserved and the merging phase provide a refinement on the global map positions. 4.4.2 Centralized on-line MDS In the previous batch Algorithm 3, measurements are made prior to the estimation of the coordinates. From now on, observations are not stored into the system’s memory: they are deleted 112 Chapter 4. Application to self-localization in WSN after use. Thus, agents gather measurements of their relative distance with other agents and, simultaneously, estimate their position. 1) Observation model: sparse measurements We introduce a collection of independent r.v. (Pij (n) : i, j = 1, · · · , N, n ∈ N) such that each Pij (n) follows the LNSM (4.2) described in Section 4.2.2. We set D n (i, j) the r.v. related to each measurement Pij (n) corresponding to the unbiased estimator (4.10) of the square distance, i.e. D n (i, j) is equal to 10 C4 −Pij (n)−PL0 5η and satisfies E[D n (i, j)] = d2ij . We set D n (i, i) = 0. Definition 4.1 (Sparse measurements). At each time instant n, we assume that with probability qij , an agent i is able to obtained an estimate S n (i, j) of the square distance with an other agent j 6= i and makes no observation otherwise. Thus, one can represent the available observations as the product S n (i, j) = An (i, j)D n (i, j) where (An )n is an i.i.d. sequence of random matrices whose components An (i, j) follow the Bernoulli distribution of parameter qij . Stated otherwise, node i observes the i-th row of matrix An ◦ D n at time n where ◦ stands for the Hadamard product. −1 N Lemma 4.1. Assume qij > 0 for all pairs i, j. Set W := [qij ]i,j=1 and let An , S n be defined as above. The matrix S n = W ◦ An ◦ D n (4.14) is an unbiased estimate of S i.e., E[S n ] = S. Proof. Each entry of matrix S n , S n (i, j), is equal to 1/qij An (i, j)D n (i, j). As the random variables An (i, j) and D n (i, j) are independent, by the above definition of D n and E[An (i, j)] = qij , then E[S n (i, j)] = d2ij . 2) Oja’s algorithm for the localization problem In this section we adapt the projected Oja’s algorithm presented in Section 3.4 of the previous Chapter 3 to solve the problem of localization based on MDS. The p > 1 largest eigenvectors and eigenvalues can be estimated by an extension of the Oja’s algorithm [122]. As a consequence of Lemma 4.1, an unbiased estimate of M defined in (4.13) is simply obtained by M n = − 12 J ⊥ S n J ⊥ . When faced with random matrices M n having a given expectation M , the principal eigenspace of M can be recovered similarly by the algorithm 3.12 as: U n = ΠK U n−1 + γn M n U n−1 − U n U Tn−1 M n U n−1 , (4.15) where K = [−α, α]p ⊗ · · · ⊗ [−α, α]p such α < 1 whose interior contains χ (the set (3.11) defined in Section 3.4 of Chapter 3), ΠK is the projector onto K and where γn > 0 is a step size. Let un,k denote the k-th column of matrix U n . The p largest eigenvalues can be estimated as a straightforward extension of the above Oja’s algorithm. If (un,k )n converges to one of the eigenvectors of M , then the quantity λn,k recursively defined by: (4.16) λn,k = λn−1,k + γn uTn−1,k M n un−1,k − λn−1,k 4.4. Distributed MDS-MAP approach 113 converges to the corresponding eigenvalue (see [122]). Finally, according to step 3 of the batch Algorithm 3, the estimated barycentric coordinates are obtained as: p p bn = Z λn,1 un,1 , . . . , λn,p un,p . (4.17) 4.4.3 Distributed on-line MDS Since the goal is to implement the previous on-line algorithm (4.15)-(4.16) in a distributed setting, we introduce the communication model that enables sensor nodes process their data locally. Thus, each sensor node i estimate its position as in (4.17) based on its local measurements (see Definition 4.1) and sparse random communications within its neighborhood. 1) Communication model: sparse asynchronous transmissions It is clear from the previous section that an unbiased estimate of matrix M is the first step needed to estimate the sought eigenspace. In the centralized setting, this estimate was given by matrix M n = − 12 J ⊥ S n J ⊥ . As made clear by the observation model (in Definition 4.1), each node i observes theP i-th row of matrix S n . As a consequence, node i has access to the i-th row-average 1 S n (i) , N j S n (i, j). This means that matrix S n J ⊥ can be obtained with no need to further exchange of information in the network. On thePother hand, J ⊥ S n J ⊥ requires to compute the per-column averages of matrix S n J ⊥ , i.e. N1 j S n (j, i) for all i. This task is difficult in a distributed setting, as it would require that all nodes share all their observations at any time. A similar obstacle happens in Oja’s algorithm when computing matrix products, e.g. M n U n−1 in (4.15). To circumvent the above difficulties, we introduce the following sparse asynchronous communication framework. In order to derive an unbiased estimate of M , let us first remark that for all i, j, d2 (i) + d2 (j) d2ij + δ M (i, j) = − (4.18) 2 2 P P where we set d2 (i) , N1 k d2ik and δ , N1 i d2 (i). Note that the terms d2ij and d2 (i) can be estimated by S n (i, j) and S n (i) respectively. However, additional communication is needed to estimate δ since it corresponds to the average value over all square distances. We define c n (i, j) = S n (i) + S n (j) − S n (i, j) + δ n (i) M 2 2 (4.19) where δ n (i) is a quantity that we will define in the sequel, and which represents the estimate of δ at the agent n. We are now faced with two problems. First, we must construct δ n (i) as an unbiased estimate c n (i, j) for all pairs i, j, but only to some of of δ. Second, we need to avoid the computation of M them. In order to provide an answer to these problems, we instanciate the notion of asynchronous transmission sequence already introduced in the previous chapter. Formally, Definition 4.2 (Asynchronous transmission sequence (ATS)). Let q be a real number such that 0 < q < 1. We say that the sequence of random vectors Tn = (ιn , Qn,i : i ∈ {1, · · · , N }, n ∈ N) is an asynchronous transmission sequence (ATS) if: 114 Chapter 4. Application to self-localization in WSN i) all variables (ιn , Qn,i )i,n are independent. ii) ιn is uniformly distributed on the set {1, · · · , N }. iii) ∀i 6= ιn , Qn,i is a Bernoulli variable with parameter q, i.e., P[Qn,i = 1] = q . iv) Qn,ιn = 0. Let (Tn )n denote an ATS defined as above. At time n, we assume that a given node ιn ∈ {1, . . . , N } wakes up and transmits its local row-average S n (ιn ) to other nodes. All nodes i such that Qn,i = 1 are supposed to receive the message. For any i, we set: δ n (i) = S n (i) S n (ιn )Qn,i + . N q (4.20) The following Lemma is a consequence of Definition 4.2 along with Lemma 4.1 and equation (4.13). c n )n be the sequence Lemma 4.2. Assume that (Tn )n is an ATS independent of (S n )n . Let (M cn ] = M . of matrices defined by (4.19). Then, E[M Proof. By Lemma 4.1 the expectation of terms S n (i), S n (j) and S n (i, j) are respectively d2 (i), d2 (j) and d2ij . Moreover, by Definition 4.2 the expectation of the random term δ n (i) is equal to E[δ n (i)] = N 1 1 1 X 1 X 2 d (i) , E[S n (i)] + E[S n (j)]q = N qN N j6=i i=1 c n in (4.19) is equal which coincides with δ. Then, the expectation of each entry of the matrix M to the corresponding M (i, j) defined in (4.18). 2) Preliminaries: constructing unbiased estimates As we now obtain a distributed and unbiased estimate of M , the remaining task is to adapt accordingly the Oja’s algorithm (4.15). In this paragraph, we provide the main ideas behind the construction of our algorithm. Assume that we are given a current estimate U n−1 at time n, under the form of a N × p matrix. Assume also that for each i, the i-th row of U n−1 is a variable which is physically handled by node i. We denote by U n−1 (i) the i-th row of U n−1 . Looking at (4.15) in more details, Oja’s algorithm requires the evaluation of intermediate values, as unbiased estimates of M U n−1 and U Tn−1 M U n−1 . We consider the previous ATS (Tn )n involved in (4.19). We assume that the active node ιn (i.e., the one which transmits S n (ιn )) is also able to transmit its local estimate U n−1 (ιn ) at same time. Thus, with probability N1 , node ιn sends its former estimate U n−1 (ιn ) and S n (ιn ) to all nodes i such that Qn,i = 1. Then, all nodes compute: c n (i, i)U n−1 (i) + N U n−1 (ιn )M c n (i, ιn )Qn,i Y n (i) = M q (4.21) 4.4. Distributed MDS-MAP approach 115 As it will be made clear below, the N × p matrix Y n whose i-th row coincides with Y n (i) can be interpreted as an unbiased estimate of M U n−1 . Now we introduce the distributed version of the second term U Tn−1 M n U n−1 . Consider a second ATS (Tn0 )n independent of (Tn )n . At time n, node ι0n wakes up uniformly random and broadcasts the product U n−1 (ι0n )T Y n (ι0n ) to other nodes. Receiving nodes are those i’s for which Q0n,i = 1. Then, all nodes are able to compute the estimate p × p matrix as follows: Λn (i) = U n−1 (i)T Y n (i) + N U n−1 (ι0n )T Y n (ι0n )Q0n,i . q (4.22) Lemma 4.3. Let (Tn )n and (Tn0 )n be two independent ATS. For any n, denote by Fn the σ-field generated by (Tk )k≤n , (Tk0 )k≤n , (Ak )k≤n and (Dk )k≤n . Let (U n )n be a Fn -measurable N × p random matrix and let Y n , Λn be defined as above. Then, E[Y n |Fn−1 ] = M U n−1 E[Λn (i)|Fn−1 ] = U Tn−1 M U n−1 . Under Lemma P4.1, 4.2 and Definition 4.2,Tthe random sequences Y n (i) and Λn (i) are unbiased estimates of j M (i, j)U n−1 (j) and U n−1 M U n−1 respectively given U n−1 . Proof. For each i, we obtain E[Y n (i)|Fn−1 ] = M (i, i)U n−1 (i) + X N q X M (i, j)U n−1 (j) = M (i, j)U n−1 (j) , q N j j6=i and E[Λn (i)|Fn−1 ] = U n−1 (i)T E[Y n (i)|Fn−1 ] + N 1 X U n−1 (j)T E[Y n (j)|Fn−1 ]q q N j6=i = XX i U n−1 (i)T M (i, j)U n−1 (j) j which corresponds with the square matrix U Tn−1 M U n−1 . 3) Main algorithm We are now ready to state the main algorithm. The algorithm generates iteratively and for any node i two variables U n (i) and λn (i), according to: U n (i) = ΠK [U n−1 (i) + γn (Y n (i) − U n−1 (i)Λn (i))] (4.23) λn (i) = λn−1 (i) + γn (diag(Λn (i)) − λn−1 (i)) , (4.24) where ΠK is the projector onto the set K , [−α, α]p given an arbitrary α > 1. Finally, as in b n (i) by: (4.17), each sensor i obtains its estimate position Z q q b n (i) = λn,1 (i)un,1 (i), · · · , λn,p (i)un,p (i) (4.25) Z 116 Chapter 4. Application to self-localization in WSN where we set U n (i) = (un,1 (i), . . . , un,p (i)). The proposed algorithm (4.23)-(4.25) is summarized in Algorithm 4 below. Note that, at each iteration time n, only two communications are performed by two randomly selected nodes issued to the ATS’s Tn and Tn0 . Algorithm 4: Distributed on-line MDS for localization (do-MDS) Update: At each time n = 1, 2, . . . [Measures]: each sensor node i, do: Makes sparse measurements of their RSSI to obtain (D n (i, j))j for some j such that An (i, j) = 1 (Definition 4.1). Set −1 qij D n (i, j) if An (i, j) = 1 S n (i, j) = 0 otherwise. and set S n (i) = 1 N P j S n (i, j). [Communication step]: • A node ιn randomly selected broadcasts U n−1 (ιn ) and S n (ιn ) to nodes i such that Qn,i = 1. • Each node i computes Y n (i) by (4.21). • A node ι0n randomly selected broadcasts U n−1 (ι0n )T Y n (ι0n ) to nodes i such that Q0n,i = 1. b n (i) by (4.25). • Each node i updates U n (i) by (4.22)-(4.23) and Z 4.4.4 Convergence analysis We now investigate the convergence of Algorithm 4. We prove that if the sensors’ positions are fixed, the algorithm recovers the latter configuration up to a rigid transformation. Assumption 4.1. The sequence (γn )n is positive and satisfies: P i) n γn = +∞, P 2 ii) n γn < ∞. The following assumption states the stability of the sequence (U n )n (see Assumption 3.4 in Chapter 3). Assumption 4.2. For each node i, there exists an instant time n0i such that ∀n > n0i the sequence U n−1 (i) + γn (Y n (i) − U n−1 (i)Λn (i)) remains in the compact set K almost surely. 4.4. Distributed MDS-MAP approach 117 Roughly speaking, Assumption 4.2 means that projector ΠK becomes inactive at each sensor node i for all n after a certain value. Proposition 4.1. For any U ∈ RN ×p , set h(U ) = M U − U U T M U . Let U n be defined by (4.23). There P exists two random sequences (ξn , en )n such that, almost surely (a.s.), en converges to zero, n γn ξn converges and U n = U n−1 + γn h(U n−1 ) + γn ξn + γn en . Proof. Set for each i, X j (4.26) M (i, j)U n−1 (j) = (M U n−1 )i and ξn (i) = (Y n (i) − (M U n−1 )i ) + U n−1 (i)(U Tn−1 M U n−1 − Λn (i)) (4.27) en (i) = ΠK [Y n (i) − U n−1 (i)Λn (i)] − (Y n (i) − U n−1 (i)Λn (i)) (4.28) Then, the sequence generated by each sensor node i is written as: U n (i) = U n−1 (i) + γn (M U n−1 )i − U n−1 (i)(U Tn−1 M U n−1 ) + γn ξn (i) + γn en (i) P First we prove that n γn ξn (i) converges. For that purpose, we use [75, Theorem 2.17] to prove that for any i, (ξn (i))n is a L2 -bounded Martingale increment sequence. By Lemma 4.3, E[ξn (i)|U n−1 ] is equal to zero. Regarding the second moment of (ξn (i))n , we state that supn E[kξn (i)k2 |U n−1 ] < ∞ for any i as follows: E[kξn (i)k2 |U n−1 ] ≤ E[kY n (i)k2 |U n−1 ] + kU n−1 (i)k2 E[kΛn (i)k2 |U n−1 ] + 2kU n−1 (i)kE[kY n (i)Λn (i)k|U n−1 ] , (4.29) the first term on the right hand side (RHS) of (4.29) can be extended as: c n (i, i)|2 ]kU n−1 (i)k2 + E[kY n (i)k2 |U n−1 ] ≤ E[|M NX c n (i, j)|2 ]kU n−1 (i)k2 E[|M q j6=i +2 X c n (i, i)M c n (i, j)]kU n−1 (i)k2 . E[M j6=i Upon noting that for any i, j E[S n (i, j)2 ] = 1 4 8 qij dij C , then there exists a constant K > 0 dec n (i, j)|2 ] < pending on N , qmin = mini,j qij , C defined in (4.8) and maxi,j d4ij such that E[|M p K. In addition, by Assumption 4.2 U n−1 (i) remains on the compact set [−α, α] which depends on α > 1, then supn kU n−1 (i)k2 = pα2 < ∞. Thus: E[kY n (i)k2 |U n−1 ] ≤ (1 + N2 + 2N )Kα2 p = B1 qmin (4.30) The second term on the RHS of (4.29) can be expressed as a function of the latter bound B1 118 Chapter 4. Application to self-localization in WSN such: E[kΛn (i)k2 |U n−1 ] ≤ E[kY n (i)k2 |U n−1 ]kU n−1 (i)k2 + ( NX E[kY n (j)k2 |Un−1 ] q j6=i +2 X E[Y n (i)Y n (j)|U n−1 ])kU n−1 (j)k2 j6=i ≤ (1 + N2 + 2N )B1 α2 p . qmin (4.31) Finally the cross term E[Y n (i)Λn (i)|U n−1 ] is also bounded by: E[kY n (i)Λn (i)k|U n−1 ] ≤ E[kY n (i)k2 |U n−1 ]kU n−1 (i)k X √ + E[Y n (i)Y n (j)|U n−1 ]kU n−1 (j)k ≤ N B1 α p . (4.32) j6=i Hence, using (4.30)-(4.32) supn E[kξn (i)k2 |U n−1 ] < ∞ , and by AssumpP the bound terms 2 tion 4.1, n E[kγn ξn (i)k |U n−1 ] is bounded almost surely which conclude the bound on the second moment of (4.29). To conclude, by Assumption 4.2, limn en (i) = 0 a.s. for any i. We now state the main theorem on the convergence of Algorithm 4. The following result is standard in the stochastic approximation folklore, e.g. [14], [51], [28]. Theorem 4.1 (Main result). Let U n be defined by (4.23) and λn,k be defined by (4.24). Under Assumption 4.1, for any k = 1, · · · , p, the k-th column un,k of U n converges to an eigenvector of M with unit-norm. Moreover, for each node i, λn,k (i) converges to the corresponding eigenvalue. Proof. Consider the following Lyapunov function V : RN ×p r {0} → R+ : 2 ekU k V (U ) = T . U MU (4.33) The following properties hold: i) assuming that the eigenvalues of the expectation matrix M are bounded by λmin ≤ λk (M ) ≤ λmax ∀k = 1, . . . , p and kU k2 ≤ N α2 p = b, then: eb λmax b ≤ V (U ) ≤ eb λmin b . ) ii) limkU k→∞ V (U ) = +∞ and its gradient is ∇V (U ) = −2 UVT(U h(U ). MU iii) hV (U ), h(U )i ≤ 0 and the equality holds when {U ∈ RN ×p | h(U ) = 0} ⊂ χ defined in (3.11). 4.4. Distributed MDS-MAP approach 119 The proof is an immediate consequence of Proposition 4.1, the existence of (4.33) along with Theorem 2 of [51]. Sequence U n converges a.s. to the roots of h. The latter roots are characterized in [121]. In particular, h(U ) = 0 implies that each column of U is an unit-norm eigenvector of M . Note that Theorem A.1 might seem incomplete in some respect: one indeed expects that the sequence U n converges to the set χ characterizing the principal eigenspace of M . Instead, Theorem A.1 only guarantees that one recovers some eigenspace of M . As discussed in Section 3.4, undesired limit points can be theoretically avoided by introducing an arbitrary small Gaussian noise inside the parenthesis of the left hand side of (4.23) (see Chapter 4 in [28]). As b n converges to Z up to a rigid transformation. a consequence, sequence Z 4.4.5 Numerical results In this section we show the performance of our proposed distributed algorithm when using simulated and real data. In both cases, we consider the same network configuration corresponding on the set of N = 50 sensor nodes selected from the FIT IoT-LAB [1] platform at Rennes. Sensor nodes are located within a 5 × 9 m2 area, i.e. p = 2. Six sensors of the 50 were set as anchor nodes (or landmarks) as illustrated by Figure 4.7 (a). We compare the performance of our proposed distributed on-line MDS (do-MDS) to other existing algorithms. We consider the distributed batch MDS [45] (dw-MDS) and the classical centralized methods of Section 1) such as multilateration [79] (MC), min-max [141], Algorithm 3 in Section 4.4.1 (the batch MDS) and the Oja’s algoritm (4.15)-(4.16) described in Section 4.4.2. The three iterative algorithms (Oja’s, dw-MDS and do-MDS) are initialized by randomly chosen positions in 5 × 9 m2 . The performances are compared through the root-mean-square error (RMSE expressed in meters) between the true and the estimated positions as a function of the number of communications per iteration (n). Method # communications MC/min-max TMN MDS TN Oja NI dw-MDS T N + 2N I do-MDS (N + 2)I Table 2. Number of communications required for each method. Table 2 summarizes the total number of communications (i.e. information broadcasted by each sensor node). For the iterative algortithms (centralized Oja, dw-MDS [45] and Algorithm 4) we consider after a fixed number of iterations I. The cost is rather different depending on the approach. Indeed, for the classical anchor-based methods, T M is needed for each unknown position in order to obtain each estimated distance between the node and each anchor node. Similarly, the classical anchor-free MDS algorithm, needs all N nodes to broadcast T measurements to compute the estimated matrix Ŝ before computing the double-centered matrix M̂ and its eigendecomposition. The remaining iterative approaches (Oja’s, dw-MDS [45] and Algorithm 4) require a different number of communications at each iteration. The batch dw-MDS requires a previous measurement phase of T N communications to obtain the estimated square 120 Chapter 4. Application to self-localization in WSN distances while the on-line Oja’s approaches (centralized and distributed do-MDS) consider one measurement step at each iteration’s update. Since dw-MDS [45] needs that all nodes broadcast its estimated positions before to update the global stress function by an incremental cycle, a total of 2N communications are required per iteration. For the centralized Oja’s update (4.15) N communications is enough to obtain the estimated matrix Mn (from the observation’s matrix Sn ), while our distributed Oja’s update needs the addition of two communications due to the ATS involved in the communication step (see Algorithm 4). Note that distributed Oja’s is slightly worse in terms of communication’s cost, but when regarding the number of operations (sums and multiplications), the cost is rather reduced. Indeed, the computational cost scales N 2 per iteration for the centralized Oja’s while our algorithm needs N + 2qN per iteration since all nodes N perform at least a multiplication to update and the average 2qN receiving nodes aggregates a multiplication related to the received data. 1) Simulated data First, we show the results from simulated data drawn according to the observation model defined in Section 4.4.2. In order to compare our proposed algorithm with the distributed MDS proposed by [45], we set the same environmental context in which σ/η = 1.7. Figure 4.10 displays the comparison of the RMSE when running Algorithm 4 over 300 independent runs of the estimated positions when considering different communication parameters: (qij )i,j (the Bernoullis related to the observation model (4.14)) and q (the Bernoullis related to the ATS in Definition 4.2). Since the variance of the error sequence is upper bounded by the minimum probability value in (4.30)- (4.32), we observe from Figure 4.10 a trade-off between the accuracy and the number of communications as the RMSE decreases faster when the probability q is closer to 1. RMSE of un,2 RMSE of un,1 σ /η = 0, (q ) = 0.5, centralized −0.64 ij i,j 10 −0.65 10 σ /η = 1.7, (qij)i,j= q = 0.8 σ /η = 1.7, (qij)i,j= q = 0.5 σ /η = 0, (qij)i,j= q = 0.8 σ /η = 0, (q ) = q = 0.5 −0.66 10 −0.67 10 ij i,j −0.68 10 −0.69 10 −0.7 10 −0.71 10 −0.72 10 −0.73 10 −0.74 10 0 200 400 600 iteration x number communications 800 0 200 400 600 800 iteration x number communications Figure 4.10. RMSE as a function of nN from the two estimated eigenvectors un,1 and un,2 when considering the noiseless and noisy case and for different values of q. Figure 4.11 shows the comparison of the localization RMSE over 300 independent runs of the overall estimated positions when considering the three iterative methods: the centralized 4.4. Distributed MDS-MAP approach 121 Oja’s (4.15)-(4.16) (co-MDS), the dw-MDS of [45] and our proposed Algorithm 4. The estimated positions after 1000 iterations of the three iterative algorithms are reported in Figure 4.12. Note that, as remarked in Table 2, the result in Figure 4.12 (b) requires at least twice the number of communications compared to the results both on-line Oja’s approaches. Positions close to the barycentric of the network tend to be more accurate than positions on the surrounding area for the three cases. Nevertheless, Figures 4.12 (a) and 4.12 (c) show these outer positions better preserved than [45]. Indeed, our distributed and asynchronous Oja’s algorithm achieves in general better accuracy (around the 65% of positions) except for the third part of nodes which are located around the network’s boundary, e.g. nodes 11 or 36 − 37 for instance (see squared nodes in Figure 4.12 (c)). 0 10 RMSE Location error co−MDS, Algorithm (4.15)−(4.17) dw−MDS [3] do−MDS, Algorithm 4 Min−Max [106] Multilateration Batch MDS, Algorithm 3 −1 10 0 50 100 150 200 250 300 iteration x number communications 350 400 b n (1), . . . , Z b n (N )). Figure 4.11. RMSE as a function of nN from the estimated positions (Z 12 12 8 y (m) 5 4 11 19 27 9 17 26 8 16 25 7 15 44 34 43 41 8 40 14 24 32 3 13 23 31 12 22 30 21 29 5 9 17 26 8 16 25 7 15 4 28 35 44 34 43 33 41 40 14 24 32 3 13 23 31 12 22 30 21 29 6 5 3.5 4 4.5 x (m) 5 5.5 6 6.5 27 9 17 26 8 16 25 7 15 35 44 34 43 42 33 41 40 4 14 24 32 3 13 23 31 12 22 30 21 29 39 2 1 6 28 38 38 2 37 37 37 3 6 4 2 2.5 36 18 39 28 38 2 0 2 8 2 1 19 10 42 4 39 2 6 6 11 27 10 42 33 20 10 36 18 35 4 1 20 10 36 18 y (m) 19 10 (m) 11 y 10 6 12 20 7 (a) Oja’s (4.15)-(4.16) (co-MDS). 0 2 2.5 3 3.5 4 4.5 x (m) 5 5.5 (b) dw-MDS [45]. 6 6.5 7 0 2 2.5 3 3.5 4 4.5 x (m) 5 5.5 6 6.5 7 (c) Algorithm 4 (do-MDS). Figure 4.12. Estimated configuration network from simulated data after 1000 iterations. Markers (Q) correspond to the estimated values while markers (#) to the true positions. On the right (distributed on-line Oja’s), squared positions () highlight worse accuracy compared to the centralized case (on the left). 122 2) Chapter 4. Application to self-localization in WSN Real data As described in Section 4.2.4, through of our user profile created in the FIT IoT-LAB’s website [1], we run remotely several experiments involving the sensor nodes illustrated in Figure 4.7 (a). All real data used in this section can be found in [2] (on the research information). The set of estimated parameters are obtained from the LNSM as in (4.4)-(4.5): σ 2 = 28.16 dB, √ PL0 = 61.71 dB and η = 2.44. We set qij = 0.8 ∀i, j, q = 0.85 and γn = 0.015 for our pron posed algorithm described in Section 4.4.3. Table 3 summarizes the RMSE of the 44 location estimates in the real testbed over 100 independent runs for the six considered algorithms and after 1000 iterations for the three iterative algorithms (centralized on-line Oja’s, dw-MDS [45] and Algorithm 4). Method over all nodes at a center node at a outer node MC 1.87 1.76 3.53 min-max 0.8 0.55 2.42 MDS 1.98 1.02 3.2 Oja 2.18 1.12 2.5 dw-MDS 0.86 0.71 1.91 do-MDS 1.56 0.47 1.41 Table 3. RMSE over the 44 estimated positions considering the real data from the FIT IoT-LAB testbed. Table 3 also includes the results for a given center node (node 12) and surrounding node (node 36) in order to get insight about the impact of the sensor node location relative to the barycentric position and the anchors’ positions. Although the best performance is achieved by min-max and dw-MDS, do-MDS gets the best accuracy for the center node situated close to the barycentric position of the network while multilateration and MDS give the worst RMSE value for the positions on the network’s boundary. 4.5 Position refinement: distributed maximum likelihood estimator In the framework of WSN localization, a refinement phase is in general added to obtain the absolute coordinates or/and to improve the accuracy of the estimated positions (see [140], [141], [142] or [45]). Assuming the RSSI model of Section 4.2.2, we propose a distributed and online algorithm based on the maximum likelihood estimator (MLE) that can be executed after Algorithm 4 at the expense of too much additional complexity. Since the function involved in the ML criterion is not convex, the convergence issues can be avoided by a suitable initialization. The idea here is to use the estimated positions of the previous Algorithm 4 to initialize the following algorithm. We make use of a consensus-based algorithm which is linked to Chapter 2. 4.5.1 Principle: maximum likelihood estimation Similarly to the observation model in Section 1), we introduce a collection of independent r.v.’s (Pij (n) : i, j = 1, · · · , N, n ∈ N) such that each Pij (n) follows the LNSM (4.2) described in Section 4.2.2. We set D̂n (i, j) the unbiased estimate of the log-distance obtained from (4.2) 4.5. Position refinement: distributed maximum likelihood estimator −Pij (n)−PL0 −Pij (n)−PL0 . Since 10η 10η σ2 N(log10 dij , 100η 2 ). as: D̂ij (n) = = log10 dij + εij 10η 123 then it is easy to verify that D̂ij (n) ∼ Moreover, we define two sets of RSSI measurements collected at each sensor node i as {Pij (n)}∀j∈Ni from its neighboring sensor nodes and {Pik (n)}∀k∈Mi from its neighboring landmarks. For each sensor node i, the set of positions of its neighboring nodes’ is denoted as (x(i) , y (i) ) = {(xj , yj )}∀j∈Ni (including its own position z i = (xi , yi )). The aim is to solve the global optimization problem on the function in F : (x, y) ⊂ RN ×2 → R+ defined as a sum of local functions such as: min F (x, y) = min (x,y) (x,y) N X fi (x(i) , y (i) ) (4.34) i=1 where: fi (x(i) , y (i) ) = X D̂ij (n) − log10 kz i − z j k j∈Ni 2 + X 2 D̂ik (n) − log10 kz i − z k k . k∈Mi Note that for a centralized setting, the gradient ∇F (x, y) can be computed for each unknown position z i as: ! X D̂ij (n) + D̂ji (n) (xi − xj ) dF − log10 dij =4 dxi 2 d2ij ln 10 j∈Ni (xi − ak ) d2ik ln 10 k∈Mi ! X D̂ij (n) + D̂ji (n) (yi − yj ) dF =4 − log10 dij dyi 2 d2ij ln 10 +2 X (D̂ik (n) − log10 dik ) X (D̂ik (n) − log10 dik ) (4.35) j∈Ni +2 k∈Mi (yi − bk ) d2ik ln 10 When considering z i 6= z j for each pair i 6= j ∈ {1, . . . , N }, the solution of (4.34) that cancels (4.35) is written as a system of equations. A first set of N (N − 1)/2 equations having 2N unknowns and for each unknown position z i a set of |Mi | equations of the form (4.11). The solutions lie on the intersection of both sets as: \ A B (4.36) where : n ∀i,j=1,...,N, | (xi i6=j − xj )2 + (yi − yj )2 = 10D̂ij (n)+D̂ji (n) n o 2 2 2D̂ik (n) B = ∀i=1,...,N, | (x − a ) + (y − b ) = 10 i i k k ∀k∈Mi A= o The resulting set of unknowns (4.36) can be viewed as a separable problem, the left-hand set gives a global set of solutions forming the whole network configuration and the right-hand set 124 Chapter 4. Application to self-localization in WSN deals with a local trilateration (multilateration) problem performed for each single position inside its reference system of landmarks. Indeed, as explained in Section 1), a first unbiased estimation of each single position can be found from the RHS of equations. Then, the LHS of equations can be solved in the same way considering one unknown position while fixing the rest of neighboring nodes as estimated anchors’ positions. Since the likelihood function of the LNSM involved in (4.34) is not convex with respect the unknown positions (i.e. the Hessian matrix obtained by differentiate (4.35) is not positive definite), a standard iterative gradient descent suffers from the initialization point. We consider an initial guess obtained by another localization algorithm before the cooperative refinement step to overcome this initialization issue. 4.5.2 The algorithm: on-line gossip-based implementation When dealing with a distributed processing, the global function in (4.35) is not perfectly known by each sensor node. However, in this context, the separable nature of problem (4.34) can be exploited to design a distributed implementation consisting on local computations and random communications among the sensor nodes. Thus, (4.34) is finally solved by means of a distributed on-line stochastic approximation algorithm (DSA) based on the local version of (4.35) ∇fi (x(i) , y (i) ) (see the general framework of [19] and Chapter 2 for more details on consensus dfi dfi algorithm). The components for each sensor i are composed by their partial derivative ( dx , ) i dyi df df and those from its neighbors ( dxji , dyji ) as follows: (x − x ) X X dfi (xi − ak ) i j =2 D̂ij (n) − log10 dij + 2 (D̂ik (n) − log10 dik ) 2 2 dxi dij ln 10 dik ln 10 j∈Ni k∈Mi (y − y ) X X dfi (yi − bk ) i j + 2 (4.37) =2 (D̂ik (n) − log10 dik ) 2 D̂ij (n) − log10 dij 2 dyi dij ln 10 dik ln 10 k∈Mi j∈Ni (x − x ) X dfj i j ∀j ∈ Ni =2 D̂ji (n) − log10 dij 2 dxi dij ln 10 j∈Ni (y − y ) X dfj i j =2 D̂ji (n) − log10 dij 2 dyi dij ln 10 j∈Ni Note that each component i in (4.37) involves the knowledge of the set of positions (x(i) , y (i) ) df df and the neighborhood components of the gradient function ( dxji , dyji ). Thus, we introduce a consensus-based algorithm in order to drive these local terms towards the global gradient terms (4.35). Similarly to Algorithm (2.1)-(2.2) in Chapter 2, Algorithm 5 consists on a local gradient descent step along with a gossip step based on the pairwise model of [31]. Yet, (4.38) − (4.39) generate a sequence of the set of estimated positions at each sensor i, i.e. its own position and those of their neighbors. Figure 4.13 illustrates an iteration of the proposed algorithm. Indeed, the scheme shown in Figure 4.13 highlights that at any n, each sensor node i (i) (i) (i) (i) obtains an estimate of its local map though the set (xn , y n ) = {(xn,j , yn;j )}∀j∈Ni . 4.5. Position refinement: distributed maximum likelihood estimator 125 Figure 4.13. Scheme of Algorithm 5 (doMLE) at any time n when considering the gossip step is performed by nodes 1 and 2. Both estimated local maps are delimited by two different dashed lines. Algorithm 5: Distributed on-line MLE (doMLE) (i) (i) Initialize: for each i set {(x0 (j), y 0 (j))}∀j∈Ni Update: at each time n = 1, 2, . . . [Local step][156] each node i obtains {Pij (n)}∀j∈Ni and {Pik (n)}∀k∈Mi . Each sensor i computes (4.37) and a temporary estimate of its position’s set as: (i) (i) (i) (i) (i) (x̃(i) n , ỹ n ) = (xn−1 , y n−1 ) − γn ∇fi (xn−1 , y n−1 ) (4.38) √ where (γn )n≥1 is a decreasing step sequence such γn = 1/ n. [Gossip step][31] two uniformly random selected nodes exchange their common estimated positions and average their values. The final updates are set as follows: (i) (i) (i) (xn,` , yn,` ) = (i) (x̃n,` , ỹn,` ) 2 (j) (j) (i) (i) (xn,` , yn,` ) = (xn,` , yn,` ), (j) + (j) (x̃n,` , ỹn,` ) 2 ∀` ∈ Ni ∩ Nj , (4.39) Otherwise, ∀` ∈ / Ni ∩ Nj and ∀m 6= i, j then, (m) (m) (m) (m) (xn,` , yn,` ) = (x̃n,` , ỹn,` ). 4.5.3 Numerical results: initialization by do-MDS algorithm In order to highlight the improvement on the accuracy that can be achieved after the refinement phase for a given estimated positions, we consider the real data from the FIT IoT-LAB testbed 126 Chapter 4. Application to self-localization in WSN used in Section 4.4.5. We compare the same algorithms by setting the estimated positions obtained from each algorithm to the initialization of Algorithm 5. Table 4 shows the RMSE values after the refinement phase. In addition, we include the ratio of the accuracy improvement considering the RMSE values after and before applying the distributed MLE and the ratio regarding the number of positions over the total N that are improved. The best performances are achieved by min-max, dw-MDS and do-MDS in terms of minimum RMSE value over the N estimated positions. Nevertheless, the highest improvement is obtained with the proposed do-MDS since the RMSE before the refinement phase was higher than the values from min-max and dw-MDS which do not experiment a considerable decrease. In general, the refinement Algorithm 4 improves almost all the positions for each method and especially the anchor-free methods based on the MDS approach. Indeed, the highest values are those from the distributed versions which may exploit in advantage the local knowledge of each sensor node. Method After refinement Improvement (%) Positions improved (%) MC 1.05 44 75 min-max 0.54 32 71 MDS 1.39 30 80 Oja 1.37 28 80 dw-MDS 0.6 30 82 do-MDS 0.51 78 86 Table 4. RMSE averaged over the 44 estimated positions considering real data. Table 5 reports the results for the center node as node 12 and the surrounding node as 36 in order to insight about the impact of the sensor’s emplacement relative to the barycentric position and the anchors’ area. The improvement on the accuracy is rather higher for the outer position for all methods and in general, the RMSE values for the center and the outer positions are similar except for multilateration and MDS. Since the distributed MLE focus on the signal model instead the network configuration, the resulting RMSE may not depend on the sensor nodes’ location. Method After refinement Improvement (%) MC 1.68 4 min-max 0.48 12 MDS 0.75 26 Oja 0.48 58 dw-MDS 0.54 22 do-MDS 0.37 27 Method After refinement Improvement (%) MC 0.89 75 min-max 0.45 26 MDS 1.27 60 Oja 0.46 76 dw-MDS 0.33 62 do-MDS 0.23 84 Table 5. RMSE location error and improvement after the refinement algorithm of the center node 12 (on the top) and the outer node 36 (on the bottom) when considering real data and RMSE values of Table 3. 4.6. A cooperative RSSI-based algorithm for indoor localization in WSN 4.6 127 A cooperative RSSI-based algorithm for indoor localization in WSN This section is devoted to present the work made in collaboration with N.A. Dieng, a member of the laboratory LINCS5 at Institut Mines-Télécom. In order to improve the accuracy achieved by the biased-maximum likelihood estimator (B-MLE) proposed in [53] and to let each sensor node to build a local map of itself and its neighbors, we use the on-line distributed stochastic approximation (DSA) algorithm described in previous section. Algorithm 5 was tested in three different indoor environments where several measurement campaigns were hold. A part of the FIT IoT-LAB testbed described in Section 4.2.4, we get benefit of two more testbeds available in [3]. In general, when applying the MLE to the LNSM (4.2), the parameters of the propagation model PL0 , η and σ 2 , are considered equal for all landmarks even for indoor scenarios (see for instance [145]). When dealing with closed and relatively small spaces, RSSI is not accurate enough and the effects of multipaths, possible blocking objects and antenna orientation may be also included in the propagation model as outliers (see, for instance, [11] and [62]). A B-MLE involving a random factor related to the possibly outliers was proposed by [53] and has been experimentally proved to reduce the mean error of the classical MLE. As an anchor-based approach, the optimization problem is defined for each single unknown position given several RSSI values from a set of surrounding landmarks. Since the average values of the RSSI measured at different positions do not always decrease with the distance (see the learning phase in [54]), we let the parameters of the propagation model differ from one landmark to another. As a result, the M considered landmarks are not treat equally during the statistical estimation phase (cf. Tables 6 and 7) and a set of M parameters are defined by {(PL0,k , ηk , σk2 }M k=1 . 4.6.1 Observation model: biased log-normal shadowing model (B-LNSM) This section recalls the dynamic method introduced by [53] to estimate the position of a sensor node from a set of landmarks. The sensor node seeks to reduce the effect of any potentially aberrant landmark whose measurements do not improve localization accuracy. This effect is compensate by introducing a constant bias which becomes an additional variable to estimate and replaces the log-normal shadowing model of the measurements associated to this landmark. As for the standard LNSM, we denote by PL (t) the tth RSSI sample measured by the sensor node from packets transmitted by a given landmark L. We define as PL0,L , ηL and σL2 the propagation parameters of landmark L and we rewrite the general signal model of (4.2) as follows: PL (t) = − PL0,L + 10ηL log10 dL IL6=O + βIL=O + N(0, σL2 ) , (4.40) where dL is the distance to landmark L, β is the constant bias which replaces the measurements coming from a given landmark O, I is the indicator function, which is equal to 1 when the subscript expression is true and 0 otherwise. Abnormal landmarks O can be detected from equation (4.40), and the biased LNSM can be fully characterized. The aberrant landmark can 5 See the "Network, Mobility and Security" research group http://www.infres.enst.fr/wp/nms/ 128 Chapter 4. Application to self-localization in WSN be identified by comparing the global likelihood values when each landmark is subsequently considered as outlier. 4.6.2 Initialization: biased maximum likelihood estimator (B-MLE) Combining all the measured values altogether, we apply the MLE on the proposed model (4.40) to compute the likelihood expressions in the case where landmark O is considered as abnormal. If we denote by TL the number of samples received from landmark L, the likelihood function for every each L 6= Ois written as: T LL (x, y) = − TL log σL2 L 1X − 2 t=1 PL (t) + PL0,L + 10ηL log10 dL σL 2 , (4.41) and for the outlier (abnormal) landmark, O, it becomes: T LO (β) = −TO log σO2 O 1X − 2 t=1 PO (t) − β σO 2 (4.42) The global likelihood function from the data set is then the sum of equations (4.41) and (4.42) over the M landmarks: LO (x, y, β) = M X LL (x, y) + LO (β) (4.43) L=1;L6=O Thus, the maximum likelihood criterion applied to equation (4.43) is used to infer the sensor node’s position and the corresponding bias β. Upon noting that (4.43) can be considered as a separable problem, the B-MLE solution (x, y) is given by: (xO , yO , βO ) = max x,y where: βO = M X LL (x, y) + max LO (β) L=1;L6=O β (4.44) TO 1 X PO (t) TO t=1 4.6.3 Experimental results after the refinement phase The numerical results are obtained by using Algorithm 5 when the positions are initialized with the B-MLE (4.44). We consider three wireless sensor networks issued to the standard ZigBee IEEE 802.15.4 and operating at 2.4 GHz. In addition, the three real testbeds involve different dimensions and low power devices (two testbeds using the TMote Sky6 nodes and one testbed using the WSN430 nodes, both node types include the CC2420 RF transceiver). For each testbed, the procedure is described as follows. Each of the NL sensor nodes selected for the learning 6 Technical details in http://www.eecs.harvard.edu/~konrad/projects/shimmer/ references/tmote-sky-datasheet.pdf 4.6. A cooperative RSSI-based algorithm for indoor localization in WSN 129 phase broadcasts 100 frames. Then, the M landmarks compute the set of the propagation model parameters {(PL0,k , ηk , σk2 }M k=1 as detailed in [54]. We set two different sizes NL of the learning data set, a small set involving the first 10 sensor nodes’ positions and a large one from the first 25 sensor nodes’ positions. The RSSI values are collected from the set of received frames, and then the parameters are estimated by applying the MLE criterion as in (4.4)-(4.5). The remaining N − NL sensor nodes compute a first estimate of their positions using the B-MLE (4.44) given the parameters and the RSSI values from theirs corresponding received frames. The refinement phase is subsequently applied assuming the latter positions as the initialization values for the distributed on-line Algorithm 5. At each iteration, a single frame is broadcasted by each sensor in order to compute the local estimates and only two sensor nodes randomly selected exchanged their common estimate positions which are finally updated as the average value. 1) Office Scenario The testbeds considered in this section are the same as those from [55]. The first testbed located in Paris is a small semi-furnished office at LINCS laboratory with dimensions 4 × 3 m2 . It involves the positions of 48 sensors and 5 landmarks. The second larger testbed located in Telecom Bretagne is a classroom at Rammus platform hosted by the RSM department whose dimensions are 8.77 × 6.46 m2 . It involves the positions of 57 sensor nodes and 8 landmarks. In both testbeds sensor nodes are placed at 1.25 m height from the floor. None of these rooms are electromagnetically isolated and active wireless access points close to the testbeds may create interferences. The presence of two moving persons in both rooms may affect the line-of-sight as obstruction between the sensor nodes and the landmarks during the communication phases. Both testbeds are illustrated in Figure 4.14 (available and detailed in [3]). Testbed LINCS: N=48 nodes and M=5 landmarks Testbed Rammus: N=57 nodes and M=8 landmarks RSSI values at node 15 LM5 −35 3 y (m) 1 0.5 1 LM2 0 0 18 17 10 2 9 21 5 −30 6 33 4 −40 −50 23 20 9 41 2 x (m) 12 3 25 14 1 0 −80 2 2.5 d (m) 3 3.5 2 −50 47 53 46 54 32 40 45 55 44 56 42 15 28 29 LM5 4 x (m) −55 −60 −65 41 27 30 −70 57 43 −75 LM7 LM8 LM4 0 52 39 26 31 17 16 2 1 −70 3 18 −45 37 48 38 13 LM4 1 11 −40 51 33 19 4 3 34 24 5 2 −60 42 LM2 36 49 35 43 35 34 25 44 36 27 26 45 37 28 19 11 46 LM1 8 7 50 22 LM3 10 20 12 4 1.5 38 LM6 6 −20 29 21 13 5 2 47 30 22 14 6 2.5 40 39 23 15 7 3 32 31 24 16 8 Pd (dBm) 4 3.5 48 y (m) LM1 RSSI values at node 20 7 −10 LM3 Pd (dBm) 4.5 6 8 −80 2 3 4 5 6 d (m) (a) LINCS testbed (left) and RSSI values (right) col- (b) Rammus testbed (left) and RSSI values (right) collected at node 15 from the 5 landmarks. lected at node 20 from the 8 landmarks. Figure 4.14. Offices testbeds and RSSI values collected at squared nodes () from data transmitted by the M landmarks. The marker (Q) highlights the real RSSI values. The markers (5) and (+) indicate respectively the average and the minimum and maximum values from 100 i.i.d. random samples drawn when considering the theoretical LNSM in (4.2) given the estimated parameters in Table 6 130 Chapter 4. Application to self-localization in WSN Regarding the real RSSI data shown in Figure 4.14, some values received at each of the two nodes (node 15 at LINCS testbed and node 20 at Rammus testbed) are affected by a bias depending on the landmark. For instance, in Figure 4.14 (a), data coming from LM3 which has the highest path loss exponent η (see Table 6) suffer a considerable bias from the theoretical mean value (around −42 dBm) since the real values are placed around −70 dBm. From Figure 4.14 (b) we observe a gap of around 20 dBm between the theoretical and the empirical mean value of the RSSI coming from LM1 which is the closest one to node 20, maybe due to its proximity to the wall since its corresponding parameters are rather soft (see Table 6). The corresponding estimated parameters for each landmark are summarized in Table 6. LM_ID PL0 η σ2 LM_ID PL0 η σ2 LM1 52.09 1.31 11.89 LM1 40.72 1.47 26.91 LM2 49.23 0.81 26.17 LM2 29.97 2.81 49.55 LM3 47.83 1.51 27.51 LM3 40.59 3.11 12.09 LM4 41.09 1.97 6.69 LM4 57.69 2.06 17.76 LM5 42.67 1.38 26.81 LM5 40.38 0.92 31.29 LM6 45.69 1.66 23.08 LM7 71.85 -2.19 12.69 LM8 50.97 0.51 13.19 Table 6. Estimated parameters from the office at LINCS testbed (up) and the classroom at Rammus testbed (down) when the small data set of 10 sensor nodes is selected. 2) FIT IoT-LAB platform The testbed at the FIT IoT-LAB’s platform in Rennes of size 5 × 9 m2 involves the positions of 44 sensor nodes and 6 landmarks and its configuration is shown in Figure 4.15. The WSN430 open nodes available at the platform are located in a big storage room containing different objects and they are placed at the ceil which is 1.9 m height from the floor in a grid organization. There was no one in the room most of the time and there was only a wireless access point located in the corridor which is separated by a cinder wall (no electromagnetically isolating). On the right in Figure 4.15 we illustrate the real and the empirical RSSI values collected at node whose Node_id is 240 coming from the 6 landmarks. Note that the most important bias of more than 10 dBm appears on the RSSI corresponding to the closest landmark LM244 which is situated next to the wall of the room and has the highest path loss value PL0 (see Table 7). The estimated parameters for each landmark are summarized in Table 7. 3) Comparison and discussion We run both algorithms, the initialization performed by (4.44) and the refinement phase performed by the distributed Algorithm 5. In order to evaluate and quantify the achieved accuracy of such methods, we define the normalized mean deviation (NMD) as the RMSE over the N 4.6. A cooperative RSSI-based algorithm for indoor localization in WSN Testbed selection at Rennes FIT IoT−LAB platform N=44 nodes and M=6 landmarks 131 RSSI values at node 240 −50 202 183 LM163 y (m) 161 218 180 198 216 179 197 215 178 196 LM214 232 250 195 213 231 249 194 212 230 193 211 229 209 227 159 158 −60 200 LM236 181 8 6 237 201 LM176 LM157 235 253 234 252 −70 251 Pd (dBm) 10 247 −80 156 4 155 173 −90 LM244 225 243 2 −100 240 0 2 3 4 x (m) 5 6 7 −110 2 4 d (m) 6 8 Figure 4.15. Network configuration of the 50 sensors selected at the FIT IoT-LAB’s platform in Rennes [1]. LM_ ID PL0 η σ2 LM157 62.19 1.76 19.06 LM163 63.61 2.83 40.87 LM176 58.4 3.39 37.04 LM214 63.33 1.98 75.62 LM236 58.55 2.80 30.03 LM244 67.67 2.29 18.97 Table 7. Estimated parameters from the FIT IoT-LAB Rennes testbed when the small data set of 10 sensor nodes is selected. estimated positions normalized by the testbed’s dimensions, i.e. l × h m2 . It can be defined as: N N 1 X 1 X 1 NMD = NMDi = √ kẑ i − z i k , N l2 + h2 N i=1 i=1 where {ẑ i }∀i are the set of the N estimated positions. Figure 4.16 illustrates the decreasing RMSE over the N positions along the iteration instant n for each testbed which involves the communication between two randomly selected nodes at each n. Note that as being a real environment, the algorithm converged to an asymptotic error which may depend on the testbed’s parameters. Regarding the different curves in Figure 4.16, the earliest testbed to achieve an improvement (after n = 24 iterations implying 48 communications) is the Rammus testbed. However, this testbed achieves the worst accuracy after the refinement phase since the curve of its RMSE remains always above the other two. The best accuracy is achieved with the LINCS testbed even if the convergence is slower and 89 refinements iterations implying 178 communications are required to improve the mean localization error. As reported in Table 8, a RMSE less than 80 cm is obtained for the LINCS testbed. In order to summarized the results for each testbed, Table 8 displays the average error, both RMSE in meters and the respective normalized value NMD. The numerical results are reported when considering the two sizes of data sets chosen during the learning phase of the B-MLE 132 Chapter 4. Application to self-localization in WSN 8 DSA, LINCS B−MLE, LINCS DSA, Rammus B−MLE, Rammus DSA, FIT IoT−LAB B−MLE, FIT IoT−LAB 7 6 RMSEn 5 4 3 2 1 0 50 100 150 n (iteration time) 200 250 Figure 4.16. Convergence of the RMSE sequence generated by Algorithm 5 along the iteration time n for each testbed when considering the small data learning set (NL = 10). Markers (5 Q M) emphasize the iteration time when the RMSE of the distributed refinement phase is lower than the RMSE computed by the initialization B-MLE. Testbed Method RMSE (m) NMD Improvement (%) Positions improved (%) LINCS B-MLE DSA 1.39 0.77 0.28 0.16 44.3 76 Rammus B-MLE DSA 2.73 1.73 0.25 0.16 36.7 72 FIT IoT-LAB B-MLE DSA 1.85 1.3 0.18 0.13 29.5 74 Testbed Method RMSE (m) NMD Improvement (%) Positions improved (%) LINCS B-MLE DSA 1.35 1.31 0.27 0.26 3.8 55 Rammus B-MLE DSA 2.27 1.64 0.22 0.15 30.5 63 FIT IoT-LAB B-MLE DSA 2.27 1.84 0.22 0.18 19 68 Table 8. Localization error averaged over the N estimated sensor nodes’ positions for each of the three testbeds when using the small data set of 10 positions (up) and the big data set of 25 (down). (NL = 10 and NL = 25). In addition, we compute the ratio of the accuracy improvement as the percentage (1 − ρ) % where ρ defines the ratio between the NMD achieved after the refinement phase described in Section 4.5.1 and those achieved previously by the B-MLE. We also compute the ratio regarding the number of improved positions after the refinement phase (see the ratio Positions improved in Table 8). From the results reported in Table 8, the best 4.6. A cooperative RSSI-based algorithm for indoor localization in WSN 133 accuracy improvement is obtained in general in the case of the small data learning set, i.e. NL = 10. The best accuracy about 70 cm is achieved for the smallest testbed (LINCS) where there is at the same time the biggest noise factor σ 2 about 49.55 dB. However, the LINCS testbed requires the highest number of pairwise communications between the sensor nodes during the refinement phase. Our numerical results appeared to be consistent compared to other experiments involving real indoor scenarios with similar testbeds in terms of dimensions and number of sensor nodes. See for instance the accuracy between 1.5 − 2.5 m from the experiments of [145] or the 2.27 m reported in [45]. Figure 4.17. Boxplot of the NMD values obtained over 100 independent runs of the doMLE (Algorithm 5) for each testbed when considering the parameter learning sizes NL = 10 and NL = 20. In addition, Figure 4.17 illustrates the statistical behavior of the localization error (NMD) through the corresponding boxplot representations. Both LINCS and FIT IoT-LAB testbeds have similar behavior: the standard deviation of the NMD error decreases when increasing the learning data size (from NL = 10 to 25) but the mean value of the NMD error increases. Indeed, more outliers values appear in such case since considering a large number of positions may add some corrupted data to the learning phase affecting the estimated parameters. The smallest testbed (LINCS) has the worst error performance since it gives the highest standard deviation and mean values when considering the big learning data set (NL = 25) and the highest standard deviation when considering the small one (NL = 10). On the contrary, the FIT IoT-LAB testbed, which has the biggest dimensions and is rarely occupied by moving people, maintains the lowest standard deviation error for both values of NL and achieves the best accuracy on the NMD when considering the small learning set. Finally, the middle-sized Rammus testbed has a more regular behavior since it gives a similar performance independently of NL value. The localization error at each sensor node is detailed in Figure 4.18. We report the NMD values for each sensor node {NMDi }∀i at each testbed before (given by the B-MLE) and after the refinement phase (given by doMLE algorithm) when considering the small and the big learning data set. Regarding Figures 4.18 and the information summarized in Table 8, the accuracy 134 Chapter 4. Application to self-localization in WSN (a) LINCS testbed, NL = 25. (b) LINCS testbed, NL = 10. (c) Rammus testbed, NL = 25. (d) Rammus testbed, NL = 25. (e) FIT IoT-LAB testbed, NL = 25. (f) FIT IoT-LAB testbed, NL = 10. Figure 4.18. NMD values at each testbed before (blue bars) and after (red bars) the refinement phase (Algorithm 5 for each estimated sensor position. On the left we set NL = 25 for the learning data set and on the right when considering NL = 10. improvement and the number of positions improved are considerably higher when NL = 10. For the LINCS testbed about 76% positions are improved while for the Rammus testbed the percentage is 72% when considering 10 positions for the learning phase. However, the percentages decrease for the LINCS and Rammus testbeds when 25 positions are considered during the learning phase which are 55% and 63% respectively. Moreover, when regarding localization errors in Figure 4.18 and the network’s configurations (see Figures 4.14 and 4.15), the estimated positions that do not improve the accuracy after the refinement phase are those from sensor nodes located on more dense area or on the middle surrounded by objects and the other sensor nodes (see for instance nodes 13, 14, 28 and 30 at LINCS testbed, nodes 4, 5, 17, 18 and 28 at Rammus 4.6. A cooperative RSSI-based algorithm for indoor localization in WSN 135 testbed and nodes 197, 216 and 234 at FIT IoT-LAB testbed). In some cases, there is no accuracy improvement on nodes located on the corners as nodes 252 and 253 at FIT IoT-LAB testbed or as nodes 1 and 42 at Rammus tesbed. In conclusion, after the refinement phase, an accuracy improvement of at least 30% is achieved and more than the 70% of positions are improved for different indoor scenarios involving different dimensions and different radio devices. Conclusions and perspectives In this thesis, two applications of distributed stochastic approximation in multi-agent systems have been considered: consensus-based distributed optimization and distributed principal component analysis (PCA). Regarding consensus-based methods, we addressed the case where a network of agents seek to find the global minimizer of an optimization problem. The aim is to drive the local iterates of all agents to a common minimizer. We have concentrated our efforts in the theoretical analysis of a adaptation-diffusion algorithm, where agents iteratively update their local iterates and merge them by communicating with their neighbors. We have demonstrated the almost sure convergence under weak conditions on the communication protocols. Although double stochasticity is generally assumed in past works, our convergence result holds even when the matrix Wn characterizing the networks exchange is non doubly stochastic. This observation gives rise to the possibility of using simple communication schemes between agents such as the intuitive broadcast protocol, in which agents send information to their neighbors without expecting any instantaneous feedback from the latter. We have also analyzed the convergence rate of the method. More precisely, we have proved that the estimation error between the iterates √ and the minimizer tends to zero at rate γn where γn is the step size of the algorithm. The normalized estimation error is asymptotically normal. The limiting covariance matrix has been characterized. As a consequence of our results, we have shown the price to pay for using simple non-doubly stochastic weight matrices is an increase of the asymptotic variance of the estimation error. We have applied our results and tested our algorithms to the problem of statistical inference in wireless sensor networks, with a special focus on self-localization problems. We have also proposed and analyzed a distributed on-line expectation maximization algorithm (see Appendix A) which relies on the same principles. Regarding distributed PCA, we addressed the case where the i-th agent seeks to estimate the i-th entry of the principal eigenvectors of a given matrix, based on noisy and distributed measurements of the latter matrix. We have proposed an iterative algorithm which is based on sporadic information exchanges in the network and proved the almost sure convergence of the algorithm. The algorithm can be seen as a distributed version of Oja’s algorithm, which is a popular stochastic approximation method to estimate the principal eigenspace of a matrix. We have applied our results to the issue of self-localization in wireless sensor networks. We have considered the case where agents are identified to sensors which are able to collect noisy measurements of the distance to their neighbors. We have proposed a distributed version of a multidimensional scaling algorithm based on PCA which allows to recover the positions of the 138 Conclusions and perspectives sensors as the eigenvectors of a so-called similarity matrix computed from inter-sensor distances. In addition, our algorithm encompasses the context where measurements are gathered in an online and sporadic fashion. We have also tested our algorithms on a wireless sensor network platform, i.e. FIT IoT-LAB platform7 . The collaboration with N.A. Dieng have made possible our contributions on the localization framework in WSN (see [56]). Besides, we could show the performance on real testbeds from data acquired within different indoor scenarios (see [3] and [2]). There still remains several open problems in the continuity of this thesis, which have not been addressed due to the lack of time. First of all, it would be interesting to analyze the effect of Polyak-Ruppert averaging methods which are known to increase the convergence rate of stochastic approximation methods (see [98]). Most probably, such methods are as well effective in a distributed setting as discussed in [19] in the special case of doubly-stochastic matrices. Second, this thesis have focused on iterative algorithms with vanishing step sizes. In stochastic contexts where measurements are collected on-line and then deleted, vanishing step sizes are generally required to ensure the convergence of the algorithms. Nevertheless, in signal processing, it is as well important to consider methods with constant step size. Such methods are generally non convergent, but allow to adaptively track the variations of the environment (e.g. adaptive optimization for target tracking [151]). It is particularly well suited to the case of mobile sensor networks for instance. Typically, our self-localization algorithm based on distributed PCA would be especially relevant in the case of a constant step size, as it would allow each sensor to adaptively track its own position by collecting noisy measures of the distance with its neighbors. Indeed, the constant step-size condition could have been contemplated at the last part of this thesis since the experiments hold at the FIT IoT-LAB platform within wireless sensor nodes made us wonder about the localization problem on mobile nodes, i.e. tracking the trajectories of mobile sensor nodes. Our goal was to design a distributed algorithm that could be tested in real WSN and ideally in both static and mobile contexts. However, we were not able to address the mobile localization problem for two reasons: the lack of time and the lack of resources on real devices. A first period had to be devoted to study the localization framework and the problem involving the fixed position case. In a second stage, we found FIT IoT-LAB to be a suitable open platform allowing a remote access to both real static and mobile resources (see [148] as a recent survey on the existing testbeds). Unluckily, the early development stage of the platform affected our testbeds since at the moment we hold the experiments, the available nodes had to be replaced progressively by a new generation of wireless sensor nodes with more reliable features and the mobile nodes were not yet operational. Thus, as a future work, the distributed algorithm presented in Chapter 4 would be adapted to enable the position tracking capability and tested on the mobile WSN recently available at the FIT IoT-LAB platform. More recently, in the distributed optimization framework, several works [32], [153], [150], [157], [114], [22] propose other methods to solve the problem described as a sum of private functions (2.3) (see Chapter 2). Indeed, distributed gradient methods such as the adaptation-diffusion algorithm studied in this thesis perform well in an on-line setting where only noisy versions of the gradients are available: as we have shown, such methods achieve the convergence rate that is usually expected in centralized stochastic approximation methods. Nevertheless, in the absence 7 https://www.iot-lab.info/deployment/ Conclusions and perspectives 139 of stochastic perturbations (i.e. when gradients are assumed to be perfectly observed), vanishing step size is still needed and no significant improvement of the convergence rate is thus expected. On the other hand, recent works on distributed proximal methods (especially [22]) have shown that it is possible to construct a first-order distributed optimization method (i.e. which computes only gradients) which does not require vanishing step size. It would be interesting to investigate the behavior of such methods in a stochastic setting and to investigate how well they compare to distributed gradient methods. Another aspect is the use of alternative gossip protocol such as the so-called push-sum protocol [89], [34], [84]. Such a protocol, initially designed to avoid bilateral exchanges in average consensus algorithm, can as well be cast into a distributed optimization framework as shown by [153], [114]. Push-sum protocols and their variants are easy to use from a networking point of view, just as the broadcast protocol studied in this thesis. As a consequence, it would be interesting to analyze the advantages and drawbacks of both methods from the point of view of stochastic approximation. Part III Appendices Appendix A Application on distributed parameter estimation On-Line Gossip-based Distributed Expectation Maximization Algorithm1 This appendix is extracted from the proceedings of SSP 2012 Abstract In this paper, we introduce a novel on-line Distributed Expectation-Maximization (DEM) algorithm for latent data models including Gaussian Mixtures as a special case. We consider a network of agents whose mission is to estimate a parameter from the time series locally observed by the agents. Our estimator works online and asynchronously: it starts processing data as they arrive without needing a time-line shared by the network. Agents update some local summary statistics using recent data (E-step), then share these statistics with their neighbors in order to eventually reach a consensus (gossip step), and finally use them to generate individual estimates of the unknown parameter (M-step). Our algorithm is shown to converge under mild conditions on the gossip protocol, freeing the network from feedback communications; hence making this DEM algorithm particularly well suited to Wireless Sensor Networks (WSN). A.1 Introduction We address the problem of maximum likelihood estimation for latent data models. This problem is usually addressed by the celebrated Expectation-Maximization (EM) algorithm [52]. A typical use-case of latent data model is the problem of mixture parameter estimation: a feature is observed for several individuals in a population made of M distinct subgroups, however it is not known to the observer which subgroup a given individual belongs to; the goal is to estimate the feature distribution for each subgroup along with the subgroups proportions. 1 Proceedings of the 2012 IEEE Statistical Signal Processing Workshop (SSP) 144 Appendix A. Application on distributed parameter estimation Recently, Distributed Expectation-Maximization (DEM) algorithm has raised a great deal of attention. Consider a network formed by N agents whose mission is to estimate an unknown parameter θ. At time n, each agent i = 1, . . . , N observes a random sample Yi,n which is supposed to be governed by a latent / unobserved random variable Xi,n . The main challenge is to propose efficient ways to extend the celebrated EM algorithm to a distributed setting i.e., in the absence of a fusion center which would be able to collect the data at any time instant. In typical scenarios such as wireless sensor networks, the nodes are generally supposed to be able to process their local observations and to share a limited information with their neighbors. There is a massive literature on the EM algorithm. Most of the works have been devoted to the standard batch version of the EM algorithm: the data is first collected, stored into a memory, and then processed. In [146] and later in [36], the authors use stochastic approximation tools to propose an on-line version of the EM algorithm: the data does not need to be stored and each new single piece of data can lead to updated estimates. The original EM algorithm is also a centralized algorithm. In [119] a distributed version of the EM algorithm is presented in the case of a Gaussian mixture. The algorithm of [119] uses an incremental approach where a message has to cycle across the network going through each node one time per cycle (Hamiltonian cycle). This is a limitation for at least two reasons. First, finding a Hamiltonian cycle is an NP-complete problem and, second, letting the algorithm depend on a single message traveling across the network lacks robustness. Several alternatives to the latter incremental method have been developed quite recently, see for instance [93, 74] and [64] to cite a few. All of these works investigate a batch context where the data has to be stored into the sensors’ memories. In addition, most of these work only investigate the case of Gaussian mixtures and may be quite demanding in terms of communication protocol between nodes (number of communication between consecutive iterations, existence of feedback links, synchronism). We assume a network of agents whose mission is to estimate a parameter from the time series locally observed by the agents. Our estimator based on the elegant approach of [36] consists in three main steps: agents update some local summary statistics using recent data (E-step), then share these statistics with theirs neighbors (gossip step), and finally use them to generate individual estimates of the unknown parameter (M-step). We consider the Gaussian mixtures as a special case. The paper is organized as follows. Section A.2 introduces the parametric model; we review the centralized versions of the EM in Section A.3; our Algorithm is introduced in Section A.4; convergence results are established in Section A.5; closed form expressions are provided in Section A.6 for Gaussian Mixtures; Section A.6.2 is devoted to numerical results. A.2 Parametric model: exponential families Consider a sensor network formed by N nodes (the terms sensor and node will be used interchangeably). For any i, consider a couple of random variables (r.v.) Zi := (Xi , Yi ) on a measurable space (Ω, F), where Yi ∈ Y represents an observed variable and Xi ∈ X is a latent/unobserved variable. We assume that the aim of the network is to estimate a parameter θ of the form θ = (θ̄, α1 , . . . , αN ) where θ̄ ∈ Θ̄ and αi ∈ A ∀i, where Θ̄ and A are arbitrary sets (we define A.3. Centralized EM algorithms 145 Θ := Θ̄ × AN ). One should think of θ̄ as a global parameter and of αi ∈ A as a local parameter identifiable by node i only. Let (Pθ )θ∈Θ be a collection of probability measures on (Ω, F). We denote by fθ (z1 , . . . , zN ) the p.d.f. of (Z1 , . . . , ZN ) induced by the model Pθ , w.r.t. some arbitrary reference measures on (X × Y)N . We denote by gθ (y1 , . . . , yN ) the p.d.f. of the observations induced by Pθ . We assume that the observations (Y1 , . . . , YN ) have an unknown p.d.f. π under some probability Pπ on (Ω, F). Here, Pπ represents the actual probability under which the observed samples are generated. For a better understanding, it might be convenient to think of π as π = gθ? for some "true” parameter θ? , however our algorithm and our analysis does not need such hypothesis. Denote by h . , . i is the inner product in Rp and by | . | the Euclidean norm. Assumption A.1. For any θ = (θ̄, α1 , . . . , αN ), Q i) For any z1 , . . . , zN , fθ (z1 , . . . , zN ) =: i fi,θ̄,αi (zi ) where the marginal p.d.f. fi,θ̄,αi (zi ) coincides with: hi (zi ) exp −ψ̄(θ̄) − ψi (αi ) + hSi (zi ), φ̄(θ̄) + φi (αi )i where ψ̄, φ̄ : Θ̄ → R, ψi , φi : A → Rp , Si : X × Y → Rp are some measurable functions and hi (zi ) is a normalization factor. ii) The r.v. Eθ [Si (Zi )|Yi ] is well defined for any i. In the sequel, we assume that a sequence of independent and identical distributed (i.i.d.) observations is available at each sensor. More precisely, for each i = 1, . . . , N , we introduce a time series Zi,n = (Xi,n , Yi,n ) (n = 1, 2, . . . ) such that, under Pθ , (Zi,n )n≥1 is i.i.d. and has the same distribution as Zi . Here, (Yi,n )n≥1 represents the sequence of observations of sensor i while (Xi,n )n≥1 represents the sequence of hidden r.v.. A.3 Centralized EM algorithms We review centralized EM algorithms, assuming that a fusion center is able to gather all information of all sensors at each instant n. Although we are interested in on-line algorithms, we first review the usual batch version of the EM algorithm for convenience. A.3.1 Batch EM Assume that each sensor i collects T observations Yi,1:N := (Yi,1 , . . . , Yi,T ). The so-called intermediate quantity of the EM algorithm plays a central role: T N 1 XX QT (θ , θ) := Eθ0 log fi;θ̄,αi (Zi,n ) |Yi,n NT 0 (A.1) n=1 i=1 where θ0 , θ ∈ Θ̄ × AN , θ = (θ̄, α1 , . . . , αN ) and Eθ is the expectation associated with Pθ . The (k) (k) EM algorithm is an iterative procedure which generates an estimate θ(k) = (θ̄(k) , α1 , . . . , αN ) at each iteration k. The update is done in two steps: 146 Appendix A. Application on distributed parameter estimation E-step: Compute the function θ 7→ QT (θ(k) , θ) ; M-step: Set θ(k+1) := arg maxθ QT (θ(k) , θ) . In practice, such an algorithm makes sense only if each of the above steps can be realized at low computational price. Under Assumption A.1, both steps simplify as follows. Consider 0 ). Let us introduce a function i = 1, . . . , N , θ = (θ̄, α1 , . . . , αN ) and θ0 = (θ̄0 , α10 , . . . , αN y 7→ σi;θ̄,αi (y) defined on Y such that w.p.1: σi;θ̄,αi (Yi ) = Eθ (Si (Zi )|Yi ) , (A.2) By Assumption A.1, the RHS of the above equality depends on θ only through θ̄ and αi . It is straightforward to show that Eθ0 log fi;θ̄,αi (Zi ) |Yi coincides with: −ψ̄(θ̄) − ψi (αi ) + hσi;θ̄0 ,α0 (Yi ), φ̄(θ̄) + φi (αi )i i up to an additive random value Eθ0 (log hi (Zi )|Yi ) which does not depend on θ and which shall thus play no role in the M-step. Thus, up to a constant w.r.t. θ, the intermediate function QT (θ(k) , θ) at iteration k coincides with: ψ̄(θ̄) + hs̄(k) , φ̄(θ̄)i + N 1 X (k) ψi (αi ) + hsi , φi (αi )i N (A.3) i=1 P P (k) (k) where si := T1 Tn=1 σi;θ̄(k) ,α(k) (Yi,n ) and where s̄(k) = N1 N i=1 si . We will respectively i refer to these quantities as the local and global sufficient statistics. The E-step reduces to the (k) computation of si for any i = 1, . . . , N , and their average s̄(k) . The maximization of (A.3) can be achieved separately with respect to (w.r.t.) θ̄, α1 , . . . , αN . Assume that the following functions are well defined for any s in a relevant domain, and that their numerical computation is inexpensive: M(s) := arg max −ψ̄(θ̄) + hs, φ̄(θ̄)i (A.4) θ̄∈Θ̄ (k) Mi (s) := arg max −ψi (α) + hsi , φi (α)i . α∈A (A.5) The standard batch EM algorithm is summarized below in Algorithm 6. A.3.2 On-line EM From now on to the end of this paper, we assume that each sensor observes the time series (Yn,i )n≥1 . We are interested in on-line algorithms i.e., algorithms which are able to update the estimate any time new samples come in. The idea beyond the on-line EM algorithm of [36] is simply to replace the batch sufficient statistics with their on-line counterparts. In such case, there is no difference between n and k index, and the E-step is computed any time a new observation comes in. Assume that each agent i has access to its time series (Yn,i )n≥1 . The algorithm proposed by [36] replaces the E-step which involves an average among the N collected samples, A.3. Centralized EM algorithms 147 Algorithm 6: Centralized batch EM algorithm (EM) Initialize: s00,i , s̄0,i for all i = 1, . . . , N . Update: At each iteration k ≥ 0 do E-step: (k) Compute si for any i, and the average s̄(k) . M-step: (k+1) (k) For all i = 1, . . . , N , set αi := Mi (si ). Set θ̄(k+1) := M(s̄(k) ). by an iterative stochastic approximation step in order to track this average value at the same time that the M-step tracks the estimated parameters. The estimate θn at time n is generated similarly to Algorithm 6 in two steps after an arbitrary initialization of values s1,0 , . . . , sN,0 . The on-line E-step is given by the following recursion: sn,i = sn−1,i + γn σi;θ̄n−1 ,αn−1,i (Yn,i ) − sn−1,i (A.6) s̄n = N 1 X sn,i , N (A.7) i=1 where γn is a positive step size/gain. We refer to si,n as a summary statistics. Next, the estimate θn is updated by the following M-step: θ̄n = M(s̄n ) and ∀i, αn,i = Mi (si,n ) . (A.8) The asymptotic analysis of the above centralized algorithm is available in [36] under the hypothesis of vanishing gains γn such: Assumption A.2. Sequence (γn )n≥0 is positive, non-increasing, and satisfies: P i) n γn = +∞, P 2 ii) n γn < ∞. Remark A.1. The convergence result given by [36] is under the assumption that the algorithm is stable: the sequence of summary statistics remains almost surely in some compact set, strictly included in the domain of definition of functions M and Mi . Verifying this assumption is not an easy task. Instead, it is of common practice in stochastic approximation to force stability by confining the updated sequence (A.6) to a given convex compact set S (see [98, pp.120] for a discussion). Here, we shall follow this approach. We denote by ΠS the Euclidean projector onto the set S. Thus, we introduce the following assumption as a consequence of the previous Remark A.1. Assumption A.3. There exists a convex open set S such that the following holds for any i = 1, . . . , N . Functions M̄ : S → Θ̄ and Mi : S → A are well defined by (A.4) and (A.5) i.e., the argument of the maximum is a singleton. 148 Appendix A. Application on distributed parameter estimation A.4 Proposed distributed on-line EM A.4.1 Algorithm We now assume that no fusion center is available: each sensor i observes (Yn,i )n≥1 but ignores the samples collected by other nodes j 6= i. Each node recursively generates a sequence of estimates (αn,i )n≥1 of its local parameters and a sequence (θ̄n,i )n≥1 of the global parameters. The estimate relies on the recursive computation of on-line summary statistics similar to (A.6) and (A.7). Of course, (A.6) and (A.7) are no longer available in a distributed setting. Thus, (A.6) must be substituted by: h i s0n,i = ΠS s0i,n−1 + γn σi;θ̄n−1,i ,αn−1,i (Yn,i ) − s0n−1,i (A.9) where we just replaced the irrelevant θ̄n in (A.6) by θ̄n,i and added projector ΠS as discussed in Remark A.1. Second, the computation of the average (A.7) is likely to be difficult. It either requires to find a Hamiltonian cycle in the graph as in [119] or to use gossip-based average consensus techniques such as [31] which requires a significant amount of communication before convergence. In practice, assuming a large number of communications during each time slot n can be highly restrictive: it may be that only a couple of nodes communicate at a given time n, or even no nodes at all. In the proposed algorithm, each node replaces the unknown average (A.7) by a r.v. s̄n,i which is updated in two steps at each time n: [Local E-step] Each node i generates a temporary update s̃i,n+1 based on its local observation: h i s̃n,i = ΠS s̄n−1,i + γn σi;θ̄n−1,i ,αn−1,i (Yn,i ) − s̄n−1,i . (A.10) [Gossip step] The final update s̄n,i of a node i is defined as a weighted sum between its s̃n,i and the temporary updates received from its neighbors at time n: s̄n,i = N X wn (i, j) s̃n,j , (A.11) j=1 P where wn (i, j) are some non-negative random weights such that j wn (i, j) = 1 for all i. Of course, wn (i, j) 6= 0 only if node j communicates with i at time n. Once the above summary statistics have been computed, the estimates θ̄i,n and αi,n are obtained by a standard M-step similar to (A.8). The proposed algorithm is summarized below. The sequence of random matrices Wn := [wn (i, j)]N i,j=1 represents the time varying communication network between the nodes (see Appendix C). Let 1 denote the N × 1 vector with all components equal to one. Let Eπ denote the expectation under Pπ . Assumption A.4. The following holds under Pπ : i) For any n, Wn is a N × N row stochastic random matrix with non-negative elements: Wn 1 = 1. A.5. Convergence w.p.1 149 Algorithm 7: On-line distributed EM algorithm (dEM) Initialize: s00,i , s̄i,0 for all i = 1, . . . , N . Update: At each time n ≥ 0 do Local E-step: For all i = 1, . . . , N , compute s0n,i by (A.9) compute s̃n,i by (A.10) Gossip step: For all i = 1, . . . , N , compute s̄n,i by (A.11) M-step: For all i = 1, . . . , N , set αn,i = Mi (s0n,i ) set θ̄n,i = M(s̄n,i ) ii) Wn is column stochastic in expectation: Eπ [Wn ]T 1 = 1. iii) (Wn , Z1,n , . . . , ZN,n )n≥1 is an i.i.d. sequence. iv) Matrix W1 is independent of (Z1,1 , . . . , ZN,1 ). v) The spectral norm ρ of matrix Eπ [W1T (IN − 11T /N )W1 ] satisfies ρ ∈ [0, 1). A.5 Convergence w.p.1 We need the following regularity conditions. Assumption A.5. i) The sets Θ̄ and A are convex open subsets of Rκ and Rι respectively, where κ and ι are integers. ii) Functions ψ̄, φ̄ (resp. ψ1 . . . ψN , φ1 . . . φN ) are continuously differentiable on Θ̄ (resp. on A). iii) S is a C 2 compact convex set. iv) Functions M, M1 , . . . , MN are well defined and continuously differentiable on the compact convex set S. v) ∀i, sup(θ̄,α)∈M(S)×Mi (S) Eπ |σi,θ̄,α (Yi )|2 < ∞. h i 1 ,...,YN ) Notation D(π|gθ ) stands for the Kullback-Leibler divergence Eπ log gπ(Y , where θ (Y1 ,...,YN ) we recall that π denotes the p.d.f. of a sample (Y1 , . . . , YN ) under Pπ whereas gθ denotes the 150 Appendix A. Application on distributed parameter estimation p.d.f. of (Y1 , . . . , YN ) induced by the model Pθ . We need more notations. We define: ! N 1 X sn := s̄i,n , s01,n , . . . , s0N,n N i=1 ! N 1 X θ n := θ̄i,n , α1,n , . . . , αN,n . N i=1 For any vector s of the form s = (s0 , . . . , sN ) ∈ SN +1 , we note M(s) := (M(s0 ), M1 (s1 ), . . . , MN (sN )). We denote by L the set of Karush-Kuhn-Tucker points associated with the problem: min D(π|gM(s) ) s∈SN +1 i.e., the set of vectors s such that −∇s D(π|gM(s) ) lies in the normal cone to SN +1 at point s [30]. We are now in position to state our main result. Recall that a.s. stands for almost surely. Theorem A.1. The following holds true under Assumptions A.1-A.5. i) The network achieves a consensus in the following sense: lim max |θ̄i,n − θ̄j,n | = 0 a.s. n→∞ i,j ii) Vectors sn and θ n converge a.s. to L and M(L) respectively. iii) Given the event that s0i,n and s̃i,n are in the interior of S for all n large enough, sequence θ n converges a.s. to the set: {θ ∈ Θ : ∇θ D(π|gθ ) = 0 } The proof extensively uses the results of [36] along with those of [20]. A.6 Numerical results A.6.1 Application to Gaussian mixtures As a leading application, we shall pay a particular attention to Gaussian mixtures. For brevity, we focus on the scalar case Y = R, but our statements can be generalized to the vector case without difficulty. Consider the parametric model: ∀i, Yi ∼ M X (m) αi N(µ(m) , ν (m) ) under Pθ , (A.12) m=1 (1) (M ) where M is an integer, the vector αi = (αi , . . . , αi ) represents a set of non-negative weights P (m) such that M = 1, N(µ(m) , v (m) ) stands for the real Gaussian distribution of dimension m=1 αi q ≥ 1 with mean µ(m) and variance v (m) . In (A.12), we set θ = (θ̄, α1 , . . . , αN ) where θ̄ = (µ(1) , . . . , µ(M ) , ν (1) , . . . , ν (M ) ) is the set of global parameters. To be more explicit for any i, (m) the latent variable Xi ∈ {1, . . . , M } satisfies Pθ (Xi = m) = αi and represents the class A.6. Numerical results 151 under which the observation Yi is drawn. The distribution of Yi given Xi is N(µ(Xi ) , ν (Xi ) ). It can be verified that this model satisfies Assumption A.1. In such case, hi (zi ) = 1, ψ̄(θ̄) and ψi (αi ) are equal to zero, φi (αi ) and φ̄(θ̄) are both 3M × 1 column vectors. Assumption A.1 is verified and each marginal p.d.f. fi,θ̄,αi (zi ) has the following closed-form: fi,θ̄,αi (zi ) = exp{hSi (zi ), φ̄(θ̄) + φi (αi )i} (A.13) where: δx(1) i Si (zi ) = δx(1) yi 2 δx(1) yi i .. , . δx(M ) i δx(M ) yi i δx(M ) yi2 i (A.14) i − 12 ln 2πν (1) − µ(1) ν (1) 1 − 2ν (1) .. φ̄(θ̄) = . − 1 ln 2πν (M ) − 2 µ(M ) (M ) µ(1) 2ν (1) µ(M ) 2ν (M ) ν and 1 − 2ν (M ) (1) (1) φi (αi ) = ln αi(M ) 0 0 ln αi 0 0 .. . (A.15) Besides, we can explicitly derive the local summary quantities expressed by the mapping function σi;θ̄,α (y) (A.2). Here, for any y ∈ Y, σi;θ̄,α (y) involved in the Local step in Algoh i rithm 8 is a 3M × 1 column vector. Upon noting that Eθ δx(m) |y = Pi;θ [xi = m|y] on i expression (A.14) and the Baye’s rule, then σi;θ̄,α (y) is given by: (1) (M ) σi;θ̄,α (y) = (σi;θ̄,α (y)T , . . . , σi;θ̄,α (y)T )T , where: (m) ωi;θ̄,α (y) (m) ωi;θ̄,α (y) (m) (m) ∀ m = 1, . . . , M : σi;θ̄,α (y) = ωi;θ̄,α (y)y (m) ωi;θ̄,α (y)y 2 (m) 1 i exp{− (m) (y−µ(m) )2 } 2ν 2πν (m) (k) PM α √ i exp{− 1(k) (y−µ(k) )2 } k=1 2ν 2πν (k) √ = Pi;θ̄,α [Xi = m|y] = α 152 Appendix A. Application on distributed parameter estimation We denote by s̃n = (s̃1,n , . . . , s̃N,n )T ∈ R3M N ×1 and s̄n = (s̄1,n , . . . , s̄N,n )T ∈ R3M N ×1 the estimates at iteration n. Then, the Gossip step in Algorithm 8 can be written under the matrix form as: s̄n = (Wn ⊗ I3M )s̃n , where I3M is the 3M × 3M identity matrix, ⊗ the Kronecker product and (Wn )n≥1 is an i.i.d sequence of non-negative matrices verifying Assumption A.4. After the Gossip step, each node i updates its parameters by the M-step in Algorithm 8. In this particular case, it is easy to prove that the solutions of both (A.4) and (A.5) are decomposed 0(m) into M maximization problems. For each class m and given the 3 × 1 column vectors si,n+1 (m) and s̄i,n+1 , the solutions for any node i are: (m) αi,n+1 = s0i,n+1 (1) (m) µi,n+1 = A.6.2 s̄i,n+1 (2) s̄i,n+1 (1) (m) νi,n+1 = s̄i,n+1 (3) s̄i,n+1 (1) (m) − (µi,n+1 )2 Simulations In order to validate our algorithm considering the conditions in Section A.2, we simulate for the well-known example in (A.12) that is, the Gaussian mixture of M = 3 classes. The WSN is represented as a random geometric graph G(E, V) with N = 10 nodes where are randomly placed according to the uniform distribution in [0, 1] × [0, 1]. We set the global parameters θ̄ as µ = (110, 80, 40) and ν = (36, 16, 36) and the local weights αi randomly chosen on (7, 25, 0.05) at each node i. Moreover, Algorithm 8 iterates until 20000 times over 30 independent realizations and the step size is chosen as 1/n0.8 . We compare both gossip strategies described in Appendix C by computing the mean deviation of consensus (MDC) normalized to the sought parameter. For the case of the first parameter µ(1) , it is defined as: v u (1) (1) N u1 X (µn,i − hµn i)2 , (A.16) MDCn = t (1) )2 N (µ i=1 P (1) (1) where: hµn i , N1 N j=1 µn,j . Figure A.1 shows the comparison between the pairwise [31] and broadcast [10] schemes (see Appendix C) over 30 independent runs of Algorithm 8. The consensus is slightly faster in the broadcast since at each round, one node is able to communicate its value through all its neighbors. Then, Figure A.2 illustrates the convergence towards the true value µ(1) = 110 as boxand-whisker plots over 30 independent runs of Algorithm 8. Figure A.2 reports the asymptotic (1) (1) behavior of the average parameter over the N nodes µn,1 , . . . , µn,N when n = 20000, i.e. (1) 1 PN i=1 µn,i . Note that the asymptotic variance in the pairwise case is almost the same as those N obtained in the centralized case. This may confirm the convergence results we derive in the first part of this thesis for consensus algorithms (see Section 2.5 in Chapter 2). A.6. Numerical results 153 Average Consensus Deviation over 20 independent runs −2 10 Pairwise Gossip Broadcast Gossip −3 10 −4 10 0 1000 2000 3000 4000 5000 iterations 6000 7000 8000 9000 10000 Figure A.1. MDC as a function of the number of iterations n for the pairwise [31] (plain line) and broadcast [10] (plain line with cross markers − + −) models when considering N = 10 nodes. Box−and−whisker plot of parameter estimation 1 over 20 independent runs 110.03 110.02 110.01 110 109.99 109.98 109.97 109.96 109.95 CENTRALIZED PAIRWISE BROADCAST (1) Figure A.2. Box-and-whisker plots of the parameter estimation µn from the 30 independent runs of the centralized EM Algorithm 6 and the distributed EM Algorithm 8. The centralized setting corresponds to the first boxplot and the distributed settings correspond o the second (pairwise [31]) and the third (broadcast [10]) boxplots respectively. Similarly, we show the results when considering a standard "signal+noise" model for the network of N = 10 sensor nodes. Thus, the observation model can be described as a mixture of two Gaussians, i.e. M = 2. We set the following values for the global parameters θ̄: the signal class with µ1 = 1 and ν1 = 1, the noise class with µ2 = 0 and ν2 = 100 and we set the mixing weights as α1 = 0.9 and α2 = 0.1 to be equal for all nodes i = 1, . . . , N . Figure A.3 illustrates the agreement on the estimated parameters, P i.e. all trajectories (θi,n )n,∀i go towards the same value hθ n i when n → ∞, where hθn i = i θ n,i . Figure A.3 shows the trajectory of the MDC over 30 independent trials as a function of the iteration time n. Note that we only include the consensus on the weight of the first class (αn,i,1 )∀i since the weight of the second class verifies 154 Appendix A. Application on distributed parameter estimation <µ > class 1 0 10 n <µn> class 2 <νn> class 1 <ν > class 2 n −1 <α > class 1 10 MSD n n −2 10 −3 10 0 1000 2000 3000 4000 n iterations 5000 6000 7000 8000 Figure A.3. Performance on the consensus of the global parameters θn : convergence of the mean consensus deviation as a function of the iteration time n for each parameter and each class, i.e. (µn,i,j , νn,i,j , αn,i,1 )∀i,j . (αn,i,2 = 1 − αn,i,1 )∀i . Once the agreement is achieved, the consensus value may convergence to the true value, i.e. the sequence (hθn i)n tends to the true parameters θ? = (1, 0, 1, 100, 0.9, 0.1). We define the ? and the variance σ 2 ? as: mean θE E ? θE = Eπ [Y ] = α1 µ1 + α2 µ2 2? σE = Eπ [(Y − E[Y ])2 ] = α1 ((µ1 − θE )2 + ν1 ) + α2 ((µ2 − θE )2 + ν2 ) . Figure A.4 illustrates the convergence towards the mean and variance of the observed signal. Indeed, Figure A.4 (a) refers to the convergence towards the mean value θE which is equal to 2 = 10.99. α1 = 0.9 while Figure A.4 (b) refers to the convergence towards the variance value σE Each line corresponds to the estimated sequences computed by each node i as: θEn,i = αn,i,1 µn,i,1 + αn,i,2 µn,i,2 2 σE = αn,i,1 ((µn,i,1 − θEn,i )2 + νn,i,1 ) + αn,i,2 ((µn,i,2 − θEn,i )2 + νn,i,2 ) n,i A.6. Numerical results 155 5 4 3 2 n,i 1 θE 0 −1 −2 −3 −4 −5 0 1000 2000 3000 4000 n iteration 5000 6000 7000 8000 (a) Trajectories of θEn,i for i = 1, . . . , N . 60 50 σ2 E n,i 40 30 20 10 0 0 1000 2000 3000 4000 5000 6000 n iteration 7000 8000 9000 10000 2 (b) Trajectories of σE for i = 1, . . . , N . n,i Figure A.4. Trajectories of each node i = 1, . . . , N of the estimated mean and variance sequences of the observed signal (colored curves) as a function of the iteration time n and the ? and σ 2 ? (black lines). optimal values θE E Appendix B Application on distributed machine learning On-Line Learning Gossip Algorithm in Multi-Agent Systems with Local Decision Rules1 This appendix is extracted from the proceedings of BIGDATA 2013 Abstract This paper is devoted to investigate binary classification in a distributed and on-line setting. In the Big Data era, datasets can be so large that it may be impossible to process them using a single processor. The framework considered accounts for situations where both the training and test phases have to be performed by taking advantage of a network architecture by the means of local computations and exchange of limited information between neighbor nodes. An online learning gossip algorithm (OLGA) is introduced, together with a variant which implements a node selection procedure. Beyond a discussion of the practical advantages of the algorithm we promote, the paper proposes an asymptotic analysis of the accuracy of the rules it produces, together with preliminary experimental results. B.1 Introduction In most analyses carried out in the field of statistical learning theory, the practical constraints related to the data acquisition/storage/access system and inherent to processing speed, memory and computing capacity are generally ignored or incorporated into the mathematical framework in a very stylized manner so far. With the advent of highly complex digital network infrastructures and the pressing necessity of sharing resources and distributing computing power [155, 18], this facet of the machine-learning environment is however becoming more and more essential from a technological perspective and is receiving now increasing attention, see [82, 92, 100, 111, 66, 106, 104, 49, 59, 58] for instance. Motivated by the recent developments in the architecture 1 Proceedings of the 2013 IEEE International Conference on Big Data 158 Appendix B. Application on distributed machine learning of data repositories and computer systems, it is the main purpose of this paper to investigate the binary classification problem, the "flagship" problem in statistical learning, in a distributed and on-line context, accounting for certain real-life situations, possibly more and more currently encountered in the near future. Throughout the article, we consider the case where the training data are not stored in some central memory but split into distinct clusters, individually processed by independent agents (e.g. processors). To process Big Data, one generally distribute data subsamples over a network of processors communicating with each other. Precisely, it is assumed that the agents can exchange a limited amount of information per unit of time only, through a communication structure modeled by a graph of which they form the nodes. Hence, due to these capacity constraints, merging all training sets at any node is unfeasible and a distributed approach, limiting the network overhead, is required. Here, by "distributed", it is meant that both the learning and prediction stages are performed by the means of local computations of the agents and sparse communications between them: each agent simultaneously processes the data set it has been assigned to and shares some information with its neighbors in order to build a local classifier. In [66, 106], a specific view to distributed learning has been developed, where the goal is to reach a consensus among local classifiers. In this setting, all agents originally dispose of the same collection of classification rules and a local gradient descent technique, jointly performed with a gossip step, is used to drive them to a consensus. At the end of the learning procedure, all agents use the consensus classifier to predict labels assigned to test data, with no need for further communications. The nature of the problem we investigate through this paper is very different, it is not of the type "distributed consensus". It should be noticed that, unlike most works on "distributed classification", agents are not assumed exchangeable in the framework we consider. First, we assume that the collection of classifiers may vary from an agent to another. This situation encompasses the case where each agent is an expert in the recognition based on a specific feature of the input observation for instance. Additionally, the issue at stake is not to seek to achieve a consensus between the agents but to learn how to aggregate efficiently the local decisions, typically through a a majority vote or a well-chosen weighted average. Hence, in the classification problem we consider, both learning and test phases require distributed computations, relying on the whole network of agents. In addition, it is expected that, unlike consensus-based approaches that drive all nodes to a common classifier, our scheme should preserve and take full advantage of the peculiar skills of the local classifiers, being therefore closer to the spirit of ensemble learning algorithms. The paper is organized as follows. Section II describes the specific framework of the learning problem considered. In section III, the principles of the algorithm promoted are described at length. The performance of the procedure proposed is analyzed in section IV, while section V focuses on a specific situation. Finally, numerical experiments are displayed in section VI, in order to provide some preliminary empirical evidence of the efficiency of the methods proposed in this paper. Section VII collects some concluding remarks and technical details are deferred to the Appendix section. B.2. Background B.2 159 Background We start off with setting out the notations and describing the key ingredients of the learning problem subsequently analyzed. Here and throughout, the indicator of any event E is denoted by I{E}. B.2.1 Objective Suppose we have a "black-box" system where Y is a binary output, taking its values in {−1, +1} say, and X is an input random vector valued in a high-dimensional space X, modeling some (hopefully) useful observation for predicting Y . Based on training data, the goal is to build a prediction rule sign(h(X)), where h : X → R is some measurable function, which minimizes the risk Rϕ (h) = E [ϕ(−Y h(X))] , where expectation is taken over the unknown distribution of the pair of r.v.’s (X, Y ) and ϕ : R → [0, +∞) denotes a cost function (i.e. a measurable function such that ϕ(u) ≥ I{u ≥ 0} for any u ∈ R). For reasons which will appear obvious in the sequel (see Remark 3), we focus on the cost function ϕ(u) = (u + 1)2 /2. Notice that, in this case, the optimal decision function is given by: ∀x ∈ X, h∗ (x) = 2P{Y = +1 | X = x} − 1. The classification rule H ∗ (x) = sign(h∗ (x)) thus coincides with the naive Bayes classifier. For this specific choice, decision function candidates h(x) will be assumed to be square integrable with respect to X’s distribution. The learning environment under study is non standard. Here we consider a model of distributed classification device composed of a set V of N ≥ 1 connected agents, which process independent databases: each agent v ∈ V disposes of a training dataset Dv = {(X1,v , Y1,v ), . . . , (Xnv ,v , Ynv ,v )} of size nv ≥ 1 and made of independent copies of the pair (X, Y ). In addition, each agent v ∈ V must select a weak classifier function among a given parametric P class possibly depending on v, namely {hv (·, θv )}θv ∈Rdv , where dv ≥ 1. We set D = v dv . For any vector θ = (θ1 , · · · , θN ) ∈ Θ = Rd1 × · · · × RdN , we define the global (soft) classifier as: X H(x, θ) = hv (x, θv ) forx ∈ X, v∈V the label related to an observation X being estimated by sign(H(X, θ)). To lighten notation, we set Rϕ (θ) = Rϕ (H(·, θ)). This paper investigates the problem of finding a "global classification rule", as defined above, with minimum risk, i.e. the optimization problem min Rϕ (θ), θ∈Θ (B.1) while fulfilling some capacity constraints, which shall be described in the next subsection. Remark B.1. (M IXTURE OF EXPERTS .) A typical example of the framework above stipulates that a fixed weak classifier hv : X → {−1, +1} is assigned to each agent v. For any (θv , x) ∈ R×X, we set hv (x, θv ) = θv hv (x) and the global classifier then reduces to a weighted sum of the local weak classifiers. In the learning phase, the issue is to determine the optimal weights using a distributed algorithm. In the test phase, it is compute a weighted sum of the local decisions, by using standard average consensus algorithms such as those studied in [31] for instance. 160 Appendix B. Application on distributed machine learning Remark B.2. (M AJORITY VOTE .) Another useful example is given by the case where each θv corresponds to some local parameter of a local classifier x 7→ hv (x, θv ) ∈ R. In this case, the global classifier output sign(H(x, θ)) can be evaluated by a simple majority vote between agents, see [16]. B.2.2 Distributed Learning In order to give an insight into the approach we propose, we consider first the ideal case where a standard gradient descent for solving (1) could be applied. One would then generate in an (t) (t) iterative manner a sequence θ(t) = (θ1 , · · · , θN ), t ≥ 1, satisfying the following update equation for each v ∈ V: h i θv(t+1) = θv(t) + γt E Y ∇v hv (X, θv(t) )ϕ0 (−Y H(X, θ(t) )) , (B.2) where γt > 0 is a step size and ∇v represents the gradient operator w.r.t. the argument θv . Naturally, as (X, Y )’s distribution is unknown, the expectation involved in (2) cannot be computed and must be replaced by a statistical version, in accordance with the Empirical Risk Minimization paradigm. It is assumed that each agent v ∈ V must rely on the local dataset Dv only to update its estimate, in a one-pass fashion: each observation (Xk,v , Yk,v ) must be used only once by agent v and is not stored into the agent’s memory. This "on-line" framework is especially relevant in the context of large data sets, where it is generally hopeless to process the whole training sequence as a block. It shall also be revealed useful in a distributed optimization setting, as will be discussed later on. The expectation in (2) can be then replaced by the following unbiased estimate Yt+1,v ∇v hv (Xt+1,v , θv(t) )ϕ0 (−Yt+1,v H(Xt+1,v , θ(t) )). The second issue is related to the distributed setting. In the estimate of the gradient above, the evaluation of the quantity H(Xt+1,v , θ(t) ) requires that: i) agent v sends the input Xt+1,v to all (t) the other nodes w 6= v, ii) each node w computes its local decision hw (Xt+1,v , θw ) and returns the result to node v. Needless to say, such a procedure can be revealed overwhelmingly complex when the number of nodes is significant and/or when the dimensionality of the input X is large. It is therefore crucial to reduce the amount of information exchanged in the network. To formalize this constraint, we define the network throughput τ as the average number of information bits successfully carried by the network during each unit of time. Formally, we require that the sum over all pairs of agents (v, w) of the number bits send by v to w does not exceed τ in expectation. B.3 The Online Learning Gossip Algorithm (OLGA) We now describe at length the general algorithm we propose, in order to solve the constrained optimization problem (1) in the general on-line and distributed framework described in section II. Suppose that, at step t ≥ 1, for each agent v ∈ V, the current parameter value is θt,v . Set θt = (θt,1 , · · · , θt,N ). The update is performed as follows. Agent v observes the pair (Xt+1,v , Yt+1,v ) and evaluates its local decision hv (Xt+1,v , θt,v ) using the former value of the parameter θt,v . B.3. The Online Learning Gossip Algorithm (OLGA) 161 Next, it searches for an estimate of the global decision H(Xt+1,v , θt,v ) as follows, by selecting some neighbors at random and sending its training input Xt+1,v to the selected nodes. Let w δ t+1,v = {δt+1,v }w∈V,w6=v be a collection of N − 1 independent Bernoulli r.v.’s B(p), with parameter p ∈ (0, 1], independent from Xt+1,v given θt . Agent v sends the input Xt+1,v to node w w if and only if δt+1,v = 1. An estimate of the global decision is then given by: (V) Ŷt+1,v = hv (Xt+1,v , θt,v ) + p−1 X w δt+1,v hw (Xt+1,v , θt,w ) , (B.3) w∈V\{v} where the superscript emphasizes the fact that the estimate is computed by means of communications in the whole networkhV. It is worth noticing i that (3) is an unbiased estimate of the global (V) decision in the sense that E Ŷt+1,v |Xt+1,v , θt = H(Xt+1,v , θt,v ). If B represent the number of bits required to represent an arbitrary input x ∈ X, then each link v → w carries in average pB bits per unit of time. In order not to exceed the network throughput τ , one must pick the sampling parameter p so that p ≤ τ /BN (N − 1). Finally, agent v performs a local gradient descent as follows: (V) θt+1,v = θt,v + γt Yt+1,v ∇v hv (Xt+1,v , θt,v )ϕ0 (−Yt+1,v Ŷt+1,v ) . (B.4) As mentioned above, we shall pay a particular attention to the case ϕ(x) = 21 (x + 1)2 . In that case, the update equation (4) boils down to: (V) θt+1,v = θt,v + γt ∇v hv (Xt+1,v , θt,v )(Yt+1,v − Ŷt+1,v ) . (B.5) Remark B.3. (O N THE COST FUNCTION .) The quadratic nature of the cost functional is essential in the subsequent analysis. It guarantees that the OLGA output remains unbiased at each iteration, in spite of its on-line nature and the randomness incurred by the gossip phase. The algorithm is summarized in Table 1 below. Algorithm 8: OLGA Initialize: Set arbitrary initial values θ0,v for each node v ∈ V. Update: At each time t = 1, 2, · · · do For each v ∈ V do w Neighbors selection: Draw independent Bernoulli r.v.’s δt+1,v ∼ B(p) for any w 6= v Gossip step: w Transmit Xt+1,v to all w such that δt+1,v = 1 and obtain hw (Xt+1,v , θt,w ) in return Local descent: Update the parameter value θt+1,v using (4) As the algorithm is single-pass, P the number of iterations is necessary smaller than the size of the full data sample, n = v∈V nv . Hence, in the asymptotic framework that stipulates t → +∞, it is implicitly assumed that n → +∞. 162 B.4 Appendix B. Application on distributed machine learning Performance Analysis In this section, we investigate the asymptotic behavior of the predictor output by OLGA as t → ∞. First, we establish its almost-sure convergence to the set of minimizers of Rϕ (θ). Next, we provide a Central Limit Theorem which characterizes the fluctuations of the excess of risk as t → ∞. This result determines the convergence rate of the algorithm and explicitly characterizes the impact of the "sparsifying" parameter p on the performance of the algorithm. Finally, using the results of [12], we provide a uniform bound on the error probability of the proposed classifier. The following assumption is rather standard in stochastic approximation. Assumption B.1. The step size γt decays to 0 as t → ∞, so that: ∞. P t≥1 γt = ∞ and 2 t≥1 γt P < Additionally, some classical regularity conditions on the weak classifier functions hv are required. Assumption B.2. The conditions below hold true for any v ∈ {1, · · · , N } and any compact set K ⊂ Rdv . (a) For any x ∈ X, the function θv 7→ hv (x, θv ) is continuously differentiable. (b) For any θv ∈ Rdv , E hv (X, θv )2 < ∞. (c) We have: 2 E sup k∇v hv (X, θv )k < ∞, θv ∈K 2 E sup ∇v hv (X, θv ) < ∞. θv ∈K (d) The mappings θv 7→ hv (X, θv ) and θv 7→ ∇v hv (X, θv ) on L2 (P) are both continuous. (e) We have: h i sup E k∇v hv (X, θv )k4 < ∞, θv ∈K sup E h4v (X, θv ) < ∞. θv ∈K (f) The set of stationary points L = {θ : ∇Rϕ (θ) = 0} is finite. B.4. Performance Analysis 163 Assumption 2 is clearly satisfied in the example described in Remark 1, i.e. when hv (x, θv ) = θv hv (x) for some fixed local weak classifier hv such that the fourth moment of hv (X) is finite. Recall that the algorithm is said to be stable if there exists a compact set K ⊂ Rdv such that the sequence (θt,v )t≥1 remains in K for any v ∈ V, with probability one: P{∃K > 0, supt≥1 kθt k < K} = 1. The next result reveals that, provided that it is stable, the algorithm produces a consistent decision rule as the number of iterations grows to infinity. Theorem B.1. (C ONSISTENCY ) Assume that the algorithm is stable. Under Assumptions B.1 and B.2, the sequence (θt )t≥1 almost-surely converges to the set of stationary points L of Rϕ . The stability condition may not be easy to check in practice. There are several ways to guarantee stability. A possible approach is to confine the sequence to a predetermined bounded set. This can be achieved by introducing a projection step at each iteration of the stochastic gradient algorithm. Each time an estimate θt,v falls outside some convex compact set Kv , agent v brings the estimate back into Kv by replacing θt,v with the nearest point in Kv . In that case, differential inclusion arguments may show that the conclusions of Theorem B.1 remain true: Q θt converges to the set of Karush-Kuhn Tucker points of the functional Rϕ (θ) over the set v Kv . Refer to [15] or [96] for further details on projected stochastic approximation algorithms. Alternatively, one can stipulate additional assumptions for the weak classifier functions, see for instance [51]. The following result focuses on the situation described in Remark B.1. Theorem B.2. (C ONSISTENCY ( BIS )) Suppose that, for all v ∈ V, hv (x, θv ) = θv hv (x) for some given function hv such that E[(hv (X))4 ] < ∞. Then, OLGA is stable and the sequence (θt )t≥1 almost-surely converges to the set of minimizers of Rϕ as t → +∞. In the sequel, notation ∇2 (resp. ∇2v ) denotes the Hessian operator w.r.t. θ (resp. θv ). We also use notation ∇1v for ∇v , and ∇0v stands for the identity i.e., ∇0v f (θv ) = f (θv ). Superscript T represents transposition. Let θ ? = (θ ? , · · · , θ ? ) be an arbitrary point. We make the following 1 N assumption. Assumption B.3. Suppose that θ? ∈ L and that the following conditions hold true for any v ∈ V. (a) There exists a neighborhood Nv of θv? such that for any x ∈ X, function θv 7→ hv (x, θv ) is twice continuously differentiable on Nv . h 2 i (b) We have: E ∇2v hv (X, θv? ) < ∞ where k . k represents any matrix norm. (c) There exists a square-integrable random variable C(X) s.t. for any i ∈ {0, 1, 2} and θv ∈ N v , i ∇v hv (X, θv ) − ∇iv hv (X, θv? ) ≤ C(X) kθv − θv? k . (d) The matrix Q? = E ∇H(X, θ? )∇T H(X, θ? ) + E (H(X, θ? ) − Y )∇2 H(X, θ? ) is a Hurwitz matrix, i.e. the largest real part of its eigenvalues is −L with L > 0. 164 Appendix B. Application on distributed machine learning (e) There exists b > 4 such that for any i ∈ {0, 1}, h i sup E k∇iv hv (X, θv )kb < ∞. θv ∈Nv (f) The mapping θ 7→ Γv (θ) is continuous at point θ? , where Γv (θ) is defined by: E (H(X, θ) − Y )2 ∇v hv (X, θv )∇Tv hv (X, θv ) 1−p X + E hw (X, θw )2 ∇v hv (X, θv )∇Tv hv (X, θv ) . (B.6) p w∈V\{v} (g) The block-diagonal matrix Γ? = diag(Γv (θ? ))v∈V is positive definite. Observe that the mapping (B.6) is well-defined in a neighborhood of θ? , by virtue of Assumption B.3(e). Theorem B.3. (A CONDITIONAL CLT) Suppose that Assumptions B.2 and B.3 hold true and that γt = γ0 t−α for some constants γ0 > 0 and α ∈ (1/2, 1]. When α = 1, take γ0 > (2L)−1 and η = 1/(2γ0 ). Otherwise, set η = 0. Conditioned upon the event {limt→∞ θt = θ? }, the √ sequence γt (θt − θ? ) converges in distribution to a centered Gaussian distribution N(0, Σ) whose covariance matrix Σ is the unique solution to the Lyapunov equation: (Q? − ηI)Σ + Σ(Q? − ηI)T = Γ? . Theorem B.3 still holds true under milder assumptions on the step size, see [129] for more general conditions. The effect of the Bernoulli sampling parameter p on the asymptotic behavior of the estimation error deserves some attention. The case p = 1 corresponds to a centralized setting where all nodes communicate without restriction at any time. The matrix Γv (θ? ) then boils down to the first term in (B.6) solely. This gives the insight that the second term of (B.6) corresponds to the additional noise covariance induced by the distributed setting, as opposed to a centralized situation. In effect, when p becomes close to zero i.e. when communications become rare, the second term of (B.6) becomes significant and produces a dramatic increase of the asymptotic covariance matrix Σ. In that sense, Theorem B.3 quantifies the unavoidable tradeoff between throughput and accuracy. Corollary B.1. (E RROR RATE ) Let U be a D × 1 vector of independent centered Gaussian r.v.’s with unit variance. Under Theorem B.3’s assumptions and conditioned upon the event {limt→∞ θt = θ? }, γt (Rϕ (θt ) − Rϕ (θ? )) converges in distribution to the noncentral χ2 r.v. 1 T 1/2 ? 1/2 Q Σ U. 2U Σ Remark B.4. (O N THE COST FUNCTION ( BIS ).) We recall that the excess of probability of error of a classifier sign(H(x)) is bounded by (Rϕ (H) − Rϕ∗ )1/2 , see [12]. However, the damage to the rate of the excess of risk caused by the use of a quadratic convex surrogate for the cost is somehow compensated by the (possibly parametric) rate stated in Corollary B.1. B.5. Distributed Selection B.5 165 Distributed Selection This section investigates more specifically the situation where for any v ∈ V, hv (x, θv ) = αv hv (x, βv ), the local parameter θv being of the form θv = (αv , βv ) with αv ∈ R, βv ∈ Rdv −1 and hv : X × Rdv −1 → R being a local weak classifier function. For any agent v, the aim is to jointly determine the value of βv parametrizing the local classifier and the weight αv of agent v in the sum: X H(x, θ) = αv hv (x, βv ) , (B.7) v∈V for θ = (θ1 , · · · , θN ). In this scenario, it is natural to include a non-negativity constraint on the weights: αv ≥ 0 for any v ∈ V. Clearly, the vector θ can be achieved by using a distributed algorithm as proposed in Section B.3. However, when the number N of nodes is very large, the implementation of OLGA can be difficult or even unfeasible. Indeed, in the learning phase, a significant amount of information should be exchanged by all nodes and, in the test phase, all N nodes are involved in the decision process. It is therefore desirable to restrict the number of nodes in order to simplify the optimization stage and the prediction process both at the same time. This remark is also motivated by the fact that in practice, different nodes might generate quite similar outputs. For such nodes, it is useless to duplicate the information in the sum (B.7). In this section, we propose an online method to jointly i) learn the parameters θ and ii) withdraw the nodes which are not essential for classification. Note that the withdrawal of a node v can be achieved by setting αv to zero in (B.7). Based on this remark, we propose to include a `1 penalization term to the initial cost function. For some fixed constant λ > 0, this yields the following optimization problem: X min Rϕ (θ) + λ |αv | , (B.8) θ∈Θ v∈V P The "lasso" penalization term v |αv | is introduced in order that the minimizers exhibit a certain level of sparsity, i.e. are such that a significant number of coefficients αv are exactly equal to zero. Here, λ is a tuning parameter which allows to set the trade off between the minimization of the cost and the sparsity of the minimizers. The following modifications should be brought to OLGA in order to produce an efficient distributed on-line algorithm for selecting the relevant experts. At each iteration t, we assume that certain nodes have been definitively declared as idle, and we denote by St ⊂ V the remaining subset of active nodes. Following in the footsteps of the approach described in section B.3, a given active node v ∈ St observing a pair of the training sample (Xt+1,v , Yt+1,v ) can obtain a noisy estimation of the output classifier by i) drawing in(w) dependent Bernoulli distributed r.v.’s {δt+1,v }w∈St \{v} with parameter pt ∈ (0, 1], ii) computing the sum: X (St ) w Ŷt+1,v = hv (Xt+1,v , θt,v ) + p−1 δt+1,v hw (Xt+1,v , θt,w ) . t w∈St \{v} The Bernoulli parameter pt should be chosen in a way that the network throughput does not exceed τ . Thus, if |St | denotes the cardinal of the set St and B the number of bits required to 166 Appendix B. Application on distributed machine learning represent an arbitrary input x ∈ X, the Bernoulli parameter should be such that: pt ≤ τ . B |St |(|St | − 1) (B.9) Next, agent v updates its local parameters θt,v = (αt,v , βt,v ) using a stochastic gradient descent. Unlike the algorithm of section B.3, the update should include the `1 -penalization term in (B.8) and keep the nonnegativity constraints satisfied. We thus propose the following update equations: (S ) t αt+1,v = [αt,v + γt hv (Xt+1,v , βt,v )(Yt+1,v − Ŷt+1,v ) − λsign(αt,v )γt ]+ , (S ) t βt+1,v = βt,v + γt αt,v ∇βv hv (Xt+1,v , βt,v )(Yt+1,v − Ŷt+1,v ), (B.10) (B.11) with the notation [u]+ = max(u, 0) and where ∇βv represents the gradient w.r.t. βv . Finally, we need a criterion to decide whether agent v should declare itself as idle at step t + 1 or should be kept active. Here, we propose to declare a node as idle at iteration t+1 if the current value αt+1,v of the parameter αv is zero for the M -th time, where M is an integer fixed in advance. Formally, a node v declares itself as idle if the sequence (αk,v )1≤k≤t+1 contains at least M zeros. B.6 Numerical Results The algorithms proposed have been tested on toy examples based on simulated data and on public datasets. Due to space limitations, only a few experiments are reported below: OLGA with experts selection is evaluated on a toy example since its usefulness can be simply illustrated and tested OLGA on real data. B.6.1 Simulation data We placed ourselves in the mixture of experts case, using randomly placed affinity experts as weak classifiers namely hv (x1 , x2 ; θv ) = θv sign(cos av x1 + sin av x2 − ρv ), where av and ρv are considered fixed for each agent. We then ran OLGA with experts selection and kept v for which θv 6= 0. Notice below how the algorithm selects mostly affinity separators relevant to the distribution (X, Y ). B.6.2 Real data In this section, we compare the performances of GentleBoost [70] and OLGA on some benchmark datasets for binary classification: banana and twonorm. Detailed information about these datasets can be found in [138], see also [161] for a distributed boosting. We split each data sample into a training set and a test set using aP80% − 20% rule. For both GentleBoost and OLGA, the classifier is of the form H(x) = 1≤m≤M αm hm (x, βm ), based on linear combinations of weak classifiers h(x, β), where β are the target parameters for the algorithm. For GentleBoost, the (αm , βm )’s are estimated using a stagewise block procedure. This means B.7. Conclusion 167 6 6 4 4 2 2 0 0 −2 −2 −4 −4 −6 −6 −4 −2 0 2 4 6 −6 −6 −4 −2 0 2 4 6 Figure B.1. Left: Data av , ρv are represented by lines in red and sampling points (X, Y ) by "+" in blue when Y = −1 and by "o" in green when Y = 1. Right: Only v having a non-zero weight θv 6= 0 at the end of the iterations are represented. that α1 h1 (β1 , ·) is added first, then α2 h(β2 , ·), etc. and, for each αm hm (βm , ·) to be added, a pass over the whole block of training data is required. For OLGA, the algorithm is online and distributed; meaning that each data is processed only once and then forgotten. In addition, each αm h(βm , ·) is computed simultaneously by separate agents forming a network. For GentleBoost, the form of the weak classifier is arbitrary, but a widespread choice is to use stumps, i.e. rules of the form I{x(j) ≥ β}, where x(j) denote the x’s j-th component. For OLGA we (j) used "smooth stumps", of the form F σ(x − β) where F (x) = 1 − 2/(1 + exp(−x)). The smoothness is required by the gradient algorithm approach. In the case of OLGA, weak classifiers are in one-to-one correspondence with the agents: V = {1, . . . , M }. Each agent v starts by uniformly randomly selecting an axis j(v) ∈ {1, . . . , d}, independently from all other agents, and applies next the algorithm described in section B.5, using θt,w = (αt,w , βt,w , σt,w ) and hw (x, θt,w ) = F σt,w (xj(v) − βt,w ) . On both examples, one can see that OLGA does not outperform GentleBoost and has a more erratic error curve, due do its stochastic nature. However, it should be emphasized that: 1) both algorithms lead to comparable results and 2) OLGA is online and distributed, thus far less demanding in storage and power capability, which are crucial properties in a wide variety of applications. B.7 Conclusion In this paper, two variants of an online learning gossip algorithm (OLGA) for binary classification, very different in nature from "distributed consensus" approaches, are proposed and investigated. The main strength of OLGA lies in its ability to process data "on the fly” and then forget about it forever. Being distributed, datasets can be split and partially processed by several agents, while the network is able to benefit from the whole dataset. On real datasets tested in this paper, OLGA performs nearly as well as GentleBoost [70], a block centralized robust version of boosting that needs to store the whole dataset in order to process it. OLGA is well suited to underlying complete graphs, while sub-sampling edges to perform sparsification [4]. Even 168 Appendix B. Application on distributed machine learning 0.5 Error rate 0.4 0.3 0.2 0.1 0 5 10 15 20 25 30 35 40 45 50 Number of weak−learners OLGA GentleBoost 0.4 0.35 Error rate 0.3 0.25 0.2 0.15 0.1 0.05 0 5 10 15 20 25 30 Number of weak−learners 35 40 45 50 Figure B.2. Comparison between GentleBoost and OLGA on datasets banana (up) and twonorm (down). this assumption seems realistic for IP networks, one might want to alleviate it and introduce hierarchies. Sophisticated versions of OLGA should be thus designed and analyzed in the near future. Appendix - Technical details Proof of Theorem B.1 (sketch of) By Assumption B.2(a), Rϕ (θ) is finite for any θ. Let us write its derivative. For any θ, any v ∈ V and any (x, y) ∈ X × {±1}, ∇v ϕ(−yH(x, θ)) = (H(x, θ) − y) ∇v hv (x, θv ) X 1 = hw (x, θw ) − y ∇v hv (x, θv ) + ∇v h2v (x, θv ) . 2 w6=v B.7. Conclusion 169 In particular, for any fixed value of θ and any θv0 ∈ B(θv , 1) := {θ̃ : kθ̃ − θv k ≤ 1}, we obtain that ! X |hw (x, θw ))|) k∇v ϕ(−yH(x, θ0 ))k ≤ sup k∇v hv (x, θ̃)k (1 + θ̃∈B(θv ,1) + sup w6=v k∇v h2v (x, θ̃)k, θ̃∈B(θv ,1) where we set θ0 = (θ1 · · · θv−1 , θv0 , θv+1 · · · θN ). Thus, ∇v ϕ(−yH(x, θ0 )) is bounded with a r.v. which does not depend on θv0 and which can be proved to be integrable by straightforward application of Cauchy-Schwartz’ inequality along with Assumptions B.2(b,c). Using Lebesgue’s dominated convergence theorem, Rϕ is differentiable w.r.t. θv and its gradient coincides with ∇v Rϕ (θ) = E (H(X, θ) − Y ) ∇v hv (X, θv ) The next step is to prove that ∇v Rϕ is continuous, and thus that Rϕ (θ) is continuously differentiable w.r.t. θ. This is a direct consequence of Assumption B.2(d): the proof is left to the reader. We are now in position to prove Theorem B.1. Let us write our algorithm under the form θt+1,v = θt,v + γt Zt+1,v where we set: (V) Zt+1,v = ∇v hv (Xt+1,v , θt,v )(Yt+1,v − Ŷt+1,v ) . (B.12) Let (Ft : t ≥ 0) represent the natural filtration Ft = σ(Ft−1 , Xt,1 , · · · , Xt,N , Yt,1 , · · · , Yt,N ). From the previous statement, it follows that E(Zt+1,v |Ft ) = ∇v Rϕ (θt ). Using Minkowski’s inequality followed by Cauchy-Schwartz’ inequality, we obtain: 1 1 X 1 E kZt+1,v k2 |Ft 2 ≤ E k∇v hv (X, θv k2 2 + E khw (X, θw ) ∇v hv (X, θv )k2 2 w 1 ≤ E k∇v hv (X, θv k4 4 1+ X 4 E hw (X, θw ) 1 4 ! . w Therefore, Assumption B.2(d) implies that for any compact set K ⊂ Θ, sup E kZt+1,v k2 |Ft < ∞ . θ∈K The proof is concluded by direct application of [51]. Proof of Theorem B.2 (sketch of) The proof relies on the fact Rϕ is a Lyapunov function Rϕ for the mean field of our algorithm, and that it is well-behaved as kθk → ∞. More precisely, we prove that ∇Rϕ is Lipschitzcontinuous and satisfies k∇Rϕ k2 ≤ C (1 + Rϕ ) for some constant C > 0. Using these conditions along with adequate estimates of the conditional moments of the noise sequence ξt , standard stochastic approximation results imply that sequence θt remains in a compact set (see for instance [51]). Moreover, Rϕ is convex under the assumptions of Theorem B.2. Thus the stationary points coincide with the global minimizers. 170 Appendix B. Application on distributed machine learning Proof of Theorem B.3 (sketch of) Define ξt+1 = Zt+1 + ∇Rϕ (θt ) where Zt+1 is the vector whose vth block-component coincides with Zt+1,v defined by (B.12). As already seen in the proof of Theorem B.1, the sequence ξt is a martingale increment sequence adapted to the natural filtration, meaning that E[ξt+1 |Ft ] = 0 for any t. Algorithm 1 writes θt+1 = θt − γt ∇Rϕ (θt ) + γt ξt+1 . Function −∇Rϕ is the so-called mean field of the algorithm, whereas ξt plays the role of a noise sequence. Theorem B.3 is a consequence of [129, Theorem 1]. We only need to show that the assumptions of [129] are satisfied. To that end, we prove the following two technical lemmas. The first Lemma provides some conditions on the mean field of the algorithm. Due to space limitations, its proof is omitted. Lemma B.1. Set θ? ∈ L. Under Assumptions B.2b − c and B.3a − c, function Rϕ is twice Q continuously differentiable on N := v Nv and satisfies: ∇2 Rϕ (θ) E ∇H(X, θ)∇T H(X, θ) + E (H(X, θ) − Y )∇2 H(X, θ) = . Moreover, ∇Rϕ (θ) = Q? (θ − θ? ) + O(kθ − θ? k2 ). The second Lemma yields the required conditions on the probabilistic behavior of noise sequence. Lemma B.2. Set θ? ∈ L. Let Assumptions B.2(a-d) and B.3(e-f) hold true. Then, we have: sup E(kξt+1 kb/2 |Ft )Iθt ∈N < ∞. t≥0 T |F ) → Γ? as t → ∞. Moreover, almost surely on the event {θt → θ? }, E(ξt+1 ξt+1 t Theorem B.3 directly follows from Lemmas B.1 and B.2 by straightforward application of [129]. Proof of Lemma B.2. Since ∇Rϕ is continuous, it is bounded in a neighborhood of θ? . Therefore, it is quite immediate to see that the statement supt≥0 E[kξt+1 kb/2 |Ft ]I{θt ∈ N} < ∞ is in fact equivalent to supt≥0 E[kZt+1,v kb/2 |Ft ]I{θt ∈ N} < ∞ for any v ∈ V. Recalling the definition (B.12) of Zt+1,v , we obtain using Cauchy-Schwartz’ inequality: b/2 E[kZt+1,v k h i1 1 2 (V) b |Ft ] ≤ E k∇v hv (Xt+1,v , θt,v )k |Ft E[|Ŷt+1,v − Yt+1,v |b |Ft ] 2 h i1 1 2 (V) ≤ C E k∇v hv (X, θt,v )kb (1 + E[|Ŷt+1,v |b |Ft ]) 2 for some constant C > 0 which depends on b. Assumption B.3(e) ensures that the first factor in the right hand side of the above inequality is bounded uniformly in t when multiplied by the indi(V) cator of event {θt ∈ N}. The remaining task is thus to estimate E[|Ŷt+1,v |b |Ft ]. Recalling (B.3), B.7. Conclusion 171 one can prove after some algebra that: (V) E[|Ŷt+1,v |b |Ft ]1/b ≤C 0 X w∈V h sup E khw (X, θw )k b i1/b . θw ∈Nw for some constant C0 > 0. Using again Assumption B.3(e), we obtain that the right hand side is bounded. Putting all pieces together, this proves supt≥0 E[kξt+1 kb/2 |Ft ]I{θt ∈ N} < ∞. Consider the second statement of Lemma B.2. For any v 6= w, Zt+1,v and Zt+1,w are independent conditionally to Ft . Thus, it is sufficient to study the conditional covariance of ξt+1,v for a given v. The latter covariance matrix coincides with Uv (θt ) − ∇v Rϕ (θt )∇v Rϕ (θt )T where T Uv (θt ) = E[Zt+1,v Zt+1,v |Ft ]. Upon noting that ∇v Rϕ is continuous and ∇v Rϕ (θ? ) = 0, it is therefore sufficient to show that θ 7→ Uv (θ) is continuous and that Uv (θ? ) = Γv (θ? ), in order to complete the proof of Lemma B.2. Note that Uv (θt ) coincides with the conditional expectation (V) of (Ŷt+1,v − Yt+1,v )2 ∇v hv (Xt+1,v , θt,v )∇v hv (Xt+1,v , θt,v )T , given Ft . After some tedious but straightforward derivations, one is able to show that Uv (θ) = Γv (θ). By Assumption B.3(f), the proof of Lemma B.2 is complete. Proof of Corollary B.1 We use a second-order Taylor-Lagrange expansion of Rϕ at θ? . As ∇v Rϕ (θ? ) = 0, Rϕ (θt ) − Rϕ (θ? ) is equal to 21 (θt − θ? )T ∇2 Rϕ (θ? )(θt − θ? ) up to a negligible term. Upon noting that ∇2 Rϕ (θ? ) = Q? , the result follows from Theorem B.3. Pseudo-code - Penalized OLGA The method proposed in Section B.5 is summarized by Algorithm 9 below. Algorithm 9: OLGA Initialize: Set S = V. For each node v ∈ V, set initial values β0,v and α0,v = 1. Set counterv = 0 . Update: At each time t = 1, 2, · · · do For each v ∈ S do Neighbors selection: Set pt = τ /B|S|(|S| − 1) Draw independent Bernoulli r.v.’s w δt+1,v ∼ B(pt ) for any w ∈ S, w 6= v w Gossip step: Transmit Xt+1,v to all nodes w such that δt+1,v = 1, and obtain hw (Xt+1,v , θt,w ) in return Local descent: Update parameters αt+1,v , βt+1,v using (B.10)-(B.11) if αt+1,v = 0 then counterv ← counterv + 1 if counterv = M then S ← S \ {v} Appendix C Examples of gossip models for consensus algorithms In this appendix we provide some useful computations of the consensus algorithm involved in the gossip step of Algorithm (2.1)-(2.2) considered in Chapter 2. Notation: The network of agents is represented as an undirected graph G = (E, V ) where E is the set of edges and V is the set of N vertices, i.e. V = {1, . . . , N } and E = {∀i, j ∈ V | i ∼ j}. |E| denotes the total number of edges. We denote by A the adjacency matrix of G which has non-zero entries A(i, j) = 1 whenever i ∼ j ⊂ E and 0 otherwise. The diagonal matrix D stands by the corresponding degree matrix of G which is D = diag(A1) and di denotes the degree of any node i. The Laplacian matrix is denoted by L and verifies L = D − A. Set pij the 1 probability of any edge i ∼ j. Upon noting symmetry on the edges, pij = pji = 2N ( d1i + d1j ) - i is uniformly chosen and contacts an uniformly selected neighbor j (probability N1 d1i ) or contrarily j selects i (probability N1 d1j ). We denote by Aw the weighted version of A by the probabilities pij , i.e. Aw (i, j) = pij . The weighted Laplacian Lw is equal to Dw − Aw where the diagonal matrix is Dw = diag(Aw 1). Set I the N × N identity matrix and ei denotes the i-th vector of the canonical basis in RN , i.e. all components of ei are equal to zero except the i-th component which is equal to 1. We define the N × N matrix Wn containing the nonnegative weights [wn (i, j)]∀i,j=1,...,N at time n involved in the communication (gossip) step. We denote by C the covariance matrix of vector W1T 1 that plays an important role in the asymptotic variance of the sequence generated by Algorithm (2.1)-(2.2), i.e. C = E[W1T 11T W1 ] − 11T . We define the real vector of the temporary estimates θ̃n = (θ̃n,1 , . . . , θ̃n,N )T and the updated estimates θn = (θn,1 , . . . , θn,N )T of the seek parameter involved in the following algorithms. C.1 Standard gossip averaging We recall the algorithm introduce in Section 1.3.1. The aim is to obtain a weighted average at any instant n: θn = Wn θ̃n 174 Appendix C. Examples of gossip models for consensus algorithms where (Wn )n is an i.i.d. matrix sequence. We now establish some useful metrics of Wn when considering two standard gossip schemes: the pairwise model of [31] and the broadcast model of [10]. We introduce first some notations. C.1.1 Communication model description Pairwise gossip [31] At time n, two connected nodes – say i and j – wake up randomly with probability pij associated to the edge i ∼ j becoming active, independently from the past. Nodes i and j compute the weighted average θi,n = θj,n = 0.5θ̃i,n + 0.5θ̃j,n ; and for k∈ / {i, j}, the nodes do not gossip: θk,n = θ̃k,n . In this example, given the edge {i ∼ j} wakes up, Wn is given by: Wn (k, `) = 1 2 1 0 if {k, `} ∈ {i, j}, if {k, `} ∈ / {i, j} and k = ` , otherwise. (C.1) the above definition can be written under a matrix form. Wn is equal to Wij if the edge {i ∼ j} is active with probability pij 1 Wij = I − (ei − ej )(ei − ej )T 2 (C.2) Upon noting that E[ei eTi ] == E[ej eTj ] = Dw and E[ei eTj ] = E[ej eTi ] = Aw , the expectation matrix of W1 is: W = E[Wij ] = E[Wij ] = I − Lw . Matrices (Wn )n≥0 are doubly stochastic and satisfy: W1 1 = 1 and W1T 1 = 1. Then, C = 0. Note that W1 is symmetric W1T = W1 and idempotent W12 = W1 . Thus, the spectral radius of E[W1T J ⊥ W1 ] is: ρ = r(E[W1T J ⊥ W1 ]) = r(W p − J) = r(J⊥ − Lw ) = 1 − λ2 (Lw ) < 1 . (C.3) The eigenvalues of J⊥ − Lw are 1 > 1 − λ2 (Lw ) > · · · > 1 − λN (Lw ), which the above condition is true as the graph is assumed to be connected, i.e. the second smallest eigenvalue of Lw is non-zero. When G is a complete graph, then Lw = N 1−1 J⊥ , W = −2 I − N 1−1 J⊥ and ρ = N N −1 . In that case, more the network size N increases and more ρ goes closer to 1 meaning that the graph connectivity λ2 (Lw ) decreases. Indeed, this seems logical as the averaged gossip matrix W tends to I, i.e. no communication is performed between the agents. Broadcast gossip [10] At time n, a node i wakes up at random and broadcasts its temporary update θ̃i,n to all its neighbors Ni . Any neighbor j computes the weighted average θj,n = β θ̃i,n + (1 − β)θ̃j,n where β ∈ (0, 1). On the other hand, the nodes k which do not belong to the neighborhood of i (including i itself) set θk,n = θ̃k,n . Note that, as opposed to the pairwise scheme, the transmitter node i does not expect any feedback from its neighbors. C.1. Standard gossip averaging 175 Then, given that i wakes up, the (k, `)th component of Wn is given by: if k ∈ / Ni and k = ` , 1 β if k ∈ Ni and ` = i , Wn (k, `) = 1 − β if k ∈ Ni and k = ` , 0 otherwise. (C.4) the above definition can be written under a matrix form. We denote Ei = ei eTi the N × N matrix that the only non-zero entry is in (i, i). Then, Wn is equal to Wi if node i is active with probability N1 Wi = I − βdiag(AEi 1) + βAEi Upon noting that E[Ei ] = 1 N I, (C.5) the expectation matrix is: W = E[Wi ] = I − β L. N Matrices (Wn )n≥0 are row stochastic and column stochastic in average satisfying: W1 1 = 1 and E[W1T ]1 = 1. We also refer to Wn as doubly stochastic matrix in average. The spectral radius of E[W1T J ⊥ W1 ], denoted by r, is now: 2β β2 (1 − β)L − 2 L2 ) N N 2β β2 =1− (1 − β)λ2 (L) − 2 λ22 (L) < 1 N N ρ = r(E[W1T J ⊥ W1 ]) = ρ(J ⊥ − (C.6) In that case, using (C.5), we compute C as follows: C = β11T (E[AEi ] − E[diag(AEi 1)]) + β(E[Ei A] − E[diag(AEi 1)])11T + β 2 (E[diag(AEi 1)11T diag(AEi 1)] + E[AEi 11T Ei A]) − β 2 (E[AEi 11T diag(AEi 1)] + E[diag(AEi 1)11T Ei A]) = = β11T L L β2 2 + β 11T + β 2 (D2 + A2 ) − β 2 (AD + DA) = L . N N N (C.7) When G is a complete graph, then L = N J⊥ , W = I − βJ⊥ and ρ = (1 − β)2 r(J ⊥ ) = (1 − β)2 . C.1.2 Numerical results on distributed optimization We illustrate the behavior of these two consensus schemes in distributed optimization. PN 1 We set the same scenario of Section 2.7 in Chapter 2, i.e. minθ⊂R F (θ) where F (θ) = i=1 2 (θ − αi )2 . Note that in both cases v = 1 which implies θF = θV = θ? . We compare the performance of Algorithm (2.1)-(2.2) assuming (2.4), i.e. (ξn,i )n,i is an i.i.d. sequence with Gaussian distribution N(0, σ 2 ) where σ 2 = 1. The network is represented by a graph G = (V, E) of N = 10 vertices and three different set of edges varying the connectivity of the graph, i.e. |E| = 9, 26 and 45. 176 Appendix C. Examples of gossip models for consensus algorithms Yet, |E| = 9 correspond to the line graph (minimum connectivity) whose average degree is 1.8, i.e. the average number of neighbors (edges) per node. |E| = 45 correspond to the complete graph (maximum connectivity) in which each node has N − 1 = 9 neighbors. Figure below illustrate the graphs considered in the present example. 4 2 3 5 2 3 4 2 5 2 6 7 0.4 0.6 9 0.8 x 1 1.2 1.4 1.6 1.8 2 0 0 1 6 0.5 7 10 8 0.2 2 1 1 0.5 0.5 0 0 1 1 6 3 1.5 y y 1 4 5 1.5 1.5 y |E|=45 , |V|=10 |E|=26 , |V|=10 |E|=9 , |V|=10 2 10 8 0.2 0.4 0.6 10 7 9 0.8 x 1 1.2 1.4 1.6 1.8 0 0 2 9 8 0.2 0.4 0.6 0.8 x 1 1.2 1.4 1.6 1.8 2 Table 1 reports the connectivity information of each graph by the spectral radius value ρ. More the connectivity (the average number of neighbors per node) increases, more the spectral radius ρ decreases. This parameter appears on the convergence analysis of the disagreement sequence φn (2.12) (see Section D.2) and thus, it impacts the performance of the consensus convergence (see Figure C.1 (a)). Gossip scheme Pairwise Broadcast |E| = 9 0.995 0.992 |E| = 26 0.937 0.771 |E| = 45 0.889 0.25 Table 1. Spectral radius ρ of pairwise (C.3) and broadcast (C.6) when |V | = N = 10. Figure C.1 shows the convergence result stated in Theorem 2.1. The error curves are computed from the average through 100 independent runs of Algorithm (2.1)-(2.2). We define the mean deviation of consensus (MDC) as follows: N 1 X MDCn = E[(θn,i − hθn i)2 ] . N (C.8) i=1 The convergence to a consensus is illustrated in Figure C.1 (a) while Figure C.1 (b) illustrates the convergence to the sought P value θ? . We report the MDC (C.8) and the mean square error (MSE), i.e. MSEn = 1/N i E[|θn,i − θ? |2 ]. Figure C.1 (a) shows the connectivity influence (ρ values in Table 1) on the network agreement. When the network is formed by a line graph, ρ is close to 1 in both schemes (pairwise and broadcast) and the consensus error (MDC) has almost the same performance. Although in general the performance improves when increasing |E| (connectivity), this improvement is difficult to discern in the pairwise case since ρ decreases much more slowly than the broadcast case. In addition, the error curves corresponding to the broadcast are below of those from the pairwise, since ρ in the broadcast case is lower than the pairwise case. Indeed, the gaps between ρ when |E| = 26 and 45 are easily appreciated when choosing the broadcast gossip. C.1. Standard gossip averaging 177 Figure C.1 (b) shows the influence of the non-double stochasticity of the broadcast protocol on the average sequence hθn i. The gaps between the values of ρ in Table 1 are illustrated by the gaps between the error curves (MSE) in Figure C.1 (b). However, the pairwise gossip performs better than the broadcast for all |E| since in the latter case, the error term related to 1T Wn − 1T is always non-zero. 0 0 10 10 G(10,9) pairwise G(10,26) pairwise G(10,45) pairwise G(10,9) broadcast G(10,26) broadcast G(10,45) broadcast −1 10 −2 10 −1 10 −3 −2 10 10 MSEn MDCn G(10,9) pairwise G(10,26) pairwise G(10,45) pairwise G(10,9) broadcast G(10,26) broadcast G(10,45) broadcast −4 10 −3 10 −5 10 −6 −4 10 10 −7 10 −5 10 −8 10 0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4 4 4 x 10 x 10 number of iterations (n) number of iterations (n) (a) Mean deviation of consensus (MDC) as a function (b) Mean square error (MSE) as a function of the of the number of iterations n. number of iterations n. Figure C.1. Convergence performance of Algorithm (2.1)-(2.2) when considering the gossip schemes (C.1) and (C.4). G(10,9) 25 Pairwise 20 G(10,26) empirical 1 G(10,45) empirical 1 N(0,0.1) N(0,0.1) N(0,0.1) 0.8 0.8 0.8 0.6 0.6 0.6 15 0.4 0.4 0.4 10 0.2 0.2 0.2 5 0 0 −1 −0.5 0 0.5 1 empirical 1 1.5 0 −1 −0.5 0 0.5 1 1.5 −1 −0.5 0 0.5 1 1.5 0 empirical 1 −5 Broadcast empirical 1 N(0,155) N(0,51) −15 −20 pairwise G(10,9) pairwise G(10,26) pairwise G(10,45) broadcast G(10,9) broadcast G(10,26) N(0,35) 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 −40 −25 1 0.8 0.8 −10 −20 0 20 40 0 −20 −10 0 10 20 30 empirical 0 −10 0 10 20 30 broadcast G(10,45) (a) Boxplots of the average error sequence. (b) Empirical and theoretical distribution of the average error sequence. Figure C.2. Asymptotic behavior of the normalized average error sequence when n = 20000. √1 (hθn i γn − θ? ) Finally, we illustrate the CLT result of Chapter 2 (see Theorem 2.3 in Section 2.5). Figure C.2 shows the asymptotic behavior of the normalized average error sequence √1γn (hθn i−θ? ). 178 Appendix C. Examples of gossip models for consensus algorithms The asymptotic distribution and variances U? are reported by the boxplots and histograms in Figure C.2 (a) and C.2 (b) respectively. As expected, the asymptotic variance in the pairwise case only depends on the noise on the observations (see Corollary 2.1 in Section2.5.3). Hence, the asymptotic distribution is the same for all connectivity values ρ since U? does not depend on the choice of the graph. Contrarily, when choosing a non-doubly stochastic protocol as the broadcast gossip, an additional terms appears on U? depending on Wn and the connectivity influences the asymptotic variance which increases as the connectivity of the graph decreases. C.2 Push-sum gossip averaging In this section we highlight a possible future work related to Chapter 2 we did not have time to investigate. The idea is to solve distributed optimization problems of the form (2.3) by using a adaptation-diffusion algorithm following the scheme (2.1)-(2.2) coupled with a consensus protocol based on the push-sum introduced by [89]. This framework has been already addressed by [153],[114]-[115] providing convergence results. However both works require synchronous communication models. C.2.1 Communication model description Since we are interested in asynchronous protocols, we focus on the push-sum model more recently proposed in the context of consensus averaging by [84]. Similarly to [10], a single agent is randomly activated asynchronously which broadcasts its temporary estimates to all its neighbors. Define sn = (sn,1 , . . . , sn,N )T and wn = (wn,1 , . . . , wn,N )T . The scheme is summarized as follows: ) sn = Kn sn−1 sn,i θn,i = ∀i = 1, . . . , N , (C.9) wn,i wn = Kn wn−1 where the intermediate sequences sn and wn represent the sum and weights that are needed to update the average parameter θn under the assumption s0 = θ0 and wn = 1. Contrarily to (Wn )n , the sequence of i.i.d. matrices (Kn )n are column stochastic, i.e. 1T Kn = 1T where Kn = I −L(I +D)−1 Ei if node i is activated with probability 1/N . The main differences compare to the standard gossip models in previous section are the need of an additional parameter involved in the communication step and the conditions on (Kn )n . C.2.2 Algorithm for distributed optimization The objective is to couple the asynchronous model (C.9) of [84] with Algorithm (2.1)-(2.2) and illustrate the convergence analysis, i.e. θn tends to θ? (2.3) as n → ∞. The algorithm under study can be described as follows: [Local step] Similarly to (2.1), each node i generates a temporary iterate s0n,i given by s0n,i = sn−1,i + γn αn,i Yn,i where : Yn,i = −∇fi (θn−1,i ) + ξn,i (C.10) C.2. Push-sum gossip averaging 179 where (γn )n is a decreasing step-size sequence. We introduce the sequence (αn )n which is defined according to different models. [Push-sum step] Kn is defined as in (C.9). The active agent i sends its values to all j ∈ Ni . Node i and its neighbors compute the weighted average while the rest of nodes stay idle. Then, each node i is able to update its estimate θn,i given sn,i and wn,i : sn = Kn s0n wn = Kn wn−1 ) θn,i = sn,i wn,i ∀i = 1, . . . , N , (C.11) Note that sequences (Kn )n , (wn )n and (sn )n are related to the push-sum protocol while sequences (s0n )n and (θn )n are related to the stochastic gradient descent algorithm. Moreover, contrarily to Assumption 2.2-2) (Chapter 2), in (C.11) nothing is required concerning the rowstochasticity of (Kn )n . It is worth noting the contribution of such new algorithm with respect to the existing [153] and [115]. The Authors in [153] consider a fixed matrix model, i.e. Kn = K̄ ∀n for some deterministic K̄ and (C.10) is replaced by a primal dual step with αn = 1 and ξn = 0. Although in [115] an algorithm similar to (C.10)-(C.11) is proposed by setting αn = 1, the step order is inverted, i.e. diffusion-adaptation scheme (C.11)-(C.10). In addition, contrarily to our framework, [115] propose a time-varying and synchronous model for matrices (Kn )n and assume that they are adapted to a directed and strongly connected graph. In this section we analyze the influence of the asynchronous consensus model of [84] when used for distributed optimization problems. C.2.3 Numerical results on distributed optimization We consider the same scenario of Section 2.7 in Chapter 2. The minimization problem (2.19) is solved by a network of N = 5 connected nodes (or agents). The minimizer of (2.19) is θF = 1. The objective is to show if θn generated by Algorithm (C.10)-(C.11) tends to θ? corresponding to the sought minimizer θF as the number of iterations n tends to ∞. The graph G is formed according to the configuration defined by (2.20). In this context, we show the convergence behavior of Algorithm (C.10)-(C.11) by setting 4 different versions. The numerical results are based on 100 Monte-Carlo runs of the trajectory θn . Note that, while the first three versions use (Kn )n of [84], the last one uses the model of [10] in order to get rid some features when comparing these two broadcast-like schemes. 1. First, we set αn = 1 as in [115]. This choice does not yield convergence as illustrated by the following figures. 180 Appendix C. Examples of gossip models for consensus algorithms 1 0 −1 −2 10 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 4 x 10 θ 0 −1 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 sn,1= Σ jKn(1,j)s’n−1,j 2 n,2 60 s’n,1= sn−1,1+ γn( −∇f1(θn−1,1) + ξn,1) θn,1 θn,3 5 1 4 x 10 4 0 wn,1= Σ jKn(1,j)wn−1,j 2 30 0.5 1 1.5 2 2.5 3 3.5 4 4.5 2 θn,4 5 4 x 10 1 0 0 −2 −4 −6 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 4 x 10 1 1.5 2 2.5 3 number of iterations (n) 3.5 4 1 0 1 0.5 θn,1= sn,1/wn,1 0 θn,5 0.5 2 4.5 5 4 x 10 30600 30650 30700 30750 30800 number of iterations (n) 30850 30900 The figure on the left shows the trajectory of the estimated parameter θn,i by each agent i as a function of the number of iterations n. The convergence of θn to 1 is not achieved. It seems that as the number of iterations increases, θn goes closer to 1 as expected. However, there are some peaks at random instants n where θn,i reaches large or low values far from the asymptotic θ? . In order to investigate the reason for those peaks, the figure on the right plots the trajectories of each sequence involved in the steps of Algorithm (C.10)-(C.11) for agent 1. Yet, the figure shows a zoom of sequences s0n,1 , sn,1 , wn,1 and θn,1 within an interval of large number of iterations, i.e. from n = 30596 to n = 30950. The two most important peaks appeared at n = 30620 and n = 30940. Before these two instants, the temporary sum estimate s0n,1 goes down until it reaches 0 while at the same time both sn,1 and wn,1 decrease and become almost zero. As a consequence of the combination of such values, the quotient of these two quantities makes θn,1 to drop down towards values far from the asymptotic 1. 2. Then, we set αn,i = wn−1,i (∀i = 1, . . . , N ) which seems more coherent since the gradient term involved in (C.10) is evaluated in θn−1,i = sn−1,i /wn−1,i and this step leads to update sn . However, it does not yield convergence once again. Instead, it yields failure convergence to the sought value (2.3). Yet, θn tends to θV which is the minimizer of the weighted problem (2.5) and where v is the right Perron eigenvector of E[Kn ], i.e. v = E[Kn ]v and v T 1 = 1 (analogous to Lemma 2.1 in Chapter 2). 3. Hence, similarly to (2.6) in Chapter 2, we introduce a weighted version such as αn,i = wn−1,i /vi (∀i = 1, . . . , N ) to lead the convergence of sequence θn to (2.3). 4. In addition, based on the intuition that the failure case 2) may be caused by the nonrow stochasticity of (Kn )n , we also include the numerical results when αn = 1 and Kn = Wn C.4 (broadcast gossip [10]). This choice maintains the asynchronous and broadcast nature of [84] but modifies the main stochasticity assumption on (Kn )n , i.e. (Kn )n are column-stochastic in average and row-stochastic for all n. 30950 C.2. Push-sum gossip averaging 181 −1 35 10 10 Wn [84], αn= 1 −2 10 Wn [84], αn= wn−1 Wn [84], αn= wn−1° v−1 Wn [10], αn= 1 −3 10 30 10 MDCn −4 10 −5 10 25 10 −6 10 −7 0 0.5 1 10 2 0 1.5 number of iterations (n) 0.5 1 1.5 2 number of iterations (n) 4 x 10 4 x 10 (a) Mean deviation of consensus (MDC) as a function of n. 37 10 2 Wn [84], E[|<θn> − θF| ] , αn= 1 0 10 Wn [84], E[|<θn> − θF|2] , αn= wn−1 Wn [84], E[|<θn> − θV|2] , αn= wn−1 W [84], E[|<θ > − θ |2] , 36 10 n n F 2 Wn [10], E[|<θn> − θF| ] , α =w n ° v−1 n−1 αn= 1 −1 10 35 MSEn 10 34 10 −2 10 33 10 −3 10 32 10 0 0.5 1 1.5 2 number of iterations (n) x 104 0.5 1 1.5 number of iterations (n) 2 4 x 10 (b) Mean square error (MSE) as a function of n. Figure C.3. Convergence performance of Algorithm (C.10)-(C.11) when considering the different models 1)-4). Figure C.31 reports the convergence to a consensus and to the sought solution θ? by the error sequences MDC (C.8) and MSE (defined in Section C.1.2) for Algorithm (C.10)-(C.11) when setting the four cases 1)-4). Regarding the convergence to a consensus, Figure C.3 (a) illustrates the trajectories of the MDC sequence decreases as the number of iterations n grows. It is worth noting the difference on the behavior between cases 1) and 4), i.e. when αn = 1 (as in [115] using a synchronous model) and the two asynchronous broadcast schemes of [10] and [84]. As discussed in case 1), θn does not achieve the convergence to θ? . This may affect both MDC and 1 x ◦ y denotes the element-wise product (Hadamard product) between vectors x and y 182 Appendix C. Examples of gossip models for consensus algorithms MSE sequences by introducing a large bias even if their trajectories have a decreasing profile (see Figure C.3 (a) and C.3 (b)). Contrarily, even if in case 4) (Kn )n are column-stochastic only in average, the row-stochasticity condition is satisfied and the convergence is achieved (Algorithm (C.10)-(C.11) may be equivalent to Algorithm (2.1)-(2.2) analyzed in Chapter 2). In addition, as shown in Figure C.3 (a), the weighted versions of Algorithm (C.10)-(C.11) described by cases 2) and 3) converge to a consensus. However, case 2) does not yield the sought minimizer θ? = θF (see Figure C.3 (b)). Indeed, in Figure C.3 (b) we include the MSE sequence with respect to θF and θV when θn is generated by case 2). In the first case, the MSE remains almost flat while in the second case θn converges to θV . Although the MSE performance of case 2) when considering θV is close to those achieved by case 4), it fails to convergence to the sought solution. Finally, Figure C.3 (a) and C.3 (b) illustrate that case 3) yields the convergence to both consensus and θF even if the performance is slightly worse than those obtained by case 3). In conclusion, an alternative to case 4) would be case 3). Thus, if (Kn )n are row-stochastic (and column-stochastic in average), then Algorithm (C.10)-(C.11) can be used with αn = 1, i.e. no additional knowledge, and the asymptotic behavior can be comparable to those of Algorithm (2.1)-(2.2) (Chapter 2). If (Kn )n are not row-stochastic (but column-stochastic), Alw gorithm (C.10)-(C.11) can be used with αn,i = n−1,i (for all i = 1, . . . , N ), i.e. requiring vi the previous knowledge of v from E[Kn ] as expenses of slightly losing in performance. We recall that an algorithm similar to (C.10)-(C.11) using αn = 1 and (Kn )n non-row stochastic is addressed in [115] where the convergence is achieved under more restrictive conditions, e.g. synchronous communication model and strongly directed graph. Hence, the choice among this three alternatives will depend on each scenario. Figure C.42 illustrates the analogous CLT result obtained in Section 2.5.2 (see Theorem 2.3 in Chapter 2). Figure C.4 (a) shows the asymptotic variances of the normalized average error of cases 1)-4) when n = 20000 over 100 independent runs. As expected by the non-convergence result of case 1), several aberrant values appear outside the bounds of the average variance. The asymptotic behavior of Algorithm (C.10)-(C.11) in that case can not be predicted since the peaks on θn occur randomly. As shown in Figure C.4 (a) and Figure C.4 (b), the asymptotic variance and distribution in case 2) are closed to those obtained in case 4) up to a bias due to the failure of convergence, i.e. θ? = θV instead of θF . Besides, regarding the performance of case 3), it can be compared to the performance of case 4) by a factor lower than two, i.e. the asymptotic variance of case 3) is less than the double of 4). 2 x ◦ y denotes the element-wise product (Hadamard product) between vectors x and y C.2. Push-sum gossip averaging 183 Wn [84], αn= wn−1 Wn [84], αn= 1 19 x 10 Wn [84], αn= wn−1°v−1 W [10], α = 1 n n 0 60 60 60 50 50 50 5 40 40 40 0 30 30 30 −5 20 20 20 −10 10 10 10 −15 0 0 0 −10 −10 −10 15 −0.5 10 −1 −1.5 −2 −2.5 1 1 1 1/γ0.5 ( <θn> − θF ) n 1/γ0.5 ( <θn>−θF) n 1 2 1/γ0.5 ( <θn> − θF ) n 1/γ0.5 ( <θn>−θV) n 1 1/γ0.5 ( <θn> − θF ) n (a) Boxplots of the average error sequence. Wn [84] , 1/γ0.5 ( n 1 αn= wn−1 W [84], α = w n n °v Wn [10] , αn= 1 n−1 <θn> − θF ) 1 1 1/γ0.5 ( <θn> − θV ) n 0.9 0.9 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0.9 0 0 20 40 60 0 −20 0 20 0 −10 0 10 (b) Empirical and theoretical distribution of the average error sequence. Figure C.4. Asymptotic behavior of the normalized average error sequence when n = 20000. √1 (hθn i γn − θ? ) Appendix D Proofs related to Chapter 2 D.1 Proof of Theorem 2.1 We prove that the Assumptions 2.5 to 2.8 hold. Then Theorem 2.1 will follow from Theorem B.1. For any θ = (θ1 , . . . , θN ) ∈ RdN where θi ∈ Rd , define the RdN -valued function g by g(θ) := (−∇f1 (θ1 )T , . . . , −∇fN (θN )T )T . Under Assumption 2.2-1) and Assumption 2.2-2), for any Borel set A × B of RdN × M1 P[(Yn+1 , Wn+1 ) ∈ A × B|Fn ] = P[Yn+1 ∈ A|Fn ]P[Wn+1 ∈ B] . In addition, by Assumption 2.1 and Eq. (2.4) Z P[Yn+1 ∈ A|Fn ] = IA (g(θn ) + z) dνθn (z) . The above discussion provides the expression of µθ in Assumption 2.5-1). In addition, under Assumption 2.1-2), for any compact set K of RdN , Z Z sup |y|2 dµθ (y, w) = sup |g(θ)|2 + |z|2 dνθ (z) < ∞ θ∈K θ∈K which proves Assumption 2.5-2). Set W = w ⊗ Id . The above expression of µθ implies that Z (φ + Ay)T WT J⊥ W(φ + Ay) dµθ (y, w) Z = (φ + A(g(θ) + z))T E WT1 J⊥ W1 (φ + A(g(θ) + z)) dνθ (z) . Therefore, Assumption 2.6 easily follows from Assumption 2.2-3). The regularity conditions of Assumption 2.7 are satisfied with λµ = δ, where δ is given by Assumption 1. Observe indeed that the left hand side of (2.15) is zero and (2.16) and (2.17) are true as long as (∇fi )i are locally 186 Appendix D. Proofs related to Chapter 2 Hölder-continuous. Again, the expression of µθ implies that Wθ = E [W1 ]. Therefore, the mean field vector h defined by (2.18) gets into: h(ϑ) = hE[W1 ] g (1 ⊗ ϑ)i . Using the Woodbury matrix identity (see [81]), we have: h(ϑ) = (v T ⊗ Id ) g(1 ⊗ ϑ) = − N X vi ∇fi (ϑ) i=1 where v = (v1 , . . . , vN ) is the left Perron eigenvector given by Lemma 2.1. Set V̄ := exp(V ) where V is defined by (2.5). Upon noting that ∇V̄ = −h V̄ , it is easily seen that under the assumptions of Theorem 2.1, Assumption 2.8 holds. D.2 Proof of Lemma 2.3 From (2.12), we |φn |2 = αn2 (φn−1 + Yn )T WTn J⊥ Wn (φn−1 + Yn ). Using Assump compute tion 2.5-1), E |φn |2 |Fn−1 is equal to Z 2 αn (φn−1 + y)T (w ⊗ Id )J⊥ (w ⊗ Id )(φn−1 + y) dµθn−1 (y, w) . By Fubini Theorem and Assumption 2.6, there exists ρK ∈ (0, 1) such that for any n ≥ 1, R E |φn |2 |Fn−1 ≤ αn2 ρK |φn−1 + y|2 dµθn−1 (y, w). By Assumption 2.5-2), there exists a constant C such that for any n ≥ 1 almost-surely √ E |φn |2 |Fn−1 Iθn−1 ∈K ≤ αn2 ρK |φn−1 |2 + 2|φn−1 | C + C . Set Un := |φn |2 ITj≤n−1 {θj ∈K} . Upon noting that ITj≤n−1 {θj ∈K} ≤ ITj≤n−2 {θj ∈K} , the previous p √ inequality implies E[Un ] ≤ αn2 ρK E [Un−1 ] + 2 E[Un−1 ] C + C . Let δ ∈ (ρK , 1). For any n large enough (say n ≥ n0 ), αn2 ρK ≤ 1 − δ since limn αn = 1 under Assumption B.1-1). There exist positive constants M, b such that for any n ≥ n0 , p √ δ E[Un ] ≤ (1 − δ) E [Un−1 ] + 2 E[Un−1 ] C + C ≤ 1 − E[Un−1 ] + b1E[Un−1 ]≤M . 2 A trivial induction implies that E[Un ] ≤ (1 − δ/2)n−n0 E[Un0 ] + 2b/δ, which concludes the proof. D.3 Preliminary results on the sequence (φn )n Due to the coupling of the sequences (hθn i)n and (φn )n (see Eq. (2.11)), the asymptotic analysis of (hθn i)n requires a more detailed understanding of the behavior of φn . Note from Assumption 2.5-1) and (2.12) that {φn , n ≥ 0} is a Markov chain w.r.t. the filtration {Fn , n ≥ 0} with a transition kernel controlled by {αn , θn , n ≥ 0} (see also (D.2) below). D.3. Preliminary results on the sequence (φn )n 187 Let us introduce some notations and definitions. If (x, A) 7→ P (x, A) is a probability transition kernel on RdN , thenR for any bounded continuous function f : RdN → R, P f is the measurable function x 7→ Rf (y)P (x, dy) . If ν is a probability on RdN , νP is the probability on RdN given by νP (A) = R ν(dx) P (x, A). For n ≥ 0, notation P n stands for the n-order iterated kernel i.e., P n f (x) = P n−1 f (y)P (x, dy); by convention P 0 (x, A) = 1A (x) = δx (A). A measure π is said to be an invariant distribution w.r.t. P if πP = π. For p ≥ 0, denote by Lp (RdN ) the set of Lipschitz functions f : RdN → RdN satisfying [f ]p := sup x,y∈RdN We define Np (f ) := (supx∈RdN |f (x) − f (y)| < ∞. |x − y|(1 + |x|p + |y|p ) |f (x)| ) 1+|x|p+1 ∨ [f ]p for f ∈ Lp (RdN ). For any θ ∈ RdN and any α ≥ 0, define the probability transition kernel Pα,θ on RdN as Z Pα,θ f (x) = f (αJ⊥ (w ⊗ Id )(x + y)) dµθ (y, w) . (D.1) This collection of kernels is related to the sequence (φn )n since by Assumption 2.5-1) and (2.12), for any measurable positive function f it holds almost-surely E [f (φn+1 )|Fn ] = Pαn+1 ,θn f (φn ) . (D.2) We start with a result that claims that any transition kernel Pα,θ possesses an unique invariant distribution πα,θ and is ergodic at a geometric rate. This also implies that for a large family of functions f , a solution fα,θ to the Poisson equation f − πα,θ (f ) = fα,θ − Pα,θ fα,θ (D.3) exists, and is unique up to an additive constant. Proposition D.1. Let Assumptions 2.5 and 2.6 hold. Let K ⊂ RdN be a compact set and let √ ρK ∈ (0, 1) be given by Assumption 2.6. The following holds for any a ∈ (0, 1/ ρK ). 1. For any θ ∈ K Rand α ∈ [0, a], Pα,θ admits an unique invariant distribution πα,θ such that supα∈[0,a],θ∈K |x|2 dπα,θ (x) < ∞ . 2. For any p ∈ [0, 1], there exists a constant K such that for any x ∈ RdN and any f ∈ Lp (RdN ) √ n f (x) − πα,θ (f )| ≤ KNp (f ) (a ρK )n (1 + |x|p+1 ) . sup |Pα,θ α∈[0,a],θ∈K 3. For any α ∈ (0, a], θ ∈ K, p ∈ [0, 1] and f ∈ Lp (RdN ), the function fα,θ : x 7→ P n dN ). n≥0 Pα,θ f (x) − πα,θ f exists, solves the Poisson equation (D.3) and is in Lp (R In addition, KNp (f ) sup |fα,θ (x)| ≤ (1 + |x|p+1 ) . √ 1 − a ρK α∈[0,a],θ∈K 188 Appendix D. Proofs related to Chapter 2 Proof. Let K be a compact subset of RdN . Throughout this proof, for ease of notations, we √ will write ρ instead of ρK . Let a ∈ (0, 1/ ρ) be fixed. We check the assumptions of [17, Proposition 2 p. 253] from which all the items follow. We first prove [17, (2.1.10) p.253]. By Assumption 2.6, for any α ∈ [0, a] and θ ∈ K Z Z Z 2 2 2 2 Pα,θ (x, dy)|y| ≤ a ρ |x| + |y| dµθ (y, w) + 2|x| |y|dµθ (y, w) ; by Assumption 2.5-2), for any ρ̄ ∈ (a2 ρ, 1), there exists a positive constant c such that for any x ∈ RdN Z sup Pa,θ (x, dy)|y|2 ≤ ρ̄|x|2 + c . α∈[0,a],θ∈K This concludes the proof of [17, (2.1.10) p.253]. Note that iterating this inequality and applying the Jensen’s inequality yield for any n ≥ 1, p ∈ [0, 1], x ∈ RdN , Z sup n Pa,θ (x, dy)|y|p+1 ≤ ρ̄n |x|2 + α∈[0,a],θ∈K c 1 − ρ̄ p+1 2 . (D.4) We now prove [17, (2.1.9) p.253] Let x, z ∈ RdN , α ∈ [0, a] and θ ∈ K. We consider a coupling n (x, ·) and P n (z, ·) defined as follows: (W , Y ) of the distributions Pα,θ n n n∈N are i.i.d. random α,θ variables with distribution µθ and set Wn = W n ⊗ Id . The stochastic process (ϕxn )n∈N defined recursively by ϕxn = αJ⊥ Wn (ϕxn−1 + Y n ) and ϕx0 = x is a Markov chain with transition kernel Pα,θ starting from x. We denote by Eα,θ the expectation on the associated canonical space. Let p ∈ [0, 1]. For any g ∈ Lp (RdN ), it holds n P g(x) − P n g(z) = |Eα,θ (g(φxn ) − g(φzn ))| ≤ Eα,θ (|g(φxn ) − g(φzn )|) α,θ α,θ ≤ [g]p Eα,θ [|φxn − φzn | (1 + |φxn |p + |φzn |p )] h io1/2 n . ≤ [g]p Eα,θ |φxn − φzn |2 Eα,θ (1 + |φxn |p + |φzn |p )2 (D.5) By Assumption 2.6 combined with a trivial induction, Eα,θ (|ϕxn − ϕzn |2 )1/2 = αEα,θ (|J⊥ Wn (ϕxn−1 − ϕzn−1 )|2 )1/2 = αEα,θ ((ϕxn−1 − ϕzn−1 )T Aθ (ϕxn−1 − ϕzn−1 ))1/2 √ √ ≤ a ρ Eα,θ (|ϕxn−1 − ϕzn−1 |2 )1/2 ≤ (a ρ)n |x − z| , (D.6) where Aθ := (w ⊗ Id )T J⊥ (w ⊗ Id )dµθ (y, w). Combining (D.4) and (D.6) shows that there exists C > 0 such that for any x, z ∈ RdN , g ∈ Lp (RdN ) and n ≥ 1, n P g(x) − P n g(z) ≤ C [g]p |x − z| (a√ρ)n (1 + |x|p + |z|p ) . sup (D.7) R α,θ α,θ α∈[0,a],θ∈K This concludes the proof of [17, (2.1.9) p.253]. Finally, we show that the transition kernels are weak Feller. From (D.1) and the dominated convergence theorem, it is easily checked that for any bounded continuous function f on RdN , x 7→ Pα,θ f (x) is continuous. Therefore, all the assumptions of [17, Proposition 2 p.253] are verified. D.3. Preliminary results on the sequence (φn )n 189 In Proposition D.2, we go further by giving an explicit expression of the first two moments of πα,θ . Proposition D.2. Let Assumptions 2.5 and 2.6 hold. Let θ ∈ RdN and α such that πα,θ exists. R (1) 1. The first order moment mθ (α) := x dπα,θ (x) of πα,θ is given by (1) mθ (α) = (α−1 IdN − J⊥ Wθ )−1 J⊥ zθ , (D.8) where Wθ and zθ are given by (2.13) and (2.14). R (2) 2. Set T (w) := ((J⊥ w) ⊗ Id ) ⊗ ((J⊥ w) ⊗ Id ). The vector mθ (α) := vec ( xxT dπa,θ (x)) is given by −1 (2) mθ (α) = α−2 Id2 N 2 − Φθ ζθ (α) (D.9) R where Φθ := T (w)dµθ (y, w) and Z (1) ζθ (α) := T (w)vec yy T + 2y mθ (α)T dµθ (y, w) . Proof. Since πα,θ = πα,θ Pα,θ , we obtain: ZZ Z (1) (1) mθ (α) = αJ⊥ (w ⊗ Id )(y + x)dµθ (y, w)dπα,θ (x)α ((J⊥ w) ⊗ Id )(y + mθ (α))dµθ (y, w) ; (1) this yields the expression of mθ (α). The proof of item 2) follows the same lines as above and is omitted. We finally have a result on the regularity-in-(α, θ) of some expectations w.r.t. πα,θ and the solutions to the Poisson equation (D.3). Proposition D.3. Let Assumptions 2.5, 2.6 and 2.7 to hold. Let K ⊂ RdN be a compact set and let ρK ∈ (0, 1) and λµ ∈ (0, 1] be given resp. by Assumption 2.6 and Assumption 2.7. The √ following holds for any a ∈ (0, 1/ ρK ). 1. For any f ∈ L1 (RdN ), there exists a constant Cf such that for any α, α0 ∈ [0, a] and θ, θ0 ∈ K, Z f (x) dπα,θ (x) − dπα0 ,θ0 (x) ≤ Cf |α − α0 | + θ − θ0 λµ . 2. When f is the identity function f (x) = x then for any α ∈ (0, a], θ ∈ K, x ∈ RdN , one has (1) fα,θ (x) = (IdN − αJ⊥ Wθ )−1 (x − mθ (α)) . (D.10) In addition, there exists a constant K such that for any α, α0 ∈ [0, a], θ, θ0 ∈ K, Pα,θ fα,θ (x) − Pα0 ,θ0 fα0 ,θ0 (x) + fα,θ (x) − fα0 ,θ0 (x) ≤ K α − α0 + θ − θ0 λµ (1 + |x|) . 190 Appendix D. Proofs related to Chapter 2 3. For any function f of the form xT Ax, the Poisson solution fα,θ exists and there exists a constant K such that for any α, α0 ∈ [0, a], θ, θ0 ∈ K, Pα,θ fα,θ (x) − Pα0 ,θ0 fα0 ,θ0 (x) ≤ K α − α0 + θ − θ0 λµ 1 + |x|2 . Proof. Let K be a compact subset of RdN . Throughout this proof, for ease of notations, we √ will write ρ instead of ρK . Let a ∈ (0, 1/ ρ) be fixed. Item 1 is a consequence of [17, Theorem 5, p.259]; its proof therefore consists in verifying the assumptions of this theorem. The conditions [17, Theorem 5(i-ii), p.259] are established in the proof of Proposition D.1 (see Eqs. (D.4) and (D.7)). Let us prove [17, Theorem 5(iii) p.259]. Let α, α0 ∈ [0, a] and θ, θ0 ∈ K. Denote by (φn )n (resp. (φ0n )n ) the chain with transition kernel Pα,θ (resp. Pα0 ,θ0 ) and initial distribution z ∈ RdN . Let (W, Y ) (resp. (W 0 , Y 0 )) be two independent pairs of random variables with distribution dµθ (resp. dµθ0 ), and independent of (φn−1 , φ0n−1 ). Then, it is easily seen by using Assumption 2.5-1) that E φn − φ0n = αAθ E φn−1 − φ0n−1 + (α − α0 )Bα,θ (n − 1) + α0 Cα0 ,θ,θ0 (n − 1) with Aθ := J⊥ Wθ ; Bα,θ (k) := J⊥ Wθ E[φk ] + zθ and Cα0 ,θ,θ0 (k) := J⊥ Wθ E[φ0k ] + zθ − J⊥ Wθ0 E[φ0k ] + zθ0 . Then, by a trivial induction and upon noting that φ0 − φ00 = z − z = 0, n−1 n−1 X X 0 E φn − φ0n = (α − α0 ) αk−1 Ak−1 B (n − k) + α αk−1 Ak−1 Cα0 ,θ,θ0 (n − k) . α,θ θ θ k=1 k=1 k (z, dy)|y| ≤ C(1 + |z|) by (D.4), Assumption 2.6 implies Since supk |E [φk ]| ≤ supk Pα,θ that there exists C such that for any k ≤ n R √ sup sup |αk−1 Ak−1 Bα,θ (n − k)| ≤ C (a ρ)k−1 (1 + |z|) . θ α∈[0,a] θ∈K Similarly, by Assumption 2.7 Eqs (2.15) and (2.17), there exists a constant C 0 such that for any k≤n −λ √ sup sup θ − θ0 µ αk−1 Aθk−1 Cα0 ,θ,θ0 (n − k) ≤ C 0 (a ρ)k−1 (1 + |z|) . α0 ∈[0,a] θ,θ 0 ∈K √ Since a ρ < 1, this implies that there exists C̄ such that for any n ≥ 0, α, α0 ∈ [0, a] and θ, θ0 ∈ K, supn |E [φn − φ0n ] | ≤ C̄ |α − α0 | + |θ − θ0 |λµ (1 + |z|). Item 1 now follows from [17, Theorem 5, p.259]. We now prove the expression (D.10). The regularity-in-(α, θ) will follow from (D.8) and Assumption 2.7; details are omitted (we also omit the proof of Item 3 which follows from similar arguments). As a preliminary, note that for any affine function g : RdN → RdN of the form g(x) = Ax + b for some matrix A and some vector b, one has Pα,θ g(x) = αAJ⊥ Wθ x + αAJ⊥ zθ + b . (D.11) D.4. Proof of Proposition 2.2 191 (1) n f (x) − π n We now prove by induction that for all n ≥ 0, Pα,θ α,θ f = (αJ⊥ Wθ ) (x − mθ (α)). (1) The statement holds true for n = 0 because πα,θ f = mθ (α) by definition. Assume that it n+1 holds for an arbitrary n. By (D.11), Pα,θ f (x) − πα,θ f = (αJ⊥ Wθ )n+1 x + α(αJ⊥ Wθ )n zθ − (1) (αJ⊥ Wθ )n mθ (α) and the statement holds for integer n + 1 by straightforward algebra. ThereP (1) fore, fα,θ (x) = n (αJ⊥ Wθ )n (x − mθ (α)). D.4 Proof of Proposition 2.2 Hereafter, we will largely use the following property: any row-stochastic matrix has bounded entries. Therefore, there exists a constant C s.t. P {kWn k ≤ C} = 1 . (D.12) Lemma D.1. Under Assumptions B.1-1) and 2.5, there exists C > 0 such that almost-surely, |θn+1 − θn | ≤ C γn (|Yn+1 | + |φn |). Proof. Since limn γn /γn+1 = 1, there exists a constant C such that |θn+1 −θn | ≤ |1⊗hθn+1 i− 1 ⊗ hθn i| + |J⊥ θn+1 | + |J⊥ θn | ≤ C |hθn+1 i − hθn i| + γn φn+1 + γn φn . The result follows from Eqs (2.11), (2.12), (D.12) and supn αn < ∞. D.4.1 Decomposition of hθn+1 i − hθn i By (2.11), it holds hθn+1 i = hθn i + γn+1 h(hθn i) + γn+1 (ηn+1,1 + ηn+1,2 ) where ηn+1,1 = hWn+1 (Yn+1 + φn )i − hzθn + Wθn φn i ηn+1,2 = hzθn + Wθn φn i − h(hθn i) . We write ηn+1,2 = un + vn + wn+1 + zn where un = hzθn − zJθn i vn = hWθn − WJθn i φn (1) wn+1 = hWJθn i(φn − mθn (αn+1 )) (1) (1) zn = hWJθn i(mθn (αn+1 )) − mJθn (1)) (1) and mθ (a) is defined in Proposition D.2. We finally introduce a decomposition of wn . For √ any compact K, let ρK ∈ (0, 1) be given by Assumption 2.6. Let a ∈ (1, 1/ ρK ). Under Assumption B.1, the sequence (αn )n given by (2.9) converges to one; hence, there exists a (deterministic) integer n0 (depending on K) such that αn ∈ (0, a) for all n ≥ n0 . The identity function is in L0 (RdN ) and by Proposition D.3, there exists a solution gf α, θ to the Poisson equation (D.3) with the f equal to the identity function, for any α ∈ (0, a) and θ ∈ K; by (D.10) (1) fα,θ (x) = (IdN − αJ⊥ Wθ )−1 (x − mθ (α)). To make the notation easier, we will set below 192 Appendix D. Proofs related to Chapter 2 fn := fαn+1 ,θn and Pn := Pαn+1 ,θn . By Proposition D.1-3), there exists a constant C > 0 such that a.s. sup |fn (x)|IEK ≤ C(1 + |x|) . (D.13) n≥n0 Letting x = φn in the Poisson equation (D.3), we obtain (1) φn − mθn (αn+1 ) = fn (φn ) − Pn fn (φn ). We set wn+1 = en+1 + cn+1 + sn+1 + tn where en+1 = hWJθn i (fn (φn+1 ) − Pn fn (φn )) cn+1 = hWJθn i fn−1 (φn ) − hWJθn+1 i fn (φn+1 ) , sn+1 = hWJθn+1 − WJθn i fn (φn+1 ) tn = hWJθn i (fn (φn ) − fn−1 (φn )) . As a conclusion, we have ηn+1,2 = un + vn + zn + en+1 + cn+1 + sn+1 + tn . D.4.2 Proof of Proposition 2.2 Define EK = {∀j ∈ N, θj ∈ K} and En,K = ∩j≤n {θj ∈ K} for some compact set K. P We show that n γn ηn,i < ∞ a.s. for both i = 1, 2. The proposition will then P follow from [9]. By Assumption 2.4, it is enough to show that for any fixed compact set K, k≥1 γk ηk,i IEK is finite a.s. Hereafter, K is fixed and n0 is defined as in Section D.4.1. We first study ηn,1 . Note that for Pany ω, the sequence IEn,K (ω) is identically equal to IEK (ω) for all large n. As a consequence, n γn ηn,1 (IEK − IEn−1,K ) is finite a.s. and it is therefore sufP ficient to prove that n γn ηn,1 IEn−1,K is finite a.s. Since ηn,1 IEn−1,K is a martingale difference noise, the sought result will be obtained provided X γn1+λ E[|ηn,1 |1+λ IEn−1,K ] < ∞ n where λ > 0 (see e.g. [75, Theorem 2.18]); we choose λ ∈ (0, 1) given by Assumption B.1. After some algebra, sup E[|ηn,1 |2 IEn−1,K ] ≤ 2 sup E[|hWn (Yn + φn−1 )i|2 IEn−1,K ] n n ≤ C sup E[(|Yn |2 + |φn−1 |2 )IEn−1,K ] n for some constant C-where we used (D.12). Assumption 2.5-2) directly leads to supn E[|Yn |2 IEn−1,K ] < ∞ whereas by Lemma 2.3, P P supn E[|φn−1 |2 IEn−1,K ] < ∞. Hence, n γn1+λ E[|ηn,1 |1+λ IEn−1,K ] ≤ C 0 n γn1+λ for some C 0 > 0. And the upper bound is finite by Assumption B.1. This concludes the first step. D.4. Proof of Proposition 2.2 193 We now study ηn,2 for any n ≥ n0 . By (2.16), there exists C such that |un |IEK ≤ λ C|J⊥ θn−1 |λµ IEK ≤ Cγn µ |φn−1 |λµ IEn−2,K . Therefore, X X 1+λµ E(IEK γn |un |) ≤ C γn sup E(|φn−1 |IEn−2,K ) n n n P which is finite by Assumption B.1 and Lemma 2.3. Thus n γn |un |IEK is a.s. finite. The term vn can be analyzed similarly: by (2.15) applied with K ← K ∪ {Jθ, θ ∈ K}, there exists a constant C such that λ µ |φn |1+λµ IEn−1,K |vn |IEK ≤ C|J⊥ θn |λµ |φn |IEn−1,K ≤ Cγn+1 P and the fact that n γn |vn |IEK is finite a.s. follows from the same arguments as above. (1) (1) We now study zn . By (D.12), |zn | ≤ Cv |mθn (αn+1 ) − mJθn (1)|. By Proposition D.3-1), √ since αn+1 < a < 1/ ρK , there exists a constant C 0 such that X X 1+λµ λµ 0 |γn − γn+1 | + γn sup E(|φk | IEk−1,K ) . γn E(|zn |IEK ) ≤ C k n n P The RHS is finite by Lemma 2.3 and Assumption B.1. Hence, n γn |zn |IP EK is finite a.s. (en )n is a martingale-increment sequence: as above for the term ηn,1 , n γn en IEK is finite a.s. if supn E(|en+1 |1+λ IEn,K ) < ∞. This holds true by (D.12), (D.13) and Lemma 2.3. Let us now investigate cn+1 . We write n X k=1 γk+1 ck+1 = n X (γk+1 − γk )hWJθk ifk−1 (φk ) − γn+1 hWJθn+1 ifn (φn+1 ) + γ2 hWJθ1 if0 (φ1 ) . k=2 Using again (D.12), (D.13) and Lemma 2.3, there exists C > 0 such that n X X γk+1 E (|ck+1 |IEK ) ≤ C |γk+1 − γk | + γn + 1 . k=1 k≥1 P The rhs is finite by Assumption B.1, thus implying that n γn cn IEK is finite a.s. Consider the term sn+1 . Following similar arguments and using (D.13) again, we obtain X X γk |sk |IEK ≤ C γk khWJθk − WJθk−1 ik(1 + |φk |)IEK k≤n k≤n for some constant C which depends only on K. By condition (2.15) and Lemma D.1, one has λ khWJθk − WJθk−1 kIEK ≤ CK γk µ |Yk |λµ + |φk−1 |λµ IEK . By Cauchy-Schwarz inequality, Assumption 2.5 and Lemma 2.3, it can be proved that sup E [(|Yk | + |φk−1 |)(1 + |φk |)IEK ] < ∞ . (D.14) k P P Therefore, by Assumption B.1, E( k γk |sk |IEK ) is finite thus implying that k≥1 γk sk IEK exists a.s. Finally consider the term tn . By (D.12) and Proposition D.3-2), there exists a constant λµ (1 + |φ |). By C such that for any n ≥ n0 ,|tn |IEK ≤ C |αn − αn−1 | + |θ n − θn−1 | n P Lemma D.1,P(D.14) and Assumption B.1, it can be shown that n γn E(|tn |IEK ) < ∞ which proves that n γn tn IEK converges a.s. 194 Appendix D. Proofs related to Chapter 2 D.5 Proof of Theorem 2.3 The core of the proof consists in checking the conditions of [67, Theorem 2.1]. To make the notations easier, we write the proofs in the case d = 1 and under the assumption that limn θn = θ? 1 almost-surely. Throughout the proof, we will write that a sequence of r.v. (Zn )n is Ow.p.1 (1) iff supn |Zn | < ∞ almost-surely; and (Zn )n is OL1 (1) iff supn E [|Zn |] < ∞. Fix δ > 0. Set for any positive integers m ≤ k \ {|θj − θ? 1| ≤ δ} . Am := j≥m From Section D.4.1, it holds hθn+1 i = hθn i + γn+1 h(hθn i) + γn+1 En+1 + γn+1 Rn+1 where En+1 := hWn+1 (Yn+1 + φn )i − hzθn i + hWθn iφn + hWJθn i (fn (φn+1 ) − Pn fn (φn )) Rn+1 := un + vn + zn + cn+1 + sn+1 + tn Note that E [En+1 |Fn ] = 0 i.e., (En )n is a Fn -adapted martingale increment. From the expression of fn = fαn+1 ,θn (see Proposition (D.10)), we have fα,θ (y) − Pα,θ fα,θ (x) = Bα,θ y − αJ⊥ Wθ x − αJ⊥ zθ with Bα,θ := IdN − αJ⊥ Wθ −1 (D.15) . Hence, En+1 = hWn+1 (Yn+1 + φn )i − hzθn i − hWθn iφn + hWJθn iBαn+1 ,θn φn+1 − αn+1 J⊥ Wθn φn + zθn D.5.1 . (D.16) A preliminary result The following lemma extends Lemma 2.3. Lemma D.2. Let Assumptions B.1-1), 2.5, 2.10 and 2.11 hold. Let (φn )n≥0 be the sequence given by (2.9) and τ be given by Assumption 2.10. For any compact set K ⊂ RdN , sup E |φn |2+τ ITj≤n−1 {θj ∈K} < ∞ . n √ Let ρ̃K be given by Assumption 2.11. For any a ∈ (0, 1/ ρ̃K ), Z sup |x|2+τ dπα,θ (x) < ∞ . α∈[0,a],θ∈K D.5. Proof of Theorem 2.3 195 Proof. From (2.12) and (D.12), there exists a constant C such that on the set |φn |2+τ ≤ αn2+τ φTn−1 WnT J⊥ Wn φn−1 1+τ /2 T j≤n−1 {θj ∈ K} + C|Yn |2+τ + C|Yn |1+τ /2 |φn−1 |1+τ /2 . Since limn αn = 1, for any ρ̄K ∈ (ρ̃K , 1), there exists a deterministic n? s.t. for supn≥n? αn2+τ ρ̃K ≤ ρ̄K . Using Assumption 2.10 and Assumption 2.11, there exists a constant C 0 such that for any 2+τ n large enough, E |φn | |Fn−1 Iθn−1 ∈K ≤ ρ̄K |φn−1 |2+τ + C 0 1 + |φn−1 |1+τ /2 . A trivial induction (see the proof of Lemma 2.3 for similar computations) yields the first result. For the second statement, we mimic the previous computations in order to prove that there exist ρ̄K ∈ (0, 1) and C such that for any α ∈ [0, a] and θ ∈ K, Z Pα,θ (x, dy)|y|2+τ ≤ ρ̄K |x|2+τ + C(1 + |x|1+τ /2 ) . We conclude by [109, Theorem 14.3.7.] and Proposition D.1-1). D.5.2 Checking condition C2 of [67, Theorem 2.1] Let m ≥ 1. From Assumption h 2.10, (D.12) and Lemma D.2,i it is easily seen by using the expression (D.16) that supn E |En+1 |2+τ 1Tm≤j≤n {|θj −θ? 1|≤δ} < ∞ where τ is given by Assumption 2.10. In order to derive asymptotic covariance, we go the further in the expression of the condi2 |F . We write E E 2 |F tional covariance E En+1 n n+1 n = Ξ(αn+1 , θn , φn ) where Z Ξ(α, θ, x) := (ξα,θ,x (y, w))2 dµθ (y, w) (D.17) ξα,θ,x (y, w) := Aα,θ w − Wθ x + (wy − zθ ) T −1 1 Aα,θ := Id + α WJθ IdN − αJ⊥ Wθ J⊥ . N Set π? := π1,θ? 1 πn := παn+1 ,θn where πα,θ is defined by Proposition D.1. We write Z Z Ξ(αn+1 , θn , φn ) = Ξ(αn+1 , θn , φn ) − Ξ(1, θn , φn ) + Ξ(1, θn , x)dπn (x) − Ξ(1, θ? 1, x)dπ? (x) Z Z + Ξ(1, θn , φn ) − Ξ(1, θn , x)dπn (x) + Ξ(1, θ? 1, x)dπ? (x) . For any m ≥ 1, we have on the set Am (Ξ(αn+1 , θn , φn ) − Ξ(1, θn , φn )) → 0 a.s. Z Z Ξ(1, θn , x)dπn (x) − Ξ(1, θ? 1, x)dπ? (x) → 0 a.s. Z n X γn E Ξ(1, θk , φk ) − Ξ(1, θk , x)dπl (x) 1Am → 0 . k=1 196 Appendix D. Proofs related to Chapter 2 The detailed computations are given in RSection D.5.5. This implies that the key quantity involved in the asymptotic covariance matrix is Ξ(1, θ? 1, x)dπ? (x). D.5.3 Expression of U? Set Z U? := Ξ(1, 1 ⊗ θ? , x) dπ1,1⊗θ? (x) . Lemma D.3 gives an explicit expression for U? as a function of the quantities introduced in Section 2.5.2. Lemma D.3. Under the assumptions of Theorem 2.3, (2) (1) vec U? = (A? ⊗ A? )(R? m? + 2T? m? + S? ). Proof. For simplicity, we use the notations Rθ (w) := w − Wθ and υθ (y, w) := wy − zθ and T̃θ,x (y, w) := (Rθ (w)x + υθ (y, w))(Rθ (w)x + υθ (y, w))T . Note that T̃θ,x (y, w) coincides with Rθ (w)xxT Rθ (w)T + 2Rθ (w)xυθ (y, w)T + υθ (y, w)υθ (y, w)T . From (D.17), ξα,θ,x (y, w) = Aα,θ (Rθ (w)x + υθ (y, w)) so that Z vec Ξ(α, θ, x) = (Aα,θ ⊗ Aα,θ ) vec T̃θ,x (y, w) dµθ (y, w) . Applying the vec operator on T̃θ,x (y, w) yields (Rθ (w) ⊗ Rθ (w))vec (xxT ) + 2(υθ (y, w) ⊗ Rθ (w))x + vec (υθ (y, w)υθ (y, w)T ) . Therefore, when applied with α = 1 and θ = θ? 1, it holds vec Ξ(1, θ? 1, x) = (A? ⊗ A? )(R? vec (xxT ) + 2T? x + S? ) . This yields the result by integrating x w.r.t. π? . D.5.4 Checking condition C3 of [67, Theorem 2.1] We first prove that for any m ≥ 1, |un + vn + zn + sn+1 + tn | 1Am ≤ √ γn o(1)OL1 (1) . (D.18) Let m ≥ 1. By (2.9), (D.12) and Proposition D.3-1), there exists a constant C1 such that almostsurely, on the set Am , λµ |zn | ≤ C1 |αn+1 − 1| + |J⊥ θn |λµ ≤ C1 |αn+1 − 1| + γn+1 1 + |φn |λµ . √ Assumption 2.13 and Lemma 2.3 and λµ > 1/2 imply that |zn | 1Am = γn o(1)OL1 (1). By Assumption 2.7, Proposition D.1-3) and Lemma D.1, there exist a constant C2 > 0 and n ≥ n0 such that almost-surely, for all n ≥ n0 , λ |sn+1 | 1Am ≤ C2 γn µ |Yn+1 |λµ + |φn |λµ (1 + |φn+1 |) 1Am . D.5. Proof of Theorem 2.3 197 √ Assumption 2.5, Lemma 2.3 and the condition λµ > 1/2 imply that |sn+1 | 1Am = γn OL1 (1). By (D.12), Proposition D.3-2) and Lemma D.1, there exist a constant C3 > 0 and n0 such that almost-surely, for any n ≥ n0 , λ |tn | 1Am ≤ C3 |αn+1 − αn | + γn µ |Yn |λµ + |φn |λµ 1Am . √ Lemma 2.3, Assumption 2.13 and λµ > 1/2 imply that |tn+1 | 1Am = γn o(1)OL1 (1). By Asλ sumption 2.7, there exists a constant C4 > 0 such that almost-surely, |un | 1Am ≤ C4 γn µ |φn |λµ 1Am . √ Lemma 2.3 and the property λµ > 1/2 imply un = o( γn )OL1 (1). Finally, by Assumption 2.7, λµ |φn |1+λµ 1Am so that by there exists a constant C such that almost-surely, |vn | 1Am ≤ Cγn+1 √ Lemma 2.3 again and the condition λµ > 1/2, vn = o( γn )OL1 (1). The above discussion concludes the proof of (D.18). √ P The second step is to prove that for any m ≥ 1, γn nk=1 ck 1Am = o(1)Ow.p.1. (1)OL1 (1). By (D.12) and (D.13), there exists a constant C > 0 such that almost-surely, n X ck 1Am ≤ C (1 + |φ0 | + |φn |) 1Am . k=1 Lemma 2.3 implies that [67]. D.5.5 Pn k=1 ck = OL1 (1). This concludes the proof of the condition C3 in Detailed computations for verifying the condition C2 We start with a preliminary lemma whose proof is omitted since it follows from standard computations. Lemma D.4. Let Assumptions 2.5, 2.11 and 2.12-1) to hold. Let δ > 0 and set K := {θ : √ |θ − θ? 1| ≤ δ}. Fix a ∈ (0, 1/ ρ̃K ) where ρ̃K be given by Assumption 2.11. There exists a constant C such that for any θ, θ0 ∈ K, α, α0 ∈ [0, a], x, z, y ∈ RdN and w ∈ M1 |ξα,θ,x (y, w)| ≤ C (1 + |y| + |x|) , λ kAα,θ − Aα0 ,θ0 k ≤ C α − α0 + θ − θ0 µ , ξα,θ,x (y, w) − ξα0 ,θ0 ,x (y, w) ≤ C α − α0 + θ − θ0 λµ (1 + |x| + |y|) , |ξα,θ,x (y, w) − ξα,θ,z (y, w)| ≤ C |x − z| . λµ is given by Assumptions 2.5 and 2.12-1). 1) First term: Ξ(αn+1 , θn , φn ) − Ξ(1, θn , φn ) It is sufficient to prove that this term converges almost-surely to zero along the event Am , for any m ≥ 1; which is implied by the almost-sure convergence to zero along the event θ ∈ K := {θ : |θ − θ? | ≤ δ}. Below, Cm is a constant whose value may change upon each appearance. 198 Appendix D. Proofs related to Chapter 2 By using the inequality |a2 − b2 | ≤ |a − b|(|a| + |b|), Assumption 2.10 and Lemma D.4, there exists a constant Cm such that for any α close enough to 1 and θ ∈ K |Ξ(α, θ, x) − Ξ(1, θ, x)| ≤ Cm 1 + |x|2 |α − 1| . By Lemma D.2, for any ε > 0, there exists Cm such that ( ) P sup(1 + |φn |)2 |αn+1 − 1|1θn ∈K ≥ ε ≤ Cm n≥` X |αn+1 − 1|(1+τ /2) . n≥` The RHS converges to zero as ` → ∞ by Assumption 2.13. This implies that almost-surely, limn |Ξ(αn+1 , θn , φn ) − Ξ(1, θn , φn )| 1θn ∈K = 0. 2) Second term: R Ξ(1, θn , x)dπn (x) − R Ξ(1, θ? 1, x)dπ? (x) We apply the following lemma (see [68, Proposition 4.3.]). Lemma D.5. Let µ, {µn , n ≥ 0} be probability distributions on RdN endowed with its Borel σ-field. Let {hn , n ≥ 1} be an equicontinuous family of functions from RdN to R. Assume 1. the sequence {µn , n ≥ 0} weakly converges to µ. dN 2. for R any x ∈ R , limn hn (x) exists, and there exists a > 1 such that supn | limn hn |dµ < ∞. Then limn R hn dµn = R R |hn |a dµn + limn hn dµ. a) Almost-sure weak convergence In our case µn ← πn and µ ← π? and µn is a random probability. Since the set of bounded Lipschitz functions is convergence determiningR (see e.g. [60, R Theorem 11.3.3.]), we prove that for any bounded and Lipschitz function h, limn hdπn = hdπ? almost-surely, with an almostsure set which has to be uniform for the set of bounded Lipschitz functions. Following the same lines as in the proof of [68, Proposition 5.2.], this convergence occurs almost-surely if and only R if for any R bounded Lipschitz function h, there exists a full set such that on this set, limn hdπn = hdπ? . Let h be a bounded Lipschitz function. Then h ∈ L0 (RdN ). By Proposition D.3-1), there exists a constant Cf such that for any n large enough, on the set {θn ∈ K} Z Z hdπn − hdπ? ≤ Cf |αn+1 − 1| + |θn+1 − θ? 1|λµ . R R Since limn θn = θ? 1 almost-surely and limn αn = 1, we have limn hdπn = hdπ? almostsurely. This concludes the proof of the almost-sure weak convergence. D.5. Proof of Theorem 2.3 199 b) Equicontinuity of the family of functions We prove that the family of functions {x 7→ Ξ(1, θ, x); θ ∈ K} is equicontinuous. Using again the inequality |a2 − b2 | ≤ |a − b|(|a| + |b|), Lemma D.4 and Assumption 2.10, we know there exists a constant Cm such that for any θ ∈ K, x, z ∈ RdN , |Ξ(1, θ, x) − Ξ(1, θ, z)| ≤ Cm (1 + |x| + |z|)|x − z| c) Almost-sure limit of Ξ(1, θn , x) when n → ∞ Let x be fixed. We write Ξ(1, θ, x) − Ξ(1, θ0 , x) ≤ Z 2 2 ξ 1,θ,x (y, w) − ξ1,θ0 ,x (y, w) dµθ0 (y, w) Z Z 2 2 + ξ1,θ,x (y, w)dµθ (y, w) − ξ1,θ,x (y, w)dµθ0 (y, w) . Let us consider the first term. Using again |a2 − b2 | ≤ |a − b|(|a| + |b|) and Lemma D.4, there exists a constant Cm such that the first term is upper bounded by Cm (1 + |x|2 )|θ − θ? 1|λµ for any θ ∈ K. For the second term, we use Assumption 2.12-2) and obtain the same upper bound. Then, there exists a constant Cm such that for any θ, θ0 ∈ K Ξ(1, θ, x) − Ξ(1, θ0 , x) ≤ Cm (1 + |x|2 ) θ − θ0 λµ . (D.19) Since limn θn = θ? 1 almost-surely, the above discussion implies that for any fixed x, limn Ξ(1, θn , x) = Ξ(1, θ? 1, x) almost-surely on Am . d) Moment conditions It is easily seen (using again Lemma D.4) that there exists a constantR Cm such that for any θ ∈ K, |Ξ(1, θ, x)| ≤ Cm (1 + |x|2 ). Therefore, Lemma D.2 implies that |Ξ(1, θ? 1, x)|dπ? (x) < ∞. In addition, for any θ ∈ K, α in a neighborhood of 1 and a > 1, Z Z a 2a |Ξ(1, θ, x)| πα,θ (dx) ≤ Cm 1 + |x| πα,θ (dx) . Lemma D.2 implies that there exists a > 1 such that Z sup 1θn ∈K |Ξ(1, θn , x)|a παn+1 ,θn (dx) < ∞ . n e) Conclusion We can now apply Lemma D.5; we have almost-surely, Z Z lim Ξ(1, θn , x)dπn (x) − Ξ(1, θ? 1, x)dπ? (x) 1Am = 0 . n 200 3) Appendix D. Proofs related to Chapter 2 Third term: Ξ(1, θn , φn ) − R Ξ(1, θn , x)dπn (x) We prove that for any m ≥ 1 " n # Z X lim γn E Ξ(1, θk , φk ) − Ξ(1, θk , x)dπk (x) 1Am = 0 . n k=1 We write Pn k=1 P R (i) Ξ(1, θk , φk ) − Ξ(1, θk , x)dπk (x) = 3i=1 Tn with Tn(1) Tn(2) = = n X {Ξ(1, θk , φk ) − Ξ(1, θk−1 , φk )} k=1 n X Z Ξ(1, θk−1 , φk ) − Ξ(1, θk−1 , x)dπk−1 (x) k=1 Tn(3) Z Z = Ξ(1, θ0 , x)dπ0 (x) − Ξ(1, θn , x)dπn (x) . (1) a) Term Tn By (D.19), there exists a constant Cm such that for any k ≥ m + 1, on the set Am , |Ξ(1, θk , φk ) − Ξ(1, θk−1 , φk )| ≤ Cm |θk − θk−1 |λµ (1 + |φk |2 ) . Hence, by Lemma D.1, on the set Am , λ |Ξ(1, θk , φk ) − Ξ(1, θk−1 , φk )| ≤ Cm γk µ (1 + |φk |2 )(|Yk |λµ + |φk−1 |λµ ) . By Assumption 2.10, Lemma D.2 and Assumption 2.13, the sum i X 1+λµ h γk E (1 + |φk |2 )(|Yk |λµ + |φk−1 |λµ )1Am k≥1 h i (1) is finite which implies limn γn E |Tn |1Am = 0 by the Kronecker Lemma. (2) b) Term Tn From the expression of ξ (see (D.17)), we have Ξ(1, θ, φ) − Ξ(1, θ, x) = φT Cθ φ − xT Cθ x + (φ − x)T Dθ with Z T A1,θ (w − Wθ ) dµθ (y, w) (w − Wθ )A1,θ Z T Dθ := 2 (w − Wθ )A1,θ A1,θ (wy − zθ ) dµθ (y, w) . Cθ := D.5. Proof of Theorem 2.3 201 We detail the proof of the statement " n # T Z X lim γn E φk − x dπαk ,θk−1 (x) Dθk−1 1Am = 0 n k=1 The second statement, with the quadratic dependence on φk is similar and omitted (its proof will use Proposition D.3-3) and the condition limn γn n1/(1+τ /2) = 0). Using again the Poisson solution fn := fαn+1 ,θn associated to the identity function and the kernel Pn := Pαn+1 ,θn , it holds by (D.15) T Z φk − x dπk−1 (x) Dθk−1 = (fk−1 (φk ) − Pk−1 fk−1 (φk−1 ))T Dθk−1 T + Pk−1 fk−1 (φk−1 )Dθk−1 − Pk fkT (φk )Dθk T + Pk fkT (φk ) − Pk−1 fk−1 (φk ) Dθk T + Pk−1 fk−1 (φk ) Dθk − Dθk−1 . (D.20) (D.21) (D.22) (D.23) From Assumption 2.12-2) and Lemma D.4, there exists a constant Cm such that for any k, |Dθk |1Am ≤ Cm (D.24) λµ |Dθk − Dθk−1 |1Am ≤ Cm |θk − θk−1 | . (D.25) Let us control the first term (D.20). Upon noting that it is a martingale-increment, the Burkholder inequality (see e.g. [75, Theorem 2.10]) applied with p ← 2 + τ and Lemma D.2 imply n X √ E (fk−1 (φk ) − Pk−1 fk−1 (φk−1 ))T Dθk−1 1Am = O n . k=1 This term is o(1/γn ) by Assumption 2.13. Let us consider (D.21). n X T T E Pk−1 fk−1 (φk−1 )Dθk−1 − Pk fk (φk )Dθk 1Am = E|P0 f0T (φ0 )Dθ0 − Pn fnT (φn )Dθn | 1Am k=1 and this term is O(1) by Proposition D.1-3), (D.24) and Lemma D.2. Let us see the third term (D.22). By Proposition D.3-2) and (D.24), we have n n X X T E |θk − θk−1 |λµ + |αk+1 − αk | 1Am E Pk fkT (φk ) − Pk−1 fk−1 (φk ) Dθk 1Am ≤ Cm k=1 k=1 By Lemmas D.1 and D.2 and Assumptions 2.10 and 2.13, this term is o(1/γn ). Finally, the same conclusion holds for (D.23) h by usingiProposition D.1-3), Lemma D.2 and (D.25). This (2) concludes the proof of limn γn E |Tn |1Am = 0. 202 Appendix D. Proofs related to Chapter 2 (3) c) Term Tn By Lemma D.4, there exists Cm such that for any θ ∈ K, |Ξ(1, θ, x)| R≤ Cm (1 + |x|2 ). By Lemma D.2, for any a in a neighborhood of 1 we have supα∈[0,a],θ∈K |x|2 πα,θ (dx) < ∞. Since limn αn = 1, we have Z sup Ξ(1, θn , x)dπn (x) 1θn ∈K < C n≥m h i (3) for some constant C, which implies that limn γn E Tn 1Am = 0. Bibliography [1] FIT IoT-LAB: very large scale open wireless sensor network testbed. https://www. iot-lab.info/. [2] G., Morral. http://perso.telecom-paristech.fr/~morralad/. [3] N.A., Dieng. http://nadieng.wordpress.com/. [4] Achlioptas, D. and McSherry, F. (2001). Fast computation of low rank matrix approximations. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 611–618. [5] Achlioptas, D. and McSherry, F. (2007). Fast computation of low rank matrix approximations. Journal of ACM, 54(2). [6] Agarwal, A., Chapelle, O., Dudík, M., and Langford, J. (2014). A reliable effective terascale linear learning system. Journal of Machine Learning Research, 15:1111–1133. [7] Aghasi, H., Hashemi, M., and Khalaj, B. (2012). A Source Localization Based on Signal Attenuation and Time Delay Estimation in Sensor Networks. International Journal of Computer and Electrical Engineering, 4(3):423–427. [8] Alexander, S. (1982). Radio propagation within buildings at 900 MHz. Electronics Letters, 18(21):913 – 914. [9] Andrieu, C., Moulines, E., and Priouret, P. (2005). Stability of Stochastic Approximation under Verifiable Conditions. SIAM J. Control Optim., 44(1):283–312. [10] Aysal, T., Yildiz, M., Sarwate, A., and Scaglione, A. (2009). Broadcast Gossip Algorithms for Consensus. IEEE Trans. on Signal Processing, 57(7):2748–2761. [11] Bahl, P. and Padmanabhan, V. (2000). RADAR: An In-Building RF-Based User Location and Tracking System. In INFOCOM, pages 775–784. [12] Bartlett, P., Jordan, M., and McAuliffe, J. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156. [13] Bauso, D. and Nedic, A. (2013). Dynamic coalitional tu games: Distributed bargaining among players’ neighbors. Automatic Control, IEEE Transactions on, 58(6):1363 –1376. 204 Bibliography [14] Benaim, M. (1996). A dynamical system approach to stochastic approximations. SIAM Journal on Control and Optimization, 34:437. [15] Benaim, M., Hofbauer, J., and Sorin, S. (2005). Stochastic Approximations and Differential Inclusions. SIAM Journal on Control and Optimization, 44(1):328–348. [16] Bénézit, F. (2009). Distributed Average Consensus for Wireless Sensor Networks. PhD thesis, EPFL. [17] Benveniste, A., Metivier, M., and P., P. (1987). Adaptive Algorithms and Stochastic Approximations. Springer-Verlag. [18] Bertsekas, D. and Tsitsiklis, J. (1997). Parallel and Distributed Computation: Numerical Methods. Athena Scientific. [19] Bianchi, P., Fort, G., and Hachem, W. (2013). Performance of a Distributed Stochastic Approximation Algorithm. IEEE Transactions on Information Theory, 59(11):7405 – 7418. [20] Bianchi, P., Fort, G., Hachem, W., and Jakubowicz, J. (2011a). On the Convergence of a Distributed Parameter Estimator for Sensor Networks with Local Averaging of the Estimate. In ICASSP, Praha, Czech Republic. [21] Bianchi, P., Fort, G., Hachem, W., and Jakubowicz, J. (2011b). Performance of a Distributed Robbins-Monro Algorithm for Sensor Networks. In EUSIPCO, Barcelona, Spain. [22] Bianchi, P., Hachem, W., and Iutzeler, F. (2014). A stochastic coordinate descent primal-dual algorithm and applications to large-scale composite optimization. Arxiv preprint arXiv:1407.0898. [23] Bianchi, P. and Jakubowicz, J. (2013). On the convergence of a projected multi-agent stochastic gradient algorithm for non-convex optimization. Automatic Control, IEEE Transactions on, 58(2):391–405. [24] Biswas, P., Liang, T., Wang, T., and Ye, Y. (2006). Semidefinite programming based algorithms for sensor network localization. ACM Transactions on Sensor Networks, 2. [25] Biswas, P. and Ye, Y. (2006). A Distributed Method for Solving Semidefinite Programs Arising from Ad Hoc Wireless Sensor Network Localization. In Multiscale Optimization Methods and Applications, volume 82 of Nonconvex Optimization and Its Applications, pages 69–84. Springer US. [26] Blondel, V., Hendrickx, J., Olshevsky, A., and Tsitsiklis, J. (2005). Convergence in multiagent coordination, consensus, and flocking. In Decision and Control, 2005 and 2005 European Control Conference. CDC-ECC ’05. 44th IEEE Conference on, pages 2996 – 3000. [27] Borg, I. and Groenen, P. (1997). Modern Multidimensional Scaling: theory and applications. New York: Springer-Verlag. Bibliography 205 [28] Borkar, V. (2008). Stochastic approximation: a dynamical system viewpoint. Cambridge University Press. [29] Borkar, V. and Meyn, S. (2012). Oja’s algorithm for graph clustering, markov spectral decomposition, and risk sensitive control. Journal of Automatica, 48(10):2512–2519. [30] Borwein, J. and Lewis, A. (2006). Convex Analysis and Nonlinear Optimization: Theory and Examples. CMS Books in Mathematics. Springer. [31] Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. (2004). Analysis and optimization of randomized gossip algorithms. In Decision and Control, 2004. CDC. 43rd IEEE Conference on (Volume 5), pages 5310 – 5315, Bahamas. [32] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122. [33] Brémaud, P. (1999). Markov chains: Gibbs fields, Monte Carlo simulation, and queues. springer verlag. [34] Bénézit, F., Blondel, V., Thiran, P., Tsitsiklis, J., and Vetterli, M. (2010). Weighted gossip: Distributed averaging using non-doubly stochastic matrices. In Information Theory Proceedings (ISIT), IEEE International Symposium on, pages 1753–1757. IEEE. [35] Calafiore, G., Carlone, L., and Wei, M. (2010). A distributed gradient method for localization of formations using relative range measurements. In IEEE International Symposium on Computer-Aided Control System Design (CACSD), Yokohama. [36] Cappé, O. and Moulines, E. (2009). On-line expectation–maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B, 71(3):593–613. [37] Castells, F., Laguna, P., Sörnmo, L., Bollmann, A., and Roig, J. (2007). Principal component analysis in ecg signal processing. EURASIP Journal on Advances in Signal Processing, (1):98–98. [38] Cattivelli, F. and Sayed, A. (2010a). Diffusion LMS strategies for distributed estimation. IEEE Trans. Signal Processing, 58(3):1035–1048. [39] Cattivelli, F. and Sayed, A. (2010b). Distributed nonlinear Kalman filtering with applications to wireless localization. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 3522 – 3525, Dallas, TX. [40] Cevher, V., Becker, S., and Schmidt, M. (2014). Convex Optimization for Big Data. IEEE Signal Processing Magazine, Vol. 31(5):32–43. [41] Chen, H., Wang, G., Wang, Z., So, H., and Poor, H. (2012). Non-Line-of-Sight Node Localization Based on Semi-Definite Programming in Wireless Sensor Networks. IEEE Transactions on Wireless Communications, 11(1):108–116. 206 Bibliography [42] Chen, J., Hudson, R., and Yao, K. (2002). Maximum-likelihood source localization and unknown sensor location estimation for wideband signals in the near-field. IEEE Trans. on Signal Processing, 50(8):1843–1854. [43] Chen, J. and Sayed, A. Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks, journal = IEEE Trans. Signal Processing, volume = 60, number = 8, pages = 4289-4305, year = 2012,. [44] Connelly, R. (2005). Generic global rigidity. Discrete Comput. Geom., 33(4):549–563. [45] Costa, J., Patwari, N., and Hero, A. (2006). Distributed Weighted-Multidimensional Scaling for Node Localization in Sensor Networks. ACM Transactions on Sensor Networks, 2(1):39–64. [46] Cox, D., Murray, R., and Norris, A. (1984). 800-MHz attenuation measured in and around suburban houses. AT and T Bell Laboratories Technical Journal, 63(6):921 – 954. [47] de Moraes, L. and Nunes, B. (2006). CalibrationFree WLAN Location System Based on Dynamic Mapping of Signal Strength. In MobyWac. [48] DeGroot, M. H. Reaching a Consensus, journal = Journal of the American Statistical Association, volume = 69, number = 345, pages = 118–121, year = 1974,. [49] Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. (2011). Optimal distributed online prediction. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 713–720. [50] Delyon, B. (1996). General results on the convergence of stochastic algorithms. IEEE Transactions on Automatic Control, 41(9):1245–1255. [51] Delyon, B. (2000). Stochastic Approximation with Decreasing Gain: Convergence and Asymptotic Theory. Unpublished Lecture Notes, http://perso.univrennes1.fr/bernard.delyon/as_cours.ps. [52] Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39(1):1–38. [53] Dieng, N., Charbit, M., Chaudet, C., Toutain, L., and Meriem, T. (2012a). A Multi-Path Data Exclusion Model for RSSI-based Indoor Localization. In 15th International Symposium on Wireless Personal Multimedia Communications (WPMC), pages 336–340. [54] Dieng, N., Charbit, M., Chaudet, C., Toutain, L., and Meriem, T. (2012b). Experiments on the RSSI as a Range Estimator for Indoor Localization. In (NTMS), pages 1–5. [55] Dieng, N., Charbit, M., Chaudet, C., Toutain, L., and Meriem, T. (2013). Indoor Localization in Wireless Networks based on a Two-modes Gaussian Mixture Model. In IEEE 78th Vehicular Technology Conference (VTC Fall), pages 1–5. Bibliography 207 [56] Dieng, N., Chaudet, C., Toutain, L., Meriem, T., and Charbit, M. (2014). No-calibration localisation for indoor wireless sensor networks. International Journal of Ad Hoc and Ubiquitous Computing, 15(1):200–214. [57] Doherty, L., Pister, K. S. J., and El Ghaoui, L. (2001). Convex position estimation in wireless sensor networks. In Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies, volume 3 of IEEE Infocom, pages 1655–1663. [58] Duchi, J., Agarwal, J., and Wainwright, M. (2010a). Distributed Dual Averaging in Networks. In Advances in Neural Information Systems. [59] Duchi, J., Agarwal, J., and Wainwright, M. (2010b). Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling. Automatic control, IEEE Transactions on, 99(10):1–40. [60] Dudley, R. (2002). Real analysis and Probability. Cambridge University Press. [61] Duflo, M. (2010). Random Iterative Models. Springer. [62] Elnahrawy, E., Xiaoyan, L., and Martin, R. (2004). The limits of localization using signal strength: a comparative study. In In First Annual IEEE Conference on Sensor and Ad-hoc Communications and Networks,, pages 406–414. [63] Essoloh, M., Richard, C., and Snoussi, H. (2007). Localisation distribuée dans les réseaux de capteurs sans fil par résolution d’un problème quadratique. In GRETSI. [64] Forero, P., Cano, A., and Giannakis, G. (2008). Consensus-based distributed expectationmaximization algorithm for density estimation and classification using wireless sensor networks. In IEEE ICASSP 2008, pages 1989–1992. [65] Forero, P., Cano, A., and Giannakis, G. (2010a). Consensus-based distributed support vector machines. Journal of Machine Learning Research, 11:1663 – 1707. [66] Forero, P., Cano, A., and Giannakis, G. (2010b). Consensus-based distributed support vector machines. The Journal of Machine Learning Research, 11:1663–1707. [67] Fort, G. (2014). Central Limit Theorems for Stochastic Approximation with Controlled Markov Chain Dynamics. Accepted for publication in ESAIM PS. [68] Fort, G., Moulines, E., and Priouret, P. (2012). Convergence of Adaptive and Interacting Markov chain Monte Carlo algorithms. Ann. Statist., 39(6):3262–3289. [69] Frasca, P. and Hendrickx, J. (2013). On the mean square error of randomized averaging algorithms. Automatica, 49(8):2496 – 2501. [70] Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2):337–407. 208 Bibliography [71] Goldoni, E., Savioli, A., Risi, M., and Gamba, P. (2010). Experimental analysis of RSSIbased indoor localization with IEEE 802.15.4. In IEEE European Wireless Conference (EW), pages 71–77. [72] Goldsmith, A. (2005). Wireless Communications. Cambridge University Press, New York, USA. [73] Golub, G. H. and Van Loan, C. F. (1983). Matrix Computations. Johns Hopkins University Press. [74] Gu, D. (2008). Distributed em algorithm for gaussian mixtures in sensor networks. Neural Networks, IEEE Transactions on, 19(7):1154–1166. [75] Hall, P. and Heyde, C. C. (1980). Martingale Limit Theory and its Application. Academic Press, New York, London. [76] Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning. Springer. [77] Hendrickson, B. (1992). Conditions for unique graph realizations. SIAM J. Comput., 21(1):65–84. [78] Hendrickx, J., Shi, G., and Johansson, K. (2014). Finite-time consensus using stochastic matrices with positive diagonals. To appear in IEEE Transactions on Automatic Control. [79] Hereman, W. (2011). Trilateration: The Mathematics Behind a Local Positioning System . Seminar. [80] Honeine, P., Richard, C., Bermudez, J., Snoussi, H., and et al. (2009). Functional estimation in Hilbert space for distributed learning in wireless sensor networks. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 2861 – 2864, Taipei. [81] Horn, R. A. and Johnson, C. R. (1990). Matrix Analysis. Cambridge University Press. [82] Huang, L., Nguyen, X., Garofalakis, M., Jordan, M., Joseph, A., and Taft, N. (2007). Innetwork pca and anomaly detection. Advances in Neural Information Processing Systems, 19:617. [83] Iutzeler, F., Ciblat, P., and Hachem, W. (2013). Analysis of Sum-Weight-like algorithms for averaging in Wireless Sensor Networks. IEEE Transactions on Signal Processing, 61(11):2802–2814. [84] Iutzeler, F., Ciblat, P., Hachem, W., and Jakubowicz, J. (2012). New broadcast based distributed averaging algorithm over wireless sensor networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3117–3120. [85] Javanmard, A. and Montanari, A. (2013). Localization from Incomplete Noisy Distance Measurements. Journal of Foundations of Computational Mathematics, 13(3):297–345. Bibliography 209 [86] Kaemarungsi, K. and Krishnamurthy, P. (2004). Modeling of Indoor Positioning Systems Based on Location Fingerprinting. In INFOCOM. [87] Kar, S. and Moura, J. (2010). Distributed consensus algorithms in sensor networks: Quantized data and random link failures. IEEE Trans. on Signal Processing, 58(3):1383–1400. [88] Karhunen, J. (1984). Adaptive algorithms for estimating eigenvectors of correlation type matrices. In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’84. (Volume:9 ). [89] Kempe, D., Dobra, A., and Gehrke, J. (2003). Gossip-based computation of aggregate information. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science, FOCS, pages 482–491. IEEE Computer Society. [90] Kempe, D. and McSherry, F. (2008). A decentralized algorithm for spectral analysis. Journal of Computer and System Sciences, 74(1):70 – 83. Learning Theory 2004. [91] Keshavan, R., Montanari, A., and Oh, S. (2010). Matrix completion from a few entries. IEEE Transactions on Information Theory, 56(6):2980–2998. [92] Korada, S., Montanari, A., and Oh, S. (2011). Gossip pca. ACM SIGMETRICS Performance Evaluation Review, 39(1):169–180. [93] Kowalczyk, W. and Vlassis, N. (2005). Newscast em. In NIPS, pages 713–720. [94] Krasulina, T. (1969). The method of stochastic approximation for the determination of the least eigenvalue of a symmetrical matrix. {USSR} Computational Mathematics and Mathematical Physics, 9(6):189 – 195. [95] Kruskal, J. and Myron Wish (1978). Multidimensional Scaling. Eric M. Uslaner. [96] Kushner, H. and Clark, D. (1978). Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer-Verlag. [97] Kushner, H. and Yin, G. (1987). Asymptotic properties of distributed and communicating stochastic approximation algorithms. SIAM J. Control Optim., 25:1266 – 1290. [98] Kushner, H. and Yin, G. (2003). Stochastic Approximation and Recursive Algorithms and Applications. Springer. [99] Langendoen, K. and Reijers, N. (2003). Distributed localization in wireless sensor networks: a quantitative comparison. Computer Networks, 43(4):499–518. [100] Lee, S. and Nedic, A. (2012). Drsvm: Distributed random projection algorithms for svms. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pages 5286 – 5291, Maui, Hawai. [101] Leeuw, J. (1977). Applications of Convex Analysis to Multidimensional Scaling. Recent Developments in Statistics, pages 133–145. 210 Bibliography [102] Li, L., Scaglione, A., and Manton, J. (2011). Distributed Principal Subspace Estimation in Wireless Sensor Networks. IEEE Selected Topics in Signal Processing, 5(4):725–738. [103] Lopes, C. and Sayed, A. (2006). Distributed processing over adaptive networks. In Proc. Adaptive Sensor Array Processing Workshop, pages 1–5, MIT Lincoln Laboratory, MA. [104] Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., and Hellerstein, J. (2012). Distributed graphlab: A framework for machine learning and data mining in the cloud. Journal Proceedings of the VLDB Endowment, 5(8):716–727. [105] Mao, G., Fidan, B., and Anderson, B. (2007). Wireless sensor network localization techniques. Computer Networks, 51(10):2529–2553. [106] Mateos, G., Bazerque, J., and Giannakis, G. (2010). Distributed sparse linear regression. Signal Processing, IEEE Transactions on, 58(10):5262–5276. [107] McDonald, R., Hall, K., and Mann, G. (2010). Distributed training strategies for the structured perceptron. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 456– 464. Association for Computational Linguistics. [108] Mertikopoulos, P., Belmega, E., Moustakas, A., and Lasaulce, S. (2012). Distributed learning policies for power allocation in multiple access channels. IEEE Journal on Selected Areas in Communications, 30(1):1–11. [109] Meyn, S. and Tweedie, R. (1993). Markov Chains and Stochastic Stability. SpringerVerlag. [110] Murphy, W. and Hereman, W. (1985). Determination of a position in three dimensions using trilateration and approximate distances,. Technical report mcs-95-07, Department of Mathematical and Computer Sciences, Colorado School of Mines, Colorado. [111] Navia-Vazquez, A., Gutierrez-Gonzalez, D., Parrado-Hernandez, E., and NavarroAbellan, J. (2006). Distributed support vector machines. Neural Networks, IEEE Transactions on, 17(4):1091–1097. [112] Nedic, A. (2011). Asynchronous Broadcast-Based Convex Optimization Over a Network. IEEE Trans. on Automatic Control, 56(6):1337–1351. [113] Nedic, A. and Bertsekas, D. (2001). Incremental Subgradient Methods for Nondifferentiable Optimization. SIAM Journal of Optimization, 12(1):109–138. [114] Nedic, A. and Olshevsky, A. (2013). Distributed optimization over time-varying directed graphs. In Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on, pages 6855 – 6860, Florence, Italy. [115] Nedic, A. and Olshevsky, A. (2014). Stochastic gradient-push for strongly convex functions on time-varying directed graphs. Arxiv preprint arXiv:1406.2075. Bibliography 211 [116] Nedic, A. and Ozdaglar, A. (2009). Distributed Subgradient Methods for Multi-Agent Optimization. IEEE Trans. on Automatic Control, 54(1):48–61. [117] Nedic, A., Ozdaglar, A., and Parrilo, P. (2010). Constrained Consensus and Optimization in Multi-Agent Networks. IEEE Trans. on Automatic Control, 55(4):922–938. [118] Niculescu, D. and Nath, B. (2001). Ad Hoc Positioning System (APS). In IN GLOBECOM, pages 2926–2931. [119] Nowak, R. (2003). Distributed em algorithms for density estimation and clustering in sensor networks. Signal Processing, IEEE Transactions on, 51(8):2245–2253. [120] Oja, E. (1982). Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267–273. [121] Oja, E. (1992). Principal components, minor components, and linear neural networks. Journal of Neural Networks, 5(6):927–935. [122] Oja, E. and Karhunen, J. (1985). On Stochastic Approximation of the Eigenvectors and Eigenvalues of the Expectation of a Random Matrix. Journal of Mathematical Analysis and Applications, 106(1):69–84. [123] Olfati-Saber, R. (2007). Distributed Tracking for Mobile Sensor Networks with Information-Driven Mobility. In American Control Coference, pages 4606 – 4612, New York, USA. [124] Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). The pagerank citation ranking: Bringing order to the web. Technical report. [125] Patra, B. (2011). Convergence of distributed asynchronous learning vector quantization algorithms. The Journal of Machine Learning Research, 12:3431–3466. [126] Patwari, N. and et al. (2005). Locating the Nodes : cooperative localization in wireless sensor networks. IEEE Signal Processing Magazine, 22(4):54–69. [127] Patwari, N., Hero, A., and et al. (2003). Relative location estimation in wireless sensor networks. IEEE Transactions on Signal Processing, 51(8):2137 – 2148. [128] Patwari, N., J’Odea, R., and Yanwei, W. (2001). Relative location in wireless networks. In VTC. [129] Pelletier, M. (1998). Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing. Annals of Applied Probability, 8(1):10–44. [130] Priyantha, N., Hari, B., Erik, D., and Seth, T. (2003). Anchor-Free Distributed Localization in Sensor Networks. In Proceedings of the 1st international conference on Embedded networked sensor systems, SenSys ’03, pages 340–341. ACM. 212 Bibliography [131] Rabbat, M. and Nowak, R. (2005). Quantized Incremental Algorithms for Distributed Optimization. IEEE Journal on Selected Areas in Communications, 23(4):798–808. [132] Raffard, R., Tomlin, C., and Boyd, S. (2004). Distributed Optimization for Cooperative Agents: Application to Formation Flight. In In Proceedings of the 43rd IEEE Conference on Decision and Control. [133] Ram, S., Nedic, A., and Veeravalli, V. (2009). Incremental Stochastic Subgradient Algorithms for Convex Optimization. SIAM Journal on Optimization, 20(2):691–717. [134] Ram, S., Nedic, A., and Veeravalli, V. (2010a). Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization. Journal of optimization theory and applications, 147(3):516–545. [135] Ram, S., Veeravalli, V., and Nedic, A. (2010b). Distributed and Recursive Parameter Estimation in Parametrized Linear State-Space Models. IEEE Transactions on Automatic Control, 55(2):488–492. [136] Rappaport, T. (1996). Wireless Communications: Principles and Practice. Prentice Hall. [137] Rappaport, T. S., Reed, J. H., and Woerner, B. D. (1996). Position location using wireless communications on highways of the future. IEEE Communications Magazine, 34(10):33–41. [138] Rätsch, G., Onoda, T., and Müller, K. (2001). Soft margins for adaboost. Machine learning, 42(3):287–320. [139] Robbins, H. and Monro, S. (1951). A Stochastic Approximation Method. Ann. of Mathem. Statist., 22(3):400–407. [140] Savarese, C., Rabaey, J., and Langendoen, K. (2002). Robust positioning algorithm for distributed ad-hoc wireless sensor network. In Proceedings of the General Track of the Annual Conference on USENIX Annual Technical Conference, pages 317–327, Monterey. [141] Savvides, A., Park, H., and Srivastava, M. (2002). The Bits and Flops of the N-hop Multilateration Primitive For Node Localization Problems. In Proceedings of the 1st ACM International Workshop on Wireless Sensor Networks and Applications, WSNA ’02, pages 112–121. ACM. [142] Shang, Y. and Ruml, W. (2004). Improved MDS-based localization. In INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societes, pages 2640 – 2651 vol.4, Hong Kong. [143] Shang, Y., Ruml, W., and Fromherz, M. (2003). Localization from mere connectivity. In Proceedings of the 4th ACM International Symposium on Mobile Ad Hoc Networking &Amp; Computing, MobiHoc ’03, pages 201–212. ACM. [144] Stanković, S., Stanković, M., and Stipanović, D. (2011). Decentralized parameter estimation by consensus based stochastic approximation. IEEE Trans. Automatic Control, 56(3):531–543. Bibliography 213 [145] Sugano, M., Kawazoe, T., Ohta, Y., and Murata, M. (2006). Indoor localization system using rssi measurement of wireless sensor network based on zigbee standard. In Wireless and Optical Communications, pages 1–6. IASTED/ACTA Press. [146] Titterington, D. (1984). Recursive parameter estimation using incomplete data. Journal of the Royal Statistical Society. Series B, pages 257–267. [147] Tomozei, D. and Massoulié, L. (2010). Distributed user profiling via spectral methods. SIGMETRICS Perform. Eval. Rev., 38(1):383–384. [148] Tonneau, A., Mitton, N., and Vandaele, J. (2014). A Survey on (mobile) wireless sensor network experimentation testbeds. In DCOSS - IEEE International Conference on Distributed Computing in Sensor Systems, Marina Del Rey, California, United States. [149] Towfic, Z., Chen, J., and Sayed, A. (2013). On distributed online classification in the midst of concept drifts. Neurocomputing, 112(0):138 – 152. [150] Towfic, Z. and Sayed, A. (2013). Adaptive Stochastic Convex Optimization Over Networks. In Communication, Control, and Computing (Allerton), 2013 51st Annual Allerton Conference on, pages 1272 – 1277, Monticello, USA. [151] Towfic, Z. and Sayed, A. (August 2014). Adaptive penalty-based distributed stochastic convex optimization. IEEE Transactions on Signal Processing, 62(15):3924–3938. [152] Trawny, N., Roumeliotis, S., and Giannakis, G. (2009). Cooperative multi-robot localization under communication constraints. In Robotics and Automation, 2009. ICRA ’09. IEEE International Conference on, pages 4394 – 4400, Kobe. [153] Tsianos, K., Lawlor, S., and Rabbat, M. (2012). Push-sum distributed dual averaging for convex optimization. In Proceedings of the 51th IEEE Conference on Decision and Control, CDC, pages 5453–5458. [154] Tsiptsis, K. and Chorianopoulos, A. (2009). Data Mining Techniques in CRM: Inside Customer Segmentation. John Wiley & Sons, Ltd. [155] Tsitsiklis, J. (1984). Problems in Decentralized Decision Making and Computation. PhD thesis, Massachusetts Institute of Technology. [156] Tsitsiklis, J., Bertsekas, D., and Athans, M. (1986). Distributed asynchronous deterministic and stochastic gradient optimization algorithms. Automatic Control, IEEE Transactions on, 31(9):803 – 812. [157] Wei, E. and Ozdaglar, A. (2012). Distributed alternating direction method of multipliers. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pages 5445 – 5450. [158] Williams, R. (1985). Feature discovery through error-correcting learning. Technical report, CA : University of California, Institute of Cognitive Science. 214 Bibliography [159] Xu, J., Liu, W., Lang, F., Zhang, Y., and Wang, C. (2010). Distance measurement model based on rssi in wsn. Wireless Sensor Network, 2(8):606–611. [160] Yan, F., Sundaram, S., Vishwanathan, S., and Qi, Y. (2013). Distributed autonomous online learning: Regrets and intrinsic privacy-preserving properties. IEEE Trans. on Knowl. and Data Eng., 25(11):2483–2493. [161] Ye, J., Chow, J., Chen, J., and Zheng, Z. (2009). Stochastic gradient boosted distributed decision trees. In Proceeding of the 18th ACM conference on Information and knowledge management, pages 2061–2064. [162] Yu, J., Kulkarni, S., and Poor, H. (2013). Dimension expansion and customized spring potentials for sensor localization. EURASIP Journal on Advances in Signal Processing. A study of distributed algorithms for stochastic approximation in wireless sensor networks Gemma MORRAL ADELL RÉSUMÉ : Dans le cadre du traitement statistique du signal, les réseaux multi-agents servent à un grand nombre d’applications. Dans le milieu radio, les réseaux de capteurs sans fils s’utilisent à la surveillance d’environnements ou la détection et poursuite de cibles par exemple. Dans internet, les machines sont les agents qui servent à toutes les applications liées au "cloud computing" ou plus récemment, à la gestion de "Big Data". Les réseaux distribués se caractérisent par l’absence d’agent central qui collecte toutes les données et qui se charge des calculs et de la gestion du reste des agents. Nous considérons un réseau d’agents dont le but est d’estimer un paramètre d’intérêt. Un agent est un dispositif capable de recueillir quelques informations d’une façon locale et/ou partielle sur le paramètre inconnu, pour effectuer des calculs locaux à chaque instant de temps et de fusionner des informations avec d’autres agents dans son voisinage. Nous cherchons à concevoir et analyser algorithmes distribués d’approximation stochastique, qui sont bien adaptés au cas où les données sont collectées en ligne, simultanément avec le processus d’estimation. La première partie de la thèse porte sur les méthodes distribuées d’adaptation-diffusion. A chaque itération de l’algorithme, les agents mettent à jour leurs estimations locales et fusionnent ces résultats avec les estimations de leurs voisins. Nous étudions la convergence de l’algorithme et l’influence du protocole de communication sur sa performance asymptotique. Nous appliquons nos résultats à l’inférence statistique dans les réseaux de capteurs sans fil. Dans la deuxième partie de la thèse, nous étudions l’analyse en composantes principales distribuée. Nous proposons un algorithme d’approximation stochastique distribuée sur la base de la méthode d’Oja permettant d’estimer l’espace propre/principal d’une matrice partiellement observée. Nous appliquons nos résultats à l’auto-localisation dans les réseaux de capteurs sans fil. Les résultats numériques effectués sur un plateforme réelle des capteurs sans fil soutiennent le comportement attractif de notre approche. MOTS-CLEFS: Algorithmes d’optimisation distribués, approximation stochastique, réseaux de capteurs sans fils, localisation distribuée. ABSTRACT: In the framework of statistical signal processing, multi-agent networks serve a wide range of applications. In radio communications for instance, wireless sensor networks are used for environmental monitoring/sensing, target detection and tracking. In the internet global system, the machines are agents used for all applications related to cloud computing or more recently, Big Data processing. Distributed networks are characterized by the absence of a central agent which collects and processes all the data and manages the remaining agents. We consider a network of agents whose aim is to estimate some parameter of interest. An agent is a device able to collect some local and partial information on the unknown parameter, to perform local computations at each time instant and merge information with other agents in its neighborhood. We seek to design and analyze distributed stochastic approximation algorithms, which are well suited to the case where the data is collected on-line, simultaneously with the estimation process. The first part of the thesis focuses on distributed adaptation-diffusion methods. At each iteration of the algorithm, agents update their local estimates and merge the result with the estimates of their neighbors. We address the convergence of the algorithm and the influence of the communication protocol on the asymptotic performance. We apply our results to statistical inference in wireless sensor networks. In the second part of the thesis, we investigate distributed principal component analysis. We propose a distributed stochastic approximation algorithm based on Oja’s method allowing to estimate the principal eigenspace of a partially observed matrix. We apply our results to self-localization in wireless sensor networks. Numerical results performed on a wireless sensor network testbed sustain the attractive behavior of our approach. KEY-WORDS: Distributed optimization algorithms, stochastic approximation, wireless sensor networks, distributed localization