Analyse d`algorithmes distribués pour l`approximation stochastique

Transcription

2015-ENST-0002
EDITE - ED 130
Doctorat ParisTech
THÈSE
pour obtenir le grade de docteur délivré par
TELECOM ParisTech
Spécialité «Signal et Images»
présentée et soutenue publiquement par
Gemma MORRAL ADELL
le 8 janvier 2015
Analyse d’algorithmes distribués pour l’approximation
stochastique dans les réseaux de capteurs
Directeur de thèse : Pascal BIANCHI
Co-encadrement de la thèse : Jérémie JAKUBOWICZ
Jury
M. Walid BEN-AMEUR, Professeur, SAMOVAR, Télécom SudParis
Président
M. Alex OLSHEVSKY, Assistant professor, University of Illinois
Rapporteur
M. Cédric RICHARD, Professeur, Laboratoire Lagrange, Université de Nice Sophia-Antipolis
Rapporteur
M. Julien HENDRICKX, Assistant professor, ICTEAM, Ecole Polytechnique de Louvain
Examinateur
Mme. Gersende FORT, Directrice de recherches, LTCI, Télécom ParisTech
Examinateur
M. Claude CHAUDET, Maître de conférences, RMS, Télécom ParisTech
Invité
M. Pascal BIANCHI, Maître de conférences, LTCI, Télécom ParisTech
Directeur de thèse
M. Jérémie JAKUBOWICZ, Maître de conférences, SAMOVAR, Télécom SudParis
Co-directeur de thèse
T
H
È
S
E
TELECOM ParisTech
école de l’Institut Mines-Télécom - membre de ParisTech
46 rue Barrault 75013 Paris - (+33) 1 45 81 77 77 - www.telecom-paristech.fr
This work has been supported by DGA (French Armement Procurement Agency) and by
Institut Mines-Télécom ("Futur & Ruptures" program).
Acknowledgments
First and foremost, I would like to give a warm thank to my advisor Pascal Bianchi to have trust
in me, for believing in my skills and especially for his academic and scientific support. I really
admire him as a brilliant researcher and I owe him for his patient on me. I am also most grateful
to my co-advisor Jérémie Jakubowicz. For his present, a book on Probabilités which I have read
twice or three times, and for being available especially during the first two years. I would like
to thank Gersende Fort for her inconditional help whenever or whatever my matter was. I am
really honored to be part of an important joint work with her. More shortly but also important,
it was a pleasure to follow the lectures on Probabilités of Jamal Najim at the very beginning of
my thesis, to have been proposed to present a joint work with Stéphan Clémençon in a Big Data
conference and to receive the pleasant greetings of Eric Moulines most of the mornings at the
laboratory. I have the chance to learn from all of them.
It was a pleasure to be at the bureau DA-319 where I could share plenty of moments, chats,
confidences and support from such a kind persons: Marjorie Jala and Amandine Schreck. For
almost three years we were copines du bureau and we have spent a lot of hours in the same
place. I will keep them in my memory. More briefly but also kind and nice colleagues are:
Emilie, Cristina, Andrés (with whom I could speak spanish), Claire, Paul, Miro and Alain. And
a really warm thank to Amy N. Dieng with whom I have done an interesting joint work on the
localization topic.
Out of my lab’s environment, there is one of my best friends and an excellent scientist, my
running mate Tommaso. Grazie mille Tommy for your support. I would like to thank Alexander
for being my german partner for a short while. I shared a lot of moments to look for a break time
with: my swimming mate Jérôme (Massard & Vignerons are our meeting point); my saturday’s
break mates Stéphane and Eric (discussions chez "Le touche-balles"); my franco-catalan partner
Gwenola; my italian mates Mary, Rugge and Peppe (pazienza ragga); my scientist mentors
Albert (gràcies to be always available via Whatsapp) and Jérôme (merci pour être ma source
culturelle). I would like to thank every of them for being available and having such a patience to
wait for me. I also think to my lovely friend Cristina since I had the chance she had a job in Paris.
I am grateful to be part of my association nc with Emmanuel and other such interesting people.
I finish with my family. I would like to thank my parents and my sister for their inconditional
help, support and love to me. They have been always there even if we are separated by the
distance. I am fortunate to have them. I have realized one of my wishes, I have lived in Paris
to finish my studies for more than five years. I receive the best presents at the end, my thesis
degree and Marco. I bring them with me for the rest of my life.
Contents
Introduction et présentation des résultats (in French)
1
Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Contexte et cadre considérés . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Modèle du réseau . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Modèle de communication . . . . . . . . . . . . . . . . . . . . . . . .
3
Positionnement de cette thèse . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Préliminaires: algorithmes de consensus . . . . . . . . . . . . . . . . .
3.2
Optimisation distribuée . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Analyse en composantes principales distribuée . . . . . . . . . . . . .
3.4
Application de l’ACP: localisation dans les réseaux de capteurs sans fils
4
Organisation du mémoire de thèse . . . . . . . . . . . . . . . . . . . . . . . .
5
Production scientifique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
10
11
11
12
12
12
14
18
23
25
27
1
31
31
32
33
33
35
37
40
42
Introduction
1.1 Motivations . . . . . . . . . . . . . . . . . . . .
1.2 Framework . . . . . . . . . . . . . . . . . . . .
1.3 Position of the thesis . . . . . . . . . . . . . . .
1.3.1 Preliminary: consensus algorithms . . . .
1.3.2 Distributed optimization . . . . . . . . .
1.3.3 Distributed principal component analysis
1.4 Thesis outline . . . . . . . . . . . . . . . . . . .
1.5 Publications . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
I
Consensus algorithms
45
2
Success and failure of adaptation-diffusion algorithms
2.1 Introduction . . . . . . . . . . . . . . . . . . . . .
2.1.1 Context and goal . . . . . . . . . . . . . .
2.1.2 Related works on distributed optimization .
2.1.3 Contributions . . . . . . . . . . . . . . . .
2.2 Distributed optimization . . . . . . . . . . . . . .
2.2.1 Framework . . . . . . . . . . . . . . . . .
47
48
48
49
51
52
52
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
Contents
2.3
2.4
2.5
2.6
2.7
II
3
4
2.2.2 Results . . . . . . . . . . . . . . . . . . . . .
2.2.3 Success and failure of convergence . . . . . .
2.2.4 Enhanced algorithm with weighted step sizes .
Distributed Robbins-Monro algorithm: general setting
Convergence analysis . . . . . . . . . . . . . . . . . .
2.4.1 Disagreement vector . . . . . . . . . . . . . .
2.4.2 Average vector . . . . . . . . . . . . . . . . .
2.4.3 Main convergence result . . . . . . . . . . . .
Convergence rate . . . . . . . . . . . . . . . . . . . .
2.5.1 Assumption . . . . . . . . . . . . . . . . . . .
2.5.2 Main result . . . . . . . . . . . . . . . . . . .
2.5.3 A special case: doubly-stochastic matrices . .
Concluding remarks . . . . . . . . . . . . . . . . . . .
Numerical results . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Distributed principal component analysis
53
54
55
55
56
57
58
59
59
59
60
61
62
62
69
A distributed on-line Oja’s algorithm
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Context and goal . . . . . . . . . . . . . . . . . . . .
3.1.2 Related works . . . . . . . . . . . . . . . . . . . . . .
3.1.3 Contributions . . . . . . . . . . . . . . . . . . . . . .
3.2 Case G = GN . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Oja’s algorithm . . . . . . . . . . . . . . . . . . . . .
3.2.2 Communication model: randomized sparsification . .
3.2.3 Distributed on-line Oja’s algorithm (p = 1, G = GN )
3.3 General graph and unknown matrix M case . . . . . . . . . .
3.3.1 Network considerations . . . . . . . . . . . . . . . .
3.3.2 Distributed on-line algorithm . . . . . . . . . . . . . .
3.3.3 Convergence analysis . . . . . . . . . . . . . . . . . .
3.4 Extension of Oja’s algorithm for p ≥ 1 . . . . . . . . . . . .
3.4.1 Oja’s algorithm . . . . . . . . . . . . . . . . . . . . .
3.4.2 Distributed on-line Oja’s algorithm . . . . . . . . . .
3.5 Numerical results . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Principal eigenvector estimation (p = 1) . . . . . . .
3.5.2 Two principal eigenvectors estimation (p = 2) . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
71
72
72
73
75
75
75
76
77
79
79
80
80
83
83
84
84
84
87
Application to self-localization in WSN
4.1 Contributions . . . . . . . . . . . . . . . . . .
4.2 Received signal model and testbed description .
4.2.1 Ranging-based approaches . . . . . . .
4.2.2 Log-normal shadowing model (LNSM)
4.2.3 Distance estimation . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
93
94
95
95
96
97
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
4.3
4.4
4.5
4.6
7
4.2.4 FIT IoT-LAB: platform of wireless sensor nodes . . . . . . . . . . .
Overview of some localization techniques . . . . . . . . . . . . . . . . . . .
4.3.1 Centralized techniques . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Distributed approaches . . . . . . . . . . . . . . . . . . . . . . . . .
Distributed MDS-MAP approach . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 The framework: centralized batch MDS . . . . . . . . . . . . . . . .
4.4.2 Centralized on-line MDS . . . . . . . . . . . . . . . . . . . . . . . .
4.4.3 Distributed on-line MDS . . . . . . . . . . . . . . . . . . . . . . . .
4.4.4 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.5 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Position refinement: distributed maximum likelihood estimator . . . . . . . .
4.5.1 Principle: maximum likelihood estimation . . . . . . . . . . . . . . .
4.5.2 The algorithm: on-line gossip-based implementation . . . . . . . . .
4.5.3 Numerical results: initialization by do-MDS algorithm . . . . . . . .
A cooperative RSSI-based algorithm for indoor localization in WSN . . . . .
4.6.1 Observation model: biased log-normal shadowing model (B-LNSM) .
4.6.2 Initialization: biased maximum likelihood estimator (B-MLE) . . . .
4.6.3 Experimental results after the refinement phase . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
100
104
104
109
110
110
111
113
116
119
122
122
124
125
127
127
128
128
Conclusions and perspectives
137
III
141
Appendices
A Application on distributed parameter estimation
A.1 Introduction . . . . . . . . . . . . . . . . . .
A.2 Parametric model: exponential families . . .
A.3 Centralized EM algorithms . . . . . . . . . .
A.3.1 Batch EM . . . . . . . . . . . . . . .
A.3.2 On-line EM . . . . . . . . . . . . . .
A.4 Proposed distributed on-line EM . . . . . . .
A.4.1 Algorithm . . . . . . . . . . . . . . .
A.5 Convergence w.p.1 . . . . . . . . . . . . . .
A.6 Numerical results . . . . . . . . . . . . . . .
A.6.1 Application to Gaussian mixtures . .
A.6.2 Simulations . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
143
143
144
145
145
146
148
148
149
150
150
152
B Application on distributed machine learning
B.1 Introduction . . . . . . . . . . . . . . . . . . . .
B.2 Background . . . . . . . . . . . . . . . . . . . .
B.2.1 Objective . . . . . . . . . . . . . . . . .
B.2.2 Distributed Learning . . . . . . . . . . .
B.3 The Online Learning Gossip Algorithm (OLGA)
B.4 Performance Analysis . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
157
157
159
159
160
160
162
.
.
.
.
.
.
.
.
.
.
.
8
Contents
B.5 Distributed Selection .
B.6 Numerical Results . . .
B.6.1 Simulation data
B.6.2 Real data . . .
B.7 Conclusion . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
C Examples of gossip models for consensus algorithms
C.1 Standard gossip averaging . . . . . . . . . . . . . .
C.1.1 Communication model description . . . . . .
C.1.2 Numerical results on distributed optimization
C.2 Push-sum gossip averaging . . . . . . . . . . . . . .
C.2.1 Communication model description . . . . . .
C.2.2 Algorithm for distributed optimization . . . .
C.2.3 Numerical results on distributed optimization
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
165
166
166
166
167
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
173
173
174
175
178
178
178
179
D Proofs related to Chapter 2
D.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . .
D.2 Proof of Lemma 2.3 . . . . . . . . . . . . . . . . . . . . . .
D.3 Preliminary results on the sequence (φn )n . . . . . . . . .
D.4 Proof of Proposition 2.2 . . . . . . . . . . . . . . . . . . . .
D.4.1 Decomposition of hθn+1 i − hθn i . . . . . . . . .
D.4.2 Proof of Proposition 2.2 . . . . . . . . . . . . . . .
D.5 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . .
D.5.1 A preliminary result . . . . . . . . . . . . . . . . .
D.5.2 Checking condition C2 of [67, Theorem 2.1] . . . .
D.5.3 Expression of U? . . . . . . . . . . . . . . . . . . .
D.5.4 Checking condition C3 of [67, Theorem 2.1] . . . .
D.5.5 Detailed computations for verifying the condition C2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
185
185
186
186
191
191
192
194
194
195
196
196
197
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
203
Introduction et présentation des
résultats
Cette thèse s’ est déroulée au sein du LTCI (Laboratoire Traitement et Communication de l’
Information) à Télécom ParisTech, sous la direction de Pascal Bianchi. Cette thèse a également
été co-encadrée par Jérémie Jakubowicz pendant une demie partie de ma thèse et aussi avec un
support proche et appréciable de Gersende Fort lorsque des travaux conjoints se développaient.
L’objectif de cette thèse est de proposer et d’analyser de nouvelles stratégies distribuées pour les
problèmes de consensus et d’analyse en composantes principales dans les réseaux de capteurs
qui soient basées sur l’approximation stochastique. Ce type d’approche s’avère adapté pour les
réseaux de capteurs sans fils étant des réseaux composées par des dispositifs généralement de
bas cout et avec des ressources limités. Ainsi, nous donnons priorité à la conception des algorithmes distribués simples qui entrainent un traitement des données sans besoin de stockage ni
d’opérations complexes, et avec des échanges sporadiques entre capteurs voisins. Les problématiques abordées dans les deux parties de cette thèse peuvent se généraliser comme un problème
d’estimation statistique décentralisée. D’une côté, l’estimation vient du fait que chaque capteur
doit estimer un paramètre inconnu qui dépend de l’environnement. De l’autre côté, l’analyse
statistique vient de considérer que cet environnement est imparfait et issu d’un bruit modélisé
comme étant aléatoire. Par conséquence, dans ce contexte, l’environnement résulte partiellement
connu par les capteurs.
Dans ce premier chapitre le but est de présenter nos résultats et nos principales contributions
dans les deux champs qui ont été traités: les algorithmes de consensus dans une première partie, et l’analyse en composantes principales distribuée dans la partie suivante. Elles sont plus
profondément expliquées ensuite dans les Chapitres 2, 3 et 4 respectivement où les détails des
preuves et des précisions bibliographiques sont donnés. Tout d’abord dans ce chapitre, nous
justifions la motivation de ces contributions en relevant les applications, les tendances et les
besoins actuels du marché qui font émerger l’utilisation des réseaux d’appareils interconnectés,
e.g. capteurs sans fils. Ensuite, nous introduisons dans la Section 2 quels sont le contexte et le
cadre de ces travaux. Nous décrivons les trois principales propriétés qui ont en commun les algorithmes que nous avons conçu pour les différentes problématiques traitées. La Section 3 présente
nos contributions dans les deux contextes applicatifs traités, i.e. l’optimisation distribuée dans la
Section 3.2 et l’analyse en composantes principales dans la Section 3.3. Toutefois, nous ajoutons
deux sections additionnelles indispensables. Premièrement, la Section 3.1 sert d’étape préliminaire afin de comprendre les algorithmes de consensus en général qui sont une partie clé dans
10
l’optimisation distribuée abordée par la suite. Deuxièmement, la Section 3.4 présente les résultats expérimentaux issus de l’application concrète de la localisation des capteurs en utilisant
l’algorithme d’analyse en composantes principales distribué qui a été proposé dans la section
précédente.
1
Motivations
La télédétection et la surveillance de l’environnement ont été un domaine de recherche actif dans
les dernières décennies. L’évolution technologique des réseaux de capteurs sans fils (wireless
sensor networks en anglais) a contribué à répandre d’abord leur utilisation à des fins militaires
et plus tard pour des applications civiles et industrielles, e.g surveillance et gestion de la sécurité. Par ailleurs, les motivations scientifiques sur le traitement statistique du signal ont été un
élément important des progrès réalisés sur la détection, l’estimation et la classification des données acquises par les réseaux de capteurs. Bien que nous donnons une attention particulière
aux réseaux de capteurs, les travaux de cette thèse s’adressent aussi aux systèmes multi-agents
plus généralement, e.g. par exemple les réseaux pair-à-pair ou les systèmes multi-robots. Les
applications liées aux systèmes multi-agents comprennent:
• les contrôle et coordination des réseaux, dont les exemples suivants : la poursuite d’une
cible ou d’une trajectoire [132], [45], [123], [152], l’allocation des ressources [108], [23].
• le traitement de Big Data, dont les exemples suivants : l’apprentissage des classificateurs [149], [160], la recommandation par estimation des profils [147], le PageRank [124].
• la surveillance de l’environnement et la reconnaissance d’objets dans les réseaux de capteurs, dont les exemples suivants : l’ estimation de paramètres [135], [144], la détection
des signaux [120], [90], [102].
Dans cette thèse, nous donnons une attention particulière au problème de la localisation dans
les réseaux de capteurs sans fils suite à l’actuelle croissance des applications et services dont la
connaissance des positions d’un groupe de capteurs ou appareils inter-connectés est nécessaire,
i.e. location awareness services. Ainsi, l’information sur les positions des nœuds/capteurs du
réseau peut être utilisé pour plusieurs buts tel que le routage dynamique qui peut être contrôlé
et adapté selon les positions données. D’ailleurs, pour la surveillance de l’environnement, il
existe un besoin des applications nécessitant de l’information géographique sur les données
observées e.g. identifier la température ou l’humidité sur une carte. Dans une utilisation plus
sophistiquée, les développements sur la domotique et l’internet des objets (smart home and
Internet of Things contribuent à la conception des maisons intelligentes dont l’électroménager,
les régulateurs d’énergie et tout appareil électronique sont activés automatiquement en fonction
de la position de l’hôte. Si la robustesse et un déploiement flexible sont deux des principaux
avantages d’un réseau distribué des capteurs sans fils, il faut surmonter des possibles problèmes
liés à la confidentialité et aussi liés à la sécurité des données et à la limitation de ressource
énergétique (batterie généralement), stockage et complexité de calcul.
Nous considérons un système multi-agents comme un groupe d’entités dotées des capacités à la fois en interaction et de traitement des données, e.g. un réseau de capteurs sans fil.
11
Nous référons par agent comme le mot général pour nommer un ordinateur, un processeur ou
un nœud/capteur ou tout autre dispositif qui constitue un réseau connecté. Dans un objectif
et un contexte spécifiques, une tâche globale est effectuée par le réseau d’agents en communiquant, e.g. mesures de température dans les réseaux de capteurs ou classification binaire
dans l’apprentissage automatique. Les agents sont autonomes et ont leur propre conscience
de l’environnement en ayant accès à des vues locales du scénario global sans être sous contrôle
d’une unité centrale désignée en avance. Dans cette thèse, nous considérons que chaque agent
a une vue partielle sur le problème global à résoudre. Typiquement, dans les réseaux de capteurs chaque capteur est en mesure d’observer les alentours de l’environnement. Alors que dans
l’apprentissage automatique distribué, chaque processeur (ou ordinateur) est en charge de gérer
une partie de l’ensemble des données qui a été distribué.
L’objectif de notre travail est la conception d’algorithmes distribués pour les réseaux multiagents embarqués. Par distribué, nous entendons que les agents partagent leur information locale
sans avoir aucune architecture hiérarchique ni aucun agent maître, i.e. contrairement au traitement centralisé. Ces architectures offrent deux avantages principaux. D’une côté, la robustesse
face à des défaillances de nœuds individuelles, puisque les données peuvent être récupérées à
n’importe quel autre agent. Et de l’autre côté, la scalabilité pour s’adapter aux changements
d’ordre de grandeur du réseau en conservant les prestations puisqu’il n’y a pas d’agent central
qui peut produire un goulot d’étranglement.
2
Contexte et cadre considérés
En particulier dans cette thèse, nous nous concentrons sur des algorithmes distribués à partir
du point de vue spéciale de l’approximation stochastique. Nous considérons le cas où les données/observations sont traitées "en ligne". Un échantillon pris de façon aléatoire est utilisé par un
agent pour mettre à jour sa solution locale et il est supprimé après utilisation; ensuite, un nouvel
échantillon est utilisé, et ainsi de suite. Dans ce contexte, nous adressons conjointement deux
cadres: les systèmes multi-agents et l’approximation stochastique distribuée. En conséquence,
nous nous sommes intéressés à des méthodes numériques qui ont les trois propriétés suivantes.
Elles sont
i) itératives - une valeur itérée est mise à jour à chaque instant discret du temps, i.e. itération,
ii) distribuées - les agents communiquent afin de fusionner leurs itérations individuelles, et
iii) en ligne - chaque mise à jour se fait en utilisant la plus récente observation ou le plus
récent échantillon des données.
2.1
Modèle du réseau
Le réseau de N agents est représenté par un graphe non dirigé (non orienté) G = (V, E), où
V = {1, . . . , N } désigne l’ensemble des agents (également appelés nœuds dans le contexte des
réseaux de capteurs sans fils) et E désigne l’ensemble des arêtes non orientés. D’ailleurs, deux
agents i, j ∈ V sont connectés s’il existe le lien {i, j} ∈ E qui permet la communication entre
12
i et j. Nous allons parfois écrire i ∼ j pour désigner l’arête {i, j} ∈ E. Nous supposerons
toujours que {i, i} n’est pas une arête i.e. G n’a pas d’auto-boucles.
2.2
Modèle de communication
Les algorithmes distribués peuvent être soit synchrones ou asynchrones. Dans le cas synchrone,
tous les agents sont attendus à compléter leurs calculs locaux avant que leurs valeurs de sortie
peuvent être fusionnées. En cas contraire, l’algorithme procède avec la prochaine itération à
condition que tous les agents ont obtenu leur résultat. Or, dans le cas asynchrone, nous supposons que chaque agent est susceptible de terminer un calcul local à un certain instant de temps
aléatoire, il passe sa sortie à d’autres agents voisins, et procède à un autre calcul local sans qu’il
soit nécessaire d’attendre le reste des agents. Ainsi, le sens que nous donnons à "algorithme distribué asynchrone" est qu’il n’y a pas d’agent central qui planifie les instants des calculs et que
n’importe quel nœud peut se réveiller de façon aléatoire à tout moment indépendamment des
autres nœuds. Ce mode de fonctionnement apporte des avantages évidents en termes de faible
complexité et en étant plus flexible à mettre en œuvre. L’optimisation distribuée asynchrone
est un sujet prometteur pour aborder les problèmes d’apprentissage automatique impliquant des
ensembles de données massifs (voir [32] ou l’étude plus récent de [40]).
3
3.1
Positionnement de cette thèse
Préliminaires: algorithmes de consensus
Historiquement dans la littérature, une quantité importante de travaux dans le traitement distribué ont été faits pour résoudre les soi-disant problèmes du consensus sur la moyenne (average
consensus en anglais). Bien que la thèse aborde des questions plus générales, nous commençons
avec une description de ce problème archétype, car il permet de présenter les principales idées
qui sous-tendent de nombreux algorithmes proposés notamment pour l’optimisation distribuée.
On note par T0,i ∈ R une certaine valeur scalaire observée par l’agent i ∈ V . L’objectif du
consensus sur la moyenne est, pour toutPagent i, d’estimer quelle est la valeur moyenne sur tout
l’ensemble, i.e. on dénote par T̄0 = N1 N
i=1 T0,i . L’approche la plus répandue pour résoudre ce
problème d’ une façon distribuée est la suivant.
À l’instant n, chaque agent i a une estimation Tn,i de la moyenne recherchée qui est inconnue. Chaque nœud i reçoit les itérées actuelles d’autres nœuds dans son voisinage, i.e. les nœuds
j tels que j ∼ i ∈ E, et met à jour son itérée locale comme une moyenne pondérée de son itérée
passée et les itérées reçues de ses voisins. Formellement, à chaque itération n
∀i ∈ V,
Tn,i =
X
w(i, j)Tn−1,j
(1)
j∈V
où w(i, j) sont des poids non négatifs tels que w(i, j) = 0 dès que i et j ne sont pas connectés. Cette condition garantit que l’algorithme est en effet bien réparti sur le graphe du réseau.
Pour simplifier, supposons également que les coefficients de pondération vérifient la condition
13
P
suivante j w(i, j) = 1 pour tous les i. On définit alors la matrice W de taille N × N dont le
coefficient (i, j) coïncide avec w(i, j). L’algorithme ci-dessus s’écrit simplement
Tn = W Tn−1
où Tn est le vecteur-colonne dont la i-ième entrée coïncide avec Tn,i . On note que W est une
matrice stochastique dans le sens où la somme de toutes ses files est égale à un. Les techniques du consensus sur la moyenne distribuées ont ses origines sur la théorie de la statistique
appliquée [48] et l’informatique [155], [156] (voir [16] pour un état de l’art détaillé sur ce sujet). Ces conditions dans lesquelles l’algorithme distribué ci-dessus converge vers la moyenne
T̄0 recherchée sont bien connues. Supposons maintenant que
X
∀i ∈ V,
w(i, j) = 1
(2)
j:j∼i
∀i ∈ V,
X
w(i, j) = 1 ,
(3)
i:i∼j
c’est à dire, non seulement toutes les lignes de la matrice W somment un, mais aussi toutes les
colonnes somment également un. Telle une matrice W est appelé doublement stochastique ou
bistochastiques. Si W est d’ailleurs une matrice primitive1 , il peut être démontré comme étant
une application immédiate du théorème de Perron-Frobenius (voir [81]) telle que
∀i ∈ V, lim Tn,i = T̄0 .
n→∞
(4)
Ainsi, l’algorithme permet à chaque nœud d’atteindre finalement un consensus, c’est à dire,
un accord sur la valeur finale. En outre, la valeur du consensus coïncide avec la moyenne
T̄0 recherchée. L’algorithme ci-dessus est parfois appelé par certains auteurs : algorithme de
commérage (gossip en anglais).
L’algorithme de gossip ci-dessus est synchrone dans le sens où tous les agents doivent communiquer leurs valeurs à tout moment de l’algorithme itératif, et la matrice W est fixée la même
pour toutes les itérations. Les auteurs de [31] proposent un protocole de communication asynchrone. A l’instant de temps n, un nœud donné choisi aléatoirement (nœud i par exemple) se
réveille et sélectionne de façon uniformément aléatoire un nœud dans son voisinage (nœud j).
Les nœuds i et j fusionnent leurs valeurs ainsi :
Tn,i = Tn,j = 0.5Tn−1,i + 0.5Tn−1,j ,
(5)
tandis que les autres nœuds k ∈
/ {i, j} maintiennent leur itérées Tn,k = Tn−1,k . L’algorithme
s’écrit de façon similaire sous forme matricielle comme Tn = Wn Tn−1 où (Wn )n est une
séquence aléatoire de matrices, à savoir Wn = IN − 0, 5(ei − ej )(ei − ej )T où IN est la
N × N matrice identité, ei est le i-ième vecteur de la base canonique de RN et T représente
la transposée. La convergence (4) est toujours conservée (au sens presque sûr) sous certaines
Il existe m > 0 tel que tous les coefficients de W m sont strictement positifs. L’hypothèse tient par exemple si
w(i, j) > 0 pour tous i ∼ j et il existe i tel que w(i, i) > 0.
1
14
conditions techniques qui sont spécifiées en détail dans [31]. La caractéristique clé des matrices
Wn est qu’elles sont toujours doublement stochastiques pour tout n 6= 0.
Le protocole de commérage décrit ci-dessus (connu en anglais comme pairwise gossip) peut
être considéré comme asynchrone, étant donné que les nœuds sont autorisés à être inactifs à
certains instants. Toutefois, il nécessite toujours un certain niveau de coordination entre les
nœuds: deux nœuds doivent mettre à jour simultanément leurs valeurs au même instant de temps.
Réduire la nécessité de ces liens bidirectionnels (transmission et feedback à la fois) afin d’obtenir
des protocoles vraiment asynchrones a été un enjeu important dans les années suivantes à [31].
Les auteurs de [10] proposent un modèle de communication complètement asynchrone connu
couramment comme broadcast gossip en anglais. Comme au protocole précédent, à chaque
itération n un agent est activé de façon uniformément aléatoire. L’ asynchronisme est maintenant
au "niveau agent" au lieu d’ être au "niveau arête" étant donné que l’agent actif i diffuse son
estimée à tous ses voisins sans attendre la transmission de retour. Malheureusement, il est montré
dans [10] que le résultat de la convergence vers la moyenne recherchée (4) n’est plus vérifiée.
Or, tout ce que l’on peut attendre d’un tel protocole simple à implémenter est une convergence
en moyenne, mais pas une convergence presque sûre. On pourra se référer à l’article [69] pour
des considérations plus détaillées sur cette question.
Pour conclure cette section préliminaire, le problème de consensus sur la moyenne peut
être résolu en utilisant un algorithme linéaire de gossip dans une version asynchrone, mais la
bistochasticité des matrices de pondération Wn est requise à chaque instant n. Néanmoins, la
double-stochasticité présente en pratique certains inconvénients en ce qui concerne leur implémentation, puisqu’elle exige généralement les feedbacks dans le réseau. Des méthodes alternatives sur cette question ne nécessitent pas cette double-stochasticité au détriment de requérir à
des modèles de communication plus complexes, e.g. [89], [84], [78]. Par exemple, [78] apporte
le résultat de convergence (4) en utilisant uniquement des matrices stochastiques par ligne, i.e.
Wn 1 = 1, mais en considérant que dans la phase de communication l’ensemble des nœuds augmente à chaque itération n. Il est aussi intéressant de noter la contribution de [89] qui introduit
le protocole push-sum (plus généralement analysé par [34]). Le modèle de gossip [89] permet
de contourner la question de la convergence sans la nécessité des liens feedback en introduisant
dans leur modèle une certaine communication supplémentaire, i.e. deux variables au lieu d’une
sont impliquées dans l’étape de mise à jour. [84] propose une version asynchrone de ce dernier
modèle [89]. L’analyse de la convergence et de la vitesse de convergence sont fournies dans [83]
par les mêmes auteurs. Nous nous référons à [16] pour une description plus complète et générale
sur les algorithmes de consensus sur la moyenne distribués.
3.2
1)
Optimisation distribuée
Contexte
L’optimisation distribuée est présente dans la plupart des applications mentionnées ci-dessus
relatives aux réseaux de capteurs sans fils et de l’apprentissage automatique. Le but du réseau
est d’optimiser une fonction globale qui est définie comme une somme de fonctions privées
15
locales. Le problème de minimisation global à résoudre s’écrit :
min F (θ) ,
θ∈R
F (θ) =
N
X
fi (θ)
(6)
i=1
où fi est la fonction de coût privée de l’agent i. Nous supposons dans cette thèse que les
fonctions fi sont différentiables, mais pas nécessairement convexes. Cette thèse a également
mis l’accent sur les méthodes de premier ordre, i.e. les algorithmes s’appuyant uniquement
sur des calculs de gradient. Un exemple illustratif dans le cadre des réseaux de capteurs est le
suivant.
Exemple 1. Dans les contextes des réseaux de capteurs, il est souvent le cas où l’on doit estimer un paramètre θ (e.g. température, humidité, position d’une source) basé sur un ensemble
d’observations aléatoires X1 , . . . , Xn recueillies par des capteurs indépendants et dont les fonctions de densité de probabilité marginales pθ,1 , . . . , pθ,N sont indexés par θ. À condition que les
variables aléatoires Xi sont indépendantes, l’estimation du maximum de vraisemblance de θ
peut être écrit comme minimiseur de (6) où
fi (θ) = − log pθ,i (Xi ) .
Dans un cadre centralisé, les observations aléatoires sont collectées par une unité/agent centrale. Toutes les fonctions sont supposées être disponibles par ce seul agent, et ainsi un algorithme standard de descente du gradient sur la fonction globale F peut être directement utilisé
pour obtenir un minimiseur. Or, cette thèse porte sur la conception d’approches distribuées : les
fonctions fi sont seulement connues de façon locale par les agents et la fonction F n’est plus
disponible.
Dans la littérature, il existe principalement deux approches distribuées des algorithmes de
premier ordre pour résoudre ce problème. La première est l’approche incrémentale proposée
par [113], [131], [133]. Un message contenant l’estimée actuelle est transmise de façon itérative
en se déplaçant de nœud en nœud dans le réseau, i.e. chaque agent additionne leur propre valeur
basée sur son observation locale et retransmet cette mise à jour à l’agent suivant. L’approche,
bien que conceptuellement simple, présente certains inconvénients. Un algorithme incrémental
nécessite généralement que la transmission du message entre les nœuds suit un cycle hamiltonien. Par contre, la recherche d’un tel chemin dans un graphe est connu pour être un problème
de classe NP. Des approches alternatives sur le besoin du cycle hamiltonien ont été proposées:
par exemple, [113] suppose qu’un agent seulement communique avec un autre agent choisi de
façon aléatoire dans le réseau (pas nécessairement dans son voisinage) selon la distribution uniforme. Cependant, l’approche de [113] nécessite encore du routage considérable.
Cette thèse propose une deuxième approche coopérative du calcul distribué poursuivant une
forme adapt-then-combine (terminologie en anglais qui a été introduite par [103] dans [38])
et également connue comme algorithmes du type adaptation-diffusion. En particulier, il s’agit
d’une approche basée sur les techniques de consensus (Section 3.1) et dont l’idée est issue du
travail de [155]. Dans ce contexte les agents mettent à jour leur estimée à partir d’une étape
locale de descente du gradient ; ensuite certains agents communiquent et fusionnent leur estimée
16
locale en fonction des information échangées. Comme nous avons introduit dans la section
précédente, ces méthodes de commérage sont connues dans la littérature anglophone comme
méthodes de gossip. Contrairement aux approches incrémentales, chaque nœud i a sa propre
estimation θn,i pour chaque instant de temps n, i.e. chaque agent i génère une suite d’estimées
(θn,i )n≥1 que nous supposerons ici réelle (le cas vectoriel se traite sans plus de difficultés, sinon
notationnelles dans le Chapitre 2). A l’itération n, l’algorithme étudié s’écrit en deux étapes :
[Etape locale] L’agent i génère une estimée temporaire θ̃n,i donnée par :
θ̃n,i = θn−1,i − γn ∇fi (θn−1,i ) ,
(7)
où γn est un pas déterministe positif, (∇fi (θn−1,i ))n≥1 sont les observations dont l’agent i
dispose et ∇ désigne le gradient.
[Gossip] L’agent i observe les valeurs θ̃n,j de certains autres agents j et génère une moyenne
pondérée :
N
X
θn,i =
wn (i, j) θ̃n,j ,
(8)
j=1
où Wn = (wn (i, j))i,j∈V est une matrice de gossip similaire à celles décrites dans la section
précédente (voir la Section C.1 à l’Annexe C pour des exemples plus détaillés). D’ailleurs, il est
plus courant d’ observer le gradient à une certaine perturbation aléatoire près, pouvant dépendre
de l’historique de l’algorithme. Dans ce cas, l’équation (7) doit être remplacé par
θ̃n,i = θn−1,i − γn ∇fi (θn−1,i ) + γn ξn,i
(9)
où ξn,i est une perturbation aléatoire due au fait que le gradient n’est pas parfaitement observé
au nœud i.
Exemple 2. Pour illustrer ce point, nous considérons de nouveau l’exemple dans les réseaux
de capteurs donnée précédemment, où les agents cherchent à estimer un paramètre inconnu
θ dans le sens du maximum de vraisemblance sur des observations aléatoires. Considérons
le cas où chaque capteur i recueille une séquence d’observations aléatoires (Xn,i )n=1,2,... au
lieu d’une seule observation Xi . Supposons aussi que la séquence soit formée par des copies
indépendantes de Xi . Alors, une estimation en ligne distribuée du paramètre θ en utilisant
l’algorithme ci-dessus se lirait comme
θ̃n,i = θn−1,i + γn ∇ log pθn−1,i (Xn,i ) .
Sous certaines conditions de régularité, il peut être prouvé que la mise à jour ci-dessus coïncide
avec (9) en admettant
fi (θ) = −E[log pθn,i (Xi )]
où E représente l’espérance et la perturbation ξn,i est un incrément de martingale.
Il est espéré que, sous certaines hypothèses,
∀i ∈ V, lim θn,i = θ?
n→∞
(10)
17
où θ? est un certain minimiseur de F (dont on assume son existence). Nous nous référons au
Chapitre 2 pour un état de l’art plus détaillé sur ces techniques. Cependant, nous mentionnons
que la convergence est généralement prouvée sous certaines hypothèses fortes sur les matrices (Wn )n qui décrivent le protocole de communication du consensus. En général, le consensus
vers la valeur optimale recherchée θ? est obtenu sous l’hypothèse de double-stochasticité ([116],
[134]), i.e. (Wn )n sont stochastiques par ligne et par colonne ce qui signifie que Wn 1 = 1 et
1T Wn = 1T . Plus tard, dans [19], [112], la condition de colonne-stochasticité est relâchée
et elle est supposée seulement en moyenne, i.e. 1T E[Wn ] = 1T . Cela mène, par exemple, à
l’utilisation du modèle de gossip du type broadcast introduit par [10]. De façon similaire, les auteurs de [43] proposent un modèle de diffusion qui ne nécessite que la condition de stochasticité
par ligne au détriment de sa nature synchrone.
2)
Résultats
L’objectif de cette thèse est d’obtenir des résultats de convergence tels que (10) sur la séquence
générée par l’Algorithme (7)-(8) sous des conditions sur (Wn )n plus faibles. Nous étudions les
résultats de [19] quand (Wn )n sont supposées stochastiques par ligne seulement. La plupart
des travaux les supposent en outre stochastiques par colonne, ce qui s’avère restrictif en terme
d’implémentation, et interdit la mise en œuvre de protocoles d’échange pourtant naturels. Nous
nous adressons à un cadre plus large de la communication, quand les matrices (Wn )n peuvent
dépendre des observations ou des dernières estimations. Ainsi, en relaxant cette hypothèse de
bistochasticité, nous quantifions la dégradation des performances liée à cette relaxation. En
outre, nous considérons un cas plus général sur le cadre d’approximation stochastique en laissant
Algorithme (7)-(8) prendre la forme suivante:
θn = Wn ( θn−1 + γn Yn ) .
(11)
La récursion (11) étend l’application du problème d’optimisation distribuée (6) à un cadre
plus général. En effet, l’algorithme (11) peut être considéré comme une version distribuée de
l’algorithme de Robbins-Monro [139]. Les algorithmes d’approximation stochastique ont été
initialement conçus par [139] pour trouver les zéros (racines) d’ une certaine fonction h, appelée
champ moyen dans des situations où l’ on ne dispose que des mesures bruitées de cette fonction.
A cet effet, yn peut être liée à une estimation non biaisée de la fonction h(θ) dont on cherche à
trouver ses racines, i.e. θ ∈ { h(θ) = 0 }.
Sous l’ hypothèse de pas (γn )n , nous apportons les contributions suivantes qui conduisent à
répondre d’une manière à la fois qualitative et quantitative aux questions précédentes.
• Supposant que la suite des matrices stochastiques (Wn )n≥1 est i.i.d., on montre sous certaines hypothèses techniques que l’ Algorithme (7)-(8) converge vers le consensus et il
est caractérisé. On montre
que cette valeur accordée ne coïncide pas nécessairement avec
P
un point critique de i fi . Nous fournissons également une variante de l’ algorithme qui
permet de récupérer les points recherchés.
• Nous fournissons des conditions suffisantes soit sur le protocole de communication qui
est représenté par (Wn )n≥1 ou sur les fonctions fi qui assurent que les points limites
18
P
sont les points critiques de i fi . Lorsqu’une telle condition n’est pas satisfaite, nous
proposons également une modification simple de l’algorithme qui permet de récupérer le
comportement recherché.
• Nous adressons nos résultats dans un cadre plus large, en supposant que les matrices
(Wn )n≥1 ne sont plus i.i.d., mais elles sont susceptibles de dépendre à la fois des observations actuelles et des estimations antérieures. Nous étudions également un cadre
d’approximation stochastique général qui va au-delà du modèle (11) et au-delà du seul
problème de l’optimisation distribuée.
• On caractérise la vitesse de convergence de l’algorithme sous la forme d’un théorème
central limite. Contrairement à [19], nous nous adressons au cas où la séquence (Wn )n≥1
n’est pas nécessairement bistochastique. Nous montrons que les matrices non-doublement
stochastiques ont une influence sur la covariance de l’erreur moyen asymptotique (même
si elles sont doublement stochastiques en moyenne, e.g. [10]). D’autre part, nous montrons
que lorsque la matrice Wn est doublement stochastique pour tout n, e.g. [31], la covariance
asymptotique est identique à celle qui s’obtiendrait dans un cadre centralisé (optimale).
Finalement, un objectif de la thèse est d’étudier l’utilisation de l’algorithme proposé pour
les missions d’inférence statistique dans les réseaux de capteurs. Nous proposons un algorithme
Espérance-Maximisation (Expectation-Maximization en anglais abrégé EM) distribué qui est
inspiré de l’approche adaptation-diffusion (voir l’Annexe A). Nous appliquons également notre
méthode au problème de l’auto-localisation dans les réseaux de capteurs en incluant une étape
de raffinement des positions estimées (voir Chapitre 4).
3.3
1)
Analyse en composantes principales distribuée
Contexte
Une autre problématique qui peut être adressée par moyen de l’approximation stochastique est
l’analyse en composantes principales (ACP). L’objectif dans ces types de problèmes est plutôt
différent de celui considéré précédemment. En effet, le but maintenant pour le réseau n’est plus
de trouver un consensus sur le paramètre d’intérêt commun. Or ici, le but pour chaque nœud i est
de conduire son itérée à la valeur des i-ième entrées des principaux vecteurs et valeurs propres
d’une matrice M qui dépend de l’environnement et de la configuration du graphe subjacent au
réseau.
Nous définissons M ∈ RN ×N une matrice semi-définie positive symétrique dont les entrées
décrivent une certaine mesure de similarité entre chaque paire d’agents, e.g. des similitudes
(positionnement multidimensionnel dont on utilise le terme en anglais multidimensional scaling MDS [95], [27])), des distances (localisation dans les réseaux de capteurs [57], [143]), des
évaluations des clients sur des produits consommés (profilage des utilisateurs ou en anglais user
profiling [154], [91]), des coefficients d’adjacence dans un graphe (partitionnement spectrale ou
en anglais spectral clustering [29]) ou des covariances (détection de signal [88], [37]). Nous
supposons qu’un agent donné i a seulement de l’information partielle sur la matrice M (typiquement, l’agent i est seulement en mesure d’observer la i-ième file de M ). ainsi, l’analyse en
19
composantes principales de M correspond à trouver sa décomposition spectrale telle que
M = U ΛU T ,
U U T = IN
(12)
où U est une matrice orthonormale dont les colonnes sont les vecteurs propres de M et Λ est une
matrice diagonale contenant les correspondants valeurs propres (λ1 , . . . , λN ) en les prenant par
ordre décroissant λ1 ≥ · · · ≥ λN . Nous définissons la norme euclidienne notée k . k. Pour un
entier donné tel que p < N , l’objectif est d’évaluer les p plus grandes valeurs propres λ1 , . . . , λp
et les correspondants vecteurs propres, que nous notons par u1 , . . . , up .
Lorsque M est parfaitement connue et les données sont traitées de manière centralisée,
plusieurs méthodes classiques sont connues pour résoudre efficacement (12) telle que la méthode des puissances (en anglais power method [73, p. 406]) lorsque p = 1 et la factorisation
QR (en anglais QR-factorization [81, p. 114] et qui est appelée itération orthogonale en anglais
orthogonal iteration dans [73, p. 454]) ou l’orthonormalisation de Gram-Schmidt [73, p. 254]
lorsque p > 1. La méthode des puissances centralisée est basée sur le calcul récursif suivant
(p = 1) :
Ũn = M Un−1
Un =
Ũn
,
kŨn k
(13)
(14)
où (Un )n est la suite des vecteurs estimés qui converge
propre U1 et
P vers2le premier vecteur
N . Du point de
|x(i)|
tel
que
x
∈
R
kxk représente la norme euclidienne, i.e. kxk2 =
i
vue d’une implémentation distribuée, les deux termes M Un−1 et kM Un−1 k (terme de normalisation) ont certains inconvénients au niveau de la complexité, i.e. nombre de communications
et nombre d’opérations (sommes etP
multiplications). Pour un agent donné i, le premier produit
matriciel s’écrit comme la somme N
j=1 M (i, j)Un−1,j qui contient N termes impliquant une
communication avec
chaque
agent
j
séparément.
Deuxièmement, pour tout agent i, (14) s’écrit
qP
N
2
Un (i) = Ũn (i)/
i=1 Ũn (j) ce qui implique que l’agent i doit demander à tous les autres
agents j 6= i sur leurs valeurs Uñ(j) pour effectuer la mise à jour de la normalisation. En fait,
lorsque N est grand, cela peut produire un coût prohibitif pour le réseau en termes du nombre
de communications. En conséquence, plusieurs travaux ont fait des efforts pour proposer une
mise en œuvre décentralisée de (13)-(14). Une paire de travaux mènent une version distribuée
de (13)-(14) (voir [90],[92]) en introduisant une étape de consensus sur le calcul du terme de
normalisation kŨn k qui doit être utilisé pour actualiser chaque coordonnée Un (i) localement
par chaque agent i. Alors que dans [90] M est supposée être parfaitement connue, [92] inclue
un modèle creux (sparse en anglais) synchrone pour estimer
le vecteur M Un−1 . Contrairement
P
à [90] où chaque agent i est en mesure de calculer N
M
(i, j)Un−1,j , les auteurs de [92]
j=1
décrivent un modèle de matrice creuse (sparse matrix en anglais) pour M dans laquelle chaque
agent i transmet M (i, j)Un−1,j à un petit ensemble de voisins choisis aléatoirement.
Dans cette thèse, nous cherchons à concevoir un algorithme d’analyse en composantes principales qui soit: distribué puisque les nœuds coopèrent de façon asynchrone afin d’estimer séparément les différentes entrées des principaux vecteurs propres; et en ligne puisque la matrice
20
M n’est pas observée parfaitement, mais une séquence est générée (Mn )n des versions perturbées/bruitées de M . La séquence (Mn )n s’écrit comme Mn = M + ξn où la matrice aléatoire
issue du bruit ξn est typiquement un incrément de martingale.
Dans le cas centralisé, lorsqu’une séquence de matrices (Mn )n est observée globalement
par une unité centrale de calcul, un algorithme suivant une approche d’approximation stochastique peut être utilisé pour estimer les principaux vecteurs propres de M . Dans cet intérêt,
l’algorithme d’Oja peut être utilisé à cet effet (voir [120] pour p = 1 et [122] pour p > 1).
Nous faisons référence également à [57], [24], [85] où les auteurs proposent des approches alternatives pour résoudre (12) basées sur le problème d’optimisation semi-définie positive sous
contraintes (en anglais abrégé comme semidefinite programming). Concrètement dans ce travail,
nous introduisons une version distribuée de l’algorithme de Oja.
Nous définissons Un = (un,1 , . . . , un,p )T les p-composantes principales estimées à l’itération
n. Dans l’algorithme d’Oja [122], la séquence estimée Un est générée par la récursion suivante:
T
Un = Un−1 + γn Mn Un−1 − Un−1 (Un−1
Mn Un−1 ) .
(15)
On note que (15) se résume à un algorithme de Robbins-Monro [139] évoqué dans la section
précédente puisque (15) est vu comme un cas particulier d’une étape d’approximation stochastique. Toutefois, la norme du vecteur généré ci-dessus kUn k peut être supérieure à un ce
qui ferait devenir l’algorithme instable. Plusieurs variantes ont également été proposées pour
remédier cette instabilitée, soit en introduisant un terme de normalisation ou une étape de projection (voir [29]). Étant donné que nous visons des implémentations distribuées de (15) où
chaque capteur doit estimer ses coordonnées Un (i), elle peut être spécifiée par chaque capteur
comme :


N
X
T
Un (i) = Un−1 (i) + γn 
Mn (i, j)Un−1 (j) − Un−1 (i)(Un−1
Mn Un−1 )  .
(16)
j=1
où Un (i) sont les coordonnées estimées au capteur i, i.e. les p composantes correspondantes la
i-ième file Un (i) = (un,1 (i), . . . , un,p (i)). Le but dans notre travail est la conception d’un algoP
T
rithme où les deux termes impliqués, N
j=1 Mn (i, j)Un−1 (j) et Un−1 Mn Un−1 , soient estimés
de façon complètement distribuée, asynchrone et en ligne.
Différentes versions distribuées de l’algorithme (15) ont été proposées dans la littérature,
souvent dans des contextes spécifiques comme par exemple : le profilage des utilisateurs/clients
dans [147] (user profiling en anglais) comme application de l’apprentissage automatique ou
l’estimation/détection du signal dans [102] comme application des réseaux de capteurs sans fils.
En plus de l’approche choisie, ces deux travaux ont une caractéristique commune dans leur algorithme : la version distribuée de (15) inclue plusieurs étapes de consensus sur la moyenne de
la forme [31] (les algorithmes de average consensus en anglais expliqués dans la Section 3.1)
pour estimer certains termes dans (15) de manière décentralisée. En effet, ces approches nécessitent deux échelles de temps, i.e. l’indice de l’itération n pour mettre à jour Un et un autre
indice de temps pour indexer le nombre des cycles de consensus de la forme (5). En particulier,
les auteurs de [147] s’adressent à un problème d’apprentissage automatique où les observations
M correspondent à une grande matrice contenant notes (binaires) des évaluations faites par les
21
utilisateurs sur certains produits consommés. Sous l’hypothèse que M est une matrice de rang
faible, l’objectif est d’estimer le vecteur associé au profil de chaque utilisateur. Un algorithme
d’Oja distribué est proposé pour effectuer la décomposition spectrale d’un ensemble de données
partiellement connu Mn , i.e. la matrice Mn est assumée être creuse (sparse en anglais). Un
terme de normalisation est inclus dans (15) pour éviter les problèmes de stabilité. Le terme
Mn Un−1 est réalisé par un modèle sparse fixe, i.e. chaque agent i observe un petit ensemble
de Mn (i, j)Un−1 (j) provenant de ses voisins j à chaque itération n de l’équation d’Oja. Dans
T M U
une étape précédente, le terme de normalisation introduit et le terme Un−1
n n−1 sont tous
les deux effectués par plusieurs étapes de consensus sur ces deux termes avant la mise à jour
de Un (i) à chaque agent i. Alors que dans l’article [102] l’objectif est de trouver la décomposition spectrale d’une matrice de covariance M d’un signal provenant des mesures bruitées
reçues au sein d’un réseau de capteurs sans fils, i.e. il est supposé le modèle standard de signal
reçu "signal de haute énergie + zéro bruit aléatoire moyen". Pourtant, trouver les p-principaux
vecteurs propres de M implique de capturer les composants qui ont le plus d’énergie sur les
données reçues afin de détecter et d’estimer le signal reçu d’intérêt. Les auteurs de [102] assument que les
n’ont que accès à une estimation de la matrice de covariance telle que
Pncapteurs
−1
H
N
Mn = n
t=0 rt rt où (rt ∈ C )t≥0 représentent les mesures collectées par les N capteurs du réseau. Sous ces dernières hypothèses sur le modèle, trois termes sont identifiés lors
H U
de la description de (15) pour définir son implémentation distribuée, i.e. rnH Un−1 , Un−1
n−1
H r rH U
et Un−1
n n n−1 . L’ algorithme finalement proposé est réalisé en introduisant trois étapes
de consensus pour ces trois termes qui impliquent plusieurs cycles de communication de la
forme (5). A la fin de cette phase, chaque capteur est capable d’actualiser leurs coordonnées
Un (i).
En conclusion, on note que chacune de ces approches distribuées [147], [102], [92] comprennent une ou plusieurs phases de consensus pour chaque itération/actualisation des composantes
estimées ce qui signifie que le coût en termes du nombre de calculs et de communications augmentent considérablement avec le nombre de cycles requis pour l’étape du consensus.
2)
Résultats
Contrairement à [147], [102], nous proposons un algorithme Oja distribué pour estimer les composantes principales de M dans un cadre général où on ne donne pas un modèle explicite pour les
observations. Les observations sont définies par une séquence indépendante de matrices (Mn )n ,
i.e. Mn contient les mesures imparfaites de M issues d’un bruit aléatoire à l’instant de temps
n. D’ailleurs dans cette thèse, nous considérons le modèle suivant. A chaque instant (itération)
n, chaque nœud i observe quelques échantillons aléatoires bruités de l’ i-ième file de la matrice
Mn . Chaque nœud i envoie et/ou reçoit les variables des autres nœuds j dans le réseau choisis de
façon aléatoire (contrairement à [147] où on considère des liens fixes entre les nœuds). Les produits matriciels impliqués dans l’équation de mise à jour d’Oja (voir (15)), i.e. respectivement
T M U
Mn Un−1 et Un−1
n n−1 , sont effectuées par l’intermédiaire d’un modèle de communication
asynchrone différent du modèle de consensus [31] nécessaire dans [147], [102]. Ainsi, nous
définissons à chaque capteur i deux séquences
aléatoires yn (i) et zn (i), correspondant aux esP
T M U
timées non biaisées des deux termes j M (i, j)Un−1 (j) et Un−1
n n−1 respectivement qui
interviennent dans l’équation (16). En outre, nous introduisons une étape de projection à chaque
22
itération n qui permet Un de rester délimitée dans un ensemble compact afin d’éviter instabilités
sur la séquence (Un )n . La mise à jour à chaque capteur devient :
Un (i) = ΠK [Un−1 (i) + γn (yn (i) − zn (i)Un−1 (i))]
(17)
où K est un ensemble compact arbitraire dont son intérieur contient [−1, 1]p et ΠK est le projecteur sur K. Cette étape est facile à implémenter sur chaque capteur et ne nécessite pas de
communications supplémentaires. Les estimées yn (i) et zn (i) sont issues d’un protocole de
communication asynchrone qui est détaillé dans les Chapitre 3.
La convergence de l’algorithme proposé est analysée dans le régime asymptotique où n tend
vers l’infini. Bien que la mise en œuvre et l’objectif sont différents de l’algorithme (11), les deux
sont liés par le cadre théorique de Robbins-Monro. Ainsi, l’analyse de la convergence de la suite
générée (Un )n par l’algorithme proposé implique l’existence d’une fonction h(U ) représentant
le champ moyen dont les racines correspondent à l’espace propre de M , i.e. U ∈ {h(U ) = 0}
sont des vecteurs propres qui vérifient la décomposition spectrale de M (12). Ainsi, de façon
similaire aux travaux de [122], [29], l’analyse de la convergence se caractérise principalement
par considérer les ingrédients suivants : la stabilité de Un , la définition de h(U ) et ses racines
{h(U ) = 0}.
Nous apportons les contributions suivantes:
• Nous avons utilisé de [92] l’idée de sparsification (communications rares), et nous partageons avec [102], [147] et [29] le même fondement, à savoir l’algorithme de Oja, que
nous utilisons dans un contexte distribué comme initié par [147] et [102].
• Nous fournissons un cadre général et des algorithmes qui englobent à la fois le cas où la
matrice symétrique est parfaitement connue et le cas où M n’est pas parfaitement connue.
Dans ce dernier cas, nous considérons à sa place une séquence i.i.d. de matrices aléatoires
notée par Mn , n ≥ 0.
• Nous fournissons un algorithme qui implique un modèle asynchrone pour définir les communications entre les agents et un modèle en ligne d’acquisition et traitement des données
par chaque agent. Ce modèle en ligne est adapté au contexte considéré où les mesures (observations) et les communications entre les agents sont modélisées comme des processus
aléatoires avec des distributions paramétriques connues.
• Nous prouvons la convergence presque-sûre de la suite du sous-espace vectoriel généré
par les algorithmes distribuées proposés dans le Chapitre 3 vers un ensemble des vecteurs
propres de M .
Ensuite, ayant un but plus réaliste et expérimental, nous étudions l’application de notre algorithme pour l’ auto-localisation dans les réseaux de capteurs sans fils. En effet, ce contexte a été
motivé par la possibilité d’obtenir des résultats numériques fournis à partir des données réelles
issus des vrais capteurs.
3.4
23
Application de l’ACP: localisation dans les réseaux de capteurs sans fils
Dans le contexte du traitement du signal, une motivation intéressante de concevoir l’algorithme
d’Oja distribué, asynchrone et en ligne décrit par (15) repose sur son application au problème
de l’auto-localisation dans les réseaux de capteurs sans fils (voir [57], [27], [143], [24], [92],
[41]). La théorie de l’analyse multidimensionnelle (MDS) [95] traite le problème général suivant: trouver une configuration intégrant les N objets étudiés lorsque seulement des données sur
leurs similarités/distances sont disponibles. En particulier, la méthode classique visée dans cette
thèse, i.e. MDS [27, Chapitre 12], considère les distances euclidiennes entre les N positions
étudiées comme mesures des similarités dans un espace de dimension p (généralement les positions des capteurs sont issues de p = 2 ou 3 dimensions). Dans ce cas, la méthode MDS
classique effectue l’analyse en composantes principales (12) de la matrice M qui est définie
comme suit:
1
M = − J⊥ DJ⊥
2
(18)
où la matrice D contient les carrés des distances et J⊥ = I − 1/N 11T .
Dans le contexte des réseaux de capteurs sans fils, on récupère les positions du réseau formé
par N capteurs (à une isométrie près, i.e. rotation/traduction/réflexion) en appliquant la méthode
MDS classique (également connue comme MDS-MAP [143]). On définit par zi la position d’un
capteur/node
i du réseau et on définit par z̄ le barycentre du réseau, i.e. la position moyenne
P
z̄ = N1 i zi . Dans le cas de l’espace euclidien, les entrées de la matrice des carrés des distances
définie par D sont liées aux positions des capteurs zi par la relation suivante:
D(i, j) = kzi − zj k2 .
(19)
Ensuite, en utilisant (19) dans la définition de M (18) implique que M = ZZ T , où le i-ième
ligne de la matrice Z coïncide avec zi − z̄. Par conséquent, le problème de l’analyse en composantes principales (12) appliqué à (18) dans le contexte des réseaux de capteurs se réduit à
trouver la décomposition M = ZZ T telle que Z = U Λ1/2 ∈ RN ×p (p = 2 ou p = 3). Les
matrices U et Λ sont obtenues par la décomposition spectrale
définissons la po√ de M . Nousp
sition estimée à chaque capteur i comme Z(i) où Z(i) = ( λ1 u1 (i), . . . , λp up (i)). Nous
supposons les conditions décrites dans la section précédente où les capteurs ont accès seulement
à des mesures bruitées des correspondantes lignes de M , i.e. (Mn )n la suite de matrices estimées non-biaisées de M . L’algorithme proposé dans le Chapitre 4 est une adaptation de (17)
dans ce contexte spécifique afin d’ obtenir les positions estimées Z depuis les vecteurs propres
U de façon distribuée, asynchrone et en ligne.
L’approche centralisée pour la localisation introduite par [143] (dont une analyse théorique
est donnée dans [85]) comporte deux étapes principales: d’abord obtenir les carrés de tous les
paires des distances D(i, j) entre les capteurs et calculer la matrice (18) (étape qui comporte
un double-centrage des positions au barycentre du réseau); et, deuxièmement, trouver les p
composantes principales de M . Néanmoins, il est important de noter que dans les réseaux de
capteurs sans fils, l’acquisition de D n’est pas directement possible. Or, les distances peuvent
être estimées à partir d’autres mesures disponibles en fonction des modules électroniques et
des dispositifs qui sont contenus dans les capteurs, e.g. la mesure de la puissance en réception
24
d’un signal reçu (paramètre qui est connu en anglais par received signal strength indicator dont
l’acronyme RSSI), le temps d’arrivée (en anglais time of arrival TOA) ou l’angle d’arrivée (en
anglais angle of arrival AOA) (voir des travaux sur l’état de l’art plus détaillé dans [126], [105]).
Dans cette thèse, les capteurs pris en compte dans nos résultats expérimentaux et qui sont mis
à disposition par la plateforme FIT IoT-LAB [1] sont issus du dispositif de signal radio CC24202 .
En particulier, FIT IoT-LAB est une plateforme accessible à distance en se connectant à partir d’
un compte privé et par des commandes en ligne en se connectant aux ports séries des capteurs.
La plateforme (nationale française) comporte quatre sites situés en France. Nos expériences sont
lancés sur le site à Rennes qui comporte 256 capteurs de type WSN430 issus du standard ZigBee
IEEE 802.15.4 qui opèrent à la fréquence 2.4 GHz et qui sont déployés dans deux grandes salles
de rangement des dimensions 6 × 15 m2 presque vides d’objets. Dans cette technologie radio les
capteurs peuvent obtenir des mesures de RSSI. Ainsi, nous définissons un estimateur non biaisé
du carré de la distance basé sur le modèle du signal paramétrique suivant une distribution logNormal (modèle standard qui est connu par son nom en anglais log-Normal shadowing model
LNSM décrit dans [136]). En outre, l’étape d’analyse en composantes principales de notre
approche implique : une version distribuée et asynchrone de l’algorithme d’ Oja (15) avec un
modèle des observations qui permet à chaque capteur d’ estimer le i-ième file de Mn à l’ aide
de ce modèle non-biaisé des distances à partir des mesures sporadiques de RSSI acquises par
chaque capteur. L’algorithme est complètement détaillé dans la Section 4.4.3 du Chapitre 4.
Nos contributions sont les suivantes :
• Nous adaptons et concevons l’algorithme distribué proposé du Chapitre 3 pour le problème de l’auto-localisation dans les réseaux de capteurs sans fils en supposant les mesures
sporadiques de RSSI acquises par les capteurs. La position à chaque capteur est estimée
sans connaissance à priori des points de repère, i.e. les positions des capteurs ancres (en
anglais anchor nodes ou landmarks).
• Nous obtenons des résultats numériques sur la précision d’estimation d’une position dans
deux cas : lorsque les données artificielles qui sont générées en suivant la distribution
du modèle LNSM et lorsque les données sont collectées à partir des expériences dans
un contexte réel par moyen des capteurs depuis la plateforme FIT IoT-LAB [1]. Nous
comparons nos résultats avec des méthodes centralisées classiques (la multilateration [79]
(MC), min-max [141], l’ algorithme 3 MDS classique résumé dans la Section 4.4.1 et l’
approche d’Oja (15)) et une approche MDS distribuée proposée par [45].
• Nous proposons une phase supplémentaire qui comporte un algorithme d’optimisation
distribué de la forme (7)-(8) afin d’améliorer la précision des positions estimées et afin de
mener chaque capteur à obtenir une carte locale à partir de sa propre position et celles de
ses capteurs voisins. Cette étape optionnelle peut être particulièrement utile lorsque certains capteurs ancres sont présents ensuite dans le réseau de capteurs. L’ algorithme de d’
affinement est d’abord implémenté sur les capteurs de la plateforme FIT IoT-LAB lorsque
les positions estimées sont initialisées depuis l’algorithme proposé dans le Chapitre 3
2
Détails des spécifications sur http://www.ti.com/product/cc2420
25
et sur trois scénarios différents lorsque elles sont initialisées avec l’algorithme proposé
par [53] (données accessibles dans les sites [2] et [3]).
4
Organisation du mémoire de thèse
Les travaux de cette thèse abordent deux problématiques différentes dans les réseaux de capteurs
sans fils : le consensus et l’analyse en composantes principales. Nous proposons pour ces deux
problématiques des algorithmes basés sur l’approximation stochastique distribuée dont les principales caractéristiques ont étés décrites dans la Section 2. Ainsi, nous séparons ce manuscrit en
deux parties différentes liées aux introductions respectivement faites dans les sections 3.2 et 3.3.
Dans chacune des parties nous traitons une application particulière qui fournit un contexte à la
problématique. D’un côté, l’optimisation distribuée est l’ application des algorithmes de consensus, de l’autre côté, la localisation est l’application de l’ analyse en composantes principales
distribuée. Ci-dessous, la Figure 1 montre comment nous avons schématisé ce mémoire et les
correspondantes relations entre les chapitres.
Dans la première partie, nous apportons des résultats généraux de convergence sur les algorithmes de consensus. Nous spécifions et nous illustrons ces résultats dans le cas de l’ optimisation distribuée. Finalement nous avons adapté ce travail pour un exemple plus particulier de l’
optimisation distribuée, qui est l’ estimation de paramètres. Notre première contribution au début
de cette thèse fut la conception d’une version distribuée et en-ligne de l’algorithme déjà connu
Espérance-Maximisation (en anglais Expectation-Maximization) qui s’utilise couramment pour
estimer les paramètres d’un signal représenté par un mélange de Gaussiennes.
La deuxième partie est moins théorique que la première et est plus consacrée à décrire les
résultats plus expérimentaux sur l’analyse en composantes principales distribué qui ont été issus des expériences sur des vrais capteurs sans fils de la plateforme FIT IoT-LAB accessible à
distance. La conception d’un nouveau algorithme pour ce propos est introduit dans le premier
chapitre de cette partie et il est adapté pour la localisation dans les réseaux de capteurs dans le
chapitre qui le suit. Nous rajoutons à l’algorithme de localisation proposé, une étape d’ affinement des positions estimés qui se décrit comme un problème d’optimisation distribuée. Ainsi à
cet effet, nous nous servons de l’algorithme de consensus de descente du gradient comme cas
particulier de (7)-(8) pour résoudre ce problème de façon distribuée.
Cette thèse est ainsi séparée en deux parties concernant les deux applications différentes qui
motivent notre travail. Le contenu du mémoire comporte finalement trois chapitres qui sont résumés de la façon suivante:
Le Chapitre 2 étudie le problème de l’approximation stochastique distribuée dans les systèmes multi-agents. L’algorithme étudié se compose en deux étapes: une étape d’approximation
stochastique locale et une étape de diffusion qui entraîne le réseau à trouver un consensus sur le
résultat obtenu. L’étape de diffusion utilise des matrices stochastiques par ligne pour pondérer
les échanges de réseau. Contrairement aux œuvres précédentes, les matrices des coefficients de
pondération ne sont pas censées être doublement stochastiques, et peuvent également dépendre
de l’estimation passée.
26
Chapitre 2
'
Chapitre 3
$
'
Algorithmes
de consensus
&
Analyse en composantes
principales
%
&
Algorithme de descente
du gradient
Algorithme
Espérance-Maximisation
Annexe A
$
R
?
'
$
Estimation
de paramètres
&
Algorithme d’Oja
Chapitre 4
?
Localisation
%
%
Figure 1. Schéma du cadre réalisé dans cette thèse et les relations entre les chapitres.
Nous prouvons que les matrices non-doublement stochastiques (non bistochastiques) influencent généralement les points limites de l’algorithme. Toutefois, les points limites ne sont pas
affectés par le choix des matrices à condition que celles-ci sont bistochastiques en moyenne.
Cette conclusion légitimise l’utilisation de protocoles de diffusion du type broadcast, qui sont
plus faciles à mettre en œuvre. Ensuite, à l’aide d’un théorème central limite, nous prouvons que
les protocoles doublement stochastiques possèdent asymptotiquement les mêmes performances
qu’un algorithme centralisé et nous quantifions la dégradation causée par l’utilisation de matrices non doublement stochastiques. Tout au long du chapitre, un accent particulier est mis sur le
cas particulier de l’optimisation distribuée non convexe comme une illustration de nos résultats.
Le Chapitre 3 traite le problème de l’analyse en composantes principales (ACP) par une
implémentation distribuée et asynchrone. Nous fournissons deux algorithmes s’adaptant à des
situations différentes selon la structure du graphe sous-jacent. Un cadre suffisamment général
nous permet d’analyser tout ces algorithmes en même temps. La convergence est prouvée
avec probabilité un sous des hypothèses convenables, et des expériences numériques illustrent
leur bon comportement. L’algorithme proposé nous permet d’aborder dans le chapitre suivant
(Chapitre 4), le problème de l’auto-localisation dans les réseaux de capteurs sans fil sur lequel
est basé ce contexte.
Le Chapitre 4 considère le problème de la localisation dans les réseaux sans fil formés
par des capteurs dont les positions restent fixes. Chaque nœud cherche à estimer sa propre
position à partir de mesures bruitées de la distance par rapport à d’autres nœuds. Ainsi, nous
supposons que les capteurs peuvent obtenir des mesures sur le niveau de puissance du signal
reçu (Received Signal Strength Indicator connu par l’acronyme RSSI) et qui en même temps
sont liés à la distance euclidienne à l’ aide d’un modèle statistique log-normal connu comme
Log-Normal Shadowing Model (LNSM). Dans un mode centralisé batch, les positions peuvent
27
être récupérées (à une isométrie près, i.e. rotation, réflexion, translation) par une ACP appliquée
sur une matrice dite de similarité qui est construite à partir des distances relatives entre chaque
paire de capteurs. Dans ce chapitre, nous proposons un algorithme distribuée en ligne permettant
à chaque nœud d’estimer sa propre position basée sur l’échange limité d’informations dans le
réseau. Notre cadre englobe le fait d’avoir des mesures sporadiques et de possibles liens qui
tombent aléatoirement en panne. Nous prouvons la consistance de notre algorithme utilisant une
analyse de convergence similaire à celle utilisée au chapitre précédent. Nous incluons également
une étape de raffinement basé sur un algorithme de consensus (Chapitre 2) afin d’améliorer
la précision des positions estimées. Finalement, nous fournissons des résultats numériques et
expérimentaux à partir de données réelles et simulées. Les simulations issues des données réelles
sont effectuées sur une plateforme qui met à disposition des réseaux de capteurs sans fil et qui
est accessible à distance.
Nous avons choisi de présenter dans la partie des annexes les preuves détaillées liées à
l’analyse des algorithmes de consensus qui ne sont inclues dans le Chapitre 2 (voir l’Annexe D).
D’ailleurs et de façon supplémentaire, l’Annexe C présente l’ analyse numérique des protocoles
de communication plus connus qui sont utilisés pour le consensus, i.e. protocoles de gossip, et
en particulier pour l’optimisation distribuée. Nous avons aussi inclus deux des articles de conférence qui découlent des travaux conjoints au sein de notre département. Il est à noter que tout
au long de cette thèse, nos contributions contiennent une partie plus rigoureuse liée à l’obtention
des résultats théoriques et une partie consacrée à des applications plus spécifiques et concrètes
liées aux sujets de recherche actuels. En effet, le premier algorithme basé sur l’approximation
stochastique distribuée qui a été conçu au début de cette thèse est rapporté à l’Annexe A. Nous
avons introduit un nouvel algorithme d’Espérance-Maximisation distribué en-ligne (DEM) pour
les modèles de données latentes, y compris le modèle du mélange des gaussiennes comme un
cas particulier bien connu. Un second algorithme est issue d’une collaboration dans le cadre de
l’apprentissage automatique (machine learning) et du Big Data qui est adressé à l’Annexe B.
Nous avons présenté un algorithme d’apprentissage en ligne incluant un protocole de communication gossip (OLGA) qui est consacré à étudier la classification binaire dans un cadre distribué.
5
Production scientifique
Ces travaux de recherche présentés dans ce manuscrit sont le fruit des résultats obtenus de collaborations avec mon directeur de thèse Pascal Bianchi, mais aussi avec Jérémie Jakubowicz
(Télécom SudParis), Gersende Fort (CNRS-Télécom ParisTech) et Stéphan Clémençon (Télécom ParisTech), ainsi que avec Amy N. Dieng (PhD au LINCS de Télécom ParisTech avec
Claude Chaudet) pour les sujets issus de la seconde partie du mémoire. Ainsi, les contributions
de ces collaborations comportent plusieurs et différents résultats qui ont été présentées à la fois
à des conférences internationales et nationales. Elles sont énumérées ci-dessous.
Articles dans des revues internationales à comité de lecture
1. G. Morral et P. Bianchi, "Distributed on-line multidimensional scaling for self-localization
28
in wireless sensor networks", soumis à Elsevier journal on Signal Processing, février
2015, arXiv:1503.05298.
2. G. Morral, P. Bianchi et G. Fort, "Success and failure of adaptation-diffusion algorithms
for consensus in multi-agent networks", soumis à IEEE Trans. on Signal Processing,
octobre 2014, arXiv:1410.6956.
Articles de conférences avec actes
1. G. Morral, P. Bianchi* et G. Fort, "Success and failure of adaptation-diffusion algorithms
for consensus in multi-agent networks", the 53rd IEEE Conference on Decision and Control (CDC), Los Angeles, USA, Décembre 2014.
2. G. Morral* et N.A. Dieng, "Cooperative RSSI-based indoor localization: B-MLE and
distributed stochastic approximation", the 80th IEEE Vehicular Technology Conference
(VTC2014-Fall), Vancouver, Canada, Septembre 2014.
3. G. Morral*, N.A. Dieng et P. Bianchi, "Distributed on-line multidimensional scaling for
self-localization in wireless sensor networks", the 39th IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 1110-1114, Florence, Italie, Mai
2014.
4. P. Bianchi, S. Clémençon, J. Jakubowicz et G. Morral*, "On-line learning gossip algorithm (OLGA) in multi-agent systems with local decision rules", the 1st IEEE International Conference on Big Data (BigData), pp. 6-14, Santa Clara, USA, Octobre 2013.
5. G. Morral*, P. Bianchi, G. Fort et J. Jakubowicz, "Approximation stochastique distribuée:
le coût de la non-bistochasticité", the 24th National Conference on Signal and Image
Processing (GRETSI), Brest, France, Septembre 2013.
6. G. Morral, P. Bianchi, et J. Jakubowicz*, "Asynchronous distributed principal component
analysis using stochastic approximation", the 51st Annual Conference on Decision and
Control (CDC), pp. 1398-1403, Maui, Hawaï, Décembre 2012.
7. G. Morral, P. Bianchi*, G. Fort et J. Jakubowicz, "Distributed stochastic approximation:
the price of non-double stochasticity", invited paper, the 46th Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp. 1473-1477, Californie, USA, Novembre
2012.
8. G. Morral*, P. Bianchi et J. Jakubowicz, "Gossip-based online distributed expectation
maximization", the 2012 IEEE Statistical Signal Processing Workshop (SSP), pp. 305308, Ann Arbor, USA, Aout 2012.
Journées sans actes
1. G. Morral*, "Analyse d’algorithmes distribués pour l’approximation stochastique dans
les réseaux de capteurs", présentation des résultats de cette thèse à la 4ème Journée de
restitution des travaux de recherche du programme Futur & Ruptures organisée par la
Fondation Télécom comme candidate aux Prix de Thèse 2015, Mars 2015.
29
2. P. Bianchi, S. Clémençon, J. Jakubowicz et G. Morral*, "On-line learning gossip algorithm (OLGA) in multi-agent systems with local decision rules", poster présenté au 3e
colloque « Numérique : Grande échelle Complexité » organisé par l’ Institut MinesTélécom, Mars 2014.
3. G. Morral*, P. Bianchi, G. Fort et J. Jakubowicz, "Approximation stochastique distribuée:
le coût de la non-bistochasticité", poster présenté à la 2ème Journée de restitution des
travaux de recherche du programme Futur & Ruptures organisée par la Fondation Télécom, Janvier 2013.
Chapter 1
Introduction
1.1
Motivations
Remote sensing and environmental monitoring have been an active research area for the last
decades. The technological evolution of wireless sensor networks (WSNs) has contributed to
spread their use on military purposes first, to civil and industrial applications later, e.g. surveillance and security management. Besides, the scientific motivations on statistical signal processing have been an important part of the progress made on detection, estimation and classification
of the data acquired by WSNs. Although we give a special attention to WSNs, we also address
more general multi-agents systems, e.g. peer-to-peer networks or multi-robot systems. The applications related to multi-agents systems include: network control and coordination (e.g. target
or trajectory tracking [132], [45], [123], [152], power and resources allocation [108], [23]), Big
Data processing (e.g. classifier training [149],[160], recommendation profiling [147], PageRank [124]) or environmental monitoring and pattern recognition in sensor networks (e.g. parameter estimation [135], [144], signal detection [120],[90],[102]). In this thesis, we give a special
attention to the localization problem in wireless sensor networks since location awareness (or
network location awareness) is required in many applications. Yet, information about sensor
nodes’ positions may be used by purposes such as routing and querying that can be adapted or
controlled according to the given positions. For environmental monitoring, there exists a need
on applications requiring geographical information of the observed data e.g. identifying temperature or humidity on a map. In a more challenging use, developments on home automation
contribute to make a home to activate the furniture automatically depending on the host position. If the robustness and a flexible deployment are two of their main advantages, one has to
overcome the issues related to the data privacy and security and the limited resources including
energy, memory and computational complexity.
We consider a multi-agent system as a group of entities having both interacting and data
processing capabilities, e.g. a wireless sensor network. We refer by agent as the general word
to name a computer, a processor or a sensor node or any other device forming the connected
network. Within any specific objective and context, a global task is done by the network of communicating agents, e.g. temperature measurement in WSNs or binary classification in machine
learning. Agents are autonomous and self-aware, they have access to local views of the global
32
Chapter 1. Introduction
scenario and they are not controlled by a designated central unit. In this thesis, we consider that
each agent has a partial view on the global problem to solve. Typically in WSNs each sensor
node is able to observe its surrounding area of the environment. While in distributed machine
learning, each processor (or computer) is in charge of handling one part of a distributed dataset.
The objective of our work is to design distributed algorithms for embedded multi-agent networks. By distributed, we mean that agents share their local information without any hierarchic
architecture and any master agent, i.e. contrarily to centralized processing. These architectures
offer two main advantages: robustness to individual node failures since data can be recovered at
each agent, and scalability since there is no central agent that may produce a bottleneck.
1.2
Framework
In this thesis, we focus on distributed algorithms from the special point of view of stochastic
approximation. We consider the case where data/observations are handled "on-line". A random
sample is used by an agent to update its local solution and then deleted after use; next, another
new sample is used, and so on. In this context, we address jointly two frameworks: multiagent systems and distributed stochastic approximation. As a consequence, we are interested in
numerical methods that have the following three properties. They are i) iterative – an iterate
is updated at each time instant, ii) distributed – agents communicate to merge their individual
iterates, iii) on-line – each update is done using the most recent observation or data sample.
Network model: The network of N agents is represented by an undirected (non-oriented)
graph G = (V, E), where V = {1, . . . , N } stands for the set of agents (also referred to nodes in
WSN context) and E is the set of undirected edges. Two agents i, j ∈ V are connected if there
exists the link {i, j} ∈ E that enables the communication between i and j. We shall sometimes
write i ∼ j to denote the edge {i, j} ∈ E. We shall always assume that {i, i} is not an edge (G
has no self loop).
Communication model: Distributed algorithms can either be synchronous or asynchronous.
In the synchronous case, all agents are expected to complete their local computations before
their outputs can be merged. Otherwise stated, the algorithm proceeds with the next iteration
provided that all agents have returned their result. In the asynchronous case, we assume that
each agent is likely to finish a local computation at some random time instant, passes its output
to other agents, and proceeds with another local computation with no need to wait for the rest
of the agents. Thus, the meaning we give to "distributed asynchronous algorithm” is that there
is no central scheduler and that any node can wake up randomly at any moment independently
of the other nodes. This mode of operation brings clear advantages in terms of complexity and
flexibility. Asynchronous distributed optimization is a promising framework in order to scale up
machine learning problems involving massive data sets (see [32] or the more recent survey [40]).
1.3. Position of the thesis
1.3
1.3.1
33
Position of the thesis
Preliminary: consensus algorithms
Historically, a significant amount of work in distributed processing has been done in the literature
to solve the so-called average consensus problem. Although the thesis addresses more general
issues, we start with a description of this archetypal problem, as it allows to introduce the main
ideas underlying many distributed algorithms.
We denote by T0,i ∈ R some scalar value observed by agent i ∈ V . The objective of average
P
consensus is, for any agent i, to estimate the average value over all T̄0 = N1 N
i=1 T0,i . One
of the most widespread approach to solve the problem distributively is the following. At time
instant n, each agent i has an estimate Tn,i of the sought average. Each node i receives the
current iterates of other nodes in its neighborhood i.e., nodes j such that j ∼ i and updates its
local iterate as a weighted average of its past iterate and the iterates received from neighbors.
Formally, at each time n
X
w(i, j)Tn−1,j
(1.1)
∀i ∈ V,
Tn,i =
j∈V
where w(i, j) are non-negative weights such that w(i, j) = 0 whenever i and j are not connected. This condition ensures that the algorithm is indeed distributed
among the graph. For
P
simplicity, assume also that weights are non-negative and that j w(i, j) = 1 for all i. Define
N × N matrix W whose coefficient (i, j) coincides with w(i, j). The above algorithm simply
writes
Tn = W Tn−1
where Tn is the column vector whose i-th entry coincides with Tn,i . Note that W is a stochastic
matrix in the sense that all its rows sum to one. Distributed average consensus has its origins on
applied statistics theory [48] and computer science [155],[156] (see [16] for a detailed state of
the art on this subject). These conditions under which the above distributed algorithm converges
to the sought average T̄0 are well known. Assume that
X
∀i ∈ V,
w(i, j) = 1
(1.2)
j:j∼i
∀i ∈ V,
X
w(i, j) = 1 ,
(1.3)
i:i∼j
that is, not only all rows of matrix W sum to one, but all columns also sum to one. Such a
matrix W is called doubly stochastic. If W is moreover a primitive matrix1 , it can be shown as
an immediate application of the Perron-Frobenius theorem (see [81]) that
∀i ∈ V, lim Tn,i = T̄0 .
n→∞
(1.4)
There exists m > 0 such that all coefficients of W m are strictly positive. The assumption holds for instance if
w(i, j) > 0 for all i ∼ j and there exists i such that w(i, i) > 0.
1
34
Thus, the algorithm allows each node to eventually reach a consensus, an agreement on the final
value. In addition, the value of the consensus coincides with the sought average T̄0 . The above
algorithm is sometimes referred to by some authors as a gossip algorithm.
The above gossip algorithm is synchronous in the sense that all agents must communicate
their value at any moment of the algorithm, and the matrix W is fixed among the iterations. The
Authors of [31] propose an asynchronous communication protocol. At time n, a given node
chosen at random (say node i) wakes up and randomly select a node in its neighborhood (say
node j). Nodes i and j merge their values by
Tn,i = Tn,j = 0.5Tn−1,i + 0.5Tn−1,j ,
(1.5)
while other nodes k ∈
/ {i, j} keep their iterates Tn,k = Tn−1,k . The algorithm writes Tn =
Wn Tn−1 where (Wn )n is a random sequence of matrices, namely Wn = IN − 0.5(ei − ej )(ei −
ej )T where IN is the N × N identity matrix, ei is the i-th vector of the canonical basis in RN
and T stands for transpose. The convergence (1.4) still holds (in the almost sure sense) under
technical conditions specified in [31]. A key feature of the matrices Wn is that they are still
doubly stochastic for all n 6= 0.
The pairwise protocol described above can be considered as asynchronous in the sense that
nodes are allowed to be inactive at some instants. However, it still requires some level of coordination among nodes: two nodes must simultaneously update their values at the same instant.
Alleviating the need for such feedback in order to achieve truly asynchronous protocols has been
an important stake in the years after [31]. The Authors of [10] propose a full asynchronous communication model called broadcast gossip. Similarly to the previous protocol, at each instant
n one agent is randomly activated. The asynchronism is now at "agent-level" since the active
agent i broadcasts its estimate to all neighbors without expecting a feedback transmission. Unfortunately, it is shown in [10] that the sought convergence result (1.4) does not hold anymore.
All that can be expected with such a simple protocol is a convergence in expectation, but not an
almost sure convergence. We refer to [69] for more detailed considerations on that matter.
As a conclusion of this preliminary paragraph, the average consensus problem can be solved
using a linear gossip algorithm in an asynchronous version, but the double stochasticity of the
weighting matrices Wn is required at each n. Double stochasticity comes along with some
drawbacks regarding practical implementation, as it generally requires feedbacks in the network.
Alternative works on that matter do not require double stochasticity at expenses of more complex
models, e.g. [89], [84], [78]. For instance, [78] state the convergence result (1.4) by using only
row-stochastic matrices, i.e. Wn 1 = 1, but leading the set of communicating nodes grow at each
time n. It is worth noting the contribution of [89] that introduce the push-sum protocol (more
generally analyzed by [34]). The gossip model of [89] allows to circumvent the convergence
issue without the need of feedback links by introducing some additional communication, i.e. two
variables instead of one are involved in the update step. [84] propose an asynchronous version
of the latter model [89]. The convergence analysis and convergence speed are provided in [83]
by the same Authors. We refer to [16] for a more complete description on general distributed
average consensus algorithms.
1.3.2
35
Distributed optimization
Distributed optimization is present in many of the applications mentioned above related to sensor
networks and machine learning. The goal of the network is to optimize a global function that is
defined as a sum of local private functions. The minimization problem under study is described
as follows:
min F (θ) ,
θ∈R
F (θ) =
N
X
fi (θ)
(1.6)
i=1
where fi is the private cost function of agent i. Let us provide an application example. We assume in this thesis that functions fi are differentiable but not necessarily convex. This thesis also
put the focus on first order methods i.e., algorithms relying merely on gradient computations.
Let us provide an illustrative example in the context of sensor networks.
Example 1.1. In WSN contexts, it is often the case that one should estimate a parameter θ
based on a set of random observations X1 , . . . , XN collected by independent sensors and whose
marginal probability density functions pθ,1 , . . . , pθ,N are indexed by θ. Provided that the Xi ’s
are independent, the maximum likelihood estimate of θ can be written as the minimizer of (1.6)
where fi (θ) = − log pθ,i (Xi ). In a centralized setting, random observations are collected at a
central unit. All functions are assumed to be available at a single place, and a standard gradient
descent on F can be used to obtain a minimizer. This thesis focuses on the distributed setting:
functions fi are only locally known by the agents, but the function F is nowhere available.
In the literature, there are mainly two kinds of distributed first order algorithm for solving this
problem. The first one is known as the incremental approach (see [113], [131], [133]). A single
iterate travels in the network from node to node. Each node updates the estimate by incrementing
the iterate from a scaled version of its negative gradient evaluated at the current point. The
approach, although conceptually simple, has some drawbacks. Incremental algorithms generally
require the message to go through a Hamiltonian cycle in the network. Finding such a path is
known to be a NP complete problem. Relaxations of the Hamiltonian cycle requirement have
been proposed: for instance, [113] only requires that an agent communicates with another agent
randomly selected in the network (not necessarily in its neighborhood) according to the uniform
distribution. However, substantial routing is still needed.
This thesis focuses on another cooperative approach of the form adapt-then-combine (following a terminology introduced by [103] in [38]) and also known as adaptation-diffusion algorithms. The idea, which traces back to the [155], consists in coupling local gradient descent at
the nodes’ side and a gossip communication step, in order to merge the iterates as explained in
the previous subsection. Contrarily to incremental approaches, each node i has its own estimate
θn,i . At each iteration, the following update holds:
θ̃n,i = θn−1,i − γn ∇fi (θn−1,i ) ,
θn,i =
N
X
j=1
wn (i, j) θ̃n−1,j ,
(1.7)
(1.8)
36
where γn is a deterministic positive step size and ∇ denotes the gradient and Wn = (wn (i, j))i,j∈V
is a gossip matrix similar to the ones described previously (see Section C.1 in Appendix C for
detailed examples). In addition, it is sometimes the case that a the gradient is observed up to
some random perturbation, which might depends on the history of the algorithm. In that case,
equation (1.7) must be replaced by
θ̃n,i = θn−1,i − γn ∇fi (θn−1,i ) + γn ξn,i
(1.9)
where ξn,i is a perturbation due to the fact that the gradient is not perfectly observed at node i.
Example 1.2. To illustrate this point, consider again the WSN example given previously, where
agents seek to estimate a unknown parameter θ in the maximum likelihood sense based on random observations. Consider the case where each sensor i gather a sequence of random observations (Xn,i )n=1,2,... instead of a single observation Xi . Assume also that the sequence is formed
by independent copies of Xi . Then, an on-line distributed estimation of parameter θ using the
above algorithm would read as
θ̃n,i = θn−1,i + γn ∇ log pθn−1,i (Xn,i ) .
Under some regularity conditions, it can be shown that the above update coincides with (1.9) by
letting fi (θ) = −E[log pθn,i (Xi )] where E stands for the expectation and the perturbation ξn,i
is a martingale increment.
It is expected that, under some assumptions,
∀i ∈ V, lim θn,i = θ?
n→∞
(1.10)
where θ? is some minimizer of F (assumed to exist). We refer to Chapter 2 for a more detailed state of the art on these techniques. However, we mention that convergence is generally proved under some strong assumptions on the matrices (Wn )n describing the consensus
protocol. In general the sought consensus is achieved under the double-stochasticity assumption ([116], [134]), i.e. (Wn )n are row and column stochastic meaning that Wn 1 = 1 and
1T Wn = 1T . In [19], [112] the column-stochasticity condition is relaxed and it is only assumed
in expectation. This leads for instance the use of the broadcast gossip model of [10]. Similarly, the Authors of [43] introduce a diffusion model that only requires the row-stochasticity
condition at expenses of its synchronous nature.
The objective in this thesis is to derive convergence results such (1.10) on the sequence
generated by Algorithm (1.7)-(1.8) under more mild conditions on (Wn )n . We investigate the
results of [19] when (Wn )n are only row-stochastic. We extend them to a broader communication setting, when (Wn )n may depend on the observations or the last estimates. In addition, we
consider a more general case on stochastic approximation framework by letting Algorithm (1.7)(1.8) take the following form:
θn = Wn ( θn−1 + γn Yn ) .
(1.11)
Recursion (1.11) extends the application of distributed optimization problem (1.6) to a more general framework. Indeed, Algorithm (1.11) can be viewed as a distributed version of the so-called
37
Robbins-Monro algorithm [139]. For that purpose, Yn may be related to an unbiased estimation
of a given mean field function h(θ) that one seeks to find its roots, i.e. θ ∈ { h(θ) = 0 }. We
are also focusing on the convergence rate of this algorithm along with asymptotic normality.
Finally, an objective of the thesis is to investigate the use of the above algorithm for statistical
inference tasks in sensor networks. We propose a distributed Expectation-Maximization algorithm inspired of the adaptation-diffusion approach. We also apply our method to the sensor
self-localization problem.
1.3.3
Distributed principal component analysis
Another dealing problem addressed by the stochastic approximation framework is the principal
component analysis (PCA). The objective in such problems is rather different from the considered previously. Indeed, the aim is no longer to find a consensus on common parameter of
interest. Here, the aim to drive the iterate of each node i to the value of the i-th entries of the
principal eigenvectors of a matrix M .
We define M ∈ RN ×N a symmetric positive semi-definite matrix whose entries describe
some similarity metric between each pair of agents, e.g. similarities (multidimensional scaling [95], [27]), distances (WSN localization [57], [143]), customer ratings (user profiling [154],
[91]), adjacency weights (spectral clustering [29]) or covariances (signal detection [88], [37]).
We assume that a given agent i has only a partial information on the matrix M (typically, it is
only able to observe the i-th row of M ). Consider the spectral decomposition of M
M = U ΛU T ,
U U T = IN
(1.12)
where U is an orthonormal matrix whose columns are the eigenvectors of M and Λ is a diagonal
matrix containing the corresponding eigenvalues (λ1 , . . . , λN ) in decreasing order λ1 ≥ · · · ≥
λN . We define k . k the Euclidean norm. For a given integer p < N , the aim is to evaluate the p
largest eigenvalues λ1 , . . . , λp and the corresponding eigenvectors, which we denote u1 , . . . , up .
When M is perfectly known and data is processed in a centralized manner, several classical
methods are known to solve efficiently (1.12) such the power method ([73, p. 406]) when p = 1
and the QR-factorization ([81, p. 114] and called orthogonal iteration in [73, p. 454]) or the
Gram-Schmidt orthonormalization [73, p. 254] when p > 1. The centralized power method is
based on computing the following recursion (when p = 1):
Ũn = M Un−1
Un =
Ũn
,
kŨn k
(1.13)
(1.14)
where (Un )n is the estimate sequence that
to the first eigenvector u1 and kxk stands
P converges
2 such x ∈ RN . From a distributed implefor the Euclidean norm, i.e. kxk2 =
|x(i)|
i
mentation viewpoint, both terms M Un−1 andPkM Un−1 k have drawbacks. For a given agent
i, the first matrix product writes as a sum N
j=1 M (i, j)Un−1,j that contains N terms involving a communication
with each separate agent j. Second, for any agent i, (1.14) writes
qP
N
2
Un (i) = Ũn (i)/
i=1 Ũn (j) and thus agent i should query all other agents about their values
38
Ũn (j) to implement the normalization update. When N is large, this could incur a prohibitive
cost to the network in terms of number of communications. As a consequence, several works
have made efforts to arise a decentralized implementation of (1.13)-(1.14). A couple of works
deal with a distributed version of (1.13)-(1.14) (see [90], [92]) by introducing consensus averaging to compute the normalization term. While in [90] M is assumed to be perfectly known, [92]
include a synchronous
sparse model for M Un−1 . Contrarily to [90] where each agent i is able
PN
to compute j=1 M (i, j)Un−1,j , [92] describe a sparse model in which each agent i transmits
M (i, j)Un−1,j to a small set of randomly chosen neighbors.
In this thesis we seek to design an algorithm which is
• distributed: nodes cooperate in order to separately estimate different entries of the principal eigenvectors;
• on-line: matrix M is unobserved, but a sequence (Mn )n of perturbed/noisy versions of
M is generated.
The sequence (Mn )n is written as Mn = M + Ξn where the perturbation matrix Ξn is typically
a martingale increment. In the centralized case, when a sequence of matrices (Mn )n is globally observed at a central computing unit, stochastic approximation can be used to estimate the
eigenvectors of M . Oja’s algorithm can be used for that sake (see [120] for p = 1 and [122]
for p > 1). We also refer to (see [57], [24], [85]) for alternative approaches to solve (1.12)
by semidefinite programming based on constraint optimization. In this work we introduce a
distributed version of Oja’s algorithm.
We define as Un = (un,1 , . . . , un,p )T the p-principal components estimated at time n. In the
Oja’s algorithm [122], the estimate sequence Un is generated by:
T
Un = Un−1 + γn Mn Un−1 − Un−1 (Un−1
Mn Un−1 ) .
(1.15)
Note that (1.15) boils down to a Robbins-Monro algorithm [139]. Due to possible instabilities of
the algorithm, several variants have also been proposed, which either introduce a normalization
or a projection step (see [29]).
Distributed variants of the algorithm have been investigated, often in specific contexts user
profiling [147] or signal estimation (or detection) in WSN [102]. Both works have two common
features: proposing a distributed version of (1.15) and including average consensus iterates of
the form [31] in their algorithm to perform some of the terms in (1.15) distributedly. Indeed,
these approaches require two time scales, i.e. the iteration index n to update Un and another
time index corresponding the number of consensus cycles of the form (1.5). In particular, the
Authors of [147] address a machine learning context where observations M correspond to a
large matrix containing user taste ratings (binary) of some products. Under the assumption that
M is a low-rank matrix, the aim is to estimate the profile vector of each user. A distributed
Oja’s algorithm is proposed to perform the spectral decomposition of a partially known dataset
Mn . A normalization term is included in (1.15) to avoid stability issues. The term Mn Un−1 is
performed by a fixed sparse model, i.e. each agent i observes a small set of Mn (i, j)Un−1 (j)
from its neighbors j at each iteration n of the Oja’s update. The introduced normalization term
T M U
and (Un−1
n n−1 ) are both performed by an average consensus during several rounds before to compute the Oja’s update Un (i) at each agent i. Whereas, in [102] the goal is to find
39
the eigendecomposition of a signal’s covariance matrix M from noisy received measurements
within a wireless sensor network, i.e. the received signal model is assumed as "high energy
signal + zero mean random noise". Yet, finding the p-principal eigenvectors of M refers to
capturing the high-energy components of received data in order to detect and estimate the incoming signal
P of interest. The Authors of [102] assume an estimate covariance matrix such
Mn = n−1 nt=0 rt rtH where (rt ∈ CN )t≥0 describe received measurements at N sensors. Under the latter model assumptions, three terms are identified when describing recursion (1.15) to
H U
H
H
define its distributed implementation, i.e. rnH Un−1 , Un−1
n−1 and Un−1 rn rn Un−1 . The proposed distributed Oja’s algorithm is performed by introducing three average consensus for the
latter terms involving several rounds. At the end of this step, each sensor node is able to update
Un (i).
Differently to [147], [102], we propose a distributed Oja algorithm to estimate the principal
components of M in a general setting without explicitly giving a model for observations (Mn )n .
In this thesis, we consider the following model. At each instant n, each node i observes some
random noisy samples of the i-th row of the matrix Mn . Each node i sends and/or receives
variables from other nodes j in the network, chosen at random (contrarily to [147] that considers
T M U
fixed links). The matrix products involved in the Oja’s update, i.e. Mn Un−1 and Un−1
n n−1 ,
are performed via an asynchronous communication model different to the average consensus
model [31] required in [147], [102]. We define at each
yn (i)
P sensor i two random sequences
T
and zn (i) as unbiased estimates of the corresponding j M (i, j)Un−1 (j) and (Un−1 Mn Un−1 )
respectively. Besides, we introduce a projection step at each iteration n that enables Un to remain
bounded in a compact set in order to avoid unstabilities on sequence (Un )n . The convergence of
the proposed algorithm is analyzed in the asymptotic regime where n tends to infinity. Although
the implementation and the objective is different to (1.11), both are related through the RobbinsMonro framework. Thus, the convergence analysis of the sequence generated by the proposed
algorithm implies the existence of a mean field function h(U ) whose roots correspond with the
underlying eigenspace of M , i.e. U ∈ {h(U ) = 0} verify (1.12). Hence, similarly to [122],
[29], the convergence analysis is mainly characterized by addressing: the stability of Un , the
definition of h(U ) and its roots {h(U ) = 0}.
Next, we investigate application of our algorithm to self-localization in wireless sensor networks since numerical results can be provided from real data.
Application: self-localization in wireless sensor networks
In signal processing, an interesting motivation to design a distributed, asynchronous and online of Oja’s algorithm (1.15) relies on its application to the localization problem in wireless
sensor networks (see [57], [27], [143], [24], [92], [41]). The theory of multidimensional scaling
(MDS) [95] deals with the following general problem: find an embedding configuration of N
objects when only similarity/distance data are available. In particular, the method referred as
classical MDS [27, Chapter 12] considers Euclidean distances between N positions in a coordinate space of dimension p. In that case, classical MDS performs the PCA (1.12) of M defined
40
as follows:
1
M = − J⊥ DJ⊥
2
(1.16)
where matrix D contains the square distances and J⊥ = I − 1/N 11T .
In the WSN context, classical MDS (also known as MDS-MAP [143]) recovers the positions
of a network formed by N sensor nodes (up to a rotation/translation/reflection). We denote
by
1 P
zi the position of any sensor node i and z̄ the barycenter of the network, i.e. z̄ = N i zi . In
the case of Euclidean space, the entries of D are related to the sensor nodes positions as:
D(i, j) = kzi − zj k2 .
(1.17)
Then, using (1.17) to (1.16) means that M = ZZ T , where the i-th line of matrix Z coincides
with zi − z̄. Hence, the PCA problem (1.12) applied to (1.16) within the WSN context, reduces
to find M = ZZ T such that Z = U Λ1/2√∈ RN ×p (usually
p p = 2 or p = 3). We define each
recovered node position Z(i) as Z(i) = ( λ1 u1 (i), . . . , λp up (i)).
The centralized localization approach introduced by [143] (theoretical analysis in [85]) involves two main steps: first obtain the square pairwise distances D(i, j) between the sensor
nodes and compute (1.16) (also referred as double-centering); and second, find the p-principal
components of M . In wireless sensor networks, the acquisition of D is not directly possible.
Thus, distances may be estimated from other available measurements depending on the electronics of the sensor node devices, e.g. received signal strength indicator (RSSI), time-of-arrival
(TOA) or angle-of-arrival (AOA) (see [126], [105]).
In this thesis, we focus on RSSI-based techniques since the wireless sensor nodes considered
in our experiments are issued to the FIT IoT-LAB platform [1]. We define an unbiased estimator
of the square distance based on the standard parametric Log-Normal Shadowing Model (LNSM)
of [136]. Besides, the PCA step of our approach involves: a distributed version of the Oja’s
algorithm (1.15) along with an observation model that enables each sensor node to estimate the
i-th row of Mn .
1.4
Thesis outline
Figure 1.1 illustrates the organization of this thesis.
The thesis is separated in two parts according to the two different applications we have addressed in our work.
Chapter 2 investigates the problem of distributed stochastic approximation in multi-agent
systems. The algorithm under study consists of two steps: a local stochastic approximation
step and a diffusion step which drives the network to a consensus. The diffusion step uses rowstochastic matrices to weight the network exchanges. As opposed to previous works, exchange
matrices are not supposed to be doubly stochastic, and may also depend on the past estimate.
1.4. Thesis outline
41
Chapter 2
'
Chapter 3
$
'
Consensus
algorithms
&
Appendix A
Principal component
analysis
%
Expectation-Maximization
algorithm
$
&
Gradient descent
algorithm
R
?
'
$
Parameter
estimation
&
Oja’s algorithm
Chapter 4
?
Localization
%
%
Figure 1.1. Scheme of the framework realized in the present thesis and relations between the
chapters.
We prove that non-doubly stochastic matrices generally influence the limit points of the
algorithm. Nevertheless, the limit points are not affected by the choice of the matrices provided that the latter are doubly-stochastic in expectation. This conclusion legitimates the use of
broadcast-like diffusion protocols, which are easier to implement. Next, by means of a central
limit theorem, we prove that doubly stochastic protocols perform asymptotically as well as centralized algorithms and we quantify the degradation caused by the use of non doubly stochastic
matrices. Throughout the chapter, a special emphasis is put on the special case of distributed
non-convex optimization as an illustration of our results.
Chapter 3 addresses the problem of asynchronous distributed Principal Component Analysis (PCA). We provide two algorithms coping with different situations according to the underlying graph structure. A general enough framework allows us to analyze all these algorithms at
the same time. Convergence is proved with probability one under suitable assumptions, and numerical experiments illustrate their good behavior. The proposed algorithm allows us to address
in the following chapter the problem of self-localization in Wireless Sensor Networks (WSNs)
which is based on this framework.
Chapter 4 considers the localization problem in wireless networks formed by fixed nodes.
Each node seeks to estimate its own position based on noisy measurements of the relative distance with other nodes. Yet, we assume that sensor nodes are able to obtain RSSI (Received
Signal Strength Indicator) measurements which are related to the Euclidean distance by a LogNormal Shadowing Model (LNSM). In a centralized batch mode, positions can be retrieved
(up to a rigid transformation) by PCA on a so-called similarity matrix built from the relative
distances. In this chapter, we propose a distributed on-line algorithm allowing each node to
estimate its own position based on limited exchange of information in the network. Our frame-
42
work encompasses the case of sporadic measurements and random link failures. We prove the
consistency of our algorithm using a similar convergence analysis of the previous chapter. We
also include a refinement step based on a consensus algorithm (Chapter 2) in order to improve
the accuracy of the estimated positions. Finally, we provide numerical and experimental results
from both simulated and real data. Simulations issued to real data are conducted on a wireless
sensor network testbed.
We let to the appendix part the proofs not included in Chapter 2 related to the analysis of
consensus algorithms (see Appendix D). Appendix C gives additional numerical analysis on
the choice of some known gossip communication protocols for consensus, and in particular
for distributed optimization. We also include two of the conference papers that stem from the
joint works within our department. It is worth noting that all along this thesis, our contributions involved a more rigorous part related to obtain theoretical results and a part devoted to
more specific and concrete applications issued to current research topics. Indeed, the first algorithm based on distributed stochastic approximation designed at the beginning of this thesis is
reported in Appendix A. We introduced a novel on-line Distributed Expectation-Maximization
(DEM) algorithm for latent data models including Gaussian Mixtures as a special case. A second algorithm originated from a collaboration on the machine learning and Big Data framework
is addressed in Appendix B. We presented an on-line learning gossip algorithm (OLGA) devoted
to investigate binary classification in a distributed setting.
1.5
Publications
The contribution of our work involved different and several results which have been presented
both in international and national meetings. They are enumerated below.
Journal paper
1. G. Morral and P. Bianchi, "Distributed on-line multidimensional scaling for self-localization
in wireless sensor networks", submitted to Elsevier journal on Signal Processing, February
2015, arXiv:1503.05298.
2. G. Morral, P. Bianchi and G. Fort, "Success and failure of adaptation-diffusion algorithms
for consensus in multi-agent networks", submitted to IEEE Transactions on Signal Processing journal, October 2014, arXiv:1410.6956.
International conference papers with proceedings
1. G. Morral, P. Bianchi* and G. Fort, "Success and failure of adaptation-diffusion algorithms for consensus in multi-agent networks", the 53rd IEEE Conference on Decision
and Control (CDC), Los Angeles, USA, December 2014.
2. G. Morral* and N.A. Dieng, "Cooperative RSSI-based indoor localization: B-MLE and
distributed stochastic approximation", the 80th IEEE Vehicular Technology Conference
(VTC2014-Fall), Vancouver, Canada, September 2014.
1.5. Publications
43
3. G. Morral*, N.A. Dieng and P. Bianchi, "Distributed on-line multidimensional scaling for
self-localization in wireless sensor networks", the 39th IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 1110-1114, Florence, Italy, May
2014.
4. P. Bianchi, S. Clémençon, J. Jakubowicz and G. Morral*, "On-line learning gossip algorithm (OLGA) in multi-agent systems with local decision rules", the 1st IEEE International Conference on Big Data (BigData), pp. 6-14, Santa Clara, USA, October 2013.
5. G. Morral*, P. Bianchi, G. Fort and J. Jakubowicz, "Approximation stochastique distribuée: le coût de la non-bistochasticité", the 24th National Conference on Signal and
Image Processing (GRETSI), Brest, September 2013.
6. G. Morral, P. Bianchi, and J. Jakubowicz*, "Asynchronous distributed principal component analysis using stochastic approximation", the 51st Annual Conference on Decision
and Control (CDC), pp. 1398-1403, Maui, Hawaii, December 2012.
7. G. Morral, P. Bianchi*, G. Fort and J. Jakubowicz, "Distributed stochastic approximation:
the price of non-double stochasticity", invited paper, the 46th Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pp. 1473-1477, California, USA, November
2012.
8. G. Morral*, P. Bianchi and J. Jakubowicz, "Gossip-based online distributed expectation
maximization", the 2012 IEEE Statistical Signal Processing Workshop (SSP), pp. 305308, Ann Arbor, USA, August 2012.
National conferences without proceedings
1. G. Morral*, "A study of distributed algorithms for stochastic approximation in wireless
sensor networks", presentation of the results related to this thesis during the 4th annual
Meeting of the research work granted by the Futur & Ruptures program organized by
Fondation Télécom as candidate for the Thesis Prizes 2015, March 2015.
2. P. Bianchi, S. Clémençon, J. Jakubowicz and G. Morral*, "On-line learning gossip algorithm (OLGA) in multi-agent systems with local decision rules", poster presented in the
3th Seminar on Digital Technologies: Scale and Complexity organized by Institut MinesTélécom, March 2014.
3. G. Morral*, P. Bianchi, G. Fort and J. Jakubowicz, "Approximation stochastique distribuée: le coût de la non-bistochasticité", poster presented in the 2nd annual Meeting
of the research work granted by the Futur & Ruptures program organized by Fondation
Télécom, January 2013.
Part I
Consensus algorithms
Chapter 2
Success and failure of
adaptation-diffusion algorithms
The first part of this thesis is devoted to the convergence analysis of consensus algorithms in
multi-agent systems. The objective of the network is to find an agreement on the estimate value
when the environment is partially unknown and only local information is available at each agent.
In particular, we focus on distributed algorithms based on adaptation and diffusion schemes. The
general algorithm consists in two steps: a local stochastic approximation step and a diffusion
step which drives the network to an agreement. The diffusion step uses row-stochastic matrices
to weight the network exchanges. As opposed to previous works, exchange matrices are not
supposed to be doubly stochastic, and may also depend on the past estimate.
We prove that non-doubly stochastic matrices generally influence the limit points of the
algorithm. Nevertheless, the limit points are not affected by the choice of the matrices provided that the latter are doubly-stochastic in expectation. This conclusion legitimates the use of
broadcast-like diffusion protocols, which are easier to implement. Next, by means of a central
limit theorem, we prove that doubly stochastic protocols perform asymptotically as well as centralized algorithms and we quantify the degradation caused by the use of non doubly stochastic
matrices. Throughout this chapter, a special emphasis is put on the special case of distributed
non-convex optimization as an illustration of our results. Appendix A provides an application of
such case related to parameter estimation. We design a distributed version of the ExpectationMaximization algorithm for exponential families.
We first introduce some useful notation that enables us to define and later analyze the algorithms under study in this chapter.
48
Chapter 2. Success and failure of adaptation-diffusion algorithms
N
x, y, . . .
|x|
kAk, r(A)
A⊗B
1
IN
J
J⊥
P, E
2.1
2.1.1
positive integer
column vectors in RdN
P
Euclidean norm of x such |x|2 =
|x(i)|2
spectral norm and spectral radius of matrix A
Kronecker product between matrices A and B
vector N × 1 with all entries equal to one
N × N identity matrix
orthogonal projector onto the linear span of 1 (consensus space)
projection matrix orthogonal to J such J⊥ = IN − J (disagreement space)
probability and associated expectation operators on a measurable space
Introduction
Context and goal
During the last thirty years, distributed stochastic approximation has been addressed using different cooperative approaches. In the so-called incremental approach (see for instance [131, 133])
a message containing an estimate of the quantity of interest iteratively travels all over the network. In this chapter we focus on another cooperative approach based on average consensus
techniques where the estimates computed locally by each agent are combined through the network. This idea traces back to [155] where a network of processors seeks to optimize some
objective function known by all agents (possibly up to some additive noise).
In our context we consider a network composed by N agents, or nodes. Agents seek to find a
consensus on some global parameter by means of local observations and peer-to-peer communications. The aim in this chapter is to analyze the asymptotic behavior of the following distributed
algorithm. Agent i (i = 1, . . . , N ) generates a Rd -valued stochastic process (θn,i )n≥0 . At time
n, the update is obtained in two steps:
[Local step] Node i generates a temporary iterate θ̃n,i given by
θ̃n,i = θn−1,i + γn Yn,i ,
(2.1)
where γn is a deterministic positive step size and where the Rd -valued random process (Yn,i )n≥1
represents the observations made by agent i.
[Gossip step] Agent i is able to observe the values θ̃n,j of some other j’s and computes
the weighted average as follows:
θn,i =
N
X
wn (i, j) θ̃n,j ,
(2.2)
j=1
P
where the wn (i, j)’s are scalar non-negative random coefficients such that N
j=1 wn (i, j) = 1
N
for any i. The sequence of random matrices Wn := [wn (i, j)]i,j=1 represents the time-varying
2.1. Introduction
49
communication network between the nodes. One simply set wn (i, j) = 0 whenever nodes i
and j are unable to communicate at time n. The aim of this paper is to investigate the almost
sure (a.s.) convergence of this algorithm as n tends to infinity as well as the convergence rate.
Our goal is in particular to quantify the effect of the sequence of matrices (Wn )n≥1 on the
convergence. The algorithm is initialized at some arbitrary Rd -valued vectors θ0,1 , · · · , θ0,N .
T , . . . , Y T )T ∈ RdN , n ≥ 1, are defined
The random variables Wn ∈ RN ×N and Yn := (Yn,1
n,N
on the same measurable space equipped with P and E. For any n ≥ 1, define the σ-field
Fn := σ(θ0 , W1 , . . . , Wn , Y1 , . . . , Yn ) where θ0 is the (possibly random) initial point of the
algorithm.
It is assumed that for any i ∈ 1, . . . , N , (θn,i )n≥0 satisfies the update equations (2.1)-(2.2); and
we set
T
T
θn := (θn,1
, . . . , θn,N
)T .
2.1.2
Related works on distributed optimization
Many recent applications related to statistical data processing and machine learning can be handled by the framework of distributed optimization. We may refer to applications such: network
control and coordination (e.g. target or trajectory tracking [132], [123], power and resources allocation [108], [23]), big data processing (e.g. classifier training [151],[160]) or environmental
monitoring in sensor networks (e.g. parameter estimation [135], [144]).
The algorithm (2.1)-(2.2) under study is not new. The idea beyond the algorithm traces back
to [155, 156] where a network of processors seeks to optimize some objective function known
by all agents (possibly up to some additive noise). More recently, numerous works extended this
kind of algorithm to more involved multi-agent scenarios, see [97, 103, 117, 87, 144, 43, 19, 21,
23, 114] as a non-exhaustive list. In this context, one seeks to minimize a sum of local private
cost functions fi of the agents:
N
X
min
fi (θ) ,
(2.3)
i=1
where for all i, the function fi is supposed to be unknown by any other agent j, j 6= i. To address
this question, it is assumed that
Yn,i = −∇fi (θn−1,i ) + ξn,i
(2.4)
where ∇ is the gradient operator and ξn,i represents some random perturbation which possibly
occurs when observing the gradient. Hence, the distributed algorithm (2.1)-(2.2) is a distributed
stochastic gradient algorithm. In this paper, we handle the case where functions fi are not
necessarily convex. Of course, in that case, there is generally no hope to ensure the convergence
to a minimizer to (2.3). Instead, a moreP
realistic objective is to achieve critical points of the
objective function i.e., points θ such that i ∇fi (θ) = 0.
In a machine learning context, fi is typically the risk of a classifier indexed by θ (for more
details we refer to [107, 65, 32, 6]). The problem of finding the optimal vector quantizer is
addressed in [125] by minimizing a non convex cost function called distortion. [125] proposes
a distributed and on-line implementation of the k-means named competitive learning vector
50
quantization algorithm (CLVQ) and based on stochastic approximation. The consistency of the
algorithm is proved under suitable assumptions such row-stochastic matrices and asynchronous
weights: the trajectories of agents reach an asymptotic consensus a.s. and the corresponding
agreement vector converges a.s. towards one of the random connected component of the set of
critical points. [43],[144] restrict their analysis by considering a linear regression model for the
observations and the case of common quadratic functions for the agents. [43] studies the mean
square error performance of a distributed stochastic approximation algorithm based on a deterministic diffusion scheme and it is shown that the error variance is bounded and the convergence
is achieved in the noise-free case. In [144] these results are obtained when considering in addition i.i.d. random noise. In the field of stochastic cooperative games, the work of [13] is focused
on the a.s. convergence of bargaining processes when they are allocated in a distributed manner.
The proposed algorithm generates iteratively sequence including two steps: a combining step
involving a double stochastic time-varying random matrix in which agents communicate, and a
local projection step onto a closed and convex set. The results are: the convergence a.s. towards
zero of the nonlinear error due to the projection and the convergence a.s. of the network towards
the sought allocation.
Regarding the works on statistical data inference, there is a rich literature on distributed estimation and optimization algorithms, see [26],[103], [87], [38], [117], [144] as a non-exhaustive
list. Among the first gossip algorithms are those considered in the treatise [18] and in [156].
The case where the gossip matrices are random and the observations are noiseless is considered
in [31]. The authors of [117] solve a constrained optimization by also using noiseless estimates. The contributions [38] and [144] consider the framework of linear regression models. In
[134], stochastic gradient algorithms are considered in the case the matrices (Wn )n are doubly
stochastic gossip i.e. Wn 1 = WnT 1 = 1. This contribution assumes in addition that the gradients are bounded and considers rather stringent assumptions on the conditional variances of
the observation noises. Convergence to a global minimizer is shown in [116] assuming convex
utility functions and bounded (sub)gradients. The results of [116] are extended in [134] to the
stochastic descent case i.e., when the observation of utility functions is perturbed by a random
noise. More recently, [19] investigated distributed stochastic approximation at large, providing
stability conditions of the algorithm (2.1)-(2.2) while relaxing the bounded gradient assumption
and including the case of random communication links. In [19], it is also proved under some
hypotheses that the estimation error is asymptotically normal: the convergence rate and the
asymptotic covariance matrix are characterized. An enhanced averaging algorithm à la Polyak
is also proposed to recover the optimal convergence rate. Note that all the works previously
cited do not take into account the case where (Wn )n depend on the observations (Yn )n in their
convergence analysis.
Doubly and non-doubly stochastic matrices. In most works (see for instance [116, 134]),
the matrices (Wn )n≥1 are assumed doubly stochastic, meaning that WnT 1 = Wn 1 = 1 where 1
is the N × 1 vector whose components are all equal to one and where T denotes transposition.
Although row-stochasticity (Wn 1 = 1) is rather easy to ensure in practice, column-stochasticity
(WnT 1 = 1) implies more stringent restrictions on the communication protocol. For instance,
in [31], each one-way transmission from an agent i to another agent j requires at the same
time a feedback link from j to i. As a matter of fact, double stochasticity prevents from using
2.1. Introduction
51
natural broadcast schemes, in which a given node may transmit its local estimate to all neighbors
without expecting any immediate feedback.
Remarkably, although generally assumed, double stochasticity of the matrices Wn is in fact
not mandatory. A couple of works (see e.g. [112, 19]) get rid of the column-stochasticity
condition, but at the price of assumptions that may not always be satisfied in practice. Other
works ([114, 153]) manage to circumvent the use of feedback links by coupling the gradient
descent with the so-called push-sum protocol [89]. The latter however introduces an additional
communication of weights in the network in order to keep track of some summary of the past
transmissions. As a consequence, we address the following questions:
What conditions on the sequence (Wn )n≥1P
are needed to ensure that Algorithm (2.1)-(2.2)
drives all agents to a common critical point of i fi ? What happens if these conditions are not
satisfied? How is the convergence rate influenced by the communication protocol?
2.1.3
Contributions
We provide the following contributions which lead to answer in a both qualitative and quantitative manner the previous questions.
• Assuming that (Wn )n≥1 forms an i.i.d. sequence of stochastic matrices, we prove under
some technical hypotheses that Algorithm (2.1)-(2.2) leads the agents to a consensus,
which is characterized. P
It is shown that the latter consensus does not necessarily coincide
with a critical point of i fi . We also provide an augmented algorithm which allows to
recover the sought points.
• We provide sufficient conditions either on the communication protocol
P(Wn )n≥1 or on
the functions fi which ensure that limit points are the critical points of i fi . When such
conditions are not satisfied, we also propose a simple modification of the algorithm which
allows to recover the sought behavior.
• We extend our results to a broader setting, assuming that the matrices (Wn )n≥1 are no
longer i.i.d., but are likely to depend on both the current observations and the past estimates. We also investigate a general stochastic approximation framework which goes
beyond the model (2.4) and beyond the only problem of distributed optimization.
• We characterize the convergence rate of the algorithm under the form of a central limit
theorem. Unlike [19], we address the case where the sequence (Wn )n≥1 is not necessarily
doubly stochastic. We show that non-doubly stochastic matrix have an influence on the
asymptotic error covariance (even if they are doubly stochastic in average). On the other
hand, we prove that when the matrix Wn is doubly stochastic for all n, the asymptotic
covariance is identical to the one obtained in a centralized setting.
The chapter is organized as follows: Section 2.2 is a gentle presentation of our results in
the special case of distributed optimization (see (2.3)) assuming in addition that sequence (Wn )
is independent and identically distributed (i.i.d.). In Section 2.3 we provide the general setting
to study almost sure convergence which is studied in Section 2.4.1. Section 2.5 investigates
convergence rates. Conclusions and numerical results in Section 2.7 and Appendix C complete
52
the discussion on the topic. Proofs are given in a small part in Section D.1 but mostly devoted
in Appendix D.
2.2
Distributed optimization
We first sketch our result in the special case of distributed optimization i.e., when the "innovation” Yn,i of the algorithm in (2.1) has the form (2.4). For simplicity, the matrix-valued process
Wn will be assumed i.i.d. and independent of both processes Yn and θn . This assumption will
be relaxed in section 2.3.
2.2.1
Framework
In this section, we consider the case when Yn,i satisfies (2.4) with
Assumption 2.1.
1. fi : Rd → R is differentiable and ∇fi is locally Lipschitz-continuous.
2. For any Borel set A of RdN , P [ξn+1 ∈ A | Fn ] = νθn (A) almost surely (a.s.) where
(νθ )θ∈RdN is a family of probability measures such that
R
(a) z dνθ (z) = 0 and,
R
(b) sup |z|2 dνθ (z) < ∞ for any compact set K ⊂ RdN .
θ∈K
Assumption 2.2.
1. For any n ≥ 0, conditionally to Fn , (Yn+1 , Wn+1 ) are independent.
2. (Wn )n≥1 is an i.i.d. sequence of row-stochastic matrices (i.e., Wn 1 = 1 for any n) with
non-negative entries.
3. The spectral radius of the matrix E[W1T J⊥ W1 ] is strictly lower than 1.
P
The row-stochasticity assumption is a rather mild condition. It claims that j wn (i, j) = 1
for any i i.e., each node i computes a weighted average of the temporary updates at each node
(with possibly P
some null weights). In many works, it is usually also assumed that Wn is columnstochastic i.e., i wn (i, j) = 1 for any j. Our weaker framework addresses more general gossip
protocols, usually less demanding in terms of scheduling and overall network coordination. Assumption 2.2-3) is a contraction condition which is required to drive the network to a consensus.
Assumption 2.3. The deterministic step-size sequence (γn )n≥1 satisfies γn > 0 and:
1. limn γn+1 /γn = 1,
P
P 1+λ
2.
< ∞ for some λ ∈ (0, 1),
n γn = +∞,
n γn
P
3.
n |γn − γn−1 | < ∞ .
Polynomially decreasing sequences γn ∼ γ? /na when n → ∞, for some a ∈ (1/2, 1] and
γ? > 0 satisfy Assumption B.1. Finally, we introduce a stability-like condition.
2.2. Distributed optimization
53
Assumption 2.4. Almost surely, there exists a compact set K of RdN such that θn ∈ K for any
n ≥ 0.
Assumption 2.4 claims that the sequence (θn )n≥0 remains in a compact set and this compact
set may depend on the path. It is implied by the stronger assumption “there exists a compact set
K of RdN such that with probability one, θn ∈ K for any n ≥ 0”. Checking Assumption 2.4 is
not always an easy task. As the main scope of this paper is the analysis of convergence rather
than stability, it is taken for granted: we refer to [19] for sufficient conditions implying stability.
2.2.2
Results
The statement of our convergence result is prefaced with the following lemma, which shows that
the matrix
W := E[W1 ]
admits a unique left Perron eigenvector v; this vector will play a role in the characterization of
the limiting points of the algorithm (2.1)-(2.2).
Lemma 2.1. Under Assumptions 2.2-2) and 2.2-3), the RN -valued vector v defined by v T :=
1 T
−1 is the unique non-negative vector satisfying v T = v T W and v T 1 = 1.
N 1 W (IN − J⊥ W )
T
Proof. By the Jensen’s inequality, for any x ∈ RN , xT W J⊥ W x ≤ xT E W1T J⊥ W1 x. Then,
by Assumption 2.2-3), the spectral norm of J⊥ W is strictly lower than one. Therefore, IN −
J⊥ W is invertible.
We prove that a vector w satisfying wT 1 = 1 and wT W = wT ; then wT = N1 1T W (IN −
J⊥ W )−1 . Inversely, a vector w such that wT = N1 1T W (IN − J⊥ W )−1 ; then it satisfies wT 1 =
1 and wT W = wT . And both statements meaning that w = v.
Let us start with the first statement. The vector v satisfies v T 1 = 1 and v T W = v T . Let us prove
that a vector satisfying these two properties is unique; let w ∈ RN satisfying these properties.
First, we apply the first condition:
wT = wT W = wT W −
1T
1T
W+
W,
N
N
then we apply the second condition as follows:
wT = wT W − wT 1
1T
1T
1T
1
W+
W = wT J⊥ W +
W =⇒ wT (IN − J⊥ W ) = 1T W ,
N
N
N
N
and thus from the above equation one can easily find that wT = v T .
We now prove the second statement. We first verify wT 1 = 1 as follows. Since W is a stochastic
matrix, its spectral radius is one. By [81], there exists a non-negative vector w such that wT W =
wT and 1T w > 0. We can therefore assume without loss of generality that wT 1 = 1. Indeed,
observe that the definition wT = N1 1T W (IN − J⊥ W )−1 implies that wT (IN − J⊥ W ) =
1 T
N 1 W . By multiplying this equality by vector 1 to its both sides, using the row-stochasticity
assumption of W and because J⊥ 1 = 0; then it is shown that wT 1 = 1. In the same way, one
54
could apply the inverse matrix lemma (see [81]) twice to show that (IN − J⊥ W )−1 1 = 1 as
follows:
(IN − J⊥ W )−1 1 = (IN + J⊥ (IN − W J⊥ )−1 W )1 = 1 + J⊥ (IN − W J⊥ )−1 1
= 1 + J⊥ (IN + W (IN − J⊥ W )−1 J⊥ )1
Because J⊥ 1 = 0, then (IN − J⊥ W )−1 1 = 1. Finally, using the row-stochasticity assumption
of W implies that: :
1
w T 1 = 1T W 1 = 1 .
N
Secondly, we verify wT W = wT . For that purpose, By using the above equality
wT (IN − J⊥ W ) and the latter verified condition wT 1 = 1, we have:
wT W = wT W −
1 T
N1 W
=
1T
1T
W+
W = wT J⊥ W + wT (IN − J⊥ W ) = wT
N
N
Once again, the above discussion implies that w = v. This concludes the proof.
If A is a set, we say that (xn )n converges to A if inf{|xn − y| : y ∈ A} tends to zero as
n → ∞.
Theorem 2.1. Let Assumptions 2.1, 2.2, B.1 and 2.4 hold true. Define the function V : Rd → R
V (θ) :=
N
X
vi fi (θ)
(2.5)
i=1
where v = (v1 , . . . , vN ) is the vector defined in Lemma 2.1. Assume that the set L = {θ ∈
Rd | ∇V = 0} of critical points of V is non-empty and included in some level set {θ : V (θ) ≤
C}, and that V (L) has an empty interior. Assume also that the level sets {θ : V (θ) ≤ C} are
either empty or compact. The following holds with probability one:
1. The algorithm converges to a consensus i.e., limn→∞ maxi,j |θn,i − θn,j | = 0.
2. The sequence (θn,1 )n≥0 converges to L as n → ∞.
Theorem 2.1 is proved in Appendix D.1. Its proof consists in showing that it is a special case
of the more general convergence result given by Theorem B.1.
2.2.3
Success and failure of convergence
The
P algorithm converges to L which in general is not the set of the critical points of θ 7→
i fi (θ). We discuss some special where both sets actually coincide.
Scenario 1. All functions fi are strictly convex and admit a (unique) common minimizer θ? .
This case is for instance investigated by [43] in the framework of statistical estimation in
wireless sensor network. In this scenario, we may assume without loss of generality that fi (θ) ≥
2.3. Distributed Robbins-Monro algorithm: general setting
55
fi (θ? ) = 0 for all i (note that Algorithm (2.1)-(2.2) is not modified when fi is translated). Since
vi ≥ 0, V is a non-negative strictly convex function such that V (θ? ) = 0. Therefore, the set of
minimizers of V is {θ? }. On the other hand, since V is convex,
P L is the set of minimizers of
V . This implies that the set L is formed by the minimizers of i fi . Relaxing strict convexity,
note that when the functions fi are justP
convex with a common minimizer and vi > 0 for any i,
then L is formed by the minimizers of i fi , then the same conclusion holds.
Scenario 2. W is column-stochastic i.e., 1T W = 1T .
1
1 P
In this case, v given by Lemma 2.1
P is the vector N 1. Consequently, V = N i fi . Here
again, L is the set of minimizers of i fi . An example of random communication protocol
satisfying 1T W = 1T is the following: at time n, a single node i wakes up at random with
probability pi and broadcasts its temporary update θ̃n,i to all its neighbors Ni . Any neighbor
j computes the weighted average θn,j = β θ̃n,i + (1 − β)θ̃n,j . On the other hand, any node k
which does not belong to the neighborhood of i (including i itself) sets θn,k = θ̃n,k . Then, given
i wakes up, the (k, `)-th entry of Wn is given by:

1
if k ∈
/ Ni and k = ` ,



β
if k ∈ Ni and ` = i ,
wn (k, `) =
1 − β if k ∈ Ni and k = ` ,



0
otherwise.
Here, Wn is not doubly stochastic. However, when nodes wake up according to the uniform
distribution (pi = N1 for all i) it is easily seen that 1T E[Wn ] = 1T .
2.2.4
Enhanced algorithm with weighted step sizes
We end up this section with a simple modification of the initial algorithm in the case where
vi > 0 for all i. Let us replace the local step (2.1) of the algorithm by
θ̃n,i := θn−1,i + γn vi−1 Yn,i
(2.6)
where Yn,i is still given by (2.4). As an immediate Corollary of Theorem 2.1, the P
algorithm (2.6)(2.2) drives the agent to a consensus which coincides with the critical points of i fi .
Of course, this modification requires for each node i to have some prior knowledge of the
communication protocol through the coefficients vi (in that case, questions related to a distributed computation of the vi ’s would be of interest, but are beyond the scope of this paper).
2.3
Distributed Robbins-Monro algorithm: general setting
In this section, we consider the general setting described by Algorithm (2.1)-(2.2) with weaker
conditions on the distribution of the observations Yn . We also weaken the assumptions on the
conditional distribution of (Yn+1 , Wn+1 ) given the past behavior of the algorithm Fn : our general framework includes the case when the communication protocol is adapted at each time n
and takes into account the network observations.
We denote by M1 the set of N × N non-negative row-stochastic matrices and we endow M1
with its Borel σ-field.
56
Assumption 2.5.
1. There exists a collection of distributions (µθ )θ∈RdN on RdN × M1 such
that almost-surely, for any Borel set A:
P [(Yn+1 , Wn+1 ) ∈ A | Fn ] = µθn (A) .
In addition, the application θ 7→ µθ (A) defined on RdN is measurable for any A in the
Borel σ-field of RdN × M1 .
R
2. For any compact set K ⊂ RdN , sup |y|2 dµθ (y, w) < ∞.
θ∈K
Assumption 2.5-1) means that the joint distribution of the r.v.’s Yn+1 and Wn+1 depends on
the past Fn only through the last value θn of the vector of estimates. It also implies that Wn
is almost-surely (a.s.) non-negative and row-stochastic. Since the variables (Yn+1 , Wn+1 ) are
not necessarily independent conditionally to the past Fn and (Wn )n≥1 are no longer i.i.d., the
contraction condition on J⊥ W1 is replaced with the following condition:
Assumption 2.6. For any compact set K ⊂ RdN , there exists ρK ∈ (0, 1) such that for all
θ ∈ K, φ in RdN and A ∈ RdN × RdN ,
Z
Z
(φ + Ay)T (w ⊗ Id )T J⊥ (w ⊗ Id )(φ + Ay)dµθ (y, w) ≤ ρK
|φ + Ay|2 dµθ (y, w) .
We provide some
on the above
insight
condition. Assumption 2.6 is satisfied as soon as the
T
spectral radius r E W1 J⊥ W1 |θ0 , Y1 is upper bounded by a constant independent of (θ0 , Y1 )
when θ0 ∈ K and strictly lower than one. When (Wn )n≥1 is an i.i.d. sequence, independent of
the sequence (Yn )n≥1 and of θ0 , the above condition reduces to r(E[W1T J⊥ W1 ]) < 1.
2.4
Convergence analysis
For any vector x ∈ RdN of the form x = (xT1 , . . . , xTN )T where xi ∈ Rd , we define the vector
of Rd
x1 + · · · + xN
1
hxi :=
= (1T ⊗ Id )x .
(2.7)
N
N
We extend the notation to matrices X ∈ RdN ×k as hXi = N1 (1T ⊗ Id )X. We also define
J := J ⊗ Id ,
J⊥ := J⊥ ⊗ Id .
Note that Jx = 1 ⊗ hxi .
Algorithm (2.1-2.2) can be written in matrix form as:
θn = Wn (θn−1 + γn Yn )
where Wn = Wn ⊗ Id .
(2.8)
We decompose the estimate vector θn into two components θn = 1 ⊗ hθn i + J⊥ θn . In Section 2.4.1, we analyze the asymptotic behavior of the disagreement vector J⊥ θn . The study of
the average vector hθn i will be addressed in Section 2.4.2. These two sections are prefaced by a
result which established the dynamics of these sequences. Set
−1
φn := γn+1
J ⊥ θn
(2.9)
αn := γn /γn+1 .
(2.10)
2.4. Convergence analysis
57
Lemma 2.2. Let (θn )n≥0 be the sequence given by (4.26). Assume that (Wn )n≥0 are rowstochastic matrices. It holds
hθn i = hθn−1 i + γn hWn (Yn + φn−1 )i ,
(2.11)
φn = αn J⊥ Wn (φn−1 + Yn ) .
(2.12)
Proof. Since Wn is row-stochastic, JWn J = J. Hence, Jθn = Jθn−1 + JWn J⊥ θn−1 +
γn JWn Yn . It follows that Jθn = Jθn−1 +γn JWn (Yn +γn−1 J⊥ θn−1 ) which directly gives (2.11).
By projecting θn onto the disagreement subspace, one has J⊥ θn = J⊥ Wn (θn−1 + γn Yn ). Since
Wn is row-stochastic, J⊥ Wn = J⊥ Wn J⊥ . Then, (2.12) follows.
2.4.1
Disagreement vector
We first begin with a technical lemma proved in Appendix D.2.
Lemma 2.3. Let Assumptions B.1-1), 2.5 and 2.6 hold. Let (φn )n≥0
be the sequence given by
(2.9). For any compact set K ⊂ RdN , supn E |φn |2 ITj≤n−1 {θj ∈K} < ∞.
This lemma implies that for any compact set, there exists C such that for any n ≥ 0,
2 .
E[|J⊥ θn |2 ITk {θk ∈Km } ] ≤ Cγn+1
Proposition 2.1 (Agreement). Let Assumptions B.1-1), B.1-2), 2.4, 2.5 and 2.6 hold. Then
almost-surely,
lim J⊥ θn = 0 .
n→∞
Proof. Let (Km )m≥0 be an increasing sequence of compact subsets of RdN such that
RdN . Under Assumption 2.4, we have to prove equivalently that for any m ≥ 0,
S
m Km
=
lim J⊥ θn 1Tk {θk ∈Km } = 0 a.s. .
n
Let m ≥ 0.
2
Lemma 2.3 implies that there exists a constant C such that E[|J⊥ θn |2 ITk {θk ∈Km } ] ≤ Cγn+1
for any n. By Assumption B.1-2), this implies that
X
E[|J⊥ θn |2 ITk {θk ∈Km } ]
n
is finite; hence
2 T
n |J⊥ θn | I k {θk ∈Km }
P
is finite a.s. which yields
lim J⊥ θn2 ITk {θk ∈Km } = 0 a.s. .
n
58
2.4.2
Average vector
We now study the long-time behavior of the average estimate hθn i. Define for any θ ∈ RdN :
Z
Wθ :=
(w ⊗ Id ) dµθ (y, w)
(2.13)
Z
zθ :=
(w ⊗ Id )y dµθ (y, w) .
(2.14)
and let us assume regularity-in-θ properties of these quantities
Assumption 2.7. There exists λµ ∈ (1/2, 1] and for any compact set K ⊂ RdN , there exists a
constant C > 0 such that for any θ, θ0 ∈ K,
Wθ − Wθ0 ≤ C|θ − θ0 |λµ ,
(2.15)
|Jzθ − JzJθ | ≤ C|J⊥ θ|λµ ,
(2.16)
|J⊥ zθ − J⊥ zθ0 | ≤ C|θ − θ0 |λµ ,
(2.17)
From (2.12) and Assumption 2.5-1), we have
hθn i = hθn−1 i + γn hWn (Yn + φn−1 )i ,
= hθn−1 i + γn E [hWn (Yn + φn−1 )i|Fn−1 ] + γn Ξn
where (Ξn )n is a martingale-increment term and
E [hWn (Yn + φn−1 )i|Fn−1 ] = hzθn−1 + Wθn−1 φn−1 i .
Since limn (θn − 1 ⊗ hθn i) = 0 almost-surely, Assumption 2.7 implies that roughly speaking
hzθn−1 + Wθn−1 φn−1 i ≈ hz1⊗hθn−1 i + W1⊗hθn−1 i φn−1 i .
In addition, from (2.12) and Assumption 2.5-1) again, the conditional distribution of φn given
the past is the conditional distribution given φn−1 and is of the form Pαn ,θn−1 (φn−1 , ·) where
Pα,θ is a Markov transition kernel (see Appendix D.3 for an explicit expression of this transition
kernel). Each kernel Pα,θ possesses an invariant distribution πα,θ and is ergodic. Therefore, it is
natural to define the mean field function h : Rd → Rd (2.12) by
(1)
h(ϑ) = hz1⊗ϑ + W1⊗ϑ m1⊗ϑ i
(2.18)
(1)
where m1⊗ϑ is the expectation of the invariant distribution π1,1⊗ϑ , given by (see Proposition D.2
in Appendix D.3)
(1)
mθ := (IdN − J⊥ Wθ )−1 J⊥ zθ .
Note that under Assumption 2.6, this quantity is well defined since for any compact K ⊂ RdN ,
√
supθ∈K kJ⊥ Wθ k ≤ ρK .
We establish the limiting behavior of the average sequence (hθn i)n by verifying the sufficient
conditions for the convergence of stochastic approximation schemes given in [9, Theorem 2.2
and 2.3]. To that goal, we assume that there exists a Lyapunov function V for the mean field h.
2.5. Convergence rate
Assumption 2.8.
59
1. h : Rd → Rd is continuous.
2. There exists a continuously differentiable function V : Rd → R+ such that
(a) there exists M > 0 such that L = {ϑ ∈ Rd : ∇V T (ϑ)h(ϑ) = 0} ⊂ {V ≤ M }. In
addition, V (L) has an empty interior;
(b) there exists M 0 > M such that {V ≤ M 0 } is a compact subset of Rd ;
(c) for any ϑ ∈ Rd \ L, ∇V T (ϑ)h(ϑ) < 0.
(1)
Assumptions 2.5, 2.6 and 2.7 imply that ϑ 7→ m1⊗ϑ is continuous on Rd (see Proposition D.3 in Appendix D.3). Therefore, a sufficient condition for the Assumption 2.8-1) is to
strengthen the conditions (2.16-2.17) of Assumption 2.7 as follows: |zθ − zθ0 | ≤ C|θ − θ0 |λµ .
Proposition 2.2. Let Assumptions B.1, 2.4, 2.5, 2.6,2.7 and 2.8 hold true. Assume in addition
that λ ≤ λµ where λ, λµ are resp. given by Assumption B.1 and 2.7. The average sequence
(hθn i)n converges almost-surely to a connected component of L.
The proof of Proposition 2.2 is given in Appendix D.4. It consists in verifying the assumptions of [9, Theorem 2].
2.4.3
Main convergence result
As a trivial consequence of Propositions 2.1 and 2.2, we have
Theorem 2.2. Let Assumptions B.1, 2.4, 2.5, 2.6, 2.7 and 2.8 hold true. Assume in addition
that λ ≤ λµ where λ, λµ are resp. given by Assumption B.1 and 2.7. The following holds with
probability one:
1. The algorithm converges to a consensus i.e., limn→∞ J⊥ θn = 0;
2. θn,1 converges to a connected component of L.
2.5
2.5.1
Convergence rate
Assumption
We derive the rate of convergence of the sequence {θn , n ≥ 0} to 1 ⊗ θ? for some θ? satisfying
Assumption 2.9. θ? is a root of h i.e., h(θ? ) = 0. Moreover, h is twice continuously differentiable in a neighborhood of θ? . The Jacobian ∇h(θ? ) is a Hurwitz matrix. Denote by −L,
L > 0, the largest real part of its eigenvalues.
The moment conditions on the conditional distributions of the observations Yn and the contraction assumption on the network have to be strengthened as follows:
60
Assumption 2.10. There exists τ ∈ (0, 2) such that for any compact set K ⊂ RdN ,
Z
sup
|y|2+τ dµθ (y, w) < ∞ .
θ∈K
Assumption 2.11. Let τ be given by Assumption 2.10. For any compact set K ⊂ RdN , there
exists ρ̃K ∈ (0, 1) such that for any φ ∈ RdN
Z
sup
θ∈K
|((J⊥ w) ⊗ Id )|2+τ dµθ (y, w) ≤ ρ̃K |φ|2+τ .
We also go further in the regularity-in-θ of the integrals w.r.t. µθ . More precisely
Assumption 2.12. There exists λµ ∈ (1/2, 1] and for any compact set K ⊂ RdN there exists a
constant C such that
1. for any θ, θ0 ∈ K, |hzθ i − hzθ0 i| ≤ C |θ − θ0 |λµ .
2. Set QA (x, y, w) := (x + y)T (w ⊗ Id )T J⊥ AJ⊥ (w ⊗ Id )(x + y) for some dN × dN matrix
A. For any θ, θ0 ∈ K, x ∈ RdN and any matrix A such that kAk ≤ 1,
Z
QA (x, y, w)dµθ (y, w) − QA (x, y, w)dµθ0 (y, w) ≤ C θ − θ0 λµ (1 + |x|2 ) .
We finally have to strengthen the conditions on the step-size sequence.
Assumption 2.13. Let τ (resp. λµ ) be given by Assumption 2.10 (resp. Assumption 2.12). As
n → ∞, γn ∼ γ? /na for some a ∈ ((1 + λµ )−1 ∨ (1 + τ /2)−1 ; 1] and γ? > 0. In addition, if
a = 1 then γ? > 1/(2L) where L is given by Assumption 2.9.
2.5.2
Main result
(1)
(2)
Define m? := (IdN − J⊥ W1⊗θ? )−1 J⊥ z1⊗θ? and m?
defined in (2.14), where
:= (Id2 N 2 − Φ? )−1 ζ? where zθ is
Z
Φ? :=
Z
ζ? :=
T (w) dµ1⊗θ? (y, w)
(1)
T (w)vec yy T + 2m? y T dµ1⊗θ? (y, w)
and where we used the notation T (w) := ((J⊥ w) ⊗ Id ) ⊗ ((J⊥ w) ⊗ Id ). As will be seen in
(1)
(2)
the proofs, m? and m? represent the asymptotic first order moment and (vectorized) second
order moment of the r.v. φn defined by (2.9). Define also R? (w) := (w ⊗ Id ) − W1⊗θ? and
2.5. Convergence rate
61
υ? (y, w) := (w ⊗ Id )y − z1⊗θ? . Finally, define
A? :=
R? :=
Z
T? :=
Z
S? :=
Z
1T
⊗ Id (IdN + W1⊗θ? (IdN − J⊥ W1⊗θ? )−1 J⊥ )
N
(R? (w) ⊗ R? (w)) dµ1⊗θ? (y, w)
(υ? (y, w) ⊗ R? (w))dµ1⊗θ? (y, w)
vec (υ? (y, w)υ? (y, w)T )dµ1⊗θ? (y, w) .
We establish in Section D.5 the following result.
Theorem 2.3. Let Assumption 2.5-1), Assumption 2.7, Assumption 2.6 and Assumption 2.9 to
Assumption 2.13 hold true. Let U? be the positive-definite matrix given by
(2)
(1)
vec U? = (A? ⊗ A? )(R? m? + 2T? m? + S? )
−1/2
Then conditionally to the event {limn θn = 1 ⊗ θ? }, the sequence {γn (iθn h−θ? ), n ≥ 0}
converges in distribution to a zero mean Gaussian distribution with covariance matrix V where
V is the unique positive-definite matrix satisfying
V∇h(θ? )T + ∇h(θ? )V = −U?
V (I + 2γ? ∇h(θ? ))T + (I + 2γ? ∇h(θ? )) V = −2γ? U?
2.5.3
if a < 1,
if a = 1.
A special case: doubly-stochastic matrices
In this paragraph, let us investigate the special case when (Wn )n are N × N doubly-stochastic
matrices. Note that in this case,R (2.11) gets into hθn i = hθn−1 i + γn hYn i and the mean field
function h is equal to h(ϑ) = hyidµ1⊗ϑ (y, w). Along the event {limn θn = 1 ⊗ θ? }, it is
therefore expected to have U? equal to the covariance matrix of hY i when Y ∼ µ1⊗θ? (see e.g.
[61, Theorem 2.2.12]). This is exactly
R what can be retrieved from Theorem 2.3 as shown below.
Since Wn is column-stochastic, w dµ1⊗θ? (y, w) is column-stochastic, and we have A? =
1T
⊗
Id . Then, it is not difficult to check thatR A? R? (w) = 0, which implies that R? = T? = 0.
N
Therefore vec U? = (A? ⊗ A? )S? i.e., U? = hυ? (y, w)ihυ? (y, w)iT dµ1⊗θ? (y, w). This yields
the following corollary
Corollary 2.1. In addition to the assumptions of Theorem 2.3, assume that (Wn )n are N × N
doubly-stochastic matrices. Then the matrix U? is given by
Z
U? = hy − ȳ? ihy − ȳ? iT dµ1⊗θ? (y, w)
where ȳ? =
R
y dµ1⊗θ? (y, w).
62
2.6
Concluding remarks
In this paragraph, we informally draw some general conclusions of our study. We assimilate
the communication protocol to the selection of the sequence Wn , which we assume i.i.d. in this
paragraph for simplicity. We say that a protocol is doubly stochastic if Wn is doubly stochastic
for each n. We say that a protocol is doubly stochastic in average if E [Wn ] is doubly stochastic
for each n.
1. Consensus is fast. Theorem 2.3 states that the average estimation error converges to zero
√
√
at rate γn . This result was actually expected, as γn is the well-known convergence
rate of standard stochastic approximation algorithms.
On the other hand, Lemma 2.3 suggests that the disagreement vector J⊥ θn goes to zero
at rate γn that is, one order of magnitude faster. Asymptotically, the fluctuations of the
√
normalized estimation error (θn − 1 ⊗ θ? )/ γn are fully supported by the consensus
space.
This remark also suggests to analyze non-stationary communication protocols, for which
the number of transmissions per unit of time decreases with n. This problem is addressed
in [19].
2. Non-doubly stochastic protocols generally influence the limit points. This issue is
discussed in Section 3). The choice of the matrices Wn is likely to have an impact on the
set of limit points of the algorithms. This may be inconvenient especially in distributed
optimization tasks.
3. Protocols that are doubly stochastic “in average” all lead to the same limit points. In
the framework of distributed optimization, the latter set of limit points precisely coincides
with the sought critical points of the minimization problem. It means that non-doubly
stochastic protocols can be used provided that they are doubly stochastic in average.
4. Asymptotically, doubly stochastic protocols perform as well as a centralized algorithm. By Corollary 2.1, if Wn is chosen to be doubly stochastic for all n, the asymptotic
error covariance characterized in Theorem 2.3 does not depend on the specific choice of
Wn . In distributed optimization, the asymptotic performance is identical to the performance that would have been obtained by replacing Wn by the orthogonal
projector J,
P
which would lead to the centralized update hθn i = hθn−1 i + γNn N
Y
.
On the opn,i
i=1
posite, protocols that are not doubly stochastic generally influence the asymptotic error
covariance, even if they are doubly stochastic in average.
2.7
Numerical results
We illustrate the convergence results obtained in Section 2.2.2 and discussed in sections 3) and
2.6. We depict a particular case of the distributed optimization problem described in Section 2.2.
Consider a network of N = 5 agents and for any i = 1, . . . , 5, we define a private cost function
2.7. Numerical results
63
fi : R → R. We address the following minimization problem:
min F (θ) where F (θ) =
θ⊂R
5
X
1
i=1
2
(θ − αi )2
(2.19)
where αT = (−3, 5, 5, 1, −3). The minimizer of (2.19) is θF = hαi = 1. The network is
represented by an undirected graph G = (V, E) with vertices {1, . . . , N } and 6 fixed edges E.
The corresponding adjacency matrix is given by


0 1 0 1 0
1 0 1 0 0



(2.20)
A=
0 1 0 1 1 .
1 0 1 0 1
0 0 1 1 0
We choose θ0,i = 0 for each agent i and the step-size sequence of the form γn = 0.1/n0.7 .
Observations Yn,i are defined as in (2.4): (ξn,i )n,i is an i.i.d. sequence with Gaussian distribution
N(0, σ 2 ) where σ 2 = 1.
Figure 2.1 illustrates the two results of Theorem 2.1 according to different gossip matrices
(Wn )n . First, Figure 2.1 (a) addresses the convergence of sequence (θn,1 )n≥0 as a function of n
to show the influence of matrices Wn to the limit points. In particular, the dashed line curve
corresponds to the algorithm (2.1)-(2.2) when Wn is assumed to be fixed and deterministic
(Wn = W1 for all n); we select W1 in such a way that each agent computes the average of the
−1
temporary estimates in its neighborhood. This is equivalent to set W1 = (I
PNN+ D) (IN + A),
where D is the diagonal matrix containing the degrees, i.e. D(i, i) = j=1 A(i, j) for each
agent i. Note that W1 is not doubly stochastic since 1T W1 6= 1TP
. Computing the left Perron
eigenvector defined by Lemma 2.1 yields the minimizer of V = i vi fi being θV = v T α =
1.24. In that case, the sequence (θn,1 )n converges to θ? = θV instead of the desired θ? = θF .
Figure 2.1 (a) includes the trajectory of θn,1 generated by Algorithm (2.6)-(2.2) with W1 =
(IN + D)−1 (IN + A). As proposed in Section 2.2.4 when introducing the weighted step size
such γn vi−1 the sequence now converge to the sought value θF .
Figure 2.1 (a) also illustrates the convergence behavior of Scenario 2 where the limit point θ?
of Algorithm (2.1)-(2.2) corresponds with θF . In that case, we consider two standard models for
Wn , namely the pairwise gossip of [31] and the broadcast gossip of [10] (we set β = 12 ). Finally,
the plain line in Figure 2.1 (a) shows the performance of the algorithm proposed by [115] for
distributed optimization which is based on a synchronous version of the push-sum protocol of
[89].
We conclude the illustration of Theorem 2.1 by the results on the consensus convergence
for the same examples of Wn considered in Figure 2.1 (a). Thus, Figure 2.1 (b) represents the
norm of the scaled disagreement vector as a function of n. As expected from Theorem 2.1-2),
consensus is asymptotically achieved independently of the limit point, i.e. θF or θV . Note that
the synchronous models of W1 and [115] require N transmissions at each iteration n whereas
the gossip protocols of [31] and [10] only require two and one transmissions respectively due to
their asynchronous nature. This may explain the gap between the curves in Figure 2.1 (b) when
regarding the convergence rate towards the consensus.
64
1.2
1
θn,1
0.8
0.6
Algorithm (1)− (2), Wn = (I+D)−1(I+A)
0.4
Algorithm (1)− (2), W pairwise [31]
n
Algorithm (1)− (2), Wn broadcast [10]
0.2
Algorithm [115]
0
0
2000
4000
6000
8000
number of iterations (n)
10000
12000
(a) Trajectories of θn,1 as a function of n.
Algorithm (1)− (2), W = (I+D)−1(I+A)
n
−2
10
Algorithm (1)− (2), Wn pairwise [31]
Algorithm (1)− (2), W broadcast [10]
n
−4
Algorithm [115]
10
−6
10
−8
10
0
2000
4000
(b)
q
6000
8000 10000 12000 14000 16000 18000
1
N
PN
i=1 (θn,i
− hθn i)2 as a function of n.
Figure 2.1. Convergence result of Theorem 2.1 according to different communication schemes
for (Wn )n .
The result of Theorem 2.3 is illustrated in Figure 2.2 which leads to the concluding remark
4) of Section 2.6. Figures 2.2 (a) and 2.2 (b) display the asymptotic analysis of the normalized
−1/2
average error γn (hθn i − θ? ). Indeed, once the convergence is achieved, the asymptotic distribution can be characterized by the closed form of the variance U ? ∈ R. In this example,
65
−1/2
Theorem 2.3 states that γn (hθn i − θ? ) converges in distribution to a r.v. ∼ N(0, V) where
∇h(θ? ) = −1 and thus the variance is V = U2? . The first boxplot and the first histogram in
Figure 2.2 are related to the algorithm implemented in a centralized manner. We consider the
distributed algorithm (2.1)-(2.2) with different choices of Wn : the pairwise gossip of [31], the
broadcast gossip of [10] and the fixed W1 defined by (IN + D)−1 (IN + A). Note that θ? coincides with the sought minimum θF when Algorithm (2.1)-(2.2) considers the pairwise and the
broadcast gossip matrices since these cases v is equal to 1/N 1 and thus θV = θF . However, in
−1/2
the fixed W1 case (such W1T 1 6= 1), the average error sequence γn (hθn i − θ? ) is computed
with respect θ? = θV which does not coincide with θF . From Figure 2.2 (b) we observe that the
normal distribution obtained in Theorem 2.3 is coherent with the empirical results.
The expression of U? defined in (2.21) takes the following form:
U? =
1
[vec (C)T Γvec (σ 2 IN + (IN + 2Λ))g(θ? 1)g(θ? 1)T +
N2
+ g(θ? 1)T C(Λ + IN )g(θ? 1) + σ 2 tr(C + 11T )]
(2.21)
where tr(A) is the trace of matrix A and:
C = E[W1T 11T W1 ] − 11T
Γ = (IN 2 − E[J ⊥ W1 ⊗ J ⊥ W1 ])−1 E[J ⊥ W1 ⊗ J ⊥ W1 ]
Λ = (IN − J ⊥ W )−1 J ⊥ W
g(θ? 1) = −θ? 1 + α
As stated in Corollary 2.1 when (Wn )n are doubly-stochastic U? corresponds with the same
variance obtained in a centralized setting. Since the covariance matrix C is equal to zero and
2
tr(11T ) = N , (2.21) reduces to σN = 0.2. Yet, we can rewrite the variance (2.21) by defining
two terms as follows:
U? = U?opt + U?com
where:
U?opt =
U?com
σ2
N
1
= 2 [vec (C)T Γvec (σ 2 IN + (IN + 2Λ))g(θ? 1)g(θ? 1)T +
N
+ g(θ? 1)T C(Λ + IN )g(θ? 1) + σ 2 tr(C)]
The case when U? = U?opt is displayed by the two first histograms in Figure 2.2 (b). Contrarily, U?com > 0 once the column-stochasticity is not verified since the covariance matrix C is
not anymore null and all terms in (2.21) influence the asymptotic variance. For instance, that
2
is the case when matrices (Wn )n follow the broadcast model of [10] since C = βN L2 where
L is the Laplacian matrix of the G, i.e. L = D − A. As shown in Figure 2.2 (b), U? is now
18.03. However, when regarding the asymptotic variances illustrated by the boxplots in Figure 2.2 (a), it is worth noting that these variances are rather different for the two non-column
stochastic cases, i.e. deterministic case (fixed Wn = W1 ∀n) and the random broadcast case for
(Wn )n . Note that the variances in the deterministic cases (see boxplot 4 and 5 in Figure 2.2 (a))
66
8
6
4
2
0
−2
Centralized
(Wn)n pairwise [31]
W =W , ∀ n
n
1
Wn=W1, ∀ n
−4
(weighted)
−6
−8
(Wn)n broadcast [10]
−10
1
2
3
4
(a) Boxplots of the normalized average error. Boxplots 2, 3 and 4 correspond to Algorithm (2.1)-(2.2)
while boxplot 5 corresponds to Algorithm (2.1)-(2.6) (weighted step-sizes).
Distributed with
Wn pairwise [31]
Centralized
Distributed with
W
broadcast [10]
n
1
1
1
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
−2
0
0
2 −2
0
2
0
−10
0
10
(b) Empirical distribution (dark bars) versus theoretical distribution given by Theorem 2.3 (solid line).
Figure 2.2. Asymptotic analysis of the normalized average error √1γn (hθn i − θ? ) according to
different communication schemes for (Wn )n after n = 30000 iterations and over 100 independent Monte-Carlo runs.
are 0.26 and 0.49 respectively. These values are slightly close to the variance achieved by the
optimal cases (0.2), i.e. centralized processing and doubly-stochastic matrices. It is due to the
low contribution of the variance term U?com . Indeed, the reasons come from two sources: the
67
moments of the disagreement sequence defined in (2.9) and the covariance matrix C (related
to the non-column stochastic character of the model for Wn ). Observing the disagreement trajectories in Figure 2.1 (b), there is a gap of nearly two orders of magnitude (10−2 ) between the
mean values (first order) of the deterministic case (dashed-line) and the broadcast case (plain
line with square markers). Besides, the fluctuations (second order) are larger in the broadcast
case while they are almost null in the deterministic cases (dashed-line and plain line with circle
markers). Nevertheless, due to the convergence rate of the disagreement sequence, the impact on
√
the average error is negligible, i.e. γn is faster than γn as stated in Proposition 2.1. Thus, the
main contribution on U? comes from the covariance matrix C. In the deterministic cases, matrix
C has values close to 0 contrarily to those obtained by the broadcast gossip (values greater than
one).
Part II
Distributed principal component
analysis
Chapter 3
A distributed on-line Oja’s algorithm
In this chapter we investigate the problem of the spectral decomposition of a real symmetric
and positive-semidefinite matrix N × N denoted by M . The objective is to compute the principal eigenvectors and the corresponding eigenvalues of M , i.e. called as principal component
analysis (PCA). Consider the spectral decomposition M such that:
M = U ΛU T
(3.1)
where U T = U −1 is an orthogonal matrix and Λ is a diagonal matrix, both matrices of size
N × N . We seek to obtain the above factorization (3.1) and the related column vectors of U
having norm one by the design of a distributed algorithm. For that purpose, we introduce a
connected network of N agents under a graph structure such that each agent partially observes
matrix M and focus on the computation of its corresponding entries of U and Λ. By means
of communications among the agents, we are interested in the conception of an iterative and
distributed algorithm such that we can ensure the convergence towards the principal eigenvectors
of M .
Before going into the details, we first introduce the notations considered throughout this
chapter summarized in the table below.
N, p
x, y, . . .
x, y, . . .
X, Y , . . .
x◦y
kxk
kXk
diag(·)
P, E
positive integers
real numbers (scalars)
column vectors in RN
real matrices in RN ×N or RN ×p
Hadamard product
P
Euclidean norm, kxk2 = i |x(i)|2
Frobenius norm, kXk2 = tr(X T X), tr(·) is the trace
diagonal matrix formed by the elements (·) on its diagonal
probability and associated expectation operators on a tacit probability space
72
3.1
3.1.1
Chapter 3. A distributed on-line Oja’s algorithm
Introduction
Context and goal
Eigenvector computation serves as the main ingredient in the celebrated principal component
analysis (PCA) [76] which is a classical approach to reduce the complexity of systems due to its
high dimension or randomness. In wireless sensor networks (WSN), PCA is applied to compute
the sensors’ positions (see our application in Chapter 4 or [92]) and the signal’s covariance
matrix (see [102]). In cluster learning, eigendecomposition is needed for several applications
such as: PageRank [124], stationary distribution of an ergodic Markov chain (see [33] or [29]),
graph clustering [29], spectral analysis of an adjacency matrix for social engineering [90] and
recently, for user profiling to personalize services [147]. As highlighted in [90], [147] or [102],
there exists an importance to achieve an embedding at the user/agent level thanks to distributed
processing.
Let M in RN ×N . We set the scalars λ1 (M ) ≥ · · · ≥ λN (M ) as the non-increasing
eigenvalues of matrix M . Unless ambiguous, an eigenvector uk (M ) and its corresponding
eigenvalue λk (M ) are simply denoted by uk and λk respectively for all k = 1, . . . , N . Set p
such that p < N . Following the compact matrix notation of (3.1), we define by U the N × p
matrix containing the p first column vectors of P as U = (u1 , . . . , up ). And we define by Λ
the p × p diagonal matrix containing the corresponding eigenvalues as Λ = diag(λ1 , . . . , λp ).
Note that if the eigenspace has dimension 1, a unit norm eigenvector associated to Λ = λ1 is
denoted U = u1 .
Let us assume that N agents are connected and form a network. Moreover we denote by
G = (V, E) an undirected graph G with vertices set V = {1, . . . , N } and edge set E =
{∀i, j ∈ V | i ∼ j}. We denote by Ni the neighborhood of any agent i, i.e. all other agents j
such j ∼ i ⊂ E. GN denotes the complete graph such it has all N (N − 1)/2 possible pair of
edges. A symmetric matrix M is said to be adapted to a graph G = (V, E) whenever i 6∼ j
implies M (i, j) = 0. For instance, when G = GN , any symmetric matrix M is adapted to
G. In this section and Section 3.2, we assume that all agents are connected, so the network they
form is represented by the complete graph GN . General types of graphs will be addressed in
Section 3.3. Let us also assume that each pair of connected agents is given a weight, representing, for instance, the distance between the two agents, a link quality measure or a resource
allocation. The weight between a couple of agents i and j is denoted M (i, j). Thus, all weights
are simultaneously encoded into the single symmetric matrix M = M T .
When considering perfect observation of M , each agent i has access to the i-th row of
M , i.e. M (i, 1), . . . , M (i, N ). In addition, the location of each agent i depends on the ith component of each eigenvector, i.e. the i-th row/coordinates of U denoted by U (i) =
(u1 (i), . . . , up (i)). In the first part of this thesis we addressed the problem of the network
consensus in some global parameter external and independent of the agent. While in this second part, we deal with the problem of the network configuration requiring each agent to infer
its coordinates which depends on the global subspace of the network spanned by U , i.e. the
eigenspace associated to M . It is worth noting that contrarily to Chapter 2, the goal now is to
design a distributed algorithm that enables each agent i to obtain an estimate of its coordinates
U (i). Similarly to Chapter 2 the process is based on a stochastic approximation approach and is
3.1. Introduction
73
performed by each agent from local information and several communications with its neighbors.
Although agents have not access to the global information M , each coordinates U (i) of agent i
are related somehow to M . Hence, we define a communication scheme different to the one considered in Chapter 2 to solve this problem in a distributed manner since the network consensus
is not anymore required.
3.1.2
Related works
Consider the spectral analysis problem in one dimension (p = 1) applied to a deterministic
matrix (or perfectly known) M , then the goal is to compute u1 . A well known centralized
algorithm to accomplish this task is the so-called power method (PM) [73, p. 406] described by
Algorithm 1.
Algorithm 1: Power method for principal eigenvector estimation (p = 1)
Initialize: set u0 ∈ RN randomly.
Update: at each time n = 1, 2, . . .
[step 1]
Perform ũn = M un−1 .
[step 2] Normalize un =
ũn
kũn k .
From a distributed implementation viewpoint,
PNboth steps 1 and 2 have drawbacks. For a
given agent i, step 1 writes as a sum ũn (i) =
j=1 M (i, j)un−1 (j) that contains N terms
involving each a communication with a separate agent j. When N is large, this could
incur a proqP
N
2
hibitive cost to the network. Second, for any agent i, step 2 writes un (i) = ũn (i)/
j=1 un (j)
implying that: agent i should query all other agents about their values ũn (j) to implement
this step. Hence, Algorithm 1 has a computational cost (multiplications, additions and a divisions) scaling N 2 and a constraint communication scheme demanding synchronization among
the agents at each iteration time n. Note that more complicated step has to be included when
considering the computation of more than one eigenvector (p > 1). Indeed, this step relies on
the condition that eigenvectors form an orthonormal basis which can be performed by the GramSchmidt method or the QR-decomposition. The general PM for the computation of the first p
eigenvectors of a matrix M is called orthogonal iteration (OI) [73, p. 454]. As an extension
of the PM, the OI includes a QR decomposition at each iteration time n to generate the corresponding eigenvectors denoted by U n . Note that, the computational cost is increased and scales
a factor of pN 2 (multiplications and additions) and a factor p3 due to the Cholesky factorization
which includes the inverse computation of a p × p matrix.
We can find two decentralized algorithms [90], [92]. The first work addresses the general
p-eigenvector estimation of a perfectly known adjacency matrix and they proposed a distributed
version of the OI [73, p. 454] by introducing two communications steps at each iteration. One
in which each agent i sends its current estimate U n (i) to all other agents, and a second in
which a push-sum averaging gossip phase is considered to estimate the p × p matrix related
to the Cholesky factorization. Note that, this phase implies all agents communicating during a
74
given number of rounds to achieve a certain accuracy. In [92] a distributed gossip-based version
of Algorithm 1 is proposed to estimate the first eigenvector even if they mentioned that the
extension to the p general case can be performed by a distributed Gram-Schmidt method whose
details are not given and the considered an example of p = 2 to illustrate numerical results. The
Authors of [92] introduce a deterministic and random sparsification of matrix M to compute
the product of step 1 of Algorithm 1 in which each agent communicates with other agents with
a given probability. Then, the normalization step 2 is solved by an averaging gossip phase
involving a number of rounds to achieve a given accuracy at each iteration. Error bounds are
given depending on this two design parameters, the probability to communicate of each agent
and the desired accuracy that gives the number of gossip rounds. Thus, [92] does not provide an
algorithm that converges to the sought eigenvector; it converges to an approximate solution.
In a more general framework, if matrix M is partially known because it is corrupted by
c can be firstly computed by
a random noise for instance, an unbiased estimation version M
collecting a large number of observations. In such case, Algorithm 1 is issued to a previous batch
phase before to perform the eigenspace computation increasing the number of communications.
In such context, when considering the sequence (M n )n such that E[M n ] = M for all n, an
alternative is the centralized stochastic approximation approach proposed by Oja in [94] (p = 1)
and [122] (p > 1) as an extension of the two previous algorithms. Although under some further
hypothesis, such as large matrices, sparsity, noisy measurements, etc. different methods can
be considered (see [73]), we focus on distributed PCA based on Oja’s approach. We design
an on-line implementation in which each agent is able to update its local estimate when a new
observation is obtained, which is adapted in our context.
A distributed version of [122] is presented in [147] for a sparse and perfectly known matrix
M . They introduce two time scales, the slow one to update the coordinates at each agent U n (i)
and a fast one performed by the random averaging gossip [31] to make the global terms U Tn U n
and U Tn M U n available at each agent and at each iteration n. Moreover, [147] proposes a couple
of algorithms respectively synchronous and asynchronous. Unfortunately for settings where
asynchronism robustness is required such as wireless sensor networks (WSN), convergence is
only guaranteed for the synchronous algorithm. Finally, paper [102] addresses the problem
of distributed computation of principal components of a signal’s covariance matrix that is only
observed up to some noise on WSN. Once Oja’s recursion is defined for each agent i, the Authors
of [102] identify three terms which need the knowledge of the values from all the other agents
at each iteration, i.e. M U n−1 , U Tn−1 U n−1 and U Tn−1 M U n−1 . Hence, an average consensus
phase as ([31]) is performed to estimate each term involving a given number of rounds before
updating U n . The associated mean field is proved to be close to Oja’s recursion. This result
leads them to obtain, under suitable assumptions, the convergence towards an equilibrium point
lying on a close limit set to the sought principal component.
The chapter is organized as follows. Section 3.2 introduces the framework and details the
proposed algorithm in the case of complete graph and first eigenvector estimation problem.
Section A.5 provides the analysis of the algorithm: convergence with probability 1 is proved
under some suitable assumptions for the one-dimensional case. We provide an extension in Section 3.3, taking into account a general graph context and the computation of several principal
components. This latter section leads us to introduce the general algorithm used for the local-
3.2. Case G = GN
75
ization application in WSN (see Chapter 4). Numerical results on one and two dimensions are
provided in Section 3.5 to conclude this chapter.
3.1.3
Contributions
Note that, each of these distributed solutions include one or several average consensus phases
at each iteration meaning that the computational and communication costs are issued to the
required number of consensus steps. We bring the following contributions:
• We used from [92] the idea of sparsification (sparse communications), and we share with
[102], [147] and [29] the same foundation, namely Oja’s algorithm, which we use in a
distributed context as initiated by [147] and [102].
• We provide a general framework and algorithms that encompass both the case where the
symmetric matrix is perfectly known and the case where M is not perfectly known but
instead, a sequence of independent and identically distributed (i.i.d.) samples denoted by
M n , n ≥ 0.
• We provide an asynchronous and on-line implementation based on random measurements
or observations and random communications throughout the agents.
• We prove almost sure (a.s) convergence of the proposed distributed to some eigenvector
of M .
Since we are interested in on-line processing, we denote by M n the sample related to the actual
time instant n when the computation may be hold by the agents.
3.2
Case G = GN
In this section we investigate the following simple case: the estimation of the first eigenvector
and the network of N agents forming a complete graph. First, we recall the centralized Oja’s
algorithm and then we introduce the communication step to provide the distributed version.
3.2.1
Oja’s algorithm
When faced with random matrices (M n )n having a given expectation M , the following centralized on-line algorithm, due to Oja [120] and analyzed in [122], converges to the principal
eigenvector (p = 1), under suitable assumptions:
un = un−1 + γn M n un−1 − uTn−1 M n un−1 un−1 .
(3.2)
However, since we are expecting convergence to a unit-norm vector, the above recursion is
known to suffer from numerical unstabilities as soon as the initialization is not well chosen [122].
The work of [29] deals with the stabilization issue of [122] by including a normalization term
to the above stochastic approximation equation which ensures the Lipschitz continuity of the
term M n − M n−1 and thus, the stability of the generated sequence. Moreover, a white noise
76
term is added to ensure the consistency of the sequence and the convergence is finally proved.
Nevertheless, the computation of the normalization term proposed in [29] may be difficult to
generalize in a distributed context. Thus, the unstabilities can be easily avoided by introducing
a simple projection step (see [23] for instance) as follows:
un = ΠK un−1 + γn M n un−1 − uTn−1 M n un−1 un−1 ,
(3.3)
where K is any compact convex set whose interior contains the closed unit ball in RN , and where
ΠK is the projector onto K. Note that both standard Oja’s [122] and the variant approaches
of [29] and (3.3) are centralized and need a number of operations (products and sums) and
communications scaling a factor
of N 2 and N respectively at each iteration due to the terms
M n un−1 and uTn−1 M n un−1 un−1 . However, in a distributed setting of N agents, a similar
amount of complexity is charged by each agent i to update its corresponding vector entry of
un , i.e. un (i). The first term M n un−1 P
implies that each agent i sends un−1 (j) to all other
j 6= i and involves the matrix operation N
j=1 M n (i, j)un−1 (j). Subsequently, each agent i
sends this latter value to all other j 6= i and is able to perform the second matrix operation
(uTn−1 M n un−1 )un−1 (i). In order to reduce the number of communications per agent at each
iteration, we introduce an asynchronous communication model which enables agents to perform
a less number of transmissions while keeping the behavior of sequence (3.3).
3.2.2
Communication model: randomized sparsification
We define the following asynchronous model.
Definition 3.1 (Asynchronous sparsification matrices). Let q be a real number such that 0 <
q < 1. Define i.i.d. uniformly distributed random variables (r.v.) Vn (P[Vn = i] = 1/N , ∀i ∈
{1, . . . , N }) and i.i.d. random Bernoulli variables An,i (P[Qn,i = 1] = q, ∀i ∈ {1, . . . , N }).
The following sequence of random matrices An are referred to as asynchronous sparsification
matrices (ASM):
An (i, j) =


1
if i = j
if i 6= j and j = Vn and Qn,i = 1

0
otherwise
N
q
Notice that matrices An are not symmetric.
The following lemma can be straightforwardly proved.
Lemma 3.1. The ASM defined by (3.4) are i.i.d. random matrices, such that:
i) There exists a constant Cq such that kAn k ≤ Cq with probability 1.
ii) E[An ] = N J = 11T .
The following proposition is an immediate consequence of Lemma 3.1.
(3.4)
3.2. Case G = GN
77
Proposition 3.1. Define M n := An ◦ M . The matrix sequence (M n )n is uniformly bounded
with probability 1 and unbiased estimation of M , i.e. E[M n ] = M .
Remark 3.1. Let u be a vector whose i-th component is known by node i only. The computation
of vector (An ◦ M )u can be performed easily in an asynchronous and distributed way: at
time n, some agent Vn wakes up, chooses a sparse list of neighbors it is going to contact using
head and tail draws Qn,i . Each of the chosen neighbors is awaken by Vn and updates its value
only using the value of agent Vn and its own. Notice also that no feedback is needed from the
network: agent Vn only has to send its value to some chosen neighbors. In that sense, ASM are
analogous to broadcast matrices of [10]. Let us mention one drawback that seems difficult to
circumvent: each agent has to know previously the total number N of agents in the network.
The Authors of [92] replace the matrix multiplication step un = M un−1 by another multiplication un = M n un−1 where M n is an unbiased and sparse estimation of M . Yet, in [92]
it is suggested the use of a simple Bernoulli sparsification scheme studied by [5]: M n =
q −1 B n ◦ M where B n are random matrices with i.i.d. binary entries taking value 1 with probability q. However, notice that multiplying by matrix M n cannot be done asynchronously. Indeed, in [92] it is considered that all agents transmit at same time. In addition, as already noticed,
the convergence towards the sought eigenspace is not ensured. Following the communication
model in Definition 3.1, we now describe the proposed algorithm.
3.2.3
Distributed on-line Oja’s algorithm (p = 1, G = GN )
Since we are interested in a distributed on-line implementation of Oja’s algorithm (3.3), we
take the particular case where M n = An ◦ M for some ASM An . Putting aside the term
uTn−1 M n un−1 Oja’s algorithm only involves matrix multiplications of the form M n un−1
that could easily be distributed for the reason mentioned in Remark 3.1. Unfortunately, term
uTn−1 M n un−1 still remains difficult to evaluate distributively. Our idea is thus to replace
the latter with an estimate
which is more suitable to distributed computation. We set z n =
Ãn un−1 ◦ (M n un−1 ) , where Ãn are ASM independent from An . Again, z n can be computed distributively, each node i being able to evaluate the i-th component z n (i) of z n by means
of local gossiping with those agents selected by the sparsification matrix Ãn . Loosely speaking,
we interpret z n (i) as a noisy estimate of the desired term uTn−1 M n un−1 .
We are now in position to state our first algorithm, directly inspired of the projected Oja’s
algorithm of Section 3.2.1. The algorithm iteratively generates a random sequence (un )n according to the following updates:


y n
zn


un
= (An ◦ M )un−1
= Ãn (un−1 ◦ y n )
= ΠK [un−1 + γn (y n − z n ◦ un−1 )]
(3.5)
where An and Ãn are two independent ASM. From Remark 3.1, it follows that the matrix
multiplications steps in the update (3.5) can easily be implemented in a distributed fashion.
Thus, the algorithm (3.5) complies to the asynchronous requirement and is fully distributed,
78
as soon as the projector ΠK can be applied distributively. To that end, since vector un has N
entries, it is sufficient to choose K as a Cartesian product of the form:
K = [−α, α] ⊗ · · · ⊗ [−α, α]
(3.6)
such α > 1. In addition, (3.6) is a real interval whose interior contains [−1, 1] for each agent i.
Note that the algorithm is fully characterized by (3.5). However, in order to give a more detailed
description, we also provide a pseudocode version of (3.5) in Algorithm 2 below.
Algorithm 2: Distributed on-line Oja’s algorithm (doOja)
Initialize: Set u0 (i) 6= 0 for any i ∈ V .
Iterate: At each time n = 1, 2, · · ·
The clock of some random agent i ∈ V is ticking.
Agent i sends un−1 (i) to other agents Ni ⊂ V randomly selected as in Remark 3.1.
For any agent j ∈ V , do:
y n (j) =
M (j, j)un−1 (j) + Nq M (i, j)un−1 (i) if j ∈ Ni
M (j, j)un−1 (j) otherwise.
The clock of some random agent l ∈ V is ticking.
Agent l sends un−1 (l)y n (l) to random agents Nl ⊂ V
For any agent i ∈ V , do:
z n (i) =
un−1 (i)y n (j) + Np un−1 (i)y n (i) if i ∈ Nl
un−1 (i)y n (j) otherwise.
and finally:
un (i) = Π[−α,α] [un−1 (i) + γn (y n (i) − z n (i)un−1 (i))] .
3.3. General graph and unknown matrix M case
3.3
79
General graph and unknown matrix M case
The algorithm detailed in previous Section 3.4 is well suited in the context of perfect connectivity
(complete graph setting GN ) between agents, known matrix M and centralized processing. In
this section, we introduce some necessary notations for the general context when considering a
general connected graph formed by the network of N agents.
3.3.1
Network considerations
We let (W n,t )n,t≥0 be a doubly infinite i.i.d. sequence of random matrices with the same distribution as E[W n,t ] = W . We set the random pairwise gossip scheme of [31] for (W n,t )n,t≥0 .
In that case, W n,t = I − (ei − ej )(ei − ej )T /2 where ej denotes the i-th vector of the canonical
−1
basis in RN such the edge i ∼ j ⊂ E is active with probability 1/N (d−1
i + dj ) if di , dj are
the degrees of agents (vertices) i and j respectively. We introduce the following lemma that will
be used later in the convergence analysis. Yet, it establishes that it exists a sufficient number of
steps to perform consecutively the gossip scheme of [31].
Q
φ(n)
Lemma 3.2. Let C n = N
t=1 W n,t be a product of gossip matrices performed during a
φ(n) time, i.e. (W n,t )n,t≥0 are the pairwise gossip matrices described above. Then, E[C n ] =
J + Rn where Rn → 0 when n → ∞ almost surely (a.s.).
Proof. To prove the above condition, we verify that limn→∞ E[C n −J −Rn ] = 0. Upon noting
that limn→∞ E[R
n ] = 0 and(W n,t )n,t≥0 are i.i.d. such that E[W n,t ] = W , it is necessary to
verify limn→∞ W φ(n) − J
then by induction:
= 0. Setting φ(n) = n and since W J = J and J W n−1 = J ,
(W − J )n = (W − J ) (W − J )n−1 = (W − J ) W n−1 − J = W n − J
Thus, limn→∞ (W − J )n = 0 as the corresponding spectral radius ρ (W − J ) is lower than
one (see [31]).
In addition, we assume the following considerations:
1) Graph G is not necessarily complete. On the first hand, the fact that G 6= GN naturally
implies some degree of sparsity in the matrix M , as M (i, j) = 0 for any i 6∼ j. One
might expect that this natural sparsity of M should facilitate the design of distributed algorithms. However, on the other hand, it is no longer possible to design an ASM sequence
as in Algorithm 2. In particular, the computation of z n in (3.5) becomes irrelevant. Intuitively, the main advantage for using ASM in Section 3.2.3 was that such matrices are
equal in expectation to N J . When G 6= GN , one can no longer generate a sparse random
matrix adapted to G whose expectation is N J . In the sequel, we circumvent this issue
by replacing ASM with a random gossip step inspired from [31]. The idea of using linear
gossip methods for PCA has been used previously in [102]. Note however that our algorithm presents significant differences with [102]: First, the Authors of [102] focus on the
80
case where the observed matrix has rank one, which allows for useful simplifications. Second, the algorithm of [102] ends up in a neighborhood of the sought eigenspace, whereas
we show almost sure convergence of our algorithm. Finally, [102] assumes synchronous
deterministic gossip while we focus on asynchronous random gossip.
2) Matrix M is likely to be imperfectly observed. Instead, we assume that a node i ∈ V
associates a weight M n (i, j) at time n to any node j in its neighborhood. Here, (M n )n
is a random matrix sequence which can be interpreted as a noisy version of a deterministic
matrix M (we typically assume E[M n ] = M ). We consider a random sequence of
matrices (M n )n adapted to G. This scenario has various applications. First, this model
encompasses the sparsification scheme of Section 3.2: we just set M n = An ◦ M where
An is a well-chosen sparse random matrix adapted to G. As an other example, our model
also encompasses the case where M is a covariance matrix of a some i.i.d. random vectors
η n , M = E[η n η Tn ] is not directly observable by the network but η n η Tn is.
3.3.2
Distributed on-line algorithm
Considering the network assumptions described in previous section, we propose the following
iterations as an extension of Algorithm 2 to estimate the first eigenvector. We generalize (3.5)
as:


y n = M n un−1
(3.7)
z n = C n (un−1 ◦ y n )


un = ΠK [un−1 + γn (y n − z n ◦ un−1 )]
where C n is as defined in Lemma 3.2 for a general network or is a ASM (Definition 3.1) for
the complete graph case, φ is a non decreasing integer valued function going to infinity with
n and ΠK plays the same role than in (3.3). Remark that this algorithm yields a distributed
asynchronous implementation compatible with G since all matrices are adapted to the graph
structure G.
Remark
3.2. It is known from [31] that for a fixed n, the infinite product of gossip matrices
Q∞
t=1 W n,t converges almost surely to the orthogonal projector J provided that G is connected.
Qφ(n)
In this paper, we approximate J using a finite product t=1 W n,t . Since φ(n) → ∞, it is expected that the latter product becomes closer and closer to the true projector as n increases. It
Qφ(n)
is worth noting that if product t=1 W n,t was indeed replaced by J in (3.13), then the recursion (3.13) would coincide with a centralized Oja’s algorithm (3.12) as described in Section 3.4.
3.3.3
We have seen two distributed on-line algorithms in Sections 3.2.3 and 3.4.2. In this section we
provide a convergence result which encompasses both algorithms. Note that in this case, since
p = 1, the upper case notation U n is equivalent to the lower case notation un . In order to make
precise statements, let us first introduce s. Recall that G denotes the underlying graph.
Assumption 3.1 (Step size). (γn )n is a decreasing step-size sequence such γn > 0 and has the
following standard properties [28]:
3.3. General graph and unknown matrix M case
i)
P
ii)
P
n≥0 γn
= +∞.
2
n≥0 γn
< ∞.
81
Let us introduce some sequences of random matrices (M n )n , (C n )n , (Rn )n and (R0n )n
and denoted by Fn the filtration up to time n i.e., σ(u0 , M 1 , . . . , M n , C 1 , . . . , C n , . . . ). We
assume the following conditions.
Assumption 3.2 (M n ).
i) M is a symmetric N × N matrix and λ1 (M ) has multiplicity 1.
ii) M n is adapted to G.
iii) There exists a sequence of matrices Rn , such that E[M n |Fn−1 ] = M + Rn and Rn
converges almost surely to 0.
iv) ∃C > 0, P[kM n k < C] = 1.
v) There exists a constant L > 0 , such E[kM + Rn − Mn k2 |Fn−1 ] < L
Assumption 3.3 (C n ). Let (C n )n be a sequence of matrices such that:
i) C n is adapted to G.
ii) There exists a sequence of matrices R0n , such that E[C n |Fn ] = N J + R0n and kR0n k
converges almost surely to 0.
iii) ∃C > 0, P[kC n k < C|Fn−1 ] = 1.
iv) Conditionally to Fn , C n and M n are independent.
v) There exists a constant L0 > 0 , such E[kN J + R0n − Cn k2 |Fn−1 ] < L0 .
Q
φ(n)
Note that the sequence of matrices C n = N
W
defined in Lemma 3.2 of Secn,t
t=1
tion 3.3.1 satisfies the above Assumption 3.3. As mentioned previously, the technical bottleneck
is to ensure stability, i.e. P[kun k is bounded] = 1. Hence, we introduce a stability-like condition
to follow an analysis related to stochastic approximation schemes.
Assumption 3.4. For each agent i, there exists an instant time n0i such that ∀n > n0i the sequence
un−1 (i) + γn (y n (i) − z n−1 (i)un−1 (i)) remains in the compact set K almost surely.
The above assumption states that the projector ΠK becomes inactive for all n after a certain
value at each agent i. Hence, Assumption 3.4 claims that the sequence (un )n≥0 remains a.s. in
the compact set K, i.e. there exists m > maxi n0i such that um = ΠK [um ].
The following lemma is simple algebra but allows us to cast the proposed algorithms of
equations (3.5) and (3.7) into a stochastic approximation problem. Indeed, the sequence defined
in (3.7) before the projection onto K from (3.6) can be written under a compact matrix form as
follows:
un = ΠK [un−1 + γn (M n un−1 − (C n (un−1 ◦ M n un−1 )) ◦ un−1 )]
(3.8)
82
Lemma 3.3. Under Assumptions 3.2 and 3.3, (3.8) can be written as:
un = un−1 + γn h(un−1 ) + γn ιn + γn r n + γn en
(3.9)
where function h(u) = M u − (uT M u)u denotes the so-called mean field function, ιn is such
that E[ιn |Fn−1 ] = 0 is a martingale increment and r n and en are two remainder terms.
Proof. Direct computations give the following expressions for the error sequences ιn , r n and
en :
ιn = (M n − M − Rn )un−1 + (N J + R0n − C n )(un−1 ◦ M n un−1 ) ◦ un−1
+ (uTn−1 (M + Rn − M n )un−1 )un−1
r n = Rn un−1 + R0n (un−1 ◦ M n un−1 ) ◦ un−1 + (uTn−1 Rn un−1 )un−1
en = ΠK [M n un−1 − (C n (un−1 ◦ M n un−1 )) ◦ un−1 ]
− M n un−1 − (C n (un−1 ◦ M n un−1 )) ◦ un−1
Using Assumption 3.2, one derives that E[(M n −M −Rn )un−1 |Fn−1 ] = 0 and E[(uTn−1 (M +
Rn − M n )un−1 )un−1 |Fn−1 ] = 0. Then using Assumption 3.3, one has that E[(C n − N J −
R0n )(un−1 ◦ M n un−1 ) ◦ un−1 |Fn−1 ] = 0 which gives E[ιn |Fn−1 ] = 0 as stated. |
Moreover,
Lemma 3.4. Under Assumptions
3.1, 3.2, 3.3 and 3.4, then the remainder terms r n and en tend
P
a.s. to 0 as n → ∞ and n γn ιn is almost surely a convergent series.
Proof. Under Assumption 3.4 en converges a.s. to zero. From the stability condition of Assumption 3.4, one has P[||un−1 || < α] = 1. Moreover, using Assumption 3.2:
kr n k ≤ kRn kα + kR0n kCα3 + α3 kRn k
and Assumption 3.3 implies that r n tends a.s. to 0. Moreover, using decomposition of Lemma 3.3,
one has
E[kιn k2 |Fn−1 ] ≤ Lα2 + (L + L0 )α6
Using Assumption 3.1 it implies that
X
X
γn2 E[kιn k2 |Fn−1 ] < K
γn2 < ∞
n
2 -martingale argument [75, Theorem 2.17 p.35], this
for some constant K. Using standard LP
implies a.s. convergence of the sequence n γn ιn .
The following result is standard in the stochastic approximation folklore, see for instance [14]
or [28, Corollary 3, p.17]. Define r 0n = r n + en , then:
Theorem 3.1. A discrete dynamical system un = un−1 + γn h(un−1 ) + γn ιn + γn r 0n such that:
i) r 0n converges almost surely to 0,
3.4. Extension of Oja’s algorithm for p ≥ 1
ii)
P
n γn ιn
83
< ∞ almost surely,
iii) there exists function V : limkuk→∞ V (u) = +∞ and hV (u), h(u)i ≤ 0,
converges almost surely to the critical set of V defined as:
H = {u ∈ RN : ∇V (u) = 0}.
Following [50], we have:
Proposition 3.2. Function V (u) = exp(kuk2 )/(uT (M + λI)u) is a positive, coercive, Lyapunov function for u̇ = M u − (uT M u)u. Its critical set is formed by the eigenvectors of
matrix M .
Combining Proposition 3.2 and Theorem 3.1 gives the main result of this section.
Theorem 3.2. Under Assumptions 3.1, 3.2, 3.3 and 3.4, the column vector un defined by the
proposed algorithms (3.5) and (3.7) converges to an eigenvector of M .
3.4
3.4.1
Extension of Oja’s algorithm for p ≥ 1
Oja’s algorithm
We address the natural extension of the proposed algorithm when the goal is to compute more
than one eigenvector, i.e. the distributed version of the Oja’s algorithm for p ≥ 1 [122]. Namely,
the previous algorithms have to be extended to the computation of the N × p matrix U =
(u1 , . . . , up ) (principal components). There are now p estimates for the p principal components
computed simultaneously. The pending estimates un,1 , . . . , un,p are concatenated into a single
N ×p matrix U n = (un,1 , . . . , un,p ) obeying the same iterations. Note that, while extending the
centralized on-line algorithm (3.2) in Section 3.2.1, our objective is to eventually find a point in
the set χ of N ×p matrices whose columns are orthonormal and span the vector space associated
with the p principal eigenvalues of M .
When faced with random matrices M n having a given expectation M (see Proposition 3.1),
the principal eigenspace of M can be recovered by the following algorithm, due to Oja [120]
and analyzed in [122]. The algorithm generates a sequence (U n )n of N × p matrices according
to:
U n = U n−1 + γn M n U n−1 − U n−1 U Tn−1 M n U n−1 ,
(3.10)
where γn > 0 is a step size sequence. The main objective is to identify the principal eigenvectors
of M that is, to find a N × p matrix within the set
χ , V ∈ RN ×p : V T V = I p , Im(V ) = Im(U )
(3.11)
where Im(U ) stands for the linear span of the columns of U .
In order to have more insight, it is convenient to interpret (3.10) as a Robbins-Monro algorithm (see Chapter 3, [51]) of the form U n = U n−1 + γn (h(U n−1 ) + ξn ) where ξn is a
martingale increment noise and h is the so-called mean field of the algorithm given by h(U ) =
84
M U − U U T M U . It is known that under adequate stability assumptions and vanishing step
size γn , the algorithm converges to the roots of h (Theorem 2 in [51]). By Theorem 1 of [121]
the roots of h are essentially rotations of matrices whose columns are eigenvectors of M , multiplied by some scalar, including zero. Thus, strictly speaking, the algorithm might converge
to a broader set than the sought set χ. Fortunately, it is known since [158] that all roots of h
outside the set χ are unstable. Undesired points can be avoided by standard avoidance-of-traps
methods (see Chapter 4 in [28] and [129]) which consist in artificially adding an extra-noise in
the parenthesis of the right hand side of (3.10). Thus, we introduce a projection step as proposed
in (3.3) for the one-dimensional case to overcome the numerical unstabilities of the general
algorithm (3.10):
U n = ΠK U n−1 + γn M n U n−1 − U n U Tn−1 M n U n−1
,
(3.12)
where K is any arbitrary compact convex set whose interior contains χ, and where ΠK is the
projector onto K. Similarly to (3.6), we set K = [−α, α]N ×p . Thus, the interior of [−α, α]p
contains [−1, 1]p for each agent i which leads a distributed implementation.
3.4.2
Distributed on-line Oja’s algorithm
To avoid cluttered notations we write iterations at the agent level, U n (i) refers to line i of
matrix U n (U n (i) has hence size 1 × p). We also extend the compact set defined by (3.6) to the
p-dimensional case. Algorithm (3.5) generalizes to the following recursion:

P

Y n (i) = Pj∈V M n (i, j)U n−1 (j)
(3.13)
Z n (i) = j∈V C n (i, j)U n−1 (j)T Y n (j)


U n (i) = Π[−α,α]p [U n−1 (i) + γn (Y n (i)−Z n (i)U n−1 (i))]
where matrix Z n (i) has size p × p.
3.5
3.5.1
Numerical results
Principal eigenvector estimation (p = 1)
We consider the problem/context where the aim is to find the first eigenvector u1 of a similarity
matrix observed partially by a set of N agents forming a network. We study the convergence
behavior of the sequence generated by Algorithm 2 when varying the parameters N and q of our
model. We run 100 independent trajectories of Algorithm 2 with a decreasing step size sequence
(γn )n of the form nγ0a . We set γ0 = 1 and a = 1. To illustrate Theorem 3.2, Figure 3.1 shows the
convergence towards the principal component of M for different values of q and a fixed network
size of N = 10.
In order to highlight the impact of parameter q on the error performance, Figure 3.2 displays
the root-mean-square error (RMSE) between u1 and the estimated un as a function of q when
n = 1000 iterations and for different network sizes, e.g. N equal to 10, 50, 100 and 500.
Note that q is the Bernoulli’s parameter of the communication model (see Definition 3.1) which
85
1.8
2
q = 0.1
q = 0.5
q = 0.8
q=1
1.6
1.4
1.5
q = 0.1
q = 0.5
q = 0.8
q=1
λ = 1.75
0.5
|| un||
1
n
n
uT M u
n
1.2
1
0.8
0.6
0.4
0
0
200
400
600
800
1000
1200
1400
1600
1800
0.2
0
200
400
600
800
1000
(a)
1200
1400
(b)
Figure 3.1. Convergence of sequence (uTn M n un )n towards the eigenvalue λ1 (right) and the
norm-one sequence (||un ||)n (left) for different values of the sparsification parameter q.
describes the probability of a node to communicate at iteration n. One would expect that a
decreasing probability q degrades the estimation accuracy, i.e. increasing the RMSE. Yet, in
Figure 3.2, the performance is similar over the different values of N when q ≤ 0.1 and q ≥ 0.8.
Contrarily, when q ∈ (0.1, 0.8) a small gap appears on the RMSE values for the different values
of N , e.g. in general the performance is better for N = 100 than for N = 500.
N = 10
N = 50
N = 100
N = 500
−1
RMSE
10
−2
10
−3
10
0
0.1
0.2
0.3
0.4
0.5
0.6
probability (q)
0.7
0.8
0.9
1
Figure 3.2. RMSE as a function of the probability parameter q of our proposed distributed Oja’s
algorithm for different values of network size N .
In addition to Figure 3.2, Figure 3.3 illustrates the performance of Algorithm 2 for each network size N ∈ {10, 100, 50, 500} and for three different values of q ∈ {0.5, 0.8, 1}. Similarly
to [92], we define the complexity per agent as η = nd, where n is the number of iterations, i.e.
the number of updates to obtain an estimation, and d is the average number of communications,
i.e. d = qN by Definition 3.1. Figure 3.3 illustrates the RMSE as a function of the complexity
per node η. Similarly to [92], Figure 3.3 includes the threshold on the complexity per node in
1600
86
order to emphasize the trade off between the RMSE and the amount of communications and
computations required, e.g. by setting different values for the parameter q. Let η1 and η2 be the
thresholds such:
η1 = arg min (RMSEq=0.5 , RMSEq=1 )
η
η2 = arg min (RMSEq=0.8 , RMSEq=1 )
η
η1= 880
0.5
Algorithm 2, q = 1
Algorithm 2, q = 0.5
0.5
η2= 990
Algorithm 2, q = 1
0.4
0.4
η = 3680
0.3
2
RMSE
RMSE
η1= 2800
0.3
0.2
0.2
0.1
0.1
0
500
1000
1500
2000
complexity per node (η)
2500
0
500
3000
1000
1500
(a) N = 10.
2500
3000
3500
4000
4500
5000
(b) N = 50.
0.5
Algorithm 2, q = 1
0.4
2000
Algorithm 2, q = 1
0.5
η1= 4450
0.4
η2= 5300
RMSE
RMSE
0.3
η1= 23000
η2= 28020
0.3
0.2
0.2
0.1
0
0.1
1000
2000
3000
4000
(c) N = 100.
5000
6000
7000
0
0.5
1
1.5
2
2.5
3
(d) N = 500.
Figure 3.3. RMSE on the first eigenvector when using the proposed distributed Oja’s algorithm
(Algorithm 2) for different number of agents N .
Table 1 and Figure 3.4 report the thresholds (η1 and η2 defined above) marked by the vertical
black lines in Figure 3.3. The thresholds indicate when the dense computation scheme with q =
1 achieves the same accuracy than the sparse schemes, i.e. by tuning different levels of sparsity
with q = 0.5 and q = 0.8. Indeed, Table 1 highlights the threshold on the complexity required
by Algorithm 2 when the sparse schemes perform better than the dense scheme. Moreover,
Figure 3.4 shows the complexity thresholds (η1 , η2 ) as a function of the network size N which
seem to grow linearly with N , i.e. ηi ∝ βi N for some βi > 1 (i = 1, 2). Table 1 also includes
the values of the first eigenvalue λ1 . They increase with the dimension N meaning that a larger
number of communications and updates are required to improve the accuracy. Note that, in [102]
3.5
4
x 10
87
the convergence is shown to be related to N and λ1 through an expression of the error bound
between the estimated and the true eigenvector.
4
3
x 10
η , q = (0.5,1)
1
η1
880
2800
4450
23000
η2
990
3680
5300
28020
λ1
1.75
9.6
11.96
81.9
2
1.5
1
0.5
0
0
Table 1. Complexity thresholds η1 and η2
marked in Figure 3.3.
η2, q = (0.8,1)
2.5
N
10
50
100
500
50
100
150
200
250
300
network size (N)
350
400
500
Figure 3.4. Complexity thresholds η1 and η2 as
a function of network size N .
Finally, we show in Figure 4.17 the impact of parameter q on the variance of the error which
is the trade off between the accuracy and the number of communications requires. Figure 4.17
summarized the statistical information of the RMSE through the 100 independent runs of the
sequence generated by (3.5). We consider N = 100, n after 1000 iterations and the probability
q of the Bernoullis defined in Definition 3.1 as 0.05, 0.5 and 0.8. The median and variance values
of the RMSE decreases as the probability that an agent communicate with all other agents tends
to 1 as this extreme value corresponds with the performance of the centralized processing of
algorithm (3.2). The differences are rather higher between a low value (q = 0.05) and a median
value (q = 0.5) than between values of q after 0.5, since the differences between the case when
q = 0.5 and q = 0.8 are relatively low. Although a best RMSE performance is achieved when
q = 0.8, extreme values (outliers) appear more frequent than the lowest case q = 0.05.
In Figure 3.6 we represent the impact of the Bernoulli’s parameter q from the sparse communication Definition 3.1 on the stopping time of the projection operator. In the present scenario
we set the compact set as in (3.6) with [−α, α] = [−1, 1] for each agent i = 1, . . . , N . We set
the network of N = 10 agents and we compute the mean total number of projections over the
100 independent runs of Algorithm 2 at each iteration time n. We indicate in Figure 3.6 by the
black arrow the stopping time of the projection step, i.e. the last iteration n in which the projector is required in average. Since the convergence and the accuracy degrades when decreasing q
(see Figures 3.2 and 4.17), the projector is active for longer. We observe from Figure 3.6 that
even if the projector becomes inactive later after a certain time when decreasing q, it is active
(in average) almost during the first 300 iterations for all values of q; which may be due to the
randomness of the initialization. In addition, the difference on the stopping time of the projector between q = 0.5 and q = 0.75 is rather higher (a gap of 615 iterations) than the differences
between q = 0.25 and q = 5 (182 iterations) and between q = 0.75 and q = 0.9 (154 iterations).
3.5.2
450
Two principal eigenvectors estimation (p = 2)
To illustrate the proposed algorithm described in Section 3.4.2, we consider a complete graph
GN with N = 1000. Although the application of localization in WSN is more largely explain
in the following Chapter 4, we use the same scenario of [92] related to agents’ positions as
88
q = 0.5
RMSE
q = 0.05
q = 0.8
0.2
0.2
0.2
0.18
0.18
0.18
0.16
0.16
0.16
0.14
0.14
0.14
0.12
0.12
0.12
0.1
0.1
0.1
0.08
0.08
0.08
0.06
0.06
0.06
0.04
0.04
0.04
0.02
0.02
0.02
N=10
N=50
N=100
N=500
N=10
N=50
N=100
N=500
N=10
N=50
N=100
N=500
Figure 3.5. Boxplot of the RMSE values obtained over the 100 independent trajectories and for
n = 1000 iterations of Algorithm 2 for different probabilities q.
a benchmark. Consider a network of N = 1000 agents randomly uniformly placed in the
unit square and each agent i having an unknown fixed position in the plane, i.e. p = 2. The
Authors of [92] propose a distributed implementation of the so-called MDS (multidimensional
scaling [95]) algorithm (see [143]) whose aim is to compute the coordinates z i of each node i
(up to some rotation/translation) based on the measurement of the square distance D(i, j) =
kz i ) − z j k2 with other nodes j. We set M = −1/2(I N − J )D(I N − J ) and compute the
p = 2 principal components of M using the algorithm described in Section 3.4.2.
As shown in [143], the position z i can easily be inferred from the i-th entries of these
vectors along with the corresponding eigenvalues. We set a decreasing step size sequence (γn )n
√
of the form 1/ n. We run the proposed distributed on-line algorithm (3.13) 100 independent
times and until iteration n = 1000. Following [92], we define d as the average number of
communications per node at any iteration which is d = N q where q is the Bernoulli parameter
of the ASM in Definition 3.1. We compare our algorithm with the algorithm of [92] for the same
number of communications per agent. Figure 3.7 illustrates the root-mean-square error (RMSE)
of both algorithms as a function of the complexity per node which considers the iteration time
n multiplied by the average number of communications d. The error of the algorithm of [92]
vanishes more rapidly during the very first iterations but then is immediately subject to a residual
error, while our algorithm eventually converges to the sought solution. The trade off between the
accuracy and the communication cost per node is contrasted by comparing the centralized dense
matrix case (power method) and the distributed sparse matrices case ( [92] and our proposed
algorithm).
89
3.5
q=0.25
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
q=0.5
1
n=1722
0.5
0
0
n=1540
0.5
500
1000
1500
n iteration
2000
2500
3000
0
0
500
1000
(a)
2000
2500
3000
(b)
3.5
q=0.75
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
q=0.9
1
n=925
0.5
0
0
1500
n iteration
500
1000
n=771
0.5
1500
n iteration
(c)
2000
2500
3000
0
0
500
1000
1500
n iteration
2000
2500
(d)
Figure 3.6. Mean number of projections per iteration n for different values of probability q.
Finally, we show a more detailed convergence analysis of algorithm (3.13) for a fixed value
of q in the considered scenario. Figure 3.8 illustrates the convergence analysis of the estimated
eigenvalues and eigenvectors. In that case, we set q = 0.8. Since we consider the two dimensional case (p = 2), Figure 3.8 shows the orthonormality of the estimated matrix obtained
as U Tn U n and the two corresponding eigenvalues diag(U Tn M U n ). Indeed, the desired unitnorm of the estimated eigenvectors is reported in Figure 3.8 (c) and the orthogonality between
the two estimated eigenvectors is respectively in Figure 3.8 (d)). In addition, we expect to converge towards the true eigenvalues, λ1 = 1.1 and λ2 = 0.57 when regarding the trajectories of
uTn,1 M un,1 and uTn,2 M un,2 respectively.
3000
90
Algorithm [53], d = 50
0
10
Algorithm [53], d = 500
Power method
Proposed doOja, d = 50
Proposed doOja, d = 500
−1
RMSE
10
−2
10
−3
10
2
4
6
8
10
complexity per node (iteration x d)
12
14
4
x 10
Figure 3.7. RMSE for the centralized case with the power method of [73, p. 406] against its
distributed version [92] and the proposed distributed on-line Oja’s algorithm (3.13) (also called
doOja).
91
6
estimated λ
3
estimated λ2
true λ1=1.1
2.5
true λ2 = 0.57
1
5
2
n,2
3
2
1
1
0.5
0
0
−1
0
1.5
n,2
uT M u
uT
M un,1
n,1
4
−0.5
500
1000
1500
2000
2500 3000
iteration (n)
3500
4000
4500
5000
−1
0
500
(a) Trajectory of estimated λ1 .
1500
2000
2500 3000
iteration (n)
3500
4000
4500
5000
(b) Trajectory of estimated λ2 .
4
0.8
||un,1||
||un,2||
3.5
0.6
3
2.5
uT
u
n,1 n,2
|| un,k ||
1000
2
1.5
1
0.4
0.2
0
0.5
0
0
500
1000
1500
2000
2500
3000
iteration (n)
3500
4000
4500
(c) Trajectories of ||un,1 || and ||un,2 ||.
5000
−0.2
0
500
1000
1500
2000
2500
iteration (n)
3000
3500
4000
(d) Trajectory of uTn,1 un,2 .
Figure 3.8. Convergence analysis of the estimated eigenspace with the proposed distributed
on-line Oja’s algorithm (3.13).
4500
Chapter 4
Application to self-localization in WSN
In this chapter we investigate the problem of localization in wireless sensor networks (WSN) as a
particular application of the principal component analysis (PCA) framework described in previous Chapter 3. The link between PCA and localization, which may not be directly clear, is given
by the following framework: wireless sensor devices are able to obtain received power measurements that can be related to a ranging model depending on the inter-sensor distances, and finally,
from square distances one may find the underlying network configuration by applying PCA (the
multidimensional scaling method). Thus, we focus on ranging techniques using received signal
strength indicator (RSSI) since it does not require additional hardware or/and synchronization
compared to time difference of arrival (TDOA) and angle of arrival (AOA) techniques. Moreover, we focus on RSSI-based techniques that make each sensor node able to localize itself (see
scheme in Figure 4.1). Since WSN are composed by electronic devices (also called motes) characterized by their low cost, size and power consumption features, distributed processing based
on on-line measurements and asynchronous communications becomes adapted for this scenario.
The problem of self-localization involving low-cost radio devices in WSN can be viewed as
an example of the internet of things (IoT). The evolution in the last 50 years of the embedded
systems and smart grids has contributed to enable the WSN integrates the emerging system of the
IoT. Recently, advanced applications to handle specific tasks require the support of networking
features to design cloud-based architectures involving sensor nodes, computers and other remote
component. Among the large range of applications, location services can be provided by small
devices carried by persons or deployed in a given area. Information about sensor nodes’ positions may be used by purposes such as routing and querying which can be adapted or controlled
according to the given positions.
Throughout this chapter the notation of some of the variables that are used regularly. Yet, we
summarized them in the table below. Then before the description of the framework, we highlight
the main contributions of this chapter.
94
Chapter 4. Application to self-localization in WSN
N, M
zi
x i , yi
Z
Ak
Ni , Mi
ai , bi
PL0
η
σ2
l × h m2
T , Ti
t, n
dij
k · k, (·)T , h·i
4.1
number of unknown sensor positions and of anchors/landmarks
row vector in R1×p of any sensor position i (in practice, p = 2 or 3)
abscissa and ordinate values of any unknown sensor position z i
N × p matrix whose row-elements are the positions z i
row vector in R1×p of any anchor/landmark k
set of neighbor sensors of i and its neighboring anchors
abscissa and ordinate values of any anchor/landmark position i
path loss parameter of the RSSI model
the path loss exponent parameter of the RSSI model
noise variance parameter of the RSSI model
dimensions for a given indoor area of length l and height h
number of observations (equal for all nodes or different at each node i)
time and iteration index of the observed or estimated data
distance between any pair of nodes i and j
Euclidean norm, transpose operator and scalar product
Contributions
We provide the following contributions.
• We adapt and design the distributed algorithm proposed in Chapter 3 for the WSN localization problem by assuming sparse RSSI measurements through the sensor nodes. The
position at each sensor node is estimated without prior knowledge of landmarks, i.e. anchor nodes are not required.
• We provide numerical results on the position accuracy in two cases: when data is simulated following a known distribution and when data is collected from a real testbed in an
indoor scenario (at the FIT IoT-LAB platform [1]).
• We include a distributed refinement phase in order to improve the position accuracy and
in order to lead each sensor node to obtain a local map of itself and their neighbors. This
optional step may be especially useful when some landmarks are surrounding the WSN.
The refinement algorithm is first implemented on the FIT IoT-LAB’s testbed [2] when the
initialized positions are obtained by the proposed algorithm in this chapter. Then, we test
it on three different indoor scenarios ([2] and [3]).
The chapter is organized as follows. In the first section we introduce the framework, the
received signal model used in WSN to extract the positions’ information and some experimental
results obtained in a real WSN scenario. The following sections are related to recall the classical
and the recent techniques dealing with this problem. Then we focus on presenting our distributed
on-line algorithm along with numerical results based on simulated and real data.
4.2. Received signal model and testbed description
4.2
4.2.1
95
Received signal model and testbed description
Ranging-based approaches
Consider N agents (e.g. sensor nodes or other electronic devices) seeking to estimate their
respective positions defined as {z 1 , · · · , z N }. Note that positions are generally expressed in 2
or 3 dimensions in the wireless sensor context, i.e. z i ∈ Rp with p = 2, 3 for i = 1, . . . , N .
The goal is to design a distributed and on-line algorithm to enable each sensor node to estimate
its position z i from noisy measurements of the distances, i.e. ranging technique. We assume
that agents have only access to noisy measurements of their relative RSSI values. The RSSI
can be related to the Euclidean distance d through a statistical model. Figure 4.1 illustrates the
summarized framework of this chapter and the relation between these metrics, e.g. RSSI, d and
z.
Figure 4.1. Framework of the RSSI-based ranging techniques.
We denote dij the Euclidean distance between any pair of nodes i, j ∈ {1, · · · , N } which
is defined as dij = kz i − z j k. Before introducing the RSSI model, it is worth noting that
this problem is in fact ill-posed. Since the only input data are distances, exact positions are
identifiable only up to a rigid transformation. Indeed, quantities (dij )∀i,j are preserved when an
isometry is applied to the agents’ positions:
i) when positions are affected by a translation t, then the distance between two shifted positions z 0i and z 0j is:
d0ij = kz 0i − z 0j k = kz i + t − (z j + t)k = dij
ii) when positions are affected by a rotation/reflection denoted as the p × p orthogonal matrix
R (i.e. a unitary matrix such as R−1 = RT ), then the distance between two relative
96
positions z 0i and z 0j is:
d0ij = kz 0i − z 0j k = kz i R − z j Rk = dij
The problem is generally circumvented by assuming a minimum number of anchors or also
named landmarks (sensor nodes whose GPS-positions are known), e.g. M = 3 or 4 when
p = 2, and considering these prior knowledge to identify the indeterminacy. Thus, we divide the
existent methods into two groups according to this point, the anchor-based methods and anchorfree methods (see Section 4.3). The first group gives a position in the absolute reference system
(e.g. GPS coordinates) and the second group gives a set of positions according to a relative
reference system (e.g. the origin is taken as the barycentric coordinates of the sensor nodes’
positions). Once the relation between positions and the Euclidean distance is assumed, we now
describe the framework to estimate any distance d from the available measurements in wireless
sensor networks, i.e. the RSSI.
4.2.2
Log-normal shadowing model (LNSM)
We recall the statistical model used to describe the received signal strength indicator (RSSI)
data as a function of the distances between the sensor nodes. The log-normal shadowing model
(LNSM) is based on the log-distance path loss model (see [8], [46],[136] for details), which
describes the average path loss PL(d) at a distance d expressed in dB as:
PL(d) = PL0 + 10η log10
d
d0
(4.1)
where the parameters η, d0 and PL0 depend on the environment. The parameter η depends on the
propagation medium. For instance, η = 2 is considered in the free-space and η varying from 1.6
to 6 in indoor and more complex environments (see [136] and [72]). The parameters d0 and PL0
are the reference distance (typically 1 m in indoor environments [136]) and its corresponding
path loss value. In order to capture the random shadowing effects that may occur at different
locations having the same distance separation, the LNSM considers the addition of a Gaussian
random variable (r.v.) ε ∼ N(0, σ 2 ) to the average path loss PL(d) given by (4.1). Thus, the
RSSI is defined as the noisy received power given an emitted power PT and an average path loss
PL(d) at distance d expressed in dBm units:
P (d) = PT − PL(d) + ε .
Since in the presented experimental results we deal with indoor environments, we assume d0 =
1 m for our RSSI signal model. In addition, the considered sensor nodes in all our experimental
campaigns are issued to the CC24201 radio frequency (RF) transceiver (i.e. device comprising
both a transmitter and a receiver parts) whose transmit power PT is typically 0 dBm. Taking into
account this two specifications, the general expression of the LNSM for the RSSI r.v. defined as
P is such:
P = −PL0 − 10η log10 d + ε ∼ N( −PL0 − 10η log10 d , σ 2 ) .
(4.2)
1
Technical specifications: http://www.ti.com/lit/ds/symlink/cc2420.pdf.
4.2.3
97
Distance estimation
From the previous distribution, the maximum likelihood (ML) can be used to estimate the environmental parameters PL0 , η and σ 2 . When collecting several RSSI values associated to
ˆ 0 and η̂ related to the mean path loss value,
different known distances we obtain the estimates PL
2
and the estimated variance σ̂ from the corresponding square residuals. Set K the number of
the known distances values (dk )k=1,...,K and set T as the number of RSSI values (Pk (t))t=1,...,T
collected from each distance dk . If the KT total of samples are drawn from the distribution (4.2),
the ML optimization problem is written as follows:
max
(PL0 ,η,σ 2 )
−
K
T
KT
1 XX
ln(2πσ 2 ) − 2
(Pk (t) + PL0 + 10η log 10dk )2 .
2
2σ
(4.3)
k=1 t=1
The environmental parameters PL0 and η can be estimated firstly by a least squares (LS) or
maximum likelihood (ML) method since for the normal distribution case both estimators are
identical. Indeed, equation (4.2) can be viewed as a linear model of the form y = −PL0 − ηx
corrupted by an additional noise when setting x equal to 10 log10 d. When writing (4.3) under
vector notation, the estimator solves the minimization problem:
c 0 , η̂) = min kP − Lαk2
(PL
(PL0 ,η)
where:


P =
1
T
1
T
PT
t=1 P1 (t)

..
.


PT
t=1 PK (t)

−1
 ..
L= .

−10 log10 d1

..

.
−1 −10 log10 dK
PL0
α=
η
The solution is then obtained as follows:
c0
PL
= (LT L)−1 LT P .
η̂
(4.4)
Finally, the variance σ 2 is computed using the latter values (4.4) and solving (4.3):
σ̂ 2 =
K T
2
1 XX
c 0 + η̂10 log10 dk .
Pk (t) + PL
KT
(4.5)
k=1 t=1
Given the estimated environmental parameters as (4.4) and (4.5) and using the model (4.2) for
a collection of T RSSI values whose mean value is denoted by P̄ , the maximum likelihood
estimator of any of the distances involved in (4.3) is a biased estimator such as:
dˆ1 = 10
−P̄ −PL0
10η
ε̄
= d10 10η
where ε̄ ∼ N(0,
σ2
).
T
(4.6)
98
T=1
T=5
20
T = 10
15
16
estimated distance
14
15
σ=0
σ /η = 1.5
σ /η = 4
12
10
10
10
8
6
5
5
4
2
0
0
5
10
15
0
0
5
10
true distance
15
0
0
5
10
15
Figure 4.2. Biased estimator dˆ1 (4.6) as a function of the true distance d when considering
two levels of noise represented by the factor ση and when varying the number of collected RSSI
samples T .
Due to the nature of the distribution of the signal model, the additive average noise term ε̄
ε̄
becomes a multiplicative random term 10 10η whose mean is not equal to one. For a normal r.v.
ε of variance σ 2 and for any real constant γ, the following property holds:
E[10γε ] = 10γ
2 σ2
2
ln 10
(4.7)
Hence, defining by C the mean of this multiplicative noise, the bias of the latter estimator (4.6)
is equal to d(C − 1) which tends to zero when T >> ση as shown in Figure 4.2. As expected, the
bias increases with the distance but decreases when the number of samples T increases, being
nearly negligible for T = 10 and for both noise levels ( ση = 1.5 and ση = 4). Taking into account
the noise variance σ 2 , the unbiased estimator of the distance can be defined as follows:
−P̄ −PL0
σ 2 ln 10
ε̄
10 10η
d
ˆ =d
(4.8)
= 10 10η =⇒ E[d]
where C = 10 2T (10η)2
dˆ2 =
C
C
For the same reason, due to the multiplicative noise, the variance of the estimator grows linearly
with the square of the distance. Nevertheless, the environmental parameters σ 2 and η and the
number of RSSI data T affects on the behavior of the bias and the variance through the value of
C. The variance of the unbiased estimator (4.8) coincides with its mean square error(MSE):
σ
MSE(dˆ2 ) = var(dˆ2 ) = d2 (C 2 − 1) → 0 ⇐⇒ T >> .
(4.9)
η
It is easy to see that when taking the biased version (4.6), its variance is the same as in (4.9)
multiplied by a factor C 2 . We summarize the main statistics of both estimators dˆ1 and dˆ2 in the
following table:
dˆ
dˆ1
dˆ2
99
ˆ
B(d)
ˆ
var(d)
ˆ
MSE(d)
d(C − 1)
0
d2 C 2 (C 2 − 1)
d2 (C 2 − 1)
d2 C 2 (C 2 − 1) + d2 (C − 1)2
d2 (C 2 − 1)
ˆ variance (var(d))
ˆ and mean square error (MSE(d)
ˆ = var(d)
ˆ + B(d)
ˆ 2 ) for
Table 1. Bias (B(d)),
each estimator as a function of the distance d and constant C defined in (4.8).
Note from Table 1: MSE(dˆ2 ) = C 2 MSE(dˆ1 ) + d2 (C − 1)2 > MSE(dˆ1 ). Figure 4.3
highlights the effect of the trade off between the environmental parameters ( ση ) and the number
of RSSI values (T ) when choosing the biased or the unbiased estimator of the distance. We
pick 1000 samples of (4.6) and (4.8) and we computed their empirical values of the MSE. We
consider the values 1.5 and 4 for the environmental factor ση as the estimated values in our real
measurement settings (see Table 7) are between 1.6 and 3.
T=5
T=1
T = 10
6
3
4
2
2
1
30
20
MSE
10
0
0
5
10
15
0
0
5
10
15
20
600
40
5
10
15
biased, σ /η=4
unbiased, σ /η=4
15
400
10
20
200
0
0
0
0
25
biased, σ /η=1.5
unbiased, σ /η=1.5
5
5
10
15
0
0
5
10
distance (d)
15
0
0
5
10
15
Figure 4.3. Mean square error (MSE) of the biased estimator (4.6) and the unbiased estimator
(4.8) as a function of the true distance d when considering two levels of noise represented by
the factor ση (top ση = 1.5 and bottom ση = 4) and when varying the number of collected RSSI
samples T (left T = 1, middle T = 5 and right column T = 10).
As displayed in Figure 4.3, the variance of the biased estimator is larger than the variance of
the unbiased case affected by the multiplying factor C 2 defined in (4.8). For instance, regarding
the on-line case (T = 1), the factor C 2 is equal to 1.1 when considering the low noise level
( ση = 1.5) and 2 when considering the high noise level ( ση = 4). For a fixed value of ση ,
the variances decrease when increasing the number of samples T . Note that for low noisy
100
environments the gap between the biased and the unbiased estimator is rather low even when
T = 1 as C 2 is close to one. When dealing with higher noise levels, increasing the number
of collected data becomes the solution to approach the biased behavior to the unbiased one. In
conclusion, Figure 4.2 and 4.3 show that when considering an on-line data acquisition (T = 1),
the unbiased estimator should be considered in general. However, depending on the accuracy
required the biased estimator may be useful for distances below 5 for the high noise case and
for distances until 15 if the noise level is relatively low. Note that it may be restrictive in real
scenarios to estimate parameter σ 2 as it may be not stable during the processing time especially
in the indoor case. Thus, the choice of the distance estimator can be determined depending on
the localization algorithm considered and the involved scenario.
Since in Section 4.3 we make an overview of the existing localization techniques, we note
that in some of them there is a need to estimate the square of the distance instead of the plain
distance. In such cases, it is analogous to the estimators (4.6) and (4.8). Upon noting the property
(4.7) and the definition of constant C in (4.8), the unbiased estimator of the square distance is
defined as:
10
D= 4
C
−P̄ −PL0
5η
ε̄
d2
= 4 10 5η
C
=⇒
E[D] = d2
var[D] = d4 (C 8 − 1)
(4.10)
Since the objective of this chapter is to evaluate the theoretical model and the proposed distributed algorithm on data coming from a real setting of WSN, the following section introduces
the management of the platform and the process to run an experiment on a set of selected sensor
nodes we use in our numerical results. Before discussing on localization techniques, we show
some of our numerical results to point out the main issues that occur in practice when dealing
with a real indoor scenario. Indeed, during the last ten years several works have been discussing
about the LNSM and its relevance on different types of environments (see for instance [62],
[127] or more recently [159] among a long list).
4.2.4
1)
FIT IoT-LAB: platform of wireless sensor nodes
Platform description
In order to obtain real RSSI values we make use of the FIT IoT-LAB platform deployed at
Rennes illustrated in Figure 4.4. The 256 WSN430 open nodes available at the platform are
issued to the standard ZigBee IEEE 802.15.4 operating at 2.4 GHz. The sensor nodes are located
in two storage rooms of size 6 × 15 m2 containing different objects. They are placed at the ceil
which is 1.9 m height from the floor in a grid organization as shown in Figure 4.4 (see more
details on the website [1]).
The WSN430 nodes support the open source operating system (OS) Contiki3 which is designed for networked and memory-constrained systems with a particular focus on low-power
wireless IoT devices. It is used to design an embedded system requiring IPv4/IPv6 (Internet
Protocol) or Rime (lightweight custom networking protocol) communication features. Since
2
3
See: https://github.com/iot-lab/iot-lab/wiki/Hardware_Wsn430-node
Details: https://github.com/iot-lab/iot-lab/wiki/Contiki-support
(a) View of the platform hosted at Rennes.
101
(b) Schematics of the WSN430 node2 .
Figure 4.4. FIT IoT-LAB platform at Rennes using WSN430 open nodes.
each sensor node is uniquely identified by a Rime address (2 bytes assignment), one can make
use of the primitives provided by this alternative network stack specialized for low-power wireless systems. Thus, the available code on the website [1] can be adapted and loaded as a firmware
file (ihex extension) onto the selected sensor nodes to make them able to communicate under a
designed protocol. In our experimental results we only required Rime’s addresses to make the
selected sensor nodes able to communicate. Figure 4.5 displays the main steps while performing
an experiment at the FIT IoT-LAB platform.
Experiments can be launched in a remote way by selecting the desired sensor nodes available
at each moment depending on their status, i.e. free, busy for other users or not available due to
technical problems. While uploading a firmware file on the selected sensor nodes, they are able
to communicate and receive packets containing the corresponding RSSI measurements. Once
registered, the procedure to run an experiment in a remote way is performed by ssh commands
on a terminal window as shown in Figure 4.5 (c) and 4.5 (d). At the end of the experiment we
are able to recover the text file generated by each sensor node which contains the communications passing through the serial port of the node (see the received RSSI message format in
Figure4.5 (d)).
2)
Illustration of RSSI-distance measurements
Figure 4.6 (a) illustrates the behavior of some real measurements related to the experimental
campaigns on the FIT IoT-LAB testbed (see the experimental results in Section 2) and 2)). We
run several experiments during 5 min (300 s) on different days and at different scheduled times
(around 10am, 3pm and 10pm depending on the node’s availability) involving the 50 nodes
shown in Figure 4.7 (a). We set N = 44 and M = 6 in our testbeds. Figure 4.6 (a) shows the
recovered RSSI values from the anchor node whose Node_id is 157 which received data from
39 neighboring nodes.
In order to compare the empirical distribution with the theoretical model recalled in (4.2),
we draw 1000 samples from the theoretical LNSM with the estimated parameters by the anchor
102
(a) Create new experiment.
(c) Experiment’s state transition.
(b) Firmware association.
(d) Recover data flow.
Figure 4.5. Workflow to handle an experiment from the user website profile and recover the
collected data from the user’s terminal frontend.
node LM157 (PL0 = 59.65 dBm, η = 2.3 and σ 2 = 31.8 dB in Table 7). We compute and plot
the maximum and the minimum values as shown in Figure 4.6 (a) along with the real values.
Note that the real RSSI values fit well to the model for data coming from sensor neighbors
being close to node LM157 as nodes 156 and 158 (see distribution’s plots in Figures 4.6 (b) and
4.6 (c)). On the contrary, for nodes 247 and 248, a gap of about 10 dB appears between the
theoretical mean of the LNSM and the corresponding averaged RSSI values (see Figures 4.6 (d)
and 4.6 (e)) as these nodes are considerably further and placed close to the wall. Such effects
are probably related to multi-paths and are consequently reproduced on the estimated distances
when using the unbiased estimator defined in (4.8).
We report in Figure 4.7 the distance estimator based on (4.8) applied to the RSSI data of
Figure 4.6 (a). As shown in Figure 4.7 (b) the 53.8% (21 over the 39) of the distances are
estimated with an error tolerance greater than 0.5 m against the 46.2% otherwise. Nevertheless,
a third part of the total 39 estimated distances are issued of an error greater than one (the square
103
−35
−40
249
−45
RSSI (dBm)
−50
−55
−60
−65
−70
−75
−80
156 − 158
−85
247
−90
0.5
1
1.5
2
2.5
3
3.5
distance (m)
4
4.5
5
5.5
6
(a) Real RSSI values (5) measured by the anchor node whose Node_id is 157 of the network illustrated in
Figure 4.15. The marker (l) corresponds to the maximum and minimum values of the theoretical LNSM
(4.2) for each given distance.
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
−70
−65
−60
−55
−50
−45
−40
−35
RSSI (dBm)
empirical distribution
estimated mean
theoretical distribution
true mean
0
−70
1
1
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
−80
−75
−70
−55
−50
−45
−40
−35
(c) Estimated distance by LM157 from node 158.
0.9
−85
−60
RSSI (dBm)
(b) Estimated distance by LM157 from node 156.
0
−65
−65
−60
RSSI (dBm)
(d) Estimated distance by LM157 from node 247.
0
−85
−80
−75
−70
−65
−60
RSSI (dBm)
(e) Estimated distance by LM157 from node 249.
Figure 4.6. Comparison between the empirical distribution (histogram bars) of the real data
displayed on the top figure and the distribution of the model described by (4.2).
markers on the right and the boxed sensor nodes on the left in Figure 4.7). These estimated
distances correspond to the sensor nodes located the furthest from anchor node 157 are the
closest to the wall (sensor nodes whose Node_id are 240, 247, 249 and 253), the surrounding
104
12
183
181
180
179
178
LM163
y (m)
8
161
159
158
LM157
156
155
6
4
LM176
218
198
197
196
195
194
193
216
215
LM214
213
212
211
173
237
LM236
235
234
232
231
230
229
209
253
252
251
250
249
247
227
LM244
243
225
8
estimated distance (m)
10
+ 0.5m
sensor nodes
anchor nodes (LM)
202
201
200
− 0.5m
7
6
5
4
3
2
2
1
240
0
2
2.5
3
3.5
4
4.5
5
x (m)
5.5
6
6.5
7
7.5
(a) Network configuration around LM157.
0
0
1
2
3
4
5
6
7
8
true distance (m)
(b) Estimated distances using (4.8) by LM157.
Figure 4.7. Estimated distances at anchor node LM157 circled on the left figure from its neighboring sensor nodes from the real collected data shown in Figure 4.6 (a). On the right figure: we
) the estimated distances whose error is greater than 1 m which correspond with
remarked as (
the sensor nodes emphasized by the boxes on the left figure, the plain line designs the true distance (or the distance estimated if σ 2 = 0) and the dashed lines limited the estimated distances
under an error tolerance of ±0.5 m.
sensors of the network whose number of RSSI packets received at node 157 are low (202, 218
and 237) and the sensor nodes located close to the anchor node whose Node_id is 176 (194, 195
and 196) which may affect directly the line-of-sight of node 157.
4.3
Overview of some localization techniques
During the last decades the localization problem has raised a great deal of attention resulting in
an extended list of different proposed algorithms among the signal processing community. Indeed, several overview papers have been published in the last ten years dealing with the description and classification of the localization techniques (see, for instance, [99], [126] and [105]).
In this chapter we recall and summarize the main RSSI-based approaches to find the network
configuration depending on one criteria: if anchors/landmarks are used (anchor-based) or not
(anchor-free). The scheme is shown in Figure 4.8. We consider that all them have a centralized nature, i.e. the problem can be completely described and directly addressed by a central
processor.
4.3.1
1)
Centralized techniques
Anchor-based methods
The classical techniques involve the resolution of a single unknown position of a sensor node
at a time by means of RSSI values coming from a fixed number of surrounding anchor nodes
(or landmarks) denoted by M (find a comparative made by [99] or [71]). We denote the un-
4.3. Overview of some localization techniques
105
Figure 4.8. Classification of the existing classical methods on localization.
known node position by z = (x, y) (two dimensions p = 2) and any anchor node position by
Ai = (ai , bi ). Since the sensor node only uses the information from known positions, z can be
expressed in absolute coordinates, i.e. anchor positions in GPS-coordinates. In anchor-based
methods, the solution (the unknown sensor node position) is issued to M T measurements where
T denotes the number of RSSI values collected from each of the M anchor nodes. Note that,
when dealing with more than one unknown position, e.g. a network of N sensor nodes, the
problem may be performed sequentially, i.e. solving one position at each time. We distinguish
between two ways to address the problem:
Geometrical point of view
The following methods were first studied for aerospace or aeronautical systems and consider the unknown position as being the intersection point of a fixed number of curves
M , one for each landmark. Although they are originally based on time measurements4
(see [137]), the related system of equations can be finally described by the set of landmarks positions and the distances between the landmarks and the unknown position. If
the LNSM is considered for the RSSI signal, distances can be estimated as described
in Section 4.2.1. If distances are perfectly known, the system of equations is described
from different geometrical curves such: circles (called also trilateration [110]), hyperbolas [137], quadrilaterals (also called min-max [141]) or triangles (known as multilateration
[79]). Figure 4.9 illustrates instances of these approaches.
Note that, unless for the quadrilaterals’ case, a landmark is designed as the reference of
4
ToA=Time of Arrival and TDoA=Time Difference of Arrival
106
(a) Trilateration (M = 3).
(b) Intersection of three hyperbolas (M = 4).
(c) Intersection of three quadrilaterals (M = 3).
(d) Multilateration: cosine rule applied on 5 triangles which
forms a linear system of 5 equations following the expression on the right.
Figure 4.9. Classical methods: geometrical point of view.
the coordinate system. As shown in Figure 4.9 (a), trilateration considers explicitly one
reference landmark and applies a linear transformation (translation t and rotation R easily computed) to the whole system to solve the problem. In Figures 4.9 (b) and 4.9 (d)
the reference landmark is considered implicitly in the equations by simple subtractions.
107
Note that, multilateration (geometrically based on the cosine’s theorem) is an extension
of trilateration considering M intersecting circles. Indeed, from M equations of circles,
subtracting the equation of the reference landmark to the M − 1 remaining equations and
adding the square distance between the reference landmark and each M − 1 remaining
landmarks, it results in the same system of M − 1 equations of the multilateration technique. When considering a noisy scenario in which distances between the unknown sensor node and the landmarks are estimated by means of RSSI values following the LNSM,
several works coupled the latter methods with a least squares problem. The most relevant works are those from [118], [140] and [141] considering multi-hop communications
between the sensor nodes. Any pair of nodes within the network which are not directly
connected are able to communicate with the help of one or more intermediate/relay nodes.
The main drawback of these methods is the dependence with the anchor’s positions since
the solutions must lie inside the convex-hull formed by the set of known positions. In
addition, the solution loses in robustness since it may be affected by the choice of the
reference landmark, e.g. a landmark with an error in its known position or placed too far
from the sensor node.
Statistical point of view
This methods focus on the statistical distribution of the received RSSI measurements coming from the M landmarks. The goal is to consider a parametric model for the received
signal and to apply maximum likelihood estimator (MLE). Thus, the LNSM in (4.2) is the
parametric model assumed for the observed RSSI and the unknown parameter to estimate
is the unknown position of the sensor node z. Set (Pm (1), . . . , Pm (T )) the RSSI values
collected from any landmark m. Then, the MLE of z (in two dimensions) is:
ẑ = arg min
M X
T X
2
p
2
2
Pm (t) + PL0 + 10η log10 (x − am ) + (y − bm )
(4.11)
z=(x,y) m=1 t=1
Since (4.11) is not a non-convex problem, the solution is affected by the existence of
local minimums. Iterative algorithms need in general a suitable initialization, e.g. a noisy
version of the target position. Indeed, the solution of (4.11) lies on the intersection of the
M circles of the form:
(x − am )2 + (y − bm )2 = r
∀m = 1, . . . , M
where r is set as the square distance (4.10). If the dimensions of the area issued to the
problem are previously known, one can set a grid of points, evaluate the function (4.11)
on each center point and set ẑ as the point where the minimum value is achieved (see for
instance [86], [47] or [54]). Note that, the accuracy depends on the choice of the grid’s
resolution and so the required computational cost. As an alternative, iterative algorithms
can be used to solve this non-linear optimization problem such: the conjugate gradient
used in [128], the Nelder-Mead method used in [42] and more recently the LevenbergMarquardt algorithm in [7].
108
It is worth noting that the LNSM may not be the most suitable model for more complex
indoor scenarios (see [11]). Other works proposes the MLE on different statistical models.
In [159] the environment is considered to be dynamic and time-varying. They proposed a
modified version of the LNSM by adapting progressively the estimation of the noise variance σ 2 . In [53] the modified LNSM considers different environment parameters (PL0 , η
and σ 2 ) for each landmark and includes a bias factor to (4.2) to model the possibly outlier/multipath effects. More recently, [55] assumes the RSSI as a Gaussian Mixture of two
classes and the problem (4.11) is solved through an Expectation-Maximization algorithm.
2)
Anchor-free techniques
When dealing with distributed sensor networks, e.g. a WSN of N nodes, anchors may not be
present, too far away or GPS signal is not available, e.g. indoor scenario. Nevertheless, sensor
nodes can still benefit of the RSSI measurements obtained from their neighbors whose positions
are unknown. The configuration of the network can be recovered on a relative coordinate system instead of the GPS absolute coordinate system. One should therefore rely on anchor-free
methods.
When distances between nodes are view as similarity metrics, the positioning problem is
referred to multidimensional scaling (MDS). Yet, the structure of distance-like data is related to
the underlying geometric configuration of the network, i.e. find an embedding from the N nodes
such that distances are preserved. For instance, in classical MDS [27, Chapter 12] positions
are obtained by principal component analysis (PCA) of the input N × N matrix constructed
from the Euclidean distances. If distances are issued to some noise, e.g. estimated from RSSI
measurements as (4.8), [143] propose a MDS-MAP algorithm based on the classical MDS problem. Indeed, the WSN localization problem is solved by enable each sensor node to infer all
the estimated pairwise distances. It is worth noting that MDS problem for WSN localization
is related to the rigid graph theory. Indeed, rigid graph theory explores the property of a given
network configuration to be an embedding of a graph in an Euclidean space, i.e. the globally
rigid property. This property, also known as fold-freedom (see [77] or [44]), suffers from two
problems: non-uniqueness and NP-hard complexity. To overcome this issues, [130] proposed a
two-phase algorithm based on fold-freedom initial configuration joint with a mass-spring model
to refine the estimated positions.
Alternative approaches within the localization context are based on optimization techniques.
In metric or modern MDS, positions are obtained by the stress majorization algorithm called
SMACOF (see [27, Chapter 8] and [101]). The minimization problem is performed by using
an auxiliary quadratic function that majorizes the stress function. Semidefinite programming
(SDP) with convex constraints can be found in [57] and [24]. Recently the works of [41], [85]
give further error bound analysis of the SDP approach. In general, even if these centralized
techniques achieve high accuracy and solve the N unknown positions at once, they require high
computational cost and may be especially complex to be implemented in a real wireless sensor
network. Yet, classical MDS and SDP require about O(N 3 ) of computational complexity.
4.3.2
109
Distributed approaches
A distributed batch version of the SMACOF algorithm based on a round-robin communication
scheme is proposed in [45]. Following a sequential approach, the global function to minimize
is computed at each iteration in which each node aggregates its local estimate following a cycle
path through the network. In their approach, all nodes have to broadcast its estimate position at
each iteration before the updating cycle phase of the stress function. The anchor’s positions are
required during all the iterative process. Their numerical results on real data show a root-meansquare error (RMSE) of 2.48 m on the same testbed considered in [127] where a centralized
MLE achieved a RMSE value of 2.18 m. The indoor scenario is formed by N = 40 and M = 4
deployed in 1414 m2 and the estimated parameters are η = 2.3 and low noise variance σ 2 =
3.92 dB. Since [45] considers the minimization of the non-convex stress function, the same
distributed approach (batch and incremental) is presented in [63] but using a quadratic criterion
to overcome the non-convex issue. The function to minimize is formed by two terms: a first
quadratic term which each sensor node tends to be positioned at the center of the polygon defined
by its neighbors’ positions, and a second regularizing term which includes the information from
the anchor nodes. The iterative algorithm is performed by two steps at each cycle: firstly all
nodes update their local optimization step and then they broadcast to all neighbors their estimated
positions. They considered simulated data from a network composed by N = 72 and M = 8
deployed in 9090 m2 . They achieved a localization mean error of 5.12 m and comparing to the
respective value of [45], an improvement of 31% is obtained.
The Authors of [24] propose a distributed implementation of their SDP-based localization
algorithm. In [25] the network is divided in several clusters of at least two anchor nodes and a
large number of sensor nodes (N = 1000 − 2000 nodes are considered) and then the SDP problem is addressed locally at each cluster. A first step is to obtain a uniform partition of the network
based on the geographical positions of the sensors. Then, the convex optimization problem is
performed at the cluster level. Based on the accuracy obtained on the estimated positions, the
sensor nodes become "anchors" depending on a given tolerance. The new considered "anchors"
are used to update the clusters and the process of solving local SDP is repeated several rounds.
Note that [25] is a parallel approach of [24] since the computation core is done at cluster level,
i.e. a sensor node is designed to solve the problem at each cluster. Although it is mentioned
that neighboring clusters need to communicate which nodes have been positioned at each round,
[25] does not give any more details about how the computation and communication processes
are hold. Simulations on synthetic data illustrate the performance of their algorithm.
More recently, gossip-based algorithms have been proposed in [39], [152] and [35]. Both
works of [39] and [152] are based on Kalman filtering applied to noisy measurements coming
from anchor and non-anchor nodes. In [39], they used a diffusion scheme represented by a deterministic matrix which is assumed to be column-stochastic to solve the minimization problem.
They showed a nearly zero mean error on simulated data involving a simple network of N = 8
and M = 4 in a 1 × 1 m2 square. In [152] they focus on mobile networks and they proposed a
cooperative algorithm to solved the convex optimization problem on mobile robots. Each robot
computed at each iteration time its estimated position based on all quantized data coming from
the rest of robots. The algorithm proposed by [35] is based on a distributed gradient descent
with time-varying step size performed in two stages. Firstly, the step size is computed by an
110
iterative averaging procedure consisting on a deterministic and double-stochastic matrix in order to achieve the consensus through the optimal Barzilai and Borwein step size value. Once
the averaging cycles give the suitable accurate step size, each node received the last estimated
positions from its neighbors and performed locally the gradient update. Their simulations on a
network of N = 10 in a 1 × 1 m2 obtained a decreasing accuracy on the estimated positions but
required a large total number of communications (several averaging cycles per any distributed
gradient descent step).
A distributed version of the spring model algorithm of [130] can be found in [162] where all
sensor nodes broadcast their local estimates at each iteration before each updating step. Alternatively, both iterative algorithms [92] and [80] introduced a sparsification model to enable each
sensor node to select only a small set of observations from its neighbors at each time. In [92] the
sparsification model is based on an uniformly random choice of the observations. Meanwhile the
Authors of [80] defined a threshold determining the level of sparsity and applied it on the income
data. However, this works follow two different approaches. In [92] a sparsification model on the
observations is introduced to decentralized the PCA algorithm. And [80] is based on non-linear
kernel learning where the optimization problem is performed by a gradient descent algorithm
with constant step size.
Other works address the MDS problem for distributed WSN localization. The MDS-MAP
proposed in [143] is later improved in [142] for larger network size (experimental results with
N = 200). In [142] each sensor node applies the MDS-MAP of [143] to its local map and then
the local maps are merged sequentially to recover the global map. Although the accuracy is
improved, the complexity still remains high due to the spectral decomposition involved in MDS,
i.e. about O(k 3 ) where k is the average number of neighbors per node.
4.4
Distributed MDS-MAP approach
In this section we describe the proposed distributed on-line algorithm for WSN localization
based on the MDS-MAP approach. Our algorithm is asynchronous and encompasses the case
of random link failures and random noisy and sporadic RSSI measurements. First, we briefly
recall MDS-MAP algorithm. Then, the on-line version is obtained by using the Oja’s algorithm
[122, 29] (see Section 3.4 in Chapter 3). The last part of this section is devoted to the asymptotic
convergence analysis of the proposed algorithm.
4.4.1
The framework: centralized batch MDS
Define S as the N × N matrix of square relative distances i.e., S(i, j) = d2ij . Define z =
1 PN
2
i=1 z i as the center of mass (or barycenter) of the agents. Upon noting that dij = kz i −
N
zk2 + kz j − zk2 − 2hz i − z, z j − zi, one has:
S = c1T + 1cT − 2ZZ T
(4.12)
where 1 is the N ×p matrix whose components are all equal to one, c = (kz 1 −zk2 , · · · , kz N −
zk2 )T and the i-th line of matrix Z coincides with the row-vector z i − z. Otherwise stated, the
i-th line of Z coincides with the barycentric coordinates of node i. Define J = 11T /N as the
4.4. Distributed MDS-MAP approach
111
orthogonal projector onto the linear span of 1. Define J ⊥ = I N − J as the projector onto the
space of vectors with zero sum, where I N is the N × N identity matrix. It is straightforward to
verify that J ⊥ Z = Z. Thus, introducing the matrix
1
M , − J ⊥ SJ ⊥ ,
2
(4.13)
equation (4.12) implies that M = ZZ T . In particular, M is symmetric, non-negative and has
rank (at most) p. The agents’ coordinates can be recovered from M (up to a rigid transformation)
by recovering the principal eigenspace of M i.e., the vector-space spanned by the p-th principal
eigenvectors (see [27, Chapter 12]). Denote by {λk }N
k=1 the eigenvalues of M in decreasing
order, i.e. λ1 ≥ · · · ≥ λN . In the sequel, we shall always assume that λp > 0, meaning
that M has a full column-rank
p < Np
. Denote by {uk }pk=1 corresponding unit-norm N × 1
√
eigenvectors. Set Z = ( λ1 u1 , · · · , λp up ). Clearly M = ZZ T and Z = RZ for some
matrix R such that RRT = I N . Otherwise stated, Z coincides with the barycentric coordinates
Z up to an orthogonal transformation. In practice, matrix S is usually not perfectly known and
b This yield the Algorithm 3 (see [27, Chapter 12]).
must be replaced by an estimate S.
Algorithm 3: Centralized batch MDS for localization
Input: Noisy estimates of the square distances Dij defined by (4.10) for all pair i, j.
b = (Dij )i,j=1,...,N .
1. Compute matrix S
1
c = − J ⊥ SJ
b ⊥.
2. Set M
2
c.
3. Find the p principal eigenvectors {uk }pk=1 and eigenvalues {λk }pk=1 of M
p
√
b = ( λ1 u1 , · · · , λp up )
Output: Z
Note that if all N 2 pairwise distances can not be obtained at once due to the size of the
network and because sensor nodes have different radius of connectivity, matrix (4.12) is incomplete (containing null entries) and the eigendecomposition can not be found. To overcome this
problem, the MDS-MAP algorithm (see [143] and [85] for an error bound analysis) introduces
a first step in which each sensor node or a central node computes all pairwise distances using
multi-hop communications and finding all shortest-paths (apply algorithms such as Dijkstra’s or
Floyd’s whose time complexity is O(N 3 )) to construct the complete similarity matrix (4.12).
Since MDS-MAP may be limited by the accuracy and the computational cost, an alternative
approach proposed by [142] applies the basic MDS Algorithm 3 locally computed at each sensor node. Thus, each sensor node uses the information of all its immediate neighbors forming
a complete network and is able to compute a local complete similarity matrix. Once the local
set of relative positions is estimated, an iterative/sequential phase is used to merge the obtained
local maps. Thus, the accuracy of the local maps is preserved and the merging phase provide a
refinement on the global map positions.
4.4.2
Centralized on-line MDS
In the previous batch Algorithm 3, measurements are made prior to the estimation of the coordinates. From now on, observations are not stored into the system’s memory: they are deleted
112
after use. Thus, agents gather measurements of their relative distance with other agents and,
simultaneously, estimate their position.
1)
Observation model: sparse measurements
We introduce a collection of independent r.v. (Pij (n) : i, j = 1, · · · , N, n ∈ N) such that each
Pij (n) follows the LNSM (4.2) described in Section 4.2.2. We set D n (i, j) the r.v. related to
each measurement Pij (n) corresponding to the unbiased estimator (4.10) of the square distance,
i.e. D n (i, j) is equal to
10
C4
−Pij (n)−PL0
5η
and satisfies E[D n (i, j)] = d2ij . We set D n (i, i) = 0.
Definition 4.1 (Sparse measurements). At each time instant n, we assume that with probability
qij , an agent i is able to obtained an estimate S n (i, j) of the square distance with an other
agent j 6= i and makes no observation otherwise. Thus, one can represent the available observations as the product S n (i, j) = An (i, j)D n (i, j) where (An )n is an i.i.d. sequence of
random matrices whose components An (i, j) follow the Bernoulli distribution of parameter qij .
Stated otherwise, node i observes the i-th row of matrix An ◦ D n at time n where ◦ stands for
the Hadamard product.
−1 N
Lemma 4.1. Assume qij > 0 for all pairs i, j. Set W := [qij
]i,j=1 and let An , S n be defined
as above. The matrix
S n = W ◦ An ◦ D n
(4.14)
is an unbiased estimate of S i.e., E[S n ] = S.
Proof. Each entry of matrix S n , S n (i, j), is equal to 1/qij An (i, j)D n (i, j). As the random
variables An (i, j) and D n (i, j) are independent, by the above definition of D n and E[An (i, j)] =
qij , then E[S n (i, j)] = d2ij .
2)
Oja’s algorithm for the localization problem
In this section we adapt the projected Oja’s algorithm presented in Section 3.4 of the previous
Chapter 3 to solve the problem of localization based on MDS. The p > 1 largest eigenvectors and eigenvalues can be estimated by an extension of the Oja’s algorithm [122]. As a consequence of Lemma 4.1, an unbiased estimate of M defined in (4.13) is simply obtained by
M n = − 12 J ⊥ S n J ⊥ . When faced with random matrices M n having a given expectation M ,
the principal eigenspace of M can be recovered similarly by the algorithm 3.12 as:
U n = ΠK U n−1 + γn M n U n−1 − U n U Tn−1 M n U n−1
,
(4.15)
where K = [−α, α]p ⊗ · · · ⊗ [−α, α]p such α < 1 whose interior contains χ (the set (3.11)
defined in Section 3.4 of Chapter 3), ΠK is the projector onto K and where γn > 0 is a step
size. Let un,k denote the k-th column of matrix U n . The p largest eigenvalues can be estimated
as a straightforward extension of the above Oja’s algorithm. If (un,k )n converges to one of the
eigenvectors of M , then the quantity λn,k recursively defined by:
(4.16)
λn,k = λn−1,k + γn uTn−1,k M n un−1,k − λn−1,k
113
converges to the corresponding eigenvalue (see [122]). Finally, according to step 3 of the batch
Algorithm 3, the estimated barycentric coordinates are obtained as:
p
p
bn =
Z
λn,1 un,1 , . . . , λn,p un,p .
(4.17)
4.4.3
Distributed on-line MDS
Since the goal is to implement the previous on-line algorithm (4.15)-(4.16) in a distributed setting, we introduce the communication model that enables sensor nodes process their data locally.
Thus, each sensor node i estimate its position as in (4.17) based on its local measurements (see
Definition 4.1) and sparse random communications within its neighborhood.
1)
Communication model: sparse asynchronous transmissions
It is clear from the previous section that an unbiased estimate of matrix M is the first step needed
to estimate the sought eigenspace. In the centralized setting, this estimate was given by matrix
M n = − 12 J ⊥ S n J ⊥ . As made clear by the observation model (in Definition 4.1), each node i
observes theP
i-th row of matrix S n . As a consequence, node i has access to the i-th row-average
1
S n (i) , N j S n (i, j). This means that matrix S n J ⊥ can be obtained with no need to further
exchange of information in the network. On thePother hand, J ⊥ S n J ⊥ requires to compute the
per-column averages of matrix S n J ⊥ , i.e. N1 j S n (j, i) for all i. This task is difficult in a
distributed setting, as it would require that all nodes share all their observations at any time. A
similar obstacle happens in Oja’s algorithm when computing matrix products, e.g. M n U n−1
in (4.15). To circumvent the above difficulties, we introduce the following sparse asynchronous
communication framework. In order to derive an unbiased estimate of M , let us first remark
that for all i, j,
d2 (i) + d2 (j) d2ij + δ
M (i, j) =
−
(4.18)
2
2
P
P
where we set d2 (i) , N1 k d2ik and δ , N1 i d2 (i). Note that the terms d2ij and d2 (i) can be
estimated by S n (i, j) and S n (i) respectively. However, additional communication is needed to
estimate δ since it corresponds to the average value over all square distances. We define
c n (i, j) = S n (i) + S n (j) − S n (i, j) + δ n (i)
M
2
2
(4.19)
where δ n (i) is a quantity that we will define in the sequel, and which represents the estimate of
δ at the agent n.
We are now faced with two problems. First, we must construct δ n (i) as an unbiased estimate
c n (i, j) for all pairs i, j, but only to some of
of δ. Second, we need to avoid the computation of M
them. In order to provide an answer to these problems, we instanciate the notion of asynchronous
transmission sequence already introduced in the previous chapter. Formally,
Definition 4.2 (Asynchronous transmission sequence (ATS)). Let q be a real number such that
0 < q < 1. We say that the sequence of random vectors Tn = (ιn , Qn,i : i ∈ {1, · · · , N }, n ∈
N) is an asynchronous transmission sequence (ATS) if:
114
i) all variables (ιn , Qn,i )i,n are independent.
ii) ιn is uniformly distributed on the set {1, · · · , N }.
iii) ∀i 6= ιn , Qn,i is a Bernoulli variable with parameter q, i.e., P[Qn,i = 1] = q .
iv) Qn,ιn = 0.
Let (Tn )n denote an ATS defined as above. At time n, we assume that a given node ιn ∈
{1, . . . , N } wakes up and transmits its local row-average S n (ιn ) to other nodes. All nodes i
such that Qn,i = 1 are supposed to receive the message. For any i, we set:
δ n (i) =
S n (i) S n (ιn )Qn,i
+
.
N
q
(4.20)
The following Lemma is a consequence of Definition 4.2 along with Lemma 4.1 and equation (4.13).
c n )n be the sequence
Lemma 4.2. Assume that (Tn )n is an ATS independent of (S n )n . Let (M
cn ] = M .
of matrices defined by (4.19). Then, E[M
Proof. By Lemma 4.1 the expectation of terms S n (i), S n (j) and S n (i, j) are respectively d2 (i),
d2 (j) and d2ij . Moreover, by Definition 4.2 the expectation of the random term δ n (i) is equal to
E[δ n (i)] =
N
1
1 1 X
1 X 2
d (i) ,
E[S n (i)] +
E[S n (j)]q =
N
qN
N
j6=i
i=1
c n in (4.19) is equal
which coincides with δ. Then, the expectation of each entry of the matrix M
to the corresponding M (i, j) defined in (4.18).
2)
Preliminaries: constructing unbiased estimates
As we now obtain a distributed and unbiased estimate of M , the remaining task is to adapt
accordingly the Oja’s algorithm (4.15). In this paragraph, we provide the main ideas behind the
construction of our algorithm.
Assume that we are given a current estimate U n−1 at time n, under the form of a N × p
matrix. Assume also that for each i, the i-th row of U n−1 is a variable which is physically
handled by node i. We denote by U n−1 (i) the i-th row of U n−1 .
Looking at (4.15) in more details, Oja’s algorithm requires the evaluation of intermediate
values, as unbiased estimates of M U n−1 and U Tn−1 M U n−1 .
We consider the previous ATS (Tn )n involved in (4.19). We assume that the active node
ιn (i.e., the one which transmits S n (ιn )) is also able to transmit its local estimate U n−1 (ιn ) at
same time. Thus, with probability N1 , node ιn sends its former estimate U n−1 (ιn ) and S n (ιn )
to all nodes i such that Qn,i = 1. Then, all nodes compute:
c n (i, i)U n−1 (i) + N U n−1 (ιn )M
c n (i, ιn )Qn,i
Y n (i) = M
q
(4.21)
115
As it will be made clear below, the N × p matrix Y n whose i-th row coincides with Y n (i) can
be interpreted as an unbiased estimate of M U n−1 .
Now we introduce the distributed version of the second term U Tn−1 M n U n−1 . Consider a
second ATS (Tn0 )n independent of (Tn )n . At time n, node ι0n wakes up uniformly random and
broadcasts the product U n−1 (ι0n )T Y n (ι0n ) to other nodes. Receiving nodes are those i’s for
which Q0n,i = 1. Then, all nodes are able to compute the estimate p × p matrix as follows:
Λn (i) = U n−1 (i)T Y n (i) +
N
U n−1 (ι0n )T Y n (ι0n )Q0n,i .
q
(4.22)
Lemma 4.3. Let (Tn )n and (Tn0 )n be two independent ATS. For any n, denote by Fn the σ-field
generated by (Tk )k≤n , (Tk0 )k≤n , (Ak )k≤n and (Dk )k≤n . Let (U n )n be a Fn -measurable N × p
random matrix and let Y n , Λn be defined as above. Then,
E[Y n |Fn−1 ] = M U n−1
E[Λn (i)|Fn−1 ] = U Tn−1 M U n−1 .
Under Lemma
P4.1, 4.2 and Definition 4.2,Tthe random sequences Y n (i) and Λn (i) are unbiased
estimates of j M (i, j)U n−1 (j) and U n−1 M U n−1 respectively given U n−1 .
Proof. For each i, we obtain
E[Y n (i)|Fn−1 ] = M (i, i)U n−1 (i) +
X
N q X
M (i, j)U n−1 (j) =
M (i, j)U n−1 (j) ,
q N
j
j6=i
and
E[Λn (i)|Fn−1 ] = U n−1 (i)T E[Y n (i)|Fn−1 ] +
N 1 X
U n−1 (j)T E[Y n (j)|Fn−1 ]q
q N
j6=i
=
XX
i
U n−1 (i)T M (i, j)U n−1 (j)
j
which corresponds with the square matrix U Tn−1 M U n−1 .
3)
Main algorithm
We are now ready to state the main algorithm. The algorithm generates iteratively and for any
node i two variables U n (i) and λn (i), according to:
U n (i) = ΠK [U n−1 (i) + γn (Y n (i) − U n−1 (i)Λn (i))]
(4.23)
λn (i) = λn−1 (i) + γn (diag(Λn (i)) − λn−1 (i)) ,
(4.24)
where ΠK is the projector onto the set K , [−α, α]p given an arbitrary α > 1. Finally, as in
b n (i) by:
(4.17), each sensor i obtains its estimate position Z
q
q
b n (i) =
λn,1 (i)un,1 (i), · · · , λn,p (i)un,p (i)
(4.25)
Z
116
where we set U n (i) = (un,1 (i), . . . , un,p (i)).
The proposed algorithm (4.23)-(4.25) is summarized in Algorithm 4 below. Note that, at
each iteration time n, only two communications are performed by two randomly selected nodes
issued to the ATS’s Tn and Tn0 .
Algorithm 4: Distributed on-line MDS for localization (do-MDS)
Update: At each time n = 1, 2, . . .
[Measures]: each sensor node i, do:
Makes sparse measurements of their RSSI to obtain (D n (i, j))j for some j
such that An (i, j) = 1 (Definition 4.1). Set
−1
qij D n (i, j) if An (i, j) = 1
S n (i, j) =
0
otherwise.
and set S n (i) =
1
N
P
j
S n (i, j).
[Communication step]:
• A node ιn randomly selected broadcasts U n−1 (ιn ) and S n (ιn ) to nodes i
such that Qn,i = 1.
• Each node i computes Y n (i) by (4.21).
• A node ι0n randomly selected broadcasts U n−1 (ι0n )T Y n (ι0n ) to nodes i
such that Q0n,i = 1.
b n (i) by (4.25).
• Each node i updates U n (i) by (4.22)-(4.23) and Z
4.4.4
We now investigate the convergence of Algorithm 4. We prove that if the sensors’ positions are
fixed, the algorithm recovers the latter configuration up to a rigid transformation.
Assumption 4.1. The sequence (γn )n is positive and satisfies:
P
i)
n γn = +∞,
P 2
ii)
n γn < ∞.
The following assumption states the stability of the sequence (U n )n (see Assumption 3.4 in
Chapter 3).
Assumption 4.2. For each node i, there exists an instant time n0i such that ∀n > n0i the sequence
U n−1 (i) + γn (Y n (i) − U n−1 (i)Λn (i)) remains in the compact set K almost surely.
117
Roughly speaking, Assumption 4.2 means that projector ΠK becomes inactive at each sensor
node i for all n after a certain value.
Proposition 4.1. For any U ∈ RN ×p , set h(U ) = M U − U U T M U . Let U n be defined by
(4.23). There
P exists two random sequences (ξn , en )n such that, almost surely (a.s.), en converges
to zero, n γn ξn converges and
U n = U n−1 + γn h(U n−1 ) + γn ξn + γn en .
Proof. Set for each i,
X
j
(4.26)
M (i, j)U n−1 (j) = (M U n−1 )i and
ξn (i) = (Y n (i) − (M U n−1 )i ) + U n−1 (i)(U Tn−1 M U n−1 − Λn (i))
(4.27)
en (i) = ΠK [Y n (i) − U n−1 (i)Λn (i)] − (Y n (i) − U n−1 (i)Λn (i))
(4.28)
Then, the sequence generated by each sensor node i is written as:
U n (i) = U n−1 (i) + γn (M U n−1 )i − U n−1 (i)(U Tn−1 M U n−1 ) + γn ξn (i) + γn en (i)
P
First we prove that n γn ξn (i) converges. For that purpose, we use [75, Theorem 2.17] to
prove that for any i, (ξn (i))n is a L2 -bounded Martingale increment sequence. By Lemma 4.3,
E[ξn (i)|U n−1 ] is equal to zero. Regarding the second moment of (ξn (i))n , we state that
supn E[kξn (i)k2 |U n−1 ] < ∞ for any i as follows:
E[kξn (i)k2 |U n−1 ] ≤ E[kY n (i)k2 |U n−1 ] + kU n−1 (i)k2 E[kΛn (i)k2 |U n−1 ]
+ 2kU n−1 (i)kE[kY n (i)Λn (i)k|U n−1 ] ,
(4.29)
the first term on the right hand side (RHS) of (4.29) can be extended as:
c n (i, i)|2 ]kU n−1 (i)k2 +
E[kY n (i)k2 |U n−1 ] ≤ E[|M
NX
c n (i, j)|2 ]kU n−1 (i)k2
E[|M
q
j6=i
+2
X
c n (i, i)M
c n (i, j)]kU n−1 (i)k2 .
E[M
j6=i
Upon noting that for any i, j E[S n (i, j)2 ] =
1 4 8
qij dij C ,
then there exists a constant K > 0 dec n (i, j)|2 ] <
pending on N , qmin = mini,j qij , C defined in (4.8) and maxi,j d4ij such that E[|M
p
K. In addition, by Assumption 4.2 U n−1 (i) remains on the compact set [−α, α] which depends
on α > 1, then supn kU n−1 (i)k2 = pα2 < ∞. Thus:
E[kY n (i)k2 |U n−1 ] ≤ (1 +
N2
+ 2N )Kα2 p = B1
qmin
(4.30)
The second term on the RHS of (4.29) can be expressed as a function of the latter bound B1
118
such:
E[kΛn (i)k2 |U n−1 ] ≤ E[kY n (i)k2 |U n−1 ]kU n−1 (i)k2 + (
NX
E[kY n (j)k2 |Un−1 ]
q
j6=i
+2
X
E[Y n (i)Y n (j)|U n−1 ])kU n−1 (j)k2
j6=i
≤ (1 +
N2
+ 2N )B1 α2 p .
qmin
(4.31)
Finally the cross term E[Y n (i)Λn (i)|U n−1 ] is also bounded by:
E[kY n (i)Λn (i)k|U n−1 ] ≤ E[kY n (i)k2 |U n−1 ]kU n−1 (i)k
X
√
+
E[Y n (i)Y n (j)|U n−1 ]kU n−1 (j)k ≤ N B1 α p . (4.32)
j6=i
Hence, using
(4.30)-(4.32) supn E[kξn (i)k2 |U n−1 ] < ∞ , and by AssumpP the bound terms
2
tion 4.1,
n E[kγn ξn (i)k |U n−1 ] is bounded almost surely which conclude the bound on the
second moment of (4.29). To conclude, by Assumption 4.2, limn en (i) = 0 a.s. for any i.
We now state the main theorem on the convergence of Algorithm 4. The following result is
standard in the stochastic approximation folklore, e.g. [14], [51], [28].
Theorem 4.1 (Main result). Let U n be defined by (4.23) and λn,k be defined by (4.24). Under
Assumption 4.1, for any k = 1, · · · , p, the k-th column un,k of U n converges to an eigenvector of M with unit-norm. Moreover, for each node i, λn,k (i) converges to the corresponding
eigenvalue.
Proof. Consider the following Lyapunov function V : RN ×p r {0} → R+ :
2
ekU k
V (U ) = T
.
U MU
(4.33)
The following properties hold:
i) assuming that the eigenvalues of the expectation matrix M are bounded by λmin ≤
λk (M ) ≤ λmax ∀k = 1, . . . , p and kU k2 ≤ N α2 p = b, then:
eb
λmax b
≤ V (U ) ≤
eb
λmin b
.
)
ii) limkU k→∞ V (U ) = +∞ and its gradient is ∇V (U ) = −2 UVT(U
h(U ).
MU
iii) hV (U ), h(U )i ≤ 0 and the equality holds when {U ∈ RN ×p | h(U ) = 0} ⊂ χ defined
in (3.11).
119
The proof is an immediate consequence of Proposition 4.1, the existence of (4.33) along with
Theorem 2 of [51]. Sequence U n converges a.s. to the roots of h. The latter roots are characterized in [121]. In particular, h(U ) = 0 implies that each column of U is an unit-norm
eigenvector of M .
Note that Theorem A.1 might seem incomplete in some respect: one indeed expects that
the sequence U n converges to the set χ characterizing the principal eigenspace of M . Instead,
Theorem A.1 only guarantees that one recovers some eigenspace of M . As discussed in Section 3.4, undesired limit points can be theoretically avoided by introducing an arbitrary small
Gaussian noise inside the parenthesis of the left hand side of (4.23) (see Chapter 4 in [28]). As
b n converges to Z up to a rigid transformation.
a consequence, sequence Z
4.4.5
Numerical results
In this section we show the performance of our proposed distributed algorithm when using simulated and real data. In both cases, we consider the same network configuration corresponding
on the set of N = 50 sensor nodes selected from the FIT IoT-LAB [1] platform at Rennes.
Sensor nodes are located within a 5 × 9 m2 area, i.e. p = 2. Six sensors of the 50 were set as
anchor nodes (or landmarks) as illustrated by Figure 4.7 (a). We compare the performance of
our proposed distributed on-line MDS (do-MDS) to other existing algorithms. We consider the
distributed batch MDS [45] (dw-MDS) and the classical centralized methods of Section 1) such
as multilateration [79] (MC), min-max [141], Algorithm 3 in Section 4.4.1 (the batch MDS) and
the Oja’s algoritm (4.15)-(4.16) described in Section 4.4.2. The three iterative algorithms (Oja’s,
dw-MDS and do-MDS) are initialized by randomly chosen positions in 5 × 9 m2 . The performances are compared through the root-mean-square error (RMSE expressed in meters) between
the true and the estimated positions as a function of the number of communications per iteration
(n).
Method
# communications
MC/min-max
TMN
MDS
TN
Oja
NI
dw-MDS
T N + 2N I
do-MDS
(N + 2)I
Table 2. Number of communications required for each method.
Table 2 summarizes the total number of communications (i.e. information broadcasted by
each sensor node). For the iterative algortithms (centralized Oja, dw-MDS [45] and Algorithm 4)
we consider after a fixed number of iterations I. The cost is rather different depending on the
approach. Indeed, for the classical anchor-based methods, T M is needed for each unknown
position in order to obtain each estimated distance between the node and each anchor node.
Similarly, the classical anchor-free MDS algorithm, needs all N nodes to broadcast T measurements to compute the estimated matrix Ŝ before computing the double-centered matrix M̂ and
its eigendecomposition. The remaining iterative approaches (Oja’s, dw-MDS [45] and Algorithm 4) require a different number of communications at each iteration. The batch dw-MDS
requires a previous measurement phase of T N communications to obtain the estimated square
120
distances while the on-line Oja’s approaches (centralized and distributed do-MDS) consider one
measurement step at each iteration’s update. Since dw-MDS [45] needs that all nodes broadcast
its estimated positions before to update the global stress function by an incremental cycle, a
total of 2N communications are required per iteration. For the centralized Oja’s update (4.15)
N communications is enough to obtain the estimated matrix Mn (from the observation’s matrix Sn ), while our distributed Oja’s update needs the addition of two communications due to
the ATS involved in the communication step (see Algorithm 4). Note that distributed Oja’s is
slightly worse in terms of communication’s cost, but when regarding the number of operations
(sums and multiplications), the cost is rather reduced. Indeed, the computational cost scales N 2
per iteration for the centralized Oja’s while our algorithm needs N + 2qN per iteration since
all nodes N perform at least a multiplication to update and the average 2qN receiving nodes
aggregates a multiplication related to the received data.
1)
Simulated data
First, we show the results from simulated data drawn according to the observation model defined
in Section 4.4.2. In order to compare our proposed algorithm with the distributed MDS proposed
by [45], we set the same environmental context in which σ/η = 1.7. Figure 4.10 displays the
comparison of the RMSE when running Algorithm 4 over 300 independent runs of the estimated
positions when considering different communication parameters: (qij )i,j (the Bernoullis related
to the observation model (4.14)) and q (the Bernoullis related to the ATS in Definition 4.2).
Since the variance of the error sequence is upper bounded by the minimum probability value
in (4.30)- (4.32), we observe from Figure 4.10 a trade-off between the accuracy and the number
of communications as the RMSE decreases faster when the probability q is closer to 1.
RMSE of un,2
RMSE of un,1
σ /η = 0, (q ) = 0.5, centralized
−0.64
ij i,j
10
−0.65
10
σ /η = 1.7, (qij)i,j= q = 0.8
σ /η = 1.7, (qij)i,j= q = 0.5
σ /η = 0, (qij)i,j= q = 0.8
σ /η = 0, (q ) = q = 0.5
−0.66
10
−0.67
10
ij i,j
−0.68
10
−0.69
10
−0.7
10
−0.71
10
−0.72
10
−0.73
10
−0.74
10
0
200
400
600
iteration x number communications
800
0
200
400
600
800
Figure 4.10. RMSE as a function of nN from the two estimated eigenvectors un,1 and un,2
when considering the noiseless and noisy case and for different values of q.
Figure 4.11 shows the comparison of the localization RMSE over 300 independent runs of
the overall estimated positions when considering the three iterative methods: the centralized
121
Oja’s (4.15)-(4.16) (co-MDS), the dw-MDS of [45] and our proposed Algorithm 4. The estimated positions after 1000 iterations of the three iterative algorithms are reported in Figure 4.12.
Note that, as remarked in Table 2, the result in Figure 4.12 (b) requires at least twice the number
of communications compared to the results both on-line Oja’s approaches. Positions close to
the barycentric of the network tend to be more accurate than positions on the surrounding area
for the three cases. Nevertheless, Figures 4.12 (a) and 4.12 (c) show these outer positions better
preserved than [45]. Indeed, our distributed and asynchronous Oja’s algorithm achieves in general better accuracy (around the 65% of positions) except for the third part of nodes which are
located around the network’s boundary, e.g. nodes 11 or 36 − 37 for instance (see squared nodes
in Figure 4.12 (c)).
0
10
RMSE Location error
co−MDS, Algorithm (4.15)−(4.17)
dw−MDS [3]
do−MDS, Algorithm 4
Min−Max [106]
Multilateration
Batch MDS, Algorithm 3
−1
10
0
50
100
150
200
250
300
350
400
b n (1), . . . , Z
b n (N )).
Figure 4.11. RMSE as a function of nN from the estimated positions (Z
12
12
8
y
(m)
5
4
11
19
27
9
17
26
8
16
25
7
15
44
34
43
41
8
40
14
24
32
3
13
23
31
12
22
30
21
29
5
9
17
26
8
16
25
7
15
4
28
35
44
34
43
33
41
40
14
24
32
3
13
23
31
12
22
30
21
29
6
5
3.5
4
4.5
x (m)
5
5.5
6
6.5
27
9
17
26
8
16
25
7
15
35
44
34
43
42
33
41
40
4
14
24
32
3
13
23
31
12
22
30
21
29
39
2
1
6
28
38
38
2
37
37
37
3
6
4
2
2.5
36
18
39
28
38
2
0
2
8
2
1
19
10
42
4
39
2
6
6
11
27
10
42
33
20
10
36
18
35
4
1
20
10
36
18
y (m)
19
10
(m)
11
y
10
6
12
20
7
(a) Oja’s (4.15)-(4.16) (co-MDS).
0
2
2.5
3
3.5
4
4.5
x (m)
5
5.5
(b) dw-MDS [45].
6
6.5
7
0
2
2.5
3
3.5
4
4.5
x (m)
5
5.5
6
6.5
7
(c) Algorithm 4 (do-MDS).
Figure 4.12. Estimated configuration network from simulated data after 1000 iterations. Markers (Q) correspond to the estimated values while markers (#) to the true positions. On the
right (distributed on-line Oja’s), squared positions () highlight worse accuracy compared to
the centralized case (on the left).
122
2)
Real data
As described in Section 4.2.4, through of our user profile created in the FIT IoT-LAB’s website [1], we run remotely several experiments involving the sensor nodes illustrated in Figure 4.7 (a). All real data used in this section can be found in [2] (on the research information).
The set of estimated parameters are obtained from the LNSM as in (4.4)-(4.5): σ 2 = 28.16 dB,
√
PL0 = 61.71 dB and η = 2.44. We set qij = 0.8 ∀i, j, q = 0.85 and γn = 0.015
for our pron
posed algorithm described in Section 4.4.3. Table 3 summarizes the RMSE of the 44 location
estimates in the real testbed over 100 independent runs for the six considered algorithms and
after 1000 iterations for the three iterative algorithms (centralized on-line Oja’s, dw-MDS [45]
and Algorithm 4).
Method
over all nodes
at a center node
at a outer node
MC
1.87
1.76
3.53
min-max
0.8
0.55
2.42
MDS
1.98
1.02
3.2
Oja
2.18
1.12
2.5
dw-MDS
0.86
0.71
1.91
do-MDS
1.56
0.47
1.41
Table 3. RMSE over the 44 estimated positions considering the real data from the FIT IoT-LAB
testbed.
Table 3 also includes the results for a given center node (node 12) and surrounding node
(node 36) in order to get insight about the impact of the sensor node location relative to the
barycentric position and the anchors’ positions. Although the best performance is achieved by
min-max and dw-MDS, do-MDS gets the best accuracy for the center node situated close to the
barycentric position of the network while multilateration and MDS give the worst RMSE value
for the positions on the network’s boundary.
4.5
Position refinement: distributed maximum likelihood estimator
In the framework of WSN localization, a refinement phase is in general added to obtain the
absolute coordinates or/and to improve the accuracy of the estimated positions (see [140], [141],
[142] or [45]). Assuming the RSSI model of Section 4.2.2, we propose a distributed and online algorithm based on the maximum likelihood estimator (MLE) that can be executed after
Algorithm 4 at the expense of too much additional complexity. Since the function involved in
the ML criterion is not convex, the convergence issues can be avoided by a suitable initialization.
The idea here is to use the estimated positions of the previous Algorithm 4 to initialize the
following algorithm. We make use of a consensus-based algorithm which is linked to Chapter 2.
4.5.1
Principle: maximum likelihood estimation
Similarly to the observation model in Section 1), we introduce a collection of independent r.v.’s
(Pij (n) : i, j = 1, · · · , N, n ∈ N) such that each Pij (n) follows the LNSM (4.2) described
in Section 4.2.2. We set D̂n (i, j) the unbiased estimate of the log-distance obtained from (4.2)
4.5. Position refinement: distributed maximum likelihood estimator
−Pij (n)−PL0
−Pij (n)−PL0
. Since
10η
10η
σ2
N(log10 dij , 100η
2 ).
as: D̂ij (n) =
= log10 dij +
εij
10η
123
then it is easy to verify that
D̂ij (n) ∼
Moreover, we define two sets of RSSI measurements collected at each sensor node i as
{Pij (n)}∀j∈Ni from its neighboring sensor nodes and {Pik (n)}∀k∈Mi from its neighboring
landmarks. For each sensor node i, the set of positions of its neighboring nodes’ is denoted
as (x(i) , y (i) ) = {(xj , yj )}∀j∈Ni (including its own position z i = (xi , yi )). The aim is to solve
the global optimization problem on the function in F : (x, y) ⊂ RN ×2 → R+ defined as a sum
of local functions such as:
min
F (x, y) = min
(x,y)
(x,y)
N
X
fi (x(i) , y (i) )
(4.34)
i=1
where:
fi (x(i) , y (i) ) =
X
D̂ij (n) − log10 kz i − z j k
j∈Ni
2
+
X 2
D̂ik (n) − log10 kz i − z k k .
k∈Mi
Note that for a centralized setting, the gradient ∇F (x, y) can be computed for each unknown
position z i as:
!
X D̂ij (n) + D̂ji (n)
(xi − xj )
dF
− log10 dij
=4
dxi
2
d2ij ln 10
j∈Ni
(xi − ak )
d2ik ln 10
k∈Mi
!
X D̂ij (n) + D̂ji (n)
(yi − yj )
dF
=4
− log10 dij
dyi
2
d2ij ln 10
+2
X
(D̂ik (n) − log10 dik )
X
(D̂ik (n) − log10 dik )
(4.35)
j∈Ni
+2
k∈Mi
(yi − bk )
d2ik ln 10
When considering z i 6= z j for each pair i 6= j ∈ {1, . . . , N }, the solution of (4.34) that cancels
(4.35) is written as a system of equations. A first set of N (N − 1)/2 equations having 2N
unknowns and for each unknown position z i a set of |Mi | equations of the form (4.11). The
solutions lie on the intersection of both sets as:
\
A
B
(4.36)
where :
n
∀i,j=1,...,N,
| (xi
i6=j
− xj )2 + (yi − yj )2 = 10D̂ij (n)+D̂ji (n)
n
o
2
2
2D̂ik (n)
B = ∀i=1,...,N,
|
(x
−
a
)
+
(y
−
b
)
=
10
i
i
k
k
∀k∈Mi
A=
o
The resulting set of unknowns (4.36) can be viewed as a separable problem, the left-hand set
gives a global set of solutions forming the whole network configuration and the right-hand set
124
deals with a local trilateration (multilateration) problem performed for each single position inside its reference system of landmarks. Indeed, as explained in Section 1), a first unbiased
estimation of each single position can be found from the RHS of equations. Then, the LHS of
equations can be solved in the same way considering one unknown position while fixing the
rest of neighboring nodes as estimated anchors’ positions. Since the likelihood function of the
LNSM involved in (4.34) is not convex with respect the unknown positions (i.e. the Hessian matrix obtained by differentiate (4.35) is not positive definite), a standard iterative gradient descent
suffers from the initialization point. We consider an initial guess obtained by another localization
algorithm before the cooperative refinement step to overcome this initialization issue.
4.5.2
The algorithm: on-line gossip-based implementation
When dealing with a distributed processing, the global function in (4.35) is not perfectly known
by each sensor node. However, in this context, the separable nature of problem (4.34) can be
exploited to design a distributed implementation consisting on local computations and random
communications among the sensor nodes. Thus, (4.34) is finally solved by means of a distributed on-line stochastic approximation algorithm (DSA) based on the local version of (4.35)
∇fi (x(i) , y (i) ) (see the general framework of [19] and Chapter 2 for more details on consensus
dfi dfi
algorithm). The components for each sensor i are composed by their partial derivative ( dx
, )
i dyi
df
df
and those from its neighbors ( dxji , dyji ) as follows:
(x − x )
X
X
dfi
(xi − ak )
i
j
=2
D̂ij (n) − log10 dij
+
2
(D̂ik (n) − log10 dik ) 2
2
dxi
dij ln 10
dik ln 10
j∈Ni
k∈Mi
(y − y )
X
X
dfi
(yi − bk )
i
j
+
2
(4.37)
=2
(D̂ik (n) − log10 dik ) 2
D̂ij (n) − log10 dij
2
dyi
dij ln 10
dik ln 10
k∈Mi
j∈Ni
(x − x )
X
dfj
i
j
∀j ∈ Ni
=2
D̂ji (n) − log10 dij
2
dxi
dij ln 10
j∈Ni
(y − y )
X
dfj
i
j
=2
D̂ji (n) − log10 dij
2
dyi
dij ln 10
j∈Ni
Note that each component i in (4.37) involves the knowledge of the set of positions (x(i) , y (i) )
df
df
and the neighborhood components of the gradient function ( dxji , dyji ). Thus, we introduce
a consensus-based algorithm in order to drive these local terms towards the global gradient
terms (4.35). Similarly to Algorithm (2.1)-(2.2) in Chapter 2, Algorithm 5 consists on a local gradient descent step along with a gossip step based on the pairwise model of [31]. Yet,
(4.38) − (4.39) generate a sequence of the set of estimated positions at each sensor i, i.e. its
own position and those of their neighbors. Figure 4.13 illustrates an iteration of the proposed
algorithm. Indeed, the scheme shown in Figure 4.13 highlights that at any n, each sensor node i
(i)
(i)
(i)
(i)
obtains an estimate of its local map though the set (xn , y n ) = {(xn,j , yn;j )}∀j∈Ni .
4.5. Position refinement: distributed maximum likelihood estimator
125
Figure 4.13. Scheme of Algorithm 5 (doMLE) at any time n when considering the gossip step
is performed by nodes 1 and 2. Both estimated local maps are delimited by two different dashed
lines.
Algorithm 5: Distributed on-line MLE (doMLE)
(i)
(i)
Initialize: for each i set {(x0 (j), y 0 (j))}∀j∈Ni
Update: at each time n = 1, 2, . . .
[Local step][156] each node i obtains {Pij (n)}∀j∈Ni and {Pik (n)}∀k∈Mi .
Each sensor i computes (4.37) and a temporary estimate of its position’s set as:
(i)
(i)
(i)
(i)
(i)
(x̃(i)
n , ỹ n ) = (xn−1 , y n−1 ) − γn ∇fi (xn−1 , y n−1 )
(4.38)
√
where (γn )n≥1 is a decreasing step sequence such γn = 1/ n.
[Gossip step][31] two uniformly random selected nodes exchange their common
estimated positions and average their values. The final updates are set as follows:
(i)
(i)
(i)
(xn,` , yn,` )
=
(i)
(x̃n,` , ỹn,` )
2
(j) (j)
(i)
(i)
(xn,` , yn,` ) = (xn,` , yn,` ),
(j)
+
(j)
(x̃n,` , ỹn,` )
2
∀` ∈ Ni ∩ Nj ,
(4.39)
Otherwise, ∀` ∈
/ Ni ∩ Nj and ∀m 6= i, j then,
(m)
(m)
(m)
(m)
(xn,` , yn,` ) = (x̃n,` , ỹn,` ).
4.5.3
Numerical results: initialization by do-MDS algorithm
In order to highlight the improvement on the accuracy that can be achieved after the refinement
phase for a given estimated positions, we consider the real data from the FIT IoT-LAB testbed
126
used in Section 4.4.5. We compare the same algorithms by setting the estimated positions obtained from each algorithm to the initialization of Algorithm 5. Table 4 shows the RMSE values
after the refinement phase. In addition, we include the ratio of the accuracy improvement considering the RMSE values after and before applying the distributed MLE and the ratio regarding
the number of positions over the total N that are improved. The best performances are achieved
by min-max, dw-MDS and do-MDS in terms of minimum RMSE value over the N estimated
positions. Nevertheless, the highest improvement is obtained with the proposed do-MDS since
the RMSE before the refinement phase was higher than the values from min-max and dw-MDS
which do not experiment a considerable decrease. In general, the refinement Algorithm 4 improves almost all the positions for each method and especially the anchor-free methods based
on the MDS approach. Indeed, the highest values are those from the distributed versions which
may exploit in advantage the local knowledge of each sensor node.
Method
After refinement
Improvement (%)
Positions improved (%)
MC
1.05
44
75
min-max
0.54
32
71
MDS
1.39
30
80
Oja
1.37
28
80
dw-MDS
0.6
30
82
do-MDS
0.51
78
86
Table 4. RMSE averaged over the 44 estimated positions considering real data.
Table 5 reports the results for the center node as node 12 and the surrounding node as 36 in
order to insight about the impact of the sensor’s emplacement relative to the barycentric position
and the anchors’ area. The improvement on the accuracy is rather higher for the outer position
for all methods and in general, the RMSE values for the center and the outer positions are similar
except for multilateration and MDS. Since the distributed MLE focus on the signal model instead
the network configuration, the resulting RMSE may not depend on the sensor nodes’ location.
Method
After refinement
Improvement (%)
MC
1.68
4
min-max
0.48
12
MDS
0.75
26
Oja
0.48
58
dw-MDS
0.54
22
do-MDS
0.37
27
Method
After refinement
Improvement (%)
MC
0.89
75
min-max
0.45
26
MDS
1.27
60
Oja
0.46
76
dw-MDS
0.33
62
do-MDS
0.23
84
Table 5. RMSE location error and improvement after the refinement algorithm of the center
node 12 (on the top) and the outer node 36 (on the bottom) when considering real data and
RMSE values of Table 3.
4.6. A cooperative RSSI-based algorithm for indoor localization in WSN
4.6
127
A cooperative RSSI-based algorithm for indoor localization in
WSN
This section is devoted to present the work made in collaboration with N.A. Dieng, a member
of the laboratory LINCS5 at Institut Mines-Télécom. In order to improve the accuracy achieved
by the biased-maximum likelihood estimator (B-MLE) proposed in [53] and to let each sensor
node to build a local map of itself and its neighbors, we use the on-line distributed stochastic
approximation (DSA) algorithm described in previous section. Algorithm 5 was tested in three
different indoor environments where several measurement campaigns were hold. A part of the
FIT IoT-LAB testbed described in Section 4.2.4, we get benefit of two more testbeds available
in [3].
In general, when applying the MLE to the LNSM (4.2), the parameters of the propagation
model PL0 , η and σ 2 , are considered equal for all landmarks even for indoor scenarios (see for
instance [145]). When dealing with closed and relatively small spaces, RSSI is not accurate
enough and the effects of multipaths, possible blocking objects and antenna orientation may be
also included in the propagation model as outliers (see, for instance, [11] and [62]). A B-MLE
involving a random factor related to the possibly outliers was proposed by [53] and has been
experimentally proved to reduce the mean error of the classical MLE. As an anchor-based approach, the optimization problem is defined for each single unknown position given several RSSI
values from a set of surrounding landmarks. Since the average values of the RSSI measured at
different positions do not always decrease with the distance (see the learning phase in [54]), we
let the parameters of the propagation model differ from one landmark to another. As a result, the
M considered landmarks are not treat equally during the statistical estimation phase (cf. Tables
6 and 7) and a set of M parameters are defined by {(PL0,k , ηk , σk2 }M
k=1 .
4.6.1
Observation model: biased log-normal shadowing model (B-LNSM)
This section recalls the dynamic method introduced by [53] to estimate the position of a sensor
node from a set of landmarks. The sensor node seeks to reduce the effect of any potentially
aberrant landmark whose measurements do not improve localization accuracy. This effect is
compensate by introducing a constant bias which becomes an additional variable to estimate and
replaces the log-normal shadowing model of the measurements associated to this landmark. As
for the standard LNSM, we denote by PL (t) the tth RSSI sample measured by the sensor node
from packets transmitted by a given landmark L. We define as PL0,L , ηL and σL2 the propagation
parameters of landmark L and we rewrite the general signal model of (4.2) as follows:
PL (t) = − PL0,L + 10ηL log10 dL IL6=O + βIL=O + N(0, σL2 ) ,
(4.40)
where dL is the distance to landmark L, β is the constant bias which replaces the measurements coming from a given landmark O, I is the indicator function, which is equal to 1 when
the subscript expression is true and 0 otherwise. Abnormal landmarks O can be detected from
equation (4.40), and the biased LNSM can be fully characterized. The aberrant landmark can
5
See the "Network, Mobility and Security" research group http://www.infres.enst.fr/wp/nms/
128
be identified by comparing the global likelihood values when each landmark is subsequently
considered as outlier.
4.6.2
Initialization: biased maximum likelihood estimator (B-MLE)
Combining all the measured values altogether, we apply the MLE on the proposed model (4.40)
to compute the likelihood expressions in the case where landmark O is considered as abnormal.
If we denote by TL the number of samples received from landmark L, the likelihood function
for every each L 6= Ois written as:
T
LL (x, y) = −
TL log σL2
L
1X
−
2
t=1
PL (t) + PL0,L + 10ηL log10 dL
σL
2
,
(4.41)
and for the outlier (abnormal) landmark, O, it becomes:
T
LO (β) =
−TO log σO2
O
1X
−
2
t=1
PO (t) − β
σO
2
(4.42)
The global likelihood function from the data set is then the sum of equations (4.41) and (4.42)
over the M landmarks:
LO (x, y, β) =
M
X
LL (x, y) + LO (β)
(4.43)
L=1;L6=O
Thus, the maximum likelihood criterion applied to equation (4.43) is used to infer the sensor
node’s position and the corresponding bias β. Upon noting that (4.43) can be considered as a
separable problem, the B-MLE solution (x, y) is given by:
(xO , yO , βO ) = max
x,y
where: βO =
M
X
LL (x, y) + max LO (β)
L=1;L6=O
β
(4.44)
TO
1 X
PO (t)
TO
t=1
4.6.3
Experimental results after the refinement phase
The numerical results are obtained by using Algorithm 5 when the positions are initialized with
the B-MLE (4.44). We consider three wireless sensor networks issued to the standard ZigBee
IEEE 802.15.4 and operating at 2.4 GHz. In addition, the three real testbeds involve different
dimensions and low power devices (two testbeds using the TMote Sky6 nodes and one testbed using the WSN430 nodes, both node types include the CC2420 RF transceiver). For each testbed,
the procedure is described as follows. Each of the NL sensor nodes selected for the learning
6
Technical
details
in
http://www.eecs.harvard.edu/~konrad/projects/shimmer/
references/tmote-sky-datasheet.pdf
129
phase broadcasts 100 frames. Then, the M landmarks compute the set of the propagation model
parameters {(PL0,k , ηk , σk2 }M
k=1 as detailed in [54]. We set two different sizes NL of the learning
data set, a small set involving the first 10 sensor nodes’ positions and a large one from the first 25
sensor nodes’ positions. The RSSI values are collected from the set of received frames, and then
the parameters are estimated by applying the MLE criterion as in (4.4)-(4.5). The remaining
N − NL sensor nodes compute a first estimate of their positions using the B-MLE (4.44) given
the parameters and the RSSI values from theirs corresponding received frames. The refinement
phase is subsequently applied assuming the latter positions as the initialization values for the
distributed on-line Algorithm 5. At each iteration, a single frame is broadcasted by each sensor
in order to compute the local estimates and only two sensor nodes randomly selected exchanged
their common estimate positions which are finally updated as the average value.
1)
Office Scenario
The testbeds considered in this section are the same as those from [55]. The first testbed located
in Paris is a small semi-furnished office at LINCS laboratory with dimensions 4 × 3 m2 . It
involves the positions of 48 sensors and 5 landmarks. The second larger testbed located in
Telecom Bretagne is a classroom at Rammus platform hosted by the RSM department whose
dimensions are 8.77 × 6.46 m2 . It involves the positions of 57 sensor nodes and 8 landmarks. In
both testbeds sensor nodes are placed at 1.25 m height from the floor. None of these rooms are
electromagnetically isolated and active wireless access points close to the testbeds may create
interferences. The presence of two moving persons in both rooms may affect the line-of-sight
as obstruction between the sensor nodes and the landmarks during the communication phases.
Both testbeds are illustrated in Figure 4.14 (available and detailed in [3]).
Testbed LINCS: N=48 nodes and M=5 landmarks
Testbed Rammus: N=57 nodes and M=8 landmarks
RSSI values at node 15
LM5
−35
3
y (m)
1
0.5
1
LM2
0
0
18
17
10
2
9
21
5
−30
6
33
4
−40
−50
23
20
9
41
2
x (m)
12
3
25
14
1
0
−80
2
2.5
d (m)
3
3.5
2
−50
47
53
46
54
32
40
45
55
44 56
42
15
28
29
LM5
4
x (m)
−55
−60
−65
41
27 30
−70
57
43
−75
LM7
LM8
LM4
0
52
39
26 31
17
16
2
1
−70
3
18
−45
37
48
38
13
LM4
1
11
−40
51
33
19
4
3
34
24
5
2
−60
42
LM2
36
49
35
43
35
34
25
44
36
27
26
45
37
28
19
11
46
LM1
8
7
50
22 LM3
10
20
12
4
1.5
38
LM6
6
−20
29
21
13
5
2
47
30
22
14
6
2.5
40
39
23
15
7
3
32
31
24
16
8
Pd (dBm)
4
3.5
48
y (m)
LM1
7
−10
LM3
Pd (dBm)
4.5
6
8
−80
2
3
4
5
6
d (m)
(a) LINCS testbed (left) and RSSI values (right) col- (b) Rammus testbed (left) and RSSI values (right) collected at node 15 from the 5 landmarks.
lected at node 20 from the 8 landmarks.
Figure 4.14. Offices testbeds and RSSI values collected at squared nodes () from data transmitted by the M landmarks. The marker (Q) highlights the real RSSI values. The markers (5)
and (+) indicate respectively the average and the minimum and maximum values from 100 i.i.d.
random samples drawn when considering the theoretical LNSM in (4.2) given the estimated
parameters in Table 6
130
Regarding the real RSSI data shown in Figure 4.14, some values received at each of the
two nodes (node 15 at LINCS testbed and node 20 at Rammus testbed) are affected by a bias
depending on the landmark. For instance, in Figure 4.14 (a), data coming from LM3 which
has the highest path loss exponent η (see Table 6) suffer a considerable bias from the theoretical mean value (around −42 dBm) since the real values are placed around −70 dBm. From
Figure 4.14 (b) we observe a gap of around 20 dBm between the theoretical and the empirical
mean value of the RSSI coming from LM1 which is the closest one to node 20, maybe due to
its proximity to the wall since its corresponding parameters are rather soft (see Table 6). The
corresponding estimated parameters for each landmark are summarized in Table 6.
LM_ID
PL0
η
σ2
LM_ID
PL0
η
σ2
LM1
52.09
1.31
11.89
LM1
40.72
1.47
26.91
LM2
49.23
0.81
26.17
LM2
29.97
2.81
49.55
LM3
47.83
1.51
27.51
LM3
40.59
3.11
12.09
LM4
41.09
1.97
6.69
LM4
57.69
2.06
17.76
LM5
42.67
1.38
26.81
LM5
40.38
0.92
31.29
LM6
45.69
1.66
23.08
LM7
71.85
-2.19
12.69
LM8
50.97
0.51
13.19
Table 6. Estimated parameters from the office at LINCS testbed (up) and the classroom at
Rammus testbed (down) when the small data set of 10 sensor nodes is selected.
2)
FIT IoT-LAB platform
The testbed at the FIT IoT-LAB’s platform in Rennes of size 5 × 9 m2 involves the positions of
44 sensor nodes and 6 landmarks and its configuration is shown in Figure 4.15. The WSN430
open nodes available at the platform are located in a big storage room containing different objects
and they are placed at the ceil which is 1.9 m height from the floor in a grid organization. There
was no one in the room most of the time and there was only a wireless access point located in
the corridor which is separated by a cinder wall (no electromagnetically isolating).
On the right in Figure 4.15 we illustrate the real and the empirical RSSI values collected at
node whose Node_id is 240 coming from the 6 landmarks. Note that the most important bias of
more than 10 dBm appears on the RSSI corresponding to the closest landmark LM244 which is
situated next to the wall of the room and has the highest path loss value PL0 (see Table 7). The
estimated parameters for each landmark are summarized in Table 7.
3)
Comparison and discussion
We run both algorithms, the initialization performed by (4.44) and the refinement phase performed by the distributed Algorithm 5. In order to evaluate and quantify the achieved accuracy
of such methods, we define the normalized mean deviation (NMD) as the RMSE over the N
Testbed selection at Rennes FIT IoT−LAB platform
N=44 nodes and M=6 landmarks
131
−50
202
183
LM163
y (m)
161
218
180
198
216
179
197
215
178
196
LM214
232
250
195
213
231
249
194
212
230
193
211
229
209
227
159
158
−60
200
LM236
181
8
6
237
201
LM176
LM157
235
253
234
252
−70
251
Pd (dBm)
10
247
−80
156
4
155
173
−90
LM244
225
243
2
−100
240
0
2
3
4
x (m)
5
6
7
−110
2
4
d (m)
6
8
Figure 4.15. Network configuration of the 50 sensors selected at the FIT IoT-LAB’s platform in
Rennes [1].
LM_ ID
PL0
η
σ2
LM157
62.19
1.76
19.06
LM163
63.61
2.83
40.87
LM176
58.4
3.39
37.04
LM214
63.33
1.98
75.62
LM236
58.55
2.80
30.03
LM244
67.67
2.29
18.97
Table 7. Estimated parameters from the FIT IoT-LAB Rennes testbed when the small data set
of 10 sensor nodes is selected.
estimated positions normalized by the testbed’s dimensions, i.e. l × h m2 . It can be defined as:
N
N
1 X
1 X
1
NMD =
NMDi = √
kẑ i − z i k ,
N
l2 + h2 N i=1
i=1
where {ẑ i }∀i are the set of the N estimated positions. Figure 4.16 illustrates the decreasing
RMSE over the N positions along the iteration instant n for each testbed which involves the
communication between two randomly selected nodes at each n. Note that as being a real
environment, the algorithm converged to an asymptotic error which may depend on the testbed’s
parameters.
Regarding the different curves in Figure 4.16, the earliest testbed to achieve an improvement
(after n = 24 iterations implying 48 communications) is the Rammus testbed. However, this
testbed achieves the worst accuracy after the refinement phase since the curve of its RMSE
remains always above the other two. The best accuracy is achieved with the LINCS testbed
even if the convergence is slower and 89 refinements iterations implying 178 communications
are required to improve the mean localization error. As reported in Table 8, a RMSE less than
80 cm is obtained for the LINCS testbed.
In order to summarized the results for each testbed, Table 8 displays the average error, both
RMSE in meters and the respective normalized value NMD. The numerical results are reported
when considering the two sizes of data sets chosen during the learning phase of the B-MLE
132
8
DSA, LINCS
B−MLE, LINCS
DSA, Rammus
B−MLE, Rammus
DSA, FIT IoT−LAB
B−MLE, FIT IoT−LAB
7
6
RMSEn
5
4
3
2
1
0
50
100
150
n (iteration time)
200
250
Figure 4.16. Convergence of the RMSE sequence generated by Algorithm 5 along the iteration
time n for each testbed when considering the small data learning set (NL = 10). Markers (5
Q M) emphasize the iteration time when the RMSE of the distributed refinement phase is lower
than the RMSE computed by the initialization B-MLE.
Testbed
Method
RMSE (m)
NMD
Improvement (%)
LINCS
B-MLE DSA
1.39
0.77
0.28
0.16
44.3
76
Rammus
B-MLE DSA
2.73
1.73
0.25
0.16
36.7
72
FIT IoT-LAB
B-MLE DSA
1.85
1.3
0.18
0.13
29.5
74
Testbed
Method
RMSE (m)
NMD
Improvement (%)
LINCS
B-MLE DSA
1.35
1.31
0.27
0.26
3.8
55
Rammus
B-MLE DSA
2.27
1.64
0.22
0.15
30.5
63
FIT IoT-LAB
B-MLE DSA
2.27
1.84
0.22
0.18
19
68
Table 8. Localization error averaged over the N estimated sensor nodes’ positions for each of
the three testbeds when using the small data set of 10 positions (up) and the big data set of 25
(down).
(NL = 10 and NL = 25). In addition, we compute the ratio of the accuracy improvement
as the percentage (1 − ρ) % where ρ defines the ratio between the NMD achieved after the
refinement phase described in Section 4.5.1 and those achieved previously by the B-MLE. We
also compute the ratio regarding the number of improved positions after the refinement phase
(see the ratio Positions improved in Table 8). From the results reported in Table 8, the best
133
accuracy improvement is obtained in general in the case of the small data learning set, i.e. NL =
10. The best accuracy about 70 cm is achieved for the smallest testbed (LINCS) where there is at
the same time the biggest noise factor σ 2 about 49.55 dB. However, the LINCS testbed requires
the highest number of pairwise communications between the sensor nodes during the refinement
phase. Our numerical results appeared to be consistent compared to other experiments involving
real indoor scenarios with similar testbeds in terms of dimensions and number of sensor nodes.
See for instance the accuracy between 1.5 − 2.5 m from the experiments of [145] or the 2.27 m
reported in [45].
Figure 4.17. Boxplot of the NMD values obtained over 100 independent runs of the doMLE
(Algorithm 5) for each testbed when considering the parameter learning sizes NL = 10 and
NL = 20.
In addition, Figure 4.17 illustrates the statistical behavior of the localization error (NMD)
through the corresponding boxplot representations. Both LINCS and FIT IoT-LAB testbeds
have similar behavior: the standard deviation of the NMD error decreases when increasing the
learning data size (from NL = 10 to 25) but the mean value of the NMD error increases. Indeed,
more outliers values appear in such case since considering a large number of positions may add
some corrupted data to the learning phase affecting the estimated parameters. The smallest
testbed (LINCS) has the worst error performance since it gives the highest standard deviation
and mean values when considering the big learning data set (NL = 25) and the highest standard
deviation when considering the small one (NL = 10). On the contrary, the FIT IoT-LAB testbed,
which has the biggest dimensions and is rarely occupied by moving people, maintains the lowest
standard deviation error for both values of NL and achieves the best accuracy on the NMD when
considering the small learning set. Finally, the middle-sized Rammus testbed has a more regular
behavior since it gives a similar performance independently of NL value.
The localization error at each sensor node is detailed in Figure 4.18. We report the NMD
values for each sensor node {NMDi }∀i at each testbed before (given by the B-MLE) and after
the refinement phase (given by doMLE algorithm) when considering the small and the big learning data set. Regarding Figures 4.18 and the information summarized in Table 8, the accuracy
134
(a) LINCS testbed, NL = 25.
(b) LINCS testbed, NL = 10.
(c) Rammus testbed, NL = 25.
(d) Rammus testbed, NL = 25.
(e) FIT IoT-LAB testbed, NL = 25.
(f) FIT IoT-LAB testbed, NL = 10.
Figure 4.18. NMD values at each testbed before (blue bars) and after (red bars) the refinement
phase (Algorithm 5 for each estimated sensor position. On the left we set NL = 25 for the
learning data set and on the right when considering NL = 10.
improvement and the number of positions improved are considerably higher when NL = 10.
For the LINCS testbed about 76% positions are improved while for the Rammus testbed the
percentage is 72% when considering 10 positions for the learning phase. However, the percentages decrease for the LINCS and Rammus testbeds when 25 positions are considered during
the learning phase which are 55% and 63% respectively. Moreover, when regarding localization
errors in Figure 4.18 and the network’s configurations (see Figures 4.14 and 4.15), the estimated
positions that do not improve the accuracy after the refinement phase are those from sensor nodes
located on more dense area or on the middle surrounded by objects and the other sensor nodes
(see for instance nodes 13, 14, 28 and 30 at LINCS testbed, nodes 4, 5, 17, 18 and 28 at Rammus
135
testbed and nodes 197, 216 and 234 at FIT IoT-LAB testbed). In some cases, there is no accuracy improvement on nodes located on the corners as nodes 252 and 253 at FIT IoT-LAB testbed
or as nodes 1 and 42 at Rammus tesbed. In conclusion, after the refinement phase, an accuracy
improvement of at least 30% is achieved and more than the 70% of positions are improved for
different indoor scenarios involving different dimensions and different radio devices.
In this thesis, two applications of distributed stochastic approximation in multi-agent systems
have been considered: consensus-based distributed optimization and distributed principal component analysis (PCA).
Regarding consensus-based methods, we addressed the case where a network of agents seek
to find the global minimizer of an optimization problem. The aim is to drive the local iterates of all agents to a common minimizer. We have concentrated our efforts in the theoretical
analysis of a adaptation-diffusion algorithm, where agents iteratively update their local iterates
and merge them by communicating with their neighbors. We have demonstrated the almost
sure convergence under weak conditions on the communication protocols. Although double
stochasticity is generally assumed in past works, our convergence result holds even when the
matrix Wn characterizing the networks exchange is non doubly stochastic. This observation
gives rise to the possibility of using simple communication schemes between agents such as the
intuitive broadcast protocol, in which agents send information to their neighbors without expecting any instantaneous feedback from the latter. We have also analyzed the convergence rate
of the method. More precisely, we have proved that the estimation error between the iterates
√
and the minimizer tends to zero at rate γn where γn is the step size of the algorithm. The
normalized estimation error is asymptotically normal. The limiting covariance matrix has been
characterized. As a consequence of our results, we have shown the price to pay for using simple
non-doubly stochastic weight matrices is an increase of the asymptotic variance of the estimation error. We have applied our results and tested our algorithms to the problem of statistical
inference in wireless sensor networks, with a special focus on self-localization problems. We
have also proposed and analyzed a distributed on-line expectation maximization algorithm (see
Appendix A) which relies on the same principles.
Regarding distributed PCA, we addressed the case where the i-th agent seeks to estimate
the i-th entry of the principal eigenvectors of a given matrix, based on noisy and distributed
measurements of the latter matrix. We have proposed an iterative algorithm which is based
on sporadic information exchanges in the network and proved the almost sure convergence of
the algorithm. The algorithm can be seen as a distributed version of Oja’s algorithm, which
is a popular stochastic approximation method to estimate the principal eigenspace of a matrix.
We have applied our results to the issue of self-localization in wireless sensor networks. We
have considered the case where agents are identified to sensors which are able to collect noisy
measurements of the distance to their neighbors. We have proposed a distributed version of a
multidimensional scaling algorithm based on PCA which allows to recover the positions of the
138
sensors as the eigenvectors of a so-called similarity matrix computed from inter-sensor distances.
In addition, our algorithm encompasses the context where measurements are gathered in an online and sporadic fashion. We have also tested our algorithms on a wireless sensor network
platform, i.e. FIT IoT-LAB platform7 . The collaboration with N.A. Dieng have made possible
our contributions on the localization framework in WSN (see [56]). Besides, we could show the
performance on real testbeds from data acquired within different indoor scenarios (see [3] and
[2]).
There still remains several open problems in the continuity of this thesis, which have not
been addressed due to the lack of time. First of all, it would be interesting to analyze the effect of Polyak-Ruppert averaging methods which are known to increase the convergence rate of
stochastic approximation methods (see [98]). Most probably, such methods are as well effective
in a distributed setting as discussed in [19] in the special case of doubly-stochastic matrices.
Second, this thesis have focused on iterative algorithms with vanishing step sizes. In stochastic
contexts where measurements are collected on-line and then deleted, vanishing step sizes are
generally required to ensure the convergence of the algorithms. Nevertheless, in signal processing, it is as well important to consider methods with constant step size. Such methods are
generally non convergent, but allow to adaptively track the variations of the environment (e.g.
adaptive optimization for target tracking [151]). It is particularly well suited to the case of mobile sensor networks for instance. Typically, our self-localization algorithm based on distributed
PCA would be especially relevant in the case of a constant step size, as it would allow each
sensor to adaptively track its own position by collecting noisy measures of the distance with
its neighbors. Indeed, the constant step-size condition could have been contemplated at the last
part of this thesis since the experiments hold at the FIT IoT-LAB platform within wireless sensor nodes made us wonder about the localization problem on mobile nodes, i.e. tracking the
trajectories of mobile sensor nodes. Our goal was to design a distributed algorithm that could
be tested in real WSN and ideally in both static and mobile contexts. However, we were not
able to address the mobile localization problem for two reasons: the lack of time and the lack of
resources on real devices. A first period had to be devoted to study the localization framework
and the problem involving the fixed position case. In a second stage, we found FIT IoT-LAB
to be a suitable open platform allowing a remote access to both real static and mobile resources
(see [148] as a recent survey on the existing testbeds). Unluckily, the early development stage
of the platform affected our testbeds since at the moment we hold the experiments, the available
nodes had to be replaced progressively by a new generation of wireless sensor nodes with more
reliable features and the mobile nodes were not yet operational. Thus, as a future work, the
distributed algorithm presented in Chapter 4 would be adapted to enable the position tracking
capability and tested on the mobile WSN recently available at the FIT IoT-LAB platform.
More recently, in the distributed optimization framework, several works [32], [153], [150],
[157], [114], [22] propose other methods to solve the problem described as a sum of private functions (2.3) (see Chapter 2). Indeed, distributed gradient methods such as the adaptation-diffusion
algorithm studied in this thesis perform well in an on-line setting where only noisy versions of
the gradients are available: as we have shown, such methods achieve the convergence rate that is
usually expected in centralized stochastic approximation methods. Nevertheless, in the absence
7
https://www.iot-lab.info/deployment/
139
of stochastic perturbations (i.e. when gradients are assumed to be perfectly observed), vanishing
step size is still needed and no significant improvement of the convergence rate is thus expected.
On the other hand, recent works on distributed proximal methods (especially [22]) have shown
that it is possible to construct a first-order distributed optimization method (i.e. which computes
only gradients) which does not require vanishing step size. It would be interesting to investigate
the behavior of such methods in a stochastic setting and to investigate how well they compare
to distributed gradient methods. Another aspect is the use of alternative gossip protocol such
as the so-called push-sum protocol [89], [34], [84]. Such a protocol, initially designed to avoid
bilateral exchanges in average consensus algorithm, can as well be cast into a distributed optimization framework as shown by [153], [114]. Push-sum protocols and their variants are easy
to use from a networking point of view, just as the broadcast protocol studied in this thesis. As a
consequence, it would be interesting to analyze the advantages and drawbacks of both methods
from the point of view of stochastic approximation.
Part III
Appendices
Appendix A
Application on distributed parameter
estimation
On-Line Gossip-based Distributed Expectation Maximization
Algorithm1
This appendix is extracted from the proceedings of SSP 2012
Abstract
In this paper, we introduce a novel on-line Distributed Expectation-Maximization (DEM)
algorithm for latent data models including Gaussian Mixtures as a special case. We consider a
network of agents whose mission is to estimate a parameter from the time series locally observed
by the agents. Our estimator works online and asynchronously: it starts processing data as they
arrive without needing a time-line shared by the network. Agents update some local summary
statistics using recent data (E-step), then share these statistics with their neighbors in order to
eventually reach a consensus (gossip step), and finally use them to generate individual estimates
of the unknown parameter (M-step). Our algorithm is shown to converge under mild conditions
on the gossip protocol, freeing the network from feedback communications; hence making this
DEM algorithm particularly well suited to Wireless Sensor Networks (WSN).
A.1
Introduction
We address the problem of maximum likelihood estimation for latent data models. This problem is usually addressed by the celebrated Expectation-Maximization (EM) algorithm [52]. A
typical use-case of latent data model is the problem of mixture parameter estimation: a feature
is observed for several individuals in a population made of M distinct subgroups, however it is
not known to the observer which subgroup a given individual belongs to; the goal is to estimate
the feature distribution for each subgroup along with the subgroups proportions.
1
Proceedings of the 2012 IEEE Statistical Signal Processing Workshop (SSP)
144
Appendix A. Application on distributed parameter estimation
Recently, Distributed Expectation-Maximization (DEM) algorithm has raised a great deal of
attention. Consider a network formed by N agents whose mission is to estimate an unknown
parameter θ. At time n, each agent i = 1, . . . , N observes a random sample Yi,n which is
supposed to be governed by a latent / unobserved random variable Xi,n . The main challenge is
to propose efficient ways to extend the celebrated EM algorithm to a distributed setting i.e., in
the absence of a fusion center which would be able to collect the data at any time instant. In
typical scenarios such as wireless sensor networks, the nodes are generally supposed to be able
to process their local observations and to share a limited information with their neighbors.
There is a massive literature on the EM algorithm. Most of the works have been devoted to
the standard batch version of the EM algorithm: the data is first collected, stored into a memory,
and then processed. In [146] and later in [36], the authors use stochastic approximation tools to
propose an on-line version of the EM algorithm: the data does not need to be stored and each
new single piece of data can lead to updated estimates. The original EM algorithm is also a centralized algorithm. In [119] a distributed version of the EM algorithm is presented in the case of
a Gaussian mixture. The algorithm of [119] uses an incremental approach where a message has
to cycle across the network going through each node one time per cycle (Hamiltonian cycle).
This is a limitation for at least two reasons. First, finding a Hamiltonian cycle is an NP-complete
problem and, second, letting the algorithm depend on a single message traveling across the network lacks robustness. Several alternatives to the latter incremental method have been developed
quite recently, see for instance [93, 74] and [64] to cite a few. All of these works investigate a
batch context where the data has to be stored into the sensors’ memories. In addition, most of
these work only investigate the case of Gaussian mixtures and may be quite demanding in terms
of communication protocol between nodes (number of communication between consecutive iterations, existence of feedback links, synchronism).
We assume a network of agents whose mission is to estimate a parameter from the time series
locally observed by the agents. Our estimator based on the elegant approach of [36] consists
in three main steps: agents update some local summary statistics using recent data (E-step),
then share these statistics with theirs neighbors (gossip step), and finally use them to generate
individual estimates of the unknown parameter (M-step). We consider the Gaussian mixtures as
a special case.
The paper is organized as follows. Section A.2 introduces the parametric model; we review
the centralized versions of the EM in Section A.3; our Algorithm is introduced in Section A.4;
convergence results are established in Section A.5; closed form expressions are provided in
Section A.6 for Gaussian Mixtures; Section A.6.2 is devoted to numerical results.
A.2
Parametric model: exponential families
Consider a sensor network formed by N nodes (the terms sensor and node will be used interchangeably). For any i, consider a couple of random variables (r.v.) Zi := (Xi , Yi ) on a
measurable space (Ω, F), where Yi ∈ Y represents an observed variable and Xi ∈ X is a latent/unobserved variable.
We assume that the aim of the network is to estimate a parameter θ of the form θ =
(θ̄, α1 , . . . , αN ) where θ̄ ∈ Θ̄ and αi ∈ A ∀i, where Θ̄ and A are arbitrary sets (we define
A.3. Centralized EM algorithms
145
Θ := Θ̄ × AN ). One should think of θ̄ as a global parameter and of αi ∈ A as a local parameter identifiable by node i only. Let (Pθ )θ∈Θ be a collection of probability measures on
(Ω, F). We denote by fθ (z1 , . . . , zN ) the p.d.f. of (Z1 , . . . , ZN ) induced by the model Pθ , w.r.t.
some arbitrary reference measures on (X × Y)N . We denote by gθ (y1 , . . . , yN ) the p.d.f. of the
observations induced by Pθ .
We assume that the observations (Y1 , . . . , YN ) have an unknown p.d.f. π under some probability Pπ on (Ω, F). Here, Pπ represents the actual probability under which the observed samples
are generated. For a better understanding, it might be convenient to think of π as π = gθ? for
some "true” parameter θ? , however our algorithm and our analysis does not need such hypothesis.
Denote by h . , . i is the inner product in Rp and by | . | the Euclidean norm.
Assumption A.1. For any θ = (θ̄, α1 , . . . , αN ),
Q
i) For any z1 , . . . , zN , fθ (z1 , . . . , zN ) =: i fi,θ̄,αi (zi ) where the marginal p.d.f. fi,θ̄,αi (zi )
coincides with:
hi (zi ) exp −ψ̄(θ̄) − ψi (αi ) + hSi (zi ), φ̄(θ̄) + φi (αi )i
where ψ̄, φ̄ : Θ̄ → R, ψi , φi : A → Rp , Si : X × Y → Rp are some measurable functions
and hi (zi ) is a normalization factor.
ii) The r.v. Eθ [Si (Zi )|Yi ] is well defined for any i.
In the sequel, we assume that a sequence of independent and identical distributed (i.i.d.)
observations is available at each sensor. More precisely, for each i = 1, . . . , N , we introduce a
time series Zi,n = (Xi,n , Yi,n ) (n = 1, 2, . . . ) such that, under Pθ , (Zi,n )n≥1 is i.i.d. and has
the same distribution as Zi . Here, (Yi,n )n≥1 represents the sequence of observations of sensor i
while (Xi,n )n≥1 represents the sequence of hidden r.v..
A.3
Centralized EM algorithms
We review centralized EM algorithms, assuming that a fusion center is able to gather all information of all sensors at each instant n. Although we are interested in on-line algorithms, we first
review the usual batch version of the EM algorithm for convenience.
A.3.1
Batch EM
Assume that each sensor i collects T observations Yi,1:N := (Yi,1 , . . . , Yi,T ). The so-called
intermediate quantity of the EM algorithm plays a central role:
T
N
1 XX
QT (θ , θ) :=
Eθ0 log fi;θ̄,αi (Zi,n ) |Yi,n
NT
0
(A.1)
n=1 i=1
where θ0 , θ ∈ Θ̄ × AN , θ = (θ̄, α1 , . . . , αN ) and Eθ is the expectation associated with Pθ . The
(k)
(k)
EM algorithm is an iterative procedure which generates an estimate θ(k) = (θ̄(k) , α1 , . . . , αN )
at each iteration k. The update is done in two steps:
146
E-step: Compute the function θ 7→ QT (θ(k) , θ) ;
M-step: Set θ(k+1) := arg maxθ QT (θ(k) , θ) .
In practice, such an algorithm makes sense only if each of the above steps can be realized at
low computational price. Under Assumption A.1, both steps simplify as follows. Consider
0 ). Let us introduce a function
i = 1, . . . , N , θ = (θ̄, α1 , . . . , αN ) and θ0 = (θ̄0 , α10 , . . . , αN
y 7→ σi;θ̄,αi (y) defined on Y such that w.p.1:
σi;θ̄,αi (Yi ) = Eθ (Si (Zi )|Yi ) ,
(A.2)
By Assumption A.1, the RHS of the above equality depends on θ only through θ̄ and αi . It is
straightforward to show that Eθ0 log fi;θ̄,αi (Zi ) |Yi coincides with:
−ψ̄(θ̄) − ψi (αi ) + hσi;θ̄0 ,α0 (Yi ), φ̄(θ̄) + φi (αi )i
i
up to an additive random value Eθ0 (log hi (Zi )|Yi ) which does not depend on θ and which shall
thus play no role in the M-step. Thus, up to a constant w.r.t. θ, the intermediate function
QT (θ(k) , θ) at iteration k coincides with:
ψ̄(θ̄) + hs̄(k) , φ̄(θ̄)i +
N
1 X
(k)
ψi (αi ) + hsi , φi (αi )i
N
(A.3)
i=1
P
P
(k)
(k)
where si := T1 Tn=1 σi;θ̄(k) ,α(k) (Yi,n ) and where s̄(k) = N1 N
i=1 si . We will respectively
i
refer to these quantities as the local and global sufficient statistics. The E-step reduces to the
(k)
computation of si for any i = 1, . . . , N , and their average s̄(k) . The maximization of (A.3)
can be achieved separately with respect to (w.r.t.) θ̄, α1 , . . . , αN . Assume that the following
functions are well defined for any s in a relevant domain, and that their numerical computation
is inexpensive:
M(s) := arg max −ψ̄(θ̄) + hs, φ̄(θ̄)i
(A.4)
θ̄∈Θ̄
(k)
Mi (s) := arg max −ψi (α) + hsi , φi (α)i .
α∈A
(A.5)
The standard batch EM algorithm is summarized below in Algorithm 6.
A.3.2
On-line EM
From now on to the end of this paper, we assume that each sensor observes the time series
(Yn,i )n≥1 . We are interested in on-line algorithms i.e., algorithms which are able to update the
estimate any time new samples come in. The idea beyond the on-line EM algorithm of [36] is
simply to replace the batch sufficient statistics with their on-line counterparts. In such case, there
is no difference between n and k index, and the E-step is computed any time a new observation
comes in. Assume that each agent i has access to its time series (Yn,i )n≥1 . The algorithm
proposed by [36] replaces the E-step which involves an average among the N collected samples,
A.3. Centralized EM algorithms
147
Algorithm 6: Centralized batch EM algorithm (EM)
Initialize: s00,i , s̄0,i for all i = 1, . . . , N .
Update: At each iteration k ≥ 0 do
E-step:
(k)
Compute si for any i, and the average s̄(k) .
M-step:
(k+1)
(k)
For all i = 1, . . . , N , set αi
:= Mi (si ).
Set θ̄(k+1) := M(s̄(k) ).
by an iterative stochastic approximation step in order to track this average value at the same time
that the M-step tracks the estimated parameters. The estimate θn at time n is generated similarly
to Algorithm 6 in two steps after an arbitrary initialization of values s1,0 , . . . , sN,0 . The on-line
E-step is given by the following recursion:
sn,i = sn−1,i + γn σi;θ̄n−1 ,αn−1,i (Yn,i ) − sn−1,i
(A.6)
s̄n =
N
1 X
sn,i ,
N
(A.7)
i=1
where γn is a positive step size/gain. We refer to si,n as a summary statistics. Next, the estimate
θn is updated by the following M-step:
θ̄n = M(s̄n ) and ∀i, αn,i = Mi (si,n ) .
(A.8)
The asymptotic analysis of the above centralized algorithm is available in [36] under the hypothesis of vanishing gains γn such:
Assumption A.2. Sequence (γn )n≥0 is positive, non-increasing, and satisfies:
P
i)
n γn = +∞,
P 2
ii)
n γn < ∞.
Remark A.1. The convergence result given by [36] is under the assumption that the algorithm
is stable: the sequence of summary statistics remains almost surely in some compact set, strictly
included in the domain of definition of functions M and Mi . Verifying this assumption is not
an easy task. Instead, it is of common practice in stochastic approximation to force stability by
confining the updated sequence (A.6) to a given convex compact set S (see [98, pp.120] for a
discussion).
Here, we shall follow this approach. We denote by ΠS the Euclidean projector onto the set
S. Thus, we introduce the following assumption as a consequence of the previous Remark A.1.
Assumption A.3. There exists a convex open set S such that the following holds for any i =
1, . . . , N . Functions M̄ : S → Θ̄ and Mi : S → A are well defined by (A.4) and (A.5) i.e., the
argument of the maximum is a singleton.
148
A.4
Proposed distributed on-line EM
A.4.1
Algorithm
We now assume that no fusion center is available: each sensor i observes (Yn,i )n≥1 but ignores
the samples collected by other nodes j 6= i. Each node recursively generates a sequence of
estimates (αn,i )n≥1 of its local parameters and a sequence (θ̄n,i )n≥1 of the global parameters.
The estimate relies on the recursive computation of on-line summary statistics similar to (A.6)
and (A.7). Of course, (A.6) and (A.7) are no longer available in a distributed setting. Thus, (A.6)
must be substituted by:
h
i
s0n,i = ΠS s0i,n−1 + γn σi;θ̄n−1,i ,αn−1,i (Yn,i ) − s0n−1,i
(A.9)
where we just replaced the irrelevant θ̄n in (A.6) by θ̄n,i and added projector ΠS as discussed
in Remark A.1. Second, the computation of the average (A.7) is likely to be difficult. It either
requires to find a Hamiltonian cycle in the graph as in [119] or to use gossip-based average
consensus techniques such as [31] which requires a significant amount of communication before
convergence. In practice, assuming a large number of communications during each time slot n
can be highly restrictive: it may be that only a couple of nodes communicate at a given time n,
or even no nodes at all.
In the proposed algorithm, each node replaces the unknown average (A.7) by a r.v. s̄n,i
which is updated in two steps at each time n:
[Local E-step] Each node i generates a temporary update s̃i,n+1 based on its local observation:
h
i
s̃n,i = ΠS s̄n−1,i + γn σi;θ̄n−1,i ,αn−1,i (Yn,i ) − s̄n−1,i
.
(A.10)
[Gossip step] The final update s̄n,i of a node i is defined as a weighted sum between its s̃n,i and
the temporary updates received from its neighbors at time n:
s̄n,i =
N
X
wn (i, j) s̃n,j ,
(A.11)
j=1
P
where wn (i, j) are some non-negative random weights such that j wn (i, j) = 1 for all i. Of
course, wn (i, j) 6= 0 only if node j communicates with i at time n.
Once the above summary statistics have been computed, the estimates θ̄i,n and αi,n are
obtained by a standard M-step similar to (A.8). The proposed algorithm is summarized below.
The sequence of random matrices Wn := [wn (i, j)]N
i,j=1 represents the time varying communication network between the nodes (see Appendix C). Let 1 denote the N × 1 vector with
all components equal to one. Let Eπ denote the expectation under Pπ .
Assumption A.4. The following holds under Pπ :
i) For any n, Wn is a N × N row stochastic random matrix with non-negative elements:
Wn 1 = 1.
A.5. Convergence w.p.1
149
Algorithm 7: On-line distributed EM algorithm (dEM)
Initialize: s00,i , s̄i,0 for all i = 1, . . . , N .
Update: At each time n ≥ 0 do
Local E-step: For all i = 1, . . . , N ,
compute s0n,i by (A.9)
compute s̃n,i by (A.10)
Gossip step: For all i = 1, . . . , N ,
compute s̄n,i by (A.11)
M-step: For all i = 1, . . . , N ,
set αn,i = Mi (s0n,i )
set θ̄n,i = M(s̄n,i )
ii) Wn is column stochastic in expectation: Eπ [Wn ]T 1 = 1.
iii) (Wn , Z1,n , . . . , ZN,n )n≥1 is an i.i.d. sequence.
iv) Matrix W1 is independent of (Z1,1 , . . . , ZN,1 ).
v) The spectral norm ρ of matrix Eπ [W1T (IN − 11T /N )W1 ] satisfies ρ ∈ [0, 1).
A.5
Convergence w.p.1
We need the following regularity conditions.
Assumption A.5.
i) The sets Θ̄ and A are convex open subsets of Rκ and Rι respectively,
where κ and ι are integers.
ii) Functions ψ̄, φ̄ (resp. ψ1 . . . ψN , φ1 . . . φN ) are continuously differentiable on Θ̄ (resp.
on A).
iii) S is a C 2 compact convex set.
iv) Functions M, M1 , . . . , MN are well defined and continuously differentiable on the compact convex set S.
v) ∀i, sup(θ̄,α)∈M(S)×Mi (S) Eπ |σi,θ̄,α (Yi )|2 < ∞.
h
i
1 ,...,YN )
Notation D(π|gθ ) stands for the Kullback-Leibler divergence Eπ log gπ(Y
, where
θ (Y1 ,...,YN )
we recall that π denotes the p.d.f. of a sample (Y1 , . . . , YN ) under Pπ whereas gθ denotes the
150
p.d.f. of (Y1 , . . . , YN ) induced by the model Pθ . We need more notations. We define:
!
N
1 X
sn :=
s̄i,n , s01,n , . . . , s0N,n
N
i=1
!
N
1 X
θ n :=
θ̄i,n , α1,n , . . . , αN,n .
N
i=1
For any vector s of the form s = (s0 , . . . , sN ) ∈ SN +1 , we note M(s) := (M(s0 ), M1 (s1 ),
. . . , MN (sN )). We denote by L the set of Karush-Kuhn-Tucker points associated with the
problem:
min D(π|gM(s) )
s∈SN +1
i.e., the set of vectors s such that −∇s D(π|gM(s) ) lies in the normal cone to SN +1 at point s
[30]. We are now in position to state our main result. Recall that a.s. stands for almost surely.
Theorem A.1. The following holds true under Assumptions A.1-A.5.
i) The network achieves a consensus in the following sense:
lim max |θ̄i,n − θ̄j,n | = 0 a.s.
n→∞ i,j
ii) Vectors sn and θ n converge a.s. to L and M(L) respectively.
iii) Given the event that s0i,n and s̃i,n are in the interior of S for all n large enough, sequence θ n
converges a.s. to the set:
{θ ∈ Θ : ∇θ D(π|gθ ) = 0 }
The proof extensively uses the results of [36] along with those of [20].
A.6
Numerical results
A.6.1
Application to Gaussian mixtures
As a leading application, we shall pay a particular attention to Gaussian mixtures. For brevity,
we focus on the scalar case Y = R, but our statements can be generalized to the vector case
without difficulty. Consider the parametric model:
∀i, Yi ∼
M
X
(m)
αi
N(µ(m) , ν (m) ) under Pθ ,
(A.12)
m=1
(1)
(M )
where M is an integer, the vector αi = (αi , . . . , αi ) represents a set of non-negative weights
P
(m)
such that M
= 1, N(µ(m) , v (m) ) stands for the real Gaussian distribution of dimension
m=1 αi
q ≥ 1 with mean µ(m) and variance v (m) . In (A.12), we set θ = (θ̄, α1 , . . . , αN ) where θ̄ =
(µ(1) , . . . , µ(M ) , ν (1) , . . . , ν (M ) ) is the set of global parameters. To be more explicit for any i,
(m)
the latent variable Xi ∈ {1, . . . , M } satisfies Pθ (Xi = m) = αi and represents the class
A.6. Numerical results
151
under which the observation Yi is drawn. The distribution of Yi given Xi is N(µ(Xi ) , ν (Xi ) ). It
can be verified that this model satisfies Assumption A.1.
In such case, hi (zi ) = 1, ψ̄(θ̄) and ψi (αi ) are equal to zero, φi (αi ) and φ̄(θ̄) are both
3M × 1 column vectors. Assumption A.1 is verified and each marginal p.d.f. fi,θ̄,αi (zi ) has the
following closed-form:
fi,θ̄,αi (zi ) = exp{hSi (zi ), φ̄(θ̄) + φi (αi )i}
(A.13)
where:

δx(1)

i






Si (zi ) = 






δx(1) yi



2
δx(1) yi 

i

..
,
.


δx(M )


i
δx(M ) yi 

i
δx(M ) yi2
i
(A.14)
i

− 12 ln 2πν (1) −

µ(1)

ν (1)

1

−

2ν (1)

..
φ̄(θ̄) = 
.

 − 1 ln 2πν (M ) −
 2

µ(M )

(M )
µ(1)
2ν (1)
µ(M )
2ν (M )
ν














and
1
− 2ν (M
)
(1)






(1)
φi (αi ) = 


 ln αi(M )


0
0











ln αi
0
0
..
.
(A.15)
Besides, we can explicitly derive the local summary quantities expressed by the mapping
function σi;θ̄,α (y) (A.2). Here, for any y ∈ Y, σi;θ̄,α (y) involved in the Local step in Algoh
i
rithm 8 is a 3M × 1 column vector. Upon noting that Eθ δx(m) |y = Pi;θ [xi = m|y] on
i
expression (A.14) and the Baye’s rule, then σi;θ̄,α (y) is given by:

(1)
(M )
σi;θ̄,α (y) = (σi;θ̄,α (y)T , . . . , σi;θ̄,α (y)T )T ,
where:
(m)
ωi;θ̄,α (y)
(m)
ωi;θ̄,α (y)

 (m)

(m)

∀ m = 1, . . . , M : σi;θ̄,α (y) = 
 ωi;θ̄,α (y)y 
(m)
ωi;θ̄,α (y)y 2
(m)
1
i
exp{− (m)
(y−µ(m) )2 }
2ν
2πν (m)
(k)
PM
α
√ i
exp{− 1(k) (y−µ(k) )2 }
k=1
2ν
2πν (k)
√
= Pi;θ̄,α [Xi = m|y] =
α
152
We denote by s̃n = (s̃1,n , . . . , s̃N,n )T ∈ R3M N ×1 and s̄n = (s̄1,n , . . . , s̄N,n )T ∈ R3M N ×1
the estimates at iteration n. Then, the Gossip step in Algorithm 8 can be written under the matrix
form as:
s̄n = (Wn ⊗ I3M )s̃n ,
where I3M is the 3M × 3M identity matrix, ⊗ the Kronecker product and (Wn )n≥1 is an i.i.d
sequence of non-negative matrices verifying Assumption A.4.
After the Gossip step, each node i updates its parameters by the M-step in Algorithm 8. In
this particular case, it is easy to prove that the solutions of both (A.4) and (A.5) are decomposed
0(m)
into M maximization problems. For each class m and given the 3 × 1 column vectors si,n+1
(m)
and s̄i,n+1 , the solutions for any node i are:
(m)
αi,n+1 = s0i,n+1 (1)
(m)
µi,n+1 =
A.6.2
s̄i,n+1 (2)
s̄i,n+1 (1)
(m)
νi,n+1 =
s̄i,n+1 (3)
s̄i,n+1 (1)
(m)
− (µi,n+1 )2
Simulations
In order to validate our algorithm considering the conditions in Section A.2, we simulate for
the well-known example in (A.12) that is, the Gaussian mixture of M = 3 classes. The WSN
is represented as a random geometric graph G(E, V) with N = 10 nodes where are randomly
placed according to the uniform distribution in [0, 1] × [0, 1]. We set the global parameters
θ̄ as µ = (110, 80, 40) and ν = (36, 16, 36) and the local weights αi randomly chosen on
(7, 25, 0.05) at each node i. Moreover, Algorithm 8 iterates until 20000 times over 30 independent realizations and the step size is chosen as 1/n0.8 . We compare both gossip strategies
described in Appendix C by computing the mean deviation of consensus (MDC) normalized to
the sought parameter. For the case of the first parameter µ(1) , it is defined as:
v
u
(1)
(1)
N
u1 X
(µn,i − hµn i)2
,
(A.16)
MDCn = t
(1) )2
N
(µ
i=1
P
(1)
(1)
where: hµn i , N1 N
j=1 µn,j .
Figure A.1 shows the comparison between the pairwise [31] and broadcast [10] schemes
(see Appendix C) over 30 independent runs of Algorithm 8. The consensus is slightly faster
in the broadcast since at each round, one node is able to communicate its value through all its
neighbors.
Then, Figure A.2 illustrates the convergence towards the true value µ(1) = 110 as boxand-whisker plots over 30 independent runs of Algorithm 8. Figure A.2 reports the asymptotic
(1)
(1)
behavior of the average parameter over the N nodes µn,1 , . . . , µn,N when n = 20000, i.e.
(1)
1 PN
i=1 µn,i . Note that the asymptotic variance in the pairwise case is almost the same as those
N
obtained in the centralized case. This may confirm the convergence results we derive in the first
part of this thesis for consensus algorithms (see Section 2.5 in Chapter 2).
153
Average Consensus Deviation over 20 independent runs
−2
10
Pairwise Gossip
Broadcast Gossip
−3
10
−4
10
0
1000
2000
3000
4000
5000
iterations
6000
7000
8000
9000
10000
Figure A.1. MDC as a function of the number of iterations n for the pairwise [31] (plain line)
and broadcast [10] (plain line with cross markers − + −) models when considering N = 10
nodes.
Box−and−whisker plot of parameter estimation 1 over 20 independent runs
110.03
110.02
110.01
110
109.99
109.98
109.97
109.96
109.95
CENTRALIZED
PAIRWISE
BROADCAST
(1)
Figure A.2. Box-and-whisker plots of the parameter estimation µn from the 30 independent
runs of the centralized EM Algorithm 6 and the distributed EM Algorithm 8. The centralized
setting corresponds to the first boxplot and the distributed settings correspond o the second
(pairwise [31]) and the third (broadcast [10]) boxplots respectively.
Similarly, we show the results when considering a standard "signal+noise" model for the
network of N = 10 sensor nodes. Thus, the observation model can be described as a mixture of
two Gaussians, i.e. M = 2. We set the following values for the global parameters θ̄: the signal
class with µ1 = 1 and ν1 = 1, the noise class with µ2 = 0 and ν2 = 100 and we set the mixing
weights as α1 = 0.9 and α2 = 0.1 to be equal for all nodes i = 1, . . . , N . Figure A.3 illustrates
the agreement on the estimated parameters,
P i.e. all trajectories (θi,n )n,∀i go towards the same
value hθ n i when n → ∞, where hθn i = i θ n,i . Figure A.3 shows the trajectory of the MDC
over 30 independent trials as a function of the iteration time n. Note that we only include the
consensus on the weight of the first class (αn,i,1 )∀i since the weight of the second class verifies
154
<µ > class 1
0
10
n
<µn> class 2
<νn> class 1
<ν > class 2
n
−1
<α > class 1
10
MSD
n
n
−2
10
−3
10
0
1000
2000
3000
4000
n iterations
5000
6000
7000
8000
Figure A.3. Performance on the consensus of the global parameters θn : convergence of the
mean consensus deviation as a function of the iteration time n for each parameter and each
class, i.e. (µn,i,j , νn,i,j , αn,i,1 )∀i,j .
(αn,i,2 = 1 − αn,i,1 )∀i .
Once the agreement is achieved, the consensus value may convergence to the true value, i.e.
the sequence (hθn i)n tends to the true parameters θ? = (1, 0, 1, 100, 0.9, 0.1). We define the
? and the variance σ 2 ? as:
mean θE
E
?
θE
= Eπ [Y ] = α1 µ1 + α2 µ2
2?
σE
= Eπ [(Y − E[Y ])2 ] = α1 ((µ1 − θE )2 + ν1 ) + α2 ((µ2 − θE )2 + ν2 ) .
Figure A.4 illustrates the convergence towards the mean and variance of the observed signal.
Indeed, Figure A.4 (a) refers to the convergence towards the mean value θE which is equal to
2 = 10.99.
α1 = 0.9 while Figure A.4 (b) refers to the convergence towards the variance value σE
Each line corresponds to the estimated sequences computed by each node i as:
θEn,i = αn,i,1 µn,i,1 + αn,i,2 µn,i,2
2
σE
= αn,i,1 ((µn,i,1 − θEn,i )2 + νn,i,1 ) + αn,i,2 ((µn,i,2 − θEn,i )2 + νn,i,2 )
n,i
155
5
4
3
2
n,i
1
θE
0
−1
−2
−3
−4
−5
0
1000
2000
3000
4000
n iteration
5000
6000
7000
8000
(a) Trajectories of θEn,i for i = 1, . . . , N .
60
50
σ2
E
n,i
40
30
20
10
0
0
1000
2000
3000
4000
5000
6000
n iteration
7000
8000
9000
10000
2
(b) Trajectories of σE
for i = 1, . . . , N .
n,i
Figure A.4. Trajectories of each node i = 1, . . . , N of the estimated mean and variance sequences of the observed signal (colored curves) as a function of the iteration time n and the
? and σ 2 ? (black lines).
optimal values θE
E
Appendix B
Application on distributed machine
learning
On-Line Learning Gossip Algorithm in
Multi-Agent Systems with Local Decision Rules1
This appendix is extracted from the proceedings of BIGDATA 2013
Abstract
This paper is devoted to investigate binary classification in a distributed and on-line setting.
In the Big Data era, datasets can be so large that it may be impossible to process them using a
single processor. The framework considered accounts for situations where both the training and
test phases have to be performed by taking advantage of a network architecture by the means
of local computations and exchange of limited information between neighbor nodes. An online
learning gossip algorithm (OLGA) is introduced, together with a variant which implements a
node selection procedure. Beyond a discussion of the practical advantages of the algorithm we
promote, the paper proposes an asymptotic analysis of the accuracy of the rules it produces,
together with preliminary experimental results.
B.1
Introduction
In most analyses carried out in the field of statistical learning theory, the practical constraints
related to the data acquisition/storage/access system and inherent to processing speed, memory
and computing capacity are generally ignored or incorporated into the mathematical framework
in a very stylized manner so far. With the advent of highly complex digital network infrastructures and the pressing necessity of sharing resources and distributing computing power [155, 18],
this facet of the machine-learning environment is however becoming more and more essential
from a technological perspective and is receiving now increasing attention, see [82, 92, 100, 111,
66, 106, 104, 49, 59, 58] for instance. Motivated by the recent developments in the architecture
1
Proceedings of the 2013 IEEE International Conference on Big Data
158
Appendix B. Application on distributed machine learning
of data repositories and computer systems, it is the main purpose of this paper to investigate
the binary classification problem, the "flagship" problem in statistical learning, in a distributed
and on-line context, accounting for certain real-life situations, possibly more and more currently
encountered in the near future.
Throughout the article, we consider the case where the training data are not stored in some
central memory but split into distinct clusters, individually processed by independent agents
(e.g. processors). To process Big Data, one generally distribute data subsamples over a network of processors communicating with each other. Precisely, it is assumed that the agents
can exchange a limited amount of information per unit of time only, through a communication
structure modeled by a graph of which they form the nodes. Hence, due to these capacity constraints, merging all training sets at any node is unfeasible and a distributed approach, limiting
the network overhead, is required. Here, by "distributed", it is meant that both the learning and
prediction stages are performed by the means of local computations of the agents and sparse
communications between them: each agent simultaneously processes the data set it has been
assigned to and shares some information with its neighbors in order to build a local classifier.
In [66, 106], a specific view to distributed learning has been developed, where the goal
is to reach a consensus among local classifiers. In this setting, all agents originally dispose
of the same collection of classification rules and a local gradient descent technique, jointly
performed with a gossip step, is used to drive them to a consensus. At the end of the learning
procedure, all agents use the consensus classifier to predict labels assigned to test data, with
no need for further communications. The nature of the problem we investigate through this
paper is very different, it is not of the type "distributed consensus". It should be noticed that,
unlike most works on "distributed classification", agents are not assumed exchangeable in the
framework we consider. First, we assume that the collection of classifiers may vary from an
agent to another. This situation encompasses the case where each agent is an expert in the
recognition based on a specific feature of the input observation for instance. Additionally, the
issue at stake is not to seek to achieve a consensus between the agents but to learn how to
aggregate efficiently the local decisions, typically through a a majority vote or a well-chosen
weighted average. Hence, in the classification problem we consider, both learning and test
phases require distributed computations, relying on the whole network of agents. In addition, it
is expected that, unlike consensus-based approaches that drive all nodes to a common classifier,
our scheme should preserve and take full advantage of the peculiar skills of the local classifiers,
being therefore closer to the spirit of ensemble learning algorithms.
The paper is organized as follows. Section II describes the specific framework of the learning
problem considered. In section III, the principles of the algorithm promoted are described at
length. The performance of the procedure proposed is analyzed in section IV, while section V
focuses on a specific situation. Finally, numerical experiments are displayed in section VI, in
order to provide some preliminary empirical evidence of the efficiency of the methods proposed
in this paper. Section VII collects some concluding remarks and technical details are deferred to
the Appendix section.
B.2. Background
B.2
159
Background
We start off with setting out the notations and describing the key ingredients of the learning
problem subsequently analyzed. Here and throughout, the indicator of any event E is denoted
by I{E}.
B.2.1
Objective
Suppose we have a "black-box" system where Y is a binary output, taking its values in {−1, +1}
say, and X is an input random vector valued in a high-dimensional space X, modeling some
(hopefully) useful observation for predicting Y . Based on training data, the goal is to build a
prediction rule sign(h(X)), where h : X → R is some measurable function, which minimizes
the risk
Rϕ (h) = E [ϕ(−Y h(X))] ,
where expectation is taken over the unknown distribution of the pair of r.v.’s (X, Y ) and ϕ :
R → [0, +∞) denotes a cost function (i.e. a measurable function such that ϕ(u) ≥ I{u ≥ 0}
for any u ∈ R). For reasons which will appear obvious in the sequel (see Remark 3), we
focus on the cost function ϕ(u) = (u + 1)2 /2. Notice that, in this case, the optimal decision
function is given by: ∀x ∈ X, h∗ (x) = 2P{Y = +1 | X = x} − 1. The classification
rule H ∗ (x) = sign(h∗ (x)) thus coincides with the naive Bayes classifier. For this specific
choice, decision function candidates h(x) will be assumed to be square integrable with respect
to X’s distribution. The learning environment under study is non standard. Here we consider
a model of distributed classification device composed of a set V of N ≥ 1 connected agents,
which process independent databases: each agent v ∈ V disposes of a training dataset Dv =
{(X1,v , Y1,v ), . . . , (Xnv ,v , Ynv ,v )} of size nv ≥ 1 and made of independent copies of the pair
(X, Y ). In addition, each agent v ∈ V must select a weak classifier function among a given
parametric
P class possibly depending on v, namely {hv (·, θv )}θv ∈Rdv , where dv ≥ 1. We set
D = v dv . For any vector θ = (θ1 , · · · , θN ) ∈ Θ = Rd1 × · · · × RdN , we define the global
(soft) classifier as:
X
H(x, θ) =
hv (x, θv ) forx ∈ X,
v∈V
the label related to an observation X being estimated by sign(H(X, θ)). To lighten notation, we
set Rϕ (θ) = Rϕ (H(·, θ)). This paper investigates the problem of finding a "global classification
rule", as defined above, with minimum risk, i.e. the optimization problem
min Rϕ (θ),
θ∈Θ
(B.1)
while fulfilling some capacity constraints, which shall be described in the next subsection.
Remark B.1. (M IXTURE OF EXPERTS .) A typical example of the framework above stipulates
that a fixed weak classifier hv : X → {−1, +1} is assigned to each agent v. For any (θv , x) ∈
R×X, we set hv (x, θv ) = θv hv (x) and the global classifier then reduces to a weighted sum of the
local weak classifiers. In the learning phase, the issue is to determine the optimal weights using
a distributed algorithm. In the test phase, it is compute a weighted sum of the local decisions,
by using standard average consensus algorithms such as those studied in [31] for instance.
160
Remark B.2. (M AJORITY VOTE .) Another useful example is given by the case where each θv
corresponds to some local parameter of a local classifier x 7→ hv (x, θv ) ∈ R. In this case,
the global classifier output sign(H(x, θ)) can be evaluated by a simple majority vote between
agents, see [16].
B.2.2
Distributed Learning
In order to give an insight into the approach we propose, we consider first the ideal case where
a standard gradient descent for solving (1) could be applied. One would then generate in an
(t)
(t)
iterative manner a sequence θ(t) = (θ1 , · · · , θN ), t ≥ 1, satisfying the following update
equation for each v ∈ V:
h
i
θv(t+1) = θv(t) + γt E Y ∇v hv (X, θv(t) )ϕ0 (−Y H(X, θ(t) )) ,
(B.2)
where γt > 0 is a step size and ∇v represents the gradient operator w.r.t. the argument θv . Naturally, as (X, Y )’s distribution is unknown, the expectation involved in (2) cannot be computed
and must be replaced by a statistical version, in accordance with the Empirical Risk Minimization paradigm. It is assumed that each agent v ∈ V must rely on the local dataset Dv only to
update its estimate, in a one-pass fashion: each observation (Xk,v , Yk,v ) must be used only once
by agent v and is not stored into the agent’s memory. This "on-line" framework is especially
relevant in the context of large data sets, where it is generally hopeless to process the whole
training sequence as a block. It shall also be revealed useful in a distributed optimization setting, as will be discussed later on. The expectation in (2) can be then replaced by the following
unbiased estimate
Yt+1,v ∇v hv (Xt+1,v , θv(t) )ϕ0 (−Yt+1,v H(Xt+1,v , θ(t) )).
The second issue is related to the distributed setting. In the estimate of the gradient above, the
evaluation of the quantity H(Xt+1,v , θ(t) ) requires that: i) agent v sends the input Xt+1,v to all
(t)
the other nodes w 6= v, ii) each node w computes its local decision hw (Xt+1,v , θw ) and returns
the result to node v. Needless to say, such a procedure can be revealed overwhelmingly complex
when the number of nodes is significant and/or when the dimensionality of the input X is large. It
is therefore crucial to reduce the amount of information exchanged in the network. To formalize
this constraint, we define the network throughput τ as the average number of information bits
successfully carried by the network during each unit of time. Formally, we require that the sum
over all pairs of agents (v, w) of the number bits send by v to w does not exceed τ in expectation.
B.3
The Online Learning Gossip Algorithm (OLGA)
We now describe at length the general algorithm we propose, in order to solve the constrained
optimization problem (1) in the general on-line and distributed framework described in section
II. Suppose that, at step t ≥ 1, for each agent v ∈ V, the current parameter value is θt,v . Set θt =
(θt,1 , · · · , θt,N ). The update is performed as follows. Agent v observes the pair (Xt+1,v , Yt+1,v )
and evaluates its local decision hv (Xt+1,v , θt,v ) using the former value of the parameter θt,v .
B.3. The Online Learning Gossip Algorithm (OLGA)
161
Next, it searches for an estimate of the global decision H(Xt+1,v , θt,v ) as follows, by selecting
some neighbors at random and sending its training input Xt+1,v to the selected nodes. Let
w
δ t+1,v = {δt+1,v
}w∈V,w6=v be a collection of N − 1 independent Bernoulli r.v.’s B(p), with
parameter p ∈ (0, 1], independent from Xt+1,v given θt . Agent v sends the input Xt+1,v to node
w
w if and only if δt+1,v
= 1. An estimate of the global decision is then given by:
(V)
Ŷt+1,v = hv (Xt+1,v , θt,v ) + p−1
X
w
δt+1,v
hw (Xt+1,v , θt,w ) ,
(B.3)
w∈V\{v}
where the superscript emphasizes the fact that the estimate is computed by means of communications in the whole networkhV. It is worth noticing
i that (3) is an unbiased estimate of the global
(V)
decision in the sense that E Ŷt+1,v |Xt+1,v , θt = H(Xt+1,v , θt,v ). If B represent the number
of bits required to represent an arbitrary input x ∈ X, then each link v → w carries in average
pB bits per unit of time. In order not to exceed the network throughput τ , one must pick the
sampling parameter p so that p ≤ τ /BN (N − 1). Finally, agent v performs a local gradient
descent as follows:
(V)
θt+1,v = θt,v + γt Yt+1,v ∇v hv (Xt+1,v , θt,v )ϕ0 (−Yt+1,v Ŷt+1,v ) .
(B.4)
As mentioned above, we shall pay a particular attention to the case ϕ(x) = 21 (x + 1)2 . In that
case, the update equation (4) boils down to:
(V)
θt+1,v = θt,v + γt ∇v hv (Xt+1,v , θt,v )(Yt+1,v − Ŷt+1,v ) .
(B.5)
Remark B.3. (O N THE COST FUNCTION .) The quadratic nature of the cost functional is essential in the subsequent analysis. It guarantees that the OLGA output remains unbiased at each
iteration, in spite of its on-line nature and the randomness incurred by the gossip phase.
The algorithm is summarized in Table 1 below.
Algorithm 8: OLGA
Initialize: Set arbitrary initial values θ0,v for each node v ∈ V.
Update: At each time t = 1, 2, · · · do
For each v ∈ V do
w
Neighbors selection: Draw independent Bernoulli r.v.’s δt+1,v
∼ B(p) for any w 6= v
Gossip step:
w
Transmit Xt+1,v to all w such that δt+1,v
= 1 and obtain hw (Xt+1,v , θt,w )
in return
Local descent: Update the parameter value θt+1,v using (4)
As the algorithm is single-pass,
P the number of iterations is necessary smaller than the size
of the full data sample, n =
v∈V nv . Hence, in the asymptotic framework that stipulates
t → +∞, it is implicitly assumed that n → +∞.
162
B.4
Performance Analysis
In this section, we investigate the asymptotic behavior of the predictor output by OLGA as
t → ∞. First, we establish its almost-sure convergence to the set of minimizers of Rϕ (θ).
Next, we provide a Central Limit Theorem which characterizes the fluctuations of the excess
of risk as t → ∞. This result determines the convergence rate of the algorithm and explicitly
characterizes the impact of the "sparsifying" parameter p on the performance of the algorithm.
Finally, using the results of [12], we provide a uniform bound on the error probability of the
proposed classifier.
The following assumption is rather standard in stochastic approximation.
Assumption B.1. The step size γt decays to 0 as t → ∞, so that:
∞.
P
t≥1 γt
= ∞ and
2
t≥1 γt
P
<
Additionally, some classical regularity conditions on the weak classifier functions hv are
required.
Assumption B.2. The conditions below hold true for any v ∈ {1, · · · , N } and any compact set
K ⊂ Rdv .
(a) For any x ∈ X, the function θv 7→ hv (x, θv ) is continuously differentiable.
(b) For any θv ∈ Rdv , E hv (X, θv )2 < ∞.
(c) We have:
2
E sup k∇v hv (X, θv )k
< ∞,
θv ∈K
2
E sup ∇v hv (X, θv )
< ∞.
θv ∈K
(d) The mappings θv 7→ hv (X, θv ) and θv 7→ ∇v hv (X, θv ) on L2 (P) are both continuous.
(e) We have:
h
i
sup E k∇v hv (X, θv )k4 < ∞,
θv ∈K
sup E h4v (X, θv ) < ∞.
θv ∈K
(f) The set of stationary points
L = {θ : ∇Rϕ (θ) = 0}
is finite.
B.4. Performance Analysis
163
Assumption 2 is clearly satisfied in the example described in Remark 1, i.e. when hv (x, θv ) =
θv hv (x) for some fixed local weak classifier hv such that the fourth moment of hv (X) is finite.
Recall that the algorithm is said to be stable if there exists a compact set K ⊂ Rdv such
that the sequence (θt,v )t≥1 remains in K for any v ∈ V, with probability one: P{∃K > 0,
supt≥1 kθt k < K} = 1. The next result reveals that, provided that it is stable, the algorithm
produces a consistent decision rule as the number of iterations grows to infinity.
Theorem B.1. (C ONSISTENCY ) Assume that the algorithm is stable. Under Assumptions B.1
and B.2, the sequence (θt )t≥1 almost-surely converges to the set of stationary points L of Rϕ .
The stability condition may not be easy to check in practice. There are several ways to guarantee stability. A possible approach is to confine the sequence to a predetermined bounded set.
This can be achieved by introducing a projection step at each iteration of the stochastic gradient
algorithm. Each time an estimate θt,v falls outside some convex compact set Kv , agent v brings
the estimate back into Kv by replacing θt,v with the nearest point in Kv . In that case, differential
inclusion arguments may show that the conclusions of Theorem B.1 remain true:
Q θt converges
to the set of Karush-Kuhn Tucker points of the functional Rϕ (θ) over the set v Kv . Refer to
[15] or [96] for further details on projected stochastic approximation algorithms. Alternatively,
one can stipulate additional assumptions for the weak classifier functions, see for instance [51].
The following result focuses on the situation described in Remark B.1.
Theorem B.2. (C ONSISTENCY ( BIS )) Suppose that, for all v ∈ V, hv (x, θv ) = θv hv (x) for
some given function hv such that E[(hv (X))4 ] < ∞. Then, OLGA is stable and the sequence
(θt )t≥1 almost-surely converges to the set of minimizers of Rϕ as t → +∞.
In the sequel, notation ∇2 (resp. ∇2v ) denotes the Hessian operator w.r.t. θ (resp. θv ). We
also use notation ∇1v for ∇v , and ∇0v stands for the identity i.e., ∇0v f (θv ) = f (θv ). Superscript
T represents transposition. Let θ ? = (θ ? , · · · , θ ? ) be an arbitrary point. We make the following
1
N
assumption.
Assumption B.3. Suppose that θ? ∈ L and that the following conditions hold true for any
v ∈ V.
(a) There exists a neighborhood Nv of θv? such that for any x ∈ X, function θv 7→ hv (x, θv ) is
twice continuously differentiable on Nv .
h
2 i
(b) We have: E ∇2v hv (X, θv? ) < ∞ where k . k represents any matrix norm.
(c) There exists a square-integrable random variable C(X) s.t. for any i ∈ {0, 1, 2} and
θv ∈ N v ,
i
∇v hv (X, θv ) − ∇iv hv (X, θv? ) ≤ C(X) kθv − θv? k .
(d) The matrix
Q? = E ∇H(X, θ? )∇T H(X, θ? ) + E (H(X, θ? ) − Y )∇2 H(X, θ? )
is a Hurwitz matrix, i.e. the largest real part of its eigenvalues is −L with L > 0.
164
(e) There exists b > 4 such that for any i ∈ {0, 1},
h
i
sup E k∇iv hv (X, θv )kb < ∞.
θv ∈Nv
(f) The mapping θ 7→ Γv (θ) is continuous at point θ? , where Γv (θ) is defined by:
E (H(X, θ) − Y )2 ∇v hv (X, θv )∇Tv hv (X, θv )
1−p X
+
E hw (X, θw )2 ∇v hv (X, θv )∇Tv hv (X, θv ) . (B.6)
p
w∈V\{v}
(g) The block-diagonal matrix Γ? = diag(Γv (θ? ))v∈V is positive definite.
Observe that the mapping (B.6) is well-defined in a neighborhood of θ? , by virtue of Assumption B.3(e).
Theorem B.3. (A CONDITIONAL CLT) Suppose that Assumptions B.2 and B.3 hold true and
that γt = γ0 t−α for some constants γ0 > 0 and α ∈ (1/2, 1]. When α = 1, take γ0 > (2L)−1
and η = 1/(2γ0 ). Otherwise, set η = 0. Conditioned upon the event {limt→∞ θt = θ? }, the
√
sequence γt (θt − θ? ) converges in distribution to a centered Gaussian distribution N(0, Σ)
whose covariance matrix Σ is the unique solution to the Lyapunov equation: (Q? − ηI)Σ +
Σ(Q? − ηI)T = Γ? .
Theorem B.3 still holds true under milder assumptions on the step size, see [129] for more
general conditions. The effect of the Bernoulli sampling parameter p on the asymptotic behavior
of the estimation error deserves some attention. The case p = 1 corresponds to a centralized
setting where all nodes communicate without restriction at any time. The matrix Γv (θ? ) then
boils down to the first term in (B.6) solely. This gives the insight that the second term of (B.6)
corresponds to the additional noise covariance induced by the distributed setting, as opposed
to a centralized situation. In effect, when p becomes close to zero i.e. when communications
become rare, the second term of (B.6) becomes significant and produces a dramatic increase
of the asymptotic covariance matrix Σ. In that sense, Theorem B.3 quantifies the unavoidable
tradeoff between throughput and accuracy.
Corollary B.1. (E RROR RATE ) Let U be a D × 1 vector of independent centered Gaussian
r.v.’s with unit variance. Under Theorem B.3’s assumptions and conditioned upon the event
{limt→∞ θt = θ? }, γt (Rϕ (θt ) − Rϕ (θ? )) converges in distribution to the noncentral χ2 r.v.
1 T 1/2 ? 1/2
Q Σ U.
2U Σ
Remark B.4. (O N THE COST FUNCTION ( BIS ).) We recall that the excess of probability of error
of a classifier sign(H(x)) is bounded by (Rϕ (H) − Rϕ∗ )1/2 , see [12]. However, the damage to
the rate of the excess of risk caused by the use of a quadratic convex surrogate for the cost is
somehow compensated by the (possibly parametric) rate stated in Corollary B.1.
B.5. Distributed Selection
B.5
165
Distributed Selection
This section investigates more specifically the situation where for any v ∈ V, hv (x, θv ) =
αv hv (x, βv ), the local parameter θv being of the form θv = (αv , βv ) with αv ∈ R, βv ∈ Rdv −1
and hv : X × Rdv −1 → R being a local weak classifier function. For any agent v, the aim is to
jointly determine the value of βv parametrizing the local classifier and the weight αv of agent v
in the sum:
X
H(x, θ) =
αv hv (x, βv ) ,
(B.7)
v∈V
for θ = (θ1 , · · · , θN ). In this scenario, it is natural to include a non-negativity constraint on the
weights: αv ≥ 0 for any v ∈ V. Clearly, the vector θ can be achieved by using a distributed
algorithm as proposed in Section B.3. However, when the number N of nodes is very large, the
implementation of OLGA can be difficult or even unfeasible. Indeed, in the learning phase, a
significant amount of information should be exchanged by all nodes and, in the test phase, all
N nodes are involved in the decision process. It is therefore desirable to restrict the number of
nodes in order to simplify the optimization stage and the prediction process both at the same
time. This remark is also motivated by the fact that in practice, different nodes might generate
quite similar outputs. For such nodes, it is useless to duplicate the information in the sum (B.7).
In this section, we propose an online method to jointly i) learn the parameters θ and ii) withdraw
the nodes which are not essential for classification. Note that the withdrawal of a node v can
be achieved by setting αv to zero in (B.7). Based on this remark, we propose to include a `1 penalization term to the initial cost function. For some fixed constant λ > 0, this yields the
following optimization problem:
X
min Rϕ (θ) + λ
|αv | ,
(B.8)
θ∈Θ
v∈V
P
The "lasso" penalization term v |αv | is introduced in order that the minimizers exhibit a certain level of sparsity, i.e. are such that a significant number of coefficients αv are exactly equal
to zero. Here, λ is a tuning parameter which allows to set the trade off between the minimization
of the cost and the sparsity of the minimizers. The following modifications should be brought
to OLGA in order to produce an efficient distributed on-line algorithm for selecting the relevant
experts. At each iteration t, we assume that certain nodes have been definitively declared as idle,
and we denote by St ⊂ V the remaining subset of active nodes. Following in the footsteps of the
approach described in section B.3, a given active node v ∈ St observing a pair of the training
sample (Xt+1,v , Yt+1,v ) can obtain a noisy estimation of the output classifier by i) drawing in(w)
dependent Bernoulli distributed r.v.’s {δt+1,v }w∈St \{v} with parameter pt ∈ (0, 1], ii) computing
the sum:
X
(St )
w
Ŷt+1,v
= hv (Xt+1,v , θt,v ) + p−1
δt+1,v
hw (Xt+1,v , θt,w ) .
t
w∈St \{v}
The Bernoulli parameter pt should be chosen in a way that the network throughput does not
exceed τ . Thus, if |St | denotes the cardinal of the set St and B the number of bits required to
166
represent an arbitrary input x ∈ X, the Bernoulli parameter should be such that:
pt ≤
τ
.
B |St |(|St | − 1)
(B.9)
Next, agent v updates its local parameters θt,v = (αt,v , βt,v ) using a stochastic gradient descent. Unlike the algorithm of section B.3, the update should include the `1 -penalization term
in (B.8) and keep the nonnegativity constraints satisfied. We thus propose the following update
equations:
(S )
t
αt+1,v = [αt,v + γt hv (Xt+1,v , βt,v )(Yt+1,v − Ŷt+1,v
) − λsign(αt,v )γt ]+ ,
(S )
t
βt+1,v = βt,v + γt αt,v ∇βv hv (Xt+1,v , βt,v )(Yt+1,v − Ŷt+1,v
),
(B.10)
(B.11)
with the notation [u]+ = max(u, 0) and where ∇βv represents the gradient w.r.t. βv . Finally, we
need a criterion to decide whether agent v should declare itself as idle at step t + 1 or should be
kept active. Here, we propose to declare a node as idle at iteration t+1 if the current value αt+1,v
of the parameter αv is zero for the M -th time, where M is an integer fixed in advance. Formally,
a node v declares itself as idle if the sequence (αk,v )1≤k≤t+1 contains at least M zeros.
B.6
Numerical Results
The algorithms proposed have been tested on toy examples based on simulated data and on
public datasets. Due to space limitations, only a few experiments are reported below: OLGA
with experts selection is evaluated on a toy example since its usefulness can be simply illustrated
and tested OLGA on real data.
B.6.1
Simulation data
We placed ourselves in the mixture of experts case, using randomly placed affinity experts as
weak classifiers namely hv (x1 , x2 ; θv ) = θv sign(cos av x1 + sin av x2 − ρv ), where av and ρv
are considered fixed for each agent. We then ran OLGA with experts selection and kept v for
which θv 6= 0. Notice below how the algorithm selects mostly affinity separators relevant to the
distribution (X, Y ).
B.6.2
Real data
In this section, we compare the performances of GentleBoost [70] and OLGA on some benchmark datasets for binary classification: banana and twonorm. Detailed information about
these datasets can be found in [138], see also [161] for a distributed boosting. We split each
data sample into a training set and a test set using aP80% − 20% rule. For both GentleBoost
and OLGA, the classifier is of the form H(x) =
1≤m≤M αm hm (x, βm ), based on linear
combinations of weak classifiers h(x, β), where β are the target parameters for the algorithm.
For GentleBoost, the (αm , βm )’s are estimated using a stagewise block procedure. This means
B.7. Conclusion
167
6
6
4
4
2
2
0
0
−2
−2
−4
−4
−6
−6
−4
−2
0
2
4
6
−6
−6
−4
−2
0
2
4
6
Figure B.1. Left: Data av , ρv are represented by lines in red and sampling points (X, Y ) by
"+" in blue when Y = −1 and by "o" in green when Y = 1. Right: Only v having a non-zero
weight θv 6= 0 at the end of the iterations are represented.
that α1 h1 (β1 , ·) is added first, then α2 h(β2 , ·), etc. and, for each αm hm (βm , ·) to be added,
a pass over the whole block of training data is required. For OLGA, the algorithm is online
and distributed; meaning that each data is processed only once and then forgotten. In addition,
each αm h(βm , ·) is computed simultaneously by separate agents forming a network. For GentleBoost, the form of the weak classifier is arbitrary, but a widespread choice is to use stumps,
i.e. rules of the form I{x(j) ≥ β}, where x(j) denote
the x’s j-th component. For OLGA we
(j)
used "smooth stumps", of the form F σ(x − β) where F (x) = 1 − 2/(1 + exp(−x)).
The smoothness is required by the gradient algorithm approach. In the case of OLGA, weak
classifiers are in one-to-one correspondence with the agents: V = {1, . . . , M }. Each agent v
starts by uniformly randomly selecting an axis j(v) ∈ {1, . . . , d}, independently from all other
agents, and applies next the algorithm described
in section B.5, using θt,w = (αt,w , βt,w , σt,w )
and hw (x, θt,w ) = F σt,w (xj(v) − βt,w ) .
On both examples, one can see that OLGA does not outperform GentleBoost and has a more
erratic error curve, due do its stochastic nature. However, it should be emphasized that: 1)
both algorithms lead to comparable results and 2) OLGA is online and distributed, thus far less
demanding in storage and power capability, which are crucial properties in a wide variety of
applications.
B.7
Conclusion
In this paper, two variants of an online learning gossip algorithm (OLGA) for binary classification, very different in nature from "distributed consensus" approaches, are proposed and
investigated. The main strength of OLGA lies in its ability to process data "on the fly” and then
forget about it forever. Being distributed, datasets can be split and partially processed by several
agents, while the network is able to benefit from the whole dataset. On real datasets tested in this
paper, OLGA performs nearly as well as GentleBoost [70], a block centralized robust version
of boosting that needs to store the whole dataset in order to process it. OLGA is well suited
to underlying complete graphs, while sub-sampling edges to perform sparsification [4]. Even
168
0.5
Error rate
0.4
0.3
0.2
0.1
0
5
10
15
20
25
30
35
40
45
50
Number of weak−learners
OLGA
GentleBoost
0.4
0.35
Error rate
0.3
0.25
0.2
0.15
0.1
0.05
0
5
10
15
20
25
30
Number of weak−learners
35
40
45
50
Figure B.2. Comparison between GentleBoost and OLGA on datasets banana (up) and
twonorm (down).
this assumption seems realistic for IP networks, one might want to alleviate it and introduce
hierarchies. Sophisticated versions of OLGA should be thus designed and analyzed in the near
future.
Appendix - Technical details
Proof of Theorem B.1 (sketch of)
By Assumption B.2(a), Rϕ (θ) is finite for any θ. Let us write its derivative. For any θ, any v ∈ V
and any (x, y) ∈ X × {±1},
∇v ϕ(−yH(x, θ)) = (H(x, θ) − y) ∇v hv (x, θv )


X
1
=
hw (x, θw ) − y  ∇v hv (x, θv ) + ∇v h2v (x, θv ) .
2
w6=v
B.7. Conclusion
169
In particular, for any fixed value of θ and any θv0 ∈ B(θv , 1) := {θ̃ : kθ̃ − θv k ≤ 1}, we obtain
that
!
X
|hw (x, θw ))|)
k∇v ϕ(−yH(x, θ0 ))k ≤
sup k∇v hv (x, θ̃)k (1 +
θ̃∈B(θv ,1)
+
sup
w6=v
k∇v h2v (x, θ̃)k,
θ̃∈B(θv ,1)
where we set θ0 = (θ1 · · · θv−1 , θv0 , θv+1 · · · θN ). Thus, ∇v ϕ(−yH(x, θ0 )) is bounded with a
r.v. which does not depend on θv0 and which can be proved to be integrable by straightforward
application of Cauchy-Schwartz’ inequality along with Assumptions B.2(b,c). Using Lebesgue’s
dominated convergence theorem, Rϕ is differentiable w.r.t. θv and its gradient coincides with
∇v Rϕ (θ) = E (H(X, θ) − Y ) ∇v hv (X, θv )
The next step is to prove that ∇v Rϕ is continuous, and thus that Rϕ (θ) is continuously differentiable w.r.t. θ. This is a direct consequence of Assumption B.2(d): the proof is left to the
reader.
We are now in position to prove Theorem B.1. Let us write our algorithm under the form
θt+1,v = θt,v + γt Zt+1,v where we set:
(V)
Zt+1,v = ∇v hv (Xt+1,v , θt,v )(Yt+1,v − Ŷt+1,v ) .
(B.12)
Let (Ft : t ≥ 0) represent the natural filtration Ft = σ(Ft−1 , Xt,1 , · · · , Xt,N , Yt,1 , · · · , Yt,N ).
From the previous statement, it follows that E(Zt+1,v |Ft ) = ∇v Rϕ (θt ). Using Minkowski’s
inequality followed by Cauchy-Schwartz’ inequality, we obtain:
1
1 X 1
E kZt+1,v k2 |Ft 2 ≤ E k∇v hv (X, θv k2 2 +
E khw (X, θw ) ∇v hv (X, θv )k2 2
w
1
≤ E k∇v hv (X, θv k4 4
1+
X
4
E hw (X, θw )
1
4
!
.
w
Therefore, Assumption B.2(d) implies that for any compact set K ⊂ Θ,
sup E kZt+1,v k2 |Ft < ∞ .
θ∈K
The proof is concluded by direct application of [51].
The proof relies on the fact Rϕ is a Lyapunov function Rϕ for the mean field of our algorithm,
and that it is well-behaved as kθk → ∞. More precisely, we prove that ∇Rϕ is Lipschitzcontinuous and satisfies k∇Rϕ k2 ≤ C (1 + Rϕ ) for some constant C > 0. Using these conditions along with adequate estimates of the conditional moments of the noise sequence ξt ,
standard stochastic approximation results imply that sequence θt remains in a compact set (see
for instance [51]). Moreover, Rϕ is convex under the assumptions of Theorem B.2. Thus the
stationary points coincide with the global minimizers.
170
Define ξt+1 = Zt+1 + ∇Rϕ (θt ) where Zt+1 is the vector whose vth block-component coincides
with Zt+1,v defined by (B.12). As already seen in the proof of Theorem B.1, the sequence ξt is a
martingale increment sequence adapted to the natural filtration, meaning that E[ξt+1 |Ft ] = 0 for
any t. Algorithm 1 writes θt+1 = θt − γt ∇Rϕ (θt ) + γt ξt+1 . Function −∇Rϕ is the so-called
mean field of the algorithm, whereas ξt plays the role of a noise sequence.
Theorem B.3 is a consequence of [129, Theorem 1]. We only need to show that the assumptions of [129] are satisfied. To that end, we prove the following two technical lemmas. The first
Lemma provides some conditions on the mean field of the algorithm. Due to space limitations,
its proof is omitted.
Lemma B.1. Set θ? ∈ L. Under Assumptions
B.2b − c and B.3a − c, function Rϕ is twice
Q
continuously differentiable on N := v Nv and satisfies:
∇2 Rϕ (θ)
E ∇H(X, θ)∇T H(X, θ) + E (H(X, θ) − Y )∇2 H(X, θ)
=
.
Moreover, ∇Rϕ (θ) = Q? (θ − θ? ) + O(kθ − θ? k2 ).
The second Lemma yields the required conditions on the probabilistic behavior of noise
sequence.
Lemma B.2. Set θ? ∈ L. Let Assumptions B.2(a-d) and B.3(e-f) hold true. Then, we have:
sup E(kξt+1 kb/2 |Ft )Iθt ∈N < ∞.
t≥0
T |F ) → Γ? as t → ∞.
Moreover, almost surely on the event {θt → θ? }, E(ξt+1 ξt+1
t
Theorem B.3 directly follows from Lemmas B.1 and B.2 by straightforward application
of [129].
Proof of Lemma B.2.
Since ∇Rϕ is continuous, it is bounded in a neighborhood of θ? . Therefore, it is quite immediate to see that the statement supt≥0 E[kξt+1 kb/2 |Ft ]I{θt ∈ N} < ∞ is in fact equivalent to
supt≥0 E[kZt+1,v kb/2 |Ft ]I{θt ∈ N} < ∞ for any v ∈ V. Recalling the definition (B.12) of
Zt+1,v , we obtain using Cauchy-Schwartz’ inequality:
b/2
E[kZt+1,v k
h
i1
1
2
(V)
b
|Ft ] ≤ E k∇v hv (Xt+1,v , θt,v )k |Ft E[|Ŷt+1,v − Yt+1,v |b |Ft ] 2
h
i1
1
2
(V)
≤ C E k∇v hv (X, θt,v )kb (1 + E[|Ŷt+1,v |b |Ft ]) 2
for some constant C > 0 which depends on b. Assumption B.3(e) ensures that the first factor in
the right hand side of the above inequality is bounded uniformly in t when multiplied by the indi(V)
cator of event {θt ∈ N}. The remaining task is thus to estimate E[|Ŷt+1,v |b |Ft ]. Recalling (B.3),
B.7. Conclusion
171
one can prove after some algebra that:
(V)
E[|Ŷt+1,v |b |Ft ]1/b
≤C
0
X
w∈V
h
sup E khw (X, θw )k
b
i1/b
.
θw ∈Nw
for some constant C0 > 0. Using again Assumption B.3(e), we obtain that the right hand side
is bounded. Putting all pieces together, this proves supt≥0 E[kξt+1 kb/2 |Ft ]I{θt ∈ N} < ∞.
Consider the second statement of Lemma B.2. For any v 6= w, Zt+1,v and Zt+1,w are independent conditionally to Ft . Thus, it is sufficient to study the conditional covariance of ξt+1,v for
a given v. The latter covariance matrix coincides with Uv (θt ) − ∇v Rϕ (θt )∇v Rϕ (θt )T where
T
Uv (θt ) = E[Zt+1,v Zt+1,v
|Ft ]. Upon noting that ∇v Rϕ is continuous and ∇v Rϕ (θ? ) = 0, it is
therefore sufficient to show that θ 7→ Uv (θ) is continuous and that Uv (θ? ) = Γv (θ? ), in order to
complete the proof of Lemma B.2. Note that Uv (θt ) coincides with the conditional expectation
(V)
of (Ŷt+1,v − Yt+1,v )2 ∇v hv (Xt+1,v , θt,v )∇v hv (Xt+1,v , θt,v )T , given Ft . After some tedious but
straightforward derivations, one is able to show that Uv (θ) = Γv (θ). By Assumption B.3(f), the
proof of Lemma B.2 is complete.
Proof of Corollary B.1
We use a second-order Taylor-Lagrange expansion of Rϕ at θ? . As ∇v Rϕ (θ? ) = 0, Rϕ (θt ) −
Rϕ (θ? ) is equal to 21 (θt − θ? )T ∇2 Rϕ (θ? )(θt − θ? ) up to a negligible term. Upon noting that
∇2 Rϕ (θ? ) = Q? , the result follows from Theorem B.3.
Pseudo-code - Penalized OLGA
The method proposed in Section B.5 is summarized by Algorithm 9 below.
Algorithm 9: OLGA
Initialize: Set S = V. For each node v ∈ V, set initial values β0,v and α0,v = 1. Set
counterv = 0 .
Update: At each time t = 1, 2, · · · do
For each v ∈ S do
Neighbors selection: Set pt = τ /B|S|(|S| − 1)
Draw independent Bernoulli r.v.’s
w
δt+1,v
∼ B(pt )
for any w ∈ S, w 6= v
w
Gossip step: Transmit Xt+1,v to all nodes w such that δt+1,v
= 1, and obtain
hw (Xt+1,v , θt,w ) in return
Local descent: Update parameters αt+1,v , βt+1,v using (B.10)-(B.11)
if αt+1,v = 0 then counterv ← counterv + 1
if counterv = M then S ← S \ {v}
Appendix C
Examples of gossip models for
consensus algorithms
In this appendix we provide some useful computations of the consensus algorithm involved in
the gossip step of Algorithm (2.1)-(2.2) considered in Chapter 2.
Notation: The network of agents is represented as an undirected graph G = (E, V ) where E
is the set of edges and V is the set of N vertices, i.e. V = {1, . . . , N } and E = {∀i, j ∈ V | i ∼
j}. |E| denotes the total number of edges. We denote by A the adjacency matrix of G which
has non-zero entries A(i, j) = 1 whenever i ∼ j ⊂ E and 0 otherwise. The diagonal matrix
D stands by the corresponding degree matrix of G which is D = diag(A1) and di denotes the
degree of any node i. The Laplacian matrix is denoted by L and verifies L = D − A. Set pij the
1
probability of any edge i ∼ j. Upon noting symmetry on the edges, pij = pji = 2N
( d1i + d1j )
- i is uniformly chosen and contacts an uniformly selected neighbor j (probability N1 d1i ) or
contrarily j selects i (probability N1 d1j ). We denote by Aw the weighted version of A by the
probabilities pij , i.e. Aw (i, j) = pij . The weighted Laplacian Lw is equal to Dw − Aw where
the diagonal matrix is Dw = diag(Aw 1). Set I the N × N identity matrix and ei denotes
the i-th vector of the canonical basis in RN , i.e. all components of ei are equal to zero except
the i-th component which is equal to 1. We define the N × N matrix Wn containing the nonnegative weights [wn (i, j)]∀i,j=1,...,N at time n involved in the communication (gossip) step. We
denote by C the covariance matrix of vector W1T 1 that plays an important role in the asymptotic
variance of the sequence generated by Algorithm (2.1)-(2.2), i.e. C = E[W1T 11T W1 ] − 11T .
We define the real vector of the temporary estimates θ̃n = (θ̃n,1 , . . . , θ̃n,N )T and the updated
estimates θn = (θn,1 , . . . , θn,N )T of the seek parameter involved in the following algorithms.
C.1
Standard gossip averaging
We recall the algorithm introduce in Section 1.3.1. The aim is to obtain a weighted average at
any instant n:
θn = Wn θ̃n
174
Appendix C. Examples of gossip models for consensus algorithms
where (Wn )n is an i.i.d. matrix sequence. We now establish some useful metrics of Wn when
considering two standard gossip schemes: the pairwise model of [31] and the broadcast model
of [10]. We introduce first some notations.
C.1.1
Communication model description
Pairwise gossip [31] At time n, two connected nodes – say i and j – wake up randomly with
probability pij associated to the edge i ∼ j becoming active, independently from the past.
Nodes i and j compute the weighted average θi,n = θj,n = 0.5θ̃i,n + 0.5θ̃j,n ; and for
k∈
/ {i, j}, the nodes do not gossip: θk,n = θ̃k,n . In this example, given the edge {i ∼ j}
wakes up, Wn is given by:
Wn (k, `) =


1
2
1

0
if {k, `} ∈ {i, j},
if {k, `} ∈
/ {i, j} and k = ` ,
otherwise.
(C.1)
the above definition can be written under a matrix form. Wn is equal to Wij if the edge
{i ∼ j} is active with probability pij
1
Wij = I − (ei − ej )(ei − ej )T
2
(C.2)
Upon noting that E[ei eTi ] == E[ej eTj ] = Dw and E[ei eTj ] = E[ej eTi ] = Aw , the expectation matrix of W1 is:
W = E[Wij ] = E[Wij ] = I − Lw .
Matrices (Wn )n≥0 are doubly stochastic and satisfy: W1 1 = 1 and W1T 1 = 1. Then,
C = 0. Note that W1 is symmetric W1T = W1 and idempotent W12 = W1 . Thus, the
spectral radius of E[W1T J ⊥ W1 ] is:
ρ = r(E[W1T J ⊥ W1 ]) = r(W p − J) = r(J⊥ − Lw ) = 1 − λ2 (Lw ) < 1 .
(C.3)
The eigenvalues of J⊥ − Lw are 1 > 1 − λ2 (Lw ) > · · · > 1 − λN (Lw ), which the
above condition is true as the graph is assumed to be connected, i.e. the second smallest
eigenvalue of Lw is non-zero. When G is a complete graph, then Lw = N 1−1 J⊥ , W =
−2
I − N 1−1 J⊥ and ρ = N
N −1 . In that case, more the network size N increases and more ρ
goes closer to 1 meaning that the graph connectivity λ2 (Lw ) decreases. Indeed, this seems
logical as the averaged gossip matrix W tends to I, i.e. no communication is performed
between the agents.
Broadcast gossip [10] At time n, a node i wakes up at random and broadcasts its temporary
update θ̃i,n to all its neighbors Ni . Any neighbor j computes the weighted average θj,n =
β θ̃i,n + (1 − β)θ̃j,n where β ∈ (0, 1). On the other hand, the nodes k which do not belong
to the neighborhood of i (including i itself) set θk,n = θ̃k,n . Note that, as opposed to the
pairwise scheme, the transmitter node i does not expect any feedback from its neighbors.
C.1. Standard gossip averaging
175
Then, given that i wakes up, the (k, `)th component of Wn is given by:

if k ∈
/ Ni and k = ` ,

 1

β
if k ∈ Ni and ` = i ,
Wn (k, `) =
1
−
β
if k ∈ Ni and k = ` ,



0
otherwise.
(C.4)
the above definition can be written under a matrix form. We denote Ei = ei eTi the N × N
matrix that the only non-zero entry is in (i, i). Then, Wn is equal to Wi if node i is active
with probability N1
Wi = I − βdiag(AEi 1) + βAEi
Upon noting that E[Ei ] =
1
N I,
(C.5)
the expectation matrix is:
W = E[Wi ] = I −
β
L.
N
Matrices (Wn )n≥0 are row stochastic and column stochastic in average satisfying: W1 1 =
1 and E[W1T ]1 = 1. We also refer to Wn as doubly stochastic matrix in average. The
spectral radius of E[W1T J ⊥ W1 ], denoted by r, is now:
2β
β2
(1 − β)L − 2 L2 )
N
N
2β
β2
=1−
(1 − β)λ2 (L) − 2 λ22 (L) < 1
N
N
ρ = r(E[W1T J ⊥ W1 ]) = ρ(J ⊥ −
(C.6)
In that case, using (C.5), we compute C as follows:
C = β11T (E[AEi ] − E[diag(AEi 1)]) + β(E[Ei A] − E[diag(AEi 1)])11T
+ β 2 (E[diag(AEi 1)11T diag(AEi 1)] + E[AEi 11T Ei A])
− β 2 (E[AEi 11T diag(AEi 1)] + E[diag(AEi 1)11T Ei A]) =
= β11T
L
L
β2 2
+ β 11T + β 2 (D2 + A2 ) − β 2 (AD + DA) =
L .
N
N
N
(C.7)
When G is a complete graph, then L = N J⊥ , W = I − βJ⊥ and ρ = (1 − β)2 r(J ⊥ ) =
(1 − β)2 .
C.1.2
Numerical results on distributed optimization
We illustrate the behavior of these two consensus schemes in distributed optimization.
PN 1 We set the
same scenario of Section 2.7 in Chapter 2, i.e. minθ⊂R F (θ) where F (θ) = i=1 2 (θ − αi )2 .
Note that in both cases v = 1 which implies θF = θV = θ? . We compare the performance of Algorithm (2.1)-(2.2) assuming (2.4), i.e. (ξn,i )n,i is an i.i.d. sequence with Gaussian distribution
N(0, σ 2 ) where σ 2 = 1. The network is represented by a graph G = (V, E) of N = 10 vertices
and three different set of edges varying the connectivity of the graph, i.e. |E| = 9, 26 and 45.
176
Yet, |E| = 9 correspond to the line graph (minimum connectivity) whose average degree is 1.8,
i.e. the average number of neighbors (edges) per node. |E| = 45 correspond to the complete
graph (maximum connectivity) in which each node has N − 1 = 9 neighbors. Figure below
illustrate the graphs considered in the present example.
4
2
3
5
2
3
4
2
5
2
6
7
0.4
0.6
9
0.8 x 1
1.2
1.4
1.6
1.8
2
0
0
1
6
0.5
7
10
8
0.2
2
1
1
0.5
0.5
0
0
1
1
6
3
1.5
y
y
1
4
5
1.5
1.5
y
|E|=45 , |V|=10
|E|=26 , |V|=10
|E|=9 , |V|=10
2
10
8
0.2
0.4
0.6
10
7
9
0.8 x 1
1.2
1.4
1.6
1.8
0
0
2
9
8
0.2
0.4
0.6
0.8 x 1
1.2
1.4
1.6
1.8
2
Table 1 reports the connectivity information of each graph by the spectral radius value ρ.
More the connectivity (the average number of neighbors per node) increases, more the spectral
radius ρ decreases. This parameter appears on the convergence analysis of the disagreement
sequence φn (2.12) (see Section D.2) and thus, it impacts the performance of the consensus
convergence (see Figure C.1 (a)).
Gossip scheme
Pairwise
Broadcast
|E| = 9
0.995
0.992
|E| = 26
0.937
0.771
|E| = 45
0.889
0.25
Table 1. Spectral radius ρ of pairwise (C.3) and broadcast (C.6) when |V | = N = 10.
Figure C.1 shows the convergence result stated in Theorem 2.1. The error curves are computed from the average through 100 independent runs of Algorithm (2.1)-(2.2). We define the
mean deviation of consensus (MDC) as follows:
N
1 X
MDCn =
E[(θn,i − hθn i)2 ] .
N
(C.8)
i=1
The convergence to a consensus is illustrated in Figure C.1 (a) while Figure C.1 (b) illustrates
the convergence to the sought
P value θ? . We report the MDC (C.8) and the mean square error
(MSE), i.e. MSEn = 1/N i E[|θn,i − θ? |2 ].
Figure C.1 (a) shows the connectivity influence (ρ values in Table 1) on the network agreement. When the network is formed by a line graph, ρ is close to 1 in both schemes (pairwise and
broadcast) and the consensus error (MDC) has almost the same performance. Although in general the performance improves when increasing |E| (connectivity), this improvement is difficult
to discern in the pairwise case since ρ decreases much more slowly than the broadcast case. In
addition, the error curves corresponding to the broadcast are below of those from the pairwise,
since ρ in the broadcast case is lower than the pairwise case. Indeed, the gaps between ρ when
|E| = 26 and 45 are easily appreciated when choosing the broadcast gossip.
C.1. Standard gossip averaging
177
Figure C.1 (b) shows the influence of the non-double stochasticity of the broadcast protocol
on the average sequence hθn i. The gaps between the values of ρ in Table 1 are illustrated by the
gaps between the error curves (MSE) in Figure C.1 (b). However, the pairwise gossip performs
better than the broadcast for all |E| since in the latter case, the error term related to 1T Wn − 1T
is always non-zero.
0
0
10
10
G(10,9) pairwise
G(10,26) pairwise
G(10,45) pairwise
G(10,9) broadcast
G(10,26) broadcast
G(10,45) broadcast
−1
10
−2
10
−1
10
−3
−2
10
10
MSEn
MDCn
G(10,9) pairwise
G(10,26) pairwise
G(10,45) pairwise
G(10,9) broadcast
G(10,26) broadcast
G(10,45) broadcast
−4
10
−3
10
−5
10
−6
−4
10
10
−7
10
−5
10
−8
10
0
0.5
1
1.5
2
2.5
3
3.5
4
0
0.5
1
1.5
2
2.5
3
3.5
4
4
4
x 10
x 10
(a) Mean deviation of consensus (MDC) as a function (b) Mean square error (MSE) as a function of the
of the number of iterations n.
number of iterations n.
Figure C.1. Convergence performance of Algorithm (2.1)-(2.2) when considering the gossip
schemes (C.1) and (C.4).
G(10,9)
25
Pairwise
20
G(10,26)
empirical
1
G(10,45)
empirical
1
N(0,0.1)
N(0,0.1)
N(0,0.1)
0.8
0.8
0.8
0.6
0.6
0.6
15
0.4
0.4
0.4
10
0.2
0.2
0.2
5
0
0
−1
−0.5
0
0.5
1
empirical
1
1.5
0
−1
−0.5
0
0.5
1
1.5
−1
−0.5
0
0.5
1
1.5
0
empirical
1
−5
Broadcast
empirical
1
N(0,155)
N(0,51)
−15
−20
pairwise G(10,9)
pairwise G(10,26)
pairwise G(10,45)
broadcast G(10,9)
broadcast G(10,26)
N(0,35)
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
−40
−25
1
0.8
0.8
−10
−20
0
20
40
0
−20 −10
0
10
20
30
empirical
0
−10
0
10
20
30
broadcast G(10,45)
(a) Boxplots of the average error sequence.
(b) Empirical and theoretical distribution of the average error sequence.
Figure C.2. Asymptotic behavior of the normalized average error sequence
when n = 20000.
√1 (hθn i
γn
− θ? )
Finally, we illustrate the CLT result of Chapter 2 (see Theorem 2.3 in Section 2.5). Figure C.2 shows the asymptotic behavior of the normalized average error sequence √1γn (hθn i−θ? ).
178
The asymptotic distribution and variances U? are reported by the boxplots and histograms in Figure C.2 (a) and C.2 (b) respectively. As expected, the asymptotic variance in the pairwise case
only depends on the noise on the observations (see Corollary 2.1 in Section2.5.3). Hence, the
asymptotic distribution is the same for all connectivity values ρ since U? does not depend on the
choice of the graph. Contrarily, when choosing a non-doubly stochastic protocol as the broadcast gossip, an additional terms appears on U? depending on Wn and the connectivity influences
the asymptotic variance which increases as the connectivity of the graph decreases.
C.2
Push-sum gossip averaging
In this section we highlight a possible future work related to Chapter 2 we did not have time to
investigate. The idea is to solve distributed optimization problems of the form (2.3) by using a
adaptation-diffusion algorithm following the scheme (2.1)-(2.2) coupled with a consensus protocol based on the push-sum introduced by [89]. This framework has been already addressed
by [153],[114]-[115] providing convergence results. However both works require synchronous
communication models.
C.2.1
Communication model description
Since we are interested in asynchronous protocols, we focus on the push-sum model more recently proposed in the context of consensus averaging by [84]. Similarly to [10], a single agent
is randomly activated asynchronously which broadcasts its temporary estimates to all its neighbors. Define sn = (sn,1 , . . . , sn,N )T and wn = (wn,1 , . . . , wn,N )T . The scheme is summarized
as follows:
)
sn = Kn sn−1
sn,i
θn,i =
∀i = 1, . . . , N ,
(C.9)
wn,i
wn = Kn wn−1
where the intermediate sequences sn and wn represent the sum and weights that are needed to
update the average parameter θn under the assumption s0 = θ0 and wn = 1. Contrarily to
(Wn )n , the sequence of i.i.d. matrices (Kn )n are column stochastic, i.e. 1T Kn = 1T where
Kn = I −L(I +D)−1 Ei if node i is activated with probability 1/N . The main differences compare to the standard gossip models in previous section are the need of an additional parameter
involved in the communication step and the conditions on (Kn )n .
C.2.2
Algorithm for distributed optimization
The objective is to couple the asynchronous model (C.9) of [84] with Algorithm (2.1)-(2.2) and
illustrate the convergence analysis, i.e. θn tends to θ? (2.3) as n → ∞. The algorithm under
study can be described as follows:
[Local step] Similarly to (2.1), each node i generates a temporary iterate s0n,i given by
s0n,i = sn−1,i + γn αn,i Yn,i
where : Yn,i = −∇fi (θn−1,i ) + ξn,i
(C.10)
C.2. Push-sum gossip averaging
179
where (γn )n is a decreasing step-size sequence. We introduce the sequence (αn )n which is
defined according to different models.
[Push-sum step] Kn is defined as in (C.9). The active agent i sends its values to all j ∈ Ni .
Node i and its neighbors compute the weighted average while the rest of nodes stay idle. Then,
each node i is able to update its estimate θn,i given sn,i and wn,i :
sn = Kn s0n
wn = Kn wn−1
)
θn,i =
sn,i
wn,i
∀i = 1, . . . , N ,
(C.11)
Note that sequences (Kn )n , (wn )n and (sn )n are related to the push-sum protocol while
sequences (s0n )n and (θn )n are related to the stochastic gradient descent algorithm. Moreover,
contrarily to Assumption 2.2-2) (Chapter 2), in (C.11) nothing is required concerning the rowstochasticity of (Kn )n . It is worth noting the contribution of such new algorithm with respect to
the existing [153] and [115]. The Authors in [153] consider a fixed matrix model, i.e. Kn = K̄
∀n for some deterministic K̄ and (C.10) is replaced by a primal dual step with αn = 1 and
ξn = 0. Although in [115] an algorithm similar to (C.10)-(C.11) is proposed by setting αn = 1,
the step order is inverted, i.e. diffusion-adaptation scheme (C.11)-(C.10). In addition, contrarily
to our framework, [115] propose a time-varying and synchronous model for matrices (Kn )n and
assume that they are adapted to a directed and strongly connected graph.
In this section we analyze the influence of the asynchronous consensus model of [84] when
used for distributed optimization problems.
C.2.3
Numerical results on distributed optimization
We consider the same scenario of Section 2.7 in Chapter 2. The minimization problem (2.19) is
solved by a network of N = 5 connected nodes (or agents). The minimizer of (2.19) is θF = 1.
The objective is to show if θn generated by Algorithm (C.10)-(C.11) tends to θ? corresponding
to the sought minimizer θF as the number of iterations n tends to ∞. The graph G is formed
according to the configuration defined by (2.20). In this context, we show the convergence
behavior of Algorithm (C.10)-(C.11) by setting 4 different versions. The numerical results are
based on 100 Monte-Carlo runs of the trajectory θn . Note that, while the first three versions
use (Kn )n of [84], the last one uses the model of [10] in order to get rid some features when
comparing these two broadcast-like schemes.
1. First, we set αn = 1 as in [115]. This choice does not yield convergence as illustrated by
the following figures.
180
1
0
−1
−2
10
5
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
4
x 10
θ
0
−1
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0
sn,1= Σ jKn(1,j)s’n−1,j
2
n,2
60
s’n,1= sn−1,1+ γn( −∇f1(θn−1,1) + ξn,1)
θn,1
θn,3 5
1
4
x 10
4
0
wn,1= Σ jKn(1,j)wn−1,j
2
30
0.5
1
1.5
2
2.5
3
3.5
4
4.5
2
θn,4 5
4
x 10
1
0
0
−2
−4
−6
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
4
x 10
1
1.5
2
2.5
3
3.5
4
1
0
1
0.5
θn,1= sn,1/wn,1
0
θn,5
0.5
2
4.5
5
4
x 10
30600
30650
30700
30750
30800
30850
30900
The figure on the left shows the trajectory of the estimated parameter θn,i by each agent i
as a function of the number of iterations n. The convergence of θn to 1 is not achieved. It
seems that as the number of iterations increases, θn goes closer to 1 as expected. However,
there are some peaks at random instants n where θn,i reaches large or low values far from
the asymptotic θ? . In order to investigate the reason for those peaks, the figure on the right
plots the trajectories of each sequence involved in the steps of Algorithm (C.10)-(C.11)
for agent 1. Yet, the figure shows a zoom of sequences s0n,1 , sn,1 , wn,1 and θn,1 within an
interval of large number of iterations, i.e. from n = 30596 to n = 30950. The two most
important peaks appeared at n = 30620 and n = 30940. Before these two instants, the
temporary sum estimate s0n,1 goes down until it reaches 0 while at the same time both sn,1
and wn,1 decrease and become almost zero. As a consequence of the combination of such
values, the quotient of these two quantities makes θn,1 to drop down towards values far
from the asymptotic 1.
2. Then, we set αn,i = wn−1,i (∀i = 1, . . . , N ) which seems more coherent since the gradient term involved in (C.10) is evaluated in θn−1,i = sn−1,i /wn−1,i and this step leads
to update sn . However, it does not yield convergence once again. Instead, it yields failure convergence to the sought value (2.3). Yet, θn tends to θV which is the minimizer
of the weighted problem (2.5) and where v is the right Perron eigenvector of E[Kn ], i.e.
v = E[Kn ]v and v T 1 = 1 (analogous to Lemma 2.1 in Chapter 2).
3. Hence, similarly to (2.6) in Chapter 2, we introduce a weighted version such as αn,i =
wn−1,i /vi (∀i = 1, . . . , N ) to lead the convergence of sequence θn to (2.3).
4. In addition, based on the intuition that the failure case 2) may be caused by the nonrow stochasticity of (Kn )n , we also include the numerical results when αn = 1 and
Kn = Wn C.4 (broadcast gossip [10]). This choice maintains the asynchronous and
broadcast nature of [84] but modifies the main stochasticity assumption on (Kn )n , i.e.
(Kn )n are column-stochastic in average and row-stochastic for all n.
30950
181
−1
35
10
10
Wn [84], αn= 1
−2
10
Wn [84],
αn= wn−1
Wn [84],
αn= wn−1° v−1
Wn [10],
αn= 1
−3
10
30
10
MDCn
−4
10
−5
10
25
10
−6
10
−7
0
0.5
1
10
2
0
1.5
0.5
1
1.5
2
4
x 10
4
x 10
(a) Mean deviation of consensus (MDC) as a function of n.
37
10
2
Wn [84], E[|<θn> − θF| ] , αn= 1
0
10
Wn [84], E[|<θn> − θF|2] , αn= wn−1
Wn [84], E[|<θn> − θV|2] , αn= wn−1
W [84], E[|<θ > − θ |2] ,
36
10
n
n
F
2
Wn [10], E[|<θn> − θF| ] ,
α =w
n
° v−1
n−1
αn= 1
−1
10
35
MSEn
10
34
10
−2
10
33
10
−3
10
32
10
0
0.5
1
1.5
2
number of iterations (n) x 104
0.5
1
1.5
2
4
x 10
(b) Mean square error (MSE) as a function of n.
Figure C.3. Convergence performance of Algorithm (C.10)-(C.11) when considering the different models 1)-4).
Figure C.31 reports the convergence to a consensus and to the sought solution θ? by the error
sequences MDC (C.8) and MSE (defined in Section C.1.2) for Algorithm (C.10)-(C.11) when
setting the four cases 1)-4). Regarding the convergence to a consensus, Figure C.3 (a) illustrates
the trajectories of the MDC sequence decreases as the number of iterations n grows. It is worth
noting the difference on the behavior between cases 1) and 4), i.e. when αn = 1 (as in [115]
using a synchronous model) and the two asynchronous broadcast schemes of [10] and [84]. As
discussed in case 1), θn does not achieve the convergence to θ? . This may affect both MDC and
1
x ◦ y denotes the element-wise product (Hadamard product) between vectors x and y
182
MSE sequences by introducing a large bias even if their trajectories have a decreasing profile
(see Figure C.3 (a) and C.3 (b)). Contrarily, even if in case 4) (Kn )n are column-stochastic
only in average, the row-stochasticity condition is satisfied and the convergence is achieved
(Algorithm (C.10)-(C.11) may be equivalent to Algorithm (2.1)-(2.2) analyzed in Chapter 2). In
addition, as shown in Figure C.3 (a), the weighted versions of Algorithm (C.10)-(C.11) described
by cases 2) and 3) converge to a consensus. However, case 2) does not yield the sought minimizer
θ? = θF (see Figure C.3 (b)). Indeed, in Figure C.3 (b) we include the MSE sequence with
respect to θF and θV when θn is generated by case 2). In the first case, the MSE remains almost
flat while in the second case θn converges to θV . Although the MSE performance of case 2)
when considering θV is close to those achieved by case 4), it fails to convergence to the sought
solution. Finally, Figure C.3 (a) and C.3 (b) illustrate that case 3) yields the convergence to both
consensus and θF even if the performance is slightly worse than those obtained by case 3).
In conclusion, an alternative to case 4) would be case 3). Thus, if (Kn )n are row-stochastic
(and column-stochastic in average), then Algorithm (C.10)-(C.11) can be used with αn = 1,
i.e. no additional knowledge, and the asymptotic behavior can be comparable to those of Algorithm (2.1)-(2.2) (Chapter 2). If (Kn )n are not row-stochastic (but column-stochastic), Alw
gorithm (C.10)-(C.11) can be used with αn,i = n−1,i
(for all i = 1, . . . , N ), i.e. requiring
vi
the previous knowledge of v from E[Kn ] as expenses of slightly losing in performance. We
recall that an algorithm similar to (C.10)-(C.11) using αn = 1 and (Kn )n non-row stochastic
is addressed in [115] where the convergence is achieved under more restrictive conditions, e.g.
synchronous communication model and strongly directed graph. Hence, the choice among this
three alternatives will depend on each scenario.
Figure C.42 illustrates the analogous CLT result obtained in Section 2.5.2 (see Theorem 2.3
in Chapter 2). Figure C.4 (a) shows the asymptotic variances of the normalized average error of
cases 1)-4) when n = 20000 over 100 independent runs. As expected by the non-convergence
result of case 1), several aberrant values appear outside the bounds of the average variance. The
asymptotic behavior of Algorithm (C.10)-(C.11) in that case can not be predicted since the peaks
on θn occur randomly. As shown in Figure C.4 (a) and Figure C.4 (b), the asymptotic variance
and distribution in case 2) are closed to those obtained in case 4) up to a bias due to the failure
of convergence, i.e. θ? = θV instead of θF . Besides, regarding the performance of case 3), it
can be compared to the performance of case 4) by a factor lower than two, i.e. the asymptotic
variance of case 3) is less than the double of 4).
2
x ◦ y denotes the element-wise product (Hadamard product) between vectors x and y
183
Wn [84], αn= wn−1
Wn [84], αn= 1
19
x 10
Wn [84], αn= wn−1°v−1 W [10], α = 1
n
n
0
60
60
60
50
50
50
5
40
40
40
0
30
30
30
−5
20
20
20
−10
10
10
10
−15
0
0
0
−10
−10
−10
15
−0.5
10
−1
−1.5
−2
−2.5
1
1
1
1/γ0.5
( <θn> − θF )
n
1/γ0.5
( <θn>−θF)
n
1
2
1/γ0.5
( <θn> − θF )
n
1/γ0.5
( <θn>−θV)
n
1
1/γ0.5
( <θn> − θF )
n
(a) Boxplots of the average error sequence.
Wn [84] ,
1/γ0.5
(
n
1
αn= wn−1
W [84], α = w
n
n
°v
Wn [10] , αn= 1
n−1
<θn> − θF )
1
1
1/γ0.5
( <θn> − θV )
n
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.9
0
0
20
40
60
0
−20
0
20
0
−10
0
10
(b) Empirical and theoretical distribution of the average error sequence.
Figure C.4. Asymptotic behavior of the normalized average error sequence
when n = 20000.
√1 (hθn i
γn
− θ? )
Appendix D
Proofs related to Chapter 2
D.1
Proof of Theorem 2.1
We prove that the Assumptions 2.5 to 2.8 hold. Then Theorem 2.1 will follow from Theorem B.1. For any θ = (θ1 , . . . , θN ) ∈ RdN where θi ∈ Rd , define the RdN -valued function g
by
g(θ) := (−∇f1 (θ1 )T , . . . , −∇fN (θN )T )T .
Under Assumption 2.2-1) and Assumption 2.2-2), for any Borel set A × B of RdN × M1
P[(Yn+1 , Wn+1 ) ∈ A × B|Fn ] = P[Yn+1 ∈ A|Fn ]P[Wn+1 ∈ B] .
In addition, by Assumption 2.1 and Eq. (2.4)
Z
P[Yn+1 ∈ A|Fn ] = IA (g(θn ) + z) dνθn (z) .
The above discussion provides the expression of µθ in Assumption 2.5-1). In addition, under
Assumption 2.1-2), for any compact set K of RdN ,
Z
Z
sup |y|2 dµθ (y, w) = sup |g(θ)|2 + |z|2 dνθ (z) < ∞
θ∈K
θ∈K
which proves Assumption 2.5-2). Set W = w ⊗ Id . The above expression of µθ implies that
Z
(φ + Ay)T WT J⊥ W(φ + Ay) dµθ (y, w)
Z
= (φ + A(g(θ) + z))T E WT1 J⊥ W1 (φ + A(g(θ) + z)) dνθ (z) .
Therefore, Assumption 2.6 easily follows from Assumption 2.2-3). The regularity conditions of
Assumption 2.7 are satisfied with λµ = δ, where δ is given by Assumption 1. Observe indeed
that the left hand side of (2.15) is zero and (2.16) and (2.17) are true as long as (∇fi )i are locally
186
Appendix D. Proofs related to Chapter 2
Hölder-continuous. Again, the expression of µθ implies that Wθ = E [W1 ]. Therefore, the mean
field vector h defined by (2.18) gets into:
h(ϑ) = hE[W1 ] g (1 ⊗ ϑ)i .
Using the Woodbury matrix identity (see [81]), we have:
h(ϑ) = (v T ⊗ Id ) g(1 ⊗ ϑ) = −
N
X
vi ∇fi (ϑ)
i=1
where v = (v1 , . . . , vN ) is the left Perron eigenvector given by Lemma 2.1. Set V̄ := exp(V )
where V is defined by (2.5). Upon noting that ∇V̄ = −h V̄ , it is easily seen that under the
assumptions of Theorem 2.1, Assumption 2.8 holds.
D.2
Proof of Lemma 2.3
From (2.12), we
|φn |2 = αn2 (φn−1 + Yn )T WTn J⊥ Wn (φn−1 + Yn ). Using Assump compute
tion 2.5-1), E |φn |2 |Fn−1 is equal to
Z
2
αn (φn−1 + y)T (w ⊗ Id )J⊥ (w ⊗ Id )(φn−1 + y) dµθn−1 (y, w) .
By Fubini Theorem
and Assumption
2.6, there exists ρK ∈ (0, 1) such that for any n ≥ 1,
R
E |φn |2 |Fn−1 ≤ αn2 ρK |φn−1 + y|2 dµθn−1 (y, w). By Assumption 2.5-2), there exists a
constant C such that for any n ≥ 1 almost-surely
√
E |φn |2 |Fn−1 Iθn−1 ∈K ≤ αn2 ρK |φn−1 |2 + 2|φn−1 | C + C .
Set Un := |φn |2 ITj≤n−1 {θj ∈K} . Upon noting that ITj≤n−1 {θj ∈K} ≤ ITj≤n−2 {θj ∈K} , the previous
p
√
inequality implies E[Un ] ≤ αn2 ρK E [Un−1 ] + 2 E[Un−1 ] C + C . Let δ ∈ (ρK , 1). For
any n large enough (say n ≥ n0 ), αn2 ρK ≤ 1 − δ since limn αn = 1 under Assumption B.1-1).
There exist positive constants M, b such that for any n ≥ n0 ,
p
√
δ
E[Un ] ≤ (1 − δ) E [Un−1 ] + 2 E[Un−1 ] C + C ≤ 1 −
E[Un−1 ] + b1E[Un−1 ]≤M .
2
A trivial induction implies that E[Un ] ≤ (1 − δ/2)n−n0 E[Un0 ] + 2b/δ, which concludes the
proof.
D.3
Preliminary results on the sequence (φn )n
Due to the coupling of the sequences (hθn i)n and (φn )n (see Eq. (2.11)), the asymptotic analysis
of (hθn i)n requires a more detailed understanding of the behavior of φn . Note from Assumption 2.5-1) and (2.12) that {φn , n ≥ 0} is a Markov chain w.r.t. the filtration {Fn , n ≥ 0} with
a transition kernel controlled by {αn , θn , n ≥ 0} (see also (D.2) below).
D.3. Preliminary results on the sequence (φn )n
187
Let us introduce some notations and definitions. If (x, A) 7→ P (x, A) is a probability transition kernel on RdN , thenR for any bounded continuous function f : RdN → R, P f is the
measurable function x 7→ Rf (y)P (x, dy) . If ν is a probability on RdN , νP is the probability
on RdN given by νP (A) = R ν(dx) P (x, A). For n ≥ 0, notation P n stands for the n-order iterated kernel i.e., P n f (x) = P n−1 f (y)P (x, dy); by convention P 0 (x, A) = 1A (x) = δx (A).
A measure π is said to be an invariant distribution w.r.t. P if πP = π. For p ≥ 0, denote by
Lp (RdN ) the set of Lipschitz functions f : RdN → RdN satisfying
[f ]p :=
sup
x,y∈RdN
We define Np (f ) := (supx∈RdN
|f (x) − f (y)|
< ∞.
|x − y|(1 + |x|p + |y|p )
|f (x)|
)
1+|x|p+1
∨ [f ]p for f ∈ Lp (RdN ). For any θ ∈ RdN and any
α ≥ 0, define the probability transition kernel Pα,θ on RdN as
Z
Pα,θ f (x) = f (αJ⊥ (w ⊗ Id )(x + y)) dµθ (y, w) .
(D.1)
This collection of kernels is related to the sequence (φn )n since by Assumption 2.5-1) and (2.12),
for any measurable positive function f it holds almost-surely
E [f (φn+1 )|Fn ] = Pαn+1 ,θn f (φn ) .
(D.2)
We start with a result that claims that any transition kernel Pα,θ possesses an unique invariant
distribution πα,θ and is ergodic at a geometric rate. This also implies that for a large family of
functions f , a solution fα,θ to the Poisson equation
f − πα,θ (f ) = fα,θ − Pα,θ fα,θ
(D.3)
exists, and is unique up to an additive constant.
Proposition D.1. Let Assumptions 2.5 and 2.6 hold. Let K ⊂ RdN be a compact set and let
√
ρK ∈ (0, 1) be given by Assumption 2.6. The following holds for any a ∈ (0, 1/ ρK ).
1. For any θ ∈ K Rand α ∈ [0, a], Pα,θ admits an unique invariant distribution πα,θ such that
supα∈[0,a],θ∈K |x|2 dπα,θ (x) < ∞ .
2. For any p ∈ [0, 1], there exists a constant K such that for any x ∈ RdN and any f ∈
Lp (RdN )
√
n
f (x) − πα,θ (f )| ≤ KNp (f ) (a ρK )n (1 + |x|p+1 ) .
sup
|Pα,θ
α∈[0,a],θ∈K
3. For any α ∈ (0, a], θ ∈ K, p ∈ [0, 1] and f ∈ Lp (RdN ), the function fα,θ : x 7→
P
n
dN ).
n≥0 Pα,θ f (x) − πα,θ f exists, solves the Poisson equation (D.3) and is in Lp (R
In addition,
KNp (f )
sup
|fα,θ (x)| ≤
(1 + |x|p+1 ) .
√
1 − a ρK
α∈[0,a],θ∈K
188
Proof. Let K be a compact subset of RdN . Throughout this proof, for ease of notations, we
√
will write ρ instead of ρK . Let a ∈ (0, 1/ ρ) be fixed. We check the assumptions of [17,
Proposition 2 p. 253] from which all the items follow. We first prove [17, (2.1.10) p.253]. By
Assumption 2.6, for any α ∈ [0, a] and θ ∈ K
Z
Z
Z
2
2
2
2
Pα,θ (x, dy)|y| ≤ a ρ |x| + |y| dµθ (y, w) + 2|x| |y|dµθ (y, w) ;
by Assumption 2.5-2), for any ρ̄ ∈ (a2 ρ, 1), there exists a positive constant c such that for any
x ∈ RdN
Z
sup
Pa,θ (x, dy)|y|2 ≤ ρ̄|x|2 + c .
α∈[0,a],θ∈K
This concludes the proof of [17, (2.1.10) p.253]. Note that iterating this inequality and applying
the Jensen’s inequality yield for any n ≥ 1, p ∈ [0, 1], x ∈ RdN ,
Z
sup
n
Pa,θ
(x, dy)|y|p+1
≤ ρ̄n |x|2 +
α∈[0,a],θ∈K
c
1 − ρ̄
p+1
2
.
(D.4)
We now prove [17, (2.1.9) p.253] Let x, z ∈ RdN , α ∈ [0, a] and θ ∈ K. We consider a coupling
n (x, ·) and P n (z, ·) defined as follows: (W , Y )
of the distributions Pα,θ
n
n n∈N are i.i.d. random
α,θ
variables with distribution µθ and set Wn = W n ⊗ Id . The stochastic process (ϕxn )n∈N defined
recursively by ϕxn = αJ⊥ Wn (ϕxn−1 + Y n ) and ϕx0 = x is a Markov chain with transition kernel
Pα,θ starting from x. We denote by Eα,θ the expectation on the associated canonical space. Let
p ∈ [0, 1]. For any g ∈ Lp (RdN ), it holds
n
P g(x) − P n g(z) = |Eα,θ (g(φxn ) − g(φzn ))| ≤ Eα,θ (|g(φxn ) − g(φzn )|)
α,θ
α,θ
≤ [g]p Eα,θ [|φxn − φzn | (1 + |φxn |p + |φzn |p )]
h
io1/2
n
.
≤ [g]p Eα,θ |φxn − φzn |2 Eα,θ (1 + |φxn |p + |φzn |p )2
(D.5)
By Assumption 2.6 combined with a trivial induction,
Eα,θ (|ϕxn − ϕzn |2 )1/2 = αEα,θ (|J⊥ Wn (ϕxn−1 − ϕzn−1 )|2 )1/2
= αEα,θ ((ϕxn−1 − ϕzn−1 )T Aθ (ϕxn−1 − ϕzn−1 ))1/2
√
√
≤ a ρ Eα,θ (|ϕxn−1 − ϕzn−1 |2 )1/2 ≤ (a ρ)n |x − z| ,
(D.6)
where Aθ := (w ⊗ Id )T J⊥ (w ⊗ Id )dµθ (y, w). Combining (D.4) and (D.6) shows that there
exists C > 0 such that for any x, z ∈ RdN , g ∈ Lp (RdN ) and n ≥ 1,
n
P g(x) − P n g(z) ≤ C [g]p |x − z| (a√ρ)n (1 + |x|p + |z|p ) .
sup
(D.7)
R
α,θ
α,θ
α∈[0,a],θ∈K
This concludes the proof of [17, (2.1.9) p.253]. Finally, we show that the transition kernels are
weak Feller. From (D.1) and the dominated convergence theorem, it is easily checked that for
any bounded continuous function f on RdN , x 7→ Pα,θ f (x) is continuous. Therefore, all the
assumptions of [17, Proposition 2 p.253] are verified.
D.3. Preliminary results on the sequence (φn )n
189
In Proposition D.2, we go further by giving an explicit expression of the first two moments
of πα,θ .
Proposition D.2. Let Assumptions 2.5 and 2.6 hold. Let θ ∈ RdN and α such that πα,θ exists.
R
(1)
1. The first order moment mθ (α) := x dπα,θ (x) of πα,θ is given by
(1)
mθ (α) = (α−1 IdN − J⊥ Wθ )−1 J⊥ zθ ,
(D.8)
where Wθ and zθ are given by (2.13) and (2.14).
R
(2)
2. Set T (w) := ((J⊥ w) ⊗ Id ) ⊗ ((J⊥ w) ⊗ Id ). The vector mθ (α) := vec ( xxT dπa,θ (x))
is given by
−1
(2)
mθ (α) = α−2 Id2 N 2 − Φθ
ζθ (α)
(D.9)
R
where Φθ := T (w)dµθ (y, w) and
Z
(1)
ζθ (α) := T (w)vec yy T + 2y mθ (α)T dµθ (y, w) .
Proof. Since πα,θ = πα,θ Pα,θ , we obtain:
ZZ
Z
(1)
(1)
mθ (α) =
αJ⊥ (w ⊗ Id )(y + x)dµθ (y, w)dπα,θ (x)α ((J⊥ w) ⊗ Id )(y + mθ (α))dµθ (y, w) ;
(1)
this yields the expression of mθ (α). The proof of item 2) follows the same lines as above and
is omitted.
We finally have a result on the regularity-in-(α, θ) of some expectations w.r.t. πα,θ and the
solutions to the Poisson equation (D.3).
Proposition D.3. Let Assumptions 2.5, 2.6 and 2.7 to hold. Let K ⊂ RdN be a compact set
and let ρK ∈ (0, 1) and λµ ∈ (0, 1] be given resp. by Assumption 2.6 and Assumption 2.7. The
√
following holds for any a ∈ (0, 1/ ρK ).
1. For any f ∈ L1 (RdN ), there exists a constant Cf such that for any α, α0 ∈ [0, a] and
θ, θ0 ∈ K,
Z
f (x) dπα,θ (x) − dπα0 ,θ0 (x) ≤ Cf |α − α0 | + θ − θ0 λµ .
2. When f is the identity function f (x) = x then for any α ∈ (0, a], θ ∈ K, x ∈ RdN , one
has
(1)
fα,θ (x) = (IdN − αJ⊥ Wθ )−1 (x − mθ (α)) .
(D.10)
In addition, there exists a constant K such that for any α, α0 ∈ [0, a], θ, θ0 ∈ K,
Pα,θ fα,θ (x) − Pα0 ,θ0 fα0 ,θ0 (x) + fα,θ (x) − fα0 ,θ0 (x) ≤ K α − α0 + θ − θ0 λµ (1 + |x|) .
190
3. For any function f of the form xT Ax, the Poisson solution fα,θ exists and there exists a
constant K such that for any α, α0 ∈ [0, a], θ, θ0 ∈ K,
Pα,θ fα,θ (x) − Pα0 ,θ0 fα0 ,θ0 (x) ≤ K α − α0 + θ − θ0 λµ
1 + |x|2 .
Proof. Let K be a compact subset of RdN . Throughout this proof, for ease of notations, we
√
will write ρ instead of ρK . Let a ∈ (0, 1/ ρ) be fixed. Item 1 is a consequence of [17,
Theorem 5, p.259]; its proof therefore consists in verifying the assumptions of this theorem.
The conditions [17, Theorem 5(i-ii), p.259] are established in the proof of Proposition D.1 (see
Eqs. (D.4) and (D.7)). Let us prove [17, Theorem 5(iii) p.259]. Let α, α0 ∈ [0, a] and θ, θ0 ∈ K.
Denote by (φn )n (resp. (φ0n )n ) the chain with transition kernel Pα,θ (resp. Pα0 ,θ0 ) and initial
distribution z ∈ RdN . Let (W, Y ) (resp. (W 0 , Y 0 )) be two independent pairs of random variables
with distribution dµθ (resp. dµθ0 ), and independent of (φn−1 , φ0n−1 ). Then, it is easily seen by
using Assumption 2.5-1) that
E φn − φ0n = αAθ E φn−1 − φ0n−1 + (α − α0 )Bα,θ (n − 1) + α0 Cα0 ,θ,θ0 (n − 1)
with Aθ := J⊥ Wθ ; Bα,θ (k) := J⊥ Wθ E[φk ] + zθ and Cα0 ,θ,θ0 (k) := J⊥ Wθ E[φ0k ] + zθ −
J⊥ Wθ0 E[φ0k ] + zθ0 . Then, by a trivial induction and upon noting that φ0 − φ00 = z − z = 0,
n−1
n−1
X
X
0
E φn − φ0n = (α − α0 )
αk−1 Ak−1
B
(n
−
k)
+
α
αk−1 Ak−1
Cα0 ,θ,θ0 (n − k) .
α,θ
θ
θ
k=1
k=1
k (z, dy)|y| ≤ C(1 + |z|) by (D.4), Assumption 2.6 implies
Since supk |E [φk ]| ≤ supk Pα,θ
that there exists C such that for any k ≤ n
R
√
sup sup |αk−1 Ak−1
Bα,θ (n − k)| ≤ C (a ρ)k−1 (1 + |z|) .
θ
α∈[0,a] θ∈K
Similarly, by Assumption 2.7 Eqs (2.15) and (2.17), there exists a constant C 0 such that for any
k≤n
−λ √
sup sup θ − θ0 µ αk−1 Aθk−1 Cα0 ,θ,θ0 (n − k) ≤ C 0 (a ρ)k−1 (1 + |z|) .
α0 ∈[0,a] θ,θ 0 ∈K
√
Since a ρ < 1, this implies that there exists C̄ such that for any n ≥ 0, α, α0 ∈ [0, a] and
θ, θ0 ∈ K, supn |E [φn − φ0n ] | ≤ C̄ |α − α0 | + |θ − θ0 |λµ (1 + |z|). Item 1 now follows from
[17, Theorem 5, p.259].
We now prove the expression (D.10). The regularity-in-(α, θ) will follow from (D.8) and Assumption 2.7; details are omitted (we also omit the proof of Item 3 which follows from similar
arguments). As a preliminary, note that for any affine function g : RdN → RdN of the form
g(x) = Ax + b for some matrix A and some vector b, one has
Pα,θ g(x) = αAJ⊥ Wθ x + αAJ⊥ zθ + b .
(D.11)
D.4. Proof of Proposition 2.2
191
(1)
n f (x) − π
n
We now prove by induction that for all n ≥ 0, Pα,θ
α,θ f = (αJ⊥ Wθ ) (x − mθ (α)).
(1)
The statement holds true for n = 0 because πα,θ f = mθ (α) by definition. Assume that it
n+1
holds for an arbitrary n. By (D.11), Pα,θ
f (x) − πα,θ f = (αJ⊥ Wθ )n+1 x + α(αJ⊥ Wθ )n zθ −
(1)
(αJ⊥ Wθ )n mθ (α) and the statement holds for integer n + 1 by straightforward algebra. ThereP
(1)
fore, fα,θ (x) = n (αJ⊥ Wθ )n (x − mθ (α)).
D.4
Proof of Proposition 2.2
Hereafter, we will largely use the following property: any row-stochastic matrix has bounded
entries. Therefore, there exists a constant C s.t.
P {kWn k ≤ C} = 1 .
(D.12)
Lemma D.1. Under Assumptions B.1-1) and 2.5, there exists C > 0 such that almost-surely,
|θn+1 − θn | ≤ C γn (|Yn+1 | + |φn |).
Proof. Since limn γn /γn+1 = 1, there exists a constant C such that |θn+1 −θn | ≤ |1⊗hθn+1 i−
1 ⊗ hθn i| + |J⊥ θn+1 | + |J⊥ θn | ≤ C |hθn+1 i − hθn i| + γn φn+1 + γn φn . The result follows from
Eqs (2.11), (2.12), (D.12) and supn αn < ∞.
D.4.1
Decomposition of hθn+1 i − hθn i
By (2.11), it holds hθn+1 i = hθn i + γn+1 h(hθn i) + γn+1 (ηn+1,1 + ηn+1,2 ) where
ηn+1,1 = hWn+1 (Yn+1 + φn )i − hzθn + Wθn φn i
ηn+1,2 = hzθn + Wθn φn i − h(hθn i) .
We write ηn+1,2 = un + vn + wn+1 + zn where
un = hzθn − zJθn i
vn = hWθn − WJθn i φn
(1)
wn+1 = hWJθn i(φn − mθn (αn+1 ))
(1)
(1)
zn = hWJθn i(mθn (αn+1 )) − mJθn (1))
(1)
and mθ (a) is defined in Proposition D.2. We finally introduce a decomposition of wn . For
√
any compact K, let ρK ∈ (0, 1) be given by Assumption 2.6. Let a ∈ (1, 1/ ρK ). Under
Assumption B.1, the sequence (αn )n given by (2.9) converges to one; hence, there exists a
(deterministic) integer n0 (depending on K) such that αn ∈ (0, a) for all n ≥ n0 . The identity
function is in L0 (RdN ) and by Proposition D.3, there exists a solution gf α, θ to the Poisson
equation (D.3) with the f equal to the identity function, for any α ∈ (0, a) and θ ∈ K; by (D.10)
(1)
fα,θ (x) = (IdN − αJ⊥ Wθ )−1 (x − mθ (α)). To make the notation easier, we will set below
192
fn := fαn+1 ,θn and Pn := Pαn+1 ,θn . By Proposition D.1-3), there exists a constant C > 0 such
that a.s.
sup |fn (x)|IEK ≤ C(1 + |x|) .
(D.13)
n≥n0
Letting x = φn in the Poisson equation (D.3), we obtain
(1)
φn − mθn (αn+1 ) = fn (φn ) − Pn fn (φn ).
We set wn+1 = en+1 + cn+1 + sn+1 + tn where
en+1 = hWJθn i (fn (φn+1 ) − Pn fn (φn ))
cn+1 = hWJθn i fn−1 (φn ) − hWJθn+1 i fn (φn+1 ) ,
sn+1 = hWJθn+1 − WJθn i fn (φn+1 )
tn = hWJθn i (fn (φn ) − fn−1 (φn )) .
As a conclusion, we have
ηn+1,2 = un + vn + zn + en+1 + cn+1 + sn+1 + tn .
D.4.2
Proof of Proposition 2.2
Define EK = {∀j ∈ N, θj ∈ K} and En,K = ∩j≤n {θj ∈ K} for some compact set K.
P
We show that n γn ηn,i < ∞ a.s. for both i = 1, 2. The proposition will then
P follow from
[9]. By Assumption 2.4, it is enough to show that for any fixed compact set K, k≥1 γk ηk,i IEK
is finite a.s. Hereafter, K is fixed and n0 is defined as in Section D.4.1.
We first study ηn,1 . Note that for
Pany ω, the sequence IEn,K (ω) is identically equal to IEK (ω)
for all large n. As a consequence,
n γn ηn,1 (IEK − IEn−1,K ) is finite a.s. and it is therefore sufP
ficient to prove that n γn ηn,1 IEn−1,K is finite a.s. Since ηn,1 IEn−1,K is a martingale difference
noise, the sought result will be obtained provided
X
γn1+λ E[|ηn,1 |1+λ IEn−1,K ] < ∞
n
where λ > 0 (see e.g. [75, Theorem 2.18]); we choose λ ∈ (0, 1) given by Assumption B.1.
After some algebra,
sup E[|ηn,1 |2 IEn−1,K ] ≤ 2 sup E[|hWn (Yn + φn−1 )i|2 IEn−1,K ]
n
n
≤ C sup E[(|Yn |2 + |φn−1 |2 )IEn−1,K ]
n
for some constant C-where we used (D.12).
Assumption 2.5-2) directly leads to supn E[|Yn |2 IEn−1,K ] < ∞ whereas by Lemma 2.3,
P
P
supn E[|φn−1 |2 IEn−1,K ] < ∞. Hence, n γn1+λ E[|ηn,1 |1+λ IEn−1,K ] ≤ C 0 n γn1+λ for some
C 0 > 0. And the upper bound is finite by Assumption B.1. This concludes the first step.
D.4. Proof of Proposition 2.2
193
We now study ηn,2 for any n ≥ n0 . By (2.16), there exists C such that |un |IEK ≤
λ
C|J⊥ θn−1 |λµ IEK ≤ Cγn µ |φn−1 |λµ IEn−2,K . Therefore,
X
X 1+λµ
E(IEK
γn |un |) ≤ C
γn
sup E(|φn−1 |IEn−2,K )
n
n
n
P
which is finite by Assumption B.1 and Lemma 2.3. Thus n γn |un |IEK is a.s. finite.
The term vn can be analyzed similarly: by (2.15) applied with K ← K ∪ {Jθ, θ ∈ K}, there
exists a constant C such that
λ
µ
|φn |1+λµ IEn−1,K
|vn |IEK ≤ C|J⊥ θn |λµ |φn |IEn−1,K ≤ Cγn+1
P
and the fact that n γn |vn |IEK is finite a.s. follows from the same arguments as above.
(1)
(1)
We now study zn . By (D.12), |zn | ≤ Cv |mθn (αn+1 ) − mJθn (1)|. By Proposition D.3-1),
√
since αn+1 < a < 1/ ρK , there exists a constant C 0 such that
X
X
1+λµ
λµ
0
|γn − γn+1 | + γn
sup E(|φk | IEk−1,K ) .
γn E(|zn |IEK ) ≤ C
k
n
n
P
The RHS is finite by Lemma 2.3 and Assumption B.1. Hence, n γn |zn |IP
EK is finite a.s.
(en )n is a martingale-increment sequence: as above for the term ηn,1 , n γn en IEK is finite
a.s. if supn E(|en+1 |1+λ IEn,K ) < ∞. This holds true by (D.12), (D.13) and Lemma 2.3.
Let us now investigate cn+1 . We write
n
X
k=1
γk+1 ck+1 =
n
X
(γk+1 − γk )hWJθk ifk−1 (φk ) − γn+1 hWJθn+1 ifn (φn+1 ) + γ2 hWJθ1 if0 (φ1 ) .
k=2
Using again (D.12), (D.13) and Lemma 2.3, there exists C > 0 such that


n
X
X
γk+1 E (|ck+1 |IEK ) ≤ C 
|γk+1 − γk | + γn + 1 .
k=1
k≥1
P
The rhs is finite by Assumption B.1, thus implying that n γn cn IEK is finite a.s.
Consider the term sn+1 . Following similar arguments and using (D.13) again, we obtain
X
X
γk |sk |IEK ≤ C
γk khWJθk − WJθk−1 ik(1 + |φk |)IEK
k≤n
k≤n
for some constant C which depends only on K. By condition (2.15) and Lemma D.1, one has
λ
khWJθk − WJθk−1 kIEK ≤ CK γk µ |Yk |λµ + |φk−1 |λµ IEK . By Cauchy-Schwarz inequality,
Assumption 2.5 and Lemma 2.3, it can be proved that
sup E [(|Yk | + |φk−1 |)(1 + |φk |)IEK ] < ∞ .
(D.14)
k
P
P
Therefore, by Assumption B.1, E( k γk |sk |IEK ) is finite thus implying that k≥1 γk sk IEK
exists a.s.
Finally consider the term tn . By (D.12) and Proposition D.3-2), there exists a constant
λµ (1 + |φ |). By
C such that for any n ≥ n0 ,|tn |IEK ≤ C |αn − αn−1 | + |θ
n − θn−1 |
n
P
Lemma D.1,P(D.14) and Assumption B.1, it can be shown that n γn E(|tn |IEK ) < ∞ which
proves that n γn tn IEK converges a.s.
194
D.5
Proof of Theorem 2.3
The core of the proof consists in checking the conditions of [67, Theorem 2.1]. To make the
notations easier, we write the proofs in the case d = 1 and under the assumption that limn θn =
θ? 1 almost-surely.
Throughout the proof, we will write that a sequence of r.v. (Zn )n is Ow.p.1 (1) iff supn |Zn | <
∞ almost-surely; and (Zn )n is OL1 (1) iff supn E [|Zn |] < ∞.
Fix δ > 0. Set for any positive integers m ≤ k
\
{|θj − θ? 1| ≤ δ} .
Am :=
j≥m
From Section D.4.1, it holds
hθn+1 i = hθn i + γn+1 h(hθn i) + γn+1 En+1 + γn+1 Rn+1
where
En+1 := hWn+1 (Yn+1 + φn )i − hzθn i + hWθn iφn + hWJθn i (fn (φn+1 ) − Pn fn (φn ))
Rn+1 := un + vn + zn + cn+1 + sn+1 + tn
Note that E [En+1 |Fn ] = 0 i.e., (En )n is a Fn -adapted martingale increment. From the expression of fn = fαn+1 ,θn (see Proposition (D.10)), we have
fα,θ (y) − Pα,θ fα,θ (x) = Bα,θ y − αJ⊥ Wθ x − αJ⊥ zθ
with Bα,θ := IdN − αJ⊥ Wθ
−1
(D.15)
. Hence,
En+1 = hWn+1 (Yn+1 + φn )i − hzθn i − hWθn iφn
+ hWJθn iBαn+1 ,θn φn+1 − αn+1 J⊥ Wθn φn + zθn
D.5.1
.
(D.16)
A preliminary result
The following lemma extends Lemma 2.3.
Lemma D.2. Let Assumptions B.1-1), 2.5, 2.10 and 2.11 hold. Let (φn )n≥0 be the sequence
given by (2.9) and τ be given by Assumption 2.10. For any compact set K ⊂ RdN ,
sup E |φn |2+τ ITj≤n−1 {θj ∈K} < ∞ .
n
√
Let ρ̃K be given by Assumption 2.11. For any a ∈ (0, 1/ ρ̃K ),
Z
sup
|x|2+τ dπα,θ (x) < ∞ .
α∈[0,a],θ∈K
D.5. Proof of Theorem 2.3
195
Proof. From (2.12) and (D.12), there exists a constant C such that on the set
|φn |2+τ ≤ αn2+τ φTn−1 WnT J⊥ Wn φn−1
1+τ /2
T
j≤n−1 {θj
∈ K}
+ C|Yn |2+τ + C|Yn |1+τ /2 |φn−1 |1+τ /2 .
Since limn αn = 1, for any ρ̄K ∈ (ρ̃K , 1), there exists a deterministic n? s.t. for supn≥n? αn2+τ ρ̃K ≤
ρ̄K . Using Assumption
2.10 and Assumption
2.11, there exists a constant C 0 such that
for any
2+τ
n large enough, E |φn | |Fn−1 Iθn−1 ∈K ≤ ρ̄K |φn−1 |2+τ + C 0 1 + |φn−1 |1+τ /2 . A trivial
induction (see the proof of Lemma 2.3 for similar computations) yields the first result.
For the second statement, we mimic the previous computations in order to prove that there exist
ρ̄K ∈ (0, 1) and C such that for any α ∈ [0, a] and θ ∈ K,
Z
Pα,θ (x, dy)|y|2+τ ≤ ρ̄K |x|2+τ + C(1 + |x|1+τ /2 ) .
We conclude by [109, Theorem 14.3.7.] and Proposition D.1-1).
D.5.2
Checking condition C2 of [67, Theorem 2.1]
Let m ≥ 1. From Assumption
h 2.10, (D.12) and Lemma D.2,i it is easily seen by using the
expression (D.16) that supn E |En+1 |2+τ 1Tm≤j≤n {|θj −θ? 1|≤δ} < ∞ where τ is given by Assumption 2.10.
In order to derive
asymptotic
covariance,
we go
the
further in the expression of the condi2 |F . We write E E 2 |F
tional covariance E En+1
n
n+1 n = Ξ(αn+1 , θn , φn ) where
Z
Ξ(α, θ, x) := (ξα,θ,x (y, w))2 dµθ (y, w)
(D.17)
ξα,θ,x (y, w) := Aα,θ w − Wθ x + (wy − zθ )
T −1
1
Aα,θ :=
Id + α WJθ IdN − αJ⊥ Wθ
J⊥ .
N
Set
π? := π1,θ? 1
πn := παn+1 ,θn
where πα,θ is defined by Proposition D.1. We write
Z
Z
Ξ(αn+1 , θn , φn ) = Ξ(αn+1 , θn , φn ) − Ξ(1, θn , φn ) + Ξ(1, θn , x)dπn (x) − Ξ(1, θ? 1, x)dπ? (x)
Z
Z
+ Ξ(1, θn , φn ) − Ξ(1, θn , x)dπn (x) + Ξ(1, θ? 1, x)dπ? (x) .
For any m ≥ 1, we have on the set Am
(Ξ(αn+1 , θn , φn ) − Ξ(1, θn , φn )) → 0 a.s.
Z
Z
Ξ(1, θn , x)dπn (x) − Ξ(1, θ? 1, x)dπ? (x) → 0 a.s.
Z
n X
γn E
Ξ(1, θk , φk ) − Ξ(1, θk , x)dπl (x) 1Am → 0 .
k=1
196
The detailed computations are given in RSection D.5.5. This implies that the key quantity involved
in the asymptotic covariance matrix is Ξ(1, θ? 1, x)dπ? (x).
D.5.3
Expression of U?
Set
Z
U? :=
Ξ(1, 1 ⊗ θ? , x) dπ1,1⊗θ? (x) .
Lemma D.3 gives an explicit expression for U? as a function of the quantities introduced in
Section 2.5.2.
Lemma D.3. Under the assumptions of Theorem 2.3,
(2)
(1)
vec U? = (A? ⊗ A? )(R? m? + 2T? m? + S? ).
Proof. For simplicity, we use the notations Rθ (w) := w − Wθ and υθ (y, w) := wy − zθ and
T̃θ,x (y, w) := (Rθ (w)x + υθ (y, w))(Rθ (w)x + υθ (y, w))T .
Note that T̃θ,x (y, w) coincides with
Rθ (w)xxT Rθ (w)T + 2Rθ (w)xυθ (y, w)T + υθ (y, w)υθ (y, w)T .
From (D.17), ξα,θ,x (y, w) = Aα,θ (Rθ (w)x + υθ (y, w)) so that
Z
vec Ξ(α, θ, x) = (Aα,θ ⊗ Aα,θ ) vec T̃θ,x (y, w) dµθ (y, w) .
Applying the vec operator on T̃θ,x (y, w) yields (Rθ (w) ⊗ Rθ (w))vec (xxT ) + 2(υθ (y, w) ⊗
Rθ (w))x + vec (υθ (y, w)υθ (y, w)T ) . Therefore, when applied with α = 1 and θ = θ? 1, it
holds vec Ξ(1, θ? 1, x) = (A? ⊗ A? )(R? vec (xxT ) + 2T? x + S? ) . This yields the result by
integrating x w.r.t. π? .
D.5.4
Checking condition C3 of [67, Theorem 2.1]
We first prove that for any m ≥ 1,
|un + vn + zn + sn+1 + tn | 1Am ≤
√
γn o(1)OL1 (1) .
(D.18)
Let m ≥ 1. By (2.9), (D.12) and Proposition D.3-1), there exists a constant C1 such that almostsurely, on the set Am ,
λµ
|zn | ≤ C1 |αn+1 − 1| + |J⊥ θn |λµ ≤ C1 |αn+1 − 1| + γn+1
1 + |φn |λµ .
√
Assumption 2.13 and Lemma 2.3 and λµ > 1/2 imply that |zn | 1Am = γn o(1)OL1 (1). By
Assumption 2.7, Proposition D.1-3) and Lemma D.1, there exist a constant C2 > 0 and n ≥ n0
such that almost-surely, for all n ≥ n0 ,
λ
|sn+1 | 1Am ≤ C2 γn µ |Yn+1 |λµ + |φn |λµ (1 + |φn+1 |) 1Am .
197
√
Assumption 2.5, Lemma 2.3 and the condition λµ > 1/2 imply that |sn+1 | 1Am = γn OL1 (1).
By (D.12), Proposition D.3-2) and Lemma D.1, there exist a constant C3 > 0 and n0 such that
almost-surely, for any n ≥ n0 ,
λ
|tn | 1Am ≤ C3 |αn+1 − αn | + γn µ |Yn |λµ + |φn |λµ 1Am .
√
Lemma 2.3, Assumption 2.13 and λµ > 1/2 imply that |tn+1 | 1Am = γn o(1)OL1 (1). By Asλ
sumption 2.7, there exists a constant C4 > 0 such that almost-surely, |un | 1Am ≤ C4 γn µ |φn |λµ 1Am .
√
Lemma 2.3 and the property λµ > 1/2 imply un = o( γn )OL1 (1). Finally, by Assumption 2.7,
λµ
|φn |1+λµ 1Am so that by
there exists a constant C such that almost-surely, |vn | 1Am ≤ Cγn+1
√
Lemma 2.3 again and the condition λµ > 1/2, vn = o( γn )OL1 (1).
The above discussion concludes the proof of (D.18).
√ P
The second step is to prove that for any m ≥ 1, γn nk=1 ck 1Am = o(1)Ow.p.1. (1)OL1 (1).
By (D.12) and (D.13), there exists a constant C > 0 such that almost-surely,
n
X ck 1Am ≤ C (1 + |φ0 | + |φn |) 1Am .
k=1
Lemma 2.3 implies that
[67].
D.5.5
Pn
k=1 ck
= OL1 (1). This concludes the proof of the condition C3 in
Detailed computations for verifying the condition C2
We start with a preliminary lemma whose proof is omitted since it follows from standard computations.
Lemma D.4. Let Assumptions 2.5, 2.11 and 2.12-1) to hold. Let δ > 0 and set K := {θ :
√
|θ − θ? 1| ≤ δ}. Fix a ∈ (0, 1/ ρ̃K ) where ρ̃K be given by Assumption 2.11. There exists a
constant C such that for any θ, θ0 ∈ K, α, α0 ∈ [0, a], x, z, y ∈ RdN and w ∈ M1
|ξα,θ,x (y, w)| ≤ C (1 + |y| + |x|) ,
λ kAα,θ − Aα0 ,θ0 k ≤ C α − α0 + θ − θ0 µ ,
ξα,θ,x (y, w) − ξα0 ,θ0 ,x (y, w) ≤ C α − α0 + θ − θ0 λµ (1 + |x| + |y|) ,
|ξα,θ,x (y, w) − ξα,θ,z (y, w)| ≤ C |x − z| .
λµ is given by Assumptions 2.5 and 2.12-1).
1)
First term: Ξ(αn+1 , θn , φn ) − Ξ(1, θn , φn )
It is sufficient to prove that this term converges almost-surely to zero along the event Am , for
any m ≥ 1; which is implied by the almost-sure convergence to zero along the event θ ∈ K :=
{θ : |θ − θ? | ≤ δ}. Below, Cm is a constant whose value may change upon each appearance.
198
By using the inequality |a2 − b2 | ≤ |a − b|(|a| + |b|), Assumption 2.10 and Lemma D.4, there
exists a constant Cm such that for any α close enough to 1 and θ ∈ K
|Ξ(α, θ, x) − Ξ(1, θ, x)| ≤ Cm 1 + |x|2 |α − 1| .
By Lemma D.2, for any ε > 0, there exists Cm such that
(
)
P sup(1 + |φn |)2 |αn+1 − 1|1θn ∈K ≥ ε
≤ Cm
n≥`
X
|αn+1 − 1|(1+τ /2) .
n≥`
The RHS converges to zero as ` → ∞ by Assumption 2.13. This implies that almost-surely,
limn |Ξ(αn+1 , θn , φn ) − Ξ(1, θn , φn )| 1θn ∈K = 0.
2)
Second term:
R
Ξ(1, θn , x)dπn (x) −
R
Ξ(1, θ? 1, x)dπ? (x)
We apply the following lemma (see [68, Proposition 4.3.]).
Lemma D.5. Let µ, {µn , n ≥ 0} be probability distributions on RdN endowed with its Borel
σ-field. Let {hn , n ≥ 1} be an equicontinuous family of functions from RdN to R. Assume
1. the sequence {µn , n ≥ 0} weakly converges to µ.
dN
2. for
R any x ∈ R , limn hn (x) exists, and there exists a > 1 such that supn
| limn hn |dµ < ∞.
Then limn
R
hn dµn =
R
R
|hn |a dµn +
limn hn dµ.
a) Almost-sure weak convergence
In our case µn ← πn and µ ← π? and µn is a random probability. Since the set of bounded
Lipschitz functions is convergence determiningR (see e.g. [60,
R Theorem 11.3.3.]), we prove that
for any bounded and Lipschitz function h, limn hdπn = hdπ? almost-surely, with an almostsure set which has to be uniform for the set of bounded Lipschitz functions. Following the
same lines as in the proof of [68, Proposition 5.2.], this convergence occurs almost-surely if
and only
R if for any
R bounded Lipschitz function h, there exists a full set such that on this set,
limn hdπn = hdπ? .
Let h be a bounded Lipschitz function. Then h ∈ L0 (RdN ). By Proposition D.3-1), there
exists a constant Cf such that for any n large enough, on the set {θn ∈ K}
Z
Z
hdπn − hdπ? ≤ Cf |αn+1 − 1| + |θn+1 − θ? 1|λµ .
R
R
Since limn θn = θ? 1 almost-surely and limn αn = 1, we have limn hdπn = hdπ? almostsurely. This concludes the proof of the almost-sure weak convergence.
199
b) Equicontinuity of the family of functions
We prove that the family of functions {x 7→ Ξ(1, θ, x); θ ∈ K} is equicontinuous. Using again
the inequality |a2 − b2 | ≤ |a − b|(|a| + |b|), Lemma D.4 and Assumption 2.10, we know there
exists a constant Cm such that for any θ ∈ K, x, z ∈ RdN ,
|Ξ(1, θ, x) − Ξ(1, θ, z)| ≤ Cm (1 + |x| + |z|)|x − z|
c) Almost-sure limit of Ξ(1, θn , x) when n → ∞
Let x be fixed. We write
Ξ(1, θ, x) − Ξ(1, θ0 , x) ≤
Z
2
2
ξ
1,θ,x (y, w) − ξ1,θ0 ,x (y, w) dµθ0 (y, w)
Z
Z
2
2
+ ξ1,θ,x (y, w)dµθ (y, w) − ξ1,θ,x (y, w)dµθ0 (y, w) .
Let us consider the first term. Using again |a2 − b2 | ≤ |a − b|(|a| + |b|) and Lemma D.4, there
exists a constant Cm such that the first term is upper bounded by Cm (1 + |x|2 )|θ − θ? 1|λµ for
any θ ∈ K. For the second term, we use Assumption 2.12-2) and obtain the same upper bound.
Then, there exists a constant Cm such that for any θ, θ0 ∈ K
Ξ(1, θ, x) − Ξ(1, θ0 , x) ≤ Cm (1 + |x|2 ) θ − θ0 λµ .
(D.19)
Since limn θn = θ? 1 almost-surely, the above discussion implies that for any fixed x, limn Ξ(1, θn , x) =
Ξ(1, θ? 1, x) almost-surely on Am .
d) Moment conditions
It is easily seen (using again Lemma D.4) that there exists a constantR Cm such that for any θ ∈ K,
|Ξ(1, θ, x)| ≤ Cm (1 + |x|2 ). Therefore, Lemma D.2 implies that |Ξ(1, θ? 1, x)|dπ? (x) < ∞.
In addition, for any θ ∈ K, α in a neighborhood of 1 and a > 1,
Z
Z
a
2a
|Ξ(1, θ, x)| πα,θ (dx) ≤ Cm 1 + |x| πα,θ (dx) .
Lemma D.2 implies that there exists a > 1 such that
Z
sup 1θn ∈K |Ξ(1, θn , x)|a παn+1 ,θn (dx) < ∞ .
n
e) Conclusion
We can now apply Lemma D.5; we have almost-surely,
Z
Z
lim Ξ(1, θn , x)dπn (x) − Ξ(1, θ? 1, x)dπ? (x) 1Am = 0 .
n
200
3)
Third term: Ξ(1, θn , φn ) −
R
Ξ(1, θn , x)dπn (x)
We prove that for any m ≥ 1
" n #
Z
X
lim γn E Ξ(1, θk , φk ) − Ξ(1, θk , x)dπk (x) 1Am = 0 .
n
k=1
We write
Pn
k=1
P
R
(i)
Ξ(1, θk , φk ) − Ξ(1, θk , x)dπk (x) = 3i=1 Tn with
Tn(1)
Tn(2)
=
=
n
X
{Ξ(1, θk , φk ) − Ξ(1, θk−1 , φk )}
k=1
n X
Z
Ξ(1, θk−1 , φk ) −
Ξ(1, θk−1 , x)dπk−1 (x)
k=1
Tn(3)
Z
Z
=
Ξ(1, θ0 , x)dπ0 (x) −
Ξ(1, θn , x)dπn (x) .
(1)
a) Term Tn
By (D.19), there exists a constant Cm such that for any k ≥ m + 1, on the set Am ,
|Ξ(1, θk , φk ) − Ξ(1, θk−1 , φk )| ≤ Cm |θk − θk−1 |λµ (1 + |φk |2 ) .
Hence, by Lemma D.1, on the set Am ,
λ
|Ξ(1, θk , φk ) − Ξ(1, θk−1 , φk )| ≤ Cm γk µ (1 + |φk |2 )(|Yk |λµ + |φk−1 |λµ ) .
By Assumption 2.10, Lemma D.2 and Assumption 2.13, the sum
i
X 1+λµ h
γk
E (1 + |φk |2 )(|Yk |λµ + |φk−1 |λµ )1Am
k≥1
h
i
(1)
is finite which implies limn γn E |Tn |1Am = 0 by the Kronecker Lemma.
(2)
b) Term Tn
From the expression of ξ (see (D.17)), we have
Ξ(1, θ, φ) − Ξ(1, θ, x) = φT Cθ φ − xT Cθ x + (φ − x)T Dθ
with
Z
T
A1,θ (w − Wθ ) dµθ (y, w)
(w − Wθ )A1,θ
Z
T
Dθ := 2 (w − Wθ )A1,θ
A1,θ (wy − zθ ) dµθ (y, w) .
Cθ :=
201
We detail the proof of the statement
" n #
T
Z
X
lim γn E φk − x dπαk ,θk−1 (x) Dθk−1 1Am = 0
n
k=1
The second statement, with the quadratic dependence on φk is similar and omitted (its proof
will use Proposition D.3-3) and the condition limn γn n1/(1+τ /2) = 0). Using again the Poisson
solution fn := fαn+1 ,θn associated to the identity function and the kernel Pn := Pαn+1 ,θn , it
holds by (D.15)
T
Z
φk − x dπk−1 (x) Dθk−1 = (fk−1 (φk ) − Pk−1 fk−1 (φk−1 ))T Dθk−1
T
+ Pk−1 fk−1
(φk−1 )Dθk−1 − Pk fkT (φk )Dθk
T
+ Pk fkT (φk ) − Pk−1 fk−1
(φk ) Dθk
T
+ Pk−1 fk−1
(φk ) Dθk − Dθk−1 .
(D.20)
(D.21)
(D.22)
(D.23)
From Assumption 2.12-2) and Lemma D.4, there exists a constant Cm such that for any k,
|Dθk |1Am ≤ Cm
(D.24)
λµ
|Dθk − Dθk−1 |1Am ≤ Cm |θk − θk−1 |
.
(D.25)
Let us control the first term (D.20). Upon noting that it is a martingale-increment, the Burkholder
inequality (see e.g. [75, Theorem 2.10]) applied with p ← 2 + τ and Lemma D.2 imply
n
X
√ E
(fk−1 (φk ) − Pk−1 fk−1 (φk−1 ))T Dθk−1 1Am = O n .
k=1
This term is o(1/γn ) by Assumption 2.13. Let us consider (D.21).
n
X
T
T
E
Pk−1 fk−1 (φk−1 )Dθk−1 − Pk fk (φk )Dθk 1Am = E|P0 f0T (φ0 )Dθ0 − Pn fnT (φn )Dθn | 1Am
k=1
and this term is O(1) by Proposition D.1-3), (D.24) and Lemma D.2. Let us see the third
term (D.22). By Proposition D.3-2) and (D.24), we have
n
n
X
X
T
E |θk − θk−1 |λµ + |αk+1 − αk | 1Am
E
Pk fkT (φk ) − Pk−1 fk−1
(φk ) Dθk 1Am ≤ Cm
k=1
k=1
By Lemmas D.1 and D.2 and Assumptions 2.10 and 2.13, this term is o(1/γn ). Finally, the
same conclusion holds for (D.23)
h by usingiProposition D.1-3), Lemma D.2 and (D.25). This
(2)
concludes the proof of limn γn E |Tn |1Am = 0.
202
(3)
c) Term Tn
By Lemma D.4, there exists Cm such that for any θ ∈ K, |Ξ(1, θ, x)| R≤ Cm (1 + |x|2 ). By
Lemma D.2, for any a in a neighborhood of 1 we have supα∈[0,a],θ∈K |x|2 πα,θ (dx) < ∞.
Since limn αn = 1, we have
Z
sup Ξ(1, θn , x)dπn (x) 1θn ∈K < C
n≥m
h
i
(3) for some constant C, which implies that limn γn E Tn 1Am = 0.
Bibliography
[1] FIT IoT-LAB: very large scale open wireless sensor network testbed. https://www.
iot-lab.info/.
[2] G., Morral. http://perso.telecom-paristech.fr/~morralad/.
[3] N.A., Dieng. http://nadieng.wordpress.com/.
[4] Achlioptas, D. and McSherry, F. (2001). Fast computation of low rank matrix approximations. In Proceedings of the thirty-third annual ACM symposium on Theory of computing,
pages 611–618.
[5] Achlioptas, D. and McSherry, F. (2007). Fast computation of low rank matrix approximations. Journal of ACM, 54(2).
[6] Agarwal, A., Chapelle, O., Dudík, M., and Langford, J. (2014). A reliable effective terascale
linear learning system. Journal of Machine Learning Research, 15:1111–1133.
[7] Aghasi, H., Hashemi, M., and Khalaj, B. (2012). A Source Localization Based on Signal Attenuation and Time Delay Estimation in Sensor Networks. International Journal of
Computer and Electrical Engineering, 4(3):423–427.
[8] Alexander, S. (1982). Radio propagation within buildings at 900 MHz. Electronics Letters,
18(21):913 – 914.
[9] Andrieu, C., Moulines, E., and Priouret, P. (2005). Stability of Stochastic Approximation
under Verifiable Conditions. SIAM J. Control Optim., 44(1):283–312.
[10] Aysal, T., Yildiz, M., Sarwate, A., and Scaglione, A. (2009). Broadcast Gossip Algorithms
for Consensus. IEEE Trans. on Signal Processing, 57(7):2748–2761.
[11] Bahl, P. and Padmanabhan, V. (2000). RADAR: An In-Building RF-Based User Location
and Tracking System. In INFOCOM, pages 775–784.
[12] Bartlett, P., Jordan, M., and McAuliffe, J. (2006). Convexity, classification, and risk
bounds. Journal of the American Statistical Association, 101(473):138–156.
[13] Bauso, D. and Nedic, A. (2013). Dynamic coalitional tu games: Distributed bargaining
among players’ neighbors. Automatic Control, IEEE Transactions on, 58(6):1363 –1376.
204
Bibliography
[14] Benaim, M. (1996). A dynamical system approach to stochastic approximations. SIAM
Journal on Control and Optimization, 34:437.
[15] Benaim, M., Hofbauer, J., and Sorin, S. (2005). Stochastic Approximations and Differential Inclusions. SIAM Journal on Control and Optimization, 44(1):328–348.
[16] Bénézit, F. (2009). Distributed Average Consensus for Wireless Sensor Networks. PhD
thesis, EPFL.
[17] Benveniste, A., Metivier, M., and P., P. (1987). Adaptive Algorithms and Stochastic Approximations. Springer-Verlag.
[18] Bertsekas, D. and Tsitsiklis, J. (1997). Parallel and Distributed Computation: Numerical
Methods. Athena Scientific.
[19] Bianchi, P., Fort, G., and Hachem, W. (2013). Performance of a Distributed Stochastic
Approximation Algorithm. IEEE Transactions on Information Theory, 59(11):7405 – 7418.
[20] Bianchi, P., Fort, G., Hachem, W., and Jakubowicz, J. (2011a). On the Convergence of a
Distributed Parameter Estimator for Sensor Networks with Local Averaging of the Estimate.
In ICASSP, Praha, Czech Republic.
[21] Bianchi, P., Fort, G., Hachem, W., and Jakubowicz, J. (2011b). Performance of a Distributed Robbins-Monro Algorithm for Sensor Networks. In EUSIPCO, Barcelona, Spain.
[22] Bianchi, P., Hachem, W., and Iutzeler, F. (2014). A stochastic coordinate descent
primal-dual algorithm and applications to large-scale composite optimization. Arxiv preprint
arXiv:1407.0898.
[23] Bianchi, P. and Jakubowicz, J. (2013). On the convergence of a projected multi-agent
stochastic gradient algorithm for non-convex optimization. Automatic Control, IEEE Transactions on, 58(2):391–405.
[24] Biswas, P., Liang, T., Wang, T., and Ye, Y. (2006). Semidefinite programming based
algorithms for sensor network localization. ACM Transactions on Sensor Networks, 2.
[25] Biswas, P. and Ye, Y. (2006). A Distributed Method for Solving Semidefinite Programs
Arising from Ad Hoc Wireless Sensor Network Localization. In Multiscale Optimization
Methods and Applications, volume 82 of Nonconvex Optimization and Its Applications, pages
69–84. Springer US.
[26] Blondel, V., Hendrickx, J., Olshevsky, A., and Tsitsiklis, J. (2005). Convergence in multiagent coordination, consensus, and flocking. In Decision and Control, 2005 and 2005 European Control Conference. CDC-ECC ’05. 44th IEEE Conference on, pages 2996 – 3000.
[27] Borg, I. and Groenen, P. (1997). Modern Multidimensional Scaling: theory and applications. New York: Springer-Verlag.
Bibliography
205
[28] Borkar, V. (2008). Stochastic approximation: a dynamical system viewpoint. Cambridge
University Press.
[29] Borkar, V. and Meyn, S. (2012). Oja’s algorithm for graph clustering, markov spectral
decomposition, and risk sensitive control. Journal of Automatica, 48(10):2512–2519.
[30] Borwein, J. and Lewis, A. (2006). Convex Analysis and Nonlinear Optimization: Theory
and Examples. CMS Books in Mathematics. Springer.
[31] Boyd, S., Ghosh, A., Prabhakar, B., and Shah, D. (2004). Analysis and optimization of
randomized gossip algorithms. In Decision and Control, 2004. CDC. 43rd IEEE Conference
on (Volume 5), pages 5310 – 5315, Bahamas.
[32] Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. (2011). Distributed optimization
and statistical learning via the alternating direction method of multipliers. Found. Trends
Mach. Learn., 3(1):1–122.
[33] Brémaud, P. (1999). Markov chains: Gibbs fields, Monte Carlo simulation, and queues.
springer verlag.
[34] Bénézit, F., Blondel, V., Thiran, P., Tsitsiklis, J., and Vetterli, M. (2010). Weighted gossip:
Distributed averaging using non-doubly stochastic matrices. In Information Theory Proceedings (ISIT), IEEE International Symposium on, pages 1753–1757. IEEE.
[35] Calafiore, G., Carlone, L., and Wei, M. (2010). A distributed gradient method for localization of formations using relative range measurements. In IEEE International Symposium
on Computer-Aided Control System Design (CACSD), Yokohama.
[36] Cappé, O. and Moulines, E. (2009). On-line expectation–maximization algorithm for latent
data models. Journal of the Royal Statistical Society: Series B, 71(3):593–613.
[37] Castells, F., Laguna, P., Sörnmo, L., Bollmann, A., and Roig, J. (2007). Principal component analysis in ecg signal processing. EURASIP Journal on Advances in Signal Processing,
(1):98–98.
[38] Cattivelli, F. and Sayed, A. (2010a). Diffusion LMS strategies for distributed estimation.
IEEE Trans. Signal Processing, 58(3):1035–1048.
[39] Cattivelli, F. and Sayed, A. (2010b). Distributed nonlinear Kalman filtering with applications to wireless localization. In Acoustics Speech and Signal Processing (ICASSP), 2010
IEEE International Conference on, pages 3522 – 3525, Dallas, TX.
[40] Cevher, V., Becker, S., and Schmidt, M. (2014). Convex Optimization for Big Data. IEEE
Signal Processing Magazine, Vol. 31(5):32–43.
[41] Chen, H., Wang, G., Wang, Z., So, H., and Poor, H. (2012). Non-Line-of-Sight Node Localization Based on Semi-Definite Programming in Wireless Sensor Networks. IEEE Transactions on Wireless Communications, 11(1):108–116.
206
Bibliography
[42] Chen, J., Hudson, R., and Yao, K. (2002). Maximum-likelihood source localization and
unknown sensor location estimation for wideband signals in the near-field. IEEE Trans. on
Signal Processing, 50(8):1843–1854.
[43] Chen, J. and Sayed, A. Diffusion Adaptation Strategies for Distributed Optimization and
Learning Over Networks, journal = IEEE Trans. Signal Processing, volume = 60, number =
8, pages = 4289-4305, year = 2012,.
[44] Connelly, R. (2005). Generic global rigidity. Discrete Comput. Geom., 33(4):549–563.
[45] Costa, J., Patwari, N., and Hero, A. (2006). Distributed Weighted-Multidimensional Scaling for Node Localization in Sensor Networks. ACM Transactions on Sensor Networks,
2(1):39–64.
[46] Cox, D., Murray, R., and Norris, A. (1984). 800-MHz attenuation measured in and around
suburban houses. AT and T Bell Laboratories Technical Journal, 63(6):921 – 954.
[47] de Moraes, L. and Nunes, B. (2006). CalibrationFree WLAN Location System Based on
Dynamic Mapping of Signal Strength. In MobyWac.
[48] DeGroot, M. H. Reaching a Consensus, journal = Journal of the American Statistical
Association, volume = 69, number = 345, pages = 118–121, year = 1974,.
[49] Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. (2011). Optimal distributed online prediction. In Proceedings of the 28th International Conference on Machine Learning
(ICML-11), pages 713–720.
[50] Delyon, B. (1996). General results on the convergence of stochastic algorithms. IEEE
Transactions on Automatic Control, 41(9):1245–1255.
[51] Delyon, B. (2000).
Stochastic Approximation with Decreasing Gain: Convergence and Asymptotic Theory.
Unpublished Lecture Notes, http://perso.univrennes1.fr/bernard.delyon/as_cours.ps.
[52] Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data
via the em algorithm. Journal of the Royal Statistical Society, 39(1):1–38.
[53] Dieng, N., Charbit, M., Chaudet, C., Toutain, L., and Meriem, T. (2012a). A Multi-Path
Data Exclusion Model for RSSI-based Indoor Localization. In 15th International Symposium
on Wireless Personal Multimedia Communications (WPMC), pages 336–340.
[54] Dieng, N., Charbit, M., Chaudet, C., Toutain, L., and Meriem, T. (2012b). Experiments on
the RSSI as a Range Estimator for Indoor Localization. In (NTMS), pages 1–5.
[55] Dieng, N., Charbit, M., Chaudet, C., Toutain, L., and Meriem, T. (2013). Indoor Localization in Wireless Networks based on a Two-modes Gaussian Mixture Model. In IEEE 78th
Vehicular Technology Conference (VTC Fall), pages 1–5.
Bibliography
207
[56] Dieng, N., Chaudet, C., Toutain, L., Meriem, T., and Charbit, M. (2014). No-calibration
localisation for indoor wireless sensor networks. International Journal of Ad Hoc and Ubiquitous Computing, 15(1):200–214.
[57] Doherty, L., Pister, K. S. J., and El Ghaoui, L. (2001). Convex position estimation in
wireless sensor networks. In Twentieth Annual Joint Conference of the IEEE Computer and
Communications Societies, volume 3 of IEEE Infocom, pages 1655–1663.
[58] Duchi, J., Agarwal, J., and Wainwright, M. (2010a). Distributed Dual Averaging in Networks. In Advances in Neural Information Systems.
[59] Duchi, J., Agarwal, J., and Wainwright, M. (2010b). Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling. Automatic control, IEEE Transactions
on, 99(10):1–40.
[60] Dudley, R. (2002). Real analysis and Probability. Cambridge University Press.
[61] Duflo, M. (2010). Random Iterative Models. Springer.
[62] Elnahrawy, E., Xiaoyan, L., and Martin, R. (2004). The limits of localization using signal
strength: a comparative study. In In First Annual IEEE Conference on Sensor and Ad-hoc
Communications and Networks,, pages 406–414.
[63] Essoloh, M., Richard, C., and Snoussi, H. (2007). Localisation distribuée dans les réseaux
de capteurs sans fil par résolution d’un problème quadratique. In GRETSI.
[64] Forero, P., Cano, A., and Giannakis, G. (2008). Consensus-based distributed expectationmaximization algorithm for density estimation and classification using wireless sensor networks. In IEEE ICASSP 2008, pages 1989–1992.
[65] Forero, P., Cano, A., and Giannakis, G. (2010a). Consensus-based distributed support
vector machines. Journal of Machine Learning Research, 11:1663 – 1707.
[66] Forero, P., Cano, A., and Giannakis, G. (2010b). Consensus-based distributed support
vector machines. The Journal of Machine Learning Research, 11:1663–1707.
[67] Fort, G. (2014). Central Limit Theorems for Stochastic Approximation with Controlled
Markov Chain Dynamics. Accepted for publication in ESAIM PS.
[68] Fort, G., Moulines, E., and Priouret, P. (2012). Convergence of Adaptive and Interacting
Markov chain Monte Carlo algorithms. Ann. Statist., 39(6):3262–3289.
[69] Frasca, P. and Hendrickx, J. (2013). On the mean square error of randomized averaging
algorithms. Automatica, 49(8):2496 – 2501.
[70] Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: a statistical
view of boosting (with discussion and a rejoinder by the authors). The annals of statistics,
28(2):337–407.
208
Bibliography
[71] Goldoni, E., Savioli, A., Risi, M., and Gamba, P. (2010). Experimental analysis of RSSIbased indoor localization with IEEE 802.15.4. In IEEE European Wireless Conference (EW),
pages 71–77.
[72] Goldsmith, A. (2005). Wireless Communications. Cambridge University Press, New York,
USA.
[73] Golub, G. H. and Van Loan, C. F. (1983). Matrix Computations. Johns Hopkins University
Press.
[74] Gu, D. (2008). Distributed em algorithm for gaussian mixtures in sensor networks. Neural
Networks, IEEE Transactions on, 19(7):1154–1166.
[75] Hall, P. and Heyde, C. C. (1980). Martingale Limit Theory and its Application. Academic
Press, New York, London.
[76] Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning.
Springer.
[77] Hendrickson, B. (1992). Conditions for unique graph realizations. SIAM J. Comput.,
21(1):65–84.
[78] Hendrickx, J., Shi, G., and Johansson, K. (2014). Finite-time consensus using stochastic
matrices with positive diagonals. To appear in IEEE Transactions on Automatic Control.
[79] Hereman, W. (2011). Trilateration: The Mathematics Behind a Local Positioning System .
Seminar.
[80] Honeine, P., Richard, C., Bermudez, J., Snoussi, H., and et al. (2009). Functional estimation in Hilbert space for distributed learning in wireless sensor networks. In Acoustics, Speech
and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 2861
– 2864, Taipei.
[81] Horn, R. A. and Johnson, C. R. (1990). Matrix Analysis. Cambridge University Press.
[82] Huang, L., Nguyen, X., Garofalakis, M., Jordan, M., Joseph, A., and Taft, N. (2007). Innetwork pca and anomaly detection. Advances in Neural Information Processing Systems,
19:617.
[83] Iutzeler, F., Ciblat, P., and Hachem, W. (2013). Analysis of Sum-Weight-like algorithms for averaging in Wireless Sensor Networks. IEEE Transactions on Signal Processing,
61(11):2802–2814.
[84] Iutzeler, F., Ciblat, P., Hachem, W., and Jakubowicz, J. (2012). New broadcast based
distributed averaging algorithm over wireless sensor networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3117–3120.
[85] Javanmard, A. and Montanari, A. (2013). Localization from Incomplete Noisy Distance
Measurements. Journal of Foundations of Computational Mathematics, 13(3):297–345.
Bibliography
209
[86] Kaemarungsi, K. and Krishnamurthy, P. (2004). Modeling of Indoor Positioning Systems
Based on Location Fingerprinting. In INFOCOM.
[87] Kar, S. and Moura, J. (2010). Distributed consensus algorithms in sensor networks: Quantized data and random link failures. IEEE Trans. on Signal Processing, 58(3):1383–1400.
[88] Karhunen, J. (1984). Adaptive algorithms for estimating eigenvectors of correlation type
matrices. In Acoustics, Speech, and Signal Processing, IEEE International Conference on
ICASSP ’84. (Volume:9 ).
[89] Kempe, D., Dobra, A., and Gehrke, J. (2003). Gossip-based computation of aggregate information. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer
Science, FOCS, pages 482–491. IEEE Computer Society.
[90] Kempe, D. and McSherry, F. (2008). A decentralized algorithm for spectral analysis. Journal of Computer and System Sciences, 74(1):70 – 83. Learning Theory 2004.
[91] Keshavan, R., Montanari, A., and Oh, S. (2010). Matrix completion from a few entries.
IEEE Transactions on Information Theory, 56(6):2980–2998.
[92] Korada, S., Montanari, A., and Oh, S. (2011). Gossip pca. ACM SIGMETRICS Performance Evaluation Review, 39(1):169–180.
[93] Kowalczyk, W. and Vlassis, N. (2005). Newscast em. In NIPS, pages 713–720.
[94] Krasulina, T. (1969). The method of stochastic approximation for the determination of the
least eigenvalue of a symmetrical matrix. {USSR} Computational Mathematics and Mathematical Physics, 9(6):189 – 195.
[95] Kruskal, J. and Myron Wish (1978). Multidimensional Scaling. Eric M. Uslaner.
[96] Kushner, H. and Clark, D. (1978). Stochastic Approximation Methods for Constrained and
Unconstrained Systems. Springer-Verlag.
[97] Kushner, H. and Yin, G. (1987). Asymptotic properties of distributed and communicating
stochastic approximation algorithms. SIAM J. Control Optim., 25:1266 – 1290.
[98] Kushner, H. and Yin, G. (2003). Stochastic Approximation and Recursive Algorithms and
Applications. Springer.
[99] Langendoen, K. and Reijers, N. (2003). Distributed localization in wireless sensor networks: a quantitative comparison. Computer Networks, 43(4):499–518.
[100] Lee, S. and Nedic, A. (2012). Drsvm: Distributed random projection algorithms for svms.
In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pages 5286 – 5291,
Maui, Hawai.
[101] Leeuw, J. (1977). Applications of Convex Analysis to Multidimensional Scaling. Recent
Developments in Statistics, pages 133–145.
210
Bibliography
[102] Li, L., Scaglione, A., and Manton, J. (2011). Distributed Principal Subspace Estimation
in Wireless Sensor Networks. IEEE Selected Topics in Signal Processing, 5(4):725–738.
[103] Lopes, C. and Sayed, A. (2006). Distributed processing over adaptive networks. In Proc.
Adaptive Sensor Array Processing Workshop, pages 1–5, MIT Lincoln Laboratory, MA.
[104] Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., and Hellerstein, J. (2012).
Distributed graphlab: A framework for machine learning and data mining in the cloud. Journal Proceedings of the VLDB Endowment, 5(8):716–727.
[105] Mao, G., Fidan, B., and Anderson, B. (2007). Wireless sensor network localization techniques. Computer Networks, 51(10):2529–2553.
[106] Mateos, G., Bazerque, J., and Giannakis, G. (2010). Distributed sparse linear regression.
Signal Processing, IEEE Transactions on, 58(10):5262–5276.
[107] McDonald, R., Hall, K., and Mann, G. (2010). Distributed training strategies for the
structured perceptron. In Human Language Technologies: The 2010 Annual Conference of
the North American Chapter of the Association for Computational Linguistics, pages 456–
464. Association for Computational Linguistics.
[108] Mertikopoulos, P., Belmega, E., Moustakas, A., and Lasaulce, S. (2012). Distributed
learning policies for power allocation in multiple access channels. IEEE Journal on Selected
Areas in Communications, 30(1):1–11.
[109] Meyn, S. and Tweedie, R. (1993). Markov Chains and Stochastic Stability. SpringerVerlag.
[110] Murphy, W. and Hereman, W. (1985). Determination of a position in three dimensions
using trilateration and approximate distances,. Technical report mcs-95-07, Department of
Mathematical and Computer Sciences, Colorado School of Mines, Colorado.
[111] Navia-Vazquez, A., Gutierrez-Gonzalez, D., Parrado-Hernandez, E., and NavarroAbellan, J. (2006). Distributed support vector machines. Neural Networks, IEEE Transactions on, 17(4):1091–1097.
[112] Nedic, A. (2011). Asynchronous Broadcast-Based Convex Optimization Over a Network.
IEEE Trans. on Automatic Control, 56(6):1337–1351.
[113] Nedic, A. and Bertsekas, D. (2001). Incremental Subgradient Methods for Nondifferentiable Optimization. SIAM Journal of Optimization, 12(1):109–138.
[114] Nedic, A. and Olshevsky, A. (2013). Distributed optimization over time-varying directed
graphs. In Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on, pages 6855
– 6860, Florence, Italy.
[115] Nedic, A. and Olshevsky, A. (2014). Stochastic gradient-push for strongly convex functions on time-varying directed graphs. Arxiv preprint arXiv:1406.2075.
Bibliography
211
[116] Nedic, A. and Ozdaglar, A. (2009). Distributed Subgradient Methods for Multi-Agent
Optimization. IEEE Trans. on Automatic Control, 54(1):48–61.
[117] Nedic, A., Ozdaglar, A., and Parrilo, P. (2010). Constrained Consensus and Optimization
in Multi-Agent Networks. IEEE Trans. on Automatic Control, 55(4):922–938.
[118] Niculescu, D. and Nath, B. (2001). Ad Hoc Positioning System (APS). In IN GLOBECOM, pages 2926–2931.
[119] Nowak, R. (2003). Distributed em algorithms for density estimation and clustering in
sensor networks. Signal Processing, IEEE Transactions on, 51(8):2245–2253.
[120] Oja, E. (1982). Simplified neuron model as a principal component analyzer. Journal of
mathematical biology, 15(3):267–273.
[121] Oja, E. (1992). Principal components, minor components, and linear neural networks.
Journal of Neural Networks, 5(6):927–935.
[122] Oja, E. and Karhunen, J. (1985). On Stochastic Approximation of the Eigenvectors and
Eigenvalues of the Expectation of a Random Matrix. Journal of Mathematical Analysis and
Applications, 106(1):69–84.
[123] Olfati-Saber, R. (2007). Distributed Tracking for Mobile Sensor Networks with
Information-Driven Mobility. In American Control Coference, pages 4606 – 4612, New
York, USA.
[124] Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). The pagerank citation ranking:
Bringing order to the web. Technical report.
[125] Patra, B. (2011). Convergence of distributed asynchronous learning vector quantization
algorithms. The Journal of Machine Learning Research, 12:3431–3466.
[126] Patwari, N. and et al. (2005). Locating the Nodes : cooperative localization in wireless
sensor networks. IEEE Signal Processing Magazine, 22(4):54–69.
[127] Patwari, N., Hero, A., and et al. (2003). Relative location estimation in wireless sensor
networks. IEEE Transactions on Signal Processing, 51(8):2137 – 2148.
[128] Patwari, N., J’Odea, R., and Yanwei, W. (2001). Relative location in wireless networks.
In VTC.
[129] Pelletier, M. (1998). Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing. Annals of Applied Probability, 8(1):10–44.
[130] Priyantha, N., Hari, B., Erik, D., and Seth, T. (2003). Anchor-Free Distributed Localization in Sensor Networks. In Proceedings of the 1st international conference on Embedded
networked sensor systems, SenSys ’03, pages 340–341. ACM.
212
Bibliography
[131] Rabbat, M. and Nowak, R. (2005). Quantized Incremental Algorithms for Distributed
Optimization. IEEE Journal on Selected Areas in Communications, 23(4):798–808.
[132] Raffard, R., Tomlin, C., and Boyd, S. (2004). Distributed Optimization for Cooperative
Agents: Application to Formation Flight. In In Proceedings of the 43rd IEEE Conference on
Decision and Control.
[133] Ram, S., Nedic, A., and Veeravalli, V. (2009). Incremental Stochastic Subgradient Algorithms for Convex Optimization. SIAM Journal on Optimization, 20(2):691–717.
[134] Ram, S., Nedic, A., and Veeravalli, V. (2010a). Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization. Journal of optimization theory and applications,
147(3):516–545.
[135] Ram, S., Veeravalli, V., and Nedic, A. (2010b). Distributed and Recursive Parameter
Estimation in Parametrized Linear State-Space Models. IEEE Transactions on Automatic
Control, 55(2):488–492.
[136] Rappaport, T. (1996). Wireless Communications: Principles and Practice. Prentice Hall.
[137] Rappaport, T. S., Reed, J. H., and Woerner, B. D. (1996). Position location using wireless
communications on highways of the future. IEEE Communications Magazine, 34(10):33–41.
[138] Rätsch, G., Onoda, T., and Müller, K. (2001). Soft margins for adaboost. Machine
learning, 42(3):287–320.
[139] Robbins, H. and Monro, S. (1951). A Stochastic Approximation Method. Ann. of
Mathem. Statist., 22(3):400–407.
[140] Savarese, C., Rabaey, J., and Langendoen, K. (2002). Robust positioning algorithm for
distributed ad-hoc wireless sensor network. In Proceedings of the General Track of the Annual
Conference on USENIX Annual Technical Conference, pages 317–327, Monterey.
[141] Savvides, A., Park, H., and Srivastava, M. (2002). The Bits and Flops of the N-hop
Multilateration Primitive For Node Localization Problems. In Proceedings of the 1st ACM
International Workshop on Wireless Sensor Networks and Applications, WSNA ’02, pages
112–121. ACM.
[142] Shang, Y. and Ruml, W. (2004). Improved MDS-based localization. In INFOCOM 2004.
Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societes,
pages 2640 – 2651 vol.4, Hong Kong.
[143] Shang, Y., Ruml, W., and Fromherz, M. (2003). Localization from mere connectivity. In
Proceedings of the 4th ACM International Symposium on Mobile Ad Hoc Networking &Amp;
Computing, MobiHoc ’03, pages 201–212. ACM.
[144] Stanković, S., Stanković, M., and Stipanović, D. (2011). Decentralized parameter estimation by consensus based stochastic approximation. IEEE Trans. Automatic Control,
56(3):531–543.
Bibliography
213
[145] Sugano, M., Kawazoe, T., Ohta, Y., and Murata, M. (2006). Indoor localization system
using rssi measurement of wireless sensor network based on zigbee standard. In Wireless and
Optical Communications, pages 1–6. IASTED/ACTA Press.
[146] Titterington, D. (1984). Recursive parameter estimation using incomplete data. Journal
of the Royal Statistical Society. Series B, pages 257–267.
[147] Tomozei, D. and Massoulié, L. (2010). Distributed user profiling via spectral methods.
SIGMETRICS Perform. Eval. Rev., 38(1):383–384.
[148] Tonneau, A., Mitton, N., and Vandaele, J. (2014). A Survey on (mobile) wireless sensor
network experimentation testbeds. In DCOSS - IEEE International Conference on Distributed
Computing in Sensor Systems, Marina Del Rey, California, United States.
[149] Towfic, Z., Chen, J., and Sayed, A. (2013). On distributed online classification in the
midst of concept drifts. Neurocomputing, 112(0):138 – 152.
[150] Towfic, Z. and Sayed, A. (2013). Adaptive Stochastic Convex Optimization Over Networks. In Communication, Control, and Computing (Allerton), 2013 51st Annual Allerton
Conference on, pages 1272 – 1277, Monticello, USA.
[151] Towfic, Z. and Sayed, A. (August 2014). Adaptive penalty-based distributed stochastic
convex optimization. IEEE Transactions on Signal Processing, 62(15):3924–3938.
[152] Trawny, N., Roumeliotis, S., and Giannakis, G. (2009). Cooperative multi-robot localization under communication constraints. In Robotics and Automation, 2009. ICRA ’09. IEEE
International Conference on, pages 4394 – 4400, Kobe.
[153] Tsianos, K., Lawlor, S., and Rabbat, M. (2012). Push-sum distributed dual averaging for
convex optimization. In Proceedings of the 51th IEEE Conference on Decision and Control,
CDC, pages 5453–5458.
[154] Tsiptsis, K. and Chorianopoulos, A. (2009). Data Mining Techniques in CRM: Inside
Customer Segmentation. John Wiley & Sons, Ltd.
[155] Tsitsiklis, J. (1984). Problems in Decentralized Decision Making and Computation. PhD
thesis, Massachusetts Institute of Technology.
[156] Tsitsiklis, J., Bertsekas, D., and Athans, M. (1986). Distributed asynchronous deterministic and stochastic gradient optimization algorithms. Automatic Control, IEEE Transactions
on, 31(9):803 – 812.
[157] Wei, E. and Ozdaglar, A. (2012). Distributed alternating direction method of multipliers.
In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pages 5445 – 5450.
[158] Williams, R. (1985). Feature discovery through error-correcting learning. Technical report, CA : University of California, Institute of Cognitive Science.
214
Bibliography
[159] Xu, J., Liu, W., Lang, F., Zhang, Y., and Wang, C. (2010). Distance measurement model
based on rssi in wsn. Wireless Sensor Network, 2(8):606–611.
[160] Yan, F., Sundaram, S., Vishwanathan, S., and Qi, Y. (2013). Distributed autonomous
online learning: Regrets and intrinsic privacy-preserving properties. IEEE Trans. on Knowl.
and Data Eng., 25(11):2483–2493.
[161] Ye, J., Chow, J., Chen, J., and Zheng, Z. (2009). Stochastic gradient boosted distributed
decision trees. In Proceeding of the 18th ACM conference on Information and knowledge
management, pages 2061–2064.
[162] Yu, J., Kulkarni, S., and Poor, H. (2013). Dimension expansion and customized spring
potentials for sensor localization. EURASIP Journal on Advances in Signal Processing.
A study of distributed algorithms for stochastic approximation
in wireless sensor networks
Gemma MORRAL ADELL
RÉSUMÉ : Dans le cadre du traitement statistique du signal, les réseaux multi-agents servent à
un grand nombre d’applications. Dans le milieu radio, les réseaux de capteurs sans fils s’utilisent à
la surveillance d’environnements ou la détection et poursuite de cibles par exemple. Dans internet,
les machines sont les agents qui servent à toutes les applications liées au "cloud computing" ou plus
récemment, à la gestion de "Big Data". Les réseaux distribués se caractérisent par l’absence d’agent
central qui collecte toutes les données et qui se charge des calculs et de la gestion du reste des agents.
Nous considérons un réseau d’agents dont le but est d’estimer un paramètre d’intérêt. Un agent est un
dispositif capable de recueillir quelques informations d’une façon locale et/ou partielle sur le paramètre
inconnu, pour effectuer des calculs locaux à chaque instant de temps et de fusionner des informations
avec d’autres agents dans son voisinage. Nous cherchons à concevoir et analyser algorithmes distribués
d’approximation stochastique, qui sont bien adaptés au cas où les données sont collectées en ligne,
simultanément avec le processus d’estimation.
La première partie de la thèse porte sur les méthodes distribuées d’adaptation-diffusion. A chaque
itération de l’algorithme, les agents mettent à jour leurs estimations locales et fusionnent ces résultats avec les estimations de leurs voisins. Nous étudions la convergence de l’algorithme et l’influence
du protocole de communication sur sa performance asymptotique. Nous appliquons nos résultats à
l’inférence statistique dans les réseaux de capteurs sans fil. Dans la deuxième partie de la thèse, nous étudions l’analyse en composantes principales distribuée. Nous proposons un algorithme d’approximation
stochastique distribuée sur la base de la méthode d’Oja permettant d’estimer l’espace propre/principal
d’une matrice partiellement observée. Nous appliquons nos résultats à l’auto-localisation dans les réseaux
de capteurs sans fil. Les résultats numériques effectués sur un plateforme réelle des capteurs sans fil soutiennent le comportement attractif de notre approche.
MOTS-CLEFS: Algorithmes d’optimisation distribués, approximation stochastique, réseaux de capteurs sans fils, localisation distribuée.
ABSTRACT: In the framework of statistical signal processing, multi-agent networks serve a wide
range of applications. In radio communications for instance, wireless sensor networks are used for environmental monitoring/sensing, target detection and tracking. In the internet global system, the machines
are agents used for all applications related to cloud computing or more recently, Big Data processing.
Distributed
networks are characterized by the absence of a central agent which collects and processes all
the data and manages the remaining agents. We consider a network of agents whose aim is to estimate
some parameter of interest. An agent is a device able to collect some local and partial information on
the unknown parameter, to perform local computations at each time instant and merge information with
other agents in its neighborhood. We seek to design and analyze distributed stochastic approximation
algorithms, which are well suited to the case where the data is collected on-line, simultaneously with the
estimation process.
The first part of the thesis focuses on distributed adaptation-diffusion methods. At each iteration of
the algorithm, agents update their local estimates and merge the result with the estimates of their neighbors. We address the convergence of the algorithm and the influence of the communication protocol on
the asymptotic performance. We apply our results to statistical inference in wireless sensor networks.
In the second part of the thesis, we investigate distributed principal component analysis. We propose a
distributed stochastic approximation algorithm based on Oja’s method allowing to estimate the principal
eigenspace of a partially observed matrix. We apply our results to self-localization in wireless sensor networks. Numerical results performed on a wireless sensor network testbed sustain the attractive behavior
of our approach.
KEY-WORDS: Distributed optimization algorithms, stochastic approximation, wireless sensor networks, distributed localization

Analyse d`algorithmes distribués pour l`approximation stochastique

Transcription

Similar documents

Westney Heights Plaza Ajax - December 10, 2015

A love of art and an eye for color led Rick DiCecca to a career in

Place des Arts

Arolla Evolène-Région, paradise for climbers More than 100 routes

Summary - E

The Crêtes Itinerary

View Document

Version 11 Educational Content