Coordination in Open Source Projects

Transcription

Coordination in Open Source Projects
Coordination in Open Source Projects
– A Social Network Analysis using CVS data
D ISSERTATION
of the University of St. Gallen,
Graduate School of Business Administration,
Economics, Law, and Social Sciences (HSG)
to obtain the title of
Doctor of Business Adminstration
Submitted by
Sebastian Spaeth
from
Germany
Approved on the application of
Prof. Georg F. von Krogh, PhD
and
Prof. Dr. Oliver Gassmann
Dissertation no. 3110
Difo-Druck GmbH, Bamberg
ii
The University of St. Gallen, Graduate School of Business Administration, Economics, Law,
and Social Sciences (HSG) hereby consents to the printing of the present dissertation, without
hereby expressing any opinion on the views herein expressed.
St. Gallen, June 30, 2005
The President
Prof. Ernst Mohr, PhD
“code is only a miniscule part of the
big picture, its what people /do/ with
it that matters”
–jrandom, Founder of I2P, February 18, 2005
Acknowledgements
This dissertation builds upon three years of research on open source software development at
the Institute of Management. While the research for it was conducted by me, it is inspired by
my work as a research assistant at the chair of Professor von Krogh and through interaction
with my colleagues. It could only be finished (within a reasonable time frame) because of the
support from some individuals and many participants in various open source projects who I
would like to thank:
First of all, I would like to thank Georg von Krogh for hiring me right away after having
done a telephone interview with me being his research object. I have been learning a lot during
the time at his chair. Unfortunately, I never got round to speaking as much Swedish/Norwegian
with him as I would have liked.
I would also like to thank Stefan Haefliger. It is a pleasure working in a team with him and
he is a great colleague and friend. Of course, his parents house near San Tropéz, and the castle
of his girlfriend Susann’s family really helped a lot to relax from our hard research1 . Thanks
for those invitations to both of you.
Sharing an office with five persons is not always easy, but I enjoyed working with Daniela
Blettner, Philip Tuertscher, Christian Loepfe, Stephan Herting, and, of course, Fritz. We had a
great time and much fun together.
Last but not least, thanks to my girlfriend Almut, who suffered through my studies as well
(she likes singing the Free Software Song now, thus thoroughly taking revenge on me). She
had to endure months of procrastination, with a final outburst of intense writing during which
I would hear but not listen. Hopefully my ears are now open again. She was the most critical
reviewer of this dissertation throughout all stages, which I am extremely grateful for (although
it might not have been obvious to her at times).
I cannot help but find the enthusiasm of dozens of researchers who praise the superior quality
of free and open source products in their articles but write them in MicrosoftTM WinWordTM on
their MicrosoftTM WindowsTM computer somewhat hypocritical.2
Being a half-geek, I enjoyed fiddling days (and nights) to get the superior quality software
up and running: This dissertation was written with the help of FLOSS software, namely LATEX ,
Emacs, and R, on a GNU/Linux computer.
Finally, I want to apologize for the continuous mixing of free software, open source, FLOSS,...
throughout this work. I do know that this might be offensive for some participants in the field.
However, I found it too bothersome to be politically correct all the time, and in those cases
where they were used throughout this work they can be considered as equivalent in my point of
view. If you find that unsatisfactory, feel free to search and replace all of these terms with the
terminus technicus of your choice.
July 2005
1
2
Sebastian Spaeth
Except the night where a dormouse found its way into my jeans.
This sentence alone might convey to the reader that I am ideologically biased. I admit that readily, however, I
take great care that my personal attitudes have no effects on the outcomes of the research I perform.
vi
Contents
Acknowledgements .
List of Tables . . . .
List of Figures . . . .
List of Abbreviations
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. v
. ix
. xi
. xiii
1 Introduction
1.1 The Quest for a Research Topic: Motivation and Research Question
1.2 Structure overview . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Introduction to Open Source . . . . . . . . . . . . . . . . . . . . .
1.3.1 History of Open Source . . . . . . . . . . . . . . . . . . .
1.3.2 Open Source Definition . . . . . . . . . . . . . . . . . . . .
1.3.3 Open Source vs. Free Software . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
5
7
7
10
12
2 Theoretical framework
2.1 A history of research on open source . . . . . . . . . . . . .
2.1.1 Common pitfalls - A critique of empiric OS research
2.2 Social Network Analysis . . . . . . . . . . . . . . . . . . .
2.3 Research on organization in OS projects . . . . . . . . . . .
2.3.1 SNA in open source . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
16
22
24
26
29
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
31
34
37
57
57
59
62
65
66
69
75
75
77
3 Empirical Section
3.1 Methodology . . . . . . . . . . . . .
3.1.1 Sample Selection . . . . . . .
3.1.2 Analysis . . . . . . . . . . .
3.2 Sample . . . . . . . . . . . . . . . .
3.3 Analysis . . . . . . . . . . . . . . . .
3.3.1 Concentration of modifications
3.3.2 Degree . . . . . . . . . . . .
3.3.3 Inclusiveness . . . . . . . . .
3.3.4 Centralization . . . . . . . . .
3.3.5 Code ownership . . . . . . .
3.4 Characterizing the coordination styles
3.5 Analysis of the Sample Projects . . .
3.5.1 Abiword . . . . . . . . . . .
3.5.2 Adonthell . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
Contents
3.5.3
3.5.4
3.5.5
3.5.6
3.5.7
3.5.8
3.5.9
3.5.10
3.5.11
3.5.12
3.5.13
3.5.14
3.5.15
3.5.16
3.5.17
3.5.18
3.5.19
3.5.20
3.5.21
3.5.22
3.5.23
3.5.24
3.5.25
3.5.26
3.5.27
3.5.28
3.5.29
AWStats . . .
bison . . . . .
BZflag . . . .
CDex . . . . .
emacs . . . . .
Flightgear . . .
Freenet . . . .
Gnomemeeting
Gnunet . . . .
GTK+ . . . . .
iRate . . . . .
LAME . . . .
Mailman . . .
mnet . . . . .
nano . . . . . .
Ogle . . . . . .
OpenSSL . . .
pango . . . . .
phpMyAdmin .
PostgreSQL . .
Smarty . . . .
Stepmania . . .
tdb . . . . . .
TikiWiki . . .
wget . . . . . .
xerces . . . . .
XFCE4 . . . .
4 Discussion & Conclusion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
79
80
83
85
86
88
90
92
94
97
99
101
102
105
106
109
110
113
114
116
119
120
123
124
127
128
131
134
A Appendix
155
A.1 Open Source Definition (Version 1.9) . . . . . . . . . . . . . . . . . . . . . . . 155
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
viii
List of Tables
1.1
Similarity of FS vs. OS philosophies . . . . . . . . . . . . . . . . . . . . . . . 14
2.1
Categorization of FLOSS research . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
3.20
3.21
3.22
3.23
3.24
3.25
3.26
3.27
3.28
3.29
3.30
3.31
Sample overview . . . . . . . . . . . . . . . . . . . . . .
Degree overview . . . . . . . . . . . . . . . . . . . . . .
Modification concentration, Centralization & Inclusiveness
Linear regression (inclusiveness, centrality) . . . . . . . .
Modification statistics for the Abiword project . . . . . . .
Modification statistics for the Adonthell project . . . . . .
Modification statistics for the AWStats project . . . . . . .
Modification statistics for the bison project . . . . . . . . .
Modification statistics for the BZFlag project . . . . . . .
Modification statistics for the CDex project . . . . . . . .
Modification statistics for the emacs project . . . . . . . .
Modification statistics for the Flightgear project . . . . . .
Modification statistics for the Freenet project . . . . . . .
Modification statistics for the Gnomemeeting project . . .
Modification statistics for the GNUnet project . . . . . . .
Modification statistics for the GTK+ project . . . . . . . .
Modification statistics for the Irate project . . . . . . . . .
Modification statistics for the LAME project . . . . . . . .
Modification statistics for the Mailman project . . . . . .
Modification statistics for the mnet project . . . . . . . . .
Modification statistics for the nano project . . . . . . . . .
Modification statistics for the Ogle project . . . . . . . . .
Modification statistics for the OpenSSL project . . . . . .
Modification statistics for the pango project . . . . . . . .
Modification statistics for the phpMyAdmin project . . . .
Modification statistics for the PostgreSQL project . . . . .
Modification statistics for the Smarty project . . . . . . . .
Modification statistics for the Stepmania project . . . . . .
Modification statistics for the tdb project . . . . . . . . . .
Modification statistics for the TikiWiki project . . . . . . .
Modification statistics for the wget project . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
73
74
74
75
77
79
81
83
85
87
89
91
93
95
97
99
101
103
105
107
109
111
113
115
117
119
121
123
125
127
ix
List of Tables
3.32 Modification statistics for the xerces project . . . . . . . . . . . . . . . . . . . 129
3.33 Modification statistics for the XFCE4 project . . . . . . . . . . . . . . . . . . 131
x
List of Figures
1.1
1.2
Starting year of participants in FLOSS Communities . . . . . . . . . . . . . . 9
Differences between open source and free software developers . . . . . . . . . 15
2.1
Publications on social networks . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
3.20
3.21
3.22
3.23
3.24
3.25
3.26
3.27
3.28
3.29
3.30
The iRate architecture . . . . . . . . . . . . . . . . . . . . .
Histogram of modification concentration (gini) . . . . . . .
Relative degree over all developers . . . . . . . . . . . . . .
Rel. degree vs. log(#modifications) . . . . . . . . . . . . . .
Inclusiveness of projects . . . . . . . . . . . . . . . . . . .
Inclusiveness vs. number of developers . . . . . . . . . . .
Project dendrogram . . . . . . . . . . . . . . . . . . . . . .
Mean distinct authors per file (over all projects) . . . . . . .
Mean distinct authors per file vs. # developers . . . . . . . .
Centrality, Inclusiveness, and Concentration of modifications
Abiword sociogram . . . . . . . . . . . . . . . . . . . . . .
Adonthell sociogram . . . . . . . . . . . . . . . . . . . . .
AWStats sociogram . . . . . . . . . . . . . . . . . . . . . .
Bison sociogram . . . . . . . . . . . . . . . . . . . . . . .
BZFlag sociogram . . . . . . . . . . . . . . . . . . . . . .
CDex sociogram . . . . . . . . . . . . . . . . . . . . . . .
emacs sociogram . . . . . . . . . . . . . . . . . . . . . . .
Flightgear sociogram . . . . . . . . . . . . . . . . . . . . .
Freenet sociogram . . . . . . . . . . . . . . . . . . . . . . .
Gnomemeeting sociogram . . . . . . . . . . . . . . . . . .
GNUnet sociogram . . . . . . . . . . . . . . . . . . . . . .
GTK+ sociogram . . . . . . . . . . . . . . . . . . . . . . .
Irate sociogram . . . . . . . . . . . . . . . . . . . . . . . .
LAME sociogram . . . . . . . . . . . . . . . . . . . . . . .
Mailman sociogram . . . . . . . . . . . . . . . . . . . . . .
mnet sociogram . . . . . . . . . . . . . . . . . . . . . . . .
nano sociogram . . . . . . . . . . . . . . . . . . . . . . . .
Ogle sociogram . . . . . . . . . . . . . . . . . . . . . . . .
OpenSSL sociogram . . . . . . . . . . . . . . . . . . . . .
pango sociogram . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
58
61
62
63
63
64
67
68
72
76
78
80
82
84
86
88
90
92
94
96
98
100
102
104
106
108
110
112
114
xi
List of Figures
3.31
3.32
3.33
3.34
3.35
3.36
3.37
3.38
3.39
xii
phpMyAdmin sociogram
PostgreSQL sociogram .
Smarty sociogram . . . .
Stepmania sociogram . .
tdb sociogram . . . . . .
TikiWiki sociogram . . .
wget sociogram . . . . .
xerces sociogram . . . .
XFCE4 sociogram . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
116
118
120
122
124
126
128
130
132
List of Abbreviations
BSD
CVS
DARCS
DARPA
ESR
FLOSS
FOSS
FS
FSF
GDBM
GNU
GPL
GTK+
Hird
Hurd
ITIL
LAME
LGPL
LOC
MPL
NDA
OS
OSI
OSS
PHP
RMS
SQL
SNA
SSL
TLS
XML
Berkeley Software Distribution
Concurrent Versions System
David’s Advanced Revision Control System (an alternative to →CVS)
Defense Advanced Research Projects Agency
Eric S. Raymond
free/libre/open source software
free/open source software
free (→ libre) software
Free Software Foundation
GNU database manager
the “GNU’s not Unix” operating system
(→GNU) General Public License
The GIMP Toolkit
→Hurd of Interfaces Representing Depth
→Hird of Unix-Replacing Daemons
IT information Library
LAME Ain’t an Mp3 Encoder
(→GNU) Lesser General Public License
lines of code
Mozilla Public License
Non Disclosure Agreement
open source
Open Source Initiative
open source software
PHP Hypertext Preprocessor
Richard M. Stallman
Structured Query Language
Social Network Analysis
Secure Sockets Layer
Transport Layer Security
Extensible Markup Language
xiii
1 Introduction
Limited to a strict interpretation of its definition, open source consists of a set of rules which
apply to a piece of software and which specify how the software and derivatives of it may be
used (and under what circumstances derivatives may be created at all). These rules are listed in
the software’s license and must be compatible with the criteria of the Open Source Definition
(The Open Source Initiative, 2003b).
But looking at numerous articles in magazines, listening to managers in companies, and seeing how researchers of all kinds scrutinize free and open source software projects, it becomes
apparent that open source seems to be much more than a simple set of licensing terms:
It is seen as a philosophy (Stallman, 1993), a paradigm (Feller and Fitzgerald, 2000), a
production model (Kogut and Metiu, 2001) or even a new innovation model (von Hippel and
von Krogh, 2003). It can be a way of organizing projects (Crowston and Scozzi, 2002), or a
way to collaborate (Yamauchi, Yokozawa, Shinohara, and Ishida, 2000). It follows plenty of
unwritten rules and exhibits an informal hierarchy which newcomers need to follow in order to
be allowed to join a community (von Krogh, Spaeth, and Lakhani, 2003).
Some of the prominent figures in this field have been in the spotlight of media attention:
e.g. Linus Torvalds, creator and project leader of the OS1 project, which is most widely known
to the public, the Linux kernel, has been voted #17 in Time Magazine’s Person of the Century
Poll in 2000. In 2001, he shared the Takeda Award for Social/Economic Well-Being with
Richard Stallman and Ken Sakamura. In 2004, he was named one of the most influential
people in the world by the Time magazine. Also in 2004, he won The Economist’s Innovation
1
To improve readability, open source is being abbreviated as OS throughout the text.
1
1 Introduction
Award (The Economist, 2004)2 .
The phenomenon has been subject to intense research by dozens of researcher during the last
few years, an overview of which will be given in Section 2.1 on page 16.
1.1 The Quest for a Research Topic: Motivation and
Research Question
Originally, it was intended to study “strategic decision making”, as described in the thesis proposal (Spaeth, 2003). But a brief analysis of mailing lists from several open source projects
revealed that (at least the examined projects) did not have a “formal” decision making process
and even worse, that decisions were seldom discussed explicitly and decided on in public mailing lists. For instance, important decisions, such as the import and reuse of a component from
an external project were rarely discussed and decided on collectively.
There are two probable possibilities why this could be the case: either the main developers
discuss and decide these issues off the public record in private e-mails or chat channels before
presenting the public a fait accompli. This possibility implies that decision making happens, but
cannot be observed directly by researchers using publicly archived data sources, as it happens
through private or non-archived communication channels. If this were the case, it would best
be examined through e.g. an ethnology, observing the developers behavior in real time.
The second possibility is that strategic decisions are indeed not explicitly discussed and
openly decided on in the community. This would imply that explicit decision making does
not happen at all in projects beyond the level of individual task self-assignment. While this is
a somewhat extreme and provoking assumption, it is not improbable that few explicit decision
making processes exist in free and open source projects, which very often do not have a formal
organization structure at all.
It seemed to become ever more unpractical to examine strategic decision making processes
2
2
In his book “Just for Fun” Linus states that he finds the frequent worshiping of his person rather annoying.
(Torvalds and Diamond, 2001)
1.1 The Quest for a Research Topic: Motivation and Research Question
empirically using public data archives. The more preliminary research was done to uncover
these processes, the more apparent it became that decisions were indeed often made implicitly
by developers by simply picking some task they liked and starting to work on it. Thus it appears
that decision making takes place mostly through the self-assignment of tasks by individual
developers (as e.g. theorized by Benkler, 2002). As long as the resulting code works and is
deemed “clean enough”, there seem to be few resentments against this spontaneous and selforganizing coordination of developers3 .
The logical consequence therefore, was to look at the areas developers pick to work on.
A closer look at the organization of projects and more specifically the coordination among
developers in open source projects was needed. Previous research had been looking into the
social structure of open source projects and their self-organizing characteristics and seemed to
agree that this is an important part of open source research (see the literature review for more
information).
Previous research by von Krogh et al. (2003) had shown that newcomers (developers who
are granted the permission to modify the source code) would often join a project by dedicating
some new (or improved) functionality and would specialize and focus their efforts later on in
this area. These findings, derived from a single case study (Yin, 2003), needed yet to be verified
in multiple cases. So, examining aspects of self-coordination and the “code ownership” of
developers would also serve to verify these findings of our previous research.
One way to examine the areas of code that developers choose to work on is to see if there is
really such a thing as code ownership, how concentrated and/or overlapping these “pockets of
activity” (von Krogh et al., 2003) are, and how they are interconnected; this would result in a
construction of a network that shows developers who are more or less interconnected with each
other, based on their coding activities. After some pondering of the issue, it became clear that
this approach basically would be the construction and analysis of a social network of developers
within one OS project.
3
This does not mean that co-developers are uncritical of their peers. If they do not like some code, they can be
extremely critical in their peer reviews, sometimes resulting in heavy attacks against each other (flame wars).
3
1 Introduction
Researchers have proposed to perform Social Network Analysis (SNA) on computer centered communities (e.g. Wellman, 1996) however, most studies actually conducted were either
not very in-depth or were high-level studies looking beyond the boundaries of single projects
(e.g. all projects hosted on Sourceforge.net4 ). Section 2.3.1 on page 29 provides more information on recent studies applying SNA to OS projects.
Yet none of the studies examined coordination, or rather self-coordination5 , of work among
developers using a specific project as unit of analysis. I became interested in what ways Social
Network Analysis could contribute to the examination of coordination and the ownership of
code among developers.
Following this line of thinking, the basic topic which is addressed in this dissertation evolved
over time:
How are free and open source projects organized? How is work coordinated/distributed
between its developers?
This very broad formulation of the topic of interest was divided into three research questions,
addressing the specific issues as explained in the above text.
Res. Question 1 Are free software and open source projects self-coordinated through the set
of files each developer works on?
Res. Question 2 Is there such a thing as “code ownership”? Are developers responsible for
certain areas of code?
Res. Question 3 Are patterns of coordination detectable through Social Network Analysis and
how can they be interpreted? Can projects be categorized according to their “coordination
style”?
4
Sourceforge.net is the most popular and prominent platform which provides infrastructure for open source
projects without charge.
5
Based on an unscientific definition of coordination, 1: the act or action of coordinating 2: the harmonious
functioning of parts for effective results (Merriam-Webster, 1993), coordination can be understood as “a way
of developers to arrange themselves harmonically to achieve effective results (→software)”. Self-coordination
emphasizes the self-organizing aspect of coordination.
4
1.2 Structure overview
Goal & Readership of this Research
Although some propositions are being made dur-
ing this work, its goal is not a single quantitative model with one dependent variable. It aims
to explore various measures that could be used to characterize the coordination and the development style, if there is any, of OS projects. One long term goal should be to enable a
categorization of projects, i.e. the creation of a typology, of coordination styles of projects.
The applied methods are to a large extent drawn from Social Network Analysis. The work
takes an interdisciplinary approach. On the one hand it intends to further research on OSS
development, to derive a theoretical foundation of what happens so seamless in thousands of
projects every day, and to help it grow further. On the other hand is the way in which OSS
is developed, interesting to both management and innovation researchers. Organization and
distribution of work in virtual, global organizations is of much interest to both disciplines.
Accordingly, this work could be of interested for a broad audience: Maintainer of open
source projects may be interested in the empirical findings and may find clues on how they
would like ‘their’ community to be organized and coordinate itself, even though project leaders might only have limited influence over this. Managers, deploying virtual, geographically
dispersed teams might want to turn their attention to the apparently fully functioning working
self-coordination of these communities. And, last but not least, researchers both of organizations such as companies or communities of practice and of open source projects might be
interested in the use of Social Network Analysis applied across several cases. This work examines the usefulness of diverse measures in order to characterize projects, and provides a first
step towards a typology of projects in respect to the distribution and self-coordination of work
among developers through the identification of determining variables of a coordination style.
Following, the structure of this work is presented.
1.2 Structure overview
So far, Section 1 gave a brief introduction into the research topic. It presented the research
questions and clarified the goals of the dissertation.
5
1 Introduction
Following, Section 1.3 on the facing page will present an introduction into the history of
open source, its precise definition and a comparison of free/libre and open source software to
give readers who are not yet familiar with the topic a better understanding of open source.
Although the study is performed in an inductive manner, the relevant theoretical framework
and existing literature can be found in Section 2 on page 16. This part begins with an overview
over the history of open source research until today, and a critique of the sampling and data
gathering methods employed in many current empirical studies.
The next subsection (Section 2.2 on page 24) contains a brief generic introduction into
Social Network Analysis (SNA).
Existing literature concerned with the organization of open source projects is reviewed in
Section 2.3 on page 26, focusing later (page 29) on existing research applying Social Network
Analysis to open source projects.
The Empirical Section (Section 3 on page 31) begins with the Methodology. A large part
of that section is then dedicated to the presentation of the sample projects (Section 3.2 on
page 37), characterizing each project briefly.
Various measures, mostly derived from Social Network Analysis, such as the degree of
connection and inclusiveness of the network, are presented and discussed in Section 3.3 on
page 57.
A first step towards a typology of open source projects in regard to the way developers chose
the areas they work on, characterizing the “coordination style” through the identification of
influencing variables, can be found in Section 3.4. Using the four identified variables, Section 3.5 on page 75 characterizes each of the sample projects, presents its sociogram and gives
some key findings.
Finally, the Discussion part summarizes the results and findings. The implications for researchers and practitioners are discussed, starting on page 134. Lastly, interesting open issues
worth future research point to directions that related research could follow.
6
1.3 Introduction to Open Source
1.3 Introduction to Open Source
When doing research on free software and open source software projects6 , it is important to
understand precisely what open source means. This section gives the reader a better understanding of the basic concepts and terms in regard to open source projects and how OS evolved
over time. It also clarifies the differences between open source software and free software and
when those two terms need to be distinguished.
The next sections do not provide an overview of current open source research yet (see Section 2 for this); they serve merely as an introduction into the area. Readers who are already very
familiar with the history and terms of this topic, might be inclined to “fast-forward” through
these parts.
1.3.1 History of Open Source
The ARPAnet was introduced as an experiment of the US Department of Defense, or to be
more correct, its spin-off, the Advanced Research Projects Agency. A proposal for "Resource
Sharing Computer Networks" was submitted on June 3, 1968, and approved by the Director on
June 21, 1968 (Hauben, 1994).
Once established, it allowed hackers all over the U.S. to communicate with each other instead
of being isolated in small local groups. Successively, they started to feel like a “virtual” tribe,
connected through the new network. They developed a common hacker culture and hacker
ethics7 .
One man who once called himself “the last true hacker” had especially taken the hacker
culture and ethics to heart. In the beginning of the 70s, when Richard Stallman (RMS) worked
at the MIT Artificial Intelligence Lab, software was not classified as commercial, free software,
or open source, as all software was originally free (Stallman, 1999; Perens, 1999). The hacker
communities, which had developed at the Aritifical Intelligence lab and at other places, were
6
7
Often abbreviated as FLOSS in a politically correct manner, standing for free, libre & open source software.
hacker and hacker ethics were probably coined by the Tech Model Railroad Club at the MIT. (Levy, 1984;
Williams, 2002, App. B)
7
1 Introduction
sharing the source code of operating systems and other applications without any hesitation or
restrictions (Levy, 1984).
But in the early 80s new computer systems were replacing the old ones and the operating
systems for these computers were not free anymore, as many had discovered the commercial
value of software and had started software companies based on proprietary development models. Successively, software companies, instead of hacker communities, started to dominate the
production and distribution of software of all kinds. These companies were closing the source
code from the user and sold only compiled binary versions of their software, which were not
readable by humans and could not easily be modified and extended. Furthermore, copyright
laws disallowed any modification of these programs. Most hacker communities were weakened when many of its members were hired away by commercial software companies and had
to sign Non-Disclosure Agreements (NDA) and dissolved successively.
Richard Stallman was faced with the choice to either join the proprietary software world, or
to try to get these hacker communities back to life again. His convictions were that all software
should be free to share and to build upon. In order to enable people to use free software again
and to revive the hacker communities he missed, he quit his job in 1984 and decided to write
a new, truly free (i.e. libre) Unix-compatible operating system, which he termed GNU8 (see
Section 1.3.3 for a more detailed explanation of the specific meaning of free). From 1984
on, he worked to replace the numerous tools and programs which a Unix operating system is
comprised of, piece by piece, with free software applications (Moody, 2001). Others followed
and supported the Free Software Foundation (FSF), founded in 1984 by Stallman, or started
complementary and competing projects of their own.
With the spread of the Internet to ever larger parts of the population9 , and the creation of
network based software development tools and infrastructures, the number of free and open
8
GNU is a ’recursive’ acronym of the operating system’s full name “GNU’s Not Unix”. Recursive acronyms
are quite common as a joke among computer hackers. E.g. the GNU kernel “Hurd” is even named by a pair
of mutually recursive acronyms: “Hurd” stands for “Hird of Unix-Replacing Daemons”. And, then, “Hird”
stands for “Hurd of Interfaces Representing Depth” (Bushnell, 1991).
9
Networked computers did not become available to the general public until after the first implementation of
TCP/IP (a network protocol) was released with the Berkeley Software Distribution 4.2 (BSD) in 1983. (Raymond, 2003)
8
1.3 Introduction to Open Source
Figure 1.1: Starting year of participants in FLOSS Communities (source: Ghosh et al., 2002)
source projects started to explode in the mid-90s. Figure 1.1 visualizes when most participants
started to become involved in FLOSS development based on a large scale survey conducted by
Ghosh, Glott, Krieger, and Robles, showing the growing participation since the late 1990s.
But not all participants were entirely happy with the strong fixation of Stallman – and therefore the entire Free Software Movement – on moral and ethical issues (“all software must be
free”). His attitude, plus the emphasis on the ambiguous term free, had led many from the press
and corporate world to connect free software with negative stereotypes and the (false) notion
that software of this type must be gratis too.
Therefore, when Netscape decided in early 1998 that it would release the source code of its
web browser suite (Netscape, 1998), the term open source was created on February 3, 1998
by participants of the new-founded Open Source Initiative (OSI) (see next section for more on
this). The motivation behind this move was the creation of a term which could not be mistaken
for gratis and which did not imply any ethical or moral judgment. It would basically follow the
same rules as free software but should still be connected with a positive image in the corporate
world.
9
1 Introduction
Its founders consider the Open Source Initiative, with its policies and goals, as a derivate
of Stallman’s work, still following his intentions (Perens, 1999; Raymond, 1999b), although
Stallman disputes this, as proponents of open source software don’t explicitly mention and
emphasize the normative values of FLOSS software10 .
Today, many individuals and companies still hesitate to contribute their expertise and knowledge to a public good. Yet, some studies have shown that contributing to and using open source
software can be the most efficient way to innovate under certain conditions (Bessen, 2002).
Many large companies, notably IBM and to a certain extend Sun and Apple have adopted and
supported OS software. Even Microsoft, considered as the “Anti-Christ” for many believers
in open source, has released some projects under an OSI compatible license, e.g. in 2004,
WiX (Windows installer XML) was donated by Microsoft and is hosted by Sourceforge (Foley,
2004).
A definition of the term will be given and its main differences to free software will be explained next.
1.3.2 Open Source Definition
In June 1997, Bruce Perens, then the project leader of the Debian GNU/Linux distribution,
drafted The Debian Free Software Guidelines; a document which should enable the categorization of software into free and non-free software by comparing the software license to the
guidelines (Perens, 1999). After some discussion with Debian developers it was made official in July. In February 1998, all Debian specific references were removed and the guidelines
were published as the first version of the official Open Source Definition by the Open Source
Initiative (The Open Source Initiative, 2003a).
According to this definition (The Open Source Initiative, 2003b; full text is given in Appendix A.1 on page 155), users and programmers are granted several rights (see Table 1.1 for
an overview). The most important aspects of OS licenses are, according to its definition:
10
“RMS” commented the reception of the Linus Torvalds Award at the 1999 LinuxWorld with a sarcastical "Giving
the Linus Torvalds Award to the Free Software Foundation is a bit like giving the Han Solo Award to the Rebel
Alliance."
10
1.3 Introduction to Open Source
• The possibility of free distribution, i.e. the possibility to make any number of copies of
software that one possesses and to sell or give them away, without having to pay anyone
for that privilege.
• The availability of the source code together with the program (or means to download
it without charge), which should guarantee the easy modification and evolution of OS
projects. It should be noted that it is acceptable to require any amount of money when
selling the product. However, when selling a binary program, the source code must be
delivered with it or be accessible for no extra charge.
• Improved and modified code (derived works) should always be allowed to be distributed
under the same license terms as the original software. In order to protect the reputation of
the original’s author, he can demand that source code is only redistributed in its original
form as long as “patches”11 are allowed to be combined to the distribution.
• No discrimination against persons, groups, or fields of endeavor may occur, which means
that nobody may be excluded from the use of the software (including commercial organizations).
• The license must apply automatically to all to whom the program is distributed. This
clause is intended to forbid closing up software by indirect means such as requiring a
non-disclosure agreement.
The main goal of open source compliant licenses (such as the GNU GPL) is to ensure that
knowledge in the form of source code, once created, remains free and available. Users of that
software are granted the right to modify and improve the software as they see fit.
OS licenses take advantage of copyright law in order to achieve the opposite for what it has
11
A patch is a piece of text describing source code modifications in a specific format, which can be used by the
patch program to actually modify the original source code. This is then called patching.
11
1 Introduction
been created for; therefore, this scheme of protection has been termed copyleft12 13 .
1.3.3 Open Source vs. Free Software
There is often some confusion concerning free software in contrast to open source. Most research assumes that they are equivalent, which in most cases might be a fair assumption. However, there are some differences which should be clarified.
As already mentioned in Section 1.3.1 (History), the term free software was introduced by
Richard Stallman and refers to “free as in Freedom”, not as in (getting the software for) free14 .
Followers of the free software movement disdain from the use of any proprietary and closed
source software, which is in their eyes “evil” software. This includes any proprietary software that runs on top of other free programs (e.g. a commercial application making use of a
free library). Basically, the Free Software Movement is driven mostly by ethical and moral
reasoning.
However, from a technical point of view, the definition of free software is very similar to the
definition of open source software; it guarantees the user the following rights (Stallman, 1999,
p.56):
• You have the freedom to run the program, for any purpose.
• You have the freedom to modify the program to suit your needs. (To make this freedom
effective in practice, you must have access to the source code, since making changes in a
program without having the source code is exceedingly difficult.)
• You have the freedom to redistribute copies, either gratis or for a fee.
12
copyleft /kop’ee-left/ n. [play on ‘copyright’] 1. The copyright notice (‘General Public License’) carried by
GNU EMACS and other Free Software Foundation software, granting reuse and reproduction rights to all
comers (but see also General Public Virus). 2. By extension, any copyright notice intended to achieve similar
aims. (Jargon, 2004)
13
The term Copyleft is derived from the phrase “Copyleft – desrever sthgir lla”, which Don Hopkins wrote in
a message to Stallman in 1984 and which is intended as a double pun on the phrase “Copyright–all rights
reserved”. (Wikipedia.org)
14
“Free as in Freedom, not free as in beer.” (Stallman, 1993)
12
1.3 Introduction to Open Source
• You have the freedom to distribute modified versions of the program, so that the community can benefit from your improvements.
The ambiguity of the word free in the English language has often led to confusions about
the goals of the Free Software Movement in the public and the corporate world. Therefore,
free software is often called libre software, as libre refers more clearly to the intended meaning
“free as in Freedom”.
The GNU GPL (GNU General Public License) was created by Stallman as the standard
license for Free software projects. It uses copyright methods in order to ensure the above
mentioned rights to all people. In order to keep all modifications and derivatives free for all
people, it requires all derivatives of GPL’ed work to be published under a GPL compatible
license as well. The requirement to release the resulting derivative source code together with
the binary files, caused Microsoft to call the GPL licensing scheme virulent.
The open source initiative takes a more pragmatic stance than the free software movement.
It mainly considers the technical and organizational advantages of open source software over
commercial closed-source software. Not moral or ethics are their primary motivations, but issues like the ease of evolution and modification of software as well as the simple means of cooperation even across organizational boundaries. If combining and using proprietary software
with more restrictive licenses helps to spread open source software, they would propagate it
without hesitation (they actually do that by endorsing commercial software on the GNU/Linux
operating system).
Apart from the underlying motivations behind these two movements, their philosophies are
obviously very similar (see Table 1.1 for a summary of both main philosophies), so that free
software and open source projects are usually considered close enough to be regarded as equivalent for research.
Whether this assumption is generally true has yet to be proved. As the definitions are nearly
equivalent, and the development style seems to work identical, it might be fair to treat both
types as same.
But especially research on the motivation of participants should not neglect the potential
13
1 Introduction
Free Software philosophy:
⇒ You have the freedom to run the program, for any purpose.
⇒ You have the freedom to modify the program to suit your needs. (To make this freedom
effective in practice, you must have access to the source code, since making changes in
a program without having the source code is exceedingly difficult.)
⇒ You have the freedom to redistribute copies, either gratis or for a fee.
⇒ You have the freedom to distribute modified versions of the program, so that the community can benefit from your improvements.
Open source philosophy:
⇒ The right to make copies of the program, and distribute these copies.
⇒ The right to have access to the software’s source code, a necessary preliminary before
you can change it.
⇒ The right to make improvements to the program.
Table 1.1: Similarities of the FS/OS philosophies (source: Stallman, 1999; Perens, 1999)
differences between both types of projects. A recent survey of free software and open source
developers has identified significant differences in the way developers see themselves (Ghosh
et al., 2002, chapter 4). Figure 1.2 shows that especially those who assign themselves to the
Free Software community, strive for a sharp distinction between their community and the open
source software community. On the other hand, answers from members of the open source
software community correspond approximately with the average distribution of answers, and
indicate a indifference between the distinction of both types.
This research is mostly concerned with the processes of organization, specialization and
code-ownership. These aspects are assumed to not be related to the motivation of developers, or
their ethical convictions. For the purpose of this study, free software and open source software
projects are therefore regarded as equivalent. This work will therefore generally use the more
popular term “open source”.
The next sections presents the relevant theoretical framework from which this research draws
and is based upon.
14
1.3 Introduction to Open Source
Figure 1.2: Describing the differences between FS and OSS (source: Ghosh et al., 2002)
15
2 Theoretical framework
Section 2.1 gives an overview over the evolving research on open source software development
over time. It also categorizes the research into three areas according to the unit of analysis,
clarifying which area this dissertation belongs to.
A critique of frequent erroneous assumptions in the sampling and data gathering methods
follows in Section 2.1.1.
A brief introduction into Social Network Analysis (SNA) based on existing literature is given
and its usefulness in connection with open source research is presented in Section 2.2.
A section about previous research on the organization of OS projects is presented. A subsection focuses on literature applying SNA methodology to open source projects.
2.1 A history of research on open source
Most academics discovered their interest in free and open source software development between 1998 and 2000. Two major events raised the awareness of the general (i.e. non-geek)
public in the various FLOSS movements:
The first event was the publication of Eric Raymond’s “The Cathedral and the Bazaar” on
the web (it was presented to the public on May 21, 1997 during Linux Kongress and was later
published as a book (Raymond, 1999a)). It is one of the first attempts to describe and analyze
the workings of a free software community. In this work, Raymond described what he termed
Linus Law: “Given enough eyeballs, all bugs are shallow.”
The second major event consisted in the announcement of Netscape Communications to
16
2.1 A history of research on open source
release the source code of their web browser suite under a free license to the public (Netscape,
1998) as already described in Section 1.3.1 on page 7, and related to this event, the coining of
the term open source.
While free software had always existed, it had so far occupied a niche which did not seem interesting for researchers and commercial participants. With the release of a commercial closed
source software product (albeit it had been distributed without charge earlier) as an open source
project, the first significant commercial player demonstrated its interest and indeed bet its existence on the power of free and open source code. Moreover, the term open source was coined
in order to be able to promote the various aspects of free software, without being hindered by a)
the ambiguous meaning of free in the English language and b) having the negative connotation
of managers to associate their ’opened’ product with a “free as in beer” (i.e. gratis) product.
Some of the first researchers who recognized OS as interesting and relevant to study, came
from the field of innovation research: e.g. von Hippel and Lakhani (2000) looked into innovation through lead users1 .
Tuomi (2000) recognized that organizational, institutional, economic, cultural, and cognitive
aspects of the open source development model would be interesting to study. Also, in that
early stage, Lerner and Tirole (2000) were among the first researchers trying to explain the
motivation of developers to participate without direct monetary benefits.
Since then, research into free and open source software development has been conducted
by academics from many disciplines, ranging from anthropologists (Zeitlyn, 2003) to software
engineers (Scacchi, 2002), and was often performed in an interdisciplinary manner. Many of
the first studies were explorative single case studies, while later studies also performed analysis
on multiple projects. By 2004, so many studies had been conducted (as of Jan 13, 2005,
http://opensource.mit.edu lists 198 published articles and working papers alone)
that a review article was published by Rossi (2004).
1
The concept of ’Lead Users’ was introduced by Eric von Hippel in the mid 1980s. He defined the lead user as
those users who display the following two characteristics: 1) They face the needs that will be general in the
market place, but face them months or years before the bulk of that marketplace encounters them. 2) They are
positioned to benefit significantly by obtaining a solution to those needs
17
2 Theoretical framework
Research
category
Motivations
Processes
Competitive Dynamics
Unit of
Analysis
Individual developer
Single OS project
All OS projects /
’Commercial’ projects
Why do programmers
dedicate their free-time
to these projects without
being paid?
Why do
organizations contribute
resources to advance
projects without direct
monetary rewards?
How do open source
projects work?
What
processes go on? How are
decisions made? Is there a
’life cycle’ of open source
projects? How is work
organized and are tasks
distributed?
What makes some projects
more successful than others? What are the competitive advantages of open
source projects over commercial projects?
Research
questions
Table 2.1: Categorization of FLOSS research (adapted from von Krogh and von Hippel, 2003)
Table 2.1 presents an attempt to categorize FLOSS research into three broad categories: motivations, processes, and competitive dynamics. Those can be distinguished by their unit of
analysis. The first category, “motivations” investigates the motivation of single developers.
The second category, “processes”. examines processes of a specific single project. The last category, “competitive dynamics”, looks across the boundary of a single project at the ‘universe’
of all closed and open source projects, identifying success factors.
Motivations Most early contributions to open source research were dedicated to explorative
and descriptive case studies; describing more or less prominent open source projects and examining the motivation of those volunteering, unpaid hobbyists2 (Harhoff, Henkel, and von
Hippel, 2003; Hertel, Niedner, and Herrmann, 2003; Hars and Ou, 2000; Osterloh and Rota,
2003). Much speculation about the true motivations has been done. Mainly two schools of
2
Back then, research mostly (and correctly) assumed that all participants were volunteering, unpaid hobbyists.
By now, the share of paid developers has risen significantly (see e.g. Ghosh et al., 2002), yet much research
on motivation has failed to take this into account. A first attempt to distinguish these types has been made by
West and O’Mahony (2005).
18
2.1 A history of research on open source
thought have emerged from this discourse:
The first, mostly propagated by economists, argues that developers are rational actors, and
gives one main reason for their participation: The true motivation of developers is their increased labor market value due to their participation in prominent projects (Lerner and Tirole,
2002b; Prufer, 2004). Others have argued to the same effect through the use of signaling theory (Lee, Moisa, and Weiss, 2003; Stenborg, 2004). According to this theory, developers signal
competency through a high reputation among their peers to potential employers. It is argued
along the same line that reputation within the developer community serves as an important factor contributing to the motivation (explaining why being the founder of a project is attractive).
These arguments are supported by the fact that indeed many prominent developers have been
subsequently hired by companies such as Novell, RedHat, or IBM. The latter even patented a
method of paying open source developers (The Enquirer, 2004).
The second school of thought focuses on the existence of intrinsic motivation (Bitzer, Schrettl,
and Schroder, 2004). Follower of this school argue that OSS development takes place in a gift
giving culture (Zeitlyn, 2003; Bergquist and Ljungberg, 2001; Hemetsberger, 2004), and that
“motives of hobbyist evolve over time; most join the community because they have a need for
the software and stay because they enjoy programming in the context of a particular community” (Shah, 2003).
Others deem trust as an important factor, enabling contributions, as it fosters intrinsic motivation (Osterloh and Rota, 2004a,b). Ferraro and O’Mahony (2003) have looked into the “web
of trust” between developers, by participating at key signing parties3 from developers of the
Debian project. The importance of trust however, is not universally recognized. While Jarvenpaa and Leidner (1999) had shown that swift trust, a quickly established, yet temporal and
fragile form of trust in its nature, emerges in virtual teams, some have argued that trust in open
source projects is not necessary at all (Gallivan, 2001).
Some large scale surveys have been conducted, emphasizing the existence of intrinsic mo3
Key signing parties are events where developers verify the authenticity of their encryption keys through faceto-face contact.
19
2 Theoretical framework
tivation, but also acknowledging the existence of extrinsic motivation (Lakhani, Wolf, Bates,
and DiBona, 2002; Ghosh et al., 2002), thus indicating that not a single school of thought is
able to explain the full motivation of developers; the truth might rather lie in a combination of
both intrinsic and intrinsic motivation, as e.g. Osterloh and Rota (2003) acknowledge.
While the area of motivation is interesting and fascinating, interest has also been raised in
organizational issues and processes that govern OS projects.
Processes The second category, Processes, uses a specific project as unit of analysis. It
is concerned with the inner workings of a project and its organization. Such processes might
for instance be how newcomers join the project (von Krogh et al., 2003) or how programming
resources are allocated within a project (Dalle and David, 2003; Dalle, David, Ghosh, and
Steinmueller, 2004). Many have seen OS development as a new mode of innovation (von
Hippel and von Krogh, 2003; Osterloh and Rota, 2004b), production, software development,
or even as a generic model for organizing (Ljungberg, 2000).
Some have looked into the specific characteristics of OS development communities which
make it so different from other settings. E.g. Lanzara and Morner (2003) found that “technology, rather than formal or informal organization, embodies most of the conditions for governance in open-source software projects, hence becoming a critical pathway to the understanding
of collective task accomplishment, coordination and knowledge making processes.”
Research on project governance, coordination, and social structure of the communities are
part of this category. A review of literature concerned with the organization and social structure
of projects can be found on page 26.
This work is deeply rooted in this second category, examining the areas of code that developers elect to work on. It will not question the motivation of these developers (see e.g. von Krogh,
Haefliger, and Spaeth, 2003 for a potential explanation). It will also not examine project success (the third category) of the sample projects. This would be a different research question in
itself.
20
2.1 A history of research on open source
Competitive dynamics The third category ’competitive dynamics’ looks beyond the scope
of a specific OS project. It is concerned with the success factors of open source projects compared to a) other open source projects competing for the same developer resources and b)
commercial closed source projects.
As the success of open source projects is very difficult to measure (Crowston and Scozzi,
2002), as projects have very different purposes and various potentially useful measures of
project success are difficult or even impossible to gather. Some researchers focus therefore
on very specific, and restricted, measures of success, such as counting the added number of
lines of code (LOC), while others try to come up with composite variables consisting of various success factors. Also concerned with the success of projects in the form of development
activity, is a working paper by Healy and Schussman (2003). They observe a snapshot of about
46, 000 projects hosted by Sourceforge. They look at the skewness of six activity measures
across all projects.4
MacCormack, Rusnak, and Baldwin (2004) compare the software architecture of the Linux
kernel with that of the Mozilla web browser, originally developed as closed source, when it
was released to the public. They find that software developed in an open source mode is more
modular than software developed as closed source. They also find that the Mozilla code base
was restructured and became more and more modular over time after it was released to the
public as open source.
Lerner and Tirole (2002a) approach the success factors of projects from a legal perspective:
They perform an empirical analysis of the determinants of license choice using nearly 40,000
projects. They find that projects geared toward end-users, such as games tend to have restrictive
licenses, while those oriented toward developers, those designed to run on commercial operating systems, and those which are geared towards the Internet are less likely to have restrictive
licenses.
Another approach to examine the success of the open source software development model is
4
Although the validity of their interpretation needs to be considered with caution. E.g. they neglect that about a
third of all projects hosted on Sourceforge are dormant.
21
2 Theoretical framework
the reuse of knowledge in the form of software components from external projects (von Krogh,
Spaeth, and Haefliger, 2005). It is found that projects frequently reuse components in order to
get their code working fast and to prevent the “re-inventing of the wheel”.
As lies in the nature of research from this third category, it is based on multiple cases or
empirical comparisons of large samples. In my opinion, results from multiple project studies,
have often become skewed due to common mistakes in the sample selection and data gathering
methods. As this work relies on quantitative data from multiple cases as well, I will list a
summary of the most common pitfalls, to validate my own research design.
2.1.1 Common pitfalls - A critique of empiric OS research
So far, most case studies have been only been concerned with a selected few prominent projects,
such as the Linux kernel, the Apache web server, or the Mozilla web browser suite. Although
it is very insightful and interesting to study these cases, it should be noted that they are far
from the ’average’ open source project and might as such, although popular and successful, not
exhibit patterns of ’common’ open source projects.
They are extremely large (compared to other open source projects), successful, and are those
projects which are the closest to a formal organization that one can have. Most of the core
developers in these projects have been exposed to so many OS studies, been scrutinized and
interviewed so many times, and have participated in so many conferences on the topic, that
they might suffer from the panel effect, i.e. exhibit a different behavior than a non-observed
developer would.5
A high share of the contributions to these projects come from paid developers of companies, and their organization is mostly managed by organizational entities rather than individuals. Therefore, drawing conclusions from these non-representative cases to explain ’how open
source works’, or examining the motivation of contributors needs to be conducted very carefully. Little thought seems to have been given to these issues in case studies involving those
5
This is similar to Heisenberg’s uncertainty principle in quantum mechanics stating that “the uncertainty principle involved the perturbation to a particle’s state by a measurement of one variable, which affects one’s ability
to predict the outcome of a subsequent measurement of the conjugate variable” (Raymer, 1994).
22
2.1 A history of research on open source
projects.
Much empirical research on open source projects looks exclusively at projects which are
hosted on Sourceforge. This fact alone skews the sample of projects, as these projects for
instance tend to be centered around the English speaking community, neglecting e.g. Asian
projects (Bates, 2003). Sourceforge.net itself was originally founded as an incubator for small
projects which could not afford an infrastructure of their own. Projects hosted by Sourceforge
tended to be smaller projects without a supporting organization to provide the technical infrastructure (at least this used to be the case until being represented, and thus being search-able,
on Sourceforge became a major promotional advantage). A typical research which needs to be
interpreted very carefully was described by Healy and Schussman (2003) in a working paper.
They analyze a complete snapshot of all Sourceforge hosted projects, detecting a high skewness in various activity measures. Although their findings are in line with others who found
power-law distribution across OS projects (e.g. Madey, Freeh, and Tynan, 2002) , it has been
neglected that, according to Jeff Bates (employee of the company providing Sourceforge), at
least a third of the projects hosted there are dormant, and many were simply used as dumping
ground for some code, without the purpose to get a “real” project started.
Projects on alternative hosting sites for projects are mostly ignored by researchers, for instance Savannah (http://savannah.gnu.org) which is dedicated to hosting projects
which are part of the GNU project, or Berlios (http://berlios.de). However, this limits
the validity of a representative sample, as projects hosted there (or are hosted independently)
might have different characteristics then those hosted by Sourceforge.
Large project (often older than Sourceforge itself), frequently have their own infrastructure
(e.g. the Linux kernel, Mozilla, and Apache are not hosted on Sourceforge). If these large
and/or old projects are hosted on Sourceforge, it happens (as reported in an interview with a
developer of that project) for promotional reasons only. These projects, while listed, either
use only a fraction of the facilities Sourceforge offers, such as bug databases, mailing lists,
source code patch trackers, etc. or do not use them at all. For instance, many developers ignore
the bug database offered by Sourceforge and rely on mailing lists only to report bugs (or the
23
2 Theoretical framework
other way round). Yet, users might still file bugs into the bug database, because the project
administrator has not bothered to explicitly turn off this feature. Similar concerns can also be
raised for other aspects, such as the number of official developers and the categorization of the
project status (alpha, beta,...,mature). A study done by Krishnamurthy (2002), examining 100
mature projects on Sourceforge, has been heavily criticized for the applied methodology and
conclusions in a subsequent letter to the editor6 (DeMaggio, 2002).
There are some researchers, who have recognized the danger of blindly relying on data gathered from Sourceforge, e.g. Howison and Crowston (2004) describe the “perils and pitfalls of
mining Sourceforge”, yet it remains an often neglected issue.
The section on methodology will explain in more detail, how these issues have been taken
into account in this study.
The next section contains a brief introduction into social network analysis, before social
structure research and social network analysis in OS projects is presented in more detail.
2.2 Social Network Analysis
Various interdisciplinary strands of development have contributed to the creation of Social
Network Analysis (SNA). This section will not present an in-depth introduction to SNA, it
gives merely a brief introduction to those unfamiliar with the topic. Interested readers are
recommended to read e.g. Scott (1991) for a good introduction and overview.
Following, a brief presentation of the three main research streams on which Social Network
Analysis has been founded is given (Freeman, White, and Kimball, 1992; Scott, 1991, p.7):
SNA descended from Gestalt theory, which was looking into group structure and the flow
of information and ideas through groups, using laboratory methods. Building upon this work,
Sociometric analysts were developing the methods of graph theory. Deriving from this stream
of research stems e.g. the sociogram (which is generally attributed to Moreno (1934)).
6
“But your conslusions are suspect because of the completely unscientific nature of your analysis.” (DeMaggio,
2002)
24
2.2 Social Network Analysis
A second stream of researchers at Harvard in the 1930s and 40s, explored patterns of interpersonal relations and the formations of ’cliques’, building e.g. upon the work of the social anthropologist Radcliffe-Brown, and through him, Durkheim. A seminal work by these researchers
was the study of the Hawthorne electrical factory (Roethlisberger and Dickson, 1939), which
also in parallel created sociograms (without references to the first stream of research).
Finally, the third pillar on which SNA was build is by Social anthropologists at Manchester
University. They examined the structure of ’community relations in tribal and village societies’,
looking into the issues of conflict and power. This group extended former studies by systematizing the use of SNA. They introduced the notion of ’total’ vs. ’partial’ networks (Barnes,
1954, p. 43), and measures such as the ’density’ of a network (Mitchell, 1969).
Density is one of the important measures to characterize a network. It measures the (relative)
number of linkages between its nodes and gives an overall idea of how densely connected a
network is.
The centrality of a specific node (here: developer) characterizes how central or peripheral a
specific node is, giving indications of its importance (Freeman, 1979). Different operationalizations of how to measure node centrality exist. Frequently, these are based on either the degree
(the number of connections of a node) or on its betweenness (Freeman, 1977) (betweenness
indicates if a node lies ’between’ many other nodes or is peripheral).
Based on the differences of the centrality of all nodes, the centralization gives an indication
if the network as a whole is very centralized (star shaped) or rather decentralized.
The importance of dense linkages had always been understood. However, in 1974, Granovetter, a prominent figure in the area of SNA, published his classic contribution “Getting a job”
(Granovetter, 1974). In this, he elaborates on the ’strength of weak ties’, which truly provide
new information from unknown sources, while too densely linked network connections provide
mostly redundant, already known information (see also Granovetter, 1973).
One of the problems in a network analysis is usually the identification and specification of
the network boundaries (Laumann, Marsden, and Prensky, 1992), as most of the time only
’partial networks’ can be studied. Fortunately, this is no issue when applying the analysis to
25
2 Theoretical framework
Figure 2.1: Publications indexed by “Sociological Abstracts” containing social network in the
abstract or title (source: Borgatti and Foster, 2003)
open source CVS data, as its boundaries are clearly defined and the analysis is able to cover the
’total network’.
Social Network Analysis has become an important method in sociology and innovation research over the last years. Borgatti and Foster (2003) show the exponential growth of research
on social networks (see Fig. 2.1). Coulon (2005) provides a comprehensive literature review
on the use of SNA in previous and current innovation research.
2.3 Research on organization in OS projects
This section reviews literature concerned with a part of the second category of research, processes; namely the social structure and organization of open source projects. Although this
study proceeds in an inductive manner, a brief overview of current research in this area is
needed in order to relate this research with the ongoing efforts in this area.
The study of social structure in OS software projects has been approached by researches
from various disciplines with a plethora of methods. One of the early attempts was performed
by Koch and Schneider (2002) who examined the CVS code modifications from the GNOME
26
2.3 Research on organization in OS projects
desktop project in a case study. They examined the number of lines of code added per developer
(as a measure of project success). They found that in average each file was modified by only
1.8 distinct developers.
Reputation plays an important role concerning the social structure of communities. Von Krogh
et al. (2003) find that a reward in the form of peer reputation plays an important role to overcome the costs of contributing to a project. Also, Stewart (2004) is concerned with reputation
in a developer’s community: He examines the role of tenure in establishing social status and
finds that the social status is “frozen” after a while, forcing members to act quickly if they want
to achieve a high social status.
Ferraro and O’Mahony (2003) examine the role of face-to-face communication of developers
(namely from the GNU/Linux Debian project) and the resulting social network. They find that
the developers a participant had met in ‘the real world’, the more likely he would vote in a
leadership election. Also, the more a developer participates in on-line discussions, the more
likely he would be elected in such a leadership election.
Interestingly enough, also economists are looking into the area of social structure of open
source projects. One would expect that economists are most puzzled about the contribution
of developers without direct monetary benefits. However, e.g. Dalle et al. (2004) state that as
interesting as research on the motivation of developers might be, “the modes of organization,
governance and performance of F/LOSS development – viewed as a collective distributed mode
of production” should also be looked into. Open source seems to be puzzling for them beyond
the motivation of developers, or so it seems.
Sagers, McLure Wasko, and Dickey (2004) propose to build upon a theory of network governance, which was introduced by Jones, Hesterly, and Borgatti (1997). Their model uses project
success as a dependent variable, using a composite measure from various success dimensions.
Coordination is but one independent variable in their model, their survey contains questions
pertaining to two aspects of coordination: expertise location, and administrative coordination,
i.e. managing of tangible and economic resources. However, no empirical results are presented
in their working paper yet. Given that many of the constructs they use have not been defined
27
2 Theoretical framework
in the area of OS projects, it will probably take some more effort to verify their model. Their
notion of “expertise location” builds upon a study by Faraj and Sproull (2000), who performed
a cross-sectional examination of 69 (closed-source) software development teams. They found
that the expertise coordination processes (described by them as “recognizing where expertise
is needed”, “knowing where expertise is located”, and “bringing expertise to bear”) led to an
increased team performance.
Some have been looking into the apparently self-organizing nature of open source projects.
Van Wendel de Joode, De Bruijn, and Van Eten (2002) examined how self-organizing developer communities deal with various property regimes. Benkler (2002) creates a theory of
“Commons-based peer production”, using transaction cost economics, where people elect to
work on the tasks they are best at. However, as already mentioned in the introduction, the examination of self-organization has been traditionally approached mostly by Natural Scientists,
namely from Biology (Haken, 1977; Kauffman, 1995; Coffey, 1998). It was also biologists,
Maturana and Varela, who created autopoiesis7 as a theory in order to explain social systems
(Whitaker, 1995). The theory has since been used, e.g. to explain the creation of organizational knowledge (von Krogh and Roos, 1995). Although self-coordination of social systems
has been studied in the context of autopoetic theory (see Whitaker, 1995), apparently no researcher has so far tackled open source communities as an example of a social system from
this point of view. As this research neither focuses on a system dynamics point of view, nor on
self-maintaining aspects of the community, autopoiesis will not be discussed further. Although
many publications mention that OS projects are self-organized, there is still a lack of research
focusing on this area. This research looks into the areas of code that developers elect to work
on, thus explaining a part of self-coordination.
7
Autopoiesis (literally self-production) is the process whereby an organization produces itself. An autopoietic
organization is an autonomous and self-maintaining unity which contains component-producing processes.
The components, through their interaction, generate recursively the same network of processes which produced
them. An autopoietic system is operationally closed and structurally state determined with no apparent inputs
and outputs. (Heylighen, 2005)
28
2.3 Research on organization in OS projects
2.3.1 SNA in open source
Some of the literature is explicitly concerned with the use of Social Network Analysis in open
source projects. One of the first to propose the use of SNA in this area was the sociologist Wellman who gave “a sociologists’ perspective on the use of social network analysis of computer
networks” (Wellman, 1996).
In her Master’s Thesis, te Meerman (2003) applies SNA methodology to open source projects
using interview data in order to examine the “ITIL” concept8 . Unfortunately, her results help
little to observe the community structure, or a project’s organization.
So far, there are only few studies I know of which apply Social Network Analysis to open
sources projects, using empirical evidence besides interviews:
Madey et al. (2002) perform an SNA analysis to the whole population of projects hosted
on Sourceforge. Generally, they observe that both the project size (in terms of developers) as
well as the number of developers per project obeys power-laws (confirmed by the findings of
Healy and Schussman, 2003). Performing a very high-level analysis looking at inter-project
participation, they construct a network of developers who are connected if they work on the
same project. In the resulting network, they observe one cluster covering 25% of all developers,
and some much smaller clusters. The paper does not examine the organization of single projects
and does not perform a deeper analysis besides the construction of these networks.
One of the relevant studies in this area is performed by Crowston and Howison (2004).
They examine 120 projects hosted on Sourceforge. Applying Social Network Analysis to the
bug database activities, looking at communication centralization, they find that some projects
are very centralized while others are very decentralized. They conclude that projects do not
automatically gain all advantages that are usually accredited to OS projects by “going open”.
Two working papers, both published in June 2004, are highly relevant to this research as they
apply SNA methods to CVS data.
The first paper (González-Barahona, López, and Robles, 2004) presents a representation of
8
ITIL stands for IT information library and is described by te Meerman as a collection of best practice approaches
to organize IT projects.
29
2 Theoretical framework
the Apache project (http://apache.org) based on CVS data. The goal of this paper is to
present the connections between CVS modules9 for very large ’libre’ projects. Unfortunately,
the version of the working paper which is available seems to be at a rather early stage. It gives
visualizations of the resulting network, but does not present any further results or data analysis.
This type of research examines a very high level of coordination: relationship between CVS
modules might be an interesting topic in itself, but as few projects use more than one module it
will mostly be interesting for very large, already modular projects. It is, however, a first attempt
to present a dynamic view of a software architecture.
The second working paper by López, González-Barahona, and Robles (2004) is the first
publication to propose applying SNA research methods to CVS source code10 and make a first
attempt at the operationalization of some SNA measures. They propose creating a “committer
network” and a “module network” (this research will indeed create and focus on a committer
network of all projects; see methodology section). They provide the graphs for the degree
and the clustering coefficient for three sample cases in their working paper. Unfortunately, no
more information on their data or a deeper analysis is performed. This research leans on and
draws from their operationalization where appropriate (see the methodology section for more
information).
Summarizing, it can be said that researchers have recognized the usefulness of SNA to analyze open source projects on many levels. So far, many publications are rather short working
papers or conference presentations, which (although representing a lot of tedious work) do not
go very deep in their analysis. Beside Crowston and Howison (2004) and López et al. (2004),
which proved to be most relevant publications for this study, there seem to be few attempts to
use SNA to analyze the inner workings of a project community.
9
A CVS code repository for large project can be divided into separately maintained modules. Note that this is
not typically the case for all projects, e.g. all projects in this research’s sample maintain their source code only
in one module. In order to create separate CVS modules, the software architecture, needs to be designed in a
very modular fashion.
10
Although the decision to use this approach in this dissertation was made independently from their paper and
precedes its publication date.
30
3 Empirical Section
3.1 Methodology
3.1.1 Sample Selection
The number of projects in the sample should be large enough to allow the comparison of statistics (i.e. the number of projects should be around 301 ). On the other hand, it should be small
enough to allow an understanding of each of the selected projects. As many projects are special
in that tools are inconsistently used, old data is dumped into the source code repositories, etc.
(Howison and Crowston, 2004), it was deemed necessary to get a general understanding of how
a project was founded, what it accomplishes, and to get a basic insight in how the project is
organized.
In addition to the specific inclusion criteria of projects as listed further down, some general
consideration were made in order to create a sample as representative as possible, falling into
two categories: 1) maximizing the variety of projects and 2) avoiding the common pitfalls as
described in the above critique (Section 2.1.1 on page 22).
1. Maximizing the variety of projects with respect to their:
• Type of application: Instead of focusing on a specific type of application, such a
server, desktop application, or functionality providing libraries, it was attempted to
1
A sample of 30 cases is usually deemed sufficient to be able to apply the central limit theorem if necessary,
thus deriving a single normal distribution (or under some circumstances a Lévy stable distribution) from the
summed (independent) variables. (Feller, 1966; Hall, 1982)
31
3 Empirical Section
include as many types of applications as possible. The final sample includes text
processors, games, server, libraries,. . .
2. The following pitfalls should be avoided:
• As described, many cases have focused on the most prominent projects (especially
Apache, Mozilla, and the Linux kernel). I have already elaborated why the choice
of these projects, which have been put extensively in the spotlight of media and
research and are heavily sponsored by companies, is not appropriate in my point
of view. Although coordination of prominent company-guided or initiated projects
could be interesting to examine as a special case, no such project was included in
this project (although some projects of the final sample are well-known projects for
open source connaisseurs).
• The second danger is the sole use of projects which are hosted by only one organization. In order to achieve a good and valid sample, this research draws its sample
from projects hosted on Sourceforge, Savannah, Berlios, and from those providing
their own infrastructure.
• The third main criticism is the danger of inconsistently used tools across projects.
This research does not rely on consistency of mailing lists, bug databases, etc. By
using only projects which provided a full CVS log history and by getting familiar
with each of the projects in advance (e.g. to ensure that no gatekeepers were being
used, and that the CVS code repository was used as primary means of code administration), it was assured that the retrieved CVS log data was indeed comparable
across projects.
The specific necessary inclusion criteria for a project were:
Code availability: The source code of the project needed to be available on-line. Moreover,
all source code modifications needed to be archived together with the date, file names,
and author. As this information was to be extracted in an automated manner only those
32
3.1 Methodology
projects with a CVS source code repository were considered for inclusion, to ensure a
consistent treatment of all sample projects..
Age: This research does not perform a longitudinal analysis. However, coordination does not
show in a snapshot of the data, as obviously not all developers modify files at the same
time. It was therefore deemed necessary to use projects which were already active over
some time, in order to show coordination, and mutual modifications of other developers’
files. An arbitrary cut-off point had to be made, for there is no experience yet, how long
it takes to evolve “typical” patterns. It was decided that only projects which had at least
been active over the course of one year, should be taken into consideration.
Size: In order to examine coordination, a project obviously needed to have more than one developer working on it. Other considerations regarding suitable project sizes were already
given above.
Gatekeepers: Some large projects have divided up their software architecture in certain areas. Those areas are ’guarded’ by formally announced (e.g. through an entry in a MAINTAINER list) gatekeepers or maintainer. It is their task to collect patches, evaluate them,
and to commit those who are deemed worthwhile. While this helps to coordinate the
work in very large projects, it falsifies the CVS logs. Those marked as authors of a CVS
modification are not necessarily those who committed the patches.2 Care was therefore
taken not to chose projects with formal maintainers of areas of code.
The final outcome resulted in a sample of 29 open source projects. The accumulated number
of developers is 1, 150 with a project average of 39.7 per project (median of 19). The total
sum of observed file modifications is 740, 677. An overview of the selected sample is given in
Table 3.1.
2
It happens in every project that some patches are contributed by non-developers e.g. through the developer’s
mailing list and finally committed by a developer with the permission to modify the source code repository.
However, this cannot be detected without an in-depth manual scrutinizing of the mailing list and code modification comments.
33
3 Empirical Section
No conscious effort was made in advance of the sample selection to include only projects
of a certain size (apart from the inclusion criteria of >1 developer). The final sample selection
contains mostly projects in the range of approximately 5 to 40 developers. It was seen beneficial
to have projects of different size in the final sample: It would help to show differences for small
and large projects. On the other hand their size should not be too different: Social Network
measures can be difficult to compare if the networks are of very different sizes (Scott, 1991).
Secondly, only very few projects are larger than this (Madey et al., 2002; Healy and Schussman,
2003); these belong mostly to the very prominent cases, which were avoided (as explained
above).
3.1.2 Analysis
After the sample of projects had been identified, the source code modification logs were retrieved from the corresponding code repositories. This includes all modifications from the instantiation of the code repository until the end of the recorded time frame (which is denoted in
the respective project presentations). Using scripts written in the perl language, the log entries
were stored in a local database for further analysis.
This database served as the source for further analysis: A number of tools written in the
PHP language, were used to e.g. extract descriptive data about the projects, such as the number
of committing developers. Additional information, such as the programming language and
project description were taken either from the public project websites or extracted from the
source code.
The CVS log data also served as data source, in order to construct the incident matrix to be
used in the SNA analysis 3 and subsequently the adjacency matrix, representing an undirected
and unweighted network of developers.
López et al. (2004) propose to use all modifications a developer conducts in a certain directory (representing a module in the software architecture in their point of view). Although
the equivalence of a directory to a software module might be valid, it is an assumption about
3
The incident matrix lists all authors/files and the number of modifications
34
3.1 Methodology
Project
Abiword
Adonthell
AWstats
Developers
63
13
3
bison
18
BZflag
32
CDex
emacs
eMule
177
Flightgear
Freenet
Gnomemeeting
Gnunet
8
32
98
15
GTK+
Irate
LAME
mailman
mnet
nano
Ogle
OpenSSL
pango
271
18
26
24
14
11
8
phpMyAdmin
PostgreSQL
19
28
Smarty
11
Stepmania
tdb
TikiWiki
wget
xerces
44
7
89
6
27
XFCE4
22
46
Description
WinWord-like word processor
Role Playing Game
Powerful and featureful server logfile analyzer that shows you all your
Web/Mail/FTP statistics
General-purpose parser generator that converts a grammar description to a C
program to parse that grammar.
OpenSource OpenGL Multiplayer Multiplatform Battle Zone capture the
Flag. 3D first person Tank Simulation.
CD-Ripper, thus extracting digital audio data from an Audio CD.
Extensible, customizable, self-documenting real-time display editor.
Filesharing client which is based on the eDonkey2000 network but offers
more features.
A Flight simulator
Platform for anonymous file distribution and retrieval
Voice/Video conferencing application
A framework for secure peer-to-peer networking that does not use any
centralized or otherwise trusted services. Allows anonymous censorshipresistant file-sharing.
Graphical ToolKit
Internet radio with user taste correlation
mp3 encoder.
Mailing list administration
Kind of Freenet written in Python, heritage from the MojoNation project
GNU nano is a clone of the Pico text editor.
DVD player capable of menus and navigation.
provides encryption.
The goal of the Pango project is to provide an open-source framework for
the layout and rendering of internationalized text. Pango is an offshoot of
the GTK+ and GNOME projects, and the initial focus is operation in those
environments, however there is nothing fundamentally GTK+ or GNOME
specific about Pango.
handles the administration of MySQL data bases over the Web.
Object-relational database management system (ORDBMS) based on POSTGRES, Version 4.2, developed at the University of California (Berkely).
Php template engine. Smarty cleanly separates your presentation elements
(HTML, CSS, etc.) from your application code.
StepMania is a music/rhythm game.
A “Trivial Database”
A Wiki and Content management system
Retrieves files using HTTP, HTTPS and FTP
Xerces (named after the Xerces Blue butterfly) provides XML parsing and
generation.
A Window Manager (desktop) for Linux
Table 3.1: Sample overview
35
3 Empirical Section
software architecture in general, which is not being made in this study. Using directories would
also lead to a loss of information, as the data provides modification information about each file.
It has therefore been decided to use a modification of a specific file as incident. This results
in a finer grained analysis at the disadvantage of not being able to identify the relationship
between software modules (i.e. directories). As this research is not concerned about module
connections, this was deemed the better choice.
A preliminary examination of the data showed that most projects feature a few common files
which are modified by nearly all developers, such as e.g. a change log, building scripts, or help
files, a cut-off point of ten common file was chosen to accommodate for these common files.
This means that ties between two nodes (developers) in this network signify the modification
of at least the same ten files. Using cut-off points is a frequent and accepted procedure in SNA
(Scott, 1991).
The “number of modifications” of a developer as used throughout this work is calculated as
the sum of all modifications on all files. That means, that it can differ from the number of CVS
commits a developer performs (one CVS commit can contain modifications to multiple files).
An incident matrix can be used to create two different adjacency matrices (Scott, 1991;
López et al., 2004), however for the purpose of this research only one of them makes sense:
To find out which authors modified the same files. The resulting adjacency matrix contains all
developers (on both dimensions of the matrix), with a 1 indicating a connection between any
two developers and a 0 indication no connection. This adjacency matrix is what López et al.
(2004) call a “committer network” and helps to identify connections between developers.
Constructing the second adjacency matrix, would lead to a network where nodes represent
files which are more or less connected. This matrix could e.g. be used to create a functional
architecture of a software project. This analysis was not pursued further to this end, as the goal
of this work is to study the relationship between developers and not files.
Once the adjacency matrices for all projects were created, the statistical software R, using an
extension library for performing Social Network Analysis was used to assist the data analysis.4
4
R is the open source equivalent of the commercial package S and can be obtained from http://r-project.
36
3.2 Sample
Besides the number of modifications for each developer, measures from SNA that characterize a network such as its density, inclusiveness, and the degree of its nodes (i.e. developers)
were derived.
The next section presents the sample projects, and gives a brief description of each.
3.2 Sample
Abiword
Abiword is part of a larger project known as AbiSource, which was originally started by the
SourceGear Corporation. The goal of the project was the development of a cross-platform,
open source office suite, beginning with AbiWord, the project’s word processor. SourceGear
released the source code to AbiWord and a developer community quickly formed around the
project. SourceGear has since then stopped working on the project, but the Abiword development was continued independently. AbiWord runs on multiple platforms: Windows, Linux,
QNX, FreeBSD and Solaris. It is able to read and write industry standard document types, such
as OpenOffice.org documents, Microsoft Word documents, WordPerfect documents, Rich Text
Format documents, or HTML web pages.
While the graphical interface leans toward the WinWord text processing application, the
main Abiword program is very small and requires comparably little resources to run, allowing
Abiword to be used on systems that are not considered "State of the Art" anymore. A variety of
plugins can be used to extend AbiWord’s functionality and ranges from Document Importers
to a Thesaurus, Image Importers, and a Text Summarizer (Abisource.com, 2004).
Although the Abiword project can also be found and its source code downloaded from
Sourceforge.net, the project provides its own infrastructure for web sites and source code repository (the Internet bandwidth is currently provided through the University of Twente). Abiword
was released under the GNU GPL license. The Abiword source code repository was recorded
org
37
3 Empirical Section
from July 16, 1998 until April 23, 2004 and contains 52060 modifications in 5265 files conducted by 63 authors.
Adonthell
The Adonthell project is dedicated to the development and implementation of a role playing
game and was started in summer 1999 by Alexandre Courbot, James Nash, and Kai Sterker.
More specifically it creates “a graphics engine [. . . ], a set of tools and an actual, playable game
driven by that engine and built with those tools.” Its website emphasizes the fact that it is free
(→libre) software.
The developers claim that they “were united by the vague idea of creating a free role playing
game for Linux and without any experience in the field of software engineering when they
started the project.” As a governance principle it claims “autonomy in choice and execution of
the actual work, since there was no dedicated project manager” (Adonthell, 2004).
While the projects main homepage can be found under http://adonthell.linuxgames.
com, the source code repository is hosted on the Savannah platform (provided by the GNU
project). True to the spirit of free software, the project releases the software under the GNU
GPL license. The recorded CVS time frame ranges from January 5, 2000 until February 9,
2004. The Adonthell source code repository contains 9485 modifications in 1349 files conducted by 13 authors.
AWStats
AWStats is a feature rich tool that generates advanced web, ftp or mail server statistics graphically. It takes the log files of those servers (various file formats are supported), analyzes them
and prepares a reported that includes statistics and graphs. AWStats has been programmed
to work on very big files, it can also be called automatically to update statistics in a certain
interval. (AWStats, 2004)
The project was founded on October 29, 2000 on Sourceforge.net and uses its facilities for
38
3.2 Sample
website, mailing lists, and source code repository. AWStats is licensed under the GNU GPL.
The recorded source code modifications range from November 4, 2000 until January 17, 2005
and contains 7, 460 modifications in 1, 310 files conducted by only 3 authors (thus being the
smallest project of the sample in terms of developers).
bison
Bison is described on its web page as:
“a general-purpose parser generator that converts a grammar description for an
LALR context-free grammar into a C program to parse that grammar. Once you are
proficient with Bison, you can use it to develop a wide range of language parsers,
from those used in simple desk calculators to complex programming languages.”
(bison, 2004b)
This, for a software layman, rather cryptic statement makes it clear that the tool is intended
for software developers itself; it is not an application intended for ‘end users’. The program
basically translates logical statements written in a specific syntax into a language that computers
can understand and execute. The name bison is inspired by the application for which it is
intended as alternative: yacc (which stands for “Yet Another Compiler Compiler”).
The bison project uses savannah.gnu.org to administer its mailing lists and source code. As
it is part of the “GNU effort” it comes under the GNU GPL license. Being part of GNU, it is
also morally motivated: the bison manual states that “Software should be free” (bison, 2004a).
The CVS modifications range from the first commit in December 16, 1987 until January 17,
2005. 10, 607 modifications in 407 files conducted by 18 authors.
BZflag
The name BZFlag stands for Battle Zone capture Flag. It is a multi-player, multi-platform 3D
tank battle game, running on Irix, Linux, *BSD, Windows, Mac OS X and other platforms.
39
3 Empirical Section
The features complex and sophisticated graphics. Tanks drive in complex worlds which
follow the laws of physics, including e.g. a weather system (rain, snow, frogs).
The game can be played against through the Internet, it is based on a client/server architecture
(both client and server are included in the examined source code repository).
BZFlag is a comparatively popular project, according to its website it is the third game hosted
on Sourceforge to reach 1 million downloads.
BZFlag uses the infrastructure provided by Sourceforge.net, although it has a website of its
own (http://bzflag.org). It was registered on March 4, 2000 on Sourceforge, which
featured the project as “Sourceforge Project of the Month” in April 2004. The observed source
code modifications range from March 5, 2000 until January 20, 2005. The BZFlag source code
is published under the GNU LGPL license and its repository contains 32, 100 modifications in
2, 164 files conducted by 32 authors.
CDex
CDex is a CD-ripper for the Windows platform. That means, it extracts digital audio data from
an Audio CD to a wav file and can subsequently convert them to many audio formats, such as
mp2, mp3, vqf, aac, and ogg. (CDex, 2004)
The first code was written in 1998 (according to the copyright information in the source
code’s ‘README’ file) and made available through the website http://www.cdex.n3.
net. In December 1999, the project was transferred to Sourceforge, and takes advantage of
Sourceforge’s infrastructure since then.
The recorded time frame of CVS activities spans from December 5, 1999 until February
10, 2004. The source code is released under the GNU GPL license. The Sourceforge project
summary page lists ten registered developers, however of these, only four authors performed
10, 106 modifications in 2, 346 files.5
5
This “discrepancy” is evidence for how unsuitable and unreliable Sourceforge summary data is as a basis for
empirical research on OS projects.
40
3.2 Sample
emacs
The Emacs website characterizes its product as “the extensible, customizable, self-documenting
real-time display editor.” (Emacs, 2004)
Among free software enthusiasts this program is one of two potential application of choice
(emacs or vi) for doing everything text-related. Emacs is being used to write articles, letters,
and books6 . But it is also capable of sending e-mails, and includes many features to use it as a
editor for programming in various programming languages. It includes its own programming
language, lisp, to enable the customization of emacs to extreme levels.
Emacs was originally created by the founder and philosophical leader of the free software
movement, Richard M. Stallman (RMS), and he still contributes today to its development.
Emacs was one of the very first applications programmed by ‘RMS’ for the GNU operation
system. He began with emacs, the text editor, to write programs and gcc, a compiler to translate
the source code into computer binaries as foundations for his new system. According to his
believes that all software should be free, Stallman released Emacs under the GNU General
Public License (GPL).
Hosted on savannah.gnu.org, emacs is the dinosaur among the project sample, its observed CVS modifications range from April 18, 1985 to January 19, 2005. Measured by the
number of developers, emacs is the second-largest project in the sample. 117 contributors
performed more than 100, 113 modifications on 3,175 files.
FlightGear
The FlightGear flight simulator project is a multi-platform, cooperative flight simulator development project. The goal of the project is to create a sophisticated flight simulator framework
for use in research or academic environments, for the development and pursuit of other interesting flight simulation ideas, and as an end-user application. Over 20,000 real world airports
included in the full scenery set, including correct runway markings and placement, correct
6
Trivia: As a matter of fact, a large proportion of this dissertation has been typed in emacs
41
3 Empirical Section
runway and approach lighting.
The simulator allows many different plane types. Currently, the 1903 Wright Flyer, strange
flapping wing "ornithopters", a Boing 747, Airbus A320, various military jets, and several light
singles can be used. FlightGear has the ability to model those aircraft and just about everything
in between. A number of networking options allow FlightGear to communicate with other
instances of FlightGear, GPS receivers, external flight dynamics modules, external autopilot or
control modules. (flightgear, 2004).
It should be noted, that the recorded CVS activity range from September 10, 2002 until
February 10, 2004, does not mark the very first beginning of the project. Earlier versions of
the software (available as compressed files through the FlightGear FTP server) show a creation
date of May 23, 1997 for the oldest file (the GNU GPL license), but were not captured in the
current CVS source code repository.
FlightGear maintains its own, independent infrastructure for website, mailing lists, and source
code repository. The current, observed repository contains 5, 406 modifications in 1, 108 files
conducted by eight authors.
Freenet
Freenet is software which enables the anonymous publication and retrieval of information on
the Internet without the ability to control or censor this information. To achieve this, the network is a completely decentralized peer-to-peer network. Publishers and consumers of information remain both anonymous. All communications by between Freenet nodes are encrypted
and are routed through other nodes to make it extremely difficult to determine who is requesting
the information and what its content is. (Freenet, 2004)
The Freenet project is based on a Master’s Thesis, published in 1999 by Ian Clarke 1999.
The first code commits happened in the beginning of 2000, however, the code was transferred
to a new module in 2002. Therefore the observed CVS time frame stretches from January 4,
2002 to November 28, 2004. Since 2003, one of the developers, Matthew Toseland, (CVS
name: toad), has been paid through donations to work full time on the project. A non-profit
42
3.2 Sample
organization, Freenet Inc., has been incorporated in California in order to give developers legal
protection from third parties.
Freenet is licensed under the GNU GPL. It uses its own infrastructure for the website and
developer mailing lists, but takes advantage of the source code repository offered by Sourceforge. The current, observed source code repository contains 14, 118 modifications in 1, 231
files conducted by 36 authors.
Gnomemeeting
Gnomemeeting is a H.3237 compatible videoconferencing and VoIP-Telephony8 application
that allows to make audio and video calls to remote users with H.323 hardware or software
(such as Microsoft Netmeeting). (Gnomemeeting, 2004)
Gnomemeeting itself relies on an underlying library to provide some low-level functionality.
It uses openh323, whose development is coordinated by Quicknet Technologies Inc. which is
in itself released under the Mozilla Public license (MPL), a license which resembles the liberal
GNU LGPL.
As the MPL is not compatible to the GNU General Public License (GPL and non-GPL
code may not be mixed under all conditions), the Gnomemeeting project leader grants a user
that “as a special exception, you have permission to link or otherwise combine this program
with the programs OpenH323 and Pwlib, and distribute the combination, without applying
the requirements of the GNU GPL to the OpenH323 program, as long as you do follow the
requirements of the GNU GPL for all the rest of the software thus combined.”
The project was founded by Damien Sandras, who remains a dominant figure in its development until today ( 32 of all modifications). Gnomemeeting hosts its website independently, its mailing lists and source code repository are provided by the Gnome project (http:
7
“H.323 is the name given to a set of communications protocols used by programs such as Microsoft NetMeeting
and equipment such as Cisco Routers to transmit and receive audio and video information over the Internet. It
was developed by the ITU (http://www.itu.int), an international standards body for telecommunications.” (http://openh323.org)
8
VoIP: Voice over Internet Protocol
43
3 Empirical Section
//gnome.org) with which it forms a loose relationship. The Gnomemeeting source code
repository contains 10, 684 modifications in 557 files conducted by 98 authors between August
20, 2001 and January 17, 2005.
Gnunet
GNUnet is a framework for secure peer-to-peer networking that does not use any centralized or
otherwise trusted services. A first service implemented on top of the networking layer allows
anonymous censorship-resistant file-sharing. GNUnet uses a simple, excess-based economic
model to allocate resources. This means that peers in GNUnet monitor each others behavior
with respect to resource usage; peers that contribute to the network are rewarded with better
service than peers that only download a lot of content. (GNUnet, 2004)
The project was founded in 2001 by Christian Grothoff. A design draft was published as
a research paper in 2002 (Grothoff, Patrascu, Bennett, Stef, and Horozov, 2002). The project
was inspired by, among others, the Freenet project with which is shares many of its goals.
One of the major criticism is, that Freenet is written in the Java programming language, which
implies a significant overhead in terms of memory requirements and processor power. GNUnet
is written in the C programming language, which is commonly considered as more efficient.
The GNUnet project website, mailing list and source code repository were first hosted on
http://ovmj.org). Ovm, which employed Christian Grothoff as a graduate student, is a
DARPA funded collaborative effort between Purdue University, SUNY Oswego, University of
Maryland, and DLTech to “to develop an open source framework for building programming
language runtime systems.”
The project moved recently (after the end of the recorded code modifications) to a new,
independently hosted home (http://gnunet.org), but states on its website that is part of
the GNU project. It is licensed under the GNU GPL.
The recorded GNUnet source code repository contains 12, 111 modifications in 962 files
between June 20, 2001 and November 21, 2003, conducted by 15 authors.
44
3.2 Sample
GTK+
GTK+ is a multi-platform toolkit for creating graphical user interfaces. GTK+ was initially
developed for and used by the GIMP, the GNU Image Manipulation Program. Therefore, it is
named “The GIMP Toolkit”, so that the origins of the project are remembered. Today GTK+
is used by a large number of applications (e.g. Abiword which is part of the research sample),
and is the toolkit used by the GNU project’s GNOME desktop. (GTK+, 2004)
It was originally started by Peter Matthis, with help from Spencer Kimball and Josh Macdonald as GTK. When it was significantly improved (introducing object oriented “widgets”), it
was renamed to GTK+.
GTK+ itself takes advantage of three libraries developed by the GTK+ team:
GLib is the low-level core library that forms the basis of GTK+ and GNOME. It provides
data structure handling for C, portability wrappers, and interfaces for such runtime functionality as an event loop, threads, dynamic loading, and an object system. Pango is a library for
layout and rendering of text, with an emphasis on internationalization. It forms the core of text
and font handling for GTK+-2.0. Pango has been included in the research sample as well. The
ATK library provides a set of interfaces for accessibility (support for in some form handicapped
users, such as e.g. blind or color blind people). By supporting the ATK interfaces, an application or toolkit can be used with such tools as screen readers, magnifiers, and alternative input
devices.
The GTK+ project hosts its website (http://www.gtk.org) independently, its developer mailing lists and the source code repository are provided through the GNOME project
infrastructure. It is licensed under the LGPL license.
The observed gtk+ source code repository, which ranges from January 1, 1997 to May 17,
2004, contains 76, 666 modifications in 2, 701 files conducted by 271 authors and is the largest
project in the sample.
45
3 Empirical Section
Figure 3.1: The iRate architecture
iRate
iRate radio is a collaborative filtering system for music. The user rates the music tracks it
downloads and the server uses the ratings and other people’s ratings to guess what you’ll like.
The tracks are downloaded from websites which allow free and legal downloads of their music
(see Figure 3.1 for a graphical overview of the system). (iRate, 2004)
The iRate system allows a user to discover, download and listen to music which he likes and
would otherwise probably not get to know. It is not thought to be a file-sharing application,
but rather interest users in artists they like and whose music they might like to buy. It performs basically, the iRate developers think, what a good radio station should be doing. It uses
correlation of ratings from users to recommend tracks.
iRate is a relatively young project: It is completely hosted by Sourceforge, where is was
registered in March 2003 by project founder Anthony Jones. It is released under the GNU
GPL. The recorded time span ranges from March 26, 2003 to August 8, 2004. The iRate
source code repository contains 1, 986 modifications in 363 files conducted by 18 authors.
46
3.2 Sample
LAME
LAME originally stood for LAME Ain’t an Mp3 Encoder. It started its life as a patch against
the dist10 ISO demonstration source9 .
However, the mp3 format was still licensed through the Fraunhofer Institute which disallowed the modification of the demonstration source code. In September 1998, Fraunhofer
stopped various efforts to improve their freely available source code, and the 8hz effort (basically a successor to LAME), and a number of other free encoders based on ISO sources were
aborted
Mike Cheng, found that “That sucked” and, in September 1998, released a patch to the ISO
code which was incapable of producing an mp3 stream or even being compiled by itself; An
act which could not be prevented by the Fraunhofer Institute.
In May 2000, the last remnants of the ISO source code were replaced, and the LAME source
code provides now a full MP3 encoder.
The developers state that “LAME is an educational tool to be used for learning about MP3
encoding. The goal of the LAME project is to use the open source model to improve the psycho
acoustics, noise shaping and speed of MP3.”
LAME achieved indeed more than simply providing a clone of a program with a restrictive
license, and having an initially somewhat dubious legal status. Mark Taylor, current maintainer
of the LAME v3.x code base, developed and implemented the GPsycho “psycho acoustic and
noise shaping model”. This model is, they claim, vastly superior to the reference model provided by the Fraunhofer Institute, the inventors of the mp3 format, itself. (LAME, 2004)
LAME was originally hosted on its own infrastructure, but moved over to take advantage of
the facilities provided by Sourceforge in November 1999. Therefore, the recorded time frame
from November 24, 1999 until November 28, 2003 does no include the very beginnings of the
project.
It is licensed under the GNU LGPL. The LAME source code repository contains 10, 312
9
mpg is an ISO standard, the dist10 code was distributed by the ISO committee to provide a working demonstration of the format
47
3 Empirical Section
modifications in 509 files conducted by 26 authors.
Mailman
Mailman is free software for managing electronic mail
discussion and e-newsletter lists. Mailman is integrated with the web, making it easy for users
to manage their accounts and for list owners to administer their lists. Mailman supports builtin archiving, automatic bounce processing, content filtering, digest delivery, spam filters, and
more. It is written in the Python programming language, with a little bit of C code for security.
(Mailman, 2004)
The system was first presented in 1998 at a conference (Viega, Warsaw, and Manheimer,
1998). One of its founders, Barry Warsaw, still continues to be Mailman’s lead developer
today.
His contributions have been partially sponsored by third parties. The website mentions, e.g.
“Control.com for their sponsorship of new Mailman 2.1 features such as the topic filters, external membership sources, and "virtual" mailing lists”. Additionally, Barry Warsaw seems to
have been granted time for development of Mailman by his (previous) employer Zope Corporation (which develops an open source content management system).
It is not so easy to identify the providers of the Mailman infrastructure: The software itself
can be downloaded from: Sourceforge, the GNU project, and its own website, List.Org.
While the Mailman website is hosted independently, its developer mailing lists are provided through the platform of the python programming language, python.org. Although
the project started in early 1998 (The first recorded CVS commit is on January 6, 1998), the
project did not register itself on Sourceforge until November 8, 1999, in order to take advantage
of its source code repository.
The Software claims to be part of the GNU project, and is accordingly released under the
GPL license. The examined Mailman source code repository, ranging from January 6, 1998 to
48
3.2 Sample
November 28, 2003, contains 13, 695 modifications in 1, 873 files conducted by 24 authors.
mnet
Mnet is a distributed file store. The website describes a distributed file store as “a shared virtual
space into which you can put, and from which you can get, files.” To achieve their goal, Mnet
forms an emergent network without a central server. While many potential applications can
take advantage of an underlying mnet, the first application that has been written for the Mnet
project is a file-sharing application for files of all kinds. (Mnet, 2004)
The project has been inspired by many of the same goals of the Freenet project. However,
mnet is written in the Python programming language. It provides binaries for both unix-like
and Windows platforms.
On June 2004 (well after the end of the examined time frame) the project moved away from
the Sourceforge infrastructure to a new, independently hosted server at http://mnetproject.
org, and began using a new source code versioning system, DARCS.
During the examined time frame the project relied on the infrastructure provided by Sourceforge, on which it was registered in January 2002 as a project.
The observed CVS code changes range from January 29, 2002 to November 28, 2003. The
mnet source code repository contains 7, 204 modifications in 930 files conducted by 14 authors
and is distributed under the GNU LGPL license (a license similar to the GNU GPL, yet allowing
closed source products to take advantage of the code through interacting with it).
nano
GNU nano is a very small and relatively simple to use text editor. It was designed to be a free
replacement for the Pico text editor, which is part of the Pine email suite from The University
of Washington but does not come with a GPL compatible license. It aims to “emulate Pico as
closely as possible and perhaps include extra functionality”. (Nano, 2004)
It features some convenience functionality such as an interactive replace function, a spell
49
3 Empirical Section
checker, or auto-indent support.
While the project has a website of its own (http://nano-editor.org), both mailing
list and source code administration happens through the savannah.gnu.org platform.
The software comes under the GNU General Public License. The observed nano source code
repository, ranging from June 6, 2000 to January 19, 2005, contains 7, 976 modifications in 191
files conducted by 11 authors.
Ogle
Ogle is a “DVD player for the Solaris, Linux and BSD environments. The first open source
DVD player to support DVD menus!” (Ogle, 2004)
Its website states that “Ogle is developed by a few students at Chalmers University of Technology”, not mentioning an explicit lead developer or project leader (A fact which showed itself
later in the project sociogram, more on this in the project analysis section).
The project website is hosted by “Chalmers University of Technology” (Sweden), but developer mailing list and source code repository are provided by http://berlios.de (a
platform similar to Sourceforge, dedicated to providing infrastructure for OS projects).
The player is licensed under the GNU GPL. The examined code from the ogle source code
repository comes from the “core module” called ogle and does not include a separate ogle_gui
module10 . It ranges from January 20, 2000 to January 1, 2005 and contains 3, 310 modifications
in 273 files conducted by 8 authors.
OpenSSL
“The OpenSSL Project is a collaborative effort to develop a robust, commercial-grade, fullfeatured, and Open Source toolkit implementing the Secure Sockets Layer (SSL v2/v3) and
Transport Layer Security (TLS v1) protocols as well as a full-strength general purpose cryptography library.” (OpenSSL, 2004)
10
Some modular organized projects divide their project into CVS modules, which form separate entities that are
able to work together.
50
3.2 Sample
The project’s source code is originally based on the SSLeay library developed by Eric A.
Young and Tim J. Hudson. Due to the licensing restrictions of SSLeay, OpenSSL cannot be
published under the GNU GPL, but is licensed under an Apache-style license, which is a very
liberal license and basically means that one is free to use it for commercial and non-commercial
purposes. It is so liberal, that it is incompatible with the GNU GPL requirements, meaning that
OpenSSL code can not be included in GPL’d projects, which often causes complaints from
users.
According to its website, the official start of the OpenSSL project was on December 3, 1998.
The observed time span of code modifications ranges from December 21, 1998 until May 4,
2004.
The OpenSSL provides its own infrastructure for website, mailing lists and source code
administration. The source code repository contains 44, 116 modifications in 3, 238 files conducted by 12 authors.
pango
“The goal of the Pango project is to provide an open-source framework for the layout and
rendering of internationalized text. Pango is an offshoot of the GTK+ and GNOME projects,
and the initial focus is operation in those environments, however there is nothing fundamentally
GTK+ or GNOME specific about Pango. Pango uses Unicode for all of its encoding, and will
eventually support output in all the worlds major languages.” The name is:
Greek:
Japanese:
“Pan” = All
“Go” = Language
So, “properly” written, it looks
(adapted from pango, 2004).
The pango library, which is not intended for the direct use through end-users, but as an
underlying foundation to be used by applications, is released under the GNU LGPL, just as its
‘relative’ the GTK+ project.
The project’s mailing lists and source code repository are provided through the infrastructure
of the GNOME project. Its source code repository contains 6, 715 modifications in 383 files
51
3 Empirical Section
conducted by 46 authors in the time frame from January 1, 1997 until May 6, 2004.
phpMyAdmin
phpMyAdmin is a tool written in the interpreted PHP language intended to handle the administration of MySQL databases11 over the Web. It can create and drop databases, create/drop/alter
tables, delete/edit/add fields, or to execute any other SQL12 statement. (phpMyAdmin, 2004)
phpMyAdmin runs on any platform that runs the PHP language interpreter and a web server,
which basically means that it can run everywhere.
The phpMyAdmin changelog begins on September 9, 1998 with a “First internally used version”. It was transferred to Sourceforge in March 2001, where it was features as “Sourceforge
project of the month” in December 2002. The GNU GPL licensed software relies on mailing
lists and source code repository provided by Sourceforge, it hosts its website independently.
The phpMyAdmin source code repository was recorded over a time frame from May 3, 2001
to January 20, 2005 and contains 37, 385 modifications in 1, 072 files conducted by 19 authors.
PostgreSQL
“PostgreSQL is a highly-scalable, SQL compliant, open source object-relational database management system. With more than 15 years of development history, it is quickly becoming the
de facto database for enterprise level open source solutions.” (PostgreSQL, 2004)
The database now known as PostgreSQL is derived from the POSTGRES package written
at the University of California at Berkeley. The POSTGRES project, led by Professor Michael
Stonebraker, was sponsored by the Defense Advanced Research Projects Agency (DARPA),
the Army Research Office (ARO), the National Science Foundation (NSF), and ESL, Inc. Its
implementation in 1986.
In 1994, Andrew Yu and Jolly Chen added a SQL language interpreter to POSTGRES, Version 4.2. The result was subsequently released to the web under a new name, Postgres95.
11
12
MySQL is a popular open source data base coordinated by the Swedish company MySQL AB.
SQL: Structured Query Language is used to perform commands on a database.
52
3.2 Sample
In 1996 the name "Postgres95" was deemed unsuitable and the project was renamed to PostgreSQL, to reflect the relationship between the original POSTGRES and the more recent versions with SQL capability.
Today PostgreSQL runs almost every brand of Unix (the website claims that it runs on 34
different platforms with the latest stable release), and includes native Windows compatibility
since version 8.0 and above.
PostgreSQL is a somewhat unique project, in that it has not been founded and is run by a
single person (or organization), but is being run by a “Steering Committee” which currently
consists of six persons.
It is licensed under the BSD license13 . All development infrastructure is provided independently by the project itself. The PostgreSQL source code repository, which ranges from July
9, 1996 to January 19, 2005, contains 95, 221 modifications in 5, 644 files conducted by 28
authors.
Smarty
Smarty is known as a “Template Engine”, but would be more accurately described as a “Template/Presentation Framework.” That is, it provides the programmer and template designer
with a wealth of tools to automate tasks commonly dealt with at the presentation layer of an
application. (Smarty, 2004)
Smarty, started by Monte Ohrt and Andrei Zmievski, is a way to separate the content of
websites from its presentation and layout. It is not intended to be used directly by end-users,
but provides functionality to other programs, or could be used by website administrators. The
TikiWiki Content Management System (also included in the sample) is an example for such a
program which takes advantage of the Smarty functionality.
Although the name "Smarty" and the logo are trademarks of New Digital Group, Inc., the
software itself comes under the liberal GNU LGPL license. The project itself is hosted by the
13
BSD stands for Berkeley Software Distribution and is the name of a free Unix derivate developed at the University of Berkeley. The BSD license is a very liberal license and allows basically every use of the software.
53
3 Empirical Section
php.net project. Website, mailing lists, and source code resides there.
The observed smarty source code repository, begins on August 8, 2000 and ends on June 8,
2004, containing 5, 881 modifications in 1, 361 files conducted by 11 authors.
Stepmania
“StepMania is a music/rhythm game. The player presses different buttons in time to the music
and to note patterns that scroll across the screen. Features 3D graphics, visualizations, support
for gamepads/dance pads, a step recording mode, and more!” (Stepania, 2004)
StepMania runs on Windows and Linux computers as well as on modified Xboxes. This
rather crazy game14 comes under the very liberal MIT License, which basically allows every
use of the source code.
Although the website is hosted independently, the project takes advantage of mail list and
source code repository on Sourceforge, where the project was registered in October 2001.
The observed time frame ranges from November 3, 2001 until May 13, 2004, during which
48, 357 modifications in 12, 725 files were conducted by 44 authors.
tdb
TDB is a “Trivial Database”. In concept, it is very much like GDBM15 except that it allows
multiple simultaneous writers and uses locking internally to keep writers from trampling on
each other. TDB is also extremely small. (tdb, 2004)
The project has been registered on August 2000, it is licensed under the GNU GPL. Its source
code modifications between August 14, 2000 and April 9, 2002 were observed for this research.
tdb relies on facilities provided by Sourceforge for its infrastructure. Although the Sourceforge summary page lists 10 developers for this projects, the tdb source code repository contains
295 modifications in 55 files conducted by only seven authors, making it the smallest project
in terms of the number of modifications.
14
15
If dancing to the rhythm of music on a “dance pad” on the floor can be considered gaming.
The GNU Database Manager
54
3.2 Sample
TikiWiki
“The Tiki CMS/Groupware, also known as TikiWiki, is a powerful web-based Groupware and
Content Management System (CMS). It can be used to create all sorts of Web applications,
Sites, Portals, Intranets and Extranets”. (TikiWiki.org, 2004)
According to its developers, the TikiWiki project aimed from the beginning to be an open,
high-growth project and to embrace as many developers as possible. The following quote from
an interview with a lead developer emphasizes this when talking about future challenges:
“Keeping the balance between number of features and quality of features. We
decided to start adding as many features as we could, as fast as possible, so more
and more users can help us to refine the features. We will start to be focused on
making features better, more often than adding new features (expand first, conquer
later).” (SourceForge.net, 2003)
The TikiWiki project hosts its own website independently, using its own code for it. Mailing
lists, bug database, and source code repository are provided through Sourceforge, where the
project was registered in October 2002.
The software is licensed under the GNU LPGL, the observed time frame ranges from October 8, 2002 to March 28, 2004. During this time, 32, 260 modifications in 5, 202 files were
conducted by 89 authors in the TikiWiki source code repository.
wget
GNU wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the
most widely-used Internet protocols. It is a non-interactive command line tool, so it may easily
be called from scripts, cron jobs, terminals without X-Windows support, etc. (wget, 2004)
The website of this command line program (programmed as part of the GNU effort, but also
ported to the Windows platform) is hosted through gnu.org.
A source code repository is provided by http://sunsite.dk16
16
“Sunsite.dk is a completely non-commercial project, powered by sponsored hardware, and driven by volunteer
55
3 Empirical Section
The wget source code repository of this GPL’d software contains 4865 modifications in 207
files conducted by six authors between December 2, 1999 and January 1, 2005.
xerces
The “Xerces Java Parser” (xerces, 2004) (named after the Xerces Blue butterfly), is written
in Java and provides “XML parsing and generation.”17 (xerces, 2004) The project resides under the umbrella of the Apache foundation and is licensed under the liberal Apache Software
License (which is not compatible with the GNU GPL).
Xerces provides “hooks” in order to be usable for different programming languages, such
as C++ or Perl. These reside in separate CVS modules and have not been included in the
analysis. The core “xerces” module contains the code of the software which is written in Java,
the observed time frame ranges from November 9, 1999 to June 6, 2004. Its source code
repository contains 18,304 modifications in 2,276 files conducted by 27 authors.
XFCE4
“Xfce is a lightweight desktop environment for unix-like operating systems. It aims to be fast
and lightweight, while still being visually appealing and easy to use. [. . . ] Another priority
of Xfce 4 is adherence to standards, specifically those defined at freedesktop.org. Xfce 4 can
be installed on several UNIX platforms. It is known to compile on Linux, NetBSD, FreeBSD,
Solaris, Cygwin and MacOS X, on x86, PPC, Sparc, Alpha...” (Xfce, 2004)
The project which is released under the GNU GPL license, was started and coordinated
by Olivier Fourdan. It hosts its web presence and its CVS code repository independently on
http://xfce.org (the hosting is donated by a commercial 3rd party). Mailing lists are
provided through http://foo-projects.org.
The observed time frame for the Xfce4 development spans from February 14, 2001 to May
workforce. The goal of Sunsite.dk is to help power the development of Open Source Software in the world.
Sunsite.dk is currently part of the Sun initiated project known as SunSITE.” (http://sunsite.dk)
17
XML (Extensible Markup Language) is a file format for the transfer and storage of structured information.
56
3.3 Analysis
4, 2004. Its source code repository contains 61,879 modifications in 13,675 files conducted by
22 authors.
3.3 Analysis
This section analyzes the sample projects: it examines the concentration of modifications, the
degree of developers (absolute and relative), as well as the density, inclusiveness, and centralization of the resulting networks. Also code-ownership of source code files is examined.
3.3.1 Concentration of modifications
This study measures the “amount of work” performed through the number of file modifications per author. One commit to the CVS source code repository can consist of one or several
modified files, so the measure is not necessary equal to the number of CVS commits.
It has been noted in other studies of open source projects that a large proportion of the work
is performed by only a small fraction of contributors, usually called “core developers” (Koch
and Schneider, 2002; Ghosh and Prakash, 2000; Mockus, Fielding, and Herbsleb, 2000). This
is also the case in nearly all of the projects in this sample. One indicator of the concentration
of modifications is the gini coefficient, introduced by Gini (1912).18
However, in order for the gini coefficient to be an unbiased estimate independent of the
population size, it should be multiplied by
n
n−1
(Dixon, Weiner, Mitchell-Olds, and Woodley,
1987; Mills and Zandvakili, 1997). Note that all measurements throughout this work using the
Gini coefficient, are calculated to using this “corrected” version of the gini coefficient.
Applied to the number of modifications of a developer, the gini coefficients for our projects
range from 0.57 to 1.00, with a mean of 0.84 and a median of 0.86. These values indicate a
high inequality of modifications.19 Fig. 3.2 shows the distribution of the gini coefficients for
18
The gini coefficient is used to measure the inequality of income in countries, but is also usable as a general
indicator for the inequality of values (v). Gini ∈ [0, 1]: 0 indicating that vi = vj ∀i, j and 1 in case vi 6=
0, vj6=i = 0.
19
To give a feeling for the coefficient: the inequality of incomes in the US hovers around of 0.4 (Germany 0.28)
57
3 Empirical Section
Figure 3.2: Histogram of modification concentration (gini)
all projects in a histogram.
A weak Spearman’s rank correlation of ρ = 0.19 indicates that no linear relationship exists
between the size of a project (in terms of number of developers) and its concentration of modifications. It is interesting though that the four largest projects in our sample all have a gini
coefficient between 0.89 and 0.92.
The generally high levels of concentration come as no surprise, if the assumption of a small
number of core developers who burden most of the work is true for projects of every size. It is
therefore proposed:
Proposition 1 Modifications are highly concentrated on a few developers, meaning that the
(source: US Bureau of Census).
58
3.3 Analysis
largest share of work (i.e. highest number of modifications) is performed by only a small
number of “core developers”.
3.3.2 Degree
The degree (of connection) of a node equals its number of connections to other nodes, i.e.
n
P
dj =
1 ∀ adj,i > 0 (adi,j being the elements from the adjacency matrix). This measure
i=1
gives some indications on how well a node in the network is connected to other nodes (Freeman,
1979).
In a social network of developers based on their file modifications, the degree of a developer
characterizes the number of developers with which one has been modifying the same files.
This means that the degree can vary between 0 and the number of developers in a project-1. A
developer’s connections can be interpreted based on his degree:
0 The developer has never modified the same files as other developers. There are two possible
reasons for this. Either his contribution has been so marginal, that it did not reach the
cut-off point of ten common files. Or it indicates that the developer works on a highly
specialized area of code, being its owner. It could e.g. be a sign of a very modular
software architecture with divided responsibilities, a feature which is being developed by
a single author. These developers are the “lone wolves”, not connected to the network.
1 Developers with a degree of one are connected to only one developer, i.e. he modified only
the same files as one other developer. This could e.g. be a team of two developers working
on the same feature, or a “lone wolf” whose files got modified by some other developer.20
Developers belonging to this category can be termed “tamed wolves”.
low value (>1) Contributors in this category modified files which a few others had worked
on as well. This could be a few developers sharing certain areas of code, or other forms
20
As ties in this analysis are undirected, a degree of 1 does not mean that a developer actively developed another
developer’s files. It is possible that “his” files get modified as well.
59
3 Empirical Section
of collaboration. The borders between a “low value” and a “high value” are somewhat
fuzzy.
high value These are the “generalists” who modify files which a lot of other developers have
been working on as well. These well-connected developers are part of the inner network
and perform most of the development (Fig. 3.4). An interesting fact is that mostly developers with a relatively high degree are connected to “tamed wolves”. They seem to take
the role of integrators.
The distinction between a “low value” and “high value” degree has not been made for the
purposes of this research. Generalized blockmodels (Wasserman, 1994) could be used to create
such a specific categorization if needed.
Table 3.2 lists a summary of the degree of developers for each project. As can be seen there,
the mean number of developers is 39.7, while the mean of the average degrees is only 5.08,
indicating that the modification of other developer’ s files is not too common.
Relative degree & network density
In order to compare projects of different sizes, the
relative degree is calculated as rel. degreep =
degreep
,
#developersp −1
being the ratio of realized over
possible connections. This measure is also called the density of a network (Scott, 1991, p.74).
The mean and maximum relative degree for each project can also be found in Table 3.2. The
average value of the network density for all projects is 0.21.
Taking the relative degree of all developers across all projects, as illustrated in Figure 3.3,
the mean value is only 0.134 (3rd quartile: 0.21). Accordingly, developers are (on average)
only connected to 13.4% of their fellow developers. Following the observation of rather low
density it can be proposed that:
Proposition 2 On average, developers maintain only connections (ie. commonly modified
files) with a minority of their co-developers, leading to a low mean relative degree, i.e. density
of the network.
60
3.3 Analysis
Figure 3.3: Relative degree over all developers
Relative degree vs. number of modifications Although the average degree is relatively low, Figure 3.3, depicting the relative degree of all developers across projects, also illustrates that a few developers have very high relative degrees. The gini coefficient of the relative
degrees is 0.73, confirming a high concentration. The average maximum relative degree is 0.52
(which, given that the average inclusiveness (see later) is only 0.59, means that in most cases
some developers are connected to nearly every other connected developers.)
But who are the few developers with a high degree? Is it moderately active developers, with
the sole task to tie code together? Or is it the “core developers” who bear the lion’s share of
the development work? It is therefore interesting to examine the relationship between the (relative) degree of a developer and his number of code modifications. Fig. 3.4 gives an idea that
there is a positive relationship between the number of modifications and a developer’s relative
degree (although a non-linear relationship, therefore the y-axis shows log(#modif ications)).
A Spearman rank correlation of 0.85 confirms the visual clue. Following this argumentation it
61
3 Empirical Section
Figure 3.4: Rel. degree vs. log(#modifications)
can be proposed:
Proposition 3 The relative degree of developers is highly concentrated, with a few very active
developers acting as “generalists” or “integrators”, who are connected to most other developers (apart from the “lone wolves”).
3.3.3 Inclusiveness
Inclusiveness refers to the number of nodes which are included in the connected parts of a
network (Scott, 1991). A useful measure of inclusiveness is the number of connected nodes as
a proportion of the total number of nodes, i.e. i =
nconnected
.
ntotal
As, per definition, the inclusiveness
is also 100%− percentage of “lone wolves”, it is an important characteristic of an open source
project.
Table 3.3 gives an overview over the inclusiveness value for each project. Figure 3.5 visualizes the values. It is interesting to note that only 6 projects feature an inclusiveness < 0.5.
The values above 0.5 seem to be relatively equally spread without apparent clusters visible.
62
3.3 Analysis
Figure 3.5: Inclusiveness of projects
Figure 3.6: Inclusiveness vs. # developers
63
3 Empirical Section
Figure 3.7: Dendrogram of projects (using avg. inclusiveness)
What determines the inclusiveness of a project? Fig. 3.6 relates the number of developers
to the inclusiveness for each project, in order to identify a possible correlation between those
variables. However, the graphic does not give a visible clue about a relationship. A relatively
weak Spearman’s rank correlation of -0.14 between the number of developers and the project
inclusiveness also indicates that no linear relationship between the number of developers and
the project’s inclusiveness can be found. Project size (in terms of developers) therefore does
not seem to affect the inclusiveness.
A correlation of -0.52 exists between the gini coefficient, indicating the concentration of
modifications, and the inclusiveness of projects (see Figure 3.10 for a visualization). This
shows that the more the small web of core developers works, the less developers are included
in the network. This is intuitively no surprise, as contributors with very little activity do not
perform enough work to become a part of the project.
Fig. 3.7 shows a dendrogram, clustering projects according to their inclusiveness. However,
64
3.3 Analysis
no obvious common criteria, besides centralization of modifications as explained above, seems
to determine the inclusiveness of a project.
Stating this observation as a proposition, it can be derived:
Proposition 4 The inclusiveness of open source projects differs widely. There are no equilibria
through which open source projects could be characterized in general, although inclusiveness
is somewhat related to the centralization of modifications.
3.3.4 Centralization
Another measure, frequently used in SNA, is that of node centrality; and subsequently building upon this, the global network centralization. Freeman (1979) provides a generic definition of the centralization of a network G for any centrality measure C(v) as C∗ (G) =
P
i∈V )
(| max(C(v)) − C(i)|).
v∈V
Two specific operationalizations have been used frequently, one based on the degree of a
node, the second based on its betweenness. Both measures have been calculated and are listed
in Table 3.3. Centralization is usually measured, using only the connected parts of the network
(Scott, 1991) (otherwise they are mostly a function of the inclusiveness of a network). Both
centrality measures include therefore only developers which are part of the network.
Degree The average centralization of the network based on the degree of developers is 0.51
(1st quart. 0.4, median 0.53, 3rd . quart. 0.66). No relationship between the inclusiveness of a
project and its centralization could be found.
The centralization of a project could be interpreted as the “dominance” of the core developers
over other persons. Note that for instance the ogle project, which has been founded by a team of
developers, exhibits a centralization of 0.0 while the mailman project, which has been founded
and is led by a single developer, exhibits a centralization of 0.82.
Betweenness The betweenness of a node v is given by CB (v) =
P
i6=v6=j∈V
where σijk
( σσivj
ij
is the number of shortest paths (geodesics) from i to k through j and σik is the number of
65
3 Empirical Section
geodesics) from i to k (Anthonisse, 1971; Freeman, 1977). Conceptually, high-betweenness
nodes lie on a large number of non-redundant shortest paths between other nodes; they can
thus be thought of as “bridges” or “boundary spanners.” (Freeman, 1979). The betweenness
based centrality is also a value between 0 (decentralized) and 1 (centralized).
Beetwenness based centrality gives basically the same interpretation as the centralization
based on degree of developers. However, it does not only look at the degree of a developer in
the network, but also takes his position (central vs. peripheral) into account. The values for the
single projects are listed in Table 3.3. The range of centrality in our project sample is 0.0-1.0,
with a mean of 0.28 (1st quart. 0.11, median 0.21, 3rd quart. 0.39)
Relatively low values indicate well connected networks of developers. A higher value can
be interpreted as the dominant coordinator role of a single core developer.
Following these observations, it can be proposed:
Proposition 5 The centralization (both degree and betweenness based) of a project differs
widely between open source projects (0.0-1.0), although the majority of the projects ranges
between 0.12 and 0.39 (betweenness based central.).
3.3.5 Code ownership
The research question asked about possible “code-ownership” in open source projects. This is
a contradiction to the popular believe that open source is developed in an anarchistic manner,
where everybody modifies source code as he sees fit.
In order to explore the aspect of code-ownership it makes sense to look at the number of
distinct authors per file. In virtually all programming languages a file is a container for a certain
functionality (or in object oriented languages such as C++ or Java the container for an object,
or the methods that an object can perform). Being ’owner’ of a file is therefore the software
architectural equivalent of being owner of that functionality or that object. The number of
distinct authors gives therefore an idea of how much ownership a developers possesses over
a piece of the software architecture. As described in the methodology, only projects without
66
3.3 Analysis
Figure 3.8: Mean distinct authors per file (over all projects)
specific ‘gatekeepers’, where a few developers collect patches by other contributors and commit
them to the area they maintain, were included in the sample (see Section Methodology on page
31 for more information).
An alternative measurement to the authors per file would have been the number of distinct
authors per directory, as software repositories are often organized in a way where a directory
equals a module in the software architecture (López et al., 2004). However, as the empirical
data provides per file information, the fine grained level of information seemed to be most
accurate. This research does not intend to gather information about the software architecture
(where this indeed would have been useful), but looks into the ‘ownership’ of distinct entities,
the smallest of which are specific files.
As can be seen in the boxplot in Figure3.8, the average number of distinct authors per file
is relatively equal across all projects (mean 2.37, median 2.49). Only the two largest (by a
magnitude larger) project deviate from the mean value: emacs 6.362 (177 devs., 76, 666 modi-
67
3 Empirical Section
Figure 3.9: Mean distinct authors per file vs. # developers
fications) and gtk+ 4.98 (177 devs., 6, 362 modifications).
It is remarkable that the average distinct authors per file measure is relatively uniform across
most projects, while the projects are significantly different in size, age, and number of developers. It is obvious from this observation that code-ownership is apparently an issue: On average
only 2.4 developers work on a single file, although the average number of developers is nearly
40. This observation fits to previous interviews with developers who have talked about things
like “that is his code”.
It can therefore be proposed:
Proposition 6 Open source projects feature “code ownership”, i.e. most files are only modified
by a very small number of developers.
68
3.4 Characterizing the coordination styles
3.4 Characterizing the coordination styles
One goal of this research is to identify variables which determine typical characteristics of open
source projects and to find those which would enable the creation of a typology of these. This
section tries to put the used variables into relation in order to see whether and how they could
be useful to achieve such a characterization.
A “coordination style” depends on a certain way how developers work on the source code,
and more important, how they interact with each other (through the modification of the same
files).
This study shows that most variables that characterize a network vary significantly across
all samples. It seems that there is not the one way to collaborate. This observation is in line
with the findings of Crowston and Howison, 2004, looking at communication structure in bug
reports.
However, as the variables do not correlate well with each other, there is no resulting variable
which gives a one dimensional value, representing the resulting dependent variable. “Coordination style” is therefore a construct consisting of four dimensions which vary significantly
across projects:
Concentration of modifications As was shown, most work in projects is performed by
very few developers. Typically, the most active developer would conduct between
1
2
and
2
3
of
all file modifications. Concentration of file modifications can be measured as the gini coefficient for all file modifications per author as listed in Table 3.3. The concentration ranges from
0.58, indicating an comparatively equal number of contributions (although still very much
concentrated) from all developers to a value of 1.00, indicating a very high concentration of
modifications (basically all work performed by a single developer).
The concentration of modification is a relevant measure as it is able to give a clue about
who does most of the work. It is able to show how equal the work load is distributed among
contributors.
69
3 Empirical Section
Core developer structure: The project’s (betweenness based) centralization is nearly linearly stretched between 0.0 and 0.4 with only four projects having values between 0.6 and 1.0
(the average value is 0.28).
The centralization value is able to explain the dominance of one (or a few) core developers,
the higher the centralization value, the more “star-shaped” the characterized network, the lower
the value the better are all connected developers connected with all others. Centralization can
therefore indicate how dominant (in terms of his degree) the core developer is compared to
other connected developers.
Degree of collaboration:
It has been shown that the inclusiveness varies across projects
(its range in the sample is 0.08 - 1.0). The inclusiveness of a project’s network should be part of
the typology as it explains how large the share of developers is, who work on files that others
modify as well. An inclusiveness of 0.0 would therefore indicate a completely modularized
project where each developer specializes on ‘his’ set of files.
Project size:
It is obvious that large projects are more difficult to handle than very small
projects. The project size should therefore be part of the characterization. It gives an impression
of the complexity of coordination.
That increasing coordination complexity has long been an undisputed argument and was
famously stated in 1975 as what has been termed as Brook’s Law: “Adding manpower to a late
project makes it later.”
21
(Brooks, 1995)
However, it should be noted that the number of developers is not a completely independent
variable in itself. Table 3.4 on page 74 displays the summary of a linear regression, explaining
the number of developers of a project through its inclusiveness and its (betweenness based)
centralization measure. It can be seen that both variables are highly significant, explaining
the number of developers up to a certain extend (adjustedR2 = 0.33). This is an interesting
observation, although no causal relationship can be inferred here. It is also notable that the
21
This assumption has however been revised partially by Brooks himself in his 20th Anniversary edition of his
Book 1995 and was also challenged by others (Jones, 2000).
70
3.4 Characterizing the coordination styles
linear regression works best for projects with less than 100 developers. This could be due to
the fact that either measures from social network theory can be very tricky to compare if the
networks’ size is too different (Scott, 1991) or because huge projects work in a very different
manner. The latter possibility deserves more attention in future studies and might be one of
the reasons to break large projects down into modules in order to keep networks manageable
(Baldwin and Clark, 2000). An increasing level of modularity as has been observed in large
open source projects by MacCormack et al. (2004).
Distinct authors per file did not vary enough to be a useful dimension for the classification
of projects (as shown in the previous section). It is an interesting fact to see such a stable value
across all projects. This value is similar to that of other case studies: A low distinct number of
authors seems to be a consistent property across open source projects.
71
3 Empirical Section
Figure 3.10: Scatterplots of Centrality, Inclusiveness, and Concentration of modifications
72
3.4 Characterizing the coordination styles
Project
Abiword
adonthell
awstats
bison
bzflag
cdex
emacs
flightgear
freenet
gnomemeeting
gnunet
gtk+
irate
LAME
mailman
mnet
nano
ogle
openssl
pango
phpmyadmin
postgresql
smarty
stepmania
tdb
TikiWiki
wget
xerces
xfce4
TOTAL (avg)
Devs.
63
13
3
18
32
4
177
8
36
98
15
271
18
26
24
14
11
8
12
46
19
28
11
44
7
89
6
27
22
39.66
Min.
Median
Mean
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0.07
28
2
0
1.5
8
0.5
7
3
1
0
1
0
0
3
1
0.5
2
5
9.5
0
9
7.5
4
1
0
2
3.5
8
5.5
3.91
22.79
2
0
3
8.25
0.5
17.57
3.25
4.111
0.4694
2.533
6.421
1.444
4.923
2.5
1.714
2
3.75
8.5
3.043
6.632
8.571
2.727
3.364
0.5714
8.27
3.667
8.148
6.545
5.08
Degree
Rel. mean 3rd Qu.
(Density)
0.37
37
0.17
3
0.00
0
0.18
5
0.27
15.25
0.17
1
0.10
28
0.46
5
0.12
8.25
0.00
0
0.18
4.5
0.02
3
0.08
2.75
0.20
9.5
0.11
6
0.13
3
0.20
3.5
0.54
5
0.77
10
0.07
5.75
0.37
11
0.32
12.25
0.27
4
0.08
4.25
0.10
1
0.09
13
0.73
4.75
0.31
13.5
0.31
10
0.21
7.91
Max.
Rel. Max.
47
6
0
8
21
1
95
6
18
7
9
74
7
15
14
6
5
5
10
16
12
22
6
21
2
45
5
19
15
17.83
0.76
0.50
0.00
0.47
0.68
0.33
0.54
0.86
0.51
0.07
0.64
0.27
0.41
0.60
0.61
0.46
0.50
0.71
0.91
0.36
0.67
0.81
0.60
0.49
0.33
0.51
1.00
0.73
0.71
0.52
Table 3.2: Degree overview
73
3 Empirical Section
Project
Devs.
Concentr.
Centralization
Inclusiveness
(Gini coeff)
degree
betweenness
Abiword
63
0.75
0.38
0.05
0.78
adonthell
13
0.87
0.53
0.21
0.54
awstats
3
1.00
0.00
NaN
0.00
bison
18
0.93
0.32
0.07
0.50
bzflag
32
0.81
0.47
0.23
0.69
cdex
4
1.00
NaN
NaN
0.50
emacs
177
0.89
0.67
0.10
0.55
flightgear
8
0.74
0.53
0.38
0.88
freenet
36
0.91
0.63
0.46
0.53
gnomemeeting
98
0.91
0.24
0.06
0.08
gnunet
15
0.93
0.72
0.64
0.67
gtk+
271
0.92
0.69
0.17
0.28
irate
18
0.81
0.71
0.62
0.44
LAME
26
0.83
0.53
0.24
0.65
mailman
24
0.92
0.82
0.82
0.62
mnet
14
0.91
0.60
0.43
0.50
nano
11
0.86
0.40
0.14
0.55
ogle
8
0.64
0.00
0.00
0.75
openssl
12
0.65
0.09
0.01
0.92
pango
46
0.89
0.55
0.24
0.37
phpmyadmin
19
0.79
0.23
0.04
0.68
postgresql
28
0.86
0.57
0.26
0.86
smarty
11
0.68
0.43
0.39
0.82
stepmania
44
0.94
0.70
0.36
0.55
tdb
7
0.58
1.00
1.00
0.43
TikiWiki
89
0.89
0.65
0.16
0.55
wget
6
0.85
0.40
0.14
1.00
xerces
27
0.72
0.47
0.12
0.78
xfce4
22(21)∗
0.81(0.80)
0.46(0.52)
0.20(0.30)
0.77(0.76)
TOTAL (avg)
0.84
0.51
0.28
0.59
∗
“user” xfce excluded. See Section 3.5.29 on page 131 for more information.
Table 3.3: Modification concentration, Centralization & Inclusiveness
Coefficients:
Estimate Std. Error
Intercept
[devs < 100] 82.53
15.27
inclusiveness [devs < 100] -69.64
20.52
centrality
[devs < 100] -35.96
15.97
t value
5.404
-3.394
-2.252
Pr(>|t|)
1.99e-05
0.00261∗∗
0.03461∗
Significance: 0.001 ‘**’; 0.01 ‘*’
Multiple R-Squared: 0.3881, Adjusted R-squared: 0.3325
Correlation Inclusiveness – Centrality: -0.21 (Pearson), -0.29 (Spearman)
Table 3.4: Linear regression #dev. = a ∗ inclusiveness + b ∗ centrality (including all projects
with <100 developer)
74
3.5 Analysis of the Sample Projects
3.5 Analysis of the Sample Projects
This section provides a characterization of each project in the sample. The four dimensions
of the coordination style are used as a basic structure. Also, the sociogram of each of the
constructed project networks is presented.
3.5.1 Abiword
Min.
2.0
Min.
2.0
Min.
0.000
Min.
0.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
22.5
204.0 826.3 807.0
8512.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
18.5
130.0 283.7 388.0
2197.0
Gini coefficient (modifications): 0.7500
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
2.000
3.000 9.888 7.000 930.000
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
1.000
2.000 3.394 4.000
38.000
Table 3.5: Modification statistics for the Abiword project
Concentration of modifications: The gini coefficient for the concentration of modifications
is 0.75 (see Table 3.5). This is below the average value of 0.84, which is remarkably low.
Most of the larger projects (in terms of the number of developers, tend to have a relatively
high gini coefficient around 0.9) (Sec. 3.3.1). The most active developer has “only” performed
8,512
52,060
= 16.4% of all modifications, a comparably low value.
Core developer structure: The centralization is only 0.05, a comparably low value, given the
average value of 0.28 across all projects. This is an indication that all connected developers are
connected with most others, i.e. modifying all others files. It is remarkable that although the
number of developers is relatively large, the best connected develop is connected with 47 of
his peers (3rd quartile is still 37 connections). The well-connectedness of the Abiword network
75
3 Empirical Section
Figure 3.11: Abiword sociogram
is also characterized through a high mean relative degree of the developers, which equals the
standard measure of density in Social Network Analysis, of 0.37 (Project average being 0.21).
Degree of collaboration: The inclusiveness of the Abiword project network is 0.78, being
above average. It shows that not many “pure” specialists exist in the project who never touch
files besides their own.
Project size: The Abiword source code repository contains 52,060 modifications in 5,265
files conducted by 63 authors. This means it is a rather large project in terms of the number of
developers.
76
3.5 Analysis of the Sample Projects
Abiword seems to be a relatively collective driven effort, an interpretation which is confirmed
by an above-average distinct number of authors of 3.4. The sociogram of the Abiword network,
depicted in Figure 3.11, confirms intuitively the above analysis, with a large proportion of well
connected developers, with relatively few unconnected “lone wolves”, although the sociogram
is too crowded to be able to identify single connections between developers. Other summary
data on the project can be taken from Table 3.5.
3.5.2 Adonthell
Min.
4.0
Min.
3.0
Min.
1.000
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
25.0
53.0
729.6 176.0
4380.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
10.0
27.0
169.2
72.0
911.0
Gini coefficient (modifications): 0.8651
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
2.000
4.000 7.031 7.000 148.000
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
1.000
1.000 1.631 2.000
7.000
Table 3.6: Modification statistics for the Adonthell project
Concentration of modifications: The concentration of modifications is relatively high, the
gini coefficient is 0.87 (Table 3.6). This shows that (in this case the project founder) perform
a big part of all modifications, the most active developer conducted
4,380
9,485
= 41% of them. The
high concentration becomes also clear when looking at the difference between the mean and
median values of developer contributions: while the average is 730, the median value is just 53,
indicating a highly skewed distribution.
Core developer structure: The (betweenness based) centralization of the project network is
0.21, still below the average of 0.28. The sociogram in Fig. 3.12 shows that the connected
developers do not form a star-shaped network (although two developers are connected to the
77
3 Empirical Section
Figure 3.12: Adonthell sociogram
rest), but form a more or less dense web (network density is 0.17). It seems the two main
developer have a somewhat integrating role.
Degree of collaboration: The inclusiveness of the project is 0.54, the sociogram visualizes
the six unconnected “lone wolves” in the network. Rather than being very active specialists
it appears that this are the little active developers (considering the median value of only 53
modifications).
Project size: The Adonthell source code repository contains 9,485 modifications in 1,349
files conducted by 13 authors. Other summary statistics are listed in Table 3.6.
78
3.5 Analysis of the Sample Projects
Min.
1
Min.
1.0
Min.
1.000
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
3
5
2487
3730
7454
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
2.5
4.0
438.3 657.0
1310.0
Gini coefficient (modifications): 1.0000
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
1.000
3.000 5.695 7.000 797.000
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
1.000
1.000 1.004 1.000
2.000
Table 3.7: Modification statistics for the AWStats project
3.5.3 AWStats
Concentration of modifications: The AWStats project proved to a somewhat unique project.
The gini concentration coefficient is 1.0000 with
7,454
7,460
= 100% of all modifications performed
by a single author. Two other developers contributed 5, respectively 1 file modification to the
project. So, although not obvious in advance, AWStats is basically a one-man driven project.
Core developer structure: As this project does not exhibit any connections between developers at all (see sociogram, in Fig. 3.13), it was not possible to calculate a (betweenness based)
centralization measure for this project.
Degree of collaboration: The inclusiveness for this project is an all-time low of 0.0, indicating a total lack of connection between the developers. What would indicate a “perfect
specialization” of developers in a larger project with a lower concentration of modifications,
merely indicate here that no other developer modified the same 10 files as the main developer
did.
Project size: The AWStats source code repository contains 7,460 modifications in 1,310 files
conducted by 3 authors. Table 3.7 shows the summary data about the number of modifications
per file and author.
79
3 Empirical Section
Figure 3.13: AWStats sociogram
Table 3.7 confirms the findings of a one-man project: although each of the 1,310 files has
been modified 5.695 times (on average), the number of mean distinct authors per file is 1.004.
3.5.4 bison
Concentration of modifications: The bison project exhibits a high concentration of 0.93, well
above the average value for all projects of 0.84. The skewness becomes also apparent when
comparing the median of 25 and the mean value of 589 modification. The most active developer
performed
7,724
10,607
= 73% of all modifications. The bison project seems to be driven by very few
core developers, performing most of the work.
Core developer structure: Although the main work is performed by only a few, the centralization measure is relatively low (0.07) when compared to the average across all projects
(0.28). Although two of the developers, akim and eggert, take a central role, being connected to
80
3.5 Analysis of the Sample Projects
Min.
1.0
Min.
1.00
Min.
1.00
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
8.0
25.0
589.3 160.3
7724.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
5.00
16.00 53.56 40.50
356.00
Gini coefficient (modifications): 0.9272
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
3.00
6.00
26.06 19.00 1501.00
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
1.000
2.000 2.369 3.000
12.000
Table 3.8: Modification statistics for the bison project
all other developers in the network, most developer are rather interconnected with each others.
Degree of collaboration: The inclusiveness of the project network is 0.50, 9 out of 18 developers remain unconnected “lone wolves” (see sociogram in Fig.3.14). This is not surprising
as the median value of modification of 25 indicates that the lesser active half of the developers
hardly performed enough modifications to be connected at all.
Project size: The bison source code repository contains 10,607 modifications in 407 files
conducted by 18 authors.
Bison seems to have few ’overseeing developers’, a relatively large number of closely connected contributors although these were much less active, and much less active “lone wolves”.
Although its files have, in average, been modified 26.06 times, only 2.369 distinct authors have
been performed these modifications (Table 3.8), which is very close to the all-project average
value.
81
3 Empirical Section
Figure 3.14: Bison sociogram
82
3.5 Analysis of the Sample Projects
3.5.5 BZflag
Number of modifications per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.0
7.5
49.0
1003.0 1021.0 9841.0
Number of modified files per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.0
5.5
30.0
246.8
375.8
1794.0
Gini coefficient (modifications): 0.8147
Number of modifications per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
2.00
4.00
7.00
14.83
15.25 1300.00
Distinct authors per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.000 2.000
3.000
3.649
5.000
20.000
Table 3.9: Modification statistics for the BZFlag project
Concentration of modifications: The gini concentration coefficient (0.81) is just below the
average for all projects. The most active developer performed “only”
9,841
32,100
= 31% of all
modifications, which is, although still high, not very high compared to other projects. It seems
that although much effort is put in by the lead programmer, others take on a part of the work as
well.
Core developer structure: The centralization measure of the BZFlag project is 0.23, also
below average, indicating that no developer takes on the role of a benevolent dictator in this
network. However, the best connected developer (degree of 21) is connected to each other developer within the network, indicating a coordinating role. The sociogram (Fig. 3.15) visualizes
the well-connectedness: 10 developers are not connected at all, three have weak connections
leaving 19 well connected (degree > 4) authors.
Degree of collaboration: The inclusiveness of the network is 0.69, above average. This
confirms the above characterization where many developer do mutually modify files. Many
of the unconnected developers simply performed to few modifications (see Table 3.9) to be
connected to others.
83
3 Empirical Section
Figure 3.15: BZFlag sociogram
Project size:
The bzflag source code repository contains 32,100 modifications in 2,164 files conducted by
32 authors.
Table 3.9 shows a comparatively high mean of 3.649 distinct authors per file (average of
nearly 15 modifications per file). This indicates, together with the relatively dense network,
little “code-ownership” and a high level of collaborative effort for this project.
84
3.5 Analysis of the Sample Projects
Min.
3.00
Min.
3.0
Min.
1.000
Min.
1.000
Number of modifications per author:
1st Qu. Median
Mean
3rd Qu.
Max.
8.25
10.50 2527.00 2529.00 10080.00
Number of modified files per author:
1st Qu. Median
Mean
3rd Qu.
Max.
6.0
9.0
591.8
594.8
2346.0
Gini coefficient (modifications): 0.9973
Number of modifications per file:
1st Qu. Median
Mean
3rd Qu.
Max.
2.000
2.000
4.308
3.000
165.000
Distinct authors per file:
1st Qu. Median
Mean
3rd Qu.
Max.
1.000
1.000
1.009
1.000
2.000
Table 3.10: Modification statistics for the CDex project
3.5.6 CDex
Concentration of modifications: The second smallest project in the sample (4 developers) surprised with a high concentration rate of 1.0. 10,080 out of 10,106 modifications were performed
by the same author “afaber”, who also is the project founder. Obviously, such a high concentration renders much of the other dimensions pointless, as this project basically is a one-man
effort.
Core developer structure: With only one connection between two developers (see Figure 3.16), it was not possible (or in any case useful) to calculate a centralization measure for
this network.
Degree of collaboration: The inclusiveness of this project is 0.5, although this measure is
not very meaningful for such a small project.
Project size: The cdex source code repository contains 10,106 modifications in 2,346 files
conducted by 4 authors. It is licensed under the GNU GPL.
Although the projects contains many files, these have only been modified 4.3 times in average
(see Table 3.10). It is interesting that the two smallest projects in the sample, with 3 respectively
4 developers, both basically form a one-person effort. These two are the only ones in the
85
3 Empirical Section
Figure 3.16: CDex sociogram
sample to exhibit concentration values of 1.0. It would be interesting to see if this is a generally
valid fact for such small projects, or if there are some small projects which truly share their
programming efforts in a more equal way.
3.5.7 emacs
Concentration of modifications: The modification concentration coefficient for emacs is 0.89
(Tab. 3.11), indicating an above average concentration of modifications. Although 177 developers contributed to the emacs development, the most active developer performed
20,000
100,113
= 20%
of all modifications.
Core developer structure: The (betweenness based) 0.10 centralization is only 0.10, indicating that the developers which are part of the network are relatively well connected. Were
the degree based centralization used as a measure, the value of 0.67 were above the average of
86
3.5 Analysis of the Sample Projects
Min.
1.0
Min.
1.0
Min.
1.00
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
7.0
37.0
565.6 250.0 20000.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
3.0
18.0
114.1
63.0
2273.0
Gini coefficient (modifications): 0.8896
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
4.00
9.00
31.53 24.00 7501.00
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
2.000
5.000 6.362 9.000 103.000
Table 3.11: Modification statistics for the emacs project
0.51 (Table 3.3 on page 74). This shows that although developers are generally well connected
with each other, a few have very much higher degrees than the rest. This ca be confirmed by
the distribution of the degree within the project (Table 3.2 on page 73). It becomes clear that,
although more than 50% of the developers are connected (median: 7), only a smaller share is
very tightly connected to many developers (3rd quart: 28, max: 95).
It is remarkable that emacs has the highest distinct number of authors per file (6.4) of all
sample projects. This might not appear to be extraordinarily high, given its high number of
177 developers. However, most other projects are consistent with a mean number of around
2.4 authors. This might be due to the fact, that emacs is by far the oldest project in the sample
(CVS commits begin in April 1985), thus having many “generations” of hackers working on
its files.
Degree of collaboration: The inclusiveness measure is 0.55, which is close to the average
value. A median of 37 modifications explains that most “lone wolves” were not active enough
to become a part of the network.
Project size: Measured by the number of developers, emacs is the second-largest project
in the sample (177 developers performing 100,113 modifications on 3,175 files). This high
number of developers makes the emacs sociogram (Fig. 3.17) to crowded to identify single
87
3 Empirical Section
Figure 3.17: emacs sociogram
connections between developers.
3.5.8 Flightgear
Concentration of modifications: The FlightGear project has a gini concentration coefficient of
0.74, which is relatively low when compared to the other projects. Nevertheless, 58% of all
modifications were performed by the most active developer were performed by the most active
developer.
88
3.5 Analysis of the Sample Projects
Min.
24.0
Min.
16.0
Min.
1.000
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
86.0
279.0 675.8 680.3
3155.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
47.0
79.5
237.3 281.5
973.0
Gini coefficient (modifications): 0.7360
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
2.000
3.000 4.879 5.000 146.000
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
1.000
1.000 1.713 2.000
5.000
Table 3.12: Modification statistics for the Flightgear project
Core developer structure: The betweenness based centralization value for FlightGear is 0.38
which is above average. This is mostly due to two central developers, david and curt, who are
connected to (nearly) everybody else. Also, the main developer curt, is connected to a degree-1
developer mselig. This degree-1 developer would have otherwise been unconnected.
Degree of collaboration: The inclusiveness is 0.88, all developers with the exception of one
“developer” aptly named “cvsguest” are linked with each other, as the sociogram (Fig. 3.18)
shows. (cvsguest has performed few modifications, the name hints to the fact that no single
person is behind this CVS account). It seems that no strict specialization appears to happen in
this project.
Project size: The flightgear source code repository contains 5,406 modifications in 1,108
files conducted by 8 authors. 4.9 modifications, average 1.7 authors
Although the number of distinct authors per file is a mere 1.7 (see Table 3.12), two central
developers are connected to nearly everybody else. A high inclusiveness, and a comparably
high mean degree of 3.25 (median:5, maximum: 6) shows that FlightGear appears to include
relatively much interaction between its developers.
89
3 Empirical Section
Figure 3.18: Flightgear sociogram
3.5.9 Freenet
Concentration of modifications: The Freenet project has a high concentration of 0.91 (Table 3.13), the most active developer has performed 63% of all modifications. The high concentration for this project can partly be explained: Since 2003, one of the main developers,
Matthew Toseland (CVS name amphibian), has been paid through donations to the project
foundation22 , working full-time on the project.
22
Freenet set up a non-profit foundation in order to protect their source code and to shield individual developers
from law suits as many other open source projects do. See e.g. O’Mahony (2003) on how many open source
90
3.5 Analysis of the Sample Projects
Min.
1.0
Min.
1.00
Min.
1.00
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu. Max.
4.0
25.5
392.2 159.5
8856.
Number of modified files per author:
1st Qu. Median Mean 3rd Qu. Max.
3.00
14.50 84.25 87.50 830.00
Gini coefficient (modifications): 0.9063
Number of modifications per file:
1st Qu. Median Mean 3rd Qu. Max.
2.00
4.00
11.47
9.00
744.00
Distinct authors per file:
1st Qu. Median Mean 3rd Qu. Max.
1.000
2.000 2.464 3.000 17.000
Table 3.13: Modification statistics for the Freenet project
Core developer structure: Not only the proportion of work performed is determined through
the work of amphibian. The network is also comparably centralized (0.46) (all-project mean
value: 0.28), with, among other prominent figures, amphibian forming a central node. As
the sociogram shows (Fig. 3.19), he is connected to four otherwise unconnected developers
(“tamed wolves”), emphasizing his role as coordinator and integrator. Project founder, Ian
Clarke, named sanity in the sociogram, is also part of the tight network of well-connected core
developers.
Degree of collaboration: With an inclusiveness of 0.53, the Freenet project is close to the
average inclusiveness of 0.52. The median of contributions is only 25.5, this means that most
of the unconnected “lone wolves” have not been active enough to become connected to the
network.
Project size: The freenet source code repository contains 14,118 modifications in 1,231 files
conducted by 36 authors. The average distinct number of authors per file (2.5) is close to the
average across as projects. Other summary data can be seen in Table 3.13.
projects protect themselves and their source code.
91
3 Empirical Section
Figure 3.19: Freenet sociogram
3.5.10 Gnomemeeting
Concentration of modifications: The Gnomemeeting project is highly concentrated in its number of contributions. The gini coefficient is 0.91, this is mostly due to project founder and lead
developer Damien Sandras (CVS name dsandras), who performed 86% of all modifications.
The median the number of modifications across all 98 developers is only 9.5 (mean value of
109), confirming a high skewness.
Core developer structure: The centralization of the Gnomemeeting network is relatively low:
92
3.5 Analysis of the Sample Projects
Number of modifications per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.00
4.00
9.50
109.00 26.75 7191.00
Number of modified files per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.00
2.00
2.00
13.03
4.00
477.00
Gini coefficient (modifications): 0.9142
Number of modifications per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.00
2.00
4.00
19.18
9.00
1607.00
Distinct authors per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.000 1.000
2.000
2.293
2.000
81.000
Table 3.14: Modification statistics for the Gnomemeeting project
0.06. Looking at the sociogram (Fig. 3.20) shows that the connected part of the developers is
indeed very well-connected, showing collaboration between these core developers. However
low this value, it should be noted that the centralization is only concerned with the connected
parts of the network, which is very small in this case as the next measure shows.
Degree of collaboration: The inclusiveness of the Gnomemeeting is, with exception of the
one-developers effort AWStats, by far the lowest in the sample: 0.08, as e.g. the dendrogram
in Figure 3.7 on page 64 visualizes. Only eight out of 98 developers are connected at all. The
explanation for this has been described above: a low median of modifications of 9.5 shows that
most developers in this network have only contributed very little to the project. Possibly the
source code administrator, Damien Sandras, gave quickly CVS write permissions to persons
which only wanted to contribute a small bug-fix.
Project size: The gnomemeeting source code repository contains 10,684 modifications in
557 files conducted by 98 authors.
93
3 Empirical Section
Figure 3.20: Gnomemeeting sociogram
3.5.11 Gnunet
Concentration of modifications: The gini coefficient of the GNUnet project is 0.93, a relatively
high value, representing a strong concentration of modifications. This is mostly due to project
founder and leader Christian Grothoff, who performed 83% of all modifications.
Core developer structure: The centralization of the GNUnet network is 0.64, compared
to the all-project average of 0.28 a high value. A look at the sociogram (Fig. 3.21) clarifies
why: grothoff is not only part of the small dense network of extremely well connected core
94
3.5 Analysis of the Sample Projects
Min.
1.0
Min.
1.0
Min.
1.00
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
21.5
64.0
807.4 282.0 10040.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
9.5
34.0
111.5
95.5
936.0
Gini coefficient (modifications): 0.9284
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
2.00
4.00
12.59 11.00
338.00
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
1.000
1.000 1.738 2.000
8.000
Table 3.15: Modification statistics for the GNUnet project
developers, but also ties several “tamed wolves” into the network, resulting in a relatively starshaped part of the network. It is remarkable, that also in GNUnet, just as e.g. in the Freenet
project, the main developer is the person to connect to all degree-1 developers. It seems that
these most active persons perform the integrating work, tying other people’s code into the
overall architecture.
Degree of collaboration: The inclusiveness of the network is 0.67, above average. Although
the largest amount of work is mostly performed by a single developer, relatively many developers are part of the network. Possibly, because grothoff forms a link to the three “tamed
wolves”.
Project size: The gnunet source code repository contains 12,111 modifications in 962 files
conducted by 15 authors.
95
3 Empirical Section
Figure 3.21: GNUnet sociogram
96
3.5 Analysis of the Sample Projects
3.5.12 GTK+
Min.
1.0
Min.
1.00
Min.
1.00
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
5.5
15.0
282.9
58.5
21530.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
3.00
5.00
49.64 14.00 1594.00
Gini coefficient (modifications): 0.9183
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
3.00
6.00
28.38 18.00 6752.00
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
2.000
3.000 4.981 6.000 150.000
Table 3.16: Modification statistics for the GTK+ project
Concentration of modifications: GTK+ is a large project. 271 developers contributed to
the project. Many developers, however, does not seem to guarantee an equal workload for
all developers. The gini coefficient for the number of modifications is 0.92, the most active
developer performed
21,530
76,666
= 28% of all modifications.
Core developer structure: The centralization of the network is 0.17, below average. This
indicates that the 28% of the developers which are included in the network (see below) are
relatively well connected. The degree based centralization measure for this project is 0.69,
above the average of 0.51, indicating that the degree of connections in the network is not equally
distributed. This is confirmed by Table 3.2, which shows a maximum degree of 74, while the
3rd quartile (covering all 271 developers) is only 3.
Degree of collaboration: In inclusiveness of the GTK+ network is 0.28. Given that the
median of the modifications is 15 (Table 3.16), i.e. 50% of the contributors performed less
than 15 modifications, it becomes clear that most of the large proportion of unconnected “lone
wolves” did not contribute enough to become part of the network.
Project size:
97
3 Empirical Section
Figure 3.22: GTK+ sociogram
The gtk+ source code repository contains 76,666 modifications in 2,701 files conducted by
271 authors and is the largest project in the sample. As a consequence, the sociogram (Fig. 3.22)
is rather crowded, it is just possible to spot the web of developers surrounded by a large mass
of unconnected contributors. It is included for the sake of completeness.
Overall, it can be said that GTK+ is, despite its size, a relatively centrally coordinated project,
with a few developers performing most of the work, and exhibiting most connections to other
developers. Still, its low centralization measure, indicating a well-connected core, and a very
high number of distinct authors per file of 5.0 (the second largest in the sample), shows that no
98
3.5 Analysis of the Sample Projects
strict specialization or code ownership seems to take place in this project.
3.5.13 iRate
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
3.00
28.50 110.30 78.75 1054.00
Number of modified files per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.00
2.25
13.50
34.56
33.00
240.00
Gini coefficient (modifications): 0.8137
Number of modifications per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.000 1.000
2.000
5.471
4.000 182.000
Distinct authors per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.000 1.000
1.000
1.713
2.000
11.000
Min.
1.00
Table 3.17: Modification statistics for the Irate project
Concentration of modifications: The gini coefficient representing the concentration of modifications is 0.81, below the average of 0.84. Still, Anthony Jones (CVS name ajones), project
founder and leader has performed 1,054 out of 1,986 modifications (53%).
Core developer structure: The core developer network is relatively centralized (0.62), this is
partly due to ajones tying in all connected developers, including two “tamed wolves” (degree1 contributors) as can be seen in the sociogram (Fig. 3.23). Again in this project, it is the
main developer, who connects those degree-1 developers into the network, which hints to an
integrating function of this main developer.
Degree of collaboration: The inclusiveness of the network is 0.44. Again, many developers
performed too little work to be connected to the network. Some of them seem to perform some
specialized work, such as modifying translation files for a certain language.
Project size:
The iRate source code repository contains 1,986 modifications in 363 files conducted by 18
99
3 Empirical Section
Figure 3.23: Irate sociogram
authors.
A high centralization and a comparably low inclusiveness of the network helps to explain the
comparably low distinct authors per file of 1.7. A strong coordinating role of ajones, and many
“lone wolves” contribute to that low value.
100
3.5 Analysis of the Sample Projects
Min.
2.0
Min.
2.00
Min.
1.00
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu. Max.
15.0
34.0
396.6 298.0 3264.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu. Max.
8.75
19.50 63.50 74.00 333.00
Gini coefficient (modifications): 0.8258
Number of modifications per file:
1st Qu. Median Mean 3rd Qu. Max.
2.00
5.00
20.26 19.00 379.00
Distinct authors per file:
1st Qu. Median Mean 3rd Qu. Max.
1.000
3.000 3.244 4.000 18.000
Table 3.18: Modification statistics for the LAME project
3.5.14 LAME
Concentration of modifications: The LAME project exhibits a concentration coefficient of 0.83
(Table 3.18), close to the average value. The most active developer performed 32% of all
modifications.
Core developer structure: The centralization of the LAME network is 0.24, below average. The sociogram (Fig. 3.24) visualizes the reason: The core developers form a tight, wellconnected network, with only a few “tamed wolves”. It is interesting that both “tamed wolves”
are connected to very prominent developers, aleidinger, listed on the project page as “primary
developer” and markt, who currently acts as project maintainer.
Degree of collaboration: LAME includes 0.65 of its developers into the network. This is
above average, it seems that the remaining 35% contributed too little to become a part of the
network (media of modifications of 34).
Project size: The LAME source code repository contains 10,312 modifications in 509 files
conducted by 26 authors.
The dense web suggests that LAME is a very collaboratively developed project, without
strong areas of specialization or areas of code ownership. The above average distinct authors
101
3 Empirical Section
Figure 3.24: LAME sociogram
per file measure of 3.3 confirms this assumption.
3.5.15 Mailman
Concentration of modifications: The concentration of modifications is comparably high (0.92).
80% of all modifications were performed by project founder and lead developer Barry Warsaw.
The remaining 20% were performed by 23 developers.
Core developer structure: The sociogram in Fig. 3.25 illustrates the high centralization of
102
3.5 Analysis of the Sample Projects
Number of modifications per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.0
18.5
61.5
570.6
151.3 11000.0
Number of modified files per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.00
6.75
21.50 108.40 62.25 1772.00
Gini coefficient (modifications): 0.9213
Number of modifications per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.000 2.000
2.000
7.312
5.000 312.000
Distinct authors per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.000 1.000
1.000
1.389
1.000
9.000
Table 3.19: Modification statistics for the Mailman project
the developer network of 0.82 (the second largest value across the sample). While a set of 7
developers form a very tight core network23 , one developer from the inner core clearly stands
out: bwarsaw, the project lead developer is connected with many “tamed wolves”. His central
position, integrating other people’s code into the main architecture, is clearly visible.
Degree of collaboration: The inclusiveness of the network is 0.62, above average. This could
be due to the fact that bwarsaw ties many developers into the network who would otherwise
have been unconnected.
Project size: The Mailman source code repository contains 13,695 modifications in 1873
files conducted by 24 authors.
The above dimensions make it clear that mailman is very much controlled by a single developer, although a tight core of well connected developers exist. The high concentration, i.e. 80%
of all modifications, and his central position in the network emphasize this. The low distinct
authors per file value of just 1.4 is therefore not surprising. Other summary data can be found
in Table 3.19.
23
One of these seven developers is called mailman. It is possible that this account is not used by a single person,
but possibly used for e.g some automated tasks, or the import of external code.
103
3 Empirical Section
Figure 3.25: Mailman sociogram
104
3.5 Analysis of the Sample Projects
3.5.16 mnet
Number of modifications per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.00
2.25
14.00 514.60 77.50 4410.00
Number of modified files per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.00
2.25
13.00 114.40 45.25
709.00
Gini coefficient (modifications): 0.9099
Number of modifications per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.000 2.000
3.000
7.746
5.000 477.000
Distinct authors per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.000 1.000
1.000
1.723
2.000
6.000
Table 3.20: Modification statistics for the mnet project
Concentration of modifications: mnet, like many others, is very concentrated with regard to
the number of modifications. The gini coefficient is 0.91. The lion’s share of the modifications
is distributed among the two most active developers. The most active performed 61% of all
modifications alone.
Core developer structure: The centralization of the network is 0.43, far above average. The
sociogram (Fig. 3.26) shows that the best connected developer (who is also the most active)
ties in a “tamed wolf” and is connected to all other developers in the network.
Degree of collaboration: A median of only 14 modifications helps to explain the inclusiveness of 0.50. The less active half of the developers did not perform enough work to become a
part of the mnet network.
Project size:
The mnet source code repository contains 7,204 modifications in 930 files conducted by 14
authors.
All in all, the project is mostly driven by three most active developers who also are the
best connected ones, indicating their dominant role. All other developers who are part of the
105
3 Empirical Section
Figure 3.26: mnet sociogram
network, are not connected to each other, but to these three developers, confirming their importance.
3.5.17 nano
Concentration of modifications: The nano text editor has a gini coefficient of 0.86, close to the
average of 0.84. The most active developer performed 59% of all modifications.
Core developer structure: The centrality of the network is 0.14, exactly half the average
106
3.5 Analysis of the Sample Projects
Min.
10.0
Min.
3.00
Min.
1.00
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
19.0
33.0
725.1 567.0
4724.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
8.50
11.00 43.45 51.50
169.00
Gini coefficient (modifications): 0.8629
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
3.00
8.00
41.76 43.50 1370.00
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
2.000
2.000 2.503 3.000
11.000
Table 3.21: Modification statistics for the nano project
value. The nano sociogram (Fig. 3.27) shows that while the number of modifications is concentrated, the connections with other developers are relatively equal and decentralized.
Degree of collaboration: The inclusiveness of the network is 0.55, most of the unconnected
developers performed too few modifications to become a part of the network.
Project size: The nano source code repository contains 7,976 modifications in 191 files
conducted by 11 authors.
Each file has been modified 41.8 times, which is comparably high, with a mean number
of distinct authors of 2.5 (which is approximately the average value across all projects). The
nano project appears to be driven by a few developers performing most of the modifications.
However, when it comes to coordination, no developer appears to take a very dominant role in
the project.
107
3 Empirical Section
Figure 3.27: nano sociogram
108
3.5 Analysis of the Sample Projects
3.5.18 Ogle
Min.
3.0
Min.
2.00
Min.
1.00
Min.
1.00
Number of modifications per author:
1st Qu. Median Mean 3rd Qu. Max.
39.0
215.5 413.8 709.3 1141.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu. Max.
24.50
82.50 84.63 122.80 193.00
Gini coefficient (modifications): 0.6408
Number of modifications per file:
1st Qu. Median Mean 3rd Qu. Max.
2.00
6.00
12.12 13.00 142.00
Distinct authors per file:
1st Qu. Median Mean 3rd Qu. Max.
1.00
2.00
2.48
3.00
7.00
Table 3.22: Modification statistics for the Ogle project
Concentration of modifications: Ogle has the lowest gini coefficient across the sample of
“only” 0.64. Of all eight developers, the most active is for 34% of the modifications responsible.
Core developer structure: The core developer structure of ogle is interesting, besides two
totally unconnected “lone wolves” all developers are connected to all others (sociogram, see
Fig. 3.28), leading to a centralization of 0.0 (both for betweennes and degree based centrality
measures). This is a unique situation across all projects. It seems that none of the developers
has a central role, coordinating or integrating the others work. This assumption is backed by
the ogle website, which does not list a project founder or leader (as most others do), but states
that ogle is “developed by a few students at Chalmers University of Technology” (Sweden).
Degree of collaboration: The inclusiveness of the network is 0.75, besides two unconnected
authors, all developers are included.
Project size: The ogle source code repository contains 3,310 modifications in 273 files conducted by 8 authors.
Although only eight developers take part in the development, the project has 2.5 distinct
authors per file. The moderate concentration of modifications and the extremely low central-
109
3 Empirical Section
Figure 3.28: Ogle sociogram
ization of the network show that ogle is indeed developed by a team of students, located at the
same physical facility, in a collaborative manner.
3.5.19 OpenSSL
Concentration of modifications: OpenSSL is in many respects very similar to the ogle project.
It is only moderately concentrated, the gini coefficient is 0.65 (all-project average 0.84). Although the most active developer has performed 41% of all modifications, the comparably
110
3.5 Analysis of the Sample Projects
Number of modifications per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
7.0
348.3
2346.0 3676.0 4684.0 18050.0
Number of modified files per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
7.0
115.5
694.0
728.8 1184.0 1907.0
Gini coefficient (modifications): 0.6496
Number of modifications per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.00
3.00
4.00
13.62
12.00 1704.00
Distinct authors per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.000 1.000
2.000
2.701
4.000
12.000
Table 3.23: Modification statistics for the OpenSSL project
small difference between median (2346) and mean value (3676) (Table 3.23) shows that the
distribution is not as skewed as most others.
Core developer structure: The structure of the core developers is very interesting. The centralization value is just 0.01; most developers are connected to nearly all others, forming a very
dense web of connections.
Degree of collaboration: The inclusiveness of the project is 0.92, the second highest value
in the network. Just one developer remains unconnected (with only 7 file modifications, he did
not perform enough work to become part of the network.)
Project size: The openssl source code repository contains 44,116 modifications in 3,238 files
conducted by 12 authors.
The OpenSSL network exhibits an extremely well-connected and dense network of developers. No central generalists or integrators are visible in this network, which seems to be a truly
collective effort of a community without a strong feeling for code ownership (although “only”
2.7 distinct authors modified each file) or specialization. The reasons for this can not be derived
from this data, OpenSSL would make an interesting case study. It could be speculated that the
library which provides encryption, only attracts very competent and experienced programmers,
thus having a high barrier of entry, and preventing many little contributing developers. Per-
111
3 Empirical Section
Figure 3.29: OpenSSL sociogram
haps the requirements are very high in order to be granted CVS write access, as a library that
provides encryption to many other programs needs to be reliable and trusted.
112
3.5 Analysis of the Sample Projects
3.5.20 pango
Min.
1
Min.
1.00
Min.
1.00
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
4
14
146
58
4001
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
4.00
8.00
26.17 29.25
329.00
Gini coefficient (modifications): 0.8894
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
2.00
5.00
17.53 13.00 1097.00
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
1.000
2.000 3.144 4.000
40.000
Table 3.24: Modification statistics for the pango project
Concentration of modifications: The pango project is, despite its large number of 46 developers, relatively concentrated (0.89). 60% of all modifications are performed by one developer.
Core developer structure: Although the number of modifications is concentrated, the centralization of the core developer network is not very high: 0.24 is below the average value. The
inner core is seems indeed to be relatively well connected, with only one “tamed wolf” sticking
out (sociogram, see Fig. 3.30).
Degree of collaboration: Although the inner core is well-connected, the remainder of the
developers are not. The inclusiveness of 0.37 shows that just about a third of the developers are
connected at all. Given the low median of modifications of 14, it becomes apparent that most
developers only contributed little to the project.
Project size: The pango source code repository contains 6,715 modifications in 383 files
conducted by 46 authors. Other modification data can be taken from Table 3.24.
The pango project seems to be mainly driven only by a small web of core developer, who
are relatively well connected. On average 3.1 distinct authors have modified each file. One
potential explanation for this is that pango is an underlying foundation for the GTK+ toolkit,
113
3 Empirical Section
Figure 3.30: pango sociogram
which is an essential part of the GNOME desktop. This desktop is actively suppoerted by
many companies. E.g. core developer Owen Taylor (CVS name owen) is a software engineer
employed by RedHat; other employed programmers support the pango development.
3.5.21 phpMyAdmin
Concentration of modifications: The gini coefficient indicating the concentration of phpMyAdmin modifications is 0.79, a little below the average. The most active developer performed 30%
114
3.5 Analysis of the Sample Projects
Min.
2
Min.
2.0
Min.
1.00
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
12
70
1968
1897
11250
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
8.5
35.0
204.6 269.5
919.0
Gini coefficient (modifications): 0.7927
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
2.00
4.00
34.87 45.75 3978.00
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
2.000
3.000 3.627 4.000
18.000
Table 3.25: Modification statistics for the phpMyAdmin project
of all modifications. However, the large difference between the median of the number of modifications (70) and the average (1968) shows that the number of contributions is still relatively
skewed.
Core developer structure: The core developers form a well connected, dense network without
a central figure standing out. The betweenness based concentration is only 0.04
Degree of collaboration: The inclusiveness of the network is 0.68, above average. Just six
“lone wolves” remain unconnected to the rest of the network.
Project size: The phpmyadmin source code repository contains 37,385 modifications in
1,072 files conducted by 19 authors.
The average distinct number of authors per file is 3.6, a comparably large value. Together
with the exceptionally low centrality and relatively high inclusiveness of the dense network as
depicted in Fig. 3.31 it can be concluded that phpMyAdmin is a project, in which most contributions are performed by a few developers, however many active developers do not hesitate
to modify other developers’ files. There do not seem to be areas of responsibility or a code
ownership of files.
115
3 Empirical Section
Figure 3.31: phpMyAdmin sociogram
3.5.22 PostgreSQL
Concentration of modifications: The gini coefficient indicating the concentration of modifications is 0.86, a little above average. Of all 28 developers, the most active performed 44% of all
changes.
Core developer structure: The centralization of the developer’s network is 0.26, also a little
below average. Looking at the sociogram (Fig. 3.32), a dense network of core developers can
be identified, although some developers are less well connected to the “core web”.
116
3.5 Analysis of the Sample Projects
Min.
1.00
Min.
1.00
Min.
1.00
Min.
1.000
Number of modifications per author:
1st Qu. Median
Mean
3rd Qu.
Max.
97.75
447.00 3401.00 1477.00 41540.00
Number of modified files per author:
1st Qu. Median
Mean
3rd Qu.
Max.
26.75
183.00 586.80
427.80
4128.00
Gini coefficient (modifications): 0.8554
Number of modifications per file:
1st Qu. Median
Mean
3rd Qu.
Max.
3.00
6.00
16.87
15.00
1447.00
Distinct authors per file:
1st Qu. Median
Mean
3rd Qu.
Max.
1.000
2.000
2.911
4.000
13.000
Table 3.26: Modification statistics for the PostgreSQL project
Degree of collaboration: The inclusiveness of the PostgreSQL network is relatively high
(0.86). Most developers are part of the network.
Project size: The postgresql source code repository contains 95,221 modifications in 5,644
files conducted by 28 authors.
PostgreSQL is a somewhat unique project, in that it has not been founded and is run by a
single person (or organization), but is being run by a “Steering Committee”, currently consisting of six persons. This seems to show in the sociogram (Fig. 3.32. No star-shaped network
emerges, but rather a dense web of collaborators. In absolute terms, it has the third highest
mean degree (8.6). The distinct numbers of developers per file is comparably high (2.9). This
illustrates the collaborative nature of this project.
117
3 Empirical Section
Figure 3.32: PostgreSQL sociogram
118
3.5 Analysis of the Sample Projects
3.5.23 Smarty
Min.
12.0
Min.
1.0
Min.
1.000
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
68.0
295.0 534.6 703.5
2224.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
19.0
63.0
163.9 158.0
912.0
Gini coefficient (modifications): 0.6759
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
1.000
1.000 4.321 2.000 491.000
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
1.000
1.000 1.325 1.000
7.000
Table 3.27: Modification statistics for the Smarty project
Concentration of modifications: With a value of only 0.68 the Smarty project has a relatively
low concentration of modifications. To top developer performed “only” 38% of all modifications. It seems, that Smarty is developed by several developers sharing the burden.
Core developer structure: Although developed by several, the centralization of the network
is relatively high (0.39). While five developers are relatively well-connected, several “tamed
wolves” are brought into the network. It seems that some developers are only weakly connected
to the network. This could indicate a developer, specializing on “his” set of files mostly.
Degree of collaboration: The inclusiveness of the network is relatively high, 0.82. A reason
could be the inclusion of many degree-1 developers into the network who would otherwise
remain connected.
Project size: The smarty source code repository contains 5,881 modifications in 1,361 files
conducted by 11 authors.
Although the inclusiveness of the project is relatively high, and the concentration of modifications relatively low, indicating a collaborative effort; the high centralization of the network
and the very low number of distinct authors per file (1.3) indicate the existence of code own-
119
3 Empirical Section
Figure 3.33: Smarty sociogram
ership for files and a specialization of developers, with some central figures coordination the
project.
3.5.24 Stepmania
Concentration of modifications: The stepmania project has a high concentration of modifications, the gini coefficient is 0.94. Although the most active developer performed “only” 44% of
all modifications, most of the activity is performed by very few developers. The high difference
between the median (32) and mean value of modifications (1,099) demonstrates how skewed
the distribution is.
Core developer structure: The centralization of the network is 0.36, a relatively high value.
120
3.5 Analysis of the Sample Projects
Number of modifications per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.0
6.0
32.0
1099.0 135.5 21120.0
Number of modified files per author:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.0
5.5
17.0
410.0
92.5
7136.0
Gini coefficient (modifications): 0.9394
Number of modifications per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.0
1.0
2.0
3.8
2.0
597.0
Distinct authors per file:
Min. 1st Qu. Median Mean 3rd Qu.
Max.
1.000 1.000
1.000
1.418
2.000
16.000
Table 3.28: Modification statistics for the Stepmania project
A look at the sociogram (Fig. 3.34) confirms that, although a small web of well connected
developers exists, two developers, gmaynard and chrisdanford, stand out with a high degree,
these two are also connected to many “tamed wolves”, tying their work into the network. It
seems that these degree-1 developers focus on only some files. The low mean value of distinct
authors per file of 1.4 could confirm this.
Degree of collaboration: The Stepmania project has an inclusiveness of 0.55. Given that
many of the less active contributors performed only few modifications, this is not too surprising.
Project size: The stepmania source code repository contains 48,357 modifications in 12,725
files conducted by 44 authors.
Stepmania seems to be driven by two coordinating main developers, with a small network of
core developers which are well connected. Many weakly connected developers seem to indicate
a specialization of developers on few files. Given the very high number of files, with only 3.8
modifications per file, could very well be the case.
121
3 Empirical Section
Figure 3.34: Stepmania sociogram
122
3.5 Analysis of the Sample Projects
3.5.25 tdb
Min.
3.00
Min.
1.00
Min.
1.000
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu. Max.
5.50
42.00 42.14 67.50 104.00
Number of modified files per author:
1st Qu. Median Mean 3rd Qu. Max.
5.50
11.00 14.71 20.00
40.00
Gini coefficient (modifications): 0.5763
Number of modifications per file:
1st Qu. Median Mean 3rd Qu. Max.
1.000
2.000 5.364 5.000 66.000
Distinct authors per file:
1st Qu. Median Mean 3rd Qu. Max.
1.000
1.000 1.873 2.000
6.000
Table 3.29: Modification statistics for the tdb project
Concentration of modifications: The tdb project is remarkable, in that the concentration of
modifications is the lowest across all projects, the gini coefficient is only 0.58. It is also the
only project in which mean (42.14) and median (42) of the number of contributions equal each
other, i.e. the distribution is non-skewed. The most active developer performed 104 out of the
295 modifications.
Core developer structure: The project is rather small, with only three connected developers
in the network, ben, being in the center (see sociogram in Fig. 3.35). This leads to a unique
centralization value of 1.0.
Degree of collaboration: The inclusiveness of this network is 0.43. Given that the developers
only perform a total sum of 295 modifications in 55 files, inclusiveness might not be very
significant, as the cut-off point of ten common files might be hard to reach.
Project size: The tdb source code repository contains 295 modifications in 55 files conducted
by 7 authors. It is interesting that although the project is very small in terms of the number of
modifications, the number of files, each file still has 1.9 distinct authors on average. It seems
that a value around 2.0-2.5 seems to be some natural equilibrium for most open source projects
123
3 Empirical Section
Figure 3.35: tdb sociogram
(at least those with less than 100 developers).
3.5.26 TikiWiki
Concentration of modifications: The TikiWiki project is relatively concentrated, the gini coefficient is 0.89. The most active developer performed 22% of all modifications.
Core developer structure: The centralization of the project network is relatively low, having
a value of 0.16. This means that although the a large proportion of modifications is concentrated
on a few developers, no “central hub” exists in the middle of the network. The distribution of
degrees among developers is not equally distributed though. The degree based centralization
measure is 0.65 above the average of 0.51; the best connected developer has ties to 45 of his
peers.
Degree of collaboration: Given the median of 24 modifications of all 89 developers, it is
124
3.5 Analysis of the Sample Projects
Min.
1.0
Min.
1.0
Min.
1.0
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
6.0
24.0
362.5 103.0
7110.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
4.0
14.0
160.9
58.0
3160.0
Gini coefficient (modifications): 0.8865
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
1.000
3.000 6.201 6.000 546.000
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
1.000
2.000 2.753 3.000
42.000
Table 3.30: Modification statistics for the TikiWiki project
not surprising that the inclusiveness of the network is 0.55. Many developers did not perform
enough work to become connected to the network. The sociogram (Fig. 3.36) shows the large
number of unconnected “lone wolves” surrounding the crowded cluster of connected developers.
Project size:
The TikiWiki source code repository contains 32260 modifications in 5202 files conducted
by 89 authors.
The ’grow fast, make stable later’ philosophy of the project, as described in the interview
quote project description on page 55 explains why this project with 89 developers only has 6.2
modifications per file on average. Although the number of modifications per file is low, the
mean number of distinct authors is above average: 2.8, showing that no strict file ownership
exists in TikiWiki. All in all, it seems that coordination mostly takes place through a small
number of well connected developers.
125
3 Empirical Section
Figure 3.36: TikiWiki sociogram
126
3.5 Analysis of the Sample Projects
Min.
25.0
Min.
19.00
Min.
1.0
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu. Max.
55.5
147.0 810.8 699.8 3637.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu. Max.
25.00
59.50 77.17 104.50 192.00
Gini coefficient (modifications): 0.8484
Number of modifications per file:
1st Qu. Median Mean 3rd Qu. Max.
4.0
8.0
23.5
24.5
705.0
Distinct authors per file:
1st Qu. Median Mean 3rd Qu. Max.
1.000
2.000 2.237 3.000
6.000
Table 3.31: Modification statistics for the wget project
3.5.27 wget
Concentration of modifications: The gini coefficient indicating the concentration of modifications is 0.85, just above average.
Core developer structure: While the amount of work is highly concentrated, the same is not
true for the developer network. The centralization of the network is 0.14, thus comparably low.
Looking at the sociogram (Fig. 3.37), no center of the network can be detected (although two
of the developers are connected to all others).
Degree of collaboration: wget features an inclusiveness of 1.0, all developers included in the
network and connected to at least two other developers.
Project size: The wget source code repository contains 4,865 modifications in 207 files
conducted by 6 authors. Other summary data is listed in Table 3.31.
It is interesting that, although each file has been modified 23.5 times (on average), the number
of distinct authors is just as high as most other projects, 2.2.
In the wget project, it seems, most of the work is performed by a few developers. These are
also connected to all other developers, tying their work into the network.
127
3 Empirical Section
Figure 3.37: wget sociogram
3.5.28 xerces
Concentration of modifications: The xerces project has a concentration of modifications of
0.72, high but below average. The most active developer performed 16% of all modifications.
Core developer structure: The centralization of the xerces network is relatively low, 0.12.
The sociogram (Fig. 3.38) illustrates how densely the network is connected. It does not seem
to contain a single center, but rather a close web of interconnected core-developers who modify
each others files.
Degree of collaboration: The inclusiveness of the project is relatively high (0.78).
Project size: The xerces source code repository contains 18,304 modifications in 2,276 files
conducted by 27 authors.
128
3.5 Analysis of the Sample Projects
Min.
1.0
Min.
1.0
Min.
1.000
Min.
1.000
Number of modifications per author:
1st Qu. Median Mean 3rd Qu.
Max.
23.0
116.0 677.9 1162.0 2989.0
Number of modified files per author:
1st Qu. Median Mean 3rd Qu.
Max.
18.0
80.0
261.1 455.0
1231.0
Gini coefficient (modifications): 0.7219
Number of modifications per file:
1st Qu. Median Mean 3rd Qu.
Max.
3.000
4.000 8.042 8.000 286.000
Distinct authors per file:
1st Qu. Median Mean 3rd Qu.
Max.
2.000
3.000 3.098 4.000
17.000
Table 3.32: Modification statistics for the xerces project
Although each file has, on average, been modified 8.0 times, it has 3.1 distinct authors, which
is above average. Together with a high inclusiveness, and a low centralization of the network
it be concluded that the xerces project is a collaborative effort, without strong feelings for code
ownership.
129
3 Empirical Section
Figure 3.38: xerces sociogram
130
3.5 Analysis of the Sample Projects
3.5.29 XFCE4
Min.
1.00
Min.
1.0
Min.
1.000
Min.
1.000
Number of modifications per author:
1st Qu. Median
Mean
3rd Qu.
Max.
23.75
143.50 2813.00 3096.00 21370.00
Number of modified files per author:
1st Qu. Median
Mean
3rd Qu.
Max.
13.5
61.5
909.7
1277.0
7163.0
Gini coefficient (modifications): 0.8101
Number of modifications per file:
1st Qu. Median
Mean
3rd Qu.
Max.
1.000
2.000
4.525
3.000
429.000
Distinct authors per file:
1st Qu. Median
Mean
3rd Qu.
Max.
1.000
1.000
1.464
2.000
10.000
Table 3.33: Modification statistics for the XFCE4 project
Concentration of modifications: Mechanistically applied, the gini coefficient of the XFCE4
project is 0.81 a bit below average. However, this project requires careful consideration. The
source code of version 3 of the Xfce project was imported into the CVS code repository, performed under the CVS account xfce. As a consequence, this non-human account seems to be
the most active “developer”, performing 35% of all modifications. Of course, this poses a difficulty in how to handle the project. Should the imported modifications silently be dropped from
the analysis? This would falsify the total number of modifications and the resulting network
did not exhibit very different characteristics. Unfortunately, the old code repository was not
available, so those modifications could not be to traced back to specific developers any more.
In the end it was decided to simply treat xfce as another developer in the summary analysis.
However, concentration, centralization, and inclusiveness have been calculated for both networks: once with xfce treated as developer, and once with all of its modifications ignored. Both
figures are included in the summary table (Table 3.3 on page 74).
With used xfce removed, the concentration of modifications is 0.80.
Core developer structure: Including user xfce, the betweenness based centralization is 0.20,
131
3 Empirical Section
Figure 3.39: XFCE4 sociogram
without him, it is 0.30. The sociogram (Fig. 3.39) shows a well connected developer network,
although some developer are clearly more central than others. Xfce exhibits a relatively dense
network, although the centralization indicates a differentiation between core integrators, and
more peripheral members of the network.
Degree of collaboration: The inclusiveness of the network is 0.77 (0.76 without user xfce).
It seems that the project ties relatively many developers in the network.
Project size: The xfce4 source code repository contains 61,879 modifications in 13,675 files
conducted by 22 authors (xfce as author included).
132
3.5 Analysis of the Sample Projects
As this version of the product is still young (the first CVS commit start February 2001, but
development did not take off until after the release of the last version of xfce3 in November
2002). Its files have only been modified 4.5 times in average. Therefore, the mean number of
distinct authors is still relatively, yet not exceptionally low: 1.5. This indicates the existence
of file ownership in the project, however, more development will possibly have to take place to
answer this. Other summary modification statistics can be taken from Table 3.33.
Having presented an summary analysis, and an analysis of each project, the next section will
summarize and discuss the findings and its implications.
133
4 Discussion & Conclusion
A literature review had shown that there is a strong interest in research on the organization of
open source projects and researchers have proposed applying Social Network Analysis to the
field of computer centered communities (Wellman, 1996) and especially open source projects.
While some looked at identifying the links between software architecture modules for very
large projects (González-Barahona et al., 2004) or performed a cross project analysis (Madey
et al., 2002), others had already gone already further, trying to explain project success, through
the use of Network governance theory (Sagers et al., 2004).
The former two examine projects on a very high level, looking at relationships between
software modules, or entire projects. They seemed little concerned with the collaboration of
developers within one project. The previously mentioned study by Sagers et al. uses many
variables and constructs from network governance theory which have not been operationalized
in the field of open source research yet and explains project success as a dependent variable; a
measure which is difficult, if not impossible, to define in general.
This work identified a lack of knowledge about how developers coordinate themselves within
a project. It focused therefore on the coordination of developers within a single project. It
addressed the issues of self-coordination through the set of files a developer works on (research
question 1), the existence of “code ownership” (research question 2), and the identification
of patterns through the Social Network Analysis, with the aim to identify influence factors of
coordination styles (research question 3).
In order to answer the above questions, this research looks at the areas of code that developers chose to work on, trying to identify typical and varying properties which can characterize
134
open source projects. It examined 29 projects empirically, using the source code modifications
of these projects as primary data source. The research takes more than 700,000 file modifications by 1,550 developers into account. As it relies on “hard” data to explore the coordination
patterns of OS projects, it was attempted to avoid the typical pitfalls of quantitative research on
open source, which was described earlier in the study.
In order to answer research question 1, “are free software and open source projects selfcoordinated through the set of files each developer works on?”, a number of measures were
examined in general and six propositions made. Generally, it was found that in most cases, a
high share of the work was performed by very few developers (proposition 1). This observation
is in line with other studies, finding power-law distributions for the number of CVS commits
per developer. Next, the degrees of developers were examined, which represent the number
of developers with whom common files were modified. It was found that the mean degree is
relatively low (5.0) (proposition 2), but the degree of developers is highly concentrated (proposition 3), with a few very active developers having very high degrees. It was concluded that
these developers, modifying nearly everybody’s files, perform an integrating role by tying the
files into the main network. The inclusiveness of project networks differed widely (proposition
4), however the only determinant which could be identified is the centralization of modifications (the larger the proportion of less active contributors, the less they are connected to the
network). The centralization of projects indicated the dominance of single developers in the
network, indicating a coordinating role. Values for the centralization varied significantly across
projects (proposition 5).
The second research question is concerned with the existence of “code ownership” in a
project. Although the average number of developers per project is 40, the projects exhibit a
mean distinct authors per file value of 2.4. This value remained remarkably consistent across
projects, independent of their age, number of modifications, or project size. Only two projects
proved to be an exception, having values of 5.0 and 6.3. Both projects were the only ones with
>100 developers, one of them was observed over a time frame of nearly 20 years. It seems,
that while no strict file ownership seems to exist, most files are only maintained by very few
135
4 Discussion & Conclusion
developers. Also some files proved to be modified by many developers, the median value of
modifications is 1 for many projects.
Research question 3 is concerned with the interpretation of the SNA measures and the identification of influence factors to enable the creation of a typology of coordination styles. This
resulted in four influencing variables: Concentration of modifications as measured by the gini
coefficient, the core developer structure characterized by the network centralization, the degree of collaboration as measured by inclusiveness, and project size given through the number
of developers, files, and modifications. These four dimensions were used to characterize and
interpret the coordination style of each of the sample projects.
Implications What can learned from this study and who will be able to take advantage of
the knowledge?
The foremost group which will be able to benefit from this research are researchers from
various disciplines looking into the organization and social structure of OS projects.
The analysis of diverse measures in an inductive manner led to the formulation of six propositions that characterize the coordination in OSS development communities. These propositions
can form a basis for further research of community structures in OS projects. File ownership
could interest researchers of modularity in OS projects. The apparent aversion of performing
modifications in other developers’ files could be one reason for the high modularity (creating
areas of responsibility) in open source software. The characterization of coordination styles
also shows that not all projects work in the same way. It could be used and extended to characterize different types of projects in future studies.
But the results should not only be interesting for researchers. Practitioners, both from commercial organizations, or volunteer participants in open source communities benefit from a
deeper understanding of how community developers chose their area of work. These benefits
do not come in the form of direct and concrete recommendations on how a project should be
organized or how the probability of project success can be increased; it is my conviction that it
is crucial to understand the processes of open source project communities first, and that there
136
is not the one way to achieve project success.
Practitioners who want to facilitate the survival and growth of such communities should still
be interested in how they work. The existence of a small core of integrating, coordinating and
much-contributing developers across all projects shows e.g. that OS projects are no perpetuum
mobile (perpetual motion machine1 ). Releasing source code in the wild and hoping that momentum will build up by magic cannot be expected. Community facilitators should also be
aware of the low number of authors per file and provide opportunities for newcomers to plugin new pieces of “their” code easily into the main software architecture. They should also be
aware that the number of contributions and the centralization of the network did not seem to
be connected. It is possible to have projects where one developer contributes most, but several
decentralized integrators kept the network together.
Another potential implication is that the social structure of the communities and the way
participants pick their work has an impact on the resulting software architecture. Given the low
distinct number of authors per file, as identified across all projects, it becomes apparent that
a software architecture which requires lots of interdependent code might not be the preferred
way of working. This observation is in line with research on modularity of software: MacCormack et al. (2004) find that open source projects are more modular than their closed source
counterparts; work on knowledge reuse finds that modular components are much preferred to a
collection of lines of code without a specified interface (von Krogh et al., 2005).
Limitations Choosing cases for a sample which should represent the total population in
some ways is never an easy task, and it is even harder if there does not seem to be “the representative” case at all. Many of these projects have very interesting stories around their creation
and the decision to include some of them is not easy. It was attempted to avoid the common
pitfalls of quantitative research (as described in Section 2.1.1 on page 22) and to otherwise maximize the variation of project properties, such as size, age, type of application etc. Of course,
1
“A Perpetual Motion Machine of the First Kind is a mechanism which, once set in motion, continues to do useful
work without an input of energy, or which produces more energy than is absorbed in its operation. This kind of
PM is impossible because it violates the principle of conservation of energy.” (McGraw-Hill Staff and Parker,
1994)
137
4 Discussion & Conclusion
other researchers are encouraged to compare these findings to other projects and settings to
confirm the validity of the sample.
One strength of this research, the relatively unbiased “hard” data which could be consistently gathered for all projects, can of course also be used as potential criticism. Although this
study attempts to interpret coordination styles, there is no context-sensitive data to explain and
interpret it in a qualitative way. There is little which can be replied to supporters of this criticism, as it is inherent in the decision to rely on “hard” data. It should be seen as an attempt to
gather measures that can characterize the collaboration and coordination of projects. Now that
the results are presented, it might be beneficial to follow up specific cases with in-depth case
studies.
Future research This research laid a foundation to examine the coordination in OS projects,
using existing methods and measures. The results help to answer the research questions posed,
but there are many ways how one could proceed in a different manner, get yet more results, or
examine related issues.
Following, I present a list of ideas on how this research could be used to further the understanding of how open source projects work and how organizations could take advantage of
it:
The empirical analysis of the CVS modifications has been performed over the complete
recorded time frame for the projects. However, it would be interesting to take the dynamics of
projects into account. It would e.g. be very interesting to see if, when a developer stops contributing code, his “heritage” would be taken over by a new member (immediately or gradually
over time) with a similar set of ties in the network. Also the social network could be examined
in connection to a “life-cycle” theory of projects (and communities).
One of the advantages of open source projects is the easy availability of data, such as code
modifications, discussion through mailing lists, etc. This is unfortunately not true for closed
source projects, and not many companies give access to their source code easily. It would be
interesting to perform the same analysis on closed source projects to see if the characteristics
138
are similar to those of open source projects.
Some of the projects presented here feature an array of interesting properties, such as extremely low or high specialization, a high number of distinct authors per file, or other characteristics. These projects could be analyzed in a more in-depth case study, using complementary
qualitative data sources in order to explain these characteristics. An important issue to examine
is the importance of those peripheral developers: If the main part of the work is done by a
few developers, do those peripheral contributors bring any advantage at all? Are they a small,
yet important, part of the big picture? Unfortunately, these questions cannot be answered by
looking at code contributions alone and warrant further in-depth research.
Although there are lots of open questions left, research in open source software projects
has indeed come a long way during the last 6 years. It has proved to be an interdisciplinary
area which was tackled from many directions and on many levels. As I have been part of the
developer community before becoming a researcher on it, one thing is for sure: “it seems so
much easier and less complicated when you are just a hobby developer, contributing work to
some open source project of your choice.”
139
Bibliography
Abisource.com, 2004. Abiword website.
URL http://abisource.com
Adonthell, 2004. Adonthell website.
URL http://adonthell.linuxgames.com
Anthonisse, J., 1971. The rush in a directed graph. Tech. Rep. BN9/71, Stichting Mahtematisch
Centrum, Amsterdam.
AWStats, 2004. Awstats website.
URL http://awstats.sourceforge.net
Baldwin, C. Y., Clark, K. B., 2000. The Power of Modularity. Vol. 1 of Design Rules. The MIT
Press.
Barnes, J. A., 1954. Class and committee in a norwegian island parish. Human Relations 7.
Bates, J., 2003. HBS-MIT free/open source conference in boston, MA, informal, personal conversation with Jeff Bates.
Benkler, Y., 2002. Coase’s penguin, or linux and the nature of the firm. Yale Law Journal
112 (3), 369–447.
Bergquist, M., Ljungberg, J., 2001. The power of gifts: Organising social relationships in Open
Source communities. Information Systems Journal 11 (4), 305–320.
140
Bibliography
Bessen, J., 2002. Open source software: Free provision of complex public goods. Tech. rep.,
Research on Innovation.
URL http://www.researchoninnovation.org/opensrc.pdf
bison, Dec. 2004a. The bison 2.0 manual.
URL http://www.gnu.org/software/bison/manual/pdf/bison.pdf
bison, 2004b. bison website.
URL http://www.gnu.org/software/bison
Bitzer, J., Schrettl, W., Schroder, P. J., 2004. Intrinsic motivation in open source software
development.
URL
http://opensource.mit.edu/papers/bitzerschrettlschroder%
.pdf
Borgatti, S. P., Foster, P. C., 2003. The network paradigm in organizational research: A review
and typology. Journal of Management 29 (6), 991–1013.
Brooks, Jr., F. P., 1995. The Mythical Man-Month: Essays on Software Engineering, 20th
Anniversary Edition. Addison-Wesley.
Bushnell, M. I., Dec. 1991. The meaning of ‘hurd’. E-mail to [email protected], [email protected].
URL http://www.cs.pdx.edu/~trent/gnu/hurd/hurd-name
CDex, 2004. Cdex website.
URL http://cdexos.sourceforge.net/
Clarke, I., 1999. A distributed decentralised information storage and retrieval systen. Master’s
thesis, University of Edinburgh.
Coffey, D. S., 1998. Self-organization, complexity and chaos: the new biology for medicine.
Nature Medicine 4 (8), 882–885.
141
Bibliography
Coulon, F., Jan. 2005. The use of social network analysis in innovation research: A literature
review, working paper.
URL http://www.druid.dk/ocs/viewabstract.php?id=305&cf=2
Crowston, K., Howison, J., Dec. 2004. The social structure of free and open source software
development. Syracuse FLOSS research working paper.
URL http://opensource.mit.edu/papers/crowstonhowison.pdf
Crowston, K., Scozzi, B., 2002. Open source software projects as virtual organizations: Competency rallying for software development. IEE Proceedings - Software 149 (1), 3–17.
Dalle, J.-M., David, P. A., 2003. The allocation of software development resources in ’open
source’ production. Disscussion paper for The Stanford Institute For Economic Policy Research.
URL http://opensource.mit.edu/papers/dalledavid.pdf
Dalle, J.-M., David, P. A., Ghosh, R. A., Steinmueller, W. E., 2004. Advancing economic research on the free and open source software mode of production. In: Wynants, M., Cornelis,
J. (Eds.), Building our Digital Future - Future Economic, Social & Cultural Scenarios Based
On Open Standards. Vrjie Universiteit Brussels Press, Brussel, Brussels, forthcoming.
DeMaggio,
munity?
D.,
Jun. 2002. Letters to the editor:
Reply to "cave or com-
An empirical examination of 100 mature open source projects".
http://www.firstmonday.org/issues/issue7_9/letters/index.html.
DiBona, C., Ockman, S., Stone, M. (Eds.), 1999. Open Sources: Voices from the Open Source
Revolution. O’Reilly & Associates, Inc.
Dixon, P. M., Weiner, J., Mitchell-Olds, T., Woodley, R., 1987. Boot-strapping the gini coefficient of inequality. Ecology 68, 1548–1551.
Emacs, 2004. Emacs website.
URL http://www.gnu.org/software/emacs/emacs.html
142
Bibliography
Faraj, S., Sproull, L., 2000. Coordinating expertise in software development teams. Management Science 46 (12), 1554–1568.
Feller, J., Fitzgerald, B., 2000. A framework analysis of the open source software development
paradigm. In: The 21st International Conference in Information Systems (ICIS 2000). pp.
58–69.
Feller, W., 1966. Introduction to Probability Theory and Its Applications. Vol. 2. John Wiley.
Ferraro, F., O’Mahony, S., 2003. Managing the boundary of an ’open’ project. Harvard NOM
Working Paper No. 03-60.
URL http://opensource.mit.edu/papers/omahonyferraro.pdf
flightgear, 2004. Flightgear website.
URL http://www.flightgear.org
Foley, M. J., Apr. 2004. Microsoft releases source code on sourceforge. Microsoft Watch.
URL
http://www.microsoft-watch.com/article2/0,1995,1561861,
%00.asp
Freeman, L. C., 1977. A set of measures of centrality based on betweenness. Sociometry 40,
35–41.
Freeman, L. C., 1979. Centrality in social networks: Conceptual clarification. Social Networks
1, 215–239.
Freeman, L. C., White, D. R., Kimball, R. A. (Eds.), 1992. Research Methods in Social Network Analysis. Transaction Publishers, New Brunswick (USA) & London (UK).
Freenet, 2004. Freenet website.
URL http://freenetproject.org
143
Bibliography
Gallivan, M. J., 2001. Striking a balance between trust and control in a virtual organization: A
content analysis of open source software case studies. Information Systems Journal 11 (4),
277–304.
Ghosh, R. A., Glott, R., Krieger, B., Robles, G., 2002. Free/Libre and open source software:
Survey and study (FLOSS). http://www.infonomics.nl/FLOSS/report/.
Ghosh, R. A., Prakash, V. V., Jul 2000. The orbiten free software survey. First Monday 5 (7).
URL http://firstmonday.org/issues/issue5_7/gosh
Gini, C., 1912. Variabilità e mutabilità. In: Pizetti E, Salvemini, T. (Ed.), Memorie di
metodologica statistica. Libreria Eredi Virgilio Veschi, Rome, reprint published 1955.
Gnomemeeting, 2004. Gnomemeeting website.
URL http://gnomemeeting.org
GNUnet, 2004. Gnunet website.
URL http://gnunet.org
González-Barahona, J. M., López, L., Robles, G., Jun. 2004. Community structure of modules
in the apache project. http://opensource.mit.edu/papers/barahona-apache_structure.pdf.
Granovetter, M. S., 1973. The strength of weak ties. American Journal of Sociology 78, 1360–
1380.
Granovetter, M. S., 1974. Getting a job. Harvard University Press, Cambridge, MA.
Grothoff, C., Patrascu, I., Bennett, K., Stef, T., Horozov, T., 6 2002. Gnet.
URL http://gnunet.org/download/main.pdf
GTK+, 2004. Gtk+ website.
URL http://www.gtk.org
Haken, H., 1977. Synergetics: An Introduction. Nonequilibrium Phase Transitions and SelfOrganization in Physics, Chemistry and Biology. Springer.
144
Bibliography
Hall, P., 1982. Rates of convergence in the central limit theorem. Pitman, Boston.
Harhoff, D., Henkel, J., von Hippel, E., 2003. Profiting from voluntary information spillovers:
how users benefit by freely revealing their innovations. Research Policy 32 (10), 1753–1769.
Hars, A., Ou, S., 2000. Why Is Open Source Software Viable? - A Study of Intrinsic Motivation, Personal Needs, and Future Returns. In: The 2000 Americas Conference on Information
Systems (amcis 2000).
Hauben, M., 1994. History of arpanet: Behind the net - the untold history of the arpanet.
URL http://www.dei.isep.ipp.pt/docs/arpa.html
Healy, K., Schussman, A., Jan. 2003. The ecology of open source software development. Working paper.
URL http://opensource.mit.edu/papers/healyschussman.pdf
Hemetsberger, A., 2004. Fostering cooperation on the internet: social exchange processes in innovative virtual consumer communities. Presented at the Association of Consumer Research
conference (2001).
URL http://opensource.mit.edu/papers/hemetsberger2.pdf
Hertel, G., Niedner, S., Herrmann, S., 2003. Motivation of software developers in open source
projects: An internet-based survey of contributors to the Linux Kernel. Research Policy
32 (7), 1159–1177.
Heylighen, F., 2005. Web dictionary of cybernetics and systems.
URL http://pespmc1.vub.ac.be/ASC/
von Hippel, E., Lakhani, K., May 2000. How open source software works: "Free" user-to-user
assistance. MIT Sloan Working Paper No. 4117-00.
URL
http://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID290305_
%code011119590.pdf?abstractid=290305
145
Bibliography
von Hippel, E., von Krogh, G., March-April 2003. Open source software and the "privatecollective" innovation model: Issues for organization science. Organization Science 14 (2),
209–223.
Howison, J., Crowston, K., May 2004. The perils and pitfalls of mining sourceforge .
URL http://opensource.mit.edu/papers/howison04msr.pdf
iRate, 2004. irate website.
URL http://irate.sourceforge.net
Jargon, 2004. The Jargon Dictionary.
URL http://info.astrian.net/jargon/
Jarvenpaa, S. L., Leidner, D. E., 1999. Communication and trust in global virtual teams. Organization Science 10 (6), 791–815.
Jones, C., Hesterly, W. S., Borgatti, S. P., 1997. A general theory of network governance:
Exchange conditions and social mechanisms. Academy of Management Review 22 (4), 911–
945.
Jones, P., Sep. 2000. Brooks’ law and open source: The more the merrier? Does the open source
development method defy the adage about cooks in the kitchen? IBM Developer Works.
Kauffman, S., 1995. At Home in the Universe: The Search for Laws of Self-Organization and
Complexity. Oxford University Press, New York.
Koch, S., Schneider, G., 2002. Effort, cooperation and coordination in an open source software
project: Gnome. Information Systems Journal 12 (1), 27–42.
Kogut, B. M., Metiu, A., 2001. Open-source software development and distributed innovation.
Oxford Review of Economic Policy 17 (2).
146
Bibliography
Krishnamurthy, S., Jun 2002. Cave or community?: An empirical examination of 100 mature
open source projects. First Monday 7 (6).
URL http://firstmonday.org/issues/issue7_6/krishnamurthy
von Krogh, G., Haefliger, S., Spaeth, S., 2003. Collective action and innovation in open source
software development: The case of Freenet. Presented at Academy of Management 2003,
Seattle.
von Krogh, G., Spaeth, S., Haefliger, S., 2005. Knowledge reuse in open source software:
An exploratory study of 15 open source projects. In: Proceedings of the 38th Hawaii
International Conference on System Sciences. pp. 198–207.
URL
http://csdl.computer.org/comp/proceedings/hicss/2005/22%
68/07/22680198b.pdf
von Krogh, G., Roos, J., 1995. Organizational Epistemology. St. Martin’s Press, New York.
von Krogh, G., Spaeth, S., Lakhani, K., Jul. 2003. Community, joining, and specialization in
open source software innovation: A case study. Research Policy 32 (7), 1217–1241.
von Krogh, G., von Hippel, E., 2003. Special issue on open source software development:
Editorial. Research Policy 32 (7), 1149–1157.
Lakhani, K., Wolf, B., Bates, J., DiBona, C., Jul. 2002. The Boston Consulting Group/OSDN
hacker survey. http://www.osdn.com/bcg/, ver 0.73.
LAME, 2004. Lame website.
URL http://www.mp3dev.org/
Lanzara, G. F., Morner, M., 2003. The knowledge ecology of open-source software projects.
URL http://opensource.mit.edu/papers/lanzaramorner.pdf
Laumann, E. O., Marsden, P. V., Prensky, D., 1992. The boundary specification problem in
network analysis. In: Freeman et al. (1992), Ch. 3, pp. 61–87.
147
Bibliography
Lee, S., Moisa, N., Weiss, M., Mar. 2003. Open source as a signalling device - an economic
analysis.
URL http://opensource.mit.edu/papers/leemoisaweiss.pdf
Lerner, J., Tirole, J., 2000. The simple economics of open source. Working Paper 7600, National Bureau of Economic Research.
Lerner, J., Tirole, J., dec 2002a. The scope of open source licensing. National Buerau of Economic Research.
URL http://papers.nber.org/papers/W9363
Lerner, J., Tirole, J., 2002b. Some simple economics of open source. Journal of Industrial
Economics 52.
Levy, S., 1984. Hackers: Heroes of the Computer Revolution. Anchor Press/Doubleday, NY.
Ljungberg, J., 2000. Open source movements as a model for organizing. European Journal of
Information Systems 9 (4).
López, L., González-Barahona, J. M., Robles, G., Jun. 2004. Applying social network analysis
to the information in cvs repositories. Working paper.
URL http://opensource.mit.edu/papers/llopez-sna-short.pdf
MacCormack, A., Rusnak, J., Baldwin, C., Oct. 2004. Exploring the structure of complex
software designs: An empirical study of open source and proprietary code.
URL
http://opensource.mit.edu/papers/maccormackrusnakbaldwi%
n.pdf
Madey, G., Freeh, V., Tynan, R., 2002. The open source software development phenomenon:
An analysis based on social network theory. In: Americas Conference on Information Systems (AMCIS2002),. Dallas, TX, pp. 1806–1813.
URL http://www.nd.edu/~oss/Papers/amcis_oss.pdf
148
Bibliography
Mailman, 2004. Mailman website.
URL http://www.gnu.org/software/mailman
Maturana, H. R., Varela, F. J., 1980. Autopoiesis and Cognition: The Realization of the Living.
D. Reidel, Boston.
McGraw-Hill Staff, Parker, S. P., 1994. Dictionary of Scientific and Technical Terms, 5th Edition. McGraw-Hill Professional.
Merriam-Webster, 1993. Merriam-Webster’s Collegiate Dictionary, 10th Edition. MerriamWebster.
URL http://m-w.com
Mills, J. A., Zandvakili, A., 1997. Statistical inference via bootstrapping for measures of inequality. Journal of Applied Econometrics 12, 133–150.
Mitchell, J. C. (Ed.), 1969. Social Networks in Urban Situations. Manchester University Press.
Mnet, 2004. Mnet website.
URL http://mnetproject.org/
Mockus, A., Fielding, R. T., Herbsleb, J. D., 2000. A case study of open source software development: The apache server. In: The 22nd International Conference on Software Engineering.
Limerick, Ireland, pp. 263–272.
Moody, G., 2001. Rebel Code: Linux and the Open Source Revolution. Penguin, London.
Moreno, J., 1934. Who Shall Survive? Beacon Press, New York.
Nano, 2004. Nano website.
URL http://www.nano-editor.org
Netscape, 1998. Netscape announces plans to make next-generation communicator source code
available free on the net. press release from 22 Jan 1998.
149
Bibliography
Ogle, 2004. Ogle website.
URL http://www.dtek.chalmers.se/groups/dvd
O’Mahony, S., 2003. Guarding the commons: How community managed software projects
protect their work. Research Policy 32 (7), 1179–1198.
OpenSSL, 2004. Openssl website.
URL http://www.openssl.org
Osterloh, M., Rota, S., 2003. Open source software production - The magic cauldron?
Osterloh, M., Rota, S., 2004a. Trust and community in open source software production, , 2004.
In: Lahno, B., Matzat, U. (Eds.), Trust and Community on the Internet: Opportunities and
Restrictions for Online Cooperation. Lucius & Lucius.
Osterloh, M., Rota, S. G., Mar. 2004b. Open source software development - just another case
of collective invention?
URL http://ssrn.com/abstract=561744
pango, 2004. pango website.
URL http://www.pango.org
Perens, B., 1999. The open source definition. In: DiBona, Ockman, and Stone (1999), pp.
171–188.
phpMyAdmin, 2004. phpmyadmin.
URL http://www.phpmyadmin.net
PostgreSQL, 2004. Postgresql website.
URL http://www.postgresql.org
Prufer, J., 2004. Network formation via contests: The production process of open source software .
URL http://opensource.mit.edu/papers/prufer.pdf
150
Bibliography
Raymer, M. G., 1994. Uncertainty principle for joint measurement of noncommuting variables.
American Journal of Physics 62 (11), 986–993.
Raymond, E., Aug. 2003. The Art Of Unix Programming. Addison-Wesley.
URL http://www.catb.org/~esr/writings/taoup/html/
Raymond, E. S., 1999a. The Cathedral & the Bazaar, 1st Edition. O’Reilly, Sebastopol, CA.
URL http://www.catb.org/~esr/writings/cathedral-bazaar
Raymond, E. S., 1999b. The Revenge of the Hackers. In: DiBona et al. (1999), pp. 207–219.
Roethlisberger, F. J., Dickson, W. J., 1939. Management and the Worker. Harvard University
Press.
Rossi, M. A., 2004. Decoding the "free/open source (f/oss) puzzle" - a survey oftheretical and
empirical contributions.
URL http://opensource.mit.edu/papers/rossi.pdf
Sagers, G. W., McLure Wasko, M., Dickey, M. H., Aug 2004. Coordinating efforts in virtual
communities: Examining network governance in open source. In: Proceedings of the Tenth
Americas Conference on Information Systems. New York.
Scacchi, W., 2002. Understanding requirements for developing open source software systems.
IEE Proceedings - Software 149 (1), 24–39.
Scott, J., 1991. Social Network Analysis: A handbook. Sage Publications Ltd.
Shah, S., 2003. Understanding the Nature of Participation & Coordination in Open and Gated
Source Software Development Communities. Ch. 4, doctoral dissertation.
Smarty, 2004. Smarty website.
URL http://smarty.php.net
SourceForge.net, Jul. 2003. Project of the month: Tiki.
URL http://sourceforge.net/potm/potm-2003-07.php
151
Bibliography
Spaeth, S., Apr. 2003. Decision-making in open source projects. Proposal for the Doctoral
thesis.
URL http://sspaeth.org/paper/Vorstudie.pdf
Stallman, R., 1999. The GNU Operating System and the Free Software Movement. In: DiBona
et al. (1999), pp. 53–70.
Stallman, R. M., 1993. The gnu manifesto.
URL http://www.gnu.org/gnu/manifesto.html
Stenborg, M., Aug. 2004. Waiting for f/oss: Coordinating the production of free/open source
software.
URL http://opensource.mit.edu/papers/stenborg.pdf
Stepania, 2004. Stepmania website.
URL http://stepmania.sourceforge.net
Stewart, D., Apr. 2004. Status inertia:the speed imperative in the attainment of community
status .
URL http://opensource.mit.edu/papers/stewart2.pdf
tdb, 2004. Trivial database website.
URL http://tdb.sourceforge.net
te Meerman, S., 2003. Puzzling with a top-down blueprint and a bottom-up network: An explorative analysis of the open source world using itil and social network analysis. Master’s
thesis, University of Groningen, Netherlands.
URL http://opensource.mit.edu/papers/meerman2.pdf
The Economist, 2004. And the winners are... The Economist September 16th.
The Enquirer, 2004. IBM patents method for paying open source volunteers. published on 26
January 2004.
URL http://www.theinquirer.net/?article=13813
152
Bibliography
The Open Source Initiative, 2003a. The History of OSI.
URL http://opensource.org/docs/history.php
The Open Source Initiative, 2003b. The Open Source Definition. Accessed: 10 Jan 2005.
URL http://opensource.org/docs/definition.php(ver.1.9)
TikiWiki.org, 2004. Tikiwiki website.
URL http://tikiwiki.org
Torvalds, L., Diamond, D., 2001. Just for Fun. Texere, London, UK.
Tuomi, I., Apr. 2000. Learning from linux: Empirical and descriptive analysis of the open
source model. Working paper was distributed in Berkeley and Stanford in April 2000.
URL
http://www.jrc.es/~tuomiil/articles/LearningFromLinux.p%
df
Van Wendel de Joode, R., De Bruijn, J. A., Van Eten, M. J. G., Oct. 2002. Protecting the virtual
commons: Self-organizing communities and innovative intetellctual property rights regimes.
Working paper.
URL http://opensource.mit.edu/papers/joode.pdf
Viega, J., Warsaw, B., Manheimer, K., 12 1998. Mailman: The GNU mailing list manager. In:
Proceedings of the 12th Systems Administration Conference (LISA ’98). USENIX Technical
Program. Boston, MA.
Wasserman, S., 1994. Social Network Analysis. Methods and Applications. Cambridge University Press, pp. 345–423, 461–482.
Wellman, B., 1996. For a social network analysis of computer networks: a sociological perspective on collaborative work and virtual community. In: SIGCPR ’96: Proceedings of the
1996 ACM SIGCPR/SIGMIS conference on Computer personnel research. ACM Press, pp.
1–11.
153
Bibliography
West, J., O’Mahony, S., 2005. Contrasting community building in sponsored and community
founded open source projects. In: Proceedings of the 38th Annual Hawai’i Internatinal conference on System Sciences (Jan 2005).
URL http://opensource.mit.edu/papers/westomahony.pdf
wget, 2004. wget website.
URL http://www.gnu.org/software/wget/wget.html
Whitaker, R., 12 1995. Self-organization, autopoiesis, and enterprises.
URL http://www.acm.org/sigs/sigois/auto/Main.html
Wikipedia.org, 2004. Dictionary.
URL http://wikipedia.org
Williams, S., 2002. Free as in Freedom: Richard Stallman’s Crusade for Free Software.
O’Reilly, Sebastapol, CA, full text available online.
URL http://www.oreilly.com/openbook/freedom/
xerces, 2004. xerces java parser website.
URL http://xml.apache.org
Xfce, 2004. Xfce website.
URL http://xfce.org
Yamauchi, Y., Yokozawa, M., Shinohara, T., Ishida, T., 2000. Collaboration with lean media:
How open-source software succeeds. In: CSCW 2000. ACM, Philidelphia, PA, pp. 329–338.
Yin, R. K., 2003. Case Study Research: Design and Methods, 2nd Edition. Sage.
Zeitlyn, D., 2003. Gift economies in the development of open source software: Anthropological
reflections. Research Policy 32 (7), 1287–1291.
154
A Appendix
A.1 Open Source Definition (Version 1.9)
Introduction
Open source doesn’t just mean access to the source code. The distribution terms of open-source
software must comply with the following criteria:
1. Free Redistribution
The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different
sources. The license shall not require a royalty or other fee for such sale.
2. Source Code
The program must include source code, and must allow distribution in source code as
well as compiled form. Where some form of a product is not distributed with source code,
there must be a well-publicized means of obtaining the source code for no more than a
reasonable reproduction cost preferably, downloading via the Internet without charge.
The source code must be the preferred form in which a programmer would modify the
program. Deliberately obfuscated source code is not allowed. Intermediate forms such
as the output of a preprocessor or translator are not allowed.
3. Derived Works
The license must allow modifications and derived works, and must allow them to be
distributed under the same terms as the license of the original software.
155
A Appendix
4. Integrity of The Author’s Source Code
The license may restrict source-code from being distributed in modified form only if the
license allows the distribution of "patch files" with the source code for the purpose of
modifying the program at build time. The license must explicitly permit distribution of
software built from modified source code. The license may require derived works to
carry a different name or version number from the original software.
5. No Discrimination Against Persons or Groups
The license must not discriminate against any person or group of persons.
6. No Discrimination Against Fields of Endeavor
The license must not restrict anyone from making use of the program in a specific field
of endeavor. For example, it may not restrict the program from being used in a business,
or from being used for genetic research.
7. Distribution of License
The rights attached to the program must apply to all to whom the program is redistributed
without the need for execution of an additional license by those parties.
8. License Must Not Be Specific to a Product
The rights attached to the program must not depend on the program’s being part of a particular software distribution. If the program is extracted from that distribution and used
or distributed within the terms of the program’s license, all parties to whom the program
is redistributed should have the same rights as those that are granted in conjunction with
the original software distribution.
9. The License Must Not Restrict Other Software
The license must not place restrictions on other software that is distributed along with
the licensed software. For example, the license must not insist that all other programs
distributed on the same medium must be open-source software.
156
A.1 Open Source Definition (Version 1.9)
10. The License must be technology-neutral
No provision of the license may be predicated on any individual technology or style of
interface.
157
Index
AWStats (2004), 38, 140
Crowston and Scozzi (2002), 1, 21, 142
Abisource.com (2004), 37, 140
Dalle and David (2003), 20, 142
Adonthell (2004), 38, 140
Dalle et al. (2004), 20, 27, 142
Anthonisse (1971), 66, 140
DeMaggio (2002), 24, 142
Baldwin and Clark (2000), 71, 140
DiBona et al. (1999), 142, 150–152
Barnes (1954), 25, 140
Dixon et al. (1987), 57, 142
Bates (2003), 23, 140
Emacs (2004), 41, 142
Benkler (2002), 3, 28, 140
Faraj and Sproull (2000), 28, 142
Bergquist and Ljungberg (2001), 19, 140
Feller and Fitzgerald (2000), 1, 143
Bessen (2002), 10, 140
Feller (1966), 31, 143
Bitzer et al. (2004), 19, 141
Ferraro and O’Mahony (2003), 19, 27, 143
Borgatti and Foster (2003), 26, 141
Foley (2004), 10, 143
Brooks (1995), 70, 141
Freeman et al. (1992), 24, 143, 147
Bushnell (1991), 8, 141
Freeman (1977), 25, 66, 143
CDex (2004), 40, 141
Freeman (1979), 25, 59, 65, 66, 143
Clarke (1999), 42, 141
Freenet (2004), 42, 143
Coffey (1998), 28, 141
GNUnet (2004), 44, 144
Coulon (2005), 26, 141
GTK+ (2004), 45, 144
Crowston and Howison (2004), 29, 30, 69,
Gallivan (2001), 19, 143
142
158
Ghosh and Prakash (2000), 57, 144
Index
Ghosh et al. (2002), 9, 14, 15, 18, 20, 144
López et al. (2004), 30, 34, 36, 67, 148
Gini (1912), 57, 144
Lakhani et al. (2002), 20, 147
Gnomemeeting (2004), 43, 144
Lanzara and Morner (2003), 20, 147
González-Barahona et al. (2004), 29, 134,
Laumann et al. (1992), 25, 147
144
Lee et al. (2003), 19, 147
Granovetter (1973), 25, 144
Lerner and Tirole (2000), 17, 148
Granovetter (1974), 25, 144
Lerner and Tirole (2002a), 21, 148
Grothoff et al. (2002), 44, 144
Lerner and Tirole (2002b), 19, 148
Haken (1977), 28, 144
Levy (1984), 7, 8, 148
Hall (1982), 31, 144
Ljungberg (2000), 20, 148
Harhoff et al. (2003), 18, 145
MacCormack et al. (2004), 21, 71, 137, 148
Hars and Ou (2000), 18, 145
Madey et al. (2002), 23, 29, 34, 134, 148
Hauben (1994), 7, 145
Mailman (2004), 48, 148
Healy and Schussman (2003), 21, 23, 29,
Maturana and Varela (1980), 28, 149
34, 145
Merriam-Webster (1993), 4, 149
Hemetsberger (2004), 19, 145
Mills and Zandvakili (1997), 57, 149
Hertel et al. (2003), 18, 145
Mitchell (1969), 25, 149
Heylighen (2005), 28, 145
Mnet (2004), 49, 149
Howison and Crowston (2004), 24, 31, 146
Mockus et al. (2000), 57, 149
Jargon (2004), 12, 146
Moody (2001), 8, 149
Jarvenpaa and Leidner (1999), 19, 146
Moreno (1934), 24, 149
Jones et al. (1997), 27, 146
Nano (2004), 49, 149
Jones (2000), 70, 146
Netscape (1998), 9, 17, 149
Kauffman (1995), 28, 146
O’Mahony (2003), 90, 150
Koch and Schneider (2002), 26, 57, 146
Ogle (2004), 50, 149
Kogut and Metiu (2001), 1, 146
OpenSSL (2004), 50, 150
Krishnamurthy (2002), 24, 146
Osterloh and Rota (2003), 18, 20, 150
LAME (2004), 47, 147
Osterloh and Rota (2004a), 19, 150
159
Index
Osterloh and Rota (2004b), 19, 20, 150
Torvalds and Diamond (2001), 2, 153
Perens (1999), 7, 10, 14, 150
Tuomi (2000), 17, 153
PostgreSQL (2004), 52, 150
Van Wendel de Joode et al. (2002), 28, 153
Prufer (2004), 19, 150
Viega et al. (1998), 48, 153
Raymer (1994), 22, 150
Wasserman (1994), 60, 153
Raymond (1999a), 16, 151
Wellman (1996), 4, 29, 134, 153
Raymond (1999b), 10, 151
West and O’Mahony (2005), 18, 153
Raymond (2003), 8, 151
Roethlisberger and Dickson (1939), 25, 151
Rossi (2004), 17, 151
Sagers et al. (2004), 27, 134, 151
Scacchi (2002), 17, 151
Scott (1991), 24, 34, 36, 60, 62, 65, 71, 151
Whitaker (1995), 28, 154
Wikipedia.org (2004), 12, 154
Williams (2002), 7, 154
Xfce (2004), 56, 154
Yamauchi et al. (2000), 1, 154
Yin (2003), 3, 154
Shah (2003), 19, 151
Smarty (2004), 53, 151
SourceForge.net (2003), 55, 151
Spaeth (2003), 2, 151
Stallman (1993), 1, 12, 152
Stallman (1999), 7, 12, 14, 152
Stenborg (2004), 19, 152
Zeitlyn (2003), 17, 19, 154
bison (2004a), 39, 141
bison (2004b), 39, 141
flightgear (2004), 42, 143
iRate (2004), 46, 146
pango (2004), 51, 150
Stepania (2004), 54, 152
phpMyAdmin (2004), 52, 150
Stewart (2004), 27, 152
tdb (2004), 54, 152
The Economist (2004), 2, 152
te Meerman (2003), 29, 152
The Enquirer (2004), 19, 152
wget (2004), 55, 154
The Open Source Initiative (2003a), 10, 152
xerces (2004), 56, 154
The Open Source Initiative (2003b), 1, 10,
McGraw-Hill Staff and Parker (1994), 137,
153
TikiWiki.org (2004), 55, 153
160
149
von Hippel and Lakhani (2000), 17, 145
Index
von Hippel and von Krogh (2003), 1, 20,
density, 25, 60
145
FLOSS, 7
von Krogh and Roos (1995), 28, 147
von Krogh and von Hippel (2003), 18, 147
von Krogh et al. (2003), 1, 3, 20, 147
von Krogh et al. (2003), 20, 27, 147
von Krogh et al. (2005), 22, 137, 147
acknowledgements, v
adjacency matrix, 34, 59
ARPAnet, 7
authors p. file, 66
autopoiesis, 28
free
beer, 12
freedom, 12
software, 12
Free Software Foundation, 8
future research, 138
gini coefficient, 57
GNU, 8
GNU General Public License, see GPL
GNU project, 23
betweenness, 25
GPL, 11, 13
Brook’s law, 70
gratis, 9
Grothoff, Christian, 44, 94
centralization, 25, 65, 74
Cheng, Mike, 47
Hopkins, Don, 12
Clarke, Ian, 42, 91
implications, 136
code ownership, 66
incident matrix, 34
copyleft, 12
inclusiveness, 62, 74
Curriculum vitae, 164
innovation research, 17
CVS modules, 30
Jones, Anthony, 46
Debian Free Software Guidelines, 10
Debian GNU/Linux, 10
knowledge reuse, 22
decision making, 2
libre software, see free software
degree (of connectedness), 25, 59
limitations, 137
relative, 60
Linus Law, 16
161
Index
Linux, 1
Abiword, 37, 75
literature review, 26
Adonthell, 38, 77
lone wolves, 62
AWStats, 38, 79
bison, 39, 80
methodology, 31
Microsoft, 10
BZflag, 39, 83
CDex, 40, 85
modification concentration, 74
emacs, 41, 86
Netscape, 9, 16
flightgear, 41, 88
node centrality, 25
Freenet, 42, 90
Non-Disclosure Agreement, 8
Gnomemeeting, 43, 92
number of modifications, 57
Gnunet, 44, 94
GTK+, 45, 97
open source
history, 10
open source
definition, 10, 155
history, 7
research
categorization, 17
critique, 22
Irate, 46, 99
LAME, 47, 101
Mailman, 48, 102
mnet, 49, 105
nano, 49, 106
Ogle, 50, 109
OpenSSL, 50, 110
pango, 51, 113
history of ,̃ 16
phpMyAdmin, 52, 114
Open Source Initiative, 9
PostgreSQL, 52, 116
OSI, see Open Source Initiative
Smarty, 53, 119
patch, 11
Stepmania, 54, 120
Perens, Bruce, 10
tdb, 54, 123
perpetuum mobile, 137
TikiWiki, 55, 124
project success, 21
wget, 55, 127
projects
xerces, 56, 128
162
Index
XFCE4, 56, 131
proposition, 58, 60, 62, 65, 66, 68
R, 36
research question, 4
evolution of ,̃ 2
RMS, see Stallman
Sandras, Damien, 43, 92
self-coordination, 28
signaling theory, 19
SNA, 24, 25, 29
history, 24–25
social network analysis, see SNA
Stallman, 1, 7, 41
swift trust, 19
Taylor, Mark, 47
Tech model Railroad Club, 7
Torvalds, Linus, 1
Toseland, Matthew, 42, 90
typology, 5
Warsaw, Barry, 48, 102
163
Curriculum Vitae
Personal data
Name
Address
E-mail
Date of Birth
Nationality
Sebastian Spaeth
Zürichstrasse 45, CH - 8600 Dübendorf, Switzerland
[email protected]
30 November 1975 (Göttingen, Germany)
German
Education
Apr 2001- Oct 2005 University of St.Gallen (Institute of Management), Switzerland, Doctoral studies (leading to “Dr. oec.” on Oct 24, 2005)
Aug 1999- Mar 2001 University of Linköping, International Master’s programme in Manufacturing Management (leading to “Master of Science in Engineering”
in Apr 2001)
Oct 1996 - Aug 1999 Technical University Karlsruhe, Industrial engineering and management (Wirtschaftsingenieurwesen) majoring corporate planning
1992
- 1995
Oberstufengymnasium Eschwege, High School Diploma (Abitur)
Professional Experience
Mar 2001- present
University of St. Gallen, Switzerland; Research assistent at the chair
of Prof. von Krogh, PhD
Jun 2002 - Dec 2002 “Mergers and Aquisitions” (Publisher: Verlagsgruppe Handelsblatt);
Editor, responsible for “Computer and Telecommunications industry”
Management
Sep 2000 - Mar 2001 NCC AB, Stockholm, Sweden; Master Thesis in cooperation with
NCC, examining goals, tasks, and organizational designs of a logistics function
Mar 2000- Jun 2000 Strömsholmen AB, Tränas, Sweden, Simulation project to optimize
the assembly process
Apr 1997- Jul 1999 Institut für Rechneranwendung in Planung und Konstruktion (University of Karlsruhe); Academic Assistant in the Computer Department:
Responsible for Organisation and Computer Maintenance
164
Publications
2005
2005
2003
2003
2003
2003
2001
Spaeth, S., Coordination in Open Source Projects: A Social Network Analysis
Using CVS Data, Doctoral Dissertation
von Krogh, G., Spaeth, S. & Haefliger, S., Knowledge Reuse in open source software: An exploratory Study of 15 open source projects, Proceedings of the 38th
Hawaii Internat. Conf. on System Sciences
von Krogh, G., Haefliger, S. & Spaeth, S., Collective Action and Innovation in
Open Source Software Development: The Case of Freenet, presented at Academy
of Management 2003, Seattle
von Krogh, G., Haefliger, S. & Spaeth, S., Collective Action and Innovation in
Open Source Software Development: The Case of Freenet, Academy of Mgmt.
2003, Seattle
von Krogh, G., Spaeth, S., & Lakhani, K., Community, Joining, and Specialization
in Open Source Software Innovation: A Case Study, Research Policy 7(32), pp.
1217–1241
Spaeth, S., Decision-Making in Open Source Projects, Proposal for the Doctoral
Dissertation
Spaeth, S., Logistics at NCC, Master’s Thesis
Internships
Aug 1998- Oct 1998 SEW-EURODRIVE GmbH & Co, Bruchsal
Planning, designing and implementing a division presentation in the
Intranet
Mar 1997
Georg Sahm GmbH & Co. KG Maschinenfabrik, Eschwege, engineering internship
Aug 1995- Sep 1995 Präwema Antriebstechnik GmbH, Eschwege, engineering internship
Voluntary Work
Apr 2001- Apr 2005 Member of the board, Kammerchor Oberthurgau, Switzerland
Jul 1998 - Jul 1999 Executive secretary of the student university cinema “Akademisches
Filmstudio an der Universität Karlsruhe e.V.”, Karlsruhe
Miscellaneous
Computer
Languages
Interests
Windows, Apple, Unix (Linux), Office, Arena, TCP/IP, HTML, XML,
MySQL, basic programming skills (C, Java, PHP, Perl. JavaScript)
German (Native Language), English (fluent), Swedish (conversationally)
Computers, Reading, Badminton, Sailing
165