Coordination in Open Source Projects
Transcription
Coordination in Open Source Projects
Coordination in Open Source Projects – A Social Network Analysis using CVS data D ISSERTATION of the University of St. Gallen, Graduate School of Business Administration, Economics, Law, and Social Sciences (HSG) to obtain the title of Doctor of Business Adminstration Submitted by Sebastian Spaeth from Germany Approved on the application of Prof. Georg F. von Krogh, PhD and Prof. Dr. Oliver Gassmann Dissertation no. 3110 Difo-Druck GmbH, Bamberg ii The University of St. Gallen, Graduate School of Business Administration, Economics, Law, and Social Sciences (HSG) hereby consents to the printing of the present dissertation, without hereby expressing any opinion on the views herein expressed. St. Gallen, June 30, 2005 The President Prof. Ernst Mohr, PhD “code is only a miniscule part of the big picture, its what people /do/ with it that matters” –jrandom, Founder of I2P, February 18, 2005 Acknowledgements This dissertation builds upon three years of research on open source software development at the Institute of Management. While the research for it was conducted by me, it is inspired by my work as a research assistant at the chair of Professor von Krogh and through interaction with my colleagues. It could only be finished (within a reasonable time frame) because of the support from some individuals and many participants in various open source projects who I would like to thank: First of all, I would like to thank Georg von Krogh for hiring me right away after having done a telephone interview with me being his research object. I have been learning a lot during the time at his chair. Unfortunately, I never got round to speaking as much Swedish/Norwegian with him as I would have liked. I would also like to thank Stefan Haefliger. It is a pleasure working in a team with him and he is a great colleague and friend. Of course, his parents house near San Tropéz, and the castle of his girlfriend Susann’s family really helped a lot to relax from our hard research1 . Thanks for those invitations to both of you. Sharing an office with five persons is not always easy, but I enjoyed working with Daniela Blettner, Philip Tuertscher, Christian Loepfe, Stephan Herting, and, of course, Fritz. We had a great time and much fun together. Last but not least, thanks to my girlfriend Almut, who suffered through my studies as well (she likes singing the Free Software Song now, thus thoroughly taking revenge on me). She had to endure months of procrastination, with a final outburst of intense writing during which I would hear but not listen. Hopefully my ears are now open again. She was the most critical reviewer of this dissertation throughout all stages, which I am extremely grateful for (although it might not have been obvious to her at times). I cannot help but find the enthusiasm of dozens of researchers who praise the superior quality of free and open source products in their articles but write them in MicrosoftTM WinWordTM on their MicrosoftTM WindowsTM computer somewhat hypocritical.2 Being a half-geek, I enjoyed fiddling days (and nights) to get the superior quality software up and running: This dissertation was written with the help of FLOSS software, namely LATEX , Emacs, and R, on a GNU/Linux computer. Finally, I want to apologize for the continuous mixing of free software, open source, FLOSS,... throughout this work. I do know that this might be offensive for some participants in the field. However, I found it too bothersome to be politically correct all the time, and in those cases where they were used throughout this work they can be considered as equivalent in my point of view. If you find that unsatisfactory, feel free to search and replace all of these terms with the terminus technicus of your choice. July 2005 1 2 Sebastian Spaeth Except the night where a dormouse found its way into my jeans. This sentence alone might convey to the reader that I am ideologically biased. I admit that readily, however, I take great care that my personal attitudes have no effects on the outcomes of the research I perform. vi Contents Acknowledgements . List of Tables . . . . List of Figures . . . . List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v . ix . xi . xiii 1 Introduction 1.1 The Quest for a Research Topic: Motivation and Research Question 1.2 Structure overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Introduction to Open Source . . . . . . . . . . . . . . . . . . . . . 1.3.1 History of Open Source . . . . . . . . . . . . . . . . . . . 1.3.2 Open Source Definition . . . . . . . . . . . . . . . . . . . . 1.3.3 Open Source vs. Free Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 5 7 7 10 12 2 Theoretical framework 2.1 A history of research on open source . . . . . . . . . . . . . 2.1.1 Common pitfalls - A critique of empiric OS research 2.2 Social Network Analysis . . . . . . . . . . . . . . . . . . . 2.3 Research on organization in OS projects . . . . . . . . . . . 2.3.1 SNA in open source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 16 22 24 26 29 . . . . . . . . . . . . . . 31 31 31 34 37 57 57 59 62 65 66 69 75 75 77 3 Empirical Section 3.1 Methodology . . . . . . . . . . . . . 3.1.1 Sample Selection . . . . . . . 3.1.2 Analysis . . . . . . . . . . . 3.2 Sample . . . . . . . . . . . . . . . . 3.3 Analysis . . . . . . . . . . . . . . . . 3.3.1 Concentration of modifications 3.3.2 Degree . . . . . . . . . . . . 3.3.3 Inclusiveness . . . . . . . . . 3.3.4 Centralization . . . . . . . . . 3.3.5 Code ownership . . . . . . . 3.4 Characterizing the coordination styles 3.5 Analysis of the Sample Projects . . . 3.5.1 Abiword . . . . . . . . . . . 3.5.2 Adonthell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Contents 3.5.3 3.5.4 3.5.5 3.5.6 3.5.7 3.5.8 3.5.9 3.5.10 3.5.11 3.5.12 3.5.13 3.5.14 3.5.15 3.5.16 3.5.17 3.5.18 3.5.19 3.5.20 3.5.21 3.5.22 3.5.23 3.5.24 3.5.25 3.5.26 3.5.27 3.5.28 3.5.29 AWStats . . . bison . . . . . BZflag . . . . CDex . . . . . emacs . . . . . Flightgear . . . Freenet . . . . Gnomemeeting Gnunet . . . . GTK+ . . . . . iRate . . . . . LAME . . . . Mailman . . . mnet . . . . . nano . . . . . . Ogle . . . . . . OpenSSL . . . pango . . . . . phpMyAdmin . PostgreSQL . . Smarty . . . . Stepmania . . . tdb . . . . . . TikiWiki . . . wget . . . . . . xerces . . . . . XFCE4 . . . . 4 Discussion & Conclusionppendix 155 A.1 Open Source Definition (Version 1.9) . . . . . . . . . . . . . . . . . . . . . . . 155 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 viii List of Tables 1.1 Similarity of FS vs. OS philosophies . . . . . . . . . . . . . . . . . . . . . . . 14 2.1 Categorization of FLOSS research . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.30 3.31 Sample overview . . . . . . . . . . . . . . . . . . . . . . Degree overview . . . . . . . . . . . . . . . . . . . . . . Modification concentration, Centralization & Inclusiveness Linear regression (inclusiveness, centrality) . . . . . . . . Modification statistics for the Abiword project . . . . . . . Modification statistics for the Adonthell project . . . . . . Modification statistics for the AWStats project . . . . . . . Modification statistics for the bison project . . . . . . . . . Modification statistics for the BZFlag project . . . . . . . Modification statistics for the CDex project . . . . . . . . Modification statistics for the emacs project . . . . . . . . Modification statistics for the Flightgear project . . . . . . Modification statistics for the Freenet project . . . . . . . Modification statistics for the Gnomemeeting project . . . Modification statistics for the GNUnet project . . . . . . . Modification statistics for the GTK+ project . . . . . . . . Modification statistics for the Irate project . . . . . . . . . Modification statistics for the LAME project . . . . . . . . Modification statistics for the Mailman project . . . . . . Modification statistics for the mnet project . . . . . . . . . Modification statistics for the nano project . . . . . . . . . Modification statistics for the Ogle project . . . . . . . . . Modification statistics for the OpenSSL project . . . . . . Modification statistics for the pango project . . . . . . . . Modification statistics for the phpMyAdmin project . . . . Modification statistics for the PostgreSQL project . . . . . Modification statistics for the Smarty project . . . . . . . . Modification statistics for the Stepmania project . . . . . . Modification statistics for the tdb project . . . . . . . . . . Modification statistics for the TikiWiki project . . . . . . . Modification statistics for the wget projectix List of Tables 3.32 Modification statistics for the xerces project . . . . . . . . . . . . . . . . . . . 129 3.33 Modification statistics for the XFCE4 project . . . . . . . . . . . . . . . . . . 131 x List of Figures 1.1 1.2 Starting year of participants in FLOSS Communities . . . . . . . . . . . . . . 9 Differences between open source and free software developers . . . . . . . . . 15 2.1 Publications on social networks . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.30 The iRate architecture . . . . . . . . . . . . . . . . . . . . . Histogram of modification concentration (gini) . . . . . . . Relative degree over all developers . . . . . . . . . . . . . . Rel. degree vs. log(#modifications) . . . . . . . . . . . . . . Inclusiveness of projects . . . . . . . . . . . . . . . . . . . Inclusiveness vs. number of developers . . . . . . . . . . . Project dendrogram . . . . . . . . . . . . . . . . . . . . . . Mean distinct authors per file (over all projects) . . . . . . . Mean distinct authors per file vs. # developers . . . . . . . . Centrality, Inclusiveness, and Concentration of modifications Abiword sociogram . . . . . . . . . . . . . . . . . . . . . . Adonthell sociogram . . . . . . . . . . . . . . . . . . . . . AWStats sociogram . . . . . . . . . . . . . . . . . . . . . . Bison sociogram . . . . . . . . . . . . . . . . . . . . . . . BZFlag sociogram . . . . . . . . . . . . . . . . . . . . . . CDex sociogram . . . . . . . . . . . . . . . . . . . . . . . emacs sociogram . . . . . . . . . . . . . . . . . . . . . . . Flightgear sociogram . . . . . . . . . . . . . . . . . . . . . Freenet sociogram . . . . . . . . . . . . . . . . . . . . . . . Gnomemeeting sociogram . . . . . . . . . . . . . . . . . . GNUnet sociogram . . . . . . . . . . . . . . . . . . . . . . GTK+ sociogram . . . . . . . . . . . . . . . . . . . . . . . Irate sociogram . . . . . . . . . . . . . . . . . . . . . . . . LAME sociogram . . . . . . . . . . . . . . . . . . . . . . . Mailman sociogram . . . . . . . . . . . . . . . . . . . . . . mnet sociogram . . . . . . . . . . . . . . . . . . . . . . . . nano sociogram . . . . . . . . . . . . . . . . . . . . . . . . Ogle sociogram . . . . . . . . . . . . . . . . . . . . . . . . OpenSSL sociogram . . . . . . . . . . . . . . . . . . . . . pango sociogramxi List of Figures 3.31 3.32 3.33 3.34 3.35 3.36 3.37 3.38 3.39 xii phpMyAdmin sociogram PostgreSQL sociogram . Smarty sociogram . . . . Stepmania sociogram . . tdb sociogram . . . . . . TikiWiki sociogram . . . wget sociogram . . . . . xerces sociogram . . . . XFCE4 sociogramist of Abbreviations BSD CVS DARCS DARPA ESR FLOSS FOSS FS FSF GDBM GNU GPL GTK+ Hird Hurd ITIL LAME LGPL LOC MPL NDA OS OSI OSS PHP RMS SQL SNA SSL TLS XML Berkeley Software Distribution Concurrent Versions System David’s Advanced Revision Control System (an alternative to →CVS) Defense Advanced Research Projects Agency Eric S. Raymond free/libre/open source software free/open source software free (→ libre) software Free Software Foundation GNU database manager the “GNU’s not Unix” operating system (→GNU) General Public License The GIMP Toolkit →Hurd of Interfaces Representing Depth →Hird of Unix-Replacing Daemons IT information Library LAME Ain’t an Mp3 Encoder (→GNU) Lesser General Public License lines of code Mozilla Public License Non Disclosure Agreement open source Open Source Initiative open source software PHP Hypertext Preprocessor Richard M. Stallman Structured Query Language Social Network Analysis Secure Sockets Layer Transport Layer Security Extensible Markup Language xiii 1 Introduction Limited to a strict interpretation of its definition, open source consists of a set of rules which apply to a piece of software and which specify how the software and derivatives of it may be used (and under what circumstances derivatives may be created at all). These rules are listed in the software’s license and must be compatible with the criteria of the Open Source Definition (The Open Source Initiative, 2003b). But looking at numerous articles in magazines, listening to managers in companies, and seeing how researchers of all kinds scrutinize free and open source software projects, it becomes apparent that open source seems to be much more than a simple set of licensing terms: It is seen as a philosophy (Stallman, 1993), a paradigm (Feller and Fitzgerald, 2000), a production model (Kogut and Metiu, 2001) or even a new innovation model (von Hippel and von Krogh, 2003). It can be a way of organizing projects (Crowston and Scozzi, 2002), or a way to collaborate (Yamauchi, Yokozawa, Shinohara, and Ishida, 2000). It follows plenty of unwritten rules and exhibits an informal hierarchy which newcomers need to follow in order to be allowed to join a community (von Krogh, Spaeth, and Lakhani, 2003). Some of the prominent figures in this field have been in the spotlight of media attention: e.g. Linus Torvalds, creator and project leader of the OS1 project, which is most widely known to the public, the Linux kernel, has been voted #17 in Time Magazine’s Person of the Century Poll in 2000. In 2001, he shared the Takeda Award for Social/Economic Well-Being with Richard Stallman and Ken Sakamura. In 2004, he was named one of the most influential people in the world by the Time magazine. Also in 2004, he won The Economist’s Innovation 1 To improve readability, open source is being abbreviated as OS throughout the text. 1 1 Introduction Award (The Economist, 2004)2 . The phenomenon has been subject to intense research by dozens of researcher during the last few years, an overview of which will be given in Section 2.1 on page 16. 1.1 The Quest for a Research Topic: Motivation and Research Question Originally, it was intended to study “strategic decision making”, as described in the thesis proposal (Spaeth, 2003). But a brief analysis of mailing lists from several open source projects revealed that (at least the examined projects) did not have a “formal” decision making process and even worse, that decisions were seldom discussed explicitly and decided on in public mailing lists. For instance, important decisions, such as the import and reuse of a component from an external project were rarely discussed and decided on collectively. There are two probable possibilities why this could be the case: either the main developers discuss and decide these issues off the public record in private e-mails or chat channels before presenting the public a fait accompli. This possibility implies that decision making happens, but cannot be observed directly by researchers using publicly archived data sources, as it happens through private or non-archived communication channels. If this were the case, it would best be examined through e.g. an ethnology, observing the developers behavior in real time. The second possibility is that strategic decisions are indeed not explicitly discussed and openly decided on in the community. This would imply that explicit decision making does not happen at all in projects beyond the level of individual task self-assignment. While this is a somewhat extreme and provoking assumption, it is not improbable that few explicit decision making processes exist in free and open source projects, which very often do not have a formal organization structure at all. It seemed to become ever more unpractical to examine strategic decision making processes 2 2 In his book “Just for Fun” Linus states that he finds the frequent worshiping of his person rather annoying. (Torvalds and Diamond, 2001) 1.1 The Quest for a Research Topic: Motivation and Research Question empirically using public data archives. The more preliminary research was done to uncover these processes, the more apparent it became that decisions were indeed often made implicitly by developers by simply picking some task they liked and starting to work on it. Thus it appears that decision making takes place mostly through the self-assignment of tasks by individual developers (as e.g. theorized by Benkler, 2002). As long as the resulting code works and is deemed “clean enough”, there seem to be few resentments against this spontaneous and selforganizing coordination of developers3 . The logical consequence therefore, was to look at the areas developers pick to work on. A closer look at the organization of projects and more specifically the coordination among developers in open source projects was needed. Previous research had been looking into the social structure of open source projects and their self-organizing characteristics and seemed to agree that this is an important part of open source research (see the literature review for more information). Previous research by von Krogh et al. (2003) had shown that newcomers (developers who are granted the permission to modify the source code) would often join a project by dedicating some new (or improved) functionality and would specialize and focus their efforts later on in this area. These findings, derived from a single case study (Yin, 2003), needed yet to be verified in multiple cases. So, examining aspects of self-coordination and the “code ownership” of developers would also serve to verify these findings of our previous research. One way to examine the areas of code that developers choose to work on is to see if there is really such a thing as code ownership, how concentrated and/or overlapping these “pockets of activity” (von Krogh et al., 2003) are, and how they are interconnected; this would result in a construction of a network that shows developers who are more or less interconnected with each other, based on their coding activities. After some pondering of the issue, it became clear that this approach basically would be the construction and analysis of a social network of developers within one OS project. 3 This does not mean that co-developers are uncritical of their peers. If they do not like some code, they can be extremely critical in their peer reviews, sometimes resulting in heavy attacks against each other (flame wars). 3 1 Introduction Researchers have proposed to perform Social Network Analysis (SNA) on computer centered communities (e.g. Wellman, 1996) however, most studies actually conducted were either not very in-depth or were high-level studies looking beyond the boundaries of single projects (e.g. all projects hosted on Sourceforge.net4 ). Section 2.3.1 on page 29 provides more information on recent studies applying SNA to OS projects. Yet none of the studies examined coordination, or rather self-coordination5 , of work among developers using a specific project as unit of analysis. I became interested in what ways Social Network Analysis could contribute to the examination of coordination and the ownership of code among developers. Following this line of thinking, the basic topic which is addressed in this dissertation evolved over time: How are free and open source projects organized? How is work coordinated/distributed between its developers? This very broad formulation of the topic of interest was divided into three research questions, addressing the specific issues as explained in the above text. Res. Question 1 Are free software and open source projects self-coordinated through the set of files each developer works on? Res. Question 2 Is there such a thing as “code ownership”? Are developers responsible for certain areas of code? Res. Question 3 Are patterns of coordination detectable through Social Network Analysis and how can they be interpreted? Can projects be categorized according to their “coordination style”? 4 Sourceforge.net is the most popular and prominent platform which provides infrastructure for open source projects without charge. 5 Based on an unscientific definition of coordination, 1: the act or action of coordinating 2: the harmonious functioning of parts for effective results (Merriam-Webster, 1993), coordination can be understood as “a way of developers to arrange themselves harmonically to achieve effective results (→software)”. Self-coordination emphasizes the self-organizing aspect of coordination. 4 1.2 Structure overview Goal & Readership of this Research Although some propositions are being made dur- ing this work, its goal is not a single quantitative model with one dependent variable. It aims to explore various measures that could be used to characterize the coordination and the development style, if there is any, of OS projects. One long term goal should be to enable a categorization of projects, i.e. the creation of a typology, of coordination styles of projects. The applied methods are to a large extent drawn from Social Network Analysis. The work takes an interdisciplinary approach. On the one hand it intends to further research on OSS development, to derive a theoretical foundation of what happens so seamless in thousands of projects every day, and to help it grow further. On the other hand is the way in which OSS is developed, interesting to both management and innovation researchers. Organization and distribution of work in virtual, global organizations is of much interest to both disciplines. Accordingly, this work could be of interested for a broad audience: Maintainer of open source projects may be interested in the empirical findings and may find clues on how they would like ‘their’ community to be organized and coordinate itself, even though project leaders might only have limited influence over this. Managers, deploying virtual, geographically dispersed teams might want to turn their attention to the apparently fully functioning working self-coordination of these communities. And, last but not least, researchers both of organizations such as companies or communities of practice and of open source projects might be interested in the use of Social Network Analysis applied across several cases. This work examines the usefulness of diverse measures in order to characterize projects, and provides a first step towards a typology of projects in respect to the distribution and self-coordination of work among developers through the identification of determining variables of a coordination style. Following, the structure of this work is presented. 1.2 Structure overview So far, Section 1 gave a brief introduction into the research topic. It presented the research questions and clarified the goals of the dissertation. 5 1 Introduction Following, Section 1.3 on the facing page will present an introduction into the history of open source, its precise definition and a comparison of free/libre and open source software to give readers who are not yet familiar with the topic a better understanding of open source. Although the study is performed in an inductive manner, the relevant theoretical framework and existing literature can be found in Section 2 on page 16. This part begins with an overview over the history of open source research until today, and a critique of the sampling and data gathering methods employed in many current empirical studies. The next subsection (Section 2.2 on page 24) contains a brief generic introduction into Social Network Analysis (SNA). Existing literature concerned with the organization of open source projects is reviewed in Section 2.3 on page 26, focusing later (page 29) on existing research applying Social Network Analysis to open source projects. The Empirical Section (Section 3 on page 31) begins with the Methodology. A large part of that section is then dedicated to the presentation of the sample projects (Section 3.2 on page 37), characterizing each project briefly. Various measures, mostly derived from Social Network Analysis, such as the degree of connection and inclusiveness of the network, are presented and discussed in Section 3.3 on page 57. A first step towards a typology of open source projects in regard to the way developers chose the areas they work on, characterizing the “coordination style” through the identification of influencing variables, can be found in Section 3.4. Using the four identified variables, Section 3.5 on page 75 characterizes each of the sample projects, presents its sociogram and gives some key findings. Finally, the Discussion part summarizes the results and findings. The implications for researchers and practitioners are discussed, starting on page 134. Lastly, interesting open issues worth future research point to directions that related research could follow. 6 1.3 Introduction to Open Source 1.3 Introduction to Open Source When doing research on free software and open source software projects6 , it is important to understand precisely what open source means. This section gives the reader a better understanding of the basic concepts and terms in regard to open source projects and how OS evolved over time. It also clarifies the differences between open source software and free software and when those two terms need to be distinguished. The next sections do not provide an overview of current open source research yet (see Section 2 for this); they serve merely as an introduction into the area. Readers who are already very familiar with the history and terms of this topic, might be inclined to “fast-forward” through these parts. 1.3.1 History of Open Source The ARPAnet was introduced as an experiment of the US Department of Defense, or to be more correct, its spin-off, the Advanced Research Projects Agency. A proposal for "Resource Sharing Computer Networks" was submitted on June 3, 1968, and approved by the Director on June 21, 1968 (Hauben, 1994). Once established, it allowed hackers all over the U.S. to communicate with each other instead of being isolated in small local groups. Successively, they started to feel like a “virtual” tribe, connected through the new network. They developed a common hacker culture and hacker ethics7 . One man who once called himself “the last true hacker” had especially taken the hacker culture and ethics to heart. In the beginning of the 70s, when Richard Stallman (RMS) worked at the MIT Artificial Intelligence Lab, software was not classified as commercial, free software, or open source, as all software was originally free (Stallman, 1999; Perens, 1999). The hacker communities, which had developed at the Aritifical Intelligence lab and at other places, were 6 7 Often abbreviated as FLOSS in a politically correct manner, standing for free, libre & open source software. hacker and hacker ethics were probably coined by the Tech Model Railroad Club at the MIT. (Levy, 1984; Williams, 2002, App. B) 7 1 Introduction sharing the source code of operating systems and other applications without any hesitation or restrictions (Levy, 1984). But in the early 80s new computer systems were replacing the old ones and the operating systems for these computers were not free anymore, as many had discovered the commercial value of software and had started software companies based on proprietary development models. Successively, software companies, instead of hacker communities, started to dominate the production and distribution of software of all kinds. These companies were closing the source code from the user and sold only compiled binary versions of their software, which were not readable by humans and could not easily be modified and extended. Furthermore, copyright laws disallowed any modification of these programs. Most hacker communities were weakened when many of its members were hired away by commercial software companies and had to sign Non-Disclosure Agreements (NDA) and dissolved successively. Richard Stallman was faced with the choice to either join the proprietary software world, or to try to get these hacker communities back to life again. His convictions were that all software should be free to share and to build upon. In order to enable people to use free software again and to revive the hacker communities he missed, he quit his job in 1984 and decided to write a new, truly free (i.e. libre) Unix-compatible operating system, which he termed GNU8 (see Section 1.3.3 for a more detailed explanation of the specific meaning of free). From 1984 on, he worked to replace the numerous tools and programs which a Unix operating system is comprised of, piece by piece, with free software applications (Moody, 2001). Others followed and supported the Free Software Foundation (FSF), founded in 1984 by Stallman, or started complementary and competing projects of their own. With the spread of the Internet to ever larger parts of the population9 , and the creation of network based software development tools and infrastructures, the number of free and open 8 GNU is a ’recursive’ acronym of the operating system’s full name “GNU’s Not Unix”. Recursive acronyms are quite common as a joke among computer hackers. E.g. the GNU kernel “Hurd” is even named by a pair of mutually recursive acronyms: “Hurd” stands for “Hird of Unix-Replacing Daemons”. And, then, “Hird” stands for “Hurd of Interfaces Representing Depth” (Bushnell, 1991). 9 Networked computers did not become available to the general public until after the first implementation of TCP/IP (a network protocol) was released with the Berkeley Software Distribution 4.2 (BSD) in 1983. (Raymond, 2003) 8 1.3 Introduction to Open Source Figure 1.1: Starting year of participants in FLOSS Communities (source: Ghosh et al., 2002) source projects started to explode in the mid-90s. Figure 1.1 visualizes when most participants started to become involved in FLOSS development based on a large scale survey conducted by Ghosh, Glott, Krieger, and Robles, showing the growing participation since the late 1990s. But not all participants were entirely happy with the strong fixation of Stallman – and therefore the entire Free Software Movement – on moral and ethical issues (“all software must be free”). His attitude, plus the emphasis on the ambiguous term free, had led many from the press and corporate world to connect free software with negative stereotypes and the (false) notion that software of this type must be gratis too. Therefore, when Netscape decided in early 1998 that it would release the source code of its web browser suite (Netscape, 1998), the term open source was created on February 3, 1998 by participants of the new-founded Open Source Initiative (OSI) (see next section for more on this). The motivation behind this move was the creation of a term which could not be mistaken for gratis and which did not imply any ethical or moral judgment. It would basically follow the same rules as free software but should still be connected with a positive image in the corporate world. 9 1 Introduction Its founders consider the Open Source Initiative, with its policies and goals, as a derivate of Stallman’s work, still following his intentions (Perens, 1999; Raymond, 1999b), although Stallman disputes this, as proponents of open source software don’t explicitly mention and emphasize the normative values of FLOSS software10 . Today, many individuals and companies still hesitate to contribute their expertise and knowledge to a public good. Yet, some studies have shown that contributing to and using open source software can be the most efficient way to innovate under certain conditions (Bessen, 2002). Many large companies, notably IBM and to a certain extend Sun and Apple have adopted and supported OS software. Even Microsoft, considered as the “Anti-Christ” for many believers in open source, has released some projects under an OSI compatible license, e.g. in 2004, WiX (Windows installer XML) was donated by Microsoft and is hosted by Sourceforge (Foley, 2004). A definition of the term will be given and its main differences to free software will be explained next. 1.3.2 Open Source Definition In June 1997, Bruce Perens, then the project leader of the Debian GNU/Linux distribution, drafted The Debian Free Software Guidelines; a document which should enable the categorization of software into free and non-free software by comparing the software license to the guidelines (Perens, 1999). After some discussion with Debian developers it was made official in July. In February 1998, all Debian specific references were removed and the guidelines were published as the first version of the official Open Source Definition by the Open Source Initiative (The Open Source Initiative, 2003a). According to this definition (The Open Source Initiative, 2003b; full text is given in Appendix A.1 on page 155), users and programmers are granted several rights (see Table 1.1 for an overview). The most important aspects of OS licenses are, according to its definition: 10 “RMS” commented the reception of the Linus Torvalds Award at the 1999 LinuxWorld with a sarcastical "Giving the Linus Torvalds Award to the Free Software Foundation is a bit like giving the Han Solo Award to the Rebel Alliance." 10 1.3 Introduction to Open Source • The possibility of free distribution, i.e. the possibility to make any number of copies of software that one possesses and to sell or give them away, without having to pay anyone for that privilege. • The availability of the source code together with the program (or means to download it without charge), which should guarantee the easy modification and evolution of OS projects. It should be noted that it is acceptable to require any amount of money when selling the product. However, when selling a binary program, the source code must be delivered with it or be accessible for no extra charge. • Improved and modified code (derived works) should always be allowed to be distributed under the same license terms as the original software. In order to protect the reputation of the original’s author, he can demand that source code is only redistributed in its original form as long as “patches”11 are allowed to be combined to the distribution. • No discrimination against persons, groups, or fields of endeavor may occur, which means that nobody may be excluded from the use of the software (including commercial organizations). • The license must apply automatically to all to whom the program is distributed. This clause is intended to forbid closing up software by indirect means such as requiring a non-disclosure agreement. The main goal of open source compliant licenses (such as the GNU GPL) is to ensure that knowledge in the form of source code, once created, remains free and available. Users of that software are granted the right to modify and improve the software as they see fit. OS licenses take advantage of copyright law in order to achieve the opposite for what it has 11 A patch is a piece of text describing source code modifications in a specific format, which can be used by the patch program to actually modify the original source code. This is then called patching. 11 1 Introduction been created for; therefore, this scheme of protection has been termed copyleft12 13 . 1.3.3 Open Source vs. Free Software There is often some confusion concerning free software in contrast to open source. Most research assumes that they are equivalent, which in most cases might be a fair assumption. However, there are some differences which should be clarified. As already mentioned in Section 1.3.1 (History), the term free software was introduced by Richard Stallman and refers to “free as in Freedom”, not as in (getting the software for) free14 . Followers of the free software movement disdain from the use of any proprietary and closed source software, which is in their eyes “evil” software. This includes any proprietary software that runs on top of other free programs (e.g. a commercial application making use of a free library). Basically, the Free Software Movement is driven mostly by ethical and moral reasoning. However, from a technical point of view, the definition of free software is very similar to the definition of open source software; it guarantees the user the following rights (Stallman, 1999, p.56): • You have the freedom to run the program, for any purpose. • You have the freedom to modify the program to suit your needs. (To make this freedom effective in practice, you must have access to the source code, since making changes in a program without having the source code is exceedingly difficult.) • You have the freedom to redistribute copies, either gratis or for a fee. 12 copyleft /kop’ee-left/ n. [play on ‘copyright’] 1. The copyright notice (‘General Public License’) carried by GNU EMACS and other Free Software Foundation software, granting reuse and reproduction rights to all comers (but see also General Public Virus). 2. By extension, any copyright notice intended to achieve similar aims. (Jargon, 2004) 13 The term Copyleft is derived from the phrase “Copyleft – desrever sthgir lla”, which Don Hopkins wrote in a message to Stallman in 1984 and which is intended as a double pun on the phrase “Copyright–all rights reserved”. (Wikipedia.org) 14 “Free as in Freedom, not free as in beer.” (Stallman, 1993) 12 1.3 Introduction to Open Source • You have the freedom to distribute modified versions of the program, so that the community can benefit from your improvements. The ambiguity of the word free in the English language has often led to confusions about the goals of the Free Software Movement in the public and the corporate world. Therefore, free software is often called libre software, as libre refers more clearly to the intended meaning “free as in Freedom”. The GNU GPL (GNU General Public License) was created by Stallman as the standard license for Free software projects. It uses copyright methods in order to ensure the above mentioned rights to all people. In order to keep all modifications and derivatives free for all people, it requires all derivatives of GPL’ed work to be published under a GPL compatible license as well. The requirement to release the resulting derivative source code together with the binary files, caused Microsoft to call the GPL licensing scheme virulent. The open source initiative takes a more pragmatic stance than the free software movement. It mainly considers the technical and organizational advantages of open source software over commercial closed-source software. Not moral or ethics are their primary motivations, but issues like the ease of evolution and modification of software as well as the simple means of cooperation even across organizational boundaries. If combining and using proprietary software with more restrictive licenses helps to spread open source software, they would propagate it without hesitation (they actually do that by endorsing commercial software on the GNU/Linux operating system). Apart from the underlying motivations behind these two movements, their philosophies are obviously very similar (see Table 1.1 for a summary of both main philosophies), so that free software and open source projects are usually considered close enough to be regarded as equivalent for research. Whether this assumption is generally true has yet to be proved. As the definitions are nearly equivalent, and the development style seems to work identical, it might be fair to treat both types as same. But especially research on the motivation of participants should not neglect the potential 13 1 Introduction Free Software philosophy: ⇒ You have the freedom to run the program, for any purpose. ⇒ You have the freedom to modify the program to suit your needs. (To make this freedom effective in practice, you must have access to the source code, since making changes in a program without having the source code is exceedingly difficult.) ⇒ You have the freedom to redistribute copies, either gratis or for a fee. ⇒ You have the freedom to distribute modified versions of the program, so that the community can benefit from your improvements. Open source philosophy: ⇒ The right to make copies of the program, and distribute these copies. ⇒ The right to have access to the software’s source code, a necessary preliminary before you can change it. ⇒ The right to make improvements to the program. Table 1.1: Similarities of the FS/OS philosophies (source: Stallman, 1999; Perens, 1999) differences between both types of projects. A recent survey of free software and open source developers has identified significant differences in the way developers see themselves (Ghosh et al., 2002, chapter 4). Figure 1.2 shows that especially those who assign themselves to the Free Software community, strive for a sharp distinction between their community and the open source software community. On the other hand, answers from members of the open source software community correspond approximately with the average distribution of answers, and indicate a indifference between the distinction of both types. This research is mostly concerned with the processes of organization, specialization and code-ownership. These aspects are assumed to not be related to the motivation of developers, or their ethical convictions. For the purpose of this study, free software and open source software projects are therefore regarded as equivalent. This work will therefore generally use the more popular term “open source”. The next sections presents the relevant theoretical framework from which this research draws and is based upon. 14 1.3 Introduction to Open Source Figure 1.2: Describing the differences between FS and OSS (source: Ghosh et al., 2002) 15 2 Theoretical framework Section 2.1 gives an overview over the evolving research on open source software development over time. It also categorizes the research into three areas according to the unit of analysis, clarifying which area this dissertation belongs to. A critique of frequent erroneous assumptions in the sampling and data gathering methods follows in Section 2.1.1. A brief introduction into Social Network Analysis (SNA) based on existing literature is given and its usefulness in connection with open source research is presented in Section 2.2. A section about previous research on the organization of OS projects is presented. A subsection focuses on literature applying SNA methodology to open source projects. 2.1 A history of research on open source Most academics discovered their interest in free and open source software development between 1998 and 2000. Two major events raised the awareness of the general (i.e. non-geek) public in the various FLOSS movements: The first event was the publication of Eric Raymond’s “The Cathedral and the Bazaar” on the web (it was presented to the public on May 21, 1997 during Linux Kongress and was later published as a book (Raymond, 1999a)). It is one of the first attempts to describe and analyze the workings of a free software community. In this work, Raymond described what he termed Linus Law: “Given enough eyeballs, all bugs are shallow.” The second major event consisted in the announcement of Netscape Communications to 16 2.1 A history of research on open source release the source code of their web browser suite under a free license to the public (Netscape, 1998) as already described in Section 1.3.1 on page 7, and related to this event, the coining of the term open source. While free software had always existed, it had so far occupied a niche which did not seem interesting for researchers and commercial participants. With the release of a commercial closed source software product (albeit it had been distributed without charge earlier) as an open source project, the first significant commercial player demonstrated its interest and indeed bet its existence on the power of free and open source code. Moreover, the term open source was coined in order to be able to promote the various aspects of free software, without being hindered by a) the ambiguous meaning of free in the English language and b) having the negative connotation of managers to associate their ’opened’ product with a “free as in beer” (i.e. gratis) product. Some of the first researchers who recognized OS as interesting and relevant to study, came from the field of innovation research: e.g. von Hippel and Lakhani (2000) looked into innovation through lead users1 . Tuomi (2000) recognized that organizational, institutional, economic, cultural, and cognitive aspects of the open source development model would be interesting to study. Also, in that early stage, Lerner and Tirole (2000) were among the first researchers trying to explain the motivation of developers to participate without direct monetary benefits. Since then, research into free and open source software development has been conducted by academics from many disciplines, ranging from anthropologists (Zeitlyn, 2003) to software engineers (Scacchi, 2002), and was often performed in an interdisciplinary manner. Many of the first studies were explorative single case studies, while later studies also performed analysis on multiple projects. By 2004, so many studies had been conducted (as of Jan 13, 2005, http://opensource.mit.edu lists 198 published articles and working papers alone) that a review article was published by Rossi (2004). 1 The concept of ’Lead Users’ was introduced by Eric von Hippel in the mid 1980s. He defined the lead user as those users who display the following two characteristics: 1) They face the needs that will be general in the market place, but face them months or years before the bulk of that marketplace encounters them. 2) They are positioned to benefit significantly by obtaining a solution to those needs 17 2 Theoretical framework Research category Motivations Processes Competitive Dynamics Unit of Analysis Individual developer Single OS project All OS projects / ’Commercial’ projects Why do programmers dedicate their free-time to these projects without being paid? Why do organizations contribute resources to advance projects without direct monetary rewards? How do open source projects work? What processes go on? How are decisions made? Is there a ’life cycle’ of open source projects? How is work organized and are tasks distributed? What makes some projects more successful than others? What are the competitive advantages of open source projects over commercial projects? Research questions Table 2.1: Categorization of FLOSS research (adapted from von Krogh and von Hippel, 2003) Table 2.1 presents an attempt to categorize FLOSS research into three broad categories: motivations, processes, and competitive dynamics. Those can be distinguished by their unit of analysis. The first category, “motivations” investigates the motivation of single developers. The second category, “processes”. examines processes of a specific single project. The last category, “competitive dynamics”, looks across the boundary of a single project at the ‘universe’ of all closed and open source projects, identifying success factors. Motivations Most early contributions to open source research were dedicated to explorative and descriptive case studies; describing more or less prominent open source projects and examining the motivation of those volunteering, unpaid hobbyists2 (Harhoff, Henkel, and von Hippel, 2003; Hertel, Niedner, and Herrmann, 2003; Hars and Ou, 2000; Osterloh and Rota, 2003). Much speculation about the true motivations has been done. Mainly two schools of 2 Back then, research mostly (and correctly) assumed that all participants were volunteering, unpaid hobbyists. By now, the share of paid developers has risen significantly (see e.g. Ghosh et al., 2002), yet much research on motivation has failed to take this into account. A first attempt to distinguish these types has been made by West and O’Mahony (2005). 18 2.1 A history of research on open source thought have emerged from this discourse: The first, mostly propagated by economists, argues that developers are rational actors, and gives one main reason for their participation: The true motivation of developers is their increased labor market value due to their participation in prominent projects (Lerner and Tirole, 2002b; Prufer, 2004). Others have argued to the same effect through the use of signaling theory (Lee, Moisa, and Weiss, 2003; Stenborg, 2004). According to this theory, developers signal competency through a high reputation among their peers to potential employers. It is argued along the same line that reputation within the developer community serves as an important factor contributing to the motivation (explaining why being the founder of a project is attractive). These arguments are supported by the fact that indeed many prominent developers have been subsequently hired by companies such as Novell, RedHat, or IBM. The latter even patented a method of paying open source developers (The Enquirer, 2004). The second school of thought focuses on the existence of intrinsic motivation (Bitzer, Schrettl, and Schroder, 2004). Follower of this school argue that OSS development takes place in a gift giving culture (Zeitlyn, 2003; Bergquist and Ljungberg, 2001; Hemetsberger, 2004), and that “motives of hobbyist evolve over time; most join the community because they have a need for the software and stay because they enjoy programming in the context of a particular community” (Shah, 2003). Others deem trust as an important factor, enabling contributions, as it fosters intrinsic motivation (Osterloh and Rota, 2004a,b). Ferraro and O’Mahony (2003) have looked into the “web of trust” between developers, by participating at key signing parties3 from developers of the Debian project. The importance of trust however, is not universally recognized. While Jarvenpaa and Leidner (1999) had shown that swift trust, a quickly established, yet temporal and fragile form of trust in its nature, emerges in virtual teams, some have argued that trust in open source projects is not necessary at all (Gallivan, 2001). Some large scale surveys have been conducted, emphasizing the existence of intrinsic mo3 Key signing parties are events where developers verify the authenticity of their encryption keys through faceto-face contact. 19 2 Theoretical framework tivation, but also acknowledging the existence of extrinsic motivation (Lakhani, Wolf, Bates, and DiBona, 2002; Ghosh et al., 2002), thus indicating that not a single school of thought is able to explain the full motivation of developers; the truth might rather lie in a combination of both intrinsic and intrinsic motivation, as e.g. Osterloh and Rota (2003) acknowledge. While the area of motivation is interesting and fascinating, interest has also been raised in organizational issues and processes that govern OS projects. Processes The second category, Processes, uses a specific project as unit of analysis. It is concerned with the inner workings of a project and its organization. Such processes might for instance be how newcomers join the project (von Krogh et al., 2003) or how programming resources are allocated within a project (Dalle and David, 2003; Dalle, David, Ghosh, and Steinmueller, 2004). Many have seen OS development as a new mode of innovation (von Hippel and von Krogh, 2003; Osterloh and Rota, 2004b), production, software development, or even as a generic model for organizing (Ljungberg, 2000). Some have looked into the specific characteristics of OS development communities which make it so different from other settings. E.g. Lanzara and Morner (2003) found that “technology, rather than formal or informal organization, embodies most of the conditions for governance in open-source software projects, hence becoming a critical pathway to the understanding of collective task accomplishment, coordination and knowledge making processes.” Research on project governance, coordination, and social structure of the communities are part of this category. A review of literature concerned with the organization and social structure of projects can be found on page 26. This work is deeply rooted in this second category, examining the areas of code that developers elect to work on. It will not question the motivation of these developers (see e.g. von Krogh, Haefliger, and Spaeth, 2003 for a potential explanation). It will also not examine project success (the third category) of the sample projects. This would be a different research question in itself. 20 2.1 A history of research on open source Competitive dynamics The third category ’competitive dynamics’ looks beyond the scope of a specific OS project. It is concerned with the success factors of open source projects compared to a) other open source projects competing for the same developer resources and b) commercial closed source projects. As the success of open source projects is very difficult to measure (Crowston and Scozzi, 2002), as projects have very different purposes and various potentially useful measures of project success are difficult or even impossible to gather. Some researchers focus therefore on very specific, and restricted, measures of success, such as counting the added number of lines of code (LOC), while others try to come up with composite variables consisting of various success factors. Also concerned with the success of projects in the form of development activity, is a working paper by Healy and Schussman (2003). They observe a snapshot of about 46, 000 projects hosted by Sourceforge. They look at the skewness of six activity measures across all projects.4 MacCormack, Rusnak, and Baldwin (2004) compare the software architecture of the Linux kernel with that of the Mozilla web browser, originally developed as closed source, when it was released to the public. They find that software developed in an open source mode is more modular than software developed as closed source. They also find that the Mozilla code base was restructured and became more and more modular over time after it was released to the public as open source. Lerner and Tirole (2002a) approach the success factors of projects from a legal perspective: They perform an empirical analysis of the determinants of license choice using nearly 40,000 projects. They find that projects geared toward end-users, such as games tend to have restrictive licenses, while those oriented toward developers, those designed to run on commercial operating systems, and those which are geared towards the Internet are less likely to have restrictive licenses. Another approach to examine the success of the open source software development model is 4 Although the validity of their interpretation needs to be considered with caution. E.g. they neglect that about a third of all projects hosted on Sourceforge are dormant. 21 2 Theoretical framework the reuse of knowledge in the form of software components from external projects (von Krogh, Spaeth, and Haefliger, 2005). It is found that projects frequently reuse components in order to get their code working fast and to prevent the “re-inventing of the wheel”. As lies in the nature of research from this third category, it is based on multiple cases or empirical comparisons of large samples. In my opinion, results from multiple project studies, have often become skewed due to common mistakes in the sample selection and data gathering methods. As this work relies on quantitative data from multiple cases as well, I will list a summary of the most common pitfalls, to validate my own research design. 2.1.1 Common pitfalls - A critique of empiric OS research So far, most case studies have been only been concerned with a selected few prominent projects, such as the Linux kernel, the Apache web server, or the Mozilla web browser suite. Although it is very insightful and interesting to study these cases, it should be noted that they are far from the ’average’ open source project and might as such, although popular and successful, not exhibit patterns of ’common’ open source projects. They are extremely large (compared to other open source projects), successful, and are those projects which are the closest to a formal organization that one can have. Most of the core developers in these projects have been exposed to so many OS studies, been scrutinized and interviewed so many times, and have participated in so many conferences on the topic, that they might suffer from the panel effect, i.e. exhibit a different behavior than a non-observed developer would.5 A high share of the contributions to these projects come from paid developers of companies, and their organization is mostly managed by organizational entities rather than individuals. Therefore, drawing conclusions from these non-representative cases to explain ’how open source works’, or examining the motivation of contributors needs to be conducted very carefully. Little thought seems to have been given to these issues in case studies involving those 5 This is similar to Heisenberg’s uncertainty principle in quantum mechanics stating that “the uncertainty principle involved the perturbation to a particle’s state by a measurement of one variable, which affects one’s ability to predict the outcome of a subsequent measurement of the conjugate variable” (Raymer, 1994). 22 2.1 A history of research on open source projects. Much empirical research on open source projects looks exclusively at projects which are hosted on Sourceforge. This fact alone skews the sample of projects, as these projects for instance tend to be centered around the English speaking community, neglecting e.g. Asian projects (Bates, 2003). Sourceforge.net itself was originally founded as an incubator for small projects which could not afford an infrastructure of their own. Projects hosted by Sourceforge tended to be smaller projects without a supporting organization to provide the technical infrastructure (at least this used to be the case until being represented, and thus being search-able, on Sourceforge became a major promotional advantage). A typical research which needs to be interpreted very carefully was described by Healy and Schussman (2003) in a working paper. They analyze a complete snapshot of all Sourceforge hosted projects, detecting a high skewness in various activity measures. Although their findings are in line with others who found power-law distribution across OS projects (e.g. Madey, Freeh, and Tynan, 2002) , it has been neglected that, according to Jeff Bates (employee of the company providing Sourceforge), at least a third of the projects hosted there are dormant, and many were simply used as dumping ground for some code, without the purpose to get a “real” project started. Projects on alternative hosting sites for projects are mostly ignored by researchers, for instance Savannah (http://savannah.gnu.org) which is dedicated to hosting projects which are part of the GNU project, or Berlios (http://berlios.de). However, this limits the validity of a representative sample, as projects hosted there (or are hosted independently) might have different characteristics then those hosted by Sourceforge. Large project (often older than Sourceforge itself), frequently have their own infrastructure (e.g. the Linux kernel, Mozilla, and Apache are not hosted on Sourceforge). If these large and/or old projects are hosted on Sourceforge, it happens (as reported in an interview with a developer of that project) for promotional reasons only. These projects, while listed, either use only a fraction of the facilities Sourceforge offers, such as bug databases, mailing lists, source code patch trackers, etc. or do not use them at all. For instance, many developers ignore the bug database offered by Sourceforge and rely on mailing lists only to report bugs (or the 23 2 Theoretical framework other way round). Yet, users might still file bugs into the bug database, because the project administrator has not bothered to explicitly turn off this feature. Similar concerns can also be raised for other aspects, such as the number of official developers and the categorization of the project status (alpha, beta,...,mature). A study done by Krishnamurthy (2002), examining 100 mature projects on Sourceforge, has been heavily criticized for the applied methodology and conclusions in a subsequent letter to the editor6 (DeMaggio, 2002). There are some researchers, who have recognized the danger of blindly relying on data gathered from Sourceforge, e.g. Howison and Crowston (2004) describe the “perils and pitfalls of mining Sourceforge”, yet it remains an often neglected issue. The section on methodology will explain in more detail, how these issues have been taken into account in this study. The next section contains a brief introduction into social network analysis, before social structure research and social network analysis in OS projects is presented in more detail. 2.2 Social Network Analysis Various interdisciplinary strands of development have contributed to the creation of Social Network Analysis (SNA). This section will not present an in-depth introduction to SNA, it gives merely a brief introduction to those unfamiliar with the topic. Interested readers are recommended to read e.g. Scott (1991) for a good introduction and overview. Following, a brief presentation of the three main research streams on which Social Network Analysis has been founded is given (Freeman, White, and Kimball, 1992; Scott, 1991, p.7): SNA descended from Gestalt theory, which was looking into group structure and the flow of information and ideas through groups, using laboratory methods. Building upon this work, Sociometric analysts were developing the methods of graph theory. Deriving from this stream of research stems e.g. the sociogram (which is generally attributed to Moreno (1934)). 6 “But your conslusions are suspect because of the completely unscientific nature of your analysis.” (DeMaggio, 2002) 24 2.2 Social Network Analysis A second stream of researchers at Harvard in the 1930s and 40s, explored patterns of interpersonal relations and the formations of ’cliques’, building e.g. upon the work of the social anthropologist Radcliffe-Brown, and through him, Durkheim. A seminal work by these researchers was the study of the Hawthorne electrical factory (Roethlisberger and Dickson, 1939), which also in parallel created sociograms (without references to the first stream of research). Finally, the third pillar on which SNA was build is by Social anthropologists at Manchester University. They examined the structure of ’community relations in tribal and village societies’, looking into the issues of conflict and power. This group extended former studies by systematizing the use of SNA. They introduced the notion of ’total’ vs. ’partial’ networks (Barnes, 1954, p. 43), and measures such as the ’density’ of a network (Mitchell, 1969). Density is one of the important measures to characterize a network. It measures the (relative) number of linkages between its nodes and gives an overall idea of how densely connected a network is. The centrality of a specific node (here: developer) characterizes how central or peripheral a specific node is, giving indications of its importance (Freeman, 1979). Different operationalizations of how to measure node centrality exist. Frequently, these are based on either the degree (the number of connections of a node) or on its betweenness (Freeman, 1977) (betweenness indicates if a node lies ’between’ many other nodes or is peripheral). Based on the differences of the centrality of all nodes, the centralization gives an indication if the network as a whole is very centralized (star shaped) or rather decentralized. The importance of dense linkages had always been understood. However, in 1974, Granovetter, a prominent figure in the area of SNA, published his classic contribution “Getting a job” (Granovetter, 1974). In this, he elaborates on the ’strength of weak ties’, which truly provide new information from unknown sources, while too densely linked network connections provide mostly redundant, already known information (see also Granovetter, 1973). One of the problems in a network analysis is usually the identification and specification of the network boundaries (Laumann, Marsden, and Prensky, 1992), as most of the time only ’partial networks’ can be studied. Fortunately, this is no issue when applying the analysis to 25 2 Theoretical framework Figure 2.1: Publications indexed by “Sociological Abstracts” containing social network in the abstract or title (source: Borgatti and Foster, 2003) open source CVS data, as its boundaries are clearly defined and the analysis is able to cover the ’total network’. Social Network Analysis has become an important method in sociology and innovation research over the last years. Borgatti and Foster (2003) show the exponential growth of research on social networks (see Fig. 2.1). Coulon (2005) provides a comprehensive literature review on the use of SNA in previous and current innovation research. 2.3 Research on organization in OS projects This section reviews literature concerned with a part of the second category of research, processes; namely the social structure and organization of open source projects. Although this study proceeds in an inductive manner, a brief overview of current research in this area is needed in order to relate this research with the ongoing efforts in this area. The study of social structure in OS software projects has been approached by researches from various disciplines with a plethora of methods. One of the early attempts was performed by Koch and Schneider (2002) who examined the CVS code modifications from the GNOME 26 2.3 Research on organization in OS projects desktop project in a case study. They examined the number of lines of code added per developer (as a measure of project success). They found that in average each file was modified by only 1.8 distinct developers. Reputation plays an important role concerning the social structure of communities. Von Krogh et al. (2003) find that a reward in the form of peer reputation plays an important role to overcome the costs of contributing to a project. Also, Stewart (2004) is concerned with reputation in a developer’s community: He examines the role of tenure in establishing social status and finds that the social status is “frozen” after a while, forcing members to act quickly if they want to achieve a high social status. Ferraro and O’Mahony (2003) examine the role of face-to-face communication of developers (namely from the GNU/Linux Debian project) and the resulting social network. They find that the developers a participant had met in ‘the real world’, the more likely he would vote in a leadership election. Also, the more a developer participates in on-line discussions, the more likely he would be elected in such a leadership election. Interestingly enough, also economists are looking into the area of social structure of open source projects. One would expect that economists are most puzzled about the contribution of developers without direct monetary benefits. However, e.g. Dalle et al. (2004) state that as interesting as research on the motivation of developers might be, “the modes of organization, governance and performance of F/LOSS development – viewed as a collective distributed mode of production” should also be looked into. Open source seems to be puzzling for them beyond the motivation of developers, or so it seems. Sagers, McLure Wasko, and Dickey (2004) propose to build upon a theory of network governance, which was introduced by Jones, Hesterly, and Borgatti (1997). Their model uses project success as a dependent variable, using a composite measure from various success dimensions. Coordination is but one independent variable in their model, their survey contains questions pertaining to two aspects of coordination: expertise location, and administrative coordination, i.e. managing of tangible and economic resources. However, no empirical results are presented in their working paper yet. Given that many of the constructs they use have not been defined 27 2 Theoretical framework in the area of OS projects, it will probably take some more effort to verify their model. Their notion of “expertise location” builds upon a study by Faraj and Sproull (2000), who performed a cross-sectional examination of 69 (closed-source) software development teams. They found that the expertise coordination processes (described by them as “recognizing where expertise is needed”, “knowing where expertise is located”, and “bringing expertise to bear”) led to an increased team performance. Some have been looking into the apparently self-organizing nature of open source projects. Van Wendel de Joode, De Bruijn, and Van Eten (2002) examined how self-organizing developer communities deal with various property regimes. Benkler (2002) creates a theory of “Commons-based peer production”, using transaction cost economics, where people elect to work on the tasks they are best at. However, as already mentioned in the introduction, the examination of self-organization has been traditionally approached mostly by Natural Scientists, namely from Biology (Haken, 1977; Kauffman, 1995; Coffey, 1998). It was also biologists, Maturana and Varela, who created autopoiesis7 as a theory in order to explain social systems (Whitaker, 1995). The theory has since been used, e.g. to explain the creation of organizational knowledge (von Krogh and Roos, 1995). Although self-coordination of social systems has been studied in the context of autopoetic theory (see Whitaker, 1995), apparently no researcher has so far tackled open source communities as an example of a social system from this point of view. As this research neither focuses on a system dynamics point of view, nor on self-maintaining aspects of the community, autopoiesis will not be discussed further. Although many publications mention that OS projects are self-organized, there is still a lack of research focusing on this area. This research looks into the areas of code that developers elect to work on, thus explaining a part of self-coordination. 7 Autopoiesis (literally self-production) is the process whereby an organization produces itself. An autopoietic organization is an autonomous and self-maintaining unity which contains component-producing processes. The components, through their interaction, generate recursively the same network of processes which produced them. An autopoietic system is operationally closed and structurally state determined with no apparent inputs and outputs. (Heylighen, 2005) 28 2.3 Research on organization in OS projects 2.3.1 SNA in open source Some of the literature is explicitly concerned with the use of Social Network Analysis in open source projects. One of the first to propose the use of SNA in this area was the sociologist Wellman who gave “a sociologists’ perspective on the use of social network analysis of computer networks” (Wellman, 1996). In her Master’s Thesis, te Meerman (2003) applies SNA methodology to open source projects using interview data in order to examine the “ITIL” concept8 . Unfortunately, her results help little to observe the community structure, or a project’s organization. So far, there are only few studies I know of which apply Social Network Analysis to open sources projects, using empirical evidence besides interviews: Madey et al. (2002) perform an SNA analysis to the whole population of projects hosted on Sourceforge. Generally, they observe that both the project size (in terms of developers) as well as the number of developers per project obeys power-laws (confirmed by the findings of Healy and Schussman, 2003). Performing a very high-level analysis looking at inter-project participation, they construct a network of developers who are connected if they work on the same project. In the resulting network, they observe one cluster covering 25% of all developers, and some much smaller clusters. The paper does not examine the organization of single projects and does not perform a deeper analysis besides the construction of these networks. One of the relevant studies in this area is performed by Crowston and Howison (2004). They examine 120 projects hosted on Sourceforge. Applying Social Network Analysis to the bug database activities, looking at communication centralization, they find that some projects are very centralized while others are very decentralized. They conclude that projects do not automatically gain all advantages that are usually accredited to OS projects by “going open”. Two working papers, both published in June 2004, are highly relevant to this research as they apply SNA methods to CVS data. The first paper (González-Barahona, López, and Robles, 2004) presents a representation of 8 ITIL stands for IT information library and is described by te Meerman as a collection of best practice approaches to organize IT projects. 29 2 Theoretical framework the Apache project (http://apache.org) based on CVS data. The goal of this paper is to present the connections between CVS modules9 for very large ’libre’ projects. Unfortunately, the version of the working paper which is available seems to be at a rather early stage. It gives visualizations of the resulting network, but does not present any further results or data analysis. This type of research examines a very high level of coordination: relationship between CVS modules might be an interesting topic in itself, but as few projects use more than one module it will mostly be interesting for very large, already modular projects. It is, however, a first attempt to present a dynamic view of a software architecture. The second working paper by López, González-Barahona, and Robles (2004) is the first publication to propose applying SNA research methods to CVS source code10 and make a first attempt at the operationalization of some SNA measures. They propose creating a “committer network” and a “module network” (this research will indeed create and focus on a committer network of all projects; see methodology section). They provide the graphs for the degree and the clustering coefficient for three sample cases in their working paper. Unfortunately, no more information on their data or a deeper analysis is performed. This research leans on and draws from their operationalization where appropriate (see the methodology section for more information). Summarizing, it can be said that researchers have recognized the usefulness of SNA to analyze open source projects on many levels. So far, many publications are rather short working papers or conference presentations, which (although representing a lot of tedious work) do not go very deep in their analysis. Beside Crowston and Howison (2004) and López et al. (2004), which proved to be most relevant publications for this study, there seem to be few attempts to use SNA to analyze the inner workings of a project community. 9 A CVS code repository for large project can be divided into separately maintained modules. Note that this is not typically the case for all projects, e.g. all projects in this research’s sample maintain their source code only in one module. In order to create separate CVS modules, the software architecture, needs to be designed in a very modular fashion. 10 Although the decision to use this approach in this dissertation was made independently from their paper and precedes its publication date. 30 3 Empirical Section 3.1 Methodology 3.1.1 Sample Selection The number of projects in the sample should be large enough to allow the comparison of statistics (i.e. the number of projects should be around 301 ). On the other hand, it should be small enough to allow an understanding of each of the selected projects. As many projects are special in that tools are inconsistently used, old data is dumped into the source code repositories, etc. (Howison and Crowston, 2004), it was deemed necessary to get a general understanding of how a project was founded, what it accomplishes, and to get a basic insight in how the project is organized. In addition to the specific inclusion criteria of projects as listed further down, some general consideration were made in order to create a sample as representative as possible, falling into two categories: 1) maximizing the variety of projects and 2) avoiding the common pitfalls as described in the above critique (Section 2.1.1 on page 22). 1. Maximizing the variety of projects with respect to their: • Type of application: Instead of focusing on a specific type of application, such a server, desktop application, or functionality providing libraries, it was attempted to 1 A sample of 30 cases is usually deemed sufficient to be able to apply the central limit theorem if necessary, thus deriving a single normal distribution (or under some circumstances a Lévy stable distribution) from the summed (independent) variables. (Feller, 1966; Hall, 1982) 31 3 Empirical Section include as many types of applications as possible. The final sample includes text processors, games, server, libraries,. . . 2. The following pitfalls should be avoided: • As described, many cases have focused on the most prominent projects (especially Apache, Mozilla, and the Linux kernel). I have already elaborated why the choice of these projects, which have been put extensively in the spotlight of media and research and are heavily sponsored by companies, is not appropriate in my point of view. Although coordination of prominent company-guided or initiated projects could be interesting to examine as a special case, no such project was included in this project (although some projects of the final sample are well-known projects for open source connaisseurs). • The second danger is the sole use of projects which are hosted by only one organization. In order to achieve a good and valid sample, this research draws its sample from projects hosted on Sourceforge, Savannah, Berlios, and from those providing their own infrastructure. • The third main criticism is the danger of inconsistently used tools across projects. This research does not rely on consistency of mailing lists, bug databases, etc. By using only projects which provided a full CVS log history and by getting familiar with each of the projects in advance (e.g. to ensure that no gatekeepers were being used, and that the CVS code repository was used as primary means of code administration), it was assured that the retrieved CVS log data was indeed comparable across projects. The specific necessary inclusion criteria for a project were: Code availability: The source code of the project needed to be available on-line. Moreover, all source code modifications needed to be archived together with the date, file names, and author. As this information was to be extracted in an automated manner only those 32 3.1 Methodology projects with a CVS source code repository were considered for inclusion, to ensure a consistent treatment of all sample projects.. Age: This research does not perform a longitudinal analysis. However, coordination does not show in a snapshot of the data, as obviously not all developers modify files at the same time. It was therefore deemed necessary to use projects which were already active over some time, in order to show coordination, and mutual modifications of other developers’ files. An arbitrary cut-off point had to be made, for there is no experience yet, how long it takes to evolve “typical” patterns. It was decided that only projects which had at least been active over the course of one year, should be taken into consideration. Size: In order to examine coordination, a project obviously needed to have more than one developer working on it. Other considerations regarding suitable project sizes were already given above. Gatekeepers: Some large projects have divided up their software architecture in certain areas. Those areas are ’guarded’ by formally announced (e.g. through an entry in a MAINTAINER list) gatekeepers or maintainer. It is their task to collect patches, evaluate them, and to commit those who are deemed worthwhile. While this helps to coordinate the work in very large projects, it falsifies the CVS logs. Those marked as authors of a CVS modification are not necessarily those who committed the patches.2 Care was therefore taken not to chose projects with formal maintainers of areas of code. The final outcome resulted in a sample of 29 open source projects. The accumulated number of developers is 1, 150 with a project average of 39.7 per project (median of 19). The total sum of observed file modifications is 740, 677. An overview of the selected sample is given in Table 3.1. 2 It happens in every project that some patches are contributed by non-developers e.g. through the developer’s mailing list and finally committed by a developer with the permission to modify the source code repository. However, this cannot be detected without an in-depth manual scrutinizing of the mailing list and code modification comments. 33 3 Empirical Section No conscious effort was made in advance of the sample selection to include only projects of a certain size (apart from the inclusion criteria of >1 developer). The final sample selection contains mostly projects in the range of approximately 5 to 40 developers. It was seen beneficial to have projects of different size in the final sample: It would help to show differences for small and large projects. On the other hand their size should not be too different: Social Network measures can be difficult to compare if the networks are of very different sizes (Scott, 1991). Secondly, only very few projects are larger than this (Madey et al., 2002; Healy and Schussman, 2003); these belong mostly to the very prominent cases, which were avoided (as explained above). 3.1.2 Analysis After the sample of projects had been identified, the source code modification logs were retrieved from the corresponding code repositories. This includes all modifications from the instantiation of the code repository until the end of the recorded time frame (which is denoted in the respective project presentations). Using scripts written in the perl language, the log entries were stored in a local database for further analysis. This database served as the source for further analysis: A number of tools written in the PHP language, were used to e.g. extract descriptive data about the projects, such as the number of committing developers. Additional information, such as the programming language and project description were taken either from the public project websites or extracted from the source code. The CVS log data also served as data source, in order to construct the incident matrix to be used in the SNA analysis 3 and subsequently the adjacency matrix, representing an undirected and unweighted network of developers. López et al. (2004) propose to use all modifications a developer conducts in a certain directory (representing a module in the software architecture in their point of view). Although the equivalence of a directory to a software module might be valid, it is an assumption about 3 The incident matrix lists all authors/files and the number of modifications 34 3.1 Methodology Project Abiword Adonthell AWstats Developers 63 13 3 bison 18 BZflag 32 CDex emacs eMule 177 Flightgear Freenet Gnomemeeting Gnunet 8 32 98 15 GTK+ Irate LAME mailman mnet nano Ogle OpenSSL pango 271 18 26 24 14 11 8 phpMyAdmin PostgreSQL 19 28 Smarty 11 Stepmania tdb TikiWiki wget xerces 44 7 89 6 27 XFCE4 22 46 Description WinWord-like word processor Role Playing Game Powerful and featureful server logfile analyzer that shows you all your Web/Mail/FTP statistics General-purpose parser generator that converts a grammar description to a C program to parse that grammar. OpenSource OpenGL Multiplayer Multiplatform Battle Zone capture the Flag. 3D first person Tank Simulation. CD-Ripper, thus extracting digital audio data from an Audio CD. Extensible, customizable, self-documenting real-time display editor. Filesharing client which is based on the eDonkey2000 network but offers more features. A Flight simulator Platform for anonymous file distribution and retrieval Voice/Video conferencing application A framework for secure peer-to-peer networking that does not use any centralized or otherwise trusted services. Allows anonymous censorshipresistant file-sharing. Graphical ToolKit Internet radio with user taste correlation mp3 encoder. Mailing list administration Kind of Freenet written in Python, heritage from the MojoNation project GNU nano is a clone of the Pico text editor. DVD player capable of menus and navigation. provides encryption. The goal of the Pango project is to provide an open-source framework for the layout and rendering of internationalized text. Pango is an offshoot of the GTK+ and GNOME projects, and the initial focus is operation in those environments, however there is nothing fundamentally GTK+ or GNOME specific about Pango. handles the administration of MySQL data bases over the Web. Object-relational database management system (ORDBMS) based on POSTGRES, Version 4.2, developed at the University of California (Berkely). Php template engine. Smarty cleanly separates your presentation elements (HTML, CSS, etc.) from your application code. StepMania is a music/rhythm game. A “Trivial Database” A Wiki and Content management system Retrieves files using HTTP, HTTPS and FTP Xerces (named after the Xerces Blue butterfly) provides XML parsing and generation. A Window Manager (desktop) for Linux Table 3.1: Sample overview 35 3 Empirical Section software architecture in general, which is not being made in this study. Using directories would also lead to a loss of information, as the data provides modification information about each file. It has therefore been decided to use a modification of a specific file as incident. This results in a finer grained analysis at the disadvantage of not being able to identify the relationship between software modules (i.e. directories). As this research is not concerned about module connections, this was deemed the better choice. A preliminary examination of the data showed that most projects feature a few common files which are modified by nearly all developers, such as e.g. a change log, building scripts, or help files, a cut-off point of ten common file was chosen to accommodate for these common files. This means that ties between two nodes (developers) in this network signify the modification of at least the same ten files. Using cut-off points is a frequent and accepted procedure in SNA (Scott, 1991). The “number of modifications” of a developer as used throughout this work is calculated as the sum of all modifications on all files. That means, that it can differ from the number of CVS commits a developer performs (one CVS commit can contain modifications to multiple files). An incident matrix can be used to create two different adjacency matrices (Scott, 1991; López et al., 2004), however for the purpose of this research only one of them makes sense: To find out which authors modified the same files. The resulting adjacency matrix contains all developers (on both dimensions of the matrix), with a 1 indicating a connection between any two developers and a 0 indication no connection. This adjacency matrix is what López et al. (2004) call a “committer network” and helps to identify connections between developers. Constructing the second adjacency matrix, would lead to a network where nodes represent files which are more or less connected. This matrix could e.g. be used to create a functional architecture of a software project. This analysis was not pursued further to this end, as the goal of this work is to study the relationship between developers and not files. Once the adjacency matrices for all projects were created, the statistical software R, using an extension library for performing Social Network Analysis was used to assist the data analysis.4 4 R is the open source equivalent of the commercial package S and can be obtained from http://r-project. 36 3.2 Sample Besides the number of modifications for each developer, measures from SNA that characterize a network such as its density, inclusiveness, and the degree of its nodes (i.e. developers) were derived. The next section presents the sample projects, and gives a brief description of each. 3.2 Sample Abiword Abiword is part of a larger project known as AbiSource, which was originally started by the SourceGear Corporation. The goal of the project was the development of a cross-platform, open source office suite, beginning with AbiWord, the project’s word processor. SourceGear released the source code to AbiWord and a developer community quickly formed around the project. SourceGear has since then stopped working on the project, but the Abiword development was continued independently. AbiWord runs on multiple platforms: Windows, Linux, QNX, FreeBSD and Solaris. It is able to read and write industry standard document types, such as OpenOffice.org documents, Microsoft Word documents, WordPerfect documents, Rich Text Format documents, or HTML web pages. While the graphical interface leans toward the WinWord text processing application, the main Abiword program is very small and requires comparably little resources to run, allowing Abiword to be used on systems that are not considered "State of the Art" anymore. A variety of plugins can be used to extend AbiWord’s functionality and ranges from Document Importers to a Thesaurus, Image Importers, and a Text Summarizer (Abisource.com, 2004). Although the Abiword project can also be found and its source code downloaded from Sourceforge.net, the project provides its own infrastructure for web sites and source code repository (the Internet bandwidth is currently provided through the University of Twente). Abiword was released under the GNU GPL license. The Abiword source code repository was recorded org 37 3 Empirical Section from July 16, 1998 until April 23, 2004 and contains 52060 modifications in 5265 files conducted by 63 authors. Adonthell The Adonthell project is dedicated to the development and implementation of a role playing game and was started in summer 1999 by Alexandre Courbot, James Nash, and Kai Sterker. More specifically it creates “a graphics engine [. . . ], a set of tools and an actual, playable game driven by that engine and built with those tools.” Its website emphasizes the fact that it is free (→libre) software. The developers claim that they “were united by the vague idea of creating a free role playing game for Linux and without any experience in the field of software engineering when they started the project.” As a governance principle it claims “autonomy in choice and execution of the actual work, since there was no dedicated project manager” (Adonthell, 2004). While the projects main homepage can be found under http://adonthell.linuxgames. com, the source code repository is hosted on the Savannah platform (provided by the GNU project). True to the spirit of free software, the project releases the software under the GNU GPL license. The recorded CVS time frame ranges from January 5, 2000 until February 9, 2004. The Adonthell source code repository contains 9485 modifications in 1349 files conducted by 13 authors. AWStats AWStats is a feature rich tool that generates advanced web, ftp or mail server statistics graphically. It takes the log files of those servers (various file formats are supported), analyzes them and prepares a reported that includes statistics and graphs. AWStats has been programmed to work on very big files, it can also be called automatically to update statistics in a certain interval. (AWStats, 2004) The project was founded on October 29, 2000 on Sourceforge.net and uses its facilities for 38 3.2 Sample website, mailing lists, and source code repository. AWStats is licensed under the GNU GPL. The recorded source code modifications range from November 4, 2000 until January 17, 2005 and contains 7, 460 modifications in 1, 310 files conducted by only 3 authors (thus being the smallest project of the sample in terms of developers). bison Bison is described on its web page as: “a general-purpose parser generator that converts a grammar description for an LALR context-free grammar into a C program to parse that grammar. Once you are proficient with Bison, you can use it to develop a wide range of language parsers, from those used in simple desk calculators to complex programming languages.” (bison, 2004b) This, for a software layman, rather cryptic statement makes it clear that the tool is intended for software developers itself; it is not an application intended for ‘end users’. The program basically translates logical statements written in a specific syntax into a language that computers can understand and execute. The name bison is inspired by the application for which it is intended as alternative: yacc (which stands for “Yet Another Compiler Compiler”). The bison project uses savannah.gnu.org to administer its mailing lists and source code. As it is part of the “GNU effort” it comes under the GNU GPL license. Being part of GNU, it is also morally motivated: the bison manual states that “Software should be free” (bison, 2004a). The CVS modifications range from the first commit in December 16, 1987 until January 17, 2005. 10, 607 modifications in 407 files conducted by 18 authors. BZflag The name BZFlag stands for Battle Zone capture Flag. It is a multi-player, multi-platform 3D tank battle game, running on Irix, Linux, *BSD, Windows, Mac OS X and other platforms. 39 3 Empirical Section The features complex and sophisticated graphics. Tanks drive in complex worlds which follow the laws of physics, including e.g. a weather system (rain, snow, frogs). The game can be played against through the Internet, it is based on a client/server architecture (both client and server are included in the examined source code repository). BZFlag is a comparatively popular project, according to its website it is the third game hosted on Sourceforge to reach 1 million downloads. BZFlag uses the infrastructure provided by Sourceforge.net, although it has a website of its own (http://bzflag.org). It was registered on March 4, 2000 on Sourceforge, which featured the project as “Sourceforge Project of the Month” in April 2004. The observed source code modifications range from March 5, 2000 until January 20, 2005. The BZFlag source code is published under the GNU LGPL license and its repository contains 32, 100 modifications in 2, 164 files conducted by 32 authors. CDex CDex is a CD-ripper for the Windows platform. That means, it extracts digital audio data from an Audio CD to a wav file and can subsequently convert them to many audio formats, such as mp2, mp3, vqf, aac, and ogg. (CDex, 2004) The first code was written in 1998 (according to the copyright information in the source code’s ‘README’ file) and made available through the website http://www.cdex.n3. net. In December 1999, the project was transferred to Sourceforge, and takes advantage of Sourceforge’s infrastructure since then. The recorded time frame of CVS activities spans from December 5, 1999 until February 10, 2004. The source code is released under the GNU GPL license. The Sourceforge project summary page lists ten registered developers, however of these, only four authors performed 10, 106 modifications in 2, 346 files.5 5 This “discrepancy” is evidence for how unsuitable and unreliable Sourceforge summary data is as a basis for empirical research on OS projects. 40 3.2 Sample emacs The Emacs website characterizes its product as “the extensible, customizable, self-documenting real-time display editor.” (Emacs, 2004) Among free software enthusiasts this program is one of two potential application of choice (emacs or vi) for doing everything text-related. Emacs is being used to write articles, letters, and books6 . But it is also capable of sending e-mails, and includes many features to use it as a editor for programming in various programming languages. It includes its own programming language, lisp, to enable the customization of emacs to extreme levels. Emacs was originally created by the founder and philosophical leader of the free software movement, Richard M. Stallman (RMS), and he still contributes today to its development. Emacs was one of the very first applications programmed by ‘RMS’ for the GNU operation system. He began with emacs, the text editor, to write programs and gcc, a compiler to translate the source code into computer binaries as foundations for his new system. According to his believes that all software should be free, Stallman released Emacs under the GNU General Public License (GPL). Hosted on savannah.gnu.org, emacs is the dinosaur among the project sample, its observed CVS modifications range from April 18, 1985 to January 19, 2005. Measured by the number of developers, emacs is the second-largest project in the sample. 117 contributors performed more than 100, 113 modifications on 3,175 files. FlightGear The FlightGear flight simulator project is a multi-platform, cooperative flight simulator development project. The goal of the project is to create a sophisticated flight simulator framework for use in research or academic environments, for the development and pursuit of other interesting flight simulation ideas, and as an end-user application. Over 20,000 real world airports included in the full scenery set, including correct runway markings and placement, correct 6 Trivia: As a matter of fact, a large proportion of this dissertation has been typed in emacs 41 3 Empirical Section runway and approach lighting. The simulator allows many different plane types. Currently, the 1903 Wright Flyer, strange flapping wing "ornithopters", a Boing 747, Airbus A320, various military jets, and several light singles can be used. FlightGear has the ability to model those aircraft and just about everything in between. A number of networking options allow FlightGear to communicate with other instances of FlightGear, GPS receivers, external flight dynamics modules, external autopilot or control modules. (flightgear, 2004). It should be noted, that the recorded CVS activity range from September 10, 2002 until February 10, 2004, does not mark the very first beginning of the project. Earlier versions of the software (available as compressed files through the FlightGear FTP server) show a creation date of May 23, 1997 for the oldest file (the GNU GPL license), but were not captured in the current CVS source code repository. FlightGear maintains its own, independent infrastructure for website, mailing lists, and source code repository. The current, observed repository contains 5, 406 modifications in 1, 108 files conducted by eight authors. Freenet Freenet is software which enables the anonymous publication and retrieval of information on the Internet without the ability to control or censor this information. To achieve this, the network is a completely decentralized peer-to-peer network. Publishers and consumers of information remain both anonymous. All communications by between Freenet nodes are encrypted and are routed through other nodes to make it extremely difficult to determine who is requesting the information and what its content is. (Freenet, 2004) The Freenet project is based on a Master’s Thesis, published in 1999 by Ian Clarke 1999. The first code commits happened in the beginning of 2000, however, the code was transferred to a new module in 2002. Therefore the observed CVS time frame stretches from January 4, 2002 to November 28, 2004. Since 2003, one of the developers, Matthew Toseland, (CVS name: toad), has been paid through donations to work full time on the project. A non-profit 42 3.2 Sample organization, Freenet Inc., has been incorporated in California in order to give developers legal protection from third parties. Freenet is licensed under the GNU GPL. It uses its own infrastructure for the website and developer mailing lists, but takes advantage of the source code repository offered by Sourceforge. The current, observed source code repository contains 14, 118 modifications in 1, 231 files conducted by 36 authors. Gnomemeeting Gnomemeeting is a H.3237 compatible videoconferencing and VoIP-Telephony8 application that allows to make audio and video calls to remote users with H.323 hardware or software (such as Microsoft Netmeeting). (Gnomemeeting, 2004) Gnomemeeting itself relies on an underlying library to provide some low-level functionality. It uses openh323, whose development is coordinated by Quicknet Technologies Inc. which is in itself released under the Mozilla Public license (MPL), a license which resembles the liberal GNU LGPL. As the MPL is not compatible to the GNU General Public License (GPL and non-GPL code may not be mixed under all conditions), the Gnomemeeting project leader grants a user that “as a special exception, you have permission to link or otherwise combine this program with the programs OpenH323 and Pwlib, and distribute the combination, without applying the requirements of the GNU GPL to the OpenH323 program, as long as you do follow the requirements of the GNU GPL for all the rest of the software thus combined.” The project was founded by Damien Sandras, who remains a dominant figure in its development until today ( 32 of all modifications). Gnomemeeting hosts its website independently, its mailing lists and source code repository are provided by the Gnome project (http: 7 “H.323 is the name given to a set of communications protocols used by programs such as Microsoft NetMeeting and equipment such as Cisco Routers to transmit and receive audio and video information over the Internet. It was developed by the ITU (http://www.itu.int), an international standards body for telecommunications.” (http://openh323.org) 8 VoIP: Voice over Internet Protocol 43 3 Empirical Section //gnome.org) with which it forms a loose relationship. The Gnomemeeting source code repository contains 10, 684 modifications in 557 files conducted by 98 authors between August 20, 2001 and January 17, 2005. Gnunet GNUnet is a framework for secure peer-to-peer networking that does not use any centralized or otherwise trusted services. A first service implemented on top of the networking layer allows anonymous censorship-resistant file-sharing. GNUnet uses a simple, excess-based economic model to allocate resources. This means that peers in GNUnet monitor each others behavior with respect to resource usage; peers that contribute to the network are rewarded with better service than peers that only download a lot of content. (GNUnet, 2004) The project was founded in 2001 by Christian Grothoff. A design draft was published as a research paper in 2002 (Grothoff, Patrascu, Bennett, Stef, and Horozov, 2002). The project was inspired by, among others, the Freenet project with which is shares many of its goals. One of the major criticism is, that Freenet is written in the Java programming language, which implies a significant overhead in terms of memory requirements and processor power. GNUnet is written in the C programming language, which is commonly considered as more efficient. The GNUnet project website, mailing list and source code repository were first hosted on http://ovmj.org). Ovm, which employed Christian Grothoff as a graduate student, is a DARPA funded collaborative effort between Purdue University, SUNY Oswego, University of Maryland, and DLTech to “to develop an open source framework for building programming language runtime systems.” The project moved recently (after the end of the recorded code modifications) to a new, independently hosted home (http://gnunet.org), but states on its website that is part of the GNU project. It is licensed under the GNU GPL. The recorded GNUnet source code repository contains 12, 111 modifications in 962 files between June 20, 2001 and November 21, 2003, conducted by 15 authors. 44 3.2 Sample GTK+ GTK+ is a multi-platform toolkit for creating graphical user interfaces. GTK+ was initially developed for and used by the GIMP, the GNU Image Manipulation Program. Therefore, it is named “The GIMP Toolkit”, so that the origins of the project are remembered. Today GTK+ is used by a large number of applications (e.g. Abiword which is part of the research sample), and is the toolkit used by the GNU project’s GNOME desktop. (GTK+, 2004) It was originally started by Peter Matthis, with help from Spencer Kimball and Josh Macdonald as GTK. When it was significantly improved (introducing object oriented “widgets”), it was renamed to GTK+. GTK+ itself takes advantage of three libraries developed by the GTK+ team: GLib is the low-level core library that forms the basis of GTK+ and GNOME. It provides data structure handling for C, portability wrappers, and interfaces for such runtime functionality as an event loop, threads, dynamic loading, and an object system. Pango is a library for layout and rendering of text, with an emphasis on internationalization. It forms the core of text and font handling for GTK+-2.0. Pango has been included in the research sample as well. The ATK library provides a set of interfaces for accessibility (support for in some form handicapped users, such as e.g. blind or color blind people). By supporting the ATK interfaces, an application or toolkit can be used with such tools as screen readers, magnifiers, and alternative input devices. The GTK+ project hosts its website (http://www.gtk.org) independently, its developer mailing lists and the source code repository are provided through the GNOME project infrastructure. It is licensed under the LGPL license. The observed gtk+ source code repository, which ranges from January 1, 1997 to May 17, 2004, contains 76, 666 modifications in 2, 701 files conducted by 271 authors and is the largest project in the sample. 45 3 Empirical Section Figure 3.1: The iRate architecture iRate iRate radio is a collaborative filtering system for music. The user rates the music tracks it downloads and the server uses the ratings and other people’s ratings to guess what you’ll like. The tracks are downloaded from websites which allow free and legal downloads of their music (see Figure 3.1 for a graphical overview of the system). (iRate, 2004) The iRate system allows a user to discover, download and listen to music which he likes and would otherwise probably not get to know. It is not thought to be a file-sharing application, but rather interest users in artists they like and whose music they might like to buy. It performs basically, the iRate developers think, what a good radio station should be doing. It uses correlation of ratings from users to recommend tracks. iRate is a relatively young project: It is completely hosted by Sourceforge, where is was registered in March 2003 by project founder Anthony Jones. It is released under the GNU GPL. The recorded time span ranges from March 26, 2003 to August 8, 2004. The iRate source code repository contains 1, 986 modifications in 363 files conducted by 18 authors. 46 3.2 Sample LAME LAME originally stood for LAME Ain’t an Mp3 Encoder. It started its life as a patch against the dist10 ISO demonstration source9 . However, the mp3 format was still licensed through the Fraunhofer Institute which disallowed the modification of the demonstration source code. In September 1998, Fraunhofer stopped various efforts to improve their freely available source code, and the 8hz effort (basically a successor to LAME), and a number of other free encoders based on ISO sources were aborted Mike Cheng, found that “That sucked” and, in September 1998, released a patch to the ISO code which was incapable of producing an mp3 stream or even being compiled by itself; An act which could not be prevented by the Fraunhofer Institute. In May 2000, the last remnants of the ISO source code were replaced, and the LAME source code provides now a full MP3 encoder. The developers state that “LAME is an educational tool to be used for learning about MP3 encoding. The goal of the LAME project is to use the open source model to improve the psycho acoustics, noise shaping and speed of MP3.” LAME achieved indeed more than simply providing a clone of a program with a restrictive license, and having an initially somewhat dubious legal status. Mark Taylor, current maintainer of the LAME v3.x code base, developed and implemented the GPsycho “psycho acoustic and noise shaping model”. This model is, they claim, vastly superior to the reference model provided by the Fraunhofer Institute, the inventors of the mp3 format, itself. (LAME, 2004) LAME was originally hosted on its own infrastructure, but moved over to take advantage of the facilities provided by Sourceforge in November 1999. Therefore, the recorded time frame from November 24, 1999 until November 28, 2003 does no include the very beginnings of the project. It is licensed under the GNU LGPL. The LAME source code repository contains 10, 312 9 mpg is an ISO standard, the dist10 code was distributed by the ISO committee to provide a working demonstration of the format 47 3 Empirical Section modifications in 509 files conducted by 26 authors. Mailman Mailman is free software for managing electronic mail discussion and e-newsletter lists. Mailman is integrated with the web, making it easy for users to manage their accounts and for list owners to administer their lists. Mailman supports builtin archiving, automatic bounce processing, content filtering, digest delivery, spam filters, and more. It is written in the Python programming language, with a little bit of C code for security. (Mailman, 2004) The system was first presented in 1998 at a conference (Viega, Warsaw, and Manheimer, 1998). One of its founders, Barry Warsaw, still continues to be Mailman’s lead developer today. His contributions have been partially sponsored by third parties. The website mentions, e.g. “Control.com for their sponsorship of new Mailman 2.1 features such as the topic filters, external membership sources, and "virtual" mailing lists”. Additionally, Barry Warsaw seems to have been granted time for development of Mailman by his (previous) employer Zope Corporation (which develops an open source content management system). It is not so easy to identify the providers of the Mailman infrastructure: The software itself can be downloaded from: Sourceforge, the GNU project, and its own website, List.Org. While the Mailman website is hosted independently, its developer mailing lists are provided through the platform of the python programming language, python.org. Although the project started in early 1998 (The first recorded CVS commit is on January 6, 1998), the project did not register itself on Sourceforge until November 8, 1999, in order to take advantage of its source code repository. The Software claims to be part of the GNU project, and is accordingly released under the GPL license. The examined Mailman source code repository, ranging from January 6, 1998 to 48 3.2 Sample November 28, 2003, contains 13, 695 modifications in 1, 873 files conducted by 24 authors. mnet Mnet is a distributed file store. The website describes a distributed file store as “a shared virtual space into which you can put, and from which you can get, files.” To achieve their goal, Mnet forms an emergent network without a central server. While many potential applications can take advantage of an underlying mnet, the first application that has been written for the Mnet project is a file-sharing application for files of all kinds. (Mnet, 2004) The project has been inspired by many of the same goals of the Freenet project. However, mnet is written in the Python programming language. It provides binaries for both unix-like and Windows platforms. On June 2004 (well after the end of the examined time frame) the project moved away from the Sourceforge infrastructure to a new, independently hosted server at http://mnetproject. org, and began using a new source code versioning system, DARCS. During the examined time frame the project relied on the infrastructure provided by Sourceforge, on which it was registered in January 2002 as a project. The observed CVS code changes range from January 29, 2002 to November 28, 2003. The mnet source code repository contains 7, 204 modifications in 930 files conducted by 14 authors and is distributed under the GNU LGPL license (a license similar to the GNU GPL, yet allowing closed source products to take advantage of the code through interacting with it). nano GNU nano is a very small and relatively simple to use text editor. It was designed to be a free replacement for the Pico text editor, which is part of the Pine email suite from The University of Washington but does not come with a GPL compatible license. It aims to “emulate Pico as closely as possible and perhaps include extra functionality”. (Nano, 2004) It features some convenience functionality such as an interactive replace function, a spell 49 3 Empirical Section checker, or auto-indent support. While the project has a website of its own (http://nano-editor.org), both mailing list and source code administration happens through the savannah.gnu.org platform. The software comes under the GNU General Public License. The observed nano source code repository, ranging from June 6, 2000 to January 19, 2005, contains 7, 976 modifications in 191 files conducted by 11 authors. Ogle Ogle is a “DVD player for the Solaris, Linux and BSD environments. The first open source DVD player to support DVD menus!” (Ogle, 2004) Its website states that “Ogle is developed by a few students at Chalmers University of Technology”, not mentioning an explicit lead developer or project leader (A fact which showed itself later in the project sociogram, more on this in the project analysis section). The project website is hosted by “Chalmers University of Technology” (Sweden), but developer mailing list and source code repository are provided by http://berlios.de (a platform similar to Sourceforge, dedicated to providing infrastructure for OS projects). The player is licensed under the GNU GPL. The examined code from the ogle source code repository comes from the “core module” called ogle and does not include a separate ogle_gui module10 . It ranges from January 20, 2000 to January 1, 2005 and contains 3, 310 modifications in 273 files conducted by 8 authors. OpenSSL “The OpenSSL Project is a collaborative effort to develop a robust, commercial-grade, fullfeatured, and Open Source toolkit implementing the Secure Sockets Layer (SSL v2/v3) and Transport Layer Security (TLS v1) protocols as well as a full-strength general purpose cryptography library.” (OpenSSL, 2004) 10 Some modular organized projects divide their project into CVS modules, which form separate entities that are able to work together. 50 3.2 Sample The project’s source code is originally based on the SSLeay library developed by Eric A. Young and Tim J. Hudson. Due to the licensing restrictions of SSLeay, OpenSSL cannot be published under the GNU GPL, but is licensed under an Apache-style license, which is a very liberal license and basically means that one is free to use it for commercial and non-commercial purposes. It is so liberal, that it is incompatible with the GNU GPL requirements, meaning that OpenSSL code can not be included in GPL’d projects, which often causes complaints from users. According to its website, the official start of the OpenSSL project was on December 3, 1998. The observed time span of code modifications ranges from December 21, 1998 until May 4, 2004. The OpenSSL provides its own infrastructure for website, mailing lists and source code administration. The source code repository contains 44, 116 modifications in 3, 238 files conducted by 12 authors. pango “The goal of the Pango project is to provide an open-source framework for the layout and rendering of internationalized text. Pango is an offshoot of the GTK+ and GNOME projects, and the initial focus is operation in those environments, however there is nothing fundamentally GTK+ or GNOME specific about Pango. Pango uses Unicode for all of its encoding, and will eventually support output in all the worlds major languages.” The name is: Greek: Japanese: “Pan” = All “Go” = Language So, “properly” written, it looks (adapted from pango, 2004). The pango library, which is not intended for the direct use through end-users, but as an underlying foundation to be used by applications, is released under the GNU LGPL, just as its ‘relative’ the GTK+ project. The project’s mailing lists and source code repository are provided through the infrastructure of the GNOME project. Its source code repository contains 6, 715 modifications in 383 files 51 3 Empirical Section conducted by 46 authors in the time frame from January 1, 1997 until May 6, 2004. phpMyAdmin phpMyAdmin is a tool written in the interpreted PHP language intended to handle the administration of MySQL databases11 over the Web. It can create and drop databases, create/drop/alter tables, delete/edit/add fields, or to execute any other SQL12 statement. (phpMyAdmin, 2004) phpMyAdmin runs on any platform that runs the PHP language interpreter and a web server, which basically means that it can run everywhere. The phpMyAdmin changelog begins on September 9, 1998 with a “First internally used version”. It was transferred to Sourceforge in March 2001, where it was features as “Sourceforge project of the month” in December 2002. The GNU GPL licensed software relies on mailing lists and source code repository provided by Sourceforge, it hosts its website independently. The phpMyAdmin source code repository was recorded over a time frame from May 3, 2001 to January 20, 2005 and contains 37, 385 modifications in 1, 072 files conducted by 19 authors. PostgreSQL “PostgreSQL is a highly-scalable, SQL compliant, open source object-relational database management system. With more than 15 years of development history, it is quickly becoming the de facto database for enterprise level open source solutions.” (PostgreSQL, 2004) The database now known as PostgreSQL is derived from the POSTGRES package written at the University of California at Berkeley. The POSTGRES project, led by Professor Michael Stonebraker, was sponsored by the Defense Advanced Research Projects Agency (DARPA), the Army Research Office (ARO), the National Science Foundation (NSF), and ESL, Inc. Its implementation in 1986. In 1994, Andrew Yu and Jolly Chen added a SQL language interpreter to POSTGRES, Version 4.2. The result was subsequently released to the web under a new name, Postgres95. 11 12 MySQL is a popular open source data base coordinated by the Swedish company MySQL AB. SQL: Structured Query Language is used to perform commands on a database. 52 3.2 Sample In 1996 the name "Postgres95" was deemed unsuitable and the project was renamed to PostgreSQL, to reflect the relationship between the original POSTGRES and the more recent versions with SQL capability. Today PostgreSQL runs almost every brand of Unix (the website claims that it runs on 34 different platforms with the latest stable release), and includes native Windows compatibility since version 8.0 and above. PostgreSQL is a somewhat unique project, in that it has not been founded and is run by a single person (or organization), but is being run by a “Steering Committee” which currently consists of six persons. It is licensed under the BSD license13 . All development infrastructure is provided independently by the project itself. The PostgreSQL source code repository, which ranges from July 9, 1996 to January 19, 2005, contains 95, 221 modifications in 5, 644 files conducted by 28 authors. Smarty Smarty is known as a “Template Engine”, but would be more accurately described as a “Template/Presentation Framework.” That is, it provides the programmer and template designer with a wealth of tools to automate tasks commonly dealt with at the presentation layer of an application. (Smarty, 2004) Smarty, started by Monte Ohrt and Andrei Zmievski, is a way to separate the content of websites from its presentation and layout. It is not intended to be used directly by end-users, but provides functionality to other programs, or could be used by website administrators. The TikiWiki Content Management System (also included in the sample) is an example for such a program which takes advantage of the Smarty functionality. Although the name "Smarty" and the logo are trademarks of New Digital Group, Inc., the software itself comes under the liberal GNU LGPL license. The project itself is hosted by the 13 BSD stands for Berkeley Software Distribution and is the name of a free Unix derivate developed at the University of Berkeley. The BSD license is a very liberal license and allows basically every use of the software. 53 3 Empirical Section php.net project. Website, mailing lists, and source code resides there. The observed smarty source code repository, begins on August 8, 2000 and ends on June 8, 2004, containing 5, 881 modifications in 1, 361 files conducted by 11 authors. Stepmania “StepMania is a music/rhythm game. The player presses different buttons in time to the music and to note patterns that scroll across the screen. Features 3D graphics, visualizations, support for gamepads/dance pads, a step recording mode, and more!” (Stepania, 2004) StepMania runs on Windows and Linux computers as well as on modified Xboxes. This rather crazy game14 comes under the very liberal MIT License, which basically allows every use of the source code. Although the website is hosted independently, the project takes advantage of mail list and source code repository on Sourceforge, where the project was registered in October 2001. The observed time frame ranges from November 3, 2001 until May 13, 2004, during which 48, 357 modifications in 12, 725 files were conducted by 44 authors. tdb TDB is a “Trivial Database”. In concept, it is very much like GDBM15 except that it allows multiple simultaneous writers and uses locking internally to keep writers from trampling on each other. TDB is also extremely small. (tdb, 2004) The project has been registered on August 2000, it is licensed under the GNU GPL. Its source code modifications between August 14, 2000 and April 9, 2002 were observed for this research. tdb relies on facilities provided by Sourceforge for its infrastructure. Although the Sourceforge summary page lists 10 developers for this projects, the tdb source code repository contains 295 modifications in 55 files conducted by only seven authors, making it the smallest project in terms of the number of modifications. 14 15 If dancing to the rhythm of music on a “dance pad” on the floor can be considered gaming. The GNU Database Manager 54 3.2 Sample TikiWiki “The Tiki CMS/Groupware, also known as TikiWiki, is a powerful web-based Groupware and Content Management System (CMS). It can be used to create all sorts of Web applications, Sites, Portals, Intranets and Extranets”. (TikiWiki.org, 2004) According to its developers, the TikiWiki project aimed from the beginning to be an open, high-growth project and to embrace as many developers as possible. The following quote from an interview with a lead developer emphasizes this when talking about future challenges: “Keeping the balance between number of features and quality of features. We decided to start adding as many features as we could, as fast as possible, so more and more users can help us to refine the features. We will start to be focused on making features better, more often than adding new features (expand first, conquer later).” (SourceForge.net, 2003) The TikiWiki project hosts its own website independently, using its own code for it. Mailing lists, bug database, and source code repository are provided through Sourceforge, where the project was registered in October 2002. The software is licensed under the GNU LPGL, the observed time frame ranges from October 8, 2002 to March 28, 2004. During this time, 32, 260 modifications in 5, 202 files were conducted by 89 authors in the TikiWiki source code repository. wget GNU wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive command line tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc. (wget, 2004) The website of this command line program (programmed as part of the GNU effort, but also ported to the Windows platform) is hosted through gnu.org. A source code repository is provided by http://sunsite.dk16 16 “Sunsite.dk is a completely non-commercial project, powered by sponsored hardware, and driven by volunteer 55 3 Empirical Section The wget source code repository of this GPL’d software contains 4865 modifications in 207 files conducted by six authors between December 2, 1999 and January 1, 2005. xerces The “Xerces Java Parser” (xerces, 2004) (named after the Xerces Blue butterfly), is written in Java and provides “XML parsing and generation.”17 (xerces, 2004) The project resides under the umbrella of the Apache foundation and is licensed under the liberal Apache Software License (which is not compatible with the GNU GPL). Xerces provides “hooks” in order to be usable for different programming languages, such as C++ or Perl. These reside in separate CVS modules and have not been included in the analysis. The core “xerces” module contains the code of the software which is written in Java, the observed time frame ranges from November 9, 1999 to June 6, 2004. Its source code repository contains 18,304 modifications in 2,276 files conducted by 27 authors. XFCE4 “Xfce is a lightweight desktop environment for unix-like operating systems. It aims to be fast and lightweight, while still being visually appealing and easy to use. [. . . ] Another priority of Xfce 4 is adherence to standards, specifically those defined at freedesktop.org. Xfce 4 can be installed on several UNIX platforms. It is known to compile on Linux, NetBSD, FreeBSD, Solaris, Cygwin and MacOS X, on x86, PPC, Sparc, Alpha...” (Xfce, 2004) The project which is released under the GNU GPL license, was started and coordinated by Olivier Fourdan. It hosts its web presence and its CVS code repository independently on http://xfce.org (the hosting is donated by a commercial 3rd party). Mailing lists are provided through http://foo-projects.org. The observed time frame for the Xfce4 development spans from February 14, 2001 to May workforce. The goal of Sunsite.dk is to help power the development of Open Source Software in the world. Sunsite.dk is currently part of the Sun initiated project known as SunSITE.” (http://sunsite.dk) 17 XML (Extensible Markup Language) is a file format for the transfer and storage of structured information. 56 3.3 Analysis 4, 2004. Its source code repository contains 61,879 modifications in 13,675 files conducted by 22 authors. 3.3 Analysis This section analyzes the sample projects: it examines the concentration of modifications, the degree of developers (absolute and relative), as well as the density, inclusiveness, and centralization of the resulting networks. Also code-ownership of source code files is examined. 3.3.1 Concentration of modifications This study measures the “amount of work” performed through the number of file modifications per author. One commit to the CVS source code repository can consist of one or several modified files, so the measure is not necessary equal to the number of CVS commits. It has been noted in other studies of open source projects that a large proportion of the work is performed by only a small fraction of contributors, usually called “core developers” (Koch and Schneider, 2002; Ghosh and Prakash, 2000; Mockus, Fielding, and Herbsleb, 2000). This is also the case in nearly all of the projects in this sample. One indicator of the concentration of modifications is the gini coefficient, introduced by Gini (1912).18 However, in order for the gini coefficient to be an unbiased estimate independent of the population size, it should be multiplied by n n−1 (Dixon, Weiner, Mitchell-Olds, and Woodley, 1987; Mills and Zandvakili, 1997). Note that all measurements throughout this work using the Gini coefficient, are calculated to using this “corrected” version of the gini coefficient. Applied to the number of modifications of a developer, the gini coefficients for our projects range from 0.57 to 1.00, with a mean of 0.84 and a median of 0.86. These values indicate a high inequality of modifications.19 Fig. 3.2 shows the distribution of the gini coefficients for 18 The gini coefficient is used to measure the inequality of income in countries, but is also usable as a general indicator for the inequality of values (v). Gini ∈ [0, 1]: 0 indicating that vi = vj ∀i, j and 1 in case vi 6= 0, vj6=i = 0. 19 To give a feeling for the coefficient: the inequality of incomes in the US hovers around of 0.4 (Germany 0.28) 57 3 Empirical Section Figure 3.2: Histogram of modification concentration (gini) all projects in a histogram. A weak Spearman’s rank correlation of ρ = 0.19 indicates that no linear relationship exists between the size of a project (in terms of number of developers) and its concentration of modifications. It is interesting though that the four largest projects in our sample all have a gini coefficient between 0.89 and 0.92. The generally high levels of concentration come as no surprise, if the assumption of a small number of core developers who burden most of the work is true for projects of every size. It is therefore proposed: Proposition 1 Modifications are highly concentrated on a few developers, meaning that the (source: US Bureau of Census). 58 3.3 Analysis largest share of work (i.e. highest number of modifications) is performed by only a small number of “core developers”. 3.3.2 Degree The degree (of connection) of a node equals its number of connections to other nodes, i.e. n P dj = 1 ∀ adj,i > 0 (adi,j being the elements from the adjacency matrix). This measure i=1 gives some indications on how well a node in the network is connected to other nodes (Freeman, 1979). In a social network of developers based on their file modifications, the degree of a developer characterizes the number of developers with which one has been modifying the same files. This means that the degree can vary between 0 and the number of developers in a project-1. A developer’s connections can be interpreted based on his degree: 0 The developer has never modified the same files as other developers. There are two possible reasons for this. Either his contribution has been so marginal, that it did not reach the cut-off point of ten common files. Or it indicates that the developer works on a highly specialized area of code, being its owner. It could e.g. be a sign of a very modular software architecture with divided responsibilities, a feature which is being developed by a single author. These developers are the “lone wolves”, not connected to the network. 1 Developers with a degree of one are connected to only one developer, i.e. he modified only the same files as one other developer. This could e.g. be a team of two developers working on the same feature, or a “lone wolf” whose files got modified by some other developer.20 Developers belonging to this category can be termed “tamed wolves”. low value (>1) Contributors in this category modified files which a few others had worked on as well. This could be a few developers sharing certain areas of code, or other forms 20 As ties in this analysis are undirected, a degree of 1 does not mean that a developer actively developed another developer’s files. It is possible that “his” files get modified as well. 59 3 Empirical Section of collaboration. The borders between a “low value” and a “high value” are somewhat fuzzy. high value These are the “generalists” who modify files which a lot of other developers have been working on as well. These well-connected developers are part of the inner network and perform most of the development (Fig. 3.4). An interesting fact is that mostly developers with a relatively high degree are connected to “tamed wolves”. They seem to take the role of integrators. The distinction between a “low value” and “high value” degree has not been made for the purposes of this research. Generalized blockmodels (Wasserman, 1994) could be used to create such a specific categorization if needed. Table 3.2 lists a summary of the degree of developers for each project. As can be seen there, the mean number of developers is 39.7, while the mean of the average degrees is only 5.08, indicating that the modification of other developer’ s files is not too common. Relative degree & network density In order to compare projects of different sizes, the relative degree is calculated as rel. degreep = degreep , #developersp −1 being the ratio of realized over possible connections. This measure is also called the density of a network (Scott, 1991, p.74). The mean and maximum relative degree for each project can also be found in Table 3.2. The average value of the network density for all projects is 0.21. Taking the relative degree of all developers across all projects, as illustrated in Figure 3.3, the mean value is only 0.134 (3rd quartile: 0.21). Accordingly, developers are (on average) only connected to 13.4% of their fellow developers. Following the observation of rather low density it can be proposed that: Proposition 2 On average, developers maintain only connections (ie. commonly modified files) with a minority of their co-developers, leading to a low mean relative degree, i.e. density of the network. 60 3.3 Analysis Figure 3.3: Relative degree over all developers Relative degree vs. number of modifications Although the average degree is relatively low, Figure 3.3, depicting the relative degree of all developers across projects, also illustrates that a few developers have very high relative degrees. The gini coefficient of the relative degrees is 0.73, confirming a high concentration. The average maximum relative degree is 0.52 (which, given that the average inclusiveness (see later) is only 0.59, means that in most cases some developers are connected to nearly every other connected developers.) But who are the few developers with a high degree? Is it moderately active developers, with the sole task to tie code together? Or is it the “core developers” who bear the lion’s share of the development work? It is therefore interesting to examine the relationship between the (relative) degree of a developer and his number of code modifications. Fig. 3.4 gives an idea that there is a positive relationship between the number of modifications and a developer’s relative degree (although a non-linear relationship, therefore the y-axis shows log(#modif ications)). A Spearman rank correlation of 0.85 confirms the visual clue. Following this argumentation it 61 3 Empirical Section Figure 3.4: Rel. degree vs. log(#modifications) can be proposed: Proposition 3 The relative degree of developers is highly concentrated, with a few very active developers acting as “generalists” or “integrators”, who are connected to most other developers (apart from the “lone wolves”). 3.3.3 Inclusiveness Inclusiveness refers to the number of nodes which are included in the connected parts of a network (Scott, 1991). A useful measure of inclusiveness is the number of connected nodes as a proportion of the total number of nodes, i.e. i = nconnected . ntotal As, per definition, the inclusiveness is also 100%− percentage of “lone wolves”, it is an important characteristic of an open source project. Table 3.3 gives an overview over the inclusiveness value for each project. Figure 3.5 visualizes the values. It is interesting to note that only 6 projects feature an inclusiveness < 0.5. The values above 0.5 seem to be relatively equally spread without apparent clusters visible. 62 3.3 Analysis Figure 3.5: Inclusiveness of projects Figure 3.6: Inclusiveness vs. # developers 63 3 Empirical Section Figure 3.7: Dendrogram of projects (using avg. inclusiveness) What determines the inclusiveness of a project? Fig. 3.6 relates the number of developers to the inclusiveness for each project, in order to identify a possible correlation between those variables. However, the graphic does not give a visible clue about a relationship. A relatively weak Spearman’s rank correlation of -0.14 between the number of developers and the project inclusiveness also indicates that no linear relationship between the number of developers and the project’s inclusiveness can be found. Project size (in terms of developers) therefore does not seem to affect the inclusiveness. A correlation of -0.52 exists between the gini coefficient, indicating the concentration of modifications, and the inclusiveness of projects (see Figure 3.10 for a visualization). This shows that the more the small web of core developers works, the less developers are included in the network. This is intuitively no surprise, as contributors with very little activity do not perform enough work to become a part of the project. Fig. 3.7 shows a dendrogram, clustering projects according to their inclusiveness. However, 64 3.3 Analysis no obvious common criteria, besides centralization of modifications as explained above, seems to determine the inclusiveness of a project. Stating this observation as a proposition, it can be derived: Proposition 4 The inclusiveness of open source projects differs widely. There are no equilibria through which open source projects could be characterized in general, although inclusiveness is somewhat related to the centralization of modifications. 3.3.4 Centralization Another measure, frequently used in SNA, is that of node centrality; and subsequently building upon this, the global network centralization. Freeman (1979) provides a generic definition of the centralization of a network G for any centrality measure C(v) as C∗ (G) = P i∈V ) (| max(C(v)) − C(i)|). v∈V Two specific operationalizations have been used frequently, one based on the degree of a node, the second based on its betweenness. Both measures have been calculated and are listed in Table 3.3. Centralization is usually measured, using only the connected parts of the network (Scott, 1991) (otherwise they are mostly a function of the inclusiveness of a network). Both centrality measures include therefore only developers which are part of the network. Degree The average centralization of the network based on the degree of developers is 0.51 (1st quart. 0.4, median 0.53, 3rd . quart. 0.66). No relationship between the inclusiveness of a project and its centralization could be found. The centralization of a project could be interpreted as the “dominance” of the core developers over other persons. Note that for instance the ogle project, which has been founded by a team of developers, exhibits a centralization of 0.0 while the mailman project, which has been founded and is led by a single developer, exhibits a centralization of 0.82. Betweenness The betweenness of a node v is given by CB (v) = P i6=v6=j∈V where σijk ( σσivj ij is the number of shortest paths (geodesics) from i to k through j and σik is the number of 65 3 Empirical Section geodesics) from i to k (Anthonisse, 1971; Freeman, 1977). Conceptually, high-betweenness nodes lie on a large number of non-redundant shortest paths between other nodes; they can thus be thought of as “bridges” or “boundary spanners.” (Freeman, 1979). The betweenness based centrality is also a value between 0 (decentralized) and 1 (centralized). Beetwenness based centrality gives basically the same interpretation as the centralization based on degree of developers. However, it does not only look at the degree of a developer in the network, but also takes his position (central vs. peripheral) into account. The values for the single projects are listed in Table 3.3. The range of centrality in our project sample is 0.0-1.0, with a mean of 0.28 (1st quart. 0.11, median 0.21, 3rd quart. 0.39) Relatively low values indicate well connected networks of developers. A higher value can be interpreted as the dominant coordinator role of a single core developer. Following these observations, it can be proposed: Proposition 5 The centralization (both degree and betweenness based) of a project differs widely between open source projects (0.0-1.0), although the majority of the projects ranges between 0.12 and 0.39 (betweenness based central.). 3.3.5 Code ownership The research question asked about possible “code-ownership” in open source projects. This is a contradiction to the popular believe that open source is developed in an anarchistic manner, where everybody modifies source code as he sees fit. In order to explore the aspect of code-ownership it makes sense to look at the number of distinct authors per file. In virtually all programming languages a file is a container for a certain functionality (or in object oriented languages such as C++ or Java the container for an object, or the methods that an object can perform). Being ’owner’ of a file is therefore the software architectural equivalent of being owner of that functionality or that object. The number of distinct authors gives therefore an idea of how much ownership a developers possesses over a piece of the software architecture. As described in the methodology, only projects without 66 3.3 Analysis Figure 3.8: Mean distinct authors per file (over all projects) specific ‘gatekeepers’, where a few developers collect patches by other contributors and commit them to the area they maintain, were included in the sample (see Section Methodology on page 31 for more information). An alternative measurement to the authors per file would have been the number of distinct authors per directory, as software repositories are often organized in a way where a directory equals a module in the software architecture (López et al., 2004). However, as the empirical data provides per file information, the fine grained level of information seemed to be most accurate. This research does not intend to gather information about the software architecture (where this indeed would have been useful), but looks into the ‘ownership’ of distinct entities, the smallest of which are specific files. As can be seen in the boxplot in Figure3.8, the average number of distinct authors per file is relatively equal across all projects (mean 2.37, median 2.49). Only the two largest (by a magnitude larger) project deviate from the mean value: emacs 6.362 (177 devs., 76, 666 modi- 67 3 Empirical Section Figure 3.9: Mean distinct authors per file vs. # developers fications) and gtk+ 4.98 (177 devs., 6, 362 modifications). It is remarkable that the average distinct authors per file measure is relatively uniform across most projects, while the projects are significantly different in size, age, and number of developers. It is obvious from this observation that code-ownership is apparently an issue: On average only 2.4 developers work on a single file, although the average number of developers is nearly 40. This observation fits to previous interviews with developers who have talked about things like “that is his code”. It can therefore be proposed: Proposition 6 Open source projects feature “code ownership”, i.e. most files are only modified by a very small number of developers. 68 3.4 Characterizing the coordination styles 3.4 Characterizing the coordination styles One goal of this research is to identify variables which determine typical characteristics of open source projects and to find those which would enable the creation of a typology of these. This section tries to put the used variables into relation in order to see whether and how they could be useful to achieve such a characterization. A “coordination style” depends on a certain way how developers work on the source code, and more important, how they interact with each other (through the modification of the same files). This study shows that most variables that characterize a network vary significantly across all samples. It seems that there is not the one way to collaborate. This observation is in line with the findings of Crowston and Howison, 2004, looking at communication structure in bug reports. However, as the variables do not correlate well with each other, there is no resulting variable which gives a one dimensional value, representing the resulting dependent variable. “Coordination style” is therefore a construct consisting of four dimensions which vary significantly across projects: Concentration of modifications As was shown, most work in projects is performed by very few developers. Typically, the most active developer would conduct between 1 2 and 2 3 of all file modifications. Concentration of file modifications can be measured as the gini coefficient for all file modifications per author as listed in Table 3.3. The concentration ranges from 0.58, indicating an comparatively equal number of contributions (although still very much concentrated) from all developers to a value of 1.00, indicating a very high concentration of modifications (basically all work performed by a single developer). The concentration of modification is a relevant measure as it is able to give a clue about who does most of the work. It is able to show how equal the work load is distributed among contributors. 69 3 Empirical Section Core developer structure: The project’s (betweenness based) centralization is nearly linearly stretched between 0.0 and 0.4 with only four projects having values between 0.6 and 1.0 (the average value is 0.28). The centralization value is able to explain the dominance of one (or a few) core developers, the higher the centralization value, the more “star-shaped” the characterized network, the lower the value the better are all connected developers connected with all others. Centralization can therefore indicate how dominant (in terms of his degree) the core developer is compared to other connected developers. Degree of collaboration: It has been shown that the inclusiveness varies across projects (its range in the sample is 0.08 - 1.0). The inclusiveness of a project’s network should be part of the typology as it explains how large the share of developers is, who work on files that others modify as well. An inclusiveness of 0.0 would therefore indicate a completely modularized project where each developer specializes on ‘his’ set of files. Project size: It is obvious that large projects are more difficult to handle than very small projects. The project size should therefore be part of the characterization. It gives an impression of the complexity of coordination. That increasing coordination complexity has long been an undisputed argument and was famously stated in 1975 as what has been termed as Brook’s Law: “Adding manpower to a late project makes it later.” 21 (Brooks, 1995) However, it should be noted that the number of developers is not a completely independent variable in itself. Table 3.4 on page 74 displays the summary of a linear regression, explaining the number of developers of a project through its inclusiveness and its (betweenness based) centralization measure. It can be seen that both variables are highly significant, explaining the number of developers up to a certain extend (adjustedR2 = 0.33). This is an interesting observation, although no causal relationship can be inferred here. It is also notable that the 21 This assumption has however been revised partially by Brooks himself in his 20th Anniversary edition of his Book 1995 and was also challenged by others (Jones, 2000). 70 3.4 Characterizing the coordination styles linear regression works best for projects with less than 100 developers. This could be due to the fact that either measures from social network theory can be very tricky to compare if the networks’ size is too different (Scott, 1991) or because huge projects work in a very different manner. The latter possibility deserves more attention in future studies and might be one of the reasons to break large projects down into modules in order to keep networks manageable (Baldwin and Clark, 2000). An increasing level of modularity as has been observed in large open source projects by MacCormack et al. (2004). Distinct authors per file did not vary enough to be a useful dimension for the classification of projects (as shown in the previous section). It is an interesting fact to see such a stable value across all projects. This value is similar to that of other case studies: A low distinct number of authors seems to be a consistent property across open source projects. 71 3 Empirical Section Figure 3.10: Scatterplots of Centrality, Inclusiveness, and Concentration of modifications 72 3.4 Characterizing the coordination styles Project Abiword adonthell awstats bison bzflag cdex emacs flightgear freenet gnomemeeting gnunet gtk+ irate LAME mailman mnet nano ogle openssl pango phpmyadmin postgresql smarty stepmania tdb TikiWiki wget xerces xfce4 TOTAL (avg) Devs. 63 13 3 18 32 4 177 8 36 98 15 271 18 26 24 14 11 8 12 46 19 28 11 44 7 89 6 27 22 39.66 Min. Median Mean 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0.07 28 2 0 1.5 8 0.5 7 3 1 0 1 0 0 3 1 0.5 2 5 9.5 0 9 7.5 4 1 0 2 3.5 8 5.5 3.91 22.79 2 0 3 8.25 0.5 17.57 3.25 4.111 0.4694 2.533 6.421 1.444 4.923 2.5 1.714 2 3.75 8.5 3.043 6.632 8.571 2.727 3.364 0.5714 8.27 3.667 8.148 6.545 5.08 Degree Rel. mean 3rd Qu. (Density) 0.37 37 0.17 3 0.00 0 0.18 5 0.27 15.25 0.17 1 0.10 28 0.46 5 0.12 8.25 0.00 0 0.18 4.5 0.02 3 0.08 2.75 0.20 9.5 0.11 6 0.13 3 0.20 3.5 0.54 5 0.77 10 0.07 5.75 0.37 11 0.32 12.25 0.27 4 0.08 4.25 0.10 1 0.09 13 0.73 4.75 0.31 13.5 0.31 10 0.21 7.91 Max. Rel. Max. 47 6 0 8 21 1 95 6 18 7 9 74 7 15 14 6 5 5 10 16 12 22 6 21 2 45 5 19 15 17.83 0.76 0.50 0.00 0.47 0.68 0.33 0.54 0.86 0.51 0.07 0.64 0.27 0.41 0.60 0.61 0.46 0.50 0.71 0.91 0.36 0.67 0.81 0.60 0.49 0.33 0.51 1.00 0.73 0.71 0.52 Table 3.2: Degree overview 73 3 Empirical Section Project Devs. Concentr. Centralization Inclusiveness (Gini coeff) degree betweenness Abiword 63 0.75 0.38 0.05 0.78 adonthell 13 0.87 0.53 0.21 0.54 awstats 3 1.00 0.00 NaN 0.00 bison 18 0.93 0.32 0.07 0.50 bzflag 32 0.81 0.47 0.23 0.69 cdex 4 1.00 NaN NaN 0.50 emacs 177 0.89 0.67 0.10 0.55 flightgear 8 0.74 0.53 0.38 0.88 freenet 36 0.91 0.63 0.46 0.53 gnomemeeting 98 0.91 0.24 0.06 0.08 gnunet 15 0.93 0.72 0.64 0.67 gtk+ 271 0.92 0.69 0.17 0.28 irate 18 0.81 0.71 0.62 0.44 LAME 26 0.83 0.53 0.24 0.65 mailman 24 0.92 0.82 0.82 0.62 mnet 14 0.91 0.60 0.43 0.50 nano 11 0.86 0.40 0.14 0.55 ogle 8 0.64 0.00 0.00 0.75 openssl 12 0.65 0.09 0.01 0.92 pango 46 0.89 0.55 0.24 0.37 phpmyadmin 19 0.79 0.23 0.04 0.68 postgresql 28 0.86 0.57 0.26 0.86 smarty 11 0.68 0.43 0.39 0.82 stepmania 44 0.94 0.70 0.36 0.55 tdb 7 0.58 1.00 1.00 0.43 TikiWiki 89 0.89 0.65 0.16 0.55 wget 6 0.85 0.40 0.14 1.00 xerces 27 0.72 0.47 0.12 0.78 xfce4 22(21)∗ 0.81(0.80) 0.46(0.52) 0.20(0.30) 0.77(0.76) TOTAL (avg) 0.84 0.51 0.28 0.59 ∗ “user” xfce excluded. See Section 3.5.29 on page 131 for more information. Table 3.3: Modification concentration, Centralization & Inclusiveness Coefficients: Estimate Std. Error Intercept [devs < 100] 82.53 15.27 inclusiveness [devs < 100] -69.64 20.52 centrality [devs < 100] -35.96 15.97 t value 5.404 -3.394 -2.252 Pr(>|t|) 1.99e-05 0.00261∗∗ 0.03461∗ Significance: 0.001 ‘**’; 0.01 ‘*’ Multiple R-Squared: 0.3881, Adjusted R-squared: 0.3325 Correlation Inclusiveness – Centrality: -0.21 (Pearson), -0.29 (Spearman) Table 3.4: Linear regression #dev. = a ∗ inclusiveness + b ∗ centrality (including all projects with <100 developer) 74 3.5 Analysis of the Sample Projects 3.5 Analysis of the Sample Projects This section provides a characterization of each project in the sample. The four dimensions of the coordination style are used as a basic structure. Also, the sociogram of each of the constructed project networks is presented. 3.5.1 Abiword Min. 2.0 Min. 2.0 Min. 0.000 Min. 0.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 22.5 204.0 826.3 807.0 8512.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 18.5 130.0 283.7 388.0 2197.0 Gini coefficient (modifications): 0.7500 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 2.000 3.000 9.888 7.000 930.000 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 3.394 4.000 38.000 Table 3.5: Modification statistics for the Abiword project Concentration of modifications: The gini coefficient for the concentration of modifications is 0.75 (see Table 3.5). This is below the average value of 0.84, which is remarkably low. Most of the larger projects (in terms of the number of developers, tend to have a relatively high gini coefficient around 0.9) (Sec. 3.3.1). The most active developer has “only” performed 8,512 52,060 = 16.4% of all modifications, a comparably low value. Core developer structure: The centralization is only 0.05, a comparably low value, given the average value of 0.28 across all projects. This is an indication that all connected developers are connected with most others, i.e. modifying all others files. It is remarkable that although the number of developers is relatively large, the best connected develop is connected with 47 of his peers (3rd quartile is still 37 connections). The well-connectedness of the Abiword network 75 3 Empirical Section Figure 3.11: Abiword sociogram is also characterized through a high mean relative degree of the developers, which equals the standard measure of density in Social Network Analysis, of 0.37 (Project average being 0.21). Degree of collaboration: The inclusiveness of the Abiword project network is 0.78, being above average. It shows that not many “pure” specialists exist in the project who never touch files besides their own. Project size: The Abiword source code repository contains 52,060 modifications in 5,265 files conducted by 63 authors. This means it is a rather large project in terms of the number of developers. 76 3.5 Analysis of the Sample Projects Abiword seems to be a relatively collective driven effort, an interpretation which is confirmed by an above-average distinct number of authors of 3.4. The sociogram of the Abiword network, depicted in Figure 3.11, confirms intuitively the above analysis, with a large proportion of well connected developers, with relatively few unconnected “lone wolves”, although the sociogram is too crowded to be able to identify single connections between developers. Other summary data on the project can be taken from Table 3.5. 3.5.2 Adonthell Min. 4.0 Min. 3.0 Min. 1.000 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 25.0 53.0 729.6 176.0 4380.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 10.0 27.0 169.2 72.0 911.0 Gini coefficient (modifications): 0.8651 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 2.000 4.000 7.031 7.000 148.000 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.631 2.000 7.000 Table 3.6: Modification statistics for the Adonthell project Concentration of modifications: The concentration of modifications is relatively high, the gini coefficient is 0.87 (Table 3.6). This shows that (in this case the project founder) perform a big part of all modifications, the most active developer conducted 4,380 9,485 = 41% of them. The high concentration becomes also clear when looking at the difference between the mean and median values of developer contributions: while the average is 730, the median value is just 53, indicating a highly skewed distribution. Core developer structure: The (betweenness based) centralization of the project network is 0.21, still below the average of 0.28. The sociogram in Fig. 3.12 shows that the connected developers do not form a star-shaped network (although two developers are connected to the 77 3 Empirical Section Figure 3.12: Adonthell sociogram rest), but form a more or less dense web (network density is 0.17). It seems the two main developer have a somewhat integrating role. Degree of collaboration: The inclusiveness of the project is 0.54, the sociogram visualizes the six unconnected “lone wolves” in the network. Rather than being very active specialists it appears that this are the little active developers (considering the median value of only 53 modifications). Project size: The Adonthell source code repository contains 9,485 modifications in 1,349 files conducted by 13 authors. Other summary statistics are listed in Table 3.6. 78 3.5 Analysis of the Sample Projects Min. 1 Min. 1.0 Min. 1.000 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 3 5 2487 3730 7454 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 2.5 4.0 438.3 657.0 1310.0 Gini coefficient (modifications): 1.0000 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 3.000 5.695 7.000 797.000 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.004 1.000 2.000 Table 3.7: Modification statistics for the AWStats project 3.5.3 AWStats Concentration of modifications: The AWStats project proved to a somewhat unique project. The gini concentration coefficient is 1.0000 with 7,454 7,460 = 100% of all modifications performed by a single author. Two other developers contributed 5, respectively 1 file modification to the project. So, although not obvious in advance, AWStats is basically a one-man driven project. Core developer structure: As this project does not exhibit any connections between developers at all (see sociogram, in Fig. 3.13), it was not possible to calculate a (betweenness based) centralization measure for this project. Degree of collaboration: The inclusiveness for this project is an all-time low of 0.0, indicating a total lack of connection between the developers. What would indicate a “perfect specialization” of developers in a larger project with a lower concentration of modifications, merely indicate here that no other developer modified the same 10 files as the main developer did. Project size: The AWStats source code repository contains 7,460 modifications in 1,310 files conducted by 3 authors. Table 3.7 shows the summary data about the number of modifications per file and author. 79 3 Empirical Section Figure 3.13: AWStats sociogram Table 3.7 confirms the findings of a one-man project: although each of the 1,310 files has been modified 5.695 times (on average), the number of mean distinct authors per file is 1.004. 3.5.4 bison Concentration of modifications: The bison project exhibits a high concentration of 0.93, well above the average value for all projects of 0.84. The skewness becomes also apparent when comparing the median of 25 and the mean value of 589 modification. The most active developer performed 7,724 10,607 = 73% of all modifications. The bison project seems to be driven by very few core developers, performing most of the work. Core developer structure: Although the main work is performed by only a few, the centralization measure is relatively low (0.07) when compared to the average across all projects (0.28). Although two of the developers, akim and eggert, take a central role, being connected to 80 3.5 Analysis of the Sample Projects Min. 1.0 Min. 1.00 Min. 1.00 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 8.0 25.0 589.3 160.3 7724.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 5.00 16.00 53.56 40.50 356.00 Gini coefficient (modifications): 0.9272 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 3.00 6.00 26.06 19.00 1501.00 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 2.369 3.000 12.000 Table 3.8: Modification statistics for the bison project all other developers in the network, most developer are rather interconnected with each others. Degree of collaboration: The inclusiveness of the project network is 0.50, 9 out of 18 developers remain unconnected “lone wolves” (see sociogram in Fig.3.14). This is not surprising as the median value of modification of 25 indicates that the lesser active half of the developers hardly performed enough modifications to be connected at all. Project size: The bison source code repository contains 10,607 modifications in 407 files conducted by 18 authors. Bison seems to have few ’overseeing developers’, a relatively large number of closely connected contributors although these were much less active, and much less active “lone wolves”. Although its files have, in average, been modified 26.06 times, only 2.369 distinct authors have been performed these modifications (Table 3.8), which is very close to the all-project average value. 81 3 Empirical Section Figure 3.14: Bison sociogram 82 3.5 Analysis of the Sample Projects 3.5.5 BZflag Number of modifications per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 7.5 49.0 1003.0 1021.0 9841.0 Number of modified files per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 5.5 30.0 246.8 375.8 1794.0 Gini coefficient (modifications): 0.8147 Number of modifications per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 2.00 4.00 7.00 14.83 15.25 1300.00 Distinct authors per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 3.000 3.649 5.000 20.000 Table 3.9: Modification statistics for the BZFlag project Concentration of modifications: The gini concentration coefficient (0.81) is just below the average for all projects. The most active developer performed “only” 9,841 32,100 = 31% of all modifications, which is, although still high, not very high compared to other projects. It seems that although much effort is put in by the lead programmer, others take on a part of the work as well. Core developer structure: The centralization measure of the BZFlag project is 0.23, also below average, indicating that no developer takes on the role of a benevolent dictator in this network. However, the best connected developer (degree of 21) is connected to each other developer within the network, indicating a coordinating role. The sociogram (Fig. 3.15) visualizes the well-connectedness: 10 developers are not connected at all, three have weak connections leaving 19 well connected (degree > 4) authors. Degree of collaboration: The inclusiveness of the network is 0.69, above average. This confirms the above characterization where many developer do mutually modify files. Many of the unconnected developers simply performed to few modifications (see Table 3.9) to be connected to others. 83 3 Empirical Section Figure 3.15: BZFlag sociogram Project size: The bzflag source code repository contains 32,100 modifications in 2,164 files conducted by 32 authors. Table 3.9 shows a comparatively high mean of 3.649 distinct authors per file (average of nearly 15 modifications per file). This indicates, together with the relatively dense network, little “code-ownership” and a high level of collaborative effort for this project. 84 3.5 Analysis of the Sample Projects Min. 3.00 Min. 3.0 Min. 1.000 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 8.25 10.50 2527.00 2529.00 10080.00 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 6.0 9.0 591.8 594.8 2346.0 Gini coefficient (modifications): 0.9973 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 2.000 2.000 4.308 3.000 165.000 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.009 1.000 2.000 Table 3.10: Modification statistics for the CDex project 3.5.6 CDex Concentration of modifications: The second smallest project in the sample (4 developers) surprised with a high concentration rate of 1.0. 10,080 out of 10,106 modifications were performed by the same author “afaber”, who also is the project founder. Obviously, such a high concentration renders much of the other dimensions pointless, as this project basically is a one-man effort. Core developer structure: With only one connection between two developers (see Figure 3.16), it was not possible (or in any case useful) to calculate a centralization measure for this network. Degree of collaboration: The inclusiveness of this project is 0.5, although this measure is not very meaningful for such a small project. Project size: The cdex source code repository contains 10,106 modifications in 2,346 files conducted by 4 authors. It is licensed under the GNU GPL. Although the projects contains many files, these have only been modified 4.3 times in average (see Table 3.10). It is interesting that the two smallest projects in the sample, with 3 respectively 4 developers, both basically form a one-person effort. These two are the only ones in the 85 3 Empirical Section Figure 3.16: CDex sociogram sample to exhibit concentration values of 1.0. It would be interesting to see if this is a generally valid fact for such small projects, or if there are some small projects which truly share their programming efforts in a more equal way. 3.5.7 emacs Concentration of modifications: The modification concentration coefficient for emacs is 0.89 (Tab. 3.11), indicating an above average concentration of modifications. Although 177 developers contributed to the emacs development, the most active developer performed 20,000 100,113 = 20% of all modifications. Core developer structure: The (betweenness based) 0.10 centralization is only 0.10, indicating that the developers which are part of the network are relatively well connected. Were the degree based centralization used as a measure, the value of 0.67 were above the average of 86 3.5 Analysis of the Sample Projects Min. 1.0 Min. 1.0 Min. 1.00 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 7.0 37.0 565.6 250.0 20000.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 3.0 18.0 114.1 63.0 2273.0 Gini coefficient (modifications): 0.8896 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 4.00 9.00 31.53 24.00 7501.00 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 2.000 5.000 6.362 9.000 103.000 Table 3.11: Modification statistics for the emacs project 0.51 (Table 3.3 on page 74). This shows that although developers are generally well connected with each other, a few have very much higher degrees than the rest. This ca be confirmed by the distribution of the degree within the project (Table 3.2 on page 73). It becomes clear that, although more than 50% of the developers are connected (median: 7), only a smaller share is very tightly connected to many developers (3rd quart: 28, max: 95). It is remarkable that emacs has the highest distinct number of authors per file (6.4) of all sample projects. This might not appear to be extraordinarily high, given its high number of 177 developers. However, most other projects are consistent with a mean number of around 2.4 authors. This might be due to the fact, that emacs is by far the oldest project in the sample (CVS commits begin in April 1985), thus having many “generations” of hackers working on its files. Degree of collaboration: The inclusiveness measure is 0.55, which is close to the average value. A median of 37 modifications explains that most “lone wolves” were not active enough to become a part of the network. Project size: Measured by the number of developers, emacs is the second-largest project in the sample (177 developers performing 100,113 modifications on 3,175 files). This high number of developers makes the emacs sociogram (Fig. 3.17) to crowded to identify single 87 3 Empirical Section Figure 3.17: emacs sociogram connections between developers. 3.5.8 Flightgear Concentration of modifications: The FlightGear project has a gini concentration coefficient of 0.74, which is relatively low when compared to the other projects. Nevertheless, 58% of all modifications were performed by the most active developer were performed by the most active developer. 88 3.5 Analysis of the Sample Projects Min. 24.0 Min. 16.0 Min. 1.000 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 86.0 279.0 675.8 680.3 3155.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 47.0 79.5 237.3 281.5 973.0 Gini coefficient (modifications): 0.7360 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 2.000 3.000 4.879 5.000 146.000 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.713 2.000 5.000 Table 3.12: Modification statistics for the Flightgear project Core developer structure: The betweenness based centralization value for FlightGear is 0.38 which is above average. This is mostly due to two central developers, david and curt, who are connected to (nearly) everybody else. Also, the main developer curt, is connected to a degree-1 developer mselig. This degree-1 developer would have otherwise been unconnected. Degree of collaboration: The inclusiveness is 0.88, all developers with the exception of one “developer” aptly named “cvsguest” are linked with each other, as the sociogram (Fig. 3.18) shows. (cvsguest has performed few modifications, the name hints to the fact that no single person is behind this CVS account). It seems that no strict specialization appears to happen in this project. Project size: The flightgear source code repository contains 5,406 modifications in 1,108 files conducted by 8 authors. 4.9 modifications, average 1.7 authors Although the number of distinct authors per file is a mere 1.7 (see Table 3.12), two central developers are connected to nearly everybody else. A high inclusiveness, and a comparably high mean degree of 3.25 (median:5, maximum: 6) shows that FlightGear appears to include relatively much interaction between its developers. 89 3 Empirical Section Figure 3.18: Flightgear sociogram 3.5.9 Freenet Concentration of modifications: The Freenet project has a high concentration of 0.91 (Table 3.13), the most active developer has performed 63% of all modifications. The high concentration for this project can partly be explained: Since 2003, one of the main developers, Matthew Toseland (CVS name amphibian), has been paid through donations to the project foundation22 , working full-time on the project. 22 Freenet set up a non-profit foundation in order to protect their source code and to shield individual developers from law suits as many other open source projects do. See e.g. O’Mahony (2003) on how many open source 90 3.5 Analysis of the Sample Projects Min. 1.0 Min. 1.00 Min. 1.00 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 4.0 25.5 392.2 159.5 8856. Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 3.00 14.50 84.25 87.50 830.00 Gini coefficient (modifications): 0.9063 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 2.00 4.00 11.47 9.00 744.00 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 2.464 3.000 17.000 Table 3.13: Modification statistics for the Freenet project Core developer structure: Not only the proportion of work performed is determined through the work of amphibian. The network is also comparably centralized (0.46) (all-project mean value: 0.28), with, among other prominent figures, amphibian forming a central node. As the sociogram shows (Fig. 3.19), he is connected to four otherwise unconnected developers (“tamed wolves”), emphasizing his role as coordinator and integrator. Project founder, Ian Clarke, named sanity in the sociogram, is also part of the tight network of well-connected core developers. Degree of collaboration: With an inclusiveness of 0.53, the Freenet project is close to the average inclusiveness of 0.52. The median of contributions is only 25.5, this means that most of the unconnected “lone wolves” have not been active enough to become connected to the network. Project size: The freenet source code repository contains 14,118 modifications in 1,231 files conducted by 36 authors. The average distinct number of authors per file (2.5) is close to the average across as projects. Other summary data can be seen in Table 3.13. projects protect themselves and their source code. 91 3 Empirical Section Figure 3.19: Freenet sociogram 3.5.10 Gnomemeeting Concentration of modifications: The Gnomemeeting project is highly concentrated in its number of contributions. The gini coefficient is 0.91, this is mostly due to project founder and lead developer Damien Sandras (CVS name dsandras), who performed 86% of all modifications. The median the number of modifications across all 98 developers is only 9.5 (mean value of 109), confirming a high skewness. Core developer structure: The centralization of the Gnomemeeting network is relatively low: 92 3.5 Analysis of the Sample Projects Number of modifications per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 4.00 9.50 109.00 26.75 7191.00 Number of modified files per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 2.00 2.00 13.03 4.00 477.00 Gini coefficient (modifications): 0.9142 Number of modifications per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 2.00 4.00 19.18 9.00 1607.00 Distinct authors per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 2.000 2.293 2.000 81.000 Table 3.14: Modification statistics for the Gnomemeeting project 0.06. Looking at the sociogram (Fig. 3.20) shows that the connected part of the developers is indeed very well-connected, showing collaboration between these core developers. However low this value, it should be noted that the centralization is only concerned with the connected parts of the network, which is very small in this case as the next measure shows. Degree of collaboration: The inclusiveness of the Gnomemeeting is, with exception of the one-developers effort AWStats, by far the lowest in the sample: 0.08, as e.g. the dendrogram in Figure 3.7 on page 64 visualizes. Only eight out of 98 developers are connected at all. The explanation for this has been described above: a low median of modifications of 9.5 shows that most developers in this network have only contributed very little to the project. Possibly the source code administrator, Damien Sandras, gave quickly CVS write permissions to persons which only wanted to contribute a small bug-fix. Project size: The gnomemeeting source code repository contains 10,684 modifications in 557 files conducted by 98 authors. 93 3 Empirical Section Figure 3.20: Gnomemeeting sociogram 3.5.11 Gnunet Concentration of modifications: The gini coefficient of the GNUnet project is 0.93, a relatively high value, representing a strong concentration of modifications. This is mostly due to project founder and leader Christian Grothoff, who performed 83% of all modifications. Core developer structure: The centralization of the GNUnet network is 0.64, compared to the all-project average of 0.28 a high value. A look at the sociogram (Fig. 3.21) clarifies why: grothoff is not only part of the small dense network of extremely well connected core 94 3.5 Analysis of the Sample Projects Min. 1.0 Min. 1.0 Min. 1.00 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 21.5 64.0 807.4 282.0 10040.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 9.5 34.0 111.5 95.5 936.0 Gini coefficient (modifications): 0.9284 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 2.00 4.00 12.59 11.00 338.00 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.738 2.000 8.000 Table 3.15: Modification statistics for the GNUnet project developers, but also ties several “tamed wolves” into the network, resulting in a relatively starshaped part of the network. It is remarkable, that also in GNUnet, just as e.g. in the Freenet project, the main developer is the person to connect to all degree-1 developers. It seems that these most active persons perform the integrating work, tying other people’s code into the overall architecture. Degree of collaboration: The inclusiveness of the network is 0.67, above average. Although the largest amount of work is mostly performed by a single developer, relatively many developers are part of the network. Possibly, because grothoff forms a link to the three “tamed wolves”. Project size: The gnunet source code repository contains 12,111 modifications in 962 files conducted by 15 authors. 95 3 Empirical Section Figure 3.21: GNUnet sociogram 96 3.5 Analysis of the Sample Projects 3.5.12 GTK+ Min. 1.0 Min. 1.00 Min. 1.00 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 5.5 15.0 282.9 58.5 21530.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 3.00 5.00 49.64 14.00 1594.00 Gini coefficient (modifications): 0.9183 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 3.00 6.00 28.38 18.00 6752.00 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 2.000 3.000 4.981 6.000 150.000 Table 3.16: Modification statistics for the GTK+ project Concentration of modifications: GTK+ is a large project. 271 developers contributed to the project. Many developers, however, does not seem to guarantee an equal workload for all developers. The gini coefficient for the number of modifications is 0.92, the most active developer performed 21,530 76,666 = 28% of all modifications. Core developer structure: The centralization of the network is 0.17, below average. This indicates that the 28% of the developers which are included in the network (see below) are relatively well connected. The degree based centralization measure for this project is 0.69, above the average of 0.51, indicating that the degree of connections in the network is not equally distributed. This is confirmed by Table 3.2, which shows a maximum degree of 74, while the 3rd quartile (covering all 271 developers) is only 3. Degree of collaboration: In inclusiveness of the GTK+ network is 0.28. Given that the median of the modifications is 15 (Table 3.16), i.e. 50% of the contributors performed less than 15 modifications, it becomes clear that most of the large proportion of unconnected “lone wolves” did not contribute enough to become part of the network. Project size: 97 3 Empirical Section Figure 3.22: GTK+ sociogram The gtk+ source code repository contains 76,666 modifications in 2,701 files conducted by 271 authors and is the largest project in the sample. As a consequence, the sociogram (Fig. 3.22) is rather crowded, it is just possible to spot the web of developers surrounded by a large mass of unconnected contributors. It is included for the sake of completeness. Overall, it can be said that GTK+ is, despite its size, a relatively centrally coordinated project, with a few developers performing most of the work, and exhibiting most connections to other developers. Still, its low centralization measure, indicating a well-connected core, and a very high number of distinct authors per file of 5.0 (the second largest in the sample), shows that no 98 3.5 Analysis of the Sample Projects strict specialization or code ownership seems to take place in this project. 3.5.13 iRate Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 3.00 28.50 110.30 78.75 1054.00 Number of modified files per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 2.25 13.50 34.56 33.00 240.00 Gini coefficient (modifications): 0.8137 Number of modifications per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 2.000 5.471 4.000 182.000 Distinct authors per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.000 1.713 2.000 11.000 Min. 1.00 Table 3.17: Modification statistics for the Irate project Concentration of modifications: The gini coefficient representing the concentration of modifications is 0.81, below the average of 0.84. Still, Anthony Jones (CVS name ajones), project founder and leader has performed 1,054 out of 1,986 modifications (53%). Core developer structure: The core developer network is relatively centralized (0.62), this is partly due to ajones tying in all connected developers, including two “tamed wolves” (degree1 contributors) as can be seen in the sociogram (Fig. 3.23). Again in this project, it is the main developer, who connects those degree-1 developers into the network, which hints to an integrating function of this main developer. Degree of collaboration: The inclusiveness of the network is 0.44. Again, many developers performed too little work to be connected to the network. Some of them seem to perform some specialized work, such as modifying translation files for a certain language. Project size: The iRate source code repository contains 1,986 modifications in 363 files conducted by 18 99 3 Empirical Section Figure 3.23: Irate sociogram authors. A high centralization and a comparably low inclusiveness of the network helps to explain the comparably low distinct authors per file of 1.7. A strong coordinating role of ajones, and many “lone wolves” contribute to that low value. 100 3.5 Analysis of the Sample Projects Min. 2.0 Min. 2.00 Min. 1.00 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 15.0 34.0 396.6 298.0 3264.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 8.75 19.50 63.50 74.00 333.00 Gini coefficient (modifications): 0.8258 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 2.00 5.00 20.26 19.00 379.00 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 3.000 3.244 4.000 18.000 Table 3.18: Modification statistics for the LAME project 3.5.14 LAME Concentration of modifications: The LAME project exhibits a concentration coefficient of 0.83 (Table 3.18), close to the average value. The most active developer performed 32% of all modifications. Core developer structure: The centralization of the LAME network is 0.24, below average. The sociogram (Fig. 3.24) visualizes the reason: The core developers form a tight, wellconnected network, with only a few “tamed wolves”. It is interesting that both “tamed wolves” are connected to very prominent developers, aleidinger, listed on the project page as “primary developer” and markt, who currently acts as project maintainer. Degree of collaboration: LAME includes 0.65 of its developers into the network. This is above average, it seems that the remaining 35% contributed too little to become a part of the network (media of modifications of 34). Project size: The LAME source code repository contains 10,312 modifications in 509 files conducted by 26 authors. The dense web suggests that LAME is a very collaboratively developed project, without strong areas of specialization or areas of code ownership. The above average distinct authors 101 3 Empirical Section Figure 3.24: LAME sociogram per file measure of 3.3 confirms this assumption. 3.5.15 Mailman Concentration of modifications: The concentration of modifications is comparably high (0.92). 80% of all modifications were performed by project founder and lead developer Barry Warsaw. The remaining 20% were performed by 23 developers. Core developer structure: The sociogram in Fig. 3.25 illustrates the high centralization of 102 3.5 Analysis of the Sample Projects Number of modifications per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 18.5 61.5 570.6 151.3 11000.0 Number of modified files per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 6.75 21.50 108.40 62.25 1772.00 Gini coefficient (modifications): 0.9213 Number of modifications per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 2.000 7.312 5.000 312.000 Distinct authors per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.000 1.389 1.000 9.000 Table 3.19: Modification statistics for the Mailman project the developer network of 0.82 (the second largest value across the sample). While a set of 7 developers form a very tight core network23 , one developer from the inner core clearly stands out: bwarsaw, the project lead developer is connected with many “tamed wolves”. His central position, integrating other people’s code into the main architecture, is clearly visible. Degree of collaboration: The inclusiveness of the network is 0.62, above average. This could be due to the fact that bwarsaw ties many developers into the network who would otherwise have been unconnected. Project size: The Mailman source code repository contains 13,695 modifications in 1873 files conducted by 24 authors. The above dimensions make it clear that mailman is very much controlled by a single developer, although a tight core of well connected developers exist. The high concentration, i.e. 80% of all modifications, and his central position in the network emphasize this. The low distinct authors per file value of just 1.4 is therefore not surprising. Other summary data can be found in Table 3.19. 23 One of these seven developers is called mailman. It is possible that this account is not used by a single person, but possibly used for e.g some automated tasks, or the import of external code. 103 3 Empirical Section Figure 3.25: Mailman sociogram 104 3.5 Analysis of the Sample Projects 3.5.16 mnet Number of modifications per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 2.25 14.00 514.60 77.50 4410.00 Number of modified files per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 2.25 13.00 114.40 45.25 709.00 Gini coefficient (modifications): 0.9099 Number of modifications per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 3.000 7.746 5.000 477.000 Distinct authors per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.000 1.723 2.000 6.000 Table 3.20: Modification statistics for the mnet project Concentration of modifications: mnet, like many others, is very concentrated with regard to the number of modifications. The gini coefficient is 0.91. The lion’s share of the modifications is distributed among the two most active developers. The most active performed 61% of all modifications alone. Core developer structure: The centralization of the network is 0.43, far above average. The sociogram (Fig. 3.26) shows that the best connected developer (who is also the most active) ties in a “tamed wolf” and is connected to all other developers in the network. Degree of collaboration: A median of only 14 modifications helps to explain the inclusiveness of 0.50. The less active half of the developers did not perform enough work to become a part of the mnet network. Project size: The mnet source code repository contains 7,204 modifications in 930 files conducted by 14 authors. All in all, the project is mostly driven by three most active developers who also are the best connected ones, indicating their dominant role. All other developers who are part of the 105 3 Empirical Section Figure 3.26: mnet sociogram network, are not connected to each other, but to these three developers, confirming their importance. 3.5.17 nano Concentration of modifications: The nano text editor has a gini coefficient of 0.86, close to the average of 0.84. The most active developer performed 59% of all modifications. Core developer structure: The centrality of the network is 0.14, exactly half the average 106 3.5 Analysis of the Sample Projects Min. 10.0 Min. 3.00 Min. 1.00 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 19.0 33.0 725.1 567.0 4724.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 8.50 11.00 43.45 51.50 169.00 Gini coefficient (modifications): 0.8629 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 3.00 8.00 41.76 43.50 1370.00 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 2.000 2.000 2.503 3.000 11.000 Table 3.21: Modification statistics for the nano project value. The nano sociogram (Fig. 3.27) shows that while the number of modifications is concentrated, the connections with other developers are relatively equal and decentralized. Degree of collaboration: The inclusiveness of the network is 0.55, most of the unconnected developers performed too few modifications to become a part of the network. Project size: The nano source code repository contains 7,976 modifications in 191 files conducted by 11 authors. Each file has been modified 41.8 times, which is comparably high, with a mean number of distinct authors of 2.5 (which is approximately the average value across all projects). The nano project appears to be driven by a few developers performing most of the modifications. However, when it comes to coordination, no developer appears to take a very dominant role in the project. 107 3 Empirical Section Figure 3.27: nano sociogram 108 3.5 Analysis of the Sample Projects 3.5.18 Ogle Min. 3.0 Min. 2.00 Min. 1.00 Min. 1.00 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 39.0 215.5 413.8 709.3 1141.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 24.50 82.50 84.63 122.80 193.00 Gini coefficient (modifications): 0.6408 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 2.00 6.00 12.12 13.00 142.00 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.00 2.00 2.48 3.00 7.00 Table 3.22: Modification statistics for the Ogle project Concentration of modifications: Ogle has the lowest gini coefficient across the sample of “only” 0.64. Of all eight developers, the most active is for 34% of the modifications responsible. Core developer structure: The core developer structure of ogle is interesting, besides two totally unconnected “lone wolves” all developers are connected to all others (sociogram, see Fig. 3.28), leading to a centralization of 0.0 (both for betweennes and degree based centrality measures). This is a unique situation across all projects. It seems that none of the developers has a central role, coordinating or integrating the others work. This assumption is backed by the ogle website, which does not list a project founder or leader (as most others do), but states that ogle is “developed by a few students at Chalmers University of Technology” (Sweden). Degree of collaboration: The inclusiveness of the network is 0.75, besides two unconnected authors, all developers are included. Project size: The ogle source code repository contains 3,310 modifications in 273 files conducted by 8 authors. Although only eight developers take part in the development, the project has 2.5 distinct authors per file. The moderate concentration of modifications and the extremely low central- 109 3 Empirical Section Figure 3.28: Ogle sociogram ization of the network show that ogle is indeed developed by a team of students, located at the same physical facility, in a collaborative manner. 3.5.19 OpenSSL Concentration of modifications: OpenSSL is in many respects very similar to the ogle project. It is only moderately concentrated, the gini coefficient is 0.65 (all-project average 0.84). Although the most active developer has performed 41% of all modifications, the comparably 110 3.5 Analysis of the Sample Projects Number of modifications per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 7.0 348.3 2346.0 3676.0 4684.0 18050.0 Number of modified files per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 7.0 115.5 694.0 728.8 1184.0 1907.0 Gini coefficient (modifications): 0.6496 Number of modifications per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.00 3.00 4.00 13.62 12.00 1704.00 Distinct authors per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 2.000 2.701 4.000 12.000 Table 3.23: Modification statistics for the OpenSSL project small difference between median (2346) and mean value (3676) (Table 3.23) shows that the distribution is not as skewed as most others. Core developer structure: The structure of the core developers is very interesting. The centralization value is just 0.01; most developers are connected to nearly all others, forming a very dense web of connections. Degree of collaboration: The inclusiveness of the project is 0.92, the second highest value in the network. Just one developer remains unconnected (with only 7 file modifications, he did not perform enough work to become part of the network.) Project size: The openssl source code repository contains 44,116 modifications in 3,238 files conducted by 12 authors. The OpenSSL network exhibits an extremely well-connected and dense network of developers. No central generalists or integrators are visible in this network, which seems to be a truly collective effort of a community without a strong feeling for code ownership (although “only” 2.7 distinct authors modified each file) or specialization. The reasons for this can not be derived from this data, OpenSSL would make an interesting case study. It could be speculated that the library which provides encryption, only attracts very competent and experienced programmers, thus having a high barrier of entry, and preventing many little contributing developers. Per- 111 3 Empirical Section Figure 3.29: OpenSSL sociogram haps the requirements are very high in order to be granted CVS write access, as a library that provides encryption to many other programs needs to be reliable and trusted. 112 3.5 Analysis of the Sample Projects 3.5.20 pango Min. 1 Min. 1.00 Min. 1.00 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 4 14 146 58 4001 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 4.00 8.00 26.17 29.25 329.00 Gini coefficient (modifications): 0.8894 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 2.00 5.00 17.53 13.00 1097.00 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 3.144 4.000 40.000 Table 3.24: Modification statistics for the pango project Concentration of modifications: The pango project is, despite its large number of 46 developers, relatively concentrated (0.89). 60% of all modifications are performed by one developer. Core developer structure: Although the number of modifications is concentrated, the centralization of the core developer network is not very high: 0.24 is below the average value. The inner core is seems indeed to be relatively well connected, with only one “tamed wolf” sticking out (sociogram, see Fig. 3.30). Degree of collaboration: Although the inner core is well-connected, the remainder of the developers are not. The inclusiveness of 0.37 shows that just about a third of the developers are connected at all. Given the low median of modifications of 14, it becomes apparent that most developers only contributed little to the project. Project size: The pango source code repository contains 6,715 modifications in 383 files conducted by 46 authors. Other modification data can be taken from Table 3.24. The pango project seems to be mainly driven only by a small web of core developer, who are relatively well connected. On average 3.1 distinct authors have modified each file. One potential explanation for this is that pango is an underlying foundation for the GTK+ toolkit, 113 3 Empirical Section Figure 3.30: pango sociogram which is an essential part of the GNOME desktop. This desktop is actively suppoerted by many companies. E.g. core developer Owen Taylor (CVS name owen) is a software engineer employed by RedHat; other employed programmers support the pango development. 3.5.21 phpMyAdmin Concentration of modifications: The gini coefficient indicating the concentration of phpMyAdmin modifications is 0.79, a little below the average. The most active developer performed 30% 114 3.5 Analysis of the Sample Projects Min. 2 Min. 2.0 Min. 1.00 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 12 70 1968 1897 11250 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 8.5 35.0 204.6 269.5 919.0 Gini coefficient (modifications): 0.7927 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 2.00 4.00 34.87 45.75 3978.00 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 2.000 3.000 3.627 4.000 18.000 Table 3.25: Modification statistics for the phpMyAdmin project of all modifications. However, the large difference between the median of the number of modifications (70) and the average (1968) shows that the number of contributions is still relatively skewed. Core developer structure: The core developers form a well connected, dense network without a central figure standing out. The betweenness based concentration is only 0.04 Degree of collaboration: The inclusiveness of the network is 0.68, above average. Just six “lone wolves” remain unconnected to the rest of the network. Project size: The phpmyadmin source code repository contains 37,385 modifications in 1,072 files conducted by 19 authors. The average distinct number of authors per file is 3.6, a comparably large value. Together with the exceptionally low centrality and relatively high inclusiveness of the dense network as depicted in Fig. 3.31 it can be concluded that phpMyAdmin is a project, in which most contributions are performed by a few developers, however many active developers do not hesitate to modify other developers’ files. There do not seem to be areas of responsibility or a code ownership of files. 115 3 Empirical Section Figure 3.31: phpMyAdmin sociogram 3.5.22 PostgreSQL Concentration of modifications: The gini coefficient indicating the concentration of modifications is 0.86, a little above average. Of all 28 developers, the most active performed 44% of all changes. Core developer structure: The centralization of the developer’s network is 0.26, also a little below average. Looking at the sociogram (Fig. 3.32), a dense network of core developers can be identified, although some developers are less well connected to the “core web”. 116 3.5 Analysis of the Sample Projects Min. 1.00 Min. 1.00 Min. 1.00 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 97.75 447.00 3401.00 1477.00 41540.00 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 26.75 183.00 586.80 427.80 4128.00 Gini coefficient (modifications): 0.8554 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 3.00 6.00 16.87 15.00 1447.00 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 2.911 4.000 13.000 Table 3.26: Modification statistics for the PostgreSQL project Degree of collaboration: The inclusiveness of the PostgreSQL network is relatively high (0.86). Most developers are part of the network. Project size: The postgresql source code repository contains 95,221 modifications in 5,644 files conducted by 28 authors. PostgreSQL is a somewhat unique project, in that it has not been founded and is run by a single person (or organization), but is being run by a “Steering Committee”, currently consisting of six persons. This seems to show in the sociogram (Fig. 3.32. No star-shaped network emerges, but rather a dense web of collaborators. In absolute terms, it has the third highest mean degree (8.6). The distinct numbers of developers per file is comparably high (2.9). This illustrates the collaborative nature of this project. 117 3 Empirical Section Figure 3.32: PostgreSQL sociogram 118 3.5 Analysis of the Sample Projects 3.5.23 Smarty Min. 12.0 Min. 1.0 Min. 1.000 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 68.0 295.0 534.6 703.5 2224.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 19.0 63.0 163.9 158.0 912.0 Gini coefficient (modifications): 0.6759 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 4.321 2.000 491.000 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.325 1.000 7.000 Table 3.27: Modification statistics for the Smarty project Concentration of modifications: With a value of only 0.68 the Smarty project has a relatively low concentration of modifications. To top developer performed “only” 38% of all modifications. It seems, that Smarty is developed by several developers sharing the burden. Core developer structure: Although developed by several, the centralization of the network is relatively high (0.39). While five developers are relatively well-connected, several “tamed wolves” are brought into the network. It seems that some developers are only weakly connected to the network. This could indicate a developer, specializing on “his” set of files mostly. Degree of collaboration: The inclusiveness of the network is relatively high, 0.82. A reason could be the inclusion of many degree-1 developers into the network who would otherwise remain connected. Project size: The smarty source code repository contains 5,881 modifications in 1,361 files conducted by 11 authors. Although the inclusiveness of the project is relatively high, and the concentration of modifications relatively low, indicating a collaborative effort; the high centralization of the network and the very low number of distinct authors per file (1.3) indicate the existence of code own- 119 3 Empirical Section Figure 3.33: Smarty sociogram ership for files and a specialization of developers, with some central figures coordination the project. 3.5.24 Stepmania Concentration of modifications: The stepmania project has a high concentration of modifications, the gini coefficient is 0.94. Although the most active developer performed “only” 44% of all modifications, most of the activity is performed by very few developers. The high difference between the median (32) and mean value of modifications (1,099) demonstrates how skewed the distribution is. Core developer structure: The centralization of the network is 0.36, a relatively high value. 120 3.5 Analysis of the Sample Projects Number of modifications per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 6.0 32.0 1099.0 135.5 21120.0 Number of modified files per author: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 5.5 17.0 410.0 92.5 7136.0 Gini coefficient (modifications): 0.9394 Number of modifications per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 1.0 2.0 3.8 2.0 597.0 Distinct authors per file: Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.000 1.418 2.000 16.000 Table 3.28: Modification statistics for the Stepmania project A look at the sociogram (Fig. 3.34) confirms that, although a small web of well connected developers exists, two developers, gmaynard and chrisdanford, stand out with a high degree, these two are also connected to many “tamed wolves”, tying their work into the network. It seems that these degree-1 developers focus on only some files. The low mean value of distinct authors per file of 1.4 could confirm this. Degree of collaboration: The Stepmania project has an inclusiveness of 0.55. Given that many of the less active contributors performed only few modifications, this is not too surprising. Project size: The stepmania source code repository contains 48,357 modifications in 12,725 files conducted by 44 authors. Stepmania seems to be driven by two coordinating main developers, with a small network of core developers which are well connected. Many weakly connected developers seem to indicate a specialization of developers on few files. Given the very high number of files, with only 3.8 modifications per file, could very well be the case. 121 3 Empirical Section Figure 3.34: Stepmania sociogram 122 3.5 Analysis of the Sample Projects 3.5.25 tdb Min. 3.00 Min. 1.00 Min. 1.000 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 5.50 42.00 42.14 67.50 104.00 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 5.50 11.00 14.71 20.00 40.00 Gini coefficient (modifications): 0.5763 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 5.364 5.000 66.000 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.873 2.000 6.000 Table 3.29: Modification statistics for the tdb project Concentration of modifications: The tdb project is remarkable, in that the concentration of modifications is the lowest across all projects, the gini coefficient is only 0.58. It is also the only project in which mean (42.14) and median (42) of the number of contributions equal each other, i.e. the distribution is non-skewed. The most active developer performed 104 out of the 295 modifications. Core developer structure: The project is rather small, with only three connected developers in the network, ben, being in the center (see sociogram in Fig. 3.35). This leads to a unique centralization value of 1.0. Degree of collaboration: The inclusiveness of this network is 0.43. Given that the developers only perform a total sum of 295 modifications in 55 files, inclusiveness might not be very significant, as the cut-off point of ten common files might be hard to reach. Project size: The tdb source code repository contains 295 modifications in 55 files conducted by 7 authors. It is interesting that although the project is very small in terms of the number of modifications, the number of files, each file still has 1.9 distinct authors on average. It seems that a value around 2.0-2.5 seems to be some natural equilibrium for most open source projects 123 3 Empirical Section Figure 3.35: tdb sociogram (at least those with less than 100 developers). 3.5.26 TikiWiki Concentration of modifications: The TikiWiki project is relatively concentrated, the gini coefficient is 0.89. The most active developer performed 22% of all modifications. Core developer structure: The centralization of the project network is relatively low, having a value of 0.16. This means that although the a large proportion of modifications is concentrated on a few developers, no “central hub” exists in the middle of the network. The distribution of degrees among developers is not equally distributed though. The degree based centralization measure is 0.65 above the average of 0.51; the best connected developer has ties to 45 of his peers. Degree of collaboration: Given the median of 24 modifications of all 89 developers, it is 124 3.5 Analysis of the Sample Projects Min. 1.0 Min. 1.0 Min. 1.0 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 6.0 24.0 362.5 103.0 7110.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 4.0 14.0 160.9 58.0 3160.0 Gini coefficient (modifications): 0.8865 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 3.000 6.201 6.000 546.000 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 2.753 3.000 42.000 Table 3.30: Modification statistics for the TikiWiki project not surprising that the inclusiveness of the network is 0.55. Many developers did not perform enough work to become connected to the network. The sociogram (Fig. 3.36) shows the large number of unconnected “lone wolves” surrounding the crowded cluster of connected developers. Project size: The TikiWiki source code repository contains 32260 modifications in 5202 files conducted by 89 authors. The ’grow fast, make stable later’ philosophy of the project, as described in the interview quote project description on page 55 explains why this project with 89 developers only has 6.2 modifications per file on average. Although the number of modifications per file is low, the mean number of distinct authors is above average: 2.8, showing that no strict file ownership exists in TikiWiki. All in all, it seems that coordination mostly takes place through a small number of well connected developers. 125 3 Empirical Section Figure 3.36: TikiWiki sociogram 126 3.5 Analysis of the Sample Projects Min. 25.0 Min. 19.00 Min. 1.0 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 55.5 147.0 810.8 699.8 3637.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 25.00 59.50 77.17 104.50 192.00 Gini coefficient (modifications): 0.8484 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 4.0 8.0 23.5 24.5 705.0 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 2.237 3.000 6.000 Table 3.31: Modification statistics for the wget project 3.5.27 wget Concentration of modifications: The gini coefficient indicating the concentration of modifications is 0.85, just above average. Core developer structure: While the amount of work is highly concentrated, the same is not true for the developer network. The centralization of the network is 0.14, thus comparably low. Looking at the sociogram (Fig. 3.37), no center of the network can be detected (although two of the developers are connected to all others). Degree of collaboration: wget features an inclusiveness of 1.0, all developers included in the network and connected to at least two other developers. Project size: The wget source code repository contains 4,865 modifications in 207 files conducted by 6 authors. Other summary data is listed in Table 3.31. It is interesting that, although each file has been modified 23.5 times (on average), the number of distinct authors is just as high as most other projects, 2.2. In the wget project, it seems, most of the work is performed by a few developers. These are also connected to all other developers, tying their work into the network. 127 3 Empirical Section Figure 3.37: wget sociogram 3.5.28 xerces Concentration of modifications: The xerces project has a concentration of modifications of 0.72, high but below average. The most active developer performed 16% of all modifications. Core developer structure: The centralization of the xerces network is relatively low, 0.12. The sociogram (Fig. 3.38) illustrates how densely the network is connected. It does not seem to contain a single center, but rather a close web of interconnected core-developers who modify each others files. Degree of collaboration: The inclusiveness of the project is relatively high (0.78). Project size: The xerces source code repository contains 18,304 modifications in 2,276 files conducted by 27 authors. 128 3.5 Analysis of the Sample Projects Min. 1.0 Min. 1.0 Min. 1.000 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 23.0 116.0 677.9 1162.0 2989.0 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 18.0 80.0 261.1 455.0 1231.0 Gini coefficient (modifications): 0.7219 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 3.000 4.000 8.042 8.000 286.000 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 2.000 3.000 3.098 4.000 17.000 Table 3.32: Modification statistics for the xerces project Although each file has, on average, been modified 8.0 times, it has 3.1 distinct authors, which is above average. Together with a high inclusiveness, and a low centralization of the network it be concluded that the xerces project is a collaborative effort, without strong feelings for code ownership. 129 3 Empirical Section Figure 3.38: xerces sociogram 130 3.5 Analysis of the Sample Projects 3.5.29 XFCE4 Min. 1.00 Min. 1.0 Min. 1.000 Min. 1.000 Number of modifications per author: 1st Qu. Median Mean 3rd Qu. Max. 23.75 143.50 2813.00 3096.00 21370.00 Number of modified files per author: 1st Qu. Median Mean 3rd Qu. Max. 13.5 61.5 909.7 1277.0 7163.0 Gini coefficient (modifications): 0.8101 Number of modifications per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 4.525 3.000 429.000 Distinct authors per file: 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 1.464 2.000 10.000 Table 3.33: Modification statistics for the XFCE4 project Concentration of modifications: Mechanistically applied, the gini coefficient of the XFCE4 project is 0.81 a bit below average. However, this project requires careful consideration. The source code of version 3 of the Xfce project was imported into the CVS code repository, performed under the CVS account xfce. As a consequence, this non-human account seems to be the most active “developer”, performing 35% of all modifications. Of course, this poses a difficulty in how to handle the project. Should the imported modifications silently be dropped from the analysis? This would falsify the total number of modifications and the resulting network did not exhibit very different characteristics. Unfortunately, the old code repository was not available, so those modifications could not be to traced back to specific developers any more. In the end it was decided to simply treat xfce as another developer in the summary analysis. However, concentration, centralization, and inclusiveness have been calculated for both networks: once with xfce treated as developer, and once with all of its modifications ignored. Both figures are included in the summary table (Table 3.3 on page 74). With used xfce removed, the concentration of modifications is 0.80. Core developer structure: Including user xfce, the betweenness based centralization is 0.20, 131 3 Empirical Section Figure 3.39: XFCE4 sociogram without him, it is 0.30. The sociogram (Fig. 3.39) shows a well connected developer network, although some developer are clearly more central than others. Xfce exhibits a relatively dense network, although the centralization indicates a differentiation between core integrators, and more peripheral members of the network. Degree of collaboration: The inclusiveness of the network is 0.77 (0.76 without user xfce). It seems that the project ties relatively many developers in the network. Project size: The xfce4 source code repository contains 61,879 modifications in 13,675 files conducted by 22 authors (xfce as author included). 132 3.5 Analysis of the Sample Projects As this version of the product is still young (the first CVS commit start February 2001, but development did not take off until after the release of the last version of xfce3 in November 2002). Its files have only been modified 4.5 times in average. Therefore, the mean number of distinct authors is still relatively, yet not exceptionally low: 1.5. This indicates the existence of file ownership in the project, however, more development will possibly have to take place to answer this. Other summary modification statistics can be taken from Table 3.33. Having presented an summary analysis, and an analysis of each project, the next section will summarize and discuss the findings and its implications. 133 4 Discussion & Conclusion A literature review had shown that there is a strong interest in research on the organization of open source projects and researchers have proposed applying Social Network Analysis to the field of computer centered communities (Wellman, 1996) and especially open source projects. While some looked at identifying the links between software architecture modules for very large projects (González-Barahona et al., 2004) or performed a cross project analysis (Madey et al., 2002), others had already gone already further, trying to explain project success, through the use of Network governance theory (Sagers et al., 2004). The former two examine projects on a very high level, looking at relationships between software modules, or entire projects. They seemed little concerned with the collaboration of developers within one project. The previously mentioned study by Sagers et al. uses many variables and constructs from network governance theory which have not been operationalized in the field of open source research yet and explains project success as a dependent variable; a measure which is difficult, if not impossible, to define in general. This work identified a lack of knowledge about how developers coordinate themselves within a project. It focused therefore on the coordination of developers within a single project. It addressed the issues of self-coordination through the set of files a developer works on (research question 1), the existence of “code ownership” (research question 2), and the identification of patterns through the Social Network Analysis, with the aim to identify influence factors of coordination styles (research question 3). In order to answer the above questions, this research looks at the areas of code that developers chose to work on, trying to identify typical and varying properties which can characterize 134 open source projects. It examined 29 projects empirically, using the source code modifications of these projects as primary data source. The research takes more than 700,000 file modifications by 1,550 developers into account. As it relies on “hard” data to explore the coordination patterns of OS projects, it was attempted to avoid the typical pitfalls of quantitative research on open source, which was described earlier in the study. In order to answer research question 1, “are free software and open source projects selfcoordinated through the set of files each developer works on?”, a number of measures were examined in general and six propositions made. Generally, it was found that in most cases, a high share of the work was performed by very few developers (proposition 1). This observation is in line with other studies, finding power-law distributions for the number of CVS commits per developer. Next, the degrees of developers were examined, which represent the number of developers with whom common files were modified. It was found that the mean degree is relatively low (5.0) (proposition 2), but the degree of developers is highly concentrated (proposition 3), with a few very active developers having very high degrees. It was concluded that these developers, modifying nearly everybody’s files, perform an integrating role by tying the files into the main network. The inclusiveness of project networks differed widely (proposition 4), however the only determinant which could be identified is the centralization of modifications (the larger the proportion of less active contributors, the less they are connected to the network). The centralization of projects indicated the dominance of single developers in the network, indicating a coordinating role. Values for the centralization varied significantly across projects (proposition 5). The second research question is concerned with the existence of “code ownership” in a project. Although the average number of developers per project is 40, the projects exhibit a mean distinct authors per file value of 2.4. This value remained remarkably consistent across projects, independent of their age, number of modifications, or project size. Only two projects proved to be an exception, having values of 5.0 and 6.3. Both projects were the only ones with >100 developers, one of them was observed over a time frame of nearly 20 years. It seems, that while no strict file ownership seems to exist, most files are only maintained by very few 135 4 Discussion & Conclusion developers. Also some files proved to be modified by many developers, the median value of modifications is 1 for many projects. Research question 3 is concerned with the interpretation of the SNA measures and the identification of influence factors to enable the creation of a typology of coordination styles. This resulted in four influencing variables: Concentration of modifications as measured by the gini coefficient, the core developer structure characterized by the network centralization, the degree of collaboration as measured by inclusiveness, and project size given through the number of developers, files, and modifications. These four dimensions were used to characterize and interpret the coordination style of each of the sample projects. Implications What can learned from this study and who will be able to take advantage of the knowledge? The foremost group which will be able to benefit from this research are researchers from various disciplines looking into the organization and social structure of OS projects. The analysis of diverse measures in an inductive manner led to the formulation of six propositions that characterize the coordination in OSS development communities. These propositions can form a basis for further research of community structures in OS projects. File ownership could interest researchers of modularity in OS projects. The apparent aversion of performing modifications in other developers’ files could be one reason for the high modularity (creating areas of responsibility) in open source software. The characterization of coordination styles also shows that not all projects work in the same way. It could be used and extended to characterize different types of projects in future studies. But the results should not only be interesting for researchers. Practitioners, both from commercial organizations, or volunteer participants in open source communities benefit from a deeper understanding of how community developers chose their area of work. These benefits do not come in the form of direct and concrete recommendations on how a project should be organized or how the probability of project success can be increased; it is my conviction that it is crucial to understand the processes of open source project communities first, and that there 136 is not the one way to achieve project success. Practitioners who want to facilitate the survival and growth of such communities should still be interested in how they work. The existence of a small core of integrating, coordinating and much-contributing developers across all projects shows e.g. that OS projects are no perpetuum mobile (perpetual motion machine1 ). Releasing source code in the wild and hoping that momentum will build up by magic cannot be expected. Community facilitators should also be aware of the low number of authors per file and provide opportunities for newcomers to plugin new pieces of “their” code easily into the main software architecture. They should also be aware that the number of contributions and the centralization of the network did not seem to be connected. It is possible to have projects where one developer contributes most, but several decentralized integrators kept the network together. Another potential implication is that the social structure of the communities and the way participants pick their work has an impact on the resulting software architecture. Given the low distinct number of authors per file, as identified across all projects, it becomes apparent that a software architecture which requires lots of interdependent code might not be the preferred way of working. This observation is in line with research on modularity of software: MacCormack et al. (2004) find that open source projects are more modular than their closed source counterparts; work on knowledge reuse finds that modular components are much preferred to a collection of lines of code without a specified interface (von Krogh et al., 2005). Limitations Choosing cases for a sample which should represent the total population in some ways is never an easy task, and it is even harder if there does not seem to be “the representative” case at all. Many of these projects have very interesting stories around their creation and the decision to include some of them is not easy. It was attempted to avoid the common pitfalls of quantitative research (as described in Section 2.1.1 on page 22) and to otherwise maximize the variation of project properties, such as size, age, type of application etc. Of course, 1 “A Perpetual Motion Machine of the First Kind is a mechanism which, once set in motion, continues to do useful work without an input of energy, or which produces more energy than is absorbed in its operation. This kind of PM is impossible because it violates the principle of conservation of energy.” (McGraw-Hill Staff and Parker, 1994) 137 4 Discussion & Conclusion other researchers are encouraged to compare these findings to other projects and settings to confirm the validity of the sample. One strength of this research, the relatively unbiased “hard” data which could be consistently gathered for all projects, can of course also be used as potential criticism. Although this study attempts to interpret coordination styles, there is no context-sensitive data to explain and interpret it in a qualitative way. There is little which can be replied to supporters of this criticism, as it is inherent in the decision to rely on “hard” data. It should be seen as an attempt to gather measures that can characterize the collaboration and coordination of projects. Now that the results are presented, it might be beneficial to follow up specific cases with in-depth case studies. Future research This research laid a foundation to examine the coordination in OS projects, using existing methods and measures. The results help to answer the research questions posed, but there are many ways how one could proceed in a different manner, get yet more results, or examine related issues. Following, I present a list of ideas on how this research could be used to further the understanding of how open source projects work and how organizations could take advantage of it: The empirical analysis of the CVS modifications has been performed over the complete recorded time frame for the projects. However, it would be interesting to take the dynamics of projects into account. It would e.g. be very interesting to see if, when a developer stops contributing code, his “heritage” would be taken over by a new member (immediately or gradually over time) with a similar set of ties in the network. Also the social network could be examined in connection to a “life-cycle” theory of projects (and communities). One of the advantages of open source projects is the easy availability of data, such as code modifications, discussion through mailing lists, etc. This is unfortunately not true for closed source projects, and not many companies give access to their source code easily. It would be interesting to perform the same analysis on closed source projects to see if the characteristics 138 are similar to those of open source projects. Some of the projects presented here feature an array of interesting properties, such as extremely low or high specialization, a high number of distinct authors per file, or other characteristics. These projects could be analyzed in a more in-depth case study, using complementary qualitative data sources in order to explain these characteristics. An important issue to examine is the importance of those peripheral developers: If the main part of the work is done by a few developers, do those peripheral contributors bring any advantage at all? Are they a small, yet important, part of the big picture? Unfortunately, these questions cannot be answered by looking at code contributions alone and warrant further in-depth research. Although there are lots of open questions left, research in open source software projects has indeed come a long way during the last 6 years. It has proved to be an interdisciplinary area which was tackled from many directions and on many levels. As I have been part of the developer community before becoming a researcher on it, one thing is for sure: “it seems so much easier and less complicated when you are just a hobby developer, contributing work to some open source project of your choice.” 139 Bibliography Abisource.com, 2004. Abiword website. URL http://abisource.com Adonthell, 2004. Adonthell website. URL http://adonthell.linuxgames.com Anthonisse, J., 1971. The rush in a directed graph. Tech. Rep. BN9/71, Stichting Mahtematisch Centrum, Amsterdam. AWStats, 2004. Awstats website. URL http://awstats.sourceforge.net Baldwin, C. Y., Clark, K. B., 2000. The Power of Modularity. Vol. 1 of Design Rules. The MIT Press. Barnes, J. A., 1954. Class and committee in a norwegian island parish. Human Relations 7. Bates, J., 2003. HBS-MIT free/open source conference in boston, MA, informal, personal conversation with Jeff Bates. Benkler, Y., 2002. Coase’s penguin, or linux and the nature of the firm. Yale Law Journal 112 (3), 369–447. Bergquist, M., Ljungberg, J., 2001. The power of gifts: Organising social relationships in Open Source communities. Information Systems Journal 11 (4), 305–320. 140 Bibliography Bessen, J., 2002. Open source software: Free provision of complex public goods. Tech. rep., Research on Innovation. URL http://www.researchoninnovation.org/opensrc.pdf bison, Dec. 2004a. The bison 2.0 manual. URL http://www.gnu.org/software/bison/manual/pdf/bison.pdf bison, 2004b. bison website. URL http://www.gnu.org/software/bison Bitzer, J., Schrettl, W., Schroder, P. J., 2004. Intrinsic motivation in open source software development. URL http://opensource.mit.edu/papers/bitzerschrettlschroder% .pdf Borgatti, S. P., Foster, P. C., 2003. The network paradigm in organizational research: A review and typology. Journal of Management 29 (6), 991–1013. Brooks, Jr., F. P., 1995. The Mythical Man-Month: Essays on Software Engineering, 20th Anniversary Edition. Addison-Wesley. Bushnell, M. I., Dec. 1991. The meaning of ‘hurd’. E-mail to [email protected], [email protected]. URL http://www.cs.pdx.edu/~trent/gnu/hurd/hurd-name CDex, 2004. Cdex website. URL http://cdexos.sourceforge.net/ Clarke, I., 1999. A distributed decentralised information storage and retrieval systen. Master’s thesis, University of Edinburgh. Coffey, D. S., 1998. Self-organization, complexity and chaos: the new biology for medicine. Nature Medicine 4 (8), 882–885. 141 Bibliography Coulon, F., Jan. 2005. The use of social network analysis in innovation research: A literature review, working paper. URL http://www.druid.dk/ocs/viewabstract.php?id=305&cf=2 Crowston, K., Howison, J., Dec. 2004. The social structure of free and open source software development. Syracuse FLOSS research working paper. URL http://opensource.mit.edu/papers/crowstonhowison.pdf Crowston, K., Scozzi, B., 2002. Open source software projects as virtual organizations: Competency rallying for software development. IEE Proceedings - Software 149 (1), 3–17. Dalle, J.-M., David, P. A., 2003. The allocation of software development resources in ’open source’ production. Disscussion paper for The Stanford Institute For Economic Policy Research. URL http://opensource.mit.edu/papers/dalledavid.pdf Dalle, J.-M., David, P. A., Ghosh, R. A., Steinmueller, W. E., 2004. Advancing economic research on the free and open source software mode of production. In: Wynants, M., Cornelis, J. (Eds.), Building our Digital Future - Future Economic, Social & Cultural Scenarios Based On Open Standards. Vrjie Universiteit Brussels Press, Brussel, Brussels, forthcoming. DeMaggio, munity? D., Jun. 2002. Letters to the editor: Reply to "cave or com- An empirical examination of 100 mature open source projects". http://www.firstmonday.org/issues/issue7_9/letters/index.html. DiBona, C., Ockman, S., Stone, M. (Eds.), 1999. Open Sources: Voices from the Open Source Revolution. O’Reilly & Associates, Inc. Dixon, P. M., Weiner, J., Mitchell-Olds, T., Woodley, R., 1987. Boot-strapping the gini coefficient of inequality. Ecology 68, 1548–1551. Emacs, 2004. Emacs website. URL http://www.gnu.org/software/emacs/emacs.html 142 Bibliography Faraj, S., Sproull, L., 2000. Coordinating expertise in software development teams. Management Science 46 (12), 1554–1568. Feller, J., Fitzgerald, B., 2000. A framework analysis of the open source software development paradigm. In: The 21st International Conference in Information Systems (ICIS 2000). pp. 58–69. Feller, W., 1966. Introduction to Probability Theory and Its Applications. Vol. 2. John Wiley. Ferraro, F., O’Mahony, S., 2003. Managing the boundary of an ’open’ project. Harvard NOM Working Paper No. 03-60. URL http://opensource.mit.edu/papers/omahonyferraro.pdf flightgear, 2004. Flightgear website. URL http://www.flightgear.org Foley, M. J., Apr. 2004. Microsoft releases source code on sourceforge. Microsoft Watch. URL http://www.microsoft-watch.com/article2/0,1995,1561861, %00.asp Freeman, L. C., 1977. A set of measures of centrality based on betweenness. Sociometry 40, 35–41. Freeman, L. C., 1979. Centrality in social networks: Conceptual clarification. Social Networks 1, 215–239. Freeman, L. C., White, D. R., Kimball, R. A. (Eds.), 1992. Research Methods in Social Network Analysis. Transaction Publishers, New Brunswick (USA) & London (UK). Freenet, 2004. Freenet website. URL http://freenetproject.org 143 Bibliography Gallivan, M. J., 2001. Striking a balance between trust and control in a virtual organization: A content analysis of open source software case studies. Information Systems Journal 11 (4), 277–304. Ghosh, R. A., Glott, R., Krieger, B., Robles, G., 2002. Free/Libre and open source software: Survey and study (FLOSS). http://www.infonomics.nl/FLOSS/report/. Ghosh, R. A., Prakash, V. V., Jul 2000. The orbiten free software survey. First Monday 5 (7). URL http://firstmonday.org/issues/issue5_7/gosh Gini, C., 1912. Variabilità e mutabilità. In: Pizetti E, Salvemini, T. (Ed.), Memorie di metodologica statistica. Libreria Eredi Virgilio Veschi, Rome, reprint published 1955. Gnomemeeting, 2004. Gnomemeeting website. URL http://gnomemeeting.org GNUnet, 2004. Gnunet website. URL http://gnunet.org González-Barahona, J. M., López, L., Robles, G., Jun. 2004. Community structure of modules in the apache project. http://opensource.mit.edu/papers/barahona-apache_structure.pdf. Granovetter, M. S., 1973. The strength of weak ties. American Journal of Sociology 78, 1360– 1380. Granovetter, M. S., 1974. Getting a job. Harvard University Press, Cambridge, MA. Grothoff, C., Patrascu, I., Bennett, K., Stef, T., Horozov, T., 6 2002. Gnet. URL http://gnunet.org/download/main.pdf GTK+, 2004. Gtk+ website. URL http://www.gtk.org Haken, H., 1977. Synergetics: An Introduction. Nonequilibrium Phase Transitions and SelfOrganization in Physics, Chemistry and Biology. Springer. 144 Bibliography Hall, P., 1982. Rates of convergence in the central limit theorem. Pitman, Boston. Harhoff, D., Henkel, J., von Hippel, E., 2003. Profiting from voluntary information spillovers: how users benefit by freely revealing their innovations. Research Policy 32 (10), 1753–1769. Hars, A., Ou, S., 2000. Why Is Open Source Software Viable? - A Study of Intrinsic Motivation, Personal Needs, and Future Returns. In: The 2000 Americas Conference on Information Systems (amcis 2000). Hauben, M., 1994. History of arpanet: Behind the net - the untold history of the arpanet. URL http://www.dei.isep.ipp.pt/docs/arpa.html Healy, K., Schussman, A., Jan. 2003. The ecology of open source software development. Working paper. URL http://opensource.mit.edu/papers/healyschussman.pdf Hemetsberger, A., 2004. Fostering cooperation on the internet: social exchange processes in innovative virtual consumer communities. Presented at the Association of Consumer Research conference (2001). URL http://opensource.mit.edu/papers/hemetsberger2.pdf Hertel, G., Niedner, S., Herrmann, S., 2003. Motivation of software developers in open source projects: An internet-based survey of contributors to the Linux Kernel. Research Policy 32 (7), 1159–1177. Heylighen, F., 2005. Web dictionary of cybernetics and systems. URL http://pespmc1.vub.ac.be/ASC/ von Hippel, E., Lakhani, K., May 2000. How open source software works: "Free" user-to-user assistance. MIT Sloan Working Paper No. 4117-00. URL http://papers.ssrn.com/sol3/Delivery.cfm/SSRN_ID290305_ %code011119590.pdf?abstractid=290305 145 Bibliography von Hippel, E., von Krogh, G., March-April 2003. Open source software and the "privatecollective" innovation model: Issues for organization science. Organization Science 14 (2), 209–223. Howison, J., Crowston, K., May 2004. The perils and pitfalls of mining sourceforge . URL http://opensource.mit.edu/papers/howison04msr.pdf iRate, 2004. irate website. URL http://irate.sourceforge.net Jargon, 2004. The Jargon Dictionary. URL http://info.astrian.net/jargon/ Jarvenpaa, S. L., Leidner, D. E., 1999. Communication and trust in global virtual teams. Organization Science 10 (6), 791–815. Jones, C., Hesterly, W. S., Borgatti, S. P., 1997. A general theory of network governance: Exchange conditions and social mechanisms. Academy of Management Review 22 (4), 911– 945. Jones, P., Sep. 2000. Brooks’ law and open source: The more the merrier? Does the open source development method defy the adage about cooks in the kitchen? IBM Developer Works. Kauffman, S., 1995. At Home in the Universe: The Search for Laws of Self-Organization and Complexity. Oxford University Press, New York. Koch, S., Schneider, G., 2002. Effort, cooperation and coordination in an open source software project: Gnome. Information Systems Journal 12 (1), 27–42. Kogut, B. M., Metiu, A., 2001. Open-source software development and distributed innovation. Oxford Review of Economic Policy 17 (2). 146 Bibliography Krishnamurthy, S., Jun 2002. Cave or community?: An empirical examination of 100 mature open source projects. First Monday 7 (6). URL http://firstmonday.org/issues/issue7_6/krishnamurthy von Krogh, G., Haefliger, S., Spaeth, S., 2003. Collective action and innovation in open source software development: The case of Freenet. Presented at Academy of Management 2003, Seattle. von Krogh, G., Spaeth, S., Haefliger, S., 2005. Knowledge reuse in open source software: An exploratory study of 15 open source projects. In: Proceedings of the 38th Hawaii International Conference on System Sciences. pp. 198–207. URL http://csdl.computer.org/comp/proceedings/hicss/2005/22% 68/07/22680198b.pdf von Krogh, G., Roos, J., 1995. Organizational Epistemology. St. Martin’s Press, New York. von Krogh, G., Spaeth, S., Lakhani, K., Jul. 2003. Community, joining, and specialization in open source software innovation: A case study. Research Policy 32 (7), 1217–1241. von Krogh, G., von Hippel, E., 2003. Special issue on open source software development: Editorial. Research Policy 32 (7), 1149–1157. Lakhani, K., Wolf, B., Bates, J., DiBona, C., Jul. 2002. The Boston Consulting Group/OSDN hacker survey. http://www.osdn.com/bcg/, ver 0.73. LAME, 2004. Lame website. URL http://www.mp3dev.org/ Lanzara, G. F., Morner, M., 2003. The knowledge ecology of open-source software projects. URL http://opensource.mit.edu/papers/lanzaramorner.pdf Laumann, E. O., Marsden, P. V., Prensky, D., 1992. The boundary specification problem in network analysis. In: Freeman et al. (1992), Ch. 3, pp. 61–87. 147 Bibliography Lee, S., Moisa, N., Weiss, M., Mar. 2003. Open source as a signalling device - an economic analysis. URL http://opensource.mit.edu/papers/leemoisaweiss.pdf Lerner, J., Tirole, J., 2000. The simple economics of open source. Working Paper 7600, National Bureau of Economic Research. Lerner, J., Tirole, J., dec 2002a. The scope of open source licensing. National Buerau of Economic Research. URL http://papers.nber.org/papers/W9363 Lerner, J., Tirole, J., 2002b. Some simple economics of open source. Journal of Industrial Economics 52. Levy, S., 1984. Hackers: Heroes of the Computer Revolution. Anchor Press/Doubleday, NY. Ljungberg, J., 2000. Open source movements as a model for organizing. European Journal of Information Systems 9 (4). López, L., González-Barahona, J. M., Robles, G., Jun. 2004. Applying social network analysis to the information in cvs repositories. Working paper. URL http://opensource.mit.edu/papers/llopez-sna-short.pdf MacCormack, A., Rusnak, J., Baldwin, C., Oct. 2004. Exploring the structure of complex software designs: An empirical study of open source and proprietary code. URL http://opensource.mit.edu/papers/maccormackrusnakbaldwi% n.pdf Madey, G., Freeh, V., Tynan, R., 2002. The open source software development phenomenon: An analysis based on social network theory. In: Americas Conference on Information Systems (AMCIS2002),. Dallas, TX, pp. 1806–1813. URL http://www.nd.edu/~oss/Papers/amcis_oss.pdf 148 Bibliography Mailman, 2004. Mailman website. URL http://www.gnu.org/software/mailman Maturana, H. R., Varela, F. J., 1980. Autopoiesis and Cognition: The Realization of the Living. D. Reidel, Boston. McGraw-Hill Staff, Parker, S. P., 1994. Dictionary of Scientific and Technical Terms, 5th Edition. McGraw-Hill Professional. Merriam-Webster, 1993. Merriam-Webster’s Collegiate Dictionary, 10th Edition. MerriamWebster. URL http://m-w.com Mills, J. A., Zandvakili, A., 1997. Statistical inference via bootstrapping for measures of inequality. Journal of Applied Econometrics 12, 133–150. Mitchell, J. C. (Ed.), 1969. Social Networks in Urban Situations. Manchester University Press. Mnet, 2004. Mnet website. URL http://mnetproject.org/ Mockus, A., Fielding, R. T., Herbsleb, J. D., 2000. A case study of open source software development: The apache server. In: The 22nd International Conference on Software Engineering. Limerick, Ireland, pp. 263–272. Moody, G., 2001. Rebel Code: Linux and the Open Source Revolution. Penguin, London. Moreno, J., 1934. Who Shall Survive? Beacon Press, New York. Nano, 2004. Nano website. URL http://www.nano-editor.org Netscape, 1998. Netscape announces plans to make next-generation communicator source code available free on the net. press release from 22 Jan 1998. 149 Bibliography Ogle, 2004. Ogle website. URL http://www.dtek.chalmers.se/groups/dvd O’Mahony, S., 2003. Guarding the commons: How community managed software projects protect their work. Research Policy 32 (7), 1179–1198. OpenSSL, 2004. Openssl website. URL http://www.openssl.org Osterloh, M., Rota, S., 2003. Open source software production - The magic cauldron? Osterloh, M., Rota, S., 2004a. Trust and community in open source software production, , 2004. In: Lahno, B., Matzat, U. (Eds.), Trust and Community on the Internet: Opportunities and Restrictions for Online Cooperation. Lucius & Lucius. Osterloh, M., Rota, S. G., Mar. 2004b. Open source software development - just another case of collective invention? URL http://ssrn.com/abstract=561744 pango, 2004. pango website. URL http://www.pango.org Perens, B., 1999. The open source definition. In: DiBona, Ockman, and Stone (1999), pp. 171–188. phpMyAdmin, 2004. phpmyadmin. URL http://www.phpmyadmin.net PostgreSQL, 2004. Postgresql website. URL http://www.postgresql.org Prufer, J., 2004. Network formation via contests: The production process of open source software . URL http://opensource.mit.edu/papers/prufer.pdf 150 Bibliography Raymer, M. G., 1994. Uncertainty principle for joint measurement of noncommuting variables. American Journal of Physics 62 (11), 986–993. Raymond, E., Aug. 2003. The Art Of Unix Programming. Addison-Wesley. URL http://www.catb.org/~esr/writings/taoup/html/ Raymond, E. S., 1999a. The Cathedral & the Bazaar, 1st Edition. O’Reilly, Sebastopol, CA. URL http://www.catb.org/~esr/writings/cathedral-bazaar Raymond, E. S., 1999b. The Revenge of the Hackers. In: DiBona et al. (1999), pp. 207–219. Roethlisberger, F. J., Dickson, W. J., 1939. Management and the Worker. Harvard University Press. Rossi, M. A., 2004. Decoding the "free/open source (f/oss) puzzle" - a survey oftheretical and empirical contributions. URL http://opensource.mit.edu/papers/rossi.pdf Sagers, G. W., McLure Wasko, M., Dickey, M. H., Aug 2004. Coordinating efforts in virtual communities: Examining network governance in open source. In: Proceedings of the Tenth Americas Conference on Information Systems. New York. Scacchi, W., 2002. Understanding requirements for developing open source software systems. IEE Proceedings - Software 149 (1), 24–39. Scott, J., 1991. Social Network Analysis: A handbook. Sage Publications Ltd. Shah, S., 2003. Understanding the Nature of Participation & Coordination in Open and Gated Source Software Development Communities. Ch. 4, doctoral dissertation. Smarty, 2004. Smarty website. URL http://smarty.php.net SourceForge.net, Jul. 2003. Project of the month: Tiki. URL http://sourceforge.net/potm/potm-2003-07.php 151 Bibliography Spaeth, S., Apr. 2003. Decision-making in open source projects. Proposal for the Doctoral thesis. URL http://sspaeth.org/paper/Vorstudie.pdf Stallman, R., 1999. The GNU Operating System and the Free Software Movement. In: DiBona et al. (1999), pp. 53–70. Stallman, R. M., 1993. The gnu manifesto. URL http://www.gnu.org/gnu/manifesto.html Stenborg, M., Aug. 2004. Waiting for f/oss: Coordinating the production of free/open source software. URL http://opensource.mit.edu/papers/stenborg.pdf Stepania, 2004. Stepmania website. URL http://stepmania.sourceforge.net Stewart, D., Apr. 2004. Status inertia:the speed imperative in the attainment of community status . URL http://opensource.mit.edu/papers/stewart2.pdf tdb, 2004. Trivial database website. URL http://tdb.sourceforge.net te Meerman, S., 2003. Puzzling with a top-down blueprint and a bottom-up network: An explorative analysis of the open source world using itil and social network analysis. Master’s thesis, University of Groningen, Netherlands. URL http://opensource.mit.edu/papers/meerman2.pdf The Economist, 2004. And the winners are... The Economist September 16th. The Enquirer, 2004. IBM patents method for paying open source volunteers. published on 26 January 2004. URL http://www.theinquirer.net/?article=13813 152 Bibliography The Open Source Initiative, 2003a. The History of OSI. URL http://opensource.org/docs/history.php The Open Source Initiative, 2003b. The Open Source Definition. Accessed: 10 Jan 2005. URL http://opensource.org/docs/definition.php(ver.1.9) TikiWiki.org, 2004. Tikiwiki website. URL http://tikiwiki.org Torvalds, L., Diamond, D., 2001. Just for Fun. Texere, London, UK. Tuomi, I., Apr. 2000. Learning from linux: Empirical and descriptive analysis of the open source model. Working paper was distributed in Berkeley and Stanford in April 2000. URL http://www.jrc.es/~tuomiil/articles/LearningFromLinux.p% df Van Wendel de Joode, R., De Bruijn, J. A., Van Eten, M. J. G., Oct. 2002. Protecting the virtual commons: Self-organizing communities and innovative intetellctual property rights regimes. Working paper. URL http://opensource.mit.edu/papers/joode.pdf Viega, J., Warsaw, B., Manheimer, K., 12 1998. Mailman: The GNU mailing list manager. In: Proceedings of the 12th Systems Administration Conference (LISA ’98). USENIX Technical Program. Boston, MA. Wasserman, S., 1994. Social Network Analysis. Methods and Applications. Cambridge University Press, pp. 345–423, 461–482. Wellman, B., 1996. For a social network analysis of computer networks: a sociological perspective on collaborative work and virtual community. In: SIGCPR ’96: Proceedings of the 1996 ACM SIGCPR/SIGMIS conference on Computer personnel research. ACM Press, pp. 1–11. 153 Bibliography West, J., O’Mahony, S., 2005. Contrasting community building in sponsored and community founded open source projects. In: Proceedings of the 38th Annual Hawai’i Internatinal conference on System Sciences (Jan 2005). URL http://opensource.mit.edu/papers/westomahony.pdf wget, 2004. wget website. URL http://www.gnu.org/software/wget/wget.html Whitaker, R., 12 1995. Self-organization, autopoiesis, and enterprises. URL http://www.acm.org/sigs/sigois/auto/Main.html Wikipedia.org, 2004. Dictionary. URL http://wikipedia.org Williams, S., 2002. Free as in Freedom: Richard Stallman’s Crusade for Free Software. O’Reilly, Sebastapol, CA, full text available online. URL http://www.oreilly.com/openbook/freedom/ xerces, 2004. xerces java parser website. URL http://xml.apache.org Xfce, 2004. Xfce website. URL http://xfce.org Yamauchi, Y., Yokozawa, M., Shinohara, T., Ishida, T., 2000. Collaboration with lean media: How open-source software succeeds. In: CSCW 2000. ACM, Philidelphia, PA, pp. 329–338. Yin, R. K., 2003. Case Study Research: Design and Methods, 2nd Edition. Sage. Zeitlyn, D., 2003. Gift economies in the development of open source software: Anthropological reflections. Research Policy 32 (7), 1287–1291. 154 A Appendix A.1 Open Source Definition (Version 1.9) Introduction Open source doesn’t just mean access to the source code. The distribution terms of open-source software must comply with the following criteria: 1. Free Redistribution The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources. The license shall not require a royalty or other fee for such sale. 2. Source Code The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost preferably, downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed. 3. Derived Works The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original software. 155 A Appendix 4. Integrity of The Author’s Source Code The license may restrict source-code from being distributed in modified form only if the license allows the distribution of "patch files" with the source code for the purpose of modifying the program at build time. The license must explicitly permit distribution of software built from modified source code. The license may require derived works to carry a different name or version number from the original software. 5. No Discrimination Against Persons or Groups The license must not discriminate against any person or group of persons. 6. No Discrimination Against Fields of Endeavor The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research. 7. Distribution of License The rights attached to the program must apply to all to whom the program is redistributed without the need for execution of an additional license by those parties. 8. License Must Not Be Specific to a Product The rights attached to the program must not depend on the program’s being part of a particular software distribution. If the program is extracted from that distribution and used or distributed within the terms of the program’s license, all parties to whom the program is redistributed should have the same rights as those that are granted in conjunction with the original software distribution. 9. The License Must Not Restrict Other Software The license must not place restrictions on other software that is distributed along with the licensed software. For example, the license must not insist that all other programs distributed on the same medium must be open-source software. 156 A.1 Open Source Definition (Version 1.9) 10. The License must be technology-neutral No provision of the license may be predicated on any individual technology or style of interface. 157 Index AWStats (2004), 38, 140 Crowston and Scozzi (2002), 1, 21, 142 Abisource.com (2004), 37, 140 Dalle and David (2003), 20, 142 Adonthell (2004), 38, 140 Dalle et al. (2004), 20, 27, 142 Anthonisse (1971), 66, 140 DeMaggio (2002), 24, 142 Baldwin and Clark (2000), 71, 140 DiBona et al. (1999), 142, 150–152 Barnes (1954), 25, 140 Dixon et al. (1987), 57, 142 Bates (2003), 23, 140 Emacs (2004), 41, 142 Benkler (2002), 3, 28, 140 Faraj and Sproull (2000), 28, 142 Bergquist and Ljungberg (2001), 19, 140 Feller and Fitzgerald (2000), 1, 143 Bessen (2002), 10, 140 Feller (1966), 31, 143 Bitzer et al. (2004), 19, 141 Ferraro and O’Mahony (2003), 19, 27, 143 Borgatti and Foster (2003), 26, 141 Foley (2004), 10, 143 Brooks (1995), 70, 141 Freeman et al. (1992), 24, 143, 147 Bushnell (1991), 8, 141 Freeman (1977), 25, 66, 143 CDex (2004), 40, 141 Freeman (1979), 25, 59, 65, 66, 143 Clarke (1999), 42, 141 Freenet (2004), 42, 143 Coffey (1998), 28, 141 GNUnet (2004), 44, 144 Coulon (2005), 26, 141 GTK+ (2004), 45, 144 Crowston and Howison (2004), 29, 30, 69, Gallivan (2001), 19, 143 142 158 Ghosh and Prakash (2000), 57, 144 Index Ghosh et al. (2002), 9, 14, 15, 18, 20, 144 López et al. (2004), 30, 34, 36, 67, 148 Gini (1912), 57, 144 Lakhani et al. (2002), 20, 147 Gnomemeeting (2004), 43, 144 Lanzara and Morner (2003), 20, 147 González-Barahona et al. (2004), 29, 134, Laumann et al. (1992), 25, 147 144 Lee et al. (2003), 19, 147 Granovetter (1973), 25, 144 Lerner and Tirole (2000), 17, 148 Granovetter (1974), 25, 144 Lerner and Tirole (2002a), 21, 148 Grothoff et al. (2002), 44, 144 Lerner and Tirole (2002b), 19, 148 Haken (1977), 28, 144 Levy (1984), 7, 8, 148 Hall (1982), 31, 144 Ljungberg (2000), 20, 148 Harhoff et al. (2003), 18, 145 MacCormack et al. (2004), 21, 71, 137, 148 Hars and Ou (2000), 18, 145 Madey et al. (2002), 23, 29, 34, 134, 148 Hauben (1994), 7, 145 Mailman (2004), 48, 148 Healy and Schussman (2003), 21, 23, 29, Maturana and Varela (1980), 28, 149 34, 145 Merriam-Webster (1993), 4, 149 Hemetsberger (2004), 19, 145 Mills and Zandvakili (1997), 57, 149 Hertel et al. (2003), 18, 145 Mitchell (1969), 25, 149 Heylighen (2005), 28, 145 Mnet (2004), 49, 149 Howison and Crowston (2004), 24, 31, 146 Mockus et al. (2000), 57, 149 Jargon (2004), 12, 146 Moody (2001), 8, 149 Jarvenpaa and Leidner (1999), 19, 146 Moreno (1934), 24, 149 Jones et al. (1997), 27, 146 Nano (2004), 49, 149 Jones (2000), 70, 146 Netscape (1998), 9, 17, 149 Kauffman (1995), 28, 146 O’Mahony (2003), 90, 150 Koch and Schneider (2002), 26, 57, 146 Ogle (2004), 50, 149 Kogut and Metiu (2001), 1, 146 OpenSSL (2004), 50, 150 Krishnamurthy (2002), 24, 146 Osterloh and Rota (2003), 18, 20, 150 LAME (2004), 47, 147 Osterloh and Rota (2004a), 19, 150 159 Index Osterloh and Rota (2004b), 19, 20, 150 Torvalds and Diamond (2001), 2, 153 Perens (1999), 7, 10, 14, 150 Tuomi (2000), 17, 153 PostgreSQL (2004), 52, 150 Van Wendel de Joode et al. (2002), 28, 153 Prufer (2004), 19, 150 Viega et al. (1998), 48, 153 Raymer (1994), 22, 150 Wasserman (1994), 60, 153 Raymond (1999a), 16, 151 Wellman (1996), 4, 29, 134, 153 Raymond (1999b), 10, 151 West and O’Mahony (2005), 18, 153 Raymond (2003), 8, 151 Roethlisberger and Dickson (1939), 25, 151 Rossi (2004), 17, 151 Sagers et al. (2004), 27, 134, 151 Scacchi (2002), 17, 151 Scott (1991), 24, 34, 36, 60, 62, 65, 71, 151 Whitaker (1995), 28, 154 Wikipedia.org (2004), 12, 154 Williams (2002), 7, 154 Xfce (2004), 56, 154 Yamauchi et al. (2000), 1, 154 Yin (2003), 3, 154 Shah (2003), 19, 151 Smarty (2004), 53, 151 SourceForge.net (2003), 55, 151 Spaeth (2003), 2, 151 Stallman (1993), 1, 12, 152 Stallman (1999), 7, 12, 14, 152 Stenborg (2004), 19, 152 Zeitlyn (2003), 17, 19, 154 bison (2004a), 39, 141 bison (2004b), 39, 141 flightgear (2004), 42, 143 iRate (2004), 46, 146 pango (2004), 51, 150 Stepania (2004), 54, 152 phpMyAdmin (2004), 52, 150 Stewart (2004), 27, 152 tdb (2004), 54, 152 The Economist (2004), 2, 152 te Meerman (2003), 29, 152 The Enquirer (2004), 19, 152 wget (2004), 55, 154 The Open Source Initiative (2003a), 10, 152 xerces (2004), 56, 154 The Open Source Initiative (2003b), 1, 10, McGraw-Hill Staff and Parker (1994), 137, 153 TikiWiki.org (2004), 55, 153 160 149 von Hippel and Lakhani (2000), 17, 145 Index von Hippel and von Krogh (2003), 1, 20, density, 25, 60 145 FLOSS, 7 von Krogh and Roos (1995), 28, 147 von Krogh and von Hippel (2003), 18, 147 von Krogh et al. (2003), 1, 3, 20, 147 von Krogh et al. (2003), 20, 27, 147 von Krogh et al. (2005), 22, 137, 147 acknowledgements, v adjacency matrix, 34, 59 ARPAnet, 7 authors p. file, 66 autopoiesis, 28 free beer, 12 freedom, 12 software, 12 Free Software Foundation, 8 future research, 138 gini coefficient, 57 GNU, 8 GNU General Public License, see GPL GNU project, 23 betweenness, 25 GPL, 11, 13 Brook’s law, 70 gratis, 9 Grothoff, Christian, 44, 94 centralization, 25, 65, 74 Cheng, Mike, 47 Hopkins, Don, 12 Clarke, Ian, 42, 91 implications, 136 code ownership, 66 incident matrix, 34 copyleft, 12 inclusiveness, 62, 74 Curriculum vitae, 164 innovation research, 17 CVS modules, 30 Jones, Anthony, 46 Debian Free Software Guidelines, 10 Debian GNU/Linux, 10 knowledge reuse, 22 decision making, 2 libre software, see free software degree (of connectedness), 25, 59 limitations, 137 relative, 60 Linus Law, 16 161 Index Linux, 1 Abiword, 37, 75 literature review, 26 Adonthell, 38, 77 lone wolves, 62 AWStats, 38, 79 bison, 39, 80 methodology, 31 Microsoft, 10 BZflag, 39, 83 CDex, 40, 85 modification concentration, 74 emacs, 41, 86 Netscape, 9, 16 flightgear, 41, 88 node centrality, 25 Freenet, 42, 90 Non-Disclosure Agreement, 8 Gnomemeeting, 43, 92 number of modifications, 57 Gnunet, 44, 94 GTK+, 45, 97 open source history, 10 open source definition, 10, 155 history, 7 research categorization, 17 critique, 22 Irate, 46, 99 LAME, 47, 101 Mailman, 48, 102 mnet, 49, 105 nano, 49, 106 Ogle, 50, 109 OpenSSL, 50, 110 pango, 51, 113 history of ,̃ 16 phpMyAdmin, 52, 114 Open Source Initiative, 9 PostgreSQL, 52, 116 OSI, see Open Source Initiative Smarty, 53, 119 patch, 11 Stepmania, 54, 120 Perens, Bruce, 10 tdb, 54, 123 perpetuum mobile, 137 TikiWiki, 55, 124 project success, 21 wget, 55, 127 projects xerces, 56, 128 162 Index XFCE4, 56, 131 proposition, 58, 60, 62, 65, 66, 68 R, 36 research question, 4 evolution of ,̃ 2 RMS, see Stallman Sandras, Damien, 43, 92 self-coordination, 28 signaling theory, 19 SNA, 24, 25, 29 history, 24–25 social network analysis, see SNA Stallman, 1, 7, 41 swift trust, 19 Taylor, Mark, 47 Tech model Railroad Club, 7 Torvalds, Linus, 1 Toseland, Matthew, 42, 90 typology, 5 Warsaw, Barry, 48, 102 163 Curriculum Vitae Personal data Name Address E-mail Date of Birth Nationality Sebastian Spaeth Zürichstrasse 45, CH - 8600 Dübendorf, Switzerland [email protected] 30 November 1975 (Göttingen, Germany) German Education Apr 2001- Oct 2005 University of St.Gallen (Institute of Management), Switzerland, Doctoral studies (leading to “Dr. oec.” on Oct 24, 2005) Aug 1999- Mar 2001 University of Linköping, International Master’s programme in Manufacturing Management (leading to “Master of Science in Engineering” in Apr 2001) Oct 1996 - Aug 1999 Technical University Karlsruhe, Industrial engineering and management (Wirtschaftsingenieurwesen) majoring corporate planning 1992 - 1995 Oberstufengymnasium Eschwege, High School Diploma (Abitur) Professional Experience Mar 2001- present University of St. Gallen, Switzerland; Research assistent at the chair of Prof. von Krogh, PhD Jun 2002 - Dec 2002 “Mergers and Aquisitions” (Publisher: Verlagsgruppe Handelsblatt); Editor, responsible for “Computer and Telecommunications industry” Management Sep 2000 - Mar 2001 NCC AB, Stockholm, Sweden; Master Thesis in cooperation with NCC, examining goals, tasks, and organizational designs of a logistics function Mar 2000- Jun 2000 Strömsholmen AB, Tränas, Sweden, Simulation project to optimize the assembly process Apr 1997- Jul 1999 Institut für Rechneranwendung in Planung und Konstruktion (University of Karlsruhe); Academic Assistant in the Computer Department: Responsible for Organisation and Computer Maintenance 164 Publications 2005 2005 2003 2003 2003 2003 2001 Spaeth, S., Coordination in Open Source Projects: A Social Network Analysis Using CVS Data, Doctoral Dissertation von Krogh, G., Spaeth, S. & Haefliger, S., Knowledge Reuse in open source software: An exploratory Study of 15 open source projects, Proceedings of the 38th Hawaii Internat. Conf. on System Sciences von Krogh, G., Haefliger, S. & Spaeth, S., Collective Action and Innovation in Open Source Software Development: The Case of Freenet, presented at Academy of Management 2003, Seattle von Krogh, G., Haefliger, S. & Spaeth, S., Collective Action and Innovation in Open Source Software Development: The Case of Freenet, Academy of Mgmt. 2003, Seattle von Krogh, G., Spaeth, S., & Lakhani, K., Community, Joining, and Specialization in Open Source Software Innovation: A Case Study, Research Policy 7(32), pp. 1217–1241 Spaeth, S., Decision-Making in Open Source Projects, Proposal for the Doctoral Dissertation Spaeth, S., Logistics at NCC, Master’s Thesis Internships Aug 1998- Oct 1998 SEW-EURODRIVE GmbH & Co, Bruchsal Planning, designing and implementing a division presentation in the Intranet Mar 1997 Georg Sahm GmbH & Co. KG Maschinenfabrik, Eschwege, engineering internship Aug 1995- Sep 1995 Präwema Antriebstechnik GmbH, Eschwege, engineering internship Voluntary Work Apr 2001- Apr 2005 Member of the board, Kammerchor Oberthurgau, Switzerland Jul 1998 - Jul 1999 Executive secretary of the student university cinema “Akademisches Filmstudio an der Universität Karlsruhe e.V.”, Karlsruhe Miscellaneous Computer Languages Interests Windows, Apple, Unix (Linux), Office, Arena, TCP/IP, HTML, XML, MySQL, basic programming skills (C, Java, PHP, Perl. JavaScript) German (Native Language), English (fluent), Swedish (conversationally) Computers, Reading, Badminton, Sailing 165