University of Notre Dame
Browse
- No file added yet -

Mining and Modeling the Open Source Software Community

Download (20.99 MB)
thesis
posted on 2007-04-18, 00:00 authored by Jin Xu
The success of Open Source Software (OSS) has attracted increased interest in many research areas. Unlike proprietary closed software, OSS projects are developed in a distributed and decentralized way. The OSS community is largely composed of part-time developers. These developers have developed a substantial number of outstanding technical achievements. A research study on how OSS developers interact with each other and how projects are developed will help researchers understand the success and failure of OSS projects. OSS developers can also benefit from this research, by being able to make more informed decisions for participating on OSS projects. In this dissertation, we address the challenge of efficiently mining data from OSS web repositories and building models to study OSS community features. Data collection for OSS study is nontrivial since most OSS projects are developed by distributed developers using web tools. Most previous studies focus on manually creating a web crawler to collect data from OSS web sites. This method is usually implemented by creating a web crawler based on specific research goals. We design a mining process which combines web mining and database mining together to identify, extract, filter and analyze data. We address and analyze the difficulty of mining OSS data. Our work provides a general solution for researchers to implement advanced techniques, such as web mining, data mining, statistics, and algorithms to collect and analyze web repository data. Based on our mining results, we model the OSS community as a social network, one which can be further modeled as a project network and a developer network, and study properties of these networks. Our goal is to find intrinsic mechanisms that lie in OSS networks to explain some OSS specific features such as roles of developers, communication, and reliability of the OSS community. We construct four social networks for the OSS development community at SourceForge cite{sourceforge}. Each social network is created by expanding the number of people with different roles in the network, moving from the core project leaders, to the core developers, to the co-developers, and finally out to active users. Social network properties such as degree distribution, diameter, cluster size, and clustering coefficient are calculated and compared for each of the expanding social networks. We elaborate on how the changing topological characteristics of the social networks may signify important capabilities for the diffusion of information, the ability to find collaborations, and the overall robustness of the OSS development community. We further find that all the social networks have scale-free properties, and the inclusion of the co-developers and active users triggers the emergence of the small-world phenomenon for the social network. We examine how these topological network properties may potentially explain the success and efficiency of OSS development practices. To study the organization and backbones of the OSS community, we conduct the identification of the community structure on the SourceForge project network. We find that groups exist in the SourceForge project network. Furthermore, we explore possible reasons for the formation of those groups by examining assortative mixing coefficients for projects categories. Among them, we find projects with same programming languages, operating systems and topics are more likely to be grouped together. Our research provides useful information to study the interaction between projects and the communication and information flow in OSS virtual organizations. We simulate the OSS community based on four social network models: random graphs, preferential attachment, preferential attachment with constant fitness, and preferential attachment with dynamic fitness, using two tools -- Repast and Swarm. Our simulation models are fit to data from year two in the history of SourceForge. To prove the correctness of our simulations, docking experiments are performed on the Repast simulation and the Java/Swarm simulation. Our models simulate developers' actions and the growth of the OSS community. We compare properties of social networks such as degree distribution, diameter and clustering coefficient to dock Repast and Swarm simulations of four social network models. Our practice demonstrates the importance of verifications in scientific simulations. The simulation models we build can be used to forecast future development of OSS community.

History

Date Modified

2017-06-02

Defense Date

2007-03-19

Research Director(s)

Gregory Madey

Committee Members

Gregory Madey

Degree

  • Doctor of Philosophy

Degree Level

  • Doctoral Dissertation

Language

  • English

Alternate Identifier

etd-04182007-083425

Publisher

University of Notre Dame

Program Name

  • Computer Science and Engineering

Usage metrics

    Dissertations

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC