COMP 5900: Mining Software Repositories

Software development projects generate impressive amounts of data. Mining software repositories research aims to extract information from the various artifacts produced during the evolution of a software system and inferring the relationships between them. This course will introduce the methods and tools of mining software repositories and artifacts used by software developers and researchers. The course will be seminar-based and will involve weekly reading and discussion. The project component will be flexible but will likely involve some programming. For further details on the course content, please refer to its outline (pdf). This course is offered by the School of Computer Science at the Carleton University.

Seminars are held every Thursday from 08:35 AM to 11:25 AM in SA314 (new room!).


  • Project presentation schedule is posted (December 06)
  • The course schedule is now finalized (September 26)
  • Submit paper review (Due 8:00 AM every Thursday)

Content Overviewtop

The course will be adjusted according to students’ interests and experience. This is an overview of the kinds of topics the course could cover:

  • Mining software repositories (data extraction and analysis)
  • Development team processes
  • Software development tools and environments
  • Software analytics
  • Software visualization
  • Mining social data
  • Software evolution
  • Quantitative and qualitative evaluation of software engineering research

Tentative Scheduletop

It is important to note that this schedule is evolving and will change based on your interests and how the class is progressing.

Thursday September 8 - Introduction

  1. Introduction.
  2. Introduction to Software Analytics.
    Presented by Olga Baysal

Thursday September 15 - Quantitative vs. Qualitative Empirical Studies

  1. Oleksii Kononenko, Olga Baysal, Latifa Guerrouj, Yaxin Cao, and Michael W. Godfrey. "Investigating Code Review Quality: Do People and Participation Matter?". ICSME 2015.
    Presented by Olga Baysal.
  2. Oleksii Kononenko, Olga Baysal, and Michael W. Godfrey. "Code Review Quality: How Developers See It". ICSE 2016.
    Presented by Olga Baysal.

Thursday September 22 - Bugs

  1. Thung, Lo, Jiang, Lucia, Rahman, Devanbu. When Would This Bug Get Reported?. ICSM 2012.
    Presented by Sheenam Sharma (slides).
  2. Aranda and Venolia. The secret life of bugs: Going past the errors and omissions in software repositories. ICSE 2009.
    Presented by Shivjot Baidwan (slides).
  3. Anvik, Hiew, and Murphy. Who should fix this bug?. ICSE 2006.
    Presented by Marzia Zaman (slides).

Thursday September 29 - Foundation and Research Methods in SE

  1. Fredrick P. Brooks. No Silver Bullet: Essence and Accidents of Software Engineering. 1987.
    Presented by Pablo Navarro (slides).
  2. J. Hannay, D. Sjoberg and T. Dyba. A systematic review of theory use in software engineering experiments. TSE 2007.
    Presented by Gurpinder Kaur (slides).
  3. Miryung Kim and David Notkin. Discovering and Representing Systematic Code Changes. ICSE 2009.
    Presented by Sultan Almaghthawi (slides).

Thursday October 06 - Mining "cool" repositories (Twitter, tutorials, healthcare)

  1. Leif Singer, Fernando Marques Figueira Filho, Margaret-Anne D. Storey. Software engineering at the speed of light: how developers stay current using Twitter. ICSE 2014.
    Presented by Eric Tran (slides).
  2. Ponzanelli et al. Too Long; Didn't Watch! Extracting Relevant Fragments from Software Development Video Tutorials. ICSE 2016.
    Presented by Chris Budiman (slides).
  3. Parvez Ahmad, Saqib Qamar, Syed Qasim Afser Rizvi. Techniques of Data Mining In Healthcare: A Review. International Journal of Computer Applications 2015.
    Presented by Eric Torunski (slides).
  4. Sillito, Murphy and De Volder. Questions programmers ask during software evolution tasks. FSE 2006.
    Presented by Simrandeep Singh (slides).

Thursday October 13 - Software Developers vs. Data Scientists

  1. Li et al. What makes a great software engineer?. ICSE 2015.
    Presented by Venus Pathak (slides).
  2. Kim et al. The emerging role of data scientists on software development teams. ICSE 2016.
    Presented by Shruthi Nagaraj (slides).
  3. Andrew Begel and Beth Simon. Struggles of new college graduates in their first software development job. SIGCSE 2008.
    Presented by Gurpinder Kaur (slides).
  4. Martin P. Robillard, Wesley Coelho, and Gail C. Murphy. How Effective Developers Investigate Source Code: An Exploratory Study. TSE 2004.
    Presented by Naga Prasanthi Bobbili (slides).

Thursday October 20 - Code Review and GitHub

  1. Alberto Bacchelli and Christian Bird. Expectations, Outcomes, and Challenges of Modern Code Review. ICSE 3013.
    Presented by Tresa Rose (slides).
  2. Eirini Kalliamvakou; Daniela Damian; Kelly Blincoe; Leif Singer; Daniel German. Open Source-Style Collaborative Development Practices in Commercial Projects Using GitHub. ICSE 2015.
    Presented by Eric Tran (slides).
  3. Georgios Gousios, Andy Zaidman, Margaret-Anne Storey, Arie van Deursen. Work Practices and Challenges in Pull-Based Development: The Integrator’s Perspective. ICSE 2015.
    Presented by Ajaydeep Singh Grewal (slides).
  4. Gousios, Storey, Bacchelli. Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective. ICSE 2016.
    Presented by Chris Budiman (slides).
  5. Andrew Sutherland, Gina Venolia. Can peer code reviews be exploited for later information needs? ICSE 2009.
    Presented by Tianxiao Deng (slides).

Thursday October 27 - NO CLASS (Reading week)

Thursday November 03 - Source Code and Refactoring

  1. Suresh Thummalapenta and Tao Xie. Parseweb: a programmer assistant for reusing open source code on the web. ASE 2007.
    Presented by Naga Prasanthi Bobbili (slides).
  2. Mathieu Lavallee, Pierre N. Robillard. Why Good Developers Write Bad Code: an Observational Case Study of the Impacts of Organizational Factors on Software Quality. ICSE 2015.
    Presented by Ajaydeep Singh Grewal (slides).
  3. Zhenchang Xing, Eleni Stroulia. Refactoring practice: How it is and how it should be supported: An Eclipse case study. ICSM 2006.
    Presented by Sultan Almaghthawi (slides).
  4. Emerson Murphy-Hill, Chris Parnin, Andrew P. Black. How we refactor, and how we know it. ICSE 2009.
    Presented by Pablo Navarro (slides).

Thursday November 10 - Bugs and Failures

  1. Shroter, Zimmermann, Zeller. Predicting Component Failures at Design Time. ISESE 2006.
    Presented by Marzia Zaman (slides).
  2. Aleem, Capretz and Ahmed. Benchmarking Machine Learning Techniques for Software Defect Detection.
    Presented by Sheenam Sharma (slides).
  3. Brittany Johnson, Yoonki Song, Emerson R. Murphy-Hill, Robert W. Bowdidge. Why don't software developers use static analysis tools to find bugs? ICSE 2013.
    Presented by Ranjodh Singh (slides).
  4. Feng Zhang, Audris Mockus, Iman Keivanloo and Ying Zou. Towards Building a Universal Defect Prediction Model. MSR 2014.
    Presented by Ekaba Bisong (slides).

Thursday November 17 - NO CLASS (I am away at a conference)

Thursday November 24 - Collaborative Development

  1. Julia Rubin and Martin Rinard. The Challenges of Staying Together While Moving Fast: An Exploratory Study ICSE 2016.
    Presented by Venus Pathak (slides).
  2. Bird, C., Nagappan, N., Devanbu, P.T., Gall, H., and Murphy, B. Does distributed development affect software quality? An empirical case study of Windows Vista. ICSE 2009.
    Presented by Simrandeep Singh (slides).
  3. Peter Rigby and Christian Bird. Convergent Software Peer Review Practices. FSE 2013.
    Presented by Tresa Rose (slides).
  4. Sharafi, Shaffer, Sharif, Gueheneuc. Eye-tracking Metrics in Software Engineering. APSEC 2015.
    Presented by Shruthi Nagaraj (slides).
  5. Christoph Treude and Margaret-Anne Storey. Awareness 2.0: staying aware of projects, developers and tasks using dashboards and feeds. ICSE 2010.
    Presented by Shivjot Kaur Baidwan (slides).

Thursday December 1 - Various Topics

  1. Andrew Begel, Yit Phang Khoo, Thomas Zimmermann. Codebook: discovering and exploiting relationships in software repositories. ICSE 2010.
    Presented by Ranjodh Singh (slides).
  2. Miltiadis Allamanis, Charles Sutton. Mining Source Code Repositories at Massive Scale using Language Modeling. MSR 2013.
    Presented by Eric Torunski (slides).
  3. Annie T. T. Ying, Martin P. Robillard. Selection and presentation practices for code example summarization. FSE 2014.
    Presented by Tianxiao Deng (slides).
  4. Patrick Morrison and Emerson Murphy-Hill. Is Programming Knowledge Related to Age? An Exploration of Stack Overflow. MSR 2013.
    Presented by Ekaba Bisong (slides).

Thursday December 8 - Project Presentations (15 minutes for each group)

  1. Gurpinder and Shivjot
  2. Chris and Pablo
  3. Sheenam
  4. Sultan and Prasanthi
  5. Marzia
  6. Ranjodh and Simrandeep
  7. Tresa
  8. Venus and Ajaydeep
  9. Shruthi
  10. Eric and Tianxiao
  11. Eric and Ekaba


  • Weekly paper reviews: 10%
  • Class participation and discussion: 20%
  • Paper presentations: 20%
  • Course Project: 50% (10% project presentation + 40% project report)

Weekly Paper Reviewstop

Each week you are expected to carefully read two papers. In addition, you are to submit a review of one of the papers (you choose which one). However, if you are doing a paper presentation, then you are excused for that week.

Reviews are due by 8:00 AM on the morning of the class. Please send me email with the Subject "[COMP 5900] Paper Review Student_Name".

A review should be about 500--1000 words long, and submitted as a PDF file.

Your review should address the following points:

  1. What were the primary contributions of the paper as the author sees it?
  2. What were the main contributions of the paper as you (the reader) see it?
  3. How does this work move the research forward (or how does the work apply to you)?
  4. How was the work validated?
  5. How could this research be extended?
  6. How could this research be applied in practice?

Class Participationtop

Each week you are expected to read two papers, as well as participate in the class discussion.

Paper Presentationstop

In a typical week, we will examine two research papers. I will present a few of them on my own, but the other presentations will be done by students.

You will get to select the two papers you want to present from the course. Please make your selections from this list. Once you have selected your papers, email me your selection. I will generate a cohesive class schedule once everyone has selected their papers.This must be done by September 13 via email.

You are then to design a presentation of about 20-25 minutes that is both informative and entertaining. Don't feel limited to just the content of the papers.

You should also come prepared with a set of questions to foster a 15-20 minute discussion session that you will lead to follow the presentation (this is where the other students earn their class participation marks).

When you design your talk, keep in mind that the audience has already read the papers. Remind us of the motivation, the big ideas, the context of the problem being addressed, and how all of this relates to what we've already seen in the course.

Presentations can be done using Open Office, Powerpoint, Keynote, or PDF. You must supply a set of slides (only PDF) to me prior your talk and I will put them on the course web page.

Course Projecttop

The project forms an integral part of this course. The projects can be done individually or completed in groups of two students.

You have two options: either create a submission for the 2017 MSR challenge or come up with an idea of your own that relates to the course material. In either case, the project topic will require my approval (via the proposal).

If you decide to do the MSR challenge, you can optionally decide to submit it to the conference, but note that the deadline is February 20, 2016. Talk to me if you are interested in exploring this. Otherwise, you can just decide to do the challenge as your class project and ignore the actual conference submission.

There are three deliverables for your project:

  1. Project proposal. Before you undertake your project you will need to submit a proposal for approval. The proposal should be short (max 2 page PDF in ACM format). The proposal should include a problem statement, the motivation for the project, and set of objectives you aim to accomplish. I will read these and provide comments. The proposal is not for marks but must be completed in order to pass the course. This will be due on October 06 by 8:00 AM via email.
  2. Written report. The required length of the written report varies from project to project (8-10 pages, double column format); all reports must be formatted according to the ACM format and submitted as a PDF. This report will constitute 100% of the project report grade. This will be due on December 16 by 11:59 PM via email.
  3. Project presentation. Each group will have the opportunity to present their project in class on December 08 . This presentation should take the form of a 15 minute (hard maximum) conference-style talk and describe the motivation for your work, what you did, and what you found. If a demo is the best way to describe what you did, feel free to include one in the middle of the talk. Please allocate 3-5 minute time for questions after the project has been presented.

    The proposed structure of your presentation:

    1. Introduction (describe the problem and motivation)
    2. Research questions
    3. Methodology: data collection, data cleanup, data mining, data analysis (statistics, machine learning), etc.
    4. Results (achieved, preliminary, or anticipated)
    5. Implications (why does this study matter? how can your findings be used?)
    6. Conclusion (summary, main contributions)


The best way to get in touch with me is via email: olga.baysal[at]

University Policiestop

Academic Integrity

Academic Integrity is everyone’s business because academic dishonesty affects the quality of every Carleton degree. Each year students are caught in violation of academic integrity and found guilty of plagiarism and cheating. In many instances they could have avoided failing an assignment or a course simply by learning the proper rules of citation. See the academic integrity for more information.

Academic Accommodations for Students with Disabilities

The Paul Menton Centre for Students with Disabilities (PMC) provides services to students with Learning Disabilities (LD), psychiatric/mental health disabilities, Attention Deficit Hyperactivity Disorder (ADHD), Autism Spectrum Disorders (ASD), chronic medical conditions, and impairments in mobility, hearing, and vision. If you have a disability requiring academic accommodations in this course, please contact PMC at 613-520-6608 or for a formal evaluation. If you are already registered with the PMC, contact your PMC coordinator to send me your Letter of Accommodation at the beginning of the term, and no later than two weeks before the first in-class scheduled test or exam requiring accommodation (if applicable). After requesting accommodation from PMC, meet with me to ensure accommodation arrangements are made. Please consult the PMC website for the deadline to request accommodations for the formally-scheduled exam (if applicable).

Pregnancy Obligation

Write to the instructor with any requests for academic accommodation during the first two weeks of class, or as soon as possible after the need for accommodation is known to exist. For more details visit the Equity Services website.

Religious Obligation

Write to the instructor with any requests for academic accommodation during the first two weeks of class, or as soon as possible after the need for accommodation is known to exist. For more details visit the Equity Services website.