COMP 5117: Mining Software Repositories

Software development projects generate impressive amounts of data. Mining software repositories research aims to extract information from the various artifacts produced during the evolution of a software system and inferring the relationships between them. This course will introduce the methods and tools of mining software repositories and artifacts used by software developers and researchers. The course will be seminar-based and will involve weekly reading and discussion. The project component will be flexible but will likely involve some programming. For further details on the course content, please refer to the Course Outline (pdf). This course is offered by the School of Computer Science at the Carleton University.

Seminars are held every Monday from 10:05 AM to 12:55 PM via Zoom Meeting (meeting details are posted on Discord).

Announcementstop

  • Submit your paper review (due 11:59 PM every Sunday; latest by 10:00 AM on Monday).
  • Please send me your paper selection list (minimum 3-5 papers) by Monday, September 16, 2024.
  • We will be using Discord for course communication, news and reminders. Please join COMP5117 server.
  • Welcome to COMP5117! Our seminars start on Monday, September 09, 2024.

Content Overviewtop

The course will be adjusted according to students' interests and experience. This is an overview of the kinds of topics the course could cover:

  • Mining software repositories (data extraction and analysis)
  • AI for SE
  • Large language models (LLMs)
  • Software development processes
  • Software development tools and environments
  • Software analytics
  • Software visualization
  • Software maintenance and evolution
  • Collaborative development
  • Quantitative and qualitative evaluation of software engineering research

Tentative Scheduletop

It is important to note that this schedule is evolving and will change based on your interests and how the class is progressing.

Monday, September 9 - Introduction

  1. Introduction to the course.
    Presented by Olga Baysal

Monday, September 16 - LLMs for issue labeling, code completion, and software architecture.

  1. Leveraging GPT-like LLMs to Automate Issue Labeling by Giuseppe Colavito, Filippo Lanubile, Nicole Novielli, and Luigi Quaranta. MSR 2024.
    Presented by Kaya Gouin.
  2. Domain Adaptive Code Completion via Language Models and Decoupled Domain Databases by Ze Tang, Jidong Ge, Shangqing Liu, Tingwei Zhu, Tongtong Xu, Liguo Huang, Bin Luo. ASE 2023.
    Presented by Nima Meghdadi.
  3. Towards Human-Bot Collaborative Software Architecting with ChatGPT by Aakash Ahmad, Muhammad Waseem, Peng Liang, Mahdi Fahmideh, Mst Shamima Aktar, and Tommi Mikkonen. EASE 2023.
    Presented by Sean Mackenzie.

Monday, September 23 - Large-scale mining.

  1. A large-scale comparison of Python code in Jupyter notebooks and scripts by Konstantin Grotov, Sergey Titov, Vladimir Sotnikov, Yaroslav Golubev, and Timofey Bryksin. MSR 2022.
    Presented by Justin Zhang.
  2. What Do Users Ask in Open-Source AI Repositories? An Empirical Study of GitHub Issues by Zhou Yang, Chenyu Wang, Jieke Shi, Thong Hoang, Pavneet Singh Kochhar, Qinghua Lu, Zhenchang Xing, David Lo. MSR 2023.
    Presented by Bahareh Abolhasanzadeh.
  3. Revisiting Dockerfiles in Open Source Software Over Time by Kalvin Eng and Abram Hindle. MSR 2021.
    Presented by Huzaifa Patel.

Monday, September 30 - Security.

  1. Investigating the Resolution of Vulnerable Dependencies with Dependabot Security Updates by H. Mohayeji, A. Agaronian, E. Constantinou, N. Zannone and A. Serebrenik. MSR 2023.
    Presented by John Shortt.
  2. EvoAttack: An Evolutionary Search-Based Adversarial Attack for Object Detection Models by Kenneth Chan and Betty H. C. Cheng. SSBSE 2022.
    Presented by Yang Xu.
  3. Automated Detection of Password Leakage from Public GitHub Repositories by Feng, Runhan and Yan, Ziyang and Peng, Shiyan and Zhang, Yuanyuan. ICSE 2022.
    Presented by Lanney Wang.
  4. Toward Improved Deep Learning-based Vulnerability Detection by Adriana Sejfia, Satyaki Das, Saad Shafiq, Nenad Medvidović. ICSE 2024.
    Presented by Elham Hekmatnia.

Monday, October 7 - NO CLASS (I am away at a conference)

Monday, October 14 - NO CLASS (Thanksgiving)

Monday, October 21 - NO CLASS (Fall break)

Monday, October 28 - Gen AI studies.

  1. ChatGPT-Resistant Screening Instrument for Identifying Non-Programmers by Serafini, Raphael and Otto, Clemens and Horstmann, Stefan Albert and Naiakshina, Alena. ICSE 2024.
    Presented by Yiqun (Ezra) Hao.
  2. ChatGPT Chats Decoded: Uncovering Prompt Patterns for Superior Solutions in Software Development Lifecycle by Wu, Liangxuan and Zhao, Yanjie and Hou, Xinyi and Liu, Tianming and Wang, Haoyu. MSR 2024.
    Presented by Mohamed Amine Benzaarit.
  3. Unveiling ChatGPT's Usage in Open Source Projects: A Mining-based Study by Tufano, Rosalia and Mastropaolo, Antonio and Pepe, Federica and Dabic, Ozren and Di Penta, Massimiliano and Bavota, Gabriele. MSR 2024.
    Presented by Mozhan Saeedidehshali.
  4. Trustworthy and Synergistic Artificial Intelligence for Software Engineering: Vision and Roadmaps by David Lo. ICSE 2023.
    Presented by Yang Xu.

Monday, November 4 - Bugs I.

  1. Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code by Pan, Rangeet and Ibrahimzada, Ali Reza and Krishna, Rahul and Sankar, Divya and Wassi, Lambert Pouguem and Merler, Michele and Sobolev, Boris and Pavuluri, Raju and Sinha, Saurabh and Jabbarvand, Reyhaneh. ICSE 2024.
    Presented by John Shortt.
  2. IoT Bugs and Development Challenges by Makhshari, Amir and Mesbah, Ali. ICSE 2021.
    Presented by Huzaifa Patel.
  3. Large Language Models and Simple, Stupid Bugs by K. Jesse, T. Ahmed, P. Devanbu and E. Morgan. MSR 2023.
    Presented by Amine Ghazi.
  4. An Exploratory Study of Productivity Perceptions in Software Teams by Ruvimova, Anastasia and Lill, Alexander and Gugler, Jan and Howe, Lauren and Huang, Elaine and Murphy, Gail and Fritz, Thomas. ICSE 2022.
    Presented by Sean Mackenzie.

Monday, November 11 - Code authorship, bots, warnings, and more.

  1. Whodunit: Classifying Code as Human Authored or GPT-4 generated - A case study on CodeChef problems by Idialu, Oseremen Joy and Mathews, Noble Saji and Maipradit, Rungroj and Atlee, Joanne M. and Nagappan, Mei. MSR 2024.
    Presented by Drew Shields.
  2. BotHunter: An Approach to Detect Software Bots in GitHub by A. Abdellatif, M. Wessel, I. Steinmacher, M. A. Gerosa and E. Shihab. MSR 2022.
    Presented by Yiqun (Ezra) Hao.
  3. Large Language Models are Few-Shot Testers: Exploring LLM-Based General Bug Reproduction by Kang, Sungmin and Yoon, Juyeon and Yoo, Shin. ICSE 2023.
    Presented by Lanney Wang.
  4. Software Entity Recognition with Noise-Robust Learning by T. Nguyen, Y. Di, J. Lee, M. Chen and T. Zhang. ASE 2023.
    Presented by Nima Meghdadi.

Monday, November 18 - Bugs II: detection, localization, prediction.

  1. Enhancing Performance Bug Prediction Using Performance Code Metrics by Zhao, Guoliang and Georgiou, Stefanos and Hassan, Safwat and Zou, Ying and Truong, Derek and Corbin, Toby. MSR 2024.
    Presented by Mohamed Amine Benzaarit.
  2. The ABLoTS Approach for Bug Localization: is it replicable and generalizable? by Niu, Feifei and Mayr-Dorn, Christoph and Assunção, Wesley K. G. and Huang, LiGuo and Ge, Jidong and Luo, Bin and Egyed, Alexander. MSR 2023.
    Presented by Justin Zhang.
  3. Bridging the Gap between Academia and Industry in Machine Learning Software Defect Prediction: Thirteen Considerations by S. Stradowski and L. Madeyski. ASE 2023.
    Presented by Bahareh Abolhasanzadeh.
  4. Supporting High-Level to Low-Level Requirements Coverage Reviewing with Large Language Models by Preda, Anamaria-Roberta and Mayr-Dorn, Christoph and Mashkoor, Atif and Egyed, Alexander. MSR 2024.
    Presented by Amine Ghazi.

Monday, November 25 - No class.

Monday, December 2 - Code: generation, translation, readability.

  1. On Evaluating the Efficiency of Source Code Generated by LLMs by Niu, Changan and Zhang, Ting and Li, Chuanyi and Luo, Bin and Ng, Vincent. FORGE 2024.
    Presented by Kaya Gouin.
  2. Data Augmentation for Supervised Code Translation Learning by Chen, Binger and Golebiowski, Jacek and Abedjan, Ziawasch. MSR 2024.
    Presented by Elham Hekmatnia.
  3. Using Deep Learning to Automatically Improve Code Readability by A. Vitale, V. Piantadosi, S. Scalabrino and R. Oliveto. ASE 2023.
    Presented by Drew Shields.
  4. MicroRec: Leveraging Large Language Models for Microservice Recommendation by Alsayed, Ahmed Saeed and Dam, Hoa Khanh and Nguyen, Chau. MSR 2024.
    Presented by Mozhan Saeedidehshali.

Friday, December 6 - Project Presentations

Evaluationtop

  • Weekly paper reviews: 10%
  • Class participation and discussion: 20%
  • Paper presentation: 10%
  • Course Project: 60% (10% project presentation + 50% project report)

Weekly Paper Reviewstop

Each week you are expected to carefully read two to three papers. In addition, you are to submit a review of one of the papers (you choose which one). However, if you are doing a paper presentation, then you are excused for that week.

Reviews are due by 10:00 AM on the morning of the class. Please send me email with the subject "[COMP 5117] Paper Review Student_Name".

A review should be about 500-1000 words long (1.5-2 pages), and submitted as a PDF file.

Your review should address the following points:

  1. What were the primary contributions of the paper as the author sees it?
  2. What were the main contributions of the paper as you (the reader) see it?
  3. How does this work move the research forward (or how does the work apply to you)?
  4. How was the work validated?
  5. How could this research be extended?
  6. How could this research be applied in practice?

Class Participationtop

Each week you are expected to read all presented papers, as well as participate in the class discussion.

Paper Presentationstop

In a typical week, we will examine two or three research papers. Paper resentations will be done by students.

You will get to select three to five papers you want to present from the course (in the order of your first to last preferences). Please make your selections from the proceedings of the MSR, ICSE or other conferences such as FSE, ASE, etc. (2021-2024): MSR 2024, MSR 2023, MSR 2022, MSR 2021, ICSE 2024, ICSE 2023, ICSE 2022, ICSE 2021. Once you have selected your papers, email me your selection of three or five papers.This must be done by Monday, September 16 via email. I will generate a cohesive class schedule once everyone has selected their papers. Each student will be assigned to present one or two papers in class depending on the class size.

You are then to design a presentation of about 20-25 minutes that is both informative and entertaining. Don't feel limited to just the content of the papers.

You should also come prepared with a set of questions to foster a 15-20 minute discussion session that you will lead to follow the presentation (this is where the other students earn their class participation marks).

When you design your talk, keep in mind that the audience has already read the papers. Remind us of the motivation, the big ideas, the context of the problem being addressed, and how all of this relates to what we've already seen in the course.

Presentations can be done using Open Office, Powerpoint, Keynote, or PDF. You must share a set of slides (only PDF) by uploading them to the Discord server prior/after your talk.

Course Projecttop

The project forms an integral part of this course. The projects can be done individually or completed in groups of two students.

You need to come up with an idea of your own that relates to the course material. The project topic will require my approval (via the proposal).

There are three deliverables for your project:

  1. Project proposal. Before you undertake your project you will need to submit a proposal for approval. The proposal should be short (max 2 page PDF in ACM format). The proposal should include a problem statement, the motivation for the project, and set of objectives you aim to accomplish. I will read these and provide comments. The proposal is not for marks but must be completed in order to pass the course. This will be due on September 23 by 11:59 PM via email.

  2. Written report. The required length of the written report varies from project to project (8-10 pages, double column format); all reports must be formatted according to the ACM format (LaTeX users can use "sigsoft" option: \documentclass[sigconf]{acmart} ) and submitted as a PDF. This report will constitute 100% of the project report grade. This will be due on December 13 by 11:59 PM via email.

  3. Project presentation. Each group will have the opportunity to present their project in class on December 6. This presentation should take the form of a 15 minute (hard maximum) conference-style talk and describe the motivation for your work, what you did, and what you found. If a demo is the best way to describe what you did, feel free to include one in the middle of the talk. Please allocate 3-5 minute time for questions after the project has been presented.
  4. The proposed structure of your presentation:

    1. Introduction (describe the problem and motivation)
    2. Research questions
    3. Methodology: data collection, data cleanup, data mining, data analysis (statistics, machine learning), etc.
    4. Results (achieved, preliminary, or anticipated)
    5. Implications (why does this study matter? how can your findings be used?)
    6. Conclusion (summary, main contributions)

Contacttop

The best way to get in touch with me is via Discord or email: olga.baysal[at]carleton.ca.

University Policiestop

For information about Carleton's academic year, including registration and withdrawal dates, see Carleton's Academic Calendar.

Academic Accommodations

Carleton is committed to providing academic accessibility for all individuals. Please review the academic accommodation available to students here: Academic Accomodations.

Academic Integrity

Student Academic Integrity Policy. Every student should be familiar with the Carleton University Student Academic Integrity policy. A student found in violation of academic integrity standards may be sanctioned with penalties which range from a reprimand to receiving a grade of F in the course, or even being suspended or expelled from the University. Examples of punishable offences include plagiarism and unauthorized collaboration. Any such reported offences will be reviewed by the office of the Dean of Science.

The use of any AI system will be considered academic misconduct. This includes, but is not limited to, chatbots or code generators (e.g., ChatGPT) for projects (reports, models, etc.). An exception to the above rule is made for automated grammar and punctuation checking tools (such as Grammarly).

More information on this policy may be found on the ODS Academic Integrity page.

Plagiarism

As defined by Senate, "plagiarism is presenting, whether intentional or not, the ideas, expression of ideas or work of others as one's own". Such reported offences will be reviewed by the office of the Dean of Science. More information and standard sanction guidelines can be found here. Please note that content generated by an unauthorized A.I.-based tool is considered plagiarized material.