The course will be lecture-based and will also offer some hands-on tutorials. The project component will be flexible and will involve data collection, manipulation, and analysis. For further details on the course content, please refer to the Course Outline (pdf). This course is offered by the School of Computer Science at Carleton University.
Seminars are held every Thursday from 10:05 AM to 12:55 PM via Zoom (see Discord for links).
Announcements
- We will be using Discord for course communication, announcements, and reminders. You will receive an invite to join the server via email (please email instructor if you face any issues).
- Welcome to DATA 5000! Lectures start on Thursday, September 05, 2024.
Content Overview
The course covers topics relevant to data science: working with data, exploratory data analysis, data mining, machine learning. The concepts are illustrated using the R language. Students also receive hands-on tutorials (e.g., Tableau). Students will be evaluated by their course projects.
Instructor
Olga Baysal
Email: olga.baysal[at]carleton.ca
Office hours: by appointment via Zoom or Discord
Website: http://olgabaysal.com/
Tentative Schedule
It is important to note that this schedule is evolving and will change based on how the class is progressing.
Thursday, September 5 - Lecture 1: What is Data Science?
Thursday, September 12 - Lecture 2: Working with Data.
Thursday, September 19 - Project proposal presentations.
Thursday, September 26 - Lecture 3: Visualization and Exploration.
Thursday, October 3 - Lecture 4: Data Mining and Machine Learning I.
Thursday, October 10 - No Class (I am away to a conference)
Thursday, October 17 - Lecture 5: Machine Learning II.
Thursday, October 24 - No Class (Fall Break)
Thursday, October 31 - Guest lectures: Prof. Matthew Holden (School of Computer Science) and Prof. James Green (Systems and Computer Engineering).
Thursday, November 7 - Paper presentations.
Thursday, November 14 - Tableau tutorial by Josh Gillmore.
Thursday, November 21 - Guest lectures: Prof. Majid Komeili (School of Computer Science) and Prof. Tracey Lauriault (School of Journalism and Communication).
Thursday, November 28 - IBM Cognos Analytics tutorial by Matthew Denham.
Thursday, December 5 - Project presentations.
Evaluation
- Project proposal presentation: 15% (September 19)
- Project proposal: 10% (due September 20, 11:59 PM)
- Paper selection: 0% (due October 17, 11:59 PM)
- Paper presentation: 15% (November 7)
- Project presentation: 10% (December 5)
- Project report: 50% (due December 12, 11:59 PM)
Paper presentation
Each group needs to choose a conference publication on the topic of Data Science to present in class (15 minute talk). Paper selection due October 17, 2024. A 8-12 page conference proceeding (e.g., IEEE International Conference on Data Science, SIGKDD/KDD Conference, etc.) will be approved by the instructor. Papers will be presented on November 7.
Project proposal
The project forms an integral part of this course. The project is to be completed in group of two-three students. Each group would have one technical expert (a student from Computer Science, Systems and Computer Engineering, Information Technology, Physics, Chemistry), and one or two domain expert(s) (e.g., from Communication, Geography, Biology, History, Psychology, Economics, Business, Health Sciences, Cognitive Science, Public Policy and Administration, International Affairs). Domain experts may contribute to finding the right problem, justifying why it is important to study it, extracting the value and implications of the work. Technical experts do the heavy lifting of building models. The main goal for students is to learn how to work on a multidisciplinary team, i.e., for domain experts, it is about learning technical terminology, while for technical experts, how to fruitfully work with domain experts.
You have two options: you can choose to mine and analyze one of the provided datasets or come up with an idea of your own that relates to the course material. In either case, the project topic will require my approval (via the project proposal).
Before you undertake your project you will need to submit a proposal for approval. The proposal should be short (max 2 page PDF in ACM format). The proposal should include a problem statement, the motivation for the project, and set of objectives you aim to accomplish. I will read these and provide comments. This will be due on September 20 by 11:59 PM via email to Olga.
Project presentation
Each group will have the opportunity to present their project in class on December 5. This presentation should take the form of a 20 minute (hard maximum) conference-style talk and describe the motivation for your work, what you did, and what you found. If a demo is the best way to describe what you did, feel free to include one in the middle of the talk. Please allocate 3-5 minute time for questions after the project has been presented.
The proposed structure of your presentation:
- Introduction (describe the problem and motivation)
- Research questions
- Methodology: data collection, data cleanup, data mining, data analysis (statistics, machine learning), etc.
- Results (achieved, preliminary, or anticipated)
- Implications (why does this study matter? how can your findings be used?)
- Conclusion (summary, main contributions)
Project report
The required length of the written report varies from project to project (8-10 pages, double column format); all reports must be formatted according to the ACM or IEEE formats and submitted as a PDF. This report will constitute 50% of the course grade. This will be due on December 12 by 11:59 PM via email.
Datasets
- Open Ottawa, the City of Ottawa
- Open Data Ontario
- Open Data @ Government of Canada
- Statistics Canada
- Dataset search by Google
- GitHub repository via GHTorrent
- Machine learning data set repo
- Kaggle Datasets
- Kaggle Competitions
- IAPR
- Datamob
- KDnuggets:
- in R
Resources
The following books are suggested but not required:
- "Doing Data Science: Straight Talk From the Frontline" by Cathy O'Neil and Rachel Schutt, O'Reilly Media, 2013
- "Data Mining and Business Analytics with R" by Johannes Ledolter, Wiley, 2013
- "Data Science for Business: what you need to know about data mining and data-analytic thinking" by Foster Provost and Tom Fawcett, O'Reilly Media, 2013.
The following books are good references for data mining and machine learning algorithms:
- "An Introduction to Statistical Learning: with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Springer, 2013
- "The Elements of Statistical Learning: Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, 2011.
The following are good references for R (just to name a few):
- "Cookbook for R" by Winston Chang
- "The R Inferno" by Patrick Burns
- Quick-R
- "Software for Data Analysis Programming with R" by John Chambers, Springer, 2008.
Contact
The best way to get in touch with me is via email: olga.baysal[at]carleton.ca. However, for any public course related communication we will be using Discord server. For private messages, please email me directly or send a DM on Discord.
University Policies
For information about Carleton's academic year, including registration and withdrawal dates, see Carleton's Academic Calendar.
Academic Accommodations
Carleton is committed to providing academic accessibility for all individuals. Please review the academic accommodation available to students here: Academic Accomodations.
Academic Integrity
Student Academic Integrity Policy. Every student should be familiar with the Carleton University Student Academic Integrity policy. A student found in violation of academic integrity standards may be sanctioned with penalties which range from a reprimand to receiving a grade of F in the course, or even being suspended or expelled from the University. Examples of punishable offences include plagiarism and unauthorized collaboration. Any such reported offences will be reviewed by the office of the Dean of Science.
The use of any AI system will be considered academic misconduct. This includes, but is not limited to, chatbots or code generators (e.g., ChatGPT) for projects (reports, models, etc.). An exception to the above rule is made for automated grammar and punctuation checking tools (such as Grammarly).
More information on this policy may be found on the ODS Academic Integrity page.
Plagiarism
As defined by Senate, "plagiarism is presenting, whether intentional or not, the ideas, expression of ideas or work of others as one's own". Such reported offences will be reviewed by the office of the Dean of Science. More information and standard sanction guidelines can be found here. Please note that content generated by an unauthorized A.I.-based tool is considered plagiarized material.