Final project
Project milestones
Team members due Wednesday, October 1
Potential research topics tentative ideally before October 21
Research topic due Wednesday, October 22
Project proposal presentation (in-class) due Wednesday, November 5
Final presentation + Final presentation comments Monday, December 1 (in-class)
Written report, Reproducibility + organization due GT Official Final Exam Date
Introduction
TL;DR: Pick a data set and do a set of hypothesis tests. That is your final project.
The goal of the final project is for you to use statistical analysis to test a hypothesis and run a simple linear regression analysis to analyze a data set of your own choosing. I will also provide some datasets. It is recommended to use data set that already exist.
Choose the data based on your group’s interests or work you all have done in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like!) and apply them to a data set to analyze it in a meaningful way.
All analyses must be done in RStudio using Quarto and GitHub, and your analysis and written report must be reproducible.
Logistics
You will work on the project with your groups. The primary deliverables for the project are
an presentation about the data analysis, some hypothesis testing, and a simple linear regression
a written, reproducible final report detailing your analysis
a GitHub repository containing all work from the project
There are intermediate milestones throughout the semester to help you work towards the primary deliverables (these are ungraded).
Team member
Each group consists of three students. As soon as your group members are set you need to submit the group member list through Canvas. The next thing to do is create a Github repository where you will collaborate with your group member.
Potential research topics
First task for your group is to discuss topics and develop potential research questions your team is interested in investigating for the project. You are only developing ideas at this point; you do not need to have a data set identified right now.
Develop three potential research topics. Include the following for each topic:
- A brief description of the topic
- A statement about your motivation for investigating this topic
- The potential audience(s), i.e., who might be most interested in this research?
- Two or three potential research questions you could analyze about this topic. (Note: These are draft questions at this point. You will finalize the questions in the next stage of the project.)
- Ideas about the type of data you might use to answer this question or potential data sets you’re interested in using. (Note: The goal is to generate ideas at this point, so it is fine if you have not identified any particular data sets at this point.)
Each group must seek feedback from me about these potential ideas at least once. I am happy to discuss these either via email or during my office hours. The purpose of my feedback is to ensure that your idea is feasible and I’ll direct you towards potential data to use.
Research topic
The final decision about the topic that your group want to pursue is solely the decision of the group members. You need to submit you final research topic via Github.
Data requirement
The data set must meet the following criteria:
At least 500 observations
At least 10 columns, such that at least 6 of the columns are useful and unique predictor variables.
e.g., identifier variables such as “name”, “ID number”, etc. are not useful predictor variables.
e.g., if you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique predictors.
At least one variable that can be identified as a reasonable outcome variable.
- The outcome variable must be quantitative.
Data that are likely violate the independence condition. Therefore, avoid data with repeated measures, data collected over time, etc.
Data sets in which there is no information about how the data were originally collected
Data sets in which there are missing or unclear definitions about the observations and/or variables
Ask me if you’re unsure whether your data set meets the criteria.
Submission
Write your responses in research-topic.qmd
in your team’s project GitHub repo. Push the qmd and rendered pdf documents to GitHub by the deadline, Wednesday, October 22 at 11:59pm. There is no Gradescope submission.
Project proposal presentation
Project proposal presentations will take place in class Wednesday, November 5. Presentation order will be announced in advance.
Your team will do an in-person presentation that summarizes the research idea you’re pursuing, the data used, and hypotheses. It will also be an opportunity to receive feedback and suggestions as well as provide feedback to other teams. The presentation will focus on introducing the subject matter and research question, data description, outcome variables, and hypotheses. The presentation should be supported by slides that serve as a brief visual addition to the presentation. The presentation and slides will be graded for content and clarity.
You can create your slides with any software you like (e.g., Keynote, PowerPoint, Google Slides, etc.). You can also use Quarto to make your slides! While we won’t be covering making slides with Quarto in the class, we would be happy to help you with it in office hours. It’s no different than writing other documents with Quarto, so the learning curve will not be steep!
The presentation is expected to be between 3 to 4 minutes. It may not exceed 4 minutes, due to the limited time.
Slides
The slide deck should have no more than 3 content slides + 1 title slide to ensure you have enough time to discuss each slide. s Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 3 slides.
Title Slide
Slide 1: Introduce the subject, motivation, and research question
Slide 2: Introduce the data set and data processing
Slide 3: List of outcome variables and hypothesis to be tested
Submission
You can submit the presentation slides in two ways:
Put a PDF of the slides or Quarto slides in the
presentation
folder in your team’s GitHub repo.Put the URL to your slides in the
README
of thepresentation
folder. If you share the URL, please make sure permissions are set so that I can view the slides.
Slides must be submitted by the start of class on the day of presentations. We will use a classroom computer for the presentations.
Grading
The presentation is worth points. It will be graded based on the following:
Content: The team told a unified story that clearly introduced the subject matter, research question, data and hypotheses.
Slides: The presentation slides were organized, included clear and informative visualizations, and were easily readable.
Presentation: The team’s communication style was clear, professional, and within time limit.
100% of the presentation grade will be the average of the teaching team scores.
Final presentation
Presentations will take place in class in the final day of class Monday, December 1. Presentation order will be announced in advance.
Your team will do an in-person presentation that summarizes and showcases the work you’ve done on the project thus far. Because the presentations will take place while you’re still working on the project, it will also be an opportunity to receive feedback and suggestions as well as provide feedback to other teams. The presentation will focus on introducing the subject matter and research question, showcase key results from the exploratory data analysis, and discuss primary modeling strategies and/or results. The presentation should be supported by slides that serve as a brief visual addition to the presentation. The presentation and slides will be graded for content and clarity.
You can create your slides with any software you like (e.g., Keynote, PowerPoint, Google Slides, etc.). You can also use Quarto to make your slides! While we won’t be covering making slides with Quarto in the class, we would be happy to help you with it in office hours. It’s no different than writing other documents with Quarto, so the learning curve will not be steep!
The presentation is expected to be between 4 to 5 minutes. It may not exceed 5 minutes, due to the limited time.
Every team member is expected to speak in the presentation. Part of the grade will be whether every team member had a meaningful speaking role in the presentation.
Slides
The slide deck should have no more than 6 content slides + 1 title slide to ensure you have enough time to discuss each slide. s Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.
Title Slide
Slide 1: Introduce the subject, motivation, and research question
Slide 2: Introduce the data set
Slide 3: Highlights from the descriptive statistics
Slide 4: Results (if applicable)
Slide 5: Next steps and any questions you’d like to get feedback on
Submission
You can submit the presentation slides in two ways:
Put a PDF of the slides or Quarto slides in the
presentation
folder in your team’s GitHub repo.Put the URL to your slides in the
README
of thepresentation
folder. If you share the URL, please make sure permissions are set so that I can view the slides.
Slides must be submitted by the start of your lab on the day of presentations. We will use a classroom computer for the presentations.
Grading
The presentation is worth points. It will be graded based on the following:
Content: The team told a unified story that clearly introduced the subject matter, research question, and exploration of the data.
Slides: The presentation slides were organized, included clear and informative visualizations, and were easily readable.
Presentation: The team’s communication style was clear and professional. The team divided the time well and stayed within the 8 minute time limit, with each team member making a meaningful contribution to the presentation.
80% of the presentation grade will be the average of the teaching team scores and 20% will be the average of the peer scores.
Final presentation comments
Click here to see the teams you’re scoring and a link to the feedback form.
This portion of the project is worth 2 points and will be assessed individually.
You will provide feedback on two teams’ presentations. The assigned teams and link to the feedback form will be available in advance of the presentations. Please provide all scores and comments by the end of the lab session. There will be a few minutes between each presentation to submit scores and comments.
The grade will be based on submitting the scores and comments for both of your assigned teams by the end of the presentation day.
Written report
Your written report must be completed in the written-report.qmd
file and must be reproducible. All team members should contribute to the GitHub repository, with regular meaningful commits.
Before you finalize your write up, make sure the code chunks are not visible and all messages and warnings are suppressed.
You will submit the PDF of your final report on GitHub.
The PDF you submit must match the .qmd in your GitHub repository exactly. The mandatory components of the report are below. You are free to add additional sections as necessary. The report, including tables and visualizations, must be no more than 10 pages long. There is no minimum page requirement; however, you should comprehensively address all of the analysis and report.
Be selective in what you include in your final write-up. The goal is to write a cohesive narrative that demonstrates a thorough and comprehensive analysis rather than explain every step of the analysis.
You are welcome to include an appendix with additional work at the end of the written report document; however, grading will overwhelmingly be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. It is not included in the 10-page limit.
Introduction and data
This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some descriptive data analysis. Focus on the descriptive statistis for that describe the main outcome variable and a few other interesting variables and relationships.
Methodology
This section includes a brief description of your hypothesis testing. Explain the null hypothesis and alternative hypothesis and the specific statistical test to be performed and the underlying assumption.
Results
Describe the key results. The goal is not to interpret every single variable but rather to show that you are proficient in using the test to address the research questions, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.
Discussion + Conclusion
In this section you’ll include a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.
Organization + formatting
This is an assessment of the overall presentation and formatting of the written report.
Reproducibility + organization
All written work (with exception of presentation slides) should be reproducible, and the GitHub repo should be neatly organized.
The GitHub repo should have the following structure:
README
: Short project description and data dictionarywritten-report.qmd
&written-report.pdf
: Final written reportresearch-topics.qmd
&research-topics.pdf
: Proposed research questions/data
: Folder that contains the data set for the final project.project.Rproj
: File specifying the RStudio project/presentation
: Folder with the presentation slides or link to slides..gitignore
: File that lists all files that are in the local RStudio project but not the GitHub repo/.github
: Folder for peer review issue templateAny other files should be neatly organized into clearly labeled folders.
Update the README of the project repo with your project title and team members’ names.
Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project GitHub repo. The repo should be neatly organized as described above, there should be no extraneous files, all text in the README
should be easily readable.
Peer teamwork evaluation
There will be an opportunity to provide feedback to me about each team member’s contribution to the project. If you are suggesting that an individual did less than half the expected contribution given your team size (e.g., for a team of four students, if a student contributed less than 12.5% of the total effort), please provide some explanation. If any individual gets an average peer score indicating that this was the case, their grade will be assessed accordingly.
Overall grading
The grade breakdown is as follows:
Total | 20 pts |
---|---|
Research topics | 1 pts |
Project proposal presentation | 2 pts |
Final presentation | 6 pts |
Final presentation comments | 2 pts |
Written report | 6 pts |
Reproducibility + organization | 3 pts |
Late work policy
There is no late work accepted on the draft report or presentation. Other components of the project may be accepted up to 48 hours late. A 10% late deduction will apply for each 24-hour period late.
Be sure to turn in your work early to avoid any technological mishaps.