Final project

Project milestones

Team members due Wednesday, October 1

Potential research topics tentative ideally before October 21

Research topic due Wednesday, October 22

Project proposal presentation (in-class) due Wednesday, November 5

Final presentation + Final presentation comments Monday, December 1 (in-class)

Written report, Reproducibility + organization due GT Official Final Exam Date

Introduction

TL;DR: Pick a data set and do a set of hypothesis tests. That is your final project.

The goal of the final project is for you to use statistical analysis to test a hypothesis and run a simple linear regression analysis to analyze a data set of your own choosing. I will also provide some datasets. It is recommended to use data set that already exist.

Choose the data based on your group’s interests or work you all have done in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like!) and apply them to a data set to analyze it in a meaningful way.

All analyses must be done in RStudio using Quarto and GitHub, and your analysis and written report must be reproducible.

Logistics

You will work on the project with your groups. The primary deliverables for the project are

  1. an presentation about the data analysis, some hypothesis testing, and a simple linear regression

  2. a written, reproducible final report detailing your analysis

  3. a GitHub repository containing all work from the project

There are intermediate milestones throughout the semester to help you work towards the primary deliverables (these are ungraded).

Team member

Each group consists of three students. As soon as your group members are set you need to submit the group member list through Canvas. The next thing to do is create a Github repository where you will collaborate with your group member.

Potential research topics

First task for your group is to discuss topics and develop potential research questions your team is interested in investigating for the project. You are only developing ideas at this point; you do not need to have a data set identified right now.

Develop three potential research topics. Include the following for each topic:

  1. A brief description of the topic
  2. A statement about your motivation for investigating this topic
  3. The potential audience(s), i.e., who might be most interested in this research?
  4. Two or three potential research questions you could analyze about this topic. (Note: These are draft questions at this point. You will finalize the questions in the next stage of the project.)
  5. Ideas about the type of data you might use to answer this question or potential data sets you’re interested in using. (Note: The goal is to generate ideas at this point, so it is fine if you have not identified any particular data sets at this point.)

Each group must seek feedback from me about these potential ideas at least once. I am happy to discuss these either via email or during my office hours. The purpose of my feedback is to ensure that your idea is feasible and I’ll direct you towards potential data to use.

Research topic

The final decision about the topic that your group want to pursue is solely the decision of the group members. You need to submit you final research topic via Github.

Data requirement

The data set must meet the following criteria:

  • At least 500 observations

  • At least 10 columns, such that at least 6 of the columns are useful and unique predictor variables.

    • e.g., identifier variables such as “name”, “ID number”, etc. are not useful predictor variables.

    • e.g., if you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique predictors.

  • At least one variable that can be identified as a reasonable outcome variable.

    • The outcome variable must be quantitative.
Types of data sets to avoid
  • Data that are likely violate the independence condition. Therefore, avoid data with repeated measures, data collected over time, etc.

  • Data sets in which there is no information about how the data were originally collected

  • Data sets in which there are missing or unclear definitions about the observations and/or variables

Ask me if you’re unsure whether your data set meets the criteria.

Submission

Write your responses in research-topic.qmd in your team’s project GitHub repo. Push the qmd and rendered pdf documents to GitHub by the deadline, Wednesday, October 22 at 11:59pm. There is no Gradescope submission.

Project proposal presentation

Important

Project proposal presentations will take place in class Wednesday, November 5. Presentation order will be announced in advance.

Your team will do an in-person presentation that summarizes the research idea you’re pursuing, the data used, and hypotheses. It will also be an opportunity to receive feedback and suggestions as well as provide feedback to other teams. The presentation will focus on introducing the subject matter and research question, data description, outcome variables, and hypotheses. The presentation should be supported by slides that serve as a brief visual addition to the presentation. The presentation and slides will be graded for content and clarity.

You can create your slides with any software you like (e.g., Keynote, PowerPoint, Google Slides, etc.). You can also use Quarto to make your slides! While we won’t be covering making slides with Quarto in the class, we would be happy to help you with it in office hours. It’s no different than writing other documents with Quarto, so the learning curve will not be steep!

The presentation is expected to be between 3 to 4 minutes. It may not exceed 4 minutes, due to the limited time.

Slides

The slide deck should have no more than 3 content slides + 1 title slide to ensure you have enough time to discuss each slide. s Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 3 slides.

  • Title Slide

  • Slide 1: Introduce the subject, motivation, and research question

  • Slide 2: Introduce the data set and data processing

  • Slide 3: List of outcome variables and hypothesis to be tested

Submission

You can submit the presentation slides in two ways:

  • Put a PDF of the slides or Quarto slides in the presentation folder in your team’s GitHub repo.

  • Put the URL to your slides in the README of the presentation folder. If you share the URL, please make sure permissions are set so that I can view the slides.

Important

Slides must be submitted by the start of class on the day of presentations. We will use a classroom computer for the presentations.

Grading

The presentation is worth points. It will be graded based on the following:

  • Content: The team told a unified story that clearly introduced the subject matter, research question, data and hypotheses.

  • Slides: The presentation slides were organized, included clear and informative visualizations, and were easily readable.

  • Presentation: The team’s communication style was clear, professional, and within time limit.

100% of the presentation grade will be the average of the teaching team scores.

Final presentation

Important

Presentations will take place in class in the final day of class Monday, December 1. Presentation order will be announced in advance.

Your team will do an in-person presentation that summarizes and showcases the work you’ve done on the project thus far. Because the presentations will take place while you’re still working on the project, it will also be an opportunity to receive feedback and suggestions as well as provide feedback to other teams. The presentation will focus on introducing the subject matter and research question, showcase key results from the exploratory data analysis, and discuss primary modeling strategies and/or results. The presentation should be supported by slides that serve as a brief visual addition to the presentation. The presentation and slides will be graded for content and clarity.

You can create your slides with any software you like (e.g., Keynote, PowerPoint, Google Slides, etc.). You can also use Quarto to make your slides! While we won’t be covering making slides with Quarto in the class, we would be happy to help you with it in office hours. It’s no different than writing other documents with Quarto, so the learning curve will not be steep!

The presentation is expected to be between 4 to 5 minutes. It may not exceed 5 minutes, due to the limited time.

Every team member is expected to speak in the presentation. Part of the grade will be whether every team member had a meaningful speaking role in the presentation.

Slides

The slide deck should have no more than 6 content slides + 1 title slide to ensure you have enough time to discuss each slide. s Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.

  • Title Slide

  • Slide 1: Introduce the subject, motivation, and research question

  • Slide 2: Introduce the data set

  • Slide 3: Highlights from the descriptive statistics

  • Slide 4: Results (if applicable)

  • Slide 5: Next steps and any questions you’d like to get feedback on

Submission

You can submit the presentation slides in two ways:

  • Put a PDF of the slides or Quarto slides in the presentation folder in your team’s GitHub repo.

  • Put the URL to your slides in the README of the presentation folder. If you share the URL, please make sure permissions are set so that I can view the slides.

Important

Slides must be submitted by the start of your lab on the day of presentations. We will use a classroom computer for the presentations.

Grading

The presentation is worth points. It will be graded based on the following:

  • Content: The team told a unified story that clearly introduced the subject matter, research question, and exploration of the data.

  • Slides: The presentation slides were organized, included clear and informative visualizations, and were easily readable.

  • Presentation: The team’s communication style was clear and professional. The team divided the time well and stayed within the 8 minute time limit, with each team member making a meaningful contribution to the presentation.

80% of the presentation grade will be the average of the teaching team scores and 20% will be the average of the peer scores.

Final presentation comments

Important

Click here to see the teams you’re scoring and a link to the feedback form.

This portion of the project is worth 2 points and will be assessed individually.

You will provide feedback on two teams’ presentations. The assigned teams and link to the feedback form will be available in advance of the presentations. Please provide all scores and comments by the end of the lab session. There will be a few minutes between each presentation to submit scores and comments.

The grade will be based on submitting the scores and comments for both of your assigned teams by the end of the presentation day.

Written report

Important

Your written report must be completed in the written-report.qmd file and must be reproducible. All team members should contribute to the GitHub repository, with regular meaningful commits.

Before you finalize your write up, make sure the code chunks are not visible and all messages and warnings are suppressed.

  • You will submit the PDF of your final report on GitHub.

  • The PDF you submit must match the .qmd in your GitHub repository exactly. The mandatory components of the report are below. You are free to add additional sections as necessary. The report, including tables and visualizations, must be no more than 10 pages long. There is no minimum page requirement; however, you should comprehensively address all of the analysis and report.

  • Be selective in what you include in your final write-up. The goal is to write a cohesive narrative that demonstrates a thorough and comprehensive analysis rather than explain every step of the analysis.

  • You are welcome to include an appendix with additional work at the end of the written report document; however, grading will overwhelmingly be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. It is not included in the 10-page limit.

Introduction and data

This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some descriptive data analysis. Focus on the descriptive statistis for that describe the main outcome variable and a few other interesting variables and relationships.

Methodology

This section includes a brief description of your hypothesis testing. Explain the null hypothesis and alternative hypothesis and the specific statistical test to be performed and the underlying assumption.

Results

Describe the key results. The goal is not to interpret every single variable but rather to show that you are proficient in using the test to address the research questions, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.

Discussion + Conclusion

In this section you’ll include a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.

Organization + formatting

This is an assessment of the overall presentation and formatting of the written report.

Reproducibility + organization

All written work (with exception of presentation slides) should be reproducible, and the GitHub repo should be neatly organized.

The GitHub repo should have the following structure:

  • README: Short project description and data dictionary

  • written-report.qmd & written-report.pdf: Final written report

  • research-topics.qmd & research-topics.pdf: Proposed research questions

  • /data: Folder that contains the data set for the final project.

  • project.Rproj: File specifying the RStudio project

  • /presentation: Folder with the presentation slides or link to slides.

  • .gitignore: File that lists all files that are in the local RStudio project but not the GitHub repo

  • /.github: Folder for peer review issue template

  • Any other files should be neatly organized into clearly labeled folders.

Update the README of the project repo with your project title and team members’ names.

Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project GitHub repo. The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.

Peer teamwork evaluation

There will be an opportunity to provide feedback to me about each team member’s contribution to the project. If you are suggesting that an individual did less than half the expected contribution given your team size (e.g., for a team of four students, if a student contributed less than 12.5% of the total effort), please provide some explanation. If any individual gets an average peer score indicating that this was the case, their grade will be assessed accordingly.

Overall grading

The grade breakdown is as follows:

Total 20 pts
Research topics 1 pts
Project proposal presentation 2 pts
Final presentation 6 pts
Final presentation comments 2 pts
Written report 6 pts
Reproducibility + organization 3 pts

Late work policy

There is no late work accepted on the draft report or presentation. Other components of the project may be accepted up to 48 hours late. A 10% late deduction will apply for each 24-hour period late.

Be sure to turn in your work early to avoid any technological mishaps.