Final Project

Due by 11:59 PM on Friday, December 16, 2022

Overview

The final project will consist of analyzing a dataset of your choice. The goal of this project is for you to demonstrate that you are proficient at asking meaningful statistical questions and answering them with results and visualization of data analysis. You should demonstrate that you are proficient in using R/Tidyverse and that you are proficient at interpreting and presenting the results. Also, critique the data and your methods and provide suggestions for improving your analysis.

Data

The dataset may already exist, or you may collect your own data using a survey or by conducting an experiment. You can choose the data based on your interests or work in other courses or research projects. Use Data_notes to help with your dataset selection. Your dataset must have at least 50 observations and 10 to 20 variables (exceptions can be made, but you must speak with me first). The dataset’s variables should include categorical variables and continuous numerical variables. Do not reuse datasets used in class or homework assignments, as well as built-in R datasets.

Proposal (Due 2022-11-23)

Submit on Canvas a brief description of your proposed project. Include:

  • Introduction, introduce your dataset and general research questions.
  • Data, where it came from, how it was collected, what the cases are, what the variables are, etc.
  • Analysis plan, what (outcome/response) variables will you use to answer your questions. What comparison groups will you use, if applicable. What statistical methods are needed to support your research? How do you plan to visualize your results?

Components

Write functions that perform data analysis and visualization. As input, your functions should accept your dataset and arguments (e.g., variables to subset the data). As an output, a function should produce meaningful statistical output and/or visualization. Make your functions general purpose, where possible, e.g., depending on input variables, a function may produce different test results or visualizations.

Make a package that includes your functions. Your package should be installable from Github. Document your functions using Roxygen syntax. Use tidyverse packages where possible.

Create functions that demonstrate:

  • Download/read your dataset
  • Summarize your dataset
  • Statistical functions (e.g., differential tests, regression)
  • Visualization (e.g., histograms, scatterplots with fit lines)

Reporting

  • Create a GitHub repository containing the code.
  • Create an RMarkdown document that demonstrates the use of your functions to analyze your dataset. The document should be knittable on any computer, that is, it should download the data, perform the analyses, and produce a pdf report.
  • Submit the RMarkdown document demonstrating the loading of your package, data preparation, and the use of your functions. Include the link to your GitHub package repository. The document must be knittable on any computer.
  • Submit a compiled PDF version of the document.

Grading

Your project will be assessed on the following criteria:

  • Code - Does the package installs without errors? Does the RMarkdown document compile on a local machine? Is the code sufficiently commented?
  • Content - What are the quality of research and/or policy questions and the relevancy of data to those questions?
  • Correctness - Are statistical procedures carried out and explained correctly?
  • Writing and Presentation - What is the quality of the statistical explanations, writing, and visualization? Is the code easy to understand, and are variable names self-explanatory?
  • Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?

A general breakdown of scoring is as follows:

  • 90%-100% - Outstanding effort. Students understand how to apply all statistical concepts, can put the results into a compelling argument, can identify weaknesses in the arguments, and can clearly communicate the results to others.
  • 80%-89% - Good effort. Students understand most of the concepts, put together adequate arguments, identify some weaknesses of their argument, and communicate most results clearly to others.
  • 70%-79% - Passing effort. Students misunderstand concepts in several areas, have some omissions in putting results together in a cogent argument, and communication of results is sometimes unclear.
  • 60%-69% - Struggling effort. Students are making some efforts but have a misunderstanding of many concepts and are unable to put together a cogent argument. Communication of results is unclear.
  • Below 60% - Students are not making a sufficient effort.