Assignment 4: Tidyverse, data manipulation

Due by 05:00 PM on Wednesday, November 30, 2022

To do yourself

dplyr

Read Wickham, Hadley, et al. “Welcome to the Tidyverse.” Journal of Open Source Software, 2019, or Wickham, Hadley. “Tidy data.” Journal of Statistical Software, 2014
introverse - alternate documentation for commonly-used functions and concepts in Base R and in the tidyverse. Tweet
Dive into dplyr (tutorial)
Data Manipulation Using R (& dplyr) by Ram Narasimhan, PDF slides
Data Manipulation with dplyr brief tutorial
Aggregating and analyzing data with dplyr tutorial by Data Carpentry
Introduction to dplyr for Faster Data Manipulation in R tutorial and a 40 min video Hands-on dplyr tutorial for faster data manipulation in R
Animations of tidyverse verbs using R, the tidyverse, and gganimate - visual explanation of dplyr operations
Pivoting, reformatting data into long/wide formats
A graphical introduction to tidyr’s pivot
Reusing Tidyverse code - dplyr/tidyverse data manipulation lecture slides

ggplot2

O’Donoghue, Sen I., Benedetta Frida Baldi, Susan J. Clark, Aaron E. Darling, James M. Hogan, Sandeep Kaur, Lena Maier-Hein, et al. “Visualization of Biomedical Data.” Annual Review of Biomedical Data Science 1, no. 1 (July 20, 2018) - Visualization best practices (the use of length, area, color etc.)
RStudio Webinars - Code and slides for RStudio webinars
ggplot2 tutorial/slides/code examples/references by Jenny Bryan
The R Graph Gallery, all graphs with code
Collection of ggplot2 materials
A Gentle Guide to the Grammar of Graphics with ggplot2 by Garrick Aden-Buie
Plotting/reporting best practices

To submit on Canvas

Create RMarkdown document with headers, text, and code to answer/visualize questions. Submit both Rmd and knitted PDF. Pay attention to code clarity, variable names, comments.

dplyr

What is the difference between read_xls() and read_xlsx() functions? What message do you get if reading an .xlsx file using read_xls() function?
What does the skip argument do?
Do we need to refer to a sheet within an excel file as a number, or can we refer to it as the sheet name instead?
What does the guess_max argument do?
What happens if columns in the Excel worksheet are of different length?
How would you write into an Excel file? Demonstrate saving the mtcars dataset into an Excel file.
Use the starwars dataset that is loaded with the tidyverse. Accomplish the following in one long string of pipes.
- Keep only observations with weight and height recorded. Also include the homeworld variable.
- Create a variable called bmi that calculates the character’s BMI (search for formula).
- Summarize the BMI variable, grouping observations by homeworld.
- Print this summary in decreasing order of average BMI.
Read in the following data into R. This data is from the American Community Survey and references the population of three cities in Virginia between 2009 and 2012. cities <- data.frame(name = rep(c("richmond", "norfolk", "charlottesville")), pop2009 = c(1202494,236071,191515), pop2010 = c(1235565,242143,197279), pop2011 = c(1248271,241943,199675), pop2012 = c(1260202,243056,210909)) In one long string of pipes, convert the data from wide (2009 to 2012 population values) to long format, naming the new column of populations pop, group by city, create a summary variable that is the ratio of the largest population value to smallest population value for the city, and arrange by this ratio value in decreasing order.

ggplot2

Get names of all packages installed on your computer, check the installed.packages() function. Split package names into characters. Calculate frequency of each letter, case sensitive. Create a horizontal barplot, Y-axis - letters/characters, X-axis - frequency. Make it sorted, so the most frequently occurring letters (highest bars) are on top. Color each bar by its own color. Do not output legend.
Download the CDC US Birth dataset from https://github.com/fivethirtyeight/data/tree/master/births. Answer the following questions:
- Make barplots of most-to-least popular a) day of week, b) day of month, c) month to give birth, irrespectively of year. X-axis - date, Y-axis - number of births. Color by date
- Visually demonstrate the total birth trend over the years. X-axis - years, Y-axis - total number of births per year.
- For each a) day of week, b) day of month, c) month, collect the number of births, irrespectively of year. Plot them as a) scatterplots, b) boxplots, with X-axis being the corresponding data and Y-axis - the number of births. For scatterplots, fit a smoothing line with the default nonlinear fit. Color by date.
- Bonus question: Make the time series curve similar to the second plot of the accompanying article.

Last updated on November 29, 2022

Edit this page