Assignment 4: Tidyverse, data manipulation

Due by 05:00 PM on Wednesday, November 30, 2022

To do yourself

dplyr

ggplot2

To submit on Canvas

Create RMarkdown document with headers, text, and code to answer/visualize questions. Submit both Rmd and knitted PDF. Pay attention to code clarity, variable names, comments.

dplyr

  • What is the difference between read_xls() and read_xlsx() functions? What message do you get if reading an .xlsx file using read_xls() function?

  • What does the skip argument do?

  • Do we need to refer to a sheet within an excel file as a number, or can we refer to it as the sheet name instead?

  • What does the guess_max argument do?

  • What happens if columns in the Excel worksheet are of different length?

  • How would you write into an Excel file? Demonstrate saving the mtcars dataset into an Excel file.

  • Use the starwars dataset that is loaded with the tidyverse. Accomplish the following in one long string of pipes.

    • Keep only observations with weight and height recorded. Also include the homeworld variable.
    • Create a variable called bmi that calculates the character’s BMI (search for formula).
    • Summarize the BMI variable, grouping observations by homeworld.
    • Print this summary in decreasing order of average BMI.
  • Read in the following data into R. This data is from the American Community Survey and references the population of three cities in Virginia between 2009 and 2012. cities <- data.frame(name = rep(c("richmond", "norfolk", "charlottesville")), pop2009 = c(1202494,236071,191515), pop2010 = c(1235565,242143,197279), pop2011 = c(1248271,241943,199675), pop2012 = c(1260202,243056,210909)) In one long string of pipes, convert the data from wide (2009 to 2012 population values) to long format, naming the new column of populations pop, group by city, create a summary variable that is the ratio of the largest population value to smallest population value for the city, and arrange by this ratio value in decreasing order.

ggplot2

  • Get names of all packages installed on your computer, check the installed.packages() function. Split package names into characters. Calculate frequency of each letter, case sensitive. Create a horizontal barplot, Y-axis - letters/characters, X-axis - frequency. Make it sorted, so the most frequently occurring letters (highest bars) are on top. Color each bar by its own color. Do not output legend.

  • Download the CDC US Birth dataset from https://github.com/fivethirtyeight/data/tree/master/births. Answer the following questions:

    • Make barplots of most-to-least popular a) day of week, b) day of month, c) month to give birth, irrespectively of year. X-axis - date, Y-axis - number of births. Color by date
    • Visually demonstrate the total birth trend over the years. X-axis - years, Y-axis - total number of births per year.
    • For each a) day of week, b) day of month, c) month, collect the number of births, irrespectively of year. Plot them as a) scatterplots, b) boxplots, with X-axis being the corresponding data and Y-axis - the number of births. For scatterplots, fit a smoothing line with the default nonlinear fit. Color by date.
    • Bonus question: Make the time series curve similar to the second plot of the accompanying article.