class: center, middle, inverse, title-slide .title[ # R preliminaries ] .author[ ### Mikhail Dozmorov ] .institute[ ### Virginia Commonwealth University ] .date[ ### 10-19-2022 ] --- ## R expressions, function calls, and objects According to John Chambers, one of the creators of R’s precursor S: - Everything that exists in R is an **object**. - Everything that happens in R is a **call to a function**. --- ## Assignment operator - We often need to save a function's result or output. For this we use the assignment operator: ` <- `, preferred over ` = ` ```r scores <- mtcars ``` Now we can use `scores` as an argument to other functions. For example, compute summary statistics for each column in the data: ```r summary(scores[1:4]) # First four elements ``` ``` ## mpg cyl disp hp ## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 ## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 ## Median :19.20 Median :6.000 Median :196.3 Median :123.0 ## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 ## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 ## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 ``` Use `Alt + -` (Win) or `Option + -` (Mac) in RStudio to quickly insert ` <- ` --- ## Variables - **Scalars** (0-dimensional): `a = 42`, `b = a / 7` - **Vectors** (1-dimensional): `b = c(12, 14, 16)` - Access vector element as `b[2]` (returns 14) - **Matrices** (2-dimensional): ```r mtx = matrix(data = c(3, 1, 3, 2, 3, 2), ncol = 2) mtx ``` ``` ## [,1] [,2] ## [1,] 3 2 ## [2,] 1 3 ## [3,] 3 2 ``` --- ## Variable names - camelCase, snake_case, PascalCase. - Be careful not to name your variables as function names. E.g., `c` is a bad variable name because `c()` is a function for combining variables. Check its help function `?c`. - With auto-completion in RStudio, you don't need to worry about variable name length - make names that are self-explanatory. - Check [Google R Style Guide](https://google.github.io/styleguide/Rguide.html), [The Tidyverse Style Guide](http://adv-r.had.co.nz/Style.html), [Bioconductor Coding Style](https://bioconductor.org/developers/how-to/coding-style/). .small[ https://betterprogramming.pub/string-case-styles-camel-pascal-snake-and-kebab-case-981407998841 ] --- ## Variable types ```r # numeric: real or decimal numbers, sometimes referred to as “double” integer: a subset of numeric in which numbers are stored as integers a <- 2 # character: sometimes referred to as string data, tend to be surrounded by quotes a <- "2" # logical: Boolean data (TRUE and FALSE) a <- TRUE ``` - complex: complex numbers with real and imaginary parts (e.g., 1 + 4i) - raw: bytes of data (machine-readable, but not human readable) Auxillary functions ``` r class(a) str(a) is.numeric() # TRUE is matches, same with is.character as.numeric("2") # Attempt to convert types ``` --- ## Factors - Factors are how R represents categorical data. - There are two kinds of factors: - `factor()` - used for nominal data ("Cats", "Cats", "Dogs", "Birds") - `ordered()` - used for ordinal data ("First", "Second", "Second", "Third") ```r factor(c("Cats", "Cats", "Dogs", "Birds")) ``` ``` ## [1] Cats Cats Dogs Birds ## Levels: Birds Cats Dogs ``` ```r ordered(c("First", "Second", "Second", "Third")) ``` ``` ## [1] First Second Second Third ## Levels: First < Second < Third ``` --- ## Factors Auxillary functions - `levels()` - get levels of a factor. Also, an argument in the `factor()` function allowing to set the order manually. - `relevel()` - reorder factor levels. - `is.factor()`, `as.factor()`. ```r a <- factor(c("Cats", "Cats", "Dogs", "Birds")) a ``` ``` ## [1] Cats Cats Dogs Birds ## Levels: Birds Cats Dogs ``` ```r relevel(a, ref = "Cats") ``` ``` ## [1] Cats Cats Dogs Birds ## Levels: Cats Birds Dogs ``` ```r levels(a) <- rev(levels(a)) a ``` ``` ## [1] Cats Cats Birds Dogs ## Levels: Dogs Cats Birds ``` --- ## Data frames - **Data frames**: tables or 2-dimensional arrays. Think matrices that can hold different data types. - The column names should be non-empty. - Columns should be the same length. - The row names should be unique. - The data stored in a data frame can be of numeric, factor, or character. ```r dat = data.frame(Column.1 = c(3, 1, 3), Column.2 = c("2", "3", "2")) dat ``` ``` ## Column.1 Column.2 ## 1 3 2 ## 2 1 3 ## 3 3 2 ``` --- ## Data frames Auxillary functions ```r dim(dat) ``` ``` ## [1] 3 2 ``` ```r nrow(dat) ``` ``` ## [1] 3 ``` ```r ncol(dat) ``` ``` ## [1] 2 ``` ```r length(dat) ``` ``` ## [1] 2 ``` ```r colnames(dat) ``` ``` ## [1] "Column.1" "Column.2" ``` ```r rownames(dat) ``` ``` ## [1] "1" "2" "3" ``` --- ## Subsetting elements in a data frame ```r dat[3, 2] # [] contain row/column indices. ``` ``` ## [1] "2" ``` ```r dat[3, "Column.2"] # Address by column name ``` ``` ## [1] "2" ``` ```r dat$Column.2[3] # Use $ shortcut to access column by name ``` ``` ## [1] "2" ``` ```r # Compare column classes class(dat$Column.1) ``` ``` ## [1] "numeric" ``` ```r class(dat$Column.2) ``` ``` ## [1] "character" ``` --- ## Subsetting elements in a data frame ```r index <- dat$Column.1 > 1 index ``` ``` ## [1] TRUE FALSE TRUE ``` ```r dat[index, ] ``` ``` ## Column.1 Column.2 ## 1 3 2 ## 3 3 2 ``` ```r index <- dat$Column.1 == 1 dat[index, "Column.2"] ``` ``` ## [1] "3" ``` ```r dat[index, "Column.2", drop = FALSE] ``` ``` ## Column.2 ## 2 3 ``` --- ## Inspecting data.frame objects There are several built-in functions that are useful for working with data frames. * Content: * `head()`: shows the first few rows. * `tail()`: shows the last few rows. * Size: * `dim()`: returns a 2-element vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object). * `nrow()`: returns the number of rows. * `ncol()`: returns the number of columns. --- ## Inspecting data.frame objects * `colnames()` (or just `names()`): returns the column names. * `str()`: structure of the object and information about the class, length and content of each column. * `summary()`: works differently depending on what kind of object you pass to it. Passing a data frame to the `summary()` function prints out useful summary statistics about numeric column (min, max, median, mean, etc.).