Intro to R and the Tidyverse

Workshop materials

What is R?

R is a programming language designed for statistical computing. It is not just a statistics package: it is a language.

What is RStudio?

RStudio is a free R integrated development environment (IDE). It is cleaner and simpler than the default R GUI (graphical user interface). It has many useful features, like syntax highlighting and tab for suggested code auto-completion.

Additionally, it has a 4-pane workspace:

  • Top left window: the R code editor

  • Bottom left: interactive console

  • Top right window: shows your workspace, including a list of objects currently in memory, history tab

  • Bottom right: shows plots, external packages available on your system, files in your working directory, and help files

Useful RStudio shortcuts:

  • tab: auto-complete function

  • Ctrl+ or cmd+ (auto-complete tool that works only in the interactive console)

  • Ctrl+enter or cmd+return (executes the selected lines of code)

Things to keep in mind

  • R is case sensitive, so be careful while typing.

  • # is used for comments

    • Keyboard Shortcuts: Ctrl+Shift+C (Windows) Cmd+Shift+C (MacOS).

  • R does not care about spaces between commands or arguments.

  • Names should start with a letter and should not contain spaces.

  • You can use . in object names (e.g., my.data).

  • Use forward slash (/) in path names, even on Windows.

Working directory

Your working directory is the folder on your computer in which you are working. We can find this with the getwd() command.

# Current working directory
getwd()
[1] /User/fordfishman/

We can also set our working directory with setwd(PATH).

# an example of the path to your workshop materials
# USE YOUR OWN PATH
setwd("Documents/Workshops/Intro to R and the Tidyverse 20220928/")

To see the files in your working directory, you can use list.files().

list.files()
[1] "IntroR_Tidyverse_code_along.R" "IntroR_Tidyverse_code.R"       "penguins.csv"

Creating Objects

However, it would be more useful if we assigned values to objects. We create an object by giving it a name followed by the assignment <- operator. You can make <- with the following shortcuts: Alt+- (Windows) or Option+- (Mac).

weight_kg <- 60
weight_lb <- 2.2 * weight_kg
weight_lb # Print the value of weight_lb
[1] 132

We can also reassign our variables to new values, but be careful, as there is no warning given for this.

You can also remove a variable from your environment with the rm() command.

weight_kg <- 100 # Overwrites your object. Be careful! no warning is given

rm(weight_lb) # Deletes that object

Storing many numbers as a vector

We can use c() to combine or concatenate values together into a vector.

Myvector1 <- c(1,3,4,5) # c for combine/concatenate
Myvector2 <- c(1:7)
Myvector3 <- seq (1,6, by=0.5)

Myvector1
Myvector2
Myvector3
[1] 1 3 4 5

[1] 1 2 3 4 5 6 7

[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

You can also store characters and character vectors.

greeting <- "hello"
greeting

days <- c ("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
days
[1] "hello"

[1] "Sunday"    "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday"

To extract individual elements of a vector, we use an index in square brackets. For instance, to get the third element of days, we can use days[3]. Unlike other programming languages, R indexes from 1, not 0. Additionally, -1 will not get the last value: it excludes that item.

days[3]
days[-1]
days[c(1,3)]
[1] "Tuesday"

[1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday"

[1] "Sunday"  "Tuesday"

Exercise 1

Extract Tuesday, Wednesday and Thursday from the days vector.

Solution

Note: these two solutions are equivalent.

days[c(3, 4, 5)]

days[3:5]
[1] "Tuesday"   "Wednesday" "Thursday"

[1] "Tuesday"   "Wednesday" "Thursday"

Replacing/adding new elements

We can also use indexing to replace or add new elements to a vector.

greeting[2] <- "How are you?"
greeting

Exercise 2

Replace the 3rd element in Myvector2 with a 10.

Solution
myvector2[3] <- 10

Data types

When we use c(), R assumes that everything in your vector is of the same data type (all # or all characters).

Myvector4 <- c(1,2,"hello")
Myvector4
[1] "1"     "2"     "hello"

If we have different types of data we need to use the list() function.

Mylist <- list(1,3, "hello", TRUE)

Mylist
[[1]]
[1] 1

[[2]]
[1] 3

[[3]]
[1] "hello"

[[4]]
[1] TRUE

Functions

A function is a piece of code to carry out a specified task. R has many built-in functions.

sum(1,3,5)
mean(Myvector1)
length(Myvector1)
max(Myvector1)
rep("hi", times=3)
[1] 9

[1] 3.25

[1] 4

[1] 5

[1] "hi" "hi" "hi"

If we want to learn more about a function we can ask for help with help() or ?.

help(mean)
?rep

Packages

We can also bring in extra functions by downloading packages. Packages are collections of functions. There are thousands of add-on packages available at the CRAN (Comprehensive R Archive Network).

For instance, we have the tidyverse, an “opinionated collection of R packages designed for data science” (www.tidyverse.org). These packages are designed to make data wrangling, analysis, and graphing much simpler and more enjoyable.

Tidyverse packages share a philosophy of data organization: they all expect tidy data. Tidy data is set up so that each row is an observation and each column is a variable.

Using the tidyverse packages

To install a package we use the function install.packages("package name"). We only need to install a package once.

install.packages("tidyverse")

If we want to use the functions in a package, we need to load it in R using the library() function.

library(tidyverse)

Importing data

Let’s explore penguins! In our file called penguins.csv, we have data for three penguin species observed in the Palmer Archipelago, Antarctica, collected by Dr. Kristen Gorman with Palmer Station LTER.

penguins <- read_csv("penguins.csv")

Exploring your data

We can use the View() function to look at our data frame.

View(penguins)

A very important function is str(), which lets you can view the structure of data.

str(penguins)
spec_tbl_df [344 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ species          : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
$ island           : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
$ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g      : num [1:344] 3750 3800 3250 NA 3450 ...
$ sex              : chr [1:344] "male" "female" "female" NA ...
$ year             : num [1:344] 2007 2007 2007 2007 2007 ...
- attr(*, "spec")=
 .. cols(
 ..   species = col_character(),
 ..   island = col_character(),
 ..   bill_length_mm = col_double(),
 ..   bill_depth_mm = col_double(),
 ..   flipper_length_mm = col_double(),
 ..   body_mass_g = col_double(),
 ..   sex = col_character(),
 ..   year = col_double()
 .. )
- attr(*, "problems")=<externalptr>

We can get the same information using glimpse().

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torge…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, 38.6, 34.6, 36.6, 38.…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, 17.3, 17.6, 21.2, 21.1, 17.8, 19.…
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 182, 191, 198, 185, 195, 197, 184, 194…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, 3800, 4400, 3700, 345…
$ sex               <chr> "male", "female", "female", NA, "female", "male", "female", "male", NA, NA, NA, NA, "female", "ma…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…

We can use some built-in functions in R to summarize the data, such as showing column names and the dimensions of the data frame.

class(penguins) # check to see that test is what we expect it to be
dim(penguins) # how many rows and columns?
names(penguins) # names of variables
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

[1] 344   8

[1] "species"           "island"            "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm" "body_mass_g"
[7] "sex"               "year"

head() displays the first 6 rows of the data frame.

head(penguins) # first 6 rows
# A tibble: 6 × 8
 species island     bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
  <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl> <chr>  <dbl>
1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007
2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007
3 Adelie  Torgersen           40.3          18                 195        3250 female  2007
4 Adelie  Torgersen           NA            NA                  NA          NA NA      2007
5 Adelie  Torgersen           36.7          19.3               193        3450 female  2007
6 Adelie  Torgersen           39.3          20.6               190        3650 male    2007

tail() similarly shows the last 6 rows.

tail(penguins) # last 6 rows
# A tibble: 6 × 8
species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
<chr>     <chr>           <dbl>         <dbl>             <dbl>       <dbl> <chr>  <dbl>
1 Chinstrap Dream            45.7          17                 195        3650 female  2009
2 Chinstrap Dream            55.8          19.8               207        4000 male    2009
3 Chinstrap Dream            43.5          18.1               202        3400 female  2009
4 Chinstrap Dream            49.6          18.2               193        3775 male    2009
5 Chinstrap Dream            50.8          19                 210        4100 male    2009
6 Chinstrap Dream            50.2          18.7               198        3775 female  2009

We can use summary() to display some descriptive statistics, like minimum and maximum values, means, and medians.

summary(penguins)
   species             island          bill_length_mm  bill_depth_mm   flipper_length_mm  body_mass_g       sex
Length:344         Length:344          Min.   :32.10   Min.   :13.10   Min.   :172.0     Min.   :2700   Length:344
Class :character   Class :character    1st Qu.:39.23   1st Qu.:15.60   1st Qu.:190.0     1st Qu.:3550   Class :character
Mode  :character   Mode  :character    Median :44.45   Median :17.30   Median :197.0     Median :4050   Mode  :character
                                       Mean   :43.92   Mean   :17.15   Mean   :200.9     Mean   :4202
                                       3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213.0     3rd Qu.:4750
                                       Max.   :59.60   Max.   :21.50   Max.   :231.0     Max.   :6300
                                       NA's   :2       NA's   :2       NA's   :2         NA's   :2
      year
Min.   :2007
1st Qu.:2007
Median :2008
Mean   :2008
3rd Qu.:2009
Max.   :2009

Note that the numerical variables are displayed different then the character variables. We can summarize the character variables better by converting them to factors.

penguins$species <- as.factor(penguins$species)
penguins$island <- as.factor(penguins$island)
penguins$sex <- as.factor(penguins$sex)

Here we access columns of a data frame using $, which is the easiest way to do so.

penguins$species
penguins$island[1:10] # first 10
summary(penguins$body_mass_g)
 [1] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie
[13] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie
[25] Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie    Adelie   ...
Levels: Adelie Chinstrap Gentoo

[1] Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen
Levels: Biscoe Dream Torgersen

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
   2700    3550    4050    4202    4750    6300       2

We can see the frequencies of a factor with table() or summary().

table(penguins$species) # these give the same thing back
summary(penguins$species)
Adelie Chinstrap    Gentoo
   152        68       124

We can also sign numerical columns with a variety of functions.

mean(penguins$body_mass_g, na.rm=TRUE) # na.rm makes sure to ignore missing data
median(penguins$body_mass_g, na.rm=TRUE)
sd(penguins$body_mass_g, na.rm=TRUE)
[1] 4201.754

[1] 4050

[1] 801.9545

We can use the filter() tidyverse function to subset our dataframe.

Gentoo <- filter(penguins,species =="Gentoo")

Gentoo
# A tibble: 124 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
   <fct>   <fct>           <dbl>         <dbl>             <dbl>       <dbl> <fct>  <dbl>
1 Gentoo  Biscoe           46.1          13.2               211        4500 female  2007
2 Gentoo  Biscoe           50            16.3               230        5700 male    2007
3 Gentoo  Biscoe           48.7          14.1               210        4450 female  2007
4 Gentoo  Biscoe           50            15.2               218        5700 male    2007
5 Gentoo  Biscoe           47.6          14.5               215        5400 male    2007
6 Gentoo  Biscoe           46.5          13.5               210        4550 female  2007
7 Gentoo  Biscoe           45.4          14.6               211        4800 female  2007
8 Gentoo  Biscoe           46.7          15.3               219        5200 male    2007
9 Gentoo  Biscoe           43.3          13.4               209        4400 female  2007
10 Gentoo  Biscoe           46.8          15.4               215        5150 male    2007
# … with 114 more rows

If we want to select specific columns, we can use the select() function.

penguins_subsetted <- select(penguins, species, island, bill_length_mm, sex)

We can add new columns with mutate().

penguins_subsetted2 <- mutate(penguins_subsetted, mass_flipper_ratio = body_mass_g/flipper_length_mm)

We can use pipes to chain tidyverse commands together. Pipes in R look like %>%. Read the pipe like the word “and then”.

female_penguins <- penguins %>%
   filter(sex == "female") %>%
   mutate(mass_flipper_ratio = body_mass_g/flipper_length_mm)

Simple graphs

To make a simple scatter plot in R, we can use the plot() function.

plot(penguins$bill_depth_mm, penguins$bill_length_mm)
../../_images/scatter_example.png

We can also use ggplot2 to get nicer graphs with many customizations.

mass_flipper <- ggplot(data = penguins,
                       aes(x = flipper_length_mm,
                           y = body_mass_g)) +
   geom_point(aes(color = species,
                  shape = species),
                  size = 3,
                  alpha = 0.8) +
   scale_color_manual(values = c("darkorange","purple","cyan4")) +
   labs(title = "Penguin size, Palmer Station LTER",
         subtitle = "Flipper length and body mass for Adelie, Chinstrap and Gentoo Penguins",
         x = "Flipper length (mm)",
         y = "Body mass (g)",
         color = "Penguin species",
         shape = "Penguin species") +
   theme(legend.position = c(0.2, 0.7),
         plot.title.position = "plot",
         plot.caption = element_text(hjust = 0, face= "italic"),
         plot.caption.position = "plot")

mass_flipper
../../_images/penguins.png

Useful Resources