Data wrangling with the Tidyverse¶

Workshop materials

When running an analysis, data cleaning and pre-processing can often take longer than doing the statistical tests. R is an excellent tool to speed up this process, with many powerful tools to manipulate and prepare data for analysis and plotting.

Install and Load Tidyverse Packages¶

You will need to run install.packages() just once to download libraries onto your computer. After that, use library() any time you want to access the tools and functions in a package.

# To install packages:
install.packages("tidyverse")

library(tidyverse)

Load our data from a file into R environment¶

We will be using data about various penguin species on different islands. To read in our data, we will use the function read_csv(), which is from a package in the Tidyverse called readr.

penguins <- read_csv("penguins.csv")

Output

## Rows: 344 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data examination¶

Tidyverse expects our data to be tidy:

Each column is a variable.
Each row is an observation.
Each cell has a value.

Our data conform to these rules. Let’s start to explore our data set, first using glimpse() to see a summary that shows the dimensions of the data, the column names, and what type of data live in each column.

glimpse(penguins)

Output

## Rows: 344
## Columns: 8
## $ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
## $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
## $ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Tidyverse pipelines¶

Pipes¶

Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many consecutive tasks to the same dataset. This means you don’t need to include the name of the data frame within each function we use.

%>% is the pipe operator in R. You can read the pipe like the word “then”.

# Using pipes
penguins_biscoe <- penguins %>%
   filter(island == "Biscoe") %>%
   select(species, body_mass_g, sex)

Notice there is no output for this command, since we are saving the resulting data frame as penguins_biscoe.

Exercise: subsetting and selection¶

Create a new object with the data subset to include all species except Adelie and retain the species column and the ones relating to their bill.

Solution

penguins %>%
   filter(species != "Adelie") %>%
   select(species, bill_length_mm, bill_depth_mm)

Output

## # A tibble: 192 × 3
##    species bill_length_mm bill_depth_mm
##    <chr>            <dbl>         <dbl>
##  1 Gentoo            46.1          13.2
##  2 Gentoo            50            16.3
##  3 Gentoo            48.7          14.1
##  4 Gentoo            50            15.2
##  5 Gentoo            47.6          14.5
##  6 Gentoo            46.5          13.5
##  7 Gentoo            45.4          14.6
##  8 Gentoo            46.7          15.3
##  9 Gentoo            43.3          13.4
## 10 Gentoo            46.8          15.4
## # … with 182 more rows

Mutate¶

Frequently you’ll want to create new columns based on the values in existing columns for tasks like unit conversion or finding the ratio of values in two columns. For this, we’ll use mutate().

We might be interested in the body mass of penguins in kg instead of g:

penguins %>%
   mutate(body_mass_kg = body_mass_g / 1000)

Output

## # A tibble: 344 × 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <chr>, year <dbl>,
## #   body_mass_kg <dbl>

Split-apply-combine data analysis and summarize¶

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function.

The `summarize()` function¶

group_by() is often used together with summarize(), which collapses each group into a single-row summary of that group.

group_by() takes in the column names that contain the categorical variables for which you want to calculate the summary statistics.

So to compute the average body mass by species:

penguins %>%
   group_by(species) %>%
   summarize(body_mass_g_mean = mean(body_mass_g, na.rm=TRUE))

Output

## # A tibble: 3 × 2
##   species   body_mass_g_mean
##   <chr>                <dbl>
## 1 Adelie               3701.
## 2 Chinstrap            3733.
## 3 Gentoo               5076.

You can also group by multiple columns:

penguins %>%
   group_by(island, species) %>%
   summarize(flipper_length_mm_mean = mean(flipper_length_mm, na.rm = TRUE),
            flipper_length_mm_min = min(flipper_length_mm, na.rm = TRUE),
            flipper_length_mm_max = max(flipper_length_mm, na.rm = TRUE),
            flipper_length_mm_sd = sd(flipper_length_mm, na.rm = TRUE))

Output

## `summarise()` has grouped output by 'island'. You can override using the
## `.groups` argument.

## # A tibble: 5 × 6
## # Groups:   island [3]
##   island    species   flipper_length_mm_mean flipper_length_mm… flipper_length_…
##   <chr>     <chr>                      <dbl>              <dbl>            <dbl>
## 1 Biscoe    Adelie                      189.                172              203
## 2 Biscoe    Gentoo                      217.                203              231
## 3 Dream     Adelie                      190.                178              208
## 4 Dream     Chinstrap                   196.                178              212
## 5 Torgersen Adelie                      191.                176              210
## # … with 1 more variable: flipper_length_mm_sd <dbl>

Counting¶

When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr provides count().

If we wanted to count the number of penguins by species, we would do the following:

penguins %>%
   count(species)

Output

## # A tibble: 3 × 2
##   species       n
##   <chr>     <int>
## 1 Adelie      152
## 2 Chinstrap    68
## 3 Gentoo      124

For convenience, count() provides the sort argument to get results in decreasing order:

penguins %>%
   count(species, sort = TRUE)

Output

## # A tibble: 3 × 2
##   species       n
##   <chr>     <int>
## 1 Adelie      152
## 2 Gentoo      124
## 3 Chinstrap    68

We can add more than one variable:

penguins %>%
   count(species, island, sex)

Output

## # A tibble: 13 × 4
##    species   island    sex        n
##    <chr>     <chr>     <chr>  <int>
##  1 Adelie    Biscoe    female    22
##  2 Adelie    Biscoe    male      22
##  3 Adelie    Dream     female    27
##  4 Adelie    Dream     male      28
##  5 Adelie    Dream     <NA>       1
##  6 Adelie    Torgersen female    24
##  7 Adelie    Torgersen male      23
##  8 Adelie    Torgersen <NA>       5
##  9 Chinstrap Dream     female    34
## 10 Chinstrap Dream     male      34
## 11 Gentoo    Biscoe    female    58
## 12 Gentoo    Biscoe    male      61
## 13 Gentoo    Biscoe    <NA>       5

Arrange the order of your rows¶

The default is to arrange in ascending order. You can use the desc() function on the variable inside arrange() to arrange in descending order.

penguins %>%
   arrange(body_mass_g)

Output

## # A tibble: 344 × 8
##    species   island    bill_length_mm bill_depth_mm flipper_length_… body_mass_g
##    <chr>     <chr>              <dbl>         <dbl>            <dbl>       <dbl>
##  1 Chinstrap Dream               46.9          16.6              192        2700
##  2 Adelie    Biscoe              36.5          16.6              181        2850
##  3 Adelie    Biscoe              36.4          17.1              184        2850
##  4 Adelie    Biscoe              34.5          18.1              187        2900
##  5 Adelie    Dream               33.1          16.1              178        2900
##  6 Adelie    Torgersen           38.6          17                188        2900
##  7 Chinstrap Dream               43.2          16.6              187        2900
##  8 Adelie    Biscoe              37.9          18.6              193        2925
##  9 Adelie    Dream               37.5          18.9              179        2975
## 10 Adelie    Dream               37            16.9              185        3000
## # … with 334 more rows, and 2 more variables: sex <chr>, year <dbl>

We can rename columns using the rename() functions.

penguins %>%
   rename(bill_length = bill_length_mm)

Output

## # A tibble: 344 × 8
##    species island   bill_length bill_depth_mm flipper_length_… body_mass_g sex
##    <chr>   <chr>          <dbl>         <dbl>            <dbl>       <dbl> <chr>
##  1 Adelie  Torgers…        39.1          18.7              181        3750 male
##  2 Adelie  Torgers…        39.5          17.4              186        3800 fema…
##  3 Adelie  Torgers…        40.3          18                195        3250 fema…
##  4 Adelie  Torgers…        NA            NA                 NA          NA <NA>
##  5 Adelie  Torgers…        36.7          19.3              193        3450 fema…
##  6 Adelie  Torgers…        39.3          20.6              190        3650 male
##  7 Adelie  Torgers…        38.9          17.8              181        3625 fema…
##  8 Adelie  Torgers…        39.2          19.6              195        4675 male
##  9 Adelie  Torgers…        34.1          18.1              193        3475 <NA>
## 10 Adelie  Torgers…        42            20.2              190        4250 <NA>
## # … with 334 more rows, and 1 more variable: year <dbl>

We can combine mutate() with the function case_when() to generate values in a new column based on conditions. For instance, here we make a new column called body_type. Values in this column are small, normal, or large based on the value in the same row of body_mass_g, which are specified as individual conditions.

penguins %>%
   mutate(body_type = case_when(
          body_mass_g < 3000 ~ "small",
          body_mass_g >= 3000 & body_mass_g < 4500 ~ "normal",
          body_mass_g >= 4500 ~ "large"))

Output

## # A tibble: 344 × 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <chr>, year <dbl>,
## #   body_type <chr>

Exporting data¶

Now that you have learned how to use dplyr to extract information from or summarize your raw data, you may want to export these new data sets to share them with your collaborators or for archival.

Similar to the read_csv() function used for reading CSV files into R, there is a write_csv() function that generates CSV files from dataframes.

write_csv(data4, 'countries.csv')

help(write_csv)

Data wrangling with the Tidyverse¶

Install and Load Tidyverse Packages¶

Load our data from a file into R environment¶

Data examination¶

Select columns¶

Filtering rows¶

Tidyverse pipelines¶

Pipes¶

Exercise: subsetting and selection¶

Mutate¶

Split-apply-combine data analysis and summarize¶

The `summarize()` function¶

Counting¶

Arrange the order of your rows¶

Wide and long data transformation¶

`pivot_wider()`¶

Using dplyr to merge tables¶

Filtering joins¶

Combining¶

Exporting data¶

Data wrangling with the Tidyverse¶

Install and Load Tidyverse Packages¶

Load our data from a file into R environment¶

Data examination¶

Select columns¶

Filtering rows¶

Tidyverse pipelines¶

Pipes¶

Exercise: subsetting and selection¶

Mutate¶

Split-apply-combine data analysis and summarize¶

The summarize() function¶

Counting¶

Arrange the order of your rows¶

Wide and long data transformation¶

pivot_wider()¶

Using dplyr to merge tables¶

Filtering joins¶

Combining¶

Exporting data¶

The `summarize()` function¶

`pivot_wider()`¶