Data wrangling with the Tidyverse¶
When running an analysis, data cleaning and pre-processing can often take longer than doing the statistical tests. R is an excellent tool to speed up this process, with many powerful tools to manipulate and prepare data for analysis and plotting.
Install and Load Tidyverse Packages¶
You will need to run install.packages()
just once to download
libraries onto your computer. After that, use library()
any time you
want to access the tools and functions in a package.
# To install packages:
install.packages("tidyverse")
library(tidyverse)
Load our data from a file into R environment¶
We will be using data about various penguin species on different
islands. To read in our data, we will use the function read_csv()
,
which is from a package in the Tidyverse called readr
.
penguins <- read_csv("penguins.csv")
## Rows: 344 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): species, island, sex
## dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data examination¶
Tidyverse expects our data to be tidy:
Each column is a variable.
Each row is an observation.
Each cell has a value.
Our data conform to these rules. Let’s start to explore our data set,
first using glimpse()
to see a summary that shows the dimensions of
the data, the column names, and what type of data live in each column.
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
## $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <chr> "male", "female", "female", NA, "female", "male", "f…
## $ year <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Select columns¶
To take specific columns we can use the function select()
. This
function takes at least two arguments (inputs). The first must be
the name of the data frame (e.g., penguins
). Any subsequent
arguments must be column names.
select(penguins,species, body_mass_g, sex)
## # A tibble: 344 × 3
## species body_mass_g sex
## <chr> <dbl> <chr>
## 1 Adelie 3750 male
## 2 Adelie 3800 female
## 3 Adelie 3250 female
## 4 Adelie NA <NA>
## 5 Adelie 3450 female
## 6 Adelie 3650 male
## 7 Adelie 3625 female
## 8 Adelie 4675 male
## 9 Adelie 3475 <NA>
## 10 Adelie 4250 <NA>
## # … with 334 more rows
We can also use columnA:columnB
to select all of the columns from
columnA
to columnB
.
select(penguins, species:body_mass_g)
## # A tibble: 344 × 6
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows
If we put a -
before the column names, we will include all but those
columns.
select(penguins, -year, -island)
## # A tibble: 344 × 6
## species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Adelie 39.1 18.7 181 3750 male
## 2 Adelie 39.5 17.4 186 3800 female
## 3 Adelie 40.3 18 195 3250 female
## 4 Adelie NA NA NA NA <NA>
## 5 Adelie 36.7 19.3 193 3450 female
## 6 Adelie 39.3 20.6 190 3650 male
## 7 Adelie 38.9 17.8 181 3625 female
## 8 Adelie 39.2 19.6 195 4675 male
## 9 Adelie 34.1 18.1 193 3475 <NA>
## 10 Adelie 42 20.2 190 4250 <NA>
## # … with 334 more rows
We can also use the function starts_with()
to return a data frame
with columns starting with certain characters. Note you will need to put
the characters in quotes.
select(penguins, starts_with("bill"))
## # A tibble: 344 × 2
## bill_length_mm bill_depth_mm
## <dbl> <dbl>
## 1 39.1 18.7
## 2 39.5 17.4
## 3 40.3 18
## 4 NA NA
## 5 36.7 19.3
## 6 39.3 20.6
## 7 38.9 17.8
## 8 39.2 19.6
## 9 34.1 18.1
## 10 42 20.2
## # … with 334 more rows
We can accomplish a similar task with ends_with()
.
select(penguins, ends_with("mm"))
## # A tibble: 344 × 3
## bill_length_mm bill_depth_mm flipper_length_mm
## <dbl> <dbl> <dbl>
## 1 39.1 18.7 181
## 2 39.5 17.4 186
## 3 40.3 18 195
## 4 NA NA NA
## 5 36.7 19.3 193
## 6 39.3 20.6 190
## 7 38.9 17.8 181
## 8 39.2 19.6 195
## 9 34.1 18.1 193
## 10 42 20.2 190
## # … with 334 more rows
Filtering rows¶
We can also subset the data frame based on certain conditions with the
filter()
function. For instance, we can pick all of the samples from
a specific island
using the ==
operator. Note that a single
=
does not work for this.
filter(penguins, island=="Torgersen")
## # A tibble: 52 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 42 more rows, and 2 more variables: sex <chr>, year <dbl>
We can also use other conditions, like inequalities.
less than:
<
less than or equal to:
<=
greater than:
>
greater than or equal to:
>=
filter(penguins, year<=2008)
## # A tibble: 224 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 214 more rows, and 2 more variables: sex <chr>, year <dbl>
We can use multiple conditions at the same time, as well.
filter (penguins, island=="Torgersen", sex=="male")
## # A tibble: 23 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.3 20.6 190 3650
## 3 Adelie Torgersen 39.2 19.6 195 4675
## 4 Adelie Torgersen 38.6 21.2 191 3800
## 5 Adelie Torgersen 34.6 21.1 198 4400
## 6 Adelie Torgersen 42.5 20.7 197 4500
## 7 Adelie Torgersen 46 21.5 194 4200
## 8 Adelie Torgersen 41.8 19.4 198 4450
## 9 Adelie Torgersen 39.7 18.4 190 3900
## 10 Adelie Torgersen 45.8 18.9 197 4150
## # … with 13 more rows, and 2 more variables: sex <chr>, year <dbl>
We can use complex conditions too, such as putting |
between two
conditions. |
means “or”, so the only rows that are kept must
satisfy one condition or the other. You can do a similar operation with
&
, which means “and”.
filter(penguins, species == "Chinstrap" | species == "Gentoo")
## # A tibble: 192 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Gentoo Biscoe 46.1 13.2 211 4500
## 2 Gentoo Biscoe 50 16.3 230 5700
## 3 Gentoo Biscoe 48.7 14.1 210 4450
## 4 Gentoo Biscoe 50 15.2 218 5700
## 5 Gentoo Biscoe 47.6 14.5 215 5400
## 6 Gentoo Biscoe 46.5 13.5 210 4550
## 7 Gentoo Biscoe 45.4 14.6 211 4800
## 8 Gentoo Biscoe 46.7 15.3 219 5200
## 9 Gentoo Biscoe 43.3 13.4 209 4400
## 10 Gentoo Biscoe 46.8 15.4 215 5150
## # … with 182 more rows, and 2 more variables: sex <chr>, year <dbl>
The %in%
operator checks to see if values are contained in a given
vector.
filter(penguins, species %in% c("Chinstrap", "Gentoo"))
## # A tibble: 192 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Gentoo Biscoe 46.1 13.2 211 4500
## 2 Gentoo Biscoe 50 16.3 230 5700
## 3 Gentoo Biscoe 48.7 14.1 210 4450
## 4 Gentoo Biscoe 50 15.2 218 5700
## 5 Gentoo Biscoe 47.6 14.5 215 5400
## 6 Gentoo Biscoe 46.5 13.5 210 4550
## 7 Gentoo Biscoe 45.4 14.6 211 4800
## 8 Gentoo Biscoe 46.7 15.3 219 5200
## 9 Gentoo Biscoe 43.3 13.4 209 4400
## 10 Gentoo Biscoe 46.8 15.4 215 5150
## # … with 182 more rows, and 2 more variables: sex <chr>, year <dbl>
Data sets often contain missing values for various samples. We can check
our missing data with is.na()
.
filter(penguins, is.na(sex))
## # A tibble: 11 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen NA NA NA NA
## 2 Adelie Torgersen 34.1 18.1 193 3475
## 3 Adelie Torgersen 42 20.2 190 4250
## 4 Adelie Torgersen 37.8 17.1 186 3300
## 5 Adelie Torgersen 37.8 17.3 180 3700
## 6 Adelie Dream 37.5 18.9 179 2975
## 7 Gentoo Biscoe 44.5 14.3 216 4100
## 8 Gentoo Biscoe 46.2 14.4 214 4650
## 9 Gentoo Biscoe 47.3 13.8 216 4725
## 10 Gentoo Biscoe 44.5 15.7 217 4875
## 11 Gentoo Biscoe NA NA NA NA
## # … with 2 more variables: sex <chr>, year <dbl>
We can also only return samples without missing data using !
, which
means “not”, before is.na()
.
filter(penguins, !is.na(sex))
## # A tibble: 333 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## 7 Adelie Torgersen 39.2 19.6 195 4675
## 8 Adelie Torgersen 41.1 17.6 182 3200
## 9 Adelie Torgersen 38.6 21.2 191 3800
## 10 Adelie Torgersen 34.6 21.1 198 4400
## # … with 323 more rows, and 2 more variables: sex <chr>, year <dbl>
Tidyverse pipelines¶
Pipes¶
Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many consecutive tasks to the same dataset. This means you don’t need to include the name of the data frame within each function we use.
%>%
is the pipe operator in R. You can read the pipe like the word
“then”.
# Using pipes
penguins_biscoe <- penguins %>%
filter(island == "Biscoe") %>%
select(species, body_mass_g, sex)
Notice there is no output for this command, since we are saving the
resulting data frame as penguins_biscoe
.
Exercise: subsetting and selection¶
Create a new object with the data subset to include all species except Adelie and retain the species column and the ones relating to their bill.
Solution
penguins %>%
filter(species != "Adelie") %>%
select(species, bill_length_mm, bill_depth_mm)
## # A tibble: 192 × 3
## species bill_length_mm bill_depth_mm
## <chr> <dbl> <dbl>
## 1 Gentoo 46.1 13.2
## 2 Gentoo 50 16.3
## 3 Gentoo 48.7 14.1
## 4 Gentoo 50 15.2
## 5 Gentoo 47.6 14.5
## 6 Gentoo 46.5 13.5
## 7 Gentoo 45.4 14.6
## 8 Gentoo 46.7 15.3
## 9 Gentoo 43.3 13.4
## 10 Gentoo 46.8 15.4
## # … with 182 more rows
Mutate¶
Frequently you’ll want to create new columns based on the values in
existing columns for tasks like unit conversion or finding the ratio of
values in two columns. For this, we’ll use mutate()
.
We might be interested in the body mass of penguins in kg instead of g:
penguins %>%
mutate(body_mass_kg = body_mass_g / 1000)
## # A tibble: 344 × 9
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 3 more variables: sex <chr>, year <dbl>,
## # body_mass_kg <dbl>
Split-apply-combine data analysis and summarize¶
Many data analysis tasks can be approached using the split-apply-combine
paradigm: split the data into groups, apply some analysis to each group,
and then combine the results. dplyr
makes this very easy through the
use of the group_by()
function.
The summarize()
function¶
group_by()
is often used together with summarize()
, which
collapses each group into a single-row summary of that group.
group_by()
takes in the column names that contain the categorical
variables for which you want to calculate the summary statistics.
So to compute the average body mass by species:
penguins %>%
group_by(species) %>%
summarize(body_mass_g_mean = mean(body_mass_g, na.rm=TRUE))
## # A tibble: 3 × 2
## species body_mass_g_mean
## <chr> <dbl>
## 1 Adelie 3701.
## 2 Chinstrap 3733.
## 3 Gentoo 5076.
You can also group by multiple columns:
penguins %>%
group_by(island, species) %>%
summarize(flipper_length_mm_mean = mean(flipper_length_mm, na.rm = TRUE),
flipper_length_mm_min = min(flipper_length_mm, na.rm = TRUE),
flipper_length_mm_max = max(flipper_length_mm, na.rm = TRUE),
flipper_length_mm_sd = sd(flipper_length_mm, na.rm = TRUE))
## `summarise()` has grouped output by 'island'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 6
## # Groups: island [3]
## island species flipper_length_mm_mean flipper_length_mm… flipper_length_…
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Biscoe Adelie 189. 172 203
## 2 Biscoe Gentoo 217. 203 231
## 3 Dream Adelie 190. 178 208
## 4 Dream Chinstrap 196. 178 212
## 5 Torgersen Adelie 191. 176 210
## # … with 1 more variable: flipper_length_mm_sd <dbl>
Counting¶
When working with data, we often want to know the number of observations
found for each factor or combination of factors. For this task,
dplyr
provides count()
.
If we wanted to count the number of penguins by species, we would do the following:
penguins %>%
count(species)
## # A tibble: 3 × 2
## species n
## <chr> <int>
## 1 Adelie 152
## 2 Chinstrap 68
## 3 Gentoo 124
For convenience, count()
provides the sort
argument to get
results in decreasing order:
penguins %>%
count(species, sort = TRUE)
## # A tibble: 3 × 2
## species n
## <chr> <int>
## 1 Adelie 152
## 2 Gentoo 124
## 3 Chinstrap 68
We can add more than one variable:
penguins %>%
count(species, island, sex)
## # A tibble: 13 × 4
## species island sex n
## <chr> <chr> <chr> <int>
## 1 Adelie Biscoe female 22
## 2 Adelie Biscoe male 22
## 3 Adelie Dream female 27
## 4 Adelie Dream male 28
## 5 Adelie Dream <NA> 1
## 6 Adelie Torgersen female 24
## 7 Adelie Torgersen male 23
## 8 Adelie Torgersen <NA> 5
## 9 Chinstrap Dream female 34
## 10 Chinstrap Dream male 34
## 11 Gentoo Biscoe female 58
## 12 Gentoo Biscoe male 61
## 13 Gentoo Biscoe <NA> 5
Arrange the order of your rows¶
The default is to arrange in ascending order. You can use the desc()
function on the variable inside arrange()
to arrange in descending
order.
penguins %>%
arrange(body_mass_g)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Chinstrap Dream 46.9 16.6 192 2700
## 2 Adelie Biscoe 36.5 16.6 181 2850
## 3 Adelie Biscoe 36.4 17.1 184 2850
## 4 Adelie Biscoe 34.5 18.1 187 2900
## 5 Adelie Dream 33.1 16.1 178 2900
## 6 Adelie Torgersen 38.6 17 188 2900
## 7 Chinstrap Dream 43.2 16.6 187 2900
## 8 Adelie Biscoe 37.9 18.6 193 2925
## 9 Adelie Dream 37.5 18.9 179 2975
## 10 Adelie Dream 37 16.9 185 3000
## # … with 334 more rows, and 2 more variables: sex <chr>, year <dbl>
We can rename columns using the rename()
functions.
penguins %>%
rename(bill_length = bill_length_mm)
## # A tibble: 344 × 8
## species island bill_length bill_depth_mm flipper_length_… body_mass_g sex
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Adelie Torgers… 39.1 18.7 181 3750 male
## 2 Adelie Torgers… 39.5 17.4 186 3800 fema…
## 3 Adelie Torgers… 40.3 18 195 3250 fema…
## 4 Adelie Torgers… NA NA NA NA <NA>
## 5 Adelie Torgers… 36.7 19.3 193 3450 fema…
## 6 Adelie Torgers… 39.3 20.6 190 3650 male
## 7 Adelie Torgers… 38.9 17.8 181 3625 fema…
## 8 Adelie Torgers… 39.2 19.6 195 4675 male
## 9 Adelie Torgers… 34.1 18.1 193 3475 <NA>
## 10 Adelie Torgers… 42 20.2 190 4250 <NA>
## # … with 334 more rows, and 1 more variable: year <dbl>
We can combine mutate()
with the function case_when()
to
generate values in a new column based on conditions. For instance, here
we make a new column called body_type
. Values in this column are
small
, normal
, or large
based on the value in the same row
of body_mass_g
, which are specified as individual conditions.
penguins %>%
mutate(body_type = case_when(
body_mass_g < 3000 ~ "small",
body_mass_g >= 3000 & body_mass_g < 4500 ~ "normal",
body_mass_g >= 4500 ~ "large"))
## # A tibble: 344 × 9
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # … with 334 more rows, and 3 more variables: sex <chr>, year <dbl>,
## # body_type <chr>
Wide and long data transformation¶
Sometimes, it might be more useful to have data in a wide format,
where columns represent different values or levels of a variable. We can
reshape our data using two tidyr functions, pivot_wider()
and
pivot_longer()
.
penguins %>%
pivot_longer(contains("_"))
## # A tibble: 1,376 × 6
## species island sex year name value
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Adelie Torgersen male 2007 bill_length_mm 39.1
## 2 Adelie Torgersen male 2007 bill_depth_mm 18.7
## 3 Adelie Torgersen male 2007 flipper_length_mm 181
## 4 Adelie Torgersen male 2007 body_mass_g 3750
## 5 Adelie Torgersen female 2007 bill_length_mm 39.5
## 6 Adelie Torgersen female 2007 bill_depth_mm 17.4
## 7 Adelie Torgersen female 2007 flipper_length_mm 186
## 8 Adelie Torgersen female 2007 body_mass_g 3800
## 9 Adelie Torgersen female 2007 bill_length_mm 40.3
## 10 Adelie Torgersen female 2007 bill_depth_mm 18
## # … with 1,366 more rows
Here we pivot the penguin data so that all the bill measurements are in the same column.
penguins %>%
pivot_longer(starts_with("bill"))
## # A tibble: 688 × 8
## species island flipper_length_mm body_mass_g sex year name value
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 Adelie Torgersen 181 3750 male 2007 bill_leng… 39.1
## 2 Adelie Torgersen 181 3750 male 2007 bill_dept… 18.7
## 3 Adelie Torgersen 186 3800 female 2007 bill_leng… 39.5
## 4 Adelie Torgersen 186 3800 female 2007 bill_dept… 17.4
## 5 Adelie Torgersen 195 3250 female 2007 bill_leng… 40.3
## 6 Adelie Torgersen 195 3250 female 2007 bill_dept… 18
## 7 Adelie Torgersen NA NA <NA> 2007 bill_leng… NA
## 8 Adelie Torgersen NA NA <NA> 2007 bill_dept… NA
## 9 Adelie Torgersen 193 3450 female 2007 bill_leng… 36.7
## 10 Adelie Torgersen 193 3450 female 2007 bill_dept… 19.3
## # … with 678 more rows
We can remove NAs at the same time.
penguins %>%
pivot_longer(starts_with("bill"),
values_drop_na = TRUE)
## # A tibble: 684 × 8
## species island flipper_length_mm body_mass_g sex year name value
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 Adelie Torgersen 181 3750 male 2007 bill_leng… 39.1
## 2 Adelie Torgersen 181 3750 male 2007 bill_dept… 18.7
## 3 Adelie Torgersen 186 3800 female 2007 bill_leng… 39.5
## 4 Adelie Torgersen 186 3800 female 2007 bill_dept… 17.4
## 5 Adelie Torgersen 195 3250 female 2007 bill_leng… 40.3
## 6 Adelie Torgersen 195 3250 female 2007 bill_dept… 18
## 7 Adelie Torgersen 193 3450 female 2007 bill_leng… 36.7
## 8 Adelie Torgersen 193 3450 female 2007 bill_dept… 19.3
## 9 Adelie Torgersen 190 3650 male 2007 bill_leng… 39.3
## 10 Adelie Torgersen 190 3650 male 2007 bill_dept… 20.6
## # … with 674 more rows
When making our data into a long format, we can split the former column names into multiple new columns.
penguins_long <- penguins %>%
pivot_longer(contains("_"),
names_to = c("part", "measure" , "unit"),
names_sep = "_",
values_drop_na = TRUE)
penguins_long
## # A tibble: 1,368 × 8
## species island sex year part measure unit value
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Adelie Torgersen male 2007 bill length mm 39.1
## 2 Adelie Torgersen male 2007 bill depth mm 18.7
## 3 Adelie Torgersen male 2007 flipper length mm 181
## 4 Adelie Torgersen male 2007 body mass g 3750
## 5 Adelie Torgersen female 2007 bill length mm 39.5
## 6 Adelie Torgersen female 2007 bill depth mm 17.4
## 7 Adelie Torgersen female 2007 flipper length mm 186
## 8 Adelie Torgersen female 2007 body mass g 3800
## 9 Adelie Torgersen female 2007 bill length mm 40.3
## 10 Adelie Torgersen female 2007 bill depth mm 18
## # … with 1,358 more rows
pivot_wider()
¶
We can also take a long data set and make it wide.
First, let’s create a simple long datafame.
penguins_long_simple <- penguins %>%
pivot_longer(contains("_"))
penguins_long_simple
## # A tibble: 1,376 × 6
## species island sex year name value
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Adelie Torgersen male 2007 bill_length_mm 39.1
## 2 Adelie Torgersen male 2007 bill_depth_mm 18.7
## 3 Adelie Torgersen male 2007 flipper_length_mm 181
## 4 Adelie Torgersen male 2007 body_mass_g 3750
## 5 Adelie Torgersen female 2007 bill_length_mm 39.5
## 6 Adelie Torgersen female 2007 bill_depth_mm 17.4
## 7 Adelie Torgersen female 2007 flipper_length_mm 186
## 8 Adelie Torgersen female 2007 body_mass_g 3800
## 9 Adelie Torgersen female 2007 bill_length_mm 40.3
## 10 Adelie Torgersen female 2007 bill_depth_mm 18
## # … with 1,366 more rows
If we want to make this wider again we can use pivot_wider()
,
specifying the names of the columns we want to expand.
penguins_long_simple %>%
pivot_wider(names_from = name,
values_from = value)
## # A tibble: 35 × 8
## species island sex year bill_length_mm bill_depth_mm flipper_length_mm
## <chr> <chr> <chr> <dbl> <list> <list> <list>
## 1 Adelie Torgersen male 2007 <dbl [7]> <dbl [7]> <dbl [7]>
## 2 Adelie Torgersen female 2007 <dbl [8]> <dbl [8]> <dbl [8]>
## 3 Adelie Torgersen <NA> 2007 <dbl [5]> <dbl [5]> <dbl [5]>
## 4 Adelie Biscoe female 2007 <dbl [5]> <dbl [5]> <dbl [5]>
## 5 Adelie Biscoe male 2007 <dbl [5]> <dbl [5]> <dbl [5]>
## 6 Adelie Dream female 2007 <dbl [9]> <dbl [9]> <dbl [9]>
## 7 Adelie Dream male 2007 <dbl [10]> <dbl [10]> <dbl [10]>
## 8 Adelie Dream <NA> 2007 <dbl [1]> <dbl [1]> <dbl [1]>
## 9 Adelie Biscoe female 2008 <dbl [9]> <dbl [9]> <dbl [9]>
## 10 Adelie Biscoe male 2008 <dbl [9]> <dbl [9]> <dbl [9]>
## # … with 25 more rows, and 1 more variable: body_mass_g <list>
However, the output looks a little strange. This is due to how
pivot_wider()
identifies unique rows.
To guarentee this does not happen, make sure to include the original row number in the long data set.
Note the alternative pipe operator here (|>
). This is equivalent to
%>%
.
penguins_long_simple <- penguins |>
mutate(sample = row_number()) |>
pivot_longer(contains("_"))
penguins_long_simple
## # A tibble: 1,376 × 7
## species island sex year sample name value
## <chr> <chr> <chr> <dbl> <int> <chr> <dbl>
## 1 Adelie Torgersen male 2007 1 bill_length_mm 39.1
## 2 Adelie Torgersen male 2007 1 bill_depth_mm 18.7
## 3 Adelie Torgersen male 2007 1 flipper_length_mm 181
## 4 Adelie Torgersen male 2007 1 body_mass_g 3750
## 5 Adelie Torgersen female 2007 2 bill_length_mm 39.5
## 6 Adelie Torgersen female 2007 2 bill_depth_mm 17.4
## 7 Adelie Torgersen female 2007 2 flipper_length_mm 186
## 8 Adelie Torgersen female 2007 2 body_mass_g 3800
## 9 Adelie Torgersen female 2007 3 bill_length_mm 40.3
## 10 Adelie Torgersen female 2007 3 bill_depth_mm 18
## # … with 1,366 more rows
We can also use pivot_wider()
across multiple columns.
penguins_long <- penguins %>%
mutate(sample = row_number()) %>%
pivot_longer(contains("_"),
names_to = c("part", "measure" , "unit"),
names_sep = "_",
values_drop_na = TRUE)
penguins_long
## # A tibble: 1,368 × 9
## species island sex year sample part measure unit value
## <chr> <chr> <chr> <dbl> <int> <chr> <chr> <chr> <dbl>
## 1 Adelie Torgersen male 2007 1 bill length mm 39.1
## 2 Adelie Torgersen male 2007 1 bill depth mm 18.7
## 3 Adelie Torgersen male 2007 1 flipper length mm 181
## 4 Adelie Torgersen male 2007 1 body mass g 3750
## 5 Adelie Torgersen female 2007 2 bill length mm 39.5
## 6 Adelie Torgersen female 2007 2 bill depth mm 17.4
## 7 Adelie Torgersen female 2007 2 flipper length mm 186
## 8 Adelie Torgersen female 2007 2 body mass g 3800
## 9 Adelie Torgersen female 2007 3 bill length mm 40.3
## 10 Adelie Torgersen female 2007 3 bill depth mm 18
## # … with 1,358 more rows
penguins_long %>%
pivot_wider(names_from = c("part", "measure", "unit"),
names_sep = "_",
values_from = value)
## # A tibble: 342 × 9
## species island sex year sample bill_length_mm bill_depth_mm
## <chr> <chr> <chr> <dbl> <int> <dbl> <dbl>
## 1 Adelie Torgersen male 2007 1 39.1 18.7
## 2 Adelie Torgersen female 2007 2 39.5 17.4
## 3 Adelie Torgersen female 2007 3 40.3 18
## 4 Adelie Torgersen female 2007 5 36.7 19.3
## 5 Adelie Torgersen male 2007 6 39.3 20.6
## 6 Adelie Torgersen female 2007 7 38.9 17.8
## 7 Adelie Torgersen male 2007 8 39.2 19.6
## 8 Adelie Torgersen <NA> 2007 9 34.1 18.1
## 9 Adelie Torgersen <NA> 2007 10 42 20.2
## 10 Adelie Torgersen <NA> 2007 11 37.8 17.1
## # … with 332 more rows, and 2 more variables: flipper_length_mm <dbl>,
## # body_mass_g <dbl>
Using dplyr to merge tables¶
Merging data frames is an important aspect of data pre-processing, and there are various methods for doing so.
To start, let’s create two small data sets with some overlapping information.
data1 <- data.frame(country=c("Germany","Australia", "Ecuador"),
region=c("Europe","Western Pacific", "Americas" ),
life_exp=c(81, 83, 75))
data2 <- data.frame(country=c("Germany","Australia", "Iceland"),
region=c("Europe","Western Pacific", "Europe" ),
urban_pop=c(76, 86, 94))
data1
data2
## country region life_exp
## 1 Germany Europe 81
## 2 Australia Western Pacific 83
## 3 Ecuador Americas 75
## country region urban_pop
## 1 Germany Europe 76
## 2 Australia Western Pacific 86
## 3 Iceland Europe 94
We can merge tables with various join functions. These join functions require us to provide at least one column to act as an identifier.
Inner joins return rows where both data sets have data, according to
the identifier column. In the following example, we use inner_join()
to return rows with countries present in both data frames.
inner_join(data1, data2, by="country")
## country region.x life_exp region.y urban_pop
## 1 Germany Europe 81 Europe 76
## 2 Australia Western Pacific 83 Western Pacific 86
A left join keeps what is in the first data frame and any matches to that in the second data frame.
left_join(data1, data2, by="country")
## country region.x life_exp region.y urban_pop
## 1 Germany Europe 81 Europe 76
## 2 Australia Western Pacific 83 Western Pacific 86
## 3 Ecuador Americas 75 <NA> NA
A right join does just the opposite and keeps matches to items in the second data frame.
right_join(data1, data2, by="country")
## country region.x life_exp region.y urban_pop
## 1 Germany Europe 81 Europe 76
## 2 Australia Western Pacific 83 Western Pacific 86
## 3 Iceland <NA> NA Europe 94
A full join will include all data from both data frames.
full_join(data1, data2, by="country")
## country region.x life_exp region.y urban_pop
## 1 Germany Europe 81 Europe 76
## 2 Australia Western Pacific 83 Western Pacific 86
## 3 Ecuador Americas 75 <NA> NA
## 4 Iceland <NA> NA Europe 94
Filtering joins¶
Sometimes, we want to filter our data frame based on another. There are
several methods for this. semi_join()
will keep rows in the first
data frame for entries present in the second.
semi_join(data1, data2, by="country")
## country region life_exp
## 1 Germany Europe 81
## 2 Australia Western Pacific 83
We can do the opposite with anti_join()
. This only keeps rows of the
first data frame that do not have entries in the second.
anti_join(data1, data2, by="country")
## country region life_exp
## 1 Ecuador Americas 75
Combining¶
Let’s make two new data frame for the next examples.
data3 <- data.frame(country=c("Germany","Ecuador"),
life_exp=c(81, 75))
data4 <- data.frame(country=c("Germany","Australia" ),
life_exp=c(81, 83 ))
If we simply want to add new rows on top of each other, we can use the
bind_rows()
function.
bind_rows(data3, data4)
## country life_exp
## 1 Germany 81
## 2 Ecuador 75
## 3 Germany 81
## 4 Australia 83
We can keep rows that are the same in both data frames with
intersect()
.
intersect(data3, data4)
## country life_exp
## 1 Germany 81
setdiff()
returns the rows that appear in the first but not the
second data frame.
setdiff(data3, data4)
## country life_exp
## 1 Ecuador 75
Finally, union()
returns all unique rows.
union(data3, data4)
## country life_exp
## 1 Germany 81
## 2 Ecuador 75
## 3 Australia 83
Exporting data¶
Now that you have learned how to use dplyr
to extract information
from or summarize your raw data, you may want to export these new data
sets to share them with your collaborators or for archival.
Similar to the read_csv()
function used for reading CSV files into
R, there is a write_csv()
function that generates CSV files from
dataframes.
write_csv(data4, 'countries.csv')
help(write_csv)