Intro to R and the Tidyverse¶
What is R?¶
R is a programming language designed for statistical computing. It is not just a statistics package: it is a language.
What is RStudio?¶
RStudio is a free R integrated development environment (IDE). It is cleaner and simpler than the default R GUI (graphical user interface). It has many useful features, like syntax highlighting and tab for suggested code auto-completion.
Additionally, it has a 4-pane workspace:
Top left window: the R code editor
Bottom left: interactive console
Top right window: shows your workspace, including a list of objects currently in memory, history tab
Bottom right: shows plots, external packages available on your system, files in your working directory, and help files
Useful RStudio shortcuts:
tab: auto-complete function
Ctrl+↑ or cmd+↑ (auto-complete tool that works only in the interactive console)
Ctrl+enter or cmd+return (executes the selected lines of code)
Things to keep in mind¶
R is case sensitive, so be careful while typing.
#
is used for commentsKeyboard Shortcuts: Ctrl+Shift+C (Windows) Cmd+Shift+C (MacOS).
R does not care about spaces between commands or arguments.
Names should start with a letter and should not contain spaces.
You can use
.
in object names (e.g.,my.data
).Use forward slash (
/
) in path names, even on Windows.
Working directory¶
Your working directory is the folder on your computer in which you
are working. We can find this with the getwd()
command.
# Current working directory
getwd()
[1] /User/fordfishman/
We can also set our working directory with setwd(PATH)
.
# an example of the path to your workshop materials
# USE YOUR OWN PATH
setwd("Documents/Workshops/Intro to R and the Tidyverse 20220928/")
To see the files in your working directory, you can use
list.files()
.
list.files()
[1] "IntroR_Tidyverse_code_along.R" "IntroR_Tidyverse_code.R" "penguins.csv"
Creating Objects¶
However, it would be more useful if we assigned values to objects. We
create an object by giving it a name followed by the assignment <-
operator. You can make <-
with the following shortcuts: Alt+-
(Windows) or Option+- (Mac).
weight_kg <- 60
weight_lb <- 2.2 * weight_kg
weight_lb # Print the value of weight_lb
[1] 132
We can also reassign our variables to new values, but be careful, as there is no warning given for this.
You can also remove a variable from your environment with the rm()
command.
weight_kg <- 100 # Overwrites your object. Be careful! no warning is given
rm(weight_lb) # Deletes that object
Storing many numbers as a vector¶
We can use c()
to combine or concatenate values together into a
vector.
Myvector1 <- c(1,3,4,5) # c for combine/concatenate
Myvector2 <- c(1:7)
Myvector3 <- seq (1,6, by=0.5)
Myvector1
Myvector2
Myvector3
[1] 1 3 4 5
[1] 1 2 3 4 5 6 7
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
You can also store characters and character vectors.
greeting <- "hello"
greeting
days <- c ("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
days
[1] "hello"
[1] "Sunday" "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"
To extract individual elements of a vector, we use an index in
square brackets. For instance, to get the third element of days
, we
can use days[3]
. Unlike other programming languages, R indexes from
1, not 0. Additionally, -1 will not get the last value: it excludes that
item.
days[3]
days[-1]
days[c(1,3)]
[1] "Tuesday"
[1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday"
[1] "Sunday" "Tuesday"
Exercise 1¶
Extract Tuesday, Wednesday and Thursday from the days
vector.
Solution
Note: these two solutions are equivalent.
days[c(3, 4, 5)]
days[3:5]
[1] "Tuesday" "Wednesday" "Thursday"
[1] "Tuesday" "Wednesday" "Thursday"
Replacing/adding new elements¶
We can also use indexing to replace or add new elements to a vector.
greeting[2] <- "How are you?"
greeting
Exercise 2¶
Replace the 3rd element in Myvector2
with a 10.
Solution
myvector2[3] <- 10
Data types¶
When we use c()
, R assumes that everything in your vector is of the
same data type (all # or all characters).
Myvector4 <- c(1,2,"hello")
Myvector4
[1] "1" "2" "hello"
If we have different types of data we need to use the list()
function.
Mylist <- list(1,3, "hello", TRUE)
Mylist
[[1]]
[1] 1
[[2]]
[1] 3
[[3]]
[1] "hello"
[[4]]
[1] TRUE
Functions¶
A function is a piece of code to carry out a specified task. R has many built-in functions.
sum(1,3,5)
mean(Myvector1)
length(Myvector1)
max(Myvector1)
rep("hi", times=3)
[1] 9
[1] 3.25
[1] 4
[1] 5
[1] "hi" "hi" "hi"
If we want to learn more about a function we can ask for help with
help()
or ?
.
help(mean)
?rep
Packages¶
We can also bring in extra functions by downloading packages. Packages are collections of functions. There are thousands of add-on packages available at the CRAN (Comprehensive R Archive Network).
For instance, we have the tidyverse, an “opinionated collection of R packages designed for data science” (www.tidyverse.org). These packages are designed to make data wrangling, analysis, and graphing much simpler and more enjoyable.
Tidyverse packages share a philosophy of data organization: they all expect tidy data. Tidy data is set up so that each row is an observation and each column is a variable.
Using the tidyverse packages¶
To install a package we use the function
install.packages("package name")
. We only need to install a package
once.
install.packages("tidyverse")
If we want to use the functions in a package, we need to load it in R
using the library()
function.
library(tidyverse)
Importing data¶
Let’s explore penguins! In our file called penguins.csv
, we have
data for three penguin species observed in the Palmer Archipelago,
Antarctica, collected by Dr. Kristen Gorman with Palmer Station LTER.
penguins <- read_csv("penguins.csv")
Exploring your data¶
We can use the View()
function to look at our data frame.
View(penguins)
A very important function is str()
, which lets you can view the
structure of data.
str(penguins)
spec_tbl_df [344 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ species : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
$ island : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : num [1:344] 3750 3800 3250 NA 3450 ...
$ sex : chr [1:344] "male" "female" "female" NA ...
$ year : num [1:344] 2007 2007 2007 2007 2007 ...
- attr(*, "spec")=
.. cols(
.. species = col_character(),
.. island = col_character(),
.. bill_length_mm = col_double(),
.. bill_depth_mm = col_double(),
.. flipper_length_mm = col_double(),
.. body_mass_g = col_double(),
.. sex = col_character(),
.. year = col_double()
.. )
- attr(*, "problems")=<externalptr>
We can get the same information using glimpse()
.
glimpse(penguins)
Rows: 344
Columns: 8
$ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie…
$ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torge…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, 38.6, 34.6, 36.6, 38.…
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, 17.3, 17.6, 21.2, 21.1, 17.8, 19.…
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 182, 191, 198, 185, 195, 197, 184, 194…
$ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, 3800, 4400, 3700, 345…
$ sex <chr> "male", "female", "female", NA, "female", "male", "female", "male", NA, NA, NA, NA, "female", "ma…
$ year <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…
We can use some built-in functions in R to summarize the data, such as showing column names and the dimensions of the data frame.
class(penguins) # check to see that test is what we expect it to be
dim(penguins) # how many rows and columns?
names(penguins) # names of variables
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
[1] 344 8
[1] "species" "island" "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
head()
displays the first 6 rows of the data frame.
head(penguins) # first 6 rows
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18 195 3250 female 2007
4 Adelie Torgersen NA NA NA NA NA 2007
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
tail()
similarly shows the last 6 rows.
tail(penguins) # last 6 rows
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 Chinstrap Dream 45.7 17 195 3650 female 2009
2 Chinstrap Dream 55.8 19.8 207 4000 male 2009
3 Chinstrap Dream 43.5 18.1 202 3400 female 2009
4 Chinstrap Dream 49.6 18.2 193 3775 male 2009
5 Chinstrap Dream 50.8 19 210 4100 male 2009
6 Chinstrap Dream 50.2 18.7 198 3775 female 2009
We can use summary()
to display some descriptive statistics, like
minimum and maximum values, means, and medians.
summary(penguins)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Length:344 Length:344 Min. :32.10 Min. :13.10 Min. :172.0 Min. :2700 Length:344
Class :character Class :character 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0 1st Qu.:3550 Class :character
Mode :character Mode :character Median :44.45 Median :17.30 Median :197.0 Median :4050 Mode :character
Mean :43.92 Mean :17.15 Mean :200.9 Mean :4202
3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0 3rd Qu.:4750
Max. :59.60 Max. :21.50 Max. :231.0 Max. :6300
NA's :2 NA's :2 NA's :2 NA's :2
year
Min. :2007
1st Qu.:2007
Median :2008
Mean :2008
3rd Qu.:2009
Max. :2009
Note that the numerical variables are displayed different then the character variables. We can summarize the character variables better by converting them to factors.
penguins$species <- as.factor(penguins$species)
penguins$island <- as.factor(penguins$island)
penguins$sex <- as.factor(penguins$sex)
Here we access columns of a data frame using $
, which is the easiest
way to do so.
penguins$species
penguins$island[1:10] # first 10
summary(penguins$body_mass_g)
[1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
[13] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie
[25] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie ...
Levels: Adelie Chinstrap Gentoo
[1] Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen
Levels: Biscoe Dream Torgersen
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2700 3550 4050 4202 4750 6300 2
We can see the frequencies of a factor with table()
or
summary()
.
table(penguins$species) # these give the same thing back
summary(penguins$species)
Adelie Chinstrap Gentoo
152 68 124
We can also sign numerical columns with a variety of functions.
mean(penguins$body_mass_g, na.rm=TRUE) # na.rm makes sure to ignore missing data
median(penguins$body_mass_g, na.rm=TRUE)
sd(penguins$body_mass_g, na.rm=TRUE)
[1] 4201.754
[1] 4050
[1] 801.9545
We can use the filter()
tidyverse function to subset our dataframe.
Gentoo <- filter(penguins,species =="Gentoo")
Gentoo
# A tibble: 124 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 Gentoo Biscoe 46.1 13.2 211 4500 female 2007
2 Gentoo Biscoe 50 16.3 230 5700 male 2007
3 Gentoo Biscoe 48.7 14.1 210 4450 female 2007
4 Gentoo Biscoe 50 15.2 218 5700 male 2007
5 Gentoo Biscoe 47.6 14.5 215 5400 male 2007
6 Gentoo Biscoe 46.5 13.5 210 4550 female 2007
7 Gentoo Biscoe 45.4 14.6 211 4800 female 2007
8 Gentoo Biscoe 46.7 15.3 219 5200 male 2007
9 Gentoo Biscoe 43.3 13.4 209 4400 female 2007
10 Gentoo Biscoe 46.8 15.4 215 5150 male 2007
# … with 114 more rows
If we want to select specific columns, we can use the select()
function.
penguins_subsetted <- select(penguins, species, island, bill_length_mm, sex)
We can add new columns with mutate()
.
penguins_subsetted2 <- mutate(penguins_subsetted, mass_flipper_ratio = body_mass_g/flipper_length_mm)
We can use pipes to chain tidyverse commands together. Pipes in R
look like %>%
. Read the pipe like the word “and then”.
female_penguins <- penguins %>%
filter(sex == "female") %>%
mutate(mass_flipper_ratio = body_mass_g/flipper_length_mm)
Simple graphs¶
To make a simple scatter plot in R, we can use the plot()
function.
plot(penguins$bill_depth_mm, penguins$bill_length_mm)
We can also use ggplot2
to get nicer graphs with many
customizations.
mass_flipper <- ggplot(data = penguins,
aes(x = flipper_length_mm,
y = body_mass_g)) +
geom_point(aes(color = species,
shape = species),
size = 3,
alpha = 0.8) +
scale_color_manual(values = c("darkorange","purple","cyan4")) +
labs(title = "Penguin size, Palmer Station LTER",
subtitle = "Flipper length and body mass for Adelie, Chinstrap and Gentoo Penguins",
x = "Flipper length (mm)",
y = "Body mass (g)",
color = "Penguin species",
shape = "Penguin species") +
theme(legend.position = c(0.2, 0.7),
plot.title.position = "plot",
plot.caption = element_text(hjust = 0, face= "italic"),
plot.caption.position = "plot")
mass_flipper