Data Analysis¶

This section covers:

Exploring and summarizing data
Correlation and Chi-Squared tests
T-test and ANOVA
Checking assumptions
Linear regression

Set up¶

To get started let’s install and/or load the libraries we will be using. If this is your first time using one of the packages “uncomment” and run the appropriate install.package(‘package’)

#install.packages('tidyverse')
library(tidyverse)
#install.packages('car')
library(car)
#install.packages('broom')
library(broom)
#install.packages ('rstatix')
library (rstatix)
#install.packages("sjPlot")
library(sjPlot)
#install.packages("lmtest")
library(lmtest)

We are going to analyze penguins! See https://allisonhorst.github.io/palmerpenguins/

Let’s get the data

install.packages("palmerpenguins")
library(palmerpenguins)

Exploring the data set¶

View(penguins)

We can check the structure

str(penguins)

Output

tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
$ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007

I want to see how many penguins I have

penguins %>%
 count(species)

Output

## # A tibble: 3 x 2
## species       n
## <fctr>      <int>
## Adelie             152
## Chinstrap       68
          ## Gentoo       124
          ## # 3 rows

Let’s create a bar graph

ggplot (penguins, aes(species))+
geom_bar()

Output

I want to see summary statistics for each species of penguin

penguins %>%
  group_by(species) %>%
    summarize(across(bill_length_mm:body_mass_g, mean, na.rm = TRUE))

Correlation¶

Is there a correlation between Flipper Length and Body Mass? Let’s create a scatterplot first

correlation_graph <- ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point() +
geom_smooth(method = lm)
correlation_graph

Output

What’s the correlation coefficient?

cor.test(penguins$flipper_length_mm, penguins$body_mass_g)

Output

## # Pearson's product-moment correlation
## data:  penguins$flipper_length_mm and
## t = 32.722, df = 340, p-value <
## 2.2e-16
## alternative hypothesis: true correlation is
          ## not equal to 0
## 95 percent confidence interval:
## # 0.843041 0.894599
## sample estimates:
          ## # cor
## 0.8712018

Chi-Squared test¶

Now, I want to see if there is relationship between species and island. As both variables are categorical, we need to run a chi-squared test

Let’s visualize both varibles first

ggplot(penguins, aes(x = species, fill = island)) + geom_bar()

Output

We can also build contigency tables

penguins_table <- table (penguins$species, penguins$island)
penguins_table
prop.table(penguins_table)
prop.table(penguins_table, 1)*100
prop.table(penguins_table, 2)*100

Output

## #           Biscoe Dream Torgersen
## Adelie        44    56        52
## Chinstrap      0    68         0
## Gentoo       124     0         0
## #            Biscoe  Dream    Torgersen
          ## Adelie    0.1279070 0.1627907 0.1511628
## Chinstrap 0.0000000 0.1976744 0.0000000
## Gentoo    0.3604651 0.0000000 0.0000000
## #            Biscoe   Dream    Torgersen
          ## Adelie     28.94737  36.84211  34.21053
## Chinstrap   0.00000 100.00000   0.00000
## Gentoo    100.00000   0.00000   0.00000
## #            Biscoe  Dream    Torgersen
## Adelie     26.19048  45.16129 100.00000
## Chinstrap   0.00000  54.83871   0.00000
## Gentoo     73.80952   0.00000   0.00000

chi-squared test

chisq <- chisq.test(penguins$species, penguins$island)
chisq

Output

## #           Pearson's Chi-squared test
## data:  penguins$species and penguins$island
## X-squared = 299.55, df = 4, p-value < 2.2e-16

Data Analysis¶

Set up¶

Exploring the data set¶

Correlation¶

Chi-Squared test¶

Independent Samples t-test¶

ANOVA: Comparing means from multiple groups¶

Linear Regression¶