Text Mining with R¶

Setup¶

Install and load packages¶

We will be using several external libraries to do our text analysis.

install.packages("readtext")
install.packages("quanteda")
install.packages("quanteda.textmodels")
install.packages("quanteda.textstats")
install.packages("quanteda.textplots")
install.packages("tidyverse")

library(readtext)
library(quanteda)
library(quanteda.textmodels)
library(quanteda.textstats)
library(quanteda.textplots)
library(tidyverse)

Import data¶

We are going to analyze the State of the Union Addresses from 1934 to 2020. First, set your working directory to the location of your files.

setwd('~/Documents/Workshops/TM2023/') # replace with the appropriate directory
sotu<- readtext ("texts")
sotu

Output

readtext object consisting of 96 documents and 0 docvars.
# Description: df [96 × 2]
doc_id                text
<chr>                 <chr>
1 barack-obama-2009.txt "\"Madam Spea\"..."
2 barack-obama-2010.txt "\"Madam Spea\"..."
3 barack-obama-2011.txt "\"Mr. Speake\"..."
4 barack-obama-2012.txt "\"Mr. Speake\"..."
5 barack-obama-2013.txt "\"Please, ev\"..."
6 barack-obama-2014.txt "\"The Presid\"..."
# … with 90 more rows

Tokenize the corpus¶

The next step is tokenization, where we break down text into individual tokens: words, characters, and symbols. Here, we do remove numbers, punctuation, and symbols first.

Finally, we convert the tokens to lowercase and visualize the first tokens of the first document.

sotu_toks <-tokens(sotu_corp)# It removes separators (whitespaces)but we can remove numbers,punctuation, symbols

sotu_toks <- tokens(sotu_corp, remove_punct = TRUE)
print(sotu_toks)

Output

Tokens consisting of 96 documents.
barack-obama-2009.txt :
 [1] "Madam"     "Speaker"   "Mr"        "Vice"      "President" "Members"
 [7] "of"        "Congress"  "the"       "First"     "Lady"      "of"
[ ... and 6,028 more ]

barack-obama-2010.txt :
 [1] "Madam"         "Speaker"       "Vice"          "President"
 [5] "Biden"         "Members"       "of"            "Congress"
 [9] "distinguished" "guests"        "and"           "fellow"
[ ... and 7,172 more ]

barack-obama-2011.txt :
 [1] "Mr"            "Speaker"       "Mr"            "Vice"
 [5] "President"     "Members"       "of"            "Congress"
 [9] "distinguished" "guests"        "and"           "fellow"
[ ... and 6,823 more ]

barack-obama-2012.txt :
 [1] "Mr"            "Speaker"       "Mr"            "Vice"
 [5] "President"     "Members"       "of"            "Congress"
 [9] "distinguished" "guests"        "and"           "fellow"
[ ... and 6,975 more ]

barack-obama-2013.txt :
 [1] "Please"    "everybody" "have"      "a"         "seat"      "Mr"
 [7] "Speaker"   "Mr"        "Vice"      "President" "Members"   "of"
[ ... and 6,736 more ]

barack-obama-2014.txt :
 [1] "The"       "President" "Mr"        "Speaker"   "Mr"        "Vice"
 [7] "President" "Members"   "of"        "Congress"  "my"        "fellow"
[ ... and 6,945 more ]

[ reached max_ndoc ... 90 more documents ]

sotu_toks <- tokens_tolower(sotu_toks)

sotu_toks [[1]][1:20]# first 20 tokens of document 1

Output

 [1] "madam"     "speaker"   "mr"        "vice"      "president"
 [6] "members"   "of"        "congress"  "the"       "first"
[11] "lady"      "of"        "the"       "united"    "states"
[16] "she's"     "around"    "here"      "somewhere" "i"

Keywords in context¶

To examine how a word is used in a wider context, we can search for keywords in context. kwic() allows you to identify a keyword of interest, and see a number of words before and after it. We can specify the number of context words to be displayed with the argument window.

The argument pattern also takes a wild card (*) and multiple keywords in a character vector.

kw_health <- kwic(sotu_toks, pattern="health*", window = 10)
head(kw_health)

Output

Keyword-in-context with 6 matches.
 [barack-obama-2009.txt, 448]
 [barack-obama-2009.txt, 585]
 [barack-obama-2009.txt, 683]
 [barack-obama-2009.txt, 916]
[barack-obama-2009.txt, 1026]
[barack-obama-2009.txt, 2371]

                   import more oil today than ever before the cost of | health  |
                           sake of a quick profit at the expense of a | healthy |
         job creation restart lending and invest in areas like energy | health  |
                     who can now keep their jobs and educate our kids | health  |
 will be able to receive extended unemployment benefits and continued | health  |
                        of our dependence on oil and the high cost of | health  |

care eats up more and more of our savings each
market people bought homes they knew they couldn't afford from
care and education that will grow our economy even as
care professionals can continue caring for our sick there are
care coverage to help them weather this storm now i
care the schools that aren't preparing our children and the

Using a vector, we can search for multiple patterns at once.

kw_health2 <- kwic(sotu_toks, pattern=c("health*","care"), window = 10)
head(kw_health2)

Output

Keyword-in-context with 6 matches.
[barack-obama-2009.txt, 448]
[barack-obama-2009.txt, 449]
[barack-obama-2009.txt, 585]
[barack-obama-2009.txt, 683]
[barack-obama-2009.txt, 684]
[barack-obama-2009.txt, 916]


             import more oil today than ever before the cost of | health  |
             more oil today than ever before the cost of health |  care   |
                     sake of a quick profit at the expense of a | healthy |
   job creation restart lending and invest in areas like energy | health  |
creation restart lending and invest in areas like energy health |  care   |
               who can now keep their jobs and educate our kids | health  |

care eats up more and more of our savings each
eats up more and more of our savings each year
market people bought homes they knew they couldn't afford from
care and education that will grow our economy even as
and education that will grow our economy even as we
care professionals can continue caring for our sick there are

We can also search for multi-word expressions.

kw_healthcare<- kwic(sotu_toks, pattern=phrase("health care"))

Select tokens¶

We also remove stopwords, which are words that are not particularly useful for understanding the meaning of the text, like “an”, “have”, and “about”. The command for this is tokens_select().

sotu_toks_nostop <- tokens_select(sotu_toks,
                                  pattern = stopwords("english"),
                                  selection="remove")
stopwords("english")

Output

  [1] "i"          "me"         "my"         "myself"     "we"
  [6] "our"        "ours"       "ourselves"  "you"        "your"
 [11] "yours"      "yourself"   "yourselves" "he"         "him"
 [16] "his"        "himself"    "she"        "her"        "hers"
 [21] "herself"    "it"         "its"        "itself"     "they"
 [26] "them"       "their"      "theirs"     "themselves" "what"
 [31] "which"      "who"        "whom"       "this"       "that"
 [36] "these"      "those"      "am"         "is"         "are"
 [41] "was"        "were"       "be"         "been"       "being"
 [46] "have"       "has"        "had"        "having"     "do"
 [51] "does"       "did"        "doing"      "would"      "should"
 [56] "could"      "ought"      "i'm"        "you're"     "he's"
 [61] "she's"      "it's"       "we're"      "they're"    "i've"
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"
[101] "who's"      "what's"     "here's"     "there's"    "when's"
[106] "where's"    "why's"      "how's"      "a"          "an"
[111] "the"        "and"        "but"        "if"         "or"
[116] "because"    "as"         "until"      "while"      "of"
[121] "at"         "by"         "for"        "with"       "about"
[126] "against"    "between"    "into"       "through"    "during"
[131] "before"     "after"      "above"      "below"      "to"
[136] "from"       "up"         "down"       "in"         "out"
[141] "on"         "off"        "over"       "under"      "again"
[146] "further"    "then"       "once"       "here"       "there"
[151] "when"       "where"      "why"        "how"        "all"
[156] "any"        "both"       "each"       "few"        "more"
[161] "most"       "other"      "some"       "such"       "no"
[166] "nor"        "not"        "only"       "own"        "same"
[171] "so"         "than"       "too"        "very"       "will"

sotu_toks_nostop

Output

Tokens consisting of 96 documents.
barack-obama-2009.txt :
 [1] "madam"     "speaker"   "mr"        "vice"      "president" "members"   "congress"
 [8] "first"     "lady"      "united"    "states"    "around"
[ ... and 3,000 more ]

barack-obama-2010.txt :
 [1] "madam"         "speaker"       "vice"          "president"     "biden"
 [6] "members"       "congress"      "distinguished" "guests"        "fellow"
[11] "americans"     "constitution"
[ ... and 3,698 more ]

barack-obama-2011.txt :
 [1] "mr"            "speaker"       "mr"            "vice"          "president"
 [6] "members"       "congress"      "distinguished" "guests"        "fellow"
[11] "americans"     "tonight"
[ ... and 3,512 more ]

barack-obama-2012.txt :
 [1] "mr"            "speaker"       "mr"            "vice"          "president"
 [6] "members"       "congress"      "distinguished" "guests"        "fellow"
[11] "americans"     "last"
[ ... and 3,690 more ]

barack-obama-2013.txt :
 [1] "please"    "everybody" "seat"      "mr"        "speaker"   "mr"        "vice"
 [8] "president" "members"   "congress"  "fellow"    "americans"
[ ... and 3,653 more ]

barack-obama-2014.txt :
 [1] "president" "mr"        "speaker"   "mr"        "vice"      "president" "members"
 [8] "congress"  "fellow"    "americans" "today"     "america"
[ ... and 3,767 more ]

[ reached max_ndoc ... 90 more documents ]

We can do the same with tokens_remove( ,pattern=stopwords("en")).

Generating n-grams¶

An n-gram is a contiguous sequence of n items in a text.

tokens_ngrams() generates a set of n-grams (tokens in sequence) from a tokenized text object. It gives all possible combinations of tokens.

toks_ngram <- tokens_ngrams(sotu_toks_nostop, n = 2:4)
head(toks_ngram[[1]], 30)
tail(toks_ngram[[1]], 30)

Output

 [1] "madam_speaker"         "speaker_mr"            "mr_vice"
 [4] "vice_president"        "president_members"     "members_congress"
 [7] "congress_first"        "first_lady"            "lady_united"
[10] "united_states"         "states_around"         "around_somewhere"
[13] "somewhere_come"        "come_tonight"          "tonight_address"
[16] "address_distinguished" "distinguished_men"     "men_women"
[19] "women_great"           "great_chamber"         "chamber_speak"
[22] "speak_frankly"         "frankly_directly"      "directly_men"
[25] "men_women"             "women_sent"            "sent_us"
[28] "us_know"               "know_many"             "many_americans"

 [1] "fear_challenges_time_summon"         "challenges_time_summon_enduring"
 [3] "time_summon_enduring_spirit"         "summon_enduring_spirit_america"
 [5] "enduring_spirit_america_quit"        "spirit_america_quit_someday"
 [7] "america_quit_someday_years"          "quit_someday_years_now"
 [9] "someday_years_now_children"          "years_now_children_can"
[11] "now_children_can_tell"               "children_can_tell_children"
[13] "can_tell_children_time"              "tell_children_time_performed"
[15] "children_time_performed_words"       "time_performed_words_carved"
[17] "performed_words_carved_chamber"      "words_carved_chamber_something"
[19] "carved_chamber_something_worthy"     "chamber_something_worthy_remembered"
[21] "something_worthy_remembered_thank"   "worthy_remembered_thank_god"
[23] "remembered_thank_god_bless"          "thank_god_bless_may"
[25] "god_bless_may_god"                   "bless_may_god_bless"
[27] "may_god_bless_united"                "god_bless_united_states"
[29] "bless_united_states_america"         "united_states_america_thank"

tokens_compound() generates n-grams more selectively. For example, you can make bi-grams using phrase() and a wild card (*)

toks_neg_bigram <- tokens_compound(sotu_toks_nostop, pattern = phrase("united *"))
toks_neg_bigram_select <- tokens_select(toks_neg_bigram, pattern = phrase("united_*"))

Collocations¶

A collocation is a sequence of words or terms that co-occur more often than would be expected by chance. Collocations can be more informative or meaningful than ngrams as they allow us to evaluate the strength and significance of the association between sequences of words.

sotu_collocations <- textstat_collocations(sotu_toks_nostop, method = "lambda", size = 2, min_count = 2,smoothing = 0.5)
head(sotu_collocations)

Output

         collocation count count_nested length   lambda        z
    united states   709            0      2 7.885126 87.41215
        last year   388            0      2 4.996601 70.81536
      health care   235            0      2 6.353499 63.41266
  american people   328            0      2 4.183064 61.56281
  social security   231            0      2 6.452656 60.44997
federal government   277            0      2 4.307875 58.60510

Document feature matrix¶

We are going to create a document feature matrix (dfm) from a tokens object. We will then turn the dfm into a tidy data frame. A tidy data frame means one variable per column, one observation per row, one value per cell.

sotu_dfm <- dfm(sotu_toks)
sotu_dfm

dim(sotu_dfm)

Output

Document-feature matrix of: 96 documents, 20,805 features (92.39% sparse) and 0 docvars.
                       features
docs                    madam speaker mr vice president members  of congress the first
  barack-obama-2009.txt     1       1  1    2         5       1 161       10 269     8
  barack-obama-2010.txt     1       1  0    2         6       2 166       10 336     5
  barack-obama-2011.txt     0       3  2    1         2       2 195       10 354    11
  barack-obama-2012.txt     0       2  2    2         5       3 170       15 294     8
  barack-obama-2013.txt     0       1  2    1         2       1 172       17 302     6
  barack-obama-2014.txt     0       2  2    2         7       3 154       20 284    14
[ reached max_ndoc ... 90 more documents, reached max_nfeat ... 20,795 more features ]

[1]    96 20805

ndoc(sotu_dfm)
nfeat(sotu_dfm)
topfeatures(sotu_dfm, 10)

Output

[1] 96

[1] 20805

  the    of   and    to    in     a    we   our  that   for
37864 23162 22842 21779 13935 10702 10156  9866  7908  7569

TF-IDF analysis¶

We can also do TF-IDF analysis (Term frequency-Inverse Document Frequency). The purpose of this type of analysis is to find a document’s most distinctive terms: How frequent a term is in a doc/how frequent it is across all docs. (High score=distinctive, Low score=not distinctive).

# Add a tf-idf on a dfm to determine a document's most distinctive words
sotu_tf_idf <- dfm_tfidf(sotu_dfm)
sotu_tf_idf

Output

Document-feature matrix of: 96 documents, 20,805 features (92.39% sparse) and 0 docvars.
                       features
docs                      madam    speaker         mr      vice  president    members of
  barack-obama-2009.txt 1.20412 0.07918125 0.08464414 0.5841503 0.14014362 0.02802872  0
  barack-obama-2010.txt 1.20412 0.07918125 0          0.5841503 0.16817234 0.05605745  0
  barack-obama-2011.txt 0       0.23754374 0.16928828 0.2920752 0.05605745 0.05605745  0
  barack-obama-2012.txt 0       0.15836249 0.16928828 0.5841503 0.14014362 0.08408617  0
  barack-obama-2013.txt 0       0.07918125 0.16928828 0.2920752 0.05605745 0.02802872  0
  barack-obama-2014.txt 0       0.15836249 0.16928828 0.5841503 0.19620107 0.08408617  0
                       features
docs                    congress the      first
  barack-obama-2009.txt        0   0 0.07314704
  barack-obama-2010.txt        0   0 0.04571690
  barack-obama-2011.txt        0   0 0.10057717
  barack-obama-2012.txt        0   0 0.07314704
  barack-obama-2013.txt        0   0 0.05486028
  barack-obama-2014.txt        0   0 0.12800731
[ reached max_ndoc ... 90 more documents, reached max_nfeat ... 20,795 more features ]

#Simple frequency analysis
sotu_freq <- textstat_frequency(sotu_dfm)
head(sotu_freq, 20)

Output

   feature frequency rank docfreq group
    the     37864    1      96   all
     of     23162    2      96   all
    and     22842    3      96   all
     to     21779    4      96   all
     in     13935    5      96   all
      a     10702    6      96   all
     we     10156    7      96   all
    our      9866    8      96   all
   that      7908    9      96   all
   for      7569   10      96   all
    is      5905   11      96   all
  will      5334   12      96   all
     i      5230   13      96   all
  this      5104   14      96   all
  have      4886   15      96   all
    be      4083   16      96   all
   are      3991   17      96   all
  with      3722   18      96   all
    on      3637   19      96   all
    it      3528   20      96   all

Create a word cloud¶

We can use word clouds as a simple way to represent our corpus. This first version will likely time out, so make sure to stop the process to see the output.

textplot_wordcloud(sotu_dfm)

Output

This word cloud is a bit overwhelming, so let’s pare it down a bit.

Here we add some specifications to limit the number of words included and to provide some aesthetic value. You shouldn’t need to halt this process.

textplot_wordcloud(sotu_dfm,
                  rotation = 0.25, #proportion of words with 90 degree rotation
                  color = rev(RColorBrewer::brewer.pal(10, "RdBu")))

Output

Lexical diversity¶

Lexical diversity is a measure of how many times a word appears in a text and where it appears relative to the beginning of the document

sotu_lexdiv <- textstat_lexdiv(sotu_dfm)
head(sotu_lexdiv)

Output

               document       TTR
barack-obama-2009.txt 0.2487967
barack-obama-2010.txt 0.2373709
barack-obama-2011.txt 0.2491568
barack-obama-2012.txt 0.2465636
barack-obama-2013.txt 0.2554831
barack-obama-2014.txt 0.2607883

Working with a subset¶

We can also work with a subset of the texts in the corpus. Here we do some additional pre-processing, before we create a subset containing only President Obama’s speeches.

sotu2 <- readtext("texts",
                  docvarsfrom = "filenames")

sotu2 <- sotu2 %>%
   mutate(year= str_sub(.$docvar1, -5)) %>% # create year column
   mutate(name= str_sub(.$docvar1, 1, -6)) # create name column

sotu2$year <- sotu2$year %>%
   str_replace_all("[-ab]", "") # remove unwanted characters from the year column

sotu2$year <- as.integer(sotu2$year)

sotu2$name <- sotu2$name %>%
   str_replace_all("-", " ") %>%
   trimws()  #trim leading and trailing whitespace from terms in name field\

sotu_corp2 <- corpus(sotu2)
obama_corpus <- corpus_subset(sotu_corp2, name=="barack obama")

From here, we can repeat the same steps we did above.

#Clean the tokens, create a dfm, and make it tidy
obama_toks <- tokens(obama_corpus, remove_numbers = TRUE, remove_punct = TRUE)
obama_toks <- tokens_remove(obama_toks, pattern = stopwords("english"))
obama_toks <- tokens_tolower(obama_toks)
obama_dfm <- dfm(obama_toks)

obama_dfm

Output

Document-feature matrix of: 8 documents, 5,104 features (69.68% sparse) and 3 docvars.
                       features
docs                    madam speaker mr vice president members congress first lady
  barack-obama-2009.txt     1       1  1    2         5       1       10     8    1
  barack-obama-2010.txt     1       1  0    2         6       2       10     5    1
  barack-obama-2011.txt     0       3  2    1         2       2       10    11    0
  barack-obama-2012.txt     0       2  2    2         5       3       15     8    0
  barack-obama-2013.txt     0       1  2    1         2       1       17     6    0
  barack-obama-2014.txt     0       2  2    2         7       3       20    14    1
                       features
docs                    united
  barack-obama-2009.txt      5
  barack-obama-2010.txt      7
  barack-obama-2011.txt      4
  barack-obama-2012.txt      6
  barack-obama-2013.txt     10
  barack-obama-2014.txt      6
[ reached max_ndoc ... 2 more documents, reached max_nfeat ... 5,094 more features ]

We can look at keywords in context for this subset, as well.

#####Keywords in Context: What words immediately precede and follow terms of interest

kw_health <- kwic(obama_toks, "health*", window = 10)
head(kw_health)

Output

Keyword-in-context with 6 matches.
 [barack-obama-2009.txt, 206]
 [barack-obama-2009.txt, 270]
 [barack-obama-2009.txt, 321]
 [barack-obama-2009.txt, 431]
 [barack-obama-2009.txt, 486]
 [barack-obama-2009.txt, 1146]

                           finding new sources energy yet import oil today ever cost
      instead opportunity invest future regulations gutted sake quick profit expense
               time jump-start job creation restart lending invest areas like energy
                           mass transit plan teachers can now keep jobs educate kids
 americans lost jobs recession able receive extended unemployment benefits continued
               another american century confront last price dependence oil high cost

 | health  |
 | healthy |
 | health  |
 | health  |
 | health  |
 | health  |

 care eats savings year yet keep delaying reform children compete
 market people bought homes knew afford banks lenders pushed bad
 care education grow economy even make hard choices bring deficit
 care professionals can continue caring sick police officers still streets
 care coverage help weather storm now know chamber watching home
 care schools preparing children mountain debt stand inherit responsibility next

We can make a lexical dispersion plot to visualize where our keyword is appearing across the documents.

textplot_xray(kw_health)

Output

Text Mining with R¶

Setup¶

Install and load packages¶

Import data¶

Create a corpus¶

Tokenize the corpus¶

Keywords in context¶

Select tokens¶

Generating n-grams¶

Collocations¶

Document feature matrix¶

TF-IDF analysis¶

Create a word cloud¶

Lexical diversity¶

Working with a subset¶