How to tokenize my dataset in R using the tidytext library? - r

I have been trying to follow Text Mining with R by Julia Silge, however, I cannot tokenize my dataset with the unnest_tokens function.
Here are the packages I have loaded:
# Load
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(corpus)
library(corpustools)
library(dplyr)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(stringr)
Here is the dataset I tried to use which is online, so the results should be reproducible:
bible <- readLines('http://bereanbible.com/bsb.txt')
And here is where everything falls apart.
Input:
bible <- bible %>%
unnest_tokens(word, text)
Output:
Error in tbl[[input]] : subscript out of bounds
From what I have read about this error, in Rstudio, the issue is that the dataset needs to be a matrix, so I tried transforming the dataset into a matrix table and I received the same error message.
Input:
bible <- readLines('http://bereanbible.com/bsb.txt')
bible <- as.matrix(bible, nrow = 31105, ncol = 2 )
bible <- bible %>%
unnest_tokens(word, text)
Output:
Error in tbl[[input]] : subscript out of bounds
Any recommendations for what next steps I could take or maybe some good Text mining sources I could use as I continue to dive into this would be very much appreciated.

The problem is that readLines()creates a vector, not a dataframe, as expected by unnest_tokens(), so you need to convert it. It is also helpful to separate the verse to it's own column:
library(tidytext)
library(tidyverse)
bible_orig <- readLines('http://bereanbible.com/bsb.txt')
# Get rid of the copyright etc.
bible_orig <- bible_orig[4:length(bible_orig)]
# Convert to df
bible <- enframe(bible_orig)
# Separate verse from text
bible <- bible %>%
separate(value, into = c("verse", "text"), sep = "\t")
tidy_bible <- bible %>%
unnest_tokens(word, text)
tidy_bible
#> # A tibble: 730,130 x 3
#> name verse word
#> <int> <chr> <chr>
#> 1 1 Genesis 1:1 in
#> 2 1 Genesis 1:1 the
#> 3 1 Genesis 1:1 beginning
#> 4 1 Genesis 1:1 god
#> 5 1 Genesis 1:1 created
#> 6 1 Genesis 1:1 the
#> 7 1 Genesis 1:1 heavens
#> 8 1 Genesis 1:1 and
#> 9 1 Genesis 1:1 the
#> 10 1 Genesis 1:1 earth
#> # … with 730,120 more rows
Created on 2020-07-14 by the reprex package (v0.3.0)

Related

Finding the dominant topic in each sentence in topic modeling

One question that I can't find the answer for in R is that how I can find the dominant topic in NLP model for each sentence?
Imagine I have data frame like this:
comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
"solidly constructed lovingly maintained sf crest built",
"one year since built new this well designed storey home",
"beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
"rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
"fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
"original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")
id <- c(1,2,3,4,5,6,7)
data <- data.frame(id, comment)
I do preprocess as shown below:
text_cleaning_tokens <- data %>%
tidytext::unnest_tokens(word, comment)
text_cleaning_tokens$word <- gsub('[[:digit:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens$word <- gsub('[[:punct:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens <- text_cleaning_tokens %>% filter(!(nchar(word) == 1))%>%
anti_join(stop_words)
stemmed_token <- text_cleaning_tokens %>% mutate(word=wordStem(word))
tokens <- stemmed_token %>% filter(!(word==""))
tokens <- tokens %>% mutate(ind = row_number())
tokens <- tokens %>% group_by(id) %>% mutate(ind = row_number()) %>%
tidyr::spread(key = ind, value = word)
tokens [is.na(tokens)] <- ""
tokens <- tidyr::unite(tokens, clean_remark,-id,sep =" " )
tokens$clean_remark <- trimws(tokens$clean_remark)
The I ran FitLdaModel function on this data and finally, found the best topics based on 2 groups:
t_1 t_2
1 beauti built
2 block home
3 renov legal
4 bathroom locat
5 bdm bachelor
6 bdm_heart bachelor_suit
7 beauti_street block_norquai
8 beauti_upgr blueridg
9 bedroom blueridg_legal
10 bedroom_unit built_design
now based on the result I have, I want to find the most dominant topic in each sentence in topic modelling. For example, I want to know that for comment 1 ("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior"), which topic (topic 1 or topic 2) is the most dominant?
Can anyone help me with this question? do we have any package that can do this?
It is pretty easy to work with quanteda and topicmodels. The former is for data management and quantitative analysis of textual data, the latter is for topic modeling inference.
Here I take your comment object and transform it to a corpus and then to a dfm. I then convert it to be understandable by topicmodels.
The function LDA() gives you all you need to easily extract information. In particular, with get_topics() you get the most probable topic for each document. If you instead want to see the document-topic-weights you can do so with ldamodel#gamma. You will see that get_topics() does exactly what you asked.
Please, see if this works for you.
library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
library(topicmodels)
comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
"solidly constructed lovingly maintained sf crest built",
"one year since built new this well designed storey home",
"beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
"rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
"fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
"original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")
mycorp <- corpus(comment)
docvars(mycorp, "id") <- 1L:7L
mydfm <- dfm(mycorp)
# convert the DFM to a Document Matrix for topicmodels
forTM <- convert(mydfm, to = "topicmodels")
myLDA <- LDA(forTM, k = 2)
dominant_topics <- get_topics(myLDA)
dominant_topics
#> text1 text2 text3 text4 text5 text6 text7
#> 2 2 2 2 1 1 1
dtw <- myLDA#gamma
dtw
#> [,1] [,2]
#> [1,] 0.4870600 0.5129400
#> [2,] 0.4994974 0.5005026
#> [3,] 0.4980144 0.5019856
#> [4,] 0.4938985 0.5061015
#> [5,] 0.5037667 0.4962333
#> [6,] 0.5000727 0.4999273
#> [7,] 0.5176960 0.4823040
Created on 2021-03-18 by the reprex package (v1.0.0)
I agree with the other answer that quanteda and topicmodels are a better choice. Maybe also look into seededlda which is an LDA implementation from one of the quanteda authors (with extra features you don't have to use).
However, if you want to stick with your choice of tidytext and textmineR, this is how you would do it.
First, I simplified your preprocessing a bit, since you did some steps that seemed unnecessary to me:
library(tidyverse)
library(tidytext)
text_cleaning_tokens <- data %>%
unnest_tokens(word, comment) %>%
mutate(word = str_remove(word, "[[:digit:]]|[[:punct:]]")) %>%
filter(!(nchar(word) <= 1))%>%
anti_join(stop_words, by = "word") %>%
mutate(word = SnowballC::wordStem(word))
Then I run LDA according to the textmineR example:
lda <- text_cleaning_tokens %>%
cast_sparse(id, word) %>%
textmineR::FitLdaModel(k = 2,
iterations = 200,
burnin = 175,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_r2 = TRUE)
Now all implementations of LDA deliver two important results:
phi (φ) which shows for each word in the corpus how it scored on each topic. The higher the phi-value, the more prevalent the word in this specific topic.
theta (θ) which shows for each document in the corpus how it scored on each topic. The higher the theta-value, the more prevalent the topic is in the document. (topicmodels calls it gamma for some reason.)
In other words, all you have to do to find the most dominant topic in a text is:
lda$theta %>%
as_tibble() %>%
rowwise() %>%
mutate(top = which.max(c_across(everything()))) %>% # find highest value per row dplyr style
bind_cols(data, .) %>% # bind to original data
as_tibble() # just for nicer printing
#> # A tibble: 7 x 5
#> id comment t_1 t_2 top
#> <int> <chr> <dbl> <dbl> <int>
#> 1 1 1 . outstanding renovation all improvements are t… 0.892 0.108 1
#> 2 2 solidly constructed lovingly maintained sf crest … 0.0161 0.984 2
#> 3 3 one year since built new this well designed store… 0.0238 0.976 2
#> 4 4 beautiful street large bdm in the heart of lynn v… 0.986 0.0139 1
#> 5 5 rare to find legal beautiful upgr in port moody c… 0.992 0.00820 1
#> 6 6 fantastic opportunity to get value for the money … 0.266 0.734 2
#> 7 7 original owner tired but rock solid perfect locat… 0.00549 0.995 2
Created on 2021-03-18 by the reprex package (v1.0.0)
I also recommend you read Julia Silge's stuff on the matter. For example, this and this.

How to use unnest_token on twitter text data?

I'm trying to run the following and gives me an error message.
data <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation. . . Video: . . - #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
"#Copingwiththelockdown... Festac town, Lagos. #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
"Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- as.data.frame(data)
remove_reg <- "&|<|>"
tidy_data <- data_df %>%
mutate(text = str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text, token = "data_df") %>%
filter(!word %in% stop_words$word,
!word %in% str_remove_all(stop_words$word, "'"),
str_detect(word, "[a-z]"))
It gives me the following error message:
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
argument str should be a character vector (or an object coercible to)"
How can I fix it?
The main problem is that you gave your text column the name data but then referred to it later as text. Try it something more like this:
library(tidyverse)
library(tidytext)
text <- c("Who said we cant have a lil dance party while were stuck in Quarantine? Happy Friday Cousins!! We got through another week of Quarantine. Lets continue to stay safe, healthy and make the best of the situation. . . Video: . . - #blackgirlstraveltoo #everydayafrica #travelnoire #blacktraveljourney #essencetravels #africanculture #blacktravelfeed #blacktravel #melanintravel #ethiopia #representationmatters #blackcommunity #Moyoafrika #browngirlbloggers #travelafrica #blackgirlskillingit #passportstamps #blacktravelista #blackisbeautiful #weworktotravel #blackgirlsrock #mytravelcrush #blackandabroad #blackgirlstravel #blacktravel #africanamerican #africangirlskillingit #africanmusic #blacktravelmovement #blacktravelgram",
"#Copingwiththelockdown... Festac town, Lagos. #covid19 #streetphotography #urbanphotography #copingwiththelockdown #documentaryphotography #hustlingandbustling #cityscape #coronavirus #busyroad #everydaypeople #everydaylife #commute #lagosroad #lagosmycity #nigeria #africa #westafrica #lagos #hustle #people #strength #faith #nopoverty #everydayeverywhere #everydayafrica #everydaylagos #nohunger #chroniclesofonyinye",
"Peace Everywhere. Amani Kila Pahali. Photo by Adan Galma . * * * * * * #matharestories #mathare #adangalma #everydaymathare #everydayeverywhere #everydayafrica #peace #amani #knowmathare #streets #spi_street #mathareslums")
data_df <- tibble(text)
remove_reg <- "&|<|>"
data_df %>%
mutate(text = str_remove_all(text, remove_reg)) %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords()) %>%
filter(str_detect(word, "[a-z]"))
#> Joining, by = "word"
#> # A tibble: 105 x 1
#> word
#> <chr>
#> 1 said
#> 2 cant
#> 3 lil
#> 4 dance
#> 5 party
#> 6 stuck
#> 7 quarantine
#> 8 happy
#> 9 friday
#> 10 cousins
#> # … with 95 more rows
If you are specifically interested in Twitter data, consider using token = "tweets":
data_df %>%
unnest_tokens(word, text, token = "tweets")
#> Using `to_lower = TRUE` with `token = 'tweets'` may not preserve URLs.
#> # A tibble: 121 x 1
#> word
#> <chr>
#> 1 who
#> 2 said
#> 3 we
#> 4 cant
#> 5 have
#> 6 a
#> 7 lil
#> 8 dance
#> 9 party
#> 10 while
#> # … with 111 more rows
Created on 2020-04-12 by the reprex package (v0.3.0)
This option handles hashtags and usernames well.

tidyverse: Combining a column with different length into exisitng tibble

I have tibble which looks like:
Review_Text
<chr>
Because it is a nice game
Best trump soumd board out there
Boring hated because it does not work when I get done
but you can make better game if game has unlimeted chemicals bottles
cant get pass loading screen
Can't play video
Casting from Note 3 to Roku 3 screen appears to start loading then back to Roku home screen. Roku software version 6.1 build 5604. It is up to date but still not able to cast Showbox. ..
Crashes all the time in the middle of the show. Whining ensues. Ugh.
Crashing
Does not work on tab 3
Doesn't work
Doesn't work with S7 which is unacceptable in this day and age.
Doesn't work... I absolutely hate it
Dont use this app battery consumers
Dose this work for snmsung I tried some many times 😡
😄I loved it so much I would recommend this to other families 😄
Every time i pressed apply it just took me to the home screen
Everytime it says collect on T.V. it won't obtain the magisword
Excellent!!! My grandchildren watch it all the time...
Feel like Lizzie McGuire 😂â\u009d¤
I want to remove the stopwords from the Review_Text and append the column (that does not have stopwords) with the existing tibble. I am using following code, to remove the stopwords:
no_stpwrd <- tibble(line = 1:nrow(tb), text = tb$Review_Text) %>%
unnest_tokens(word, text)%>%
anti_join(stop_words, by = c("word" = "word")) %>%
group_by(line) %>% summarise(title = paste(word,collapse =' '))
Then I use the following command to merge the no_stpwrd with the existing tibble:
add_column(tb,no_stpwrd).
However, when I run the above command, it throws an error message because of mismatch between the number of rows tibble and no_stowrd have. There are few row values in tibble which contains the only stopword (for example, line 11 of tibble), so when I remove stopwords it returns null hence the number of rows reduced in a no_stpwrd column. Is there any way to fix the issue?
Instead of trying to use add_column() here, what you want to do is use a join.
library(tidyverse)
library(tidytext)
review_df <- tibble(review_text = c("Because it is a nice game",
"cant get pass loading screen",
"Because I don't",
"Dont use this app battery consumers")) %>%
mutate(line = row_number())
review_df
#> # A tibble: 4 x 2
#> review_text line
#> <chr> <int>
#> 1 Because it is a nice game 1
#> 2 cant get pass loading screen 2
#> 3 Because I don't 3
#> 4 Dont use this app battery consumers 4
no_stpwrd <- review_df %>%
unnest_tokens(word, review_text) %>%
anti_join(get_stopwords()) %>%
group_by(line) %>%
summarise(title = paste(word,collapse =' '))
#> Joining, by = "word"
no_stpwrd
#> # A tibble: 3 x 2
#> line title
#> <int> <chr>
#> 1 1 nice game
#> 2 2 cant get pass loading screen
#> 3 4 dont use app battery consumers
Notice that the third document is no longer there because it was made up of all stop words. It's time for a left_join().
review_df %>%
left_join(no_stpwrd)
#> Joining, by = "line"
#> # A tibble: 4 x 3
#> review_text line title
#> <chr> <int> <chr>
#> 1 Because it is a nice game 1 nice game
#> 2 cant get pass loading screen 2 cant get pass loading screen
#> 3 Because I don't 3 <NA>
#> 4 Dont use this app battery consumers 4 dont use app battery consumers
Created on 2020-03-20 by the reprex package (v0.3.0)

ngrams analysis in tidytext in R

I am trying to do ngram analysis for in tidytext, I have a corpus of 770 speeches. However the function unnest_tokens in tidytext takes data frame as input. when i checked with the example (jane austin books) each line of the book is stored as row in a data frame. i am not able to convert the corpus into a dataframe, neither for one speech at a time nor for all the corpus at once.
What is the way i can run ngrams (n=2,3, etc) analysis on tidytext using unnest tokens on my corpus. Can someone please suggest?
Thanks
You can use library ngram & tm for this.You can replace "myCorpus" with the corpus you created.
library(tm)
library(ngarm)
myCorpus<-c("Hi How are you","Hello World","I love Stackoverflow","Good Bye All")
ng <- ngram (myCorpus , n =2)
get.phrasetable (ng)
If you want to tokenize and convert your corpus into a dataframe then use the below code.
tokenizedCorpus <- lapply(myCorpus, scan_tokenizer)
mydata <- data.frame(text = sapply(tokenizedCorpus, paste, collapse = " "),stringsAsFactors = FALSE)
You say that you have a "corpus" of 770 speeches. Do you mean you have a character vector? If so, you can tokenize your text in this way:
library(tidyverse)
library(tidytext)
speech_vec <- c("I am giving a speech!",
"My second speech is even better.",
"Unfortunately, this speech is terrible!",
"For my final speech, I will wow you all.")
speech_df <- tibble(text = speech_vec) %>%
mutate(speech = row_number())
tidy_speeches <- speech_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
tidy_speeches
#> # A tibble: 21 x 2
#> speech bigram
#> <int> <chr>
#> 1 1 i am
#> 2 1 am giving
#> 3 1 giving a
#> 4 1 a speech
#> 5 2 my second
#> 6 2 second speech
#> 7 2 speech is
#> 8 2 is even
#> 9 2 even better
#> 10 3 unfortunately this
#> # … with 11 more rows
Created on 2020-02-15 by the reprex package (v0.3.0)
If instead, you mean that you have a DocumentTermMatrix from the tm package, check out this chapter for details on how to convert to a tidy data structure.

Appending columns/variables from a data frame into a new variable

I've been searching for how to do this but cannot seem to find an example for my questions. I'm pretty new to R but am very familiar with SAS, so I wanted to ask how to do the equivalent of this SAS code in R.
I have one dataset (cohort), and two variables (last_pre_cv_prob, first_post_cv_prob), and I want to make a new dataset that has two variables, the first of which is the two previous variables set underneath each other (cv_prob), the second is a variable indicating which variable the data came from (time). So in SAS I would simply do this:
data post_cv;
set cohort(keep=last_pre_cv_prob rename=(last_pre_cv_prob=cv_prob) in=a)
cohort(keep=first_post_cv_prob rename=(first_post_cv_prob=cv_prob) in=b);
if b then time='post';
if a then time='pre';
run;
How would I do this in R?
Thanks!
edit:
post_cv2 %>% gather(column, prob, last_pre_cv_prob, first_post_cv_prob)
Error in eval(expr, envir, enclos) : object 'last_pre_cv_prob' not found
Then I tried:
post_cv2 %>% gather(column, prob, liver_cv$last_pre_cv_prob,
liver_cv$first_post_cv_prob)
Error: All select() inputs must resolve to integer column positions.
The following do not:
* liver_cv$last_pre_cv_prob
* liver_cv$first_post_cv_prob
edit:
Second issue resolved, I had to add the little quote marks
post_cv2 <- post_cv %>%
gather(time, cv_prob, `liver_cv$last_pre_cv_prob`,
`liver_cv$first_post_cv_prob`)
edit:
Solved!
library(tidyverse)
library(stringr)
post_cv <- data_frame(pre = liver_cv$last_pre_cv_prob, post = liver_cv$first_post_cv_prob)
post_cv2 <- post_cv %>%
gather(time, cv_prob, pre, post)
You can simply gather the 2 columns and extract the time information:
library(tidyverse)
cohort <- data_frame(last_pre_cv_prob = runif(5),
first_post_cv_prob = runif(5))
cohort_2 <- cohort %>%
gather(time, cv_prob, last_pre_cv_prob, first_post_cv_prob) %>%
mutate(time = str_extract(time, "post|pre"))
cohort_2
#> # A tibble: 10 × 2
#> time cv_prob
#> <chr> <dbl>
#> 1 pre 0.64527372
#> 2 pre 0.55086818
#> 3 pre 0.05882369
#> 4 pre 0.19626147
#> 5 pre 0.05933594
#> 6 post 0.25564350
#> 7 post 0.01908338
#> 8 post 0.84901506
#> 9 post 0.07761842
#> 10 post 0.29019190

Resources