One question that I can't find the answer for in R is that how I can find the dominant topic in NLP model for each sentence?
Imagine I have data frame like this:
comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
"solidly constructed lovingly maintained sf crest built",
"one year since built new this well designed storey home",
"beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
"rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
"fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
"original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")
id <- c(1,2,3,4,5,6,7)
data <- data.frame(id, comment)
I do preprocess as shown below:
text_cleaning_tokens <- data %>%
tidytext::unnest_tokens(word, comment)
text_cleaning_tokens$word <- gsub('[[:digit:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens$word <- gsub('[[:punct:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens <- text_cleaning_tokens %>% filter(!(nchar(word) == 1))%>%
anti_join(stop_words)
stemmed_token <- text_cleaning_tokens %>% mutate(word=wordStem(word))
tokens <- stemmed_token %>% filter(!(word==""))
tokens <- tokens %>% mutate(ind = row_number())
tokens <- tokens %>% group_by(id) %>% mutate(ind = row_number()) %>%
tidyr::spread(key = ind, value = word)
tokens [is.na(tokens)] <- ""
tokens <- tidyr::unite(tokens, clean_remark,-id,sep =" " )
tokens$clean_remark <- trimws(tokens$clean_remark)
The I ran FitLdaModel function on this data and finally, found the best topics based on 2 groups:
t_1 t_2
1 beauti built
2 block home
3 renov legal
4 bathroom locat
5 bdm bachelor
6 bdm_heart bachelor_suit
7 beauti_street block_norquai
8 beauti_upgr blueridg
9 bedroom blueridg_legal
10 bedroom_unit built_design
now based on the result I have, I want to find the most dominant topic in each sentence in topic modelling. For example, I want to know that for comment 1 ("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior"), which topic (topic 1 or topic 2) is the most dominant?
Can anyone help me with this question? do we have any package that can do this?
It is pretty easy to work with quanteda and topicmodels. The former is for data management and quantitative analysis of textual data, the latter is for topic modeling inference.
Here I take your comment object and transform it to a corpus and then to a dfm. I then convert it to be understandable by topicmodels.
The function LDA() gives you all you need to easily extract information. In particular, with get_topics() you get the most probable topic for each document. If you instead want to see the document-topic-weights you can do so with ldamodel#gamma. You will see that get_topics() does exactly what you asked.
Please, see if this works for you.
library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
library(topicmodels)
comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
"solidly constructed lovingly maintained sf crest built",
"one year since built new this well designed storey home",
"beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
"rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
"fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
"original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")
mycorp <- corpus(comment)
docvars(mycorp, "id") <- 1L:7L
mydfm <- dfm(mycorp)
# convert the DFM to a Document Matrix for topicmodels
forTM <- convert(mydfm, to = "topicmodels")
myLDA <- LDA(forTM, k = 2)
dominant_topics <- get_topics(myLDA)
dominant_topics
#> text1 text2 text3 text4 text5 text6 text7
#> 2 2 2 2 1 1 1
dtw <- myLDA#gamma
dtw
#> [,1] [,2]
#> [1,] 0.4870600 0.5129400
#> [2,] 0.4994974 0.5005026
#> [3,] 0.4980144 0.5019856
#> [4,] 0.4938985 0.5061015
#> [5,] 0.5037667 0.4962333
#> [6,] 0.5000727 0.4999273
#> [7,] 0.5176960 0.4823040
Created on 2021-03-18 by the reprex package (v1.0.0)
I agree with the other answer that quanteda and topicmodels are a better choice. Maybe also look into seededlda which is an LDA implementation from one of the quanteda authors (with extra features you don't have to use).
However, if you want to stick with your choice of tidytext and textmineR, this is how you would do it.
First, I simplified your preprocessing a bit, since you did some steps that seemed unnecessary to me:
library(tidyverse)
library(tidytext)
text_cleaning_tokens <- data %>%
unnest_tokens(word, comment) %>%
mutate(word = str_remove(word, "[[:digit:]]|[[:punct:]]")) %>%
filter(!(nchar(word) <= 1))%>%
anti_join(stop_words, by = "word") %>%
mutate(word = SnowballC::wordStem(word))
Then I run LDA according to the textmineR example:
lda <- text_cleaning_tokens %>%
cast_sparse(id, word) %>%
textmineR::FitLdaModel(k = 2,
iterations = 200,
burnin = 175,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_r2 = TRUE)
Now all implementations of LDA deliver two important results:
phi (φ) which shows for each word in the corpus how it scored on each topic. The higher the phi-value, the more prevalent the word in this specific topic.
theta (θ) which shows for each document in the corpus how it scored on each topic. The higher the theta-value, the more prevalent the topic is in the document. (topicmodels calls it gamma for some reason.)
In other words, all you have to do to find the most dominant topic in a text is:
lda$theta %>%
as_tibble() %>%
rowwise() %>%
mutate(top = which.max(c_across(everything()))) %>% # find highest value per row dplyr style
bind_cols(data, .) %>% # bind to original data
as_tibble() # just for nicer printing
#> # A tibble: 7 x 5
#> id comment t_1 t_2 top
#> <int> <chr> <dbl> <dbl> <int>
#> 1 1 1 . outstanding renovation all improvements are t… 0.892 0.108 1
#> 2 2 solidly constructed lovingly maintained sf crest … 0.0161 0.984 2
#> 3 3 one year since built new this well designed store… 0.0238 0.976 2
#> 4 4 beautiful street large bdm in the heart of lynn v… 0.986 0.0139 1
#> 5 5 rare to find legal beautiful upgr in port moody c… 0.992 0.00820 1
#> 6 6 fantastic opportunity to get value for the money … 0.266 0.734 2
#> 7 7 original owner tired but rock solid perfect locat… 0.00549 0.995 2
Created on 2021-03-18 by the reprex package (v1.0.0)
I also recommend you read Julia Silge's stuff on the matter. For example, this and this.
Related
I'm trying to scrape information from this webpage, https://www.ncleg.gov/Laws/GeneralStatuteSections/Chapter14, (the info under the "Chapter 14" tab)
and put it into a datafram with two columns in R, but these skills are out of my wheelhouse and I need some help. More specifically, I want one column with the G.S. numbers ("G.S. 14-1", "G.S. 14-1.1", etc.) and one column with the names corresponding to these G.S. numbers ("14.1 Felonies and Misdemeanors Defined", "14-1.1: Repealed by Session Laws 1993, c. 538, s. 2.", etc.). As text and not the links.
I've tried using the selector gadget, but this tool is pretty new to me and I don't really understand how to apply what I do with it in R.
Any advice or tips on how to do this?
Yes, this is fairly tricky. I would probably approach it with a combination of xpath and regular expressions:
library(rvest)
#> Loading required package: xml2
page <- read_html("https://www.ncleg.gov/Laws/GeneralStatuteSections/Chapter14")
x1 <- ("//div[#class = 'col-12 col-md-3 col-lg-2 d-flex mobile-font-size-large']")
x2 <- ("//div[#class='col-12 col-md-9 col-lg-10']")
description <- html_nodes(page, xpath = x2) %>% html_text() %>% trimws()
col2 <- gsub("^.*.\\d[A-Z]?(\\.|:) +", "", description)
col1 <- gsub("^(.*.\\d[A-Z]?[\\.|:]) +.*$", "\\1", description)
col1 <- gsub("\u00a7", "GS", col1)
df <- data.frame(section = col1, description = col2)
For ease of printing, I'll show the resulting data frame as a tibble:
tibble::as_tibble(df)
#> # A tibble: 1,059 x 2
#> section description
#> <chr> <chr>
#> 1 GS 14-1. Felonies and misdemeanors defined.
#> 2 GS 14-1.1: Repealed by Session Laws 1993, c. 538, s. 2.
#> 3 GS 14-2: Repealed by Session Laws 1993, c. 538, s. 2.1.
#> 4 GS 14-2.1: Repealed by Session Laws 1993, c. 538, s. 3.
#> 5 GS 14-2.2: Repealed by Session Laws 2003-0378, s. 1, effective August 1, 200~
#> 6 GS 14-2.3. Forfeiture of gain acquired through criminal activity.
#> 7 GS 14-2.4. Punishment for conspiracy to commit a felony.
#> 8 GS 14-2.5. Punishment for attempt to commit a felony or misdemeanor.
#> 9 GS 14-2.6. Punishment for solicitation to commit a felony or misdemeanor.
#> 10 GS 14-3. Punishment of misdemeanors, infamous offenses, offenses committed~
#> # ... with 1,049 more rows
Created on 2020-09-30 by the reprex package (v0.3.0)
I have been trying to follow Text Mining with R by Julia Silge, however, I cannot tokenize my dataset with the unnest_tokens function.
Here are the packages I have loaded:
# Load
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(corpus)
library(corpustools)
library(dplyr)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(stringr)
Here is the dataset I tried to use which is online, so the results should be reproducible:
bible <- readLines('http://bereanbible.com/bsb.txt')
And here is where everything falls apart.
Input:
bible <- bible %>%
unnest_tokens(word, text)
Output:
Error in tbl[[input]] : subscript out of bounds
From what I have read about this error, in Rstudio, the issue is that the dataset needs to be a matrix, so I tried transforming the dataset into a matrix table and I received the same error message.
Input:
bible <- readLines('http://bereanbible.com/bsb.txt')
bible <- as.matrix(bible, nrow = 31105, ncol = 2 )
bible <- bible %>%
unnest_tokens(word, text)
Output:
Error in tbl[[input]] : subscript out of bounds
Any recommendations for what next steps I could take or maybe some good Text mining sources I could use as I continue to dive into this would be very much appreciated.
The problem is that readLines()creates a vector, not a dataframe, as expected by unnest_tokens(), so you need to convert it. It is also helpful to separate the verse to it's own column:
library(tidytext)
library(tidyverse)
bible_orig <- readLines('http://bereanbible.com/bsb.txt')
# Get rid of the copyright etc.
bible_orig <- bible_orig[4:length(bible_orig)]
# Convert to df
bible <- enframe(bible_orig)
# Separate verse from text
bible <- bible %>%
separate(value, into = c("verse", "text"), sep = "\t")
tidy_bible <- bible %>%
unnest_tokens(word, text)
tidy_bible
#> # A tibble: 730,130 x 3
#> name verse word
#> <int> <chr> <chr>
#> 1 1 Genesis 1:1 in
#> 2 1 Genesis 1:1 the
#> 3 1 Genesis 1:1 beginning
#> 4 1 Genesis 1:1 god
#> 5 1 Genesis 1:1 created
#> 6 1 Genesis 1:1 the
#> 7 1 Genesis 1:1 heavens
#> 8 1 Genesis 1:1 and
#> 9 1 Genesis 1:1 the
#> 10 1 Genesis 1:1 earth
#> # … with 730,120 more rows
Created on 2020-07-14 by the reprex package (v0.3.0)
I'm trying to use R on a large CSV file that for this example can be said to represent a list of people and forms of transportation. If a person owns that mode of transportation, this is represented by a X in the corresponding cell. Example data of this is as per below:
Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,
The below image makes it easier to see what it represents:
What I'm after is to learn which persons have identical modes of transportation, or, ideally, where the modes of transportation differs by no more than one.
The format is a bit weird but, assuming the csv file is named example.csv, I can read it into a data frame and transpose it as per below (it should be fairly obvious that I'm a complete R noob)
ex <- read.csv('example.csv')
ext <- as.data.frame(t(ex))
This post explained how to find duplicates and it seems to work
duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1]
which(duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1])
This returns the following indexes:
1 2 4 5 6 7
That does indeed correspond with what I consider to be duplicate rows. That is, Peter has the same modes of transportation as Mary and Stan (indexes 2, 4 and 6); Don and Mike likewise share the same modes of transportation, indexes 5 and 7.
Again, that seems to work ok but if the modes of transportation and number of people are significant, it becomes really difficult finding/knowing not just which rows are duplicates, but which indexes actually matched. In this case that indexes 2, 4 and 6 are identical and that 5 and 7 are identical.
Is there an easy way of getting that information so that one doesn't have to try and find the matches manually?
Also, given all of the above, is it possible to alter the code in any way so that it would consider rows to match if there was only a difference in X positions (for example a difference of one is acceptable so as long as the persons in the above example have no more than one mode of transportation that is different, it's still considered a match)?
Happy to elaborate further and very grateful for any and all help.
library(dplyr)
library(tidyr)
ex <- read.csv(text = "Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,", )
ext <- tidyr::pivot_longer(ex, -Type, names_to = "person")
# head(ext)
ext <- ext %>%
group_by(person) %>%
filter(value == "X") %>%
summarise(Modalities = n(), Which = paste(Type, collapse=", ")) %>%
arrange(desc(Modalities), Which) %>%
mutate(IdenticalGrp = rle(Which)$lengths %>% {rep(seq(length(.)), .)})
ext
#> # A tibble: 6 x 4
#> person Modalities Which IdenticalGrp
#> <chr> <int> <chr> <int>
#> 1 Paul 3 Scooter, Skateboard, Boat 1
#> 2 Don 2 Car, Skateboard 2
#> 3 Mike 2 Car, Skateboard 2
#> 4 Mary 2 Scooter, Skateboard 3
#> 5 Peter 2 Scooter, Skateboard 3
#> 6 Stan 2 Scooter, Skateboard 3
To get a membership list in any particular IndenticalGrp you can just pull like this.
ext %>% filter(IdenticalGrp == 3) %>% pull(person)
#> [1] "Mary" "Peter" "Stan"
I am trying to do ngram analysis for in tidytext, I have a corpus of 770 speeches. However the function unnest_tokens in tidytext takes data frame as input. when i checked with the example (jane austin books) each line of the book is stored as row in a data frame. i am not able to convert the corpus into a dataframe, neither for one speech at a time nor for all the corpus at once.
What is the way i can run ngrams (n=2,3, etc) analysis on tidytext using unnest tokens on my corpus. Can someone please suggest?
Thanks
You can use library ngram & tm for this.You can replace "myCorpus" with the corpus you created.
library(tm)
library(ngarm)
myCorpus<-c("Hi How are you","Hello World","I love Stackoverflow","Good Bye All")
ng <- ngram (myCorpus , n =2)
get.phrasetable (ng)
If you want to tokenize and convert your corpus into a dataframe then use the below code.
tokenizedCorpus <- lapply(myCorpus, scan_tokenizer)
mydata <- data.frame(text = sapply(tokenizedCorpus, paste, collapse = " "),stringsAsFactors = FALSE)
You say that you have a "corpus" of 770 speeches. Do you mean you have a character vector? If so, you can tokenize your text in this way:
library(tidyverse)
library(tidytext)
speech_vec <- c("I am giving a speech!",
"My second speech is even better.",
"Unfortunately, this speech is terrible!",
"For my final speech, I will wow you all.")
speech_df <- tibble(text = speech_vec) %>%
mutate(speech = row_number())
tidy_speeches <- speech_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
tidy_speeches
#> # A tibble: 21 x 2
#> speech bigram
#> <int> <chr>
#> 1 1 i am
#> 2 1 am giving
#> 3 1 giving a
#> 4 1 a speech
#> 5 2 my second
#> 6 2 second speech
#> 7 2 speech is
#> 8 2 is even
#> 9 2 even better
#> 10 3 unfortunately this
#> # … with 11 more rows
Created on 2020-02-15 by the reprex package (v0.3.0)
If instead, you mean that you have a DocumentTermMatrix from the tm package, check out this chapter for details on how to convert to a tidy data structure.
I am trying to the sentiment of a dataset of Tweets using the AFINN dictionary (get_sentiments("afinn"). A sample of the dataset is provided below:
A tibble: 10 x 2
Date TweetText
<dttm> <chr>
1 2018-02-10 21:58:19 "RT #RealSirTomJones: Still got the moves! That was a lo~
2 2018-02-10 21:58:19 "Yass Tom \U0001f600 #snakehips still got it #TheVoiceUK"
3 2018-02-10 21:58:19 Yasss tom he’s some chanter #TheVoiceUK #ItsNotUnusual
4 2018-02-10 21:58:20 #TheVoiceUK SIR TOM JONES...HE'S STILL HOT... AMAZING VO~
5 2018-02-10 21:58:21 I wonder how many hips Tom Jones has been through? #TheV~
6 2018-02-10 21:58:21 Tom Jones has still got it!!! #TheVoiceUK
7 2018-02-10 21:58:21 Good grief Tom Jones is amazing #TheVoiceuk
8 2018-02-10 21:58:21 RT #tonysheps: Sir Thomas Jones you’re a bloody legend #~
9 2018-02-10 21:58:22 #ITV Tom Jones what a legend!!! ❤️ #StillGotIt #TheVoice~
10 2018-02-10 21:58:22 "RT #RealSirTomJones: Still got the moves! That was a lo~
What I want to do is:
1. Split up the Tweets into individual words.
2. Score those words using the AFINN lexicon.
3. Sum the score of all the words of each Tweet
4. Return this sum into a new third column, so I can see the score per Tweet.
For a similar lexicon I found the following code:
# Initiate the scoreTopic
scoreTopic <- 0
# Start a loop over the documents
for (i in 1:length (myCorpus)) {
# Store separate words in character vector
terms <- unlist(strsplit(myCorpus[[i]]$content, " "))
# Determine the number of positive matches
pos_matches <- sum(terms %in% positive_words)
# Determine the number of negative matches
neg_matches <- sum(terms %in% negative_words)
# Store the difference in the results vector
scoreTopic [i] <- pos_matches - neg_matches
} # End of the for loop
dsMyTweets$score <- scoreTopic
I am however not able to adjust this code to get it working with the afinn dictionary.
This would be a great use case for tidy data principles. Let's set up some example data (these are real tweets of mine).
library(tidytext)
library(tidyverse)
tweets <- tribble(
~tweetID, ~TweetText,
1, "Was Julie helping me because I don't know anything about Python package management? Yes, yes, she was.",
2, "#darinself OMG, this is my favorite.",
3, "#treycausey #ftrain THIS IS AMAZING.",
4, "#nest No, no, not in error. Just the turkey!",
5, "The #nest people should write a blog post about how many smoke alarms went off yesterday. (I know ours did.)")
Now we have some example data. In the code below, unnest_tokens() tokenizes the text, i.e. breaks it up into individual words (the tidytext package allows you to use a special tokenizer for tweets) and the inner_join() implements the sentiment analysis.
tweet_sentiment <- tweets %>%
unnest_tokens(word, TweetText, token = "tweets") %>%
inner_join(get_sentiments("afinn"))
#> Joining, by = "word"
Now we can find the scores for each tweet. Take the original data set of tweets and left_join() on to it the sum() of the scores for each tweet. The handy function replace_na() from tidyr lets you replace the resulting NA values with zero.
tweets %>%
left_join(tweet_sentiment %>%
group_by(tweetID) %>%
summarise(score = sum(score))) %>%
replace_na(list(score = 0))
#> Joining, by = "tweetID"
#> # A tibble: 5 x 3
#> tweetID TweetText score
#> <dbl> <chr> <dbl>
#> 1 1. Was Julie helping me because I don't know anything about … 4.
#> 2 2. #darinself OMG, this is my favorite. 2.
#> 3 3. #treycausey #ftrain THIS IS AMAZING. 4.
#> 4 4. #nest No, no, not in error. Just the turkey! -4.
#> 5 5. The #nest people should write a blog post about how many … 0.
Created on 2018-05-09 by the reprex package (v0.2.0).
If you are interested in sentiment analysis and text mining, I invite you to check out the extensive documentation and tutorials we have for tidytext.
For future reference:
Score_word <- function(x) {
word_bool_vec <- get_sentiments("afinn")$word==x
score <- get_sentiments("afinn")$score[word_bool_vec]
return (score) }
Score_tweet <- function(sentence) {
words <- unlist(strsplit(sentence, " "))
words <- as.vector(words)
scores <- sapply(words, Score_word)
scores <- unlist(scores)
Score_tweet <- sum(scores)
return (Score_tweet)
}
dsMyTweets$score<-apply(df, 1, Score_tweet)
This executes what I initially wanted! :)