My data text is a novel in plain text. I used packages tm and tidytext. Data processing went well and I created my DocumentTermMatrix without trouble.
text <- read_lines("GoneWithTheWind2.txt")
set.seed(314)
text <- iconv(text,'UTF-8',sub="")
myCorpus <- tm_map(myCorpus, removeWords, c(stopwords("english"),
stopwords("SMART"), mystopwords, Top200Words))
myDtm <- TermDocumentMatrix(myCorpus, control=list(minWordLength= 1))`
However, I could not run the coding using inner_join between bing lexicon and the DocumentTermMatrix to do chronological sentiment analysis of this novel over time. I wrote the function below based on an online example but did not know what to group by in count(sentiment) (I place ???? in hold), because the plain text and the DocumentTermMatrix has no "lines" columns.
bing <- get_sentiments("bing")
m <- as.matrix(myDtm)
v <- sort(rowSums(m),decreasing=TRUE)
myNames <- names(v)
d <- data.frame(term=myNames, freq = v)
wind_polarity <- d %>%
# Inner join to the lexicon
inner_join(bing, by=c("term"="word")) %>%
# Count by sentiment, **????**
count(sentiment, **????**) %>%
# Spread sentiments
spread(sentiment, n, fill=0) %>%
mutate(
# Add polarity field
polarity = positive - negative,
# Add line number field
line_number = row_number())
Then plot by ggplot.
I tried adding a column "Index" indicating the line number for each document (line) in text but this column disappears somewhere in the process. Any suggestions would be highly appreciated.
Below an approach that calculates the polarity per line (based on a minimum example of three lines). You might join your dtm with the lexicon directly to maintain information on the counts. Then turn polarity information into numeric representation and do your calculations per line. You might certainly rewrite the code and make it more elegant (I am not very familiar with dplyr vocabulary, sorry). I hope that helps anyway.
library(tm)
library(tidytext)
text <- c("I like coffe."
,"I rather like tea."
,"I hate coffee and tea, but I love orange juice.")
myDtm <- TermDocumentMatrix(VCorpus(VectorSource(text)),
control = list(removePunctuation = TRUE,
stopwords = TRUE))
bing <- tidytext::get_sentiments("bing")
wind_polarity <- as.matrix(myDtm) %>%
data.frame(terms = rownames(myDtm), ., stringsAsFactors = FALSE) %>%
inner_join(bing, by= c("terms"="word")) %>%
mutate(terms = NULL,
polarity = ifelse( (.[,"sentiment"] == "positive"), 1,-1),
sentiment = NULL) %>%
{ . * .$polarity } %>%
mutate(polarity = NULL) %>%
colSums
#the polarity per line which you may plot, e.g., with base or ggplot
# X1 X2 X3
# 1 1 0
Related
I am at the final stages of a project where i have been comparing the appraisal price vs the sold price of different properties. The complete code for data collection and tidying is below.
At this stage i am looking at different ways to visualize my data. However, I am quite new to it so my question is whether anyone has any "new" or special ways they visualizing data that they find usefull og intuitive. I have given a couple of examples of what i am able to visualize now using ggplot.
Additionally: Now my visualizations plots all 1275 observations every time. I would however also like to visualize the data both with mean and median for the Percentage, Sold and Tax variables which i am most interested in. For example to visualize the mean value of the Percentage column based on different years.
Appreciate any help!
Complete code:
#Step 1: Load needed library
library(tidyverse)
library(rvest)
library(jsonlite)
library(stringi)
library(dplyr)
library(data.table)
library(ggplot2)
#Step 2: Access the URL of where the data is located
url <- "https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/10/"
#Step 3: Direct JSON as format of data in URL
data <- jsonlite::fromJSON(url, flatten = TRUE)
#Step 4: Access all items in API
totalItems <- data$TotalNumberOfItems
#Step 5: Summarize all data from API
allData <- paste0('https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/', totalItems,'/') %>%
jsonlite::fromJSON(., flatten = TRUE) %>%
.[1] %>%
as.data.frame() %>%
rename_with(~str_replace(., "ListItems.", ""), everything())
#Step 6: removing colunms not needed
allData <- allData[, -c(1,4,8,9,11,12,13,14,15)]
#Step 7: remove whitespace and change to numeric in columns SoldAmount and Tax
#https://stackoverflow.com/questions/71440696/r-warning-argument-is-not-an-atomic-vector-when-attempting-to-remove-whites/71440806#71440806
allData[c("Tax", "SoldAmount")] <- lapply(allData[c("Tax", "SoldAmount")], function(z) as.numeric(gsub(" ", "", z)))
#Step 8: Remove rows where value is NA
#https://stackoverflow.com/questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame
alldata <- allData %>%
filter(across(where(is.numeric),
~ !is.na(.)))
#Step 9: Remove values below 10000 NOK on SoldAmount og Tax.
alldata <- alldata %>%
filter_all(any_vars(is.numeric(.) & . > 10000))
#Step 10: Calculate percentage change between tax and sold amount and create new column with percent change
#df %>% mutate(Percentage = number/sum(number))
alldata_Percent <- alldata %>% mutate(Percentage = (SoldAmount-Tax)/Tax)
Visualization
# Plot Percentage difference based on County
ggplot(data=alldata_Percent,mapping = aes(x = Percentage, y = County)) +
geom_point(size = 1.5)
#Plot County with both Date and Percentage difference The The
theme_set(new = ggthemes::theme_economist())
p <- ggplot(data = alldata_Percent,
mapping = aes(x = Date, y = Percentage, colour = County)) +
geom_line(na.rm = TRUE) +
geom_point(na.rm = TRUE)
p
I have a data set of a text analysis. There's a column that shows if one of predefined terms is recognized (shows the term itself). looks somewhat like this (relevant column is "funnel_term"):
sample of my data set
I want to count how many times each of the terms in the "funnel_term" appears. thought of a for loop but it's not working as I wished. the output I'm looking for would be something like that:
sexual - 6
racist - 4
ill - 2
Thanks in advance.
You can use grep() for this. Example with minimal data set:
df <- data.frame(x = c("['Sexual']", "['Sexual']"))
length(grep("Sexual", df$x))
Or with a prettier output:
paste("Sexual - ", length(grep("Sexual", test$x)), sep="")
[1] "Sexual - 2"
Or with the package dplyr:
library(dplyr)
df <- data.frame(x = c("['Sexual']", "['Sexual']"))
df %>% dplyr::count(x)
This doesn't work for cells with two words, for example "['Sexual', 'Religion']". So we need this:
library(dplyr)
df <- data.frame(x = c("['Sexual', 'Religion']", "['Sexual']"))
df %>% mutate(x2 = strsplit(as.character(x), ",")) %>%
unnest(x2) %>%
mutate(x2 = str_replace_all(x2, "[^[:alnum:]]", "")) %>%
count(x2)
Do you want to count multiple occurrences of words like "racist" within a row? If so, you may want to check out the function gregexpr:
gregexpr("sexual", df$text)
This will tell you the starting points of each of the words "racist" in your column. To get a count of all of them, you can do:
object_1 <- gregexpr("sexual", df$text)
for (i in seq_len(length(object_1))) {
if (object_1[[i]] == -1) {
object_1[[i]] <- NULL
}
}
sum(sapply(object_1, function (x) {
length(x)
}))
If you want to find words like "sexual" but not words like "asexual" or "sexually", you should use regular expressions. Use
gregexpr("\\bsexual\\b", df$text)
instead of
gregexpr("sexual", df$text)
To get your desired output, you would do:
original_funnel_terms <- c("sexual", "racist", "ill")
funnel_terms <- paste0("\\b", funnel_terms, "\\b")
output_1 <- sapply(seq_len(length(funnel_terms)), function (z) {
sum(sapply(sapply(gregexpr(funnel_terms[z], df$text), function (x) {
if (x[1] == -1) {
y <- NULL
} else {
y <- x
}
y
}), length))
})
names(output_1) <- original_funnel_terms
output_2 <- paste(names(output), " - ", as.character(output), sep = "")
cat(output_2, sep = "\n")
I created a sample data set similar to yours with the following code:
sample <- tribble(~funnel_term, "['Sexual']", "['Islam', 'Religion']", "['Sexual', 'Islam']")
Which gives you a data frame that looks like this:
funnel_term
<chr>
1 ['Sexual']
2 ['Islam', 'Religion']
3 ['Sexual', 'Islam']
You can get rid of the brackets and single quotes and then separate the rows so that each item in the list becomes it's own row
sample.1 <- sample %>% mutate(funnel_term_new = gsub("\\[|\\]|\'", "", funnel_term)) %>% separate_rows(funnel_term_new, sep = ", ")
Which gives you a data frame that looks like this:
funnel_term funnel_term_new
<chr> <chr>
1 ['Sexual'] Sexual
2 ['Islam', 'Religion'] Islam
3 ['Islam', 'Religion'] Religion
4 ['Sexual', 'Islam'] Sexual
5 ['Sexual', 'Islam'] Islam
Now that you have all of the funnel terms into their own row, you can use simple dplyr functions to get the count of each unique funnel_term:
sample.final <- sample.1 %>% group_by(funnel_term_new) %>% summarise(n = n())
funnel_term_new n
<chr> <int>
1 Islam 2
2 Religion 1
3 Sexual 2
I have a dataset containing a column of strings from which I wish to calculate an overall sentiment score, and a data frame containing all the unique words that appear in all the strings , each of which is assigned a score:
library(stringr)
df <- data.frame(text = c('recommend good value no problem','terrible quality no good','good service excellent quality commend'), score = 0)
words <- c('recommend','good','value','problem','terrible','quality','service','excellent','commend')
scores <- c(1,2,1,-2,-3,1,0,3,1)
wordsdf <- data.frame(words,scores)
The only way I have been able to get close to this is by using a nested for loop and the str_count function from the stringr package:
for (i in 1:3){
count = 0
for (j in 1:9){
count <- count + (str_count(df$text[i],as.character(wordsdf$words[j])) * wordsdf$scores[j])
}
df$score[i] <- count
}
This almost achieves what I want:
text score
1 recommend good value no problem 3
2 terrible quality no good 0
3 good service excellent quality commend 7
However, since the word 'commend' is also contained in the word 'recommend', my code calculates the scores as if both words are contained in the string.
So I have two queries:
1 - Is there a way to get it to match only to exact words?
2 - Is there a way to achieve this without using the nested loop?
One tidyverse possibility could be:
df %>%
rowid_to_column() %>%
mutate(text = strsplit(text, " ", fixed = TRUE)) %>%
unnest() %>%
full_join(wordsdf, by = c("text" = "words")) %>%
group_by(rowid) %>%
summarise(text = paste(text, collapse = " "),
scores = sum(scores, na.rm = TRUE)) %>%
ungroup() %>%
select(-rowid)
text scores
<chr> <dbl>
1 recommend good value no problem 2
2 terrible quality no good 0
3 good service excellent quality commend 7
It, first, splits the "text" column into separate words. Second, it performs a full join on these words. Finally, it combines the words from "text" column again and performs the summation.
I'm trying to plot tweet topics by year from this file
https://www.mediafire.com/file/64lzbt46v01jbe1/cleaned.xlsx/file
works fine to get the topics, but when I try to plot them by year I have a dimensions problem:
library(readxl)
library(tm)
tweets <- read_xlsx("C:/cleaned.xlsx")
mytextdata <- tweets$textdata
# Convert to tm corpus and use its API
corpus <- Corpus(VectorSource(mytextdata)) # Create corpus object
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i)
dtm.new = dtm[ui,]
k <- 7
ldaTopics <- LDA(dtm.new, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
tmResult <- posterior(ldaTopics)
theta <- tmResult$topics
dim(theta)
dim(theta)=4857 and I have 4876 dates in my cleaned.xls file and I need them to be the same to run this aggregate function
topic_proportion_per_decade <- aggregate(theta, by = list(decade = textdata$decade), mean)
from here
https://tm4ss.github.io/docs/Tutorial_6_Topic_Models.html
I think that the problem is that the cleaned.xls file is not clean enough and that's why theta misses some rows..
But in fact I really don't know why theta misses some rows..
I also don't know how to clean the file better if that was the problem, the file looks good to me, there are some rows that are only numbers or non-english words but I prefer to keep them..
The problem is that ui = unique(dtm$i) removes several documents (I don't know why you do this, so I won't comment on that part). So your theta doesn't have the same number of rows as the data. We can solve this by only keeping the rows which are still in theta:
library("dplyr")
library("reshape2")
library("ggplot2")
tweets_clean <- tweets %>%
mutate(id = rownames(.)) %>%
filter(id %in% rownames(theta)) %>% # keep only rows still in theta
cbind(theta) %>% # now we can attach the topics to the data.frame
mutate(year = format(date, "%Y")) # make year variable
I then used dplyr functions to make the aggregation, since I think it makes for easier to read code:
tweets_clean_yearly <- tweets_clean %>%
group_by(year) %>%
summarise_at(vars(as.character(1:7)), funs(mean)) %>%
melt(id.vars = "year")
Then we can plot this:
ggplot(tweets_clean_yearly, aes(x = year, y = value, fill = variable)) +
geom_bar(stat = "identity") +
ylab("proportion")
Note: I tested if theta and tweets had really the same documents with:
tweets_clean <- tweets %>%
mutate(id = rownames(.)) %>%
filter(id %in% rownames(theta))
all.equal(tweets_clean$id, rownames(theta))
Phrasal verb is really important in day-to-day English usage. Is there any library in R that allows us to deal with it?
I have tried 2 ways but it seems unable to deal with it
For example
library(sentimentr)
library(tidytext)
library(tidyverse)
x <- 'i vomit when i see her'
y <- 'i throw up when i see her'
# sentimentR
sentiment(x) #give sentiment of -0.4
sentiment(y) #give sentiment of 0
# Similarly, using tidytext
y %>% as_tibble() %>%
unnest_tokens(word, value) %>%
left_join(get_sentiments('bing')) # give all words the sentiments of 0
I came up with a (clumsy) strategy to deal with phrasal verbs:
# create a dummy phrasal verb sentiment score
phrasel_verb <- data.frame(bigram = c("throw up"),
bigram_score = -1)
# use tidy text to make bigram--> join
y %>% as_tibble() %>%
unnest_tokens(bigram, value, 'ngrams', n = 2) %>%
separate(bigram, c('word','word2'), remove = F) %>%
left_join(phrasel_verb) %>%
left_join(get_sentiments('bing')) %>%
mutate(sentiment_all = coalesce(bigram_score, as.numeric(sentiment))) %>%
summarise(sentiment_sum = sum(na.fill(sentiment_all, 0)))
The result is -1 which suggest a negative sentiments.
Any ideas to improve it? Are there any data that have sentiment score of phrasal verb?