Sentiment analysis in R for cyrillic - r

I can't comment on this page where i found a function Sentiment Analysis Text Analytics in Russian / Cyrillic languages
get_sentiment_rus <- function(char_v, method="custom", lexicon=NULL, path_to_tagger = NULL, cl = NULL, language = "english") {
language <- tolower(language)
russ.char.yes <- "[\u0401\u0410-\u044F\u0451]"
russ.char.no <- "[^\u0401\u0410-\u044F\u0451]"
if (is.na(pmatch(method, c("syuzhet", "afinn", "bing", "nrc",
"stanford", "custom"))))
stop("Invalid Method")
if (!is.character(char_v))
stop("Data must be a character vector.")
if (!is.null(cl) && !inherits(cl, "cluster"))
stop("Invalid Cluster")
if (method == "syuzhet") {
char_v <- gsub("-", "", char_v)
}
if (method == "afinn" || method == "bing" || method == "syuzhet") {
word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
if (is.null(cl)) {
result <- unlist(lapply(word_l, get_sent_values,
method))
}
else {
result <- unlist(parallel::parLapply(cl = cl, word_l,
get_sent_values, method))
}
}
else if (method == "nrc") {
# word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
word_l <- strsplit(tolower(char_v), paste0(russ.char.no, "+"), perl=T)
lexicon <- dplyr::filter_(syuzhet:::nrc, ~lang == tolower(language),
~sentiment %in% c("positive", "negative"))
lexicon[which(lexicon$sentiment == "negative"), "value"] <- -1
result <- unlist(lapply(word_l, get_sent_values, method,
lexicon))
}
else if (method == "custom") {
# word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
word_l <- strsplit(tolower(char_v), paste0(russ.char.no, "+"), perl=T)
result <- unlist(lapply(word_l, get_sent_values, method,
lexicon))
}
else if (method == "stanford") {
if (is.null(path_to_tagger))
stop("You must include a path to your installation of the coreNLP package. See http://nlp.stanford.edu/software/corenlp.shtml")
result <- get_stanford_sentiment(char_v, path_to_tagger)
}
return(result)
}
It gives an error
> mysentiment <- get_sentiment_rus(as.character(corpus))
Show Traceback
Rerun with Debug
Error in UseMethod("filter_") :
no applicable method for 'filter_' applied to an object of class "NULL"
And the sentiment scores are equal to 0
> SentimentScores <- data.frame(colSums(mysentiment[,]))
> SentimentScores
colSums.mysentiment.....
anger 0
anticipation 0
disgust 0
fear 0
joy 0
sadness 0
surprise 0
trust 0
negative 0
positive 0
Could you please point out where a problem might be? Or suggest any other working method for sentiment analysis в R? Just wonder what package supports russian language.
I am looking for any working method for sentiment analysis of a text in russian.

It looks to me like your function did not really find any sentiment words in your text. This might have to do with the sentiment dictionary you are using. Instead of trying to repair this function, you might want to consider a tidy approach instead, which is outlined in the book "Text Mining with R. A Tidy Approach". The advantage is that it does not mind the cyrillic letters and that it is really easy to understand and tweak.
First, we need a dictionary with sentiment values. I found one on GitHub, which we can directly read into R:
library(rvest)
library(stringr)
library(tidytext)
library(dplyr)
dict <- readr::read_csv("https://raw.githubusercontent.com/text-machine-lab/sentimental/master/sentimental/word_list/russian.csv")
Next, let's get some test data to work with. For no particular reason, I use the Russian Wikipedia entry for Brexit and scrape the text:
brexit <- "https://ru.wikipedia.org/wiki/%D0%92%D1%8B%D1%85%D0%BE%D0%B4_%D0%92%D0%B5%D0%BB%D0%B8%D0%BA%D0%BE%D0%B1%D1%80%D0%B8%D1%82%D0%B0%D0%BD%D0%B8%D0%B8_%D0%B8%D0%B7_%D0%95%D0%B2%D1%80%D0%BE%D0%BF%D0%B5%D0%B9%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D1%81%D0%BE%D1%8E%D0%B7%D0%B0" %>%
read_html() %>%
html_nodes("body") %>%
html_text() %>%
tibble(text = .)
Now this data can be turned into a tidy format. I split the text into paragraphs first, so we can check sentiment scores for paragraphs individually.
brexit_tidy <- brexit %>%
unnest_tokens(output = "paragraph", input = "text", token = "paragraphs") %>%
mutate(id = seq_along(paragraph)) %>%
unnest_tokens(output = "word", input = "paragraph", token = "words")
The way a dictionary is used with tidy data is incredibly straightwoard from this point. You just combine the data frame with sentiment values (i.e., the dictionary) and the data frame with the words in your text. Where text and dictionary match, the sentiment value is added. All other values are dropped.
# apply dictionary
brexit_sentiment <- brexit_tidy %>%
inner_join(dict, by = "word")
head(brexit_sentiment)
#> # A tibble: 6 x 3
#> id word score
#> <int> <chr> <dbl>
#> 1 7 затяжной -1.7
#> 2 13 против -5
#> 3 22 популярность 5
#> 4 22 против -5
#> 5 23 нужно 1.7
#> 6 39 против -5
Instead of the value for each word, you probably prefer the values per paragraphs. This can easily be done by getting the mean for each paragraph:
# group sentiment by paragraph
brexit_sentiment %>%
group_by(id) %>%
summarise(sentiment = mean(score))
#> # A tibble: 25 x 2
#> id sentiment
#> <int> <dbl>
#> 1 7 -1.7
#> 2 13 -5
#> 3 22 0
#> 4 23 1.7
#> 5 39 -5
#> 6 42 5
#> 7 43 -1.88
#> 8 44 -3.32
#> 9 45 -3.35
#> 10 47 1.7
#> # … with 15 more rows
There are a couple of ways this approach could be improved if necessary:
to get rid of different word forms, you could lemmatize the words, making matches more likely
in case your text includes misspellings, you could consider matching words which are similar with e.g. fuzzyjoin
you can find or create a better dictionary than the one I pulled of the first page I found when googling "russian sentiment dictionary"

Related

Merging words in a wordcloud made in R

I have created a word cloud with the following frequency of terms:
interesting interesting 21
economics economics 12
learning learning 9
learn learn 6
taxes taxes 6
debating debating 6
everything everything 6
know know 6
tax tax 3
meaning meaning 3
I want to add the 6 counts for "learn" into the overall count for "learning" so that the frequency becomes 15, and I only have "learning" in my word cloud. I also want to do the same for "taxes" and "tax".
This is the code I used to generate the wordcloud.
dataset <- read.csv("~/filepath.csv")
> corpus <- Corpus(VectorSource(dataset$comment))
> clean_corpus <- tm_map(corpus, removeWords, stopwords('english'))
> wordcloud(clean_corpus, scale=c(5,0.5), max.words=100, random.order = FALSE, rot.per=0.35, colors=my_palette)
I have tried using the SnowballC package, but this was the outcome:
> library(SnowballC)
> clean_set <- tm_map(clean_corpus, stemDocument)
> dtm <- TermDocumentMatrix(clean_set)
> m <- as.matrix(dtm)
> v <- sort(rowSums(m), decreasing = TRUE)
> d <- data.frame(word = names(v), freq=v)
> head(d, 10)
This gives me the output below (economics has become econom, debating has become debat, everything, everyth) which is obviously unideal. I only have an issue with learn/learning and tax/taxes, so would it be possible to manually merge just those two sets of words?
interest interest 21
learn learn 18
econom econom 12
tax tax 9
debat debat 6
everyth everyth 6
know know 6
mean mean 3
understand understand 3
group group 3
I have also tried clean_corpus_2 <- tm_map(clean_corpus, content_transformer(gsub), pattern = "taxes", replacement = "tax", fixed = TRUE) which changed nothing in the output.
I'm using the tidyverse packages, particularly dplyr as that's why I'm comfortable with, but I'm sure this is doable with base R or any number of other approaches.
library(tidyverse)
First I mock up some data as I don't have yours to test on:
testdata <- tribble(
~ID, ~comment,
1, "learn",
2, "learning",
3, "learned",
4, "tax",
5, "taxes",
6, "panoply"
)
Next is the explicitly listing the options approach:
testdata1 <- testdata %>% mutate(
newcol = case_when(
comment %in% c("learn", "learning", "learned") ~ "learn",
comment %in% c("tax", "taxes") ~ "tax",
TRUE ~ as.character(comment)
)
)
In this code, %>% is a pipe, mutate() adds a new column based on what follows. newcol is the name of the new column, and its contents is decided by the case_when() construct, which tests each option in turn until it finds something returning "TRUE" - that's why the last option (the default "don't change" approach) is listed as TRUE ~ .
After that, the pattern-matching (grepl) approach:
testdata2 <- testdata %>% mutate(
newcol = case_when(
grepl(comment, pattern = "learn") ~ "learn",
grepl(comment, pattern = "tax") ~ "tax",
TRUE ~ as.character(comment)
)
)
Yielding:
> testdata1
# A tibble: 6 × 3
ID comment newcol
<dbl> <chr> <chr>
1 1 learn learn
2 2 learning learn
3 3 learned learn
4 4 tax tax
5 5 taxes tax
6 6 panoply panoply
> testdata2
# A tibble: 6 × 3
ID comment newcol
<dbl> <chr> <chr>
1 1 learn learn
2 2 learning learn
3 3 learned learn
4 4 tax tax
5 5 taxes tax
6 6 panoply panoply

Having difficulty using rle command within a mutate step in r to count the max number of consecutive characters in a word

I created this function to count the maximum number of consecutive characters in a word.
max(rle(unlist(strsplit("happy", split = "")))$lengths)
The function works on individual words, but when I try to use the function within a mutate step it doesn't work. Here is the code that involves the mutate step.
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
num_vowels = get_count(word),
num_consec_char = max(rle(unlist(strsplit(word, split = "")))$lengths)
)
The variables num_letters and num_vowels work fine, but I get a 2 for every value of num_consec_char. I can't figure out what I'm doing wrong.
This command rle(unlist(strsplit(word, split = "")))$lengths is not vectorized and thus is operating on the entire list of words for each row thus the same result for each row.
You will need to use some type of loop (ie for, apply, purrr::map) to solve it.
library(dplyr)
library(tidytext)
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
output<- text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
# num_vowels = get_count(word),
)
output$num_consec_char<- sapply(output$word, function(word){
max(rle(unlist(strsplit(word, split = "")))$lengths)
})
output
# A tibble: 32 × 4
line word num_letters num_consec_char
<int> <chr> <int> <int>
1 1 the 3 1
2 1 most 4 1
3 1 pressing 8 2
4 1 of 2 1
5 1 those 5 1
6 1 issues 6 2
7 1 considering 11 1

Words being cut off when text mining

I am doing some text mining and am doing a simple list of most frequent words. I am removing stop words as well as some other tidying of data. Here is the corpus along with the matrix and sorting:
# Preliminary corpus
corpusNR <- Corpus(VectorSource(nat_registry)) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument)
# Create term-document matrices & remove sparse terms
tdmNR <- DocumentTermMatrix(corpusNR) %>%
removeSparseTerms(1 - (5/length(corpusNR)))
# Calculate and sort by word frequencies
word.freqNR <- sort(colSums(as.matrix(tdmNR)),
decreasing = T)
# Create frequency table
tableNR <- data.frame(word = names(word.freqNR),
absolute.frequency = word.freqNR,
relative.frequency =
word.freqNR/length(word.freqNR))
# Remove the words from the row names
rownames(tableNR) <- NULL
# Show the 10 most common words
head(tableNR, 10)
Here are the top ten words I am ending up with:
> head(tableNR, 10)
word absolute.frequency relative.frequency
1 vaccin 95 0.4822335
2 program 82 0.4162437
3 covid 59 0.2994924
4 health 59 0.2994924
5 educ 55 0.2791878
6 extens 53 0.2690355
7 inform 49 0.2487310
8 communiti 42 0.2131980
9 provid 41 0.2081218
10 counti 36 0.1827411
Notice a portion of many of the words is being cut off #1 should be "vaccine", #5 should be "education"...ect.
Any ideas why this is happening? Thanks in advance.

How to tokenize my dataset in R using the tidytext library?

I have been trying to follow Text Mining with R by Julia Silge, however, I cannot tokenize my dataset with the unnest_tokens function.
Here are the packages I have loaded:
# Load
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(corpus)
library(corpustools)
library(dplyr)
library(tidyverse)
library(tidytext)
library(tokenizers)
library(stringr)
Here is the dataset I tried to use which is online, so the results should be reproducible:
bible <- readLines('http://bereanbible.com/bsb.txt')
And here is where everything falls apart.
Input:
bible <- bible %>%
unnest_tokens(word, text)
Output:
Error in tbl[[input]] : subscript out of bounds
From what I have read about this error, in Rstudio, the issue is that the dataset needs to be a matrix, so I tried transforming the dataset into a matrix table and I received the same error message.
Input:
bible <- readLines('http://bereanbible.com/bsb.txt')
bible <- as.matrix(bible, nrow = 31105, ncol = 2 )
bible <- bible %>%
unnest_tokens(word, text)
Output:
Error in tbl[[input]] : subscript out of bounds
Any recommendations for what next steps I could take or maybe some good Text mining sources I could use as I continue to dive into this would be very much appreciated.
The problem is that readLines()creates a vector, not a dataframe, as expected by unnest_tokens(), so you need to convert it. It is also helpful to separate the verse to it's own column:
library(tidytext)
library(tidyverse)
bible_orig <- readLines('http://bereanbible.com/bsb.txt')
# Get rid of the copyright etc.
bible_orig <- bible_orig[4:length(bible_orig)]
# Convert to df
bible <- enframe(bible_orig)
# Separate verse from text
bible <- bible %>%
separate(value, into = c("verse", "text"), sep = "\t")
tidy_bible <- bible %>%
unnest_tokens(word, text)
tidy_bible
#> # A tibble: 730,130 x 3
#> name verse word
#> <int> <chr> <chr>
#> 1 1 Genesis 1:1 in
#> 2 1 Genesis 1:1 the
#> 3 1 Genesis 1:1 beginning
#> 4 1 Genesis 1:1 god
#> 5 1 Genesis 1:1 created
#> 6 1 Genesis 1:1 the
#> 7 1 Genesis 1:1 heavens
#> 8 1 Genesis 1:1 and
#> 9 1 Genesis 1:1 the
#> 10 1 Genesis 1:1 earth
#> # … with 730,120 more rows
Created on 2020-07-14 by the reprex package (v0.3.0)

ngrams analysis in tidytext in R

I am trying to do ngram analysis for in tidytext, I have a corpus of 770 speeches. However the function unnest_tokens in tidytext takes data frame as input. when i checked with the example (jane austin books) each line of the book is stored as row in a data frame. i am not able to convert the corpus into a dataframe, neither for one speech at a time nor for all the corpus at once.
What is the way i can run ngrams (n=2,3, etc) analysis on tidytext using unnest tokens on my corpus. Can someone please suggest?
Thanks
You can use library ngram & tm for this.You can replace "myCorpus" with the corpus you created.
library(tm)
library(ngarm)
myCorpus<-c("Hi How are you","Hello World","I love Stackoverflow","Good Bye All")
ng <- ngram (myCorpus , n =2)
get.phrasetable (ng)
If you want to tokenize and convert your corpus into a dataframe then use the below code.
tokenizedCorpus <- lapply(myCorpus, scan_tokenizer)
mydata <- data.frame(text = sapply(tokenizedCorpus, paste, collapse = " "),stringsAsFactors = FALSE)
You say that you have a "corpus" of 770 speeches. Do you mean you have a character vector? If so, you can tokenize your text in this way:
library(tidyverse)
library(tidytext)
speech_vec <- c("I am giving a speech!",
"My second speech is even better.",
"Unfortunately, this speech is terrible!",
"For my final speech, I will wow you all.")
speech_df <- tibble(text = speech_vec) %>%
mutate(speech = row_number())
tidy_speeches <- speech_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
tidy_speeches
#> # A tibble: 21 x 2
#> speech bigram
#> <int> <chr>
#> 1 1 i am
#> 2 1 am giving
#> 3 1 giving a
#> 4 1 a speech
#> 5 2 my second
#> 6 2 second speech
#> 7 2 speech is
#> 8 2 is even
#> 9 2 even better
#> 10 3 unfortunately this
#> # … with 11 more rows
Created on 2020-02-15 by the reprex package (v0.3.0)
If instead, you mean that you have a DocumentTermMatrix from the tm package, check out this chapter for details on how to convert to a tidy data structure.

Resources