Words being cut off when text mining

Words being cut off when text mining - r

I am doing some text mining and am doing a simple list of most frequent words. I am removing stop words as well as some other tidying of data. Here is the corpus along with the matrix and sorting:
# Preliminary corpus
corpusNR <- Corpus(VectorSource(nat_registry)) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument)
# Create term-document matrices & remove sparse terms
tdmNR <- DocumentTermMatrix(corpusNR) %>%
removeSparseTerms(1 - (5/length(corpusNR)))
# Calculate and sort by word frequencies
word.freqNR <- sort(colSums(as.matrix(tdmNR)),
decreasing = T)
# Create frequency table
tableNR <- data.frame(word = names(word.freqNR),
absolute.frequency = word.freqNR,
relative.frequency =
word.freqNR/length(word.freqNR))
# Remove the words from the row names
rownames(tableNR) <- NULL
# Show the 10 most common words
head(tableNR, 10)
Here are the top ten words I am ending up with:
> head(tableNR, 10)
word absolute.frequency relative.frequency
1 vaccin 95 0.4822335
2 program 82 0.4162437
3 covid 59 0.2994924
4 health 59 0.2994924
5 educ 55 0.2791878
6 extens 53 0.2690355
7 inform 49 0.2487310
8 communiti 42 0.2131980
9 provid 41 0.2081218
10 counti 36 0.1827411
Notice a portion of many of the words is being cut off #1 should be "vaccine", #5 should be "education"...ect.
Any ideas why this is happening? Thanks in advance.

Related

Merging words in a wordcloud made in R

I have created a word cloud with the following frequency of terms:
interesting interesting 21
economics economics 12
learning learning 9
learn learn 6
taxes taxes 6
debating debating 6
everything everything 6
know know 6
tax tax 3
meaning meaning 3
I want to add the 6 counts for "learn" into the overall count for "learning" so that the frequency becomes 15, and I only have "learning" in my word cloud. I also want to do the same for "taxes" and "tax".
This is the code I used to generate the wordcloud.
dataset <- read.csv("~/filepath.csv")
> corpus <- Corpus(VectorSource(dataset$comment))
> clean_corpus <- tm_map(corpus, removeWords, stopwords('english'))
> wordcloud(clean_corpus, scale=c(5,0.5), max.words=100, random.order = FALSE, rot.per=0.35, colors=my_palette)
I have tried using the SnowballC package, but this was the outcome:
> library(SnowballC)
> clean_set <- tm_map(clean_corpus, stemDocument)
> dtm <- TermDocumentMatrix(clean_set)
> m <- as.matrix(dtm)
> v <- sort(rowSums(m), decreasing = TRUE)
> d <- data.frame(word = names(v), freq=v)
> head(d, 10)
This gives me the output below (economics has become econom, debating has become debat, everything, everyth) which is obviously unideal. I only have an issue with learn/learning and tax/taxes, so would it be possible to manually merge just those two sets of words?
interest interest 21
learn learn 18
econom econom 12
tax tax 9
debat debat 6
everyth everyth 6
know know 6
mean mean 3
understand understand 3
group group 3
I have also tried clean_corpus_2 <- tm_map(clean_corpus, content_transformer(gsub), pattern = "taxes", replacement = "tax", fixed = TRUE) which changed nothing in the output.

I'm using the tidyverse packages, particularly dplyr as that's why I'm comfortable with, but I'm sure this is doable with base R or any number of other approaches.
library(tidyverse)
First I mock up some data as I don't have yours to test on:
testdata <- tribble(
~ID, ~comment,
1, "learn",
2, "learning",
3, "learned",
4, "tax",
5, "taxes",
6, "panoply"
)
Next is the explicitly listing the options approach:
testdata1 <- testdata %>% mutate(
newcol = case_when(
comment %in% c("learn", "learning", "learned") ~ "learn",
comment %in% c("tax", "taxes") ~ "tax",
TRUE ~ as.character(comment)
)
)
In this code, %>% is a pipe, mutate() adds a new column based on what follows. newcol is the name of the new column, and its contents is decided by the case_when() construct, which tests each option in turn until it finds something returning "TRUE" - that's why the last option (the default "don't change" approach) is listed as TRUE ~ .
After that, the pattern-matching (grepl) approach:
testdata2 <- testdata %>% mutate(
newcol = case_when(
grepl(comment, pattern = "learn") ~ "learn",
grepl(comment, pattern = "tax") ~ "tax",
TRUE ~ as.character(comment)
)
)
Yielding:
> testdata1
# A tibble: 6 × 3
ID comment newcol
<dbl> <chr> <chr>
1 1 learn learn
2 2 learning learn
3 3 learned learn
4 4 tax tax
5 5 taxes tax
6 6 panoply panoply
> testdata2
# A tibble: 6 × 3
ID comment newcol
<dbl> <chr> <chr>
1 1 learn learn
2 2 learning learn
3 3 learned learn
4 4 tax tax
5 5 taxes tax
6 6 panoply panoply

Quanteda dfm_weight() results in relative frequency > 1

I'm using Quanteda and trying to compute the relative frequencies of specific words in a corpus which is organized by date and party. However, after converting the corpus to a dfm and using dfm_weight(dfmat, scheme = "prop") followed by textstat_frequency, I get scores of bigger than 1.
Here is my code (I also stem and clean my tokens, not here in the code):
corp <- corpus(title_df, text_field = "text", meta = list(title_df[,-4]))
toks <- tokens(corp)
dfmat <- dfm(toks, verbose=TRUE)
dfm_rel_freq <- dfm_weight(dfmat, scheme = "prop")
rel_freq_all <- quanteda.textstats::textstat_frequency(dfm_rel_freq, groups = year)
# arrange by max frequency:
rel_freq_all %>% arrange(frequency) %>% tail()
feature
<chr>
frequency
<dbl>
rank
<dbl>
docfreq
<dbl>
group
<chr>
81093 pension 5.802529 1 117 2004
40971 pension 6.117154 1 97 1998
148372 peopl 6.430454 1 220 2014
65747 pension 6.721089 1 138 2002
53303 pension 7.871011 1 153 2000
74391 pension 8.153381 1 156 2003
6 rows

This is the expected behaviour: quanteda.textstats::textstat_frequency(x, groups = year) will sum the dfm within the year groups. So your proportions from the dfm are being summed, and these can exceed 1.0.
If you wanted a different operation on the groups, for instance mean, then you should not use a groups argument, and then use some dplyr operations such as
library(dplyr)
quanteda.textstats::textstat_frequency(dfm_rel_freq) %>%
group_by(year) %>%
summarize(mean_rel_freq = mean(frequency))

Sentiment analysis in R for cyrillic

I can't comment on this page where i found a function Sentiment Analysis Text Analytics in Russian / Cyrillic languages
get_sentiment_rus <- function(char_v, method="custom", lexicon=NULL, path_to_tagger = NULL, cl = NULL, language = "english") {
language <- tolower(language)
russ.char.yes <- "[\u0401\u0410-\u044F\u0451]"
russ.char.no <- "[^\u0401\u0410-\u044F\u0451]"
if (is.na(pmatch(method, c("syuzhet", "afinn", "bing", "nrc",
"stanford", "custom"))))
stop("Invalid Method")
if (!is.character(char_v))
stop("Data must be a character vector.")
if (!is.null(cl) && !inherits(cl, "cluster"))
stop("Invalid Cluster")
if (method == "syuzhet") {
char_v <- gsub("-", "", char_v)
}
if (method == "afinn" || method == "bing" || method == "syuzhet") {
word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
if (is.null(cl)) {
result <- unlist(lapply(word_l, get_sent_values,
method))
}
else {
result <- unlist(parallel::parLapply(cl = cl, word_l,
get_sent_values, method))
}
}
else if (method == "nrc") {
# word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
word_l <- strsplit(tolower(char_v), paste0(russ.char.no, "+"), perl=T)
lexicon <- dplyr::filter_(syuzhet:::nrc, ~lang == tolower(language),
~sentiment %in% c("positive", "negative"))
lexicon[which(lexicon$sentiment == "negative"), "value"] <- -1
result <- unlist(lapply(word_l, get_sent_values, method,
lexicon))
}
else if (method == "custom") {
# word_l <- strsplit(tolower(char_v), "[^A-Za-z']+")
word_l <- strsplit(tolower(char_v), paste0(russ.char.no, "+"), perl=T)
result <- unlist(lapply(word_l, get_sent_values, method,
lexicon))
}
else if (method == "stanford") {
if (is.null(path_to_tagger))
stop("You must include a path to your installation of the coreNLP package. See http://nlp.stanford.edu/software/corenlp.shtml")
result <- get_stanford_sentiment(char_v, path_to_tagger)
}
return(result)
}
It gives an error
> mysentiment <- get_sentiment_rus(as.character(corpus))
Show Traceback
Rerun with Debug
Error in UseMethod("filter_") :
no applicable method for 'filter_' applied to an object of class "NULL"
And the sentiment scores are equal to 0
> SentimentScores <- data.frame(colSums(mysentiment[,]))
> SentimentScores
colSums.mysentiment.....
anger 0
anticipation 0
disgust 0
fear 0
joy 0
sadness 0
surprise 0
trust 0
negative 0
positive 0
Could you please point out where a problem might be? Or suggest any other working method for sentiment analysis в R? Just wonder what package supports russian language.
I am looking for any working method for sentiment analysis of a text in russian.

It looks to me like your function did not really find any sentiment words in your text. This might have to do with the sentiment dictionary you are using. Instead of trying to repair this function, you might want to consider a tidy approach instead, which is outlined in the book "Text Mining with R. A Tidy Approach". The advantage is that it does not mind the cyrillic letters and that it is really easy to understand and tweak.
First, we need a dictionary with sentiment values. I found one on GitHub, which we can directly read into R:
library(rvest)
library(stringr)
library(tidytext)
library(dplyr)
dict <- readr::read_csv("https://raw.githubusercontent.com/text-machine-lab/sentimental/master/sentimental/word_list/russian.csv")
Next, let's get some test data to work with. For no particular reason, I use the Russian Wikipedia entry for Brexit and scrape the text:
brexit <- "https://ru.wikipedia.org/wiki/%D0%92%D1%8B%D1%85%D0%BE%D0%B4_%D0%92%D0%B5%D0%BB%D0%B8%D0%BA%D0%BE%D0%B1%D1%80%D0%B8%D1%82%D0%B0%D0%BD%D0%B8%D0%B8_%D0%B8%D0%B7_%D0%95%D0%B2%D1%80%D0%BE%D0%BF%D0%B5%D0%B9%D1%81%D0%BA%D0%BE%D0%B3%D0%BE_%D1%81%D0%BE%D1%8E%D0%B7%D0%B0" %>%
read_html() %>%
html_nodes("body") %>%
html_text() %>%
tibble(text = .)
Now this data can be turned into a tidy format. I split the text into paragraphs first, so we can check sentiment scores for paragraphs individually.
brexit_tidy <- brexit %>%
unnest_tokens(output = "paragraph", input = "text", token = "paragraphs") %>%
mutate(id = seq_along(paragraph)) %>%
unnest_tokens(output = "word", input = "paragraph", token = "words")
The way a dictionary is used with tidy data is incredibly straightwoard from this point. You just combine the data frame with sentiment values (i.e., the dictionary) and the data frame with the words in your text. Where text and dictionary match, the sentiment value is added. All other values are dropped.
# apply dictionary
brexit_sentiment <- brexit_tidy %>%
inner_join(dict, by = "word")
head(brexit_sentiment)
#> # A tibble: 6 x 3
#> id word score
#> <int> <chr> <dbl>
#> 1 7 затяжной -1.7
#> 2 13 против -5
#> 3 22 популярность 5
#> 4 22 против -5
#> 5 23 нужно 1.7
#> 6 39 против -5
Instead of the value for each word, you probably prefer the values per paragraphs. This can easily be done by getting the mean for each paragraph:
# group sentiment by paragraph
brexit_sentiment %>%
group_by(id) %>%
summarise(sentiment = mean(score))
#> # A tibble: 25 x 2
#> id sentiment
#> <int> <dbl>
#> 1 7 -1.7
#> 2 13 -5
#> 3 22 0
#> 4 23 1.7
#> 5 39 -5
#> 6 42 5
#> 7 43 -1.88
#> 8 44 -3.32
#> 9 45 -3.35
#> 10 47 1.7
#> # … with 15 more rows
There are a couple of ways this approach could be improved if necessary:
to get rid of different word forms, you could lemmatize the words, making matches more likely
in case your text includes misspellings, you could consider matching words which are similar with e.g. fuzzyjoin
you can find or create a better dictionary than the one I pulled of the first page I found when googling "russian sentiment dictionary"

R - Removing the same name in two columns of a data frame

I am working with a data frame that has two columns, name and spouse. I am trying to calculate the interracial marriage frequency, but I need to remove repeated registers.
When I have the name of a creature I need to keep this register in the data frame but remove the register where that creature name is the spouse name. I have this following data sample:
name spouse
15 Finarfin EÃ¤rwen
6 Tar-VanimeldÃ« Herucalmo
17 Faramir owyn
8 Tar-Meneldur Almarian
14 Finduilas of Dol Amroth Denethor II
12 FinwÃ« MÃriel SerindÃ« then ,Indis
9 Tar-AncalimÃ« Hallacar
7 Tar-MÃriel Ar-PharazÃ´n
5 Tarannon Falastur BerÃºthiel
21 Rufus Burrows Asphodel Brandybuck
2 Angrod EldalÃ³tÃ«
4 Ar-GimilzÃ´r InzilbÃªth
19 Lobelia Sackville-Baggins Otho Sackville-Baggins
25 Mrs. Proudfoot Odo Proudfoot
22 Rudigar Bolger Belba Baggins
24 Odo Proudfoot Mrs. Proudfoot
3 Ar-PharazÃ´n Tar-MÃriel
13 Fingolfin AnairÃ«
18 SilmariÃ«n Elatan
23 Rowan Greenhand Belba Baggins
20 RÃan Huor
1 Adanel Belemir
16 Fastolph Bolger Pansy Baggins
10 Morwen Steelsheen Thengel
11 Tar-Aldarion Erendis
25 Belemir Adanel
For example, I ran the code and in line 1 it caught name Adanel and got Belemir as its spouse, so I need to keep line 1, but remove line 25, because with that I will avoid duplicated data.
I have tried this following code:
interacialMariage <-data %>% filter(spouse != name) %>% select(name, spouse)
How can I get the same spouse name register out of the data frame registers?
P.S.: I would need it to avoid case sensitive (Belemir == belemir) so that I don't have problems in the future.
Thanks!

You could set up another vector with the row-wise alphabetically sorted names, and deduplicate using that...
sorted <- sapply(1:nrow(data),
function(i) paste(sort(c(trimws(tolower(data$name[i])),
trimws(tolower(data$spouse[i])))),
collapse=" "))
irM <- data[!duplicated(sorted),]
The trimws strips off any leading or trailing spaces before sorting and pasting, and tolower converts everything to lower case.

My attempt with tidyverse:
library(tidyverse)
dat %>%
mutate(id = 1:n()) %>% # add id to label the pairs
gather('key', 'name', -id) %>% # transform: key (name | spouse), name, id
group_by(name) %>% # group by unique name to find duplicated
top_n(-1, wt = id) %>% # if name > 1, take row with the lower id
spread(key, name) %>% # spread data to original format
select(-id) # remove id's
# # A tibble: 3 x 2
# name spouse
# <chr> <chr>
# 1 Adanel Belemir
# 2 Fastolph Bolger Pansy Baggins
# 3 Morwen Steelsheen Thengel
Data:
dat <- data.frame(
name = c("Adanel", "Fastolph Bolger", "Morwen Steelsheen", "Belemir"),
spouse = c("Belemir", "Pansy Baggins", "Thengel", "Adanel" ),
stringsAsFactors = F
)

Text mining in R: How to deal with incomplete and non-sense words in a pdf file when creating a frequency table for most frequent terms?

I'm having some troubles making the table for most frequent words in a pdf file because some words appear as incomplete or look "strange". To explain myself better, first the file (in spanish), that can be downloaded from:
https://drive.google.com/file/d/178s_tfbqbXmnxsknxF8DP154_N1DYjgf/view
Second, the code: (Just include your own path and run the code)
library(rJava)
library(tm)
library(qdap)
library(tidyverse)
library(pdftools)
library(stringr)
library(tidytext)
library(stringi)
library(wordcloud)
stop_es <- c(stopwords("es")) #This is the vector I'll be feeding with additional stopwords
cce <- pdf_text("path/file.pdf") #Reading the file
corpus <- Corpus(VectorSource(cce)) #Create corpus
#Cleaning and pre-processing
CCE <- tm_map(corpus, tolower) %>%
tm_map(stripWhitespace) %>%
tm_map(removePunctuation) %>%
tm_map (removeWords, stop_es) %>%
stri_trans_general("Latin-ASCII") #Remove accents for words in spanish
##Create corpus again. (stri_trans_general has a strange behavior that forces me to make again a corpus)
CCEGTO <- Corpus(VectorSource(CCE))
After previous steps, I explore for most frequent terms with:
ft <- freq_terms(CCEGTO, 50, stopwords=stop_es) ##Create the table for most frequent terms
ft
That give us the following output (I remove some words to focus my attention on incomplete or "strange" ones)
WORD FREQ
2 ca 105 ## No idea about this one
3 guanajuato 94
5 vo 86
6 ufb 75 ##¿¿??
9 va 69
10 propuestas 68
11 nivel 64
12 par 58 #For example this one could stand for "parte" or "participacion"
27 ins 42 #This one could stand for "instituto", "institucion" or some else related
28 n 42 #No idea why this simple term appears as a frequent term
30 vos 41
33 numero 40
34 vas 40
35 l 39
38 d 37
39 s 37
42 poli 35 #This one could stand for "policia", "politica", "politicas"
43 vidad 35 #This one could be a bad output for "vida" or maybe for "actividad"
44 cas 34
45 r 34 #Single character...
46 cipacion 33 #This one could be the complement for "parti" in order to form "participacion"
47 i 33
Am I missing something on cleaning and pre-processing or maybe is the pdf structure itself that doesn't allow to do a proper text mining job?
Any advice and help will be much appreciated.

If I run a bit of adjusted code with less packages loaded I can create a frequency table which looks normal. Checking the outcome of some functions before going to the next is also useful. See writeLines statement to see if everything transforms correctly from the pdf extraction. You might want to use stri_trans_general before creating a corpus instead of in the pipeline of the corpus. But then you need to do this to the stopword list as well.
Depending on what you exactly want to do with Spanish text you might want to look into udpipe. But try to contain your work with as few packages as possible. So most of the work with tm or with any of the other text mining packages like qdap, quanteda, tidytext or udpipe.
library(tm)
library(dplyr)
library(pdftools)
cce <- pdf_text("PROPUESTAS GTO 2018 FINAL.pdf") #Reading the file
# have a look at page 4 output not printed in answer!
writeLines(stringi::stri_trans_general(cce[4], "latin-ascii"))
stop_es <- stopwords("spanish")
corpus <- Corpus(VectorSource(cce)) #Create corpus
CCE <- tm_map(corpus, tolower) %>%
tm_map(stripWhitespace) %>%
tm_map(removePunctuation) %>%
tm_map (removeWords, stop_es) %>%
stringi::stri_trans_general("Latin-ASCII")
CCEGTO <- Corpus(VectorSource(CCE))
# create frequency table
dtm <- DocumentTermMatrix(CCEGTO)
m <- as.matrix(dtm)
df <- data.frame(words = names(colSums(m)), freq = colSums(m))
# filter frequencies
df %>%
filter(freq > 50) %>%
arrange(desc(freq))
words freq
1 fb01 236
2 desarrollo 107
3 guanajuato 94
4 nacional 90
5 problema 73
6 social 69
7 propuestas 68
8 nivel 64
9 par 58
10 ciudadanos 55
11 pais 55
12 leon 53
13 asi 52
14 gobierno 52

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Words being cut off when text mining - r

Related

Merging words in a wordcloud made in R

Quanteda dfm_weight() results in relative frequency > 1

Sentiment analysis in R for cyrillic

R - Removing the same name in two columns of a data frame

Text mining in R: How to deal with incomplete and non-sense words in a pdf file when creating a frequency table for most frequent terms?

Categories

Resources