Problems with non english letters using wordcloud by twitter mined text - r

I'm new to Stackoverflow and I've been doing my best to follow the guidelines. If there's however something I've missed, please let me know.
Lately I've been playing around with text mining in R; something I'm a novice towards. I've been using the packages you can find in the code nested below to do this. However, problem occurs when the wordcloud displays the Swedish letters å, ä and ö. As you can see in the attached picture the dots gets positioned a bit weird.
Wordcloud image
I've been trying as best as I could solving this by myself, but whatever I've been trying, I can't seem to get it to work.
What I've tried to do:
Use Encoding(tweets) <- "UTF-8" in an attempt to set tweets to UTF-8
Use iconv(tweets, from = "UTF-8", to = "UTF-8", sub = "")
Furthermore, the last part of the code after defining the corpus vecotr was copied from the author of the tm-package. He listed this as the solution after other people mentioning problems with the wordcloud function with the corpus vector as input. Without it I get an error message when trying to create the wordcloud.
#Get and load necessary packages:
install.packages("twitteR")
install.packages("ROAuth")
install.packages("wordcloud")
install.packages("tm")
library("tm")
library("wordcloud")
library("twitteR")
library("ROAuth")
#Authentication:
api_key <- "XXX"
api_secret <- "XXX"
access_token <- "XXX"
access_token_secret <- "XXX"
cred <- setup_twitter_oauth(api_key,api_secret,access_token,
access_token_secret)
#Extract tweets:
search.string <- "#svpol"
no.of.tweets <- 3200
tweets <- searchTwitter(search.string, n=no.of.tweets, since = "2017-01-01")
tweets.text <- sapply(tweets, function(x){x$getText()})
#Remove tweets that starts with "RT" (retweets):
tweets.text <- gsub("^\bRT", "", tweets.text)
#Remove tabs:
tweets.text <- gsub("[ |\t]{2,}", "", tweets.text)
#Remove usernames:
tweets.text <- gsub("#\\w+", "", tweets.text)
tweets.text <- (tweets.text[!is.na(tweets.text)])
tweets.text <- gsub("\n", " ", tweets.text)
#Remove links:
tweets.text <- gsub("http[^[:space:]]*", "", tweets.text)
#Remove stopwords:
stopwords_swe <- c("är", "från", "än")
#Just a short example above, the real one is very large
tweets.text <- removeWords(tweets.text,stopwords_swe)
#Create corpus:
tweets.text.corpus <- Corpus(VectorSource(tweets.text))
#See notes in the longer text about the corpus vector
tweets.text.corpus <- tm_map(tweets.text.corpus,
content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')), mc.cores=1)
tweets.text.corpus <- tm_map(tweets.text.corpus, content_transformer(tolower), mc.cores=1)
tweets.text.corpus <- tm_map(tweets.text.corpus, removePunctuation, mc.cores=1)
tweets.text.corpus <- tm_map(tweets.text.corpus, function(x)removeWords(x,stopwords(kind = "en")), mc.cores=1)
wordcloud <- wordcloud(tweets.text.corpus, min.freq = 10,
max.words=300, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Set2"))
wordcloud
Would be super happy receiving help with this!

Managed to solve it by first encoding the vector to UTF-8-MAC (since I'm on OSX), then using the gsub() function in order to manually change the hex codes for å,ä,ö (the letters I had problems with) to the actual letters. For example gsub("0xc3 0x85", "å", x), gsub("0xc3 0xa5", "å", x) (since case sensitivity).
Lastly changing the argument for the tm_map() function from UTF-8-MAC to latin1. That did the trick for me, hopefully someone else will find this useful in the future.

Related

Using gsub function to replace words by reading a .csv file

setwd("C:\\Users\\Joshua\\Documents\\TextMining\\Description_text")
library(RTextTools)
library(topicmodels)
library(tm)
library(SnowballC)
#Seizure
#cnamePS=file.path("C:\\Users\\Joshua\\Documents\\TextMining\\")
cnamePS=file.path("C:\\Users\\Joshua\\Documents\\TextMining\\Description_text")
docs <- Corpus(DirSource(cnamePS), readerControl=list(reader=readPlain))
#clean file
docsA <- tm_map(docs, removePunctuation)
docsA <- tm_map(docsA, removeNumbers)
docsA=tm_map(docsA,content_transformer(tolower))
docsA=tm_map(docsA,removeWords,stopwords("english"))
#phrase replacement
phrasesReplacement <- read.csv("C:\\Users\\Joshua\\Documents\\TextMining\\phrases.csv", stringsAsFactors = FALSE)
replacePhrasesFunc <- function(txt, replacementtable)
{
txt<-gsub("-","",txt)
for (r in seq(nrow(replacementtable)))
{
txt <- gsub(replacementtable$phrase[r], replacementtable$replacement[r], txt, fixed=TRUE)
}
return(txt)
}
replacePhrases <- content_transformer(replacePhrasesFunc)
docsA <- tm_map(docsA, replacePhrases, phrasesReplacement)#keep phrase
#docsAcopy=docsA #make copy
#docsA=tm_map(docsA,stemDocument)
docsA=tm_map(docsA,stripWhitespace)
Hi, I am trying to do some Textmining for some research, however I keep getting one error in one part of my code.
> docsA <- tm_map(docsA, replacePhrases, phrasesReplacement)#keep phrase
Error in gsub(replacementtable$phrase[r], replacementtable$replacement[r], :
invalid 'pattern' argument
Called from: gsub(replacementtable$phrase[r], replacementtable$replacement[r],
txt, fixed = TRUE)
Unfortunately as I am new to R, I cannot seem to figure why the error keeps reoccurring. Thanks for any help if possible.
So the .csv file I use is formatted as such..
So here is the debugging information.
Turns out that it was a problem with my .csv file.
I was missing a comma in between words.
A good reminder that the smallest things can cause the whole code to not work.

How to do large-scale replacement/tokenization in R tm_map gsub from a list?

Has anyone managed to create a massive find/replace function/working code snippet that exchanges out known bigrams in a dataframe?
Here's an example. I'm able to don onesie-twosie replacements but I really want to leverage a known lexicon of about 800 terms I want to find-replace to turn them into word units prior to DTM generation. For example, I want to turn "Google Analytics" into "google-analytics".
I know it's theoretically possible; essentially, a custom stopwords list functionally does almost the same thing, except without the replacement. And it seems stupid to just have 800 gsubs.
Here's my current code. Any help/pointers/URLs/RTFMs would be greatly appreciated.
mystopwords <- read.csv(stopwords.file, header = FALSE)
mystopwords <- as.character(mystopwords$V1)
mystopwords <- c(mystopwords, stopwords())
# load the file
df <- readLines(file.name)
# transform to corpus
doc.vec <- VectorSource(df)
doc.corpus <- Corpus(doc.vec)
# summary(doc.corpus)
## Hit known phrases
docs <- tm_map(doc.corpus, content_transformer(gsub), pattern = "Google Analytics", replacement = "google-analytics")
## Clean up and fix text - note, no stemming
doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))
doc.corpus <- tm_map(doc.corpus, removePunctuation,preserve_intra_word_dashes = TRUE)
doc.corpus <- tm_map(doc.corpus, removeNumbers)
doc.corpus <- tm_map(doc.corpus, removeWords, c(stopwords("english"),mystopwords))
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
The corpus library allows you to combine multi-word phrases into single tokens. When there are multiple matches, it chooses the longest one:
library(corpus)
text_tokens("I live in New York City, New York",
combine = c("new york city", "new york"))
# [[1]]
# [1] "i" "live" "in" "new_york_city"
# [5] "," "new_york"
By default, the connector is the underscore character (_), but you can specify an alternative connector using the connector argument`.
In your example, you could do the following to get a document-by-term matrix:
mycombine <- c("google analytics", "amazon echo") # etc.
term_matrix(doc.corpus, combine = mycombine,
drop_punct = TRUE, drop_number = TRUE,
drop = c(stopwords_en, mystopwords))
Note also that corpus keeps intra-word hyphens, so there's no need for a preserve_intra_word_dashes option.
It can be a hassle to specify the preprocessing options in every function call. If you'd like, you can convert your corpus to a corpus_frame (a data.frame with a special text column), then set the preprocessing options (the text_filter):
corpus <- as_corpus_frame(doc.corpus)
text_filter(corpus) <- text_filter(combine = mycombine,
drop_punct = TRUE,
drop_number = TRUE,
drop = c(stopwords_en, mystopwords))
After that, you can just call
term_matrix(corpus)
There's a lot more information about corpus, including an introductory vignette, at http://corpustext.com

Text mining with R: use of sub

I am on a project with R and I am starting to get my hands dirty with it.
In the first part I try to clean the data of vector msg. But later when I build the termdocumentmatrix, these characters still appear.
I would like to remove words with less than 4 letters and remove punctuation
gsub("\\b\\w{1,4}\\b ", " ", pclbyshares$msg)
gsub("[[:punct:]]", "", pclbyshares$msg)
corpus <- Corpus(VectorSource(pclbyshares$msg))
TermDocumentMatrix(corpus)
tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq=120, highfreq=Inf)
You haven't stored your first two lines of code as variables to use later. So, in your third line, where you create your corpus variable, you are using the unmodified msg data. Give this a try:
msg_clean <- gsub("\\b\\w{1,4}\\b ", " ", pclbyshares$msg)
msg_clean <- gsub("[[:punct:]]", "", msg_clean)
corpus <- Corpus(VectorSource(msg_clean))
TermDocumentMatrix(corpus)
tdm <- TermDocumentMatrix(corpus)
findFreqTerms(tdm, lowfreq = 120, highfreq = Inf)

Using RemoveWords in tm_map in R on words loaded from a file

I have seen several questions about using the removewords function in the tm_map package of R in order to remove either stopwords() or hard coded words from a corpus. However, I am trying to remove words stored in a file (currently csv, but I don't care which type). Using the code below, I don't get any errors, but my words are still there. Could someone please explain what is wrong?
#install.packages('tm')
library(tm)
setwd("c://Users//towens101317//Desktop")
problem_statements <- read.csv("query_export_results_100.csv", stringsAsFactors = FALSE, header = TRUE)
problem_statements_text <- paste(problem_statements, collapse=" ")
problem_statements_source <- VectorSource(problem_statements_text)
my_stop_words <- read.csv("mystopwords.csv", stringsAsFactors=FALSE, header = TRUE)
my_stop_words_text <- paste(my_stop_words, collapse=" ")
corpus <- Corpus(problem_statements_source)
corpus <- tm_map(corpus, removeWords, my_stop_words_text)
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)

FUN-error after running 'tolower' while making Twitter wordcloud

Trying to create wordcloud from twitter data, but get the following error:
Error in FUN(X[[72L]], ...) :
invalid input '������������❤������������ "#xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs'
This error appears after running the "mytwittersearch_corpus<-
tm_map(mytwittersearch_corpus, tolower)" code
mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText())
mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list))
mytwittersearch_corpus<-tm_map(mytwittersearch_corpus, tolower)
mytwittersearch_corpus<-tm_map( mytwittersearch_corpus, removePunctuation)
mytwittersearch_corpus <-tm_map(mytwittersearch_corpus, function(x) removeWords(x, stopwords()))
I read on other pages this may be due to R having difficulty processing symbols, emoticons and letters in non-English languages, but this appears not to be the problem with the "error tweets" that R has issues with. I did run the codes:
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, function(x) iconv(enc2utf8(x), sub = "byte"))
mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "bytes")))
These do not help. I also get that it can't find function content_transformer even though the tm-package is checked off and running.
I'm running this on OS X 10.6.8 and using the latest RStudio.
I use this code to get rid of the problem characters:
tweets$text <- sapply(tweets$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
A nice example on creating wordcloud from Twitter data is here. Using the example, and the code below, and passing the tolower parameter while creating the TermDocumentMatrix, I could create a Twitter wordcloud.
library(twitteR)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(ggplot2)
#Collect tweets containing 'new year'
tweets = searchTwitter("new year", n=50, lang="en")
#Extract text content of all the tweets
tweetTxt = sapply(tweets, function(x) x$getText())
#In tm package, the documents are managed by a structure called Corpus
myCorpus = Corpus(VectorSource(tweetTxt))
#Create a term-document matrix from a corpus
tdm = TermDocumentMatrix(myCorpus,control = list(removePunctuation = TRUE,stopwords = c("new", "year", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))
#Convert as matrix
m = as.matrix(tdm)
#Get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE)
#Create data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
#Plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
Have you tried updating tm and using stri_trans_tolower from stringi?
library(twitteR)
library(tm)
library(stringi)
setup_twitter_oauth("CONSUMER_KEY", "CONSUMER_SECRET")
mytwittersearch <- showStatus(551365749550227456)
mytwittersearch_list <- mytwittersearch$getText()
mytwittersearch_corpus <- Corpus(VectorSource(mytwittersearch_list))
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(tolower))
# Error in FUN(content(x), ...) :
# invalid input 'í ½í±…í ¼í¾¯â¤í ¼í¾§í ¼í½œ "#comScore: Nearly half of #Millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56Fb78aTSC"' in 'utf8towcs'
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(stri_trans_tolower))
inspect(mytwittersearch_corpus)
# <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
#
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# <ed><U+00A0><U+00BD><ed><U+00B1><U+0085><ed><U+00A0><U+00BC><ed><U+00BE><U+00AF><U+2764><ed><U+00A0><U+00BC><ed><U+00BE><U+00A7><ed><U+00A0><U+00BC><ed><U+00BD><U+009C> "#comscore: nearly half of #millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56fb78atsc"
The above solutions may have worked but not anymore in the newest versions of wordcloud and tm.
This problem almost made me crazy, but I found a solution and want to explain it the best I can to save anyone becoming desperate.
The function which is implicitly called by wordcloud and responsible for throwing the error
Error in FUN(content(x), ...) : in 'utf8towcs'
is this one:
words.corpus <- tm_map(words.corpus, tolower)
which is a shortcut for
words.corpus <- tm_map(words.corpus, content_transformer(tolower))
To provide a reproducible example, here's a function that embeds the solution:
plot_wordcloud <- function(words, max_words = 70, remove_words ="",
n_colors = 5, palette = "Set1")
{
require(dplyr)
require(wordcloud)
require(RColorBrewer) # for brewer.pal()
require(tm) # for tm_map()
# Solution: remove all non-printable characters in UTF-8 with this line
words <- iconv(words, "ASCII", "UTF-8", sub="byte")
wc <- wordcloud(words=words.corpus, max.words=max_words,
random.order=FALSE,
colors = brewer.pal(n_colors, palette),
random.color = FALSE,
scale=c(5.5,.5), rot.per=0.35) %>% recordPlot
return(wc)
}
Here's what failed:
I tried to convert the text BEFORE and AFTER creating the corpus with
words.corpus <- Corpus(VectorSource(words))
BEFORE:
Converting to UTF-8 on the text didn't work with:
words <- sapply(words, function(x) iconv(enc2utf8(x), sub = "byte"))
nor
for (i in 1:length(words))
{
Encoding(words[[i]])="UTF-8"
}
AFTER:
Converting to UTF-8 on the corpus didn't work with:
words.corpus <- tm_map(words.corpus, removeWords, remove_words)
nor
words.corpus <- tm_map(words.corpus, content_transformer(stringi::stri_trans_tolower))
nor
words.corpus <- tm_map(words.corpus, function(x) iconv(x, to='UTF-8'))
nor
words.corpus <- tm_map(words.corpus, enc2utf8)
nor
words.corpus <- tm_map(words.corpus, tolower)
All these solutions may have worked at a certain point in time, so I don't want to discredit the authors. They may work some time in the future. But why they didn't work is almost impossible to say because there were good reasons why they were supposed to work.
Anyway, just remember to convert the text before creating the corpus with:
words <- iconv(words, "ASCII", "UTF-8", sub="byte")
Disclaimer:
I got the solution with more detailed explanation here:
http://www.textasdata.com/2015/02/encoding-headaches-emoticons-and-rs-handling-of-utf-816/
I ended up with updating my RStudio and packages. This seemed to solve the tolower/ content_transformer issues. I read somewhere that the last tm-package had some issues with tm_map, so maybe that was the problem. In any case, this worked!
Instead of
corp <- tm_map(corp, content_transformer(tolower), mc.cores=1)
use
corp <- tm_map(corp, tolower, mc.cores=1)
While using code similar to that above and working on a word cloud shiny app which ran fine on my own pc, but didn't work either on amazon aws or shiny apps.io, I discovered that text with 'accents',e.g. santé in it didn't upload well as csv files to the cloud. I found a solution by saving the files as .txt files and in utf-8 using notepad and re-writing my code to allow for the fact that the files were no longer csv but txt. My versions of R was 3.2.1 and Rstudio was Version 0.99.465
Just to mention, I had the same problem in a different context (nothing to do with tm or Twitter). For me, the solution was iconv(x, "latin1", "UTF-8"), even though Encoding() told me it was already UTF-8.

Resources