FUN-error after running 'tolower' while making Twitter wordcloud - r

Trying to create wordcloud from twitter data, but get the following error:
Error in FUN(X[[72L]], ...) :
invalid input '������������❤������������ "#xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs'
This error appears after running the "mytwittersearch_corpus<-
tm_map(mytwittersearch_corpus, tolower)" code
mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText())
mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list))
mytwittersearch_corpus<-tm_map(mytwittersearch_corpus, tolower)
mytwittersearch_corpus<-tm_map( mytwittersearch_corpus, removePunctuation)
mytwittersearch_corpus <-tm_map(mytwittersearch_corpus, function(x) removeWords(x, stopwords()))
I read on other pages this may be due to R having difficulty processing symbols, emoticons and letters in non-English languages, but this appears not to be the problem with the "error tweets" that R has issues with. I did run the codes:
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, function(x) iconv(enc2utf8(x), sub = "byte"))
mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "bytes")))
These do not help. I also get that it can't find function content_transformer even though the tm-package is checked off and running.
I'm running this on OS X 10.6.8 and using the latest RStudio.

I use this code to get rid of the problem characters:
tweets$text <- sapply(tweets$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

A nice example on creating wordcloud from Twitter data is here. Using the example, and the code below, and passing the tolower parameter while creating the TermDocumentMatrix, I could create a Twitter wordcloud.
library(twitteR)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(ggplot2)
#Collect tweets containing 'new year'
tweets = searchTwitter("new year", n=50, lang="en")
#Extract text content of all the tweets
tweetTxt = sapply(tweets, function(x) x$getText())
#In tm package, the documents are managed by a structure called Corpus
myCorpus = Corpus(VectorSource(tweetTxt))
#Create a term-document matrix from a corpus
tdm = TermDocumentMatrix(myCorpus,control = list(removePunctuation = TRUE,stopwords = c("new", "year", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))
#Convert as matrix
m = as.matrix(tdm)
#Get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE)
#Create data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
#Plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

Have you tried updating tm and using stri_trans_tolower from stringi?
library(twitteR)
library(tm)
library(stringi)
setup_twitter_oauth("CONSUMER_KEY", "CONSUMER_SECRET")
mytwittersearch <- showStatus(551365749550227456)
mytwittersearch_list <- mytwittersearch$getText()
mytwittersearch_corpus <- Corpus(VectorSource(mytwittersearch_list))
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(tolower))
# Error in FUN(content(x), ...) :
# invalid input 'í ½í±…í ¼í¾¯â¤í ¼í¾§í ¼í½œ "#comScore: Nearly half of #Millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56Fb78aTSC"' in 'utf8towcs'
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(stri_trans_tolower))
inspect(mytwittersearch_corpus)
# <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
#
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# <ed><U+00A0><U+00BD><ed><U+00B1><U+0085><ed><U+00A0><U+00BC><ed><U+00BE><U+00AF><U+2764><ed><U+00A0><U+00BC><ed><U+00BE><U+00A7><ed><U+00A0><U+00BC><ed><U+00BD><U+009C> "#comscore: nearly half of #millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56fb78atsc"

The above solutions may have worked but not anymore in the newest versions of wordcloud and tm.
This problem almost made me crazy, but I found a solution and want to explain it the best I can to save anyone becoming desperate.
The function which is implicitly called by wordcloud and responsible for throwing the error
Error in FUN(content(x), ...) : in 'utf8towcs'
is this one:
words.corpus <- tm_map(words.corpus, tolower)
which is a shortcut for
words.corpus <- tm_map(words.corpus, content_transformer(tolower))
To provide a reproducible example, here's a function that embeds the solution:
plot_wordcloud <- function(words, max_words = 70, remove_words ="",
n_colors = 5, palette = "Set1")
{
require(dplyr)
require(wordcloud)
require(RColorBrewer) # for brewer.pal()
require(tm) # for tm_map()
# Solution: remove all non-printable characters in UTF-8 with this line
words <- iconv(words, "ASCII", "UTF-8", sub="byte")
wc <- wordcloud(words=words.corpus, max.words=max_words,
random.order=FALSE,
colors = brewer.pal(n_colors, palette),
random.color = FALSE,
scale=c(5.5,.5), rot.per=0.35) %>% recordPlot
return(wc)
}
Here's what failed:
I tried to convert the text BEFORE and AFTER creating the corpus with
words.corpus <- Corpus(VectorSource(words))
BEFORE:
Converting to UTF-8 on the text didn't work with:
words <- sapply(words, function(x) iconv(enc2utf8(x), sub = "byte"))
nor
for (i in 1:length(words))
{
Encoding(words[[i]])="UTF-8"
}
AFTER:
Converting to UTF-8 on the corpus didn't work with:
words.corpus <- tm_map(words.corpus, removeWords, remove_words)
nor
words.corpus <- tm_map(words.corpus, content_transformer(stringi::stri_trans_tolower))
nor
words.corpus <- tm_map(words.corpus, function(x) iconv(x, to='UTF-8'))
nor
words.corpus <- tm_map(words.corpus, enc2utf8)
nor
words.corpus <- tm_map(words.corpus, tolower)
All these solutions may have worked at a certain point in time, so I don't want to discredit the authors. They may work some time in the future. But why they didn't work is almost impossible to say because there were good reasons why they were supposed to work.
Anyway, just remember to convert the text before creating the corpus with:
words <- iconv(words, "ASCII", "UTF-8", sub="byte")
Disclaimer:
I got the solution with more detailed explanation here:
http://www.textasdata.com/2015/02/encoding-headaches-emoticons-and-rs-handling-of-utf-816/

I ended up with updating my RStudio and packages. This seemed to solve the tolower/ content_transformer issues. I read somewhere that the last tm-package had some issues with tm_map, so maybe that was the problem. In any case, this worked!

Instead of
corp <- tm_map(corp, content_transformer(tolower), mc.cores=1)
use
corp <- tm_map(corp, tolower, mc.cores=1)

While using code similar to that above and working on a word cloud shiny app which ran fine on my own pc, but didn't work either on amazon aws or shiny apps.io, I discovered that text with 'accents',e.g. santé in it didn't upload well as csv files to the cloud. I found a solution by saving the files as .txt files and in utf-8 using notepad and re-writing my code to allow for the fact that the files were no longer csv but txt. My versions of R was 3.2.1 and Rstudio was Version 0.99.465

Just to mention, I had the same problem in a different context (nothing to do with tm or Twitter). For me, the solution was iconv(x, "latin1", "UTF-8"), even though Encoding() told me it was already UTF-8.

Related

Using gsub function to replace words by reading a .csv file

setwd("C:\\Users\\Joshua\\Documents\\TextMining\\Description_text")
library(RTextTools)
library(topicmodels)
library(tm)
library(SnowballC)
#Seizure
#cnamePS=file.path("C:\\Users\\Joshua\\Documents\\TextMining\\")
cnamePS=file.path("C:\\Users\\Joshua\\Documents\\TextMining\\Description_text")
docs <- Corpus(DirSource(cnamePS), readerControl=list(reader=readPlain))
#clean file
docsA <- tm_map(docs, removePunctuation)
docsA <- tm_map(docsA, removeNumbers)
docsA=tm_map(docsA,content_transformer(tolower))
docsA=tm_map(docsA,removeWords,stopwords("english"))
#phrase replacement
phrasesReplacement <- read.csv("C:\\Users\\Joshua\\Documents\\TextMining\\phrases.csv", stringsAsFactors = FALSE)
replacePhrasesFunc <- function(txt, replacementtable)
{
txt<-gsub("-","",txt)
for (r in seq(nrow(replacementtable)))
{
txt <- gsub(replacementtable$phrase[r], replacementtable$replacement[r], txt, fixed=TRUE)
}
return(txt)
}
replacePhrases <- content_transformer(replacePhrasesFunc)
docsA <- tm_map(docsA, replacePhrases, phrasesReplacement)#keep phrase
#docsAcopy=docsA #make copy
#docsA=tm_map(docsA,stemDocument)
docsA=tm_map(docsA,stripWhitespace)
Hi, I am trying to do some Textmining for some research, however I keep getting one error in one part of my code.
> docsA <- tm_map(docsA, replacePhrases, phrasesReplacement)#keep phrase
Error in gsub(replacementtable$phrase[r], replacementtable$replacement[r], :
invalid 'pattern' argument
Called from: gsub(replacementtable$phrase[r], replacementtable$replacement[r],
txt, fixed = TRUE)
Unfortunately as I am new to R, I cannot seem to figure why the error keeps reoccurring. Thanks for any help if possible.
So the .csv file I use is formatted as such..
So here is the debugging information.
Turns out that it was a problem with my .csv file.
I was missing a comma in between words.
A good reminder that the smallest things can cause the whole code to not work.

Problems with non english letters using wordcloud by twitter mined text

I'm new to Stackoverflow and I've been doing my best to follow the guidelines. If there's however something I've missed, please let me know.
Lately I've been playing around with text mining in R; something I'm a novice towards. I've been using the packages you can find in the code nested below to do this. However, problem occurs when the wordcloud displays the Swedish letters å, ä and ö. As you can see in the attached picture the dots gets positioned a bit weird.
Wordcloud image
I've been trying as best as I could solving this by myself, but whatever I've been trying, I can't seem to get it to work.
What I've tried to do:
Use Encoding(tweets) <- "UTF-8" in an attempt to set tweets to UTF-8
Use iconv(tweets, from = "UTF-8", to = "UTF-8", sub = "")
Furthermore, the last part of the code after defining the corpus vecotr was copied from the author of the tm-package. He listed this as the solution after other people mentioning problems with the wordcloud function with the corpus vector as input. Without it I get an error message when trying to create the wordcloud.
#Get and load necessary packages:
install.packages("twitteR")
install.packages("ROAuth")
install.packages("wordcloud")
install.packages("tm")
library("tm")
library("wordcloud")
library("twitteR")
library("ROAuth")
#Authentication:
api_key <- "XXX"
api_secret <- "XXX"
access_token <- "XXX"
access_token_secret <- "XXX"
cred <- setup_twitter_oauth(api_key,api_secret,access_token,
access_token_secret)
#Extract tweets:
search.string <- "#svpol"
no.of.tweets <- 3200
tweets <- searchTwitter(search.string, n=no.of.tweets, since = "2017-01-01")
tweets.text <- sapply(tweets, function(x){x$getText()})
#Remove tweets that starts with "RT" (retweets):
tweets.text <- gsub("^\bRT", "", tweets.text)
#Remove tabs:
tweets.text <- gsub("[ |\t]{2,}", "", tweets.text)
#Remove usernames:
tweets.text <- gsub("#\\w+", "", tweets.text)
tweets.text <- (tweets.text[!is.na(tweets.text)])
tweets.text <- gsub("\n", " ", tweets.text)
#Remove links:
tweets.text <- gsub("http[^[:space:]]*", "", tweets.text)
#Remove stopwords:
stopwords_swe <- c("är", "från", "än")
#Just a short example above, the real one is very large
tweets.text <- removeWords(tweets.text,stopwords_swe)
#Create corpus:
tweets.text.corpus <- Corpus(VectorSource(tweets.text))
#See notes in the longer text about the corpus vector
tweets.text.corpus <- tm_map(tweets.text.corpus,
content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')), mc.cores=1)
tweets.text.corpus <- tm_map(tweets.text.corpus, content_transformer(tolower), mc.cores=1)
tweets.text.corpus <- tm_map(tweets.text.corpus, removePunctuation, mc.cores=1)
tweets.text.corpus <- tm_map(tweets.text.corpus, function(x)removeWords(x,stopwords(kind = "en")), mc.cores=1)
wordcloud <- wordcloud(tweets.text.corpus, min.freq = 10,
max.words=300, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Set2"))
wordcloud
Would be super happy receiving help with this!
Managed to solve it by first encoding the vector to UTF-8-MAC (since I'm on OSX), then using the gsub() function in order to manually change the hex codes for å,ä,ö (the letters I had problems with) to the actual letters. For example gsub("0xc3 0x85", "å", x), gsub("0xc3 0xa5", "å", x) (since case sensitivity).
Lastly changing the argument for the tm_map() function from UTF-8-MAC to latin1. That did the trick for me, hopefully someone else will find this useful in the future.

Using RemoveWords in tm_map in R on words loaded from a file

I have seen several questions about using the removewords function in the tm_map package of R in order to remove either stopwords() or hard coded words from a corpus. However, I am trying to remove words stored in a file (currently csv, but I don't care which type). Using the code below, I don't get any errors, but my words are still there. Could someone please explain what is wrong?
#install.packages('tm')
library(tm)
setwd("c://Users//towens101317//Desktop")
problem_statements <- read.csv("query_export_results_100.csv", stringsAsFactors = FALSE, header = TRUE)
problem_statements_text <- paste(problem_statements, collapse=" ")
problem_statements_source <- VectorSource(problem_statements_text)
my_stop_words <- read.csv("mystopwords.csv", stringsAsFactors=FALSE, header = TRUE)
my_stop_words_text <- paste(my_stop_words, collapse=" ")
corpus <- Corpus(problem_statements_source)
corpus <- tm_map(corpus, removeWords, my_stop_words_text)
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)

TermDocumentMatrix errors in R

I have been working through numerous online examples of the {tm} package in R, attempting to create a TermDocumentMatrix. Creating and cleaning a corpus has been pretty straightforward, but I consistently encounter an error when I attempt to create a matrix. The error is:
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
In addition: Warning message:
In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
For example, here is code from Jon Starkweather's text mining example. Apologies in advance for such long code, but this does produce a reproducible example. Please note that the error comes at the end with the {tdm} function.
#Read in data
policy.HTML.page <- readLines("http://policy.unt.edu/policy/3-5")
#Obtain text and remove mark-up
policy.HTML.page[186:202]
id.1 <- 3 + which(policy.HTML.page == " TOTAL UNIVERSITY </div>")
id.2 <- id.1 + 5
text.data <- policy.HTML.page[id.1:id.2]
td.1 <- gsub(pattern = "<p>", replacement = "", x = text.data,
ignore.case = TRUE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
td.2 <- gsub(pattern = "</p>", replacement = "", x = td.1, ignore.case = TRUE,
perl = FALSE, fixed = FALSE, useBytes = FALSE)
text.d <- td.2; rm(text.data, td.1, td.2)
#Create corpus and clean
library(tm)
library(SnowballC)
txt <- VectorSource(text.d); rm(text.d)
txt.corpus <- Corpus(txt)
txt.corpus <- tm_map(txt.corpus, tolower)
txt.corpus <- tm_map(txt.corpus, removeNumbers)
txt.corpus <- tm_map(txt.corpus, removePunctuation)
txt.corpus <- tm_map(txt.corpus, removeWords, stopwords("english"))
txt.corpus <- tm_map(txt.corpus, stripWhitespace); #inspect(docs[1])
txt.corpus <- tm_map(txt.corpus, stemDocument)
# NOTE ERROR WHEN CREATING TDM
tdm <- TermDocumentMatrix(txt.corpus)
The link provided by jazzurro points to the solution. The following line of code
txt.corpus <- tm_map(txt.corpus, tolower)
must be changed to
txt.corpus <- tm_map(txt.corpus, content_transformer(tolower))
There are 2 reasons for this issue in tm v0.6.
If you are doing term level transformations like tolower etc., tm_map returns character vector instead of PlainTextDocument.
Solution: Call tolower through content_transformer or call tm_map(corpus, PlainTextDocument) immediately after tolower
If the SnowballC package is not installed and if you are trying to stem the documents then also this can occur.
Solution: install.packages('SnowballC')
There is No need to apply content_transformer.
Create the corpus in this way:
trainData_corpus <- Corpus((VectorSource(trainData$Comments)))
Try it.

Finding ngrams in R and comparing ngrams across corpora

I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement").
This is a two-step question, one regarding my code so far and one regarding how I should go on.
Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early on. Here is what I've been doing:
library(tm)
library(RWeka)
a <-Corpus(DirSource("/mycorpora/1965"), readerControl = list(language="lat")) # that dir is full of txt files
summary(a)
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english"))
a <- tm_map(a, stemDocument, language = "english")
# everything works fine so far, so I start playing around with what I have
adtm <-DocumentTermMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)
inspect(adtm)
findFreqTerms(adtm, lowfreq=10) # find terms with a frequency higher than 10
findAssocs(adtm, "usa",.5) # just looking for some associations
findAssocs(adtm, "china",.5)
# ... and so on, and so forth, all of this works fine
The corpus I load into R works fine with most functions I throw at it. I haven't had any problems creating TDMs from my corpus, finding frequent words, associations, creating word clouds and so on. But when I try to use identify ngrams using the approach outlined in the tm FAQ, I'm apparently making some mistake with the tdm-constructor:
# Trigram
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
inspect(tdm)
I get this error message:
Error in rep(seq_along(x), sapply(tflist, length)) :
invalid 'times' argument
In addition: Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
Any ideas? Is "a" not the right class/object? I'm confused. I assume there's a fundamental mistake here, but I'm not seeing it. :(
Step 2: Then I would like to identify ngrams that are significantly overrepresented, when I compare the corpus against other corpora. For example I could compare my corpus against a large standard english corpus. Or I create subsets that I can compare against each other (e.g. Soviet vs. a Chinese Communist terminology). Do you have any suggestions how I should go about doing this? Any scripts/functions I should look into? Just some ideas or pointers would be great.
Thanks for your patience!
I could not reproduce your problem, are you using the latest versions of R, tm, RWeka, etc.?
require(tm)
a <- Corpus(DirSource("C:\\Downloads\\Only1965\\Only1965"))
summary(a)
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english"))
# a <- tm_map(a, stemDocument, language = "english")
# I also got it to work with stemming, but it takes so long...
adtm <-DocumentTermMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)
inspect(adtm)
findFreqTerms(adtm, lowfreq=10) # find terms with a frequency higher than 10
findAssocs(adtm, "usa",.5) # just looking for some associations
findAssocs(adtm, "china",.5)
# Trigrams
require(RWeka)
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
tdm <- removeSparseTerms(tdm, 0.75)
inspect(tdm[1:5,1:5])
And here's what I get
A term-document matrix (5 terms, 5 documents)
Non-/sparse entries: 11/14
Sparsity : 56%
Maximal term length: 28
Weighting : term frequency (tf)
Docs
Terms PR1965-01.txt PR1965-02.txt PR1965-03.txt
†chinese press 0 0 0
†renmin ribao 0 1 1
— renmin ribao 2 5 2
“ chinese people 0 0 0
“renmin ribaoâ€\u009d editorial 0 1 0
etc.
Regarding your step two, here are some pointers to useful starts:
http://quantifyingmemory.blogspot.com/2013/02/mapping-significant-textual-differences.html
http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/ and here's his code https://dl.dropboxusercontent.com/u/4713959/Neuchatel/NassrProgram.R
Regarding Step 1, Brian.keng gives a one liner workaround here https://stackoverflow.com/a/20251039/3107920 that solves this issue on Mac OSX - it seems to be related to parallelisation rather than ( the minor nightmare that is ) java setup on mac.
You may want to explicitly access the functions like this
BigramTokenizer <- function(x) {
RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 3))
}
myTdmBi.d <- TermDocumentMatrix(
myCorpus.d,
control = list(tokenize = BigramTokenizer, weighting = weightTfIdf)
)
Also, some other things that randomly came up.
myCorpus.d <- tm_map(myCorpus.d, tolower) # This does not work anymore
Try this instead
myCorpus.d <- tm_map(myCorpus.d, content_transformer(tolower)) # Make lowercase
In the RTextTools package,
create_matrix(as.vector(C$V2), ngramLength=3) # ngramLength throws an error message.
Further to Ben's answer - I couldn't reproduce this either, but in the past I've had trouble with the plyr package and conflicting dependencies. In my case there was a conflict between Hmisc and ddply. You could try adding this line just prior to the offending line of code:
tryCatch(detach("package:Hmisc"), error = function(e) NULL)
Apologies if this is completely tangental to your problem!

Resources