Here is the source code that I have used:
MyData <- Corpus(DirSource("F:/Data/CSV/Data"),readerControl = list(reader=readPlain,language="cn"))
SegmentedData <- lapply(MyData, function(x) unlist(segmentCN(x)))
temp <- Corpus(DataframeSource(SegmentedData), readerControl = list(reader=readPlain, language="cn"))
Preprocessing Data
temp <- tm_map(temp, removePunctuation)
temp <- tm_map(temp,removeNumbers)
removeURL <- function(x)gsub("http[[:alnum:]]*"," ",x)
temp <- tm_map(temp, removeURL)
temp <- tm_map(temp,stripWhitespace)
dtmxi <- DocumentTermMatrix(temp)
dtmxi <- removeSparseTerms(dtmxi,0.83)
**inspect(t(dtmxi))** ---This is where I get the error
I believe there are some Chinese characters in your file. To overcome this issue, use this line of code to read them as well:
Sys.setlocale('LC_ALL','C')
My RStudio restarts the session after to set Sys.setlocale( 'LC_ALL','C' ) and run the TermDocumentMatrix( mycorpus ) function.
u can use this code:
txt <- tm_map (txt, content_transformer(stemDocument))
txt is ur text vector.
I have a dataframe with 60000 rows/phrase which I would like to use as stopwords and remove from text.
I use tm package and I use this line, after I read the csv file with the list of stopwords:
corpus <- tm_map(corpus, removeWords, df$mylistofstopwords)
but I receive this error:
In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), :
PCRE pattern compilation error
'regular expression is too large'
at ''
Is there any problem because the list is to big? Is there anything I could make to fix it?
You could probably resolve your issue by splitting the stopword list into multiple parts, something like the following:
chunk <- 1000
i <- 0
n <- length(df$mylistofstopwords)
while (i != n) {
i2 <- min(i + chunk, n)
corpus <- tm_map(corpus, removeWords, df$mylistofstopwords[(i+1):i2])
i <- i2
}
Or, you could just use a package that can handle long stopword lists. corpus is one such package. quanteda is another. Here's how to get a document-by-term matrix in corpus
library(corpus)
x <- term_matrix(corpus, drop = df$mylistofstopwords)
Here, the input argument corpus can be a tm corpus.
I am new to the tm package in R. I am trying to create a document-term matrix with the tm_map function, but apparently the function passed to tm_map(Corpus, function, lazy=TRUE) is not applied to the corpus. Concretely, the documents are not converted to lower case. R Studio does not show any errors or warnings.
Did I mess up anything here? Could this be some enconding issue?
library(tm)
setwd("...")
filenames <- list.files(getwd(), pattern="*.txt")
files <- lapply(filenames, readLines)
docs <- Corpus(VectorSource(files))
writeLines(as.character(docs[[30]]))
docs <- tm_map(docs, function(x) iconv(enc2utf8(x$content), sub = ""), lazy=TRUE)
#to lower case
docs <- tm_map(docs, content_transformer(tolower), lazy=TRUE)
writeLines(as.character(docs[[30]]))
Thank you for any advice!
This is a simple fix. Move your code for converting to lower case before iconv(...).
This works:
library(tm)
setwd("")
# Read in Files
filenames <- list.files(getwd(), pattern="*.txt")
files <- lapply(filenames, readLines)
docs <- Corpus(VectorSource(files))
writeLines(as.character(docs[[30]]))
# Lower Case
docs <- tm_map(docs, content_transformer(tolower), lazy=TRUE)
# Convert
docs <- tm_map(docs, function(x) iconv(enc2utf8(x$content), sub = ""))
writeLines(as.character(docs[[30]]))
I have seen several questions about using the removewords function in the tm_map package of R in order to remove either stopwords() or hard coded words from a corpus. However, I am trying to remove words stored in a file (currently csv, but I don't care which type). Using the code below, I don't get any errors, but my words are still there. Could someone please explain what is wrong?
#install.packages('tm')
library(tm)
setwd("c://Users//towens101317//Desktop")
problem_statements <- read.csv("query_export_results_100.csv", stringsAsFactors = FALSE, header = TRUE)
problem_statements_text <- paste(problem_statements, collapse=" ")
problem_statements_source <- VectorSource(problem_statements_text)
my_stop_words <- read.csv("mystopwords.csv", stringsAsFactors=FALSE, header = TRUE)
my_stop_words_text <- paste(my_stop_words, collapse=" ")
corpus <- Corpus(problem_statements_source)
corpus <- tm_map(corpus, removeWords, my_stop_words_text)
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)
Trying to create wordcloud from twitter data, but get the following error:
Error in FUN(X[[72L]], ...) :
invalid input '������������❤������������ "#xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs'
This error appears after running the "mytwittersearch_corpus<-
tm_map(mytwittersearch_corpus, tolower)" code
mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText())
mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list))
mytwittersearch_corpus<-tm_map(mytwittersearch_corpus, tolower)
mytwittersearch_corpus<-tm_map( mytwittersearch_corpus, removePunctuation)
mytwittersearch_corpus <-tm_map(mytwittersearch_corpus, function(x) removeWords(x, stopwords()))
I read on other pages this may be due to R having difficulty processing symbols, emoticons and letters in non-English languages, but this appears not to be the problem with the "error tweets" that R has issues with. I did run the codes:
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, function(x) iconv(enc2utf8(x), sub = "byte"))
mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "bytes")))
These do not help. I also get that it can't find function content_transformer even though the tm-package is checked off and running.
I'm running this on OS X 10.6.8 and using the latest RStudio.
I use this code to get rid of the problem characters:
tweets$text <- sapply(tweets$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
A nice example on creating wordcloud from Twitter data is here. Using the example, and the code below, and passing the tolower parameter while creating the TermDocumentMatrix, I could create a Twitter wordcloud.
library(twitteR)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(ggplot2)
#Collect tweets containing 'new year'
tweets = searchTwitter("new year", n=50, lang="en")
#Extract text content of all the tweets
tweetTxt = sapply(tweets, function(x) x$getText())
#In tm package, the documents are managed by a structure called Corpus
myCorpus = Corpus(VectorSource(tweetTxt))
#Create a term-document matrix from a corpus
tdm = TermDocumentMatrix(myCorpus,control = list(removePunctuation = TRUE,stopwords = c("new", "year", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))
#Convert as matrix
m = as.matrix(tdm)
#Get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE)
#Create data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
#Plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
Have you tried updating tm and using stri_trans_tolower from stringi?
library(twitteR)
library(tm)
library(stringi)
setup_twitter_oauth("CONSUMER_KEY", "CONSUMER_SECRET")
mytwittersearch <- showStatus(551365749550227456)
mytwittersearch_list <- mytwittersearch$getText()
mytwittersearch_corpus <- Corpus(VectorSource(mytwittersearch_list))
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(tolower))
# Error in FUN(content(x), ...) :
# invalid input 'í ½í±…í ¼í¾¯â¤í ¼í¾§í ¼í½œ "#comScore: Nearly half of #Millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56Fb78aTSC"' in 'utf8towcs'
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(stri_trans_tolower))
inspect(mytwittersearch_corpus)
# <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
#
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# <ed><U+00A0><U+00BD><ed><U+00B1><U+0085><ed><U+00A0><U+00BC><ed><U+00BE><U+00AF><U+2764><ed><U+00A0><U+00BC><ed><U+00BE><U+00A7><ed><U+00A0><U+00BC><ed><U+00BD><U+009C> "#comscore: nearly half of #millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56fb78atsc"
The above solutions may have worked but not anymore in the newest versions of wordcloud and tm.
This problem almost made me crazy, but I found a solution and want to explain it the best I can to save anyone becoming desperate.
The function which is implicitly called by wordcloud and responsible for throwing the error
Error in FUN(content(x), ...) : in 'utf8towcs'
is this one:
words.corpus <- tm_map(words.corpus, tolower)
which is a shortcut for
words.corpus <- tm_map(words.corpus, content_transformer(tolower))
To provide a reproducible example, here's a function that embeds the solution:
plot_wordcloud <- function(words, max_words = 70, remove_words ="",
n_colors = 5, palette = "Set1")
{
require(dplyr)
require(wordcloud)
require(RColorBrewer) # for brewer.pal()
require(tm) # for tm_map()
# Solution: remove all non-printable characters in UTF-8 with this line
words <- iconv(words, "ASCII", "UTF-8", sub="byte")
wc <- wordcloud(words=words.corpus, max.words=max_words,
random.order=FALSE,
colors = brewer.pal(n_colors, palette),
random.color = FALSE,
scale=c(5.5,.5), rot.per=0.35) %>% recordPlot
return(wc)
}
Here's what failed:
I tried to convert the text BEFORE and AFTER creating the corpus with
words.corpus <- Corpus(VectorSource(words))
BEFORE:
Converting to UTF-8 on the text didn't work with:
words <- sapply(words, function(x) iconv(enc2utf8(x), sub = "byte"))
nor
for (i in 1:length(words))
{
Encoding(words[[i]])="UTF-8"
}
AFTER:
Converting to UTF-8 on the corpus didn't work with:
words.corpus <- tm_map(words.corpus, removeWords, remove_words)
nor
words.corpus <- tm_map(words.corpus, content_transformer(stringi::stri_trans_tolower))
nor
words.corpus <- tm_map(words.corpus, function(x) iconv(x, to='UTF-8'))
nor
words.corpus <- tm_map(words.corpus, enc2utf8)
nor
words.corpus <- tm_map(words.corpus, tolower)
All these solutions may have worked at a certain point in time, so I don't want to discredit the authors. They may work some time in the future. But why they didn't work is almost impossible to say because there were good reasons why they were supposed to work.
Anyway, just remember to convert the text before creating the corpus with:
words <- iconv(words, "ASCII", "UTF-8", sub="byte")
Disclaimer:
I got the solution with more detailed explanation here:
http://www.textasdata.com/2015/02/encoding-headaches-emoticons-and-rs-handling-of-utf-816/
I ended up with updating my RStudio and packages. This seemed to solve the tolower/ content_transformer issues. I read somewhere that the last tm-package had some issues with tm_map, so maybe that was the problem. In any case, this worked!
Instead of
corp <- tm_map(corp, content_transformer(tolower), mc.cores=1)
use
corp <- tm_map(corp, tolower, mc.cores=1)
While using code similar to that above and working on a word cloud shiny app which ran fine on my own pc, but didn't work either on amazon aws or shiny apps.io, I discovered that text with 'accents',e.g. santé in it didn't upload well as csv files to the cloud. I found a solution by saving the files as .txt files and in utf-8 using notepad and re-writing my code to allow for the fact that the files were no longer csv but txt. My versions of R was 3.2.1 and Rstudio was Version 0.99.465
Just to mention, I had the same problem in a different context (nothing to do with tm or Twitter). For me, the solution was iconv(x, "latin1", "UTF-8"), even though Encoding() told me it was already UTF-8.