Related
I am using DocumentTermMatrix to find a list of keywords in a long text. Most of the words in my list are correctly found, but there are a couple that are missing. Now, I would love to post here a minimal working example, but the problem is: there is one of the words ("insolvency", so not a short word as in the problem here) in a document of 32 pages which is missed. Now, this word is actually in page 7 of the text. But if I reduce my text with text <- text[7], then DocumentTermMatrix actually finds it! So I am not able to reproduce this with a minimal working example...
Do you have any ideas?
Below a sketch of my script:
library(fastpipe)
library(openxlsx)
library(tm)
`%>>%` <- fastpipe::`%>>%`
source("cleanText.R") # Custom function to clean up the text from reports
keywords_xlsx <- read.xlsx(paste0(getwd(),"/Keywords.xlsx"),
sheet = "all",
startRow = 1,
colNames = FALSE,
skipEmptyRows = TRUE,
skipEmptyCols = TRUE)
keywords <- keywords_xlsx[1] %>>%
tolower(as.character(.[,1]))
# Custom function to read pdfs
read <- readPDF(control = list(text = "-layout"))
# Extract text from pdf
report <- "my_report.pdf"
document <- Corpus(URISource(paste0("./Annual reports/", report)), readerControl = list(reader = read))
text <- content(document[[1]])
text <- cleanText(report, text) # This is a custom function to clean up the texts
# text <- text[7] # If I do this, my word is found! Otherwise it is missed
# Create a corpus
text_corpus <- Corpus(VectorSource(text))
matrix <- t(as.matrix(inspect(DocumentTermMatrix(text_corpus,
list(dictionary = keywords,
list(wordLengths=c(1, Inf))
)
))))
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
The problem lies in your use of inspect. Only use inspect to check if your code is working and to see if a dtm has any values. Never use inspect inside functions / transformations, because inspect by default only shows the firs 10 rows and 10 columns of a document term matrix.
Also if you want to transpose the outcome of a dtm, use TermDocumentMatrix.
Your last line should be:
mat <- as.matrix(TermDocumentMatrix(text_corpus,
list(dictionary = keywords,
list(wordLengths=c(1, Inf)))))
Note that turning a dtm / tdm into a matrix will use a lot more memory than having the data inside a sparse matrix.
I'm trying to run structural topic models (using stm package) on the document-term matrix that was prepared using tm package.
I built a corpus in tm package that contains the following metadata:
library(tm)
myReader2 <- readTabular(mapping=list(content="text", id="id", sentiment = "sentiment"))
text_corpus2 <- VCorpus(DataframeSource(bin_stm_df), readerControl = list(reader = myReader2))
meta(text_corpus2[[1]])
id : 11
sentiment: negative
language : en
After doing some text-cleaning and saving the results as clean_corpus2(metadata still present), I change it to document-term matrix and then read it as stm-compatible matrix:
library(stm)
chat_DTM2 <- DocumentTermMatrix(clean_corpus2, control = list(wordLengths = c(3, Inf)))
DTM2 <- removeSparseTerms(chat_DTM2 , 0.990)
DTM_st <-readCorpus(DTM2, type = "slam")
So far, so good. However, when I try to specify the metadata using stm-compatible data, the metadata is gone:
docsTM <- DTM_st$documents # works fine
vocabTM <- DTM_st$vocab # works fine
metaTM <- DTM_st$meta # returns NULL
> metaTM
NULL
How do I keep the metadata from tm-generated Corpus in stm-compatible document-term matrix? Any suggestions welcome, thanks.
How about trying the quanteda package?
Without the ability to access your object, I cannot guarantee this works verbatim, but it should:
library("quanteda")
# creates the corpus with document variables except for the "text"
text_corpus3 <- corpus(bin_stm_df, text_field = "text")
# convert to document-feature matrix - cleaning options can be added
# see ?tokens
chat_DTM3 <- dfm(text_corpus3)
# similar to tm::removeSparseTerms()
DTM3 <- dfm_trim(chat_DTM3, sparsity = 0.990)
# convert to STM format
DTM_st <- convert(DTM3, to = "stm")
# then it's all there
docsTM <- DTM_st$documents
vocabTM <- DTM_st$vocab
metaTM <- DTM_st$meta # should return the data.frame of document variables
I have two corpuses that contain similar words. similar enough that using setdiff doesn't really help my cause. So I've turned towards finding a way to extract a list or corpus (to eventually make a wordcloud) of words that are more frequent (assuming something like this would have a threshold - so maybe like 50% more frequent?) in corpus #1, compared to corpus #2.
This is everything I have right now:
> install.packages("tm")
> install.packages("SnowballC")
> install.packages("wordcloud")
> install.packages("RColorBrewer")
> library(tm)
> library(SnowballC)
> library(wordcloud)
> library(RColorBrewer)
> UKDraft = read.csv("UKDraftScouting.csv", stringsAsFactors=FALSE)
> corpus = Corpus(VectorSource(UKDraft$Report))
> corpus = tm_map(corpus, tolower)
> corpus = tm_map(corpus, PlainTextDocument)
> corpus = tm_map(corpus, removePunctuation)
> corpus = tm_map(corpus, removeWords, c("strengths", "weaknesses", "notes", "kentucky", "wildcats", stopwords("english")))
> frequencies = DocumentTermMatrix(corpus)
> allReports = as.data.frame(as.matrix(frequencies))
> SECDraft = read.csv("SECMinusUKDraftScouting.csv", stringsAsFactors=FALSE)
> SECcorpus = Corpus(VectorSource(SECDraft$Report))
> SECcorpus = tm_map(SECcorpus, tolower)
> SECcorpus = tm_map(SECcorpus, PlainTextDocument)
> SECcorpus = tm_map(SECcorpus, removePunctuation)
> SECcorpus = tm_map(SECcorpus, removeWords, c("strengths", "weaknesses", "notes", stopwords("english")))
> SECfrequencies = DocumentTermMatrix(SECcorpus)
> SECallReports = as.data.frame(as.matrix(SECfrequencies))
So if the word "wingspan" has a 100 count frequency in corpus#2 ('SECcorpus') but 150 count frequency in corpus#1 ('corpus'), we would want that word in our resulting corpus/list.
I can suggest a method that might be more straightforward, based on the new text analysis package I developed with Paul Nulty. It's called quanteda, available on CRAN and GitHub.
I don't have access to your texts, but this will work in a similar fashion for your examples. You create a corpus of your two sets of documents, then add a document variable (using docvars), and then create a document feature matrix grouping on the new document partition variable. The rest of the operations are straightforward, see the code below. Note that by default, dfm objects are sparse Matrixes, but subsetting on features is not yet implemented (next release!).
install.packages(quanteda)
library(quanteda)
# built-in character vector of 57 inaugural addreses
str(inaugTexts)
# create a corpus, with a partition variable to represent
# the two sets of texts you want to compare
inaugCorp <- corpus(inaugTexts,
docvars = data.frame(docset = c(rep(1, 29), rep(2, 28))),
notes = "Example made for stackoverflow")
# summarize the corpus
summary(inaugCorp, 5)
# toLower, removePunct are on by default
inaugDfm <- dfm(inaugCorp,
groups = "docset", # by docset instead of document
ignoredFeatures = c("strengths", "weaknesses", "notes", stopwords("english"))),
matrixType = "dense")
# now compare frequencies and trim based on ratio threshold
ratioThreshold <- 1.5
featureRatio <- inaugDfm[2, ] / inaugDfm[1, ]
# to select where set 2 feature frequency is 1.5x set 1 feature frequency
inaugDfmReduced <- inaugDfm[2, featureRatio >= ratioThreshold]
# plot the wordcloud
plot(inaugDfmReduced)
I would recommend you pass through some options to wordcloud() (what plot.dfm() uses), perhaps to restrict the minimum number of features to be plotted.
Very happy to assist with any queries you might have on using the quanteda package.
New
Here's a stab directly at your problem. I don't have your files so cannot verify that it works. Also if your R skills are limited, you might find this challenging to understand; ditto if you have not looked at any of the (sadly limited for now) documentation for quanteda.
I think what you need (based on your comment/query) is the following:
# read in each corpus separately, directly into quanteda
mycorpus1 <- corpus(textfile("UKDraftScouting.csv", textField = "Report"))
mycorpus2 <- corpus(textfile("SECMinusUKDraftScouting.csv", textField = "Report"))
# assign docset variables to each corpus as appropriate
docvars(mycorpus1, "docset") <- 1
docvars(mycorpus2, "docset") <- 2
myCombinedCorpus <- mycorpus1 + mycorpus2
then proceed with the dfm step as above, substituting myCombinedCorpus for inaugTexts.
I am updating the answer by #ken Benoit, as it was several years old and quanteda package has gone through some major changes in syntax.
The current version should be (April 2017):
str(inaugTexts)
# create a corpus, with a partition variable to represent
# the two sets of texts you want to compare
inaugCorp <- corpus(inaugTexts,
docvars = data.frame(docset = c(rep(1, 29), rep(2, 29))),
notes = "Example made for stackoverflow")
# summarize the corpus
summary(inaugCorp, 5)
inaugDfm <- dfm(comment_corpus,
groups = "docset", # by docset instead of document
remove = c("<p>", "http://", "www", stopwords("english")),
remove_punct = TRUE,
remove_numbers = TRUE,
stem = TRUE)
Trying to create wordcloud from twitter data, but get the following error:
Error in FUN(X[[72L]], ...) :
invalid input '������������❤������������ "#xxx:bla, bla, bla... http://t.co/56Fb78aTSC"' in 'utf8towcs'
This error appears after running the "mytwittersearch_corpus<-
tm_map(mytwittersearch_corpus, tolower)" code
mytwittersearch_list <-sapply(mytwittersearch, function(x) x$getText())
mytwittersearch_corpus <-Corpus(VectorSource(mytwittersearch_corpus_list))
mytwittersearch_corpus<-tm_map(mytwittersearch_corpus, tolower)
mytwittersearch_corpus<-tm_map( mytwittersearch_corpus, removePunctuation)
mytwittersearch_corpus <-tm_map(mytwittersearch_corpus, function(x) removeWords(x, stopwords()))
I read on other pages this may be due to R having difficulty processing symbols, emoticons and letters in non-English languages, but this appears not to be the problem with the "error tweets" that R has issues with. I did run the codes:
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, function(x) iconv(enc2utf8(x), sub = "byte"))
mytwittersearch_corpus<- tm_map(mytwittersearch_corpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "bytes")))
These do not help. I also get that it can't find function content_transformer even though the tm-package is checked off and running.
I'm running this on OS X 10.6.8 and using the latest RStudio.
I use this code to get rid of the problem characters:
tweets$text <- sapply(tweets$text,function(row) iconv(row, "latin1", "ASCII", sub=""))
A nice example on creating wordcloud from Twitter data is here. Using the example, and the code below, and passing the tolower parameter while creating the TermDocumentMatrix, I could create a Twitter wordcloud.
library(twitteR)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(ggplot2)
#Collect tweets containing 'new year'
tweets = searchTwitter("new year", n=50, lang="en")
#Extract text content of all the tweets
tweetTxt = sapply(tweets, function(x) x$getText())
#In tm package, the documents are managed by a structure called Corpus
myCorpus = Corpus(VectorSource(tweetTxt))
#Create a term-document matrix from a corpus
tdm = TermDocumentMatrix(myCorpus,control = list(removePunctuation = TRUE,stopwords = c("new", "year", stopwords("english")), removeNumbers = TRUE, tolower = TRUE))
#Convert as matrix
m = as.matrix(tdm)
#Get word counts in decreasing order
word_freqs = sort(rowSums(m), decreasing=TRUE)
#Create data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
#Plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))
Have you tried updating tm and using stri_trans_tolower from stringi?
library(twitteR)
library(tm)
library(stringi)
setup_twitter_oauth("CONSUMER_KEY", "CONSUMER_SECRET")
mytwittersearch <- showStatus(551365749550227456)
mytwittersearch_list <- mytwittersearch$getText()
mytwittersearch_corpus <- Corpus(VectorSource(mytwittersearch_list))
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(tolower))
# Error in FUN(content(x), ...) :
# invalid input 'í ½í±…í ¼í¾¯â¤í ¼í¾§í ¼í½œ "#comScore: Nearly half of #Millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56Fb78aTSC"' in 'utf8towcs'
mytwittersearch_corpus <- tm_map(mytwittersearch_corpus, content_transformer(stri_trans_tolower))
inspect(mytwittersearch_corpus)
# <<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
#
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# <ed><U+00A0><U+00BD><ed><U+00B1><U+0085><ed><U+00A0><U+00BC><ed><U+00BE><U+00AF><U+2764><ed><U+00A0><U+00BC><ed><U+00BE><U+00A7><ed><U+00A0><U+00BC><ed><U+00BD><U+009C> "#comscore: nearly half of #millennials do at least some of their video viewing from a smartphone or tablet: http://t.co/56fb78atsc"
The above solutions may have worked but not anymore in the newest versions of wordcloud and tm.
This problem almost made me crazy, but I found a solution and want to explain it the best I can to save anyone becoming desperate.
The function which is implicitly called by wordcloud and responsible for throwing the error
Error in FUN(content(x), ...) : in 'utf8towcs'
is this one:
words.corpus <- tm_map(words.corpus, tolower)
which is a shortcut for
words.corpus <- tm_map(words.corpus, content_transformer(tolower))
To provide a reproducible example, here's a function that embeds the solution:
plot_wordcloud <- function(words, max_words = 70, remove_words ="",
n_colors = 5, palette = "Set1")
{
require(dplyr)
require(wordcloud)
require(RColorBrewer) # for brewer.pal()
require(tm) # for tm_map()
# Solution: remove all non-printable characters in UTF-8 with this line
words <- iconv(words, "ASCII", "UTF-8", sub="byte")
wc <- wordcloud(words=words.corpus, max.words=max_words,
random.order=FALSE,
colors = brewer.pal(n_colors, palette),
random.color = FALSE,
scale=c(5.5,.5), rot.per=0.35) %>% recordPlot
return(wc)
}
Here's what failed:
I tried to convert the text BEFORE and AFTER creating the corpus with
words.corpus <- Corpus(VectorSource(words))
BEFORE:
Converting to UTF-8 on the text didn't work with:
words <- sapply(words, function(x) iconv(enc2utf8(x), sub = "byte"))
nor
for (i in 1:length(words))
{
Encoding(words[[i]])="UTF-8"
}
AFTER:
Converting to UTF-8 on the corpus didn't work with:
words.corpus <- tm_map(words.corpus, removeWords, remove_words)
nor
words.corpus <- tm_map(words.corpus, content_transformer(stringi::stri_trans_tolower))
nor
words.corpus <- tm_map(words.corpus, function(x) iconv(x, to='UTF-8'))
nor
words.corpus <- tm_map(words.corpus, enc2utf8)
nor
words.corpus <- tm_map(words.corpus, tolower)
All these solutions may have worked at a certain point in time, so I don't want to discredit the authors. They may work some time in the future. But why they didn't work is almost impossible to say because there were good reasons why they were supposed to work.
Anyway, just remember to convert the text before creating the corpus with:
words <- iconv(words, "ASCII", "UTF-8", sub="byte")
Disclaimer:
I got the solution with more detailed explanation here:
http://www.textasdata.com/2015/02/encoding-headaches-emoticons-and-rs-handling-of-utf-816/
I ended up with updating my RStudio and packages. This seemed to solve the tolower/ content_transformer issues. I read somewhere that the last tm-package had some issues with tm_map, so maybe that was the problem. In any case, this worked!
Instead of
corp <- tm_map(corp, content_transformer(tolower), mc.cores=1)
use
corp <- tm_map(corp, tolower, mc.cores=1)
While using code similar to that above and working on a word cloud shiny app which ran fine on my own pc, but didn't work either on amazon aws or shiny apps.io, I discovered that text with 'accents',e.g. santé in it didn't upload well as csv files to the cloud. I found a solution by saving the files as .txt files and in utf-8 using notepad and re-writing my code to allow for the fact that the files were no longer csv but txt. My versions of R was 3.2.1 and Rstudio was Version 0.99.465
Just to mention, I had the same problem in a different context (nothing to do with tm or Twitter). For me, the solution was iconv(x, "latin1", "UTF-8"), even though Encoding() told me it was already UTF-8.
I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would like to extract newly coined political terms (multiple words, e.g. "struggle-criticism-transformation movement").
This is a two-step question, one regarding my code so far and one regarding how I should go on.
Step 1: To do this, I wanted to identify some common ngrams first. But I get stuck very early on. Here is what I've been doing:
library(tm)
library(RWeka)
a <-Corpus(DirSource("/mycorpora/1965"), readerControl = list(language="lat")) # that dir is full of txt files
summary(a)
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english"))
a <- tm_map(a, stemDocument, language = "english")
# everything works fine so far, so I start playing around with what I have
adtm <-DocumentTermMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)
inspect(adtm)
findFreqTerms(adtm, lowfreq=10) # find terms with a frequency higher than 10
findAssocs(adtm, "usa",.5) # just looking for some associations
findAssocs(adtm, "china",.5)
# ... and so on, and so forth, all of this works fine
The corpus I load into R works fine with most functions I throw at it. I haven't had any problems creating TDMs from my corpus, finding frequent words, associations, creating word clouds and so on. But when I try to use identify ngrams using the approach outlined in the tm FAQ, I'm apparently making some mistake with the tdm-constructor:
# Trigram
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
inspect(tdm)
I get this error message:
Error in rep(seq_along(x), sapply(tflist, length)) :
invalid 'times' argument
In addition: Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
Any ideas? Is "a" not the right class/object? I'm confused. I assume there's a fundamental mistake here, but I'm not seeing it. :(
Step 2: Then I would like to identify ngrams that are significantly overrepresented, when I compare the corpus against other corpora. For example I could compare my corpus against a large standard english corpus. Or I create subsets that I can compare against each other (e.g. Soviet vs. a Chinese Communist terminology). Do you have any suggestions how I should go about doing this? Any scripts/functions I should look into? Just some ideas or pointers would be great.
Thanks for your patience!
I could not reproduce your problem, are you using the latest versions of R, tm, RWeka, etc.?
require(tm)
a <- Corpus(DirSource("C:\\Downloads\\Only1965\\Only1965"))
summary(a)
a <- tm_map(a, removeNumbers)
a <- tm_map(a, removePunctuation)
a <- tm_map(a , stripWhitespace)
a <- tm_map(a, tolower)
a <- tm_map(a, removeWords, stopwords("english"))
# a <- tm_map(a, stemDocument, language = "english")
# I also got it to work with stemming, but it takes so long...
adtm <-DocumentTermMatrix(a)
adtm <- removeSparseTerms(adtm, 0.75)
inspect(adtm)
findFreqTerms(adtm, lowfreq=10) # find terms with a frequency higher than 10
findAssocs(adtm, "usa",.5) # just looking for some associations
findAssocs(adtm, "china",.5)
# Trigrams
require(RWeka)
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(a, control = list(tokenize = TrigramTokenizer))
tdm <- removeSparseTerms(tdm, 0.75)
inspect(tdm[1:5,1:5])
And here's what I get
A term-document matrix (5 terms, 5 documents)
Non-/sparse entries: 11/14
Sparsity : 56%
Maximal term length: 28
Weighting : term frequency (tf)
Docs
Terms PR1965-01.txt PR1965-02.txt PR1965-03.txt
†chinese press 0 0 0
†renmin ribao 0 1 1
— renmin ribao 2 5 2
“ chinese people 0 0 0
“renmin ribaoâ€\u009d editorial 0 1 0
etc.
Regarding your step two, here are some pointers to useful starts:
http://quantifyingmemory.blogspot.com/2013/02/mapping-significant-textual-differences.html
http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/ and here's his code https://dl.dropboxusercontent.com/u/4713959/Neuchatel/NassrProgram.R
Regarding Step 1, Brian.keng gives a one liner workaround here https://stackoverflow.com/a/20251039/3107920 that solves this issue on Mac OSX - it seems to be related to parallelisation rather than ( the minor nightmare that is ) java setup on mac.
You may want to explicitly access the functions like this
BigramTokenizer <- function(x) {
RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 3))
}
myTdmBi.d <- TermDocumentMatrix(
myCorpus.d,
control = list(tokenize = BigramTokenizer, weighting = weightTfIdf)
)
Also, some other things that randomly came up.
myCorpus.d <- tm_map(myCorpus.d, tolower) # This does not work anymore
Try this instead
myCorpus.d <- tm_map(myCorpus.d, content_transformer(tolower)) # Make lowercase
In the RTextTools package,
create_matrix(as.vector(C$V2), ngramLength=3) # ngramLength throws an error message.
Further to Ben's answer - I couldn't reproduce this either, but in the past I've had trouble with the plyr package and conflicting dependencies. In my case there was a conflict between Hmisc and ddply. You could try adding this line just prior to the offending line of code:
tryCatch(detach("package:Hmisc"), error = function(e) NULL)
Apologies if this is completely tangental to your problem!