DocumentTermMatrix error on Corpus argument - r

I have the following code:
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings.
corpus_clean <- tm_map(news_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english'))
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, trim)
news_dtm <- DocumentTermMatrix(corpus_clean) # errors here
When I run the DocumentTermMatrix() method, it gives me this error:
Error: inherits(doc, "TextDocument") is not TRUE
Why do I get this error? Are my rows not text documents?
Here is the output upon inspecting corpus_clean:
[[153]]
[1] obama holds technical school model us
[[154]]
[1] oil boom produces jobs bonanza archaeologists
[[155]]
[1] islamic terrorist group expands territory captures tikrit
[[156]]
[1] republicans democrats feel eric cantors loss
[[157]]
[1] tea party candidates try build cantor loss
[[158]]
[1] vehicles materials stored delaware bridges
[[159]]
[1] hill testimony hagel defends bergdahl trade
[[160]]
[1] tweet selfpropagates tweetdeck
[[161]]
[1] blackwater guards face trial iraq shootings
[[162]]
[1] calif man among soldiers killed afghanistan
[[163]]
[1] stocks fall back world bank cuts growth outlook
[[164]]
[1] jabhat alnusra longer useful turkey
[[165]]
[1] catholic bishops keep focus abortion marriage
[[166]]
[1] barbra streisand visits hill heart disease
[[167]]
[1] rand paul cantors loss reason stop talking immigration
[[168]]
[1] israeli airstrike kills northern gaza
Edit: Here is my data:
type,text
neutral,The week in 32 photos
neutral,Look at me! 22 selfies of the week
neutral,Inside rebel tunnels in Homs
neutral,Voices from Ukraine
neutral,Water dries up ahead of World Cup
positive,Who's your hero? Nominate them
neutral,Anderson Cooper: Here's how
positive,"At fire scene, she rescues the pet"
neutral,Hunger in the land of plenty
positive,Helping women escape 'the life'
neutral,A tour of the sex underworld
neutral,Miss Universe Thailand steps down
neutral,China's 'naked officials' crackdown
negative,More held over Pakistan stoning
neutral,Watch landmark Cold War series
neutral,In photos: History of the Cold War
neutral,Turtle predicts World Cup winner
neutral,What devoured great white?
positive,Nun wins Italy's 'The Voice'
neutral,Bride Price app sparks debate
neutral,China to deport 'pork' artist
negative,Lightning hits moving car
neutral,Singer won't be silenced
neutral,Poland's mini desert
neutral,When monarchs retire
negative,Murder on Street View?
positive,Meet armless table tennis champ
neutral,Incredible 400 year-old globes
positive,Man saves falling baby
neutral,World's most controversial foods
Which I retrieve like:
news_raw <- read.csv('news_csv.csv', stringsAsFactors = F)
Edit: Here is the traceback():
> news_dtm <- DocumentTermMatrix(corpus_clean)
Error: inherits(doc, "TextDocument") is not TRUE
> traceback()
9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"),
ch), call. = FALSE, domain = NA)
8: stopifnot(inherits(doc, "TextDocument"), is.list(control))
7: FUN(X[[1L]], ...)
6: lapply(X, FUN, ...)
5: mclapply(unname(content(x)), termFreq, control)
4: TermDocumentMatrix.VCorpus(x, control)
3: TermDocumentMatrix(x, control)
2: t(TermDocumentMatrix(x, control))
1: DocumentTermMatrix(corpus_clean)
When I evaluate inherits(corpus_clean, "TextDocument") it is FALSE.

It seems this would have worked just fine in tm 0.5.10 but changes in tm 0.6.0 seems to have broken it. The problem is that the functions tolower and trim won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters.
So you could change to
corpus_clean <- tm_map(news_corpus, content_transformer(tolower))
Or you can run
corpus_clean <- tm_map(corpus_clean, PlainTextDocument)
after all of your non-standard transformations (those not in getTransformations()) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.

I have found a way to solve this problem in an article about TM.
An example in which the error follows below:
getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1") # import files
corpus <- VCorpus(x=files) # load files, create corpus
summary(corpus) # get a summary
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation);
matrix_terms <- DocumentTermMatrix(corpus)
Warning messages:
In TermDocumentMatrix.VCorpus(x, control) : invalid document identifiers
This error occurs because you need an object of the class Vector Source to do your Term Document Matrix, but the previous transformations transform your corpus of texts in character, therefore, changing a class which is not accepted by the function.
However, if you add the function content_transformer inside the tm_map command you may not need even one more command before using the function TermDocumentMatrix to keep going.
The code below changes the class (see second last line) and avoids the error:
getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1")
corpus <- VCorpus(x=files) # load files, create corpus
summary(corpus) # get a summary
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- tm_map(corpus,content_transformer(stripWhitespace))
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- Corpus(VectorSource(corpus)) # change class
matrix_term <- DocumentTermMatrix(corpus)

Change this:
corpus_clean <- tm_map(news_corpus, tolower)
For this:
corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

This should work.
remove.packages(tm)
install.packages("http://cran.r-project.org/bin/windows/contrib/3.0/tm_0.5-10.zip",repos=NULL)
library(tm)

Related

how to calculate R1 (lexical richness index) in R?

Hi I need to write a function to calculate R1 which is defined as follows :
R1 = 1 - ( F(h) - h*h/2N) )
where N is the number of tokens, h is the Hirsch point, and F(h) is the cumulative relative frequencies up to that point. Using quanteda package I managed to calculate the Hirsch point
a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.", "The United States is committed to advancing prosperity, security, and freedom for both Israelis and Palestinians in tangible ways in the immediate term, which is important in its own right, but also as a means to advance towards a negotiated two-state solution.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.", "We believe that this UN agency for so-called refugees should not exist in its current format.")
a2 <- c("His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.", " The US president accused Palestinians of lacking “appreciation or respect.", "To create my data I had to chunk each text in an increasing manner.", "Therefore, the input is a list of chunked texts within another list.")
a3 <- c("We plan to restart US economic, development, and humanitarian assistance for the Palestinian people,” the secretary of state, Antony Blinken, said in a statement.", "The cuts were decried as catastrophic for Palestinians’ ability to provide basic healthcare, schooling, and sanitation, including by prominent Israeli establishment figures.","After Donald Trump’s row with the Palestinian leadership, President Joe Biden has sought to restart Washington’s flailing efforts to push for a two-state resolution for the Israel-Palestinian crisis, and restoring the aid is part of that.")
txt <-list(a,a1,a2,a3)
library(quanteda)
DFMs <- lapply(txt, dfm)
txt_freq <- function(x) textstat_frequency(x, groups = docnames(x), ties_method = "first")
Fs <- lapply(DFMs, txt_freq)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
s_p <- function(x){split(x,x$group)}
tstat_by <- lapply(Fs, s_p)
h_values <-lapply(tstat_by, vapply, get_h_point, double(1))
To calculate F(h)—the cumulative relative frequencies up to h_point— to put in R1, I need two values; one of them needs to be from Fs$rank and the other must be from h_values. Consider the first original texts (tstat_by[[1]], tstat_by[[2]], and tstat_by[[3]]) and their respective h_values(h_values[[1]], h_values[[2]], and h_values[[3]]):
fh_txt1 <- tail(prop.table(cumsum(tstat_by[[1]][["text1"]]$rank:h_values[[1]][["text1"]])), n=1)
fh_txt2 <-tail(prop.table(cumsum(tstat_by[[1]][["text2"]]$rank:h_values[[1]][["text2"]])), n=1)
...
tail(prop.table(cumsum(tstat_by[[4]][["text2"]]$rank:h_values[[4]][["text2"]])), n=1)
[1] 1
tail(prop.table(cumsum(tstat_by[[4]][["text3"]]$rank:h_values[[4]][["text3"]])), n=1)
[1] 0.75
As you can see, the grouping is the same— docnames for each chunk of the original character vectors are the same (text1, text2, text3, etc.). my question is how to write a function for fh_txt(s) so that using lapply can be an option to calculate F(h) for R1.
Please note that the goal is to write a function to calculate R1, and what I`ve put here is what has been done in this regard.
I've simplified your inputs below, and used the groups argument in textstat_frequency() instead of your approach to creating lists of dfm objects.
a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.")
a2 <- c("His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.")
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
dfmat <- c(a, a1, a2) %>%
tokens() %>%
dfm()
tstat <- quanteda.textstats::textstat_frequency(dfmat, groups = docnames(dfmat), ties = "first")
tstat_by <- split(tstat, tstat$group)
get_h_point <- function(DATA) {
fn_interp <- approxfun(DATA$rank, DATA$frequency)
fn_root <- function(x) fn_interp(x) - x
uniroot(fn_root, range(DATA$rank))$root
}
h_values <- vapply(tstat_by, get_h_point, double(1))
h_values
## text1 text2 text3
## 2.000014 1.500000 2.000024
tstat_by <- lapply(
names(tstat_by),
function(x) subset(tstat_by[[x]], cumsum(rank) <= h_values[[x]])
)
do.call(rbind, tstat_by)
## feature frequency rank docfreq group
## 1 the 2 1 1 text1
## 29 the 2 1 1 text2
## 48 the 3 1 1 text3
You didn't specify what you wanted for output, but with this result, you should be able to compute your own either on the list using lapply(), or on the combined data.frame using for instance dplyr.
Created on 2021-04-05 by the reprex package (v1.0.0)

Package `sentimentr`: how remove emoticons and stopwords before `sentiment_by`

Here is a basic sentiment example. The text data is splitted into sentences via the get_sentences function. With sentiment_by we approximate the sentiment (polarity) of text for an entire element of a list (mytext in this example).
E.g. for the example:
library(sentimentr)
mytext <- c(
'do you like it? But I hate really bad dogs',
'I am the best friend.',
'Do you really like it? I\'m not a fan'
)
mytext <- get_sentences(mytext)
sentiment_by(mytext)
I obtained the following result:
element_id word_count sd ave_sentiment
1: 1 10 1.497465 -0.8088680
2: 2 5 NA 0.5813777
3: 3 9 0.284605 0.2196345
Before applying sentiment function, I would like to remove stop words, number, emoticons from mytext. I figured I could use, e.g:
library("tm")
tm_map(mytext, removeNumbers)
tm_map(mytext, removeWords, stopwords())
but I obtained:
Error in UseMethod("tm_map", x) :
no applicable method for 'tm_map' applied to an object of class "c('get_sentences',
'get_sentences_character', 'list')"

Use mapply to replace a patterns in a vector with replacements in a vector in tm

Hi there: I m using the tm package for some text analysis and I need to sub a vector of terms with the paired replacement term in a vector of replacements. So the pattern / replacement dictionary looks like this.
#pattern -replacement dictionary
df<-data.frame(replace=c('crude', 'oil', 'price'), with=c('xcrude', 'xoil', 'xprice'))
#load tm
library(tm)
#load crude
data('crude')
I tried this and received an error
tm_map(crude, mapply, gsub, df$replace, df$with)
Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code
Based on this answer you could use stringi and wrap it around content_transformer() to preserve the corpus structure:
corp <- tm_map(crude, content_transformer(
function(x) {
stri_replace_all_fixed(x, df$replace, df$with, vectorize_all = FALSE)
})
)
Or multigsub from qdap
corp <- tm_map(crude, content_transformer(
function(x) {
multigsub(df$replace, df$with, fixed = FALSE, x)
})
)
Which gives:
> corp[[1]][1]
"Diamond Shamrock Corp said that\neffective today it had cut its
contract xprices for xcrude xoil by\n1.50 dlrs a barrel.\n The reduction brings its posted xprice for West Texas\nIntermediate to
16.00 dlrs a barrel, the copany said.\n \"The xprice reduction today was made in the light of falling\nxoil product xprices
and a weak xcrude xoil market,\" a company\nspokeswoman said.\n
Diamond is the latest in a line of U.S. xoil companies that\nhave
cut its contract, or posted, xprices over the last two
days\nciting weak xoil markets.\n Reuter"
You can then apply other tm functions on the resulting corpus:
> DocumentTermMatrix(corp)
#<<DocumentTermMatrix (documents: 20, terms: 1269)>>
#Non-/sparse entries: 2262/23118
#Sparsity : 91%
#Maximal term length: 17
#Weighting : term frequency (tf)

R tm package: utf-8 text

I would like to create a wordcloud for non-english text in utf-8 (actually, it's in kazakh language).
The text is displayed absolutely right in inspect function of the tm package.
However, when I search for word frequency everything is displayed incorrectly:
The problem is that the text is displayed with coded characters instead of words. Cyrillic characters are displayed correctly. Consquently the wordcloud becomes a complete mess.
Is it possible to assign encoding to the tm function somehow? I tried this, but the text on its own is fine, the problem is with using tm package.
Let a sample text be:
Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді.
Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді.
Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық.
Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы.
Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді.
My simple code is this:
(Based on onertipaday.blogspot.com tutorials:)
require(tm)
require(wordcloud)
text<-readLines("text.txt", encoding="UTF-8")
ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, tolower)
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
table(ap.d$freq)
1 2
44 4
findFreqTerms(ap.tdm, lowfreq=2)
[1] "<U+04D9>лем" "арман" "еді"
[4] "м<U+04D9><U+04A3>гілік"
Those words should be: "Әлем", арман", "еді", "мәңгілік". They are displayed correctly in inspect(ap.corpus) output.
Highly appreciate any help! :)
The problem comes from the default tokenizer. tm by default uses scan_tokenizer which it looses encoding(maybe you should contact the maintainer to add an encoding argument).
scan_tokenizer function (x) {
scan(text = x, what = "character", quote = "", quiet = TRUE) }
One solution is to provide your own tokenizer to create the matrix terms. I am using strsplit:
scanner <- function(x) strsplit(x," ")
ap.tdm <- TermDocumentMatrix(ap.corpus,control=list(tokenize=scanner))
Then you get the result well encoded:
findFreqTerms(ap.tdm, lowfreq=2)
[1] "арман" "біз" "еді" "әлем" "идеясы" "мәңгілік"
Actually, I disagree with agstudy's answer. It does not seem to be a tokenizer problem. I'm using version 0.6.0 of the tm package and your code works just fine for me, except that I had to explicitly set the encoding of your text data to UTF-8 using :
Encoding(text) <- "UTF-8"
Below is the complete piece of reproducible code. Just make sure you save it in a file with UTF-8 encoding, and use source() to run it; do not use source.with.encoding(), it'll throw an error.
text <- "Ол арман – әлем елдерімен терезесі тең қатынас құрып, әлем картасынан ойып тұрып орын алатын Тәуелсіз Мемлекет атану еді. Ол арман – тұрмысы бақуатты, түтіні түзу ұшқан, ұрпағы ертеңіне сеніммен қарайтын бақытты Ел болу еді. Біз армандарды ақиқатқа айналдырдық. Мәңгілік Елдің іргетасын қаладық. Мен қоғамда «Қазақ елінің ұлттық идеясы қандай болуы керек?» деген сауал жиі талқыға түсетінін көріп жүрмін. Біз үшін болашағымызға бағдар ететін, ұлтты ұйыстырып, ұлы мақсаттарға жетелейтін идея бар. Ол – Мәңгілік Ел идеясы. Тәуелсіздікпен бірге халқымыз Мәңгілік Мұраттарына қол жеткізді."
Encoding(text)
# [1] "unknown"
Encoding(text) <- "UTF-8"
# [1] "UTF-8"
ap.corpus <- Corpus(DataframeSource(data.frame(text)))
ap.corpus <- tm_map(ap.corpus, removePunctuation)
ap.corpus <- tm_map(ap.corpus, content_transformer(tolower))
content(ap.corpus[[1]])
ap.tdm <- TermDocumentMatrix(ap.corpus)
ap.m <- as.matrix(ap.tdm)
ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
ap.d <- data.frame(word = names(ap.v),freq=ap.v)
print(table(ap.d$freq))
# 1 2 3
# 62 5 1
print(findFreqTerms(ap.tdm, lowfreq=2))
# [1] "арман" "біз" "еді" "әлем" "идеясы" "мәңгілік"
It worked for me, hope it does for you too.

How to read data from a csv table in R?

I have a csv file, and I want to extract the each column a as string so I can use it with getSymbols function from quantmod package.
The csv file looks like this:
AEGR,Aegerion Pharmaceuticals Inc
AKS,AK Steel Holding Corp
ALXA‎,Alexza Pharmaceuticals Inc
CCL‎,Carnival Corporation
CECO‎,Career Education Corp
CDXS‎,Codexis Inc
And I use this code to read the file:
data<-read.csv(file='CAPM/allquotes.csv',header=F)
symbols=gettext(data[,1])
symbol.names=gettext(data[,2])
getSymbols(symbols)
I get this error:
Error in download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, : cannot open URL 'http://chart.yahoo.com/table.csv?s=ALXA‎&a=0&b=01&c=2007&d=5&e=16&f=2012&g=d&q=q&y=0&z=ALXA‎&x=.csv'
In addition: Warning message:
In download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m, : cannot open: HTTP status was '404 Not Found'
When I enter the symbols one by one it works fine. I've also noticed that when I go to the end of the last line, the margins seem to corrupt. In the image you can see that values of 'symbols', the end of the line is a few more spaces to the right than it should be (you can see that because of the color of the initial parenthesis).
Your csv has hidden characters in it -- namely a left-to-right mark. Since you are using RStudio, you can remove it with gsub using "\016" as the value for the pattern argument. Alternatively, instead of removing the hidden character that you don't want, you could only keep the characters that you know you DO want. For example, if your symbols will only have letters and/or numbers you could use something like gsub("[^A-Za-z0-9]", "", data[, 1])
data <- read.csv(text="AEGR,Aegerion Pharmaceuticals Inc
AKS,AK Steel Holding Corp
ALXA‎,Alexza Pharmaceuticals Inc
CCL‎,Carnival Corporation
CECO‎,Career Education Corp
CDXS‎,Codexis Inc", header=FALSE)
#data[, 1] <- gsub("\016", "", data[, 1]) #this should work in RStudio
data[, 1] <- gsub("[^A-Za-z0-9]", "", data[, 1]) #but this should work anywhere
symbols=gettext(data[,1])
getSymbols(symbols, src='yahoo')
After you read.csv, you can examine the data object to see that something is amiss.
s <- as.character(data[, 1])
str(s)
#chr [1:6] "AEGR" "AKS" "ALXA""| __truncated__ "CCL""| __truncated__ "CECO""| __truncated__ "CDXS""| __truncated__
str(s[3])
#chr "ALXA""| __truncated__
charToRaw(s[3])
#[1] 41 4c 58 41 e2 80 8e
# Compare what we have to what we think we have
charToRaw("ALXA")
#[1] 41 4c 58 41
I'm using the Systematic Investor Toolbox, that uses the quantmod. Thanks to GSee, the solution came this way:
source('SystematicInvestorToolbox.r')
load.packages('quantmod')
dates='2012::2012'
data<-read.csv(file='CAPM/allquotes.csv',header=F,stringsAsFactors=F)
data[, 1] <- gsub("[^A-Za-z0-9]", "", data[, 1])
symbols=gettext(data[,1])
symbol.names=gettext(data[,2])
ia=aa.test.create.ia.custom(symbols,symbol.names,dates)
plot.ia(ia,(1:1))
It's worth noting that the left-to-right marks only appear with 'symbols' not when I extract the characters for the names of the quotes in 'symbol.names'.
Thanks for the help.

Resources