tm combine list of corpora - r

I have a list of URL for which i have fetched the webcontent, and included that into tm corpora:
library(tm)
library(XML)
link <- c(
"http://www.r-statistics.com/tag/hadley-wickham/",
"http://had.co.nz/",
"http://vita.had.co.nz/articles.html",
"http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html",
"http://www.analyticstory.com/hadley-wickham/"
)
create.corpus <- function(url.name){
doc=htmlParse(url.name)
parag=xpathSApply(doc,'//p',xmlValue)
if (length(parag)==0){
parag="empty"
}
cc=Corpus(VectorSource(parag))
meta(cc,"link")=url.name
return(cc)
}
link=catch$url
cc <- lapply(link, create.corpus)
This gives me a "large list" of corpora, one for each URL.
Combining them one by one works:
x=cc[[1]]
y=cc[[2]]
z=c(x,y,recursive=T) # preserved metadata
x;y;z
# A corpus with 8 text documents
# A corpus with 2 text documents
# A corpus with 10 text documents
But this becomes unfeasible for a list with a few thousand corpora.
So how can a list of corpora be merged into one corpus while maintaining the meta data?

You can use do.call to call c:
do.call(function(...) c(..., recursive = TRUE), cc)
# A corpus with 155 text documents

I don't think that tm offer any built-in function to join/merge many corpus. But after all a corpus is a list of document , so how the question is how to transform a list of list to a list. I would do create a new corpus using all documents , then assign meta manually:
y = Corpus(VectorSource(unlist(cc)))
meta(y,'link') = do.call(rbind,lapply(cc,meta))$link

Your code does not work because catch is not defined, so I don't know exactly what that is supposed to do.
But now tm corpora can just be put into a vector to make one big corpora: https://www.rdocumentation.org/packages/tm/versions/0.7-1/topics/tm_combine
So maybe c(unlist(cc)) would work. I have no way to test if that would work though because your code doesn't run.

Related

stri_replace_all_fixed slow on big data set - is there an alternative?

I'm trying to stem ~4000 documents in R, by using the stri_replace_all_fixed function. However, it is VERY slow, since my dictionary of stemmed words consists of approx. 300k words. I am doing this because the documents are in danish and therefore the Porter Stemmer Algortihm is not useful (it is too aggressive).
I have posted the code below. Does anyone know an alternative for doing this?
Logic: Look at each word in each document -> If word = word from voc-table, then replace with tran-word.
##Read in the dictionary
voc <- read.table("danish.csv", header = TRUE, sep=";")
#Using the library 'stringi' to make the stemming
library(stringi)
#Split the voc corpus and put the word and stem column into different corpus
word <- Corpus(VectorSource(voc))[1]
tran <- Corpus(VectorSource(voc))[2]
#Using stri_replace_all_fixed to stem words
## !! NOTE THAT THE FOLLOWING STEP MIGHT TAKE A FEW MINUTES DEPENDING ON THE SIZE !! ##
docs <- tm_map(docs, function(x) stri_replace_all_fixed(x, word, tran, vectorize_all = FALSE))
Structure of "voc" data frame:
Word Stem
1 abandonnere abandonner
2 abandonnerede abandonner
3 abandonnerende abandonner
...
313273 åsyns åsyn
To make a dictionary marching fast, you need to implement some clever data structures such as a prefix tree. 300000x search and replace just does not scale.
I don't think this will be efficient in R, but you will need to write a C or C++ extension. You have many tiny operations there, the overhead of the R interpreter will kill you when trying to do this in pure R.

How to add metadata to tm Corpus object with tm_map

I have been reading different questions/answers (especially here and here) without managing to apply any to my situation.
I have a 11,390 rows matrix with attributes id, author, text, such as:
library(tm)
m <- cbind(c("01","02","03","04","05","06"),
c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
c("Text1","Text2","Text3","Text4","Text5","Text6"))
I want to create a tm corpus out of it. I can quickly create my corpus with
tm_corpus <- Corpus(VectorSource(m[,3]))
which terminates execution for my 11,390 rows matrix in
user system elapsed
2.383 0.175 2.557
But then when I try to add metadata to the corpus with
meta(tm_corpus, type="local", tag="Author") <- m[,2]
the execution time is over the 15 minutes and counting (I then stopped execution).
According to the discussion here chances are to decreasing significantly the time in processing the corpus with tm_map; something like
tm_corpus <- tm_map(tm_corpus, addMeta, m[,2])
Still I am not sure how to do this. Probably it is going to be something like
addMeta <- function(text, vector) {
meta(text, tag="Author") = vector[??]
text
}
For one thing how to pass to tm_map a vector of values to be assign to each text of the corpus? Should I call the function from within a loop? Should I enclose the tm_map function within vapply?
Have you already tried the excellent readTabular?
## your sample data
matrix <- cbind(c("01","02","03","04","05","06"),
c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
c("Text1","Text2","Text3","Text4","Text5","Text6"))
## simple transformations
matrix <- as.data.frame(matrix)
names(matrix) <- c("id", "author", "content")
Now your ex-matrix now data.frame can be read easily in as a corpus using readTabular. ReadTabular wants you to define a Reader which itselfs takes a mapping. In your mapping "content" points to the text data and the other names - well - to meta.
## define myReader, which will be used in creation of Corpus
myReader <- readTabular(mapping=list(id="id", author="author", content="content"))
Now the creation of the Corpus is same as always, apart from small changes:
## create the corpus
tm_corpus <- DataframeSource(matrix)
tm_corpus <- Corpus(tm_corpus,
readerControl = list(reader=myReader))
Now have a look at the content and meta data of the first items:
lapply(tm_corpus, as.character)
lapply(tm_corpus, meta)
## output just as expected.
This should be fast, as it is part of the package and extremely adaptable. In my own project I am using this on a data.table with some 20 variables - it works like a charm.
However I cannot provide benchmarking with the answer you have already approved as suitable. I simply guess it is faster and more efficient.
Yes tm_map is faster and it is the way to go. You should use it here with a global counter.
auths <- paste0('Author',seq(11390))
i <- 0
tm_corpus = tm_map(tm_corpus, function(x) {
i <<- i +1
meta(x, "Author") <- m[i,2]
x
})
Since readTabular from tm package has been deprecated, now the solution might be like this:
matrix <- cbind(c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
c("Text1","Text2","Text3","Text4","Text5","Text6"))
matrix <- as.data.frame(matrix)
names(matrix) <- c("doc_id", "text")
tm_corpus <- DataframeSource(matrix)
tm_corpus <- Corpus(tm_corpus)
inspect(tm_corpus)
meta(tm_corpus)

R text mining documents from CSV file (one row per doc)

I am trying to work with the tm package in R, and have a CSV file of customer feedback with each line being a different instance of feedback. I want to import all the content of this feedback into a corpus but I want each line to be a different document within the corpus, so that I can compare the feedback in a DocTerms Matrix. There are over 10,000 rows in my data set.
Originally I did the following:
fdbk_corpus <-Corpus(VectorSource(fdbk), readerControl = list(language="eng"), sep="\t")
This creates a corpus with 1 document and >10,000 rows, and I want >10,000 docs with 1 row each.
I imagine I could just have 10,000+ separate CSV or TXT documents within a folder and create a corpus from that... but I'm thinking there is a much simpler answer than that, reading each line as a separate document.
Here's a complete workflow to get what you want:
# change this file location to suit your machine
file_loc <- "C:\\Documents and Settings\\Administrator\\Desktop\\Book1.csv"
# change TRUE to FALSE if you have no column headings in the CSV
x <- read.csv(file_loc, header = TRUE)
require(tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)
In the dtm object each row will be a doc, or a line of your original CSV file. Each column will be a word.
You can use TermDocumentMatrix() on your fdbk object, and obtain a term document matrix where each row represent a customer feedback.

How can I manually set the document id in a corpus?

I am creating a Copus from a dataframe. I pass it as a VectorSource as there is only one column I want to be used as the text source. This works find however I need the document ids within the corpus to match the document ids from the dataframe. The document ids are stored in a separate column in the original dataframe.
df <- as.data.frame(t(rbind(c(1,3,5,7,8,10),
c("text", "lots of text", "too much text", "where will it end", "give peas a chance","help"))))
colnames(df) <- c("ids","textColumn")
library("tm")
library("lsa")
corpus <- Corpus(VectorSource(df[["textColumn"]]))
Running this code creates a corpus however the document ids run from 1-6. Is there any way of creating the corpus with the document ids 1,3,5,7,8,10?
I know it's probably late for #user1098798, but there is a way how you can specify ids directly when creating the corpus. You need to load the data as DataframeSource() and add mapping to the columns:
corpus = VCorpus(DataframeSource(df), readerControl = list(reader = readTabular(mapping = list(content = "textColumn", id = "ids"))))
Well, one simple but not very elegant way to assign your ids to your documents afterward could be the following :
for (i in 1:length(corpus)) {
attr(corpus[[i]], "ID") <- df$ids[i]
}
Here is a qdap approach to this problem that can handle it without the loop:
Use qdap version >= 1.1.0 right from the get go to convert the dataframe to a Corpus and the ID tags will be automatically added.
with(df, as.Corpus(textColumn, ids))
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 3
## Content: documents: 6
## Look around a bit
meta(with(df, as.Corpus(textColumn, ids)), tag="id")
inspect(with(df, as.Corpus(textColumn, ids)))

R text mining package: Allowing to incorporate new documents into an existing corpus

I was wondering if there is any chance of R's text mining package having the following feature:
myCorpus <- Corpus(DirSource(<directory-contatining-textfiles>),control=...)
# add docs
myCorpus.addDocs(DirSource(<new-dir>),control=...)
Ideally I would like to incorporate additional documents into the existing corpus.
Any help is appreciated
You should be able just to use c(,) as in
> library(tm)
> data("acq")
> data("crude")
> together <- c(acq,crude)
> acq
A corpus with 50 text documents
> crude
A corpus with 20 text documents
> together
A corpus with 70 text documents
You can find more in the tm package documentation under tm_combine.
I overcome this issue as well in the context of big data text mining sets. It was not possible to load the entire data set at once.
Here, another option for such big data sets is possible. The approach is to collect a vector of one document corpora inside a loop. After processing all documents like this, it is possible to convert this vector into one huge corpus e.g. to create a DTM on it.
# Vector to collect the corpora:
webCorpusCollection <- c()
# Loop over raw data:
for(i in ...) {
try({
# Convert one document into a corpus:
webDocument <- Corpus(VectorSource(iconv(webDocuments[i,1], "latin1", "UTF-8")))
#
# Do other things e.g. preprocessing...
#
# Store this document into the corpus vector:
webCorpusCollection <- rbind(webCorpusCollection, webDocument)
})
}
# Collecting done. Create one huge corpus:
webCorpus <- Corpus(VectorSource(unlist(webCorpusCollection[,"content"])))

Resources