I'm trying to use LDA() from topicmodels package on a quite large data set. After trying everything to fix the following errors "In nr * nc : NAs produced by integer overflow" and "Each row of the input matrix needs to contain at least one non-zero entry", I ended up with this error.
ask<- read.csv('askreddit201508.csv', stringsAsFactors = F)
myDtm <- create_matrix(as.vector(ask$title), language="english", removeNumbers=TRUE, stemWords=TRUE, weighting=weightTf)
myDtm2 = removeSparseTerms(myDtm,0.99999)
myDtm2 <- rollup(myDtm2, 2, na.rm=TRUE, FUN = sum)
rowTotals <- apply(myDtm2 , 1, sum)
myDtm2 <- myDtm2[rowTotals> 0, ]
LDA2 <- LDA(myDtm2,100)
Error in LDA(myDtm2, 100) :
The DocumentTermMatrix needs to have a term frequency weighting
Part of the problem is that you are weighting the document-term matrix by tf-idf, but LDA requires term counts. In addition, this method of removing sparse terms seems to be creating some documents where all terms have been removed.
Easier to get from your text to topic models using the quanteda package. Here's how:
require(quanteda)
myCorpus <- corpus(textfile("http://homepage.stat.uiowa.edu/~thanhtran/askreddit201508.csv",
textField = "title"))
myDfm <- dfm(myCorpus, stem = TRUE)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 160,707 documents
## ... indexing features: 39,505 feature types
## ... stemming features (English), trimmed 12563 feature variants
## ... created a 160707 x 26942 sparse dfm
## ... complete.
# remove infrequent terms: see http://stats.stackexchange.com/questions/160539/is-this-interpretation-of-sparsity-accurate/160599#160599
sparsityThreshold <- round(ndoc(myDfm) * (1 - 0.99999))
myDfm2 <- trim(myDfm, minDoc = sparsityThreshold)
## Features occurring in fewer than 1.60707 documents: 12579
nfeature(myDfm2)
## [1] 14363
# fit the LDA model
require(topicmodels)
LDA2 <- LDA(quantedaformat2dtm(myDfm2), 100)
all.dtm <- DocumentTermMatrix(corpus,
control = list(weighting=weightTf)) ; inspect(all.dtm)
tpc.mdl.LDA <- LDA(all.dtm ,k=the.number.of.topics)
Related
I have this issue when I run this chunk of code
text_lda <- LDA(text_dtm, k = 2, method = "VEM", control = NULL)
I have the next mistake "Each row of the input matrix needs to contain at least one non-zero entry"
Then I tried to solve this with these lines
row_total = apply(text_dtm, 1, sum)
empty.rows <- text_dtm[rowTotals == 0, ]$dimnames[1][[1]]
But I got the next issue
cannot allocate vector of size 3890.8 GB
This is the size of my DTM:
DocumentTermMatrix documents: 1968850, terms: 265238
Non-/sparse entries: 29766814/522184069486
Sparsity : 100%
Maximal term length: 4000
Weighting : term frequency (tf)
Try this:
empty.rows <- text_dtm[rowTotals == 0, ]$dimnames[1][[1]]
corpus_new <- corpus[-as.numeric(empty.rows)]
Or use tm to generate the dtm and then:
ui = unique(text_dtm$i)
text_dtm.new = text_dtm[ui,]
I’d recommend using a dgCMatrix class for your DTM. It ships with R as part of the widely-used Matrix package, works with topicmodels::LDA and many other NLP packages (textmineR, text2vec, tidytext, etc.), has methods that let you work with it as if it was a dense matrix.
library(tm)
library(topicmodels)
library(Matrix)
# grab a character vector of text. Your source may be different
text <- textmineR::nih_sample$ABSTRACT_TEXT
text_corpus <- SimpleCorpus(VectorSource(text))
text_dtm <- DocumentTermMatrix(text_corpus,
control = list(tolower=TRUE,
removePunctuation = TRUE,
removeNumbers= TRUE,
stopwords = TRUE,
sparse=TRUE))
text_dtm2 <- cast_sparse(text_dtm)
text_dtm2 <- Matrix::sparseMatrix(i=text_dtm$i,
j=text_dtm$j,
x=text_dtm$v,
dims=c(text_dtm$nrow, text_dtm$ncol),
dimnames = text_dtm$dimnames)
doc_lengths <- Matrix::rowSums(text_dtm2)
text_dtm3 <- text_dtm2[doc_lengths > 0, ]
text_lda <- LDA(text_dtm3, k = 2, method = "VEM", control = NULL)
I'm trying to run the initial steps of this stm tutorial
https://github.com/dondealban/learning-stm
with this dataset, it is part of the original one
http://www.mediafire.com/file/1jk2aoz4ac84jn6/data.csv/file
install.packages("stm")
library(stm)
load("VignetteObjects.RData")
data <- read.csv("C:/data.csv")
head(data)
processed <- textProcessor(data$documents, metadata=data)
out <- prepDocuments(processed$documents, processed$vocab, processed$meta)
docs <- out$documents
vocab <- out$vocab
meta <- out$meta
poliblogPrevFit <- stm(out$documents, out$vocab, K=4, prevalence=~rating+s(day),
max.em.its=200, data=out$meta, init.type="Spectral",
seed=8458159)
But I keep getting the same error
Error in makeTopMatrix(prevalence, data) : Error creating model matrix.
This could be caused by many things including
explicit calls to a namespace within the formula.
Try a simpler formula.
Please can anyone run it in 64 bits MS Windows R-3.5.2.. I could not even find similar errors anywhere..
It seems your problem was that with the sampling you did, you ended up with a factor object with just one level:
> levels(meta$rating)
[1] "Conservative"
Using a variable like this does not make any sense though, as there is no variation between cases. If you use the original data, your code works absolutely fine:
data <- read.csv("https://raw.githubusercontent.com/dondealban/learning-stm/master/data/poliblogs2008.csv")
processed <- textProcessor(data$documents, metadata = data)
out <- prepDocuments(processed$documents, processed$vocab, processed$meta)
docs <- out$documents
vocab <- out$vocab
meta <- out$meta
levels(meta$rating)
[1] "Conservative" "Liberal"
poliblogPrevFit <- stm(docs, vocab, K = 4, prevalence = ~rating+s(day),
max.em.its = 200, data = out$meta, init.type = "Spectral",
seed = 8458159)
I have a question regarding LDA in topicmodels in R.
I created a matrix with documents as rows, terms as columns, and the number of terms in a document as respective values from a data frame. While I wanted to start LDA, I got an Error Message stating "Error in !all.equal(x$v, as.integer(x$v)) : invalid argument type" . The data contains 1675 documents of 368 terms. What can I do to make the code work?
library("tm")
library("topicmodels")
data_matrix <- data %>%
group_by(documents, terms) %>%
tally %>%
spread(terms, n, fill=0)
doctermmatrix <- as.DocumentTermMatrix(data_matrix, weightTf("data_matrix"))
lda_head <- topicmodels::LDA(doctermmatrix, 10, method="Gibbs")
Help is much appreciated!
edit
# Toy Data
documentstoy <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
meta1toy <- c(3,4,1,12,1,2,3,5,1,4,2,1,1,1,1,1)
meta2toy <- c(10,0,10,1,1,0,1,1,3,3,0,0,18,1,10,10)
termstoy <- c("cus","cus","bill","bill","tube","tube","coa","coa","un","arc","arc","yib","yib","yib","dar","dar")
toydata <- data.frame(documentstoy,meta1toy,meta2toy,termstoy)
So I looked inside the code and apparently the lda() function only accepts integers as the input so you have to convert your categorical variables as below:
library('tm')
library('topicmodels')
documentstoy <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
meta1toy <- c(3,4,1,12,1,2,3,5,1,4,2,1,1,1,1,1)
meta2toy <- c(10,0,10,1,1,0,1,1,3,3,0,0,18,1,10,10)
toydata <- data.frame(documentstoy,meta1toy,meta2toy)
termstoy <- c("cus","cus","bill","bill","tube","tube","coa","coa","un","arc","arc","yib","yib","yib","dar","dar")
toy_unique = unique(termstoy)
for (i in 1:length(toy_unique)){
A = as.integer(termstoy == toy_unique[i])
toydata[toy_unique[i]] = A
}
lda_head <- topicmodels::LDA(toydata, 10, method="Gibbs")
I am trying to do user based recommendation in R by using recommenderlab package but all the time I am getting 0(no) prediction out of the model.
my code is :
library("recommenderlab")
# Loading to pre-computed affinity data
movie_data<-read.csv("D:/course/Colaborative filtering/data/UUCF Assignment Spreadsheet_user_row.csv")
movie_data[is.na(movie_data)] <- 0
rownames(movie_data) <- movie_data$X
movie_data$X <- NULL
# Convert it as a matrix
R<-as.matrix(movie_data)
# Convert R into realRatingMatrix data structure
# realRatingMatrix is a recommenderlab sparse-matrix like data-structure
r <- as(R, "realRatingMatrix")
r
rec=Recommender(r[1:nrow(r)],method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=5, minRating=1))
recom <- predict(rec, r["1648"], n=5)
recom
as(recom, "list")
all the time I am getting out put like :
as(recom, "list")
$`1648`
character(0)
I am using user-row data from this link:
https://drive.google.com/file/d/0BxANCLmMqAyIQ0ZWSy1KNUI4RWc/view
In that data column A contains user id and apart from that all are movie rating for each movie name.
Thanks.
The line of code movie_data[is.na(movie_data)] <- 0 is the source of the error. For realRatingMatrix (unlike the binaryRatingMatrix) the movies that are not rated by the users are expected to be NA values, not zero values. For example, the following code gives the correct predictions:
library("recommenderlab")
movie_data<-read.csv("UUCF Assignment Spreadsheet_user_row.csv")
rownames(movie_data) <- movie_data$X
movie_data$X <- NULL
R<-as.matrix(movie_data)
r <- as(R, "realRatingMatrix")
rec=Recommender(r,method="UBCF", param=list(normalize = "Z-score",method="Cosine",nn=5, minRating=1))
recom <- predict(rec, r["1648"], n=5)
as(recom, "list")
# [[1]]
# [1] "X13..Forrest.Gump..1994." "X550..Fight.Club..1999."
# [3] "X77..Memento..2000." "X122..The.Lord.of.the.Rings..The.Return.of.the.King..2003."
# [5] "X1572..Die.Hard..With.a.Vengeance..1995."
Let's do some Text Mining
Here I stand with a document term matrix (from the tm Package)
dtm <- TermDocumentMatrix(
myCorpus,
control = list(
weight = weightTfIdf,
tolower=TRUE,
removeNumbers = TRUE,
minWordLength = 2,
removePunctuation = TRUE,
stopwords=stopwords("german")
))
When I do a
typeof(dtm)
I see that it is a "list" and the structure looks like
Docs
Terms 1 2 ...
lorem 0 0 ...
ipsum 0 0 ...
... .......
So I try a
wordMatrix = as.data.frame( t(as.matrix( dtm )) )
That works for 1000 Documents.
But when I try to use 40000 it doesn't anymore.
I get this error:
Fehler in vector(typeof(x$v), nr * nc) : Vektorgröße kann nicht NA sein
Zusätzlich: Warnmeldung:
In nr * nc : NAs durch Ganzzahlüberlauf erzeugt
Error in vector ... : Vector can't be NA
Additional:
In nr * nc NAs created by integer overflow
So I looked at as.matrix and it turns out that somehow the function converts it to a vector with as.vector and than to a matrix.
The convertion to a vector works but not the one from the vector to the matrix dosen't.
Do you have any suggestions what could be the problem?
Thanks, The Captain
Integer overflow tells you exactly what the problem is : with 40000 documents, you have too much data. It is in the conversion to a matrix that the problem begins btw, which can be seen if you look at the code of the underlying function :
class(dtm)
[1] "TermDocumentMatrix" "simple_triplet_matrix"
getAnywhere(as.matrix.simple_triplet_matrix)
A single object matching ‘as.matrix.simple_triplet_matrix’ was found
...
function (x, ...)
{
nr <- x$nrow
nc <- x$ncol
y <- matrix(vector(typeof(x$v), nr * nc), nr, nc)
...
}
This is the line referenced by the error message. What's going on, can be easily simulated by :
as.integer(40000 * 60000) # 40000 documents is 40000 rows in the resulting frame
[1] NA
Warning message:
NAs introduced by coercion
The function vector() takes an argument with the length, in this case nr*nc If this is larger than appx. 2e9 ( .Machine$integer.max ), it will be replaced by NA. This NA is not valid as an argument for vector().
Bottomline : You're running into the limits of R. As for now, working in 64bit won't help you. You'll have to resort to different methods. One possibility would be to continue working with the list you have (dtm is a list), selecting the data you need using list manipulation and go from there.
PS : I made a dtm object by
require(tm)
data("crude")
dtm <- TermDocumentMatrix(crude,
control = list(weighting = weightTfIdf,
stopwords = TRUE))
Here is a very very simple solution I discovered recently
DTM=t(TDM)#taking the transpose of Term-Document Matrix though not necessary but I prefer DTM over TDM
M=as.big.matrix(x=as.matrix(DTM))#convert the DTM into a bigmemory object using the bigmemory package
M=as.matrix(M)#convert the bigmemory object again to a regular matrix
M=t(M)#take the transpose again to get TDM
Please note that taking transpose of TDM to get DTM is absolutely optional, it's my personal preference to play with matrices this way
P.S.Could not answer the question 4 years back as I was just a fresh entry in my college
Based on Joris Meys answer, I've found the solution. "vector()" documentation regarding "length" argument
...
For a long vector, i.e., length > .Machine$integer.max, it has to be of type "double"...
So we can make a tiny fix of the as.matrix():
as.big.matrix <- function(x) {
nr <- x$nrow
nc <- x$ncol
# nr and nc are integers. 1 is double. Double * integer -> double
y <- matrix(vector(typeof(x$v), 1 * nr * nc), nr, nc)
y[cbind(x$i, x$j)] <- x$v
dimnames(y) <- x$dimnames
y
}