package tm. problems with kmeans - r

I have a question about k-means clustering in R. Actually i'm doing everything according to this article. Everything is based on examples within the tm package so it's required no data import. acq contains 50 documents and crude 20 documents.
library(tm)
data("acq")
data("crude")
ws <- c(acq, crude)
wsTDM <- Data(TermDocumentMatrix(ws)) #First problem here
wsKMeans <- kmeans(wsTDM, 2)
wsReutersCluster <- c(rep("acq", 50), rep("crude", 20))
cl_agreement(wsKMeans, as.cl_partition(wsReutersCluster), "diag")
Error in lapply(X, FUN, ...) :
(list) object cannot be coerced to type 'integer'
I actually want to create cross agreement matrix. But this article was wrote in 2008 since then a lot have changed. The Data function is only available in RSurvey package, but i'm kinda doubt is it the same. And i think that the main problem is that TermDocumentMatrix was S4 class and now it's S3. I know it's possibly to do this having text only. But I wanna do it like this since in TDM it's possible to remove stopwords, punct, etc for better results. So if someone has any solution that would be terrific.

The TDM is stored as a sparse matrix, as described in ?TermDocumentMatrix. This can also be seen from just inspecting the object like str(wsTDM). That old Data() function was just a way to access the contents as a regular matrix. It is not needed anymore. Just do kmeans(wsTDM, 2) and you'll see that the output is as expected, with clusters identified for 2775 observations (terms) on 70 features (documents). Good luck!

Related

How can I create a reproducible example of a SpatRaster (terra)?

For a question that is specific to my particular dataset, how can I make a reproducible example of that dataset if I have it stored as a SpatRaster in R?
The data structure is complex enough that I don't know how to freehand invent a simpler version and read it as a SpatRast (i.e. x <- rast(????????)
I also haven't been able to figure out how I could use a package or command to extract enough information to provide what is functionally a reproducible example either
See my previous question for an example: How can I add a class name to numeric raster values in a terra SpatRaster?
You can create objects from scratch like this:
library(terra)
r <- rast()
s <- rast(ncols=22, nrows=25, nlyrs=5, xmin=0)
See ?terra::rast for additional arguments you can use and for alternative approaches.
You can also use a file that ships with R. For example:
f <- system.file("ex/elev.tif", package="terra")
r <- rast(f)
You can also create from scratch a new SpatRaster with (mostly) the same properties with what is returned by
as.character(r)
And then recreate it with something like
r <- rast(ncols=95, nrows=90, nlyrs=1, xmin=5.74166666666667, xmax=6.53333333333333, ymin=49.4416666666667, ymax=50.1916666666667, names=c('elevation'), crs='GEOGCRS[\"WGS 84\",DATUM[\"World Geodetic System 1984\",ELLIPSOID[\"WGS 84\",6378137,298.257223563,LENGTHUNIT[\"metre\",1]]],PRIMEM[\"Greenwich\",0,ANGLEUNIT[\"degree\",0.0174532925199433]],CS[ellipsoidal,2],AXIS[\"geodetic latitude (Lat)\",north,ORDER[1],ANGLEUNIT[\"degree\",0.0174532925199433]],AXIS[\"geodetic longitude (Lon)\",east,ORDER[2],ANGLEUNIT[\"degree\",0.0174532925199433]],ID[\"EPSG\",4326]]')
r <- init(r, "cell")
If you cannot replicate your error with example data, this may give you a hint about the problem. Does it have to do with NAs? The file being on disk? The file format? One tricky situation is where there is a difference if the real file is much larger. You can simulate a large file by setting terraOptions(todisk=TRUE) and using a steps argument in a function, e.g.
b <- clamp(x, steps=5)
If none of that allows you to replicate an error, your last resort is to provide a link to the file so that others can download it. If you cannot do that, then at least show the content of the SpatRaster x with show(x) and provide the code to create a similar object with as.character(x)

Convert Document Term Matrix (DTM) to Data Frame (R Programming)

I am a beginner at R programming language and currently try to work on a project.
There's a huge Document Term Matrix (DTM) and I would like to convert it into a Data Frame.
However due to the restrictions of the functions, I am not able to do so.
The method that I have been using is to first convert it into a matrix, and then convert it to data frame.
DF <- data.frame(as.matrix(DTM), stringsAsFactors=FALSE)
It was working perfectly with smaller size DTM. However when the DTM is too large, I am not able to convert it to a matrix, yielding the error as shown below:
Error: cannot allocate vector of size 2409.3 Gb
Tried looking online for a few days however I am not able to find a solution.
Would be really thankful if anyone is able to suggest what is the best way to convert a DTM into a DF (especially when dealing with large size DTM).
In the tidytext package there is actually a function to do just that. Try using the tidy function which will return a tibble (basically a fancy dataframe that will print nicely). The nice thing about the tidy function is it'll take care of the pesky StringsAsFactors=FALSE issue by not converting strings to factors and it will deal nicely with the sparsity of your DTM.
as.matrix is trying to convert your DTM into a non-sparse matrix with an entry for every document and term even if the term occurs 0 times in that document, which is causing your memory usage to ballon. tidy` will convert it into a dataframe where each document only has the counts for the term found in them.
In your example here you'd run
library(tidytext)
DF <- tidy(DTM)
There's even a vignette on how to use the tidytext packages (meant to work in the tidyverse) here.
It's possible that as.data.frame(as.matrix(DTM), stringsAsFactors=False) instead of data.frame(as.matrix(DTM), stringsAsFactors=False) might do the trick.
The API documentation notes that as.data.frame() simply coerces a matrix into a dataframe, whereas data.frame() creates a new data frame from the input.
as.data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.data.frame.html
data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html

In R, Create Summary Data Frame from Multiple Objects

I'm trying to create a "summary" data frame that holds some high-level stats about a few objects in my R project. I'm having trouble even accomplishing this simple task and I've tried using For loops and Apply functions with no luck.
After searching (a lot) on SO I'm seeing that For loops might not be the best performing option, so I'm open to any solution that gets the job done.
I have three objects: text1 text2 and text3 of class "Large Character (vectors)" (imagine I might be exploring these objects and will create a NLP predictive model from them). Each are > 250 MB in size (upwards of 1 million "rows" each) once loaded into R.
My goal: Store the results of object.size() length() and max(nchar()) in a table for my 3 objects.
Method 1: Use an Apply() Function
Issue: I haven't successfully applied multiple functions to a single object. I understand how to do simple applies like lapply(x, mean) but I'm falling short here.
Method 2: Bind Rows Using a For loop
I'm liking this solution because I almost know how to implement it. A lot of SO users say this is a bad approach, but I'm lacking other ideas.
sources <- c("text1", "text2", "text3")
text.summary <- data.frame()
for (i in sources){
text.summary[i ,] <- rbind(i, object.size(get(i)), length(get(i)),
max(nchar(get(i))))
}
Issue: This returns the error data length exceeds size of matrix - I know I could define the structure of my data frame (on line 2), but I've seen too much feedback on other questions that advise against doing this.
Thanks for helping me understand the proper way to accomplish this. I know I'm going to have trouble doing NLP if I can't even figure out this simple problem, but R is my first foray into programming. Oof!
Just try for example:
do.call(rbind, lapply(list(text1,text2,text3),
function(x) c(objectSize=c(object.size(x)),length=length(x),max=max(nchar(x)))))
You'll obtain a matrix. You can coerce to data.frame later if you need.

Work-around to clear blank entries in a document term matrix?

I have some r code that I've used in the past to produce topic models. Everything was working fine until I updated all of my r packages in the hopes of fixing a slightly unrelated problem. Now, code which had previously worked seems to be broken and I can't figure out what to do.
I read this post and found it very helpful in setting this up originally. It describes a method of cleaning blank rows after sparse terms have been removed to set up subsequent analysis. Here is what happens when I enter the same code with my current packages:
> rowTotals <- apply(a.dtm.t, 1, sum) #Find the sum of words in each Document
> a.dtm.t.rt <- a.dtm.t[rowTotals>0]
Error in `[.simple_triplet_matrix`(a.dtm.t, rowTotals > 0) :
Logical vector subscripting disabled for this object.
Does anyone know how I can go about locating the problem, and roll back to a working solution? Thanks.
Try a.dtm.t.rt <- a.dtm.t[which(rowTotals>0)]
If that doesn't work then you need to show a reproducible example. We have no idea what anything you're doing here is.
I find the same problem as yours. I use slam package to solve this issue.
library(slam)
# take tdm as a large term-document matrix
freq <- rowapply_simple_triplet_matrix(tdm,sum)
Also the colapply_simple_triplet_matrix will help to handle the sparse matrix

converting R code snippet to use the Matrix package?

I am not sure there are any R users out there, but just in case:
I am a novice at R and was kindly "handed down" the following R code snippet:
Beta <- exp(as.matrix(read.table('beta.transpose')))
WordFreq <- read.table('freq-matrix')
WordProbs <- WordFreq$V1 / sum(WordFreq)
infile <- file('freq-matrix')
outfile <- file('doc_topic_prob_matrix', 'w')
open(infile)
open(outfile)
for (i in 1:93049) {
vec <- t(scan(infile, nlines=1))
topics <- (vec/WordProbs) %*% Beta
write.table(topics, outfile, append=T, row.names=F, col.names=F)
}
When I tried running this on my dataset, the system thrashed and swapped like crazy. Now I realize that has a simple reason: the file freq-matrix holds a large (22GB) matrix and I was trying to read it into memory.
I have been told to use the Matrix package, because freq-matrix has many, many zeros all over the place and it handles such cases well. Will that help? If so, any hints on how to change this code would be most welcome. I have no R experience and just started reading through the introduction PDF available on the site.
Many thanks
~l
My suggestion might be completely off, because you don't give enough details about the contents of your files, and I had to guess from the code. Anyway, here it goes.
You don't state it, but I would assume that your code crashes on the second line, when you read in the big matrix. The loop reads the lines one-at-a-time, and should not crash. The only reason you need that big matrix is to calculate the WordProbs vector. So why don't you rewrite that part using the same looping using scan? In fact, you could probably don't even need to store the WordProbs vector, just sum(WordFreq) - you can get that using an initial run through hte file. Then rewrite the formula within the loop to calculate the current WordProb.
Belated answer, but I'd recommend reading the data into a memory mapped file, using the bigmemory package. After that, I'd look for the non-zero entries, which can then be represented as a 3 column matrix: (ix_row, ix_col, value). This is called a coordinate object list (COO), though the name is unimportant. From there, Matrix supports the creation of sparse matrices (via sparseMatrix). After you get the COO, you're pretty much set - conversion to and from the sparse matrix format is reasonably fast. Multiplying the matrix by Beta should be reasonably fast. If you need even greater speed, you could use an optimized BLAS library, but that opens up more questions. :)

Resources