Convert Document Term Matrix (DTM) to Data Frame (R Programming) - r

I am a beginner at R programming language and currently try to work on a project.
There's a huge Document Term Matrix (DTM) and I would like to convert it into a Data Frame.
However due to the restrictions of the functions, I am not able to do so.
The method that I have been using is to first convert it into a matrix, and then convert it to data frame.
DF <- data.frame(as.matrix(DTM), stringsAsFactors=FALSE)
It was working perfectly with smaller size DTM. However when the DTM is too large, I am not able to convert it to a matrix, yielding the error as shown below:
Error: cannot allocate vector of size 2409.3 Gb
Tried looking online for a few days however I am not able to find a solution.
Would be really thankful if anyone is able to suggest what is the best way to convert a DTM into a DF (especially when dealing with large size DTM).

In the tidytext package there is actually a function to do just that. Try using the tidy function which will return a tibble (basically a fancy dataframe that will print nicely). The nice thing about the tidy function is it'll take care of the pesky StringsAsFactors=FALSE issue by not converting strings to factors and it will deal nicely with the sparsity of your DTM.
as.matrix is trying to convert your DTM into a non-sparse matrix with an entry for every document and term even if the term occurs 0 times in that document, which is causing your memory usage to ballon. tidy` will convert it into a dataframe where each document only has the counts for the term found in them.
In your example here you'd run
library(tidytext)
DF <- tidy(DTM)
There's even a vignette on how to use the tidytext packages (meant to work in the tidyverse) here.

It's possible that as.data.frame(as.matrix(DTM), stringsAsFactors=False) instead of data.frame(as.matrix(DTM), stringsAsFactors=False) might do the trick.
The API documentation notes that as.data.frame() simply coerces a matrix into a dataframe, whereas data.frame() creates a new data frame from the input.
as.data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.data.frame.html
data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html

Related

Explaining Simple Loop in R

I successfully wrote a for loop in R. That is okay and I am very happy that it works. But I also want to understand what I've done exactly because I will have to work with loops later on in my analysis as well.
I work with Raster Data (DEMs). I load them into the environment as rasters and then I use the getValues function in the loop as I want to do some calculations. Looks as follows:
list <- dir(pattern=".tif", full.names=T)
tif.files <- list()
tif.files.values <- tif.files
for (i in 1: length(list)){
tif.files[[i]] <- raster (list[[i]])
tif.files.values[[i]] <- getValues(tif.files[[i]])
}
Okay, so far so good. I don't get why I have to specify tif.files and tif.files.values before I use them in the loop and I don't know why to specify them exactly how I did that. For the first part, the raster operation, I had a pattern. Maybe someone can explain the context. I really want to understand R.
When you do:
tif.files[[i]] <- raster (list[[i]])
then tif.files[[i]] is the result of running raster(list[[i]]), so that is storing the raster object. This object contains the metadata (extent, number of rows, cols etc) and the data, although if the tiff is huge it doesn't actually read it in at the time.
tif.files.values[[i]] <- getValues(tif.files[[i]])
that line calls getValues on the raster object, which reads the values from the raster and returns a vector. The values of the grid cells are now in tif.files.values[[i]].
Experiment by printing tif.files[[1]] and tif.files.values[[1]] at the R prompt.
Note
This is R, not RStudio, which is the interface you are using that has all the buttons and menus. The R language exists quite happily without it, and your question is just a language question. I've edited and tagged it now for you.

How to search for huge list of search-terms from a corpus using custom function in tm package

I want to select and retain the gene names from a corpus of multiple text documents using the tm package. I have used a custom function to keep only the genes defined in "pattern" and remove everything else. Here are my codes
docs <- Corpus(DirSource("path of the directory containing text documents"))
f <- content_transformer(function(x, pattern)regmatches(x, gregexpr(pattern, x, ignore.case=TRUE)))
genes = "IL1|IL2|IL3|IL4|IL5|IL6|IL7|IL8|IL9|IL10|TNF|TGF|AP2|OLR1|OLR2"
docs <- tm_map(docs, f, genes)
The codes are working perfectly fine. However, If I need to match a larger number of genes (say > 5000 genes), what is the best way to approach it ? I don't want to put the genes in an array and loop the tm_map function, to avoid huge run time and memory constraints.
If you simply want the fastest vectorized fixed-string regex, use stringi package, not tm. Specifically, look at stri_match* functions (or you might find stringr even faster if you're only handling ASCII - look for Hadley's latest versions and comments).
But if the regex of gene names is fixed and known upfront, and you're going to be doing a lot of retrieval on those few strings, then you could tag each document for faster retrieval.
(You haven't fully told us your use-case. What % of your runtime is this retrieval task? 0.1%? 99%? Are you storing your genes as text strings? Why not tokenize them and convert once to factors at input-time?)
Either way, tm is not a very scaleable performant package, so look at other approaches.

In R, Create Summary Data Frame from Multiple Objects

I'm trying to create a "summary" data frame that holds some high-level stats about a few objects in my R project. I'm having trouble even accomplishing this simple task and I've tried using For loops and Apply functions with no luck.
After searching (a lot) on SO I'm seeing that For loops might not be the best performing option, so I'm open to any solution that gets the job done.
I have three objects: text1 text2 and text3 of class "Large Character (vectors)" (imagine I might be exploring these objects and will create a NLP predictive model from them). Each are > 250 MB in size (upwards of 1 million "rows" each) once loaded into R.
My goal: Store the results of object.size() length() and max(nchar()) in a table for my 3 objects.
Method 1: Use an Apply() Function
Issue: I haven't successfully applied multiple functions to a single object. I understand how to do simple applies like lapply(x, mean) but I'm falling short here.
Method 2: Bind Rows Using a For loop
I'm liking this solution because I almost know how to implement it. A lot of SO users say this is a bad approach, but I'm lacking other ideas.
sources <- c("text1", "text2", "text3")
text.summary <- data.frame()
for (i in sources){
text.summary[i ,] <- rbind(i, object.size(get(i)), length(get(i)),
max(nchar(get(i))))
}
Issue: This returns the error data length exceeds size of matrix - I know I could define the structure of my data frame (on line 2), but I've seen too much feedback on other questions that advise against doing this.
Thanks for helping me understand the proper way to accomplish this. I know I'm going to have trouble doing NLP if I can't even figure out this simple problem, but R is my first foray into programming. Oof!
Just try for example:
do.call(rbind, lapply(list(text1,text2,text3),
function(x) c(objectSize=c(object.size(x)),length=length(x),max=max(nchar(x)))))
You'll obtain a matrix. You can coerce to data.frame later if you need.

Replacing Rows in a large data frame in R

I have to manually collect some rows so based on the R Cookbook, it recommended me to pre-allocate some memory for a large data frame. Say my code is
dataSize <- 500000;
shoesRead <- read.csv(file="someShoeCsv.csv", head=TRUE, sep=",");
shoes <- data.frame(size=integer(dataSize), price=double(dataSize),
cost=double(dataSize), retail=double(dataSize));
So now, I have some data about shoes which I imported via csv, and then I perform some calculation and want to insert into the data frame shoes. Let's say the someShoeCsv.csv has a column called ukSize and so
usSize <- ukSize * 1.05 #for example
My question is how do I do so? Running the code, noting now I have a usSize variable which was transformed from the ukSize column, read from the csv file:
shoes <- rbind(shoes,
data.frame("size"=usSize, "price"=price,
"cost"=cost, "retail"=retail));
adds to the already large data frame.
I have experimented with doing the list and then rbind but understand that it is tedious and so I am thinking of using this method but still to no avail.
I'm not quite sure what you're trying to do, but if you're trying to replace some of the pre-allocated rows with new data, you could do so like this:
Nreplace = length(usSize)
shoes$size[1:Nreplace] <- usSize
shoes$price[1:Nreplace] <- shoesRead$price
And so on, for the rest of the columns.
Here's some unsolicited advice. Looking at the code you've included, you reference ukSize and price etc without referencing the data frame, which makes it appear like you've done attach(shoesRead). Definitely never use attach(). If you want the price vector, for example, just do shoesRead$price. It's just a little bit more typing for the sake of much more readable code.

package tm. problems with kmeans

I have a question about k-means clustering in R. Actually i'm doing everything according to this article. Everything is based on examples within the tm package so it's required no data import. acq contains 50 documents and crude 20 documents.
library(tm)
data("acq")
data("crude")
ws <- c(acq, crude)
wsTDM <- Data(TermDocumentMatrix(ws)) #First problem here
wsKMeans <- kmeans(wsTDM, 2)
wsReutersCluster <- c(rep("acq", 50), rep("crude", 20))
cl_agreement(wsKMeans, as.cl_partition(wsReutersCluster), "diag")
Error in lapply(X, FUN, ...) :
(list) object cannot be coerced to type 'integer'
I actually want to create cross agreement matrix. But this article was wrote in 2008 since then a lot have changed. The Data function is only available in RSurvey package, but i'm kinda doubt is it the same. And i think that the main problem is that TermDocumentMatrix was S4 class and now it's S3. I know it's possibly to do this having text only. But I wanna do it like this since in TDM it's possible to remove stopwords, punct, etc for better results. So if someone has any solution that would be terrific.
The TDM is stored as a sparse matrix, as described in ?TermDocumentMatrix. This can also be seen from just inspecting the object like str(wsTDM). That old Data() function was just a way to access the contents as a regular matrix. It is not needed anymore. Just do kmeans(wsTDM, 2) and you'll see that the output is as expected, with clusters identified for 2775 observations (terms) on 70 features (documents). Good luck!

Resources