Working with text classification and big sparse matrices in R - r

I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language.
I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm, where creating a DocumentTermMatrix with a dictionary, makes the process incredibly memory hungry with small datasets. Currently, as I said, I use quanteda to build the equivalent Document Term Matrix container that later on I transform into a data.frame to perform the training.
I want to know if there is a way to build such big matrices. I have been reading about the bigmemory package that allows this kind of container but I am not sure it will work with caret for the later classification. Overall I want to understand the problem and build a workaround to be able to work with bigger datasets, as the RAM is not a (big) problem (32GB) but I'm trying to find a way to do it and I feel completely lost about it.

At what moment did you reach ram constraints?
quanteda is good package to work with NLP on medium datasets. But also I suggest to try my text2vec package. Generally it is considerably memory friendly and doesn't require to load all the raw text into the RAM (for example it can create DTM for wikipedia dump on a 16gb laptop).
Second point is that I strongly don't recommend to convert data into data.frame. Try to work with sparseMatrix objects directly.
Following method will work good for text classification:
logistic regression with L1 penalty (see glmnet package)
Linear SVM (see LiblineaR, but worth to serach for alternatives)
Also worth to try `xgboost. I would prefer linear models. So you can try linear booster.

Related

Random Forest with p>>n and not enough memory

I am trying to perform Random Forest classification on genomic data with ~200k predictors and ~20 rows. Predictors have been already pruned for autocorrelation. I tried to use the 'ranger' R package, but it complains it cannot allocate 164Gb vector (I do have 32Gb RAM).
Is there any RF implementation that can manage the analysis given
the available RAM (I would like to avoid increasing the swap)?
Should I maybe use a different algorithm (for what I read, RF should
deal alright with p>>n)?
If it's genomic data, are there a lot of zeroes? If so, you might be able to convert into a sparse matrix, using the Matrix package. I believe ranger has been able to work with sparse matrices for a while, and this can help a lot with memory issues.
As far as I know, ranger is the best R random forest package available for datasets where p >> n.

Stochastic Gradient Descent design matrix too big for R

I'm trying to implement a baseline prediction model of movie ratings (akin to the various baseline models from the NetFlix prize), with parameters learned via stochastic gradient descent. However, because both explanatory variables are categorical (users and movies), the design matrix is really big, and cannot fit into my RAM.
I thought that the sgd package would automagically find its way around this issue (since it's designed for large amounts of data), but that does not seem to be the case.
Does anyone know a way around this? Maybe a way to build the design matrix as a sparse matrix.
Cheers,
You can try to use Matrix::sparseMatrix to create a triplet that will describe the matrix in a more efficient way.
You can also try to export your problem on Amazon EC2 and use and instance with more RAM or configure cluster to create mapped reduced job.
Check out the xgboost Package https://github.com/dmlc/xgboost and their documentation to understand how to deal with memory problems.
This is also a more practical tutorial: https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html

Train SVM on a very large dataset stored on hard drive

There exist a very large own-collected dataset of size [2000000 12672] where the rows shows the number of instances and the columns, the number of features. This dataset occupies ~60 Gigabyte on the local hard disk. I want to train a linear SVM on this dataset. The problem is that I have only 8 Gigabyte of RAM! so I cannot load all data once. Is there any solution to train the SVM on this large dataset? Generating the dataset is on my own desire, and currently are is HDF5 format.
Thanks
Welcome to machine learning! One of the hard things about working in this space is the compute requirements. There are two main kinds of algorithms, on-line and off-line.
Online: supports feeding in examples one at a time, each one improving the model slightly
Offline: supports feeding in the entire dataset at once, achieving higher accuracy than an On-line model
Many typical algorithms have both on-line, and off-line implementations, but an SVM is not one of them. To the best of my knowledge, SVMs are traditionally an off-line only algorithm. The reason for this is a lot of the fine details around "shattering" the dataset. I won't go too far into the math here, but if you read into it it should become apparent.
It's also worth noting that the complexity of an SVM is somewhere between n^2 and n^3, meaning that even if you could load everything into memory it would take ages to actually train the model. It's very typical to test with a much smaller portion of your dataset before moving to the full dataset.
When moving to the full dataset you would have to run this on a much larger machine than your own, but AWS should have something large enough for you, though at your size of data I highly advise using something other than an SVM. At large data sizes, neural net approaches really shine, and can be trained in a more realistic amount of time.
As alluded to in the comments, there's also the concept of an out-of-core algorithm that can operate directly on objects stored on disk. The only group I know with a good offering of out-of-core algorithms is dato. It's a commercial product, but might be your best solution here.
A stochastic gradient descent approach to SVM could help, as it scales well and avoids the n^2 problem. An implementation available in R is RSofia, which was created by a team at Google and is discussed in Large Scale Learning to Rank. In the paper, they show that compared to a traditional SVM, the SGD approach significantly decreases the training time (this is due to 1, the pairwise learning method and 2, only a subset of the observations end up being used to train the model).
Note that RSofia is a little more bare bones than some of the other SVM packages available in R; for example, you need to do your own centering and scaling of features.
As to your memory problem, it'd be a little surprising if you needed the entire dataset - I would expect that you'd be fine reading in a sample of your data and then training your model on that. To confirm this, you could train multiple models on different samples and then estimate performance on the same holdout set - the performance should be similar across the different models.
You don't say why you want Linear SVM, but if you can consider another model that often gives superior results then check out the hpelm python package. It can read an HDF5 file directly. You can find it here https://pypi.python.org/pypi/hpelm It trains on segmented data, that can even be pre-loaded (called async) to speed up reading from slow hard disks.

SVM modeling with BIG DATA

For modeling with SVM in R, I have used kernlab package (ksvm method)with Windows Xp operating system and 2 GB RAM. But having more data rows as 201497, I can'nt able to provide more memory for processing of data modeling (getting issue : can not allocate vector size greater than 2.7 GB).
Therefore, I have used Amazon micro and large instance for SCM modeling. But, it have same issue as local machine (can not allocate vector size greater than 2.7 GB).
Can any one suggest me the solution of this problem with BIG DATA modeling or Is there something wrong with this.
Without a reproducible example it is hard to say if the dataset is just too big, or if some parts of your script are suboptimal. A few general pointers:
Take a look at the High Performance Computing Taskview, this lists the main R packages relevant for working with BigData.
You use your entire dataset for training your model. You could try to take a subset (say 10%) and fit your model on that. Repeating this procedure a few times will yield insight into if the model fit is sensitive to which subset of the data you use.
Some analysis techniques, e.g. PCA analysis, can be done by processing the data iteratively, i.e. in chunks. This makes analyses possible on very big datasets possible (>> 100 gb). I'm not sure if this is possible with kernlab.
Check if the R version you are using is 64 bit.
This earlier question might be of interest.

Large matrix: solve(crossprod(X)) when dim(X) = 100,000:5000

I need to do run a simple two-stage least squares regression on large data matrices. This just requires some crossprod() and solve() commands, but the matrices have dimensions 100,000 by 5000 matrix. My understanding is that holding a matrix like this in memory would take up a bit less than 4GB of memory. Unfortunately, my 64-bit Win7 machine only has 8GB of RAM. When I try to manipulate the matrices in question, I get the usual 'can't allocate vector of size' message.
I have considered a number of options such as the ff and bigmemory packages. However, the base R functions for the matrix operations I need only support the usual matrix object type, not the bigmatrix type.
It seems like it may be possible to extend the code from biglm(), but I'm on a tight schedule for this project, so I wanted to check-in with you all to see if there existed a ready-made solution for problems like this. Apologies if this was addressed before (I couldn't find it) or if the question is too generic.
Yes, a ready-made solution exist in biglm, the package you already identified. Linear regression can work with an updating scheme; that basic property is implemented in the package.
Dump your data to disk, say to SQLite and study the package documentation and proceed in, say, 10 chunks on 10,000 each.

Resources