Random forest on a big dataset - r

I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.
Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.
Any suggestions or workaround ideas are much appreciated.

You're likely asking randomForest to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n) is found is in calculating the proximity matrix.
But it's hard to help more, given that you've provided no details about the actual code you're using.

Related

Keras predict repeated columns

I have a question related to keras model code in R. I have finished training the model and need to predict. Predicting a line is very fast, but my data has 2000,000,000 rows and nearly 200 columns, with a structure like the attached image.
Datastructure
I don't know if anyone has any suggestions on which method to use so that predict can run quickly and use less memory. I created a matrix according to the table as shown in order to predict, each matrix is ​​200,000x200 dimensions. Then I use sapply to predict all the remaining matrices. However, even though predict is fast for each matrix, but creating the matrix is ​​slow, so it makes the model run twice or three times as long, and that is not taking into account the sapply step. I wonder if keras has a "smart" way to know that in each of his matrix, the last N columns that are exactly the same? I google and see someone talking about RepeatVector but I don't quite understand and it seems that this is only used for training? I already have the model and just need to predict.
Thank you so much everyone!
One of the most performant ways to feed keras models locally is by creating a tf.data.Dataset object. Please take a look at the tfdatasets R package for guides and example usage.

Random forest survival analysis crashes

I'm trying to run a RFSRC on a 6500 records dataframe, with 59 variables:
rfsrc_test <- rfsrc(Surv(TIME, DIED) ~ ., data=test, nsplit=10, na.action = "na.impute")
It seems to work when I run it on 1500 records, but crashes on the entire dataset.
It crashes R without any specific error - sometimes it gives "exceptional processing error".
Any thoughts how to debug this one? I skimmed the database for weird rows without any luck.
We do not know the size of each record, nor the complexity of the variables.
I have encountered similar situations when I have hit the RAM overhead. R is not designed for massive data sets. Parallel processing will resolve this, however R is not designed for this, the next suggestion is to buy more RAM.
My approach would be to reduce the number of variables until you can process 6500 records (to make sure its just the data set size). Then I'd pre-screen the fitness of each variable e.g. GLM and use variables which explain a large amount of the data and minimise the residual. Then I'd rerun the survival analysis on a reduced number of variables.
One thing you can check is the time variable - how many different values exist? The survival forest will save a cumulative hazard function for each node. If the number of unique time points in the dataset is large than the CHFS grow large as well.. had to round my time variable and this significantly reduced the run-time.

Free up RAM for ERGM Model R

I'm attempting to run a large amount of data through the ergm function in R. By large, I mean my network graph object has 4,300 vertices and approximately 470,000 total edges. Covariates X,Y, and Z are all categorical data types. When I run this script, R studio will ultimately crash, as the model cannot handle that amount of gigs of information. I'm aware of the number of combinations that will be generated using the nodemix function; however, my analysis requires that I use this particular function given the nature of the study. I should also mention that I have reduced my data down as much as possible to account for its size.
I wanted to know if there was a way to drop any coefficients, by modifying the ergm function behind the scenes, with a -Inf. I could be wrong, but I feel like a majority of my nodemix combinations will have a -Inf coefficient; therefore, I can drop these unnecessary combinations and free up some RAM space for the function to completely run. I am not concerned with any combinations that have a -Inf. Hopefully this question makes sense. If you need any additional information please let me know. Thanks in advance for your help.
ergm_control <- control.ergm(drop= TRUE, MPLE.max.dyad.types=500000)
ergm.factor.model <- ergm(sna.network ~ edges +
nodemix('Covariate_X', base=1) +
nodemix('Covariate_Y', base=1) +
nodemix('Covariate_Z', base=1),
control=ergm_control)

how to use LSA for dimension reduction in text analytics with R

I am a beginner at data science, and I am working on a text analytics/sentiment analysis project with tweets.
what i have been trying to do is to perform some dimension reduction on my tweets training set, and feed the training set into a NaiveBayes learner, and use the learned NaiveBayes to predict the sentiment on the testing tweet set.
I have been following the steps in this article:
http://www.analyticskhoj.com/data-mining/text-analytics-part-iv-cluster-analysis-on-terms-and-documents-using-r/
their explanation is kind of too brief for a beginner like me.
I have used the lsa() to create a, what's labeled as "Large LSAspace (3 elements)" in RStudio. And following their example, I've created 3 more data frames:
lsa.train.tk = as.data.frame(lsa.train$tk)
lsa.train.dk = as.data.frame(lsa.train$dk)
lsa.train.sk = as.data.frame(lsa.train$sk)
when i view the lsa.train.tk data, it looks like this (lsa.train.dk looks pretty similar to this matrix):
and my lsa.train.sk looks like following:
my question is, how do i interpret such information?
How can i utilize this information to create something that I can feed into my NaiveBayes learner? I tried just using the lsa.train.sk for the NaiveBayes learner, but I cannot think of any good explanation that can justify what I've tried. Any help would be much appreciated!
EDIT:
What I've done so far:
making everything into term document matrix
pass in the matrix into the NaiveBayes learner
predict using the learned algorithm
my problems are:
accuracy is only 50%... and I realized that it labels everything as positive sentiment (so I could have gotten 1% accuracy if my test set only contains negative sentiment tweets).
current code is not scalable. since it utilizes large matrices, I can only handle up to 3.5k rows of data. more than that, my computer would crash. thus I wanted to do a dimensional reduction so that I can handle up to more data (such as 10k or 100k rows of tweets)

R knn large dataset

I'm trying to use knn in R (used several packages(knnflex, class)) to predict the probability of default based on 8 variables. The dataset is about 100k lines of 8 columns, but my machine seems to be having difficulty with a sample of 10k lines. Any suggestions for doing knn on a dataset > 50 lines (ie iris)?
EDIT:
To clarify there are a couple issues.
1) The examples in the class and knnflex packages are a bit unclear and I was curious if there was some implementation similar to the randomForest package where you give it the variable you want to predict and the data you want to use to train the model:
RF <- randomForest(x, y, ntree, type,...)
then turn around and use the model to predict data using the test data set:
pred <- predict(RF, testData)
2) I'm not really understanding why knn wants training AND test data for building the model. From what I can tell, the package creates a matrix ~ to nrows(trainingData)^2 which also seems to be an upper limit on the size of the predicted data. I created a model using 5000 rows (above that # I got memory allocation errors) and was unable to predict test sets > 5000 rows. Thus I would need either:
a) find a way to use > 5000 lines in a training set
or
b) find a way to use the model on the full 100k lines.
The reason that knn (in class) asks for both the training and test data is that if it didn't, the "model" it would return would simply be the training data itself.
The training data is the model.
To make predictions, knn calculates the distance between a test observation and each training observation (although I suppose there are some fancy versions for insanely large data sets that don't check every distance). So until you have test observations, there isn't really a model to build.
The ipred package provides functions that appear structured as you describe, but if you look at them, you'll see that there is basically nothing happening in the "training" function. All the work is in the "predict" function. And those are really intended as wrappers to be used for error estimation using cross validation.
As far as limitations on the number of cases, that will be dependent on how much physical memory you have. If you're getting memory allocation errors, then you either need to reduce your RAM usage elsewhere (close apps, etc), buy more RAM, buy a new computer, etc.
The knn function in class runs fine for me with training and test data sets of 10k rows or more, although I have 8gb of RAM. Also, I suspect that knn in class will be faster than in knnflex, but I haven't done extensive testing.

Resources