I'm trying to run a RFSRC on a 6500 records dataframe, with 59 variables:
rfsrc_test <- rfsrc(Surv(TIME, DIED) ~ ., data=test, nsplit=10, na.action = "na.impute")
It seems to work when I run it on 1500 records, but crashes on the entire dataset.
It crashes R without any specific error - sometimes it gives "exceptional processing error".
Any thoughts how to debug this one? I skimmed the database for weird rows without any luck.
We do not know the size of each record, nor the complexity of the variables.
I have encountered similar situations when I have hit the RAM overhead. R is not designed for massive data sets. Parallel processing will resolve this, however R is not designed for this, the next suggestion is to buy more RAM.
My approach would be to reduce the number of variables until you can process 6500 records (to make sure its just the data set size). Then I'd pre-screen the fitness of each variable e.g. GLM and use variables which explain a large amount of the data and minimise the residual. Then I'd rerun the survival analysis on a reduced number of variables.
One thing you can check is the time variable - how many different values exist? The survival forest will save a cumulative hazard function for each node. If the number of unique time points in the dataset is large than the CHFS grow large as well.. had to round my time variable and this significantly reduced the run-time.
Related
I'm trying to estimate a logit using the glm function in R where my data set has about 40,000 observations and where I'm trying to use as a control a factor with about 1,800 levels. It's a data set of mayoral candidates in cities. Is there any way to estimate how long it will take. I stopped it after 10 minutes, but I'm not sure if this will take minutes, hours, days, weeks, or longer to finish. Is there any way to estimate how long it will take?
Converting my comments to an answer:
There's not really a way to pre-compute time... it will depend on a lot of factors, including the computer you're running it on. You could use the control parameters to set trace = TRUE which will give you output every iteration. The default is a maximum of 25 iterations. So monitoring that as it runs will give you a sense of how quickly things are moving.
You could run your model on increasing subsets of your data to see how it scales. Do 5k rows with 200 levels of your factor. Then 10k rows with 400 levels, etc. Doing this 4 or 5 times should give you a decent sense. Don't expect the growth in time to be linear...
Better use of your time may be finding ways to speed up the estimation. With that many factor levels, a sparse matrix will certainly help out. The fastglm package looks quite nice (though I've never used it). This question has several answers with ideas for speeding up glm estimation.
I am relatively new to the machine learning ocean, please excuse me if some of my questions are really basic.
Current situation: The overall goal was trying to improve some code for h2o package in r running on the supercomputer cluster. However, since the data is too large that single node with h2o really takes more than a day, therefore, we have decided to use multiple nodes to run the model. I came up with an idea:
(1) Distribute each node to build (nTree/num_node) trees and saved into a model;
(2) running on the cluster at each node for (nTree/num_node) number of trees in the forest;
(3) Merging the trees back together and reform the original forest, and using the measurement results in average.
I later realized this could be risky. But I cannot find the actual support or against statement since I am not machine learning focused programmer.
Questions:
if this way of handling random forest will result in some risk, please reference me the link so I can have a basic idea why this is not right.
If this way is actually an "ok" way to do so. What should I be do to merge the trees, is there a package or method I can borrow from?
If this is actually a solved problem, please reference me the link, I may have searched the wrong keywords, and thank you!
The real number-involved example I can present here is:
I have a random forest task with 80k rows and 2k columns and wanted the number of trees are 64. What I have done is put 16 trees on each node running with the whole dataset, and each one of four nodes come up with an RF model. I am now trying to merge the trees from each model into this one big RF model and average the measurements (from each of those four models).
There is no need to merge the models. Unlike with boosting methods, every tree in a Random Forest is grown independently (just don't set the same seed prior to kicking off RF on each node!).
You are basically doing what Random Forest does on its own, which is to grow X independent trees and then average across the votes. Many packages provide an option to specify the number of cores or threads, in order to take advantage of this feature of RF.
In your case, since you have the same number of trees per node, you'll get 4 "models" back, but those are really just collections of 16 trees. To use it, I'd just keep the 4 models separate and when you want a prediction, average the prediction from each of the 4 models. Assuming you're going to be doing that more than once, you could write a small wrapper function to predict with the 4 models and average the output.
10,000 rows by 1,000 columns is not overly large and should not take that long to train an RF model.
It sound like something unexpected is happening.
While you can try to average models if you know what you are doing, I don't think it should be necessary in this case.
I want to use glm( ... , family = "binomial") to do a logistic regression with my big dataset which has 80,000,000 rows and 125 columns as a data.frame. But when I run in RStudio, it just crashes:
So I wonder what the time complexity of glm() is, and whether there are any solutions to handle such data? Someone suggested I try running the code from command line: does this make any difference (I tried, but it seems that doesn't work either)?
Memory requirement: R has to load the entire dataset into memory (RAM). However, your dataset is (assuming entries are 32-bits) is roughly 37 gigabytes -- much larger than the amount of RAM you have on your computer. Therefore, it crashes. You cannot use R for this dataset unless you use some special big data packages, and I'm not sure it's even feasible then.
There are other languages do not need to load it into memory to look at it, and so it might be wise to do that.
Time complexity for GLMs: if N = # of observations (usually # of rows), and p = # of variables (usually # of columns), it is O(p^3 + Np^3) for most standard GLM algorithms.
For your situation, it has a time complexity of approximately 10^12 which is still barely in the realm of possibility, but you probably need more than one modern PC running for at least a few days.
I have a large dataset in R (1M+ rows by 6 columns) that I want to use to train a random forest (using the randomForest package) for regression purposes. Unfortunately, I get a Error in matrix(0, n, n) : too many elements specified error when trying to do the whole thing at once and cannot allocate enough memory kind of errors when running in on a subset of the data -- down to 10,000 or so observations.
Seeing that there is no chance I can add more RAM on my machine and random forests are very suitable for the type of process I am trying to model, I'd really like to make this work.
Any suggestions or workaround ideas are much appreciated.
You're likely asking randomForest to create the proximity matrix for the data, which if you think about it, will be insanely big: 1 million x 1 million. A matrix this size would be required no matter how small you set sampsize. Indeed, simply Googling the error message seems to confirm this, as the package author states that the only place in the entire source code where n,n) is found is in calculating the proximity matrix.
But it's hard to help more, given that you've provided no details about the actual code you're using.
I'm trying to use knn in R (used several packages(knnflex, class)) to predict the probability of default based on 8 variables. The dataset is about 100k lines of 8 columns, but my machine seems to be having difficulty with a sample of 10k lines. Any suggestions for doing knn on a dataset > 50 lines (ie iris)?
EDIT:
To clarify there are a couple issues.
1) The examples in the class and knnflex packages are a bit unclear and I was curious if there was some implementation similar to the randomForest package where you give it the variable you want to predict and the data you want to use to train the model:
RF <- randomForest(x, y, ntree, type,...)
then turn around and use the model to predict data using the test data set:
pred <- predict(RF, testData)
2) I'm not really understanding why knn wants training AND test data for building the model. From what I can tell, the package creates a matrix ~ to nrows(trainingData)^2 which also seems to be an upper limit on the size of the predicted data. I created a model using 5000 rows (above that # I got memory allocation errors) and was unable to predict test sets > 5000 rows. Thus I would need either:
a) find a way to use > 5000 lines in a training set
or
b) find a way to use the model on the full 100k lines.
The reason that knn (in class) asks for both the training and test data is that if it didn't, the "model" it would return would simply be the training data itself.
The training data is the model.
To make predictions, knn calculates the distance between a test observation and each training observation (although I suppose there are some fancy versions for insanely large data sets that don't check every distance). So until you have test observations, there isn't really a model to build.
The ipred package provides functions that appear structured as you describe, but if you look at them, you'll see that there is basically nothing happening in the "training" function. All the work is in the "predict" function. And those are really intended as wrappers to be used for error estimation using cross validation.
As far as limitations on the number of cases, that will be dependent on how much physical memory you have. If you're getting memory allocation errors, then you either need to reduce your RAM usage elsewhere (close apps, etc), buy more RAM, buy a new computer, etc.
The knn function in class runs fine for me with training and test data sets of 10k rows or more, although I have 8gb of RAM. Also, I suspect that knn in class will be faster than in knnflex, but I haven't done extensive testing.