I am trying to run a random forest regression on this large dataset in R using the package randomForest. I've run in to problems with computational time required, even when parallelized with doSNOW and 10-20 cores. I think I am misunderstanding the "sampsize" parameter in the function randomForest.
When I subset the dataset to 100,000 rows, I can build 1 tree in 9-10 seconds.
training = read.csv("training.csv")
t100K = sample_n(training, 100000)
system.time(randomForest(tree~., data=t100K, ntree=1, importance=T)) #~10sec
But, when I use the sampsize parameter to sample 100,000 rows from the full dataset in the course of running randomForest, the same 1 tree takes hours.
system.time(randomForest(tree~., data=training, sampsize = ifelse(nrow(training<100000),nrow(training), 100000), ntree=1, importance=T)) #>>100x as long. Why?
Obviously, I am eventually going to run >>1 tree. What am I missing here? Thanks.
Your brackets are slightly off. Notice the difference between the following statements. You currently have:
ifelse(nrow(mtcars<10),nrow(mtcars), 10)
Which counts the number of rows in the boolean matrix mtcars<10 that has TRUE for each element in mtcars that is smaller than 10, and FALSE otherwise. You want:
ifelse(nrow(mtcars)<10,nrow(mtcars), 10)
Hope this helps.
Related
in traditional gbm, we can use
predict.gbm(model, newsdata=..., n.tree=...)
So that I can compare result with different number of trees for the test data.
In h2o.gbm, although it has n.tree to set, it seems it doesn't have any effect on the result. It's all the same as the default model:
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=100))
R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=10))
> R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
Does anybod have similar problem? How to solve it? h2o.gbm is much faster than gbm, so if it can get detailed result of each tree that would be great.
I don't think H2O supports what you are describing.
BUT, if what you are after is to get the performance against the number of trees used, that can be done at model building time.
library(h2o)
h2o.init()
iris <- as.h2o(iris)
parts <- h2o.splitFrame(iris,c(0.8,0.1))
train <- parts[[1]]
valid <- parts[[2]]
test <- parts[[3]]
m <- h2o.gbm(1:4, 5, train,
validation_frame = valid,
ntrees = 100, #Max desired
score_tree_interval = 1)
h2o.scoreHistory(m)
plot(m)
The score history will show the evaluation after adding each new tree. plot(m) will show a chart of this. Looks like 20 is plenty for iris!
BTW, if your real purpose was to find out the optimum number of trees to use, then switch early stopping on, and it will do that automatically for you. (Just make sure you are using both validation and test data frames.)
As of 3.20.0.6 H2O does support this. The method you are looking for is
staged_predict_proba. For classification models it produces predicted class probabilities after each iteration (tree), for every observation in your testing frame. For regression models (i.e. when response is numerical), although not really documented, it produces the actual prediction for every observation in your testing frame.
From these predictions it is also easy to compute various performance metrics (AUC, r2 etc), assuming that's what you're after.
Python API:
staged_predict_proba = model.staged_predict_proba(test)
R API:
staged_predict_proba <- h2o.staged_predict_proba(model, prostate.test)
I've got a rather small dataset (162,000 observations with 13 attributes)
that I'm trying to use for modelling, using h2o.GBM. The response variable is categorical with large number of levels (~ 20,000 levels)
The model doesn't run out of memory or give any errors, but it's been going for nearly 24 hours without any progress (says 0% on H2o.GBM reporting)
I finally gave in and stopped it.
I'm wondering if there's anything wrong with my hyperparameters, as data is not particularly large.
here's my code:
library(h2o)
localH2O <- h2o.init(nthreads = -1, max_mem_size = "12g")
train.h20 <- as.h2o(analdata_train)
gbm1 <- h2o.gbm(
y = response_var
, x = independ_vars
, training_frame = train.h20
, ntrees = 3
, max_depth = 5
, min_rows = 10
, stopping_tolerance = 0.001
, learn_rate = 0.1
, distribution = "multinomial"
)
The way H2O GBM multinomial classification works is, when you ask for 1 tree as a parameter, it actually builds a tree for each level in the response column underneath the hood.
So 1 tree really means 20,000 trees in your case.
2 trees would really mean 40,000, and so on...
(Note the binomial classification case takes a shortcut and builds only one tree for both classes.)
So... it will probably finish but it could take quite a long time!
It's probably not a good idea to train a classifier with 20,000 classes -- most GBM implementations won't even let you do that. Can you group/cluster the classes into a smaller number of groups so that you can train a model with a smaller number of classes? If so, then you could perform your training in a two-stage process -- the first model would have K classes (assuming you clustered your classes into K groups). Then you can train secondary models that further classify the observations into your original classes.
This type of two-stage process may make sense if your classes represent groups that naturally clusters into a hierarchy of groups -- such as zip codes or ICD-10 medical diagnostic codes, for example.
If your use-case really demands that you train a 20,000 class GBM (and there's no way around it), then you should get a bigger cluster of machines to use in your H2O cluster (it's unclear how many CPUs you are using currently). H2O GBM should be able to finish training, assuming it has enough memory and CPUs, but it may take a while.
Usually we fix a seed number to produce the same split every time we run the code. So the code
set.seed(12345)
data <- (1:100)
train <- sample(data, 50)
test <- (1:100)[-train]
always gives the same train and test sets (since we fixed the seed).
Now, assume that I have a data, train, and test. Is there a way to know which seed number used to produce train and test from data???
Bests.
It's not possible to know with absolute mathematical certainty: but if you have a suspicion about the range in which the seed lies, you can check every seed in that range by "brute force" and see if it leads to the same result.
For example, you could check seeds from 1 to a million with the following code:
tests <- sapply(1:1e6, function(s) {
set.seed(s)
this_train <- sample(data, 50)
all(this_train == train)
})
which(tests)
# 12345
A few notes:
If your dataset or your sample is much smaller, you will start getting collisions- multiple seeds that give the same output. For example, if you were sampling 5 from 10 rather than 50 from 100, there are 34 seeds in the 1:1e6 range that would produce the same result.
If you have absolutely no suspicion about how the seed was set, you'd have to check from -.Machine$integer.max to .Machine$integer.max, which on my computer requires 4.2 billion checks (that will take a while and you may have to get clever about not storing all results).
If there were random numbers generated after the set.seed(), you'd need to replicate that same behavior in between the set.seed and sample lines in your function.
The behavior of sample after a seed is set may differ in very old versions of R, so you may not be able to reproduce one created on an earlier version
No, this is not possible. Multiple seeds can produce the same series of data. It's non-reversible.
I am running glmnet favoring lasso regression on a 16 core machine. I have some 800K rows with around 2K columns in a sparse matrix format that should be trained to predict probability in first column.
This process has become very slow. I want to know, is there a way to speed it up
either by parallelizing on nfolds or if I can select a smaller number of rows without affecting the accuracy. Is it possible? If so, what would be better?
The process can be expedited by using parallelization, which as explained in comment link above executing glmnet in parallel in R is done by setting parallel=TRUE option in cv.glmnet() function, once you specify the number of cores like this:
library(doParallel)
registerDoParallel(5)
m <- cv.glmnet(x, y, family="binomial", alpha=0.7, type.measure="auc",
grouped=FALSE, standardize=FALSE, parallel=TRUE)
Reducing the number of rows is more of a judgement call based on AUC value on test set. If it is above threshold, and reducing rows does not affect this, then it is certainly a good idea.
I have a dataset with 1 million rows and 100 columns. randomForest is quite slow for data this big so I would like to train each tree on a subset of, say, 50000 columns each.
How do I achieve this with the randomForest function? Do I have to hack something together manually? I am not able to find any instruction on this in the vignette.
Do you mean that the sample for each tree should be different?
To start with, I would consider sampling before calling randomforest. Indeed, the fact that you take different samples for each tree could have an impact on the final result, and the importance matrix would probably be partly biased.
You can achieve this by doing that:
numrow <- nrow(data)
subset <- sample(numrow, 50000)
learn <- data[subset,]
test <- data[-subset,]
model_rf <- randomForest(formula=[...], data=learn, importance=T)