I am interested in running a Random Forest model on a very large dataset. I have been reading about "parallel computing" in an effort to make the code run faster. I came across this post over here (parallel execution of random forest in R) that had some suggestions:
library(randomForest)
library(doMC)
registerDoMC()
x <- matrix(runif(500), 100)
y <- gl(2, 50)
rf <- foreach(ntree=rep(25000, 6), .combine=randomForest::combine,
.multicombine=TRUE, .packages='randomForest') %dopar% {
randomForest(x, y, ntree=ntree)
}
I am trying to understand what is happening in the above code - my guess is that perhaps 6 Random Forest models (with each Random Forest Model having 25000 trees) are being fit to dataset and then combined into a single model?
I started looking into the "combine()" function in R (https://cran.r-project.org/web/packages/randomForest/randomForest.pdf) - it seems that the "combine()" function is combining several Random Forest models into a single model (here, I think 3 Random Forest models are being combined into a single model):
data(iris)
rf1 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
rf2 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
rf3 <- randomForest(Species ~ ., iris, ntree=50, norm.votes=FALSE)
rf.all <- combine(rf1, rf2, rf3)
print(rf.all)
My Question: Can someone please confirm if I have understood this correctly? In the above code, are 6 Random Forest models being trained in parallel and then combined into a single model - is this correct?
References:
https://stats.stackexchange.com/questions/519640/parallelizing-random-forest-learning-in-r-changes-the-class-of-the-rf-object
https://rpubs.com/chidungkt/315749
https://www.learnbymarketing.com/724/parallel-processing-r-basics/
Yes, I would say yes. foreach's .combine=arguments takes the function given for it to apply on the results the combination.
Related
I'm using Databricks with the SparkR package to build a glm model. Everything seems to run ok except when I run summary(lm1). Instead of getting Variable, Estimate, Std.Error, t-value & p-value (see pic below - this is what I'd expect to see, NOT what I'm getting), I just get the variable and estimate. The only thing I can think is that the data set is big enough (train1 is 12 million rows and test1 is 6 million rows) that all estimates have 0 p-values. Any other reasons this would happen??
library(SparkR)
rdf <- sql("select * from myTable") #read data
train1 <- rdf[rdf$ntile_3 != 1,] # split into test and train based on ntile in table
test1 <- rdf[rdf$ntile_3 == 1,]
vtu1 <- c('var1','var2','var3')
lm1 <- glm( target ~., train1[,c(vtu1,'target' )],family = 'gaussian')
pred1 <- predict(lm1, test1)
summary(lm1)
as you specify family = Gaussian in your model, your glm model seems to be equivalent to a standard linear regression model (analyzed by lm in R).
For an extensive answer to your question, see for example here: https://stats.stackexchange.com/questions/187100/interpreting-glm-model-output-assessing-quality-of-fit
If you specify your model using lm, you should get the output you want.
Here is my code for random forest and rfsrc in R; Is there anyway to include n_estimators and max_depth like sklearn version in my R code ? Also, How can I plot OBB error vs number of trees plot like this?
set.seed(2234)
tic("Time to train RFSRC fast")
fast.o <- rfsrc.fast(Label ~ ., data = train[(1:50000),],forest=TRUE)
toc()
print(fast.o)
#print(vimp(fast.o)$importance)
set.seed(2367)
tic("Time to test RFSRC fast ")
#data(breast, package = "randomForestSRC")
fast.pred <- predict(fast.o, test[(1:50000),])
toc()
print(fast.pred)
set.seed(3)
tic("RF model fitting without Parallelization")
rf <-randomForest(Label~.,data=train[(1:50000),])
toc()
print(rf)
plot(rf)
varImp(rf,sort = T)
varImpPlot(rf, sort=T, n.var= 10, main= "Variable Importance", pch=16)
rf_pred <- predict(rf, newdata=test[(1:50000),])
confMatrix <- confusionMatrix(rf_pred,test[(1:50000),]$Label)
confMatrix
I appreciate your time.
You need to set block.size=1 , and also take note the sampling is without replacement, you can check the vignette for rfsrc:
Unlike Breiman's random forests, the default action here is sampling
without replacement. Thus out-of-bag (OOB) technically means
out-of-sample, but for legacy reasons we retain the term OOB.
So using an example dataset,
library(mlbench)
library(randomForestSRC)
data(Sonar)
set.seed(911)
trn = sample(nrow(Sonar),150)
rf <- rfsrc(Class ~ ., data = Sonar[trn,],ntree=500,block.size=1,importance=TRUE)
pred <- predict(rf,Sonar[-trn,],block.size=1)
plot(rf$err.rate[,1],type="l",col="steelblue",xlab="ntrees",ylab="err.rate",
ylim=c(0,0.5))
lines(pred$err.rate[,1],col="orange")
legend("topright",fill=c("steelblue","orange"),c("test","OOB.train"))
In randomForest:
library(randomForest)
rf <- randomForest(Class ~ ., data = Sonar[trn,],ntree=500)
pred <- predict(rf,Sonar[-trn,],predict.all=TRUE)
Not very sure if there's an easier to get ntrees error:
err_by_tree = sapply(1:ncol(pred$individual),function(i){
apply(pred$individual[,1:i,drop=FALSE],1,
function(i)with(rle(i),values[which.max(lengths)]))
})
err_by_tree = colMeans(err_by_tree!=Sonar$Class[-trn])
Then plot:
plot(rf$err.rate[,1],type="l",col="steelblue",xlab="ntrees",ylab="err.rate",
ylim=c(0,0.5))
lines(err_by_tree,col="orange")
legend("topright",fill=c("steelblue","orange"),c("test","OOB.train"))
So for random mixed effects, I am making a comparison list of scripts between the 2 packages.
For independent random intercept and slope, if I am using the following code in lme4 package, what is the corresponding script in nlme?
model1 <- lmer(y~A + (1+site) + (0+A|site), data, REML = FALSE)
Also, for nested mixed effects, which calculates the random effect in different way from the above, are my scripts correct?
model2 <- lme(y~A, random = ~1+site/A, data, method="REML")
and
model3 <- lmer(y~A + (1|site) + (1|site:A), data, method=FALSE)
Thank you so much!
I hope this answer is not too late!
For your first model the version in nlme would be:
model1 <- lme(y ~ A ,
random = list(A = pdDiag(~time)),
data=data)
Your seccond and third models are equivalent. Model 3 in lme4 package can be also written as:
model3 <- lmer(y~A + (1|site/A), data, method=FALSE)
I foud this link that might help you a lot to compare nlme and lme4 packages
https://rpsychologist.com/r-guide-longitudinal-lme-lmer#conditional-growth-model-dropping-intercept-slope-covariance
I would like to create confusion matrices for a multinomial logistic regression as well as a proportional odds model but I am stuck with the implementation in R. My attempt below does not seem to give the desired output.
This is my code so far:
CH <- read.table("http://data.princeton.edu/wws509/datasets/copen.dat", header=TRUE)
CH$housing <- factor(CH$housing)
CH$influence <- factor(CH$influence)
CH$satisfaction <- factor(CH$satisfaction)
CH$contact <- factor(CH$contact)
CH$satisfaction <- factor(CH$satisfaction,levels=c("low","medium","high"))
CH$housing <- factor(CH$housing,levels=c("tower","apartments","atrium","terraced"))
CH$influence <- factor(CH$influence,levels=c("low","medium","high"))
CH$contact <- relevel(CH$contact,ref=2)
model <- multinom(satisfaction ~ housing + influence + contact, weights=n, data=CH)
summary(model)
preds <- predict(model)
table(preds,CH$satisfaction)
omodel <- polr(satisfaction ~ housing + influence + contact, weights=n, data=CH, Hess=TRUE)
preds2 <- predict(omodel)
table(preds2,CH$satisfaction)
I would really appreciate some advice on how to correctly produce confusion matrices for my 2 models!
You can refer -
Predict() - Maybe I'm not understanding it
Here in predict() you need to pass unseen data for prediction.
Is there a function within caret (or another package) that can perform a Breusch-Pagan / Cook-Weisberg test for heteroskedasticity on an 'nnet' model trained using caret?
E.g. something similar to library(car); ncvTest or library(lmtest); bptest for lm objects, but that works on nnet objects created from caret?
Example data
library(caret)
set.seed(4)
n <- 100
x1i <- rnorm(n)
x2i <- rnorm(n)
yi <- rnorm(n)
dat <- data.frame(yi, x1i, x2i)
mod <- train(yi ~., data=dat, method="nnet", trace=FALSE, linout=TRUE)
This produces the plot of fitted vs residuals:
No there is not anything like that in the package right now.