RandomForest for Regression in R - r

I'm experimenting with R and the randomForest Package, I have some experience with SVM and Neural Nets.
My first test is to try and regress: sin(x)+gaussian noise.
With Neural Nets and svm I obtain a "relatively" nice approximation of sin(x) so the noise is filtered out and the learning algorithm doesn't overfit. (for decent parameters)
When doing the same on randomForest I have a completely overfitted solution.
I simply use (R 2.14.0, tried on 2.14.1 too, just in case):
library("randomForest")
x<-seq(-3.14,3.14,by=0.00628)
noise<-rnorm(1001)
y<-sin(x)+noise/4
mat<-matrix(c(x,y),ncol=2,dimnames=list(NULL,c("X","Y")))
plot(x,predict(randomForest(Y~.,data=mat),mat),col="green")
points(x,y)
I guess there is a magic option in randomForest to make it work correctly, I tried a few but I did not find the right lever to pull...

You can use maxnodes to limit the size of the trees,
as in the examples in the manual.
r <- randomForest(Y~.,data=mat, maxnodes=10)
plot(x,predict(r,mat),col="green")
points(x,y)

You can do a lot better (rmse ~ 0.04, $R^2$ > 0.99) by training individual trees on small samples or bites as Breiman called them
Since there is a significant amount of noise in the training data, this problem is really about smoothing rather than generalization. In general machine learning terms this requires increasing regularization. For ensemble learner this means trading strength for diversity.
Diversity of randomForests can be increasing by reducing the number of candidate feature per split (mtry in R) or the training set of each tree (sampsize in R). Since there is only 1 input dimesions, mtry does not help, leaving sampsize. This leads to a 3.5x improvement in RMSE over the default settings and >6x improvement over the noisy training data itself. Since increased divresity means increased variance in the prediction of the individual learners, we also need to increase the number of trees to stabilize the ensemble prediction.
small bags, more trees :: rmse = 0.04:
>sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
replace=FALSE, ntree=5000),
mat)
- sin(x))
[1] 0.03912643
default settings :: rmse=0.14:
> sd(predict(randomForest(Y~.,data=mat),mat) - sin(x))
[1] 0.1413018
error due to noise in training set :: rmse = 0.25
> sd(y - sin(x))
[1] 0.2548882
The error due to noise is of course evident from
noise<-rnorm(1001)
y<-sin(x)+noise/4
In the above the evaluation is being done against the training set, as it is in the original question. Since the issue is smoothing rather than generalization, this is not as egregious as it may seem, but it is reassuring to see that out of bag evaluation shows similar accuracy:
> sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
replace=FALSE, ntree=5000))
- sin(x))
[1] 0.04059679

My intuition is that:
if you had a simple decision tree to fit a 1 dimensional curve f(x), that would be equivalent to fit with a staircase function (not necessarily with equally spaced jumps)
with random forests you will make a linear combination of staircase functions
For a staircase function to be a good approximator of f(x), you want enough steps on the x axis, but each step should contain enough points so that their mean is a good approximation of f(x) and less affected by noise.
So I suggest you tune the nodesize parameter. If you have 1 decision tree and N points, and nodesize=n, then your staircase function will have N/n steps. n too small brings to overfitting. I got nice results with n~30 (RMSE~0.07):
r <- randomForest(Y~.,data=mat, nodesize=30)
plot(x,predict(r,mat),col="green")
points(x,y)
Notice that RMSE gets smaller if you take N'=10*N and n'=10*n.

Related

Validate Accuracy of Test Data

I have fit my model with my training data and tested the accuracy of the model using r squared.
However, I want to test the accuracy of the model with my test data, how to do this?
My predicted value is continuous. Quite new to this so open to suggestions.
LR_swim <- lm(racetime_mins ~ event_month +gender + place +
clocktime_mins +handicap_mins +
Wind_Speed_knots+
Air_Temp_Celsius +Water_Temp_Celsius +Wave_Height_m,
data = SwimmingTrain)
family=gaussian(link = "identity")
summary(LR_swim)
rsq(LR_swim) #Returns- 0.9722331
#Predict Race_Time Using Test Data
pred_LR <- predict(LR_swim, SwimmingTest, type ="response")
#Add predicted Race_Times back into the test dataset.
SwimmingTest$Pred_RaceTime <- pred_LR
To start with, as already pointed out in the comments, the term accuracy is actually reserved for classification problems. What you are actually referring to is the performance of your model. And truth is, for regression problems (such as yours), there are several such performance measures available.
For good or bad, R^2 is still the standard measure in several implementations; nevertheless, it may be helpful to keep in mind what I have argued elsewhere:
the whole R-squared concept comes in fact directly from the world of statistics, where the emphasis is on interpretative models, and it has little use in machine learning contexts, where the emphasis is clearly on predictive models; at least AFAIK, and beyond some very introductory courses, I have never (I mean never...) seen a predictive modeling problem where the R-squared is used for any kind of performance assessment; neither it's an accident that popular machine learning introductions, such as Andrew Ng's Machine Learning at Coursera, do not even bother to mention it. And, as noted in the Github thread above (emphasis added):
In particular when using a test set, it's a bit unclear to me what the R^2 means.
with which I certainly concur.
There are several other performance measures that are arguably more suitable in a predictive task, such as yours; and most of them can be implemented with a simple line of R code. So, for some dummy data:
preds <- c(1.0, 2.0, 9.5)
actuals <- c(0.9, 2.1, 10.0)
the mean squared error (MSE) is simply
mean((preds-actuals)^2)
# [1] 0.09
while the mean absolute error (MAE), is
mean(abs(preds-actuals))
# [1] 0.2333333
and the root mean squared error (RMSE) is simply the square root of the MSE, i.e.:
sqrt(mean((preds-actuals)^2))
# [1] 0.3
These measures are arguably more useful for assessing the performance on unseen data. The last two have an additional advantage of being in the same scale as your original data (not the case for MSE).

bnlearn::bn.fit difference and calculation of methods "mle" and "bayes"

I try to understand the differences between the two methods bayes and mle in the bn.fit function of the package bnlearn.
I know about the debate between the frequentist and the bayesian approach on understanding probabilities. On a theoretical level I suppose the maximum likelihood estimate mle is a simple frequentist approach setting the relative frequencies as the probability. But what calculations are done to get the bayes estimate? I already checked out the bnlearn documenation, the description of the bn.fit function and some application examples, but nowhere there's a real description of what's happening.
I also tried to understand the function in R by first checking out bnlearn::bn.fit, leading to bnlearn:::bn.fit.backend, leading to bnlearn:::smartSapply but then I got stuck.
Some help would be really appreciated as I use the package for academic work and therefore I should be able to explain what happens.
Bayesian parameter estimation in bnlearn::bn.fit applies to discrete variables. The key is the optional iss argument: "the imaginary sample size used by the bayes method to estimate the conditional probability tables (CPTs) associated with discrete nodes".
So, for a binary root node X in some network, the bayes option in bnlearn::bn.fit returns (Nx + iss / cptsize) / (N + iss) as the probability of X = x, where N is your number of samples, Nx the number of samples with X = x, and cptsize the size of the CPT of X; in this case cptsize = 2. The relevant code is in the bnlearn:::bn.fit.backend.discrete function, in particular the line: tab = tab + extra.args$iss/prod(dim(tab))
Thus, iss / cptsize is the number of imaginary observations for each entry in a CPT, as opposed to N, the number of 'real' observations. With iss = 0 you would be getting a maximum likelihood estimate, as you would have no prior imaginary observations.
The higher iss with respect to N, the stronger the effect of the prior on your posterior parameter estimates. With a fixed iss and a growing N, the Bayesian estimator and the maximum likelihood estimator converge to the same value.
A common rule of thumb is to use a small non-zero iss so that you avoid zero entries in the CPTs, corresponding to combinations that were not observed in the data. Such zero entries could then result in a network which generalizes poorly, such as some early versions of the Pathfinder system.
For more details on Bayesian parameter estimation you can have a look at the book by Koller and Friedman. I suppose many other Bayesian network books also cover the topic.

Finding the best LCA model in poLCA R package

I am applying LCA analysis with PoLCA R package, but the analysis not resulted since three days (it did not find the best model yet) and occasionally it gives the following error: "ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND". So i cancelled the process at 35 latent class. I am analyzing 16 variables (all of them categorical) and 36036 rows of data. When I test the variable importance for 16 variables in Boruta package, all the 16 variables resulted as important, so i used all 16 variables in LCA analysis with poLCA. Which path should i follow? Should I use another clustering method such as k-modes for clustering categorical variables in this dataset? I use the parameters with 500 iterations and nrep=10 model estimation number. The R script i use to find the best model in LCA and one of the outputs is as follows:
for(i in 2:50){
lc <- poLCA(f, data, nclass=i, maxiter=500,
tol=1e-5, na.rm=FALSE,
nrep=10, verbose=TRUE, calc.se=TRUE)
if(lc$bic < min_bic){
min_bic <- lc$bic
LCA_best_model<-lc
}
}
========================================================= Fit for 35 latent classes:
========================================================= number of observations: 36036
number of estimated parameters: 2029 residual
degrees of freedom: 34007
maximum log-likelihood: -482547.1
AIC(35): 969152.2
BIC(35): 986383 G^2(35): 233626.8 (Likelihood
ratio/deviance statistic)
X^2(35): 906572555 (Chi-square goodness of
fit)
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
The script you are using sequentially tests every model from 2 to 50 classes and keeps the one with the lowest BIC. BIC is not the only one or the best way to select "the best" model, but fair enough.
The problem is, you are estimating a LOT of parameters, especially in the last steps. The more classes you fit, the more time consuming the process is. Also, in this cases convergence problems are to be expected because you are fitting so many classes. That's what the error message reports, it can't find the maximum likelihood for a model with 35 classes.
I don't know what problem you are trying to solve, but models with over 10 classes are unusual in LCA. You do LCA to reduce the complexity of your data as much as possible. If you NEED to fit models with many -over 10- classes:
fit them one by one, so RAM consumption will be less of a problem.
increase the nrep= argument in the call, so you the probability of the model not finding maximum likelihood by chance -bad random initial numbers- is reduced. Also increases computing time.
Alternatively you can reduce computing time running models in parallel. Almost every modern PC has 2 or more cores. The function acl() in the next block does this with foreach() and %dopar%, so is OS independent.
library(poLCA)
library(foreach)
library(doParallel)
registerDoParallel(cores=2) #as many physical cores as available.
acl <- function(datos, #a data.frame with your data
k, #the maximum number of classes to fit
formula) {
foreach(i=1:k, .packages="poLCA") %dopar% poLCA(formula, datos, nclass=i
)
}
acm() returns a list of models, you can pick "the best" later. The next function will retrieve the quantities of intrest from the list and create a nicely formatted data.frame with usefull information to select the right number of classes.
comparar_clases_acl <- function(modelo) {
entropy<-function (p) sum(-p*log(p)) #to asses the quality of classification
tabla_LCA <- data.frame(Modelo=0, BIC=0, Lik_ratio=0, Entropia=0, MenorClase=0) #empty data.frame to prealocate memory.
for(i in 1:length(modelo)){
tabla_LCA [i,1] <- paste("Modelo", i)
tabla_LCA [i,2] <- modelo[[i]]$bic
tabla_LCA [i,3] <- modelo[[i]]$Gsq
error_prior <- entropy(modelo[[i]]$P)
error_post <- mean(apply(modelo[[i]]$posterior,1, entropy),na.rm = TRUE)
tabla_LCA [i,4]<-round(((error_prior-error_post) / error_prior),3)
tabla_LCA [i,5] <- min(modelo[[i]]$P)*100
}
return(tabla_LCA)
}
It takes only one argument: an object with a list of LCA models, exactly what acl() returns.
This parallel approach should reduce computing time. Still 50 classes are to much and you are probably getting the smallest BIC way before 50 classes. Remember, BIC penalices models as the number of estimated parameters increases, helping you find the point of diminishing returns of an extra class in your model.

why h2o.randomForest in R make much better predictions than randomForest packages

setwd("D:/Santander")
## import train dataset
train<-read.csv("train.csv",header=T)
dim(train)
summary(train)
str(train)
prop.table(table(train2$TARGET))
stats<-function(x){
length<-length(x)
nmiss<-sum(is.na(x))
y<-x[!is.na(x)]
freq<-as.data.frame(table(y))
max_freq<-max(freq[,2])/length
min<-min(y)
median<-median(y)
max<-max(y)
mean<-mean(y)
freq<-length(unique(y))
return(c(nmiss=nmiss,min=min,median=median,mean=mean,max=max,freq=freq,max_freq=max_freq))
}
var_stats<-sapply(train,stats)
var_stats_1<-t(var_stats)
###将最大频数类别比例超过0.9999,其它类别小于1/10000的变量全删除
exclude_var<-rownames(var_stats_1)[var_stats_1[,7]>0.9999]
train2<-train[,! colnames(train) %in% c(exclude_var,"ID")]
rm(list=setdiff(ls(),"train2"))
train2<-train2[1:10000,]
write.csv(train2,"example data.csv",row.names = F)
##随机将数据分为训练集与测试集
set.seed(1)
ind<-sample(c(1,2),size=nrow(train2),replace=T,prob=c(0.8,0.2))
train2$TARGET<-factor(train2$TARGET)
train_set<-train2[ind==1,]
test_set<-train2[ind==2,]
rm(train2)
##1\用R randomForest构建预测模型 100棵树
library(randomForest)
memory.limit(4000)
random<-randomForest(TARGET~.,data=train_set,ntree=50)
print(random)
random.importance<-importance(random)
p_train<-predict(random,train_set,type="prob")
pred.auc<-prediction(p_train[,2],train_set$TARGET)
performance(pred.auc,"auc")
##train_set auc=0.8177
## predict test_set
p_test<-predict(random,newdata = test_set,type="prob")
pred.auc<-prediction(p_test[,2],test_set$TARGET)
performance(pred.auc,"auc")
##test_set auc=0.60
#________________________________________________#
##_________h2o.randomForest_______________
library(h2o)
h2o.init()
train.h2o<-as.h2o(train_set)
test.h2o<-as.h2o(test_set)
random.h2o<-h2o.randomForest(,"TARGET",training_frame = train.h2o,ntrees=50)
importance.h2o<-h2o.varimp(random.h2o)
p_train.h2o<-as.data.frame(h2o.predict(random.h2o,train.h2o))
pred.auc<-prediction(p_train.h2o$p1,train_set$TARGET)
performance(pred.auc,"auc")
##auc=0.9388, bigger than previous one
###test_set prediction
p_test.h2o<-as.data.frame(h2o.predict(random.h2o,test.h2o))
pred.auc<-prediction(p_test.h2o$p1,test_set$TARGET)
performance(pred.auc,"auc")
###auc=0.775
I tried to make predictions with Kaggle competitions: Santander customer satisfaction: https://www.kaggle.com/c/santander-customer-satisfaction
When i use randomForest package in R, i got final result in test data of AUC=0.57, but when i use h2o.randomForest, i got final result in test data of AUC=0.81.the parameters in both function are same, i only used the default parameters with ntree=100.
So why h2o.randomForest make much better predictions than randomForest package itself?
Firstly, as user1808924 noted, there are differences in the algorithms and their default hyperparameters. For example, R's randomForest splits based on the Gini criterion and H2O trees are split based on reduction in Squared Error (even for classification). H2O also uses histograms for splitting and can handle splitting on categorical variables without dummy (or one-hot) encoding (although I don't think that matters here since the Santander dataset is entirely numeric). Other information on H2O's splitting can be found here (this is in the GBM section but the splitting for both algos is the same).
If you look at the predictions from your R randomForest model you will see that they are all in increments of 0.02. R's randomForest builds really deep trees, resulting in pure leaf nodes. This means the predicted outcome or an observation is either going to be 0 or 1 in each tree, and since you've set ntrees=50 the predictions will all be in increments of 0.02. The reason you get bad AUC scores is because with AUC it is the order of the predictions that matters, and since all of your predictions are [0.00, 0.02, 0.04, ...] there are a lot of ties. The trees in H2O's random forest aren't quite as deep and therefore aren't as pure, allowing for predictions that have some more granularity to them and that can be better sorted for a better AUC score.

setting values for ntree and mtry for random forest regression model

I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201.
I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry? Is there an approximate formula to find such parameter values?
Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.
The default for mtry is quite sensible so there is not really a need to muck with it. There is a function tuneRF for optimizing this parameter. However, be aware that it may cause bias.
There is no optimization for the number of bootstrap replicates. I often start with ntree=501 and then plot the random forest object. This will show you the error convergence based on the OOB error. You want enough trees to stabilize the error but not so many that you over correlate the ensemble, which leads to overfit.
Here is the caveat: variable interactions stabilize at a slower rate than error so, if you have a large number of independent variables you need more replicates. I would keep the ntree an odd number so ties can be broken.
For the dimensions of you problem I would start ntree=1501. I would also recommended looking onto one of the published variable selection approaches to reduce the number of your independent variables.
The short answer is no.
The randomForest function of course has default values for both ntree and mtry. The default for mtry is often (but not always) sensible, while generally people will want to increase ntree from it's default of 500 quite a bit.
The "correct" value for ntree generally isn't much of a concern, as it will be quite apparent with a little tinkering that the predictions from the model won't change much after a certain number of trees.
You can spend (read: waste) a lot of time tinkering with things like mtry (and sampsize and maxnodes and nodesize etc.), probably to some benefit, but in my experience not a lot. However, every data set will be different. Sometimes you may see a big difference, sometimes none at all.
The caret package has a very general function train that allows you to do a simple grid search over parameter values like mtry for a wide variety of models. My only caution would be that doing this with fairly large data sets is likely to get time consuming fairly quickly, so watch out for that.
Also, somehow I forgot that the ranfomForest package itself has a tuneRF function that is specifically for searching for the "optimal" value for mtry.
Could this paper help ?
Limiting the Number of Trees in Random Forests
Abstract. The aim of this paper is to propose a simple procedure that
a priori determines a minimum number of classifiers to combine in order
to obtain a prediction accuracy level similar to the one obtained with the
combination of larger ensembles. The procedure is based on the McNemar
non-parametric test of significance. Knowing a priori the minimum
size of the classifier ensemble giving the best prediction accuracy, constitutes
a gain for time and memory costs especially for huge data bases
and real-time applications. Here we applied this procedure to four multiple
classifier systems with C4.5 decision tree (Breiman’s Bagging, Ho’s
Random subspaces, their combination we labeled ‘Bagfs’, and Breiman’s
Random forests) and five large benchmark data bases. It is worth noticing
that the proposed procedure may easily be extended to other base
learning algorithms than a decision tree as well. The experimental results
showed that it is possible to limit significantly the number of trees. We
also showed that the minimum number of trees required for obtaining
the best prediction accuracy may vary from one classifier combination
method to another
They never use more than 200 trees.
One nice trick that I use is to initially start with first taking square root of the number of predictors and plug that value for "mtry". It is usually around the same value that tunerf funtion in random forest would pick.
I use the code below to check for accuracy as I play around with ntree and mtry (change the parameters):
results_df <- data.frame(matrix(ncol = 8))
colnames(results_df)[1]="No. of trees"
colnames(results_df)[2]="No. of variables"
colnames(results_df)[3]="Dev_AUC"
colnames(results_df)[4]="Dev_Hit_rate"
colnames(results_df)[5]="Dev_Coverage_rate"
colnames(results_df)[6]="Val_AUC"
colnames(results_df)[7]="Val_Hit_rate"
colnames(results_df)[8]="Val_Coverage_rate"
trees = c(50,100,150,250)
variables = c(8,10,15,20)
for(i in 1:length(trees))
{
ntree = trees[i]
for(j in 1:length(variables))
{
mtry = variables[j]
rf<-randomForest(x,y,ntree=ntree,mtry=mtry)
pred<-as.data.frame(predict(rf,type="class"))
class_rf<-cbind(dev$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
dev_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
dev_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,type="prob"))
prob_rf<-cbind(dev$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
dev_auc<-as.numeric(auc#y.values)
pred<-as.data.frame(predict(rf,val,type="class"))
class_rf<-cbind(val$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
val_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
val_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,val,type="prob"))
prob_rf<-cbind(val$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
val_auc<-as.numeric(auc#y.values)
results_df = rbind(results_df,c(ntree,mtry,dev_auc,dev_hit_rate,dev_coverage_rate,val_auc,val_hit_rate,val_coverage_rate))
}
}

Resources