why h2o.randomForest in R make much better predictions than randomForest packages

why h2o.randomForest in R make much better predictions than randomForest packages - r

setwd("D:/Santander")
## import train dataset
train<-read.csv("train.csv",header=T)
dim(train)
summary(train)
str(train)
prop.table(table(train2$TARGET))
stats<-function(x){
length<-length(x)
nmiss<-sum(is.na(x))
y<-x[!is.na(x)]
freq<-as.data.frame(table(y))
max_freq<-max(freq[,2])/length
min<-min(y)
median<-median(y)
max<-max(y)
mean<-mean(y)
freq<-length(unique(y))
return(c(nmiss=nmiss,min=min,median=median,mean=mean,max=max,freq=freq,max_freq=max_freq))
}
var_stats<-sapply(train,stats)
var_stats_1<-t(var_stats)
###将最大频数类别比例超过0.9999，其它类别小于1/10000的变量全删除
exclude_var<-rownames(var_stats_1)[var_stats_1[,7]>0.9999]
train2<-train[,! colnames(train) %in% c(exclude_var,"ID")]
rm(list=setdiff(ls(),"train2"))
train2<-train2[1:10000,]
write.csv(train2,"example data.csv",row.names = F)
##随机将数据分为训练集与测试集
set.seed(1)
ind<-sample(c(1,2),size=nrow(train2),replace=T,prob=c(0.8,0.2))
train2$TARGET<-factor(train2$TARGET)
train_set<-train2[ind==1,]
test_set<-train2[ind==2,]
rm(train2)
##1\用R randomForest构建预测模型 100棵树
library(randomForest)
memory.limit(4000)
random<-randomForest(TARGET~.,data=train_set,ntree=50)
print(random)
random.importance<-importance(random)
p_train<-predict(random,train_set,type="prob")
pred.auc<-prediction(p_train[,2],train_set$TARGET)
performance(pred.auc,"auc")
##train_set auc=0.8177
## predict test_set
p_test<-predict(random,newdata = test_set,type="prob")
pred.auc<-prediction(p_test[,2],test_set$TARGET)
performance(pred.auc,"auc")
##test_set auc=0.60
#________________________________________________#
##_________h2o.randomForest_______________
library(h2o)
h2o.init()
train.h2o<-as.h2o(train_set)
test.h2o<-as.h2o(test_set)
random.h2o<-h2o.randomForest(,"TARGET",training_frame = train.h2o,ntrees=50)
importance.h2o<-h2o.varimp(random.h2o)
p_train.h2o<-as.data.frame(h2o.predict(random.h2o,train.h2o))
pred.auc<-prediction(p_train.h2o$p1,train_set$TARGET)
performance(pred.auc,"auc")
##auc=0.9388, bigger than previous one
###test_set prediction
p_test.h2o<-as.data.frame(h2o.predict(random.h2o,test.h2o))
pred.auc<-prediction(p_test.h2o$p1,test_set$TARGET)
performance(pred.auc,"auc")
###auc=0.775
I tried to make predictions with Kaggle competitions: Santander customer satisfaction: https://www.kaggle.com/c/santander-customer-satisfaction
When i use randomForest package in R, i got final result in test data of AUC=0.57, but when i use h2o.randomForest, i got final result in test data of AUC=0.81.the parameters in both function are same, i only used the default parameters with ntree=100.
So why h2o.randomForest make much better predictions than randomForest package itself?

Firstly, as user1808924 noted, there are differences in the algorithms and their default hyperparameters. For example, R's randomForest splits based on the Gini criterion and H2O trees are split based on reduction in Squared Error (even for classification). H2O also uses histograms for splitting and can handle splitting on categorical variables without dummy (or one-hot) encoding (although I don't think that matters here since the Santander dataset is entirely numeric). Other information on H2O's splitting can be found here (this is in the GBM section but the splitting for both algos is the same).
If you look at the predictions from your R randomForest model you will see that they are all in increments of 0.02. R's randomForest builds really deep trees, resulting in pure leaf nodes. This means the predicted outcome or an observation is either going to be 0 or 1 in each tree, and since you've set ntrees=50 the predictions will all be in increments of 0.02. The reason you get bad AUC scores is because with AUC it is the order of the predictions that matters, and since all of your predictions are [0.00, 0.02, 0.04, ...] there are a lot of ties. The trees in H2O's random forest aren't quite as deep and therefore aren't as pure, allowing for predictions that have some more granularity to them and that can be better sorted for a better AUC score.

Related

Plot an envelope for an mppm object in spatstat

My question is closely related to this previous one: Simulation-based hypothesis testing on spatial point pattern hyperframes using "envelope" function in spatstat
I have obtained an mppm object by fitting a model on several independent datasets using the mppmfunction from the R package spatstat. How can I study its envelope to compare it to my observations ?
I fitted my model as such:
data <- listof(NMJ1,NMJ2,NMJ3)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
where NMJ1, NMJ2, and NMJ3 are marked ppp and are independent realizations of the same experiment.
However, the envelope function does not accept inputs of type mppm:
> envelope(model, Kcross.inhom, nsim=10)
Error in UseMethod("envelope") :
no applicable method for 'envelope' applied to an object of class "c('mppm', 'list')"
The answer provided to the previously mentioned question indicates how to plot global envelopes for each pattern, and to use the product rule for multiple testing. However, my fitted model implies that my 3 ppp objects are statistically equivalent, and are independent realizations of the same experiment (ie no different covariates between them). I would thus like to obtain one single plot comparing my fitted model to my 3 datasets. The following code:
gamma= 1 - 0.95^(1/3)
nsims=round(1/gamma-1)
sims <- simulate(model, nsim=2*nsims)
SIMS <- list()
for(i in 1:nrow(sims)) SIMS[[i]] <- as.solist(sims[i,,drop=TRUE])
Hplus <- cbind(data, hyperframe(Sims=SIMS))
EE1 <- with(Hplus, envelope(Points, Kcross.inhom, nsim=nsims, simulate=Sims))
pool(EE1[1],EE1[2],EE1[3])
leads to the following error:
Error in pool.envelope(`1` = list(r = c(0, 0.78125, 1.5625, 2.34375, 3.125, :
Arguments 2 and 3 do not belong to the class “envelope”

Wrong type of subset index. Use
pool(EE1[[1]], EE1[[2]], EE1[[3]])
or just
pool(EE1)
These would have given an error message that the envelope commands should have been called with savefuns=TRUE. So you just need to change that step as well.
However, statistically this procedure makes little sense. You have already fitted a model, which allows for rigorous statistical inference using anova.mppm and other tools. Instead of this, you are generating simulated data from the fitted model and performing a Monte Carlo test, with all the fraught issues of multiple testing and low power. There are additional problems with this approach - for example, even if the model is "the same" for each row of the hyperframe, the patterns are not statistically equivalent unless the windows of the point patterns are identical, and so on.

How to obtain Brier Score in Random Forest in R?

I am having trouble getting the Brier Score for my Machine Learning Predictive models. The outcome "y" was categorical (1 or 0). Predictors are a mix of continuous and categorical variables.
I have created four models with different predictors, I will call them "model_1"-"model_4" here (except predictors, other parameters are the same). Example code of my model is:
Model_1=rfsrc(y~ ., data=TrainTest, ntree=1000,
mtry=30, nodesize=1, nsplit=1,
na.action="na.impute", nimpute=3,seed=10,
importance=T)
When I run the "Model_1" function in R, I got the results:
My question was how can I get the predicted possibility for those 412 people? And how to find the observed probability for each person? Do I need to calculate by hand? I found the function BrierScore() in "DescTools" package.
But I tried "BrierScore(Model_1)", it gives me no results.
codes I added:
library(scoring)
library(DescTools)
BrierScore(Raw_SB)
class(TrainTest$VL_supress03)
TrainTest$VL_supress03_nu<-as.numeric(as.character(TrainTest$VL_supress03))
class(TrainTest$VL_supress03_nu)
prediction_Raw_SB = predict(Raw_SB, TrainTest)
BrierScore(prediction_Raw_SB, as.numeric(TrainTest$VL_supress03) - 1)
BrierScore(prediction_Raw_SB, as.numeric(as.character(TrainTest$VL_supress03)) - 1)
BrierScore(prediction_Raw_SB, TrainTest$VL_supress03_nu - 1)
I tried some codes: have so many error messages:

One assumption I am making about your approach is that you want to compute the BrierScore on the data you train your model on (which is usually not the correct approach, google train-test split if you need more info there).
In general, therefore you should reflect on whether your approach is correct there.
The BrierScore method in DescTools only has a defined method for glm models, otherwise, it expects as input a vector of predicted probabilities and a vector of true values (see ?BrierScore).
What you would need to do though is to predict on your data using:
prediction = predict(model_1, TrainTest, na.action="na.impute")
and then compute the brier score using
BrierScore(as.numeric(TrainTest$y) - 1, prediction$predicted[, 1L])
(Note, that we transform TrainTest$y into a numeric vector of 0's and 1's in order to compute the brier score.)
Note: The randomForestSRC package also prints a normalized brier score when you call print(prediction).
In general, using one of the available workbenches for machine learning in R (mlr3, tidymodels, caret) might simplify this approach for you and prevent a lot of errors in this direction. This is a really good practice, especially if you are less experienced in ML as it can prevent many errors.
See e.g. this chapter in the mlr3 book for more information.
For reference, here is some very similar code using the mlr3 package, automatically also taking care of train-test splits.
data(breast, package = "randomForestSRC") # with target variable "status"
library(mlr3)
library(mlr3extralearners)
task = TaskClassif$new(id = "breast", backend = breast, target = "status")
algo = lrn("classif.rfsrc", na.action = "na.impute", predict_type = "prob")
resample(task, algo, rsmp("holdout", ratio = 0.8))$score(msr("classif.bbrier"))

Writing syntax for bivariate survival censored data to fit copula models in R

library(Sunclarco)
library(MASS)
library(survival)
library(SPREDA)
library(SurvCorr)
library(doBy)
#Dataset
diabetes=data("diabetes")
data1=subset(diabetes,select=c("LASER","TRT_EYE","AGE_DX","ADULT","TIME1","STATUS1"))
data2=subset(diabetes,select=c("LASER","TRT_EYE","AGE_DX","ADULT","TIME2","STATUS2"))
#Adding variable which identify cluster
data1$CLUSTER<- rep(1,197)
data2$CLUSTER<- rep(2,197)
#Renaming the variable so that that we hve uniformity in the common items in the data
names(data1)[5] <- "TIME"
names(data1)[6] <- "STATUS"
names(data2)[5] <- "TIME"
names(data2)[6] <- "STATUS"
#merge the files
Total_data=rbind(data1,data2)
# Re arranging the database
diabete_full=orderBy(~LASER+TRT_EYE+AGE_DX,data=Total_data)
diabete_full
#using Sunclarco package for Clayton a nd Gumbel
Clayton_1step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=1,copula="Clayton",marginal="Weibull")
summary(Clayton_1step)
# Estimates StandardErrors
#lambda 0.01072631 0.005818201
#rho 0.79887565 0.058942208
#theta 0.10224445 0.090585891
#beta_LASER 0.16780224 0.157652947
#beta_TRT_EYE 0.24580489 0.162333369
#beta_ADULT 0.09324001 0.158931463
# Estimate StandardError
#Kendall's Tau 0.04863585 0.04099436
Clayton_2step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=2,copula="Clayton",marginal="Weibull")
summary(Clayton_1step)
# Estimates StandardErrors
#lambda 0.01131751 0.003140733
#rho 0.79947406 0.012428824
#beta_LASER 0.14244235 0.041845100
#beta_TRT_EYE 0.27246433 0.298184235
#beta_ADULT 0.06151645 0.253617142
#theta 0.18393973 0.151048024
# Estimate StandardError
#Kendall's Tau 0.08422381 0.06333791
Gumbel_1step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=1,copula="GH",marginal="Weibull")
# Estimates StandardErrors
#lambda 0.01794495 0.01594843
#rho 0.70636113 0.10313853
#theta 0.87030690 0.11085344
#beta_LASER 0.15191936 0.14187943
#beta_TRT_EYE 0.21469814 0.14736381
#beta_ADULT 0.08284557 0.14214373
# Estimate StandardError
#Kendall's Tau 0.1296931 0.1108534
Gumbel_2step <- SunclarcoModel(data=diabete_full,time="TIME",status="STATUS",
clusters="CLUSTER",covariates=c("LASER","TRT_EYE","ADULT"),
stage=2,copula="GH",marginal="Weibull")
Am required to fit copula models in R for different copula classes particularly the Gaussian, FGM,Pluckett and possibly Frank (if i still have time). The data am using is Diabetes data available in R through the package Survival and Survcorr.
Its my thesis am working on and its a study for the exploratory purposes to see how does copula class serves different purposes as in results they lead to having different results on the same. I found a package Sunclarco in Rstudio which i was able to fit Clayton and Gumbel copula class but its not available yet for the other classes.
The challenge am facing is that since i have censored data which has to be incorporated in likelihood estimation then it becomes harder fro me to write a syntax since as I don't have a strong programming background. In addition, i have to incorporate the covariates present in programming and see their impact on the association if it present or not. However, taking to my promoter he gave me insights on how to approach the syntax writing for this puzzle which goes as follows
• ******First of all, forget about the likelihood function. We only work with the log-likelihood function. In this way, you do not need to take the product of the contributions over each of the observations, but can take the sum of the log-contributions over the different observations.
• Next, since we have a balanced design, we can use the regular data frame structure in which we have for each cluster only one row in the data frame. The different variables such as the lifetimes, the indicators and all the covariates are the columns in this data frame.
• Due to the bivariate setting, there are only 4 possible ways to give a contribution to the log-likelihood function: both uncensored, both censored, first uncensored and second censored, or first censored and second uncensored. Well, to create the loglikelihood function, you create a new variable in your data frame in which you put the correct contribution of the log-likelihood based on which individual in the couple is censored. When you take the sum of this variable, you have the value of the log-likelihood function.
• Since this function depends on parameters, you can use any optimizer, like optim or nlm to get your optimal values. By careful here, optim and nlm look for the minimum of a function, not a maximum. This is easy solved since the minimum of a function -f is the same as the maximum of a function f.
• Since you have for each copula function, the different expressions for the derivatives, it should be possible to get the likelihood functions now.******
Am still struggling to find a way as for each copula class each of the likelihood changes as the generator function is also unique for the respective copula since it needs to be adapted during estimation. Lastly, I should run analysis for both one and two steps of copula estimations as i will use to compare results.
if someone could help me to figure it out then I will be eternally grateful. Even if for just one copula class e.g. Gaussian then I will figure it the rest based on the one that am requesting to be assisted since I tried everything and still i have nothing to show up for and now i feel time is running out to get answers by myself.

Spark ML Logistic Regression with Categorical Features Returns Incorrect Model

I've been doing a head-to-head comparison of Spark 1.6.2 ML's LogisticRegression with R's glmnet package (the closest analog I could find based on other forum posts).
I'm specifically looking at these two fitting packages when using categorical features. When using continuous features, results for the two packages are comparable.
For my first attempt with Spark, I used the ML Pipeline API to transform my single 21-level categorical variable (called FAC for faculty) with StringIndexer followed by OneHotEncoder to get a binary vector representation.
When I fit my models in Spark and R, I get the following sets of results (that aren't even close):
SPARK 1.6.2 ML
lrModel.intercept
-3.1453838659926427
lrModel.weights
[0.37664264958084287,0.697784342445422,0.4269429071484017,0.3521764371898419,0.19233585960734872,0.6708049751689226,0.49342372792676115,0.5471489576300356,0.37650628365008465,1.0447861554914701,0.5371820187662734,0.4556833133252492,0.2873530144304645,0.09916227313130375,0.1378469333986134,0.20412095883272838,0.4494641670133712,0.4499625784826652,0.489912016708041,0.5433020878341336]
R (glmnet)
(Intercept) -2.79255253
facG -0.35292166
facU -0.16058275
facN 0.69187146
facY -0.06555711
facA 0.09655696
facI 0.02374558
facK -0.25373146
facX 0.31791765
facM 0.14054251
facC 0.02362977
facT 0.07407357
facP 0.09709607
facE 0.10282076
facH -0.21501281
facQ 0.19044412
facW 0.18432837
facF 0.34494177
facO 0.13707197
facV -0.14871580
facS 0.19431703
I've manually checked the glmnet results and they're correct (calculating the proportion of training samples with a particular level of the categorical feature and comparing that to the softmax prob. under the estimated model). These results do not change even when the max. no. of iterations in the optimization is set to 1000000 and the convergence tolerance is set to 1E-15. These results also do not change when the Spark LogisticRegression weights are initialized to the glmnet-estimated weights (Spark's optimizing a different cost function?).
I should say that the optimization problem is not different between these two approaches. You should be minimizing logistic loss (a convex surface) and thereby arriving at nearly the exact same answer).
Now, when I manually recode the FAC feature as a binary vector in the data file and read those binary columns as "DoubleType" (using Spark's DataFrame schema), I get very comparable results. (The order of the coefficients for the following results is different from the above results. Also the reference levels are different--"B" in the first case, "A" in the second--so the coefficients for this test should not match those from the above test.)
SPARK 1.6.2 ML
lrModel.intercept
-2.9530485080391378
lrModel.weights
[-0.19233467682265934,0.8524505857034615,0.09501714540028124,0.25712829253044844,0.18430675058702053,0.09317325898819705,0.4784688407322236,0.3010877381053835,0.18417033887042242,0.2346069926274015,0.2576267066227656,0.2633474197307803,0.05448893119304087,0.35096612444193326,0.3448460751810199,0.505448794876487,0.29757609104571175,0.011785058030487976,0.3548130904832268,0.15984047288368383]
R (glmnet)
s0
(Intercept) -2.9419468179
FAC_B -0.2045928975
FAC_C 0.8402716731
FAC_E 0.0828962518
FAC_F 0.2450427806
FAC_G 0.1723424956
FAC_H -0.1051037449
FAC_I 0.4666239456
FAC_K 0.2893153021
FAC_M 0.1724536240
FAC_N 0.2229762780
FAC_O 0.2460295934
FAC_P 0.2517981380
FAC_Q -0.0660069035
FAC_S 0.3394729194
FAC_T 0.3334048723
FAC_U 0.4941379563
FAC_V 0.2863010635
FAC_W 0.0005482422
FAC_X 0.3436361348
FAC_Y 0.1487405173
Standardization is set to FALSE for both and no regularization is performed (you shouldn't perform it here since you're really just learning the incidence rate of each level of the feature and the binary feature columns are completely uncorrelated from one another). Also, I should mention that the 21 levels of the categorical feature range in incidence counts from ~800 to ~3500 (so this is not due to lack of data; large error in estimates).
Anyone experience this? I'm one step away from reporting this to the Spark guys.
Thanks as always for the help.

setting values for ntree and mtry for random forest regression model

I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201.
I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry? Is there an approximate formula to find such parameter values?
Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.

The default for mtry is quite sensible so there is not really a need to muck with it. There is a function tuneRF for optimizing this parameter. However, be aware that it may cause bias.
There is no optimization for the number of bootstrap replicates. I often start with ntree=501 and then plot the random forest object. This will show you the error convergence based on the OOB error. You want enough trees to stabilize the error but not so many that you over correlate the ensemble, which leads to overfit.
Here is the caveat: variable interactions stabilize at a slower rate than error so, if you have a large number of independent variables you need more replicates. I would keep the ntree an odd number so ties can be broken.
For the dimensions of you problem I would start ntree=1501. I would also recommended looking onto one of the published variable selection approaches to reduce the number of your independent variables.

The short answer is no.
The randomForest function of course has default values for both ntree and mtry. The default for mtry is often (but not always) sensible, while generally people will want to increase ntree from it's default of 500 quite a bit.
The "correct" value for ntree generally isn't much of a concern, as it will be quite apparent with a little tinkering that the predictions from the model won't change much after a certain number of trees.
You can spend (read: waste) a lot of time tinkering with things like mtry (and sampsize and maxnodes and nodesize etc.), probably to some benefit, but in my experience not a lot. However, every data set will be different. Sometimes you may see a big difference, sometimes none at all.
The caret package has a very general function train that allows you to do a simple grid search over parameter values like mtry for a wide variety of models. My only caution would be that doing this with fairly large data sets is likely to get time consuming fairly quickly, so watch out for that.
Also, somehow I forgot that the ranfomForest package itself has a tuneRF function that is specifically for searching for the "optimal" value for mtry.

Could this paper help ?
Limiting the Number of Trees in Random Forests
Abstract. The aim of this paper is to propose a simple procedure that
a priori determines a minimum number of classifiers to combine in order
to obtain a prediction accuracy level similar to the one obtained with the
combination of larger ensembles. The procedure is based on the McNemar
non-parametric test of significance. Knowing a priori the minimum
size of the classifier ensemble giving the best prediction accuracy, constitutes
a gain for time and memory costs especially for huge data bases
and real-time applications. Here we applied this procedure to four multiple
classifier systems with C4.5 decision tree (Breiman’s Bagging, Ho’s
Random subspaces, their combination we labeled ‘Bagfs’, and Breiman’s
Random forests) and five large benchmark data bases. It is worth noticing
that the proposed procedure may easily be extended to other base
learning algorithms than a decision tree as well. The experimental results
showed that it is possible to limit significantly the number of trees. We
also showed that the minimum number of trees required for obtaining
the best prediction accuracy may vary from one classifier combination
method to another
They never use more than 200 trees.

One nice trick that I use is to initially start with first taking square root of the number of predictors and plug that value for "mtry". It is usually around the same value that tunerf funtion in random forest would pick.

I use the code below to check for accuracy as I play around with ntree and mtry (change the parameters):
results_df <- data.frame(matrix(ncol = 8))
colnames(results_df)[1]="No. of trees"
colnames(results_df)[2]="No. of variables"
colnames(results_df)[3]="Dev_AUC"
colnames(results_df)[4]="Dev_Hit_rate"
colnames(results_df)[5]="Dev_Coverage_rate"
colnames(results_df)[6]="Val_AUC"
colnames(results_df)[7]="Val_Hit_rate"
colnames(results_df)[8]="Val_Coverage_rate"
trees = c(50,100,150,250)
variables = c(8,10,15,20)
for(i in 1:length(trees))
{
ntree = trees[i]
for(j in 1:length(variables))
{
mtry = variables[j]
rf<-randomForest(x,y,ntree=ntree,mtry=mtry)
pred<-as.data.frame(predict(rf,type="class"))
class_rf<-cbind(dev$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
dev_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
dev_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,type="prob"))
prob_rf<-cbind(dev$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
dev_auc<-as.numeric(auc#y.values)
pred<-as.data.frame(predict(rf,val,type="class"))
class_rf<-cbind(val$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
val_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
val_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,val,type="prob"))
prob_rf<-cbind(val$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
val_auc<-as.numeric(auc#y.values)
results_df = rbind(results_df,c(ntree,mtry,dev_auc,dev_hit_rate,dev_coverage_rate,val_auc,val_hit_rate,val_coverage_rate))
}
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

why h2o.randomForest in R make much better predictions than randomForest packages - r

Related

Plot an envelope for an mppm object in spatstat

How to obtain Brier Score in Random Forest in R?

Writing syntax for bivariate survival censored data to fit copula models in R

Spark ML Logistic Regression with Categorical Features Returns Incorrect Model

setting values for ntree and mtry for random forest regression model

Categories

Resources