I tried neural net in R on Boston data set available.
data("Boston",package="MASS")
data <- Boston
Retaining only those variable we want to use:
keeps <- c("crim", "indus", "nox", "rm" , "age", "dis", "tax" ,"ptratio", "lstat" ,"medv" )
data <- data[keeps]
In this case the formula is stored in an R object called f.
The response variable medv is to be βregressedβ against the remaining nine attributes. I have done it as below:
f <- medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat
To set up train sample 400 of the 506 rows of data without replacement is collected using the sample method:
set.seed(2016)
n = nrow(data)
train <- sample(1:n, 400, FALSE)
neuralnet function of R is fitted.
library(neuralnet)
fit<- neuralnet(f, data = data[train ,], hidden=c(10 ,12 ,20),
algorithm = "rprop+", err.fct = "sse", act.fct = "logistic",
threshold =0.1, linear.output=TRUE)
But warning message is displayed as algorithm not converging.
Warning message:
algorithm did not converge in 1 of 1 repetition(s) within the stepmax
Tried Prediction using compute,
pred <- compute(fit,data[-train, 1:9])
Following error msg is displayed
Error in nrow[w] * ncol[w] : non-numeric argument to binary operator
In addition: Warning message:
In is.na(weights) : is.na() applied to non-(list or vector) of type 'NULL'
Why the error is coming up and how to recover from it for prediction. I want to use the neuralnet function on that data set.
When neuralnet doesn't converge, the resulting neural network is not complete. You can tell by calling attributes(fit)$names. When training converges, it will look like this:
[1] "call" "response" "covariate" "model.list" "err.fct"
[6] "act.fct" "linear.output" "data" "net.result" "weights"
[11] "startweights" "generalized.weights" "result.matrix"
When it doesn't, some attributes will not be defined:
[1] "call" "response" "covariate" "model.list" "err.fct" "act.fct" "linear.output"
[8] "data"
That explains why compute doesn't work.
When training doesn't converge, the first possible solution could be to increase stepmax (default 100000). You can also add lifesign = "full", to get better insight into the training process.
Also, looking at your code, I would say three layers with 10, 12 and 20 neurons is too much. I would start with one layer with the same number of neurons as the number of inputs, in your case 9.
EDIT:
With scaling (remember to scale both training and test data, and to 'de-scale' compute results), it converges much faster. Also note that I reduced the number of layers and neurons, and still lowered the error threshold.
data("Boston",package="MASS")
data <- Boston
keeps <- c("crim", "indus", "nox", "rm" , "age", "dis", "tax" ,"ptratio", "lstat" ,"medv" )
data <- data[keeps]
f <- medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat
set.seed(2016)
n = nrow(data)
train <- sample(1:n, 400, FALSE)
# Scale data. Scaling parameters are stored in this matrix for later.
scaledData <- scale(data)
fit<- neuralnet::neuralnet(f, data = scaledData[train ,], hidden=9,
algorithm = "rprop+", err.fct = "sse", act.fct = "logistic",
threshold = 0.01, linear.output=TRUE, lifesign = "full")
pred <- neuralnet::compute(fit,scaledData[-train, 1:9])
scaledResults <- pred$net.result * attr(scaledData, "scaled:scale")["medv"]
+ attr(scaledData, "scaled:center")["medv"]
cleanOutput <- data.frame(Actual = data$medv[-train],
Prediction = scaledResults,
diff = abs(scaledResults - data$medv[-train]))
# Show some results
summary(cleanOutput)
The problem seems to be in your argument linear.output = TRUE.
With your data, but changing the code a bit (not defining the formula and adding some explanatory comments):
library(neuralnet)
fit <- neuralnet(formula = medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat,
data = data[train,],
hidden=c(10, 12, 20), # number of vertices (neurons) in each hidden layer
algorithm = "rprop+", # resilient backprop with weight backtracking,
err.fct = "sse", # calculates error based on the sum of squared errors
act.fct = "logistic", # smoothing the cross product of neurons and weights with logistic function
threshold = 0.1, # of the partial derivatives for error function, stopping
linear.output=FALSE) # act.fct applied to output neurons
print(net)
Call: neuralnet(formula = medv ~ crim + indus + nox + rm + age + dis + tax + ptratio + lstat, data = data[train, ], hidden = c(10, 12, 20), threshold = 0.1, rep = 10, algorithm = "rprop+", err.fct = "sse", act.fct = "logistic", linear.output = FALSE)
10 repetitions were calculated.
Error Reached Threshold Steps
1 108955.0318 0.03436116236 4
5 108955.0339 0.01391790099 8
3 108955.0341 0.02193379592 3
9 108955.0371 0.01705056758 6
8 108955.0398 0.01983134293 8
4 108955.0450 0.02500006437 5
6 108955.0569 0.03689097762 5
7 108955.0677 0.04765829189 5
2 108955.0705 0.05052776877 5
10 108955.1103 0.09031966778 7
10 108955.1103 0.09031966778 7
# now compute will work
pred <- compute(fit, data[-train, 1:9])
Related
Hi I tried to print the confusion matrix for a dataset using R. Following are my results
My class variables contains binary values. Medv value is my class variable binarized with Medv value of the house greater than 230k being 1, or 0 otherwise. When I see the confusion matrix, at the end represents Positive class as 0. What does this mean? Are these results misrepresents my data?
my R code so far,
# Load CART packages
library(rpart)
library(rpart.plot)
library(caTools)
library(caret)
library (pROC)
housing_data = read.csv('housing.csv')
summary(housing_data)
housing_data = na.omit(housing_data)
# CART model
latlontree = rpart(Medv ~ Crim + Rm, data=housing_data , method = "class")
# Plot the tree using prp command defined in rpart.plot package
prp(latlontree)
# Split the data for Machine Learning
set.seed(123)
split = sample.split(housing_data$Medv, SplitRatio = 0.8)
train = subset(housing_data, split==TRUE)
test = subset(housing_data, split==FALSE)
#print (train)
#print (test)
# Create a CART model
tree = rpart(Medv ~ Crim + Zn + Indus + Chas + Nox + Rm + Age + Dis + Rad + Tax + Ptratio + B + Lstat , data=train , method = "class")
prp(tree)
#Decision tree prediction
#tree.pred = predict(tree, test)
pred = predict(tree,test, type="class")
#print (pred)
table(pred, test$Medv)
table(factor(pred, levels=min(test$Medv):max(test$Medv)),
factor(test$Medv, levels=min(test$Medv):max(test$Medv)))
# If p exceeds threshold of 0.5, M else R: m_or_r
#m_or_r <- ifelse(p > 0.5, 1, 0)
#print (m_or_r)
# Convert to factor: p_class
#p_class <- factor(m_or_r, levels = test$Medv)
# Create confusion matrix
confusionMatrix(table(factor(pred, levels=min(test$Medv):max(test$Medv)),
factor(test$Medv, levels=min(test$Medv):max(test$Medv))))
#print (tree.sse)
#ROC Curve
#Obtaining predicted probabilites for Test data
tree.probs=predict(tree,
test,
type="prob")
head(tree.probs)
#Calculate ROC curve
rocCurve.tree <- roc(test$Medv,tree.probs[,2])
#plot the ROC curve
plot(rocCurve.tree,col=c(4))
auc <- auc (test$Medv,tree.probs[,2])
print (auc)
#creating a dataframe with a single row
x <- data.frame("Crim"= c(0.03), "Zn"=c(13), "Indus"=c(3.5), "Chas"=c(0.3), "Nox"=c(0.58), "Rm"=c(4.1), "Age"=c(68), "Dis"=c(4.98), "Rad" =c(3), "Tax"=c(225), "Ptratio"=c(17), "B"=c(396), "Lstat"=c(7.56))
#Obtaining predicted probabilites for Test data
probability2=predict(tree,
x,
type="prob")
print (probability2)
#Obtaining predicted class for Test data
probability3=predict(tree,
x,
type="class")
print (probability3)
Image of the dataset
> library("lmtest")
> a = arima.sim(list(ar = c(.05, -.05)), 1000)
> b = arima(a, order = c(2, 0, 0))
> resettest(b)
**Error in terms.default(formula) : no terms component nor attribute**
Question 1. What I am doing is shown above. What should I do about that?
(I have tried to put in type, data and power parameter at resettest(), result is the same.)
Question 2.If I want to do the same thing on the model below
ππ‘=0.5+0.5π(π‘β1)β0.5π(π‘β2)+0.1π(π‘β1)^2+π_π‘
which is a ar(2) model plus 0.1π_(π‘β1)^2, how to fit this nonlinear model (by using R, thank you!)?
should have earn more reputation... can't post pic below 10 :(
The issue is that the first argument of resettest is
formula - a symbolic description for the model to be tested (or a fitted "lm" object).
So, passing an Arima object is not going to work. Instead we may manually define the lagged variables and provide an lm object or just the formula:
la1 <- Hmisc::Lag(a, 1)
la2 <- Hmisc::Lag(a, 2)
resettest(a ~ la1 + la2)
#
# RESET test
#
# data: a ~ la1 + la2
# RESET = 0.10343, df1 = 2, df2 = 993, p-value = 0.9018
Now your second model is nonlinear in variables but linear in parameters, so the same estimation methods still apply. (I'm assuming that the true DGP remains the same and you just want to test a new specification.) In particular,
resettest(a ~ la1 + la2 + I(la2^2))
#
# RESET test
#
# data: a ~ la1 + la2 + I(la2^2)
# RESET = 0.089211, df1 = 2, df2 = 992, p-value = 0.9147
I am running glmer logit models using the lme4 package. I am interested in various two and three way interaction effects and their interpretations. To simplify, I am only concerned with the fixed effects coefficients.
I managed to come up with a code to calculate and plot these effects on the logit scale, but I am having trouble transforming them to the predicted probabilities scale. Eventually I would like to replicate the output of the effects package.
The example relies on the UCLA's data on cancer patients.
library(lme4)
library(ggplot2)
library(plyr)
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
facmin <- function(n) {
min(as.numeric(levels(n)))
}
facmax <- function(x) {
max(as.numeric(levels(x)))
}
hdp <- read.csv("http://www.ats.ucla.edu/stat/data/hdp.csv")
head(hdp)
hdp <- hdp[complete.cases(hdp),]
hdp <- within(hdp, {
Married <- factor(Married, levels = 0:1, labels = c("no", "yes"))
DID <- factor(DID)
HID <- factor(HID)
CancerStage <- revalue(hdp$CancerStage, c("I"="1", "II"="2", "III"="3", "IV"="4"))
})
Until here it is all data management, functions and the packages I need.
m <- glmer(remission ~ CancerStage*LengthofStay + Experience +
(1 | DID), data = hdp, family = binomial(link="logit"))
summary(m)
This is the model. It takes a minute and it converges with the following warning:
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0417259 (tol = 0.001, component 1)
Even though I am not quite sure if I should worry about the warning, I use the estimates to plot the average marginal effects for the interaction of interest. First I prepare the dataset to be feed into the predict function, and then I calculate the marginal effects as well as the confidence intervals using the fixed effects parameters.
newdat <- expand.grid(
remission = getmode(hdp$remission),
CancerStage = as.factor(seq(facmin(hdp$CancerStage), facmax(hdp$CancerStage),1)),
LengthofStay = seq(min(hdp$LengthofStay, na.rm=T),max(hdp$LengthofStay, na.rm=T),1),
Experience = mean(hdp$Experience, na.rm=T))
mm <- model.matrix(terms(m), newdat)
newdat$remission <- predict(m, newdat, re.form = NA)
pvar1 <- diag(mm %*% tcrossprod(vcov(m), mm))
cmult <- 1.96
## lower and upper CI
newdat <- data.frame(
newdat, plo = newdat$remission - cmult*sqrt(pvar1),
phi = newdat$remission + cmult*sqrt(pvar1))
I am fairly confident these are correct estimates on the logit scale, but maybe I am wrong. Anyhow, this is the plot:
plot_remission <- ggplot(newdat, aes(LengthofStay,
fill=factor(CancerStage), color=factor(CancerStage))) +
geom_ribbon(aes(ymin = plo, ymax = phi), colour=NA, alpha=0.2) +
geom_line(aes(y = remission), size=1.2) +
xlab("Length of Stay") + xlim(c(2, 10)) +
ylab("Probability of Remission") + ylim(c(0.0, 0.5)) +
labs(colour="Cancer Stage", fill="Cancer Stage") +
theme_minimal()
plot_remission
I think now the OY scale is measured on the logit scale but to make sense of it I would like to transform it to predicted probabilities. Based on wikipedia, something like exp(value)/(exp(value)+1) should do the trick to get to predicted probabilities. While I could do newdat$remission <- exp(newdat$remission)/(exp(newdat$remission)+1) I am not sure how should I do this for the confidence intervals?.
Eventually I would like to get to the same plot what the effects package generates. That is:
eff.m <- effect("CancerStage*LengthofStay", m, KR=T)
eff.m <- as.data.frame(eff.m)
plot_remission2 <- ggplot(eff.m, aes(LengthofStay,
fill=factor(CancerStage), color=factor(CancerStage))) +
geom_ribbon(aes(ymin = lower, ymax = upper), colour=NA, alpha=0.2) +
geom_line(aes(y = fit), size=1.2) +
xlab("Length of Stay") + xlim(c(2, 10)) +
ylab("Probability of Remission") + ylim(c(0.0, 0.5)) +
labs(colour="Cancer Stage", fill="Cancer Stage") +
theme_minimal()
plot_remission2
Even though I could just use the effects package, it unfortunately does not compile with a lot of the models I had to run for my own work:
Error in model.matrix(mod2) %*% mod2$coefficients :
non-conformable arguments
In addition: Warning message:
In vcov.merMod(mod) :
variance-covariance matrix computed from finite-difference Hessian is
not positive definite or contains NA values: falling back to var-cov estimated from RX
Fixing that would require adjusting the estimation procedure, which at the moment I would like to avoid. plus, I am also curious what effects actually does here.
I would be grateful for any advice on how to tweak my initial syntax to get to predicted probabilities!
To obtain a similar result as the effect function provided in your question, you just have to back transform both the predicted values and the boundaries of your confidence interval from the logit scale to the original scale with the transformation you provide : exp(x)/(1+exp(x)).
This transformation can be done in base R with the plogis function :
> a <- 1:5
> plogis(a)
[1] 0.7310586 0.8807971 0.9525741 0.9820138 0.9933071
> exp(a)/(1+exp(a))
[1] 0.7310586 0.8807971 0.9525741 0.9820138 0.9933071
So using proposal from #eipi10 using ribbons for the confidence bands instead of the dotted lines (I also find this presentation more readable) :
ggplot(newdat, aes(LengthofStay, fill=factor(CancerStage), color=factor(CancerStage))) +
geom_ribbon(aes(ymin = plogis(plo), ymax = plogis(phi)), colour=NA, alpha=0.2) +
geom_line(aes(y = plogis(remission)), size=1.2) +
xlab("Length of Stay") + xlim(c(2, 10)) +
ylab("Probability of Remission") + ylim(c(0.0, 0.5)) +
labs(colour="Cancer Stage", fill="Cancer Stage") +
theme_minimal()
The results are the same (with effects_3.1-2 and lme4_1.1-13):
> compare <- merge(newdat, eff.m)
> compare[, c("remission", "plo", "phi")] <-
+ sapply(compare[, c("remission", "plo", "phi")], plogis)
> head(compare)
CancerStage LengthofStay remission Experience plo phi fit se lower upper
1 1 10 0.20657613 17.64129 0.12473504 0.3223392 0.20657613 0.3074726 0.12473625 0.3223368
2 1 2 0.35920425 17.64129 0.27570456 0.4522040 0.35920425 0.1974744 0.27570598 0.4522022
3 1 4 0.31636299 17.64129 0.26572506 0.3717650 0.31636299 0.1254513 0.26572595 0.3717639
4 1 6 0.27642711 17.64129 0.22800277 0.3307300 0.27642711 0.1313108 0.22800360 0.3307290
5 1 8 0.23976445 17.64129 0.17324422 0.3218821 0.23976445 0.2085896 0.17324530 0.3218805
6 2 10 0.09957493 17.64129 0.06218598 0.1557113 0.09957493 0.2609519 0.06218653 0.1557101
> compare$remission-compare$fit
[1] 8.604228e-16 1.221245e-15 1.165734e-15 1.054712e-15 9.714451e-16 4.718448e-16 1.221245e-15 1.054712e-15 8.326673e-16
[10] 6.383782e-16 4.163336e-16 7.494005e-16 6.383782e-16 5.689893e-16 4.857226e-16 2.567391e-16 1.075529e-16 1.318390e-16
[19] 1.665335e-16 2.081668e-16
The differences between the confidence boundaries is higher but still very small :
> compare$plo-compare$lower
[1] -1.208997e-06 -1.420235e-06 -8.815678e-07 -8.324261e-07 -1.076016e-06 -5.481007e-07 -1.429258e-06 -8.133438e-07 -5.648821e-07
[10] -5.806940e-07 -5.364281e-07 -1.004792e-06 -6.314904e-07 -4.007381e-07 -4.847205e-07 -3.474783e-07 -1.398476e-07 -1.679746e-07
[19] -1.476577e-07 -2.332091e-07
But if I use the real quantile of the normal distribution cmult <- qnorm(0.975) instead of cmult <- 1.96 I obtain very small differences also for these boundaries :
> compare$plo-compare$lower
[1] 5.828671e-16 9.992007e-16 9.992007e-16 9.436896e-16 7.771561e-16 3.053113e-16 9.992007e-16 8.604228e-16 6.938894e-16
[10] 5.134781e-16 2.289835e-16 4.718448e-16 4.857226e-16 4.440892e-16 3.469447e-16 1.006140e-16 3.382711e-17 6.765422e-17
[19] 1.214306e-16 1.283695e-16
this post follows this question : https://stackoverflow.com/questions/31234329/rpart-user-defined-implementation
I'm very interested in tools which could handle tree growing with customized criteria, such that I could test different model.
I tried to use the partykit R package to grow a tree for which the split rule is given by the negative log-likelihood of a Cox model (which is log-quasi-likelihood in case of the Cox model) and a Cox model is fitted in each leaf.
As I understood reading the vignette about the MOB function, there are two way to implement my own split criteria, namely to get the fit function return either a list or a model object.
For my purpose, I tried the two solutions but I failed to make it work.
Solution 1 : return a list object :
I take as an example the "breast cancer dataset" as in the "mob" vignette.
I tried this :
cox1 = function(y,x, start = NULL, weights = NULL, offset = NULL, ...,
estfun = FALSE, object = TRUE){
res_cox = coxph(formula = y ~ x )
list(
coefficients = res_cox$coefficients,
objfun = - res_cox$loglik[2],
object = res_cox)
}
mob(formula = Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade + progrec +
estrec + menostat ,
data = GBSG2 ,
fit = cox1,
control = mob_control(alpha = 0.0001) )
There is a warning about the singularity of the X matrix, and the mob function a tree with a single node (even with smaller values for alpha).
Note that there is no singularity problem with the X matrix when running the coxph function :
res_cox = coxph( formula = Surv(time, cens) ~ horTh + pnodes ,
data = GBSG2 )
Solution 2 : Return a coxph.object :
I tried this :
cox2 = function(y,x, start = NULL, weights = NULL, offset = NULL, ... ){
res_cox = coxph(formula = y ~ x )
}
logLik.cox2 <- function(object, ...)
structure( - object$loglik[2], class = "logLik")
mob(formula = Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade + progrec +
estrec + menostat ,
data = GBSG2 ,
fit = cox2,
control = mob_control(alpha = 0.0001 ) )
So this time I get a split along the "progrec" variable :
Model-based recursive partitioning (cox2)
Model formula:
Surv(time, cens) ~ horTh + pnodes - 1 | age + tsize + tgrade +
progrec + estrec + menostat
Fitted party:
[1] root
| [2] progrec <= 21: n = 281
| xhorThno xhorThyes xpnodes
| 0.19306661 NA 0.07832756
| [3] progrec > 21: n = 405
| xhorThno xhorThyes xpnodes
| 0.64810352 NA 0.04482348
Number of inner nodes: 1
Number of terminal nodes: 2
Number of parameters per node: 3
Objective function: 1531.132
Warning message:
In coxph(formula = y ~ x) : X matrix deemed to be singular; variable 2
I would like to know what's wrong with my Solution 1.
I also tried a similar thing for a regression problem and get the same result, ending with a single leaf :
data("BostonHousing", package = "mlbench")
BostonHousing <- transform(BostonHousing,
chas = factor(chas, levels = 0:1, labels = c("no", "yes")),
rad = factor(rad, ordered = TRUE))
linear_reg = function(y,x, start = NULL, weights = NULL, offset = NULL, ...,
estfun = FALSE, object = TRUE){
res_lm = glm(formula = y ~ x , family = "gaussian")
list(
coefficients = res_lm$coefficients,
objfun = res_lm$deviance,
object = res_lm )
}
mob( formula = medv ~ log(lstat) + I(rm^2) | zn + indus + chas + nox +
+ age + dis + rad + tax + crim + b + ptratio,
data = BostonHousing ,
fit = linear_reg)
Also I would like to know if there is no problem using a variable for both "fit the model in a node" and "make a split".
Thank you in advance.
I will probably have other questions about partykit functioning.
The problem with the cox1() and linear_reg() functions you have set up are that you do not supply the estimating functions aka score contributions. As these are the basis for the inference that selects the splitting variable, the algorithm does not split at all if these are not provided. See this recent answer for some discussion of this issues.
But for coxph() objects (unlike the fitdistr() example in the discussion linked above) it is very easy to obtain these estimating functions or scores because there is an estfun() method available. So your cox2() approach is the easier route to go here.
The reason that the latter doesn't work correctly is due to the special handling of intercepts in coxph(). Internally, this always forces the intercept into the model but then omits the first column from the design matrix. When interfacing this through mob() you need to be careful not to mess this up because mob() sets up its own model matrix. And because you exclude the intercept, mob() thinks that it can estimate both levels of horTh. But this is not the case because the intercept is not identified in the Cox-PH model.
The best solution in this case (IMO) is the following: You let mob() set up an intercept but then exclude it again when passing the model matrix to coxph(). Because there are coef(), logLik(), and estfun() methods for the resulting objects, one can use the simple setup of your cox2() function.
Packages and data:
library("partykit")
library("survival")
data("GBSG2", package = "TH.data")
Fitting function:
cox <- function(y, x, start = NULL, weights = NULL, offset = NULL, ... ) {
x <- x[, -1]
coxph(formula = y ~ 0 + x)
}
Fitting of the MOB tree to the GBSG2 data:
mb <- mob(formula = Surv(time, cens) ~ horTh + pnodes | age + tsize + tgrade + progrec + estrec + menostat,
data = GBSG2, fit = cox)
mb
## Model-based recursive partitioning (cox)
##
## Model formula:
## Surv(time, cens) ~ horTh + pnodes | age + tsize + tgrade + progrec +
## estrec + menostat
##
## Fitted party:
## [1] root: n = 686
## xhorThyes xpnodes
## -0.35701115 0.05768026
##
## Number of inner nodes: 0
## Number of terminal nodes: 1
## Number of parameters per node: 2
## Objective function: 1758.86
I am using the randomForest package in R to build several species distribution models. My response variable is binary (0 - absence or 1-presence), and pretty unbalanced - for some species the ratio of absences:presences is 37:1. This imbalance (or zero-inflation) leads to questionable out-of-bag error estimates - the larger the ratio of absences to presence, the lower my out-of-bag (OOB) error estimate.
To compensate for this imbalance, I wanted to implement stratified sampling such that each tree in the random forest included an equal (or at least less imbalanced) number of results from both the presence and absences category. I was surprised that there doesn't seem to be any difference in the stratified and unstratified model OOB error estimates. See my code below:
Without stratification
> set.seed(25)
> HHrf<- randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla , data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
> HHrf
Call:
randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr + DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla, data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 19.1%
Confusion matrix:
0 1 class.error
0 422 18 0.04090909
1 84 10 0.89361702
With stratification
> HHrf_strata<- randomForest(formula = factor(HH_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, strata = bll_HH$HH_Pres, sampsize = ceiling(.632*nrow(bll_HH)))
> HHrf
Call:
randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr + DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla, data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 19.1%
Confusion matrix:
0 1 class.error
0 422 18 0.04090909
1 84 10 0.89361702
Is there a reason that I am getting the same results in both cases? For the strata argument, I specify my response variable, HH_Pres. For the sampsize argument, I specify that it should just be 63.2% of the entire dataset.
Anyone know what I am doing wrong? Or is this to be expected?
Thanks,
Liza
To reproduce this problem:
Sample data: https://docs.google.com/file/d/0B-JMocik79JzY3B4U3NoU3kyNW8/edit
Code:
bll = read.csv("bll_Nov2013_NMV.csv", header=TRUE)
HH_Pres <- bll$HammerHeadALL_Presence
Slope <-bll$Slope
Dist2Shr <-bll$Dist2Shr
Bathy <-bll$Bathy2
Chla <-bll$GSM_Chl_Daily_MF
SST <-bll$SST_PF_daily
Region <- bll$Region
MoonPhase <-bll$MoonPhase
DaylightHours <- bll$DaylightHours
bll_HH <- data.frame(HH_Pres, Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region)
set.seed(25)
HHrf<- randomForest(formula = factor(HH_Pres) ~ SST + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region) + Chla , data = bll_HH, ntree = 500, replace = FALSE, importance = TRUE, na.action = na.omit)
HHrf
set.seed(25)
HHrf_strata<- randomForest(formula = factor(HH_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_HH, strata = bll_HH$HH_Pres, sampsize = c(100, 50), ntree = 500, replace = FALSE, importance = TRUE)
HHrf
As far as I know, the sampsize argument should be a vector that is the same length as the number of classes in your data set. If you specify a factor variable in the strata argument, then sampsize should be given a vector that is the same length as the number of factors in the strata argument. I am not sure it performs as you describe in your question, but it has been a while since I have used the randomForest function.
From the help files, it says:
strata
A (factor) variable that is used for stratified sampling.
sampsize:
Size(s) of sample to draw. For classification, if sampsize is a vector
of the length the number of strata, then sampling is stratified by
strata, and the elements of sampsize indicate the numbers to be drawn
from the strata.
For example, since your classification has 2 distinct classes, you need to give sampsize a vector of length 2 that specifies how many observations you want to sample from each class during training time.
e.g. sampsize=c(100,50)
Furthermore, you can specify the names of the groups to be extra clear.
e.g. sampsize=c('0'=100, '1'=50)
An example from the help files that uses the sampsize argument, to clarify:
## stratified sampling: draw 20, 30, and 20 of the species to grow each tree.
data(iris)
(iris.rf2 <- randomForest(iris[1:4], iris$Species, sampsize=c(20, 30, 20)))
EDIT: Added some notes about the strata argument in randomForest.
EDIT: Make sure the strata argument is given a factor variable!
e.g. try strata = factor(HH_Pres), sampsize = c(...) where c(...) is a vector that is the same length as length(levels(factor(bll_HH$HH_Pres)))
EDIT:
OK, I tried running the code with your data, and it works for me.
# Fix up the data set to have HH_Pres and Region as factors
bll_HH$Region <- factor(bll_HH$Region)
bll_HH$HH_Pres <- factor(bll_HH$HH_Pres)
# Original RF code
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
Slope + MoonPhase + Chla + Region,
data=bll_HH, ntree = 500, replace = FALSE,
importance = TRUE, na.action = na.omit)
HHrf
# Output
# OOB estimate of error rate: 18.91%
# Confusion matrix:
# 0 1 class.error
# 0 425 15 0.03409091
# 1 86 8 0.91489362
# Take 63.2% from each class
mySampSize <- ceiling(table(bll_HH$HH_Pres) * 0.632)
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
Slope + MoonPhase + Chla + Region,
data=bll_HH, ntree = 500, replace = FALSE,
importance = TRUE, na.action = na.omit,
sampsize=mySampSize)
HHrf
# Output
# OOB estimate of error rate: 18.91%
# Confusion matrix:
# 0 1 class.error
# 0 424 16 0.03636364
# 1 85 9 0.90425532
Note that the OOB error estimate is the same in this case, even if we only use 63.2% of the data from each of the classes in our bootstrap samples. This is probably due to using sample sizes that are proportional to the class distribution in your training data, and the relatively small size of your data set. Let's try changing mySampSize to make sure it REALLY worked.
# Change mySampSize. Sample 100 from class 0 and 50 from class 1
mySampSize[1] <- 100
mySampSize[2] <- 50
set.seed(25)
HHrf <- randomForest(formula=HH_Pres ~ SST + Dist2Shr + DaylightHours + Bathy +
Slope + MoonPhase + Chla + Region,
data=bll_HH, ntree = 500, replace = FALSE,
importance = TRUE, na.action = na.omit,
sampsize=mySampSize)
HHrf
# Output
# OOB estimate of error rate: 21.16%
# Confusion matrix:
# 0 1 class.error
# 0 382 58 0.1318182
# 1 55 39 0.5851064
This syntax seems to be working fine for me on your data. The OOB is 32.21% and the class error(s): 0.32, 0.29. I did kick up the number of Bootstraps to 1000. I always recommend using indexing to define a RF model. In certain circumstances, symbolic syntax seems to be unstable.
require(randomForest)
HHrf <- read.csv("bll_HH.csv")
set.seed(25)
( rf.mdl <- randomForest( y=as.factor(HHrf[,"HH_Pres"]), x=HHrf[,2:ncol(HHrf)],
strata=as.factor(HHrf[,"HH_Pres"]), sampsize=c(50,50),
ntree=1000) )
I ran into this problem too. What I noticed is that my error rate when using importance = TRUE changes significantly. It is not the same as if I did not choose stratification with sampling.
For me it ended up being a tradeoff in not having an importance/accuracy score for my classification tree. It appears to be one of many bugs in this implementation.