rpart not splitting obvious nodes - r

I am using a data set of about 54K records and 5 classes(pop) of which one class is insignicant. I am using the caret package and the following to run rpart:
model <- train(pop ~ pe + chl_small, method = "rpart", data = training)
and I get the following tree:
n= 54259
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 54259 38614 pico (0.0014 0.18 0.29 0.25 0.28)
2) pe< 5004 39537 23961 pico (0 0.22 0.39 2.5e-05 0.38)
4) chl_small< 32070.5 16948 2900 pico (0 0.00012 0.83 5.9e-05 0.17) *
5) chl_small>=32070.5 22589 10281 ultra (0 0.39 0.068 0 0.54) *
3) pe>=5004 14722 1113 synecho (0.0052 0.052 0.0047 0.92 0.013) *
It is obvious that node 5 should be further split, but rpart is not doing it. I tried using cp = .001 to cp =.1 and also minbucket = 1000 as additional parameters, but no improvement.
Appreciate any help on this.

Try running the model with an even smaller cp=0.00001 or cp = -1. If it is still not splitting that node then it means that the split will not improve the overall fit.
You can also try changing the splitting criteria from the default Gini impurity to information gain criterion: parms = list(split = "information")
If you do force it to split, it might be a good idea to do a quick check:
compare the accuracy of the training vs testing set for the original model and model with small cp.
If the difference between training vs testing is much smaller for the original model then the other model probably overfits the data.

Related

Simulating correlated Bernoulli data

I want to simulate 100 data with 5 columns. I want to get a correlation of 0.5 between the columns. To complete it, I have done the following action
F1 <- matrix( c(1, .5, .5, .5,.5,
.5, 1, .5, .5,.5,
.5, .5, 1, .5,.5,
.5, .5, .5, 1,.5,
.5, .5, .5, .5,1
), 5,5)
To simulate the intended data frame, I have done this, but it does not work properly.
df2 <- as.data.frame (rbinom(100, 1,.5),ncol(5), F1)
I'm surprised this isn't a duplicate (this question refers specifically to non-binary responses, i.e. binomial with N>1). The bindata package does what you want.
library(bindata)
## set up correlation matrix (compound-symmetric with rho=0.5)
m <- matrix(0.5,5,5)
diag(m) <- 1
Simulate with a mean of 0.5 (as in your example):
set.seed(101)
## this simulates 10 rather than 100 realizations
## (I didn't read your question carefully enough)
## but it's easy to change
r <- rmvbin(n=10, margprob=rep(0.5,5), bincorr=m)
round(cor(r),2)
Results
1.00 0.22 0.80 0.05 0.22
0.22 1.00 0.00 0.65 1.00
0.80 0.00 1.00 -0.09 0.00
0.05 0.65 -0.09 1.00 0.65
0.22 1.00 0.00 0.65 1.00
this looks wrong - the correlations aren't exactly 0.5 - but on average they will be (when I sampled 10,000 vectors rather than 10, the values ranged from about 0.48 to 0.51). Equivalently, if you simulated many samples of 10 and computed the correlation matrix for each, you should find that the expected (average) correlation matrix is correct.
simulating values with correlation exactly equal to the specified value is much harder (and not necessarily what you want to do anyway, depending on the application)
note that there will be limitations about what mean vectors and correlation matrices are feasible. For example, the off-diagonal elements of an n-by-n compound-symmetric (equal-correlation) matrix can't be less than -1/(n-1). Similarly, there may be limits on what correlations are possible for a given set of means (this may be discussed in the technical reference, I haven't checked).
The reference for this method is
Leisch, Friedrich and Weingessel, Andreas and Hornik, Kurt (1998) On the generation of correlated artificial binary data. Working Papers SFB "Adaptive Information Systems and Modelling in Economics and Management Science", 13. SFB Adaptive Information Systems and Modelling in Economics and Management Science, WU Vienna University of Economics and Business, Vienna. https://epub.wu.ac.at/286/

Supplying several methods in a loop and saving the output

EDIT:Actually looking at the method shows it only uses one method, how can I write a loop to select a method given a vector of methods?!
I've looked at a couple of questions(Loop in R: how to save the outputs?) but can't seem to figure out how to save the output of these models. Here is my "function":
library(caret)
Control<-trainControl(method="cv",number=5)
metric<-"Accuracy"
modeler<-function(df,yname,xname,method,metric,control){
df<-df
methods1<-method
f1<-as.formula(paste0(yname,"~",xname))
for(method in methods1){
fit<-do.call("train",list(data=quote(df),f1,metric=metric,trControl=control,
method=method))
}
fit
}
An implementation of the "function":
methods1<-c("rf","rpart")
modeler(iris,yname="Species",xname=".",method = methods1,control=Control,
metric = metric)
Now I could save the above like:
mod1<-modeler(iris,yname="Species",xname=".",method = methods1,control=Control,
metric = metric)
And then:
sapply(mod1,"[",1)
This works but returns only the last model and in not the best of displays.
How can I optimise this process?
Here is an abridge answer for the key parts.
With methods1 as is vector of methods one can use the lapply function:
methods1<-c("rf","rpart")
#use lapply to loop through the methods
#fit will be a list of the results from modeler
fit<-lapply(methods1, function(met) {
modeler(iris,yname="Species",xname=".",method = met,control=Control, metric = metric)
})
If the calculations are time consuming the lapply function can easily made to run in parallel with the parallel package.
Based on #Dave2e 's comment: Here is how I've managed to "solve" it:
library(caret)
Control<-trainControl(method="cv",number=5)
metric<-"Accuracy"
modeler<-function(df,yname,xname,method,metric,control){
df<-df
methods1<-method
f1<-as.formula(paste0(yname,"~",xname))
for(method in methods1){
fit<-do.call("train",list(data=quote(df),f1,metric=metric,trControl=control,
method=methods1))
}
data.frame(fit$results,Type=fit$modelInfo$tags[1])
}
Implement:
methods1<-c("rf","rpart")
lapply(methods1, function(met) {modeler(iris,yname="Species",xname=".",
method = met,control=Control, metric = metric)})
Result with no seed: This is still lacking but it works so;
[[1]]
mtry Accuracy Kappa AccuracySD KappaSD Type
1 2 0.9533333 0.93 0.03800585 0.05700877 Random Forest
2 3 0.9533333 0.93 0.03800585 0.05700877 Random Forest
3 4 0.9533333 0.93 0.03800585 0.05700877 Random Forest
[[2]]
cp Accuracy Kappa AccuracySD KappaSD Type
1 0.00 0.9533333 0.93 0.02981424 0.04472136 Tree-Based Model
2 0.44 0.7733333 0.66 0.14605935 0.21908902 Tree-Based Model
3 0.50 0.3333333 0.00 0.00000000 0.00000000 Tree-Based Model

Mixed integer programming R: Least absolute deviation with cost associated with each regressor

I have been presented with a problem, regarding the minimization of the absolute error, the problem know as LAD(Least absolute deviation) but, being each regressor the result of expensive test with an associated cost, one should refrain from using regressors that don't explain variance to a high degree. It takes the following equations:
Where N is the total number of observations, E the deviation associated with observation i, S the number of independant variables, lambda a penalty coefficient for the cost, and C the cost associated with performing the test.
So far, I have oriented as usual. To make it lineal, I transformed the absolute value in two errors, e^+ and e^-, where e= y_i-(B_0+sum(B_j*X_ij) and the following constraints:
z_j ={0,1}, binary value about whether the regressor enters my model.
B_i<=M_zj; B_i>=-M_zj
E^+, E^- >=0
A toy subset of data I'm working has the following structure:
For y
quality
1 5
2 5
3 5
4 6
5 7
6 5
For the regressors
fixed.acidity volatile.acidity citric.acid
1 7.5 0.610 0.26
2 5.6 0.540 0.04
3 7.4 0.965 0.00
4 6.7 0.460 0.24
5 6.1 0.400 0.16
6 9.7 0.690 0.32
And for the cost
fixed.acidity volatile.acidity citric.acid
1 0.26 0.6 0.52
So far, my code looks like this:
# loading the matrixes
y <- read.csv(file="PATH\\y.csv", header = TRUE, sep = ",") #dim=100*11
regresores <- read.csv(file="PATH\\regressors.csv", header = TRUE, sep = ",")#dim=100*1
cost <- read.csv(file="PATH\\cost.csv", header = TRUE, sep = ",")#dim=1*11
for (i in seq(0, 1, by = 0.1)){#so as to have a collection of models with different penalties
obj.fun <- c(1,1,i*coste)
constr <- matrix(
c(y,regresores,-regresores),
c(-y,-regresores,regresores),
sum(regresores),ncol = ,byrow = TRUE)
constr.dir <- c("<=",">=","<=","==")
rhs<-c(regresores,-regresores,1,binary)
sol<- lp("min", obj.fun, constr, constr.tr, rhs)
sol$objval
sol$solution}
I know theres is a LAD function in R, but for consistence sake with my colleagues, as well as a pretty annoying phD tutor, I have to perform this using lpSolve in R. I have just started with R for the project and I don't know exactly why this won't run. Is there something wrong with the syntax or my formulation of the model. Right know, the main problem I have is:
"Error in matrix(c(y, regressors, -regressors), c(-y, -regressors, regressors), : non-numeric matrix extent".
Mainly, I intended for it to create said weighted LAD model and have it return the different values of lambda, from 0 to 1 in a 0.1 step.
Thanks in advance and sorry for any inconvenience, neither English nor R are my native languages.

Regression tree with simulated data - rpart package

I have simulated some data to create a regression tree with 3 terminal nodes:
set.seed(1988)
n=1000
X1<-rnorm(n,mean=0,sd=2)
X2<-rnorm(n,mean=0,sd=2)
e<-rnorm(n)
Y=5*I(X1<1)*I(X2<0.2)+4*I(X1<1)*I(X2>=0.2)+3*I(X1>=1)+e
mydat=as.data.frame(cbind(Y,X1,X2))
So, I want first to split by X1<1, and for X1<1 I want to split by X2<0.2. The values of Y in the leaves are the coefficient of the indicator.
If I run the procedure implemented in the RPART package everything is ok in the case above.
mytree<-rpart(Y~.,data=mydat)
mytree
Output:
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 1627.0670 4.043696
2) X1>=0.9490461 326 373.8485 3.124825 *
3) X1< 0.9490461 674 844.8367 4.488135
6) X2>=0.2488142 327 312.7506 3.970742 *
7) X2< 0.2488142 347 362.0582 4.975708 *
It runs also if I try with coefficient all negative.
But when I try to generate some negative and some positive values in the final terms (it means in the "interaction" of the tree, so where the split is divided at a second level), RPART change the order of the split and the value in the leaves are not correct:
Y=-5*I(X1<1)*I(X2<0.2)+4*I(X1<1)*I(X2>=0.2)+3*I(X1>=1)+e
mydat=as.data.frame(cbind(Y,X1,X2))
mytree<-rpart(Y~.,data=mydat)
mytree
Output:
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 17811.4000 0.6136962
2) X2< 0.1974489 515 8116.5350 -2.3192910
4) X1< 1.002815 343 359.7394 -5.0305350 *
5) X1>=1.002815 172 207.4313 3.0874360 *
3) X2>=0.1974489 485 560.3419 3.7281050 *
Anyone have some idea for that problem?
Thanks
You need to tune the complexity parameter cp. See the code below.
# Data Generating Process
set.seed(1988)
n=1000
X1<-rnorm(n,mean=0,sd=2)
X2<-rnorm(n,mean=0,sd=2)
e<-rnorm(n)
Y=-5*I(X1<1)*I(X2<0.2)+4*I(X1<1)*I(X2>=0.2)+3*I(X1>=1)+e
mydat=as.data.frame(cbind(Y,X1,X2))
library(rpart)
mytree<-rpart(Y~.,data=mydat, cp=0.0001)
# Plot the cross-validation error vs the complexity parameter
plotcp(mytree)
# Find the optimal value of the complexity parameter cp
optcp <- mytree$cptable[which.min(mytree$cptable[,4]),1]
# Prune the tree using the optial complexity parameter
mytree <- prune(mytree,optcp)
The pruned tree correctly represents the underlying data generating process
library(rattle)
fancyRpartPlot(mytree)

Calculating CIs of fixed effects using confint in R

I would like to perform bootstrapping to obtain 95% Cis of my fixed effects in a binomial GLMM:
m <- glmer(cbind(df$Valid.detections, df$Missed.detections) ~ distance +
Habitat + Replicate + transmitter.depth + receiver.depth +
wind.speed + wtc + Transmitter + (1 | Unit) +
(1 | SUR.ID) + distance:Transmitter +
distance:Habitat + distance:transmitter.depth + distance:receiver.depth +
distance:wind.speed, data = df, family = binomial(link=logit),control=glmerControl(calc.derivs=F))
I found that the confint() function is able to achieve this, so I specified the function:
confint(m, method = "boot", boot.type = "basic", seed = 123, nsim = 1000)
The function had been running for more than 8 hours before I decided to terminate. Upon termination, I got returned the following warning messages (x10):
Warning messages:
1: In (function (fn, par, lower = rep.int(-Inf, n), upper = rep.int(Inf, :
failure to converge in 10000 evaluations
My questions are: 1) Do I have to worry about these warning messages? If so, how could I deal with them?, 2) Because after 8 hours it was still running I have no clue how long it takes to perform this function. Therefore, it would be nice to have some sort of progress bar while performing this function. I read that confint() can take arguments from bootMer, so I included the argument .progress = "txt", resulting in:
confint(m, method = "boot", boot.type = "basic", seed = 123, nsim = 1000, .progress = "txt")
but it doesn't seem to work. Alternatively, if there are better ways to achieve the same goal, I'm open to suggestions.
Thanks for any help
Here's an example:
library("lme4")
(t1 <- system.time(
gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial)))
## user system elapsed
## 0.188 0.000 0.186
nranpars <- length(getME(gm1,"theta"))
nfixpars <- length(fixef(gm1))
(t2 <- system.time(c1 <- confint(gm1,method="boot", nsim=1000,
parm=(nranpars+1):(nranpars+nfixpars),
.progress="txt")))
## user system elapsed
## 221.958 0.164 222.187
Note that this progress bar is only 80 characters long, so it increments only after each 1000/80=12 bootstrap iterations. If your original model took an hour to fit then you shouldn't expect to see any progress-bar activity until 12 hours later ...
(t3 <- system.time(c2 <- confint(gm1,
parm=(nranpars+1):(nranpars+nfixpars))))
## user system elapsed
## 5.212 0.012 5.236
1000 bootstrap reps is probably overkill -- if your model fit is slow, you can probably get reasonable CIs from 200 bootstrap reps.
I tried this with optimizer="nloptwrap" as well, hoping it would speed things up. It did, although there is a drawback (see below).
(t4 <- system.time(
gm1B <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial,
control=glmerControl(optimizer="nloptwrap"))))
## user system elapsed
## 0.064 0.008 0.075
(t5 <- system.time(c3 <- confint(gm1B,method="boot", nsim=1000,
parm=(nranpars+1):(nranpars+nfixpars),
.progress="txt",PBargs=list(style=3))))
##
## user system elapsed
## 65.264 2.160 67.504
This is much faster, but gives warnings (37 in this case) about
convergence. According to all.equal(), there was only about 2% difference in the confidence intervals calculated this way. (There are still some wrinkles to sort out in the package itself ...)
Your best bet for speeding this up will be to parallelize -- unfortunately, this way you lose the ability to use a progress bar ...
(t6 <- system.time(c4 <- confint(gm1,method="boot", nsim=1000,
parm=(nranpars+1):(nranpars+nfixpars),
parallel="multicore", ncpus=4)))
##
## user system elapsed
## 310.355 0.916 116.917
This takes more user time (it counts the time used on all cores), but the elapsed time is cut in half. (It would be nice to do better with 4 cores but twice as fast is still good. These are virtual cores on a virtual Linux machine, real (non-virtual) cores might have better performance.)
With nloptwrap and multicore combined I can get the time down to 91 seconds (user)/ 36 seconds (elapsed).

Resources