Key methods selection in rpart - r

What's the rule to select the complexity parameter (cp) and method in rpart() function from the rpart? I've read a couple articles regarding the package, but the contents was too technical for me to grok.
Example:
rpart_1 <- rpart(myFormula, data = kyphosis,
method = "class",
control = rpart.control(minsplit = 0, cp = 0))
plotcp(rpart_1)
printcp(rpart_1)

You typically don't choose the method parameter as such; it's chosen for you as part of the problem you're solving. If it's a classification problem, you use method="class", if it's a regression problem, you use method="anova", and so on. Naturally, this means you have to understand what the problem is you're trying to solve, and whether your data will let you solve it.
The cp parameter controls the size of the fitted tree. You choose its value via cross-validation or using a separate test dataset. rpart is somewhat different to most other R modelling packages in how it handles this. The rpart function does cross-validation automatically by default. You then examine the model to see the result of the cross-validation, and prune the model based on that.
Worked example, using the MASS::Boston dataset:
library(MASS)
# does 10-fold CV by default
Bos.tree <- rpart(medv ~ ., data=Boston, cp=0)
# look at the result of the CV
plotcp(Bos.tree)
The plot shows that the 10-fold cross-validated error flattens out beginning at a tree size of about 9 leaf nodes. The dotted line is the minimum of the curve plus 1 standard error, which is a standard rule of thumb for pruning decision trees: you pick the smallest tree size that is within 1 SE of the minimum.
Printing the CP values gives a more precise view of how to choose the tree size:
printcp(Bos.tree)
#CP nsplit rel error xerror xstd
#1 0.45274420 0 1.00000 1.00355 0.082973
#2 0.17117244 1 0.54726 0.61743 0.057053
#3 0.07165784 2 0.37608 0.43034 0.046596
#4 0.03616428 3 0.30443 0.34251 0.042502
#5 0.03336923 4 0.26826 0.32642 0.040456
#6 0.02661300 5 0.23489 0.32591 0.040940
#7 0.01585116 6 0.20828 0.29324 0.040908
#8 0.00824545 7 0.19243 0.28256 0.039576
#9 0.00726539 8 0.18418 0.27334 0.037122
#10 0.00693109 9 0.17692 0.27593 0.037326
#11 0.00612633 10 0.16999 0.27467 0.037310
#12 0.00480532 11 0.16386 0.26547 0.036897
# . . .
This shows that a CP value of 0.00612 corresponds to a tree with 10 splits (and hence 11 leaves). This is the value of cp you use to prune the tree. So:
# prune with a value of cp slightly greater than 0.00612633
Bos.tree.cv <- prune(Bos.tree, cp=0.00613)

Related

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

'mtry' in 'rfcv()' function of R's 'randomForest' library

I would like to use cross validation to determine the number of variables to try in a Random Forest method. I don't understand how to use the mtry argument in the rfcv() function.
I have 6 predictors in my dataset. I want to use mtry = 6,5,4,3,2,1, e.g., any possible m value, and cross validate with 5-fold CV.
I believe this can be done with rfcv() function of randomForest package. I am running the code:
rf_cv<- rfcv(training_x,training_y,cv.fold=5, mtry=function(p) max(1, p-1))
However, calling rf_cv$n.var gives me:
[1] 6 3 1
So, this method does not apply mtry as I was hoping, since I said each time subtract the number of variables used by 1.
How can I try every number of variables by applying a 5-fold cross validation for each number of variable?
I checked this post, however it is not completely related since they are discussing the default of mtry.
In the post you referenced, it explains how the steps will determine the mtry tested. So in your case, p=6, and since you did not change step or the scale, then:
p=6; 0.5
k <- floor(log(p, base = 1/step))
n.var <- round(p * step^(0:(k - 1)))
[1] 6 3
And if n.var does not include 1, it goes ahead and includes it for you, which gives you 6,3,1. So if you want to try all numbers, set mtry to be identity, and step to be 1, set scale to anything but "log" (yeah the code doesn't give you other options):
rf_cv=rfcv(matrix(rnorm(100*6),ncol=6),rnorm(100),cv.fold=3,
mtry=identity,scale="new",step=-1)
rf_cv$n.var
[1] 6 5 4 3 2 1

R e1071: Balanced Error Rate (BER) as error criterion in tune function

I'm kind of new to R and machine learning in general, so apologies if this seems stupid!
I'm using the e1071 package to tune the parameters of various models. My dataset is very unbalanced and I would like for the error criterion to be Balanced Error Rate... NOT overall classification error. However, I'm stumped as how to achieve this.
Here is my code:
#Find optimal value 'k' value for k-NN model (feature subset).
c <- data_train_sub[1:13]
d <- data_train_sub[,14]
knn2 <- tune.knn(c, d, k = 1:10, tunecontrol = tune.control(sampling = "cross", performances = TRUE, sampling.aggregate = mean)
)
summary(knn2)
plot(knn2)
Which returns this:
Parameter tuning of ‘knn.wrapper’:
- sampling method: 10-fold cross validation
- best parameters:
k
1
- best performance: 0.001190476
- Detailed performance results:
k error dispersion
1 1 0.001190476 0.003764616
2 2 0.005952381 0.006274360
3 3 0.003557423 0.005728122
4 4 0.005924370 0.008352124
5 5 0.005938375 0.008407043
6 6 0.005938375 0.008407043
7 7 0.007128852 0.008315090
8 8 0.009495798 0.009343555
9 9 0.008305322 0.009751997
10 10 0.008319328 0.009795292
Has anyone any experience of altering the error being assessed in this function?
Look at the class.weights argument of the svm() function:
a named vector of weights for the different classes, used for asymmetric class sizes...
Coefficient can easily be calculated as such:
class.weights = table(Xcal$species)/sum(table(Xcal$species))

Multinomial logit models and nested logit models

I am using the mlogit package in program R. I have converted my data from its original wide format to long format. Here is a sample of the converted data.frame which I refer to as 'long_perp'. All of the independent variables are individual specific. I have 4258 unique observations in the data-set.
date_id act2 grp.bin pdist ship sea avgknots shore day location chid alt
4.dive 40707_004 TRUE 2 2.250 second light 14.06809 2.30805 12 Lower 4 dive
4.fly 40707_004 FALSE 2 2.250 second light 14.06809 2.30805 12 Lower 4 fly
4.none 40707_004 FALSE 2 2.250 second light 14.06809 2.30805 12 Lower 4 none
5.dive 40707_006 FALSE 2 0.000 second light 15.12650 2.53312 12 Lower 5 dive
5.fly 40707_006 TRUE 2 0.000 second light 15.12650 2.53312 12 Lower 5 fly
5.none 40707_006 FALSE 2 0.000 second light 15.12650 2.53312 12 Lower 5 none
6.dive 40707_007 FALSE 1 1.995 second light 14.02101 2.01680 12 Lower 6 dive
6.fly 40707_007 TRUE 1 1.995 second light 14.02101 2.01680 12 Lower 6 fly
6.none 40707_007 FALSE 1 1.995 second light 14.02101 2.01680 12 Lower 6 none
'act2' is the dependent variable and consists of choices a bird floating on the water could make when approached by a ship; fly, dive, or none. I am interested in how these probabilities relate to the remaining independent variables in the data.frame, i.e. perpendicular distance to the ship path (pdist) sea conditions (sea), speed (avgknots), distance to shore (shore) etc. The independent variables are made of dichotomous, factor and continuous variables.
I ran two multinomial logit models, one including all the choice options and another including only a subset. I then compared these models with the hmftest() function to test for the IIA assumption. The results were confusing the say the least. I will include the codes for the two models and the test output (in case I am miss-specifying the models in the code).
# model including all choice options (fly, dive, none)
mod.1 <- mlogit(act2 ~ 1 | pdist + as.factor(grp.bin) +
as.factor(sea) + avgknots + shore + as.factor(location),long_perp ,
reflevel = 'none')
# model including only a subset of choice options (fly, dive)
mod.alt <- mlogit(act2 ~ 1 | pdist + as.factor(grp.bin) +
as.factor(sea) + avgknots + shore + as.factor(location),long_perp ,
reflevel = 'none', alt.subset = c("fly","dive"))
# IIA test
hmftest(mod.1, mod.alt)
# output
Hausman-McFadden test
data: long_perp
chisq = -968.7303, df = 7, p-value = 1
alternative hypothesis: IIA is rejected
As you can see the chisquare statistic is negative! I assume I am either 1. doing something wrong, or 2. IIA is violated. This result holds true for choice subset (fly, dive), but the IIA assumption is upheld with choice subset (none, dive)? This confuses me.
Next I tried to formulate a nested model as a way to relax the IIA assumption. I nested the choices as nest1 = none, nest2 = fly, dive. This makes sense to me as this seems like a logical break, the bird decides to react or not then decides which reaction to make.
I am unclear on how to run the nested logit models (even after reading the two vignettes for mlogit, Croissant vignette and Train vignette).
When I run my analysis following the example in the Croissant vignette I get the following error.
nested.1 <- mlogit(act2 ~ 0 | pdist + as.factor(grp.bin) + as.factor(ship) +
as.factor(sea) + avgknots + shore + as.factor(location),
long_perp , reflevel="none",nests = list(noact = "none",
react = c("dive","fly")), unscaled = TRUE)
# Error in solve.default(crossprod(attr(x, "gradi")[, !fixed])) :
Lapack routine dgesv: system is exactly singular: U[19,19] = 0
I have read a bit about this error message and it may occur because of complete separation. I have looked at some tables of the data and do not believe this is happening as I have 4,000+ observations and only one factor variable with more than 2 levels (it has 3).
Help on these specific problems is greatly appreciated but I am also open to alternate analyses that I can use to answer my question. I am mainly interested in the probability of flying as a function of perpendicular distance to the ships path.
Thanks, Tim
To get a positive chi-sq, change the code as follows:
alt.subset = c("none", "fly")
that is, the ref level will be in the subset too. It may help, though the P-value may not change much.

R decision tree using all the variables

I would like to perform a decision tree analysis. I want that the decision tree uses all the variables in the model.
I also need to plot the decision tree. How can I do that in R?
This is a sample of my dataset
> head(d)
TargetGroup2000 TargetGroup2012 SmokingGroup_Kai PA_Score wheeze3 asthma3 tres3
1 2 2 4 2 0 0 0
2 2 2 4 3 1 0 0
3 2 2 5 1 0 0 0
4 2 2 4 2 1 0 0
5 2 3 3 1 0 0 0
6 2 3 3 2 0 0 0
>
I would like to use the formula
myFormula <- wheeze3 ~ TargetGroup2000 + TargetGroup2012 + SmokingGroup_Kai + PA_Score
Note that all the variables are categorical.
EDIT:
My problem is that some variables do not appear in the final decision tree.
The deap of the tree should be defined by a penalty parameter alpha. I do not know how to set this penalty in order that all the variables appear in my model.
In other words I would like a model that minimize the training error.
As mentioned above, if you want to run the tree on all the variables you should write it as
ctree(wheeze3 ~ ., d)
The penalty you mentioned is located at the ctree_control(). You can set the P-value there and the minimum split and bucket size. So in order to maximize the chance that all the variables will be included you should do something like that:
ctree(wheeze3 ~ ., d, controls = ctree_control(mincriterion = 0.85, minsplit = 0, minbucket = 0))
The problem is that you'll get into risk of overfitting.
The last thing you need to understand is, that the reason that you may not see all the variables in the output of the tree is because they don't have a significant influence on the dependend variable. Unlike linear or logistic regression, that will show all the variables and give you the P-value in order to determine if they are significant or not, the decision tree does not return the unsiginifcant variables, i.e, it doesn't split by them.
For better understanding of how ctree works, please take a look here: https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees
The easiest way is to use the rpart package that is part of the core R.
library(rpart)
model <- rpart( wheeze3 ~ ., data=d )
summary(model)
plot(model)
text(model)
The . in the formula argument means use all the other variables as independent variables.
plot(ctree(myFormula~., data=sta))

Resources