How to turn off k fold cross validation in rpart() in r - r

I have the Bitcoin time series, I use 11 technical indicators as features and I want to fit a regression tree to the data. As far as I know, there are 2 functions in r which can create regression trees, i.e. rpart() and tree(), but both functions do not seem appropriate. rpart() uses k-fold cross validation to validate the optimal cost complexity parameter cp and in tree(), it is not possible to specify the value of cp.
I am aware that cv.tree() looks for the optimal value of cp via cross validation, but again, cv.tee() uses k-fold cross validation. Since I have a time series, and therefore temporal dependencies, I do not want to use k-fold cross validation, because k-fold cross validation will randomly divide the data into k-fold, fit the model on k-1 folds and calculate the MSE on the left out k-th fold, and then the sequence of my time series is obviously ruined.
I found an argument of the rpart() function, i.e. xval, which is supposed to let me specify the number of cross validations, but when I look at the output of the rpart() function call when xval=0, it doesn't seem like cross validation is turned off. Below you can see my function call and the output:
tree.model= rpart(Close_5~ M+ DSMA+ DWMA+ DEMA+ CCI+ RSI+ DKD+ R+ FI+ DVI+
OBV, data= train.subset, method= "anova", control=
rpart.control(cp=0.01,xval= 0, minbucket = 5))
> summary(tree.model)
Call:
rpart(formula = Close_5 ~ M + DSMA + DWMA + DEMA + CCI + RSI +
DKD + R + FI + DVI + OBV, data = train.subset, method = "anova",
control = rpart.control(cp = 0.01, xval = 0, minbucket = 5))
n= 590
CP nsplit rel error
1 0.35433076 0 1.0000000
2 0.10981049 1 0.6456692
3 0.06070669 2 0.5358587
4 0.04154720 3 0.4751521
5 0.02415633 5 0.3920576
6 0.02265346 6 0.3679013
7 0.02139752 8 0.3225944
8 0.02096500 9 0.3011969
9 0.02086543 10 0.2802319
10 0.01675277 11 0.2593665
11 0.01551861 13 0.2258609
12 0.01388126 14 0.2103423
13 0.01161287 15 0.1964610
14 0.01127722 16 0.1848482
15 0.01000000 18 0.1622937
It seems like rpart() cross validated 15 different values of cp. If these values were tested with k-fold cross validation, then again, the sequence of my time series will be ruined and I can basically not use these results. Does anyone know how I can turn off cross validation in rpart() effectively, or how to vary the value of cp in tree()?
UPDATE: I followed the suggestion of one of our colleagues and set xval=1, but that didn't seem to solve the problem. You can see the full function output when xval=1 here. Btw, parameters[j] is the j-th element of a parameter vector. When I called this function, parameters[j]= 0.0009765625
Many thanks in advance

To demonstrate that rpart() is creating tree nodes by iterating over declining values of cp versus resampling, we'll use the Ozone data from the mlbench package to compare the results of rpart() and caret::train() as discussed in the comments to the OP. We'll setup the Ozone data as illustrated in the CRAN documentation for Support Vector Machines, which support nonlinear regression and are comparable to rpart().
library(rpart)
library(caret)
data(Ozone, package = "mlbench")
# split into test and training
index <- 1:nrow(Ozone)
set.seed(01381708)
testIndex <- sample(index, trunc(length(index) / 3))
testset <- na.omit(Ozone[testIndex,-3])
trainset <- na.omit(Ozone[-testIndex,-3])
# rpart version
set.seed(95014) #reset seed to ensure sample is same as caret version
rpart.model <- rpart(V4 ~ .,data = trainset,xval=0)
# summary(rpart.model)
# calculate RMSE
rpart.pred <- predict(rpart.model, testset[,-3])
crossprod(rpart.pred - testset[,3]) / length(testIndex)
...and the output for the RMSE calculation:
> crossprod(rpart.pred - testset[,3]) / length(testIndex)
[,1]
[1,] 18.25507
Next, we'll run the same analysis with caret::train() as proposed in the comments to the OP.
# caret version
set.seed(95014)
rpart.model <- caret::train(x = trainset[,-3],
y = trainset[,3],method = "rpart", trControl = trainControl(method = "none"),
metric = "RMSE", tuneGrid = data.frame(cp=0.01),
preProcess = c("center", "scale"), xval = 0, minbucket = 5)
# summary(rpart.model)
# demonstrate caret version did not do resampling
rpart.model
# calculate RMSE, which matches RMSE from rpart()
rpart.pred <- predict(rpart.model, testset[,-3])
crossprod(rpart.pred - testset[,3]) / length(testIndex)
When we print the model output from caret::train() it clearly notes that there was no resampling.
> rpart.model
CART
135 samples
11 predictor
Pre-processing: centered (9), scaled (9), ignore (2)
Resampling: None
The RMSE for the caret::train() version matches the RMSE from rpart().
> # calculate RMSE, which matches RMSE from rpart()
> rpart.pred <- predict(rpart.model, testset[,-3])
> crossprod(rpart.pred - testset[,3]) / length(testIndex)
[,1]
[1,] 18.25507
>
Conclusions
First, as configured above, neither caret::train() nor rpart() are resampling. If one prints the model output, however, one will see multiple values of cp are used to generate the final tree of 47 nodes via both techniques.
Output from caret summary(rpart.model)
CP nsplit rel error
1 0.58951537 0 1.0000000
2 0.08544094 1 0.4104846
3 0.05237152 2 0.3250437
4 0.04686890 3 0.2726722
5 0.03603843 4 0.2258033
6 0.02651451 5 0.1897648
7 0.02194866 6 0.1632503
8 0.01000000 7 0.1413017
Output from rpart summary(rpart.model)
CP nsplit rel error
1 0.58951537 0 1.0000000
2 0.08544094 1 0.4104846
3 0.05237152 2 0.3250437
4 0.04686890 3 0.2726722
5 0.03603843 4 0.2258033
6 0.02651451 5 0.1897648
7 0.02194866 6 0.1632503
8 0.01000000 7 0.1413017
Second, both models account for time values via the inclusion of month and day variables as independent variables. In the Ozone data set, V1 is the month variable, and V2 is the day variable. All data was collected during 1976, so there is no year variable included in the data set, and in the original analysis in the svm vignette, day of week was dropped prior to analysis.
Third, to account for other time-based effects using algorithms like rpart() or svm() when date attributes are not used as features in the model, one must include lag effects as features in the model because these algorithms do not directly account for a time component. One example of how to do this with an ensemble of regression trees using a range of lagged values is Ensemble Regression Trees for Time Series Predictions.

In your model, simply xval=0 turn off cross validation.
In your output, you have only CP NSPLIT REL ERROR, with cross valisation you should have CP NSPLIT REL ERROR XERROR XSTD.
cp is just your " complexity parameter" (cp=0.01 by default) from 1 to 0.01.
rel error is your predicted error on your dataset train / expected loss from root node.
nsplit number of node relativ at size of your tree according to cp.
Look : https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

Related

How to use LOOCV to find a subset that classifies better than full set in R

I am working with the wbca data from the faraway package. The prior probability of sampling a malignant tumor is π0 = 1/3 and the prior probability for sampling a benign tumor is π1 = 2/3.
I am trying to use the naive Bayes classifier with multinomials to see if there is a good subset of the 9 features that classifies better than the full set using LOOCV.
I am unsure where to even begin with this, so any Rcode help would be great. Thanks!
You can try something below, the kernel estimate of your predictors might not be the most accurate, but it's something you can start with:
library(faraway)
library(naivebayes)
library(caret)
x = wbca[,!grepl("Class",colnames(wbca))]
y = factor(wbca$Class)
ctrl <- rfeControl(functions = nbFuncs,
method = "LOOCV")
bayesProfile <- rfe(x, y,
sizes = subsets,
rfeControl = ctrl)
bayesProfile
Recursive feature selection
Outer resampling method: Leave-One-Out Cross-Validation
Resampling performance over subset size:
Variables Accuracy Kappa Selected
2 0.9501 0.8891
3 0.9648 0.9225
4 0.9648 0.9223
5 0.9677 0.9290
6 0.9750 0.9454 *
7 0.9692 0.9322
8 0.9750 0.9455
9 0.9662 0.9255
The top 5 variables (out of 6):
USize, UShap, BNucl, Chrom, Epith
You can get the optimal variables:
bayesProfile$optVariables
[1] "USize" "UShap" "BNucl" "Chrom" "Epith" "Thick"

What is the difference between predict() function and the model$predicted in case of a random forest model in R?

Using random forest package:-
#install.packages("randomForest")
library(randomForest)
I used an online code to run random forest on my system. I got a model with confusion matrix and accuracy etc.
Now, my data is in the form of training and validation sets. I got the data from here:-
https://archive.ics.uci.edu/ml/machine-learning-databases/car/
I divided it in a ratio of 70%-30% (training - validation, respectively).
Then i ran a model on it.
The model results gave me an answer that around 30 observations were misclassified for one particular value of the variable on which the random forest was run.
Below is the sample data:-
BuyingPrice Maintenance NumDoors NumPersons Bootspace Safety Condition
vhigh low 4 4 med low unacc
vhigh med 2 4 med high acc
vhigh med 2 more small high unacc
vhigh high 3 4 big high unacc
vhigh med 4 more small med unacc
low low 2 more med med acc
The randomForest was run on predicting the last variable, "Condition".
Below is the model summary
Call:
randomForest(formula = Condition ~ ., data = TrainSet, ntree = 500,
mtry = 6, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 6
OOB estimate of error rate: 2.48%
Confusion matrix:
acc good unacc vgood class.error
acc 244 4 6 2 0.04687500
good 3 44 1 0 0.08333333
unacc 11 1 843 0 0.01403509
vgood 2 0 0 47 0.04081633
If we take the first row of the table (the one just above us), we see that the value "acc" has had 244 correct predictions (95%) and 12 wrong predictions.
Similarly, "good" has had 44 correct predictions (91%) and 4 wrong predictions. And so on for the other two.
Total number of wrong predictions are 30 (12+4+12+2)
Now, technically the predicted values of this model should differ from the actual by 30 misclassified values.
Now i tried getting the predicted values by two methods:-
1. First method :- model2$predicted.
2. Second method :- predTrain <- predict(model2, TrainSet, type = "class")
The First method gives me a predicted value set that differs from the actual in 30 places while the second method gives me an dataset which is exactly equal to the actual values.
I think the first method is correct but the guy in the link has used the second one.
https://www.r-bloggers.com/how-to-implement-random-forests-in-r/
Not sure where my concepts are going wrong
Please help.
PS:- I know there is a similar question that has been asked but i feel that both the question and the answers below it were not sufficiently elaborate or easily explainable for me. That's why, i asked a new question.
SAMPLE CODE
set.seed(100)
train <- sample(nrow(data1),0.7*nrow(data1),replace=FALSE)
TrainSet <- data1[train,]
ValidSet <- data1[-train,]
model2 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry=6,
importance = TRUE)
predTrain <- predict(model2, TrainSet, type = "class")
new1 <- data.frame(actual = TrainSet$Condition, predicted = predTrain)
new2 <- data.frame(actual = TrainSet$Condition, predicted =
model2$predicted)
new1$third <- 0
for(i in 1:nrow(new1))
{
if(new1[i,1] == new1[i,2])
{
new1[i,3] = 1
}else{
new1[i,3] = 0
}
}
new2$third <- 0
for(i in 1:nrow(new2))
{
if(new2[i,1] == new2[i,2])
{
new2[i,3] = 1
}else{
new2[i,3] = 0
}
}
Thanks,
Abhay
According to the documentation of randomForest function:
predicted: the predicted values of the input data based on out-of-bag samples.
So the predicted value of an observation is obtained with a model that does not use this observation.
The predict function applies the model learnt to new data and doesn't know they was used for training. So any observation is used for both learning and predict.
You should used the predicted output as every predicted value is computed without the corresponding observation used for training.

Repeated measures ANOVA and link to mixed-effect models in R

I have a problem when performing a two-way rm ANOVA in R on the following data (link : https://drive.google.com/open?id=1nIlFfijUm4Ib6TJoHUUNeEJnZnnNzO29):
subjectnbr is the id of the subject and blockType and linesTTL are the independent variables. RT2 is the dependent variable
I first performed the rm ANOVA through using ezANOVA with the following code:
ANOVA_RTS <- ezANOVA(
data=castRTs
, dv=RT2
, wid=subjectnbr
, within = .(blockType,linesTTL)
, type = 2
, detailed = TRUE
, return_aov = FALSE
)
ANOVA_RTS
The result is correct (I double-checked using statistica).
However, when I perform the rm ANOVA using the lme function, I do not get the same answer and I have no clue why.
There is my code:
lmeRTs <- lme(
RT2 ~ blockType*linesTTL,
random = ~1|subjectnbr/blockType/linesTTL,
data=castRTs)
anova(lmeRTs)
Here are the outputs of both ezANOVA and lme.
I hope I have been clear enough and have given you all the information needed.
I'm looking forward for your help as I am trying to figure it out for at least 4 hours!
Thanks in advance.
Here is a step-by-step example on how to reproduce ezANOVA results with nlme::lme.
The data
We read in the data and ensure that all categorical variables are factors.
# Read in data
library(tidyverse);
df <- read.csv("castRTs.csv");
df <- df %>%
mutate(
blockType = factor(blockType),
linesTTL = factor(linesTTL));
Results from ezANOVA
As a check, we reproduce the ez::ezANOVA results.
## ANOVA using ez::ezANOVA
library(ez);
model1 <- ezANOVA(
data = df,
dv = RT2,
wid = subjectnbr,
within = .(blockType, linesTTL),
type = 2,
detailed = TRUE,
return_aov = FALSE);
model1;
# $ANOVA
# Effect DFn DFd SSn SSd F p
#1 (Intercept) 1 13 2047405.6654 34886.767 762.9332235 6.260010e-13
#2 blockType 1 13 236.5412 5011.442 0.6136028 4.474711e-01
#3 linesTTL 1 13 6584.7222 7294.620 11.7348665 4.514589e-03
#4 blockType:linesTTL 1 13 1019.1854 2521.860 5.2538251 3.922784e-02
# p<.05 ges
#1 * 0.976293831
#2 0.004735442
#3 * 0.116958989
#4 * 0.020088855
Results from nlme::lme
We now run nlme::lme
## ANOVA using nlme::lme
library(nlme);
model2 <- anova(lme(
RT2 ~ blockType * linesTTL,
random = list(subjectnbr = pdBlocked(list(~1, pdIdent(~blockType - 1), pdIdent(~linesTTL - 1)))),
data = df))
model2;
# numDF denDF F-value p-value
#(Intercept) 1 39 762.9332 <.0001
#blockType 1 39 0.6136 0.4382
#linesTTL 1 39 11.7349 0.0015
#blockType:linesTTL 1 39 5.2538 0.0274
Results/conclusion
We can see that the F test results from both methods are identical. The somewhat complicated structure of the random effect definition in lme arises from the fact that you have two crossed random effects. Here "crossed" means that for every combination of blockType and linesTTL there exists an observation for every subjectnbr.
Some additional (optional) details
To understand the role of pdBlocked and pdIdent we need to take a look at the corresponding two-level mixed effect model
The predictor variables are your categorical variables blockType and linesTTL, which are generally encoded using dummy variables.
The variance-covariance matrix for the random effects can take different forms, depending on the underlying correlation structure of your random effect coefficients. To be consistent with the assumptions of a two-level repeated measure ANOVA, we must specify a block-diagonal variance-covariance matrix pdBlocked, where we create diagonal blocks for the offset ~1, and for the categorical predictor variables blockType pdIdent(~blockType - 1) and linesTTL pdIdent(~linesTTL - 1), respectively. Note that we need to subtract the offset from the last two blocks (since we've already accounted for the offset).
Some relevant/interesting resources
Pinheiro and Bates, Mixed-Effects Models in S and S-PLUS, Springer (2000)
Potvin and Schutz, Statistical power for the two-factor
repeated measures ANOVA, Behavior Research Methods, Instruments & Computers, 32, 347-356 (2000)
Deming Mi, How to understand and apply
mixed-effect models, Department of Biostatistics, Vanderbilt university

How to perform a K-fold cross validation and understanding the outputs

I have been trying to perform k-fold cross-validation in R on a data set that I have created. The link to this data is as follows:
https://drive.google.com/open?id=0B6vqHScIRbB-S0ZYZW1Ga0VMMjA
I used the following code:
library(DAAG)
six = read.csv("six.csv") #opening file
fit <- lm(Height ~ GLCM.135 + Blue + NIR, data=six) #applying a regression model
summary(fit) # show results
CVlm(data =six, m=10, form.lm = formula(Height ~ GLCM.135 + Blue + NIR )) # 10 fold cross validation
This produces the following output (Summarized version)
Sum of squares = 7.37 Mean square = 1.47 n = 5
Overall (Sum over all 5 folds)
ms
3.75
Warning message:
In CVlm(data = six, m = 10, form.lm = formula(Height ~ GLCM.135 + :
As there is >1 explanatory variable, cross-validation
predicted values for a fold are not a linear function
of corresponding overall predicted values. Lines that
are shown for the different folds are approximate
I do not understand what the ms value refers to as I have seen different interpretations on the internet. It is my understanding that K-fold cross validations produce a overall RMSE value for a specified model (which is what I am trying to obtain for my research).
I also don't understand why the results generated produce a Overall (Sum over all 5 folds), when I have specified a 10 fold cross validation in the code.
If anyone can help it would be much appreciated.
When I ran this same thing, I saw that it did do 10 folds, but the final output printed was the same as yours ("Sum over all 5 folds"). The "ms" is the mean squared prediction error. The value of 3.75 is not exactly a simple average across all 10 folds either (got 3.67):
msaverage <- (1.19+6.04+1.26+2.37+3.57+5.24+8.92+2.03+4.62+1.47)/10
msaverage
Notice the average as well as most folds are higher than "Residual standard error" (1.814). This is what we would expect as the CV error represents model performance likely on "test" data (not data used to trained the model). For instance on Fold 10, notice the residuals calculated are on the predicted observations (5 observations) that were not used in the training for that model:
fold 10
Observations in test set: 5
12 14 26 54 56
Predicted 20.24 21.18 22.961 18.63 17.81
cvpred 20.15 21.14 22.964 18.66 17.86
Height 21.98 22.32 22.870 17.12 17.37
CV residual 1.83 1.18 -0.094 -1.54 -0.49
Sum of squares = 7.37 Mean square = 1.47 n = 5
It appears this warning we received may be common too -- also saw it in this article: http://www.rpubs.com/jmcimula/xCL1aXpM3bZ
One thing I can suggest that may be useful to you is that in the case of linear regression, there is a closed form solution for leave-one-out-cross-validation (loocv) without actually fitting multiple models.
predictedresiduals <- residuals(fit)/(1 - lm.influence(fit)$hat)
PRESS <- sum(predictedresiduals^2)
PRESS #Predicted Residual Sum of Squares Error
fitanova <- anova(fit) #Anova to get total sum of squares
tss <- sum(fitanova$"Sum Sq") #Total sum of squares
predrsquared <- 1 - PRESS/(tss)
predrsquared
Notice this value is 0.574 vs. the original Rsquared of 0.6422
To better convey the concept of RMSE, it is useful to see the distribution of the predicted residuals:
hist(predictedresiduals)
RMSE can then calculated simply as:
sd(predictedresiduals)

Tree sizes given by CP table in rpart

In the R package rpart, what determines the size of trees presented within the CP table for a decision tree? In the below example, the CP table defaults to presenting only trees with 1, 2, and 5 nodes (as nsplit = 0, 1 and 4 respectively).
library(rpart)
fit <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis)
> printcp(fit)
Classification tree:
rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis,
method = "class")
Variables actually used in tree construction:
[1] Age Start
Root node error: 17/81 = 0.20988
n= 81
CP nsplit rel error xerror xstd
1 0.176471 0 1.00000 1.00000 0.21559
2 0.019608 1 0.82353 0.94118 0.21078
3 0.010000 4 0.76471 0.94118 0.21078
Is there an inherent rule rpart() used to determine what size of trees to present? And is it possible to force printcp() to return cross-validation statistics for all possible sizes of tree, i.e. for the above example, also include rows for trees with 3 and 4 nodes (nsplit = 2, 3)?
The rpart() function is controlled using the rpart.control() function. It has parameters such as minsplit which tells the function to only split when there are more observations then the value specified and cp which tells the function to only split if the overall lack of fit is decreased by a factor of cp.
If you look at summary(fit) on your above example it shows the statistics for all values of nsplit. To get these values to print when using printcp(fit) you need to choose appropriate values of cp and minsplit when calling the original rpart function.
The cran-r documentation on rpart mentions adding option cp=0 to the rpart function. http://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
It also mentions other options which can be given in the rpart function for eg to control the number of splits.
dfit <- rpart(y ~ x, method='class',
control = rpart.control(xval = 10, minbucket = 2, **cp = 0**))

Resources