Search for corresponding node in a regression tree using rpart - r

I'm pretty new to R and I'm stuck with a pretty dumb problem.
I'm calibrating a regression tree using the rpart package in order to do some classification and some forecasting.
Thanks to R the calibration part is easy to do and easy to control.
#the package rpart is needed
library(rpart)
# Loading of a big data file used for calibration
my_data <- read.csv("my_file.csv", sep=",", header=TRUE)
# Regression tree calibration
tree <- rpart(Ratio ~ Attribute1 + Attribute2 + Attribute3 +
Attribute4 + Attribute5,
method="anova", data=my_data,
control=rpart.control(minsplit=100, cp=0.0001))
After having calibrated a big decision tree, I want, for a given data sample to find the corresponding cluster of some new data (and thus the forecasted value).
The predict function seems to be perfect for the need.
# read validation data
validationData <-read.csv("my_sample.csv", sep=",", header=TRUE)
# search for the probability in the tree
predict <- predict(tree, newdata=validationData, class="prob")
# dump them in a file
write.table(predict, file="dump.txt")
However with the predict method I just get the forecasted ratio of my new elements, and I can't find a way get the decision tree leaf where my new elements belong.
I think it should be pretty easy to get since the predict method must have found that leaf in order to return the ratio.
There are several parameters that can be given to the predict method through the class= argument, but for a regression tree all seem to return the same thing (the value of the target attribute of the decision tree)
Does anyone know how to get the corresponding node in the decision tree?
By analyzing the node with the path.rpart method, it would help me understanding the results.

Benjamin's answer unfortunately doesn't work: type="vector" still returns the predicted values.
My solution is pretty klugy, but I don't think there's a better way. The trick is to replace the predicted y values in the model frame with the corresponding node numbers.
tree2 = tree
tree2$frame$yval = as.numeric(rownames(tree2$frame))
predict = predict(tree2, newdata=validationData)
Now the output of predict will be node numbers as opposed to predicted y values.
(One note: the above worked in my case where tree was a regression tree, not a classification tree. In the case of a classification tree, you probably need to omit as.numeric or replace it with as.factor.)

You can use the partykit package:
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
library("partykit")
fit.party <- as.party(fit)
predict(fit.party, newdata = kyphosis[1:4, ], type = "node")
For your example just set
predict(as.party(tree), newdata = validationData, type = "node")

I think what you want is type="vector" instead of class="prob" (I don't think class is an accepted parameter of the predict method), as explained in the rpart docs:
If type="vector": vector of predicted
responses. For regression trees this
is the mean response at the node, for
Poisson trees it is the estimated
response rate, and for classification
trees it is the predicted class (as a
number).

treeClust::rpart.predict.leaves(tree, validationData) returns node number
also tree$where returns node numbers for the training set

Related

Plot an envelope for an mppm object in spatstat

My question is closely related to this previous one: Simulation-based hypothesis testing on spatial point pattern hyperframes using "envelope" function in spatstat
I have obtained an mppm object by fitting a model on several independent datasets using the mppmfunction from the R package spatstat. How can I study its envelope to compare it to my observations ?
I fitted my model as such:
data <- listof(NMJ1,NMJ2,NMJ3)
data <- hyperframe(X=1:3, Points=data)
model <- mppm(Points ~marks*sqrt(x^2+y^2), data)
where NMJ1, NMJ2, and NMJ3 are marked ppp and are independent realizations of the same experiment.
However, the envelope function does not accept inputs of type mppm:
> envelope(model, Kcross.inhom, nsim=10)
Error in UseMethod("envelope") :
no applicable method for 'envelope' applied to an object of class "c('mppm', 'list')"
The answer provided to the previously mentioned question indicates how to plot global envelopes for each pattern, and to use the product rule for multiple testing. However, my fitted model implies that my 3 ppp objects are statistically equivalent, and are independent realizations of the same experiment (ie no different covariates between them). I would thus like to obtain one single plot comparing my fitted model to my 3 datasets. The following code:
gamma= 1 - 0.95^(1/3)
nsims=round(1/gamma-1)
sims <- simulate(model, nsim=2*nsims)
SIMS <- list()
for(i in 1:nrow(sims)) SIMS[[i]] <- as.solist(sims[i,,drop=TRUE])
Hplus <- cbind(data, hyperframe(Sims=SIMS))
EE1 <- with(Hplus, envelope(Points, Kcross.inhom, nsim=nsims, simulate=Sims))
pool(EE1[1],EE1[2],EE1[3])
leads to the following error:
Error in pool.envelope(`1` = list(r = c(0, 0.78125, 1.5625, 2.34375, 3.125, :
Arguments 2 and 3 do not belong to the class “envelope”
Wrong type of subset index. Use
pool(EE1[[1]], EE1[[2]], EE1[[3]])
or just
pool(EE1)
These would have given an error message that the envelope commands should have been called with savefuns=TRUE. So you just need to change that step as well.
However, statistically this procedure makes little sense. You have already fitted a model, which allows for rigorous statistical inference using anova.mppm and other tools. Instead of this, you are generating simulated data from the fitted model and performing a Monte Carlo test, with all the fraught issues of multiple testing and low power. There are additional problems with this approach - for example, even if the model is "the same" for each row of the hyperframe, the patterns are not statistically equivalent unless the windows of the point patterns are identical, and so on.

How to obtain Brier Score in Random Forest in R?

I am having trouble getting the Brier Score for my Machine Learning Predictive models. The outcome "y" was categorical (1 or 0). Predictors are a mix of continuous and categorical variables.
I have created four models with different predictors, I will call them "model_1"-"model_4" here (except predictors, other parameters are the same). Example code of my model is:
Model_1=rfsrc(y~ ., data=TrainTest, ntree=1000,
mtry=30, nodesize=1, nsplit=1,
na.action="na.impute", nimpute=3,seed=10,
importance=T)
When I run the "Model_1" function in R, I got the results:
My question was how can I get the predicted possibility for those 412 people? And how to find the observed probability for each person? Do I need to calculate by hand? I found the function BrierScore() in "DescTools" package.
But I tried "BrierScore(Model_1)", it gives me no results.
codes I added:
library(scoring)
library(DescTools)
BrierScore(Raw_SB)
class(TrainTest$VL_supress03)
TrainTest$VL_supress03_nu<-as.numeric(as.character(TrainTest$VL_supress03))
class(TrainTest$VL_supress03_nu)
prediction_Raw_SB = predict(Raw_SB, TrainTest)
BrierScore(prediction_Raw_SB, as.numeric(TrainTest$VL_supress03) - 1)
BrierScore(prediction_Raw_SB, as.numeric(as.character(TrainTest$VL_supress03)) - 1)
BrierScore(prediction_Raw_SB, TrainTest$VL_supress03_nu - 1)
I tried some codes: have so many error messages:
One assumption I am making about your approach is that you want to compute the BrierScore on the data you train your model on (which is usually not the correct approach, google train-test split if you need more info there).
In general, therefore you should reflect on whether your approach is correct there.
The BrierScore method in DescTools only has a defined method for glm models, otherwise, it expects as input a vector of predicted probabilities and a vector of true values (see ?BrierScore).
What you would need to do though is to predict on your data using:
prediction = predict(model_1, TrainTest, na.action="na.impute")
and then compute the brier score using
BrierScore(as.numeric(TrainTest$y) - 1, prediction$predicted[, 1L])
(Note, that we transform TrainTest$y into a numeric vector of 0's and 1's in order to compute the brier score.)
Note: The randomForestSRC package also prints a normalized brier score when you call print(prediction).
In general, using one of the available workbenches for machine learning in R (mlr3, tidymodels, caret) might simplify this approach for you and prevent a lot of errors in this direction. This is a really good practice, especially if you are less experienced in ML as it can prevent many errors.
See e.g. this chapter in the mlr3 book for more information.
For reference, here is some very similar code using the mlr3 package, automatically also taking care of train-test splits.
data(breast, package = "randomForestSRC") # with target variable "status"
library(mlr3)
library(mlr3extralearners)
task = TaskClassif$new(id = "breast", backend = breast, target = "status")
algo = lrn("classif.rfsrc", na.action = "na.impute", predict_type = "prob")
resample(task, algo, rsmp("holdout", ratio = 0.8))$score(msr("classif.bbrier"))

How to do a leave-one-out cross validation for CAP/capscale in R vegan?

I would like to perform a "leave-one-out cross validation" (LOO-CV) for a CAP in R. The CAP was calculated by using capscale in R package vegan and is a canonical analysis of principal coordinates, similar to an rda or cca, but based on another similarity matrix, in my case Bray-Curtis. I have found that within predict.cca there is the function calibrate.cca but I cannot make it work.
https://www.rdocumentation.org/packages/vegan/versions/2.4-2/topics/predict.cca
This is what I have (based on the sample data mite available in vegan)
library(vegan)
data(mite, mite.env)
str(mite.env) #"SubsDens", "WatrCont", "Substrate", "Shrub", "Topo"
miteBC <- vegdist(mite, method="bray") #Bray-Curtis similarity matrix
miteCAP <-capscale(miteBC~Substrate + Shrub + Topo, data=mite.env, #CAP in capscale
distance = "bray", metaMDSdist = F)
summary(miteCAP)
anova(miteCAP)
anova(miteCAP, by = "axis")
anova(miteCAP, by = "margin")
calibrate.cca(miteCAP, type = c("response")) #error cannot find function calibrate.cca
In the program Primer it is done automatically within the CAP function ("Leave-one-out Allocation of Observations to Groups"), where it assigns each sample automatically to a group and get a mis-classification error (similar to a classification randomForest, which I have already done), but I would like to use R, and it should be possible with vegan::capscale.
Any help is very much appreciated!
Function vegan::calibrate does not have argument type and never returns "response". Check its documentation. It does the environmental calibration, and returns the predicted values of constraints (Substrate, Shrub, Topo) in the scale of model matrix, and with factors these hardly make sense directly.
There is no direct option of LOO: you got to do it by hand cycling through points, and using the complete left-out-point as the newdata. However, I'd suggest k-fold cross-validation as a better alternative for estimation of predictive power: LOO changes data too little, and gives over-optimistic view of predictive power.

Display more nodes in decision tree in R?

Base on the result I have 7 nodes, I wanted to have more than 2 nodes displayed in the result, but existing it seemed that I keep on having 2 nodes displayed.
Is there a way to display more nodes and in a nicer way?
library(rpart)
tr1<-rpart(leaveyrx~marstx.f+age+jobtitlex.f+organizationunitx.f+fteworkschedule+nationalityx.f+eesubgroupx.f+lvlx.f+sttpmx.f+ staff2ndtpmx.f+staff3rdtpmx.f+staff4thtpmx.f, method="class",data=btree)
printcp(tr1)
plotcp(tr1)
summary(tr1)
plot(tr1, uniform=TRUE, margin = 0.2, main="Classification Tree for Exploration") text(tr1, use.n=TRUE, all=TRUE, cex=.5)
*A repost
Your problem probably is not your plot, but rather your decision tree model. Can you clarify why you expect 7 nodes? When you only have two (leaf) nodes, it probably means that your model is only using one predictor variable and using a binary classification as the response variable. This is probably caused by the predictor variable having a 1:1 relation with the response variable. For example, if you are predicting Gender (Male, Female) and one of your response variables is Sex (M,F). In this case, a decision tree model is not needed because you can just use the predictor variable. Maybe something happened in the pre-processing of your data that copied the response variable. Here are a few things to look for:
1) Calculate the Correct Classification Rate (CCR). If it is 0, then you have a perfect model.
yhat<-predict(tr1, type="class") # Model Predictions
sum(yhat != btree$leaveyrx)/nrow(btree) # CCR
2) See which predictor your model is using. Double check that this variable has been processed correctly. Try excluding it from the model.
tr1$variable.importance
3) If you are absolutely sure the variable is calculated correctly and that it should be used in the model, try increasing your cp value. The default is 0.01. But decision trees will run quickly even with high cp values. While you are tinkering with the cp values, also consider the other tuning parameters. ?rpart.control
control <- rpart.control(minbucket = 20, cp = 0.0002, maxsurrogate = 0, usesurrogate = 0, xval = 10)
tr1 <- rpart(leaveyrx~marstx.f+age+jobtitlex.f+organizationunitx.f+fteworkschedule+nationalityx.f+eesubgroupx.f+lvlx.f+sttpmx.f+ staff2ndtpmx.f+staff3rdtpmx.f+staff4thtpmx.f,
data=btree,
method = "class",
control = control)
4) Once you have a tree with many nodes, you will need to trim it back. It may that your best model is really only driven off of one variable and hence will only have two nodes
# Plot the cp
plotcp(tr1)
printcp(tr1) # Printing cp table (choose the cp with the smallest xerror)
# Prune back to optimal size, according to plot of CV r^2
tr1.pruned <- prune(tr1, cp=0.001) #approximately the cp corresponding to the best size
5) the rpart libary is a good resource for plotting the decision trees. There are are lots of great articles out there, but here is one a good one on rpart: http://www.milbo.org/rpart-plot/prp.pdf
It may also be helpful to post a bit of the summary of your model.

Successive training in neuralnet

I have a huge trainData and I want to withdraw random subsets out of it (let's say 1000 times) and use them to train the nural network object successively. Is it possible to do by using neuralnet R package. What I am thinking about is something like:
library(neuralnet)
for (i=1:1000){
classA <- 2000
classB <- 2000
dataB <- trainData[sample(which(trainData$class == "B"), classB, replace=TRUE),] #withdraw 2000 samples from class B
dataU <- trainData[sample(which(trainData$class == "A"), classA, replace=TRUE),] #withdraw 2000 samples from class A
subset <- rbind(dataB, dataU) #bind them to make a subset
and then feed this subset of actual trainData to train the neuralnet object again and again like:
nn <- neuralnet(formula, data=subset, hidden=c(3,5), linear.output = F, stepmax = 2147483647) #use that subset for training the neural network
}
My question is will this neualnet object named nn will be trained in every iteration of loop and when loop will finish will I get a fully trained neural network object? Secondly, what will be the effect of non-convergence in the cases when the neuralnet would be unable to converge for a particular subset? Will it affect the predictions result?
The shortest answer - No
More nuanced answer - Sort of ...
Why? - Because the neuralnet::neuralnet function is not designed to return the weights if the threshold is not reached within stepmax. However, if the threshold is reached, the resulting object will contain the final weights. These weights could then be fed to the neuralnet function as the startweights argument allowing for successive learning. Your call would look like the following:
# nn.prior = previously run neuralnet object
nn <- neuralnet(formula, data=subset, hidden=c(3,5), linear.output = F, stepmax = 2147483647, startweights = nn.prior$weights)
However, I initially answer 'No' because choosing a threshold to get a suitable amount of information out of a subset while also making sure it 'converges' before stepmax would likely be a guessing game and not very objective.
You have essentially four options I can think of:
Find another package that allows for this explicitly
Get the neuralnet source code and modify it to return the weights even when 'convergence' isn't achieved (i.e. reaching threshold).
Take a suitably sized random subset and just build your model on that and test its' performance. (This is actually quite common practice AFAIK).
Take all your subsets, build a model on each and look into combining them as an 'ensemble' model.
I would recommend to use k-fold validation to train many nets using library(e1071) and tune function.

Resources