Getting "Variable Importance" from rpart - r

I'm performing a tree analysis using rpart, and I need to access the values of "Variable importance" as shown when the rpart object is printed.
Is there a way to do that?
Thanks!

#rawr indicated it in the comments, I'll just make it an answer:
You can extract the variable importance from a rpart object using:
fit$variable.importance

Just adding details on #user7779's answer, you can also access the information you need in the following way:
library(rpart)
my.tree = rpart(y ~ X, data = dta, method = "anova") # I am assuming regression tree.
summary(my.tree)
In the output, among the first lines, you find variable importance. Notice though that here everything is rescaled, thus you will get the relative importance (i.e., numbers are going to sum up to one hundred).

Related

The Validation Set Approach

I am studying the book ISLR on my own
library(ISLR)
set.seed(1)
train=sample(392,196)
lm.fit=lm(mpg~horsepower, data= Auto, subset=train)
attach(Auto)
mean((mpg-predict(lm.fit,Auto))[-train]^2)
23.26601
If I don't use the attach()
library(ISLR)
set.seed(1)
train=sample(392,196)
lm.fit=lm(mpg~horsepower, data= Auto, subset=train)
mean((mpg-predict(lm.fit,data=Auto))[-train]^2)
97.06483
Why the result change substantially?
Also, I don't know the syntax of this code mean((mpg-predict(lm.fit,data=Auto))[-train]^2);
what does [] represent?
Also, why mpg-predict,we usually use ~ for formula?
I tried to use ?mean but it didn't show the answers.
Many thanks in advance for the help.
The second argument to predict.lm is not "data", it is newdata. So the first set of instruction matched the Auto dataframe to the newdata argument. If you run the second set of instructions with newdata as the parameter, you get the same result:
mean((mpg-predict(lm.fit,newdata=Auto))[-train]^2)
[1] 23.26601
When you execute mpg-predict(lm.fit,newdata=Auto)) you are getting the residuals from the model. You are asking for the difference of the mpg variable value and the prediction for the that variable. (It's just a minus sign between expression mpg - predict(...).
The next part of the code [-train] is removing the training set from the consideration. This is often called the "out-of-bag" residuals when you are doing cross validation.
Instead of attaching, use with
with(Auto, mean((mpg-predict(lm.fit,Auto))[-train]^2))
#[1] 23.26601
The use of - on 'train' in (?Extract - [) is subsetting the elements from the vector by removing those positions created by the sample, before taking the mean of power
with(Auto, (mpg-predict(lm.fit,Auto)))
If we don't use attach or with, the 'mpg' object is not found in the global environment. Therefore, it would result in error
mean((mpg-predict(lm.fit,data=Auto))[-train]^2)
Error in mean((mpg - predict(lm.fit, data = Auto))[-train]^2) :
object 'mpg' not found
If the OP got different value, then mpg may be from a different data
Regarding the use of ~ in formula, according to ?lm
Models for lm are specified symbolically. A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.

Why is predict in R taking Train data instead of Test data? [duplicate]

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

Number at risk table using survplot with cph() object

This is pretty specific to the rms package. When using survplot with a cph object, the number at risk table is not stratified according to the covariate. using npsurv() does this correctly.
library(survival);library(rms)
data(lung)
fit <- cph(Surv(time, status==2) ~ sex,data = lung,surv = TRUE,x=TRUE, y=TRUE)
survplot(fit, n.risk=TRUE)
# compared to:
survplot(npsurv(fit$sformula, data = lung), conf= "none",n.risk=TRUE)
For usual use cases I guess re-running the model with npsurv will do, but for others, like when wanting to use fit.mult.impute (my specific use case), you want to use the "original" fit created by cph().
Is there a work-around/way to fix this?
So I should have used
strat(sex)
as the term.

lm and predict - agreement of data.frame names

Working in R to develop regression models, I have something akin to this:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
c_pred = predict(c_lm,testset$independent))
and every single time, I get a mysterious error from R:
Warning message:
'newdata' had 34 rows but variables found have 142 rows
which essentially translates into R not being able to find the independent column of the testset data.frame. This is simply because the exact name from the right-hand side of the formula in lm must be there in predict. To fix it, I can do this:
tempset = trainingset
c_lm = lm(trainingset$dependent ~ tempset$independent)
tempset = testset
c_pred = predict(c_lm,tempset$independent))
or some similar variation, but this is really sloppy, in my opinion.
Is there another way to clean up the translation between the two so that the independent variables' data frame does not have to have the exact same name in predict as it does in lm?
No, No, No, No, No, No! Do not use the formula interface in the way you are doing if you want all the other sugar that comes with model formulas. You wrote:
c_lm = lm(trainingset$dependent ~ trainingset$independent)
You repeat trainingset twice, which is a waste of fingers/time, redundant, and not least causing you the problem that you are hitting. When you now call predict, it will be looking for a variable in testset that has the name trainingset$independent, which of course doesn't exist. Instead, use the data argument in your call to lm(). For example, this fits the same model as your formula but is efficient and also works properly with predict()
c_lm = lm(dependent ~ independent, data = trainingset)
Now when you call predict(c_lm, newdata = testset), you only need to have a data frame with a variable whose name is independent (or whatever you have in the model formula).
An additional reason to write formulas as I show them, is legibility. Getting the object name out of the formula allows you to more easily see what the model is.

Search for corresponding node in a regression tree using rpart

I'm pretty new to R and I'm stuck with a pretty dumb problem.
I'm calibrating a regression tree using the rpart package in order to do some classification and some forecasting.
Thanks to R the calibration part is easy to do and easy to control.
#the package rpart is needed
library(rpart)
# Loading of a big data file used for calibration
my_data <- read.csv("my_file.csv", sep=",", header=TRUE)
# Regression tree calibration
tree <- rpart(Ratio ~ Attribute1 + Attribute2 + Attribute3 +
Attribute4 + Attribute5,
method="anova", data=my_data,
control=rpart.control(minsplit=100, cp=0.0001))
After having calibrated a big decision tree, I want, for a given data sample to find the corresponding cluster of some new data (and thus the forecasted value).
The predict function seems to be perfect for the need.
# read validation data
validationData <-read.csv("my_sample.csv", sep=",", header=TRUE)
# search for the probability in the tree
predict <- predict(tree, newdata=validationData, class="prob")
# dump them in a file
write.table(predict, file="dump.txt")
However with the predict method I just get the forecasted ratio of my new elements, and I can't find a way get the decision tree leaf where my new elements belong.
I think it should be pretty easy to get since the predict method must have found that leaf in order to return the ratio.
There are several parameters that can be given to the predict method through the class= argument, but for a regression tree all seem to return the same thing (the value of the target attribute of the decision tree)
Does anyone know how to get the corresponding node in the decision tree?
By analyzing the node with the path.rpart method, it would help me understanding the results.
Benjamin's answer unfortunately doesn't work: type="vector" still returns the predicted values.
My solution is pretty klugy, but I don't think there's a better way. The trick is to replace the predicted y values in the model frame with the corresponding node numbers.
tree2 = tree
tree2$frame$yval = as.numeric(rownames(tree2$frame))
predict = predict(tree2, newdata=validationData)
Now the output of predict will be node numbers as opposed to predicted y values.
(One note: the above worked in my case where tree was a regression tree, not a classification tree. In the case of a classification tree, you probably need to omit as.numeric or replace it with as.factor.)
You can use the partykit package:
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
library("partykit")
fit.party <- as.party(fit)
predict(fit.party, newdata = kyphosis[1:4, ], type = "node")
For your example just set
predict(as.party(tree), newdata = validationData, type = "node")
I think what you want is type="vector" instead of class="prob" (I don't think class is an accepted parameter of the predict method), as explained in the rpart docs:
If type="vector": vector of predicted
responses. For regression trees this
is the mean response at the node, for
Poisson trees it is the estimated
response rate, and for classification
trees it is the predicted class (as a
number).
treeClust::rpart.predict.leaves(tree, validationData) returns node number
also tree$where returns node numbers for the training set

Resources