I built a decision tree from training data using the rpart package in R. Now i have more data and I want to check it against the tree to check the model. Logically/iteratively, I want to do the following:
for each datapoint in new data
run point thru decision tree, branching as appropriate
examine how tree classifies the data point
determine if the datapoint is a true positive or false positive
How do I do that in R?
To be able to use this, I assume you split up your training set into a subset training set and a test set.
To create the training model you can use:
model <- rpart(y~., traindata, minbucket=5) # I suspect you did it so far.
To apply it to the test set:
pred <- predict(model, testdata)
You then get a vector of predicted results.
In your training test data set you also have the "real" answer. Let's say the last column in the training set.
Simply equating them will yield the result:
pred == testdata[ , last] # where 'last' equals the index of 'y'
When the elements are equal, you will get a TRUE, when you get a FALSE it means your prediction was wrong.
pred + testdata[, last] > 1 # gives TRUE positive, as it means both vectors are 1
pred == testdata[, last] # gives those that are correct
It might be interesting to see how much percent you have correct:
mean(pred == testdata[ , last]) # here TRUE will count as a 1, and FALSE as 0
Related
I am running gbm function (from GBM R package) and I am setting the option train.fraction to 0.7. I would like to get a vector with the response variable corresponding to this subset. I though this must be saved in one of the variables of the output gbm object but I haven't found it and I don't know if there is a way to get it. The data fraction used is saved in gbm.result$data$x.ordered but it does not include the response variable. Apologies if this has a very obvious answer.
It takes the first 0.7*nrows of your data if you specify training.fraction = 0.7
If you check out the gbm function:
train.fraction: The first ‘train.fraction * nrows(data)’ observations
are used to fit the ‘gbm’ and the remainder are used for
computing out-of-sample estimates of the loss function.
We can verify this by checking the training error and valid error:
train.error: a vector of length equal to the number of fitted trees
containing the value of the loss function for each boosting
iteration evaluated on the training data
valid.error: a vector of length equal to the number of fitted trees
containing the value of the loss function for each boosting
iteration evaluated on the validation data
For example:
library(gbm)
set.seed(111)
data = iris[sample(nrow(iris)),]
data$Species=as.numeric(data$Species=="versicolor")
fit = gbm(Species ~ .,data=data,train.fraction=0.7,distribution="bernoulli")
Since 0.7*150 = 105, we will write a function to calculate deviance (can refer to this for the derivation) and check the respective deviance:
# here y is the observed label, 0 or 1
# P is the log-odds obtained from predict.gbm(..)
b_dev = function(y,P){-2*mean(y*P-log(1+exp(P)))}
fit$train.error[length(fit$train.error)]
[1] 0.1408239
b_dev(data$Species[1:105],predict(fit,data[1:105,],n.trees=fit$n.trees))
[1] 0.1408239
fit$valid.error[100]
[1] 0.365474
b_dev(data$Species[106:150],predict(fit,data[106:150,],n.trees=fit$n.trees))
[1] 0.365474
I've been using h2o.gbm for a classification problem, and wanted to understand a bit more about how it calculates the class probabilities. As a starting point, I tried to recalculate the class probability of a gbm with only 1 tree (by looking at the observations in the leafs), but the results are very confusing.
Let's assume my positive class variable is "buy" and negative class variable "not_buy" and I have a training set called "dt.train" and a separate test-set called "dt.test".
In a normal decision tree, the class probability for "buy" P(has_bought="buy") for a new data row (test-data) is calculated by dividing all observations in the leaf with class "buy" by the total number of observations in the leaf (based on the training data used to grow the tree).
However, the h2o.gbm seems to do something differently, even when I simulate a 'normal' decision tree (setting n.trees to 1, and alle sample.rates to 1). I think the best way to illustrate this confusion is by telling what I did in a step-wise fashion.
Step 1: Training the model
I do not care about overfitting or model performance. I want to make my life as easy as possible, so I've set the n.trees to 1, and make sure all training-data (rows and columns) are used for each tree and split, by setting all sample.rate parameters to 1. Below is the code to train the model.
base.gbm.model <- h2o.gbm(
x = predictors,
y = "has_bought",
training_frame = dt.train,
model_id = "2",
nfolds = 0,
ntrees = 1,
learn_rate = 0.001,
max_depth = 15,
sample_rate = 1,
col_sample_rate = 1,
col_sample_rate_per_tree = 1,
seed = 123456,
keep_cross_validation_predictions = TRUE,
stopping_rounds = 10,
stopping_tolerance = 0,
stopping_metric = "AUC",
score_tree_interval = 0
)
Step 2: Getting the leaf assignments of the training set
What I want to do, is use the same data that is used to train the model, and understand in which leaf they ended up in. H2o offers a function for this, which is shown below.
train.leafs <- h2o.predict_leaf_node_assignment(base.gbm.model, dt.train)
This will return the leaf node assignment (e.g. "LLRRLL") for each row in the training data. As we only have 1 tree, this column is called "T1.C1" which I renamed to "leaf_node", which I cbind with the target variable "has_bought" of the training data. This results in the output below (from here on referred to as "train.leafs").
Step 3: Making predictions on the test set
For the test set, I want to predict two things:
The prediction of the model itself P(has_bought="buy")
The leaf node assignment according to the model.
test.leafs <- h2o.predict_leaf_node_assignment(base.gbm.model, dt.test)
test.pred <- h2o.predict(base.gbm.model, dt.test)
After finding this, I've used cbind to combine these two predictions with the target variable of the test-set.
test.total <- h2o.cbind(dt.test[, c("has_bought")], test.pred, test.leafs)
The result of this, is the table below, from here on referred to as "test.total"
Unfortunately, I do not have enough rep point to post more than 2 links. But if you click on "table "test.total" combined with manual
probability calculation" in step 5, it's basically the same table
without the column "manual_prob_buy".
Step 4: Manually predicting probabilities
Theoretically, I should be able to predict the probabilities now myself. I did this by writing a loop, that loops over each row in "test.total". For each row, I take the leaf node assignment.
I then use that leaf-node assignment to filter the table "train.leafs", and check how many observations have a positive class (has_bought == 1) (posN) and how many observations are there in total (totalN) within the leaf associated with the test-row.
I perform the (standard) calculation posN / totalN, and store this in the test-row as a new column called "manual_prob_buy", which should be the probability of P(has_bought="buy") for that leaf. Thus, each test-row that falls in this leaf should get this probability.
This for-loop is shown below.
for(i in 1:nrow(dt.test)){
leaf <- test.total[i, leaf_node]
totalN <- nrow(train.leafs[train.leafs$leaf_node == leaf])
posN <- nrow(train.leafs[train.leafs$leaf_node == leaf & train.leafs$has_bought == "buy",])
test.total[i, manual_prob_buy := posN / totalN]
}
Step 5: Comparing the probabilities
This is where I get confused. Below is the the updated "test.total" table, in which "buy" represents the probability P(has_bought="buy") according to the model and "manual_prob_buy" represents the manually calculated probability from step 4. As for as I know, these probabilities should be identical, knowing I only used 1 tree and I've set the sample.rates to 1.
Table "test.total" combined with manual probability calculation
The Question
I just don't understand why these two probabilities are not the same. As far as I know, I've set the parameters in such a way that it should just be like a 'normal' classification tree.
So the question: does anyone know why I find differences in these probabilities?
I hope someone could point me to where I might have made wrong assumptions. I just really hope I did something stupid, as this is driving me crazy.
Thanks!
Rather than compare the results from R's h2o.predict() with your own handwritten code, I recommend you compare with an H2O MOJO, which should match.
See an example here:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#quickstartmojo
You can run that simple example yourself, and then modify it according to your own model and new row of data to predict on.
Once you can do that, you can look at the code and debug/single-step it in a java environment to see exactly how the prediction gets calculated.
You can find the MOJO prediction code on github here:
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/easy/EasyPredictModelWrapper.java
The main cause of the large difference between your observed probabilities and the predictions of h2o is your learning rate. As you have learn_rate = 0.001 the gbm is adjusting the probabilities by a relatively small amount from the overall rate. If you adjust this to learn_rate = 1 you will have something much closer to a decision tree, and h2o's predicted probabilities will come much closer to the rates in each leaf node.
There is a secondary difference which will then become apparent as your probabilities will still not exactly match. This is due to the method of gradient descent (the G in GBM) on the logistic loss function, which is used rather than the number of observations in each leaf node.
Assuming "test" and "train" are two data frames for testing and traininig respectively, and "model" is a classifier that was generated using training data. I can find the number of misclassified examples like this:
n = sum(test$class_label != predict(model, test))
How can I find the number of examples that is predicted as negative but it is actually positive? (i.e. false positive)
NOTE: Above example assumes that the problem is a binary classification problem whose classes are, say, "yes" (positive class) and "no". Additionally, predict is a function of caret package.
This will get you a 2x2 table showing true positives, false positives, false negatives and true negatives.
> table(Truth = test$class_label, Prediction = predict(model, test))
Prediction
Truth yes no
yes 32 3
no 8 27
I need help in R program language which i have to answer this question : ) (a) Create a supervised classifier based on decision trees. (b) Randomly split into training and test set to determine the prediction quality of your classifier.
I did this code but i just get same result for all categories. is there anybody to help me???
libery(tree)
quality<- as.numeric(winequality.red$quality)
range(quality) #8.4 14.9
High = ifelse(winequality.red$quality >= 5, "Yes","No")
winequality.red2 = data.frame(winequality.red, High)
winequality.red2 = winequality.red2[,-12]
#divide data into testing and training
set.seed(2)
train = sample(1:nrow(winequality.red2), nrow(winequality.red2)/2) # half for testing and halof for training
test = -train
training_data = winequality.red2[train, ]
testing_data = winequality.red2[test, ]
testing_Test = High[test]
tree_model = tree(test~., training_data)
plot(tree_model)
text(tree_model, pretty= 0 )
tree_Pred = predict(tree_model, testing_data)
mean(tree_Pred !=testing_data)
I found Rpart better than tree. it does cross validation internally if that is what you meant. be sure to use rpart.plot::prp to plot them nicely. There is enough documentation on the packages from here.
but this i am doing must be in tree which i always get same result when i change numbers or variables. then i have to compare the result by random forests . the quetion whic i have to cover by these codes are
1) (a) Create a supervised classifier based on decision trees. (b) Randomly split into training and test set to determine the prediction quality of your classifier
I wanna make sure i did right or not....
I'm pretty new to R and I'm stuck with a pretty dumb problem.
I'm calibrating a regression tree using the rpart package in order to do some classification and some forecasting.
Thanks to R the calibration part is easy to do and easy to control.
#the package rpart is needed
library(rpart)
# Loading of a big data file used for calibration
my_data <- read.csv("my_file.csv", sep=",", header=TRUE)
# Regression tree calibration
tree <- rpart(Ratio ~ Attribute1 + Attribute2 + Attribute3 +
Attribute4 + Attribute5,
method="anova", data=my_data,
control=rpart.control(minsplit=100, cp=0.0001))
After having calibrated a big decision tree, I want, for a given data sample to find the corresponding cluster of some new data (and thus the forecasted value).
The predict function seems to be perfect for the need.
# read validation data
validationData <-read.csv("my_sample.csv", sep=",", header=TRUE)
# search for the probability in the tree
predict <- predict(tree, newdata=validationData, class="prob")
# dump them in a file
write.table(predict, file="dump.txt")
However with the predict method I just get the forecasted ratio of my new elements, and I can't find a way get the decision tree leaf where my new elements belong.
I think it should be pretty easy to get since the predict method must have found that leaf in order to return the ratio.
There are several parameters that can be given to the predict method through the class= argument, but for a regression tree all seem to return the same thing (the value of the target attribute of the decision tree)
Does anyone know how to get the corresponding node in the decision tree?
By analyzing the node with the path.rpart method, it would help me understanding the results.
Benjamin's answer unfortunately doesn't work: type="vector" still returns the predicted values.
My solution is pretty klugy, but I don't think there's a better way. The trick is to replace the predicted y values in the model frame with the corresponding node numbers.
tree2 = tree
tree2$frame$yval = as.numeric(rownames(tree2$frame))
predict = predict(tree2, newdata=validationData)
Now the output of predict will be node numbers as opposed to predicted y values.
(One note: the above worked in my case where tree was a regression tree, not a classification tree. In the case of a classification tree, you probably need to omit as.numeric or replace it with as.factor.)
You can use the partykit package:
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
library("partykit")
fit.party <- as.party(fit)
predict(fit.party, newdata = kyphosis[1:4, ], type = "node")
For your example just set
predict(as.party(tree), newdata = validationData, type = "node")
I think what you want is type="vector" instead of class="prob" (I don't think class is an accepted parameter of the predict method), as explained in the rpart docs:
If type="vector": vector of predicted
responses. For regression trees this
is the mean response at the node, for
Poisson trees it is the estimated
response rate, and for classification
trees it is the predicted class (as a
number).
treeClust::rpart.predict.leaves(tree, validationData) returns node number
also tree$where returns node numbers for the training set