I need help in R program language which i have to answer this question : ) (a) Create a supervised classifier based on decision trees. (b) Randomly split into training and test set to determine the prediction quality of your classifier.
I did this code but i just get same result for all categories. is there anybody to help me???
libery(tree)
quality<- as.numeric(winequality.red$quality)
range(quality) #8.4 14.9
High = ifelse(winequality.red$quality >= 5, "Yes","No")
winequality.red2 = data.frame(winequality.red, High)
winequality.red2 = winequality.red2[,-12]
#divide data into testing and training
set.seed(2)
train = sample(1:nrow(winequality.red2), nrow(winequality.red2)/2) # half for testing and halof for training
test = -train
training_data = winequality.red2[train, ]
testing_data = winequality.red2[test, ]
testing_Test = High[test]
tree_model = tree(test~., training_data)
plot(tree_model)
text(tree_model, pretty= 0 )
tree_Pred = predict(tree_model, testing_data)
mean(tree_Pred !=testing_data)
I found Rpart better than tree. it does cross validation internally if that is what you meant. be sure to use rpart.plot::prp to plot them nicely. There is enough documentation on the packages from here.
but this i am doing must be in tree which i always get same result when i change numbers or variables. then i have to compare the result by random forests . the quetion whic i have to cover by these codes are
1) (a) Create a supervised classifier based on decision trees. (b) Randomly split into training and test set to determine the prediction quality of your classifier
I wanna make sure i did right or not....
Related
I am using the R package machisplin (it's not on CRAN) to downscale a satellite image. According to the description of the package:
The machisplin.mltps function simultaneously evaluates different combinations of the six algorithms to predict the input data. During model tuning, each algorithm is systematically weighted from 0-1 and the fit of the ensembled model is evaluated. The best performing model is determined through k-fold cross validation (k=10) and the model that has the lowest residual sum of squares of test data is chosen. After determining the best model algorithms and weights, a final model is created using the full training dataset.
My question is how can I check which model out of the 6 has been selected for the downscaling? To put it differently, when I export the downscaled image, I would like to know which algorithm (out of the 6) has been used to perform the downscaling.
Here is the code:
library(MACHISPLIN)
library(raster)
library(gbm)
evi = raster("path/evi.tif") # covariate
ntl = raster("path/ntl_1600.tif") # raster to be downscaled
##convert one of the rasters to a point dataframe to sample. Use any raster input.
ntl.points<-rasterToPoints(ntl,
fun = NULL,
spatial = FALSE)
##subset only the x and y data
ntl.points<- ntl.points[,1:2]
##Extract values to points from rasters
RAST_VAL<-data.frame(extract(ntl, ntl.points))
##merge sampled data to input
InInterp<-cbind(ntl.points, RAST_VAL)
#run an ensemble machine learning thin plate spline
interp.rast<-machisplin.mltps(int.values = InInterp,
covar.ras = evi,
smooth.outputs.only = T,
tps = T,
n.cores = 4)
#set negative values to 0
interp.rast[[1]]$final[interp.rast[[1]]$final <= 0] <- 0
writeRaster(interp.rast[[1]]$final,
filename = "path/ntl_splines.tif")
I vied all the output parameters (please refer to Example 2 in the package description) but I couldn't find anything relevant to my question.
I have posted a question on GitHub as well. From here you can download my images.
I think this is a misunderstanding; mahcisplin, isnt testing 6 and gives one. it's trying many ensembles of 6 and its giving one ensemble... or in other words
that its the best 'combination of 6 algorithms' that I will get, and not one of 6 algo's chosen.
It will get something like "a model which is 20% algo1 , 10% algo2 etc. "and not "algo1 is the best and chosen"
there i have two regression models ,rf1 and rf2 and i want o find value of variables that allow output of rf1 to be between 20 and 26 and output of rf2 should be inferior to 10 :
i tried grid search but i found nothing,please i you know how to do it with a heuristic (simulated annealing or genetic algorithm) please help me
you can find the code for this example in this repository here
library(randomForest)
model_rf_fines<- readRDS(file = paste0("rf1.rds"))
model_rf_gros<- readRDS(file = paste0("rf2.rds"))
#grid------
grid_input_test = expand.grid(
"Poste" ="P1",
"Qualité" ="BTNBA",
"CPT_2500" =13.83,
"CPT400" = 46.04,
"CPT160" =15.12,
"CPT125" =5.9,
"CPT40"=15.09,
"CPT_40"=4.02,
"retart"=0,
"dure"=0,
'Débit_CV004'=seq(1300,1400,10),
"Dilution_SB002"=seq(334.68,400,10),
"Arrosage_Crible_SC003"=seq(250,300,10),
"Dilution_HP14"=1200,
"Dilution_HP15"=631.1,
"Dilution_HP18"=500,
"Dilution_HP19"=seq(760.47,800,10),
"Pression_PK12"=c(0.59,0.4),
"Pression_PK13"=c(0.8,0.7),
"Pression_PK14"=c(0.8,0.9,0.99,1),
"Pression_PK16"=c(0.5),
"Pression_PK18"=c(0.4,0.5)
)
#levels correction ----
levels(grid_input_test$Qualité) = model_rf_fines$forest$xlevels$Qualité
levels(grid_input_test$Poste) = model_rf_fines$forest$xlevels$Poste
for(i in 1:nrow(grid_input_test)){
#fines
print("----------------------------")
print(i)
print(paste0('Fines :', predict(object = model_rf_fines,newdata = grid_input_test[i,]) ))
#gros
print(paste0('Gros :',predict(object = model_rf_gros,newdata = grid_input_test[i,]) ))
if(predict(object = model_rf_gros,newdata = grid_input_test[i,])<=10){break}
}
any suggestions will be greatly appreciated
thanks.
It might be such variables/input does not exists. If rf1 and rf2 represent two Random Forest models, with say >50 trees, the number of trees will average out spikes/edges of the model.
Similar to the law of large numbers, the more trees in each forest, the more closer output of rf1 and rf2 will be. This is all if indeed rf_ represent random forests both trained on same data, indeed than the more trees the more impossible your input that satisfies the conditions.
Indeed try a naive grid search first, and keep track of minimum value of rf2 while rf1 satisfies your condition. Call this minimum M_grid
If you want to implement simulated annealing, I would start with a simple neighbour scheme, say take a random input variable and vary it a bit. Use python packages for the annealing scheme. If this simple scheme beats your M_grid by quite a bit and you feel you are close to the solution, you can play around with slower cooling schemes, or more complicated neighbour proposals.
Also, the objective for both SA and GA should not be chosen too fast. Probably you want a objective that steers rf1 close to its lowest edge of 20, and rf2 as minium as possible, with maybe a exp() or **3 to reward going down plenty.
I made some assumptions here, maybe wrong. But hope this helps anyway.
I've been using h2o.gbm for a classification problem, and wanted to understand a bit more about how it calculates the class probabilities. As a starting point, I tried to recalculate the class probability of a gbm with only 1 tree (by looking at the observations in the leafs), but the results are very confusing.
Let's assume my positive class variable is "buy" and negative class variable "not_buy" and I have a training set called "dt.train" and a separate test-set called "dt.test".
In a normal decision tree, the class probability for "buy" P(has_bought="buy") for a new data row (test-data) is calculated by dividing all observations in the leaf with class "buy" by the total number of observations in the leaf (based on the training data used to grow the tree).
However, the h2o.gbm seems to do something differently, even when I simulate a 'normal' decision tree (setting n.trees to 1, and alle sample.rates to 1). I think the best way to illustrate this confusion is by telling what I did in a step-wise fashion.
Step 1: Training the model
I do not care about overfitting or model performance. I want to make my life as easy as possible, so I've set the n.trees to 1, and make sure all training-data (rows and columns) are used for each tree and split, by setting all sample.rate parameters to 1. Below is the code to train the model.
base.gbm.model <- h2o.gbm(
x = predictors,
y = "has_bought",
training_frame = dt.train,
model_id = "2",
nfolds = 0,
ntrees = 1,
learn_rate = 0.001,
max_depth = 15,
sample_rate = 1,
col_sample_rate = 1,
col_sample_rate_per_tree = 1,
seed = 123456,
keep_cross_validation_predictions = TRUE,
stopping_rounds = 10,
stopping_tolerance = 0,
stopping_metric = "AUC",
score_tree_interval = 0
)
Step 2: Getting the leaf assignments of the training set
What I want to do, is use the same data that is used to train the model, and understand in which leaf they ended up in. H2o offers a function for this, which is shown below.
train.leafs <- h2o.predict_leaf_node_assignment(base.gbm.model, dt.train)
This will return the leaf node assignment (e.g. "LLRRLL") for each row in the training data. As we only have 1 tree, this column is called "T1.C1" which I renamed to "leaf_node", which I cbind with the target variable "has_bought" of the training data. This results in the output below (from here on referred to as "train.leafs").
Step 3: Making predictions on the test set
For the test set, I want to predict two things:
The prediction of the model itself P(has_bought="buy")
The leaf node assignment according to the model.
test.leafs <- h2o.predict_leaf_node_assignment(base.gbm.model, dt.test)
test.pred <- h2o.predict(base.gbm.model, dt.test)
After finding this, I've used cbind to combine these two predictions with the target variable of the test-set.
test.total <- h2o.cbind(dt.test[, c("has_bought")], test.pred, test.leafs)
The result of this, is the table below, from here on referred to as "test.total"
Unfortunately, I do not have enough rep point to post more than 2 links. But if you click on "table "test.total" combined with manual
probability calculation" in step 5, it's basically the same table
without the column "manual_prob_buy".
Step 4: Manually predicting probabilities
Theoretically, I should be able to predict the probabilities now myself. I did this by writing a loop, that loops over each row in "test.total". For each row, I take the leaf node assignment.
I then use that leaf-node assignment to filter the table "train.leafs", and check how many observations have a positive class (has_bought == 1) (posN) and how many observations are there in total (totalN) within the leaf associated with the test-row.
I perform the (standard) calculation posN / totalN, and store this in the test-row as a new column called "manual_prob_buy", which should be the probability of P(has_bought="buy") for that leaf. Thus, each test-row that falls in this leaf should get this probability.
This for-loop is shown below.
for(i in 1:nrow(dt.test)){
leaf <- test.total[i, leaf_node]
totalN <- nrow(train.leafs[train.leafs$leaf_node == leaf])
posN <- nrow(train.leafs[train.leafs$leaf_node == leaf & train.leafs$has_bought == "buy",])
test.total[i, manual_prob_buy := posN / totalN]
}
Step 5: Comparing the probabilities
This is where I get confused. Below is the the updated "test.total" table, in which "buy" represents the probability P(has_bought="buy") according to the model and "manual_prob_buy" represents the manually calculated probability from step 4. As for as I know, these probabilities should be identical, knowing I only used 1 tree and I've set the sample.rates to 1.
Table "test.total" combined with manual probability calculation
The Question
I just don't understand why these two probabilities are not the same. As far as I know, I've set the parameters in such a way that it should just be like a 'normal' classification tree.
So the question: does anyone know why I find differences in these probabilities?
I hope someone could point me to where I might have made wrong assumptions. I just really hope I did something stupid, as this is driving me crazy.
Thanks!
Rather than compare the results from R's h2o.predict() with your own handwritten code, I recommend you compare with an H2O MOJO, which should match.
See an example here:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#quickstartmojo
You can run that simple example yourself, and then modify it according to your own model and new row of data to predict on.
Once you can do that, you can look at the code and debug/single-step it in a java environment to see exactly how the prediction gets calculated.
You can find the MOJO prediction code on github here:
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/easy/EasyPredictModelWrapper.java
The main cause of the large difference between your observed probabilities and the predictions of h2o is your learning rate. As you have learn_rate = 0.001 the gbm is adjusting the probabilities by a relatively small amount from the overall rate. If you adjust this to learn_rate = 1 you will have something much closer to a decision tree, and h2o's predicted probabilities will come much closer to the rates in each leaf node.
There is a secondary difference which will then become apparent as your probabilities will still not exactly match. This is due to the method of gradient descent (the G in GBM) on the logistic loss function, which is used rather than the number of observations in each leaf node.
I have a dataset with 5 independent variables and a categorical dependent variable.
I would like to develop a code in R that allows me to predict the final results for a test data set.
I would like to implement bagging using as classifier a decision tree. In order to obtain the final predictions I would like to use the uniform voting procedure.
The code that I developed is the following
set.seed(10)
all_data<-qwe
positions <- sample(nrow(all_data),size=floor((nrow(all_data)/5)*4))
training<- all_data[positions,]
testing<- all_data[-positions,]
n <-10
for (i in 1:n ){
training_positions <- sample(nrow(training), size=floor((nrow(training)/3)))
train_pos<-1:nrow(training) %in% training_positions
model_tree <- rpart(UNS~., data=training[train_pos,])
pred <- predict(model_tree, newdata = testing, type="class")
print(as.matrix(pred))
plot(pred)
text(pred)
}
I have the predictions made by each decision tree (10 decision trees), but I do not know how to determine the most common prediction for each observation ( I mean the mode).
Any help would be welcome!
Thanks in advance!
Best regards,
Liza Vieira
I'm pretty new to R and I'm stuck with a pretty dumb problem.
I'm calibrating a regression tree using the rpart package in order to do some classification and some forecasting.
Thanks to R the calibration part is easy to do and easy to control.
#the package rpart is needed
library(rpart)
# Loading of a big data file used for calibration
my_data <- read.csv("my_file.csv", sep=",", header=TRUE)
# Regression tree calibration
tree <- rpart(Ratio ~ Attribute1 + Attribute2 + Attribute3 +
Attribute4 + Attribute5,
method="anova", data=my_data,
control=rpart.control(minsplit=100, cp=0.0001))
After having calibrated a big decision tree, I want, for a given data sample to find the corresponding cluster of some new data (and thus the forecasted value).
The predict function seems to be perfect for the need.
# read validation data
validationData <-read.csv("my_sample.csv", sep=",", header=TRUE)
# search for the probability in the tree
predict <- predict(tree, newdata=validationData, class="prob")
# dump them in a file
write.table(predict, file="dump.txt")
However with the predict method I just get the forecasted ratio of my new elements, and I can't find a way get the decision tree leaf where my new elements belong.
I think it should be pretty easy to get since the predict method must have found that leaf in order to return the ratio.
There are several parameters that can be given to the predict method through the class= argument, but for a regression tree all seem to return the same thing (the value of the target attribute of the decision tree)
Does anyone know how to get the corresponding node in the decision tree?
By analyzing the node with the path.rpart method, it would help me understanding the results.
Benjamin's answer unfortunately doesn't work: type="vector" still returns the predicted values.
My solution is pretty klugy, but I don't think there's a better way. The trick is to replace the predicted y values in the model frame with the corresponding node numbers.
tree2 = tree
tree2$frame$yval = as.numeric(rownames(tree2$frame))
predict = predict(tree2, newdata=validationData)
Now the output of predict will be node numbers as opposed to predicted y values.
(One note: the above worked in my case where tree was a regression tree, not a classification tree. In the case of a classification tree, you probably need to omit as.numeric or replace it with as.factor.)
You can use the partykit package:
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
library("partykit")
fit.party <- as.party(fit)
predict(fit.party, newdata = kyphosis[1:4, ], type = "node")
For your example just set
predict(as.party(tree), newdata = validationData, type = "node")
I think what you want is type="vector" instead of class="prob" (I don't think class is an accepted parameter of the predict method), as explained in the rpart docs:
If type="vector": vector of predicted
responses. For regression trees this
is the mean response at the node, for
Poisson trees it is the estimated
response rate, and for classification
trees it is the predicted class (as a
number).
treeClust::rpart.predict.leaves(tree, validationData) returns node number
also tree$where returns node numbers for the training set