Determing the mode using bagging with decision tree - r

I have a dataset with 5 independent variables and a categorical dependent variable.
I would like to develop a code in R that allows me to predict the final results for a test data set.
I would like to implement bagging using as classifier a decision tree. In order to obtain the final predictions I would like to use the uniform voting procedure.
The code that I developed is the following
set.seed(10)
all_data<-qwe
positions <- sample(nrow(all_data),size=floor((nrow(all_data)/5)*4))
training<- all_data[positions,]
testing<- all_data[-positions,]
n <-10
for (i in 1:n ){
training_positions <- sample(nrow(training), size=floor((nrow(training)/3)))
train_pos<-1:nrow(training) %in% training_positions
model_tree <- rpart(UNS~., data=training[train_pos,])
pred <- predict(model_tree, newdata = testing, type="class")
print(as.matrix(pred))
plot(pred)
text(pred)
}
I have the predictions made by each decision tree (10 decision trees), but I do not know how to determine the most common prediction for each observation ( I mean the mode).
Any help would be welcome!
Thanks in advance!
Best regards,
Liza Vieira

Related

Does k cross validation work well with Random Forest Model in R?

I want to build Random forest model in R and I want first to divide my dataset to Training and Testing.
all the tutorials so far use regular sampling.
For example: training<-data[1:150,] and
testing<- data[151:700,]
I tried to use 10 cross validation with my random forest model and I want to ensure I am doing the right thing
here's my code in R
#the head of the dataset after deleting ID attribute
head(wpdc)
#k fold cross validation
RF_folds<-createFolds(wpdc$outcome, k=10) #create folds
RF_fun <- lapply (RF_folds, function(x){
RF_traing_folds=wpdc[-x,]
RF_test_folds=wpdc[x,]
RF_test_folds_class<-RF_test_folds[,1]
#build the model
RF_model<-randomForest(outcome ~ ., data = RF_traing_folds)
#test the model
RF_predict<-predict(RF_model, RF_test_folds[-1])
#accuracy
RF_table<-table(RF_test_folds_class,RF_predict)
RF_confusionMatrix<-confusionMatrix(RF_table, positive ="R") #to see the matrex of echo floods
return(RF_confusionMatrix$table)
})
RF_sum_matrices <-Reduce('+', RF_fun)/10 #sum 10 matrices
RF_final_confusionMatrix<-confusionMatrix(RF_sum_matrices, positive ="R")
RF_final_confusionMatrix
I read an article that says out of bag in RFs are similar to k cross validation and there's no need to use k cross validation. So my question is, is my code correct ? and is k cross validation a good choice to partition data before building a model? if so, what is the correct way to do so?

Successive training in neuralnet

I have a huge trainData and I want to withdraw random subsets out of it (let's say 1000 times) and use them to train the nural network object successively. Is it possible to do by using neuralnet R package. What I am thinking about is something like:
library(neuralnet)
for (i=1:1000){
classA <- 2000
classB <- 2000
dataB <- trainData[sample(which(trainData$class == "B"), classB, replace=TRUE),] #withdraw 2000 samples from class B
dataU <- trainData[sample(which(trainData$class == "A"), classA, replace=TRUE),] #withdraw 2000 samples from class A
subset <- rbind(dataB, dataU) #bind them to make a subset
and then feed this subset of actual trainData to train the neuralnet object again and again like:
nn <- neuralnet(formula, data=subset, hidden=c(3,5), linear.output = F, stepmax = 2147483647) #use that subset for training the neural network
}
My question is will this neualnet object named nn will be trained in every iteration of loop and when loop will finish will I get a fully trained neural network object? Secondly, what will be the effect of non-convergence in the cases when the neuralnet would be unable to converge for a particular subset? Will it affect the predictions result?
The shortest answer - No
More nuanced answer - Sort of ...
Why? - Because the neuralnet::neuralnet function is not designed to return the weights if the threshold is not reached within stepmax. However, if the threshold is reached, the resulting object will contain the final weights. These weights could then be fed to the neuralnet function as the startweights argument allowing for successive learning. Your call would look like the following:
# nn.prior = previously run neuralnet object
nn <- neuralnet(formula, data=subset, hidden=c(3,5), linear.output = F, stepmax = 2147483647, startweights = nn.prior$weights)
However, I initially answer 'No' because choosing a threshold to get a suitable amount of information out of a subset while also making sure it 'converges' before stepmax would likely be a guessing game and not very objective.
You have essentially four options I can think of:
Find another package that allows for this explicitly
Get the neuralnet source code and modify it to return the weights even when 'convergence' isn't achieved (i.e. reaching threshold).
Take a suitably sized random subset and just build your model on that and test its' performance. (This is actually quite common practice AFAIK).
Take all your subsets, build a model on each and look into combining them as an 'ensemble' model.
I would recommend to use k-fold validation to train many nets using library(e1071) and tune function.

How do I convert an "RWeka" decision tree into a "party" tree in R?

I am using the RWeka package in R to fit M5' trees to a dataset using "M5P". I then want to convert the tree generated into a "party" tree so that I can access variable importances. The issue I am having is that I can't seem to get the function as.party to work without getting the following error:
"Error: all(sapply(split, head, 1) %in% c("<=", ">")) is not TRUE"
This error only arises when I apply the function within a for loop, but the for loop is necessary as I am running 5-fold cross validation.
Below is the code I have been running:
n <- nrow(data)
k <- 5
indCV <- sample( rep(1:k,each=ceiling(n/k)), n)
for(i in 1:k){
#Training data is for all the observations where indCV is not equal to i
training_data <- data.frame(x[-which(indCV==i),])
training_response <- y[-which(indCV==i)]
#Test the data on the fifth of the data where the observation indices are equal to i
test_data <- x[which(indCV==i),]
test_response <- y[which(indCV==i)]
#Fit a pruned model to the training data
fit <- M5P(training_response~., data=training_data, control=Weka_control(N=TRUE))
#Convert to party
p <- as.party(fit)
}
The RWeka package has an example for converting M5P trees into party objects. If you run example("M5P", package = "RWeka") then the tree visualizations are actually drawn by partykit. After running the examples, see plot(m3) and as.party(m3).
However, while for J48 you can get a fully fledged constparty object, the same is not true for M5P. In the latter case, the tree structure itself can be converted to party but the linear models within the nodes are not completely straightforward to convert into lm objects. Thus, if you want to use the party representation to compute measures that only depend on the tree structure (e.g., variables used for splitting, number of splits, splitpoints, etc.) then you can do so. But if you want to compute measures that depend on the models or the predictions (e.g., mean square errors etc.) then the party class won't be of much help.

Create a supervised classifier based on decision trees

I need help in R program language which i have to answer this question : ) (a) Create a supervised classifier based on decision trees. (b) Randomly split into training and test set to determine the prediction quality of your classifier.
I did this code but i just get same result for all categories. is there anybody to help me???
libery(tree)
quality<- as.numeric(winequality.red$quality)
range(quality) #8.4 14.9
High = ifelse(winequality.red$quality >= 5, "Yes","No")
winequality.red2 = data.frame(winequality.red, High)
winequality.red2 = winequality.red2[,-12]
#divide data into testing and training
set.seed(2)
train = sample(1:nrow(winequality.red2), nrow(winequality.red2)/2) # half for testing and halof for training
test = -train
training_data = winequality.red2[train, ]
testing_data = winequality.red2[test, ]
testing_Test = High[test]
tree_model = tree(test~., training_data)
plot(tree_model)
text(tree_model, pretty= 0 )
tree_Pred = predict(tree_model, testing_data)
mean(tree_Pred !=testing_data)
I found Rpart better than tree. it does cross validation internally if that is what you meant. be sure to use rpart.plot::prp to plot them nicely. There is enough documentation on the packages from here.
but this i am doing must be in tree which i always get same result when i change numbers or variables. then i have to compare the result by random forests . the quetion whic i have to cover by these codes are
1) (a) Create a supervised classifier based on decision trees. (b) Randomly split into training and test set to determine the prediction quality of your classifier
I wanna make sure i did right or not....

Search for corresponding node in a regression tree using rpart

I'm pretty new to R and I'm stuck with a pretty dumb problem.
I'm calibrating a regression tree using the rpart package in order to do some classification and some forecasting.
Thanks to R the calibration part is easy to do and easy to control.
#the package rpart is needed
library(rpart)
# Loading of a big data file used for calibration
my_data <- read.csv("my_file.csv", sep=",", header=TRUE)
# Regression tree calibration
tree <- rpart(Ratio ~ Attribute1 + Attribute2 + Attribute3 +
Attribute4 + Attribute5,
method="anova", data=my_data,
control=rpart.control(minsplit=100, cp=0.0001))
After having calibrated a big decision tree, I want, for a given data sample to find the corresponding cluster of some new data (and thus the forecasted value).
The predict function seems to be perfect for the need.
# read validation data
validationData <-read.csv("my_sample.csv", sep=",", header=TRUE)
# search for the probability in the tree
predict <- predict(tree, newdata=validationData, class="prob")
# dump them in a file
write.table(predict, file="dump.txt")
However with the predict method I just get the forecasted ratio of my new elements, and I can't find a way get the decision tree leaf where my new elements belong.
I think it should be pretty easy to get since the predict method must have found that leaf in order to return the ratio.
There are several parameters that can be given to the predict method through the class= argument, but for a regression tree all seem to return the same thing (the value of the target attribute of the decision tree)
Does anyone know how to get the corresponding node in the decision tree?
By analyzing the node with the path.rpart method, it would help me understanding the results.
Benjamin's answer unfortunately doesn't work: type="vector" still returns the predicted values.
My solution is pretty klugy, but I don't think there's a better way. The trick is to replace the predicted y values in the model frame with the corresponding node numbers.
tree2 = tree
tree2$frame$yval = as.numeric(rownames(tree2$frame))
predict = predict(tree2, newdata=validationData)
Now the output of predict will be node numbers as opposed to predicted y values.
(One note: the above worked in my case where tree was a regression tree, not a classification tree. In the case of a classification tree, you probably need to omit as.numeric or replace it with as.factor.)
You can use the partykit package:
fit <- rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
library("partykit")
fit.party <- as.party(fit)
predict(fit.party, newdata = kyphosis[1:4, ], type = "node")
For your example just set
predict(as.party(tree), newdata = validationData, type = "node")
I think what you want is type="vector" instead of class="prob" (I don't think class is an accepted parameter of the predict method), as explained in the rpart docs:
If type="vector": vector of predicted
responses. For regression trees this
is the mean response at the node, for
Poisson trees it is the estimated
response rate, and for classification
trees it is the predicted class (as a
number).
treeClust::rpart.predict.leaves(tree, validationData) returns node number
also tree$where returns node numbers for the training set

Resources