I have a data set with 1962 observations and 46 columns. Column 46 is the target. 6 of the other columns are nominal variables and the rest are ordinal variables. I have preprocessed them using as follows:
for (i in c(1:4,6,9,46)){
cw_alldata_known[,i] <- as.factor(cw_alldata_known[,i])
}
for (i in c(5,7,8,10:45)){
cw_alldata_known[,i] <- as.ordered(cw_alldata_known[,i])
}
Then I divide them 50/50 into training and test sets.
I fitted a decision tree using party package of R:
cw.ctree <- ctree(cr~.,data = cw.train)
Then I also fitted a random forest model:
cw.forest <- randomForest(credit.rating ~ ., data=cw.train,ntree=107)
I have tried other ntree values but 107 seems to be the best.
The accuracy on the test set of decision tree is around 61%, while random forest is only 56%. I read that random forest is often more robust and reliable. Why doesn't it perform better than decision tree in this case?
Related
I have a dataset 8100 observations of 118 variables that are used to determine which one of 4 groups each respondent falls into. I am interested in which variables are the most important for predicting group membership. My data is a combination of ordinal and binary. I initially did a discriminant function analysis, but then read that this does not handle binary data well. Next I tried a multinomial logistic regression. However, from here I am struggling to work out which variables are the most important. I had tried an r-part decision tree, but then I read that these are not very stable, and indeed, when I ran it on a random half of my data I got different results every time. Now I am trying a dominance analysis. I can get it working for a linear model (lm), but for both the multinomial logistic regression and the discriminant function analysis I get the error:
Error in daRawResults(x = x, constants = constants, terms = terms, fit.functions = fit.functions, :
Not implemented method to retrieve data from model
Does anyone have any advice for what else I can try? Only 4 of the 118 variables are binary, so I can remove them if needed and will still have a good analysis.
Here is a reproducible example including a much smaller example dataset:
set.seed(1) ## for reproducibility
remotes::install_github("clbustos/dominanceAnalysis") # If you don't have the dominance analysis package
library(dominanceanalysis)
library(MASS)
library(nnet)
mydata <- data.frame(Segments=sample(1:4, 15, replace=TRUE),
var1=sample(1:7, 15, replace=TRUE),
var2=sample(1:7, 15, replace=TRUE),
var3=sample(1:6, 15, replace=TRUE),
var4=sample(1:2, 15, replace=TRUE))
# Show that it works for a linar model
LM<-lm(Segments ~., mydata)
da.LM<-dominanceAnalysis(LM);da.LM
#var1 is the most important, followed by var4
# Try the discriminant function analysis
DFA <- lda(Segments~., data=mydata)
da.DFA <- dominanceAnalysis(DFA)
# Error
# Try multinomial logistic regression
MLR <- multinom(Segments ~ ., data = mydata, maxit=500)
da.MLR <- dominanceAnalysis(MLR)
# Error
I've discovered a partial answer.
The dominanceanalysis package can only be used on these models: Ordinary Least Squares, Generalized Linear Models, Dynamic Linear Models and Hierarchical Linear Models.
Source: https://github.com/clbustos/dominanceAnalysis
This explains why it didn't work for my data - I wasn't using those models.
I have decided to pursue the decision tree option of variable selection by using a random forest.
Due to computational limitations with my GIS software, I am trying to implement random forests in R for image classification purposes. My input is a multi-band TIFF image, which is being trained on an ArcGIS shapefile (target values 0 and 1). The code technically works and produces a valid output. When I view the confusion matrix I get the following:
0 1 class.error
0 11 3 0.214285714
1 1 13 0.071428571
This is sensible for my data. However when I plot up the output of the image classification in my GIS software (the binary reclassified tiff with values 0 and 1), it predicts the training data with a 100% success rate. In other words there is no classification error with the output image. How is this the case when the confusion matrix indicates there are classification errors?
Am I missing something really obvious here? Code snippet below.
rf.mdl <- randomForest(x=samples#data[, names(PredMaps)], y=samples#data[, ValueFld], ntree=501, proximity=TRUE, importance=TRUE, keep.forest=TRUE,keep.inbag=TRUE)
ConfMat = rf.mdl$confusion
write.csv(ConfMat,file = "ConfMat1.csv")
predict(PredMaps, rf.mdl, filename=classifiedPath, type="response", na.rm=T, overwrite=T, progress="text")
I expected the output classified image to misclassify 1 of the Value=1 training points and misclassify 3 of the Value=0 training points based on what is indicated in the confusion matrix.
The Random Forest algorithm is a bagging method. This means it creates numerous weak classifiers, then has each weak classifier "vote" to create the end prediction. In RF, each weak classifier is one decision tree that is trained on a random sample of observations in the training set. Think of the random samples each decision tree is trained on as a "bag" of data.
What is being shown in the confusion matrix is something called "out-of-bag error" (OOB error). This OOB error is an accurate estimate of how your model would generalize to data it has never seen before (this estimate is usually achieved by testing your model on a withheld testing set). Since each decision tree is trained on only one bag from your training data, the rest of the data (data that's "outside the bag") can stand in for this withheld data.
OOB error is calculated by making a prediction for each observation in the training set. However, when predicting each individual observation, only decision trees whose bags did not include that observation are allowed to participate in the voting process. The result is the confusion matrix available after training a RF model.
When you predict the observations in the training set using the complete model, decision trees whose bags did include each observation are now involved in the voting process. Since these decision trees "remember" the observation they were trained on, they skew the prediction toward the correct answer. This is why you achieve 100% accuracy.
Essentially, you should trust the confusion matrix that uses OOB error. It's a robust estimate of how the model will generalize to unseen data.
I am trying to use random Forest model on my data set which has 4679 observations and 13 variables.
I am using the random forest model to predict is a part will fail or not.
On the total 4679 observations, I have 66 observation with my target variable as NA. I wanted to predict if this 66 part will fail or not.
SO, I decided to split my train data into first 4613 as my train data and rest 66 rows as my test data.
train<- Imputed_data[1:4613, ]
test <- Imputed_data[4614:4679, ]
I then used the below code for my random forest
fit<- randomForest(claim.Qty.Accepted~., data=train, na.action=na.exclude)
The training confusion matrix I received was clear.
I tried the same to predict my test with the following piece of code
#Prediction for test set
p2 <- predict(fit, test)
head(p2)
head(test$claim.Qty.Accepted)
caret::confusionMatrix(p2, test$claim.Qty.Accepted)
the confusion matrix was 0 with both the classes of Yes and No.
I later saved the predicted value p2 in the form of data frame like below; in the table i could see that all the 66 entries have Yes and No classes.
t2<- data.frame(p2)
I am confused why, the confusion matrix didn't show me the results of prediction? Also is this a right approach I am following to predict my test result? Any lead would be helpful, since i am new in the field.
Guys!
I am a newbie in machine learning methods and have a question about it. I try to use Caret package in R to start this method and work with my dataset.
I have a training dataset (Dataset1) with mutation information regarding my gene of interest let's say Gene A.
In Dataset1, I have the information regarding the mutation of Gene A in the form of Mut or Not-Mut. I used the Dataset1 with SVM model to predict the output (I chose SVM because it was more accurate than LVQ or GBM).
So, in my first step, I divided my dataset into training and test groups because I've had information as a test and train set in the dataset. then I've done the cross validation with 10 fold.
I tuned my model and assessed the performance of the model using the test dataset (using ROC curve).
Everything goes fine till this step.
I have another dataset. Dataset2 which doesn't have mutation information regarding Gene A.
What I want to do now is to use my tuned SVM model from the Dataset1 on the Dataset2 to see if it could give me mutation information regarding Gene A in the Dataset 2 in a form of Mut/Not-Mut. I've gone through Caret package guide but I couldn't get it. I am stuck here and don't know what to do.
I am not sure if I chose a right approach.Any suggestions or help would really be appreciated.
Here is my code till I tuned my model from the first dataset.
Selecting training and test models from the first dataset:
M_train <- Dataset1[Dataset1$Case=='train',-1] #creating train feature data frame
M_test <- Dataset1[Dataset1$Case=='test',-1] #creating test feature data frame
y=as.factor(M_train$Class) # Target variable for training
ctrl <- trainControl(method="repeatedcv", # 10fold cross validation
repeats=5, # do 5 repititions of cv
summaryFunction=twoClassSummary, # Use AUC to pick the best model
classProbs=TRUE)
#Use the expand.grid to specify the search space
#Note that the default search grid selects 3 values of each tuning parameter
grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4
n.trees=seq(10,100,by=10), # let iterations go from 10 to 100
shrinkage=c(0.01,0.1), # Try 2 values fornlearning rate
n.minobsinnode = 20)
# Set up for parallel processing
#set.seed(1951)
registerDoParallel(4,cores=2)
#Train and Tune the SVM
svm.tune <- train(x=M_train,
y= M_train$Class,
method = "svmRadial",
tuneLength = 9, # 9 values of the cost function
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl) # same as for gbm above
#Finally, assess the performance of the model using the test data set.
#Make predictions on the test data with the SVM Model
svm.pred <- predict(svm.tune,M_test)
confusionMatrix(svm.pred,M_test$Class)
svm.probs <- predict(svm.tune,M_test,type="prob") # Gen probs for ROC
svm.ROC <- roc(predictor=svm.probs$mut,
response=as.factor(M_test$Class),
levels=y))
plot(svm.ROC,main="ROC for SVM built with GA selected features")
So, here is where I stuck, how can I use svm.tune model to predict the mutation of Gene A in Dataset2?
Thanks in advance,
Now you just take the model you built and tuned and predict off of it using predict :
D2.predictions <- predict(svm.tune, newdata = Dataset2)
They keys are to be sure that you have ALL off the same predictor variables in this set, with the same column names (and in my paranoid world in the same order).
D2.predictions will contain your predicted classes for the unlabeled data.
I am new to Random Forest Regression. I have 300 Continuous variables ( 299 predictors and 1 target)in prep1, where some predictors are highly correlated. The problem is that I still need to get the importance value for each one of the predictors, so eliminating some is not an option.
Here are my questions:
1) Is there a way to choose for each tree only variables that are not highly correlated, if yes, how should the below code be adjusted?
2) Assuming yes to 1), will this take care of the multi-collinearity problem?
bound <- floor(nrow(prep1)/2)
df <- prep1[sample(nrow(prep1)), ]
train <- df[1:bound, ]
test <- df[(bound+1):nrow(df), ]
modelFit <- randomForest(continuous_target ~., data = train)
prediction <- predict(modelFit, test)
Random Forest has the nature of selecting samples with replacement as well as selecting subsets of features on those samples randomly. As per your scenario, given that you don't have skewness in the response variable, building LARGE NUMBER of trees should give you importance for all of the variables. Though this should increase the computational complexity as you for every bag you are capturing the same importance multiple number of times. Also multicollinearity won't affect the predictive power.