Random Forest Predictions - r

I am looking for some guidance on a homework assignment I am working on for a class. We are given a dataset with 14K observations and we are asked to build a prediction model. I subset the dataset into training and testing (4909 observations), here I am using the caret package, which predicts the last variable "classe". I pulled out the near zero variables and built the model but when I tried to do predictions I only get 97 predictions back. I reviewed the help files but still can't figure out where I am going wrong. Any hints would be appreciated.
Here is the Code:
set.seed(1234)
pml.training <- read.csv("./data/pml-training.csv")
#
library(caret)
inTrain <- createDataPartition(y=pml.training$classe, p=0.75, list=FALSE)
training <- pml.training[inTrain,]
testing <- pml.training[-inTrain,]
# Pull out the Near Zero Value (NZV)
nzv <- nearZeroVar(training, saveMetrics=TRUE)
omit <- which(nzv$nzv==TRUE)
training <- training[,-omit]
testing <- testing[,-omit]
# Fit the model
modFit <- train(classe ~., method="rf", data=training)
modFit
print(modFit$finalModel)
plot(modFit)
# Try and predict on the testing model
pred <- predict(modFit, newdata=testing)
testing$predRight <- pred==testing$classe
print(table(pred, testing$classe))
Thanks, Pat C.

Have you checked
sum(complete.cases(subset(testing, select = -classe)))
?

Related

How can I calculate the mean square error in R of a regression tree?

I am working with the wine quality database.
I am studying regression trees depending on different variables as:
library(rpart)
library(rpart.plot)
library(rattle)
library(naniar)
library(dplyr)
library(ggplot2)
vinos <- read.csv(file = 'Wine.csv', header = T)
arbol0<-rpart(formula=quality~chlorides, data=vinos, method="anova")
fancyRpartPlot(arbol0)
arbol1<-rpart(formula=quality~chlorides+density, data=vinos, method="anova")
fancyRpartPlot(arbol1)
I want to calculate the mean square error to see if arbol1 is better than arbol0. I will use my own dataset since no more data is available. I have tried to do it as
aaa<-predict(object=arbol0, newdata=data.frame(chlorides=vinos$chlorides), type="anova")
bbb<-predict(object=arbol1, newdata=data.frame(chlorides=vinos$chlorides, density=vinos$density), type="anova")
and then substract manually the last column of the dataframe from aaa and bbb. However, I am getting an error. Can someone please help me?
This website could be useful for you. It's very important to split your dataset into train and test subsets before training your models. In the following code, I've done it with base functions, but there's another function called sample.split from the caTools package that does the same procedure. I attach you this website where you can see all the ways to split data in R.
Remember that the function of the Mean Squared Error (MSE) is the following one:
So, it's very simple to apply it with R. You just have to compute the mean of the squared difference between the observed (i.e, the response variable from your test subset) and predicted values (i.e, the values you have predicted from the model with the predict function).
A solution for your wine dataset could be this one, based on the previous website.
library(rpart)
library(dplyr)
library(data.table)
vinos <- fread(file = 'Winequality-red.csv', header = TRUE)
# Split data into train and test subsets
sample_index <- sample(nrow(vinos), size = nrow(vinos)*0.75)
train <- vinos[sample_index, ]
test <- vinos[-sample_index, ]
# Train regression trees models
arbol0 <- rpart(formula = quality ~ chlorides, data = train, method = "anova")
arbol1 <- rpart(formula = quality ~ chlorides + density, data = train, method = "anova")
# Make predictions for each model
pred0 <- predict(arbol0, newdata = test)
pred1 <- predict(arbol1, newdata = test)
# Calculate MSE for each model
mean((pred0 - test$quality)^2)
mean((pred1 - test$quality)^2)

How can I handle a confusionMatrix error when it says my Data is null

I am trying to run a random forests analysis in R and it works well when I fit the model and predict it on the test group but when I run the confusionMatrix it gives me the following error:
Error in table(data, reference, dnn = dnn, ...) : all arguments must have the same length
load the test and training data
trainData <- read.csv("./pml-training.csv")
testData <- read.csv("./pml-testing.csv")
dim(trainData)
dim(testData)
Data cleaning - Here, variables with nearly zero variance or that are almost always NA,
and the columns containing summary statistics or irrelevant data will be removed.
trainClean <- trainData[,colMeans(is.na(trainData))< .9]
trainClean <- trainData[,-c(1:7)]
nvz <- nearZeroVar(trainClean)
trainClean <- trainClean[,-nvz]
dim(trainClean)
Split the data into training (70%) and validation (30%)
inTrain <- createDataPartition(y=trainClean$classe, p=0.7, list=FALSE)
train <- trainClean[inTrain,]
valid <- trainClean[-inTrain,]
# Create a control for 3 fold validation
control <- trainControl(method="cv", number=3, verboseIter = FALSE)
Building the models
Random Forests
# Fit the model on train using random forest
modFit <- train(classe~., data=train, method="rf", trControl=control, tuneLength=5, na.action=na.omit)
modFit
modPredict<- predict(modFit, valid, na.action=na.omit) # predict on the valid data set.
# Turn valid$classe into a factor and check it
valid$classe <- as.factor(valid$classe)
modCM <- confusionMatrix(modPredict, as.factor(valid$classe))
modCM
table(modPredict, valid$classe)
When I check the length of modPredict it = 122, and valid$classe = 5885. If I try dim on modPredict, I get NULL. I have tried using na.action=na.omit on the prediction chunk. I have also tried NOT using na.action=na.omit on the prediction or the fit chunks.
I checked the test and valid data sets where I split the data using:
```length(train); length(valid); length(valid$classe); nrow(valid); nrow(train)```
The output is:
[1] 94
[1] 94
[1] 5885
[1] 5885
[1] 13737
I have been struggling with this problem and similar problems on my decision tree chunk as well. I don't want people to do my homework for me, but I could use a hint.
Thanks in advance

Performing SVM regression with Test and training sets in R

I have a response variable contains 100 observation and I wish to estimate them by using 8 independent variables via employing supper Vector Regression.
I have searched a lot to find a template in order to implement my SVR with training and testing sets in R, but I could not find the way which I wanted.
I have used the following code to fit the model and calculate RMSE, but I want to check my model for unseen data and I do not know how to perform this in R.
My code is as follows:
data<-read.csv("Enzyme.csv",header = T)
Testset <- data[c(11:30),]
Trainset <- data[-c(11:30), ]
#attached dependent variable
Y<-Trainset$Urease
Trainset<-Trainset[,-c(1)]
SVMUr <- svm (Urease~., data=Trainset, kernel="radial",gamma=
1,epsilon=seq(0,1,0.1), cost=10)
summary(SVMUr)
################### RMSE SVMUr ##########################
RMSE <- function(observed, predicted){
sqrt(mean((predicted - observed)^2, na.rm=TRUE))
}
RMSE(observed =Y,predicted = predSVMUr)
######## Check the model for unseen data via using testset ######
predicted_test <- predict(SVMUr, Testset[,-1])
RMSE(Testset$Urease, predicted_test)
The way you want to go about testing your model is to:
First apply your model on unseen data using predict(SVMUr, Testset[,-1]) assuming the first variable is your target response Y. If it is the 15th variable for example, replace -1 with -15.
Now use the RMSE() function to get the RMSE of the model on your test dataset
Additional Recommendation:
I would not split the data the way you do because as you've pointed out you end up with too little training data in relation to test data. If you want to split it by 80%-20%, you can adjust from my code below:
data<-read.csv("Enzyme.csv",header = T)
split_data <- sample(nrow(data), nrow(data)*0.8)
Trainset <- data[split_data, ]
Testset <- data[-split_data, ]
That would put 80% of your data in the train set and 20% in the test set.
The rest of the code:
SVMUr <- svm (Urease~., data=Trainset, kernel="radial",gamma=
1,epsilon=seq(0,1,0.1), cost=10)
summary(SVMUr)
################### RMSE SVMUr ##########################
RMSE <- function(observed, predicted){
sqrt(mean((predicted - observed)^2, na.rm=TRUE))
}
RMSE(observed =Trainset$Urease, predicted = predSVMUr)
######## Check the model for unseen data via using testset ######
predicted_test <- predict(SVMUr, Testset[,-1])
RMSE(Testset$Urease, predicted_test)

Extimate prediction accuracy of cox ph

i would like to develop a cox proportional hazard model with r, use it to predict input and evaluate the accuracy of the model. For the evaluation I would like to use the Brior score.
# import various packages, needed at some point of the script
library("survival")
library("survminer")
library("prodlim")
library("randomForestSRC")
library("pec")
library("rpart")
library("mlr")
library("Hmisc")
library("ipred")
# load lung cancer data
data("lung")
head(lung)
# recode status variable
lung$status <- lung$status-1
# Delete rows with missing values
lung <- na.omit(lung)
# split data into training and testing
## 80% of the sample size
smp_size <- floor(0.8 * nrow(lung))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(lung)), size = smp_size)
# training and testing data
train.lung <- lung[train_ind, ]
test.lung <- lung[-train_ind, ]
# time and failure event
s <- Surv(train.lung$time, train.lung$status)
# create model
cox.ph2 <- coxph(s~age+meal.cal+wt.loss, data=train.lung)
# predict
pred <- predict(cox.ph2, newdata = train.lung)
# evaluate
sbrier(s, pred)
as an outcome of the prediction I would expect the time (as in "when does this individuum experience failure). Instead I get values like this
[1] 0.017576359 -0.135928959 -0.347553969 0.112509137 -0.229301199 -0.131861582 0.044589175 0.002634008
[9] 0.345966978 0.209488560 0.002418358
What does that mean?
Furthermore sbrier does not work. Apparently it can not work with the prediction pred (no surprise there)
How do I solve this? How do I make a prediction with cox.ph2? How can I evaluate the model afterwards?
The predict() function won't return a time value, you have to specify the argument type = c("lp", "risk","expected","terms","survival") in the predict() function.
If you want to get the hazard ratios :
predict(cox.ph2, newdata = test.lung, type = "risk")
Note that you want to predict the values on the test set not the training set.
I have read that you can use AFT models in your case :
https://stats.stackexchange.com/questions/79362/how-to-get-predictions-in-terms-of-survival-time-from-a-cox-ph-model
You also can read this post :
Calculate the Survival prediction using Cox Proportional Hazard model in R
Hope it will help

Two different ways to calculate the AUC of training set on randomforest give me different results?

I used two ways to calculate the AUC of training set on the randomForest but I get very different results. The two ways are as follows:
rfmodel <- randomForest(y~., data=train, importance=TRUE, ntree=1000)
Way 1 of calculating AUC of train set:
`rf_p_train <- predict(rfmodel, type="prob",newdata = train)[,'yes']
rf_pr_train <- prediction(rf_p_train, train$y)
r_auc_train[i] <- performance(rf_pr_train, measure = "auc")#y.values[[1]] `
Way 2 of calculating AUC of train set:
rf_p_train <- as.vector(rfmodel$votes[,2])
rf_pr_train <- prediction(rf_p_train, train$y)
r_auc_train[i] <- performance(rf_pr_train, measure = "auc")#y.values[[1]]
Way 1 gives me AUC around 1 but way 2 gives me AUC around 0.65. I am wondering why these two results differ so much. Could anyone help me with this? Really appreciate it. For the data, I am sorry that I am not allowed to share it here. This is the first time for me to ask question here. Please forgive me if there is anything unclear. Thanks a lot!
OK. The second way is correct. Why? Because in the first way, you treat training data as a new dataset and try to fit it again. In the second way, what you get is actually the so called out of bag estimate, and that should be the way to calculate AUC.
I am not sure what data you are using. It is best if you provide a reproducible example but I think I was able to piece one together
library(randomForest)
#install.packages("ModelMetrics")
library(ModelMetrics)
# prep training to binary outcome
train <- iris[iris$Species %in% c('virginica', 'versicolor'),]
train$Species <- droplevels(train$Species)
# build model
rfmodel <- randomForest(Species~., data=train, importance=TRUE, ntree=2)
# generate predictions
preds <- predict(rfmodel, type="prob",newdata = train)[,2]
# Calculate AUC
auc(train$Species, preds)
# Calculate LogLoss
logLoss(train$Species, preds)

Resources