I am trying to use randomForest function in R. For my analysis, I have a dataset that has 151 observations. I used 70/30 split to get 105 observations in training and 46 n test.
I'm using subset argument to indicate training dataset observations. However, when I use the "rf$predicted" command, I see that the model used the entire dataset (151 observations, and not just the training dataset.
Also, when I use the "predict()" function to provide test data, the model is using the predicting on 150 observations. But, the test has only 46 observations.
Can you please tell me what I may be doing wrong? I want to fit the model using the training dataset only and predict on the test dataset only. Thank you in advance!
Data is available here:
https://archive.ics.uci.edu/ml/datasets/teaching+assistant+evaluation
https://archive.ics.uci.edu/ml/machine-learning-databases/tae/
Code:
library("randomForest")
library(caTools)
# Importing tae.csv
setwd("C:\\Users\\Saulat Majid\\Documents\\MSDataAnalytics\\DSU\\10 STAT 702 Modern Applied Statistics II - Saunders\\HW5")
tae <- read.table(file = "tae.csv", header = FALSE, sep = ",")
head(tae)
colnames(tae) <- c("N_Speaker", "Instructor", "Course", "Summer", "C_Size", "Class")
# Numerical Summary of tae dataset
head(tae)
str(tae)
summary(tae)
# Converted categorical variables into factor
tae$Class <- as.factor(tae$Class)
tae$N_Speaker <- as.factor(tae$N_Speaker)
tae$Instructor <- as.factor(tae$Instructor)
tae$Summer <- as.factor(tae$Summer)
tae$Course <- as.factor(tae$Course)
str(tae)
# Splitting data into train and test
tae.Split <- sample.split(tae$Class, SplitRatio = 0.7)
table(tae.Split)
tae.train <- tae[tae.Split,]
tae.test <- tae[!tae.Split,]
dim(tae)
dim(tae.train)
dim(tae.test)
rf <- randomForest(Class ~ N_Speaker + Summer + C_Size, data = tae, subset = tae.Split)
rf$predicted
predict(object = rf,
newdata = tae[-tae.Split,],
type = "response")
Related
I am working with the wine quality database.
I am studying regression trees depending on different variables as:
library(rpart)
library(rpart.plot)
library(rattle)
library(naniar)
library(dplyr)
library(ggplot2)
vinos <- read.csv(file = 'Wine.csv', header = T)
arbol0<-rpart(formula=quality~chlorides, data=vinos, method="anova")
fancyRpartPlot(arbol0)
arbol1<-rpart(formula=quality~chlorides+density, data=vinos, method="anova")
fancyRpartPlot(arbol1)
I want to calculate the mean square error to see if arbol1 is better than arbol0. I will use my own dataset since no more data is available. I have tried to do it as
aaa<-predict(object=arbol0, newdata=data.frame(chlorides=vinos$chlorides), type="anova")
bbb<-predict(object=arbol1, newdata=data.frame(chlorides=vinos$chlorides, density=vinos$density), type="anova")
and then substract manually the last column of the dataframe from aaa and bbb. However, I am getting an error. Can someone please help me?
This website could be useful for you. It's very important to split your dataset into train and test subsets before training your models. In the following code, I've done it with base functions, but there's another function called sample.split from the caTools package that does the same procedure. I attach you this website where you can see all the ways to split data in R.
Remember that the function of the Mean Squared Error (MSE) is the following one:
So, it's very simple to apply it with R. You just have to compute the mean of the squared difference between the observed (i.e, the response variable from your test subset) and predicted values (i.e, the values you have predicted from the model with the predict function).
A solution for your wine dataset could be this one, based on the previous website.
library(rpart)
library(dplyr)
library(data.table)
vinos <- fread(file = 'Winequality-red.csv', header = TRUE)
# Split data into train and test subsets
sample_index <- sample(nrow(vinos), size = nrow(vinos)*0.75)
train <- vinos[sample_index, ]
test <- vinos[-sample_index, ]
# Train regression trees models
arbol0 <- rpart(formula = quality ~ chlorides, data = train, method = "anova")
arbol1 <- rpart(formula = quality ~ chlorides + density, data = train, method = "anova")
# Make predictions for each model
pred0 <- predict(arbol0, newdata = test)
pred1 <- predict(arbol1, newdata = test)
# Calculate MSE for each model
mean((pred0 - test$quality)^2)
mean((pred1 - test$quality)^2)
I'm trying to generate a confusion table using the HMDA data from the AER package. So I ran a probit model, predict on testing set, and use table() function to generate a 2 by 2 plot, but R just returns me a long list, not showing the 2 by 2 matrix that I wanted.
Could anyone tell me what's going on>
# load required packages and data (HMDA)
library(e1071)
library(caret)
library(AER)
library(plotROC)
data(HMDA)
# again, check variable columns
names(HMDA)
# convert dependent variables to numeric
HMDA$deny <- ifelse(HMDA$deny == "yes", 1, 0)
# subset needed columns
subset <- c("deny", "hirat", "lvrat", "mhist", "unemp")
# subset data
data <- HMDA[complete.cases(HMDA), subset]
# do a 75-25 train-test split
train_row_numbers <- createDataPartition(data$deny, p=0.75, list=FALSE)
training <- data[train_row_numbers, ]
testing <- data[-train_row_numbers, ]
# fit a probit model and predict on testing data
probit.fit <- glm(deny ~ ., family = binomial(link = "probit"), data = training)
probit.pred <- predict(probit.fit, testing)
confmat_probit <- table(Predicted = probit.pred,
Actual = testing$deny)
confmat_probit
You need to specify the threshold or cut-point for predicting a dichotomous outcome. Predict returns the predicted values, not 0 / 1.
And be careful with the predict function as the default type is "link", which in your case is the "probit". If you want predict to return the probabilities, specify type="response".
probit.pred <- predict(probit.fit, testing, type="response")
Then choose a cut-point; any prediction above this value will be TRUE:
confmat_probit <- table(`Predicted>0.1` = probit.pred > 0.1 , Actual = testing$deny)
confmat_probit
Actual
Predicted>0.1 0 1
FALSE 248 21
TRUE 273 53
I have a data frame with 45045 variables and only 90 observations in R. I did a PCA to reduce the dimension and I'll use 14 principal components. I need do predictions and I wanna try to use the Naive Bayes method. I can't use the predict function with the trasformed data and i'm not understanding the error.
Here is some code:
data.pca <- prcomp(data)
I'll use 14 PCs:
newdata <- as.data.frame(data.pca$x[,1:14]) #dimension: 90x14
Training:
library(naivebayes)
mod.nb <- naive_bayes(label ~ newdata$PC1+...+newdata$PC14, data = NULL)
Tryna predict the 50th observation:
test.pca <- predict(data.pca, newdata = data[50,])
test.pca <- as.data.frame(test.pca)
test.pca <- test.pca[,1:14]
pred <- predict(mod.nb, test.pca)
I'm getting these errors:
predict.naive_bayes(): Only 0 feature(s) out of 14 defined in the naive_bayes object "mod.nb" are used for prediction.
predict.naive_bayes(): No feature in the newdata corresponds to probability tables in the object. Classification is done based on the prior probabilities
The vector of labels is a factor with levels 1 to 6, and for any observation that I try to predict the result is only 1. The 50th observation, for example, has the label 4.
You can try the following code modified from your code only
data.pca <- prcomp(data)
newdata <- as.data.frame(data.pca$x[,1:14])
library(naivebayes)
mod.nb <- naive_bayes(label ~ newdata$PC1+...+newdata$PC14, data = newdata)
test.pca <- predict(mod.nb, newdata = newdata[50,])
I'm using using a support vector machine on the Titanic dataset and some of the observations are not being predicted when using the predict function with my model.
library(e1071)
library(data.table)
library(ISLR)
titanic.index <- sample(891, 600)
titanic.train <- dat[index]
titanic.test <- dat[-index]
titanic. fit <- svm(Survived ~ Pclass + Sex + SibSp, data = train, kernel = "polynomial")
titanic.preds <- predict(fit, newdata = test)
titanic.preds
length(titanic.preds)
Whenever I run this on my comp I get anywhere from 220 to 240 predictions, but the their are clearly 291 observations in the test data. There aren't any missing observations for these predictors. To make matters even more weird, when I build an SVM using the auto dataset in the ISLR package this same problem doesn't occur.
data("Auto")
auto <- as.data.table(Auto)
auto[, mileage := ifelse(auto[, mpg] > median(auto[, mpg]), 1, 0)]
auto[, mileage := factor(mileage)]
auto.index <- sample(392, 200)
auto.train <- auto[auto.index]
auto.test <- auto[-auto.index]
auto.fit <- svm(mileage ~ ., data = auto.train)
auto.preds <- predict(auto.fit, newdata = auto.test)
auto.preds
length(auto.preds)
I have no idea why this is happening. Any insight you can provide is greatly appreciated!
How can I use dummy vars in caret without destroying my target variable?
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
dummies <- dummyVars( Purchase ~ ., data = data)
data2 <- predict(dummies, newdata = data)
split_factor = 0.5
n_samples = nrow(data2)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
train <- data2[train_idx, ]
test <- data2[-train_idx, ]
modelFit<- train(Purchase~ ., method='lda',preProcess=c('scale', 'center'), data=train)
will fail, as the Purchase variable is missing. In case I replace it with data$Purchase <- ifelse(data$Purchase == "CH",1,0) beforehand caret complains that this no longer is a classification but a regression problem
At least the example code seems to have a few issues indicated in the comments below. To answer your questions:
The result of ifelse is an integer vector, not a factor, so the train function defaults to regression
Passing the dummyVars directly to the function is done by using the train(x = , y =, ...) instead of a formula
To avoid these problems, check the class of your objects carefully.
Be aware that option preProcess in train() will apply the preprocessing to all numeric variables, including the dummies. Option 2 below avoid this, be standardizing the data before calling train().
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
# Make sure that all variables that should be a factor are defined as such
newFactorIndex <- c("StoreID","SpecialCH","SpecialMM","STORE")
data[, newFactorIndex] <- lapply(data[,newFactorIndex], factor)
library(caret)
# See help for dummyVars. The function does not take a dependent variable and predict will give an error
# I don't include the target variable here, so predicting dummies on new data will drop unknown columns
# including the target variable
dummies <- dummyVars(~., data = data[,-1])
# I don't change the data yet to apply standardization to the numeric variables,
# before turning the categorical variables into dummies
split_factor = 0.5
n_samples = nrow(data)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
# Option 1 (as asked): Specify independent and dependent variables separately
# Note that dummy variables will be standardized by preProcess as per the original code
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
data2 <- data.frame(predict(dummies, newdata = data))
modelFit<- train(y = data[train_idx, "Purchase"], x = data2[train_idx,], method='lda',preProcess=c('scale', 'center'))
# Option 2: Append dependent variable to the independent variables (needs to be a data frame to allow factor and numeric)
# Note that I also shift the proprocessing away from train() to
# avoid standardizing the dummy variables
train <- data[train_idx, ]
test <- data[-train_idx, ]
preprocessor <- preProcess(train[!sapply(train, is.factor)], method = c('center',"scale"))
train <- predict(preprocessor, train)
test <- predict(preprocessor, test)
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
train <- data.frame(predict(dummies, newdata = train))
test <- data.frame(predict(dummies, newdata = test))
# Reattach the target variable to the training data that has been
# dropped by predict(dummies,...)
train$Purchase <- data$Purchase[train_idx]
modelFit<- train(Purchase ~., data = train, method='lda')