Predictions function of ROCR package gives Error: 'predictions' contains NA - r

I have been following the Edx course The Analytics Edge and I am currently in the logistic regression section: Framingham Heart Study part.
Here they use the predictions function of the ROCR package to predict the data accuracy with the threshold value set to 0.5. I have downloaded the .csv file from the edx course portal and written the exact same code but I am getting the error that 'Predictions' contains NA.
Here's the code:
framingham <- read.csv("framingham.csv")
library(caTools)
# framingham <- na.omit(framingham) # If I use this line then the code works fine but people must be working all the time with data with missing values. So, it should work for all the cases too.
set.seed(1000)
split <- sample.split(framingham$TenYearCHD, SplitRatio = 0.65)
train <- subset(framingham, split == TRUE)
test <- subset(framingham, split == FALSE)
framinghamLog <- glm(TenYearCHD ~ ., data = train, family = binomial)
summary(framinghamLog)
predictTest <- predict(framinghamLog, type = "response", newdata = test)
table(test$TenYearCHD, predictTest > 0.5)
library(ROCR)
ROCRpred <- prediction(predictTest, test$TenYearCHD)
This is the error:
Error: 'predictions' contains NA.

Related

Obtaining randomForest Predictions from a dataset different from the original set

The datasets where prepared as followed:
library(caret)
library(randomForest)
set.seed(2242)
inTrain <- createDataPartition(y = new_train$classe, p= 0.85, list = FALSE)
training <- new_train[inTrain,]
testing <- new_train[-inTrain, ]
Then I trained the random forest model below
model <- randomForest(classe~., data = training, importance = TRUE)
the predictions are successfully generated
testingpredictions <- predict(model, testing[ ,-55])
The validation set is set as follows:
the term chosen_column is a subset of the test set which contains only the variables used in the training of the model
validation_set <- test[ ,chosen_columns]
validation_set$new_window<- as.factor(validation_set$new_window)
PROBLEM 1
when I try predict using the validation set which is in a different file:
valpredictions <- predict(model, newdata = validation_set, type = "response")
I get the error
Error in predict.randomForest(model, newdata = validation_set, type = "response") :
Type of predictors in new data do not match that of the training data.
PROBLEM 2
When I use the train function with method = "rf" and method = "gbm"
the code runs until the program crashes
Attempted Solutions
I tried rbind to match the datasets but also receive an error:
data <- rbind(train, test)
Error in match.names(clabs, names(xi)) :
names do not match previous names
I also tried coercing all the data into the same class prior to running the models but still yields the same errors

How can I calculate the mean square error in R of a regression tree?

I am working with the wine quality database.
I am studying regression trees depending on different variables as:
library(rpart)
library(rpart.plot)
library(rattle)
library(naniar)
library(dplyr)
library(ggplot2)
vinos <- read.csv(file = 'Wine.csv', header = T)
arbol0<-rpart(formula=quality~chlorides, data=vinos, method="anova")
fancyRpartPlot(arbol0)
arbol1<-rpart(formula=quality~chlorides+density, data=vinos, method="anova")
fancyRpartPlot(arbol1)
I want to calculate the mean square error to see if arbol1 is better than arbol0. I will use my own dataset since no more data is available. I have tried to do it as
aaa<-predict(object=arbol0, newdata=data.frame(chlorides=vinos$chlorides), type="anova")
bbb<-predict(object=arbol1, newdata=data.frame(chlorides=vinos$chlorides, density=vinos$density), type="anova")
and then substract manually the last column of the dataframe from aaa and bbb. However, I am getting an error. Can someone please help me?
This website could be useful for you. It's very important to split your dataset into train and test subsets before training your models. In the following code, I've done it with base functions, but there's another function called sample.split from the caTools package that does the same procedure. I attach you this website where you can see all the ways to split data in R.
Remember that the function of the Mean Squared Error (MSE) is the following one:
So, it's very simple to apply it with R. You just have to compute the mean of the squared difference between the observed (i.e, the response variable from your test subset) and predicted values (i.e, the values you have predicted from the model with the predict function).
A solution for your wine dataset could be this one, based on the previous website.
library(rpart)
library(dplyr)
library(data.table)
vinos <- fread(file = 'Winequality-red.csv', header = TRUE)
# Split data into train and test subsets
sample_index <- sample(nrow(vinos), size = nrow(vinos)*0.75)
train <- vinos[sample_index, ]
test <- vinos[-sample_index, ]
# Train regression trees models
arbol0 <- rpart(formula = quality ~ chlorides, data = train, method = "anova")
arbol1 <- rpart(formula = quality ~ chlorides + density, data = train, method = "anova")
# Make predictions for each model
pred0 <- predict(arbol0, newdata = test)
pred1 <- predict(arbol1, newdata = test)
# Calculate MSE for each model
mean((pred0 - test$quality)^2)
mean((pred1 - test$quality)^2)

Wald Test for Multinomial Reg. in R

I asked this question before but never got an answer, so I am trying again and providing a sample data set so someone can tell me why I'm getting the errors I'm getting when I try implementing the Wald Test from aod and lmtest packages.
Sample data:
marital <- sample(1:5, 64614, replace = T)
race <- sample(1:3, 64614, replace = T)
educ <- sample(1:20, 64614, replace = T)
test <- data.frame(educ, marital, race)
test$marital <- as.factor(test$marital)
test$race <- as.factor(test$race)
test$marital <- relevel(test$marital, ref = "3")
require(nnet)
require(aod)
require(lmtest)
testmod <- multinom(marital ~ race*educ, data = test)
testnull <- multinom(marital ~ 1, data = test) #null model for the global test
waldtest(testnull, testmod)
wald.test(b = coef(testmod), Sigma = vcov(testmod), Terms = 1:24) #testing all terms for the global test
As you can see, when I use the waldtest function from lmtest package I get the following error:
Error in solve.default(vc[ovar, ovar]) : 'a' is 0-diml
When I use the wald.test function from aod, I get the following error:
Error in L %*% b : non-conformable arguments
I assume these are related errors as they both seem to do with the variance matrix. I'm not sure why I'm getting these errors though as the data set has no missing values.
Just as heads up when using nnet package with multinom: You can also use the package broom to tidy things a bit by doing this:
tidy(multinom_model, conf.int= True, conf.level = 0.95, exponentiate = T)
This will return a tibble with the coefficients exponentiated, confidence intervals (similiar to confint used in lm), as well as the Z-scores, Standard errors and the respective p-value for the Wald Z test (essentially doing z = summary(multinom_model)$coefficients/summary(multinom_model)$standard.errors and the round((1 - pnorm(abs(z), 0, 1)) * 2,digits=5) already

Unable to Subset in randomForest() function in R

I am trying to use randomForest function in R. For my analysis, I have a dataset that has 151 observations. I used 70/30 split to get 105 observations in training and 46 n test.
I'm using subset argument to indicate training dataset observations. However, when I use the "rf$predicted" command, I see that the model used the entire dataset (151 observations, and not just the training dataset.
Also, when I use the "predict()" function to provide test data, the model is using the predicting on 150 observations. But, the test has only 46 observations.
Can you please tell me what I may be doing wrong? I want to fit the model using the training dataset only and predict on the test dataset only. Thank you in advance!
Data is available here:
https://archive.ics.uci.edu/ml/datasets/teaching+assistant+evaluation
https://archive.ics.uci.edu/ml/machine-learning-databases/tae/
Code:
library("randomForest")
library(caTools)
# Importing tae.csv
setwd("C:\\Users\\Saulat Majid\\Documents\\MSDataAnalytics\\DSU\\10 STAT 702 Modern Applied Statistics II - Saunders\\HW5")
tae <- read.table(file = "tae.csv", header = FALSE, sep = ",")
head(tae)
colnames(tae) <- c("N_Speaker", "Instructor", "Course", "Summer", "C_Size", "Class")
# Numerical Summary of tae dataset
head(tae)
str(tae)
summary(tae)
# Converted categorical variables into factor
tae$Class <- as.factor(tae$Class)
tae$N_Speaker <- as.factor(tae$N_Speaker)
tae$Instructor <- as.factor(tae$Instructor)
tae$Summer <- as.factor(tae$Summer)
tae$Course <- as.factor(tae$Course)
str(tae)
# Splitting data into train and test
tae.Split <- sample.split(tae$Class, SplitRatio = 0.7)
table(tae.Split)
tae.train <- tae[tae.Split,]
tae.test <- tae[!tae.Split,]
dim(tae)
dim(tae.train)
dim(tae.test)
rf <- randomForest(Class ~ N_Speaker + Summer + C_Size, data = tae, subset = tae.Split)
rf$predicted
predict(object = rf,
newdata = tae[-tae.Split,],
type = "response")

How do I use predict() on new data for lme4::glmer model?

I have been trying to establish predictive performance (AUC ROC) for a glmer model. When I try and use the predict() function on a test data set, the output for this function is the length of my train data set.
folds = 10;
glmerperf=rep(0,folds); glmperf=glmerperf;
TB_Train.glmer.subset <- TB_Train.glmer %>% select(one_of(subset.vars), IDNO)
TB_Train.glmer.fs <- TB_Train.glmer.subset[,c(1:7, 22)]
TB_Train.glmer.ns <- TB_Train.glmer.subset[, 8:21]
TB_Train.glmer.cns <- TB_Train.glmer.ns %>% scale(center=TRUE, scale=TRUE) %>% cbind(TB_Train.glmer.fs)
foldsamples = caret::createFolds(TB_Train.glmer.cns$Case.Status, k = folds, list = TRUE, returnTrain = FALSE)
for (n in 1:folds)
{
testdata = TB_Train.glmer.cns[foldsamples[[n]],]
traindata = TB_Train.glmer.cns[-foldsamples[[n]],]
GLMER <- lme4::glmer(Case.Status ~ . + (1 | IDNO), data = traindata, family="binomial", control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=1000000)))
glmer.probs <- predict(GLMER, newdata=testdata$Non.TB.Case, type="response")
glmer.ROC <- roc(predictor=glmer.probs, response=testdata$Case.Status, levels=rev(levels(testdata$Case.Status)))
glmerperf[n] <- glmer.ROC$auc
}
prob <- predict(GLMER, newdata=TB_Test.glmer$Non.TB.Case, type="response", re.form=~(1|IDNO))
print(sprintf('Mean AUC ROC of model on test set for GLMER %f', mean(glmerperf)))
Both the prob and glmer.probs objects are the length of the traindata object, despite specifying the newdata argument. I have noticed issues with the predict function in the past, but none as specific as this one.
Also, when the model is run, I get several errors about needing to scale my data (which I already have) and that the model fails to converge. Any ideas on how to fix this? I have already bumped up the iterations and selected a new optimizer.
Figured out that error was arising from using the "." shortcut to specify all predictors for the model.

Resources