I'm trying to generate a confusion table using the HMDA data from the AER package. So I ran a probit model, predict on testing set, and use table() function to generate a 2 by 2 plot, but R just returns me a long list, not showing the 2 by 2 matrix that I wanted.
Could anyone tell me what's going on>
# load required packages and data (HMDA)
library(e1071)
library(caret)
library(AER)
library(plotROC)
data(HMDA)
# again, check variable columns
names(HMDA)
# convert dependent variables to numeric
HMDA$deny <- ifelse(HMDA$deny == "yes", 1, 0)
# subset needed columns
subset <- c("deny", "hirat", "lvrat", "mhist", "unemp")
# subset data
data <- HMDA[complete.cases(HMDA), subset]
# do a 75-25 train-test split
train_row_numbers <- createDataPartition(data$deny, p=0.75, list=FALSE)
training <- data[train_row_numbers, ]
testing <- data[-train_row_numbers, ]
# fit a probit model and predict on testing data
probit.fit <- glm(deny ~ ., family = binomial(link = "probit"), data = training)
probit.pred <- predict(probit.fit, testing)
confmat_probit <- table(Predicted = probit.pred,
Actual = testing$deny)
confmat_probit
You need to specify the threshold or cut-point for predicting a dichotomous outcome. Predict returns the predicted values, not 0 / 1.
And be careful with the predict function as the default type is "link", which in your case is the "probit". If you want predict to return the probabilities, specify type="response".
probit.pred <- predict(probit.fit, testing, type="response")
Then choose a cut-point; any prediction above this value will be TRUE:
confmat_probit <- table(`Predicted>0.1` = probit.pred > 0.1 , Actual = testing$deny)
confmat_probit
Actual
Predicted>0.1 0 1
FALSE 248 21
TRUE 273 53
Related
I built a model, using plm package. The sample dataset is here.
I am trying to predict on test data and calculate metrics.
# Import package
library(plm)
library(tidyverse)
library(prediction)
library(nlme)
# Import data
df <- read_csv('Panel data sample.csv')
# Convert author to character
df$Author <- as.character(df$Author)
# Split data into train and test
df_train <- df %>% filter(Year != 2020) # 2017, 2018, 2019
df_test <- df %>% filter(Year == 2020) # 2020
# Convert data
panel_df_train <- pdata.frame(df_train, index = c("Author", "Year"), drop.index = TRUE, row.names = TRUE)
panel_df_test <- pdata.frame(df_train, index = c("Author", "Year"), drop.index = TRUE, row.names = TRUE)
# Create the first model
plmFit1 <- plm(Score ~ Articles, data = panel_df_train)
# Print
summary(plmFit1)
# Get the RMSE for train data
sqrt(mean(plmFit1$residuals^2))
# Get the MSE for train data
mean(plmFit1$residuals^2)
Now I am trying to calculate metrics for test data
First, I tried to use prediction() from prediction package, which has an option for plm.
predictions <- prediction(plmFit1, panel_df_test)
Got an error:
Error in crossprod(beta, t(X)) : non-conformable arguments
I read the following questions:
One
Two
Three
Four
I also read this question, but
fitted <- as.numeric(plmFit1$model[[1]] - plmFit1$residuals) gives me a different number of values from my train or test numbers.
Regarding out-of-sample prediction with fixed effects models, it is not clear how data relating to fixed effects not in the original model are to be treated, e.g., data for an individual not contained in the orignal data set the model was estimated on. (This is rather a methodological question than a programming question).
The version 2.6-2 of plm allows predict for fixed effect models with the original data and with out-of-sample data (see ?predict.plm).
Find below an example with 10 firms for model estimation and the data to be used for prediction contains a firm not contained in the original data set (besides that firm, there are also years not contained in the original model object but these are irrelevant here as it is a one-way individual model). It is unclear what the fixed effect of that out-of-sample firm would be. Hence, by default, no predicted value is given (NA value). If argument na.fill is set to TRUE, the (weighted) mean of the fixed effects contained in the original model object is used as a best guess.
library(plm)
data("Grunfeld", package = "plm")
# fit a fixed effect model
fit.fe <- plm(inv ~ value + capital, data = Grunfeld, model = "within")
# generate 55 new observations of three firms used for prediction:
# * firm 1 with years 1935:1964 (has out-of-sample years 1955:1964),
# * firm 2 with years 1935:1949 (all in sample),
# * firm 11 with years 1935:1944 (firm 11 is out-of-sample)
set.seed(42L)
new.value2 <- runif(55, min = min(Grunfeld$value), max = max(Grunfeld$value))
new.capital2 <- runif(55, min = min(Grunfeld$capital), max = max(Grunfeld$capital))
newdata <- data.frame(firm = c(rep(1, 30), rep(2, 15), rep(11, 10)),
year = c(1935:(1935+29), 1935:(1935+14), 1935:(1935+9)),
value = new.value2, capital = new.capital2)
# make pdata.frame
newdata.p <- pdata.frame(newdata, index = c("firm", "year"))
## predict from fixed effect model with new data as pdata.frame
predict(fit.fe, newdata = newdata.p) # has NA values for the 11'th firm
## set na.fill = TRUE to have the weighted mean used to for fixed effects -> no NA values
predict(fit.fe, newdata = newdata.p, na.fill = TRUE)
NB: When you input a plain data.frame as newdata, it is not clear how the data related to the individuals and time periods, which is why the weighted mean of fixed effects from the original model object is used for all observations in newdata and a warning is printed. For fixed effect model prediction, it is reasonable to assume the user can provide information (via a pdata.frame) how the data the user wants to use for prediction relates to the individual and time dimension of panel data.
I have a data frame with 45045 variables and only 90 observations in R. I did a PCA to reduce the dimension and I'll use 14 principal components. I need do predictions and I wanna try to use the Naive Bayes method. I can't use the predict function with the trasformed data and i'm not understanding the error.
Here is some code:
data.pca <- prcomp(data)
I'll use 14 PCs:
newdata <- as.data.frame(data.pca$x[,1:14]) #dimension: 90x14
Training:
library(naivebayes)
mod.nb <- naive_bayes(label ~ newdata$PC1+...+newdata$PC14, data = NULL)
Tryna predict the 50th observation:
test.pca <- predict(data.pca, newdata = data[50,])
test.pca <- as.data.frame(test.pca)
test.pca <- test.pca[,1:14]
pred <- predict(mod.nb, test.pca)
I'm getting these errors:
predict.naive_bayes(): Only 0 feature(s) out of 14 defined in the naive_bayes object "mod.nb" are used for prediction.
predict.naive_bayes(): No feature in the newdata corresponds to probability tables in the object. Classification is done based on the prior probabilities
The vector of labels is a factor with levels 1 to 6, and for any observation that I try to predict the result is only 1. The 50th observation, for example, has the label 4.
You can try the following code modified from your code only
data.pca <- prcomp(data)
newdata <- as.data.frame(data.pca$x[,1:14])
library(naivebayes)
mod.nb <- naive_bayes(label ~ newdata$PC1+...+newdata$PC14, data = newdata)
test.pca <- predict(mod.nb, newdata = newdata[50,])
I'm working on a new type of code and need a little help with the ridge regularized regression. trying to build a predictive model but first i need x and y matrix rows to match.
I found something similar with a google search but their data is randomly generated and not provided like mine is. the data is a large dataset with over 500,000 observations and 670 variables.
library(rsample)
library(glmnet)
library(dplyr)
library(ggplot2)
# Create training (70%) and test (30%) sets
# Use set.seed for reproducibility
set.seed(123)
alumni_split<-initial_split(alumni, prop=.7, strata = "Id.Number")
alumni_train<-training(alumni_split)
alumni_test<-testing(alumni_split)
#----
# Create training and testing feature model matrices and response
vectors.
# we use model.matrix(...)[, -1] to discard the intercept
alumni_train_x <- model.matrix(Id.Number ~ ., alumni_train)[, -1]
alumni_test_x <- model.matrix(Id.Number ~ ., alumni_test)[, -1]
alumni_train_y <- log(alumni_train$Id.Number)
alumni_test_y <- log(alumni_test$Id.Number)
# What is the dimension of of your feature matrix?
dim(alumni_train_x)
#---- [HERE]
# Apply Ridge regression to alumni data
alumni_ridge <- glmnet(alumni_train_x, alumni_train_y, alpha = 0)
The error message (with code):
alumni_ridge <- glmnet(alumni_train_x, alumni_train_y, alpha = 0)
Error in glmnet(alumni_train_x, alumni_train_y, alpha = 0) :
number of observations in y (329870) not equal to the number of rows of
x (294648)
I am trying to use randomForest function in R. For my analysis, I have a dataset that has 151 observations. I used 70/30 split to get 105 observations in training and 46 n test.
I'm using subset argument to indicate training dataset observations. However, when I use the "rf$predicted" command, I see that the model used the entire dataset (151 observations, and not just the training dataset.
Also, when I use the "predict()" function to provide test data, the model is using the predicting on 150 observations. But, the test has only 46 observations.
Can you please tell me what I may be doing wrong? I want to fit the model using the training dataset only and predict on the test dataset only. Thank you in advance!
Data is available here:
https://archive.ics.uci.edu/ml/datasets/teaching+assistant+evaluation
https://archive.ics.uci.edu/ml/machine-learning-databases/tae/
Code:
library("randomForest")
library(caTools)
# Importing tae.csv
setwd("C:\\Users\\Saulat Majid\\Documents\\MSDataAnalytics\\DSU\\10 STAT 702 Modern Applied Statistics II - Saunders\\HW5")
tae <- read.table(file = "tae.csv", header = FALSE, sep = ",")
head(tae)
colnames(tae) <- c("N_Speaker", "Instructor", "Course", "Summer", "C_Size", "Class")
# Numerical Summary of tae dataset
head(tae)
str(tae)
summary(tae)
# Converted categorical variables into factor
tae$Class <- as.factor(tae$Class)
tae$N_Speaker <- as.factor(tae$N_Speaker)
tae$Instructor <- as.factor(tae$Instructor)
tae$Summer <- as.factor(tae$Summer)
tae$Course <- as.factor(tae$Course)
str(tae)
# Splitting data into train and test
tae.Split <- sample.split(tae$Class, SplitRatio = 0.7)
table(tae.Split)
tae.train <- tae[tae.Split,]
tae.test <- tae[!tae.Split,]
dim(tae)
dim(tae.train)
dim(tae.test)
rf <- randomForest(Class ~ N_Speaker + Summer + C_Size, data = tae, subset = tae.Split)
rf$predicted
predict(object = rf,
newdata = tae[-tae.Split,],
type = "response")
How can I use dummy vars in caret without destroying my target variable?
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
dummies <- dummyVars( Purchase ~ ., data = data)
data2 <- predict(dummies, newdata = data)
split_factor = 0.5
n_samples = nrow(data2)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
train <- data2[train_idx, ]
test <- data2[-train_idx, ]
modelFit<- train(Purchase~ ., method='lda',preProcess=c('scale', 'center'), data=train)
will fail, as the Purchase variable is missing. In case I replace it with data$Purchase <- ifelse(data$Purchase == "CH",1,0) beforehand caret complains that this no longer is a classification but a regression problem
At least the example code seems to have a few issues indicated in the comments below. To answer your questions:
The result of ifelse is an integer vector, not a factor, so the train function defaults to regression
Passing the dummyVars directly to the function is done by using the train(x = , y =, ...) instead of a formula
To avoid these problems, check the class of your objects carefully.
Be aware that option preProcess in train() will apply the preprocessing to all numeric variables, including the dummies. Option 2 below avoid this, be standardizing the data before calling train().
set.seed(5)
data <- ISLR::OJ
data<-na.omit(data)
# Make sure that all variables that should be a factor are defined as such
newFactorIndex <- c("StoreID","SpecialCH","SpecialMM","STORE")
data[, newFactorIndex] <- lapply(data[,newFactorIndex], factor)
library(caret)
# See help for dummyVars. The function does not take a dependent variable and predict will give an error
# I don't include the target variable here, so predicting dummies on new data will drop unknown columns
# including the target variable
dummies <- dummyVars(~., data = data[,-1])
# I don't change the data yet to apply standardization to the numeric variables,
# before turning the categorical variables into dummies
split_factor = 0.5
n_samples = nrow(data)
train_idx <- sample(seq_len(n_samples), size = floor(split_factor * n_samples))
# Option 1 (as asked): Specify independent and dependent variables separately
# Note that dummy variables will be standardized by preProcess as per the original code
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
data2 <- data.frame(predict(dummies, newdata = data))
modelFit<- train(y = data[train_idx, "Purchase"], x = data2[train_idx,], method='lda',preProcess=c('scale', 'center'))
# Option 2: Append dependent variable to the independent variables (needs to be a data frame to allow factor and numeric)
# Note that I also shift the proprocessing away from train() to
# avoid standardizing the dummy variables
train <- data[train_idx, ]
test <- data[-train_idx, ]
preprocessor <- preProcess(train[!sapply(train, is.factor)], method = c('center',"scale"))
train <- predict(preprocessor, train)
test <- predict(preprocessor, test)
# Turn the categorical variabels to (unstandardized) dummies
# The output of predict is a matrix, change it to data frame
train <- data.frame(predict(dummies, newdata = train))
test <- data.frame(predict(dummies, newdata = test))
# Reattach the target variable to the training data that has been
# dropped by predict(dummies,...)
train$Purchase <- data$Purchase[train_idx]
modelFit<- train(Purchase ~., data = train, method='lda')