Confusion Matrix in Logistic Regression in R - r

My confusion matrix created for a logistic regression model only has the values for Predicted-FALSE. Even though I adjusted my threshold, it does not do much to the matrix. What is wrong and how do I adjust the threshold? Below is the code for the training set and the result. "Retain" is my dependent variable with 1=retained 0=not retained, and all the independent variables are continuous variables. I have overall 170K records in the dataset (df). This matrix indicates that the model predicts that no one retained, which is odd, because in reality 45% retained.
model_1 <- glm(retain~ age_2010+cnt_total_funds+sum_MS_2010+tenure_2010, data=df, family="binomial")
res <- predict(model_1, training, retain="response")
(table(ActualValue=training$retain, PredictedValue=res>0.05))
PredictedValue
ActualValue FALSE
0 96006
1 43676

You made a mistake inside predict function as you want to use type argument (not retain which not exists for this function).
I use a sample data to show you working example.
In your example change retain="response" to type="response".
aa <- airquality
aa$retain <- aa$Ozone > 50
gg = glm(retain ~ Solar.R + Month, data = aa, family = "binomial")
range(predict(gg, aa, type = "response"), na.rm = TRUE)
#> [1] 0.05918388 0.48769632
Created on 2021-06-18 by the reprex package (v2.0.0)

Related

Predict on test data, using plm package in R, and calculate RMSE for test data

I built a model, using plm package. The sample dataset is here.
I am trying to predict on test data and calculate metrics.
# Import package
library(plm)
library(tidyverse)
library(prediction)
library(nlme)
# Import data
df <- read_csv('Panel data sample.csv')
# Convert author to character
df$Author <- as.character(df$Author)
# Split data into train and test
df_train <- df %>% filter(Year != 2020) # 2017, 2018, 2019
df_test <- df %>% filter(Year == 2020) # 2020
# Convert data
panel_df_train <- pdata.frame(df_train, index = c("Author", "Year"), drop.index = TRUE, row.names = TRUE)
panel_df_test <- pdata.frame(df_train, index = c("Author", "Year"), drop.index = TRUE, row.names = TRUE)
# Create the first model
plmFit1 <- plm(Score ~ Articles, data = panel_df_train)
# Print
summary(plmFit1)
# Get the RMSE for train data
sqrt(mean(plmFit1$residuals^2))
# Get the MSE for train data
mean(plmFit1$residuals^2)
Now I am trying to calculate metrics for test data
First, I tried to use prediction() from prediction package, which has an option for plm.
predictions <- prediction(plmFit1, panel_df_test)
Got an error:
Error in crossprod(beta, t(X)) : non-conformable arguments
I read the following questions:
One
Two
Three
Four
I also read this question, but
fitted <- as.numeric(plmFit1$model[[1]] - plmFit1$residuals) gives me a different number of values from my train or test numbers.
Regarding out-of-sample prediction with fixed effects models, it is not clear how data relating to fixed effects not in the original model are to be treated, e.g., data for an individual not contained in the orignal data set the model was estimated on. (This is rather a methodological question than a programming question).
The version 2.6-2 of plm allows predict for fixed effect models with the original data and with out-of-sample data (see ?predict.plm).
Find below an example with 10 firms for model estimation and the data to be used for prediction contains a firm not contained in the original data set (besides that firm, there are also years not contained in the original model object but these are irrelevant here as it is a one-way individual model). It is unclear what the fixed effect of that out-of-sample firm would be. Hence, by default, no predicted value is given (NA value). If argument na.fill is set to TRUE, the (weighted) mean of the fixed effects contained in the original model object is used as a best guess.
library(plm)
data("Grunfeld", package = "plm")
# fit a fixed effect model
fit.fe <- plm(inv ~ value + capital, data = Grunfeld, model = "within")
# generate 55 new observations of three firms used for prediction:
# * firm 1 with years 1935:1964 (has out-of-sample years 1955:1964),
# * firm 2 with years 1935:1949 (all in sample),
# * firm 11 with years 1935:1944 (firm 11 is out-of-sample)
set.seed(42L)
new.value2 <- runif(55, min = min(Grunfeld$value), max = max(Grunfeld$value))
new.capital2 <- runif(55, min = min(Grunfeld$capital), max = max(Grunfeld$capital))
newdata <- data.frame(firm = c(rep(1, 30), rep(2, 15), rep(11, 10)),
year = c(1935:(1935+29), 1935:(1935+14), 1935:(1935+9)),
value = new.value2, capital = new.capital2)
# make pdata.frame
newdata.p <- pdata.frame(newdata, index = c("firm", "year"))
## predict from fixed effect model with new data as pdata.frame
predict(fit.fe, newdata = newdata.p) # has NA values for the 11'th firm
## set na.fill = TRUE to have the weighted mean used to for fixed effects -> no NA values
predict(fit.fe, newdata = newdata.p, na.fill = TRUE)
NB: When you input a plain data.frame as newdata, it is not clear how the data related to the individuals and time periods, which is why the weighted mean of fixed effects from the original model object is used for all observations in newdata and a warning is printed. For fixed effect model prediction, it is reasonable to assume the user can provide information (via a pdata.frame) how the data the user wants to use for prediction relates to the individual and time dimension of panel data.

What is the correct way to use weights in a logistic regression in R?

My data includes survey data of car buyers. My data has a weight column that i used in SPSS to get sample sizes. Weight column is affected by demographic factors & vehicle sales. Now i am trying to put together a logistic regression model for a car segment which includes a few vehicles. I want to use the weight column in the logistic regression model & i tried to do so using "weights" in glm function. But the results are horrific. Deviances are too high, McFadden Rsquare too low. My dependent variable is binary, independent variables are on 1 to 5 scale. Weight column is numerical, ranging from 32 to 197. Could that be a reason that results are poor? Do i need to have values in weight column below 1?
Format of input file to R is -
WGT output I1 I2 I3 I4 I5
67 1 1 3 1 5 4
I1, I2, I3 being independent variables
logr<-glm(output~1,data=data1,weights=WGT,family="binomial")
logrstep<-step(logr,direction = "both",scope = formula(data1))\
logr1<-glm(output~ (formula from final iteration),weights = WGT,data=data1,family="binomial")
hl <- hoslem.test(data1$output,fitted(logr1),g=10)
I want a logistic regression model with better accuracy & gain a better understanding of using weights with logistic regression
I would check out the survey package. This will allow you to specify weights for the survey design using the svydesign function. Additionally, you can use the svyglm function to perform your weighted logistic regression. See http://r-survey.r-forge.r-project.org/survey/
Something like the following assuming your data is in a dataframe called df
my_svy <- svydesign(df, ids = ~1, weights = ~WGT)
Then you can do the following:
my_fit <- svyglm(output ~1, my_svy, family = "binomial")
For a full reprex check out the below example
library(survey)
# Generate Some Random Weights
mtcars$wts <- rnorm(nrow(mtcars), 50, 5)
# Make vs a factor just for illustrative purposes
mtcars$vs <- as.factor(mtcars$vs)
# Build the Complete survey Object
svy_df <- svydesign(data = mtcars, ids = ~1, weights = ~wts)
# Fit the logistic regression
fit <- svyglm(vs ~ gear + disp, svy_df, family = "binomial")
# Store the summary object
(fit_sumz <- summary(fit))
# Look at the AIC if desired
AIC(fit)
# Pull out the deviance if desired
fit_sumz$deviance
As far as the stepwise regression, this typically isn't a great methodology for a statistical point of view. It results in a higher R2 and some other issues regarding inference (see https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/).

use cox model to estimate survival

I first establish a cox model in R:
test1<- test[1:20,]
model.1 <- coxph(Surv(test1$days,test1$status==1) ~ test1$MTT+test1$ADC,data=test1)
and when i tried to predict next patient's survival like this:
covs1 <- data.frame(test[21,]$MTT,test[21,]$ADC)
summary(survfit(model.1, newdata= covs1, type ="aalen"))
it gave me too many survival results and the warning is
"'newdata' had 1 row but variables found have 20 rows "
fyi, there are 20 events and the results contain 20 survival results.
The names of the columns in the datframe being given as the basis for a prediction must have the same column names as are in the RHS of the model formula. I don't think yours will qualifiy unless you do something like this:
test1<- test[1:20,]
model.1 <- coxph( Surv(days, status==1) ~ MTT + ADC, data=test1)
covs1 <- test[21, c("MTT", "ADC")]
# then do your prediction
You should not use $ to supply arguments to Surv. It is important that the model be constructed in the environment of the dataframe.

Why doesn't predict like the dimensions of my newdata?

I want to perform a multiple regression in R and make predictions based on the trained model. Below is an example code I am using:
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
predict(lm(price ~ predictors), data.frame(predictors=matrix(c(3,5),nrow=1)))
So, based on the 2-variate regression model trained by 5 samples, I want to make a prediction for the test data point where the first variate is 3 and second variate is 5. But I get a warning from above code saying that 'newdata' had 1 rows but variable(s) found have 5 rows. How can I correct above code? Below code works fine where I give the variables separately to the model formula. But since I will have hundreds of variates, I have to give them in a matrix since it would be unfeasible to append hundreds of columns using + sign.
price = c(10,18,18,11,17)
predictor1 = c(5,6,3,4,5)
predictor2 = c(2,1,8,5,6)
predict(lm(price ~ predictor1 + predictor2), data.frame(predictor1=3,predictor2=5))
Thanks in advance!
The easiest way to get past the issue of matching up variable names from a matrix of covariates to newdata data.frame column names is to put your input data into a data.frame as well. Try this
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
indata<-data.frame(price,predictors=predictors)
predict(lm(price ~ ., indata), data.frame(predictors=matrix(c(3,5),nrow=1)))
Here we combine price and predictors into a data.frame such that it will be named the same say as the newdata data.frame. We use the . in the formula to mean "all other columns" so we don't have to specify them explicitly.
Need to build the model first, then predict from it:
mod1 <- lm(price ~ predictor1 + predictor2)
predict( mod1 , data.frame(predictor1=3,predictor2=5))

Is there a predict function for plm in R?

I have a small N large T panel which I am estimating via plm::plm (panel linear regression model), with fixed effects.
Is there any way to get predicted values for a new dataset? (I want to
estimate parameters on a subset of my sample, and then use these to
calculate model-implied values for the whole sample).
There are (at least) two methods in the package to produce estimates from plm objects:
-- fixef.plm: Extract the Fixed Effects
-- pmodel.response: A function to extract the model.response
It appears to me that the author(s) are not interested in providing estimates for the "random effects". It may be a matter of "if you don't know how to do it on your own, then we don't want to give you a sharp knife to cut yourself too deeply."
I wrote a function called predict.out.plm that can create predictions for the original data and for a manipulated data set (with equal column names).
The predict.out.plm calculates a) the predicted (fitted) outcome of the transformed data and b) constructs the according to level outcome. The function works for First Difference (FD) estimations and Fixed Effects (FE) estimations using plm. For FD it creates the differenced outcome over time and for FE it creates the time-demeaned outcome.
The function is largely untested, and probably only works with strongly balanced data frames.
Any suggestions and corrections are very welcome. Help to develop a small R package would be very appreciated.
The function predict.out.plm
predict.out.plm<-function(
estimate,
formula,
data,
model="fd",
pname="y",
pindex=NULL,
levelconstr=T
){
# estimate=e.fe
# formula=f
# data=d
# model="within"
# pname="y"
# pindex=NULL
# levelconstr=T
#get index of panel data
if (is.null(pindex) && class(data)[1]=="pdata.frame") {
pindex<-names(attributes(data)$index)
} else {
pindex<-names(data)[1:2]
}
if (class(data)[1]!="pdata.frame") {
data<-pdata.frame(data)
}
#model frame
mf<-model.frame(formula,data=data)
#model matrix - transformed data
mn<-model.matrix(formula,mf,model)
#define variable names
y.t.hat<-paste0(pname,".t.hat")
y.l.hat<-paste0(pname,".l.hat")
y.l<-names(mf)[1]
#transformed data of explanatory variables
#exclude variables that were droped in estimation
n<-names(estimate$aliased[estimate$aliased==F])
i<-match(n,colnames(mn))
X<-mn[,i]
#predict transformed outcome with X * beta
# p<- X %*% coef(estimate)
p<-crossprod(t(X),coef(estimate))
colnames(p)<-y.t.hat
if (levelconstr==T){
#old dataset with original outcome
od<-data.frame(
attributes(mf)$index,
data.frame(mf)[,1]
)
rownames(od)<-rownames(mf) #preserve row names from model.frame
names(od)[3]<-y.l
#merge old dataset with prediciton
nd<-merge(
od,
p,
by="row.names",
all.x=T,
sort=F
)
nd$Row.names<-as.integer(nd$Row.names)
nd<-nd[order(nd$Row.names),]
#construct predicted level outcome for FD estiamtions
if (model=="fd"){
#first observation from real data
i<-which(is.na(nd[,y.t.hat]))
nd[i,y.l.hat]<-NA
nd[i,y.l.hat]<-nd[i,y.l]
#fill values over all years
ylist<-unique(nd[,pindex[2]])[-1]
ylist<-as.integer(as.character(ylist))
for (y in ylist){
nd[nd[,pindex[2]]==y,y.l.hat]<-
nd[nd[,pindex[2]]==(y-1),y.l.hat] +
nd[nd[,pindex[2]]==y,y.t.hat]
}
}
if (model=="within"){
#group means of outcome
gm<-aggregate(nd[, pname], list(nd[,pindex[1]]), mean)
gl<-aggregate(nd[, pname], list(nd[,pindex[1]]), length)
nd<-cbind(nd,groupmeans=rep(gm$x,gl$x))
#predicted values + group means
nd[,y.l.hat]<-nd[,y.t.hat] + nd[,"groupmeans"]
}
if (model!="fd" && model!="within") {
stop('funciton works only for FD and FE estimations')
}
}
#results
results<-p
if (levelconstr==T){
results<-list(results,nd)
names(results)<-c("p","df")
}
return(results)
}
Testing the the function:
##packages
library(plm)
##test dataframe
#data structure
N<-4
G<-2
M<-5
d<-data.frame(
id=rep(1:N,each=M),
year=rep(1:M,N)+2000,
gid=rep(1:G,each=M*2)
)
#explanatory variable
d[,"x"]=runif(N*M,0,1)
#outcome
d[,"y"] = 2 * d[,"x"] + runif(N*M,0,1)
#panel data frame
d<-pdata.frame(d,index=c("id","year"))
##new data frame for out of sample prediction
dn<-d
dn$x<-rnorm(nrow(dn),0,2)
##estimate
#formula
f<- pFormula(y ~ x + factor(year))
#fixed effects or first difffernce estimation
e<-plm(f,data=d,model="within",index=c("id","year"))
e<-plm(f,data=d,model="fd",index=c("id","year"))
summary(e)
##fitted values of estimation
#transformed outcome prediction
predict(e)
c(pmodel.response(e)-residuals(e))
predict.out.plm(e,f,d,"fd")$p
# "level" outcome prediciton
predict.out.plm(e,f,d,"fd")$df$y.l.hat
#both
predict.out.plm(e,f,d,"fd")
##out of sampel prediciton
predict(e,newdata=d)
predict(e,newdata=dn)
# Error in crossprod(beta, t(X)) : non-conformable arguments
# if plm omits variables specified in the formula (e.g. one year in factor(year))
# it tries to multiply two matrices with different length of columns than regressors
# the new funciton avoids this and therefore is able to do out of sample predicitons
predict.out.plm(e,f,dn,"fd")
plm has now a predict.plm() function, although it is not documented/exported.
Note also that predict works on the transformed model (i.e. after doing the within/between/fd transformation), not the original one. I speculate that the reason for this is that it is more difficult to do prediction in a panel data framework. Indeed, you need to consider whether you are predicting:
new time periods, for existing individual and you used a individual-FE? Then you can add the prediction to the existing individual mean
new time periods, for new individual? Then you need to figure out which individual mean you are going to use?
the same is even more complicated is you use a random-effect model, as the effects are not easily derived
In the code below, I illustrate how to use fitted values, on the existing sample:
library(plm)
#> Loading required package: Formula
library(tidyverse)
data("Produc", package = "plm")
zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc, index = c("state","year"))
## produce a dataset of prediction, added to the group means
Produc_means <- Produc %>%
mutate(y = log(gsp)) %>%
group_by(state) %>%
transmute(y_mean = mean(y),
y = y,
year = year) %>%
ungroup() %>%
mutate(y_pred = predict(zz) + y_mean) %>%
select(-y_mean)
## plot it
Produc_means %>%
gather(type, value, y, y_pred) %>%
filter(state %in% toupper(state.name[1:5])) %>%
ggplot(aes(x = year, y = value, linetype = type))+
geom_line() +
facet_wrap(~state) +
ggtitle("Visualising in-sample prediction, for 4 states")
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
Created on 2018-11-20 by the reprex package (v0.2.1)
Looks like there is a new package to do in-sample predictions for a variety of models including plm
https://cran.r-project.org/web/packages/prediction/prediction.pdf
You can calculate the residuals via residuals(reg_name). From here, you can subtract them from your response variable and get the predicted values.

Resources