Stepwise Regression with ROC - r

I am learning data science with R on DataCamp. In one exercise, I have to build a stepwise regression model. Even though I create the stepwise model successfully, roc() function doesn't accept the response and it gives an error like: "'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead"
I want to learn how to handle this problem so I wrote my code below.
# Specify a null model with no predictors
null_model <- glm(donated ~ 1, data = donors, family = "binomial")
# Specify the full model using all of the potential predictors
full_model <- glm(donated ~ ., data = donors, family = "binomial")
# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")
# Estimate the stepwise donation probability
step_prob <- predict(step_model, type = "response")
# Plot the ROC of the stepwise model
library(pROC)
ROC <- roc( step_prob, donors$donated)
plot(ROC, col = "red")
auc(ROC)

I changed the order of roc function's argument and the error was solved.
library(pROC)
ROC <- roc( donors$donated, step_prob)
plot(ROC, col = "red")
auc(ROC)

Related

R GLM Regression Model - Graph outcome incorrect & Error Rate code needed

For my homework, I am working with a dataset titled Default. I split my data into training and test sets, and ran a logistic regression for the relationship of default1 and the other 3 predictors(income (continuous), balance(continuous), student(0/1)).
I am supposed to plot the regression model, but it keeps showing a straight horizontal line on the graph and I don't think that's correct.
How can I graph multiple predictors with a singular binary outcome using my Default_train_logistic glm?
Also, how can I obtain those coefficients and error rates of the model?
TIA!
set.seed(1234)
Default$subsample <- runif(nrow(Default))
Default$test <- ifelse(Default$subsample < 0.80, "train", "test")
Default_train <- filter(Default, test == "train")
Default_test <- filter(Default, test == "test")
###Q1 Part B: b. Construct a logistic regression to predict if an individual will default based on all of the provided predictors, and visualize your final predicted model.
#Immediately after loading data, I created default1 to use default as a numerical binary variable for logistic regression.
Default_train_logistic <- glm(default1 ~ ., data = Default_train %>% select(-test), family = "binomial")
summary(Default_train_logistic)
plot(Default_train_logistic)
G1 <- ggplot(Default_train_logistic, aes(balance + income + student1, default1)) +
geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE)
print(G1)

plotting semivariograms with non-nlme package models

I am trying to plot a semivariogram of my model residuals for a generalised mixed effect model in R. Doing this for a mixed effect model with normal distribution is straightforward with the nlme package, and using the quakes dataset as an example.
library(nlme)
data(quakes)
head(quakes)
model1 <- lme(mag ~ depth , random = ~1|stations, data = quakes)
summary(model1)
semivario <- Variogram(model1, form = ~long+lat,resType = "normalized")
plot(semivario, smooth = TRUE)
I want to create a model with a non-normal distribution, which I can't do with nlme, so I have tried glmer and glmmPQL. I have turned the 'mag' into a binomial variable, then try to reapply the Variogram function to make a plot with models.
quakes$thresh <- ifelse(quakes$mag > "5", 0, 1)
library(MASS)
model2 <- glmmPQL(as.factor(thresh) ~ depth , random = ~1|stations, family = binomial, data = quakes)
summary(model2)
semivario <- Variogram(model2, form = ~long+lat,resType = "normalized")
plot(semivario, smooth = TRUE)
library(lme4)
model3 <-glmer(as.factor(thresh) ~ depth + (1|stations), data = quakes, family = binomial)
summary(model3)
semivario <- Variogram(model3, form = ~long+lat,resType = "normalized")
plot(semivario, smooth = TRUE)
Neither of these appear to work for plotting the variogram. The glmmPQL says that lat and long isn't found, and the glmer says distance isn't specified.
How can I code a plot of semivariogram of these models? Is the Variogram function from the nlme package unusable for them? And if so what alternatives can I use?

Getting estimated means after multiple imputation using the mitml, nlme & geepack R packages

I'm running multilevel multiple imputation through the package mitml (using the panimpute() function) and am fitting linear mixed models and marginal models through the packages nlme and geepack and the mitml:with() function.
I can get the estimates, p-values etc for those through the testEstimates() function but I'm also looking to get estimated means across my model predictors. I've tried the emmeans package, which I normally use for getting estimated means when running nlme & geepack without multiple imputation but doing so emmeans tell me "Can't handle an object of class “mitml.result”".
I'm wondering is there a way to get pooled estimated means from the multiple imputation analyses I've run?
The data frames I'm analyzing are longitudinal/repeated measures and in long format. In the linear mixed model I want to get the estimated means for a 2x2 interaction effect and in the marginal model I'm trying to get estimated means for the 6 levels of 'time' variable. The outcome in all models is continuous.
Here's my code
# mixed model
fml <- Dep + time ~ 1 + (1|id)
imp <- panImpute(data=Data, formula=fml, n.burn=50000, n.iter=5000, m=100, group = "treatment")
summary(imp)
plot(imp, trace="all")
implist <- mitmlComplete(imp, "all", force.list = TRUE)
fit <- with(implist, lme(Dep ~ time*treatment, random = ~ 1|id, method = "ML", na.action = na.exclude, control = list(opt = "optim")))
testEstimates(fit, var.comp = TRUE)
confint.mitml.testEstimates(testEstimates(fit, var.comp = TRUE))
# marginal model
fml <- Dep + time ~ 1 + (1|id)
imp <- panImpute(data=Data, formula=fml, n.burn=50000, n.iter=5000, m=100)
summary(imp)
plot(imp, trace="all")
implist <- mitmlComplete(imp, "all", force.list = TRUE)
fit <- with(implist, geeglm(Dep ~ time, id = id, corstr ="unstructured"))
testEstimates(fit, var.comp = TRUE)
confint.mitml.testEstimates(testEstimates(fit, var.comp = TRUE))
is there a way to get pooled estimated means from the multiple imputation analyses I've run?
This is not a reprex without Data, so I can't verify this works for you. But emmeans provides support for mira-class (lists of) models in the mice package. So if you fit your model in with() using the mids rather than mitml.list class object, then you can use that to obtain marginal means of your outcome (and any contrasts or pairwise comparisons afterward).
Using example data found here, which uncomfortably loads an external workspace:
con <- url("https://www.gerkovink.com/mimp/popular.RData")
load(con)
## imputation
library(mice)
ini <- mice(popNCR, maxit = 0)
meth <- ini$meth
meth[c(3, 5, 6, 7)] <- "norm"
pred <- ini$pred
pred[, "pupil"] <- 0
imp <- mice(popNCR, meth = meth, pred = pred, print = FALSE)
## analysis
library(lme4) # fit multilevel model
mod <- with(imp, lmer(popular ~ sex + (1|class)))
library(emmeans) # obtain pooled estimates of means
(em <- emmeans(mod, specs = ~ sex) )
pairs(em) # test comparison

predict function with lasso regression

I am trying to implement lasso regression for my sales prediction problem. I am using glmnet package and cv.glmnet function to train the model.
library(glmnet)
set.seed(123)
model = cv.glmnet(as.matrix(x = train[, -which(names(train) %in% "Sales")]),
y = train$Sales,
alpha = 1,
lambda = 10^seq(4,-1,-0.1))
best_lambda = model$lambda.min
lasso_predictions_valid <- predict(model,s = best_lambda,type = "coefficients")
After I read few articles about implementing lasso regression I still don't know how to add my test data on which I want to apply the prediction. There is newx argument to be added to predict function that I do not know also. I mean in most regression types we have newdata or data argument that we fill our test data to it.
I think there is an error in your lasso_predictions_valid, you shouldn't put valid$sales as your newx, as I believe this is the actual sales number.
Once you have created the model with the train set, then for newx you need to pass matrix values of x that you want to make predictions on, I guess in this case it will be your validation set.
Looking at your example code above, I think your predict line should be something like:
lasso_predictions_valid <- predict(model, s = best_lambda,
newx = as.matrix(valid[, -which(names(valid) %in% "Sales")]),
type = "coefficients")
Then you should run your RMSE() line:
RMSE(lasso_predictions_valid, valid$Sales)

How to calculate R Squared value for Lasso regression using glmnet in R

I am performing lasso regression in R using glmnet package:
fit.lasso <- glmnet(x,y)
plot(fit.lasso,xvar="lambda",label=TRUE)
Then using cross-validation:
cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)
One tutorial (last slide) suggest the following for R^2:
R_Squared = 1 - cv.lasso$cvm/var(y)
But it did not work.
I want to understand the model efficiency/performance in fitting the data. As we usually get R^2 and adjusted R^2 when performing lm() function in r.
If you are using "gaussian" family, you can access R-squared value by
fit.lasso$glmnet.fit$dev.ratio
I use the example data to demonstrate it
library(glmnet)
load data
data(BinomialExample)
head(x)
head(y)
For cross validation
cvfit = cv.glmnet(x, y, family = "binomial", type.measure = "class")
rsq = 1 - cvfit$cvm/var(y)
plot(cvfit$lambda,rsq)
Firtst fit the Lasso model with the selected lambda
...
lasso.model <- glmnet(x=X,y=Y, family = "binomial", alpha=1, lambda = cv.model$lambda.min )
then you could get the pseudo R2 from the fitted model
`lasso.model$dev.ratio`
this value give the deviance explained by the model/Null deviance

Resources