R logistic regression area under curve - r

I am performing logistic regression using this page. My code is as below.
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mylogit <- glm(admit ~ gre, data = mydata, family = "binomial")
summary(mylogit)
prob=predict(mylogit,type=c("response"))
mydata$prob=prob
After running this code mydata dataframe has two columns - 'admit' and 'prob'.
Shouldn't those two columns sufficient to get the ROC curve?
How can I get the ROC curve.
Secondly, by loooking at mydata, it seems that model is predicting probablity of admit=1.
Is that correct?
How to find out which particular event the model is predicting?
Thanks
UPDATE:
It seems that below three commands are very useful. They provide the cut-off which will have maximum accuracy and then help to get the ROC curve.
coords(g, "best")
mydata$prediction=ifelse(prob>=0.3126844,1,0)
confusionMatrix(mydata$prediction,mydata$admit

The ROC curve compares the rank of prediction and answer. Therefore, you could evaluate the ROC curve with package pROC as follow:
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mylogit <- glm(admit ~ gre, data = mydata, family = "binomial")
summary(mylogit)
prob=predict(mylogit,type=c("response"))
mydata$prob=prob
library(pROC)
g <- roc(admit ~ prob, data = mydata)
plot(g)

another way to plot ROC Curve...
library(Deducer)
modelfit <- glm(formula=admit ~ gre + gpa, family=binomial(), data=mydata, na.action=na.omit)
rocplot(modelfit)

#Another way to plot ROC
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
mylogit <- glm(admit ~ gre, data = mydata, family = "binomial")
summary(mylogit)
prob=predict(mylogit,type=c("response"))
library("ROCR")
pred <- prediction(prob, mydata$admit)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(7), main="ROC curve Admissions", xlab="Specificity",
ylab="Sensitivity")
abline(0, 1) #add a 45 degree line

Related

Obtaining individual slopes from an lme4 object in R

I'm new to lme4 package in R. In my example below, I was wondering if it might be possible to obtain the gender slopes (i.e., differences) for each dep after fitting my glmer model?
dat <- data.frame(dep = rep(LETTERS[1:6],each=2), gender = rep(c("Ma","Fe"),6),
admit=c(512,89,353,17,120,202,138,131,53,94,22,24),
reject=c(313,19,207,8,205,391,279,244,138,299,351,317))
lme4::glmer(cbind(admit,reject) ~ gender+dep + (gender|dep), data=dat, family=binomial)
In lme4 you can get the estimated slopes from ranef, but in your model you will need to sum the global and unit specific terms, as in the example below.
library(lme4)
dat <- data.frame(dep = rep(LETTERS[1:6],each=2), gender = rep(c("Ma","Fe"),6),
admit=c(512,89,353,17,120,202,138,131,53,94,22,24),
reject=c(313,19,207,8,205,391,279,244,138,299,351,317))
mod1 <- glmer(cbind(admit,reject) ~ gender+dep + (gender|dep), data=dat, family=binomial)
summary(mod1)
ran_gender <- ranef(mod1)$dep
fe_mod1 <- fixef(mod1)
slopes <- fe_mod1[[2]] + ran_gender[,2]
slopes

ROC curves comparing logistic regression and neural network predictions in R

I am trying to compare the prediction accuracy of a dataset using a logistic regression model and a neural network. While looking at the confusion matrices of the two methods, the ANN model gives a better output compared to the logistic regression model. However, while plotting the ROC curves for the two methods, it seems that the logistic regression model is better. I am wondering if there is something wrong with my code for the ROC curves.
For context, I am explaining my procedure. First, I divided the dataset into training and testing data.
data = read.csv("heart.csv", header=TRUE)
set.seed(300)
index = sample(seq_len(nrow(data)), size = samplesize) # For logistic
train <- data[index,]
test <- data[-index,]
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
scaled <- as.data.frame(lapply(data, normalize))
index = sample(seq_len(nrow(scaled)), size = samplesize) # For ANN
trainset <- scaled[index, ]
testset <- scaled[-index, ]
The response variable is "target" so I fit the following GLM :
glm.fit <- glm(target ~ ., data=train, family=binomial(link = "logit"),control = list(maxit = 50))
For the ANN, I used R's neuralnet package and did the following:
library(neuralnet)
nn <- neuralnet(target ~ ., data=trainset, hidden=c(3,2), act.fct = "logistic", err.fct = "sse", linear.output=FALSE, threshold=0.01)
For my ROC curves, I did the following:
For ANN:
prob = compute(nn, testset[, -ncol(testset)] )
prob.result <- prob$net.result
detach(package:neuralnet,unload = T)
library(ROCR)
nn.pred = prediction(prob.result, testset$target)
pref <- performance(nn.pred, "tpr", "fpr")
plot(pref)
And for logistic regression:
prob=predict(glm.fit,type=c("response"))
library(ROCR)
pred <- prediction(prob, test$target)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(7), main="ROC curve Admissions", xlab="Specificity",
ylab="Sensitivity")
I would just like some guidance in understanding why the plots seem to suggest that the logistic regression model is better when the confusion matrix suggests otherwise, and understand what I am doing wrong.
Thank you for any input.

Find R-square value of Weibull fit (Survival model) in R

I have a survival object (S) for which I am doing a weibull fit using the survreg function and weibull distribution in R.
S = Surv(data$ValueX, data$ValueY)
W = Survreg(S ~ 1, data=data, dist="weibull")
How do I extract the R-square value of the Weibull fit which is essentially a linear line? Or is there a function to calculate the correlation coefficient value Rho?
Basically, I want to calculate the goodness of fit.
Look at pam.censor in the PAmeasures package which produces an R^2 like statistic. Using the ovarian dataset from the survival package:
library(PAmeasures)
library(survival)
fit.s <- survreg(Surv(futime, fustat) ~ age, data = ovarian, dist="weibull" )
p <- predict(fit.s, type = "response")
with(ovarian, pam.censor(futime, p, fustat))
For the ovarian data with an age regressor we get a value of only 0.0915 .
Another idea is that for a Weibull model with no covariates we have S(t) = exp(- (lambda * t)^p) so log(-log(S(t))) is linear in log(t) hence we could use the R squared of the corresponding regression to measure how well the model fits to a Weibull.
library(survival)
fit1 <- survfit(Surv(futime, fustat) ~ 1, data = ovarian)
sum1 <- summary(fit1, times = ovarian$futime)
fo <- log(-log(surv)) ~ log(time)
d <- as.data.frame(sum1[c("time", "surv")])
fit.lm <- lm(fo, d)
summary(fit.lm)$r.sq
plot(fo, d)
abline(fit.lm)
For the ovarian data without covariates the R^2 at 93% is high but the plot does suggest systematic departures from linearity so it may not really be Weibull.
Other
Not sure if this is of interest but the eha package has the check.dist function which can be used for a visual comparison of a parametric baseline hazard model to a cox proportional hazard model. See the documentation as well as:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5233524/
Using the ovarian dataset from survival:
library(eha)
library(surival)
fit.c <- coxreg(Surv(futime, fustat) ~ age, data = ovarian)
fit.p <- phreg(Surv(futime, fustat) ~ age, data = ovarian, dist = "weibull")
check.dist(fit.c, fit.p)
The survAUC package has three functions that provide r squared type statistics for cox proportional hazard models (OXS, Nagelk and XO).

Example of Time Series Prediction using Neural Networks in R

Anyone's got a quick short educational example how to use Neural Networks (nnet in R) for the purpose of prediction?
Here is an example, in R, of a time series
T = seq(0,20,length=200)
Y = 1 + 3*cos(4*T+2) +.2*T^2 + rnorm(200)
plot(T,Y,type="l")
Many thanks
David
I think you can use the caret package and specially the train function
This function sets up a grid of tuning parameters for a number
of classification and regression routines.
require(quantmod)
require(nnet)
require(caret)
T = seq(0,20,length=200)
y = 1 + 3*cos(4*T+2) +.2*T^2 + rnorm(200)
dat <- data.frame( y, x1=Lag(y,1), x2=Lag(y,2))
names(dat) <- c('y','x1','x2')
dat <- dat[c(3:200),] #delete first 2 observations
#Fit model
model <- train(y ~ x1+x2 ,
dat,
method='nnet',
linout=TRUE,
trace = FALSE)
ps <- predict(model, dat)
#Examine results
plot(T,Y,type="l",col = 2)
lines(T[-c(1:2)],ps, col=3)
legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3)
The solution proposed by #agstudy is useful, but in-sample fits are not a reliable guide to out-of-sample forecasting accuracy. The gold standard in forecasting accuracy measurement is to use a holdout sample. Remove the last 5 or 10 or 20 observations (depending to the length of the time series) from the training sample, fit your models to the rest of the data, use the fitted models to forecast the holdout sample and simply compare accuracies on the holdout, using Mean Absolute Deviations (MAD) or weighted Mean Absolute Percentage Errors (wMAPEs).
So to do this you can change the code above in this way:
require(quantmod)
require(nnet)
require(caret)
t = seq(0,20,length=200)
y = 1 + 3*cos(4*t+2) +.2*t^2 + rnorm(200)
dat <- data.frame( y, x1=Lag(y,1), x2=Lag(y,2))
names(dat) <- c('y','x1','x2')
train_set <- dat[c(3:185),]
test_set <- dat[c(186:200),]
#Fit model
model <- train(y ~ x1+x2 ,
train_set,
method='nnet',
linout=TRUE,
trace = FALSE)
ps <- predict(model, test_set)
#Examine results
plot(T,Y,type="l",col = 2)
lines(T[c(186:200)],ps, col=3)
legend(5, 70, c("y", "pred"), cex=1.5, fill=2:3)
This last two lines output the wMAPE of the forecasts from the model
sum(abs(ps-test_set["y"]))/sum(test_set)

How to plot the survival curve generated by survreg (package survival of R)?

I’m trying to fit and plot a Weibull model to a survival data. The data has just one covariate, cohort, which runs from 2006 to 2010. So, any ideas on what to add to the two lines of code that follows to plot the survival curve of the cohort of 2010?
library(survival)
s <- Surv(subSetCdm$dur,subSetCdm$event)
sWei <- survreg(s ~ cohort,dist='weibull',data=subSetCdm)
Accomplishing the same with the Cox PH model is rather straightforward, with the following lines. The problem is that survfit() doesn’t accept objects of type survreg.
sCox <- coxph(s ~ cohort,data=subSetCdm)
cohort <- factor(c(2010),levels=2006:2010)
sfCox <- survfit(sCox,newdata=data.frame(cohort))
plot(sfCox,col='green')
Using the data lung (from the survival package), here is what I'm trying to accomplish.
#create a Surv object
s <- with(lung,Surv(time,status))
#plot kaplan-meier estimate, per sex
fKM <- survfit(s ~ sex,data=lung)
plot(fKM)
#plot Cox PH survival curves, per sex
sCox <- coxph(s ~ as.factor(sex),data=lung)
lines(survfit(sCox,newdata=data.frame(sex=1)),col='green')
lines(survfit(sCox,newdata=data.frame(sex=2)),col='green')
#plot weibull survival curves, per sex, DOES NOT RUN
sWei <- survreg(s ~ as.factor(sex),dist='weibull',data=lung)
lines(survfit(sWei,newdata=data.frame(sex=1)),col='red')
lines(survfit(sWei,newdata=data.frame(sex=2)),col='red')
Hope this helps and I haven't made some misleading mistake:
copied from above:
#create a Surv object
s <- with(lung,Surv(time,status))
#plot kaplan-meier estimate, per sex
fKM <- survfit(s ~ sex,data=lung)
plot(fKM)
#plot Cox PH survival curves, per sex
sCox <- coxph(s ~ as.factor(sex),data=lung)
lines(survfit(sCox,newdata=data.frame(sex=1)),col='green')
lines(survfit(sCox,newdata=data.frame(sex=2)),col='green')
for Weibull, use predict, re the comment from Vincent:
#plot weibull survival curves, per sex,
sWei <- survreg(s ~ as.factor(sex),dist='weibull',data=lung)
lines(predict(sWei, newdata=list(sex=1),type="quantile",p=seq(.01,.99,by=.01)),seq(.99,.01,by=-.01),col="red")
lines(predict(sWei, newdata=list(sex=2),type="quantile",p=seq(.01,.99,by=.01)),seq(.99,.01,by=-.01),col="red")
The trick here was reversing the quantile orders for plotting vs predicting. There is likely a better way to do this, but it works here. Good luck!
An alternative option is to make use of the package flexsurv. This offers some additional functionality over the survival package - including that the parametric regression function flexsurvreg() has a nice plot method which does what you ask.
Using lung as above;
#create a Surv object
s <- with(lung,Surv(time,status))
require(flexsurv)
sWei <- flexsurvreg(s ~ as.factor(sex),dist='weibull',data=lung)
sLno <- flexsurvreg(s ~ as.factor(sex),dist='lnorm',data=lung)
plot(sWei)
lines(sLno, col="blue")
You can plot on the cumulative hazard or hazard scale using the type argument, and add confidence intervals with the ci argument.
This is just a note clarifying Tim Riffe's answer, which uses the following code:
lines(predict(sWei, newdata=list(sex=1),type="quantile",p=seq(.01,.99,by=.01)),seq(.99,.01,by=-.01),col="red")
lines(predict(sWei, newdata=list(sex=2),type="quantile",p=seq(.01,.99,by=.01)),seq(.99,.01,by=-.01),col="red")
The reason for the two mirror-image sequences, seq(.01,.99,by=.01) and seq(.99,.01,by=-.01), is because the predict() method is giving quantiles for the event distribution f(t) - that is, values of the inverse CDF of f(t) - while a survival curve is plotting 1-(CDF of f) versus t. In other words, if you plot p versus predict(p), you'll get the CDF, and if you plot 1-p versus predict(p) you'll get the survival curve, which is 1-CDF. The following code is more transparent and generalizes to arbitrary vectors of p values:
pct <- seq(.01,.99,by=.01)
lines(predict(sWei, newdata=list(sex=1),type="quantile",p=pct),1-pct,col="red")
lines(predict(sWei, newdata=list(sex=2),type="quantile",p=pct),1-pct,col="red")
In case someone wants to add a Weibull distribution to the Kaplan-Meyer curve in the ggplot2 ecosystem, we can do the following:
library(survminer)
library(tidyr)
s <- with(lung,Surv(time,status))
fKM <- survfit(s ~ sex,data=lung)
sWei <- survreg(s ~ as.factor(sex),dist='weibull',data=lung)
pred.sex1 = predict(sWei, newdata=list(sex=1),type="quantile",p=seq(.01,.99,by=.01))
pred.sex2 = predict(sWei, newdata=list(sex=2),type="quantile",p=seq(.01,.99,by=.01))
df = data.frame(y=seq(.99,.01,by=-.01), sex1=pred.sex1, sex2=pred.sex2)
df_long = gather(df, key= "sex", value="time", -y)
p = ggsurvplot(fKM, data = lung, risk.table = T)
p$plot = p$plot + geom_line(data=df_long, aes(x=time, y=y, group=sex))
In case you'd like to use the survival function itself S(t) (instead of the inverse survival function S^{-1}(p) used in other answers here) I've written a function to implement that for the case of the Weibull distribution (following the same inputs as the pec::predictSurvProb family of functions:
survreg.predictSurvProb <- function(object, newdata, times){
shape <- 1/object$scale # also equals 1/exp(fit$icoef[2])
lps <- predict(object, newdata = newdata, type = "lp")
surv <- t(sapply(lps, function(lp){
sapply(times, function(t) 1 - pweibull(t, shape = shape, scale = exp(lp)))
}))
return(surv)
}
You can then do:
sWei <- survreg(s ~ as.factor(sex),dist='weibull',data=lung)
times <- seq(min(lung$time), max(lung$time), length.out = 1000)
new_dat <- data.frame(sex = c(1,2))
surv <- survreg.predictSurvProb(sWei, newdata = new_dat, times = times)
lines(times, surv[1, ],col='red')
lines(times, surv[2, ],col='red')

Resources