How to run cross validation for logistic regression in R? - r

I'm trying to run a cross validation (leave one out and k fold) using logistic regression in R, binary outcome.
I have a problem with the cost function. I do not understand the cost function in the R help, and found a more intuitive one here on Stack Overflow, but I don't know how to call it, more specifically, how to pass on the arguments.
library(ISLR)
D = Default
mycost <- function(r, pi)
{
weight1 = 1 #cost for getting 1 wrong
weight0 = 1 #cost for getting 0 wrong
c1 = (r==1)&(pi<0.5) #logical vector - true if actual 1 but predict 0
c0 = (r==0)&(pi>=0.5) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
glm.fit1 = glm(default~balance + student, data = D, family = binomial)
The problem is: if R runs several logistic regressions in the background (for example 3 for K=3), how can I pass on the vectors of predicated probabilities (pi) and the vector of actual values ?
I am confused...
Is there a way to use a for loop and do it manually instead of using cv.glm?

Related

Why Ljung box test return NA for Q and hence p-value when residual is 0 (in R)

Just like the question title.
I have done Ljung box tests in R for model fitting in time-series with constant values (i.e.: 0), and I got perfect model fit and 0 residuals with no surprise. But I want to know why the test returns NA for Q and p-value instead of for example p=0.99999 or something like that.
I want to have a theoretical interpretation for this.
Given you are using stats::Box.test() you can take a look at the code yourself:
utils::getAnywhere(Box.test)
The Ljung-Box Q-statistics is NaN because
cor <- acf (x, lag.max = lag, plot = FALSE, na.action = na.pass)
already returns NaN. So the subsequent computations
obs <- cor$acf[2:(lag+1)]
STATISTIC <- n*(n+2)*sum(1/seq.int(n-1, n-lag)*obs^2)
are NaN too - and so on. So it seems like you should look at stats::acf() what's going on in there ...
utils::getAnywhere(acf)
You should also be able to find the code on Github.

Ridge regression within a loop

I am new in coding, so I still struggle with simple things as loops, subsetting, and data frame vs. matrix.
I am trying to fit a ridge regression for a multivariable X (X1=Marker 1, X2= Marker, X3= Marker 3,..., X1333= Marker 1333), shown in the first image, as a predictor variable of Y, in the second image.
I want to compute the sum of the squared errors (SSE) for varying tuning parameter λ (between 1 and 20). My code is the following:
#install.packages("MASS")
library(MASS)
fitridge <- function(x,y){
fridge=lm.ridge (y ~ x, lambda = seq(0, 20, 2)) #Fitting a ridge regression for varying λ values
sum(residuals(fridge)^2) #This results in SSE
}
all_gcv= apply(as.matrix(genmark_new),2,fitridge,y=as.matrix(coleslev_new))
}
However, it returns this error, and I don't know what to do anymore. I have tried converting the data set into a matrix, a data frame, changing the order of rows and columns...
Error in colMeans(X[, -Inter]) : 'x' must be an array of at least two dimensions.
I just would like to take each marker value from a single row (first picture), pass them into my fitridge function that fits a ridge regression against the Y from the second data set (in the second picture).
And then subset the SSE and their corresponding lambda values
You cannot fit a ridge with only one independent variable. It is not meant for this. In your case, most likely you have to do:
genmark_new = data.frame(matrix(sample(0:1,1333*100,replace=TRUE),ncol=1333))
colnames(genmark_new) = paste0("Marker_",1:ncol(genmark_new))
coleslev_new = data.frame(NormalizedCholesterol=rnorm(100))
Y = coleslev_new$NormalizedCholesterol
library(MASS)
fit = lm.ridge (y ~ ., data=data.frame(genmark_new,y=Y),lambda = seq(0, 20, 2))
And calculate residuals for each lambda:
apply(fit$coef,2,function(i)sum((Y-as.matrix(genmark_new) %*% i)^2))
0 2 4 6 8 10 12 14
26.41866 27.88029 27.96360 28.04675 28.12975 28.21260 28.29530 28.37785
16 18 20
28.46025 28.54250 28.62459
If you need to fit each variable separately, you can consider using a linear model:
fitlm <- function(x,y){
fridge=lm(y ~ x)
sum(residuals(fridge)^2)
}
all_gcv= apply(genmark_new,2,fitlm,y=Y)
Suggestion, check out make notes or introductions to ridge, they are meant for multiple variate regressions, that is, more than 1 independent variable.

Pass vector of lambdas to Poisson(), or guidance on idiomatic function composition

I'm trying to learn a little Julia by doing some bayesian analysis. In Peter Hoff's textbook, he describes a process of sampling from a posterior predictive distribution of a Poisson-Gamma model in which he:
Samples values from the gamma distribution
Samples values from the poisson distribution, passing a vector of lambdas
Here is what this looks like in R:
a <- 2
b <- 1
sy1 <- 217; n1 <- 111
theta1.mc <- rgamma(1000, a+sy1, b+n1)
y1.mc <- rpois(1000, theta1.mc)
In Julia, I see that distributions can't take a vector of parameters. So, I end up doing something like this:
using Distributions
a = 2
b = 1
sy1 = 217; n1 = 111
theta_mc = rand(Gamma(a+217, 1/(b+n1)), 5000)
y1_mc = map(x -> rand(Poisson(x)), theta_mc)
While I was initially put off at the distribution function not taking a vector and working Just Like R™, I like that I'm not needing to set my number of samples more than once. That said, I'm not sure I'm doing this idiomatically, either in terms of how people would work with the distributions package, or more generically how to compose functions.
Can anyone suggest a better, more idiomatic approach than my example code?
I would usually do something like the following, which uses list comprehensions:
a, b = 2, 1
sy1, n1 = 217, 111
theta_mc = rand(Gamma(a + sy1, 1 / (b + n1)), 1000)
y1_mc = [rand(Poisson(theta)) for theta in theta_mc]
One source of confusion may be that Poisson isn't really a function, it's a type constructor and it returns an object. So vectorization over theta doesn't really make sense, since that wouldn't construct one object, but many -- which would then require another step to call rand on each generated object.

Manually conduct leave-one-out cross validation for a GLMM using a for() loop in R

I am trying to build a for() loop to manually conduct leave-one-out cross validations for a GLMM fit using the lmer() function from the lme4 pkg. I need to remove an individual, fit the model and use the beta coefficients to predict a response for the individual that was withheld, and repeat the process for all individuals.
I have created some test data to tackle the first step of simply leaving an individual out, fitting the model and repeating for all individuals in a for() loop.
The data have a binary (0,1) Response, an IndID that classifies 4 individuals, a Time variable, and a Binary variable. There are N=100 observations. The IndID is fit as a random effect.
require(lme4)
#Make data
Response <- round(runif(100, 0, 1))
IndID <- as.character(rep(c("AAA", "BBB", "CCC", "DDD"),25))
Time <- round(runif(100, 2,50))
Binary <- round(runif(100, 0, 1))
#Make data.frame
Data <- data.frame(Response, IndID, Time, Binary)
Data <- Data[with(Data, order(IndID)), ] #**Edit**: Added code to sort by IndID
#Look at head()
head(Data)
Response IndID Time Binary
1 0 AAA 31 1
2 1 BBB 34 1
3 1 CCC 6 1
4 0 DDD 48 1
5 1 AAA 36 1
6 0 BBB 46 1
#Build model with all IndID's
fit <- lmer(Response ~ Time + Binary + (1|IndID ), data = Data,
family=binomial)
summary(fit)
As stated above, my hope is to get four model fits – one with each IndID left out in a for() loop. This is a new type of application of the for() command for me and I quickly reached my coding abilities. My attempt is below.
fit <- list()
for (i in Data$IndID){
fit[[i]] <- lmer(Response ~ Time + Binary + (1|IndID), data = Data[-i],
family=binomial)
}
I am not sure storing the model fits as a list is the best option, but I had seen it on a few other help pages. The above attempt results in the error:
Error in -i : invalid argument to unary operator
If I remove the [-i] conditional to the data=Data argument the code runs four fits, but data for each individual is not removed.
Just as an FYI, I will need to further expand the loop to:
1) extract the beta coefs, 2) apply them to the X matrix of the individual that was withheld and lastly, 3) compare the predicted values (after a logit transformation) to the observed values. As all steps are needed for each IndID, I hope to build them into the loop. I am providing the extra details in case my planned future steps inform the more intimidate question of leave-one-out model fits.
Thanks as always!
The problem you are having is because Data[-i] is expecting i to be an integer index. Instead, i is either AAA, BBB, CCC or DDD. To fix the loop, set
data = Data[Data$IndID != i, ]
in you model fit.

cost function in cv.glm of boot library in R

I am trying to use the crossvalidation cv.glm function from the boot library in R to determine the number of misclassifications when a glm logistic regression is applied.
The function has the following signature:
cv.glm(data, glmfit, cost, K)
with the first two denoting the data and model and K specifies the k-fold.
My problem is the cost parameter which is defined as:
cost: A function of two vector arguments specifying the cost function
for the crossvalidation. The first argument to cost should correspond
to the observed responses and the second argument should correspond to
the predicted or fitted responses from the generalized linear model.
cost must return a non-negative scalar value. The default is the
average squared error function.
I guess for classification it would make sense to have a function which returns the rate of misclassification something like:
nrow(subset(data, (predict >= 0.5 & data$response == "no") |
(predict < 0.5 & data$response == "yes")))
which is of course not even syntactically correct.
Unfortunately, my limited R knowledge let me waste hours and I was wondering if someone could point me in the correct direction.
It sounds like you might do well to just use the cost function (i.e. the one named cost) defined further down in the "Examples" section of ?cv.glm. Quoting from that section:
# [...] Since the response is a binary variable an
# appropriate cost function is
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
This does essentially what you were trying to do with your example. Replacing your "no" and "yes" with 0 and 1, lets say you have two vectors, predict and response. Then cost() is nicely designed to take them and return the mean classification rate:
## Simulate some reasonable data
set.seed(1)
predict <- seq(0.1, 0.9, by=0.1)
response <- rbinom(n=length(predict), prob=predict, size=1)
response
# [1] 0 0 0 1 0 0 0 1 1
## Demonstrate the function 'cost()' in action
cost(response, predict)
# [1] 0.3333333 ## Which is right, as 3/9 elements (4, 6, & 7) are misclassified
## (assuming you use 0.5 as the cutoff for your predictions).
I'm guessing the trickiest bit of this will be just getting your mind fully wrapped around the idea of passing a function in as an argument. (At least that was for me, for the longest time, the hardest part of using the boot package, which requires that move in a fair number of places.)
Added on 2016-03-22:
The function cost(), given above is in my opinion unnecessarily obfuscated; the following alternative does exactly the same thing but in a more expressive way:
cost <- function(r, pi = 0) {
mean((pi < 0.5) & r==1 | (pi > 0.5) & r==0)
}
I will try to explain the cost function in simple words. Let's take
cv.glm(data, glmfit, cost, K) arguments step by step:
data
The data consists of many observations. Think of it like series of numbers or even.
glmfit
It is generalized linear model, which runs on the above series. But there is a catch it splits data into several parts equal to K. And runs glmfit on each of them separately (test set), taking the rest of them as training set. The output of glmfit is a series consisting of same number of elements as the split input passed.
cost
Cost Function. It takes two arguments first the split input series(test set), and second the output of glmfit on the test input. The default is mean square error function.
.
It sums the square of difference between observed data point and predicted data point. Inside the function a loop runs over the test set (output and input should have same number of elements) calculates difference, squares it and adds to output variable.
K
The number to which the input should be split. Default gives leave one out cross validation.
Judging from your cost function description. Your input(x) would be a set of numbers between 0 and 1 (0-0.5 = no and 0.5-1 = yes) and output(y) is 'yes' or 'no'. So error(e) between observation(x) and prediction(y) would be :
cost<- function(x, y){
e=0
for (i in 1:length(x)){
if(x[i]>0.5)
{
if( y[i]=='yes') {e=0}
else {e=x[i]-0.5}
}else
{
if( y[i]=='no') {e=0}
else {e=0.5-x[i]}
}
e=e*e #square error
}
e=e/i #mean square error
return (e)
}
Sources : http://www.cs.cmu.edu/~schneide/tut5/node42.html
The cost function can optionally be defined if there is one you prefer over the default average squared error. If you wanted to do so then the you would write a function that returns the cost you want to minimize using two inputs: (1) the vector of known labels that you are predicting, and (2) the vector of predicted probabilities from your model for those corresponding labels. So for the cost function that (I think) you described in your post you are looking for a function that will return the average number of accurate classifications which would look something like this:
cost <- function(labels,pred){
mean(labels==ifelse(pred > 0.5, 1, 0))
}
With that function defined you can then pass it into your glm.cv() call. Although I wouldn't recommend using your own cost function over the default one unless you have reason to. Your example isn't reproducible, so here is another example:
> library(boot)
>
> cost <- function(labels,pred){
+ mean(labels==ifelse(pred > 0.5, 1, 0))
+ }
>
> #make model
> nodal.glm <- glm(r ~ stage+xray+acid, binomial, data = nodal)
> #run cv with your cost function
> (nodal.glm.err <- cv.glm(nodal, nodal.glm, cost, nrow(nodal)))
$call
cv.glm(data = nodal, glmfit = nodal.glm, cost = cost, K = nrow(nodal))
$K
[1] 53
$delta
[1] 0.8113208 0.8113208
$seed
[1] 403 213 -2068233650 1849869992 -1836368725 -1035813431 1075589592 -782251898
...
The cost function defined in the example for cv.glm clearly assumes that the predictions are probabilities, which would require the type="response" argument in the predict function. The documentation from library(boot) should state this explicitly. I would otherwise be forced to assume that the default type="link" is used inside the cv.glm function, in which case the cost function would not work as intended.

Resources