Manual LOOCV vs cv.glm - r

In Introduction to Statistical Learning we're asked to do the Leave Out One Cross Validation over logistic regression manually. The code for it is here:
count = rep(0, dim(Weekly)[1])
for (i in 1:(dim(Weekly)[1])) {
##fitting a logistic regression model, not including ith data in the training data
glm.fit = glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = binomial)
is_up = predict.glm(glm.fit, Weekly[i, ], type = "response") > 0.5
is_true_up = Weekly[i, ]$Direction == "Up"
if (is_up != is_true_up)
count[i] = 1
}
sum(count)
##[1] 490
The source of this code can be found here.
Which means that the error rate is approximately 45 %.
But when we do it, using the cv.glm() function of the boot library, the result is far different.
> library(boot)
> glm.fit = glm(Direction~Lag1+Lag2,data=Weekly,family=binomial)
> cv.glm = cv.glm(Weekly,glm.fit)
> cv.glm$delta
[1] 0.2464536 0.2464530
Why does this occur? What does the cv.glm() function exactly do?

I believe there may be a bug in the cv.glm function. On line 23 it calculates
cost(glm.y, fitted(glmfit)) where fitted(glmfit) are fitted probabilities. In order to calculate cross-validated error rate (= total number of misclassified observations over n), we first need to map these to classes. In other words, if you replace
cost.0 <- cost(glm.y, fitted(glmfit))
with
cost.0 <- cost(glm.y, ifelse(fitted(glmfit)>0.5, 1, 0))
I believe you should get the same thing as what you coded up manually.

Related

Can glmmLasso be used with the Tweedie distribution?

I have a linear mixed effects model and I am trying to do variable selection. The model is testing the level of forest degradation in 1000 sampled points. Most points have no degradation, and so the dependent variable is highly skewed with many zeros. Therefore, I am using the Tweedie distribution to fit the model. My main question is: can the Tweedie distribution actually be used in the glmmLasso function? My second question is: do I even need to use this distribution in glmmLasso()? Any help is much appreciated!
When I run the function with family = tweedie(var.power=1.2,link.power=0) I get the following error:
Error in logLik.glmmLasso(y = y, yhelp = yhelp, mu = mu, family = family, :
object 'loglik' not found
If I change the link.power from 0 to 1 (which I think is not correct for my model, but just for the sake of figuring out the problem), I get a different error:
Error in grad.lasso[b.is.0] <- score.beta[b.is.0] - lambda.b * sign(score.beta[b.is.0]) :
NAs are not allowed in subscripted assignments
Here tweedie comes from the statmod package. A simple example:
library(tweedie)
library(tidyverse)
library(glmmLasso)
library(statmod)
power <- 2
mu <- 1
phi <- seq(2, 8, by=0.1)
set.seed(10000)
y <- rtweedie( 100, mu=mu, power=power, phi=3)
x <- rnorm(100)
z <- c(rep(1, 50), rep(2,50))
df = as.data.frame(cbind(y,x,z))
df$z = as.factor(df$z)
f = y ~ x
varSelect = glmmLasso(fix = f, rnd = list(z=~1), data = df,
lambda = 5, family = tweedie(var.power=1.2,link.power=0))
I created a hacked version of glmmLasso that incorporates the Tweedie distribution as an option and put it on Github. I had to change two aspects of the code:
add a clause to compute the log-likelihood if family$family == "Tweedie"
in a number of places where the code was essentially if (family$family in list_of_families) ..., add "Tweedie" as an option.
remotes::install_github("bbolker/glmmLasso-bmb")
packageVersion("glmmLasso")
## [1] ‘1.6.2.9000’
Your example runs for me now, but I haven't checked at all to see if the results are sensible.

Warning: "Intersecting linear/additive predictors" when fitting non-proportional odds ordinal regression model with VGLM()

I am fitting a partial proportional odds cumulative logit ordinal regression model. Response is an ordinal diagnosis, predictors are two urinary biomarkers. I fit the model using the following command:
fit=vglm(diagnosis ~ creatinine + LYVE1, data=urine.dat,
family=cumulative(parallel=F))
summary(fit)
Afterwards, I often get about 20 of the following warnings:
In slot(family, "validparams")(eta, y, extra = extra) :
It seems that the nonparallelism assumption has resulted in
intersecting linear/additive predictors.
Try propodds() or fitting a partial nonproportional odds model or
choosing some other link function, etc.
Does anyone understand what is meant by "intersecting linear/additive predictors?" From what I have seen, this error is returned very often with non-proportional odds VGLM models. Just trying to understand what is the issue with the model.
Any insight would be helpful.
If you look through the VGAM source code, you'll find the following piece that throws the warning:
probs <-
if ( .reverse ) {
ccump <- cbind(1, eta2theta(eta, .link , earg = .earg ))
cbind(-tapplymat1(ccump, "diff"), ccump[, ncol(ccump)])
} else {
cump <- cbind(eta2theta(eta, .link , earg = .earg ), 1)
cbind(cump[, 1], tapplymat1(cump, "diff"))
}
okay1 <- all(is.finite(probs)) && all(0 < probs & probs < 1)
if (!okay1)
warning("It seems that the nonparallelism assumption has ",
"resulted in intersecting linear/additive ",
"predictors. Try propodds() or fitting a partial ",
"nonproportional odds model or choosing ",
"some other link function, etc.")
We can boil the guts of this down to a couple of different pieces:
library(VGAM)
#> Loading required package: stats4
#> Loading required package: splines
fit=vglm(cyl ~ wt, data=mtcars,
family=cumulative(parallel=F))
eta <- predict(fit, type="link")
cump <- cbind(VGAM:::eta2theta(eta, logitlink ), 1)
probs <- cbind(cump[, 1], tapplymat1(cump, "diff"))
Since cyl has three values, eta is an Nx2 matrix of predicted values on the link scale. cump is the matrix of cumulative probabilities calculated in the usual way for ordered logit. probs is the matrix of category probabilities calculated in the usual way for ordered logit - by subtracting the previous cumulative probability from the current one. Once these are calculated, a flag is generated to identify whether all probabilities are finite and in the theoretical bounds:
okay1 <- all(is.finite(probs)) && all(0 < probs & probs < 1)
#> [1] FALSE
In this case okay1 is FALSE. We can see why below:
all(is.finite(probs))
#> [1] TRUE
all(0 < probs & probs < 1)
#> [1] FALSE
It's because some of the predicted probabilities are negative. We can see which ones below:
ind <- which(probs < 0 | probs > 1, arr.ind=TRUE)[,1]
ind
#> row
#> 16
probs[ind, ]
#> logitlink(P[Y<=2])
#> 1.253506e-04 -2.167395e-12 9.998746e-01
Notice here that the predicted probabilities for the second group are negative (though not much different from zero). The takeaway here is that even though you have specified parallel=FALSE, the resulting model is incompatible with the underlying cumulative probability assumption. The warning is encouraging you to use a different model that doesn't calculate probabilities this way, like multinomial logit. For example:
fit2=vglm(cyl ~ wt, data=mtcars,
family=multinomial())
which doesn't throw a warning because the probabilities are calculated in a way that won't allow them to be outside [0,1], so long as the exponentiated predicted values on the link scale are finite.
Created on 2022-04-26 by the reprex package (v2.0.1)

How to get AIC from lm_robust object

How do I get an AIC from an lm_robust object (package estimatr)? I'm using lm_robust because I want to use a robust estimator for calculating the SE. Unlike the lm function, AIC is not provided when you run the summary function and running the AIC function on a lm_robust object produces an error. Below is a toy example of the kind of model I'm trying to run.
library(estimatr)
fake_data<-data.frame(outcome=rnorm(100,3.65,1),
pred1=rnorm(100,15,7),
pred2=as.factor(sample(1:5, 100, replace = T)))
mod1<-lm_robust(outcome~pred1+pred2,data=fake_data)
AIC(mod1)
here is what the error message looks like:
> AIC(mod1)
Error in UseMethod("logLik") :
no applicable method for 'logLik' applied to an object of class "lm_robust"
If you have to do it with lm_robust, you may choose to calculate it by yourself as below,
The formula of AIC,
AIC = 2*k + n [Ln( 2(pi) RSS/n ) + 1]
# n : Number of observation
# k : All variables including all distinct factors and constant
# RSS : Residual Sum of Square
If we apply it to R for your case,
# Note that, I take k=7 since you have, 5 factors + 1 continuous and 1 constant
AIC_calculated <- 2*7 + 100* (log( 2*pi* (1-mod1$r.squared)*mod1$tss/100 ) + 1)
[1] 332.2865
which is same with both lm and glm outputs.
mod2<-lm(outcome~pred1+pred2,data=fake_data)
> AIC(mod2)
[1] 332.2865
And finally, of course, you can put this calculation into a function to call whenever you want by just giving lm_robust model inside it without having to set the N and k parameters for any given data like,
myAIC <- function(data) {
2*(data$k+1) + data$N * (log(2*pi* (1-data$r.squared)*data$tss/data$N ) + 1)
}
> myAIC(mod1)
[1] 332.2865
Note: Results may be shown different in your computer because of the seeding differences when running the sample() function in dataframe.
Here's a workaround
mod1 = lm_robust(outcome ~ pred1 + pred2, data = fake_data)
#Create any fitted model using 'lm' as a placeholder
mod2 = with(list(x = rnorm(10), y = rnorm(10)), lm(y ~ x))
#Copy values in `mod2` from `mod1`
mod2[names(mod2)] = mod1[names(mod2)]
#Calculate residuals in `mod2`
mod2$residuals = mod2$fitted.values - fake_data$outcome
AIC(mod2)
#[1] 326.6092

Using CARET together with GAM ("gamSpline" method) in R Poisson Regression

I am trying to use caret package to tune 'df' parameter of a gam model for my cohort analysis.
With the following data:
cohort = 1:60
age = 1:26
grid = data.frame(expand.grid(age = age, cohort = cohort))
size = data.frame(cohort = cohort, N = sample(100:150,length(cohort), replace = TRUE))
df = merge(grid, size, by = "cohort")
log_k = -3 + log(df$N) - 0.5*log(df$age) + df$cohort*(df$cohort-30)*(df$cohort-50)/20000 + runif(nrow(df),min = 0, max = 0.5)
df$conversion = rpois(nrow(df),exp(log_k))
Explanation of the data : Cohort number is the time of arrival of the potential customer. N is the number of potential customer that arrived at that time. Conversion is the number out of those potential customer that 'converted' (bought something). Age is the age (time spent from arrival) of the cohort when conversion took place. For a given cohort there are fewer conversions as age grows. This effect follows a power law.
But the total conversion rate of each cohort can also change slowly in time (cohort number). Thus I want a smoothing spline of the time variable in my model.
I can fit a gam model from package gam
library(gam)
fit = gam(conversion ~ log(N) + log(age) + s(cohort, df = 4), data = df, family = poisson)
fit
> Call:
> gam(formula = conversion ~ log(N) + log(age) + s(cohort, df = 4),
> family = poisson, data = df)
> Degrees of Freedom: 1559 total; 1553 Residual
> Residual Deviance: 1869.943
But if i try to train the model using the CARET package
library(caret)
fitControl = trainControl(verboseIter = TRUE)
fit.crt = train(conversion ~ log(N) + log(age) + s(cohort,df),
data = df, method = "gamSpline",
trControl = fitControl, tune.length = 3, family = poisson)
I get this error :
+ Resample01: df=1
model fit failed for Resample01: df=1 Error in as.matrix(x) : object 'N' not found
- Resample01: df=1
+ Resample01: df=2
model fit failed for Resample01: df=2 Error in as.matrix(x) : object 'N' not found .....
Please does anyone know what I'm doing wrong here?
Thanks
There are a two things wrong with your code.
The train function can be a bit tedious depending on the method you used (as you have noticed). In the case of method = "gamSpline", the train function adds a smooth term to every independent term in the formula. So it converts your variables to s(log(N), df), s(log(age) df) and to s(s(cohort, df), df).
Wait s(s(cohort, df), df) does not really makes sense. So you must change s(cohort, df) to cohort.
I am not sure why, but the train with method = "gamSpline" does not like it when you put functions (e.g. log) in the formula. I think this is due to the fact that this method already applies the s() functions to your variables. This problem can be solved by applying the log earlier to your variables. Such as df$N <- log(df$N) or logN <- log(df$N) and use logN as variable. And of course, do the same for age.
My guess is that you don't want this method to apply a smoothing term to all your independent variables based on the code you provided. I am not sure if this is possible and how to do it, if it is possible.
Hope this helps.
EDIT: If you want a more elegant solution than the one I provided at point 2, make sure to read the comment of #topepo. This suggestion also allows you to apply s() function to the variables you want if I understand it correctly.

Manual Perceptron example in R - are the results acceptable?

I am trying to get a perceptron algorithm for classification working but I think something is missing. This is the decision boundary achieved with logistic regression:
The red dots got into college, after performing better on tests 1 and 2.
This is the data, and this is the code for the logistic regression in R:
dat = read.csv("perceptron.txt", header=F)
colnames(dat) = c("test1","test2","y")
plot(test2 ~ test1, col = as.factor(y), pch = 20, data=dat)
fit = glm(y ~ test1 + test2, family = "binomial", data = dat)
coefs = coef(fit)
(x = c(min(dat[,1])-2, max(dat[,1])+2))
(y = c((-1/coefs[3]) * (coefs[2] * x + coefs[1])))
lines(x, y)
The code for the "manual" implementation of the perceptron is as follows:
# DATA PRE-PROCESSING:
dat = read.csv("perceptron.txt", header=F)
dat[,1:2] = apply(dat[,1:2], MARGIN = 2, FUN = function(x) scale(x)) # scaling the data
data = data.frame(rep(1,nrow(dat)), dat) # introducing the "bias" column
colnames(data) = c("bias","test1","test2","y")
data$y[data$y==0] = -1 # Turning 0/1 dependent variable into -1/1.
data = as.matrix(data) # Turning data.frame into matrix to avoid mmult problems.
# PERCEPTRON:
set.seed(62416)
no.iter = 1000 # Number of loops
theta = rnorm(ncol(data) - 1) # Starting a random vector of coefficients.
theta = theta/sqrt(sum(theta^2)) # Normalizing the vector.
h = theta %*% t(data[,1:3]) # Performing the first f(theta^T X)
for (i in 1:no.iter){ # We will recalculate 1,000 times
for (j in 1:nrow(data)){ # Each time we go through each example.
if(h[j] * data[j, 4] < 0){ # If the hypothesis disagrees with the sign of y,
theta = theta + (sign(data[j,4]) * data[j, 1:3]) # We + or - the example from theta.
}
else
theta = theta # Else we let it be.
}
h = theta %*% t(data[,1:3]) # Calculating h() after iteration.
}
theta # Final coefficients
mean(sign(h) == data[,4]) # Accuracy
With this, I get the following coefficients:
bias test1 test2
9.131054 19.095881 20.736352
and an accuracy of 88%, consistent with that calculated with the glm() logistic regression function: mean(sign(predict(fit))==data[,4]) of 89% - logically, there is no way of linearly classifying all of the points, as it is obvious from the plot above. In fact, iterating only 10 times and plotting the accuracy, a ~90% is reach after just 1 iteration:
Being in line with the training classification performance of logistic regression, it is likely that the code is not conceptually wrong.
QUESTIONS: Is it OK to get coefficients so different from the logistic regression:
(Intercept) test1 test2
1.718449 4.012903 3.743903
This is really more of a CrossValidated question than a StackOverflow question, but I'll go ahead and answer.
Yes, it's normal and expected to get very different coefficients because you can't directly compare the magnitude of the coefficients between these 2 techniques.
With the logit (logistic) model you're using a binomial distribution and logit-link based on a sigmoid cost function. The coefficients are only meaningful in this context. You've also got an intercept term in the logit.
None of this is true for the perceptron model. The interpretation of the coefficients are thus totally different.
Now, that's not saying anything about which model is better. There aren't comparable performance metrics in your question that would allow us to determine that. To determine that you should do cross-validation or at least use a holdout sample.

Resources