cv.glmnet gives auc value greater than 1 - r

I want to get the AUC on the testing set from cv.glmnet for the best set of hyperparameters. according to this post.
I should run cvm and get it, however, when I do this i get a value greater than 1, and my understanding is that the AUC should be between 0 and 1. Here's an example:
age <- c(4, 8, 7, 12, 6, 9, 10, 14, 7)
gender <- as.factor(c(1, 0, 1, 1, 1, 0, 1, 0, 0))
bmi_p <- c(0.86, 0.45, 0.99, 0.84, 0.85, 0.67, 0.91, 0.29, 0.88)
m_edu <- as.factor(c(0, 1, 1, 2, 2, 3, 2, 0, 1))
p_edu <- as.factor(c(0, 2, 2, 2, 2, 3, 2, 0, 0))
f_color <- as.factor(c("blue", "blue", "yellow", "red", "red", "yellow",
"yellow", "red", "yellow"))
asthma <- c(1, 1, 0, 1, 0, 0, 0, 1, 1)
xfactors <- model.matrix(asthma ~ gender + m_edu + p_edu + f_color)[, -1]
x <- as.matrix(data.frame(age, bmi_p, xfactors))
cv.glmmod <- cv.glmnet(x, y=asthma, alpha=1,family="binomial", type.measure = "auc")
max(cv.glmmod$cvm)
[1] 7.0223
How do I interpret this number? is it really just .70223?
Thanks,
Steve

For your dataset, cv.glmnet() do not measure the loss by "AUC", but "deviance", which is what you obtained by cv.glmmod$cvm.
Althouth you run the CV by cv.glmnet(type.measure="auc"), your dataset is too small. In this situation, cv.glmnet() (actually cv.lognet()) issues warning "Too few (< 10) observations per fold for type.measure='auc' in cv.lognet; changed to type.measure='deviance'. Alternatively, use smaller value for nfolds", and according to what the function complains about, it sets type.measure="deviance".
You can verify this by showing cv.glmmod$name, which should be "Partial Likelihood Deviance" in your case, instead of "AUC".

Related

How to multiply some elements of a variable by a constant, depending on the value of another variable, in every data frame of a list of data frames

I have a list of 100 data frames, each corresponding to one participant, in which there is a variable "trial" (100 trials), a variable "response" (0/1 for incorrect/correct) and a variable "time" (for reaction times). Here is an example of my data:
library(tidyverse)
# list of data sets
my_responses <- tibble(participant = c(1, 2, 3, 4, 5, 6, 7, 8, 9), trial1 = c(0, 0, 1, 0, 1, 0, 0, 1, 1), trial2 = c(1, 1, 1, 1, 0, 1, 0, 0, 1), trial3 = c(0, 1, 0, 1, 1, 1, 0, 0, 1))
my_times <- tibble(participant = c(1, 2, 3, 4, 5, 6, 7, 8, 9), rt1 = c(0.5, 1.1, 1.3, 0.4, 0.6, 1.2, 1.9, 0.6, 0.8), rt2 = c(0.8, 0.7, 1.4, 1.5, 0.6, 1.3, 0.9, 0.8, 1.2), rt3 = c(0.4, 0.9, 1.3, 1.1, 1.6, 0.3, 0.9, 1.4, 1.3))
my_responses <- my_responses %>%
pivot_longer(c(2:ncol(my_responses)), names_to = "trials1", values_to = "response")
my_times <- my_times %>%
pivot_longer(c(2:ncol(my_times)), names_to = "trials2", values_to = "times")
my_data <- my_responses %>% bind_cols(my_times)
my_data <- my_data %>% subset(select = -c(participant...4, trials2))
my_data <- split(my_data, f = my_data$participant...1)
Now I need to give a negative sign to the incorrect responses. So, for each data frame in my list I need to multiply the variable "time" by -1 if the variable "response" has the value 0.
I've tried several things, based on previous answers to questions here, but none seem to be working. Here is what I tried so far:
1)
# multiply times of incorrect responses (0) by -1
my_data <- lapply(my_data, function(x) {ifelse(my_data[[x]]$response == 0, my_data[[x]]$times*-1, my_data[[x]]$times*1)})
which gives the error: Error in my_data[[x]] : invalid subscript type 'list'
2)
my_data <- lapply(my_data, function(x) {ifelse(my_data$response == 0, my_data$times*-1, my_data$times*1)})
which gives me just a list with 9 elements with the value logi(0)
3)
my_data <- lapply(names(my_data), function(x) {ifelse(my_data[[x]]$response == 0, my_data[[x]]$times*-1, my_data[[x]]$times*1)})
which gives me a list with 9 elements, each with the "times" multiplied by -1 or not according to variable "response". This seems to be close but I need the output to be again a list of data frames (the same I had).
I appreciate any help to try to make this work!
Thank you.
Please, try this:
for (i in 1:length(my_data)) {
my_data[[i]] <- my_data[[i]] %>% mutate(times = ifelse(response, times, -times))
}

R: Calculate row sum (MERSQI score), adjusted to missing values / not applicable categories

I would like to calculate sums of rows, including adjustment for missing data.
The row sums are "MERSQI" scores in real (scoring the quality of studies, 1study per row). Each col is a question about quality with a specific maximum of points achievable.
However, in some cases, questions were not applicable for some studies leading to "missing values". The row sum should be adjusted to standard denominator of 18 as maximal score/row sum, i.e.: (max achievable points= sum of maximal achievable points of applicable questions/cols)
total MERSQI score = row sum / max achievable points * 18
For example:
questions <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) #number of question or col number
max_quest <- c(3, 1.5, 1.5, 3, 1, 1, 1, 1, 3) #maximum of every single question
study1 <- c(1.5, 0.5, 1.5, 3, 0, 0, 0, 1, 3) #points for every single questions for study1
study2 <- c(1, 0.5, 0.5, 3, NA, NA, NA, 1, 1, 3) # for study2
study3 <- c(2, 1.5, NA, 3, NA, 1, NA, 1, 1, 3) #for study3
df <- rbind (questions, max_quest, study1, study2, study3)
For study1 we would have a row sum and resulting score of 10.5 and as there are no missing values.
For study2 we have a row sum of 10. We have three NA, maximal achievable points for study2 were 15 (=18 maximal points - 3*1 point of the NA questions), and adjusted MERSQI score of 12.85 (=10 *18/15).
For study3: row sum= 12.5, maximal achievable points=15.5 (=18 -(1.5+1+1)), adjusted MERSQI score= 15.53
Do you have any idea how to calculate the row sums with adjusting for missing values? Maybe with going through every row, using forloop and ifwith is.na?
Thank you!
PS: Link / explanation to the MERSQI score: https://www.aliem.com/article-review-how-do-you-assess/ and https://pubmed.ncbi.nlm.nih.gov/26107881/
There is an issue with the lengths of the vectors. I edited the dataset so that they are all length 9, but this should work:
apply(mat[, 3:5],
2,
FUN = function (x) {
tot = sum(x, na.rm = TRUE)
nas = which(is.na(x))
total_max = sum(max_quest)
if (!length(nas))
return(tot)
else
return(tot * total_max / (total_max - sum(max_quest[nas])))
})
Data:
questions <- c(1, 2, 3, 4, 5, 6, 7, 8, 9) #number of question or col number
max_quest <- c(3, 1.5, 1.5, 3, 1, 1, 1, 1, 3) #maximum of every single question
study1 <- c(1.5, 0.5, 1.5, 3, 0, 0, 0, 1, 3) #points for every single questions for study1
study2 <- c(1, 0.5, 0.5, 3, NA, NA, NA, 1, 1) # for study2
study3 <- c(2, 1.5, NA, 3, NA, 1, NA, 1, 1) #for study3
## rename mat because cbind(...) of vectors returns matrix.
mat <- cbind (questions, max_quest, study1, study2, study3)
For each study column calculate it's sum multiply by sum of max_quest and divide by max_quest - NA value.
library(dplyr)
val <- sum(df$max_quest)
df %>%
summarise(across(starts_with('study'),
~sum(., na.rm = TRUE)* val/(val - sum(max_quest[is.na(.)]))))
data
The data shared is not complete due to incompatible lengths. Also it would make sense if these values are in column-wise fashion than row-wise.
questions <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
max_quest <- c(3, 1.5, 1.5, 3, 1, 1, 1, 1, 3, 3)
study1 <- c(1.5, 0.5, 1.5, 3, 0, 0, 0, 1, 3, 0)
study2 <- c(1, 0.5, 0.5, 3, NA, NA, NA, 1, 1, 3)
study3 <- c(2, 1.5, NA, 3, NA, 1, NA, 1, 1, 3)
df <- data.frame(questions, max_quest, study1, study2, study3)
This can be done with vectorization.
First apply row sums and find number of NAs:
row_sums <- apply(df, 1, function(x) sum(x, na.rm=T))
row_NAs <- apply(df,1, function(x) sum(is.na(x)))
Then pull out studies and max points:
studies <- row_sums[3:length(row_sums)]
max <- row_sums[2]
Compute the MERSQI from the adjusted max, based on NAs:
adjusted_max <- rep(max, length(studies)) - row_NAs[3:length(row_NAs)]
MERSQI <- studies * max / adjusted_max

Getting (maybe manually) confidence interval of fits after using multi-way clustering package (multiwayvcov)

I am interested in plotting fits with confidence intervals after using two-way clustering package (multiwayvcov).
Here is my reproducible data.
rm(list=ls(all=TRUE))
library(lmtest)
library(multiwayvcov)
dv<-c(1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0)
int1<-c(0.0123, 0.3428, 0.2091, 0.8325, 0.7113, 0.7401, 0.6009, 0.5062, 0.4841, 0.8912, 0.3850, 0.2463, 0.0625, 0.5374, 0.1984)
int2<-c(0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0)
cont<-c(3, 1, 2, 4, 6, 7, 1, 4, 3, 2, 4, 3, 6, 1, 3)
cluster1<-c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)
cluster2<-c(1, 2, 3, 1, 2, 3, 1, 2, 1, 2, 1, 2, 3, 1, 2)
mydata<-as.data.frame(cbind(dv, int1, int2, cont, cluster1, cluster2))
This is my non-clustered model:
result_lm <- lm(dv~int1+int2+cont,data=mydata)
To get clustered results using "cluster1" and "cluster2", I use functions in the package of "lmtest" and "multiwayvcov" as follows.
cluster_vcov<-cluster.vcov(result_lm, ~cluster1+cluster2)
result_2c<-coeftest(result_lm, cluster_vcov)
Here, "cluster_vcov" is just a variance-covariance matrix and "result_2c" is just an atomic vector. Thus, I am not able to use "predict" function to plot fits on a new dataset ("datagrid") such as
grid <- seq(0,1,.2)
datagrid <- data.frame(int1=rep(grid,2),
int2=c(rep(0,length(grid)),
rep(1,length(grid))))
datagrid$cont<-mean(mydata$cont, na.rm=T)
Before moving to what I have done, here is something similar what I would like to have eventually.
fits <- predict(result_lm,newdata=datagrid,interval="confidence")
plotdata <- data.frame(fits,datagrid)
plotdata$int2 <- plotdata$int2==1
ggplot(plotdata,aes(x=int1,y=fit,ymin=lwr,ymax=upr,color=int2)) + geom_line(aes(linetype = int2)) + geom_ribbon(alpha=.2) + theme(legend.position="none") + scale_color_manual(values=c("red", "darkgreen")) + scale_linetype_manual(values=c("dashed", "solid"))
The result is
To address the problem that "result_2c" does not give a dataframe that can be directly used with "predict", I decided to construct a data by myself as follows.
d_twc_result<-data.frame(matrix(0, nrow =4, ncol = 4) )
colnames(d_twc_result) <- c("Estimate","Std. Error","t value", "Pr(>|t|)")
rownames(d_twc_result) <-c("(Intercept)", "int1","int2", "cont")
for (j in 1:4){
for (i in 1:4){
d_twc_result[i, j]<-result_2c[i,j]
}
}
Then, using "d_twc_result$Estimate", I generate a vector that corresponds to "fits" that one could get after running "predict".
fits<-c(1:12)
for (i in 1:12){
fits[i]<-d_twc_result$Estimate[1]+
d_twc_result$Estimate[2]*datagrid$int1[i]+
d_twc_result$Estimate[3]*datagrid$int2[i]+
d_twc_result$Estimate[4]*datagrid$cont[i]
}
Yet, I was still not able to construct vectors for "lwr" and "upr", which requires 'residuals' or 'standard error'. What I was actually stuck is that it seems impossible to get 'residuals' or 'standard error' because there is no observation on 'dv' in the dataset "datagrid".
Nevertheless, "predict" works with the dataset "datagrid", so I guess that I am poorly understanding how "predict" works or the concept of fit.
It will be highly appreciated if you could help me to get "lwr" and "upr" (if my understanding of the concept of fit is incorrect). Thank for any comment in advance.

Extract the coefficients for the best tuning parameters of a glmnet model in caret

I am running elastic net regularization in caret using glmnet.
I pass sequence of values to trainControl for alpha and lambda, then I perform repeatedcv to get the optimal tunings of alpha and lambda.
Here is an example where the optimal tunings for alpha and lambda are 0.7 and 0.5 respectively:
age <- c(4, 8, 7, 12, 6, 9, 10, 14, 7, 6, 8, 11, 11, 6, 2, 10, 14, 7, 12, 6, 9, 10, 14, 7)
gender <- make.names(as.factor(c(1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1)))
bmi_p <- c(0.86, 0.45, 0.99, 0.84, 0.85, 0.67, 0.91, 0.29, 0.88, 0.83, 0.48, 0.99, 0.80, 0.85,
0.50, 0.91, 0.29, 0.88, 0.99, 0.84, 0.80, 0.85, 0.88, 0.99)
m_edu <- make.names(as.factor(c(0, 1, 1, 2, 2, 3, 2, 0, 1, 1, 0, 1, 2, 2, 1, 2, 0, 1, 1, 2, 2, 0 , 1, 0)))
p_edu <- make.names(as.factor(c(0, 2, 2, 2, 2, 3, 2, 0, 0, 0, 1, 2, 2, 1, 3, 2, 3, 0, 0, 2, 0, 1, 0, 1)))
f_color <- make.names(as.factor(c("blue", "blue", "yellow", "red", "red", "yellow",
"yellow", "red", "yellow","blue", "blue", "yellow", "red", "red", "yellow",
"yellow", "red", "yellow", "yellow", "red", "blue", "yellow", "yellow", "red")))
asthma <- make.names(as.factor(c(1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1)))
x <- data.frame(age, gender, bmi_p, m_edu, p_edu, f_color, asthma)
tuneGrid <- expand.grid(alpha = seq(0, 1, 0.05), lambda = seq(0, 0.5, 0.05))
fitControl <- trainControl(method = 'repeatedcv', number = 3, repeats = 5, classProbs = TRUE, summaryFunction = twoClassSummary)
set.seed(1352)
model.test <- caret::train(asthma ~ age + gender + bmi_p + m_edu + p_edu + f_color, data = x, method = "glmnet",
family = "binomial", trControl = fitControl, tuneGrid = tuneGrid,
metric = "ROC")
model.test$bestTune
My question?
When I run as.matrix(coef(model.test$finalModel)) which I would assume give me the coefficients corresponding to the best model, I get 100 different sets of coefficients.
So how do I get the coefficients corresponding to the best tuning?
I've seen this recommendation to get the best model coef(model.test$finalModel, model.test$bestTune$lambda) However, this returns NULL coefficients, and In any case, would only be returning the best tunings related to lambda, and not to alpha in addition.
EDIT:
After searching everywhere on the internet, all I can find now which points me in the direction of the correct answer is this blog post, which says that model.test$finalModel returns the model corresponding to the best alpha tuning, and coef(model.test$finalModel, model.caret$bestTune$lambda) returns the set of coefficients corresponding to the best values of lambda. If this is true then this is the answer to my question. However, as this is a single blog post, and I can't find anything else to back up this claim, I am still skeptical. Can anyone validate this claim that model.test$finalModel returns the model corresponding to the best alpha?? If so then this question would be solved. Thanks!
After a bit of playing with your code I find it very odd that glmnet train chooses different lambda ranges depending on the seed. Here is an example:
library(caret)
library(glmnet)
set.seed(13)
model.test <- caret::train(asthma ~ age + gender + bmi_p + m_edu + p_edu + f_color, data = x, method = "glmnet",
family = "binomial", trControl = fitControl, tuneGrid = tuneGrid,
metric = "ROC")
c(head(model.test$finalModel$lambda, 5), tail(model.test$finalModel$lambda, 5))
#output
[1] 3.7796447301 3.4438715094 3.1379274562 2.8591626295 2.6051625017 0.0005483617 0.0004996468 0.0004552595 0.0004148155
[10] 0.0003779645
optimum lambda is:
model.test$finalModel$lambdaOpt
#output
#[1] 0.05
and this works:
coef(model.test$finalModel, model.test$finalModel$lambdaOpt)
#12 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -0.03158974
age 0.03329806
genderX1 -1.24093677
bmi_p 1.65156913
m_eduX1 0.45314106
m_eduX2 -0.09934991
m_eduX3 -0.72360297
p_eduX1 -0.51949828
p_eduX2 -0.80063642
p_eduX3 -2.18231433
f_colorred 0.87618211
f_coloryellow -1.52699254
giving the coefficients at best alpha and lambda
when using this model to predict some y are predicted as X1 and some as X2
[1] X1 X1 X0 X1 X1 X0 X0 X1 X1 X1 X0 X1 X1 X1 X0 X0 X0 X1 X1 X1 X1 X0 X1 X1
Levels: X0 X1
now with the seed you used
set.seed(1352)
model.test <- caret::train(asthma ~ age + gender + bmi_p + m_edu + p_edu + f_color, data = x, method = "glmnet",
family = "binomial", trControl = fitControl, tuneGrid = tuneGrid,
metric = "ROC")
c(head(model.test$finalModel$lambda, 5), tail(model.test$finalModel$lambda, 5))
#output
[1] 2.699746e-01 2.459908e-01 2.241377e-01 2.042259e-01 1.860830e-01 3.916870e-05 3.568906e-05 3.251854e-05 2.962968e-05
[10] 2.699746e-05
lambda values are 10 times smaller and this gives empty coefficients since lambdaOpt is not in the range of tested lambda:
coef(model.test$finalModel, model.test$finalModel$lambdaOpt)
#output
12 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) .
age .
genderX1 .
bmi_p .
m_eduX1 .
m_eduX2 .
m_eduX3 .
p_eduX1 .
p_eduX2 .
p_eduX3 .
f_colorred .
f_coloryellow .
model.test$finalModel$lambdaOpt
#output
0.5
now when predicting upon this model only X0 is predicted (the first level):
predict(model.test, x)
#output
[1] X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0 X0
Levels: X0 X1
quite odd behavior, probably worth reporting

How to force model.matrix() function to consider NA as a level?

How can I force model.matrix() function to consider NA as a level of a group just like "yellow" for example and don't delete that record?
my code:
a <- c(4,NA, 8, 7, 12, 6, 9, 10, 14, 7,9)
b <- as.factor(c(1, 0, 1,NA, 1, 1, 0, 1,1, 0, 0))
c <- c(0.86, 0.45, 0.99, 0.84, 0.85, 0.67, 0.91, 0.29, 0.88,1,0.5)
d <- as.factor(c(0, 1, 1, 2, 2, 3, 2, 0, NA,NA,NA))
e <- as.factor(c(0, 2, 2, 2, 2, 3, 2, 0, 0 , 2,0))
f <- as.factor(c("blue", "blue", "yellow", "red", "red", "yellow",
"yellow", "red", "yellow" , "blue" , "red"))
g <- c(1, 1, 0, 1, 0, 0, 0, 1, 1 , 0,0)
mat <- model.matrix(g ~ b + d + e + f)[, -1]
mat #reduce the number of rows containing NA
Any help would be greatly appreciated.

Resources