I have a glm model that works. Since I'd like to add (ridge) regularization I thought I'd switch to glmnet. For some reason I cannot get glmnet to work. It seems to always predict the first class, never the second, which results in low accuracy and kappa = 0.
Below is some code to reproduce the problem. What am I doing wrong?
The test data it generates looks like this:
Since the data cannot be linearly separated two polynomial terms A^2 and B^2 are added.
A glm model predicts the data correctly (with accuracy = 1 and kappa = 1). Here is its prediction boundary:
While a glmnet model always has kappa = 0, no matter what lambda it tries:
lambda Accuracy Kappa Accuracy SD Kappa SD
0 0.746 0 0.0295 0
1e-04 0.746 0 0.0295 0
0.01 0.746 0 0.0295 0
0.1 0.746 0 0.0295 0
1 0.746 0 0.0295 0
10 0.746 0 0.0295 0
Code to reproduce the problem:
library(caret)
# generate test data
set.seed(42)
n <- 500; m <- 100
data <- data.frame(A=runif(n, 98, 102), B=runif(n, 98, 102), Type="foo")
data <- subset(data, sqrt((A-100)^2 + (B-100)^2) > 1.5)
data <- rbind(data, data.frame(A=rnorm(m, 100, 0.25), B=rnorm(m, 100, 0.25), Type="bar"))
# add a few polynomial features to match ellipses
polymap <- function(data) cbind(data, A2=data$A^2, B2=data$B^2)
data <- polymap(data)
plot(x=data$A, y=data$B, pch=21, bg=data$Type, xlab="A", ylab="B")
# train a binomial glm model
model.glm <- train(Type ~ ., data=data, method="glm", family="binomial",
preProcess=c("center", "scale"))
# train a binomial glmnet model with ridge regularization (alpha = 0)
model.glmnet <- train(Type ~ ., data=data, method="glmnet", family="binomial",
preProcess=c("center", "scale"),
tuneGrid=expand.grid(alpha=0, lambda=c(0, 0.0001, 0.01, 0.1, 1, 10)))
print(model.glm) # <- Accuracy = 1, Kappa = 1 - good!
print(model.glmnet) # <- Accuracy = low, Kappa = 0 - bad!
Calling glmnet directly (without caret) results in the same problem:
x <- as.matrix(subset(data, select=-c(Type)))
y <- data$Type
model.glmnet2 <- cv.glmnet(x=x, y=y, family="binomial", type.measure="class")
preds <- predict(model.glmnet2, x, type="class", s="lambda.min")
# all predictions are class 1...
EDIT: Plot of the scaled data and the decision boundary found by glm:
Model: -37 + 6317*A + 6059*B - 6316*A2 - 6059*B2
You should center and scale data prior to making polynomial versions of the predictor. Numerically, things work better that way:
set.seed(42)
n <- 500; m <- 100
data <- data.frame(A=runif(n, 98, 102), B=runif(n, 98, 102), Type="foo")
data <- subset(data, sqrt((A-100)^2 + (B-100)^2) > 1.5)
data <- rbind(data, data.frame(A=rnorm(m, 100, 0.25), B=rnorm(m, 100, 0.25), Type="bar"))
data2 <- data
data2$A <- scale(data2$A, scale = TRUE)
data2$B <- scale(data2$B, scale = TRUE)
data2$A2 <- data2$A^2
data2$B2 <- data2$B^2
# train a binomial glm model
model.glm2 <- train(Type ~ ., data=data2, method="glm")
# train a binomial glmnet model with ridge regularization (alpha = 0)
model.glmnet2 <- train(Type ~ ., data=data2, method="glmnet",
tuneGrid=expand.grid(alpha=0,
lambda=c(0, 0.0001, 0.01, 0.1, 1, 10)))
From these:
> getTrainPerf(model.glm2)
TrainAccuracy TrainKappa method
1 1 1 glm
> getTrainPerf(model.glmnet2)
TrainAccuracy TrainKappa method
1 1 1 glmnet
Max
Related
I would like to use the fastshap package to obtain SHAP values plots for every category of my outcome in a multi-classification problem using a random forest classifier. I could only found chunks of the code around, but no explanation on how to procede from the beginning in obtaining the SHAP values in this case. Here is the code I have so far (my y has 5 classes, here I am trying to obtain SHAP values for class 3):
library(randomForest)
library(fastshap)
set.seed(42)
sample <- sample.int(n = nrow(ITA), size = floor(.75*nrow(ITA)), replace=F)
train <- ITA [sample,]
test <- ITA [-sample,]
set.seed(42)
rftrain <-randomForest(y ~ ., data=train, ntree=500, importance = TRUE)
p_function_3<- function(object, newdata)
caret::predict.train(object,
newdata = newdata,
type = "prob")[,3]
shap_values_G <- fastshap::explain(rftrain,
X = train,
pred_wrapper = p_function_3,
nsim = 50,
newdata=train[which(y==3),])
Now, I took the code largely from an example I found online, and I tried to adapt it (I am not an expert R user), but it does not work.. Can you please help me in correcting it? Thanks!
Here is a working example (with a different dataset), but I think the logic is the same.
library(randomForest)
library(fastshap)
set.seed(42)
ix <- sample(nrow(iris), 0.75 * nrow(iris))
train <- iris[ix, ]
test <- iris[-ix, ]
xvars <- c("Sepal.Width", "Sepal.Length")
yvar <- "Species"
fit <- randomForest(reformulate(xvars, yvar), data = train, ntree = 500)
pred_3 <- function(model, newdata) {
predict(model, newdata = newdata, type = "prob")[, "virginica"]
}
shap_values_3 <- fastshap::explain(
fit,
X = train, # Reference data
feature_names = xvars,
pred_wrapper = pred_3,
nsim = 50,
newdata = train[train$Species == "virginica", ] # For these rows, you will calculate explanations
)
head(shap_values_3)
# Sepal.Width Sepal.Length
# <dbl> <dbl>
# 1 0.101 0.381
# 2 0.159 -0.0109
# 3 0.0736 -0.0285
# 4 0.0564 0.161
# 5 0.0649 0.594
# 6 0.232 0.0305
In glm() it is possible to model bernoulli [0,1] outcomes with a logistic regression using the following sort of syntax.
glm(bin ~ x, df, family = "binomial")
However you can also perform aggregated binomial regression, where each observation represents a count of target events from a certain fixed number of bernoulli trials. For example see the following data:
set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
p <- 0.4 + 0.2*x
y <- rbinom(n, cov, p)
With these sort of data you use slightly different syntax in glm()
mod <- glm(cbind(y, cov-y) ~ x, family="binomial")
mod
# output
# Call: glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
#
# Coefficients:
# (Intercept) x
# -0.3064 0.6786
#
# Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
# Null Deviance: 53.72
# Residual Deviance: 39.54 AIC: 178
I was wondering is it possible to model this type of aggregated binomial data in the glmnet package? If so, what is the syntax?
Yes you can do it as the following
set.seed(1)
n <- 50
cov <- 10
x <- c(rep(0,n/2), rep(1, n/2))
x = cbind(x, xx = c(rep(0.5,20), rep(0.7, 20), rep(1,10)))
p <- 0.4 + 0.2*x
y <- rbinom(n, cov, p)
I added another covariate here called xx as glmnet accepts minimum of two covariates
In glm as you have it in your post
mod <- glm(cbind(y, cov-y) ~ x, family="binomial")
mod
# output
# Call: glm(formula = cbind(y, cov - y) ~ x, family = "binomial")
# Coefficients:
# (Intercept) xx xxx
# 0.04366 0.86126 -0.64862
# Degrees of Freedom: 49 Total (i.e. Null); 47 Residual
# Null Deviance: 53.72
# Residual Deviance: 38.82 AIC: 179.3
In glmnet, without regularization (lambda=0) to reproduce similar results as in glm
library(glmnet)
fit = glmnet(x, cbind(cov-y,y), family="binomial", lambda=0)
coef(fit)
# output
# 3 x 1 sparse Matrix of class "dgCMatrix"
# s0
# (Intercept) 0.04352689
# x 0.86111234
# xx -0.64831806
I tried computing confusion-matrix for my glm model but I keep getting:
Error: data and reference should be factors with the same levels.
Below is my model:
model3 <- glm(winner ~ srs.1 + srs.2, data = train_set, family = binomial)
confusionMatrix(table(predict(model3, newdata=test_set, type="response")) >= 0.5,
train_set$winner == 1)
winner variable contains team1 and team2.
srs.1 and srs.2 are numerical values.
What is my problem here?
I suppose your winner label is a binary of 0,1. So let's use the example below:
library(caret)
set.seed(111)
data = data.frame(
srs.1 = rnorm(200),
srs.2 = rnorm(200)
)
data$winner = ifelse(data$srs.1*data$srs.2 > 0,1,0)
idx = sample(nrow(data),150)
train_set = data[idx,]
test_set = data[-idx,]
model3 <- glm(winner ~ srs.1 + srs.2, data = train_set, family = binomial)
Like you did, we try to predict, if > 0.5, it will be 1 else 0. You got the table() about right. Note you need to do it both for test_set, or train_set:
pred = as.numeric(predict(model3, newdata=test_set, type="response")>0.5)
ref = test_set$winner
confusionMatrix(table(pred,ref))
Confusion Matrix and Statistics
ref
pred 0 1
0 12 5
1 19 14
Accuracy : 0.52
95% CI : (0.3742, 0.6634)
No Information Rate : 0.62
P-Value [Acc > NIR] : 0.943973
Kappa : 0.1085
I'm performing a cross validation on a competing risks proportional hazards model. With help from the mstate pacakge, I've prepared my data and am fitting it with survival::coxph. I get a fitted Cox model object for my training data, but I want to evaluate the partial likelihood of my trained coefficients with my test data.
If I need to, I'll write the partial likelihood function myself, but I'd rather not (though it would probably be good for me). The survival package calculates in this C code, but the likelihood calculation is embedded in the fitting function. Maybe there's a way to fix parameters, or some other tools to easily get at the partial likelihood?
Minimum Working Exmaple
# Adapted from examples in the mstate vignette
# http://cran.r-project.org/web/packages/mstate/vignettes/Tutorial.pdf
# beginning at the bottom of page 28
library(mstate)
library(survival)
# Get data. I add a second explanatory variable (badx) for illustration
# Also divide the data by subject into training and test sets.
data(aidssi)
si <- aidssi # Just a shorter name
si$badx <- sample(c("A", "B"), size = nrow(si), replace = TRUE)
si$fold <- sample(c("train", "test"), size = nrow(si), replace = TRUE, prob = c(0.7, 0.3))
tmat <- trans.comprisk(2, names = c("event-free", "AIDS", "SI"))
si$stat1 <- as.numeric(si$status == 1)
si$stat2 <- as.numeric(si$status == 2)
# Convert the data to a long competing risks format
silong <- msprep(time = c(NA, "time", "time"),
status = c(NA,"stat1", "stat2"),
data = si, keep = c("ccr5", "badx", "fold"), trans = tmat)
silong <- na.omit(silong)
silong <- expand.covs(silong, c("ccr5", "badx"))
train.dat <- subset(silong, fold == "train")
test.dat <- subset(silong, fold == "test")
Data looks like this:
> head(silong)
An object of class 'msdata'
Data:
id from to trans Tstart Tstop time status ccr5 badx fold ccr5WM.1 ccr5WM.2 badxB.1 badxB.2
1 1 1 2 1 0 9.106 9.106 1 WW A train 0 0 0 0
2 1 1 3 2 0 9.106 9.106 0 WW A train 0 0 0 0
3 2 1 2 1 0 11.039 11.039 0 WM B train 1 0 1 0
4 2 1 3 2 0 11.039 11.039 0 WM B train 0 1 0 1
5 3 1 2 1 0 2.234 2.234 1 WW B train 0 0 1 0
6 3 1 3 2 0 2.234 2.234 0 WW B train 0 0 0 1
Now, the ccr5 variable could be modeled as transition-specific, or as a having equal proportional effect for all transitions. The models are:
train.mod.equal <- coxph(Surv(time, status) ~ ccr5 + badx + strata(trans),
data = train.dat)
train.mod.specific <- coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + badx + strata(trans),
data = train.dat)
Now I would like to use the test data to evaluate the variable selection
on whether or not ccr5 should be transition-specific or not.
I have a large data set and many variables--mostly but not all categorical--that could go either way. The evaluation is where I'm stuck.
# We can fit the same models to the test data,
# this yields new parameter estimates of course,
# but the model matrices might be useful
test.mod.equal <- coxph(Surv(time, status) ~ ccr5 + badx + strata(trans),
data = test.dat)
test.mod.specific <- coxph(Surv(time, status) ~ ccr5WM.1 + ccr5WM.2 + badx + strata(trans),
data = test.dat)
test.eq.mm <- model.matrix(test.mod.equal)
test.sp.mm <- model.matrix(test.mod.specific)
# We can use these to get the first part of the sum of the partial likelihood:
xbeta.eq <- test.eq.mm[test.dat$status == 1, ] %*% coef(train.mod.equal)
xbeta.sp <- test.sp.mm[test.dat$status == 1, ] %*% coef(train.mod.specific)
# We can also get linear predictors
lp.eq <- predict(train.mod.equal, newdata = test.dat, type = "lp")
lp.sp <- predict(train.mod.specific, newdata = test.dat, type = "lp")
I'm hoping to calculate the partial likelihood for each of the models on the test data with the training coefficient estimates. Maybe I should move the question to Cross Validated and ask if the sum of the linear predictors (or the sum of the linear predictors excluding censored cases) is close enough to an equivalent measure.
This is what I was proposing when I wrote: 'Can you calculate a "neo-model" (using the [new data] with a formula that includes an offset [built with] beta estimates [from the original fit] and then use summary(mdl) to do the heavy lifting for you? You might even be able to calculate the offset with predict.coxph.' Turns out I don't need to use summary.coxph since print.coxph gives the LLR statistic.
lp.eq <- predict(train.mod.equal, newdata = test.dat, type = "lp")
eq.test.mod <- coxph(Surv(time, status) ~ ccr5 + badx + strata(trans)+offset(lp.eq),
data=test.dat )
eq.test.mod
Call:
coxph(formula = Surv(time, status) ~ ccr5 + badx + strata(trans) +
offset(lp.eq), data = test.dat)
coef exp(coef) se(coef) z p
ccr5WM -0.20841 0.812 0.323 -0.6459 0.52
badxB -0.00829 0.992 0.235 -0.0354 0.97
Likelihood ratio test=0.44 on 2 df, p=0.804 n= 212, number of events= 74
I would interpret this to mean that a similar model, fit with the predictions based on the first model but with new data, was not significantly different (than a null model) and that on a log-likelihood scale, it was 0.44 "away" from an exact fit.
As pointed out by #Gregor, one can access the 'loglik' node of the coxph-object, but I would advise against attaching too much meaning to the single values. To get he LRT statistic one could produce:
> diff(eq.test.mod$loglik)
[1] 0.399137
For interest sake, also look at the result without the offset:
> coxph(Surv(time, status) ~ ccr5 + badx + strata(trans),
+ data=test.dat)
Call:
coxph(formula = Surv(time, status) ~ ccr5 + badx + strata(trans),
data = test.dat)
coef exp(coef) se(coef) z p
ccr5WM -0.8618 0.422 0.323 -2.671 0.0076
badxB -0.0589 0.943 0.235 -0.251 0.8000
Likelihood ratio test=8.42 on 2 df, p=0.0148 n= 212, number of events= 74
And you do get the expected result when testing against the original data:
> lp.eq2 <- predict(train.mod.equal, newdata = train.dat, type = "lp")
> coxph(Surv(time, status) ~ ccr5 + badx + strata(trans)+offset(lp.eq2),
+ data=train.dat)
Call:
coxph(formula = Surv(time, status) ~ ccr5 + badx + strata(trans) +
offset(lp.eq2), data = train.dat)
coef exp(coef) se(coef) z p
ccr5WM -4.67e-12 1 0.230 -2.03e-11 1
badxB 2.57e-14 1 0.168 1.53e-13 1
Likelihood ratio test=0 on 2 df, p=1 n= 436, number of events= 146
I'm using caret with custom fitting metric, but I need to maximize not just this metric but lower bound of it's confidence interval. So I'd like to maximize something like mean(metric) - k * stddev(metric). I know how to do this manually, but is there a way to tell caret to automatically select best parameters using this function?
Yes, you can define your own selection metric through the "summaryFunction" parameter of your "trainControl" object and then with the "metric" parameter of your call to train(). Details on this are pretty well documented in the "Alternate Performance Metrics" section on caret's model tuning page: http://caret.r-forge.r-project.org/training.html
I don't think you gave enough information for anyone to write exactly what you're looking for, but here is an example using the code from the twoClassSummary function:
> library(caret)
> data(Titanic)
>
> #an example custom function
> roc <- function (data, lev = NULL, model = NULL) {
+ require(pROC)
+ if (!all(levels(data[, "pred"]) == levels(data[, "obs"])))
+ stop("levels of observed and predicted data do not match")
+ rocObject <- try(pROC:::roc(data$obs, data[, lev[1]]), silent = TRUE)
+ rocAUC <- if (class(rocObject)[1] == "try-error")
+ NA
+ else rocObject$auc
+ out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"], lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
+ names(out) <- c("ROC", "Sens", "Spec")
+ out
+ }
>
> #your train control specs
> tc <- trainControl(method="cv",classProb=TRUE,summaryFunction=roc)
> #yoru model with selection metric specificed
> train(Survived~.,data=data.frame(Titanic),method="rf",trControl=tc,metric="ROC")
32 samples
4 predictors
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 28, 29, 30, 30, 28, 28, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec ROC SD Sens SD Spec SD
2 0.9 0.2 0.25 0.175 0.35 0.425
4 0.85 0.4 0.6 0.211 0.459 0.459
6 0.875 0.35 0.6 0.212 0.412 0.459
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
There is more basic example in the caret's help for train function:
madSummary <- function (data,
lev = NULL,
model = NULL) {
out <- mad(data$obs - data$pred,
na.rm = TRUE)
names(out) <- "MAD"
out
}
robustControl <- trainControl(summaryFunction = madSummary)
marsGrid <- expand.grid(degree = 1, nprune = (1:10) * 2)
earthFit <- train(medv ~ .,
data = BostonHousing,
method = "earth",
tuneGrid = marsGrid,
metric = "MAD",
maximize = FALSE,
trControl = robustControl)