Simple Logistic Regression in a Loop? - r

I have a bunch of features for a multivariate logistic regression, but i Want to test each feature individually for multiple univariate logistic regressions.
I'm trying to do a loop like so
features <- c("f1","f2","f3","f4")
out <- list()
for (f in features) {
mod <- train(form = positive ~ f,
data = training,
method = "glm",
metric = "ROC",
family = "binomial")
out <- append(out,mod)
}
I'm getting an error saying variable lengths differ (found for 'f'). I think it's not recognizing f as the column name? How can I fix this?

For future reference an answer with a reprex that uses the same solution that was probosed by #Rorschach:
x <- runif(50, min = 0, max = 100)
z <- runif(50, min = 0, max = 100)
a <- runif(50, min = 0, max = 100)
b <- runif(50, min = 0, max = 100)
positive <- rbinom(50,1, 0.4)
training <- as.data.frame(cbind(x,z,a,b,positive = positive))
training$positive <- factor(training$positive)
library(caret)
features <- c("x","z","a","b")
out <- list()
for (f in features) {
mod <- train(form = as.formula(paste("positive ~ ", f)),
data = training,
method = "glm",
family = "binomial")
out <- append(out,mod)
}

Related

Vectors for simulations

The code I have so far is written below. This is simulation so there is no actual data. I have two questions:
I have two vectors (treat and cont) but I need to put them into one single vector which I did (vect), however, I need another vector that is coding for treatment vs. control. How do I do that?
For my model (model) I need to fit a linear model testing for a treatment effect but I don't know how to add that effect into what I have or is that what it is testing in the code I have?
library(car)
treat=rnorm(3, mean = 460, sd = 110)
treat
cont=rnorm(3, mean = 415, sd = 110)
cont
vect=c(treat, cont)
vect
nsims = 1000
p.value.saved = coeff.saved = vector()
for (i in 1:nsims) {
treat=rnorm(3, mean = 460, sd = 110)
cont=rnorm(3, mean = 415, sd = 110)
vect=c(treat, cont)
model = glm(treat ~ cont, family = poisson)
p.value.saved[i] = Anova(model)$P[1]
coeff.saved[i] = coef(model)
}
Thank you!
Something like this? (note that you'll get a bunch of warnings for running a poisson regression against continuous data.
n <- 3
nsims <- 10
do.call(
rbind,
lapply(1:nsims, function(.) {
treat <- rnorm(n, mean = 460, sd = 110)
cont <- rnorm(n, mean = 415, sd = 110)
# Instead of vect
df <- data.frame(
y = c(treat, cont),
x = rep(c("treat", "cont"), each = n)
)
# Model the values vs treatment indicator
model <- glm(y ~ x, data = df, family = poisson)
# Extract the model's p-value and coefficient of treatment.
data.frame(p = car::Anova(model)$P, coef = coef(model)[2])
})
)
The first creates the string and the second bit will combine them. In your example they are both length 3, hence the 3 repetition in rep("trt",3)
treat_lab = c(rep("control", 3),rep("trt", 3))
treatment <- cbind(treat_lab,c(treat,cont))

How can I train a glmnet model (Poisson family) with an offset term using the caret package in R?

I want to model insurance claim count using a Poisson glmnet. The data I have at hand contains the number of claims for each policy (which is the response variable), some features about the policy (gender, region, etc.) as well as the duration of the policy (in years). I want to include the log-duration as an offset term, as we usually do in actuarial science. With the cv.glmnet function of the glmnet package, it is straightforward:
library(tidyverse)
library(glmnet)
n <- 100
dat <- tibble(
nb_claims = rpois(n, lambda = 0.5),
duration = runif(n),
x1 = runif(n),
x2 = runif(n),
x3 = runif(n)
)
fit <- cv.glmnet(
x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
y = dat %>% pull(nb_claims),
family = "poisson",
offset = dat %>% pull(duration) %>% log()
)
fit
However, my goal is to train this model using the train function of the caret package, because of the many advantages it gives. Indeed, validation, preprocessing as well as feature selection is much better with this package. It is straightforward to train a basic glmnet (without an offset term) with caret:
library(caret)
fit <- caret::train(
x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
y = dat %>% pull(nb_claims),
method = "glmnet",
family = "poisson"
)
fit
Naively, we could try to add the offset argument in the train function:
fit <- caret::train(
x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
y = dat %>% pull(nb_claims),
method = "glmnet",
family = "poisson",
offset = dat %>% pull(duration) %>% log()
)
fit
Unfortunately, this code throws the error Error : No newoffset provided for prediction, yet offset used in fit of glmnet. This error occurs because the caret::train function doesn't take care to give a value for the newoffset argument in predict.glmnet function.
In this book, they show how to add an offset term to a GLM model by modifying the source code of the caret::train function. It works perfectly. However, the predict.glm function is quite different from the predict.glmnet function, because it does not have the newoffset argument. I tried to modify the source code of the caret::train function, but I am having some trouble because I do not know well enough how this function works.
A simple way to perform this is pass the offset column as part of x and in each fit and predict call pass as x columns of x which are not the offset. While as offset/newoffset pass the x column corresponding to the offset.
In the following example the offest column of x needs to be named "offset" too. This can be changed relatively easy
To create the function we will just use lots of parts from: https://github.com/topepo/caret/blob/master/models/files/glmnet.R
glmnet is peculiar since it needs a loop, the rest is just rinse and reapeat from https://topepo.github.io/caret/using-your-own-model-in-train.html#illustrative-example-1-svms-with-laplacian-kernels
family = "poisson" will be specified throughout, to change this adopt code from https://github.com/topepo/caret/blob/master/models/files/glmnet.R
glmnet_offset <- list(type = "Regression",
library = c("glmnet", "Matrix"),
loop = function(grid) {
alph <- unique(grid$alpha)
loop <- data.frame(alpha = alph)
loop$lambda <- NA
submodels <- vector(mode = "list", length = length(alph))
for(i in seq(along = alph)) {
np <- grid[grid$alpha == alph[i],"lambda"]
loop$lambda[loop$alpha == alph[i]] <- np[which.max(np)]
submodels[[i]] <- data.frame(lambda = np[-which.max(np)])
}
list(loop = loop, submodels = submodels)
})
glmnet_offset$parameters <- data.frame(parameter = c('alpha', 'lambda'),
class = c("numeric", "numeric"),
label = c('Mixing Percentage', 'Regularization Parameter'))
glmnet_offset$grid <- function(x, y, len = NULL, search = "grid") {
if(search == "grid") {
init <- glmnet::glmnet(Matrix::as.matrix(x[,colnames(x) != "offset"]), y,
family = "poisson",
nlambda = len+2,
alpha = .5,
offset = x[,colnames(x) == "offset"])
lambda <- unique(init$lambda)
lambda <- lambda[-c(1, length(lambda))]
lambda <- lambda[1:min(length(lambda), len)]
out <- expand.grid(alpha = seq(0.1, 1, length = len),
lambda = lambda)
} else {
out <- data.frame(alpha = runif(len, min = 0, 1),
lambda = 2^runif(len, min = -10, 3))
}
out
}
So x[,colnames(x) != "offset"] is x while offset is x[,colnames(x) == "offset"]
glmnet_offset$fit <- function(x, y, wts, param, last, ...) {
theDots <- list(...)
## pass in any model weights
if(!is.null(wts)) theDots$weights <- wts
if(!(class(x)[1] %in% c("matrix", "sparseMatrix")))
x <- Matrix::as.matrix(x)
modelArgs <- c(list(x = x[,colnames(x) != "offset"],
y = y,
alpha = param$alpha,
family = "poisson",
offset = x[,colnames(x) == "offset"]),
theDots)
out <- do.call(glmnet::glmnet, modelArgs)
if(!is.na(param$lambda[1])) out$lambdaOpt <- param$lambda[1]
out
}
glmnet_offset$predict <- function(modelFit, newdata, submodels = NULL) {
if(!is.matrix(newdata)) newdata <- Matrix::as.matrix(newdata)
out <- predict(modelFit,
newdata[,colnames(newdata) != "offset"],
s = modelFit$lambdaOpt,
newoffset = newdata[,colnames(newdata) == "offset"],
type = "response") #important for measures to be appropriate
if(is.matrix(out)) out <- out[,1]
out
if(!is.null(submodels)) {
tmp <- as.list(as.data.frame(predict(modelFit,
newdata[,colnames(newdata) != "offset"],
s = submodels$lambda,
newoffset = newdata[,colnames(newdata) == "offset"],
type = "response"),
stringsAsFactors = TRUE))
out <- c(list(out), tmp)
}
out
}
For some reason which I don't understand yet it does not work without the prob slot
glmnet_offset$prob <- glmnet_offset$predict
glmnet_offset$tags = c("Generalized Linear Model", "Implicit Feature Selection",
"L1 Regularization", "L2 Regularization", "Linear Classifier",
"Linear Regression")
glmnet_offset$sort = function(x) x[order(-x$lambda, x$alpha),]
glmnet_offset$trim = function(x) {
x$call <- NULL
x$df <- NULL
x$dev.ratio <- NULL
x
}
library(tidyverse)
library(caret)
library(glmnet)
n <- 100
set.seed(123)
dat <- tibble(
nb_claims = rpois(n, lambda = 0.5),
duration = runif(n),
x1 = runif(n),
x2 = runif(n),
x3 = runif(n)
)
x = dat %>%
dplyr::select(-nb_claims) %>%
mutate(offset = log(duration)) %>%
dplyr::select(-duration) %>%
as.matrix
fit <- caret::train(
x = x,
y = dat %>% pull(nb_claims),
method = glmnet_offset,
)
fit
100 samples
4 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:
alpha lambda RMSE Rsquared MAE
0.10 0.0001640335 0.7152018 0.01805762 0.5814200
0.10 0.0016403346 0.7152013 0.01805684 0.5814193
0.10 0.0164033456 0.7130390 0.01798125 0.5803747
0.55 0.0001640335 0.7151988 0.01804917 0.5814020
0.55 0.0016403346 0.7150312 0.01802689 0.5812936
0.55 0.0164033456 0.7095996 0.01764947 0.5783706
1.00 0.0001640335 0.7152033 0.01804795 0.5813997
1.00 0.0016403346 0.7146528 0.01798979 0.5810811
1.00 0.0164033456 0.7063482 0.01732168 0.5763653
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.01640335.
predict(fit$finalModel, x[,1:3], newoffset = x[,4]) #works
This will not work with preprocessing in caret since we pass offset as one of the features. However it will work with recipes since you can define columns on which preprocessing functions will be performed via selections. Se article for details: https://tidymodels.github.io/recipes/articles/Selecting_Variables.html
I haven't had time to error check my code. If any problems occur or if there is a mistake somewhere please comment. Thanks.
You can also post an issue in caret github asking this feature (offset/newoffset) to be added to the model
I tried to change the model info a lot of ways, but it was failing miserably. Below I can propose one solution, may not be the best, but will get you somewhere if your data is sensible.
In the poisson / negative binom .. regression, the offset in factor gets introduced into the regression, you can read more here and here:
where tx is the offset. In glmnet, there is a penalty factor you can introduce for each term, and if you let that be 0 for a term, basically you are not penalizing it and it's always included. We can use that for the offset, and you can see this effect only if you use a dataset that makes some sense (note that in your example dataset, the offsets are numbers that make no sense).
Below I use the insurance claims dataset from MASS:
library(tidyverse)
library(glmnet)
library(MASS)
dat <- Insurance
X = model.matrix(Claims ~ District + Group + Age,data=dat)
Y = dat$Claims
OFF = log(dat$Holders)
fit_cv <- cv.glmnet(
x = X,
y = Y,
family = "poisson",
offset = OFF
)
Now using caret, I will fit it without any training, and using the same lambda obtained from the fit in cv.glmnet. One thing you should note too is that cv.glmnet often uses lambda.1se instead of lambda.min:
fit_c <- caret::train(
x = cbind(X,OFF),
y = Y,
method = "glmnet",
family = "poisson",
tuneGrid=data.frame(lambda=fit_cv$lambda.1se,alpha=1),
penalty=c(rep(1,ncol(X)),0),
trControl = trainControl(method="none")
)
We can see how different are the predictions:
p1 = predict(fit_cv,newx=X,newoffset=OFF)
p2 = predict(fit_c,newx=cbind(X,OFF))
plot(p1,p2)

Save Gradient Boosting Machine values obtained with Bootstrap

I am calculating the boosting gradient to identify the importance of variables in the model, however I am performing resampling to identify how the importance of each variable behaves.
But I can't correctly save the variable name with it's importance calculated in each bootstrap.
I'm doing this using a function, which is called within the bootstrap package
boost command.
Below is a minimally reproducible example adapted for AmesHousing data:
library(gbm)
library(boot)
library(AmesHousing)
df <- make_ames()
imp_gbm <- function(data, indices) {
d <- data[indices,]
gbm.fit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
return(summary(gbm.fit)[,2])
}
results_GBM <- boot(data = df,statistic = imp_gbm, R=100)
results_GBM$t0
I expect to save the bootstrap results with their variable names but I can only save the importance of variables without their names.
with summary.gbm, the default is to order the variables according to importance. you need to set it to FALSE, and also not plot. Then the returned variable importance is the same as the order of variables in the fit.
imp_gbm <- function(data, indices) {
d <- data[indices,]
# use gbmfit because gbm.fit is a function
gbmfit <- gbm(
formula = Sale_Price ~ .,
distribution = "gaussian",
data = d,
n.trees = 100,
interaction.depth = 5,
shrinkage = 0.1,
cv.folds = 5,
n.cores = NULL,
verbose = FALSE
)
o= summary(gbmfit,plotit=FALSE,order=FALSE)[,2]
names(o) = gbmfit$var.names
return(o)
}

Iterate two arguments with map2 (purrr function)

I want to calculate all possible predictions with different probabilities of my data with multiple models. The result is a list.
df<-iris
df$y<-sample(0:1,nrow(df),replace=TRUE)
set.seed(101)
#Now Selecting 80% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(df), size = floor(.8*nrow(df)), replace = F)
train <- df[sample, ]
test <- df[-sample, ]
Then i create a logistic model:
full <- glm(y~., data = train, family = "binomial")
min <- glm( y~ 1, data = train, family = "binomial")
backward <- step(full,direction = "backward",trace=0)
forward <- step(min,scope=list(lower=min, upper=full),direction = "forward",trace=0)
model2<- glm(y~Sepal.Length+Sepal.Width , data = train, family = "binomial")
models<-list(backward,forward,model2)
prediction<- lapply(models, function(x){predict(x,newdata=test,type="response")})
First of all i have table with predictions. Then i created a vector with all posible probabilities.
p <- seq(from = 0.1, to = 0.9, by = 0.5)
Problem is i want to apply differents breaks point. I tried with map2 function of purrr package but it doesn't work.
pred = map2(prediction,p, function(x,pi){ifelse(x > pi, 1, 0)})
The problem is:
Error: .x (3) and .y (2) are different lengths
Anyone can help?
I think is best to change apply to sapply, then i will have a data.frame.
prediction<- sapply(models, function(x){predict(x, newdata=test,type="response")},
simplify = T,USE.NAMES = TRUE)
Then i could use pmap function?
thanks
EDIT: I updated with all code.
See if this makes sense:
df<-iris
df$y<-sample(0:1,nrow(df),replace=TRUE)
set.seed(101)
#Now Selecting 80% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(df), size = floor(.8*nrow(df)), replace = F)
train <- df[sample, ]
test <- df[-sample, ]
full <- glm(y~., data = train, family = "binomial")
min <- glm( y~ 1, data = train, family = "binomial")
backward <- step(full,direction = "backward",trace=0)
forward <- step(min,scope=list(lower=min, upper=full),direction = "forward",trace=0)
model2<- glm(y~Sepal.Length+Sepal.Width , data = train, family = "binomial")
models<-list(backward,forward,model2)
prediction<- lapply(models, function(x){predict(x,newdata=test,type="response")})
p <- seq(from = 0.1, to = 0.9, by = 0.5)
combn = cross2(prediction, p)
pred <- map(combn,
function(combination) {
x <- combination[[1]]
pi <- combination[[2]]
ifelse(x > pi, 1, 0)
}
)

predicting from flexmix object (R)

I fit some data to a mixture distribution of two gaussian in flexmix:
data("NPreg", package = "flexmix")
mod <- flexmix(yn ~ x, data = NPreg, k = 2,
model = list(FLXMRglm(yn ~ x, family= "gaussian"),
FLXMRglm(yn ~ x, family = "gaussian")))
the model fit is as follows:
> mod
Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2, model = list(FLXMRglm(yn ~ x, family = "gaussian"),
FLXMRglm(yn ~ x, family = "gaussian")))
Cluster sizes:
1 2
74 126
convergence after 31 iterations
But how do I actually predict from this model?
when I do
pred <- predict(mod, NPreg)
I get a list with the predictions from each of the two components
To get a single prediction, do I have to add in the cluster sizes like this?
single <- (74/200)* pred$Comp.1[,1] + (126/200)*pred$Comp.2[,2]
I use flexmix for prediction in the following way:
pred = predict(mod, NPreg)
clust = clusters(mod,NPreg)
result = cbind(NPreg,data.frame(pred),data.frame(clust))
plot(result$yn,col = c("red","blue")[result$clust],pch = 16,ylab = "yn")
And the confusion matrix:
table(result$class,result$clust)
For getting the predicted values of yn, I select the component value of the cluster to which a data point belongs.
for(i in 1:nrow(result)){
result$pred_model1[i] = result[,paste0("Comp.",result$clust[i],".1")][i]
result$pred_model2[i] = result[,paste0("Comp.",result$clust[i],".2")][i]
}
The actual vs predicted results show the fit (adding only one of them here as both of your models are same, you would use pred_model2 for the second model).
qplot(result$yn, result$pred_model1,xlab="Actual",ylab="Predicted") + geom_abline()
RMSE = sqrt(mean((result$yn-result$pred_model1)^2))
gives a root mean square error of 5.54.
This answer is based on many SO answers I read through while working with flexmix. It worked well for my problem.
You may also be interested in visualizing the two distributions. My model was the following, which shows some overlap as the ratio of components are not close to 1.
Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2,
model = list(FLXMRglm(yn ~ x, family = "gaussian"),
FLXMRglm(yn ~ x, family = "gaussian")))
prior size post>0 ratio
Comp.1 0.481 102 129 0.791
Comp.2 0.519 98 171 0.573
'log Lik.' -1312.127 (df=13)
AIC: 2650.255 BIC: 2693.133
I also generate a density distribution with histograms to visulaize both components. This was inspired by a SO answer from the maintainer of betareg.
a = subset(result, clust == 1)
b = subset(result, clust == 2)
hist(a$yn, col = hcl(0, 50, 80), main = "",xlab = "", freq = FALSE, ylim = c(0,0.06))
hist(b$yn, col = hcl(240, 50, 80), add = TRUE,main = "", xlab = "", freq = FALSE, ylim = c(0,0.06))
ys = seq(0, 50, by = 0.1)
lines(ys, dnorm(ys, mean = mean(a$yn), sd = sd(a$yn)), col = hcl(0, 80, 50), lwd = 2)
lines(ys, dnorm(ys, mean = mean(b$yn), sd = sd(b$yn)), col = hcl(240, 80, 50), lwd = 2)
# Joint Histogram
p <- prior(mod)
hist(result$yn, freq = FALSE,main = "", xlab = "",ylim = c(0,0.06))
lines(ys, p[1] * dnorm(ys, mean = mean(a$yn), sd = sd(a$yn)) +
p[2] * dnorm(ys, mean = mean(b$yn), sd = sd(b$yn)))
You can pass an additional argument to your prediction call.
pred <- predict(mod, NPreg, aggregate = TRUE)[[1]][,1]

Resources