How to remove correlated variables from GLM in R - r

I am trying to exclude correlated variables from GLModel. Firstly, I calculate correlation matrix. Afterwards, I would like to implement it into combn function in some way to exclude the variables (column headers) that are correlated. At this point I fail - I am not able to incorporate it in combn function so that it worked and correlated variables were excluded.
Here is the link for data I use:
https://drive.google.com/open?id=0B5IgiR_svnKcZkxHeTJXTm9jUjQ
Here is the code I am trying to make it work:
## rm(list = ls()) ## Edited out to prevent accidents
mod_data <- read.csv("mod_data.csv", header = T)
mod_headers <- names(mod_data[3:ncol(mod_data)-1])
CM = which(abs(cor(mod_data[,1:ncol(mod_data)-1])-diag(1,ncol(mod_data)-1)) > 0.5, arr.ind = T)
f <- function(){
null_model <- glm(newcol ~ 1, data=mod_data, family = binomial(link = "logit"), control = list(maxit = 50))
best_model <- null_model
best_aic <- AIC(null_model)
for(i in 1:length(mod_headers)){
tab <- combn(mod_headers,i)
for(j in 1:ncol(tab)){
tab_new <- c(tab[,j])
mod_tab_new <- c(tab_new, "newcol")
model <- glm(newcol ~., data=mod_data[c(mod_tab_new)], family = binomial(link = "logit"), control = list(maxit = 50000))
if(AIC(model) < best_aic){
best_model <- model
best_aic <- AIC(model)
}
}
}
return(best_model)
}
f()
Thanks for your tips!

Related

How can I train a glmnet model (Poisson family) with an offset term using the caret package in R?

I want to model insurance claim count using a Poisson glmnet. The data I have at hand contains the number of claims for each policy (which is the response variable), some features about the policy (gender, region, etc.) as well as the duration of the policy (in years). I want to include the log-duration as an offset term, as we usually do in actuarial science. With the cv.glmnet function of the glmnet package, it is straightforward:
library(tidyverse)
library(glmnet)
n <- 100
dat <- tibble(
nb_claims = rpois(n, lambda = 0.5),
duration = runif(n),
x1 = runif(n),
x2 = runif(n),
x3 = runif(n)
)
fit <- cv.glmnet(
x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
y = dat %>% pull(nb_claims),
family = "poisson",
offset = dat %>% pull(duration) %>% log()
)
fit
However, my goal is to train this model using the train function of the caret package, because of the many advantages it gives. Indeed, validation, preprocessing as well as feature selection is much better with this package. It is straightforward to train a basic glmnet (without an offset term) with caret:
library(caret)
fit <- caret::train(
x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
y = dat %>% pull(nb_claims),
method = "glmnet",
family = "poisson"
)
fit
Naively, we could try to add the offset argument in the train function:
fit <- caret::train(
x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
y = dat %>% pull(nb_claims),
method = "glmnet",
family = "poisson",
offset = dat %>% pull(duration) %>% log()
)
fit
Unfortunately, this code throws the error Error : No newoffset provided for prediction, yet offset used in fit of glmnet. This error occurs because the caret::train function doesn't take care to give a value for the newoffset argument in predict.glmnet function.
In this book, they show how to add an offset term to a GLM model by modifying the source code of the caret::train function. It works perfectly. However, the predict.glm function is quite different from the predict.glmnet function, because it does not have the newoffset argument. I tried to modify the source code of the caret::train function, but I am having some trouble because I do not know well enough how this function works.
A simple way to perform this is pass the offset column as part of x and in each fit and predict call pass as x columns of x which are not the offset. While as offset/newoffset pass the x column corresponding to the offset.
In the following example the offest column of x needs to be named "offset" too. This can be changed relatively easy
To create the function we will just use lots of parts from: https://github.com/topepo/caret/blob/master/models/files/glmnet.R
glmnet is peculiar since it needs a loop, the rest is just rinse and reapeat from https://topepo.github.io/caret/using-your-own-model-in-train.html#illustrative-example-1-svms-with-laplacian-kernels
family = "poisson" will be specified throughout, to change this adopt code from https://github.com/topepo/caret/blob/master/models/files/glmnet.R
glmnet_offset <- list(type = "Regression",
library = c("glmnet", "Matrix"),
loop = function(grid) {
alph <- unique(grid$alpha)
loop <- data.frame(alpha = alph)
loop$lambda <- NA
submodels <- vector(mode = "list", length = length(alph))
for(i in seq(along = alph)) {
np <- grid[grid$alpha == alph[i],"lambda"]
loop$lambda[loop$alpha == alph[i]] <- np[which.max(np)]
submodels[[i]] <- data.frame(lambda = np[-which.max(np)])
}
list(loop = loop, submodels = submodels)
})
glmnet_offset$parameters <- data.frame(parameter = c('alpha', 'lambda'),
class = c("numeric", "numeric"),
label = c('Mixing Percentage', 'Regularization Parameter'))
glmnet_offset$grid <- function(x, y, len = NULL, search = "grid") {
if(search == "grid") {
init <- glmnet::glmnet(Matrix::as.matrix(x[,colnames(x) != "offset"]), y,
family = "poisson",
nlambda = len+2,
alpha = .5,
offset = x[,colnames(x) == "offset"])
lambda <- unique(init$lambda)
lambda <- lambda[-c(1, length(lambda))]
lambda <- lambda[1:min(length(lambda), len)]
out <- expand.grid(alpha = seq(0.1, 1, length = len),
lambda = lambda)
} else {
out <- data.frame(alpha = runif(len, min = 0, 1),
lambda = 2^runif(len, min = -10, 3))
}
out
}
So x[,colnames(x) != "offset"] is x while offset is x[,colnames(x) == "offset"]
glmnet_offset$fit <- function(x, y, wts, param, last, ...) {
theDots <- list(...)
## pass in any model weights
if(!is.null(wts)) theDots$weights <- wts
if(!(class(x)[1] %in% c("matrix", "sparseMatrix")))
x <- Matrix::as.matrix(x)
modelArgs <- c(list(x = x[,colnames(x) != "offset"],
y = y,
alpha = param$alpha,
family = "poisson",
offset = x[,colnames(x) == "offset"]),
theDots)
out <- do.call(glmnet::glmnet, modelArgs)
if(!is.na(param$lambda[1])) out$lambdaOpt <- param$lambda[1]
out
}
glmnet_offset$predict <- function(modelFit, newdata, submodels = NULL) {
if(!is.matrix(newdata)) newdata <- Matrix::as.matrix(newdata)
out <- predict(modelFit,
newdata[,colnames(newdata) != "offset"],
s = modelFit$lambdaOpt,
newoffset = newdata[,colnames(newdata) == "offset"],
type = "response") #important for measures to be appropriate
if(is.matrix(out)) out <- out[,1]
out
if(!is.null(submodels)) {
tmp <- as.list(as.data.frame(predict(modelFit,
newdata[,colnames(newdata) != "offset"],
s = submodels$lambda,
newoffset = newdata[,colnames(newdata) == "offset"],
type = "response"),
stringsAsFactors = TRUE))
out <- c(list(out), tmp)
}
out
}
For some reason which I don't understand yet it does not work without the prob slot
glmnet_offset$prob <- glmnet_offset$predict
glmnet_offset$tags = c("Generalized Linear Model", "Implicit Feature Selection",
"L1 Regularization", "L2 Regularization", "Linear Classifier",
"Linear Regression")
glmnet_offset$sort = function(x) x[order(-x$lambda, x$alpha),]
glmnet_offset$trim = function(x) {
x$call <- NULL
x$df <- NULL
x$dev.ratio <- NULL
x
}
library(tidyverse)
library(caret)
library(glmnet)
n <- 100
set.seed(123)
dat <- tibble(
nb_claims = rpois(n, lambda = 0.5),
duration = runif(n),
x1 = runif(n),
x2 = runif(n),
x3 = runif(n)
)
x = dat %>%
dplyr::select(-nb_claims) %>%
mutate(offset = log(duration)) %>%
dplyr::select(-duration) %>%
as.matrix
fit <- caret::train(
x = x,
y = dat %>% pull(nb_claims),
method = glmnet_offset,
)
fit
100 samples
4 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:
alpha lambda RMSE Rsquared MAE
0.10 0.0001640335 0.7152018 0.01805762 0.5814200
0.10 0.0016403346 0.7152013 0.01805684 0.5814193
0.10 0.0164033456 0.7130390 0.01798125 0.5803747
0.55 0.0001640335 0.7151988 0.01804917 0.5814020
0.55 0.0016403346 0.7150312 0.01802689 0.5812936
0.55 0.0164033456 0.7095996 0.01764947 0.5783706
1.00 0.0001640335 0.7152033 0.01804795 0.5813997
1.00 0.0016403346 0.7146528 0.01798979 0.5810811
1.00 0.0164033456 0.7063482 0.01732168 0.5763653
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.01640335.
predict(fit$finalModel, x[,1:3], newoffset = x[,4]) #works
This will not work with preprocessing in caret since we pass offset as one of the features. However it will work with recipes since you can define columns on which preprocessing functions will be performed via selections. Se article for details: https://tidymodels.github.io/recipes/articles/Selecting_Variables.html
I haven't had time to error check my code. If any problems occur or if there is a mistake somewhere please comment. Thanks.
You can also post an issue in caret github asking this feature (offset/newoffset) to be added to the model
I tried to change the model info a lot of ways, but it was failing miserably. Below I can propose one solution, may not be the best, but will get you somewhere if your data is sensible.
In the poisson / negative binom .. regression, the offset in factor gets introduced into the regression, you can read more here and here:
where tx is the offset. In glmnet, there is a penalty factor you can introduce for each term, and if you let that be 0 for a term, basically you are not penalizing it and it's always included. We can use that for the offset, and you can see this effect only if you use a dataset that makes some sense (note that in your example dataset, the offsets are numbers that make no sense).
Below I use the insurance claims dataset from MASS:
library(tidyverse)
library(glmnet)
library(MASS)
dat <- Insurance
X = model.matrix(Claims ~ District + Group + Age,data=dat)
Y = dat$Claims
OFF = log(dat$Holders)
fit_cv <- cv.glmnet(
x = X,
y = Y,
family = "poisson",
offset = OFF
)
Now using caret, I will fit it without any training, and using the same lambda obtained from the fit in cv.glmnet. One thing you should note too is that cv.glmnet often uses lambda.1se instead of lambda.min:
fit_c <- caret::train(
x = cbind(X,OFF),
y = Y,
method = "glmnet",
family = "poisson",
tuneGrid=data.frame(lambda=fit_cv$lambda.1se,alpha=1),
penalty=c(rep(1,ncol(X)),0),
trControl = trainControl(method="none")
)
We can see how different are the predictions:
p1 = predict(fit_cv,newx=X,newoffset=OFF)
p2 = predict(fit_c,newx=cbind(X,OFF))
plot(p1,p2)

Using $ to refer to multiple variables in user defined function R

I am trying to run a function which calculates the marginal effects for different mixed effects models, based on two different main predictors (var1 vs. var2). The original code can be found here:
https://stats.idre.ucla.edu/r/dae/mixed-effects-logistic-regression/. Below is a reproducible example:
I create a dataframe (ex):
time <- seq(from = 1, to = 500, by =1)
var1 <- factor(sample(0:1, 500, replace = TRUE))
var2 <- factor(sample(0:1, 500, replace = TRUE))
var3 <- sample(1:500, 500, replace = TRUE)
group <- rep(1001:1005, 500)
out <- sample(0:1, 500, replace = TRUE)
group <- as.factor(group)
ex <- data.frame(time,var1,var2,var3,group,out)
Run the models:
m1a <- glmer(out ~ time + var1 + (1|group), data=ex, family = binomial(link = "logit"), nAGQ = 1,
control = glmerControl(calc.derivs = FALSE))
m1b <- glmer(out ~ time + var2 + (1|group), data=ex, family = binomial(link = "logit"), nAGQ = 1,
control = glmerControl(calc.derivs = FALSE))
Create subsets of the data with only the predictors for complete cases:
sub1a <- na.omit(ex[, c("time", "var1", "group")])
sub1b <- na.omit(ex[, c("time", "var2", "group")])
I cannot attach my data frame, ex, because R says var1 and var2 are masked. Therefore, the only way I know to refer to the variables is using $. However, every function I create produces a wrong or null result. I first tried:
marg <- function(v1, v2, d, m) {
biprobs <- lapply(levels(v1), function(var) {
v2[ ] <- var
lapply(time, function(ti) {
d$time <- ti
predict(m, newdata = d, type = "response")
})
})
plotdat <- lapply(biprobs, function(X) {
temp <- t(sapply(X, function(x) {
c(M=mean(x), quantile(x, c(.25, .75)))
}))
temp <- as.data.frame(cbind(temp,time))
colnames(temp) <- c("PP", "Lower", "Upper", "Dayssince")
return(temp)
})
plotdat <- do.call(rbind, plotdat)
}
result1 <- marg(ex$var1, sub1a$var1, sub1a, m1a)
Although this creates a data frame, it produces the same predicted probabilities for each level of var1 (0 vs. 1) at a given time (1-500), which is not what I want. So then I tried:
marg <- function(v, d, m) {
biprobs <- lapply(levels(ex$v), function(var) {
d$v[ ] <- var
lapply(time, function(ti) {
d$time <- ti
predict(m, newdata = d, type = "response")
})
})
.....
}
result2 <- marg(var1,sub1a, m1a)
This produces a null result. I also tried, which produces a null result:
marg <- function(d1,v,d2,m) {
biprobs <- lapply(levels(d1$v), function(var) {
d2$v[ ] <- var
lapply(time, function(ti) {
d2$time <- ti
predict(m, newdata = d2, type = "response")
})
})
......
}
result3 <- marg(ex,var1,sub1a,m1a)
I also tried creating a new object to input directly into the function:
v1 <- ex$var1
marg <- function(d, m) {
biprobs <- lapply(levels(v1), function(var) {
.....
})
})
That also produces a null result. How do I refer to different variables in an unattached data frame?? The code works with direct inputs, so it's a matter of correctly defining the function arguments. I appreciate any help!

Iterate two arguments with map2 (purrr function)

I want to calculate all possible predictions with different probabilities of my data with multiple models. The result is a list.
df<-iris
df$y<-sample(0:1,nrow(df),replace=TRUE)
set.seed(101)
#Now Selecting 80% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(df), size = floor(.8*nrow(df)), replace = F)
train <- df[sample, ]
test <- df[-sample, ]
Then i create a logistic model:
full <- glm(y~., data = train, family = "binomial")
min <- glm( y~ 1, data = train, family = "binomial")
backward <- step(full,direction = "backward",trace=0)
forward <- step(min,scope=list(lower=min, upper=full),direction = "forward",trace=0)
model2<- glm(y~Sepal.Length+Sepal.Width , data = train, family = "binomial")
models<-list(backward,forward,model2)
prediction<- lapply(models, function(x){predict(x,newdata=test,type="response")})
First of all i have table with predictions. Then i created a vector with all posible probabilities.
p <- seq(from = 0.1, to = 0.9, by = 0.5)
Problem is i want to apply differents breaks point. I tried with map2 function of purrr package but it doesn't work.
pred = map2(prediction,p, function(x,pi){ifelse(x > pi, 1, 0)})
The problem is:
Error: .x (3) and .y (2) are different lengths
Anyone can help?
I think is best to change apply to sapply, then i will have a data.frame.
prediction<- sapply(models, function(x){predict(x, newdata=test,type="response")},
simplify = T,USE.NAMES = TRUE)
Then i could use pmap function?
thanks
EDIT: I updated with all code.
See if this makes sense:
df<-iris
df$y<-sample(0:1,nrow(df),replace=TRUE)
set.seed(101)
#Now Selecting 80% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(df), size = floor(.8*nrow(df)), replace = F)
train <- df[sample, ]
test <- df[-sample, ]
full <- glm(y~., data = train, family = "binomial")
min <- glm( y~ 1, data = train, family = "binomial")
backward <- step(full,direction = "backward",trace=0)
forward <- step(min,scope=list(lower=min, upper=full),direction = "forward",trace=0)
model2<- glm(y~Sepal.Length+Sepal.Width , data = train, family = "binomial")
models<-list(backward,forward,model2)
prediction<- lapply(models, function(x){predict(x,newdata=test,type="response")})
p <- seq(from = 0.1, to = 0.9, by = 0.5)
combn = cross2(prediction, p)
pred <- map(combn,
function(combination) {
x <- combination[[1]]
pi <- combination[[2]]
ifelse(x > pi, 1, 0)
}
)

Use pdp package to get probability scale partial dependence plots for all classes

I have been following the example here to create partial dependence plots but I would like to combine the approach used to get plots for all levels in a multiclass with the one to get predictions on the probability scale (see pages 430-431).
This is my approach but it doesn't work because pred.fun is not allowed to have a third arguement
library(e1071)
iris.svm <- svm(Species ~ ., data = iris, kernel = "radial", gamma = 0.75,
cost = 0.25, probability = TRUE)
pred.prob <- function(object, newdata,i) { # see ?predict.svm
pred <- predict(object, newdata, probability = TRUE)
prob.class <- attr(pred, which = "probabilities")[, i]
mean(prob.class)
}
pred.prob(iris.svm,iris,"setosa")
pd <- NULL
for (i in 1:3) {
tmp <- partial(iris.svm, pred.var = c("Petal.Width", "Petal.Length"),
pred.fun = pred.prob,
which.class = i, grid.resolution = 101, progress = "text")
pd <- rbind(pd, cbind(tmp, Species = levels(iris$Species)[i]))
}
Any recommendations for how to get around this requirement or a different approach?
It looks like the package has actually been updated since the article I referred to was published. Now all you need to do is set the prob argument to TRUE and it will predict on the probability scale.
pd <- NULL
for (i in 1:3) {
tmp <- partial(iris.svm, pred.var = c("Petal.Width", "Petal.Length"),
prob = T,
which.class = i, grid.resolution = 101, progress = "text")
pd <- rbind(pd, cbind(tmp, Species = levels(iris$Species)[i]))
}
I hope this helps someone else to avoid wasting an afternoon!

Reproducing results from previous answer is not working due to using new version of lme4

I have tried to reproduce the results from the answers for this question “Estimating random effects and applying user defined correlation/covariance structure with R lme4 or nlme package “ https://stats.stackexchange.com/questions/18563/estimating-random-effects-and-applying-user-defined-correlation-covariance-struc
Aaron Rendahl's codes
library(pedigreemm)
relmatmm <- function (formula, data, family = NULL, REML = TRUE, relmat = list(),
control = list(), start = NULL, verbose = FALSE, subset,
weights, na.action, offset, contrasts = NULL, model = TRUE,
x = TRUE, ...)
{
mc <- match.call()
lmerc <- mc
lmerc[[1]] <- as.name("lmer")
lmerc$relmat <- NULL
if (!length(relmat))
return(eval.parent(lmerc))
stopifnot(is.list(relmat), length(names(relmat)) == length(relmat))
lmerc$doFit <- FALSE
lmf <- eval(lmerc, parent.frame())
relfac <- relmat
relnms <- names(relmat)
stopifnot(all(relnms %in% names(lmf$FL$fl)))
asgn <- attr(lmf$FL$fl, "assign")
for (i in seq_along(relmat)) {
tn <- which(match(relnms[i], names(lmf$FL$fl)) == asgn)
if (length(tn) > 1)
stop("a relationship matrix must be associated with only one random effects term")
Zt <- lmf$FL$trms[[tn]]$Zt
relmat[[i]] <- Matrix(relmat[[i]][rownames(Zt), rownames(Zt)],
sparse = TRUE)
relfac[[i]] <- chol(relmat[[i]])
lmf$FL$trms[[tn]]$Zt <- lmf$FL$trms[[tn]]$A <- relfac[[i]] %*% Zt
}
ans <- do.call(if (!is.null(lmf$glmFit))
lme4:::glmer_finalize
else lme4:::lmer_finalize, lmf)
ans <- new("pedigreemm", relfac = relfac, ans)
ans#call <- match.call()
ans
}
the original example
set.seed(1234)
mydata <- data.frame (gen = factor(rep(1:10, each = 10)),
repl = factor(rep(1:10, 10)),
yld = rnorm(10, 5, 0.5))
library(lme4)
covmat <- round(nearPD(matrix(runif(100, 0, 0.2), nrow = 10))$mat, 2)
diag(covmat) <- diag(covmat)/10+1
rownames(covmat) <- colnames(covmat) <- levels(mydata$gen)
m <- relmatmm(yld ~ (1|gen) + (1|repl), relmat=list(gen=covmat), data=mydata)
here is the error message
Error in lmf$FL : $ operator not defined for this S4 class
In addition: Warning message:
In checkArgs("lmer", doFit = FALSE) : extra argument(s) ‘doFit’ disregarded
I will appreciate any help ?
Thanks
This is a re-implementation of the previous code -- I have done some slight modifications, and I have not tested it in any way -- test yourself and/or use at your own risk.
First create a slightly more modularized function that constructs the deviance function and fits the model:
doFit <- function(lmod,lmm=TRUE) {
## see ?modular
if (lmm) {
devfun <- do.call(mkLmerDevfun, lmod)
opt <- optimizeLmer(devfun)
mkMerMod(environment(devfun), opt, lmod$reTrms, fr = lmod$fr)
} else {
devfun <- do.call(mkGlmerDevfun, lmod)
opt <- optimizeGlmer(devfun)
devfun <- updateGlmerDevfun(devfun, lmod$reTrms)
opt <- optimizeGlmer(devfun, stage=2)
mkMerMod(environment(devfun), opt, lmod$reTrms, fr = lmod$fr)
}
}
Now create a function to construct the object that doFit needs and modify it:
relmatmm <- function (formula, ..., lmm=TRUE, relmat = list()) {
ff <- if (lmm) lFormula(formula, ...) else glFormula(formula, ...)
stopifnot(is.list(relmat), length(names(relmat)) == length(relmat))
relnms <- names(relmat)
relfac <- relmat
flist <- ff$reTrms[["flist"]] ## list of factors
## random-effects design matrix components
Ztlist <- ff$reTrms[["Ztlist"]]
stopifnot(all(relnms %in% names(flist)))
asgn <- attr(flist, "assign")
for (i in seq_along(relmat)) {
tn <- which(match(relnms[i], names(flist)) == asgn)
if (length(tn) > 1)
stop("a relationship matrix must be",
" associated with only one random effects term")
zn <- rownames(Ztlist[[i]])
relmat[[i]] <- Matrix(relmat[[i]][zn,zn],sparse = TRUE)
relfac[[i]] <- chol(relmat[[i]])
Ztlist[[i]] <- relfac[[i]] %*% Ztlist[[i]]
}
ff$reTrms[["Ztlist"]] <- Ztlist
ff$reTrms[["Zt"]] <- do.call(rBind,Ztlist)
fit <- doFit(ff,lmm)
}
Example
set.seed(1234)
mydata <- data.frame (gen = factor(rep(1:10, each = 10)),
repl = factor(rep(1:10, 10)),
yld = rnorm(10, 5, 0.5))
library(lme4)
covmat <- round(nearPD(matrix(runif(100, 0, 0.2), nrow = 10))$mat, 2)
diag(covmat) <- diag(covmat)/10+1
rownames(covmat) <- colnames(covmat) <- levels(mydata$gen)
m <- relmatmm(yld ~ (1|gen) + (1|repl), relmat=list(gen=covmat),
data=mydata)
This runs -- I don't know if the output is correct. It also doesn't make the resulting object into a pedigreemm object ...

Resources