Finite mixture of tweedie - r

I'm trying to estimate a finite mixture of tweedie (or compound Poisson-gamma) distributions. I have scoured any resources I can think of, without finding any resources on how to do this.
I am currently trying to use the flexmix package in R writing a different M-step driver, as outlined in the flexmix vignette on pages 12-14. Here is my code, which relies on the cplm package:
tweedieClust <- function(formula = .~.,offset = NULL){
require(tweedie)
require(cplm)
require(plyr)
require(dplyr)
retval <- new("FLXMC", weighted = TRUE, formula = formula, dist = "tweedie",
name = "Compound Poisson Clustering")
retval#defineComponent <- expression ({
predict <- function(x, ...) {
pr <- mu
}
logLik <- function(x, y, ...){
dtweedie(y, xi = p, mu = mu, phi = phi) %>%
log
}
new("FLXcomponent",
parameters=list(coef=coef),
logLik=logLik, predict=predict,
df=df)
})
retval#fit <- function (x, y, w, component) {
fit <- cpglm(formula = y ~ x, link = "log", weights=w, offset=offset)
with(list(coef = coef(fit), df = ncol(x),mu = fit$fitted.values,
p = fit$p, phi = fit$phi),
eval(retval#defineComponent))
}
retval
}
However, this results in the following error:
Error in dtweedie(y, xi = p, mu = mu, phi = phi) :
binary operation on non-conformable arrays
Has anyone done or seen a finite mixture of tweedie distributions? Can you point me in the right direction to accomplish this, using flexmix or otherwise?

The problem is somewhere in the weights part, if you remove it, it works:
tweedieClust <- function(formula = .~.,offset = NULL){
require(tweedie)
require(statmod)
require(cplm)
require(plyr)
require(dplyr)
retval <- new("FLXMC", weighted = F, formula = formula, dist = "tweedie",
name = "Compound Poisson Clustering")
retval#defineComponent <- expression ({
predict <- function(x, ...) {
pr <- mu
}
logLik <- function(x, y, ...){
dtweedie(y, xi = p, mu = mu, phi = phi) %>%
log
}
new("FLXcomponent",
parameters=list(mu=mu,xi=p,phi=phi),
logLik=logLik, predict=predict,
df=df)
})
retval#fit <- function (x, y, w, component) {
fit <- cpglm(formula = End~.,data=dmft, link = "log")
with(list(df = ncol(x), mu = fit$fitted.values,
p = fit$p, phi = fit$phi),
eval(retval#defineComponent))
}
retval
}
example:
library(flexmix)
data("dmft", package = "flexmix")
m1 <- flexmix(End ~ .,data=dmft, k = 4, model = tweedieClust())

Related

MLE - Optimization with constraints as non-linear functions of the variables

I have a problem with the following optimization problem. In particular, I would like to add the following constraint to the MLE problem: (x - location)/scale > 0. Without this constraint, the LL is Inf and the L-BGFS-B optimization gives the following error
library(PearsonDS)
x <- rpearsonIII(n=1000, shape = 5, location = 6, scale = 7)
dpearson3 <- function (x, shape, location, scale, log = FALSE)
{
gscale <- abs(scale)
ssgn <- sign(scale)
density <- dgamma(ssgn * (x - location), shape = shape, scale = gscale, log = log)
return(density)
}
LL3 <- function(theta, x, display)
{
shape <- as.numeric(theta[1])
location <- as.numeric(theta[2])
scale <- as.numeric(theta[3])
tmp <- -sum(log(dpearson3(x, shape, location, scale, log = FALSE)))
if (is.na(tmp)) +Inf else tmp
if(display == 1){print(c(tmp, theta))}
return(sum(tmp))
}
control.list <- list(maxit = 100000, factr=1e-12, fnscale = 1)
fit <- optim(par = param,
fn = LL3,
hessian = TRUE,
method = "L-BFGS-B",
lower = c(0,-Inf,-Inf),
upper = c(Inf,Inf,Inf),
control = control.list,
x = x, display = 1)
Assume that I start the search from
param <- c(100,1000,10), I get the following error
Error in optim(par = param, fn = LL3, hessian = TRUE, method = "L-BFGS-B", :
L-BFGS-B needs finite values of 'fn'
How to solve the issue?
Changing the MLE function to
LL3 <- function(theta, x, display){
shape <- as.numeric(theta[1])
location <- as.numeric(theta[2])
scale <- as.numeric(theta[3])
tmp <- -sum(log(dpearson3(x, shape, location, scale, log = FALSE)))
if(min((x-location)/scale) < 0) tmp = + 100000000000 # I added this line
if (is.na(tmp)) +Inf else tmp
if(display == 1){print(c(tmp, theta))}
return(tmp)
}
is the smartest thing I could find. In this way I avoid the Inf problem. Any better answer?

How can I train a glmnet model (Poisson family) with an offset term using the caret package in R?

I want to model insurance claim count using a Poisson glmnet. The data I have at hand contains the number of claims for each policy (which is the response variable), some features about the policy (gender, region, etc.) as well as the duration of the policy (in years). I want to include the log-duration as an offset term, as we usually do in actuarial science. With the cv.glmnet function of the glmnet package, it is straightforward:
library(tidyverse)
library(glmnet)
n <- 100
dat <- tibble(
nb_claims = rpois(n, lambda = 0.5),
duration = runif(n),
x1 = runif(n),
x2 = runif(n),
x3 = runif(n)
)
fit <- cv.glmnet(
x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
y = dat %>% pull(nb_claims),
family = "poisson",
offset = dat %>% pull(duration) %>% log()
)
fit
However, my goal is to train this model using the train function of the caret package, because of the many advantages it gives. Indeed, validation, preprocessing as well as feature selection is much better with this package. It is straightforward to train a basic glmnet (without an offset term) with caret:
library(caret)
fit <- caret::train(
x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
y = dat %>% pull(nb_claims),
method = "glmnet",
family = "poisson"
)
fit
Naively, we could try to add the offset argument in the train function:
fit <- caret::train(
x = dat %>% dplyr::select(x1, x2, x3) %>% as.matrix(),
y = dat %>% pull(nb_claims),
method = "glmnet",
family = "poisson",
offset = dat %>% pull(duration) %>% log()
)
fit
Unfortunately, this code throws the error Error : No newoffset provided for prediction, yet offset used in fit of glmnet. This error occurs because the caret::train function doesn't take care to give a value for the newoffset argument in predict.glmnet function.
In this book, they show how to add an offset term to a GLM model by modifying the source code of the caret::train function. It works perfectly. However, the predict.glm function is quite different from the predict.glmnet function, because it does not have the newoffset argument. I tried to modify the source code of the caret::train function, but I am having some trouble because I do not know well enough how this function works.
A simple way to perform this is pass the offset column as part of x and in each fit and predict call pass as x columns of x which are not the offset. While as offset/newoffset pass the x column corresponding to the offset.
In the following example the offest column of x needs to be named "offset" too. This can be changed relatively easy
To create the function we will just use lots of parts from: https://github.com/topepo/caret/blob/master/models/files/glmnet.R
glmnet is peculiar since it needs a loop, the rest is just rinse and reapeat from https://topepo.github.io/caret/using-your-own-model-in-train.html#illustrative-example-1-svms-with-laplacian-kernels
family = "poisson" will be specified throughout, to change this adopt code from https://github.com/topepo/caret/blob/master/models/files/glmnet.R
glmnet_offset <- list(type = "Regression",
library = c("glmnet", "Matrix"),
loop = function(grid) {
alph <- unique(grid$alpha)
loop <- data.frame(alpha = alph)
loop$lambda <- NA
submodels <- vector(mode = "list", length = length(alph))
for(i in seq(along = alph)) {
np <- grid[grid$alpha == alph[i],"lambda"]
loop$lambda[loop$alpha == alph[i]] <- np[which.max(np)]
submodels[[i]] <- data.frame(lambda = np[-which.max(np)])
}
list(loop = loop, submodels = submodels)
})
glmnet_offset$parameters <- data.frame(parameter = c('alpha', 'lambda'),
class = c("numeric", "numeric"),
label = c('Mixing Percentage', 'Regularization Parameter'))
glmnet_offset$grid <- function(x, y, len = NULL, search = "grid") {
if(search == "grid") {
init <- glmnet::glmnet(Matrix::as.matrix(x[,colnames(x) != "offset"]), y,
family = "poisson",
nlambda = len+2,
alpha = .5,
offset = x[,colnames(x) == "offset"])
lambda <- unique(init$lambda)
lambda <- lambda[-c(1, length(lambda))]
lambda <- lambda[1:min(length(lambda), len)]
out <- expand.grid(alpha = seq(0.1, 1, length = len),
lambda = lambda)
} else {
out <- data.frame(alpha = runif(len, min = 0, 1),
lambda = 2^runif(len, min = -10, 3))
}
out
}
So x[,colnames(x) != "offset"] is x while offset is x[,colnames(x) == "offset"]
glmnet_offset$fit <- function(x, y, wts, param, last, ...) {
theDots <- list(...)
## pass in any model weights
if(!is.null(wts)) theDots$weights <- wts
if(!(class(x)[1] %in% c("matrix", "sparseMatrix")))
x <- Matrix::as.matrix(x)
modelArgs <- c(list(x = x[,colnames(x) != "offset"],
y = y,
alpha = param$alpha,
family = "poisson",
offset = x[,colnames(x) == "offset"]),
theDots)
out <- do.call(glmnet::glmnet, modelArgs)
if(!is.na(param$lambda[1])) out$lambdaOpt <- param$lambda[1]
out
}
glmnet_offset$predict <- function(modelFit, newdata, submodels = NULL) {
if(!is.matrix(newdata)) newdata <- Matrix::as.matrix(newdata)
out <- predict(modelFit,
newdata[,colnames(newdata) != "offset"],
s = modelFit$lambdaOpt,
newoffset = newdata[,colnames(newdata) == "offset"],
type = "response") #important for measures to be appropriate
if(is.matrix(out)) out <- out[,1]
out
if(!is.null(submodels)) {
tmp <- as.list(as.data.frame(predict(modelFit,
newdata[,colnames(newdata) != "offset"],
s = submodels$lambda,
newoffset = newdata[,colnames(newdata) == "offset"],
type = "response"),
stringsAsFactors = TRUE))
out <- c(list(out), tmp)
}
out
}
For some reason which I don't understand yet it does not work without the prob slot
glmnet_offset$prob <- glmnet_offset$predict
glmnet_offset$tags = c("Generalized Linear Model", "Implicit Feature Selection",
"L1 Regularization", "L2 Regularization", "Linear Classifier",
"Linear Regression")
glmnet_offset$sort = function(x) x[order(-x$lambda, x$alpha),]
glmnet_offset$trim = function(x) {
x$call <- NULL
x$df <- NULL
x$dev.ratio <- NULL
x
}
library(tidyverse)
library(caret)
library(glmnet)
n <- 100
set.seed(123)
dat <- tibble(
nb_claims = rpois(n, lambda = 0.5),
duration = runif(n),
x1 = runif(n),
x2 = runif(n),
x3 = runif(n)
)
x = dat %>%
dplyr::select(-nb_claims) %>%
mutate(offset = log(duration)) %>%
dplyr::select(-duration) %>%
as.matrix
fit <- caret::train(
x = x,
y = dat %>% pull(nb_claims),
method = glmnet_offset,
)
fit
100 samples
4 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:
alpha lambda RMSE Rsquared MAE
0.10 0.0001640335 0.7152018 0.01805762 0.5814200
0.10 0.0016403346 0.7152013 0.01805684 0.5814193
0.10 0.0164033456 0.7130390 0.01798125 0.5803747
0.55 0.0001640335 0.7151988 0.01804917 0.5814020
0.55 0.0016403346 0.7150312 0.01802689 0.5812936
0.55 0.0164033456 0.7095996 0.01764947 0.5783706
1.00 0.0001640335 0.7152033 0.01804795 0.5813997
1.00 0.0016403346 0.7146528 0.01798979 0.5810811
1.00 0.0164033456 0.7063482 0.01732168 0.5763653
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 0.01640335.
predict(fit$finalModel, x[,1:3], newoffset = x[,4]) #works
This will not work with preprocessing in caret since we pass offset as one of the features. However it will work with recipes since you can define columns on which preprocessing functions will be performed via selections. Se article for details: https://tidymodels.github.io/recipes/articles/Selecting_Variables.html
I haven't had time to error check my code. If any problems occur or if there is a mistake somewhere please comment. Thanks.
You can also post an issue in caret github asking this feature (offset/newoffset) to be added to the model
I tried to change the model info a lot of ways, but it was failing miserably. Below I can propose one solution, may not be the best, but will get you somewhere if your data is sensible.
In the poisson / negative binom .. regression, the offset in factor gets introduced into the regression, you can read more here and here:
where tx is the offset. In glmnet, there is a penalty factor you can introduce for each term, and if you let that be 0 for a term, basically you are not penalizing it and it's always included. We can use that for the offset, and you can see this effect only if you use a dataset that makes some sense (note that in your example dataset, the offsets are numbers that make no sense).
Below I use the insurance claims dataset from MASS:
library(tidyverse)
library(glmnet)
library(MASS)
dat <- Insurance
X = model.matrix(Claims ~ District + Group + Age,data=dat)
Y = dat$Claims
OFF = log(dat$Holders)
fit_cv <- cv.glmnet(
x = X,
y = Y,
family = "poisson",
offset = OFF
)
Now using caret, I will fit it without any training, and using the same lambda obtained from the fit in cv.glmnet. One thing you should note too is that cv.glmnet often uses lambda.1se instead of lambda.min:
fit_c <- caret::train(
x = cbind(X,OFF),
y = Y,
method = "glmnet",
family = "poisson",
tuneGrid=data.frame(lambda=fit_cv$lambda.1se,alpha=1),
penalty=c(rep(1,ncol(X)),0),
trControl = trainControl(method="none")
)
We can see how different are the predictions:
p1 = predict(fit_cv,newx=X,newoffset=OFF)
p2 = predict(fit_c,newx=cbind(X,OFF))
plot(p1,p2)

Error with bootmer and confint for glmer

I'm running into an error that I can't find any documentation on when I try to bootstrap a glmer object:
glm2 <- glmer(RT~valence+location+first_location+Trial_num +
(1+Trial_num|id)+(1|Trial_num),
family=inverse.gaussian(log),
control = glmerControl(optimizer = "nloptwrap",
calc.derivs = FALSE), data=df_long)
The error is:
Error in lme4::.simulateFun(object = , :
could not find function "sfun
This is regardless of whether I try bootMer or confint:
bootMer_out <- bootMer(glm2,FUN=fixef, nsim=300)
confint_out <- confint(glm2, method="boot")
When I run as an lmer object I don't have the issue with bootstrapping. i.e.
lm2 <- glmer(RT~valence+location+first_location+Trial_num + (1+Trial_num|id)+(1|Trial_num), family=inverse.gaussian(log), control = glmerControl(optimizer = "nloptwrap", calc.derivs = FALSE), data=df_long))
Does it have to do with the link function? Is there a workaround? I couldn't find function 'sfun' in the simulateFun documentation either. I could always just do the transformation on the data separately and use lmer instead of glmer, but if anyone has some insight that would be great (since I'm curious now).
As pointed out by #user20650, you'll need to add a simulation method for the inverse gaussian family.
For example, I added these to a branch on my lme4 fork under predict.R:
rinvgauss <- function(n, mu, lambda) {
# transcribed from https://en.wikipedia.org/wiki/Inverse_Gaussian_distribution
nu <- rnorm(n)
y <- nu^2
x <- mu + (mu^2 * y)/(2*lambda) - (mu/(2*lambda)) * sqrt(4*mu*lambda*y + mu^2*y^2)
z <- runif(n)
ifelse(z <= mu/(mu + x), x, mu^2/x)
}
inverse.gaussian_simfun <- function(object, nsim, ftd = fitted(object),
wts = weights(object)) {
if (any(wts != 1)) message("using weights as inverse variances")
dispersion <- sum((weights(object, 'working') *
resid(object, 'working')^2)[weights(object, 'working')>0])/df.residual(object)
rinvgauss(nsim * length(ftd), mu = ftd,
lambda = wts/dispersion)
}
# ... skip a few
simfunList <- list(gaussian = gaussian_simfun,
binomial = binomial_simfun,
poisson = poisson_simfun,
Gamma = Gamma_simfun,
negative.binomial = negative.binomial_simfun,
inverse.gaussian = inverse.gaussian_simfun)
Here's an example:
# devtools::install_github('aforren1/lme4', ref = 'add_invgauss_simulate')
library(lme4)
set.seed(1)
dat <- data.frame(y = lme4:::rinvgauss(1000, 3, 4),
x = runif(1000),
subj = factor(rep(1:10, 100)))
mod <- glmer(y ~ x + (1|subj),
data = dat,
family = inverse.gaussian(link='log'))
# ~60 secs on my laptop
(boots <- confint(mod, method = 'boot', nsim = 100, parm = 'beta_'))
2.5 % 97.5 %
(Intercept) 1.0044813 1.248774
x -0.2158155 0.161213
(walds <- confint(mod, method = 'Wald', parm = 'beta_'))
2.5 % 97.5 %
(Intercept) 1.000688 1.2289971
x -0.205546 0.1644621
You can see that the bootstrap method gives (roughly) the same results as the Wald method.

Error with custom SVM model for tuning in caret

I'm trying to follow this link to create a custom SVM and run it through some cross-validations. My primary reason for this is to run Sigma, Cost and Epsilon parameters in my grid-search and the closest caret model (svmRadial) can only do two of those.
When I attempt to run the code below, I get the following error all over the place at every iteration of my grid:
Warning in eval(expr, envir, enclos) :
model fit failed for Fold1.: sigma=0.2, C=2, epsilon=0.1 Error in if (!isS4(modelFit) & !(method$label %in% c("Ensemble Partial Least Squares Regression", :
argument is of length zero
Even when I replicate the code from the link verbatim, I get a similar error and I'm not sure how to solve it. I found this link which goes through how the custom models are built and I see where this error is referenced, but still not sure what the issue is. I have my code below:
#Generate Tuning Criteria across Parameters
C <- c(1,2)
sigma <- c(0.1,.2)
epsilon <- c(0.1,.2)
grid <- data.frame(C,sigma)
#Parameters
prm <- data.frame(parameter = c("C", "sigma","epsilon"),
class = rep("numeric", 3),
label = c("Cost", "Sigma", "Epsilon"))
#Tuning Grid
svmGrid <- function(x, y, len = NULL) {
expand.grid(sigma = sigma,
C = C,
epsilon = epsilon)
}
#Fit Element Function
svmFit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
ksvm(x = as.matrix(x), y = y,
type = "eps-svr",
kernel = rbfdot,
kpar = list(sigma = param$sigma),
C = param$C,
epsilon = param$epsilon,
prob.model = classProbs,
...)
}
#Predict Element Function
svmPred <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata)
#Sort Element Function
svmSort <- function(x) x[order(x$C),]
#Model
newSVM <- list(type="Regression",
library="kernlab",
loop = NULL,
parameters = prm,
grid = svmGrid,
fit = svmFit,
predict = svmPred,
prob = NULL,
sort = svmSort,
levels = NULL)
#Train
tc<-trainControl("repeatedcv",number=2, repeats = 0,
verboseIter = T,savePredictions=T)
svmCV <- train(
Y~ 1
+ X1
+ X2
,data = data_nn,
method=newSVM,
trControl=tc
,preProc = c("center","scale"))
svmCV
After viewing the second link provided, I decided to try and include a label into the Model's parameters and that solved the issue! It's funny that it worked because the caret documentation says that value is optional, but if it works I can't complain.
#Model
newSVM <- list(label="My Model",
type="Regression",
library="kernlab",
loop = NULL,
parameters = prm,
grid = svmGrid,
fit = svmFit,
predict = svmPred,
prob = NULL,
sort = svmSort,
levels = NULL)

Passing the weights argument to a regression function inside an R function

I am trying to write an R function to run a weighted (optional) regressions, and I am having difficulties getting the weight variable to work.
Here is a simplified version of the function.
HC <- function(data, FUN, formula, tau = 0.5, weights = NULL){
if(is.null(weights)){
est <- FUN(data = data, formula = formula, tau = tau)
intercept = est$coef[["(Intercept)"]]
zeroWorker <- exp(intercept)
}
else {
est <- FUN(data = data, formula = formula, tau = tau, weights = weights)
intercept = est$coef[["(Intercept)"]]
zeroWorker <- exp(intercept)
}
return(zeroWorker)
}
The function works perfectly if I do not use the weights argument.
mod1 <- HC(data = mydata, formula = lin.model, tau = 0.2,
FUN = rq)
But, throws an error message when I use the weights argument.
mod2 <- HC(data = mydata, formula = lin.model, tau = 0.2,
FUN = rq, weights = weig)
I google the problem, and this post seems to be the closest to my problem, but I could still not get it to work. R : Pass argument to glm inside an R function.
Any help will be appreciated.
My problem can be replicated with:
library("quantreg")
data(engel)
mydata <- engel
mydata$weig <- with(mydata, log(sqrt(income))) # Create a fictive weigth variable
lin.model <- foodexp~income
mod1 <- HC(data = mydata, formula = lin.model, tau = 0.2,
FUN = rq) # This works perfectly
mod2 <- HC(data = mydata, formula = lin.model, tau = 0.2,
FUN = rq, weights = weig) # throws an error.
Error in HC(data = mydata, formula = lin.model, tau = 0.2, FUN = rq, weights = weig) :
object 'weig' not found
You have two problems. The error you're encountering is because you're trying to use the weigh variable without referencing it as coming from the mydata dataset. Try using mydata$weig. This will solve your first error, but you then get the actual one related to using the weights argument, which is:
Error in model.frame.default(formula = formula, data = data, weights = substitute(weights), :
invalid type (symbol) for variable '(weights)'
The solution is to add the variable specified in HC's weights argument to the dataframe before passing it to FUN:
HC <- function(data, FUN, formula, tau = 0.5, weights = NULL){
data$.weights <- weights
if(is.null(weights)){
est <- FUN(data = data, formula = formula, tau = tau)
} else {
est <- FUN(data = data, formula = formula, tau = tau, weights = .weights)
}
intercept = est$coef[["(Intercept)"]]
zeroWorker <- exp(intercept)
return(zeroWorker)
}
Then everything works:
mod2 <- HC(data = mydata, formula = lin.model, tau = 0.2, FUN = rq, weights = mydata$weig)
mod2
# [1] 4.697659e+47

Resources