Usually my mixed models contain several categorical variables with a lot of unique levels, so X matrix is very sparse.
I use glmmTMB package that handles X and Z matrices as sparse. This significantly reduced RAM usage during fitting the model.
The glmmTMB package is great, but there is one problem for me (maybe I'm missing something):
when I use interactions between a numeric variable and a categorical variable (as FE), the model is fitted without errors.
For example, this model works well:
fit = glmmTMB(Y ~ 0 + num1:factor1 + num2:factor1 + factor2 +
(0 + num3|subject) + (0 + num4|subject) + (1|subject),
model_data, REML = TRUE, sparseX=c(cond=TRUE))
But when I use any interactions between two categorical variables i.e. the formula looks like this:
fit = glmmTMB(Y ~ 0 + num1:factor1 + factor3:factor1 + factor2 +
(0 + num2|subject) + (0 + num3|subject) + (1|subject),
model_data, REML = TRUE, sparseX=c(cond=TRUE)),
I get the following error:
iter: 5 Error in newton(par = c(beta = 1, beta = 1, beta = 1, beta = 1, beta = 1, :
Newton failed to find minimum.
In addition: Warning message:
In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
outer mgc: NaN
Error in (function (start, objective, gradient = NULL, hessian = NULL, :
gradient function must return a numeric vector of length 4
At the same time, in mixed model theory, interactions between two categorical variables are valid.
Moreover, such model (with interactions between two factors) is successfully fitted with Julia MixedModels package.
Could you help me, please, to understand root of this error?
Is there a way to avoid it in a model with interactions between two categorical variables?
Why does such models work with Julia MixedModels and not work with glmmTMB?
Related
I would like to use the gamlss package for fitting a model benefiting from more available distributions in that package. However, I am struggling to correctly specify my random effects or at least I think there is a mistake because if I compare the output of a lmer model with Gaussian distribution and the gamlss model with Gaussian distribution output differs. If comparing a lm model without the random effects and a gamlss model with Gaussian distribution and without random effects output is similar.
I unfortunately cannot share my data to reproduce it.
Here my code:
df <- subset.data.frame(GFW_food_agg, GFW_food_agg$fourC_area_perc < 200, select = c("ISO3", "Year", "Forest_loss_annual_perc_boxcox", "fourC_area_perc", "Pop_Dens_km2", "Pop_Growth_perc", "GDP_Capita_current_USD", "GDP_Capita_growth_perc",
"GDP_AgrForFis_percGDP", "Gini_2008_2018", "Arable_land_perc", "Forest_loss_annual_perc_previous_year", "Forest_extent_2000_perc"))
fourC <- lmer(Forest_loss_annual_perc_boxcox ~ fourC_area_perc + Pop_Dens_km2 + Pop_Growth_perc + GDP_Capita_current_USD +
GDP_Capita_growth_perc + GDP_AgrForFis_percGDP + Gini_2008_2018 + Arable_land_perc + Forest_extent_2000_perc + (1|ISO3) + (1|Year),
data = df)
summary(fourC)
resid_panel(fourC)
df <- subset.data.frame(GFW_food_agg, GFW_food_agg$fourC_area_perc < 200, select = c("ISO3", "Year", "Forest_loss_annual_perc_boxcox", "fourC_area_perc", "Pop_Dens_km2", "Pop_Growth_perc", "GDP_Capita_current_USD", "GDP_Capita_growth_perc",
"GDP_AgrForFis_percGDP", "Gini_2008_2018", "Arable_land_perc", "Forest_loss_annual_perc_previous_year", "Forest_extent_2000_perc"))
df <- na.omit(df)
df$ISO3 <- as.factor(df$ISO3)
df$Year <- as.factor(df$Year)
fourC <- gamlss(Forest_loss_annual_perc_boxcox ~ fourC_area_perc + Pop_Dens_km2 + Pop_Growth_perc + GDP_Capita_current_USD +
GDP_Capita_growth_perc + GDP_AgrForFis_percGDP + Gini_2008_2018 + Arable_land_perc + Forest_extent_2000_perc + random(ISO3) + random(Year),
data = df, family = NO, control = gamlss.control(n.cyc = 200))
summary(fourC)
plot(fourC)
How do the random effects need to be specified in gamlss to be similar to the random effects in lmer?
If I specify the random effects instead using
re(random = ~1|ISO3) + re(random = ~1|Year)
I get the following error:
Error in model.frame.default(formula = Forest_loss_annual_perc_boxcox ~ :
variable lengths differ (found for 're(random = ~1 | ISO3)')
I found the +re(random=~1|x) specification to work fairly well with my GAMLSS. Have you double check that the NA's are being removed from your dataset? Sometimes na.omit does not work properly.
Have a look at this thread that has the same error than yours, but in a GAM. You can try that code to remove your NA's
Error in model.frame.default: variable lengths differ
I'm trying to run a gam using the mgcv package with a response variable which is proportional data. The data is overdispered so initially I used a quasibinomial distribution. However because I'm using model selection that's not particularly useful as it does not produce AIC scores.
Instead I'm trying to use betar distribution, as I've read that it could be appropriate.
mRI_br <- bam(ri ~ SE_score + s(deg, k=7) + s(gs, k=7) + TL + species + sex + season + year + s(code, bs = 're') + s(station, bs = 're'), family=betar(), data=node_dat, na.action = "na.fail")
However I'm getting this warnings when I run the model.
Warning messages:
1: In estimate.theta(theta, family, y, mu, scale = scale1, ... :
step failure in theta estimation
And when I try and check the model summary I get this error.
> summary(mRI_br)
Error in chol.default(diag(p) + crossprod(S2 %*% t(R2))) :
the leading minor of order 62 is not positive definite
I would like to know:
What is causing these errors and warnings, and how can they be solved?
If not are there any other distributions that can be used with proportion data which enable me to subsequently use model selection techniques (such as the dredge function from the MuMIn package.
A copy of the dataset can be found here
I fitted a zero inflated poisson model using gam() from the mgcv package
ziplss.fit.mixed <- gam(
list(
n.ind ~ as.factor(data.type) + s(year) +s(enz,bs="re") + s(clc.3,bs="re") + te(lon, lat),
~ as.factor(data.type) + s(year) + s(enz,bs="re") + s(clc.3,bs="re") + te(lon, lat)),
family=ziplss(),
data = dat,
control = gam.control(keepData = TRUE)
)
Every time I try to do predictions I get the following message
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
I even tried to predict using the data I used for the fit. It gives the same error
predict.gam(ziplss.fit.mixed, newdata = ziplss.fit.mixed$data, newdata.guaranteed = TRUE)
If I change back to a regular poisson gam with the same formula then it works.
Any idea why the ziplss option triggers the error ?
Thanks for your help
For some unknown reason it appears you have to format your variables when predicting from a ziplss family fit.
For instance if you wrote as.factor(my.var) in your model, then your new data must be of class factor.
Using glmer, I can run a logistic regression mixed model just fine. But when I try to do the same using glmulti, I get errors (described below). I think the problem is with the function I am specifying for use in glmulti. I want a function that specifies a logistic regression model for data containing continuous fixed covariates and categorical random effects, using a logit link. The response variable is a binary 0/1.
Sample data:
library(lme4)
library(rJava)
library(glmulti)
set.seed(666)
x1 = rnorm(1000) # some continuous variables
x2 = rnorm(1000)
x3 = rnorm(1000)
r1 = rep(c("red", "blue"), times = 500) #categorical random effects
r2 = rep(c("big", "small"), times = 500)
z = 1 + 2*x1 + 3*x2 +2*x3
pr = 1/(1+exp(-z))
y = rbinom(1000,1,pr) # bernoulli response variable
df = data.frame(y=y,x1=x1,x2=x2, x3=x3, r1=r1, r2=r2)
A single glmer logistic regression works just fine:
model1<-glmer(y~x1+x2+x3+(1|r1)+(1|r2),data=df,family="binomial")
But errors occur when I try to use the same model structure through glmulti:
# create a function - I think this is where my problem is
glmer.glmulti<-function(formula, data, family=binomial(link ="logit"), random="", ...){
glmer(paste(deparse(formula),random),data=data,...)
}
# run glmulti models
glmulti.logregmixed <-
glmulti(formula(glmer(y~x1+x2+x3+(1|r1)+(1|r2), data=df), fixed.only=TRUE), #error w/o fixed.only=TRUE
data=df,
level = 2,
method = "g",
crit = "aicc",
confsetsize = 128,
plotty = F, report = F,
fitfunc = glmer.glmulti,
family = binomial(link ="logit"),
random="+(1|r1)","+(1|r2)", # possibly this line is incorrect?
intercept=TRUE)
#Errors returned:
singular fit
Error in glmulti(formula(glmer(y ~ x1 + x2 + x3 + (1 | r1) + (1 | r2), :
Improper call of glmulti.
In addition: Warning message:
In glmer(y ~ x1 + x2 + x3 + (1 | r1) + (1 | r2), data = df) :
calling glmer() with family=gaussian (identity link) as a shortcut to lmer() is deprecated; please call lmer() directly
I've tried various changes to the function, and within the formula and fitfunc portion of the glmulti code. I've tried substituting lmer for glmer and I guess I don't understand the error. I'm also afraid that calling lmer may change the model structure, as during one of my attempts the summary() of the model stated "Linear mixed model fit by REML ['lmerMod']." I need the glmulti models to be the same as what I'm obtaining with model1 using glmer (ie summary(model1) gives "Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']"
Many similar questions remain unanswered. Thanks in advance!
Credit:
sample data set created with help from here:
https://stats.stackexchange.com/questions/46523/how-to-simulate-artificial-data-for-logistic-regression
glmulti code adapted from here:
Model selection using glmulti
I am trying to use caret package to tune 'df' parameter of a gam model for my cohort analysis.
With the following data:
cohort = 1:60
age = 1:26
grid = data.frame(expand.grid(age = age, cohort = cohort))
size = data.frame(cohort = cohort, N = sample(100:150,length(cohort), replace = TRUE))
df = merge(grid, size, by = "cohort")
log_k = -3 + log(df$N) - 0.5*log(df$age) + df$cohort*(df$cohort-30)*(df$cohort-50)/20000 + runif(nrow(df),min = 0, max = 0.5)
df$conversion = rpois(nrow(df),exp(log_k))
Explanation of the data : Cohort number is the time of arrival of the potential customer. N is the number of potential customer that arrived at that time. Conversion is the number out of those potential customer that 'converted' (bought something). Age is the age (time spent from arrival) of the cohort when conversion took place. For a given cohort there are fewer conversions as age grows. This effect follows a power law.
But the total conversion rate of each cohort can also change slowly in time (cohort number). Thus I want a smoothing spline of the time variable in my model.
I can fit a gam model from package gam
library(gam)
fit = gam(conversion ~ log(N) + log(age) + s(cohort, df = 4), data = df, family = poisson)
fit
> Call:
> gam(formula = conversion ~ log(N) + log(age) + s(cohort, df = 4),
> family = poisson, data = df)
> Degrees of Freedom: 1559 total; 1553 Residual
> Residual Deviance: 1869.943
But if i try to train the model using the CARET package
library(caret)
fitControl = trainControl(verboseIter = TRUE)
fit.crt = train(conversion ~ log(N) + log(age) + s(cohort,df),
data = df, method = "gamSpline",
trControl = fitControl, tune.length = 3, family = poisson)
I get this error :
+ Resample01: df=1
model fit failed for Resample01: df=1 Error in as.matrix(x) : object 'N' not found
- Resample01: df=1
+ Resample01: df=2
model fit failed for Resample01: df=2 Error in as.matrix(x) : object 'N' not found .....
Please does anyone know what I'm doing wrong here?
Thanks
There are a two things wrong with your code.
The train function can be a bit tedious depending on the method you used (as you have noticed). In the case of method = "gamSpline", the train function adds a smooth term to every independent term in the formula. So it converts your variables to s(log(N), df), s(log(age) df) and to s(s(cohort, df), df).
Wait s(s(cohort, df), df) does not really makes sense. So you must change s(cohort, df) to cohort.
I am not sure why, but the train with method = "gamSpline" does not like it when you put functions (e.g. log) in the formula. I think this is due to the fact that this method already applies the s() functions to your variables. This problem can be solved by applying the log earlier to your variables. Such as df$N <- log(df$N) or logN <- log(df$N) and use logN as variable. And of course, do the same for age.
My guess is that you don't want this method to apply a smoothing term to all your independent variables based on the code you provided. I am not sure if this is possible and how to do it, if it is possible.
Hope this helps.
EDIT: If you want a more elegant solution than the one I provided at point 2, make sure to read the comment of #topepo. This suggestion also allows you to apply s() function to the variables you want if I understand it correctly.