CPU requirement to run logistic regression with pairwise interactions - r

I am trying to fit a logistic regression model with pairwise interactions for 7 variables; however, I have let the code run for as long as 12 hours, and still no results. My dataset is not terribly large…about 3000 lines. One of my variables has 82 degrees of freedom, and I am wondering if that is the problem? I have no problem running the code with main level effects, but I expect interactions between my variables, so I would like pairwise interactions included. I have tried adding arguments to the code to speed up the process, but I still can’t get the code to kick back results even after 12 hours of running. I am using the glmulti package to fit the model, and I included the method = “g” argument and the conseq = 5 argument in an attempt to make the code run faster. Is there anything else I can do to speed it up? Another code or different package, or is a basic laptop just not enough to run it?
This is the code I used:
detectmodel<- glmulti::glmulti(outcome~ bird + year + season + sex + numobs + obsname + season + month, data=detect, level=2, fitfunction=glm, crit="aicc", family=binomial, confsetsize=10, method = "g")

Related

Computation time GAM

I am fitting the below GAM in mgcv
m3.2 <- bam(pt10 ~
s(year, by = org.type) +
s(year, by = region) +
s(org.name, bs = 're') +
s(org.name, year, bs = 're'),
data = dat,
method = "fREML",
family = betar(link="logit"),
select = T,
discrete = T)
My dat has ~58,000 observations, and the factor org.name has ~2,500 levels - meaning there are a lot of random intercepts and slopes to be fit. Hence, in an attempt to reduce computation time, I have used bam() and the discrete = T option. However, my model has now been running for ~36 hours and still has not fit nor failed and provided me with an error message. I am unsure how long might be reasonable for such a model to take to fit, and therefore how/when to decide to kill the command or not; I don't want to stop the model running if this is normal behaviour/computational time for such a model, but also don't want to waste my time if bam() is stuck going around in circles and will never fit.
Question: How long might be reasonable for such a model to take to fit / what would reasonable computation time for such a model be? Is there a way I can determine if bam() is still making progress or if the command should just be killed to avoid wasting time?
My computer has 16GB RAM and an Intel(R) Core(TM) i7-8565u processor (CPU # 1.80GHz). In my Windows Task Manager I can see that RStudio is using 20-30% CPU and 20-50% memory and that these usage values are changing and are not static.
To see what bam() is doing, you should have set trace = TRUE in the list passed to the control argument:
ctrl <- gam.control(trace = TRUE)
m3.2 <- bam(pt10 ~
org.type + region, # include parametric terms for group means
s(year, by = org.type) +
s(year, by = region) +
s(org.name, bs = 're') +
s(org.name, year, bs = 're'),
data = dat,
method = "fREML",
family = betar(link="logit"),
select = TRUE,
discrete = TRUE,
control = ctrl)
That way you'd get printed statements in the console while bam() is doing it's thing. I would check on a much smaller set of data that this actually works (run the example in ?bam() say) within RStudio on Windows; I've never used it so you wouldn't want the trace output to only come once the function had finished, in a single torrent.)
Your problem is the random effects, not really the size of the data; you are estimating 2500 + 2500 coefficients for your two random effects and another 10 * nlevels(org.type) + 10 * nlevels(region) for the two factor by smooths. This is a lot for 58,000 observations.
As you didn't set nthreads it's fitting the model on a single CPU core. I wouldn't necessarily change that though as using 2+ threads might just make the memory situation worse. 16GB on Windows isn't very much RAM these days - RStudio is using ~half the available RAM with your other software and Windows using the remaining 36% that Task Manager is reporting is in use.
You should also check if the OS is having to swap memory out to disk; if that's happening then give up as retrieving data from disk for each iteration of model fitting is going to be excruciating, even with a reasonably fast SSD.
The random effects can be done more efficiently in dedicated mixed model software but then you have the problem of writing the GAM bits (the two factor by smooths) in that form - you would write the random effects in the required notation for {brms} or {lme4} respectively as (1 + year | org.name) in the relevant part of the formula (or the random argument in gamm4()).
{brms} and {gamm4} can do this bit, but for the former you need to know how to drive {brms} and Stan (which is doing HMC sampling of the posterior), while {lme4}, which is what {gamm4} uses to do the fitting, doesn't have a beta response family. The {bamlss} package has options for this too, but it's quite a complex package so be sure to understand how to specify the model estimation method.
So perhaps revisit your model structure; why do you want a smooth of year for the regions and organisation types, but a linear trend for individual organisations?

A strange case of singular fit in lme4 glmer - simple random structure with variation among groups

Background
I am trying to test for differences in wind speed data among different groups. For the purpose of my question, I am looking only on side wind (wind direction that is 90 deg from the individual), and I only care about the strength of the wind. Thus, I use absolute values. The range of wind speeds is 0.0004-6.8 m/sec and because I use absolute values, Gamma distribution describes it much better than normal distribution.
My data contains 734 samples from 68 individuals, with each individual having between 1-30 repeats. However, even if I reduce my samples to only include individuals with at least 10 repeats (which leaves me with 26 individuals and a total of 466 samples), I still get the problematic error.
The model
The full model is Wind ~ a*b + c*d + (1|individual), but for the purpose of this question, the simple model of Wind ~ 1 + (1|individual) gives the same singularity error, so I do not think that the explanatory variables are the problem.
The complete code line is glmer(Wind ~ 1 + (1|individual), data = X, family = Gamma(log))
The problem and the strange part
When running the model, I get the boundary (singular) fit: see ?isSingular error, although, as you can see, I use a very simple model and random structure. The strange part is that I can solve this by adding 0.1 to the Wind variable (i.e. glmer(Wind+0.1 ~ 1 + (1|Tag), data = X, family = Gamma(log)) does not give any error). I honestly do not remember why I added 0.1 the first time I did it, but I was surprised to see that it solved the error.
The question
Is this a problem with lme4? Am I missing something? Any ideas what might cause this and why does me adding 0.1 to the variable solve this problem?
Edit following questions
I am not sure what's the best way to add data, so here is a link to a csv file in Google drive
using glmmTMB does not produce any warnings with the basic formula glmmTMB(Wind ~ 1 + (1|Tag), data = X, family = Gamma(log)), but gives convergence problems warnings ('non-positive-definite Hessian matrix') when using the full model (i.e., Wind ~ a*b + c*d + (1|individual)), which are then solved if I scale the continuous variables

Why is my glmer model in R taking so long to run?

I had previously been using simple stats in Statistica, but required R for my masters research. I am trying to run the following code to test for any significant interactions, and it is just running forever. If I simplify the model by taking month out, then it runs, but biologically it makes sense that month is significant so I would really like this to run including month as a factor. Once I run the model, the stop sign in R studio just stays present for hours, what could be the reason for this? Like I said I'm very new and it has been really difficult to learn this on my own. I am working with presence/absence data (as %) which I do cbind as my dependent variable. SO far this is what my coad looks like:
library(car)
library(languageR)
library(AICcmodavg)
library(lme4)
Scat <- read.csv("Scat2.csv", header=T)
attach(Scat)
names(Scat)
y <- cbind(Present,Absent)
ScatData <- glmer(y ~ Estate * Species * Month * Content * (1|Site) + Min + Max,family=binomial)
summary(ScatData)
Once I get to running the actual model, I don't even get to do the summary because R is not done computing the results of the actual model. I ran the model for approximately 4 hours, and when I clicked on the stop sign, I received this message:
Warning message:
In (function (fn, par, lower = rep.int(-Inf, n), upper = rep.int(Inf, :
failure to converge in 10000 evaluations
I would really appreciate some input on this matter.
You have a few problems with your model specification. Your model
y ~ Estate * Species * Month * Content * (1|Site) + Min + Max
is asking for all the main effects and interactions of estate, species, month, content, and site, which is incredibly complex.
Also, you have specified site as a random effect and asked for its interaction with fixed effects. I'm not sure whether that's possible, but it certainly seems wrong. You should decide whether you want site to be a fixed effect or a random effect.
If you post a minimal replicable example, I can give more specific advice.

Repeated effect LMM in R

Until recently I used SPSS for my statistics, but since I am not in University any more, I am changing to R. Things are going well, but I can't seem to replicate the results I obtained for repeated effect LMM in SPSS. I did find some treads here which seemed relevant, but those didn't solve my issues.
This is the SPSS script I am trying to replicate in R
MIXED TriDen_L BY Campaign Watering Heating
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1)
SINGULAR(0.000000000001) HCONVERGE(0,
ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED=Campaign Watering Heating Campaign*Watering Campaign*Heating
Watering*Heating Campaign*Watering*Heating | SSTYPE(3)
/METHOD=REML
/PRINT=TESTCOV
/RANDOM=Genotype | SUBJECT(Plant_id) COVTYPE(AD1)
/REPEATED=Week | SUBJECT(Plant_id) COVTYPE(AD1)
/SAVE=PRED RESID
Using the lme4 package in R I have tried:
lmm <- lmer(lnTriNU ~ Campaign + Watering + Heating + Campaign*Watering
+ Campaign*Heating + Watering*Heating + Campaign*Watering*Heating
+ (1|Genotype) + (1|Week:Plant_id), pg)
But this -and the other options I have tried for the random part- keep producing an error:
Error: number of levels of each grouping factor must be < number of observations
Obviously in SPSS everything is fine. I am suspecting I am not correctly modelling the repeated effect? Also saving predicted and residual values is not yet straightforward for me...
I hope anyone can point me in the right direction.
You probably need to take out either Week or Plant_id, as I think you have as many values for either variable as you have cases. You can nest observations within a larger unit if you add a variable to model time. I am not familiar with SPSS, but if your time variable is Week (i.e., if week has a value of 1 for the first observation, 2 for the second etc.), then it should not be a grouping factor but a random effect in the model. Something like <snip> week + (1 + week|Plant_id).
k.
Is Plant_id nested within Genotype, and Week indicate different measure points? If so, I assume that following formula leads to the required result:
lmm <- lmer(lnTriNU ~ Campaign + Watering + Heating + Campaign*Watering
+ Campaign*Heating + Watering*Heating + Campaign*Watering*Heating
+ (1+Week|Genotype/Plant_id), pg)
Also saving predicted and residual values is not yet straightforward for me...
Do you mean "computing" by "saving"? In R, all relevant information are in the returned object, and accessible through functions like residuals() or predict() etc., called on the saved object (in your case, residuals(lmm)).
Note that, by default, lmer does not use AD1-covtype.

Nested model in R

I'm having a huge problem with a nested model I am trying to fit in R.
I have response time experiment with 2 conditions with 46 people each and 32 measures each. I would like measures to be nested within people and people nested within conditions, but I can't get it to work.
The code I thought should make sense was:
nestedmodel <- lmer(responsetime ~ 1 + condition +
(1|condition:person) + (1|person:measure), data=dat)
However, all I get is an error:
Error in checkNlevels(reTrms$flist, n = n, control) :
number of levels of each grouping factor must be < number of observations
Unfortunately, I do not even know where to start looking what the problem is here.
Any ideas? Please, please, please? =)
Cheers!
This might be more appropriate on CrossValidated, but: lme4 is trying to tell you that one or more of your random effects is confounded with the residual variance. As you've described your data, I don't quite see why: you should have 2*46*32=2944 total observations, 2*46=92 combinations of condition and person, and 46*32=1472 combinations of measure and person.
If you do
lf <- lFormula(responsetime ~ 1 + condition +
(1|condition:person) + (1|person:measure), data=dat)
and then
lapply(lf$reTrms$Ztlist,dim)
to look at the transposed random-effect design matrices for each term, what do you get? You should (based on your description of your data) see that these matrices are 1472 by 2944 and 92 by 2944, respectively.
As #MrFlick says, a reproducible example would be nice. Other things you could show us are:
fit the model anyway, using lmerControl(check.nobs.vs.nRE="ignore") to ignore the test, and show us the results (especially the random effects variances and the statement of the numbers of groups)
show us the results of with(dat,table(table(interaction(condition,person))) to give information on the number of replicates per combination (and similarly for measure)

Resources