Mixed model with large sample size - r

I am currently doing a mixed linear model (using the lme function in R), and I have some problems.
My dataset is about damages by brown bears in Slovenia. Slovenia was divided in 1x1km grids, and for each grid I have data of number of damages per year (for 12 consecutive years). This frequency of damages will be my Y variable in the model, and I will test different environmental variables to explain occurrence of damages (e.g. distance to the forest edge, cover of forest etc.).
I put year as a random factor (verified with a likelihood ratio test).
My sample size is big (250 000 cell values), and mainly 0 (only 4000 cases were positive, ranging from 1 to 17 damages in one cell for a year).
Here is my problem. Following Zuur (2009) methods, I try to find the optimal fixed structure for my model. My first model has all the variables, plus some interactions (see below). I'm using a logit link.
f1 <- formula (dam ~ masting + dens*pop_size_index + saturation + exposition
settlements + orchards + crops + meadows + mixed_for + dist_for_out
dist_for_out_a + dist_for_in + dist_for_in_a + for_edge + prop_broadleaves
prop_broadleaves_a + dist_road + dist_village + feed_stat + sup_food
masting*prop_broadleaves)
M1.lme <- lme (f1, random = ~1|year, method="REML", data=d)
But, looking at likelihood ratio test, I can not remove ANY variable. All are significant. However, the model is still very bad (too many variables in, residuals not good looking), and I can definitely not stop there.
So how can I find a better model (i.e. get rid of non significant variables)?
Is this due to my large sample size?
Could this zero inflation possibly be a problem?
I could not find another way of improving my model, that would take this into account.
Any suggestions?

Related

A strange case of singular fit in lme4 glmer - simple random structure with variation among groups

Background
I am trying to test for differences in wind speed data among different groups. For the purpose of my question, I am looking only on side wind (wind direction that is 90 deg from the individual), and I only care about the strength of the wind. Thus, I use absolute values. The range of wind speeds is 0.0004-6.8 m/sec and because I use absolute values, Gamma distribution describes it much better than normal distribution.
My data contains 734 samples from 68 individuals, with each individual having between 1-30 repeats. However, even if I reduce my samples to only include individuals with at least 10 repeats (which leaves me with 26 individuals and a total of 466 samples), I still get the problematic error.
The model
The full model is Wind ~ a*b + c*d + (1|individual), but for the purpose of this question, the simple model of Wind ~ 1 + (1|individual) gives the same singularity error, so I do not think that the explanatory variables are the problem.
The complete code line is glmer(Wind ~ 1 + (1|individual), data = X, family = Gamma(log))
The problem and the strange part
When running the model, I get the boundary (singular) fit: see ?isSingular error, although, as you can see, I use a very simple model and random structure. The strange part is that I can solve this by adding 0.1 to the Wind variable (i.e. glmer(Wind+0.1 ~ 1 + (1|Tag), data = X, family = Gamma(log)) does not give any error). I honestly do not remember why I added 0.1 the first time I did it, but I was surprised to see that it solved the error.
The question
Is this a problem with lme4? Am I missing something? Any ideas what might cause this and why does me adding 0.1 to the variable solve this problem?
Edit following questions
I am not sure what's the best way to add data, so here is a link to a csv file in Google drive
using glmmTMB does not produce any warnings with the basic formula glmmTMB(Wind ~ 1 + (1|Tag), data = X, family = Gamma(log)), but gives convergence problems warnings ('non-positive-definite Hessian matrix') when using the full model (i.e., Wind ~ a*b + c*d + (1|individual)), which are then solved if I scale the continuous variables

Trouble Converging Bifactor model using lavaan

Title basically explains it but I'm trying to build a bifactor model with psychopathy as one factor and subtypes as the other. I believe that I have everything constrained properly but that might be the issue.
Current code:
BifactorModel <- 'psychopathyBi =~ YPIS_1 + YPIS_2 + YPIS_3 + YPIS_4 + YPIS_5 + YPIS_6 + YPIS_7 + YPIS_8 + YPIS_9 +YPIS_10 + YPIS_11 + YPIS_12 + YPIS_13 + YPIS_14 + YPIS_15 + YPIS_16 + YPIS_17 + YPIS_18
GMbi =~ YPIS_4 + YPIS_5 + YPIS_8 + YPIS_9 + YPIS_14 + YPIS_16
CUbi =~ YPIS_3 + YPIS_6 + YPIS_10 + YPIS_15 + YPIS_17 + YPIS_18
DIbi =~ YPIS_1 + YPIS_2 + YPIS_7 + YPIS_11 + YPIS_12 + YPIS_13
psychopathyBi ~~ 0*GMbi
psychopathyBi ~~ 0*CUbi
psychopathyBi ~~ 0*DIbi
GMbi ~~ 0*CUbi
GMbi ~~ 0*DIbi
CUbi ~~ 0*DIbi
'
#fit bifactor model
bifactorFit <- cfa(BifactorModel, data = YPIS_Data)
#get summary of bifactor model
summary(bifactorFit, fit.measures = TRUE, standardized = TRUE)
This produces the following:
lavaan 0.6-9 did NOT end normally after 862 iterations
this is what the model should ultimately look like once converged
Any suggestions or comments would be greatly appreciated. Thanks in advance.
The variances of several of your latent variables are very small. For example, Dlbi appears to be effectively zero. That's the source of the issue here.
There are two things you can to try to remedy this.
First, it may work better to identify the model by fixing the latent variable variances to 1, rather than fixing the first indicator factor loadings to 1. Do this by specifying std.lv = TRUE.
Even then, it will likely be the case that loadings onto one or more of the group factors will have very small loadings. This indicates that there really isn't much of a distinct group factor in your data for this items that is distinct from the general factor. You should consider estimating a model that drops that group factor (as well as comparing with models dropping the other group factors one at a time). We discuss this issue some here: https://psyarxiv.com/q356f/
Additionally, you should constrain item loadings so that they are in the theoretically expected direction (e.g., all positive with a lower bound of 0). It is common for bifactor models to overextract variance in items and produce uninterpretable group factors that have a mix of positive and negative loadings. This can also cause convergence issues.
In general, this sort of unconstrained bifactor model tends to be overly flexible and tends to overfit to a similar degree as exploratory factor analysis. You should be sure to evaluate the bifactor model based not only on global model fit statistics, but also on whether the factor loadings actually resemble a true bifactor model--do the items each show substantial loadings on both the general factor and their group factor in the expected directions, or do items tend to load on only one or the other? See some examples in the paper linked above about this issue.
Another option would be to switch to exploratory bifactor modeling. This is implemented in R in the fungible package in the fungible::BiFAD() function. This approach is discussed here:
https://www.sciencedirect.com/science/article/pii/S0001879120300555
Exploratory bifactor models are useful because they rely on targeted EFA rotation to estimate loadings. This makes convergence much more likely and can help to diagnose when a group factor is too weak to identify in the data.

Repeated effect LMM in R

Until recently I used SPSS for my statistics, but since I am not in University any more, I am changing to R. Things are going well, but I can't seem to replicate the results I obtained for repeated effect LMM in SPSS. I did find some treads here which seemed relevant, but those didn't solve my issues.
This is the SPSS script I am trying to replicate in R
MIXED TriDen_L BY Campaign Watering Heating
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1)
SINGULAR(0.000000000001) HCONVERGE(0,
ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED=Campaign Watering Heating Campaign*Watering Campaign*Heating
Watering*Heating Campaign*Watering*Heating | SSTYPE(3)
/METHOD=REML
/PRINT=TESTCOV
/RANDOM=Genotype | SUBJECT(Plant_id) COVTYPE(AD1)
/REPEATED=Week | SUBJECT(Plant_id) COVTYPE(AD1)
/SAVE=PRED RESID
Using the lme4 package in R I have tried:
lmm <- lmer(lnTriNU ~ Campaign + Watering + Heating + Campaign*Watering
+ Campaign*Heating + Watering*Heating + Campaign*Watering*Heating
+ (1|Genotype) + (1|Week:Plant_id), pg)
But this -and the other options I have tried for the random part- keep producing an error:
Error: number of levels of each grouping factor must be < number of observations
Obviously in SPSS everything is fine. I am suspecting I am not correctly modelling the repeated effect? Also saving predicted and residual values is not yet straightforward for me...
I hope anyone can point me in the right direction.
You probably need to take out either Week or Plant_id, as I think you have as many values for either variable as you have cases. You can nest observations within a larger unit if you add a variable to model time. I am not familiar with SPSS, but if your time variable is Week (i.e., if week has a value of 1 for the first observation, 2 for the second etc.), then it should not be a grouping factor but a random effect in the model. Something like <snip> week + (1 + week|Plant_id).
k.
Is Plant_id nested within Genotype, and Week indicate different measure points? If so, I assume that following formula leads to the required result:
lmm <- lmer(lnTriNU ~ Campaign + Watering + Heating + Campaign*Watering
+ Campaign*Heating + Watering*Heating + Campaign*Watering*Heating
+ (1+Week|Genotype/Plant_id), pg)
Also saving predicted and residual values is not yet straightforward for me...
Do you mean "computing" by "saving"? In R, all relevant information are in the returned object, and accessible through functions like residuals() or predict() etc., called on the saved object (in your case, residuals(lmm)).
Note that, by default, lmer does not use AD1-covtype.

Cox Regression Hazard Ratio in Percentiles

I computed a Cox proportional hazards regression in R.
cox.model <- coxph(Surv(time, dead) ~ A + B + C + X, data = df)
Now, I got the hazard ratio (HR, or exp(coef)) for all these covariates, but I'm really only interested in the effects of continuous predictor X. The HR for X is 1.20. X is actually scaled to the sample measurements, such that X has a mean of 0 and SD 1. That is, an individual with a 1 SD increase in X has a 1.23 times higher chance of mortality (the event) than someone with an average value of X (I believe).
I would like to be able to say these results in something that's a bit less awkward, and actually this article does exactly what I would like to. It says:
"In a Cox proportional hazards model adjusting for age, sex and
education, a higher level of total daily physical activity was
associated with a decreased risk of death (hazard ratio=0.71;
95%CI:0.63, 0.79). Thus, an individual with high total daily physical
activity (90th percentile) had about ΒΌ the risk of death as compared
to an individual with low total daily physical activity (10th
percentile)."
Assuming only the HR (i.e. 1.20) is needed, how does one compute this comparison statement? If you need any other information, please ask me for it.
If you have x1 as your 90th percentile X value and x2 as your 10th percentile X value, and if p,q,r and s (s is1.20 as you have mentioned) and your coefficients of cox regression you need to find exp(p*A + q*B + r*C + s*x1)/exp(p*A + q*B + r*C + s*x2) where A, B, and C can be average values of the variable. This ratio give you the comparison statement.
This question is actually for stats.stackexchange.com though.

Standardisation in MuMIn package in R

I am using the 'MuMIn' package in R to select models and calculate effect sizes of the input variables (rain, brk, onset, wid). To make my effect size comparable between variables, I standardised them using standardize function in arm package. Here is the code that I am following:
For reference, please refer to the appendix of this paper: http://onlinelibrary.wiley.com/doi/10.1111/j.1420-9101.2010.02210.x/full
Grueber et al. 2011: Multimodel inference in ecology and evolution: challenges and solutions
data1<-read.csv("data.csv",header=TRUE) #reads the data
global.model<-lmer(yld.res ~ rain + brk + onset + wid + (1|state),data=data1,REML="FALSE") # prepares a global model
stdz.model <- standardize(global.model,standardize.y = FALSE) # standardise the input varaibles
model.set <- dredge(stdz.model) ### generates the full submodel set
top.models <- get.models(model.set, subset= delta<2) # selects models with delta AIC <2
model.avg(top.models) # calculates the average effect size of input variables
Here is the result of model.avg(top.models) which gives the average effect size of each input variable
Coefficients:
(Intercept) brk rain wid onset
subset -4.281975e-14 -106.0919 51.54688 39.82837 35.68766
I read around how the standardize function works- subtracts mean and divides by 2SD.
My question is this: Since I have standardised the input variables, should not the effect sizes be between -1 to 1? or the effect size which the output shows is correct?
Please advise
Thanks a lot
This is more of a statistical question than a programming question, but: you've only standardized the predictor variables, not the response variable (you specified standardize.y=FALSE); therefore, each of your coefficients represents the expected change of the response (in the response's units!) per 2 SD change in the predictor. If the range of the response is large (as it must be in your example), then there could be a very large change. For example, if I were analyzing the change in elephant weight measured in milligrams, I could expect very large changes in the response for reasonably small changes in the predictors (e.g. sex, age, food availability). You should probably use standardize.y=TRUE if you want truly nondimensional/unitless effect sizes. Even nondimensional effects aren't necessarily constrained to be between -1 and +1, but it would be surprising for them to be so large.
By the way, I think your standardize function comes from the arm package, not from MuMIn (library("sos"); findFn("standardize",sortby="Function)).

Resources