Repeated effect LMM in R - r

Until recently I used SPSS for my statistics, but since I am not in University any more, I am changing to R. Things are going well, but I can't seem to replicate the results I obtained for repeated effect LMM in SPSS. I did find some treads here which seemed relevant, but those didn't solve my issues.
This is the SPSS script I am trying to replicate in R
MIXED TriDen_L BY Campaign Watering Heating
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1)
SINGULAR(0.000000000001) HCONVERGE(0,
ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED=Campaign Watering Heating Campaign*Watering Campaign*Heating
Watering*Heating Campaign*Watering*Heating | SSTYPE(3)
/METHOD=REML
/PRINT=TESTCOV
/RANDOM=Genotype | SUBJECT(Plant_id) COVTYPE(AD1)
/REPEATED=Week | SUBJECT(Plant_id) COVTYPE(AD1)
/SAVE=PRED RESID
Using the lme4 package in R I have tried:
lmm <- lmer(lnTriNU ~ Campaign + Watering + Heating + Campaign*Watering
+ Campaign*Heating + Watering*Heating + Campaign*Watering*Heating
+ (1|Genotype) + (1|Week:Plant_id), pg)
But this -and the other options I have tried for the random part- keep producing an error:
Error: number of levels of each grouping factor must be < number of observations
Obviously in SPSS everything is fine. I am suspecting I am not correctly modelling the repeated effect? Also saving predicted and residual values is not yet straightforward for me...
I hope anyone can point me in the right direction.

You probably need to take out either Week or Plant_id, as I think you have as many values for either variable as you have cases. You can nest observations within a larger unit if you add a variable to model time. I am not familiar with SPSS, but if your time variable is Week (i.e., if week has a value of 1 for the first observation, 2 for the second etc.), then it should not be a grouping factor but a random effect in the model. Something like <snip> week + (1 + week|Plant_id).
k.

Is Plant_id nested within Genotype, and Week indicate different measure points? If so, I assume that following formula leads to the required result:
lmm <- lmer(lnTriNU ~ Campaign + Watering + Heating + Campaign*Watering
+ Campaign*Heating + Watering*Heating + Campaign*Watering*Heating
+ (1+Week|Genotype/Plant_id), pg)
Also saving predicted and residual values is not yet straightforward for me...
Do you mean "computing" by "saving"? In R, all relevant information are in the returned object, and accessible through functions like residuals() or predict() etc., called on the saved object (in your case, residuals(lmm)).
Note that, by default, lmer does not use AD1-covtype.

Related

Fixest package using feels command. Interesting difference-in-differences results

I'm not sure how to make this example reproducible - but I'm having an issue with my fixest implementation of a simple event study plot based on coefficients. Wondering if anyone has suggestions for ways to fix this issue.
I have longitudinal data in which I'm looking into the impact of a policy on mother employment. Since the treatment (a reduction in benefits) is blanketed across all effected obs at once - no need to worry about staggard D-I-D treatment effect.
Treated obs are those that made use of the benefit before it was taken away - control group is everyone else. Where treatment periods are normalized with 0 being the quarter of the event.
I'm using the specification below:
employ_mother <- feols(paid_emp ~ i (time_to_treat, treated_group,ref = -1) +
age_dv + age_dv^2 + nkids_dv + marital_status + regions|quarter ,
data = dta, cluster = dta$pidp)
iplot(ctx_employ_mother,
xlab = 'Time to treatment',
main = 'Mother in Employment')
For which the output graph looks like so:
Mother Employment Graph
I'm trying to understand why all of the pre-treatment coefficients are below 0 and rising. When I try the same specification in STATA - my results look normal- with pre-treatment coefficients hovering around 0 with a positive effect after the treatment period begins.
Would really appreciate any help with this.
Thanks!

A strange case of singular fit in lme4 glmer - simple random structure with variation among groups

Background
I am trying to test for differences in wind speed data among different groups. For the purpose of my question, I am looking only on side wind (wind direction that is 90 deg from the individual), and I only care about the strength of the wind. Thus, I use absolute values. The range of wind speeds is 0.0004-6.8 m/sec and because I use absolute values, Gamma distribution describes it much better than normal distribution.
My data contains 734 samples from 68 individuals, with each individual having between 1-30 repeats. However, even if I reduce my samples to only include individuals with at least 10 repeats (which leaves me with 26 individuals and a total of 466 samples), I still get the problematic error.
The model
The full model is Wind ~ a*b + c*d + (1|individual), but for the purpose of this question, the simple model of Wind ~ 1 + (1|individual) gives the same singularity error, so I do not think that the explanatory variables are the problem.
The complete code line is glmer(Wind ~ 1 + (1|individual), data = X, family = Gamma(log))
The problem and the strange part
When running the model, I get the boundary (singular) fit: see ?isSingular error, although, as you can see, I use a very simple model and random structure. The strange part is that I can solve this by adding 0.1 to the Wind variable (i.e. glmer(Wind+0.1 ~ 1 + (1|Tag), data = X, family = Gamma(log)) does not give any error). I honestly do not remember why I added 0.1 the first time I did it, but I was surprised to see that it solved the error.
The question
Is this a problem with lme4? Am I missing something? Any ideas what might cause this and why does me adding 0.1 to the variable solve this problem?
Edit following questions
I am not sure what's the best way to add data, so here is a link to a csv file in Google drive
using glmmTMB does not produce any warnings with the basic formula glmmTMB(Wind ~ 1 + (1|Tag), data = X, family = Gamma(log)), but gives convergence problems warnings ('non-positive-definite Hessian matrix') when using the full model (i.e., Wind ~ a*b + c*d + (1|individual)), which are then solved if I scale the continuous variables

Incorporating time series into a mixed effects model in R (using lme4)

I've had a search for similar questions and come up short so apologies if there are related questions that I've missed.
I'm looking at the amount of time spent on feeders (dependent variable) across various conditions with each subject visiting feeders 30 times.
Subjects are exposed to feeders of one type which will have a different combination of being scented/unscented, having visual patterns/being blank, and having these visual or scented patterns presented in one of two spatial arrangements.
So far my model is:
mod<-lmer(timeonfeeder ~ scent_yes_no + visual_yes_no +
pattern_one_or_two + (1|subject), data=data)
How can I incorporate the visit numbers into the model to see if these factors have an effect on the time spent on the feeders over time?
You have a variety of choices (this question might be marginally better for CrossValidated).
as #Dominix suggests, you can allow for a linear increase or decrease in time on feeder over time. It probably makes sense to allow this change to vary across birds:
timeonfeeder ~ time + ... + (time|subject)
you could allow for an arbitrary pattern of change over time (i.e. not just linear):
timeonfeeder ~ factor(time) + ... + (1|subject)
this probably doesn't make sense in your case, because you have a large number of observations, so it would require many parameters (it would be more sensible if you had, say, 3 time points per individual)
you could allow for a more complex pattern of change over time via an additive model, i.e. modeling change over time with a cubic spline. For example:
library(mgcv)
gamm(timeonfeeder ~ s(time) + ... , random = ~1|subject
(1) this assumes the temporal pattern is the same across subjects; (2) because gamm() uses lme rather than lmer under the hood you have to specify the random effect as a separate argument. (You could also use the gamm4 package, which uses lmer under the hood.)
You might want to allow for temporal autocorrelation. For example,
lme(timeonfeeder ~ time + ... ,
random = ~ time|subject,
correlation = corAR1(form= ~time|subject) , ...)

Explanation of the formula object used in the coxph function in R

I am a complete novice when it comes to survival analysis. I am working on a project that requires I use the coxph function in the "survival" package, but I am running into trouble because I do not understand what is required by the formula object.
Most descriptions I can find about the function are as follows:
"a formula object, with the response on the left of a ~ operator, and the terms on the right. The response must be a survival object as returned by the Surv function. "
I know what needs to be on the left of the operator, the issue is what the function expects from the right-hand side.
Here is a link of what my data looks like (The actual data set is much larger, I'm only displaying the first 20 data points for brevity):
Short explanation of data:
-Row 1 is the header
-Each row after that is a separate patient
-The first column is the age of the patient at the time of the study
-columns 2 through 14 (headed by x2-x13), and 19 (x18) and 20 (x19) are covariates such as race, relationship status, medical conditions that take on either true (1) or false (0) values.
-columns 15 (x14) through 18 (x17) are covariates such as tumor size, which take on whole number values greater than 0.
-The second to last column "sur" is the number of months survived, and "index" is whether or not that is a right-censored time (1 for true, 0 for false).
Given this data I need to plot a Cox Proportional hazard curve, but I end up with an incorrect plot because the right hand side of the formula object is wrong.
Here is my code, "temp4" is the name I gave to the data table:
library("survival")
temp4 <- read.table("~/data.txt", header=TRUE)
seerCox <- coxph(Surv(sur, index)~ temp4$x1 + temp4$x2 + temp4$x3 + temp4$x4 + temp4$x5 + temp4$x6 + temp4$x7 + temp4$x8 + temp4$x9 + temp4$x10 + temp4$x11 + temp4$x12 + temp4$x13 + temp4$x14 + temp4$x15 + temp4$x16 + temp4$x17 + temp4$x18 + temp4$x19, data=temp4, singular.ok=TRUE)
plot(survfit(seerCox), main= "Cox Estimate", mark.time=FALSE, ylab="Probability", xlab="Survival Time in Months", col=c("blue", "red", "green"))
I should also note that I have tried replacing the right hand side that you're seeing with the number 1, a period, leaving it blank. These methods produce a kaplan-meier curve.
The following is the console output:
Each new line is an example of the error produced depending on how I filter the data. (ie if I only include patients with ages greater than 85, etc.)
If someone could explain how it works, it would be greatly appreciated.
PS- I have searched for over a week to my solution, and I am asking for help here as a last resort.
You should not be using the prefix temp$ if you are also using a data argument. The whole purpose of supplying a data argument is to allow dropping those in the formula.
seerCox <- coxph( Surv(sur, index) ~ . , data=temp4, singular.ok=TRUE)
The above would use all of the x-variables in your temp data.frame. This will use just the first 3:
seerCox <- coxph( Surv(sur, index) ~ x1+x2+x3 , data=temp4)
Exactly what the warnings signify depends on the data (as you have in one sense already exemplified by producing different sorts of collinearity with different subsets.) If you have collinear columns, then you get singularities in the inversion of the model matrix and the software will attempt to drop aliased columns with a warning. This is really telling you that you do not have enough data to build the large models you are attempting. Exploring that possibility with table calls is often informative.
Bottom line: This is not a problem with your formula construction, so much as it is a problem of not understanding the limitations of the chosen method with the dataset you have assembled. You need to be more careful about defining your goals. What is the highest priority in this research? Do you really need every variable? Is it possible to aggregate some of these anonymous variables into clinically meaningful categories such as diagnostic categories or comorbities?

Regression coefficients by group in dataframe R

I have data of various companies' financial information organized by company ticker. I'd like to regress one of the columns' values against the others while keeping the company constant. Is there an easy way to write this out in lm() notation?
I've tried using:
reg <- lmList(lead2.dDA ~ paudit1 + abs.d.GINDEX + logcapx + logmkvalt +
logmkvalt2|pp, data=reg.df)
where pp is a vector of company names, but this returns coefficients as though I regressed all the data at once (and did not separate by company name).
A convenient and apparently little-known syntax for estimating separate regression coefficients by group in lm() involves using the nesting operator, /. In this case it would look like:
reg <- lm(lead2.dDA ~ 0 + pp/(paudit1 + abs.d.GINDEX + logcapx +
logmkvalt + logmkvalt2), data=reg.df)
Make sure that pp is a factor and not a numeric. Also notice that the overall intercept must be suppressed for this to work; in the new formulation, we have a different "intercept" for each group.
A couple comments:
Although the regression coefficients obtained this way will match those given by lmList(), it should be noted that with lm() we estimate only a single residual variance across all the groups, whereas lmList() would estimate separate residual variances for each group.
Like I mentioned in my earlier comment, the lmList() syntax that you gave looks like it should have worked. Since you say it didn't, this leads me to expect that really the problem is something else (although it's hard to tell what without a reproducible example), and so it seems likely that the solution I posted will fail for you as well, for the same unknown reasons. If you want more detailed guidance, please provide more information; help us help you.

Resources