I am using lavaan and have only observed variables (no latent variables).
I would like to include an interaction term in the model, but not sure how to do this.
This is what I have
model4 <-'
interac =~ var1 * var2
Ent ~ age
presu ~ age + interac
protein ~ age + fat
fat ~ age
tempo ~ age +interac+protein
score ~sex+education+presu+tempo
'
fit <- sem(model4, data=mydata)
summary(fit4, fit.measures=TRUE)
(all variables have been scaled before starting, because I had some issues with some variables being 100 times larger than others).
I am wondering whether this is correct? I don't have the main effects of the interaction in the regression? Shouldn't these be included?
When I add the interaction term directly in the regression (var1*var2), I get 1 as estimates, so that must be wrong...
No, it is not correct. For manifest variables interaction, you have two alternatives:
1 - create the interaction term outside lavaan, e.g.:
mydata$interac <- mydata$var1 * mydata$var2
or
2 - use the : operator:
model4 <-'
Ent ~ age
presu ~ age + var1:var2 #interaction and age as predictors
protein ~ age + fat
fat ~ age
tempo ~ age + var1:var2 + protein #interaction, age and protein as predictors
score ~sex+education+presu+tempo
'
fit <- sem(model4, data=mydata)
summary(fit4, fit.measures=TRUE)
Related
I am trying to determine whether there is a significant effect of treatment on microbiome diversity between two timepoints (two timepoints x three treatments).
Can somebody please explain how to model this using linear mixed models using the nlme library in R?
Particularly how to handle repeated sampling of the same subject over time.
I have seen the three following syntaxes used but don't really understand the difference between them.
model1 <- lme(diversity ~ treatment * timepoint,
random = ~ 1 | mouseID,
data = alpha_df)
model2 <- lme(diversity ~ treatment * timepoint,
random = ~ timepoint | mouseID,
data = alpha_df)
model3 <- lme(shannon ~ treatment * timepoint,
random = ~ 1 + timepoint | mouse,
data = alpha_df)
I think model3 is the correct one for my use but I am not sure.
Thanks in advance!
~ 1 | mouse means "one intercept per mouse". There is a main, fixed intercept (actually there are three intercepts, one per treatment), and the random intercepts of the mice are normally distributed around the main intercept.
~ timepoint | mouse is the same as ~ 1 + timepoint | mouse. It means "one regression line (i.e. an intercept and a slope) per mouse". There is a main slope (actually three main slopes because of the interaction term with the treatments) and the random slopes are normally distributed around the main slope.
So the "biggest" model is ~ 1 + timepoint | mouse. If there is a biological justification that the mice have the same diversity value at time 0, you can drop the intercept: random = ~ 0 + timepoint.
In the Orthodont dataset in nlme, there are 27 subjects and each subject is measured at 4 different ages. I wish to use this data to explore at what condition the model will be overdetermined. Here are the models:
library(nlme)
library(lme4)
m1 <- lmer( distance ~ age + (age|Subject), data = Orthodont )
m2 <- lmer( distance ~ age + I(age^2) + (age|Subject), data = Orthodont )
m3 <- lmer( distance ~ age + I(age^2) + I(age^3) + (age|Subject), data = Orthodont )
m1nlme <- lme(distance ~ age, random = ~ age|Subject, data = Orthodont)
m2nlme <- lme(distance ~ age + I(age^2), random = ~ age|Subject, data = Orthodont)
m3nlme <- lme(distance ~ age + I(age^2) + I(age^3), random = ~ age|Subject, data = Orthodont)
m4nlme <- lme(distance ~ age + I(age^2) + I(age^3), random = ~ age + I(age^2) + I(age^3)|Subject, data = Orthodont)
Of all of the above models, only m3 throws a warning message:In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,:Model failed to converge with max|grad| = 0.00762984 (tol = 0.002, component 1).
Questions:
What does the warning message suggest and if it is sensible to ignore this message?
For m2, the model estimates fixed effect of intercept and fixed coefficient for age and I(age^2), together with the random effect parameter sigma^2_intercept, sigma^2_age, and sigma^2_intercept:age. So a total of 1+2+3=6 parameters are estimated for each Subject. But there are only 4 observations per subject. Why does not m2 throws an error? Isn't m2 overdetermined? Am I counting the number of paratermeters anywhere incorrectly?
The warning message means that the model fit may be a bit numerically unstable; it is done by numerically checking the scaled gradient, but as this depends in turn on the gradient and Hessian estimated by finite differences, which are themselves subject to numerical error. As I've stated in many different venues, these warnings definitely tend to be over-sensitive/likely to be false positives: see e.g. ?lme4::convergence, ?lme4::troubleshooting. The gold standard is to use allFit() to refit the model with a variety of optimizers and make sure that the results from different optimizers are close enough to the same for your purposes.
There are two random effects values (BLUPs or conditional modes) per subject - the subject-level deviation of the intercept and slope wrt age. For values, we will be in trouble if the number of values is greater than or equal to the number of observations per group (or, for GLMMs without a scale parameter such as the Poisson, if the number of values is strictly greater than the number of observations per group). For parameters, there are up to four fixed-effect parameters (intercept, linear, quadratic, cubic terms wrt age) and three RE parameters (variance of the intercept, variance of the slope, covariance between intercept and slope), but these 7 parameters are estimated at the population level - the appropriate comparisons are with either the total number of observations or with the number of groups, not with the number of observations per group.
In general you should probably think about the number of observations when considering the number of fixed effect parameters and the number of groups when considering the number of random effect parameters; "10 [observations/groups] per parameter" is a not-unreasonable starting rule of thumb.
level 1 variable:
income - continuous
level 2 variable:
state's general whether: three leveled categorical variable: hot/moderate/cool
used effect coded, and generate two variables because it has three levels.
(weather_ef1, weather_ef2)
enrolled in university - binary : yes/no ( effect coded. yes = -1, no =1)
DV:
math score
grouping variable: household
model 1: (fixed slope)
Dv is predicted by income, enrollment, and the interaction between enrollment and income.
in this case,
lmer(y~ 1 + income + enrollment +income*enrollment+ (1|householdID), data=data)
lmer(y~ 1 + income + enrollment +income:enrollment+ (1|householdID), data=data)
: is it for interaction? or * is it for interaction?
further, do I have to do factor(enrollment)?
or is it okay because it is already effect coded?
model 2: (fixed slope)
DV is predicted by income, weather, and interaction between income and weather
lmer( y ~ 1 + income + weather_ef1 + weather_ef2 + weather_ef1*income
+ weather_ef2*income +(1|houshold_id), data)
lmer ( y ~ l + income + weather_ef1+ weather_ef2 + weather_ef1:income
+ weather_ef2:income + (1|houshold_id), data)
Still confusing * is right or: is right.
I think the effect code variables are already effect coded, so I don't have to
do use the factor(weather_ef1) things.
From the documentation (use ?formula):
The * operator denotes factor crossing: a*b interpreted as a+b+a:b.
In other words a*b adds the main effects of a and b and their interaction. So in your model when you use income*enrollment this is the same as income + enrollment +income:enrollment. The two versions you described for each model should give identical results. You could just have used:
lmer(y~ 1 + income*enrollment+ (1|householdID), data=data)
which also describes the same model.
If your variables are effect coded then you don't need to use factor but be careful about the interpretation of the effects.
Okay, I have students in classrooms in schools. I want to know if test score number depends on your school.
my basic model is:
basemodel <- lmer(test ~ schoolnumber +
(1 | schoolnumber/classnumber), data=mydata)
Do I want to try and add in the student level?
Doesn't work:
model1 <- lmer(test ~ schoolnumber +
(1 | schoolnumber/classnumber/ studentID), data=ED)
Doesn't work:
model2 <- lmer(test ~ schoolnumber +
(1 | schoolnumber/classnumber) +( 1 |studentID), data=ED)
Doesn't work:
model3 <- lmer(test ~ schoolnumber +
(1 + studentID | schoolnumber/classnumber), data=ED)
model4 <- lmer(test ~ schoolnumber + studentID +
(1 | schoolnumber/classnumber), data=ED)
When I add student ID it says
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?
Also my current test score is a standardised score taken from raw scores, then z scores then linear transformation (standard scores); 100 + 15(z).
Am I okay to use these linear transformed scores or should I be using something else? I've seen code elsewhere saying to use scale()?
As Roland says, if schoolnumber is categorical/a factor variable, then your first model should fail:
~ schoolnumber + (1 | schoolnumber/classnumber)
includes schoolnumber as both a fixed categorical predictor and as a random effects grouping variable. ~ (1|schoolnumber/classnumber) would make more sense.
If you get rid of schoolnumber as a fixed effect predictor, then
~ (1 | schoolnumber/classnumber) + (1|studentID)
should work. I wouldn't recommend adding studentID as a fixed effect.
I'm assuming that students are labeled uniquely, i.e. that there isn't a student 1A57 in school number 1 and a different student 1A57 in school number 2 ...
How large is your data set at each level (observations, students, classes, schools)? I'm guessing that students are nested within schools but crossed among classes, i.e. each student is in only one school but in more than one class. As long as you have students labeled uniquely, it won't matter as much how you specify the model.
I have a severe problem with R. I did not figure out how to run a logit regression with an instrument variable.
The tricky thing is that I have 2 independent variables that work as an interaction term, but the instrument only works on one of the two independent variables. Further, I have a couple of Controls.
I tried a couple of things with the AER ivreg package, but I could not figure out what I have to type in the regression command.
I would be so grateful if somebody could help me.
I think this post is what you need:
http://www.r-bloggers.com/a-simple-instrumental-variables-problem/
The code in the post
library(AER)
library(lmtest)
data("CollegeDistance")
cd.d<-CollegeDistance
simple.ed.1s<- lm(education ~ distance,data=cd.d)
cd.d$ed.pred<- predict(simple.ed.1s)
simple.ed.2s<- lm(wage ~ urban + gender + ethnicity + unemp + ed.pred , data=cd.d)
simple.comp<- encomptest(wage ~ urban + gender + ethnicity + unemp + ed.pred , wage ~ urban + gender + ethnicity + unemp + education , data=cd.d)
1s.ftest<- encomptest(education ~ tuition + gender + ethnicity + urban , education ~ distance , data=cd.d)
library(arm)
coefplot(lm(wage ~ urban + gender + ethnicity + unemp + education,data=cd.d),vertical=FALSE,var.las=1,varnames=c("Education","Unemp","Hispanic","Af-am","Female","Urban","Education"))
coefplot(simple.ed.2s , vertical=FALSE,var.las=1,varnames=c("Education","Unemp","Hispanic","Af-am","Female","Urban","Education"))