Using R for lack-of-fit F-test - r

I learnt how to use R to perform an F-test for a lack of fit of a regression model, where $H_0$: "there is no lack of fit in the regression model".
where df_1 is the degrees of freedom for SSLF (lack-of-fit sum of squares) and df_2 is the degrees of freedom for SSPE (sum of squares due to pure error).
In R, the F-test (say for a model with 2 predictors) can be calculated with
anova(lm(y~x1+x2), lm(y~factor(x1)*factor(x2)))
Example output:
Model 1: y ~ x1 + x2
Model 2: y ~ factor(x1) * factor(x2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 19 18.122
2 11 12.456 8 5.6658 0.6254 0.7419
F-statistic: 0.6254 with a p-value of 0.7419.
Since the p-value is greater than 0.05, we do not reject $H_0$ that there is no lack of fit. Therefore the model is adequate.
What I want to know is why use 2 models and why use the command factor(x1)*factor(x2)? Apparently, 12.456 from Model 2, is magically the SSPE for Model 1.
Why?

You are testing whether a model with an interaction improves the model fit.
Model 1 corresponds to an additive effect of x1 and x2.
One way to "check" if the complexity of a model is adequate (in your case whether a multiple regression with additive effects make sense for your data) is to compare the proposed model with a more flexible/complex model.
Your model 2 has the role of this more flexible model. First the predictors are made categorical (by using factor(x1) and factor(x2)) and then an interaction between them is constructed by factor(x1)*factor(x2). The interaction model includes the additive model as a special case (i.e., model 1 is nested in model 2) and has several extra parameters to provide a potentially better fit to the data.
You can see difference in number of parameters between the two models in the output from anova. Model 2 has 8 extra parameters to allow for a better fit but because the p-value is non-significant you would conclude that model 2 (with the extra flexibility based on the additional 8 parameters) actually does not provide a significantly better fit to the data. Thus, the additive model provides a decent enough fit to the data when compared to model 2.
Note that the trick above with making categories (factors) of x1 and x2 only really works when number of unique values for x1 and x2 is low. If x1 and x2 are numeric and each individual has their own value then model 2 is not that useful as you end up with the same number of parameters as you hav observations. In those situations more ad hoc modifications such are binning the variables are used.

Related

Model formula for Multilevel Model where different sets of predictors are used for random intercept and random slope

I have been searching for many web sources just to try to get any information on this issue. I am working on a two-level multilevel model.
Problem Background
In level one, I have one response variable with one predictor variable. Therefore, I have one intercept and one slope coefficient at this level.
In level two, I have one model that predicts the level-one intercept and another model that predicts the level-one slope. In my study, I will be using different predictors in both level-two models. Nearly all tutorials in the internet and the books I have read assume the same level-two predictors are being used to predict the level-one intercept and level-one slope.
Question
How should I specify my model in r? (Packages in use: lmerTest, blmer, and brms. They all use the same model formulation)
List of Variables
y - level 1 response variable
L1x - level 1 predictor variable
L2a - level 2 predictor for the level 1 intercept
L2b - level 2 predictor for the level 1 slope
g - grouping variable
What I Know
Null Model: This is simple. I think I have done it correctly.
y ~ (1|g)
Random Intercept Model: I am pretty sure this is correct too. L1x will be a fixed effect predictor and this only allows the incept to be varying across the different groups.
y ~ L1x + (1 | g)
What I Don't Know
How do I make a formula for random intercept and random slope and beyond? I know that when you have the same level-two predictors for both the intercept and the slope, it is
y ~ L1x + (L2b | g)
But to my understanding, this assumes L2b to be the level-two predictor of both Level-one intercept and slope. How do I formulate my model when the level-two predictors for level-one intercept and slope are different? I hope my question makes sense to everybody.
Note:
This is my first time posting. Please let me know what I should do to make the question clearer to you. Thank you.
I could not figure out how to use LaTex code here, so I am adding the model as images.
Level One Model
Level Two Models

Linear mixed model comparison with ANOVA R

I have two models:
model1 = y~ a+b*c+ 1|d
model2 = y~ a*e+c+1|d
I wanted to compare how they do.
anova(model1, model2)
This is the result:
Why is the p value 0?
Thank you!
Desperate grad student
Hi Desperate Grad student! Typically, the ANOVA test is used to test the necessity of a complex model with respect to a simpler, more parsimonious model. Since, in your case, you're comparing two models with the same number of parameters, you have 0 degrees of freedom (where df = # of parameters in the complex model - # of parameters in the simpler model). This is why you have an absent p-value associated with this comparison.
However, since you have the information criteria for both of these models (AIC/BIC), you can use that to compare the two. Here, model 1 is favorable since its AIC and BIC are lower than the IC for model 2.
If you're set on using the ANOVA approach to compare models, consider creating an "intercept only" model using model0 <- y ~ 1 as your basis for comparison.

What post-hoc test should be used for a glmer model with a continious and a categorical predictor variable?

I'm a bit of a newbie with stats and R, so need a bit of direction to find a suitable post-hoc test for my glmer model.
The model has a binary dependent variable (absent/present) and the predictor variables are interactive terms between a continuous variable(eg temp) and a categorical variable (species, n=3). Only interactive terms, rather than the continuous factor in isolation, produce significant results when an anova is run on the model. Species by itself has a large effect because one species is much rarer than the others. I'm trying to tease apart how the presence of these species varies across pH and between species.
I've tried lsmeans test with Tukey, and Firth's Bias-Reduced Logistic Regression, emmeans. I ran the effects function on the interactive terms, so had a rough expectation of what a post hoc could show, but the results logistf (firth's) have produced I was not expecting. Emmeans and tukey both gave the same results and ignored the continuous variable I assume because it's not a factor.
When I run firth's regression it produces chi-squared and p values that are either infinity for chi values or the p values astronomically small, even though what I saw through effects suggested no significant difference. I can't tell with the interactive term if there truly is an effect of the environmental variable or if the significant effect is because of the difference in species. Based on what I have seen of the logistf function, I didn't think it would produce a chi-square score. Is this an issue in coding or is it because of my data?
If I wasn't clear enough about something please let me know and if anyone has any suggestions or advice, they would be massively appreciated. Thanks!
The model and test code I used are below:
###glmer model
Large<-glmer(Abs.Pres~ Species:Q.Depth+Species:Conductivity+Species:Temp+Species:pH+Species:DO.P+(1|QID),
nAGQ=0,
family=binomial,
data=Stacked_Pref)
anova(Large)
Output:Analysis of Variance Table
npar Sum Sq Mean Sq F value
Species:Q.Depth 3 234.904 78.301 78.3014
Species:Conductivity 3 32.991 10.997 10.9970
Species:Temp 3 39.001 13.000 13.0004
Species:pH 3 25.369 8.456 8.4562
Species:DO.P 3 34.930 11.643 11.6434
###Firths
Lp<-logistf(Abs.Pres~Species:pH, data=Stacked_Pref, contrasts.arg=list(pH="contr.treatment", Species="contr.sum"))
> Lp
logistf(formula = Abs.Pres ~ Species:pH, data = Stacked_Pref,
contrasts.arg = list(pH = "contr.treatment", Species = "contr.sum"))
Model fitted by Penalized ML
Confidence intervals and p-values by Profile Likelihood
coef se(coef) lower 0.95 upper 0.95 Chisq p
(Intercept) 1.9711411 0.57309880 0.8552342 3.1015114 12.09107 5.066380e-04
SpeciesGoby:pH -0.3393185 0.07146049 -0.4804047 -0.2003108 23.31954 1.371993e-06
SpeciesMosquito:pH -0.3001385 0.07127771 -0.4408186 -0.1614419 18.24981 1.937453e-05
SpeciesRFBE:pH -0.4771393 0.07232469 -0.6200179 -0.3365343 45.73750 1.352096e-11
Likelihood ratio test=267.0212 on 3 df, p=0, n=3945

How to report overall results of an nlme mixed effects model

I want to report the results of an one factorial lme from the nlme package. I want to know the overall effect of A on y. To do so I would compare the model with a Null model:
m1 <- lme(y~A,random=~1|B/C,data=data,weights=varIdent(form = ~1|A),method="ML")
m0 <- lme(y~1,random=~1|B/C,data=data,weights=varIdent(form = ~1|A),method="ML")
I am using maximum likelihood because I am comparing models with different main effects.
stats::anova(m0,m1) gives me a significant p value, meaning that there is a significant effect of A on y. However, in contrast to lmer models made with lme4, no Chi2 values are given. First: Is this approach valid? And second: What is the best way to report the result?
Thanks for your answers
An anova with lme should give you the same information as with lmer. Both use what's called a deviance test or likelihood ratio test. The L.ratio part in the table returned by anova is simply the difference in the loglikelihood of the two models multiplied by -2. A deviance test tests this value against a Chi2 distribution with the difference in model parameters (in your case 1) degrees of freedom. So the value reported under L.ratio for lme models is the same as the Chi2 value reported for lmer models (assuming the models are the same of course, and lmer rounds the value to a decimal).
The approach is valid and you could report the value under L.ratio along with the degrees of freedom and p-value, but I would add more information in your report such as the fixed and random coefficients of both models and other parameters that you've added (such as the difference in variance for levels of A specified under weights). If you're only interested in the fixed effect of A than a Wald test should also be appropriate though REML estimates are recommended in cases with a small number of groups (Snijders & Bosker, 2012). The test statistic is the t-value and associated p-value in the model summary output summary(m1). Chapter 6 in Snijders & Bosker (2012) gives a great explanation on tests for fixed and random parameters. Along with reporting examples.

How to get individual coefficients and residuals in panel data using fixed effects

I have a panel data including income for individuals over years, and I am interested in the income trends of individuals, i.e individual coefficients for income over years, and residuals for each individual for each year (the unexpected changes in income according to my model). However, I have a lot of observations with missing income data at least for one or more years, so with a linear regression I lose the majority of my observations. The data structure is like this:
caseid<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)
years<-c(1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008,
1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008)
income<-c(1100,NA,NA,NA,NA,1300,1500,1900,2000,NA,2200,NA,
NA,NA,NA,NA,NA,NA, 2300,2500,2000,1800,NA, 1900)
df<-data.frame(caseid, years, income)
I decided using a random effects model, that I think will still predict income for missing years by using a maximum likelihood approach. However, since Hausman Test gives a significant result I decided to use a fixed effects model. And I ran the code below, using plm package:
inc.fe<-plm(income~years, data=df, model="within", effect="individual")
However, I get coefficients only for years and not for individuals; and I cannot get residuals.
To maybe give an idea, the code in Stata should be
xtest caseid
xtest income year
predict resid, resid
Then I tried to run the pvcm function from the same library, which is a function for variable coefficients.
inc.wi<-pvcm(Income~Year, data=ldf, model="within", effect="individual")
However, I get the following error message:
"Error in FUN(X[[i]], ...) : insufficient number of observations".
How can I get individual coefficients and residuals with pvcm by resolving this error or by using some other function?
My original long form data has 202976 observations and 15 years.
Does the fixef function from package plm give you what you are looking for?
Continuing your example:
fixef(inc.fe)
Residuals are extracted by:
residuals(inc.fe)
You have a random effects model with random slopes and intercepts. This is also known as a random coefficients regression model. The missingness is the tricky part, which (I'm guessing) you'll have to write custom code to solve after you choose how you wish to do so.
But you haven't clearly/properly specified your model (at least in your question) as far as I can tell. Let's define some terms:
Let Y_it = income for ind i (i= 1,..., N) in year t (t= 1,...,T). As I read you question, you have not specified which of the two below models you wish to have:
M1: random intercepts, global slope, random slopes
Y_it ~ N(\mu_i + B T + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
M2: random intercepts, random slopes
Y_it ~ N(\mu_i + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
Also, your example data is nonsensical (see below). As you can see, you don't have enough observations to estimate all parameters. I'm not familiar with library(plm) but the above models (without missingness) can be estimated in lme4 easily. Without a realistic example dataset, I won't bother providing code.
R> table(df$caseid, is.na(df$income))
FALSE TRUE
1 2 4
2 4 2
3 0 6
4 5 1
Given that you do have missingness, you should be able to produce estimates for either hierarchical model via the typical methods, such as EM. But I do think you'll have to write the code to do the estimation yourself.

Resources