R: step function not writing out complete model in result report - r

I am running the "step" function in RStudio on this model:
inputData.entry = lmer(height ~ ENTRY_NO + REP + (1|SUB.BLOCK), data=inputData); # our model
this is what I am running with "step" :
help.search("step",package="lmerTest");
st <- step(inputData.entry, reduce.fixed=FALSE);
print(st);
Here is the output:
Backward reduced random-effect table:
Eliminated npar logLik AIC LRT Df Pr(>Chisq)
<none> 142 -397.15 1078.3
(1 | SUB.BLOCK) 1 141 -397.47 1076.9 0.63157 1 0.4268
Backward reduced fixed-effect table:
Eliminated Df Sum of Sq RSS AIC F value Pr(>F)
ENTRY_NO 0 138 4238.1 6210.4 844.18 1.9775 5.749e-05 ***
REP 0 1 30.6 2002.9 816.03 1.9720 0.1627
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Model found:
height ~ ENTRY_NO + REP
My issue is the statement---- Model Found:
Why won't the results list "Sub.Block" in the model when inputData.entry shows it in "lmer"?
Is there something I am doing wrong?
Thanks for the advice!

Step function selects a model according to AIC. Model found tells which model is selected according to these criteria. In this instance the algorithm tells us that Sub.Block doesn't bring meaningful information to the analysis over the added complicatoin it causes and thus it's not in the model suggested.

Related

How to get the overall significance from an mblogit model (mclogit package)

I am modeling a multiple choice response with the mclogit package (mblogit function) and I don't get an overall model significance in the summary(). Also running anova() for a comparison with a null-model doesn't work. Moreover, I'm not sure if the Chi-square test even applies here. I get this warning:
Warning in anova.mclogitlist(c(list(object), dotargs), dispersion = dispersion,
Results are unreliable, since deviances from quasi-likelihoods are not comparable.
Analysis of Deviance Table
Model 1: this_response ~ 1
Model 2: this_response ~ classLevel
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 7755 6530.4
2 7704 6058.6 51 471.82 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here's my code:
m1 = mclogit::mblogit(
formula = this_response ~ classLevel,
random = ~ 1 + classLevel | question,
data = test_set_8_df)
summary(m1) # -> I get sign. for the coefficients but not for the overall model
m0 = mclogit::mblogit(
formula = this_response ~ 1,
data = test_set_8_df)
anova(m0,m1, test="Chisq") # -> I get a warning

R: Calculating ANOVA Sum sqr for a model with interacting numerical and categorical variables

I need to know how it is calculated the Sum Sqr column of the anova() function in R, for a linear model with the form:
modelXg <-lm(Y ~ X * group, data)
(which is equivalent to lm(Y~ X+group+X:group, data=dat) )
where: "X" is a numerical variable, and "group" is a categorical one.
The function anova(modelXg) returns a table like:
Analysis of Variance Table
Response: TMIN
Df Sum Sq Mean Sq F value Pr(>F)
X 1 6476 6476.1 282.9208 < 2.2e-16 ***
group 1 1176 1176.4 51.3956 7.666e-13 ***
X:group 1 64 64.2 2.8058 0.09393 .
Residuals 45130 1033029 22.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
What I need is to know how to calculate all the terms of the Sum Sq column, described in a way as easy and reproducible as possible, because I need to implement it in C#.
I already searched a lot accross the Net, but I didn't find this exact case. I found some useful info in Interpretation of Sum Sq in ANOVA with numeric independent variable but it is incomplete for this case, because there the model does not involve the interaction between both variables.

Why does the drop1 function in R give me zero degrees of freedom and no significance test result for a continuous fixed effect in a GLM?

I am trying to determine the influence of five categorical and one continuous independent variable on some ecological count data using a GLM in R. Here is an example of what the data that I am using looks like:
No. DateNum Tunnel Trt Row Time AvgTemp sqTotal.L
1 44382 1 A 3 AM 30.0 1.41
2 44384 3 C 2 PM 21.0 2.23
3 44384 7 D 3 AM 24.0 3.65
4 44400 4 B 1 AM 27.5 2.78
The fixed effects DateNum, Tunnel and Row are coded as ordered factors, Trt and Time are unordered factors and AvgTemp is coded as a numeric. 'sqTotal.L' is the squart-root-transformed count data, which is a normally distributed response variable. I have decided to use GLM instead of ANOVA because the experimental design is not balanced and there are different numbers of samples from the different experimental plots.
When I run the following code for a GLM and then the drop1() function on the resulting model, the effect of my continuous fixed effect (AvgTemp) appears not be incorporated into the result of the drop1() function:
> rowfvs.glm <- glm(sqTotal.L ~ AvgTemp + Row + DateNum + Time + Tunnel + Trt,
+ family = gaussian, data = rowfvs2)
> summary(rowfvs.glm)
Call:
glm(formula = sqTotal.L ~ AvgTemp + Row + DateNum + Time + Tunnel +
Trt, family = gaussian, data = rowfvs2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.63548 -0.38868 0.06587 0.41777 1.31886
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.89037 5.98768 1.485 0.1492
AvgTemp -0.28191 0.24566 -1.148 0.2612
Row.L -0.46085 0.24735 -1.863 0.0734 .
Row.Q 0.08047 0.25153 0.320 0.7515
DateNum.L -1.17448 0.85015 -1.382 0.1785
DateNum.Q 0.57857 0.64731 0.894 0.3793
DateNum.C -2.17331 2.15684 -1.008 0.3226
DateNum^4 -0.76025 1.09723 -0.693 0.4943
DateNum^5 -1.62269 0.68388 -2.373 0.0250 *
DateNum^6 0.63799 0.70822 0.901 0.3756
DateNum^7 NA NA NA NA
TimePM -0.31436 0.87881 -0.358 0.7233
Tunnel.L 1.38420 0.62199 2.225 0.0346 *
Tunnel.Q -0.03521 0.56561 -0.062 0.9508
Tunnel.C 0.81639 0.54880 1.488 0.1484
Tunnel^4 0.24029 0.61180 0.393 0.6976
Tunnel^5 0.30665 0.51849 0.591 0.5592
Tunnel^6 0.67603 0.53728 1.258 0.2191
TrtB 0.10067 0.40771 0.247 0.8068
TrtC 0.31278 0.41048 0.762 0.4527
TrtD -0.49857 0.46461 -1.073 0.2927
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.7583716)
Null deviance: 50.340 on 46 degrees of freedom
Residual deviance: 20.476 on 27 degrees of freedom
AIC: 136.33
Number of Fisher Scoring iterations: 2
> drop1(rowfvs.glm, test = "Chi")
Single term deletions
Model:
sqTotal.L ~ AvgTemp + Row + DateNum + Time + Tunnel + Trt
Df Deviance AIC scaled dev. Pr(>Chi)
<none> 20.476 136.33
AvgTemp 0 20.476 136.33 0.0000
Row 2 23.128 138.05 5.7249 0.05713 .
DateNum 6 25.517 134.67 10.3447 0.11087
Time 1 20.573 134.55 0.2222 0.63736
Tunnel 6 27.525 138.23 13.9039 0.03073 *
Trt 3 23.201 136.20 5.8725 0.11798
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
By contrast, when I try running the anova() function on the model, I do get an analysis of the influence of AvgTemp on sqTotal.L:
> anova(rowfvs.glm, test = "Chi")
Analysis of Deviance Table
Model: gaussian, link: identity
Response: sqTotal.L
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 46 50.340
AvgTemp 1 0.7885 45 49.551 0.3078945
Row 2 1.0141 43 48.537 0.5124277
DateNum 6 17.6585 37 30.879 0.0007065 ***
Time 1 0.3552 36 30.523 0.4937536
Tunnel 6 7.3222 30 23.201 0.1399428
Trt 3 2.7251 27 20.476 0.3088504
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So, my questions are why isn't the drop1() function taking AvgTemp into account, and is it sufficient to use the results from the anova() function in my report of the results, or do I need to figure out how to get the drop1() function to incorporate this continuous predictor?
This is a bit of a guess because we don't have your data, but: I believe the answer is related to the multicollinearity in your design matrix (as signalled by the message "1 not defined because of singularities" and the presence of an NA estimate for the DateNum^7 parameter).
When you have collinear (perfectly correlated) columns in your design matrix, it can be a bit unpredictable how they get dealt with. lm() picks one of the columns to discard: in this case it's DateNum^7. However, assuming that AvgTemp is also in the set of collinear columns, if you drop AvgTemp from the model then when the model gets refitted lm() will not drop DateNum^7 (because it doesn't need to any more), but you will still get the same goodness of fit (AIC/log-likelihood/etc.) — because you dropped a variable that is redundant.
You should be able to explore this possibility via caret::findLinearCombos(model.matrix(rowfvs.glm)), although careful thought about your observational/experimental design might also enlighten you as to why these variables are collinear ...

R code to test the difference between coefficients of regressors from one panel regression

I am trying to compare two regression coefficient from the same panel regression used over two different time periods in order to confirm the statistical significance of difference. Therefore, running my panel regression first with observations over 2007-2009, I get an estimate of one coefficient I am interested in to compare with the estimate of the same coefficient obtained from the same panel model applied over the period 2010-2017.
Based on R code to test the difference between coefficients of regressors from one regression, I tried to compute a likelihood ratio test. In the linked discussion, they use a simple linear equation. If I use the same commands in R than described in the answer, I get results based on a chi-squared distribution and I don't understand if and how I can interpret that or not.
In r, I did the following:
linearHypothesis(reg.pannel.recession.fe, "Exp_Fri=0.311576")
where reg.pannel.recession.fe is the panel regression over the period 2007-2009, Exp_Fri is the coefficient of this regression I want to compare, 0.311576 is the estimated coefficient over the period 2010-2017.
I get the following results using linearHypothesis():
How can I interpret that? Should I use another function as it is plm objects?
Thank you very much for your help.
You get a F test in that example because as stated in the vignette:
The method for "lm" objects calls the default method, but it changes
the
default test to "F" [...]
You can also set the test to F, but basically linearHypothesis works whenever the standard error of the coefficient can be estimated from the variance-covariance matrix, as also said in the vignette:
The default method will work with any model
object for which the coefficient vector can be retrieved by ‘coef’
and the coefficient-covariance matrix by ‘vcov’ (otherwise the
argument ‘vcov.’ has to be set explicitly)
So using an example from the package:
library(plm)
data(Grunfeld)
wi <- plm(inv ~ value + capital,
data = Grunfeld, model = "within", effect = "twoways")
linearHypothesis(wi,"capital=0.3",test="F")
Linear hypothesis test
Hypothesis:
capital = 0.3
Model 1: restricted model
Model 2: inv ~ value + capital
Res.Df Df F Pr(>F)
1 170
2 169 1 6.4986 0.01169 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
linearHypothesis(wi,"capital=0.3")
Linear hypothesis test
Hypothesis:
capital = 0.3
Model 1: restricted model
Model 2: inv ~ value + capital
Res.Df Df Chisq Pr(>Chisq)
1 170
2 169 1 6.4986 0.0108 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And you can also use a t.test:
tested_value = 0.3
BETA = coefficients(wi)["capital"]
SE = coefficients(summary(wi))["capital",2]
tstat = (BETA- tested_value)/SE
pvalue = as.numeric(2*pt(-tstat,wi$df.residual))
pvalue
[1] 0.01168515

3-way ANOVA for reshaped data in R

I have just discovered reshaping in R and am unsure of how to proceed with an ANOVA once the data is reshaped. I found this site which has the data organized in a way very similar to my own data. If I were using this hypothetical data, how would I conduct a 3-way ANOVA say between race, program and subject? Now that the subjects have been reshaped into a single column I'm having trouble seeing how to include this variable using the typical ANOVA code. Any help would be much appreciated!
Assuming the data are in 'long format' and 'score' is your dependent variable you could do something like:
mymodel = aov(score ~ prog + race + subj, data=l)
summary(my model)
Which in this case yields:
Df Sum Sq Mean Sq F value Pr(>F)
prog 1 2864 2864 31.32 2.82e-08 ***
race 1 5064 5064 55.39 2.14e-13 ***
subj 4 106 27 0.29 0.885
Residuals 993 90780 91
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
n.b. this model contains only the main effects

Resources