Interpret regression with Heteroskedasticity Corrected Standard Errors - r

In my data I have problems with heteroscedasticity as indicated by the Breusch-Pagan test and the NVC test that are both significant.
Therefore, I would like to follow the method posted by Gavin Simpson here:
Regression with Heteroskedasticity Corrected Standard Errors
This seems to work but now I have troubles interpreting the results as they look very different from my original multiple regression results.
mySummary(model_maineffect, vcovHC)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5462588 0.0198430 -27.5291 < 2.2e-16 ***
IV1 0.0762802 0.0082630 9.2315 < 2.2e-16 ***
Control1 -0.0062260 0.0071657 -0.8689 0.38493
Control2 0.0277049 0.0066251 4.1818 2.910e-05 ***
Control3 0.0199855 0.0104345 1.9153 0.05547 .
Control4 -0.4639035 0.0083046 -55.8608 < 2.2e-16 ***
Control5 0.6239948 0.0072652 85.8876 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Wald test
Model 1: DV ~ IV1 + Control1 + Control2 +
Control3 + Control4 + Control5
Model 2: DV ~ 1
Res.Df Df F Pr(>F)
1 14120
2 14128 -8 1304.6 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Can I interpret them just in the same way as a multiple regression, i.e., IV1 has a highly significant effect on DV since Pr(>|t|) of IV1 is <0.001. And does it mean that the model is significantly improved since the Pr(>F) is <0.001? How could I report my R-Square in this case?

Related

How to improve my Generative Additive Model (GAM) performance and fix heteroscedasticity? [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 19 days ago.
I have a large datasets of data which I try to model with GAM. My model performance is otherwise good, but the errors are not normally distributed. Do you have any idea, how this problem could be solved? I have added the histogram of my data (skewness is 0.4 and thus, no transformation is needed), model results and the residual plots.
Code and model results for the model:
Family: gaussian
Link function: identity
Formula:
y ~ s(x, k = 20) + s(z, k = 23) + w
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 179.9106 0.2917 616.7 <2e-16 ***
w -26.8263 0.2595 -103.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 15.33 16.44 611.9 <2e-16 ***
s(z) 16.39 18.53 24365.8 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.839 Deviance explained = 83.9%
-REML = 1.3054e+06 Scale est. = 1287.3 n = 261113

How to use Wald Ch-Sq test Statistic in a GAM to determine which smooth term has the largest impact on the response?

I am tasked with determining, from a summary function of a generalized linear model, which smooth term is most impactful to the response. Intuitively I understand that to be the smooth term with the largest Chi-Sq test stat listed in the summary, so it would be s(tests) in this case below:
Family: poisson
Link function: log
Formula:
daily_confirmed_cases ~ s(tests, k = 18) + s(vaccines, k = 18) +
s(people_fully_vaccinated, k = 18) + s(hosp) + s(icu) + s(ndate,
k = 18)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.405421 0.001489 4973 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(tests) 16.793 16.98 14076 <2e-16 ***
s(vaccines) 16.982 17.00 9744 <2e-16 ***
s(people_fully_vaccinated) 16.923 17.00 7337 <2e-16 ***
s(hosp) 8.988 9.00 6893 <2e-16 ***
s(icu) 8.985 9.00 7246 <2e-16 ***
s(ndate) 16.674 16.96 11156 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.965 Deviance explained = 97.7%
fREML = 19764 Scale est. = 1 n = 460
Is this approach true, and if so can you explain some rational as to why a high Chi-Sq stat indicates the greater impact on the GAM?
Thanks

Small sample (20-25 observations) - Robust standard errors (Newey-West) do not change coefficients/standard errors. Is this normal?

I am running a simple regression (OLS)
> lm_1 <- lm(Dependent_variable_1 ~ Independent_variable_1, data = data_1)
> summary(lm_1)
Call:
lm(formula = Dependent_variable_1 ~ Independent_variable_1,
data = data_1)
Residuals:
Min 1Q Median 3Q Max
-143187 -34084 -4990 37524 136293
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 330853 13016 25.418 < 2e-16 ***
`GDP YoY% - Base` 3164631 689599 4.589 0.000118 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 66160 on 24 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.4674, Adjusted R-squared: 0.4452
F-statistic: 21.06 on 1 and 24 DF, p-value: 0.0001181
The autocorrelation and heteroskedasticity tests follow:
> dwtest(lm_1,alternative="two.sided")
Durbin-Watson test
data: lm_1
DW = 0.93914, p-value = 0.001591
alternative hypothesis: true autocorrelation is not 0
> bptest(lm_1)
studentized Breusch-Pagan test
data: lm_1
BP = 9.261, df = 1, p-value = 0.002341
then I run a robust regression for autocorrelation and heteroskedasticity (HAC - Newey-West):
> coeftest(lm_1, vocv=NeweyWest(lm_1,lag=2, prewhite=FALSE))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 330853 13016 25.4185 < 2.2e-16 ***
Independent_variable_1 3164631 689599 4.5891 0.0001181 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
and I get the same results for coefficients / standard errors.
Is this normal? Is this due to the small sample size?

Convert mgcv or gamm4 gam/bam output to dataframe

The broom package has a great tidy() function for the summary results of simple linear models such as those generated by lm(). However, tidy() does not work for mgcv::bam(), mgcv::gam() or gamm4::gamm4. The bam below produces the following:
library(mgcv)
set.seed(3)
dat <- gamSim(1,n=25000,dist="normal",scale=20)
bs <- "cr";k <- 12
b <- bam(y ~ s(x0,bs=bs)+s(x1,bs=bs)+s(x2,bs=bs,k=k)+
s(x3,bs=bs),data=dat)
summary(b)
tidy(b)
glance(b)
Output of above code:
> summary(b)
Family: gaussian
Link function: identity
Formula:
y ~ s(x0, bs = bs) + s(x1, bs = bs) + s(x2, bs = bs, k = k) +
s(x3, bs = bs)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.8918 0.1275 61.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x0) 3.113 3.863 6.667 3.47e-05 ***
s(x1) 2.826 3.511 63.015 < 2e-16 ***
s(x2) 8.620 9.905 52.059 < 2e-16 ***
s(x3) 1.002 1.004 3.829 0.0503 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.0295 Deviance explained = 3.01%
fREML = 1.1057e+05 Scale est. = 406.15 n = 25000
> tidy(b)
data frame with 0 columns and 0 rows
> glance(b)
Error in `$<-.data.frame`(`*tmp*`, "logLik", value = -110549.163197452) :
replacement has 1 row, data has 0
How can I convert the summary to a dataframe so I can access outputs like the coefficients?

Get pearson's effect size from GAM on R

I'm doing GAM on R. I have many variable in the model and I'd like to know the pearson's effect size for each (the aim is to compare the importance of a focal variable "A" between different data). I could use de F value but my data have different size (n) that influe the value of F...
Is it possible to find it with the result of an anova or from summary.gam ? Or is there a specific fonction that I didn't found ?
Here my result for one of my data :
summary.gam(modF)
Family: gaussian
Link function: identity
Formula:
W ~ B + s(A, k = 4) + s(C) + s(D) + s(E) + s(F) + s(G)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.39861 0.09408 4.237 2.27e-05 ***
BB2 0.42625 0.07903 5.393 6.95e-08 ***
BB3 0.37377 0.09741 3.837 0.000125 ***
BB4 0.31527 0.09500 3.319 0.000905 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(A) 2.573 2.859 493.12 <2e-16 ***
s(C) 8.923 8.998 391.11 <2e-16 ***
s(D) 8.921 8.998 539.71 <2e-16 ***
s(E) 8.894 8.997 119.45 <2e-16 ***
s(F) 8.279 8.858 13.04 <2e-16 ***
s(G) 8.752 8.977 27.99 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.231 Deviance explained = 23.2%
GCV = 0.064502 Scale est. = 0.064434 n = 47402
Thanks !
Edit :
anov<-anova(modF)
anov$s.table
edf Ref.df F p-value
s(A) 2.572899 2.858737 493.11752 1.799004e-291
s(C) 8.923426 8.998212 391.10473 0.000000e+00
s(D) 8.921437 8.998192 539.70477 0.000000e+00
s(E) 8.893891 8.996777 119.44924 9.555698e-224
s(F) 8.279099 8.857940 13.03855 2.554811e-20
s(G) 8.751735 8.976584 27.99345 8.139591e-49

Resources