Get pearson's effect size from GAM on R - r

I'm doing GAM on R. I have many variable in the model and I'd like to know the pearson's effect size for each (the aim is to compare the importance of a focal variable "A" between different data). I could use de F value but my data have different size (n) that influe the value of F...
Is it possible to find it with the result of an anova or from summary.gam ? Or is there a specific fonction that I didn't found ?
Here my result for one of my data :
summary.gam(modF)
Family: gaussian
Link function: identity
Formula:
W ~ B + s(A, k = 4) + s(C) + s(D) + s(E) + s(F) + s(G)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.39861 0.09408 4.237 2.27e-05 ***
BB2 0.42625 0.07903 5.393 6.95e-08 ***
BB3 0.37377 0.09741 3.837 0.000125 ***
BB4 0.31527 0.09500 3.319 0.000905 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(A) 2.573 2.859 493.12 <2e-16 ***
s(C) 8.923 8.998 391.11 <2e-16 ***
s(D) 8.921 8.998 539.71 <2e-16 ***
s(E) 8.894 8.997 119.45 <2e-16 ***
s(F) 8.279 8.858 13.04 <2e-16 ***
s(G) 8.752 8.977 27.99 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.231 Deviance explained = 23.2%
GCV = 0.064502 Scale est. = 0.064434 n = 47402
Thanks !
Edit :
anov<-anova(modF)
anov$s.table
edf Ref.df F p-value
s(A) 2.572899 2.858737 493.11752 1.799004e-291
s(C) 8.923426 8.998212 391.10473 0.000000e+00
s(D) 8.921437 8.998192 539.70477 0.000000e+00
s(E) 8.893891 8.996777 119.44924 9.555698e-224
s(F) 8.279099 8.857940 13.03855 2.554811e-20
s(G) 8.751735 8.976584 27.99345 8.139591e-49

Related

How to improve my Generative Additive Model (GAM) performance and fix heteroscedasticity? [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 19 days ago.
I have a large datasets of data which I try to model with GAM. My model performance is otherwise good, but the errors are not normally distributed. Do you have any idea, how this problem could be solved? I have added the histogram of my data (skewness is 0.4 and thus, no transformation is needed), model results and the residual plots.
Code and model results for the model:
Family: gaussian
Link function: identity
Formula:
y ~ s(x, k = 20) + s(z, k = 23) + w
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 179.9106 0.2917 616.7 <2e-16 ***
w -26.8263 0.2595 -103.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x) 15.33 16.44 611.9 <2e-16 ***
s(z) 16.39 18.53 24365.8 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.839 Deviance explained = 83.9%
-REML = 1.3054e+06 Scale est. = 1287.3 n = 261113

How to use Wald Ch-Sq test Statistic in a GAM to determine which smooth term has the largest impact on the response?

I am tasked with determining, from a summary function of a generalized linear model, which smooth term is most impactful to the response. Intuitively I understand that to be the smooth term with the largest Chi-Sq test stat listed in the summary, so it would be s(tests) in this case below:
Family: poisson
Link function: log
Formula:
daily_confirmed_cases ~ s(tests, k = 18) + s(vaccines, k = 18) +
s(people_fully_vaccinated, k = 18) + s(hosp) + s(icu) + s(ndate,
k = 18)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.405421 0.001489 4973 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
s(tests) 16.793 16.98 14076 <2e-16 ***
s(vaccines) 16.982 17.00 9744 <2e-16 ***
s(people_fully_vaccinated) 16.923 17.00 7337 <2e-16 ***
s(hosp) 8.988 9.00 6893 <2e-16 ***
s(icu) 8.985 9.00 7246 <2e-16 ***
s(ndate) 16.674 16.96 11156 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.965 Deviance explained = 97.7%
fREML = 19764 Scale est. = 1 n = 460
Is this approach true, and if so can you explain some rational as to why a high Chi-Sq stat indicates the greater impact on the GAM?
Thanks

R CLM Logistic regression Significance changes depending on order of input of independent variables

I run a categorial logistic regression.
Int is a Intelligence ranking (1st place, 2nd, 3rd and 4th)
My questions: I detected, that the significances vary depending on how I define the level of Sex and Pos (Body Posture), setting first M or F on Sex and Open and Closed on Pos (Posture).
This very strange for me, because I thought, changing the level order just alters - and + of the coefficient. What did I wrong? Is the strong Pos*Sex interaction the key to the solution?
Many thanks for every hint.
Here can you see the output of every combination:
> Pos = relevel(Pos,ref="Open")
> mopen<- clm(Int ~ Pos*Sex, data = x)
> summary(mopen)
formula: Int ~ Pos * Sex
data: x
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 668 -904.76 1821.51 4(0) 1.30e-12 6.7e+01
Coefficients:
Estimate Std. Error z value Pr(>|z|)
PosClosed 1.128633 0.204955 5.507 3.66e-08 ***
SexF 0.008686 0.195416 0.044 0.964548
PosClosed:SexF -0.991075 0.281194 -3.525 0.000424 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -0.8356 0.1489 -5.614
2|3 0.2956 0.1451 2.037
3|4 1.4497 0.1557 9.310
>
> Sex = relevel(Sex,ref="F")
> Pos = relevel(Pos,ref="Open")
> fopen<- clm(Int ~ Pos*Sex, data = x)
> summary(fopen)
formula: Int ~ Pos * Sex
data: x
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 668 -904.76 1821.51 4(0) 1.27e-12 6.4e+01
Coefficients:
Estimate Std. Error z value Pr(>|z|)
PosClosed 0.137559 0.193101 0.712 0.476238
SexM -0.008686 0.195416 -0.044 0.964548
PosClosed:SexM 0.991075 0.281194 3.525 0.000424 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -0.8443 0.1458 -5.791
2|3 0.2869 0.1406 2.041
3|4 1.4410 0.1519 9.489
>
> Sex = relevel(Sex,ref="M")
> Pos = relevel(Pos,ref="Closed")
> mclosed<- clm(Int ~ Pos*Sex, data = x)
> summary(mclosed)
formula: Int ~ Pos * Sex
data: x
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 668 -904.76 1821.51 4(0) 1.30e-12 7.2e+01
Coefficients:
Estimate Std. Error z value Pr(>|z|)
PosOpen -1.1286 0.2050 -5.507 3.66e-08 ***
SexF -0.9824 0.2021 -4.861 1.17e-06 ***
PosOpen:SexF 0.9911 0.2812 3.525 0.000424 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -1.9642 0.1656 -11.859
2|3 -0.8331 0.1536 -5.422
3|4 0.3211 0.1506 2.132
>
> Sex = relevel(Sex,ref="F")
> Pos = relevel(Pos,ref="Closed")
> fclosed<- clm(Int ~ Pos*Sex, data = x)
> summary(fclosed)
formula: Int ~ Pos * Sex
data: x
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 668 -904.76 1821.51 4(0) 1.32e-12 6.5e+01
Coefficients:
Estimate Std. Error z value Pr(>|z|)
PosOpen -0.1376 0.1931 -0.712 0.476238
SexM 0.9824 0.2021 4.861 1.17e-06 ***
PosOpen:SexM -0.9911 0.2812 -3.525 0.000424 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Threshold coefficients:
Estimate Std. Error z value
1|2 -0.9819 0.1477 -6.649
2|3 0.1493 0.1413 1.057
3|4 1.3035 0.1512 8.623
My best answer is, that it was a mistake to use a dummy code for f/m and closed/open
I tried a contrast code and got better results
Used following code to create a contrast
contrasts(Sex) <- contr.sum(2)

Convert mgcv or gamm4 gam/bam output to dataframe

The broom package has a great tidy() function for the summary results of simple linear models such as those generated by lm(). However, tidy() does not work for mgcv::bam(), mgcv::gam() or gamm4::gamm4. The bam below produces the following:
library(mgcv)
set.seed(3)
dat <- gamSim(1,n=25000,dist="normal",scale=20)
bs <- "cr";k <- 12
b <- bam(y ~ s(x0,bs=bs)+s(x1,bs=bs)+s(x2,bs=bs,k=k)+
s(x3,bs=bs),data=dat)
summary(b)
tidy(b)
glance(b)
Output of above code:
> summary(b)
Family: gaussian
Link function: identity
Formula:
y ~ s(x0, bs = bs) + s(x1, bs = bs) + s(x2, bs = bs, k = k) +
s(x3, bs = bs)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.8918 0.1275 61.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x0) 3.113 3.863 6.667 3.47e-05 ***
s(x1) 2.826 3.511 63.015 < 2e-16 ***
s(x2) 8.620 9.905 52.059 < 2e-16 ***
s(x3) 1.002 1.004 3.829 0.0503 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.0295 Deviance explained = 3.01%
fREML = 1.1057e+05 Scale est. = 406.15 n = 25000
> tidy(b)
data frame with 0 columns and 0 rows
> glance(b)
Error in `$<-.data.frame`(`*tmp*`, "logLik", value = -110549.163197452) :
replacement has 1 row, data has 0
How can I convert the summary to a dataframe so I can access outputs like the coefficients?

Interpret regression with Heteroskedasticity Corrected Standard Errors

In my data I have problems with heteroscedasticity as indicated by the Breusch-Pagan test and the NVC test that are both significant.
Therefore, I would like to follow the method posted by Gavin Simpson here:
Regression with Heteroskedasticity Corrected Standard Errors
This seems to work but now I have troubles interpreting the results as they look very different from my original multiple regression results.
mySummary(model_maineffect, vcovHC)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5462588 0.0198430 -27.5291 < 2.2e-16 ***
IV1 0.0762802 0.0082630 9.2315 < 2.2e-16 ***
Control1 -0.0062260 0.0071657 -0.8689 0.38493
Control2 0.0277049 0.0066251 4.1818 2.910e-05 ***
Control3 0.0199855 0.0104345 1.9153 0.05547 .
Control4 -0.4639035 0.0083046 -55.8608 < 2.2e-16 ***
Control5 0.6239948 0.0072652 85.8876 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Wald test
Model 1: DV ~ IV1 + Control1 + Control2 +
Control3 + Control4 + Control5
Model 2: DV ~ 1
Res.Df Df F Pr(>F)
1 14120
2 14128 -8 1304.6 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Can I interpret them just in the same way as a multiple regression, i.e., IV1 has a highly significant effect on DV since Pr(>|t|) of IV1 is <0.001. And does it mean that the model is significantly improved since the Pr(>F) is <0.001? How could I report my R-Square in this case?

Resources