I've run into an issue where R INLA isn't computing the fitted marginal values. I first had it with my own dataset, and have been able to reproduce it following an example from this book. I suspect there must be some configuration I need to change, or maybe INLA isn't working well with something under the hood? Anyways here is the code:
library("rgdal")
boston.tr <- readOGR(system.file("shapes/boston_tracts.shp",
package="spData")[1])
#create adjacency matrices
boston.adj <- poly2nb(boston.tr)
W.boston <- nb2mat(boston.adj, style = "B")
W.boston.rs <- nb2mat(boston.adj, style = "W")
boston.tr$CMEDV2 <- boston.tr$CMEDV
boston.tr$CMEDV2 [boston.tr$CMEDV2 == 50.0] <- NA
#define formula
boston.form <- log(CMEDV2) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) + I(RM^2) +
AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B + log(LSTAT)
boston.tr$ID <- 1:length(boston.tr)
#run model
boston.iid <- inla(update(boston.form, . ~. + f(ID, model = "iid")),
data = as.data.frame(boston.tr),
control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE),
control.predictor = list(compute = TRUE)
)
When I look at the output of this model, it species that the fitted values were computed:
summary(boston.iid)
Call:
c("inla(formula = update(boston.form, . ~ . + f(ID, model = \"iid\")), ", " data = as.data.frame(boston.tr),
control.compute = list(dic = TRUE, ", " waic = TRUE, cpo = TRUE), control.predictor = list(compute = TRUE))"
)
Time used:
Pre = 0.981, Running = 0.481, Post = 0.0337, Total = 1.5
Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode kld
(Intercept) 4.376 0.151 4.080 4.376 4.672 4.376 0
CRIM -0.011 0.001 -0.013 -0.011 -0.009 -0.011 0
ZN 0.000 0.000 -0.001 0.000 0.001 0.000 0
INDUS 0.001 0.002 -0.003 0.001 0.006 0.001 0
CHAS1 0.056 0.034 -0.010 0.056 0.123 0.056 0
I(NOX^2) -0.540 0.107 -0.751 -0.540 -0.329 -0.540 0
I(RM^2) 0.007 0.001 0.005 0.007 0.010 0.007 0
AGE 0.000 0.001 -0.001 0.000 0.001 0.000 0
log(DIS) -0.143 0.032 -0.206 -0.143 -0.080 -0.143 0
log(RAD) 0.082 0.018 0.047 0.082 0.118 0.082 0
TAX 0.000 0.000 -0.001 0.000 0.000 0.000 0
PTRATIO -0.031 0.005 -0.040 -0.031 -0.021 -0.031 0
B 0.000 0.000 0.000 0.000 0.001 0.000 0
log(LSTAT) -0.329 0.027 -0.382 -0.329 -0.277 -0.329 0
Random effects:
Name Model
ID IID model
Model hyperparameters:
mean sd 0.025quant 0.5quant 0.975quant mode
Precision for the Gaussian observations 169.24 46.04 99.07 160.46 299.72 141.30
Precision for ID 42.84 3.40 35.40 43.02 49.58 43.80
Deviance Information Criterion (DIC) ...............: -996.85
Deviance Information Criterion (DIC, saturated) ....: 1948.94
Effective number of parameters .....................: 202.49
Watanabe-Akaike information criterion (WAIC) ...: -759.57
Effective number of parameters .................: 337.73
Marginal log-Likelihood: 39.74
CPO and PIT are computed
Posterior marginals for the linear predictor and
the fitted values are computed
However, when I try to inspect those fitted marginal values, there is nothing there:
> boston.iid$marginals.fitted.values
NULL
Interestingly enough, I do get a summary of the posteriors, so they must be getting computed somehow?
> boston.iid$summary.fitted.values
mean sd 0.025quant 0.5quant 0.975quant mode
fitted.Predictor.001 2.834677 0.07604927 2.655321 2.844934 2.959994 2.858717
fitted.Predictor.002 3.020424 0.08220780 2.824525 3.034319 3.149766 3.052558
fitted.Predictor.003 3.053759 0.08883760 2.841738 3.071530 3.188051 3.094010
fitted.Predictor.004 3.032981 0.09846662 2.801099 3.056692 3.175215 3.084842
Any ideas on what I'm mis-specifying in the call. I have set compute = T which is what I had seen causing issues on the R-INLA forums.
The developers intentionally disabled computing the marginals to make the model faster.
To enable it, you can add these to the inla arguments:
control.predictor=list(compute=TRUE)
control.compute=list(return.marginals.predictor=TRUE)
So it looks something like this:
boston.form <- log(CMEDV2) ~ CRIM + ZN + INDUS + CHAS + I(NOX^2) + I(RM^2) +
AGE + log(DIS) + log(RAD) + TAX + PTRATIO + B + log(LSTAT)
boston.tr$ID <- 1:length(boston.tr)
#run model
boston.iid <- inla(update(boston.form, . ~. + f(ID, model = "iid")),
data = as.data.frame(boston.tr),
control.compute = list(dic = TRUE, waic = TRUE, cpo = TRUE, return.marginals.predictor=TRUE),
control.predictor = list(compute = TRUE)
)
boston.iid$summary.fitted.values
boston.iid$marginals.fitted.values
Related
Say I have a series of GAMs that I would like to average together using MuMIn. How do I go about interpreting the results of the averaged smoothers? Why are there numbers after each smoother term?
library(glmmTMB)
library(mgcv)
library(MuMIn)
data("Salamanders") # glmmTMB data
# mgcv gams
gam1 <- gam(count ~ spp + s(cover) + s(DOP), data = Salamanders, family = tw, method = "ML")
gam2 <- gam(count ~ mined + s(cover) + s(DOP), data = Salamanders, family = tw, method = "ML")
gam3 <- gam(count ~ s(Wtemp), data = Salamanders, family = tw, method = "ML")
gam4 <- gam(count ~ mined + s(DOY), data = Salamanders, family = tw, method = "ML")
# MuMIn model average
summary(model.avg(gam1, gam2, gam3, gam4))
And an excerpt from the results...
Model-averaged coefficients:
(full average)
Estimate Std. Error
(Intercept) -1.32278368618846586812765053764451295137405 0.16027398202204409805027296442858641967177
minedno 2.22006553885311141982583649223670363426208 0.19680444996609294805445244946895400062203
s(cover).1 0.00096638939252485735100645092288118576107 0.05129736767981037115493592182247084565461
s(cover).2 0.00360413985630353601863351542533564497717 0.18864911049300209233692271482141222804785
s(cover).3 0.00034381902619062468381624930735540601745 0.01890820689958183642431777116144075989723
s(cover).4 -0.00248365164684107844403349041328965540743 0.12950622739175629560826052966149291023612
s(cover).5 -0.00089826079366626997504963192398008686723 0.04660149540411069601919535898559843190014
s(cover).6 0.00242197856572917875894734862640689243563 0.12855093144749979439112053114513400942087
s(cover).7 -0.00032596616013735266745646179664674946252 0.02076865732570042782922925539423886220902
s(cover).8 0.00700001172809289889942263584998727310449 0.36609857217759655956257347497739829123020
s(cover).9 -0.17150069832114492318630993850092636421323 0.17672571419517621449379873865836998447776
s(DOP).1 0.00018839994220792031023870016781529557193 0.01119134546418791391342306695833030971698
s(DOP).2 -0.00081869157242861999301819508900734945200 0.04333670935815417402103832955617690458894
s(DOP).3 -0.00021538789478326670289408395486674407948 0.01164171952980479901595955993798270355910
s(DOP).4 0.00043433676942596419591827161532648915454 0.02463278659589070856972270462392771150917
This is a little easier to read if you don't print so many digits (see below):
Each smooth term is parameterized using multiple coefficients (9 by default), which is why we have multiple s.(whatever).xxx coefficients.
It's not clear to me what you want to do with the model-averaged results. It's usually best to make model-averaged predictions rather than trying to interpret model-averaged coefficients, which has some pitfalls ... There is a predict() method for objects of class "averaging" (which is what model.average() returns).
For further questions about interpretation you might want to ask on CrossValidated ...
Model-averaged coefficients:
(full average)
Estimate Std. Error Adjusted SE z value Pr(>|z|)
(Intercept) -1.323e+00 1.603e-01 1.606e-01 8.239 <2e-16 ***
minedno 2.220e+00 1.968e-01 1.971e-01 11.263 <2e-16 ***
s(cover).1 9.664e-04 5.130e-02 5.130e-02 0.019 0.985
s(cover).2 3.604e-03 1.886e-01 1.887e-01 0.019 0.985
s(cover).3 3.438e-04 1.891e-02 1.891e-02 0.018 0.985
s(cover).4 -2.484e-03 1.295e-01 1.295e-01 0.019 0.985
s(cover).5 -8.983e-04 4.660e-02 4.660e-02 0.019 0.985
s(cover).6 2.422e-03 1.286e-01 1.286e-01 0.019 0.985
s(cover).7 -3.260e-04 2.077e-02 2.078e-02 0.016 0.987
s(cover).8 7.000e-03 3.661e-01 3.661e-01 0.019 0.985
s(cover).9 -1.715e-01 1.767e-01 1.768e-01 0.970 0.332
s(DOP).1 1.884e-04 1.119e-02 1.120e-02 0.017 0.987
s(DOP).2 -8.187e-04 4.334e-02 4.334e-02 0.019 0.985
s(DOP).3 -2.154e-04 1.164e-02 1.164e-02 0.018 0.985
s(DOP).4 4.343e-04 2.463e-02 2.464e-02 0.018 0.986
s(DOP).5 -1.737e-04 1.019e-02 1.020e-02 0.017 0.986
s(DOP).6 -3.224e-04 1.790e-02 1.790e-02 0.018 0.986
s(DOP).7 2.991e-07 5.739e-04 5.750e-04 0.001 1.000
s(DOP).8 -1.756e-03 9.557e-02 9.559e-02 0.018 0.985
s(DOP).9 1.930e-02 5.630e-02 5.639e-02 0.342 0.732
s(DOY).1 5.189e-08 3.378e-04 3.384e-04 0.000 1.000
I have a dataset that contains weekly counts from 2019 to 2021. What I want to do is to compare the weekly count for a given week in 2020 to the count for the same week in 2019, and similarly compare the count in 2021 to that of the same week in 2019. The data looks like this:
set.seed(123)
df <- data.frame(count = sample(1:300, 156, replace = TRUE),
week = rep(seq(1, 52, by = 1), 3),
year = rep(2019:2021, each = 52))
In my real data, there is significant overdispersion, so I figured a negative binomial model may be best suited. I have run the following:
library(MASS)
nb <- glm.nb(count ~ factor(year)+factor(week), data = df)
summary(nb)
> summary(nb)
Call:
glm.nb(formula = count ~ factor(year) + factor(week), data = df,
init.theta = 2.193368056, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.5012 -0.7180 -0.0207 0.4896 1.6763
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.180832 0.399664 12.963 < 2e-16 ***
factor(year)2020 0.077185 0.133587 0.578 0.563404
factor(year)2021 0.037590 0.133609 0.281 0.778448
factor(week)2 -0.642313 0.556042 -1.155 0.248028
factor(week)3 0.084091 0.554446 0.152 0.879451
factor(week)4 0.228528 0.554245 0.412 0.680103
factor(week)5 -0.275901 0.555095 -0.497 0.619165
factor(week)6 0.181420 0.554308 0.327 0.743448
factor(week)7 -0.331875 0.555218 -0.598 0.550015
factor(week)8 -0.018194 0.554608 -0.033 0.973829
factor(week)9 -0.093659 0.554737 -0.169 0.865926
factor(week)10 -0.260570 0.555062 -0.469 0.638753
factor(week)11 -0.165835 0.554871 -0.299 0.765038
factor(week)12 -0.003480 0.554583 -0.006 0.994993
factor(week)13 0.045328 0.554506 0.082 0.934850
factor(week)14 -0.420895 0.555429 -0.758 0.448581
factor(week)15 -0.288260 0.555121 -0.519 0.603570
factor(week)16 -1.719551 0.561984 -3.060 0.002215 **
factor(week)17 -0.339217 0.555235 -0.611 0.541237
factor(week)18 -0.770541 0.556464 -1.385 0.166141
factor(week)19 -0.088333 0.554728 -0.159 0.873483
factor(week)20 -0.595712 0.555901 -1.072 0.283893
factor(week)21 -2.010330 0.565001 -3.558 0.000374 ***
factor(week)22 -0.075819 0.554706 -0.137 0.891282
factor(week)23 0.298783 0.554157 0.539 0.589772
factor(week)24 0.114664 0.554401 0.207 0.836147
factor(week)25 0.089396 0.554439 0.161 0.871907
factor(week)26 -0.396060 0.555368 -0.713 0.475754
factor(week)27 -0.261789 0.555065 -0.472 0.637186
factor(week)28 -0.090157 0.554731 -0.163 0.870894
factor(week)29 0.210589 0.554269 0.380 0.703990
factor(week)30 -0.537967 0.555736 -0.968 0.333032
factor(week)31 -0.401567 0.555381 -0.723 0.469651
factor(week)32 0.108651 0.554410 0.196 0.844630
factor(week)33 -0.732234 0.556332 -1.316 0.188113
factor(week)34 -0.589688 0.555884 -1.061 0.288775
factor(week)35 -0.437695 0.555471 -0.788 0.430714
factor(week)36 -0.402218 0.555383 -0.724 0.468933
factor(week)37 -0.076802 0.554708 -0.138 0.889881
factor(week)38 -0.151350 0.554844 -0.273 0.785022
factor(week)39 0.272593 0.554189 0.492 0.622806
factor(week)40 -0.119806 0.554785 -0.216 0.829027
factor(week)41 -1.184984 0.558260 -2.123 0.033784 *
factor(week)42 -0.153762 0.554848 -0.277 0.781685
factor(week)43 -0.068443 0.554693 -0.123 0.901799
factor(week)44 -0.721053 0.556294 -1.296 0.194916
factor(week)45 0.102378 0.554419 0.185 0.853497
factor(week)46 -0.009142 0.554593 -0.016 0.986848
factor(week)47 -0.284169 0.555112 -0.512 0.608712
factor(week)48 -0.133066 0.554809 -0.240 0.810454
factor(week)49 -0.705118 0.556242 -1.268 0.204924
factor(week)50 -0.080921 0.554715 -0.146 0.884017
factor(week)51 0.152016 0.554348 0.274 0.783912
factor(week)52 -0.503605 0.555642 -0.906 0.364752
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(2.1934) family taken to be 1)
Null deviance: 221.83 on 155 degrees of freedom
Residual deviance: 169.71 on 102 degrees of freedom
AIC: 1923.7
Number of Fisher Scoring iterations: 1
Theta: 2.193
Std. Err.: 0.243
2 x log-likelihood: -1813.701
The reference category for factor(year) is 2019, which is what I want it to be. However, I am struggling with the interpretation of the coefficients (and also IRR) for week with week 1 being the reference category.
Is there a better way to do this, in order to achieve the week/year comparison? My main goal is to plot weekly IRR for 2020 and 2021, relative to 2019.
I have seen similar posts like this one which say that getting the error message saying: Coefficients: (1 not defined because of singularities) is because of nearly perfect correlation among predictors used in the lm() call.
But in my case, there is no nearly perfect correlation among predictors, but still one coefficient (X_wthn_outcome) returns NA in the output of lm().
I wonder what is wrong with the coefficient that returns NA?
For reproducibility purposes, the exact same data and code is provided below.
library(dplyr)
set.seed(132)
(data <- expand.grid(study = 1:1e3, outcome = rep(1:50,2)))
data$X <- rnorm(nrow(data))
e <- rnorm(nrow(data), 0, 2)
data$yi <- .8 +.6*data$X + e
dat <- data %>%
group_by(study) %>%
mutate(X_btw_study = mean(X), X_wthn_study = X-X_btw_study) %>%
group_by(outcome, .add = TRUE) %>%
mutate(X_btw_outcome = mean(X), X_wthn_outcome = X-X_btw_outcome) %>% ungroup()
round(cor(select(dat,-study,-outcome,-X,-yi)),3)
# X_btw_study X_wthn_study X_btw_outcome X_wthn_outcome
#X_btw_study 1.000 0.000 0.141 0.00
#X_wthn_study 0.000 1.000 0.698 0.71
#X_btw_outcome 0.141 0.698 1.000 0.00
#X_wthn_outcome 0.000 0.710 0.000 1.00
summary(lm(yi ~ 0 + X_btw_study + X_btw_outcome + X_wthn_study
+ X_wthn_outcome, data = dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#X_btw_study 0.524093 0.069610 7.529 5.15e-14 ***
#X_btw_outcome 0.014557 0.013694 1.063 0.288
#X_wthn_study 0.589517 0.009649 61.096 < 2e-16 ***
#X_wthn_outcome NA NA NA NA ## What's wrong with this variable
You have constructed a problem where the three-way combination of X_btw_study + X_btw_outcome + X_wthn_study perfectly predicts X_wthn_outcome:
lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat)
#------------------
Call:
lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study,
data = dat)
Coefficients:
(Intercept) X_btw_study X_btw_outcome X_wthn_study
1.165e-17 1.000e+00 -1.000e+00 1.000e+00
#--------------
summary( lm(X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study , data = dat) )
#---------------
Call:
lm(formula = X_wthn_outcome ~ X_btw_study + X_btw_outcome + X_wthn_study,
data = dat)
Residuals:
Min 1Q Median 3Q Max
-3.901e-14 -6.000e-17 0.000e+00 5.000e-17 3.195e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.165e-17 3.242e-18 3.594e+00 0.000326 ***
X_btw_study 1.000e+00 3.312e-17 3.020e+16 < 2e-16 ***
X_btw_outcome -1.000e+00 6.515e-18 -1.535e+17 < 2e-16 ***
X_wthn_study 1.000e+00 4.590e-18 2.178e+17 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.025e-15 on 99996 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.582e+34 on 3 and 99996 DF, p-value: < 2.2e-16
You have an adjusted R^2 of 1 with three predictors. So MULTI-collinearity but not two-way collinearity. (R has caught on to your tricks and won't let you get away with this dplyr game of "hide the dependencies".) I think the dependencies might have been more apparently if you had build the variables in a sequence of separate steps rather than in a piped chain.
I am trying to determine the best value of spar to implement across a dataset by reducing the root mean square error between test and training replicates on the same raw data. My test and training replicates look like this:
Traindataset
t = -0.008
-0.006
-0.004
-0.002 0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
0.022
0.024
dist = NA 0 0 0 0
0.000165038
0.000686934
0.001168098
0.001928885
0.003147262
0.004054971
0.005605361
0.007192645
0.009504648
0.011498809
0.013013655
0.01342625
Testdataset
t = -0.008
-0.006
-0.004
-0.002 0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
0.022
0.024
dist = NA 0 0 0 0 0
0.000481184
0.001306409
0.002590156
0.003328259
0.004429246
0.005012768
0.005829698
0.006567801
0.008030102
0.009617453
0.011202827
I need the spline to be 5th order so I can accurately predict the 3rd derivative, so I am using smooth.Pspline (from the pspline package) instead of the more common smooth.spline. I attempted using a variant of the solution outlined here (using root mean squared error of predicting testdataset from traindataset instead of cross validation sum of squares within one dataset). My code looks like this:
RMSE <- function(m, o){
sqrt(mean((m - o)^2))
}
Psplinermse <- function(spar){
trainmod <- smooth.Pspline(traindataset$t, traindataset$dist, norder = 5,
spar = spar)
testpreddist <- predict(trainmod,testdataset$t)[,1]
RMSE(testpreddist, testdataset$dist)
}
spars <- seq(0, 1, by = 0.001)
rmsevals <- rep(0, length(spars))
for (i in 1:length(spars)){
rmsevals[i] <- Psplinermse(spars[i])
}
plot(spars, rmsevals, 'l', xlab = 'spar', ylab = 'RMSE' )
The issue I am having is that for pspline, the values of RMSE are the same for any spar above 0 graph of spar vs RMSE. When I dug into just the predictions line of code, I realized I am getting the exact same predicted values of dist for any spar above 0. Any ideas on why this might be are greatly appreciated.
I'm wondering if it's possible to access (in some form) the information that is presented in the -forest- command in the -metafor- package.
I am checking / verifying results, and I'd like to have the output of values produced. Thus far, the calculations all check, but I'd like to have them available for printing, saving, etc. instead of having to type them out by hand.
Sample code is below :
es <- read.table(header=TRUE, text = "
b se_b
0.083 0.011
0.114 0.011
0.081 0.013
0.527 0.017
" )
library(metafor)
es.est <- rma(yi=b, sei=se_b, dat=es, method="DL")
studies <- as.vector( c("Larry (2011)" , "Curly (2011)", "Moe (2015)" , "Shemp (2010)" ) )
forest(es.est , transf=exp , slab = studies , refline = 1 , xlim=c(0,3), at = c(1, 1.5, 2, 2.5, 3, 3.5, 4) , showweights=TRUE)
I'd like to access the values (effect size and c.i. for each study, as well as the overall estimate, and c.i.) that are presented on the right of the graphic.
Thanks so much,
-Jon
How about:
> summary(escalc(measure="GEN", yi=b, sei=se_b, data=es), transf=exp)
b se_b yi vi sei zi ci.lb ci.ub
1 0.083 0.011 1.0865 0.0001 0.0110 7.5455 1.0634 1.1102
2 0.114 0.011 1.1208 0.0001 0.0110 10.3636 1.0968 1.1452
3 0.081 0.013 1.0844 0.0002 0.0130 6.2308 1.0571 1.1124
4 0.527 0.017 1.6938 0.0003 0.0170 31.0000 1.6383 1.7512
Then yi, ci.lb, and ci.ub provides the same info.