How to use sample weights in GAM (mgcv) on survey data for Logit regression? - r

I'm interesting in performing a GAM regression on data from a national wide survey which presents sample weights. I read with interest this post.
I selected my vars of interest generating a DF:
nhanesAnalysis <- nhanesDemo %>%
select(fpl,
age,
gender,
persWeight,
psu,
strata)
Than, for what I understood, I generated a weighted DF with the following code:
library(survey)
nhanesDesign <- svydesign( id = ~psu,
strata = ~strata,
weights = ~persWeight,
nest = TRUE,
data = nhanesAnalysis)
Let's say that I would select only subjects with ageā‰„30:
ageDesign <- subset(nhanesDesign, age >= 30)
Now, I would fit a GAM model (fpl ~ s(age) + gender) with mgcv package. Is it possible to do so with the weights argument or using svydesign object ageDesign ?
EDIT
I was wondering if is it correct to extrapolate computed weights from the an svyglm object and use it for weights argument in GAM.

This is more difficult than it looks. There are two issues
You want to get the right amount of smoothing
You want valid standard errors.
Just giving the sampling weights to mgcv::gam() won't do either of these: gam() treats the weights as frequency weights and so will think it has a lot more data than it actually has. You will get undersmoothing and underestimated standard errors because of the weights, and you will also likely get underestimated standard errors because of the cluster sampling.
The simple work-around is to use regression splines (splines package) instead. These aren't quite as good as the penalised splines used by mgcv, but the difference usually isn't a big deal, and they work straightforwardly with svyglm. You do need to choose how many degrees of freedom to assign.
library(splines)
svglm(fpl ~ ns(age,4) + gender, design = nhanesDesign)

Related

How to plot multi-level meta-analysis by study (in contrast to the subgroup)?

I am doing a multi-level meta-analysis. Many studies have several subgroups. When I make a forest plot studies are presented as subgroups. There are 60 of them, however, I would like to plot studies according to the study, then it would be 25 studies and it would be more appropriate. Does anyone have an idea how to do this forest plot?
I did it this way:
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | Author/Study,
test = "t",
method = "REML")
forest(full.model)
It is not clear to me if you want to aggregate to the Author level or to the Study level. If there are multiple rows of data for particular studies, then the model isn't really complete and you would want to add another random intercept for the level of the estimates within studies. Essentially, the lowest random effect should have as many values for nlvls in the output as there are estimates (k).
Let's first tackle the case where we have a multilevel structure with two levels, studies and multiple estimates within studies (for some technical reasons, some might call this a three-level model, but let's not get into this). I will use a fully reproducible example for illustration purposes, using the dat.konstantopoulos2011 dataset, where we have districts and schools within districts. We fit a multilevel model of the type as you have with:
library(metafor)
dat <- dat.konstantopoulos2011
res <- rma.mv(yi, vi, random = ~ 1 | district/school, data=dat)
res
We can aggregate the estimates to the district level using the aggregate() function, specifying the marginal var-cov matrix of the estimates from the model to account for their non-independence (note that this makes use of aggregate.escalc() which only works with escalc objects, so if it is not, you need to convert the dataset to one - see help(aggregate.escalc) for details):
agg <- aggregate(dat, cluster=dat$district, V=vcov(res, type="obs"))
agg
You will find that if you then fit an equal-effects model to these estimates based on the aggregated data that the results are identical to what you obtained from the multilevel model (we use an equal-effects model since the heterogeneity accounted for by the multilevel model is already encapsulated in vcov(res, type="obs")):
rma(yi, vi, method="EE", data=agg)
So, we can now use these aggregated values in a forest plot:
with(agg, forest(yi, vi, slab=district))
My guess based on your description is that you actually have an additional level that you should include in the model and that you want to aggregate to the intermediate level. This is a tad more complicated, since aggregate() isn't meant for that. Just for illustration purposes, say we use year as another (higher) level and I will mess a bit with the data so that all three variance components are non-zero (again, just for illustration purposes):
dat$yi[dat$year == 1976] <- dat$yi[dat$year == 1976] + 0.8
res <- rma.mv(yi, vi, random = ~ 1 | year/district/school, data=dat)
res
Now instead of aggregate(), we can accomplish the same thing by using a multivariate model, including the intermediate level as a factor and using again vcov(res, type="obs") as the var-cov matrix:
agg <- rma.mv(yi, V=vcov(res, type="obs"), mods = ~ 0 + factor(district), data=dat)
agg
Now the model coefficients of this model are the aggregated values and the var-cov matrix of the model coefficients is the var-cov matrix of these aggregated values:
coef(agg)
vcov(agg)
They are not all independent (since we haven't aggregated to the highest level), so if we want to check that we can obtain the same results as from the multilevel model, we must account for this dependency:
rma.mv(coef(agg), V=vcov(agg), method="EE")
Again, exactly the same results. So now we use these coefficients and the diagonal from vcov(agg) as their sampling variances in the forest plot:
forest(coef(agg), diag(vcov(agg)), slab=names(coef(agg)))
The forest plot cannot indicate the dependency that still remains in these values, so if one were to meta-analyze these aggregated values using only diag(vcov(agg)) as their sampling variances, the results would not be identical to what you get from the full multilevel model. But there isn't really a way around that and the plot is just a visualization of the aggregated estimates and the CIs shown are correct.
You need to specify your own grouping in a new column of data and use this as the new random effect:
df$study_group <- c(1,1,1,2,2,3,4,5,5,5) # example
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | study_group,
test = "t",
method = "REML")
forest(full.model)

Is it possible to establish splitting criteria in partykit::mob() with a model and then fit a different model to terminal nodes?

Sometimes when working with this package, I only want to assess heterogeneity in one parameter or another. However, I don't think I can do that and then fit a more complete model to the terminal nodes in one step. Is there a way to do that? Here's what the code I want to do should look like (I think), but it does not work:
full_mod <-
function(y, x, weights = NULL, start = NULL, offset = NULL, ...) {
lm(y ~ x + 1, ...)
}
tree_1 <-
mob(
# assess heterogeneity in slope, ignoring intercepts
Sepal.Length ~ 0 + Sepal.Width | Species,
data = iris,
# fit each terminal node WITH intercepts
fit = full_mod
)
This achieves what I want to do, but I'm looking for a single-step way.
tree2 <-
lmtree(
Sepal.Length ~ 0 + Sepal.Width | Species,
data = iris
)
iris <-
iris %>%
mutate(prediction = predict(tree2, type = 'node'))
lms <- iris %>%
nest_by(prediction) %>%
rowwise() %>%
summarize(linear_model = list(lm(Sepal.Length ~ Sepal.Width, data = data)))
I see that this is not the best method here with continuous variables, but with dichotomous predictors, I think this could be very powerful and would like to write some code to do this and assess this variant of the model (as long as there is not another way to to do it).
ADDED ON 1st EDIT: Perhaps an alternative way to fit this type of model would be to optimize fit based on homogeneity in a chosen regression parameter (rather than entire model-based deviance, log-likelihood, etc.). I'm happy with either solution, but (personally) had more trouble trying to go the latter.
Thank you!
Christopher Loan
In mob_control() you can specify the parm argument. This means that only a certain subset of the parameters, say parm = 2 (the second parameter) or parm = "x" (the coefficient of x) get tested for parameter instability.
However, the catch is that once a variable is selected for splitting, then the best split point is searched by optimizing the overall objective function (e.g., error sum of squares or log-likelihood etc.) of the model. Thus, this will be sensitive to all changes in all parameters of the model.
A better alternative for fixing some parameters globally and only splitting with respect to others is to iterate between:
Estimating the (generalized) linear model given the subgroups from the tree.
Estimating the tree (and its subgroups) while keeping the global parameters of the model fixed.
This is what the PALM tree algorithm does for partially additive (generalized) linear models. It is implemented in the palmtree package in R. For the methodological background see: Heidi Seibold, Torsten Hothorn, Achim Zeileis (2019). "Generalised Linear Model Trees with Global Additive Effects." Advances in Data Analysis and Classification, 13(3), 703-725. doi:10.1007/s11634-018-0342-1
A replication of the empirical illustration in the paper is provided in: https://www.zeileis.org/news/palmtree/

Longitudinal analysis using sampling weigths in R

I have longitudinal data from two surveys and I want to do a pre-post analysis. Normally, I would use survey::svyglm() or svyVGAM::svy_vglm (for multinomial family) to include sampling weights, but these functions don't account for the random effects. On the other hand, lme4::lmer accounts for the repeated measures, but not the sampling weights.
For continuous outcomes, I understand that I can do
w_data_wide <- svydesign(ids = ~1, data = data_wide, weights = data_wide$weight)
svyglm((post-pre) ~ group, w_data_wide)
and get the same estimates that I would get if I could use lmer(outcome ~ group*time + (1|id), data_long) with weights [please correct me if I'm wrong].
However, for categorical variables, I don't know how to do the analyses. WeMix::mix() has a parameter weights, but I'm not sure if it treats them as sampling weights. Still, this function can't support multinomial family.
So, to resume: can you enlighten me on how to do a pre-post test analysis of categorical outcomes with 2 or more levels? Any tips about packages/functions in R and how to use/write them would be appreciated.
I give below some data sets with binomial and multinomial outcomes:
library(data.table)
set.seed(1)
data_long <- data.table(
id=rep(1:5,2),
time=c(rep("Pre",5),rep("Post",5)),
outcome1=sample(c("Yes","No"),10,replace=T),
outcome2=sample(c("Low","Medium","High"),10,replace=T),
outcome3=rnorm(10),
group=rep(sample(c("Man","Woman"),5,replace=T),2),
weight=rep(c(1,0.5,1.5,0.75,1.25),2)
)
data_wide <- dcast(data_long, id~time, value.var = c('outcome1','outcome2','outcome3','group','weight'))[, `:=` (weight_Post = NULL, group_Post = NULL)]
EDIT
As I said below in the comments, I've been using lmer and glmer with variables used to calculate the weights as predictors. It happens that glmer returns a lot of problems (convergence, high eigenvalues...), so I give another look at #ThomasLumley answer in this post and others (https://stat.ethz.ch/pipermail/r-help/2012-June/315529.html | https://stats.stackexchange.com/questions/89204/fitting-multilevel-models-to-complex-survey-data-in-r).
So, my question is now if a can use participants id as clusters in svydesign
library(survey)
w_data_long_cluster <- svydesign(ids = ~id, data = data_long, weights = data_long$weight)
summary(svyglm(factor(outcome1) ~ group*time, w_data_long_cluster, family="quasibinomial"))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.875e+01 1.000e+00 18.746 0.0339 *
groupWoman -1.903e+01 1.536e+00 -12.394 0.0513 .
timePre 5.443e-09 5.443e-09 1.000 0.5000
groupWoman:timePre 2.877e-01 1.143e+00 0.252 0.8431
and still interpret groupWoman:timePre as differences in the average rate of change/improvement in the outcome over time between sex groups, as if I was using mixed models with participants as random effects.
Thank you once again!
A linear model with svyglm does not give the same parameter estimates as lme4::lmer. It does estimate the same parameters as lme4::lmer if the model is correctly specified, though.
Generalised linear models with svyglm or svy_vglm don't estimate the same parameters as lme4::glmer, as you note. However, they do estimate perfectly good regression parameters and if you aren't specifically interested in the variance components or in estimating the realised random effects (BLUPs) I would recommend just using svy_glm.
Another option if you have non-survey software for random effects versions of the models is to use that. If you scale the weights to sum to the sample size and if all the clustering in the design is modelled by random effects in the model, you will get at least a reasonable approximation to valid inference. That's what I've seen recommended for Bayesian survey modelling, for example.

plm() versus lm() with multiple fixed effects

I am attempting to run a model with county, year, and state:year fixed effects. The lm() approach looks like this:
lm <- lm(data = mydata, formula = y ~ x + county + year + state:year
where county, year, and state:year are all factors.
Because I have a large number of counties, running the model is very slow using lm(). More frustrating given the number of models I need to produce, lm() produces a much larger object than plm(). This plm() command yields the same coefficients and levels of significance for my main variables.
plm <- plm(data = mydata, formula = y ~ x + year + state:year, index = "county", model = "within"
However, these produce substantially different R-squared, Adj. R-squared, etc. I thought I could solve the R-squared problem by calculating the R-squared for plm by hand:
SST <- sum((mydata$y - mean(mydata$y))^2)
fit <- (mydata$y - plm$residuals)
SSR <- sum((fit - mean(mydata$y))^2)
R2 <- SSR / SST
I tested the R-squared code with lm and got the same result reported by summary(lm). However, when I calculated R-squared for plm I got a different R-squared (and it was greater than 1).
At this point I checked what the coefficients for my fixed effects in plm were and they were different than the coefficients in lm.
Can someone please 1) help me understand why I'm getting these differing results and 2) suggest the most efficient way to construct the models I need and obtain correct R-squareds? Thanks!

How can I extract coefficients from this model in caret?

I'm using the caret package with the leaps package to get the number of variables to use in a linear regression. How do I extract the model with the lowest RMSE that uses mdl$bestTune number of variables? If this can't be done are there functions in other packages you would recommend that allow for loocv of a stepwise linear regression and actually allow me to find the final model?
Below is reproducible code. From it, I can tell from mdl$bestTune that the number of variables should be 4 (even though I would have hoped for 3). It seems like I should be able to extract the variables from the third row of summary(mdl$finalModel) but I'm not sure how I would do this in a general case and not just this example.
library(caret)
set.seed(101)
x <- matrix(rnorm(36*5), nrow=36)
colnames(x) <- paste0("V", 1:5)
y <- 0.2*x[,1] + 0.3*x[,3] + 0.5*x[,4] + rnorm(36) * .0001
train.control <- trainControl(method="LOOCV")
mdl <- train(x=x, y=y, method="leapSeq", trControl = train.control, trace=FALSE)
coef(mdl$finalModel, as.double(mdl$bestTune))
mdl$bestTune
summary(mdl$finalModel)
mdl$results
Here's the context behind my question in case it's of interest. I have historical monthly returns hundreds of mutual fund. Each fund's returns will be a dependent variable that I'd like to regress against a set of returns on a handful (e.g. 5) factors. For each fund I want to run a stepwise regression. I expect only 1 to 3 of the five factors to be significant for any fund.
you can use:
coef(mdl$finalModel,unlist(mdl$bestTune))

Resources