How to plot multi-level meta-analysis by study (in contrast to the subgroup)? - r

I am doing a multi-level meta-analysis. Many studies have several subgroups. When I make a forest plot studies are presented as subgroups. There are 60 of them, however, I would like to plot studies according to the study, then it would be 25 studies and it would be more appropriate. Does anyone have an idea how to do this forest plot?
I did it this way:
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | Author/Study,
test = "t",
method = "REML")
forest(full.model)

It is not clear to me if you want to aggregate to the Author level or to the Study level. If there are multiple rows of data for particular studies, then the model isn't really complete and you would want to add another random intercept for the level of the estimates within studies. Essentially, the lowest random effect should have as many values for nlvls in the output as there are estimates (k).
Let's first tackle the case where we have a multilevel structure with two levels, studies and multiple estimates within studies (for some technical reasons, some might call this a three-level model, but let's not get into this). I will use a fully reproducible example for illustration purposes, using the dat.konstantopoulos2011 dataset, where we have districts and schools within districts. We fit a multilevel model of the type as you have with:
library(metafor)
dat <- dat.konstantopoulos2011
res <- rma.mv(yi, vi, random = ~ 1 | district/school, data=dat)
res
We can aggregate the estimates to the district level using the aggregate() function, specifying the marginal var-cov matrix of the estimates from the model to account for their non-independence (note that this makes use of aggregate.escalc() which only works with escalc objects, so if it is not, you need to convert the dataset to one - see help(aggregate.escalc) for details):
agg <- aggregate(dat, cluster=dat$district, V=vcov(res, type="obs"))
agg
You will find that if you then fit an equal-effects model to these estimates based on the aggregated data that the results are identical to what you obtained from the multilevel model (we use an equal-effects model since the heterogeneity accounted for by the multilevel model is already encapsulated in vcov(res, type="obs")):
rma(yi, vi, method="EE", data=agg)
So, we can now use these aggregated values in a forest plot:
with(agg, forest(yi, vi, slab=district))
My guess based on your description is that you actually have an additional level that you should include in the model and that you want to aggregate to the intermediate level. This is a tad more complicated, since aggregate() isn't meant for that. Just for illustration purposes, say we use year as another (higher) level and I will mess a bit with the data so that all three variance components are non-zero (again, just for illustration purposes):
dat$yi[dat$year == 1976] <- dat$yi[dat$year == 1976] + 0.8
res <- rma.mv(yi, vi, random = ~ 1 | year/district/school, data=dat)
res
Now instead of aggregate(), we can accomplish the same thing by using a multivariate model, including the intermediate level as a factor and using again vcov(res, type="obs") as the var-cov matrix:
agg <- rma.mv(yi, V=vcov(res, type="obs"), mods = ~ 0 + factor(district), data=dat)
agg
Now the model coefficients of this model are the aggregated values and the var-cov matrix of the model coefficients is the var-cov matrix of these aggregated values:
coef(agg)
vcov(agg)
They are not all independent (since we haven't aggregated to the highest level), so if we want to check that we can obtain the same results as from the multilevel model, we must account for this dependency:
rma.mv(coef(agg), V=vcov(agg), method="EE")
Again, exactly the same results. So now we use these coefficients and the diagonal from vcov(agg) as their sampling variances in the forest plot:
forest(coef(agg), diag(vcov(agg)), slab=names(coef(agg)))
The forest plot cannot indicate the dependency that still remains in these values, so if one were to meta-analyze these aggregated values using only diag(vcov(agg)) as their sampling variances, the results would not be identical to what you get from the full multilevel model. But there isn't really a way around that and the plot is just a visualization of the aggregated estimates and the CIs shown are correct.

You need to specify your own grouping in a new column of data and use this as the new random effect:
df$study_group <- c(1,1,1,2,2,3,4,5,5,5) # example
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | study_group,
test = "t",
method = "REML")
forest(full.model)

Related

How to start R regression at specific value

I need to find a regression in R which has the form of
lm(Binary_value ~ Age, data=dataframe)
But my age variable starts at 15 yrs old so I'm not interested in ages that are less than 14. How can I specify that I only want my regression to be accurate at the age point of 15 and not worry about smaller values? I tried it this way:
lm(Binary_value ~ Age, data=dataframe)
But I get nonsense results for higher ages.
First things first, remember that R is case-sensitive, so the function would look like lm, not LM. I edited your question to fix that. Second, a regression only includes data that is available for prediction. It will not magically make up 14 data points if they are not already present, so there is no issue there. However, the regression line will not map to just => 15 years old because it uses the model coefficients to draw an intercept. An example below with fake data:
#### Create Fake Data ####
set.seed(123)
x <- 15:100 # use these numbers for age
age <- sample(x, # using x
size=1000, # sample 1000 times
replace=T) # sample with replacement
outcome <- age * .60 + rnorm(n=1000,sd=15) # make fake outcome variable
df <- data.frame(age,outcome)
#### Fit Data ####
fit <- lm(outcome ~ age, data = df)
summary(fit)
plot(age,outcome)
abline(fit,
col = "red")
You will see that the regression line, despite only including 15, will still draw to the left where there is no data. This is because the intercept is a conditional value based on the coefficients.
P.S. I used a normal Gaussian regression for this example because you used the lm function in your question, but included a binary response. For a logistic regression, the rationale would be the same, but it would use glm instead.

How to make predictions using an LDA (Linear discriminant analysis) model in R

as the title suggests I am trying to make predictions using an LDA model in R. I have two sets of data that I'm working with: the first set is a series of entries associated with 16 predictor variables and 1 outcome variable (the outcome variable are "groups" that each entry belongs to that I've assigned myself), the second set of data also consists of entries associated with the same 16 predictor variables, but with no outcome variable. What I would like to do is predict the group membership of the entries in the second set of data.
So far I've successfully managed to create an LDA model by separating the first dataset into a "training set" and a "test set". However, now that I have the model I don't know how I would go about predicting the group membership of the entries in my second data set.
Thanks for the help! Please let me know if any more information is required, this is my first post on stack overflow so I am still learning the ropes.
Short example based on An introduction to Statistical learning, chapter 4. Say you have fitted a model lda_model on a training_data set, with dependent variable Group which you aim to predict, and predictors Predictor1 and Predictor2
library(MASS)
lda_model <- lda(Group∼ Predictor1 + Predictor2, data = training_set)
You can then make predictions with the lda_model using the predict function on the testing_set
lda_predictions <- predict(lda_model, testing_set)
lda_predictions then holds the posterior probabilities in $posterior that the observation is part of Group j.
You could then apply a threshold of for instance (but not limiting to) 50% probability. E.g.
sum(lda_model$posterior[, 7] >= .5)
returns the number of observations for which the probabilty that the observation is part of Group 7 is larger than 50%

Application of a multi-way cluster-robust function in R

Hello (first timer here),
I would like to estimate a "two-way" cluster-robust variance-covariance matrix in R. I am using a particular canned routine from the "multiwayvcov" library. My question relates solely to the set-up of the cluster.vcov function in R. I have panel data of various crime outcomes. My cross-sectional unit is the "precinct" (over 40 precincts) and I observe crime in those precincts over several "months" (i.e., 24 months). I am evaluating an intervention that 'turns on' (dummy coded) for only a few months throughout the year.
I include "precinct" and "month" fixed effects (i.e., a full set of precinct and month dummies enter the model). I have only one independent variable I am assessing. I want to cluster on "both" dimensions but I am unsure how to set it up.
Do I estimate all the fixed effects with lm first? Or, do I simply run a model regressing crime on the independent variable (excluding fixed effects), then use cluster.vcov i.e., ~ precinct + month_year.
This seems like it would provide the wrong standard error though. Right? I hope this was clear. Sorry for any confusion. See my set up below.
library(multiwayvcov)
model <- lm(crime ~ as.factor(precinct) + as.factor(month_year) + policy, data = DATASET_full)
boot_both <- cluster.vcov(model, ~ precinct + month_year)
coeftest(model, boot_both)
### What the documentation offers as an example
### https://cran.r-project.org/web/packages/multiwayvcov/multiwayvcov.pdf
library(lmtest)
data(petersen)
m1 <- lm(y ~ x, data = petersen)
### Double cluster by firm and year using a formula
vcov_both_formula <- cluster.vcov(m1, ~ firmid + year)
coeftest(m1, vcov_both_formula)
Is is appropriate to first estimate a model that ignores the fixed effects?
First the answer: you should first estimate your lm -model using fixed effects. This will give you your asymptotically correct parameter estimates. The std errors are incorrect because they are calculated from a vcov matrix which assumes iid errors.
To replace the iid covariance matrix with a cluster robust vcov matrix, you can use cluster.vcov, i.e. my_new_vcov_matrix <- cluster.vcov(~ precinct + month_year).
Then a recommendation: I warmly recommend the function felm from lfe for both multi-way fe's and cluster-robust standard erros.
The syntax is as follows:
library(multiwayvcov)
library(lfe)
data(petersen)
my_fe_model <- felm(y~x | firmid + year | 0 | firmid + year, data=petersen )
summary(my_fe_model)

How can I extract coefficients from this model in caret?

I'm using the caret package with the leaps package to get the number of variables to use in a linear regression. How do I extract the model with the lowest RMSE that uses mdl$bestTune number of variables? If this can't be done are there functions in other packages you would recommend that allow for loocv of a stepwise linear regression and actually allow me to find the final model?
Below is reproducible code. From it, I can tell from mdl$bestTune that the number of variables should be 4 (even though I would have hoped for 3). It seems like I should be able to extract the variables from the third row of summary(mdl$finalModel) but I'm not sure how I would do this in a general case and not just this example.
library(caret)
set.seed(101)
x <- matrix(rnorm(36*5), nrow=36)
colnames(x) <- paste0("V", 1:5)
y <- 0.2*x[,1] + 0.3*x[,3] + 0.5*x[,4] + rnorm(36) * .0001
train.control <- trainControl(method="LOOCV")
mdl <- train(x=x, y=y, method="leapSeq", trControl = train.control, trace=FALSE)
coef(mdl$finalModel, as.double(mdl$bestTune))
mdl$bestTune
summary(mdl$finalModel)
mdl$results
Here's the context behind my question in case it's of interest. I have historical monthly returns hundreds of mutual fund. Each fund's returns will be a dependent variable that I'd like to regress against a set of returns on a handful (e.g. 5) factors. For each fund I want to run a stepwise regression. I expect only 1 to 3 of the five factors to be significant for any fund.
you can use:
coef(mdl$finalModel,unlist(mdl$bestTune))

R: Regressions with group fixed effects and clustered standard errors with imputed dataset

I am trying to run regressions in R (multiple models - poisson, binomial and continuous) that include fixed effects of groups (e.g. schools) to adjust for general group-level differences (essentially demeaning by group) and that cluster standard errors to account for the nesting of participants in the groups. I am also running these over imputed data frames (created with mice). It seems that different disciplines use the phrase ‘fixed effects’ differently so I am having a hard time searching to troubleshoot.
I have fit random intercept models (with lme4) but they do not account for the school fixed effects (and the random effects are not of interest to my research questions). Putting the groups in as dummies slows the run down tremendously. I could also run a single level glm/lm with group dummies but I have not been able to find a strategy to cluster the standard errors with the imputed data (tried the clusterSE package). I could hand calculate the demeaning but there seems like there should be a more direct way to achieve this.

I have also looked at the lfe package but that does not seem to have glm options and the demeanlist function does not seem to be compatible with the imputed data frames.
In Stata, the command would be xtreg, fe vce (Cluster Variable), (fe = fixed effects, vce = clustered standard errors, with mi added to run over imputed dataframes). I could switch to Stata for the modeling but would definitely prefer to stay with R if possible!
Please let me know if this is better posted in cross-validated - I was on the fence but went with this one since it seemed to be more a coding question.
Thank you!
I would block bootstrap. The "block" handles the clustering and "bootstrap" handles the generated regressors.
There is probably a more elegant way to make this extensible to other estimators, but this should get you started.
# junk data
x <- rnorm(100)
y <- 1 + 2*x + rnorm(100)
dat1 <- data.frame(y, x, id=seq_along(y))
summary(lm(y ~ x, data=dat1))
# same point estimates, but lower SEs
dat2 <- dat1[rep(seq_along(y), each=10), ]
summary(lm(y ~ x, data=dat2))
# block boostrap helper function
require(boot)
myStatistic <- function(ids, i) {
myData <- do.call(rbind, lapply(i, function(i) dat2[dat2$id==ids[i], ]))
myLm <- lm(y ~ x, data=myData)
myLm$coefficients
}
# same point estimates from helper function if original data
myStatistic(unique(dat2$id), 1:100)
# block bootstrap recovers correct SEs
boot(unique(dat2$id), myStatistic, 500)

Resources