Boot package in R simple assistance - r

If I want to use the the boot() function from R's boot package for calculating the significance of the Pearson correlation coefficient between two vectors, should I do it like this:
boot(re1, cor, R = 1000)
where re1 is a two column matrix for these two observation vectors? I can't seem to get this right because cor of these vectors is 0.8, but the above function returns -0.2 as t0.

Just to emphasize the general idea on bootstrapping in R, although #caracal already answered your question through his comment. When using boot, you need to have a data structure (usually, a matrix) that can be sampled by row. The computation of your statistic is usually done in a function that receives this data matrix and returns the statistic of interest computed after resampling. Then, you call the boot() that takes care of applying this function to R replicates and collecting results in a structured format. Those results can be assessed using boot.ci() in turn.
Here are two working examples with the low birth baby study in the MASS package.
require(MASS)
data(birthwt)
# compute CIs for correlation between mother's weight and birth weight
cor.boot <- function(data, k) cor(data[k,])[1,2]
cor.res <- boot(data=with(birthwt, cbind(lwt, bwt)),
statistic=cor.boot, R=500)
cor.res
boot.ci(cor.res, type="bca")
# compute CI for a particular regression coefficient, e.g. bwt ~ smoke + ht
fm <- bwt ~ smoke + ht
reg.boot <- function(formula, data, k) coef(lm(formula, data[k,]))
reg.res <- boot(data=birthwt, statistic=reg.boot,
R=500, formula=fm)
boot.ci(reg.res, type="bca", index=2) # smoke

Related

How to plot multi-level meta-analysis by study (in contrast to the subgroup)?

I am doing a multi-level meta-analysis. Many studies have several subgroups. When I make a forest plot studies are presented as subgroups. There are 60 of them, however, I would like to plot studies according to the study, then it would be 25 studies and it would be more appropriate. Does anyone have an idea how to do this forest plot?
I did it this way:
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | Author/Study,
test = "t",
method = "REML")
forest(full.model)
It is not clear to me if you want to aggregate to the Author level or to the Study level. If there are multiple rows of data for particular studies, then the model isn't really complete and you would want to add another random intercept for the level of the estimates within studies. Essentially, the lowest random effect should have as many values for nlvls in the output as there are estimates (k).
Let's first tackle the case where we have a multilevel structure with two levels, studies and multiple estimates within studies (for some technical reasons, some might call this a three-level model, but let's not get into this). I will use a fully reproducible example for illustration purposes, using the dat.konstantopoulos2011 dataset, where we have districts and schools within districts. We fit a multilevel model of the type as you have with:
library(metafor)
dat <- dat.konstantopoulos2011
res <- rma.mv(yi, vi, random = ~ 1 | district/school, data=dat)
res
We can aggregate the estimates to the district level using the aggregate() function, specifying the marginal var-cov matrix of the estimates from the model to account for their non-independence (note that this makes use of aggregate.escalc() which only works with escalc objects, so if it is not, you need to convert the dataset to one - see help(aggregate.escalc) for details):
agg <- aggregate(dat, cluster=dat$district, V=vcov(res, type="obs"))
agg
You will find that if you then fit an equal-effects model to these estimates based on the aggregated data that the results are identical to what you obtained from the multilevel model (we use an equal-effects model since the heterogeneity accounted for by the multilevel model is already encapsulated in vcov(res, type="obs")):
rma(yi, vi, method="EE", data=agg)
So, we can now use these aggregated values in a forest plot:
with(agg, forest(yi, vi, slab=district))
My guess based on your description is that you actually have an additional level that you should include in the model and that you want to aggregate to the intermediate level. This is a tad more complicated, since aggregate() isn't meant for that. Just for illustration purposes, say we use year as another (higher) level and I will mess a bit with the data so that all three variance components are non-zero (again, just for illustration purposes):
dat$yi[dat$year == 1976] <- dat$yi[dat$year == 1976] + 0.8
res <- rma.mv(yi, vi, random = ~ 1 | year/district/school, data=dat)
res
Now instead of aggregate(), we can accomplish the same thing by using a multivariate model, including the intermediate level as a factor and using again vcov(res, type="obs") as the var-cov matrix:
agg <- rma.mv(yi, V=vcov(res, type="obs"), mods = ~ 0 + factor(district), data=dat)
agg
Now the model coefficients of this model are the aggregated values and the var-cov matrix of the model coefficients is the var-cov matrix of these aggregated values:
coef(agg)
vcov(agg)
They are not all independent (since we haven't aggregated to the highest level), so if we want to check that we can obtain the same results as from the multilevel model, we must account for this dependency:
rma.mv(coef(agg), V=vcov(agg), method="EE")
Again, exactly the same results. So now we use these coefficients and the diagonal from vcov(agg) as their sampling variances in the forest plot:
forest(coef(agg), diag(vcov(agg)), slab=names(coef(agg)))
The forest plot cannot indicate the dependency that still remains in these values, so if one were to meta-analyze these aggregated values using only diag(vcov(agg)) as their sampling variances, the results would not be identical to what you get from the full multilevel model. But there isn't really a way around that and the plot is just a visualization of the aggregated estimates and the CIs shown are correct.
You need to specify your own grouping in a new column of data and use this as the new random effect:
df$study_group <- c(1,1,1,2,2,3,4,5,5,5) # example
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | study_group,
test = "t",
method = "REML")
forest(full.model)

Multinomial/Ordinal hypotheses tests in SAS/R

I'm trying to recreate some SAS output in R. I'm doing ordinal/multinomial regression using the polr and multinom functions from the MASS and nnet packages respectively.
The output I want to recreate in R from SAS is the test of the global null via LRT, Score, and Wald tests, as well as the type 3 analysis of effects, i.e. basically the test of the interaction (all interaction terms tested together) and of the main effects. I tried to use the wald.test function from the aod package but it was giving me errors about L and V not being conformable arrays, though I made sure L was a matrix of the same size as the matrix of coefficients entered into the function for the b = argument.
Lastly, is there a quick way to test the proportional odds assumption in R?
Any help/guidance is appreciated. Thanks!
Some example data:
educ <- runif(21483, min = 0, max = 20)
df <- cbind(gss_cat[, c("marital", "race")], educ)
model <- multinom(marital ~ race*educ, data = df)
Basically what I'm trying to reproduce from SAS are the following command lines:
proc logistic data=in desc;
class race /param=ref;
model marital = educ race educ*race /link=glogit;
output out=predicted predprobs=individual;
run;

How can I extract coefficients from this model in caret?

I'm using the caret package with the leaps package to get the number of variables to use in a linear regression. How do I extract the model with the lowest RMSE that uses mdl$bestTune number of variables? If this can't be done are there functions in other packages you would recommend that allow for loocv of a stepwise linear regression and actually allow me to find the final model?
Below is reproducible code. From it, I can tell from mdl$bestTune that the number of variables should be 4 (even though I would have hoped for 3). It seems like I should be able to extract the variables from the third row of summary(mdl$finalModel) but I'm not sure how I would do this in a general case and not just this example.
library(caret)
set.seed(101)
x <- matrix(rnorm(36*5), nrow=36)
colnames(x) <- paste0("V", 1:5)
y <- 0.2*x[,1] + 0.3*x[,3] + 0.5*x[,4] + rnorm(36) * .0001
train.control <- trainControl(method="LOOCV")
mdl <- train(x=x, y=y, method="leapSeq", trControl = train.control, trace=FALSE)
coef(mdl$finalModel, as.double(mdl$bestTune))
mdl$bestTune
summary(mdl$finalModel)
mdl$results
Here's the context behind my question in case it's of interest. I have historical monthly returns hundreds of mutual fund. Each fund's returns will be a dependent variable that I'd like to regress against a set of returns on a handful (e.g. 5) factors. For each fund I want to run a stepwise regression. I expect only 1 to 3 of the five factors to be significant for any fund.
you can use:
coef(mdl$finalModel,unlist(mdl$bestTune))

Get Regression Coefficient Names with R Bootstrap

I'm using the boot package in R to calculate bootstrapped SEs and confidence intervals. I'm trying to find an elegant and efficient way of getting the names of my parameters along with the bootstrap distribution of their estimates. For instance, consider the simple example given here:
# Bootstrap 95% CI for regression coefficients
library(boot)
# function to obtain regression weights
bs = function(data, indices, formula) {
d = data[indices,] # allows boot to select sample
fit = lm(formula, data=d)
return(coef(fit))
}
# bootstrapping with 1000 replications
results = boot(
data=mtcars,
statistic=bs,
R=1000,
formula=mpg~wt+disp)
This works fine, except that the results just appear as numerical indices:
# view results
results
Bootstrap Statistics :
original bias std. error
t1* 34.96055404 0.1559289371 2.487617954
t2* -3.35082533 -0.0948558121 1.152123237
t3* -0.01772474 0.0002927116 0.008353625
Particularly when getting into long, complicated regression formulas, involving a variety of factor variables, it can take some work to keep track of precisely which indices go with which coefficient estimates.
I could of course just re-fit my model again outside of the bootstrap function, and extract the names with names(coef(fit)) or something, or likely use something else such as a call to model.matrix(). These seem cumbersome, both in terms of extra coding but also in terms of extra CPU and ram resources.
How can I more easily get a nice vector of the coefficient names to pair a vector of coefficient standard errors in situations like this?
UPDATE
Based on the great answer from lmo, here is my basic code to get a basic regression table:
Names = names(results$t0)
SEs = sapply(data.frame(results$t), sd)
Coefs = as.numeric(results$t0)
zVals = Coefs / SEs
Pvals = 2*pnorm(-abs(zVals))
Formatted_Results = cbind(Names, Coefs, SEs, zVals, Pvals)
The estimates from calling the "boot strapped" function, here lm, on the original data, are stored in an element of the list called "t0".
results$t0
(Intercept) wt disp
34.96055404 -3.35082533 -0.01772474
This object preserves the names of the estimates from original function call, which you can then access with names.
names(results$t0)
[1] "(Intercept)" "wt" "disp"

Residuals and plots in ordered multinomial regression

I need to plot a binned residual plot with fitted versus residual values from an ordered multinominal logit regression.
How can I extract residuals when using polr? Is there any other function that runs ord multinominal logit in which residuals can be extracted?
This is the code I used
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data, method='logistic')
fit <- mod1$fitted.values
res <- residuals(mod1)
binnedplot(fit, res)
The problem is that object 'res' is 'null'.
Thanks
For a start, can you tell us how residuals would be defined in principle for a model with categorical responses? fitted.values is a matrix of probabilities. You could define residuals in terms of correct prediction (defining the most likely outcome as the prediction, as in the default predict method for polr objects) -- or you could compute an n-by-n table of true values and predicted values. Alternatively you could reduce the ordinal data back to an integer scale and compute a mean outcome as the prediction ... but I can't see that there's any unique way to define the residuals in the first place.
In polr(), there is no function that returns residual. You should manually calculate it using its definition.
There are actually plenty of ways to get residuals from an ordinal probit/logit. Although polr does not provide any residuals, vglm provides several. See ?residualsvglm from the VGAMpackage (see also below).
NOTE: However, for a Control Function/2SRI approach Wooldridge (2014) suggests using the generalised residuals as described in Vella (1993). These are as far as I know currently not available in R, although I am working on that, but they are in Stata (using predict gr, score)
Residuals in VLGM
Surrogate residuals for polr
You can use the package sure (link), to calculate surrogate residuals with resids. The package is based on this paper, in the Journal of the American Statistical Association.
library(sure) # for residual function and sample data sets
library(MASS) # for polr function
df1 <- df1
df1$x1 <- df1$x
df1$x <- NULL
df1$y <- df2$y
df1$x2 <- df2$x
df1$x3 <- df3$x
options(contrasts = c("contr.treatment", "contr.poly"))
mod1 <- polr(as.ordered(y) ~ x1 + x2 + x3, data=df1, method='probit')
fit <- mod1$fitted.values
res <- resids(mod1)
EDIT: One big issue is the following (from ?resids):
"Note: Surrogate residuals require sampling from a continuous distribution; consequently, the result will be different with every call to resids. The internal functions used for sampling from truncated distributions when method = "latent" are based on modified versions of rtrunc and qtrunc."
Even when running resids(mod1, nsim=1000, method="latent"), there is no convergence of the outcome.

Resources