Metafor updated degrees of freedom - r

Edit: changed code to include test = "t"
I'm hoping to better understand how the updated dev version of Metafor 2.5-101 will help me to adjust my degrees of freedom in a multi-level model to provide some protection against type 1 error.
My understanding of this has come from the Nakagawa preprint "Methods for testing publication bias in ecological and evolutionary meta-analyses" https://ecoevorxiv.org/k7pmz/ and their "Supplemental_Impleentation_Example.Rmd" file, following along with their line 133-142:
Before moving on to some useful corrections, users should be aware that the most up-to-date version of metafor (version 2.5-101) does now provide users with some protection against Type I errors. Instead of using the number of effect sizes in the calculation of the degrees of freedom we can instead make use of the total numbers of papers instead. We show in our simulations that a "papers-1" degrees of freedom can be fairly good. This can be implemented as follows after installing the development version of metafor (see "R Packages Required" above):
mod_multilevel_pdf = rma.mv(yi = yi, V = vi, mods = ~1,
random=list(~1|study_id,~1|obs),
data=data, test="t", dfs = "contain")
summary(mod_multilevel_pdf)
We can see here that the df for the model has changes from 149 to 29, and the p-value has been adjusted accordingly.
So my understanding here is that the model now shows df as 29 (the original no. of papers (30) -1, instead of the no. of papers x no. of effects (30 papers with 5 effects each (150) -1)
Adapting this to my code, where I have n=18 papers and total of n=24 effects, I would expect using the above code would adjust my df to 17 (the original no. of papers (18) -1), however I still have df as 23 (total no. of effects (24) -1).
The output using the df code:
mod_multilevel_pdf = rma.mv(yi = yi, V = vi, mods = ~1,
random=list(~1|study_id,~1|es_id),
data=dat, test="t", dfs = "contain")
summary(mod_multilevel_pdf)
Is:
Multivariate Meta-Analysis Model (k = 24; method: REML)
logLik Deviance AIC BIC AICc
-30.2270 60.4540 66.4540 69.8604 67.7171
Variance Components:
estim sqrt nlvls fixed factor
sigma^2.1 0.6783 0.8236 18 no study_id
sigma^2.2 0.1416 0.3763 24 no es_id
Test for Heterogeneity:
Q(df = 23) = 167.2145, p-val < .0001
Model Results:
estimate se tval df pval ci.lb ci.ub
-0.3508 0.2219 -1.5809 17 0.1323 -0.8190 0.1174
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Quite stumped on this one! Any help would be majorly appreciated.

You neither have df=17 nor df=23, since you did not specify that you want a t-test. With test="t", dfs = "contain", you will get the expected t-test with df=17.

Related

One-way anova using the Survey package in R

I am trying to identify the best way to run a one-way Anova on a complex survey design. After perusing Lumley's Survey package documentation, I am none the wiser.
The survey::anova function is meant to 'Fit and compare hierarchical loglinear models for complex survey data', which is not what I am doing.
What I am trying to do
I have collected data about one categorical independent variable [3 levels] and one quantitative dependent variable. I want to use ANOVA to check if the dependent variable changes according to the level of the independent variable.
Here is an example of my process:
Load Survey package and create complex survey design object
library(survey)
df <- data.frame(sex = c('F', 'O', NA, 'M', 'M', 'O', 'F', 'F'),
married = c(1,1,1,1,0,0,1,1),
pens = c(0, 1, 1, NA, 1, 1, 0, 0),
weight = c(1.12, 0.55, 1.1, 0.6, 0.23, 0.23, 0.66, 0.67))
svy_design <- svydesign(ids=~1, data=df, weights=~weight)
Borrowing from this post over here,
Method 1: using survey::aov
summary(aov(weight~sex,data = svy_design))
However I got an error saying:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'summary': object 'api00' not foun
Method 2: use survey::glm instead of anova
That same post has an answer/explanation with a case against using anova:
According to the main statistician of our institute there is no easy implementation of this kind of analysis in any common modeling environment. The reason for that is that ANOVA and ANCOVA are linear models that where not further developed after the emergence of General Linear Models (later Generalized linear models - GLMs) in the 70's.
A normal linear regression model yields practically the same results as an ANOVA, but is much more flexible regarding variable choice. Since weighting methods exist for GLMs (see survey package in R) there is no real need to develop methods to weight for stratified sampling design in ANOVA... simply use a GLM instead.
summary(svyglm(weight~sex,svy_design))
I got this output:
call:
svyglm(formula = weight ~ sex, design = svy_design)
Survey design:
svydesign(ids = ~1, data = df, weights = ~weight)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8730 0.1478 5.905 0.00412 **
sexM -0.3756 0.1855 -2.024 0.11292
sexO -0.4174 0.1788 -2.334 0.07989 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.04270091)
Number of Fisher Scoring iterations: 2
My Questions:
Why does method 1 throw an error?
Is it possible to use the survey::aov function accomplish my goal?
If I were to use survey::glm [method 2], which value should I be looking at to identify a difference in means? Would it be the p value of the intercept?
I am a far cry from a stats buff, please do explain in the simplest possible terms. Thank you!!
There is no such function as survey::aov, so you can't use it to accomplish your goal. Your code uses stats::aov
You can use survey::svyglm. I will use one of the examples from the package, so I can actually run the code
> model<-svyglm(api00~stype, design=dclus2)
> summary(model)
Call:
svyglm(formula = api00 ~ stype, design = dclus2)
Survey design:
dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 692.81 30.28 22.878 < 2e-16 ***
stypeH -94.47 27.66 -3.415 0.00156 **
stypeM -50.46 23.01 -2.193 0.03466 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 17528.44)
Number of Fisher Scoring iterations: 2
There are three school types, E, M, and H. The two coefficients here estimate differences between the mean of E and the means of the other two groups and the $p$-values test the hypotheses that H and E have the same mean and that M and E have the same mean.
If you want an overall test for the difference in means among the three groups you can use the regTermTest function, which tests a term or set of terms in the model, eg,
> regTermTest(model,~stype)
Wald test for stype
in svyglm(formula = api00 ~ stype, design = dclus2)
F = 12.5997 on 2 and 37 df: p= 6.7095e-05
That F test is analogous to the one stats::aov gives. It's not identical, because this is survey data

user-defined "negative exponential" link glm

I tried to follow this example modify glm... user specificed link function in r
but am getting errors. I have binary data, and would like to change the link function from "logit" to a negative exponential link. I want to predict the
probability of success(p) = 1-exp(linear predictor)
The reason I need this link instead of one of the built-in links is that p increases in a convex manner between 0 and 0.5, but the "logit", "cloglog", "probit", and "cauchy" only allow a concave shape. See attached photo for reference: predicted p vs binned observations
Simulate data
location<-as.character(LETTERS[rep(seq(from=1,to=23),30)])
success<-rbinom(n=690, size=1, prob=0.15)
df<-data.frame(location,success)
df$random_var<-rnorm(690,5,3)
df$seedling_size<-abs((0.1+df$success)^(1/df$random_var))
df<-df[order(df$location)]
Create custom link function. Note: eta = linear predictor, mu = probability
negex<-function(){
##link
linkfun<-function(mu) log(-mu+1)
linkinv<-function(eta) 1-exp(eta)
## derivative of inverse link with respect to eta
mu.eta<-function(eta)-exp(eta)
valideta<-function(eta) TRUE
link<-"log(-mu+1)"
structure(list(linkfun=linkfun,linkinv=linkinv,
mu.eta=mu.eta,valideta=valideta,
name=link),
class="link-glm")
}
Model success as a function of seedling size
negexp<-negex()
model1<-glm(success~seedling_size,family=binomial(link=negexp),data=df)
Error: no valid set of coefficients has been found: please supply starting values
Model using glmer (My ultimate goal)
model2<-glmer(success~seedling_size+ (1|location),family=binomial(link=negexp),data=df)
Error in (function (fr, X, reTrms, family, nAGQ = 1L, verbose = 0L, maxit = 100L, :
(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
I get different error messages, but I think the problem is the same regardless of whether using glmer or glm, and that is that my link function is wrong somehow.
I found the answer. Most helpful was this R thread from 2016. There were 2 issues. First, my link fuction was wrong. I revised it to this:
negex <- function()
{
linkfun <- function(mu) -log(1-mu)
linkinv <- function(eta) 1-exp(-eta)
mu.eta <- function(eta) exp(-eta)
valideta <- function(eta) all(is.finite(eta)&eta>0)
link <- paste0("negexp")
structure(list(linkfun = linkfun, linkinv = linkinv,
mu.eta = mu.eta, valideta = valideta, name = link),
class = "link-glm")
}
Second, the model required specific starting values. These will be unique to your data. Here is the first few lines of the data that I actually found the solution to:
site plot sub_plot oak_success oak_o1_gt05ft..1
0001 10 3 1 0
0001 12 2 0 0
0001 12 3 0 0
0001 12 4 0 0
0001 13 4 0 0
I don't know how to post the full data to this site, but if someone wants it to run the example, shoot me an email: lake.graboski#gmail.com
negexp<-negex()
Hopefully this helps someone in the future, because I found no other examples of this being solved on stack overflow or online. Using the new starting values, I was able to get the model to run:
starting_values<-c(1,0) #1 for the intercept and 0 for the slope
h_gt05_solo_negex2<-glm(oak_success~ oak_o1_gt05ft..1 ,
family=binomial(link=negexp),start=starting_values,data=rocdf)
summary(h_gt05_solo_negex2)
Call:
glm(formula = oak_success ~ oak_o1_gt05ft..1, family = binomial(link = negexp),
data = lt40, start = starting_values)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3808 -0.4174 -0.2637 -0.2637 2.5985
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.034774 0.005484 6.341 2.28e-10 ***
oak_o1_gt05ft..1 0.023253 0.002187 10.635 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1416.9 on 2078 degrees of freedom
Residual deviance: 1213.5 on 2077 degrees of freedom
AIC: 1217.5
Number of Fisher Scoring iterations: 6
There were some issues with convergence. As seedling heights (oak_o1_gt05ft..1) got above 40ft, the parameter estimates became unreliable convergence issues. I had very few observations in this range, so I restricted the data to observations were the predictor was <40ft and re-ran the model. I also included "site" (same as "location" in the simulated data)). What you see in this figure are the predictions of oak success with respect to oak seedling heights for each site/location (black circles), the binned observations of successes/samples (large green dots) and the prediction of success probability without a site factor (blue line). It looks like the slope of the seedling size variable is more accurate when site is factored in.
Unfortunately, I was not able to get this model to run in glmer, so site had to be included as a fixed effect, thus, the standard errors and slope estimates for oak seedling height might be a bit conservative.

R: Clustering standard errors in MASS::polr()

I am trying to estimate an ordinal logistic regression with clustered standard errors using the MASS package's polr() function. There is no built-in clustering feature, so I am looking for (a) packages or (b) manual methods for calculating clustered standard errors using the model output. I plan to use margins package to estimate marginal effects from the model.
Here is an example:
library(MASS)
set.seed(1)
obs <- 500
# Create data frame
dat <- data.frame(y = as.factor(round(rnorm(n = obs, mean = 5, sd = 1), 0)),
x = sample(x = 1:obs, size = obs, replace = T),
clust = rep(c(1,2), 250))
# Estimate and summarize model
m1 <- MASS::polr(y ~x, data = dat, Hess = TRUE)
summary(m1)
While many questions on Stack Overflow ask about how to cluster standard errors in R for ordinary least squares models (and in some cases for logistic regression), it's unclear how to cluster errors in ordered logistic regression (i.e. proportional odds logistic regression). Additionally, the existing SO questions focus on packages that have other severe drawbacks (e.g. the classes of model outputs are not compatible with other standard packages for analysis and presentation of results) rather than using MASS::polr() which is compatible with predict().
This is essentially following an answer offered by Achim Zeleis on rhelp in 2016.
library(lmtest)
library("sandwich")
coeftest(m1, vcov=vcovCL(m1, factor(dat$clust) ))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
x 0.00093547 0.00023777 3.9343 9.543e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

How to obtain Poisson's distribution "lambda" from R glm() coefficients

My R-script produces glm() coeffs below.
What is Poisson's lambda, then? It should be ~3.0 since that's what I used to create the distribution.
Call:
glm(formula = h_counts ~ ., family = poisson(link = log), data = pois_ideal_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-22.726 -12.726 -8.624 6.405 18.515
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.222532 0.015100 544.53 <2e-16 ***
h_mids -0.363560 0.004393 -82.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 11451.0 on 10 degrees of freedom
Residual deviance: 1975.5 on 9 degrees of freedom
AIC: 2059
Number of Fisher Scoring iterations: 5
random_pois = rpois(10000,3)
h=hist(random_pois, breaks = 10)
mean(random_pois) #verifying that the mean is close to 3.
h_mids = h$mids
h_counts = h$counts
pois_ideal_data <- data.frame(h_mids, h_counts)
pois_ideal_model <- glm(h_counts ~ ., pois_ideal_data, family=poisson(link=log))
summary_ideal=summary(pois_ideal_model)
summary_ideal
What are you doing here???!!! You used a glm to fit a distribution???
Well, it is not impossible to do so, but it is done via this:
set.seed(0)
x <- rpois(10000,3)
fit <- glm(x ~ 1, family = poisson())
i.e., we fit data with an intercept-only regression model.
fit$fitted[1]
# 3.005
This is the same as:
mean(x)
# 3.005
It looks like you're trying to do a Poisson fit to aggregated or binned data; that's not what glm does. I took a quick look for canned ways of fitting distributions to canned data but couldn't find one; it looks like earlier versions of the bda package might have offered this, but not now.
At root, what you need to do is set up a negative log-likelihood function that computes (# counts)*prob(count|lambda) and minimize it using optim(); the solution given below using the bbmle package is a little more complex up-front but gives you added benefits like easily computing confidence intervals etc..
Set up data:
set.seed(101)
random_pois <- rpois(10000,3)
tt <- table(random_pois)
dd <- data.frame(counts=unname(c(tt)),
val=as.numeric(names(tt)))
Here I'm using table rather than hist because histograms on discrete data are fussy (having integer cutpoints often makes things confusing because you have to be careful about right- vs left-closure)
Set up density function for binned Poisson data (to work with bbmle's formula interface, the first argument must be called x, and it must have a log argument).
dpoisbin <- function(x,val,lambda,log=FALSE) {
probs <- dpois(val,lambda,log=TRUE)
r <- sum(x*probs)
if (log) r else exp(r)
}
Fit lambda (log link helps prevent numerical problems/warnings from negative lambda values):
library(bbmle)
m1 <- mle2(counts~dpoisbin(val,exp(loglambda)),
data=dd,
start=list(loglambda=0))
all.equal(unname(exp(coef(m1))),mean(random_pois),tol=1e-6) ## TRUE
exp(confint(m1))
## 2.5 % 97.5 %
## 2.972047 3.040009

anova.rq() in quantreg package in R

I'm interested in comparing estimates from different quantiles (same outcome, same covariates) using anova.rqlist function called by anova in the environment of the quantreg package in R. However the math in the function is beyond my rudimentary expertise. Lets say i fit 3 models at different quantiles;
library(quantreg)
data(Mammals) # data in quantreg to be used as a useful example
fit1 <- rq(weight ~ speed + hoppers + specials, tau = .25, data = Mammals)
fit2 <- rq(weight ~ speed + hoppers + specials, tau = .5, data = Mammals)
fit3 <- rq(weight ~ speed + hoppers + specials, tau = .75, data = Mammals)
Then i compare them using;
anova(fit1, fit2, fit3, test="Wald", joint=FALSE)
My question is which is of these models is being used as the basis of the comparison?
My understanding of the Wald test (wiki entry)
where θ^ is the estimate of the parameter(s) of interest θ that is compared with the proposed value θ0.
So my question is what is the anova function in quantreg choosing as the θ0?
Based on the pvalue returned from the anova my best guess is that it is choosing the lowest quantile specified (ie tau=0.25). Is there a way to specify the median (tau = 0.5) or better yet the mean estimate from obtained using lm(y ~ x1 + x2 + x3, data)?
anova(fit1, fit2, fit3, joint=FALSE)
actually produces
Quantile Regression Analysis of Deviance Table
Model: weight ~ speed + hoppers + specials
Tests of Equality of Distinct Slopes: tau in { 0.25 0.5 0.75 }
Df Resid Df F value Pr(>F)
speed 2 319 1.0379 0.35539
hoppersTRUE 2 319 4.4161 0.01283 *
specialsTRUE 2 319 1.7290 0.17911
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
while
anova(fit3, fit1, fit2, joint=FALSE)
produces the exact same result
Quantile Regression Analysis of Deviance Table
Model: weight ~ speed + hoppers + specials
Tests of Equality of Distinct Slopes: tau in { 0.5 0.25 0.75 }
Df Resid Df F value Pr(>F)
speed 2 319 1.0379 0.35539
hoppersTRUE 2 319 4.4161 0.01283 *
specialsTRUE 2 319 1.7290 0.17911
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The order of the models is clearly being changed in the anova, but how is it that the F value and Pr(>F) are identical in both tests?
All the quantiles you input are used and there is not one model used as a reference.
I suggest you read this post and the related answer to understand what your "theta.0" is.
I believe what you are trying to do is to test whether the regression lines are parallel. In other words whether the effects of the predictor variables (only income here) are uniform across quantiles.
You can use the anova() from the quantreg package to answer this question. You should indeed use several fits for each quantile.
When you use joint=FALSE as you did, you get coefficient-wise comparisons. But you only have one coefficient so there is only one line! And your results tells you that the effect of income is not uniform accross quantiles in your example. Use several predictor variables and you will get several p-values.
You can do an overall test of equality of the entire sets of coefficients if you do not use joint=FALSE and that would give you a "Joint Test of Equality of Slopes" and therefore only one p-value.
EDIT:
I think theta.0 is the average slope for all 'tau' values or the actual estimate from 'lm()', rather than a specific slope of any of the models. My reasoning is that 'anova.rq()' does not require any specific low value of 'tau' or even the median 'tau'.
There are several ways to test this. Either do the calculations by hand with theta.0 being equal to the average value, or compare many combinations because then you could a situation where certain of your models are close to the model with a low 'tau' values but not to the 'lm()' value. So if theta.0 is the slope of the first model with lowest 'tau' then your Pr(>F) will be high whereas in the other case, it will be low.
This question should maybe have been asked on cross-validated.

Resources