I am trying to estimate an ordinal logistic regression with clustered standard errors using the MASS package's polr() function. There is no built-in clustering feature, so I am looking for (a) packages or (b) manual methods for calculating clustered standard errors using the model output. I plan to use margins package to estimate marginal effects from the model.
Here is an example:
library(MASS)
set.seed(1)
obs <- 500
# Create data frame
dat <- data.frame(y = as.factor(round(rnorm(n = obs, mean = 5, sd = 1), 0)),
x = sample(x = 1:obs, size = obs, replace = T),
clust = rep(c(1,2), 250))
# Estimate and summarize model
m1 <- MASS::polr(y ~x, data = dat, Hess = TRUE)
summary(m1)
While many questions on Stack Overflow ask about how to cluster standard errors in R for ordinary least squares models (and in some cases for logistic regression), it's unclear how to cluster errors in ordered logistic regression (i.e. proportional odds logistic regression). Additionally, the existing SO questions focus on packages that have other severe drawbacks (e.g. the classes of model outputs are not compatible with other standard packages for analysis and presentation of results) rather than using MASS::polr() which is compatible with predict().
This is essentially following an answer offered by Achim Zeleis on rhelp in 2016.
library(lmtest)
library("sandwich")
coeftest(m1, vcov=vcovCL(m1, factor(dat$clust) ))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
x 0.00093547 0.00023777 3.9343 9.543e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Related
I am trying to identify the best way to run a one-way Anova on a complex survey design. After perusing Lumley's Survey package documentation, I am none the wiser.
The survey::anova function is meant to 'Fit and compare hierarchical loglinear models for complex survey data', which is not what I am doing.
What I am trying to do
I have collected data about one categorical independent variable [3 levels] and one quantitative dependent variable. I want to use ANOVA to check if the dependent variable changes according to the level of the independent variable.
Here is an example of my process:
Load Survey package and create complex survey design object
library(survey)
df <- data.frame(sex = c('F', 'O', NA, 'M', 'M', 'O', 'F', 'F'),
married = c(1,1,1,1,0,0,1,1),
pens = c(0, 1, 1, NA, 1, 1, 0, 0),
weight = c(1.12, 0.55, 1.1, 0.6, 0.23, 0.23, 0.66, 0.67))
svy_design <- svydesign(ids=~1, data=df, weights=~weight)
Borrowing from this post over here,
Method 1: using survey::aov
summary(aov(weight~sex,data = svy_design))
However I got an error saying:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'summary': object 'api00' not foun
Method 2: use survey::glm instead of anova
That same post has an answer/explanation with a case against using anova:
According to the main statistician of our institute there is no easy implementation of this kind of analysis in any common modeling environment. The reason for that is that ANOVA and ANCOVA are linear models that where not further developed after the emergence of General Linear Models (later Generalized linear models - GLMs) in the 70's.
A normal linear regression model yields practically the same results as an ANOVA, but is much more flexible regarding variable choice. Since weighting methods exist for GLMs (see survey package in R) there is no real need to develop methods to weight for stratified sampling design in ANOVA... simply use a GLM instead.
summary(svyglm(weight~sex,svy_design))
I got this output:
call:
svyglm(formula = weight ~ sex, design = svy_design)
Survey design:
svydesign(ids = ~1, data = df, weights = ~weight)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8730 0.1478 5.905 0.00412 **
sexM -0.3756 0.1855 -2.024 0.11292
sexO -0.4174 0.1788 -2.334 0.07989 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.04270091)
Number of Fisher Scoring iterations: 2
My Questions:
Why does method 1 throw an error?
Is it possible to use the survey::aov function accomplish my goal?
If I were to use survey::glm [method 2], which value should I be looking at to identify a difference in means? Would it be the p value of the intercept?
I am a far cry from a stats buff, please do explain in the simplest possible terms. Thank you!!
There is no such function as survey::aov, so you can't use it to accomplish your goal. Your code uses stats::aov
You can use survey::svyglm. I will use one of the examples from the package, so I can actually run the code
> model<-svyglm(api00~stype, design=dclus2)
> summary(model)
Call:
svyglm(formula = api00 ~ stype, design = dclus2)
Survey design:
dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 692.81 30.28 22.878 < 2e-16 ***
stypeH -94.47 27.66 -3.415 0.00156 **
stypeM -50.46 23.01 -2.193 0.03466 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 17528.44)
Number of Fisher Scoring iterations: 2
There are three school types, E, M, and H. The two coefficients here estimate differences between the mean of E and the means of the other two groups and the $p$-values test the hypotheses that H and E have the same mean and that M and E have the same mean.
If you want an overall test for the difference in means among the three groups you can use the regTermTest function, which tests a term or set of terms in the model, eg,
> regTermTest(model,~stype)
Wald test for stype
in svyglm(formula = api00 ~ stype, design = dclus2)
F = 12.5997 on 2 and 37 df: p= 6.7095e-05
That F test is analogous to the one stats::aov gives. It's not identical, because this is survey data
Edit: changed code to include test = "t"
I'm hoping to better understand how the updated dev version of Metafor 2.5-101 will help me to adjust my degrees of freedom in a multi-level model to provide some protection against type 1 error.
My understanding of this has come from the Nakagawa preprint "Methods for testing publication bias in ecological and evolutionary meta-analyses" https://ecoevorxiv.org/k7pmz/ and their "Supplemental_Impleentation_Example.Rmd" file, following along with their line 133-142:
Before moving on to some useful corrections, users should be aware that the most up-to-date version of metafor (version 2.5-101) does now provide users with some protection against Type I errors. Instead of using the number of effect sizes in the calculation of the degrees of freedom we can instead make use of the total numbers of papers instead. We show in our simulations that a "papers-1" degrees of freedom can be fairly good. This can be implemented as follows after installing the development version of metafor (see "R Packages Required" above):
mod_multilevel_pdf = rma.mv(yi = yi, V = vi, mods = ~1,
random=list(~1|study_id,~1|obs),
data=data, test="t", dfs = "contain")
summary(mod_multilevel_pdf)
We can see here that the df for the model has changes from 149 to 29, and the p-value has been adjusted accordingly.
So my understanding here is that the model now shows df as 29 (the original no. of papers (30) -1, instead of the no. of papers x no. of effects (30 papers with 5 effects each (150) -1)
Adapting this to my code, where I have n=18 papers and total of n=24 effects, I would expect using the above code would adjust my df to 17 (the original no. of papers (18) -1), however I still have df as 23 (total no. of effects (24) -1).
The output using the df code:
mod_multilevel_pdf = rma.mv(yi = yi, V = vi, mods = ~1,
random=list(~1|study_id,~1|es_id),
data=dat, test="t", dfs = "contain")
summary(mod_multilevel_pdf)
Is:
Multivariate Meta-Analysis Model (k = 24; method: REML)
logLik Deviance AIC BIC AICc
-30.2270 60.4540 66.4540 69.8604 67.7171
Variance Components:
estim sqrt nlvls fixed factor
sigma^2.1 0.6783 0.8236 18 no study_id
sigma^2.2 0.1416 0.3763 24 no es_id
Test for Heterogeneity:
Q(df = 23) = 167.2145, p-val < .0001
Model Results:
estimate se tval df pval ci.lb ci.ub
-0.3508 0.2219 -1.5809 17 0.1323 -0.8190 0.1174
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Quite stumped on this one! Any help would be majorly appreciated.
You neither have df=17 nor df=23, since you did not specify that you want a t-test. With test="t", dfs = "contain", you will get the expected t-test with df=17.
I have built a generalized additive mixed effects model with a fixed effect and a random intercept effect (that is a categorical variable). After running the model I am able to extract the random intercepts per each of my categories using ranef(m1$lme)$x[[1]]. However, when I try to extract the standard errors of the random effects using se.ranef(m1$lme), the function does not work. Other attempts using se.ranef(m1) and se.ranef(m1$gam) dont work either. I don't know if this is because these function only apply to models of the class lmer?
Can any one help me so that I can pull out my standard errors of my random intercept from a class "gamm"? I would like to use the random intercepts and standard errors to plot the Best Linear Unbiased Predictors of my gamm model.
My initial model is of the form: gamm(y ~ s(z), random = list(x = ~1), data = dat).
library(mgcv)
library(arm)
example <- gamm(mag ~ s(depth), random = list(stations = ~1), data = quakes)
summary(example$gam)
#Family: gaussian
#Link function: identity
#Formula:
# mag ~ s(depth)
#Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 5.02300 0.04608 109 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Approximate significance of smooth terms:
# edf Ref.df F p-value
#s(depth) 3.691 3.691 43.12 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#R-sq.(adj) = 0.0725
#Scale est. = 0.036163 n = 1000
ranef(example$lme)$stations[[1]] # extract random intercepts
#se.ranef(example$lme) # extract se of random intercepts - Problem line - doesn't work?
I'm not an expect on the inner workings of nlme::lme but I don't think it is easy to get what you want from that model --- the ranef() method doesn't allow for the posterior or conditional variance of the random effects to be returned, unlike the method for models fitted by lmer() and co.
Two options spring to mind
fit the model using gamm4:gamm4(), where the mixed model face of the object should work with se.ranef(), or
fit the model with gam() using the random effect basis.
Setup
library("mgcv")
library("gamm4")
library("arm")
library("ggplot2")
library("cowplot")
theme_set(theme_bw())
Option 1: gamm4::gamm4()
This is a straight translation of your model to the syntax required for gamm4::gamm4()
quakes <- transform(quakes, fstations = factor(stations))
m1 <- gamm4::gamm4(mag ~ s(depth), random = ~ (1 | fstations),
data = quakes)
re1 <- ranef(m1$mer)[["fstations"]][,1]
se1 <- se.ranef(m1$mer)[["fstations"]][,1]
Note I convert stations to a factor as mgcv::gam needs a factor to fit a random intercept.
Option 2: mgcv::gam()
For this we use the random effects basis. The theory of penalised spline models shows that if we write the math down in a particular way, the model has the same form as a mixed model, with the wiggly parts of the basis acting as random effects and the infinitely smooth parts of the basis used as fixed effects. The same theory allows the reverse process; we can formulate a spline basis that is fully penalised, which is the equivalent of a random effect.
m2 <- gam(mag ~ s(depth) + s(fstations, bs = "re"),
data = quakes, method = "REML")
We also need to do a little more work to get the "estimated" random effects and standard errors. We need to predict from the model at the levels of fstations. We also need to pass in values for the other terms in the models but as the model is additive we can ignore their effect and just pull out the random effect.
newd <- with(quakes, data.frame(depth = mean(depth),
fstations = levels(fstations)))
p <- predict(m2, newd, type = "terms", se.fit = TRUE)
re2 <- p[["fit"]][ , "s(fstations)"]
se2 <- p[["se.fit"]][ , "s(fstations)"]
How do these options compare?
re <- data.frame(GAMM = re1, GAM = re2)
se <- data.frame(GAMM = se1, GAM = se2)
p1 <- ggplot(re, aes(x = GAMM, y = GAM)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
coord_equal() +
labs(title = "Random effects")
p2 <- ggplot(se, aes(x = GAMM, y = GAM)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
coord_equal() +
labs(title = "Standard errors")
plot_grid(p1, p2, nrow = 1, align = "hv")
The "estimates" are the equivalent, but their standard errors are somewhat larger in the GAM version.
My R-script produces glm() coeffs below.
What is Poisson's lambda, then? It should be ~3.0 since that's what I used to create the distribution.
Call:
glm(formula = h_counts ~ ., family = poisson(link = log), data = pois_ideal_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-22.726 -12.726 -8.624 6.405 18.515
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.222532 0.015100 544.53 <2e-16 ***
h_mids -0.363560 0.004393 -82.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 11451.0 on 10 degrees of freedom
Residual deviance: 1975.5 on 9 degrees of freedom
AIC: 2059
Number of Fisher Scoring iterations: 5
random_pois = rpois(10000,3)
h=hist(random_pois, breaks = 10)
mean(random_pois) #verifying that the mean is close to 3.
h_mids = h$mids
h_counts = h$counts
pois_ideal_data <- data.frame(h_mids, h_counts)
pois_ideal_model <- glm(h_counts ~ ., pois_ideal_data, family=poisson(link=log))
summary_ideal=summary(pois_ideal_model)
summary_ideal
What are you doing here???!!! You used a glm to fit a distribution???
Well, it is not impossible to do so, but it is done via this:
set.seed(0)
x <- rpois(10000,3)
fit <- glm(x ~ 1, family = poisson())
i.e., we fit data with an intercept-only regression model.
fit$fitted[1]
# 3.005
This is the same as:
mean(x)
# 3.005
It looks like you're trying to do a Poisson fit to aggregated or binned data; that's not what glm does. I took a quick look for canned ways of fitting distributions to canned data but couldn't find one; it looks like earlier versions of the bda package might have offered this, but not now.
At root, what you need to do is set up a negative log-likelihood function that computes (# counts)*prob(count|lambda) and minimize it using optim(); the solution given below using the bbmle package is a little more complex up-front but gives you added benefits like easily computing confidence intervals etc..
Set up data:
set.seed(101)
random_pois <- rpois(10000,3)
tt <- table(random_pois)
dd <- data.frame(counts=unname(c(tt)),
val=as.numeric(names(tt)))
Here I'm using table rather than hist because histograms on discrete data are fussy (having integer cutpoints often makes things confusing because you have to be careful about right- vs left-closure)
Set up density function for binned Poisson data (to work with bbmle's formula interface, the first argument must be called x, and it must have a log argument).
dpoisbin <- function(x,val,lambda,log=FALSE) {
probs <- dpois(val,lambda,log=TRUE)
r <- sum(x*probs)
if (log) r else exp(r)
}
Fit lambda (log link helps prevent numerical problems/warnings from negative lambda values):
library(bbmle)
m1 <- mle2(counts~dpoisbin(val,exp(loglambda)),
data=dd,
start=list(loglambda=0))
all.equal(unname(exp(coef(m1))),mean(random_pois),tol=1e-6) ## TRUE
exp(confint(m1))
## 2.5 % 97.5 %
## 2.972047 3.040009
I am interested to reproduce results calculated by the GNU plugin to MS Word WordMat in R, but I can't get them to arrive at similar results (I am not looking for identical, but simply similar).
I have some y and x values and a power function, y = bx^a
Using the following data,
x <- c(15,31,37,44,51,59)
y <- c(126,71,61,53,47,42)
I get a = -0.8051 and b = 1117.7472 in WordMat, but a = -0.8026 and B = 1108.2533 in R, slightly different values.
Am I using the nls function in some wrong way or is there a better (more transparent) way to calculate it in R?
Data and R code,
# x <- c(15,31,37,44,51,59)
# y <- c(126,71,61,53,47,42)
df <- data.frame(x,y)
moD <- nls(y~a*x^b, df, start = list(a = 1,b=1))
summary(moD)
Formula: y ~ a * x^b
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1.108e+03 1.298e+01 85.35 1.13e-07 ***
b -8.026e-01 3.626e-03 -221.36 2.50e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3296 on 4 degrees of freedom
Number of iterations to convergence: 19
Achieved convergence tolerance: 5.813e-06
It looks like WordMat is estimating the parameters of y=b*x^a by doing the log-log regression rather than by solving the nonlinear least-squares problem:
> x <- c(15,31,37,44,51,59)
> y <- c(126,71,61,53,47,42)
>
> (m1 <- lm(log(y)~log(x)))
Call:
lm(formula = log(y) ~ log(x))
Coefficients:
(Intercept) log(x)
7.0191 -0.8051
> exp(coef(m1)[1])
(Intercept)
1117.747
To explain what's going on here a little bit more: if y=b*x^a, taking the log on both sides gives log(y)=log(b)+a*log(x), which has the form of a linear regression (lm() in R). However, log-transforming also affects the variance of the errors (which are implicitly included on the right-hand side of the question), meaning that you're actually solving a different problem. Which is correct depends on exactly how you state the problem. This question on CrossValidated gives more details.