I have built a generalized additive mixed effects model with a fixed effect and a random intercept effect (that is a categorical variable). After running the model I am able to extract the random intercepts per each of my categories using ranef(m1$lme)$x[[1]]. However, when I try to extract the standard errors of the random effects using se.ranef(m1$lme), the function does not work. Other attempts using se.ranef(m1) and se.ranef(m1$gam) dont work either. I don't know if this is because these function only apply to models of the class lmer?
Can any one help me so that I can pull out my standard errors of my random intercept from a class "gamm"? I would like to use the random intercepts and standard errors to plot the Best Linear Unbiased Predictors of my gamm model.
My initial model is of the form: gamm(y ~ s(z), random = list(x = ~1), data = dat).
library(mgcv)
library(arm)
example <- gamm(mag ~ s(depth), random = list(stations = ~1), data = quakes)
summary(example$gam)
#Family: gaussian
#Link function: identity
#Formula:
# mag ~ s(depth)
#Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 5.02300 0.04608 109 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Approximate significance of smooth terms:
# edf Ref.df F p-value
#s(depth) 3.691 3.691 43.12 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#R-sq.(adj) = 0.0725
#Scale est. = 0.036163 n = 1000
ranef(example$lme)$stations[[1]] # extract random intercepts
#se.ranef(example$lme) # extract se of random intercepts - Problem line - doesn't work?
I'm not an expect on the inner workings of nlme::lme but I don't think it is easy to get what you want from that model --- the ranef() method doesn't allow for the posterior or conditional variance of the random effects to be returned, unlike the method for models fitted by lmer() and co.
Two options spring to mind
fit the model using gamm4:gamm4(), where the mixed model face of the object should work with se.ranef(), or
fit the model with gam() using the random effect basis.
Setup
library("mgcv")
library("gamm4")
library("arm")
library("ggplot2")
library("cowplot")
theme_set(theme_bw())
Option 1: gamm4::gamm4()
This is a straight translation of your model to the syntax required for gamm4::gamm4()
quakes <- transform(quakes, fstations = factor(stations))
m1 <- gamm4::gamm4(mag ~ s(depth), random = ~ (1 | fstations),
data = quakes)
re1 <- ranef(m1$mer)[["fstations"]][,1]
se1 <- se.ranef(m1$mer)[["fstations"]][,1]
Note I convert stations to a factor as mgcv::gam needs a factor to fit a random intercept.
Option 2: mgcv::gam()
For this we use the random effects basis. The theory of penalised spline models shows that if we write the math down in a particular way, the model has the same form as a mixed model, with the wiggly parts of the basis acting as random effects and the infinitely smooth parts of the basis used as fixed effects. The same theory allows the reverse process; we can formulate a spline basis that is fully penalised, which is the equivalent of a random effect.
m2 <- gam(mag ~ s(depth) + s(fstations, bs = "re"),
data = quakes, method = "REML")
We also need to do a little more work to get the "estimated" random effects and standard errors. We need to predict from the model at the levels of fstations. We also need to pass in values for the other terms in the models but as the model is additive we can ignore their effect and just pull out the random effect.
newd <- with(quakes, data.frame(depth = mean(depth),
fstations = levels(fstations)))
p <- predict(m2, newd, type = "terms", se.fit = TRUE)
re2 <- p[["fit"]][ , "s(fstations)"]
se2 <- p[["se.fit"]][ , "s(fstations)"]
How do these options compare?
re <- data.frame(GAMM = re1, GAM = re2)
se <- data.frame(GAMM = se1, GAM = se2)
p1 <- ggplot(re, aes(x = GAMM, y = GAM)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
coord_equal() +
labs(title = "Random effects")
p2 <- ggplot(se, aes(x = GAMM, y = GAM)) +
geom_point() +
geom_abline(intercept = 0, slope = 1) +
coord_equal() +
labs(title = "Standard errors")
plot_grid(p1, p2, nrow = 1, align = "hv")
The "estimates" are the equivalent, but their standard errors are somewhat larger in the GAM version.
Related
Somehow as a follow up on the question Creating confidence intervals for regression curve in GLMM using Bootstrapping, I am interested in getting the correct values of a regression curve and the associated confidence interval curves.
Consider a case where in a GLMM, there is one response variable, two continuous fixed effects and one random effect. Here is some fake data:
library (dplyr)
set.seed (1129)
x1 <- runif(100,0,1)
x2 <- rnorm(100,0.5,0.4)
f1 <- gl(n = 5,k = 20)
rnd1<-rnorm(5,0.5,0.1)
my_data <- data.frame(x1=x1, x2=x2, f1=f1)
modmat <- model.matrix(~x1+x2, my_data)
fixed <- c(-0.12,0.35,0.09)
y <- (modmat%*%fixed+rnd1)
my_data$y <- ((y - min (y))/max(y- min (y))) %>% round (digits = 1)
rm (y)
The GLMM that I fit looks like this:
m1<-glmer (y ~x1+x2+(1|f1), my_data, family="binomial")
summary (m1)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: y ~ x1 + x2 + (1 | f1)
Data: my_data
AIC BIC logLik deviance df.resid
65.7 76.1 -28.8 57.7 96
Scaled residuals:
Min 1Q Median 3Q Max
-8.4750 -0.7042 -0.0102 1.5904 14.5919
Random effects:
Groups Name Variance Std.Dev.
f1 (Intercept) 1.996e-10 1.413e-05
Number of obs: 100, groups: f1, 5
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.668 2.051 -4.713 2.44e-06 ***
x1 12.855 2.659 4.835 1.33e-06 ***
x2 4.875 1.278 3.816 0.000136 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) x1
x1 -0.970
x2 -0.836 0.734
convergence code: 0
boundary (singular) fit: see ?isSingular
Plotting y vs x1:
plot (y~x1, my_data)
It should be possible to get a regression curve from the summary of m1. I have learned that I need to reverse the link-function (in this case, "logit"):
y = 1/(1+exp(-(Intercept+b*x1+c*x2)))
In order to plot a regression curve of x1 in a two-dimensional space, I set x2 = mean(x2) in the formula (which also seems important - the red line in the following plots ignores x2, apparently leading to considerable bias). The regression line:
xx <- seq (from = 0, to = 1, length.out = 100)
yy <- 1/(1+exp(-(-9.668+12.855*xx+4.875*mean(x2))))
yyy <- 1/(1+exp(-(-9.668+12.855*xx)))
lines (yy ~ xx, col = "blue")
lines (yyy~ xx, col = "red")
I think, the blue line looks not so good (and the red line worse, of course). So as a side-question: is y = 1/(1+exp(-(Intercept+b*x1+c*x2))) always the right choice as a back-transformation of the logit-link? I am asking because I found this https://sebastiansauer.github.io/convert_logit2prob/, which made me suspicious. Or is there another reason for the model not to fit so well? Maybe my data creation process is somewhat 'bad'.
What I need now is to add the 95%-confidence interval to the curve. I think that Bootstrapping using the bootMer function should be a good approach. However, all examples that I found were on models with one single fixed effect. #Jamie Murphy asked a similar question, but he was interested in models containing a continuous and a categorical variable as fixed effects here: Creating confidence intervals for regression curve in GLMM using Bootstrapping
But when it comes to models with more than one continuous variables as fixed effects, I get lost. Perhaps someone can help solve this issue - possibly with a modification of the second part of this tutorial:
https://www.r-bloggers.com/2015/06/confidence-intervals-for-prediction-in-glmms/
I am trying to estimate an ordinal logistic regression with clustered standard errors using the MASS package's polr() function. There is no built-in clustering feature, so I am looking for (a) packages or (b) manual methods for calculating clustered standard errors using the model output. I plan to use margins package to estimate marginal effects from the model.
Here is an example:
library(MASS)
set.seed(1)
obs <- 500
# Create data frame
dat <- data.frame(y = as.factor(round(rnorm(n = obs, mean = 5, sd = 1), 0)),
x = sample(x = 1:obs, size = obs, replace = T),
clust = rep(c(1,2), 250))
# Estimate and summarize model
m1 <- MASS::polr(y ~x, data = dat, Hess = TRUE)
summary(m1)
While many questions on Stack Overflow ask about how to cluster standard errors in R for ordinary least squares models (and in some cases for logistic regression), it's unclear how to cluster errors in ordered logistic regression (i.e. proportional odds logistic regression). Additionally, the existing SO questions focus on packages that have other severe drawbacks (e.g. the classes of model outputs are not compatible with other standard packages for analysis and presentation of results) rather than using MASS::polr() which is compatible with predict().
This is essentially following an answer offered by Achim Zeleis on rhelp in 2016.
library(lmtest)
library("sandwich")
coeftest(m1, vcov=vcovCL(m1, factor(dat$clust) ))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
x 0.00093547 0.00023777 3.9343 9.543e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
My R-script produces glm() coeffs below.
What is Poisson's lambda, then? It should be ~3.0 since that's what I used to create the distribution.
Call:
glm(formula = h_counts ~ ., family = poisson(link = log), data = pois_ideal_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-22.726 -12.726 -8.624 6.405 18.515
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.222532 0.015100 544.53 <2e-16 ***
h_mids -0.363560 0.004393 -82.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 11451.0 on 10 degrees of freedom
Residual deviance: 1975.5 on 9 degrees of freedom
AIC: 2059
Number of Fisher Scoring iterations: 5
random_pois = rpois(10000,3)
h=hist(random_pois, breaks = 10)
mean(random_pois) #verifying that the mean is close to 3.
h_mids = h$mids
h_counts = h$counts
pois_ideal_data <- data.frame(h_mids, h_counts)
pois_ideal_model <- glm(h_counts ~ ., pois_ideal_data, family=poisson(link=log))
summary_ideal=summary(pois_ideal_model)
summary_ideal
What are you doing here???!!! You used a glm to fit a distribution???
Well, it is not impossible to do so, but it is done via this:
set.seed(0)
x <- rpois(10000,3)
fit <- glm(x ~ 1, family = poisson())
i.e., we fit data with an intercept-only regression model.
fit$fitted[1]
# 3.005
This is the same as:
mean(x)
# 3.005
It looks like you're trying to do a Poisson fit to aggregated or binned data; that's not what glm does. I took a quick look for canned ways of fitting distributions to canned data but couldn't find one; it looks like earlier versions of the bda package might have offered this, but not now.
At root, what you need to do is set up a negative log-likelihood function that computes (# counts)*prob(count|lambda) and minimize it using optim(); the solution given below using the bbmle package is a little more complex up-front but gives you added benefits like easily computing confidence intervals etc..
Set up data:
set.seed(101)
random_pois <- rpois(10000,3)
tt <- table(random_pois)
dd <- data.frame(counts=unname(c(tt)),
val=as.numeric(names(tt)))
Here I'm using table rather than hist because histograms on discrete data are fussy (having integer cutpoints often makes things confusing because you have to be careful about right- vs left-closure)
Set up density function for binned Poisson data (to work with bbmle's formula interface, the first argument must be called x, and it must have a log argument).
dpoisbin <- function(x,val,lambda,log=FALSE) {
probs <- dpois(val,lambda,log=TRUE)
r <- sum(x*probs)
if (log) r else exp(r)
}
Fit lambda (log link helps prevent numerical problems/warnings from negative lambda values):
library(bbmle)
m1 <- mle2(counts~dpoisbin(val,exp(loglambda)),
data=dd,
start=list(loglambda=0))
all.equal(unname(exp(coef(m1))),mean(random_pois),tol=1e-6) ## TRUE
exp(confint(m1))
## 2.5 % 97.5 %
## 2.972047 3.040009
Let's say I've got a data inputted into a data frame like so:
df = data.frame(x = c(1,2,3,4,5,10,15,25,50),
y = c(.57,.75,.82,0.87,.89,.95,.97,.98,.99))
df
and I wish to fit the expression:
y = ((x/a)^b)/(1+(x/a)^b)
where a and b are unknown parameters.
I have plotted the points and drawn a fitted line by guessing the values of a and b:
library(ggplot2)
graph <- ggplot(df, aes(x=x, y=y))
graph <- graph + geom_point()
a = 0.50
b = 1.00
guesstimate <- function(x){((x/a)^b)/(1+(x/a)^b)}
graph <- graph + stat_function(fun = guesstimate)
graph
However, I'd like to find the values of a and b which creates an expression that produces the highest R^2 square value; i.e. the best possible mathematical fit for the data possible.
Question:
Short of guessing through the values of a and b manually and checking with the naked eye which fit is best, is there a way to get R to find the 'best' a and b values along with providing the R-squared value which confirms to me that the chosen a and b values are indeed the best possible fit?
You can use the nls (non-linear least squares) function:
m1 = nls(y ~ (x/a)^b/(1+(x/a)^b), list(a=1, b=1), data=df)
summary(m1)
Formula: y ~ (x/a)^b/(1 + (x/a)^b)
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 0.779291 0.009444 82.51 1.01e-11 ***
b 1.145174 0.012733 89.94 5.53e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.003086 on 7 degrees of freedom
Number of iterations to convergence: 4
Achieved convergence tolerance: 5.949e-08
ggplot(df, aes(x,y)) +
geom_point() +
geom_line(data=data.frame(x=seq(1,50,0.1), y=predict(m1, newdata=list(x=seq(1,50,0.1)))),
aes(x,y), colour="red")
nls does not provide an r-squared value, because, as discussed in this thread on R-help, r-squared is not necessarily meaningful for a non-linear model. nls does, however, find the parameter values that minimize the residual sum-of-squares, so in that sense these parameters provide the best fit for the given data and model. That doesn't mean that there isn't another model specification that gives a better fit, though in this case the model fit is virtually perfect.
Even if not obvious, a linear model can be applied here, just using basic algebra. Indeed, starting from 1/y = (1+(x/a)^b)/(x/a)^b and a little manipulation, you can arrive to:
log(1/y - 1) = -b*log(x) + b*log(a)
which is basically a linear model in the variables Y = log(1/y - 1) and X = log(x). From here, you can use lm:
df2<-data.frame(Y = log(1/df$y - 1), X = log(df$x))
coeffs<-lm(Y ~ X, data=df2)$coefficients
a <- exp(-model[1]/model[2])
# 0.7491387
b <- -model[2]
#1.116111
which are similar to those obtained with nls.
I have performed a Cox regression analysis including four variables (sex, age and two binary explanatory variables) which all have significant associations to outcome. I have used the coxph function from the "survival" package in R:
library(survival)
cox <- coxph(Surv(time, status_risk==1) ~ sex + age + stone_number +stone_size, data=cox_cut)
summary(cox1_3_cut)
Call:
coxph(formula = Surv(time, status_risk == 1) ~ sex + age +
stone_number + stone_size, data = cox_cut)
n= 582, number of events= 48
(82 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
sexfemale 0.76993 2.15961 0.34577 2.227 0.025966 *
age -0.03222 0.96829 0.01201 -2.682 0.007311 **
stone_number>=2 0.60646 1.83393 0.29942 2.025 0.042821 *
stone_size>10 1.02593 2.78969 0.29391 3.491 0.000482 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
sexfemale 2.1596 0.4630 1.0966 4.2530
age 0.9683 1.0327 0.9458 0.9914
stone_number>=2 1.8339 0.5453 1.0198 3.2980
stone_size>10 2.7897 0.3585 1.5681 4.9629
I would like to make a prediction score table including the four variables with 4 age stratified Groups (30, 40, 50, 60 years of age). All hazards in this table must be divided by one predefined hazard in order to obtain HR for each age group.
How to compute the HR with the 95% CI for each of these specific age-groups in R?
As per #shadow's comment, the CI of the parameter estimate is based on the whole dataset, if you want age conditional CI's you need to subset your data. If instead you want to generate expected survival curves conditional on a set of covariates (including age) this is how you do it:
# Create a dummy dataset
df <- data.frame(sex = sample(c(T,F),100,T),
age = 50 + rnorm(100)*10,
foo = sample(c('a','b','c'),100,T),
bar = sample(c('x','y','z'),100,T),
status = sample(c(T,F),100,T),
start = 0,# required for the survfit with `individual = TRUE`
time = -log(runif(100)))
# fit the A coxph model in your full dataset data
cox <- coxph(Surv(start,time, status) ~ sex + age + foo + bar, data=df)
# create a data.frame with all the variables used in the formula
newData <- data.frame(sex = T,
age = 55,
foo = sample(c('a','b','c'),1),
bar = sample(c('x','y','z'),1),
status = T,# required but unused
start = 0,# required but unused
time = 1)# required but unused
# get a prediction from the fitted model, specifiying 'individual = TRUE'
pred <- survfit(cox, newdata=data.frame(newData),individual =TRUE)
# plot the survival curves
matplot(x = cbind(pred$"time"),
y = cbind(pred$surv,
pred$upper,
pred$lower),
type = 'l',
lty= c(1,2,2),
main = 'Predicted Survial with 95% CI')
You may also wnat to inspect unclass(pred) and summary(pred).