To assess if there is an association between certain groups of patients (patient_group; categorical variable) and a disease (disease_outcome; count variable) I am running negative binomial regression models (due to overdispersion). To check for confounding by other variables I am running 3 models with increasing amounts of covariates.
To display the IRRs and CIs i want to use the tbl_regression function from the package gtsummary (I am using the latest version 1.3.7.9022). However, calling the function returns the IRR and the corresponding 95% CIs non-exponentiated, even though I put exponentiate=TRUE:
# Load packages
library(haven)
library(magrittr)
library(MASS)
library(dplyr)
install.packages("gtsummary")
remotes::install_github("ddsjoberg/gtsummary")
library(gtsummary)
# Load example data.
dat <- read_dta("https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta")
dat <- within(dat, {
prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id <- factor(id)
})
# Run negative binomial regression and pipe in the tbl_regression function
Model 1 <-
glm.nb(data=dat, formula=daysabs ~ prog) %>%
tbl_regression(exponentiate=TRUE)
Model 1
This returns the summary table, but the regression coefficients have not been exponentiated. Is there a way to get gtsummary to return exponentiated coefficients and CIs?
Thanks!
I was just doing some poking around to see what is going on. The tbl_regression() function uses broom::tidy() in the background. Support for negbin models was just added 7 days ago, but for some reason an exponentiate= argument was not added for this type of model.
I am going to request that it be added. In the meantime, this code should get you up and going with negbin models.
library(gtsummary)
library(tidyverse)
# add a custom tidying function
my_negbin_tidy <- function(x, exponentiate = FALSE, ...) {
df_tidy <- broom::tidy(x, ...)
# exponentiate coef and CI if requested
if (exponentiate) {
df_tidy <-
df_tidy %>%
mutate_at(vars(any_of(c("estimate", "conf.low", "conf.high"))), exp)
}
df_tidy
}
# build model
mod <- MASS::glm.nb(response ~ age, gtsummary::trial)
# summarize model results
tbl <-
tbl_regression(
mod,
exponentiate = TRUE,
tidy_fun = my_negbin_tidy
)
Created on 2021-04-12 by the reprex package (v2.0.0)
Related
I used tbl_regression and add_glance_table() from gtsummary to build a table with model statistic:
library(gtsumary)
coxph(Surv(time, event) ~ score, data = dat) %>%
tbl_regression(exponentiate = TRUE) %>%
add_glance_table(concordance)
1st question: How can I move the model statistic horizontaly, to the right?
Because, in the end, I want to display multiple model statistic, with C index in the last column, like this:
tbl_uvregression(
dat_score,
method=survival::coxph,
y = Surv(time, event),
exponentiate = TRUE)
2nd question: How do I add add_glance_table in tbl_uvregression?
You can merge any additional columns/statistics into a gtsummary using the modify_table_body() function (the table_body is an internal data frame that is styled and printed as the summary table).
It's possible to add the c-index in a tbl_uvregression() setting. But I think it requires a higher understanding of the internals of a tbl_uvregression() object. In the example below, I estimate each univariable model separately, summarize the model with tbl_regression(), merge in the c-index, then stack all the tbls with tbl_stack().
Happy Programming!
library(gtsummary)
library(tidyverse)
library(survival)
packageVersion("gtsummary")
#> [1] '1.5.2'
covariates <- c("age", "marker")
# iterate over the covariates
tbl <-
covariates %>%
map(
function(varname) {
# build regression model
mod <-
str_glue("Surv(ttdeath, death) ~ {varname}") %>%
as.formula() %>%
coxph(data = trial)
# calculate and format c-index. adding variable column to merge in the next step
df_cindex <-
broom::glance(mod) %>%
select(concordance) %>%
mutate(
concordance = style_sigfig(concordance, digits = 3),
variable = varname
)
# summarize model
tbl_regression(mod, exponentiate = TRUE) %>%
# merge in the c-index
modify_table_body(~left_join(.x, df_cindex, by = "variable")) %>%
modify_header(concordance = "**c-index**") # assigning a header label unhides the column
}
) %>%
#stack all tbls
tbl_stack()
Created on 2022-04-09 by the reprex package (v2.0.1)
I am trying to extract random intercepts from tidymodels using lme4 and multilevelmod. I able to do this using lme4 below:
Using R and lme4:
library("tidyverse")
library("lme4")
# set up model
mod <- lmer(Reaction ~ Days + (1|Subject),data=sleepstudy)
# create expanded df
expanded_df <- with(sleepstudy,
data.frame(
expand.grid(Subject=levels(Subject),
Days=seq(min(Days),max(Days),length=51))))
# create predicted df with **random intercepts**
predicted_df <- data.frame(expanded_df,resp=predict(mod,newdata=expanded_df))
predicted_df
# plot intercepts
ggplot(predicted_df,aes(x=Days,y=resp,colour=Subject))+
geom_line()
Using tidymodels:
# example from
# https://github.com/tidymodels/multilevelmod
library("multilevelmod")
library("tidymodels")
library("tidyverse")
library("lme4")
#> Loading required package: parsnip
data(sleepstudy, package = "lme4")
# set engine to lme4
mixed_model_spec <- linear_reg() %>% set_engine("lmer")
# create model
mixed_model_fit_tidy <-
mixed_model_spec %>%
fit(Reaction ~ Days + (1 | Subject), data = sleepstudy)
expanded_df_tidy <- with(sleepstudy,
data.frame(
expand.grid(Subject=levels(Subject),
Days=seq(min(Days),max(Days),length=51))))
predicted_df_tidy <- data.frame(expanded_df_tidy,resp=predict(mixed_model_fit_tidy,new_data=expanded_df_tidy))
ggplot(predicted_df_tidy,aes(x=Days,y=.pred,colour=Subject))+
geom_line()
Using the predict() function seems to gives only the fixed effect predictions.
Is there a way to extract the random intercepts from tidymodels and multilevelmod? I know the package is still in development so it might not be possible at this stage.
I think you can work around this as follows:
predicted_df_tidy <- mutate(expanded_df_tidy,
.pred = predict(mixed_model_fit_tidy,
new_data=expanded_df_tidy,
type = "raw", opts=list(re.form=NULL)))
bind_cols() instead of mutate() might be useful in some circumstances?
the issue is that multilevelmod internally sets the default for prediction to re.form = NA; the code above resets it to re.form = NULL (which is the lme4 default, i.e. include all random effects in the prediction)
If you actually want the random intercepts (only) I guess you could predicted_df_tidy %>% filter(Days==0)
PS If you want to be more 'tidy' about this I think you can use purrr::cross_df() in place of expand.grid and pipe the results directly to mutate() ...
To test whether there is an association between disease groups (categorical_variable) and a disease (outcome; dichotomous) I am running a logistic regression. To check for confounding by other variables I am running 3 models with various amounts of covariates.
To display the ORs and CIs i am using the tbl_regression function from the package gtsummary (and I used the dplyr function mutate to display CIs in round brackets). However, this displays the CI in a seperate column from the OR, but I want them in the same column with the CI in round brackets after the OR.
My code:
library(gtsummary)
library(dplyr)
Model 1 <-
glm(data=wide_dataset, formula=outcome ~ categorical_variable,
family=binomial(link="logit") %>%
tbl_regression(
exponentiate = TRUE) %>%
# remove the p-value column
modify_column_hide(column=p.value) %>%
# CI in round brackets:
modify_table_body(mutate,
ci=gsub("(\\d\\.\\d{,4})(, )(\\d\\.\\d{,4})"
,"\\(\\1 \\3\\)",
ci))
Model 1
Thanks in advance!
It's much easier to put the odds ratio and the CI in the same column using the dev version of the package (will be released to CRAN next week).
If you use the JAMA journal theme, the OR and the CI will automatically be combined into a single column.
remotes::install_github("ddsjoberg/gtsummary")
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.3.7.9016'
# set the JAMA theme to display OR and CI in same column
theme_gtsummary_journal("jama")
#> Setting theme `JAMA`
tbl <-
glm(response ~ age + grade, data = trial, family = binomial) %>%
tbl_regression(exponentiate = TRUE) %>%
modify_column_hide(p.value)
Created on 2021-04-07 by the reprex package (v2.0.0)
This code could also be used without setting the JAMA journal theme.
tbl <-
glm(response ~ age + grade, data = trial, family = binomial) %>%
tbl_regression(exponentiate = TRUE) %>%
# merge OR and CI into single column
modify_table_styling(
column = estimate,
rows = !is.na(estimate),
cols_merge_pattern = "{estimate} ({conf.low} to {conf.high})"
) %>%
modify_header(estimate ~ "**OR (95% CI)**") %>%
modify_column_hide(c(ci, p.value))
It sounds like you want to compare multiple models. Here's how you can put them in the same table.
list(tbl, tbl, tbl) %>%
tbl_merge(tab_spanner = paste0("**Model ", 1:3, "**"))
I have 100 replicates of coxph model fitted in loop. I am trying to extract out log-rank score test result with p-values for each replicate in a data frame or list. I am using the following. But, it gives me only log rank score, not p-value. Any help will be very appreciated.
I can share dataset, but am not sure how to attach here.
thanks,
Krina
Repl_List <- unique(dat3$Repl)
doLogRank = function(sel_name) {
dum <- dat3[dat3$Repl == sel_name,]
reg <- with(dum, coxph(Surv(TIME_day, STATUS) ~ Treatment, ties = "breslow"))
LogRank <- with(reg, reg$score)
}
LogRank <- t(as.data.frame(lapply(Repl_List, doLogRank)))
Here is a mock example that I took from the help page of the coxph function. I just replicated the dataset 100 times to create your scenario. I highly recommend to start using the tidyverse packages to do such work. broom is a great addition along with dplyr and tidyr.
library(survival)
library(tidyverse)
library(broom)
test <- data.frame(time=c(4,3,1,1,2,2,3),
status=c(1,1,1,0,1,1,0),
x=c(0,2,1,1,1,0,0),
sex=c(0,0,0,0,1,1,1))
Below I am replicating the dataset 100 times using the replicate function.
r <- replicate(test,n = 100,simplify = FALSE) %>% bind_rows %>%
mutate(rep = rep(seq(1,100,1),each=7))
I setup the cox model as a small function that I can them pass on to each replicate of the dataframe.
cxph_mod <- function(df) {
coxph(Surv(time, status) ~ x + strata(sex), df)
}
Below, is the step by step process of fitting the model and extracting the values.
tidyr::nest the dataframe
purrr::map the model into each nest
nest is function in library(tidyr)
map is a function similar to lapply in library(purrr)
nested <- r %>%
group_by(rep) %>%
nest %>%
mutate(model = data %>% map(cxph_mod))
look into the first rep to see the coxph output. You will see the model object stored in the cells of the dataframe allowing easier access.
nested %>% filter(rep==1)
With each model object, now use broom to get the parameter estimates and the prediction from the model into the nested dataset
nested <- nested %>%
mutate(
ests = model %>% map(broom::tidy)
)
tidyr::unnest to view your predictions for fitting each resampled dataset
ests <- unnest(nested,ests,.drop=TRUE) %>% dplyr::select(rep,estimate:conf.high)
In this case since I am repeating the same dataset 100 times, the pvalue will be the same, but in your case you will have 100 different datasets and hence 100 different p.values.
ggplot(data=ests,aes(y=p.value,x=rep))+geom_point()
Vijay
I have a small N large T panel which I am estimating via plm::plm (panel linear regression model), with fixed effects.
Is there any way to get predicted values for a new dataset? (I want to
estimate parameters on a subset of my sample, and then use these to
calculate model-implied values for the whole sample).
There are (at least) two methods in the package to produce estimates from plm objects:
-- fixef.plm: Extract the Fixed Effects
-- pmodel.response: A function to extract the model.response
It appears to me that the author(s) are not interested in providing estimates for the "random effects". It may be a matter of "if you don't know how to do it on your own, then we don't want to give you a sharp knife to cut yourself too deeply."
I wrote a function called predict.out.plm that can create predictions for the original data and for a manipulated data set (with equal column names).
The predict.out.plm calculates a) the predicted (fitted) outcome of the transformed data and b) constructs the according to level outcome. The function works for First Difference (FD) estimations and Fixed Effects (FE) estimations using plm. For FD it creates the differenced outcome over time and for FE it creates the time-demeaned outcome.
The function is largely untested, and probably only works with strongly balanced data frames.
Any suggestions and corrections are very welcome. Help to develop a small R package would be very appreciated.
The function predict.out.plm
predict.out.plm<-function(
estimate,
formula,
data,
model="fd",
pname="y",
pindex=NULL,
levelconstr=T
){
# estimate=e.fe
# formula=f
# data=d
# model="within"
# pname="y"
# pindex=NULL
# levelconstr=T
#get index of panel data
if (is.null(pindex) && class(data)[1]=="pdata.frame") {
pindex<-names(attributes(data)$index)
} else {
pindex<-names(data)[1:2]
}
if (class(data)[1]!="pdata.frame") {
data<-pdata.frame(data)
}
#model frame
mf<-model.frame(formula,data=data)
#model matrix - transformed data
mn<-model.matrix(formula,mf,model)
#define variable names
y.t.hat<-paste0(pname,".t.hat")
y.l.hat<-paste0(pname,".l.hat")
y.l<-names(mf)[1]
#transformed data of explanatory variables
#exclude variables that were droped in estimation
n<-names(estimate$aliased[estimate$aliased==F])
i<-match(n,colnames(mn))
X<-mn[,i]
#predict transformed outcome with X * beta
# p<- X %*% coef(estimate)
p<-crossprod(t(X),coef(estimate))
colnames(p)<-y.t.hat
if (levelconstr==T){
#old dataset with original outcome
od<-data.frame(
attributes(mf)$index,
data.frame(mf)[,1]
)
rownames(od)<-rownames(mf) #preserve row names from model.frame
names(od)[3]<-y.l
#merge old dataset with prediciton
nd<-merge(
od,
p,
by="row.names",
all.x=T,
sort=F
)
nd$Row.names<-as.integer(nd$Row.names)
nd<-nd[order(nd$Row.names),]
#construct predicted level outcome for FD estiamtions
if (model=="fd"){
#first observation from real data
i<-which(is.na(nd[,y.t.hat]))
nd[i,y.l.hat]<-NA
nd[i,y.l.hat]<-nd[i,y.l]
#fill values over all years
ylist<-unique(nd[,pindex[2]])[-1]
ylist<-as.integer(as.character(ylist))
for (y in ylist){
nd[nd[,pindex[2]]==y,y.l.hat]<-
nd[nd[,pindex[2]]==(y-1),y.l.hat] +
nd[nd[,pindex[2]]==y,y.t.hat]
}
}
if (model=="within"){
#group means of outcome
gm<-aggregate(nd[, pname], list(nd[,pindex[1]]), mean)
gl<-aggregate(nd[, pname], list(nd[,pindex[1]]), length)
nd<-cbind(nd,groupmeans=rep(gm$x,gl$x))
#predicted values + group means
nd[,y.l.hat]<-nd[,y.t.hat] + nd[,"groupmeans"]
}
if (model!="fd" && model!="within") {
stop('funciton works only for FD and FE estimations')
}
}
#results
results<-p
if (levelconstr==T){
results<-list(results,nd)
names(results)<-c("p","df")
}
return(results)
}
Testing the the function:
##packages
library(plm)
##test dataframe
#data structure
N<-4
G<-2
M<-5
d<-data.frame(
id=rep(1:N,each=M),
year=rep(1:M,N)+2000,
gid=rep(1:G,each=M*2)
)
#explanatory variable
d[,"x"]=runif(N*M,0,1)
#outcome
d[,"y"] = 2 * d[,"x"] + runif(N*M,0,1)
#panel data frame
d<-pdata.frame(d,index=c("id","year"))
##new data frame for out of sample prediction
dn<-d
dn$x<-rnorm(nrow(dn),0,2)
##estimate
#formula
f<- pFormula(y ~ x + factor(year))
#fixed effects or first difffernce estimation
e<-plm(f,data=d,model="within",index=c("id","year"))
e<-plm(f,data=d,model="fd",index=c("id","year"))
summary(e)
##fitted values of estimation
#transformed outcome prediction
predict(e)
c(pmodel.response(e)-residuals(e))
predict.out.plm(e,f,d,"fd")$p
# "level" outcome prediciton
predict.out.plm(e,f,d,"fd")$df$y.l.hat
#both
predict.out.plm(e,f,d,"fd")
##out of sampel prediciton
predict(e,newdata=d)
predict(e,newdata=dn)
# Error in crossprod(beta, t(X)) : non-conformable arguments
# if plm omits variables specified in the formula (e.g. one year in factor(year))
# it tries to multiply two matrices with different length of columns than regressors
# the new funciton avoids this and therefore is able to do out of sample predicitons
predict.out.plm(e,f,dn,"fd")
plm has now a predict.plm() function, although it is not documented/exported.
Note also that predict works on the transformed model (i.e. after doing the within/between/fd transformation), not the original one. I speculate that the reason for this is that it is more difficult to do prediction in a panel data framework. Indeed, you need to consider whether you are predicting:
new time periods, for existing individual and you used a individual-FE? Then you can add the prediction to the existing individual mean
new time periods, for new individual? Then you need to figure out which individual mean you are going to use?
the same is even more complicated is you use a random-effect model, as the effects are not easily derived
In the code below, I illustrate how to use fitted values, on the existing sample:
library(plm)
#> Loading required package: Formula
library(tidyverse)
data("Produc", package = "plm")
zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc, index = c("state","year"))
## produce a dataset of prediction, added to the group means
Produc_means <- Produc %>%
mutate(y = log(gsp)) %>%
group_by(state) %>%
transmute(y_mean = mean(y),
y = y,
year = year) %>%
ungroup() %>%
mutate(y_pred = predict(zz) + y_mean) %>%
select(-y_mean)
## plot it
Produc_means %>%
gather(type, value, y, y_pred) %>%
filter(state %in% toupper(state.name[1:5])) %>%
ggplot(aes(x = year, y = value, linetype = type))+
geom_line() +
facet_wrap(~state) +
ggtitle("Visualising in-sample prediction, for 4 states")
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
Created on 2018-11-20 by the reprex package (v0.2.1)
Looks like there is a new package to do in-sample predictions for a variety of models including plm
https://cran.r-project.org/web/packages/prediction/prediction.pdf
You can calculate the residuals via residuals(reg_name). From here, you can subtract them from your response variable and get the predicted values.