Predictions using time dependent covariates in survival model - r

Simple question, how do you specify time dependent covariates in the data.frame supplied to newdata when looking to make predictions?
In other words, I fit a model with time dependent covariates:
cfit <- coxph(Surv(tstart, tstop, status) ~ treat + sex + age +
inherit + cluster(id), data=cgd)
Now I'd like to create a prediction for a patient, but using updated data from that patient. In other words, what is their survival probability, given that we observed changes in certain covariates within that certain time intervals?
I can predict survival for a new patient, as follows:
survfit(cfit, newdata=data.frame(treat = "placebo", age = 12, sex ="male", inherit = "X-linked"))$surv
But this does not allow me to update predictions as time passes from the start of observation for that patient, allowing for the incorporation of updated covariates.

This is detailed in the 4th paragraph of the details section of the help page ?survfit.coxph. Basically you need an id column that shows which rows belong to the same person, then for each row you need the beginning time, the ending time, and the values of the covariates during that time period. Each time period for the individual being predicted will have its own row in newdata (so the time periods should not overlap).

A reproducible example in R to my comment above:
## Install package that has a dataset with data applicable for this problem
install.packages("ipw")
library(ipw)
## Load data
head(haartdat,n=100)
## Fit model, with time varying covariate cd4.sqrt
model.2 <- coxph(Surv(tstart, fuptime, event) ~ sex + age + cd4.sqrt + cluster(patient), data = haartdat)
## Create dataframe of variables (one row)
covs <- data.frame(age = 25,sex = 1,cd4.sqrt = 24)
covs
## Get survival probabilities for these variables at baseline
summary(survfit(model.2, newdata = covs, type = "aalen"))
## Now create two 'newdata' sets of covariates, for time points up to 1900
## covs.2 the data is same at all 20 time points
covs.2 <- cbind(rep(1,20),rbind(covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs))
colnames(covs.2)[1] <- "patient"
covs.2 <- data.frame(covs.2)
covs.2$tstart <- seq(-100,1800,100)
covs.2$fuptime <- seq(0,1900,100)
covs.2$event <- rep(0,20)
## covs.3 has varying cd4.sqrt
covs.3 <- cbind(rep(2,20),rbind(covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs,covs))
colnames(covs.3)[1] <- "patient"
covs.3 <- data.frame(covs.3)
covs.3$tstart <- seq(-100,1800,100)
covs.3$fuptime <- seq(0,1900,100)
covs.3$event <- rep(0,20)
covs.3$cd4.sqrt <- seq(20.25,25,0.25)
## Combine into one dataset
covs.4 <- rbind(covs.2,covs.3)
covs.4
## Create survival probabilities, with the id = argument
summary(survfit(model.2, newdata = covs.4, type = "aalen", id = patient))

Related

Non-parametric bootstrapping to generate 95% Confidence Intervals for fixed effect coefficients calculated by a glmer with nested random effects

I have an R coding question.
This is my first time asking a question here, so apologies if I am unclear or do something wrong.
I am trying to use a Generalized Linear Mixed Model (GLMM) with Poisson error family to test for any significant effect on a count response variable by three separate dichotomous variables (AGE = ADULT or JUVENILE, SEX = MALE or FEMALE and MEDICATION = NEW or OLD) and an interaction between AGE and MEDICATION (AGE:MEDICATION).
There is some dependency in my data in that the data was collected from a total of 22 different sites (coded as SITE vector with 33 distinct levels), and the data was collected over a total of 21 different years (coded as YEAR vector with 21 distinct levels, and treated as a categorical variable). Unfortunately, every SITE was not sampled for each YEAR, with some being sampled for a greater number of years than others.
The data is also quite sparse, in that I do not have a great number of measurements of the response variable (coded as COUNT and an integer vector) per SITE per YEAR.
My Poisson GLMM is constructed using the following code:
model <- glmer(data = mydata,
family = poisson(link = "log"),
formula = COUNT ~ SEX + SEX:MEDICATION + AGE + AGE:SEX + MEDICATION + AGE:MEDICATION + (1|SITE/YEAR),
offset = log(COUNT.SAMPLE.SIZE),
nAGQ = 0)
In order to try and obtain more reliable estimates for the fixed effect coefficients (particularly given the sparse nature of my data), I am trying to obtain 95% confidence intervals for the fixed effect coefficients through non-parametric bootstrapping.
I have come across the "glmmboot" package which can be used to conduct non-parametric bootstrapping of GLMMs, however when I try to run the non-parametric bootstrapping using the following code:
library(glmmboot)
bootstrap_model(base_model = model,
base_data = mydata,
resamples = 1000)
When I run this code, I receive the following message:
Performing case resampling (no random effects)
Naturally, though, my model does have random effects, namely (1|SITE/YEAR).
If I try to tell the function to resample from a specific block, by adding in the "reample_specific_blocks" argument, i.e.:
library(glmmboot)
bootstrap_model(base_model = model,
base_data = mydata,
resamples = 1000,
resample_specific_blocks = "YEAR")
Then I get the following error message:
Performing block resampling, over SITE
Error: Invalid grouping factor specification, YEAR:SITE
I get a similar error message if I try set 'resample_specific_blocks' to "SITE".
If I then try to set 'resample_specific_blocks' to "SITE:YEAR" or "SITE/YEAR" I get the following error message:
Error in bootstrap_model(base_model = model, base_data = mydata, resamples = 1000, :
No random columns from formula found in resample_specific_blocks
I have tried explicitly nesting YEAR within SITE and then adapting the model accordingly using the code:
mydata <- within(mydata, SAMPLE <- factor(SITE:YEAR))
model.refit <- glmer(data = mydata,
family = poisson(link = "log"),
formula = COUNT ~ SEX + AGE + MEDICATION + AGE:MEDICATION + (1|SITE) + (1|SAMPLE),
offset = log(COUNT.SAMPLE.SIZE),
nAGQ = 0)
bootstrap_model(base_model = model.refit,
base_data = mydata,
resamples = 1000,
resample_specific_blocks = "SAMPLE")
But unfortunately I just get this error message:
Error: Invalid grouping factor specification, SITE
The same error message comes up if I set resample_specific_blocks argument to SITE, or if I just remove the resample_specific_blocks argument.
I believe that the case_bootstrap() function found in the lmeresampler package could potentially be another option, but when I look into the help for it it looks like I would need to create a function and I unfortunately have no experience with creating my own functions within R.
If anyone has any suggestions on how I can get the bootstrap_model() function in the glmmboot package to recognise the random effects in my model/dataframe, or any suggestions for alternative methods on conducting non-parametric bootstrapping to create 95% confidence intervals for the coefficients of the fixed effects in my model, it would be greatly appreciated! Many thanks in advance, and for reading such a lengthy question!
For reference, I include links to the RDocumentation and GitHub for the glmmboot package:
https://www.rdocumentation.org/packages/glmmboot/versions/0.6.0
https://github.com/ColmanHumphrey/glmmboot
The following is code that will allow for creation of a reproducible example using the data set from lme4::grouseticks
#Load in required packages
library(tidyverse)
library(lme4)
library(glmmboot)
library(psych)
#Load in the grouseticks dataframe
data("grouseticks")
tibble(grouseticks)
#Create dummy vectors for SEX, AGE and MEDICATION
set.seed(1)
SEX <-sample(1:2, size = 403, replace = TRUE)
SEX <- as.factor(ifelse(SEX == 1, "MALE", "FEMALE"))
set.seed(2)
AGE <- sample(1:2, size = 403, replace = TRUE)
AGE <- as.factor(ifelse(AGE == 1, "ADULT", "JUVENILE"))
set.seed(3)
MEDICATION <- sample(1:2, size = 403, replace = TRUE)
MEDICATION <- as.factor(ifelse(MEDICATION == 1, "OLD", "NEW"))
grouseticks$SEX <- SEX
grouseticks$AGE <- AGE
grouseticks$MEDICATION <- MEDICATION
#Use the INDEX vector to create a vector of sample sizes per LOCATION
#per YEAR
grouseticks$INDEX <- 1
sample.sizes <- grouseticks %>%
group_by(LOCATION, YEAR) %>%
summarise(SAMPLE.SIZE = sum(INDEX))
#Combine the dataframes together into the dataframe to be used in the
#model
mydata$SAMPLE.SIZE <- as.integer(mydata$SAMPLE.SIZE)
#Create the Poisson GLMM model
model <- glmer(data = mydata,
family = poisson(link = "log"),
formula = TICKS ~ SEX + SEX + AGE + MEDICATION + AGE:MEDICATION + (1|LOCATION/YEAR),
nAGQ = 0)
#Attempt non-parametric bootstrapping on the model to get 95%
#confidence intervals for the coefficients of the fixed effects
set.seed(1)
Model.bootstrap <- bootstrap_model(base_model = model,
base_data = mydata,
resamples = 1000)
Model.bootstrap

Predict on test data, using plm package in R, and calculate RMSE for test data

I built a model, using plm package. The sample dataset is here.
I am trying to predict on test data and calculate metrics.
# Import package
library(plm)
library(tidyverse)
library(prediction)
library(nlme)
# Import data
df <- read_csv('Panel data sample.csv')
# Convert author to character
df$Author <- as.character(df$Author)
# Split data into train and test
df_train <- df %>% filter(Year != 2020) # 2017, 2018, 2019
df_test <- df %>% filter(Year == 2020) # 2020
# Convert data
panel_df_train <- pdata.frame(df_train, index = c("Author", "Year"), drop.index = TRUE, row.names = TRUE)
panel_df_test <- pdata.frame(df_train, index = c("Author", "Year"), drop.index = TRUE, row.names = TRUE)
# Create the first model
plmFit1 <- plm(Score ~ Articles, data = panel_df_train)
# Print
summary(plmFit1)
# Get the RMSE for train data
sqrt(mean(plmFit1$residuals^2))
# Get the MSE for train data
mean(plmFit1$residuals^2)
Now I am trying to calculate metrics for test data
First, I tried to use prediction() from prediction package, which has an option for plm.
predictions <- prediction(plmFit1, panel_df_test)
Got an error:
Error in crossprod(beta, t(X)) : non-conformable arguments
I read the following questions:
One
Two
Three
Four
I also read this question, but
fitted <- as.numeric(plmFit1$model[[1]] - plmFit1$residuals) gives me a different number of values from my train or test numbers.
Regarding out-of-sample prediction with fixed effects models, it is not clear how data relating to fixed effects not in the original model are to be treated, e.g., data for an individual not contained in the orignal data set the model was estimated on. (This is rather a methodological question than a programming question).
The version 2.6-2 of plm allows predict for fixed effect models with the original data and with out-of-sample data (see ?predict.plm).
Find below an example with 10 firms for model estimation and the data to be used for prediction contains a firm not contained in the original data set (besides that firm, there are also years not contained in the original model object but these are irrelevant here as it is a one-way individual model). It is unclear what the fixed effect of that out-of-sample firm would be. Hence, by default, no predicted value is given (NA value). If argument na.fill is set to TRUE, the (weighted) mean of the fixed effects contained in the original model object is used as a best guess.
library(plm)
data("Grunfeld", package = "plm")
# fit a fixed effect model
fit.fe <- plm(inv ~ value + capital, data = Grunfeld, model = "within")
# generate 55 new observations of three firms used for prediction:
# * firm 1 with years 1935:1964 (has out-of-sample years 1955:1964),
# * firm 2 with years 1935:1949 (all in sample),
# * firm 11 with years 1935:1944 (firm 11 is out-of-sample)
set.seed(42L)
new.value2 <- runif(55, min = min(Grunfeld$value), max = max(Grunfeld$value))
new.capital2 <- runif(55, min = min(Grunfeld$capital), max = max(Grunfeld$capital))
newdata <- data.frame(firm = c(rep(1, 30), rep(2, 15), rep(11, 10)),
year = c(1935:(1935+29), 1935:(1935+14), 1935:(1935+9)),
value = new.value2, capital = new.capital2)
# make pdata.frame
newdata.p <- pdata.frame(newdata, index = c("firm", "year"))
## predict from fixed effect model with new data as pdata.frame
predict(fit.fe, newdata = newdata.p) # has NA values for the 11'th firm
## set na.fill = TRUE to have the weighted mean used to for fixed effects -> no NA values
predict(fit.fe, newdata = newdata.p, na.fill = TRUE)
NB: When you input a plain data.frame as newdata, it is not clear how the data related to the individuals and time periods, which is why the weighted mean of fixed effects from the original model object is used for all observations in newdata and a warning is printed. For fixed effect model prediction, it is reasonable to assume the user can provide information (via a pdata.frame) how the data the user wants to use for prediction relates to the individual and time dimension of panel data.

How to plot survival relative to general population with age on the X-axis (left-truncated data)?

I am trying to compare the survival in my study cohort with the survival in the Dutch general population (matched for age and sex). I created a rate table of the Dutch population.
library(relsurv)
setwd("")
nldpop <- transrate.hmd("mltper_1x1.txt","fltper_1x1.txt")
Then, I wanted to create a plot of the survival of my cohort (observed) and the survival of the population (expected) with age on the X-axis. However, the 'survexp' function does not seem to support a (start,stop,event)-format. Only with the normal (futime, event)-format it works, see below, but then I have follow-up time on the X-axis. Does anyone know how to get the age on the X-axis instead of follow-up time?
# Observed and expected survival with time on X-axis
fit <- survfit(Surv(futime, event)~1)
efit <- survexp(futime ~ 1, rmap = list(year=(date_entry), age=(age_entry), sex=(sex)),
ratetable=nldpop)
plot(fit)
lines(efit)
You didn't provide your example data, so i used survival::mgus data for this. Your problem may be due to incorrectly specifying variable names in the rmap option. See plot here
library(relsurv)
nldpop <- transrate.hmd("mltper_1x1.txt", "fltper_1x1.txt")
mgus2 <- mgus %>% mutate(date_year = dxyr + 1900)
fit <- survfit(Surv(futime, death) ~ 1, data = mgus2)
efit <- survexp(Surv(futime, death) ~ 1, data = mgus2,
ratetable = nldpop, rmap = list(age = age*365.25, year = date_year, sex = sex))
plot(fit)
lines(efit)

bootMer output of lme4 linear mixed model: correct structure? at what level am I bootstrapping?

I have a linear mixed model that describes the relation between a continuous outcome and the interaction between a binary grouping variable (subjects are 1 out of 2 genotypes) and a time variable (5 consecutive days). Every subject has repeated measures within a day. I have 47 subjects.
Some simulated data:
Genotype <- as.factor(append(rep("X", 230), rep("Y", 240)))
Day <- as.factor(rep(c(1:5), 94))
Subject <- as.factor(rep(c(1:47), times=1, each=10))
jitter <- rep(jitter(0.01*c(1:47)), times=1, each=10)
Outcome <- append(rnorm(230, mean=80, sd=16), rnorm(240, mean=85, sd=18)) + jitter
Data <- data.frame(Genotype, Day, Subject, Outcome)
The model is coded as:
Lmer <- lmer(Outcome ~ Genotype*Day + (1|Subject),
data = Data)
I want to perform bootstrapping. If I calculate bootstrapping results for a specific genotype on Day 1, I use this code:
new_dat <- subset(Lmer, Genotype=="X" & Day=="1")
b <- bootMer(Lmer, nsim=200,
FUN=function(x)predict(x, newdata=new_dat, re.form=NA))
b$t now gives me a matrix of 46 columns (=N-1) with each 200 identical observations.
My goal is to bootstrap on the subject level, should I then not have only 200 bootstrapping values (thus not 46*200 values)? Am I overlooking something in my code?

How to add baseline values of outcome at time = 0 as a fixed effect in mixed effect model?

My study was a 12 week long study involving two diets (diet). Outcome measurements such as weight, waist circumference etc. were taken at time (time) = 0,6,and 12 week intervals. (Code) stands for subjects.
I tried the following code to create a new variable called weight0 to only pick outcome values at time = 0.
weight0 <- dat2$weight[dat2$time==0]
I created the following model.
p1 <- lme4::lmer(weight ~ diet * time + weight0 + weight0 * time + (time | code), REML = FALSE, data = dat2)
Error in model.frame.default(data = dat2, drop.unused.levels = TRUE, formula = weight ~ :
variable lengths differ (found for 'weight0')
I think I understand what is going on. The values of weight 0 are not the same length as values of the outcome weight. Is this correct? How do I correct for this and use baseline as a fixed effect?
Thanks for taking the time to read this.
You can do the following to define weight0:
dat2 <- dat2[order(dat2$code, dat2$time), ]
dat2$weight0 <- with(dat2, ave(weight, code, FUN = function (x) x[1]))
and then fit the model.

Resources