I have the following data set:
Age<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA)
R<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA)
sex<-c(1,0,1,1,1,1,1,0,0,0,NA, 0,1)
df1<-data.frame(Age,R,sex)
# Second dataset:
Age2<-seq(2,20,0.25)
Mspline<-rnorm(73)
df2.F<-data.frame(Age2, Mspline)
# Third data
Age2<-seq(2,20,0.25)
Mspline<-rnorm(73)
df2.M<-data.frame(Age2, Mspline)
I was wondering how I can include gender into the calculation and combine these two algorithm to make a loop function. What I need is:
If sex=1 then use the following function to calculate Time
last = dim(df2.F)[1]
fM.F<-approxfun(df2.F$Age2, df2.F$Mspline, yleft = df2.F$Mspline[1] , yright = df2.F$Mspline[last])
df1$Time<-fM.F(df1$Age)
and If sex=0 then use this function to calculate Time
last = dim(df2.M)[1]
fM.M<-approxfun(df2.M$Age2, df2.M$Mspline, yleft = df2.M$Mspline[1] , yright = df2.M$Mspline[last])
df1$Time<-fM.M(df1$Age)
I mean: Read the first record in df1 if it is Female (with age=4.1) the time=fM.F(its age=4.1) but if the gender is Male then to calculate Time apply fM.M on its age so time=fM.M(4.1)
You can create a function that takes the Age vector, the sex value, and the male and female specific dataframes, and selects the frame to use based on the sex value.
f <- function(age, s, m,f) {
if(is.na(s)) return(NA)
if(s==0) df = m
else df = f
last = dim(df)[1]
fM<-approxfun(df$Age2, df$Mspline, yleft = df$Mspline[1] , yright = df$Mspline[last])
fM(age)
}
Now, just apply the function by group, using pull(cur_group(),sex) to get the sex value for the current group.
library(dplyr)
df1 %>%
group_by(sex) %>%
mutate(time = f(Age, pull(cur_group(),sex), df2.M, df2.F))
Output:
Age R sex time
<dbl> <dbl> <dbl> <dbl>
1 2 2 1 -0.186
2 2.1 2.1 0 1.02
3 2.2 2.2 1 -1.55
4 3.4 3.4 1 -0.461
5 3.5 3.5 1 0.342
6 4.2 4.2 1 -0.560
7 4.7 4.7 1 -0.114
8 4.8 4.8 0 0.247
9 5 5 0 -0.510
10 5.6 5.6 0 -0.982
11 NA NA NA NA
12 5.9 5.9 0 -0.231
13 NA NA 1 NA
I am facing a problem I do not manage to solve. I would like to use nlme or nlmODE to perform a non linear regression with random effect using as a model the solution of a second order differential equation with fixed coefficients (a damped oscillator).
I manage to use nlme with simple models, but it seems that the use of deSolve to generate the solution of the differential equation causes a problem. Below an example, and the problems I face.
The data and functions
Here is the function to generate the solution of the differential equation using deSolve:
library(deSolve)
ODE2_nls <- function(t, y, parms) {
S1 <- y[1]
dS1 <- y[2]
dS2 <- dS1
dS1 <- - parms["esp2omega"]*dS1 - parms["omega2"]*S1 + parms["omega2"]*parms["yeq"]
res <- c(dS2,dS1)
list(res)}
solution_analy_ODE2 = function(omega2,esp2omega,time,y0,v0,yeq){
parms <- c(esp2omega = esp2omega,
omega2 = omega2,
yeq = yeq)
xstart = c(S1 = y0, dS1 = v0)
out <- lsoda(xstart, time, ODE2_nls, parms)
return(out[,2])
}
I can generate a solution for a given period and damping factor, as for example here a period of 20 and a slight damping of 0.2:
# small example:
time <- 1:100
period <- 20 # period of oscillation
amort_factor <- 0.2
omega <- 2*pi/period # agular frequency
oscil <- solution_analy_ODE2(omega^2,amort_factor*2*omega,time,1,0,0)
plot(time,oscil)
Now I generate a panel of 10 individuals with a random starting phase (i.e. different starting position and velocity). The goal is to perform a non linear regression with random effect on the starting values
library(data.table)
# generate panel
Npoint <- 100 # number of time poitns
Nindiv <- 10 # number of individuals
period <- 20 # period of oscillation
amort_factor <- 0.2
omega <- 2*pi/period # agular frequency
# random phase
phase <- sample(seq(0,2*pi,0.01),Nindiv)
# simu data:
data_simu <- data.table(time = rep(1:Npoint,Nindiv), ID = rep(1:Nindiv,each = Npoint))
# signal generation
data_simu[,signal := solution_analy_ODE2(omega2 = omega^2,
esp2omega = 2*0.2*omega,
time = time,
y0 = sin(phase[.GRP]),
v0 = omega*cos(phase[.GRP]),
yeq = 0)+
rnorm(.N,0,0.02),by = ID]
If we have a look, we have a proper dataset:
library(ggplot2)
ggplot(data_simu,aes(time,signal,color = ID))+
geom_line()+
facet_wrap(~ID)
The problems
Using nlme
Using nlme with similar syntax working on simpler examples (non linear functions not using deSolve), I tried:
fit <- nlme(model = signal ~ solution_analy_ODE2(esp2omega,omega2,time,y0,v0,yeq),
data = data_simu,
fixed = esp2omega + omega2 + y0 + v0 + yeq ~ 1,
random = y0 ~ 1 ,
groups = ~ ID,
start = c(esp2omega = 0.08,
omega2 = 0.04,
yeq = 0,
y0 = 1,
v0 = 0))
I obtain:
Error in checkFunc(Func2, times, y, rho) : The number of derivatives returned by func() (2) must equal the length of the initial conditions vector (2000)
The traceback:
12. stop(paste("The number of derivatives returned by func() (", length(tmp[[1]]), ") must equal the length of the initial conditions vector (", length(y), ")", sep = ""))
11. checkFunc(Func2, times, y, rho)
10. lsoda(xstart, time, ODE2_nls, parms)
9. solution_analy_ODE2(omega2, esp2omega, time, y0, v0, yeq)
.
.
I looks like nlme is trying to pass a vector of starting condition to solution_analy_ODE2, and causes an error in checkFunc from lasoda.
I tried using nlsList:
test <- nlsList(model = signal ~ solution_analy_ODE2(omega2,esp2omega,time,y0,v0,yeq) | ID,
data = data_simu,
start = list(esp2omega = 0.08, omega2 = 0.04,yeq = 0,
y0 = 1,v0 = 0),
control = list(maxiter=150, warnOnly=T,minFactor = 1e-10),
na.action = na.fail, pool = TRUE)
head(test)
Call:
Model: signal ~ solution_analy_ODE2(omega2, esp2omega, time, y0, v0, yeq) | ID
Data: data_simu
Coefficients:
esp2omega omega2 yeq y0 v0
1 0.1190764 0.09696076 0.0007577956 -0.1049423 0.30234654
2 0.1238936 0.09827158 -0.0003463023 0.9837386 0.04773775
3 0.1280399 0.09853310 -0.0004908579 0.6051663 0.25216134
4 0.1254053 0.09917855 0.0001922963 -0.5484005 -0.25972829
5 0.1249473 0.09884761 0.0017730823 0.7041049 0.22066652
6 0.1275408 0.09966155 -0.0017522320 0.8349450 0.17596648
We can see that te non linear fit works well on individual signals. Now if I want to perform a regression of the dataset with random effects, the syntax should be:
fit <- nlme(test,
random = y0 ~ 1 ,
groups = ~ ID,
start = c(esp2omega = 0.08,
omega2 = 0.04,
yeq = 0,
y0 = 1,
v0 = 0))
But I obtain the exact same error message.
I then tried using nlmODE, following Bne Bolker's comment on a similar question I asked some years ago
using nlmODE
library(nlmeODE)
datas_grouped <- groupedData( signal ~ time | ID, data = data_simu,
labels = list (x = "time", y = "signal"),
units = list(x ="arbitrary", y = "arbitrary"))
modelODE <- list( DiffEq = list(dS2dt = ~ S1,
dS1dt = ~ -esp2omega*S1 - omega2*S2 + omega2*yeq),
ObsEq = list(yc = ~ S2),
States = c("S1","S2"),
Parms = c("esp2omega","omega2","yeq","ID"),
Init = c(y0 = 0,v0 = 0))
resnlmeode = nlmeODE(modelODE, datas_grouped)
assign("resnlmeode", resnlmeode, envir = .GlobalEnv)
#Fitting with nlme the resulting function
model <- nlme(signal ~ resnlmeode(esp2omega,omega2,yeq,time,ID),
data = datas_grouped,
fixed = esp2omega + omega2 + yeq + y0 + v0 ~ 1,
random = y0 + v0 ~1,
start = c(esp2omega = 0.08,
omega2 = 0.04,
yeq = 0,
y0 = 0,
v0 = 0)) #
I get the error:
Error in resnlmeode(esp2omega, omega2, yeq, time, ID) : object 'yhat' not found
Here I don't understand where the error comes from, nor how to solve it.
Questions
Can you reproduce the problem ?
Does anyone have an idea to solve this problem, using either nlme or nlmODE ?
If not, is there a solution using an other package ? I saw nlmixr (https://cran.r-project.org/web/packages/nlmixr/index.html), but I don't know it, the instalation is complicated and it was recently remove from CRAN
Edits
#tpetzoldt suggested a nice way to debug nlme behavior, and it surprised me a lot. Here is a working example with a non linear function, where I generate a set of 5 individual with a random parameter varying between individuals :
reg_fun = function(time,b,A,y0){
cat("time : ",length(time)," b :",length(b)," A : ",length(A)," y0: ",length(y0),"\n")
out <- A*exp(-b*time)+(y0-1)
cat("out : ",length(out),"\n")
tmp <- cbind(b,A,y0,time,out)
cat(apply(tmp,1,function(x) paste(paste(x,collapse = " "),"\n")),"\n")
return(out)
}
time <- 0:10*10
ramdom_y0 <- sample(seq(0,1,0.01),10)
Nid <- 5
data_simu <-
data.table(time = rep(time,Nid),
ID = rep(LETTERS[1:Nid],each = length(time)) )[,signal := reg_fun(time,0.02,2,ramdom_y0[.GRP]) + rnorm(.N,0,0.1),by = ID]
The cats in the function give here:
time : 11 b : 1 A : 1 y0: 1
out : 11
0.02 2 0.64 0 1.64
0.02 2 0.64 10 1.27746150615596
0.02 2 0.64 20 0.980640092071279
0.02 2 0.64 30 0.737623272188053
0.02 2 0.64 40 0.538657928234443
0.02 2 0.64 50 0.375758882342885
0.02 2 0.64 60 0.242388423824404
0.02 2 0.64 70 0.133193927883213
0.02 2 0.64 80 0.0437930359893108
0.02 2 0.64 90 -0.0294022235568269
0.02 2 0.64 100 -0.0893294335267746
.
.
.
Now I do with nlme:
nlme(model = signal ~ reg_fun(time,b,A,y0),
data = data_simu,
fixed = b + A + y0 ~ 1,
random = y0 ~ 1 ,
groups = ~ ID,
start = c(b = 0.03, A = 1,y0 = 0))
I get:
time : 55 b : 55 A : 55 y0: 55
out : 55
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
time : 55 b : 55 A : 55 y0: 55
out : 55
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
...
So nlme binds 5 time (the number of individual) the time vector and pass it to the function, with the parameters repeated the same number of time. Which is of course not compatible with the way lsoda and my function works.
It seems that the ode model is called with a wrong argument, so that it gets a vector with 2000 state variables instead of 2. Try the following to see the problem:
ODE2_nls <- function(t, y, parms) {
cat(length(y),"\n") # <----
S1 <- y[1]
dS1 <- y[2]
dS2 <- dS1
dS1 <- - parms["esp2omega"]*dS1 - parms["omega2"]*S1 + parms["omega2"]*parms["yeq"]
res <- c(dS2,dS1)
list(res)
}
Edit: I think that the analytical function worked, because it is vectorized, so you may try to vectorize the ode function, either by iterating over the ode model or (better) internally using vectors as state variables. As ode is fast in solving systems with several 100k equations, 2000 should be feasible.
I guess that both, states and parameters from nlme are passed as vectors. The state variable of the ode model is then a "long" vector, the parameters can be implemented as a list.
Here an example (edited, now with parameters as list):
ODE2_nls <- function(t, y, parms) {
#cat(length(y),"\n")
#cat(length(parms$omega2))
ndx <- seq(1, 2*N-1, 2)
S1 <- y[ndx]
dS1 <- y[ndx + 1]
dS2 <- dS1
dS1 <- - parms$esp2omega * dS1 - parms$omega2 * S1 + parms$omega2 * parms$yeq
res <- c(dS2, dS1)
list(res)
}
solution_analy_ODE2 = function(omega2, esp2omega, time, y0, v0, yeq){
parms <- list(esp2omega = esp2omega, omega2 = omega2, yeq = yeq)
xstart = c(S1 = y0, dS1 = v0)
out <- ode(xstart, time, ODE2_nls, parms, atol=1e-4, rtol=1e-4, method="ode45")
return(out[,2])
}
Then set (or calculate) the number of equations, e.g. N <- 1 resp. N <-1000 before the calls.
The model runs through this way, before running in numerical issues, but that's another story ...
You may then try to use another ode solver (e.g. vode), set atoland rtol to lower values, tweak nmle's optimization parameters, use box constraints ... and so on, as usual in nonlinear optimization.
I found a solution hacking nlme behavior: as shown in my edit, the problem comes from the fact that nlme passes a vector of NindividualxNpoints to the nonlinear function, supposing that the function associates for each time point a value. But lsoda don't do that, as it integrates an equation along time (i.e. it need all time until a given time poit to produce a value).
My solution consists in decomposing the parameters that nlme passes to my function, make the calculation, and re-create a vector:
detect_id <- function(vec){
tmp <- c(0,diff(vec))
out <- tmp
out <- NA
out[tmp < 0] <- 1:sum(tmp < 0)
out <- na.locf(out,na.rm = F)
rleid(out)
}
detect_id decompose the time vector into single time vectors identificator:
detect_id(rep(1:10,3))
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
And then, the function doing the numeric integration loop over each individuals, and bind the resulting vectors together:
solution_analy_ODE2_modif = function(omega2,esp2omega,time,y0,v0,yeq){
tmp <- detect_id(time)
out <- lapply(unique(tmp),function(i){
idxs <- which(tmp == i)
parms <- c(esp2omega = esp2omega[idxs][1],
omega2 = omega2[idxs][1],
yeq = yeq[idxs][1])
xstart = c(S1 = y0[idxs][1], dS1 = v0[idxs][1])
out_tmp <- lsoda(xstart, time[idxs], ODE2_nls, parms)
out_tmp[,2]
}) %>% unlist()
return(out)
}
It I make a test, where I pass a vector similar to whats nlme passes to the function:
omega2vec <- rep(0.1,30)
eps2omegavec <- rep(0.1,30)
timevec <- rep(1:10,3)
y0vec <- rep(1,30)
v0vec <- rep(0,30)
yeqvec = rep(0,30)
solution_analy_ODE2_modif(omega2 = omega2vec,
esp2omega = eps2omegavec,
time = timevec,
y0 = y0vec,
v0 = v0vec,
yeq = yeqvec)
[1] 1.0000000 0.9520263 0.8187691 0.6209244 0.3833110 0.1321355 -0.1076071 -0.3143798
[9] -0.4718058 -0.5697255 1.0000000 0.9520263 0.8187691 0.6209244 0.3833110 0.1321355
[17] -0.1076071 -0.3143798 -0.4718058 -0.5697255 1.0000000 0.9520263 0.8187691 0.6209244
[25] 0.3833110 0.1321355 -0.1076071 -0.3143798 -0.4718058 -0.5697255
It works. It would not work with #tpetzoldt method, because the time vector passes from 10 to 0, which would cause integration problems. Here I really need to hack the way nlnme works.
Now :
fit <- nlme(model = signal ~ solution_analy_ODE2_modif (esp2omega,omega2,time,y0,v0,yeq),
data = data_simu,
fixed = esp2omega + omega2 + y0 + v0 + yeq ~ 1,
random = y0 ~ 1 ,
groups = ~ ID,
start = c(esp2omega = 0.5,
omega2 = 0.5,
yeq = 0,
y0 = 1,
v0 = 1))
works like a charm
summary(fit)
Nonlinear mixed-effects model fit by maximum likelihood
Model: signal ~ solution_analy_ODE2_modif(omega2, esp2omega, time, y0, v0, yeq)
Data: data_simu
AIC BIC logLik
-597.4215 -567.7366 307.7107
Random effects:
Formula: list(y0 ~ 1, v0 ~ 1)
Level: ID
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
y0 0.61713329 y0
v0 0.67815548 -0.269
Residual 0.03859165
Fixed effects: esp2omega + omega2 + y0 + v0 + yeq ~ 1
Value Std.Error DF t-value p-value
esp2omega 0.4113068 0.00866821 186 47.45002 0.0000
omega2 1.0916444 0.00923958 186 118.14876 0.0000
y0 0.3848382 0.19788896 186 1.94472 0.0533
v0 0.1892775 0.21762610 186 0.86974 0.3856
yeq 0.0000146 0.00283328 186 0.00515 0.9959
Correlation:
esp2mg omega2 y0 v0
omega2 0.224
y0 0.011 -0.008
v0 0.005 0.030 -0.269
yeq -0.091 -0.046 0.009 -0.009
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.2692477 -0.6122453 0.1149902 0.6460419 3.2890201
Number of Observations: 200
Number of Groups: 10
I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -
cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
bestglm(Xy = winedata,
family = binomial, # binomial family for logistic
IC = "AIC", # Information criteria
method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)
However, this gives the error -
Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"
I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?
The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package.
The data was processed as -
winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good" #rename 'quality' to 'good'
bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that
It's always good to check what is the class:
class(res.best.logistic$BestModel)
[1] "glm" "lm"
But if you look at the call of res.best.logistic$BestModel:
res.best.logistic$BestModel$call
glm(formula = y ~ ., family = family, data = Xi, weights = weights)
head(res.best.logistic$BestModel$model)
y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1 0 7.0 0.27 0.36 20.7 0.045
2 0 6.3 0.30 0.34 1.6 0.049
3 0 8.1 0.28 0.40 6.9 0.050
4 0 7.2 0.23 0.32 8.5 0.058
5 0 7.2 0.23 0.32 8.5 0.058
6 0 8.1 0.28 0.40 6.9 0.050
free.sulfur.dioxide density pH sulphates
1 45 1.0010 3.00 0.45
2 14 0.9940 3.30 0.49
3 30 0.9951 3.26 0.44
4 47 0.9956 3.19 0.40
5 47 0.9956 3.19 0.40
6 30 0.9951 3.26 0.44
You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:
best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
# take the variable names for best model
best_var = names(best_var[[1]])
new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
fit = glm(new_form,winedata,family="binomial")
best.cv.err<- cv.glm(winedata,fit,cost1, K=10)
I read that the R flexsurv package can also be used for modeling time-dependent covariates according to Christopher Jackson (2016) ["flexsurv: a platform for parametric survival modeling in R, Journal of Statistical Software, 70 (1)].
However, I was not able to figure out how, even after several adjustments and searches in online forums.
Before turning to the estimation of time-dependent covariates I tried to create a simple model with only time-independent covariates to test whether I specified the Surv object correctly. Here is a small example.
library(splitstackshape)
library(flexsurv)
## create sample data
n=50
set.seed(2)
t <- rpois(n,15)+1
x <- rnorm(n,t,5)
df <- data.frame(t,x)
df$id <- 1:n
df$rep <- df$t-1
Which looks like this:
t x id rep
1 12 17.696149 1 11
2 12 20.358094 2 11
3 11 2.058789 3 10
4 16 26.156213 4 15
5 13 9.484278 5 12
6 15 15.790824 6 14
...
And the long data:
long.df <- expandRows(df, "rep")
rep.vec<-c()
for(i in 1:n){
rep.vec <- c(rep.vec,1:(df[i,"t"]-1))
}
long.df$start <- rep.vec
long.df$stop <- rep.vec +1
long.df$censrec <- 0
long.df$censrec<-ifelse(long.df$stop==long.df$t,1,long.df$censrec)
Which looks like this:
t x id start stop censrec
1 12 17.69615 1 1 2 0
1.1 12 17.69615 1 2 3 0
1.2 12 17.69615 1 3 4 0
1.3 12 17.69615 1 4 5 0
1.4 12 17.69615 1 5 6 0
1.5 12 17.69615 1 6 7 0
1.6 12 17.69615 1 7 8 0
1.7 12 17.69615 1 8 9 0
1.8 12 17.69615 1 9 10 0
1.9 12 17.69615 1 10 11 0
1.10 12 17.69615 1 11 12 1
2 12 20.35809 2 1 2 0
...
Now I can estimate a simple Cox model to see whether it works:
coxph(Surv(t)~x,data=df)
This yields:
coef exp(coef) se(coef) z p
x -0.0588 0.9429 0.0260 -2.26 0.024
And in the long format:
coxph(Surv(start,stop,censrec)~x,data=long.df)
I get:
coef exp(coef) se(coef) z p
x -0.0588 0.9429 0.0260 -2.26 0.024
Taken together I conclude that my transformation into the long format was correct. Now, turning to the flexsurv framework:
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")
yields:
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 5.00086 4.05569 6.16631 0.53452 NA NA NA
scale NA 13.17215 11.27876 15.38338 1.04293 NA NA NA
x 15.13380 0.01522 0.00567 0.02477 0.00487 1.01534 1.00569 1.02508
But
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull")
causes an error:
Error in flexsurvreg(Surv(start, stop, censrec) ~ x, data = long.df, dist = "weibull") :
Initial value for parameter 1 out of range
Would anyone happen to know the correct syntax for the latter Surv object? If you use the correct syntax, do you get the same estimates?
Thank you very much,
best,
David
===============
EDIT AFTER FEEDBACK FROM 42
===============
library(splitstackshape)
library(flexsurv)
x<-c(8.136527, 7.626712, 9.809122, 12.125973, 12.031536, 11.238394, 4.208863, 8.809854, 9.723636)
t<-c(2, 3, 13, 5, 7, 37 ,37, 9, 4)
df <- data.frame(t,x)
#transform into long format for time-dependent covariates
df$id <- 1:length(df$t)
df$rep <- df$t-1
long.df <- expandRows(df, "rep")
rep.vec<-c()
for(i in 1:length(df$t)){
rep.vec <- c(rep.vec,1:(df[i,"t"]-1))
}
long.df$start <- rep.vec
long.df$stop <- rep.vec +1
long.df$censrec <- 0
long.df$censrec<-ifelse(long.df$stop==long.df$t,1,long.df$censrec)
coxph(Surv(t)~x,data=df)
coxph(Surv(start,stop,censrec)~x,data=long.df)
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull",inits=c(shape=.1, scale=1))
Which yields the same estimates for both coxph models but
Call:
flexsurvreg(formula = Surv(time = t) ~ x, data = df, dist = "weibull")
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 1.0783 0.6608 1.7594 0.2694 NA NA NA
scale NA 27.7731 3.5548 216.9901 29.1309 NA NA NA
x 9.3012 -0.0813 -0.2922 0.1295 0.1076 0.9219 0.7466 1.1383
N = 9, Events: 9, Censored: 0
Total time at risk: 117
Log-likelihood = -31.77307, df = 3
AIC = 69.54614
and
Call:
flexsurvreg(formula = Surv(start, stop, censrec) ~ x, data = long.df,
dist = "weibull", inits = c(shape = 0.1, scale = 1))
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 0.8660 0.4054 1.8498 0.3353 NA NA NA
scale NA 24.0596 1.7628 328.3853 32.0840 NA NA NA
x 8.4958 -0.0912 -0.3563 0.1739 0.1353 0.9128 0.7003 1.1899
N = 108, Events: 9, Censored: 99
Total time at risk: 108
Log-likelihood = -30.97986, df = 3
AIC = 67.95973
Reading the error message:
Error in flexsurvreg(Surv(start, stop, censrec) ~ x, data = long.df, dist = "weibull", :
initial values must be a numeric vector
And then reading the help page, ?flexsurvreg, it seemed as though an attempt at setting values for inits to a named numeric vector should be attempted:
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull", inits=c(shape=.1, scale=1))
Call:
flexsurvreg(formula = Surv(start, stop, censrec) ~ x, data = long.df,
dist = "weibull", inits = c(shape = 0.1, scale = 1))
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 5.00082 4.05560 6.16633 0.53454 NA NA NA
scale NA 13.17213 11.27871 15.38341 1.04294 NA NA NA
x 15.66145 0.01522 0.00567 0.02477 0.00487 1.01534 1.00569 1.02508
N = 715, Events: 50, Censored: 665
Total time at risk: 715
Log-likelihood = -131.5721, df = 3
AIC = 269.1443
Extremely similar results. My guess was basically a stab in the dark, so I have no guidance on how to make a choice if this had not succeeded other than to "expand the search."
I just want to mention that in flexsurv v1.1.1, running this code:
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull")
doesn't return any errors. It also gives the same estimates as the non time-varying command
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")