Specifying prior distribution on matrix in rstan - r

I am having trouble with getting a Bayesian mixed-effects model to yield stationary and well-mixed chains. I have created my own data so I know what parameters should be retrieved by the model. Unfortunately because the effective number of parameters is so low and the Rhat so high the parameter estimates are complete nonsense.
The data is designed so there are 60 subjects, split into three groups (g1, g2, g3) of 20 subjects each. Each subject is exposed to 3 conditions (cond1, cond2, cond3). I designed the data so there is no difference among the groups, but there are differences among the conditions, with cond1 scoring 100 on average, cond2 scoring 75 on average, and cond3 scoring 125.
df <- data.frame(id = factor(rep(1:60, 3)),
group = factor(rep(c("g1", "g2", "g3"), each = 20, length.out = 180)),
condition = factor(rep(c("cond1", "cond2", "cond3"), each = 60)),
score = c(ceiling(rnorm(60, 100, 15)), ceiling(rnorm(60, 75, 15)), ceiling(rnorm(60, 125, 15))))
Here are the descriptives
library(dplyr)
df %>% group_by(group, condition) %>% summarise(m = mean(score), sd = sd(score))
# group condition m sd
# <fct> <fct> <dbl> <dbl>
# 1 g1 cond1 108 12.4
# 2 g1 cond2 79.4 13.1
# 3 g1 cond3 128 11.5
# 4 g2 cond1 105 15.5
# 5 g2 cond2 71.6 10.6
# 6 g2 cond3 127 17.7
# 7 g3 cond1 106 13.3
# 8 g3 cond2 75.8 17.6
# 9 g3 cond3 124 14.5
Everything looks to be correct, the differences between conditions are preserved nicely across groups.
Now for the the model. The model I am running has a grand mean, a parameter for group, a parameter for condition, a parameter for the group x condition interaction, and a subject parameter.
Here is the data list
##### Step 1: put data into a list
mixList <- list(N = nrow(df),
nSubj = nlevels(df$id),
nGroup = nlevels(df$group),
nCond = nlevels(df$condition),
nGxC = nlevels(df$group)*nlevels(df$condition),
sIndex = as.integer(df$id),
gIndex = as.integer(df$group),
cIndex = as.integer(df$condition),
score = df$score)
Now to build the model in rstan, saving the string as a .stan file using the cat() function
###### Step 2: build model
cat("
data{
int<lower=1> N;
int<lower=1> nSubj;
int<lower=1> nGroup;
int<lower=1> nCond;
int<lower=1,upper=nSubj> sIndex[N];
int<lower=1,upper=nGroup> gIndex[N];
int<lower=1,upper=nCond> cIndex[N];
real score[N];
}
parameters{
real a0;
vector[nGroup] bGroup;
vector[nCond] bCond;
vector[nSubj] bSubj;
matrix[nGroup,nCond] bGxC;
real<lower=0> sigma_s;
real<lower=0> sigma_g;
real<lower=0> sigma_c;
real<lower=0> sigma_gc;
real<lower=0> sigma;
}
model{
vector[N] mu;
bCond ~ normal(100, sigma_c);
bGroup ~ normal(100, sigma_g);
bSubj ~ normal(0, sigma_s);
sigma ~ cauchy(0,2)T[0,];
for (i in 1:N){
mu[i] = a0 + bGroup[gIndex[i]] + bCond[cIndex[i]] + bSubj[sIndex[i]] + bGxC[gIndex[i],cIndex[i]];
}
score ~ normal(mu, sigma);
}
", file = "mix.stan")
Next is to generate the chains in rstan
##### Step 3: generate the chains
mix <- stan(file = "mix.stan",
data = mixList,
iter = 2e3,
warmup = 1e3,
cores = 1,
chains = 1)
And here is the output
###### Step 4: Diagnostics
print(mix, pars = c("a0", "bGroup", "bCond", "bGxC", "sigma"), probs = c(.025,.975))
# mean se_mean sd 2.5% 97.5% n_eff Rhat
# a0 -1917.21 776.69 2222.64 -5305.69 1918.58 8 1.02
# bGroup[1] 2368.36 2083.48 3819.06 -2784.04 9680.78 3 1.54
# bGroup[2] 7994.87 446.06 1506.31 4511.22 10611.46 11 1.00
# bGroup[3] 7020.78 2464.68 4376.83 81.18 14699.90 3 1.91
# bCond[1] -3887.06 906.99 1883.45 -7681.24 -247.48 4 1.60
# bCond[2] 4588.50 676.28 1941.92 -594.56 7266.09 8 1.10
# bCond[3] 73.91 1970.28 3584.74 -5386.96 5585.99 3 2.13
# bGxC[1,1] 3544.02 799.91 1819.18 -1067.27 6327.68 5 1.26
# bGxC[1,2] -4960.08 1942.57 3137.33 -10078.84 317.07 3 2.66
# bGxC[1,3] -396.35 418.34 1276.44 -2865.39 2543.45 9 1.42
# bGxC[2,1] -2085.90 1231.36 2439.58 -5769.81 3689.38 4 1.46
# bGxC[2,2] -10594.89 1206.58 2560.42 -14767.50 -5074.33 5 1.02
# bGxC[2,3] -6024.75 2417.43 4407.09 -12002.87 4651.14 3 1.71
# bGxC[3,1] -1111.81 1273.66 2853.08 -4843.38 5572.87 5 1.48
# bGxC[3,2] -9616.85 2314.56 4020.02 -15775.40 -4262.64 3 2.98
# bGxC[3,3] -5054.27 828.77 2245.68 -8666.01 -321.74 7 1.00
# sigma 13.81 0.14 0.74 12.36 15.17 27 1.00
The low number of effective samples and high Rhats tell me I am doing something terribly wrong here, but what?
Is it not specifying a prior on bGxC?
How does one specify a prior on a matrix?

Matrices are inefficient in Stan (see here). It's better to use a vector of vectors:
vector[nCond] bGxC[nGroup];
And to set a prior:
for(i in 1:nGroup){
bGxC[i] ~ normal(0, sigma_gc);
}
And:
for (i in 1:N){
mu[i] = a0 + bGroup[gIndex[i]] + bCond[cIndex[i]] + bSubj[sIndex[i]] + bGxC[gIndex[i]][cIndex[i]];
}

Related

Issue with designing loop in R for executing simple linear regression equation

The general form of the equation is
Sector ~ Beta_0 + Beta_1*absMkt + Beta_2*sqMkt
where Sector are the daily stock returns of each of the 12 sectors i.e AUTO ; IT ; REALTY ; BANK ; ENERGY ; FINANCIAL SERVICES ; FMCG ; INFRASTRUCTURE ; SERVICES ; MEDIA ; METAL and PHARMA.
Beta_0 is the intercept; Beta_1 is the coefficient of absolute market return; Beta_2 is the coefficient of the squared market return.
For each sector, I would like to run linear regression, where I want to extract the coefficients Beta_1 and Beta_2 if the corresponding p-value is less than 0.05 and store it.
Sample data is stated below.
It is also available for download from my google drive location
https://drive.google.com/drive/folders/16XUq8_lXXtD2BSlUdDAAWeHiWIznf--c?usp=share_link
Name of the file : Week_1_CSV.csv
Code that I have tried from my end, but not getting the result
# Reading the data
Returns <- read.csv("Week_1_CSV.CSV", header = TRUE, stringsAsFactors = FALSE)
# Splitting the Data into Sector and Market Returns
Sector_Returns <- Returns[,2:13]
Market_Returns <- Returns[,14:15]
# Defining the number of sectors
nc <- ncol(Sector_Returns)
# Creating a matrix with zero value to store the coefficient values and their corresponding p-values
Beta_1 <- Beta_2 <- p_1 <- p_2 <- matrix(0, 1, nc) # coefs and p values
# Converting the Sectoral Returns into a Matrix named "Sect_Ret_Mat"
Sect_Ret_Mat <- as.matrix(Sector_Returns)
head(Sect_Ret_Mat)
# Converting the Market Returns into a Matrix named "Mkt_Ret_Mat"
Mkt_Ret_Mat <- as.matrix(Market_Returns)
head(Mkt_Ret_Mat)
#### Without Loop ##############
mode1_lm <- lm(Sect_Ret_Mat[,1] ~ Mkt_Ret_Mat[,1] + Mkt_Ret_Mat[,2] )
summary(mode1_lm)
# Extracting the p-value
coef(summary(mode1_lm))[2, 4] ## p-value corresponding to Beta_1
coef(summary(mode1_lm))[3, 4] ## p-value corresponding to Beta_2
# Extracting the Coefficient
coef(mode1_lm)[[2]] ## Coeficient corresponding to Beta_1
coef(mode1_lm)[[3]] ## Coeficient corresponding to Beta_2
##############################################################################
#### WithLoop ##############
for (i in 1:nc) {
for (j in 1:nc) {
if (i != j) {
mode1_lm <- lm(Sect_Ret_Mat[,i] ~ Mkt_Ret_Mat[,1] + Mkt_Ret_Mat[,2] )
p_0[i,j] <- coef(summary(mode1_lm))[2, 4]
p_1[i,j] <- coef(summary(mode1_lm))[3, 4]
if
(p_0[i, j] < 0.05)
Beta_0[i,j] <- coef(mode1_lm)[[2]]
if
(p_1[i, j] < 0.05)
Beta_1[i,j] <- coef(mode1_lm)[[3]]
}
}
}
Beta_0
Beta_1
As a general rule, I don't download datasets from a link. You should try to make a small example that can be easily run on anyone's computer. When running a bunch of regressions for different groups, I find it easiest to keep everything in one dataframe and avoid loops. The general workflow goes: 1) nest data by group, 2) run regression for each group, 3) use broom to make the coefficients into a nice table, 4) unnest the model coefficients into a table. Here is an example using mtcars. I show how to run a regression for each group of cyl.
library(tidyverse)
mtcars |>
nest(data = -cyl) |>
mutate(mod = map(data, ~lm(mpg~wt + hp, data = .x)),
summ = map(mod, broom::tidy)) |>
select(-data, -mod) |>
unnest(summ)
#> # A tibble: 9 x 6
#> cyl term estimate std.error statistic p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 6 (Intercept) 32.6 5.57 5.84 0.00428
#> 2 6 wt -3.24 1.37 -2.36 0.0776
#> 3 6 hp -0.0222 0.0202 -1.10 0.333
#> 4 4 (Intercept) 45.8 4.79 9.58 0.0000117
#> 5 4 wt -5.12 1.60 -3.19 0.0128
#> 6 4 hp -0.0905 0.0436 -2.08 0.0715
#> 7 8 (Intercept) 26.7 3.66 7.28 0.0000158
#> 8 8 wt -2.18 0.721 -3.02 0.0117
#> 9 8 hp -0.0137 0.0107 -1.27 0.229
Melt your data long, and apply a helper function each sector:
Set dataframe wk1 to data.table, and melt long
library(data.table)
setDT(wk1)
wk1 = melt(
data = wk1[, id:=.I],
id = c("id", "Date", "ABSOLUTE MARKLET RETURN", "SQUARED MARKET RETURN"),
variable.name = "sector"
)
Write helper functiont that runs model and returns beta and p.value in list
f <- function(v,a,s) {
cf = summary(lm(v~a+s))$coefficients[-1,]
list(est = cf[,1],pvalue=cf[,4])
}
Apply f to each sector
wk1[, f(value, `ABSOLUTE MARKLET RETURN`, `SQUARED MARKET RETURN`), sector]
Output:
sector beta pvalue
1: AUTO 0.71847599 1.837679e-02
2: AUTO -7.44556841 3.921574e-01
3: IT 0.33384211 2.878851e-02
4: IT -1.69884185 6.970822e-01
5: REALTY 0.19224293 3.128459e-01
6: REALTY 0.63655626 9.084056e-01
7: BANK 0.72886921 4.544867e-06
8: BANK -15.07590018 6.835331e-04
9: ENERGY 0.30568300 5.611571e-01
10: ENERGY -0.42252869 9.780039e-01
11: FINANCIAL SERVICES 0.46149507 1.940130e-04
12: FINANCIAL SERVICES -2.72192333 4.238560e-01
13: FMCG 0.38259697 1.398654e-02
14: FMCG -0.45504587 9.180342e-01
15: INFRASTRUCTURE 0.28891493 1.845572e-01
16: INFRASTRUCTURE 0.03244222 9.958937e-01
17: SERVICES 0.49497910 2.098375e-06
18: SERVICES -6.52723131 2.036243e-02
19: MEDIA 0.10040065 7.554367e-01
20: MEDIA 0.26014494 9.779231e-01
21: METAL 0.45139509 1.576446e-02
22: METAL -6.87462275 1.991497e-01
23: PHARMA 0.39993847 1.781434e-02
24: PHARMA -0.74201727 8.772388e-01

R: Compute Cohen's d based on t-statistic of a coefficient in multiple linear regression

I'm looking at age- and sex-adjusted group differences in a continuous variable of interest. As done in other studies in my field, I want to calculate Cohen's d based on contrasts extracted from a multiple linear regression model.
The original formula (Nakagawa & Cuthill, 2007) is as follows:
n1 = sample size in Group 1
n2 = sample size in Group 2
df' = degrees of freedom used for a corresponding t value in a linear model
t = t-statistic corresponding to the contrast of interest
So far I've attempted to apply this in R, but the results are looking strange (much larger effect sizes than expected).
Here's some simulated data:
library(broom)
df = data.frame(ID = c(1001, 1002, 1003, 1004, 1005, 1006,1007, 1008, 1009, 1010),
Group = as.numeric(c('0','1','0','0','1','1','0','1','0','1')),
age = as.numeric(c('23','28','30','15','7','18','29','27','14','22')),
sex = as.numeric(c('1','0','1','0','0','1','1','0','0','1')),
test_score = as.numeric(c('18','20','19','15','20','23','19','25','10','14')))
# run lm and extract regression coefficients
model <- lm(test_score ~ Group + age + sex, data = df)
tidy_model <- tidy(model)
tidy_model
# A tibble: 4 x 5
#term estimate std.error statistic p.value
#<chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 11.1 4.41 2.52 0.0451
# 2 Group 4.63 2.65 1.75 0.131
# 3 age 0.225 0.198 1.13 0.300
# 4 sex 0.131 2.91 0.0452 0.965
t_statistic <- tidy_model[2,4] # = 1.76
n <- 5 #(equal n of participants in Group1 as in Group2)
cohens_d <- t_statistic*(n + n)/(sqrt(n * n) * sqrt(1)) # 1 dof for 1 estimated parameter (group contrast)
cohens_d # = 3.518096
Could you please flag up where I'm going wrong?
You have set the degrees of freedom to 1. However, you actually have 6 degrees of freedom which you can see if you type: summary(model).
If you set your degrees of freedom to 6 your Cohen's d will be ~1.7 which should be more inline with what you expect.

pbcor and ggcorrmat correlations give different confidence intervals in R

I am working with multiple variables, where I would like to run a robust correlation and then extract the 95% confidence intervals. I am able to do this using pbcor from the WRS2 package.
However, when I want to plot these values, I use ggcorrmat from the ggstratsplot package. As I was checking the confidence intervals between the two outputs, I noticed they do not match up.
Here is a sample of my dataset:
Individual varA varB
1 2.9380842 0.09896456
2 2.9380842 -1.38772037
3 -0.6879859 -2.41310243
4 -0.6879859 0.55722346
5 -2.3129564 -1.34140699
6 -2.3129564 -1.75604301
7 -0.4937431 0.78381085
8 -0.4937431 0.38320385
9 -0.8558126 0.82125672
10 -0.8558126 0.06346062
11 -0.9211026 -1.67170174
Respective code/outputs using this sample dataset:
WRS2::pbcor(data$varA, data$varB, ci=TRUE, nboot=1000, beta=0.1)
> robust correlation coefficient: 0.275
> test statistic: 0.8582
> p-value:0.41307
> bootstrap CI: [-0.3564; 0.7792]
ggstatsplot::ggcorrmat(data, cor.vars = c(OFT1, PC1), output = "dataframe", matrix.type = "lower", type = "robust", beta = 0.1, sig.level = 0.05, conf.level = 0.95, nboot = 1000)
>robust correlation: 0.275
>test statistic: 0.858
>p-value: 0.413
>CI: [-0.389, 0.751]
Why are the confidence intervals different, but the correlation values are the same?
You are right that the CIs differ between WRS2 and ggstatsplot because ggstatsplot internally doesn't use bootstrapping (which is slower and computationally costly) to compute the CIs.
Input <- ("
Individual varA varB
1 2.9380842 0.09896456
2 2.9380842 -1.38772037
3 -0.6879859 -2.41310243
4 -0.6879859 0.55722346
5 -2.3129564 -1.34140699
6 -2.3129564 -1.75604301
7 -0.4937431 0.78381085
8 -0.4937431 0.38320385
9 -0.8558126 0.82125672
10 -0.8558126 0.06346062
11 -0.9211026 -1.67170174
")
# creating a dataframe
df <- read.table(textConnection(Input), header = TRUE)
set.seed(123)
WRS2::pbcor(df$varA, df$varB, ci = TRUE, nboot = 1000, beta = 0.1)
#> Call:
#> WRS2::pbcor(x = df$varA, y = df$varB, beta = 0.1, ci = TRUE,
#> nboot = 1000)
#>
#> Robust correlation coefficient: 0.275
#> Test statistic: 0.8582
#> p-value: 0.41307
#>
#> Bootstrap CI: [-0.4476; 0.8223]
set.seed(123)
ggstatsplot::ggcorrmat(
data = dplyr::select(df, -Individual),
type = "robust",
output = "dataframe",
nboot = 1000,
beta = 0.1
)
#> # A tibble: 1 x 10
#> parameter1 parameter2 r ci_low ci_high t df p method nobs
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int>
#> 1 varA varB 0.275 -0.389 0.751 0.809 9 0.439 Percentage~ 11
It instead returns non-central confidence intervals for effect sizes whenever it can.
If you are curious, the relevant piece of code used to compute CIs is here:
https://github.com/easystats/correlation/blob/ddd105da55c8b5a81e4ce97b8938f5f00e6e968b/R/cor_to_ci.R#L70-L85

Estimating the standard deviation from mean and confidence intervals with a gamma distribution in R

I have the following problem I'd like to solve in R and apply to a larger workflow. I need to estimate the standard deviation from a gamma distribution where the mean and 95% confidence intervals are known.
state = c("group1", "group2", "group3")
mean = c(0.970, 0.694, 0.988)
lowers = c(0.527, 0.381, 0.536)
uppers = c(1.87, 1.37, 1.90)
df = data.frame(state=state, mean=mean, lower=lower, upper=upper)
Using excel and the "solver" tool I can adjust the standard deviation to minimize the sum of squared differences between the target 2.5 (lowers) and 97.5 (uppers) percentiles of the distribution with the actuals. Challenge is this needs to be scaled up to a rather large set of data and operationalized in my R dataframe workflow. Any ideas how to solve this?
I think this problem is ultimately an optimization problem, dealing one row of data at a time. Since you want to scale it, though, here's an approximation for finding the distribution core parameters.
This process is not an optimization: it expands a defined range of possible k (shape) parameters and finds the shape/scale combination (given your mean) that most closely resembles your upper and lower quantiles. You control the granularity of k, which is as good as you're going to get to having a tolerance (which would be appropriate for optimizations).
As such, this process will be imperfect. I hope it gets you a fast-enough process for good-enough accuracy.
I'm going to first demonstrate a process that operates one row at a time, as guesser1. I'll then expand it to do the same operation on an arbitrary number of mean, lower, and upper.
Data with Known Answers
But first, I want to generate my own samples so that we have known "truth" for this guesser.
library(dplyr)
set.seed(42)
n <- 4
randks <- tibble(
k = sample(1:10, size = n, replace = TRUE),
scale = sample(seq(0.5, 2.5, by = 0.5), size = n, replace = TRUE)
) %>%
mutate(
samp = map2(k, scale, ~ rgamma(1000, shape = .x, scale = .y)),
theor_mu = k*scale,
mu = map_dbl(samp, ~ mean(.x)),
lwr = map_dbl(samp, ~ quantile(.x, 0.025)),
upr = map_dbl(samp, ~ quantile(.x, 0.975))
) %>%
select(-samp)
randks
# # A tibble: 4 x 6
# k scale theor_mu mu lwr upr
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 10 2 20 19.9 9.47 33.7
# 2 10 1.5 15 15.1 7.36 25.0
# 3 3 2 6 5.85 1.08 14.5
# 4 9 0.5 4.5 4.51 1.99 7.72
Guesser1
Single row at a time:
guesser1 <- function(mu, lwr, upr, k.max = 10, k.by = 0.01) {
stopifnot(length(mu) == 1, length(lwr) == 1, length(upr) == 1)
ks <- seq(0, k.max, by = k.by)[-1]
L <- sapply(ks, function(k) qgamma(0.025, shape = k, scale = mu / k))
U <- sapply(ks, function(k) qgamma(0.975, shape = k, scale = mu / k))
dists <- sqrt((L-lwr)^2 + (U-upr)^2)
ind <- which.min(dists)
data.frame(
k = ks[ind],
scale = mu / ks[ind],
dist = min(dists),
lwr = L[ind],
upr = U[ind]
)
}
In action:
out1 <- do.call(rbind, Map(guesser1, randks$mu, randks$lwr, randks$upr))
cbind(subset(randks, select = -theor_mu), out1)
# k scale mu lwr upr k scale dist lwr upr
# 1 10 2.0 19.88 9.47 33.67 10.00 1.988 0.304 9.53 33.97
# 2 10 1.5 15.06 7.36 25.02 10.00 1.506 0.727 7.22 25.73
# 3 3 2.0 5.85 1.08 14.50 2.76 2.120 0.020 1.10 14.50
# 4 9 0.5 4.51 1.99 7.72 9.55 0.472 0.142 2.12 7.79
### \____ randks __________/ \____ guessed ____________/
There are certainly some differences, underscoring my original assertion that this is imperfect.
Guessers
All rows at once. This is a little more work in the function, since it deals with matrices instead of just vectors. Not a problem, I just wanted to prove it one-at-a-time before going for the gusto.
guessers <- function(mu, lwr, upr, k.max = 10, k.by = 0.01, include.size = FALSE) {
stopifnot(length(mu) == length(lwr), length(mu) == length(upr))
# count <- length(mu)
ks <- seq(0, k.max, by = k.by)[-1]
# 'ks' dims: [mu]
L <- outer(mu, ks, function(m, k) qgamma(0.025, shape = k, scale = m / k))
U <- outer(mu, ks, function(m, k) qgamma(0.975, shape = k, scale = m / k))
# 'L' & 'U' dims: [mu, ks]
dists <- sqrt((L - lwr)^2 + (U - upr)^2)
inds <- apply(dists, 1, which.min)
mindists <- apply(dists, 1, min)
i <- seq_along(mu)
out <- data.frame(
k = ks[inds],
scale = mu / ks[inds],
dist = mindists,
lwr = L[cbind(i, inds)],
upr = U[cbind(i, inds)]
)
size <- if (include.size) {
message("guessers memory: ",
object.size(list(ks, L, U, dists, inds, mindists, i, out)))
}
out
}
In action:
outs <- guessers(randks$mu, randks$lwr, randks$upr, include.size = TRUE)
# guessers memory: 106400
cbind(subset(randks, select = -theor_mu), outs)
# k scale mu lwr upr k scale dist lwr upr
# 1 10 2.0 19.88 9.47 33.67 10.00 1.988 0.304 9.53 33.97
# 2 10 1.5 15.06 7.36 25.02 10.00 1.506 0.727 7.22 25.73
# 3 3 2.0 5.85 1.08 14.50 2.76 2.120 0.020 1.10 14.50
# 4 9 0.5 4.51 1.99 7.72 9.55 0.472 0.142 2.12 7.79
### \____ randks __________/ \____ guessed (same) _____/
(I included a memory message in there just to track how much this can scale. It's not bad now, and that argument should definitely not be used in production. FYI.)
Comparison
Using microbenchmark, we repeat each operation a few times and compare their run times.
microbenchmark::microbenchmark(
g1 = Map(guesser1, randks$mu, randks$lwr, randks$upr),
gs = guessers(randks$mu, randks$lwr, randks$upr)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# g1 27.3 28.8 33.9 29.7 33.0 131.1 100
# gs 13.3 13.6 14.4 13.8 14.6 20.3 100
Not too surprisingly, the all-at-once guessers is a bit faster. When will this not be the case? When the number of rows gets so big that memory consumption is a problem. I don't know what that is.
Let's try the same thing, but increasing randks from 4 rows to 1000 and repeating the benchmark.
n <- 1000
# randks <- ...
nrow(randks)
# [1] 1000
microbenchmark::microbenchmark(
g1 = Map(guesser1, randks$mu, randks$lwr, randks$upr),
gs = guessers(randks$mu, randks$lwr, randks$upr),
times = 10
)
# Unit: seconds
# expr min lq mean median uq max neval
# g1 8.50 8.99 9.59 9.31 9.64 11.72 10
# gs 3.35 3.44 3.61 3.63 3.77 3.93 10
So it's definitely faster. The median run-time for 1000 estimations is 3.63 seconds, so it appears to finish about 300/sec.
guessers memory: 24066176
(24 MiB) Actually, that doesn't seem bad at all. Decrease k.by to increase your accuracy, or increase k.by to speed this up.

R Flexsurv and time-dependent covariates

I read that the R flexsurv package can also be used for modeling time-dependent covariates according to Christopher Jackson (2016) ["flexsurv: a platform for parametric survival modeling in R, Journal of Statistical Software, 70 (1)].
However, I was not able to figure out how, even after several adjustments and searches in online forums.
Before turning to the estimation of time-dependent covariates I tried to create a simple model with only time-independent covariates to test whether I specified the Surv object correctly. Here is a small example.
library(splitstackshape)
library(flexsurv)
## create sample data
n=50
set.seed(2)
t <- rpois(n,15)+1
x <- rnorm(n,t,5)
df <- data.frame(t,x)
df$id <- 1:n
df$rep <- df$t-1
Which looks like this:
t x id rep
1 12 17.696149 1 11
2 12 20.358094 2 11
3 11 2.058789 3 10
4 16 26.156213 4 15
5 13 9.484278 5 12
6 15 15.790824 6 14
...
And the long data:
long.df <- expandRows(df, "rep")
rep.vec<-c()
for(i in 1:n){
rep.vec <- c(rep.vec,1:(df[i,"t"]-1))
}
long.df$start <- rep.vec
long.df$stop <- rep.vec +1
long.df$censrec <- 0
long.df$censrec<-ifelse(long.df$stop==long.df$t,1,long.df$censrec)
Which looks like this:
t x id start stop censrec
1 12 17.69615 1 1 2 0
1.1 12 17.69615 1 2 3 0
1.2 12 17.69615 1 3 4 0
1.3 12 17.69615 1 4 5 0
1.4 12 17.69615 1 5 6 0
1.5 12 17.69615 1 6 7 0
1.6 12 17.69615 1 7 8 0
1.7 12 17.69615 1 8 9 0
1.8 12 17.69615 1 9 10 0
1.9 12 17.69615 1 10 11 0
1.10 12 17.69615 1 11 12 1
2 12 20.35809 2 1 2 0
...
Now I can estimate a simple Cox model to see whether it works:
coxph(Surv(t)~x,data=df)
This yields:
coef exp(coef) se(coef) z p
x -0.0588 0.9429 0.0260 -2.26 0.024
And in the long format:
coxph(Surv(start,stop,censrec)~x,data=long.df)
I get:
coef exp(coef) se(coef) z p
x -0.0588 0.9429 0.0260 -2.26 0.024
Taken together I conclude that my transformation into the long format was correct. Now, turning to the flexsurv framework:
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")
yields:
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 5.00086 4.05569 6.16631 0.53452 NA NA NA
scale NA 13.17215 11.27876 15.38338 1.04293 NA NA NA
x 15.13380 0.01522 0.00567 0.02477 0.00487 1.01534 1.00569 1.02508
But
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull")
causes an error:
Error in flexsurvreg(Surv(start, stop, censrec) ~ x, data = long.df, dist = "weibull") :
Initial value for parameter 1 out of range
Would anyone happen to know the correct syntax for the latter Surv object? If you use the correct syntax, do you get the same estimates?
Thank you very much,
best,
David
===============
EDIT AFTER FEEDBACK FROM 42
===============
library(splitstackshape)
library(flexsurv)
x<-c(8.136527, 7.626712, 9.809122, 12.125973, 12.031536, 11.238394, 4.208863, 8.809854, 9.723636)
t<-c(2, 3, 13, 5, 7, 37 ,37, 9, 4)
df <- data.frame(t,x)
#transform into long format for time-dependent covariates
df$id <- 1:length(df$t)
df$rep <- df$t-1
long.df <- expandRows(df, "rep")
rep.vec<-c()
for(i in 1:length(df$t)){
rep.vec <- c(rep.vec,1:(df[i,"t"]-1))
}
long.df$start <- rep.vec
long.df$stop <- rep.vec +1
long.df$censrec <- 0
long.df$censrec<-ifelse(long.df$stop==long.df$t,1,long.df$censrec)
coxph(Surv(t)~x,data=df)
coxph(Surv(start,stop,censrec)~x,data=long.df)
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull",inits=c(shape=.1, scale=1))
Which yields the same estimates for both coxph models but
Call:
flexsurvreg(formula = Surv(time = t) ~ x, data = df, dist = "weibull")
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 1.0783 0.6608 1.7594 0.2694 NA NA NA
scale NA 27.7731 3.5548 216.9901 29.1309 NA NA NA
x 9.3012 -0.0813 -0.2922 0.1295 0.1076 0.9219 0.7466 1.1383
N = 9, Events: 9, Censored: 0
Total time at risk: 117
Log-likelihood = -31.77307, df = 3
AIC = 69.54614
and
Call:
flexsurvreg(formula = Surv(start, stop, censrec) ~ x, data = long.df,
dist = "weibull", inits = c(shape = 0.1, scale = 1))
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 0.8660 0.4054 1.8498 0.3353 NA NA NA
scale NA 24.0596 1.7628 328.3853 32.0840 NA NA NA
x 8.4958 -0.0912 -0.3563 0.1739 0.1353 0.9128 0.7003 1.1899
N = 108, Events: 9, Censored: 99
Total time at risk: 108
Log-likelihood = -30.97986, df = 3
AIC = 67.95973
Reading the error message:
Error in flexsurvreg(Surv(start, stop, censrec) ~ x, data = long.df, dist = "weibull", :
initial values must be a numeric vector
And then reading the help page, ?flexsurvreg, it seemed as though an attempt at setting values for inits to a named numeric vector should be attempted:
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull", inits=c(shape=.1, scale=1))
Call:
flexsurvreg(formula = Surv(start, stop, censrec) ~ x, data = long.df,
dist = "weibull", inits = c(shape = 0.1, scale = 1))
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 5.00082 4.05560 6.16633 0.53454 NA NA NA
scale NA 13.17213 11.27871 15.38341 1.04294 NA NA NA
x 15.66145 0.01522 0.00567 0.02477 0.00487 1.01534 1.00569 1.02508
N = 715, Events: 50, Censored: 665
Total time at risk: 715
Log-likelihood = -131.5721, df = 3
AIC = 269.1443
Extremely similar results. My guess was basically a stab in the dark, so I have no guidance on how to make a choice if this had not succeeded other than to "expand the search."
I just want to mention that in flexsurv v1.1.1, running this code:
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull")
doesn't return any errors. It also gives the same estimates as the non time-varying command
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")

Resources