speed up a handmade Cox model fit (v.s. `survival::coxph`) - r

Below, I compare the results from an R-function with my own code. The algorithm simply consists of maximising a function of many parameters (here, 19). My code defines the function and uses nlm for optimisation. Fortunately, both return the same result. However, the R-function is amazingly quick. I therefore suspect I can do better than using nlm (or a similar optimisation routine in R). Any idea?
Here is some survival data that can be fitted with a Cox model. To do so, one needs to maximise the partial log-likelihood (3rd equation in the wikipedia link).
InR, this can be done with coxph() (part of the survival package):
> library(survival)
> fmla <- as.formula(paste("Surv(time, event) ~ ",
+ paste(names(data)[-(1:3)], collapse=" +")))
> mod <- coxph(formula=fmla, data=data)
> round(mod$coef, 3)
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15
-0.246 -0.760 0.089 -0.033 -0.138 -0.051 -0.484 -0.537 -0.620 -0.446 -0.204 -0.112 -0.089 -0.451 0.043
x16 x17 x18 x19
0.106 -0.015 -0.245 -0.653
This can be checked by explicitly writing the partial log-likelihood and by using some numerical optimisation routine. Here is some crude code which does this job.
The code has been edited based on the comments I received
> #------ minus partial log-lik ------
> Mpll <- function(beta, data)
+ #!!!data must be ordered by increasing time!!!
+ #--> data <- data[order(data$time), ]
+ {
+ #preparation
+ N <- nrow(data)
+ linpred <- as.matrix(data[, -(1:3)]) %*% beta
+
+ #pll
+ pll <- sum(sapply(X=which(data$event == 1), FUN=function(j)
+ linpred[j] - log(sum(exp(linpred[j:N])))))
+
+ #output
+ return(- pll)
+ }
> #-----------------------------------
>
> data <- data[order(data$time), ]
> round(nlm(f=Mpll, p=rep(0, 19), data=data)$estimate, 3)
[1] -0.246 -0.760 0.089 -0.033 -0.138 -0.051 -0.484 -0.537 -0.620 -0.446 -0.204 -0.112 -0.089 -0.451
[15] 0.043 0.106 -0.015 -0.245 -0.653
OK, it works... but it is much much slower!
Does anyone have an idea on what is done within coxph() to make it so fast?

Here is a vectorized version of your code.
Mpll2 <- function(beta, data) {
X <- as.matrix(data[, -(1:3)])
a <- X %*% beta
b <- log(rev(cumsum(rev(exp(a)))))
-sum((a - b)[data$event==1])
}
And here is a simple test of the run times.
data <- data[order(data$time), ] # No reason to order every time
# Yours
system.time(round(nlm(f=Mpll, p=rep(0, 19), data=data)$estimate, 3))
# user system elapsed
# 2.77 0.01 2.79
# Vectorized
system.time(round(nlm(f=Mpll2, p=rep(0, 19), data=data)$estimate, 3))
# user system elapsed
# 0.28 0.00 0.28
# Optimized C code
fmla <- as.formula(paste("Surv(time, event) ~ ",
paste(names(data)[-(1:3)], collapse=" +")))
system.time(round(coxph(formula=fmla, data=data)$coef,3))
# user system elapsed
# 0.02 0.00 0.03
So, about an order of magnitude difference between each type. C is very fast, and you are never going to approach those speeds in R. But C is harder to write.

Related

How to run the predicted probabilities (or average marginal effects) for individuals fixed effects in panel data using R?

These are three different ways to run an individual fixed effect method which gives more or less the same results (see below). My main question is how to get predictive probabilities or average marginal effects using the second model (model_plm) or the third model(model_felm). I know how to do it using the first model (model_lm) and show an example below using ggeffects, but that only works when i have a small sample.
As i have over a million individual, my model only works using model_plm and model_felm. If i use model_lm, it takes a lot of time to run with one million individuals since they are controlled for in the model. I also get the following error: Error: vector memory exhausted (limit reached?). I checked many threads on StackOverflow to work around that error but nothing seems to solve it.
I was wondering whether there is an efficient way to work around this issue. My main interest is to extract the predicted probabilities of the interaction residence*union. I usually extract predictive probabilities or average marginal effects using one of these packages: ggeffects,emmeans or margins.
library(lfe)
library(plm)
library(ggeffects)
data("Males")
model_lm = lm(wage ~ exper + residence+health + residence*union +factor(nr)-1, data=Males)
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr", "year"), data=Males)
model_felm = felm(wage ~ exper + residence + health + residence*union | nr, data= Males)
pred_ggeffects <- ggpredict(model_lm, c("residence","union"),
vcov.fun = "vcovCL",
vcov.type = "HC1",
vcov.args = list(cluster = Males$nr))
I tried adjusting formula/datasets to get emmeans and plm to play nice. Let me know if there's something here. I realized the biglm answer wasn't going to cut it for a million individuals after some testing.
library(emmeans)
library(plm)
data("Males")
## this runs but we need to get an equivalent result with expanded formula
## and expanded dataset
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr"), data=Males)
## expanded dataset
Males2 <- data.frame(wage=Males[complete.cases(Males),"wage"],
model.matrix(wage ~ exper + residence + health + residence*union, Males),
nr=Males[complete.cases(Males),"nr"])
(fmla2 <- as.formula(paste("wage ~ ", paste(names(coef(model_plm)), collapse= "+"))))
## expanded formula
model_plm2 <- plm(fmla2,
model = "within",
index=c("nr"),
data=Males2)
(fmla2_rg <- as.formula(paste("wage ~ -1 +", paste(names(coef(model_plm)), collapse= "+"))))
plm2_rg <- qdrg(fmla2_rg,
data = Males2,
coef = coef(model_plm2),
vcov = vcov(model_plm2),
df = model_plm2$df.residual)
plm2_rg
### when all 3 residences are 0, that's `rural area`
### then just pick the rows when one of the residences are 1
emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
Which gives, after some row-deletion:
> ### when all 3 residences are 0, that's `rural area`
> ### then just pick the rows when one of the residences are 1
> emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
residencenorth_east residencenothern_central residencesouth unionyes emmean SE df lower.CL upper.CL
0 0 0 0 0.3777 0.0335 2677 0.31201 0.443
1 0 0 0 0.3301 0.1636 2677 0.00929 0.651
0 1 0 0 0.1924 0.1483 2677 -0.09834 0.483
0 0 1 0 0.2596 0.1514 2677 -0.03732 0.557
0 0 0 1 0.2875 0.1473 2677 -0.00144 0.576
1 0 0 1 0.3845 0.1647 2677 0.06155 0.708
0 1 0 1 0.3326 0.1539 2677 0.03091 0.634
0 0 1 1 0.3411 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: healthyes
Confidence level used: 0.95
The problem seems to be that when we add -1 to the formula, that creates an extra column in the model matrix that is not included in the regression coefficients. (This is a byproduct of the way that R creates factor codings.)
So I can work around this by adding a strategically placed coefficient of zero. We also have to fix up the covariance matrix the same way:
library(emmeans)
library(plm)
data("Males")
mod <- plm(wage ~ exper + residence + health + residence*union,
model = "within",
index = "nr",
data = Males)
BB <- c(coef(mod)[1], 0, coef(mod)[-1])
k <- length(BB)
VV <- matrix(0, nrow = k, ncol = k)
VV[c(1, 3:k), c(1, 3:k)] <- vcov(mod)
RG <- qdrg(~ -1 + exper + residence + health + residence*union,
data = Males, coef = BB, vcov = VV, df = df.residual(mod))
Verify that things line up:
> names(RG#bhat)
[1] "exper" ""
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
> colnames(RG#linfct)
[1] "exper" "residencerural_area"
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
They do line up, so we can get the results we need:
(EMM <- emmeans(RG, ~ residence * union))
residence union emmean SE df lower.CL upper.CL
rural_area no 0.378 0.0335 2677 0.31201 0.443
north_east no 0.330 0.1636 2677 0.00929 0.651
nothern_central no 0.192 0.1483 2677 -0.09834 0.483
south no 0.260 0.1514 2677 -0.03732 0.557
rural_area yes 0.287 0.1473 2677 -0.00144 0.576
north_east yes 0.385 0.1647 2677 0.06155 0.708
nothern_central yes 0.333 0.1539 2677 0.03091 0.634
south yes 0.341 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: health
Confidence level used: 0.95
In general, the key is to identify where the added column occurs. It's going to be the position of the first level of the first factor in the model formula. You can check it by looking at names(coef(mod)) and colnames(model.matrix(formula), data = data) where formula is the model formula with intercept removed.
Update: a general function
Here's a function that may be used to create a reference grid for any plm object. It turns out that sometimes these objects do have an intercept (e.g., random-effects models) so we have to check. For models lacking an intercept, you really should use this only for contrasts.
plmrg = function(object, ...) {
form = formula(formula(object))
if (!("(Intercept)" %in% names(coef(object))))
form = update(form, ~ . - 1)
data = eval(object$call$data, environment(form))
mmat = model.matrix(form, data)
sel = which(colnames(mmat) %in% names(coef(object)))
k = ncol(mmat)
b = rep(0, k)
b[sel] = coef(object)
v = matrix(0, nrow = k, ncol = k)
v[sel, sel] = vcov(object)
emmeans::qdrg(formula = form, data = data,
coef = b, vcov = v, df = df.residual(object), ...)
}
Test run:
> (rg = plmrg(mod, at = list(exper = c(3,6,9))))
'emmGrid' object with variables:
exper = 3, 6, 9
residence = rural_area, north_east, nothern_central, south
health = no, yes
union = no, yes
> emmeans(rg, "residence")
NOTE: Results may be misleading due to involvement in interactions
residence emmean SE df lower.CL upper.CL
rural_area 0.313 0.0791 2677 0.1579 0.468
north_east 0.338 0.1625 2677 0.0190 0.656
nothern_central 0.243 0.1494 2677 -0.0501 0.536
south 0.281 0.1514 2677 -0.0161 0.578
Results are averaged over the levels of: exper, health, union
Confidence level used: 0.95
This potential solution uses biglm::biglm() to fit the lm model and then uses emmeans::qdrg() with a nuisance specified. Does this approach help in your situation?
library(biglm)
library(emmeans)
## the biglm coefficients using factor() with all the `nr` levels has NAs.
## so restrict data to complete cases in the `biglm()` call
model_biglm <- biglm(wage ~ -1 +exper + residence+health + residence*union + factor(nr), data=Males[!is.na(Males$residence),])
summary(model_biglm)
## double check that biglm and lm give same/similar model
## summary(model_biglm)
## summary(model_lm)
summary(model_biglm)$rsq
summary(model_lm)$r.squared
identical(coef(model_biglm), coef(model_lm)) ## not identical! but plot the coefficients...
head(cbind(coef(model_biglm), coef(model_lm)))
tail(cbind(coef(model_biglm), coef(model_lm)))
plot(cbind(coef(model_biglm), coef(model_lm))); abline(0,1,col="blue")
## do a "[q]uick and [d]irty [r]eference [g]rid and follow examples
### from ?qdrg and https://cran.r-project.org/web/packages/emmeans/vignettes/FAQs.html
rg1 <- qdrg(wage ~ -1 + exper + residence+health + residence*union + factor(nr),
data = Males,
coef = coef(model_biglm),
vcov = vcov(model_biglm),
df = model_biglm$df.resid,
nuisance="nr")
## Since we already specified nuisance in qdrg() we don't in emmeans():
emmeans(rg1, c("residence","union"))
Which gives:
> emmeans(rg1, c("residence","union"))
residence union emmean SE df lower.CL upper.CL
rural_area no 1.72 0.1417 2677 1.44 2.00
north_east no 1.67 0.0616 2677 1.55 1.79
nothern_central no 1.53 0.0397 2677 1.45 1.61
south no 1.60 0.0386 2677 1.52 1.68
rural_area yes 1.63 0.2011 2677 1.23 2.02
north_east yes 1.72 0.0651 2677 1.60 1.85
nothern_central yes 1.67 0.0503 2677 1.57 1.77
south yes 1.68 0.0460 2677 1.59 1.77
Results are averaged over the levels of: 1 nuisance factors, health
Confidence level used: 0.95

Assign dependent variable names to stargazer table with list from lapply

I am running several regressions with lm and lapply so that I get a list of models. From there I want to create a stargazer table. The problem I am having is with stargazer. I can create a table without issue, but I can't figure out how to include more than one dependent variable names. Here is some sample code:
library(stargazer)
library(magrittr)
x1 <- rnorm(1000,0,1)
x2 <- rnorm(1000,0,1)
x3 <- rnorm(1000,0,1)
x4 <- rnorm(1000,0,1)
x5 <- rnorm(1000,0,1)
x6 <- rnorm(1000,0,1)
data <- cbind(x1, x2, x3, x4, x5, x6) %>%
as.data.frame()
mod_list_test <- lapply(data[, 1:4], function(x) lm(x ~ data$x5 + data$x6))
dep_vars_test <- c("A", "B", "C", "D")
stargazer(mod_list_test, header = F,
dep.var.labels = dep_vars_test,
type = "text")
The issue I believe is that when the formula is called with lapply, it reads:
Call: lm(formula = x ~ data$x5 + data$x6))
so stargazer seems to think the dep var is "x" in all models. Specifying the dependent variable names as I've done in my code above then only uses the first name in the dep_vars vector.
This is kind of a follow-up to this post here, but in that thread it seems that OP is satisfied with excluding the dependent variable labels entirely. I would like to have them if possible. I tried with their solution of pasting in the formula text but that gave me the same result.
Any help would be much appreciated.
I suggest to use lapply as follows:
mod_list_test <- lapply(data[, 1:4], function(x) {
df <- data.frame(y = x, x5=data$x5, x6=data$x6)
lm(y ~ x5 + x6, data=df)
})
or:
mod_list_test <- lapply(1:4, function(k) {
frm <- as.formula(paste(names(data)[k],"~ x5+x6"))
lm(frm, data=data)
})
stargazer(mod_list_test, header = F, type = "text")
The output is:
=================================================================
Dependent variable:
----------------------------------
x1 x2 x3 x4
(1) (2) (3) (4)
-----------------------------------------------------------------
x5 -0.041 0.011 -0.077** -0.002
(0.033) (0.033) (0.033) (0.033)
x6 -0.021 0.027 0.027 -0.095***
(0.033) (0.034) (0.033) (0.033)
Constant -0.013 -0.016 0.014 0.016
(0.033) (0.033) (0.033) (0.033)
-----------------------------------------------------------------
Observations 1,000 1,000 1,000 1,000
R2 0.002 0.001 0.006 0.008
Adjusted R2 -0.0001 -0.001 0.004 0.006
Residual Std. Error (df = 997) 1.035 1.041 1.029 1.036
F Statistic (df = 2; 997) 0.939 0.377 3.145** 4.016**
=================================================================
Note: *p<0.1; **p<0.05; ***p<0.01

Arrange monte carlo p-value into a matrix for different sample size and variance estimators

The following code works out quite well (based on my previous question). But I have to change the variance estimator (ols, hc0, hc1, hc2, hc3) every time before I run the code. I would like to solve this problem with a loop.
Hereafter, I briefly describe the code. Within the code, 1000 regression models for each sample size (n = 25, 50, 100, 250, 500, 1000) are created. Then, each regression model out of the 1000 is estimated by OLS. After that, I calculate t-statistics based on the different beta values of x3 out of the 1000 samples. The null hypothesis reads: H0: beta03 = beta3, that is the calculated beta value of x3 equals the 'real' value which I defined as 1. In the last step, I check how often the null hypothesis is rejected (significance level = 0.05). My final goal is to create a code which spits out the procentual rejection rate of the null hypothesis for each sample size and variance estimator. Thus, the result should be a matrix whereas right now I get a vector as a result. I would be pleased if anyone of you could help me with that. Here you can see my code:
library(car)
sample_size = c("n=25"=25, "n=50"=50, "n=100"=100, "n=250"=250, "n=500"=500, "n=1000"=1000)
B <- 1000
beta0 <- 1
beta1 <- 1
beta2 <- 1
beta3 <- 1
alpha <- 0.05
simulation <- function(n, beta3h0){
t.test.values <- rep(NA, B)
#simulation of size
for(rep in 1:B){
#data generation
d1 <- runif(n, 0, 1)
d2 <- rnorm(n, 0, 1)
d3 <- rchisq(n, 1, ncp=0)
x1 <- (1 + d1)
x2 <- (3*d1 + 0.6*d2)
x3 <- (2*d1 + 0.6*d3)
# homoskedastic error term: exi <- rchisq(n, 4, ncp = 0)
exi <- sqrt(x3 + 1.6)*rchisq(n, 4, ncp = 0)
y <- beta0 + beta1*x1 + beta2*x2 + beta3*x3 + exi
mydata <- data.frame(y, x1, x2, x3)
#ols estimation
lmobj <- lm(y ~ x1 + x2 + x3, mydata)
#extraction
betaestim <- coef(lmobj)[4]
betavar <- vcov(lmobj)[4,4]
#robust variance estimators: hc0, hc1, hc2, hc3
betavar0 <- hccm(lmobj, type="hc0")[4,4]
betavar1 <- hccm(lmobj, type="hc1")[4,4]
betavar2 <- hccm(lmobj, type="hc2")[4,4]
betavar3 <- hccm(lmobj, type="hc3")[4,4]
#t statistic
t.test.values[rep] <- (betaestim - beta3h0)/sqrt(betavar)
}
mean(abs(t.test.values) > qt(p=c(1-alpha/2), df=n-4))
}
sapply(sample_size, simulation, beta3h0 = 1)
You don't need a double nested loop. Just make sure you get a matrix inside your loop. Update your current simulation with the following:
## set up a matrix
## replacing `t.test.values <- rep(NA, B)`
t.test.values <- matrix(nrow = 5, ncol = B) ## 5 estimators
## update / fill a column
## replacing `t.test.values[rep] <- (betaestim - beta3h0)/sqrt(betavar)`
t.test.values[, rep] <- abs(betaestim - beta3h0) / sqrt(c(betavar, betavar0, betavar1, betavar2, betavar3))
## row means
## replacing `mean(abs(t.test.values) > qt(p=c(1-alpha/2), df=n-4))`
rowMeans(t.test.values > qt(1-alpha/2, n-4))
Now, simulation would return a vector of length 5. For each sample size, the monte carlo estimate of t-statistic p-value is returned for all 5 variance estimators. Then, when you call sapply, you get a matrix result:
sapply(sample_size, simulation, beta3h0 = 1)
# n=25 n=50 n=100 n=250 n=500 n=1000
#[1,] 0.132 0.237 0.382 0.696 0.917 0.996
#[2,] 0.198 0.241 0.315 0.574 0.873 0.994
#[3,] 0.157 0.220 0.299 0.569 0.871 0.994
#[4,] 0.119 0.173 0.248 0.545 0.859 0.994
#[5,] 0.065 0.122 0.197 0.510 0.848 0.993

variable lengths differ in R

I am getting the error above when trying to use the cv.lm fucntion. Please see my code
sample<-read.csv("UU2_1_lung_cancer.csv",header=TRUE,sep=",",na.string="NA")
sample1<-sample[2:2000,3:131]
samplex<-sample[2:50,3:131]
y<-as.numeric(sample1[1,])
y<-as.numeric(sample1[2:50,2])
x1<-as.numeric(sample1[2:50,3])
x2<-as.numeric(sample1[2:50,4])
x11<-x1[!is.na(y)]
x12<-x2[!is.na(y)]
y<-y[!is.na(y)]
fit1 <- lm(y ~ x11 + x12, data=sample)
fit1
x3<-as.numeric(sample1[2:50,5])
x4<-as.numeric(sample1[2:50,6])
x13<-x3[!is.na(y)]
x14<-x4[!is.na(y)]
fit2 <- lm(y ~ x11 + x12 + x13 + x14, data=sample)
anova(fit1,fit2)
install.packages("DAAG")
library("DAAG")
cv.lm(df=samplex, fit1, m=10) # 3 fold cross-validation
Any insight will be appreciated.
Example of data
ID peak height LCA001 LCA002 LCA003
N001786 32391.111 0.397 0.229 -0.281
N005356 32341.473 0.397 -0.655 -1.301
N002416 32215.474 -0.703 -0.214 -0.901
GS239 31949.777 0.354 0.118 0.272
N016343 31698.853 0.226 0.04 -0.006
N003255 31604.978 0.024 NA -0.534
N004358 31356.597 -0.252 -0.022 -0.407
N000122 31168.09 -0.487 -0.533 -0.134
GS10564 31106.103 -0.156 -0.141 -1.17
GS17987 31043.876 NA 0.253 0.553
N003674 30876.207 0.109 0.093 0.07
Please see the example of the data above
First, you are using lm(..) incorrectly, or at least in a very unconventional way. The purpose of specifying the data=sample argument is so that the formula uses references to columns of the sample. Generally, it is a very bad practice to use free-standing data in the formula reference.
So try this:
## not tested...
sample <- read.csv(...)
colnames(sample)[2:6] <- c("y","x1","x2","x3","x4")
fit1 <- lm(y~x1+x2, data=sample[2:50,],na.action=na.omit)
library(DAAG)
cv.lm(df=na.omit(sample[2:50,]),fit1,m=10)
This will give columns 2:6 the appropriate names and then use those in the formula. The argument na.action=na.omit tells the lm(...) function to exclude all rows where there is an NA value in any of the relevant columns. This is actually the default, so it is not needed in this case, but included for clarity.
Finally, cv.lm(...) uses it's second argument to find the formula definition, so in your code:
cv.lm(df=samplex, fit1, m=10)
is equivalent to:
cv.lm(df=samplex,y~x11+x12,m=10)
Since there are (presumeably) no columns named x11 and x12 in samplex, and since you define these vectors externally, cv.lm(...) throws the error you are getting.

how to save a list object in R

I did coxph for my data and get result like this:
> z
Call:
coxph(formula = Surv(Years, Event) ~ y, data = x)
coef exp(coef) se(coef) z p
y 0.0714 1.07 0.288 0.248 0.8
Likelihood ratio test=0.06 on 1 df, p=0.804 n= 65, number of events= 49
I just want to save
y 0.0714 1.07 0.288 0.248 0.8
into a file. Because I do permutation and generate 1000 z.
I want to save them into a text file like this:
fin -0.3794 0.684 0.1914 -1.983 0.0470
age -0.0574 0.944 0.0220 -2.611 0.0090
race 0.3139 1.369 0.3080 1.019 0.3100
wexp -0.1498 0.861 0.2122 -0.706 0.4800
mar -0.4337 0.648 0.3819 -1.136 0.2600
paro -0.0849 0.919 0.1958 -0.434 0.6600
Anyone can help?
Thanks!
The coefficients are easily accessed by
summary(z)[['coefficients']]
and the confidence interval information by
summary(z)[['conf.int']]
To find out what the components of a summary.coxph object
str(summary(z))
My advice would be to create a list of your permutations
data_list <- list(data_1, ...., data_1000)
Then call
lots_models <- lapply(data_list, coxph, formula = Surv(Years, Event) ~ y)
Which creates a list of models
You can create the summaries by
lots_summaries <- lapply(lots_models, summary)
Extract the coefficients
all_coefficients <- lapply(lots_summaries, '[[', 'coefficients')
all_conf.int <- lapply(lots_summaries, '[[', 'conf.int')
Add a permutation id column (if you want)
all_coefs_id <- lapply(seq_along(data_list),
function(i) cbind(all_coefficients[[i]],i))
all_ci_id <- lapply(seq_along(data_list),
function(i) cbind(all_conf.int[[i]],i))
Then combine into a data.frame
all_coefs_df <- do.call(rbind, all_coefs_id)
all_ci_df <- do.call(rbind, all_ci_id)
Which you than then save as a text file

Resources