R formula in survfit - r

I am having trouble with formulas, environments, and survfit().
Things work fine for lm() but they fail for survfit().
General problem statement:
I am fitting a series of formulas to some data. So, I call the
modeling function with the formula passed as a variable. Later,
I want to work with the formula from the fitted object.
(From my naive point of view, the trouble comes from survfit not
recording the environment.)
Detailed Example
Expected behaviour as seen in lm():
library("plyr")
preds <- c("wt", "qsec")
f <- function() {
lm(mpg ~ wt, data = mtcars)
}
fits <- alply(preds, 1, function(pred)
{
modform <- reformulate(pred, response = "mpg")
lm(modform, data = mtcars)
})
fits[[1]]$call$formula
##modform
formula(fits[[1]])
## mpg ~ wt
## <environment: 0x1419d1a0>
Even though fits[[1]]$call$formula resolves to modform I can
still get the original formula with formula(fits[[1]]).
But things fail for survfit():
library("plyr")
library("survival")
preds <- c("resid.ds", "rx", "ecog.ps")
fits <-
alply(preds, 1, function(pred)
{
modform <- paste("Surv(futime, fustat)", pred, sep = " ~ ")
modform <- as.formula(modform)
print(modform)
fit <- survfit(modform, data = ovarian)
})
fits[[1]]$call$formula
## modform
formula(fits[[1]])
## Error in eval(expr, envir, enclos) : object 'modform' not found
Here (and in contrast to lm-fits), formula(fits[[1]]) does not
work!
So, my specific question is: How can I get back the formula used
to fit fits[[1]]?

The issue is that when x$formula is NULL, for an lm object there is a backup plan to get the formula; this doesn't exist for survfit objects
library("plyr")
library("survival")
preds <- c("wt", "qsec")
f <- function() lm(mpg ~ wt, data = mtcars)
fits <- alply(preds, 1, function(pred) {
modform <- reformulate(pred, response = "mpg")
lm(modform, data = mtcars)
})
fits[[1]]$formula
# NULL
The formula can be extracted with formula(fits[[1]]) which uses the formula generic. The lm S3 method for formula is
stats:::formula.lm
# function (x, ...)
# {
# form <- x$formula
# if (!is.null(form)) {
# form <- formula(x$terms)
# environment(form) <- environment(x$formula)
# form
# }
# else formula(x$terms)
# }
So when fits[[1]]$formula returns NULL, forumla.lm looks for a terms attribute in the object and finds the formula that way
fits[[1]]$terms
The survfit objects don't have x$formula or x$terms, so formula(x) givens an error
preds <- c("resid.ds", "rx", "ecog.ps")
fits <- alply(preds, 1, function(pred) {
modform <- paste("Surv(futime, fustat)", pred, sep = " ~ ")
modform <- as.formula(modform)
fit <- survfit(modform, data = ovarian)
})
fits[[1]]$formula
# NULL
formula(fits[[1]]) ## error
formula(fits[[1]]$terms)
# list()
You can fix this by inserting the formula into the call and evaluating it
modform <- as.formula(paste("Surv(futime, fustat)", 'rx', sep = " ~ "))
substitute(survfit(modform, data = ovarian), list(modform = modform))
# survfit(Surv(futime, fustat) ~ rx, data = ovarian)
eval(substitute(survfit(modform, data = ovarian), list(modform = modform)))
# Surv(futime, fustat) ~ rx
# Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
#
# n events median 0.95LCL 0.95UCL
# rx=1 13 7 638 268 NA
# rx=2 13 5 NA 475 NA
Or by manually putting the formula into x$call$formula
fit <- survfit(modform, data = ovarian)
fit$call$formula
# modform
fit$call$formula <- modform
fit$call$formula
# Surv(futime, fustat) ~ rx
fit
# Call: survfit(formula = Surv(futime, fustat) ~ rx, data = ovarian)
#
# n events median 0.95LCL 0.95UCL
# rx=1 13 7 638 268 NA
# rx=2 13 5 NA 475 NA

Related

Custom function does not work properly unless the object is stored in the global environment in R

Context
I have a custom function myfun1 that fits the cox model. Before fitting the model, I need to do a bit of processing on the data used to fit the model. Specifically, run two lines of code, dd = datadist(data) and options(datadist = 'dd').
If dd exists in the environment inside the function, myfun1 will report an error.
But when I output dd to the global environment, myfun2 works fine.
Question
Why does this happen?
How can I get myfun1 to run properly while keeping dd inside the function?
Reproducible code
library(survival)
library(rms)
data(cancer)
myfun1 <- function(data, x){
x = sym(x)
dd = datadist(data)
options(datadist = 'dd')
fit = rlang::inject(cph(Surv(time, status) ~ rcs(!!x), data = data))
fit
}
myfun1(dat = lung, x = 'meal.cal')
# Error in Design(data, formula, specials = c("strat", "strata")) :
# dataset dd not found for options(datadist=)
myfun2 <- function(data, x){
x = sym(x)
dd <<- datadist(data) # Changed here compared to myfun1
options(datadist = 'dd')
fit = rlang::inject(cph(Surv(time, status) ~ rcs(!!x), data = data))
fit
}
myfun2(dat = lung, x = 'meal.cal')
# Frequencies of Missing Values Due to Each Variable
# Surv(time, status) meal.cal
# 0 47
#
# Cox Proportional Hazards Model
#
# cph(formula = Surv(time, status) ~ rcs(meal.cal), data = data)
#
#
# Model Tests Discrimination
# Indexes
# Obs 181 LR chi2 0.72 R2 0.004
# Events 134 d.f. 4 R2(4,181)0.000
# Center -0.3714 Pr(> chi2) 0.9485 R2(4,134)0.000
# Score chi2 0.76 Dxy 0.048
# Pr(> chi2) 0.9443

Use quasiquotation for formula syntax in a user-created function?

When I run this code:
# Create example data
df <- tibble(age=rnorm(10),
income=rnorm(10))
make_model <- function(response_var, df){
# Create formula
form <- as.formula(response_var ~ .)
# Create model
model <- lm(form , data=df)
# Return coefficients
return(coef(model))
}
make_model(income, df)
I obtain the following error
Error in eval(predvars, data, env) : object 'income' not found
How can I make this function work using quasiquotation? I assume the logic is the same as how we can call library(dplyr) instead of library("dplyr").
Use blast() (to be included in rlang 0.5.0)
blast <- function(expr, env = caller_env()) {
eval_bare(enexpr(expr), env)
}
make_model <- function(data, column) {
f <- blast(!!enexpr(column) ~ .)
model <- lm(f, data = data)
coef(model)
}
df <- data.frame(
age = rnorm(10),
income = rnorm(10)
)
make_model(df, income)
#> (Intercept) age
#> -0.3563103 -0.2200773
Works flexibly:
blast(list(!!!1:3))
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3
The following should work:
library(tidyverse)
# Your original function, modified
make_model <- function(df, column) {
column <- enexpr(column)
form <- as.formula(paste0(quo_text(column), " ~ ."))
model <- lm(form, data = df)
return(coef(model))
}
# Your original data and call
tibble(
age = rnorm(10),
income = rnorm(10)
) %>%
make_model(income)

calling the glm() function within a user-defined function

I have been trying to create a function that uses a glm() inside it. But I always get an error message. It looks like the function does not retrieve the value of the variable.
set.seed(234)
sex <- sample(c("M", "F"), size=100, replace=TRUE)
age <- rnorm(n=100, mean=20 + 4*(sex=="F"), sd=0.1)
dsn <- data.frame(sex, age)
rm(sex, age) #remove sex and age from the global environment for reproducibility
to_analyze <- function(dep, indep, data){
glm(dep~factor(indep), data=data)
}
to_analyze(dep=age, indep=sex, data=dsn)
#> Error in eval(predvars, data, env): object 'age' not found
You could use any of the following:
Using substitute:
to_analyze <- function(dep, indep, data){
glm(substitute(dep ~ factor(indep)), data=data)
}
to_analyze(dep=age, indep=sex, data=dsn)
Advantage: Can write the independent as a formula.
eg
to_analyze(Petal.Width, Sepal.Length + Sepal.Width, data = iris)
Using reformulate as stated by #NelsonGon
to_analyze <- function(dep, indep, data){
glm(reformulate(sprintf("factor(%s)",indep), dep), data = data)
}
Note that to call this function, the variables aught to be of type character
to_analyze(dep= "age", indep="sex", data=dsn)
Recall glm can also take a string that can be parsed to a formula:
to_analyze <- function(dep, indep, data){
glm(sprintf("%s~factor(%s)", dep, indep), data = data)
}
to_analyze("age", "sex", data=dsn)
or even:
to_analyze <- function(dep, indep, data){
glm(paste(dep,"~ factor(",indep,")"), data = data)
}
to_analyze("age", "sex", data=dsn)
LASTLY: to combine both the substitute and paste:
to_analyze <- function(dep, indep, data){
glm(paste(substitute(dep),"~ factor(",substitute(indep),")"), data = data)
}
will work for both symbols and characters. eg:
to_analyze(age, sex, data=dsn)
to_analyze("age", "sex", data=dsn)
Create a "formula" object in the function and pass to glm.
To get the variables without giving an error the standard trick is deparse(substitute(.)).
Then compose the formula with paste.
to_analyze <- function(dep, indep, data){
dep <- deparse(substitute(dep))
indep <- deparse(substitute(indep))
indep <- paste0("factor(", indep, ")")
fmla <- paste(dep, indep, sep = " ~ ")
fmla <- as.formula(fmla)
glm(fmla, data = data)
}
to_analyze(dep=age, indep=sex, data=dsn)
#
#Call: glm(formula = fmla, data = data)
#
#Coefficients:
# (Intercept) factor(sex)M
# 23.984 -3.984
#
#Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
#Null Deviance: 396.2
#Residual Deviance: 0.837 AIC: -188.5
#Onyambu and others. The substitute command seems to work well for just one call as it works for the to_analyze(). However when I call another function inside it, it is complaining again. Any help would be greatly appreciated
to_analyze <- function(dep, indep, data){
glm(substitute(dep ~ factor(indep)), data=data)
}
to_analyze(dep=age, indep=sex, data=dsn)
#>
#> Call: glm(formula = substitute(dep ~ factor(indep)), data = data)
#>
#> Coefficients:
#> (Intercept) factor(sex)M
#> 24.006 -4.034
#>
#> Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
#> Null Deviance: 397.3
#> Residual Deviance: 0.8152 AIC: -191.2
However, I am stuck again because I am trying to call the output from this model in lsmeans::lsmeans() to predict marginal means and return the output but it is giving me an error. Although it does not need an offset, I am including it here so that I can get a more general code that I can modify later. Any help would be greatly appreciated
to_predict_lsmeans <- function(dep, indep, data){
model <- glm(substitute(dep ~ factor(indep)), data=data)
pred <- lsmeans:: lsmeans(model, substitute(~ factor(indep)), offset=substitute(data)$log(age), type ="response" )
return(pred)
}
pred <- to_predict_lsmeans(dep=age, indep=sex, data=dsn)
#> Error in ref_grid(object, ...): We are unable to reconstruct the data.
#> The variables needed are:
#> sex
#> Are any of these actually constants? (specify via 'params = ')
#> The dataset name is:
#> data
#> Does the data still exist? Or you can specify a dataset via 'data = '
pred
#> Error in eval(expr, envir, enclos): object 'pred' not found

How do I store lm object in a data frame in R [duplicate]

This question already has answers here:
Linear Regression and storing results in data frame [duplicate]
(5 answers)
Closed 7 years ago.
I need to store lm fit object in a data frame for further processing (This is needed as I will have around 200+ regressions to be stored in the data frame). I am not able to store the fit object in the data frame. Following code produces the error message:
x = runif(100)
y = 2*x+runif(100)
fit = lm(y ~x)
df = data.frame()
df = rbind(df, c(id="xx1", fitObj=fit))
Error in rbind(deparse.level, ...) :
invalid list argument: all variables should have the same length
I would like to get the data frame as returned by "do" call of dplyr, example below:
> tacrSECOutput
Source: local data frame [24 x 5]
Groups: <by row>
sector control id1 fit count
1 Chemicals and Chemical Products S tSector <S3:lm> 2515
2 Construation and Real Estate S tSector <S3:lm> 985
Please note that this is a sample output only. I would like to create the data frame (fit column for the lm object) in the above format so that my rest of the code can work on the added models.
What am I doing wrong? Appreciate the help very much.
The list approach:
Clearly based on #Pascal 's idea. Not a fan of lists, but in some cases they are extremely helpful.
set.seed(42)
x <- runif(100)
y <- 2*x+runif(100)
fit1 <- lm(y ~x)
set.seed(123)
x <- runif(100)
y <- 2*x+runif(100)
fit2 <- lm(y ~x)
# manually select model names
model_names = c("fit1","fit2")
# create a list based on models names provided
list_models = lapply(model_names, get)
# set names
names(list_models) = model_names
# check the output
list_models
# $fit1
#
# Call:
# lm(formula = y ~ x)
#
# Coefficients:
# (Intercept) x
# 0.5368 1.9678
#
#
# $fit2
#
# Call:
# lm(formula = y ~ x)
#
# Coefficients:
# (Intercept) x
# 0.5545 1.9192
Given that you have lots of models in your work space, the only "manual" thing you have to do is provide a vector of your models names (how are they stored) and then using the get function you can obtain the actual model objects with those names and save them in a list.
Store model objects in a dataset when you create them:
The data frame can be created using dplyr and do if you are planning to store the model objects when they are created.
library(dplyr)
set.seed(42)
x1 = runif(100)
y1 = 2*x+runif(100)
set.seed(123)
x2 <- runif(100)
y2 <- 2*x+runif(100)
model_formulas = c("y1~x1", "y2~x2")
data.frame(model_formulas, stringsAsFactors = F) %>%
group_by(model_formulas) %>%
do(model = lm(.$model_formulas))
# model_formulas model
# (chr) (chr)
# 1 y1~x1 <S3:lm>
# 2 y2~x2 <S3:lm>
It REALLY depends on how "organised" is the process that allows you to built those 200+ models you mentioned. You can build your models this way if they depend on columns of a specific dataset. It will not work if you want to build models based on various columns of different datasets, maybe of different work spaces or different model types (linear/logistic regression).
Store existing model objects in a dataset:
Actually I think you can still use dplyr using the same philosophy as in the list approach. If the models are already built you can use their names like this
library(dplyr)
set.seed(42)
x <- runif(100)
y <- 2*x+runif(100)
fit1 <- lm(y ~x)
set.seed(123)
x <- runif(100)
y <- 2*x+runif(100)
fit2 <- lm(y ~x)
# manually select model names
model_names = c("fit1","fit2")
data.frame(model_names, stringsAsFactors = F) %>%
group_by(model_names) %>%
do(model = get(.$model_names))
# model_names model
# (chr) (chr)
# 1 fit1 <S3:lm>
# 2 fit2 <S3:lm>
This seems to work:
x = runif(100)
y = 2*x+runif(100)
fit = lm(y ~x)
df <- data.frame()
fitvec <- serialize(fit,NULL)
df <- rbind(df, data.frame(id="xx1", fitObj=fitvec))
fit1 <- unserialize( df$fitObj )
print(fit1)
yields:
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
0.529 1.936
Update Okay, now more complex, so as to get one row per fit.
vdf <- data.frame()
fitlist <- list()
niter <- 5
for (i in 1:niter){
# Create a new model each time
a <- runif(1)
b <- runif(1)
n <- 50*runif(1) + 50
x <- runif(n)
y <- a*x + b + rnorm(n,0.1)
fit <- lm(x~y)
fitlist[[length(fitlist)+1]] <- serialize(fit,NULL)
}
vdf <- data.frame(id=1:niter)
vdf$fitlist <- fitlist
for (i in 1:niter){
print(unserialize(vdf$fitlist[[i]]))
}
yields:
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.45689 0.07766
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.44922 0.00658
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.41036 0.04522
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.40823 0.07189
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
0.40818 0.08141

Custom Bootstrapped Standard Error: numeric 'envir' arg not of length one

I am writing a custom script to bootstrap standard errors in a GLM in R and receive the following error:
Error in eval(predvars, data, env) : numeric 'envir' arg not of length one
Can someone explain what I am doing wrong? My code:
#Number of simulations
sims<-numbersimsdesired
#Set up place to store data
saved.se<-matrix(NA,sims,numberofcolumnsdesired)
y<-matrix(NA,realdata.rownumber)
x1<-matrix(NA,realdata.rownumber)
x2<-matrix(NA,realdata.rownumber)
#Resample entire dataset with replacement
for (sim in 1:sims) {
fake.data<-sample(1:nrow(data5),nrow(data5),replace=TRUE)
#Define variables for GLM using fake data
y<-realdata$y[fake.data]
x1<-realdata$x1[fake.data]
x2<-realdata$x2[fake.data]
#Run GLM on fake data, extract SEs, save SE into matrix
glm.output<-glm(y ~ x1 + x2, family = "poisson", data = fake.data)
saved.se[sim,]<-summary(glm.output)$coefficients[0,2]
}
An example: if we suppose sims = 1000 and we want 10 columns (suppose instead of x1 and x2, we have x1...x10) the goal is a dataset with 1,000 rows and 10 columns containing each explanatory variable's SEs.
There isn't a reason to reinvent the wheel. Here is an example of bootstrapping the standard error of the intercept with the boot package:
set.seed(42)
counts <- c(18,17,15,20,10,20,25,13,12)
x1 <- 1:9
x2 <- sample(9)
DF <- data.frame(counts, x1, x2)
glm1 <- glm(counts ~ x1 + x2, family = poisson(), data=DF)
summary(glm1)$coef
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) 2.08416378 0.42561333 4.896848 9.738611e-07
#x1 0.04838210 0.04370521 1.107010 2.682897e-01
#x2 0.09418791 0.04446747 2.118131 3.416400e-02
library(boot)
intercept.se <- function(d, i) {
glm1.b <- glm(counts ~ x1 + x2, family = poisson(), data=d[i,])
summary(glm1.b)$coef[1,2]
}
set.seed(42)
boot.intercept.se <- boot(DF, intercept.se, R=999)
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = DF, statistic = intercept.se, R = 999)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* 0.4256133 0.103114 0.2994377
Edit:
If you prefer doing it without a package:
n <- 999
set.seed(42)
ind <- matrix(sample(nrow(DF), nrow(DF)*n, replace=TRUE), nrow=n)
boot.values <- apply(ind, 1, function(...) {
i <- c(...)
intercept.se(DF, i)
})
sd(boot.values)
#[1] 0.2994377

Resources