R: Predictions from a list of coxph objects on newdata - r

I am building a series of Cox regression models, and getting predictions from those models on new data. I am able to get the expected number of events in some cases, but not others.
For example, if the formula in the coxph call is written out, then the predictions are calculated. But, if the the formula is stored in an object and that object called, I get an error. I also cannot get the predictions if I try to create them within a dplyr piped mutate function (for the function I am writing, this would be the most ideal place to get the predictions to work properly).
Any assistance is greatly appreciated!
Thank you,
Daniel
require(survival)
require(tidyverse)
n = 15
# creating tibble of tibbles.
results =
tibble(id = 1:n) %>%
group_by(id) %>%
do(
# creating tibble to evaluate model on
tbl0 = tibble(time = runif(n), x = runif(n)),
# creating tibble to build model on
tbl = tibble(time = runif(n), x = runif(n))
) %>%
ungroup
#it works when the formula is added the the coxph function already written out
map2(results$tbl, results$tbl0, ~ predict(coxph( Surv(time) ~ x, data = .x), newdata = .y, type = "expected"))
#but if the formula is previously defined, I get an error
f = as.formula(Surv(time) ~ x)
map2(results$tbl, results$tbl0, ~ predict(coxph( f, data = .x), newdata = .y, type = "expected"))
# I also get an error when I try to include in a dplyr pipe with mutate
results %>%
mutate(
pred = map2(tbl, tbl0, ~ predict(coxph( f, data = .x), newdata = .y, type = "expected"))
)

I figured it out (with the help of a friend). If you define the formula as a string, and within the function call coerce it to a formula everything runs smoothly. I am not sure why it works, but it does!
#define the formula as a string, and call it in the function with as.formula(.)
f = "Surv(time) ~ x"
map2(results$tbl, results$tbl0, ~ predict(coxph( as.formula(f), data = .x), newdata = .y, type = "expected"))
#also works in a dplyr pipe with mutate
results %>%
mutate(
pred = map2(tbl, tbl0, ~ predict(coxph( as.formula(f), data = .x), newdata = .y, type = "expected"))
)

Related

lapply for multiple lmer models [duplicate]

I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)

{gtsummary} R package -- Problem passing arguments from `add_difference()` to custom function

I'm trying to use a custom function for testing in a tbl_summary |> add_difference() call.
Basically, whatever I pass the function via the test.arg argument in add_difference doesn't get to the called function. I made a simplified version of my code below--I'm pretty sure it's far from optimized and probably full of other bugs I can't see because I'm stuck at the call! πŸ˜•
library(tidyverse)
library(gtsummary)
# Most args are necessary for any function to be compatible with
# add_difference()
testMe <- function(data,
variable,
by,
group = NULL,
type = NULL,
conf.level = 0.95,
adj.vars = NULL,
# test_name is the one I got problems with
test_name = NULL,
ci_type = 0.95,
continuous_variable = NULL,
na.rm,
...) {
# Turn strings into actual symbol for valiables
variable <- ensym(variable)
by <- ensym(by)
# Use functions from infer to make statistical computations
estimate <- data |> specify(expr(!!variable ~ !!by)) |>
calculate(stat = test_name, na.rm = na.rm)
booted <- data |>
specify(expr(!!variable ~ !!by)) |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = test_name, na.rm = na.rm)
result_df <- get_confidence_interval(booted,
estimate,
type = "bias-corrected",
point_estimate = estimate) |>
rename(conf.low = lower_ci,
conf.high = upper_ci)
# returns a df with column names as specified in gtsummary docs
return(result_df)
}
This is the gtsummary code:
# Turn the group factor into character (per add_difference()) and reduce
# groups to 2.
pdata <- PlantGrowth |> filter((group == "trt1") | (group == "trt2")) |>
mutate(group = as.character(group))
table <- data |>
tbl_summary(
by = group,
statistic = list(all_continuous() ~ "{median} ({p25}β€”{p75})"),
) |>
add_difference(
test = list(weight ~ "testMe"),
# Here we pass a string containing the name of the estimate which
# the infer functions will need.
test.args = weight ~ list(test_name = "diff in medians")
# The following selection, which I've seen in documentation and on the
# Internet, doesn't work for me.
# test.args = all_tests("testMe) ~ list(test_name = "diff in medians")
)
In my attempts, the testMe function is called but its test_name argument is always NULL (i.e. default); all subsequent errors are due to its being blank, and if I set a sensible default, the function always returns the subsequent result.
When I change testMe not to have a default test_name, I get an "argument missing" error. In other words, test.args isn't working for me…
Am I getting something wrong? Thanks!

How to apply a custom function to nested dataframes?

I'm trying to apply a custom function to a nested dataframe
I want to apply a machine learning algorithm to predict NA values
After doing a bit of reading online, it seemed that the map function would be the most applicable here
I have a section of code that nests the dataframe and then splits the data into a test (data3) and train (data2) set - with the test dataset containing all the null values for the column to be predicted, and the train containing all the values that are not null to be used to train the ML model
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))))
Here is the function I intend to use:
my_function (test,train) {
et <- extraTrees(x = train, y = train[, "mean_night_flow"], na.action = "fuse", ntree = 1000, nodesize = 2, mtry = ncol(train) * 0.9 )
test1 <- test
test1[ , "mean_night_flow"] <- 0
pred <- predict(et, newdata = test1[, "mean_night_flow"])
test1[ , "mean_night_flow"] <- pred
return(test1)
I have tried the following code, however it does not work:
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))),
data4 = map(data3, data2, ~my_function(.x,.y)))
It gives the following error:
Error: Index 1 must have length 1, not 33
This is suggests that it expects a column rather than a whole dataframe. How can I get this to work?
Many thanks
Without testing on your data, I think you're using the wrong map function. purrr::map works on one argument (one list, one vector, whatever) and returns a list. You are passing it two values (data3 and data2), so we need to use:
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))),
data4 = map2(data3, data2, ~my_function(.x,.y)))
If you find yourself needing more than two, you need pmap. You can use pmap for 1 or 2 arguments, it's effectively the same. The two biggest differences when migrating from map to pmap are:
your arguments need to be enclosed within a list, so
map2(data3, data12, ...)
becomes
pmap(list(data3, data12), ...)
you refer to them with double-dot number position, ..1, ..2, ..3, etc, so
~ my_function(.x, .y)
becomes
~ my_function(..1, ..2)
An alternative that simplifies your overall flow just a little.
my_function (test, train = NULL, fld = "mean_night_flow") {
if (is.null(train)) {
train <- test[ !is.na(test[[fld]]),, drop = FALSE ]
test <- test[ is.na(test[[fld]]),, drop = FALSE ]
}
et <- extraTrees(x = train, y = train[, fld], na.action = "fuse", ntree = 1000, nodesize = 2, mtry = ncol(train) * 0.9 )
test1 <- test
test1[ , fld] <- 0
pred <- predict(et, newdata = test1[, fld])
test1[ , fld] <- pred
return(test1)
}
which auto-populates train based on the missingness of your field. (I also parameterized it in case you ever need to train/test on a different field.) This changes your use to
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data4 = map(data, ~ my_function(.x, fld = "mean_night_flow")))
(It's important to name fld=, since otherwise it will be confused for train.)
If you're planning on reusing data2 and/or data3 later in the pipe or analysis, then this step is not necessarily what you need.
Note: I suspect your function in under-tested or incomplete. The fact that you assign all 0 to your test1[,"mean_night_flow"] and then use those zeroes in your call to predict seems suspect. I might be missing something, but I would expect perhaps
test1 <- test
pred <- predict(et, newdata = test1)
test1[ , fld] <- pred
return(test1)
(though copying to test1 using tibble or data.frame is mostly unnecessary, since it is copied in-place and the original frame is untouched; I would be more cautious if you were using class data.table).

broom::augment omits columns from data

broom::augment outputs only columns from data that are used in formula. This is problematic behavior, because being able to find something like a respondent id can be very helpful at times. Using the newdata argument could be a workaround, but it still doesn't provide a fix when working with nested data.
Additional explanations in-line:
#simulated glm data
glmdata = data.frame(ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-2*A-3*B)) ))
#fit model, not including the ID variable
glmfit = glm(response~A+B, glmdata,family='binomial')
#ID variable is contained in glm$data
str(glmfit$data)
#works!
head(glmfit$data$ID)
#use broom::augment
augmented = glmfit %>% augment
#does not work, wth broom?!
augmented$ID
#ok ... I could use the newdata argument
augmented = glmfit %>% augment(newdata=glmdata)
augmented$ID
#however, that is a hacky workaround ....
#... and it does not fix the following scenario:
#Let's say I want to use nest
#simulated glm data
glmdata1 = data.frame(segm=1,ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-2*A-3*B)) ))
glmdata2 = data.frame(segm=2,ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-3*A-2*B)) ))
glmdata_nest = rbind(glmdata1,glmdata2) %>% group_by(segm) %>% nest
#fit the two models via map
glmfit_nest= glmdata_nest %>% mutate(model=map(data, glm, formula=response~A+B, family='binomial') )
#run augment via map
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map(model,augment))
#ID is not here ...
glmfit_nest_augmented$augmented$ID
#ok, so then we have to use map2 ....
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map2(model,data,augment,newdata=.y))
#but even this doesn't work
#also, trying to recycling glm$data does not work
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map(model,augment,newdata=.$data))
Update:
broom developers deliberately choose this inconsistent behavior
https://github.com/tidymodels/broom/issues/753
Here the .x, and .y goes with the anonymous function call with ~
glmfit_nest_augmented <- glmfit_nest %>%
mutate(augmented = map2(model,data,~ augment(.x, newdata=.y))

Use lapply for multiple regression with formula changing, not the dataset

I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)

Resources