Averaging R objects - r

Lets say that I have several R objects, e.g. lm outputs:
m1 <- lm(x ~ y, data = data, subset = sample==1)
m2 <- lm(x ~ y, data = data, subset = sample==2)
m3 <- lm(x ~ y, data = data, subset = sample==3)
m4 <- lm(x ~ y, data = data, subset = sample==4)
and now I want to average those objects, i.e. I want to average all estimates produced by lm. I would be very happy If I could get summary statistics of all the parameters in the objects, i.e. average intercept etc. What simplifies the problem is that all the objects would be roughly the same, just calculated on different samples.
Is there any way to do this in a general fashion, that is, using a single general function rather that taking all the individual values and averaging them one at a time? Also, I would need this kind of function for different kinds of objects.
Probably lapply could be used in some way, however how to deal with multiple (varying) layers of nesting?

This should work (example using the mtcars dataset):
library(dplyr)
meanpars <- mtcars %>%
group_by(cyl) %>%
do(mod = lm(mpg ~ wt, data = .)) %>%
summarise(
intercepts = coef(mod)[1],
wtbeta = coef(mod)[2]) %>%
summarise(
meaninter = mean(intercepts),
meanbeta = mean(wtbeta))
Here's with your toy data plugged in:
library(dplyr)
meanpars <- data %>%
group_by(sample) %>%
do(mod = lm(x ~ y, data = .)) %>%
summarise(
intercepts = coef(mod)[1],
ybeta = coef(mod)[2]) %>%
summarise(
meaninter = mean(intercepts),
meanbeta = mean(ybeta))
Edit: If you don't want to average the coefficients in the end, just remove the last summarise function and you'll still get a data.frame with the results from your models.

Related

Add multiple model statistics with add_glance_table() horizontaly (and not verticaly) in tbl_regression

I used tbl_regression and add_glance_table() from gtsummary to build a table with model statistic:
library(gtsumary)
coxph(Surv(time, event) ~ score, data = dat) %>%
tbl_regression(exponentiate = TRUE) %>%
add_glance_table(concordance)
1st question: How can I move the model statistic horizontaly, to the right?
Because, in the end, I want to display multiple model statistic, with C index in the last column, like this:
tbl_uvregression(
dat_score,
method=survival::coxph,
y = Surv(time, event),
exponentiate = TRUE)
2nd question: How do I add add_glance_table in tbl_uvregression?
You can merge any additional columns/statistics into a gtsummary using the modify_table_body() function (the table_body is an internal data frame that is styled and printed as the summary table).
It's possible to add the c-index in a tbl_uvregression() setting. But I think it requires a higher understanding of the internals of a tbl_uvregression() object. In the example below, I estimate each univariable model separately, summarize the model with tbl_regression(), merge in the c-index, then stack all the tbls with tbl_stack().
Happy Programming!
library(gtsummary)
library(tidyverse)
library(survival)
packageVersion("gtsummary")
#> [1] '1.5.2'
covariates <- c("age", "marker")
# iterate over the covariates
tbl <-
covariates %>%
map(
function(varname) {
# build regression model
mod <-
str_glue("Surv(ttdeath, death) ~ {varname}") %>%
as.formula() %>%
coxph(data = trial)
# calculate and format c-index. adding variable column to merge in the next step
df_cindex <-
broom::glance(mod) %>%
select(concordance) %>%
mutate(
concordance = style_sigfig(concordance, digits = 3),
variable = varname
)
# summarize model
tbl_regression(mod, exponentiate = TRUE) %>%
# merge in the c-index
modify_table_body(~left_join(.x, df_cindex, by = "variable")) %>%
modify_header(concordance = "**c-index**") # assigning a header label unhides the column
}
) %>%
#stack all tbls
tbl_stack()
Created on 2022-04-09 by the reprex package (v2.0.1)

Feeding new data to existing model and using broom::augment to add predictions

I am using tidyverse,broom, and purrr to fit a model to some data, by group. I am then trying to use this model to predict on some new data, again by group. broom's 'augment' function nicely adds not only the predictions, but also other values like the std error, etc. However, I am unable to make the 'augment' function use the new data instead of the old data. As a result, my two sets of predictions are exactly the same. The question is - how can I make 'augment' use the new data instead of the old data (which was used to fit the model) ?
Here's a reproducible example:
library(tidyverse)
library(broom)
library(purrr)
# nest the iris dataset by Species and fit a linear model
iris.nest <- nest(iris, data = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)) %>%
mutate(model = map(data, function(df) lm(Sepal.Width ~ Sepal.Length, data=df)))
# create a new dataset where the Sepal.Length is 5x as big
newdata <- iris %>%
mutate(Sepal.Length = Sepal.Length*5) %>%
nest(data = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)) %>%
rename("newdata"="data")
# join these two nested datasets together
iris.nest.new <- left_join(iris.nest, newdata)
# now form two new columns of predictions -- one using the "old" data that the model was
# initially fit on, and the second using the new data where the Sepal.Length has been increased
iris.nest.new <- iris.nest.new %>%
mutate(preds = map(model, broom::augment),
preds.new = map2(model, newdata, broom::augment)) # THIS LINE DOESN'T WORK ****
# unnest the predictions on the "old" data
preds <-select(iris.nest.new, preds) %>%
unnest(cols = c(preds))
# rename the columns prior to merging
names(preds)[3:9] <- paste0("old", names(preds)[3:9])
# now unnest the predictions on the "new" data
preds.new <-select(iris.nest.new, preds.new) %>%
unnest(cols = c(preds.new))
#... and also rename columns prior to merging
names(preds.new)[3:9] <- paste0("new", names(preds.new)[3:9])
# merge the two sets of predictions and compare
compare <- bind_cols(preds, preds.new)
# compare
select(compare, old.fitted, new.fitted) %>% View(.) # EXACTLY THE SAME!!!!
When calling broom::augment, note that the newdata= parameter is the third parameter. When you use purr::map2, the values you iterate over are passed in the first two parameters by default. It doesn't matter what you've named those lists that you are passing in. You need to explicitly place the new data in the newdata= parameter.
iris.nest.new <- iris.nest.new %>%
mutate(preds = map(model, broom::augment),
preds.new = map2(model, newdata, ~broom::augment(.x, newdata=.y)))
The difference can be seen running these two commands.
broom::augment(iris.nest.new$model[[1]], iris.nest.new$newdata[[1]])
broom::augment(iris.nest.new$model[[1]], newdata=iris.nest.new$newdata[[1]])

Create a list column with just one item in it (no group by)

Here is a workflow that trains an XGB model using tidr list columns, rsmaple folds and purrr map:
library(rsample)
library(xgboost)
library(Metrics)
# keep just numeric features for this example
pdata_split <- initial_split(diamonds %>% select(-cut, -color, -clarity), 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)
train_cv <- vfold_cv(training_data, 5) %>%
# create training and validation sets within each fold
mutate(train = map(splits, ~training(.x)),
validate = map(splits, ~testing(.x)))
# xgb across each fold
mod.xgb <- train_cv %>%
# convert regression data to a dmatrix for xgb. Just simple price ~ carat for here and now
mutate(train_dmatrix = map(train, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price)),
validate_dmatrix = map(validate, ~xgb.DMatrix(.x %>% select(carat) %>% as.matrix(), label = .x$price))) %>%
mutate(regression = map(train_dmatrix, ~xgboost(.x, objective = "reg:squarederror", nrounds = 100))) %>% # fit the model
mutate(predictions =map2(.x = regression, .y = validate_dmatrix, ~predict(.x, .y))) %>% # predictions
mutate(validation_actuals = map(validate, ~.x$carat)) %>% # get the actuals for computing evaluation metrics
mutate(mae = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::mae(actual = .x, predicted = .y))) %>% # mae
mutate(rmse = map2_dbl(.x = validation_actuals, .y = predictions, ~Metrics::rmse(actual = .x, predicted = .y))) # rmse
My actual script and data uses crossing() and other models with their own hyper parameters in order to pick the best model. So, the real block the above is based on allows me to compare several models since it actually contains several models.
I like this workflow because using dplyr verbs and the pipe operator, I can make changes as needed while progressing through each step, then apply them to each fold using map functions.
Now that I'm at the test phase and passed the cross validation phase, I'd like to emulate that 'flow' except I do not have folds so there is no need for map_* functions.
However, I still need to make transformations such as the one above adding an xgb.DMatrix since I am using xgboost.
Example, below what I created to test my chosen xgb model:
library(rsample)
library(xgboost)
library(Metrics)
# keep just numeric features for this example
pdata_split <- initial_split(diamonds %>% select(-cut, -color, -clarity), 0.9)
training_data <- training(pdata_split)
testing_data <- testing(pdata_split)
# create xgb.DMatrix'
training_data_xgb_matrix <- xgb.DMatrix(training_data %>% select(-price) %>% as.matrix(), label = training_data$price)
test_data_xgb_matrix <- xgb.DMatrix(testing_data %>% select(-price) %>% as.matrix(), label = testing_data$price)
# create a regression
model_xgb <- xgboost(training_data_xgb_matrix, nrounds = 100, objective = "reg:squarederror")
# predict on test data
xgb_predictions <- predict(model_xgb, test_data_xgb_matrix)
# evaluate using rmse
test_rmse <- rmse(actual = testing_data$price, predicted = xgb_predictions)
test_rmse
# 1370.185
So, that is doing it step by step. My question is, can I somehow do this in a similar way to using the approach above during cross validation, particularity just adding a new column to a existing df / list column?
What is the 'tidy' way of evaluating a model on test data? Is it possible to start with training_data, append test data in a new column and start a workflow to reach the same result with rmse in it's own column following a call to mutate()?
training_data %>%
(add test data in a new column) %>%
mutate(convert training data to a xgb.DMatrix) %>%
mutate(convert test data to a xgb.DMatrix) %>%
mutate(fit a regression model based on the training data xgb.DMatrix) %>%
mutate(predict with the regression model on test data xgb.DMatrix) %>%
mutate(calculate rmse)
Is this possible?

Object '.' not found while piping with dplyr

I am trying to conduct a survival curve using the survival package. The MWE code is as follows:
df %>%
filter(fac <= "Limit") %>%
survfit(Surv(tte, !is.na(event)) ~ fac, data = .) %>%
ggsurvplot(fit = .)
I get the error Error in eval(fit$call$data) : object '.' not found
When I try to break this down further by:
survfit <- df %>%
filter(fac <= "Limit") %>%
survfit(Surv(tte, !is.na(event)) ~ fac, data = .)
ggsurvplot(fit = survfit)
I get an identical error. Is anyone able to figure out how to pipe from my dataframe all the way through a survival curve? The reason I would like to do this is to streamline the filtering of my dataframe in order to produce a multitude of different survival curves without having to create many subsetted dataframes.
Apparently, ggsurvplot expects an object of class "survfit" as its first argument but also needs the data set as an argument.
The example below is based on the first example of function
survfit.formula {survival}.
library(dplyr)
library(survival)
library(survminer)
aml %>%
survfit(Surv(time, status) ~ x, data = .) %>%
ggsurvplot(data = aml)
In the question's case this would become
df %>%
filter(fac <= "Limit") %>%
survfit(Surv(tte, !is.na(event)) ~ fac, data = .) %>%
ggsurvplot(data = filter(df, fac <= "Limit"))

How to fit linear models to several randomly generated datasets using the tidyverse

I'm currently reading through Hadley Wickham's R for Data Science and came across the exercices in 23.2.1, dealing with the robustness of linear models that were fitted with squared differences to randomly generated datasets.
I tried to implement this using tidyverse packages.
generate_data <- function(seed){
set.seed(seed)
tibble(
x = rep(1:10, each = 3),
y = x * 1.5 + 6 + rt(length(x), df = 2),
seed = as.character(seed)
)
}
seeds <- 6:11
datasets <- seeds %>%
map(generate_data)
This is the crucial point. The datasets exist in a list of dataframes, so I used the map function, finally extracting the coefficients of the respective model with coef. In this process however I lose the information on the seed that was used and hence the link to the dataset it refers to, which forces me to do the ugly mutate(seed = as.character(seeds)) thing.
model_parameters <- datasets %>%
map(~ lm(y ~ x, data = .)) %>%
map(coef)
model_parameters <- model_parameters %>%
map_df(bind_rows) %>%
mutate(seed = as.character(seeds))
Convert the list of dataframes into a single one for plotting:
datasets <- datasets %>% map_df(bind_rows)
ggplot(datasets,
aes(x,y, col = seed)
) +
geom_jitter(width = .1) +
geom_abline(
data = model_parameters,
aes(
intercept = `(Intercept)`,
slope = x,
color = seed
)
)
My solution seems somewhat ugly. Is there a more natural approach to this?
Since you've added the seed column its usually easier to work with one large data.frame than a list of data.frames. So you could do
library(tidyverse)
datasets <- seeds %>%
map_df(generate_data)
and then when extracting coefficients, the broom package can help in a tidy way. For example
model_parameters <- datasets %>% group_by(seed) %>%
do(broom::tidy(lm(y~x, .))) %>%
select(seed, term, estimate) %>%
spread(term, estimate)
Then these can both go directly into the ggplot code you've already written

Resources