broom::augment outputs only columns from data that are used in formula. This is problematic behavior, because being able to find something like a respondent id can be very helpful at times. Using the newdata argument could be a workaround, but it still doesn't provide a fix when working with nested data.
Additional explanations in-line:
#simulated glm data
glmdata = data.frame(ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-2*A-3*B)) ))
#fit model, not including the ID variable
glmfit = glm(response~A+B, glmdata,family='binomial')
#ID variable is contained in glm$data
str(glmfit$data)
#works!
head(glmfit$data$ID)
#use broom::augment
augmented = glmfit %>% augment
#does not work, wth broom?!
augmented$ID
#ok ... I could use the newdata argument
augmented = glmfit %>% augment(newdata=glmdata)
augmented$ID
#however, that is a hacky workaround ....
#... and it does not fix the following scenario:
#Let's say I want to use nest
#simulated glm data
glmdata1 = data.frame(segm=1,ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-2*A-3*B)) ))
glmdata2 = data.frame(segm=2,ID=1:100, A=rnorm(100), B=rnorm(100)) %>% mutate(response=rbinom(length(ID),1,1/(1+exp(-3*A-2*B)) ))
glmdata_nest = rbind(glmdata1,glmdata2) %>% group_by(segm) %>% nest
#fit the two models via map
glmfit_nest= glmdata_nest %>% mutate(model=map(data, glm, formula=response~A+B, family='binomial') )
#run augment via map
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map(model,augment))
#ID is not here ...
glmfit_nest_augmented$augmented$ID
#ok, so then we have to use map2 ....
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map2(model,data,augment,newdata=.y))
#but even this doesn't work
#also, trying to recycling glm$data does not work
glmfit_nest_augmented = glmfit_nest %>% mutate(augmented = map(model,augment,newdata=.$data))
Update:
broom developers deliberately choose this inconsistent behavior
https://github.com/tidymodels/broom/issues/753
Here the .x, and .y goes with the anonymous function call with ~
glmfit_nest_augmented <- glmfit_nest %>%
mutate(augmented = map2(model,data,~ augment(.x, newdata=.y))
Related
I am learning R with an online course. The course is teaching us the use of resamples for cross validation. I have copied their code but am running into some problems. The first problem I ran into was the order of the model, formula and resamples, which I have fixed. However now I have a new problem, you must preprocess with a formula or recipe. Can someone tell me what is wrong with the below?
titanic_cv <- vfold_cv(
data = titanic,
v = 5,
strata = Survived )
titanic_cv
titanic_cv %>%
slice( 1 ) %>% #This grabs only the first row
pull( splits ) # This lets us look at the splits variable
## The testing set
titanic_cv %>%
slice( 1 ) %>% #This grabs only the first row
pull( splits ) %>%
map_df( training ) #If you are unfamiliar with map, have a look at the help file. What is it doing?
## The training set
titanic_cv %>%
slice( 5 ) %>% #This grabs only the first row
pull( splits ) %>%
map_df( testing )
lr_spec <- logistic_reg( mode = "classification" ) %>%
set_engine( "glm" )
titanic_resamples <- fit_resamples(
model = lr_spec,
formula = Survived ~ Age + Sex + Pclass,
resamples = titanic_cv
)
titanic_resamples
The errot that appears is
Error in fit_resamples():
! To tune a model spec, you must preprocess with a formula or recipe
Run rlang::last_error() to see where the error occurred.
I was expecting to see the below results
I'm trying to apply a custom function to a nested dataframe
I want to apply a machine learning algorithm to predict NA values
After doing a bit of reading online, it seemed that the map function would be the most applicable here
I have a section of code that nests the dataframe and then splits the data into a test (data3) and train (data2) set - with the test dataset containing all the null values for the column to be predicted, and the train containing all the values that are not null to be used to train the ML model
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))))
Here is the function I intend to use:
my_function (test,train) {
et <- extraTrees(x = train, y = train[, "mean_night_flow"], na.action = "fuse", ntree = 1000, nodesize = 2, mtry = ncol(train) * 0.9 )
test1 <- test
test1[ , "mean_night_flow"] <- 0
pred <- predict(et, newdata = test1[, "mean_night_flow"])
test1[ , "mean_night_flow"] <- pred
return(test1)
I have tried the following code, however it does not work:
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))),
data4 = map(data3, data2, ~my_function(.x,.y)))
It gives the following error:
Error: Index 1 must have length 1, not 33
This is suggests that it expects a column rather than a whole dataframe. How can I get this to work?
Many thanks
Without testing on your data, I think you're using the wrong map function. purrr::map works on one argument (one list, one vector, whatever) and returns a list. You are passing it two values (data3 and data2), so we need to use:
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data2 = map(data, ~filter(., !(is.na(mean_night_flow)))),
data3 = map(data, ~filter(., is.na(mean_night_flow))),
data4 = map2(data3, data2, ~my_function(.x,.y)))
If you find yourself needing more than two, you need pmap. You can use pmap for 1 or 2 arguments, it's effectively the same. The two biggest differences when migrating from map to pmap are:
your arguments need to be enclosed within a list, so
map2(data3, data12, ...)
becomes
pmap(list(data3, data12), ...)
you refer to them with double-dot number position, ..1, ..2, ..3, etc, so
~ my_function(.x, .y)
becomes
~ my_function(..1, ..2)
An alternative that simplifies your overall flow just a little.
my_function (test, train = NULL, fld = "mean_night_flow") {
if (is.null(train)) {
train <- test[ !is.na(test[[fld]]),, drop = FALSE ]
test <- test[ is.na(test[[fld]]),, drop = FALSE ]
}
et <- extraTrees(x = train, y = train[, fld], na.action = "fuse", ntree = 1000, nodesize = 2, mtry = ncol(train) * 0.9 )
test1 <- test
test1[ , fld] <- 0
pred <- predict(et, newdata = test1[, fld])
test1[ , fld] <- pred
return(test1)
}
which auto-populates train based on the missingness of your field. (I also parameterized it in case you ever need to train/test on a different field.) This changes your use to
dmaExtendedDataNA2 <- dmaExtendedDataNA %>%
group_by(dma) %>%
nest() %>%
mutate(data4 = map(data, ~ my_function(.x, fld = "mean_night_flow")))
(It's important to name fld=, since otherwise it will be confused for train.)
If you're planning on reusing data2 and/or data3 later in the pipe or analysis, then this step is not necessarily what you need.
Note: I suspect your function in under-tested or incomplete. The fact that you assign all 0 to your test1[,"mean_night_flow"] and then use those zeroes in your call to predict seems suspect. I might be missing something, but I would expect perhaps
test1 <- test
pred <- predict(et, newdata = test1)
test1[ , fld] <- pred
return(test1)
(though copying to test1 using tibble or data.frame is mostly unnecessary, since it is copied in-place and the original frame is untouched; I would be more cautious if you were using class data.table).
Is there a more elegant way of doing this?
m1 <- lm(price ~ carat, data = diamonds)
m2 <- lm(price ~ carat + cut, data = diamonds)
m3 <- lm(price ~ carat + cut + depth, data = diamonds)
m1r2 <- summary(m1)$r.squared
m2r2 <- summary(m2)$r.squared
m3r2 <- summary(m3)$r.squared
data.frame(
model = c("m1", "m2", "m3"),
RSqd = c(m1r2, m2r2, m3r2)
)
With caret I often use the following to compare multiple models side by side:
resamples(list_of_models) %>% summary()
Is there a conventional approach to comparing models on fit such as R.Squared, AIC, RSE? As opposed to crudely writing a dataframe in the way I have above?
An option is to use mget
stack(mget(ls(pattern = "^m\\d+r\\d+$")))
From the input 'm's, get thee objects in to a list with mget and apply the summary by looping over the list and extract the r.squared
lapply(mget(ls(pattern = "^m\\d+$")), function(x) summary(x)$r.squared)
Also, this can be done with reformulate by passing the independent variables in a list
lapply(list('carat', c('carat', 'cut'), c('carat', 'cut', 'depth')),
function(nm) summary(lm(reformulate(nm, 'price'),
data = diamonds))$r.squared)
If we want to get multiple components
library(broom)
lapply(mget(ls(pattern = "^m\\d+$")), glance)
Here is a similar more tidyverse-based approach.
With only three variables it won't save you much typing, but once you have a two digit number of variables the 'many models approach' is really convenient.
library(dplyr)
library(purrr)
library(broom)
library(ggplot2)
reg_vars <- c("carat", "cut", "depth")
tibble(id = 1:3) %>%
mutate(equ = map(id, ~ reformulate(reg_vars[1:.x], response = "price")),
mod = map(equ, ~ lm(.x, data = diamonds)),
res = map(mod, glance)) %>%
pull(res) %>%
bind_rows(., .id = "model")
You could also try texreg package.
library(texreg)
screenreg(list(m1, m2, m3))
I am building a series of Cox regression models, and getting predictions from those models on new data. I am able to get the expected number of events in some cases, but not others.
For example, if the formula in the coxph call is written out, then the predictions are calculated. But, if the the formula is stored in an object and that object called, I get an error. I also cannot get the predictions if I try to create them within a dplyr piped mutate function (for the function I am writing, this would be the most ideal place to get the predictions to work properly).
Any assistance is greatly appreciated!
Thank you,
Daniel
require(survival)
require(tidyverse)
n = 15
# creating tibble of tibbles.
results =
tibble(id = 1:n) %>%
group_by(id) %>%
do(
# creating tibble to evaluate model on
tbl0 = tibble(time = runif(n), x = runif(n)),
# creating tibble to build model on
tbl = tibble(time = runif(n), x = runif(n))
) %>%
ungroup
#it works when the formula is added the the coxph function already written out
map2(results$tbl, results$tbl0, ~ predict(coxph( Surv(time) ~ x, data = .x), newdata = .y, type = "expected"))
#but if the formula is previously defined, I get an error
f = as.formula(Surv(time) ~ x)
map2(results$tbl, results$tbl0, ~ predict(coxph( f, data = .x), newdata = .y, type = "expected"))
# I also get an error when I try to include in a dplyr pipe with mutate
results %>%
mutate(
pred = map2(tbl, tbl0, ~ predict(coxph( f, data = .x), newdata = .y, type = "expected"))
)
I figured it out (with the help of a friend). If you define the formula as a string, and within the function call coerce it to a formula everything runs smoothly. I am not sure why it works, but it does!
#define the formula as a string, and call it in the function with as.formula(.)
f = "Surv(time) ~ x"
map2(results$tbl, results$tbl0, ~ predict(coxph( as.formula(f), data = .x), newdata = .y, type = "expected"))
#also works in a dplyr pipe with mutate
results %>%
mutate(
pred = map2(tbl, tbl0, ~ predict(coxph( as.formula(f), data = .x), newdata = .y, type = "expected"))
)
I have seen an example of list apply (lapply) that works nicely to take a list of data objects,
and return a list of regression output, which we can pass to Stargazer for nicely formatted output.
Using stargazer with a list of lm objects created by lapply-ing over a split data.frame
library(MASS)
library(stargazer)
data(Boston)
by.river <- split(Boston, Boston$chas)
class(by.river)
fit <- lapply(by.river, function(dd)lm(crim ~ indus,data=dd))
stargazer(fit, type = "text")
What i would like to do is, instead of passing a list of datasets to do the same regression on each data set (as above),
pass a list of independent variables to do different regressions on the same data set. In long hand it would look like this:
fit2 <- vector(mode = "list", length = 2)
fit2[[1]] <- lm(nox ~ indus, data = Boston)
fit2[[2]] <- lm(crim ~ indus, data = Boston)
stargazer(fit2, type = "text")
with lapply, i tried this and it doesn't work. Where did I go wrong?
myvarc <- c("nox","crim")
class(myvarc)
myvars <- as.list(myvarc)
class(myvars)
fit <- lapply(myvars, function(dvar)lm(dvar ~ indus,data=Boston))
stargazer(fit, type = "text")
Consider creating dynamic formulas from string:
fit <- lapply(myvars, function(dvar)
lm(as.formula(paste0(dvar, " ~ indus")),data=Boston))
This should work:
fit <- lapply(myvars, function(dvar) lm(eval(paste0(dvar,' ~ wt')), data = Boston))
You can also use a dplyr & purrr approach, keep everything in a tibble, pull out what you want, when you need it. No difference in functionality from the lapply methods.
library(dplyr)
library(purrr)
library(MASS)
library(stargazer)
var_tibble <- tibble(vars = c("nox","crim"), data = list(Boston))
analysis <- var_tibble %>%
mutate(models = map2(data, vars, ~lm(as.formula(paste0(.y, " ~ indus")), data = .x))) %>%
mutate(tables = map2(models, vars, ~stargazer(.x, type = "text", dep.var.labels.include = FALSE, column.labels = .y)))
You can also use get():
# make a list of independent variables
list_x <- list("nox","crim")
# create regression function
my_reg <- function(x) { lm(indus ~ get(x), data = Boston) }
# run regression
results <- lapply(list_x, my_reg)