Residualize an observation after fitting a model in group_by - r

I'd like to find the residual of observations after fitting a model per group. I would have thought the code looks something like
library(dplyr)
df %>%
group_by(group) %>%
do(residual=resid(lm(y~x, data=.))) %>%
ungroup()
but this collapses df and leaves no trace of the x variable. What I want is a data frame return that is something like
group |y| x| residual

1) dplyr For purposes of example, this uses the iris data frame that comes with R. I noticed that the code below chokes on the formula if we remove the double quotes but it works OK if the formula is passed as a character string as shown:
iris %>%
group_by(Species) %>%
do(mutate(., resid = resid(lm("Sepal.Length ~ Sepal.Width", .)))) %>%
ungroup()
1a) This variation also works even without a character string formula:
iris %>%
group_by(Species) %>%
do(cbind(., resid = resid(lm(Sepal.Length ~ Sepal.Width, .)))) %>%
ungroup()
1b) and this variation also works:
iris %>%
group_by(Species) %>%
do(transform(., resid = resid(lm(Sepal.Length ~ Sepal.Width, .)))) %>%
ungroup()
2) Base R We could also consider not using dplyr and just base R like this:
f <- function(ix) resid(lm(Sepal.Length ~ Sepal.Width, iris, subset = ix))
transform(iris, resid = ave(seq_along(Species), Species, FUN = f))
3) data.table If speed is of concern you might want to try data.table which is often the fastest approach and is also quite compact here:
library(data.table)
dt <- as.data.table(iris)
dt[, resid := resid(lm(Sepal.Length ~ Sepal.Width, .SD)), by = Species]
3a) Interestingly this variation of (1) works with data.table input and an actual formula (not character string). Also, do() is not needed:
data.table(iris) %>%
group_by(Species) %>%
mutate(resid = resid(lm(Sepal.Length ~ Sepal.Width, .))) %>%
ungroup()
Note: I have added dplyr issue 1648.

Related

Purrr map over multiple models to store results in dataframe

I have got the following example:
mtcars %>%
group_split(cyl) %>%
map(~lm(mpg ~ wt, data = .x)) %>%
map_dbl(~.x$coefficients[[2]])
[1] -5.647025 -2.780106 -2.192438
I also want to store the intercept, so I thought this might work:
mtcars %>%
group_split(cyl) %>%
map(~lm(mpg ~ wt, data = .x)) %>%
map_df(~.x$coefficients)
Error: Argument 1 must have names
However I get this error. What am I doing wrong and how can I store both coefficients in a dataframe?
The coefficients return a numeric vector, we can change it to dataframe and then use map_df.
library(tidyverse)
mtcars %>%
group_split(cyl) %>%
map(~lm(mpg ~ wt, data = .x)) %>%
map_df(~.x$coefficients %>% t %>% as.data.frame)
# (Intercept) wt
#1 39.571 -5.6470
#2 28.409 -2.7801
#3 23.868 -2.1924

grouped statistical test tidyverse

I'm trying to do a Wilcoxon test on long-formatted data. I want to use dplyr::group_by() to specify the subsets I'd like to do the test on.
The final result would be a new column with the p-value of the Wilcoxon test appended to the original data frame. All of the techniques I have seen require summarizing the data frame. I DO NOT want to summarize the data frame.
Please see an example reformatting the iris dataset to mimic my data, and finally my attempts to perform the task.
I am getting close, but I want to preserve all of my original data from before the Wilcoxon test.
# Reformatting Iris to mimic my data.
long_format <- iris %>%
gather(key = "attribute", value = "measurement", -Species) %>%
mutate(descriptor =
case_when(
str_extract(attribute, pattern = "\\.(.*)") == ".Width" ~ "Width",
str_extract(attribute, pattern = "\\.(.*)") == ".Length" ~ "Length")) %>%
mutate(Feature =
case_when(
str_extract(attribute, pattern = "^(.*?)\\.") == "Sepal." ~ "Sepal",
str_extract(attribute, pattern = "^(.*?)\\.") == "Petal." ~ "Petal"))
# Removing no longer necessary column.
cleaned_up <- long_format %>% select(-attribute)
# Attempt using do(), but I lose important info like "measurement"
cleaned_up %>%
group_by(Species, Feature) %>%
do(w = wilcox.test(measurement~descriptor, data=., paired=FALSE)) %>%
mutate(Wilcox = w$p.value)
# This is an attempt with the dplyr experimental group_map function. If only I could just make this a new column appended to the original df in one step.
cleaned_up %>%
group_by(Species, Feature) %>%
group_map(~ wilcox.test(measurement~descriptor, data=., paired=FALSE)$p.value)
Thanks for your help.
The model object can be wrapped in a list
library(tidyverse)
cleaned_up %>%
group_by(Species, Feature) %>%
nest %>%
mutate(model = map(data, ~
.x %>%
transmute(w = list(wilcox.test(measurement~descriptor,
data=., paired=FALSE)))))
Or another option is group_split into a list, then map through the list, elements create the 'pval' column after applying the model
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
Another option is to avoid the data argument entirely. The wilcox.test function only requires a data argument when the variables being tested aren't in the calling scope, but functions called within mutate have all the columns from the data frame in scope.
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
Same as akrun's output (thanks to his correction in the comments above)
akrun <-
cleaned_up %>%
group_split(Species, Feature) %>%
map_dfr(~ .x %>%
mutate(pval = wilcox.test(measurement~descriptor,
data=., paired=FALSE)$p.value))
me <-
cleaned_up %>%
group_by(Species, Feature) %>%
mutate(pval = wilcox.test(measurement~descriptor, paired=FALSE)$p.value)
all.equal(akrun, me)
# [1] TRUE

Bootstrapping by multiple groups in the tidyverse: rsample vs. broom

In this SO Question bootstrapping by several groups and subgroups seemed to be easy using the broom::bootstrap function specifying the by_group argument with TRUE.
My desired output is a nested tibble with n rows where the data column contains the bootstrapped data generated by each bootstrap call (and each group and subgroup has the same amount of cases as in the original data).
In broom I did the following:
# packages
library(dplyr)
library(purrr)
library(tidyr)
library(tibble)
library(rsample)
library(broom)
# some data to bootstrap
set.seed(123)
data <- tibble(
group=rep(c('group1','group2','group3','group4'), 25),
subgroup=rep(c('subgroup1','subgroup2','subgroup3','subgroup4'), 25),
v1=rnorm(100),
v2=rnorm(100)
)
# the actual approach using broom::bootstrap
tibble(id = 1:100) %>%
mutate(data = map(id, ~ {data %>%
group_by(group,subgroup) %>%
broom::bootstrap(100, by_group=TRUE)}))
Since the broom::bootstrap function is deprecated, I rebuild my approach with the desired output using rsample::bootstraps. It seems to be much more complicated to get my desired output. Am I doing something wrong or have things gotten more complicated in the tidyverse when generating grouped bootstraps?
data %>%
dplyr::mutate(group2 = group,
subgroup2 = subgroup) %>%
tidyr::nest(-group2, -subgroup2) %>%
dplyr::mutate(boot = map(data, ~ rsample::bootstraps(., 100))) %>%
pull(boot) %>%
purrr::map(., "splits") %>%
transpose %>%
purrr::map(., ~ purrr::map_dfr(., rsample::analysis)) %>%
tibble(id = 1:length(.), data = .)

bootstrap by group and calculate statistics

I'm trying to bootstrap some model fits and then calculate statistics without having to rerun the models every time. I can do this fine if I calculate r2 inside the first do() but I'd like to know how to access the data.
library(dplyr)
library(tidyr)
library(modelr)
library(purrr)
allmdls <-
mtcars %>%
group_by(cyl) %>%
do({
datsplit=crossv_mc(.,10)
mdls=list(map(datsplit$train, ~glm(hp~disp,data=.,family=gaussian(link='identity'))))
data_frame(datsplit=list(datsplit),mdls)
})
and now something like:
allmdls %>%
by_slice(dmap,.f=map2_dbl(.$mdls,.$datsplit$test,rsquare))
but I get
Error: .y is not a vector (NULL)
or
allmdls %>%
group_by(cyl) %>%
do({
map2_df(.x=.$mdls, .y=.$datsplit, .f=map2_dbl(.x=.x,.y=.y$test,.f=rsquare))
})
Error in map2_dbl(.x = .x, .y = .y$test, .f = rsquare) : object
'.x' not found
I can't seem to get the syntax right.
help?
Thanks
EDIT:
Thanks to #aosmith's comment, I created a somewhat simpler solution:
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})
One option is to use map2 within mutate. Because you are using lists of lists I ended up with nested map2s to get access to the innermost lists. I pulled the test data out via map(datsplit, "test"), as neither the dollar sign operator nor the extract brackets were working for me.
mutate(allmdls, rsq = map2(mdls, map(datsplit, "test"), ~map2_dbl(.x, .y, rsquare)))
Here is another option that avoids the nested lists all together:
mtcars %>%
split(.$cyl) %>%
map_df(crossv_mc, 10, .id = "cyl") %>%
mutate(models = map(train, ~glm(hp ~ disp, data = .x)),
rsq = map2_dbl(models, test, rsquare))
#aosmith answered my question but here is a simpler solution overall
mtcars %>%
group_by(cyl) %>%
do({
datplit=crossv_mc(.,10) %>%
mutate(mdls=map(train, ~glm(hp~disp,data=.)),
r2=map2_dbl(mdls,test,rsquare)
pctmae=map2_dbl(mdls,test,function(model,data) {mae(model,data)/mean(model$model$hp,na.rm=T)*100})
)
})

regression output in dplyr

I would like to define similar functions as in the 'broom' package
library(dplyr)
library(broom)
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
glance(model)
works fine. But how do I defne custom functions like
myglance <- function(x, ...) {
s <- summary(x)
ret <- with(s, data.frame(r2=adj.r.squared, a=coefficients[1], b=coefficients[2]))
ret
}
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
myglance(model)
Error in eval(substitute(expr), data, enclos = parent.frame()) :
invalid 'envir' argument of type 'character'
glance works this way because the broom package defines a method for rowwise data frames here. If you were willing to bring in that whole .R file (along with the col_name utility from here), you could use my code to do the same thing:
myglance_df <- wrap_rowwise_df(wrap_rowwise_df_(myglance))
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
myglance_df(model)
There's also a workaround that doesn't require adding so much code from broom: change the class of each of your models, and define your own glance function on that class.
glance.mylm <- function(x, ...) {
s <- summary(x)
ret <- with(s, data.frame(r2=adj.r.squared, a=coefficients[1], b=coefficients[2]))
ret
}
mtcars %>%
group_by(am) %>%
do(model = lm(mpg ~ wt, .)) %>%
mutate(model = list(structure(model, class = c("mylm", class(model))))) %>%
glance(model)
Finally, you also have the option of performing myglance on the model right away.
mtcars %>%
group_by(am) %>%
do(myglance(lm(mpg ~ wt, .)))
Here is my take on how it would work, basically the approach would be:
Extract the appropriate column from the dataframe (My solution is based on this answer, there must be a better way, and I hope someone will correct me!
run lapply on the result and construct the variables that you wanted in the myglance function you have above.
run do.call with rbind to return a data.frame.
myglance <- function(df, ...) {
# step 1
s <- collect(select(df, ...))[[1]] # based on this answer: https://stackoverflow.com/a/21629102/1992167
# step 2
lapply(s, function(x) {
data.frame(r2 = summary(x)$adj.r.squared,
a = summary(x)$coefficients[1],
b = summary(x)$coefficients[2])
}) %>% do.call(rbind, .) # step 3
}
Output:
> mtcars %>%
+ group_by(am) %>%
+ do(model = lm(mpg ~ wt, .)) %>%
+ myglance(model)
r2 a b
1 0.5651357 31.41606 -3.785908
2 0.8103194 46.29448 -9.084268

Resources