Combining dplyr::do() with dplyr::mutate? - r

I would like to achieve the following: for each subgroup of a dataset, I would like to carry out a regression, and the residuals of that regression should be saved as a new variable in the original dataframe. For instance,
group_by(mtcars, gear) %>% mutate(res = residuals(lm(mpg~carb, .)))
indicates what I think should work, but does not (anyone care to explain why it does not work?). One way to get the residuals is to do the following:
group_by(mtcars, gear) %>% do(res = residuals(lm(mpg~carb, .)))
which gives me a dataframe in which dbl-objects are saved, i.e. those contain the residuals for each group. However, it seems they do not contain the original rownames that would help me to merge them back to the original data.
So, my question is: how can I achieve what I want to do in a dplyr-kind of way?
Obviously, it can be achieved in other ways. To give you an example, the following works just fine:
dat <- mtcars
dat$res <- NA
for(i in unique(mtcars$gear)){
dat[dat$gear==i, "res"] <- residuals(lm(mpg ~ disp, data=dat[dat$gear==i,]))
}
However, my understanding is that dplyr is made for this purpose, so there should be a dplyr-style way?
Any hints / tips / comments are appreciated.
Remark: this question is very similar to lm() called within mutate() except that in that question, only one parameter per group is retained, which makes a merge-approach easy. I have an entire vector with no rownames, so that I would have to rely on the ordering of the vector to do that, and that seems troublesome to me.

library(lazyeval)
eq <- "y ~ x"
dat <- mtcars
dat %>%
group_by(gear) %>%
mutate(res=residuals(lm(interp(eq, y = mpg, x = disp))))
or without lazyeval
dat %>%
group_by(gear) %>%
mutate(res=residuals(lm(deparse(substitute(mpg~disp)))))

#This gives you the residuals. You can then combine this with original data.
mtcars %>%
group_by(cyl) %>%
do(model = lm(mpg ~ wt, data=.)) %>%
do((function(reg_mod) {
data.frame(reg_res = residuals(reg_mod$model))
})(.))

Related

Is it possible to use group_by in a function for more than one variable?

I created a function that aggregates the numeric values in a dataset, and I use a group_by() function to group the data first. Below is an example of what the code I wrote looks like. Is there a way I can group_by() more than one variable without having to create another input for the function?
agg <- function(data, group){ aggdata <- data %>% group_by({{group}}) %>% select_if(function(col) !is.numeric(col) & !is.integer(col)) %>% summarise_if(is.numeric, sum, na.rm = TRUE) return(aggdata)
Your code has (at least) a misplaced curly brace, and it's a bit difficult to see what you're trying to accomplish without a reproducible example and desired result.
It is possible to pass a vector of variable names to group_by(). For example, the following produces the same result as mtcars %>% group_by(cyl, gear):
my_groups <- c("cyl", "gear")
mtcars %>% group_by(!!!syms(my_groups))
Maybe you could use this syntax within your function definition.

Using dplyr rowwise to create multiple linear models

considering this post:
https://www.tidyverse.org/blog/2020/06/dplyr-1-0-0/
I was trying to create multiple models for a data set, using multiple formulas. this example says:
library(dplyr, warn.conflicts = FALSE)
models <- tibble::tribble(
~model_name, ~ formula,
"length-width", Sepal.Length ~ Petal.Width + Petal.Length,
"interaction", Sepal.Length ~ Petal.Width * Petal.Length
)
iris %>%
nest_by(Species) %>%
left_join(models, by = character()) %>%
rowwise(Species, model_name) %>%
mutate(model = list(lm(formula, data = data))) %>%
summarise(broom::glance(model))
You can see rowwise function is used to get the answer but when i dont use this function, i still get the correct answer
iris %>%
nest_by(Species) %>%
left_join(models, by = character()) %>%
mutate(model = list(lm(formula, data = data))) %>%
summarise(broom::tidy(model))
i only lost the "model_name" column, but considering that rowwise documentation says, this function is to compute, i dont get why is still computed this way, why this happens?
thanks in advance.
considering
https://cran.r-project.org/web/packages/dplyr/vignettes/rowwise.html
You can optionally supply “identifier” variables in your call to rowwise(). These variables are preserved when you call summarise(), so they behave somewhat similarly to the grouping variables passed to group_by():
i didn't understand how identifiers works, so as far i get this "identifiers" (Species,model_name) doesn't affect how to compute a value, only the way your tibble is presented.
So if you have a rowwise tibble created by nest_by you dont need the rowwise() function to compute by row. So in my example, rowwise function only give you a extra column of information but linear model is still the same. this is just for a "elegant way", it doesn't change the way its computed.
Thanks to tmfmnk

working with lists of models using the pipe syntax

I often like to fit and examine multiple models that relate two variables in an R dataframe.
I can do that using syntax like this:
require(tidyverse)
require(broom)
models <- list(hp ~ exp(cyl), hp ~ cyl)
map_df(models, ~tidy(lm(data=mtcars, formula=.x)))
But I'm used to the pipe syntax and was hoping to be able to something like this:
mtcars %>% map_df(models, ~tidy(lm(data=., formula=.x)))
That makes it clear that I'm "starting" with mtcars and then doing stuff to it to generate my output. But that syntax doesn't work, giving an error Error: Index 1 must have length 1.
Is there a way to write my purrr:map() function in a way that I can pipe mtcars into it to get the same output as the working code above? I.e.
mtcars %>% <<<something>>>
tl/dr: mtcars %>% {map_df(models, function(.x) tidy(lm(data=., formula=.x)))}
Or mtcars %>% map_df(models, ~tidy(lm(..1,..2)), ..2 = .)
There are 2 problems with the solution you've tried.
The first is that you need to use curly braces if you want to place the dot in an unusual place.
library(magrittr)
1 %>% divide_by(2) # 0.5 -> this works
1 %>% divide_by(2,.) # 2 -> this works as well
1 %>% divide_by(2,mean(.,3)) # this doesn't
1 %>% divide_by(.,2,mean(.,3)) # as it's equivalent to this one
1 %>% {divide_by(2,mean(.,3))} # but this one works as it forces all dots to be explicit.
The second is that you can't use the dot with the ~ formulation in the way you intended, try map(c(1,2), ~ 3+.) and map(c(1,2), ~ 3+.x) (or even map(c(1,2), ~ 3+..1)) and you'll see you get the same result. By the time you use the dot in a ~ formula it's not linked to the pipe function anymore.
To make sure the dot is interpreted as mtcars you need to use the good old function(x) ... definition.
This works:
mtcars %>% {map_df(models, function(.x) tidy(lm(data=., formula=.x)))}
Finally, as a bonus, here's what I came up with, trying to find a solution without curly braces :
mtcars %>% map(models,lm,.) %>% map_df(tidy)
mtcars %>% map_df(models, ~tidy(lm(..1,..2)), ..2 = .)
This should work and does not involve the complexity of functions or {}. A purely purrr solution.
library(tidyverse)
library(broom)
models <- list(hp ~ exp(cyl), hp ~ cyl)
mtcars %>%
list %>% # make it a list
cross2(models) %>% # get combinations
transpose %>% # put into a nice format
set_names("data", "formula") %>% # set names to lm arg names
pmap(lm) %>% # fit models
map_df(tidy) # tidy it up
This is a little at odds with how purrr::map works. You are mapping over the list of models (one item of the list at a time), not the dataframe (which would be one column of the dataframe at a time). Because the dataframe stays constant even with other model expressions, I don't think mapping will work for this situation.
However, you could get the syntax you want from defining a custom function based on the one you have above.
library(tidyverse)
library(broom)
models <- list(hp ~ exp(cyl), hp ~ cyl)
models_to_rows <- function(data, models_list) {
models_list %>%
map_df(~tidy(lm(data=data, formula=.x)))
}
mtcars %>%
models_to_rows(models)
#> term estimate std.error statistic p.value
#> 1 (Intercept) 89.60052274 9.702303069 9.234975 2.823542e-10
#> 2 exp(cyl) 0.04045315 0.004897717 8.259594 3.212750e-09
#> 3 (Intercept) -51.05436157 24.981944312 -2.043650 4.985522e-02
#> 4 cyl 31.95828066 3.883803355 8.228604 3.477861e-09

How can you obtain the group_by value for use in passing to a function?

I am trying to use dplyr to apply a function to a data frame that is grouped using the group_by function. I am applying a function to each row of the grouped data using do(). I would like to obtain the value of the group_by variable so that I might use it in a function call.
So, effectively, I have-
tmp <-
my_data %>%
group_by(my_grouping_variable) %>%
do(my_function_call(data.frame(x = .$X, y = .$Y),
GROUP_BY_VARIABLE)
I'm sure that I could call unique and get it...
do(my_function_call(data.frame(x = .$X, y = .$Y),
unique(.$my_grouping_variable))
But, it seems clunky and would inefficiently call unique for every grouping value.
Is there a way to get the value of the group_by variable in dplyr?
I'm going to prematurely say sorry if this is a crazy easy thing to answer. I promise that I've exhaustively searched for an answer.
First, if necessary, check if it's a grouped data frame: inherits(data, "grouped_df").
If you want the subsets of data frames, you could nest the groups:
mtcars %>% group_by(cyl) %>% nest()
Usually, you won't nest within the pipe-chain, but check in your function:
your_function(.x) <- function(x) {
if(inherits(x, "grouped_df")) x <- nest(x)
}
Your function should then iterate over the list-column data with all grouped subsets. If you use a function within mutate, e.g.
mtcars %>% group_by(cyl) %>% mutate(abc = your_function_call(.x))
then note that your function directly receives the values for each group, passed as class structure. It's a bit difficult to explain, just try it out and debug your_function_call step by step...
You can use groups(), however a SE version of this does not exist so I'm unsure of its use in programming.
library(dplyr)
df <- mtcars %>% group_by(cyl, mpg)
groups(df)
[[1]]
cyl
[[2]]
mpg

Using dplyr to compare models

I was working through the examples in the dplyr documentation of the do() function and all was well until I came across this snippet to summarize model comparisons: # compare %>% summarise(p.value = aov$`Pr(>F)`) The error was "Error: expecting a single value". So I found a way forward accessing the list of aov elements directly. This question is about sub-setting operators and to ask if there is a better way to do this. Here is my full attempt and solution.
models <- group_by(mtcars,cyl) %>% do(mod_lin = lm(mpg ~ disp, data = .), mod_quad = lm(mpg ~ poly(disp,2), data = .))
compare <- models %>% do(aov = anova(.$mod_lin, .$mod_quad))
compare %>% summarise(p.value = aov$'Pr(>F)')
Error: expecting a single value
Looking into the structure of compare
select comparison 1
compare$aov[[1]]
select comparison 1 and all of element 6 (the pvalues)
compare$aov[[1]][6]
just the pvalues
compare$aov[[1]][2,6]
compare %>% summarise(pvalue = aov[2,6]) # this gets the pvalues by group
So I suppose I'm wondering how with an object of classes (‘rowwise_df’, ‘tbl_df’ and 'data.frame') that summarise can intuit the [[]] operator. And also if there might be a better way to do this.
You could try
compare %>% do(.$aov['Pr(>F)']) %>% na.omit()

Resources