dplyr summarize with a function of a dataframe - r

I'm having some trouble carrying out a routine using the dplyr package. In short, I have a function which takes a dataframe as an input, and returns a single (numeric) value; I'd like to be able to apply this function to several subsets of a dataframe. It feels like I should be able to use group_by() to specify the subsets of the dataframe, then pipe along to the summarize() function, but I'm not sure how to pass the (subsetted) dataframe along to the function I'd like to apply.
As a simplified example, let's say I'm using the iris dataset, and I've got a fairly simple function which I'd like to apply to several subsets of the data:
data(iris)
lm.func = function(.data){
lm.fit = lm(Petal.Width ~ Petal.Length, data = .data)
out = summary(lm.fit)$coefficients[2,1]
return(out)
}
Now, I'd like to be able to apply this function to subsets of iris based on some other variable, like Species. I'm able to manually filter the data, then pipe along to my function, for example:
iris %>% filter(Species == "setosa") %>% lm.func(.)
But I'd like to be able to apply lm.func to each subset of the data, based on Species. My first thought was to try something like the following:
iris %>% group_by(Species) %>% summarize(coef.val = lm.func(.))
Even though I know this doesn't work, my idea is to try to pass each subset of iris to the lm.func function.
To clarify, I'd like to end up with a dataframe with two columns -- a first with each level of the grouping variable, and a second with the output of lm.func when the data are restricted to a subset specified by the grouping variable.
Is it possible to use summarize() in this way?

You can try with do
iris %>%
group_by(Species) %>%
do(data.frame(coef.val=lm.func(.)))
# Species coef.val
#1 setosa 0.2012451
#2 versicolor 0.3310536
#3 virginica 0.1602970

There is an easy way to do without creating a function.
library(broom)
models <-iris %>%
group_by(Species) %>%
do(
mod = lm(Petal.Width ~ Petal.Length, data =.)
)
models %>% do(tidy(.$mod))
term estimate std.error statistic p.value
1 (Intercept) -0.04822033 0.12164115 -0.3964146 6.935561e-01
2 Petal.Length 0.20124509 0.08263253 2.4354220 1.863892e-02
3 (Intercept) -0.08428835 0.16070140 -0.5245029 6.023428e-01
4 Petal.Length 0.33105360 0.03750041 8.8279995 1.271916e-11
5 (Intercept) 1.13603130 0.37936622 2.9945505 4.336312e-03
6 Petal.Length 0.16029696 0.06800119 2.3572668 2.253577e-02

Related

Extract (or isolate) 'group-wise constant' columns from a data frame, *using dplyr/tidyverse*

How can I extract (or isolate_ group-wise constant columns from a data frame, using dplyr/tidyverse?
This is an update of Dowle/Hadley's decades-old question here. The earlier poster's example...
Using a contrived example from iris (to generate a dataset with columns that are constant by group for this example )
irisX <- iris %>% mutate(
numspec = as.numeric(Species),
numspec2 = numspec*2
)
Now I want to generate a dataset that keeps the columns Species, numspec, and numspec2 only (and keeps only one row for each).
And I don't want to have to tell it which columns these are (constant by group) -- I want it to find these for me.
So what I want is
Species, numspec, numspec2
setosa, 1, 2
versicolor, 2, 4
virginica, 3, 6
Unlike in the older linked question I want to do something using the tidyverse so I can understand it better and the code looks cleaner.
I tried something like
single_iris <- irisX %>%
group_by(Species) %>%
select_if(function(.) n_distinct(.) == 1)
But the latter select_if ignores the groupings.
If we want to use select, do it outside the grouping
library(dplyr)
irisX %>%
select(where(~ n_distinct(.) == n_distinct(irisX$Species))) %>%
distinct()
You could do:
iris %>%
group_by(Species)%>%
summarise(numspec = as.numeric(first(Species)),
numspec2 = numspec*2)

Using dplyr rowwise to create multiple linear models

considering this post:
https://www.tidyverse.org/blog/2020/06/dplyr-1-0-0/
I was trying to create multiple models for a data set, using multiple formulas. this example says:
library(dplyr, warn.conflicts = FALSE)
models <- tibble::tribble(
~model_name, ~ formula,
"length-width", Sepal.Length ~ Petal.Width + Petal.Length,
"interaction", Sepal.Length ~ Petal.Width * Petal.Length
)
iris %>%
nest_by(Species) %>%
left_join(models, by = character()) %>%
rowwise(Species, model_name) %>%
mutate(model = list(lm(formula, data = data))) %>%
summarise(broom::glance(model))
You can see rowwise function is used to get the answer but when i dont use this function, i still get the correct answer
iris %>%
nest_by(Species) %>%
left_join(models, by = character()) %>%
mutate(model = list(lm(formula, data = data))) %>%
summarise(broom::tidy(model))
i only lost the "model_name" column, but considering that rowwise documentation says, this function is to compute, i dont get why is still computed this way, why this happens?
thanks in advance.
considering
https://cran.r-project.org/web/packages/dplyr/vignettes/rowwise.html
You can optionally supply “identifier” variables in your call to rowwise(). These variables are preserved when you call summarise(), so they behave somewhat similarly to the grouping variables passed to group_by():
i didn't understand how identifiers works, so as far i get this "identifiers" (Species,model_name) doesn't affect how to compute a value, only the way your tibble is presented.
So if you have a rowwise tibble created by nest_by you dont need the rowwise() function to compute by row. So in my example, rowwise function only give you a extra column of information but linear model is still the same. this is just for a "elegant way", it doesn't change the way its computed.
Thanks to tmfmnk

Using filter_at to filter some data correctly

I am trying to use the filter_at function from the tidyverse/dplyr. What I want to do is to filter Sepal.Width, Petal.Length, and Petal.Width without actually specifying the names or in a way that I can specify some variables I do not want to filter. i.e. Below I do not want to filter Sepal.Length and Species.
I am tyring to filter out all the data which is greater than 3 and less than 2. (Just keeping the observations between 2.9999 an 2.00001 etc.
data(iris)
iris %>%
filter_at(vars(-Sepal.Length, -Species), all_vars(. > 3 & < 2))
Additionally I would like to filter and Inf observations also. any_vars(is.Inf(.)) across all variables.

working with lists of models using the pipe syntax

I often like to fit and examine multiple models that relate two variables in an R dataframe.
I can do that using syntax like this:
require(tidyverse)
require(broom)
models <- list(hp ~ exp(cyl), hp ~ cyl)
map_df(models, ~tidy(lm(data=mtcars, formula=.x)))
But I'm used to the pipe syntax and was hoping to be able to something like this:
mtcars %>% map_df(models, ~tidy(lm(data=., formula=.x)))
That makes it clear that I'm "starting" with mtcars and then doing stuff to it to generate my output. But that syntax doesn't work, giving an error Error: Index 1 must have length 1.
Is there a way to write my purrr:map() function in a way that I can pipe mtcars into it to get the same output as the working code above? I.e.
mtcars %>% <<<something>>>
tl/dr: mtcars %>% {map_df(models, function(.x) tidy(lm(data=., formula=.x)))}
Or mtcars %>% map_df(models, ~tidy(lm(..1,..2)), ..2 = .)
There are 2 problems with the solution you've tried.
The first is that you need to use curly braces if you want to place the dot in an unusual place.
library(magrittr)
1 %>% divide_by(2) # 0.5 -> this works
1 %>% divide_by(2,.) # 2 -> this works as well
1 %>% divide_by(2,mean(.,3)) # this doesn't
1 %>% divide_by(.,2,mean(.,3)) # as it's equivalent to this one
1 %>% {divide_by(2,mean(.,3))} # but this one works as it forces all dots to be explicit.
The second is that you can't use the dot with the ~ formulation in the way you intended, try map(c(1,2), ~ 3+.) and map(c(1,2), ~ 3+.x) (or even map(c(1,2), ~ 3+..1)) and you'll see you get the same result. By the time you use the dot in a ~ formula it's not linked to the pipe function anymore.
To make sure the dot is interpreted as mtcars you need to use the good old function(x) ... definition.
This works:
mtcars %>% {map_df(models, function(.x) tidy(lm(data=., formula=.x)))}
Finally, as a bonus, here's what I came up with, trying to find a solution without curly braces :
mtcars %>% map(models,lm,.) %>% map_df(tidy)
mtcars %>% map_df(models, ~tidy(lm(..1,..2)), ..2 = .)
This should work and does not involve the complexity of functions or {}. A purely purrr solution.
library(tidyverse)
library(broom)
models <- list(hp ~ exp(cyl), hp ~ cyl)
mtcars %>%
list %>% # make it a list
cross2(models) %>% # get combinations
transpose %>% # put into a nice format
set_names("data", "formula") %>% # set names to lm arg names
pmap(lm) %>% # fit models
map_df(tidy) # tidy it up
This is a little at odds with how purrr::map works. You are mapping over the list of models (one item of the list at a time), not the dataframe (which would be one column of the dataframe at a time). Because the dataframe stays constant even with other model expressions, I don't think mapping will work for this situation.
However, you could get the syntax you want from defining a custom function based on the one you have above.
library(tidyverse)
library(broom)
models <- list(hp ~ exp(cyl), hp ~ cyl)
models_to_rows <- function(data, models_list) {
models_list %>%
map_df(~tidy(lm(data=data, formula=.x)))
}
mtcars %>%
models_to_rows(models)
#> term estimate std.error statistic p.value
#> 1 (Intercept) 89.60052274 9.702303069 9.234975 2.823542e-10
#> 2 exp(cyl) 0.04045315 0.004897717 8.259594 3.212750e-09
#> 3 (Intercept) -51.05436157 24.981944312 -2.043650 4.985522e-02
#> 4 cyl 31.95828066 3.883803355 8.228604 3.477861e-09

How can you obtain the group_by value for use in passing to a function?

I am trying to use dplyr to apply a function to a data frame that is grouped using the group_by function. I am applying a function to each row of the grouped data using do(). I would like to obtain the value of the group_by variable so that I might use it in a function call.
So, effectively, I have-
tmp <-
my_data %>%
group_by(my_grouping_variable) %>%
do(my_function_call(data.frame(x = .$X, y = .$Y),
GROUP_BY_VARIABLE)
I'm sure that I could call unique and get it...
do(my_function_call(data.frame(x = .$X, y = .$Y),
unique(.$my_grouping_variable))
But, it seems clunky and would inefficiently call unique for every grouping value.
Is there a way to get the value of the group_by variable in dplyr?
I'm going to prematurely say sorry if this is a crazy easy thing to answer. I promise that I've exhaustively searched for an answer.
First, if necessary, check if it's a grouped data frame: inherits(data, "grouped_df").
If you want the subsets of data frames, you could nest the groups:
mtcars %>% group_by(cyl) %>% nest()
Usually, you won't nest within the pipe-chain, but check in your function:
your_function(.x) <- function(x) {
if(inherits(x, "grouped_df")) x <- nest(x)
}
Your function should then iterate over the list-column data with all grouped subsets. If you use a function within mutate, e.g.
mtcars %>% group_by(cyl) %>% mutate(abc = your_function_call(.x))
then note that your function directly receives the values for each group, passed as class structure. It's a bit difficult to explain, just try it out and debug your_function_call step by step...
You can use groups(), however a SE version of this does not exist so I'm unsure of its use in programming.
library(dplyr)
df <- mtcars %>% group_by(cyl, mpg)
groups(df)
[[1]]
cyl
[[2]]
mpg

Resources