Number of rows per group along with other summary functions - r

I am trying to find number of rows and some other function after grouping by two different factors. Surprisingly, n() is not working. Is there any work around?
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mtcars %>% group_by(cyl, am) %>%
summarise(across(c(disp, mpg), list(m = min, a = mean)))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 6 x 6
#> # Groups: cyl [3]
#> cyl am disp_m disp_a mpg_m mpg_a
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 0 120. 136. 21.5 22.9
#> 2 4 1 71.1 93.6 21.4 28.1
#> 3 6 0 168. 205. 17.8 19.1
#> 4 6 1 145 155 19.7 20.6
#> 5 8 0 276. 358. 10.4 15.0
#> 6 8 1 301 326 15 15.4
mtcars %>% group_by(cyl, am) %>%
summarise(across(c(disp, mpg), list(m = min, n = n())))
#> Error: Problem with `summarise()` input `..1`.
#> x Can't convert an integer vector to function
#> i Input `..1` is `across(c(disp, mpg), list(m = min, n = n()))`.
#> i The error occurred in group 1: cyl = 4, am = 0.
Created on 2020-11-27 by the reprex package (v0.3.0)

You could use length:
mtcars %>%
group_by(cyl, am) %>%
summarise(across(c(disp, mpg), list(m = min, n = length)))
#`summarise()` regrouping output by 'cyl' (override with `.groups` argument)
## A tibble: 6 x 6
#> # Groups: cyl [3]
#> cyl am disp_m disp_n mpg_m mpg_n
#> <dbl> <dbl> <dbl> <int> <dbl> <int>
#> 1 4 0 120. 3 21.5 3
#> 2 4 1 71.1 8 21.4 8
#> 3 6 0 168. 4 17.8 4
#> 4 6 1 145 3 19.7 3
#> 5 8 0 276. 12 10.4 12
#> 6 8 1 301 2 15 2

We can do
library(dplyr)
mtcars %>%
group_by(cyl, am) %>%
summarise(across(c(disp, mpg), list(m = ~ min(.), n = ~n())), .groups = 'drop')
# A tibble: 6 x 6
cyl am disp_m disp_n mpg_m mpg_n
<dbl> <dbl> <dbl> <int> <dbl> <int>
1 4 0 120. 3 21.5 3
2 4 1 71.1 8 21.4 8
3 6 0 168. 4 17.8 4
4 6 1 145 3 19.7 3
5 8 0 276. 12 10.4 12
6 8 1 301 2 15 2

Related

How do I build a dplyr summarize statement programmatically?

I'm trying to do some dplyr programming and having trouble. I'd like to group_by an arbitrary number of variables (thus, across), and then summarize based on arbitrary length (but all the same length) vectors of:
The column to apply the function to
The function to apply
The name of the new column
So, like in a map or apply statement, I want to execute code that ends up looking like:
data %>%
group_by(group_column) %>%
summarize(new_name_1 = function_1(column_1),
summarize(new_name_2 = function_2(column_2))
Here's an example of what I want and my best shot so far. I know I can use the names argument to clean those up if I use across, but I'm not confident that across is the correct way. Finally, I'll be applying this to fairly large dataframes, so I'd rather not calculate the extra columns.
Desired result
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(across(c("disp", "hp"), list(mean = mean, sd = sd))) %>%
select(cyl, carb, disp_mean, hp_sd)
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
What I get
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(across(c("disp", "hp"), list(mean = mean, sd = sd)))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 9 x 6
#> # Groups: cyl [3]
#> cyl carb disp_mean disp_sd hp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 21.4 77.4 16.1
#> 2 4 2 117. 27.1 87 24.9
#> 3 6 1 242. 23.3 108. 3.54
#> 4 6 4 164. 4.39 116. 7.51
#> 5 6 6 145 NA 175 NA
#> 6 8 2 346. 43.4 162. 14.4
#> 7 8 3 276. 0 180 0
#> 8 8 4 406. 57.8 234 21.7
#> 9 8 8 301 NA 335 NA
With different functions on different columns, an option is to use collap from collapse
library(collapse)
collap(mtcars, ~ cyl + carb, custom = list(fmean = 4, fsd = 5))
-output
cyl disp hp carb
1 4 91.38 16.133815 1
2 4 116.60 24.859606 2
3 6 241.50 3.535534 1
4 6 163.80 7.505553 4
5 6 145.00 NA 6
6 8 345.50 14.433757 2
7 8 275.80 0.000000 3
8 8 405.50 21.725561 4
9 8 301.00 NA 8
Or the index can be dynamically generated with match
collap(mtcars, ~ cyl + carb, custom = list(fmean =
match('disp', names(mtcars)), fsd = match('hp', names(mtcars))))
With tidyverse, an option is to loop over the column names of interest and the functions in map2 and do a join later
library(dplyr)
library(purrr)
library(stringr)
map2(c("disp", "hp"), c("mean", "sd"), ~
mtcars %>%
group_by(across(c('cyl', 'carb'))) %>%
summarise(across(all_of(.x), match.fun(.y),
.names = str_c("{.col}_", .y)), .groups = 'drop')) %>%
reduce(inner_join)
-output
# A tibble: 9 x 4
cyl carb disp_mean hp_sd
<dbl> <dbl> <dbl> <dbl>
1 4 1 91.4 16.1
2 4 2 117. 24.9
3 6 1 242. 3.54
4 6 4 164. 7.51
5 6 6 145 NA
6 8 2 346. 14.4
7 8 3 276. 0
8 8 4 406. 21.7
9 8 8 301 NA
I have a package on github {dplyover}
which can help with this kind of tasks. In this case we could use over2 to
loop over two character vectors simultaniously. The first vector contains the
variable names as string, which is why we have to wrap .x in sym() when
applying a function to it. The second vector contains the function names,
which we use as .y in a do.call. over2 creates the desired names automatically.
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(over2(c("disp", "hp"),
c("mean", "sd"),
~ do.call(.y, list(sym(.x)))
))
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
An alternative way building on the same logic is to use purrr::map2. However,
here we have to put some effort into creating vectors with the desired names.
library(purrr)
# setup vectors and names
myfuns <- c("mean", "sd")
myvars <- c("disp", "hp") %>%
set_names(., paste(., myfuns, sep = "_"))
mtcars %>%
group_by(across(c("cyl", "carb"))) %>%
summarise(map2(myvars,
myfuns,
~ do.call(.y, list(sym(.x)))
) %>% bind_cols()
)
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
#> # A tibble: 9 x 4
#> # Groups: cyl [3]
#> cyl carb disp_mean hp_sd
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1 91.4 16.1
#> 2 4 2 117. 24.9
#> 3 6 1 242. 3.54
#> 4 6 4 164. 7.51
#> 5 6 6 145 NA
#> 6 8 2 346. 14.4
#> 7 8 3 276. 0
#> 8 8 4 406. 21.7
#> 9 8 8 301 NA
Created on 2021-08-20 by the reprex package (v2.0.1)

working with columns containing `call` class objects in `tidyr::unnest`

I have a package that creates calls containing stats details that can then be displayed in plots.
Here is a simple use case:
# setup
set.seed(123)
library(statsExpressions)
library(tidyverse)
# two-sample t-test results in an expression
stats_exp <- bf_ttest(mtcars, am, wt)
# class of object
class(stats_exp)
#> [1] "call"
# using the expression to display details in a plot
ggplot(mtcars, aes(as.factor(am), wt)) + geom_boxplot() +
labs(subtitle = stats_exp)
Now let's say I wanted to do the same kind of visualizations for all levels of a grouping variable. In this case, I will need to create and save the call for each level.
I can successfully do so using tidyr, which can save the call objects in a list column:
# doing this across groups
(df <- mtcars %>%
group_nest(cyl) %>%
mutate(stats_exp = data %>% map(., ~bf_ttest(., am, wt))))
# A tibble: 3 x 3
cyl data stats_exp
<dbl> <list> <list>
1 4 <tibble [11 × 10]> <language>
2 6 <tibble [7 × 10]> <language>
3 8 <tibble [14 × 10]> <language>
# did it work? yes!
df$stats_exp[[1]]
#> atop(displaystyle(NULL), expr = paste("In favor of null: ", "log"["e"],
#> "(BF"["01"], ") = ", "-1.58", ", ", italic("r")["Cauchy"]^"JZS",
#> " = ", "0.71"))
The problem arises when I try to unnest it, which I would like to do since I will need to do some other operations on this dataframe somewhere downstream in my workflow:
# unnest
tidyr::unnest(data = df, cols = c(stats_exp, data))
#> Error: Input must be list of vectors
How can I avoid this error?
I'm not sure what you intend to do to the stats_exp after you've manipulated the other data but this could a potential solution:
set.seed(123)
library(statsExpressions)
library(tidyverse)
stats_exp <- bf_ttest(mtcars, am, wt)
df <- mtcars %>%
group_nest(cyl) %>%
mutate(stats_exp = map(data, ~ bf_ttest(.x, am, wt)),
stats_chr = map(stats_exp, ~ paste0(deparse(.x), collapse = " ")))
df %>%
select(stats_chr) %>%
unnest(cols = stats_chr)
#> # A tibble: 3 x 1
#> stats_chr
#> <chr>
#> 1 "atop(displaystyle(NULL), expr = paste(\"In favor of null: \", \"log\"[\"e\"]~
#> 2 "atop(displaystyle(NULL), expr = paste(\"In favor of null: \", \"log\"[\"e\"]~
#> 3 "atop(displaystyle(NULL), expr = paste(\"In favor of null: \", \"log\"[\"e\"]~
Created on 2020-02-25 by the reprex package (v0.3.0)
Based on a solution provided on Twitter (h/t #dvaughan32). unnest won't fail if stats_exp is not included in cols argument:
library(tidyverse)
library(statsExpressions)
# doing this across groups
df <- mtcars %>%
group_nest(cyl) %>%
mutate(stats_exp = data %>% map(., ~bf_ttest(., am, wt)))
# alternative
tidyr::unnest(data = df, cols = c(data))
#> # A tibble: 32 x 12
#> cyl mpg disp hp drat wt qsec vs am gear carb stats_exp
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>
#> 1 4 22.8 108 93 3.85 2.32 18.6 1 1 4 1 <language>
#> 2 4 24.4 147. 62 3.69 3.19 20 1 0 4 2 <language>
#> 3 4 22.8 141. 95 3.92 3.15 22.9 1 0 4 2 <language>
#> 4 4 32.4 78.7 66 4.08 2.2 19.5 1 1 4 1 <language>
#> 5 4 30.4 75.7 52 4.93 1.62 18.5 1 1 4 2 <language>
#> 6 4 33.9 71.1 65 4.22 1.84 19.9 1 1 4 1 <language>
#> 7 4 21.5 120. 97 3.7 2.46 20.0 1 0 3 1 <language>
#> 8 4 27.3 79 66 4.08 1.94 18.9 1 1 4 1 <language>
#> 9 4 26 120. 91 4.43 2.14 16.7 0 1 5 2 <language>
#> 10 4 30.4 95.1 113 3.77 1.51 16.9 1 1 5 2 <language>
#> # … with 22 more rows
Created on 2020-02-27 by the reprex package (v0.3.0)

Purrr and Rlang - mapping functions with quasiquotation

If I have a function defined using rlang, how I can use purrr::map to use it with several variables ?
Suppose I have a function defined as:
mean_by <- function(data, by, var) {
data %>%
group_by({{ by }}) %>%
summarise(avg = mean({{ var }}, na.rm = TRUE))
}
Which computes group means,
Preferably using a purrr::map solution, how could I apply this function for several "by" variables but a single "var" in a data frame?
You need the !!! operator or using group_by_at
library(tidyverse)
mean_by <- function(data, by, var) {
data %>%
group_by_at(by) %>%
summarise(avg = {{var}} %>% mean(na.rm =TRUE))
}
mtcars %>%
mean_by(by = vars(mpg,cyl),hp)
#> # A tibble: 27 x 3
#> # Groups: mpg [25]
#> mpg cyl avg
#> <dbl> <dbl> <dbl>
#> 1 10.4 8 210
#> 2 13.3 8 245
#> 3 14.3 8 245
#> 4 14.7 8 230
#> 5 15 8 335
#> 6 15.2 8 165
#> 7 15.5 8 150
#> 8 15.8 8 264
#> 9 16.4 8 180
#> 10 17.3 8 180
#> # … with 17 more rows
# or
mean_by <- function(data, by, var) {
data %>%
group_by(!!!by) %>%
summarise(avg = {{var}} %>% mean(na.rm =TRUE))
}
mtcars %>%
mean_by(by = vars(cyl,disp),hp)
#> # A tibble: 27 x 3
#> # Groups: cyl [3]
#> cyl disp avg
#> <dbl> <dbl> <dbl>
#> 1 4 71.1 65
#> 2 4 75.7 52
#> 3 4 78.7 66
#> 4 4 79 66
#> 5 4 95.1 113
#> 6 4 108 93
#> 7 4 120. 97
#> 8 4 120. 91
#> 9 4 121 109
#> 10 4 141. 95
#> # … with 17 more rows
Created on 2020-01-07 by the reprex package (v0.3.0)
A good alternative is to "pass the dots".
The first argument will be the single variable you want to summarise, and use ... to pass all (if any) grouping variables you want.
This way you have a cleaner syntax for your function and you avoid including the vars function.
library(tidyverse)
mean_by <- function(data, var, ...) {
data %>%
group_by(...) %>%
summarise(avg = {{var}} %>% mean(na.rm =TRUE))
}
mtcars %>%
mean_by(hp, cyl, disp)
#> # A tibble: 27 x 3
#> # Groups: cyl [3]
#> cyl disp avg
#> <dbl> <dbl> <dbl>
#> 1 4 71.1 65
#> 2 4 75.7 52
#> 3 4 78.7 66
#> 4 4 79 66
#> 5 4 95.1 113
#> 6 4 108 93
#> 7 4 120. 97
#> 8 4 120. 91
#> 9 4 121 109
#> 10 4 141. 95
#> # ... with 17 more rows
mtcars %>%
mean_by(hp)
#> # A tibble: 1 x 1
#> avg
#> <dbl>
#> 1 147.
Created on 2020-01-08 by the reprex package (v0.3.0)

unquote a list of functions inside R dplyr functions

I was trying to pass a list of functions into dplyr summerize_at function and got a warning:
library(tidyverse)
library(purrr)
p <- c(0.2, 0.5, 0.8)
p_names <- map_chr(p, ~paste0(.x*100, "%"))
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>%
set_names(nm = p_names)
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), funs(!!!p_funs))
#> Warning: funs() is soft deprecated as of dplyr 0.8.0
#> please use list() instead
#>
#> # Before:
#> funs(name = f(.)
#>
#> # After:
#> list(name = ~f(.))
#> This warning is displayed once per session.
#> # A tibble: 3 x 4
#> cyl `20%` `50%` `80%`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 22.8 26 30.4
#> 2 6 18.3 19.7 21
#> 3 8 13.9 15.2 16.8
I then changed the funs to list but couldn't find a way to unquote the list of funs.
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), list(~ !!!p_funs))
#> Error in !p_funs: invalid argument type
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), list(~ {{p_funs}}))
#> Error: Column `mpg` must be length 1 (a summary value), not 3
list doesn't support splicing (!!!), use list2 or lst instead :
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), rlang::list2(!!!p_funs))
# # A tibble: 3 x 4
# cyl `20%` `50%` `80%`
# <dbl> <dbl> <dbl> <dbl>
# 1 4 22.8 26 30.4
# 2 6 18.3 19.7 21
# 3 8 13.9 15.2 16.8
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), lst(!!!p_funs))
# # A tibble: 3 x 4
# cyl `20%` `50%` `80%`
# <dbl> <dbl> <dbl> <dbl>
# 1 4 22.8 26 30.4
# 2 6 18.3 19.7 21
# 3 8 13.9 15.2 16.8
Though here the simplest is just to do :
mtcars %>%
group_by(cyl) %>%
summarize_at(vars(mpg), p_funs)
# # A tibble: 3 x 4
# cyl `20%` `50%` `80%`
# <dbl> <dbl> <dbl> <dbl>
# 1 4 22.8 26 30.4
# 2 6 18.3 19.7 21
# 3 8 13.9 15.2 16.8

Make prediction for each group differently

I have dataset that looks like this:
Category Weekly_Date a b
<chr> <date> <dbl> <dbl>
1 aa 2018-07-01 36.6 1.4
2 aa 2018-07-02 5.30 0
3 bb 2018-07-01 4.62 1.2
4 bb 2018-07-02 3.71 1.5
5 cc 2018-07-01 3.41 12
... ... ... ... ...
I fitted linear regression for each group separately:
fit_linreg <- train %>%
group_by(Category) %>%
do(model = lm(Target ~ Unit_price + Unit_discount, data = .))
Now I have different models for each category:
aa model1
bb model2
cc model3
So, I need to apply each model to the appropriate category. How to achieve that? (dplyr is preferable)
if you nest the data of your test data, join it with the models, then you can use map2 to make predictions on the test data with the trained models. See below example with mtcars.
library(tidyverse)
x <- mtcars %>%
group_by(gear) %>%
do(model = lm(mpg ~ hp + wt, data = .))
x
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
gear model
* <dbl> <list>
1 3 <S3: lm>
2 4 <S3: lm>
3 5 <S3: lm>
mtcars %>%
group_by(gear) %>%
nest %>%
inner_join(x) %>%
mutate(preds = map2(model, data, predict)) %>%
unnest(preds)
Joining, by = "gear"
# A tibble: 32 x 2
gear preds
<dbl> <dbl>
1 4 22.0
2 4 21.2
3 4 25.1
4 4 26.0
5 4 22.2
6 4 17.8
7 4 17.8
8 4 28.7
9 4 32.3
10 4 30.0
# ... with 22 more rows
Here's one approach, I'm using data.table to filter but you can use dplyr instead as well, I just prefer the data.table syntax.
d <- as.data.table(mtcars)
cats <- unique(d$cyl)
m <- lapply(cats, function(z){
return(lm(formula = mpg ~ wt + hp + disp,
data = d[cyl == z, ] ))
})
names(m) <- cats
OUTPUT
> summary(m)
Length Class Mode
6 12 lm list
4 12 lm list
8 12 lm list
# Checking first model
> m[[1]]
Call:
lm(formula = mpg ~ wt + hp + disp, data = d[cyl == z, ])
Coefficients:
(Intercept) wt hp disp
30.27791 -3.89618 -0.01097 0.01610
> sapply(1:length(m), function(z) return(summary(m[[z]])$adj.r.squared))
[1] 0.4434228 0.5829574 0.3461900
I named the list because it might be easier to refer to models by name aa or bb in your case. Hope this helps!
I find the nesting and un-nesting very unnatural, so here's my attempt.
Let's say you want the quality of the model's fit.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(data.frame(r2 = summary(lm(mpg ~ wt, data = .))$r.squared))
#> # A tibble: 3 x 2
#> # Groups: cyl [3]
#> cyl r2
#> <dbl> <dbl>
#> 1 4 0.509
#> 2 6 0.465
#> 3 8 0.423
Let's say you want the residuals:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mtcars %>%
group_by(cyl) %>%
do(data.frame(resid = residuals(lm(mpg ~ wt, data = .))))
#> # A tibble: 32 x 2
#> # Groups: cyl [3]
#> cyl resid
#> <dbl> <dbl>
#> 1 4 -3.67
#> 2 4 2.84
#> 3 4 1.02
#> 4 4 5.25
#> 5 4 -0.0513
#> 6 4 4.69
#> 7 4 -4.15
#> 8 4 -1.34
#> 9 4 -1.49
#> 10 4 -0.627
#> # ... with 22 more rows
See ?do for why you need the embedded data.frame(). You'll probably want to include other columns in the result. Not just the grouping variable and the residuals. I can't find a neat way to do this, other than listing them!
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(data.frame(disp = .$disp,
qsec = .$qsec,
resid = residuals(lm(mpg ~ wt, data = .))))
#> # A tibble: 32 x 4
#> # Groups: cyl [3]
#> cyl disp qsec resid
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 108 18.6 -3.67
#> 2 4 147. 20 2.84
#> 3 4 141. 22.9 1.02
#> 4 4 78.7 19.5 5.25
#> 5 4 75.7 18.5 -0.0513
#> 6 4 71.1 19.9 4.69
#> 7 4 120. 20.0 -4.15
#> 8 4 79 18.9 -1.34
#> 9 4 120. 16.7 -1.49
#> 10 4 95.1 16.9 -0.627
#> # ... with 22 more rows
Something that doesn't work
For the first example, I thought the following would work:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(r2 = summary(lm(mpg ~ wt, data = .))$r.squared)
#> # A tibble: 3 x 2
#> cyl r2
#> <dbl> <dbl>
#> 1 4 0.753
#> 2 6 0.753
#> 3 8 0.753
But you can see all models have the same r2. It's because the model is being fit to all the data, not per cyl. Looking at the authors' code, I believe this is because they've optimised the evaluation of mutate() and summarise() using Rcpp, and the optimisation doesn't work in this case. But do() works as expected. It subsets the data by group before passing it to the expression to be evaluated. I see they are pondering this, see Hyrbid Folding

Resources