dplyr group by colnames described as vector of strings - r

I'm trying to group_by multiple columns in my data frame and I can't write out every single column name in the group_by function so I want to call the column names as a vector like so:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
mtcars %>% filter(disp < 160) %>% group_by(cols) %>% summarise(n = n())
This returns error:
Error in mutate_impl(.data, dots) :
Column `mtcars[colnames(mtcars)[grep("[a-z]{3,}$", colnames(mtcars))]]` must be length 12 (the number of rows) or one, not 7
I definitely want to use a dplyr function to do this, but can't figure this one out.

Update
group_by_at() has been superseded; see https://dplyr.tidyverse.org/reference/group_by_all.html. Refer to Harrison Jones' answer for the current recommended approach.
Retaining the below approach for posterity
You can use group_by_at, where you can pass a character vector of column names as group variables:
mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
Or you can move the column selection inside group_by_at using vars and column select helper functions:
mtcars %>%
filter(disp < 160) %>%
group_by_at(vars(matches('[a-z]{3,}$'))) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...

I believe group_by_at has now been superseded by using a combination of group_by and across. And summarise has an experimental .groups argument where you can choose how to handle the grouping after you create a summarised object. Here is an alternative to consider:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
original <- mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
superseded <- mtcars %>%
filter(disp < 160) %>%
group_by(across(all_of(cols))) %>%
summarise(n = n(), .groups = 'drop_last')
all.equal(original, superseded)
Here is a blog post that goes into more detail about using the across function:
https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/

Related

Programming with `{data.table}`: how to name a new column?

The following question seems very basic in programming with data.table, so my apologies if it's a duplicate. I spent time researching but could not find an answer.
I want to create a "user-defined function" that wraps around a data.table wrangling procedure. In this procedure, a new column is created, and I want to let the user set the name of that new column.
Example
Consider the following code that works as-is. I want to wrap it inside a function.
library(data.table)
library(magrittr)
library(tibble)
mtcars %>%
as.data.table() %>%
.[, .(max_mpg = max(mpg)), by = cyl] %>%
as_tibble()
#> # A tibble: 3 x 2
#> cyl max_mpg
#> <dbl> <dbl>
#> 1 6 21.4
#> 2 4 33.9
#> 3 8 19.2
Created on 2021-10-13 by the reprex package (v0.3.0)
All I want my function to do is let the user set the name of new_colname_of_choice:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(new_colname_of_choice = max(mpg)), by = cyl] %>%
as_tibble()
}
my_wrapper(new_colname_of_choice = "my_lovely_colname")
#> # A tibble: 3 x 2
#> cyl new_colname_of_choice <---------- why this isn't called "my_lovely_colname"?
#> <dbl> <dbl>
#> 1 6 21.4
#> 2 4 33.9
#> 3 8 19.2
I've tried using curly braces which didn't work either (actually threw an error):
my_wrapper_2 <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .({new_colname_of_choice} = max(mpg)), by = cyl] %>%
as_tibble()
}
Error: unexpected '=' in:
" as.data.table() %>%
.[, .({new_colname_of_choice} ="
Which is surprising because curly braces do promote the desired naming ability, but in a different (yet similar) kind of code:
my_wrapper_3 <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, {new_colname_of_choice} := max(mpg), by = cyl] %>%
as_tibble()
}
my_wrapper_3(new_colname_of_choice = "my_lovely_colname")
## # A tibble: 32 x 12
## mpg cyl disp hp drat wt qsec vs am gear carb my_lovely_colname <---- SUCCESS!
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 21.4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 21.4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 33.9
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 21.4
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 19.2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 21.4
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 19.2
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 33.9
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 33.9
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 21.4
## # ... with 22 more rows
Bottom line
My conclusion is that the = operator is sensitive to {...} on the LHS. How can I otherwise pass a name (from argument) to the LHS in the initial my_wrapper() example?
EDIT
I'd like to add the dplyr solution for the same problem, taken from the programming with dplyr vignette:
library(dplyr)
my_wrapper_dplyr <- function(new_colname_of_choice) {
mtcars %>%
group_by(cyl) %>%
summarise("{new_colname_of_choice}" := max(mpg))
}
my_wrapper_dplyr("another_lovely_colname")
Which is pretty robust and works in all naming situations I've encountered. Is there a built-in/canonical practice in data.table similar to {dplyr}'s?
With the upcoming data.table version 1.14.3, you'll be able to use the new env parameter:
A new interface for programming on data.table has been added, closing #2655 and many other linked issues. It is built using base R's substitute-like interface via a new env argument to [.data.table. For details see the new vignette programming on data.table, and the new ?substitute2 manual page. Thanks to numerous users for filing requests, and Jan Gorecki for implementing.
# install dev version
install.packages("https://github.com/Rdatatable/data.table/archive/master.tar.gz", repo = NULL, type = "source")
library(tibble)
library(data.table)
my_wrapper_new <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(new_colname_of_choice = max(mpg)), by = cyl,
env=list(new_colname_of_choice = new_colname_of_choice)] %>%
as_tibble()
}
my_wrapper_new('test')
# A tibble: 3 x 2
cyl test
<dbl> <dbl>
1 6 21.4
2 4 33.9
3 8 19.2
One thing you can do is separate the creation of the column and the naming of the column like so:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, .(tempcol = max(mpg)), by = cyl] %>%
setnames(., "tempcol", new_colname_of_choice) %>%
as.tibble()
}
my_wrapper("my_lovely_colname")
Using this method you can use either .(tempcol = max(mpg)) or tempcol := max(mpg)
Using setNames from stats:
my_wrapper <- function(new_colname_of_choice) {
mtcars %>%
as.data.table() %>%
.[, setNames(list(max(mpg)), new_colname_of_choice), by = cyl] %>%
as_tibble()
}
my_wrapper(new_colname_of_choice = "my_lovely_colname")

Mimicking a secondary tidy dots argument in an R function

I'm looking to create a function that accepts a list of (data frame) variables as one of its parameters. I've managed to get it working partially, but when I get to the group_by/count, things fall apart. How can I do this??
## Works
f1 <- function(dfr, ..., split = NULL) {
dots <- rlang::enquos(...)
split <- rlang::enquos(split)
dfr %>%
select(!!!dots, !!!split) %>%
gather('type', 'score', -c(!!!split))
}
## does not work
f2 <- function(dfr, ..., split = NULL) {
dots <- rlang::enquos(...)
split <- rlang::enquos(split)
dfr %>%
select(!!!dots, !!!split) %>%
gather('type', 'score', -c(!!!split))
count(!!!split, type, score)
}
I would want to do things like
mtcars %>% f2(drat:qsec)
mtcars %>% f2(drat:qsec, split = gear)
mtcars %>% f2(drat:qsec, split = c(gear, carb)) ## ??
These calls with f1() all work, but for f2 none of the commands work. They all end up with a Error in !split : invalid argument type. That f2(drat:qsec) doesn't (immediately) work without the split argument, I'm not too surprised about, but how to get the second and third comment working?
The issue with the second function (the missing pipe notwithstanding) is that count() (or rather group_by() which is called by count()) doesn't support tidyselect syntax so you can't pass it a list to be spliced like you can with select(), gather() etc. Instead, one option is to use group_by_at() and add_tally(). Here's a slightly modified version of the function:
library(dplyr)
f2 <- function(dfr, ..., split = NULL) {
dfr %>%
select(..., {{split}}) %>%
gather('type', 'score', -{{split}}) %>%
group_by_at(vars({{split}}, type, score)) %>% # could use `group_by_all()`
add_tally()
}
mtcars %>% f2(drat:qsec)
# A tibble: 96 x 3
# Groups: type, score [81]
type score n
<chr> <dbl> <int>
1 drat 3.9 2
2 drat 3.9 2
3 drat 3.85 1
4 drat 3.08 2
5 drat 3.15 2
6 drat 2.76 2
7 drat 3.21 1
8 drat 3.69 1
9 drat 3.92 3
10 drat 3.92 3
# ... with 86 more rows
mtcars %>% f2(drat:qsec, split = c(gear, carb))
# A tibble: 96 x 5
# Groups: gear, carb, type, score [89]
gear carb type score n
<dbl> <dbl> <chr> <dbl> <int>
1 4 4 drat 3.9 2
2 4 4 drat 3.9 2
3 4 1 drat 3.85 1
4 3 1 drat 3.08 1
5 3 2 drat 3.15 2
6 3 1 drat 2.76 1
7 3 4 drat 3.21 1
8 4 2 drat 3.69 1
9 4 2 drat 3.92 1
10 4 4 drat 3.92 2
# ... with 86 more rows

Using a data_frame as an argument into a mutate and group_by routine

I have this data_frame (db) here with lots of columns:
A B C D ... ZZ
1 .23 .21 ... .23
2 .45 .12 ... .23
1 .47 ... .53
2 .49 ... .27
I want to employ group_by and mutate with a function which gets a complete data_frame and returns a vector.
function1 <- function(data_frame) {
...
return(vector)
}
db %>%
group_by(A) %>%
mutate(results = function1(.))
This is not working. It returns the results of using the function with the whole data_frame, not with the groups.
I know I could solve it using for, but I'm looking for a dplyr solution. The function necessarily gets a data_frame, I'm not passing columns separately as arguments.
dplyr
My trick has been to use bind_cols. By itself it won't honor any groups, so you need to nest it within a do block, such as:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
do(bind_cols(., {
# "insert complex stuff here"
data_frame(results = apply(., 1, mean))
}))
# Source: local data frame [32 x 12]
# Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb results
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 23.59818
# 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 24.63455
# 3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 27.23364
# # ... with 29 more rows
On benefit of this approach is that the code in the block can return one or more columns without further complication.
So, using your code, it would look something like:
db %>%
group_by(A) %>%
do(bind_cols(., data_frame(results = function(.))))
tidyr
Another option is to use tidy (RStudio blog here, though a little out of date it is still useful).
library(tidyr) # nest, unnest
library(purrr) # map
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(results = map(data, ~ apply(., 1, mean))) %>%
unnest()
Your code might be something like (untested):
db %>%
group_by(A) %>%
nest() %>%
mutate(results = purrr::map(data, ~ function1(.))) %>%
unnest()

R - dplyr Summarize and Retain Other Columns

I am grouping data and then summarizing it, but would also like to retain another column. I do not need to do any evaluations of that column's content as it will always be the same as the group_by column. I can add it to the group_by statement but that does not seem "right". I want to retain State.Full.Name after grouping by State. Thanks
TDAAtest <- data.frame(State=sample(state.abb,1000,replace=TRUE))
TDAAtest$State.Full.Name <- state.name[match(TDAAtest$State,state.abb)]
TDAA.states <- TDAAtest %>%
filter(!is.na(State)) %>%
group_by(State) %>%
summarize(n=n()) %>%
ungroup() %>%
arrange(State)
Perhaps we need
TDAAtest %>%
filter(!is.na(State)) %>%
group_by(State) %>%
summarise(State.Full.Name = first(State.Full.Name), n = n())
Or use mutate to create the column and then do the distinct
TDAAtest %>% f
filter(!is.na(State)) %>%
group_by(State) %>%
mutate(n= n()) %>%
distinct(State, .keep_all=TRUE)
To retain all columns, you can include across() as a summarize argument, as explained in the documentation for dplyr::do().
by_cyl <- head(mtcars) %>%
group_by(cyl)
by_cyl %>%
summarise(m_mpg = mean(mpg), across())
cyl m_mpg mpg disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 22.8 22.8 108 93 3.85 2.32 18.6 1 1 4 1
2 6 20.4 21 160 110 3.9 2.62 16.5 0 1 4 4
3 6 20.4 21 160 110 3.9 2.88 17.0 0 1 4 4
4 6 20.4 21.4 258 110 3.08 3.22 19.4 1 0 3 1
5 6 20.4 18.1 225 105 2.76 3.46 20.2 1 0 3 1
6 8 18.7 18.7 360 175 3.15 3.44 17.0 0 0 3 2
To retain only a subset of unaltered columns, you can select them within across using tidyselect semantics.
I believe there are more accurate answers than the accepted answer specially when you don't have unique data for other columns in each group (e.g. max or min or top n items based on one particular column
).
Although the accepted answer works for this question, for instance, you would like to find the county with the max population for each state. (You need to have county and population columns).
We have the following options:
1. dplyr version
From this link, you have three extra operations (mutate, ungroup and filter) to achieve that:
TDAAtest %>%
filter(!is.na(State)) %>%
group_by(State) %>%
mutate(maxPopulation = max(Population)) %>%
ungroup() %>%
filter(maxPopulation == Population)
2. Function version
This one gives you as much flexibility as you want and you can apply any kind of operation to each group:
maxFUN = function(x) {
# order population in a descending order
x = x[with(x, order(-Population)), ]
x[1, ]
}
TDAAtest %>%
filter(!is.na(State)) %>%
group_by(State) %>%
do(maxFUN(.))
This one is highly recommended for more complex operations. For instance, you can return top n (topN) counties per state by having x[1:topN] for the returned dataframe in maxFUN.

Stepping through a pipeline with intermediate results

Is there a way to output the result of a pipeline at each step without doing it manually? (eg. without selecting and running only the selected chunks)
I often find myself running a pipeline line-by-line to remember what it was doing or when I am developing some analysis.
For example:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
sample_frac(0.1) %>%
summarise(res = mean(mpg))
# Source: local data frame [3 x 2]
#
# cyl res
# 1 4 33.9
# 2 6 18.1
# 3 8 18.7
I'd to select and run:
mtcars %>% group_by(cyl)
and then...
mtcars %>% group_by(cyl) %>% sample_frac(0.1)
and so on...
But selecting and CMD/CTRL+ENTER in RStudio leaves a more efficient method to be desired.
Can this be done in code?
Is there a function which takes a pipeline and runs/digests it line by line showing output at each step in the console and you continue by pressing enter like in demos(...) or examples(...) of package guides
You can select which results to print by using the tee-operator (%T>%) and print(). The tee-operator is used exclusively for side-effects like printing.
# i.e.
mtcars %>%
group_by(cyl) %T>% print() %>%
sample_frac(0.1) %T>% print() %>%
summarise(res = mean(mpg))
It is easy with magrittr function chain. For example define a function my_chain with:
foo <- function(x) x + 1
bar <- function(x) x + 1
baz <- function(x) x + 1
my_chain <- . %>% foo %>% bar %>% baz
and get the final result of a chain as:
> my_chain(0)
[1] 3
You can get a function list with functions(my_chain)
and define a "stepper" function like this:
stepper <- function(fun_chain, x, FUN = print) {
f_list <- functions(fun_chain)
for(i in seq_along(f_list)) {
x <- f_list[[i]](x)
FUN(x)
}
invisible(x)
}
And run the chain with interposed print function:
stepper(my_chain, 0, print)
# [1] 1
# [1] 2
# [1] 3
Or with waiting for user input:
stepper(my_chain, 0, function(x) {print(x); readline()})
Add print:
mtcars %>%
group_by(cyl) %>%
print %>%
sample_frac(0.1) %>%
print %>%
summarise(res = mean(mpg))
IMHO magrittr is mostly useful interactively, that is when I am exploring data or building a new formula/model.
In this cases, storing intermediate results in distinct variables is very time consuming and distracting, while pipes let me focus on data, rather than typing:
x %>% foo
## reason on results and
x %>% foo %>% bar
## reason on results and
x %>% foo %>% bar %>% baz
## etc.
The problem here is that I don't know in advance what the final pipe will be, like in #bergant.
Typing, as in #zx8754,
x %>% print %>% foo %>% print %>% bar %>% print %>% baz
adds to much overhead and, to me, defeats the whole purpose of magrittr.
Essentially magrittr lacks a simple operator that both prints and pipes results.
The good news is that it seems quite easy to craft one:
`%P>%`=function(lhs, rhs){ print(lhs); lhs %>% rhs }
Now you can print an pipe:
1:4 %P>% sqrt %P>% sum
## [1] 1 2 3 4
## [1] 1.000000 1.414214 1.732051 2.000000
## [1] 6.146264
I found that if one defines/uses a key bindings for %P>% and %>%, the prototyping workflow is very streamlined (see Emacs ESS or RStudio).
I wrote the package pipes that can do several things that might help :
use %P>% to print the output.
use %ae>% to use all.equal on input and output.
use %V>% to use View on the output, it will open a viewer for each relevant step.
If you want to see some aggregated info you can try %summary>%, %glimpse>% or %skim>% which will use summary, tibble::glimpse or skimr::skim, or you can define your own pipe to show specific changes, using new_pipe
# devtools::install_github("moodymudskipper/pipes")
library(dplyr)
library(pipes)
res <- mtcars %P>%
group_by(cyl) %P>%
sample_frac(0.1) %P>%
summarise(res = mean(mpg))
#> group_by(., cyl)
#> # A tibble: 32 x 11
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
#> sample_frac(., 0.1)
#> # A tibble: 3 x 11
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
#> 2 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#> 3 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> summarise(., res = mean(mpg))
#> # A tibble: 3 x 2
#> cyl res
#> <dbl> <dbl>
#> 1 4 26
#> 2 6 17.8
#> 3 8 18.7
res <- mtcars %ae>%
group_by(cyl) %ae>%
sample_frac(0.1) %ae>%
summarise(res = mean(mpg))
#> group_by(., cyl)
#> [1] "Attributes: < Names: 1 string mismatch >"
#> [2] "Attributes: < Length mismatch: comparison on first 2 components >"
#> [3] "Attributes: < Component \"class\": Lengths (1, 4) differ (string compare on first 1) >"
#> [4] "Attributes: < Component \"class\": 1 string mismatch >"
#> [5] "Attributes: < Component 2: Modes: character, list >"
#> [6] "Attributes: < Component 2: Lengths: 32, 2 >"
#> [7] "Attributes: < Component 2: names for current but not for target >"
#> [8] "Attributes: < Component 2: Attributes: < target is NULL, current is list > >"
#> [9] "Attributes: < Component 2: target is character, current is tbl_df >"
#> sample_frac(., 0.1)
#> [1] "Different number of rows"
#> summarise(., res = mean(mpg))
#> [1] "Cols in y but not x: `res`. "
#> [2] "Cols in x but not y: `qsec`, `wt`, `drat`, `hp`, `disp`, `mpg`, `carb`, `gear`, `am`, `vs`. "
res <- mtcars %V>%
group_by(cyl) %V>%
sample_frac(0.1) %V>%
summarise(res = mean(mpg))
# you'll have to test this one by yourself

Resources