Is it possible to group_by using regex match on column names using dplyr?
library(dplyr) # dplyr_0.5.0; R version 3.3.2 (2016-10-31)
# dummy data
set.seed(1)
df1 <- sample_n(iris, 20) %>%
mutate(Sepal.Length = round(Sepal.Length),
Sepal.Width = round(Sepal.Width))
Group by static version (looks/works fine, imagine if we have 10-20 columns):
df1 %>%
group_by(Sepal.Length, Sepal.Width) %>%
summarise(mySum = sum(Petal.Length))
Group by dynamic - "ugly" version:
df1 %>%
group_by_(.dots = colnames(df1)[ grepl("^Sepal", colnames(df1))]) %>%
summarise(mySum = sum(Petal.Length))
Ideally, something like this (doesn't work, as starts_with returns indices):
df1 %>%
group_by(starts_with("Sepal")) %>%
summarise(mySum = sum(Petal.Length))
Error in eval(expr, envir, enclos) :
wrong result size (0), expected 20 or 1
Expected output:
# Source: local data frame [6 x 3]
# Groups: Sepal.Length [?]
#
# Sepal.Length Sepal.Width mySum
# <dbl> <dbl> <dbl>
# 1 4 3 1.4
# 2 5 3 10.9
# 3 6 2 4.0
# 4 6 3 43.7
# 5 7 3 15.7
# 6 8 4 6.4
Note: sounds very much like a duplicated post, kindly link the relevant posts if any.
This feature will be implemented in future release, reference GitHub issue #2619:
Solution would be to use group_by_at function:
df1 %>%
group_by_at(vars(starts_with("Sepal"))) %>%
summarise(mySum = sum(Petal.Length))
Edit: This is now implemented in dplyr_0.7.1
if you just want to keep it with dplyr functions, you can try:
df1 %>%
group_by_(.dots = df1 %>% select(contains("Sepal")) %>% colnames()) %>%
summarise(mySum = sum(Petal.Length))
though it's not necessarily much prettier, but it gets rid of the regex
Related
Folks I have a couple of questions about how tidy evaluation works with dplyr
The following code produces a tally of cars by cylinder using the mtcars dataset:
mtcars %>%
select(cyl) %>%
group_by(cyl) %>%
tally()
With output as expected:
# A tibble: 3 x 2
cyl n
* <dbl> <int>
1 4 11
2 6 7
3 8 14
If I want to pass the grouping factor as variable, then this fails:
var <- "cyl"
mtcars %>%
select(var) %>%
group_by(var) %>%
tally()
with error message:
Error: Must group by variables found in `.data`.
* Column `var` is not found.
This also fails:
var <- "cyl"
mtcars %>%
select(var) %>%
group_by({{ var}}) %>%
tally()
Producing output:
# A tibble: 1 x 2
`"cyl"` n
* <chr> <int>
1 cyl 32
This code, however, works as expected:
var <- "cyl"
mtcars %>%
select(var) %>%
group_by(.data[[ var]]) %>%
tally()
Producing the expected output:
# A tibble: 3 x 2
cyl n
* <dbl> <int>
1 4 11
2 6 7
3 8 14
I have two questions about this and wondering if someone can help!
Why does select(var) work fine without using any of the dplyr tidy evaluation extensions, such as select({{ var }}) or select(.data[[ var ]])?
What is is about group_by() that makes group_by({{ var }}) wrong but group_by(.data[[ var ]]) right?
Thanks so much!
Matt.
It depends on how those functions work and accept input.
If you look at the documentation at ?select the relevant part for this question is -
These helpers select variables from a character vector:
all_of(): Matches variable names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown.
any_of(): Same as all_of(), except that no error is thrown for names that don't exist.
So you can use all_of and any_of in select with character vectors hence you get a warning when you run mtcars %>% select(var)
Note: Using an external vector in selections is ambiguous.
ℹ Use all_of(var) instead of var to silence this message.
and no warning with mtcars %>% select(all_of(var)).
As far as group_by is concerned there is no such specific provision and you need to use mtcars %>% group_by(.data[[var]]).
On a fairly regular basis I want to pass in strings that function as arguments in code. For context, I often want a section where I can pass in filtering criteria or assumptions that then flow through my analysis, plots, etc. to make it more interactive.
A simple example is below. I've seen the eval/parse solution, but it seems like that makes code chunks unreadable. Is there a better/cleaner/shorter way to do this?
column.names <- c("group1", "group2") #two column names I want to be able to toggle between for grouping
select.column <- group.options[1] #Select the column for grouping
DataTable.summary <-
DataTable %>%
group_by(select.column) %>% #How do I pass that selection in here?
summarize(avg.price = mean(SALES.PRICE))
Well this is just a copy-paste from the tidyverse website: link:(https://dplyr.tidyverse.org/articles/programming.html#programming-recipes).
my_summarise <- function(df, group_var) {
group_var <- enquo(group_var)
print(group_var)
df %>%
group_by(!! group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, g1)
#> <quosure>
#> expr: ^g1
#> env: global
#> # A tibble: 2 x 2
#> g1 a
#> <dbl> <dbl>
#> 1 1 2.5
#> 2 2 3.33
But I think i illustrates your problem. I think what you really want to do is like the code above, i.e. create a function.
You can use the group_by_ function for the example in your question:
library(dplyr)
x <- data.frame(group1 = letters[1:4], group2 = LETTERS[1:4], value = 1:4)
select.colums <- c("group1", "group2")
x %>% group_by_(select.colums[2]) %>% summarize(avg = mean(value))
# A tibble: 4 x 2
# group2 avg
# <fct> <dbl>
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
The *_ family functions in dplyr might also offer a more general solution you are after, although the dplyr documentation says they are deprecated (?group_by_) and might disappear at some point. An analogous expression to the above solution using the tidy evaluation syntax seems to be:
x %>% group_by(!!sym(select.colums[2])) %>% summarize(avg = mean(value))
And for several columns:
x %>% group_by(!!!syms(select.colums)) %>% summarize(avg = mean(value))
This creates a symbol out of a string that is evaluated by dplyr.
I recommend using group_by_at(). It supports both single strings or character vectors:
nms <- c("cyl", "am")
mtcars %>% group_by_at(nms)
I'm trying to count the number of unique elements in each column in the spark dataset s.
However It seems that spark doesn't recognize tally()
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function TALLY
It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.
```
d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
group X1 X2
<chr> <int> <int>
1 a 5 1
2 b 5 1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;
```
library(sparklyr)
library(dplyr)
#I am on Spark V. 2.1
#Building example input (local)
d <- data.frame(cbind(seq(1, 10, 1), rep(1,10)))
d$group <- rep(c("a","b"), each = 5)
d
#Spark tbl
sdf <- sparklyr::sdf_copy_to(sc, d)
# The Answer
sdf %>%
group_by(group) %>%
summarise_all(funs(n_distinct)) %>%
collect()
#Output
group X1 X2
<chr> <dbl> <dbl>
1 b 5 1
2 a 5 1
NB: Given that we are using sparklyr I went for dplyr::n_distinct().
Minor: dplyr::summarise_each is deprecated. Thus, dplyr::summarise_all.
Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count and distinct come in handy.
library(sparkylr)
sc <- spark_connect()
iris_spk <- copy_to(sc, iris)
# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
summarise(Species = distinct(Species))
# or
iris_spk %>%
summarise(Species = approx_count_distinct(Species))
# this does what you are looking for
iris_spk %>%
group_by(species) %>%
summarise_all(funs(n_distinct))
# for larger data sets this is much faster
iris_spk %>%
group_by(species) %>%
summarise_all(funs(approx_count_distinct))
I'm trying to count the number of unique elements in each column in the spark dataset s.
However It seems that spark doesn't recognize tally()
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function TALLY
It seems that spark doesn't recognize simple r functions either, like "unique" or "length". I can run the code on local data, but when I try to run the exact same code on spark table it doesn't work.
```
d<-data.frame(cbind(seq(1,10,1),rep(1,10)))
d$group<-rep(c("a","b"),each=5)
d%>%group_by(group)%>%summarise_each(funs(length(unique(.))))
A tibble: 2 × 3
group X1 X2
<chr> <int> <int>
1 a 5 1
2 b 5 1
k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(length(unique(.)))))
Error: org.apache.spark.sql.AnalysisException: undefined function UNIQUE;
```
library(sparklyr)
library(dplyr)
#I am on Spark V. 2.1
#Building example input (local)
d <- data.frame(cbind(seq(1, 10, 1), rep(1,10)))
d$group <- rep(c("a","b"), each = 5)
d
#Spark tbl
sdf <- sparklyr::sdf_copy_to(sc, d)
# The Answer
sdf %>%
group_by(group) %>%
summarise_all(funs(n_distinct)) %>%
collect()
#Output
group X1 X2
<chr> <dbl> <dbl>
1 b 5 1
2 a 5 1
NB: Given that we are using sparklyr I went for dplyr::n_distinct().
Minor: dplyr::summarise_each is deprecated. Thus, dplyr::summarise_all.
Remember when you are writing sparlyr you are really transpiling to spark-sql, so you may need to use spark-sql verbs from time to time. This is one of those times where spark-sql verbs like count and distinct come in handy.
library(sparkylr)
sc <- spark_connect()
iris_spk <- copy_to(sc, iris)
# for instance this does not work in R, but it does in sparklyr
iris_spk %>%
summarise(Species = distinct(Species))
# or
iris_spk %>%
summarise(Species = approx_count_distinct(Species))
# this does what you are looking for
iris_spk %>%
group_by(species) %>%
summarise_all(funs(n_distinct))
# for larger data sets this is much faster
iris_spk %>%
group_by(species) %>%
summarise_all(funs(approx_count_distinct))
I have a parent dataset nesting multiple datasets (i.e. a tibble where each cell is a tibble) , where I want for each dataset, to find the number of rows of each group. Standard way, using a single dataset, would simply be to do group_by(var) %>% mutate(nrow=n()).
But now that I do this for multiple datasets with a map() call, it looks like the n() call refers to the (implicit) grouping made by map(), not the explicit grouping within my local dataset made by group_by?
Standard way for one single dataset, n() returns 50:
iris %>%
group_by(., Species) %>%
mutate(nrow=n())
Dataset of datasets:
df <- data_frame(name=c("a", "b"), Data=list(iris, iris))
df2 <- df %>%
mutate(Data2=map(Data, ~group_by(., Species) %>%
mutate(nrow=n()) %>%
ungroup()))
but now n() returned 2?
df2[1,]$Data2[[1]]
If you define the function outside of mutate it works fine (I assume this output is what you have in mind...)
fun <- function(x) {
df <- group_by(x, Species) %>%
summarise(nrow = n())
}
df2 <- df %>%
mutate(Data2=map(Data, fun))
df2$Data2
# [[1]]
# # A tibble: 3 x 2
# Species nrow
# <fctr> <int>
# 1 setosa 50
# 2 versicolor 50
# 3 virginica 50
#
# [[2]]
# # A tibble: 3 x 2
# Species nrow
# <fctr> <int>
# 1 setosa 50
# 2 versicolor 50
# 3 virginica 50
Another option, available since version 0.7.0 is to use add_count(), which will not conflict with the map(), and anyway simplifies the code:
# standard case:
iris %>%
add_count(Species)
## df of df:
df2 <- df %>%
mutate(Data2=map(Data, ~add_count(., Species)))