Keep columns when using do - r

Code
Suppose I have the following code (I know, instead of the second do, I could use a simple mutate in this case (and skip the rowwise()), but that is not the point, as in my real code the second do is a bit more complicated and calculates a model):
library(dplyr)
set.seed(1)
d <- data_frame(n = c(5, 1, 3))
e <- d %>% group_by(n) %>%
do(data_frame(y = rnorm(.$n), dat = list(data.frame(a = 1))))
e %>% rowwise() %>% do(data_frame(sum = .$y + .$n))
# Source: local data frame [9 x 1]
# Groups: <by row>
# # A tibble: 9 x 1
# sum
# * <dbl>
# 1 0.3735462
# 2 3.1836433
# 3 2.1643714
# 4 4.5952808
# 5 5.3295078
# 6 4.1795316
# 7 5.4874291
# 8 5.7383247
# 9 5.5757814
Problem
As you can see, the result contains only the column sum.
Question
Is there a way to keep the original columns from e without needing to specify them explicitly (like in e %>% do(data_frame(n = .$n, y = .$y, dat = .$dat, sum = .$y + .$n)) in dplyr or do I have to use purrrlyr::by_row? (not that I do not like purrrlyr*, I was just wondering whether there is a straight forward dplyr way of doing it which I may have overloooked):
e %>% purrrlyr::by_row(function(x) x$y + x$n, .collate = "cols", .to = "sum")
*) Well, there is in fact a catch with purrrlyr::by_row:
e %>% purrrlyr::by_row(function(x) data_frame(sum = x$y + x$n, diff = x$y - x$n),
.collate ="cols")
Will produce columns sum1 and diff1 which I would need to rename again to get sum and diff, which adds another line of code.

I almost never use do, but rather do a combination of nest, mutate and map.
It's a bit hard to tell how that would look in your case, as your example doesn't seem to fully specify your needs.
In the simplest case, you could specify the variables that you do need (if they would be lists of S3 objects, for example):
mutate(e, sum = map2_dbl(y, n, `+`))
Or, you could nest the required data then map the whole data. E.g.:
f <- e
f$r <- 1:nrow(e) # i.e. add some other variable, not necessarily row indices
f %>%
ungroup() %>% # e was still grouped
nest(n:dat) %>% # specify what you variables you need
mutate(sum = map_dbl(data, ~.$y + .$n)) %>% # map to data, use the same formula as in do
unnest() # unnest to get original columns back
Both leave the original columns untouched.
For a modeling example, e.g.:
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(model = map(data, ~lm(qsec ~ hp, .)),
coef = map_dbl(model, ~coef(.)[2])) %>%
unnest(data)
This will give you all your original data, but with added regression coefficents per group. Before unnesting, the whole models are in your data.frame as a list column.

Related

Add summarize variable in multiple statements using dplyr?

In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5

Programmatically dropping a `group_by` field in dplyr

I'm writing functions that take in a data.frame and then do some operations. I need to add and subtract items from the group_by criteria in order to get where I want to go.
If I want to add a group_by criteria to a df, that's pretty easy:
library(tidyverse)
set.seed(42)
n <- 10
input <- data.frame(a = 'a',
b = 'b' ,
vals = 1
)
input %>%
group_by(a) ->
grouped
grouped
#> # A tibble: 1 x 3
#> # Groups: a [1]
#> a b vals
#> <fct> <fct> <dbl>
#> 1 a b 1.
## add a group:
grouped %>%
group_by(b, add=TRUE)
#> # A tibble: 1 x 3
#> # Groups: a, b [1]
#> a b vals
#> <fct> <fct> <dbl>
#> 1 a b 1.
## drop a group?
But how do I programmatically drop the grouping by b which I added, yet keep all other groupings the same?
Here's an approach that uses tidyeval so that bare column names can be used as the function arguments. I'm not sure if it makes sense to convert the bare column names to text (as I've done below) or if there's a more elegant way to work directly with the bare column names.
drop_groups = function(data, ...) {
groups = map_chr(groups(data), rlang::quo_text)
drop = map_chr(quos(...), rlang::quo_text)
if(any(!drop %in% groups)) {
warning(paste("Input data frame is not grouped by the following groups:",
paste(drop[!drop %in% groups], collapse=", ")))
}
data %>% group_by_at(setdiff(groups, drop))
}
d = mtcars %>% group_by(cyl, vs, am)
groups(d %>% drop_groups(vs, cyl))
[[1]]
am
groups(d %>% drop_groups(a, vs, b, c))
[[1]]
cyl
[[2]]
am
Warning message:
In drop_groups(., a, vs, b, c) :
Input data frame is not grouped by the following groups: a, b, c
UPDATE: The approach below works directly with quosured column names, without converting them to strings. I'm not sure which approach is "preferred" in the tidyeval paradigm, or whether there is yet another, more desirable method.
drop_groups2 = function(data, ...) {
groups = map(groups(data), quo)
drop = quos(...)
if(any(!drop %in% groups)) {
warning(paste("Input data frame is not grouped by the following groups:",
paste(drop[!drop %in% groups], collapse=", ")))
}
data %>% group_by(!!!setdiff(groups, drop))
}
Maybe something like this to remove grouping variables from the end of the list back:
grouped %>%
group_by(b, add=TRUE) -> grouped
grouped %>% group_by_at(.vars = group_vars(.)[-2])
or use head or tail or something on the output from group_vars for more control.
It would be interesting to have this sort of utility function available more generally:
peel_groups <- function(.data,n){
.data %>%
group_by_at(.vars = head(group_vars(.data),-n))
}
A more thought out version would likely include more careful checks on n being out of bounds.
Function to remove groups by column name
drop_groups_at <- function(df, vars){
df %>%
group_by_at(setdiff(group_vars(.), vars))
}
input %>%
group_by(a, b) %>%
drop_groups_at('b') %>%
group_vars
# [1] "a"

Can't split dataframe into equal buckets preserving order without introducing Xn. prefix

I am trying to split an ordered data frame into 10 equal buckets. The following works but it introduces an X1., X2., X3. ... prefix to each bucket, which prevents me from iterating over the buckets to sum them.
num_dfs <- 10
buckets<-split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs)))
Produces a df[10] that looks like:
$`10`
predicted_duration actual_duration
177188 23.7402944 6
466561 23.7402663 12
479556 23.7401721 5
147585 23.7401666 48
Here's the crude code I am using to try to sum the groups.
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(as.data.frame(df[i],row.names=NULL)$X1.actual_duration) # X1., X2.,
print(paste(i,"=",p))
}
How do I remove the Xn. grouping prefix or programmatically reference it using the index i?
Here's a similar reproducible example:
df<-data.frame(actual_duration=sample(100))
num_dfs <- 10
df_grouped<-as.data.frame(split(df, rep(1:num_dfs, each = round(nrow(df) / num_dfs))))
for (i in c(1,2,3,4,5,6,7,8,9,10)){
p<-sum(df[i]$actual_duration) # does not work because postfix .1, .2.. was added by R
print(paste(p))
}
I'm not entirely clear on what your issue is, but if you are just trying to get the sum by group couldn't you use
library(tidyverse)
df <- data.frame(actual_duration=sample(100))
df %>%
arrange(actual_duration) %>%
mutate(group = rep(1:10, each = 10)) %>%
group_by(group) %>%
summarise(sums = sum(actual_duration))
alternatively if you want to keep the list format
df %>%
arrange(actual_duration) %>%
mutate(group = factor(rep(1:10, each = 10))) %>%
split(., .$group) %>%
map(., function(x) sum(x$actual_duration))

dplyr summarise and group_by for unique values

Here's a representative example:
DF <- as.data.frame(matrix(data = 0, nrow = 9, ncol = 3))
colnames(DF) <- c("code", "actual", "expected")
DF$code <- letters[rep(1:3, each = 3)]
DF$actual <- runif(9, 3,5)
DF$expected <- rep(1:3, each = 3)
The following crashes:
DF %>%
group_by(code) %>%
summarise(Exp = expected)
Error: expecting a single value
However, the following works:
DF %>%
group_by(code) %>%
summarise(Exp = unique(expected))
However, the unique value by code is just one value. Why doesn't returnign the value work? Why do I need to wrap it up in a "unique"?
Thanks!
This is a common mistake. One way to debug it is to use paste() in the summarise call.
> DF %>%
group_by(code) %>%
summarise(Exp=paste(expected, collapse='-'))
Source: local data frame [3 x 2]
code Exp
(chr) (chr)
1 a 1-1-1
2 b 2-2-2
3 c 3-3-3
Do you see what is going on? You are trying to assign multiple values to a single group.
One solution is to use unique as you describe. In alternative, if you know that all the rows with the same code have always the same expected value, you can group_by directly:
> DF%>% group_by(code, expected) %>% summarise()
Source: local data frame [3 x 2]
Groups: code [?]
code expected
(chr) (int)
1 a 1
2 b 2
3 c 3
If the dataframe is big, group_by will be much faster than the solution based on unique()

Group by multiple columns in dplyr, using string vector input

I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.
# make data with weird column names that can't be hard coded
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
# plyr - works
ddply(data, columns, summarize, value=mean(value))
# dplyr - raises error
data %.%
group_by(columns) %.%
summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds
What am I missing to translate the plyr example into a dplyr-esque syntax?
Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.
Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# Columns you want to group by
grp_cols <- names(df)[-3]
# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)
# Perform frequency counts
df %>%
group_by_(.dots=dots) %>%
summarise(n = n())
output:
Source: local data frame [9 x 3]
Groups: asihckhdoydk
asihckhdoydk a30mvxigxkgh n
1 A A 10
2 A B 10
3 A C 13
4 B A 14
5 B B 10
6 B C 12
7 C A 9
8 C B 12
9 C C 10
Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.
The support for this in dplyr is currently pretty weak, eventually I think the syntax will be something like:
df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))
But that probably won't be there for a while (because I need to think through all the consequences).
In the meantime, you can use regroup(), which takes a list of symbols:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
df %.%
regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
summarise(n = n())
If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol():
vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)
df %.% regroup(vars2) %.% summarise(n = n())
String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments. This vignette describes the syntax of these functions in detail.
The following snippet cleanly solves the problem that #sharoz originally posed (note the need to write out the .dots argument):
# Given data and columns from the OP
data %>%
group_by_(.dots = columns) %>%
summarise(Value = mean(value))
(Note that dplyr now uses the %>% operator, and %.% is deprecated).
Update with across() from dplyr 1.0.0
All the answers above are still working, and the solutions with the .dots argument are intruiging.
BUT if you look for a solution that is easier to remember, the new across() comes in handy. It was published 2020-04-03 by Hadley Wickham and can be used in mutate() and summarise() and replace the scoped variants like _at or _all. Above all, it replaces very elegantly the cumbersome non-standard evaluation (NSE) with quoting/unquoting such as !!! rlang::syms().
So the solution with across looks very readable:
data %>%
group_by(across(all_of(columns))) %>%
summarize(Value = mean(value))
Until dplyr has full support for string arguments, perhaps this gist is useful:
https://gist.github.com/skranz/9681509
It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example
cols = c("cyl","gear")
mtcars %.%
s_group_by(cols) %.%
s_summarise("avdisp=mean(disp), max(disp)") %.%
arrange(avdisp)
It works if you pass it the objects (well, you aren't, but...) rather than as a character vector:
df %.%
group_by(asdfgfTgdsx, asdfk30v0ja) %.%
summarise(Value = mean(value))
> df %.%
+ group_by(asdfgfTgdsx, asdfk30v0ja) %.%
+ summarise(Value = mean(value))
Source: local data frame [9 x 3]
Groups: asdfgfTgdsx
asdfgfTgdsx asdfk30v0ja Value
1 A C 0.046538002
2 C B -0.286359899
3 B A -0.305159419
4 C A -0.004741504
5 B B 0.520126476
6 C C 0.086805492
7 B C -0.052613078
8 A A 0.368410146
9 A B 0.088462212
where df was your data.
?group_by says:
...: variables to group by. All tbls accept variable names, some
will also accept functons of variables. Duplicated groups
will be silently dropped.
which I interpret to mean not the character versions of the names, but how you would refer to them in foo$bar; bar is not quoted here. Or how you'd refer to variables in a formula: foo ~ bar.
#Arun also mentions that you can do:
df %.%
group_by("asdfgfTgdsx", "asdfk30v0ja") %.%
summarise(Value = mean(value))
But you can't pass in something that unevaluated is not a name of a variable in the data object.
I presume this is due to the internal methods Hadley is using to look up the things you pass in via the ... argument.
data = data.frame(
my.a = sample(LETTERS[1:3], 100, replace=TRUE),
my.b = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
group_by(data,newcol=paste(my.a,my.b,sep="_")) %>% summarise(Value=mean(value))
One (tiny) case that is missing from the answers here, that I wanted to make explicit, is when the variables to group by are generated dynamically midstream in a pipeline:
library(wakefield)
df_foo = r_series(rnorm, 10, 1000)
df_foo %>%
# 1. create quantized versions of base variables
mutate_each(
funs(Quantized = . > 0)
) %>%
# 2. group_by the indicator variables
group_by_(
.dots = grep("Quantized", names(.), value = TRUE)
) %>%
# 3. summarize the base variables
summarize_each(
funs(sum(., na.rm = TRUE)), contains("X_")
)
This basically shows how to use grep in conjunction with group_by_(.dots = ...) to achieve this.
General example on using the .dots argument as character vector input to the dplyr::group_by function :
iris %>%
group_by(.dots ="Species") %>%
summarise(meanpetallength = mean(Petal.Length))
Or without a hard coded name for the grouping variable (as asked by the OP):
iris %>%
group_by(.dots = names(iris)[5]) %>%
summarise_at("Petal.Length", mean)
With the example of the OP:
data %>%
group_by(.dots =names(data)[-3]) %>%
summarise_at("value", mean)
See also the dplyr vignette on programming which explains pronouns, quasiquotation, quosures, and tidyeval.

Resources