With dplyr starting version 0.7 the methods ending with underscore such as summarize_ group_by_ are deprecated since we are supposed to use quosures.
See:
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
I am trying to implement the following example using quo and !!
Working example:
df <- data.frame(x = c("a","a","a","b","b","b"), y=c(1,1,2,2,3,3), z = 1:6)
lFG <- df %>%
group_by( x,y)
lFG %>% summarize( min(z))
However, in the case, I need to implement the columns to group by and summarize are specified as strings.
cols2group <- c("x","y")
col2summarize <- "z"
How can I get the same example as above working?
For this you can now use _at versions of the verbs
df %>%
group_by_at(cols2group) %>%
summarize_at(.vars = col2summarize, .funs = min)
Edit (2021-06-09):
Please see Ronak Shah's answer, using
mutate(across(all_of(cols2summarize), min))
Now the preferred option
From dplyr 1.0.0 you can use across :
library(dplyr)
cols2group <- c("x","y")
col2summarize <- "z"
df %>%
group_by(across(all_of(cols2group))) %>%
summarise(across(all_of(col2summarize), min)) %>%
ungroup
# x y z
# <chr> <dbl> <int>
#1 a 1 1
#2 a 2 3
#3 b 2 4
#4 b 3 5
Another option is to use non-standard evaluation (NSE), and have R interpret the string as quoted names of objects:
cols2group <- c("x","y")
col2summarize <- "z"
df %>%
group_by(!!rlang::sym(cols2group)) %>%
summarize(min(!!rlang::sym(col2summarize)))
The rlang::sym() function takes the strings and turns them into quotes, which are in turn unquoted by !! and used as names in the context of df where they refer to the relevant columns. There's different ways of doing the same thing, as always, and this is the shorthand I tend to use!
See ?dplyr::across for the updated way to do this since group_by_at and summarize_at are now Superseded
Related
I am trying to understand the expected output of dplyr::group_by() in conjunction with the use of dplyr::all_of(). My understanding is that using dplyr::all_of() should convert character vectors containing variable names to the bare names so that group_by(), but this doesn't appear to happen.
Below, I generate some fake data, pass different objects to group_by() with(out) all_of() and calculate the number of observations in each group. In the example, passing a single bare column name without dplyr::all_of() produces the correct output: one row per unique value of the column. However, passing character vectors or using dplyr::all_of() produces incorrect output: one row regardless of the number of values in a column.
What is expected when using all_of and how might I alternatively pass a character vector to group_by to process as a vector of bare names?
library(dplyr)
# Create a 20-row data.frame with
# 2 variables each with 2 unique values.
df <- data.frame(var = rep(c("a", "b"), 10),
bar = rep(c(1, 2), 20))
# Output 1: 2x2 tibble - GOOD
df %>% group_by(var) %>% summarize(n = n())
# Output 2: 1x2 tibble - BAD
foo <- "var"
df %>% group_by(all_of(foo)) %>% summarize(n = n())
# Output 3: 1x2 tibble
df %>% group_by("var") %>% summarize(n = n())
# Output 4: Error in_var not found - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(in_var) %>%
summarize(n = n())
})
# Output 5: list of length 2 where
# each element is a 1x2 tibble - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(all_of(in_var)) %>%
summarize(n = n())
})
We can use group_by_at
lapply(foo2, function(in_var) df %>%
group_by_at(all_of(in_var)) %>%
summarise(n = n()))
-output
#[[1]]
# A tibble: 2 x 2
# var n
#* <chr> <int>
#1 a 20
#2 b 20
#[[2]]
# A tibble: 2 x 2
# bar n
#* <dbl> <int>
#1 1 20
#2 2 20
As across replaces some of the functionality of group_by_at, we can use it instead with all_of:
lapply(foo2, function(in_var) df %>%
group_by(across(all_of(in_var))) %>%
summarise(n = n()))
Or convert to symbol and evaluate (!!)
lapply(foo2, function(in_var) df %>%
group_by(!! rlang::sym(in_var)) %>%
summarise(n = n()))
Or use map
library(purrr)
map(foo2, ~ df %>%
group_by(!! rlang::sym(.x)) %>%
summarise(n = n()))
Or instead of group_by, it can be count
map(foo2, ~ df %>%
count(across(all_of(.x))))
To add to #akrun's answers of mutliple ways to achieve the desired output - my understanding of all_of() is that, it is a helper for selection of variables stored as character for dplyr function and uses vctrs underneath. Compared to any_of() which is a less strict version of all_of() and some convenient use cases.
reading the ?tidyselect::all_off() is helpful. This page is also helpful to keep up with changes in dplyr and tidy evaluation https://dplyr.tidyverse.org/articles/programming.html.
The scoped dplyr verbs are being superceded in the future with across based on decisions by the devs at RStudio. See ?group_by_at() or other *_if, *_at, *_all documentation. So I guess it really depends on what version of dplyr you are using in your workflow and what works best for you.
This SO post also gives context of changes in solutions over time with passing characters into dplyr functions, and there's probably more posts out there.
I am working with a dataset of which I want to calculate rowSums of columns that start with a certain string and end with an other specified string, using dplyr (in my example: starts_with('c_') & ends_with('_f'))
My current code is as follows (and works fine):
df <- df %>% mutate(row.sum = rowSums(select(select(., starts_with('c_')), ends_with('_f'))))
However, as you can see, using the select() function within a select() function seems a bit messy. Is there a way to combine the starts_with and ends_with within just one select() function? Or do you have other ideas to make this line of code more elegant via using dplyr?
EDIT:
To make the example reproducible:
names <- c('c_first_f', 'c_second_o', 't_third_f', 'c_fourth_f')
values <- c(5, 3, 2, 5)
df <- t(values)
colnames(df) <- names
> df
c_first_f c_second_o t_third_f c_fourth_f
[1,] 5 3 2 5
Thus, here I want to sum the first and fourth column, making the summed value 10.
We could use select_at with matches
library(dplyr)
df %>% select_at(vars(matches("^c_.*_f$"))) %>% mutate(row.sum = rowSums(.))
and with base R :
df$row.sum <- rowSums(df[grep("^c_.*_f$", names(df))])
We can use tidyverse approaches
library(dplyr)
library(purrr)
df %>%
select_at(vars(matches("^c_.*_f$"))) %>%
mutate(rowSum = reduce(., `+`))
Or with new versions of tidyverse, select can take matches
df %>%
select(matches("^c_.*_f$")) %>%
mutate(rowSum = reduce(., `+`))
I am trying to use pipe mutate statement using a custom function. I looked a this somewhat similar SO post but in vain.
Say I have a data frame like this (where blob is some variable not related to the specific task but is part of the entire data) :
df <-
data.frame(exclude=c('B','B','D'),
B=c(1,0,0),
C=c(3,4,9),
D=c(1,1,0),
blob=c('fd', 'fs', 'sa'),
stringsAsFactors = F)
I have a function that uses the variable names so select some based on the value in the exclude column and e.g. calculates a sum on the variables not specified in exclude (which is always a single character).
FUN <- function(df){
sum(df[c('B', 'C', 'D')] [!names(df[c('B', 'C', 'D')]) %in% df['exclude']] )
}
When I gives a single row (row 1) to FUN I get the the expected sum of C and D (those not mentioned by exclude), namely 4:
FUN(df[1,])
How do I do similarly in a pipe with mutate (adding the result to a variable s). These two tries do not work:
df %>% mutate(s=FUN(.))
df %>% group_by(1:n()) %>% mutate(s=FUN(.))
UPDATE
This also do not work as intended:
df %>% rowwise(.) %>% mutate(s=FUN(.))
This works of cause but is not within dplyr's mutate (and pipes):
df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))
If you want to use dplyr you can do so using rowwise and your function FUN.
df %>%
rowwise %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The same can be achieved using group_by instead of rowwise (like you already tried) but with do instead of mutate
df %>%
group_by(1:n()) %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The reason mutate doesn't work in this case, is that you are passing the whole tibble to it, so it's like calling FUN(df).
A much more efficient way of doing the same thing though is to just make a matrix of columns to be included and then use rowSums.
cols <- c('B', 'C', 'D')
include_mat <- outer(function(x, y) x != y, X = df$exclude, Y = cols)
# or outer(`!=`, X = df$exclude, Y = cols) if it's more readable to you
df$s <- rowSums(df[cols] * include_mat)
purrr approach
We can use a combination of nest and map_dbl for this:
library(tidyverse)
df %>%
rowwise %>%
nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>%
unnest
Let's break that down a little bit. First, rowwise allows us to apply each subsequent function to support arbitrary complex operations that need to be applied to each row.
Next, nest will create a new column that is a list of our data to be fed into FUN (the beauty of tibbles vs data.frames!). Since we are applying this rowwise, each row contains a single-row tibble of exclude:D.
Finally, we use map_dbl to map our FUN to each of these tibbles. map_dbl is used over the family of other map_* functions since our intended output is numeric (i.e. double).
unnest returns our tibble into the more standard structure.
purrrlyr approach
While purrrlyr may not be as 'popular' as its parents dplyr and purrr, its by_row function has some utility here.
In your above example, we would use your data frame df and user-defined function FUN in the following way:
df %>%
by_row(..f = FUN, .to = "s", .collate = "cols")
That's it! Giving you:
# tibble [3 x 6]
exclude B C D blob s
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 B 1 3 1 fd 4
2 B 0 4 1 fs 5
3 D 0 9 0 sa 9
Admittedly, the syntax is a little strange, but here's how it breaks down:
..f = the function to apply to each row
.to = the name of the output column, in this case s
.collate = the way the results should be collated, by list, row, or column. Since FUN only has a single output, we would be fine to use either "cols" or "rows"
See here for more information on using purrrlyr...
Performance
Forewarning, while I like the functionality of by_row, it's not always the best approach for performance! purrr is more intuitive, but also at a rather large speed loss. See the following microbenchmark test:
library(microbenchmark)
mbm <- microbenchmark(
purrr.test = df %>% rowwise %>% nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>% unnest,
purrrlyr.test = df %>% by_row(..f = FUN, .to = "s", .collate = "cols"),
rowwise.test = df %>%
rowwise %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
group_by.test = df %>%
group_by(1:n()) %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
sapply.test = {df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))},
times = 1000
)
autoplot(mbm)
You can see that the purrrlyr approach is faster than the approach of using a combination of do with rowwise or group_by(1:n()) (see #konvas answer), and rather on par with the sapply approach. However, the package is admittedly not the most intuitive. The standard purrr approach seems to be the slowest, but also perhaps easier to work with. Different user-defined functions may change the speed order.
I'm trying to transfer my understanding of plyr into dplyr, but I can't figure out how to group by multiple columns.
# make data with weird column names that can't be hard coded
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
# plyr - works
ddply(data, columns, summarize, value=mean(value))
# dplyr - raises error
data %.%
group_by(columns) %.%
summarise(Value = mean(value))
#> Error in eval(expr, envir, enclos) : index out of bounds
What am I missing to translate the plyr example into a dplyr-esque syntax?
Edit 2017: Dplyr has been updated, so a simpler solution is available. See the currently selected answer.
Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# Columns you want to group by
grp_cols <- names(df)[-3]
# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)
# Perform frequency counts
df %>%
group_by_(.dots=dots) %>%
summarise(n = n())
output:
Source: local data frame [9 x 3]
Groups: asihckhdoydk
asihckhdoydk a30mvxigxkgh n
1 A A 10
2 A B 10
3 A C 13
4 B A 14
5 B B 10
6 B C 12
7 C A 9
8 C B 12
9 C C 10
Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:
data = data.frame(
asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
# get the columns we want to average within
columns = names(data)[-3]
library(dplyr)
df1 <- data %>%
group_by_at(vars(one_of(columns))) %>%
summarize(Value = mean(value))
#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE
## 27
The output from your example question is as expected (see comparison to plyr above and output below):
# A tibble: 9 x 3
# Groups: asihckhdoydkhxiydfgfTgdsx [?]
asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja Value
<fctr> <fctr> <dbl>
1 A A 0.04095002
2 A B 0.24943935
3 A C -0.25783892
4 B A 0.15161805
5 B B 0.27189974
6 B C 0.20858897
7 C A 0.19502221
8 C B 0.56837548
9 C C -0.22682998
Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.
The support for this in dplyr is currently pretty weak, eventually I think the syntax will be something like:
df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))
But that probably won't be there for a while (because I need to think through all the consequences).
In the meantime, you can use regroup(), which takes a list of symbols:
library(dplyr)
df <- data.frame(
asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
df %.%
regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
summarise(n = n())
If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol():
vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)
df %.% regroup(vars2) %.% summarise(n = n())
String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments. This vignette describes the syntax of these functions in detail.
The following snippet cleanly solves the problem that #sharoz originally posed (note the need to write out the .dots argument):
# Given data and columns from the OP
data %>%
group_by_(.dots = columns) %>%
summarise(Value = mean(value))
(Note that dplyr now uses the %>% operator, and %.% is deprecated).
Update with across() from dplyr 1.0.0
All the answers above are still working, and the solutions with the .dots argument are intruiging.
BUT if you look for a solution that is easier to remember, the new across() comes in handy. It was published 2020-04-03 by Hadley Wickham and can be used in mutate() and summarise() and replace the scoped variants like _at or _all. Above all, it replaces very elegantly the cumbersome non-standard evaluation (NSE) with quoting/unquoting such as !!! rlang::syms().
So the solution with across looks very readable:
data %>%
group_by(across(all_of(columns))) %>%
summarize(Value = mean(value))
Until dplyr has full support for string arguments, perhaps this gist is useful:
https://gist.github.com/skranz/9681509
It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example
cols = c("cyl","gear")
mtcars %.%
s_group_by(cols) %.%
s_summarise("avdisp=mean(disp), max(disp)") %.%
arrange(avdisp)
It works if you pass it the objects (well, you aren't, but...) rather than as a character vector:
df %.%
group_by(asdfgfTgdsx, asdfk30v0ja) %.%
summarise(Value = mean(value))
> df %.%
+ group_by(asdfgfTgdsx, asdfk30v0ja) %.%
+ summarise(Value = mean(value))
Source: local data frame [9 x 3]
Groups: asdfgfTgdsx
asdfgfTgdsx asdfk30v0ja Value
1 A C 0.046538002
2 C B -0.286359899
3 B A -0.305159419
4 C A -0.004741504
5 B B 0.520126476
6 C C 0.086805492
7 B C -0.052613078
8 A A 0.368410146
9 A B 0.088462212
where df was your data.
?group_by says:
...: variables to group by. All tbls accept variable names, some
will also accept functons of variables. Duplicated groups
will be silently dropped.
which I interpret to mean not the character versions of the names, but how you would refer to them in foo$bar; bar is not quoted here. Or how you'd refer to variables in a formula: foo ~ bar.
#Arun also mentions that you can do:
df %.%
group_by("asdfgfTgdsx", "asdfk30v0ja") %.%
summarise(Value = mean(value))
But you can't pass in something that unevaluated is not a name of a variable in the data object.
I presume this is due to the internal methods Hadley is using to look up the things you pass in via the ... argument.
data = data.frame(
my.a = sample(LETTERS[1:3], 100, replace=TRUE),
my.b = sample(LETTERS[1:3], 100, replace=TRUE),
value = rnorm(100)
)
group_by(data,newcol=paste(my.a,my.b,sep="_")) %>% summarise(Value=mean(value))
One (tiny) case that is missing from the answers here, that I wanted to make explicit, is when the variables to group by are generated dynamically midstream in a pipeline:
library(wakefield)
df_foo = r_series(rnorm, 10, 1000)
df_foo %>%
# 1. create quantized versions of base variables
mutate_each(
funs(Quantized = . > 0)
) %>%
# 2. group_by the indicator variables
group_by_(
.dots = grep("Quantized", names(.), value = TRUE)
) %>%
# 3. summarize the base variables
summarize_each(
funs(sum(., na.rm = TRUE)), contains("X_")
)
This basically shows how to use grep in conjunction with group_by_(.dots = ...) to achieve this.
General example on using the .dots argument as character vector input to the dplyr::group_by function :
iris %>%
group_by(.dots ="Species") %>%
summarise(meanpetallength = mean(Petal.Length))
Or without a hard coded name for the grouping variable (as asked by the OP):
iris %>%
group_by(.dots = names(iris)[5]) %>%
summarise_at("Petal.Length", mean)
With the example of the OP:
data %>%
group_by(.dots =names(data)[-3]) %>%
summarise_at("value", mean)
See also the dplyr vignette on programming which explains pronouns, quasiquotation, quosures, and tidyeval.
I have a data.frame:
dat <- data.frame(fac1 = c(1, 2),
fac2 = c(4, 5),
fac3 = c(7, 8),
dbl1 = c('1', '2'),
dbl2 = c('4', '5'),
dbl3 = c('6', '7')
)
To change data types I can use something like
l1 <- c("fac1", "fac2", "fac3")
l2 <- c("dbl1", "dbl2", "dbl3")
dat[, l1] <- lapply(dat[, l1], factor)
dat[, l2] <- lapply(dat[, l2], as.numeric)
with dplyr
dat <- dat %>% mutate(
fac1 = factor(fac1), fac2 = factor(fac2), fac3 = factor(fac3),
dbl1 = as.numeric(dbl1), dbl2 = as.numeric(dbl2), dbl3 = as.numeric(dbl3)
)
is there a more elegant (shorter) way in dplyr?
thx
Christof
Edit (as of 2021-03)
As also pointed out in Eric's answer, mutate_[at|if|all] has been superseded by a combination of mutate() and across(). For reference, I will add the respective pendants to the examples in the original answer (see below):
# convert all factor to character
dat %>% mutate(across(where(is.factor), as.character))
# apply function (change encoding) to all character columns
dat %>% mutate(across(where(is.character),
function(x){iconv(x, to = "ASCII//TRANSLIT")}))
# subsitute all NA in numeric columns
dat %>% mutate(across(where(is.numeric), function(x) tidyr::replace_na(x, 0)))
Original answer
Since Nick's answer is deprecated by now and Rafael's comment is really useful, I want to add this as an Answer. If you want to change all factor columns to character use mutate_if:
dat %>% mutate_if(is.factor, as.character)
Also other functions are allowed. I for instance used iconv to change the encoding of all character columns:
dat %>% mutate_if(is.character, function(x){iconv(x, to = "ASCII//TRANSLIT")})
or to substitute all NA by 0 in numeric columns:
dat %>% mutate_if(is.numeric, function(x){ifelse(is.na(x), 0, x)})
You can use the standard evaluation version of mutate_each (which is mutate_each_) to change the column classes:
dat %>% mutate_each_(funs(factor), l1) %>% mutate_each_(funs(as.numeric), l2)
EDIT - The syntax of this answer has been deprecated, loki's updated answer is more appropriate.
ORIGINAL-
From the bottom of the ?mutate_each (at least in dplyr 0.5) it looks like that function, as in #docendo discimus's answer, will be deprecated and replaced with more flexible alternatives mutate_if, mutate_all, and mutate_at. The one most similar to what #hadley mentions in his comment is probably using mutate_at. Note the order of the arguments is reversed, compared to mutate_each, and vars() uses select() like semantics, which I interpret to mean the ?select_helpers functions.
dat %>% mutate_at(vars(starts_with("fac")),funs(factor)) %>%
mutate_at(vars(starts_with("dbl")),funs(as.numeric))
But mutate_at can take column numbers instead of a vars() argument, and after reading through this page, and looking at the alternatives, I ended up using mutate_at but with grep to capture many different kinds of column names at once (unless you always have such obvious column names!)
dat %>% mutate_at(grep("^(fac|fctr|fckr)",colnames(.)),funs(factor)) %>%
mutate_at(grep("^(dbl|num|qty)",colnames(.)),funs(as.numeric))
I was pretty excited about figuring out mutate_at + grep, because now one line can work on lots of columns.
EDIT - now I see matches() in among the select_helpers, which handles regex, so now I like this.
dat %>% mutate_at(vars(matches("fac|fctr|fckr")),funs(factor)) %>%
mutate_at(vars(matches("dbl|num|qty")),funs(as.numeric))
Another generally-related comment - if you have all your date columns with matchable names, and consistent formats, this is powerful. In my case, this turns all my YYYYMMDD columns, which were read as numbers, into dates.
mutate_at(vars(matches("_DT$")),funs(as.Date(as.character(.),format="%Y%m%d")))
Dplyr across function has superseded _if, _at, and _all. See vignette("colwise").
dat %>%
mutate(across(all_of(l1), as.factor),
across(all_of(l2), as.numeric))
It's a one-liner with mutate_at:
dat %>% mutate_at("l1", factor) %>% mutate_at("l2", as.numeric)
A more general way of achieving column type transformation is as follows:
If you want to transform all your factor columns to character columns, e.g., this can be done using one pipe:
df %>% mutate_each_( funs(as.character(.)), names( .[,sapply(., is.factor)] ))
Or mayby even more simple with convert from hablar:
library(hablar)
dat %>%
convert(fct(fac1, fac2, fac3),
num(dbl1, dbl2, dbl3))
or combines with tidyselect:
dat %>%
convert(fct(contains("fac")),
num(contains("dbl")))
For future readers, if you are ok with dplyr guessing the column types, you can convert the col types of an entire df as if you were originally reading it in with readr and col_guess() with
library(tidyverse)
df %>% type_convert()
Try this
df[,1:11] <- sapply(df[,1:11], as.character)