So, if one wishes to apply an operation row by row in dplyr, one can use the rowwise function, for example: Applying a function to every row of a table using dplyr?
Is there a unrowwise function which you can use to stop doing operations row by row? Currently, it seems adding a group_by after the rowwise removes row operations, e.g.
data.frame(a=1:4) %>% rowwise() %>% group_by(a)
# ...
# Warning message:
# Grouping rowwise data frame strips rowwise nature
Does this mean one should use group_by(1) if you wish to explicitly remove rowwise?
As found in the comments and the other answer, the correct way of doing this is to use ungroup().
The operation rowwise(df) sets one of the classes of df to be rowwise_df. We can see the methods on this class by examining the code here, which gives the following ungroup method:
#' #export
ungroup.rowwise_df <- function(x) {
class(x) <- c( "tbl_df", "data.frame")
x
}
So we see that ungroup is not strictly removing a grouped structure, instead it just removes the rowwise_df class added from the rowwise function.
Just use ungroup()
The following produces a warning:
data.frame(a=1:4) %>% rowwise() %>%
group_by(a)
#Warning message:
#Grouping rowwise data frame strips rowwise nature
This does not produce the warning:
data.frame(a=1:4) %>% rowwise() %>%
ungroup() %>%
group_by(a)
You can use as.data.frame(), like below
> data.frame(a=1:4) %>% rowwise() %>% group_by(a)
# A tibble: 4 x 1
# Groups: a [4]
a
* <int>
1 1
2 2
3 3
4 4
Warning message:
Grouping rowwise data frame strips rowwise nature
> data.frame(a=1:4) %>% rowwise() %>% as.data.frame() %>% group_by(a)
# A tibble: 4 x 1
# Groups: a [4]
a
* <int>
1 1
2 2
3 3
4 4
Related
Folks I have a couple of questions about how tidy evaluation works with dplyr
The following code produces a tally of cars by cylinder using the mtcars dataset:
mtcars %>%
select(cyl) %>%
group_by(cyl) %>%
tally()
With output as expected:
# A tibble: 3 x 2
cyl n
* <dbl> <int>
1 4 11
2 6 7
3 8 14
If I want to pass the grouping factor as variable, then this fails:
var <- "cyl"
mtcars %>%
select(var) %>%
group_by(var) %>%
tally()
with error message:
Error: Must group by variables found in `.data`.
* Column `var` is not found.
This also fails:
var <- "cyl"
mtcars %>%
select(var) %>%
group_by({{ var}}) %>%
tally()
Producing output:
# A tibble: 1 x 2
`"cyl"` n
* <chr> <int>
1 cyl 32
This code, however, works as expected:
var <- "cyl"
mtcars %>%
select(var) %>%
group_by(.data[[ var]]) %>%
tally()
Producing the expected output:
# A tibble: 3 x 2
cyl n
* <dbl> <int>
1 4 11
2 6 7
3 8 14
I have two questions about this and wondering if someone can help!
Why does select(var) work fine without using any of the dplyr tidy evaluation extensions, such as select({{ var }}) or select(.data[[ var ]])?
What is is about group_by() that makes group_by({{ var }}) wrong but group_by(.data[[ var ]]) right?
Thanks so much!
Matt.
It depends on how those functions work and accept input.
If you look at the documentation at ?select the relevant part for this question is -
These helpers select variables from a character vector:
all_of(): Matches variable names in a character vector. All names must be present, otherwise an out-of-bounds error is thrown.
any_of(): Same as all_of(), except that no error is thrown for names that don't exist.
So you can use all_of and any_of in select with character vectors hence you get a warning when you run mtcars %>% select(var)
Note: Using an external vector in selections is ambiguous.
ℹ Use all_of(var) instead of var to silence this message.
and no warning with mtcars %>% select(all_of(var)).
As far as group_by is concerned there is no such specific provision and you need to use mtcars %>% group_by(.data[[var]]).
I am trying to understand the expected output of dplyr::group_by() in conjunction with the use of dplyr::all_of(). My understanding is that using dplyr::all_of() should convert character vectors containing variable names to the bare names so that group_by(), but this doesn't appear to happen.
Below, I generate some fake data, pass different objects to group_by() with(out) all_of() and calculate the number of observations in each group. In the example, passing a single bare column name without dplyr::all_of() produces the correct output: one row per unique value of the column. However, passing character vectors or using dplyr::all_of() produces incorrect output: one row regardless of the number of values in a column.
What is expected when using all_of and how might I alternatively pass a character vector to group_by to process as a vector of bare names?
library(dplyr)
# Create a 20-row data.frame with
# 2 variables each with 2 unique values.
df <- data.frame(var = rep(c("a", "b"), 10),
bar = rep(c(1, 2), 20))
# Output 1: 2x2 tibble - GOOD
df %>% group_by(var) %>% summarize(n = n())
# Output 2: 1x2 tibble - BAD
foo <- "var"
df %>% group_by(all_of(foo)) %>% summarize(n = n())
# Output 3: 1x2 tibble
df %>% group_by("var") %>% summarize(n = n())
# Output 4: Error in_var not found - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(in_var) %>%
summarize(n = n())
})
# Output 5: list of length 2 where
# each element is a 1x2 tibble - BAD
foo2 <- list("var", "bar")
lapply(foo2, function(in_var) {
df %>%
group_by(all_of(in_var)) %>%
summarize(n = n())
})
We can use group_by_at
lapply(foo2, function(in_var) df %>%
group_by_at(all_of(in_var)) %>%
summarise(n = n()))
-output
#[[1]]
# A tibble: 2 x 2
# var n
#* <chr> <int>
#1 a 20
#2 b 20
#[[2]]
# A tibble: 2 x 2
# bar n
#* <dbl> <int>
#1 1 20
#2 2 20
As across replaces some of the functionality of group_by_at, we can use it instead with all_of:
lapply(foo2, function(in_var) df %>%
group_by(across(all_of(in_var))) %>%
summarise(n = n()))
Or convert to symbol and evaluate (!!)
lapply(foo2, function(in_var) df %>%
group_by(!! rlang::sym(in_var)) %>%
summarise(n = n()))
Or use map
library(purrr)
map(foo2, ~ df %>%
group_by(!! rlang::sym(.x)) %>%
summarise(n = n()))
Or instead of group_by, it can be count
map(foo2, ~ df %>%
count(across(all_of(.x))))
To add to #akrun's answers of mutliple ways to achieve the desired output - my understanding of all_of() is that, it is a helper for selection of variables stored as character for dplyr function and uses vctrs underneath. Compared to any_of() which is a less strict version of all_of() and some convenient use cases.
reading the ?tidyselect::all_off() is helpful. This page is also helpful to keep up with changes in dplyr and tidy evaluation https://dplyr.tidyverse.org/articles/programming.html.
The scoped dplyr verbs are being superceded in the future with across based on decisions by the devs at RStudio. See ?group_by_at() or other *_if, *_at, *_all documentation. So I guess it really depends on what version of dplyr you are using in your workflow and what works best for you.
This SO post also gives context of changes in solutions over time with passing characters into dplyr functions, and there's probably more posts out there.
Dplyr provides a function top_n(), however in case of equal values it returns all rows (more than one). I would like to return exactly one row per group. See the example below.
df <- data.frame(id1=c(rep("A",3),rep("B",3),rep("C",3)),id2=c(8,8,4,7,7,4,5,5,5))
df %>% group_by(id1) %>% top_n(n=1)
You can use a combination of arrange and slice
df %>%
group_by(id1) %>%
arrange(desc(id2)) %>%
slice(1)
Use desc with in arrange if you want the larges element otherwise leave it out.
Apparently also slice_head is the new name of the function that you are looking for
df %>%
group_by(id1) %>%
arrange(desc(id2)) %>%
slice_head(id2, n=2)
Use slice_max() with the argument with_ties = FALSE:
library(dplyr)
df %>%
group_by(id1) %>%
slice_max(id2, with_ties = FALSE)
# A tibble: 3 x 2
# Groups: id1 [3]
id1 id2
<chr> <dbl>
1 A 8
2 B 7
3 C 5
If you don't want to remember so many {dplyr} function names that are prone to be changed anyway, I can recommend the {data.table} package for such tasks. Plus, it's faster.
require(data.table)
df <- data.frame(id1=c(rep("A",3),rep("B",3),rep("C",3)),id2=c(8,8,4,7,7,4,5,5,5))
setDT(df)
df[ ,
.(id2_head = head(id2, 1)),
by = id1 ]
I have this input:
t <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
And want to have the rowwise nth-lowest element of the dataframe ordered by the rowwise values, so that the output is something like this (example for nth_element = 2):
[1] 2 3 5 4
I tried a function like this:
apply(t, 1, nth, n=1, order_by = .)
But this does not work. Two questions:
What should I type in the order_by gument to make this function work?
Which is the best way to summarise rows with an own summary function if I don't want to mention the column names in the rowwise summary function?
Sidenote:
I don't want to mention the column names specifically, I want the function to use all rows in the dataset.
I tried the rownth function from the Rfast package but it only provides one result. Does anybody know what I do wrong?
We can use apply and sort to do this.
d <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
nth_lowest <- 2
apply(d, 1, FUN = function(x) sort(x)[nth_lowest])
# [1] 2 3 5 4
Note that I am calling the data d instead of t. t is already a reserved name in R (matrix transpose function).
Not as elegant as #bouncyball's answer, but using dplyr (and tidyr), one possibility is to do:
library(dplyr)
library(tidyr)
t %>% mutate(Row = row_number()) %>%
pivot_longer(-Row, names_to = "Col", values_to = "Val") %>%
group_by(Row) %>%
arrange(Val) %>%
slice(2) %>%
select(Val)
Adding missing grouping variables: `Row`
# A tibble: 4 x 2
# Groups: Row [4]
Row Val
<int> <dbl>
1 1 2
2 2 3
3 3 5
4 4 4
Using Rfast you could reduce run time for big matrices and for matrices only.
d <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
d<- Rfast::data.frame.to_matrix(d)
nth_lowests <- rep(2,ncol(d))
Rfast::rownth(d,nth_lowests)
# [1] 2 3 5 4
You could also use the parallel version of Rfast::rownth
So, if one wishes to apply an operation row by row in dplyr, one can use the rowwise function, for example: Applying a function to every row of a table using dplyr?
Is there a unrowwise function which you can use to stop doing operations row by row? Currently, it seems adding a group_by after the rowwise removes row operations, e.g.
data.frame(a=1:4) %>% rowwise() %>% group_by(a)
# ...
# Warning message:
# Grouping rowwise data frame strips rowwise nature
Does this mean one should use group_by(1) if you wish to explicitly remove rowwise?
As found in the comments and the other answer, the correct way of doing this is to use ungroup().
The operation rowwise(df) sets one of the classes of df to be rowwise_df. We can see the methods on this class by examining the code here, which gives the following ungroup method:
#' #export
ungroup.rowwise_df <- function(x) {
class(x) <- c( "tbl_df", "data.frame")
x
}
So we see that ungroup is not strictly removing a grouped structure, instead it just removes the rowwise_df class added from the rowwise function.
Just use ungroup()
The following produces a warning:
data.frame(a=1:4) %>% rowwise() %>%
group_by(a)
#Warning message:
#Grouping rowwise data frame strips rowwise nature
This does not produce the warning:
data.frame(a=1:4) %>% rowwise() %>%
ungroup() %>%
group_by(a)
You can use as.data.frame(), like below
> data.frame(a=1:4) %>% rowwise() %>% group_by(a)
# A tibble: 4 x 1
# Groups: a [4]
a
* <int>
1 1
2 2
3 3
4 4
Warning message:
Grouping rowwise data frame strips rowwise nature
> data.frame(a=1:4) %>% rowwise() %>% as.data.frame() %>% group_by(a)
# A tibble: 4 x 1
# Groups: a [4]
a
* <int>
1 1
2 2
3 3
4 4