base::ifelse() within dplyr::arrange() for conditional arrangement of grouped rows - r

I'm trying to order the rows of a data.frame conditional upon the value of another column.
Here's an example below:
library(magrittr)
library(dplyr)
df <- data.frame(grp = c(1,1,1,2,2,2),
ori = c("f","f","f","r","r","r"),
ite = c("A","B","C","A","B","C"))
df
# # grp ori ite
# 1 1 f A
# 2 1 f B
# 3 1 f C
# 4 2 r A
# 5 2 r B
# 6 2 r C
df %>%
group_by(grp) %>%
arrange(ifelse(ori == "f", ite, desc(ite)), .by_group = TRUE) %>%
ungroup()
# # A tibble: 6 × 3
# # Groups: grp [2]
# grp ori ite
# <dbl> <chr> <chr>
# 1 1 f A
# 2 1 f B
# 3 1 f C
# 4 2 r A
# 5 2 r B
# 6 2 r C
The expected output is:
# # grp ori ite
# 1 1 f A
# 2 1 f B
# 3 1 f C
# 4 2 r C
# 5 2 r B
# 6 2 r A
I have a general idea of why it doesn't work: arrange() cannot look at things on a per-row basis, which is what the ifelse() is asking it to do.
Is there a better way of accomplishing this?

The idea to use ifelse(ori == "f", ite, desc(ite)) is basically good, unfortunately desc(ite) has a negative numeric vector as output, whereas the output of ite is a character vector.
ifelse(df$ori == "f", df$ite, dplyr::desc(df$ite))
#> [1] "A" "B" "C" "-1" "-3" "-5"
To bring the result of ite in reverse order using the same output as input we can write a function asc() which just does the opposite of desc():
asc <- function(x) {
xtfrm(x)
}
No we can use both inside ifelse():
library(dplyr)
df <- data.frame(grp = c(1,1,1,2,2,2),
ori = c("f","f","f","r","r","r"),
ite = c("A","B","C","A","B","C"))
df %>%
arrange(ori, ifelse(ori == "r", desc(ite), asc(ite)))
#> grp ori ite
#> 1 1 f A
#> 2 1 f B
#> 3 1 f C
#> 4 2 r C
#> 5 2 r B
#> 6 2 r A
Created on 2022-08-21 by the reprex package (v2.0.1)

One possible way is splitting the column ori and creating a function to then combine the results as following:
df %>%
split(.$ori) %>%
map(function(x) {
if ('f' %in% x$ori) {
x %>%
group_by(grp) %>%
arrange(ite, .by_group = TRUE)
}
else {
x %>%
group_by(grp) %>%
arrange(desc(ite), .by_group = TRUE)
}
}) %>%
bind_rows()
# A tibble: 6 x 3
## Groups: grp [2]
# grp ori ite
# <dbl> <chr> <chr>
#1 1 f A
#2 1 f B
#3 1 f C
#4 2 r C
#5 2 r B
#6 2 r A

One option to achieve your desired result would be to make use of split + arrange + bind_rows like so:
library(dplyr)
library(purrr)
df %>%
split(.$ori) %>%
purrr::imap(
~ if (.y == "f") arrange(.x, grp, ite) else arrange(.x, grp, desc(ite))
) %>%
dplyr::bind_rows()
#> grp ori ite
#> 1 1 f A
#> 2 1 f B
#> 3 1 f C
#> 4 2 r C
#> 5 2 r B
#> 6 2 r A
And thanks to the suggestion by #MartinGal we could save the bind_rows step by making use of purrr::imap_dfr:
df %>%
split(.$ori) %>%
purrr::imap_dfr(
~ if (.y == "f") arrange(.x, grp, ite) else arrange(.x, grp, desc(ite))
)

Here's another solution.
df %>%
arrange(
ifelse(ori == 'f', ite, NA),
desc(ifelse(ori != 'f', ite, NA))
)

Related

Remove sequence of rows conditional on value in single cell in group-first position

In this type of data:
df <- data.frame(
Sequ = c(1,1,2,2,2,3,3,3),
G = c("A", "B", "*", "B", "A", "A", "*", "B")
)
I need to filter out rows grouped by Sequ iff the Sequ-first value is *. I can do it like so, but was wondering if there's a more direct and more elegant way in dplyr:
library(dplyr)
df %>%
group_by(Sequ) %>%
mutate(check = ifelse(first(G)=="*", 1, 0)) %>%
filter(check != 1)
# A tibble: 5 × 3
# Groups: Sequ [2]
Sequ G check
<dbl> <chr> <dbl>
1 1 A 0
2 1 B 0
3 3 A 0
4 3 * 0
5 3 B 0
We can try the following base R code using subset + ave
subset(
df,
!ave(G == "*", Sequ, FUN = function(x) head(x, 1))
)
which gives
Sequ G
1 1 A
2 1 B
6 3 A
7 3 *
8 3 B
Here is a direct dplyr way:
library(dplyr)
df %>%
group_by(Sequ) %>%
filter(!first(G == "*"))
Sequ G
<dbl> <chr>
1 1 A
2 1 B
3 3 A
4 3 *
5 3 B
Another base R option with duplicated
subset(df, !Sequ %in% Sequ[G == "*" & !duplicated(Sequ)])
Sequ G
1 1 A
2 1 B
6 3 A
7 3 *
8 3 B

How to get all combinations of 2 from a grouped column in a data frame

I could write a loop to do this, but I was wondering how this might be done in R with dplyr. I have a data frame with two columns. Column 1 is the group, Column 2 is the value. I would like a data frame that has every combination of two values from each group in two separate columns. For example:
input = data.frame(col1 = c(1,1,1,2,2), col2 = c("A","B","C","E","F"))
input
#> col1 col2
#> 1 1 A
#> 2 1 B
#> 3 1 C
#> 4 2 E
#> 5 2 F
and have it return
output = data.frame(col1 = c(1,1,1,2), col2 = c("A","B","C","E"), col3 = c("B","C","A","F"))
output
#> col1 col2 col3
#> 1 1 A B
#> 2 1 B C
#> 3 1 C A
#> 4 2 E F
I'd like to be able to include it within dplyr syntax:
input %>%
group_by(col1) %>%
???
I tried writing my own function that produces a data frame of combinations like what I need from a vector and sent it into the group_map function, but didn't have success:
combos = function(x, ...) {
x = t(combn(x, 2))
return(as.data.frame(x))
}
input %>%
group_by(col1) %>%
group_map(.f = combos)
Produced an error.
Any suggestions?
You can do :
library(dplyr)
data <- input %>%
group_by(col1) %>%
summarise(col2 = t(combn(col2, 2)))
cbind(data[1], data.frame(data$col2))
# col1 X1 X2
# <dbl> <chr> <chr>
#1 1 A B
#2 1 A C
#3 1 B C
#4 2 E F
input %>%
group_by(col1) %>%
nest(data=-col1) %>%
mutate(out= map(data, ~ t(combn(unlist(.x), 2)))) %>%
unnest(out) %>% select(-data)
# A tibble: 4 x 2
# Groups: col1 [2]
col1 out[,1] [,2]
<dbl> <chr> <chr>
1 1 A B
2 1 A C
3 1 B C
4 2 E F
Or :
combos = function(x, ...) {
return(tibble(col1=x[[1,1]],col2=t(combn(unlist(x[[2]], use.names=F), 2))))
}
input %>%
group_by(col1) %>%
group_map(.f = combos, .keep=T) %>% invoke(rbind,.) %>% tibble
# A tibble: 4 x 2
col1 col2[,1] [,2]
<dbl> <chr> <chr>
1 1 A B
2 1 A C
3 1 B C
4 2 E F
Thank you! In terms of parsimony, I like both the answer from Ben
input %>%
group_by(col1) %>%
do(data.frame(t(combn(.$col2, 2))))
and Ronak
data <- input %>%
group_by(col1) %>%
summarise(col2 = t(combn(col2, 2)))
cbind(data[1], data.frame(data$col2))

Getting observations until and including first different value (groups with "no switch" are allowed)

I have a slightly convoluted way to slice a data frame by group from the first row (it always starts with the same value) till (and including) the first different value.
I though about using slice(1:min(which == new.value)), but there are groups where this switch does not happen - and this is what causes me headache. I could split the data into those groups where there is a switch and not and do the calculation on only those with a switch - but I would love to know if there are somewhat more elegant options out there. I am open for any package out there.
library(dplyr)
mydf <- data.frame(group = rep(letters[1:3], each = 4), value = c(1,2,2,2, 1, 1,1,1,1,1,2,2))
The following does not work, because there are groups without "switch"
mydf %>% group_by(group) %>% slice(1: min(which(value == 2)))
#> Warning in min(which(value == 2)): no non-missing arguments to min; returning
#> Inf
#> Error in 1:min(which(value == 2)): result would be too long a vector
Doing the slice operation on only the groups with a switch and binding with the "no-switchers" works:
mydf_grouped <- mydf %>% group_by(group)
mydf_grouped %>%
filter(any(value == 2)) %>%
slice(1: min(which(value == 2))) %>%
bind_rows(filter(mydf_grouped, !any(value ==2)))
#> # A tibble: 9 x 2
#> # Groups: group [3]
#> group value
#> <fct> <dbl>
#> 1 a 1
#> 2 a 2
#> 3 c 1
#> 4 c 1
#> 5 c 2
#> 6 b 1
#> 7 b 1
#> 8 b 1
#> 9 b 1
Created on 2019-12-22 by the reprex package (v0.3.0)
Here, one option is to pass the if/else condition
library(dplyr)
mydf %>%
group_by(group) %>%
slice(if(!2 %in% value) row_number() else seq_len(match(2, value)) )
Or more compactly, change the nomatch in match to n()
mydf %>%
group_by(group) %>%
slice(seq_len(match(2, value, nomatch = n())))
# A tibble: 9 x 2
# Groups: group [3]
# group value
# <fct> <dbl>
#1 a 1
#2 a 2
#3 b 1
#4 b 1
#5 b 1
#6 b 1
#7 c 1
#8 c 1
#9 c 2
We want all rows having a value of 1 as well as the row with the first 2 in each group:
mydf %>%
group_by(group) %>%
filter(value == 1 | cumsum(value == 2) == 1) %>%
ungroup
We can use rleid to create an index of change in value, shift it by 1 position and select all the rows till 1st change.
library(data.table)
setDT(mydf)
mydf[, .SD[shift(rleid(value), fill = 1) == 1], group]
# group value
#1: a 1
#2: a 2
#3: b 1
#4: b 1
#5: b 1
#6: b 1
#7: c 1
#8: c 1
#9: c 2
The same logic in dplyr can be implemented by
library(dplyr)
mydf %>%
group_by(group) %>%
filter(lag(cumsum(value != lag(value, default = 1)), default = 0) == 0)

Use a grouped field to filter the original table in the summarise

Edit.
I´ve rewritten the question hoping it makes more sense.
Given this data:
> df
Cat1 Cat2 Q
1 A B 1
2 A C 1
3 B D 1
4 B C 1
5 C C 1
6 C D 1
You can easily group by Cat1 and sum Q using dplyr:
> df %>% group_by(Cat1) %>% summarise(Sum1 = sum(Q))
# A tibble: 3 x 2
Cat1 Sum1
<fct> <dbl>
1 A 2
2 B 2
3 C 2
Now, my question is, as a next step, can you use the groups in the group by (i.e. A, B and C) to operate in the original table? For example, how could you sum Q when Cat2 equals each group?
Meaning, for A there is no match in Cat2, so the sum of Q would be 0. For B there is only a match in the first row, so the sum of Q would be 1. For C there is a match in the second, the fourth and the fifth row, so the sum of Q would be 3:
# A tibble: 3 x 3
Cat1 Sum1 Sum2
<fct> <dbl> <dbl>
1 A 2 0
2 B 2 1
3 C 2 3
Note that this is not what I´m asking:
> df %>% group_by(Cat1) %>% summarise(Sum1 = sum(Q), Sum2 = sum(Q[Cat1==Cat2]))
# A tibble: 3 x 3
Cat1 Sum1 Sum2
<fct> <dbl> <dbl>
1 A 2 0
2 B 2 0
3 C 2 1
#antoine-sac propose in the comments to duplicate df and do a left join on Cat1(Grouped) = Cat2. Of course this would solve the problem, but it´s not the question I´m trying to answer.
Code:
Cat1 <- c("A","A","B","B","C","C")
Cat2 <- c("B","C","D","C","C","D")
Cat1 <- factor(Cat1, levels = c("A","B","C","D"))
Cat2 <- factor(Cat2, levels = c("A","B","C","D"))
Q <- c(1,1,1,1,1,1)
df <- data.frame(Cat1, Cat2, Q)
I think a join is the cleanest way to do it. Think about yourself reading your code again in 6 months: you want the meaning of your code to be obvious.
library("dplyr")
df <- read.table(text = " Cat1 Cat2 Q
1 A B 1
2 A C 1
3 B D 1
4 B C 1
5 C C 1
6 C D 1", stringsAsFactor = FALSE)
df1 <- df %>%
group_by(Cat1) %>%
summarise(Sum1 = sum(Q))
df2 <- df %>%
group_by(Cat2) %>%
summarise(Sum2 = sum(Q))
full_join(df1, df2, by = c("Cat1" = "Cat2")) %>%
tidyr::replace_na(list(Sum1 = 0, Sum2 = 0))
# # A tibble: 4 x 3
# Cat1 Sum1 Sum2
# <chr> <dbl> <dbl>
# 1 A 2 0
# 2 B 2 1
# 3 C 2 3
# 4 D 0 2
With a full_join, you keep all values in Cat1 or Cat2 (A, B, C , D) but you can use a left_join (to keep A, B, C), a right_join (to keep B, C, D) or an inner_join (to keep B, C).
These are respectively the values in Cat1, in Cat2 or both in Cat1 and Cat2.
It may seem painful, especially if you have a lot of categories, but if you have to do it more than once, it is actually easy to automate in a function.
EDIT: actually it is not easy at all if you want to use dplyr due to non-standard evaluation. Here's how you'd do it:
sum_cats <- function(df, cat1, cat2, value) {
cat1 <- enquo(cat1)
cat2 <- enquo(cat2)
value <- enquo(value)
sum1 <- paste0("Sum_", quo_name(cat1))
df1 <- df %>%
rename(cat = !! cat1) %>%
group_by(cat) %>%
summarise(!! sum1 := sum(!! value))
sum2 <- paste0("Sum_", quo_name(cat2))
df2 <- df %>%
rename(cat = !! cat2) %>%
group_by(cat) %>%
summarise(!! sum2 := sum(!! value))
full_join(df1, df2, by = "cat") %>%
tidyr::replace_na(rlang::list2(!! sum1 := 0, !! sum2 := 0))
}
Now you can just call sum_cats to do all the work:
df %>%
sum_cats(Cat1, Cat2, Q)
# cat Sum_Cat1 Sum_Cat2
# <chr> <dbl> <dbl>
# 1 A 2 0
# 2 B 2 1
# 3 C 2 3
# 4 D 0 2
You can try
df %>%
group_by(Cat1) %>%
summarise(sum1 = sum(Q),
sum2 = sum(ifelse(.$Cat2 == Cat1[1], Q, 0)))
# A tibble: 3 x 3
Cat1 sum1 sum2
<fct> <dbl> <dbl>
1 A 2 0
2 B 2 1
3 C 2 3
By using the .$ you will compare and sum up the ungrouped original data.
You probably could construct a new column and summarise from the new column:
df %>% mutate(new_Quantity=ifelse(Start == End, Quantity,0)) %>% group_by(Start) %>% summarise(Sum = sum(new_Quantity))

count by all variables / count distinct with dplyr

Say I have this data.frame :
library(dplyr)
df1 <- data.frame(x=rep(letters[1:3],1:3),y=rep(letters[1:3],1:3))
# x y
# 1 a a
# 2 b b
# 3 b b
# 4 c c
# 5 c c
# 6 c c
I can group and count easily by mentioning the names :
df1 %>%
count(x,y)
# A tibble: 3 x 3
# x y n
# <fctr> <fctr> <int>
# 1 a a 1
# 2 b b 2
# 3 c c 3
How do I do to group by everything without mentioning individual column names, in the most compact /readable way ?
We can pass the input itself to the ... argument and splice it with !!! :
df1 %>% count(., !!!.)
#> x y n
#> 1 a a 1
#> 2 b b 2
#> 3 c c 3
Note : see edit history to make sense of some comments
With base we could do : aggregate(setNames(df1[1],"n"), df1, length)
For those who wouldn't get the voodoo you are using in the accepted answer, if you don't need to use dplyr, you can do it with data.table:
setDT(df1)
df1[, .N, names(df1)]
# x y N
# 1: a a 1
# 2: b b 2
# 3: c c 3
Have you considered the (now superceded) group_by_all()?
df1 <- data.frame(x=rep(letters[1:3],1:3),y=rep(letters[1:3],1:3))
df1 %>% group_by_all() %>% count
df1 %>% group_by(across()) %>% count()
df1 %>% count(across()) # don't know why this returns a data.frame and not tibble
See the colwise vignette "other verbs" section for explanation... though honestly I get turned around myself sometimes.

Resources