Apply dplyr::starts_with() with lambda function

Apply dplyr::starts_with() with lambda function - r

I have below implementation
library(dplyr)
library(tidyr)
dat = data.frame('A' = 1:3, 'C_1' = 1:3, 'C_2' = 1:3, 'M' = 1:3)
Below works
dat %>% rowwise %>% mutate(Anew = list({function(x) c(x[1]^2, x[2] + 5, x[3] + 1)}(c(M, C_1, C_2)))) %>% ungroup %>% unnest_wider(Anew, names_sep = "")
However below does not work when I try find the column names using dplyr::starts_with()
dat %>% rowwise %>% mutate(Anew = list({function(x) c(x[1]^2, x[2] + 5, x[3] + 1)}(c(M, starts_with('C_'))))) %>% ungroup %>% unnest_wider(Anew, names_sep = "")
Any pointer on how to correctly apply starts_with() in this context will be very helpful.
PS : This is continuation from my earlier post Apply custom function that returns multiple values after dplyr::rowwise()

starts_with must be used within a selecting function so we can write this. across is also a selecting function so we could alternately use across(M | starts_with('C_')) in place of select(...) . c_across is also a selecting function but it does not preserve names.
dat %>%
rowwise %>%
mutate(Anew = list({function(x) c(x[1]^2, x[2] + 5, x[3] + 1)}
(select(cur_data(), M, starts_with('C_'))))) %>%
ungroup %>%
unnest_wider(Anew, names_sep = "")
## # A tibble: 3 × 7
## A C_1 C_2 M AnewM AnewC_1 AnewC_2
## <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 1 1 1 1 1 6 2
## 2 2 2 2 2 4 7 3
## 3 3 3 3 3 9 8 4
Here group_modify would also work and allow the use of formula notation to specify an anonymous function. The indexes in the anonymous function have been reordered to correspond to the order in the input.
dat %>%
group_by(A) %>%
group_modify(~ cbind(.x, Anew = c(.x[3]^2, .x[1] + 5, .x[2] + 1))) %>%
ungroup
## # A tibble: 3 × 7
## A C_1 C_2 M Anew.M Anew.C_1 Anew.C_2
## <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 1 1 1 1 1 6 2
## 2 2 2 2 2 4 7 3
## 3 3 3 3 3 9 8 4

If we wrap the starts_with in c_across and assuming there is a third column that starts with C_, then the lambda function on the fly would work
library(dplyr)
library(tidyr)
dat %>%
rowwise %>%
mutate(Anew = list((function(x) c(x[1]^2, x[2] + 5, x[3] +
1))(c_across(starts_with("C_"))))) %>%
unnest_wider(Anew, names_sep = "")
-output
# A tibble: 3 × 8
A C_1 C_2 C_3 M Anew1 Anew2 Anew3
<int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 6 2
2 2 2 2 2 2 4 7 3
3 3 3 3 3 3 9 8 4
Or instead of doing rowwise, we may create a named list of functions and apply column wise with across (would be more efficient)
fns <- list(C_1 = function(x) x^2, C_2 = function(x) x + 5,
C_3 = function(x) x + 1)
dat %>%
mutate(across(starts_with("C_"),
~ fns[[cur_column()]](.x), .names = "Anew{seq_along(.fn)}"))
-output
A C_1 C_2 C_3 M Anew1 Anew2 Anew3
1 1 1 1 1 1 1 6 2
2 2 2 2 2 2 4 7 3
3 3 3 3 3 3 9 8 4
data
dat <- data.frame('A' = 1:3, 'C_1' = 1:3, 'C_2' = 1:3, C_3 = 1:3, 'M' = 1:3)

Related

how to filter on column==var when var has same name as column? (inside pmap)

I have a tibble that I want to filter by comparing its columns against some variables. However, it's convenient for that variable to have the same name as the column. How can I force dplyr to evaluate the variable so it doesn't confuse the variable and column names?
set.seed(2)
ngrp <- 3
npergrp <- 4
tib <- tibble(grp=rep(letters[1:ngrp], each=npergrp),
N=rep(1:npergrp, ngrp),
val=round(runif(npergrp*ngrp))) %>% print(n=Inf)
grp <- grp_ <- 'a'
tib %>% dplyr::filter(grp==grp_) %>% glimpse() ## works
tib %>% dplyr::filter(grp==grp) %>% glimpse() ## undesired result, grp==grp always true
tib %>% dplyr::filter(grp=={{grp}}) %>% glimpse() ## hey it works!
## slightly less toy example
tib %>% dplyr::filter(grp==grp_) %>%
dplyr::mutate(
the_rest = purrr::pmap(
.,
function(grp, N, ...) {
gg <- grp ## there must be a better way
NN <- N
tib %>%
dplyr::filter(
# grp!=grp, ## always false
# N==N ## always true
grp!=gg,
N==NN
) %>%
dplyr::pull(val) %>%
sum()
}
),
no_hugs = purrr::pmap(
.,
function(grp, N, ...) {
tib %>%
dplyr::filter(
grp!={{grp}}, ## ERROR! oh noes!
N=={{N}}
) %>%
dplyr::pull(val) %>%
sum()
}
)
) %>%
tidyr::unnest() %>%
glimpse()
output:
# A tibble: 12 × 3
grp N val
<chr> <int> <dbl>
1 a 1 0
2 a 2 1
3 a 3 1
4 a 4 0
5 b 1 1
6 b 2 1
7 b 3 0
8 b 4 1
9 c 1 0
10 c 2 1
11 c 3 1
12 c 4 0
Rows: 4
Columns: 3
$ grp <chr> "a", "a", "a", "a"
$ N <int> 1, 2, 3, 4
$ val <dbl> 0, 1, 1, 0
Rows: 4
Columns: 3
$ grp <chr> "a", "a", "a", "a"
$ N <int> 1, 2, 3, 4
$ val <dbl> 0, 1, 1, 0
Error in local_error_context(dots = dots, .index = i, mask = mask) :
promise already under evaluation: recursive default argument reference or earlier problems?
# the_rest should be 1, 2, 1, 1
As often happens, writing the question taught me how to embrace variables using the double curly brace operator {{}}
https://dplyr.tidyverse.org/articles/programming.html
Use dynamic name for new column/variable in `dplyr`
However, it doesn't work inside the pmap.

It would need .env to evaluate the object 'grp' from the environment other than the data environment (or use !!)
library(dplyr)
tib %>%
dplyr::filter(grp==.env$grp)
-output
# A tibble: 4 × 3
grp N val
<chr> <int> <dbl>
1 a 1 0
2 a 2 1
3 a 3 1
4 a 4 0
The .env can be used similarly within the pmap code as well
library(purrr)
tib %>%
dplyr::filter(grp==.env$grp_) %>%
dplyr::mutate(the_rest = purrr::pmap_dbl(across(everything()),
~ {gg <- ..1
NN <- ..2
tib %>%
dplyr::filter(grp != gg, N == NN) %>%
pull(val) %>%
sum()}),
no_hugs = purrr::pmap_dbl(across(all_of(names(tib))),
~ tib %>%
dplyr::filter(grp != .env$grp, N == ..2) %>%
pull(val) %>%
sum()))
-output
# A tibble: 4 × 5
grp N val the_rest no_hugs
<chr> <int> <dbl> <dbl> <dbl>
1 a 1 0 1 1
2 a 2 1 2 2
3 a 3 1 1 1
4 a 4 0 1 1

How can a table be rearranged one step at a time so that two or more observations are listed in a row in successive columns?

So far I have done this to achieve the desired result:
# A tibble: 4 x 2
frag treat
<dbl> <dbl>
1 1 1
2 2 1
3 1 2
4 2 2
treat_1 <- tab_example %>% filter(treat == "1")
treat_2 <- tab_example %>% filter(treat == "2")
new_tab_example <- full_join(treat_1, treat_2, by = "frag")
> new_tab_example
# A tibble: 2 x 3
frag treat.x treat.y
<dbl> <dbl> <dbl>
1 1 1 2
2 2 1 2
Is there a way to do it in one step?

You can use pivot_wider :
tidyr::pivot_wider(tab_example, names_from = treat,
names_prefix = 'treat', values_from = treat)
# frag treat1 treat2
# <dbl> <dbl> <dbl>
#1 1 1 2
#2 2 1 2

There is a way using spread() function:
library(dplyr)
library(tidyr)
# Yours data
df = tibble(frag = c(1, 2, 1, 2), treat = c(1,1,2,2) )
dfnew = df %>%
mutate(treat_name = case_when(treat==1 ~ 'treat.x', # Build names of columns
treat==2 ~ 'treat.y')
) %>%
spread(treat_name, treat) # Use spread function
If you print the result:
print(dfnew)
# A tibble: 2 x 3
frag treat.x treat.y
<dbl> <dbl> <dbl>
1 1 1 2
2 2 1 2

Apply the same function with multiple columns as inputs to multiple columns in R with tidyverse

As an example, I have the following data frame:
df <- data.frame(a1=1,a2=2,a3=3,b1=1,b2=2,b3=3)
I have a function:
fn <- function(x,y,z) x^y+(z-x)^(y-x)
I want the following:
df <- df %>% mutate(a=fn(a1,a2,a3),b=fn(b1,b2,b3))
The problem is, I have tons of triplets in my dataset, so it is not ideal to write them out one by one.

Here are base R options using:
split.default + lapply + do.call
cbind(
df,
lapply(
split.default(df, gsub("\\d+", "", names(df))),
function(x) do.call(fn, unname(x))
)
)
reshape + lapply + do.call
cbind(
df,
lapply(
subset(
reshape(
setNames(df, gsub("(\\d+)$", "\\.\\1", names(df))),
direction = "long",
varying = 1:length(df)
),
select = -c(time, id)
),
function(x) do.call(fn, as.list(x))
)
)
Output
a1 a2 a3 b1 b2 b3 a b
1 1 2 3 1 2 3 3 3

I would convert df to long format then use lag to create 3 columns then apply fn() on them
library(tidyverse)
df_long <- df %>%
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "(.)(.)")
df_longer <- df_long %>%
pivot_longer(-c(set),
names_to = "key",
values_to = "val") %>%
arrange(key)
df_longer
#> # A tibble: 6 x 3
#> set key val
#> <chr> <chr> <dbl>
#> 1 1 a 1
#> 2 2 a 2
#> 3 3 a 3
#> 4 1 b 1
#> 5 2 b 2
#> 6 3 b 3
lag then apply fn(), keep only non-NA val_fn
df_longer <- df_longer %>%
group_by(key) %>%
mutate(val_lag1 = lag(val, n = 1),
val_lag2 = lag(val, n = 2)) %>%
mutate(val_fn = fn(val_lag2, val_lag1, val)) %>%
filter(!is.na(val_fn))
df_longer
#> # A tibble: 2 x 6
#> # Groups: key [2]
#> set key val val_lag1 val_lag2 val_fn
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 3 a 3 2 1 3
#> 2 3 b 3 2 1 3
Created on 2020-12-03 by the reprex package (v0.3.0)

I think it would be easier/shorter to combine columns into their separate group and apply the function to each column.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything(),
names_to = '.value',
names_pattern = '([a-z]+)') %>%
summarise(across(.fns = ~do.call(fn, as.list(.)))) -> result
result
# a b
# <dbl> <dbl>
#1 3 3
You can bind the result to your original dataset if needed.
bind_cols(df, result)
# a1 a2 a3 b1 b2 b3 a b
#1 1 2 3 1 2 3 3 3

Winners within pairs; or vector-valued group_by mutate?

I'm trying to assess which unit in a pair is the "winner". group_by() %>% mutate() is close to the right thing, but it's not quite there. in particular
dat %>% group_by(pair) %>% mutate(winner = ifelse(score[1] > score[2], c(1, 0), c(0, 1))) doesn't work.
The below does, but is clunky with an intermediate summary data frame. Can we improve this?
library(tidyverse)
set.seed(343)
# units within pairs get scores
dat <-
data_frame(pair = rep(1:3, each = 2),
unit = rep(1:2, 3),
score = rnorm(6))
# figure out who won in each pair
summary_df <-
dat %>%
group_by(pair) %>%
summarize(winner = which.max(score))
# merge back and determine whether each unit won
dat <-
left_join(dat, summary_df, "pair") %>%
mutate(won = as.numeric(winner == unit))
dat
#> # A tibble: 6 x 5
#> pair unit score winner won
#> <int> <int> <dbl> <int> <dbl>
#> 1 1 1 -1.40 2 0
#> 2 1 2 0.523 2 1
#> 3 2 1 0.142 1 1
#> 4 2 2 -0.847 1 0
#> 5 3 1 -0.412 1 1
#> 6 3 2 -1.47 1 0
Created on 2018-09-26 by the reprex
package (v0.2.0).
maybe related to Weird group_by + mutate + which.max behavior

You could do:
dat %>%
group_by(pair) %>%
mutate(won = score == max(score),
winner = unit[won == TRUE]) %>%
# A tibble: 6 x 5
# Groups: pair [3]
pair unit score won winner
<int> <int> <dbl> <lgl> <int>
1 1 1 -1.40 FALSE 2
2 1 2 0.523 TRUE 2
3 2 1 0.142 TRUE 1
4 2 2 -0.847 FALSE 1
5 3 1 -0.412 TRUE 1
6 3 2 -1.47 FALSE 1

Using rank:
dat %>% group_by(pair) %>% mutate(won = rank(score) - 1)
More for fun (and slightly faster), using the outcome of the comparison (score[1] > score[2]) to index a vector with 'won alternatives' :
dat %>% group_by(pair) %>%
mutate(won = c(0, 1, 0)[1:2 + (score[1] > score[2])])

How to separate a column list of fixed size X to X different columns?

I have a tibble with one column being a list column, always having two numeric values named a and b (e.g. as a result of calling purrr:map to a function which returns a list), say:
df <- tibble(x = 1:3, y = list(list(a = 1, b = 2), list(a = 3, b = 4), list(a = 5, b = 6)))
df
# A tibble: 3 × 2
x y
<int> <list>
1 1 <list [2]>
2 2 <list [2]>
3 3 <list [2]>
How do I separate the list column y into two columns a and b, and get:
df_res <- tibble(x = 1:3, a = c(1,3,5), b = c(2,4,6))
df_res
# A tibble: 3 × 3
x a b
<int> <dbl> <dbl>
1 1 1 2
2 2 3 4
3 3 5 6
Looking for something like tidyr::separate to deal with a list instead of a string.

Using dplyr (current release: 0.7.0):
bind_cols(df[1], bind_rows(df$y))
# # A tibble: 3 x 3
# x a b
# <int> <dbl> <dbl>
# 1 1 1 2
# 2 2 3 4
# 3 3 5 6
edit based on OP's comment:
To embed this in a pipe and in case you have many non-list columns, we can try:
df %>% select(-y) %>% bind_cols(bind_rows(df$y))

We could also make use the map_df from purrr
library(tidyverse)
df %>%
summarise(x = list(x), new = list(map_df(.$y, bind_rows))) %>%
unnest
# A tibble: 3 x 3
# x a b
# <int> <dbl> <dbl>
#1 1 1 2
#2 2 3 4
#3 3 5 6