I want to get rolling means for the past 1 to 10 events grouped by a column for multiple columns. I also want it very fast such as in dplyr or data.table because I want to run this on a 1,000,000 x 1,000 dataframe.
starting df
data.table(a = c("bill", "bob", "bill", "bob", "bill", "bob"),
b = c(1,2,1,1,3,2),
c = c(2,3,9,1,4,1),
d = c(4,5,1,7,3,4))
1: bill 1 2 4
2: bob 2 3 5
3: bill 1 9 1
4: bob 1 1 7
5: bill 3 4 3
6: bob 2 1 4
desired df
I want the rolling mean of only b and c grouped by column a with a window of 1 to 10 for each column lagged 1 row.
a b c d b_roll1 c_roll1 b_roll2 c_roll2 b_roll3 c_roll3
1: bill 1 2 4 NA NA NA NA NA NA
2: bob 2 3 5 NA NA NA NA NA NA
3: bill 1 9 1 1 2 1 2 1 2
4: bob 1 1 7 2 3 2 3 2 3
5: bill 3 4 3 1 9 1 5.5 1 5.5
6: bob 2 1 4 1 1 1 2 1 2
Your example outcome doesn't make too much sense to me, but here is an example on how you can generate many mutate calls programmatically.
An extendable solution using lazyeval and RcppRoll:
library(tidyverse)
vars <- c('b', 'c')
ns <- 1:10
com <- expand.grid(vars, ns, stringsAsFactors = FALSE)
dots <- map2(com[[1]], com[[2]],
~lazyeval::interp(~RcppRoll::roll_meanr(x, y, fill = NA), x = as.name(.x), y = .y))
names(dots) <- apply(com, 1, paste0, collapse = '_')
D %>%
group_by(a) %>%
mutate_(.dots = dots)
Gives:
Source: local data frame [6 x 24]
Groups: a [2]
a b c d `b_ 1` `c_ 1` `b_ 2` `c_ 2` `b_ 3` `c_ 3` `b_ 4` `c_ 4` `b_ 5` `c_ 5` `b_ 6` `c_ 6` `b_ 7` `c_ 7` `b_ 8` `c_ 8` `b_ 9`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 bill 1 2 4 1 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 bob 2 3 5 2 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 bill 1 2 1 1 2 1.0 2 NA NA NA NA NA NA NA NA NA NA NA NA NA
4 bob 1 1 7 1 1 1.5 2 NA NA NA NA NA NA NA NA NA NA NA NA NA
5 bill 3 4 3 3 4 2.0 3 1.666667 2.666667 NA NA NA NA NA NA NA NA NA NA NA
6 bob 2 1 4 2 1 1.5 1 1.666667 1.666667 NA NA NA NA NA NA NA NA NA NA NA
# ... with 3 more variables: `c_ 9` <dbl>, b_10 <dbl>, c_10 <dbl>
I am still not completely following you. It seems that you apply a combination of a lag and a rolled mean. For just the rolled mean this is a solution using dplyr and RcppRoll.
roll_mean_na <- function(x, lag){
c(rep(NA, lag - 1), RcppRoll::roll_mean(x, lag, align = "left"))
}
library(dplyr)
df %>% group_by(a) %>%
mutate(b_2 = roll_mean_na(b, 2), c_2 = roll_mean_na(c, 2),
b_3 = roll_mean_na(b, 3), c_3 = roll_mean_na(c, 3),
b_4 = roll_mean_na(b, 4), c_4 = roll_mean_na(c, 4))
Related
I have a Datafaame like this:
dt <- tibble(
TRIAL = c("A", "A", "A", "B", "B", "B", "C", "C", "C","D","D","D"),
RL = c(1, NA, 3, 1, 6, 3, 2, 3, 1, 0, 1.5, NA),
SL = c(6, 1.5, 1, 0, 0, 1, 1, 2, 0, 1, 1.5, NA),
HC = c(0, 1, 5, 6,7, 8, 9, 3, 4, 5, 4, 2)
)
# A tibble: 12 x 4
TRIAL RL SL HC
<chr> <dbl> <dbl> <dbl>
1 A 1 6 0
2 A NA 1.5 1
3 A 3 1 5
4 B 1 0 6
5 B 6 0 7
6 B 3 1 8
7 C 2 1 9
8 C 3 2 3
9 C 1 0 4
10 D 0 1 5
11 D 1.5 1.5 4
12 D NA NA 2
I want to group the data frame by TRIAL and have the values in RL and SL checked by group, if the value in either of the column is greater than 5 then move all values for RL and SL for that particular group to RLCT and SLCT respectively.
# A tibble: 12 x 6
TRIAL HC RLCT SLCT SL RL
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0 1 6 NA NA
2 A 1 NA 1.5 NA NA
3 A 5 3 1 NA NA
4 B 6 1 0 NA NA
5 B 7 6 0 NA NA
6 B 8 3 1 NA NA
7 C 9 NA NA 1 3
8 C 3 NA NA 3 5
9 C 4 NA NA 1 1
10 D 5 NA NA 1 0
11 D 4 NA NA 1.5 1.5
12 D 2 NA NA NA NA
When I run the below code, I did not get the expected output
dt0 <- dt %>%
mutate(RLCT = NA,
SLCT = NA) %>%
group_by(TRIAL) %>%
filter(!any(RL > 5.0 | SL > 5.0))
dt1 <- dt %>%
group_by(TRIAL) %>%
filter(any(RL > 5.0 | SL > 5.0)) %>%
mutate(RLCT = RL,
SLCT = SL) %>%
rbind(dt0, .) %>%
mutate(RL = ifelse(!is.na(RLCT), NA, RL),
SL = ifelse(!is.na(SLCT), NA, SL)) %>% arrange(TRIAL)
This is what I get
# A tibble: 9 x 6
# Groups: TRIAL [3]
TRIAL RL SL HC RLCT SLCT
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A NA NA 0 1 6
2 A NA NA 1 NA 1.5
3 A NA NA 5 3 1
4 B NA NA 6 1 0
5 B NA NA 7 6 0
6 B NA NA 8 3 1
7 C 2 1 9 NA NA
8 C 3 2 3 NA NA
9 C 1 0 4 NA NA
You can define a column to storage the condition, and change RL and SL with ifelse inside across.
dt %>%
group_by(TRIAL) %>%
mutate(cond = any(RL > 5.0 | SL > 5.0, na.rm = TRUE),
across(c(RL, SL), ~ ifelse(cond, ., NA), .names = "{.col}CT"),
across(c(RL, SL), ~ ifelse(!cond, ., NA)),
cond = NULL)
Result:
# A tibble: 12 x 6
# Groups: TRIAL [4]
TRIAL RL SL HC RLCT SLCT
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A NA NA 0 1 6
2 A NA NA 1 NA 1.5
3 A NA NA 5 3 1
4 B NA NA 6 1 0
5 B NA NA 7 6 0
6 B NA NA 8 3 1
7 C 2 1 9 NA NA
8 C 3 2 3 NA NA
9 C 1 0 4 NA NA
10 D 0 1 5 NA NA
11 D 1.5 1.5 4 NA NA
12 D NA NA 2 NA NA
With dplyr, you could use group_modify():
library(dplyr)
dt %>%
group_by(TRIAL) %>%
group_modify(~ {
if(any(select(.x, c(RL, SL)) > 5, na.rm = TRUE)) {
rename_with(.x, ~ paste0(.x, 'CT'), c(RL, SL))
} else {
.x
}
})
Output
# A tibble: 12 × 6
# Groups: TRIAL [4]
TRIAL RLCT SLCT HC RL SL
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 6 0 NA NA
2 A NA 1.5 1 NA NA
3 A 3 1 5 NA NA
4 B 1 0 6 NA NA
5 B 6 0 7 NA NA
6 B 3 1 8 NA NA
7 C NA NA 9 2 1
8 C NA NA 3 3 2
9 C NA NA 4 1 0
10 D NA NA 5 0 1
11 D NA NA 4 1.5 1.5
12 D NA NA 2 NA NA
I understand how to use complete from tidyverse tidyverse complete a dataframe
In the example they give:
df <- tibble(
group = c(1:2, 1, 2),
item_id = c(1:2, 2, 3),
item_name = c("a", "a", "b", "b"),
value1 = c(1, NA, 3, 4),
value2 = 4:7
)
df
#> # A tibble: 4 × 5
#> group item_id item_name value1 value2
#> <dbl> <dbl> <chr> <dbl> <int>
#> 1 1 1 a 1 4
#> 2 2 2 a NA 5
#> 3 1 2 b 3 6
#> 4 2 3 b 4 7
Is there a way of adding a group and completing? e.g. add a group 3 and complete the table.
For example, I have a df which I populate in a for loop to make a plot. The df is like so:
variant Location Position variable value protein Mutation.type
FANCI_L605F FANCI chr15:89828441_C/T B 0.45 L605F nonsynonymous_SNV
PLCG2_R953* PLCG2 chr16:81969788_C/T B 0.87 R953* stopgain
STAT3_R278C STAT3 chr17:40486033_G/A B 0.38 R278C nonsynonymous_SN
FANCI_L605F FANCI chr15:89828441_C/T C 0.45 L605F nonsynonymous_SNV
PLCG2_R953* PLCG2 chr16:81969788_C/T C 0.87 R953* stopgain
STAT3_R278C STAT3 chr17:40486033_G/A C 0.38 R278C nonsynonymous_SNV
I also have a vector of possible variable names:
all_var<-c("A","B","C")
I have worked out how to add any missing variables (I think):
new_df<-complete(df,variable=all_var,Position)
>new_df
variant Location Position variable value protein Mutation.type
NA NA chr15:89828441_C/T A NA NA NA
NA NA chr16:81969788_C/T A NA NA NA
NA NA chr17:40486033_G/A A NA NA NA
FANCI_L605F FANCI chr15:89828441_C/T B 0.45 L605F nonsynonymous_SNV
PLCG2_R953* PLCG2 chr16:81969788_C/T B 0.87 R953* stopgain
STAT3_R278C STAT3 chr17:40486033_G/A B 0.38 R278C nonsynonymous_SN
FANCI_L605F FANCI chr15:89828441_C/T C 0.45 L605F nonsynonymous_SNV
PLCG2_R953* PLCG2 chr16:81969788_C/T C 0.87 R953* stopgain
STAT3_R278C STAT3 chr17:40486033_G/A C 0.38 R278C nonsynonymous_SNV
How do I now complete the variant,Location, protein, Mutation.Type?
You can use add a row by specifying the group and use complete() to complete the combinations, i.e.
library(dplyr)
library(tidyr)
df %>%
add_row(group = 3) %>%
complete(group, nesting(item_id, item_name)) %>%
drop_na(item_id)
# A tibble: 12 x 5
group item_id item_name value1 value2
<dbl> <dbl> <chr> <dbl> <int>
1 1 1 a 1 4
2 1 2 a NA NA
3 1 2 b 3 6
4 1 3 b NA NA
5 2 1 a NA NA
6 2 2 a NA 5
7 2 2 b NA NA
8 2 3 b 4 7
9 3 1 a NA NA
10 3 2 a NA NA
11 3 2 b NA NA
12 3 3 b NA NA
This question already has answers here:
How can I automatically create n lags in a timeseries?
(3 answers)
Closed 1 year ago.
Given this tibble:
tibble(x = c(1:9))
I want to add a column x_lag_1 = c(NA,1:8), a column x_lag_2 = c(NA,NA,1:7), etc.
Up to x_lag_n.
This can be quick with data.table:
library(data.table)
n <- seq(4)
setDT(df)[, paste0('x_lag_', n) := shift(x, n)]
df
x x_lag_1 x_lag_2 x_lag_3 x_lag_4
1: 1 NA NA NA NA
2: 2 1 NA NA NA
3: 3 2 1 NA NA
4: 4 3 2 1 NA
5: 5 4 3 2 1
6: 6 5 4 3 2
7: 7 6 5 4 3
8: 8 7 6 5 4
9: 9 8 7 6 5
You may use map_dfc to add n new columns.
library(dplyr)
library(purrr)
df <- tibble(x = c(1:9))
n <- 3
bind_cols(df, map_dfc(seq_len(n), ~df %>%
transmute(!!paste0('x_lag', .x) := lag(x, .x))))
# x x_lag1 x_lag2 x_lag3
# <int> <int> <int> <int>
#1 1 NA NA NA
#2 2 1 NA NA
#3 3 2 1 NA
#4 4 3 2 1
#5 5 4 3 2
#6 6 5 4 3
#7 7 6 5 4
#8 8 7 6 5
#9 9 8 7 6
Edit 2: Reworked the answer to contemplate the case of a grouped df.
library(tidyverse)
set.seed(123)
df <- tibble(group = sample(letters[1:3], 30, replace = TRUE), x = c(1:30))
formulas <- seq(3, 12, 3) %>%
map(~ as.formula(str_glue("~lag(.,n={.x})"))) %>%
set_names(str_c("lag", seq(3, 12, 3)))
df %>%
summarise(x, across(x, lst(!!!formulas)))
#> # A tibble: 30 × 5
#> x x_lag3 x_lag6 x_lag9 x_lag12
#> <int> <int> <int> <int> <int>
#> 1 1 NA NA NA NA
#> 2 2 NA NA NA NA
#> 3 3 NA NA NA NA
#> 4 4 1 NA NA NA
#> 5 5 2 NA NA NA
#> 6 6 3 NA NA NA
#> 7 7 4 1 NA NA
#> 8 8 5 2 NA NA
#> 9 9 6 3 NA NA
#> 10 10 7 4 1 NA
#> # … with 20 more rows
df %>%
group_by(group) %>%
summarise(x, across(x, lst(!!!formulas)), .groups = "drop")
#> # A tibble: 30 × 6
#> group x x_lag3 x_lag6 x_lag9 x_lag12
#> <chr> <int> <int> <int> <int> <int>
#> 1 a 10 NA NA NA NA
#> 2 a 13 NA NA NA NA
#> 3 a 16 NA NA NA NA
#> 4 a 19 10 NA NA NA
#> 5 a 20 13 NA NA NA
#> 6 a 21 16 NA NA NA
#> 7 a 22 19 10 NA NA
#> 8 a 27 20 13 NA NA
#> 9 b 4 NA NA NA NA
#> 10 b 6 NA NA NA NA
#> # … with 20 more rows
Created on 2021-12-30 by the reprex package (v2.0.1)
I have a dataframe like this:
df <- data_frame(id = c(rep('A', 10), rep('B', 10)),
value = c(1:3, rep(NA, 2), 1:2, rep(NA, 3), 1, rep(NA, 4), 1:3, rep(NA, 2)))
I need to count the number of consective NA's in the value column. The count needs to be grouped by ID, and it needs to restart at 1 every time a new NA or new series of NA's is encountered. The exptected output should look like this:
df$expected_output <- c(rep(NA, 3), 1:2, rep(NA, 2), 1:3, NA, 1:4, rep(NA, 3), 1:2)
If anyone can give me a dplyr solution that would also be great :)
I've tried a few things but nothing is giving any sort of sensical result. Thanks in advance^!
A solution using dplyr and data.table.
library(dplyr)
library(data.table)
df2 <- df %>%
group_by(id) %>%
mutate(info = rleid(value)) %>%
group_by(id, info) %>%
mutate(expected_output = row_number()) %>%
ungroup() %>%
mutate(expected_output = ifelse(!is.na(value), NA, expected_output)) %>%
select(-info)
df2
# # A tibble: 20 x 3
# id value expected_output
# <chr> <dbl> <int>
# 1 A 1 NA
# 2 A 2 NA
# 3 A 3 NA
# 4 A NA 1
# 5 A NA 2
# 6 A 1 NA
# 7 A 2 NA
# 8 A NA 1
# 9 A NA 2
# 10 A NA 3
# 11 B 1 NA
# 12 B NA 1
# 13 B NA 2
# 14 B NA 3
# 15 B NA 4
# 16 B 1 NA
# 17 B 2 NA
# 18 B 3 NA
# 19 B NA 1
# 20 B NA 2
We can use rle to get length of groups that are or are not na, and use purrr::map2 to apply seq if they are NA and get the growing count or just fill in with NA values using rep.
library(tidyverse)
count_na <- function(x) {
r <- rle(is.na(x))
consec <- map2(r$lengths, r$values, ~ if (.y) seq(.x) else rep(NA, .x))
unlist(consec)
}
df %>%
mutate(expected_output = count_na(value))
#> # A tibble: 20 × 3
#> id value expected_output
#> <chr> <dbl> <int>
#> 1 A 1 NA
#> 2 A 2 NA
#> 3 A 3 NA
#> 4 A NA 1
#> 5 A NA 2
#> 6 A 1 NA
#> 7 A 2 NA
#> 8 A NA 1
#> 9 A NA 2
#> 10 A NA 3
#> 11 B 1 NA
#> 12 B NA 1
#> 13 B NA 2
#> 14 B NA 3
#> 15 B NA 4
#> 16 B 1 NA
#> 17 B 2 NA
#> 18 B 3 NA
#> 19 B NA 1
#> 20 B NA 2
Here is a solution using rle:
x <- rle(is.na(df$value))
df$new[is.na(df$value)] <- sequence(x$lengths[x$values])
# A tibble: 20 x 3
id value new
<chr> <dbl> <int>
1 A 1 NA
2 A 2 NA
3 A 3 NA
4 A NA 1
5 A NA 2
6 A 1 NA
7 A 2 NA
8 A NA 1
9 A NA 2
10 A NA 3
11 B 1 NA
12 B NA 1
13 B NA 2
14 B NA 3
15 B NA 4
16 B 1 NA
17 B 2 NA
18 B 3 NA
19 B NA 1
20 B NA 2
Yet another solution:
library(tidyverse)
df %>%
mutate(aux =data.table::rleid(value)) %>%
group_by(id, aux) %>%
mutate(eout = ifelse(is.na(value), row_number(), NA_real_)) %>%
ungroup %>% select(-aux)
#> # A tibble: 20 × 4
#> id value expected_output eout
#> <chr> <dbl> <int> <dbl>
#> 1 A 1 NA NA
#> 2 A 2 NA NA
#> 3 A 3 NA NA
#> 4 A NA 1 1
#> 5 A NA 2 2
#> 6 A 1 NA NA
#> 7 A 2 NA NA
#> 8 A NA 1 1
#> 9 A NA 2 2
#> 10 A NA 3 3
#> 11 B 1 NA NA
#> 12 B NA 1 1
#> 13 B NA 2 2
#> 14 B NA 3 3
#> 15 B NA 4 4
#> 16 B 1 NA NA
#> 17 B 2 NA NA
#> 18 B 3 NA NA
#> 19 B NA 1 1
#> 20 B NA 2 2
This question already has answers here:
Calculate max value across multiple columns by multiple groups
(5 answers)
Closed 2 years ago.
I have data which looks basically like this:
id <- c(1:5)
VolumeA <- c(12, NA, NA, NA, NA)
VolumeB <- c(NA, 34, NA, NA, NA)
VolumeC <- c(NA, NA, 56, NA, NA)
VolumeD <- c(NA, NA, NA, 78, NA)
VolumeE <- c(NA, NA, NA, NA, 90)
df_now <- tibble(id, VolumeA, VolumeB, VolumeC, VolumeD, VolumeE)
df_now
# A tibble: 5 x 6
id VolumeA VolumeB VolumeC VolumeD VolumeE
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 12 NA NA NA NA
2 2 NA 34 NA NA NA
3 3 NA NA 56 NA NA
4 4 NA NA NA 78 NA
5 5 NA NA NA NA 90
In the IRL dataset, there are MANY more Volume[label] columns, but in each row I only need one of them: the largest one. So I want to create a new variable which has the largest value:
Volume <- c(12, 34, 56, 78, 90)
df_desired <- cbind(df_now, Volume)
df_desired
id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
1 1 12 NA NA NA NA 12
2 2 NA 34 NA NA NA 34
3 3 NA NA 56 NA NA 56
4 4 NA NA NA 78 NA 78
5 5 NA NA NA NA 90 90
After looking at the dplyr documentation, I tried this...
library(tidyverse)
df_try <- df_now %>%
mutate(Volume = across(contains("Volume"), max, na.rm = TRUE))
...but got back a tibble of data, not a single column. Can someone tell me how to do this properly?
(Please assume, due to issues with my IRL data too complicated to explain here, that I cannot just gather and spread my data. I want to use a conditional mutate.)
Since you have "MANY more Volume[label] columns", any solution that works over each row (rowwise) or individually on each column (with reduce or Reduce) is going to be much slower than necessary.
df_now %>%
mutate(Volume = do.call(pmax, c(select(., starts_with('Volume')), na.rm = TRUE)))
# # A tibble: 5 x 7
# id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 12 NA NA NA NA 12
# 2 2 NA 34 NA NA NA 34
# 3 3 NA NA 56 NA NA 56
# 4 4 NA NA NA 78 NA 78
# 5 5 NA NA NA NA 90 90
Proof of relative improvement:
Using Reduce or purrr::reduce or anything that will iterate per column (well, with nc columns, then it will iterate nc-1 times):
mypmax <- function(...) { message("mypmax"); pmax(...); }
df_now %>%
mutate(Volume = reduce(select(., starts_with('Volume')), mypmax, na.rm = TRUE))
# mypmax
# mypmax
# mypmax
# mypmax
# # A tibble: 5 x 7
# id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 12 NA NA NA NA 12
# 2 2 NA 34 NA NA NA 34
# 3 3 NA NA 56 NA NA 56
# 4 4 NA NA NA 78 NA 78
# 5 5 NA NA NA NA 90 90
Anything rowwise is doing this once per row, perhaps even worse (assuming more rows than columns in your data:
mymax <- function(...) { message("mymax"); max(...); }
df_now %>%
rowwise %>%
mutate(Volume = mymax(c_across(starts_with('Volume')), na.rm = TRUE))
# mymax
# mymax
# mymax
# mymax
# mymax
# # A tibble: 5 x 7
# # Rowwise:
# id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 12 NA NA NA NA 12
# 2 2 NA 34 NA NA NA 34
# 3 3 NA NA 56 NA NA 56
# 4 4 NA NA NA 78 NA 78
# 5 5 NA NA NA NA 90 90
Do it once across all columns, all rows:
mypmax <- function(...) { message("mypmax"); pmax(...); }
df_now %>%
mutate(Volume = do.call(mypmax, c(select(., starts_with('Volume')), na.rm = TRUE)))
# mypmax
# # A tibble: 5 x 7
# id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 12 NA NA NA NA 12
# 2 2 NA 34 NA NA NA 34
# 3 3 NA NA 56 NA NA 56
# 4 4 NA NA NA 78 NA 78
# 5 5 NA NA NA NA 90 90
The benchmarking is minor at this scale, but will be more dramatic with larger data:
microbenchmark::microbenchmark(
red = df_now %>% mutate(Volume = reduce(select(., starts_with('Volume')), pmax, na.rm = TRUE)),
row = df_now %>% rowwise %>% mutate(Volume = max(c_across(starts_with('Volume')), na.rm = TRUE)),
sgl = df_now %>% mutate(Volume = do.call(pmax, c(select(., starts_with('Volume')), na.rm = TRUE)))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# red 4.9736 5.36240 7.240561 5.68010 6.19915 70.7482 100
# row 4.5813 5.02020 6.082047 5.34460 5.70345 63.1166 100
# sgl 3.8270 4.18605 5.803043 4.43215 4.76030 65.7217 100
We can use pmax (first posted the pmax solution here). Note that the relative improvement is very small with do.call
library(dplyr)
library(purrr)
df_now %>%
mutate(Volume = reduce(select(., starts_with('Volume')), pmax, na.rm = TRUE))
# A tibble: 5 x 7
# id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 12 NA NA NA NA 12
#2 2 NA 34 NA NA NA 34
#3 3 NA NA 56 NA NA 56
#4 4 NA NA NA 78 NA 78
#5 5 NA NA NA NA 90 90
Or with c_across and max (using only tidyverse approaches)
df_now %>%
rowwise %>%
mutate(Volume = max(c_across(starts_with('Volume')), na.rm = TRUE))
# A tibble: 5 x 7
# Rowwise:
# id VolumeA VolumeB VolumeC VolumeD VolumeE Volume
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 12 NA NA NA NA 12
#2 2 NA 34 NA NA NA 34
#3 3 NA NA 56 NA NA 56
#4 4 NA NA NA 78 NA 78
#5 5 NA NA NA NA 90 90
Benchmarks
system.time({df_now %>% mutate(Volume = reduce(select(., starts_with('Volume')), pmax, na.rm = TRUE))})
# user system elapsed
# 0.023 0.006 0.029
system.time({df_now %>% rowwise %>% mutate(Volume = max(c_across(starts_with('Volume')), na.rm = TRUE))})
# user system elapsed
# 0.012 0.002 0.015
system.time({df_now %>% mutate(Volume = do.call(pmax, c(select(., starts_with('Volume')), na.rm = TRUE)))})
# user system elapsed
# 0.011 0.001 0.011
NOTE: Not that much difference in timings