Winners within pairs; or vector-valued group_by mutate? - r

I'm trying to assess which unit in a pair is the "winner". group_by() %>% mutate() is close to the right thing, but it's not quite there. in particular
dat %>% group_by(pair) %>% mutate(winner = ifelse(score[1] > score[2], c(1, 0), c(0, 1))) doesn't work.
The below does, but is clunky with an intermediate summary data frame. Can we improve this?
library(tidyverse)
set.seed(343)
# units within pairs get scores
dat <-
data_frame(pair = rep(1:3, each = 2),
unit = rep(1:2, 3),
score = rnorm(6))
# figure out who won in each pair
summary_df <-
dat %>%
group_by(pair) %>%
summarize(winner = which.max(score))
# merge back and determine whether each unit won
dat <-
left_join(dat, summary_df, "pair") %>%
mutate(won = as.numeric(winner == unit))
dat
#> # A tibble: 6 x 5
#> pair unit score winner won
#> <int> <int> <dbl> <int> <dbl>
#> 1 1 1 -1.40 2 0
#> 2 1 2 0.523 2 1
#> 3 2 1 0.142 1 1
#> 4 2 2 -0.847 1 0
#> 5 3 1 -0.412 1 1
#> 6 3 2 -1.47 1 0
Created on 2018-09-26 by the reprex
package (v0.2.0).
maybe related to Weird group_by + mutate + which.max behavior

You could do:
dat %>%
group_by(pair) %>%
mutate(won = score == max(score),
winner = unit[won == TRUE]) %>%
# A tibble: 6 x 5
# Groups: pair [3]
pair unit score won winner
<int> <int> <dbl> <lgl> <int>
1 1 1 -1.40 FALSE 2
2 1 2 0.523 TRUE 2
3 2 1 0.142 TRUE 1
4 2 2 -0.847 FALSE 1
5 3 1 -0.412 TRUE 1
6 3 2 -1.47 FALSE 1

Using rank:
dat %>% group_by(pair) %>% mutate(won = rank(score) - 1)
More for fun (and slightly faster), using the outcome of the comparison (score[1] > score[2]) to index a vector with 'won alternatives' :
dat %>% group_by(pair) %>%
mutate(won = c(0, 1, 0)[1:2 + (score[1] > score[2])])

Related

Apply dplyr::starts_with() with lambda function

I have below implementation
library(dplyr)
library(tidyr)
dat = data.frame('A' = 1:3, 'C_1' = 1:3, 'C_2' = 1:3, 'M' = 1:3)
Below works
dat %>% rowwise %>% mutate(Anew = list({function(x) c(x[1]^2, x[2] + 5, x[3] + 1)}(c(M, C_1, C_2)))) %>% ungroup %>% unnest_wider(Anew, names_sep = "")
However below does not work when I try find the column names using dplyr::starts_with()
dat %>% rowwise %>% mutate(Anew = list({function(x) c(x[1]^2, x[2] + 5, x[3] + 1)}(c(M, starts_with('C_'))))) %>% ungroup %>% unnest_wider(Anew, names_sep = "")
Any pointer on how to correctly apply starts_with() in this context will be very helpful.
PS : This is continuation from my earlier post Apply custom function that returns multiple values after dplyr::rowwise()
starts_with must be used within a selecting function so we can write this. across is also a selecting function so we could alternately use across(M | starts_with('C_')) in place of select(...) . c_across is also a selecting function but it does not preserve names.
dat %>%
rowwise %>%
mutate(Anew = list({function(x) c(x[1]^2, x[2] + 5, x[3] + 1)}
(select(cur_data(), M, starts_with('C_'))))) %>%
ungroup %>%
unnest_wider(Anew, names_sep = "")
## # A tibble: 3 × 7
## A C_1 C_2 M AnewM AnewC_1 AnewC_2
## <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 1 1 1 1 1 6 2
## 2 2 2 2 2 4 7 3
## 3 3 3 3 3 9 8 4
Here group_modify would also work and allow the use of formula notation to specify an anonymous function. The indexes in the anonymous function have been reordered to correspond to the order in the input.
dat %>%
group_by(A) %>%
group_modify(~ cbind(.x, Anew = c(.x[3]^2, .x[1] + 5, .x[2] + 1))) %>%
ungroup
## # A tibble: 3 × 7
## A C_1 C_2 M Anew.M Anew.C_1 Anew.C_2
## <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 1 1 1 1 1 6 2
## 2 2 2 2 2 4 7 3
## 3 3 3 3 3 9 8 4
If we wrap the starts_with in c_across and assuming there is a third column that starts with C_, then the lambda function on the fly would work
library(dplyr)
library(tidyr)
dat %>%
rowwise %>%
mutate(Anew = list((function(x) c(x[1]^2, x[2] + 5, x[3] +
1))(c_across(starts_with("C_"))))) %>%
unnest_wider(Anew, names_sep = "")
-output
# A tibble: 3 × 8
A C_1 C_2 C_3 M Anew1 Anew2 Anew3
<int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
1 1 1 1 1 1 1 6 2
2 2 2 2 2 2 4 7 3
3 3 3 3 3 3 9 8 4
Or instead of doing rowwise, we may create a named list of functions and apply column wise with across (would be more efficient)
fns <- list(C_1 = function(x) x^2, C_2 = function(x) x + 5,
C_3 = function(x) x + 1)
dat %>%
mutate(across(starts_with("C_"),
~ fns[[cur_column()]](.x), .names = "Anew{seq_along(.fn)}"))
-output
A C_1 C_2 C_3 M Anew1 Anew2 Anew3
1 1 1 1 1 1 1 6 2
2 2 2 2 2 2 4 7 3
3 3 3 3 3 3 9 8 4
data
dat <- data.frame('A' = 1:3, 'C_1' = 1:3, 'C_2' = 1:3, C_3 = 1:3, 'M' = 1:3)

R: Aggregate data in sliding window into new columns

let's say I have a dataframe like this:
df <- tibble(ID = c(1, 1, 1, 1, 1), v1 = c(3, 5, 1, 0, 1), v2 = c(10, 6, 1, 20, 23), Time = c(as.POSIXct("1900-01-01 10:00:00"), as.POSIXct("1900-01-01 11:00:00"), as.POSIXct("1900-01-01 13:00:00"), as.POSIXct("1900-01-01 16:00:00"), as.POSIXct("1900-01-01 20:00:00"))) %>% group_by(ID)
# A tibble: 5 x 4
# Groups: ID [1]
ID v1 v2 Time
<dbl> <dbl> <dbl> <dttm>
1 1 3 10 1900-01-01 10:00:00
2 1 5 6 1900-01-01 11:00:00
3 1 1 1 1900-01-01 13:00:00
4 1 0 20 1900-01-01 16:00:00
5 1 1 23 1900-01-01 20:00:00
In words, this is a simple timeseries of a specific ID with two values v1 and v2 per time.
As quite common in machine learning, I want to aggregate the last n timesteps into one feature vector. For all previous timesteps there should be a time reference in hours when this data point occured. For the first row, where no previous timestep is available, the data should be filled with zeros.
Let's make an example. In this case n=2, that is I want to aggregate the current time step (t2) and the prevopus (t1) together:
# A tibble: 5 x 6
ID v1_t1 v2_t1 time_t1 v1_t2 v2_t2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 NA 3 10
2 1 3 10 1 5 6
3 1 5 6 2 1 1
4 1 1 1 3 0 20
5 1 0 20 4 1 23
I want to keep that as generic as possible, so that n can change and the number of data columns. Any idea how to do this?
Thanks :)
Using dplyr::lag and dplyr::across you could do:
library(dplyr, warn=FALSE)
library(lubridate, warn=FALSE)
df %>%
group_by(ID) %>%
mutate(time_t1 = lubridate::hour(Time) - lag(lubridate::hour(Time))) %>%
mutate(across(c(v1, v2), .fns = list(t2 = ~.x, t1 = ~lag(.x, default = 0)))) %>%
select(-v1, -v2, -Time)
#> # A tibble: 5 × 6
#> # Groups: ID [1]
#> ID time_t1 v1_t2 v1_t1 v2_t2 v2_t1
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 3 0 10 0
#> 2 1 1 5 3 6 10
#> 3 1 2 1 5 1 6
#> 4 1 3 0 1 20 1
#> 5 1 4 1 0 23 20
UPDATE Here is a more generic approach which makes use of some function factories to create list of functions which could then be passed to the .fns argument of across. Haven't tested for the more general case but should work for any n or number of lags to include and also for any number of data columns.
library(dplyr, warn=FALSE)
library(lubridate, warn=FALSE)
fun_factory1 <- function(n) {
function(x) {
lubridate::hour(x) - lag(lubridate::hour(x), n = n)
}
}
fun_factory2 <- function(n) {
function(x) {
lag(x, n = n, default = 0)
}
}
n <- 2
fns1 <- lapply(seq(n - 1), fun_factory1)
names(fns1) <- paste0("t", seq(n - 1))
fns2 <- lapply(seq(n) - 1, fun_factory2)
names(fns2) <- paste0("t", seq(n))
df %>%
group_by(ID) %>%
mutate(across(Time, .fns = fns1)) %>%
mutate(across(c(v1, v2), .fns = fns2)) %>%
select(-v1, -v2, -Time)
#> # A tibble: 5 × 6
#> # Groups: ID [1]
#> ID Time_t1 v1_t1 v1_t2 v2_t1 v2_t2
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 3 0 10 0
#> 2 1 1 5 3 6 10
#> 3 1 2 1 5 1 6
#> 4 1 3 0 1 20 1
#> 5 1 4 1 0 23 20

R Regex capture to remove/keep columns with repeats in their column names

This is an example dataframe
means2 <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))
names(means2) <- c("B_T0|B_T0", "B_T0|B_T1", "B_T0|Fibro_T0", "B_T5|Endo_T5", "Macro_T1|Fibro_T1")
I have column names in my dataframe in R in this format
\S+_T\d+|\S+_T\d+
The syntax is something like (Name)_ (T)(Number) | (Name)_ (T)(Number)
Step 1) I want to select columns which contain the same (T)(Number) on both sides of the "|"
I did this with some manual labor :
means_t0 <- means2 %>% select(matches("\\S+_T0\\|\\S+_T0")) %>% rownames_to_column("id_cp_interaction")
means_t1 <- means2 %>% select(matches("\\S+_T1\\|\\S+_T1")) %>% rownames_to_column("id_cp_interaction")
means_t5 <- means2 %>% select(matches("\\S+_T5\\|\\S+_T5")) %>% rownames_to_column("id_cp_interaction")
means3 <- full_join(means_t0, means_t1) %>% full_join(means_t5)
This gives me what I want and it was easy to do because I only had 3 types - T0, T1 and T5. What do I do if I had a huge number?
Step 2) From the output of Step1, I want to do a negation of the last question i.e. select only those columns with Names which are not the same
For example B_T0|B_T0 should be removed but B_T0|Fibro_T0 should be retained
Is there a way to regex capture the part in front of the pipe(|) and match it to the part at the back of the pipe(|)
Thank you
If you have that much information in your column names, I like to transform the data into the long format and then separate the info from the column name into several columns. Then it's easy to filter by these columns:
means2 <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))
names(means2) <- c("B_T0|B_T0", "B_T0|B_T1", "B_T0|Fibro_T0", "B_T5|Endo_T5", "Macro_T1|Fibro_T1")
means2 <- cbind(data.frame(id_cp_interaction = 1:5), means2)
library(tidyr)
library(dplyr)
library(stringr)
res <- means2 %>%
pivot_longer(
cols = -id_cp_interaction,
names_to = "names",
values_to = "values"
) %>%
mutate(
celltype_1 = str_extract(names, "^[^_]*"),
timepoint_1 = str_extract(names, "[0-9](?=|)"),
celltype_2 = str_extract(names, "(?<=\\|)(.*?)(?=_)"),
timepoint_2 = str_extract(names, "[0-9]$")
)
head(res, n = 7)
#> # A tibble: 7 × 7
#> id_cp_interaction names values celltype_1 timepoint_1 celltype_2 timepoint_2
#> <int> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 B_T0|B… 1.68 B 0 B 0
#> 2 1 B_T0|B… 19.3 B 0 B 1
#> 3 1 B_T0|F… 10.6 B 0 Fibro 0
#> 4 1 B_T5|E… 12.5 B 5 Endo 5
#> 5 1 Macro_… 2.84 Macro 1 Fibro 1
#> 6 2 B_T0|B… 2.17 B 0 B 0
#> 7 2 B_T0|B… 10.1 B 0 B 1
# only keep interactions of different cell types
res %>%
filter(celltype_1 != celltype_2) %>%
head()
#> # A tibble: 6 × 7
#> id_cp_interaction names values celltype_1 timepoint_1 celltype_2 timepoint_2
#> <int> <chr> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 B_T0|F… 10.6 B 0 Fibro 0
#> 2 1 B_T5|E… 12.5 B 5 Endo 5
#> 3 1 Macro_… 2.84 Macro 1 Fibro 1
#> 4 2 B_T0|F… 1.47 B 0 Fibro 0
#> 5 2 B_T5|E… 11.3 B 5 Endo 5
#> 6 2 Macro_… 13.0 Macro 1 Fibro 1
Created on 2022-09-19 by the reprex package (v1.0.0)

Replace NA in muliple column by group in r

df <- data.frame(A = c(NA,5,4,NA,1),
B = c(1,NA,1,1,NA),
C = c(3,3,NA,NA,6),
D = c(0,0,1,1,1))
I have something like above dataset and trying to replace the NA values with the mean of the subgroup from target varibale D.
I tried the following code to replace them individually.
df <- df %>%
group_by(D) %>%
mutate(
A = ifelse(is.na(A),
mean(A, na.rm=TRUE),A)
) %>%
mutate(
B = ifelse(is.na(B),
mean(B, na.rm=TRUE),B)
) %>%
mutate(
C = ifelse(is.na(C),
mean(C, na.rm=TRUE),C)
)
Is there more efficent way to impute the mean values?
Perhaps this 'tidyverse' approach will suit:
library(tidyverse)
df <- data.frame(A = c(NA,5,4,NA,1),
B = c(1,NA,1,1,NA),
C = c(3,3,NA,NA,6),
D = c(0,0,1,1,1))
df_output <- df %>%
group_by(D) %>%
mutate(
A = ifelse(is.na(A),
mean(A, na.rm=TRUE),A)
) %>%
mutate(
B = ifelse(is.na(B),
mean(B, na.rm=TRUE),B)
) %>%
mutate(
C = ifelse(is.na(C),
mean(C, na.rm=TRUE),C)
)
df_output
#> # A tibble: 5 × 4
#> # Groups: D [2]
#> A B C D
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 3 0
#> 2 5 1 3 0
#> 3 4 1 6 1
#> 4 2.5 1 6 1
#> 5 1 1 6 1
df_output_2 <- df %>%
group_by(D) %>%
mutate(across(A:C, ~replace_na(.x, mean(.x, na.rm = TRUE))))
df_output_2
#> # A tibble: 5 × 4
#> # Groups: D [2]
#> A B C D
#> <dbl> <dbl> <dbl> <dbl>
#> 1 5 1 3 0
#> 2 5 1 3 0
#> 3 4 1 6 1
#> 4 2.5 1 6 1
#> 5 1 1 6 1
all_equal(df_output, df_output_2)
#> [1] TRUE
Created on 2022-10-04 by the reprex package (v2.0.1)
I encountered the same problem before but my dataset was bigger. In these cases, I use mutate_all
df %>% group_by(D) %>% mutate_all(funs(replace(., is.na(.), mean(., na.rm = TRUE))))
A B C D
<dbl> <dbl> <dbl> <dbl>
1 5 1 3 0
2 5 1 3 0
3 4 1 6 1
4 2.5 1 6 1
5 1 1 6 1

How to keep other values unchanged with dplyr's recode_factor

In the example below, recoding some values makes all the other NA. How can I keep the other values unchanged?
library(tibble)
library(dplyr)
test <- tibble(
test_vec = as.factor(c(1, 2, 3))
)
test
#> # A tibble: 3 x 1
#> test_vec
#> <fct>
#> 1 1
#> 2 2
#> 3 3
test %>%
mutate(test_vec = recode_factor(test_vec, `3` = 4))
#> # A tibble: 3 x 1
#> test_vec
#> <fct>
#> 1 <NA>
#> 2 <NA>
#> 3 4
Need to make your replacement the same type as the original value.
test %>%
mutate(test_vec = recode_factor(test_vec, "3" = "4"))
# A tibble: 3 x 1
test_vec
<fct>
1 1
2 2
3 4
Using fct_recode
library(forcats)
library(dplyr)
test %>%
mutate(test_vec = fct_recode(test_vec, `4` = '3'))
-output
# A tibble: 3 x 1
# test_vec
# <fct>
#1 1
#2 2
#3 4
So that you don't get missing NA values, you have to list the other values in the function as well.
test %>%
mutate(test_vec = recode_factor(test_vec, `1` = 1, `2` = 2, `3` = 4))
Result
# A tibble: 3 x 1
test_vec
<fct>
1 1
2 2
3 4
Another way to do it is using case_when, but for this you have to start from numerical values.
I give you an example starting from numerical values and I convert them to factor.
test <- tibble(
test_vec = (c(1, 2, 3)))
test %>%
mutate(test_vec = case_when( test_vec != 3 ~ test_vec,
test_vec == 3 ~ 4)) %>%
mutate(across(test_vec,factor))
Result
# A tibble: 3 x 1
test_vec
<fct>
1 1
2 2
3 4

Resources