tidyverse: Simulating random sample with nested factor

tidyverse: Simulating random sample with nested factor - r

I want to simulate random sample with nested factor. Factor Dept has two levels A & B. Level A has two nested levels A1 and A2. Level B has three nested levels B1, B2 and B3. Want to simulate random sample from 2022-01-01 to 2022-01-31 using some R code. Part of desired output is given below (from 2022-01-01 to 2022-01-02 only for reference).
library(tibble)
set.seed(12345)
df1 <-
tibble(
Date = c(rep("2022-01-01", 5), rep("2022-01-02", 4), rep("2022-01-03", 4))
, Dept = c("A", "A", "B", "B", "B", "A", "B", "B", "B", "A", "A", "B", "B")
, Prog = c("A1", "A2", "B1", "B2", "B3", "A1", "B1", "B2", "B3", "A1", "A2", "B2", "B3")
, Amount = runif(n = 13, min = 50000, max = 100000)
)
df1
#> # A tibble: 13 x 4
#> Date Dept Prog Amount
#> <chr> <chr> <chr> <dbl>
#> 1 2022-01-01 A A1 86045.
#> 2 2022-01-01 A A2 93789.
#> 3 2022-01-01 B B1 88049.
#> 4 2022-01-01 B B2 94306.
#> 5 2022-01-01 B B3 72824.
#> 6 2022-01-02 A A1 58319.
#> 7 2022-01-02 B B1 66255.
#> 8 2022-01-02 B B2 75461.
#> 9 2022-01-02 B B3 86385.
#> 10 2022-01-03 A A1 99487.
#> 11 2022-01-03 A A2 51727.
#> 12 2022-01-03 B B2 57619.
#> 13 2022-01-03 B B3 86784.

If we want to sample randomly, create the expanded data with crossing and then filter/slice to return random rows for each 'date'
library(dplyr)
library(tidyr)
library(stringr)
crossing(Date = seq(as.Date("2022-01-01"), as.Date("2022-01-31"),
by = "1 day"), Dept = c("A", "B"), Prog = 1:3) %>%
mutate(Prog = str_c(Dept, Prog)) %>%
filter(Prog != "A3") %>%
mutate(Amount = runif(n = n(), min = 50000, max = 100000)) %>%
group_by(Date) %>%
slice(seq_len(sample(row_number(), 1))) %>%
ungroup
-output
# A tibble: 102 × 4
Date Dept Prog Amount
<date> <chr> <chr> <dbl>
1 2022-01-01 A A1 83964.
2 2022-01-01 A A2 93428.
3 2022-01-01 B B1 85187.
4 2022-01-01 B B2 79144.
5 2022-01-01 B B3 65784.
6 2022-01-02 A A1 86014.
7 2022-01-03 A A1 76060.
8 2022-01-03 A A2 56412.
9 2022-01-03 B B1 87365.
10 2022-01-03 B B2 66169.
# … with 92 more rows

Related

dplyr join with OR condition?

I am wondering whether there is any way, preferably in the tidyverse, to join two dataframes based on OR conditions.
There are two dataframes: df_obs and df_event.
a) The join should happen if there is a match between
the obs_id and event_id; and obs_date or event_date, or both are NA.
OR
b) the obs_date and event_date are identical; and obs_id, event_id, or both are NA.
The match should not happen if obs_id is not identical to event_id (if both are not NA) OR
obs_date and event_date are not identical (both being not NA).
The result should look like df_res below. The column 'event' from the df_event is added to the df_obs.
I have seen the answer to this question, but maybe there is a way around SQL?
df_obs <- tibble::tribble(
~obs, ~obs_date, ~obs_id,
"a1", NA, 10L,
"a2", "01/01/2000", NA,
"b", "02/01/2000", NA,
"a3", "03/01/2000", 10L
)
df_obs
#> # A tibble: 4 × 3
#> obs obs_date obs_id
#> <chr> <chr> <int>
#> 1 a1 <NA> 10
#> 2 a2 01/01/2000 NA
#> 3 b 02/01/2000 NA
#> 4 a3 03/01/2000 10
df_event <- tibble::tribble(
~event, ~event_date, ~event_id,
"A", "01/01/2000", 10L,
"B", "02/01/2000", NA
)
df_event
#> # A tibble: 2 × 3
#> event event_date event_id
#> <chr> <chr> <int>
#> 1 A 01/01/2000 10
#> 2 B 02/01/2000 NA
df_res <- tibble::tribble(
~obs, ~obs_date, ~obs_id, ~event,
"a1", NA, 10L, "A",
"a2", "01/01/2000", NA, "A",
"b", "02/01/2000", NA, "B",
"a3", "03/01/2000", 10L, NA
)
df_res
#> # A tibble: 4 × 4
#> obs obs_date obs_id event
#> <chr> <chr> <int> <chr>
#> 1 a1 <NA> 10 A
#> 2 a2 01/01/2000 NA A
#> 3 b 02/01/2000 NA B
#> 4 a3 03/01/2000 10 <NA>
Created on 2022-09-13 with reprex v2.0.2

I can come up only with this solution:
df_obs %>%
left_join(df_event, by = c("obs_id" = "event_id"), na_matches='never') %>%
mutate(event = ifelse(!(is.na(obs_date)|is.na(event_date)|obs_date == event_date), NA, event)) %>%
select(-event_date) %>%
left_join(df_event, by = c("obs_date" = "event_date"), na_matches='never') %>%
mutate(event.y = ifelse(!(is.na(obs_id)|is.na(event_id)|obs_id == event_id), NA, event.y)) %>%
select(-event_id) %>%
mutate(event = ifelse(is.na(event.x), event.y, event.x)) %>%
select(-c(event.x, event.y))

How to check if pairs remain the same between years?

I have a data frame with pairs of individual birds (male and female) that were observed in several years. I am trying to figure out whether these pairs have changed from one year to the next so that I can do some further analyses.
My data is structured like this:
dat <- tibble(year = rep(1:3, each = 3),
Male = c("A1", "B1", "C1",
"A1", "B1", "C1",
"A1", "B1", "C2"),
Female = c("X1", "Y1", "Z1",
"X1", "Y2", "Z2",
"X1", "Y2", "Z2"))
# A tibble: 9 x 3
year Male Female
<int> <chr> <chr>
1 1 A1 X1
2 1 B1 Y1
3 1 C1 Z1
4 2 A1 X1
5 2 B1 Y2
6 2 C1 Z2
7 3 A1 X1
8 3 B1 Y2
9 3 C2 Z2
And my expected output is something like:
# A tibble: 9 x 5
year Male Female male_state female_state
<int> <chr> <chr> <chr> <chr>
1 1 A1 X1 new new
2 1 B1 Y1 new new
3 1 C1 Z1 new new
4 2 A1 X1 reunited reunited
5 2 B1 Y2 divorced new
6 2 C1 Z2 divorced new
7 3 A1 X1 reunited reunited
8 3 B1 Y2 reunited reunited
9 3 C2 Z2 new divorced
I cannot figure out how to check whether a value from a different column is the same in the year before (e.g. if the male ID is the same for a certain female in year 2 or 3 as in the year prior). Any ideas?

This (probably overcomplicated) pipe produces the following output.
dat <- tibble(year = rep(1:3, each = 3),
Male = c("A1", "B1", "C1",
"A1", "B1", "C1",
"A1", "B1", "C2"),
Female = c("X1", "Y1", "Z1",
"X1", "Y2", "Z2",
"X1", "Y2", "Z2"))
dat %>%
mutate(pair=paste0(Male,Female)) %>%
arrange(pair,year) %>%
mutate(check = if_else((pair==lag(pair)) & (year>lag(year)), 'old couple', 'new couple')) %>%
mutate(check = if_else(is.na(check), 'new couple', check)) %>%
mutate(divorced = if_else((Male == lag(Male)) & (Female != lag(Female)), 'divorce', '')) %>%
mutate(divorced = if_else(is.na(divorced), '', divorced))
OUTPUT:
# A tibble: 9 × 6
year Male Female pair check divorced
<int> <chr> <chr> <chr> <chr> <chr>
1 1 A1 X1 A1X1 new couple ""
2 2 A1 X1 A1X1 old couple ""
3 3 A1 X1 A1X1 old couple ""
4 1 B1 Y1 B1Y1 new couple ""
5 2 B1 Y2 B1Y2 new couple "divorce"
6 3 B1 Y2 B1Y2 old couple ""
7 1 C1 Z1 C1Z1 new couple ""
8 2 C1 Z2 C1Z2 new couple "divorce"
9 3 C2 Z2 C2Z2 new couple ""

Try this:
library(tidyverse)
dat <- tibble(
year = rep(1:3, each = 3),
Male = c(
"A1", "B1", "C1",
"A1", "B1", "C1",
"A1", "B1", "C2"
),
Female = c(
"X1", "Y1", "Z1",
"X1", "Y2", "Z2",
"X1", "Y2", "Z2"
)
)
dat |>
mutate(pairing = str_c(Male, "|", Female)) |>
add_count(pairing) |>
group_by(pairing) |>
mutate(male_state = if_else(pairing == lag(pairing), "reunited", NA_character_),
female_state = if_else(pairing == lag(pairing), "reunited", NA_character_)) |>
group_by(Male) |>
mutate(
male_state = if_else(row_number() == 1, "new", male_state),
male_state = if_else(is.na(male_state), "divorced", male_state)
) |>
group_by(Female) |>
mutate(
female_state = if_else(row_number() == 1, "new", female_state),
female_state = if_else(is.na(female_state), "divorced", female_state)
) |>
arrange(year, Male)
#> # A tibble: 9 × 7
#> # Groups: Female [5]
#> year Male Female pairing n male_state female_state
#> <int> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 1 A1 X1 A1|X1 3 new new
#> 2 1 B1 Y1 B1|Y1 1 new new
#> 3 1 C1 Z1 C1|Z1 1 new new
#> 4 2 A1 X1 A1|X1 3 reunited reunited
#> 5 2 B1 Y2 B1|Y2 2 divorced new
#> 6 2 C1 Z2 C1|Z2 1 divorced new
#> 7 3 A1 X1 A1|X1 3 reunited reunited
#> 8 3 B1 Y2 B1|Y2 2 reunited reunited
#> 9 3 C2 Z2 C2|Z2 1 new divorced
Created on 2022-05-03 by the reprex package (v2.0.1)

"spread" multiple variables using pivot_wider()

What is the best way to "spread" multiple variables using pivot_wider() in the development version of tidyr?
# https://tidyr.tidyverse.org/dev/reference/pivot_wider.html
# devtools::install_github("tidyverse/tidyr")
library(tidyr)
library(tidyverse)
have <- tibble::tribble(
~user_id, ~question, ~answer, ~timestamp,
1, "q1", "a1", "2019-07-22 16:54:43",
1, "q2", "a2", "2019-07-22 16:55:43",
2, "q1", "a1", "2019-07-22 16:56:43",
2, "q2", "a2", "2019-07-22 16:57:43",
3, "q1", "a1", "2019-07-22 16:58:43",
3, "q2", "a2", "2019-07-22 16:59:43"
) %>%
mutate(timestamp = as_datetime(timestamp))
have
# # A tibble: 6 x 4
# user_id question answer timestamp
# <dbl> <chr> <chr> <dttm>
# 1 1 q1 a1 2019-07-22 16:54:43
# 2 1 q2 a2 2019-07-22 16:55:43
# 3 2 q1 a1 2019-07-22 16:56:43
# 4 2 q2 a2 2019-07-22 16:57:43
# 5 3 q1 a1 2019-07-22 16:58:43
# 6 3 q2 a2 2019-07-22 16:59:43
want <- tibble::tribble(
~user_id, ~q1, ~q2, ~timestamp_q1, ~timestamp_q2,
1, "a1", "a2", "2019-07-22 16:54:43", "2019-07-22 16:55:43",
2, "a1", "a2", "2019-07-22 16:56:43", "2019-07-22 16:57:43",
3, "a1", "a2", "2019-07-22 16:58:43", "2019-07-22 16:59:43"
) %>%
mutate(timestamp_q1 = as_datetime(timestamp_q1)) %>%
mutate(timestamp_q2 = as_datetime(timestamp_q2))
want
# A tibble: 3 x 5
# user_id q1 q2 timestamp_q1 timestamp_q2
# <dbl> <chr> <chr> <dttm> <dttm>
#1 1 a1 a2 2019-07-22 16:54:43 2019-07-22 16:55:43
#2 2 a1 a2 2019-07-22 16:56:43 2019-07-22 16:57:43
#3 3 a1 a2 2019-07-22 16:58:43 2019-07-22 16:59:43
This works if you want to spread one pair of variables, but fails because only user_id should be the identifying variable.
have %>%
pivot_wider(names_from = question, values_from = answer)
# # A tibble: 6 x 4
# user_id timestamp q1 q2
# <dbl> <dttm> <chr> <chr>
# 1 1 2019-07-22 16:54:43 a1 NA
# 2 1 2019-07-22 16:55:43 NA a2
# 3 2 2019-07-22 16:56:43 a1 NA
# 4 2 2019-07-22 16:57:43 NA a2
# 5 3 2019-07-22 16:58:43 a1 NA
# 6 3 2019-07-22 16:59:43 NA a2

You can include multiple columns in the values_from argument to spread multiple columns in one go:
have %>%
pivot_wider(
id_cols = user_id,
names_from = question,
values_from = c(answer, timestamp)
) %>%
# remove the 'answer_' prefix from those cols
rename_all(~ str_remove(., "answer_"))
Output:
# A tibble: 3 x 5
user_id q1 q2 timestamp_q1 timestamp_q2
<dbl> <chr> <chr> <dttm> <dttm>
1 1 a1 a2 2019-07-22 16:54:43 2019-07-22 16:55:43
2 2 a1 a2 2019-07-22 16:56:43 2019-07-22 16:57:43
3 3 a1 a2 2019-07-22 16:58:43 2019-07-22 16:59:43

how to create new variables from one variable using two rules

I would appreciate any help to create new variables from one variable.
Specifically, I need help to simultaneously create one row per each ID and various columns of E, where each of the new columns of E, (that is, E1, E2, E3) contains the values of E for each row of ID. I tried doing this which melt followed by spread but I am getting the error:
Error: Duplicate identifiers for rows (4, 7, 9), (1, 3, 6), (2, 5, 8)
Additionally, I tried the solutions discussed here and here but these did not work for my case because I need to be able to create row identifiers for rows (4, 1, 2), (7, 3, 5), and (9, 6, 8). That is, E for rows (4, 1, 2) should be named E1, E for rows (7, 3, 5) should be named E2, E for rows (9, 6, 8) should be named E3, and so on.
#data
dT<-structure(list(A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1",
"a2", "a1"), B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1",
"b2", "b1"), ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"
), E = c(0.621142094943352, 0.742109450696123, 0.39439152996948,
0.40694392882818, 0.779607277916503, 0.550579323666347, 0.352622183880119,
0.690660491345867, 0.23378944873769)), class = c("data.table",
"data.frame"), row.names = c(NA, -9L))
#my attempt
A B ID E
1: a1 b2 3 0.6211421
2: a2 b2 4 0.7421095
3: a1 b2 3 0.3943915
4: a1 b1 1 0.4069439
5: a2 b2 4 0.7796073
6: a1 b2 3 0.5505793
7: a1 b1 1 0.3526222
8: a2 b2 4 0.6906605
9: a1 b1 1 0.2337894
aTempDF <- melt(dT, id.vars = c("A", "B", "ID")) )
A B ID variable value
1: a1 b2 3 E 0.6211421
2: a2 b2 4 E 0.7421095
3: a1 b2 3 E 0.3943915
4: a1 b1 1 E 0.4069439
5: a2 b2 4 E 0.7796073
6: a1 b2 3 E 0.5505793
7: a1 b1 1 E 0.3526222
8: a2 b2 4 E 0.6906605
9: a1 b1 1 E 0.2337894
aTempDF%>%spread(variable, value)
Error: Duplicate identifiers for rows (4, 7, 9), (1, 3, 6), (2, 5, 8)
#expected output
A B ID E1 E2 E3
1: a1 b2 3 0.6211421 0.3943915 0.5505793
2: a2 b2 4 0.7421095 0.7796073 0.6906605
3: a1 b1 1 0.4069439 0.3526222 0.2337894
Thanks in advance for any help.

You can use dcast from data.table
library(data.table)
dcast(dT, A + B + ID ~ paste0("E", rowid(ID)))
# A B ID E1 E2 E3
#1 a1 b1 1 0.4069439 0.3526222 0.2337894
#2 a1 b2 3 0.6211421 0.3943915 0.5505793
#3 a2 b2 4 0.7421095 0.7796073 0.6906605
You need to create the correct 'time variable' first which is what rowid(ID) does.

For those looking for a tidyverse solution:
library(tidyverse)
dT <- structure(
list(
A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1", "a2", "a1"),
B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1", "b2", "b1"),
ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"),
E = c(0.621142094943352, 0.742109450696123, 0.39439152996948, 0.40694392882818,
0.550579323666347, 0.352622183880119, 0.690660491345867, 0.23378944873769,
0.779607277916503)),
class = c("data.table",
"data.frame"),
row.names = c(NA, -9L))
dT %>%
as_tibble() %>% # since dataset is a data.table object
group_by(A, B, ID) %>%
# Just so columns are "E1", "E2", etc.
mutate(rn = glue::glue("E{row_number()}")) %>%
ungroup() %>%
spread(rn, E) %>%
# not necessary, just making output in the same order as your expected output
arrange(desc(B))
# A tibble: 3 x 6
# A B ID E1 E2 E3
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 a1 b2 3 0.621 0.394 0.551
#2 a2 b2 4 0.742 0.780 0.691
#3 a1 b1 1 0.407 0.353 0.234
As mentioned in the accepted answer, you need a "key" variable to spread on first. This is created using row_number() and glue where glue just gives you the proper E1, E2, etc. variable names.
The group_by piece just makes sure that the row numbers are with respect to A, B and ID.
EDIT for tidyr >= 1.0.0
The (not-so) new pivot_ functions supercede gather and spread and eliminate the need to glue the new variable names together in a mutate.
dT %>%
as_tibble() %>% # since dataset is a data.table object
group_by(A, B, ID) %>%
# no longer need to glue (or paste) the names together but still need a row number
mutate(rn = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = rn, values_from = E, names_glue = "E{.name}") %>% # names_glue argument allows for easy transforming of the new variable names
# not necessary, just making output in the same order as your expected output
arrange(desc(B))
# A tibble: 3 x 6
# A B ID E1 E2 E3
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 a1 b2 3 0.621 0.394 0.551
#2 a2 b2 4 0.742 0.780 0.691
#3 a1 b1 1 0.407 0.353 0.234

Convert column to comma separated in R

I have two columns A and B in excel with large data.we have to consider both columns A and B, I am trying to achieve column C as output. Right now I am doing everything in excel. So I think there may a way to this in R but really don't know how to do it.Any help is appreciated..Thanks
I have
Column A ColumnB Column C(output column)
A1 10 A2
A2 10 A1
B1 3 B2,B3,B4
B2 3 B1,B3,B4
B3 3 B1,B2,B4
B4 3 B1,B2,B3
C1 6 C2,C3
C2 6 C1,C3
C3 6 C1,C2

We can group by column B then find a set difference between the current column A character and the whole characters in the group:
library(tidyverse)
df %>%
group_by(ColumnB) %>%
mutate(ColumnC=map_chr(ColumnA, ~toString(setdiff(ColumnA, .x))))
# A tibble: 9 x 3
# Groups: ColumnB [3]
ColumnA ColumnB ColumnC
<fct> <int> <chr>
1 A1 10 A2
2 A2 10 A1
3 B1 3 B2, B3, B4
4 B2 3 B1, B3, B4
5 B3 3 B1, B2, B4
6 B4 3 B1, B2, B3
7 C1 6 C2, C3
8 C2 6 C1, C3
9 C3 6 C1, C2

I don't think the question is phrased very clearly but I am interpreting the desired results to be that you want Column C to have all the values from each group of Column B, leaving out the value of Column A. You can do this as follows:
nest Column A and join it back onto the original data frame
flatten it so you now have a vector of the Column A values
use setdiff to get the values that are not Column A
collapse into comma separated string with str_c
You can see that your desired Column C is reproduced.
library(tidyverse)
tbl <- structure(list(ColumnA = c("A1", "A2", "B1", "B2", "B3", "B4", "C1", "C2", "C3"), ColumnB = c(10L, 10L, 3L, 3L, 3L, 3L, 6L, 6L, 6L), ColumnC = c("A2", "A1", "B2,B3,B4", "B1,B3,B4", "B1,B2,B4", "B1,B2,B3", "C2,C3", "C1,C3", "C1,C2")), problems = structure(list(row = 9L, col = "ColumnC", expected = "", actual = "embedded null", file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame")), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(cols = list(ColumnA = structure(list(), class = c("collector_character", "collector")), ColumnB = structure(list(), class = c("collector_integer", "collector")), ColumnC = structure(list(), class = c("collector_character", "collector"))), default = structure(list(), class = c("collector_guess", "collector"))), class = "col_spec"))
tbl %>%
left_join(
tbl %>% select(-ColumnC) %>% nest(ColumnA)
) %>%
mutate(
data = flatten(data),
output = map2(data, ColumnA, ~ setdiff(.x, .y)),
output = map_chr(output, ~ str_c(., collapse = ","))
)
#> Joining, by = "ColumnB"
#> # A tibble: 9 x 5
#> ColumnA ColumnB ColumnC data output
#> <chr> <int> <chr> <list> <chr>
#> 1 A1 10 A2 <chr [2]> A2
#> 2 A2 10 A1 <chr [2]> A1
#> 3 B1 3 B2,B3,B4 <chr [4]> B2,B3,B4
#> 4 B2 3 B1,B3,B4 <chr [4]> B1,B3,B4
#> 5 B3 3 B1,B2,B4 <chr [4]> B1,B2,B4
#> 6 B4 3 B1,B2,B3 <chr [4]> B1,B2,B3
#> 7 C1 6 C2,C3 <chr [3]> C2,C3
#> 8 C2 6 C1,C3 <chr [3]> C1,C3
#> 9 C3 6 C1,C2 <chr [3]> C1,C2
Created on 2018-08-21 by the reprex package (v0.2.0).

My understanding is to find all OTHER entries of column A that share the current value of column B
Grouping by B, and finding all A's associated with the value should do the trick (some clean-up afterward removes the current entry of A from the resulting column C)
a <- c("a1", "a2","b1", "b2","b3", "b4","c1","c2","c3","d1")
b <- c(10,10,3,3,3,3,6,6,6,5)
dta <- data.frame(a,b, stringsAsFactors = F)
dta<-dta %>%
group_by(b) %>%
mutate(c = paste0(a,collapse = ",")) %>%
ungroup() %>%
mutate(c = str_replace(c,pattern = paste0(",",a),replacement = "")) %>%
mutate(c = str_replace(c,pattern = paste0(a,","),replacement = "")) %>%
mutate(c = ifelse(c==a,NA,c))

Another version of tidyverse solution. The separate function is handy to separate an existing column to new columns. By doing this, we can create the Group column to make sure all the operation would be within each group. map2 and map function are ideal to do vectorized operation. dat2 is the final output.
library(tidyverse)
dat2 <- dat %>%
separate(ColumnA, into = c("Group", "Number"), remove = FALSE, convert = TRUE, sep = 1) %>%
group_by(Group) %>%
mutate(List = list(ColumnA)) %>%
mutate(List = map2(List, ColumnA, ~.x[!(.x %in% .y)])) %>%
mutate(ColumnC = map_chr(List, ~str_c(.x, collapse = ","))) %>%
ungroup() %>%
select(starts_with("Column"))
dat2
# # A tibble: 9 x 3
# ColumnA ColumnB ColumnC
# <chr> <int> <chr>
# 1 A1 10 A2
# 2 A2 10 A1
# 3 B1 3 B2,B3,B4
# 4 B2 3 B1,B3,B4
# 5 B3 3 B1,B2,B4
# 6 B4 3 B1,B2,B3
# 7 C1 6 C2,C3
# 8 C2 6 C1,C3
# 9 C3 6 C1,C2
DATA
dat <- read.table(text = "ColumnA ColumnB
A1 10
A2 10
B1 3
B2 3
B3 3
B4 3
C1 6
C2 6
C3 6",
stringsAsFactors = FALSE, header = TRUE)

df = read.table(text = "
ColumnA ColumnB
A1 10
A2 10
B1 3
B2 3
B3 3
B4 3
C1 6
C2 6
C3 6
", header=T, stringsAsFactors=F)
library(tidyverse)
df %>%
group_by(ColumnB) %>% # for each ColumnB value
mutate(vals = list(ColumnA), # create a list of all Column A values for each row
vals = map2(vals, ColumnA, ~.x[.x != .y]), # exclude the value in Column A from that list
vals = map_chr(vals, ~paste0(.x, collapse = ","))) %>% # combine remaining values in the list
ungroup() # forget the grouping
# # A tibble: 9 x 3
# ColumnA ColumnB vals
# <chr> <int> <chr>
# 1 A1 10 A2
# 2 A2 10 A1
# 3 B1 3 B2,B3,B4
# 4 B2 3 B1,B3,B4
# 5 B3 3 B1,B2,B4
# 6 B4 3 B1,B2,B3
# 7 C1 6 C2,C3
# 8 C2 6 C1,C3
# 9 C3 6 C1,C2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

tidyverse: Simulating random sample with nested factor - r

Related

dplyr join with OR condition?

How to check if pairs remain the same between years?

"spread" multiple variables using pivot_wider()

how to create new variables from one variable using two rules

Convert column to comma separated in R

Categories

Resources