How to collapse unique duplicate columns to unique columns in R? - r

Solution
I went with the solutions provided by #MauritsEvers and #akrun below.
Question
For a data frame, I want to keep only 1 column of each set of duplicate columns. In addition, the column that is kept takes on a name that is a concatenation of all column names in the set of duplicate columns. There are multiple sets of duplicate columns in the data frame. The data frame contains tens of thousands of columns, so using a for loop might take too much time.
I have tried a combination of using the duplicate(), summary(), aggregate(), lapply(), apply(), and using for loops.
Input data frame (df_in):
0 1 2 3 4 5 6 7
0 1 0 0 1 0 1 1
0 1 0 1 1 0 0 0
1 0 1 0 0 1 1 0
Output data frame (df_out):
0-2-5 1-4 3 6 7
0 1 0 1 1
0 1 1 0 0
1 0 0 1 0

Here is an option with tidyverse. We gather the data into 'long' format, conver the 'value' into a string, grouped by 'value', paste the 'key' column together, separate the rows of 'value' and then spread the 'value' column to get the expected output
library(tidyverse)
gather(df_in) %>%
group_by(key) %>%
summarise(value = toString(value)) %>%
group_by(value) %>%
summarise(key = paste(key, collapse="-")) %>%
separate_rows(value) %>%
group_by(key) %>%
mutate(n = row_number()) %>%
spread(key, value) %>%
select(-n)
# A tibble: 3 x 5
# `0-2-5` `1-4` `3` `6` `7`
# <chr> <chr> <chr> <chr> <chr>
#1 0 1 0 1 1
#2 0 1 1 0 0
#3 1 0 0 1 0
Or another option with tidyverse would be
t(df_in) %>%
as.data.frame %>%
mutate(grp = group_indices(., V1, V2, V3)) %>%
mutate(rn = row_number() - 1) %>%
group_split(grp, keep = FALSE) %>%
map_dfc(~ .x %>%
mutate(rn = str_c(rn, collapse="-")) %>%
slice(1) %>%
gather(key, val, -rn) %>%
rename(!! .$rn[1] := val) %>%
select(ncol(.)))
# A tibble: 3 x 5
# `0-2-5` `3` `7` `6` `1-4`
# <int> <int> <int> <int> <int>
#1 0 0 1 1 1
#2 0 1 0 0 1
#3 1 0 0 1 0
Or we can also do this with data.table methods
library(data.table)
dcast(melt(as.data.table(t(df_in))[, grp := .GRP, .(V1, V2, V3)][,
c(.SD[1], cn = paste(.I-1, collapse="-")) , .(grp)],
id.var = c('cn', 'grp')), variable ~ cn, value.var = 'value')[,
variable := NULL][]
# 0-2-5 1-4 3 6 7
#1: 0 1 0 1 1
#2: 0 1 1 0 0
#3: 1 0 0 1 0
data
df_in <- structure(list(`0` = c(0L, 0L, 1L), `1` = c(1L, 1L, 0L), `2` = c(0L,
0L, 1L), `3` = c(0L, 1L, 0L), `4` = c(1L, 1L, 0L), `5` = c(0L,
0L, 1L), `6` = c(1L, 0L, 1L), `7` = c(1L, 0L, 0L)),
class = "data.frame", row.names = c(NA, -3L))

You can do the following in base R
Get indices of identical columns
idx <- split(seq_along(names(df)), apply(df, 2, paste, collapse = "_"))
Sort indices from low to high
idx <- idx[order(sapply(idx, function(x) x[1]))]
Names of idx as concatentation of column names
names(idx) <- sapply(idx, function(x) paste(names(df)[x], collapse = "_"))
Create final matrix
sapply(idx, function(x) df[, x[1]])
# col0_col2_col5 col1_col4 col3_col6 col7
#[1,] 0 1 1 1
#[2,] 0 1 0 0
#[3,] 1 0 1 0
Note that the resulting object is a matrix, so if you need a data.frame simply cast as.data.frame.
Sample data
I've changed your sample data slightly to not have numbers as column names.
df <- read.table(text =
"col0 col1 col2 col3 col4 col5 col6 col7
0 1 0 1 1 0 1 1
0 1 0 0 1 0 0 0
1 0 1 1 0 1 1 0", header = T)

Related

Extracting sequences from columns using R

I have a df which looks like this
ID X003-APP X005-APP X008-APP X003-COP X004-COP X008-PIN X009-PIN
363 NA NA 1 0 NA 4 5
364 0 2 NA 1 5 1 5
678 0 NA NA 5 NA NA NA
713 1 1 1 1 1 1 1
219 1 2 3 3 NA 4 5
234 NA NA NA 2 3 NA NA
321 2 3 1 NA NA 1 2
I am interested in minimum counts for non-null values across the column substrings APP, COP and PIN. My required output is:
ID APP COP PIN
363 1 1 1
364 1 1 1
678 1 1 0
713 1 1 1
219 1 1 1
234 0 1 0
321 1 0 1
For reference, I am sharing the dput():
structure(list(ID = c(363L, 364L, 678L, 713L, 219L, 234L, 321L),
X003.APP = c(NA, 0L, 0L, 1L, 1L, NA, 2L),
X005.APP = c(NA, 2L, NA, 1L, 2L, NA, 3L),
X008.APP = c(1L, NA, NA, 1L, 3L, NA, 1L),
X003.COP = c(0L, 1L, 5L, 1L, 3L, 2L, NA),
X004.COP = c(NA, 5L, NA, 1L, NA, 3L, NA),
X008.PIN = c(4L, 1L, NA, 1L, 4L, NA, 1L),
X009.PIN = c(5L, 5L, NA, 1L, 5L, NA, 2L)),
class = "data.frame", row.names = c(NA, -7L))
Edit:
Later on, I would like to analyse 2 and 3 sequences across IDs. For example, I am ultimately, interested in minimum counts for non-null values across the column substrings APP, COP and PIN. My ultimate required output for a sequence of length 2 would be:
Spec_1 Spec_2 Counts
APP COP 5
APP PIN 5
COP PIN 4
Or correspondingly, my required output for a sequence of length 3 would be:
Spec_1 Spec_2 Spec_3 Counts
APP COP PIN 4
Is there an easy way to achieve this? It would be great to have a solution that could cater for longer sequences - even beyond 3. Thank you very much for your time.
You may try
library(reshape2)
library(tidyverse)
df %>%
reshape2::melt(id = "ID") %>%
separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
group_by(ID, Spec) %>%
summarize(value = as.numeric(any(!is.na(value)))) %>%
filter(value == 1) %>%
pivot_wider(names_from = "Spec", values_from = "value") %>%
replace(is.na(.), 0)
ID APP COP PIN
<int> <dbl> <dbl> <dbl>
1 219 1 1 1
2 234 0 1 0
3 321 1 0 1
4 363 1 1 1
5 364 1 1 1
6 678 1 1 0
7 713 1 1 1
Is your edited one and
df %>%
reshape2::melt(id = "ID") %>%
separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
group_by(ID, Spec) %>%
summarize(value = any(!is.na(value))) %>%
filter(value) %>%
group_by(ID) %>%
filter(n() > 1) %>%
summarise(Spec = combn(Spec, 2, simplify = F)) %>%
unnest_wider(Spec, names_sep = "_") %>%
group_by(Spec_1, Spec_2) %>%
summarize(Counts = n())
Spec_1 Spec_2 Counts
<chr> <chr> <int>
1 APP COP 5
2 APP PIN 5
3 COP PIN 4
is your previous one.
3 seq?
df %>%
reshape2::melt(id = "ID") %>%
separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
group_by(ID, Spec) %>%
summarize(value = any(!is.na(value))) %>%
filter(value) %>%
group_by(ID) %>%
filter(n() > 2) %>%
summarise(Spec = combn(Spec, 3, simplify = F)) %>%
unnest_wider(Spec, names_sep = "_") %>%
group_by(Spec_1, Spec_2, Spec_3) %>%
summarize(Counts = n())
Spec_1 Spec_2 Spec_3 Counts
<chr> <chr> <chr> <int>
1 APP COP PIN 4
Try this using dplyr
library(dplyr)
df |> rowwise() |> transmute( ID,
APP = case_when(all(is.na(c_across(contains("APP")))) ~ 0 , TRUE ~ 1) ,
COP = case_when(all(is.na(c_across(contains("COP")))) ~ 0 , TRUE ~ 1) ,
PIN = case_when(all(is.na(c_across(contains("PIN")))) ~ 0 , TRUE ~ 1)) -> df1
output
# A tibble: 7 × 4
# Rowwise:
ID APP COP PIN
<int> <dbl> <dbl> <dbl>
1 363 1 1 1
2 364 1 1 1
3 678 1 1 0
4 713 1 1 1
5 219 1 1 1
6 234 0 1 0
7 321 1 0 1
for your second required you can use
df1 |> transmute(AC = case_when(sum(c_across(c(APP,COP))) == 2 ~ 1 , TRUE ~ 0) ,
AP = case_when(sum(c_across(c(APP,PIN))) == 2 ~ 1 , TRUE ~ 0) ,
CP = case_when(sum(c_across(c(PIN,COP))) == 2 ~ 1 , TRUE ~ 0) ,
ACP = case_when(sum(c_across(c(APP,COP,PIN))) == 3 ~ 1 , TRUE ~ 0)) |> ungroup() |>
summarise(APP_COP = sum(AC) , APP_PIN = sum(AP) , COP_PIN = sum(CP) , APP_COP_PIN = sum(ACP))
output
# A tibble: 1 × 4
APP_COP APP_PIN COP_PIN APP_COP_PIN
<dbl> <dbl> <dbl> <dbl>
1 5 5 4 4

How to create dummy variables that indicate the presence of a factor for other observations within in a group?

I am working with a data frame like the following, where Color and `Player are factor variables:
I want to create indicator variables for each value of the color column. However, I want those indicator variables to represent whether the color is present for other players in the same game (not whether it's present for that player). So I want the above table to turn into:
I imagine the code will have group_by(Game) %>%, but I'm lost beyond that.
Data:
structure(list(Game = c("A", "A", "A", "B", "B", "B"), Player = c(1L,
2L, 3L, 1L, 2L, 3L), Color = c("Red", "Green", "Blue", "Green",
"Purple", "Yellow"), Blue = c(1L, 1L, 0L, 0L, 0L, 0L), Green = c(1L,
0L, 1L, 0L, 1L, 1L), Yellow = c(0L, 0L, 0L, 1L, 1L, 0L), Red = c(0L,
1L, 1L, 0L, 0L, 0L), Purple = c(0L, 0L, 0L, 1L, 0L, 1L)), class = "data.frame", row.names = c(NA,
-6L))
Perhaps this helps - split the 'Color' column by 'Game', create a binary matrix by comparing the elements of 'Color' (!=), convert to tibble, row bind (_dfr) and bind the dataset with the original dataset (bind_cols)
library(purrr)
library(dplyr)
library(tidyr)
map_dfr(split(df1$Color, df1$Game), ~ {
m1 <- +(outer(.x, .x, FUN = `!=`))
colnames(m1) <- .x
as_tibble(m1)}) %>%
mutate(across(everything(), replace_na, 0)) %>%
bind_cols(df1, .)
-output
Game Player Color Red Green Blue Purple Yellow
1 A 1 Red 0 1 1 0 0
2 A 2 Green 1 0 1 0 0
3 A 3 Blue 1 1 0 0 0
4 B 1 Green 0 0 0 1 1
5 B 2 Purple 0 1 0 0 1
6 B 3 Yellow 0 1 0 1 0
Or another option is with dummy_cols and then modify the output
library(fastDummies)
library(stringr)
dummy_cols(df1, 'Color') %>%
rename_with(~ str_remove(.x, "Color_")) %>%
group_by(Game) %>%
mutate(across(Blue:Yellow, ~ +(Color != cur_column() & any(.x)))) %>%
ungroup
-output
# A tibble: 6 × 8
Game Player Color Blue Green Purple Red Yellow
<chr> <int> <chr> <int> <int> <int> <int> <int>
1 A 1 Red 1 1 0 0 0
2 A 2 Green 1 0 0 1 0
3 A 3 Blue 0 1 0 1 0
4 B 1 Green 0 0 1 0 1
5 B 2 Purple 0 1 0 0 1
6 B 3 Yellow 0 1 1 0 0
data
df1 <- structure(list(Game = c("A", "A", "A", "B", "B", "B"), Player = c(1L,
2L, 3L, 1L, 2L, 3L), Color = c("Red", "Green", "Blue", "Green",
"Purple", "Yellow")), row.names = c(NA, -6L), class = "data.frame")
Here is a way how we could do it:
First we use model.matrix() fucntion multiply it by 1 and substract 1 within a wrap of abs().
Then we get almost the desired output, the only thing that is left is the get zeros in case if non of the colors is present. We do this with a mutate across...:
library(dplyr)
df %>%
cbind(abs((model.matrix(~ Color + 0, .) == 1)*1-1)) %>%
group_by(Game) %>%
mutate(across(-c(Player, Color), ~case_when(sum(.)==3 ~0,
TRUE ~ .)))
Game Player Color ColorBlue ColorGreen ColorPurple ColorRed ColorYellow
<chr> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 Red 1 1 0 0 0
2 A 2 Green 1 0 0 1 0
3 A 3 Blue 0 1 0 1 0
4 B 1 Green 0 0 1 0 1
5 B 2 Purple 0 1 0 0 1
6 B 3 Yellow 0 1 1 0 0
>
Here is another approach using full_join and pivot_wider from tidyverse. I believe this also gives the same result. The filter is included to avoid same color indicators as 1.
library(tidyverse)
full_join(df, df, by = "Game", suffix = c("", "_Two")) %>%
filter(Color != Color_Two) %>%
mutate(val = 1) %>%
pivot_wider(id_cols = c(Game, Player, Color),
names_from = Color_Two,
values_from = val,
values_fill = 0)
Output
Game Player Color Green Blue Red Purple Yellow
<chr> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 Red 1 1 0 0 0
2 A 2 Green 0 1 1 0 0
3 A 3 Blue 1 0 1 0 0
4 B 1 Green 0 0 0 1 1
5 B 2 Purple 1 0 0 0 1
6 B 3 Yellow 1 0 0 1 0
Using base R, you can write a small function and evaluate using tapply:
fun <- function(x) {
nms <- levels(x)
tab <- tcrossprod(table(x))
dimnames(tab) <- list(nms, nms)
tab[x, ]
}
data.frame(df1, do.call(rbind, with(df1, tapply(factor(Color), Game, fun))), row.names = NULL)
Game Player Color Blue Green Purple Red Yellow
1 A 1 Red 1 1 0 1 0
2 A 2 Green 1 1 0 1 0
3 A 3 Blue 1 1 0 1 0
4 B 1 Green 0 1 1 0 1
5 B 2 Purple 0 1 1 0 1
6 B 3 Yellow 0 1 1 0 1
Note that out of all the options given, This is by far the fastest, yet only using base R:
Here is the benchmark:
library(microbenchmark)
microbenchmark(Tarjae(df1), akrun(df1), ben(df1), onyambu(df1),
paulS(df1), unit = 'relative')
Unit: relative
expr min lq mean median uq max neval
Tarjae(df1) 18.775201 18.11495 13.533556 17.171485 15.746554 1.105045 100
akrun(df1) 9.755032 8.83519 7.137294 8.756033 8.241494 1.455906 100
ben(df1) 21.084371 18.57861 14.699821 17.950987 16.486863 3.124906 100
onyambu(df1) 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 100
paulS(df1) 33.108208 31.27110 24.918541 30.266024 27.420363 3.156215 100
For larger dataframes, some of the given code breaks down, while those that dont break down are still slow to the base R approach:
df2<- transform(data.frame(Game = sample(LETTERS, 2000, TRUE), Color = sample(colors(), 2000, TRUE)), Player = ave(Game, Game, FUN=seq_along))
microbenchmark(Tarjae(df2), akrun(df2), onyambu(df2), paulS(df2))
Unit: milliseconds
expr min lq mean median uq max neval
Tarjae(df2) 2147.67826 2234.5575 2460.1924 2423.20994 2653.1737 3049.9455 100
akrun(df2) 108.25249 121.3167 144.6715 130.48052 152.9518 404.7286 100
onyambu(df2) 67.19992 80.3653 111.2821 91.05784 118.4877 331.6724 100
paulS(df2) 183.88836 200.6224 231.0155 215.18942 237.5717 467.1721 100
Code for the benchmark:
Tarjae <- function(df){
df %>%
cbind(abs((model.matrix(~ Color + 0, .) == 1)*1-1)) %>%
group_by(Game) %>%
mutate(across(-c(Player, Color), ~case_when(sum(.)==3 ~0,
TRUE ~ .)))
}
akrun <- function(df1){
map_dfr(split(df1$Color, df1$Game), ~ {
m1 <- +(outer(.x, .x, FUN = `!=`))
colnames(m1) <- .x
as_tibble(m1)}) %>%
mutate(across(everything(), replace_na, 0)) %>%
bind_cols(df1, .)
}
ben <- function(df){
full_join(df, df, by = "Game", suffix = c("", "_Two")) %>%
filter(Color != Color_Two) %>%
mutate(val = 1) %>%
pivot_wider(id_cols = c(Game, Player, Color),
names_from = Color_Two,
values_from = val,
values_fill = 0)
}
onyambu <- function(df1){
fun <- function(x) {
nms <- levels(x)
tab <- tcrossprod(table(x))
dimnames(tab) <- list(nms, nms)
tab[x, ]
}
data.frame(df1, do.call(rbind, with(df1, tapply(factor(Color), Game, fun))), row.names = NULL)
}
paulS <- function(df){
df %>%
group_by(Game) %>%
mutate(aux = list(Color)) %>%
unnest(aux) %>%
filter(aux != Color) %>%
ungroup %>%
pivot_wider(Game:Color, names_from = aux, values_from = aux, values_fill = 0,
values_fn = length)
}
Another possible solution:
library(tidyverse)
df %>%
group_by(Game) %>%
mutate(aux = list(Color)) %>%
unnest(aux) %>%
filter(aux != Color) %>%
ungroup %>%
pivot_wider(Game:Color, names_from = aux, values_from = aux, values_fill = 0,
values_fn = length)
#> # A tibble: 6 × 8
#> Game Player Color Green Blue Red Purple Yellow
#> <chr> <int> <chr> <int> <int> <int> <int> <int>
#> 1 A 1 Red 1 1 0 0 0
#> 2 A 2 Green 0 1 1 0 0
#> 3 A 3 Blue 1 0 1 0 0
#> 4 B 1 Green 0 0 0 1 1
#> 5 B 2 Purple 1 0 0 0 1
#> 6 B 3 Yellow 1 0 0 1 0

Sequential Increase in Column value based on a condition R

I have an R data frame that has an ID column with multiple records for an ID. When the flag is set to 1 for an ID, I want to create a column new timeline that starts from 1 and increases sequentially in increments of 6 (1,6,12...). How can I achieve this in R using dplyr ?
Below is a sample data frame
ID
Timepoint
Flag
A
0
0
A
6
0
A
12
0
A
18
1
A
24
0
A
30
0
A
36
0
Expected Dataframe
ID
Timepoint
Flag
New_Timepoint
A
0
0
A
6
0
A
12
0
A
18
1
1
A
24
0
6
A
30
0
12
A
36
0
18
An option is to group by 'ID', create the lag of the 'Timepoint' with n specified as the position of 'Flag' where the value is 1 (-1)
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(New_Timepoint = dplyr::lag(replace(Timepoint, !Timepoint, 1),
n = which(Flag == 1)-1)) %>%
ungroup
-output
# A tibble: 7 x 4
# ID Timepoint Flag New_Timepoint
# <chr> <int> <int> <dbl>
#1 A 0 0 NA
#2 A 6 0 NA
#3 A 12 0 NA
#4 A 18 1 1
#5 A 24 0 6
#6 A 30 0 12
#7 A 36 0 18
Or use a double cumsum to create the index
df1 %>%
group_by(ID) %>%
mutate(New_Timepoint = Timepoint[na_if(cumsum(cumsum(Flag)), 0)]) %>%
ungroup
data
df1 <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "A"),
Timepoint = c(0L,
6L, 12L, 18L, 24L, 30L, 36L),
Flag = c(0L, 0L, 0L, 1L, 0L, 0L,
0L)), class = "data.frame", row.names = c(NA, -7L))
Another dplyr option
df %>%
group_by(ID) %>%
mutate(New_Timepoint = pmax(1, Timepoint - c(NA, Timepoint[Flag == 1])[cumsum(Flag) + 1])) %>%
ungroup()
gives
ID Timepoint Flag New_Timepoint
<chr> <int> <int> <dbl>
1 A 0 0 NA
2 A 6 0 NA
3 A 12 0 NA
4 A 18 1 1
5 A 24 0 6
6 A 30 0 12
7 A 36 0 18

How to get merged data frame from two data frames having some same columns(R)

I want to merge them and find the values of one dataframe that would like to be added to the existing values of the other based on the same columns.
For example:
df1
No
A
B
C
D
1
1
0
1
0
2
0
1
2
1
3
0
0
1
0
df2
No
A
B
E
F
1
1
0
1
1
2
0
1
2
1
3
2
1
1
0
Finally, I want the output table like this.
df
No
A
B
C
D
E
F
1
2
0
1
0
1
1
2
0
2
2
1
2
1
3
2
1
1
0
1
0
Note: I did try merge(), but in this case, it did not work.
Any help/suggestion would be appreciated.
Reproducible sample data
df1 <-
structure(list(No = 1:3, A = c(1L, 0L, 0L), B = c(0L, 1L, 0L),
C = c(1L, 2L, 1L), D = c(0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
df2 <-
structure(list(No = 1:3, A = c(1L, 0L, 2L), B = c(0L, 1L, 1L),
E = c(1L, 2L, 1L), F = c(1L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
You can also carry out this operation by left_joining these two data frames:
library(dplyr)
library(stringr)
df1 %>%
left_join(df2, by = "No") %>%
mutate(across(ends_with(".x"), ~ .x + get(str_replace(cur_column(), "\\.x", "\\.y")))) %>%
rename_with(~ str_replace(., "\\.x", ""), ends_with(".x")) %>%
select(!ends_with(".y"))
No A B C D E F
1 1 2 0 1 0 1 1
2 2 0 2 2 1 2 1
3 3 2 1 1 0 1 0
You can first row-bind the two dataframes and then compute the sum of each column while 'grouping' by the No column. This can be done like so:
library(dplyr)
bind_rows(df1, df2) %>%
group_by(No) %>%
summarise(across(c(A, B, C, D, E, `F`), sum, na.rm = TRUE),
.groups = "drop")
If a particular column doesn't exist in one dataframe (i.e. columns E and F), values will be padded with NA. Adding the na.rm = TRUE argument (to be passed to sum()) means that these values will get treated like zeros.
Using data.table :
library(data.table)
rbindlist(list(df1, df2), fill = TRUE)[, lapply(.SD, sum, na.rm = TRUE), No]
# No A B C D E F
#1: 1 2 0 1 0 1 1
#2: 2 0 2 2 1 2 1
#3: 3 2 1 1 0 1 0
We can use base R (with R 4.1.0). Get the values of the objects in a list ('lst1'). Then, find the union of the column names ('nm1'). Loop over the list assign to create 0 value columns with setdiff in each list element, rbind them and use aggregate to get the sum grouped by 'No'
lst1 <- mget(ls(pattern= '^df\\d+$'))
nm1 <- lapply(lst1, names) |>
{\(x) Reduce(union, x)}()
lapply(lst1, \(x) {x[setdiff(nm1, names(x))] <- 0; x}) |>
{\(x) do.call(rbind, x)}() |>
{\(dat) aggregate(.~ No, data = dat, FUN = sum, na.rm = TRUE,
na.action = na.pass)}()
# No A B C D E F
#1 1 2 0 1 0 1 1
#2 2 0 2 2 1 2 1
#3 3 2 1 1 0 1 0

Repeat a value within each ID

I have a dataset in R in long format. Each ID does not appear the same number of times (i.e. one ID might be one row, another might appear 79 rows).
e.g.
ID V1 V2
1 B 0
1 A 1
1 C 0
2 C 0
3 A 0
3 C 0
I want to create a variable which, if any of the rows for a given ID have Var2 == 1, then 1 repeats for every row of that ID
e.g.
ID V1 V2 V3
1 B 0 1
1 A 1 1
1 C 0 1
2 C 0 0
3 A 0 0
3 C 0 0
In base R we can use any - and ave for the grouping.
DF$V3 <- with(DF, ave(V2, ID, FUN = function(x) any(x == 1)))
DF
# ID V1 V2 V3
#1 1 B 0 1
#2 1 A 1 1
#3 1 C 0 1
#4 2 C 0 0
#5 3 A 0 0
#6 3 C 0 0
data
DF <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), V1 = c("B", "A",
"C", "C", "A", "C"), V2 = c(0L, 1L, 0L, 0L, 0L, 0L)), .Names = c("ID",
"V1", "V2"), class = "data.frame", row.names = c(NA, -6L))
Here's a tidyverse solution.
If V2 can only be 0 or 1:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(V3 = max(V2))
If you want to check that V2 is exactly 1.
df %>%
group_by(ID) %>%
mutate(V3 = as.numeric(any(V2 == 1)))
Another base R option is
df$V3 <- with(df, +(ID %in% which(rowsum(V2, ID) > 0)))

Resources