How to select row with exactly only 2 unique value with tidyverse? - r

What I have:
library(magrittr)
set.seed(1234)
what_i_have <- tibble::tibble(
A = c(0, 1) |> sample(5, replace = TRUE),
B = c(0, 1) |> sample(5, replace = TRUE),
C = c(0, 1) |> sample(5, replace = TRUE)
)
It looks like this:
> what_i_have
# A tibble: 5 x 3
A B C
<dbl> <dbl> <dbl>
1 1 1 1
2 1 0 1
3 1 0 1
4 1 0 0
5 0 1 1
What I want:
what_i_want <- what_i_have %>% .[apply(., 1, function(row) row |> unique() |> length() == 2),]
It looks like this:
# A tibble: 4 x 3
A B C
<dbl> <dbl> <dbl>
1 1 0 1
2 1 0 1
3 1 0 0
4 0 1 1
My question is: is there a tidyverse way to do the things above?
I tried this:
what_i_have |>
dplyr::rowwise() |>
dplyr::filter_all(function(row) row |> unique() |> length() == 2)
but it returns the following empty tibble and I do not know why
# A tibble: 0 x 3
# Rowwise:
# … with 3 variables: A <dbl>, B <dbl>, C <dbl>
Thank you.

Here is one option with tidyverse. Here, I treat each row as a vector (via c_across), then get the number of distinct values using n_distinct and return TRUE for the rows that have 2 unique values.
library(tidyverse)
what_i_have %>%
rowwise %>%
filter(n_distinct(c_across(everything())) == 2)
Output
A B C
<dbl> <dbl> <dbl>
1 0 1 1
2 1 0 1
3 1 0 0
4 1 1 0
A mixed method approach with apply could be:
what_i_have %>%
filter(apply(., 1, \(x)length(unique(x)))==2)
Data
what_i_have <-
structure(
list(
A = c(0, 1, 1, 1, 1),
B = c(1, 0, 0, 1, 1),
C = c(1, 1, 0, 1, 0)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA,-5L)
)

Related

How to change column values based on duplication in another column R

My data looks like this:
data <- data.frame(grupoaih = c("09081997", "13122006", "09081997", "22031969"),
NMM_PROC_BR = c(1, 1, 0, 1),
NMM_CID = c(0, 1, 1, 0),
CPAV_PROC_BR = c(0, 0, 0, 1),
CPAV_CID = c(1, 1, 0, 1))
grupoaih NMM_PROC_BR NMM_CID CPAV_PROC_BR CPAV_CID
1 09081997 1 0 0 1
2 13122006 1 1 0 1
3 09081997 0 1 0 0
4 22031969 1 0 1 1
How can I assign the value 1 when "grupoaih" is a duplicate so the other 4 variables get filled equally like this:
data2 <- data.frame(grupoaih = c("09081997", "13122006", "09081997", "22031969"),
NMM_PROC_BR = c(1, 1, 1, 1),
NMM_CID = c(1, 1, 1, 0),
CPAV_PROC_BR = c(0, 0, 0, 1),
CPAV_CID = c(1, 1, 1, 1))
grupoaih NMM_PROC_BR NMM_CID CPAV_PROC_BR CPAV_CID
1 09081997 1 1 0 1
2 13122006 1 1 0 1
3 09081997 1 1 0 1
4 22031969 1 0 1 1
This only applies if grupoaih is duplicated and any of the 4 variables are filled with 1. If both are 0 in all variables, they stay as they are.
You can use a group_by and then an n() to check if there are duplicates. . stands for the original value, and ~ indicates a formula.
library(dplyr)
data %>%
group_by(grupoaih) %>%
mutate(across(c("NMM_PROC_BR", "NMM_CID", "CPAV_CID"), ~ifelse(n() > 1, 1, .))) %>%
ungroup()
# # A tibble: 4 × 5
# grupoaih NMM_PROC_BR NMM_CID CPAV_PROC_BR CPAV_CID
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 09081997 1 1 0 1
# 2 13122006 1 1 0 1
# 3 09081997 1 1 0 1
# 4 22031969 1 0 1 1
It could work with max after grouping
library(dplyr)
data %>%
group_by(grupoaih) %>%
mutate(across(everything(), max)) %>%
ungroup
-output
# A tibble: 4 × 5
grupoaih NMM_PROC_BR NMM_CID CPAV_PROC_BR CPAV_CID
<chr> <dbl> <dbl> <dbl> <dbl>
1 09081997 1 1 0 1
2 13122006 1 1 0 1
3 09081997 1 1 0 1
4 22031969 1 0 1 1
Or use fmax from collapse
library(collapse)
data[-1] <- fmax(data[-1], data$grupoaih, TRA = 1)

Assigning values to a column in the based on values of another column in the same dataframe in R

I have a dataframe with 3 columns and I want to assign values to a fourth column of this dataframe if the sum of a condition is met in another row. In this example I want to assign 1 to df[,4], if df[,3]>=2 for each row.
An example of what I want as the output is:
Any help is appreciated.
Thank you,
library(tidyverse)
data <-
tribble(
~ID, ~time1, ~time2,
'jkjkdf', 1, 1,
'kjkj', 1, 0,
'fgf', 1, 1,
'jhkj', 0, 1,
'hgd', 0,0
)
mutate(data, label = if_else(time1 + time2 >= 2, 1, 0))
#> # A tibble: 5 x 4
#> ID time1 time2 label
#> <chr> <dbl> <dbl> <dbl>
#> 1 jkjkdf 1 1 1
#> 2 kjkj 1 0 0
#> 3 fgf 1 1 1
#> 4 jhkj 0 1 0
#> 5 hgd 0 0 0
#or with n time columns
data %>%
rowwise() %>%
mutate(label = if_else(sum(across(starts_with('time'))) >= 2, 1, 0))
#> # A tibble: 5 x 4
#> # Rowwise:
#> ID time1 time2 label
#> <chr> <dbl> <dbl> <dbl>
#> 1 jkjkdf 1 1 1
#> 2 kjkj 1 0 0
#> 3 fgf 1 1 1
#> 4 jhkj 0 1 0
#> 5 hgd 0 0 0
Created on 2021-06-06 by the reprex package (v2.0.0)
Do you want to assign 1 if both time1 and time2 are 1 ?
If there are only two columns you can do -
df$label <- as.integer(df$time1 == 1 & df$time2 == 1)
If there are many such time columns we can take help of rowSums -
cols <- grep('time', names(df))
df$label <- as.integer(rowSums(df[cols] == 1) == length(cols))
df
# a time1 time2 label
#1 a 1 1 1
#2 b 1 0 0
#3 c 1 1 1
#4 d 0 1 0
#5 e 0 0 0
data
Images are not the right way to share data, provide them in a reproducible format.
df <- data.frame(a = letters[1:5],
time1 = c(1, 1, 1, 0, 0),
time2 = c(1, 0, 1, 1, 0))
We could do thin in a vectorized way using tidyverse methods - select the columns that starts_with 'time' in column name, reduce it to a single vector by adding (+) the corresponding elements, use the aliases from magrittr to convert it to binary for creating the 'label' column. Finally, the object should be assigned (<-) to original data if we want the original object to be changed
library(dplyr)
library(purrr)
library(magrittr)
df %>%
mutate(label = select(cur_data(), starts_with('time')) %>%
reduce(`+`) %>%
is_weakly_greater_than(2) %>%
multiply_by(1))
a time1 time2 label
1 a 1 1 1
2 b 1 0 0
3 c 1 1 1
4 d 0 1 0
5 e 0 0 0
data
df <- structure(list(a = c("a", "b", "c", "d", "e"), time1 = c(1, 1,
1, 0, 0), time2 = c(1, 0, 1, 1, 0)), class = "data.frame", row.names = c(NA,
-5L))

dplyr mutate ifelse returning first value of group instead of by-row

I'm trying to mutate a data.frame using ifelse:
df = data.frame(grp = c('a', 'a', 'a', 'b', 'b', 'b'),
value1 = c(0, 0, 0, 0, 1, 2),
value2 = 1:6)
df %>%
group_by(grp) %>%
mutate(value2 = ifelse(all(value1 == 0), 0, value2))
which returns
# # A tibble: 6 x 3
# # Groups: grp [2]
# grp value1 value2
# <chr> <dbl> <dbl>
# 1 a 0 0
# 2 a 0 0
# 3 a 0 0
# 4 b 0 4
# 5 b 1 4
# 6 b 2 4
instead of
# # A tibble: 6 x 3
# # Groups: grp [2]
# grp value1 value2
# <chr> <dbl> <dbl>
# 1 a 0 0
# 2 a 0 0
# 3 a 0 0
# 4 b 0 4
# 5 b 1 5
# 6 b 2 6
How can I change the mutate so that the rows of "value2" are unchanged if the condition is false?
You can use if and else instead of ifelse():
df %>%
group_by(grp) %>%
mutate(value2 = if(all(value1 == 0)) 0 else value2)
grp value1 value2
<fct> <dbl> <dbl>
1 a 0 0
2 a 0 0
3 a 0 0
4 b 0 4
5 b 1 5
6 b 2 6
You can try ifelse as a mask, e.g.,
df %>%
group_by(grp) %>%
mutate(value2 = ifelse(all(value1 == 0), 0, 1)*value2)
or (thank #tmfmnk's comment)
df %>%
group_by(grp) %>%
mutate(value2 = any(value1 != 0)*value2)
which gives
grp value1 value2
<chr> <dbl> <dbl>
1 a 0 0
2 a 0 0
3 a 0 0
4 b 0 4
5 b 1 5
6 b 2 6
The problem you encountered is due to the fact that all(value1 == 0) returns a single logical value. You need to have a vector of logic values to have your desired output, e.g.,
df %>%
group_by(grp) %>%
mutate(value2 = ifelse(rep(all(value1 == 0),n()), 0, value2))

Create new columns based on comma-separated values in another column in R [duplicate]

This question already has answers here:
Convert column with pipe delimited data into dummy variables [duplicate]
(4 answers)
Closed 2 years ago.
I have some data similar to that below.
df <- data.frame(id = 1:5, tags = c("A,B,AB,C", "C", "AB,E", NA, "B,C"))
df
# id tags
# 1 1 A,B,AB,C
# 2 2 C
# 3 3 AB,E
# 4 4 <NA>
# 5 5 B,C
I'd like to create a new dummy variable column for each tag in the "tags" column, resulting in a dataframe like the following:
correct_df <- data.frame(id = 1:5,
tags = c("A,B,AB,C", "C", "AB,E", NA, "B,C"),
A = c(1, 0, 0, 0, 0),
B = c(1, 0, 0, 0, 1),
C = c(1, 1, 0, 0, 1),
E = c(0, 0, 1, 0, 0),
AB = c(1, 0, 1, 0, 0)
)
correct_df
# id tags A B C E AB
# 1 1 A,B,AB,C 1 1 1 0 1
# 2 2 C 0 0 1 0 0
# 3 3 AB,E 0 0 0 1 1
# 4 4 <NA> 0 0 0 0 0
# 5 5 B,C 0 1 1 0 0
One of the challenges is ensuring that the "A" column has 1 only for the "A" tag, so that it doesn't has 1 for the "AB" tag, for example. The following won't work for this reason, since "A" gets 1 for the "AB" tag:
df <- df %>%
mutate(A = ifelse(grepl("A", tags, fixed = T), 1, 0))
df
# id tags A
# 1 1 A,B,AB,C 1
# 2 2 C 0
# 3 3 AB,E 1 < Incorrect
# 4 4 <NA> 0
# 5 5 B,C 0
Another challenge is doing this programmatically. I can probably deal with a solution that manually creates a column for each tag, but a solution that doesn't assume which tag columns need to be created beforehand is best, since there can potentially be many different tags. Is there some relatively simple solution that I'm overlooking?
Does this work:
> library(tidyr)
> library(dplyr)
> df %>% separate_rows(tags) %>% mutate(A = case_when(tags == 'A' ~ 1, TRUE ~ 0),
+ B = case_when(tags == 'B' ~ 1, TRUE ~ 0),
+ C = case_when(tags == 'C' ~ 1, TRUE ~ 0),
+ E = case_when(tags == 'E' ~ 1, TRUE ~ 0),
+ AB = case_when(tags == 'AB' ~ 1, TRUE ~ 0)) %>%
+ group_by(id) %>% mutate(tags = toString(tags)) %>% group_by(id, tags) %>% summarise(across(A:AB, sum))
`summarise()` regrouping output by 'id' (override with `.groups` argument)
# A tibble: 5 x 7
# Groups: id [5]
id tags A B C E AB
<int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 A, B, AB, C 1 1 1 0 1
2 2 C 0 0 1 0 0
3 3 AB, E 0 0 0 1 1
4 4 NA 0 0 0 0 0
5 5 B, C 0 1 1 0 0
>
Here's a solution:
library(dplyr)
library(stringr)
library(magrittr)
library(tidyr)
#Data
df <- data.frame(id = 1:5, tags = c("A,B,AB,C", "C", "AB,E", NA, "B,C"))
#Separate into rows
df %<>% mutate(t2 = tags) %>% separate_rows(t2, sep = ",")
#Create a presence/absence column
df %<>% mutate(pa = 1)
#Pivot wider and use the presence/absence
#column as entries; fill with 0 if absent
df %<>% pivot_wider(names_from = t2, values_from = pa, values_fill = 0)
df
# # A tibble: 5 x 8
# id tags A B AB C E `NA`
# <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 A,B,AB,C 1 1 1 1 0 0
# 2 2 C 0 0 0 1 0 0
# 3 3 AB,E 0 0 1 0 1 0
# 4 4 NA 0 0 0 0 0 1
# 5 5 B,C 0 1 0 1 0 0
Edit: updated the code to enable it to retain the tags column. Sorry.

dplyr group_by_ lazy .drop = F

I am trying to incorporate the drop = F into the following dplyr function
dspreadN = function(data, ...) {
data %>% group_by_(.dots = lazyeval::lazy_dots(...), .drop = F) %>%
summarise(n = n()*100) %>% spread(value, n, fill = 0)
}
Basically, the function transform this
id x
1 1 A
2 1 A
3 1 A
4 1 A
5 2 A
6 2 A
7 2 B
8 2 B
9 3 A
10 3 A
11 3 B
12 3 A
into that
id drop A B
<dbl> <lgl> <dbl> <dbl>
1 1 FALSE 400 0
2 2 FALSE 200 200
3 3 FALSE 300 100
I use the function in this way dff %>% dspreadN(id, value = x)
(my real example is much more complicated that why I need the dplyr function).
What I would like is to keep all the levels of the x variable, here the C is missing.
id A B C
<dbl> <dbl> <dbl> <dbl>
1 1 400 0 0
2 2 200 200 0
3 3 300 100 0
Why is the drop = F not working?
library(tidyverse)
# data
dff = data.frame(id = c(1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4),
x = c('A','A','A','A', 'A','A','B','B', 'A','A','B','A', 'C', 'C', 'C', 'C'))
# remove the case to keep the C level
dff = dff[dff$id != 4, ]
You can use .drop = FALSE argument in count instead of group_by.
group_by + summarise with n() is equal to count.
spread has been deprecated in favour of pivot_wider.
Thanks to #Edo for useful tips in improving the post
library(dplyr)
library(tidyr)
dspreadN = function(data, ...) {
data %>%
count(id, x, .drop = FALSE, wt = n() * 100) %>%
pivot_wider(names_from = x, values_from = n, values_fill = 0)
}
dspreadN(dff, id, x)
# id A B C
# <dbl> <dbl> <dbl> <dbl>
#1 1 400 0 0
#2 2 200 200 0
#3 3 300 100 0

Resources