tidy way to remove duplicates per row - r

I've seen different solutions to remove rowwise duplicates with base R solutions, e.g. R - find all duplicates in row and replace.
However, I'm wondering if there's amore tidy way. I tried several ways of using across or a combination of rowwise with c_across, but can't get it work.
df <- data.frame(x = c(1, 2, 3, 4),
y = c(1, 3, 4, 5),
z = c(2, 3, 5, 6))
Expected output:
x y z
1 1 NA 2
2 2 3 NA
3 3 4 5
4 4 5 6
My ideas so far (not working):
df |>
mutate(apply(across(everything()), 1, function(x) replace(x, duplicated(x), NA)))
df |>
mutate(apply(across(everything()), 1, function(x) {x[duplicated(x)] <- NA}))
I got somewhat along the way by creating a list column that contains the column positions of the duplicates (but it also has the ugly warning about the usual "new names" problem. I'm unsure how to proceed from there (if that's a promising way), i.e. I guess it requires some form of purrr magic?
df |>
rowwise() |>
mutate(test = list(duplicated(c_across(everything())))) |>
unnest_wider(test)
# A tibble: 4 × 6
x y z ...1 ...2 ...3
<dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
1 1 1 2 FALSE TRUE FALSE
2 2 3 3 FALSE FALSE TRUE
3 3 4 5 FALSE FALSE FALSE
4 4 5 6 FALSE FALSE FALSE

Maybe you want something like this:
library(dplyr)
df %>%
rowwise() %>%
do(data.frame(replace(., duplicated(unlist(.)), NA)))
Output:
# A tibble: 4 × 3
# Rowwise:
x y z
<dbl> <dbl> <dbl>
1 1 NA 2
2 2 3 NA
3 3 4 5
4 4 5 6

I wouldn't say its tidy but it is a solution using map:
library(tidyverse)
df %>%
group_nest(row_number()) %>%
pull(data) %>%
map(function(x) as.numeric(x) %>% replace(., duplicated(.), NA) %>% setNames(names(df))) %>%
bind_rows()
# # A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
# 1 1 NA 2
# 2 2 3 NA
# 3 3 4 5
# 4 4 5 6

Just for completeness, after trialing & erroring a bit, I also got the same result as provided by #Quinten, just in a much, much uglier way!
df |>
rowwise() |>
mutate(pos = list(which(duplicated(c_across(everything()))))) |>
mutate(across(-pos, ~ ifelse(which(names(df) == cur_column()) %in% unlist(pos), NA, .))) |>
select(-pos)

Related

Continuing a sequence into NAs using dplyr

I am trying to figure out a dplyr specific way of continuing a sequence of numbers when there are NAs in that column.
For example I have this dataframe:
library(tibble)
dat <- tribble(
~x, ~group,
1, "A",
2, "A",
NA_real_, "A",
NA_real_, "A",
1, "B",
NA_real_, "B",
3, "B"
)
dat
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 NA A
#> 4 NA A
#> 5 1 B
#> 6 NA B
#> 7 3 B
I would like this one:
#> # A tibble: 7 × 2
#> x group
#> <dbl> <chr>
#> 1 1 A
#> 2 2 A
#> 3 3 A
#> 4 4 A
#> 5 1 B
#> 6 2 B
#> 7 3 B
When I try this I get a warning which makes me think I am probably approaching this incorrectly:
library(dplyr)
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(n))
#> Warning in seq_len(n): first element used of 'length.out' argument
#> Warning in seq_len(n): first element used of 'length.out' argument
#> # A tibble: 7 × 4
#> # Groups: group [2]
#> x group n new_seq
#> <dbl> <chr> <int> <int>
#> 1 1 A 4 1
#> 2 2 A 4 2
#> 3 NA A 4 3
#> 4 NA A 4 4
#> 5 1 B 3 1
#> 6 NA B 3 2
#> 7 3 B 3 3
It's easier if you do it in one go. Your approach is not 'wrong', it is just that seq_len needs one integer, and you are giving a vector (n), so seq_len corrects it by using the first value.
dat %>%
group_by(group) %>%
mutate(x = seq_len(n()))
Note that row_number might be even easier here:
dat %>%
group_by(group) %>%
mutate(x = row_number())
We could use rowid directly if the intention is to create a sequence and group size is just intermediate column
library(data.table)
library(dplyr)
dat %>%
mutate(new_seq = rowid(group))
The issue with using a column after it is created is that it is no longer a single row as showed in #Maëls post. If we need to do that, use first as seq_len is not vectorized and here it is not needed as well
dat %>%
group_by(group) %>%
mutate(n = n()) %>%
mutate(new_seq = seq_len(first(n)))
A base R option using ave (work in a similar way as group_by in dplyr)
> transform(dat, x = ave(x, group, FUN = seq_along))
x group
1 1 A
2 2 A
3 3 A
4 4 A
5 1 B
6 2 B
7 3 B

how to filter on column==var when var has same name as column? (inside pmap)

I have a tibble that I want to filter by comparing its columns against some variables. However, it's convenient for that variable to have the same name as the column. How can I force dplyr to evaluate the variable so it doesn't confuse the variable and column names?
set.seed(2)
ngrp <- 3
npergrp <- 4
tib <- tibble(grp=rep(letters[1:ngrp], each=npergrp),
N=rep(1:npergrp, ngrp),
val=round(runif(npergrp*ngrp))) %>% print(n=Inf)
grp <- grp_ <- 'a'
tib %>% dplyr::filter(grp==grp_) %>% glimpse() ## works
tib %>% dplyr::filter(grp==grp) %>% glimpse() ## undesired result, grp==grp always true
tib %>% dplyr::filter(grp=={{grp}}) %>% glimpse() ## hey it works!
## slightly less toy example
tib %>% dplyr::filter(grp==grp_) %>%
dplyr::mutate(
the_rest = purrr::pmap(
.,
function(grp, N, ...) {
gg <- grp ## there must be a better way
NN <- N
tib %>%
dplyr::filter(
# grp!=grp, ## always false
# N==N ## always true
grp!=gg,
N==NN
) %>%
dplyr::pull(val) %>%
sum()
}
),
no_hugs = purrr::pmap(
.,
function(grp, N, ...) {
tib %>%
dplyr::filter(
grp!={{grp}}, ## ERROR! oh noes!
N=={{N}}
) %>%
dplyr::pull(val) %>%
sum()
}
)
) %>%
tidyr::unnest() %>%
glimpse()
output:
# A tibble: 12 × 3
grp N val
<chr> <int> <dbl>
1 a 1 0
2 a 2 1
3 a 3 1
4 a 4 0
5 b 1 1
6 b 2 1
7 b 3 0
8 b 4 1
9 c 1 0
10 c 2 1
11 c 3 1
12 c 4 0
Rows: 4
Columns: 3
$ grp <chr> "a", "a", "a", "a"
$ N <int> 1, 2, 3, 4
$ val <dbl> 0, 1, 1, 0
Rows: 4
Columns: 3
$ grp <chr> "a", "a", "a", "a"
$ N <int> 1, 2, 3, 4
$ val <dbl> 0, 1, 1, 0
Error in local_error_context(dots = dots, .index = i, mask = mask) :
promise already under evaluation: recursive default argument reference or earlier problems?
# the_rest should be 1, 2, 1, 1
As often happens, writing the question taught me how to embrace variables using the double curly brace operator {{}}
https://dplyr.tidyverse.org/articles/programming.html
Use dynamic name for new column/variable in `dplyr`
However, it doesn't work inside the pmap.
It would need .env to evaluate the object 'grp' from the environment other than the data environment (or use !!)
library(dplyr)
tib %>%
dplyr::filter(grp==.env$grp)
-output
# A tibble: 4 × 3
grp N val
<chr> <int> <dbl>
1 a 1 0
2 a 2 1
3 a 3 1
4 a 4 0
The .env can be used similarly within the pmap code as well
library(purrr)
tib %>%
dplyr::filter(grp==.env$grp_) %>%
dplyr::mutate(the_rest = purrr::pmap_dbl(across(everything()),
~ {gg <- ..1
NN <- ..2
tib %>%
dplyr::filter(grp != gg, N == NN) %>%
pull(val) %>%
sum()}),
no_hugs = purrr::pmap_dbl(across(all_of(names(tib))),
~ tib %>%
dplyr::filter(grp != .env$grp, N == ..2) %>%
pull(val) %>%
sum()))
-output
# A tibble: 4 × 5
grp N val the_rest no_hugs
<chr> <int> <dbl> <dbl> <dbl>
1 a 1 0 1 1
2 a 2 1 2 2
3 a 3 1 1 1
4 a 4 0 1 1

How to keep other values unchanged with dplyr's recode_factor

In the example below, recoding some values makes all the other NA. How can I keep the other values unchanged?
library(tibble)
library(dplyr)
test <- tibble(
test_vec = as.factor(c(1, 2, 3))
)
test
#> # A tibble: 3 x 1
#> test_vec
#> <fct>
#> 1 1
#> 2 2
#> 3 3
test %>%
mutate(test_vec = recode_factor(test_vec, `3` = 4))
#> # A tibble: 3 x 1
#> test_vec
#> <fct>
#> 1 <NA>
#> 2 <NA>
#> 3 4
Need to make your replacement the same type as the original value.
test %>%
mutate(test_vec = recode_factor(test_vec, "3" = "4"))
# A tibble: 3 x 1
test_vec
<fct>
1 1
2 2
3 4
Using fct_recode
library(forcats)
library(dplyr)
test %>%
mutate(test_vec = fct_recode(test_vec, `4` = '3'))
-output
# A tibble: 3 x 1
# test_vec
# <fct>
#1 1
#2 2
#3 4
So that you don't get missing NA values, you have to list the other values in the function as well.
test %>%
mutate(test_vec = recode_factor(test_vec, `1` = 1, `2` = 2, `3` = 4))
Result
# A tibble: 3 x 1
test_vec
<fct>
1 1
2 2
3 4
Another way to do it is using case_when, but for this you have to start from numerical values.
I give you an example starting from numerical values and I convert them to factor.
test <- tibble(
test_vec = (c(1, 2, 3)))
test %>%
mutate(test_vec = case_when( test_vec != 3 ~ test_vec,
test_vec == 3 ~ 4)) %>%
mutate(across(test_vec,factor))
Result
# A tibble: 3 x 1
test_vec
<fct>
1 1
2 2
3 4

Create a list column with ranges set by existing columns

I am trying to create a list column within a data frame, specifying the range using existing columns, something like:
# A tibble: 3 x 3
A B C
<dbl> <dbl> <list>
1 1 6 c(1, 2, 3, 4, 5, 6)
2 2 5 c(2, 3, 4, 5)
3 3 4 c(3, 4)
The catch is that it would need to be created as follows:
df %>% mutate(C = c(A:B))
I have a dataset containing integers entered as ranges, i.e someone has entered "7 to 26". I've separated the ranges into two columns A & B, or "start" and "end", and was hoping to use c(A:B) to create a list, but using dplyr I keep getting:
Warning messages:
1: In a:b : numerical expression has 3 elements: only the first used
2: In a:b : numerical expression has 3 elements: only the first used
Which gives:
# A tibble: 3 x 3
A B C
<dbl> <dbl> <list>
1 1 6 list(1:6)
2 2 5 list(1:6)
3 3 4 list(1:6)
Has anyone had a similar issue and found a workaround?
You can use map2() in purrr
library(dplyr)
df %>%
mutate(C = purrr::map2(A, B, seq))
or do rowwise() before mutate()
df %>%
rowwise() %>%
mutate(C = list(A:B)) %>%
ungroup()
Both methods give
# # A tibble: 3 x 3
# A B C
# <int> <int> <list>
# 1 1 6 <int [6]>
# 2 2 5 <int [4]>
# 3 3 4 <int [2]>
Data
df <- tibble::tibble(A = 1:3, B = 6:4)

How to perform a group_by with elements that are contiguous in R and dplyr

Suppose we have this tibble:
group item
x 1
x 2
x 2
y 3
z 2
x 2
x 2
z 1
I want to perform a group_by by group. However, I'd rather group only by the elements that are adjacent. For example, in my case, I'd have three 'x' groups, summing 'item' elements. The result would be something like:
group item
x 5
y 3
z 2
x 4
z 1
I know how to solve this problem using 'for' loops. However, this is not fast and doesn't sound straightforward. I'd rather use some dplyr or tidyverse function with an easy logic.
This question is not duplicated. I know there's already a question about rle in SO, but my question was more general than that. I asked for general solutions.
If you want to use only base R + tidyverse, this code exactly replicates your desired results
mydf <- tibble(group = c("x", "x", "x", "y", "z", "x", "x", "z"),
item = c(1, 2, 2, 3, 2, 2, 2, 1))
mydf
# A tibble: 8 × 2
group item
<chr> <dbl>
1 x 1
2 x 2
3 x 2
4 y 3
5 z 2
6 x 2
7 x 2
8 z 1
runs <- rle(mydf$group)
mydf %>%
mutate(run_id = rep(seq_along(runs$lengths), runs$lengths)) %>%
group_by(group, run_id) %>%
summarise(item = sum(item)) %>%
arrange(run_id) %>%
select(-run_id)
Source: local data frame [5 x 2]
Groups: group [3]
group item
<chr> <dbl>
1 x 5
2 y 3
3 z 2
4 x 4
5 z 1
You can construct group identifiers with rle, but the easier route is to just use data.table::rleid, which does it for you:
library(dplyr)
df %>%
group_by(group,
group_run = data.table::rleid(group)) %>%
summarise_all(sum)
#> # A tibble: 5 x 3
#> # Groups: group [?]
#> group group_run item
#> <fctr> <int> <int>
#> 1 x 1 5
#> 2 x 4 4
#> 3 y 2 3
#> 4 z 3 2
#> 5 z 5 1

Resources