R + dplyr: Partial Deduplication of Rows in a Tibble - r

A very common question is how to remove all the duplicated lines in a data frame in R, something which can be done with a variety of tools (I like dplyr+distinct).
However, what if your dataset contains several duplicated lines, but you do not want to remove all of them, but only those for some combination of the variables?
I do not know how to achieve that, so any suggestion is welcome.
Please have a look at the reprex at the end of the post.
Thanks!
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
)
df
#> # A tibble: 15 × 3
#> x y z
#> <int> <chr> <chr>
#> 1 1 A h
#> 2 2 B j
#> 3 3 C k
#> 4 4 D t
#> 5 5 E u
#> 6 1 A h
#> 7 2 B j
#> 8 3 C k
#> 9 4 D t
#> 10 5 E u
#> 11 1 A A
#> 12 2 B B
#> 13 3 C C
#> 14 4 D D
#> 15 5 E E
df_ded <- df |>
distinct()
df_ded
#> # A tibble: 10 × 3
#> x y z
#> <int> <chr> <chr>
#> 1 1 A h
#> 2 2 B j
#> 3 3 C k
#> 4 4 D t
#> 5 5 E u
#> 6 1 A A
#> 7 2 B B
#> 8 3 C C
#> 9 4 D D
#> 10 5 E E
## I want to deduplicate only the rows with x==3 and z=="k"
df_ded_partial <- df |>
distinct(x==3, z=="k") ## but this is not what I mean.
## How to achieve it?
df_ded_partial
#> # A tibble: 3 × 2
#> `x == 3` `z == "k"`
#> <lgl> <lgl>
#> 1 FALSE FALSE
#> 2 TRUE TRUE
#> 3 TRUE FALSE
Created on 2023-02-14 with reprex v2.0.2

We can use group_modify() and check for the condition using the .y argument which is a tibble of the current group. So we can say: if the condition is met return the distinct(.x) group otherwise return the whole group .x.
library(dplyr)
df |>
group_by(x, z) |>
group_modify(~ if(.y$x == 3 && .y$z == "k") distinct(.x) else .x)
#> # A tibble: 14 x 3
#> # Groups: x, z [10]
#> x z y
#> <int> <chr> <chr>
#> 1 1 A A
#> 2 1 h A
#> 3 1 h A
#> 4 2 B B
#> 5 2 j B
#> 6 2 j B
#> 7 3 C C
#> 8 3 k C
#> 9 4 D D
#> 10 4 t D
#> 11 4 t D
#> 12 5 E E
#> 13 5 u E
#> 14 5 u E
Data from OP
df <- tibble(x=rep(seq(5), 3), y=rep(LETTERS[1:5],3),
z=c(rep(c("h","j","k","t","u"), 2), LETTERS[1:5])
)
Created on 2023-02-14 by the reprex package (v2.0.1)

Related

Keep Top Factors based on Grouped Weight

Please have a look at the snippet at the end of the post.
I am essentially looking for a cleaner way to obtain the same result.
I have a tibble where the x column is a character vector (I did not translate it into a factor, but this is actually what it is).
Each factor appears multiple times and it always has an associated numerical value (the w column in the tibble).
I would like to keep the top 4 factors according to the sum of their associated w values and change everything else into an "other" factor.
I achieve it below, but I wonder if there is a smarter way to do the same using e.g. forcats.
Any suggestion is appreciated
library(tidyverse)
df <- tibble(x=rep(letters[1:10], 10), w=seq(100))
df
#> # A tibble: 100 × 2
#> x w
#> <chr> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
#> 7 g 7
#> 8 h 8
#> 9 i 9
#> 10 j 10
#> # … with 90 more rows
###detect the first 4 factors based on the w column
ff <- df |>
group_by(x) |>
summarise(w_tot=sum(w)) |>
ungroup() |>
arrange(desc(w_tot)) |>
slice(1:4) |>
pull(x)
ff
#> [1] "j" "i" "h" "g"
## recode the data
df_new <- df |>
mutate(w=if_else(x %in% ff, x, "other"))
df_new
#> # A tibble: 100 × 2
#> x w
#> <chr> <chr>
#> 1 a other
#> 2 b other
#> 3 c other
#> 4 d other
#> 5 e other
#> 6 f other
#> 7 g g
#> 8 h h
#> 9 i i
#> 10 j j
#> # … with 90 more rows
Created on 2022-09-16 by the reprex package (v2.0.1)
It appears that I can pass a weight argument to fct_lump_n() so this works
library(tidyverse)
library(forcats)
df <- tibble(x=rep(letters[1:10], 10), w=seq(100))
df
#> # A tibble: 100 × 2
#> x w
#> <chr> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
#> 7 g 7
#> 8 h 8
#> 9 i 9
#> 10 j 10
#> # … with 90 more rows
###detect the first 4 factors based on the w column
ff <- df |>
group_by(x) |>
summarise(w_tot=sum(w)) |>
ungroup() |>
arrange(desc(w_tot)) |>
slice(1:4) |>
pull(x)
ff
#> [1] "j" "i" "h" "g"
## recode the data
df_new <- df |>
mutate(w=if_else(x %in% ff, x, "other"))
df_new
#> # A tibble: 100 × 2
#> x w
#> <chr> <chr>
#> 1 a other
#> 2 b other
#> 3 c other
#> 4 d other
#> 5 e other
#> 6 f other
#> 7 g g
#> 8 h h
#> 9 i i
#> 10 j j
#> # … with 90 more rows
df_new2 <- df |>
mutate(x2=fct_lump_n(x,4, w))
df_new2
#> # A tibble: 100 × 3
#> x w x2
#> <chr> <int> <fct>
#> 1 a 1 Other
#> 2 b 2 Other
#> 3 c 3 Other
#> 4 d 4 Other
#> 5 e 5 Other
#> 6 f 6 Other
#> 7 g 7 g
#> 8 h 8 h
#> 9 i 9 i
#> 10 j 10 j
#> # … with 90 more rows
Created on 2022-09-16 by the reprex package (v2.0.1)

How can I change the label of row structure into a string?

I'm trying to change the type of structure in row label (the one with red rectangle) into a string(character). Any ideas/suggestion of how can I change it?
Set the rownames() for the data.frame. You might also find the rownames_to_column(), rowid_to_column(), and column_to_rownames() functions from the {tibble} package useful:
dat <- data.frame(x = 1:26)
head(dat)
#> x
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
rownames(dat) <- letters
head(dat)
#> x
#> a 1
#> b 2
#> c 3
#> d 4
#> e 5
#> f 6
tibble::rownames_to_column(dat, var = "rowname") |>
head()
#> rowname x
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 6
tibble::rowid_to_column(dat, var = "rowid") |>
head()
#> rowid x
#> 1 1 1
#> 2 2 2
#> 3 3 3
#> 4 4 4
#> 5 5 5
#> 6 6 6
dat <- data.frame(x = 1:26, rowname = letters)
head(dat)
#> x rowname
#> 1 1 a
#> 2 2 b
#> 3 3 c
#> 4 4 d
#> 5 5 e
#> 6 6 f
tibble::column_to_rownames(dat, var = "rowname") |>
head()
#> x
#> a 1
#> b 2
#> c 3
#> d 4
#> e 5
#> f 6
Created on 2022-07-22 by the reprex package (v2.0.1)

collapse a dataframe in R that contains both numeric and character variables

I have the following data.frame:
data <- data.frame("ag" = rep(LETTERS[1:4],6),
"date" = c(sapply(1:3, function(x) rep(x, 8))),
"num_var1"= 1:24,
"num_var2"= 24:1,
"alpha_var1" = LETTERS[1:24],
"alpha_var2" = LETTERS[25:2] )
and I would like to summarize (mean) its rows by ag and date using dplyr. The issue is that some rows include characters: in this case, I would like to get the first entry by group (the example dataset is already sorted).
Since my dataset has several entries, I would like the code to be able to recognize whether a variable is numeric (including integers) or a character. However, the best solution that I have so far is the following one:
data %>%
dplyr::group_by(ag, date) %>%
summarise(across(everything(), mean))
which creates NAs for non-numeric variables. Do you have a better solution?
Is this what you are looking for?
library(dplyr)
data %>%
dplyr::group_by(ag, date) %>%
summarise(across(everything(), ~
if(is.numeric(.x)) mean(.x) else first(.x)))
#> `summarise()` has grouped output by 'ag'. You can override using the `.groups` argument.
#> # A tibble: 12 x 6
#> # Groups: ag [4]
#> ag date num_var1 num_var2 alpha_var1 alpha_var2
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 A 1 3 22 A Y
#> 2 A 2 11 14 I Q
#> 3 A 3 19 6 Q I
#> 4 B 1 4 21 B X
#> 5 B 2 12 13 J P
#> 6 B 3 20 5 R H
#> 7 C 1 5 20 C W
#> 8 C 2 13 12 K O
#> 9 C 3 21 4 S G
#> 10 D 1 6 19 D V
#> 11 D 2 14 11 L N
#> 12 D 3 22 3 T F
Created on 2022-03-03 by the reprex package (v2.0.1)
Another possible solution:
library(tidyverse)
data %>%
group_by(ag, date) %>%
summarise(across(where(is.numeric), mean),
across(where(is.character), first), .groups = "drop")
#> # A tibble: 12 × 6
#> ag date num_var1 num_var2 alpha_var1 alpha_var2
#> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 A 1 3 22 A Y
#> 2 A 2 11 14 I Q
#> 3 A 3 19 6 Q I
#> 4 B 1 4 21 B X
#> 5 B 2 12 13 J P
#> 6 B 3 20 5 R H
#> 7 C 1 5 20 C W
#> 8 C 2 13 12 K O
#> 9 C 3 21 4 S G
#> 10 D 1 6 19 D V
#> 11 D 2 14 11 L N
#> 12 D 3 22 3 T F

R+dplyr: conditionally swap the elements of two columns

Consider the dataframe df at the end of the post.
I simply would like to swap the elements of columns x and y whenever x>y.
There may be other columns in the dataframe which I do not want to touch.
In a sense, I would like to sort row wise the columns x and y.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df<-tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
df
#> # A tibble: 10 × 3
#> # Rowwise:
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
Created on 2021-10-06 by the reprex package (v2.0.1)
base solution:
use which(df$x > df$y) to determine row numbers you want to change, then use rev to swap values for these:
df[which(df$x > df$y), c("x", "y")] <- rev(df[which(df$x > df$y), c("x", "y")])
df
# x y extra
# <int> <int> <chr>
# 1 1 10 A
# 2 2 9 B
# 3 3 8 C
# 4 4 7 D
# 5 5 6 E
# 6 5 6 F
# 7 4 7 G
# 8 3 8 H
# 9 2 9 I
# 10 1 10 J
Thanks everyone!
I wrote a small function which does what I need and generalizes to the case of multiple variables.
See the reprex
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(1234)
set_colnames <- `colnames<-`
df<-tibble(x=1:10, y=10:1, z=rnorm(10), extra=LETTERS[1:10]) %>%
rowwise()
df
#> # A tibble: 10 × 4
#> # Rowwise:
#> x y z extra
#> <int> <int> <dbl> <chr>
#> 1 1 10 -1.21 A
#> 2 2 9 0.277 B
#> 3 3 8 1.08 C
#> 4 4 7 -2.35 D
#> 5 5 6 0.429 E
#> 6 6 5 0.506 F
#> 7 7 4 -0.575 G
#> 8 8 3 -0.547 H
#> 9 9 2 -0.564 I
#> 10 10 1 -0.890 J
sort_rows <- function(df, col_names, dec=F){
temp <- df %>%
select(all_of(col_names))
extra_names <- setdiff(colnames(df), col_names)
temp2 <- df %>%
select(all_of(extra_names))
res <- t(apply(temp, 1, sort, decreasing=dec)) %>%
as_tibble %>%
set_colnames(col_names) %>%
bind_cols(temp2)
return(res)
}
col_names <- c("x", "y", "z")
df_s <- df %>%
sort_rows(col_names, dec=F)
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
df_s
#> # A tibble: 10 × 4
#> x y z extra
#> <dbl> <dbl> <dbl> <chr>
#> 1 -1.21 1 10 A
#> 2 0.277 2 9 B
#> 3 1.08 3 8 C
#> 4 -2.35 4 7 D
#> 5 0.429 5 6 E
#> 6 0.506 5 6 F
#> 7 -0.575 4 7 G
#> 8 -0.547 3 8 H
#> 9 -0.564 2 9 I
#> 10 -0.890 1 10 J
Created on 2021-10-06 by the reprex package (v2.0.1)
This looks like sorting for me:
library(tidyverse)
df <- tibble(x=1:10, y=10:1, extra=LETTERS[1:10])
df
#> # A tibble: 10 x 3
#> x y extra
#> <int> <int> <chr>
#> 1 1 10 A
#> 2 2 9 B
#> 3 3 8 C
#> 4 4 7 D
#> 5 5 6 E
#> 6 6 5 F
#> 7 7 4 G
#> 8 8 3 H
#> 9 9 2 I
#> 10 10 1 J
extra_cols <- df %>% colnames() %>% setdiff(c("x", "y"))
extra_cols
#> [1] "extra"
df %>%
mutate(row = row_number()) %>%
pivot_longer(-c(row, extra_cols)) %>%
group_by_at(c("row", extra_cols)) %>%
transmute(
value = value %>% sort(),
name = c("x", "y"),
) %>%
pivot_wider() %>%
ungroup() %>%
select(-row)
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(extra_cols)` instead of `extra_cols` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#> # A tibble: 10 x 3
#> extra x y
#> <chr> <int> <int>
#> 1 A 1 10
#> 2 B 2 9
#> 3 C 3 8
#> 4 D 4 7
#> 5 E 5 6
#> 6 F 5 6
#> 7 G 4 7
#> 8 H 3 8
#> 9 I 2 9
#> 10 J 1 10
Created on 2021-10-06 by the reprex package (v2.0.1)
Try using apply on axis 1 and transpose it with t, then use as_tibble to convert it to a tibble.
Then finally change the column names:
> df <- as_tibble(t(apply(df, 1, sort)))
> names(df) <- c('x', 'y')
> df
# A tibble: 10 x 2
x y
<int> <int>
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 5 6
7 4 7
8 3 8
9 2 9
10 1 10

How would one use dplyr to recursively concatenate characters in a tibble until a character repeats

I'm trying to use dplyr to concatenate characters from prior tibble rows until a character repeats. Once a character repeats, we use the repeated character to start the same concatenation process again. Here is a reprex that shows the source data frame (df) my failed attempt to concatenate the characters (df1) and the desired result of the proposed concatenation process (df2).
In my attempt, it appears the concatenation process only takes place once when we create bf. Unfortunately, I'm not sure why this is the case. I'm still fairly new to dplyr, so I suspect I missing something very obvious. Also, if there is a better approach to solving this problem, I am happy to expand my horizon and knowledge.
library (tidyverse)
df <- tibble(id = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14),
cde =c("b","f","c","e","b","f","c","e","d","f","b","c","e","d"))
df
#> # A tibble: 14 x 2
#> id cde
#> <dbl> <chr>
#> 1 1 b
#> 2 2 f
#> 3 3 c
#> 4 4 e
#> 5 5 b
#> 6 6 f
#> 7 7 c
#> 8 8 e
#> 9 9 d
#> 10 10 f
#> 11 11 b
#> 12 12 c
#> 13 13 e
#> 14 14 d
df1 <- df %>%
mutate(cum_cde = "") %>%
mutate(cum_cde = if_else(id ==1,cde,cum_cde)) %>%
mutate(cum_cde = if_else(id > 1 & str_count(lag(cum_cde),(cde)) < 1,str_c(lag(cum_cde),cde,sep="",collapse=NULL),cde))
df1
#> # A tibble: 14 x 3
#> id cde cum_cde
#> <dbl> <chr> <chr>
#> 1 1 b b
#> 2 2 f bf
#> 3 3 c c
#> 4 4 e e
#> 5 5 b b
#> 6 6 f f
#> 7 7 c c
#> 8 8 e e
#> 9 9 d d
#> 10 10 f f
#> 11 11 b b
#> 12 12 c c
#> 13 13 e e
#> 14 14 d d
df2 <- tibble(id = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14),
cde =c("b","f","c","e","b","f","c","e","d","f","b","c","e","d"),
result = c("b","bf","bfc","bfce","b","bf","bfc","bfce","bfced","f","fb","fbc","fbce","fbced"))
df2
#> # A tibble: 14 x 3
#> id cde result
#> <dbl> <chr> <chr>
#> 1 1 b b
#> 2 2 f bf
#> 3 3 c bfc
#> 4 4 e bfce
#> 5 5 b b
#> 6 6 f bf
#> 7 7 c bfc
#> 8 8 e bfce
#> 9 9 d bfced
#> 10 10 f f
#> 11 11 b fb
#> 12 12 c fbc
#> 13 13 e fbce
#> 14 14 d fbced
<sup>Created on 2019-12-23 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
An option with for loop would be
library(stringr)
v1 <- character(nrow(df))
j <- 1
for(i in seq_len(nrow(df))) {
v1[i] <- paste(df$cde[unique(j:i)], collapse="")
if(str_count(v1[i], df$cde[i]) > 1) {
v1[i] <- df$cde[i]
j <- i
}
}
v1
#[1] "b" "bf" "bfc" "bfce"
#[5] "b" "bf" "bfc" "bfce" "bfced"
#[10]"f" "fb" "fbc" "fbce" "fbced"
Or using accumulate
library(purrr)
library(dplyr)
df %>%
group_by(grp = cummax(str_count(accumulate(cde, str_c), cde))) %>%
mutate(result = accumulate(cde, str_c)) %>%
ungroup %>%
select(-grp)
# A tibble: 14 x 3
# id cde result
# <dbl> <chr> <chr>
# 1 1 b b
# 2 2 f bf
# 3 3 c bfc
# 4 4 e bfce
# 5 5 b b
# 6 6 f bf
# 7 7 c bfc
# 8 8 e bfce
# 9 9 d bfced
#10 10 f f
#11 11 b fb
#12 12 c fbc
#13 13 e fbce
#14 14 d fbced

Resources