Find and keep duplicated items in each column in R - r

Is there any way I can use some like tidyverse's add_count() %>% filter() or distinct() or alternatively janitor's get_dupes() to find and keep the duplicated items of each column. No need to compare items of different columns with each other, each column needs to be considered separately.
data1 <-tribble(
~colA, ~colB,
"a", 1,
"b", 1,
"c", 2,
"c", 3
)
Expected Output would be
colA colB
c 1

You can try with map_dfc which will map over the columns and return a data frame by column binding the outputs
library(tidyverse)
data1 %>%
map_dfc(~.x[duplicated(.x)])
# A tibble: 1 x 2
colA colB
<chr> <dbl>
1 c 1
However this will result in unwanted behavior when each column has a different amount of duplicates due to recycling (when applying an operation to two vectors that requires them to be the same length - like column bind, R automatically repeats the shorter one, until it is long enough to match the longer one).
data1 <-tribble(
~colA, ~colB,
"a", 1,
"b", 1,
"c", 2,
"c", 3,
"d", 1,
)
data1 %>%
map_dfc( ~.x[duplicated(.x)])
# A tibble: 2 x 2
colA colB
<chr> <dbl>
1 c 1
2 c 1
here colA has been recycled to match the length of colB. In such a case you are better off returning a list with map
data1 %>%
map( ~.x[duplicated(.x)])
#output
$colA
[1] "c"
$colB
[1] 1 1

In baseR
dupicatedList <- lapply(data1, function(columnValues) {
unique(columnValues[duplicated(columnValues)])
})

A base R option
> list2DF(Map(function(x) x[duplicated(x)], data1))
colA colB
1 c 1

Related

identified distinct arrangement of values by groups in data.frame

I have a large dataframe that has as it's primary organization a single row with groups that are all identical length (in the toy example 3).
df <- data.frame(groups = c("gr1","gr1","gr1","gr2","gr2","gr2","gr3","gr3","gr3"),
no = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
colA = c("a", "b", "c", "a", "b", "c", "a", "b", "c"),
colB = c("a", "b", "c", "X_", "b", "c", "a", "b", "c"),
colC = c("a", "b", "c", "X_", "b", "c", "c", "b", "a"))
df
> df
> groups no colA colB colC
> 1 gr1 1 a a a
> 2 gr1 2 b b b
> 3 gr1 3 c c c
> 4 gr2 1 a X_ X_
> 5 gr2 2 b b b
> 6 gr2 3 c c c
> 7 gr3 1 a a c
> 8 gr3 2 b b b
> 9 gr3 3 c c a
I want to identify for each column which group is the first example of a unique arrangement of values. So for colA it should return (T, F, F) since all three groups are identical so only group one is the 1st unique on. For colB it should return (T, T, F) since there are two distinct groups and only the 3rd is identical to the 1st. And for colC it should be (T, T, T) since the order of items matters.
So the final output could be a matrix like this
colA colB colC
> gr1 T T T
> gr2 F T T
> gr3 F F T
I think I could figure this out by breaking down the data frame into pairs of group and colA/B/B, identify which ones are identical, storing the results in a vector, and then reassembling the whole deal. But I am seeing a ton of for-loops and have a hard time thinking about how to vectorize this. I have been using dplyr a bit, but I don't yet see how it can help.
Maybe there's a decent way to unstack each of the columns based on the groups and then run a comparison across the relevant subsets of new (and shorter) columns?
Edited to add:
Maybe group_by %>% summarize is a way to get at this. If the summary can essentially concatenate all values in a group per column into a really long string I could then see which of those is distinct per group?
Second edit:
I got as far as:
d1 <- df %>% group_by(groups) %>% summarise(colB = paste(unique(colB), collapse = ', ')) %>% distinct(colB)
which puts out
> # A tibble: 2 x 1
> colB
> <chr>
> 1 a, b, c
> 2 X_, b, c
It identifies the distinct groups, but I now have to figure out how to compare it against the rest full column to get T/F for each group.
Here's a base R approach :
cols <- grep('col', names(df))
cbind(unique(df[1]), sapply(df[cols], function(x)
!duplicated(by(x, df$groups, paste0, collapse = '-'))))
# groups colA colB colC
#1 gr1 TRUE TRUE TRUE
#4 gr2 FALSE TRUE TRUE
#7 gr3 FALSE FALSE TRUE
Your summarize idea is spot on:
df %>%
group_by(groups) %>%
summarize(across(starts_with("col"), paste, collapse = ""), .groups = "drop") %>%
mutate(across(starts_with("col"), ~!duplicated(.)))
# # A tibble: 3 x 4
# groups colA colB colC
# <chr> <lgl> <lgl> <lgl>
# 1 gr1 TRUE TRUE TRUE
# 2 gr2 FALSE TRUE TRUE
# 3 gr3 FALSE FALSE TRUE
With "data.table" you can try:
library(data.table)
cols <- c("colA", "colB", "colC")
fun <- function(x) !duplicated(x)
as.data.table(df)[, lapply(.SD, toString), groups, .SDcols = cols][
, (cols) := lapply(.SD, fun), .SDcols = cols][]
# groups colA colB colC
# 1: gr1 TRUE TRUE TRUE
# 2: gr2 FALSE TRUE TRUE
# 3: gr3 FALSE FALSE TRUE

How to match multiple columns based on lookup table

I have the following two data frames:
lookup <- data.frame(id = c("A", "B", "C"),
price = c(1, 2, 3))
results <- data.frame(price_1 = c(2,2,1),
price_2 = c(3,1,1))
I now want to go through all columns of results and add the respective matching id from lookup as new columns. So I first want to take the price_1 column and find the ids (here: "B", "B", "A") and add it as a new column to results and then I want to do the same for the price_2 column.
My real-life case would need to match 20+ columns, so I want to avoid a hard-coded manual solution and are looking for a dynamic approach, ideally in the tidyverse.
results <- results %>%
left_join(., lookup, by = c("price_1" = "id")
would give me the manual solution for the first column and I could repeat this with the second column, but I'm wondering if I can do this automatically for all my results columns.
Expected output:
price_1 price_2 id_1 id_2
2 3 "B" "C"
2 1 "B" "A"
1 1 "A" "A"
We could unlist the dataframe and match directly.
new_df <- results
names(new_df) <- paste0("id", seq_along(new_df))
new_df[] <- lookup$id[match(unlist(new_df), lookup$price)]
cbind(results, new_df)
# price_1 price_2 id1 id2
#1 2 3 B C
#2 2 1 B A
#3 1 1 A A
In dplyr, we can do
library(dplyr)
bind_cols(results, results %>% mutate_all(~lookup$id[match(., lookup$price)]))
You can use apply and match to match multiple columns based on lookup table.
cbind(results, t(apply(results, 1, function(i) lookup[match(i, lookup[,2]),1])))
# price_1 price_2 1 2
#1 2 3 B C
#2 2 1 B A
#3 1 1 A A

Counting number of sequences of particular value for multiple columns in a data frame

I would like to create a table or a new data frame that displays, for each column in the original data frame (of which there are many), how many times a sequence of a specific value occurs. For example, if I have the following data frame:
x <- data.frame("Red" = c("a", "b", "a", "a", "c", "d"), "Blue" = c("b", "a", "a", "a", "a", "a"), "Green" = c("a", "a", "b", "a", "b", "a"))
and I want to know, for each color (Red, Blue, and Green) how many times a run of the value "a" occurs (i.e., before it is interrupted by another value like b or c)
The correct answer would look something like this:
Color a_sequences
1 Red 2
2 Blue 1
3 Green 3
I have found solutions on this site using the rle function on single vectors but I don't know how to scale this up for an entire data frame with multiple columns and obtain a new table or data frame with the sequence counts, as in the above table. Thank you!
f = function(v, ch) sum(rle(as.character(v))$values == ch)
sapply(x, f, 'a')
# Red Blue Green
# 2 1 3
Here is an option with tidyverse where we gather into 'long' format, grouped by 'color', and the run-length-id (rleid) of 'value', filter the 'value' elements that are 'a', then grouped by 'color', get the number of distinct 'grp' elements
library(tidyverse)
library(data.table)
gather(x, color, value) %>%
group_by(color, grp = rleid(value)) %>%
filter(value == "a") %>%
group_by(color) %>%
summarise(n = n_distinct(grp))
# A tibble: 3 x 2
# color n
# <chr> <int>
#1 Blue 1
#2 Green 3
#3 Red 2
Or an option with summarise_all
x %>%
summarise_all(list(~ n_distinct(setdiff((. == 'a') * rleid(.), 0)))) %>%
as.list %>%
enframe %>%
unnest
# A tibble: 3 x 2
# name value
# <chr> <int>
#1 Red 2
#2 Blue 1
#3 Green 3
Here is another idea. We can combine and collapse all string, split the string that is not a, and count how many elements containing a. result2 is the final outcome.
result <- sapply(x, function(x) {
x2 <- as.character(x)
y <- paste0(x2, collapse = "")
z <- strsplit(y, "[^a]")[[1]]
return(sum(grepl("a", z)))
})
result2 <- data.frame(Color = names(result), a_sequence = unname(result))
result2
# Color a_sequence
# 1 Red 2
# 2 Blue 1
# 3 Green 3
in base R you could do:
f = as.character(unique(unlist(x)))
t(sapply(x,function(y)table(factor(rle(as.character(y))$values,f))))
a b c d
Red 2 1 1 1
Blue 1 1 0 0
Green 3 2 0 0
It seems your whole dataset is in factors so the code as.character() is used

How to add new column to R dataframe based on values in multiple columns

I have created the following dataframe
df<-data.frame("A"<-(1:5), "B"<-c("A","B", "C", "B",'C' ), "C"<-c("A", "A",
"B", 'B', "B"))
names(df)<-c("A", "B", "C")
I am triyng to obtain the duplicated values between columns A and C following output and add the corresponding values in column B . The expected dataframe should be
df2<- "B" "Dupvalues"
1 A
4 B
I am unable to do this. I request some help here
df<-data.frame(A = (1:5),
B = c("A","B", "C", "B",'C' ),
C = c("A", "A","B", 'B', "B"), stringsAsFactors = F)
library(dplyr)
df %>%
filter(B == C) %>% # keep rows when B equals C
group_by(A) %>% # for each A
transmute(DupValues = B) %>% # keep the duplicate value
ungroup() # forget the grouping
# # A tibble: 2 x 2
# A DupValues
# <int> <chr>
# 1 1 A
# 2 4 B
Note that this works if your variables are not factors, but character varaibles.

left_join two data frames and overwrite

I'd like to merge two data frames where df2 overwrites any values that are NA or present in df1. Merge data frames and overwrite values provides a data.table option, but I'd like to know if there is a way to do this with dplyr. I've tried all of the _join options but none seem to do this. Is there a way to do this with dplyr?
Here is an example:
df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4))
df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))
Desired output:
y x1
1 A 5
2 B 6
3 C 7
4 D 4
I think what you want is to keep the values of df2 and only add the ones in df1 that are not present in df2 which is what anti_join does:
"anti_join return all rows from x where there are not matching values in y, keeping just columns from x."
My solution:
df3 <- anti_join(df1, df2, by = "y") %>% bind_rows(df2)
Warning messages:
1: In anti_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector
2: In rbind_all(x, .id) : Unequal factor levels: coercing to character
> df3
Source: local data frame [4 x 2]
y x1
(chr) (dbl)
1 D 4
2 A 5
3 B 6
4 C 7
this line gives the desired output (in a different order) but, you should pay attention to the warning message, when working with your dataset be sure to read y as a character variable.
This is the idiom I now use, as, in addition, it handles keeping columns that are not part of the update table. I use some different names than from the OP, but the flavor is similar.
The one thing I do is create a variable for the keys used in the join, as I use that in a few spots. But otherwise, it does what is desired.
In itself it doesn't handle the action of, for example, "update this row if a value is NA", but you should exercise that condition when creating the join table.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
.keys <- c("key1", "key2")
.base_table <- tribble(
~key1, ~key2, ~val1, ~val2,
"A", "a", 0, 0,
"A", "b", 0, 1,
"B", "a", 1, 0,
"B", "b", 1, 1)
.join_table <- tribble(
~key1, ~key2, ~val2,
"A", "b", 100,
"B", "a", 111)
# This works
df_result <- .base_table %>%
# Pull off rows from base table that match the join table
semi_join(.join_table, .keys) %>%
# Drop cols from base table that are in join table, except for the key columns
select(-matches(setdiff(names(.join_table), .keys))) %>%
# Left join on the join table columns
left_join(.join_table, .keys) %>%
# Remove the matching rows from the base table, and bind on the newly joined result from above.
bind_rows(.base_table %>% anti_join(.join_table, .keys))
df_result %>%
print()
#> # A tibble: 4 x 4
#> key1 key2 val1 val2
#> <chr> <chr> <dbl> <dbl>
#> 1 A b 0 100
#> 2 B a 1 111
#> 3 A a 0 0
#> 4 B b 1 1
Created on 2019-12-12 by the reprex package (v0.3.0)

Resources