Finding duplicates in a dataframe and returning count of each duplicate record - r

I have a dataframe like
col1 col2 col3
A B C
A B C
A B B
A B B
A B C
B C A
I want to get an output in the below format:
col1 col2 col3 Count
A B C 3 Duplicates
A B B 2 Duplicates
I dont want to use any specific column in the function to find the duplicates.
That is the reason of not using add_count from dplyr.
Using duplicate will have
col1 col2 col3 count
2 A B C 3
3 A B B 2
5 A B C 3
So not the desired output.

We can use group_by_all to group by all columns and then remove the ones which are not duplicates by selecting rows which have count > 1.
library(dplyr)
df %>%
group_by_all() %>%
count() %>%
filter(n > 1)
# col1 col2 col3 n
# <fct> <fct> <fct> <int>
#1 A B B 2
#2 A B C 3

We can use data.table
library(data.table)
setDT(df1)[, .(n =.N), names(df1)][n > 1]
# col1 col2 col3 n
#1: A B C 3
#2: A B B 2
Or with base R
subset(aggregate(n ~ ., transform(df1, n = 1), FUN = sum), n > 1)
# col1 col2 col3 n
#2 A B B 2
#3 A B C 3

Related

dplyr mutate compare with another data frame

I have 2 data frames like this:
A:
col1 col2
1 a
1 b
1 b
1 c
1 c
2 x
2 y
2 y
3 k
3 k
3 m
3 m
B:
col1 col2 col3
1 a 0.3
1 b 0.001
1 c 0.0004
2 x 0.005
2 y 0.09
3 k 0.00007
3 m 0.08
What I want to do is to create another col3 on A using mutate and ifelse. If the value for the col2 in B is less than 0.05, I want the value in the col3 to be "other"; else from col2 of A. The output should look like this:
A:
col1 col2 col3
1 a a
1 b other
1 b other
1 c other
1 c other
2 x x
2 y y
2 y y
3 k other
3 k other
3 m m
3 m m
I tried using mutate and ifelse combination, but couldnt figure out how to do the comparison part between A and B.
vals_for_plot = A %>%
mutate(col3 = ifelse( **value for col2 of A in B** < 0.05, "others", col2))
Thank you
We may do a join after modifying the second data
library(dplyr)
B %>%
mutate(col3 = case_when(col3 < 0.05 ~'other', TRUE ~ col2)) %>%
left_join(A, .)
Or with data.table
library(data.table)
setDT(A)[B, col3 := fcase(col3 >= 0.05, col2, default = 'others'),
on = .(col1, col2)]

Merge nth elements from two columns while keeping the original row order in R

I am attempting to merge every nth element from col1, replacing values from that same row in col2 in a new column: col3
df <- data.frame(col1 = c('A', 'B', 'D', 'F', 'C'), col2 = c(2, 1, 2, 3, 1))
> df
col1 col2
1 A 2
2 B 1
3 D 2
4 F 3
5 C 1
If i was to merge every odd element from col1 with every even element from col2, for example, the output should look something like this:
> df
col1 col2 col3
1 A 2 A
2 B 1 1
3 D 2 D
4 F 3 3
5 C 1 C
Thanks.
We could do it with an ifelse statement checking if row is even or odd with the modulo operator %%:
library(dplyr)
df %>%
mutate(col3 = ifelse((row_number() %% 2) == 0, col2, col1))
col1 col2 col3
1 A 2 A
2 B 1 1
3 D 2 D
4 F 3 3
5 C 1 C
In base R, we may also use a row/column indexing
df$col3 <- df[cbind(seq_len(nrow(df)), rep(1:2, length.out = nrow(df)))]
-output
> df
col1 col2 col3
1 A 2 A
2 B 1 1
3 D 2 D
4 F 3 3
5 C 1 C
base
df <- data.frame(col1 = c('A', 'B', 'D', 'F', 'C'), col2 = c(2, 1, 2, 3, 1))
df$col3 <- df$col1
df$col3[c(FALSE, TRUE)] <- df$col2[c(FALSE, TRUE)]
df
#> col1 col2 col3
#> 1 A 2 A
#> 2 B 1 1
#> 3 D 2 D
#> 4 F 3 3
#> 5 C 1 C
Created on 2022-03-06 by the reprex package (v2.0.1)

removing multiple subsequent rows with condition in r

I have df as below
df
id col1
1 D
1 D
1 D
1 B
1 C
I would like to remove when more than 2 subsequent rows "D" happens
result
df
id col1
1 D
1 D
1 B
1 C
df[with(df, col1 != "D" | sequence(rle(col1)$lengths) <= 2),]
# id col1
#1 1 D
#2 1 D
#4 1 B
#5 1 C
Assuming that you want this done separately per id and that the occurrences need not be consecutive this is a base R one-liner:
subset(df, ave(col1, id, col1, FUN = seq_along) <= 2 | col1 != "D")
Note
The input in reproducible form is assumed to be:
Lines <- "id col1
1 D
1 D
1 D
1 B
1 C"
df <- read.table(text = Lines, header = TRUE, as.is = TRUE)

Extract the subset of a dataframe based with values unique from other two dataframes

I have three dataframes df1,df2,df3. I would like to identify the value(s) in col1 of df2 not present in col1 of df1 and/or col1 of df3.
df1 <- data.frame(col1=c('A','C','E'),col2=c(4,8,2))
df1
df2 <- data.frame(col1=c('A','B','C','E','G','I'),col2=c(4,8,2,6,1,9))
df2
df3 <- data.frame(col1=LETTERS[3:26],col2=sample(3:26))
df3
# Expected output
#2 B 8
What I have done?
table(df2$col1 %in% df1$col1)
# FALSE TRUE
# 3 3
df2[df2$col1 %in% df1$col1,]
# col1 col2
#1 A 4
#3 C 2
#4 E 6
df2[!df2$col1 %in% df1$col1,]
# col1 col2
#2 B 8
#5 G 1
#6 I 9
table(df2$col1 %in% df3$col1)
#FALSE TRUE
# 2 4
df2[df2$col1 %in% df3$col1,]
# col1 col2
#3 C 2
#4 E 6
#5 G 1
#6 I 9
df2[!df2$col1 %in% df3$col1,]
# col1 col2
#1 A 4
#2 B 8
In a wrong approach,
df2[!df2$col1[!df2$col1 %in% df1$col1] %in% df3$col1,]
# col1 col2
#1 A 4
#4 E 6
How to avoid the repetition of the indices?
Is there any better approach than the below?
df2[!df2$col1 %in% df1$col1,][!df2$col1[!df2$col1 %in% df1$col1] %in% df3$col1,]
# col1 col2
#2 B 8
While the correct approach,
df2[!(df2$col1 %in% df1$col1 | df2$col1 %in% df3$col1),]
# col1 col2
#2 B 8
We can use anti_join
library(dplyr)
bind_rows(df1, df3) %>%
anti_join(df2, ., by = "col1")
# col1 col2
#1 B 8

Combine dataframes with common column names but different values in columns

I think I ran into a situation that (to the best of my knowledge) is not fully covered by the awesome dplyr library, so I guess it will need a bit more of coding than what I am capable of. I have the following 2 data frames:
df1 =
Col1 Col2 Col3
1 A 1 X
2 A 1 X
3 B 1 X
4 C 1 X
5 D 1 Y
6 D 1 Z
df2 =
Col1 Col2 Col3
1 A 2 X
2 B 2 Y
3 C 2 Y
4 G 2 Z
5 H 2 X
6 I 2 Z
I want only the rows which has common elements from Col1 only, this is:
out =
Col1 Col2 Col3
1 A 1 X
2 A 1 X
3 B 1 X
4 C 1 X
5 A 2 X
6 B 2 Y
7 C 2 Y
It looks like dplyr::intersect would do it, but since Col2 and Col3 have different values, it gives me table with 0 values. Your guidance is much appreciated. Thanks. P. Perez.
With base R you could do:
common <- intersect(df1$Col1, df2$Col1)
df3 <- rbind(df1, df2)
df3[df3$Col1 %in% common, ]
which gives:
Col1 Col2 Col3
1 A 1 X
2 A 1 X
3 B 1 X
4 C 1 X
11 A 2 X
21 B 2 Y
31 C 2 Y
And with dplyr:
bind_rows(df1, df2) %>%
filter(Col1 %in% intersect(df1$Col1, df2$Col1))
which will give you the same output. An alternative by #Frank from the comments:
bind_rows(df1, df2, .id = "id") %>%
group_by(Col1) %>%
filter(n_distinct(id) == 2L)
The logic behind this is that you bind the two dataframes together and include an id-column simultaniously with the .id-parameter. Then group by the values of Col1 and check how much unique id's there are for each value. The ones with only one unique id, don't appear in both dataframes.
A similar logic can be applied with the data.table package:
library(data.table)
rbindlist(list(df1, df2), idcol = 'id')[, if (uniqueN(id) == 2L) .SD, by = Col1]

Resources