Remove duplicates, keeping most frequent row

Remove duplicates, keeping most frequent row - r

I would like to deduplicate my data, keeping the row that has the most frequent appearances. If there is a tie in rows, I don't care which gets returned—the first in alphabetical or numeric order is fine. I would like to do this by group of id and var.
MRE:
df <- data.frame(
id = rep("a", 8),
var = c(rep("b", 4), rep("c", 4)),
val = c("d", "d", "d", "e", "f", "f", "g", "g")
)
> df
id var val
1 a b d
2 a b d
3 a b d
4 a b e
5 a c f
6 a c f
7 a c g
8 a c g
Should be:
id var val
1 a b d
2 a c f
I'm working with large datasets and tidyverse pipe chains, so a dplyr solution would be preferable.

Use table and which.max to extract the mode:
df %>%
group_by(id, var) %>%
summarise(val = {t <- table(val); names(t)[which.max(t)] })
# A tibble: 2 x 3
# Groups: id [?]
# id var val
# <fct> <fct> <chr>
#1 a b d
#2 a c f
Another way to do this in base R: Create a three way contingency table directly, and then find the max column along the third axis:
apply(table(df), c(1, 2), function(v) names(v)[which.max(v)])
# var
#id b c
# a "d" "f"
Convert this to a data frame:
as.data.frame.table(
apply(table(df), c(1, 2), function(v) names(v)[which.max(v)])
)
# id var Freq
#1 a b d
#2 a c f

Using dplyr:
library(dplyr)
df %>%
group_by(id, var, val) %>%
summarise(n = n()) %>%
group_by(id, var) %>%
arrange(-n) %>%
slice(1) %>%
ungroup() %>%
select(-n)
# # A tibble: 2 x 3
# id var val
# <fct> <fct> <fct>
# 1 a b d
# 2 a c f

One option could be using table and max as:
library(dplyr)
df %>% group_by(id, var) %>%
filter(table(val) == max(table(val))) %>%
slice(1)
# # A tibble: 2 x 3
# # Groups: id, var [2]
# id var val
# <fctr> <fctr> <fctr>
# 1 a b d
# 2 a c g
NOTE: a c g is case of tie. Per OP any record can be returned in case of tie.

I doubt this is any faster, but another option is
df %>%
group_by(id, var) %>%
filter(row_number() == rle(as.character(val))$lengths %>%
{sum(.[1:which.max(.)])})

A dplyr solution using count:
library(dplyr)
df %>%
count(id,var,val,sort = T) %>%
group_by(id,var) %>%
summarize_at("val",head,1)
# # A tibble: 2 x 3
# id var val
# <fctr> <fctr> <fctr>
# 1 a b d
# 2 a c f
or maybe more idiomatic but longer:
df %>%
count(id,var,val,sort = T) %>%
group_by(id,var) %>%
slice(1) %>%
select(-n) %>%
ungroup
Or with tally for same output with slightly different syntax:
df %>%
group_by(id,var,val) %>%
tally(sort = T) %>%
slice(1) %>%
select(-n) %>%
ungroup
and a base solution :
df2 <- aggregate(x ~ .,cbind(df,x=1),sum)
aggregate(val ~ id+var, df2[order(-df2$x),],head,1)
# id var val
# 1 a b d
# 2 a c f

Here is my try:
library(dplyr)
df %>%
group_by(id,var,val) %>%
mutate(n=n()) %>%
arrange(desc(n)) %>%
group_by(id,var) %>%
filter(row_number()==1) %>%
select(-n)
`

Related

dplyr: group over several variables in a function

I want to have two lists of grouping variables. let's say list1 = c("var2","var3","var4") and list2 = c("var2","var3")
dta = data.frame(var1 = c(1:8),
var2 = c(rep("AA",4),rep("BB",4)),
var3 = rep(c("C","D"),4),
var4 = c(1,1,0,0,0,0,1,1))
dta %>% group_by(var2,var3,var4) %>% summarise(mv1 = mean(var1)) %>%
group_by(var2,var3) %>% summarise(mv1_2 = mean(mv1))
How can I create a function like this
sample_fun = function(dta, list1, list2){
dta %>% group_by(list1) %>% summarise(mv1 = mean(var1)) %>%
group_by(list2) %>% summarise(mv1_2 = mean(mv1))
}

Here are two ways to do this -
Pure dplyr solution using across :
library(dplyr)
library(rlang)
sample_fun = function(dta, list1, list2){
dta %>%
group_by(across(all_of(list1))) %>%
summarise(mv1 = mean(var1)) %>%
ungroup %>%
group_by(across(all_of(list2))) %>%
summarise(mv1_2 = mean(mv1))
}
sample_fun(dta, list1, list2)
# var2 var3 mv1_2
# <chr> <chr> <dbl>
#1 AA C 2
#2 AA D 3
#3 BB C 6
#4 BB D 7
Using non-standard evaluation with syms :
sample_fun = function(dta, list1, list2){
dta %>%
group_by(!!!syms(list1)) %>%
summarise(mv1 = mean(var1)) %>%
ungroup %>%
group_by(!!!syms(all_of(list2))) %>%
summarise(mv1_2 = mean(mv1))
}
sample_fun(dta, list1, list2)
# var2 var3 mv1_2
# <chr> <chr> <dbl>
#1 AA C 2
#2 AA D 3
#3 BB C 6
#4 BB D 7

Using the value in one column to specify from which row to retrieve a value for a new column

I'm looking for an automated way of converting this:
dat = tribble(
~a, ~b, ~c
, 'x', 1, 'y'
, 'y', 2, NA
, 'q', 4, NA
, 'z', 3, 'q'
)
to:
tribble(
~a, ~b, ~d
, 'x', 1, 2
, 'z', 3, 4
)
So, the column c in dat encodes which row in dat to look at to grab a value for a new column d, and if c is NA, toss that row from the output. Any tips?

We can join dat with itself using c and a columns.
library(dplyr)
dat %>%
inner_join(dat %>% select(-c) %>% rename(d = 'b'),
by = c('c' = 'a'))
# A tibble: 2 x 4
# a b c d
# <chr> <dbl> <chr> <dbl>
#1 x 1 y 2
#2 z 3 q 4
In base R, we can do this with merge :
merge(dat, dat[-3], by.x = 'c', by.y = 'a')

We create the 'd' with lead of 'b' and filter out the NA rows of 'c' and remove the c column with select
library(dplyr)
dat %>%
mutate(d = lead(b)) %>%
filter(!is.na(c)) %>%
select(-c)
# A tibble: 2 x 3
# a b d
# <chr> <dbl> <dbl>
#1 x 1 2
#2 z 3 4
Or more compactly
dat %>%
mutate(d = replace(lead(b), is.na(c), NA), c = NULL) %>%
na.omit
Or with fill
library(tidyr)
dat %>%
mutate(c1 = c) %>%
fill(c1) %>%
group_by(c1) %>%
mutate(d = lead(b)) %>%
ungroup %>%
filter(!is.na(c)) %>%
select(-c, -c1)
Or in data.table
library(data.table)
setDT(dat)[, d := shift(b, type = 'lead')][!is.na(c)][, c := NULL][]
# a b d
#1: x 1 2
#2: z 3 4
NOTE: Both the solutions are simple and doesn't require any joins. Besides, it gives the expected output in the OP's post
Or using match from base R
cbind(na.omit(dat), d = with(dat, b[match(c, a, nomatch = 0)]))[, -3]
# a b d
#1 x 1 2
#2 z 3 4

dplyr: create new variable based upon grouping

Given this dataframe:
library(dplyr)
df.ex <- tibble(id = c(rep(1, 4), rep(2, 4), rep(3, 4)),
var1 = c('a','a','b','b','a','a','a','a','b','b','b','b'))
I would like to create a new variable var2 based upon the presence of b in var1 which is grouped by the id column. Thus each id, can then only contain one type of value in the output column. This is the hoped for outcome:
df.ex.outcome <- tibble(id = c(rep(1, 4), rep(2, 4), rep(3, 4)),
var1 = c('a','a','b','b','a','a','a','a','b','b','b','b'),
var2 = c(rep('foo', 4), rep('bar', 4), rep('foo', 4)))
I thought that using group_by would solve this, however it doesn't appear to work, like so:
df.ex <- df.ex %>% group_by(id) %>% mutate(var2 = if_else(var1 %in% 'b', 'foo','bar'))
Does anyone have any ideas on how to do this?

We can wrap with any
df.ex %>%
group_by(id) %>%
mutate(var2 = case_when(any(var1 == "b")~ "foo", TRUE ~ "bar"))
# A tibble: 12 x 3
# Groups: id [3]
# id var1 var2
# <dbl> <chr> <chr>
# 1 1 a foo
# 2 1 a foo
# 3 1 b foo
# 4 1 b foo
# 5 2 a bar
# 6 2 a bar
# 7 2 a bar
# 8 2 a bar
# 9 3 b foo
#10 3 b foo
#11 3 b foo
#12 3 b foo
Or reverse the arguments for %in%
df.ex %>%
group_by(id) %>%
mutate(var2 = case_when("b" %in% var1 ~ "foo", TRUE ~ "bar"))
Or using if_else
df.ex %>%
group_by(id) %>%
mutate(var2 = if_else('b' %in% var1, 'foo','bar'))
so that there will a single TRUE/FALSE output from %in%, which we can also use with if/else
df.ex %>%
group_by(id) %>%
mutate(var2 = if("b" %in% var1) "foo" else "bar")

Spread multiple columns [tidyr]

I would like to spread data over multiple columns using tidyr.
dat <- data.frame(ID = rep(1,10),
col1 = LETTERS[seq(1,10)],
col2 = c(letters[seq(1,8)],NA,NA),
col3 = c(rep(NA,8),"5",NA),
col4 = c(rep(NA,8),NA,"value"))
The expected outcome is:
Out <- data.frame(t(c(1,letters[seq(1,8)],"5","value")),row.names=NULL)
colnames(Out) <- c("ID",LETTERS[seq(1,10)])
I came up with:
a <- dat %>% gather(variable, value, -(ID:col1)) %>%
unite(temp, col1, variable) %>%
spread(temp, value)
a[,-which(is.na(a))]
which is clumsy and also changes the column names. Is there a better solution for this?

We can use the na.rm=TRUE in gather, remove the 'variable' with select and use spread
library(dplyr)
library(tidyr)
gather(dat, variable, val, -(ID:col1), na.rm=TRUE) %>%
select(-variable) %>%
spread(col1, val)
# ID A B C D E F G H I J
#1 1 d b b c b b b a 5 value
Update
With the devel version of tidyr (tidyr_0.8.3.9000), we can use pivot_wider when there are multiple value columns to be considered
dat %>%
pivot_wider(names_from = col1, values_from = str_c("col", 2:4)) %>%
select_if(~ any(!is.na(.)))
# A tibble: 1 x 11
# ID col2_A col2_B col2_C col2_D col2_E col2_F col2_G col2_H col3_I col4_J
# <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
#1 1 a b c d e f g h 5 value
If we are using reshape2, similar option is
library(reshape2)
dcast(melt(dat, measure = 3:5, na.rm=TRUE),
ID~col1, value.var='value')

Filter groups in dplyr that exclusively contain specific combinations of values

Given a table like:
id value
1 1 a
2 2 a
3 2 b
4 2 c
5 3 c
I would like to filter for:
a) the ids that only have value a, i.e. id 1.
b) the ids that contain a and b jointly, i.e. id 2.
Data:
data.frame(id = c(1,2,2,2,3), value = c("a", "a", "b", "c", "c"))

Try
a)
df %>% group_by(id) %>% filter(all(value == "a"))
b)
df %>% group_by(id) %>% filter(all(c("a", "b") %in% value))

Here is an alternative approach that can be used both for a) and b)
df %>% group_by(id) %>% arrange(value) %>% summarize(value=paste(value,collapse="")) %>% filter(grepl("ab",value))
Result:
id value
(dbl) (chr)
1 2 abc
Hope this helps

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove duplicates, keeping most frequent row - r

Using dplyr: library(dplyr) df %>% group_by(id, var, val) %>% summarise(n = n()) %>% group_by(id, var) %>% arrange(-n) %>% slice(1) %>% ungroup() %>% select(-n) # # A tibble: 2 x 3 # id var val # <fct> <fct> <fct> # 1 a b d # 2 a c f

I doubt this is any faster, but another option is df %>% group_by(id, var) %>% filter(row_number() == rle(as.character(val))$lengths %>% {sum(.[1:which.max(.)])})

Here is my try: library(dplyr) df %>% group_by(id,var,val) %>% mutate(n=n()) %>% arrange(desc(n)) %>% group_by(id,var) %>% filter(row_number()==1) %>% select(-n) `

Related

dplyr: group over several variables in a function

Using the value in one column to specify from which row to retrieve a value for a new column

dplyr: create new variable based upon grouping

Spread multiple columns [tidyr]

Filter groups in dplyr that exclusively contain specific combinations of values

Categories

Resources