Remove duplicates keeping entry with largest absolute value - r

Let's say I have four samples: id=1, 2, 3, and 4, with one or more measurements on each of those samples:
> a <- data.frame(id=c(1,1,2,2,3,4), value=c(1,2,3,-4,-5,6))
> a
id value
1 1 1
2 1 2
3 2 3
4 2 -4
5 3 -5
6 4 6
I want to remove duplicates, keeping only one entry per ID - the one having the largest absolute value of the "value" column. I.e., this is what I want:
> a[c(2,4,5,6), ]
id value
2 1 2
4 2 -4
5 3 -5
6 4 6
How might I do this in R?

First. Sort in the order putting the less desired items last within id groups
aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value)
Then: Remove items after the first within id groups
aa[ !duplicated(aa$id), ] # take the first row within each id
id value
2 1 2
4 2 -4
5 3 -5
6 4 6

A data.table approach might be in order if your data set is very large:
aDT <-
aDT[J(unique(id)), list(value = value[which.max(abs(value))])]
Or a not as fast, but still fast, alternative :
library(data.table)[, .SD[which.max(abs(value))], by=id]
This version returns all the columns of a, in case there are more in the real dataset.

Here is a dplyr approach
a %>%
group_by(id) %>%
top_n(1, abs(value))
# A tibble: 4 x 2
# Groups: id [4]
# id value
# <dbl> <dbl>
#1 1 2
#2 2 -4
#3 3 -5
#4 4 6

Check out ?aggregate:
aggregate(value~id,a,function(x) x[which.max(abs(x))])
I like the answer by #DWin, but I would like show how this could also work with metadata:
aa<-merge(aggregate(value~id,a,function(x) x[which.max(abs(x))]),a)
# Fails if the max value is duplicated for a single id without next line.
I couldn't help myself and created one last answer:,lapply(split(a,a$id),function(x) x[which.max(abs(x$value)),]))

Another approach (though the code might look a little cumbersome) is to use ave():
a[which(abs(a$value) == ave(a$value, a$id,
FUN=function(x) max(abs(x)))), ]
# id value
# 2 1 2
# 4 2 -4
# 5 3 -5
# 6 4 6

ddply(a, .(id), function(x) return(x[which(abs(x$value)==max(abs(x$value))),]))

You can do this with dplyr as follows:
a %>%
group_by(name) %>%
filter(n == max(n)) %>%


Automate filtering to subset data based on multiple columns

Here is a data set I am trying to subset:
The data set has a variable x1 that is measured at different time points, denoted by ax1, bx1, cx1 and dx1. I am trying to subset these data by deleting the rows with -1 on any column (i.e ax1, bx1, cx1, dx1). I would like to know if there is a way to automate filtering (or filter function) to perform this task. I am familiar with situations where the focus is to filter rows based on a single column (or variable).
For the current case, I made an attempt by starting with
mutate_at( vars(ends_with("x1"))
to select the required columns, but I am not sure about how to combine this with the filter function to produce the desired results. The expect output would have the 3rd and 4th row being deleted. I appreciate any help on this. There is a similar case resolved here but this has not been done through the automation process. I want to adapt the automation to the case of large data with many columns.
You can use filter() with across().
df %>%
filter(across(ends_with("x1"), ~ .x != -1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8
It's equivalent to filter_at() with all_vars(), which has been superseded in dplyr 1.0.0.
df %>%
filter_at(vars(ends_with("x1")), all_vars(. != -1))
Using base R :
With rowSums
cols <- grep('x1$', names(df))
df[rowSums(df[cols] == -1) == 0, ]
# id ax1 bx1 cx1 dx1
#1 1 5 0 2 3
#2 2 3 1 1 7
#5 5 9 3 5 8
Or with apply :
df[!apply(df[cols] == -1, 1, any), ]
Using filter_at;
df %>%
filter_at(vars(ax1:dx1), ~. != as.numeric(-1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8

How do I use dplyr::arrange to sort NA's first?

I'd like to sort flights in ascending order of dep_time with NAs first using dplyr's arrange in dplyr_0.8.0. arrange's default is to list NAs last.
I had thought that
would work but NAs still come last. In fact, both
produce the same arrangement. Why is this and how do I get the desired sort?
Edit: here's a minimal, reproducible example.
df <- tibble(x = sample(c(NA,NA,1:4)))
Here's the output.
> arrange(df,desc(,x)
# A tibble: 6 x 1
1 1
2 2
3 3
4 4
5 NA
6 NA
> arrange(df,,x)
# A tibble: 6 x 1
1 1
2 2
3 3
4 4
5 NA
6 NA
It works as expected if I mutate(ind = and then sort on the variable ind rather than the expression
Here's my sessionInfo(). All hints toward solution gratefully received.
This was fixed by downloading the latest version of dplyr_0.8.0:

Remove duplicate rows based on conditions from multiple columns (decreasing order) in R

I have a 3-columns data.frame (variables: ID.A, ID.B, DISTANCE). I would like to remove the duplicates under a condition: keeping the row with the smallest value in column 3.
It is the same problem than here :
R, conditionally remove duplicate rows
(Similar one: Remove duplicates based on 2nd column condition)
But, in my situation, there is second problem : I have to remove rows when the couples (ID.A, ID.B, DISTANCE) are duplicated, and not only when ID.A is duplicated.
I tried several things, such as:
df <- ddply(df, 1:3, function(df) return(df[df$DISTANCE==min(df$DISTANCE),]))
but it didn't work
Example :
This dataset
id.a id.b dist
1 1 1 12
2 1 1 10
3 1 1 8
4 2 1 20
5 1 1 15
6 3 1 16
Should become:
id.a id.b dist
3 1 1 8
4 2 1 20
6 3 1 16
Using dplyr, and a suitable modification to Remove duplicated rows using dplyr
df %>%
group_by(id.a, id.b) %>%
arrange(dist) %>% # in each group, arrange in ascending order by distance
filter(row_number() == 1)
Another way of achieving the solution and retaining all the columns:
df %>% arrange(dist) %>%
distinct(id.a, id.b, .keep_all=TRUE)
# id.a id.b dist
# 1 1 1 8
# 2 3 1 16
# 3 2 1 20

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
tapply across the rownames and grab a sample of 1 in each ID group:
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

R: subset/group data frame with a max value?

Given a data frame like this:
gid set a b
1 1 1 1 9
2 1 2 -2 -3
3 1 3 5 6
4 2 2 -4 -7
5 2 6 5 10
6 2 9 2 0
How can I subset/group data frame of a unique gid with the max set value and 1/0 wether its a value is greater than its b value?
So here, it'd be, uh...
Kind of a stupid simple thing in SQL but I'd like to have a bit better control over my R, so...
Piece of cake with dplyr:
dat <- read.table(text="gid set a b
1 1 1 9
1 2 -2 -3
1 3 5 6
2 2 -4 -7
2 6 5 10
2 9 2 0", header=TRUE)
dat %>%
group_by(gid) %>%
filter(row_number() == which.max(set)) %>%
mutate(greater=a>b) %>%
select(gid, set, greater)
## Source: local data frame [2 x 3]
## Groups: gid
## gid set greater
## 1 1 3 FALSE
## 2 2 9 TRUE
If you really need 1's and 0's and the dplyr groups cause any angst:
dat %>%
group_by(gid) %>%
filter(row_number() == which.max(set)) %>%
mutate(greater=ifelse(a>b, 1, 0)) %>%
select(gid, set, greater) %>%
## Source: local data frame [2 x 3]
## gid set greater
## 1 1 3 0
## 2 2 9 1
You could do the same thing without pipes:
filter(row_number() == which.max(set)),
greater=ifelse(a>b, 1, 0)), gid, set, greater))
but…but… why?! :-)
Here's a data.table possibility, assuming your original data is called df.
setDT(df)[, .(set = max(set), b = as.integer(a > b)[set == max(set)]), gid]
# gid set b
# 1: 1 3 0
# 2: 2 9 1
Note that to account for multiple max(set) rows, I used set == max(set) as the subset so that this will return the same number of rows for which there are ties for the max (if that makes any sense at all).
And courtesy of #thelatemail, another data table option:
setDT(df)[, list(set = max(set), ab = (a > b)[which.max(set)] + 0), by = gid]
# gid set ab
# 1: 1 3 0
# 2: 2 9 1
In base R, you can use ave
indx <- with(df, ave(set, gid, FUN=max)==set)
#in cases of ties
#indx <- with(df, !!ave(set, gid, FUN=function(x)
# which.max(x) ==seq_along(x)))
transform(df[indx,], greater=(a>b)+0)[,c(1:2,5)]
# gid set greater
# 3 1 3 0
# 6 2 9 1
