Remove duplicates keeping entry with largest absolute value - r

Let's say I have four samples: id=1, 2, 3, and 4, with one or more measurements on each of those samples:
> a <- data.frame(id=c(1,1,2,2,3,4), value=c(1,2,3,-4,-5,6))
> a
id value
1 1 1
2 1 2
3 2 3
4 2 -4
5 3 -5
6 4 6
I want to remove duplicates, keeping only one entry per ID - the one having the largest absolute value of the "value" column. I.e., this is what I want:
> a[c(2,4,5,6), ]
id value
2 1 2
4 2 -4
5 3 -5
6 4 6
How might I do this in R?

First. Sort in the order putting the less desired items last within id groups
aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value)
Then: Remove items after the first within id groups
aa[ !duplicated(aa$id), ] # take the first row within each id
id value
2 1 2
4 2 -4
5 3 -5
6 4 6

A data.table approach might be in order if your data set is very large:
library(data.table)
aDT <- as.data.table(a)
setkey(aDT,"id")
aDT[J(unique(id)), list(value = value[which.max(abs(value))])]
Or a not as fast, but still fast, alternative :
library(data.table)
as.data.table(a)[, .SD[which.max(abs(value))], by=id]
This version returns all the columns of a, in case there are more in the real dataset.

Here is a dplyr approach
library(dplyr)
a %>%
group_by(id) %>%
top_n(1, abs(value))
# A tibble: 4 x 2
# Groups: id [4]
# id value
# <dbl> <dbl>
#1 1 2
#2 2 -4
#3 3 -5
#4 4 6

Check out ?aggregate:
aggregate(value~id,a,function(x) x[which.max(abs(x))])
I like the answer by #DWin, but I would like show how this could also work with metadata:
aa<-merge(aggregate(value~id,a,function(x) x[which.max(abs(x))]),a)
# Fails if the max value is duplicated for a single id without next line.
aa[!duplicated(aa),]
I couldn't help myself and created one last answer:
do.call(rbind,lapply(split(a,a$id),function(x) x[which.max(abs(x$value)),]))

Another approach (though the code might look a little cumbersome) is to use ave():
a[which(abs(a$value) == ave(a$value, a$id,
FUN=function(x) max(abs(x)))), ]
# id value
# 2 1 2
# 4 2 -4
# 5 3 -5
# 6 4 6

library(plyr)
ddply(a, .(id), function(x) return(x[which(abs(x$value)==max(abs(x$value))),]))

You can do this with dplyr as follows:
library(dplyr)
a %>%
group_by(name) %>%
filter(n == max(n)) %>%
ungroup()

Related

Automate filtering to subset data based on multiple columns

Here is a data set I am trying to subset:
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
The data set has a variable x1 that is measured at different time points, denoted by ax1, bx1, cx1 and dx1. I am trying to subset these data by deleting the rows with -1 on any column (i.e ax1, bx1, cx1, dx1). I would like to know if there is a way to automate filtering (or filter function) to perform this task. I am familiar with situations where the focus is to filter rows based on a single column (or variable).
For the current case, I made an attempt by starting with
mutate_at( vars(ends_with("x1"))
to select the required columns, but I am not sure about how to combine this with the filter function to produce the desired results. The expect output would have the 3rd and 4th row being deleted. I appreciate any help on this. There is a similar case resolved here but this has not been done through the automation process. I want to adapt the automation to the case of large data with many columns.
You can use filter() with across().
library(dplyr)
df %>%
filter(across(ends_with("x1"), ~ .x != -1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8
It's equivalent to filter_at() with all_vars(), which has been superseded in dplyr 1.0.0.
df %>%
filter_at(vars(ends_with("x1")), all_vars(. != -1))
Using base R :
With rowSums
cols <- grep('x1$', names(df))
df[rowSums(df[cols] == -1) == 0, ]
# id ax1 bx1 cx1 dx1
#1 1 5 0 2 3
#2 2 3 1 1 7
#5 5 9 3 5 8
Or with apply :
df[!apply(df[cols] == -1, 1, any), ]
Using filter_at;
library(tidyverse)
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
df
df %>%
filter_at(vars(ax1:dx1), ~. != as.numeric(-1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8

How do I use dplyr::arrange to sort NA's first?

I'd like to sort flights in ascending order of dep_time with NAs first using dplyr's arrange in dplyr_0.8.0. arrange's default is to list NAs last.
I had thought that
arrange(flights,desc(is.na(dep_time)),dep_time)
would work but NAs still come last. In fact, both
desc(is.na(dep_time))
and
is.na(dep_time)
produce the same arrangement. Why is this and how do I get the desired sort?
Edit: here's a minimal, reproducible example.
library(tidyverse)
set.seed(1)
df <- tibble(x = sample(c(NA,NA,1:4)))
arrange(df,desc(is.na(x)),x)
arrange(df,is.na(x),x)
Here's the output.
...
> arrange(df,desc(is.na(x)),x)
# A tibble: 6 x 1
x
<int>
1 1
2 2
3 3
4 4
5 NA
6 NA
> arrange(df,is.na(x),x)
# A tibble: 6 x 1
x
<int>
1 1
2 2
3 3
4 4
5 NA
6 NA
It works as expected if I mutate(ind = is.na(x)) and then sort on the variable ind rather than the expression is.na(x).
Here's my sessionInfo(). All hints toward solution gratefully received.
This was fixed by downloading the latest version of dplyr_0.8.0:
devtools::install_github("tidyverse/dplyr")

Remove duplicate rows based on conditions from multiple columns (decreasing order) in R

I have a 3-columns data.frame (variables: ID.A, ID.B, DISTANCE). I would like to remove the duplicates under a condition: keeping the row with the smallest value in column 3.
It is the same problem than here :
R, conditionally remove duplicate rows
(Similar one: Remove duplicates based on 2nd column condition)
But, in my situation, there is second problem : I have to remove rows when the couples (ID.A, ID.B, DISTANCE) are duplicated, and not only when ID.A is duplicated.
I tried several things, such as:
df <- ddply(df, 1:3, function(df) return(df[df$DISTANCE==min(df$DISTANCE),]))
but it didn't work
Example :
This dataset
id.a id.b dist
1 1 1 12
2 1 1 10
3 1 1 8
4 2 1 20
5 1 1 15
6 3 1 16
Should become:
id.a id.b dist
3 1 1 8
4 2 1 20
6 3 1 16
Using dplyr, and a suitable modification to Remove duplicated rows using dplyr
library(dplyr)
df %>%
group_by(id.a, id.b) %>%
arrange(dist) %>% # in each group, arrange in ascending order by distance
filter(row_number() == 1)
Another way of achieving the solution and retaining all the columns:
df %>% arrange(dist) %>%
distinct(id.a, id.b, .keep_all=TRUE)
# id.a id.b dist
# 1 1 1 8
# 2 3 1 16
# 3 2 1 20

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

R: subset/group data frame with a max value?

Given a data frame like this:
gid set a b
1 1 1 1 9
2 1 2 -2 -3
3 1 3 5 6
4 2 2 -4 -7
5 2 6 5 10
6 2 9 2 0
How can I subset/group data frame of a unique gid with the max set value and 1/0 wether its a value is greater than its b value?
So here, it'd be, uh...
1,3,0
2,9,1
Kind of a stupid simple thing in SQL but I'd like to have a bit better control over my R, so...
Piece of cake with dplyr:
dat <- read.table(text="gid set a b
1 1 1 9
1 2 -2 -3
1 3 5 6
2 2 -4 -7
2 6 5 10
2 9 2 0", header=TRUE)
library(dplyr)
dat %>%
group_by(gid) %>%
filter(row_number() == which.max(set)) %>%
mutate(greater=a>b) %>%
select(gid, set, greater)
## Source: local data frame [2 x 3]
## Groups: gid
##
## gid set greater
## 1 1 3 FALSE
## 2 2 9 TRUE
If you really need 1's and 0's and the dplyr groups cause any angst:
dat %>%
group_by(gid) %>%
filter(row_number() == which.max(set)) %>%
mutate(greater=ifelse(a>b, 1, 0)) %>%
select(gid, set, greater) %>%
ungroup
## Source: local data frame [2 x 3]
##
## gid set greater
## 1 1 3 0
## 2 2 9 1
You could do the same thing without pipes:
ungroup(
select(
mutate(
filter(row_number() == which.max(set)),
greater=ifelse(a>b, 1, 0)), gid, set, greater))
but…but… why?! :-)
Here's a data.table possibility, assuming your original data is called df.
library(data.table)
setDT(df)[, .(set = max(set), b = as.integer(a > b)[set == max(set)]), gid]
# gid set b
# 1: 1 3 0
# 2: 2 9 1
Note that to account for multiple max(set) rows, I used set == max(set) as the subset so that this will return the same number of rows for which there are ties for the max (if that makes any sense at all).
And courtesy of #thelatemail, another data table option:
setDT(df)[, list(set = max(set), ab = (a > b)[which.max(set)] + 0), by = gid]
# gid set ab
# 1: 1 3 0
# 2: 2 9 1
In base R, you can use ave
indx <- with(df, ave(set, gid, FUN=max)==set)
#in cases of ties
#indx <- with(df, !!ave(set, gid, FUN=function(x)
# which.max(x) ==seq_along(x)))
transform(df[indx,], greater=(a>b)+0)[,c(1:2,5)]
# gid set greater
# 3 1 3 0
# 6 2 9 1

Resources