Remove duplicate rows based on conditions from multiple columns (decreasing order) in R - r

I have a 3-columns data.frame (variables: ID.A, ID.B, DISTANCE). I would like to remove the duplicates under a condition: keeping the row with the smallest value in column 3.
It is the same problem than here :
R, conditionally remove duplicate rows
(Similar one: Remove duplicates based on 2nd column condition)
But, in my situation, there is second problem : I have to remove rows when the couples (ID.A, ID.B, DISTANCE) are duplicated, and not only when ID.A is duplicated.
I tried several things, such as:
df <- ddply(df, 1:3, function(df) return(df[df$DISTANCE==min(df$DISTANCE),]))
but it didn't work
Example :
This dataset
id.a id.b dist
1 1 1 12
2 1 1 10
3 1 1 8
4 2 1 20
5 1 1 15
6 3 1 16
Should become:
id.a id.b dist
3 1 1 8
4 2 1 20
6 3 1 16

Using dplyr, and a suitable modification to Remove duplicated rows using dplyr
library(dplyr)
df %>%
group_by(id.a, id.b) %>%
arrange(dist) %>% # in each group, arrange in ascending order by distance
filter(row_number() == 1)

Another way of achieving the solution and retaining all the columns:
df %>% arrange(dist) %>%
distinct(id.a, id.b, .keep_all=TRUE)
# id.a id.b dist
# 1 1 1 8
# 2 3 1 16
# 3 2 1 20

Related

How to rearrange columns of a data frame based on values in a row

This is an R programming question. I would like to rearrange the order of columns in a data frame based on the values in one of the rows. Here is an example data frame:
df <- data.frame(A=c(1,2,3,4),B=c(3,2,4,1),C=c(2,1,4,3),
D=c(4,2,3,1),E=c(4,3,2,1))
Suppose I want to rearrange the columns in df based on the values in row 4, ascending from 1 to 4, with ties having the same rank. So the desired data frame could be:
df <- data.frame(B=c(3,2,4,1),D=c(4,2,3,1),E=c(4,3,2,1),
C=c(2,1,4,3),A=c(1,2,3,4))
although I am indifferent about the order of first three columns, all of which have the value 1 in column 4.
I could do this with a for loop, but I am looking for a simpler approach. Thank you.
We can use select - subset the row (4), unlist, order the values and pass it on select
library(dplyr)
df %>%
select(order(unlist(.[4, ])))
-output
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
Or may use
df %>%
select({.} %>%
slice_tail(n = 1) %>%
flatten_dbl %>%
order)
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
or in base R
df[order(unlist(tail(df, 1))),]

Add non occurent factors to data frame in R

I have a dataframe of factors and corresponding values like this:
df <- data.frame(week = factor(c(1,2,49,50)), occurrences = c(1,4,2,3))
week occurrences
1 1 1
2 2 4
3 49 2
4 50 3
I want to add factors for all the "missing" weeks in (1-53) with the corresponding occurrences value of 0. What is the best way to do this? I have to do this to several data frames that may not be "missing" the same factors so I would like to generalize it in a function.
You can use rbind() to append the necessary lines to your df, in this example, I first create the df to be added before appending it for clarity. setdiff() will return the numbers currently not present in your week column:
df_to_app = data.frame(week = factor(setdiff(1:52, df$week)), occurrences = 0)
df = rbind(df, df_to_app)
I hope that helps!
Here's an approach with tidyr::complete. First, we need to add the additional levels to our week column. We can use forcats::fct_expand. Then tidyr::complete will fill the data.frame with those levels and we can use the fill = argument to indicate that we want 0.
library(tidyverse)
df %>%
mutate(week = fct_expand(week,paste0(1:52))) %>%
complete(week, fill = list(occurrences = 0))
# A tibble: 52 x 2
week occurrences
<fct> <dbl>
1 1 1
2 2 4
3 49 2
4 50 3
5 3 0
6 4 0
7 5 0
8 6 0
9 7 0
10 8 0
# … with 42 more rows
Or with a right join to a data.frame containing all weeks:
library(dplyr)
df %>%
right_join(data.frame(week = as.factor(1:52))) %>%
mutate(occurrences = replace_na(occurrences,0))

Automate filtering to subset data based on multiple columns

Here is a data set I am trying to subset:
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
The data set has a variable x1 that is measured at different time points, denoted by ax1, bx1, cx1 and dx1. I am trying to subset these data by deleting the rows with -1 on any column (i.e ax1, bx1, cx1, dx1). I would like to know if there is a way to automate filtering (or filter function) to perform this task. I am familiar with situations where the focus is to filter rows based on a single column (or variable).
For the current case, I made an attempt by starting with
mutate_at( vars(ends_with("x1"))
to select the required columns, but I am not sure about how to combine this with the filter function to produce the desired results. The expect output would have the 3rd and 4th row being deleted. I appreciate any help on this. There is a similar case resolved here but this has not been done through the automation process. I want to adapt the automation to the case of large data with many columns.
You can use filter() with across().
library(dplyr)
df %>%
filter(across(ends_with("x1"), ~ .x != -1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8
It's equivalent to filter_at() with all_vars(), which has been superseded in dplyr 1.0.0.
df %>%
filter_at(vars(ends_with("x1")), all_vars(. != -1))
Using base R :
With rowSums
cols <- grep('x1$', names(df))
df[rowSums(df[cols] == -1) == 0, ]
# id ax1 bx1 cx1 dx1
#1 1 5 0 2 3
#2 2 3 1 1 7
#5 5 9 3 5 8
Or with apply :
df[!apply(df[cols] == -1, 1, any), ]
Using filter_at;
library(tidyverse)
df<-data.frame(
id=c(1:5),
ax1=c(5,3,7,-1,9),
bx1=c(0,1,-1,0,3),
cx1=c(2,1,5,-1,5),
dx1=c(3,7,2,1,8))
df
df %>%
filter_at(vars(ax1:dx1), ~. != as.numeric(-1))
# id ax1 bx1 cx1 dx1
# 1 1 5 0 2 3
# 2 2 3 1 1 7
# 3 5 9 3 5 8

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

Remove duplicates keeping entry with largest absolute value

Let's say I have four samples: id=1, 2, 3, and 4, with one or more measurements on each of those samples:
> a <- data.frame(id=c(1,1,2,2,3,4), value=c(1,2,3,-4,-5,6))
> a
id value
1 1 1
2 1 2
3 2 3
4 2 -4
5 3 -5
6 4 6
I want to remove duplicates, keeping only one entry per ID - the one having the largest absolute value of the "value" column. I.e., this is what I want:
> a[c(2,4,5,6), ]
id value
2 1 2
4 2 -4
5 3 -5
6 4 6
How might I do this in R?
First. Sort in the order putting the less desired items last within id groups
aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value)
Then: Remove items after the first within id groups
aa[ !duplicated(aa$id), ] # take the first row within each id
id value
2 1 2
4 2 -4
5 3 -5
6 4 6
A data.table approach might be in order if your data set is very large:
library(data.table)
aDT <- as.data.table(a)
setkey(aDT,"id")
aDT[J(unique(id)), list(value = value[which.max(abs(value))])]
Or a not as fast, but still fast, alternative :
library(data.table)
as.data.table(a)[, .SD[which.max(abs(value))], by=id]
This version returns all the columns of a, in case there are more in the real dataset.
Here is a dplyr approach
library(dplyr)
a %>%
group_by(id) %>%
top_n(1, abs(value))
# A tibble: 4 x 2
# Groups: id [4]
# id value
# <dbl> <dbl>
#1 1 2
#2 2 -4
#3 3 -5
#4 4 6
Check out ?aggregate:
aggregate(value~id,a,function(x) x[which.max(abs(x))])
I like the answer by #DWin, but I would like show how this could also work with metadata:
aa<-merge(aggregate(value~id,a,function(x) x[which.max(abs(x))]),a)
# Fails if the max value is duplicated for a single id without next line.
aa[!duplicated(aa),]
I couldn't help myself and created one last answer:
do.call(rbind,lapply(split(a,a$id),function(x) x[which.max(abs(x$value)),]))
Another approach (though the code might look a little cumbersome) is to use ave():
a[which(abs(a$value) == ave(a$value, a$id,
FUN=function(x) max(abs(x)))), ]
# id value
# 2 1 2
# 4 2 -4
# 5 3 -5
# 6 4 6
library(plyr)
ddply(a, .(id), function(x) return(x[which(abs(x$value)==max(abs(x$value))),]))
You can do this with dplyr as follows:
library(dplyr)
a %>%
group_by(name) %>%
filter(n == max(n)) %>%
ungroup()

Resources