Remove duplicates based on some conditions - r

I have two datasets: D1 and D2. D2 is a left join from D1 and a larger dataset which I will call D3. Although the key column of D2 has the same number of unique elements than D1, it has some duplicates that I want to get rid of based on certain conditions.
There are two problems:
1) There are some rows full of NA values, except for the key value, and these rows are very important to me.
2) There are some other rows which may or may not be duplicated but doesn't match with my standard condition.
How can I remove these duplicates conditionally based on a hierarchy?
Sample dataset:
ID Var
1 1
2 1
3 1
3 9
4 2
4 9
5 1
6 1
7 1
7 9
7 9
8 2
9
10 1
Expected dataset:
ID Var
1 1
2 1
3 1
4 2
5 1
6 1
7 1
8 2
9
10 1

duplicated does what you need.
dat[!duplicated(dat$ID),]
# ID Var
# 1 1 1
# 2 2 1
# 3 3 1
# 5 4 2
# 7 5 1
# 8 6 1
# 9 7 1
# 12 8 2
# 13 9 NA
# 14 10 1
As does something from the tidyverse:
library(dplyr)
dat %>%
group_by(ID) %>%
slice(1) %>%
ungroup()
And data.table ...
library(data.table)
as.data.table(dat)[ !duplicated(ID), ]
Data:
dat <- read.table(header = TRUE, text = "
ID Var
1 1
2 1
3 1
3 9
4 2
4 9
5 1
6 1
7 1
7 9
7 9
8 2
9 NA
10 1")

Let's say! We have a data.table below:
Library(data.table)
df <- data.table(Name = c("JACK", "JOHN", "JACK", "ANNIE", "JOHN", "JACK"),
Amount = c(30, 10, 20, 24, 5, 1))
In this case, I order by Name so it will be similar to your Id column. When I got the appropriate order, I will take only the first result
df[][order(Name, Amount)]
df[,.SD[1], by = Name]
Output:
Name Amount
1: JACK 30
2: JOHN 10
3: ANNIE 24
I hope this may help you.

Related

Subseting data frame based on multiple criteria for deletion of rows

Consider the following data frame consisting of column names "id" and "x", where each id is repeated four times. Data is as follows:
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
The question is about how to subset the data frame by the following criteria:
(1) keep all entries of each id, if its corresponding values in column x does not contain 3 or it has 3 as the last number.
(2) for a given id with multiple 3s in column x, keep all the numbers up to the first 3 and delete the remaining 3s. The expected output would look like:
id x
1 1 2
2 1 2
3 1 1
4 1 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 2
10 3 3
11 4 2
12 4 2
13 4 3
I am familiar with the use of the 'filter' function in dplyr package to subset data, but this particular situation confuses me because of the complexity of the above criteria. Any help on this would be greatly appraciated.
Here's one solution that uses / creates some new columns to help you filter on:
library(dplyr)
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
df %>%
group_by(id) %>% # for each id
mutate(num_threes = sum(x == 3), # count number of 3s
flag = ifelse(unique(num_threes) > 0, # if there is a 3
min(row_number()[x == 3]), # keep the row of the first 3
0)) %>% # otherwise put a 0
filter(num_threes == 0 | row_number() <= flag) %>% # keep ids with no 3s or up to first 3
ungroup() %>%
select(-num_threes, -flag) # remove helpful columns
# # A tibble: 13 x 2
# id x
# <dbl> <dbl>
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 1
# 5 2 2
# 6 2 3
# 7 3 1
# 8 3 2
# 9 3 2
# 10 3 3
# 11 4 2
# 12 4 2
# 13 4 3
this works for me:
data
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
commands
library(dplyr)
df <- mutate(df, before = lag(x))
df$condition1 <- 1
df$condition1[df$x == 3 & df$before == 3] <- 0
final_df <- df[df$condition1 == 1, 1:2]
result
x id
1 2
1 2
1 1
1 1
2 2
2 3
3 1
3 2
3 2
3 3
4 2
4 2
4 3`
One idea is to pick out the rows with x==3 and use unique() over them. Then append the unique rows with just single 3 to the rest part of the data frame, and finally order the rows.
Here is a solution with base R for the idea above:
res <- (r <- with(df,rbind(df[x!=3,],unique(df[x==3,]))))[order(as.numeric(rownames(r))),]
rownames(res) <- seq(nrow(res))
which give
> res
id x
1 1 2
2 1 2
3 1 1
4 1 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 2
10 3 3
11 4 2
12 4 2
13 4 3
DATA
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))

How to create a column/index based on either of two conditions being met (to enable clustering of matched pairs within same dataframe)?

I have a large dataset of matched pairs (id1 and id2) and would like to create an index variable to enable me to merge these pairs into rows.
As such, the first row would be index 1 and from then on the index will increase by 1, unless either id1 or id2 match any of the values in previous rows. Where this is the case, the previously attributed index should be applied.
I have looked for weeks and most solutions seem to fall short of what I need.
Here's some data to replicate what I have:
id1 <- c(1,2,2,4,6,7,9,11)
id2 <- c(2,3,4,5,7,8,10,2)
df <- cbind(id1,id2)
df <- as.data.frame(df)
df
id1 id2
1 1 2
2 2 3
3 2 4
4 4 5
5 6 7
6 7 8
7 9 10
8 11 2
And here's what hope to achieve:
#wanted result
index <- c(1,1,1,1,2,2,3,1)
df_indexed <- cbind(df,index)
df_indexed
id1 id2 index
1 1 2 1
2 2 3 1
3 2 4 1
4 4 5 1
5 6 7 2
6 7 8 2
7 9 10 3
8 11 2 1
It may be easier to do in igraph
library(igraph)
g <- graph.data.frame(df)
df$index <- clusters(g)$membership[as.character(df$id1)]
df$index
#[1] 1 1 1 1 2 2 3 1

r - remove first row of condition per subject in dataframe

I have a long format dataframe with multiple subjects and multiple conditions for each subject.
I want to remove the first row of each condition (except the first one) for all subjects.
My dataframe looks like this:
> df <- data.frame(subj = c(rep(1,4),rep(2,4), rep(3,4)), cond = (rep(c("A", "A", "B", "B"),times=3)), value = round(runif(12, min = 0, max = 10)))
> df
subj cond value
1 A 1
1 A 5
1 B 3
1 B 10
2 A 6
2 A 5
2 B 2
2 B 0
3 A 5
3 A 8
3 B 5
3 B 2
I have found the duplicated() function but it only removes the first row of each condition for the first subject:
df <- df[duplicated(df$cond),]
subj cond value
1 A 5
1 B 10
2 A 6
2 A 5
2 B 2
2 B 0
3 A 5
3 A 8
3 B 5
3 B 2
Is there a way to "reset" the finding of a duplicate whenever a new subject begins?
And how can I stop it from excluding the first row of the first condition?
Thank you all so much!
You could subset with the duplicated interaction of the two variables:
> df
subj cond value
1 1 A 5
2 1 A 7
3 1 B 4
4 1 B 8
5 2 A 5
6 2 A 2
7 2 B 8
8 2 B 5
9 3 A 8
10 3 A 1
11 3 B 1
12 3 B 5
df1 <- df[!duplicated(interaction(df$subj, df$cond)),]
> df1
subj cond value
1 1 A 5
3 1 B 4
5 2 A 5
7 2 B 8
9 3 A 8
11 3 B 1
Edit:
I've read your question again and it seems you want to remove the first row, not the last. In this case, use
df1 <- df[!duplicated(interaction(df$subj, df$cond), fromLast = TRUE),]
> df1
subj cond value
2 1 A 4
4 1 B 9
6 2 A 9
8 2 B 7
10 3 A 1
12 3 B 2
Alternative (but does depend on actual df):
df <- data.frame(subj = c(rep(1,4),rep(2,4), rep(3,4)),
cond = (rep(c("A", "A", "B", "B"),times=3)),
value = round(runif(12, min = 0, max = 10)))
df
dummy <- as.character(df$cond) # factor to character
mask <- c(FALSE, dummy[-1] == dummy[-length(dummy)])
df[mask,]

Extract Index of repeat value

how do I extract specific row of data when the column has repetitive value? my data looks like this: I want to extract the row of the end of each repeat of x (A 3 10, A 2 3 etc) or the index of the last value
Name X M
A 1 1
A 2 9
A 3 10
A 1 1
A 2 3
A 1 5
A 2 6
A 3 4
A 4 5
A 5 3
B 1 1
B 2 9
B 3 10
B 1 1
B 2 3
Expected output
Index Name X M
3 A 3 10
5 A 2 3
10 A 5 3
13 B 3 10
15 B 2 3
Using base R duplicated and cumsum:
dups <- !duplicated(cumsum(dat$X == 1), fromLast=TRUE)
cbind(dat[dups,], Index=which(dups))
# Name X M Index
#3 A 3 10 3
#5 A 2 3 5
#10 A 5 3 10
#13 B 3 10 13
#15 B 2 3 15
A solution using dplyr.
library(dplyr)
df2 <- df %>%
mutate(Flag = ifelse(lead(X) < X, 1, 0)) %>%
mutate(Index = 1:n()) %>%
filter(Flag == 1 | is.na(Flag)) %>%
select(Index, X, M)
df2
# Index X M
# 1 3 3 10
# 2 5 2 3
# 3 10 5 3
# 4 13 3 10
# 5 15 2 3
Flag is a column showing if the next number in A is smaller than the previous number. If TRUE, Flag is 1, otherwise is 0. We can then filter for Flag == 1 or where Flag is NA, which is the last row. df2 is the final filtered data frame.
DATA
df <- read.table(text = "Name X M
A 1 1
A 2 9
A 3 10
A 1 1
A 2 3
A 1 5
A 2 6
A 3 4
A 4 5
A 5 3
B 1 1
B 2 9
B 3 10
B 1 1
B 2 3",
header = TRUE, stringsAsFactors = FALSE)

calculate each chunk by group using dplyr?

How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3

Resources