Remove duplicate rows with certain value in specific column - r

I have a data frame and I want to remove rows that are duplicated in all columns except one column and choose to keep the ones that are not certain values.
In above example, 3rd row and 4th row are duplicated for all columns except for col3, so I want to keep one row only. The complicated step is I want to keep 4th row instead of 3rd because 3rd row in col3 is "excluded". In general, I want to only keep the rows(that were duplicated) that do not have "excluded".
My real data frame have lots of duplicated rows and among those 2 rows that are duplicated, one of them is "excluded" for sure.
Below is re-producible ex:
a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d <- data.frame(a,b,c)
Thank you so much!
Update: Or, when removing duplicate, only keep the second observation (4th row).

dplyr with some base R should work for this:
library(dplyr)
a <- c(1,2,3,3,3,7)
b <- c(4,5,6,6,6,8)
c <- c("red","green","brown","excluded","orange","excluded")
d <- data.frame(a,b,c)
d <- filter(d, !duplicated(d[,1:2]) | c!="excluded")
Result:
a b c
1 1 4 red
2 2 5 green
3 3 6 brown
4 3 6 orange
5 7 8 excluded
The filter will get rid of anything that should be excluded and not duplicated. I added an example of a none unique exclude to your example('brown') to test as well.

Here is an example with a loop:
a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d<- data.frame(a,b,c)
# Give row indices of duplicated rows (only the second and more occurence are given)
duplicated_rows=which(duplicated(d[c("a","b")]))
to_remove=c()
# Loop over different duplicated rows
for(i in duplicated_rows){
# Find simmilar rows
selection=which(d$a==d$a[i] & d$b==d$b[i])
# Sotre indices of raw in the set of duplicated row whihc are "excluded"
to_remove=c(to_remove,selection[which(d$c[selection]=="excluded")])
}
# Remove rows
d=d[-to_remove,]
print(d)
> a b c
> 1 4 red
> 2 2 5 green
> 4 3 6 orange
> 5 7 8 excluded

Here is a possibility ... I hope it can help :)
nquit <- (d %>%
mutate(code= 1:nrow(d)) %>%
group_by(a, b) %>%
mutate(nDuplicate= n()) %>%
filter(nDuplicate > 1) %>%
filter(c == "excluded"))$code
e <- d[-nquit]

Shortening the approach by #Klone a bit, another dplyr solution:
d %>% mutate(c = factor(c, ordered = TRUE,
levels = c("red", "green", "orange", "excluded"))) %>% # Order the factor variable
arrange(c) %>% # Sort the data frame so that excluded comes first
group_by(a, b) %>% # Group by the two columns that determine duplicates
mutate(id = 1:n()) %>% # Assign IDs in each group
filter(id == 1) # Only keep one row in each group
Result:
# A tibble: 4 x 4
# Groups: a, b [4]
a b c id
<dbl> <dbl> <ord> <int>
1 1 4 red 1
2 2 5 green 1
3 3 6 orange 1
4 7 8 excluded 1

Regarding your edit at the end of the question:
Update: Or, when removing duplicate, only keep the second observation (4th row).
note that, in case the ordering of the rows by col3 determines that the row to keep is always the last one among the duplicate records, you can simply set fromLast=TRUE in the duplicated() function to request that rows should be flagged as duplicates starting the duplicate count from the last one found for each duplicate group.
Using a slightly modified version of your data (where I added more duplicate groups to better show that the process works in a more general case):
a <- c(1,1,2,3,3,3,7)
b <- c(4,4,5,6,6,6,8)
c <- c("excluded", "red","green","excluded", "excluded","orange","excluded")
d <- data.frame(a,b,c)
a b c
1 1 4 excluded
2 1 4 red
3 2 5 green
4 3 6 excluded
5 3 6 excluded
6 3 6 orange
7 7 8 excluded
using:
ind2remove = duplicated(d[,c("a", "b")], fromLast=TRUE)
(d_noduplicates = d[!ind2remove,])
we get:
a b c
2 1 4 red
3 2 5 green
6 3 6 orange
7 7 8 excluded
Note that this doesn't require the rows in each duplicate group to be all together in the original data. The only important thing is that you want to keep the record showing up last in the data from each duplicate group.

Related

Repeat (duplicate) just one row twice in R

I'm trying to duplicate just the second row in a dataframe, so that row will appear twice. A dplyr or tidyverse aproach would be great. I've tried using slice() but I can only get it to either duplicate the row I want and remove all the other data, or duplicate all the data, not just the second row.
So I want something like df2:
df <- data.frame(t = c(1,2,3,4,5),
r = c(2,3,4,5,6))
df1 <- data.frame(t = c(1,2,2,3,4,5),
r = c(2,3,3,4,5,6))
Thanks!
Here's also a tidyverse approach with uncount:
library(tidyverse)
df %>%
mutate(nreps = if_else(row_number() == 2, 2, 1)) %>%
uncount(nreps)
Basically the idea is to set the number of times you want the row to occur (in this case row number 2 - hence row_number() == 2 - will occur twice and all others occur only once but you could potentially construct a more complex feature where each row has a different number of repetitions), and then uncount this variable (called nreps in the code).
Output:
t r
1 1 2
2 2 3
2.1 2 3
3 3 4
4 4 5
5 5 6
One way with slice would be :
library(dplyr)
df %>% slice(sort(c(row_number(), 2)))
# t r
#1 1 2
#2 2 3
#3 2 3
#4 3 4
#5 4 5
#6 5 6
Also :
df %>% slice(sort(c(seq_len(n()), 2)))
In base R, this can be written as :
df[sort(c(seq(nrow(df)), 2)), ]

How to relabel column values

This seems like a simple task but for the life of me I can't figure it out. I have a dataframe column with the following structure:
df = as.data.frame(c(1,1,2,2,3,3,4,4))
I also have the following vectors:
index = seq(1,2)
labels = c('Control','Treatment')
The index is updated by a for loop and all I want to do is replace all of the values in the df column that match the index with the appropriate label (eg all values of 1 and 2 in the df will be replaced with 'Control'). So far the closest I've gotten is:
df$col[df$col == index[1]] = labels[1]
If index[1] is replaced with index, only the first value of the vector is matched. How can I do this such that all values are matched and replaced?
Thank you!
It is not completely clear what you want to do. But from your description, I suppose that what you might need is a factor:
df <- data.frame(col=c(1,1,2,2,3,3,4,4))
labs <- c("first", "second", "third", "fourth")
df$col2 <- factor(df$col, labels=labs)
df
# col col2
# 1 1 first
# 2 1 first
# 3 2 second
# 4 2 second
# 5 3 third
# 6 3 third
# 7 4 fourth
# 8 4 fourth
You can change these labels by levels, for example:
> levels(df$col2)[3:4] <- c("tre", "fyra")
> df
col col2
1 1 first
2 1 first
3 2 second
4 2 second
5 3 tre
6 3 tre
7 4 fyra
8 4 fyra

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

keep values of a data frame column R

In my data frame df I want to get the id number satisfying the condition that the value of A is greater than the value of B. In the example I only would want Id=2.
Id Name Value
1 A 3
1 B 5
1 C 4
2 A 7
2 B 6
2 C 8
vecA<-vector();
vecB<-vector();
vecId<-vector();
i<-1
while(i<=length(dim(df)[1]){
if(df$Name[[i]]=="A"){vecA<-c(vecA,df$Value)}
if(df$Name[[i]]=="B"){vecB<-c(vecB,df$Value)}
if(vecA[i]>vecB[i]){vecId<-c(vecId,)}
i<-i+1
}
First, you could convert your data from long to wide so you have one row for each ID:
library(reshape2)
(wide <- dcast(df, Id~Name, value.var="Value"))
# Id A B C
# 1 1 3 5 4
# 2 2 7 6 8
Now you can use normal indexing to get the ids with larger A than B:
wide$Id[wide$A > wide$B]
# [1] 2
The first answer works out well for sure. I wanted to get to regular subset operations as well. I came up with this since you might want to check out some of the more recent R packages. If you had 3 groups to compare that would be interesting. Oh in the code below exp is the exact data.frame you started with.
library(plyr)
library(dplyr)
comp <- exp %>% filter(Name %in% c("A","B")) %>% group_by(Id) %>% filter(min_rank(Value)>1)
# If the whole row is needed
comp[which.max(comp$Value),]
# If not
comp[which.max(comp$Value),"Id"]

Multirow deletion: delete row depending on other row

I'm stuck with a quite complex problem. I have a data frame with three rows: id, info and rownum. The data looks like this:
id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8
What I want to do now is to delete all other rows of one id if one of the rows contains the info a. This would mean for example that row 2 and 3 should be removed as row 1's coloumn info contains the value a. Please note that the info values are not ordered (id 3/row 5 & 6) and cannot be ordered due to other data limitations.
I solved the case using a for loop:
# select all id containing an "a"-value
a_val <- data$id[grep("a", data$info)]
# check for every id containing an "a"-value
for(i in a_val) {
temp_data <- data[which(data$id == i),]
# only go on if the given id contains more than one row
if (nrow(temp_data) > 1) {
for (ii in nrow(temp_data)) {
if (temp_data$info[ii] != "a") {
temp <- temp_data$row[ii]
if (!exists("delete_rows")) {
delete_rows <- temp
} else {
delete_rows <- c(delete_rows, temp)
}
}
}
}
}
My solution works quite well. Nevertheless, it is very, very, very slow as the original data contains more than 700k rows and more that 150k rows with an "a"-value.
I could use a foreach loop with 4 cores to speed it up, but maybe someone could give me a hint for a better solution.
Best regards,
Arne
[UPDATE]
The outcome should be:
id info row
1 a 1
2 a 4
3 a 6
4 b 7
4 c 8
Here is one possible solution.
First find ids where info contains "a":
ids <- with(data, unique(id[info == "a"]))
Subset the data:
subset(data, (id %in% ids & info == "a") | !id %in% ids)
Output:
id info row
1 1 a 1
4 2 a 4
6 3 a 6
7 4 b 7
8 4 c 8
An alternative solution (maybe harder to decipher):
subset(data, info == "a" | !rep.int(tapply(info, id, function(x) any(x == "a")),
table(id)))
Note. #BenBarnes found out that this solution only works if the data frame is ordered according to id.
You might want to investigate the data.table package:
EDIT: If the row variable is not a sequential numbering of each row in your data (as I assumed it was), you could create such a variable to obtain the original row order:
library(data.table)
# Create data.table of your data
dt <- as.data.table(data)
# Create index to maintain row order
dt[, idx := seq_len(nrow(dt))]
# Set a key on id and info
setkeyv(dt, c("id", "info"))
# Determine unique ids
uid <- dt[, unique(id)]
# subset your data to select rows with "a"
dt2 <- dt[J(uid, "a"), nomatch = 0]
# identify rows of dataset where the id doesn't have an "a"
dt3 <- dt[J(dt2[, setdiff(uid, id)])]
# rbind those two data.tables together
(dt4 <- rbind(dt2, dt3))
# id info row idx
# 1: 1 a 1 1
# 2: 2 a 4 4
# 3: 3 a 6 6
# 4: 4 b 7 7
# 5: 4 c 8 8
# And if you need the original ordering of rows,
dt5 <- dt4[order(idx)]
Note that setting a key for the data.table will order the rows according to the key columns. The last step (creating dt5) sets the row order back to the original.
Here is a way using ddply:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
library("plyr")
ddply(df,.(id),subset,rep(!'a'%in%info,length(info))|info=='a')
Returns:
id info row
1 1 a 1
2 2 a 4
3 3 a 6
4 4 b 7
5 4 c 8
if df is this (RE Sacha above) use match which just finds the index of the first occurrence:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
# the first info row matching 'a' and all other rows that are not 'a'
with(df, df[c(match('a',info), which(info != 'a')),])
id info row
1 1 a 1
2 1 b 2
3 1 c 3
5 3 b 5
7 4 b 7
8 4 c 8
try to take a look at subset, it's quite easy to use and it will solve your problem.
you just need to specify the value of the column that you want to subset based on, alternatively you can choose more columns.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html
http://www.statmethods.net/management/subset.html

Resources