remove duplicate rows in R based on values in all columns - r

I have the following dataset
col1 col2 col3
a b 1
a b 2
a b 3
unique(dataset) returns
col1 col2 col3
a b 1
dataset[!duplicated(1:3),] returns
col1 col2 col3
a b 1
a b 2
a b 3
But the same thing fails to work in following
dataset2
col1 col2 col3
a b 1
a b 1
unique(dataset2) returns
col1 col2 col3
a b 1
dataset2[!duplicated(1:3),] returns
col1 col2 col3
a b 1
a b 1
NA NA NA

Use !duplicated:
dataset[!duplicated(dataset[c("col1", "col2", "col3")]),]

Related

Finding duplicates in a dataframe and returning count of each duplicate record

I have a dataframe like
col1 col2 col3
A B C
A B C
A B B
A B B
A B C
B C A
I want to get an output in the below format:
col1 col2 col3 Count
A B C 3 Duplicates
A B B 2 Duplicates
I dont want to use any specific column in the function to find the duplicates.
That is the reason of not using add_count from dplyr.
Using duplicate will have
col1 col2 col3 count
2 A B C 3
3 A B B 2
5 A B C 3
So not the desired output.
We can use group_by_all to group by all columns and then remove the ones which are not duplicates by selecting rows which have count > 1.
library(dplyr)
df %>%
group_by_all() %>%
count() %>%
filter(n > 1)
# col1 col2 col3 n
# <fct> <fct> <fct> <int>
#1 A B B 2
#2 A B C 3
We can use data.table
library(data.table)
setDT(df1)[, .(n =.N), names(df1)][n > 1]
# col1 col2 col3 n
#1: A B C 3
#2: A B B 2
Or with base R
subset(aggregate(n ~ ., transform(df1, n = 1), FUN = sum), n > 1)
# col1 col2 col3 n
#2 A B B 2
#3 A B C 3

R find unduplicated rows based on other data's columns [duplicate]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a data table in R, called A, which has three columns Col1, Col2, and Col3. Another table, called B, also has the same three columns. I want to remove all the rows in table A, for which the pairs (Col1, Col2) are present in table B. I tried, but I am not sure how to do this. I am stuck on this for last few days.
Thanks,
library(data.table)
A = data.table(Col1 = 1:4, Col2 = 4:1, Col3 = letters[1:4])
# Col1 Col2 Col3
#1: 1 4 a
#2: 2 3 b
#3: 3 2 c
#4: 4 1 d
B = data.table(Col1 = c(1,3,5), Col2 = c(4,2,1))
# Col1 Col2
#1: 1 4
#2: 3 2
#3: 5 1
A[!B, on = c("Col1", "Col2")]
# Col1 Col2 Col3
#1: 2 3 b
#2: 4 1 d
We can use anti_join
library(dplyr)
anti_join(A, B, by = c('Col1', 'Col2'))
Here's a go, using interaction:
A <- data.frame(Col1=1:3, Col2=2:4, Col3=10:12)
B <- data.frame(Col1=1:2, Col2=2:3, Col3=10:11)
A
# Col1 Col2 Col3
#1 1 2 10
#2 2 3 11
#3 3 4 12
B
# Col1 Col2 Col3
#1 1 2 10
#2 2 3 11
byv <- c("Col1","Col2")
A[!(interaction(A[byv]) %in% interaction(B[byv])),]
# Col1 Col2 Col3
#3 3 4 12
Or create a unique id for each row, and then exclude those that merged:
A[-merge(cbind(A[byv],id=seq_len(nrow(A))), B[byv], by=byv)$id,]

data.table in R: Replace a column value with a value from same column after matching two other columns values

I'm not able to get a solution, for my below requirement.
If a data.table(as below) has matching values in Col1 and Col3. Replace Col2 value(old with New-Val).
Col1 Col2 Col3
1 old a
1 old a
1 New-Val a
After manipulating data table should look as below:
Col1 Col2 Col3
1 New-Val a
1 New-Val a
1 New-Val a
Update:
I've written New-Val for understanding the requirement. However I cannot match this value because it varies for different Col1 and Col3 values. For example as below:
Col1 Col2 Col3
1 blank a
1 blank a
1 New1 a
2 blank b
2 new2 b
2 new2 b
Likewise the entries are huge. So I ideally want to match Col1 and Col3 and in Col2 it is blank(always) which is to be replaced irrespective of different matched Col1 and Col3 values.
This should be manipulated to:
Col1 Col2 Col3
1 New1 a
1 New1 a
1 New1 a
2 new2 b
2 new2 b
2 new2 b
We can replace the "blank" values in "Col2" with NA and use na.locf to replace the NA with "New" values grouped by "Col1" and "Col3".
library(zoo)
dt[Col2=="blank", Col2 := NA]
dt[, Col2 := na.locf(Col2, fromLast=TRUE) ,.(Col1, Col3)]
dt
# Col1 Col2 Col3
#1: 1 New1 a
#2: 1 New1 a
#3: 1 New1 a
#4: 2 new2 b
#5: 2 new2 b
#6: 2 new2 b
Or we can do without using any additional package
dt[, Col2 := Col2[Col2!='blank'][1L] , .(Col1, Col3)]
Another option is to use a binary join combined with by = .EACHI- this will work for factors too
dt[dt[Col2 != "blank"], Col2 := i.Col2, on = c("Col1", "Col3"), by = .EACHI]
dt
# Col1 Col2 Col3
# 1: 1 New1 a
# 2: 1 New1 a
# 3: 1 New1 a
# 4: 2 new2 b
# 5: 2 new2 b
# 6: 2 new2 b

Consolidate rows by group value [duplicate]

This question already has answers here:
Pivoting a large data set
(2 answers)
Closed 7 years ago.
Trying to simplify a huge and redundant dataset and would like your help with moving cells around so each row is a different "group" according to the value in column 1, with added columns for each unique OLD row cell/element that matches that group value. See below.
What I have:
col1 col2
1 a
1 b
1 c
1 d
2 a
2 c
2 d
2 e
3 a
3 b
3 d
3 e
What I want:
col1 col2 col3 col4 col5 col6
1 a b c d N/A
2 a N/A c d e
3 a b N/A d e
I hope this isn't too vague but I will update this question as soon as get notification of replies.
Thanks in advance!
We could use dcast from library(reshape2) to convert from 'long' to 'wide' format. By default, it will take the value.var='col2'. If there are more columns, we can explicitly specify the value.var.
library(reshape2)
dcast(df1, col1~ factor(col2, labels=paste0('col', 2:6)))
# col1 col2 col3 col4 col5 col6
#1 1 a b c d <NA>
#2 2 a <NA> c d e
#3 3 a b <NA> d e
Here is another way that uses reshape of the stats package,
x<-data.frame(col1 = c(1,1,1,1,2,2,2,2,3,3,3,3),
col2 = c('a','b','c','d','a','c','d', 'e', 'a', 'b', 'd', 'e'))
x<-reshape(x, v.names="col2", idvar="col1", timevar="col2", direction="wide")
names(x)<-c('col1', 'col2', 'col3', 'col4', 'col5', 'col6')
Output:
col1 col2 col3 col4 col5 col6
1 1 a b c d <NA>
5 2 a <NA> c d e
9 3 a b <NA> d e

un-intersect values in R

I have two data sets of at least 420,500 observations each, e.g.
dataset1 <- data.frame(col1=c("microsoft","apple","vmware","delta","microsoft"),
col2=paste0(c("a","b","c",4,"asd"),".exe"),
col3=rnorm(5))
dataset2 <- data.frame(col1=c("apple","cisco","proactive","dtex","microsoft"),
col2=paste0(c("a","b","c",4,"asd"),".exe"),
col3=rnorm(5))
> dataset1
col1 col2 col3
1 microsoft a.exe 2
2 apple b.exe 1
3 vmware c.exe 3
4 delta 4.exe 4
5 microsoft asd.exe 5
> dataset2
col1 col2 col3
1 apple a.exe 3
2 cisco b.exe 4
3 vmware d.exe 1
4 delta 5.exe 5
5 microsoft asd.exe 2
I would like to print all the observations in dataset1 that do not intersect one in dataset2 (comparing both col1 and col2 in each), which in this case would print everything except the last observation - observations 1 & 2 match on col2 but not col1 and observation 3 & 4 match on col1 but not col2, i.e.:
col1 col2 col3
1: apple b.exe 1
2: delta 4.exe 4
3: microsoft a.exe 2
4: vmware c.exe 3
You could use anti_join from dplyr
library(dplyr)
anti_join(df1, df2, by = c('col1', 'col2'))
# col1 col2 col3
#1 delta 4.exe -0.5836272
#2 vmware c.exe 0.4196231
#3 apple b.exe 0.5365853
#4 microsoft a.exe -0.5458808
data
set.seed(24)
df1 <- data.frame(col1 = c('microsoft', 'apple', 'vmware', 'delta',
'microsoft'), col2= c('a.exe', 'b.exe', 'c.exe', '4.exe', 'asd.exe'),
col3=rnorm(5), stringsAsFactors=FALSE)
set.seed(22)
df2 <- data.frame(col1 = c( 'apple', 'cisco', 'proactive', 'dtex',
'microsoft'), col2= c('a.exe', 'b.exe', 'c.exe', '4.exe', 'asd.exe'),
col3=rnorm(5), stringsAsFactors=FALSE)
data.table solution inspired by this:
library(data.table) #1.9.5+
setDT(dataset1,key=c("col1","col2"))
setDT(dataset2,key=key(dataset1))
dataset1[!dataset2]
col1 col2 col3
1: apple b.exe 1
2: delta 4.exe 4
3: microsoft a.exe 2
4: vmware c.exe 3
You could also try without keying:
library(data.table) #1.9.5+
setDT(dataset1); setDT(dataset2)
dataset1[!dataset2,on=c("col1","col2")]

Resources