R find unduplicated rows based on other data's columns [duplicate] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a data table in R, called A, which has three columns Col1, Col2, and Col3. Another table, called B, also has the same three columns. I want to remove all the rows in table A, for which the pairs (Col1, Col2) are present in table B. I tried, but I am not sure how to do this. I am stuck on this for last few days.
Thanks,

library(data.table)
A = data.table(Col1 = 1:4, Col2 = 4:1, Col3 = letters[1:4])
# Col1 Col2 Col3
#1: 1 4 a
#2: 2 3 b
#3: 3 2 c
#4: 4 1 d
B = data.table(Col1 = c(1,3,5), Col2 = c(4,2,1))
# Col1 Col2
#1: 1 4
#2: 3 2
#3: 5 1
A[!B, on = c("Col1", "Col2")]
# Col1 Col2 Col3
#1: 2 3 b
#2: 4 1 d

We can use anti_join
library(dplyr)
anti_join(A, B, by = c('Col1', 'Col2'))

Here's a go, using interaction:
A <- data.frame(Col1=1:3, Col2=2:4, Col3=10:12)
B <- data.frame(Col1=1:2, Col2=2:3, Col3=10:11)
A
# Col1 Col2 Col3
#1 1 2 10
#2 2 3 11
#3 3 4 12
B
# Col1 Col2 Col3
#1 1 2 10
#2 2 3 11
byv <- c("Col1","Col2")
A[!(interaction(A[byv]) %in% interaction(B[byv])),]
# Col1 Col2 Col3
#3 3 4 12
Or create a unique id for each row, and then exclude those that merged:
A[-merge(cbind(A[byv],id=seq_len(nrow(A))), B[byv], by=byv)$id,]

Related

Removing one table from another in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a data table in R, called A, which has three columns Col1, Col2, and Col3. Another table, called B, also has the same three columns. I want to remove all the rows in table A, for which the pairs (Col1, Col2) are present in table B. I tried, but I am not sure how to do this. I am stuck on this for last few days.
Thanks,
library(data.table)
A = data.table(Col1 = 1:4, Col2 = 4:1, Col3 = letters[1:4])
# Col1 Col2 Col3
#1: 1 4 a
#2: 2 3 b
#3: 3 2 c
#4: 4 1 d
B = data.table(Col1 = c(1,3,5), Col2 = c(4,2,1))
# Col1 Col2
#1: 1 4
#2: 3 2
#3: 5 1
A[!B, on = c("Col1", "Col2")]
# Col1 Col2 Col3
#1: 2 3 b
#2: 4 1 d
We can use anti_join
library(dplyr)
anti_join(A, B, by = c('Col1', 'Col2'))
Here's a go, using interaction:
A <- data.frame(Col1=1:3, Col2=2:4, Col3=10:12)
B <- data.frame(Col1=1:2, Col2=2:3, Col3=10:11)
A
# Col1 Col2 Col3
#1 1 2 10
#2 2 3 11
#3 3 4 12
B
# Col1 Col2 Col3
#1 1 2 10
#2 2 3 11
byv <- c("Col1","Col2")
A[!(interaction(A[byv]) %in% interaction(B[byv])),]
# Col1 Col2 Col3
#3 3 4 12
Or create a unique id for each row, and then exclude those that merged:
A[-merge(cbind(A[byv],id=seq_len(nrow(A))), B[byv], by=byv)$id,]

R data.table value from previous row with conditional statement

I would like to update a data table value depending on whether it meets a criteria and return either the value from another column or the value from the row above (same column).
As an example:
library( data.table )
data <- data.table( Col1 = 1:5, Col2 = letters[1:5] )
I would like to return the following:
data2 <- data.table( Col1= 1:5, Col2= letters[1:5], Col3= c("NA", "NA", "3", "3", "3"))
I have read the ?shift help page but I can't adapt it to using a conditional statement and returning a value in the same column. To get my desired outcome I have tried:
data[ , ( Col3 ) := ifelse( get( Col2 ) == "c", get( Col1 ) , shift( Col3 ))]
I would be grateful for some advice.
*Please ignore my use of get() for this example as I am aware it may not be the best approach.
This old, so far unanswered question has been revived recently.
As of today I am aware of the following approaches:
1. zoo::na.locf()
According to Frank's comment:
data3 <- data.table(Col1= 1:10, Col2 = c(letters[1:5],letters[1:5]))
data3[Col2=='c', Col3 := Col1][, Col3 := zoo::na.locf(Col3, na.rm=FALSE)]
data3[]
Col1 Col2 Col3
1: 1 a NA
2: 2 b NA
3: 3 c 3
4: 4 d 3
5: 5 e 3
6: 6 a 3
7: 7 b 3
8: 8 c 8
9: 9 d 8
10: 10 e 8
2. cumsum()
data3 <- data.table(Col1= 1:10, Col2 = c(letters[1:5],letters[1:5]))
data3[, Col3 := Col1[which(Col2 == "c")], by = cumsum(Col2 == "c")]
data3[]
Col1 Col2 Col3
1: 1 a NA
2: 2 b NA
3: 3 c 3
4: 4 d 3
5: 5 e 3
6: 6 a 3
7: 7 b 3
8: 8 c 8
9: 9 d 8
10: 10 e 8

Consolidate rows by group value [duplicate]

This question already has answers here:
Pivoting a large data set
(2 answers)
Closed 7 years ago.
Trying to simplify a huge and redundant dataset and would like your help with moving cells around so each row is a different "group" according to the value in column 1, with added columns for each unique OLD row cell/element that matches that group value. See below.
What I have:
col1 col2
1 a
1 b
1 c
1 d
2 a
2 c
2 d
2 e
3 a
3 b
3 d
3 e
What I want:
col1 col2 col3 col4 col5 col6
1 a b c d N/A
2 a N/A c d e
3 a b N/A d e
I hope this isn't too vague but I will update this question as soon as get notification of replies.
Thanks in advance!
We could use dcast from library(reshape2) to convert from 'long' to 'wide' format. By default, it will take the value.var='col2'. If there are more columns, we can explicitly specify the value.var.
library(reshape2)
dcast(df1, col1~ factor(col2, labels=paste0('col', 2:6)))
# col1 col2 col3 col4 col5 col6
#1 1 a b c d <NA>
#2 2 a <NA> c d e
#3 3 a b <NA> d e
Here is another way that uses reshape of the stats package,
x<-data.frame(col1 = c(1,1,1,1,2,2,2,2,3,3,3,3),
col2 = c('a','b','c','d','a','c','d', 'e', 'a', 'b', 'd', 'e'))
x<-reshape(x, v.names="col2", idvar="col1", timevar="col2", direction="wide")
names(x)<-c('col1', 'col2', 'col3', 'col4', 'col5', 'col6')
Output:
col1 col2 col3 col4 col5 col6
1 1 a b c d <NA>
5 2 a <NA> c d e
9 3 a b <NA> d e

Rbind two vectors in R

I have a data.frame with several columns I'd like to join into one column in a new data.frame.
df1 <- data.frame(col1 = 1:3, col2 = 4:6, col3 = 7:9)
how would I create a new data.frame with a single column that's 1:9?
Since data.frames are essentially lists of columns, unlist(df1) will give you one large vector of all the values. Now you can simply construct a new data.frame from it:
data.frame(col = unlist(df1))
In case you want an indicator too:
stack(df1)
# values ind
# 1 1 col1
# 2 2 col1
# 3 3 col1
# 4 4 col2
# 5 5 col2
# 6 6 col2
# 7 7 col3
# 8 8 col3
# 9 9 col3
Just to provide a complete set of ways to do that, here is the tidyr way.
library(tidyr)
gather(df1)
key value
1 col1 1
2 col1 2
3 col1 3
4 col2 4
5 col2 5
6 col2 6
7 col3 7
8 col3 8
9 col3 9
One more using c function:
data.frame(col11 = c(df1,recursive=TRUE))
col11
col11 1
col12 2
col13 3
col21 4
col22 5
col23 6
col31 7
col32 8
col33 9
You could try:
as.data.frame(as.vector(as.matrix(df1)))
# as.vector(as.matrix(df1))
#1 1
#2 2
#3 3
#4 4
#5 5
#6 6
#7 7
#8 8
#9 9
Another approach, just for using Reduce...
data.frame(Reduce(c, df1))

un-intersect values in R

I have two data sets of at least 420,500 observations each, e.g.
dataset1 <- data.frame(col1=c("microsoft","apple","vmware","delta","microsoft"),
col2=paste0(c("a","b","c",4,"asd"),".exe"),
col3=rnorm(5))
dataset2 <- data.frame(col1=c("apple","cisco","proactive","dtex","microsoft"),
col2=paste0(c("a","b","c",4,"asd"),".exe"),
col3=rnorm(5))
> dataset1
col1 col2 col3
1 microsoft a.exe 2
2 apple b.exe 1
3 vmware c.exe 3
4 delta 4.exe 4
5 microsoft asd.exe 5
> dataset2
col1 col2 col3
1 apple a.exe 3
2 cisco b.exe 4
3 vmware d.exe 1
4 delta 5.exe 5
5 microsoft asd.exe 2
I would like to print all the observations in dataset1 that do not intersect one in dataset2 (comparing both col1 and col2 in each), which in this case would print everything except the last observation - observations 1 & 2 match on col2 but not col1 and observation 3 & 4 match on col1 but not col2, i.e.:
col1 col2 col3
1: apple b.exe 1
2: delta 4.exe 4
3: microsoft a.exe 2
4: vmware c.exe 3
You could use anti_join from dplyr
library(dplyr)
anti_join(df1, df2, by = c('col1', 'col2'))
# col1 col2 col3
#1 delta 4.exe -0.5836272
#2 vmware c.exe 0.4196231
#3 apple b.exe 0.5365853
#4 microsoft a.exe -0.5458808
data
set.seed(24)
df1 <- data.frame(col1 = c('microsoft', 'apple', 'vmware', 'delta',
'microsoft'), col2= c('a.exe', 'b.exe', 'c.exe', '4.exe', 'asd.exe'),
col3=rnorm(5), stringsAsFactors=FALSE)
set.seed(22)
df2 <- data.frame(col1 = c( 'apple', 'cisco', 'proactive', 'dtex',
'microsoft'), col2= c('a.exe', 'b.exe', 'c.exe', '4.exe', 'asd.exe'),
col3=rnorm(5), stringsAsFactors=FALSE)
data.table solution inspired by this:
library(data.table) #1.9.5+
setDT(dataset1,key=c("col1","col2"))
setDT(dataset2,key=key(dataset1))
dataset1[!dataset2]
col1 col2 col3
1: apple b.exe 1
2: delta 4.exe 4
3: microsoft a.exe 2
4: vmware c.exe 3
You could also try without keying:
library(data.table) #1.9.5+
setDT(dataset1); setDT(dataset2)
dataset1[!dataset2,on=c("col1","col2")]

Resources