Eliminate all duplicates according to a key + keep records from a table which are not in another table - r

Question 1:
I am relatively new to R and I have two distinct questions.
I need to eliminate duplicates according to a key (single or multiple) but all of them so unique wouldn't do it. I also found the function duplicated but it will mark as true only from the second occurrence onward, but I need to eliminate all of them.
> DT <- data.table(Key=c("a","a","a","b","c"),var=c(1:5))
> DT
Key var
1: a 1
2: a 2
3: a 3
4: b 4
5: c 5
> unique(DT)
Key var
1: a 1
2: b 4
3: c 5
> duplicated(DT)
[1] FALSE TRUE TRUE FALSE FALSE
what I want instead is
Key var
1: b 4
2: c 5
Question 2:
I have 2 data tables and I want to keep only records from DTFrom for which the combination of values from the 2 (or more keys) is not in DTFilter (I found similar questions for SQL but not r):
> DTFrom
key1 key2 var
1: q m 1
2: q n 2
3: q b 3
4: w n 4
5: e m 5
6: e n 6
7: e b 7
8: r n 8
9: r b 9
10: t m 10
11: t n 11
12: t b 12
13: t v 13
> DTFilter
key1 key2 var
1: q m 1
2: w n 4
3: e b 7
4: e n 6
5: r n 8
6: r b 9
7: t m 10
8: t v 13
and I want the result to be:
> DTOut
key1 key2 var
1: q n 2
2: q b 3
3: e m 5
4: t n 11
5: t b 12
Thanks in advance!

For the first question, you can use the fromLast argument in duplicated:
DT[ !(duplicated(Key) | duplicated(Key, fromLast = TRUE))]
# Key var
#1: b 4
#2: c 5
For the second question, you can do:
setkey(DTFrom, key1, key2)
DTFrom[!DTFilter]
# key1 key2 var
#1: e m 5
#2: q b 3
#3: q n 2
#4: t b 12
#5: t n 11

As for the first question you can use table function for additional filtering:
DT[!duplicated(Key)][table(DT$Key) == 1,]
# Key var
# 1: b 4
# 2: c 5
As for the second question there is anti_join function in dplyr package specially for this case:
require("dplyr")
anti_join(DTFrom, DTFilter, by = c("key1", "key2"))
# key1 key2 var
# 1 e m 5
# 2 q b 3
# 3 q n 2
# 4 t n 11
# 5 t b 12

My preferred method for the first question is:
DT[.(DT[ , .N, by=key(DT)][N==1L, !"N"])]
Similarly we can do:
DT[.(DT[, .N, by=key(DT)][N==1L, ]), .SD]
#docendodiscimus 's answer for Q2 is also mine.

Related

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

Creating a new data table for each row of an existing data table R while avoiding memory vector issue

Suppose I have two data tables:
library(data.table)
A=data.table(w=1:3,d=5:7)
B=data.table(K=2:4,m=9:11)
> A
w d
1: 1 5
2: 2 6
3: 3 7
> B
K m
1: 2 9
2: 3 10
3: 4 11
I want to do the following expansion, where I have a new B for each row of A:
C=A[,B[],by=names(A)]
w d K m
1: 1 5 2 9
2: 1 5 3 10
3: 1 5 4 11
4: 2 6 2 9
5: 2 6 3 10
6: 2 6 4 11
7: 3 7 2 9
8: 3 7 3 10
9: 3 7 4 11
However, when I do it with my real data, I get this error:
Error in `[.data.table`(A, , B[], by = names(A)) :
negative length vectors are not allowed
It turns out this is a memory error. However, I think there should be a way to do this without loops, memory is not an issue on my server up to 50gb of ram, which the following data table would certainly be less than.
Does anyone know an efficient way to do this?
A hacky way to handle this might be to add an identical helper column to each table and then to allow cartesian joins:
library(data.table)
A = data.table(w = 1:3, d = 5:7)
B = data.table(K = 2:4, m = 9:11)
A[, j := 1]
B[, j := 1]
C = A[B, on = 'j', allow.cartesian = T]

How to sort a data.table using a target vector

So, I have the following data.table
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
> DT
x y
1: b 1
2: b 2
3: b 3
4: a 1
5: a 2
6: a 3
7: c 1
8: c 2
9: c 3
And I have the following vector
k <- c("2","3","1")
I want to use k as a target vector to sort DT using y and get something like this.
> DT
x y
1: b 2
2: a 2
3: c 2
4: b 3
5: a 3
6: c 3
7: b 1
8: a 1
9: c 1
Any ideas? If I use DT[order(k)] I get a subset of the original data, and that isn't what I am looking for.
Throw a call to match() in there.
DT[order(match(y, as.numeric(k)))]
# x y
# 1: b 2
# 2: a 2
# 3: c 2
# 4: b 3
# 5: a 3
# 6: c 3
# 7: b 1
# 8: a 1
# 9: c 1
Actually DT[order(match(y, k))] would work as well, but it is probably safest to make the arguments to match() of the same class just in case.
Note: match() is known to be sub-optimal in some cases. If you have a large number of rows, you may want to switch to fastmatch::fmatch for faster matching.
You can do this:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
setkey(DT,y)
DT[data.table(as.numeric(k))]
or (from the comment of Richard)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
DT[data.table(y = as.numeric(k)), on = "y"]

variable usage in data.table [duplicate]

This question already has answers here:
Keyed lookup on data.table without 'with'
(4 answers)
Subsetting data.table using variables with same name as column
(2 answers)
Closed 4 years ago.
It must be very basic thing but I can't figure out how to use a real variable that has the same name as a data.table column. I can use a different variable name to avoid conflict but I'm wondering if there's a way to eval the variable before giving up to DT.
> DT = data.table(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c=13:18)
> DT
ID a b c
1: b 1 7 13
2: b 2 8 14
3: b 3 9 15
4: a 4 10 16
5: a 5 11 17
6: c 6 12 18
> DT[b == 7]
ID a b c
1: b 1 7 13
> b <- 7
> DT[b == b]
ID a b c
1: b 1 7 13
2: b 2 8 14
3: b 3 9 15
4: a 4 10 16
5: a 5 11 17
6: c 6 12 18
Since you have two variables named b, one inside DT and one outside the scope of DT, we have to go and get b <- 7 from the global environment. We can do that with get().
DT[b == get("b", globalenv())]
# ID a b c
# 1: b 1 7 13
Update: You mention in the comments that the variables are inside a function environment. In that case, you can use parent.frame() instead of globalenv().
f <- function(b, dt) dt[b == get("b", parent.frame(3))]
f(7, DT)
# ID a b c
# 1: b 1 7 13
f(12, DT)
# ID a b c
# 1: c 6 12 18

data preparation part II

there's another problem I encountered which is (as I think) quite interesting:
dt <- data.table(K=c("A","A","A","B","B","B"),A=c(2,3,4,1,3,4),B=c(3,3,3,1,1,1))
dt
K A B
1: A 2 3
2: A 3 3
3: A 4 3
4: B 1 1
5: B 3 1
6: B 4 1
Now I want a somewhat "higher" level of the data. For each letter in K, there should only be one line and "A_sum" should include the length of A where B has the same value. So there are three values for B=3 and three values for B=1.
Resulting data.table:
dt_new
K A_sum B
1: A 3 3
2: B 3 1
It's not clear how you want to treat K, but here's one option:
dt_new <- dt[, list(A_sum = length(A)), by = list(K, B)]
# K B A_sum
# 1: A 3 3
# 2: B 1 3

Resources