How to sort a data.table using a target vector - r

So, I have the following data.table
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
> DT
x y
1: b 1
2: b 2
3: b 3
4: a 1
5: a 2
6: a 3
7: c 1
8: c 2
9: c 3
And I have the following vector
k <- c("2","3","1")
I want to use k as a target vector to sort DT using y and get something like this.
> DT
x y
1: b 2
2: a 2
3: c 2
4: b 3
5: a 3
6: c 3
7: b 1
8: a 1
9: c 1
Any ideas? If I use DT[order(k)] I get a subset of the original data, and that isn't what I am looking for.

Throw a call to match() in there.
DT[order(match(y, as.numeric(k)))]
# x y
# 1: b 2
# 2: a 2
# 3: c 2
# 4: b 3
# 5: a 3
# 6: c 3
# 7: b 1
# 8: a 1
# 9: c 1
Actually DT[order(match(y, k))] would work as well, but it is probably safest to make the arguments to match() of the same class just in case.
Note: match() is known to be sub-optimal in some cases. If you have a large number of rows, you may want to switch to fastmatch::fmatch for faster matching.

You can do this:
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
setkey(DT,y)
DT[data.table(as.numeric(k))]
or (from the comment of Richard)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,2,3))
k <- c("2","3","1")
DT[data.table(y = as.numeric(k)), on = "y"]

Related

Creating data.table from a list of unequal vector lengths

I am looking to create a data.table from a list of unequal vectors, but instead of repeating the values for the "shorter" vector, I want it to be filled with NAs. I have one possible solution, but it repeats values and does not retain the NA as needed.
Example:
library(data.table)
my_list <- list(A = 1:4, B = letters[1:5])
as.data.table(do.call(cbind, my_list))
as.data.table(do.call(cbind, my_list))
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 1 e
But I want it to look like:
as.data.table(do.call(cbind, my_list))
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: NA e
Thank you!
We need to make the lengths same by appending NA at the end of the list elements having lesser length than the max length
mx <- max(lengths(my_list))
as.data.table(do.call(cbind, lapply(my_list, `length<-`, mx)))
-output
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: <NA> e
Instead of cbind/as.data.table, setDT is more compact
setDT(lapply(my_list, `length<-`, mx))[]
A B
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: NA e
You may use stringi::stri_list2matrix to make all the list length equal.
my_list |>
stringi::stri_list2matrix() |>
data.table::as.data.table() |>
type.convert(as.is = TRUE) |>
setNames(names(my_list))
# A B
#1: 1 a
#2: 2 b
#3: 3 c
#4: 4 d
#5: NA e

Subsetting data.table by values with repetition

I would like to subset data.table in a way which would provide some rows repeatedly. This works well with indices but I am not sure how to do it in a simple way with values especially when the values does not appear in one row only.
E.g.:
library(data.table)
dt<-data.table(x1=c('a','a','b','b','c','c'),x2=c(1,2,3,4,5,6))
xsel<-c('a','b','a','a','c','b')
dt[x1%in%xsel,]
will provide this output:
x1 x2
1: a 1
2: a 2
3: b 3
4: b 4
5: c 5
6: c 6
I would like to get it in the original order and with repetition exactly as it is in the xsel vector. Is it possible to do it in a reasonably simple way without looping? Thanks.
Using:
setkey(dt, x1) # set the key
dt[J(xsel)] # or: dt[.(xsel)]
gives:
x1 x2
1: a 1
2: a 2
3: b 3
4: b 4
5: a 1
6: a 2
7: a 1
8: a 2
9: c 5
10: c 6
11: b 3
12: b 4
Without setting the key, you could use:
dt[.(xsel), on = .(x1)]

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

Eliminate all duplicates according to a key + keep records from a table which are not in another table

Question 1:
I am relatively new to R and I have two distinct questions.
I need to eliminate duplicates according to a key (single or multiple) but all of them so unique wouldn't do it. I also found the function duplicated but it will mark as true only from the second occurrence onward, but I need to eliminate all of them.
> DT <- data.table(Key=c("a","a","a","b","c"),var=c(1:5))
> DT
Key var
1: a 1
2: a 2
3: a 3
4: b 4
5: c 5
> unique(DT)
Key var
1: a 1
2: b 4
3: c 5
> duplicated(DT)
[1] FALSE TRUE TRUE FALSE FALSE
what I want instead is
Key var
1: b 4
2: c 5
Question 2:
I have 2 data tables and I want to keep only records from DTFrom for which the combination of values from the 2 (or more keys) is not in DTFilter (I found similar questions for SQL but not r):
> DTFrom
key1 key2 var
1: q m 1
2: q n 2
3: q b 3
4: w n 4
5: e m 5
6: e n 6
7: e b 7
8: r n 8
9: r b 9
10: t m 10
11: t n 11
12: t b 12
13: t v 13
> DTFilter
key1 key2 var
1: q m 1
2: w n 4
3: e b 7
4: e n 6
5: r n 8
6: r b 9
7: t m 10
8: t v 13
and I want the result to be:
> DTOut
key1 key2 var
1: q n 2
2: q b 3
3: e m 5
4: t n 11
5: t b 12
Thanks in advance!
For the first question, you can use the fromLast argument in duplicated:
DT[ !(duplicated(Key) | duplicated(Key, fromLast = TRUE))]
# Key var
#1: b 4
#2: c 5
For the second question, you can do:
setkey(DTFrom, key1, key2)
DTFrom[!DTFilter]
# key1 key2 var
#1: e m 5
#2: q b 3
#3: q n 2
#4: t b 12
#5: t n 11
As for the first question you can use table function for additional filtering:
DT[!duplicated(Key)][table(DT$Key) == 1,]
# Key var
# 1: b 4
# 2: c 5
As for the second question there is anti_join function in dplyr package specially for this case:
require("dplyr")
anti_join(DTFrom, DTFilter, by = c("key1", "key2"))
# key1 key2 var
# 1 e m 5
# 2 q b 3
# 3 q n 2
# 4 t n 11
# 5 t b 12
My preferred method for the first question is:
DT[.(DT[ , .N, by=key(DT)][N==1L, !"N"])]
Similarly we can do:
DT[.(DT[, .N, by=key(DT)][N==1L, ]), .SD]
#docendodiscimus 's answer for Q2 is also mine.

data preparation part II

there's another problem I encountered which is (as I think) quite interesting:
dt <- data.table(K=c("A","A","A","B","B","B"),A=c(2,3,4,1,3,4),B=c(3,3,3,1,1,1))
dt
K A B
1: A 2 3
2: A 3 3
3: A 4 3
4: B 1 1
5: B 3 1
6: B 4 1
Now I want a somewhat "higher" level of the data. For each letter in K, there should only be one line and "A_sum" should include the length of A where B has the same value. So there are three values for B=3 and three values for B=1.
Resulting data.table:
dt_new
K A_sum B
1: A 3 3
2: B 3 1
It's not clear how you want to treat K, but here's one option:
dt_new <- dt[, list(A_sum = length(A)), by = list(K, B)]
# K B A_sum
# 1: A 3 3
# 2: B 1 3

Resources