Update a data.table based on another data table - r

I want to update the columns in an old data.table based on a new data.table only when the value is not NA.
DT_old = data.table(x=rep(c("a","b","c")), y=c(1,3,6), v=1:3, l=c(1,1,1))
DT_old
x y v l
1: a 1 1 1
2: b 3 2 1
3: c 6 3 1
DT_new = data.table(x=rep(c("b","c",'d')), y=c(9,6,10), v=c(2,NA,10), z=c(9,9,9))
DT_new
x y v z
1: b 9 2 9
2: c 6 NA 9
3: d 10 10 9
I want the output to be
x y v z
1: b 9 2 9
2: c 6 3 9
3: d 10 10 9
4: a 1 1 NA
Currently I am merging the two data.table and going through each column and replacing the NA in the new data.table
DT_merged <- merge(DT_new, DT_old, all=TRUE, by='x')
DT_merged
x y.x v.x z y.y v.y l
1: a NA NA NA 1 1 1
2: b 9 2 9 3 2 1
3: c 6 NA 9 6 3 1
4: d 10 10 9 NA NA NA
DT_merged[is.na(y.x), y.x := y.y]
DT_merged[is.na(v.x), v.x := v.y]
DT_merged = DT_merged[, list(y=y.x, v=v.x, z=z)
Is there a better way to do the above?

Here's how I would approach this. First, I will expand DT_new according to the unique values combination of x columns of both tables using binary join
res <- setkey(DT_new, x)[unique(c(x, DT_old$x))]
res
# x y v z
# 1: b 9 2 9
# 2: c 6 NA 9
# 3: d 10 10 9
# 4: a NA NA NA
Then, I will updated the two columns in res by reference using another binary join
setkey(res, x)[DT_old, `:=`(y = i.y, v = i.v)]
res
# x y v z
# 1: a 1 1 NA
# 2: b 3 2 9
# 3: c 6 3 9
# 4: d 10 10 9
Following the comments section, it seems that you are trying to join each column by its own condition. There is no simple way of doing such thing in R or any language AFAIK. Thus, your own solution could be a good option by itself.
Though, here are some other alternatives, mainly taken from a similar question I myself asked not long ago
Using two ifelse statments
setkey(res, x)[DT_old, `:=`(y = ifelse(is.na(y), i.y, y),
v = ifelse(is.na(v), i.v, v))]
Two separate conditional joins
setkey(res, x) ; setkey(DT_old, x) ## old data set needs to be keyed too now
res[is.na(y), y := DT_old[.SD, y]]
res[is.na(v), v := DT_old[.SD, v]]
Both will give you what you need.
P.S.
If you don't want warnings, you need to define the corresponding column classes correctly, e.g. v column in DT_new should be defined as v= c(2L, NA_integer_, 10L)

Related

Sort a data.table programmatically using character vector of multiple column names

I need to sort a data.table on multiple columns provided as character vector of variable names.
This is my approach so far:
DT = data.table(x = rep(c("b","a","c"), each = 3), y = c(1,3,6), v = 1:9)
#column names to sort by, stored in a vector
keycol <- c("x", "y")
DT[order(keycol)]
x y v
1: b 1 1
2: b 3 2
Somehow It displays just 2 rows and removes other records. But if I do this:
DT[order(x, y)]
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
It works like fluid.
Can anyone help with sorting using column name vector?
You need ?setorderv and its cols argument:
A character vector of column names of x by which to order
library(data.table)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
#column vector
keycol <-c("x","y")
setorderv(DT, keycol)
DT
x y v
1: a 1 4
2: a 3 5
3: a 6 6
4: b 1 1
5: b 3 2
6: b 6 3
7: c 1 7
8: c 3 8
9: c 6 9
Note that there is no need to assign the output of setorderv back to DT. The function updates DT by reference.

Creating a new data table for each row of an existing data table R while avoiding memory vector issue

Suppose I have two data tables:
library(data.table)
A=data.table(w=1:3,d=5:7)
B=data.table(K=2:4,m=9:11)
> A
w d
1: 1 5
2: 2 6
3: 3 7
> B
K m
1: 2 9
2: 3 10
3: 4 11
I want to do the following expansion, where I have a new B for each row of A:
C=A[,B[],by=names(A)]
w d K m
1: 1 5 2 9
2: 1 5 3 10
3: 1 5 4 11
4: 2 6 2 9
5: 2 6 3 10
6: 2 6 4 11
7: 3 7 2 9
8: 3 7 3 10
9: 3 7 4 11
However, when I do it with my real data, I get this error:
Error in `[.data.table`(A, , B[], by = names(A)) :
negative length vectors are not allowed
It turns out this is a memory error. However, I think there should be a way to do this without loops, memory is not an issue on my server up to 50gb of ram, which the following data table would certainly be less than.
Does anyone know an efficient way to do this?
A hacky way to handle this might be to add an identical helper column to each table and then to allow cartesian joins:
library(data.table)
A = data.table(w = 1:3, d = 5:7)
B = data.table(K = 2:4, m = 9:11)
A[, j := 1]
B[, j := 1]
C = A[B, on = 'j', allow.cartesian = T]

How to join tables with OR condition in data.table

Is this possible in data.table to join tables with OR condition?
For example:
library(data.table)
X<-data.table(x=c('a','b','c','d','e','f'),y=c(1,1,2,2,3,3),z=c(10,11,12,13,14,15))
x y z
1: a 1 12
2: b 1 11
3: c 2 12
4: d 2 13
5: e 3 14
6: f 3 15
Y<-data.table(x=c('a','e','a'),z=c(12,20,14),t=c('a','b','c'))
x z t
1: a 12 a
2: e 20 b
3: a 14 c
# and i need something like this:
X[Y,on=c("x"|"z"),.(x,y,z,i.t)]
x y z t
1: a 1 10 a
2: a 1 10 c
3: b 1 11 NA
4: c 2 12 a
5: d 2 13 NA
6: e 3 14 b
7: e 3 14 c
8: f 3 15 NA
I haven't found information about joining with OR in documentation.
Have I missed something?
The OP requested that the result set should consist of 3 subsets:
rows matching on column x
rows matching on column y
remaining rows of data.table X
So, this is a kind of right outer join of table X with Y on either column x or y.
This can be translated into 2 separate inner joins on column x and y resp., a union of both result sets, and a final outer join to add the remaining rows from table X.
Combined in one data.table statement this becomes
unique(rbindlist(list(
X[Y, on = "x", .(x, y, z, t), nomatch = 0],
X[Y, on = "z", .(x, y, z, t), nomatch = 0]
)))[X, on = .(x, y, z)]
# x y z t
#1: a 1 10 a
#2: a 1 10 c
#3: b 1 11 NA
#4: c 2 12 a
#5: d 2 13 NA
#6: e 3 14 b
#7: e 3 14 c
#8: f 3 15 NA
The inner joins are enforced by parameter nomatch = 0. The union operation is implemented using rbindlist(list(...)). EDIT: unique() is required to remove double matches in case where x and z are matching in the same row in X and in Y (thanks to filius_arator for pointing this out).
The final right outer join uses all rows of X including those which haven't been matched yet. Note that this join is on the three columns of X.
I am not sure if this is what you want or if it is very data.table-esque but there's no other answers at the moment:
join1 <- merge(X, Y[,c('x', 't'), with=FALSE], all.x=TRUE)
merge(join1, Y[,c('z', 't'), with=FALSE], all.x=TRUE, by = 'z')[,
t := ifelse(!is.na(t.x), t.x, t.y)][,
t.x := NULL][,
t.y := NULL][]
Giving:
z x y t
1: 10 a 1 a
2: 11 b 1 NA
3: 12 c 2 a
4: 13 d 2 NA
5: 14 e 3 b
6: 15 f 3 NA
EDIT with the updated example here's an approach but I'm sure there are better ways that the data.table gurus could should:
join1 <- merge(X, Y[,c('x', 't'), with=FALSE], all.x=TRUE)
merge(join1, Y[,c('z', 't'), with=FALSE], all.x=TRUE, by = 'z')[,
id := seq(.N)][,
.(t =list( na.omit(c(t.x, t.y)))), by = c('id', 'x', 'y', 'z')][,
.(x=x, y=y, z=z, t=unlist(t)), by = c('id')][]
## id x y z t
## 1: 1 a 1 10 a
## 2: 2 a 1 10 c
## 3: 3 b 1 11 NA
## 4: 4 c 2 12 a
## 5: 5 d 2 13 NA
## 6: 6 e 3 14 b
## 7: 6 e 3 14 c
## 8: 7 f 3 15 NA

R - unpivot list in data.table rows

I have a dataset that contains several columns, including 1 with list entries:
DT = data.table(
x = c(1:5),
y = seq(2, 10, 2),
z = list(list("a","b","a"), list("a","c"), list("b","c"), list("a","b","c"), list("b","c","b"))
)
Basically, I'm trying to unlist a, b, c from column z, and aggregate the data based on the x & y values.
Desired output:
z x sum(y)
1: a 1 4
2: b 1 2
3: a 2 4
4: c 2 4
5: b 3 6
6: c 3 6
7: a 4 8
8: b 4 8
9: c 4 8
10: b 5 20
11: c 5 10
My current method is rather round-about; I created 2 other columns with x and y values in lists of the same length as the list entry in z column, then unlisted all 3 columns simultaneously before aggregating - i.e. sum y values, grouped by z & x.
Code (before unlisting & aggregation):
DT[, listlen := sapply(z, function(x) length(x))]
for (a in c(1:nrow(DT))){
DT[a, x1:= list(list(rep(DT[a, x], DT[a, listlen])))]
DT[a, y1:= list(list(rep(DT[a, y], DT[a, listlen])))]}
DT_out = data.table(x = unlist(DT[,x1]), y = unlist(DT[,y1]), z = unlist(DT[,z]))
x y z listlen x1 y1
1: 1 2 <list> 3 1,1,1 2,2,2
2: 2 4 <list> 2 2,2 4,4
3: 3 6 <list> 2 3,3 6,6
4: 4 8 <list> 3 4,4,4 8,8,8
5: 5 10 <list> 3 5,5,5 10,10,10
Is there a method through data.table or reshape packages that can help me melt the dataset / do this much simpler? As I'm working with a lot more rows than this and this step seems to be very inefficient.
Any other help regarding the aggregation step would be much appreciated too!
unlist your z column first and then just aggregate as per normal via by=:
DT[, .(z=unlist(z)), by=.(x,y)][, .(sumy=sum(y)), by=.(x,z)]
# x z sumy
# 1: 1 a 4
# 2: 1 b 2
# 3: 2 a 4
# 4: 2 c 4
# 5: 3 b 6
# 6: 3 c 6
# 7: 4 a 8
# 8: 4 b 8
# 9: 4 c 8
#10: 5 b 20
#11: 5 c 10

data.table assigning with `sapply` in a merge

I have some data.tables like so:
x <- data.table(id=rep(1:3, 2), a=1:6)
y <- data.table(id=1:3, b=2:4)
I can merge them like this:
setkey(x, id)
setkey(y, id)
x[y]
id a b
1: 1 1 2
2: 1 4 2
3: 2 2 3
4: 2 5 3
5: 3 3 4
6: 3 6 4
Now, I want to create a new column in x based off a and b which is the sum of a and b.
I can do this with:
x[y, val:=a + b]
However, now suppose for some reason that the '+' operator is not vectorised. How can I store a row-wise calculation into x where x[y] is needed for the calculation? Also, assume I cannot use mapply (because for my actual problem, mapply is not suited to the function).
I'm trying to use sapply like so to add in a row-wise manner:
x[y, sapply(1:nrow(x), function (i) a[i] + b[i])]
However this returns the incorrect result:
id V1
1: 1 3
2: 1 NA
3: 1 NA
4: 1 NA
5: 1 NA
6: 1 NA
7: 2 5
8: 2 NA
9: 2 NA
10: 2 NA
11: 2 NA
12: 2 NA
13: 3 7
14: 3 NA
15: 3 NA
16: 3 NA
17: 3 NA
18: 3 NA
If I do this it works:
x[y][, sapply(1:nrow(x), function (i) a[i] + b[i])]
# [1] 3 6 5 8 7 10
BUT when I try and assign this to a column in x, it is not stored (makes sense because it looks like I'm trying to save the new column into x[y]).
x[y][, val:=sapply(1:nrow(x), function (i) a[i] + b[i])]
Is there any way to do the above but save the output into x[, val]?
Is this how I am supposed to do it, or is there a more data.table-y way?
x[, val:=x[y][, sapply(1:nrow(x), function (i) a[i] + b[i])]]
You are doing by-without-by without knowing it, (see below for the description from the help)
Advanced: Aggregation for a subset of known groups is particularly
efficient when passing those groups in i. When i is a data.table,
DT[i,j] evaluates j for each row of i. We call this by without by or
grouping by i. Hence, the self join DT[data.table(unique(colA)),j] is
identical to DT[,j,by=colA].
This means that j is evaluated for each row of i (cylcing through y one row at a time -- so that if you run sapply(1:nrow(x),...) in j it will create a vector of length nrow(x) each time, when this is not what you want.
So your second option is definitely a valid approach (as it is one of the recommended approaches for doing this)
Otherwise you could use .N (When grouping by i, .N is the number of rows in x matched to, for each row of i) not nrow(x), but you will have to think about the length of your objects and how your function is to be vectorized.
Take this as an example
x[y, {browser(); a+b}]
Called from: `[.data.table`(x, y, {
browser()
a + b
})
Browse[1]> a
[1] 1 4
Browse[1]> b
[1] 2
Browse[1]> .N
[1] 2
a has length two, because value of the key matches with 2 rows from x. b only has length 1 because it only has length 1 in y.
I think the best approach is to correctly Vectorize your function (which is hard to give advice upon without more of an example)
another approach would be to replicate b to the length of a eg
x[y, val := {
bl <- rep_len(b, .N)
sapply(seq_len(.N), function(i) a[i] + bl[i])}]
x
id a val
1: 1 1 3
2: 1 4 6
3: 2 2 5
4: 2 5 8
5: 3 3 7
6: 3 6 10
or if you know that y has unique rows for each value of id, then you don't need to try and index any columns from it.
x[y, val2 := sapply(seq_len(.N), function(i) a[i] + b)]
# an alternative would be to use sapply on a (avoid creating another vector)
x[y, val3 := sapply(a, function(ai) ai + b)]
x
# id a val val2 val3
# 1: 1 1 3 3 3
# 2: 1 4 6 6 6
# 3: 2 2 5 5 5
# 4: 2 5 8 8 8
# 5: 3 3 7 7 7
# 6: 3 6 10 10 10

Resources