Rolling Join multiple columns independently to eliminate NAs - r

I am trying to do a rolling join in data.table that brings in multiple columns, but rolls over both entire missing rows, and individual NAs in particular columns, even when the row is present. By way of example, I have two tables, A, and B:
library(data.table)
A <- data.table(v1 = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(6,6,6,4,4,6,4,4,4,6,4,4,4),
t = c(10,20,30,60,60,10,40,50,60,20,40,50,60),
key = c("v1", "v2", "t"))
B <- data.table(v1 = c(1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(4,4,6,6,4,4,6,6,4,4,6,6),
t = c(10,70,20,70,10,70,20,70,10,70,20,70),
valA = c('a','a',NA,'a',NA,'a','b','a', 'b','b',NA,'b'),
valB = c(NA,'q','q','q','p','p',NA,'p',NA,'q',NA,'q'),
key = c("v1", "v2", "t"))
B
## v1 v2 t valA valB
## 1: 1 4 10 a NA
## 2: 1 4 70 a q
## 3: 1 6 20 NA q
## 4: 1 6 70 a q
## 5: 2 4 10 NA p
## 6: 2 4 70 a p
## 7: 2 6 20 b NA
## 8: 2 6 70 a p
## 9: 3 4 10 b NA
## 10: 3 4 70 b q
## 11: 3 6 20 NA NA
## 12: 3 6 70 b q
If I do a rolling join (in this case a backwards join), it rolls over all the points when a row cannot be found in B, but still includes points when the row exists but the data to be merged are NA:
B[A, , roll=-Inf]
## v1 v2 t valA valB
## 1: 1 4 60 a q
## 2: 1 4 60 a q
## 3: 1 6 10 NA q
## 4: 1 6 20 NA q
## 5: 1 6 30 a q
## 6: 2 4 40 a p
## 7: 2 4 50 a p
## 8: 2 4 60 a p
## 9: 2 6 10 b NA
## 10: 3 4 40 b q
## 11: 3 4 50 b q
## 12: 3 4 60 b q
## 13: 3 6 20 NA NA
I would like to rolling join in such a way that it rolls over these NAs as well. For a single column, I can subset B to remove the NAs, then roll with A:
C <- B[!is.na(valA), .(v1, v2, t, valA)][A, roll=-Inf]
C
## v1 v2 t valA
## 1: 1 4 60 a
## 2: 1 4 60 a
## 3: 1 6 10 a
## 4: 1 6 20 a
## 5: 1 6 30 a
## 6: 2 4 40 a
## 7: 2 4 50 a
## 8: 2 4 60 a
## 9: 2 6 10 b
## 10: 3 4 40 b
## 11: 3 4 50 b
## 12: 3 4 60 b
## 13: 3 6 20 b
But for multiple columns, I have to do this sequentially, storing the value for each added column and then repeat.
B[!is.na(valB), .(v1, v2, t, valB)][C, roll=-Inf]
## v1 v2 t valB valA
## 1: 1 4 60 q a
## 2: 1 4 60 q a
## 3: 1 6 10 q a
## 4: 1 6 20 q a
## 5: 1 6 30 q a
## 6: 2 4 40 p a
## 7: 2 4 50 p a
## 8: 2 4 60 p a
## 9: 2 6 10 p b
## 10: 3 4 40 q b
## 11: 3 4 50 q b
## 12: 3 4 60 q b
## 13: 3 6 20 q b
The end result above is the desired output, but for multiple columns it quickly becomes unwieldy. Is there a better solution?

Joins are about matching up rows. If you want to match rows multiple ways, you'll need multiple joins.
I'd use a loop, but add columns to A (rather than creating new tables C, D, ... following each join):
k = key(A)
bcols = setdiff(names(B), k)
for (col in bcols) A[, (col) :=
B[!.(as(NA, typeof(B[[col]]))), on=col][.SD, roll=-Inf, ..col]
][]
A
v1 v2 t valA valB
1: 1 4 60 a q
2: 1 4 60 a q
3: 1 6 10 a q
4: 1 6 20 a q
5: 1 6 30 a q
6: 2 4 40 a p
7: 2 4 50 a p
8: 2 4 60 a p
9: 2 6 10 b p
10: 3 4 40 b q
11: 3 4 50 b q
12: 3 4 60 b q
13: 3 6 20 b q
B[!.(NA_character_), on="valA"] is an anti-join that drops rows with NAs in valA. The code above attempts to generalize this (since the NA needs to match the type of the column).

Related

How to replace single occurances with the previous status

I have a data table like below :
table=data.table(x=c(1:15),y=c(1,1,1,3,1,1,2,1,2,2,3,3,3,3,3),z=c(1:15)*3)
I have to clean this data table where there are single occurrences like a 3 in between the 1s and a 1 in between the 2s. It doesn't have to be a 3 but any number which occurs only once should be replaced by the previous number.
table=data.table(x=c(1:15),y=c(1,1,1,1,1,1,2,2,2,2,3,3,3,3,3),z=c(1:15)*3)
This is the expected data table.
Any help is appreciated.
Here's one way :
library(data.table)
#Count number of rows for each group
table[, N := .N, rleid(y)]
#Change `y` value which have only one row
table[, y := replace(y, N ==1, NA)]
#Replace NA with last non-NA value
table[, y := zoo::na.locf(y)][, N := NULL]
table
# x y z
# 1: 1 1 3
# 2: 2 1 6
# 3: 3 1 9
# 4: 4 1 12
# 5: 5 1 15
# 6: 6 1 18
# 7: 7 2 21
# 8: 8 2 24
# 9: 9 2 27
#10: 10 2 30
#11: 11 3 33
#12: 12 3 36
#13: 13 3 39
#14: 14 3 42
#15: 15 3 45
Here is a base R option
inds <- which(diff(c(head(table$y,1),table$y))*diff(c(table$y,tail(table$y,1)))<0)
table$y <- replace(table$y,inds,table$y[inds-1])
such that
> table
x y z
1: 1 1 3
2: 2 1 6
3: 3 1 9
4: 4 1 12
5: 5 1 15
6: 6 1 18
7: 7 2 21
8: 8 2 24
9: 9 2 27
10: 10 2 30
11: 11 3 33
12: 12 3 36
13: 13 3 39
14: 14 3 42
15: 15 3 45

R data table combining lapply with other j arguments

I want to combine the result of lapply using .SD in j with further output columns in j. How can I do that in the same data table?
So far Im creating two data tables (example_summary1, example_summary2) and merge them but there should be a better way?
Maybe I don't fully understand the concept of .SD/.SDcols.
example <-data.table(id=rep(1:5,3),numbers=rep(1:5,3),sample1=sample(20,15,repla ce=TRUE),sample2=sample(20,15,replace=100))
id numbers sample1 sample2
1: 1 1 17 18
2: 2 2 8 1
3: 3 3 17 12
4: 4 4 15 2
5: 5 5 14 18
6: 1 1 11 14
7: 2 2 12 12
8: 3 3 11 7
9: 4 4 16 13
10: 5 5 17 1
11: 1 1 10 3
12: 2 2 14 15
13: 3 3 13 3
14: 4 4 17 6
15: 5 5 1 5
example_summary1<-example[,lapply(.SD,mean),by=id,.SDcols=c("sample1","sample2")]
> example_summary1
id sample1 sample2
1: 1 12.66667 11.666667
2: 2 11.33333 9.333333
3: 3 13.66667 7.333333
4: 4 16.00000 7.000000
5: 5 10.66667 8.000000
example_summary2<-example[,.(example.sum=sum(numbers)),id]
> example_summary2
id example.sum
1: 1 3
2: 2 6
3: 3 9
4: 4 12
5: 5 15
This is the best you can do if you are using .SDcols:
example_summary1 <- example[, c(lapply(.SD, mean), .(example.sum = sum(numbers))),
by = id, .SDcols = c("sample1", "sample2", "numbers")][, numbers := NULL][]
If you don't include numbers in .SDcols it's not available in j.
Without .SDcols you can do this:
example_summary1 <- example[, c(lapply(.(sample1 = sample1, sample2 = sample2), mean),
.(example.sum = sum(numbers))),
by=id]
Or if you have a vector of column names:
cols <- c("sample1","sample2")
example_summary1 <- example[, c(lapply(mget(cols), mean),
.(example.sum = sum(numbers))),
by=id]
But I suspect that you don't get the same data.table optimizations then.
Finally, a data.table join is so fast that I would use your approach.

Persistent assignment in data.table with .SD

I'm struggling with .SD calls in data.table.
In particular, I'm trying to identify some logical characteristic within a grouping of data, and draw some identifying mark in another variable. Canonical application of .SD, right?
From FAQ 4.5, http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf, imagine the following table:
library(data.table) # 1.9.5
DT = data.table(a=rep(1:3,1:3),b=1:6,c=7:12)
DT[,{ mySD = copy(.SD)
mySD[1, b := 99L]
mySD },
by = a]
## a b c
## 1: 1 99 7
## 2: 2 99 8
## 3: 2 3 9
## 4: 3 99 10
## 5: 3 5 11
## 6: 3 6 12
I've assigned these values to b (using the ':=' operator) and so when I re-call DT, I expect the same output. But, unexpectedly, I'm met with the original table:
DT
## a b c
## 1: 1 1 7
## 2: 2 2 8
## 3: 2 3 9
## 4: 3 4 10
## 5: 3 5 11
## 6: 3 6 12
Expected output was the original frame, with persistent modifications in 'b':
DT
## a b c
## 1: 1 99 7
## 2: 2 99 8
## 3: 2 3 9
## 4: 3 99 10
## 5: 3 5 11
## 6: 3 6 12
Sure, I can copy this table into another one, but that doesn't seem consistent with the ethos.
DT2 <- copy(DT[,{ mySD = copy(.SD)
mySD[1, b := 99L]
mySD },
by = a])
DT2
## a b c
## 1: 1 99 7
## 2: 2 99 8
## 3: 2 3 9
## 4: 3 99 10
## 5: 3 5 11
## 6: 3 6 12
It feels like I'm missing something fundamental here.
The mentioned FAQ is just showing a workaround on how to modify (a temprory copy of) .SD but it won't update your original data in place. A possible solution for you problem would be something like
DT[DT[, .I[1L], by = a]$V1, b := 99L]
DT
# a b c
# 1: 1 99 7
# 2: 2 99 8
# 3: 2 3 9
# 4: 3 99 10
# 5: 3 5 11
# 6: 3 6 12

Use data.table in R to add multiple columns to a data.table with = with only one function call

This is a direkt expansion of this Question.
I have a dataset and I want to find all pairwise combinations of Variable v depending on Variables x and y:
library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,1,6), v=1:18)
x y v
1: a 1 1
2: a 1 2
3: a 6 3
4: a 1 4
5: a 1 5
6: a 6 6
7: b 1 7
8: b 1 8
9: b 6 9
10: b 1 10
11: b 1 11
12: b 6 12
13: c 1 13
14: c 1 14
15: c 6 15
16: c 1 16
17: c 1 17
18: c 6 18
DT[, list(new1 = t(combn(sort(v), m = 2))[,1],
new2 = t(combn(sort(v), m = 2))[,2]),
by = list(x, y)]
x y new1 new2
1: a 1 1 2
2: a 1 1 4
3: a 1 1 5
4: a 1 2 4
5: a 1 2 5
6: a 1 4 5
7: a 6 3 6
8: b 1 7 8
9: b 1 7 10
10: b 1 7 11
11: b 1 8 10
12: b 1 8 11
13: b 1 10 11
14: b 6 9 12
15: c 1 13 14
16: c 1 13 16
17: c 1 13 17
18: c 1 14 16
19: c 1 14 17
20: c 1 16 17
21: c 6 15 18
The Code does what I want but the twice function call makes it slow for larger dataset. My dataset has more than 3 million rows and more than 1.3 million combinations of x and y.
Any suggestions on how to do this faster?
I would prefer something like:
DT[, list(c("new1", "new2") = t(combn(sort(v), m = 2))), by = list(x, y)]
This should work:
DT[, {
tmp <- combn(sort(v), m = 2 )
list(new1 = tmp[1,], new2 = tmp[2,] )
}
, by = list(x, y) ]
The following also works. The trick is to convert the matrix into a data.table.
DT[, data.table(t(combn(sort(v), m = 2))), by=list(x, y)]
If necessary, just rename the columns after
r2 <- DT[, data.table(t(combn(sort(v), m = 2))), by=list(x, y)]
setnames(r2, c("V1", "V2"), c("new1", "new2"))

Add multiple columns to R data.table in one function call?

I have a function that returns two values in a list. Both values need to be added to a data.table in two new columns. Evaluation of the function is costly, so I would like to avoid having to compute the function twice. Here's the example:
library(data.table)
example(data.table)
DT
x y v
1: a 1 42
2: a 3 42
3: a 6 42
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
Here's an example of my function. Remember I said it's costly compute, on top of that there is no way to deduce one return value from the other given values (as in the example below):
myfun <- function (y, v)
{
ret1 = y + v
ret2 = y - v
return(list(r1 = ret1, r2 = ret2))
}
Here's my way to add two columns in one statement. That one needs to call myfun twice, however:
DT[,new1:=myfun(y,v)$r1][,new2:=myfun(y,v)$r2]
x y v new1 new2
1: a 1 42 43 -41
2: a 3 42 45 -39
3: a 6 42 48 -36
4: b 1 4 5 -3
5: b 3 5 8 -2
6: b 6 6 12 0
7: c 1 7 8 -6
8: c 3 8 11 -5
9: c 6 9 15 -3
Any suggestions on how to do this? I could save r2 in a separate environment each time I call myfun, I just need a way to add two columns by reference at a time.
Since data.table v1.8.3, you can do this:
DT[, c("new1","new2") := myfun(y,v)]
Another option is storing the output of the function and adding the columns one-by-one:
z <- myfun(DT$y,DT$v)
head(DT[,new1:=z$r1][,new2:=z$r2])
# x y v new1 new2
# [1,] a 1 42 43 -41
# [2,] a 3 42 45 -39
# [3,] a 6 42 48 -36
# [4,] b 1 4 5 -3
# [5,] b 3 5 8 -2
# [6,] b 6 6 12 0
The answer can not be used such as when the function is not vectorized.
For example in the following situation it will not work as intended:
myfun <- function (y, v, g)
{
ret1 = y + v + length(g)
ret2 = y - v + length(g)
return(list(r1 = ret1, r2 = ret2))
}
DT
# v y g
# 1: 1 1 1
# 2: 1 3 4,2
# 3: 1 6 9,8,6
DT[,c("new1","new2"):=myfun(y,v,g)]
DT
# v y g new1 new2
# 1: 1 1 1 5 3
# 2: 1 3 4,2 7 5
# 3: 1 6 9,8,6 10 8
It will always add the size of column g, not the size of each vector in g
A solution in such case is:
DT[, c("new1","new2") := data.table(t(mapply(myfun,y,v,g)))]
DT
# v y g new1 new2
# 1: 1 1 1 3 1
# 2: 1 3 4,2 6 4
# 3: 1 6 9,8,6 10 8
To build on the previous answer, one can use lapply with a function that output more than one column. It's is then possible to use the function with more columns of the data.table.
myfun <- function(a,b){
res1 <- a+b
res2 <- a-b
list(res1,res2)
}
DT <- data.table(z=1:10,x=seq(3,30,3),t=seq(4,40,4))
DT
## DT
## z x t
## 1: 1 3 4
## 2: 2 6 8
## 3: 3 9 12
## 4: 4 12 16
## 5: 5 15 20
## 6: 6 18 24
## 7: 7 21 28
## 8: 8 24 32
## 9: 9 27 36
## 10: 10 30 40
col <- colnames(DT)
DT[, paste0(c('r1','r2'),rep(col,each=2)):=unlist(lapply(.SD,myfun,z),
recursive=FALSE),.SDcols=col]
## > DT
## z x t r1z r2z r1x r2x r1t r2t
## 1: 1 3 4 2 0 4 2 5 3
## 2: 2 6 8 4 0 8 4 10 6
## 3: 3 9 12 6 0 12 6 15 9
## 4: 4 12 16 8 0 16 8 20 12
## 5: 5 15 20 10 0 20 10 25 15
## 6: 6 18 24 12 0 24 12 30 18
## 7: 7 21 28 14 0 28 14 35 21
## 8: 8 24 32 16 0 32 16 40 24
## 9: 9 27 36 18 0 36 18 45 27
## 10: 10 30 40 20 0 40 20 50 30
In case a function return a matrix you can achieve the same behavior by wrapping the function with one converting the matrix into list first. I wonder if data.table should handle it automatically?
matrix2list <- function(mat){
unlist(apply(mat,2,function(x) list(x)),FALSE)
}
DT <- data.table(A=1:10)
myfun <- function(x) matrix2list(cbind(x+1,x-1))
DT[,c("c","d"):=myfun(A)]
##>DT
## A c d
## 1: 1 2 0
## 2: 2 3 1
## 3: 3 4 2
## 4: 4 5 3
## 5: 5 6 4
## 6: 6 7 5
## 7: 7 8 6
## 8: 8 9 7
## 9: 9 10 8
## 10: 10 11 9
Why not have your function take in a data frame and return a data frame directly?
myfun <- function (DT)
{
DT$ret1 = with(DT, y + v)
DT$ret2 = with(DT, y - v)
return(DT)
}

Resources