I have a function that returns two values in a list. Both values need to be added to a data.table in two new columns. Evaluation of the function is costly, so I would like to avoid having to compute the function twice. Here's the example:
library(data.table)
example(data.table)
DT
x y v
1: a 1 42
2: a 3 42
3: a 6 42
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9
Here's an example of my function. Remember I said it's costly compute, on top of that there is no way to deduce one return value from the other given values (as in the example below):
myfun <- function (y, v)
{
ret1 = y + v
ret2 = y - v
return(list(r1 = ret1, r2 = ret2))
}
Here's my way to add two columns in one statement. That one needs to call myfun twice, however:
DT[,new1:=myfun(y,v)$r1][,new2:=myfun(y,v)$r2]
x y v new1 new2
1: a 1 42 43 -41
2: a 3 42 45 -39
3: a 6 42 48 -36
4: b 1 4 5 -3
5: b 3 5 8 -2
6: b 6 6 12 0
7: c 1 7 8 -6
8: c 3 8 11 -5
9: c 6 9 15 -3
Any suggestions on how to do this? I could save r2 in a separate environment each time I call myfun, I just need a way to add two columns by reference at a time.
Since data.table v1.8.3, you can do this:
DT[, c("new1","new2") := myfun(y,v)]
Another option is storing the output of the function and adding the columns one-by-one:
z <- myfun(DT$y,DT$v)
head(DT[,new1:=z$r1][,new2:=z$r2])
# x y v new1 new2
# [1,] a 1 42 43 -41
# [2,] a 3 42 45 -39
# [3,] a 6 42 48 -36
# [4,] b 1 4 5 -3
# [5,] b 3 5 8 -2
# [6,] b 6 6 12 0
The answer can not be used such as when the function is not vectorized.
For example in the following situation it will not work as intended:
myfun <- function (y, v, g)
{
ret1 = y + v + length(g)
ret2 = y - v + length(g)
return(list(r1 = ret1, r2 = ret2))
}
DT
# v y g
# 1: 1 1 1
# 2: 1 3 4,2
# 3: 1 6 9,8,6
DT[,c("new1","new2"):=myfun(y,v,g)]
DT
# v y g new1 new2
# 1: 1 1 1 5 3
# 2: 1 3 4,2 7 5
# 3: 1 6 9,8,6 10 8
It will always add the size of column g, not the size of each vector in g
A solution in such case is:
DT[, c("new1","new2") := data.table(t(mapply(myfun,y,v,g)))]
DT
# v y g new1 new2
# 1: 1 1 1 3 1
# 2: 1 3 4,2 6 4
# 3: 1 6 9,8,6 10 8
To build on the previous answer, one can use lapply with a function that output more than one column. It's is then possible to use the function with more columns of the data.table.
myfun <- function(a,b){
res1 <- a+b
res2 <- a-b
list(res1,res2)
}
DT <- data.table(z=1:10,x=seq(3,30,3),t=seq(4,40,4))
DT
## DT
## z x t
## 1: 1 3 4
## 2: 2 6 8
## 3: 3 9 12
## 4: 4 12 16
## 5: 5 15 20
## 6: 6 18 24
## 7: 7 21 28
## 8: 8 24 32
## 9: 9 27 36
## 10: 10 30 40
col <- colnames(DT)
DT[, paste0(c('r1','r2'),rep(col,each=2)):=unlist(lapply(.SD,myfun,z),
recursive=FALSE),.SDcols=col]
## > DT
## z x t r1z r2z r1x r2x r1t r2t
## 1: 1 3 4 2 0 4 2 5 3
## 2: 2 6 8 4 0 8 4 10 6
## 3: 3 9 12 6 0 12 6 15 9
## 4: 4 12 16 8 0 16 8 20 12
## 5: 5 15 20 10 0 20 10 25 15
## 6: 6 18 24 12 0 24 12 30 18
## 7: 7 21 28 14 0 28 14 35 21
## 8: 8 24 32 16 0 32 16 40 24
## 9: 9 27 36 18 0 36 18 45 27
## 10: 10 30 40 20 0 40 20 50 30
In case a function return a matrix you can achieve the same behavior by wrapping the function with one converting the matrix into list first. I wonder if data.table should handle it automatically?
matrix2list <- function(mat){
unlist(apply(mat,2,function(x) list(x)),FALSE)
}
DT <- data.table(A=1:10)
myfun <- function(x) matrix2list(cbind(x+1,x-1))
DT[,c("c","d"):=myfun(A)]
##>DT
## A c d
## 1: 1 2 0
## 2: 2 3 1
## 3: 3 4 2
## 4: 4 5 3
## 5: 5 6 4
## 6: 6 7 5
## 7: 7 8 6
## 8: 8 9 7
## 9: 9 10 8
## 10: 10 11 9
Why not have your function take in a data frame and return a data frame directly?
myfun <- function (DT)
{
DT$ret1 = with(DT, y + v)
DT$ret2 = with(DT, y - v)
return(DT)
}
Related
I have a data table like below :
table=data.table(x=c(1:15),y=c(1,1,1,3,1,1,2,1,2,2,3,3,3,3,3),z=c(1:15)*3)
I have to clean this data table where there are single occurrences like a 3 in between the 1s and a 1 in between the 2s. It doesn't have to be a 3 but any number which occurs only once should be replaced by the previous number.
table=data.table(x=c(1:15),y=c(1,1,1,1,1,1,2,2,2,2,3,3,3,3,3),z=c(1:15)*3)
This is the expected data table.
Any help is appreciated.
Here's one way :
library(data.table)
#Count number of rows for each group
table[, N := .N, rleid(y)]
#Change `y` value which have only one row
table[, y := replace(y, N ==1, NA)]
#Replace NA with last non-NA value
table[, y := zoo::na.locf(y)][, N := NULL]
table
# x y z
# 1: 1 1 3
# 2: 2 1 6
# 3: 3 1 9
# 4: 4 1 12
# 5: 5 1 15
# 6: 6 1 18
# 7: 7 2 21
# 8: 8 2 24
# 9: 9 2 27
#10: 10 2 30
#11: 11 3 33
#12: 12 3 36
#13: 13 3 39
#14: 14 3 42
#15: 15 3 45
Here is a base R option
inds <- which(diff(c(head(table$y,1),table$y))*diff(c(table$y,tail(table$y,1)))<0)
table$y <- replace(table$y,inds,table$y[inds-1])
such that
> table
x y z
1: 1 1 3
2: 2 1 6
3: 3 1 9
4: 4 1 12
5: 5 1 15
6: 6 1 18
7: 7 2 21
8: 8 2 24
9: 9 2 27
10: 10 2 30
11: 11 3 33
12: 12 3 36
13: 13 3 39
14: 14 3 42
15: 15 3 45
I have data which is unique at one variable Y. Another variable Z tells me how many people are in each of Y. My problem is that I want to create groups of 45 from these Y and Z. I mean that whenever the running total of Z touches 45, one group is made and the code moves on to create the next group.
My data looks something like this
ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3
13 A M 1
14 A N 1
15 A O 2
16 A P 0
17 A Q 1
18 A R 2
What is want is something like this
ID X Y Z CumSum Group
1 A A 1 1 1
2 A B 5 6 1
3 A C 2 8 1
4 A D 42 50 1
5 A E 10 10 2
6 A F 2 12 2
7 A G 0 12 2
8 A H 3 15 2
9 A I 0 15 2
10 A J 8 23 2
11 A K 19 42 2
12 A L 3 45 2
13 A M 1 1 3
14 A N 1 2 3
15 A O 2 4 3
16 A P 0 4 3
17 A Q 1 5 3
18 A R 2 7 3
Please let me know how I can achieve this with R.
EDIT: I extended the minimum reproducible example for more clarity
EDIT 2: I have one extra question on this topic. What if, the variable X which is A only right now is also changing. For example, it can be B for a while then can go to being C. How can I prevent the code from generating groups that are not within two categories of X. For example if Group = 3, then how can I make sure that 3 is not in category A and B?
A function for this is available in the MESS-package...
library(MESS)
library(data.table)
DT[, Group := MESS::cumsumbinning(Z, 50)][, Cumsum := cumsum(Z), by = .(Group)][]
output
ID X Y Z Group Cumsum
1: 1 A A 1 1 1
2: 2 A B 5 1 6
3: 3 A C 2 1 8
4: 4 A D 42 1 50
5: 5 A E 10 2 10
6: 6 A F 2 2 12
7: 7 A G 0 2 12
8: 8 A H 3 2 15
9: 9 A I 0 2 15
10: 10 A J 8 2 23
11: 11 A K 19 2 42
12: 12 A L 3 2 45
sample data
DT <- fread("ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3")
Define Accum which adds x to acc resetting to x if acc is 45 or more. Use Reduce to apply that to Z giving r (which is the cumulative sum column). The values greater than or equal to 45 are the group ends so attach a unique group id to them in g by using a cumsum starting from the end and going backwards toward the beginning giving g which has unique values for each group. Finally modify the group id's in g so that they start from 1. We run this with the input in the Note at the end which duplicates the last line several times so that 3 groups can be shown. No packages are used.
Accum <- function(acc, x) if (acc < 45) acc + x else x
applyAccum <- function(x) Reduce(Accum, x, accumulate = TRUE)
cumsumr <- function(x) rev(cumsum(rev(x))) # reverse cumsum
GroupNo <- function(x) {
y <- cumsumr(x >= 45)
max(y) - y + 1
}
transform(transform(DF, Cumsum = ave(Z, ID, FUN = applyAccum)),
Group = ave(Cumsum, ID, FUN = GroupNo))
giving:
ID X Y Z Cumsum Group
1 1 A A 1 1 1
2 2 A B 5 6 1
3 3 A C 2 8 1
4 4 A D 42 50 1
5 5 A E 10 10 2
6 6 A F 2 12 2
7 7 A G 0 12 2
8 8 A H 3 15 2
9 9 A I 0 15 2
10 10 A J 8 23 2
11 11 A K 19 42 2
12 12 A L 3 45 2
13 12 A L 3 3 3
14 12 A L 3 6 3
Note
The input in reproducible form:
Lines <- "ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3
12 A L 3
12 A L 3"
DF <- read.table(text = Lines, as.is = TRUE, header = TRUE)
One tidyverse possibility could be:
df %>%
mutate(Cumsum = accumulate(Z, ~ if_else(.x >= 45, .y, .x + .y)),
Group = cumsum(Cumsum >= 45),
Group = if_else(Group > lag(Group, default = first(Group)), lag(Group), Group) + 1)
ID X Y Z Cumsum Group
1 1 A A 1 1 1
2 2 A B 5 6 1
3 3 A C 2 8 1
4 4 A D 42 50 1
5 5 A E 10 10 2
6 6 A F 2 12 2
7 7 A G 0 12 2
8 8 A H 3 15 2
9 9 A I 0 15 2
10 10 A J 8 23 2
11 11 A K 19 42 2
12 12 A L 3 45 2
Not a pretty solution, but functional.
df$Group<-0
group<-1
while (df$Group[nrow(df)]==0) {
df$ww[df$Group==0]<-cumsum(df$Z[df$Group==0])
df$Group[df$Group==0 & (lag(df$ww)<=45 | is.na(lag(df$ww)) | lag(df$Group!=0))]<-group
group=group+1
}
df
ID X Y Z ww Group
1 1 A A 1 1 1
2 2 A B 5 6 1
3 3 A C 2 8 1
4 4 A D 42 50 1
5 5 A E 10 10 2
6 6 A F 2 12 2
7 7 A G 0 12 2
8 8 A H 3 15 2
9 9 A I 0 15 2
10 10 A J 8 23 2
11 11 A K 19 42 2
12 12 A L 3 45 2
OK, yeah, #tmfmnk 's solution is vastly better:
Unit: milliseconds
expr min lq mean median uq max neval
tm 2.224536 2.805771 6.76661 3.221449 3.990778 303.7623 100
iod 19.198391 22.294222 30.17730 25.765792 35.768616 110.2062 100
Or using data.table:
library(data.table)
n <- 45L
DT[, cs := Reduce(function(tot, z) if (tot+z > n) z else tot+z, Z, accumulate=TRUE)][,
Group := .GRP, by=cumsum(c(1L, diff(cs))<0L)]
output:
ID X Y Z cs Group
1: 1 A A 1 1 1
2: 2 A B 5 6 1
3: 3 A C 2 8 1
4: 4 A D 42 42 1
5: 5 A E 10 10 2
6: 6 A F 2 12 2
7: 7 A G 0 12 2
8: 8 A H 3 15 2
9: 9 A I 0 15 2
10: 10 A J 8 23 2
11: 11 A K 19 42 2
12: 12 A L 3 45 2
13: 13 A M 1 1 3
14: 14 A N 1 2 3
15: 15 A O 2 4 3
16: 16 A P 0 4 3
17: 17 A Q 1 5 3
18: 18 A R 2 7 3
data:
library(data.table)
DT <- fread("ID X Y Z
1 A A 1
2 A B 5
3 A C 2
4 A D 42
5 A E 10
6 A F 2
7 A G 0
8 A H 3
9 A I 0
10 A J 8
11 A K 19
12 A L 3
13 A M 1
14 A N 1
15 A O 2
16 A P 0
17 A Q 1
18 A R 2")
I have data
library(data.table)
set.seed(42)
t <- data.table(time=1:1000, value=runif(100,0,1))
p <- data.table(id=1:10, cut=sample(1:100,5))
vals <- 1:5
> head(t)
time value
1: 1 0.9148060
2: 2 0.9370754
3: 3 0.2861395
4: 4 0.8304476
5: 5 0.6417455
6: 6 0.5190959
> head(p)
id cut
1: 1 63
2: 2 22
3: 3 99
4: 4 38
5: 5 91
6: 6 63
> vals
[1] 1 2 3 4 5
where t gives some vector of values associated with time points, and p gives for each person a cutoff in time.
I would like to get for each person the time units it takes to accumulate each of the values in vals.
My approach now is to use a for-loop that computes for each person a temporary vector of cumulative sums, starting at its specific cutoff in time. Next, I use findInterval() to obtain the positions at which cumsum reaches each of the levels in vals.
out <- matrix(NA, nrow=nrow(p), ncol=length(vals)); colnames(out) <- vals
for(i in 1:nrow(p)){
temp <- cumsum(t$value[t$time > p$cut[i]]); temp <- temp[!is.na(temp)]
out[i,] <- findInterval(vals,temp)
}
which should yield
1 2 3 4 5
[1,] 1 4 5 9 12
[2,] 1 2 5 6 7
[3,] 1 2 4 5 7
[4,] 1 3 5 7 8
[5,] 2 3 5 7 8
[6,] 1 4 5 9 12
[7,] 1 2 5 6 7
[8,] 1 2 4 5 7
[9,] 1 3 5 7 8
[10,] 2 3 5 7 8
This is of course heavily inefficient and doesn't do justice to the powers of R. Is there a way of speeding this up?
I'd do
# precompute cumsum on full table
t[, cs := cumsum(value)]
# compute one time per unique cut value, not per id
cuts = unique(p[, .(t_cut = cut)])
# look up value at cut time
cuts[t, on=.(t_cut = time), v_cut := i.cs]
# look up time at every cut value combo
cutres = cuts[, .(pt = vals + v_cut), by=t_cut][, .(
t_cut,
v = vals,
t_plus = t[.SD, on=.(cs = pt), roll=TRUE, x.time] - t_cut
)]
which gives
t_cut v t_plus
1: 63 1 1
2: 63 2 4
3: 63 3 5
4: 63 4 9
5: 63 5 12
6: 22 1 1
7: 22 2 2
8: 22 3 5
9: 22 4 6
10: 22 5 7
11: 99 1 1
12: 99 2 2
13: 99 3 4
14: 99 4 5
15: 99 5 7
16: 38 1 1
17: 38 2 3
18: 38 3 5
19: 38 4 7
20: 38 5 8
21: 91 1 2
22: 91 2 3
23: 91 3 5
24: 91 4 7
25: 91 5 8
t_cut v t_plus
If you want to map this back to id and get it in a id x vals table...
cutres[p, on=.(t_cut = cut), allow.cartesian=TRUE,
dcast(.SD, id ~ v, value.var = "t_plus")]
id 1 2 3 4 5
1: 1 1 4 5 9 12
2: 2 1 2 5 6 7
3: 3 1 2 4 5 7
4: 4 1 3 5 7 8
5: 5 2 3 5 7 8
6: 6 1 4 5 9 12
7: 7 1 2 5 6 7
8: 8 1 2 4 5 7
9: 9 1 3 5 7 8
10: 10 2 3 5 7 8
(Alternately, the key part can be done like t_plus = t[.SD, on=.(cs = pt), roll=TRUE, which=TRUE] - t_cut since t$time is the row number.)
I have 2 data.tables:
a.id <- c("a","a","a","b","b","c","c","c","c")
b.id <- c(1,2,3,4,5,1,3,4,5)
x <- seq(1:9)
dt1 <- data.table(a.id,b.id,x)
and
rp <- c("r","s")
t <- rep(rp, each=5)
b.id <- rep(1:5, 2)
y <- sample.int(50, 10)
dt2 <- data.table(t, b.id, y)
For each a.id of dt1, I would like to full-join each t of dt2, adding them by column into dt1 and giving to the column the name the value of t. As this is a full-join, all the missing x(b.id) in dt1 are added with NA.
Here is the desired output (for r and s, these are random values):
a.id b.id x r s
a 1 1 14 40
a 2 2 42 25
a 3 3 32 11
a 4 NA 33 3
a 5 NA 21 1
b 1 NA 14 40
b 2 NA 42 25
b 3 NA 32 11
b 4 4 33 3
b 5 5 21 1
c 1 6 14 40
c 2 NA 42 25
c 3 7 32 11
c 4 8 33 3
c 5 9 21 1
I have tried something like:
dt1[, merge(.SD, dt2, by = "b.id", all = TRUE), by = a.id]
But it does not work.
I would appreciate your help on that problem.
Thanks for your time.
Try something like:
f<-dcast(dt2,b.id~t)
dt1[f[rep(1:nrow(f),uniqueN(dt1$a.id)),
c(.SD,list(a.id=rep(unique(dt1$a.id),each=nrow(f))))],on=c("a.id","b.id")]
# a.id b.id x r s
# 1: a 1 1 40 28
# 2: a 2 2 4 17
# 3: a 3 3 11 13
# 4: a 4 NA 49 42
# 5: a 5 NA 29 37
# 6: b 1 NA 40 28
# 7: b 2 NA 4 17
# 8: b 3 NA 11 13
# 9: b 4 4 49 42
#10: b 5 5 29 37
#11: c 1 6 40 28
#12: c 2 NA 4 17
#13: c 3 7 11 13
#14: c 4 8 49 42
#15: c 5 9 29 37
The result differs since a seed had not been set.
With a cross join one can do:
dcast(dt2, b.id~t, value.var = "y")[
dt1[CJ(a.id=a.id, b.id=b.id, unique=TRUE), on=.(a.id, b.id)], on="b.id"]
if not all possible values of b.id are in dt1$b.id then the CJ()-part should look like
CJ(a.id=a.id, b.id=dt2$b.id, unique=TRUE)
Here is another variant:
dt1[dcast(dt2, b.id~t, value.var = "y")[
CJ(a.id=dt1$a.id, b.id=dt2$b.id, unique=TRUE), on=.(b.id)], on=.(a.id, b.id)]
# a.id b.id x r s
# 1: a 1 1 46 24
# 2: a 2 2 50 33
# 3: a 3 3 14 6
# 4: a 4 NA 40 28
# 5: a 5 NA 30 29
# 6: b 1 NA 46 24
# 7: b 2 NA 50 33
# 8: b 3 NA 14 6
# 9: b 4 4 40 28
# 10: b 5 5 30 29
# 11: c 1 6 46 24
# 12: c 2 NA 50 33
# 13: c 3 7 14 6
# 14: c 4 8 40 28
# 15: c 5 9 30 29
data:
library("data.table")
set.seed(42)
dt1 <- data.table(a.id=rep(c("a", "b", "c"), c(3,2,4)), b.id=c(1:5,1,3,4,5), x=1:9)
dt2 <- data.table(t=rep(c("r","s"), each=5), b.id=1:5, y=sample.int(50, 10))
I am trying to do a rolling join in data.table that brings in multiple columns, but rolls over both entire missing rows, and individual NAs in particular columns, even when the row is present. By way of example, I have two tables, A, and B:
library(data.table)
A <- data.table(v1 = c(1,1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(6,6,6,4,4,6,4,4,4,6,4,4,4),
t = c(10,20,30,60,60,10,40,50,60,20,40,50,60),
key = c("v1", "v2", "t"))
B <- data.table(v1 = c(1,1,1,1,2,2,2,2,3,3,3,3),
v2 = c(4,4,6,6,4,4,6,6,4,4,6,6),
t = c(10,70,20,70,10,70,20,70,10,70,20,70),
valA = c('a','a',NA,'a',NA,'a','b','a', 'b','b',NA,'b'),
valB = c(NA,'q','q','q','p','p',NA,'p',NA,'q',NA,'q'),
key = c("v1", "v2", "t"))
B
## v1 v2 t valA valB
## 1: 1 4 10 a NA
## 2: 1 4 70 a q
## 3: 1 6 20 NA q
## 4: 1 6 70 a q
## 5: 2 4 10 NA p
## 6: 2 4 70 a p
## 7: 2 6 20 b NA
## 8: 2 6 70 a p
## 9: 3 4 10 b NA
## 10: 3 4 70 b q
## 11: 3 6 20 NA NA
## 12: 3 6 70 b q
If I do a rolling join (in this case a backwards join), it rolls over all the points when a row cannot be found in B, but still includes points when the row exists but the data to be merged are NA:
B[A, , roll=-Inf]
## v1 v2 t valA valB
## 1: 1 4 60 a q
## 2: 1 4 60 a q
## 3: 1 6 10 NA q
## 4: 1 6 20 NA q
## 5: 1 6 30 a q
## 6: 2 4 40 a p
## 7: 2 4 50 a p
## 8: 2 4 60 a p
## 9: 2 6 10 b NA
## 10: 3 4 40 b q
## 11: 3 4 50 b q
## 12: 3 4 60 b q
## 13: 3 6 20 NA NA
I would like to rolling join in such a way that it rolls over these NAs as well. For a single column, I can subset B to remove the NAs, then roll with A:
C <- B[!is.na(valA), .(v1, v2, t, valA)][A, roll=-Inf]
C
## v1 v2 t valA
## 1: 1 4 60 a
## 2: 1 4 60 a
## 3: 1 6 10 a
## 4: 1 6 20 a
## 5: 1 6 30 a
## 6: 2 4 40 a
## 7: 2 4 50 a
## 8: 2 4 60 a
## 9: 2 6 10 b
## 10: 3 4 40 b
## 11: 3 4 50 b
## 12: 3 4 60 b
## 13: 3 6 20 b
But for multiple columns, I have to do this sequentially, storing the value for each added column and then repeat.
B[!is.na(valB), .(v1, v2, t, valB)][C, roll=-Inf]
## v1 v2 t valB valA
## 1: 1 4 60 q a
## 2: 1 4 60 q a
## 3: 1 6 10 q a
## 4: 1 6 20 q a
## 5: 1 6 30 q a
## 6: 2 4 40 p a
## 7: 2 4 50 p a
## 8: 2 4 60 p a
## 9: 2 6 10 p b
## 10: 3 4 40 q b
## 11: 3 4 50 q b
## 12: 3 4 60 q b
## 13: 3 6 20 q b
The end result above is the desired output, but for multiple columns it quickly becomes unwieldy. Is there a better solution?
Joins are about matching up rows. If you want to match rows multiple ways, you'll need multiple joins.
I'd use a loop, but add columns to A (rather than creating new tables C, D, ... following each join):
k = key(A)
bcols = setdiff(names(B), k)
for (col in bcols) A[, (col) :=
B[!.(as(NA, typeof(B[[col]]))), on=col][.SD, roll=-Inf, ..col]
][]
A
v1 v2 t valA valB
1: 1 4 60 a q
2: 1 4 60 a q
3: 1 6 10 a q
4: 1 6 20 a q
5: 1 6 30 a q
6: 2 4 40 a p
7: 2 4 50 a p
8: 2 4 60 a p
9: 2 6 10 b p
10: 3 4 40 b q
11: 3 4 50 b q
12: 3 4 60 b q
13: 3 6 20 b q
B[!.(NA_character_), on="valA"] is an anti-join that drops rows with NAs in valA. The code above attempts to generalize this (since the NA needs to match the type of the column).