data.table assigning with `sapply` in a merge - r

I have some data.tables like so:
x <- data.table(id=rep(1:3, 2), a=1:6)
y <- data.table(id=1:3, b=2:4)
I can merge them like this:
setkey(x, id)
setkey(y, id)
x[y]
id a b
1: 1 1 2
2: 1 4 2
3: 2 2 3
4: 2 5 3
5: 3 3 4
6: 3 6 4
Now, I want to create a new column in x based off a and b which is the sum of a and b.
I can do this with:
x[y, val:=a + b]
However, now suppose for some reason that the '+' operator is not vectorised. How can I store a row-wise calculation into x where x[y] is needed for the calculation? Also, assume I cannot use mapply (because for my actual problem, mapply is not suited to the function).
I'm trying to use sapply like so to add in a row-wise manner:
x[y, sapply(1:nrow(x), function (i) a[i] + b[i])]
However this returns the incorrect result:
id V1
1: 1 3
2: 1 NA
3: 1 NA
4: 1 NA
5: 1 NA
6: 1 NA
7: 2 5
8: 2 NA
9: 2 NA
10: 2 NA
11: 2 NA
12: 2 NA
13: 3 7
14: 3 NA
15: 3 NA
16: 3 NA
17: 3 NA
18: 3 NA
If I do this it works:
x[y][, sapply(1:nrow(x), function (i) a[i] + b[i])]
# [1] 3 6 5 8 7 10
BUT when I try and assign this to a column in x, it is not stored (makes sense because it looks like I'm trying to save the new column into x[y]).
x[y][, val:=sapply(1:nrow(x), function (i) a[i] + b[i])]
Is there any way to do the above but save the output into x[, val]?
Is this how I am supposed to do it, or is there a more data.table-y way?
x[, val:=x[y][, sapply(1:nrow(x), function (i) a[i] + b[i])]]

You are doing by-without-by without knowing it, (see below for the description from the help)
Advanced: Aggregation for a subset of known groups is particularly
efficient when passing those groups in i. When i is a data.table,
DT[i,j] evaluates j for each row of i. We call this by without by or
grouping by i. Hence, the self join DT[data.table(unique(colA)),j] is
identical to DT[,j,by=colA].
This means that j is evaluated for each row of i (cylcing through y one row at a time -- so that if you run sapply(1:nrow(x),...) in j it will create a vector of length nrow(x) each time, when this is not what you want.
So your second option is definitely a valid approach (as it is one of the recommended approaches for doing this)
Otherwise you could use .N (When grouping by i, .N is the number of rows in x matched to, for each row of i) not nrow(x), but you will have to think about the length of your objects and how your function is to be vectorized.
Take this as an example
x[y, {browser(); a+b}]
Called from: `[.data.table`(x, y, {
browser()
a + b
})
Browse[1]> a
[1] 1 4
Browse[1]> b
[1] 2
Browse[1]> .N
[1] 2
a has length two, because value of the key matches with 2 rows from x. b only has length 1 because it only has length 1 in y.
I think the best approach is to correctly Vectorize your function (which is hard to give advice upon without more of an example)
another approach would be to replicate b to the length of a eg
x[y, val := {
bl <- rep_len(b, .N)
sapply(seq_len(.N), function(i) a[i] + bl[i])}]
x
id a val
1: 1 1 3
2: 1 4 6
3: 2 2 5
4: 2 5 8
5: 3 3 7
6: 3 6 10
or if you know that y has unique rows for each value of id, then you don't need to try and index any columns from it.
x[y, val2 := sapply(seq_len(.N), function(i) a[i] + b)]
# an alternative would be to use sapply on a (avoid creating another vector)
x[y, val3 := sapply(a, function(ai) ai + b)]
x
# id a val val2 val3
# 1: 1 1 3 3 3
# 2: 1 4 6 6 6
# 3: 2 2 5 5 5
# 4: 2 5 8 8 8
# 5: 3 3 7 7 7
# 6: 3 6 10 10 10

Related

Replace Inf/-Inf values from vector of variable names, with values from similarly named vector of variables (substr/grep/gsub)

I'm currently stumped making some efficient code. I have a vector of variables (med.vars) that were transformed by the in-year global median. Sometimes the global median is 0, which creates Inf/-Inf values I would like to replace with the pre-transformed variable value (vars). I can't figure out how to do this efficiently with some type of data.table 'dat[,:=lapply(.SD), .SDcols=med.vars] function or a for loop with get(), noquotes(), etc.
dat<-data.table(v1=c(2,10,7),v2=c(5,6,5),v3=c(10,15,20),v1.med=c(1,Inf,5),v2.med=c(5,6,5),v3.med=c(-Inf,2,3))
vars<-c("v1","v2","v3")
med.vars<-c("v1.med","v2.med","v3.med")
v1 v2 v3 v1.med v2.med v3.med
1: 2 5 10 1 5 -Inf
2: 10 6 15 Inf 6 2
3: 7 5 20 5 5 3
In reality these vectors are 50+ vars I pull from names(dat) with grep() and use gsub(".med","",med.vars) to create the second vector of pre-transformed variable names.
I would like to efficiently perform
dat[v1.med==Inf | v1.med==-Inf, v1.med:=v1]
dat[v3.med==Inf | v3.med==-Inf, v3.med:=v3]
for each element, med.vars[i], and its corresponding element, vars[i] such that the resulting data.table is:
v1 v2 v3 v1.med v2.med v3.med
1: 2 5 10 1 5 -10
2: 10 6 15 10 6 2
3: 7 5 20 5 5 3
Thank you for your time
OP mentions efficiency, so maybe move to long form. Then the standard syntax can be used:
DT = melt(dat, meas=list(vars, med.vars), value.name=c("var", "med"))
DT[!is.finite(med), med := sign(med)*var]
variable var med
1: 1 2 1
2: 1 10 10
3: 1 7 5
4: 2 5 5
5: 2 6 6
6: 2 5 5
7: 3 10 -10
8: 3 15 2
9: 3 20 3
As these are corresponding columns, we can make use of Map
dat[, (med.vars) := Map(function(x, y) ifelse(is.finite(y), y,
x * sign(y)), .SD[, vars, with = FALSE],
.SD[, med.vars, with = FALSE])]
dat
# v1 v2 v3 v1.med v2.med v3.med
#1: 2 5 10 1 5 -10
#2: 10 6 15 10 6 2
#3: 7 5 20 5 5 3
Or another option is set by looping through the columns with a for loop
for(j in seq_along(vars)) {
i1 <- !is.finite(dat[[med.vars[j]]])
v1 <- dat[[vars[j]]]
v2 <- dat[[med.vars[j]]]
set(dat, i = which(i1), j = med.vars[j], value = sign(v2[i1]) * v1[i1])
}
This can also be done in base R (on a data.frame)
i1 <- !sapply(dat[med.vars], is.finite)
dat[med.vars][i1] <- dat[vars][i1] * sign(dat[med.vars][i1])

Summing the number of times a value appears in either of 2 columns

I have a large data set - around 32mil rows. I have information on the telephone number, the origin of the call, and the destination.
For each telephone number, I want to count the number of times it appeared either as Origin or as Destination.
An example data table is as follows:
library(data.table)
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1))
Tel Origin Destination
1: 1 1 3
2: 2 2 4
3: 3 3 5
4: 4 4 6
5: 5 5 7
I have working code, but it takes too long for my data since it involves a for loop. How can I optimize it?
Here it is:
for (i in unique(dt$Tel)){
index <- (dt$Origin == i | dt$Destination == i)
dt[dt$Tel ==i, "N"] <- sum(index)
}
Result:
Tel Origin Destination N
1: 1 1 3 1
2: 2 2 4 1
3: 3 3 5 2
4: 4 4 6 2
5: 5 5 7 2
Where N tells that Tel=1 appears 1, Tel=2 appears 1, Tel=3,4 and 5 each appear 2 times.
We can do a melt and match
dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]
Or another option is to loop through the columns 2 and 3, use %in% to check whether the values in 'Tel' are present, then with Reduce and + get the sum of logical elements for each 'Tel', assign (:=) the values to 'N'
dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3]
dt
# Tel Origin Destination N
#1: 1 1 3 1
#2: 2 2 4 1
#3: 3 3 5 2
#4: 4 4 6 2
#5: 5 5 7 2
A second method constructs a temporary data.table which is then joins to the original. This is longer and likely less efficient than #akrun's, but can be useful to see.
# get temporary data.table as the sum of origin and destination frequencies
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))),
c("Tel", "N"))
# turn the variables into integers (Tel is the name of the table above, and thus character)
temp <- temp[, lapply(temp, as.integer)]
Now, join the original table on
dt <- temp[dt, on="Tel"]
dt
Tel N Origin Destination
1: 1 1 1 3
2: 2 1 2 4
3: 3 2 3 5
4: 4 2 4 6
5: 5 2 5 7
You can get the desired column order using setcolorder
setcolorder(dt, c("Tel", "Origin", "Destination", "N"))

Lowest pair sequential combination data table

I have a set with two columns. The rows are pairs of values (a,b).
require(data.table)
dt<-data.table(a=c(1,11,11,2,7,5,6), b = c(2,9,8,6,5,3,3))
I want to assign to each pair of values the lowest number. BUT if one of the values appears again in a new line, it must be compared again with the new pair and selected the lowest of the history. The result must be this one:
res.dt<-data.table(a=c(1,11,11,2,7,5,6), b = c(2,9,8,6,5,3,3), res=c(1,9,8,1,5,3,1))
a b res
1: 1 2 1
2: 11 9 9
3: 11 8 8
4: 2 6 1
5: 7 5 5
6: 5 3 3
7: 6 3 1
To state the problem differently: For each row i, we need to iteratively update res with the smallest value in rows j <= i where (a_i,b_i) and (a_j,b_j) have a non-empty intersection.
We can do this efficiently with non-equi joins in data.table (v>=1.9.8), but since this feature only allows single-element comparisons (>,>=,==,<=, or <), we need to find intersections by comparing (a_i,a_j), (a_i,b_j), (b_i,a_j), (b_i,b_j) separately. (There is an intersection if at least one of these pairs contains identical elements.) Doing this iteratively accounts for the entire history, and we can stop when the result converges:
dt[, `:=`(idx=.I, res=pmin(a,b), prev_res=NA)]
while (dt[, !identical(res, prev_res)]) {
dt[, prev_res:= res]
# Use non-equi joins to update 'res' for intersecting pairs downstream
dt[dt[, .(i.a=a, i.res=res, i=.I)], on=.(a==i.a, idx > i), res:= pmin(res, i.res)]
dt[dt[, .(i.a=a, i.res=res, i=.I)], on=.(b==i.a, idx > i), res:= pmin(res, i.res)]
dt[dt[, .(i.b=b, i.res=res, i=.I)], on=.(a==i.b, idx > i), res:= pmin(res, i.res)]
dt[dt[, .(i.b=b, i.res=res, i=.I)], on=.(b==i.b, idx > i), res:= pmin(res, i.res)]
}
The result:
> dt[, .(a,b,res)]
# a b res
# 1: 1 2 1
# 2: 11 9 9
# 3: 11 8 8
# 4: 2 6 1
# 5: 7 5 5
# 6: 5 3 3
# 7: 6 3 1

R data.table filtering on group size

I am trying to find all the records in my data.table for which there is more than one row with value v in field f.
For instance, we can use this data:
dt <- data.table(f1=c(1,2,3,4,5), f2=c(1,1,2,3,3))
If looking for that property in field f2, we'd get (note the absence of the (3,2) tuple)
f1 f2
1: 1 1
2: 2 1
3: 4 3
4: 5 3
My first guess was dt[.N>2,list(.N),by=f2], but that actually keeps entries with .N==1.
dt[.N>2,list(.N),by=f2]
f2 N
1: 1 2
2: 2 1
3: 3 2
The other easy guess, dt[duplicated(dt$f2)], doesn't do the trick, as it keeps one of the 'duplicates' out of the results.
dt[duplicated(dt$f2)]
f1 f2
1: 2 1
2: 5 3
So how can I get this done?
Edited to add example
The question is not clear. Based on the title, it looks like we want to extract all groups with number of rows (.N) greater than 1.
DT[, if(.N>1) .SD, by=f]
But the value v in field f is making it confusing.
If I understand what you're after correctly, you'll need to do some compound queries:
library(data.table)
DT <- data.table(v1 = 1:10, f = c(rep(1:3, 3), 4))
DT[, N := .N, f][N > 2][, N := NULL][]
# v1 f
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 1
# 5: 5 2
# 6: 6 3
# 7: 7 1
# 8: 8 2
# 9: 9 3

Update a data.table based on another data table

I want to update the columns in an old data.table based on a new data.table only when the value is not NA.
DT_old = data.table(x=rep(c("a","b","c")), y=c(1,3,6), v=1:3, l=c(1,1,1))
DT_old
x y v l
1: a 1 1 1
2: b 3 2 1
3: c 6 3 1
DT_new = data.table(x=rep(c("b","c",'d')), y=c(9,6,10), v=c(2,NA,10), z=c(9,9,9))
DT_new
x y v z
1: b 9 2 9
2: c 6 NA 9
3: d 10 10 9
I want the output to be
x y v z
1: b 9 2 9
2: c 6 3 9
3: d 10 10 9
4: a 1 1 NA
Currently I am merging the two data.table and going through each column and replacing the NA in the new data.table
DT_merged <- merge(DT_new, DT_old, all=TRUE, by='x')
DT_merged
x y.x v.x z y.y v.y l
1: a NA NA NA 1 1 1
2: b 9 2 9 3 2 1
3: c 6 NA 9 6 3 1
4: d 10 10 9 NA NA NA
DT_merged[is.na(y.x), y.x := y.y]
DT_merged[is.na(v.x), v.x := v.y]
DT_merged = DT_merged[, list(y=y.x, v=v.x, z=z)
Is there a better way to do the above?
Here's how I would approach this. First, I will expand DT_new according to the unique values combination of x columns of both tables using binary join
res <- setkey(DT_new, x)[unique(c(x, DT_old$x))]
res
# x y v z
# 1: b 9 2 9
# 2: c 6 NA 9
# 3: d 10 10 9
# 4: a NA NA NA
Then, I will updated the two columns in res by reference using another binary join
setkey(res, x)[DT_old, `:=`(y = i.y, v = i.v)]
res
# x y v z
# 1: a 1 1 NA
# 2: b 3 2 9
# 3: c 6 3 9
# 4: d 10 10 9
Following the comments section, it seems that you are trying to join each column by its own condition. There is no simple way of doing such thing in R or any language AFAIK. Thus, your own solution could be a good option by itself.
Though, here are some other alternatives, mainly taken from a similar question I myself asked not long ago
Using two ifelse statments
setkey(res, x)[DT_old, `:=`(y = ifelse(is.na(y), i.y, y),
v = ifelse(is.na(v), i.v, v))]
Two separate conditional joins
setkey(res, x) ; setkey(DT_old, x) ## old data set needs to be keyed too now
res[is.na(y), y := DT_old[.SD, y]]
res[is.na(v), v := DT_old[.SD, v]]
Both will give you what you need.
P.S.
If you don't want warnings, you need to define the corresponding column classes correctly, e.g. v column in DT_new should be defined as v= c(2L, NA_integer_, 10L)

Resources