data.table subassignment with `on = ` - r

When making a subassigment,
the RHS length must either be 1 (single values are ok) or match the LHS length exactly,
as the error message says when rule is not followed.
However, the following works:
tab.01 <- data.table( a = 1L:5L, b = 11L:15L )
tab.02 <- data.table( a = c(1L, 1L, 2L), x = c(11L, 12L, 22L) )
tab.01[ tab.02, x := i.x, on = "a"]
# a b x
# 1: 1 11 12
# 2: 2 12 22
# 3: 3 13 NA
# 4: 4 14 NA
# 5: 5 15 NA
The column x is not functionally dependent on the column a. Yet, an assignment is made and, if my guess is right, the last element of the subgroup is assigned.
Can this default behaviour be changed, e.g. to choose the first element? The following trials do not work:
mult = "first" has no effect.
tab.01[ tab.02, x := first(i.x), on = "a" ] assigns the value 11L to all matches.
tab.01[ tab.02, x := first(i.x), on = "a", by = "a"]
results in an error, because i.x is not available anymore (or any other column in i).
tab.01[ tab.02, x := first(i.x), on = "a", by = .EACHI ] does not raise an error, but does not fix anything either. The values in the group a reassigned in the order of the rows, hence the last value is kept.
One can use a version of tab.02 with functionally dependent columns:
tab.02[ , y := f_fd(x), by = "a" ] # e.g. f_fd <- data.table::first
tab.01[ tab.02, x := y, on = "a"]
Is this the concisest way to perform this task?

I believe there's no built-in method specifically for accomplishing this. However, it is possible to do this update without modifying tab.02.
You could create a subset
tab.01[tab.02[rowid(a) == 1], x := i.x, on = "a"][]
# a b x
# 1: 1 11 11
# 2: 2 12 22
# 3: 3 13 NA
# 4: 4 14 NA
# 5: 5 15 NA
or order before joining
tab.01[tab.02[order(-x)], x := i.x, on = "a"][]
# a b x
# 1: 1 11 11
# 2: 2 12 22
# 3: 3 13 NA
# 4: 4 14 NA
# 5: 5 15 NA

Related

Group-wise conditional subsetting where feasible

I would like to subset rows of my data
library(data.table); set.seed(333); n <- 100
dat <- data.table(id=1:n, group=rep(1:2,each=n/2), x=runif(n,100,120), y=runif(n,200,220), z=runif(n,300,320))
> head(dat)
id group x y z
1: 1 1 109.3400 208.6732 308.7595
2: 2 1 101.6920 201.0989 310.1080
3: 3 1 119.4697 217.8550 313.9384
4: 4 1 111.4261 205.2945 317.3651
5: 5 1 100.4024 212.2826 305.1375
6: 6 1 114.4711 203.6988 319.4913
in several stages, unless it results in an empty subset. In this case, I would like to skip that specific subsetting. In an earlier question, Frank has found a great solution for this:
f = function(x, ..., verbose=FALSE){
L = substitute(list(...))[-1]
mon = data.table(cond = as.character(L))[, skip := FALSE]
for (i in seq_along(L)){
d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
if (nrow(d)){
x = d
} else {
mon[i, skip := TRUE]
}
}
print(mon)
return(x)
}
where I can enter the data, and the cut-offs for each variable manually.
> f(dat, x > 119, y > 219, y > 1e6)
cond skip
1: x > 119 FALSE
2: y > 219 FALSE
3: y > 1e+06 TRUE
id group x y z
1: 55 2 119.2634 219.0044 315.6556
I now wonder how this (or something even better!) could be applied to a case where the cut-offs are in a second data.table
c <- data.table(group=1:2, x=c(110,119), y=c(210,219), z=c(310,319))
> c
group x y z
1: 1 110 210 310
2: 2 119 219 319
and specified for each group separately.
If I were to use f(.), I thought of a join of c into dat but can't figure it out. But perhaps there is a smarter way entirely.
First, I would change how c is constructed. You currently have it set up with one column per filter, but a long format would allow you to use multiple filters on the same column similar to your initial example (i.e. two filters on y):
c <- data.table(group=c(1,2,1,2,1,2,1),variable = c("x","x","y","y","z","z","y"), c_val = c(110,119,210,219,310,319,1e6))
c[, c_id := 1:.N]
c
group variable c_val c_id
1: 1 x 110 1
2: 2 x 119 2
3: 1 y 210 3
4: 2 y 219 4
5: 1 z 310 5
6: 2 z 319 6
7: 1 y 1000000 7
you can then merge your filters to your data.
dat_cut <- melt(dat, id.vars = c("id", "group"), value.name = "dat_val")
output <- merge(dat_cut, c, by = c("group","variable"), allow.cartesian = TRUE)
This line then tests the filters - you can expand this line if you want to expand your filter logic (greater than / less than, equal etc.), and can code that logic back to c
output <- output[dat_val > c_val]
You then want to find any line where the number of filters met is equal to the unique total number of filters met, for that group:
output[,req_match := uniqueN(c_id), by = .(group)] # number of filters where a condition was met.
selection <- output[,.N,by = .(id, group, req_match)][N == req_match, id]
If a filter did not match any rows, it will be excluded here.
Then you can filter your initial dataset for the solution:
dat[id %in% selection]
id group x y z
1: 3 1 119.4697 217.8550 313.9384
2: 18 1 117.2930 216.5670 310.4617
3: 35 1 110.4283 218.6130 312.0904
4: 50 1 119.2519 214.2517 318.8567

paste() within another function

The first function works and the second one does not, and I am not sure why. I am solely interested in what is happening with the paste() function in this example, as all of the other code works properly. In addition to what is shown below, I have also tried the second function with a comma separator between each value.
Ideally, the list would be as follows, within my function but with the paste() function instead of my listing these values.
X41262.0.0 = i.X41262.0.0, X41262.0.1 = i.X41262.0.1, etc.
fread("ukb33822.csv", select= c("eid", "X2784.0.0", "X2794.0.0",
"X2804.0.0", "X2814.0.0", "X2834.0.0",
"X3536.0.0", "X3546.0.0", paste("X41262.0.", 0:65, sep = ""),
"X3581.0.0"))
biobank[biobank2, on = .(eid), `:=` (X2784.0.0 = i.X2784.0.0, X2794.0.0 = i.X2794.0.0,
X2804.0.0 = i.X2804.0.0, X2814.0.0 = i.X2814.0.0,
X2834.0.0 = i.X2834.0.0, X3536.0.0 = i.X3536.0.0,
X3546.0.0 = i.X3546.0.0, paste("X41262.0.", 0:65, " = ", "i.X41262.0.", 0:65, sep = ""),
X3581.0.0 = i.X3581.0.0)]
Error in
`[.data.table`(biobank, biobank2, on = .(eid), `:=`(X2784.0.0 = i.X2784.0.0, :
In `:=`(col1=val1, col2=val2, ...) form, all arguments must be named.
Not having your data, it's a little contrived, but this might be enough to show you one option:
DT <- data.table(x=1:3)
DT[, c("a", "b", letters[3:5]) := c(list(1, 2), 3:5) ]
DT
# x a b c d e
# 1: 1 1 2 3 4 5
# 2: 2 1 2 3 4 5
# 3: 3 1 2 3 4 5
In this example:
"a" and "b" are your already-known names, e.g., "X2784.0.0", "X2794.0.0", etc
letters[3:5] are names you need to create programmatically, e.g., paste0("X41262.0.", 0:65)
1 and 2 are your already-known values, e.g., i.X2784.0.0, i.X2794.0.0, etc
3:5 are values you determine programmatically
It is not clear to me where your other values are found ...
If they are in the enclosing environment (and not within the actual table), then perhaps:
x1 <- 3:5
x2 <- 13:15
x3 <- 33:35
e <- environment()
DT[, c("a", "b", paste0("x", 1:3)) := c(list(1, 2), mget(paste0("x", 1:3), envir=e))]
# x a b x1 x2 x3
# 1: 1 1 2 3 13 33
# 2: 2 1 2 4 14 34
# 3: 3 1 2 5 15 35
where paste0("x", 1:3) is forming the variable names, and mget(...) actually retrieves them. You might need to define e as I have here, if they are not visible from data.table's search path.
If they are already in the data.table, then you might be able to do something with this:
DT <- data.table(x1=1:3, x2=11:13, x3=21:23)
DT[, c("a", "b", paste0("y", 1:3)) := c(list(1, 2), DT[, paste0("x", 1:3), with=FALSE]) ]
# x1 x2 x3 a b y1 y2 y3
# 1: 1 11 21 1 2 1 11 21
# 2: 2 12 22 1 2 2 12 22
# 3: 3 13 23 1 2 3 13 23
where paste0("y", 1:3) forms the names you want them to be, and paste0("x", 1:3) forms the other columns' names as they exist before this call.

Binary search for integer64 in data.table

I have a integer64 indexed data.table object:
library(data.table)
library(bit64)
some_data = as.integer64(c(1514772184120000026, 1514772184120000068, 1514772184120000042, 1514772184120000078,1514772184120000011, 1514772184120000043, 1514772184120000094, 1514772184120000085,
1514772184120000083, 1514772184120000017, 1514772184120000013, 1514772184120000060, 1514772184120000032, 1514772184120000059, 1514772184120000029))
#
n <- 10
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
Then I have a bunch of other integer64 that I need to binary-search for in the indexes of my original data.table object (x):
search_values <- some_data[(n+1):length(some_data)]
If these where native integers I could use findInterval() to solve the problem:
values_index <- findInterval(search_values, x$new_col)
but when the arguments to findInterval are integer64, I get:
Warning messages:
1: In as.double.integer64(vec) :
integer precision lost while converting to double
2: In as.double.integer64(x) :
integer precision lost while converting to double
and wrong indexes:
> values_index
[1] 10 10 10 10 10
e.g. it is not true that the entries of search_values are all larger than all entries of x$new_col.
Edit:
Desired output:
print(values_index)
9 10 6 10 1
Why?:
value_index has as many entries as search_values. For each entries of search_values, the corresponding entry in value_index gives the rank that entry of search_values would have if it where inserted inside x$new_col. So the first entry of value_index is 9 because the first entry of search_values (1514772184120000045) would have rank 9 among the entries of x$new_col.
Maybe you want something like this:
findInterval2 <- function(y, x) {
toadd <- y[!(y %in% x$new_col)] # search_values that is not in data
x2 <- copy(x)
x2[, i := .I] # mark the original data set
x2 <- rbindlist(list(x2, data.table(new_col = toadd)),
use.names = T, fill = T) # add missing search_values
setkey(x2, new_col) # order
x2[, index := cumsum(!is.na(i))]
x2[match(y, new_col), index]
}
# x2 is:
# a new_col i index
# 1: 0.56602278 1514772184120000011 1 1
# 2: NA 1514772184120000013 NA 1
# 3: 0.29408237 1514772184120000017 2 2
# 4: 0.28532378 1514772184120000026 3 3
# 5: NA 1514772184120000029 NA 3
# 6: NA 1514772184120000032 NA 3
# 7: 0.66844754 1514772184120000042 4 4
# 8: 0.83008829 1514772184120000043 5 5
# 9: NA 1514772184120000059 NA 5
# 10: NA 1514772184120000060 NA 5
# 11: 0.76992760 1514772184120000068 6 6
# 12: 0.57049677 1514772184120000078 7 7
# 13: 0.14406169 1514772184120000083 8 8
# 14: 0.02044602 1514772184120000085 9 9
# 15: 0.68016024 1514772184120000094 10 10
findInterval2(search_values, x)
# [1] 1 5 3 5 3
If not, then maybe you could change the code as needed.
update
look at this integer example to see that this function gives the same result as base findInterval
now <- 10
n <- 10
n2 <- 10
some_data = as.integer(now + sample.int(n + n2, n + n2))
x <- setDT(data.frame(a = runif(n)))
x[, new_col := some_data[1:n]]
setorder(x, new_col)
search_values <- some_data[(n + 1):length(some_data)]
r1 <- findInterval2(search_values, x)
r2 <- findInterval(search_values, x$new_col)
all.equal(r1, r2)
If I get what you want, then a quick workaround could be:
toadd <- search_values[!(search_values %in% x$new_col)] # search_values that is not in data
x[, i := .I] # mark the original data set
x <- rbindlist(list(x, data.table(new_col = toadd)),
use.names = T, fill = T) # add missing search_values
setkey(x, new_col) # order
x[, index := new_col %in% search_values] # mark where the values are
x[, index := cumsum(index)] # get indexes
x <- x[!is.na(i)] # remove added rows
x$index # should contain your desired output

set() in data.table - matching on names instead of column number

Using set() for efficiency to update values insided of a data.table I ran into problems, when the order of the columns changed. So to prevent that I used a workaround to match the column name instead of the column postion.
I would like to know if there's a better way of addressing the column in the j part of the set query.
DT <- as.data.table(cbind( Period = 1:10,
Col.Name=NA))
set(DT, i = 1L , j = as.integer(match("Col.Name",names(DT))), value = 0)
set(DT, i = 3L , j = 2L, value = 0)
So I would like to ask if there's a data.table workaround for this, perhaps a fast matching on the colnames already available.
We can use column name directly in 'j'
set(DT, i = 1L , j = "Col.Name", value = 0)
DT
# Period Col.Name
# 1: 1 0
# 2: 2 NA
# 3: 3 NA
# 4: 4 NA
# 5: 5 NA
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
#10: 10 NA

Apply function to data.table using function's character name and arguments as character vector

I would like to call functions by their character name on a data.table. Each function has also a vector of arguments (so there is a long list of functions to apply to data.table). Arguments are data.table columns. My first thought was that do.call would be a good approach for that task. Here is a simple example with one function name to run and it's vector of columns to pass:
# set up dummy data
set.seed(1)
DT <- data.table(x = rep(c("a","b"),each=5), y = sample(10), z = sample(10))
# columns to use as function arguments
mycols <- c('y','z')
# function name
func <- 'sum'
# my current solution:
DT[, do.call(func, list(get('y'), get('z'))), by = x]
# x V1
# 1: a 47
# 2: b 63
I am not satisfied with that since it requires to name specifically each column. And I would like to pass just a character vector mycols.
Other solution that works just as I need in this case is:
DT[, do.call(func, .SD), .SDcols = mycols, by = x]
But there is a hiccup with custom functions and the only solution that works for me is the first one:
#own dummy function
myfunc <- function(arg1, arg2){
arg1+arg2
}
func <- 'myfunc'
DT[, do.call(func, list(get('y'), get('z'))), by = x]
# x V1
# 1: a 6
# 2: a 6
# 3: a 11
# 4: a 17
# 5: a 7
# 6: b 15
# 7: b 17
# 8: b 10
# 9: b 11
# 10: b 10
# second solution does not work
DT[, do.call(func, .SD), .SDcols = mycols, by = x]
# Error in myfunc(y = c(3L, 4L, 5L, 7L, 2L), z = c(3L, 2L, 6L, 10L, 5L)) :
# unused arguments (y = c(3, 4, 5, 7, 2), z = c(3, 2, 6, 10, 5))
As I understand it, it assumes that myfunc has arguments y, z which is not true. There should be variables y,z which should be passed to arguments arg1, arg2.
I also tried mget function, but also with no success:
DT[, do.call(func, mget(mycols)), by = x]
# Error: value for ‘y’ not found
I could be missing something fairly obvious, thanks in advance for any guidance.
This is likely to be dependent on the types of functions you want to use, but it seems like Reduce might be of interest to you.
Here it is with both of your examples:
mycols <- c('y','z')
func <- 'sum'
DT[, Reduce(func, mget(mycols)), by = x]
# x V1
# 1: a 47
# 2: b 63
myfunc <- function(arg1, arg2){
arg1+arg2
}
func <- 'myfunc'
DT[, Reduce(func, mget(mycols)), by = x]
# x V1
# 1: a 6
# 2: a 6
# 3: a 11
# 4: a 17
# 5: a 7
# 6: b 15
# 7: b 17
# 8: b 10
# 9: b 11
# 10: b 10
Yes you are missing something (well, it's not really obvious, but careful debugging of the error identifies the problem). Your function expects named arguments arg1 and arg2. You are passing it arguments y = ... and z = ... via do.call (which you have noticed). The solution is to pass the list without names:
> DT[, do.call(func, unname(.SD[, mycols, with = F])), by = x]
x V1
1: a 6
2: a 6
3: a 11
4: a 17
5: a 7
6: b 15
7: b 17
8: b 10
9: b 11
10: b 10
Here is a solution that helped me to achieve what I want.
func <- 'sum'
mycols <- c('y','z')
DT[, do.call(func, lapply(mycols, function(x) get(x))), by = x]
# x V1
# 1: a 47
# 2: b 63
One can pass to it base functions or custom defined functions (not so specific as with Reduce solution).

Resources