Query data.table by key in R - r

I have followed the data.table introduction. A key is set on the x column of the data.table and then queried. I have tried to set the key on the v column and it does not work has expected. Any ideas of what I am doing wrong?
> set.seed(34)
> DT = data.table(x=c("b","b","b","a","a"),v=rnorm(5))
> DT
x v
1: b -0.1388900
2: b 1.1998129
3: b -0.7477224
4: a -0.5752482
5: a -0.2635815
> setkey(DT,v)
> DT[1.1998129,]
x v
1: b -0.7477224
EXPECTED:
x v
1: b 1.1998129

When the first argument of [.data.table is a number, it will not do a join, but a simple row number lookup. Since after the setkey your data.table looks like so:
DT
# x v
#1: b -0.7477224
#2: a -0.5752482
#3: a -0.2635815
#4: b -0.1388900
#5: b 1.1998129
And since as.integer(1.1998129) is equal to 1 you get the first row.
Now if you intended to do a join instead, you have to use the syntax DT[J(...)] or DT[.(...)], and that will work as expected, provided you use the correct number (as a convenience, you're not required to use the J when dealing with e.g. character columns, because there is no default meaning for what DT["a"] would mean):
DT[J(v[5])]
# x v
#1: b 1.199813
Note that DT[J(1.1998129)] will not work, because:
DT$v[5] == 1.1998129
#[1] FALSE
You could print out a lot of digits, and that would work:
options(digits = 22)
DT$v[5]
#[1] 1.199812896606383683107
DT$v[5] == 1.199812896606383683107
#[1] TRUE
DT[J(1.199812896606383683107)]
# x v
#1: b 1.199812896606383683107
but there is an additional subtlety here, worth noting, in that R and data.table have different precisions for when floating point numbers are equal:
DT$v[5] == 1.19981289660638
#[1] FALSE
DT[J(1.19981289660638)]
# x v
#1: b 1.199812896606379908349
Long story short - be careful when joining floating point numbers.

Related

Can I do on-the-fly calculation on a column using data.table in r?

Hi I am wondering if anyone knows/ could show me how to calculate a value in a column C based on the previous value(s) in this column C and another column D and save the calculated value as a new current value in column C?
For example, suppose I first initialize the column C to 1s and the calculation I want to implement is C(1) = 1 + B(1)*0.1*1 and C(2) = C(1) + B(2)*0.1*C(1).
test=data.table(A=1:5,B=c(1,2,1,2,1),C=1)
test
A B C
1: 1 1 1
2: 2 2 1
3: 3 1 1
4: 4 2 1
5: 5 1 1
What I want is:
test
A B C
1: 1 1 1.1
2: 2 2 1.32
3: 3 1 1.452
4: 4 2 1.7424
5: 5 1 1.91664
I could achieve what I want with for loop or apply() but I really want to know if this is doable just using data.table and get some speed up.
Edit:
As pointed out by Frank in the comments below,
test[, C := cumprod(1 + .1*B)]
will do since multiplication is distributive. What if I want to supply a more complex custom function?
Many thanks in advance!
Using the formula as presented we have:
test[, C := Reduce(function(c, b) c + .1 * b * c, B, init = 1, acc = TRUE)[-1] ]
Of course, as pointed out already it simplifies in this particular case since we can write the body of the function as c * ( 1 + .1 * b) which implies a cumulative product of the parenthesized portion.
It seems you need to apply the function cumulatively
library(data.table)
library(zoo)
test=data.table(A=1:5,B=c(1,2,1,2,1),C=1)
z <- function(b){1+b*0.1}
test[,C:=cumprod(rollapply(B, width=1, FUN=z))]
But I agree that there's really no need to bring zoo here. Frank's solution is more elegant and concise.
test[,C:=cumprod(1 + .1*B)]
I don't believe there is a similar data.table function, but it seems like accumulate from purrr is what you want. Simple example below, but the input could be rows of a data.table also.
library(purrr)
accumulate(1:4, function(x, y){2*x + y})
# [1] 1 4 11 26

Complex data.table subset and manipulation

I am trying to combine a lot of data.table manipulations into a some faster code. I am creating an example with a smaller data.table and I hopeful someone has a better solution than the clunky (embarrassing) code I developed.
For each group, I want to:
1) Verify there is both a TRUE and FALSE in column w, and if there is:
2) Subtract the value of x corresponding to the highest value of v from each
value of x in the same group and put that that number in a new column
So in group 3, if the highest v value is 10, and in the same row x is 0.212,
I would subtract 0.212 from every x value corresponding to group 3 and put that number in a new column
3) Remove all rows corresponding to groups without both a TRUE and a FALSE in column w.
set.seed(1)
test <- data.table(v=1:12, w=runif(12)<0.5, x=runif(12),
y=sample(2,12,replace=TRUE), z=sample(letters[1:3],12,replace=TRUE) )
setkey(test,y,z)
test[,group:=.GRP,by=key(test)]
A chained version can look like this without needing to set a table key:
result <- test[
# First, identify groups to remove and store in 'rowselect'
, rowselect := (0 < sum(w) & sum(w) < .N)
, by = .(y,z)][
# Select only the rows that we need
rowselect == TRUE][
# get rid of the temp column
, rowselect := NULL][
# create a new column 'u' to store the values
, u := x - x[max(v) == v]
, by = .(y,z)]
The result looks like this:
> result
v w x y z u
1: 1 TRUE 0.6870228 1 c 0.4748803
2: 3 FALSE 0.7698414 1 c 0.5576989
3: 7 FALSE 0.3800352 1 c 0.1678927
4: 10 TRUE 0.2121425 1 c 0.0000000
5: 8 FALSE 0.7774452 2 b 0.6518901
6: 12 TRUE 0.1255551 2 b 0.0000000

R data.table column names not working within a function

I am trying to use a data.table within a function, and I am trying to understand why my code is failing. I have a data.table as follows:
DT <- data.table(my_name=c("A","B","C","D","E","F"),my_id=c(2,2,3,3,4,4))
> DT
my_name my_id
1: A 2
2: B 2
3: C 3
4: D 3
5: E 4
6: F 4
I am trying to create all pairs of "my_name" with different values of "my_id", which for DT would be:
Var1 Var2
A C
A D
A E
A F
B C
B D
B E
B F
C E
C F
D E
D F
I have a function to return all pairs of "my_name" for a given pair of values of "my_id" which works as expected.
get_pairs <- function(id1,id2,tdt) {
return(expand.grid(tdt[my_id==id1,my_name],tdt[my_id==id2,my_name]))
}
> get_pairs(2,3,DT)
Var1 Var2
1 A C
2 B C
3 A D
4 B D
Now, I want to execute this function for all pairs of ids, which I try to do by finding all pairs of ids and then using mapply with the get_pairs function.
> combn(unique(DT$my_id),2)
[,1] [,2] [,3]
[1,] 2 2 3
[2,] 3 4 4
tid1 <- combn(unique(DT$my_id),2)[1,]
tid2 <- combn(unique(DT$my_id),2)[2,]
mapply(get_pairs, tid1, tid2, DT)
Error in expand.grid(tdt[my_id == id1, my_name], tdt[my_id == id2, my_name]) :
object 'my_id' not found
Again, if I try to do the same thing without an mapply, it works.
get_pairs3(tid1[1],tid2[1],DT)
Var1 Var2
1 A C
2 B C
3 A D
4 B D
Why does this function fail only when used within an mapply? I think this has something to do with the scope of data.table names, but I'm not sure.
Alternatively, is there a different/more efficient way to accomplish this task? I have a large data.table with a third id "sample" and I need to get all of these pairs for each sample (e.g. operating on DT[sample=="sample_id",] ). I am new to the data.table package, and I may not be using it in the most efficient way.
The function debugonce() is extremely useful in these scenarios.
debugonce(mapply)
mapply(get_pairs, tid1, tid2, DT)
# Hit enter twice
# from within BROWSER
debugonce(FUN)
# Hit enter twice
# you'll be inside your function, and then type DT
DT
# [1] "A" "B" "C" "D" "E" "F"
Q # (to quit debugging mode)
which is wrong. Basically, mapply() takes the first element of each input argument and passes it to your function. In this case you've provided a data.table, which is also list. So, instead of passing the entire data.table, it's passing each element of the list (columns).
So, you can get around this by doing:
mapply(get_pairs, tid1, tid2, list(DT))
But mapply() simplifies the result by default, and therefore you'd get a matrix back. You'll have to use SIMPLIFY = FALSE.
mapply(get_pairs, tid1, tid2, list(DT), SIMPLIFY = FALSE)
Or simply use Map:
Map(get_pairs, tid1, tid2, list(DT))
Use rbindlist() to bind the results.
HTH
Enumerate all possible pairs
u_name <- unique(DT$my_name)
all_pairs <- CJ(u_name,u_name)[V1 < V2]
Enumerate observed pairs
obs_pairs <- unique(
DT[,{un <- unique(my_name); CJ(un,un)[V1 < V2]}, by=my_id][, !"my_id"]
)
Take the difference
all_pairs[!J(obs_pairs)]
CJ is like expand.grid except that it creates a data.table with all of its columns as its key. A data.table X must be keyed for a join X[J(Y)] or a not-join X[!J(Y)] (like the last line) to work. The J is optional, but makes it more obvious that we're doing a join.
Simplifications. #CathG pointed out that there is a cleaner way of constructing obs_pairs if you always have two sorted "names" for each "id" (as in the example data): use as.list(un) in place of CJ(un,un)[V1 < V2].
Why does this function fail only when used within an mapply? I think
this has something to do with the scope of data.table names, but I'm
not sure.
The reason the function is failing has nothing to do with scoping in this case. mapply vectorizes the function, it takes each element of each parameter and passes to the function. So, in your case, the data.table elements are its columns, so mapply is passing the column my_name instead of the complete data.table.
If you want to pass the complete data.table to mapply, you should use the MoreArgs parameter. Then your function will work:
res <- mapply(get_pairs, tid1, tid2, MoreArgs = list(tdt=DT), SIMPLIFY = FALSE)
do.call("rbind", res)
Var1 Var2
1 A C
2 B C
3 A D
4 B D
5 A E
6 B E
7 A F
8 B F
9 C E
10 D E
11 C F
12 D F

R data.table setkey on numeric column

Do we really need to add J() to select numeric column?
We can get the result of character column without J().
library(data.table)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
setkey(DT,x)
DT["a"]
# x y v
# 1: a 1 1
# 2: a 3 2
# 3: a 6 3
setkey(DT,y)
DT["1"]
# Error in `[.data.table`(DT, "1") :
# typeof x.y (double) != typeof i.y (character)
# Is it a bug?
DT[J(1)]
# y x v
# 1: 1 a 1
# 2: 1 b 4
# 3: 1 c 7
Thanks!
The reason that DT[1] is not the same as DT[J(1)] is that there are TWO different interpretations that we might want:
the first row, DT[1]
all rows for which the key equals 1, DT[J(1)]
The potential ambiguity only exists if the first argument is numeric which is why there are two different notations for the two situations.
In the case of a character key this potential ambiguity does not arise since a character argument could only mean the second case.
Also, DT["1"] is an error in the code of the question since the key in the example is not character and data.table does not perform coercion of types here.

Filter data.table using inequalities and variable column names

I have a data.table that i want to filter based on some inequality criteria:
dt <- data.table(A=letters[1:3], B=2:4)
dt
# A B
# 1: a 2
# 2: b 3
# 3: c 4
dt[B>2]
# A B
# 1: b 3
# 2: c 4
The above works well as a vector scan solution. But I can't work out how to combine this with variable names for the columns:
mycol <- "B"
dt[mycol > 2]
# A B // Nothing has changed
# 1: a 2
# 2: b 3
# 3: c 4
How do I work around this? I know I can use binary search by setting keys using setkeyv(dt, mycol) but I can't see a way of doing a binary search based on some inequality criteria.
OK, then,
Use get(mycol) because you want the argument to dt[ to be the contents of the object "mycol" . I believe dt[mycol ...] looks for a "mycol" thingie in the data.table object itself, of which of course there is no such animal.
There is an accesor function provided for this. j is evaluated in the frame of X, i.e. your data.table, unless you specify with = FALSE. This would be the canonical way of doing this.
dt[ , mycol , with = FALSE ]
B
1: 2
2: 3
3: 4
Return column, logical comparison, subset rows...
dt[ c( dt[ , mycol , with = FALSE ] > 2 ) ]
Another alternative is to use ]] to retrieve B as a vector, and subset using this:
dt[dt[[mycol]] > 2]

Resources