I have the following data.table:
> dt = data.table(expr = c("a + b", "a - b", "a * b", "a / b"), a = c(1,2,3,4), b = c(5,6,7,8))
> dt
expr a b
1: a + b 1 5
2: a - b 2 6
3: a * b 3 7
4: a / b 4 8
My aim is to get the following data.table:
> dt
expr a b ans
1: a + b 1 5 6
2: a - b 2 6 -4
3: a * b 3 7 21
4: a / b 4 8 0.5
I tried the following:
> dt[, ans := eval(expr)]
Error in eval(expr, envir, enclos) : object 'expr' not found
> dt[, ans := eval(parse(text = expr))]
Error in parse(text = expr) : object 'expr' not found
Any idea how can I calculate the ans column based on the expression in the expr column?
If your actual expressions describe calls to vectorized functions and are repeated many times each, this may be more efficient, since it only parses and evaluates each distinct expression one time:
f <- function(e, .SD) eval(parse(text=e[1]), envir=.SD)
dt[, ans:=f(expr,.SD), by=expr, .SDcols=c("a", "b")]
# expr a b ans
# 1: a + b 1 5 6.0
# 2: a - b 2 6 -4.0
# 3: a * b 3 7 21.0
# 4: a / b 4 8 0.5
Really, there are a bunch of challenges for vectorization in such a setup. eval doesn't expect to run on a vector of expressions nor is it set up to iterate over a vector of environments by default. Here I define a helper function to wrap much of the iteration
calc <- function(e, ...) {
run<-function(x, ...) {
eval(parse(text=x), list(...))
}
do.call("mapply", c(list(run, e), list(...)))
}
dt[, ans:=calc(expr,a=a,b=b)]
which returns
expr a b ans
1: a + b 1 5 6.0
2: a - b 2 6 -4.0
3: a * b 3 7 21.0
4: a / b 4 8 0.5
as desired. Note that you'll need to name the parameters in the call to calc() so it knows which column to map to which variable.
Related
So what I'm trying to achieve is this : Say I have a data table dt having (say) 4 columns. I want to get unique length of every combination of 2 columns.
DT <- data.table(a = 1:10, b = c(1,1,1,2,2,3,4,4,5,5), c = letters[1:10], d = c(3,3,5,2,4,2,5,1,1,5))
> DT
a b c d
1: 1 1 a 3
2: 2 1 b 3
3: 3 1 c 5
4: 4 2 d 2
5: 5 2 e 4
6: 6 3 f 2
7: 7 4 g 5
8: 8 4 h 1
9: 9 5 i 1
10: 10 5 j 5
I tried the following code :
cols <- colnames(DT)
for(i in 1:(length(cols)-1)) {
for (j in i+1:length(cols)) {
print(unique(DT[,.SD, .SDcols = c(cols[i],cols[j])]))
}
}
Here, basically 'i' goes from first column to second last whereas 'j' is the combining column with 'i'. So the combinations I get are : ab, ac, ad, bc, bd, cd.
But it gives me the following error
Error in [.data.table(DT, , .SD, .SDcols = c(cols[i], cols[j])) :
.SDcols missing at the following indices: [2]
If someone can explain why this is and a way around it, I'll be really grateful. Thanks.
This is due to operators precedence, : is evaluated before +:
1+1:length(cols)
[1] 2 3 4 5
> (1+1):length(cols)
[1] 2 3 4
Correct loop is :
for(i in 1:(length(cols)-1)) {
for (j in (i+1):length(cols)) {
print(unique(DT[,.SD, .SDcols = c(cols[i],cols[j])]))
}
}
Given the data.table dt <- data.table(a=c(1,NA,3), b = c(4:6))
a b
1: 1 4
2: NA 5
3: 3 6
... , the result for dt[is.na(a), a := sum(a, na.rm = T)] is:
a b
1: 1 4
2: 0 5
3: 3 6
... , instead of the expected:
a b
1: 1 4
2: 4 5
3: 3 6
What is going on? I am using data.table 1.12.8
We could use fcoalesce
library(data.table)
dt[, a := fcoalesce(a, sum(a, na.rm = TRUE))]
Based on this previous post I build leftOuterJoin which is a function to update a data.table X according to an other data.table Y. The function is defined as follows:
leftOuterJoin <- function(X, Y, onCol) {
.colsY <- names(Y)
X[Y, (.colsY) := mget(paste0("i.", .colsY)), on = onCol]
}
The function works 99% of the time as intended, e.g.:
X <- data.table(id = 1:5, L = letters[1:5])
id L
1: 1 a
2: 2 b
3: 3 c
4: 4 d
5: 5 e
Y <- data.table(id = 3:5, L = c(NA, "g", "h"), N = c(10, NA, 12))
id L N
1: 3 <NA> 10
2: 4 g NA
3: 5 h 12
leftOuterJoin(X, Y, "id")
X
id L N
1: 1 a NA
2: 2 b NA
3: 3 <NA> 10
4: 4 g NA
5: 5 h 12
However, for some reason that is unknown to me, it just stops working with some data tables (I have no reproductible example at hand). There is no error, but the data table is not updated. When I use the debug function, everything seems to be working fine, X is updated, but the real data.table isn't. Now, if I just do it outside the function it works. Maybe it is related to the scope of the function? I am really struggling with this problem.
Spec: R v3.5.1 and data.table v1.11.4.
EDIT
Based on the comments I figured out that the problem is related to the data.table pointer. You can reproduce the problem with this code:
> save(X, file = "X.RData")
> load("X.RData")
> leftOuterJoin(X, Y, "id")
> X
id L
1: 1 a
2: 2 b
3: 3 <NA>
4: 4 g
5: 5 h
Notice that X is updated but not the way we want it. However, if we use setDT() it works properly:
> load("X.RData")
> setDT(X)
> leftOuterJoin(X, Y, "id")
> X
id L N
1: 1 a NA
2: 2 b NA
3: 3 <NA> 10
4: 4 g NA
5: 5 h 12
Is there a way to set up leftOuterJoin() such that it will not be necessary to run setDT() every time some data is loaded?
This is a small challenge within a big project, so I'm going to try to keep this simple.
I'm attempting to conditionally add columns to a data.table, and then process them on a conditional basis.
x <- T
y <- data.table(a = 1:10, b = c(rep(1,5), rep(2,5)))
y[ # filter some rows
a != 1
][ # conditionally add two calculated columns
,
if(x){
`:=` (
c = a*b,
d = 1/b
)
}
][ # process columns and group
,
list(
a = sum(a),
b = sum(b),
if(x) c = sum(c) # only add c if it's created above
),
by = if(x) list(b, d) else list(b) # only group by d if it's created above
]
Here is the output (error references the second set []):
Error in eval(expr, envir, enclos) : object 'd' not found
In addition: Warning message:
In deconstruct_and_eval(m, envir, enclos) :
Caught and removed `{` wrapped around := in j. := and `:=`(...) are
defined for use in j, once only and in particular ways. See help(":=").
Of course, the error is a symptom of the warning. How can I get this done?
As #Michal pointed out, putting the if() statement outside the data.table call is an option:
if(x) {
y[
...
]
} else {
y[
...
]
}
I'm hoping there's a way to get this done without repeating the code in its entirety, to simplify everything.
I can't think of a way of doing it inside the j-expression, because of how := gets evaluated in there (it really only works if it's at the root of the expression tree), but you could put it in the i-expression as a workaround:
x = FALSE
y[a != 1][x, `:=`(c = a * b, d = 1/b)][]
# a b
#1: 2 1
#2: 3 1
#3: 4 1
#4: 5 1
#5: 6 2
#6: 7 2
#7: 8 2
#8: 9 2
#9: 10 2
x = TRUE
y[a != 1][x, `:=`(c = a * b, d = 1/b)][]
# a b c d
#1: 2 1 2 1.0
#2: 3 1 3 1.0
#3: 4 1 4 1.0
#4: 5 1 5 1.0
#5: 6 2 12 0.5
#6: 7 2 14 0.5
#7: 8 2 16 0.5
#8: 9 2 18 0.5
#9: 10 2 20 0.5
Since c(1) is the same as c(1, NULL), it can be used to return complete vectors when you're not sure how many elements will compose them.
To conditionally include columns in j
y[
,
c(
list(
a = sum(a),
b = sum(b)
),
if(x) list(c = sum(c))
)
]
And to conditionally include columns in by
y[
,
...,
by = c("b", if(x) "d")
]
by won't accept a vector of lists, but it will accept a vector of column names.
Here's a data.table
dt <- data.table(group = c("a","a","a","b","b","b"), x = c(1,3,5,1,3,5), y= c(3,5,8,2,8,9))
dt
group x y
1: a 1 3
2: a 3 5
3: a 5 8
4: b 1 2
5: b 3 8
6: b 5 9
And here's a function that operates on a data.table and returns a data.table
myfunc <- function(dt){
# Hyman spline interpolation (which preserves monotonicity)
newdt <- data.table(x = seq(min(dt$x), max(dt$x)))
newdt$y <- spline(x = dt$x, y = dt$y, xout = newdt$x, method = "hyman")$y
return(newdt)
}
How do I apply myfunc to each subset of dt defined by the "group" column? In other words, I want an efficient, generalized way to do this
result <- rbind(myfunc(dt[group=="a"]), myfunc(dt[group=="b"]))
result
x y
1: 1 3.000
2: 2 3.875
3: 3 5.000
4: 4 6.375
5: 5 8.000
6: 1 2.000
7: 2 5.688
8: 3 8.000
9: 4 8.875
10: 5 9.000
EDIT: I've updated my sample dataset and myfunc because I think it was initially too simplistic and invited work-arounds to the actual problem I'm trying to solve.
The whole idea of data.table is being both memory efficient and fast. Thus we never use $ within the data.table scope (only in very rare situations) and we don't create data.table objects within data.tables environment (currently, even .SD has an overhead).
In your case you can take advantage of data.table's non-standard evaluation capabilities and define your function as follows
myfunc <- function(x, y){
temp = seq(min(x), max(x))
y = spline(x = x, y = y, xout = temp, method = "hyman")$y
list(x = temp, y = y)
}
Then the implementation within the dt scope is straight forward
dt[, myfunc(x, y), by = group]
# group x y
# 1: a 1 3.0000
# 2: a 2 3.8750
# 3: a 3 5.0000
# 4: a 4 6.3750
# 5: a 5 8.0000
# 6: b 1 2.0000
# 7: b 2 5.6875
# 8: b 3 8.0000
# 9: b 4 8.8750
# 10: b 5 9.0000