understanding optimisation messages on assignment by reference in a data.table - r

This is from an observation during my answering this question from #sds here.
First, let me switch on the trace messages for data.table:
options(datatable.verbose = TRUE)
dt <- data.table(a = c(rep(3, 5), rep(4, 5)), b=1:10, c=11:20, d=21:30, key="a")
Now, suppose one wants to get the sum of all columns grouped by column a, then, we could do:
dt.out <- dt[, lapply(.SD, sum), by = a]
Now, suppose I'd want to add also the number of entries that belong to each group to dt.out, then I normally assign it by reference as follows:
dt.out[, count := dt[, .N, by=a][, N]]
# or alternatively
dt.out[, count := dt[, .N, by=a][["N"]]]
In this assignment by reference, one of the messages data.table produces is:
RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.
This is a message from a file in data.table's source directory assign.C. I dont want to paste the relevant snippet here as it's about 18 lines. If necessary, just leave a comment and I'll paste the code. dt[, .N, by=a][["N"]] just gives [1] 5 5. So, it's not a named vector. And I don't understand what this recycled list in RHS is..
But if I do:
dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# or equivalently
dt.out[, `:=`(count = dt[, .N, by=a][["N"]])]
Then, I get the message:
Direct plonk of unnamed RHS, no copy.
As I understand this, the RHS has been duplicated in the first case, meaning it's making a copy (shallow/deep, this I don't know). If so, why is this happening?
Even if not, why the changes in assignment by reference between two internally? Any ideas?
To bring out the main underlying question that I had in my mind while writing this post (and seem to have forgotten!): Is it "less efficient" to assign as dt.out[, count := dt[, .N, by=a][["N"]]] (compared to the second way of doing it)?

Update: The expression,
DT[, c(..., lapply(.SD, .), ..., by=.]
has been optimised internally in commit #1242 of v1.9.3 (FR #2722). Here's the entry from NEWS:
o Complex j-expressions of the form DT[, c(..., lapply(.SD, fun)), by=grp]are now optimised, as long as .SD is only present in the form lapply(.SD, fun).
For ex: DT[, c(.I, lapply(.SD, sum), mean(x), lapply(.SD, log)), by=grp]
is optimised to: DT[, list(.I, x=sum(x), y=sum(y), ..., mean(x), log(x), log(y), ...), by=grp]
But DT[, c(.SD, lapply(.SD, sum)), by=grp] for example isn't optimised yet.
This partially resolves FR #2722. Thanks to Sam Steingold for filing the FR.
Where it says NAMED vector it means that in the internal R sense at C level; i.e., whether an object has been assigned a symbol and is called something, not whether an atomic vector has a "names" attribute or not. The NAMED value in the SEXP structure takes value 0, 1 or 2. R uses that to know whether it needs to copy-on-subassign or not. See section 1.1.2 of R-ints.
What would be better is if optimization of j in data.table could handle :
DT[, c(lapply(.SD,sum),.N), by=a]
That works but may be slow. Currently only the simpler form is optimized :
DT[, lapply(.SD,sum), by=a]
To answer main question, yes the following :
Direct plonk of unnamed RHS, no copy.
is desirable compared to :
RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.
Another way to achieve this is :
dt.out[, count := dt[, .N, by=a]$N]
I'm not quite sure why [["N"]] returns a NAM(2) compared to $N which doesn't.

Related

How to select data.table columns by partial string match and update them by a constant multiplication?

I have a large data.table with several columns, where some contain values in Cubic Feet.
These are marked by an added "_cft" at the end of the column name. I want to convert the values of these columns to m³ b multiplying them with a constant and returning the updated value.
I can already select the columns and multiply them, but am not able to replace the existing values.
My code looks like follows:
dt <- dt[, lapply(.SD, function(x) x * 0.0283168), .SDcols= grepl("_cft", names(dt))]
This however only returns me the columns I converted to the data.table, but I want to keep all the columns in the original data.table.
I have already tried using the :=operator, but it results in an error:
"Error: unexpected symbol in dt <- dt[, `:=` lapply"
How can I do this?
Note that you should not combine <- with := because the latter works by reference.
Your error message suggests that you did not do the assignment properly. you need to specify the columns you want to assign to. Doing something like
dt[, `:=` lapply(.SD, function(x) x * 0.0283168), .SDcols= grepl("_cft", names(dt))]
will not work, and that's why you got that error message.
Try the following code:
cols = grep("_cft", names(dt))
dt[, (cols) := lapply(.SD, function(x) x * 0.0283168), .SDcols=cols]
# or simply
dt[, (cols) := lapply(.SD, `*`, 0.0283168), .SDcols=cols]

Select columns in a data.table using vector of characters

I am trying to select some columns from a data.table but getting unexpected results.
For the following, I want to select columns y and z and this works as expected
library(data.table)
dt <- data.table(x=1:4, y=5:8, z=9:6)
dt[, c("y", "z")]
When I try to do this using setdiff it returns nonsense
omit_var <- "x"
dt[, setdiff(c("x","y","z"), omit_var)]
Even though they are equivalent all.equal(setdiff(c("x","y","z"), omit_var), c("y", "z"))
Why is this happening -- I guessing a scoping issue but can I avoid it while keeping the code similar?
(I realise I can do i <- setdiff(c("x","y","z"), omit); dt[,..i])
Following comments from #Roland: "If you call a function in j, data.table can't know that you want to subset. It assumes that you want the return value of the function.". So can use either of
dt[, mget(setdiff(c("x","y","z"), omit_var))]
dt[, setdiff(c("x","y","z"), omit_var), with = FALSE]
#sindri_baldur gave another alternative using .SDcols
dt[, .SD, .SDcols = setdiff(c("x","y","z"), omit_var)]
And more options from #DavidArenburg
dt[, .SD, .SDcols = -omit_var]
If you want to keep all columns except the one in a character string you can use
dt[, !"x"]
Or you could use the .. operator
cols <- setdiff(names(dt), omit_var)
dt[, ..cols]

Row operations in data.table using `by = .I`

Here is a good SO explanation about row operations in data.table
One alternative that came to my mind is to use a unique id for each row and then apply a function using the by argument. Like this:
library(data.table)
dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)],
V1=1:5,
V2=3:7,
V3=5:1)
# create a column with row positions
dt[, rowpos := .I]
# calculate standard deviation by row
dt[ , sdd := sd(.SD[, -1, with=FALSE]), by = rowpos ]
Questions:
Is there a good reason not to use this approach? perhaps other more efficient alternatives?
Why does using by = .I doesn't work the same?
dt[ , sdd := sd(.SD[, -1, with=FALSE]), by = .I ]
UPDATE:
Since data.table version 1.4.3 or later, by=.I has been implemented to work as expected by OP for row-wise grouping. Note using by=.I will create a new column in the data.table called I that has the row numbers. The row number column can then be kept or deleted according to preference.
The following parts of this answer records an earlier version that pertains to older versions of data.table. I keep it here for reference in case someone still uses legacy versions.
Note: section (3) of this answer updated in April 2019, due to many changes in data.table over time redering the original version obsolete. Also, use of the argument with= removed from all instances of data.table, as it has since been deprecated.
1) Well, one reason not to use it, at least for the rowsums example is performance, and creation of an unnecessary column. Compare to option f2 below, which is almost 4x faster and does not need the rowpos column (Note that the original question used rowSums as the example function, to which this part of the answer responds. OP edited the question afterwards to use a different function, for which part 3 of this answer is more relevant`):
dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1)
f1 <- function(dt){
dt[, rowpos := .I]
dt[ , sdd := rowSums(.SD[, 2:4]), by = rowpos ] }
f2 <- function(dt) dt[, sdd := rowSums(.SD), .SDcols= 2:4]
library(microbenchmark)
microbenchmark(f1(dt),f2(dt))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608 100 b
# f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464 100 a
2) On your second question, although dt[, sdd := sum(.SD[, 2:4]), by = .I] does not work, dt[, sdd := sum(.SD[, 2:4]), by = 1:NROW(dt)] works perfectly. Given that according to ?data.table ".I is an integer vector equal to seq_len(nrow(x))", one might expect these to be equivalent. The difference, however, is that .I is for use in j, not in by. NB the value of .I is calculated internally in data.table, so is not available beforehand to be passed in as a parameter value as in by=.I.
It might also be expected that by = .I should just throw an error. But this does not occur, because loading the data.table package creates an object .I in the data.table namespace that is accessible from the global environment, and whose value is NULL. You can test this by typing .I at the command prompt. (Note, the same applies to .SD, .EACHI, .N, .GRP, and .BY)
.I
# Error: object '.I' not found
library(data.table)
.I
# NULL
data.table::.I
# NULL
The upshot of this is that the behaviour of by = .I is equivalent to by = NULL.
3) Although we have already seen in part 1 that in the case of rowSums, which already loops row-wise efficiently, there are much faster ways than creating the rowpos column. But what about looping when we don't have a fast row-wise function?
Benchmarking the by = rowpos and by = 1:NROW(dt) versions against a for loop with set() is informative here. We find that looping over set in a for loop is slower than either of the methods that use data.table's by argument for looping. However there is neglibible difference in timing between the by loop that creates an additional column and the one that uses seq_len(NROW(dt)). Absent any performance difference, it seems that f.nrow is probably preferable, but only on the basis of being more concise and not creating an unnecessary column
dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)
f.rowpos <- function() {
dt[, rowpos := .I]
dt[, sdd := sum(.SD[, 2:4]), by = rowpos ]
}
f.nrow <- function() {
dt[, sdd := sum(.SD[, 2:4]), by = seq_len(NROW(dt)) ]
}
f.forset<- function() {
for (i in seq_len(NROW(dt))) set(dt, i, 'sdd', sum(dt[i, 2:4]))
}
microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# f.rowpos() 559.1115 575.3162 580.2853 578.6865 588.5532 599.7591 5
# f.nrow() 558.4327 582.4434 584.6893 587.1732 588.6689 606.7282 5
# f.forset() 1172.6560 1178.8399 1298.4842 1255.4375 1292.7393 1592.7486 5
So, in conclusion, even in situations where there is not an optimised function such as rowSums that already operates by row, there are alternatives to using a rowpos column that, although not faster, don't require creation of a redundant column.

using mean with .SD and .SDcols in data.table

I am writing a very simple function to summarize columns of data.tables. I am passing one column at a time to the function, and then doing some diagnostics to figure out the options for summarization, and then doing the summarization. I am doing this in data.table to allow for some very large datasets.
So, I am using .SDcols to pass in the column to summarize, and using functions on .SD in the j part of a data.table expression. Since I am passing in one column at a time, I am not using lapply. And what I am finding is that some functions work and others do not. Below is a test dataset I am working with and the results I see:
dt <- data.table(
a=1:10,
b=as.factor(letters[1:10]),
c=c(TRUE, FALSE),
d=runif(10, 0.5, 100),
e=c(0,1),
f=as.integer(c(0,1)),
g=as.numeric(1:10),
h=c("cat1", "cat2", "cat3", "cat4", "cat5"))
mean(dt$a)
[1] 5.5
dt[, mean(.SD), .SDcols = "a"]
[1] NA
Warning message:
In mean.default(.SD) : argument is not numeric or logical: returning NA
dt[, sum(.SD), .SDcols = "a"]
[1] 55
dt[, max(.SD), .SDcols = "a"]
[1] 10
dt[, colMeans(.SD), .SDcols = "a"]
a
5.5
dt[, lapply(.SD, mean), .SDcols = "a"]
a
1: 5.5
Interestingly, weighted.mean gives the wrong answer (55, the sum) when I use weighted.mean(.SD) in j. But when I use lapply(.SD, weighted.mean) in j, it gives the right answer (5.5, the mean).
I tried turning off data.table optimizations to see if it was the internal data.table mean function, but that didn't change things.
Maybe this is just a problem with using mean() on a list (which seems to be what .SD returns)? I guess there is never a reason to NOT use the lapply paradigm with .SD? It seems that only the lapply option returns a data.table. The others seem to return vectors, except for colMeans which is returning something else (list?).
My main question is why mean(.SD) does not work. And the corollary is whether .SD can be used in the absence of one of the apply functions.
Thanks.
I think the appropriate way of approaching what you want is to just use the standard syntax:
dt[ , lapply(.SD, mean), .SDcols = "a"]
Alternatively, you can pass a variable by name as follows:
col_to_pass = "a"
dt[ , mean(get(col_to_pass)) ]
Eventually, you can generalized this approach to multiple columns as follows:
col_to_pass = c("a", "d")
dt[ , lapply( mget(col_to_pass), mean) ]

Elegantly assigning multiple columns in data.table with lapply()

I am trying to figure out an elegant way to use := assignment to replace many columns at once in a data.table by applying a shared function. A typical use of this might be to apply a string function (e.g., gsub) to all character columns in a table. It is not difficult to extend the data.frame way of doing this to a data.table, but I'm looking for a method consistent with the data.table way of doing things.
For example:
library(data.table)
m <- matrix(runif(10000), nrow = 100)
df <- df1 <- df2 <- df3 <- as.data.frame(m)
dt <- as.data.table(df)
head(names(df))
head(names(dt))
## replace V20-V100 with sqrt
# data.frame approach
# by column numbers
df1[20:100] <- lapply(df1[20:100], sqrt)
# by reference to column numbers
v <- 20:100
df2[v] <- lapply(df2[v], sqrt)
# by reference to column names
n <- paste0("V", 20:100)
df3[n] <- lapply(df3[n], sqrt)
# data.table approach
# by reference to column names
n <- paste0("V", 20:100)
dt[, n] <- lapply(dt[, n, with = FALSE], sqrt)
I understand it is more efficient to loop over a vector of column names using := to assign:
for (col in paste0("V", 20:100)) dt[, col := sqrt(dt[[col]]), with = FALSE]
I don't like this because I don't like reference the data.table in a j expression. I also know that I can use := to assign with lapply given that I know the column names:
dt[, c("V20", "V30", "V40", "V50", "V60") := lapply(list(V20, V30, V40, V50, V60), sqrt)]
(You could extend this by building an expression with unknown column names.)
Below are the ideas I tried on this, but I wasn't able to get them to work. Am I making a mistake, or is there another approach I'm missing?
# possible data.table approaches?
# by reference to column names; assignment works, but not lapply
n <- paste0("V", 20:100)
dt[, n := lapply(n, sqrt), with = FALSE]
# by (smaller for example) list; lapply works, but not assignment
dt[, list(list(V20, V30, V40, V50, V60)) := lapply(list(V20, V30, V40, V50, V60), sqrt)]
# by reference to list; neither assignment nor lapply work
l <- parse(text = paste("list(", paste(paste0("V", 20:100), collapse = ", "), ")"))
dt[, eval(l) := lapply(eval(l), sqrt)]
Yes, you're right in question here :
I understand it is more efficient to loop over a vector of column names using := to assign:
for (col in paste0("V", 20:100))
dt[, col := sqrt(dt[[col]]), with = FALSE]
Aside: note that the new way of doing that is :
for (col in paste0("V", 20:100))
dt[ , (col) := sqrt(dt[[col]])]
because the with = FALSE wasn't easy to read whether it referred to the LHS or the RHS of :=. End aside.
As you know, that's efficient because that does each column one by one, so working memory is only needed for one column at a time. That can make a difference between it working and it failing with the dreaded out-of-memory error.
The problem with lapply on the RHS of := is that the RHS (the lapply) is evaluated first; i.e., the result for the 80 columns is created. That's 80 column's worth of new memory which has to be allocated and populated. So you need 80 column's worth of free RAM for that operation to succeed. That RAM usage dominates vs the subsequently instant operation of assigning (plonking) those 80 new columns into the data.table's column pointer slots.
As #Frank pointed to, if you have a lot of columns (say 10,000 or more) then the small overhead of dispatching to the [.data.table method starts to add up). To eliminate that overhead that there is data.table::set which under ?set is described as a "loopable" :=. I use a for loop for this type of operation. It's the fastest way and is fairly easy to write and read.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(dt[[col]]))
Although with just 80 columns, it's unlikely to matter. (Note it may be more common to loop set over a large number of rows than a large number of columns.) However, looped set doesn't solve the problem of the repeated reference to the dt symbol name that you mentioned in the question :
I don't like this because I don't like reference the data.table in a j expression.
Agreed. So the best I can do is revert to your looping of := but use get instead.
for (col in paste0("V", 20:100))
dt[, (col) := sqrt(get(col))]
However, I fear that using get in j carry an overhead. Benchmarking made in #1380. Also, perhaps it is confusing to use get() on the RHS but not on the LHS. To address that we could sugar the LHS and allow get() as well, #1381 :
for (col in paste0("V", 20:100))
dt[, get(col) := sqrt(get(col))]
Also, maybe value of set could be run within scope of DT, #1382.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(get(col))
These should work if you want to refer to the columns by string name:
n = paste0("V", 20:100)
dt[, (n) := lapply(n, function(x) {sqrt(get(x))})]
or
dt[, (n) := lapply(n, function(x) {sqrt(dt[[x]])})]
Is this what you are looking for?
dt[ , names(dt)[20:100] :=lapply(.SD, function(x) sqrt(x) ) , .SDcols=20:100]
I have heard tell that using .SD is not so efficient because it makes a copy of the table beforehand, but if your table isn't huge (obviously that's relative depending on your system specs) I doubt it will make much of a difference.

Resources