I am trying to figure out an elegant way to use := assignment to replace many columns at once in a data.table by applying a shared function. A typical use of this might be to apply a string function (e.g., gsub) to all character columns in a table. It is not difficult to extend the data.frame way of doing this to a data.table, but I'm looking for a method consistent with the data.table way of doing things.
For example:
library(data.table)
m <- matrix(runif(10000), nrow = 100)
df <- df1 <- df2 <- df3 <- as.data.frame(m)
dt <- as.data.table(df)
head(names(df))
head(names(dt))
## replace V20-V100 with sqrt
# data.frame approach
# by column numbers
df1[20:100] <- lapply(df1[20:100], sqrt)
# by reference to column numbers
v <- 20:100
df2[v] <- lapply(df2[v], sqrt)
# by reference to column names
n <- paste0("V", 20:100)
df3[n] <- lapply(df3[n], sqrt)
# data.table approach
# by reference to column names
n <- paste0("V", 20:100)
dt[, n] <- lapply(dt[, n, with = FALSE], sqrt)
I understand it is more efficient to loop over a vector of column names using := to assign:
for (col in paste0("V", 20:100)) dt[, col := sqrt(dt[[col]]), with = FALSE]
I don't like this because I don't like reference the data.table in a j expression. I also know that I can use := to assign with lapply given that I know the column names:
dt[, c("V20", "V30", "V40", "V50", "V60") := lapply(list(V20, V30, V40, V50, V60), sqrt)]
(You could extend this by building an expression with unknown column names.)
Below are the ideas I tried on this, but I wasn't able to get them to work. Am I making a mistake, or is there another approach I'm missing?
# possible data.table approaches?
# by reference to column names; assignment works, but not lapply
n <- paste0("V", 20:100)
dt[, n := lapply(n, sqrt), with = FALSE]
# by (smaller for example) list; lapply works, but not assignment
dt[, list(list(V20, V30, V40, V50, V60)) := lapply(list(V20, V30, V40, V50, V60), sqrt)]
# by reference to list; neither assignment nor lapply work
l <- parse(text = paste("list(", paste(paste0("V", 20:100), collapse = ", "), ")"))
dt[, eval(l) := lapply(eval(l), sqrt)]
Yes, you're right in question here :
I understand it is more efficient to loop over a vector of column names using := to assign:
for (col in paste0("V", 20:100))
dt[, col := sqrt(dt[[col]]), with = FALSE]
Aside: note that the new way of doing that is :
for (col in paste0("V", 20:100))
dt[ , (col) := sqrt(dt[[col]])]
because the with = FALSE wasn't easy to read whether it referred to the LHS or the RHS of :=. End aside.
As you know, that's efficient because that does each column one by one, so working memory is only needed for one column at a time. That can make a difference between it working and it failing with the dreaded out-of-memory error.
The problem with lapply on the RHS of := is that the RHS (the lapply) is evaluated first; i.e., the result for the 80 columns is created. That's 80 column's worth of new memory which has to be allocated and populated. So you need 80 column's worth of free RAM for that operation to succeed. That RAM usage dominates vs the subsequently instant operation of assigning (plonking) those 80 new columns into the data.table's column pointer slots.
As #Frank pointed to, if you have a lot of columns (say 10,000 or more) then the small overhead of dispatching to the [.data.table method starts to add up). To eliminate that overhead that there is data.table::set which under ?set is described as a "loopable" :=. I use a for loop for this type of operation. It's the fastest way and is fairly easy to write and read.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(dt[[col]]))
Although with just 80 columns, it's unlikely to matter. (Note it may be more common to loop set over a large number of rows than a large number of columns.) However, looped set doesn't solve the problem of the repeated reference to the dt symbol name that you mentioned in the question :
I don't like this because I don't like reference the data.table in a j expression.
Agreed. So the best I can do is revert to your looping of := but use get instead.
for (col in paste0("V", 20:100))
dt[, (col) := sqrt(get(col))]
However, I fear that using get in j carry an overhead. Benchmarking made in #1380. Also, perhaps it is confusing to use get() on the RHS but not on the LHS. To address that we could sugar the LHS and allow get() as well, #1381 :
for (col in paste0("V", 20:100))
dt[, get(col) := sqrt(get(col))]
Also, maybe value of set could be run within scope of DT, #1382.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(get(col))
These should work if you want to refer to the columns by string name:
n = paste0("V", 20:100)
dt[, (n) := lapply(n, function(x) {sqrt(get(x))})]
or
dt[, (n) := lapply(n, function(x) {sqrt(dt[[x]])})]
Is this what you are looking for?
dt[ , names(dt)[20:100] :=lapply(.SD, function(x) sqrt(x) ) , .SDcols=20:100]
I have heard tell that using .SD is not so efficient because it makes a copy of the table beforehand, but if your table isn't huge (obviously that's relative depending on your system specs) I doubt it will make much of a difference.
Related
my data.table contain K columns called claims, among other 30 columns. I want to subset the data.table, such that only rows remain which do not have 0 claims.
So, firstly i get all the column names i need for filtering. For the purpose of this example, i have chosen K = 2
> claimsCols = c("claimsnext", paste0("claims" , 1:K))
> claimsCols
[1] "claimsnext" "claims1" "claims2"
i have tried subsetting like:
for(i in claimsCols){
BTplan <- BTplan[ claimsCols[i] == 0, ]
i+1
}
this doent work:
Error in i + 1 : non-numeric argument to binary operator
I am sure there is a better way to do this?
I would basically do what akrun does
idx = BTplan[ , Reduce(`&`, .SD), .SDcols = patterns('claims')]
BTplan = BTplan[idx]
The innovations are:
Use patterns in .SDcols to specify the columns to include by pattern
& automatically converts numeric to logical, i.e. 1.1 & 2.2 is TRUE, and becomes FALSE as soon as there's a 0 anywhere (hence filtering the corresponding row)
In a future version of data.table this will be slightly more efficient and concise (and hopefully more readable):
idx = BTplan[ , pall(.SD), .SDcols = patterns('claims')]
BTplan = BTplan[idx]
Keep an eye on this pull request:
https://github.com/Rdatatable/data.table/pull/4448
In the OP's code, the i is each of the elements of 'claimsCols' which is character, so i +1 won't work and in fact, it is not needed
for(colnm in claimsCols) {
BTplan <- BTplan[BTplan[[colnm]] != 0,]
}
Or using data.table syntax
library(data.table)
setDT(BTplan)
BTplan[BTplan[, Reduce(`&`, lapply(.SD, `!=`, 0)),.SDcols = claimsCols]]
How can I program a loop so that all eight tables are calculated one after the other?
The code:
dt_M1_I <- M1_I
dt_M1_I <- data.table(dt_M1_I)
dt_M1_I[,I:=as.numeric(gsub(",",".",I))]
dt_M1_I[,day:=substr(t,1,10)]
dt_M1_I[,hour:=substr(t,12,16)]
dt_M1_I_median <- dt_M1_I[,list(median_I=median(I,na.rm = TRUE)),by=.(day,hour)]
This should be calculated for:
M1_I
M2_I
M3_I
M4_I
M1_U
M2_U
M3_U
M4_U
Thank you very much for your help!
Whenever you have several variables of the same kind, especially when you find yourself numbering them, as you did, step back and replace them with a single list variable. I do not recommend doing what the other answer suggested.
That is, instead of M1_I…M4_I and M1_U…M4_U, have two variables m_i and m_u (using lower case in variable names is conventional), which are each lists of four data.tables.
Alternatively, you might want to use a single variable, m, which contains nested lists of data.tables (m = list(list(i = …, u = …), …)).
Assuming the first, you can then iterate over them as follows:
give_this_a_meaningful_name = function (df) {
dt <- data.table(df)
dt[, I := as.numeric(gsub(",", ".", I))]
dt[, day := substr(t, 1, 10)]
dt[, hour := substr(t, 12, 16)]
dt[, list(median_I = median(I, na.rm = TRUE)), by = .(day, hour)]
}
m_i_median = lapply(m_i, give_this_a_meaningful_name)
(Note also the introduction of consistent spacing around operators; good readability is paramount for writing bug-free code.)
You can use a combination of a for loop and the get/assign functions like this:
# create a vector of the data.frame names
dts <- c('M1_I', 'M2_I', 'M3_I', 'M4_I', 'M1_U', 'M2_U', 'M3_U', 'M4_U')
# iterate over each dataframe
for (dt in dts){
# get the actual dataframe (not the string name of it)
tmp <- get(dt)
tmp <- data.table(tmp)
tmp[, I:=as.numeric(gsub(",",".",I))]
tmp[, day:=substr(t,1,10)]
tmp[, hour:=substr(t,12,16)]
tmp <- tmp[,list(median_I=median(I,na.rm = TRUE)),by=.(day,hour)]
# assign the modified dataframe to the name you want (the paste adds the 'dt_' to the front)
assign(paste0('dt_', dt), tmp)
}
When iterating through all columns in an R data.table using reference semantics, what makes more sense from a memory usage standpoint:
(1) dt[, (all_cols) := lapply(.SD, my_fun)]
or
(2) lapply(colnames(dt), function(col) dt[, (col) := my_fun(dt[[col]])])[[1]]
My question is: In (2), I am forcing data.table to overwrite dt on a column by column basis, so I would assume to need extra memory on the order of column size. Is this also the case for (1)? Or is all of lapply(.SD, my_fun) evaluated before the original columns are overwritten?
Some sample code to run the above variants:
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
Following the suggestion of #Frank, the most efficient way (from a memory point of view) to replace a data.table column by column by applying a function my_fun to each column, is
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
for (col in all_cols) set(dt, j = col, value = my_fun(dt[[col]]))
This currently (v1.11.4) is not handled in the same way as an expression like dt[, lapply(.SD, my_fun)] which internally is optimised to dt[, list(fun(a), fun(b), ...)], where a, b, ... are columns in .SD (see ?datatable.optimize). This might change in the future and is being tracked by #1414.
I am looking for a way to manipulate multiple columns in a data.table in R. As I have to address the columns dynamically as well as a second input, I wasn't able to find an answer.
The idea is to index two or more series on a certain date by dividing all values by the value of the date eg:
set.seed(132)
# simulate some data
dt <- data.table(date = seq(from = as.Date("2000-01-01"), by = "days", length.out = 10),
X1 = cumsum(rnorm(10)),
X2 = cumsum(rnorm(10)))
# set a date for the index
indexDate <- as.Date("2000-01-05")
# get the column names to be able to select the columns dynamically
cols <- colnames(dt)
cols <- cols[substr(cols, 1, 1) == "X"]
Part 1: The Easy data.frame/apply approach
df <- as.data.frame(dt)
# get the right rownumber for the indexDate
rownum <- max((1:nrow(df))*(df$date==indexDate))
# use apply to iterate over all columns
df[, cols] <- apply(df[, cols],
2,
function(x, i){x / x[i]}, i = rownum)
Part 2: The (fast) data.table approach
So far my data.table approach looks like this:
for(nam in cols) {
div <- as.numeric(dt[rownum, nam, with = FALSE])
dt[ ,
nam := dt[,nam, with = FALSE] / div,
with=FALSE]
}
especially all the with = FALSE look not very data.table-like.
Do you know any faster/more elegant way to perform this operation?
Any idea is greatly appreciated!
One option would be to use set as this involves multiple columns. The advantage of using set is that it will avoid the overhead of [.data.table and makes it faster.
library(data.table)
for(j in cols){
set(dt, i=NULL, j=j, value= dt[[j]]/dt[[j]][rownum])
}
Or a slightly slower option would be
dt[, (cols) :=lapply(.SD, function(x) x/x[rownum]), .SDcols=cols]
Following up on your code and the answer given by akrun, I would recommend you to use .SDcols to extract the numeric columns and lapply to loop through them. Here's how I would do it:
index <-as.Date("2000-01-05")
rownum<-max((dt$date==index)*(1:nrow(dt)))
dt[, lapply(.SD, function (i) i/i[rownum]), .SDcols = is.numeric]
Using .SDcols could be specially useful if you have a large number of numeric columns and you'd like to apply this division on all of them.
This is from an observation during my answering this question from #sds here.
First, let me switch on the trace messages for data.table:
options(datatable.verbose = TRUE)
dt <- data.table(a = c(rep(3, 5), rep(4, 5)), b=1:10, c=11:20, d=21:30, key="a")
Now, suppose one wants to get the sum of all columns grouped by column a, then, we could do:
dt.out <- dt[, lapply(.SD, sum), by = a]
Now, suppose I'd want to add also the number of entries that belong to each group to dt.out, then I normally assign it by reference as follows:
dt.out[, count := dt[, .N, by=a][, N]]
# or alternatively
dt.out[, count := dt[, .N, by=a][["N"]]]
In this assignment by reference, one of the messages data.table produces is:
RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.
This is a message from a file in data.table's source directory assign.C. I dont want to paste the relevant snippet here as it's about 18 lines. If necessary, just leave a comment and I'll paste the code. dt[, .N, by=a][["N"]] just gives [1] 5 5. So, it's not a named vector. And I don't understand what this recycled list in RHS is..
But if I do:
dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# or equivalently
dt.out[, `:=`(count = dt[, .N, by=a][["N"]])]
Then, I get the message:
Direct plonk of unnamed RHS, no copy.
As I understand this, the RHS has been duplicated in the first case, meaning it's making a copy (shallow/deep, this I don't know). If so, why is this happening?
Even if not, why the changes in assignment by reference between two internally? Any ideas?
To bring out the main underlying question that I had in my mind while writing this post (and seem to have forgotten!): Is it "less efficient" to assign as dt.out[, count := dt[, .N, by=a][["N"]]] (compared to the second way of doing it)?
Update: The expression,
DT[, c(..., lapply(.SD, .), ..., by=.]
has been optimised internally in commit #1242 of v1.9.3 (FR #2722). Here's the entry from NEWS:
o Complex j-expressions of the form DT[, c(..., lapply(.SD, fun)), by=grp]are now optimised, as long as .SD is only present in the form lapply(.SD, fun).
For ex: DT[, c(.I, lapply(.SD, sum), mean(x), lapply(.SD, log)), by=grp]
is optimised to: DT[, list(.I, x=sum(x), y=sum(y), ..., mean(x), log(x), log(y), ...), by=grp]
But DT[, c(.SD, lapply(.SD, sum)), by=grp] for example isn't optimised yet.
This partially resolves FR #2722. Thanks to Sam Steingold for filing the FR.
Where it says NAMED vector it means that in the internal R sense at C level; i.e., whether an object has been assigned a symbol and is called something, not whether an atomic vector has a "names" attribute or not. The NAMED value in the SEXP structure takes value 0, 1 or 2. R uses that to know whether it needs to copy-on-subassign or not. See section 1.1.2 of R-ints.
What would be better is if optimization of j in data.table could handle :
DT[, c(lapply(.SD,sum),.N), by=a]
That works but may be slow. Currently only the simpler form is optimized :
DT[, lapply(.SD,sum), by=a]
To answer main question, yes the following :
Direct plonk of unnamed RHS, no copy.
is desirable compared to :
RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.
Another way to achieve this is :
dt.out[, count := dt[, .N, by=a]$N]
I'm not quite sure why [["N"]] returns a NAM(2) compared to $N which doesn't.