Here is a good SO explanation about row operations in data.table
One alternative that came to my mind is to use a unique id for each row and then apply a function using the by argument. Like this:
library(data.table)
dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)],
V1=1:5,
V2=3:7,
V3=5:1)
# create a column with row positions
dt[, rowpos := .I]
# calculate standard deviation by row
dt[ , sdd := sd(.SD[, -1, with=FALSE]), by = rowpos ]
Questions:
Is there a good reason not to use this approach? perhaps other more efficient alternatives?
Why does using by = .I doesn't work the same?
dt[ , sdd := sd(.SD[, -1, with=FALSE]), by = .I ]
UPDATE:
Since data.table version 1.4.3 or later, by=.I has been implemented to work as expected by OP for row-wise grouping. Note using by=.I will create a new column in the data.table called I that has the row numbers. The row number column can then be kept or deleted according to preference.
The following parts of this answer records an earlier version that pertains to older versions of data.table. I keep it here for reference in case someone still uses legacy versions.
Note: section (3) of this answer updated in April 2019, due to many changes in data.table over time redering the original version obsolete. Also, use of the argument with= removed from all instances of data.table, as it has since been deprecated.
1) Well, one reason not to use it, at least for the rowsums example is performance, and creation of an unnecessary column. Compare to option f2 below, which is almost 4x faster and does not need the rowpos column (Note that the original question used rowSums as the example function, to which this part of the answer responds. OP edited the question afterwards to use a different function, for which part 3 of this answer is more relevant`):
dt <- data.table(V0 =LETTERS[c(1,1,2,2,3)], V1=1:5, V2=3:7, V3=5:1)
f1 <- function(dt){
dt[, rowpos := .I]
dt[ , sdd := rowSums(.SD[, 2:4]), by = rowpos ] }
f2 <- function(dt) dt[, sdd := rowSums(.SD), .SDcols= 2:4]
library(microbenchmark)
microbenchmark(f1(dt),f2(dt))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# f1(dt) 3.669049 3.732434 4.013946 3.793352 3.972714 5.834608 100 b
# f2(dt) 1.052702 1.085857 1.154132 1.105301 1.138658 2.825464 100 a
2) On your second question, although dt[, sdd := sum(.SD[, 2:4]), by = .I] does not work, dt[, sdd := sum(.SD[, 2:4]), by = 1:NROW(dt)] works perfectly. Given that according to ?data.table ".I is an integer vector equal to seq_len(nrow(x))", one might expect these to be equivalent. The difference, however, is that .I is for use in j, not in by. NB the value of .I is calculated internally in data.table, so is not available beforehand to be passed in as a parameter value as in by=.I.
It might also be expected that by = .I should just throw an error. But this does not occur, because loading the data.table package creates an object .I in the data.table namespace that is accessible from the global environment, and whose value is NULL. You can test this by typing .I at the command prompt. (Note, the same applies to .SD, .EACHI, .N, .GRP, and .BY)
.I
# Error: object '.I' not found
library(data.table)
.I
# NULL
data.table::.I
# NULL
The upshot of this is that the behaviour of by = .I is equivalent to by = NULL.
3) Although we have already seen in part 1 that in the case of rowSums, which already loops row-wise efficiently, there are much faster ways than creating the rowpos column. But what about looping when we don't have a fast row-wise function?
Benchmarking the by = rowpos and by = 1:NROW(dt) versions against a for loop with set() is informative here. We find that looping over set in a for loop is slower than either of the methods that use data.table's by argument for looping. However there is neglibible difference in timing between the by loop that creates an additional column and the one that uses seq_len(NROW(dt)). Absent any performance difference, it seems that f.nrow is probably preferable, but only on the basis of being more concise and not creating an unnecessary column
dt <- data.table(V0 = rep(LETTERS[c(1,1,2,2,3)], 1e3), V1=1:5, V2=3:7, V3=5:1)
f.rowpos <- function() {
dt[, rowpos := .I]
dt[, sdd := sum(.SD[, 2:4]), by = rowpos ]
}
f.nrow <- function() {
dt[, sdd := sum(.SD[, 2:4]), by = seq_len(NROW(dt)) ]
}
f.forset<- function() {
for (i in seq_len(NROW(dt))) set(dt, i, 'sdd', sum(dt[i, 2:4]))
}
microbenchmark(f.rowpos(),f.nrow(), f.forset(), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# f.rowpos() 559.1115 575.3162 580.2853 578.6865 588.5532 599.7591 5
# f.nrow() 558.4327 582.4434 584.6893 587.1732 588.6689 606.7282 5
# f.forset() 1172.6560 1178.8399 1298.4842 1255.4375 1292.7393 1592.7486 5
So, in conclusion, even in situations where there is not an optimised function such as rowSums that already operates by row, there are alternatives to using a rowpos column that, although not faster, don't require creation of a redundant column.
Related
As a natural consequence of the data.table subsetting logic in i, I often end up in situations where I have part of a variable defined for an id (like "total economic crises before 2007 per country" being counted for data < 2007 hence NA for anything later). Here is a slightly more general example:
library("data.table")
Data <- data.table(id = rep(c(1,2,3), each = 4),
variable =c(3,3,NA,NA,NA,NA,4,NA,NA,NA,NA,NA))
When I subsequently need this variable defined over the entire dataset, I want to fill up the NA's by group. I usually do this using max by group:
Data[, variable_full := max(variable, na.rm = T), by = id]
Data[variable_full == -Inf, variable_full := NA] # this just overwrites the result of the warning
But, for whatever reason, this takes very long in large datasets. Is there a more efficient, more data.table like way of doing this?
edit: "large datasets" is currently 8 million observations and it stops my workflow because it takes several minutes. Other data.table operations take split seconds because data.table is amazing.
perhaps a join?
Data[, variable_full := variable]
Data[is.na(variable), variable_full := Data[!is.na(variable),
max(variable),
by = .(id)][Data[is.na(variable), ], V1, on = .(id)]][]
a (slightly) shorter version of the line with the join is
Data[is.na(variable), variable_full := Data[!is.na(variable), max(variable), by = .(id)][.SD, V1, on = .(id)]]
here, the [Data[is.na(variable), ], part has bene replaced with [.SD, , because is is alreadaey derived from i (at the beginning of the line)...
the
If you want to install the collapse package, I think this could be much faster:
library(collapse)
Data = Data |> gby(id) |> fmutate(variable_full=fmax(variable)) |> setDT()
gby is 'group by' and fmutate is 'fast mutate'. The default output is a grouped data frame so it needs the 'setDT' at the end
I need to perform the following operation on a dataframe in R 3.4.1 on Windows
Split the dataframe by a categorical variable -> get a list of dataframe splitted by that categorical variable (getting the list is not necessary, that's just how I do it).
Extract a variable from the splitted dataframe list.
Combine the splitted variables in a matrix.
Transpose the matrix.
Currently I'm doing these operations as follows:
t(sapply(split(df, df$date), function(x) x$avg_mean))
I'd like this operation to be more efficient, that is:
Use the least memory possible, i.e. not duplicate objects, if possible. I may need to use this with a 1.5 GB dataframe.
Be fast with large dataframes.
What is the most appropriate/efficient way of doing this in R? Parallelization is also appreciated but not strictly necessary since I'm not sure I'll be able to use it.
If you need a toy dataframe, use this.
The best approach is probably to go in the direction suggested in comments with split(df$avg_mean, df$date) and bind the results together. A pretty close second would be to just convert your vector to a matrix directly exploiting the fact that the number of observations for each date must be constant in your case. Some approaches and their speed below:
library(microbenchmark)
library(data.table)
dat <- data.frame(date = rep(c('A', 'B', 'C'), each = 1000),
avg_mean = rnorm(3000))
f1 <- function(dat) {
t(sapply(split(dat, dat$date), function(x) x$avg_mean))
}
f2 <- function(dat) {
matrix(dat$avg_mean, nrow=length(unique(dat$date)), byrow = T)
}
f3 <- function(dat) {
do.call(rbind, split(dat$avg_mean, dat$date))
}
f4 <- function(DF) {
DF = data.table(DF)
DF[ , index := 1:.N, by=date]
DF_trx = dcast(DF, index~date, value.var = "avg_mean")
DF_trx$index=NULL
t(as.matrix(DF_trx))
}
microbenchmark(f1(dat), f2(dat), f3(dat), f4(dat))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> f1(dat) 456.064 475.542 617.0032 489.9390 515.6205 4250.471 100
#> f2(dat) 107.062 110.907 150.3135 117.6060 124.1925 2992.862 100
#> f3(dat) 74.313 79.927 122.2712 84.4455 89.4250 2504.850 100
#> f4(dat) 3797.694 3893.886 4563.4614 4021.6505 5053.5800 15757.085 100
It seems do.call(rbind, split(dat$avg_mean, dat$date) is probably your best bet.
I have added an index for each group and data table cast functions to achieve the same. You should use data table to make it more efficient
DF = data.table(DF)
DF[ , index := 1:.N, by=Col1]
DF_trx = dcast(DF, index~Col1, value.var = "Col2")
DF_trx$index=NULL
as.matrix(DF_trx)
I'm new to the data.table package.
I'm working on a big data.table (60 columns, 9 million rows)
and would like to replace all negative values with 0 in all columns.
My current solution is:
dt2 <- dt[, lapply(.SD,function(x) {ifelse(x < 0,0,x)})]
This takes approx. 8s per column.
I'd like to use the := operator and skip the function to make it faster.
But I don't know how I can reference the current column chosen by .SD
e.g.
dt[, lapply(.SD, .SD[<0] := 0]
How would I do that?
We can use the set way which would do the assignment in place. Loop through the sequence of columns, then get the row index where the value is less than 0 (i), specify the column index in 'j' and set the value that correspond to these index to 0.
for(j in seq_along(dt)){
set(dt, i = which(dt[[j]]<0), j=j, value = 0)
}
Or another option is
dt[, lapply(.SD, function(x) pmax(0, x))]
I am trying to figure out an elegant way to use := assignment to replace many columns at once in a data.table by applying a shared function. A typical use of this might be to apply a string function (e.g., gsub) to all character columns in a table. It is not difficult to extend the data.frame way of doing this to a data.table, but I'm looking for a method consistent with the data.table way of doing things.
For example:
library(data.table)
m <- matrix(runif(10000), nrow = 100)
df <- df1 <- df2 <- df3 <- as.data.frame(m)
dt <- as.data.table(df)
head(names(df))
head(names(dt))
## replace V20-V100 with sqrt
# data.frame approach
# by column numbers
df1[20:100] <- lapply(df1[20:100], sqrt)
# by reference to column numbers
v <- 20:100
df2[v] <- lapply(df2[v], sqrt)
# by reference to column names
n <- paste0("V", 20:100)
df3[n] <- lapply(df3[n], sqrt)
# data.table approach
# by reference to column names
n <- paste0("V", 20:100)
dt[, n] <- lapply(dt[, n, with = FALSE], sqrt)
I understand it is more efficient to loop over a vector of column names using := to assign:
for (col in paste0("V", 20:100)) dt[, col := sqrt(dt[[col]]), with = FALSE]
I don't like this because I don't like reference the data.table in a j expression. I also know that I can use := to assign with lapply given that I know the column names:
dt[, c("V20", "V30", "V40", "V50", "V60") := lapply(list(V20, V30, V40, V50, V60), sqrt)]
(You could extend this by building an expression with unknown column names.)
Below are the ideas I tried on this, but I wasn't able to get them to work. Am I making a mistake, or is there another approach I'm missing?
# possible data.table approaches?
# by reference to column names; assignment works, but not lapply
n <- paste0("V", 20:100)
dt[, n := lapply(n, sqrt), with = FALSE]
# by (smaller for example) list; lapply works, but not assignment
dt[, list(list(V20, V30, V40, V50, V60)) := lapply(list(V20, V30, V40, V50, V60), sqrt)]
# by reference to list; neither assignment nor lapply work
l <- parse(text = paste("list(", paste(paste0("V", 20:100), collapse = ", "), ")"))
dt[, eval(l) := lapply(eval(l), sqrt)]
Yes, you're right in question here :
I understand it is more efficient to loop over a vector of column names using := to assign:
for (col in paste0("V", 20:100))
dt[, col := sqrt(dt[[col]]), with = FALSE]
Aside: note that the new way of doing that is :
for (col in paste0("V", 20:100))
dt[ , (col) := sqrt(dt[[col]])]
because the with = FALSE wasn't easy to read whether it referred to the LHS or the RHS of :=. End aside.
As you know, that's efficient because that does each column one by one, so working memory is only needed for one column at a time. That can make a difference between it working and it failing with the dreaded out-of-memory error.
The problem with lapply on the RHS of := is that the RHS (the lapply) is evaluated first; i.e., the result for the 80 columns is created. That's 80 column's worth of new memory which has to be allocated and populated. So you need 80 column's worth of free RAM for that operation to succeed. That RAM usage dominates vs the subsequently instant operation of assigning (plonking) those 80 new columns into the data.table's column pointer slots.
As #Frank pointed to, if you have a lot of columns (say 10,000 or more) then the small overhead of dispatching to the [.data.table method starts to add up). To eliminate that overhead that there is data.table::set which under ?set is described as a "loopable" :=. I use a for loop for this type of operation. It's the fastest way and is fairly easy to write and read.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(dt[[col]]))
Although with just 80 columns, it's unlikely to matter. (Note it may be more common to loop set over a large number of rows than a large number of columns.) However, looped set doesn't solve the problem of the repeated reference to the dt symbol name that you mentioned in the question :
I don't like this because I don't like reference the data.table in a j expression.
Agreed. So the best I can do is revert to your looping of := but use get instead.
for (col in paste0("V", 20:100))
dt[, (col) := sqrt(get(col))]
However, I fear that using get in j carry an overhead. Benchmarking made in #1380. Also, perhaps it is confusing to use get() on the RHS but not on the LHS. To address that we could sugar the LHS and allow get() as well, #1381 :
for (col in paste0("V", 20:100))
dt[, get(col) := sqrt(get(col))]
Also, maybe value of set could be run within scope of DT, #1382.
for (col in paste0("V", 20:100))
set(dt, j = col, value = sqrt(get(col))
These should work if you want to refer to the columns by string name:
n = paste0("V", 20:100)
dt[, (n) := lapply(n, function(x) {sqrt(get(x))})]
or
dt[, (n) := lapply(n, function(x) {sqrt(dt[[x]])})]
Is this what you are looking for?
dt[ , names(dt)[20:100] :=lapply(.SD, function(x) sqrt(x) ) , .SDcols=20:100]
I have heard tell that using .SD is not so efficient because it makes a copy of the table beforehand, but if your table isn't huge (obviously that's relative depending on your system specs) I doubt it will make much of a difference.
This is from an observation during my answering this question from #sds here.
First, let me switch on the trace messages for data.table:
options(datatable.verbose = TRUE)
dt <- data.table(a = c(rep(3, 5), rep(4, 5)), b=1:10, c=11:20, d=21:30, key="a")
Now, suppose one wants to get the sum of all columns grouped by column a, then, we could do:
dt.out <- dt[, lapply(.SD, sum), by = a]
Now, suppose I'd want to add also the number of entries that belong to each group to dt.out, then I normally assign it by reference as follows:
dt.out[, count := dt[, .N, by=a][, N]]
# or alternatively
dt.out[, count := dt[, .N, by=a][["N"]]]
In this assignment by reference, one of the messages data.table produces is:
RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.
This is a message from a file in data.table's source directory assign.C. I dont want to paste the relevant snippet here as it's about 18 lines. If necessary, just leave a comment and I'll paste the code. dt[, .N, by=a][["N"]] just gives [1] 5 5. So, it's not a named vector. And I don't understand what this recycled list in RHS is..
But if I do:
dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# or equivalently
dt.out[, `:=`(count = dt[, .N, by=a][["N"]])]
Then, I get the message:
Direct plonk of unnamed RHS, no copy.
As I understand this, the RHS has been duplicated in the first case, meaning it's making a copy (shallow/deep, this I don't know). If so, why is this happening?
Even if not, why the changes in assignment by reference between two internally? Any ideas?
To bring out the main underlying question that I had in my mind while writing this post (and seem to have forgotten!): Is it "less efficient" to assign as dt.out[, count := dt[, .N, by=a][["N"]]] (compared to the second way of doing it)?
Update: The expression,
DT[, c(..., lapply(.SD, .), ..., by=.]
has been optimised internally in commit #1242 of v1.9.3 (FR #2722). Here's the entry from NEWS:
o Complex j-expressions of the form DT[, c(..., lapply(.SD, fun)), by=grp]are now optimised, as long as .SD is only present in the form lapply(.SD, fun).
For ex: DT[, c(.I, lapply(.SD, sum), mean(x), lapply(.SD, log)), by=grp]
is optimised to: DT[, list(.I, x=sum(x), y=sum(y), ..., mean(x), log(x), log(y), ...), by=grp]
But DT[, c(.SD, lapply(.SD, sum)), by=grp] for example isn't optimised yet.
This partially resolves FR #2722. Thanks to Sam Steingold for filing the FR.
Where it says NAMED vector it means that in the internal R sense at C level; i.e., whether an object has been assigned a symbol and is called something, not whether an atomic vector has a "names" attribute or not. The NAMED value in the SEXP structure takes value 0, 1 or 2. R uses that to know whether it needs to copy-on-subassign or not. See section 1.1.2 of R-ints.
What would be better is if optimization of j in data.table could handle :
DT[, c(lapply(.SD,sum),.N), by=a]
That works but may be slow. Currently only the simpler form is optimized :
DT[, lapply(.SD,sum), by=a]
To answer main question, yes the following :
Direct plonk of unnamed RHS, no copy.
is desirable compared to :
RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.
Another way to achieve this is :
dt.out[, count := dt[, .N, by=a]$N]
I'm not quite sure why [["N"]] returns a NAM(2) compared to $N which doesn't.