dynamic aggregations with data.table in R using derived column names - r

I have this data.table with different column types.
I do not know the column names before hand and I would like to generate aggregations only for columns of certain type (say, numeric). How to achieve this with data.table?
For example, consider the below code:
dt <- data.table(ch=c('a','b','c'),num1=c(1,3,6), num2=1:9)
Need to create a function that accepts the above data.table and automatically performs calculations on the numeric fields grouped by the character filed (say sum on num1 and mean on num2, by ch). How to achieve this dynamically?
We can find out the numeric columns using sapply(dt, is.numeric) but it gives column names as strings - not sure how to plug it with data.table. Help is appreciated. Below code gives the idea of what is required - but does not work
DoSomething <- function(dt)
{
numCols <- names(dt)[sapply(dt, is.numeric)]
chrCols <- names(dt)[sapply(dt, is.character)]
dt[,list(sum(numCols[1]), mean(numCols[2])), by=(chrCols), with=F]
}

You can achieve it using .SDcols argument. See example.
require(data.table)
dt <- data.table(ch=c('a','b','c'), num1=c(1,3,6), num2=1:9)
DoSomething <- function(dt) {
numCols <- names(dt)[sapply(dt, is.numeric)]
chrCols <- names(dt)[sapply(dt, is.character)]
dt[, list(sum(.SD[[1]]), mean(.SD[[2]])), by = chrCols, .SDcols = numCols]
}
DoSomething(dt)

#djhurio gives a nice solution to your problem.
.SD and .SDcols in data.table gives what you want.
In case you perform same calculation between different columns, you can try the following code.
require(data.table)
dt <- data.table(ch=c('a','b','c'), num1=c(1,3,6), num2=1:9)
DTfunction <- function(dt){
numCols <- names(dt)[sapply(dt, is.numeric)]
chrCols <- names(dt)[sapply(dt, is.character)]
dt <- dt[, lapply(.SD, mean), by = (chrCols), .SDcols = (numCols)]
}
cute code. Isn't it? :)

Related

data.table reference semantics: memory usage of iterating through all columns

When iterating through all columns in an R data.table using reference semantics, what makes more sense from a memory usage standpoint:
(1) dt[, (all_cols) := lapply(.SD, my_fun)]
or
(2) lapply(colnames(dt), function(col) dt[, (col) := my_fun(dt[[col]])])[[1]]
My question is: In (2), I am forcing data.table to overwrite dt on a column by column basis, so I would assume to need extra memory on the order of column size. Is this also the case for (1)? Or is all of lapply(.SD, my_fun) evaluated before the original columns are overwritten?
Some sample code to run the above variants:
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
Following the suggestion of #Frank, the most efficient way (from a memory point of view) to replace a data.table column by column by applying a function my_fun to each column, is
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
for (col in all_cols) set(dt, j = col, value = my_fun(dt[[col]]))
This currently (v1.11.4) is not handled in the same way as an expression like dt[, lapply(.SD, my_fun)] which internally is optimised to dt[, list(fun(a), fun(b), ...)], where a, b, ... are columns in .SD (see ?datatable.optimize). This might change in the future and is being tracked by #1414.

Standardize by group using data.table

Is it possible to use data.table to standardize a number of variables by a number of group variables?
DT <- data.table(V1=1:20, V2=40:21, gr=c(rep(c('a'),10), rep(c('b'),10)),
grr=rep(c(rep(c('a'),5), rep(c('b'),5)),2))
gr and grr are the group variables. I want to add to that data.table V1.z and V2.z that are the standardized score within each gr-by-grr group.
Here is an extremely stupid code for that, to explain what I want:
DTaa <- DT[gr=='a' & grr=='a',]
DTab <- DT[gr=='a' & grr=='b',]
DTba <- DT[gr=='b' & grr=='a',]
DTbb <- DT[gr=='b' & grr=='b',]
DTaa <- DTaa[,V1.z := scale(V1)]
DTaa <- DTaa[,V2.z := scale(V2)]
DTab <- DTab[,V1.z := scale(V1)]
DTab <- DTab[,V2.z := scale(V2)]
DTba <- DTba[,V1.z := scale(V1)]
DTba <- DTba[,V2.z := scale(V2)]
DTbb <- DTbb[,V1.z := scale(V1)]
DTbb <- DTbb[,V2.z := scale(V2)]
DTn <- rbind(DTaa, DTab, DTba, DTbb)
Probably, there is a way to do it using by in one or two lines.
I'm hoping to then use it in a function that accepts data, the target variables (in the example, V1 and V2), and group variables (in the example, gr and grr) as arguments.
If you have a solution that does not use data.table, it's also good (I tried using mutate_at from dplyr but couldn't find much documentation about that function).
After grouping by 'gr' and 'grr', loop over the Subset of Data.table (.SD), scale it (the output of scale is a matrix, so we convert it to vector with as.vector) and assign (:=) the output to the new columns.
DT[, paste0(names(DT)[1:2], ".z") := lapply(.SD,
function(x) as.vector(scale(x))), .(gr, grr)]

how to deal with data.table column as.character in R?

I'm trying to use data.table rather data.frame(for a faster code). Despite the syntax difference between than, I'm having problems when I need to extract a specific character column and use it as character vector. When I call:
library(data.table)
DT <- fread("file.txt")
vect <- as.character(DT[, 1, with = FALSE])
class(vect)
###[1] "character"
head(vect)
It returns:
[1] "c(\"uc003hzj.4\", \"uc021ofx.1\", \"uc021olu.1\", \"uc021ome.1\", \"uc021oov.1\", \"uc021opl.1\", \"uc021osl.1\", \"uc021ovd.1\", \"uc021ovp.1\", \"uc021pdq.1\", \"uc021pdv.1\", \"uc021pdw.1\")
Any ideas of how to avoid these "\" in the output?
The as.character works on vectors and not on data.frame/data.table objects in the way the OP expected. So, if we need to get the first column as character class, subset with .SD[[1L]] and apply the as.character
DT[, as.character(.SD[[1L]])]
If there are multiple columns, we can specify the column index with .SDcols and loop over the .SD to convert to character and assign (:=) the output back to the particular columns.
DT[, (1:2) := lapply(.SD, as.character), .SDcols= 1:2]
data
DT <- data.table(Col1 = 1:5, Col2= 6:10, Col3= LETTERS[1:5])

Changing multiple Columns in data.table r

I am looking for a way to manipulate multiple columns in a data.table in R. As I have to address the columns dynamically as well as a second input, I wasn't able to find an answer.
The idea is to index two or more series on a certain date by dividing all values by the value of the date eg:
set.seed(132)
# simulate some data
dt <- data.table(date = seq(from = as.Date("2000-01-01"), by = "days", length.out = 10),
X1 = cumsum(rnorm(10)),
X2 = cumsum(rnorm(10)))
# set a date for the index
indexDate <- as.Date("2000-01-05")
# get the column names to be able to select the columns dynamically
cols <- colnames(dt)
cols <- cols[substr(cols, 1, 1) == "X"]
Part 1: The Easy data.frame/apply approach
df <- as.data.frame(dt)
# get the right rownumber for the indexDate
rownum <- max((1:nrow(df))*(df$date==indexDate))
# use apply to iterate over all columns
df[, cols] <- apply(df[, cols],
2,
function(x, i){x / x[i]}, i = rownum)
Part 2: The (fast) data.table approach
So far my data.table approach looks like this:
for(nam in cols) {
div <- as.numeric(dt[rownum, nam, with = FALSE])
dt[ ,
nam := dt[,nam, with = FALSE] / div,
with=FALSE]
}
especially all the with = FALSE look not very data.table-like.
Do you know any faster/more elegant way to perform this operation?
Any idea is greatly appreciated!
One option would be to use set as this involves multiple columns. The advantage of using set is that it will avoid the overhead of [.data.table and makes it faster.
library(data.table)
for(j in cols){
set(dt, i=NULL, j=j, value= dt[[j]]/dt[[j]][rownum])
}
Or a slightly slower option would be
dt[, (cols) :=lapply(.SD, function(x) x/x[rownum]), .SDcols=cols]
Following up on your code and the answer given by akrun, I would recommend you to use .SDcols to extract the numeric columns and lapply to loop through them. Here's how I would do it:
index <-as.Date("2000-01-05")
rownum<-max((dt$date==index)*(1:nrow(dt)))
dt[, lapply(.SD, function (i) i/i[rownum]), .SDcols = is.numeric]
Using .SDcols could be specially useful if you have a large number of numeric columns and you'd like to apply this division on all of them.

How to aggregate (using "by") a data.table with customized column name without ":="?

I know I can do this
a <- dt[,sum(x), by=y]
Also I can do this
dt[,z:=sum(x), by=y] # this would modify dt
However I don't get why I cannot do this:
a <- dt[,z=sum(x), by=y]
How should I perform a "summarize" with customized column names?
Is this the only option?
a <- copy(dt)
a[,z:=sum(x), by=y]
You're looking for list() in the j argument:
a <- dt[,list(z=sum(x)), by=y]

Resources