I am looking for a way to manipulate multiple columns in a data.table in R. As I have to address the columns dynamically as well as a second input, I wasn't able to find an answer.
The idea is to index two or more series on a certain date by dividing all values by the value of the date eg:
set.seed(132)
# simulate some data
dt <- data.table(date = seq(from = as.Date("2000-01-01"), by = "days", length.out = 10),
X1 = cumsum(rnorm(10)),
X2 = cumsum(rnorm(10)))
# set a date for the index
indexDate <- as.Date("2000-01-05")
# get the column names to be able to select the columns dynamically
cols <- colnames(dt)
cols <- cols[substr(cols, 1, 1) == "X"]
Part 1: The Easy data.frame/apply approach
df <- as.data.frame(dt)
# get the right rownumber for the indexDate
rownum <- max((1:nrow(df))*(df$date==indexDate))
# use apply to iterate over all columns
df[, cols] <- apply(df[, cols],
2,
function(x, i){x / x[i]}, i = rownum)
Part 2: The (fast) data.table approach
So far my data.table approach looks like this:
for(nam in cols) {
div <- as.numeric(dt[rownum, nam, with = FALSE])
dt[ ,
nam := dt[,nam, with = FALSE] / div,
with=FALSE]
}
especially all the with = FALSE look not very data.table-like.
Do you know any faster/more elegant way to perform this operation?
Any idea is greatly appreciated!
One option would be to use set as this involves multiple columns. The advantage of using set is that it will avoid the overhead of [.data.table and makes it faster.
library(data.table)
for(j in cols){
set(dt, i=NULL, j=j, value= dt[[j]]/dt[[j]][rownum])
}
Or a slightly slower option would be
dt[, (cols) :=lapply(.SD, function(x) x/x[rownum]), .SDcols=cols]
Following up on your code and the answer given by akrun, I would recommend you to use .SDcols to extract the numeric columns and lapply to loop through them. Here's how I would do it:
index <-as.Date("2000-01-05")
rownum<-max((dt$date==index)*(1:nrow(dt)))
dt[, lapply(.SD, function (i) i/i[rownum]), .SDcols = is.numeric]
Using .SDcols could be specially useful if you have a large number of numeric columns and you'd like to apply this division on all of them.
Related
This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I want to apply a transformation (whose type, loosely speaking, is "vector" -> "vector") to a list of columns in a data table, and this transformation will involve a grouping operation.
Here is the setup and what I would like to achieve:
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
DT[, A.prime := (A - mean(A))/sd(A), by=year(date)]
DT[, B.prime := (B - mean(B))/sd(B), by=year(date)]
DT[, C.prime := (C - mean(C))/sd(C), by=year(date)]
The goal is to avoid typing out the column names. In my actual application, I have a list of columns I would like to apply this transformation to.
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
columns <- c("A", "B", "C")
for (x in columns) {
# This doesn't work.
# target <- DT[, (x - mean(x, na.rm=TRUE))/sd(x, na.rm = TRUE), by=year(date)]
# This doesn't work.
#target <- DT[, (..x - mean(..x, na.rm=TRUE))/sd(..x, na.rm = TRUE), by=year(date)]
# THIS WORKS! But it is tedious writing "get(x)" every time.
target <- DT[, (get(x) - mean(get(x), na.rm=TRUE))/sd(get(x), na.rm = TRUE), by=year(date)][, V1]
set(DT, j = paste0(x, ".prime"), value = target)
}
Question: What is the idiomatic way to achieve the above result? There are two things which may be possibly be improved:
How to avoid typing out get(x) every time I use x to access a column?
Is accessing [, V1] the most efficient way of doing this? Is it possible to update DT directly by reference, without creating an intermediate data.table?
You can use .SDcols to specify the columns that you want to operate on :
library(data.table)
columns <- c("A", "B", "C")
newcolumns <- paste0(columns, ".prime")
DT[, (newcolumns) := lapply(.SD, function(x) (x- mean(x))/sd(x)),
year(date), .SDcols = columns]
This avoids using get(x) everytime and updates data.table by reference.
I think Ronak's answer is superior & preferable, just writing this to demonstrate a common syntax for more complicated j queries is to use a full {} expression:
target <- DT[ , by = year(date), {
xval = eval(as.name(x))
(xval - mean(xval, na.rm=TRUE))/sd(xval, na.rm = TRUE)
}]$V1
Two other small differences:
I used eval(as.name(.)) instead of get; the former is more trustworthy & IME faster
I replaced [ , V1] with $V1 -- the former requires the overhead of [.data.table.
You might also like to know that the base function scale will do the center & normalize steps more concisely (if slightly inefficient for being a bit to general).
How can I program a loop so that all eight tables are calculated one after the other?
The code:
dt_M1_I <- M1_I
dt_M1_I <- data.table(dt_M1_I)
dt_M1_I[,I:=as.numeric(gsub(",",".",I))]
dt_M1_I[,day:=substr(t,1,10)]
dt_M1_I[,hour:=substr(t,12,16)]
dt_M1_I_median <- dt_M1_I[,list(median_I=median(I,na.rm = TRUE)),by=.(day,hour)]
This should be calculated for:
M1_I
M2_I
M3_I
M4_I
M1_U
M2_U
M3_U
M4_U
Thank you very much for your help!
Whenever you have several variables of the same kind, especially when you find yourself numbering them, as you did, step back and replace them with a single list variable. I do not recommend doing what the other answer suggested.
That is, instead of M1_I…M4_I and M1_U…M4_U, have two variables m_i and m_u (using lower case in variable names is conventional), which are each lists of four data.tables.
Alternatively, you might want to use a single variable, m, which contains nested lists of data.tables (m = list(list(i = …, u = …), …)).
Assuming the first, you can then iterate over them as follows:
give_this_a_meaningful_name = function (df) {
dt <- data.table(df)
dt[, I := as.numeric(gsub(",", ".", I))]
dt[, day := substr(t, 1, 10)]
dt[, hour := substr(t, 12, 16)]
dt[, list(median_I = median(I, na.rm = TRUE)), by = .(day, hour)]
}
m_i_median = lapply(m_i, give_this_a_meaningful_name)
(Note also the introduction of consistent spacing around operators; good readability is paramount for writing bug-free code.)
You can use a combination of a for loop and the get/assign functions like this:
# create a vector of the data.frame names
dts <- c('M1_I', 'M2_I', 'M3_I', 'M4_I', 'M1_U', 'M2_U', 'M3_U', 'M4_U')
# iterate over each dataframe
for (dt in dts){
# get the actual dataframe (not the string name of it)
tmp <- get(dt)
tmp <- data.table(tmp)
tmp[, I:=as.numeric(gsub(",",".",I))]
tmp[, day:=substr(t,1,10)]
tmp[, hour:=substr(t,12,16)]
tmp <- tmp[,list(median_I=median(I,na.rm = TRUE)),by=.(day,hour)]
# assign the modified dataframe to the name you want (the paste adds the 'dt_' to the front)
assign(paste0('dt_', dt), tmp)
}
When iterating through all columns in an R data.table using reference semantics, what makes more sense from a memory usage standpoint:
(1) dt[, (all_cols) := lapply(.SD, my_fun)]
or
(2) lapply(colnames(dt), function(col) dt[, (col) := my_fun(dt[[col]])])[[1]]
My question is: In (2), I am forcing data.table to overwrite dt on a column by column basis, so I would assume to need extra memory on the order of column size. Is this also the case for (1)? Or is all of lapply(.SD, my_fun) evaluated before the original columns are overwritten?
Some sample code to run the above variants:
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
Following the suggestion of #Frank, the most efficient way (from a memory point of view) to replace a data.table column by column by applying a function my_fun to each column, is
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
for (col in all_cols) set(dt, j = col, value = my_fun(dt[[col]]))
This currently (v1.11.4) is not handled in the same way as an expression like dt[, lapply(.SD, my_fun)] which internally is optimised to dt[, list(fun(a), fun(b), ...)], where a, b, ... are columns in .SD (see ?datatable.optimize). This might change in the future and is being tracked by #1414.
I have this data.table with different column types.
I do not know the column names before hand and I would like to generate aggregations only for columns of certain type (say, numeric). How to achieve this with data.table?
For example, consider the below code:
dt <- data.table(ch=c('a','b','c'),num1=c(1,3,6), num2=1:9)
Need to create a function that accepts the above data.table and automatically performs calculations on the numeric fields grouped by the character filed (say sum on num1 and mean on num2, by ch). How to achieve this dynamically?
We can find out the numeric columns using sapply(dt, is.numeric) but it gives column names as strings - not sure how to plug it with data.table. Help is appreciated. Below code gives the idea of what is required - but does not work
DoSomething <- function(dt)
{
numCols <- names(dt)[sapply(dt, is.numeric)]
chrCols <- names(dt)[sapply(dt, is.character)]
dt[,list(sum(numCols[1]), mean(numCols[2])), by=(chrCols), with=F]
}
You can achieve it using .SDcols argument. See example.
require(data.table)
dt <- data.table(ch=c('a','b','c'), num1=c(1,3,6), num2=1:9)
DoSomething <- function(dt) {
numCols <- names(dt)[sapply(dt, is.numeric)]
chrCols <- names(dt)[sapply(dt, is.character)]
dt[, list(sum(.SD[[1]]), mean(.SD[[2]])), by = chrCols, .SDcols = numCols]
}
DoSomething(dt)
#djhurio gives a nice solution to your problem.
.SD and .SDcols in data.table gives what you want.
In case you perform same calculation between different columns, you can try the following code.
require(data.table)
dt <- data.table(ch=c('a','b','c'), num1=c(1,3,6), num2=1:9)
DTfunction <- function(dt){
numCols <- names(dt)[sapply(dt, is.numeric)]
chrCols <- names(dt)[sapply(dt, is.character)]
dt <- dt[, lapply(.SD, mean), by = (chrCols), .SDcols = (numCols)]
}
cute code. Isn't it? :)
I wish to change the class of selected variables in a data table, using a vectorized operation. I am new to the data.table syntax, and am trying to learn as much as possible. I now the question is basic, but it will help me to better understand the data table way of thinking!
A similar question was asked here! However, the solution seems to pertain to either reclassing just one column or all columns. My question is unique to a select few columns.
### Load package
require(data.table)
### Create pseudo data
data <- data.table(id = 1:10,
height = rnorm(10, mean = 182, sd = 20),
weight = rnorm(10, mean = 160, sd = 10),
color = rep(c('blue', 'gold'), times = 5))
### Reclass all columns
data <- data[, lapply(.SD, as.character)]
### Search for columns to be reclassed
index <- grep('(id)|(height)|(weight)', names(data))
### data frame method
df <- data.frame(data)
df[, index] <- lapply(df[, index], as.numeric)
### Failed attempt to reclass columns used the data.table method
data <- data[, lapply(index, as.character), with = F]
Any help would be appreciated. My data are large and so using regular expressions to create a vector of column numbers to reclassify is necessary.
Thank you for your time.
You could avoid the overhead of the construction of .SD within j by using set
for(j in index) set(data, j =j ,value = as.character(data[[j]]))
I think that #SimonO101 did most of the Job
data[, names(data)[index] := lapply(.SD, as.character) , .SDcols = index ]
You can just use the := magic
You just need to use .SDcols with your index vector (I learnt that today!), but that will just return a data table with the reclassed columns. #dickoa 's answer is what you are looking for.
data <- data[, lapply(.SD, as.character) , .SDcols = index ]
sapply(data , class)
id height weight
"character" "character" "character"