Is there any way to fill a completely empty data.table in R? I have to fill a data.table and columns are given by a function called in a loop. I don't know how many columns will be created or what length they will be before launching the function but I do know that all will be the same length.
My aproach is to create an empty data.table and fill it within the loop. But this does not work either because I cannot append columns to empty data.table or beacuse I cannot insert rows properly. Check the toy example below (for the sake of simplicity let's avoid the for bucle)
f <- someFunction() # just imagine that returns c(1,2,3)
# this does not work. fails with error msg: Cannot use := to add columns to a null data.table (no columns), currently
dt <- data.table()
dt[, c("b", "c") := list(f(), f())]
# this actually work. But creates a 0 row data.table which cannot be filled latter
dt <- data.table(a = numeric() )
dt[, c("b", "c") := list(numeric(), numeric())]
dt[, c("b", "c") := list(f(), f())] # no rows are added
# this workaround works but is ugly as hell
dt <- data.table(a = rep(NA, length(f())) )
dt[, c("b", "c") := list(f(), f())]
dt[, a := NULL]
So Is there any elegant/efficient way of approaching this
You can use something like this:
library("data.table")
f <- function(x) c(1,2,3)
dt <- as.data.table(lapply(11:13, f))
setnames(dt, c("a", "b", "c"))
The lapply() is doing the loop you mentioned in your question.
Related
I have a very simple code that reproduces my issue in R (and data.table package):
If I save the column names into a variable without copy(), it gets overwritten when using the := operator later on in the code:
input_table = data.table(A = 1:3, B = 2:4)
some_value = 0.025
original_colnames <- names(input_table)
original_colnames_copy <- copy(names(input_table))
identical(original_colnames, c("A", "B")) # returns TRUE
# create a new column
input_table[, C := A + some_value]
identical(original_colnames, c("A", "B")) # returns FALSE, original_colnames contains "C" as well
identical(original_colnames_copy, c("A", "B")) # returns TRUE
This problem does not persist if I use the following code to create the new column:
input_table = mutate(input_table, C= A + some_value)
Is this intended (am I missing some deeper knowledge in R / data.table)?
R.version is 4.1.0, data.table version is 1.14.0.
Answered by the comments below the question:
Why does data.table update names(DT) by reference, even if I assign to another variable?
This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I want to apply a transformation (whose type, loosely speaking, is "vector" -> "vector") to a list of columns in a data table, and this transformation will involve a grouping operation.
Here is the setup and what I would like to achieve:
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
DT[, A.prime := (A - mean(A))/sd(A), by=year(date)]
DT[, B.prime := (B - mean(B))/sd(B), by=year(date)]
DT[, C.prime := (C - mean(C))/sd(C), by=year(date)]
The goal is to avoid typing out the column names. In my actual application, I have a list of columns I would like to apply this transformation to.
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
columns <- c("A", "B", "C")
for (x in columns) {
# This doesn't work.
# target <- DT[, (x - mean(x, na.rm=TRUE))/sd(x, na.rm = TRUE), by=year(date)]
# This doesn't work.
#target <- DT[, (..x - mean(..x, na.rm=TRUE))/sd(..x, na.rm = TRUE), by=year(date)]
# THIS WORKS! But it is tedious writing "get(x)" every time.
target <- DT[, (get(x) - mean(get(x), na.rm=TRUE))/sd(get(x), na.rm = TRUE), by=year(date)][, V1]
set(DT, j = paste0(x, ".prime"), value = target)
}
Question: What is the idiomatic way to achieve the above result? There are two things which may be possibly be improved:
How to avoid typing out get(x) every time I use x to access a column?
Is accessing [, V1] the most efficient way of doing this? Is it possible to update DT directly by reference, without creating an intermediate data.table?
You can use .SDcols to specify the columns that you want to operate on :
library(data.table)
columns <- c("A", "B", "C")
newcolumns <- paste0(columns, ".prime")
DT[, (newcolumns) := lapply(.SD, function(x) (x- mean(x))/sd(x)),
year(date), .SDcols = columns]
This avoids using get(x) everytime and updates data.table by reference.
I think Ronak's answer is superior & preferable, just writing this to demonstrate a common syntax for more complicated j queries is to use a full {} expression:
target <- DT[ , by = year(date), {
xval = eval(as.name(x))
(xval - mean(xval, na.rm=TRUE))/sd(xval, na.rm = TRUE)
}]$V1
Two other small differences:
I used eval(as.name(.)) instead of get; the former is more trustworthy & IME faster
I replaced [ , V1] with $V1 -- the former requires the overhead of [.data.table.
You might also like to know that the base function scale will do the center & normalize steps more concisely (if slightly inefficient for being a bit to general).
When iterating through all columns in an R data.table using reference semantics, what makes more sense from a memory usage standpoint:
(1) dt[, (all_cols) := lapply(.SD, my_fun)]
or
(2) lapply(colnames(dt), function(col) dt[, (col) := my_fun(dt[[col]])])[[1]]
My question is: In (2), I am forcing data.table to overwrite dt on a column by column basis, so I would assume to need extra memory on the order of column size. Is this also the case for (1)? Or is all of lapply(.SD, my_fun) evaluated before the original columns are overwritten?
Some sample code to run the above variants:
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
Following the suggestion of #Frank, the most efficient way (from a memory point of view) to replace a data.table column by column by applying a function my_fun to each column, is
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
for (col in all_cols) set(dt, j = col, value = my_fun(dt[[col]]))
This currently (v1.11.4) is not handled in the same way as an expression like dt[, lapply(.SD, my_fun)] which internally is optimised to dt[, list(fun(a), fun(b), ...)], where a, b, ... are columns in .SD (see ?datatable.optimize). This might change in the future and is being tracked by #1414.
I often need to store complex object in data.table so I used list column a lot. Sometimes I need to initialize the column with empty list, and sometimes I need to copy a list column to another column.
At first there seemed to be nothing wrong with these:
# the example at https://stackoverflow.com/a/22536321/3718827
dt = data.table(id = 1:2, comment = vector("list", 2L))
# my usage need to add column to existing dt, not creating a data.table from scratch, so I modified it a little bit
dt <- data.table(id = 1:2)
dt[, comment := vector("list", nrow(dt))]
dt[1, comment := .(list(list(a = "a", b = "b")))]
dt[, edited_comment := comment]
However if there is only one row, the code above no longer work
> dt <- data.table(id = 1)
> dt[, comment := vector("list", nrow(dt))]
Warning message:
In `[.data.table`(dt, , `:=`(comment, vector("list", nrow(dt)))) :
Adding new column 'comment' then assigning NULL (deleting it).
> dt[1, comment := .(list(list(a = "a", b = "b")))]
> dt[, edited_comment := comment]
Warning message:
In `[.data.table`(dt, , `:=`(edited_comment, comment)) :
Supplied 2 items to be assigned to 1 items of column 'edited_comment' (1 unused)
I tried a different syntax and this seemed to work
> dt <- data.table(id = 1)
> dt[, comment := .(list(NULL))]
> dt[1, comment := .(list(list(a = "a", b = "b")))]
> dt[, edited_comment := list(comment)]
For the list column initialization, maybe the example code I found is only meant for data.table() call. To add a column it should not be used.
For copying a column, the behavior is understandable but kind of surprising when it worked with nrow > 1 but not working with nrow = 1.
My question is, what's the proper syntax for the task?
And for the surprising behavior difference with nrow >1/=1 of first version, is that expected behavior or a bug?
This question already has an answer here:
Apply a function to a subset of data.table columns, by column-indices instead of name
(1 answer)
Closed 8 years ago.
I want to be able to apply a function over a subset of columns, and return those columns that have been manipulated along with the rest of the data columns that weren't touched. Is there a way to do this with data.table. I wasn't able to figure out the syntax.
In this example I have NAs and want to overwrite them with something else for a few different columns. I need a way to also return other columns that weren't touched.
library(data.table)
# make data set
a <- sample(c(letters[1:5], NA), 50, replace=TRUE)
b <- sample(c(LETTERS[1:5], NA), 50, replace=TRUE)
c <- sample(runif(50))
x <- data.table(a,b,c)
# function to apply to a single column
overwriteNA <- function(vec, new="") ifelse(is.na(vec), new, vec)
# Only returns .SDcols but would like to also include rest of columns in data.table
x[, lapply(.SD, overwriteNA), .SDcols=c("a", "b")]
# Need something along these lines
x[, `:=` lapply(.SD, overwriteNA), .SDcols=c("a", "b")]
Try
x[, c("a", "b") := lapply(.SD, overwriteNA), .SDcols = c("a", "b")]
Edit:
Per OPs additional request.
myCols <- c("a", "b")
x[, (myCols) := lapply(.SD, overwriteNA), .SDcols = myCols]