I often need to store complex object in data.table so I used list column a lot. Sometimes I need to initialize the column with empty list, and sometimes I need to copy a list column to another column.
At first there seemed to be nothing wrong with these:
# the example at https://stackoverflow.com/a/22536321/3718827
dt = data.table(id = 1:2, comment = vector("list", 2L))
# my usage need to add column to existing dt, not creating a data.table from scratch, so I modified it a little bit
dt <- data.table(id = 1:2)
dt[, comment := vector("list", nrow(dt))]
dt[1, comment := .(list(list(a = "a", b = "b")))]
dt[, edited_comment := comment]
However if there is only one row, the code above no longer work
> dt <- data.table(id = 1)
> dt[, comment := vector("list", nrow(dt))]
Warning message:
In `[.data.table`(dt, , `:=`(comment, vector("list", nrow(dt)))) :
Adding new column 'comment' then assigning NULL (deleting it).
> dt[1, comment := .(list(list(a = "a", b = "b")))]
> dt[, edited_comment := comment]
Warning message:
In `[.data.table`(dt, , `:=`(edited_comment, comment)) :
Supplied 2 items to be assigned to 1 items of column 'edited_comment' (1 unused)
I tried a different syntax and this seemed to work
> dt <- data.table(id = 1)
> dt[, comment := .(list(NULL))]
> dt[1, comment := .(list(list(a = "a", b = "b")))]
> dt[, edited_comment := list(comment)]
For the list column initialization, maybe the example code I found is only meant for data.table() call. To add a column it should not be used.
For copying a column, the behavior is understandable but kind of surprising when it worked with nrow > 1 but not working with nrow = 1.
My question is, what's the proper syntax for the task?
And for the surprising behavior difference with nrow >1/=1 of first version, is that expected behavior or a bug?
Related
This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I want to apply a transformation (whose type, loosely speaking, is "vector" -> "vector") to a list of columns in a data table, and this transformation will involve a grouping operation.
Here is the setup and what I would like to achieve:
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
DT[, A.prime := (A - mean(A))/sd(A), by=year(date)]
DT[, B.prime := (B - mean(B))/sd(B), by=year(date)]
DT[, C.prime := (C - mean(C))/sd(C), by=year(date)]
The goal is to avoid typing out the column names. In my actual application, I have a list of columns I would like to apply this transformation to.
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
columns <- c("A", "B", "C")
for (x in columns) {
# This doesn't work.
# target <- DT[, (x - mean(x, na.rm=TRUE))/sd(x, na.rm = TRUE), by=year(date)]
# This doesn't work.
#target <- DT[, (..x - mean(..x, na.rm=TRUE))/sd(..x, na.rm = TRUE), by=year(date)]
# THIS WORKS! But it is tedious writing "get(x)" every time.
target <- DT[, (get(x) - mean(get(x), na.rm=TRUE))/sd(get(x), na.rm = TRUE), by=year(date)][, V1]
set(DT, j = paste0(x, ".prime"), value = target)
}
Question: What is the idiomatic way to achieve the above result? There are two things which may be possibly be improved:
How to avoid typing out get(x) every time I use x to access a column?
Is accessing [, V1] the most efficient way of doing this? Is it possible to update DT directly by reference, without creating an intermediate data.table?
You can use .SDcols to specify the columns that you want to operate on :
library(data.table)
columns <- c("A", "B", "C")
newcolumns <- paste0(columns, ".prime")
DT[, (newcolumns) := lapply(.SD, function(x) (x- mean(x))/sd(x)),
year(date), .SDcols = columns]
This avoids using get(x) everytime and updates data.table by reference.
I think Ronak's answer is superior & preferable, just writing this to demonstrate a common syntax for more complicated j queries is to use a full {} expression:
target <- DT[ , by = year(date), {
xval = eval(as.name(x))
(xval - mean(xval, na.rm=TRUE))/sd(xval, na.rm = TRUE)
}]$V1
Two other small differences:
I used eval(as.name(.)) instead of get; the former is more trustworthy & IME faster
I replaced [ , V1] with $V1 -- the former requires the overhead of [.data.table.
You might also like to know that the base function scale will do the center & normalize steps more concisely (if slightly inefficient for being a bit to general).
When iterating through all columns in an R data.table using reference semantics, what makes more sense from a memory usage standpoint:
(1) dt[, (all_cols) := lapply(.SD, my_fun)]
or
(2) lapply(colnames(dt), function(col) dt[, (col) := my_fun(dt[[col]])])[[1]]
My question is: In (2), I am forcing data.table to overwrite dt on a column by column basis, so I would assume to need extra memory on the order of column size. Is this also the case for (1)? Or is all of lapply(.SD, my_fun) evaluated before the original columns are overwritten?
Some sample code to run the above variants:
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
Following the suggestion of #Frank, the most efficient way (from a memory point of view) to replace a data.table column by column by applying a function my_fun to each column, is
library(data.table)
dt <- data.table(a = 1:10, b = 11:20)
my_fun <- function(x) x + 1
all_cols <- colnames(dt)
for (col in all_cols) set(dt, j = col, value = my_fun(dt[[col]]))
This currently (v1.11.4) is not handled in the same way as an expression like dt[, lapply(.SD, my_fun)] which internally is optimised to dt[, list(fun(a), fun(b), ...)], where a, b, ... are columns in .SD (see ?datatable.optimize). This might change in the future and is being tracked by #1414.
Is there any way to fill a completely empty data.table in R? I have to fill a data.table and columns are given by a function called in a loop. I don't know how many columns will be created or what length they will be before launching the function but I do know that all will be the same length.
My aproach is to create an empty data.table and fill it within the loop. But this does not work either because I cannot append columns to empty data.table or beacuse I cannot insert rows properly. Check the toy example below (for the sake of simplicity let's avoid the for bucle)
f <- someFunction() # just imagine that returns c(1,2,3)
# this does not work. fails with error msg: Cannot use := to add columns to a null data.table (no columns), currently
dt <- data.table()
dt[, c("b", "c") := list(f(), f())]
# this actually work. But creates a 0 row data.table which cannot be filled latter
dt <- data.table(a = numeric() )
dt[, c("b", "c") := list(numeric(), numeric())]
dt[, c("b", "c") := list(f(), f())] # no rows are added
# this workaround works but is ugly as hell
dt <- data.table(a = rep(NA, length(f())) )
dt[, c("b", "c") := list(f(), f())]
dt[, a := NULL]
So Is there any elegant/efficient way of approaching this
You can use something like this:
library("data.table")
f <- function(x) c(1,2,3)
dt <- as.data.table(lapply(11:13, f))
setnames(dt, c("a", "b", "c"))
The lapply() is doing the loop you mentioned in your question.
I'm trying to manipulate a number of data.tables in similar ways, and would like to write a function to accomplish this. I would like to pass in a parameter containing a list of columns that would have the operations performed. This works fine when the vector declaration of columns is the left hand side of the := operator, but not if it is declared earlier (or passed into the function). The follow code shows the issue.
dt = data.table(a = letters, b = 1:2, c=1:13)
colsToDelete = c('b', 'c')
dt[,colsToDelete := NULL] # doesn't work but I don't understand why not.
dt[,c('b', 'c') := NULL] # works fine, but doesn't allow passing in of columns
The error is "Adding new column 'colsToDelete' then assigning NULL (deleting it)." So clearly, it's interpreting 'colsToDelete' as a new column name.
The same issue occurs when doing something along these lines
dt[, colNames := lapply(.SD, adjustValue, y=factor), .SDcols = colNames]
I new to R, but rather more experienced with some other languages, so this may be a silly question.
It's basically because we allow symbols on LHS of := to add new columns, for convenience: ex: DT[, col := val]. So, in order to distinguish col itself being the name from whatever is stored in col being the column names, we check if the LHS is a name or an expression.
If it's a name, it adds the column with the name as such on the LHS, and if expression, then it gets evaluated.
DT[, col := val] # col is the column name.
DT[, (col) := val] # col gets evaluated and replaced with its value
DT[, c(col) := val] # same as above
The preferred idiom is: dt[, (colsToDelete) := NULL]
HTH
I am surprised no answer provided uses the set() function.
set(DT, , colsToDelete, NULL)
This should be the easiest.
To extend on previous answer, you can delete columns by reference doing:
# delete columns 10 to 15
dt[ , (10:15) := NULL ]
or
# delete columns 3, 5 and 10 to 15
dt[ , (c(3,5,10:15)) := NULL ]
This code did the job for me. you need to have the position of the columns to be deleted e.g., posvec as mentioned in the ?set
j: Column name(s) (character) or number(s) (integer) to be assigned value
when column(s) already exist, and only column name(s) if they are to
be created.
DT_removed_slected_col = set(DT, j = posvec, value = NULL)
Also if you want to get the posvec you can try this:
selected_col = c('col_a','col_b',...)
selected_col = unlist(sapply(selected_col, function(x) grep(x,names(DT))))
namvec = names(selected_col) #col names
posvec = unname(selected_col) #col positions
I want to convert a subset of data.table cols to a new class. There's a popular question here (Convert column classes in data.table) but the answer creates a new object, rather than operating on the starter object.
Take this example:
dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
cols <- c('ID', 'Quarter')
How best to convert to just the cols columns to (e.g.) a factor? In a normal data.frame you could do this:
dat[, cols] <- lapply(dat[, cols], factor)
but that doesn't work for a data.table, and neither does this
dat[, .SD := lapply(.SD, factor), .SDcols = cols]
A comment in the linked question from Matt Dowle (from Dec 2013) suggests the following, which works fine, but seems a bit less elegant.
for (j in cols) set(dat, j = j, value = factor(dat[[j]]))
Is there currently a better data.table answer (i.e. shorter + doesn't generate a counter variable), or should I just use the above + rm(j)?
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the := operator you update the datatable by reference. A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by #MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by #Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
You can use .SDcols:
dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]