I came across a surprising result with data.table. Here is a really simple example :
library(data.table)
df <- data.table(x = 1:10)
df[,x[x>3][.N]]
[1] NA
This syntax gives NA, but this work:
df[,x[x>3][1]]
[1] 4
and of course this
df[,x[.N]]
[1] 10
I know that in this simple example case you can do
df[x>3,x[.N]]
but I wanted to use the df[,x[x>3][.N]] syntax while using lapply on .SD to avoid a loop on the i selection, so something like
df2 <- data.table(x = rep(1:10,2), y = rep(2:11,2),ID = rep(c("A","B"),each = 10))
cols = c("x","y")
df2[,lapply(.SD,function(x){x[x>3][.N]}),.SDcols = cols, by = ID]
But this fail, same as in my simple example. Is it because .N is not implemented in this case, or am I doing something wrong ?
My actual work around:
Reduce(merge,lapply(cols,function(col){df2[col>3,setNames(list( get(col)[.N]),col),by = ID]}))
ID x y
1: A 10 11
2: B 10 11
but I am not fully happy with it, I find it less readable. Has anyone an explanation and a better work around ?
Thank you !!
df[,x[x>3]] has 7 elements. .N is 10. You are trying to subset a vector out of range.
So you can access the last element of the vector in lapply using:
df2[, lapply(.SD, function(x) tail(x[x>3], 1) ), .SDcols = c('x','y'), by = ID]
Or more idiomatic for data.table we can use
df2[, lapply(.SD, function(x) last(x[x>3]) ), .SDcols = c('x','y'), by = ID]
Related
This question already has answers here:
Apply a function to every specified column in a data.table and update by reference
(7 answers)
Closed 2 years ago.
I want to apply a transformation (whose type, loosely speaking, is "vector" -> "vector") to a list of columns in a data table, and this transformation will involve a grouping operation.
Here is the setup and what I would like to achieve:
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
DT[, A.prime := (A - mean(A))/sd(A), by=year(date)]
DT[, B.prime := (B - mean(B))/sd(B), by=year(date)]
DT[, C.prime := (C - mean(C))/sd(C), by=year(date)]
The goal is to avoid typing out the column names. In my actual application, I have a list of columns I would like to apply this transformation to.
library(data.table)
set.seed(123)
n <- 1000
DT <- data.table(
date = seq.Date(as.Date('2000/1/1'), by='day', length.out = n),
A = runif(n),
B = rnorm(n),
C = rexp(n))
columns <- c("A", "B", "C")
for (x in columns) {
# This doesn't work.
# target <- DT[, (x - mean(x, na.rm=TRUE))/sd(x, na.rm = TRUE), by=year(date)]
# This doesn't work.
#target <- DT[, (..x - mean(..x, na.rm=TRUE))/sd(..x, na.rm = TRUE), by=year(date)]
# THIS WORKS! But it is tedious writing "get(x)" every time.
target <- DT[, (get(x) - mean(get(x), na.rm=TRUE))/sd(get(x), na.rm = TRUE), by=year(date)][, V1]
set(DT, j = paste0(x, ".prime"), value = target)
}
Question: What is the idiomatic way to achieve the above result? There are two things which may be possibly be improved:
How to avoid typing out get(x) every time I use x to access a column?
Is accessing [, V1] the most efficient way of doing this? Is it possible to update DT directly by reference, without creating an intermediate data.table?
You can use .SDcols to specify the columns that you want to operate on :
library(data.table)
columns <- c("A", "B", "C")
newcolumns <- paste0(columns, ".prime")
DT[, (newcolumns) := lapply(.SD, function(x) (x- mean(x))/sd(x)),
year(date), .SDcols = columns]
This avoids using get(x) everytime and updates data.table by reference.
I think Ronak's answer is superior & preferable, just writing this to demonstrate a common syntax for more complicated j queries is to use a full {} expression:
target <- DT[ , by = year(date), {
xval = eval(as.name(x))
(xval - mean(xval, na.rm=TRUE))/sd(xval, na.rm = TRUE)
}]$V1
Two other small differences:
I used eval(as.name(.)) instead of get; the former is more trustworthy & IME faster
I replaced [ , V1] with $V1 -- the former requires the overhead of [.data.table.
You might also like to know that the base function scale will do the center & normalize steps more concisely (if slightly inefficient for being a bit to general).
I want to convert a subset of data.table cols to a new class. There's a popular question here (Convert column classes in data.table) but the answer creates a new object, rather than operating on the starter object.
Take this example:
dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
cols <- c('ID', 'Quarter')
How best to convert to just the cols columns to (e.g.) a factor? In a normal data.frame you could do this:
dat[, cols] <- lapply(dat[, cols], factor)
but that doesn't work for a data.table, and neither does this
dat[, .SD := lapply(.SD, factor), .SDcols = cols]
A comment in the linked question from Matt Dowle (from Dec 2013) suggests the following, which works fine, but seems a bit less elegant.
for (j in cols) set(dat, j = j, value = factor(dat[[j]]))
Is there currently a better data.table answer (i.e. shorter + doesn't generate a counter variable), or should I just use the above + rm(j)?
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the := operator you update the datatable by reference. A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by #MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by #Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
You can use .SDcols:
dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]
I am looking for a way to manipulate multiple columns in a data.table in R. As I have to address the columns dynamically as well as a second input, I wasn't able to find an answer.
The idea is to index two or more series on a certain date by dividing all values by the value of the date eg:
set.seed(132)
# simulate some data
dt <- data.table(date = seq(from = as.Date("2000-01-01"), by = "days", length.out = 10),
X1 = cumsum(rnorm(10)),
X2 = cumsum(rnorm(10)))
# set a date for the index
indexDate <- as.Date("2000-01-05")
# get the column names to be able to select the columns dynamically
cols <- colnames(dt)
cols <- cols[substr(cols, 1, 1) == "X"]
Part 1: The Easy data.frame/apply approach
df <- as.data.frame(dt)
# get the right rownumber for the indexDate
rownum <- max((1:nrow(df))*(df$date==indexDate))
# use apply to iterate over all columns
df[, cols] <- apply(df[, cols],
2,
function(x, i){x / x[i]}, i = rownum)
Part 2: The (fast) data.table approach
So far my data.table approach looks like this:
for(nam in cols) {
div <- as.numeric(dt[rownum, nam, with = FALSE])
dt[ ,
nam := dt[,nam, with = FALSE] / div,
with=FALSE]
}
especially all the with = FALSE look not very data.table-like.
Do you know any faster/more elegant way to perform this operation?
Any idea is greatly appreciated!
One option would be to use set as this involves multiple columns. The advantage of using set is that it will avoid the overhead of [.data.table and makes it faster.
library(data.table)
for(j in cols){
set(dt, i=NULL, j=j, value= dt[[j]]/dt[[j]][rownum])
}
Or a slightly slower option would be
dt[, (cols) :=lapply(.SD, function(x) x/x[rownum]), .SDcols=cols]
Following up on your code and the answer given by akrun, I would recommend you to use .SDcols to extract the numeric columns and lapply to loop through them. Here's how I would do it:
index <-as.Date("2000-01-05")
rownum<-max((dt$date==index)*(1:nrow(dt)))
dt[, lapply(.SD, function (i) i/i[rownum]), .SDcols = is.numeric]
Using .SDcols could be specially useful if you have a large number of numeric columns and you'd like to apply this division on all of them.
This is very strange. When I try to select columns on my data.table by doing
df1[, 30]
It just gives me 30, or whatever number I put in there. Not column 30.
Data here: https://github.com/pourque/country-data/blob/master/data/df1.csv
I've checked, and everything works properly when I just produce a test data.frame:
df2 <- data.frame(x = 1:3, y = 3:1, z = 7:9)
> df2[, 2]
[1] 3 2 1
Any ideas on what might be happening?
When working with data.table you need to use following to chose column by numbers:
df2[, 2]
df2[, .SD, .SDcols=2]
This will still return data.table, not a vector.
As always on list you can also use below to return a vector:
df2[[2]]
This is very similar to a question applying a common function to multiple columns of a data.table uning .SDcols answered thoroughly here.
The difference is that I would like to simultaneously apply a different function on another column which is not part of the .SD subset. I post a simple example below to show my attempt to solve the problem:
dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
v1 = rnorm(100),
v2 = rnorm(100),
v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]
Yields the following error:
Error in `[.data.table`(dt, , list(v1 = sum(v1), lapply(.SD, mean)), by = grp,
: object 'v1' not found
Now this makes sense because the v1 column is not included in the subset of columns which must be evaluated first. So I explored further by including it in my subset of columns:
sd.cols = c("v1","v2", "v3")
dt.out = dt[, list(sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]
Now this does not cause an error but it provides an answer containing 9 rows (for 3 groups), with the sum repeated thrice in column V1 and the means for all 3 columns (as expected but not wanted) placed in V2 as shown below:
> dt.out
grp V1 V2
1: c -1.070608 -0.0486639841313638
2: c -1.070608 -0.178154270921521
3: c -1.070608 -0.137625003604012
4: b -2.782252 -0.0794929150464099
5: b -2.782252 -0.149529237116445
6: b -2.782252 0.199925178109264
7: a 6.091355 0.141659419355985
8: a 6.091355 -0.0272192037753071
9: a 6.091355 0.00815760216214876
Workaround Solution using 2 steps
Clearly it is possible to solve the problem in multiple steps by calculating the mean by group for the subset of columns and joining it to the sum by group for the single column as follows:
dt.out1 = dt[, sum(v1), by = grp]
dt.out2 = dt[, lapply(.SD,mean), by = grp, .SDcols = sd.cols]
dt.out = merge(dt.out1, dt.out2, by = "grp")
> dt.out
grp V1 v2 v3
1: a 6.091355 -0.0272192 0.008157602
2: b -2.782252 -0.1495292 0.199925178
3: c -1.070608 -0.1781543 -0.137625004
Im sure it's a fairly simple thing I am missing, thanks in advance for any guidance.
Update: Issue #495 is solved now with this recent commit, we can now do this just fine:
require(data.table) # v1.9.7+
set.seed(1L)
dt = data.table(grp = sample(letters[1:3],100, replace = TRUE),
v1 = rnorm(100),
v2 = rnorm(100),
v3 = rnorm(100))
sd.cols = c("v2", "v3")
dt.out = dt[, list(v1 = sum(v1), lapply(.SD,mean)), by = grp, .SDcols = sd.cols]
However note that in this case, v2 would be returned as a list. That's because you're doing list(val, list()) effectively. What you intend to do perhaps is:
dt[, c(list(v1=sum(v1)), lapply(.SD, mean)), by=grp, .SDcols = sd.cols]
# grp v1 v2 v3
# 1: a -6.440273 0.16993940 0.2173324
# 2: b 4.304350 -0.02553813 0.3381612
# 3: c 0.377974 -0.03828672 -0.2489067
See history for older answer.
Try this:
dt[,list(sum(v1), mean(v2), mean(v3)), by=grp]
In data.table, using list() in the second argument allows you to describe a set of columns that result in the final data.table.
For what it's worth, .SD can be quite slow [^1] so you may want to avoid it unless you truly need all of the data supplied in the subsetted data.table like you might for a more sophisticated function.
Another option, if you have many columns for .SDcols would be to do the merge in one line using the data.table merge syntax.
For example:
dt[, sum(v1), by=grp][dt[,lapply(.SD,mean), by=grp, .SDcols=sd.cols]]
In order to use the merge from data.table, you need to first use setkey() on your data.table so it knows how to match things up.
So really, first you need:
setkey(dt, grp)
Then you can use the line above to produce an equivalent result.
[^1]: I find this to be especially true as your number of groups approach the number of total rows. For example, this might happen where your key is an individual ID and many individuals have just one or two observations.