Use of "list" in data.table's j argument - r

I am learning data.table properties from a blog post. I am trying to understand the part under "summary table (short and narrow)", starting by coercing data.frame(mtcars) to data.table:
> data <- as.data.table(mtcars)
> data <- data[,.(gear,cyl)]
> head(data)
gear cyl
1: 4 6
2: 4 6
3: 4 4
4: 3 6
5: 3 8
6: 3 6
Up to this point everything is fine.
Now I have tried this data[, gearsL := list(list(unique(gear))), by=cyl]
> head(data)
gear cyl gearsL
1: 4 6 4,3,5
2: 4 6 4,3,5
3: 4 4 4,3,5
4: 3 6 4,3,5
5: 3 8 3,5
6: 3 6 4,3,5
I am able to understand unique(gear) but unable to understand what list(list(unique(gear)) is doing.

A data.table -- like any data.frame -- is a list of pointers to column vectors.
When creating new columns, we write j of DT[i,j,by] so that it evaluates to a list of columns:
DT[, (newcol_names) := list(newcol_A, newcol_B)]
That's what the outermost list() in the OP's example does, for a single list column.
data[,gearsL := list(list(unique(gear))), by=cyl]
This can and should be written using the alias .(), for clarity:
data[, gearsL := .(list(unique(gear))), by=cyl]
That's all you need to know, but I've put some elaboration below.
Details. When creating a new column, we can often skip list()/.():
DT = data.table(id=1:3)
DT[, E := c(4,5,6)]
DT[, R := 3]
# this works as if we had typed
# R := c(3,3,3)
Note that E enumerates each value, while R recycles a single value over all rows. Next example:
DT[, Elist := list(hist(rpois(1,1)), hist(rpois(2,2)), hist(rpois(3,3)))]
As we did for E, we're enumerating the values of Elist here. This still uses the shortcut; list() is here only because the column is itself a list, as confirmed by
sapply(DT, class)
# id E R Elist
# "integer" "numeric" "numeric" "list"
The convenient shortcut of skipping list()/.() fails in one special case: when we are creating a list column that that recycles its value:
DT[, Rlist := list(c("a","b"))]
# based on the pattern for column R, this should work as if we typed
# Rlist := list(c("a","b"), c("a","b"), c("a","b"))
It doesn't work because the parser sees this as C2 := .( c("a", "b") ) and thinks we simply neglected to make a full enumeration with one value for each row, like Elist does. To get the desired result, skip the shortcut and wrap the vector in list()/.():
DT[, Rlist := .(list(c("a","b")))]
# id E R Elist Rlist
# 1: 1 4 3 <histogram> a,b
# 2: 2 5 3 <histogram> a,b
# 3: 3 6 3 <histogram> a,b
This is the case in the OP's example, where the outer list()/.() is necessary.

Related

Order data.table by a character vector of column names

I'd like to order a data.table by a variable holding the name of a column:
I've tried every combination of + eval, getandc` without success:
I have colVar = "someColumnName"
I'd like to apply this to: DT[order(colVar)]
data.table has special functions for that matter which will modify your data set by reference instead of copying it to a new object.
You can either use setkey or (in versions >= 1.9.4) setorder which is capable of ordering in decreasing order too.
Note the difference between setkey vs. setkeyv and setorder vs. setorderv. v notes that you can pass either a quoted variable name or a variable containing one.
Using #andrewzm data set
dtbl
# x y
# 1: 1 5
# 2: 2 4
# 3: 3 3
# 4: 4 2
# 5: 5 1
setorderv(dtbl, colVar)[] # or `sekeyv(dtbl, colVar)[]` or `setorderv(dtbl, "y")[]`
# x y
# 1: 5 1
# 2: 4 2
# 3: 3 3
# 4: 2 4
# 5: 1 5
You can use double brackets for data tables:
library(data.table)
dtbl <- data.table(x = 1:5, y = 5:1)
colVar = "y"
dtbl_sorted <- dtbl[order(dtbl[[colVar]])]
dtbl_sorted

data.table avoid column name changing [duplicate]

Pass character vectors and column names to data.table as a list of columns?
I want to be able to produce a subset of columns in R using data.table in a way that I can determine some of them earlier on and pass the predetermined list on as a character vector, then combine with a static list of columns.
That is, given this:
a <- 1:4
b <- 5:8
c <- c('aa','bb','cc','dd')
e <- 1:4
z <- data.table(a,b,c,e)
I want to do this:
z[, list(a,b)]
Which produces this output:
a b
1: 1 5
2: 2 6
3: 3 7
4: 4 8
But I want to do it in some way similar to this (which works, almost):
cols <- "b"
z[, list(get(cols), a)]
Results:
Note that it doesn't return the name of the column stored in cols
V1 a
1: 5 1
2: 6 2
3: 7 3
4: 8 4
but I need to do it with more than one element of cols (which does not work):
cols <- c('a', 'b')
z[, list(mget(cols), c)]
The above produces the following error:
Error: value for ‘a’ not found
I think my problem lies with scoping and which environments mget is looking in, but I can't figure out what exactly I am doing wrong. Also, how do I preserve the column titles?
Here are two (pretty much equivalent) options. One using lapply:
z[, c(lapply(cols, get), list(c))]
# V1 V2 V3
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
And one using mget:
z[, c(mget(cols, inherits = TRUE), c = list(c))]
# a b c
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
Note that get returns a vector which loses the information about column name (and there isn't much you can do about it besides manually adding it back in), while mget returns a named list.
Attempting to mix standard and non-standard evaluation within a single call will probably end in tears / frustration / obfusticated code.
There are a number of options in data.table
Use .. notation to "look up one level" to find the vector of column names
cols <- c('a','b')
z[, ..cols]
Use .SDcols
z[, .SD, .SDcols = cols]
But if you really want to combine the two ways of referencing, then you can use something like (introducing another option, with=FALSE, which allows more general expressions for column names than a simple vector)
ll <- function(char=NULL,uneval=NULL){
Call <- match.call()
cols <- lapply(Call$uneval,as.character)
unlist(c(char,cols))}
z[, ll(cols,c), with=FALSE]
# a b c
# 1: 1 5 aa
# 2: 2 6 bb
# 3: 3 7 cc
# 4: 4 8 dd
z[, ll(char=cols), with=FALSE]
# a b
# 1: 1 5
# 2: 2 6
# 3: 3 7
# 4: 4 8
z[, ll(uneval=c), with=FALSE]
# c
# 1: aa
# 2: bb
# 3: cc
# 4: dd
Combining a variable with column names with hard-coded column names in data.table
Given z and cols from the example above:
To combine a list of column names in a variable col with other hard coded column name c, we combine them in a new character vector c(col, 'c') in the call to data.table. We can refer to cols from within j (the second argument within []) by using the "up-one-level" notation ..:
z[, c(..cols, 'c')]
Thank you to #thelatemail for providing the base to the solution above.

How to pass a list of columns to data.table where some are predetermined

Pass character vectors and column names to data.table as a list of columns?
I want to be able to produce a subset of columns in R using data.table in a way that I can determine some of them earlier on and pass the predetermined list on as a character vector, then combine with a static list of columns.
That is, given this:
a <- 1:4
b <- 5:8
c <- c('aa','bb','cc','dd')
e <- 1:4
z <- data.table(a,b,c,e)
I want to do this:
z[, list(a,b)]
Which produces this output:
a b
1: 1 5
2: 2 6
3: 3 7
4: 4 8
But I want to do it in some way similar to this (which works, almost):
cols <- "b"
z[, list(get(cols), a)]
Results:
Note that it doesn't return the name of the column stored in cols
V1 a
1: 5 1
2: 6 2
3: 7 3
4: 8 4
but I need to do it with more than one element of cols (which does not work):
cols <- c('a', 'b')
z[, list(mget(cols), c)]
The above produces the following error:
Error: value for ‘a’ not found
I think my problem lies with scoping and which environments mget is looking in, but I can't figure out what exactly I am doing wrong. Also, how do I preserve the column titles?
Here are two (pretty much equivalent) options. One using lapply:
z[, c(lapply(cols, get), list(c))]
# V1 V2 V3
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
And one using mget:
z[, c(mget(cols, inherits = TRUE), c = list(c))]
# a b c
#1: 1 5 aa
#2: 2 6 bb
#3: 3 7 cc
#4: 4 8 dd
Note that get returns a vector which loses the information about column name (and there isn't much you can do about it besides manually adding it back in), while mget returns a named list.
Attempting to mix standard and non-standard evaluation within a single call will probably end in tears / frustration / obfusticated code.
There are a number of options in data.table
Use .. notation to "look up one level" to find the vector of column names
cols <- c('a','b')
z[, ..cols]
Use .SDcols
z[, .SD, .SDcols = cols]
But if you really want to combine the two ways of referencing, then you can use something like (introducing another option, with=FALSE, which allows more general expressions for column names than a simple vector)
ll <- function(char=NULL,uneval=NULL){
Call <- match.call()
cols <- lapply(Call$uneval,as.character)
unlist(c(char,cols))}
z[, ll(cols,c), with=FALSE]
# a b c
# 1: 1 5 aa
# 2: 2 6 bb
# 3: 3 7 cc
# 4: 4 8 dd
z[, ll(char=cols), with=FALSE]
# a b
# 1: 1 5
# 2: 2 6
# 3: 3 7
# 4: 4 8
z[, ll(uneval=c), with=FALSE]
# c
# 1: aa
# 2: bb
# 3: cc
# 4: dd
Combining a variable with column names with hard-coded column names in data.table
Given z and cols from the example above:
To combine a list of column names in a variable col with other hard coded column name c, we combine them in a new character vector c(col, 'c') in the call to data.table. We can refer to cols from within j (the second argument within []) by using the "up-one-level" notation ..:
z[, c(..cols, 'c')]
Thank you to #thelatemail for providing the base to the solution above.

How to change the last value in each group by reference, in data.table

For a data.table DT grouped by site, sorted by time t, I need to change the last value of a variable in each group. I assume it should be possible to do this by reference using :=, but I haven't found a way that works yet.
Sample data:
require(data.table) # using 1.8.11
DT <- data.table(site=c(rep("A",5), rep("B",4)),t=c(1:5,1:4),a=as.double(c(11:15,21:24)))
setkey(DT, site, t)
DT
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 15
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 24
The desired result is to change the last value of a in each group, for example to 999, so the result looks like:
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 999
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 999
It seems like .I and/or .N should be used, but I haven't found a form that works. The use of := in the same statement as .I[.N] gives an error. The following gives me the row numbers where the assignment is to be made:
DT[, .I[.N], by=site]
# site V1
# 1: A 5
# 2: B 9
but I don't seem to be able to use this with a := assignment. The following give errors:
DT[.N, a:=999, by=site]
# Null data.table (0 rows and 0 cols)
DT[, .I[.N, a:=999], by=site]
# Error in `:=`(a, 999) :
# := and `:=`(...) are defined for use in j, once only and in particular ways.
# See help(":="). Check is.data.table(DT) is TRUE.
DT[.I[.N], a:=999, by=site]
# Null data.table (0 rows and 0 cols)
Is there a way to do this by reference in data.table? Or is this better done another way in R?
Currently you can use:
DT[DT[, .I[.N], by = site][['V1']], a := 999]
# or, avoiding the overhead of a second call to `[.data.table`
set(DT, i = DT[,.I[.N],by='site'][['V1']], j = 'a', value = 999L)
alternative approaches:
use replace...
DT[, a := replace(a, .N, 999), by = site]
or shift the replacement to the RHS, wrapped by {} and return the full vector
DT[, a := {a[.N] <- 999L; a}, by = site]
or use mult='last' and take advantage of by-without-by. This requires the data.table to be keyed by the groups of interest.
DT[unique(site), a := 999, mult = 'last']
There is a feature request #2793 that would allow
DT[, a[.N] := 999]
but this is yet to be implemented

R data.table subsetting within a group and splitting a data table into two

I have the following data.table.
ts,id
1,a
2,a
3,a
4,a
5,a
6,a
7,a
1,b
2,b
3,b
4,b
I want to subset this data.table into two. The criteria is to have approximately the first half for each group (in this case column "id") in one data table and the remaining in another data.table. So the expected result are two data.tables as follows
ts,id
1,a
2,a
3,a
4,a
1,b
2,b
and
ts,id
5,a
6,a
7,a
3,b
4,b
I tried the following,
z1 = x[,.SD[.I < .N/2,],by=dev]
z1
and got just the following
id ts
a 1
a 2
a 3
Somehow, .I within the .SD isn't working the way I think it should. Any help appreciated.
Thanks in advance.
.I gives the row locations with respect to the whole data.table. Thus it can't be used like that within .SD.
Something like
DT[, subset := seq_len(.N) > .N/2,by='id']
subset1 <- DT[(subset)][,subset:=NULL]
subset2 <- DT[!(subset)][,subset:=NULL]
subset1
# ts id
# 1: 4 a
# 2: 5 a
# 3: 6 a
# 4: 7 a
# 5: 3 b
# 6: 4 b
subset2
# ts id
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 1 b
# 5: 2 b
Should work
For more than 2 groups, you could use cut to create a factor with the appropriate number of levels
Something like
DT[, subset := cut(seq_len(.N), 3, labels= FALSE),by='id']
# you could copy to the global environment a subset for each, but this
# will not be memory efficient!
list2env(setattr(split(DT, DT[['subset']]),'names', paste0('s',1:3)), .GlobalEnv)
Here's the corrected version of your expression:
dt[, .SD[, .SD[.I <= .N/2]], by = id]
# id ts
#1: a 1
#2: a 2
#3: a 3
#4: b 1
#5: b 2
The reason yours is not working is because .I and .N are not available in the i-expression (i.e. first argument of [) and so the parent data.table's .I and .N are used (i.e. dt's).

Resources