split data.table - r

I have a data.table which I want to split into two. I do this as follows:
dt <- data.table(a=c(1,2,3,3),b=c(1,1,2,2))
sdt <- split(dt,dt$b==2)
but if I want to to something like this as a next step
sdt[[1]][,c:=.N,by=a]
I get the following warning message.
Warning message: In [.data.table(sdt[[1]], , :=(c, .N), by = a) :
Invalid .internal.selfref detected and fixed by taking a copy of the
whole table, so that := can add this new column by reference. At an
earlier point, this data.table has been copied by R. Avoid key<-,
names<- and attr<- which in R currently (and oddly) may copy the whole
data.table. Use set* syntax instead to avoid copying: setkey(),
setnames() and setattr(). Also, list(DT1,DT2) will copy the entire DT1
and DT2 (R's list() copies named objects), use reflist() instead if
needed (to be implemented). If this message doesn't help, please
report to datatable-help so the root cause can be fixed.
Just wondering if there is a better way of splitting the table so that it would be more efficient (and would not get this message)?

This works in v1.8.7 (and may work in v1.8.6 too) :
> sdt = lapply(split(1:nrow(dt), dt$b==2), function(x)dt[x])
> sdt
$`FALSE`
a b
1: 1 1
2: 2 1
$`TRUE`
a b
1: 3 2
2: 3 2
> sdt[[1]][,c:=.N,by=a] # now no warning
> sdt
$`FALSE`
a b c
1: 1 1 1
2: 2 1 1
$`TRUE`
a b
1: 3 2
2: 3 2
But, as #mnel said, that's inefficient. Please avoid splitting if possible.

I was looking for some way to do a split in data.table, I came across this old question.
Sometime a split is what you want to do, and the data.table "by" approach is not convenient.
Actually you can easily do your split by hand with data.table only instructions and it works very efficiently:
SplitDataTable <- function(dt,attr) {
boundaries=c(0,which(head(dt[[attr]],-1)!=tail(dt[[attr]],-1)),nrow(dt))
return(
mapply(
function(start,end) {dt[start:end,]},
head(boundaries,-1)+1,
tail(boundaries,-1),
SIMPLIFY=F))
}

As mentionned above (#jangorecki), the package data.table already has its own function for splitting. In that simplified case we can use:
> dt <- data.table(a = c(1, 2, 3, 3), b = c(1, 1, 2, 2))
> split(dt, by = "b")
$`1`
a b
1: 1 1
2: 2 1
$`2`
a b
1: 3 2
2: 3 2
For more difficult/concrete cases, I would recommend to create a new variable in the data.table using the by reference functions := or set and then call the function split. If you care about performance, make sure to always remain in the data.table environment e.g., dt[, SplitCriteria := (...)] rather than computing the splitting variable externallly.

Related

Remove data.table rows whose vector elements contain nested NAs

I need to remove from a data.table any row in which column a contains any NA nested in a vector:
library(data.table)
a = list(as.numeric(c(NA,NA)), 2,as.numeric(c(3, NA)), c(4,5) )
b <- 11:14
dt <- data.table(a,b)
Thus, rows 1 and 3 should be removed.
I tried three solutions without success:
dt1 <- dt[!is.na(a)]
dt2 <- dt[!is.na(unlist(a))]
dt3 <- dt[dt[,!Reduce(`&`, lapply(a, is.na))]]
Any ideas? Thank you.
You can do the following:
dt[sapply(dt$a, \(l) !any(is.na(l)))]
This alternative also works, but you will get warnings
dt[sapply(dt$a, all)]
Better approach (thanks to r2evans, see comments)
dt[!sapply(a,anyNA)]
Output:
a b
1: 2 12
2: 4,5 14
A third option that you might prefer: You could move the functionality to a separate helper function that ingests a list of lists (nl), and returns a boolean vector of length equal to length(nl), and then apply that function as below. In this example, I explicitly call unlist() on the result of lapply() rather than letting sapply() do that for me, but I could also have used sapply()
f <- \(nl) unlist(lapply(nl,\(l) !any(is.na(l))))
dt[f(a)]
An alternative to *apply()
dt[, .SD[!anyNA(a, TRUE)], by = .I][, !"I"]
# a b
# <list> <int>
# 1: 2 12
# 2: 4,5 14

Using data.table function in lapply on a list with data.frames elements (Answer = setDT)

First question, let me know if more info or background is needed in the comments please.
Many answers on here and elsewhere deal with calling lapply in a data.table function. I want to do the opposite, which on paper should be easy lapply(list.of.dfs, fun(x) x) but I cant get it to work with data.table functions.
I have a list that contains several data.frames with the same columns but differing numbers of rows. This comes from the output of several simulation scenarios so they must be treated seperately and not rbind'ed.
#sample list of data.frames
scenarios <- replicate(5, data.frame(a=sample(letters[1:4],10,T),
b=sample(1:2,10,T),
x=sample(1:10, 10),
y =runif(10)), simplify = FALSE)
I want to add a column to every element that is the sum of x/y by a and b.
From the data.table documentation in the examples section the process to do this for one data.frame is the following (search: add new column by reference by group in the doc page):
test <- as.data.table(scenarios[[1]]) #must specify data.table class
test[, newcol := sum(x/y), by = .(a , b)][]
I want to use lapply to do the same thing to every element in the scenarios list and return the list.
My most recent attempt:
lapply(scenarios, function(i) {as.data.table(i[, z := sum(x/y), by=.(a,b)]); i})
but I keep getting the error unused argument (by = .a,b))
After pouring over the results of this and other sites I have been unable to solve this problem. Which I'm fairly sure means that there is something I dont understand about calling anonymous functions, and/or using the data.table function. Is this one of those cases where one you use the [ as the function? Or possibly my as.data.table is out of place.
This answer was a step in the right direction (I think), it covers the use of fun(x) {... ; x} to use an anonymous function and return x.
Thanks!
You can use setDT here instead.
scenarios <- lapply(scenarios, function(i) setDT(i)[, z := sum(x/y), by=.(a,b)])
scenarios[[1]]
a b x y z
1: c 2 2 0.87002174 2.298793
2: b 2 10 0.19720775 78.611837
3: b 2 8 0.47041670 78.611837
4: b 2 4 0.36705023 78.611837
5: a 1 5 0.78922686 12.774035
6: a 1 6 0.93186209 12.774035
7: b 1 3 0.83118438 3.609307
8: c 1 1 0.08248658 30.047494
9: c 1 7 0.89382050 30.047494
10: c 1 9 0.89172831 30.047494
Using as.data.table, the syntax would be
scenarios <- lapply(scenarios, function(i) {i <- as.data.table(i); i[, z := sum(x/y),
by=.(a,b)]})
but this wouldn't be recommended as it will create an additional copy, which is avoided by setDT.

`j` doesn't evaluate to the same number of columns for each group

I am trying to use data.table where my j function could and will return a different number of columns on each call. I would like it to behave like rbind.fill in that it fills any missing columns with NA.
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]
In this case 'result' may end up with two columns; A and B. 'A' and 'B' was returned as part of the first call to 'fetch' and only 'B' was returned as part of the second. I would like the example code to return this result.
id A B
1 1 a b
2 2 <NA> b
Unfortunately, when run I get this error.
Error in `[.data.table`(data, , fetch(.BY, .SD), by = id) :
j doesn't evaluate to the same number of columns for each group
I can do this with plyr as follows, but in my real world use case plyr is running out of memory. Each call to fetch occurs rather quickly, but the memory crash occurs when plyr tries to merge all of the data back together. I am trying to see if data.table might solve this problem for me.
result <- ddply(data, "id", fetch)
Any thoughts appreciated.
DWin's approach is good. Or you could return a list column instead, where each cell is itself a vector. That's generally a better way of handling variable length vectors.
DT = data.table(A=rep(1:3,1:3),B=1:6)
DT
A B
1: 1 1
2: 2 2
3: 2 3
4: 3 4
5: 3 5
6: 3 6
ans = DT[, list(list(B)), by=A]
ans
A V1
1: 1 1
2: 2 2,3 # V1 is a list column. These aren't strings, the
3: 3 4,5,6 # vectors just display with commas
ans$V1[3]
[[1]]
[1] 4 5 6
ans$V1[[3]]
[1] 4 5 6
ans[,sapply(V1,length)]
[1] 1 2 3
So in your example you could use this as follows:
library(plyr)
rbind.fill(data[, list(list(fetch(.BY))), by = id]$V1)
# A B
#1 a b
#2 <NA> b
Or, just make the list returned conformant :
allcols = c("A","B")
fetch <- function(by) {
if(by == 1)
list(A=c("a"), B=c("b"))[allcols]
else
list(B=c("b"))[allcols]
}
Here are two approaches. The first roughly follows your strategy:
data[,list(A=if(.BY==1) 'a' else NA_character_,B='b'), by=id]
And the second does things in two steps:
DT <- copy(data)[,`:=`(A=NA_character_,B='b')][id==1,A:='a']
Using a by just to check for a single value seems wasteful (maybe computationally, but also in terms of clarity); of course, it could be that your application isn't really like that.
Try
data.table(A=NA, B=c("b"))
#NickAllen: I'm not sure from the comments whether you understood my suggestion. (I was posting from a mobile phone that limited my cut-paste capabilities and I suspect my wife was telling me to stop texting to S0 or she would divorce me.) What I meant was this:
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(A=NA, B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]

R using lubridate with data.table to match dates

The following code fails
a = data.table(date=seq(ymd('2001-6-30'),ymd('2003-6-30'),by='weeks'))
a = a[,list(date=date,a=rnorm(105),b=rnorm(105))]
b = seq(ymd('2001-6-30'),ymd('2001-07-28'),by='weeks')
a[date %in% b]
with the message
Empty data.table (0 rows) of 3 cols: date,a,b
Can anyone help identify what I'm doing wrong. It should find the data.
Nothing to do with lubridate.
Your issue is scoping. You have a column b in your data.table. data.table will look first in the data.table and then up along the search path. It cannot tell you want to look for b in the parent.frame
So, rename your vector in the parent (global) environment
B <- b
a[date %in% B]
date a b
1: 2001-06-30 -1.89904968 0.9230171
2: 2001-07-07 0.08599561 -0.0440927
3: 2001-07-14 -0.28606686 0.4649957
4: 2001-07-21 0.39191680 0.2907855
5: 2001-07-28 0.18732463 -0.1743267

How to reference column names that start with a number, in data.table

If the column names in data.table are in the form of number + character, for example: 4PCS, 5Y etc, how could this be referenced as j in x[i,j] so that it is interpreted as an unquoted column name.
I assume this would solve mine original problem. I wanted to add several column in 'data.table' which were in the form number + character.
M <- data.table('4PCS'=1:4,'5Y'=4:1,X5Y=2:5)
> M[,4PCS+5Y]
Error: unexpected symbol in "M[,4PCS"
The new column should be a sum of 4PSC and 5Y.
Is there a way how to refer to them in data.table in no quoted form? If these columns are referred in data.table with the quoted "logic" of data.frame :
> M[,'5Y',with=FALSE]
5Y
[1,] 4
[2,] 3
[3,] 2
[4,] 1
then there will be a limitation in functionality of such reference. The addition would not work as it does not work in data.frame:
> M[,'4PCS'+'5Y',with=FALSE]
Error in "4PCS" + "5Y" : non-numeric argument to binary operator
The data.table functionality would allow to operate over the columns. I would like to find a solution in the new data.table logic hence I can use its ability to transform the columns by column name referencing.
The question is:
How to quote the column name which start with number so that the data.table logic would understand that it is a column name.
I think, this is what you're looking for, not sure. data.table is different from data.frame. Please have a look at the quick introduction, and then the FAQ (and also the reference manual if necessary).
require(data.table)
dt <- data.table("4PCS" = 1:3, y=3:1)
#   4PCS y
# 1:    1 3
# 2:    2 2
# 3:    3 1
# access column 4PCS
dt[, "4PCS"]
# returns a data.table
# 4PCS
# 1: 1
# 2: 2
# 3: 3
# to access multiple columns by name
dt[, c("4PCS", "y")]
Alternatively, if you need to access the column and not result in a data.table, rather a vector, then you can access using the $ notation:
dt$`4PCS` # notice the ` because the variable begins with a number
# [1] 1 2 3
# alternatively, as mnel mentioned under comments:
dt[, `4PCS`]
# [1] 1 2 3
Or if you know the column number you can access using [[.]] as follows:
dt[[1]] # 4PCS is the first column here
# [1] 1 2 3
Edit:
Thanks #joran. I think you're looking for this:
dt[, `4PCS` + y]
# [1] 4 4 4
Fundamentally the issue is that 4CPS is not a valid variable name in R (try 4CPS <- 1, you'll get the same "Unexpected symbol" error). So to refer to it, we have to use backticks (compare`4CPS` <- 1)
You can also put an 'X' immediately before the variable name you are calling to get R to recognise it as a name rather than evaluating the number and the string as different (and hence bad syntax)
So e.g. when calling 4PCS use X4PCS
as in
mydata <- X4PCS

Resources