R using lubridate with data.table to match dates - r

The following code fails
a = data.table(date=seq(ymd('2001-6-30'),ymd('2003-6-30'),by='weeks'))
a = a[,list(date=date,a=rnorm(105),b=rnorm(105))]
b = seq(ymd('2001-6-30'),ymd('2001-07-28'),by='weeks')
a[date %in% b]
with the message
Empty data.table (0 rows) of 3 cols: date,a,b
Can anyone help identify what I'm doing wrong. It should find the data.

Nothing to do with lubridate.
Your issue is scoping. You have a column b in your data.table. data.table will look first in the data.table and then up along the search path. It cannot tell you want to look for b in the parent.frame
So, rename your vector in the parent (global) environment
B <- b
a[date %in% B]
date a b
1: 2001-06-30 -1.89904968 0.9230171
2: 2001-07-07 0.08599561 -0.0440927
3: 2001-07-14 -0.28606686 0.4649957
4: 2001-07-21 0.39191680 0.2907855
5: 2001-07-28 0.18732463 -0.1743267

Related

Using data.table function in lapply on a list with data.frames elements (Answer = setDT)

First question, let me know if more info or background is needed in the comments please.
Many answers on here and elsewhere deal with calling lapply in a data.table function. I want to do the opposite, which on paper should be easy lapply(list.of.dfs, fun(x) x) but I cant get it to work with data.table functions.
I have a list that contains several data.frames with the same columns but differing numbers of rows. This comes from the output of several simulation scenarios so they must be treated seperately and not rbind'ed.
#sample list of data.frames
scenarios <- replicate(5, data.frame(a=sample(letters[1:4],10,T),
b=sample(1:2,10,T),
x=sample(1:10, 10),
y =runif(10)), simplify = FALSE)
I want to add a column to every element that is the sum of x/y by a and b.
From the data.table documentation in the examples section the process to do this for one data.frame is the following (search: add new column by reference by group in the doc page):
test <- as.data.table(scenarios[[1]]) #must specify data.table class
test[, newcol := sum(x/y), by = .(a , b)][]
I want to use lapply to do the same thing to every element in the scenarios list and return the list.
My most recent attempt:
lapply(scenarios, function(i) {as.data.table(i[, z := sum(x/y), by=.(a,b)]); i})
but I keep getting the error unused argument (by = .a,b))
After pouring over the results of this and other sites I have been unable to solve this problem. Which I'm fairly sure means that there is something I dont understand about calling anonymous functions, and/or using the data.table function. Is this one of those cases where one you use the [ as the function? Or possibly my as.data.table is out of place.
This answer was a step in the right direction (I think), it covers the use of fun(x) {... ; x} to use an anonymous function and return x.
Thanks!
You can use setDT here instead.
scenarios <- lapply(scenarios, function(i) setDT(i)[, z := sum(x/y), by=.(a,b)])
scenarios[[1]]
a b x y z
1: c 2 2 0.87002174 2.298793
2: b 2 10 0.19720775 78.611837
3: b 2 8 0.47041670 78.611837
4: b 2 4 0.36705023 78.611837
5: a 1 5 0.78922686 12.774035
6: a 1 6 0.93186209 12.774035
7: b 1 3 0.83118438 3.609307
8: c 1 1 0.08248658 30.047494
9: c 1 7 0.89382050 30.047494
10: c 1 9 0.89172831 30.047494
Using as.data.table, the syntax would be
scenarios <- lapply(scenarios, function(i) {i <- as.data.table(i); i[, z := sum(x/y),
by=.(a,b)]})
but this wouldn't be recommended as it will create an additional copy, which is avoided by setDT.

"object 'ansvals' not found" error - what does it mean?

from my simple data.table, for example, like this:
dt1 <- fread("
col1 col2 col3
AAA ab cd
BBB ef gh
BBB ij kl
CCC mn nm")
I am making new table, for example, like this:
dt1[,
.(col3, new=.N),
by=col1]
> col1 col3 new
>1: AAA cd 1
>2: BBB gh 2
>3: BBB kl 2
>4: CCC op 1
this works fine when I indicate column names explicitly. But when I have them in the variables and try to use with=F, this gives an error:
colBy <- 'col1'
colShow <- 'col3'
dt1[,
.(colShow, 'new'=.N),
by=colBy,
with=F]
# Error in `[.data.table`(dt1, , .(colShow, new = .N), by = colBy, with = F) : object 'ansvals' not found
I could not find any information about this error so far.
The reason why you are getting this error message is that when using with=FALSE you tell data.table to treat j as if it were a dataframe. It therefore expects a vector of columnnames and not an expression to be evaluated in j as new=.N.
From the documentation of ?data.table about with:
By default with=TRUE and j is evaluated within the frame of x; column
names can be used as variables. When with=FALSE j is a character
vector of column names or a numeric vector of column positions to
select, and the value returned is always a data.table.
When you use with=FALSE, you have to select the columnnames in j without a . before () like this: dt1[, (colShow), with=FALSE]. Other options are dt1[, c(colShow), with=FALSE] or dt1[, colShow, with=FALSE]. The same result can be obtained by using dt1[, .(col3)]
To sum up: with = FALSE is used to select columns the data.frame way. So, you should do it then as such.
Also by using by = colBy you are telling data.table to evaluate j which is in contradiction with with = FALSE.
From the documentation of ?data.table about j:
A single column name, single expresson of column names, list() of
expressions of column names, an expression or function call that
evaluates to list (including data.frame and data.table which are
lists, too), or (when with=FALSE) a vector of names or positions to
select.
j is evaluated within the frame of the data.table; i.e., it
sees column names as if they are variables. Use j=list(...) to return
multiple columns and/or expressions of columns. A single column or
single expression returns that type, usually a vector. See the
examples.
See also points 1.d and 1.g of the introduction vignette of data.table.
ansvals is a name used in data.table internals. You can see where it appears in the code by using ctrl+f (Windows) or cmd+f (macOS) here.
The error object 'ansvals' not found looks like a bug to me. It should either be a helpful message or just work. I've filed issue #1440 linking back to this question, thank you.
Jaap is completely correct. Following on from his answer, you can use get() in j like this :
dt1
# col1 col2 col3
#1: AAA ab cd
#2: BBB ef gh
#3: BBB ij kl
#4: CCC mn nm
colBy
#[1] "col1"
colShow
#[1] "col3"
dt1[,.(get(colShow),.N),by=colBy]
# col1 V1 N
#1: AAA cd 1
#2: BBB gh 2
#3: BBB kl 2
#4: CCC nm 1

`j` doesn't evaluate to the same number of columns for each group

I am trying to use data.table where my j function could and will return a different number of columns on each call. I would like it to behave like rbind.fill in that it fills any missing columns with NA.
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]
In this case 'result' may end up with two columns; A and B. 'A' and 'B' was returned as part of the first call to 'fetch' and only 'B' was returned as part of the second. I would like the example code to return this result.
id A B
1 1 a b
2 2 <NA> b
Unfortunately, when run I get this error.
Error in `[.data.table`(data, , fetch(.BY, .SD), by = id) :
j doesn't evaluate to the same number of columns for each group
I can do this with plyr as follows, but in my real world use case plyr is running out of memory. Each call to fetch occurs rather quickly, but the memory crash occurs when plyr tries to merge all of the data back together. I am trying to see if data.table might solve this problem for me.
result <- ddply(data, "id", fetch)
Any thoughts appreciated.
DWin's approach is good. Or you could return a list column instead, where each cell is itself a vector. That's generally a better way of handling variable length vectors.
DT = data.table(A=rep(1:3,1:3),B=1:6)
DT
A B
1: 1 1
2: 2 2
3: 2 3
4: 3 4
5: 3 5
6: 3 6
ans = DT[, list(list(B)), by=A]
ans
A V1
1: 1 1
2: 2 2,3 # V1 is a list column. These aren't strings, the
3: 3 4,5,6 # vectors just display with commas
ans$V1[3]
[[1]]
[1] 4 5 6
ans$V1[[3]]
[1] 4 5 6
ans[,sapply(V1,length)]
[1] 1 2 3
So in your example you could use this as follows:
library(plyr)
rbind.fill(data[, list(list(fetch(.BY))), by = id]$V1)
# A B
#1 a b
#2 <NA> b
Or, just make the list returned conformant :
allcols = c("A","B")
fetch <- function(by) {
if(by == 1)
list(A=c("a"), B=c("b"))[allcols]
else
list(B=c("b"))[allcols]
}
Here are two approaches. The first roughly follows your strategy:
data[,list(A=if(.BY==1) 'a' else NA_character_,B='b'), by=id]
And the second does things in two steps:
DT <- copy(data)[,`:=`(A=NA_character_,B='b')][id==1,A:='a']
Using a by just to check for a single value seems wasteful (maybe computationally, but also in terms of clarity); of course, it could be that your application isn't really like that.
Try
data.table(A=NA, B=c("b"))
#NickAllen: I'm not sure from the comments whether you understood my suggestion. (I was posting from a mobile phone that limited my cut-paste capabilities and I suspect my wife was telling me to stop texting to S0 or she would divorce me.) What I meant was this:
fetch <- function(by) {
if(by == 1)
data.table(A=c("a"), B=c("b"))
else
data.table(A=NA, B=c("b"))
}
data <- data.table(id=c(1,2))
result <- data[, fetch(.BY), by=id]

How to reference column names that start with a number, in data.table

If the column names in data.table are in the form of number + character, for example: 4PCS, 5Y etc, how could this be referenced as j in x[i,j] so that it is interpreted as an unquoted column name.
I assume this would solve mine original problem. I wanted to add several column in 'data.table' which were in the form number + character.
M <- data.table('4PCS'=1:4,'5Y'=4:1,X5Y=2:5)
> M[,4PCS+5Y]
Error: unexpected symbol in "M[,4PCS"
The new column should be a sum of 4PSC and 5Y.
Is there a way how to refer to them in data.table in no quoted form? If these columns are referred in data.table with the quoted "logic" of data.frame :
> M[,'5Y',with=FALSE]
5Y
[1,] 4
[2,] 3
[3,] 2
[4,] 1
then there will be a limitation in functionality of such reference. The addition would not work as it does not work in data.frame:
> M[,'4PCS'+'5Y',with=FALSE]
Error in "4PCS" + "5Y" : non-numeric argument to binary operator
The data.table functionality would allow to operate over the columns. I would like to find a solution in the new data.table logic hence I can use its ability to transform the columns by column name referencing.
The question is:
How to quote the column name which start with number so that the data.table logic would understand that it is a column name.
I think, this is what you're looking for, not sure. data.table is different from data.frame. Please have a look at the quick introduction, and then the FAQ (and also the reference manual if necessary).
require(data.table)
dt <- data.table("4PCS" = 1:3, y=3:1)
#   4PCS y
# 1:    1 3
# 2:    2 2
# 3:    3 1
# access column 4PCS
dt[, "4PCS"]
# returns a data.table
# 4PCS
# 1: 1
# 2: 2
# 3: 3
# to access multiple columns by name
dt[, c("4PCS", "y")]
Alternatively, if you need to access the column and not result in a data.table, rather a vector, then you can access using the $ notation:
dt$`4PCS` # notice the ` because the variable begins with a number
# [1] 1 2 3
# alternatively, as mnel mentioned under comments:
dt[, `4PCS`]
# [1] 1 2 3
Or if you know the column number you can access using [[.]] as follows:
dt[[1]] # 4PCS is the first column here
# [1] 1 2 3
Edit:
Thanks #joran. I think you're looking for this:
dt[, `4PCS` + y]
# [1] 4 4 4
Fundamentally the issue is that 4CPS is not a valid variable name in R (try 4CPS <- 1, you'll get the same "Unexpected symbol" error). So to refer to it, we have to use backticks (compare`4CPS` <- 1)
You can also put an 'X' immediately before the variable name you are calling to get R to recognise it as a name rather than evaluating the number and the string as different (and hence bad syntax)
So e.g. when calling 4PCS use X4PCS
as in
mydata <- X4PCS

split data.table

I have a data.table which I want to split into two. I do this as follows:
dt <- data.table(a=c(1,2,3,3),b=c(1,1,2,2))
sdt <- split(dt,dt$b==2)
but if I want to to something like this as a next step
sdt[[1]][,c:=.N,by=a]
I get the following warning message.
Warning message: In [.data.table(sdt[[1]], , :=(c, .N), by = a) :
Invalid .internal.selfref detected and fixed by taking a copy of the
whole table, so that := can add this new column by reference. At an
earlier point, this data.table has been copied by R. Avoid key<-,
names<- and attr<- which in R currently (and oddly) may copy the whole
data.table. Use set* syntax instead to avoid copying: setkey(),
setnames() and setattr(). Also, list(DT1,DT2) will copy the entire DT1
and DT2 (R's list() copies named objects), use reflist() instead if
needed (to be implemented). If this message doesn't help, please
report to datatable-help so the root cause can be fixed.
Just wondering if there is a better way of splitting the table so that it would be more efficient (and would not get this message)?
This works in v1.8.7 (and may work in v1.8.6 too) :
> sdt = lapply(split(1:nrow(dt), dt$b==2), function(x)dt[x])
> sdt
$`FALSE`
a b
1: 1 1
2: 2 1
$`TRUE`
a b
1: 3 2
2: 3 2
> sdt[[1]][,c:=.N,by=a] # now no warning
> sdt
$`FALSE`
a b c
1: 1 1 1
2: 2 1 1
$`TRUE`
a b
1: 3 2
2: 3 2
But, as #mnel said, that's inefficient. Please avoid splitting if possible.
I was looking for some way to do a split in data.table, I came across this old question.
Sometime a split is what you want to do, and the data.table "by" approach is not convenient.
Actually you can easily do your split by hand with data.table only instructions and it works very efficiently:
SplitDataTable <- function(dt,attr) {
boundaries=c(0,which(head(dt[[attr]],-1)!=tail(dt[[attr]],-1)),nrow(dt))
return(
mapply(
function(start,end) {dt[start:end,]},
head(boundaries,-1)+1,
tail(boundaries,-1),
SIMPLIFY=F))
}
As mentionned above (#jangorecki), the package data.table already has its own function for splitting. In that simplified case we can use:
> dt <- data.table(a = c(1, 2, 3, 3), b = c(1, 1, 2, 2))
> split(dt, by = "b")
$`1`
a b
1: 1 1
2: 2 1
$`2`
a b
1: 3 2
2: 3 2
For more difficult/concrete cases, I would recommend to create a new variable in the data.table using the by reference functions := or set and then call the function split. If you care about performance, make sure to always remain in the data.table environment e.g., dt[, SplitCriteria := (...)] rather than computing the splitting variable externallly.

Resources