rolling joins data.table in R - r

I am trying to understand a little more about the way rolling joins work and am having some confusion, I was hoping somebody could clarify this for me. To take a concrete example:
dt1 <- data.table(id=rep(1:5, 10), t=1:50, val1=1:50, key="id,t")
dt2 <- data.table(id=rep(1:5, 2), t=1:10, val2=1:10, key="id,t")
I expected this to produce a long data.table where the values in dt2 are rolled:
dt1[dt2,roll=TRUE]
Instead, the correct way to do this seems to be:
dt2[dt1,roll=TRUE]
Could someone explain to me more about how joining in data.table works as I am clearly not understanding it correctly. I thought that dt1[dt2,roll=TRUE] corresponded to the sql equivalent of select * from dt1 right join dt2 on (dt1.id = dt2.id and dt1.t = dt2.t), except with the added functionality locf.
Additionally the documentation says:
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one)
as an index.
This makes it seem that only things in X should be returned an the join being done is an inner join, not outer. What about in the case when roll=T but that particular id does not exist in dt1? Playing around a bit more I can't understand what value is being placed into the column.

That quote from the documentation appears to be from FAQ 1.12 What is the difference between X[Y] and merge(X,Y). Did you find the following in ?data.table and does it help?
roll Applies to the last join column, generally a date but can be any
ordered variable, irregular and including gaps. If roll=TRUE and i's
row matches to all but the last x join column, and its value in the
last i join column falls in a gap (including after the last
observation in x for that group), then the prevailing value in x is
rolled forward. This operation is particularly fast using a modified
binary search. The operation is also known as last observation carried
forward (LOCF). Usually, there should be no duplicates in x's key, the
last key column is a date (or time, or datetime) and all the columns
of x's key are joined to. A common idiom is to select a
contemporaneous regular time series (dts) across a set of identifiers
(ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date)
and CJ stands for cross join.
rolltolast Like roll but the data is not rolled forward past the last
observation within each group defined by the join columns. The value
of i must fall in a gap in x but not after the end of the data, for
that group defined by all but the last join column. roll and
rolltolast may not both be TRUE.
In terms of left/right analogies to SQL joins, I prefer to think about that in the context of FAQ 2.14 Can you explain further why data.table is inspired by A[B] syntax
in base. That's quite a long answer so I won't paste it here.

Related

Why am I getting more rows after applying the merge(x,y,all.x=T) function in r?

I do have two data sets: data2 and data 3.
The relevant information of the data3 should be added to the respective rows of data2 and the commmon columns in both set are "Inschrijfjaar" and "Leeftijd".
I am using the code:
data4=merge(x=data2,y=data3, by=c("Inschrijfjaar", "Leeftijd"),all.x=TRUE)
A check up gives me:
dim(data2)
525380 5
dim(data3)
1707 7
dim(data4)
5307668 10
So the merge is not done correctly, the dimension of data4 should also be 525380, because it is a left join. So I am getting ways more rows then the left data set. What could be the cause?
I also tried the code:
data4=merge(x=data2,y=data3,all.x=TRUE)
Sorry, I cannot comment, this is not a full answer, but meant to be a comment:
There are many differnt forms of joins.
I find them well explained here.
You do a left join, which returns all rows from the left table, and any rows with matching keys from the right table. So you would expect more values in your data4.
What you you actually want seems to be a left semi-join "A semi-join is like an inner join except that it only returns the columns of X (not also those of Y), and does not repeat the rows of X to match the rows of Y" (from this question which may help you answer your question).
This behaviour occurs when there are multiple rows of data3 that have the same values in your columns c("Inschrijfjaar", "Leeftijd") (and these values appear in data2). Where data in data3 could be merged with multiple records in data2, they will all be included, leading to more records in data4 than in data2

Last up-to-x rows per group from DT

I have a data.table that I want to pull out the last 10,000 lines on a per- group basis. Unfortunately, I have been getting inconsistent results depending on the method employed, so am obviously not understanding the full picture. I have concerns about each of my methods.
The data is structured in such a way that I have a set of columns that I wish to group by, where I want to grab the entries corresponding with the last 10,000 POSIXct timestamps (if that many exist...otherwise return all). Entries are defined as unique by the combination of the grouping columns along with the timestamps, even though there are several other data columns. In the below example, my timestamp column is in ts and keycol1 and keycol2 are the fields that I'm grouping on. DT has 2,809,108 entries.
setkey(DT,keycol1,keycol2,ts)
DT[DT[,.I[.N-10000:.N], by=c("keycol1","keycol2")]$V1]
returns 1,181,256 entries. No warning or error is issued. My concern is what is happening when .N-10000 for the group is <1? When I do a DT[-10:10] I receive the following error.
Error in [.data.table(DT, -10:10) : Item 1 of i is -10 and item
12 is 1. Cannot mix positives and negatives.
leading me to believe that the .I[.N-10000:.N] may not be working as intended.
If I instead try to sort the timestamps backward and then use the strategy described by Jaap in his answer to this question
DT[DT[order(-ts),.I[1:10000], by=c("keycol1","keycol2")]$V1],nomatch=NULL]
returns 3,810,000 entries, some of which are all NA, suggesting that the nomatch parameter isn't being honored (nomatch=0 returns the same). Chaining a [!is.na(ts)] tells me that it returns 1,972,166 valid entries, which is more than the previous "solution." However, do the values for .I correspond with the row numbers of the original DT or of the reverse-sorted (within group) DT? So, does the outer selection return the true matches, or will it actually result in the first 10000 entries per group, rather than the last 10000?
okay, to resolve this question can I have the key itself work backward?
setkey(DT,keycol1,keycol2,ts)
setorder(DT,keycol1,keycol2,-ts)
key(DT)
NULL
setkey(DT,keycol1,keycol2,-ts)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) :
some columns are not in the data.table: -ts
That'd be a no then.
Can this be resolved by using .SD rather than .I?
DT[
DT[order(-ts), .SD[1:10000], by=c("keycol1","keycol2")],
nomatch=0, on=c("keycol1","keycol2","ts")
]
returns 1,972,166 entries. Although I'm fairly confident that these entries are the ones I want, it also results in a duplication of columns not part of the key or timestamp (as i.A, i.B, etc). I think these are the same entries as the .I[1:10000] example with the order(-ts), as if I store each, delete the extra columns in the .SD method, then do a setkey(resultA, keycol1,keycol2,ts) for each then do identical(resultA,resultB) it returns
TRUE
Related threads:
Throw away first and last n rows
data.table - select first n rows within group
How to extract the first n rows per group?

is foverlaps deterministic with mult="last" (or "first")?

This question is somewhat related to github issue #1617 of data.table
I would like to use data.table's foverlaps function match dates to within a range, but only match it to the range that has the greatest end date (which would also be, by the structure of my data, the largest range of dates as start date should be the same for all matching intervals for a given Rx_ID).
Here is the relevant code snippet
setkey(dat, Rx_ID, eventtime, duptime) # eventtime = duptime (POSIXct)
setkey( sl, Rx, Rx_UTC_Start, Rx_UTC_end) # Rx_UTC_end > Rx_UTC_Start (POSIXct)
fo<-foverlaps(dat, sl, type="within", mult="last")
Questions
Will the mult="last" guarantee that it picks the Rx_UTC_end that is greatest for a given Rx_ID&Rx_UTC_Start pair, as it is part of the key for sl?
Is foverlaps the correct function to use, or should I just use a straight-up non-equijoin, given the duplication in date stamps in dat to satisfy foverlaps
What would I do if I wanted to use foverlaps to find the greatest (or least) value in some other column of sl rather than Rx_UTC_end?
Would I have to use mult="all" and sort the results, then eliminate rows with lesser values in my targeted column?
Or would foverlaps sort in the order of any setindex indices that I have for sl, despite the index operation not reordering the data in sl itself? Then I would use mult="last" restriction.
Or is there a way to have 4 keys for sl and 3 for dat? In this scenario with the third-from-last column in sl would be a spectator to the join, but used to sort sl and the post-foverlaps fo prior to the mult restriction.

Match values in each group of a data.table column to values in a vector

I recently started to use the data.table package to identify values in a table's column that conform to some conditions. Although and I manage to get most of the things done, now I'm stuck with this problem:
I have a data table, table1, in which the first column (labels) is a group ID, and the second column, o.cell, is an integer. The key is on "labels"
I have another data table, table2, containing a single column: "cell".
Now, I'm trying to find, for each group in table1, the values from the column "o.cell" that are in the "cell" column of table2. table1 has some 400K rows divided into 800+ groups of unequal sizes. table2 has about 1.3M rows of unique cell numbers. Cell numbers in column "o.cell" table1 can be found in more than one group.
This seems like a simple task but I can't find the right way to do it. Depending on the way I structure my call, it either gives me a different result than what I expect or it never completes and I have to end R task because it's frozen (my machine has 24 GB RAM).
Here's an example of one of the "variant" of the calls I have tried:
overlap <- table1[, list(over.cell =
o.cell[!is.na(o.cell) & o.cell %in% table2$cell]),
by = labels]
I pretty sure this is the wrong way to use data tables for this task and on top of that I can't get the result I want.
I will greatly appreciate any help. Thanks.
Sounds like this is your set up:
dt1 = data.table(labels = c('a','b'), o.cell = 1:10)
dt2 = data.table(cell = 4:7)
And you simply want to do a simple merge:
setkey(dt1, o.cell)
dt1[dt2]
# o.cell labels
#1: 4 b
#2: 5 a
#3: 6 b
#4: 7 a

Remove a range in data.table

I am trying to exclude some rows from a datatable based on, let's say, days and month - excluding for example summer holidays, that always begin for example 15th of June and end the 15th of next month. I can extract those days based on Date, but as as.Date function is awfully slow to operate with, I have separate integer columns for Month and Day and I want to do it using only them.
It is easy to select the given entries by
DT[Month==6][Day>=15]
DT[Month==7][Day<=15]
Is there any way how to make "difference" of the two data.tables (the original ones and the ones I selected). (Why not subset? Maybe I am missing something simple, but I don't want to exclude days like 10/6, 31/7.)
I am aware of a way to do it with join, but only day by day
setkey(DT, Month, Day)
DT[-DT[J(Month,Day), which= TRUE]]
Can anyone help how to solve it in more general way?
Great question. I've edited the question title to match the question.
A simple approach avoiding as.Date which reads nicely :
DT[!(Month*100L+Day) %between% c(0615L,0715L)]
That's probably fast enough in many cases. If you have a lot of different ranges, then you may want to step up a gear :
DT[,mmdd:=Month*100L+Day]
from = DT[J(0615),mult="first",which=TRUE]
to = DT[J(0715),mult="first",which=TRUE]
DT[-(from:to)]
That's a bit long and error prone because it's DIY. So one idea is that a list column in an i table would represent a range query (FR#203, like a binary search %between%). Then a not-join (also not yet implemented, FR#1384) could be combined with the list column range query to do exactly what you asked :
setkey(DT,mmdd)
DT[-J(list(0615,0715))]
That would extend to multiple different ranges, or the same range for many different ids, in the usual way; i.e., more rows added to i.
Based on the answer here, you might try something like
# Sample data
DT <- data.table(Month = sample(c(1,3:12), 100, replace = TRUE),
Day = sample(1:30, 100, replace = TRUE), key = "Month,Day")
# Dates that you want to exclude
excl <- as.data.table(rbind(expand.grid(6, 15:30), expand.grid(7, 1:15)))
DT[-na.omit(DT[excl, which = TRUE])]
If your data contain at least one entry for each day you want to exclude, na.omit might not be required.

Resources