is foverlaps deterministic with mult="last" (or "first")? - r

This question is somewhat related to github issue #1617 of data.table
I would like to use data.table's foverlaps function match dates to within a range, but only match it to the range that has the greatest end date (which would also be, by the structure of my data, the largest range of dates as start date should be the same for all matching intervals for a given Rx_ID).
Here is the relevant code snippet
setkey(dat, Rx_ID, eventtime, duptime) # eventtime = duptime (POSIXct)
setkey( sl, Rx, Rx_UTC_Start, Rx_UTC_end) # Rx_UTC_end > Rx_UTC_Start (POSIXct)
fo<-foverlaps(dat, sl, type="within", mult="last")
Questions
Will the mult="last" guarantee that it picks the Rx_UTC_end that is greatest for a given Rx_ID&Rx_UTC_Start pair, as it is part of the key for sl?
Is foverlaps the correct function to use, or should I just use a straight-up non-equijoin, given the duplication in date stamps in dat to satisfy foverlaps
What would I do if I wanted to use foverlaps to find the greatest (or least) value in some other column of sl rather than Rx_UTC_end?
Would I have to use mult="all" and sort the results, then eliminate rows with lesser values in my targeted column?
Or would foverlaps sort in the order of any setindex indices that I have for sl, despite the index operation not reordering the data in sl itself? Then I would use mult="last" restriction.
Or is there a way to have 4 keys for sl and 3 for dat? In this scenario with the third-from-last column in sl would be a spectator to the join, but used to sort sl and the post-foverlaps fo prior to the mult restriction.

Related

Last up-to-x rows per group from DT

I have a data.table that I want to pull out the last 10,000 lines on a per- group basis. Unfortunately, I have been getting inconsistent results depending on the method employed, so am obviously not understanding the full picture. I have concerns about each of my methods.
The data is structured in such a way that I have a set of columns that I wish to group by, where I want to grab the entries corresponding with the last 10,000 POSIXct timestamps (if that many exist...otherwise return all). Entries are defined as unique by the combination of the grouping columns along with the timestamps, even though there are several other data columns. In the below example, my timestamp column is in ts and keycol1 and keycol2 are the fields that I'm grouping on. DT has 2,809,108 entries.
setkey(DT,keycol1,keycol2,ts)
DT[DT[,.I[.N-10000:.N], by=c("keycol1","keycol2")]$V1]
returns 1,181,256 entries. No warning or error is issued. My concern is what is happening when .N-10000 for the group is <1? When I do a DT[-10:10] I receive the following error.
Error in [.data.table(DT, -10:10) : Item 1 of i is -10 and item
12 is 1. Cannot mix positives and negatives.
leading me to believe that the .I[.N-10000:.N] may not be working as intended.
If I instead try to sort the timestamps backward and then use the strategy described by Jaap in his answer to this question
DT[DT[order(-ts),.I[1:10000], by=c("keycol1","keycol2")]$V1],nomatch=NULL]
returns 3,810,000 entries, some of which are all NA, suggesting that the nomatch parameter isn't being honored (nomatch=0 returns the same). Chaining a [!is.na(ts)] tells me that it returns 1,972,166 valid entries, which is more than the previous "solution." However, do the values for .I correspond with the row numbers of the original DT or of the reverse-sorted (within group) DT? So, does the outer selection return the true matches, or will it actually result in the first 10000 entries per group, rather than the last 10000?
okay, to resolve this question can I have the key itself work backward?
setkey(DT,keycol1,keycol2,ts)
setorder(DT,keycol1,keycol2,-ts)
key(DT)
NULL
setkey(DT,keycol1,keycol2,-ts)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) :
some columns are not in the data.table: -ts
That'd be a no then.
Can this be resolved by using .SD rather than .I?
DT[
DT[order(-ts), .SD[1:10000], by=c("keycol1","keycol2")],
nomatch=0, on=c("keycol1","keycol2","ts")
]
returns 1,972,166 entries. Although I'm fairly confident that these entries are the ones I want, it also results in a duplication of columns not part of the key or timestamp (as i.A, i.B, etc). I think these are the same entries as the .I[1:10000] example with the order(-ts), as if I store each, delete the extra columns in the .SD method, then do a setkey(resultA, keycol1,keycol2,ts) for each then do identical(resultA,resultB) it returns
TRUE
Related threads:
Throw away first and last n rows
data.table - select first n rows within group
How to extract the first n rows per group?

data.table column/data filtering execution order in R?

When applying multiple filters, what is the execution order (left-to-right or right-to-left) for a data.table?
For example,
dt[,!excludeColumns,with=F][date > as.POSIXct('2013-01-02', 'GMT')][is.na(holiday)]
In the above, a data.table is:
being excluded few columns
filtering the rows for certain date-range
filtering the rows for one particular holiday period
Would like to know in which order they get executed? (so that we can put the filter that produces the smallest amount of data first, such that later steps have small data to further operate on, and thus faster).
It should always be left to right!
vec <- 1:10
vec[vec>5][1:2]
[1] 6 7

Filtering data in R (complex)

I have a dataset with 7 million records.
I need to filter the data to only show about 9000 of these.
The first field dmg is effectively the primary key and take the format 1-Apr-123456. There are about 12 occurrences of each dmg value.
Another column is O_Y and takes the value of 0 or 1. It is most often 0, but 1 on about 900 occasions.
I would like to return all the rows with the same dmg value, where at least one of those records has and O_Y value of 1.
I recommend using data.table for doing this (fread in data.table will be quite handy in reading in the large data set too as you say you have enough RAM).
I am not sure that the following is the best way to do this in data.table but, at least, it should get you started. Hopefully, someone else will come along and list the most idiomatic data.table way for this. But this is what I can think of right now:
Assuming your data.table is called DT and has two columns dmg and O_Y. Use O_Y as the index key for DT and subset DT for O_Y == 1 (DT[.(1)] in data.table syntax). Now find the corresponding dmg values. The unique of these dmg values is your keys.with.ones. All this is succinctly done as follows:
setkey(DT, O_Y)
keys.with.ones <- unique(DT[.(1), dmg][["dmg"]])
Next, we need to extract rows corresponding to these values of dmg. For this we need to change the key for DT to dmg and extract the rows corresponding to the keys above:
setkey(DT, dmg)
DT.filtered <- DT[.(keys.with.ones)]
And we are done. :)
Please refer to ?data.table to figure out a better method if possible and let us know.

Remove a range in data.table

I am trying to exclude some rows from a datatable based on, let's say, days and month - excluding for example summer holidays, that always begin for example 15th of June and end the 15th of next month. I can extract those days based on Date, but as as.Date function is awfully slow to operate with, I have separate integer columns for Month and Day and I want to do it using only them.
It is easy to select the given entries by
DT[Month==6][Day>=15]
DT[Month==7][Day<=15]
Is there any way how to make "difference" of the two data.tables (the original ones and the ones I selected). (Why not subset? Maybe I am missing something simple, but I don't want to exclude days like 10/6, 31/7.)
I am aware of a way to do it with join, but only day by day
setkey(DT, Month, Day)
DT[-DT[J(Month,Day), which= TRUE]]
Can anyone help how to solve it in more general way?
Great question. I've edited the question title to match the question.
A simple approach avoiding as.Date which reads nicely :
DT[!(Month*100L+Day) %between% c(0615L,0715L)]
That's probably fast enough in many cases. If you have a lot of different ranges, then you may want to step up a gear :
DT[,mmdd:=Month*100L+Day]
from = DT[J(0615),mult="first",which=TRUE]
to = DT[J(0715),mult="first",which=TRUE]
DT[-(from:to)]
That's a bit long and error prone because it's DIY. So one idea is that a list column in an i table would represent a range query (FR#203, like a binary search %between%). Then a not-join (also not yet implemented, FR#1384) could be combined with the list column range query to do exactly what you asked :
setkey(DT,mmdd)
DT[-J(list(0615,0715))]
That would extend to multiple different ranges, or the same range for many different ids, in the usual way; i.e., more rows added to i.
Based on the answer here, you might try something like
# Sample data
DT <- data.table(Month = sample(c(1,3:12), 100, replace = TRUE),
Day = sample(1:30, 100, replace = TRUE), key = "Month,Day")
# Dates that you want to exclude
excl <- as.data.table(rbind(expand.grid(6, 15:30), expand.grid(7, 1:15)))
DT[-na.omit(DT[excl, which = TRUE])]
If your data contain at least one entry for each day you want to exclude, na.omit might not be required.

rolling joins data.table in R

I am trying to understand a little more about the way rolling joins work and am having some confusion, I was hoping somebody could clarify this for me. To take a concrete example:
dt1 <- data.table(id=rep(1:5, 10), t=1:50, val1=1:50, key="id,t")
dt2 <- data.table(id=rep(1:5, 2), t=1:10, val2=1:10, key="id,t")
I expected this to produce a long data.table where the values in dt2 are rolled:
dt1[dt2,roll=TRUE]
Instead, the correct way to do this seems to be:
dt2[dt1,roll=TRUE]
Could someone explain to me more about how joining in data.table works as I am clearly not understanding it correctly. I thought that dt1[dt2,roll=TRUE] corresponded to the sql equivalent of select * from dt1 right join dt2 on (dt1.id = dt2.id and dt1.t = dt2.t), except with the added functionality locf.
Additionally the documentation says:
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one)
as an index.
This makes it seem that only things in X should be returned an the join being done is an inner join, not outer. What about in the case when roll=T but that particular id does not exist in dt1? Playing around a bit more I can't understand what value is being placed into the column.
That quote from the documentation appears to be from FAQ 1.12 What is the difference between X[Y] and merge(X,Y). Did you find the following in ?data.table and does it help?
roll Applies to the last join column, generally a date but can be any
ordered variable, irregular and including gaps. If roll=TRUE and i's
row matches to all but the last x join column, and its value in the
last i join column falls in a gap (including after the last
observation in x for that group), then the prevailing value in x is
rolled forward. This operation is particularly fast using a modified
binary search. The operation is also known as last observation carried
forward (LOCF). Usually, there should be no duplicates in x's key, the
last key column is a date (or time, or datetime) and all the columns
of x's key are joined to. A common idiom is to select a
contemporaneous regular time series (dts) across a set of identifiers
(ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date)
and CJ stands for cross join.
rolltolast Like roll but the data is not rolled forward past the last
observation within each group defined by the join columns. The value
of i must fall in a gap in x but not after the end of the data, for
that group defined by all but the last join column. roll and
rolltolast may not both be TRUE.
In terms of left/right analogies to SQL joins, I prefer to think about that in the context of FAQ 2.14 Can you explain further why data.table is inspired by A[B] syntax
in base. That's quite a long answer so I won't paste it here.

Resources