Last up-to-x rows per group from DT - r

I have a data.table that I want to pull out the last 10,000 lines on a per- group basis. Unfortunately, I have been getting inconsistent results depending on the method employed, so am obviously not understanding the full picture. I have concerns about each of my methods.
The data is structured in such a way that I have a set of columns that I wish to group by, where I want to grab the entries corresponding with the last 10,000 POSIXct timestamps (if that many exist...otherwise return all). Entries are defined as unique by the combination of the grouping columns along with the timestamps, even though there are several other data columns. In the below example, my timestamp column is in ts and keycol1 and keycol2 are the fields that I'm grouping on. DT has 2,809,108 entries.
setkey(DT,keycol1,keycol2,ts)
DT[DT[,.I[.N-10000:.N], by=c("keycol1","keycol2")]$V1]
returns 1,181,256 entries. No warning or error is issued. My concern is what is happening when .N-10000 for the group is <1? When I do a DT[-10:10] I receive the following error.
Error in [.data.table(DT, -10:10) : Item 1 of i is -10 and item
12 is 1. Cannot mix positives and negatives.
leading me to believe that the .I[.N-10000:.N] may not be working as intended.
If I instead try to sort the timestamps backward and then use the strategy described by Jaap in his answer to this question
DT[DT[order(-ts),.I[1:10000], by=c("keycol1","keycol2")]$V1],nomatch=NULL]
returns 3,810,000 entries, some of which are all NA, suggesting that the nomatch parameter isn't being honored (nomatch=0 returns the same). Chaining a [!is.na(ts)] tells me that it returns 1,972,166 valid entries, which is more than the previous "solution." However, do the values for .I correspond with the row numbers of the original DT or of the reverse-sorted (within group) DT? So, does the outer selection return the true matches, or will it actually result in the first 10000 entries per group, rather than the last 10000?
okay, to resolve this question can I have the key itself work backward?
setkey(DT,keycol1,keycol2,ts)
setorder(DT,keycol1,keycol2,-ts)
key(DT)
NULL
setkey(DT,keycol1,keycol2,-ts)
Error in setkeyv(x, cols, verbose = verbose, physical = physical) :
some columns are not in the data.table: -ts
That'd be a no then.
Can this be resolved by using .SD rather than .I?
DT[
DT[order(-ts), .SD[1:10000], by=c("keycol1","keycol2")],
nomatch=0, on=c("keycol1","keycol2","ts")
]
returns 1,972,166 entries. Although I'm fairly confident that these entries are the ones I want, it also results in a duplication of columns not part of the key or timestamp (as i.A, i.B, etc). I think these are the same entries as the .I[1:10000] example with the order(-ts), as if I store each, delete the extra columns in the .SD method, then do a setkey(resultA, keycol1,keycol2,ts) for each then do identical(resultA,resultB) it returns
TRUE
Related threads:
Throw away first and last n rows
data.table - select first n rows within group
How to extract the first n rows per group?

Related

is foverlaps deterministic with mult="last" (or "first")?

This question is somewhat related to github issue #1617 of data.table
I would like to use data.table's foverlaps function match dates to within a range, but only match it to the range that has the greatest end date (which would also be, by the structure of my data, the largest range of dates as start date should be the same for all matching intervals for a given Rx_ID).
Here is the relevant code snippet
setkey(dat, Rx_ID, eventtime, duptime) # eventtime = duptime (POSIXct)
setkey( sl, Rx, Rx_UTC_Start, Rx_UTC_end) # Rx_UTC_end > Rx_UTC_Start (POSIXct)
fo<-foverlaps(dat, sl, type="within", mult="last")
Questions
Will the mult="last" guarantee that it picks the Rx_UTC_end that is greatest for a given Rx_ID&Rx_UTC_Start pair, as it is part of the key for sl?
Is foverlaps the correct function to use, or should I just use a straight-up non-equijoin, given the duplication in date stamps in dat to satisfy foverlaps
What would I do if I wanted to use foverlaps to find the greatest (or least) value in some other column of sl rather than Rx_UTC_end?
Would I have to use mult="all" and sort the results, then eliminate rows with lesser values in my targeted column?
Or would foverlaps sort in the order of any setindex indices that I have for sl, despite the index operation not reordering the data in sl itself? Then I would use mult="last" restriction.
Or is there a way to have 4 keys for sl and 3 for dat? In this scenario with the third-from-last column in sl would be a spectator to the join, but used to sort sl and the post-foverlaps fo prior to the mult restriction.

Merge with multiple conditions and nearest numerical match

From looking through Stackoverflow, and other sources, I believe that changing my dataframes to data.tables and using setkey, or similar, will give what I want. But as of yet I have been unable to get a working Syntax.
I have two data frames, one containing 26000 rows and the other containing 6410 rows.
The first dataframe contains the following columns:
Customer name, Base_Code, Idenity_Number, Financials
The second dataframe holds the following:
Customer name, Base_Code, Idenity_Number, Financials, Lapse
Both sets of data have identical formatting.
My goal is to join the Lapse column in the second dataframe to first dataframe. The issue I have is that the numeric value in Financials does not match between the two datasets and I only want the closest match in DF1 to have the value in the Lapse column in DF2 against it.
There will be examples where there are multiple entries for the same customer ID and Base Code in each dataframe, so I need to merge the two based on Idenity_Number and Base_Code (which is exact) and then match against the nearest financial numeric match for each entry only.
There will never be more entries in the DF2 then held within DF1 for each Customer and Base_Code.
Here is an example of DF1:
Here is an example of DF2:
And finally, here is what I want end up with:
If we use Jessica Rabbit as the example we have a match against DF1 and DF2, the financial value of 1240 from DF1 was matched against 1058 in DF2 as that was the closest match.
I could not work out how to get a working solution using data.table, so I re-thought my approach and have come up with a solution.
First of all I merged the two datasets, and then removed any entries that did not have a stauts of "LAP", this gave me all of the NON Lapsed entries:
NON_LAP <- merge(x=Merged,y=LapsesMonth,by=c("POLICY_NO","LOB_BASE"),all.x=TRUE)
NON_LAP <- NON_LAP [!grepl("LAP", NON_LAP$Status, ignore.case=FALSE),]
Next I merged again, this time looking specifically for the lapsed cases. To work out which was the cloest match I used the abs function, then I ordered by the lowest difference to get the closest matches in order. Finally I removed duplicates to show the closest matches and then also kept duplicates and stripped out the "LAP" status to ensure those that were not the closest match remained in the data.
Finally I merged them all together giving me the required outcome.
FIND_LAP <- merge(x=Merged,y=LapsesMonth,by=c("POLICY_NO","LOB_BASE"),all.y=FALSE)
FIND_LAP$Difference <- abs(FIND_LAP$GWP - FIND_LAP$ACTUAL_PRICE)
FIND_LAP <- FIND_LAP[order( FIND_LAP[,27] ),]
FOUND_LAP <- FIND_LAP [!duplicated(FIND_LAP[c("POLICY_NO","LOB_BASE")]),]
NOT_LAP <- FIND_LAP [duplicated(FIND_LAP[c("POLICY_NO","LOB_BASE")]),]
Hopefully this will help someone else who might be new to R and encounters the same issue.

Eliminate rows with identical ID but different condition code (1 or 2)

I have a large data set with over a thousand participant. Each participant has a unique ID. Each time a participant was tested their data was entered on a separate row. Participants were tested under two conditions coded "1" and "2". Some participants were always tested under condition 1 Some were always tested under condition 2. Still other participants were tested under both condition 1 and 2.
For this analysis, I want to eliminate participants that were tested under two different conditions, retaining only participants that were always tested under the same condition.
I have to find rows with identical id's (showing same participant) but different condition codes and eliminate those rows. I am familiar with subset, but I am not sure how to create the data subset I need in this case.
Any help would be appreciated.
In data.table
library(data.table)
setDT(old_data)
new_data <- old_data[ , if (uniqueN(condition_code) == 1) .SD, by = participant_id]
setDT adds the data.table class to your data.frame so it can be passed to data.table methods. uniqueN is equivalent to (but faster than) length(unique()) and this statement ensures there is exactly one unique condition code associated with a given participant (as identified by their participant_id).
.SD is a temporary data set created within each group. Without further modification, .SD simply represents the full set of columns and rows associated with a particular participant_id, so the construction says to return all data associated with participant_ids passing your condition; for those that don't pass, return nothing (NULL is technically returned, and then those rows are dropped in clean-up)

data.table column/data filtering execution order in R?

When applying multiple filters, what is the execution order (left-to-right or right-to-left) for a data.table?
For example,
dt[,!excludeColumns,with=F][date > as.POSIXct('2013-01-02', 'GMT')][is.na(holiday)]
In the above, a data.table is:
being excluded few columns
filtering the rows for certain date-range
filtering the rows for one particular holiday period
Would like to know in which order they get executed? (so that we can put the filter that produces the smallest amount of data first, such that later steps have small data to further operate on, and thus faster).
It should always be left to right!
vec <- 1:10
vec[vec>5][1:2]
[1] 6 7

rolling joins data.table in R

I am trying to understand a little more about the way rolling joins work and am having some confusion, I was hoping somebody could clarify this for me. To take a concrete example:
dt1 <- data.table(id=rep(1:5, 10), t=1:50, val1=1:50, key="id,t")
dt2 <- data.table(id=rep(1:5, 2), t=1:10, val2=1:10, key="id,t")
I expected this to produce a long data.table where the values in dt2 are rolled:
dt1[dt2,roll=TRUE]
Instead, the correct way to do this seems to be:
dt2[dt1,roll=TRUE]
Could someone explain to me more about how joining in data.table works as I am clearly not understanding it correctly. I thought that dt1[dt2,roll=TRUE] corresponded to the sql equivalent of select * from dt1 right join dt2 on (dt1.id = dt2.id and dt1.t = dt2.t), except with the added functionality locf.
Additionally the documentation says:
X[Y] is a join, looking up X's rows using Y (or Y's key if it has one)
as an index.
This makes it seem that only things in X should be returned an the join being done is an inner join, not outer. What about in the case when roll=T but that particular id does not exist in dt1? Playing around a bit more I can't understand what value is being placed into the column.
That quote from the documentation appears to be from FAQ 1.12 What is the difference between X[Y] and merge(X,Y). Did you find the following in ?data.table and does it help?
roll Applies to the last join column, generally a date but can be any
ordered variable, irregular and including gaps. If roll=TRUE and i's
row matches to all but the last x join column, and its value in the
last i join column falls in a gap (including after the last
observation in x for that group), then the prevailing value in x is
rolled forward. This operation is particularly fast using a modified
binary search. The operation is also known as last observation carried
forward (LOCF). Usually, there should be no duplicates in x's key, the
last key column is a date (or time, or datetime) and all the columns
of x's key are joined to. A common idiom is to select a
contemporaneous regular time series (dts) across a set of identifiers
(ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date)
and CJ stands for cross join.
rolltolast Like roll but the data is not rolled forward past the last
observation within each group defined by the join columns. The value
of i must fall in a gap in x but not after the end of the data, for
that group defined by all but the last join column. roll and
rolltolast may not both be TRUE.
In terms of left/right analogies to SQL joins, I prefer to think about that in the context of FAQ 2.14 Can you explain further why data.table is inspired by A[B] syntax
in base. That's quite a long answer so I won't paste it here.

Resources