relating data frames more efficiently [duplicate] - r

I have two data frames, one of hospital stays, and the other of lab results. I need to identify which hospital stay a lab result takes place in, and copy the admission and discharge dates from the hospital data frame into the row for the relevant lab result.
I am doing this with a for loop to walk through the lab results, and then if statements and subsets that look for matching entries (by patient SSN and surrounding dates) in the hospital records.
This is quite a large data set and using the for loop is very slow. Is there a way to speed this kind of problem up? (I have several similar problems, so would love an answer.)
Sample data added, note that there are multiple hospital records for each patient with the goal being to get the dates from the record where the dates overlap the lab date. In this example, the resulting data frame should only have admission and discharge dates for patient 1, as patient 2 has no hospital data, and patient 3's records do not overlap the lab date.
testDate <- as.Date(c("2017-01-15", "2017-01-15", "2017-01-15"))
patientSSN <- c("1","2","3")
labs <- data.frame(patientSSN, testDate)
# patientSSN testDate
# 1 1 2017-01-15
# 2 2 2017-01-15
# 3 3 2017-01-15
patientSSN <- c("1","1","3","3")
admissionDate <- as.Date(c("2017-01-07", "2017-02-01", "2016-12-01", "2017-01-16"))
dischargeDate <- as.Date(c("2017-01-16", "2017-02-10", "2016-12-15", "2017-02-01"))
hospitalRec <- data.frame(patientSSN, admissionDate, dischargeDate)
for (I in 1:nrow(labs)) {
labs[I,]$admissionDate <- hospitalRec[hospitalRec$patientSSN == labs[I,]$patientSSN & hospitalRec$admissionDate <= labs[I,]$testDate & hospitalRec$dischargeDate >= labs[I,]$testDate,]$admissionDate
labs[I,]$admissionDate <- hospitalRec[hospitalRec$PatientSSN == labs[I,]$PatientSSN & hospitalRec$admissionDate <= labs[I,]$testDate & hospitalRec$dischargeDate >= labs[I,]$testDate,]$dischargeDate
}
The desired data frame would look like:
labs:
patientSSN testDate admissionDate dischargeDate
1 2017-01-15 2017-01-07 2017-01-16
2 2017-01-15 NA NA
3 2017-01-15 NA NA
Notice, in the real data, there is also the problem of multiple hospital records qualifying (discharges between departments) these records would have the same admission date, but different discharge times with the latest one being important. But first things first...

A non equi join works, for example with data.table:
library(data.table)
setDT(labs); setDT(hospitalRec)
labs[hospitalRec, on=.(patientSSN, testDate >= admissionDate, testDate <= dischargeDate),
`:=`(aDate = i.admissionDate, dDate = i.dischargeDate)]
patientSSN testDate aDate dDate
1: 1 2017-01-15 2017-01-07 2017-01-16
2: 2 2017-01-15 <NA> <NA>
3: 3 2017-01-15 <NA> <NA>
in the real data, there is also the problem of multiple hospital records qualifying (discharges between departments) these records would have the same admission date, but different discharge times with the latest one being important.
If hospitalRec is sorted, adding mult="last" above should work. See ?data.table for full documentation. Alternately, you could just create a version of the hospital records that excludes these "duplicates", like ... sort and then
lastRec = unique(hospitalRec, by=c("patientSSN", "admissionDate"), fromLast=TRUE))
The setorder function is the standard tool for sorting data.tables.

Assuming this is similar to what your df looks like, use dplyr::left_join:
hospital_data <- data.frame(PatientSSN = c('1234567890','9876543210'),
admit = c('8/1/17','8/5/17'),
discharge = c('8/10/17','8/15/17'))
lab_data <- data.frame(specimen_id = c('foo1','foo2','foo3','foo4','foo5','foo6','foo7'),
PatientSSN = c('1234567890','1234567890','1234567890','9876543210','9876543210','9876543210','8527419600'),
test = c('hemoglobin','inr','platelette','hemoglobin','inr','platelette','inr'))
lab_data %>% left_join(hospital_data)
specimen_id PatientSSN test admit discharge
1 foo1 1234567890 hemoglobin 8/1/17 8/10/17
2 foo2 1234567890 inr 8/1/17 8/10/17
3 foo3 1234567890 platelette 8/1/17 8/10/17
4 foo4 9876543210 hemoglobin 8/5/17 8/15/17
5 foo5 9876543210 inr 8/5/17 8/15/17
6 foo6 9876543210 platelette 8/5/17 8/15/17
7 foo7 8527419600 inr <NA> <NA>
Note that your id variable (PatientSSN) are the same in each table.

OK- here's a method. Just a quick heads up, though; it's pretty unlikely that you'll ever work with EMR data that doesn't have an ID variable specific to a visit/account. I would look to use that as the unique identifier before I used SSN. Nevertheless; this should work. I used the data you provided above.
for(i in 1:nrow(labs)){
#finding the ID (ssn)
ssn_match_df <- hospitalRec[which(as.character(labs$patientSSN[i]) == as.character(hospitalRec$patientSSN)),]
#finding record in table where the test date fall between the admit/discharge
ssn_match_df <- ssn_match_df[which(labs$testDate[i] >= ssn_match_df$admissionDate &
labs$testDate[i] <= ssn_match_df$dischargeDate),]
if(nrow(ssn_match_df)>0){
labs[i,3] <- as.character(ssn_match_df[1,2])
labs[i,4] <- as.character(ssn_match_df[1,3])
} else {
labs[i,3] <- NA
labs[i,4] <- NA
}
}
colnames(labs)[3] <- 'admitDate'
colnames(labs)[4] <- 'dischargeDate'

Related

Number of overlaping datetime inside same table (R)

I have a table of about 50 000 rows, with four columns.
ID Arrival Departure Gender
1 10/04/2015 23:14 11/04/2015 00:21 F
1 11/04/2015 07:59 11/04/2015 08:08 F
3 10/04/2017 21:53 30/03/2017 23:37 M
3 31/03/2017 07:09 31/03/2017 07:57 M
3 01/04/2017 01:32 01/04/2017 01:35 M
3 01/04/2017 13:09 01/04/2017 14:23 M
6 10/04/2015 21:31 10/04/2015 23:17 F
6 10/04/2015 23:48 11/04/2015 00:05 F
6 01/04/2016 21:45 01/04/2016 22:48 F
6 02/04/2016 04:54 02/04/2016 07:38 F
6 04/04/2016 18:41 04/04/2016 22:48 F
10 10/04/2015 22:39 11/04/2015 00:42 M
10 13/04/2015 02:57 13/04/2015 03:07 M
10 31/03/2016 22:29 01/04/2016 08:39 M
10 01/04/2016 18:49 01/04/2016 19:44 M
10 01/04/2016 22:28 02/04/2016 00:31 M
10 05/04/2017 09:27 05/04/2017 09:28 M
10 06/04/2017 15:12 06/04/2017 15:43 M
This is a very small representation of the table. What I want to find out is, at the same time as each entry, how many others were present and then separate them by gender. So, say for example that at the time as the first presence of person with ID 1, person with ID 6 was present and person with ID 10 was present twice in the same interval. That would mean that at the same time, 2 other overlaps occurred. This also means that person with ID 1 has overlapped with 1 Male and 1 Female.
So its result should look like:
ID Arrival Departure Males encountered Females encountered
1 10/04/2015 23:14 11/04/2015 00:21 1 1
How would I be able to calculate this? I have tried to work with foverlaps and have managed to solve this with Excel, but I would want to do it in R.
Here is a data.table solution using foverlaps.
First, notice that there's an error in your data:
ID Arrival Departure Gender
3 10/04/2017 21:53 30/03/2017 23:37 M
The user arrived almost one month after he actually left. I needed to get rid of that data in order for foverlaps to run.
library(data.table)
dt <- data.table(df)
dt <- dt[Departure > Arrival, ] # filter wrong cases
setkey(dt, "Arrival", "Departure") # prepare for foverlaps
dt2 <- copy(dt) # use a different dt, inherits the key
run foverlaps and then
filter (leave only) the cases where arrival of second person is before than ID and same user-cases.
Add a variable where we count the male simultaneous guests and
a variable where we count the female simultaneous guests, all grouped by ID and arrival
.
simultaneous <- foverlaps(dt, dt2)[i.Arrival <= Arrival & ID != i.ID,
.(malesEncountered = sum(i.Gender == "M"),
femalesEncountered = sum(i.Gender == "F")),
by = .(ID, Arrival)]
Join the findings of the previous command with our original table on ID and arrival
result <- simultaneous[dt, on = .(ID, Arrival)]
<EDIT>: Convert to zero the NAs in malesEncountered and femalesEncountered: </EDIT>
result[is.na(malesEncountered), malesEncountered := 0][
is.na(femalesEncountered), femalesEncountered := o]
set the column order to something nicer
setcolorder(result, c(1, 2, 5, 6, 3, 4))[]
Here's one possibility. This uses lubridate's interval and the int_overlaps function that finds date overlaps. That has a drawback though: Interval doesn't work with dplyr. So this version is just doing all the work manually in a for loop.
It starts by making a 1000 row random dataset that matches yours: each person arrives in a two year period and departs one or two days later.
It's taking about 24 seconds for 1000 to run so you can expect it to take a while for 50K! The for loop outputs the row number so you can see where it is though.
Any questions about the code, lemme know.
There must be a faster vectorised way but interval didn't seem to play nice with apply either. Someone else might have something quicker...
Final output looks like this
library(tidyverse)
library(lubridate)
#Sample data:
#(Date sampling code: https://stackoverflow.com/questions/21502332/generating-random-dates)
#Random dates between 2017 and 2019
x <- data.frame(
ID = c(1:1000),
Arrival = sample(seq(as.Date('2017/01/01'), as.Date('2019/01/01'), by="day"), 1000, replace = T),
Gender = ifelse(rbinom(1000,1,0.5),'Male','Female')#Random Male female 50% probabiliity
)
#Make departure one or two days after arrival
x$Departure = x$Arrival + sample(1:2,1000, replace=T)
#Lubridate has a function for checking whether date intervals overlap
#https://lubridate.tidyverse.org/reference/interval.html
#So first, let's make the arrival and departure dates into intervals
x$interval <- interval(x$Arrival,x$Departure)
#Then for every person / row
#We want to know if their interval overlaps with the rest
#At the moment, dplyr doesn't play nice with interval
#https://github.com/tidyverse/dplyr/issues/3206
#So let's go through each row and do this manually
#Keep each person's result in list initially
gendercounts <- list()
#Check timing
t <- proc.time()
#Go through every row manually (sigh!
for(i in 1:nrow(x)){
print(paste0("Row ",i))
#exclude self (don't want to check date overlap with myself)
overlapcheck <- x[x$ID != x$ID[i],]
#Find out what dates this person overlaps with - can do all other intervals in one command
overlapcheck$overlaps <- int_overlaps(x$interval[i],overlapcheck$interval)
#Eyeball check that is finding the overlaps we want
#Is this ID date overlapping? Tick
#View(overlapcheck[overlapcheck$overlaps,])
#Use dplyr to find out the number of overlaps for male and female
#Keep only columns where the overlap is TRUE
#Also drop the interval column first tho as dplyr doesn't like it... (not tidy!)
gendercount <- overlapcheck %>%
select(-interval) %>%
filter(overlaps) %>%
group_by(Gender) %>%
summarise(count = n()) %>% #Get count of observations for each overlap for each sex
complete(Gender, fill = list(count = 0))#Need this to keep zero counts: summarise drops them otherwise
#We want count for each gender in their own column, so make wide
gendercount <- gendercount %>%
spread(key = Gender, value = count)
#Store for turning into dataframe shortly
gendercounts[[length(gendercounts)+1]] <- gendercount
}
#Dlyr command: turn list into dataframe
gendercounts <- bind_rows(gendercounts)
#End result. Drop interval column, order columns
final <- cbind(x,gendercounts) %>%
select(ID,Arrival,Departure,Gender,Male,Female)
#~24 seconds per thousand
proc.time()-t

How can I implement a dynamic count within R without using a for loop?

I want to distinctly count the number of customers who have purchased from the company between each SKU's first and last purchase date. This is after I have distinctly counted the number of customers for each SKU given in SQL (as well as finding the first and last purchase date),
I have code that successfully solves this problem; however, it uses a for loop and it is taking far too long because there are tens of thousands of SKUs. This is short example of what my SKU table looks like:
SKUID <- c('123', '456', '789')
NumberOfCustomers <- c(204543, 92703, 305727)
SKUFirstPurchase <- c('2014-05-02', '2014-02-03', '2016-05-13')
SKULastPurchase <- c('2017-09-30', '2018-07-01', '2019-01-09')
SKUCount <- data.frame(SKUID, NumberOfCustomers,
SKUFirstPurchase, SKULastPurchase)
colnames(SKUCount) <- c('SKU', 'NumberOfCustomers',
'FirstPurchase', 'LastPurchase')
Then I have another table that is about 6 million rows long, a select distinct of the sales date and the CustomerID that I call OrderTable. I can't summarize the distinct count on a day-to-day basis and sum them together because this would double count customers who have purchased on separate days. I have to re-calculate the distinct count with every FirstPurchase/LastPurchase permutation that I see in my SKUCount table. From there, I use the following code to calculate the distinct number of customers in the given time frame:
library(dplyr)
for (i in 1:nrow(SKUCount))
{
SKUCount[i, c('DateCustomers')] <-
sapply(OrderTable %>%
filter(Date >= SKUCount[i,'FirstPurchase'],
Date <= SKUCount[i,'LastPurchase']) %>%
select(CustomerID),
function(x) length(unique(x)))
}
As I previously noted, this piece of code DOES work, but it's very slow (~0.5 second for each row). Is there a quicker way to calculate the distinct counts, or is there a more clever solution to my problem?
Try this one:
library("purrrlyr")
library("dplyr")
#First creating the datasets including OrderTable (please correct me if I got it wrong!):
SKUID <- c('123', '456', '789')
NumberOfCustomers <- c(204543, 92703, 305727)
SKUFirstPurchase <- c('2014-05-02', '2014-02-03', '2016-05-13')
SKULastPurchase <- c('2017-09-30', '2018-07-01', '2019-01-09')
SKUCount <- data.frame(SKUID, NumberOfCustomers,
SKUFirstPurchase, SKULastPurchase)
colnames(SKUCount) <- c('SKU', 'NumberOfCustomers',
'FirstPurchase', 'LastPurchase')
OrderTable <- data.frame(Date=c('2014-06-02', '2014-08-02', '2015-02-03', '2017-05-13'
,'2015-05-02', '2014-06-03', '2016-07-13', '2017-09-30', '2018-07-01', '2019-01-09'),
CustomerID=c('121','212','3434','24232','121','124','212','131','412','3634'))
#changing factors to date
SKUCount$FirstPurchase<-as.Date(SKUCount$FirstPurchase,format = "%Y-%m-%d")
SKUCount$LastPurchase<-as.Date(SKUCount$LastPurchase,format = "%Y-%m-%d")
OrderTable$Date<-as.Date(OrderTable$Date,format = "%Y-%m-%d")
#defining a function, named FUN, which limit the Date from OrderTable between
#the two date arguments (FirstPurchase and LastPurchase) and returns the
#distinct count of CustomerID's from OrderTable:
FUN <- function(FirstPurchase,LastPurchase){
Rtrn<-OrderTable %>%
filter(Date >= FirstPurchase,
Date <= LastPurchase) %>%
summarize(n_distinct(CustomerID))
as.numeric(Rtrn)
}
Next you want to take your dataset, SKUCount, and create a variable called DateCustomers by applying the function, FUN, to every row of it:
SKUCount %>%
rowwise() %>%
mutate(DateCustomers= FUN(FirstPurchase,LastPurchase))
# Source: local data frame [3 x 5]
# Groups: <by row>
#
# # A tibble: 3 x 5
# SKU NumberOfCustomers FirstPurchase LastPurchase DateCustomers
# <fct> <dbl> <date> <date> <dbl>
# 1 123 204543 2014-05-02 2017-09-30 6
# 2 456 92703 2014-02-03 2018-07-01 7
# 3 789 305727 2016-05-13 2019-01-09 5

Create a variable based on time difference between dates on consecutive rows

I have a very large set of data driven off of an id and a date. The dataset has several hundred million rows and about 10 million id's. I am running in a non-windows environment with ample RAM and multiple processors available. I am doing this in parallel. At the moment, I'm working with multidplyr, though am considering all options.
For illustration:
> df[1:11,]
id date gap episode
1 100000019 2015-01-24 0 1
2 100000019 2015-02-20 27 1
3 100000019 2015-03-31 39 2
4 100000019 2015-04-29 29 2
5 100000019 2015-05-27 28 2
6 100000019 2015-06-24 28 2
7 100000019 2015-07-24 30 2
8 100000019 2015-08-23 30 2
9 100000019 2015-09-21 29 2
10 100000019 2015-10-22 31 3
11 100000019 2015-12-30 69 4
The data is sorted before the function call. The order is important. For each id, after the first date, I need to determine the number of days between each subsequent date. I call this a gap. So, the first date for the id gets a gap of zero. The second date gets the value of the second date minus the date in the prior row. An so on.
I am splitting the data by id, then sending the data for each id to the following function.
assign_gap <- function(x) {
# x$gap <- NA
for(i in 1:nrow(x)) {
x[i, ]$gap <- ifelse(i == 1, 0, x[i,]$date - x[i-1, ]$date)
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_gap', assign_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_gap(.)) %>% collect())
I then apply another function that groups the sequence of gaps across dates into "episodes" based on allowable_gap (I am using a value of 30). So, each id will potentially have multiple episodes assigned based on the date sequence and the gap.
assign_episode <- function(x, allowable_gap){
ep <- 1
for(i in 1:nrow(x)){
ifelse(x[i,]$gap <= allowable_gap, ep <- ep, ep <- ep + 1)
x[i, ]$episode <- ep
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_episode', assign_episode)
cluster_assign_value(cluster, 'allowable_gap', allowable_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_episode(., allowable_gap)) %>% collect())
Given the amount of data I have, I'd really like to find a way to avoid these loops in the functions, which I expect will improve efficiency considerably. If anyone can think of an alternative that accomplishes the same thing, I would be grateful.
I would recommend using the data.table library. This library is extremely fast, particularly if one is working with large data sets like yours. Here is a partial solution, where I solve the first step of your question:
1. calculate gap between dates, making sure the first row of each id is 0
library(data.table)
setDT(df)
df[, gap := c(0L, diff(date)) , by = id ]
Even though this is not working in parallel, I would expect this code to be faster than the loop you're currently using.
2. Assign a group episode for consecutive observations when the gap is under 30 by id
I haven't found a solution for the second part of your question yet, but I would encourage others to complement this answer if they find a solution.

Map a list of events (instants) to a list of periods (intervals) in R (with or without lubridate)

I have two data frames. One containing time periods marked with character unique IDs and another containing events with another set of unique IDs associated with them
Period DF (code):
periodID <- c("P_UID_00", "P_UID_01", "P_UDI_02", "P_UID_03")
periodStart <- as.POSIXct(c("2016/02/10 19:00", "2016/02/11 19:00",
"2016/02/12 19:00", "2016/02/13 19:00"))
periodEnd <- as.POSIXct(c("2016/02/10 21:00", "2016/02/11 21:00",
"2016/02/12 21:00", "2016/02/13 21:00"))
periodDF <- data.frame(periodID, periodStart, periodEnd)
Period DF:
periodID periodStart periodEnd
1 P_UID_00 2016-02-10 19:00:00 2016-02-10 21:00:00
2 P_UID_01 2016-02-11 19:00:00 2016-02-11 21:00:00
3 P_UDI_02 2016-02-12 19:00:00 2016-02-12 21:00:00
4 P_UID_03 2016-02-13 19:00:00 2016-02-13 21:00:00
Event DF (code):
eventID <- c("E_UID_00", "E_UID_01", "E_UDI_02", "E_UID_03")
eventTime <- as.POSIXct(c("2016/02/09 19:55:01", "2016/02/11 19:12:01",
"2016/02/11 20:22:01", "2016/02/15 19:00:01"))
eventDF <- data.frame(eventID, eventTime)
Event DF:
eventID eventTime
1 E_UID_00 2016-02-09 19:55:01
2 E_UID_01 2016-02-11 19:12:01
3 E_UDI_02 2016-02-11 20:22:01
4 E_UID_03 2016-02-15 19:00:01
I want to to map the event times in second DF to the time periods in the first DF in order to match the ID of the event to the ID of the period. Essentially the result table I want to see should look like:
eventID periodID
1 E_UID_00 NA
2 NA P_UID_00
3 E_UID_01 P_UID_01
4 E_UDI_02 P_UID_01
5 NA P_UID_02
6 NA P_UID_03
7 E_UID_03 NA
I suppose this can be achieved by using lubricate to transform the start and end cloumns in the first DF to intervals and the use some form of apply and instant %within% interval combination, but I am not really familiar with lubridate and did not manage to produce a working code
Additional considerations:
- periods are completely arbitrary and can last from seconds to years
- periods never overlap, so this is not an issue
- more than one event could be associated with a time period
- it is possible for DFs to contain unassociatable events and time periods
- the solution must not include loops
- does not have to be solved with lubridate, in fact a solution with the base R will be even more welcome.
I actually managed to come up with the code that produces exactly what I wanted using lubridate. So if anyone knows how to do this in base OR simply a better way than the one suggested below, sharing this will be greatly appreciated!
First off, the start and end times in the period DF should be converted to lubridate intervals:
intervalsP <- as.interval(periodStart, periodEnd)
Step 2: A function should be created for checking if an instant is located within a list of intervals. The only reason I have created a separate function is to be able using it with apply:
PeriodAssign <- function(x, y){
# x - instants
# y - intervals
variable1 <- mapply(`%within%`, x, y)
if (length(y[variable1]) != 0) {
as.character(y[variable1])
} else {
NA
}
}
NOTE: I had to use the interval to character coercion, because otherwise intervals were coerced to their length in seconds by the apply function and as such being not really useful for matching purposes - i.e. all four intervals in this example are the same length
Step 3: The function can the be used on the event DF and both DFs can then be merged to produce the DF I was looking for:
eventDF$intervals <- lapply(eventTime, PeriodAssign, intervalsP)
periodDF$intervals <- as.character(intervalsP)
mergedDF <- merge(periodDF, eventDF, by = "intervals")
presentableDF <- mergedDF[, c(2, 5)]
# adding in the unmatched Periods and Evenets
tDF1 <- data.frame(periodDF[!(periodDF$periodID %in% presentableDF$periodID), 1], NA)
colnames(tDF1) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF1)
tDF2 <- data.frame(NA, eventDF[!(eventDF$eventID %in% presentableDF$eventID), 1])
colnames(tDF2) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF2)
presentableDF <- presentableDF[order(presentableDF[,1]),]
The eventual DF looks like:
> presentableDF
periodID eventID
3 P_UID_00 <NA>
1 P_UID_01 E_UID_01
2 P_UID_01 E_UDI_02
4 P_UID_02 <NA>
5 P_UID_03 <NA>
6 <NA> E_UID_00
7 <NA> E_UID_03

Efficient Manipulation of Time Series in R Data Table Package

Much appreciated for your comment/answer.
Context: I have a large data table of daily prices of swap rates across a dozen countries. The columns are [ID, Date, X1Y, X2Y, X3Y ... X30Y], where X..Y are columns indicating the part of the yield curve (e.g. X1Y is 1-year swap, X3Y is 3-year swap). The two keys are ID (e.g. "AUD", "GBP") and Date (e.g. "2001-04-13", "2001-04-16").
Dummy Data:
set.seed(123)
dt <- cbind(ID=rep(c("AUD","GBP"),c(100,100)),X1Y=rnorm(200),X2Y=rnorm(200),X3Y=rnorm(200))
dt <- data.table(dt)
dt[,Date := seq(from=as.IDate("2013-01-01"), by="1 day", length.out=100)]
setkeyv(dt,c("ID","Date"))
Problem 1:
First generate some dummy signals. What's the syntax if there is 100 columns with fairly complicated signal generation formula coded in a separate function say genSig(X1Y)? Here's what I mean using just the 3 columns and some meaningless formula:
dt[,SIG1 :=c(0, diff(X1Y ,1)),by="ID"]
dt[,SIG2 :=c(0, diff(X2Y ,1)),by="ID"]
dt[,SIG3 :=c(0, diff(X3Y ,1)),by="ID"]
Problem 2:
Carry forward column(s) based on "middle of the month". For example, using the SIG columns, I'd like to make everything after say the 15th of each month the same as the signal on the 15th, until next month's 15th. The tricky thing is that the actual data contains just trading days so some months do not have 15th if it is a weekend/holiday. Another issue is using an efficient syntax, I could achieve something similar using loop (I know..) for the start of each month just to show what I meant:
for (i in 2:length(dt$Date)){
if(as.POSIXlt(dt[i,]$Date)$mon == as.POSIXlt(dt[i-1,]$Date)$mon){
dt[i, SIG1 := dt[i-1,SIG1]]
dt[i, SIG2 := dt[i-1,SIG2]]
dt[i, SIG3 := dt[i-1,SIG3]]
}
}
I can't figure out how to deal with the "mid-month" issue since it can fall on the 15th or 16th or 17th. Like Problem 1, would appreciate if there is a smart way to insert/update multiple/dozen columns.
As far as problem 2 goes, you can use rolling joins:
# small sample to demonstrate
dt = data.table(date = as.Date(c('2013-01-01', '2013-01-15', '2013-01-17', '2013-02-14', '2013-02-17'), '%Y-%m-%d'), val = 1:5)
dt
# date val
#1: 2013-01-01 1
#2: 2013-01-15 2
#3: 2013-01-17 3
#4: 2013-02-14 4
#5: 2013-02-17 5
setkey(dt, date)
midmonth = seq(as.Date('2013-01-15', '%Y-%m-%d'),
as.Date('2013-12-15', '%Y-%m-%d'),
by = '1 month')
dt[, flag := 0]
dt[J(midmonth), flag := 1, roll = -Inf]
dt
# date val flag
#1: 2013-01-01 1 0
#2: 2013-01-15 2 1
#3: 2013-01-17 3 0
#4: 2013-02-14 4 0
#5: 2013-02-17 5 1
And now you can cumsum the flag to obtain the grouping you want to e.g. do:
dt[, val1 := val[1], by = cumsum(flag)]
dt
# date val flag val1
#1: 2013-01-01 1 0 1
#2: 2013-01-15 2 1 2
#3: 2013-01-17 3 0 2
#4: 2013-02-14 4 0 2
#5: 2013-02-17 5 1 5
# problem 1
nsig <- 3L
csig <- 1:nsig+1L
newcols <- paste('SIG',1:nsig,sep='')
dt[,(newcols):=0]
for (j in csig) set(dt,j=j+nsig+1L,value=c(0, diff(dt[[j]],1)))
After looking at #eddi's answer, I see that set is not so useful for problem 2. Here's what I would do:
dt[,(newcols):=lapply(newcols,function(x) get(x)[1]),by=list(ID,month(Date-14))]
According to this answer, you can subtract days from a date in this way.
Aside. Cbind-ing vectors makes a matrix. In your example, you've got a character matrix. I think you were looking for...
# Creating better data...
set.seed(123)
dt <- data.table(ID=rep(c("AUD","GBP"),c(100,100)),
X1Y=rnorm(200),X2Y=rnorm(200),X3Y=rnorm(200),
Date=seq(from=as.IDate("2013-01-01"), by="1 day", length.out=100))

Resources