I have a data frame of users and access times. Access times can be duplicated.
I am trying to create a list of users grouped and named by a given time interval, e.g. year.
timestamp user
1 2013-03-06 01:00:00 1
2 2014-07-06 21:00:00 1
3 2014-07-31 23:00:00 2
4 2014-08-09 17:00:00 2
5 2014-08-14 20:00:00 2
6 2014-08-14 22:00:00 3
7 2014-08-16 15:00:00 3
8 2014-08-19 02:00:00 1
9 2014-12-28 18:00:00 1
10 2015-01-17 17:00:00 1
11 2015-01-22 22:00:00 2
12 2015-01-22 22:00:00 3
13 2015-03-23 15:00:00 4
14 2015-04-05 18:00:00 1
15 2015-04-06 01:00:00 2
My code example already creates a list of users grouped by year.
My problem is that I need to modify the table in this approach, which becomes a problem with my tables of a million entries.
test <- structure(list(timestamp = c("2013-03-06 01:00:00", "2014-07-06 21:00:00",
"2014-07-31 23:00:00", "2014-08-09 17:00:00", "2014-08-14 20:00:00",
"2014-08-14 22:00:00", "2014-08-16 15:00:00", "2014-08-19 02:00:00",
"2014-12-28 18:00:00", "2015-01-17 17:00:00", "2015-01-22 22:00:00",
"2015-01-22 22:00:00", "2015-03-23 15:00:00", "2015-04-05 18:00:00",
"2015-04-06 01:00:00"), user = c(1L, 1L, 2L, 2L, 2L, 3L, 3L,
1L, 1L, 1L, 2L, 3L, 4L, 1L, 2L)), .Names = c("timestamp", "user"
), class = "data.frame", row.names = c(NA, -15L))
require(lubridate)
#Creating "POSIXct" object from string timestamp
timestamp <- lapply(test$timestamp,
function(x)parse_date_time(x, "y-m-d H:M:S"))
test$timestamp <- do.call(c,timestamp)
print(class(test$timestamp))
#Adding column for year
test <- cbind(test,sapply(timestamp, function(x)year(x)))
colnames(test)[3]<- "year"
#Creating list of year time intervals for users
intervals <- names(table(test$year))
users <- lapply(intervals, function(x)test[test$year %in% x,"user"])
names(users) <- intervals
without timestamps
treat the timestamp as a character. Only works if for every timestap, the first 4 digits represent the year.
library(dplyr)
test %>%
group_by( user, substr(timestamp,1,4 ) ) %>%
summarise( )
# user `substr(timestamp, 1, 4)`
# <int> <chr>
# 1 1 2013
# 2 1 2014
# 3 1 2015
# 4 2 2014
# 5 2 2015
# 6 3 2014
# 7 3 2015
# 8 4 2015
dplyr + lubridate
will extract the year from the timestamp
library( dplyr )
library( lubridate )
test %>%
mutate( timestamp = as.POSIXct( timestamp, format = "%Y-%m-%d %H:%M:%S" ) ) %>%
group_by( user, lubridate::year( timestamp ) ) %>%
summarise( )
# # Groups: user [?]
# user `year(timestamp)`
# <int> <dbl>
# 1 1 2013
# 2 1 2014
# 3 1 2015
# 4 2 2014
# 5 2 2015
# 6 3 2014
# 7 3 2015
# 8 4 2015
table
a frequency table is also quickly made
table( test$user, substr( test$timestamp, 1, 4 ) )
# 2013 2014 2015
# 1 1 3 2
# 2 0 3 2
# 3 0 2 1
# 4 0 0 1
there are any more alternatives... pick one
edit
if speed is an issue, ty data.table
dcast(
setDT( test )[, timestamp := as.POSIXct( timestamp, format = "%Y-%m-%d %H:%M:%S" )][, .N, by = list( user, data.table::year(timestamp) )],
user ~ data.table,
value.var = "N")
# user 2013 2014 2015
# 1: 1 1 3 2
# 2: 2 NA 3 2
# 3: 3 NA 2 1
# 4: 4 NA NA 1
Another option using the lightning fast data.table package:
library(data.table)
setDT(test) # make `test` a data.frame 'by reference' (no copy is made at all)
test[, j=.(users=list(unique(user))),
by=.(year=substr(test$timestamp,1,4))]
year users
1: 2013 1
2: 2014 1,2,3
3: 2015 1,2,3,4
Again assuming your test$timestamp column is a character vector - otherwise substitute lubridate::year() as needed.
Update:
Simple change to show grouping instead by month (just as it was mentioned in a comment):
test[, j=.(users=list(unique(user))),
by=.(ym=substr(test$timestamp,1,7))]
ym users
1: 2013-03 1
2: 2014-07 1,2
3: 2014-08 2,3,1
4: 2014-12 1
5: 2015-01 1,2,3
6: 2015-03 4
7: 2015-04 1,2
Or group by day, to help demonstrate how to subset with chaining:
test[, j=.(users=list(unique(user))),
by=.(ymd=substr(test$timestamp,1,11))][ymd>='2014-08-01' & ymd<= '2014-08-21']
ymd users
1: 2014-08-09 2
2: 2014-08-14 2,3
3: 2014-08-16 3
4: 2014-08-19 1
Note for filtering/subsetting, if you are only interested in a subset of dates for a "one off" calculation (and not otherwise saving the whole aggregated set to be stored for other purposes) it will likely be more efficient to do the subset in i of DT[i, j, by] for the "one off" calculation.
You could also use base (stats) function aggregate() as follows:
aggregate( x = test$user,
by = list(year=substr(test$timestamp,1,4)),
FUN = unique )
Result:
year x
1 2013 1
2 2014 1, 2, 3
3 2015 1, 2, 3, 4
Above working on assumption that your timestamp column is initially just a character vector exactly as included in your structured example data. In which case you may directly substr out the year with substr(test$timestamp,1,4) avoiding the need to first convert to dates.
However, if you have the timestamp column already as a date, simply substitute the lubridate::year() function you demonstrated in your attempted solution.
Related
I have a cohort of data with multiple person visits and want to group visits with a common ID based on person # and the time of the visit. The condition is if an start is within 24 hours of a the previous exit, then I want those to have the same ID.
Sample of what data looks like:
dat <- data.frame(
Person_ID = c(1,1,1,2,3,3,3,4,4),
Admit_Date_Time = as.POSIXct(c("2017-02-07 15:26:00","2017-04-21 10:20:00",
"2017-04-22 12:12:00", "2017-10-16 01:31:00","2017-01-24 02:41:00","2017- 01-24 05:31:00", "2017-01-28 04:26:00", "2017-12-01 01:31:00","2017-12-01
01:31:00"), format = "%Y-%m-%d %H:%M"),
Discharge_Date_Time = as.POSIXct(c("2017-03-01 11:42:00","2017-04-22
05:56:00",
"2017-04-26 21:01:00",
"2017-10-18 20:11:00",
"2017-01-27 22:15:00",
"2017-01-26 15:35:00",
"2017-01-28 09:25:00",
"2017-12-05 18:33:00",
"2017-12-04 16:41:00"),format = "%Y-%m-%d %H:%M" ),
Visit_ID = c(1:9))
this is what I tried to start:
dat1 <-
dat %>%
arrange(Person_ID, Admit_Date_Time) %>%
group_by(Person_ID) %>%
mutate(Previous_Visit_Interval = difftime(lag(Discharge_Date_Time,
1),Admit_Date_Time, units = "hours")) %>%
mutate(start = c(1,Previous_Visit_Interval[-1] < hours(-24)), run =
cumsum(start))
dat1$ID = as.numeric(as.factor(paste0(dat1$Person_ID,dat1$run)))
Which is almost right, except it does not give the correct ID for visit 7 (person #3). Since there are three visits and the second visit is entirely within the first, and the third starts within 24 hours of the first but not the second.
There's probably a way to shorten this, but here's an approach using tidyr::gather and spread. By gathering into long format, we can track the cumulative admissions inside each visit. A new visit is recorded whenever there's a new Person_ID or that Person_ID completed a visit (cumulative admissions went to zero) at least 24 hours prior.
library(tidyr)
dat1 <- dat %>%
# Gather into long format with event type in one column, timestamp in another
gather(event, time, Admit_Date_Time:Discharge_Date_Time) %>%
# I want discharges to have an effect up to 24 hours later. Sort using that.
mutate(time_adj = if_else(event == "Discharge_Date_Time",
time + ddays(1),
time)) %>%
arrange(Person_ID, time_adj) %>%
# For each Person_ID, track cumulative admissions. 0 means a visit has completed.
# (b/c we sorted by time_adj, these reflect the 24hr period after discharges.)
group_by(Person_ID) %>%
mutate(admissions = if_else(event == "Admit_Date_Time", 1, -1)) %>%
mutate(admissions_count = cumsum(admissions)) %>%
ungroup() %>%
# Record a new Hosp_ID when either (a) a new Person, or (b) preceded by a
# completed visit (ie admissions_count was zero).
mutate(Hosp_ID_chg = 1 *
(Person_ID != lag(Person_ID, default = 1) | # (a)
lag(admissions_count, default = 1) == 0), # (b)
Hosp_ID = cumsum(Hosp_ID_chg)) %>%
# Spread back into original format
select(-time_adj, -admissions, -admissions_count, -Hosp_ID_chg) %>%
spread(event, time)
Results
> dat1
# A tibble: 9 x 5
Person_ID Visit_ID Hosp_ID Admit_Date_Time Discharge_Date_Time
<dbl> <int> <dbl> <dttm> <dttm>
1 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
2 1 2 2 2017-04-21 10:20:00 2017-04-22 05:56:00
3 1 3 2 2017-04-22 12:12:00 2017-04-26 21:01:00
4 2 4 3 2017-10-16 01:31:00 2017-10-18 20:11:00
5 3 5 4 2017-01-24 02:41:00 2017-01-27 22:15:00
6 3 6 4 2017-01-24 05:31:00 2017-01-26 15:35:00
7 3 7 4 2017-01-28 04:26:00 2017-01-28 09:25:00
8 4 8 5 2017-12-01 01:31:00 2017-12-05 18:33:00
9 4 9 5 2017-12-01 01:31:00 2017-12-04 16:41:00
Here's a data.table approach using an overlap-join
library( data.table )
library( lubridate )
setDT( dat )
setorder( dat, Person_ID, Admit_Date_Time )
#create a 1-day extension after each discharge
dt2 <- dat[, discharge_24h := Discharge_Date_Time %m+% days(1)][]
#now create id
setkey( dat, Admit_Date_Time, discharge_24h )
#create data-table with overlap-join, create groups based on overlapping ranges
dt2 <- setorder(
foverlaps( dat,
dat,
mult = "first",
type = "any",
nomatch = 0L
),
Visit_ID )[, list( Visit_ID = i.Visit_ID,
Hosp_ID = .GRP ),
by = .( Visit_ID )][, Visit_ID := NULL]
#reorder the result
setorder( dt2[ dat, on = "Visit_ID" ][, discharge_24h := NULL], Visit_ID )[]
# Visit_ID Hosp_ID Person_ID Admit_Date_Time Discharge_Date_Time
# 1: 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
# 2: 2 2 1 2017-04-21 10:20:00 2017-04-22 05:56:00
# 3: 3 2 1 2017-04-22 12:12:00 2017-04-26 21:01:00
# 4: 4 3 2 2017-10-16 01:31:00 2017-10-18 20:11:00
# 5: 5 4 3 2017-01-24 02:41:00 2017-01-27 22:15:00
# 6: 6 4 3 2017-01-24 05:31:00 2017-01-26 15:35:00
# 7: 7 4 3 2017-01-28 04:26:00 2017-01-28 09:25:00
# 8: 8 5 4 2017-12-01 01:31:00 2017-12-05 18:33:00
# 9: 9 5 4 2017-12-01 01:31:00 2017-12-04 16:41:00
I'm getting started with R, so please bear with me
For example, I have this data.table (or data.frame) object :
Time Station count_starts count_ends
01/01/2015 00:30 A 2 3
01/01/2015 00:40 A 2 1
01/01/2015 00:55 B 1 1
01/01/2015 01:17 A 3 1
01/01/2015 01:37 A 1 1
My end goal is to group the "Time" column to hourly and sum the count_starts and count_ends based on the hourly time and station :
Time Station sum(count_starts) sum(count_ends)
01/01/2015 01:00 A 4 4
01/01/2015 01:00 B 1 1
01/01/2015 02:00 A 4 2
I did some research and found out that I should use the xts library.
Thanks for helping me out
UPDATE :
I converted the type of transactions$Time to POSIXct, so the xts package should be able to use the timeseries directly.
Using base R, we can still do the above. Only that the hour will be one less for all of them:
dat=read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
dat$Time=cut(strptime(dat$Time,"%m/%d/%Y %H:%M"),"hour")
aggregate(.~Time+Station,dat,sum)
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
2 2015-01-01 01:00:00 A 4 2
3 2015-01-01 00:00:00 B 1 1
You can use the order function to rearrange the table or even the sort.POSIXlt function:
m=aggregate(.~Time+Station,dat,sum)
m[order(m[,1]),]
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
3 2015-01-01 00:00:00 B 1 1
2 2015-01-01 01:00:00 A 4 2
A solution using dplyr and lubridate. The key is to use ceiling_date to convert the date time column to hourly time-step, and then group and summarize the data.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Time = mdy_hm(Time)) %>%
mutate(Time = ceiling_date(Time, unit = "hour")) %>%
group_by(Time, Station) %>%
summarise(`sum(count_starts)` = sum(count_starts),
`sum(count_ends)` = sum(count_ends)) %>%
ungroup()
dt2
# # A tibble: 3 x 4
# Time Station `sum(count_starts)` `sum(count_ends)`
# <dttm> <chr> <int> <int>
# 1 2015-01-01 01:00:00 A 4 4
# 2 2015-01-01 01:00:00 B 1 1
# 3 2015-01-01 02:00:00 A 4 2
DATA
dt <- read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
Explanation
mdy_hm is the function to convert the string to date-time class. It means "month-day-year hour-minute", which depends on the structure of the string. ceiling_date rounds a date-time object up based on the unit specified. group_by is to group the variable. summarise is to conduct summary operation.
There are basically two things required:
1) round of the Time to nearest 1 hour window:
library(data.table)
library(lubridate)
data=data.table(Time=c('01/01/2015 00:30','01/01/2015 00:40','01/01/2015 00:55','01/01/2015 01:17','01/01/2015 01:37'),Station=c('A','A','B','A','A'),count_starts=c(2,2,1,3,1),count_ends=c(3,1,1,1,1))
data[,Time_conv:=as.POSIXct(strptime(Time,'%d/%m/%Y %H:%M'))]
data[,Time_round:=floor_date(Time_conv,unit="1 hour")]
2) List the data table obtained above to get the desired result:
New_data=data[,list(count_starts_sum=sum(count_starts),count_ends_sum=sum(count_ends)),by='Time_round']
Let's say I have a dataframe of timestamps with the corresponding number of tickets sold at that time.
Timestamp ticket_count
(time) (int)
1 2016-01-01 05:30:00 1
2 2016-01-01 05:32:00 1
3 2016-01-01 05:38:00 1
4 2016-01-01 05:46:00 1
5 2016-01-01 05:47:00 1
6 2016-01-01 06:07:00 1
7 2016-01-01 06:13:00 2
8 2016-01-01 06:21:00 1
9 2016-01-01 06:22:00 1
10 2016-01-01 06:25:00 1
I want to know how to calculate the number of tickets sold within a certain time frame of all tickets. For example, I want to calculate the number of tickets sold up to 15 minutes after all tickets. In this case, the first row would have three tickets, the second row would have four tickets, etc.
Ideally, I'm looking for a dplyr solution, as I want to do this for multiple stores with a group_by() function. However, I'm having a little trouble figuring out how to hold each Timestamp fixed for a given row while simultaneously searching through all Timestamps via dplyr syntax.
In the current development version of data.table, v1.9.7, non-equi joins are implemented. Assuming your data.frame is called df and the Timestamp column is POSIXct type:
require(data.table) # v1.9.7+
window = 15L # minutes
(counts = setDT(df)[.(t=Timestamp+window*60L), on=.(Timestamp<t),
.(counts=sum(ticket_count)), by=.EACHI]$counts)
# [1] 3 4 5 5 5 9 11 11 11 11
# add that as a column to original data.table by reference
df[, counts := counts]
For each row in t, all rows where df$Timestamp < that_row is fetched. And by=.EACHI instructs the expression sum(ticket_count) to run for each row in t. That gives your desired result.
Hope this helps.
This is a simpler version of the ugly one I wrote earlier..
# install.packages('dplyr')
library(dplyr)
your_data %>%
mutate(timestamp = as.POSIXct(timestamp, format = '%m/%d/%Y %H:%M'),
ticket_count = as.numeric(ticket_count)) %>%
mutate(window = cut(timestamp, '15 min')) %>%
group_by(window) %>%
dplyr::summarise(tickets = sum(ticket_count))
window tickets
(fctr) (dbl)
1 2016-01-01 05:30:00 3
2 2016-01-01 05:45:00 2
3 2016-01-01 06:00:00 3
4 2016-01-01 06:15:00 3
Here is a solution using data.table. Also incorporating different stores.
Example data:
library(data.table)
dt <- data.table(Timestamp = as.POSIXct("2016-01-01 05:30:00")+seq(60,120000,by=60),
ticket_count = sample(1:9, 2000, T),
store = c(rep(c("A","B","C","D"), 500)))
Now apply the following:
ts <- dt$Timestamp
for(x in ts) {
end <- x+900
dt[Timestamp <= end & Timestamp >= x ,CS := sum(ticket_count),by=store]
}
This gives you
Timestamp ticket_count store CS
1: 2016-01-01 05:31:00 3 A 13
2: 2016-01-01 05:32:00 5 B 20
3: 2016-01-01 05:33:00 3 C 19
4: 2016-01-01 05:34:00 7 D 12
5: 2016-01-01 05:35:00 1 A 15
---
1996: 2016-01-02 14:46:00 4 D 10
1997: 2016-01-02 14:47:00 9 A 9
1998: 2016-01-02 14:48:00 2 B 2
1999: 2016-01-02 14:49:00 2 C 2
2000: 2016-01-02 14:50:00 6 D 6
I have data that looks like
ID CLM_ID Date1 Date2
1 718182 1/1/2014 1/17/2014
1 718184 1/2/2014 1/8/2014
1 885236 1/15/2014 1/17/2014
1 885362 3/20/2014 3/21/2014
2 589963 3/18/2015 3/22/2015
2 589999 2/27/2015 5/9/2015
2 594226 4/11/2015 4/17/2015
2 689959 5/10/2015 6/10/2015
3 656696 5/1/2016 5/5/2016
3 669625 5/6/2016 5/22/2016
4 777777 2/21/2015 3/4/2015
4 778952 2/1/2015 2/28/2015
4 778965 3/1/2015 3/22/2015
I am working on two different problems with this. The first one was answered in a previous post about how to roll dates up (Date roll-up in R) and the second now is that I have intervals that are within intervals and I am trying to get rid of them. So the final outcome should look like
ID CLM_ID Date1 Date2
1 718182 1/1/2014 1/17/2014
1 885362 3/20/2014 3/21/2014
2 589999 2/27/2015 5/9/2015
3 656696 5/1/2016 5/22/2016
4 778952 2/1/2015 3/22/2015
Now I know I will have to create the extended intervals via the date rollup first, but then how do I get rid of these sub-intervals (a term I am making up for intervals within intervals)? I am also looking for a solution that is efficient since I actually have 75,000 records to go through (i.e. I am trying to avoid iterative solutions).
Using non-equi joins from the current development version of data.table, v1.9.7,
require(data.table) # v1.9.7+
dt[dt, .(CLM_IDs = CLM_IDs[.N==1L]), on=.(ID, Date1<=Date1, Date2>=Date2), by=.EACHI]
# ID Date1 Date2 CLM_ID
# 1: 1 2014-01-01 2014-01-17 718182
# 2: 1 2014-03-20 2014-03-21 885362
# 3: 2 2015-02-27 2015-05-09 589999
# 4: 2 2015-05-10 2015-06-10 689959
# 5: 3 2016-05-01 2016-05-05 656696
# 6: 3 2016-05-06 2016-05-22 669625
# 7: 4 2015-02-21 2015-03-04 777777
# 8: 4 2015-02-01 2015-02-28 778952
# 9: 4 2015-03-01 2015-03-22 778965
What this does is, for each row in dt (the one inside of square bracket), it looks up which rows match in dt (on the outside) based on the condition provided to the on argument.
The matching row indices are returned iff the only match is a self-match (since the condition includes equality as well). This is done by CLM_IDs[.N == 1L], where .N holds the number of observations for each group.
"I am also looking for a solution that is efficient ... (i.e. I am trying to avoid iterative solutions)."
"Your assumptions are your windows on the world. Scrub them off every once in a while, or the light won't come in." - Isaac Asimov
Below is a super fast base R iterative solution. It returns the correct results for very large data frames virtually instantly. (it also "rolls-up" the data, so there is no need to carry out two algorithms):
MakeDFSubInt <- function(df, includeCost = FALSE) {
## Sorting the data frame to allow for fast
## creation of the "Contained" logical vector below
tempDF <- df[order(df$ID, df$Date1, df$Date2), ]
UniIDs <- unique(tempDF$ID)
Len <- length(UniIDs)
## Determine starting (i.e. "s") and ending (i.e. "e")
## points of the respective groups of IDs
e <- which(diff(tempDF$ID)==1)
s <- c(1L, e + 1L)
dfLen <- nrow(tempDF)
e <- c(e, dfLen)
## Converting dates to integers so that comparison
## will be faster. Internally dates are stored as
## integers, so this isn't a problem
dte1 <- as.integer(tempDF$Date1)
dte2 <- as.integer(tempDF$Date2)
## Building logical vector in order to quickly create sub-intervals
Contained <- rep(FALSE, dfLen)
BegTime <- Sys.time() ## Included to measure time of for loop execution
for (j in 1:Len) {
Compare <- ifelse(dte2[s[j]] >= (dte1[s[j]+1L]+1L), max(dte2[s[j]], dte2[s[j]+1L]), dte2[s[j]+1L])
for (x in (s[j]+1L):e[j]) {
if (!Contained[x-1L]) {
Contained[x] <- dte2[x-1L] >= (dte1[x]-1L)
} else {
Contained[x] <- Compare >= (dte1[x]-1L)
}
## could use ifelse, but this construct is faster
if (Contained[x]) {
Compare <- max(Compare, dte2[x])
} else {
Compare <- dte2[x]
}
}
}
EndTime <- Sys.time()
TotTime <- EndTime - BegTime
if (printTime) {print(paste(c("for loop execution time was: ", format(TotTime)), collapse = ""))}
## identify sub-intervals
nGrps <- which(!Contained)
## Create New fields for our new DF
ID <- tempDF$ID[nGrps]
CLM_ID <- tempDF$CLM_ID[nGrps]
Date1 <- tempDF$Date1[nGrps]
nGrps <- c(nGrps, dfLen+1L)
## as.Date is converting numbers to dates.
## N.B. This only works if origin is supplied
Date2 <- as.Date(vapply(1L:(length(nGrps) - 1L), function(x) {
max(dte2[nGrps[x]:(nGrps[x+1L]-1L)])}, 1L), origin = "1970-01-01")
## in a related question the OP had, "Cost" was
## included to show how the algorithm would handle
## generic summary information
if (includeCost) {
myCost <- tempDF$Cost
Cost <- vapply(1L:(length(nGrps) - 1L), function(x) sum(myCost[nGrps[x]:(nGrps[x+1L]-1L)]), 100.01)
NewDf <- data.frame(ID,CLM_ID,Date1,Date2,Cost)
} else {
NewDf <- data.frame(ID,CLM_ID,Date1,Date2)
}
NewDf
}
For the example given in the question, we have:
ID <- c(rep(1,4),rep(2,4),rep(3,2),rep(4,3))
CLM_ID <- c(718182, 718184, 885236, 885362, 589963, 589999, 594226, 689959, 656696, 669625, 777777, 778952, 778965)
Date1 <- c("1/1/2014","1/2/2014","1/15/2014","3/20/2014","3/18/2015","2/27/2015","4/11/2015","5/10/2015","5/1/2016","5/6/2016","2/21/2015","2/1/2015","3/1/2015")
Date2 <- c("1/17/2014","1/8/2014","1/17/2014","3/21/2014","3/22/2015","5/9/2015","4/17/2015","6/10/2015","5/5/2016","5/22/2016","3/4/2015","2/28/2015","3/22/2015")
myDF <- data.frame(ID, CLM_ID, Date1, Date2)
myDF$Date1 <- as.Date(myDF$Date1, format = "%m/%d/%Y")
myDF$Date2 <- as.Date(myDF$Date2, format = "%m/%d/%Y")
MakeDFSubInt(myDF)
ID CLM_ID Date1 Date2
1 1 718182 2014-01-01 2014-01-17
2 1 885362 2014-03-20 2014-03-21
3 2 589999 2015-02-27 2015-06-10
4 3 656696 2016-05-01 2016-05-22
5 4 778952 2015-02-01 2015-03-22
From a similar question the OP posted, we can add a Cost field, to show how we would proceed with calculations for this setup.
set.seed(7777)
myDF$Cost <- round(rnorm(13, 450, sd = 100),2)
MakeDFSubInt(myDF, includeCost = TRUE)
ID CLM_ID Date1 Date2 Cost
1 1 718182 2014-01-01 2014-01-17 1164.66
2 1 885362 2014-03-20 2014-03-21 568.16
3 2 589999 2015-02-27 2015-06-10 2019.16
4 3 656696 2016-05-01 2016-05-22 990.14
5 4 778952 2015-02-01 2015-03-22 1578.68
This algorithm scales very well. For data frames the size the OP is looking for, returning the requested DF returns almost instantaneously and for very large data frames, it returns in just seconds.
First we build a function that will generate a random data frame with n rows.
MakeRandomDF <- function(n) {
set.seed(109)
CLM_Size <- ifelse(n < 10^6, 10^6, 10^(ceiling(log10(n))))
numYears <- trunc((6/425000)*n + 5)
StrtYear <- ifelse(numYears > 16, 2000, 2016 - numYears)
numYears <- ifelse(numYears > 16, 16, numYears)
IDs <- sort(sample(trunc(n/100), n, replace = TRUE))
CLM_IDs <- sample(CLM_Size, n)
StrtDate <- as.Date(paste(c(as.character(StrtYear),"-01-01"), collapse = ""))
myPossibleDates <- StrtDate+(0:(numYears*365)) ## "numYears" years of data
Date1 <- sample(myPossibleDates, n, replace = TRUE)
Date2 <- Date1 + sample(1:100, n, replace = TRUE)
Cost <- round(rnorm(n, 850, 100), 2)
tempDF <- data.frame(IDs,CLM_IDs,Date1,Date2,Cost)
tempDF$Date1 <- as.Date(tempDF$Date1, format = "%m/%d/%Y")
tempDF$Date2 <- as.Date(tempDF$Date2, format = "%m/%d/%Y")
tempDF
}
For moderate size DFs (i.e. 75,000 rows)
TestDF <- MakeRandomDF(75000)
system.time(test1 <- MakeDFSubInt(TestDF, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.06500006 secs"
user system elapsed
0.14 0.00 0.14
nrow(test1)
[1] 7618
head(test1)
ID CLM_ID Date1 Date2 Cost
1 1 116944 2010-01-29 2010-01-30 799.90 ## The range of dates for
2 1 515993 2010-02-15 2011-10-12 20836.83 ## each row are disjoint
3 1 408037 2011-12-13 2013-07-21 28149.26 ## as requested by the OP
4 1 20591 2013-07-25 2014-03-11 10449.51
5 1 338609 2014-04-24 2014-07-31 4219.48
6 1 628983 2014-08-03 2014-09-11 2170.93
For very large DFs (i.e. > 500,000 rows)
TestDF2 <- MakeRandomDF(500000)
system.time(test2 <- MakeDFSubInt(TestDF2, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.3679998 secs"
user system elapsed
1.19 0.03 1.21
nrow(test2)
[1] 154839
head(test2)
ID CLM_ID Date1 Date2 Cost
1 1 71251 2004-04-19 2004-06-29 2715.69 ## The range of dates for
2 1 601676 2004-07-05 2004-09-23 2675.04 ## each row are disjoint
3 1 794409 2004-12-28 2005-04-05 1760.63 ## as requested by the OP
4 1 424671 2005-06-03 2005-08-20 1973.67
5 1 390353 2005-09-16 2005-11-06 785.81
6 1 496611 2005-11-21 2005-11-24 904.09
system.time(test3 <- MakeDFSubInt(TestDF3, includeCost = TRUE, printTime = TRUE))
[1] "for loop execution time was: 0.6930001 secs"
user system elapsed
2.68 0.08 2.79 ## 1 million rows in under 3 seconds!!!
nrow(test3)
[1] 413668
Explanation
The main part of the algorithm is generating the Contained logical vector that is used to determine the sub-intervals of continuous dates. Generation of this vector relies on the fact that the data frame is sorted, first by ID, second by Date1, and finally by Date2. We begin by locating the starting and ending rows of each group of IDs. For example, with the example provided by the OP we have:
myDF
ID CLM_ID Date1 Date2
1 1 718182 2014-01-01 2014-01-17 ## <- 1 s[1]
2 1 718184 2014-01-02 2014-01-08
3 1 885236 2014-01-15 2014-01-17
4 1 885362 2014-03-20 2014-03-21 ## <- 4 e[1]
5 2 589963 2015-03-18 2015-03-22 ## <- 5 s[2]
6 2 589999 2015-02-27 2015-05-09
7 2 594226 2015-04-11 2015-04-17
8 2 689959 2015-05-10 2015-06-10 ## <- 8 e[2]
9 3 656696 2016-05-01 2016-05-05 ## <- 9 s[3]
10 3 669625 2016-05-06 2016-05-22 ## <- 10 e[3]
11 4 777777 2015-02-21 2015-03-04 ## <- 11 s[4]
12 4 778952 2015-02-01 2015-02-28
13 4 778965 2015-03-01 2015-03-22 ## <- 13 e[4]
Below is the code that generates s and e.
## Determine starting (i.e. "s") and ending (i.e. "e")
## points of the respective groups of IDs
e <- which(diff(tempDF$ID)==1)
s <- c(1L, e + 1L)
dfLen <- nrow(tempDF)
e <- c(e, dfLen)
s
1 5 9 11
e
4 8 10 13
Now, we loop over each group and begin populating the logical vector Contained. If the date range for a particular row overlaps (or is a continuance of) the date range above it, we set that particular index of Contained to TRUE. This is why the first row in each group is set to FALSE since there is nothing above to compare it to. As we are doing this, we are updating the largest date to compare against moving forward, hence the Compare variable. It should be noted that it isn't necessarily true that Date2[n] < Date2[n+1L], this is why Compare <- max(Compare, dte2[x]) for a succession of TRUEs. The result for our example is give below.
ID CLM_ID Date1 Date2 Contained
1 1 718182 2014-01-01 2014-01-17 FALSE
2 1 718184 2014-01-02 2014-01-08 TRUE ## These two rows are contained
3 1 885236 2014-01-15 2014-01-17 TRUE ## in the date range 1/1 - 1/17
4 1 885362 2014-03-20 2014-03-21 FALSE ## This row isn't
6 2 589999 2015-02-27 2015-05-09 FALSE
5 2 589963 2015-03-18 2015-03-22 TRUE
7 2 594226 2015-04-11 2015-04-17 TRUE
8 2 689959 2015-05-10 2015-06-10 TRUE ## N.B. 5/10 is a continuance of 5/09
9 3 656696 2016-05-01 2016-05-05 FALSE
10 3 669625 2016-05-06 2016-05-22 TRUE
12 4 778952 2015-02-01 2015-02-28 FALSE
11 4 777777 2015-02-21 2015-03-04 TRUE
13 4 778965 2015-03-01 2015-03-22 TRUE
Now we can easily identify the "starting" rows by identifying all rows with a corresponding FALSE. After this, finding summary information is a breeze by simply calculating whatever you are interested in (e.g. max(Date2), sum(Cost)) over each succession of TRUEs and Voila!!
Here is a not-so-pretty solution comparing each row with the dates of all other rows. I corrected the one year 3015 to 2015. The results are different from what you are expecting, though. Either I misunderstood your question, or you misread the data.
Data:
dta <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L),
CLM_ID = c(718182L, 718184L, 885236L, 885362L, 589963L, 589999L, 594226L, 689959L, 656696L, 669625L, 777777L, 778952L, 778965L),
Date1 = structure(c(1L, 3L, 2L, 9L, 8L, 6L, 10L, 12L, 11L, 13L, 5L, 4L, 7L), .Label = c("1/1/2014", "1/15/2014", "1/2/2014", "2/1/2015", "2/21/2015", "2/27/2015", "3/1/2015", "3/18/2015", "3/20/2014", "4/11/2015", "5/1/2016", "5/10/2015", "5/6/2016"), class = "factor"),
Date2 = structure(c(1L, 2L, 1L, 4L, 5L, 10L, 7L, 11L, 9L, 8L, 6L, 3L, 5L), .Label = c("1/17/2014", "1/8/2014", "2/28/2015", "3/21/2014", "3/22/2015", "3/4/2015", "4/17/2015", "5/22/2016", "5/5/2016", "5/9/2015", "6/10/2015"), class = "factor")),
.Names = c("ID", "CLM_ID", "Date1", "Date2"), class = "data.frame",
row.names = c(NA, -13L))
Code:
dta$Date1 <- as.Date(dta$Date1, format = "%m/%d/%Y")
dta$Date2 <- as.Date(dta$Date2, format = "%m/%d/%Y")
# Boolean vector to memorize results
keep <- logical(length = nrow(dta))
for(i in 1:nrow(dta)) {
match <- dta[dta$Date1 <= dta$Date1[i] & dta$Date2 >= dta$Date2[i], ]
if(nrow(match) == 1) keep[i] <- TRUE
}
# Result
dta[keep, ]
Currently I have multiple dataframes in a list with the following format:
datetime precip code
1 2015-04-15 00:00:00 NA M
2 2015-04-15 01:00:00 NA M
3 2015-04-15 02:00:00 NA M
4 2015-04-15 03:00:00 NA M
5 2015-04-15 04:00:00 NA M
6 2015-04-15 05:00:00 NA M
Each dataframe has a different start and end date but I will like each dataframe to start from 2015-04-01 0:00:00 to 2015-11-30 23:59:59. I would like to generate rows for the missing dates in datetime in each dataframe and fill the precipcolumn with NAso that I have a continuous time series with nrow=5856in each dataframe.
Ignore the code column. If values exist for precip, do not alter them, just fill the additional datetime rows with NAs
My attempt so far yields an error:
library(dplyr)
dates <- seq.POSIXt(as.POSIXlt("2015-04-01 0:00:00"), as.POSIXlt("2015-11-30 23:59:59"), by="hour",tz="GMT")
ts <- format.POSIXct(dates,"%Y/%m/%d %H:%M")
df <- data.frame(datetime=ts)
dat=mylist
final_list <- lapply(dat, function(x) full_join(df,dat$precip))
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')"
link to sample file in case it is needed
Thanks for your suggestions.
As vitor pointed out above, you can only join two data.frames, not a data.frame and a vector. dplyr also plays nice with POSIXct, but not POSIXlt (Hadley has a preference), so if you store your data as actual time, it will be easier to join usefully.
Further, within lapply, you need to use the variable of the function you create (x here), or you'll just be repeating the same thing. Don't subset the data.frames, either, if you want to join them; you need a column in each with the same name and data type.
All together, you need something like:
library(dplyr)
df$datetime <- as.POSIXct(df$datetime, tz = "GMT")
df <- tbl_df(df) # not necessary, but prints nicely
list_df <- list(df, df) # fake list of data.frames
# make a data.frame of sequence to join on
seq_df <- data_frame(datetime = seq.POSIXt(as.POSIXct("2015-04-01 0:00:00", tz = 'GMT'),
as.POSIXct("2015-11-30 23:59:59", tz = 'GMT'),
by="hour",tz="GMT"))
lapply(list_df, function(x){full_join(x, seq_df)})
# Joining by: "datetime"
# Joining by: "datetime"
# [[1]]
# Source: local data frame [5,857 x 3]
#
# datetime precip code
# (POSI) (lgl) (fctr)
# 1 2015-04-15 00:00:00 NA M
# 2 2015-04-15 01:00:00 NA M
# 3 2015-04-15 02:00:00 NA M
# 4 2015-04-15 03:00:00 NA M
# 5 2015-04-15 04:00:00 NA M
# 6 2015-04-15 05:00:00 NA M
# 7 2015-04-01 04:00:00 NA NA
# 8 2015-04-01 05:00:00 NA NA
# 9 2015-04-01 06:00:00 NA NA
# 10 2015-04-01 07:00:00 NA NA
# .. ... ... ...
#
# [[2]]
# Source: local data frame [5,857 x 3]
#
# datetime precip code
# (POSI) (lgl) (fctr)
# 1 2015-04-15 00:00:00 NA M
# 2 2015-04-15 01:00:00 NA M
# 3 2015-04-15 02:00:00 NA M
# 4 2015-04-15 03:00:00 NA M
# 5 2015-04-15 04:00:00 NA M
# 6 2015-04-15 05:00:00 NA M
# 7 2015-04-01 04:00:00 NA NA
# 8 2015-04-01 05:00:00 NA NA
# 9 2015-04-01 06:00:00 NA NA
# 10 2015-04-01 07:00:00 NA NA
# .. ... ... ...
Data:
df <- structure(list(datetime = structure(c(1429056000, 1429059600, 1429063200, 1429066800,
1429070400, 1429074000), class = c("POSIXct", "POSIXt"), tzone = "GMT"), precip = c(NA,
NA, NA, NA, NA, NA), code = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "M",
class = "factor")), .Names = c("datetime", "precip", "code"), row.names = c("1",
"2", "3", "4", "5", "6"), class = c("tbl_df", "tbl", "data.frame"))