R: data.table: aggregation using referencing over time - r

I have a dataset with periods
active <- data.table(id=c(1,1,2,3), beg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:50:00","2018-01-01 01:50:00","2018-01-01 01:50:00")), end=as.POSIXct(c("2018-01-01 01:20:00","2018-01-01 02:00:00","2018-01-01 02:00:00","2018-01-01 02:00:00")))
> active
id beg end
1: 1 2018-01-01 01:10:00 2018-01-01 01:20:00
2: 1 2018-01-01 01:50:00 2018-01-01 02:00:00
3: 2 2018-01-01 01:50:00 2018-01-01 02:00:00
4: 3 2018-01-01 01:50:00 2018-01-01 02:00:00
during which an id was active. I would like to aggregate across ids and determine for every point in
time <- data.table(seq(from=min(active$beg),to=max(active$end),by="mins"))
the number of IDs that are inactive and the average number of minutes until they get active. That is, ideally, the table looks like
>ans
time inactive av.time
1: 2018-01-01 01:10:00 2 30
2: 2018-01-01 01:11:00 2 29
...
50: 2018-01-01 02:00:00 0 0
I believe this can be done using data.table but I cannot figure out the syntax to get the time differences.

Using dplyr, we can join by a dummy variable to create the Cartesian product of time and active. The definitions of inactive and av.time might not be exactly what you're looking for, but it should get you started. If your data is very large, I agree that data.table will be a better way of handling this.
library(tidyverse)
time %>%
mutate(dummy = TRUE) %>%
inner_join({
active %>%
mutate(dummy = TRUE)
#join by the dummy variable to get the Cartesian product
}, by = c("dummy" = "dummy")) %>%
select(-dummy) %>%
#define what makes an id inactive and the time until it becomes active
mutate(inactive = time < beg | time > end,
TimeUntilActive = ifelse(beg > time, difftime(beg, time, units = "mins"), NA)) %>%
#group by time and summarise
group_by(time) %>%
summarise(inactive = sum(inactive),
av.time = mean(TimeUntilActive, na.rm = TRUE))
# A tibble: 51 x 3
time inactive av.time
<dttm> <int> <dbl>
1 2018-01-01 01:10:00 3 40
2 2018-01-01 01:11:00 3 39
3 2018-01-01 01:12:00 3 38
4 2018-01-01 01:13:00 3 37
5 2018-01-01 01:14:00 3 36
6 2018-01-01 01:15:00 3 35
7 2018-01-01 01:16:00 3 34
8 2018-01-01 01:17:00 3 33
9 2018-01-01 01:18:00 3 32
10 2018-01-01 01:19:00 3 31

Related

Creating a Survival Analysis dataset

I have a table composed by three columns: ID, Opening Date and Cancelation Date.
What I want to do is to create 36 observations per client (one per month for 3 years) as a dummy variable. Basically, i want all the months observations before the cancelation date to have a 1 and the others a 0. In case that the cancelation date is null, then all of the values would be 1.
This process should be repeated for every ID.
The desired output would be a table with five columns: ID, Opening Date, Cancelation Date, Month (from 1 to 36, starting on opening date) and Status (1 or 0).
I've tried everything but havent managed to solve this problem, using seq() to create the dates and order them seq(table$Opening, by = "month", length.out = 36) and many other ways.
We can use complete from tidyr to create a dates of 1-month sequence for each ID, create a row_number for each group as count of Month and create Status based on Cancellation_Date.
library(dplyr)
library(tidyr)
df %>%
mutate_at(vars(ends_with("Date")), as.Date, "%d/%m/%y") %>%
mutate(Date = Opening_Date) %>%
group_by(ID) %>%
complete(Date = seq(Date,by = "1 month", length.out = 36)) %>%
mutate(Month = row_number()) %>%
fill(Opening_Date, Cancellation_Date) %>%
mutate(Status = +(Date <= Cancellation_Date))
# ID Date Opening_Date Cancellation_Date Month Status
# <dbl> <date> <date> <date> <int> <int>
# 1 336 2017-01-01 2017-01-01 2018-06-01 1 1
# 2 336 2017-02-01 2017-01-01 2018-06-01 2 1
# 3 336 2017-03-01 2017-01-01 2018-06-01 3 1
# 4 336 2017-04-01 2017-01-01 2018-06-01 4 1
# 5 336 2017-05-01 2017-01-01 2018-06-01 5 1
# 6 336 2017-06-01 2017-01-01 2018-06-01 6 1
# 7 336 2017-07-01 2017-01-01 2018-06-01 7 1
# 8 336 2017-08-01 2017-01-01 2018-06-01 8 1
# 9 336 2017-09-01 2017-01-01 2018-06-01 9 1
#10 336 2017-10-01 2017-01-01 2018-06-01 10 1
# … with 26 more rows
In the output Date column is sequence of monthly dates for each ID, which can be removed from the final output if not needed.
data
df <- data.frame(ID = 336, Opening_Date = '1/1/17',Cancellation_Date = '1/6/18')

Group by with summarise in date difference in R

I am trying to use group_by and then summarise using date difference calculation. I am not sure if its a runtime error or something wrong in what I am doing. Sometimes when I run the code I get the output as days and other times as seconds. I am not sure what is causing this change. I am not changing dataset or codes. The dataset I am using is huge (2,304,433 rows and 40 columns). Both the times, the output value (digits) are the same but only the name changes (days to secs). I would like to see the output in days.
This is the code that I am using:
data %>%
group_by(PRODUCT,PERSON_ID) %>%
summarise(Freq = n(),
Revenue = max(TOTAL_AMT + 0.000001/QUANTITY),
No_Days = (max(ORDER_DT) - min(ORDER_DT) + 1)/n())
This is the output.
Can anyone please help me on this?
Use difftime() You might need to specify the units.
set.seed(314)
data <- data.frame(PRODUCT = sample(1:10, size = 10000, replace = TRUE),
PERSON_ID = sample(1:10, size = 10000, replace = TRUE),
ORDER_DT = as.POSIXct(as.Date('2019/01/01') + sample(-300:+300, size = 10000, replace = TRUE)))
require(dplyr)
data %>%
group_by(PRODUCT,PERSON_ID) %>%
summarise(Freq = n(),
start = min(ORDER_DT),
end = max(ORDER_DT)) %>%
mutate(No_Days = (as.double(difftime(end, start, units = "days"), units = "days")+1)/Freq)
gives:
PRODUCT PERSON_ID Freq start end No_Days
<int> <int> <int> <dttm> <dttm> <dbl>
1 1 1 109 2018-03-21 01:00:00 2019-10-27 02:00:00 5.38
2 1 2 117 2018-03-23 01:00:00 2019-10-26 02:00:00 4.98
3 1 3 106 2018-03-19 01:00:00 2019-10-28 01:00:00 5.56
4 1 4 109 2018-03-07 01:00:00 2019-10-26 02:00:00 5.50
5 1 5 95 2018-03-07 01:00:00 2019-10-16 02:00:00 6.2
6 1 6 79 2018-03-09 01:00:00 2019-10-04 02:00:00 7.28
7 1 7 83 2018-03-09 01:00:00 2019-10-28 01:00:00 7.22
8 1 8 114 2018-03-09 01:00:00 2019-10-16 02:00:00 5.15
9 1 9 100 2018-03-09 01:00:00 2019-10-13 02:00:00 5.84
10 1 10 91 2018-03-11 01:00:00 2019-10-26 02:00:00 6.54
# ... with 90 more rows
Why is the value devided by n()?
Simple as.integer(max(ORDER_DT) - min(ORDER_DT)) should work, but if it doesn't then please be more specific and update me with more information.
Also while working with datetime values it's good to know lubridate library

Using a rolling time interval to count rows in R and dplyr

Let's say I have a dataframe of timestamps with the corresponding number of tickets sold at that time.
Timestamp ticket_count
(time) (int)
1 2016-01-01 05:30:00 1
2 2016-01-01 05:32:00 1
3 2016-01-01 05:38:00 1
4 2016-01-01 05:46:00 1
5 2016-01-01 05:47:00 1
6 2016-01-01 06:07:00 1
7 2016-01-01 06:13:00 2
8 2016-01-01 06:21:00 1
9 2016-01-01 06:22:00 1
10 2016-01-01 06:25:00 1
I want to know how to calculate the number of tickets sold within a certain time frame of all tickets. For example, I want to calculate the number of tickets sold up to 15 minutes after all tickets. In this case, the first row would have three tickets, the second row would have four tickets, etc.
Ideally, I'm looking for a dplyr solution, as I want to do this for multiple stores with a group_by() function. However, I'm having a little trouble figuring out how to hold each Timestamp fixed for a given row while simultaneously searching through all Timestamps via dplyr syntax.
In the current development version of data.table, v1.9.7, non-equi joins are implemented. Assuming your data.frame is called df and the Timestamp column is POSIXct type:
require(data.table) # v1.9.7+
window = 15L # minutes
(counts = setDT(df)[.(t=Timestamp+window*60L), on=.(Timestamp<t),
.(counts=sum(ticket_count)), by=.EACHI]$counts)
# [1] 3 4 5 5 5 9 11 11 11 11
# add that as a column to original data.table by reference
df[, counts := counts]
For each row in t, all rows where df$Timestamp < that_row is fetched. And by=.EACHI instructs the expression sum(ticket_count) to run for each row in t. That gives your desired result.
Hope this helps.
This is a simpler version of the ugly one I wrote earlier..
# install.packages('dplyr')
library(dplyr)
your_data %>%
mutate(timestamp = as.POSIXct(timestamp, format = '%m/%d/%Y %H:%M'),
ticket_count = as.numeric(ticket_count)) %>%
mutate(window = cut(timestamp, '15 min')) %>%
group_by(window) %>%
dplyr::summarise(tickets = sum(ticket_count))
window tickets
(fctr) (dbl)
1 2016-01-01 05:30:00 3
2 2016-01-01 05:45:00 2
3 2016-01-01 06:00:00 3
4 2016-01-01 06:15:00 3
Here is a solution using data.table. Also incorporating different stores.
Example data:
library(data.table)
dt <- data.table(Timestamp = as.POSIXct("2016-01-01 05:30:00")+seq(60,120000,by=60),
ticket_count = sample(1:9, 2000, T),
store = c(rep(c("A","B","C","D"), 500)))
Now apply the following:
ts <- dt$Timestamp
for(x in ts) {
end <- x+900
dt[Timestamp <= end & Timestamp >= x ,CS := sum(ticket_count),by=store]
}
This gives you
Timestamp ticket_count store CS
1: 2016-01-01 05:31:00 3 A 13
2: 2016-01-01 05:32:00 5 B 20
3: 2016-01-01 05:33:00 3 C 19
4: 2016-01-01 05:34:00 7 D 12
5: 2016-01-01 05:35:00 1 A 15
---
1996: 2016-01-02 14:46:00 4 D 10
1997: 2016-01-02 14:47:00 9 A 9
1998: 2016-01-02 14:48:00 2 B 2
1999: 2016-01-02 14:49:00 2 C 2
2000: 2016-01-02 14:50:00 6 D 6

Alter values in one data frame based on comparison values in another in R

I am trying to subtract one hour to date/times within a POSIXct column that are earlier than or equal to a time stated in a different comparison dataframe for that particular ID.
For example:
#create sample data
Time<-as.POSIXct(c("2015-10-02 08:00:00","2015-11-02 11:00:00","2015-10-11 10:00:00","2015-11-11 09:00:00","2015-10-24 08:00:00","2015-10-27 08:00:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,01,02,02,03,03)
data<-data.frame(Time,ID)
Which produces this:
Time ID
1 2015-10-02 08:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 10:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 08:00:00 3
6 2015-10-27 08:00:00 3
I then have another dataframe with a key date and time for each ID to compare against. The Time in data should be compared against Comparison in ComparisonData for the particular ID it is associated with. If the Time value in data is earlier than or equal to the comparison value one hour should be subtracted from the value in data:
#create sample comparison data
Comparison<-as.POSIXct(c("2015-10-29 08:00:00","2015-11-02 08:00:00","2015-10-26 08:30:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,02,03)
ComparisonData<-data.frame(Comparison,ID)
This should look like this:
Comparison ID
1 2015-10-29 08:00:00 1
2 2015-11-02 08:00:00 2
3 2015-10-26 08:30:00 3
In summary, the code should check all times of a certain ID to see if any are earlier than or equal to the value specified in ComparisonData and if they are, subtract one hour. This should give this data frame as an output:
Time ID
1 2015-10-02 07:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 09:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 07:00:00 3
6 2015-10-27 08:00:00 3
I have looked at similar solutions such as this but I cannot understand how to also check the times using the right timing with that particular ID.
I think ddply seems quite a promising option but I'm not sure how to use it for this particular problem.
Here's a quick and efficient solution using data.table. First we join the two data sets by ID and then just modify the Times which are lower or equal to Comparison
library(data.table) # v1.9.6+
setDT(data)[ComparisonData, end := i.Comparison, on = "ID"]
data[Time <= end, Time := Time - 3600L][, end := NULL]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
Alternatively, we could do this in one step while joining using ifelse (not sure how efficient this though)
setDT(data)[ComparisonData,
Time := ifelse(Time <= i.Comparison,
Time - 3600L, Time),
on = "ID"]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
I am sure there is going to be a better solution than this, however, I think this works.
for(i in 1:nrow(data)) {
if(data$Time[i] < ComparisonData[data$ID[i], 1]){
data$Time[i] <- data$Time[i] - 3600
}
}
# Time ID
#1 2015-10-02 07:00:00 1
#2 2015-11-02 11:00:00 1
#3 2015-10-11 09:00:00 2
#4 2015-11-11 09:00:00 2
#5 2015-10-24 07:00:00 3
#6 2015-10-27 08:00:00 3
This is going to iterate through every row in data.
ComparisonData[data$ID[i], 1] gets the time column in ComparisonData for the corresponding ID. If this is greater than the Time column in data then reduce the time by 1 hour.

Given start and end times, create hourly labels to indicate whether an hour is in the duration or not

I have start and end times of some commercial event for a couple of locations. The event may or may not take place on each day and the event duration does not overlap. For example, run this:
inputdata = data.frame(
location = c('x','x','y','z','z'),
start = c(as.POSIXct("2010/1/1 8:28:00"),as.POSIXct("2010/1/2 7:20:00"),
as.POSIXct("2010/1/1 10:22:00"),
as.POSIXct("2010/1/5 13:28:00"),as.POSIXct("2010/1/7 15:39:00")),
end = c(as.POSIXct("2010/1/1 13:25:00"),as.POSIXct("2010/1/2 10:09:00"),
as.POSIXct("2010/1/1 15:24:00"),
as.POSIXct("2010/1/6 00:28:00"),as.POSIXct("2010/1/7 19:34:00"))
)
The input data looks like:
location start end
1 x 2010-01-01 08:28:00 2010-01-01 13:25:00
2 x 2010-01-02 07:20:00 2010-01-02 10:09:00
3 y 2010-01-01 10:22:00 2010-01-01 15:24:00
4 z 2010-01-05 13:28:00 2010-01-06 00:28:00
5 z 2010-01-07 15:39:00 2010-01-07 19:34:00
I want to construct an hourly dataset with three columns: 1.location, 2.hour, and 3.indicator and each row is for a pair of location and sharp hour (for instance, as.POSIXct("2010/1/1 13:00:00")) where indicator is a dummy, =1 if this hour is between some event start and end times for the location.
For instance, let's say the output hourly data are for 2010-01-01 to 2010-01-07. Run this:
output = data.frame(
location = rep(c('x','y','z'),
each=length(seq(as.POSIXct("2010/1/1"), as.POSIXct("2010/1/7 23:00:00"), "hours"))),
hour = rep(seq(as.POSIXct("2010/1/1"), as.POSIXct("2010/1/7 23:00:00"), "hours"),3),
indicator = rep(0,3*length(seq(as.POSIXct("2010/1/1"), as.POSIXct("2010/1/7 23:00:00"), "hours"))))
So we get the first six rows look like this:
location hour indicator
1 x 2010-01-01 00:00:00 0
2 x 2010-01-01 01:00:00 0
3 x 2010-01-01 02:00:00 0
4 x 2010-01-01 03:00:00 0
5 x 2010-01-01 04:00:00 0
6 x 2010-01-01 05:00:00 0
Now, we need to change the value of indicator to 1 if the hour in the same row has an event in effect for the location in the same row.
For instance, since location x has an event between 8:28 am on 2010/1/1 and 13:25 pm on 2010/1/1. So the rows for 7 am to 14 pm should look like this:
location hour indicator
8 x 2010-01-01 07:00:00 0
9 x 2010-01-01 08:00:00 1
10 x 2010-01-01 09:00:00 1
11 x 2010-01-01 10:00:00 1
12 x 2010-01-01 11:00:00 1
13 x 2010-01-01 12:00:00 1
14 x 2010-01-01 13:00:00 1
15 x 2010-01-01 14:00:00 0
It seems that I can do exhaustively search for each pair of location and hour and update the value of indicator is the hour is between the start and end hour of some event at that location. But I doubt this is the best way.
Or I am thinking that I can first, convert the input data to hourly data where the hour would be there only if they are between the start and end hour. In other words, the converted data should look like:
location hour indicator
1 x 2010-01-01 08:00:00 1
2 x 2010-01-01 09:00:00 1
3 x 2010-01-01 10:00:00 1
4 x 2010-01-01 11:00:00 1
5 x 2010-01-01 12:00:00 1
6 x 2010-01-01 13:00:00 1
7 x 2010-01-02 07:00:00 1
8 x 2010-01-02 08:00:00 1
9 x 2010-01-02 09:00:00 1
10 x 2010-01-02 10:00:00 1
11 y 2010-01-01 10:00:00 1
12 y 2010-01-01 11:00:00 1
and then I go from there to get the correct indicators for each hour for each location. Though, I don't know how to convert the start/end hours to hourly observations.
This is all I get for this problem so far.
With this said, I do not have a solution and would like to ask for help.
Also, all I want is that output with three columns. When contributing, please do not constrained by my thoughts which may not be efficient.
It is worth mentioning that the actual problem covers 5 years and there are 30 locations. So the algorithm needs to be efficient.
Here is a way to do this with a cross join.
library(dplyr)
hours =
data_frame(hour = seq(as.POSIXct("2010/1/1"),
as.POSIXct("2010/1/7 23:00:00"),
"hours") ) %>%
merge(inputdata %>% select(location) %>% distinct)
hours %>%
left_join(inputdata) %>%
filter(start <= hour & hour <= end) %>%
right_join(hours) %>%
mutate(indicator = +!is.na(start))

Resources