I would like to count the amount of sessions started at each 15 minutes intervals for businessdays within a large dataset.
My data looks like:
df <-
Start_datetime End_datetime Duration Volume
2016-04-01 06:20:55 2016-04-01 14:41:22 08:20:27 8.360
2016-04-01 08:22:27 2016-04-01 08:22:40 00:00:13 0.000
2016-04-01 08:38:53 2016-04-01 09:31:58 00:53:05 12.570
2016-04-01 09:33:57 2016-04-01 12:37:43 03:03:46 7.320
2016-04-01 10:05:03 2016-04-01 16:41:16 06:36:13 9.520
2016-04-01 12:07:57 2016-04-02 22:22:32 34:14:35 7.230
2016-04-01 16:56:55 2016-04-02 10:40:17 17:43:22 5.300
2016-04-01 17:29:18 2016-04-01 19:50:29 02:21:11 7.020
2016-04-01 17:42:39 2016-04-01 19:45:38 02:02:59 2.430
2016-04-01 17:47:57 2016-04-01 20:26:35 02:38:38 8.090
2016-04-01 22:00:15 2016-04-04 08:22:21 58:22:06 4.710
2016-04-02 01:12:38 2016-04-02 09:49:00 08:36:22 3.150
2016-04-02 01:32:00 2016-04-02 12:49:47 11:17:47 5.760
2016-04-02 07:28:48 2016-04-04 06:58:56 47:30:08 0.000
2016-04-02 07:55:18 2016-04-05 07:55:15 71:59:57 0.240
I would like to count all the starting sessions per 15minutes starting, where:
For business days
Time PTU Count
00:00:00 - 00:15:00 1 10 #(where count is the amount of sessions started between 00:00:00 and 00:15:00)
00:15:00 - 00:30:00 2 6
00:30:00 - 00:45:00 3 5
00:45:00 - 01:00:00 3 3
And so on and the same data for the weekend.
I have tried the cut function:
df$PTU <- table (cut(df$Start_datetime, breaks="15 minutes"))
data.frame(PTU)
EDIT: When I run this i receive the following error:
Error in cut.default(df$Start_datetime, breaks = "15 minutes") :'x' must be numeric
And some functions with lubridate, but I can't seem to make it work. My final goal is to create a table like the following, but then with 15 minutes interval.
Here is a sort of complete process from datetime "strings" to the format you want. The start is a string vector:
Start_time <-
c("2016-04-01 06:20:55", "2016-04-01 08:22:27", "2016-04-01 08:38:53",
"2016-04-01 09:33:57", "2016-04-01 10:05:03", "2016-04-01 12:07:57",
"2016-04-01 16:56:55", "2016-04-01 17:29:18", "2016-04-01 17:42:39",
"2016-04-01 17:47:57", "2016-04-01 22:00:15", "2016-04-02 01:12:38",
"2016-04-02 01:32:00", "2016-04-02 07:28:48", "2016-04-02 07:55:18"
)
df <- data.frame(Start_time)
And this is an actual processing
## We will use two packages
library(lubridate)
library(data.table)
# convert df to data.table, parse the datetime string
setDT(df)[, Start_time := ymd_hms(Start_time)]
# floor time by 15 min to assign the appropriate slot (new variable Start_time_slot)
df[, Start_time_slot := floor_date(Start_time, "15 min")]
# aggregate by wday and time in a date
start_time_data_frame <- df[, .N, by = .(wday(Start_time_slot), format(Start_time_slot, format="%H:%M:%S") )]
# output looks like this
start_time_data_frame
## wday time N
## 1: 6 06:15:00 1
## 2: 6 08:15:00 1
## 3: 6 08:30:00 1
## 4: 6 09:30:00 1
## 5: 6 10:00:00 1
## 6: 6 12:00:00 1
## 7: 6 16:45:00 1
## 8: 6 17:15:00 1
## 9: 6 17:30:00 1
## 10: 6 17:45:00 1
## 11: 6 22:00:00 1
## 12: 7 01:00:00 1
## 13: 7 01:30:00 1
## 14: 7 07:15:00 1
## 15: 7 07:45:00 1
There's two things you have to keep in mind when using cut on datetimes:
Make sure your data is actually a POSIXt class. I'm quite sure yours isn't, or R wouldn't be using cut.default but cut.POSIXt as a method.
"15 minutes" should be "15 min". See ?cut.POSIXt
So this works:
Start_datetime <- as.POSIXct(
c("2016-04-01 06:20:55",
"2016-04-01 06:22:12",
"2016-04-01 05:30:12")
)
table(cut(Start_datetime, breaks = "15 min"))
# 2016-04-01 05:30:00 2016-04-01 05:45:00 2016-04-01 06:00:00 2016-04-01 06:15:00
# 1 0 0 2
Note that the output gives you the start of the 15 minute interval as names of the table.
Related
I have a dataframe in which each row is the working hours of an employee defined by a start and a stop time:
DF < - EmployeeNum Start_datetime End_datetime
123 2012-02-01 07:30:00 2012-02-01 17:45:00
342 2012-02-01 08:00:00 2012-02-01 17:45:00
876 2012-02-01 10:45:00 2012-02-01 18:45:00
I'd like to find the number of employees working during each hour on each day in a timespan:
Date Hour NumberofEmployeesWorking
2012-02-01 00:00 ? (number of employees working between 00:00 and 00:59)
2012-02-01 01:00 ?
2012-02-01 02:00 ?
2012-02-01 03:00 ?
2012-02-01 04:00 ?
2012-02-01 05:00 ?
2012-02-01 06:00 ?
How do I put my working hours into bins like this?
Your data, in a more consumable format, plus one row to span midnight (for example). I changed the format to include a "T" here, to make consumption easier, otherwise the middle space makes it less trivial to do it with read.table(text='...'). (You can skip this since you already have your real data.)
x <- read.table(text='EmployeeNum Start_datetime End_datetime
123 2012-02-01T07:30:00 2012-02-01T17:45:00
342 2012-02-01T08:00:00 2012-02-01T17:45:00
876 2012-02-01T10:45:00 2012-02-01T18:45:00
877 2012-02-01T22:45:00 2012-02-02T05:45:00',
header=TRUE, stringsAsFactors=FALSE)
In case you haven't done it with your own data, convert all times to POSIXt, otherwise skip this, too.
x[c('Start_datetime','End_datetime')] <- lapply(x[c('Start_datetime','End_datetime')],
as.POSIXct, format='%Y-%m-%dT%H:%M:%S')
We need to generate a sequence of hourly timestamps:
startdate <- trunc(min(x$Start_datetime), units = "hours")
enddate <- round(max(x$End_datetime), units = "hours")
c(startdate, enddate)
# [1] "2012-02-01 07:00:00 PST" "2012-02-02 06:00:00 PST"
timestamps <- seq(startdate, enddate, by = "hour")
head(timestamps)
# [1] "2012-02-01 07:00:00 PST" "2012-02-01 08:00:00 PST" "2012-02-01 09:00:00 PST"
# [4] "2012-02-01 10:00:00 PST" "2012-02-01 11:00:00 PST" "2012-02-01 12:00:00 PST"
(Assumption: all end timestamps are after their start timestamps ...)
Now it's just a matter of tallying:
counts <- mapply(function(st,en) sum(st <= x$End_datetime & x$Start_datetime <= en),
timestamps[-length(timestamps)], timestamps[-1])
data.frame(
start = timestamps[ -length(timestamps) ],
count = counts
)
# start count
# 1 2012-02-01 07:00:00 2
# 2 2012-02-01 08:00:00 2
# 3 2012-02-01 09:00:00 2
# 4 2012-02-01 10:00:00 3
# 5 2012-02-01 11:00:00 3
# 6 2012-02-01 12:00:00 3
# 7 2012-02-01 13:00:00 3
# 8 2012-02-01 14:00:00 3
# 9 2012-02-01 15:00:00 3
# 10 2012-02-01 16:00:00 3
# 11 2012-02-01 17:00:00 3
# 12 2012-02-01 18:00:00 1
# 13 2012-02-01 19:00:00 0
# 14 2012-02-01 20:00:00 0
# 15 2012-02-01 21:00:00 0
# 16 2012-02-01 22:00:00 1
# 17 2012-02-01 23:00:00 1
# 18 2012-02-02 00:00:00 1
# 19 2012-02-02 01:00:00 1
# 20 2012-02-02 02:00:00 1
# 21 2012-02-02 03:00:00 1
# 22 2012-02-02 04:00:00 1
# 23 2012-02-02 05:00:00 1
I did not see #r2evans answer before posting. I came up with this independently, though it looks similar. I posted it here, so it may be helpful. Feel free to accept #r2evans answer.
Data:
df1 <- read.table(text="EmployeeNum Start_datetime End_datetime
123 '2012-02-01 07:30:00' '2012-02-01 17:45:00'
342 '2012-02-01 08:00:00' '2012-02-01 17:45:00'
876 '2012-02-01 10:45:00' '2012-02-01 18:45:00'", header = TRUE )
df1 <- within(df1, Start_datetime <- as.POSIXct( Start_datetime))
df1 <- within(df1, End_datetime <- as.POSIXct( End_datetime))
Code:
Find datetime sequence by 1 hour for each employee and count the number by Start_datetime.
Also, with this code, it is assumed that you separate original data by each single day and then apply the following code. If your data has multiple days mixed in it, with IDateTime() function from data.table package, it is possible to separate days from time and group by them while making the datetime sequence.
library('data.table')
setDT(df1) # assign data.table class by reference
df2 <- df1[, Map( f = function(x, y) seq( from = trunc(x, "hour"),
to = round(y, "hour"),
by = "1 hour" ),
x = Start_datetime, y = End_datetime ),
by = EmployeeNum ]
colnames(df2)[ colnames(df2) == "V1" ] <- "Start_datetime" # for some reason I can't assign column name properly during the column creation step.
Output:
df2[, .N, by = .( Start_datetime, End_datetime = Start_datetime + 3599 ) ]
# Start_datetime End_datetime N
# 1: 2012-02-01 07:00:00 2012-02-01 07:59:59 1
# 2: 2012-02-01 08:00:00 2012-02-01 08:59:59 2
# 3: 2012-02-01 09:00:00 2012-02-01 09:59:59 2
# 4: 2012-02-01 10:00:00 2012-02-01 10:59:59 3
# 5: 2012-02-01 11:00:00 2012-02-01 11:59:59 3
# 6: 2012-02-01 12:00:00 2012-02-01 12:59:59 3
# 7: 2012-02-01 13:00:00 2012-02-01 13:59:59 3
# 8: 2012-02-01 14:00:00 2012-02-01 14:59:59 3
# 9: 2012-02-01 15:00:00 2012-02-01 15:59:59 3
# 10: 2012-02-01 16:00:00 2012-02-01 16:59:59 3
# 11: 2012-02-01 17:00:00 2012-02-01 17:59:59 3
# 12: 2012-02-01 18:00:00 2012-02-01 18:59:59 3
# 13: 2012-02-01 19:00:00 2012-02-01 19:59:59 1
Graph:
binwidth = 3600 the value indicates 1 hour = 60 min * 60 sec = 3600 seconds
library('ggplot2')
ggplot( data = df2,
mapping = aes( x = Start_datetime ) ) +
geom_histogram(binwidth = 3600, color = "red", fill = "white" ) +
scale_x_datetime( date_breaks = "1 hour", date_labels = "%H:%M" ) +
ylab("Number of Employees") +
xlab( "Working Hours: 2012-02-01" ) +
theme( axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank(),
panel.background = element_rect( fill = "white", color = "black") )
Thank you both for your answers. I came up with a solution which is pretty similar to yours, but I was wondering if you could have a look and let me know what you think of it.
I started a new empty dataframe, and then made two nested loops, to look at each start and end time in each row, and generate a sequence of hours in between. Then I each hour in the sequence to the new empty dataframe. This way, I can simply do a count later.
staffDetailHours <- data.frame("personnelNum"=integer(0),
"workDate"=character(0),
"Hour"=integer(0))
for (i in 1:dim(DF)[1]){
hoursList <- seq(as.POSIXlt(DF[i,]$START)$hour,
as.POSIXlt(DF[i,]$END)$hour)
for (j in 1:length(hoursList)) {
staffDetailHours[nrow(staffDetailHours)+1,] = list(
DF[i,]$EmployeeNum,
DF[i,]$Date,
hoursList[j]
)
}
}
Consider this
time <- seq(ymd_hms("2014-02-24 23:00:00"), ymd_hms("2014-06-25 08:32:00"), by="hour")
group <- rep(LETTERS[1:20], each = length(time))
value <- sample(-10^3:10^3,length(time), replace=TRUE)
df2 <- data.frame(time,group,value)
str(df2)
> head(df2)
time group value
1 2014-02-24 23:00:00 A 246
2 2014-02-25 00:00:00 A -261
3 2014-02-25 01:00:00 A 628
4 2014-02-25 02:00:00 A 429
5 2014-02-25 03:00:00 A -49
6 2014-02-25 04:00:00 A -749
I would like to create a variable that contains, for each group, the rolling mean of value
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At time 2014-02-24 23:00:00, df2['rolling_mean_same_hour'] contains the mean of the values of value observed at 23:00:00 during the last 5 days in the data (not including 2014-02-24 of course).
I would like to do that in either dplyr or data.table. I confess having no ideas how to do that.
Any ideas?
Many thanks!
You can calculate the rollmean() with your data grouped by the group variable and hour of the time variable, normally the rollmean() will include the current observation, but you can use shift() function to exclude the current observation from the rollmean:
library(data.table); library(zoo)
setDT(df2)
df2[, .(rolling_mean_same_hour = shift(
rollmean(value, 5, na.pad = TRUE, align = 'right'),
n = 1,
type = 'lag'),
time), .(hour(time), group)]
# hour group rolling_mean_same_hour time
# 1: 23 A NA 2014-02-24 23:00:00
# 2: 23 A NA 2014-02-25 23:00:00
# 3: 23 A NA 2014-02-26 23:00:00
# 4: 23 A NA 2014-02-27 23:00:00
# 5: 23 A NA 2014-02-28 23:00:00
# ---
#57796: 22 T -267.0 2014-06-20 22:00:00
#57797: 22 T -389.6 2014-06-21 22:00:00
#57798: 22 T -311.6 2014-06-22 22:00:00
#57799: 22 T -260.0 2014-06-23 22:00:00
#57800: 22 T -26.8 2014-06-24 22:00:00
I have some data which is formatted in the following way:
time count
00:00 17
00:01 62
00:02 41
So I have from 00:00 to 23:59hours and with a counter per minute. I'd like to group the data in intervals of 15 minutes such that:
time count
00:00-00:15 148
00:16-00:30 284
I have tried to do it manually but this is exhausting so I am sure there has to be a function or sth to do it easily but I haven't figured out yet how to do it.
I'd really appreciate some help!!
Thank you very much!
For data that's in POSIXct format, you can use the cut function to create 15-minute groupings, and then aggregate by those groups. The code below shows how to do this in base R and with the dplyr and data.table packages.
First, create some fake data:
set.seed(4984)
dat = data.frame(time=seq(as.POSIXct("2016-05-01"), as.POSIXct("2016-05-01") + 60*99, by=60),
count=sample(1:50, 100, replace=TRUE))
Base R
cut the data into 15 minute groups:
dat$by15 = cut(dat$time, breaks="15 min")
time count by15
1 2016-05-01 00:00:00 22 2016-05-01 00:00:00
2 2016-05-01 00:01:00 11 2016-05-01 00:00:00
3 2016-05-01 00:02:00 31 2016-05-01 00:00:00
...
98 2016-05-01 01:37:00 20 2016-05-01 01:30:00
99 2016-05-01 01:38:00 29 2016-05-01 01:30:00
100 2016-05-01 01:39:00 37 2016-05-01 01:30:00
Now aggregate by the new grouping column, using sum as the aggregation function:
dat.summary = aggregate(count ~ by15, FUN=sum, data=dat)
by15 count
1 2016-05-01 00:00:00 312
2 2016-05-01 00:15:00 395
3 2016-05-01 00:30:00 341
4 2016-05-01 00:45:00 318
5 2016-05-01 01:00:00 349
6 2016-05-01 01:15:00 397
7 2016-05-01 01:30:00 341
dplyr
library(dplyr)
dat.summary = dat %>% group_by(by15=cut(time, "15 min")) %>%
summarise(count=sum(count))
data.table
library(data.table)
dat.summary = setDT(dat)[ , list(count=sum(count)), by=cut(time, "15 min")]
UPDATE: To answer the comment, for this case the end point of each grouping interval is as.POSIXct(as.character(dat$by15)) + 60*15 - 1. In other words, the endpoint of the grouping interval is 15 minutes minus one second from the start of the interval. We add 60*15 - 1 because POSIXct is denominated in seconds. The as.POSIXct(as.character(...)) is because cut returns a factor and this just converts it back to date-time so that we can do math on it.
If you want the end point to the nearest minute before the next interval (instead of the nearest second), you could to as.POSIXct(as.character(dat$by15)) + 60*14.
If you don't know the break interval, for example, because you chose the number of breaks and let R pick the interval, you could find the number of seconds to add by doing max(unique(diff(as.POSIXct(as.character(dat$by15))))) - 1.
The cut approach is handy but slow with large data frames. The following approach is approximately 1,000x faster than the cut approach (tested with 400k records.)
# Function: Truncate (floor) POSIXct to time interval (specified in seconds)
# Author: Stephen McDaniel # PowerTrip Analytics
# Date : 2017MAY
# Copyright: (C) 2017 by Freakalytics, LLC
# License: MIT
floor_datetime <- function(date_var, floor_seconds = 60,
origin = "1970-01-01") { # defaults to minute rounding
if(!is(date_var, "POSIXct")) stop("Please pass in a POSIXct variable")
if(is.na(date_var)) return(as.POSIXct(NA)) else {
return(as.POSIXct(floor(as.numeric(date_var) /
(floor_seconds))*(floor_seconds), origin = origin))
}
}
Sample output:
test <- data.frame(good = as.POSIXct(Sys.time()),
bad1 = as.Date(Sys.time()),
bad2 = as.POSIXct(NA))
test$good_15 <- floor_datetime(test$good, 15 * 60)
test$bad1_15 <- floor_datetime(test$bad1, 15 * 60)
Error in floor_datetime(test$bad, 15 * 60) :
Please pass in a POSIXct variable
test$bad2_15 <- floor_datetime(test$bad2, 15 * 60)
test
good bad1 bad2 good_15 bad2_15
1 2017-05-06 13:55:34.48 2017-05-06 <NA> 2007-05-06 13:45:00 <NA>
You can do it in one line by using trs function from FQOAT, just like:
df_15mins=trs(df, "15 mins")
Below is a repeatable example:
library(foqat)
head(aqi[,c(1,2)])
# Time NO
#1 2017-05-01 01:00:00 0.0376578
#2 2017-05-01 01:01:00 0.0341483
#3 2017-05-01 01:02:00 0.0310285
#4 2017-05-01 01:03:00 0.0357016
#5 2017-05-01 01:04:00 0.0337507
#6 2017-05-01 01:05:00 0.0238120
#mean
aqi_15mins=trs(aqi[,c(1,2)], "15 mins")
head(aqi_15mins)
# Time NO
#1 2017-05-01 01:00:00 0.02736549
#2 2017-05-01 01:15:00 0.03244958
#3 2017-05-01 01:30:00 0.03743626
#4 2017-05-01 01:45:00 0.02769419
#5 2017-05-01 02:00:00 0.02901817
#6 2017-05-01 02:15:00 0.03439455
I have the following data as a list of POSIXct times that span one month. Each of them represent a bike delivery. My aim is to find the average amount of bike deliveries per ten-minute interval over a 24-hour period (producing a total of 144 rows). First all of the trips need to be summed and binned into an interval, then divided by the number of days. So far, I've managed to write a code that sums trips per 10-minute interval, but it produces incorrect values. I am not sure where it went wrong.
The data looks like this:
head(start_times)
[1] "2014-10-21 16:58:13 EST" "2014-10-07 10:14:22 EST" "2014-10-20 01:45:11 EST"
[4] "2014-10-17 08:16:17 EST" "2014-10-07 17:46:36 EST" "2014-10-28 17:32:34 EST"
length(start_times)
[1] 1747
The code looks like this:
library(lubridate)
library(dplyr)
tripduration <- floor(runif(1747) * 1000)
time_bucket <- start_times - minutes(minute(start_times) %% 10) - seconds(second(start_times))
df <- data.frame(tripduration, start_times, time_bucket)
summarized <- df %>%
group_by(time_bucket) %>%
summarize(trip_count = n())
summarized <- as.data.frame(summarized)
out_buckets <- data.frame(out_buckets = seq(as.POSIXlt("2014-10-01 00:00:00"), as.POSIXct("2014-10-31 23:0:00"), by = 600))
out <- left_join(out_buckets, summarized, by = c("out_buckets" = "time_bucket"))
out$trip_count[is.na(out$trip_count)] <- 0
head(out)
out_buckets trip_count
1 2014-10-01 00:00:00 0
2 2014-10-01 00:10:00 0
3 2014-10-01 00:20:00 0
4 2014-10-01 00:30:00 0
5 2014-10-01 00:40:00 0
6 2014-10-01 00:50:00 0
dim(out)
[1] 4459 2
test <- format(out$out_buckets,"%H:%M:%S")
test2 <- out$trip_count
test <- cbind(test, test2)
colnames(test)[1] <- "interval"
colnames(test)[2] <- "count"
test <- as.data.frame(test)
test$count <- as.numeric(test$count)
test <- aggregate(count~interval, test, sum)
head(test, n = 20)
interval count
1 00:00:00 32
2 00:10:00 33
3 00:20:00 32
4 00:30:00 31
5 00:40:00 34
6 00:50:00 34
7 01:00:00 31
8 01:10:00 33
9 01:20:00 39
10 01:30:00 41
11 01:40:00 36
12 01:50:00 31
13 02:00:00 33
14 02:10:00 34
15 02:20:00 32
16 02:30:00 32
17 02:40:00 36
18 02:50:00 32
19 03:00:00 34
20 03:10:00 39
but this is impossible because when I sum the counts
sum(test$count)
[1] 7494
I get 7494 whereas the number should be 1747
I'm not sure where I went wrong and how to simplify this code to get the same result.
I've done what I can, but I can't reproduce your issue without your data.
library(dplyr)
I created the full sequence of 10 minute blocks:
blocks.of.10mins <- data.frame(out_buckets=seq(as.POSIXct("2014/10/01 00:00"), by="10 mins", length.out=30*24*6))
Then split the start_times into the same bins. Note: I created a baseline time of midnight to force the blocks to align to 10 minute intervals. Removing this later is an exercise for the reader. I also changed one of your data points so that there was at least one example of multiple records in the same bin.
start_times <- as.POSIXct(c("2014-10-01 00:00:00", ## added
"2014-10-21 16:58:13",
"2014-10-07 10:14:22",
"2014-10-20 01:45:11",
"2014-10-17 08:16:17",
"2014-10-07 10:16:36", ## modified
"2014-10-28 17:32:34"))
trip_times <- data.frame(start_times) %>%
mutate(out_buckets = as.POSIXct(cut(start_times, breaks="10 mins")))
The start_times and all the 10 minute intervals can then be merged
trips_merged <- merge(trip_times, blocks.of.10mins, by="out_buckets", all=TRUE)
These can then be grouped by 10 minute block and counted
trips_merged %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(time) (int)
1 2014-10-01 00:00:00 1
2 2014-10-07 10:10:00 2
3 2014-10-17 08:10:00 1
4 2014-10-20 01:40:00 1
5 2014-10-21 16:50:00 1
6 2014-10-28 17:30:00 1
Instead, if we only consider time, not date
trips_merged2 <- trips_merged
trips_merged2$out_buckets <- format(trips_merged2$out_buckets, "%H:%M:%S")
trips_merged2 %>% filter(!is.na(start_times)) %>%
group_by(out_buckets) %>%
summarise(trip_count=n())
Source: local data frame [6 x 2]
out_buckets trip_count
(chr) (int)
1 00:00:00 1
2 01:40:00 1
3 08:10:00 1
4 10:10:00 2
5 16:50:00 1
6 17:30:00 1
I am trying to subtract one hour to date/times within a POSIXct column that are earlier than or equal to a time stated in a different comparison dataframe for that particular ID.
For example:
#create sample data
Time<-as.POSIXct(c("2015-10-02 08:00:00","2015-11-02 11:00:00","2015-10-11 10:00:00","2015-11-11 09:00:00","2015-10-24 08:00:00","2015-10-27 08:00:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,01,02,02,03,03)
data<-data.frame(Time,ID)
Which produces this:
Time ID
1 2015-10-02 08:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 10:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 08:00:00 3
6 2015-10-27 08:00:00 3
I then have another dataframe with a key date and time for each ID to compare against. The Time in data should be compared against Comparison in ComparisonData for the particular ID it is associated with. If the Time value in data is earlier than or equal to the comparison value one hour should be subtracted from the value in data:
#create sample comparison data
Comparison<-as.POSIXct(c("2015-10-29 08:00:00","2015-11-02 08:00:00","2015-10-26 08:30:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,02,03)
ComparisonData<-data.frame(Comparison,ID)
This should look like this:
Comparison ID
1 2015-10-29 08:00:00 1
2 2015-11-02 08:00:00 2
3 2015-10-26 08:30:00 3
In summary, the code should check all times of a certain ID to see if any are earlier than or equal to the value specified in ComparisonData and if they are, subtract one hour. This should give this data frame as an output:
Time ID
1 2015-10-02 07:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 09:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 07:00:00 3
6 2015-10-27 08:00:00 3
I have looked at similar solutions such as this but I cannot understand how to also check the times using the right timing with that particular ID.
I think ddply seems quite a promising option but I'm not sure how to use it for this particular problem.
Here's a quick and efficient solution using data.table. First we join the two data sets by ID and then just modify the Times which are lower or equal to Comparison
library(data.table) # v1.9.6+
setDT(data)[ComparisonData, end := i.Comparison, on = "ID"]
data[Time <= end, Time := Time - 3600L][, end := NULL]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
Alternatively, we could do this in one step while joining using ifelse (not sure how efficient this though)
setDT(data)[ComparisonData,
Time := ifelse(Time <= i.Comparison,
Time - 3600L, Time),
on = "ID"]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
I am sure there is going to be a better solution than this, however, I think this works.
for(i in 1:nrow(data)) {
if(data$Time[i] < ComparisonData[data$ID[i], 1]){
data$Time[i] <- data$Time[i] - 3600
}
}
# Time ID
#1 2015-10-02 07:00:00 1
#2 2015-11-02 11:00:00 1
#3 2015-10-11 09:00:00 2
#4 2015-11-11 09:00:00 2
#5 2015-10-24 07:00:00 3
#6 2015-10-27 08:00:00 3
This is going to iterate through every row in data.
ComparisonData[data$ID[i], 1] gets the time column in ComparisonData for the corresponding ID. If this is greater than the Time column in data then reduce the time by 1 hour.