How to get people in the store at every 5 minutes? - r

I have a data table like below :
library(data.table)
DT1<-data.table(
id=c(1,2,3,4,3,2),
in_time=c("2017-11-01 08:37:35","2017-11-01 09:07:44","2017-11-01 09:46:16","2017-11-01 10:32:29","2017-11-01 10:59:25","2017-11-01 13:24:12"),
out_time=c("2017-11-01 08:45:35","2017-11-01 09:15:30","2017-11-01 10:11:16","2017-11-01 10:37:05","2017-11-01 11:45:25","2017-11-01 14:10:09")
)
It contains each information about what time a person enters the store and exits the store.
Now I want to take the people in the store every 5 minutes (standard 5 minutes like minute 0,5,10,15 ...60). If there is no one I need a 0 value.
So I tried with
library(lubridate)
DT1[,time:=ymd_hms(in_time)]
DT1[,time:=ceiling_date(time,"5mins")]
DT1[,.N,by=list(time)]
which only gives how many people entered at each time but I am now stuck at how to take into account the out_time.For example, the id 1 entered at 2017-11-01 08:37:35 and left at 2017-11-01 08:45:35.So he will be in the shop for the 5-minute interval from 2017-11-01 08:40:00 to
2017-11-01 08:45:00 and not in 2017-11-01 08:50:00 and so on .
An id can repeat multiple times like one person came drop by the store multiple times a day.
Any help is appreciated .

Here is an option using data.table::foverlaps:
#generate intervals of 5mins
times <- seq(as.POSIXct("2017-11-01 00:00:00", format=fmt),
as.POSIXct("2017-11-02 00:00:00", format=fmt),
by="5 min")
DT2 <- data.table(in_time=times[-length(times)], out_time=times[-1L], key=c("in_time","out_time"))
#set keys before foverlaps
setkey(DT1, in_time, out_time)
#find overlaps and count distinct in each 5min interval.
#!is.na(id) is for truncating the output for checking. to be removed in actual code
foverlaps(DT2, DT1)[!is.na(id), uniqueN(id), .(i.in_time, i.out_time)]
And if id is unique in each time interval, the last line of code can be foverlaps(DT2, DT1)[, sum(!is.na(id)), .(i.in_time, i.out_time)] instead
first 8 rows of output:
i.in_time i.out_time V1
1: 2017-11-01 08:35:00 2017-11-01 08:40:00 1
2: 2017-11-01 08:40:00 2017-11-01 08:45:00 1
3: 2017-11-01 08:45:00 2017-11-01 08:50:00 1
4: 2017-11-01 09:05:00 2017-11-01 09:10:00 1
5: 2017-11-01 09:10:00 2017-11-01 09:15:00 1
6: 2017-11-01 09:15:00 2017-11-01 09:20:00 1
7: 2017-11-01 09:45:00 2017-11-01 09:50:00 1
8: 2017-11-01 09:50:00 2017-11-01 09:55:00 1
data:
library(data.table)
DT1 <- data.table(
id=c(1,2,3,4,3,2),
in_time=c("2017-11-01 08:37:35","2017-11-01 09:07:44","2017-11-01 09:46:16","2017-11-01 10:32:29","2017-11-01 10:59:25","2017-11-01 13:24:12"),
out_time=c("2017-11-01 08:45:35","2017-11-01 09:15:30","2017-11-01 10:11:16","2017-11-01 10:37:05","2017-11-01 11:45:25","2017-11-01 14:10:09")
)
cols <- c("in_time", "out_time")
fmt <- "%Y-%m-%d %T"
DT1[, (cols) := lapply(.SD, as.POSIXct, format=fmt), .SDcols=cols]

Related

Can I aggregate time series data between an on and off date using a data table join or the aggregate function?

I would like to efficiently summarize continuous meteorological data over the periods that discrete samples are being collected.
I currently do this with a time-consuming loop, but I imagine a better solution exists. I'm new to data.table syntax, but it seems like there should be a solution with joining.
continuous <- data.frame(Time = seq(as.POSIXct("2019-01-01 0:00:00"),
as.POSIXct("2019-01-01 9:00:00"),"hour"),
CO2 = sample(400:450,10),
Temp = sample(10:30,10))
> continuous
Time CO2 Temp
1 2019-01-01 00:00:00 430 11
2 2019-01-01 01:00:00 412 26
3 2019-01-01 02:00:00 427 17
4 2019-01-01 03:00:00 435 29
5 2019-01-01 04:00:00 447 23
6 2019-01-01 05:00:00 417 19
7 2019-01-01 06:00:00 408 12
8 2019-01-01 07:00:00 449 28
9 2019-01-01 08:00:00 445 20
10 2019-01-01 09:00:00 420 27
discrete <- data.frame(on = c(as.POSIXct("2019-01-01 0:00:00"),
as.POSIXct("2019-01-01 3:00:00")),
off = c(as.POSIXct("2019-01-01 3:00:00"),
as.POSIXct("2019-01-01 8:00:00")))
> discrete
on off
1 2019-01-01 00:00:00 2019-01-01 03:00:00
2 2019-01-01 03:00:00 2019-01-01 08:00:00
discrete[, c("CO2.mean","Temp.mean")] <-
lapply(seq(length(c("CO2","Temp"))), function(k)
unlist(lapply(seq(length(discrete[, 1])), function(i)
mean(continuous[
which.closest(continuous$Time,discrete$on[i]):
which.closest(continuous$Time, discrete$off[i]),
c("CO2","Temp")[k]]))))
> discrete
on off CO2.mean Temp.mean
1 2019-01-01 00:00:00 2019-01-01 03:00:00 426.0 20.75000
2 2019-01-01 03:00:00 2019-01-01 08:00:00 433.5 21.83333
This works, but when aggregating tens of continuous variables into hundreds of sampling periods, it takes a very long time to run. Thank you for your help!
An option would be a nonequi join in data.table
library(data.table)
setDT(continuous)[discrete, .(CO2mean = mean(CO2),
Tempmean = mean(Temp)),on = .(Time >= on, Time <= off), by = .EACHI]
or with a rolling join
setDT(continuous)[discrete, .(CO2mean = mean(CO2),
Tempmean = mean(Temp)),on = .(Time = on, Time = off),
by = .EACHI, roll = 'nearest']

R: Merge dataframes by nearest datetime

I have two dataframes of different lengths: NROW(data) = 20000
NROW(database) = 8000
Both of dataframes have date time values in a format as : YYYY-MM-DD HH-MM-SS which are not the same in each dataframe
What I want is to merge them by the nearest date-time and keep only the records that exist in database.
I tried the approach posted in another stackexchange post
[R – How to join two data frames by nearest time-date?][1]
based on data.table library. I tried following but without success:
require("data.table")
database <- data.table(database)
data <- data.table(data)
setkey( data, "timekey")
setkey( database, "timekeyd")
database <- data[ database, roll = "nearest"]
But the merge was almost completely wrong. You can see how the merged was performed in the following table that has only the two keys (timekey and timekeyd)
1 2017-11-01 00:00:00 2017-10-31 21:00:00
2 2017-11-01 00:00:00 2017-10-31 22:10:00
3 2017-11-02 19:00:00 2017-11-02 21:00:00
4 2017-11-02 19:00:00 2017-11-02 21:00:00
5 2017-11-03 20:08:00 2017-11-03 22:10:00
6 2017-11-04 19:00:00 2017-11-04 21:00:00
7 2017-11-04 19:00:00 2017-11-04 21:00:00
8 2017-11-05 19:00:00 2017-11-05 21:10:00
9 2017-11-07 18:00:00 2017-11-07 20:00:00

R : how to get the rolling mean of a variable over the last few days but only at a given hour?

Consider this
time <- seq(ymd_hms("2014-02-24 23:00:00"), ymd_hms("2014-06-25 08:32:00"), by="hour")
group <- rep(LETTERS[1:20], each = length(time))
value <- sample(-10^3:10^3,length(time), replace=TRUE)
df2 <- data.frame(time,group,value)
str(df2)
> head(df2)
time group value
1 2014-02-24 23:00:00 A 246
2 2014-02-25 00:00:00 A -261
3 2014-02-25 01:00:00 A 628
4 2014-02-25 02:00:00 A 429
5 2014-02-25 03:00:00 A -49
6 2014-02-25 04:00:00 A -749
I would like to create a variable that contains, for each group, the rolling mean of value
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At time 2014-02-24 23:00:00, df2['rolling_mean_same_hour'] contains the mean of the values of value observed at 23:00:00 during the last 5 days in the data (not including 2014-02-24 of course).
I would like to do that in either dplyr or data.table. I confess having no ideas how to do that.
Any ideas?
Many thanks!
You can calculate the rollmean() with your data grouped by the group variable and hour of the time variable, normally the rollmean() will include the current observation, but you can use shift() function to exclude the current observation from the rollmean:
library(data.table); library(zoo)
setDT(df2)
df2[, .(rolling_mean_same_hour = shift(
rollmean(value, 5, na.pad = TRUE, align = 'right'),
n = 1,
type = 'lag'),
time), .(hour(time), group)]
# hour group rolling_mean_same_hour time
# 1: 23 A NA 2014-02-24 23:00:00
# 2: 23 A NA 2014-02-25 23:00:00
# 3: 23 A NA 2014-02-26 23:00:00
# 4: 23 A NA 2014-02-27 23:00:00
# 5: 23 A NA 2014-02-28 23:00:00
# ---
#57796: 22 T -267.0 2014-06-20 22:00:00
#57797: 22 T -389.6 2014-06-21 22:00:00
#57798: 22 T -311.6 2014-06-22 22:00:00
#57799: 22 T -260.0 2014-06-23 22:00:00
#57800: 22 T -26.8 2014-06-24 22:00:00

Create a time interval of 15 minutes from minutely data in R?

I have some data which is formatted in the following way:
time count
00:00 17
00:01 62
00:02 41
So I have from 00:00 to 23:59hours and with a counter per minute. I'd like to group the data in intervals of 15 minutes such that:
time count
00:00-00:15 148
00:16-00:30 284
I have tried to do it manually but this is exhausting so I am sure there has to be a function or sth to do it easily but I haven't figured out yet how to do it.
I'd really appreciate some help!!
Thank you very much!
For data that's in POSIXct format, you can use the cut function to create 15-minute groupings, and then aggregate by those groups. The code below shows how to do this in base R and with the dplyr and data.table packages.
First, create some fake data:
set.seed(4984)
dat = data.frame(time=seq(as.POSIXct("2016-05-01"), as.POSIXct("2016-05-01") + 60*99, by=60),
count=sample(1:50, 100, replace=TRUE))
Base R
cut the data into 15 minute groups:
dat$by15 = cut(dat$time, breaks="15 min")
time count by15
1 2016-05-01 00:00:00 22 2016-05-01 00:00:00
2 2016-05-01 00:01:00 11 2016-05-01 00:00:00
3 2016-05-01 00:02:00 31 2016-05-01 00:00:00
...
98 2016-05-01 01:37:00 20 2016-05-01 01:30:00
99 2016-05-01 01:38:00 29 2016-05-01 01:30:00
100 2016-05-01 01:39:00 37 2016-05-01 01:30:00
Now aggregate by the new grouping column, using sum as the aggregation function:
dat.summary = aggregate(count ~ by15, FUN=sum, data=dat)
by15 count
1 2016-05-01 00:00:00 312
2 2016-05-01 00:15:00 395
3 2016-05-01 00:30:00 341
4 2016-05-01 00:45:00 318
5 2016-05-01 01:00:00 349
6 2016-05-01 01:15:00 397
7 2016-05-01 01:30:00 341
dplyr
library(dplyr)
dat.summary = dat %>% group_by(by15=cut(time, "15 min")) %>%
summarise(count=sum(count))
data.table
library(data.table)
dat.summary = setDT(dat)[ , list(count=sum(count)), by=cut(time, "15 min")]
UPDATE: To answer the comment, for this case the end point of each grouping interval is as.POSIXct(as.character(dat$by15)) + 60*15 - 1. In other words, the endpoint of the grouping interval is 15 minutes minus one second from the start of the interval. We add 60*15 - 1 because POSIXct is denominated in seconds. The as.POSIXct(as.character(...)) is because cut returns a factor and this just converts it back to date-time so that we can do math on it.
If you want the end point to the nearest minute before the next interval (instead of the nearest second), you could to as.POSIXct(as.character(dat$by15)) + 60*14.
If you don't know the break interval, for example, because you chose the number of breaks and let R pick the interval, you could find the number of seconds to add by doing max(unique(diff(as.POSIXct(as.character(dat$by15))))) - 1.
The cut approach is handy but slow with large data frames. The following approach is approximately 1,000x faster than the cut approach (tested with 400k records.)
# Function: Truncate (floor) POSIXct to time interval (specified in seconds)
# Author: Stephen McDaniel # PowerTrip Analytics
# Date : 2017MAY
# Copyright: (C) 2017 by Freakalytics, LLC
# License: MIT
floor_datetime <- function(date_var, floor_seconds = 60,
origin = "1970-01-01") { # defaults to minute rounding
if(!is(date_var, "POSIXct")) stop("Please pass in a POSIXct variable")
if(is.na(date_var)) return(as.POSIXct(NA)) else {
return(as.POSIXct(floor(as.numeric(date_var) /
(floor_seconds))*(floor_seconds), origin = origin))
}
}
Sample output:
test <- data.frame(good = as.POSIXct(Sys.time()),
bad1 = as.Date(Sys.time()),
bad2 = as.POSIXct(NA))
test$good_15 <- floor_datetime(test$good, 15 * 60)
test$bad1_15 <- floor_datetime(test$bad1, 15 * 60)
Error in floor_datetime(test$bad, 15 * 60) :
Please pass in a POSIXct variable
test$bad2_15 <- floor_datetime(test$bad2, 15 * 60)
test
good bad1 bad2 good_15 bad2_15
1 2017-05-06 13:55:34.48 2017-05-06 <NA> 2007-05-06 13:45:00 <NA>
You can do it in one line by using trs function from FQOAT, just like:
df_15mins=trs(df, "15 mins")
Below is a repeatable example:
library(foqat)
head(aqi[,c(1,2)])
# Time NO
#1 2017-05-01 01:00:00 0.0376578
#2 2017-05-01 01:01:00 0.0341483
#3 2017-05-01 01:02:00 0.0310285
#4 2017-05-01 01:03:00 0.0357016
#5 2017-05-01 01:04:00 0.0337507
#6 2017-05-01 01:05:00 0.0238120
#mean
aqi_15mins=trs(aqi[,c(1,2)], "15 mins")
head(aqi_15mins)
# Time NO
#1 2017-05-01 01:00:00 0.02736549
#2 2017-05-01 01:15:00 0.03244958
#3 2017-05-01 01:30:00 0.03743626
#4 2017-05-01 01:45:00 0.02769419
#5 2017-05-01 02:00:00 0.02901817
#6 2017-05-01 02:15:00 0.03439455

Alter values in one data frame based on comparison values in another in R

I am trying to subtract one hour to date/times within a POSIXct column that are earlier than or equal to a time stated in a different comparison dataframe for that particular ID.
For example:
#create sample data
Time<-as.POSIXct(c("2015-10-02 08:00:00","2015-11-02 11:00:00","2015-10-11 10:00:00","2015-11-11 09:00:00","2015-10-24 08:00:00","2015-10-27 08:00:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,01,02,02,03,03)
data<-data.frame(Time,ID)
Which produces this:
Time ID
1 2015-10-02 08:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 10:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 08:00:00 3
6 2015-10-27 08:00:00 3
I then have another dataframe with a key date and time for each ID to compare against. The Time in data should be compared against Comparison in ComparisonData for the particular ID it is associated with. If the Time value in data is earlier than or equal to the comparison value one hour should be subtracted from the value in data:
#create sample comparison data
Comparison<-as.POSIXct(c("2015-10-29 08:00:00","2015-11-02 08:00:00","2015-10-26 08:30:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,02,03)
ComparisonData<-data.frame(Comparison,ID)
This should look like this:
Comparison ID
1 2015-10-29 08:00:00 1
2 2015-11-02 08:00:00 2
3 2015-10-26 08:30:00 3
In summary, the code should check all times of a certain ID to see if any are earlier than or equal to the value specified in ComparisonData and if they are, subtract one hour. This should give this data frame as an output:
Time ID
1 2015-10-02 07:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 09:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 07:00:00 3
6 2015-10-27 08:00:00 3
I have looked at similar solutions such as this but I cannot understand how to also check the times using the right timing with that particular ID.
I think ddply seems quite a promising option but I'm not sure how to use it for this particular problem.
Here's a quick and efficient solution using data.table. First we join the two data sets by ID and then just modify the Times which are lower or equal to Comparison
library(data.table) # v1.9.6+
setDT(data)[ComparisonData, end := i.Comparison, on = "ID"]
data[Time <= end, Time := Time - 3600L][, end := NULL]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
Alternatively, we could do this in one step while joining using ifelse (not sure how efficient this though)
setDT(data)[ComparisonData,
Time := ifelse(Time <= i.Comparison,
Time - 3600L, Time),
on = "ID"]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
I am sure there is going to be a better solution than this, however, I think this works.
for(i in 1:nrow(data)) {
if(data$Time[i] < ComparisonData[data$ID[i], 1]){
data$Time[i] <- data$Time[i] - 3600
}
}
# Time ID
#1 2015-10-02 07:00:00 1
#2 2015-11-02 11:00:00 1
#3 2015-10-11 09:00:00 2
#4 2015-11-11 09:00:00 2
#5 2015-10-24 07:00:00 3
#6 2015-10-27 08:00:00 3
This is going to iterate through every row in data.
ComparisonData[data$ID[i], 1] gets the time column in ComparisonData for the corresponding ID. If this is greater than the Time column in data then reduce the time by 1 hour.

Resources