Related
I have data which includes Date as well as Time enter and Time exit. These latter two contain data like this: 08:02, 12:02, 23:45 etc.
I would like to manipulate the Time eXXX data - for example, substract Time enter from Time exit to work out duration, or plot the distributions of Time enter and Time exit, e.g. to see if most entries are before 10:00, or if most exits are after 17:00.
All the packages I've looked at require a date to precede the time, e.g. 01/02/2012 12:33.
Is this possible, or should I simply append an identical date to every time for the sake of calculations? This seem a bit messy!
Use the "times" class found in the chron package:
library(chron)
Enter <- c("09:12", "17:01")
Enter <- times(paste0(Enter, ":00"))
Exit <- c("10:15", "18:11")
Exit <- times(paste0(Exit, ":00"))
Exit - Enter # durations
sum(Enter < "10:00:00") # no entering before 10am
mean(Enter < "10:00:00") # fraction entering before 10am
sum(Exit > "17:00:00") # no exiting after 5pm
mean(Exit > "17:00:00") # fraction exiting after 5pm
table(cut(hours(Enter), breaks = c(0, 10, 17, 24))) # Counts for indicated hours
## (0,10] (10,17] (17,24]
## 1 1 0
table(hours(Enter)) # Counts of entries each hour
## 9 17
## 1 1
stem(hours(Enter), scale = 2)
## The decimal point is at the |
## 9 | 0
## 10 |
## 11 |
## 12 |
## 13 |
## 14 |
## 15 |
## 16 |
## 17 | 0
Graphics:
tab <- c(table(Enter), -table(Exit)) # Freq at each time. Enter is pos; Exit is neg.
plot(times(names(tab)), tab, type = "h", xlab = "Time", ylab = "Freq")
abline(v = c(10, 17)/24, col = "red", lty = 2) # vertical red lines
abline(h = 0) # X axis
Thanks for the feedback and sorry for the confusion I have edited it a bit to clarify.
New Edit:
First, chron package and strptime with fixed format both work well as demonstrated in other answers. I just want to introduce lubridate a little bit since it's easier to use, and flexible with time format.
Example data
df <- data.frame(TimeEnterChar = c(rep("07:58", 10), "08:02", "08:03", "08:05", "08:10", "09:00"),
TimeExitChar = c("16:30", "16:50", "17:00", rep("17:02", 10), "17:30", "18:59"),
stringsAsFactors = F)
If all you want is to count how many entry time were later than 8:00, then you can compare the character directly. Below would should 5 entry time were later.
sum(df$TimeEnterChar > "08:00")
If you want more, personally, I like lubridate package when dealing with time data, especially timestamps with dates although it's not the focus of this post at all.
library(lubridate)
# Convert character to a "Period" class by lubridate, shows in form of H M S
df$TimeEnterTime <- hm(df$TimeEnterChar)
df$TimeExitTime <- hm(df$TimeExitChar)
head(df)
sum(df$TimeEnterTime > hm("08:00"))
You can still compare the time.
A little more about using them as numeric:
I assume only minute-level time is wanted. Thus, I divided number of seconds by 60 to get number of minutes.
df$DurationMinute <- as.numeric( df$TimeExitTime - df$TimeEnterTime )/60
hist(df$DurationMinute, breaks = seq(500, 600, 5))
head(df)
TimeEnterChar TimeExitChar TimeEnterTime TimeExitTime DurationMinute
1 07:58 16:30 7H 58M 0S 16H 30M 0S 512
2 07:58 16:50 7H 58M 0S 16H 50M 0S 532
3 07:58 17:00 7H 58M 0S 17H 0M 0S 542
4 07:58 17:02 7H 58M 0S 17H 2M 0S 544
5 07:58 17:02 7H 58M 0S 17H 2M 0S 544
6 07:58 17:02 7H 58M 0S 17H 2M 0S 544
You can simply plot a histogram to see the distribution of time duration between entry and exit.
You can also look at the distribution of entry/exit time. But some effort is needed to convert the axis.
df$TimeEnterNumMin <- as.numeric(df$TimeEnterTime) / 60
df$TimeExitNumMin <- as.numeric(df$TimeExitTime) / 60
hist(df$TimeEnterNumMin, breaks = seq(0, 1440, 60), xaxt = 'n', main = "Whole by 1hr")
axis(side = 1, at = seq(0, 1440, 60), labels = paste0(seq(0, 24, 1), ":00"))
hist(df$TimeEnterNumMin, breaks = seq(420, 600, 15), xaxt = 'n', main = "Morning by 15min")
axis(side = 1, at = seq(420, 600, 60), labels = paste0(seq(7, 10, 1), ":00"))
I did not polish the plot, nor make the axis flexible. Please do based on your needs. Hopefully, it helps.
Below is old useless post: (no need to read. kept so that comments don't look weird)
Came across a similar issue and was inspired by this post. #G. Grothendieck and #David Arenburg provided great answers for transforming the time.
For comparison, I feel forcing the time into numeric helps. Instead of comparing "11:22:33" with "9:00:00", comparing as.numeric(hms("11:22:33")) (which is 40953 seconds) and as.numeric(hms("9:00:00")) (32400) would be much easier.
as.numeric(hms("11:22:33")) > as.numeric(hms("9:00:00")) & as.numeric(hms("11:22:33")) < as.numeric(hms("17:00:00"))
[1] TRUE
The above example shows 11:22:33 is between 9AM and 5PM.
To extract just time from the date or POSIXct object, substr("2013-10-01 11:22:33 UTC", 12, 19) should work, although it looks stupid to change a time object to string/character and back to time again.
Converting the time to numeric should work for plotting as #G. Grothendieck descirbed. You can convert the numbers back to time as needed for x axis labels.
Would something like that work?
SubstracTimes <- function(TimeEnter, TimeExit){
(as.numeric(format(strptime(TimeExit, format ="%H:%M"), "%H")) +
as.numeric(format(strptime(TimeExit, format ="%H:%M"), "%M"))/60) -
(as.numeric(format(strptime(TimeEnter, format ="%H:%M"), "%H")) +
as.numeric(format(strptime(TimeEnter, format ="%H:%M"), "%M"))/60)
}
Testing:
TimeEnter <- "08:02"
TimeExit <- "12:02"
SubstracTimes(TimeEnter, TimeExit)
> SubstracTimes(TimeEnter, TimeExit)
[1] 4
I have a large file of time-series data, which looks as follows. The dataset covers years, in increments of 15 minutes. A small subset looks like:
uniqueid time
a 2014-04-30 23:30:00
a 2014-04-30 23:45:00
a 2014-05-01 00:00:00
a 2014-05-01 00:15:00
a 2014-05-12 13:45:00
a 2014-05-12 14:00:00
b 2014-05-12 13:45:00
b 2014-05-12 14:00:00
b 2014-05-12 14:30:00
To reproduce above:
time<-c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
My goal is to count the number of rows per unique id, per consecutive timeflow. A consecutive timespan is when a unique id is stamped for each 15 minutes in a row (such as id A, which is stamped from 30.04.14 23.30 hrs until 01.05.14 00.15 hrs - hence 4 rows), yet when this flow of 15-minute iterations is disrupted (after 01.05.14 00:15, it is not stamped at 01.05.14 00:30 hence it is disrupted), it should count the next timestamp as start of a new consecutive timeflow and again calculate the number of rows until this flow is disrupted again. Time is POSIX.
As you can see in above example; a consecutive timeflow may cover different days, different months, or different years. I have many unique ids (and as said, a very large file), so I'm looking for a way that my computer can handle (loops probably wouldn't work).
I am looking for output something like:
uniqueid flow number_rows
a 1 4
a 2 2
b 3 2
b 4 1
I have looked into some time packages (such as lubridate), but given my limited R knowledge, I don't even know where to begin.
I hope all is clear - if not, I'd be happy to try to clarify it further. Thank you very much in advance!
Another way to do this with data.table also using a time difference would be to make use of the data.table internal values for group number and number of rows in each group:
library(data.table)
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(res)
uniqueid number_rows flow
1: a 4 1
2: a 2 2
3: b 2 3
4: b 1 4
Also since the sample data you posted didn't align with the subset you posted, I have included my data below:
Data
time<-as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00"))
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
You can groupby the uniqueid and the cumulative sum of the difference of time between rows which is not equal to 15 min and that gives the flow id and then a count of rows should give you what you need:
A justification of the logic is whenever the time difference is not equal to 15 within each uniqueid, a new flow process should be generated so we label it as TRUE and combine that with the cumsum, it becomes a new flow id with the following consecutive rows:
library(dplyr)
mydf$time <- as.POSIXct(mydf$time, "%Y-%m-%d %H:%M:%S")
# convert the time column to POSIXct class so that we can apply the diff function correctly
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
# Source: local data frame [4 x 3]
# Groups: uniqueid [?]
#
# uniqueid flow num_rows
# <fctr> <dbl> <int>
# 1 a 1 4
# 2 a 2 2
# 3 b 3 2
# 4 b 4 1
Base R is pretty fast. Using crude benchmarking, I found it finished in half the time of DT, and I got tired of waiting for dplyr.
# estimated size of data, years x days x hours x 15mins x uniqueids
5*365*24*4*1000 # = approx 180M
# make data with posixct and characters of 180M rows, mydf is approx 2.5GB in memory
time<-rep(as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")),times = 20000000)
uniqueid<-rep(as.character(c("a","a","a","a","a","a","b","b","b")),times = 20000000)
mydf<-data.frame(uniqueid,time = time)
rm(time,uniqueid);gc()
Base R:
# assumes that uniqueid's are in groups and in order, and there won't be a followed by b that have the 15 minute "flow"
starttime <- Sys.time()
# find failed flows
mydf$diff <- c(0,diff(mydf$time))
mydf$flowstop <- mydf$diff != 15
# give each flow an id
mydf$flowid <- cumsum(mydf$flowstop)
# clean up vars
mydf$time <- mydf$diff <- mydf$flowstop <- NULL
# find flow length
mydfrle <- rle(mydf$flowid)
# get uniqueid/flowid pairs (unique() is too slow)
mydf <- mydf[!duplicated(mydf$flowid), ]
# append rle and remove separate var
mydf$number_rows <- mydfrle$lengths
rm(mydfrle)
print(Sys.time()-starttime)
# Time difference of 30.39437 secs
data.table:
library(data.table)
starttime <- Sys.time()
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(Sys.time()-starttime)
# Time difference of 57.08156 secs
dplyr:
library(dplyr)
# convert the time column to POSIXct class so that we can apply the diff function correctly
starttime <- Sys.time()
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
print(Sys.time()-starttime)
# too long, did not finish after a few minutes
I think the assumption of uniqueid's and times being in order is huge, and the other solutions might be able to take advantage of that better. order() is easy enough to do.
I'm not sure about the impact of memory, or of the impact of different data sets that aren't so simple. It should be easy enough to break it into chunks and process if memory is an issue. It takes more code in Base R for sure.
Having both ordered "id" and "time" columns, we could build a single group to operate on by creating a logical vector of indices wherever either "id" changes or "time" is > 15 minutes.
With:
id = as.character(mydf$uniqueid)
tm = mydf$time
find where "id":
id_gr = c(TRUE, id[-1] != id[-length(id)])
and "time":
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
change and combine them in:
gr = id_gr | tm_gr
which shows wherever either "id" changed or "time" > 15.
And to get the result:
tab = tabulate(cumsum(gr)) ## basically, the only operation per group -- 'n by group'
data.frame(id = id[gr], flow = seq_along(tab), n = tab)
# id flow n
#1 a 1 4
#2 a 2 2
#3 b 3 2
#4 b 4 1
On a larger scale:
set.seed(1821); nid = 1e4
dat = replicate(nid, as.POSIXct("2016-07-07 12:00:00 EEST") +
cumsum(sample(c(1, 5, 10, 15, 20, 30, 45, 60, 90, 120, 150, 200, 250, 300), sample(5e2:1e3, 1), TRUE)*60),
simplify = FALSE)
names(dat) = make.unique(rep_len(letters, nid))
dat = data.frame(id = rep(names(dat), lengths(dat)), time = do.call(c, dat))
system.time({
id = as.character(dat$id); tm = dat$time
id_gr = c(TRUE, id[-1] != id[-length(id)])
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
gr = id_gr | tm_gr
tab = tabulate(cumsum(gr))
ans1 = data.frame(id = id[gr], flow = seq_along(tab), n = tab)
})
# user system elapsed
# 1.44 0.19 1.66
For comparison, included MikeyMike's answer:
library(data.table)
dat2 = copy(dat)
system.time({
ans2 = setDT(dat2)[, list(flow = .GRP, n = .N),
by = .(id, cumsum(as.numeric(difftime(time,
shift(time, 1L, type = "lag", fill = 0),
unit = "mins")) > 15))][, cumsum := NULL]
})
# user system elapsed
# 3.95 0.22 4.26
identical(as.data.table(ans1), ans2)
#[1] TRUE
An MRE is pasted below
MRE
date<-c('2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-01','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02','2001-01-02')
time<-c('07:00:00 GMT','08:00:00 GMT','09:00:00 GMT','10:00:00 GMT','11:00:00 GMT','12:00:00 GMT','13:00:00 GMT','14:00:00 GMT','15:00:00 GMT','16:00:00 GMT','17:00:00 GMT', '18:00:00 GMT','19:00:00 GMT','20:00:00 GMT','21:00:00 GMT','22:00:00 GMT','23:00:00 GMT','00:00:00 GMT', '01:00:00 GMT','02:00:00 GMT','03:00:00 GMT','04:00:00 GMT','05:00:00 GMT','06:00:00 GMT','07:00:00 GMT','08:00:00 GMT','09:00:00 GMT','10:00:00 GMT','11:00:00 GMT','12:00:00 GMT','13:00:00 GMT','14:00:00 GMT','15:00:00 GMT','16:00:00 GMT','17:00:00 GMT','18:00:00 GMT','19:00:00 GMT','20:00:00 GMT','21:00:00 GMT')
el<-c(0.257,0.687,1.861,3.288, 4.821,6.172,7.048,7.258,6.799,5.654,4.463,3.443,2.704,2.708,3.328,4.23,5.244,5.985,6.317,6.074,5.234,3.981,2.662,1.615,0.88,0.746,1.405,2.527,3.928,5.283,6.517,7.179,7.252,6.625,5.454,4.214,3.144,2.491,2.357)
Time<-as.POSIXct(paste(date, time),tz="GMT")
wave<-data.table(Time, el)
ggplot(wave, aes(wave$Time, wave$el)) + geom_point() + labs(x="time", y="elevation") + geom_hline(aes(yintercept=4))
I have a wave time series and I want to be able to have a function that is able to tell me the frequency and mean/median duration the wave is above a given elevation. In my example I have chosen 4.
I want to interpolate the time when the wave reaches 4 on the rising and falling edges and find the time difference between the two points for each wave.
I can do this with a for loop, but I think that I should be able to do it in data.table much faster. I have 1mil+ points for several locations and do not think a for loop would be efficient.
For the rising wave I want to do something like:
wave[,timeIs4:=ifelse(elev<3 & elev[+1]>4,TRUE,FALSE )]
But instead of TRUE put in my interpolation calculation. I do not know how to access preceding and proceeding values within a data table such as in a for loop i+1 or i-1.
Desired Output
Rising leg
I want to interpolate between points 4 and 5; 15 and 16; 29 and 30.
Falling leg
I want to interpolate between points 11 and 12; 21 and 22; 36 and 37
Approximate Outcome
Rising Falling
10:28:00 17:27:00
21:45:00 3:59:00
11:03:00 18:12:00
Then I will be able to subtract Rising from Falling using difftime() to determine the amount of time the water level was above the given elevation.
This will give me the frequency and duration the water is above the given elevation.
Here's a possible solution using the devel version from GH. You will need it for both the shift function (as mentioned by #Jan) and fir the new dcast method which accepts expressions. Also, you don't have minutes in your MRE, so not sure where did you get those in your expected output.
Anyway, for starters, we will create an index (we will call it Wave so you will know from which wave # its coming from) that will tell us if the wave is rising or falling using shift. Then, we will dcast on matched values while removing the unmatched using na.omit (you can reorder the column names later if you like using the setcolorder function)
library(data.table) ## V 1.9.5+
dt[elev <= 4 & shift(elev, type = "lead") > 4, Wave := "Rising"]
dt[elev > 4 & shift(elev, type = "lead") <= 4, Wave := "Falling"]
dcast(na.omit(dt), cumsum(Wave == "Rising") ~ Wave, value.var = "time")
# Wave Falling Rising
# 1: 1 2001-01-01 17:00:00 2001-01-01 10:00:00
# 2: 2 2001-01-02 03:00:00 2001-01-01 21:00:00
# 3: 3 2001-01-02 18:00:00 2001-01-02 11:00:00
Here is another possible idea:
elev = 4
#a helper function to calculate elapsed time
ff = function(el1, el2, el, time1, time2)
time1 + ((el - el1) / (el2 - el1)) * (time2 - time1)
dif = diff(findInterval(wave$el, c(-Inf, elev, Inf)))
ris = which(dif == 1) #risings
fal = which(dif == -1) #fallings
ff(wave$el[ris], wave$el[ris + 1], elev, wave$Time[ris], wave$Time[ris + 1])
#[1] "2001-01-01 10:27:52 GMT" "2001-01-01 21:44:42 GMT" "2001-01-02 11:03:11 GMT"
ff(wave$el[fal], wave$el[fal + 1], elev, wave$Time[fal], wave$Time[fal + 1])
#[1] "2001-01-01 17:27:14 GMT" "2001-01-02 03:59:05 GMT" "2001-01-02 18:12:00 GMT"
I have data which includes Date as well as Time enter and Time exit. These latter two contain data like this: 08:02, 12:02, 23:45 etc.
I would like to manipulate the Time eXXX data - for example, substract Time enter from Time exit to work out duration, or plot the distributions of Time enter and Time exit, e.g. to see if most entries are before 10:00, or if most exits are after 17:00.
All the packages I've looked at require a date to precede the time, e.g. 01/02/2012 12:33.
Is this possible, or should I simply append an identical date to every time for the sake of calculations? This seem a bit messy!
Use the "times" class found in the chron package:
library(chron)
Enter <- c("09:12", "17:01")
Enter <- times(paste0(Enter, ":00"))
Exit <- c("10:15", "18:11")
Exit <- times(paste0(Exit, ":00"))
Exit - Enter # durations
sum(Enter < "10:00:00") # no entering before 10am
mean(Enter < "10:00:00") # fraction entering before 10am
sum(Exit > "17:00:00") # no exiting after 5pm
mean(Exit > "17:00:00") # fraction exiting after 5pm
table(cut(hours(Enter), breaks = c(0, 10, 17, 24))) # Counts for indicated hours
## (0,10] (10,17] (17,24]
## 1 1 0
table(hours(Enter)) # Counts of entries each hour
## 9 17
## 1 1
stem(hours(Enter), scale = 2)
## The decimal point is at the |
## 9 | 0
## 10 |
## 11 |
## 12 |
## 13 |
## 14 |
## 15 |
## 16 |
## 17 | 0
Graphics:
tab <- c(table(Enter), -table(Exit)) # Freq at each time. Enter is pos; Exit is neg.
plot(times(names(tab)), tab, type = "h", xlab = "Time", ylab = "Freq")
abline(v = c(10, 17)/24, col = "red", lty = 2) # vertical red lines
abline(h = 0) # X axis
Thanks for the feedback and sorry for the confusion I have edited it a bit to clarify.
New Edit:
First, chron package and strptime with fixed format both work well as demonstrated in other answers. I just want to introduce lubridate a little bit since it's easier to use, and flexible with time format.
Example data
df <- data.frame(TimeEnterChar = c(rep("07:58", 10), "08:02", "08:03", "08:05", "08:10", "09:00"),
TimeExitChar = c("16:30", "16:50", "17:00", rep("17:02", 10), "17:30", "18:59"),
stringsAsFactors = F)
If all you want is to count how many entry time were later than 8:00, then you can compare the character directly. Below would should 5 entry time were later.
sum(df$TimeEnterChar > "08:00")
If you want more, personally, I like lubridate package when dealing with time data, especially timestamps with dates although it's not the focus of this post at all.
library(lubridate)
# Convert character to a "Period" class by lubridate, shows in form of H M S
df$TimeEnterTime <- hm(df$TimeEnterChar)
df$TimeExitTime <- hm(df$TimeExitChar)
head(df)
sum(df$TimeEnterTime > hm("08:00"))
You can still compare the time.
A little more about using them as numeric:
I assume only minute-level time is wanted. Thus, I divided number of seconds by 60 to get number of minutes.
df$DurationMinute <- as.numeric( df$TimeExitTime - df$TimeEnterTime )/60
hist(df$DurationMinute, breaks = seq(500, 600, 5))
head(df)
TimeEnterChar TimeExitChar TimeEnterTime TimeExitTime DurationMinute
1 07:58 16:30 7H 58M 0S 16H 30M 0S 512
2 07:58 16:50 7H 58M 0S 16H 50M 0S 532
3 07:58 17:00 7H 58M 0S 17H 0M 0S 542
4 07:58 17:02 7H 58M 0S 17H 2M 0S 544
5 07:58 17:02 7H 58M 0S 17H 2M 0S 544
6 07:58 17:02 7H 58M 0S 17H 2M 0S 544
You can simply plot a histogram to see the distribution of time duration between entry and exit.
You can also look at the distribution of entry/exit time. But some effort is needed to convert the axis.
df$TimeEnterNumMin <- as.numeric(df$TimeEnterTime) / 60
df$TimeExitNumMin <- as.numeric(df$TimeExitTime) / 60
hist(df$TimeEnterNumMin, breaks = seq(0, 1440, 60), xaxt = 'n', main = "Whole by 1hr")
axis(side = 1, at = seq(0, 1440, 60), labels = paste0(seq(0, 24, 1), ":00"))
hist(df$TimeEnterNumMin, breaks = seq(420, 600, 15), xaxt = 'n', main = "Morning by 15min")
axis(side = 1, at = seq(420, 600, 60), labels = paste0(seq(7, 10, 1), ":00"))
I did not polish the plot, nor make the axis flexible. Please do based on your needs. Hopefully, it helps.
Below is old useless post: (no need to read. kept so that comments don't look weird)
Came across a similar issue and was inspired by this post. #G. Grothendieck and #David Arenburg provided great answers for transforming the time.
For comparison, I feel forcing the time into numeric helps. Instead of comparing "11:22:33" with "9:00:00", comparing as.numeric(hms("11:22:33")) (which is 40953 seconds) and as.numeric(hms("9:00:00")) (32400) would be much easier.
as.numeric(hms("11:22:33")) > as.numeric(hms("9:00:00")) & as.numeric(hms("11:22:33")) < as.numeric(hms("17:00:00"))
[1] TRUE
The above example shows 11:22:33 is between 9AM and 5PM.
To extract just time from the date or POSIXct object, substr("2013-10-01 11:22:33 UTC", 12, 19) should work, although it looks stupid to change a time object to string/character and back to time again.
Converting the time to numeric should work for plotting as #G. Grothendieck descirbed. You can convert the numbers back to time as needed for x axis labels.
Would something like that work?
SubstracTimes <- function(TimeEnter, TimeExit){
(as.numeric(format(strptime(TimeExit, format ="%H:%M"), "%H")) +
as.numeric(format(strptime(TimeExit, format ="%H:%M"), "%M"))/60) -
(as.numeric(format(strptime(TimeEnter, format ="%H:%M"), "%H")) +
as.numeric(format(strptime(TimeEnter, format ="%H:%M"), "%M"))/60)
}
Testing:
TimeEnter <- "08:02"
TimeExit <- "12:02"
SubstracTimes(TimeEnter, TimeExit)
> SubstracTimes(TimeEnter, TimeExit)
[1] 4
I've got logs of events that contain:
start time, end time, category id and count. They cover several months.
I'd like to aggregate them over time to be able to trace histograms over a given day, week, month.
So I assume the best way to do this is to bin the periods in buckets. I think 5 minutes would be good.
e.g. If an event starts at 1.01pm and ends at 1.07pm, I'd like to obtain 2 records for it as it covers 2 periods of 5 minutes (0-5 and 5-10) and replicate the rest of the original data for these new records (category and count)
if my input logs (x) are as such:
start / end / catid / count
2012-11-17 15:05:02.0, 2012-11-17 15:12:52.0, 1, 2
2012-11-17 15:07:13.0, 2012-11-17 15:17:47.0, 2, 10
2012-11-17 15:11:00.0, 2012-11-17 15:12:33.0, 3, 5
2012-11-17 15:12:01.0, 2012-11-17 15:20:00.0, 4, 1
I'm trying to get the output bucketed in 5 minutes (b) this way:
start / catid / count
2012-11-17 15:05:00.0 1, 2
2012-11-17 15:10:00.0 1, 2
2012-11-17 15:05:00.0 2, 10
2012-11-17 15:10:00.0 2, 10
2012-11-17 15:15:00.0 2, 10
2012-11-17 15:10:00.0 3, 5
2012-11-17 15:10:00.0 4, 1
2012-11-17 15:15:00.0 4, 1
Then I can easily aggregate the new data frame (b) over category ids for the period I want (hour, day, week, month)
I'm starting with R and I found a lot explanations on how to bucket a time value but not a period of time.
I've had a look at zoo and xts but I couldn't quite find what to do.
Hopefully that makes sense to some of you.
Edit:
I've slightly modified Ram's suggestion to get the correct calculation of blocks using the rounded endtime rather than the original end time. (Thanks Ram!)
mnslot=15 # size of the buckets/slot in minutes
#Round down the minutes of starttime to a mutliple of mnslot
st.str <- strptime(st, "%Y-%m-%d %H:%M:%S")
min_st <- as.numeric(format(st.str, "%M"))
roundedmins <- floor(min_st/mnslot) * mnslot
st.base <- strptime(st, "%Y-%m-%d %H")
rounded_start <- st.base + (roundedmins * 60)
#Round down the minutes of the endtime to a multiple of mnslot.
en.str <- strptime(en, "%Y-%m-%d %H:%M:%S")
min_en <- as.numeric(format(en.str, "%M"))
roundedmins <- floor(min_en/mnslot) * mnslot
en.base <- strptime(en, "%Y-%m-%d %H")
rounded_end<- en.base + (roundedmins * 60)
# calculate the number of blocks based on the rounded minutes of start and end
numblocks<- as.numeric(floor((rounded_end-rounded_start)/mnslot/60)+1)
# differenced of POSIXct values is in minutes
# but difference of POSIXlt seems to be in seconds , so have to divide by 60 as well
#Create REPLICATED Rows, depending on the size of the interval
replicated_cat = NULL
replicated_count = NULL
replicated_start = NULL
for (n in 1:length(numblocks)){
for (newrow in 1:numblocks[n]){
replicated_start = c(replicated_start, df$rounded_start[n]+(newrow-1)*300 )
replicated_cat = c(replicated_cat, df$catid[n])
replicated_count = c(replicated_count, df$count[n])
}
}
#Change to readable format
POSIXT <- unix2POSIXct(replicated_start)
newdf <- data.frame(POSIXT, replicated_cat, replicated_count)
names(newdf) <- c("start", "CatId", "Count")
newdf
This produces the required output. it is a bit slow though:p
Here's a fully working version. It involves step-by-step data manipulation for what you are after.
#storing the original data as a csv
df <- read.csv("tsdata.csv")
st<-as.POSIXlt(df$start)
en<-as.POSIXlt(df$end)
#a utility function to convert formats
unix2POSIXct <- function (time) structure(time, class = c("POSIXt", "POSIXct") )
#For each row, determine how many replications are needed
numdups <- as.numeric(floor((en-st)/5)+1)
st.str <- strptime(st, "%Y-%m-%d %H:%M:%S")
min_st <- as.numeric(format(st.str, "%M"))
#Round down the minutes of start to 5 minute starts. 0,5,10 etc...
roundedmins <- floor(min_st/5) * 5
st.base <- strptime(st, "%Y-%m-%d %H")
df$rounded_start <- st.base + (roundedmins * 60)
#Create REPLICATED Rows, depending on the size of the interval
replicated_cat = NULL
replicated_count = NULL
replicated_start = NULL
for (n in 1:length(numdups)){
for (newrow in 1:numdups[n]){
replicated_start = c(replicated_start, df$rounded_start[n]+(newrow-1)*300 )
replicated_cat = c(replicated_cat, df$catid[n])
replicated_count = c(replicated_count, df$count[n])
}
}
#Change to readable format
POSIXT <- unix2POSIXct(replicated_start)
newdf <- data.frame(POSIXT, replicated_cat, replicated_count)
names(newdf) <- c("start", "CatId", "Count")
newdf
Which produces:
start CatId Count
1 2012-11-17 15:05:00 1 2
2 2012-11-17 15:10:00 1 2
3 2012-11-17 15:05:00 2 10
4 2012-11-17 15:10:00 2 10
5 2012-11-17 15:15:00 2 10
6 2012-11-17 15:10:00 3 5
7 2012-11-17 15:10:00 4 1
8 2012-11-17 15:15:00 4 1
That's not an easy one ... I am also missing the structure of the whole problem so I hope it is ok if I limit myself to outlining the basic approach, if things are unclear you can come back to me.
First (if I were you) I would install the 'lubridate' package, which makes playing around with dates/times a lot easier.
Then maybe try something like this:
z <- strptime("17/11/12 15:05:00.0", "%d/%m/%y %H:%M:%OS")
This will define your starting point in time, if that is supposed to be defined by the first logs(x) time then there is the minute command available e.g.
z <- strptime("17/11/12 15:05:02.0", "%d/%m/%y %H:%M:%OS")
minute(z)<-5;second(z)<-0.0 #I guess, you get the concept
Then produce a sequence of 5 minute intervals
z5s<-z+minutes(seq(0,100,5))
This will produce a sequence of 20, 5 minute time intervals, here again I do not know how flexible the whole thing is supposed to be.
Finally you could then play around with for instance modulo operations
z2<-z+minutes(2)
z2 should be the end time, I just added 2 minutes "manually" here to illustrate the concept
(as.integer(z2-z))%%5 > 5
FALSE
or if you want to see how many 5 minute spans are covered only do (as.integer(z2-z))%%5
or whatever other functions you prefer to match/distribute the log times across the z5s POSIXlt intervals.
Hope this helps a bit i.e. gives you some direction.