I have data which includes Date as well as Time enter and Time exit. These latter two contain data like this: 08:02, 12:02, 23:45 etc.
I would like to manipulate the Time eXXX data - for example, substract Time enter from Time exit to work out duration, or plot the distributions of Time enter and Time exit, e.g. to see if most entries are before 10:00, or if most exits are after 17:00.
All the packages I've looked at require a date to precede the time, e.g. 01/02/2012 12:33.
Is this possible, or should I simply append an identical date to every time for the sake of calculations? This seem a bit messy!
Use the "times" class found in the chron package:
library(chron)
Enter <- c("09:12", "17:01")
Enter <- times(paste0(Enter, ":00"))
Exit <- c("10:15", "18:11")
Exit <- times(paste0(Exit, ":00"))
Exit - Enter # durations
sum(Enter < "10:00:00") # no entering before 10am
mean(Enter < "10:00:00") # fraction entering before 10am
sum(Exit > "17:00:00") # no exiting after 5pm
mean(Exit > "17:00:00") # fraction exiting after 5pm
table(cut(hours(Enter), breaks = c(0, 10, 17, 24))) # Counts for indicated hours
## (0,10] (10,17] (17,24]
## 1 1 0
table(hours(Enter)) # Counts of entries each hour
## 9 17
## 1 1
stem(hours(Enter), scale = 2)
## The decimal point is at the |
## 9 | 0
## 10 |
## 11 |
## 12 |
## 13 |
## 14 |
## 15 |
## 16 |
## 17 | 0
Graphics:
tab <- c(table(Enter), -table(Exit)) # Freq at each time. Enter is pos; Exit is neg.
plot(times(names(tab)), tab, type = "h", xlab = "Time", ylab = "Freq")
abline(v = c(10, 17)/24, col = "red", lty = 2) # vertical red lines
abline(h = 0) # X axis
Thanks for the feedback and sorry for the confusion I have edited it a bit to clarify.
New Edit:
First, chron package and strptime with fixed format both work well as demonstrated in other answers. I just want to introduce lubridate a little bit since it's easier to use, and flexible with time format.
Example data
df <- data.frame(TimeEnterChar = c(rep("07:58", 10), "08:02", "08:03", "08:05", "08:10", "09:00"),
TimeExitChar = c("16:30", "16:50", "17:00", rep("17:02", 10), "17:30", "18:59"),
stringsAsFactors = F)
If all you want is to count how many entry time were later than 8:00, then you can compare the character directly. Below would should 5 entry time were later.
sum(df$TimeEnterChar > "08:00")
If you want more, personally, I like lubridate package when dealing with time data, especially timestamps with dates although it's not the focus of this post at all.
library(lubridate)
# Convert character to a "Period" class by lubridate, shows in form of H M S
df$TimeEnterTime <- hm(df$TimeEnterChar)
df$TimeExitTime <- hm(df$TimeExitChar)
head(df)
sum(df$TimeEnterTime > hm("08:00"))
You can still compare the time.
A little more about using them as numeric:
I assume only minute-level time is wanted. Thus, I divided number of seconds by 60 to get number of minutes.
df$DurationMinute <- as.numeric( df$TimeExitTime - df$TimeEnterTime )/60
hist(df$DurationMinute, breaks = seq(500, 600, 5))
head(df)
TimeEnterChar TimeExitChar TimeEnterTime TimeExitTime DurationMinute
1 07:58 16:30 7H 58M 0S 16H 30M 0S 512
2 07:58 16:50 7H 58M 0S 16H 50M 0S 532
3 07:58 17:00 7H 58M 0S 17H 0M 0S 542
4 07:58 17:02 7H 58M 0S 17H 2M 0S 544
5 07:58 17:02 7H 58M 0S 17H 2M 0S 544
6 07:58 17:02 7H 58M 0S 17H 2M 0S 544
You can simply plot a histogram to see the distribution of time duration between entry and exit.
You can also look at the distribution of entry/exit time. But some effort is needed to convert the axis.
df$TimeEnterNumMin <- as.numeric(df$TimeEnterTime) / 60
df$TimeExitNumMin <- as.numeric(df$TimeExitTime) / 60
hist(df$TimeEnterNumMin, breaks = seq(0, 1440, 60), xaxt = 'n', main = "Whole by 1hr")
axis(side = 1, at = seq(0, 1440, 60), labels = paste0(seq(0, 24, 1), ":00"))
hist(df$TimeEnterNumMin, breaks = seq(420, 600, 15), xaxt = 'n', main = "Morning by 15min")
axis(side = 1, at = seq(420, 600, 60), labels = paste0(seq(7, 10, 1), ":00"))
I did not polish the plot, nor make the axis flexible. Please do based on your needs. Hopefully, it helps.
Below is old useless post: (no need to read. kept so that comments don't look weird)
Came across a similar issue and was inspired by this post. #G. Grothendieck and #David Arenburg provided great answers for transforming the time.
For comparison, I feel forcing the time into numeric helps. Instead of comparing "11:22:33" with "9:00:00", comparing as.numeric(hms("11:22:33")) (which is 40953 seconds) and as.numeric(hms("9:00:00")) (32400) would be much easier.
as.numeric(hms("11:22:33")) > as.numeric(hms("9:00:00")) & as.numeric(hms("11:22:33")) < as.numeric(hms("17:00:00"))
[1] TRUE
The above example shows 11:22:33 is between 9AM and 5PM.
To extract just time from the date or POSIXct object, substr("2013-10-01 11:22:33 UTC", 12, 19) should work, although it looks stupid to change a time object to string/character and back to time again.
Converting the time to numeric should work for plotting as #G. Grothendieck descirbed. You can convert the numbers back to time as needed for x axis labels.
Would something like that work?
SubstracTimes <- function(TimeEnter, TimeExit){
(as.numeric(format(strptime(TimeExit, format ="%H:%M"), "%H")) +
as.numeric(format(strptime(TimeExit, format ="%H:%M"), "%M"))/60) -
(as.numeric(format(strptime(TimeEnter, format ="%H:%M"), "%H")) +
as.numeric(format(strptime(TimeEnter, format ="%H:%M"), "%M"))/60)
}
Testing:
TimeEnter <- "08:02"
TimeExit <- "12:02"
SubstracTimes(TimeEnter, TimeExit)
> SubstracTimes(TimeEnter, TimeExit)
[1] 4
Related
Suppose I have a zoo object (or it could be a data.frame) that has an index on "time of day" and has some value (see sample data below):
val
...
2006-08-01 12:00 23
2006-08-01 12:01 24
2006-08-01 12:02 25
2006-08-01 12:03 26
2006-08-01 12:04 27
2006-08-01 12:05 28
2006-08-01 12:06 29
...
2006-08-02 12:00 123
2006-08-02 12:01 124
2006-08-02 12:02 125
2006-08-02 12:03 126
2006-08-02 12:04 127
...
I would like to call a custom function (call it custom.func(vals)) from 12:01 - 12:03 (i.e. something similar to zoo::rollapply) every time that interval occurs so in this example, daily. How would I do that?
NOTES (for robustness, it would also be great to take into account the following edge cases but not necessary):
Don't assume that I have values for 12:01 - 12:03 every day
Don't assume that the entire range 12:01 - 12:03 is present every day. Some days I might only have 12:01 and 12:02 but might be missing 12:03
What if I wanted my custom.func(vals) to be called on day boundaries like using val from 23:58 - 00:12?
Suppose our input is the POSIXct zoo object z given in the Note at the end.
Create a character vector times which has one element per element of z and is in the form HH:MM. Then create a logical ok which indicates which times are between the indicated boundary values. z[ok] is then z reduced to those values. Finally for each day apply sum (can use some other function if desired) using aggregate.zoo :
times <- format(time(z), "%H:%M")
ok <- times >= "12:01" & times <= "12:03"
aggregate(z[ok], as.Date, sum)
## 2006-08-01 2006-08-02
## 75 375
times straddle midnight
The version is for the case where the times straddle midnight. Note that the order of values sent to the function is not the original order but if the function is symmetric that does not matter.
times <- format(time(z), "%H:%M")
ok <- times >= "23:58" | times <= "00:12"
aggregate(z[ok], (as.Date(format(time(z))) + (times >= "23:58"))[ok], sum)
## 2006-08-02
## 41
Variation
The prior code chunk works if the function is symmetric in the components of its argument (which is the case for many functions such as mean and sum) but if the function were not symmetric we would need a slightly different approach. We define to.sec which translates an HH:MM string to numeric seconds and subtract to.sec("23:58") from each POSIXct datetime. Then the components of z to keep are those whose transformed times converted to HH:MM character strings that are less than "00:14".
to.sec <- function(x) with(read.table(text = x, sep = ":"), 3600 * V1 + 60 * V2)
times <- format(time(z) - to.sec("23:58"), "%H:%M")
ok <- times <= "00:14"
aggregate(z[ok], as.Date(time(z)[ok] - to.sec("23:58")), sum)
## 2006-08-01
## 41
Note
Lines <- "datetime val
2006-08-01T12:00 23
2006-08-01T12:01 24
2006-08-01T12:02 25
2006-08-01T12:03 26
2006-08-01T12:04 27
2006-08-01T12:05 28
2006-08-01T12:06 29
2006-08-01T23:58 20
2006-08-02T00:01 21
2006-08-02T12:00 123
2006-08-02T12:01 124
2006-08-02T12:02 125
2006-08-02T12:03 126
2006-08-02T12:04 127"
library(zoo)
z <- read.zoo(text = Lines, tz = "", header = TRUE, format = "%Y-%m-%dT%H:%M")
EDIT
Have revised the non-symmetric code and simplified all code chunks.
I recommend runner package which allows to compute any rolling function on irregular time series. Function runner is equivalent of rollApply with distinction that it can depend on dates. runner allows to apply any R function on window length defined by k with date idx (or any integer). Example below calculates regression on 5-minutes (5*60 sec) window span. Algorithm don't care if there will be day-change, just compute 5-minutes each time (for example 23:56-00:01).
Create data:
set.seed(1)
x <- cumsum(rnorm(1000))
y <- 3 * x + rnorm(1000)
time <- as.POSIXct(cumsum(sample(60:120, 1000, replace = TRUE)),
origin = Sys.Date()) # unequaly spaced time series
data <- data.frame(time, y, x)
Custom function to be called on sliding windows:
library(runner)
running_regression <- function(idx) {
predict(lm(y ~ x, data = data))[max(idx)]
}
data$pred <- runner(seq_along(x),
k = 60 * 5,
idx = time,
f = running_regression)
Once we have created dataset with rolling 5-minute prediction, then we can filter only particular windows - here, only 1-st minute of the hour. It means that we always keep {hh}:56 - {hh+1}:01
library(dplyr)
library(lubridate)
filtered <-
data %>%
filter(minute(time) == 1)
plot(data$time, data$y, type = "l", col = "red")
points(filtered$time, filtered$pred, col = "blue")
There are some other examples in vignette how to do this with runner
I have a 24 hour data starting from 7:30 today (for example), until 7:30 the next day, because I didn't link the date to the line plot, R sorts the hour starting from 00:00 despite the data starting at 7:30, I am a beginner in R, and I don't know where to begin to even solve this problem, should I try linking the date also to the X axis, or is there a better solution?
My time function somehow didn't work either, it used to work when I was plotting data for 15 minute increments.
library(chron)
d <- read.csv(file="data.csv", header = T)
t <- times(d$Time)
plot(t,d$MCO2, type="l")
Graph created from the 24 hour data I have :
Graph created from a 15 minute data using the same code :
I wanted the outcome to be from 7:30 to 7:30 the next day, but it showed now a decimal number from 0.0 to 1
Here is the link to the data, just in case:
https://www.dropbox.com/s/wsg437gu00e5t08/Data%20210519.csv?dl=0
The question is actually about combining a date column and a time column to create a timestamp containing date AND time. Note that I suggest to process everything as if we are in GMT timezone. You can pick whatever timezone you want, then stick to it.
# use ggplot
library(ggplot2)
# assume everything happens in GMT timezone
Sys.setenv( TZ = "GMT" )
# replicating the data: a measurement result sampled at 1 sec interval
t <- seq(start, end, by = "1 sec")
Time24 <- trimws(strftime(t, format = "%k:%M:%OS", tz="GMT"))
Date <- strftime(t, format = "%d/%m/%Y", tz="GMT")
head(Time24)
head(Date)
d <- data.frame(Date, Time24)
# this is just a random data of temperature
d$temp <- rnorm(length(d$Date),mean=25,sd=5)
head(d)
# the resulting data is as follows
# Date Time24 temp
#1 22/05/2019 0:00:00 22.67185
#2 22/05/2019 0:00:01 19.91123
#3 22/05/2019 0:00:02 19.57393
#4 22/05/2019 0:00:03 15.37280
#5 22/05/2019 0:00:04 31.76683
#6 22/05/2019 0:00:05 26.75153
# this is the answer to the question
# which is combining the the date and the time column of the data
# note we still assume that this happens in GMT
t <- as.POSIXct(paste(d$Date,d$Time24,sep=" "), format = "%d/%m/%Y %H:%M:%OS", tz="GMT")
# print the data into a plot
png(filename = "test.png", width = 800, height = 600, units = "px", pointsize = 22 )
ggplot(d,aes(x=t,y=temp)) + geom_line() +
scale_x_datetime(date_breaks = "3 hour",
date_labels = "%H:%M\n%d-%b")
The problem is that the function times does not include information about the day. This is a problem since your data spans two days.
The data type you use should be able to include information about the day. Posix is this data type. Also, since Posix is the go-to date-time object in R it is much easier to plot.
Before plotting the data, the time column should have the correct difference in days. When just transforming the column with as.POSIXct, the times of day 2 are read as if it is from day 1. This is why we have to add 24 hours to the correct entries.
After that, it is just a matter of plotting. I added an example of the package of ggplot2 since I prefer these plots.
You might notice that using as.POSIXct will add an incorrect date to your time information. Don't bother about this, you use this date just as a dummy date. You don't use this date itself, you just use it to be able to work with the difference in days.
library(ggplot2)
# Read in your data set
d <- read.csv(file="Data 210519.csv", header = T)
# Read column into R date-time object
t <- as.POSIXct(d$Time24, format = "%H:%M:%OS")
# Add 24 hours to time the time on day 2.
startOfDayTwo <- as.POSIXct("00:00:00", format = "%H:%M:%OS")
endOfDayTwo <- as.POSIXct("07:35:00", format = "%H:%M:%OS")
t[t >= startOfDayTwo & t <= endOfDayTwo] <- t[t >= startOfDayTwo & t <= endOfDayTwo] + 24*60*60
plot(t,d$MCO2, type="l")
# arguably a nicer plot
ggplot(d,aes(x=t,y=MCO2)) + geom_line() +
scale_x_datetime(date_breaks = "2 hour",
date_labels = "%I:%M %p")
my data frame looks like:
set.seed(1)
MyDates <- ISOdatetime(2012, 1, 1, 0, 0, 0, tz = "GMT") + sample(1:27000, 500)
It is fair easy to count number of observations per 5 mins by using cut
df <- data.frame(table(cut(MyDates, breaks = "5 mins")))
It gives me intervals as
00:00:00 -- 00:05:00,
00:05:00 -- 00:10:00
But how about if I want to get a 'customized' intervals as
00:00:00 -- 00:05:00,
00:01:00 -- 00:06:00,
00:02:00 -- 00:07:00
Any help would be appreciated!
You just need to pass a vector of numeric values to cut.
An example:
data.frame(table(cut(MyDates,
c(min(MyDates), ## "Leftmost" cut
min(MyDates) + 5000, ## Custom cut 1
min(MyDates) + 17000, ## Custom cut 2
min(MyDates) + 65000),## "Rightmost" cut
right = TRUE))) ## let cut() know where should the infinite be closed or open.
I have data which includes Date as well as Time enter and Time exit. These latter two contain data like this: 08:02, 12:02, 23:45 etc.
I would like to manipulate the Time eXXX data - for example, substract Time enter from Time exit to work out duration, or plot the distributions of Time enter and Time exit, e.g. to see if most entries are before 10:00, or if most exits are after 17:00.
All the packages I've looked at require a date to precede the time, e.g. 01/02/2012 12:33.
Is this possible, or should I simply append an identical date to every time for the sake of calculations? This seem a bit messy!
Use the "times" class found in the chron package:
library(chron)
Enter <- c("09:12", "17:01")
Enter <- times(paste0(Enter, ":00"))
Exit <- c("10:15", "18:11")
Exit <- times(paste0(Exit, ":00"))
Exit - Enter # durations
sum(Enter < "10:00:00") # no entering before 10am
mean(Enter < "10:00:00") # fraction entering before 10am
sum(Exit > "17:00:00") # no exiting after 5pm
mean(Exit > "17:00:00") # fraction exiting after 5pm
table(cut(hours(Enter), breaks = c(0, 10, 17, 24))) # Counts for indicated hours
## (0,10] (10,17] (17,24]
## 1 1 0
table(hours(Enter)) # Counts of entries each hour
## 9 17
## 1 1
stem(hours(Enter), scale = 2)
## The decimal point is at the |
## 9 | 0
## 10 |
## 11 |
## 12 |
## 13 |
## 14 |
## 15 |
## 16 |
## 17 | 0
Graphics:
tab <- c(table(Enter), -table(Exit)) # Freq at each time. Enter is pos; Exit is neg.
plot(times(names(tab)), tab, type = "h", xlab = "Time", ylab = "Freq")
abline(v = c(10, 17)/24, col = "red", lty = 2) # vertical red lines
abline(h = 0) # X axis
Thanks for the feedback and sorry for the confusion I have edited it a bit to clarify.
New Edit:
First, chron package and strptime with fixed format both work well as demonstrated in other answers. I just want to introduce lubridate a little bit since it's easier to use, and flexible with time format.
Example data
df <- data.frame(TimeEnterChar = c(rep("07:58", 10), "08:02", "08:03", "08:05", "08:10", "09:00"),
TimeExitChar = c("16:30", "16:50", "17:00", rep("17:02", 10), "17:30", "18:59"),
stringsAsFactors = F)
If all you want is to count how many entry time were later than 8:00, then you can compare the character directly. Below would should 5 entry time were later.
sum(df$TimeEnterChar > "08:00")
If you want more, personally, I like lubridate package when dealing with time data, especially timestamps with dates although it's not the focus of this post at all.
library(lubridate)
# Convert character to a "Period" class by lubridate, shows in form of H M S
df$TimeEnterTime <- hm(df$TimeEnterChar)
df$TimeExitTime <- hm(df$TimeExitChar)
head(df)
sum(df$TimeEnterTime > hm("08:00"))
You can still compare the time.
A little more about using them as numeric:
I assume only minute-level time is wanted. Thus, I divided number of seconds by 60 to get number of minutes.
df$DurationMinute <- as.numeric( df$TimeExitTime - df$TimeEnterTime )/60
hist(df$DurationMinute, breaks = seq(500, 600, 5))
head(df)
TimeEnterChar TimeExitChar TimeEnterTime TimeExitTime DurationMinute
1 07:58 16:30 7H 58M 0S 16H 30M 0S 512
2 07:58 16:50 7H 58M 0S 16H 50M 0S 532
3 07:58 17:00 7H 58M 0S 17H 0M 0S 542
4 07:58 17:02 7H 58M 0S 17H 2M 0S 544
5 07:58 17:02 7H 58M 0S 17H 2M 0S 544
6 07:58 17:02 7H 58M 0S 17H 2M 0S 544
You can simply plot a histogram to see the distribution of time duration between entry and exit.
You can also look at the distribution of entry/exit time. But some effort is needed to convert the axis.
df$TimeEnterNumMin <- as.numeric(df$TimeEnterTime) / 60
df$TimeExitNumMin <- as.numeric(df$TimeExitTime) / 60
hist(df$TimeEnterNumMin, breaks = seq(0, 1440, 60), xaxt = 'n', main = "Whole by 1hr")
axis(side = 1, at = seq(0, 1440, 60), labels = paste0(seq(0, 24, 1), ":00"))
hist(df$TimeEnterNumMin, breaks = seq(420, 600, 15), xaxt = 'n', main = "Morning by 15min")
axis(side = 1, at = seq(420, 600, 60), labels = paste0(seq(7, 10, 1), ":00"))
I did not polish the plot, nor make the axis flexible. Please do based on your needs. Hopefully, it helps.
Below is old useless post: (no need to read. kept so that comments don't look weird)
Came across a similar issue and was inspired by this post. #G. Grothendieck and #David Arenburg provided great answers for transforming the time.
For comparison, I feel forcing the time into numeric helps. Instead of comparing "11:22:33" with "9:00:00", comparing as.numeric(hms("11:22:33")) (which is 40953 seconds) and as.numeric(hms("9:00:00")) (32400) would be much easier.
as.numeric(hms("11:22:33")) > as.numeric(hms("9:00:00")) & as.numeric(hms("11:22:33")) < as.numeric(hms("17:00:00"))
[1] TRUE
The above example shows 11:22:33 is between 9AM and 5PM.
To extract just time from the date or POSIXct object, substr("2013-10-01 11:22:33 UTC", 12, 19) should work, although it looks stupid to change a time object to string/character and back to time again.
Converting the time to numeric should work for plotting as #G. Grothendieck descirbed. You can convert the numbers back to time as needed for x axis labels.
Would something like that work?
SubstracTimes <- function(TimeEnter, TimeExit){
(as.numeric(format(strptime(TimeExit, format ="%H:%M"), "%H")) +
as.numeric(format(strptime(TimeExit, format ="%H:%M"), "%M"))/60) -
(as.numeric(format(strptime(TimeEnter, format ="%H:%M"), "%H")) +
as.numeric(format(strptime(TimeEnter, format ="%H:%M"), "%M"))/60)
}
Testing:
TimeEnter <- "08:02"
TimeExit <- "12:02"
SubstracTimes(TimeEnter, TimeExit)
> SubstracTimes(TimeEnter, TimeExit)
[1] 4
I've got logs of events that contain:
start time, end time, category id and count. They cover several months.
I'd like to aggregate them over time to be able to trace histograms over a given day, week, month.
So I assume the best way to do this is to bin the periods in buckets. I think 5 minutes would be good.
e.g. If an event starts at 1.01pm and ends at 1.07pm, I'd like to obtain 2 records for it as it covers 2 periods of 5 minutes (0-5 and 5-10) and replicate the rest of the original data for these new records (category and count)
if my input logs (x) are as such:
start / end / catid / count
2012-11-17 15:05:02.0, 2012-11-17 15:12:52.0, 1, 2
2012-11-17 15:07:13.0, 2012-11-17 15:17:47.0, 2, 10
2012-11-17 15:11:00.0, 2012-11-17 15:12:33.0, 3, 5
2012-11-17 15:12:01.0, 2012-11-17 15:20:00.0, 4, 1
I'm trying to get the output bucketed in 5 minutes (b) this way:
start / catid / count
2012-11-17 15:05:00.0 1, 2
2012-11-17 15:10:00.0 1, 2
2012-11-17 15:05:00.0 2, 10
2012-11-17 15:10:00.0 2, 10
2012-11-17 15:15:00.0 2, 10
2012-11-17 15:10:00.0 3, 5
2012-11-17 15:10:00.0 4, 1
2012-11-17 15:15:00.0 4, 1
Then I can easily aggregate the new data frame (b) over category ids for the period I want (hour, day, week, month)
I'm starting with R and I found a lot explanations on how to bucket a time value but not a period of time.
I've had a look at zoo and xts but I couldn't quite find what to do.
Hopefully that makes sense to some of you.
Edit:
I've slightly modified Ram's suggestion to get the correct calculation of blocks using the rounded endtime rather than the original end time. (Thanks Ram!)
mnslot=15 # size of the buckets/slot in minutes
#Round down the minutes of starttime to a mutliple of mnslot
st.str <- strptime(st, "%Y-%m-%d %H:%M:%S")
min_st <- as.numeric(format(st.str, "%M"))
roundedmins <- floor(min_st/mnslot) * mnslot
st.base <- strptime(st, "%Y-%m-%d %H")
rounded_start <- st.base + (roundedmins * 60)
#Round down the minutes of the endtime to a multiple of mnslot.
en.str <- strptime(en, "%Y-%m-%d %H:%M:%S")
min_en <- as.numeric(format(en.str, "%M"))
roundedmins <- floor(min_en/mnslot) * mnslot
en.base <- strptime(en, "%Y-%m-%d %H")
rounded_end<- en.base + (roundedmins * 60)
# calculate the number of blocks based on the rounded minutes of start and end
numblocks<- as.numeric(floor((rounded_end-rounded_start)/mnslot/60)+1)
# differenced of POSIXct values is in minutes
# but difference of POSIXlt seems to be in seconds , so have to divide by 60 as well
#Create REPLICATED Rows, depending on the size of the interval
replicated_cat = NULL
replicated_count = NULL
replicated_start = NULL
for (n in 1:length(numblocks)){
for (newrow in 1:numblocks[n]){
replicated_start = c(replicated_start, df$rounded_start[n]+(newrow-1)*300 )
replicated_cat = c(replicated_cat, df$catid[n])
replicated_count = c(replicated_count, df$count[n])
}
}
#Change to readable format
POSIXT <- unix2POSIXct(replicated_start)
newdf <- data.frame(POSIXT, replicated_cat, replicated_count)
names(newdf) <- c("start", "CatId", "Count")
newdf
This produces the required output. it is a bit slow though:p
Here's a fully working version. It involves step-by-step data manipulation for what you are after.
#storing the original data as a csv
df <- read.csv("tsdata.csv")
st<-as.POSIXlt(df$start)
en<-as.POSIXlt(df$end)
#a utility function to convert formats
unix2POSIXct <- function (time) structure(time, class = c("POSIXt", "POSIXct") )
#For each row, determine how many replications are needed
numdups <- as.numeric(floor((en-st)/5)+1)
st.str <- strptime(st, "%Y-%m-%d %H:%M:%S")
min_st <- as.numeric(format(st.str, "%M"))
#Round down the minutes of start to 5 minute starts. 0,5,10 etc...
roundedmins <- floor(min_st/5) * 5
st.base <- strptime(st, "%Y-%m-%d %H")
df$rounded_start <- st.base + (roundedmins * 60)
#Create REPLICATED Rows, depending on the size of the interval
replicated_cat = NULL
replicated_count = NULL
replicated_start = NULL
for (n in 1:length(numdups)){
for (newrow in 1:numdups[n]){
replicated_start = c(replicated_start, df$rounded_start[n]+(newrow-1)*300 )
replicated_cat = c(replicated_cat, df$catid[n])
replicated_count = c(replicated_count, df$count[n])
}
}
#Change to readable format
POSIXT <- unix2POSIXct(replicated_start)
newdf <- data.frame(POSIXT, replicated_cat, replicated_count)
names(newdf) <- c("start", "CatId", "Count")
newdf
Which produces:
start CatId Count
1 2012-11-17 15:05:00 1 2
2 2012-11-17 15:10:00 1 2
3 2012-11-17 15:05:00 2 10
4 2012-11-17 15:10:00 2 10
5 2012-11-17 15:15:00 2 10
6 2012-11-17 15:10:00 3 5
7 2012-11-17 15:10:00 4 1
8 2012-11-17 15:15:00 4 1
That's not an easy one ... I am also missing the structure of the whole problem so I hope it is ok if I limit myself to outlining the basic approach, if things are unclear you can come back to me.
First (if I were you) I would install the 'lubridate' package, which makes playing around with dates/times a lot easier.
Then maybe try something like this:
z <- strptime("17/11/12 15:05:00.0", "%d/%m/%y %H:%M:%OS")
This will define your starting point in time, if that is supposed to be defined by the first logs(x) time then there is the minute command available e.g.
z <- strptime("17/11/12 15:05:02.0", "%d/%m/%y %H:%M:%OS")
minute(z)<-5;second(z)<-0.0 #I guess, you get the concept
Then produce a sequence of 5 minute intervals
z5s<-z+minutes(seq(0,100,5))
This will produce a sequence of 20, 5 minute time intervals, here again I do not know how flexible the whole thing is supposed to be.
Finally you could then play around with for instance modulo operations
z2<-z+minutes(2)
z2 should be the end time, I just added 2 minutes "manually" here to illustrate the concept
(as.integer(z2-z))%%5 > 5
FALSE
or if you want to see how many 5 minute spans are covered only do (as.integer(z2-z))%%5
or whatever other functions you prefer to match/distribute the log times across the z5s POSIXlt intervals.
Hope this helps a bit i.e. gives you some direction.