How can I force R to change the parameters of a day. So I mean that for example yrDay like here provided doesn't go from 0am til 0am but from 6pm til 6pm.
df <- data.frame(Date=seq(
from=as.POSIXct("2012-1-1 13:00:00", tz="UTC"),
to=as.POSIXct("2012-1-3 13:00:00", tz="UTC"),
by="hour")
)
df$yrDay <- as.numeric(strftime(df$Date,format="%j"))
Just add 5 hours (5 * 60 min * 60 sec) using your code:
df$yrDay <- as.numeric(strftime(df$Date + 5*60*60,format="%j"))
Date yrDay
1 2012-01-01 13:00:00 1
2 2012-01-01 14:00:00 1
3 2012-01-01 15:00:00 1
4 2012-01-01 16:00:00 1
5 2012-01-01 17:00:00 1
6 2012-01-01 18:00:00 2
7 2012-01-01 19:00:00 2
8 2012-01-01 20:00:00 2
9 2012-01-01 21:00:00 2
10 2012-01-01 22:00:00 2
...
Or 6 hours using lubridate (better approach IMHO):
df$yrDay <- lubridate::day(df$Date + 6*60*60)
But as mentioned by #ngm in the comments, it is a quick and dirty solution that might not be robust for all cases.
A robust solution in python code for the Q above.
This algorithm uses a NumPy and pandas to set a conditions for each time in with your datetime data. It simply ask if the day before is less then the time specified and the is date == to the current date.
Then it ask and if the day after is greater then shift hour and is it This condition been generate for each day in your data then is check for date same as the before day.
Hope this help or encourage other to try think for better solution.
Sorry about my English it is not my first and I am very busy so this is my best for the time i allocate to make an answer. Hope to improve please give helpful comments and try not to harsh :)
def count_custome_day_start_and_end(pd_index_datetime, shift_time='8:30am'):
dates = sorted(set(pd_index_datetime.date))
times = np.array([time.time() for time in pd_index_datetime
])
shift_time = pd.to_datetime(shift_time).time()
conditiones = [
cond for date in dates for cond in [
((times < shift_time)
& (pd_index_datetime.date == date)), # day before
((times >= shift_time)
& (pd_index_datetime.date == date)) # day after
]
]
langth = (len(conditiones) // 2) + 1
choices = [0, *np.repeat(range(1, langth), 2)]
choices.pop()
return np.select(conditiones, choices, pd_index_datetime.factorize()[0])
Related
I have a dataframe in the following format:
temp:
id time date
1 06:22:30 2018-01-01
2 08:58:00 2018-01-15
3 09:30:21 2018-01-30
The actual data set continues on for 9000 rows with obs for times throughout the month of January. I want to write a code that will assign each row a new value depending on which hour range the time variable belongs to.
A couple of example hour ranges would be:
Morning peak: 06:00:00 - 08:59:00
Morning: 09:00:00 - 11:59:00
The desired output would look like this:
id time date time_of_day
1 06:22:30 2018-01-01 MorningPeak
2 08:58:00 2018-01-15 MorningPeak
3 09:30:21 2018-01-30 Morning
I have tried playing around with time objects using the chron package using the following code to specify different time ranges:
MorningPeak <- temp[temp$Time >= "06:00:00" & temp$Time <= "08:59:59",]
MorningPeak$time_of_day <- "MorningPeak"
Morning <- temp[temp$Time >= "09:00:00" & temp$Time <= "11:59:59",]
Midday$time_of_day <- "Morning"
The results could then be merged and then manipulated to get everything in the same column. Is there a way to do this such that the desired result is generated and no extra data manipulation is required? I am interested in learning how to make my code more efficient.
You are comparing characters and not time/datetime objects, you need to convert it to date-time before comparison. It seems you can compare the hour of the day to get appropriate labels.
library(dplyr)
df %>%
mutate(hour = as.integer(format(as.POSIXct(time, format = "%T"), "%H")),
time_of_day = case_when(hour >= 6 & hour < 9 ~ "MorningPeak",
hour >= 9 & hour < 12 ~ "Morning",
TRUE ~ "Rest of the day"))
# id time date hour time_of_day
#1 1 06:22:30 2018-01-01 6 MorningPeak
#2 2 08:58:00 2018-01-15 8 MorningPeak
#3 3 09:30:21 2018-01-30 9 Morning
You can add more hourly criteria if needed.
We can also use cut
cut(as.integer(format(as.POSIXct(df$time, format = "%T"), "%H")),
breaks = c(-Inf, 6, 9, 12, Inf), right = FALSE,
labels = c("Rest of the day", "MorningPeak", "Morning", "Rest of the day"))
This question already has answers here:
Wrong units displayed in data.table with POSIXct arithmetic
(3 answers)
Closed 3 years ago.
The objective is to calculate the time between events grouped by some id. Here is an example:
library(data.table)
library(lubridate)
dt <- data.table(id = c(1,1:3),
start = c("2015-01-01 12:00:00", "2015-12-01 12:00:00", "2019-01-01 12:00:00", NA),
end = c("2016-01-01 12:00:01", "2016-01-01 12:00:01", "2019-01-01 12:00:01", "2019-01-01 12:00:02"))
dt[, start := ymd_hms(start)]
dt[, end := ymd_hms(end)]
dt[, time_diff_1 := min(end) - max(start), by = .(id)]
dt[, time_diff_2 := end - start]
which results in:
id start end time_diff_1 time_diff_2
1: 1 2015-01-01 12:00:00 2016-01-01 12:00:01 31.00001 secs 31536001 secs
2: 1 2015-12-01 12:00:00 2016-01-01 12:00:01 31.00001 secs 2678401 secs
3: 2 2019-01-01 12:00:00 2019-01-01 12:00:01 1.00000 secs 1 secs
4: 3 <NA> 2019-01-01 12:00:02 NA secs NA secs
Both columns time_diff_1 and time_diff_2 display the time difference in seconds. However the time_diff_1 which resulted from the grouped calculation mixed up the units. The result for id == 1 is 31 days and one second. It seems as if the units were choosen automatically by group and then gotten overwritten.
Any hints on how to fix this?
When using the difftime() function the units can be set explicitly, e.g.
dt[, time_diff_3 := difftime(min(end), max(start), units = "secs"), by = .(id)]
resulting in
id start end time_diff_1 time_diff_2 time_diff_3
1: 1 2015-01-01 12:00:00 2016-01-01 12:00:01 31.00001 secs 31536001 secs 2678401 secs
2: 1 2015-12-01 12:00:00 2016-01-01 12:00:01 31.00001 secs 2678401 secs 2678401 secs
3: 2 2019-01-01 12:00:00 2019-01-01 12:00:01 1.00000 secs 1 secs 1 secs
4: 3 <NA> 2019-01-01 12:00:02 NA secs NA secs NA secs
with the expected result found in column time_diff_3.
However, there might be still room for improvement regarding how data.table silently overwrites the units after the grouped calculation. The results caused some head scratching before I figured out the units got messed up.
I do have 2 datasets per 10 minutes on 34 years. In one of them, observations are made only every 3 hours and I would like to keep only the lines with those observations. It starts at midnight (included) and goes like: 3am, 6am, 9am etc.
Looks like this:
stn CODES time1 pcp_type
1 SIO - 1981-01-01 02:00:00 <NA>
2 SIO - 1981-01-01 02:10:00 <NA>
3 SIO - 1981-01-01 02:20:00 <NA>
4 SIO - 1981-01-01 02:30:00 <NA>
5 SIO - 1981-01-01 02:40:00 <NA>
6 SIO - 1981-01-01 02:50:00 <NA>
Now the idea would be to keep only lines which corresponds to every 3 hours and deleting the rest.
I saw some solution about sorting by value (e.g. is bigger than) but I didn't find a solution that could help me sort by hour ( %H == 3 etc).
Thank you in advance.
I've already sorted my time column as following:
SYNOP_SION$time1<-as.POSIXct(strptime(as.character(SYNOP_SION$time),format = "%Y%m%d%H%M"), tz="UTC")
Here is an example with a vector:
# Creating sample time data
time1 <- seq(from = Sys.time(), length.out = 96, by = "hours")
# To get a T/F vector you can use to filter
as.integer(format(time1, "%H")) %in% seq.int(0, 21, 3)
# To see the filtered POSIXct vector:
time1[as.integer(format(time1, "%H")) %in% seq.int(0, 21, 3)]
How do you set 0:00 as end of day instead of 23:00 in an hourly data? I have this struggle while using period.apply or to.period as both return days ending at 23:00. Here is an example :
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:00:00"), by="hour"), x = rnorm(120))
The following functions show periods ends at 23:00
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "days")
x1[endpoints(x1, 'days')]
So when I am aggregating the hourly data to daily, does someone have an idea how to set the end of day at 0:00?
As already pointed out by another answer here, to.period on days computes on the data with timestamps between 00:00:00 and 23:59:59.9999999 on the day in question. so 23:00:00 is seen as the last timestamp in your data, and 00:00:00 corresponds to a value in the next day "bin".
What you can do is shift all the timestamps back 1 hour, use to.period get the daily data points from the hour points, and then using align.time to get the timestamps aligned correctly.
(More generally, to.period is useful for generating OHLCV type data, and so if you're say generating say hourly bars from ticks, it makes sense to look at all the ticks between 23:00:00 and 23:59:59.99999 in the bar creation. then 00:00:00 to 00:59:59.9999.... would form the next hourly bar and so on.)
Here is an example:
> tail(x1["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -1.2760349
# 2018-02-01 19:00:00 -0.1496041
# 2018-02-01 20:00:00 -0.5989614
# 2018-02-01 21:00:00 -0.9691905
# 2018-02-01 22:00:00 -0.2519618
# 2018-02-01 23:00:00 -1.6081656
> head(x1["2018-02-02"])
# [,1]
# 2018-02-02 00:00:00 -0.3373271
# 2018-02-02 01:00:00 0.8312698
# 2018-02-02 02:00:00 0.9321747
# 2018-02-02 03:00:00 0.6719425
# 2018-02-02 04:00:00 -0.5597391
# 2018-02-02 05:00:00 -0.9810128
> head(x1["2018-02-03"])
# [,1]
# 2018-02-03 00:00:00 2.3746424
# 2018-02-03 01:00:00 0.8536594
# 2018-02-03 02:00:00 -0.2467268
# 2018-02-03 03:00:00 -0.1316978
# 2018-02-03 04:00:00 0.3079848
# 2018-02-03 05:00:00 0.2445634
x2 <- x1
.index(x2) <- .index(x1) - 3600
> tail(x2["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -0.1496041
# 2018-02-01 19:00:00 -0.5989614
# 2018-02-01 20:00:00 -0.9691905
# 2018-02-01 21:00:00 -0.2519618
# 2018-02-01 22:00:00 -1.6081656
# 2018-02-01 23:00:00 -0.3373271
x.d2 <- to.period(x2, OHLC = FALSE, drop.date = FALSE, period = "days")
> x.d2
# [,1]
# 2018-01-31 23:00:00 0.12516594
# 2018-02-01 23:00:00 -0.33732710
# 2018-02-02 23:00:00 2.37464235
# 2018-02-03 23:00:00 0.51797747
# 2018-02-04 23:00:00 0.08955208
# 2018-02-05 22:00:00 0.33067734
x.d2 <- align.time(x.d2, n = 86400)
> x.d2
# [,1]
# 2018-02-01 0.12516594
# 2018-02-02 -0.33732710
# 2018-02-03 2.37464235
# 2018-02-04 0.51797747
# 2018-02-05 0.08955208
# 2018-02-06 0.33067734
Want to convince yourself? Try something like this:
x3 <- rbind(x1, xts(x = matrix(c(1,2), nrow = 2), order.by = as.POSIXct(c("2018-02-01 23:59:59.999", "2018-02-02 00:00:00"))))
x3["2018-02-01 23/2018-02-02 01"]
# [,1]
# 2018-02-01 23:00:00.000 -1.6081656
# 2018-02-01 23:59:59.999 1.0000000
# 2018-02-02 00:00:00.000 -0.3373271
# 2018-02-02 00:00:00.000 2.0000000
# 2018-02-02 01:00:00.000 0.8312698
x3.d <- to.period(x3, OHLC = FALSE, drop.date = FALSE, period = "days")
> x3.d <- align.time(x3.d, 86400)
> x3.d
[,1]
2018-02-02 1.00000000
2018-02-03 -0.09832625
2018-02-04 -0.65075506
2018-02-05 -0.09423664
2018-02-06 0.33067734
See that the value of 2 on 00:00:00 did not form the last observation in the day for 2018-02-02 (00:00:00), which went from 2018-02-01 00:00:00 to 2018-02-01 23:59:59.9999.
Of course, if you want the daily timestamp to be the start of the day, not the end of the day, which would be 2018-02-01 as start of bar for the first row, in x3.d above, you could shift back the day by one. You could do this relatively safely for most timezones, when your data doesn't involve weekend dates:
index(x3.d) = index(x3.d) - 86400
I say relatively safetly, because there are corner cases when there are time shifts in a time zone. e.g. Be careful with day light savings. Simply subtracting -86400 can be a problem when going from Sunday to Saturday in time zones where day light saving occurs:
#e.g. bad: day light savings occurs on this weekend for US EST
z <- xts(x = 9, order.by = as.POSIXct("2018-03-12", tz = "America/New_York"))
> index(z) - 86400
[1] "2018-03-10 23:00:00 EST"
i.e. the timestamp is off by one hour, when you really want the midnight timestamp (00:00:00).
You could get around this problem using something much safer like this:
library(lubridate)
# right
> index(z) - days(1)
[1] "2018-03-11 EST"
I don't think this is possible because 00:00 is the start of the day. From the manual:
These endpoints are aligned in POSIXct time to the zero second of the day at the beginning, and the 59.9999th second of the 59th minute of the 23rd hour of the final day
I think the solution here is to use minutes instead of hours. Using your example:
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:59:99"), by="min"), x = rnorm(7200))
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "day")
x1[endpoints(x1, 'day')]
So I have a xts time serie over the year with time zone "UTC". The time interval between each row is 15 minutes.
x1 x2
2014-12-31 23:15:00 153.0 0.0
2014-12-31 23:30:00 167.1 5.4
2014-12-31 23:45:00 190.3 4.1
2015-01-01 00:00:00 167.1 9.7
As I want data over one hour to allow for comparison with other data sets, I tried to use period.apply:
dat <- period.apply(dat, endpoints(dat,on="hours",k=1), colSums)
The problem is that the first row in my new data set is 2014-12-31 23:45:00 and not 2015-01-01 00:00:00. I tried changing the endpoint vector but somehow it keeps saying that it is out of bounds. I also thought this was my answer: https://stats.stackexchange.com/questions/5305/how-to-re-sample-an-xts-time-series-in-r/19003#19003 but it was not. I don't want to change the names of my columns, I want to sum over a different interval.
Here a reproducible example:
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
period.apply(xts, endpoints(xts,on="hours",k=1), colSums)
And the result looks like this:
2014-12-31 23:45:00 3
2015-01-01 00:45:00 4
2015-01-01 01:45:00 4
2015-01-01 02:45:00 4
and ends up like this:
2015-01-01 21:45:00 4
2015-01-01 22:45:00 4
2015-01-01 23:45:00 4
2015-01-02 00:00:00 1
Whereas I would like it to always sum over the same interval, meaning I would like only 4s.
(I am using RStudio 0.99.903 with R x64 3.3.2)
The problem is that you're using endpoints, but you want to align by the start of the interval, not the end. I thought you might be able to use this startpoints function, but that produced weird results.
The basic idea of the work-around below is to subtract a small amount from all index values, then use endpoints and period.apply to aggregate. Then call align.time on the result. I'm not sure if this is a general solution, but it seems to work for your example.
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
# create a temporary object
tmp <- xts
# subtract a small amount of time from each index value
.index(tmp) <- .index(tmp)-0.001
# aggregate to hourly
agg <- period.apply(tmp, endpoints(tmp, "hours"), colSums)
# round index up to next hour
agg_aligned <- align.time(agg, 3600)