I have a dataframe that I need to add a column of datetime to. It is recording water levels every hour for 2 years. The original data frame has the wrong dates and times. i.e. the dates say 2015 instead of 2020. The date and month are also wrong. I do not know the original start date and time. However, I know the date and time of the very last recording (28-03-2022 14:00:00). I need to calculate a column from the bottom to the top to figure out the original start date.
Current Code
I have this code which populates the dates from a known start date (i.e. top down), but I want to population the data from down up. Is these a way to alter this or another solution??
# recalculate date to correct date
# set start dates
startDate5 <- as.POSIXct("2020-03-05 17:00:00")
startDateMere <- as.POSIXct("2020-07-06 17:00:00")
# find length of dataframe to populate required rows.
len5 <- max(dataList$`HMB 5`$Rec)
lenMere <- max(dataList$`HM SSSI 4`$Rec)
# calculate new date column
dataList$`HMB 5`$DateTimeNew <- seq(startDate5, by='hour', length.out=len5)
dataList$`HM SSSI 4`$DateTimeNew <-seq(startDateMere, by='hour', length.out=lenMere)
Current dataframe - top 10 rows
structure(list(Rec = 1:10, DateTime = structure(c(1436202000,
1436205600, 1436209200, 1436212800, 1436216400, 1436220000, 1436223600,
1436227200, 1436230800, 1436234400), class = c("POSIXct", "POSIXt"
), tzone = "GMT"), Temperature = c(16.59, 16.49, 16.74, 17.14,
17.47, 17.71, 18.43, 18.78, 19.06, 19.18), Pressure = c(1050.64,
1050.86, 1051.28, 1051.56, 1051.48, 1051.2, 1051.12, 1050.83,
1050.83, 1050.76), DateTimeNew = structure(c(1594051200L, 1594054800L,
1594058400L, 1594062000L, 1594065600L, 1594069200L, 1594072800L,
1594076400L, 1594080000L, 1594083600L), class = c("POSIXct",
"POSIXt"), tzone = "")), row.names = c(NA, 10L), class = "data.frame")
Desired Output
This is what the desired output looks like: The date I know is correct for example is '2020-07-07 02:00:00' (e.g. value in 10th row, final column). And I need to figure out the rest of the column from this value.
NB: I do not actually know what the original start date is (2020-07-06 17:00:00) should be. Its just illustrative.
Here's a sequence method:
startDateMere <- as.POSIXct("2020-07-06 17:00:00")
new_date = seq(startDateMere, length.out = nrow(data), by = "-1 hour")
data$result = rev(new_date)
data
# Rec DateTime Temperature Pressure DateTimeNew result
# 1 1 2015-07-06 17:00:00 16.59 1050.64 2020-07-06 12:00:00 2020-07-06 08:00:00
# 2 2 2015-07-06 18:00:00 16.49 1050.86 2020-07-06 13:00:00 2020-07-06 09:00:00
# 3 3 2015-07-06 19:00:00 16.74 1051.28 2020-07-06 14:00:00 2020-07-06 10:00:00
# 4 4 2015-07-06 20:00:00 17.14 1051.56 2020-07-06 15:00:00 2020-07-06 11:00:00
# 5 5 2015-07-06 21:00:00 17.47 1051.48 2020-07-06 16:00:00 2020-07-06 12:00:00
# 6 6 2015-07-06 22:00:00 17.71 1051.20 2020-07-06 17:00:00 2020-07-06 13:00:00
# 7 7 2015-07-06 23:00:00 18.43 1051.12 2020-07-06 18:00:00 2020-07-06 14:00:00
# 8 8 2015-07-07 00:00:00 18.78 1050.83 2020-07-06 19:00:00 2020-07-06 15:00:00
# 9 9 2015-07-07 01:00:00 19.06 1050.83 2020-07-06 20:00:00 2020-07-06 16:00:00
# 10 10 2015-07-07 02:00:00 19.18 1050.76 2020-07-06 21:00:00 2020-07-06 17:00:00
Related
Say I have a POSIXct vector like
timestamps = seq(as.POSIXct("2021-01-23"), as.POSIXct("2021-01-24"), length.out = 6)
I would like to round these times up to the nearest hour of the day in a vector:
hours_of_day = c(6, 14, 20)
i.e., the following result:
timestamps result
1 2021-01-23 00:00:00 2021-01-23 02:00:00
2 2021-01-23 04:48:00 2021-01-23 14:00:00
3 2021-01-23 09:36:00 2021-01-23 14:00:00
4 2021-01-23 14:24:00 2021-01-23 20:00:00
5 2021-01-23 19:12:00 2021-01-23 20:00:00
6 2021-01-24 00:00:00 2021-01-24 02:00:00
Is there a vectorized solution to this (or otherwise fast)? I have a few million timestamps and need to apply it for several hours_of_day.
One way to simplify this problem is to (1) find the next hours_of_day for each lubridate::hour(timestamps) and then (2) result = lubridate::floor_date(timestamps) + next_hour_of_day * 3600. But how to do step 1 vectorized?
Convert to as.POSIXlt, which allows you to extract hours and minutes, and calculate decimal hours. In an lapply/sapply combination first look up where these are less than the hours of the day vector, and choose the maximum hour using which.max. Now create new date-time using ISOdate and add one day ifelse date-time is smaller than original time.
timestamps <- as.POSIXlt(timestamps)
h <- hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600),
`<=`, hours_of_day), which.max)]
r <- with(timestamps, ISOdate(1900 + year, mon + 1, mday, h,
tz=attr(timestamps, "tzone")[[1]]))
r[r < timestamps] <- r[r < timestamps] + 86400
Result
r
# [1] "2021-01-23 06:00:00 CET" "2021-01-23 06:00:00 CET"
# [3] "2021-01-23 14:00:00 CET" "2021-01-23 20:00:00 CET"
# [5] "2021-01-23 20:00:00 CET" "2021-01-24 06:00:00 CET"
# [7] "2021-01-25 06:00:00 CET" "2021-01-27 20:00:00 CET"
data.frame(timestamps, r)
# timestamps r
# 1 2021-01-23 00:00:00 2021-01-23 06:00:00
# 2 2021-01-23 04:48:00 2021-01-23 06:00:00
# 3 2021-01-23 09:36:00 2021-01-23 14:00:00
# 4 2021-01-23 14:24:00 2021-01-23 20:00:00
# 5 2021-01-23 19:12:00 2021-01-23 20:00:00
# 6 2021-01-24 00:00:00 2021-01-24 06:00:00
# 7 2021-01-24 23:59:00 2021-01-25 06:00:00
# 8 2021-01-27 20:00:00 2021-01-27 20:00:00
Note: I've added "2021-01-24 23:59:00 CET" to timestamps to demonstrate the date change.
Benchmark
Tested on a length 1.4e6 vector.
# Unit: seconds
# expr min lq mean median uq max neval cld
# POSIX() 32.96197 33.06495 33.32104 33.16793 33.50057 33.83321 3 a
# lubridate() 47.36412 47.57762 47.75280 47.79113 47.94715 48.10316 3 b
Data:
timestamps <- structure(c(1611356400, 1611373680, 1611390960, 1611408240, 1611425520,
1611442800, 1611529140, 1611774000), class = c("POSIXct", "POSIXt"
))
hours_of_day <- c(6, 14, 20)
I would extract the hour component, use cut to bin it, and assign the binned hours back to the original:
hours_of_day = c(2, 14, 20)
library(lubridate)
library(magrittr) ## just for the pipe
new_hours = timestamps %>%
hour %>%
cut(breaks = c(0, hours_of_day), labels = hours_of_day, include.lowest = TRUE) %>%
as.character() %>%
as.integer()
result = floor_date(timestamps, "hour")
hour(result) = new_hours
result
# [1] "2021-01-23 02:00:00 EST" "2021-01-23 14:00:00 EST" "2021-01-23 14:00:00 EST"
# [4] "2021-01-23 14:00:00 EST" "2021-01-23 20:00:00 EST" "2021-01-24 02:00:00 EST"
Building on the approach by #jay.sf, I made a function for floor as well while adding support for NA values.
floor_date_to = function(timestamps, hours_of_day) {
# Handle NA with a temporary filler so code below doesn't break
na_timestamps = is.na(timestamps)
timestamps[na_timestamps] = as.POSIXct("9999-12-31")
# Proceed as usual
timestamps = as.POSIXlt(timestamps)
hours_of_day = rev(hours_of_day) # floor-specific: because which.max returns the first index by default
nearest_hour = hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600), `<`, hours_of_day), function(x) which.max(-x))] # floor-specific: negative which.max()
rounded = with(timestamps, ISOdate(1900 + year, mon + 1, mday, nearest_hour, tz = attr(timestamps, "tzone")[1]))
rounded[rounded > timestamps] = rounded[rounded > timestamps] - 86400 # floor: use minus
return(rounded)
timestamps[na_timestamps] = NA # Overwrite with NA again
}
I am working with water quality data and I have a list of storm events I extracted from the streamflow time series.
head(Storms)
PeakNumber PeakTime PeakHeight PeakStartTime PeakEndTime DurationHours
1 1 2019-07-21 22:15:00 81.04667 2019-07-21 21:30:00 2019-07-22 04:45:00 7.25
2 2 2019-07-22 13:45:00 66.74048 2019-07-22 13:00:00 2019-07-22 23:45:00 10.75
3 3 2019-07-11 11:30:00 49.08663 2019-07-11 10:45:00 2019-07-11 19:00:00 8.25
4 4 2019-05-29 18:45:00 37.27926 2019-05-29 18:30:00 2019-05-29 20:45:00 2.25
5 5 2019-06-27 16:30:00 33.12268 2019-06-27 16:00:00 2019-06-27 17:15:00 1.25
6 6 2019-07-11 08:15:00 31.59931 2019-07-11 07:45:00 2019-07-11 09:00:00 1.25
I would like to use these PeakStartTime and PeakEndTime points to subset my other data. The other data is 15-minute time series data in xts or data.table format (I am constantly going back and forth for various functions/plots)
> head(Nitrogen)
[,1]
2019-03-20 10:00:00 2.12306
2019-03-20 10:15:00 2.13538
2019-03-20 10:30:00 2.14180
2019-03-20 10:45:00 2.14704
2019-03-20 11:00:00 2.14464
2019-03-20 11:15:00 2.15548
So I would like to create a new dataframe for each storm that is just the Nitrogen data between those PeakStartTime and PeakEndTime points. And then hopefully loop this, so it will do so for each of the peaks in the Storms dataframe.
One option is to do the comparison on each corresponding StartTime, EndTime, and subset the data
library(xts)
do.call(rbind, Map(function(x, y) Nitrogen[paste( x, y, sep="/")],
Storms$PeakStartTime, Storms$PeakEndTime))
# [,1]
#2019-05-29 18:30:00 -0.07102752
#2019-05-29 18:45:00 -0.19454811
#2019-05-29 19:00:00 -1.69684540
#2019-05-29 19:15:00 1.09384970
#2019-05-29 19:30:00 0.20019572
#2019-05-29 19:45:00 -0.76086259
# ...
data
set.seed(24)
Nitrogen <- xts(rnorm(20000), order.by = seq(as.POSIXct('2019-03-20 10:00:00'),
length.out = 20000, by = '15 min'))
Storms <- structure(list(PeakNumber = 1:6, PeakTime = structure(c(1563761700,
1563817500, 1562859000, 1559169900, 1561667400, 1562847300), class = c("POSIXct",
"POSIXt"), tzone = ""), PeakHeight = c(81.04667, 66.74048, 49.08663,
37.27926, 33.12268, 31.59931), PeakStartTime = structure(c(1563759000,
1563814800, 1562856300, 1559169000, 1561665600, 1562845500), class = c("POSIXct",
"POSIXt"), tzone = ""), PeakEndTime = structure(c(1563785100,
1563853500, 1562886000, 1559177100, 1561670100, 1562850000), class = c("POSIXct",
"POSIXt"), tzone = ""), DurationHours = c(7.25, 10.75, 8.25,
2.25, 1.25, 1.25)), row.names = c("1", "2", "3", "4", "5", "6"
), class = "data.frame")
I have a log of many years of meditation sittings, each with a start and end time. I want to create nice plots of my most active times of the day. (In other words, how often relatively I am meditating at 7am versus other times of day?)
ID StartTime EndTime
1 2679 2019-03-23 07:00:00 2019-03-23 07:30:00
2 2678 2019-03-22 07:00:00 2019-03-22 07:30:00
3 2677 2019-03-21 07:00:00 2019-03-21 07:30:00
4 2676 2019-03-20 07:00:00 2019-03-20 07:30:00
5 2675 2019-03-19 07:00:00 2019-03-19 07:30:00
6 2674 2019-03-18 09:00:00 2019-03-18 09:30:00
7 2673 2019-03-18 09:00:00 2019-03-18 09:30:00
8 2672 2019-03-18 09:00:00 2019-03-18 10:00:00
9 2671 2019-03-15 07:00:00 2019-03-15 08:00:00
10 2670 2019-03-14 07:00:00 2019-03-14 08:00:00
dput version:
structure(list(ID = 2679:2670, StartTime = structure(c(1553324400,
1553238000, 1553151600, 1553065200, 1552978800, 1552899600, 1552899600,
1552899600, 1552633200, 1552546800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), EndTime = structure(c(1553326200, 1553239800,
1553153400, 1553067000, 1552980600, 1552901400, 1552901400, 1552903200,
1552636800, 1552550400), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-10L), class = "data.frame")
I can hack this by turning each day into an 1440 element array of included/excluded minutes and summing these up. Getting something like this
But I feel like there must be a better way. Probably using Interval object from lubridate and/or dplyr. But I haven't worked out how to do this.
I have a xts-timeseries temperature data in 5 min resolution.
head(dataset)
Time Temp
2016-04-26 10:00:00 6.877
2016-04-26 10:05:00 6.877
2016-04-26 10:10:00 6.978
2016-04-26 10:15:00 6.978
2016-04-26 10:20:00 6.978
I want to calculate the longest duration the temperature exceeds a certain threshold. (let's say 20 °C)
I want to calculate all the periods with their duration the temperature exceeds a certain threshold.
I create a data.frame from my xts-data:
df=data.frame(Time=index(dataset),coredata(dataset))
head(df)
Time Temp
1 2016-04-26 10:00:00 6.877
2 2016-04-26 10:05:00 6.877
3 2016-04-26 10:10:00 6.978
4 2016-04-26 10:15:00 6.978
5 2016-04-26 10:20:00 6.978
6 2016-04-26 10:25:00 7.079
then I create a subset with only the data that exceeds the threshold:
sub=(subset(x=df,subset = df$Temp>20))
head(sub)
Time Temp
7514 2016-05-22 12:05:00 20.043
7515 2016-05-22 12:10:00 20.234
7516 2016-05-22 12:15:00 20.329
7517 2016-05-22 12:20:00 20.424
7518 2016-05-22 12:25:00 20.615
7519 2016-05-22 12:30:00 20.805
But now im having trouble to calculate the duration of the event the temperature exceeds the threshold. I dont know how to identify a connected period and calculate their duration?
I would be happy if you have a solution for this question (it's my first thread so please excuse minor mistakes) If you need more information on my data, feel free to ask.
This may work. I take as example this data:
df <- structure(list(Time = structure(c(1463911500, 1463911800, 1463912100,
1463912400, 1463912700, 1463913000), class = c("POSIXct", "POSIXt"
), tzone = ""), Temp = c(20.043, 20.234, 6.329, 20.424, 20.615,
20.805)), row.names = c(NA, -6L), class = "data.frame")
> df
Time Temp
1 2016-05-22 12:05:00 20.043
2 2016-05-22 12:10:00 20.234
3 2016-05-22 12:15:00 6.329
4 2016-05-22 12:20:00 20.424
5 2016-05-22 12:25:00 20.615
6 2016-05-22 12:30:00 20.805
library(dplyr)
df %>%
# add id for different periods/events
mutate(tmp_Temp = Temp > 20, id = rleid(tmp_Temp)) %>%
# keep only periods with high temperature
filter(tmp_Temp) %>%
# for each period/event, get its duration
group_by(id) %>%
summarise(event_duration = difftime(last(Time), first(Time)))
id event_duration
<int> <time>
1 1 5 mins
2 3 10 mins
How do you set 0:00 as end of day instead of 23:00 in an hourly data? I have this struggle while using period.apply or to.period as both return days ending at 23:00. Here is an example :
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:00:00"), by="hour"), x = rnorm(120))
The following functions show periods ends at 23:00
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "days")
x1[endpoints(x1, 'days')]
So when I am aggregating the hourly data to daily, does someone have an idea how to set the end of day at 0:00?
As already pointed out by another answer here, to.period on days computes on the data with timestamps between 00:00:00 and 23:59:59.9999999 on the day in question. so 23:00:00 is seen as the last timestamp in your data, and 00:00:00 corresponds to a value in the next day "bin".
What you can do is shift all the timestamps back 1 hour, use to.period get the daily data points from the hour points, and then using align.time to get the timestamps aligned correctly.
(More generally, to.period is useful for generating OHLCV type data, and so if you're say generating say hourly bars from ticks, it makes sense to look at all the ticks between 23:00:00 and 23:59:59.99999 in the bar creation. then 00:00:00 to 00:59:59.9999.... would form the next hourly bar and so on.)
Here is an example:
> tail(x1["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -1.2760349
# 2018-02-01 19:00:00 -0.1496041
# 2018-02-01 20:00:00 -0.5989614
# 2018-02-01 21:00:00 -0.9691905
# 2018-02-01 22:00:00 -0.2519618
# 2018-02-01 23:00:00 -1.6081656
> head(x1["2018-02-02"])
# [,1]
# 2018-02-02 00:00:00 -0.3373271
# 2018-02-02 01:00:00 0.8312698
# 2018-02-02 02:00:00 0.9321747
# 2018-02-02 03:00:00 0.6719425
# 2018-02-02 04:00:00 -0.5597391
# 2018-02-02 05:00:00 -0.9810128
> head(x1["2018-02-03"])
# [,1]
# 2018-02-03 00:00:00 2.3746424
# 2018-02-03 01:00:00 0.8536594
# 2018-02-03 02:00:00 -0.2467268
# 2018-02-03 03:00:00 -0.1316978
# 2018-02-03 04:00:00 0.3079848
# 2018-02-03 05:00:00 0.2445634
x2 <- x1
.index(x2) <- .index(x1) - 3600
> tail(x2["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -0.1496041
# 2018-02-01 19:00:00 -0.5989614
# 2018-02-01 20:00:00 -0.9691905
# 2018-02-01 21:00:00 -0.2519618
# 2018-02-01 22:00:00 -1.6081656
# 2018-02-01 23:00:00 -0.3373271
x.d2 <- to.period(x2, OHLC = FALSE, drop.date = FALSE, period = "days")
> x.d2
# [,1]
# 2018-01-31 23:00:00 0.12516594
# 2018-02-01 23:00:00 -0.33732710
# 2018-02-02 23:00:00 2.37464235
# 2018-02-03 23:00:00 0.51797747
# 2018-02-04 23:00:00 0.08955208
# 2018-02-05 22:00:00 0.33067734
x.d2 <- align.time(x.d2, n = 86400)
> x.d2
# [,1]
# 2018-02-01 0.12516594
# 2018-02-02 -0.33732710
# 2018-02-03 2.37464235
# 2018-02-04 0.51797747
# 2018-02-05 0.08955208
# 2018-02-06 0.33067734
Want to convince yourself? Try something like this:
x3 <- rbind(x1, xts(x = matrix(c(1,2), nrow = 2), order.by = as.POSIXct(c("2018-02-01 23:59:59.999", "2018-02-02 00:00:00"))))
x3["2018-02-01 23/2018-02-02 01"]
# [,1]
# 2018-02-01 23:00:00.000 -1.6081656
# 2018-02-01 23:59:59.999 1.0000000
# 2018-02-02 00:00:00.000 -0.3373271
# 2018-02-02 00:00:00.000 2.0000000
# 2018-02-02 01:00:00.000 0.8312698
x3.d <- to.period(x3, OHLC = FALSE, drop.date = FALSE, period = "days")
> x3.d <- align.time(x3.d, 86400)
> x3.d
[,1]
2018-02-02 1.00000000
2018-02-03 -0.09832625
2018-02-04 -0.65075506
2018-02-05 -0.09423664
2018-02-06 0.33067734
See that the value of 2 on 00:00:00 did not form the last observation in the day for 2018-02-02 (00:00:00), which went from 2018-02-01 00:00:00 to 2018-02-01 23:59:59.9999.
Of course, if you want the daily timestamp to be the start of the day, not the end of the day, which would be 2018-02-01 as start of bar for the first row, in x3.d above, you could shift back the day by one. You could do this relatively safely for most timezones, when your data doesn't involve weekend dates:
index(x3.d) = index(x3.d) - 86400
I say relatively safetly, because there are corner cases when there are time shifts in a time zone. e.g. Be careful with day light savings. Simply subtracting -86400 can be a problem when going from Sunday to Saturday in time zones where day light saving occurs:
#e.g. bad: day light savings occurs on this weekend for US EST
z <- xts(x = 9, order.by = as.POSIXct("2018-03-12", tz = "America/New_York"))
> index(z) - 86400
[1] "2018-03-10 23:00:00 EST"
i.e. the timestamp is off by one hour, when you really want the midnight timestamp (00:00:00).
You could get around this problem using something much safer like this:
library(lubridate)
# right
> index(z) - days(1)
[1] "2018-03-11 EST"
I don't think this is possible because 00:00 is the start of the day. From the manual:
These endpoints are aligned in POSIXct time to the zero second of the day at the beginning, and the 59.9999th second of the 59th minute of the 23rd hour of the final day
I think the solution here is to use minutes instead of hours. Using your example:
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:59:99"), by="min"), x = rnorm(7200))
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "day")
x1[endpoints(x1, 'day')]