I am trying to separate the events from streamflow data. I have hourly data. I have run the code
dailyMQ <- data.frame(Date=seq(from=as.Date("01.01.2000", format="%d.%m.%Y"),
to=as.Date("01.01.2004", format="%d.%m.%Y"), by="days"),
discharge=rbeta(1462,2,20)*100)
for daily data. But I am trying for hourly data but getting errors.
Could anyone suggest me how to write a code for hourly data?
Thanks
Date format can't directly be split into hours.
You could use POSIXct datetime format:
HourlyMQ <- data.frame(Date=seq(from=as.POSIXct("01.01.2019", format="%d.%m.%Y"), to=as.POSIXct("11.12.2019", format="%d.%m.%Y"),by="hours"),discharge=rbeta(8257,2,20))
HourlyMQ
#> Date discharge
#> 1 2019-01-01 00:00:00 0.2452214482
#> 2 2019-01-01 01:00:00 0.0620291334
#> 3 2019-01-01 02:00:00 0.0608788870
#> 4 2019-01-01 03:00:00 0.0697449808
#> 5 2019-01-01 04:00:00 0.0302780135
Related
I have a time series, that spans almost 20 years with a resolution of 15 min.
I want to extract only hourly values (00:00:00, 01:00:00, and so on...) and plot the resulting time series.
The df looks like this:
3 columns: date, time, and discharge
How would you approach this?
a reproducible example would be good for this kind of question. Here is my code, hope it helps you:
#creating dummy data
df <- data.frame(time = seq(as.POSIXct("2018-01-01 00:00:00"), as.POSIXct("2018-01-01 23:59:59"), by = "15 min"), variable = runif(96, 0, 1))
example output: (only 5 rows)
time variable
1 2018-01-01 00:00:00 0.331546992
2 2018-01-01 00:15:00 0.407269290
3 2018-01-01 00:30:00 0.635367577
4 2018-01-01 00:45:00 0.808612045
5 2018-01-01 01:00:00 0.258801201
df %>% filter(format(time, "%M:%S") == "00:00")
output:
1 2018-01-01 00:00:00 0.76198532
2 2018-01-01 01:00:00 0.01304103
3 2018-01-01 02:00:00 0.10729465
4 2018-01-01 03:00:00 0.74534184
5 2018-01-01 04:00:00 0.25942667
plot(df %>% filter(format(time, "%M:%S") == "00:00") %>% ggplot(aes(x = time, y = variable)) + geom_line())
I have below dataframe (df) from ENTSO-E showing German power prices. I created the "Hour" column with lubridate function hour(df$date). Output was a range (1,2,..,23,0)
# to replace 0 with 24
df["Hour"][df["Hour"]=="0"]<- "24"
I will need to work on an hourly basis. So I filtered each hour from 1 till 24, but I cannot filter the replaced hour: H24.
H1 <- df %>%
filter(Hour==1)
H24 <- df %>%
filter(Hour==24)
Error in match.fun(FUN) : object 'Hour' not found
24 values are still in Hour col, and class is numeric but I cannot do any calculation with the Hour column.
class(df$Hour)
[1] "numeric"
mean(german_last_4$Hour)
[1] NA
I am thinking problem is with replace function. is there any other way to produce a result that works with H24?
date
price
Hour
2019-01-01 01:00:00
28.32
1
2019-01-01 02:00:00
10.07
2
2019-01-01 03:00:00
-4.08
3
2019-01-01 04:00:00
-9.91
4
2019-01-01 05:00:00
-7.41
5
2019-01-01 06:00:00
-12.55
6
I'm working on a SHM system where I have data every 15 minutes that came from sensors on a structure. I have a set of observations where there is no damage and another where some kind of damage was simulated. My objective is to take the undamaged data and use it to forecast. This forecasted data is then compared to the undamaged one and this difference will then be used to create control charts.
However my undamaged data is of around 5 months and the damaged state is of 8 months. I tried to explore the forecast package using multiple seasonality (msts) of 96 (1 day) and 35060 (1 year) since I believe it has a connection to temperature.
The models that I created that followed some kind of pattern that could resemble reality had a small amplitude, while the real data was much more volatile.
Can someone point me in the right direction as to what to do next and how to do it?
PS: When using the ts function even though I try to make it start at 2018-04-27 14:15:00, when plotting the ts object always starts at 1-1-2018. I think this is more aesthetic than anything but setting it right would be appreciated.
ts and msts objects are not well suited to high frequency data. I suggest you try using tsibble objects via the tsibble package (http://tsibble.tidyverts.org). With tsibble, the time index is explicit. Here is an example using 30 minute data.
library(tsibble)
library(feasts)
library(ggplot2)
tsibbledata::vic_elec
#> # A tsibble: 52,608 x 5 [30m] <UTC>
#> Time Demand Temperature Date Holiday
#> <dttm> <dbl> <dbl> <date> <lgl>
#> 1 2012-01-01 00:00:00 4263. 21.0 2012-01-01 TRUE
#> 2 2012-01-01 00:30:00 4049. 20.7 2012-01-01 TRUE
#> 3 2012-01-01 01:00:00 3878. 20.6 2012-01-01 TRUE
#> 4 2012-01-01 01:30:00 4036. 20.4 2012-01-01 TRUE
#> 5 2012-01-01 02:00:00 3866. 20.2 2012-01-01 TRUE
#> 6 2012-01-01 02:30:00 3694. 20.1 2012-01-01 TRUE
#> 7 2012-01-01 03:00:00 3562. 19.6 2012-01-01 TRUE
#> 8 2012-01-01 03:30:00 3433. 19.1 2012-01-01 TRUE
#> 9 2012-01-01 04:00:00 3359. 19.0 2012-01-01 TRUE
#> 10 2012-01-01 04:30:00 3331. 18.8 2012-01-01 TRUE
#> # … with 52,598 more rows
tsibbledata::vic_elec %>% autoplot(Demand)
Created on 2019-11-27 by the reprex package (v0.3.0)
I have a data frame containing some daily data timestamped at midnight on each day and some hourly data timestamped at the beginning of each hour throughout the day. I want to expand the data so it's all hourly, and I'd like to do so within a tidyverse "pipe chain".
My thought was to create a data frame containing the full hourly time series and then dplyr::right_join() my data against this time series. I thought this would populate the proper values where there was a match for the daily data (at midnight) and populate NA wherever there was no match (any hour except midnight). This seems to work only when the time series in my data is daily only, rather than a mix of daily and hourly values, which was unexpected. Why does the right join not expand the daily time series when it coexists in a data frame along with another hourly time series?
I've generated a minimal example below. My representative data set that I want to expand is named allData and contains a mix of daily and hourly datasets from two different time series variables, Daily TS and Hourly TS.
dailyData <- data.frame(
DateTime = seq.POSIXt(lubridate::ymd_hms('2019-01-01', truncated=3),
lubridate::ymd_hms('2019-01-07', truncated=3),
by='day'),
Name = 'Daily TS'
)
allHours <- data.frame(
DateTime = seq.POSIXt(lubridate::ymd_hms('2019-01-01', truncated=3),
lubridate::ymd_hms('2019-01-07 23:00:00'),
by='hour')
)
hourlyData <- allHours %>%
dplyr::mutate( Name = 'Hourly TS' )
allData <- rbind( dailyData, hourlyData )
This gives
head( allData, n=15 )
DateTime Name
1 2019-01-01 00:00:00 Daily TS
2 2019-01-02 00:00:00 Daily TS
3 2019-01-03 00:00:00 Daily TS
4 2019-01-04 00:00:00 Daily TS
5 2019-01-05 00:00:00 Daily TS
6 2019-01-06 00:00:00 Daily TS
7 2019-01-07 00:00:00 Daily TS
8 2019-01-01 00:00:00 Hourly TS
9 2019-01-01 01:00:00 Hourly TS
10 2019-01-01 02:00:00 Hourly TS
11 2019-01-01 03:00:00 Hourly TS
12 2019-01-01 04:00:00 Hourly TS
13 2019-01-01 05:00:00 Hourly TS
14 2019-01-01 06:00:00 Hourly TS
15 2019-01-01 07:00:00 Hourly TS
Now, I thought that dplyr::right_join() of the full hourly sequence of POSIXct values against allData$DateTime would have expanded the daily time series, leaving NA values for any hours not explicitly present in the data. I could then use tidyr::fill() to fill these in over the day. However, the following code does not behave this way:
expanded_BAD <- allData %>%
dplyr::right_join( allHours, by='DateTime' ) %>%
tidyr::fill( dplyr::everything(), .direction='down' ) %>%
dplyr::arrange( Name, DateTime )
expanded_BAD shows that the daily data hasn't been expanded by the right_join(). That is, the hours in allHours missing from allData were not retained in the result, which I thought was the whole purpose of using a right join. Here's the head of the result:
head(expanded_BAD, n=15)
DateTime Name
1 2019-01-01 00:00:00 Daily TS
2 2019-01-02 00:00:00 Daily TS
3 2019-01-03 00:00:00 Daily TS
4 2019-01-04 00:00:00 Daily TS
5 2019-01-05 00:00:00 Daily TS
6 2019-01-06 00:00:00 Daily TS
7 2019-01-07 00:00:00 Daily TS
8 2019-01-01 00:00:00 Hourly TS
9 2019-01-01 01:00:00 Hourly TS
10 2019-01-01 02:00:00 Hourly TS
11 2019-01-01 03:00:00 Hourly TS
12 2019-01-01 04:00:00 Hourly TS
13 2019-01-01 05:00:00 Hourly TS
14 2019-01-01 06:00:00 Hourly TS
15 2019-01-01 07:00:00 Hourly TS
Interestingly, if we perform the exact same right join on only the daily data, we get the desired result:
dailyData_expanded_GOOD <- dailyData %>%
dplyr::right_join( allHours, by='DateTime' ) %>%
tidyr::fill( dplyr::everything(), .direction='down' )
Here's the head:
head(dailyData_expanded_GOOD, n=15)
DateTime Value
1 2019-01-01 00:00:00 Daily TS
2 2019-01-01 01:00:00 Daily TS
3 2019-01-01 02:00:00 Daily TS
4 2019-01-01 03:00:00 Daily TS
5 2019-01-01 04:00:00 Daily TS
6 2019-01-01 05:00:00 Daily TS
7 2019-01-01 06:00:00 Daily TS
8 2019-01-01 07:00:00 Daily TS
9 2019-01-01 08:00:00 Daily TS
10 2019-01-01 09:00:00 Daily TS
11 2019-01-01 10:00:00 Daily TS
12 2019-01-01 11:00:00 Daily TS
13 2019-01-01 12:00:00 Daily TS
14 2019-01-01 13:00:00 Daily TS
15 2019-01-01 14:00:00 Daily TS
Why does the right join do different things on the full data compared to only the daily data?
I think the problem is that you are trying to bind the dataframes together too soon. I believe this gives you what you want:
result <- bind_rows(dailyData_expanded_GOOD, hourlyData)
head(result)
#> DateTime Name
#> 1 2019-01-01 00:00:00 Daily TS
#> 2 2019-01-01 01:00:00 Daily TS
#> 3 2019-01-01 02:00:00 Daily TS
#> 4 2019-01-01 03:00:00 Daily TS
#> 5 2019-01-01 04:00:00 Daily TS
#> 6 2019-01-01 05:00:00 Daily TS
The reason right_join() doesn't work is that allHours matches perfectly the
rows in allData for hourly timeseries. From ?right_join
return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
You're hoping that rows in x with no match in y will have NA values, but the rows in y do match rows in x already. There are actually multiple matches, one for the daily and one for the hourly, but right_join() just returns both without expanding the daily time series rows.
This is different from the situation in this question, where the datetimes to be expanded do not occur in the left hand data frame. Then the strategy of merging would expand your result as expected.
So that explains why a bare right_join() doesn't work, but doesn't solve
the problem because you have to manually split up the data, and that would
get old fast if there are varying numbers of time series. There are a couple solutions in comments, and then one additional that I will add below.
tidyr::expand()
expandedData <- allData %>%
tidyr::expand( DateTime, Name ) %>%
dplyr::arrange( Name, DateTime )
This works, but only with both time series present. If there is only
dailyData, then the result is not expanded.
The kitchen sink
expandedData1 <- allData %>%
dplyr::right_join(allHours, by = 'DateTime') %>%
tidyr::fill(everything()) %>%
tidyr::expand( DateTime, Name) %>%
dplyr::arrange( Name, DateTime )
As pointed out in the comments, this works for all cases - both types,
only daily data, only hourly data. This solution and the next generate
warnings unless you use stringsAsFactors = FALSE in the data.frame()
calls above.
The only issue with this solution is that fill() and right_join() are
only there to deal with edge cases. I don't know if that is a real problem
or not.
"Split" in the pipe
The simple solution splits the dataset, and this can be done inside the
pipe in a couple ways.
expandedData2 <- allData %>%
tidyr::nest(-Name) %>%
mutate(data = purrr::map(data, ~right_join(., allHours, by = 'DateTime'))) %>%
tidyr::unnest()
The other way would use base::split() and then purrr::map_dfr()
Created on 2019-03-24 by the reprex
package (v0.2.0).
How do you set 0:00 as end of day instead of 23:00 in an hourly data? I have this struggle while using period.apply or to.period as both return days ending at 23:00. Here is an example :
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:00:00"), by="hour"), x = rnorm(120))
The following functions show periods ends at 23:00
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "days")
x1[endpoints(x1, 'days')]
So when I am aggregating the hourly data to daily, does someone have an idea how to set the end of day at 0:00?
As already pointed out by another answer here, to.period on days computes on the data with timestamps between 00:00:00 and 23:59:59.9999999 on the day in question. so 23:00:00 is seen as the last timestamp in your data, and 00:00:00 corresponds to a value in the next day "bin".
What you can do is shift all the timestamps back 1 hour, use to.period get the daily data points from the hour points, and then using align.time to get the timestamps aligned correctly.
(More generally, to.period is useful for generating OHLCV type data, and so if you're say generating say hourly bars from ticks, it makes sense to look at all the ticks between 23:00:00 and 23:59:59.99999 in the bar creation. then 00:00:00 to 00:59:59.9999.... would form the next hourly bar and so on.)
Here is an example:
> tail(x1["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -1.2760349
# 2018-02-01 19:00:00 -0.1496041
# 2018-02-01 20:00:00 -0.5989614
# 2018-02-01 21:00:00 -0.9691905
# 2018-02-01 22:00:00 -0.2519618
# 2018-02-01 23:00:00 -1.6081656
> head(x1["2018-02-02"])
# [,1]
# 2018-02-02 00:00:00 -0.3373271
# 2018-02-02 01:00:00 0.8312698
# 2018-02-02 02:00:00 0.9321747
# 2018-02-02 03:00:00 0.6719425
# 2018-02-02 04:00:00 -0.5597391
# 2018-02-02 05:00:00 -0.9810128
> head(x1["2018-02-03"])
# [,1]
# 2018-02-03 00:00:00 2.3746424
# 2018-02-03 01:00:00 0.8536594
# 2018-02-03 02:00:00 -0.2467268
# 2018-02-03 03:00:00 -0.1316978
# 2018-02-03 04:00:00 0.3079848
# 2018-02-03 05:00:00 0.2445634
x2 <- x1
.index(x2) <- .index(x1) - 3600
> tail(x2["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -0.1496041
# 2018-02-01 19:00:00 -0.5989614
# 2018-02-01 20:00:00 -0.9691905
# 2018-02-01 21:00:00 -0.2519618
# 2018-02-01 22:00:00 -1.6081656
# 2018-02-01 23:00:00 -0.3373271
x.d2 <- to.period(x2, OHLC = FALSE, drop.date = FALSE, period = "days")
> x.d2
# [,1]
# 2018-01-31 23:00:00 0.12516594
# 2018-02-01 23:00:00 -0.33732710
# 2018-02-02 23:00:00 2.37464235
# 2018-02-03 23:00:00 0.51797747
# 2018-02-04 23:00:00 0.08955208
# 2018-02-05 22:00:00 0.33067734
x.d2 <- align.time(x.d2, n = 86400)
> x.d2
# [,1]
# 2018-02-01 0.12516594
# 2018-02-02 -0.33732710
# 2018-02-03 2.37464235
# 2018-02-04 0.51797747
# 2018-02-05 0.08955208
# 2018-02-06 0.33067734
Want to convince yourself? Try something like this:
x3 <- rbind(x1, xts(x = matrix(c(1,2), nrow = 2), order.by = as.POSIXct(c("2018-02-01 23:59:59.999", "2018-02-02 00:00:00"))))
x3["2018-02-01 23/2018-02-02 01"]
# [,1]
# 2018-02-01 23:00:00.000 -1.6081656
# 2018-02-01 23:59:59.999 1.0000000
# 2018-02-02 00:00:00.000 -0.3373271
# 2018-02-02 00:00:00.000 2.0000000
# 2018-02-02 01:00:00.000 0.8312698
x3.d <- to.period(x3, OHLC = FALSE, drop.date = FALSE, period = "days")
> x3.d <- align.time(x3.d, 86400)
> x3.d
[,1]
2018-02-02 1.00000000
2018-02-03 -0.09832625
2018-02-04 -0.65075506
2018-02-05 -0.09423664
2018-02-06 0.33067734
See that the value of 2 on 00:00:00 did not form the last observation in the day for 2018-02-02 (00:00:00), which went from 2018-02-01 00:00:00 to 2018-02-01 23:59:59.9999.
Of course, if you want the daily timestamp to be the start of the day, not the end of the day, which would be 2018-02-01 as start of bar for the first row, in x3.d above, you could shift back the day by one. You could do this relatively safely for most timezones, when your data doesn't involve weekend dates:
index(x3.d) = index(x3.d) - 86400
I say relatively safetly, because there are corner cases when there are time shifts in a time zone. e.g. Be careful with day light savings. Simply subtracting -86400 can be a problem when going from Sunday to Saturday in time zones where day light saving occurs:
#e.g. bad: day light savings occurs on this weekend for US EST
z <- xts(x = 9, order.by = as.POSIXct("2018-03-12", tz = "America/New_York"))
> index(z) - 86400
[1] "2018-03-10 23:00:00 EST"
i.e. the timestamp is off by one hour, when you really want the midnight timestamp (00:00:00).
You could get around this problem using something much safer like this:
library(lubridate)
# right
> index(z) - days(1)
[1] "2018-03-11 EST"
I don't think this is possible because 00:00 is the start of the day. From the manual:
These endpoints are aligned in POSIXct time to the zero second of the day at the beginning, and the 59.9999th second of the 59th minute of the 23rd hour of the final day
I think the solution here is to use minutes instead of hours. Using your example:
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:59:99"), by="min"), x = rnorm(7200))
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "day")
x1[endpoints(x1, 'day')]