I'm working on a SHM system where I have data every 15 minutes that came from sensors on a structure. I have a set of observations where there is no damage and another where some kind of damage was simulated. My objective is to take the undamaged data and use it to forecast. This forecasted data is then compared to the undamaged one and this difference will then be used to create control charts.
However my undamaged data is of around 5 months and the damaged state is of 8 months. I tried to explore the forecast package using multiple seasonality (msts) of 96 (1 day) and 35060 (1 year) since I believe it has a connection to temperature.
The models that I created that followed some kind of pattern that could resemble reality had a small amplitude, while the real data was much more volatile.
Can someone point me in the right direction as to what to do next and how to do it?
PS: When using the ts function even though I try to make it start at 2018-04-27 14:15:00, when plotting the ts object always starts at 1-1-2018. I think this is more aesthetic than anything but setting it right would be appreciated.
ts and msts objects are not well suited to high frequency data. I suggest you try using tsibble objects via the tsibble package (http://tsibble.tidyverts.org). With tsibble, the time index is explicit. Here is an example using 30 minute data.
library(tsibble)
library(feasts)
library(ggplot2)
tsibbledata::vic_elec
#> # A tsibble: 52,608 x 5 [30m] <UTC>
#> Time Demand Temperature Date Holiday
#> <dttm> <dbl> <dbl> <date> <lgl>
#> 1 2012-01-01 00:00:00 4263. 21.0 2012-01-01 TRUE
#> 2 2012-01-01 00:30:00 4049. 20.7 2012-01-01 TRUE
#> 3 2012-01-01 01:00:00 3878. 20.6 2012-01-01 TRUE
#> 4 2012-01-01 01:30:00 4036. 20.4 2012-01-01 TRUE
#> 5 2012-01-01 02:00:00 3866. 20.2 2012-01-01 TRUE
#> 6 2012-01-01 02:30:00 3694. 20.1 2012-01-01 TRUE
#> 7 2012-01-01 03:00:00 3562. 19.6 2012-01-01 TRUE
#> 8 2012-01-01 03:30:00 3433. 19.1 2012-01-01 TRUE
#> 9 2012-01-01 04:00:00 3359. 19.0 2012-01-01 TRUE
#> 10 2012-01-01 04:30:00 3331. 18.8 2012-01-01 TRUE
#> # … with 52,598 more rows
tsibbledata::vic_elec %>% autoplot(Demand)
Created on 2019-11-27 by the reprex package (v0.3.0)
Related
I am trying to separate the events from streamflow data. I have hourly data. I have run the code
dailyMQ <- data.frame(Date=seq(from=as.Date("01.01.2000", format="%d.%m.%Y"),
to=as.Date("01.01.2004", format="%d.%m.%Y"), by="days"),
discharge=rbeta(1462,2,20)*100)
for daily data. But I am trying for hourly data but getting errors.
Could anyone suggest me how to write a code for hourly data?
Thanks
Date format can't directly be split into hours.
You could use POSIXct datetime format:
HourlyMQ <- data.frame(Date=seq(from=as.POSIXct("01.01.2019", format="%d.%m.%Y"), to=as.POSIXct("11.12.2019", format="%d.%m.%Y"),by="hours"),discharge=rbeta(8257,2,20))
HourlyMQ
#> Date discharge
#> 1 2019-01-01 00:00:00 0.2452214482
#> 2 2019-01-01 01:00:00 0.0620291334
#> 3 2019-01-01 02:00:00 0.0608788870
#> 4 2019-01-01 03:00:00 0.0697449808
#> 5 2019-01-01 04:00:00 0.0302780135
Before I start, I would like to state that I have seen the FAQ entry at https://tsibble.tidyverts.org/articles/faq.html, however I am still unable to get to a workable solution.
I am using the "Aggregates (Bars)" output from polygon.io (https://polygon.io/docs/get_v2_aggs_ticker__forexTicker__range__multiplier___timespan___from___to__anchor)
Due to licensing/copyright restrictions I am unable to post the data here, but if you go to the docs link above, they have a sample there (no login required).
Polygon.io provide timestamp as follows:
t integer The Unix Msec timestamp for the start of the aggregate
window.
My attempt so far looks like this :
library(fpp3)
library(fable.prophet)
library(jsonlite)
library(curl)
library(tidyverse)
library(lubridate)
myURL <- # *N.B. For sample data please see doc link in SO question*
myData <- fromJSON(myURL)
myData$ticker
myData$results
parse1 <- myData$results %>% select(t,c) %>%
mutate(dt = as_datetime(t/1000),.keep="unused",.before=1)
print(head(parse1))
parse2 <- as_tsibble(parse1,index=dt)
However this yields:
Error: Can't obtain the interval due to the mismatched index class.
The issue seems to be related to the regular interval of Datetime column 'dt'. We can convert it to Date class with as.Date and it works
library(dplyr)
library(tsibble)
parse1 %>%
mutate(dt = as.Date(dt)) %>%
as_tsibble(index = dt)
-output
# A tsibble: 120 x 2 [1D]
dt c
<date> <dbl>
1 2021-01-03 1.22
2 2021-01-04 1.23
3 2021-01-05 1.23
4 2021-01-06 1.23
5 2021-01-07 1.23
6 2021-01-08 1.22
7 2021-01-09 1.22
8 2021-01-10 1.22
9 2021-01-11 1.22
10 2021-01-12 1.22
# … with 110 more rows
Was able to replicate the same error in the OP's post
as_tsibble(parse1,index=dt)
Error: Can't obtain the interval due to the mismatched index class.
ℹ Please see `vignette("FAQ")` for details.
Run `rlang::last_error()` to see where the error occurred.
The issue is with the piece of code in as_tsibble
...
if (unknown_interval(interval) && (nrows > vec_size(key_data))) {
abort(c(
"Can't obtain the interval due to the mismatched index class.",
i = "Please see `vignette(\"FAQ\")` for details."
))
}
...
There is an argument to specify if the interval is regular or not. By default, it is TRUE. Here, the interval is not regular. So, we need
as_tsibble(parse1,index=dt)
Error: Can't obtain the interval due to the mismatched index class.
ℹ Please see `vignette("FAQ")` for details.
Run `rlang::last_error()` to see where the error occurred.
Thus, change the regular to FALSE
as_tsibble(parse1,index=dt, regular = FALSE)
# A tsibble: 120 x 2 [!] <GMT>
dt c
<dttm> <dbl>
1 2021-01-03 00:00:00 1.22
2 2021-01-04 00:00:00 1.23
3 2021-01-05 00:00:00 1.23
4 2021-01-06 00:00:00 1.23
5 2021-01-07 00:00:00 1.23
6 2021-01-08 00:00:00 1.22
7 2021-01-09 00:00:00 1.22
8 2021-01-10 00:00:00 1.22
9 2021-01-11 00:00:00 1.22
10 2021-01-12 00:00:00 1.22
# … with 110 more rows
I have a dataset of temperature values taken at specific datetimes across five locations. For whatever reason, sometimes the readings are every hour, and some every four hours. Another issue is that when the time changed as a result of daylight savings, the readings are off by one hour. I am interested in the readings taken every four hours and would like to subset these by day and night to ultimately get daily and nightly mean temperatures.
To summarise, the readings I am interested in are either:
0800, 1200, 1600 =day
2000, 0000, 0400 =night
Recordings between 0800-1600 and 2000-0400 each day should be averaged.
During daylight savings, the equivalent times are:
0900, 1300, 1700 =day
2100, 0100, 0500 =night
Recordings between 0900-1700 and 2100-0500 each day should be averaged.
In the process, I am hoping to subset by site.
There are also some NA values or blank cells which should be ignored.
So far, I tried to subset by one hour of interest just to see if it worked, but haven't got any further than that. Any tips on how to subset by a series of times of interest? Thanks!
temperature <- read.csv("SeaTemperatureData.csv",
stringsAsFactors = FALSE)
temperature <- subset(temperature, select=-c(X)) #remove last column that contains comments, not needed
temperature$Date.Time < -as.POSIXct(temperature$Date.Time,
format="%d/%m/%Y %H:%M",
tz="Pacific/Auckland")
#subset data by time, we only want to include temperatures recorded at certain times
temperature.goat <- subset(temperature, Date.Time==c('01:00:00'), select=c("Goat.Island"))
Date.Time Goat.Island Tawharanui Kawau Tiritiri Noises
1 2019-06-10 16:00:00 16.820 16.892 16.749 16.677 15.819
2 2019-06-10 20:00:00 16.773 16.844 16.582 16.654 15.796
3 2019-06-11 00:00:00 16.749 16.820 16.749 16.606 15.819
4 2019-06-11 04:00:00 16.487 16.796 16.654 16.558 15.796
5 2019-06-11 08:00:00 16.582 16.749 16.487 16.463 15.867
6 2019-06-11 12:00:00 16.630 16.773 16.725 16.654 15.867
One possible solution is to extract hours from your DateTime variable, then filter for particular hours of interest.
Here a fake example over 4 days:
library(lubridate)
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Value = sample(1:100,97, replace = TRUE))
DateTime Value
1 2020-02-01 00:00:00 99
2 2020-02-01 01:00:00 51
3 2020-02-01 02:00:00 44
4 2020-02-01 03:00:00 49
5 2020-02-01 04:00:00 60
6 2020-02-01 05:00:00 56
Now, you can extract hours with hour function of lubridate and subset for the desired hour:
library(lubridate)
subset(df, hour(DateTime) == 5)
DateTime Value
6 2020-02-01 05:00:00 56
30 2020-02-02 05:00:00 31
54 2020-02-03 05:00:00 65
78 2020-02-04 05:00:00 80
EDIT: Getting mean of each sites per subset of hours
Per OP's request in comments, the question is to calcualte the mean of values for various sites for different period of times.
Basically, you want to have two period per days, one from 8:00 to 17:00 and the other one from 18:00 to 7:00.
Here, a more elaborated example based on the previous one:
df <- data.frame(DateTime = seq(ymd_hms("2020-02-01 00:00:00"), ymd_hms("2020-02-05 00:00:00"), by = "hour"),
Site1 = sample(1:100,97, replace = TRUE),
Site2 = sample(1:100,97, replace = TRUE))
DateTime Site1 Site2
1 2020-02-01 00:00:00 100 6
2 2020-02-01 01:00:00 9 49
3 2020-02-01 02:00:00 86 12
4 2020-02-01 03:00:00 34 55
5 2020-02-01 04:00:00 76 29
6 2020-02-01 05:00:00 41 1
....
So, now you can do the following to label each time point as daily or night, then group by this category for each day and calculate the mean of each individual sites using summarise_at:
library(lubridate)
library(dplyr)
df %>% mutate(Date = date(DateTime),
Hour= hour(DateTime),
Category = ifelse(between(hour(DateTime),8,17),"Daily","Night")) %>%
group_by(Date, Category) %>%
summarise_at(vars(c(Site1,Site2)), ~ mean(., na.rm = TRUE))
# A tibble: 9 x 4
# Groups: Date [5]
Date Category Site1 Site2
<date> <chr> <dbl> <dbl>
1 2020-02-01 Daily 56.9 63.1
2 2020-02-01 Night 58.9 46.6
3 2020-02-02 Daily 54.5 47.6
4 2020-02-02 Night 36.9 41.7
5 2020-02-03 Daily 42.3 56.9
6 2020-02-03 Night 44.1 55.9
7 2020-02-04 Daily 54.3 50.4
8 2020-02-04 Night 54.8 34.3
9 2020-02-05 Night 75 16
Does it answer your question ?
I have spent 1-day search for the answer to this question and yet still could not figure out how this works (relatively new to R).
The data:
I have the daily revenue of a store. The starting date is November 2017, and the end date is February 2020. (It is not a typical Jan - Dec every year data). There is no missing value, and every day's sale is recorded. There are 2 columns: date (in proper date format) and revenue (in numerical format).
I am trying to build a time series forecasting model for my sales data. One pre-requisite is that I need to transform my data into the ts object. All those posts online I have seen dealt with yearly or monthly data, yet I have not yet seen anyone mention daily data.
I tried to convert my data to a ts object this way (I named my data "d"):
d_ts <- ts(d$revenue, start=min(d$date), end = max(d$date), frequency = 365)
I then got really weird results as such:
Start = c(17420, 1)
End = c(18311, 1)
Frequency = 365
[1] 1174.77 214.92 10.00 684.86 7020.04 11302.50 30613.55 29920.98 24546.49 22089.89 30291.65 32993.05 26517.11 39670.38 30361.32 17510.72
[17] 9888.76 3032.27 1229.74 2426.36 ....... [ reached getOption("max.print") -- omitted 324216 entries ]
There are 892 days in this dataset, how come the ts object's dimension to be 325,216 x 1 ????
I looked into this book called "Hands-On Time-Series with R" and found the following excerpt:
enter image description here
This basically means the ts() object does NOT work for daily data. Is this why my ts() conversion is messed up?
My questions are ...
(1) How can I make my daily revenue data to be a time series object before feeding into a model, if ts() does not work for daily data? All those time-series models require my data to be in time-series format though.
(2) Does the fact that my data does not start on Jan 2017 & end on Dec 2019 (i.e. those perfect 12 months in a year data shown in many online posts) have any complications? If so, what should I adjust so that the time series forecasting would be meaningful?
I have been stuck on this issue and could not wrap my head around. I really, really appreciate your help!
The ts function can work with any time interval, that's defined by the start and end points. As you're using dates, one unit corresponds to one day, as this is how they're stored internally. The help file at ?ts also shows examples of how to use annual or quarterly data,
To read in your daily data correctly you need to set frequency=1. Using some data similar in structure to what you've got:
#Compile a dataframe like yours
library(lubridate)
set.seed(0)
d <- data.frame(date=seq.Date(dmy("01/11/2017/"), by="day", length.out=892))
d$revenue <- runif(892)
head(d)
#date revenue
# 1 2017-11-01 0.8966972
# 2 2017-11-02 0.2655087
# 3 2017-11-03 0.3721239
# 4 2017-11-04 0.5728534
# 5 2017-11-05 0.9082078
# 6 2017-11-06 0.2016819
#Convert to timeseries object
d_ts <- ts(d$revenue, start=min(d$date), end = max(d$date), frequency = 1)
d_ts
# Time Series:
# Start = 17471
# End = 18362
# Frequency = 1
# [1] 0.896697200 0.265508663 0.372123900 0.572853363 0.908207790 0.201681931 0.898389685 0.944675269 0.660797792
# [10] 0.629114044 0.061786270 0.205974575 0.176556753 0.687022847 0.384103718 0.769841420 0.497699242 0.717618508
With daily data, you are better off using a tsibble class rather than a ts class. There are modelling and forecasting tools available via the fable package.
library(tsibble)
library(fable)
set.seed(1)
d_tsibble <- data.frame(
date = seq(as.Date("2017-11-01"), by = "day", length.out = 892),
revenue = rnorm(892)
) %>%
as_tsibble(index = date)
d_tsibble
#> # A tsibble: 892 x 2 [1D]
#> date revenue
#> <date> <dbl>
#> 1 2017-11-01 -0.626
#> 2 2017-11-02 0.184
#> 3 2017-11-03 -0.836
#> 4 2017-11-04 1.60
#> 5 2017-11-05 0.330
#> 6 2017-11-06 -0.820
#> 7 2017-11-07 0.487
#> 8 2017-11-08 0.738
#> 9 2017-11-09 0.576
#> 10 2017-11-10 -0.305
#> # … with 882 more rows
d_tsibble %>%
model(
arima = ARIMA(revenue)
) %>%
forecast(h = "14 days")
#> # A fable: 14 x 4 [1D]
#> # Key: .model [1]
#> .model date revenue .distribution
#> <chr> <date> <dbl> <dist>
#> 1 arima 2020-04-11 -0.0178 N(-1.8e-02, 1.1)
#> 2 arima 2020-04-12 -0.0117 N(-1.2e-02, 1.1)
#> 3 arima 2020-04-13 -0.00765 N(-7.7e-03, 1.1)
#> 4 arima 2020-04-14 -0.00501 N(-5.0e-03, 1.1)
#> 5 arima 2020-04-15 -0.00329 N(-3.3e-03, 1.1)
#> 6 arima 2020-04-16 -0.00215 N(-2.2e-03, 1.1)
#> 7 arima 2020-04-17 -0.00141 N(-1.4e-03, 1.1)
#> 8 arima 2020-04-18 -0.000925 N(-9.2e-04, 1.1)
#> 9 arima 2020-04-19 -0.000606 N(-6.1e-04, 1.1)
#> 10 arima 2020-04-20 -0.000397 N(-4.0e-04, 1.1)
#> 11 arima 2020-04-21 -0.000260 N(-2.6e-04, 1.1)
#> 12 arima 2020-04-22 -0.000171 N(-1.7e-04, 1.1)
#> 13 arima 2020-04-23 -0.000112 N(-1.1e-04, 1.1)
#> 14 arima 2020-04-24 -0.0000732 N(-7.3e-05, 1.1)
Created on 2020-04-01 by the reprex package (v0.3.0)
I am trying to subtract one hour to date/times within a POSIXct column that are earlier than or equal to a time stated in a different comparison dataframe for that particular ID.
For example:
#create sample data
Time<-as.POSIXct(c("2015-10-02 08:00:00","2015-11-02 11:00:00","2015-10-11 10:00:00","2015-11-11 09:00:00","2015-10-24 08:00:00","2015-10-27 08:00:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,01,02,02,03,03)
data<-data.frame(Time,ID)
Which produces this:
Time ID
1 2015-10-02 08:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 10:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 08:00:00 3
6 2015-10-27 08:00:00 3
I then have another dataframe with a key date and time for each ID to compare against. The Time in data should be compared against Comparison in ComparisonData for the particular ID it is associated with. If the Time value in data is earlier than or equal to the comparison value one hour should be subtracted from the value in data:
#create sample comparison data
Comparison<-as.POSIXct(c("2015-10-29 08:00:00","2015-11-02 08:00:00","2015-10-26 08:30:00"), format = "%Y-%m-%d %H:%M:%S")
ID<-c(01,02,03)
ComparisonData<-data.frame(Comparison,ID)
This should look like this:
Comparison ID
1 2015-10-29 08:00:00 1
2 2015-11-02 08:00:00 2
3 2015-10-26 08:30:00 3
In summary, the code should check all times of a certain ID to see if any are earlier than or equal to the value specified in ComparisonData and if they are, subtract one hour. This should give this data frame as an output:
Time ID
1 2015-10-02 07:00:00 1
2 2015-11-02 11:00:00 1
3 2015-10-11 09:00:00 2
4 2015-11-11 09:00:00 2
5 2015-10-24 07:00:00 3
6 2015-10-27 08:00:00 3
I have looked at similar solutions such as this but I cannot understand how to also check the times using the right timing with that particular ID.
I think ddply seems quite a promising option but I'm not sure how to use it for this particular problem.
Here's a quick and efficient solution using data.table. First we join the two data sets by ID and then just modify the Times which are lower or equal to Comparison
library(data.table) # v1.9.6+
setDT(data)[ComparisonData, end := i.Comparison, on = "ID"]
data[Time <= end, Time := Time - 3600L][, end := NULL]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
Alternatively, we could do this in one step while joining using ifelse (not sure how efficient this though)
setDT(data)[ComparisonData,
Time := ifelse(Time <= i.Comparison,
Time - 3600L, Time),
on = "ID"]
data
# Time ID
# 1: 2015-10-02 07:00:00 1
# 2: 2015-11-02 11:00:00 1
# 3: 2015-10-11 09:00:00 2
# 4: 2015-11-11 09:00:00 2
# 5: 2015-10-24 07:00:00 3
# 6: 2015-10-27 08:00:00 3
I am sure there is going to be a better solution than this, however, I think this works.
for(i in 1:nrow(data)) {
if(data$Time[i] < ComparisonData[data$ID[i], 1]){
data$Time[i] <- data$Time[i] - 3600
}
}
# Time ID
#1 2015-10-02 07:00:00 1
#2 2015-11-02 11:00:00 1
#3 2015-10-11 09:00:00 2
#4 2015-11-11 09:00:00 2
#5 2015-10-24 07:00:00 3
#6 2015-10-27 08:00:00 3
This is going to iterate through every row in data.
ComparisonData[data$ID[i], 1] gets the time column in ComparisonData for the corresponding ID. If this is greater than the Time column in data then reduce the time by 1 hour.