How to create and analyze a time series with variable test frequency in R - r

Here is a short description of the problem I am trying to solve: I have test data for multiple variables (weight, thickness, absorption, etc.) that are taken at varying intervals over time - no set schedule, sometimes a test a day, sometimes days might go between tests. I want to detect trends in each of these and alert stake holders when any parameter is trending up/down more than a certain amount. I first did a linear model between each variable's raw data and test time (I converted the test time to days or weeks since a fixed date) and create a table with slopes for each variable - so the stake holders can view one table for all variables and quickly see if any of them is raising concern. The issue was that the data for most variables is very noisy. Someone suggested using time series functions, separating noise and seasonality from the trends, and studying the trend component for a cleaner analysis. I started to look into this and see a couple concerns/questions already:
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable? Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
Finally, is there a better way of accomplishing what I explained above? I am only looking for general trends over time - say over 3 months of data, not day to day trends.

Time series are generally used to see if previous observations of a variable have influence on future observations. You would model under the assumption that the previous observations are able to predict the future observations. That is the reason for that most (not all) time series models require evenly spaced instances of training data. If your data is not only very noisy, but also not collected on a regular basis, then you should seriously consider if time series is the appropriate choice of modelling.
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals.
What you can do, is creating an aggregate by increasing the time bucket (shift from daily data to a weekly average for instance) such that every unit of time has an instance of training data. Following your final comment, you could create the average of the observations of the last 3 months of data instead from the observations.
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable?
In the simplest case of a linear model, the independent variable is the unit of time corresponding to the prediction you are trying to make. However this is not always regarded a time series model.
In the case of an autoregressive model, this would be the previous observation of what you are trying to predict, something similar to y(t) = x(t-1), for instance multiplied by a smoothing factor. I encourage you to read Forecasting: principles and practice which is an excellent book on the matter.
Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
The function decompose.ts returns a list which includes trend. Trend is a vector of the estimated trend components corresponding to it's respective time value.
Let's create an example time series with linear trend
df <- data.frame(
date = seq(from = as.Date("2021-01-01"), to = as.Date("2021-01-10"), by=1)
)
df$value <- jitter(seq(from = 1, to = nrow(df), by=1))
time_series <- ts(df$value, frequency = 5)
df$trend <- decompose(time_series)$trend
> df
date value trend
1 2021-01-01 0.9170296 NA
2 2021-01-02 1.8899565 NA
3 2021-01-03 3.0816892 2.992256
4 2021-01-04 4.0075589 4.042486
5 2021-01-05 5.0650478 5.046874
6 2021-01-06 6.1681775 6.051641
7 2021-01-07 6.9118942 7.074260
8 2021-01-08 8.1055282 8.041628
9 2021-01-09 9.1206522 NA
10 2021-01-10 9.9018900 NA
As you see, the trend component already is an estimate of the dependent variable at the corresponding time. In decompose the estimate of trend is based on a moving average.

Related

Time series forecasting of outcome variable based on current performance of outcome variable in R

I have a very large dataset (~55,000 datapoints) for chicken crops. Chickens are grown over ~35 day period. The dataset covers 10 sheds of ~20,000 chickens each. In the sheds are weighing platforms and as chickens step on them they send the weight recorded to a server. They are sending continuously from day 0 to the final day.
The variables I have are: House (as a number, House 1 up to House 10), Weight (measured in grams, to 5 decimal points) and Day (measured as a number between two integers, e.g. 12 noon on day 0 might be 0.5 in the day, whereas day 23.3 suggests a third of the way through day 23 (8AM). But as this data is sent continuously the numbers can be very precise).
I want to construct either a Time Series Regression model or an ML model so that if I take a new crop, as data is sent by the sensors, the model can make a prediction for what the end weight will be. Then as that crop cycle finishes it can be added to the training data and repeat.
Currently I'm using this very simple Weight VS Time model, but eventually would include things like temperature, water and food consumption, humidity etc.
I've run regression analyses on the data sets to determine the relationship between time and weight (it's likely quadratic, see image attached) and tried using randomForrest in R to create a model. The test model seemed to work well in regards to the MAPE value being similar to the training value, but that was by taking out one house and using that as the test.
Potentially what I've tried so far is completely the wrong methodology but this is a new area so I'm really not sure of the best approach.

ERROR in R: decompose(y) : time series has no or less than 2 periods

I have a time series data of daily transactions, starting from 2017-06-28 till 2018-11-26.
The data looks like this:
I am interested to use decompose() or stl() function in R. But I am getting
error:
decompose(y) : time series has no or less than 2 periods
when I am trying to use decompose()
and
Error in stl(y, "periodic") :
series is not periodic or has less than two periods
when I am trying to use stl().
I have understood that I have to specify the period, but I am not able to understand what should be the period in my case? I have tried with the following toy example:
dat <- cumsum(rnorm(51.7*10))
y <- ts(dat, frequency = 517)
plot.ts(y)
stl(y, "periodic")
But I couldn't succeed. Any help will be highly appreciated.
The frequency parameter reflects the number of observations before the seasonal pattern repeats. As your data is daily, you may want to set frequency equal to 7 or 365.25 (depending on your business seasonality).
Of course, the larger the business seasonality, the more data you need (i.e. more than 2 periods) in order to decompose your time series. In your case, you set the frequency to 517, but have data available for less than two periods. Thus, the seasonal decomposition cannot happen.
For more info, please see: Rob Hyndman's Forecasting Principles and Practice book

Predicting multivariate time series with RNN

I have been experimenting with a R package called RNN.
The following is the code site:
https://github.com/bquast/rnn
It has a very nice example for financial time series prediction.
I have read the code and I understand it uses the sequence of the time series to predict in advance the value of next day instrument.
The following is an example of run with 10 hidden nodes and 200 epochs
RNN financial time series prediction
What I would expect as result is that the algorithm succeed, at least in part, to predict in advance the value of the instrument.
From what I can see, apparently is only approximating the value of the time series at the current day, not giving any prediction on the next day.
Is my expectation wrong?
This code is very simple, how would you improve it?
y <- X[,1:input$training_amount+input$prediction_gap,as.numeric(input$target)]
matrix(y, ncol=input$training_amount)
y.train moves all the data forward by a day so that is what is being trained on - next day data for the currency pair you care about. With ncol = training_amount when there are too many columns (with them now equal to training_amount + prediction_gap), the first data points fall off; hence all the data gets moved forward by the prediction_gap.

Time series forecasting, dealing with known big orders

I have many data sets with known outliers (big orders)
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1", 155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5, 135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6, 222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6, 231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6, 329429882.8, 264012891.6, 496745973.9, 284484362.55),ncol=2,byrow=FALSE)
The top 11 outliers of this specific series are:
outliers <- matrix(c("14Q4","14Q2","12Q1","13Q1","14Q2","11Q1","11Q4","14Q2","13Q4","14Q4","13Q1",20193525.68, 18319234.7, 12896323.62, 12718744.01, 12353002.09, 11936190.13, 11356476.28, 11351192.31, 10101527.85, 9723641.25, 9643214.018),ncol=2,byrow=FALSE)
What methods are there that i can forecast the time series taking these outliers into consideration?
I have already tried replacing the next biggest outlier (so running the data set 10 times replacing the outliers with the next biggest until the 10th data set has all the outliers replaced).
I have also tried simply removing the outliers (so again running the data set 10 times removing an outlier each time until all 10 are removed in the 10th data set)
I just want to point out that removing these big orders does not delete the data point completely as there are other deals that happen in that quarter
My code tests the data through multiple forecasting models (ARIMA weighted on the out sample, ARIMA weighted on the in sample, ARIMA weighted, ARIMA, Additive Holt-winters weighted and Multiplcative Holt-winters weighted) so it needs to be something that can be adapted to these multiple models.
Here are a couple more data sets that i used, i do not have the outliers for these series yet though
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3", 26393.99306, 13820.5037, 23115.82432, 25894.41036, 14926.12574, 15855.8857, 21565.19002, 49373.89675, 27629.10141, 43248.9778, 34231.73851, 83379.26027, 54883.33752, 62863.47728, 47215.92508, 107819.9903, 53239.10602, 71853.5, 59912.7624, 168416.2995, 64565.6211, 94698.38748, 80229.9716, 169205.0023, 70485.55409, 133196.032, 78106.02227), ncol=2,byrow=FALSE)
data <- matrix(c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3",3311.5124, 3459.15634, 2721.486863, 3286.51708, 3087.234059, 2873.810071, 2803.969394, 4336.4792, 4722.894582, 4382.349583, 3668.105825, 4410.45429, 4249.507839, 3861.148928, 3842.57616, 5223.671347, 5969.066896, 4814.551389, 3907.677816, 4944.283864, 4750.734617, 4440.221993, 3580.866991, 3942.253996, 3409.597269, 3615.729974, 3174.395507),ncol=2,byrow=FALSE)
If this is too complicated then an explanation of how, in R, once outliers are detected using certain commands, the data is dealt with to forecast. e.g smoothing etc and how i can approach that writing a code myself (not using the commands that detect outliers)
Your outliers appear to be seasonal variations with the largest orders appearing in the 4-th quarter. Many of the forecasting models you mentioned include the capability for seasonal adjustments. As an example, the simplest model could have a linear dependence on year with corrections for all seasons. Code would look like:
df <- data.frame(period= c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3",
"10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2",
"13Q3","13Q4","14Q1","14Q2","14Q3","14Q4","15Q1"),
order= c(155782698, 159463653.4, 172741125.6, 204547180, 126049319.8, 138648461.5,
135678842.1, 242568446.1, 177019289.3, 200397120.6, 182516217.1, 306143365.6,
222890269.2, 239062450.2, 229124263.2, 370575384.7, 257757410.5, 256125841.6,
231879306.6, 419580274, 268211059, 276378232.1, 261739468.7, 429127062.8, 254776725.6,
329429882.8, 264012891.6, 496745973.9, 42748656.73))
seasonal <- data.frame(year=as.numeric(substr(df$period, 1,2)), qtr=substr(df$period, 3,4), data=df$order)
ord_model <- lm(data ~ year + qtr, data=seasonal)
seasonal <- cbind(seasonal, fitted=ord_model$fitted)
library(reshape2)
library(ggplot2)
plot_fit <- melt(seasonal,id.vars=c("year", "qtr"), variable.name = "Source", value.name="Order" )
ggplot(plot_fit, aes(x=year, y = Order, colour = qtr, shape=Source)) + geom_point(size=3)
which gives the results shown in the chart below:
Models with a seasonal adjustment but nonlinear dependence upon year may give better fits.
You already said you tried different Arima-models, but as mentioned by WaltS, your series don't seem to contain big outliers, but a seasonal-component, which is nicely captured by auto.arima() in the forecast package:
myTs <- ts(as.numeric(data[,2]), start=c(2008, 1), frequency=4)
myArima <- auto.arima(myTs, lambda=0)
myForecast <- forecast(myArima)
plot(myForecast)
where the lambda=0 argument to auto.arima() forces a transformation (or you could take log) of the data by boxcox to take the increasing amplitude of the seasonal-component into account.
The approach you are trying to use to cleanse your data of outliers is not going to be robust enough to identify them. I should add that there is a free outlier package in R called tsoutliers, but it won't do the things I am about to show you....
You have an interesting time series here. The trend changes over time with the upward trend weakening a bit. If you bring in two time trend variables with the first beginning at 1 and another beginning at period 14 and forward you will capture this change. As for seasonality, you can capture the high 4th quarter with a dummy variable. The model is parsimonios as the other 3 quarters are not different from the average plus no need for an AR12, seasonal differencing or 3 seasonal dummies. You can also capture the impact of the last two observations being outliers with two dummy variables. Ignore the 49 above the word trend as that is just the name of the series being modeled.

Can we easily avoid the "maximum supported lag is 350 error" when using the ARIMA function in R?

I am currently fitting a SARIMAX model to big data sets. Information is retrieved every 10 minutes, so I have a vector of size 52560 for a year of data. Considering it is representing electricity load in a device throughout said year, we can observe a daily pattern, a weekly one and a yearly one.
There is also a trend, so I need to differentiate my series 4 times. Since the set is for 1 year, I can let aside the yearly seasonality and focus on the other 3. Let's say I get something like this:
dauch = diff(diff(diff(auch2), 144), 1008)
With 144 being the daily seasonality (6×24 10-minute points per day) and 1008 the weekly one.
I would like to fit a SARIMAX model on which I worked with my predecessor. He found that SARIMAX(2,1,5)(1,2,8)144 was the best one for this series. However, I get an error whenever I try this:
themodel = arima(auch[1000:4024,1], order = c(2,1,5), seasonal = list(order = c(1,2,8),
period = 144), xreg=tmpf[988:4012])
tmpf being the temperature used as an exogenous variable. The error is the following:
Error in makeARIMA(trarma[[1L]], trarma[[2L]], Delta, kappa, SSinit) :
maximum supported lag is 350
I don't really understand what it means in my case, because the period I chose is 144 which is inferior to 350. I feel like I need to keep the D = 2 in the model because of the dual differencing for daily and weekly pattern, but I don't know what to do to solve this. Thanks.

Resources