Input format for functions in package strucchange? - r

I'm trying to do change point detection with ´monitor´ from the strucchange package, but I have trouble getting a useful output.
My input is a time stamped dataframe, and I would like the breaks to be returned as dates, but they are returned as observation number:
cDF1 <- myDF[1:80,]
> cDF1[1:3,]
Year Month Value
2000-10 2000 Oct 1
2001-01 2001 Jan 1
2001-04 2001 Apr 1
me.mefp <- mefp(Value~1, type="ME", rescale=TRUE,
+ data=cDF1, alpha=0.05)
cDF1 <- myDF[1:104,]
> me.mefp <- monitor(me.mefp)
Break detected at observation # 98
In the strucchange manual, there are examples in which the time stamps are kept, but I can't figure out what they difference in format is.
It makes no difference if I make the data frame into a time series.
Can anybody help?
Thanks!

The mefp/monitor functions can only deal with ts time series. Hence, you can either supply a data argument that is a (multivariate) ts, a data.frame where the response variable is a ts or a standalone ts without a data argument. In your case, the data appears to be quarterly and as there are no regressors (except a constant) a standalone time series is probably most convenient.
As an artificial example, I simulate 100 observations from a quarterly time series:
set.seed(1)
Value <- ts(rnorm(100, mean = rep(0:1, c(70, 30)), sd = 0.5),
start = c(1990, 1), freq = 4)
plot(Value)
Then I select the data up to the end of 1999 as the history period and initialize the monitoring process:
val <- window(Value, end = c(1999, 4))
m <- mefp(val ~ 1, type = "ME", rescale = TRUE, alpha = 0.05)
Then the data can arrive, say until the end of 2009:
val <- window(Value, end = c(2009, 4))
m <- monitor(m)
And then finally until the end of 2014:
val <- window(Value, end = c(2014, 4))
m <- monitor(m)
## Break detected at observation # 81
plot(m)
Here, a break is finally detected and also brought out graphically.
P.S.: In your example, it appears as if the data were positive counts. If so, taking logs may (or may not) be useful.

Related

Time series daily data modeling

I am looking to forecast my time series. I have the following period daily data 2021-Jan-1 to 2022-Jul-1.
So I have a column of observations for each day.
what I tried so far:
d1=zoo(data, seq(from = as.Date("2021-01-01"), to = as.Date("2022-07-01"), by = 1))
tsdata <- ts(d1, frequency = 365)
ddata <- decompose(tsdata, "multiplicative")
I get following error here:
Error in decompose(tsdata, "multiplicative") :
time series has no or less than 2 periods
From what i have read it seems like because I do not have two full years? is that correct? I have tried doing it weekly as well:
series <- ts(data, frequency = 52, start = c(2021, 1))
getting the same issue.
How do I go about it without having to extend my dataset to two years since I do not have that, and still being able to decompose it?
Plus when I am actually trying to forecast it, it isn't giving me good enough forecast:
Plot with forecast
My data somewhat resembles a bell curve during that period. so is there a better fitting timeseries model I can apply instead?
A weekly frequency for daily data should have frequency = 7, not 52. It's possible that this fix to your code will produce a model with a seasonal term.
I don't think you'll be able to produce a time series model with annual seasonality with less than 2 years of data.
You can either produce a model with only weekly seasonality (I expect this is what most folks would recommend), or if you truly believe in the annual seasonal pattern exhibited in your data, your "forecast" can be a seasonal naive forecast that is simply last year's value for that particular day. I wouldn't recommend this, because it just seems risky, and I don't really see the same trajectory in your screenshot over 2022 that's apparent in 2021.
decompose requires two full cycles and that a full cycle represent 1 time unit. ts class can't use Date class anyways. To use frequency 7 we must use times 1/7th apart such as 1, 1+1/7, 1+2/7, etc. so that 1 cycle (7 days) covers 1 unit. Then just label the plot appropriately rather than using those times on the X axis. In the code below use %Y in place of %y if the years start in 19?? and end in 20?? so that tapply maintains the order.
# test data
set.seed(123)
s <- seq(from = as.Date("2021-01-01"), to = as.Date("2022-07-01"), by = 1)
data <- rnorm(length(s))
tsdata <- ts(data, freq = 7)
ddata <- decompose(tsdata, "multiplicative")
plot(ddata, xaxt = "n")
m <- tapply(time(tsdata), format(s, "%y/%m"), head, 1)
axis(1, m, names(m))

Add date to first element of vector to get new date and then add the next element to generate new date and so on

I am trying to generate some fake data to make a dataset so I can do some analysis. I am trying to build out a dataset where I have a haircut date and then generate other dates based on a normal skewed generation of data. The end goal is to predict future haircut intervals.
I have built this for one single customer, but I am trying to do this over and over again for different n values so I need help building it more programatic. I have tried different loops and am coming up empty handed. I am kind of new to programming. Thanks in advance!
#load the library for skewed normal generation
library(fGarch)
#set observations and generate data
n=5
set.seed(1)
days_since_last=rsnorm(n, mean = 35, sd = 5, xi = 2)
days_since_last=as.integer(days_since_last)
#generate random date to start
haircut_date = sample(seq(as.Date('2018/01/01'), as.Date('2019/01/01'), by="day"), 1)
#generate new dates
haircut_date2=haircut_date + days_since_last[1]
haircut_date3=haircut_date2 + days_since_last[2]
haircut_date4=haircut_date3 + days_since_last[4]
haircut_date5=haircut_date4 + days_since_last[4]
haircut_date6=haircut_date4 + days_since_last[5]
#combine dates
date = c(haircut_date2,haircut_date3,haircut_date4,haircut_date5,haircut_date6)
#add dates to generated intervals in a dataframe
haircut_df=data.frame(days_since_last,date)
Slight variation, creating the data frame right after your haircut_date = sample... line:
haircut_df <- data.frame(days_since_last = c(0,days_since_last),
date = c(haircut_date, haircut_date + cumsum(days_since_last)))
Resulting in:
> haircut_df
days_since_last date
1 0 2018-07-02
2 39 2018-08-10
3 33 2018-09-12
4 41 2018-10-23
5 28 2018-11-20
6 32 2018-12-22
This should give you what you're looking for,
#load the library for skewed normal generation
library(fGarch)
#set observations and generate data
n=10
set.seed(1)
days_since_last=as.integer(rsnorm(n, mean = 35, sd = 5, xi = 2))
# Creating a variable for cumulative days since the first haircut
cumulative<-cumsum(days_since_last)
#generate random date to start
haircut_date = sample(seq(as.Date('2018/01/01'), as.Date('2019/01/01'), by="day"), 1)
#initializing variable and looping for n
haircut_dates<-as.Date(x = integer(0), origin = "1970-01-01")
for (i in 1:n)
{
haircut_dates[i]<-haircut_date+cumulative[i]
}
#add dates to generated intervals in a dataframe
haircut_df=data.frame(haircut_dates,days_since_last)

Weekly and Yearly Seasonality in R

I have daily electric load data from 1-1-2007 till 31-12-2016. I use ts() function to load the data like so
ts_load <- ts(data, start = c(2007,1), end = c(2016,12),frequency = 365)
I want to remove the yearly and weekly seasonality from my data, to decompose the data and remove the seasonality, I use the following code
decompose_load = decompose(ts_load, "additive")
deseasonalized = ts_load - decompose_load$seasonal
My question is, am I doing it right? is this the right way to remove the yearly seasonality? and what is the right way to remove the weekly seasonality?
A few points:
a ts series must have regularly spaced points and the same number of points in each cycle. In the question a frequency of 365 is specified but some years, i.e. leap years, would have 366 points. In particular, if you want the frequency to be a year then you can't use daily or weekly data without adjustment since different years have different numbers of days and the number of weeks in a year is not integer.
decompose does not handle multiple seasonalities. If by weekly you mean remove the effect of Monday, of Tuesday, etc. and if by yearly you mean remove the effect of being 1st of the year, 2nd of the year, etc. then you are asking for multiple seasonalities.
end = c(2017, 12) means the 12th day of 2017 since frequency is 365.
The msts function in the forecast package can handle multiple and non-integer seasonalities.
Staying with base R, another approach is to approximate it by a linear model avoiding all the above problems (but ignoring correlations) and we will discuss that.
Assuming the data shown reproducibly in the Note at the end we define the day of week, dow, and day of year, doy, variables and regress on those with an intercept and trend and then construct just the intercept plus trend plus residuals in the last line of code to deseasonalize. This isn't absolutely necessary but we have used scale to remove the mean of trend in order that the three terms defining data.ds are mutually orthogonal -- Whether or not we do this the third term will be orthogonal to the other 2 by the properties of linear models.
trend <- scale(seq_along(d), TRUE, FALSE)
dow <- format(d, "%a")
doy <- format(d, "%j")
fm <- lm(data ~ trend + dow + doy)
data.ds <- coef(fm)[1] + coef(fm)[2] * trend + resid(fm)
Note
Test data used in reproducible form:
set.seed(123)
d <- seq(as.Date("2007-01-01"), as.Date("2016-12-31"), "day")
n <- length(d)
trend <- 1:n
seas_week <- rep(1:7, length = n)
seas_year <- rep(1:365, length = n)
noise <- rnorm(n)
data <- trend + seas_week + seas_year + noise
you can use the dsa function in the dsa package to adjust a daily time series. The advantage over the regression solution is, that it takes into account that the impact of the season can change over time, which is usually the case.
In order to use that function, your data should be in the xts format (from the xts package). Because in that case the leap year is not ignored.
The code will then look something like this:
install.packages(c("xts", "dsa"))
data = rnorm(365.25*10, 100, 1)
data_xts <- xts::xts(data, seq.Date(as.Date("2007-01-01"), by="days", length.out = length(data)))
sa = dsa::dsa(data_xts, fourier_number = 24)
# the fourier_number is used to model monthly recurring seasonal patterns in the regARIMA part
data_adjusted <- sa$output[,1]

Time Series R with duplicate Items for daily forecast

I would like a guidance no how to plot daily data and use forecasting in R.
There are low purchase for Sunday and Saturday in this data. And there are certain weekdays that have no purchase at all. So its the obstacles for the analysis.
I have around 300 rows with various item name which the items are duplicated inside the column, but with different dates.
Example, I bought exactly 1 soap 3 times a week, at monday, wednesday and also sunday.
This is the example data table :
My trouble so far is that it took me a long time to forecast manually using other statistical software, so I try to learn R from the start and see how it could save the time. The table above have been put into R, the date also have been converted from factor into date class by using the function as.Date(data$Date)
Usually i used exponential smoothing method, since the purchase are still low and sometimes out of stock, so not much of pattern are shown from the historical data. The output of this analysis is that i could provide a forecast for the purchase of the item daily in order to give instruction when should we demand an item.
First please consider adding a reproducible example for a more substantial answer. Look at the most upvoted question with tag R for a how-to.
EDIT: I think this is what you want before creating the ts:
data.agg <- aggregate(data$purchase, by = list(data$date, data$item), FUN = sum)
If your data is not yet of class 'ts' you can create a time-series object with the ts() command. From the ?ts page:
ts(data = NA, start = 1, end = numeric(), frequency = 1,
deltat = 1, ts.eps = getOption("ts.eps"), class = , names = )
as.ts(x, ...)
Generally you could use the HoltWinters function for exponential smoothing like so:
data.hw <- HotlWinters(data)
data.predict <- predict(data.hw, n.ahead = x) # for x = units of time ahead you would like to predict
See also ?HoltWinters for more info on the function
Reproducible Example for aggregate:
data <- data.frame(date = c(1, 2, 1, 2, 1, 1), item = c('b','b','a','a', 'a', 'a'), purchase = c(5,15, 23, 7, 12, 11))
data.agg <- aggregate(data$purchase, by = list(data$date, data$item), FUN = sum)
Reproducible Example for HoltWinters:
library(AER)
data("UKNonDurables")
nd <- window((log(UKNonDurables)), end = c(1970, 4))
tsp(nd)
hw <- HoltWinters(nd)
pred <- predict(hw, n.ahead = 35)
pred
plot(hw, pred, ylim = range(log(UKNonDurables)))
lines(log(UKNonDurables))

R applying growth rate backwards

I have a data with two different monthly series and a one year overlap
ts1 <- ts(cumsum(rnorm(120,.1,1)), start = 1995, frequency = 12)
ts2 <- ts(cumsum(rnorm(120,.2,1)), start = 2004, frequency = 12)
They do not have the same levels (there was a rebasing in 2004) but with the overlap one can use the monthly growth rate of the first one to back-project the second one until 1995.
I would like to create a variable ts_series which has the levels of ts2 after 2004 and then uses the monthly growth rates of ts1 to back-project it. I have several such series in a zoo object, so I can either use a zoo method or group them in list and use mapply.
Many thanks
Here is one approach, using a simple linear regression on the overlapping bits to identify the relationship between the two series and then applying that model to the non-overlapping part of ts1 to estimate earlier values of ts2. The last step gives you a new ts object that represents the predicted values of ts2 for the non-overlapping period.
# Make the toy data
set.seed(1)
ts1 <- ts(cumsum(rnorm(120,.1,1)), start = 1995, frequency = 12)
ts2 <- ts(cumsum(rnorm(120,.2,1)), start = 2004, frequency = 12)
# Now do the estimation
x <- as.vector(window(ts1, start = c(2004,1), end = c(2004,12)))
y <- as.vector(window(ts2, start = c(2004,1), end = c(2004,12)))
tsmod <- lm(y ~ x)
ts2preds <- predict(tsmod, newdata = as.data.frame(window(ts1, start = c(1995,1), end = c(2003,12))))
ts2prior <- ts(data = ts2preds, start = c(1995, 1), end = c(2003, 12), frequency = 12)
If you want instead to backcast ts2 on its own, though, Rob Hyndman's got you covered with the forecast() function in his forecast package. Following an example from his blog:
library(forecast)
f <- frequency(ts2) # Identify the frequency of your ts
h <- (start(ts2)[1] - start(ts1)[1]) * f # Set the number of periods you want to backcast
revx <- ts(rev(ts2), frequency = f) # Reverse time in the series you want to backcast
ts2plus <- forecast(auto.arima(revx), h) # Do the backcasting
# Reverse its elements
ts2plus$mean <- ts(rev(ts2plus$mean), end=tsp(ts2)[1] - 1/f, frequency=f)
ts2plus$upper <- ts2plus$upper[h:1,]
ts2plus$lower <- ts2plus$lower[h:1,]
ts2plus$x <- ts2 # Replace the reversed reference series in the prediction object with the original one
# Plot it
plot(ts2plus, xlim=c(tsp(ts2)[1]-h/f, tsp(ts2)[2]))
Here's the plot that produces:
And here's how the two series compare:
> cor(ts2plus$mean, ts2preds)
[1] 0.9760174
If your main goal is to get the best possible point predictions of those earlier values, you might consider running both versions and averaging their results. Then this becomes a very simple multi-model ensemble forecast (or backcast).

Resources