I am using the stats::stl function for first time in order to identify and delete the tecnological signal of a crop yields serie. I am not familiar with this method and I am a newbie on programming, in advance I apologize for any mistaken.
These are the original data I am working with:
dat <- data.frame(year= seq(1962,2014,1),yields=c(1100,1040,1130,1174,1250,1350,1450,1226,1070,1474,1526,1719,1849,1766,1342,2000,1750,1750,2270,1550,1220,2400,2750,3200,2125,3125,3737,2297,3665,2859,3574,4519,3616,3247,3624,2964,4326,4321,4219,2818,4052,3770,4170,2854,3598,4767,4657,3564,4340,4573,3834,4700,4168))
This is the ts with frequency =1 (annual) created as input for STL function:
time.series <- ts(data=dat$yields, frequency = 1, start=c(1962, 1), end=c(2014, 1))
plot(time.series, xlab="Years", ylab="Kg/ha", main="Crop yields")
When I try to run the function I get the following error message:
decomposed <- stl(time.series, s.window='periodic')
> Error in stl(time.series, s.window = "periodic") : series is not periodic or has less than two periods
I know that my serie is annual and therefore I can not vary the frequency in the ts which it is seems what causes the error because when I change the frequency I get the seasonal, trend and remainder signals:
time.series <- ts(data=dat$yields, frequency = 12, start=c(1962, 1), end=c(2014, 1))
decomposed <- stl(time.series, s.window='periodic')
plot(decomposed)
I would like to know if there is a method to apply STL function with annual data with a frequency of observation per unit of time = 1.
On the other hand, to remove the tecnological signal, it is only necessary to obviate the trend and remainder signal from the original serie or I am mistaken?
Many thanks for your help.
Since your using annual data, there is no seasonal component, therefore seasonal decomposition of time series would not be appropriate. However, the stats::stl function calls the loess function to estimate trend, which is a local polynomial regression you can adjust to your liking. You can call loess directly and estimate your own trend as followings.
dat <- data.frame(year= seq(1962,2014,1),yields=c(1100,1040,1130,1174,1250,1350,1450,1226,1070,1474,1526,1719,1849,1766,1342,2000,1750,1750,2270,1550,1220,2400,2750,3200,2125,3125,3737,2297,3665,2859,3574,4519,3616,3247,3624,2964,4326,4321,4219,2818,4052,3770,4170,2854,3598,4767,4657,3564,4340,4573,3834,4700,4168))
dat$trend <- loess(yields ~ year, data = dat)$fitted
plot(y = dat$yields, x = dat$year, type = "l", xlab="Years", ylab="Kg/ha", main="Crop yields")
lines(y = dat$trend, x = dat$year, col = "blue", type = "l")
Related
I would like to do a pseudo-out-of-sample exercises with Dynamic factor model (DFM) from the Nowcasting-package in R.
Let me first provide you with a replicable example using the data from the Nowcasting-package.
library(nowcasting)
data(NYFED)
NYFED$legend$SeriesName
base <- NYFED$base
blocks <- NYFED$blocks$blocks
trans <- NYFED$legend$Transformation
frequency <- NYFED$legend$Frequency
delay <- NYFED$legend$delay
vintage <- PRTDB(mts = BRGDP$base, delay = BRGDP$delay, vintage = "2015-06-01")
base <- window(vintage, start = c(2005,06), frequency = 12)
x <- Bpanel(base = base, trans = BRGDP$trans)
GDP <- base[,which(colnames(base) == "PIB")]
GDP_qtr <- month2qtr(x = GDP, reference_month = 3)
y <- diff(diff(GDP_qtr,4))
y <- qtr2month(y)
data <- cbind(y,x)
frequency <- c(4,rep(12,ncol(x)))
nowca <- nowcast(formula = y~., data = data, r = 1, q = 1 , p = 1, method = "2s_agg",
frequency = frequency)
summary(nowca$reg)
nowca$yfcst
nowcast.plot(nowca, type = "fcst")
This code runs fine and creates forecasts and a plot with GDP, in-sample fit and three steps of out-of-sample forecasts.
However, I would like to do a full pseudo-out-of-sample forecasting exercise with this package. In other words, I would like to create multiple point forecasts using forecasts generated by this nowcast-function.
I have already written a replicable code to do this. It uses the same the data as before, but now the data is inputted gradually to the model.
nowcasts_dfm <- rep(NA,nrow(data))
for (i in 12:nrow(data)){
data <- ts(data[1:i,], start=c(2005,06), frequency=12)
nowca <- nowcast(formula = y~., data = data, r = 1, q = 1 , p = 1, method = "2s_agg",
frequency = frequency)
nowcasts_dfm[i] <- now$yfcst[,3][!is.na(now$yfcst[,3])][1]
}
So, this pseudo-out-of-sample uses expanding window starting with the first 12 observations. It then expands to cover the whole sample. However, I am getting a error message.
Error in eigen(cov(x)) : infinite or missing values in 'x'
Could some help me with this, please? How do you code a expanding window pseudo-out-of-sample forecasting exercise with this package?
Or is there a better way to code a expanding window Dynamic factor model (DFM) in R?
Thanks!
I am working with time series 551 of the monthly data of the M3 competition.
So, my data is :
library(forecast)
library(Mcomp)
# Time Series
# Subset the M3 data to contain the relevant series
ts.data<- subset(M3, 12)[[551]]
print(ts.data)
I want to implement time series cross-validation for the last 18 observations of the in-sample interval.
Some people would normally call this “forecast evaluation with a rolling origin” or something similar.
How can i achieve that ? Whats means the in-sample interval ? Which is the timeseries i must evaluate?
Im quite confused , any help in order to light up this would be welcome.
The tsCV function of the forecast package is a good place to start.
From its documentation,
tsCV(y, forecastfunction, h = 1, window = NULL, xreg = NULL, initial = 0, .
..)
Let ‘y’ contain the time series y[1:T]. Then ‘forecastfunction’ is
applied successively to the time series y[1:t], for t=1,...,T-h,
making predictions f[t+h]. The errors are given by e[t+h] =
y[t+h]-f[t+h].
That is first tsCV fit a model to the y[1] and then forecast y[1 + h], next fit a model to y[1:2] and forecast y[2 + h] and so on for T-h steps.
The tsCV function returns the forecast errors.
Applying this to the training data of the ts.data
# function to fit a model and forecast
fmodel <- function(x, h){
forecast(Arima(x, order=c(1,1,1), seasonal = c(0, 0, 2)), h=h)
}
# time-series CV
cv_errs <- tsCV(ts.data$x, fmodel, h = 1)
# RMSE of the time-series CV
sqrt(mean(cv_errs^2, na.rm=TRUE))
# [1] 778.7898
In your case, it maybe that you are supposed to
fit a model to ts.data$x and then forecast ts.data$xx[1]
fit mode the c(ts.data$x, ts.data$xx[1]) and forecast(ts.data$xx[2]),
so on.
I have an existing time series (1000 samples) and calculated the rolling mean using the filter() function in R, averaging across 30 samples each. The goal of this was to create a "smoothed" version of the time series. Now I would like to create artificial data that "look like" the original time series, i.e., are somewhat noisy, that would result in the same rolling mean if I would apply the same filter() function to the artificial data. In short, I would like to simulate a time series with the same overall course but not the exact same values as those of an existing time series. The overall goal is to investigate whether certain methods can detect similarity of trends between time series, even when the fluctuations around the trend are not the same.
To provide some data, my time series looks somewhat like this:
set.seed(576)
ts <- arima.sim(model = list(order = c(1,0,0), ar = .9), n = 1000) + 900
# save in dataframe
df <- data.frame("ts" = ts)
# plot the data
plot(ts, type = "l")
The filter function produces the rolling mean:
my_filter <- function(x, n = 30){filter(x, rep(1 / n, n), sides = 2, circular = T)}
df$rolling_mean <- my_filter(df$ts)
lines(df$rolling_mean, col = "red")
To simulate data, I have tried the following:
Adding random noise to the rolling mean.
df$sim1 <- df$rolling_mean + rnorm(1000, sd = sd(df$ts))
lines(df$sim1, col = "blue")
df$sim1_rm <- my_filter(df$sim1)
lines(df$sim1_rm, col = "green")
The problem is that a) the variance of the simulated values is higher than the variance of the original values, b) that the rolling average, although quite similar to the original, sometimes deviates quite a bit from the original, and c) that there is no autocorrelation. To have an autocorrelational structure in the data would be good since it is supposed to resemble the original data.
Edit: Problem a) can be solved by using sd = sqrt(var(df$ts)-var(df$rolling_mean)) instead of sd = sd(df$ts).
I tried arima.sim(), which seems like an obvious choice to specify the autocorrelation that should be present in the data. I modeled the original data using arima(), using the model parameters as input for arima.sim().
ts_arima <- arima(ts, order = c(1,0,1))
my_ar <- ts_arima$coef["ar1"]
my_ma <- ts_arima$coef["ma1"]
my_intercept <- ts_arima$coef["intercept"]
df$sim2 <- arima.sim(model = list(order = c(1,0,1), ar = my_ar, ma = my_ma), n = 1000) + my_intercept
plot(df$ts)
lines(df$sim2, col = "blue")
The resulting time series is very different from the original. Maybe a higher order for ar and ma in arima.sim() would solve this, but I think a whole different method might be more appropriate.
I have 3 time series and I want to predict future values for each of them.
I am using VARS! Package in R.
So this is the approach:
Decompose multiplicative time series and take out the trend, seasonality, and Random part.
time_series1_components = decompose(time_series1,type="mult")
Do this for all the time series.
Apply the VAR Model on the random parts and predict the future values:
random_part1 = time_series1_components$random
random_part2 = time_series2_components$random
random_part3 = time_series3_components$random
merged_df = ts.union(random_part1, random_part2,random_part3, dframe = TRUE)
merged_mat <- data.matrix(merged_df)
merged_mat = na.exclude(merged_mat)
checklag = VARselect(merged_mat)
EstimateModel=VAR(merged_mat, p = 2, type = "const", season = NULL, exogen = NULL)
summary(EstimateModel)
roots(EstimateModel)
predict(EstimateModel)`
Now, I should combine the predicted values of the random part with the trend and seasonality. And Plot a graph showing the past values and predicted values (highlighted separately).
How can I achieve this?
Any pointers will be helpful.
I have a time series for which I want to intelligently interpolate the missing values. The value at a particular time is influenced by a multi-day trend, as well as its position in the daily cycle.
Here is an example in which the tenth observation is missing from myzoo
start <- as.POSIXct("2010-01-01")
freq <- as.difftime(6, units = "hours")
dayvals <- (1:4)*10
timevals <- c(3, 1, 2, 4)
index <- seq(from = start, by = freq, length.out = 16)
obs <- (rep(dayvals, each = 4) + rep(timevals, times = 4))
myzoo <- zoo(obs, index)
myzoo[10] <- NA
If I had to implement this, I'd use some kind of weighted mean of close times on nearby days, or add a value for the day to a function line fitted to the larger trend, but I hope there already exist some package or functions that apply to this situation?
EDIT: Modified the code slightly to clarify my problem. There are na.* methods that interpolate from nearest neighbors, but in this case they do not recognize that the missing value is at the time that is the lowest value of the day. Maybe the solution is to reshape the data to wide format and then interpolate, but I wouldn't like to completely disregard the contiguous values from the same day. It is worth noting that diff(myzoo, lag = 4) returns a vector of 10's. The solution may lie with some combination of reshape, na.spline, and diff.inv, but I just can't figure it out.
Here are three approaches that don't work:
EDIT2. Image produced using the following code.
myzoo <- zoo(obs, index)
myzoo[10] <- NA # knock out the missing point
plot(myzoo, type="o", pch=16) # plot solid line
points(na.approx(myzoo)[10], col = "red")
points(na.locf(myzoo)[10], col = "blue")
points(na.spline(myzoo)[10], col = "green")
myzoo[10] <- 31 # replace the missing point
lines(myzoo, type = "o", lty=3, pch=16) # dashed line over the gap
legend(x = "topleft",
legend = c("na.spline", "na.locf", "na.approx"),
col=c("green","blue","red"), pch = 1)
Try this:
x <- ts(myzoo,f=4)
fit <- ts(rowSums(tsSmooth(StructTS(x))[,-2]))
tsp(fit) <- tsp(x)
plot(x)
lines(fit,col=2)
The idea is to use a basic structural model for the time series, which handles the missing value fine using a Kalman filter. Then a Kalman smooth is used to estimate each point in the time series, including any omitted.
I had to convert your zoo object to a ts object with frequency 4 in order to use StructTS. You may want to change the fitted values back to zoo again.
In this case, I think you want a seasonality correction in the ARIMA model. There's not enough date here to fit the seasonal model, but this should get you started.
library(zoo)
start <- as.POSIXct("2010-01-01")
freq <- as.difftime(6, units = "hours")
dayvals <- (1:4)*10
timevals <- c(3, 1, 2, 4)
index <- seq(from = start, by = freq, length.out = 16)
obs <- (rep(dayvals, each = 4) + rep(timevals, times = 4))
myzoo <- myzoo.orig <- zoo(obs, index)
myzoo[10] <- NA
myzoo.fixed <- na.locf(myzoo)
myarima.resid <- arima(myzoo.fixed, order = c(3, 0, 3), seasonal = list(order = c(0, 0, 0), period = 4))$residuals
myzoo.reallyfixed <- myzoo.fixed
myzoo.reallyfixed[10] <- myzoo.fixed[10] + myarima.resid[10]
plot(myzoo.reallyfixed)
points(myzoo.orig)
In my tests the ARMA(3, 3) is really close, but that's just luck. With a longer time series you should be able to calibrate the seasonal correction to give you good predictions. It would be helpful to have a good prior on what the underlying mechanisms for both the signal and the seasonal correction to get better out of sample performance.
forecast::na.interp is a good approach. From the documentation
Uses linear interpolation for non-seasonal series and a periodic stl decomposition with seasonal series to replace missing values.
library(forecast)
fit <- na.interp(myzoo)
fit[10] # 32.5, vs. 31.0 actual and 32.0 from Rob Hyndman's answer
This paper evaluates several interpolation methods against real time series, and finds that na.interp is both accurate and efficient:
From the R implementations tested in this paper, na.interp from the forecast package and na.StructTS from the zoo package showed the best overall results.
The na.interp function is also not that much slower than
na.approx [the fastest method], so the loess decomposition seems not to be very demanding in terms of computing time.
Also worth noting that Rob Hyndman wrote the forecast package, and included na.interp after providing his answer to this question. It's likely that na.interp is an improvement upon this approach, even though it performed worse in this instance (probably due to specifying the period in StructTS, where na.interp figures it out).
Package imputeTS has a method for Kalman Smoothing on the state space representation of an ARIMA model - which might be a good solution for this problem.
library(imputeTS)
na_kalman(myzoo, model = "auto.arima")
Also works directly with zoo time series objects. You could also use your own ARIMA models in this function. If you think you can do better then "auto.arima". This would be done this way:
library(imputeTS)
usermodel <- arima(myts, order = c(1, 0, 1))$model
na_kalman(myts, model = usermodel)
But in this case you have to convert the zoo onject back to ts, since arima() only accepts ts.