Plotting Forecast and Real values in one plot using a Rolling Window - r

I have a code which takes the input as the Yield Spread (dependent var.) and Forward Rates(independent var.) and operate an auto.arima to get the orders. Afterwards, I am forecasting the next 25 dates (forc.horizon). My training data are the first 600 (training). Then I am moving the time window 25 dates, meaning using the data from 26 to 625, estimating the auto.arima and then forecasting the data from 626 to 650 and so on. My data sets are 2298 rows (date) and 30 columns (maturity).
I want to store all of the forecasts and then plot the forecasted and real values in the same plot.
This is the code I have, but it doesn't store the forecasts in a way to plot later.
forecast.func <- function(NS.spread, ind.v, maturity, training, forc.horizon){
NS.spread <- NS.spread/100
forc <- c()
j <- 0
for(i in 1:floor((nrow(NS.spread)-training)/forc.horizon)){
# test data
y <- NS.spread[(1+j):(training+j) , maturity]
f <- ind.v[(1+j):(training+j) , maturity]
# auto- arima
c <- auto.arima(y, xreg = f, test= "adf")
# forecast
e <- ind.v[(training+j+1):(training+j+forc.horizon) , maturity]
h <- forecast(c, xreg = lagmatrix(e, -1))
forc <- c(forc, list(h))
j <- j + forc.horizon
}
return(forc)
}
a <- forecast.func(spread.NS.JPM, Forward.rate.JPM, 10, 600, 25)
lapply(a, plot)
Here's a link to my two datasets:
https://drive.google.com/drive/folders/1goCxllYHQo3QJ0IdidKbdmfR-DZgrezN?usp=sharing

LOOK AT THE END for a full functional example on how to handle AUTO.ARIMA MODEL with DAILY DATA using XREG and FOURIER SERIES with ROLLING STARTING TIMES and cross validated training and test.
Without a reproducible example no one can help you, because they can't run your code. You need to provide data. :-(
Even if it's not part of StackOverflow to discuss statistics matters, why don't you do an auto.arima with xreg instead of lm + auto.arima on residuals? Especially, considering how you forecast at the end, that training method looks really wrong. Consider using:
fit <- auto.arima(y, xreg = lagmatrix(f, -1))
h <- forecast(fit, xreg = lagmatrix(e, -1))
auto.arima will automatically calculate the best parameters by max likelihood.
On your coding question..
forc <- c() should be outside of the for loop, otherwise at every run you delete your previous results.
Same for j <- 0: at every run you're setting it back to 0. Put it outside if you need to change its value at every run.
The output of forecast is an object of class forecast, which is actually a type of list. Therefore, you can't use cbind effectively.
I'm my opinion, you should create forc in this way: forc <- list()
And create a list of your final results in this way:
forc <- c(forc, list(h)) # instead of forc <- cbind(forc, h)
This will create a list of objects of class forecast.
You can then plot them with a for loop by getting access at every object or with a lapply.
lapply(output_of_your_function, plot)
This is as far as I can go without a reproducible example.
FINAL EDIT
FULL FUNCTIONAL EXAMPLE
Here I try to sum up a conclusion out of the million comments we wrote.
With the data you provided, I built a code that can handle everything you need.
From training and test to model, till forecast and finally plotting which have the X axis with the time as required in one of your comments.
I removed the for loop. lapply is much better for your case.
You can leave the fourier series if you want to. That's how Professor Hyndman suggests to handle daily time series.
Functions and libraries needed:
# libraries ---------------------------
library(forecast)
library(lubridate)
# run model -------------------------------------
.daily_arima_forecast <- function(init, training, horizon, tt, ..., K = 10){
# create training and test
tt_trn <- window(tt, start = time(tt)[init] , end = time(tt)[init + training - 1])
tt_tst <- window(tt, start = time(tt)[init + training], end = time(tt)[init + training + horizon - 1])
# add fourier series [if you want to. Otherwise, cancel this part]
fr <- fourier(tt_trn[,1], K = K)
frf <- fourier(tt_trn[,1], K = K, h = horizon)
tsp(fr) <- tsp(tt_trn)
tsp(frf) <- tsp(tt_tst)
tt_trn <- ts.intersect(tt_trn, fr)
tt_tst <- ts.intersect(tt_tst, frf)
colnames(tt_tst) <- colnames(tt_trn) <- c("y", "s", paste0("k", seq_len(ncol(fr))))
# run model and forecast
aa <- auto.arima(tt_trn[,1], xreg = tt_trn[,-1])
fcst <- forecast(aa, xreg = tt_tst[,-1])
# add actual values to plot them later!
fcst$test.values <- tt_tst[,1]
# NOTE: since I modified the structure of the class forecast I should create a new class,
# but I didnt want to complicate your code
fcst
}
daily_arima_forecast <- function(y, x, training, horizon, ...){
# set up x and y together
tt <- ts.intersect(y, x)
# set up all starting point of the training set [give it a name to recognize them later]
inits <- setNames(nm = seq(1, length(y) - training, by = horizon))
# remove last one because you wouldnt have enough data in front of it
inits <- inits[-length(inits)]
# run model and return a list of all your models
lapply(inits, .daily_arima_forecast, training = training, horizon = horizon, tt = tt, ...)
}
# plot ------------------------------------------
plot_daily_forecast <- function(x){
autoplot(x) + autolayer(x$test.values)
}
Reproducible Example on how to use the previous functions
# create a sample data
tsp(EuStockMarkets) <- c(1991, 1991 + (1860-1)/365.25, 365.25)
# model
models <- daily_arima_forecast(y = EuStockMarkets[,1],
x = EuStockMarkets[,2],
training = 600,
horizon = 25,
K = 5)
# plot
plots <- lapply(models, plot_daily_forecast)
plots[[1]]
Example for the author of the post
# your data
load("BVIS0157_Forward.rda")
load("BVIS0157_NS.spread.rda")
spread.NS.JPM <- spread.NS.JPM / 100
# pre-work [out of function!!!]
set_up_ts <- function(m){
start <- min(row.names(m))
end <- max(row.names(m))
# daily sequence
inds <- seq(as.Date(start), as.Date(end), by = "day")
ts(m, start = c(year(start), as.numeric(format(inds[1], "%j"))), frequency = 365.25)
}
mts_spread.NS.JPM <- set_up_ts(spread.NS.JPM)
mts_Forward.rate.JPM <- set_up_ts(Forward.rate.JPM)
# model
col <- 10
models <- daily_arima_forecast(y = mts_spread.NS.JPM[, col],
x = stats::lag(mts_Forward.rate.JPM[, col], -1),
training = 600,
horizon = 25,
K = 5) # notice that K falls between ... that goes directly to the inner function
# plot
plots <- lapply(models, plot_daily_forecast)
plots[[5]]

Related

Dynamic factor models and forecasting exercises in R (Nowcasting package)

I would like to do a pseudo-out-of-sample exercises with Dynamic factor model (DFM) from the Nowcasting-package in R.
Let me first provide you with a replicable example using the data from the Nowcasting-package.
library(nowcasting)
data(NYFED)
NYFED$legend$SeriesName
base <- NYFED$base
blocks <- NYFED$blocks$blocks
trans <- NYFED$legend$Transformation
frequency <- NYFED$legend$Frequency
delay <- NYFED$legend$delay
vintage <- PRTDB(mts = BRGDP$base, delay = BRGDP$delay, vintage = "2015-06-01")
base <- window(vintage, start = c(2005,06), frequency = 12)
x <- Bpanel(base = base, trans = BRGDP$trans)
GDP <- base[,which(colnames(base) == "PIB")]
GDP_qtr <- month2qtr(x = GDP, reference_month = 3)
y <- diff(diff(GDP_qtr,4))
y <- qtr2month(y)
data <- cbind(y,x)
frequency <- c(4,rep(12,ncol(x)))
nowca <- nowcast(formula = y~., data = data, r = 1, q = 1 , p = 1, method = "2s_agg",
frequency = frequency)
summary(nowca$reg)
nowca$yfcst
nowcast.plot(nowca, type = "fcst")
This code runs fine and creates forecasts and a plot with GDP, in-sample fit and three steps of out-of-sample forecasts.
However, I would like to do a full pseudo-out-of-sample forecasting exercise with this package. In other words, I would like to create multiple point forecasts using forecasts generated by this nowcast-function.
I have already written a replicable code to do this. It uses the same the data as before, but now the data is inputted gradually to the model.
nowcasts_dfm <- rep(NA,nrow(data))
for (i in 12:nrow(data)){
data <- ts(data[1:i,], start=c(2005,06), frequency=12)
nowca <- nowcast(formula = y~., data = data, r = 1, q = 1 , p = 1, method = "2s_agg",
frequency = frequency)
nowcasts_dfm[i] <- now$yfcst[,3][!is.na(now$yfcst[,3])][1]
}
So, this pseudo-out-of-sample uses expanding window starting with the first 12 observations. It then expands to cover the whole sample. However, I am getting a error message.
Error in eigen(cov(x)) : infinite or missing values in 'x'
Could some help me with this, please? How do you code a expanding window pseudo-out-of-sample forecasting exercise with this package?
Or is there a better way to code a expanding window Dynamic factor model (DFM) in R?
Thanks!

How to forecast and fit the optimal model for multiple time series?

I want to do batch forecasting among multiple series, for example, if I want to forecast time series with IDs that end with 1(1,11,21,31...), how can I do that?
Since you did not provide detailed information, I was not sure which forecasting method you want to use hence I give here an example of a univariate time series model:
Load required packages:
library(forecast)
library(dplyr)
We use example data from Rob Hyndman:
dta <- read.csv("https://robjhyndman.com/data/ausretail.csv")
Now change the column names:
colnames(dta) <- c("date", paste0("tsname_", seq_len(ncol(dta[,-1]))))
Select timeseries which end with 1:
dta_ends_with1 <- dplyr::select(dta, dplyr::ends_with("1"))
Create a ts object:
dta_ends_with1 <- ts(dta_ends_with1, start = c(1982,5), frequency = 12)
Specify how many steps ahead you want to forecast, here I set it to 6 steps ahead,
h <- 6
Now we prepare a matrix to save the forecast:
fc <- matrix(NA, ncol = ncol(dta_ends_with1), nrow = h)
Forecasting loop.
for (i in seq_len(ncol(dta_ends_with1))) {
fc[,i] <- forecast::forecast(forecast::auto.arima(dta_ends_with1[,i]),
h = h)$mean
}
Set the column names:
colnames(fc) <- colnames(dta_ends_with1)
head(fc)

How to loop an arima model to run over different columns of a time series data?

So I have a data set which has district wise values of covid-19 cases. Now I want to run an Arima model on each of these districts and create a similar dataset of predicted values.
library("forecast")
df <- read.csv("D:/Hackathon/Time series/Maharashtra.csv")
z = ncol(df)
for(i in z){
x = ts(c[,i],frequency = 365, start = c(2020,1,30))
plot.ts(x)
pi = auto.arima(x)
summary(pi)
q = forecast(pi,h=30)
plot.forecast(q)
write.csv(q,"D:/Hackathon/pred.csv")
}
I know for a fact this is not correct. This how the data is

Perceptron in R not converging

I am trying to understand Neural Networks better so I am trying to implement a simple perceptron from scratch in R. I know that this is very inefficient as there are many libraries that do this extemely well optimized but my goal is to understand the basics of neural networks better and work my way forward to more complex models.
I have created some artificial test data with a very simple linear decision boundary and split this into a training set and a test set. I then ran a logistic regression on the training data and checked the predictions from the test-set and got +99% accuray, which was to be expected given the simple nature of the data. I then tried implementing a perceptron with 2 inputs, 1 neuron, 1000 iterations, a learning rate of 0.1 and a sigmoid activation function.
I would expect to get very similar accuracy to the logistic regression model but my results are a lot worse (around 70% correct classifications in the training set). so I definitly did something wrong. The predictions only seem to get better after the first couple of iterations and then just go back and forth around a specific value (I tried with many different learning rates, no success). I'm attaching my script and I#m thankful for any advice! I think the problem lies in the calculation of the error or the weight adjustment but I can't put my finger on it...
### Reproducible Example for StackOverflow
#### Setup
# loading libraries
library(data.table)
#remove scientifc notation
options(scipen = 999)
# setting seed for random number generation
seed <- 123
#### Selfmade Test Data
# input points
x1 <- runif(10000,-100,100)
x2 <- runif(10000,-100,100)
# setting decision boundary to create output
output <- vector()
output[0.5*x1 + -1.2*x2 >= 50] <- 0
output[0.5*x1 + -1.2*x2 < 50] <- 1
# combining to dataframe
points <- cbind.data.frame(x1,x2,output)
# plotting all data points
plot(points$x1,points$x2, col = as.factor(points$output), main = "Self-created data", xlab = "x1",ylab = "x2")
# split into test and training sets
trainsize = 0.2
set.seed(seed)
train_rows <- sample(1:dim(points)[1], size = trainsize * dim(points)[1])
train <- points[train_rows,]
test <- points[-c(train_rows),]
# plotting training set only
plot(train$x1,train$x2, col = as.factor(train$output), main = "Self-created data (training set)", xlab = "x1",ylab = "x2")
#### Approaching the problem with logistic regression
# building model
train_logit <- glm(output ~ x1 + x2, data = train, family = "binomial", maxit = 10000)
summary(train_logit)
# testing performance in training set
table(round(train_logit$fitted.values) == train$output)
# testing performance of train_logit model in test set
table(test$output == round(predict(train_logit,test[,c(1,2)], type = "response")))
# We get 100% accuracy in the training set and near 100% accuracy in the test set
#### Approaching Problem with a Perceptron from scratch
# setting inputs, outputs and weights
inputs <- as.matrix(train[,c(1,2)])
output <- as.matrix(train[,3])
set.seed(123456)
weights <- as.matrix(runif(dim(inputs)[2],-1,1))
## Defining activation function + derivative
# defining sigmoid and it's derivative
sigmoid <- function(x) {1 / (1 + exp(-x))}
sig_dir <- function(x){sigmoid(x)*(1 - sigmoid(x))}
## Perceptron nitial Settings
bias <- 1
# number of iterations
iterations <- 1000
# setting learning rate
alpha <- 0.1
## Perceptron
# creating vectors for saving results per iteration
weights_list <- list()
weights_list[[1]] <- weights
errors_vec <- vector()
outputs_vec <- vector()
# saving results across iterations
weights_list_all <- list()
outputs_list <- list()
errors_list <- list()
# looping through the backpropagation algorithm "iteration" # times
for (j in 1:iterations) {
# Loop for backpropagation with updating weights after every datapoint
for (i in 1:dim(train)[1]) {
# taking the weights from the last iteration of the outer loop as a starting point
if (j > 1) {
weights_list[[1]] <- weights
}
# Feed Forward (Should we really round this?!)
output_pred <- round(sigmoid(sum(inputs[i,] * as.numeric(weights)) + bias))
error <- output_pred - output[i]
# Backpropagation (Do I need the sigmoid derivative AND a learning rate? Or should I only take one of them?)
weight_adjustments <- inputs[i,] * (error * sig_dir(output_pred)) * alpha
weights <- weights - weight_adjustments
# saving progress for later plots
weights_list[[i + 1]] <- weights
errors_vec[i] <- error
outputs_vec[[i]] <- output_pred
}
# saving results for each iteration
weights_list_all[[j]] <- weights_list
outputs_list[[j]] <- outputs_vec
errors_list[[j]] <- errors_vec
}
#### Formatting Diagnostics for easier plotting
# implementing empty list to transform weightslist
WeightList <- list()
# collapsing individual weightslist into datafames
for (i in 1:iterations) {
WeightList[[i]] <- t(data.table::rbindlist(weights_list_all[i]))
}
# pasting dataframes together
WeightFrame <- do.call(rbind.data.frame, WeightList)
colnames(WeightFrame) <- paste("w",1:dim(WeightFrame)[2], sep = "")
# pasting dataframes together
ErrorFrame <- do.call(rbind.data.frame, errors_list)
OutputFrame <- do.call(rbind.data.frame, outputs_list)
##### Plotting Results
# Development of Mean Error per iteration
plot(rowMeans(abs(ErrorFrame)),
type = "l",
xlab = "Sum of absolute Error terms")
# Development of Weights over time
plot(WeightFrame$w1, type = "l",xlim = c(1,dim(train)[1]), ylim = c(min(WeightFrame),max(WeightFrame)), ylab = "Weights", xlab = "Iterations")
lines(WeightFrame$w2, col = "green")
# lines(WeightFrame$w3, col = "blue")
# lines(WeightFrame$w4, col = "red")
# lines(WeightFrame$w5, col = "orange")
# lines(WeightFrame$w6, col = "cyan")
# lines(WeightFrame$w7, col = "magenta")
# Empty vector for number of correct categorizations per iteration
NoCorr <- vector()
# Computing percentage of correct predictions per iteration
colnames(OutputFrame) <- paste("V",1:dim(OutputFrame)[2], sep = "")
Output_mat <- as.matrix(OutputFrame)
for (i in 1:iterations) {
NoCorr[i] <- sum(output == Output_mat[i,]) / nrow(train)
}
# plotting number of correct predictions per iteration
plot(NoCorr, type = "l")
# Performance in training set after last iteration
table(output,round(OutputFrame[iterations,]))
First of all, welcome to the world of Neural Networks :).
Secondly, I want to recommend a great article to you, which I personally used to get a better understanding of backtracking and the whole NN learning stuff: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/. Might be a bit rough to get through sometimes, and for the general implementation I think it is much easier to follow pseudocode from a NN book. However, to understand what is going on this is article is very nice!
Thirdly, I will hopefully solve your problem :)
You comment yourself already with whether you should really round that output_pred. Yes you should.. if you want to use that output_pred to make a prediction! However, if you want to use it for learning it is generally not good! The reason for this is that if you round it for learning, than an output which was rounded up from 0.51 to 1 with target output 1 will not learn anything as the output was the same as the target and thus is perfect. However, 0.99 would have been a lot better of a prediction than 0.51 and thus there is definitely something to learn!
I am not 100% sure if this solves all your problems (im not an R programmer) and gets your accuracy up to 99%, but it should solve some of it, and hopefully the intuition is also clear :)

Coding a prediction interval from a generalized additive model with a very large dataset

I have a small dataset of people, thier locations, and whether or not they know each other. It is a subset of a dataset with 1000 people. Given that each person could know any other person, the number of potential links grows just under n^2. I want to fit a model with the small subset, to get probability of linkage as a function of distance, so that I can perform simulations with the wider dataset.
I've got two problems:
I'm not sure how to create a prediction interval from a fitted GAM object.
Generating a prediction interval using either posterior simulation or using this technique from R-sig-mixed is computationally prohibitive.
The following is an example of my problem, generating the interval using the technique from R-sig-mixed. Be warned that the last step will throw an error about not being able to allocate a huge vector, unless you're on a really impressive machine.
#Some fake location data
set.seed(13)
x = runif(50)*2
y = runif(50)*2
d = cbind(ID = 1:50,as.matrix(dist(data.frame(x,y))))
I want to model links as a function of distance. More fake data:
library(reshape)
mdata <- melt(as.data.frame(d), id=c("ID"),measure.vars = colnames(d)[2:ncol(d)],variable.name="distance")
mdata$popularity = rnorm(25,sd=.3)
colnames(mdata)[colnames(mdata)=="variable"] = "knows"
colnames(mdata)[colnames(mdata)=="value"] = "distance"
mdata = subset(mdata,ID!=knows)
a = exp(1/(mdata$distance/runif(nrow(mdata))^mdata$distance)+mdata$popularity+rnorm(nrow(mdata),sd=.001))
mdata$prlink = a/(1+a)
with(mdata,plot(distance,prlink))
mdata$link = runif(nrow(mdata))<mdata$prlink
mdata$ID = as.factor(mdata$ID)
mdata$knows = as.factor(mdata$knows)
mdata$dum=1 #this facilitates predicting from the population of the model, later
Now, I model the data:
library(mgcv)
mod = gam(link~s(distance)+s(ID,bs="re",by=dum)+s(knows,bs="re",by=dum),data=mdata,family=binomial(link="logit"))
plot(mod,pages=1)
summary(mod)
Now, I want to apply the fitted model to my main dataset:
x = runif(1000)*2
y = runif(1000)*2
d = cbind(ID = 1:1000,as.matrix(dist(data.frame(x,y))))
mdata <- melt(as.data.frame(d),id.vars = "ID")
colnames(mdata)[colnames(mdata)=="variable"] = "knows"
colnames(mdata)[colnames(mdata)=="value"] = "distance"
mdata = subset(mdata,ID!=knows)
mdata$dum=0; mdata$ID=1; mdata$knows=2 #These are needed for prediction, even though I am predicting from the population of the model, not one of the levels.
Some timekeeping tools...
tic <- function(gcFirst = TRUE, type=c("elapsed", "user.self", "sys.self"))
{
type <- match.arg(type)
assign(".type", type, envir=baseenv())
if(gcFirst) gc(FALSE)
tic <- proc.time()[type]
assign(".tic", tic, envir=baseenv())
invisible(tic)
}
toc <- function()
{
type <- get(".type", envir=baseenv())
toc <- proc.time()[type]
tic <- get(".tic", envir=baseenv())
print(toc - tic)
invisible(toc)
}
tic()
p = predict(mod,newdata=mdata,type="response")
toc()
Just predicting the point estimates takes 31 seconds on my machine. Now to try to get the prediction intervals, first get the design matrix...
tic()
Designmat = predict(mod,newdata=mdata,type="lpmatrix")
toc()
That took me 47 seconds and froze my computer while it was working.
Now here is the technique for getting a prediction interval that I found on R-sig-mixed...
CAUTION: THE FOLLOWING CODE WILL ATTEMPT TO ALLOCATE A MASSIVE AMOUNT OF MEMORY AND MIGHT CRASH YOUR MACHINE.
tic()
predvar <- diag(Designmat %*% vcov(mod) %*% t(Designmat))
SE <- sqrt(predvar)
SE2 <- sqrt(predvar+mod$sig2)
tfrac <- qt(0.975, mod$df.residual)
interval = tfrac*SE2
toc()
>Error: cannot allocate vector of size 7435.7 Gb
Is there another way???
You need to avoid calculating Designmat %*% vcov(mod) %*% t(Designmat). You only need the diagonal after all. Try this:
tmp <- Designmat %*% vcov(mod)
library(compiler)
diagMult <- cmpfun(function(m1, m2) sapply(seq_len(nrow(m1)),
function(i) m1[i,] %*% m2[,i]))
predvar <- diagMult(tmp, t(Designmat))
(Not tested thoroughly. The function should be implemented with Rcpp to improve speed, if there is not already a compiled version in some package.)

Resources