Export or save tsclust model - r

I am doing a time-series clustering on a large dataset (3000 time-series, with > 50 points each). Thus, I was wondering if once I have finished the analysis, it would be possible to:
export the model so that I can cluster new series swiftly.
export the "centroids" so that I can use them as the template for matching new series.
A simple MRE could look like this
my_matrix <- matrix(rnorm(1000), 100, 10)
k <- 5
clustering <- tsclust(my_matrix , k = k, trace = T)
# somehow saveRDS(clustering)
# new series coming
y <- rnorm(10)
my_clusters <- loadRDS(clustering)
# Somehow clusternew(y, my_clusters)
Thank you,
S

Related

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

How to forecast and fit the optimal model for multiple time series?

I want to do batch forecasting among multiple series, for example, if I want to forecast time series with IDs that end with 1(1,11,21,31...), how can I do that?
Since you did not provide detailed information, I was not sure which forecasting method you want to use hence I give here an example of a univariate time series model:
Load required packages:
library(forecast)
library(dplyr)
We use example data from Rob Hyndman:
dta <- read.csv("https://robjhyndman.com/data/ausretail.csv")
Now change the column names:
colnames(dta) <- c("date", paste0("tsname_", seq_len(ncol(dta[,-1]))))
Select timeseries which end with 1:
dta_ends_with1 <- dplyr::select(dta, dplyr::ends_with("1"))
Create a ts object:
dta_ends_with1 <- ts(dta_ends_with1, start = c(1982,5), frequency = 12)
Specify how many steps ahead you want to forecast, here I set it to 6 steps ahead,
h <- 6
Now we prepare a matrix to save the forecast:
fc <- matrix(NA, ncol = ncol(dta_ends_with1), nrow = h)
Forecasting loop.
for (i in seq_len(ncol(dta_ends_with1))) {
fc[,i] <- forecast::forecast(forecast::auto.arima(dta_ends_with1[,i]),
h = h)$mean
}
Set the column names:
colnames(fc) <- colnames(dta_ends_with1)
head(fc)

Time series prediction with and without NAs (ARIMA and Forecast package) in R

This is my first question on stack overflow.
Situation: I have 2 time series. Both series have the same values but the second series has 5 NAs at the start. Hence, first series has 105 observations, where 2nd series has 110 observations. I have fitted an ARIMA(0,1,0) using the Arima function to both series separately. And then I used the forecast package to predict 10 steps to the future.
Issue: Even though the ARIMA coefficient for both series are the same, the projections (10 steps) appear to be different. I am uncertain why this is the case. Has anyone come across this before? Any guidance is highly appreciated.
Tried: I tried setting seed, creating index manually, and using auto.ARIMA for the model fitting. However, none of the steps has helped me to reconcile the difference.
I have added a picture to show you what I see. Please note I have hidden the mid part of the series so that you can see the start and the end of the series. The yellow highlighted cells are the projection outputs from the 'Forecast' package. I have manually added the index to be years after extracting the results from R.
Time series projected and base in excel
Rates <- read.csv("Rates_for_ARIMA.csv")
set.seed(123)
#ARIMA with NA
Simple_Arima <- Arima(
ts(Rates$Rates1),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima <- forecast(Simple_Arima, h = 10)
fcasted_Arima$mean
#ARIMA Without NA
Rates2 <- as.data.frame(Rates$Rates2)
##Remove the final spaces from the CSV
Rates2 <- Rates2[-c(106,107,108,109,110),]
Simple_Arima2 <- Arima(
ts(Rates2),
order = c(0,1,0),
include.drift = TRUE)
fcasted_Arima2 <- forecast(Simple_Arima2, h = 10)
fcasted_Arima2$mean
The link to data is here, CSV format
Could you share your data and code such that others can see if there is any issue with it?
I tried to come up with an example and got the same results for both series, one that includes NAs and one that doesn't.
library(forecast)
library(xts)
set.seed(123)
ts1 <- arima.sim(model = list(0, 1, 0), n = 105)
ts2 <- ts(c(rep(NA, 5), ts1), start = 1)
fit1 <- forecast::Arima(ts1, order = c(0, 1, 0))
fit2 <- forecast::Arima(ts2, order = c(0, 1, 0))
pred1 <- forecast::forecast(fit1, 10)
pred2 <- forecast::forecast(fit2, 10)
forecast::autoplot(pred1)
forecast::autoplot(pred2)
> all.equal(as.numeric(pred1$mean), as.numeric(pred2$mean))
[1] TRUE

Plotting Forecast and Real values in one plot using a Rolling Window

I have a code which takes the input as the Yield Spread (dependent var.) and Forward Rates(independent var.) and operate an auto.arima to get the orders. Afterwards, I am forecasting the next 25 dates (forc.horizon). My training data are the first 600 (training). Then I am moving the time window 25 dates, meaning using the data from 26 to 625, estimating the auto.arima and then forecasting the data from 626 to 650 and so on. My data sets are 2298 rows (date) and 30 columns (maturity).
I want to store all of the forecasts and then plot the forecasted and real values in the same plot.
This is the code I have, but it doesn't store the forecasts in a way to plot later.
forecast.func <- function(NS.spread, ind.v, maturity, training, forc.horizon){
NS.spread <- NS.spread/100
forc <- c()
j <- 0
for(i in 1:floor((nrow(NS.spread)-training)/forc.horizon)){
# test data
y <- NS.spread[(1+j):(training+j) , maturity]
f <- ind.v[(1+j):(training+j) , maturity]
# auto- arima
c <- auto.arima(y, xreg = f, test= "adf")
# forecast
e <- ind.v[(training+j+1):(training+j+forc.horizon) , maturity]
h <- forecast(c, xreg = lagmatrix(e, -1))
forc <- c(forc, list(h))
j <- j + forc.horizon
}
return(forc)
}
a <- forecast.func(spread.NS.JPM, Forward.rate.JPM, 10, 600, 25)
lapply(a, plot)
Here's a link to my two datasets:
https://drive.google.com/drive/folders/1goCxllYHQo3QJ0IdidKbdmfR-DZgrezN?usp=sharing
LOOK AT THE END for a full functional example on how to handle AUTO.ARIMA MODEL with DAILY DATA using XREG and FOURIER SERIES with ROLLING STARTING TIMES and cross validated training and test.
Without a reproducible example no one can help you, because they can't run your code. You need to provide data. :-(
Even if it's not part of StackOverflow to discuss statistics matters, why don't you do an auto.arima with xreg instead of lm + auto.arima on residuals? Especially, considering how you forecast at the end, that training method looks really wrong. Consider using:
fit <- auto.arima(y, xreg = lagmatrix(f, -1))
h <- forecast(fit, xreg = lagmatrix(e, -1))
auto.arima will automatically calculate the best parameters by max likelihood.
On your coding question..
forc <- c() should be outside of the for loop, otherwise at every run you delete your previous results.
Same for j <- 0: at every run you're setting it back to 0. Put it outside if you need to change its value at every run.
The output of forecast is an object of class forecast, which is actually a type of list. Therefore, you can't use cbind effectively.
I'm my opinion, you should create forc in this way: forc <- list()
And create a list of your final results in this way:
forc <- c(forc, list(h)) # instead of forc <- cbind(forc, h)
This will create a list of objects of class forecast.
You can then plot them with a for loop by getting access at every object or with a lapply.
lapply(output_of_your_function, plot)
This is as far as I can go without a reproducible example.
FINAL EDIT
FULL FUNCTIONAL EXAMPLE
Here I try to sum up a conclusion out of the million comments we wrote.
With the data you provided, I built a code that can handle everything you need.
From training and test to model, till forecast and finally plotting which have the X axis with the time as required in one of your comments.
I removed the for loop. lapply is much better for your case.
You can leave the fourier series if you want to. That's how Professor Hyndman suggests to handle daily time series.
Functions and libraries needed:
# libraries ---------------------------
library(forecast)
library(lubridate)
# run model -------------------------------------
.daily_arima_forecast <- function(init, training, horizon, tt, ..., K = 10){
# create training and test
tt_trn <- window(tt, start = time(tt)[init] , end = time(tt)[init + training - 1])
tt_tst <- window(tt, start = time(tt)[init + training], end = time(tt)[init + training + horizon - 1])
# add fourier series [if you want to. Otherwise, cancel this part]
fr <- fourier(tt_trn[,1], K = K)
frf <- fourier(tt_trn[,1], K = K, h = horizon)
tsp(fr) <- tsp(tt_trn)
tsp(frf) <- tsp(tt_tst)
tt_trn <- ts.intersect(tt_trn, fr)
tt_tst <- ts.intersect(tt_tst, frf)
colnames(tt_tst) <- colnames(tt_trn) <- c("y", "s", paste0("k", seq_len(ncol(fr))))
# run model and forecast
aa <- auto.arima(tt_trn[,1], xreg = tt_trn[,-1])
fcst <- forecast(aa, xreg = tt_tst[,-1])
# add actual values to plot them later!
fcst$test.values <- tt_tst[,1]
# NOTE: since I modified the structure of the class forecast I should create a new class,
# but I didnt want to complicate your code
fcst
}
daily_arima_forecast <- function(y, x, training, horizon, ...){
# set up x and y together
tt <- ts.intersect(y, x)
# set up all starting point of the training set [give it a name to recognize them later]
inits <- setNames(nm = seq(1, length(y) - training, by = horizon))
# remove last one because you wouldnt have enough data in front of it
inits <- inits[-length(inits)]
# run model and return a list of all your models
lapply(inits, .daily_arima_forecast, training = training, horizon = horizon, tt = tt, ...)
}
# plot ------------------------------------------
plot_daily_forecast <- function(x){
autoplot(x) + autolayer(x$test.values)
}
Reproducible Example on how to use the previous functions
# create a sample data
tsp(EuStockMarkets) <- c(1991, 1991 + (1860-1)/365.25, 365.25)
# model
models <- daily_arima_forecast(y = EuStockMarkets[,1],
x = EuStockMarkets[,2],
training = 600,
horizon = 25,
K = 5)
# plot
plots <- lapply(models, plot_daily_forecast)
plots[[1]]
Example for the author of the post
# your data
load("BVIS0157_Forward.rda")
load("BVIS0157_NS.spread.rda")
spread.NS.JPM <- spread.NS.JPM / 100
# pre-work [out of function!!!]
set_up_ts <- function(m){
start <- min(row.names(m))
end <- max(row.names(m))
# daily sequence
inds <- seq(as.Date(start), as.Date(end), by = "day")
ts(m, start = c(year(start), as.numeric(format(inds[1], "%j"))), frequency = 365.25)
}
mts_spread.NS.JPM <- set_up_ts(spread.NS.JPM)
mts_Forward.rate.JPM <- set_up_ts(Forward.rate.JPM)
# model
col <- 10
models <- daily_arima_forecast(y = mts_spread.NS.JPM[, col],
x = stats::lag(mts_Forward.rate.JPM[, col], -1),
training = 600,
horizon = 25,
K = 5) # notice that K falls between ... that goes directly to the inner function
# plot
plots <- lapply(models, plot_daily_forecast)
plots[[5]]

Speed up functions involving (s)apply

I've profiled my code using the lineprof package and identified the bottlenecks to be in the three functions perm.stat.list, G.hat, and emp.FDR. The common theme seems to be the use of (s)apply, based on the output of the profiler.
Below is a simplified version of my functions, along with code to generate a reproducible example involving the three functions. I've added comments to better explain what each function is doing and the inputs required.
I'd like to speed up my code considerably because even with B=10, the process takes almost half an hour of computation. The input takes a large matrix (10000 x 10000), so speed is important. Ideally, I'd like to run B=5000 permutations, which also increases computation time.
Any tips to improve my code are greatly appreciated.
### Functions ###
perm.stat.list <- function(samp.dat,N1,N2,B){
perm.list = NULL
for (b in 1:B){
#Permute the row "labels", preserving information across columns
perm.dat.tmp = samp.dat[sample(nrow(samp.dat)),]
#Compute the permutation-based test statistics
#Need to save each (1 x M) permutation vector into a list
perm.list[[b]] = apply(perm.dat.tmp,2,function(y) t.test(y[1:N1],y[(N1+1):(N1+N2)])$statistic)
}
return(perm.list)
}
G.hat = function(perm.mat,t){
#Number of permutations
B = nrow(perm.mat)
#Compute an empirical distribution along each COLUMN of permutation matrix
out = apply(perm.mat,2,function(x) sum(x>t,na.rm = TRUE))/B
return(out)
}
emp.FDR <- function(t.vec,mat){
#For each value in t.vec, apply G.hat function
out = sapply(t.vec,function(i) sum(G.hat(mat,i),na.rm = TRUE)/max(sum(t.vec > i,na.rm = TRUE),1))
return(out)
}
.
### Generate reproducible example ###
### Global variables ###
#Sample sizes (rows)
N1=3000
N2=7000
#Number of columns
M = 10000
#Number of permutations
B = 10
### Data ###
set.seed(1)
X1 = matrix(rnorm(N1*M),ncol=M)
X2 = matrix(rnorm(N2*M),ncol=M)
### Combine data in one large matrix of size (N1+N2) rows x M columns ###
samp.dat = rbind(X1,X2)
### Compute statistic for each column of samp.dat ###
t.stats = apply(samp.dat,2,
function(x) t.test(x[1:N1],x[(N1+1):(N1+N2)])$statistic)
### Sort t.stats in decreasing order (not necessarily needed for computation) ###
t.vec = sort(t.stats,decreasing=TRUE)
### Permutation matrix based on the data ###
perm.mat = perm.stat.list(samp.dat=samp.dat,N1=N1,N2=N2,B=B)
eFDR = emp.FDR(t.vec=t.vec,mat=perm.mat)

Resources