Time series bootstrapping in R: How can I access each simulated path using tsbootstrap? - r

I want to perform a bootstrap analysis of a specific time series.
I am using the function tsbootstrap of the package tseries. My problem: for values of m > 1, I cannot access each bootstrapped path individually
(m: the length of the basic blocks in the block of blocks bootstrap, see ?tsbootstrap)
library(tseries)
set.seed(1)
TS <- sample(1:20)
tsbootstrap(TS,m=2, nb=1)
gives:
Error in tsbootstrap(TS, m = 2, nb = 1) :
can only return bootstrap data for m = 1
To my knowledge, the function can only compute some statistics (e.g. mean) over all of the simulated tranjectories, but I need each simulation itself. How can I come around this problem? (I am aware of the function tsboot of the package boot, but I was not able to operationalize the function yet)

The b argument is the block length. m is the "block of blocks" argument for when you want to calculate statistics for each resampled series, rather than return each resampled series itself.
library(tseries)
# Simulate a time series
set.seed(1)
TS<-arima.sim(model=list(ar=c(.8,-.2)), n=20)
plot(TS)
# 3 bootstrap samples with block size b=5
TSboot = tsbootstrap(TS, m=1, b=5, type="block", nb=3)
# Here are the individual bootstrapped series
TSboot
Time Series:
Start = 1
End = 20
Frequency = 1
[,1] [,2] [,3]
1 -0.72571390 1.94273559 1.62729703
2 -0.36463539 2.00048877 0.34495502
3 -0.30236104 1.28640888 -2.26419528
...
18 0.96532247 -0.72571390 -0.36463539
19 1.59792898 -0.36463539 -0.30236104
20 1.67918002 -0.30236104 -1.63971414
plot(TSboot)

Related

Average Mean w/ Forecast Horizon > 1 in R

I use the updated greybox package in R to forecast the consecutive 2 values (horizon=2) with a moving average scheme (See the first line of code below), where the window size is equal to 3.
For example, the overall goal is to take the average of (1+2+3)/3 = "2" as the forecasted value in horizon 1 (h=1) and then make use of the predicted value in h=1 for h=2, where (2+3+"2")=2,3334.
The following forecast window will make use of the window (2+3+4), where 4 is the actual value to predict the next h1 and h2, which equals 3 and 3,3334 respectively.
Yet, the prediction result I want "ValuesMA[[3]]" only emits one row, i.e. values for the first horizon. But it should be equal to the predifined horizon, which is two.
I have a code for an AR1 process which works perfectly (Second line of code). At the end I add an MAE test statistic to evaluate the model.
Can anyone help?
Thank you!
This is the underlying code I use:
#data
z <- c(1,2,3,4,5,6,7)
ourCall <- "mean(x=data,n.ahead=h)"
ourValue <- c("pred")
# Return a list for an forecasting horizon h
ValuesMA <- ro(z, h=2, origins=3, call=ourCall, ci=TRUE, co=TRUE)
ValuesMA[[3]]
**Which yields:**
origin3 origin4 origin5
[1,] 2 3 4
**But I want:**
origin3 origin4 origin5
[1,] 2 3 4
[2,] 2,3334 3,3334 4,3334
#data
z <- c(1,2,3,4,5,6,7)
# ci defines constant in-sample window size, co defines whether the holdout sample window size should be constant
ourCall <- "predict(arima(x=data,order=c(1,0,0)),n.ahead=h)"
# Ask for predicted values and the standard error
ourValue <- c("pred","se")
# Return a list for an forecasting horizon h with a rolling holdout size equal to origin
ValuesAR1 <- ro(z, h=2, origins=3, call=ourCall, value=ourValue, ci=TRUE, co=TRUE)
# calculate MAE
MAE_AR1 <- apply(abs(ValuesAR1$holdout - ValuesAR1$pred),1,mean,na.rm=TRUE) / mean(ValuesAR1$actuals)
ValuesAR1[[3]]
**Which yields:**
> ValuesAR1[[3]]
origin3 origin4 origin5
h1 2 3 4
h2 2 3 4
For further reading see: https://cran.r-project.org/web/packages/greybox/vignettes/ro.html

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

Weighted Kmeans R

I want to do a Kmeans clustering on a dataset (namely, Sample_Data) with three variables (columns) such as below:
A B C
1 12 10 1
2 8 11 2
3 14 10 1
. . . .
. . . .
. . . .
in a typical way, after scaling the columns, and determining the number of clusters, I will use this function in R:
Sample_Data <- scale(Sample_Data)
output_kmeans <- kmeans(Sample_Data, centers = 5, nstart = 50)
But, what if there is a preference for the variables? I mean that, suppose variable (column) A, is more important than the two other variables?
how can I insert their weights in the model?
Thank you all
You have to use a kmeans weighted clustering, like the one presented in flexclust package:
https://cran.r-project.org/web/packages/flexclust/flexclust.pdf
The function
cclust(x, k, dist = "euclidean", method = "kmeans",
weights=NULL, control=NULL, group=NULL, simple=FALSE,
save.data=FALSE)
Perform k-means clustering, hard competitive learning or neural gas on a data matrix.
weights An optional vector of weights to be used in the fitting process. Works only in combination with hard competitive learning.
A toy example using iris data:
library(flexclust)
data(iris)
cl <- cclust(iris[,-5], k=3, save.data=TRUE,weights =c(1,0.5,1,0.1),method="hardcl")
cl
kcca object of family ‘kmeans’
call:
cclust(x = iris[, -5], k = 3, method = "hardcl", weights = c(1, 0.5, 1, 0.1), save.data = TRUE)
cluster sizes:
1 2 3
50 59 41
As you can see from the output of cclust, also using competitive learning the family is always kmenas.
The difference is related to cluster assignment during training phase:
If method is "kmeans", the classic kmeans algorithm as given by
MacQueen (1967) is used, which works by repeatedly moving all cluster
centers to the mean of their respective Voronoi sets. If "hardcl",
on-line updates are used (AKA hard competitive learning), which work
by randomly drawing an observation from x and moving the closest
center towards that point (e.g., Ripley 1996).
The weights parameter is just a sequence of numbers, in general I use number between 0.01 (minimum weight) and 1 (maximum weight).
I had the same problem and the answer here is not satisfying for me.
What we both wanted was an observation-weighted k-means clustering in R. A good readable example for our question is this link: https://towardsdatascience.com/clustering-the-us-population-observation-weighted-k-means-f4d58b370002
However the solution to use the flexclust package is not satisfying simply b/c the used algorithm is not the "standard" k-means algorithm but the "hard competitive learning" algorithm. The difference are well described above and in the package description.
I looked through many sites and did not find any solution/package in R in order to use to perform a "standard" k-means algorithm with weighted observations. I was also wondering why the flexclust package explicitly do not support weights with the standard k-means algorithm. If anyone has an explanation for this, please feel free to share!
So basically you have two options: First, rewrite the flexclust-algorithm to enable weights within the standard approach. Or second, you can estimate weighted cluster centroids as starting centroids and perform a standard k-means algorithm with only one iteration, then compute new weighted cluster centroids and perform a k-means with one iteration and so on until you reach convergence.
I used the second alternative b/c it was the easier way for me. I used the data.table package, hope you are familiar with it.
rm(list=ls())
library(data.table)
### gen dataset with sample-weights
dataset <- data.table(iris)
dataset[, weights:= rep(c(1, 0.7, 0.3, 4, 5),30)]
dataset[, Species := NULL]
### initial hclust for estimating weighted centroids
clustering <- hclust(dist(dataset[, c(1:4)], method = 'euclidean'),
method = 'ward.D2')
no_of_clusters <- 4
### estimating starting centroids (weighted)
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol = ncol(dataset[, c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
### performing weighted k-means as explained in my post
iter <- 0
cluster_i <- 0
cluster_iminus1 <- 1
## while loop: if number of iteration is smaller than 50 and cluster_i (result of
## current iteration) is not identical to cluster_iminus1 (result of former
## iteration) then continue
while(identical(cluster_i, cluster_iminus1) == F && iter < 50){
# update iteration
iter <- iter + 1
# k-means with weighted centroids and one iteration (may generate warning messages
# as no convergence is reached)
cluster_kmeans <- kmeans(x = dataset[, c(1:4)], centers = weighted_centroids, iter = 1)$cluster
# estimating new weighted centroids
weighted_centroids <- matrix(NA, nrow = no_of_clusters,
ncol=ncol(dataset[,c(1:4)]))
for (i in (1:no_of_clusters))
{
weighted_centroids[i,] <- sapply(dataset[, c(1:4)][cutree(clustering, k =
no_of_clusters) == i,], weighted.mean, w = dataset[cutree(clustering, k = no_of_clusters) == i, weights])
}
# update cluster_i and cluster_iminus1
if(iter == 1) {cluster_iminus1 <- 0} else{cluster_iminus1 <- cluster_i}
cluster_i <- cluster_kmeans
}
## merge final clusters to data table
dataset[, cluster := cluster_i]
If you want to increase the weight of a variable (column), just multiply it with a constant c > 1.
It's trivial to show that this increases the weight in the SSQ optimization objective.

How to create a sliding window in R to divide data into test and train samples to test accuracy of forecasts?

We are using the forecast package in R to read 3 weeks worth of hourly data (3*7*24 data points) and make predictions for the next 24 hours. It's a time-series with multiple seasonality.
We have the forecast model running just fine and it seems to be doing well. Now, we wish to quantify the accuracy of our approach / forecasting algorithm for our data. We wish to use the accuracy function in forecast package for this purpose. We understand that the accuracy function works so that it f is the forecast and x is the actual observation vector then accuracy(f,x) would give us the several accuracy measurements for this forecast.
We have data from the past several months and we wish to write a sliding window algorithm that picks (3*7*24) hour values and then predicts the next 24 hours. Then, compares these values against actual data for the next day / 24 hours, displays the accuracy, then slides the window by (24 points / hours) / next day and repeats.
The sample data is generated as follows:
library("forecast")
time <- 1:(12*168)
set.seed(1)
ds <- msts(sin(2*pi*time/24)+c(1,1,1.2,0.8,1,0,0)[((time-1)%/%24)%%7+1]+ time/400+rnorm(length(time),0,0.2),seasonal.periods=c(24,168))
plot(ds)
head(ds)
tail(ds)
length(ds)
length(time)
Forecasting procedure is as follows:
model <- tbats(ds[1:504])
fcst <- forecast(model,h=24,level=90)
accuracy(fcst,ds[505:528]) ##Test accuracy of forecast against next/actual 24 values
Now, we wish to slide the "window" by 24 and repeat the same procedure, that is, the next set of values used to build the model will be ds[25:528] and their accuracy will be tested against ds[529:552] ... and so on. How can we implement this?
Also, is there a better way to test overall accuracy of this forecasting algorithm for our scenario?
I would do this by creating a vector of times representing the front edge of the sliding windows, then using lapply to iterate the forecasting and scoring process over the windows those edges imply. Like...
# set a couple of parameters we'll use to slice the series into chunks:
# window width (w) and the time step at which you want to end the first
# training set
w = 24 ; start = 504
# now use those parameters to make a vector of the time steps at which each
# window will end
steps <- seq(start + w, length(ds), by = w)
# using lapply, iterate the forecasting-and-scoring process over the
# windows that created
cv_list <- lapply(steps, function(x) {
train <- ds[1:(x - w)]
test <- ds[(x - w + 1):x]
model <- tbats(train)
fcst <- forecast(model, h = w, level = 90)
accuracy(fcst, test)
})
Example output for the first window:
> cv_list[[1]]
ME RMSE MAE MPE MAPE MASE
Training set 0.0001587681 0.3442898 0.2689754 34.3957362 84.30841 0.9560206
Test set 0.2619029897 0.8961109 0.7868256 -0.6832273 36.64301 2.7966186
ACF1
Training set 0.02588145
Test set NA
If you want summaries of the scores for the whole list, you can do something like...
rmse <- mean(unlist(lapply(cv_list, '[[', "Test set","RMSE")))
...which produces this:
[1] 1.011177

How to get top down forecasts using `hts::combinef()`?

I am trying to compare forecast reconciliation methods from the hts package on previously existing forecasts. The forecast.gts function is not available to me since there is no computationally tractable way to create a user defined function that returns the values in a forecast object. Because of this, I am using the combinef() function in the package to redistribute the forecasts. I have been able to work of the proper weights to get the wls and nseries methods, and the ols version is the default. I was able to get the "bottom up" method using:
# Creates sample forecasts, taken from `combinef()` example
library(hts)
h <- 12
ally <- aggts(htseg1)
allf <- matrix(NA, nrow = h, ncol = ncol(ally))
for(i in 1:ncol(ally))
allf[,i] <- forecast(auto.arima(ally[,i]), h = h, PI = FALSE)$mean
allf <- ts(allf, start = 51)
# create the weight vector
numTS <- ncol(allf) # Get the total number of series
numBaseTS <- sum(tail(htseg1$nodes, 1)[[1]]) # Get the number of bottom level series
# Create weights of 0 for all aggregate ts and 1 for the base level
weightVals <- c(rep(0, numTS - numBaseTS), rep(1, numBaseTS))
y.f <- combinef(allf, htseg1$nodes, weights = weightVals)
I was hoping that something like making the first weight 1 and the rest 0 might give me one of the three top down forecast, but that just results in a bunch of 0s or NaN values depending on how you try to look at it.
combinef(allf, htseg1$nodes, weights = c(1, rep(0, numTS - 1)))
I know the top down methods aren't the hardest thing to compute manually, and I can just write a function to do that, but are there any tools in the hts package that can help with this? I'd like to keep the data format consistent to simplify my analysis. Most specifically, I would like to get the "top down forcasted proportions" or tdfp method.
The functions to reconcile the forecasts using the "top-down" method are currently not exported. Probably I should export them to make the "top-down" results as tractable as combinef() in the next version. The workaround is as follows:
hts:::TdFp(allf, nodes = htseg1$nodes)
Hope it helps.

Resources