How to get top down forecasts using `hts::combinef()`? - r

I am trying to compare forecast reconciliation methods from the hts package on previously existing forecasts. The forecast.gts function is not available to me since there is no computationally tractable way to create a user defined function that returns the values in a forecast object. Because of this, I am using the combinef() function in the package to redistribute the forecasts. I have been able to work of the proper weights to get the wls and nseries methods, and the ols version is the default. I was able to get the "bottom up" method using:
# Creates sample forecasts, taken from `combinef()` example
library(hts)
h <- 12
ally <- aggts(htseg1)
allf <- matrix(NA, nrow = h, ncol = ncol(ally))
for(i in 1:ncol(ally))
allf[,i] <- forecast(auto.arima(ally[,i]), h = h, PI = FALSE)$mean
allf <- ts(allf, start = 51)
# create the weight vector
numTS <- ncol(allf) # Get the total number of series
numBaseTS <- sum(tail(htseg1$nodes, 1)[[1]]) # Get the number of bottom level series
# Create weights of 0 for all aggregate ts and 1 for the base level
weightVals <- c(rep(0, numTS - numBaseTS), rep(1, numBaseTS))
y.f <- combinef(allf, htseg1$nodes, weights = weightVals)
I was hoping that something like making the first weight 1 and the rest 0 might give me one of the three top down forecast, but that just results in a bunch of 0s or NaN values depending on how you try to look at it.
combinef(allf, htseg1$nodes, weights = c(1, rep(0, numTS - 1)))
I know the top down methods aren't the hardest thing to compute manually, and I can just write a function to do that, but are there any tools in the hts package that can help with this? I'd like to keep the data format consistent to simplify my analysis. Most specifically, I would like to get the "top down forcasted proportions" or tdfp method.

The functions to reconcile the forecasts using the "top-down" method are currently not exported. Probably I should export them to make the "top-down" results as tractable as combinef() in the next version. The workaround is as follows:
hts:::TdFp(allf, nodes = htseg1$nodes)
Hope it helps.

Related

How to generate a multivariate spline basis in R?

I want to obtain a multivariate spline basis using R. I do not know how to do it properly or the best approach for this. According to my limited research on the Internet, I think that the package that can help me is mgcv and the functions ti and smooth.construct.tensor.smooth.spec but I am not sure.
The structure of my data is simple. I have two vectors xdata and alphadata generated as
n = 200
T = 2
xdata = as.matrix(rnorm(T*n),T*n,1)
tau = seq(-2,2,by=0.1)
tau = as.matrix(tau,length(tau),1)
So basically I have two vectors xdata and alphadata of dimension n*T and 41, respectively. My goal is then obtain a spline basis (for example a cubic spline) which should be a function of both b(alphadata,xdata).
What I have tried so far is something like this
xdata_data <- data.frame("xdata" = xdata[,1])
tau_data <- data.frame("tau" = tau[,1])
basisobj1 <- ti(tau_data, xdata_data, bs = 'cr', k = c(6, 6), fx = TRUE) #cr:cubic regression splines
xdata_data <- data.frame("xdata_data" = xdata[,1])
tau_data <- data.frame("tau_data" = tau[,1])
basisobj2 <- smooth.construct.tensor.smooth.spec(basisobj1, data = c(tau_data,xdata_data), knots = NULL)
basis <- basisobj2[["X"]]
Note that I manipulated my data, otherwise I get some errors with smooth.construct.tensor.smooth.spec.
My questions are:
(1) With the previous approach I am doing what I want?
(2) Is this a smart approach to do what I want?
(3) When I do the above, the number of rows of basis is 41 but shouldn't the number of rows of basis be equal to the product of dimensions of xdata and alphadata as the basis is a function of two vectors?

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

Simulate over an xts object in R

I am looking for a function, or package, that will help me with this goal. I've looked through several packages but can't find what I am looking for:
Lets say I have an xts object with 10 columns and 250 rows.
What I want to do is run a simulation, such that I get a robust calculation of my performance metric over the period.
So, lets say that I have 250 data points, I want to run x number of simulations over random samples of the data computing the Sharpe Ratio using the function (PerformanceAnalytics::SharpeRatio) varying the random samples to be lengths 30-240, and then find the average. Keep in mind I want to do this for every column and I'd rather not have to use apply if possible. I'd also like to find something that processes the information rather quickly.
What package or functions would best serve this purpose?
Thank you!
Subsetting xts objects for the rows you want to randomly sample should be good enough, performance wise, if that is your main concern. If you want some other concrete examples, you may find it useful to look at the monte carlo simulation functions recently added to the R blotter package:
https://github.com/braverock/blotter/blob/master/R/mcsim.R
Your requirements are quite detailed and a little tricky to follow, but I think this example may be what you're after?
This solution does use apply functions though! Because it just makes life easier. If you don't use lapply, the code will expand quickly and distract from achieving the goal quickly (and you risk introducing bugs with longer, messier code; one reason to use apply family functions where you can).
library(quantmod)
library(PerformanceAnalytics)
# Set up the data:
syms <- c("GOOG", "FB", "TSLA", "SNAP", "MU")
getSymbols(syms)
z <- do.call(merge, lapply(syms, function(s) {
x <- get(s)
dailyReturn(Cl(x))
}))
# Here we have 250 rows, 5 columns:
z <- tail(z, 250)
colnames(z) <- paste0(syms, ".rets")
subSample <- function(x, n.sub = 40) {
# Assuming subsampling by row, preserving all returns and cross symbol dependence structure at a given timestamp
ii <- sample(1:NROW(x), size = n.sub, replace = FALSE)
# sort in order to preserve time ordering?
ii <- sort(ii)
xs <- x[ii, ]
xs
}
set.seed(5)
# test:
z2 <- subSample(z, n.sub = 40)
zShrp <- SharpeRatio(z2)[1, ]
# now run simulation:
nSteps <- seq(30, 240, by = 30)
sharpeSimulation <- function(x, n.sub) {
x <- subSample(x, n.sub)
SharpeRatio(x)[1, ]
}
res <- lapply(nSteps, FUN = sharpeSimulation, x = z)
res <- do.call(rbind, res)
resMean <- colMeans(res)
resMean
# GOOG.rets FB.rets TSLA.rets SNAP.rets MU.rets
# 0.085353854 0.059577882 0.009783841 0.026328660 0.080846592
Do you realise that SharpeRatio uses sapply? And it's likely other performance metrics you want to use will as well. Since you seem to have something against apply (possibly all apply functions in R), this might be worth noting.

How to do top down forecasted proportions for hts objects with 2 levels?

I had previously asked this question trying to get top down forecasted proportions forecast recombination using the hts package. The solution there works great for multilevel hierarchies, however I have found I get an error when I try to use the solution on a two level hierarchy.
library(hts)
# Create the hierarchy
newhts <- hts(htseg1$bts, list(ncol(htseg1$bts)))
# forecast creation adapted from the `combinef()` example
h <- 12
ally <- aggts(newhts)
allf <- matrix(NA, nrow = h, ncol = ncol(ally))
for(i in 1:ncol(ally))
allf[,i] <- forecast(auto.arima(ally[,i]), h = h, PI = FALSE)$mean
allf <- ts(allf, start = 51)
# Earo Wang's solution to my previous question
hts:::TdFp(allf, nodes = htseg1$nodes)
Error in *.default(fcasts[, 1L], prop) : time-series/vector length mismatch
The problem seems to arise because a two level hierarchy skips the last if conditional with the condition if (l.levels > 2L). The last statement of this conditional multiplies includes a piece where prop is multiplied by the time series flist[[k + 1L]], which converts prop into a time series matrix. When this statement is skipped, prop remains a regular matrix causing the error when the time series vector fcasts[, 1L] is multiplied by the matrix prop.
I understand that TdFp is a non exported function and therefore may not be as robust as the other functions in the package, but is there any way around this problem? Since it is a relatively simple case, I can code a solution myself, but since hts::forecast.hts() can handle two level hierarchies for method = "tdfp", I thought there might be a nice clean solution.

poLCA - Latent Class how to do the adjusted Lo-Mendell-Rubin (LMR) test with R

Good afternoon,
I am trying to perform Lo, Mendell and Rubin's (2001) adjusted test (LMR) in order to decide the optimal number of classes in LCA. I performed the command with poLCA, but I didn't find any command to perform it.
Is there someone that can help me?
Thank you very much!
Here is an example of a (ad-hoc adjusted) LMR test comparing a LCA with 3 groups (alternative model) against 2 groups (baseline model).
# load packages/install if needed
library(poLCA)
library(tidyLPA)
data("election")
# Fit LCA with 2 classes (NULL model)
mod_null <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 2, verbose = F)
# store values baseline model
n <- mod_null$Nobs #number of observations (should be equal in both models)
null_ll <- mod_null$llik #log-likelihood
null_param <- mod_null$npar # number of parameters
null_classes <- length(mod_null$P) # number of classes
# Fit LCA with 3 classes (ALTERNATIVE model)
mod_alt <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 3, verbose = F)
# Store values alternative model
alt_ll <- mod_alt$llik #log-likelihood
alt_param <- mod_alt$npar # number of parameters
alt_classes <- length(mod_alt$P) # number of classes
# use calc_lrt from tidyLPA package
calc_lrt(n, null_ll, null_param, null_classes, alt_ll, alt_param, alt_classes)
Wow really late to the game but as Im looking at similar things Ill leave for the next person.
The Lo-Mendell-Rubin test involves a transformation of the data and then a chi-sq test to determine if K classes is a better fit than K-1 classes... basically.
However there is reasonable research out there suggesting that a better measure of this is the bootstrap likelihood ratio.
The former is still in common use with MPlus users, the latter is far more common in LCA packages in R, e.g. mclust. Dunno about poLCA though...

Resources