How to ensemble forecasts in R using weights - r

I have a couple forecasts and was trying to figure out how to merge the two according to some criterion that is widely used.
In part one, I split the data and compared the forecasts against the actual balues using Forest_comb.
library(forecast)
library(ForecastCombinations)
y1 = rnorm(100)
train = y1[1:90]
test = y1[91:100]
fit1 = auto.arima(train)
fit2 = ets(train)
forc1 = forecast(fit1, n=10)$mean
forc2 = forecast(fit2, n=10)$mean
forc_all = cbind(forc1,forc2)
forc_all
?Forecast_comb
fitted <- Forecast_comb(obs = test ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted
In part two, I rebuilt the whole model on the entire data and forecast out by ten values. How can one merge the two values together according to some criterion?
fit3 = auto.arima(y1)
fit4 = ets(y1)
forc3 = forecast(fit3, n=20)$mean
forc4 = forecast(fit4, n=20)$mean
forc_all = cbind(forc3,forc4)
forc_all
fitted <- Forecast_comb(obs = y1[91:100] ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted
Thanks for the help
The reason that I am using ForecastCombination is that it includes procedures for popular combination strategies. I thought that perhaps that function could be modified to perform the desired ensembling.

Based on a lot of Kaggle competitions where people share/discuss their scripts, I'd say that by far the most common and most effective way is simply to manually weight and add your predictions.
pacman::p_load(forecast)
pacman::p_load(ForecastCombinations)
y1 = rnorm(100)
train = y1[1:90]
test = y1[91:100]
fit1 = auto.arima(train)
fit2 = ets(train)
forc1 = forecast(fit1, n=10)$mean
forc2 = forecast(fit2, n=10)$mean
forc_all = cbind(forc1,forc2)
forc_all
?Forecast_comb
fitted_1 <- Forecast_comb(obs = test ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted_1
fit3 = auto.arima(y1)
fit4 = ets(y1)
forc3 = forecast(fit3, n=20)$mean
forc4 = forecast(fit4, n=20)$mean
forc_all = cbind(forc3,forc4)
forc_all
fitted_2 <- Forecast_comb(obs = y1[91:100] ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted_2
# By far the most common way to combine/weight is simply:
fitted <- fitted_2*.5+fitted_1*.5
fitted
One might ask if you should use equal weights or how to know what to make the weights. This is usually determined by
(a) naive, equal weighting if that's all you have time for and it seems to work fine
(b) iterating with a holdout or cross-validation sample(s), being careful not to overfit
Some people try to take more fancy approaches. It's easy to mess that up, however if you get it right then it can lead you to a more optimal forecast.
The model-based and other more fancy approaches are things like creating another stage of the modeling process wherein your predictions on a holdout sample are the X matrix and the outcome variable is the actual y for that sample.
Also, check out Erin LeDell's approach in h2oEnsemble.

Related

MLR - calculating feature importance for bagged, boosted trees (XGBoost)

Good morning,
I have a question about calculating feature importance for bagged and boosted regression tree models with MLR package in R. I am using XGBOOST to make predictions and i'm using bagging to estimate prediction uncertainty. My data set is relatively large; approximately 10k features and observations. The predictions work perfectly (see code below), but I can't seem to calculate feature importance (the last line in the code below). The importance function crashes with no errors... and freezes the R session. I saw some related python code, where people seem to calculate the importance for each of the bagged models here and here. I haven't been able to get that to work properly in R either. Specifically, i'm not sure how to access individual models within the objected produced by MLR (mb object in the code below). In python, this seems to be trivial. In R, i can't seem to extract mb$learner.model, which seems logically closest to what i need. So i'm wondering if anyone had any experience with this issues?
Please see the code below
learn1 <- makeRegrTask(data = train.all , target= "resp", weights = weights1)
lrn.xgb <- makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals <- list( objective="reg:squarederror", eval_metric="error", nrounds=300, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 50, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag <- setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, learn1)
fimp1 <- getFeatureImportance(mb)
If you set bw.feats = 1 it might be feasible to average the feature importance values.
Basically you just have to apply over all single models that are stored in the HomogeneousEnsembleModel. Some extra care is necessary because the order of the features gets mixed up because of the sampling - although we set it to 100%.
library(mlr)
data = data.frame(x1 = runif(100), x2 = runif(100), x3 = runif(100))
data$y = with(data, x1 + 2 * x2 + 0.1 * x3 + rnorm(100))
task = makeRegrTask(data = data, target = "y")
lrn.xgb = makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals = list( objective="reg:squarederror", eval_metric="error", nrounds=50, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 10, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag = setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, task)
fimps = lapply(mb$learner.model$next.model, function(x) getFeatureImportance(x)$res)
fimp = fimps[[1]]
# we have to take extra care because the results are not ordered
for (i in 2:length(fimps)) {
fimp = merge(fimp, fimps[[i]], by = "variable")
}
rowMeans(fimp[,-1]) # only makes sense with bw.feats = 1
# [1] 0.2787052 0.4853880 0.2359068

R object is not a matrix

I am new to R and trying to save my svm model in R and have read the documentation but still do not understand what is wrong.
I am getting the error "object is not a matrix" which would seem to mean that my data is not a matrix, but it is... so something is missing.
My data is defined as:
data = read.table("data.csv")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
Where the last line is my label
I am trying to define my model as:
svm.model <- svm(type ~ ., data=trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
This seems like it should be correct but I am having trouble finding other examples.
Here is my code so far:
# load libraries
require(e1071)
require(pracma)
require(kernlab)
options(warn=-1)
# load dataset
SVMtimes = 1
KERNEL="polynomial"
DEGREE = 2
data = read.table("head.csv")
results10foldAll=c()
# Cross Fold for training and validation datasets
for(timesRun in 1:SVMtimes) {
cat("Running SVM = ",timesRun," result = ")
trainSet = as.data.frame(data[,1:(ncol(data)-1)])
trainClasses = as.factor(data[,ncol(data)])
model = svm(trainSet, trainClasses, type="C-classification",
kernel = KERNEL, degree = DEGREE, coef0=1, cost=1,
cachesize = 10000, cross = 10)
accAll = model$accuracies
cat(mean(accAll), "/", sd(accAll),"\n")
results10foldAll = rbind(results10foldAll, c(mean(accAll),sd(accAll)))
}
# create model
svm.model <- svm(type ~ ., data = trainSet, type='C-classification', kernel='polynomial',scale=FALSE)
An example of one of my samples would be:
10.135338 7.214543 5.758917 6.361316 0.000000 18.455875 14.082668 31
Here, trainSet is a data frame but in the svm.model function it expects data to be a matrix(where you are assigning trainSet to data). Hence, set data = as.matrix(trainSet). This should work fine.
Indeed as pointed out by #user5196900 you need a matrix to run the svm(). However beware that matrix object means all columns have same datatypes, all numeric or all categorical/factors. If this is true for your data as.matrix() may be fine.
In practice more than often people want to model.matrix() or sparse.model.matrix() (from package Matrix) which gives dummy columns for categorical variables, while having single column for numerical variables. But a matrix indeed.

R e1071 SVM leave one out cross validation function result differ from manual LOOCV

I'm using e1071 svm function to classify my data.
I tried two different ways to LOOCV.
First one is like that,
svm.model <- svm(mem ~ ., data, kernel = "sigmoid", cost = 7, gamma = 0.009, cross = subSize)
svm.pred = data$mem
svm.pred[which(svm.model$accuracies==0 & svm.pred=='good')]=NA
svm.pred[which(svm.model$accuracies==0 & svm.pred=='bad')]='good'
svm.pred[is.na(svm.pred)]='bad'
conMAT <- table(pred = svm.pred, true = data$mem)
summary(svm.model)
I typed cross='subject number' to make LOOCV, but the result of classification is different from my manual version of LOOCV, which is like...
for (i in 1:subSize){
data_Tst <- data[i,1:dSize]
data_Trn <- data[-i,1:dSize]
svm.model1 <- svm(mem ~ ., data = data_Trn, kernel = "linear", cost = 2, gamma = 0.02)
svm.pred1 <- predict(svm.model1, data_Tst[,-dSize])
conMAT <- table(pred = svm.pred1, true = data_Tst[,dSize])
CMAT <- CMAT + conMAT
CORR[i] <- sum(diag(conMAT))
}
In my opinion, through LOOCV, accuracy should not vary across many runs of code because SVM makes model with all the data except one and does it until the end of the loop. However, with the svm function with argument 'cross' input, the accuracy differs across every runs of code.
Which way is more accurate? Thanks for read this post! :-)
You are using different hyper-parameters (cost, gamma) and different kernels (linear, sigmoid). If you want identical results, then these should be the same each run.
Also, it depends how Leave One Out (LOO) is implemented:
Does your LOO method leave one out randomly or as a sliding window over the dataset?
Does your LOO method leave one out from one class at a time or both classes at the same time?
Is the training set always the same, or are you using a randomisation procedure before splitting between a training and testing set (assuming you have a separate independent testing set)? In which case, the examples you are cross-validating would change each run.

PLS in R: Predicting new observations returns Fitted values instead

In the past few days I have developed multiple PLS models in R for spectral data (wavebands as explanatory variables) and various vegetation parameters (as individual response variables). In total, the dataset comprises of 56. The first 28 (training set) have been used for model calibration, now all I want to do is to predict the response values for the remaining 28 observations in the tesset. For some reason, however, R keeps on the returning the fitted values of the calibration set for a given number of components rather than predictions for the independent test set. Here is what the model looks like in short.
# first simulate some data
set.seed(123)
bands=101
data <- data.frame(matrix(runif(56*bands),ncol=bands))
colnames(data) <- paste0(1:bands)
data$height <- rpois(56,10)
data$fbm <- rpois(56,10)
data$nitrogen <- rpois(56,10)
data$carbon <- rpois(56,10)
data$chl <- rpois(56,10)
data$ID <- 1:56
data <- as.data.frame(data)
caldata <- data[1:28,] # define model training set
valdata <- data[29:56,] # define model testing set
# define explanatory variables (x)
spectra <- caldata[,1:101]
# build PLS model using training data only
library(pls)
refl.pls <- plsr(height ~ spectra, data = caldata, ncomp = 10, validation =
"LOO", jackknife = TRUE)
It was then identified that a model comprising of 3 components yielded the best performance without over-fitting. Hence, the following command was used to predict the values of the 28 observations in the testing set using the above calibrated PLS model with 3 components:
predict(refl.pls, ncomp = 3, newdata = valdata)
Sensible as the output may seem, I soon discovered that all this piece of code generates are the fitted values of the PLS model for the calibration/training data, rather than predictions. I discovered this because the below code, in which newdata = is omitted, yields identical results.
predict(refl.pls, ncomp = 3)
Surely something must be going wrong, although I cannot seem to find out what specifically is. Is there someone out there who can, and is willing to help me move in the right direction?
I think the problem is with the nature of the input data. Looking at ?plsr and str(yarn) that goes with the example, plsr requires a very specific data frame that I find tricky to work with. The input data frame should have a matrix as one of its elements (in your case, the spectral data). I think the following works correctly (note I changed the size of the training set so that it wasn't half the original data, for troubleshooting):
library("pls")
set.seed(123)
bands=101
spectra = matrix(runif(56*bands),ncol=bands)
DF <- data.frame(spectra = I(spectra),
height = rpois(56,10),
fbm = rpois(56,10),
nitrogen = rpois(56,10),
carbon = rpois(56,10),
chl = rpois(56,10),
ID = 1:56)
class(DF$spectra) <- "matrix" # just to be certain, it was "AsIs"
str(DF)
DF$train <- rep(FALSE, 56)
DF$train[1:20] <- TRUE
refl.pls <- plsr(height ~ spectra, data = DF, ncomp = 10, validation =
"LOO", jackknife = TRUE, subset = train)
res <- predict(refl.pls, ncomp = 3, newdata = DF[!DF$train,])
Note that I got the spectral data into the data frame as a matrix by protecting it with I which equates to AsIs. There might be a more standard way to do this, but it works. As I said, to me a matrix inside of a data frame is not completely intuitive or easy to grok.
As to why your version didn't work quite right, I think the best explanation is that everything needs to be in the one data frame you pass to plsr for the data sources to be completely unambiguous.

Set G in prior using MCMCglmm, with categorical response and phylogeny

I am new to the MCMCglmm package in R, and rather new to glm models in general. I have a dataset of species traits and whether or not they have been introduced outside of their native range.
I would like to test whether being introduced (as a binary 0/1 response variable) can be explained by any of the species traits. I would also like to correct for phylogeny between species.
I was told that for a binary response I could use family =“threshold” and I should fix the residual variance at 1. But I am having some trouble with the other parameters needed for the prior.
I've specified the R value for the random effects, but if I specify R I must also specify G and it is not clear to me how to decide the values for this parameter. I've tried putting default values but I get error messages:
Error in MCMCglmm(fixed, random = ~species, data = data2, family = "threshold", :
prior$G has the wrong number of structures
I have read the help vignettes and course but have not found an example with a binary response, and it is not clear to me how to decide the values for the priors. This is what I have so far:
fixed=Intro_binary ~ Trait1+ Trait2 + Trait3
Ainv=inverseA(redTree1)$Ainv
binary_model = MCMCglmm(fixed, random=~species, data = data, family = "threshold", ginverse=list(species=Ainv),
 prior = list(
    G = list(),    #not sure about the parameters for random effects.
    R = list(V = 1, fix = 1)),  #to fix the residual variance at one
  nitt = 60000, burnin = 10000)
Any help or feedback would be greatly appreciated!
This one is a bit tricky with the information you provide. I'd say you can define G as a "weak" prior using:
priors <- list(R = list(V = 1, nu = 0.002),
G = list(V = 1, fix = 1)))
binary_model <- MCMCglmm(fixed, random = ~species, data = data,
family = "threshold",
ginverse = list(species = Ainv),
prior = priors,
nitt = 60000, burnin = 10000)
However, without more information on your analysis, I strongly suggest you plot your posteriors to have a look at the results and see if anything looks wrong. Have a look at the MCMCglmm package Course Notes for more info on how to set these priors (especially on what not to do in section 1.5 - you can also find more specific info on how to tune it to your model if it fits in the categories of the tutorial).

Resources