Predict Future values by using OPERA package in R - r

I have been trying to understand Opera “Online Prediction by Expert Aggregation” by Pierre Gaillard and Yannig Goude. I read two posts by Pierre Gaillard (http://pierre.gaillard.me/opera.html) and Rob Hyndman (https://robjhyndman.com/hyndsight/forecast-combinations/). However, I do not understand how to predict future values. In Pierre's example, newY = Y represents the test data set (Y <- data_test$Load) which is weekly observations of the French electric load. As you shown below, the data ends at December 2009. Now, how can I forecast let's say 2010 values? What will be the newY here?
> tail(electric_load,5)
Time Day Month Year NumWeek Load Load1 Temp Temp1 IPI
727 727 30 11 2009 0.9056604 63568.79 58254.42 7.220536 10.163839 91.3 88.4
728 728 7 12 2009 0.9245283 63977.13 63568.79 6.808929 7.220536 90.1 87.7
729 729 14 12 2009 0.9433962 78046.85 63977.13 -1.671280 6.808929 90.1 87.7
730 730 21 12 2009 0.9622642 66654.69 78046.85 4.034524 -1.671280 90.1 87.7
731 731 28 12 2009 0.9811321 60839.71 66654.69 7.434115 4.034524 90.1 87.7
I noticed that by multiplying the weights of MLpol0 by X, we get similar outputs as online predictions values.
> weights <- predict(MLpol0, X, Y, type='weights')
> w<-weights[,1]*X[,1]+weights[,2]*X[,2]+weights[,3]*X[,3]
> predValues <- predict(MLpol0, newexpert = X, newY = Y, type='response')
Test_Data predValues w
620 65564.29 65017.11 65017.11
621 62936.07 62096.12 62096.12
622 64953.83 64542.44 64542.44
623 61580.44 60447.63 60447.63
624 71075.52 67622.97 67622.97
625 75399.88 72388.64 72388.64
626 65410.13 67445.63 67445.63
627 65815.15 62623.64 62623.64
628 65251.90 64271.97 64271.97
629 63966.91 61803.77 61803.77
630 64893.42 65793.14 65793.14
631 69226.32 67153.80 67153.80
But still I am not sure how to generate weights w/out newY. Maybe we can use final coefficients that are the output of MLpol to predict future values?
(c<-summary(MLpol <- mixture(Y = Y, experts = X, model = "MLpol", loss.type = "square"))$coefficients)
[1] 0.585902 0.414098 0.000000
I am sorry I may be way off on this and my question may not make sense at all, but I really appreciate any help/insight.

The idea of the opera package is a bit different from classical batch machine learning methods with a training set and a testing set. The goal is to make sequential predictions:
At each round t=1,...,n,
1) the algorithm receives predictions of the expert for round n+1,
2) it makes a prediction for this time step by combining the expert
3) it updates the weights used for the combination by using the new output
If you have out-of-sample forecasts (i.e., forecasts of experts for future values without the outputs), the best you can do is to use the last coefficients and use them to make a prediction by using:
newexperts %*% model$coefficients
In practice, you may also want to use the averaged coefficients. You can also obtained the same by using
predict (object, # for exemple, mixture(model='FS', loss.type="square")
newexperts = # matrix of out-of-sample experts predictions
online = FALSE,
type = 'response')
By using the parameter online = FALSE the model does not need any newY. It will not update the model. When you provide newY, the algorithm does not cheat. It does not use the value at rount t to make the prediction at round t. The values of newY are only used to update the coefficients step by step to do as if the prediction were made sequentially.
I hope this helped.

Related

What loss function in cross-validation should I use with R package BNlearn for my data?

New to StackOverflow and R.
I have a question regarding the different loss functions for cross-validation that are provided in R package BNlearn and which one I should use. I have continuous data (example below) with 32 rows and 8 columns, each column representing a species and each row representing the number of individuals of that species that year.
201 1.78e+08 18500000 1.87e+08 6.28e+07 1.08e+09 1.03e+08 7.22e+07 43100000
202 8.06e+07 9040000 5.04e+07 4.49e+07 6.66e+08 8.07e+07 2.58e+07 24100000
203 1.54e+08 4380000 1.51e+08 2.88e+07 9.94e+08 1.44e+08 7.32e+07 39000000
204 1.36e+08 6820000 3.80e+08 8.39e+06 7.38e+08 1.50e+08 4.25e+07 32600000
205 9.94e+07 9530000 8.99e+07 1.05e+07 6.62e+08 1.67e+08 1.90e+07 29200000
206 1.33e+08 6340000 4.27e+07 3.26e+06 5.31e+08 2.93e+08 2.70e+07 41500000
207 1.22e+08 5710000 4.41e+07 3.16e+06 4.58e+08 4.92e+08 4.02e+07 21600000
208 1.33e+08 13500000 1.20e+08 3.56e+06 4.40e+08 2.50e+08 3.93e+07 30000000
209 1.73e+08 21700000 4.35e+07 7.58e+06 5.62e+08 3.31e+08 4.98e+07 42100000
210 1.86e+08 6950000 3.40e+07 1.18e+07 4.41e+08 3.80e+08 4.83e+07 28100000
So far I have used the Tabu Search to make a fixed network structure and analyzed it with the cross-validation command
bn.cv(data = data, bn = bn.tabu, method = "k-fold", k = 10, runs = 100)
which gives the result
k-fold cross-validation for Bayesian networks
number of folds: 10
loss function: Log-Likelihood Loss (Gauss.)
number of runs: 100
average loss over the runs: 151.8083
standard deviation of the loss: 0.2384763
The question is, what loss function should I use for my data so that I can change the data set that I use and get comparable results and what does the "average loss over the runs" mean? The end game is to make joint probability distributions and a prediction for year + 1, so basically a row 33 with numbers and their probability distributions.
Sorry for any inconsistencies, as I'm still learning statistics.
i don't know that I understand correctly your question or not. the second question "what does the "average loss over the runs" mean?" because your code is run for 10 times (k=10) this means the average of loss function of the 10 times. and about first question it's better to have a look at this page.
https://stats.stackexchange.com/questions/339897/what-is-the-difference-between-loss-function-and-mle
sorry for bad language, my English language isn't good as you see.

Poisson GLM with categorical data

I'm trying to fit a Poisson generalized mixed model using counts of categorical data labeled as s and v. Since the data was collected within sessions that have a different duration (see session_dur_s), I want to include this information as a predictor by putting offset in the glm model.
Here is my table:
label session counts session_dur_s
s 1 587 6843
s 2 203 2095
s 3 187 1834
s 4 122 1340
s 5 40 1108
s 6 64 476
s 7 60 593
v 1 147 6721
v 2 57 2095
v 3 58 1834
v 4 22 986
v 5 8 1108
v 6 12 476
v 7 11 593
My data:
label <- c("s","s","s","s","s","s","s","v","v","v","v","v","v","v")
session <- c(1,2,3,4,5,6,7,1,2,3,4,5,6,7)
counts <- c(587,203,187,122,40,64,60,147,54,58,22,8,12,11)
session_dur_s <-c(6843,2095,1834,1340,1108,476,593,6721,2095,1834,986,1108,476,593)
sv_dur <- data.frame(label,session,counts,session_dur_s)
That's my code:
sv_dur_mod <- glm(counts ~ label * session, data=sv_dur, family = "poisson",offset =session_dur_s)
summary(sv_dur_mod)
plot(allEffects(sv_dur_mod),type="response")
I can't execute the glm function because I receive the beautiful error:
Error: no valid set of coefficients has been found: please supply starting values
I'm not sure how to go about it. I would be really happy if a smart head could point me what can I do in order to work it out.
If there is a better model that I can use to predict the counts over time for the both s and v labels, I'm more than open to go for it.
Many thanks for comments and suggestions!
P.S. I'm running it in the R markdown script using following packages tidyverse, effects and dplyr
A Poisson GLM uses a log link as default. That is, it can executed as:
sv_dur_mod <- glm(counts ~ label * session,
data = sv_dur,
family = poisson("log"))
Accordingly, a log offset is generally appropriate:
sv_dur_mod <- glm(counts ~ label * session,
data = sv_dur,
offset = log(session_dur_s),
family = poisson("log"))
Which executes as expected. See the answer here for more information on using a log offset: https://stats.stackexchange.com/a/237980/70372

always getting fix prediction for different records using decision tree algorithm

when using the decision tree to predict numeric values, I always get exactly the same predictions for different records, even when the independent variables in test data among records are different. What could be the reason?
Some lines of training data are below.
cur val lat lon ass sf nb yr_bt sasdate
562333.10 134000 33.7651 -84.266 178280 2214 4 1954 12158
201965.01 55900 33.7631 -84.270 22360 1352 2 1988 12240
322198.64 190000 33.7607 -84.264 75976 2258 2 1965 13322
355587.37 191500 33.7594 -84.2637 72520 2377 3 1961 12220
The code is below.
library(rpart)
x_train=subset(train,select=-c(origval))
y_train=train$val
x <- cbind(x_train,y_train)
x$yr_built<-as.numeric(x$yr)
#grow tree
fit <- rpart(y_train ~ ., data = x,method="anova")
#Predict Output
test$cur<-as.numeric(test$cur)
predicted= predict(fit,test)
Thank you.
Elyse

Grouping in R changes mean substantially

I have a file containing the predictions for two models (A and B) on a binary classification problem. Now I'd like to understand how good they are predicting the observations that they are most confident about. To do that I want to group their predictions into 10 groups based on how confident they are. Each of these groups should have an identical number of observations. However, when I do that the accuracy of the models change substantially! How can that be?
I've also tested with n_groups=100, but it only makes a minor difference. The CSV file is here and the code is below:
# Grouping observations
conf <- read.table(file="conf.csv", sep=',', header=T)
n_groups <- 10
conf$model_a_conf <- pmax(conf$model_a_pred_0, conf$model_a_pred_1)
conf$model_b_conf <- pmax(conf$model_b_pred_0, conf$model_b_pred_1)
conf$conf_group_model_a <- cut(conf$model_a_conf, n_groups, labels=FALSE, ordered_result=TRUE)
conf$conf_group_model_b <- cut(conf$model_b_conf, n_groups, labels=FALSE, ordered_result=TRUE)
# Test of original mean.
mean(conf$model_a_acc) # 0.78
mean(conf$model_b_acc) # 0.777
# Test for mean in aggregated data. They should be similar.
(acc_model_a <- mean(tapply(conf$model_a_acc, conf$conf_group_model_a, FUN=mean))) # 0.8491
(acc_model_b <- mean(tapply(conf$model_b_acc, conf$conf_group_model_b, FUN=mean))) # 0.7526
Edited to clarify slightly.
table(conf$conf_group_model_a)
1 2 3 4 5 6 7 8 9 10
2515 2628 2471 2128 1792 1321 980 627 398 140
The groups you are using are unbalanced. So when you take the mean of each of those groups with tapply thats fine, however to simply take the mean afterwards is not the way to go.
You need to weight the means by their size if you want to do the process you have.
something like this is quick and dirty:
mean(tapply(conf$model_a_acc, conf$conf_group_model_a, FUN=mean) * (table(conf$conf_group_model_a)/nrow(conf)) * 1000)

Fitting two parameter observations into copulas

I have one set of observations containing two parameters.
How to fit it into copula (estimate the parameter of the copula and the margin function)?
Let's say the margin distribution are log-normal distributions, and the copula is Gumbel copula.
The data is as below:
1 974.0304 1010
2 6094.2672 1150
3 3103.2720 1490
4 1746.1872 1210
5 6683.7744 3060
6 6299.6832 3330
7 4784.0112 1550
8 1472.4288 607
9 3758.5728 1970
10 4381.2144 1350
Library(copula)
gumbel.cop <- gumbelCopula(dim=2)
myMvd <- mvdc(gumbel.cop, c("lnorm","lnorm"), list(list(meanlog = 7.1445391,sdlog=0.4568783), list(meanlog = 7.957392,sdlog=0.559831)))
x <- rmvdc(myMvd, 1000)
fit <- fitMvdc(x, myMvd, c(7.1445391,0.4568783,7.957392,0.559831))
The meanlog and sdlog value are derived from the data set. Error message:
"Error in if (alpha - 1 < .Machine$double.eps^(1/3)) return(rCopula(n, :
missing value where TRUE/FALSE needed"
How to choose the copula parameter with the given data, and the margin distributions derived from the data set?
To close the question assessed in the comments.
It seems that giving a parameter of TRUE or FALSE close the problem as well as doing first the pseudo-observation and then fit the function.

Resources