Calculating accuracy for already existing forecast - r

I have the Original forecast data from a company (12 observations). Next to that I have the REAL 12 observations. I just want to calculate the accuracy of the companies current method with the real data and let them know what the MSE, MAPE, MAD, MAE etc is. So I don't have to calculate the forecast myself, but just using these 2 datasets. I can't get my head around how to use the accuracy() function in this case. I can convert the forecast dataset to a time series value, but I still keep getting errors.
Anyone knows how to help me out?
> Forecast_data
1 8237
2 13438
3 10026
4 9651
5 11043
6 8500
7 10126
8 11560
9 11175
10 9103
11 14456
12 10308
> Real data
1 16507
2 14637
3 15210
4 17818
5 17606
6 13396
7 11603
8 11094
9 14087
10 14304
11 17887
12 14116

Look at the first "date" (1) for the moment. The actual/observed value is a[1]=16507, and the forecast/estimate is f[1]=8237.
So the error/deviation is e[1]=f[1]-a[1]=8237-16507=-8270 (you underestimate) and in percentage p[1]=e[1]/a [1]=-8270/16507=-.501=-50.1% (you underestimate by 50%).
Do this for all the dates and you'll get a column of error in value e[i] and in percentage p[i].
The MSE (Mean Squared Error) is the average of e[i]^2.
The MAD (Mean Absolute Derivation) is the average of abs(e[i]).
The MAE (Mean Absolute Error) is the average of abs(e[i]) (the same).
The MASP (Mean Absolute Percent Error) is the average of abs(p[i]).

Related

Errors using powerSim and powerCurve for a clmm in R

I'm new to clmm and run into the following problem:
I want to obtain the optimal sample size for my study with R using powerSim and powerCurve. Because my data is ordinal, I'm using a clmm. Study participants (VPN) should evaluate three sentence types (SH1,SM1, and SP1) on a 5 point likert scale (evaluation.likert). I need to account for my participants as a random factor while the sentence types and the evaluation are my fixed factors.
Here's a glimpse of my data (count of VPN goes up to 40 for each of the parameters, I just shortened it here):
VPN parameter evaluation.likert
1 1 SH1 2
2 2 SH1 4
3 3 SH1 5
4 4 SH1 3
...
5 1 SM1 4
6 2 SM1 2
7 3 SM1 2
8 4 SM1 5
...
9 1 SP1 1
10 2 SP1 1
11 3 SP1 3
12 4 SP1 5
...
Now, with some help I created the following model:
clmm(likert~parameter+(1|VPN), data=dfdata)
With this model, I'm doing the simulation:
ps1 <- powerSim(power, test=fixed("likert:parameter", "anova"), nsim=40)
Warning:
In observedPowerWarning(sim) :
This appears to be an "observed power" calculation
print(ps1)
Power for predictor 'likert:parameter', (95% confidence interval):
0.00% ( 0.00, 8.81)
Test: Type-I F-test
Based on 40 simulations, (0 warnings, 40 errors)
alpha = 0.05, nrow = NA
Time elapsed: 0 h 0 m 0 s
nb: result might be an observed power calculation
In the above example, I tried it with 40 participants but I already also ran a simulation with 2000000 participants to check if I just need a huge amount of people. The results were the same though: 0.0%.
lastResult()$errors tells me that I'm using a method which is not applicable for clmm:
not applicable method for'simulate' on object of class "clmm"
But besides the anova I'm doing here, I've also already tried z, t, f, chisq, lr, sa, kr, pb. (And instead of test=fixed, I've also already tried test=compare, test=fcompare, test=rcompare, and even test=random())
So I guess there must be something wrong with my model? Or are really none of these methods applicaple for clmms?
Many thanks in advance, your help is already very much appreciated!

Write R code of AR(2) model for a time serie data from `rsav` file

I need to write R code to model a time serie data from rsav file. Here is detailed information about the question:
The file “file.rsav” (which can be loaded into R using load(“file.rsav”)) contains a time series (“xx”). The series is a “demeaned” monthly revenue stream (in millions of
dollars) for a company. There are n = 96 observations.
The series has been “demeaned”; usually that would mean we subtract off $\bar{X}$ from every data point, but pretend for now we know the mean $miu$ exactly so we have subtracted off µ from every data point, so the new series is exactly (theoretically) mean 0. (But thus its sample mean is not precisely 0.)
We will consider possible ARMA models for the series $X_t$. We assume that the corresponding white noise is Gaussian (so X_t is Gaussian). We will consider first an AR(2) model. We assume we know the true model exactly: $X_t = .1.34X_{t-1} - .48X_{t-2} + W_t, W_t \sim iid N(0, σ^2)$.
I was asked to compute forecasts backcasts using model, up to 25 time steps in the future and into the past.
Write code to do the prediction by hand (i.e., not using the predict() function). Plot the data, forecast, and 95% prediction intervals [assuming gaussianity] (all on one plot). (Note: you do not need to do a multiplicity correction for the prediction intervals.)
Code:
load('./file.rsav')
str(xx)
xx
Out:
Time-Series [1:96] from 1 to 8.92: 2.45 2.18 0.389 -1.44 -1.47 ...
Jan Feb Mar Apr May Jun Jul
1 2.45017780 2.17955829 0.38874020 -1.43979552 -1.47049807 -2.25233354 -0.82580703
2 1.92378321 1.87944208 1.07382472 1.01933130 1.88660307 -0.31109156 -0.25732342
3 0.60732330 1.53185399 1.58614371 0.63922270 0.82728407 0.28910411 -1.18154941
4 0.41375543 1.96633332 1.97402973 4.16058136 5.15474250 5.71865844 3.93136013
5 -1.51228022 -3.03396294 -3.65446772 -4.69589618 -3.51276584 -2.97682246 -3.08655352
6 3.43027017 4.68909032 6.55598795 4.95816124 4.87626503 3.17103291 0.79093946
7 -0.62481997 -0.94469455 -2.13648402 -3.64364158 -2.07214317 -3.26793808 -3.38573375
8 0.67823828 1.09908274 0.93832242 0.08791237 1.77322327 2.01201710 3.70197246
Aug Sep Oct Nov Dec
1 0.53048061 1.31994246 0.69306401 1.25916404 1.53363966
2 -0.47154459 0.52849630 0.90548093 0.71783457 0.86908457
3 -0.52525201 -0.40335058 0.73415310 0.58501633 0.29875228
4 2.50242432 1.69408297 0.96230124 0.53164036 -0.64480235
5 -1.60735865 -0.20500888 -0.44508903 -0.01443040 1.71087427
6 -0.09975821 -0.85972650 -0.41557374 -0.99876068 0.52620555
7 -2.25968715 -0.91700127 -0.49302872 -1.44275203 -0.66221559
8 4.62724761 4.17549847 3.43992950 3.15302462 4.17300576
I don't know too much about rsav extension file, could someone help me to solve this issue or give me some tips? Thanks in advance.
I think with "backcast" the in sample fit for the last 25 observations is meant. To forecast from an AR(2) model you simply need the last 2 observations for the next step.
The model is: x_t = ar1 * x_{t-1} + ar2 * x_{t-2} + error
Now we just need to insert the estimated ar parameters and the observations for x_{t-1} and x_{t-2}. For the next step we need the forecast step and the last observation:
x_{t+1} = ar1 * x_{t} + ar2 * x_{t-1} + error
This is what we repeat 25 times. The error term is assumed to be normal distributed, so it is expected to be zero.
We do the same thing for the "backcast", the in sample fit, but here we only need the observations from the time series.
forecast<-numeric(25)
backcast<-numeric(25)
forecast[1]<-0.134*xx[length(xx)]+0.48*xx[length(xx)-1]
forecast[2]<-0.134*forecast[1]+0.48*xx[length(xx)]
for(i in 3:25)
{
forecast[i]<-0.134*forecast[i-1]+0.48*forecast[i-2]
}
for(i in 1:25)
{
backcast[i]<-0.134*xx[length(xx)-i-1]+0.48*xx[length(xx)-i-2]
}
ts.plot(xx)

Bug with VGAM? vglm family=posnegbinomial => "Error in if (take.half.step) { : missing value where TRUE/FALSE needed"

I have some actual data that I am afraid is somewhat nasty.
It's essentially a Positive Negative Binomial distribution (without any zero counts). However, there are some outliers that seem to cause some bad calculations to occur (maybe underflow or NaNs?) The first 8 or so entries are reasonable, but I'm guessing the last few are causing some problems with the fitting.
Here's the data:
> df
counts t
1 1968 1
2 217 2
3 55 3
4 26 4
5 11 5
6 5 6
7 8 7
8 3 8
9 1 10
10 1 11
11 1 12
12 1 13
13 1 15
14 1 18
15 1 26
16 1 59
This command runs for a while and then spits out the error message
> vglm(counts ~ t, data=df, family = posnegbinomial)
Error in if (take.half.step) { : missing value where TRUE/FALSE needed
BUT, if I rerun this cutting off the outliers, I get a solution for posnegbinomial
> vglm(counts ~ t, data=df[1:9,], family = posnegbinomial)
Call:
vglm(formula = counts ~ t, family = posnegbinomial, data = df[1:9,])
Coefficients:
(Intercept):1 (Intercept):2 t
7.7487404 0.7983811 -0.9427189
Degrees of Freedom: 18 Total; 15 Residual
Log-likelihood: -36.21064
If I try the family pospoisson (Positive Poisson: no zero values), I get a similar error "argument is not interpretable as logical".
I do notice that there are a number of similar questions in Stackoverflow about missing values where TRUE/FALSE is needed, but with other R packages. This indicates to me that perhaps the package writers need to better anticipate calculations might fail.
I think your proximal problem is that the predicted means for the negative binomial for your extreme values are so close to zero that they are underflowing to zero, in a way that was not anticipated/protected against by the package authors. (One thing to realize about nonlinear optimization/fitting is that it is always possible to break a fitting method by giving it extreme data ...)
I couldn't get this to work in VGAM, but I'll offer a couple of other suggestions.
plot(log(counts)~t,data=dd)
And eyeballing the data to get an initial estimate of parameter values (at least for the mean model):
m0 <- lm(log(counts)~t,data=subset(dd,t<10))
I thought I might be able to get vglm() to work by setting starting values, but that didn't actually pan out, even when I have fairly good values from other platforms (see below).
glmmADMB
The glmmADMB package can handle positive NB, via family="truncnbinom":
library(glmmADMB)
m1 <- glmmadmb(counts~t, data=dd, family="truncnbinom")
(there are some warning messages ...)
bbmle::mle2()
This requires a little bit more work: it failed with the standard model, but works if I set a floor on the predicted mean ...
library(VGAM) ## for dposnegbin
library(bbmle)
m2 <- mle2(counts~dposnegbin(size=exp(logk),
munb=pmax(exp(logeta),1e-7)),
parameters=list(logeta~t),
data=dd,
start=list(logk=0,logeta=0))
Again warning messages.
Compare glmmADMB, mle2, simple truncated lm fit ...
cc <- cbind(coef(m2),
c(log(m1$alpha),coef(m1)),
c(NA,coef(m0)))
dimnames(cc) <- list(c("log_k","log_int","slope"),
c("mle2","glmmADMB","lm"))
## mle2 glmmADMB lm
## log_k 0.8094678 0.8094625 NA
## log_int 7.7670604 7.7670637 7.1747551
## slope -0.9491796 -0.9491778 -0.8328487
This is in principle also possible with glmmTMB, but it runs into the same kinds of problems as vglm() ...

R - How to Speed Up Recursion and Double Summation

Since this is essentially a question about how to efficiently perform a computation in R, I will start with the equation and then provide an explanation for the problem after the code for those who would find it useful or interesting.
I have written a script in R to generate values using the following function:
The function, as you can see, is recursive and involves double summation. It works well for small numbers around 15 or lower, but the execution time gets prohibitively long at higher values of n and t. I need to be able to perform the calculation for every n and t pair from 1 to 30. Is there a way to write a script that won't take months to execute?
My current script is:
explProb <- function(n,t) {
prob <- 0
#################################
# FIRST PART - SINGLE SUMMATION
#################################
i <- 0
if(t<=n) {
i <- c(t:n)
}
prob = sum(choose(n,i[i>0])*((1/3)^(i[i>0]))*((2/3)^(n-i[i>0])))
#################################
# SECOND PART - DOUBLE SUMMATION
#################################
if(t >= 2) {
for(k in 1:(t-1)) {
j <- c(0:(k-1))
prob = prob + sum(choose(n,n-k)*((1/6)^(j))*((1/6)^(k-j))*((2/3)^(n-k))*explProb(k-j,t-k))
}
}
return(prob)
}
MAX_DICE = 30
MAX_THRESHOLD = 30
probabilities = matrix(0,MAX_DICE,MAX_THRESHOLD)
for(dice in 1:MAX_DICE) {
for(threshold in 1:MAX_THRESHOLD) {
#print(sprintf("DICE = %d : THRESH = %d", dice, threshold))
probabilities[dice,threshold] = explProb(dice,threshold)
}
}
I am trying to write a script to generate a set of probabilities for a particular type of dice roll in a tabletop roleplaying game (Shadowrun 5th Edition, to be specific). The type of dice roll is called an "Exploding Dice Roll". In case you are not familiar with how these rolls work in this game, let me briefly explain.
Whenever you try to accomplish a task you make a test by rolling a number of six-sided dice. Your goal is to get a predetermined number "hits" when rolling those dice. A "hit" is defined as a 5 or 6 on a six-sided die. So, for example, if you have a dice pool of 5 dice, and you roll: 1, 3, 3, 5, 6 then you have gotten 2 hits.
In some cases you are allowed to re-roll all of the 6's that were rolled in order to try and get MORE hits.This is called an "exploding" roll. The 6's counts as hits, but can be re-rolled to "explode" into even more hits. For clarification I'll give a quick example...
If you roll 10 dice and get a result of 1, 2, 2, 4, 5, 5, 6, 6, 6, 6 then you have gotten 6 hits on the first roll... However, the 4 dice that rolled 6's can be re-rolled again. If you roll those dice and get 3, 5, 6, 6 then you have 3 more hits for a total of 9 hits. But you can now re-roll the two more sixes you got... etc... You keep re-rolling the sixes, adding the 5's and 6's to your total hits, and keep going until you get a roll with no sixes.
The function listed above generates these probabilities taking an input of "# of dice" and "number of hits" (called a "threshold" here).
n = # of Dice being rolled
t = Threshold number of "hits" to be reached
Calculation with Transition Matrix
If we have n=10 dice, then the probability of 0 to 10 occurrences of an event with prob=2/6 may be efficiently calculated in R as
dbinom(0:10,10,2/6)
Since you are allowed to keep rolling until failure, any number of ultimate hits is possible (the support of the distribution is [0,Inf)), albeit with geometrically diminishing probabilities. A recursive numeric solution is feasible due to the need to establish a cutoff for machine precision and the presence of a threshold to censor.
Since rerolls are with a smaller number of dice, it makes sense to precalculate all transition probabilities.
X<-outer(0:10,0:10,function(x,size) dbinom(x,size,2/6))
Where the i-th row of the j-th column gives the probability of (i-1) successes (hits) with (j-1) trials (dice rolled). For example, the probability of exactly 1 success with 6 trials is located at X[2,7].
Now if you start out with 10 dice, we can represent this as the vector
d<-c(rep(0,10),1)
Showing that with probability 1 we have 10 dice with 0 probability everywhere else.
After a single roll, the probabilities of the number of live dice is X %*% d.
After two rolls, the probabilities are X %*% X %*% d. We can calculate the live dice state probabilities after any number of rolls by iterating.
T<-Reduce(function(dn,n) X %*% dn,1:11,d,accumulate=TRUE)
Where T[1] gives the probabilities of live dice before the first roll and T[11] gives the probabilities of live dice before the 11th (after the 10th).
This is sufficient to calculate expected values, but for the distribution of cumulative sums, we'll need to track additional information in the state. The following function reshapes a state matrix at each step so that the i-th row and j-th column has the probability of (i-1) live dice with a current cumulative total of j-1.
step<-function(m) {
idx<-arrayInd(seq_along(m),dim(m))
idx[,2]<-rowSums(idx)-1
i<-idx[nrow(idx),]
m2<-matrix(0,i[1],i[2])
m2[idx]<-m
return(m2)
}
In order to recover the probabilities for cumulative totals, we use the following convenience function to sum across anti-diagonals
conv<-function(m)
tapply(c(m),c(row(m)+col(m)-2),FUN=sum)
The probabilities of continuing to roll rapidly diminish, so I've cut off at 40, and shown up to 20, rounded to 4 places
round(conv(Reduce(function(mn,n) X %*% step(mn), 1:40, X %*% d))[1:21],4)
#> 0 1 2 3 4 5 6 7 8 9
#> 0.0173 0.0578 0.1060 0.1413 0.1531 0.1429 0.1191 0.0907 0.0643 0.0428
#>
#> 10 11 12 13 14 15 16 17 18 19
#> 0.0271 0.0164 0.0096 0.0054 0.0030 0.0016 0.0008 0.0004 0.0002 0.0001
Calculation with Simulation
This can also be calculated in reasonable time with reasonable precision using simple simulation.
We simulate a roll of n 6-sided dice with sample(1:6,n,replace=TRUE), calculate the number to re-roll, and iterate until none are available, counting "hits" along the way.
sim<-function(n) {
k<-0
while(n>0) {
roll<-sample(1:6,n,replace=TRUE)
n<-sum(roll>=5)
k<-k+n
}
return(k)
}
Now we can simply replicate a large number of trials and tabulate
prop.table(table(replicate(100000,sim(10))))
#> 0 1 2 3 4 5 6 7 8 9
#> 0.0170 0.0588 0.1053 0.1431 0.1518 0.1433 0.1187 0.0909 0.0657 0.0421
#>
#> 10 11 12 13 14 15 16 17 18 19
#> 0.0252 0.0161 0.0102 0.0056 0.0030 0.0015 0.0008 0.0004 0.0002 0.0001
This quite feasible even with 30 dice (a few seconds even with 100,000 replications).
Efficient Calculation Using Probability Distributions
The approach in the question and in my other answer use sums over transitions of dependent binomial distributions. The dependency arising from the carry over of previous successes (hits) to subsequent trials (rolls) complicates the calculations.
An alternative approach is to view each die separately. Roll a single die as long as it turns up as a hit. Each die is independent of the other, so the random variables may be summed efficiently through convolution. However, the distribution for each die is a geometric distribution, and the sum of independent geometric distributions gives rise to a negative binomial distribution.
R provides the negative binomial distribution, so the results obtained in my other answer may be had all at once by
round(dnbinom(0:19,10,prob=2/3),4)
[1] 0.0173 0.0578 0.1060 0.1413 0.1531 0.1429 0.1191 0.0907 0.0643 0.0428
[11] 0.0271 0.0164 0.0096 0.0054 0.0030 0.0016 0.0008 0.0004 0.0002 0.0001
The probability matrix in the question, with MAX_DICE=MAX_THRESHOLD=10, has first column equal to
1-dnbinom(0,1:10,prob=2/3)
So, you might be looking for the cumulative distribution function. I have not been able to figure out your intentions with the subsequent columns, but perhaps the goal was
outer(1:10,0:10,function(size,x) 1-dnbinom(x,size,prob=2/3))

Getting percentile values from gamlss centile curves

This question is related to: Selecting Percentile curves using gamlss::lms in R
I can get centile curve from following data and code:
age = sample(5:15, 500, replace=T)
yvar = rnorm(500, age, 20)
mydata = data.frame(age, yvar)
head(mydata)
age yvar
1 12 13.12974
2 14 -18.97290
3 10 42.11045
4 12 27.89088
5 11 48.03861
6 5 24.68591
h = lms(yvar, age , data=mydata, n.cyc=30)
centiles(h,xvar=mydata$age, cent=c(90), points=FALSE)
How can I now get yvar on the curve for each of x value (5:15) which would represent 90th percentiles for data after smoothening?
I tried to read help pages and found fitted(h) and fv(h) to get fitted values for entire data. But how to get values for each age level at 90th centile curve level? Thanks for your help.
Edit: Following figure show what I need:
I tried following but it is correct since value are incorrect:
mydata$fitted = fitted(h)
aggregate(fitted~age, mydata, function(x) quantile(x,.9))
age fitted
1 5 6.459680
2 6 6.280579
3 7 6.290599
4 8 6.556999
5 9 7.048602
6 10 7.817276
7 11 8.931219
8 12 10.388048
9 13 12.138104
10 14 14.106250
11 15 16.125688
The values are very different from 90th quantile directly from data:
> aggregate(yvar~age, mydata, function(x) quantile(x,.9))
age yvar
1 5 39.22938
2 6 35.69294
3 7 25.40390
4 8 26.20388
5 9 29.07670
6 10 32.43151
7 11 24.96861
8 12 37.98292
9 13 28.28686
10 14 43.33678
11 15 44.46269
See if this makes sense. The 90th percentile of a normal distribution with mean and sd of 'smn' and 'ssd' is qnorm(.9, smn, ssd): So this seems to deliver (somewhat) sensible results, albeit not the full hack of centiles that I suggested:
plot(h$xvar, qnorm(.9, fitted(h), h$sigma.fv))
(Note the massive overplotting from only a few distinct xvars but 500 points. Ande you may want to set the ylim so that the full range can be appreciated.)
The caveat here is that you need to check the other parts of the model to see if it is really just an ordinary Normal model. In this case it seems to be:
> h$mu.formula
y ~ pb(x)
<environment: 0x10275cfb8>
> h$sigma.formula
~1
<environment: 0x10275cfb8>
> h$nu.formula
NULL
> h$tau.formula
NULL
So the model is just mean-estimate with a fixed-variance (the ~1) across the range of the xvar, and there are no complications from higher order parameters like a Box-Cox model. (And I'm unable to explain why this is not the same as the plotted centiles. For that you probably need to correspond with the package authors.)

Resources