Average Mean w/ Forecast Horizon > 1 in R - r

I use the updated greybox package in R to forecast the consecutive 2 values (horizon=2) with a moving average scheme (See the first line of code below), where the window size is equal to 3.
For example, the overall goal is to take the average of (1+2+3)/3 = "2" as the forecasted value in horizon 1 (h=1) and then make use of the predicted value in h=1 for h=2, where (2+3+"2")=2,3334.
The following forecast window will make use of the window (2+3+4), where 4 is the actual value to predict the next h1 and h2, which equals 3 and 3,3334 respectively.
Yet, the prediction result I want "ValuesMA[[3]]" only emits one row, i.e. values for the first horizon. But it should be equal to the predifined horizon, which is two.
I have a code for an AR1 process which works perfectly (Second line of code). At the end I add an MAE test statistic to evaluate the model.
Can anyone help?
Thank you!
This is the underlying code I use:
#data
z <- c(1,2,3,4,5,6,7)
ourCall <- "mean(x=data,n.ahead=h)"
ourValue <- c("pred")
# Return a list for an forecasting horizon h
ValuesMA <- ro(z, h=2, origins=3, call=ourCall, ci=TRUE, co=TRUE)
ValuesMA[[3]]
**Which yields:**
origin3 origin4 origin5
[1,] 2 3 4
**But I want:**
origin3 origin4 origin5
[1,] 2 3 4
[2,] 2,3334 3,3334 4,3334
#data
z <- c(1,2,3,4,5,6,7)
# ci defines constant in-sample window size, co defines whether the holdout sample window size should be constant
ourCall <- "predict(arima(x=data,order=c(1,0,0)),n.ahead=h)"
# Ask for predicted values and the standard error
ourValue <- c("pred","se")
# Return a list for an forecasting horizon h with a rolling holdout size equal to origin
ValuesAR1 <- ro(z, h=2, origins=3, call=ourCall, value=ourValue, ci=TRUE, co=TRUE)
# calculate MAE
MAE_AR1 <- apply(abs(ValuesAR1$holdout - ValuesAR1$pred),1,mean,na.rm=TRUE) / mean(ValuesAR1$actuals)
ValuesAR1[[3]]
**Which yields:**
> ValuesAR1[[3]]
origin3 origin4 origin5
h1 2 3 4
h2 2 3 4
For further reading see: https://cran.r-project.org/web/packages/greybox/vignettes/ro.html

Related

Use arima.sim to simulate ARIMA 1,1,1 with drift in R

I am trying to use ARIMA sim package to simulate an ARIMA simulation with a drift. My problem is that I cannot seem to get it to work.
I need to get something like this:
My code is producing this though:
> mean(datatime)
[1] 15881.56
> sd(datatime)
[1] 8726.893
> length(datatime)
[1] 123
# The mean and variance from the original series
originalseriesmean = 15881.56
originalseriesvariance = 8726.893*8726.893
originalseriesn=123
# Simulation using arima.sim
ts.sim <- arima.sim(model=list(c(1,1,1)), n = 123, mean=190,sd=69.2863)
ts.plot(ts.sim)
How do I add a drft term to this function to make it look like the simulation before?
ARIMA process doesn't have any drift/trend by definition. Inspired by this answer on cross validated arima with trend and taking into consideration the value you want:
set.seed(123)
intercept <- 4500
b <- (32000 - intercept) / 123
x <- 1:123
y <- b * x + arima.sim(model=list(c(1, 0, 1)),
n = 123, mean=intercept, sd=2000)
> sd(y)
[1] 8020
> mean(y)
[1] 18370
The argument mean gives you the intercept of the process (where it starts), to obtain the variance you should detrend your process because the mean value and the sd values you give are with trend wherease to simulate such process you should decompose your process into noise + trend.

Clustering with Mclust results in an empty cluster

I am trying to cluster my empirical data using Mclust. When using the following, very simple code:
library(reshape2)
library(mclust)
data <- read.csv(file.choose(), header=TRUE, check.names = FALSE)
data_melt <- melt(data, value.name = "value", na.rm=TRUE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
summary(fit, parameters = TRUE)
R gives me the following result:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 4 components:
log-likelihood n df BIC ICL
-20504.71 3258 8 -41074.13 -44326.69
Clustering table:
1 2 3 4
0 2271 896 91
Mixing probabilities:
1 2 3 4
0.2807685 0.4342499 0.2544305 0.0305511
Means:
1 2 3 4
1381.391 1381.715 1574.335 1851.667
Variances:
1 2 3 4
7466.189 7466.189 7466.189 7466.189
Edit: Here my data for download https://www.file-upload.net/download-14320392/example.csv.html
I do not readily understand why Mclust gives me an empty cluster (0), especially with nearly identical mean values to the second cluster. This only appears when specifically looking for an univariate, equal variance model. Using for example modelNames="V" or leaving it default, does not produce this problem.
This thread: Cluster contains no observations has a similary problem, but if I understand correctly, this appeared to be due to randomly generated data?
I am somewhat clueless as to where my problem is or if I am missing anything obvious.
Any help is appreciated!
As you noted the mean of cluster 1 and 2 are extremely similar, and it so happens that there's quite a lot of data there (see spike on histogram):
set.seed(111)
data <- read.csv("example.csv", header=TRUE, check.names = FALSE)
fit <- Mclust(data$value, modelNames="E", G = 1:7)
hist(data$value,br=50)
abline(v=fit$parameters$mean,
col=c("#FF000080","#0000FF80","#BEBEBE80","#BEBEBE80"),lty=8)
Briefly, mclust or gmm are probabilistic models, which estimates the mean / variance of clusters and also the probabilities of each point belonging to each cluster. This is unlike k-means provides a hard assignment. So the likelihood of the model is the sum of the probabilities of each data point belonging to each cluster, you can check it out also in mclust's publication
In this model, the means of cluster 1 and cluster 2 are near but their expected proportions are different:
fit$parameters$pro
[1] 0.28565736 0.42933294 0.25445342 0.03055627
This means if you have a data point that is around the means of 1 or 2, it will be consistently assigned to cluster 2, for example let's try to predict data points from 1350 to 1400:
head(predict(fit,1350:1400)$z)
1 2 3 4
[1,] 0.3947392 0.5923461 0.01291472 2.161694e-09
[2,] 0.3945941 0.5921579 0.01324800 2.301397e-09
[3,] 0.3944456 0.5919646 0.01358975 2.450108e-09
[4,] 0.3942937 0.5917661 0.01394020 2.608404e-09
[5,] 0.3941382 0.5915623 0.01429955 2.776902e-09
[6,] 0.3939790 0.5913529 0.01466803 2.956257e-09
The $classification is obtained by taking the column with the maximum probability. So, same example, everything is assigned to 2:
head(predict(fit,1350:1400)$classification)
[1] 2 2 2 2 2 2
To answer your question, no you did not do anything wrong, it's a fallback at least with this implementation of GMM. I would say it's a bit of overfitting, but you can basically take only the clusters that have a membership.
If you use model="V", i see the solution is equally problematic:
fitv <- Mclust(Data$value, modelNames="V", G = 1:7)
plot(fitv,what="classification")
Using scikit learn GMM I don't see a similar issue.. So if you need to use a gaussian mixture with spherical means, consider using a fuzzy kmeans:
library(ClusterR)
plot(NULL,xlim=range(data),ylim=c(0,4),ylab="cluster",yaxt="n",xlab="values")
points(data$value,fit_kmeans$clusters,pch=19,cex=0.1,col=factor(fit_kmeans$clusteraxis(2,1:3,as.character(1:3))
If you don't need equal variance, you can use the GMM function in the ClusterR package too.

Generating different percentages of MAR data in R

The following​ two R functions are from the book "Flexible Imputation of Missing Data" (page no. 59 and 63). The first one generates missing completely at random(MCAR) data and the second on generates missing at random(MAR) data. Both functions give approximately 50% missing values. ​
In MCAR function, we can generate different percentages of missing data by changing the p value. But in MAR function, ​I don't understand ​which parameter should we change to generate different percentages of missing data like 10% or 30%?
MCAR
makemissing <- function(data, p=0.5){
rx <- rbinom(nrow(data), 1, p)
data[rx==0,"y"] <- NA
return(data)
}
MAR
logistic <- function(x) exp(x)/(1+exp(x))
set.seed(32881)
n <- 10000
y <- mvrnorm(n=n,mu=c(5,5),Sigma=matrix(c(1,0.6,0.6,1),nrow=2))
p2.marright <- 1 - logistic(-5 + y[,1])
r2.marright <- rbinom(n, 1, p2.marright)
yobs <- y
yobs[r2.marright==0, 2] <- NA
The probability of an observation being missing is 50% for every case for the MCAR function because, by definition, the missingness is random. For the MAR version, the probability of an observation being missing is different for each observation, since it depends on the values of y[,1]. In your code, the probability of missingness on y[,2] is saved in the variable p2.marright. You can perhaps see this more easily by lining up all of the values in a dataframe:
df <- data.frame(y1 = y[,1], y2_ori = y[,2], y2_mis = yobs[,2], p2.marright = p2.marright, r2.marright)
head(df)
y1 y2_ori y2_mis p2.marright r2.marright
1 2.086475 3.432803 3.432803 0.9485110 1
2 3.784675 5.005584 5.005584 0.7712399 1
3 4.818409 5.356688 NA 0.5452733 0
4 2.937422 3.898014 3.898014 0.8872124 1
5 6.422158 5.032659 5.032659 0.1943236 1
6 4.115106 5.083162 5.083162 0.7078354 1
You can see that whether or not an observation will be NA on y2 is encoded in r2.marright, which is a probabilistic binary version of p2.marright --- for higher values of p2.marright, r2.marright is more likely to 1. To change the overall rate of missingness, you can change the calculation of p2.marright to bias it higher or lower.
You can manipulate p2.marright by changing the constant in the logistic transformation (-5 in the example). If you increase it (make it less negative, e.g. -4) then p2.marright will decrease, resulting in more missing values on y2. If you decrease it (make it more negative, e.g. -6) then you'll end up with fewer missing values on y2. (The reason -5 is resulting in 50% missingness is because 5 is the mean of the variable being transformed, y1.) This works, but the mechanism is rather opaque, and it might be difficult for you to control it easily. For example, it's not obvious what you should set the constant to be if you want 20% missingness on y2.

Print significant auto-correlation value

If I do an autocorrelation test in R (acf), I get a great graph, and the horizontal lines show the cutoff of significance.
acf also prints out the individual lag values in the console, however, here I can't see which are significant. Is there an easy way to do that without looking at the graph?
So basically for this we need to know the cutoff value. By writing acf and stats:::plot.acf you can see that it might be different for different parameter values, but for default values here is what you should use:
set.seed(123)
x <- arima.sim(list(ar = 0.5), 100)
r <- acf(x, plot = FALSE)$acf
which(abs(r)[-1] >= qnorm(1 - 0.05 / 2) / sqrt(length(x)))
# [1] 1 2 3 9 10 12 13
where 0.05 is the significance level in this case.

Time series bootstrapping in R: How can I access each simulated path using tsbootstrap?

I want to perform a bootstrap analysis of a specific time series.
I am using the function tsbootstrap of the package tseries. My problem: for values of m > 1, I cannot access each bootstrapped path individually
(m: the length of the basic blocks in the block of blocks bootstrap, see ?tsbootstrap)
library(tseries)
set.seed(1)
TS <- sample(1:20)
tsbootstrap(TS,m=2, nb=1)
gives:
Error in tsbootstrap(TS, m = 2, nb = 1) :
can only return bootstrap data for m = 1
To my knowledge, the function can only compute some statistics (e.g. mean) over all of the simulated tranjectories, but I need each simulation itself. How can I come around this problem? (I am aware of the function tsboot of the package boot, but I was not able to operationalize the function yet)
The b argument is the block length. m is the "block of blocks" argument for when you want to calculate statistics for each resampled series, rather than return each resampled series itself.
library(tseries)
# Simulate a time series
set.seed(1)
TS<-arima.sim(model=list(ar=c(.8,-.2)), n=20)
plot(TS)
# 3 bootstrap samples with block size b=5
TSboot = tsbootstrap(TS, m=1, b=5, type="block", nb=3)
# Here are the individual bootstrapped series
TSboot
Time Series:
Start = 1
End = 20
Frequency = 1
[,1] [,2] [,3]
1 -0.72571390 1.94273559 1.62729703
2 -0.36463539 2.00048877 0.34495502
3 -0.30236104 1.28640888 -2.26419528
...
18 0.96532247 -0.72571390 -0.36463539
19 1.59792898 -0.36463539 -0.30236104
20 1.67918002 -0.30236104 -1.63971414
plot(TSboot)

Resources