Measure of prediction accuracy for discrete time series with zero values - percentage

I make predictions for the on and off time of a machine. This looks like the following picture, where red are the actual values and blue are the forecast values.
For the forecasts I would like to be able to say by what percentage the forecast differs from the actual values.
Therefore I looked at the MAPE (mean absolute percentage error), but it doesn't work, because I also have zero values in the actual values.
What kind of prediction accuracy measurement would you recommend for time series that look like the ones in the picture?

Related

R^2 accuracy xgboost Rstudio going to -500%

I have been building a model to predict a ratio: CTS=sales/visits. All the variables in the model are time decomposition and rolling means and rollings lags of the dependent variable. Now, this model is behaving very strangely for 2 main reasons:
First: when I look at the variable importance it is showing NA as a
first despite not giving it a cluster and despite I have no NA
columns or values in my train
then when I predict my validation set R^2 goes to -547%, I know that
a negative R^2 is symbolizing that a constant mean would predict
better than the model itself, however, what sounds very strange is
that if I predict with the same model, both denominator and numerator
the model predicts at 70% or over accuracy.
I don't find a reason why this could even happen. The 2 differences are
the range values can take bigger for numerator and denominator and much smaller for the dependent variable of interest:
sales and visits are non stationary while CTS is stationary
When I then compare, actual CTS, calculated CTS (made by predicted sales/predicted visits) VS model predicted CTS I have the below:
with my model predicted CTS tending much more to the mean value and the calculated one not great but fluctuating more as per actual values. Can anyone help to understand why and how to sort it?

Correlation coefficient between weather conditions are zero

I want build a model for rainfall prediction using data mining and machine learning. As one of the initial step, I computed the correlation Matrix for weather conditions including Rain Gauge, Average Temperature, Relative Humidity, Wind Speed and O3. I have used spearman and Kendall Correlation formulas and computed the correlation coefficients through RStudio.
The problem is, there is not any significant relation between Rain Gauge and other variables since all correlation coefficients are more close to zero.
I want to know that there is any other way to find out the most related variables for Gain Gauge or is it ok to use these variable even the correlation coefficients are close to zero.
Since this project work is new experience for me, please apologize for any mistakes done by me.

Calculating marginal effects from predicted probabilities of zeroinfl() model object

This plot, which I previously created, shows predicted probabilities of claim onset based on two variables, PIB (scaled across the x-axis) and W, presented as its 75th and 25th percentiles. Confidence intervals for the predictions are presented alongside the two lines.
Probability of Claim Onset
As I theorize that W and PIB have an interactive effect on claim onset, I'd like to see if there is any significance in the marginal effect of W on PIB. Confidence intervals of the predicted probabilities alone cannot confirm that this effect is insignificant, per my reading here (https://www.sociologicalscience.com/download/vol-6/february/SocSci_v6_81to117.pdf).
I know that you can calculate marginal effect easily from predicted probabilities by subtracting one from the other. Yet, I don't understand how I can get the confidence intervals for the marginal effect -- obviously needed to determine when and where my two sets of probabilities are indeed significantly different from one another.
The function that I used for calculating predicted probabilities of the zeroinfl() model object and the confidence intervals of those predicted probabilities is derived from an online posting (https://stat.ethz.ch/pipermail/r-help/2008-December/182806.html). I'm happy to provide more code if needed, but as this is not a question about an error, I am not sure it is needed.
So, I'm not entirely sure this is the correct answer, but to anyone who might come across the same problem I did:
Assuming that the two prediction lines maintain the same variance, you can pool SE before then calculating. See the wikipedia for Pooled Variance to confirm.
SEpooled <- ((pred_1_OR_pred_2$SE * sqrt(simulation_n))^2) * (sqrt((1/simulation_n)+(1/simulation_n)))
low_conf <- (pred_1$PP - pred_2$PP) - (1.96*SEpooled)
high_conf <- (pred_1$PP - pred_2$PP) + (1.96*SEpooled)
##Add this to the plot
lines(pred_1$x_val, low_conf, lty=2)
lines(pred_1$x_val, high_conf, lty=2)

Function to produce a single metric to compare the shape of two distributions (predictions vs actuals)

I am assessing the accuracy of a model that predicts count data.
My actual data has quite an unusual distribution - although I have a large amount of data, the shape is unlike any standard distributions (poisson, normal, negative binomial etc.).
As part of my assessment, I want a metric for how well the distribution of the predictions match the distribution of actual data. I've tried using standard model performance metrics, such as MAE or RMSE, but they don't seem to capture how well the predictions match the expected distribution.
My initial idea was to split the predictions into deciles, and calculate what proportion fall in each decile. This would be a very rough indication of the underlying distribution. I would then calculate the same for my 'actuals' and sum the absolute differences between the proportions.
This works to some extent, but feels a bit clunky, and the split into deciles feels arbitrary. Is there a function in R to produce a single metric for how well two distributions match?

Correlated residuals in time series

I use "vars" R package to do a multivariate time series analysis. The thing is when I conduct a bivariate VAR, the result of serial.test() give always a really low p-value, so we reject H0 and the residuals are correlated. The right thing to do is to increase the order of the VAR but even with a very high order (p=20 or even more) my residuals are still correlated.
How is it possible ?
I can't really give you a reproducible code because i don't know how to reproduce a VAR with residuals always correlated. For me it's a really unusual situation, but if someone know how it's possible it would be great.
This is probably a better question for Cross Validated as it doesn't contain any R code or a reproducible example, but you're probably going to need to do more digging than "I have a low p-value". Have you tested your data for normallity? Also, to say
The right thing to do is to increase the order of the VAR
is very inaccurate. What type of data are you working with that you would set a lag order as high as 20? A Typical value for yearly data is 1, for quarterly is 4, and for monthly is 12. You can't just keep throwing higher and higher orders at your problem and expect it to fix issues in the underlying data.
Assuming you have an optimal lag value and your data is normally distributed and you still have a low p-value there are several ways to go.
Minor cases of positive serial correlation (say, lag-1 residual autocorrelation in the range 0.2 to 0.4, or a Durbin-Watson statistic between 1.2 and 1.6) indicate that there is some room for fine-tuning in the model. Consider adding lags of the dependent variable and/or lags of some of the independent variables. Or, if you have an ARIMA+regressor procedure available in your statistical software, try adding an AR(1) or MA(1) term to the regression model. An AR(1) term adds a lag of the dependent variable to the forecasting equation, whereas an MA(1) term adds a lag of the forecast error. If there is significant correlation at lag 2, then a 2nd-order lag may be appropriate.
If there is significant negative correlation in the residuals (lag-1 autocorrelation more negative than -0.3 or DW stat greater than 2.6), watch out for the possibility that you may have overdifferenced some of your variables. Differencing tends to drive autocorrelations in the negative direction, and too much differencing may lead to artificial patterns of negative correlation that lagged variables cannot correct for.
If there is significant correlation at the seasonal period (e.g. at lag 4 for quarterly data or lag 12 for monthly data), this indicates that seasonality has not been properly accounted for in the model. Seasonality can be handled in a regression model in one of the following ways: (i) seasonally adjust the variables (if they are not already seasonally adjusted), or (ii) use seasonal lags and/or seasonally differenced variables (caution: be careful not to overdifference!), or (iii) add seasonal dummy variables to the model (i.e., indicator variables for different seasons of the year, such as MONTH=1 or QUARTER=2, etc.) The dummy-variable approach enables additive seasonal adjustment to be performed as part of the regression model: a different additive constant can be estimated for each season of the year. If the dependent variable has been logged, the seasonal adjustment is multiplicative. (Something else to watch out for: it is possible that although your dependent variable is already seasonally adjusted, some of your independent variables may not be, causing their seasonal patterns to leak into the forecasts.)
Major cases of serial correlation (a Durbin-Watson statistic well below 1.0, autocorrelations well above 0.5) usually indicate a fundamental structural problem in the model. You may wish to reconsider the transformations (if any) that have been applied to the dependent and independent variables. It may help to stationarize all variables through appropriate combinations of differencing, logging, and/or deflating.

Resources