I am wondering what the good approach for testing time series model would be. Suppose I have a time series in a time domain t1,t2,...tN. I have inputs, say, zt1, zt2,...ztN and output x1,x2...xN.
Now, if that were a classical data mining problem, I could go with known approaches like cross-validation, leave-one-out, 70-30 or something else.
But how should I approach the problem of testing my model with time series? Should I build the model on the first t1,t2,...t(N-k) inputs and test it on the last k inputs? But what if we want to maximise the prediction for p steps ahead and not k (where p < k). I am looking for a robust solution which I can apply to my specific case.
With timeseries fitting, you need to be careful about not using your Out-of-sample data until after you've developed your model. The main problem with modelling is that it's simply easy to overfit.
Typically what we do is to use 70% for in-sample modelling, 30% of out-of-sample testing/validation. And when we use the model in production, the data we collect day-to-day becomes true-out-of-sample data : the data you have never seen or used.
Now, if you have enough data points, I'd suggest trying rolling window fitting approach. For each time step in your in-sample, you look back N time steps to fit your model and see how the parameters in your model varies over time. For example, let's say your model is linear regression with Y = B0 + B1*X1 + B2*X2. You'd do regression N - window_size time over the sample. This way, you understand how sensitive your Betas are in relation to time, among other things.
It sounds like you have a choice between
Using the first few years of data to create the model, then seeing how well it predicts the remaining years.
Using all the years of data for some subset of input conditions, then seeing how well it predicts using the remaining input conditions.
Related
My job is to make sure that an online retailer achieves a certain service level (in stock rate) for their products, while avoiding aging and excess stock. I have a robust cost and leadtime simulation model. One of the inputs into that model is a vector of prediction intervals for cumulative demand over the next leadtime weeks.
I've been reading about quantile regression, conforming models, gradient boosting, and quantile random forest... frankly all of these are far above my head, and they seem focused on multivariate regression of non-time-series data. I know that I can't just regress against time, so I'm not even sure how to set up a complex regression method correctly. Moreover, since I'm forecasting many thousands of items every week, the parameter setting and tuning needs to be completely automated.
To date, I've been using a handful of traditional forecast methods (TSB [variation of Croston], ETS, ARIMA, etc) including hybrids, using R packages like hybridForecast. My prediction intervals are almost universally much narrower than our actual results (e.g. in a sample of 500 relatively steady-selling items, 20% were below my ARIMA 1% prediction interval, and 12% were above the 99% prediction interval).
I switched to using simulation + bootstrapping the residuals to build my intervals, but the results are directionally the same as above.
I'm looking for the simplest way to arrive at a univariate time series model with more accurate prediction intervals for cumulative demand over leadtime weeks, particularly at the upper / lower 10% and beyond. All my current models are training on MSE, so one step is probably to use something more like pinball loss scoring against cumulative demand (rather than the per-period error). Unfortunately I'm totally unfamiliar with how to write a custom loss function for the legacy forecasting libraries (much less the new sexy ones above).
I'd deeply appreciate any advice!
A side note: we already have an AWS setup that can compute each item from an R job in parallel, so computing time is not a major factor.
Professor wanted us to run some 10 fold cross validation on a data set to get the lowest RMSE and use the coefficients of that to make a function that takes in parameters and predicts and returns a "Fitness Factor" Score which ranges between 25-75.
He encouraged us to try transforming the data, so I did. I used scale() on the entire data set to standardize it and then ran my regression and 10 fold cross validation. I then found the model I wanted and copied the coefficients over. The problem is my function predictions are WAY off when i put unstandardized parameters into it to predict a y.
Did I completely screw this up by standardizing the data to a mean of 0 and sd of 1? Is there anyway I can undo this mess if I did screw up?
My coefficients are extremely small numbers and I feel like I did something wrong here.
Build a proper pipeline, not just a hack with some R functions.
The problem is that you treat scaling as part of loading the data, not as part of the prediction process.
The proper protocol is as follows:
"Learn" the transformation parameters
Transform the training data
Train the model
Transform the new data
Predict the value
Inverse-transform the predicted value
During cross-validation these need to run separately for each fold, or you may overestimate (overfit) your quality.
Standardization is a linear transform, so the inverse is trivial to find.
I want to improve step-by-step, whilst unevenly-sampled data are coming, the value of the first derivative at t = 0 s. For example, if you want to find the initial velocity in a projectile's motion, but you do not know its final position and velocity, however, you are receiving (slowly) the measurements of the projectile's current position and time.
Update - 26 Aug 2018
I would like to give you more details:
"Unevenly-sampled data" means the time intervals are not regular (irregular times between successive measurements). However, data have almost the same sampling frequency, i.e., it is about 15 min. Thus, there are some measurements without changes, because of the nature of the phenomenon (heat transfer). It gives an exponential tendency and I can fit data to a known model, but an important amount of information is required. For practical purposes, I only need to know the value of the very first slope for the whole process.
I tried a progresive Weighted Least Squares (WLS) fitting procedure, with a weight matrix such as
W = diag((0.5).^(1:kk)); % where kk is the last measurement id
But it was using preprocessed data (i.e., jitter-removing, smoothing, and fitting using the theoretical functional). I gave me the following result:
This is a real example of the problem and its "current solution"
It is good for me, but I would like to know if there is an optimal manner of doing that, but employing the raw data (or smoothed data).
IMO, additional data is not relevant to improve the estimate at zero. Because perturbations come into play and the correlation between the first and last samples goes decreasing.
Also, the asymptotic behavior of the phenomenon is probably not known rigorously (is it truly a first order linear model) ? And this can introduce a bias in the measurements.
I would stick to the first points (say up to t=20) and fit a simple model, say quadratic.
If in fact what you are trying to do is to fit a first order linear model to the data, then least-squares fitting on the raw data is fine. If there are significant outliers, robust fitting is preferable.
I am trying to investigate the relationship between some Google Trends Data and Stock Prices.
I performed the augmented ADF Test and KPSS test to make sure that both time series are integrated of the same order (I(1)).
However, after I took the first differences, the ACF plot was completely insigificant (except for 1 of course), which told me that the differenced series are behaving like white noise.
Nevertheless I tried to estimate a VAR model which you can see attached.
As you can see, only one constant is significant. I have already read that because Stocks.ts.l1 is not significant in the equation for GoogleTrends and GoogleTrends.ts.l1 is not significant in the equation for Stocks, there is no dynamic between the two time series and both can also be models independently from each other with a AR(p) model.
I checked the residuals of the model. They fulfill the assumptions (normally distributed residuals are not totally given but ok, there is homoscedasticity, its stable and there is no autocorrelation).
But what does it mean if no coefficient is significant as in the case of the Stocks.ts equation? Is the model just inappropriate to fit the data, because the data doesn't follow an AR process. Or is the model just so bad, that a constant would describe the data better than the model? Or a combination of the previous questions? Any suggestions how I could proceed my analysis?
Thanks in advance
As from title, I have some data that is roughly binormally distributed and I would like to find its two underlying components.
I am fitting to the data distribution the sum of two normal with means m1 and m2 and standard deviations s1 and s2. The two gaussians are scaled by a weight factor such that w1+w2 = 1
I can succeed to do this using the vglm function of the VGAM package such as:
fitRes <- vglm(mydata ~ 1, mix2normal1(equalsd=FALSE),
iphi=w, imu=m1, imu2=m2, isd1=s1, isd2=s2))
This is painfully slow and it can take several minutes depending on the data, but I can live with that.
Now I would like to see how the distribution of my data changes over time, so essentially I break up my data in a few (30-50) blocks and repeat the fit process for each of those.
So, here are the questions:
1) how do I speed up the fit process? I tried to use nls or mle that look much faster but mostly failed to get good fit (but succeeded in getting all the possible errors these function could throw on me). Also is not clear to me how to impose limits with those functions (w in [0;1] and w1+w2=1)
2) how do I automagically choose some good starting parameters (I know this is a $1 million question but you'll never know, maybe someone has the answer)? Right now I have a little interface that allow me to choose the parameters and visually see what the initial distribution would look like which is very cool, but I would like to do it automatically for this task.
I thought of relying on the x corresponding to the 3rd and 4th quartiles of the y as starting parameters for the two mean? Do you thing that would be a reasonable thing to do?
First things first:
did you try to search for fit mixture model on RSeek.org?
did you look at the Cluster Analysis + Finite Mixture Modeling Task View?
There has been a lot of research into mixture models so you may find something.