Recursive Bayesian with pymc - recursion

In general bayesian inference works like:
prior = foo
for data in (dataSet as it arrives):
posterior = prior+model+data
prior = posterior
The amazing pakedge PyMC seems to have the workflow:
prior = foo
run MCMC on prior+AllTheModel+AllTheData
posterior = trace
That is cool, but how can I close the loop? I.E. How do I convert a trace into a prior for my next data or model?
To be more specific about my use case:
I have data that comes in over the course of each work day. My current workflow with PyMC:
At the end of each day the program is run.
It starts with uninformative priors.
Then it loads in all the data for the whole history.
Runs MCMC, and generates reports from the traces.
The first few days of the project, the program runs in a reasonable time, small bern and small thin. When we are weeks into the project, the program takes a long time to run. Furthermore, it often needs to be rerun do to insufficient bern or high acor. This is a Shlemiel the painter's algorithm. Is there a way to save the analyses done yesterday as the prior to a model of todays data?

There is no good way to do this in PyMC. The posterior that comes out is represented by a set of samples. You could set your prior to be a mixture of point masses on yesterday's samples, but that is unlikely to work well. You have to fit a parametric approximation to the posterior to effectively use it as a prior.

Related

Time complexity of nlm-package in R?

I'm estimating a Non-Linear system (via seemingly unrelated regressions - SUR), using systemfit (nlsystemfit() function) package with 4 equations, 32 parameters to estimate (!) and 412 observations. But my code is taking forever (my laptop it's not a super-powerful one tho). So far, the process was on a 13 hours run. I'm not an expert in computational stuff, but someone explained me some time ago the concept of Time Complexity of the algorithms (or big-o), then depending on this concept the time to compute a certain algorithm could rely on specific functional relation on the number of observations and/or coefficients.
Hence, I'm thinking of just stopping my process, and trying to simplify the model (temporarily) and trying to run something simpler, only to check-up if the estimated parameters had sens so far. And then, run a whole model.
But all this has a sense if I can change key elements in my model, which can reduce the time of processing significantly. That's why I was looking on google about the time complexity of nlm-package (nlsystemfit() function relies on nlm) but unsuccessfully. So, this is my question: Anybody knows where I can find that info, or at least give me advice on how test non-linear systems before run a whole model?
Since you didn't provide any substantial information regarding your model or some code for the same, its hard to express a betterment for your situation.
From what you said:
Hence, I'm thinking of just stopping my process, and trying to simplify the model (temporarily) and trying to run something simpler, only to check-up if the estimated parameters had sens so far. And then, run a whole model.
It seems you require benchmarking or to obtain the measured time taken to execute, as in your case. (although it can deal with memory usage or some other performance metric as well)
There are quite a few ways to benchmark code in R, which include the use of Sys.time() or system.time() just before and right after your algorithm/function executes, or libraries such as rbenchmark (which is a simple wrapper around the system.time function), tictoc, bench and microbenchmark.
Among these the last two are preferable options, as bench::mark includes system_time(), a higher precision alternative to system.time() and microbenchmark is known to be a reliable source to accurately measure and compare the execution time of R expressions/algorithms.

Predicting future emissions from fitted HMM model

I've fitted a HMM model to my data using hmm.discnp package in R as follows:
library(hmm.discnp)
zs <- hmm(y=lis,K=5)
Now I want to predict the future K observations (emissions) from this model. But I am only able to get most probable state sequence for the observations that I already have through Viterbi algorithm.
I have t emissions already , i.e (y(1),...,y(t)).
I want the most probable future K emissions from the fitted HMM object i.e (y(t+1),...y(t+k)).
Is there a function to calculate this? if not then how do I calculate it manually?
Generating emissions from an HMM is pretty straightforward to do manually. I'm am not really familiar with R but I explain here the steps to generate data as you ask.
First thing to keep in mind is that, by its Markovian nature, the HMM has no memory. At any time, only the current state is known, what happened before is "forgotten". This means that the generation of the sample at time t+1 only depends of the sample at time t.
If you have a sequence, the first thing you can do is to fit the most probable state sequence (with the Viterbi algorithm) as you did. Now, you know the state that generated the last observation that you have (the one that you denote y(t)).
Now, from this state, you know the probabilities to transit to each other state of the model thanks to the transition matrix. This is a probability mass function (pmf) and you can draw a state number from this pmf (not by hand! R should have a built-in function to draw a sample from a pmf). The state number you draw is the state in which your system is at time t+1.
With this information, you can now draw a sample observation from the probability function that is assigned to this new state (same here, if it is a Gaussian distribution, use a Gaussian random generator that should exist in R).
From this state t+1, you can now apply the same procedure to reach a state at time t+2 and so on.
Keep in mind that if you do this full procedure several times (to generate data samples from time t+1 to t+k), you will end up with different results. This is due to the probabilistic nature of the model. I am not sure of what you mean by most probable future emissions and I am not sure whether there are some routines or not to do so. You can compute the likelihood of the full sequence you obtain at the end (from 1 to t+k). It will in general be greater that the likelihood of the sequence up to t as the last part has been truly generated from the model itself and thus "perfectly" fits in some regards.

R Neural Network Forecasting - Aspect of Randomness?

I have been trying different methods of forecasting and stumbled upon the
nnetar()
function in the forecast package of R. I soon quickly realized that while this does work to forecast, it gives me something different every time I run it. Could anybody help to explain why this happens? I thought I had a decent understanding of neural nets and I don't see what could make drastic differences in forecasts, unless the nnetar() function randomly selects the number of nodes or something. Any help?
20, by default, networks are trained with random starting values and then their predictions are averaged when you use the function.
Because the function uses random starting values for each run, the forecasts will be different for each call too.
EDIT: new question from OP in the comments
In order to control the function and get the same random starting values each time, you can simple use the function set.seed() with the value of your choice.
For example:
set.seed(666)
forecast(nnetar(...),...)
set.seed(666)
forecast(nnetar(...),...)
set.seed(666)
forecast(nnetar(...),...)
will give the same results every time you run it with this "seed" value (666). You have to run set.seed(666) before every run of the rest of you code of course.
EDIT 2: new new question from OP in the comments
In order to have 100 different networks to fit with random starting weights:
nnetar(...,repeats=100,...)

mlr classification training with rpart does not complete

I have a classification task that I managed to train with mlr package using LDA ("classif.lda") in a few seconds. However when I trained it using "classif.rpart" the training never ended.
Is there any different setup to be done for the different methods?
My training data here if needed to replicate the problem. I tried to train it simply with
pred.bin.task <- makeClassifTask(id="CountyCrime", data=dftrain, target="count.bins")
train("classif.rpart", pred.bin.task)
In general, you don't need to change anything about the setup when switching learners -- one of the main points of mlr is to make this easy! This does not mean that it'll always work though, as different learning methods do different things under the hood.
It looks like in this particular case the model simply takes a long time to train, so you probably didn't wait long enough for it to complete. You have quite a large data frame.
Looking at your data, you seem to have an interval of values in count.bins. This is treated as a factor by R (i.e. intervals are only the same if the string matches completely), which is probably not what you want here. You could encode start and end as separate (numerical) features.

How to test time series model?

I am wondering what the good approach for testing time series model would be. Suppose I have a time series in a time domain t1,t2,...tN. I have inputs, say, zt1, zt2,...ztN and output x1,x2...xN.
Now, if that were a classical data mining problem, I could go with known approaches like cross-validation, leave-one-out, 70-30 or something else.
But how should I approach the problem of testing my model with time series? Should I build the model on the first t1,t2,...t(N-k) inputs and test it on the last k inputs? But what if we want to maximise the prediction for p steps ahead and not k (where p < k). I am looking for a robust solution which I can apply to my specific case.
With timeseries fitting, you need to be careful about not using your Out-of-sample data until after you've developed your model. The main problem with modelling is that it's simply easy to overfit.
Typically what we do is to use 70% for in-sample modelling, 30% of out-of-sample testing/validation. And when we use the model in production, the data we collect day-to-day becomes true-out-of-sample data : the data you have never seen or used.
Now, if you have enough data points, I'd suggest trying rolling window fitting approach. For each time step in your in-sample, you look back N time steps to fit your model and see how the parameters in your model varies over time. For example, let's say your model is linear regression with Y = B0 + B1*X1 + B2*X2. You'd do regression N - window_size time over the sample. This way, you understand how sensitive your Betas are in relation to time, among other things.
It sounds like you have a choice between
Using the first few years of data to create the model, then seeing how well it predicts the remaining years.
Using all the years of data for some subset of input conditions, then seeing how well it predicts using the remaining input conditions.

Resources