Density Estimation of a stream of Data - kernel-density

What statistical methods out there that will estimate the probability density of data as it arrives temporally?
I need to estimate the pdf of a multivariate dataset; however, new data arrives over time and as the data arrives the density estimation must update.
What I have been using so far is kernel estimations by storing a buffer of the data and computing a new kernel density estimation with every update of new data; however, I can no longer keep up with the amount of data needed to be stored. Therefore, I need a method that will keep track of the overall pdf/density estimation rather that the individual datum. Any suggestions would be really helpful. I work in Python, but since this is long-winded any algorithm suggestions would be also helpful.

Scipy's implementation of KDE includes the functionality to increment the KDE by each datum instead of for each point. This is nested inside a "if more points than data" loop, but you could probably re-purpose it for your needs.
if m >= self.n:
# there are more points than data, so loop over data
for i in range(self.n):
diff = self.dataset[:, i, newaxis] - points
tdiff = dot(self.inv_cov, diff)
energy = sum(diff*tdiff,axis=0) / 2.0
result = result + exp(-energy)
In this case, you could store the result of your kde as result, and each time you get a new point you could just calculate the new Gaussian and add it to your result. Data can be dropped as needed, you are only storing the KDE.

Related

Training and simulating a spatstat ppm using multiple datasets

Disclaimer: I'm very new to spatstat and spatial point modeling in general... please excuse my naivete.
I have recently tried using spatstat to fit and simulate spatial point patterns related to weather phenomenon where the spatial pattern represents a set of eye-witness reports (for example, reports of hail occurrence) and the observational window and covariate is based on some meteorological parameter (eg. the window is area where moisture is at least X, and then the moisture variable is additionally passed as a covariate when training the model).
moistureMask = owin(mask=moisture>X)
moistureVar = im(moisture)
obsPPP = ppp(x=obsX,y=obsY,window=moistureMask)
myModel = ppm(obsPPP ~ moistureVar)
### then simulate
mySim = simulate(myModel,nsim=10)
My questions are the following:
Is it possible (or more importantly, even valid), to take a ppm trained on one day with a specific moisture variable and mask, and apply it to another day with a different moisture value and mask. I had considered using the update function to switch out the window and covariate fields from the trained model, but haven't actually tried it yet. If the answer is yes... its a little unclear to me how to actually do this, programmatically
Is it it possible to do an online update of the ppm with additional data. For example, train the model on data from different days (each with their own window and covariate), iteratively (similar to how many machine learning models are trained, using blocks of training data). For example, lets say I have 10-years of daily data which I'd like to use to train the model, and another 10-years of moisture variables over which I'd like to simulate point patterns. Again, I considered the update function here as well, but it was unclear if the new model would simply be based ONLY on the new data, or a combination of the original and new data.
Please let me know if I'm going the completely wrong direction with this. References and resources appreciated.
If you have fitted a model using ppm and you update it by specifying new data and/or new covariates, then the new data replace the old data; the updated model's parameters are determined using only the new data that you gave when you called update.
The syntax for the update command is described in the online help for update.ppm (the method for the generic update for an object of class ppm).
It seems that what you really want to do is to fit a point process model to many replicate datasets, each dataset consisting of a predictor moistureVar and a point pattern obsPPP. In that case, you should use the function mppm which fits a point process model to replicated data.
To do this, first make a list A containing the moisture regions for each day, and another list B containing the hail report location patterns for each day. That is, A[[1]] is the moisture region for day 1, and B[[1]] is the point pattern of hail report locations for day 1, and so on. Then do
h <- hyperframe(moistureVar=A, obsPPP=B)
m <- mppm(obsPPP ~ moistureVar, data=h)
This will fit a single point process model to the full set of data.
Finally can I point out that the model
obsPPP ~ moistureVar
is very simple, because moistureVar is a binary predictor. The model will simply say that the intensity of hail reports takes one value inside the high-moisture region, and another value outside that region. As an alternative, you could consider use the moisture content (eg humidity) as a predictor variable.
See Chapters 9 and 16 of the spatstat book for more detail.

Is there a numerical method for approaching the first derivative at t = 0 s in a real-time application?

I want to improve step-by-step, whilst unevenly-sampled data are coming, the value of the first derivative at t = 0 s. For example, if you want to find the initial velocity in a projectile's motion, but you do not know its final position and velocity, however, you are receiving (slowly) the measurements of the projectile's current position and time.
Update - 26 Aug 2018
I would like to give you more details:
"Unevenly-sampled data" means the time intervals are not regular (irregular times between successive measurements). However, data have almost the same sampling frequency, i.e., it is about 15 min. Thus, there are some measurements without changes, because of the nature of the phenomenon (heat transfer). It gives an exponential tendency and I can fit data to a known model, but an important amount of information is required. For practical purposes, I only need to know the value of the very first slope for the whole process.
I tried a progresive Weighted Least Squares (WLS) fitting procedure, with a weight matrix such as
W = diag((0.5).^(1:kk)); % where kk is the last measurement id
But it was using preprocessed data (i.e., jitter-removing, smoothing, and fitting using the theoretical functional). I gave me the following result:
This is a real example of the problem and its "current solution"
It is good for me, but I would like to know if there is an optimal manner of doing that, but employing the raw data (or smoothed data).
IMO, additional data is not relevant to improve the estimate at zero. Because perturbations come into play and the correlation between the first and last samples goes decreasing.
Also, the asymptotic behavior of the phenomenon is probably not known rigorously (is it truly a first order linear model) ? And this can introduce a bias in the measurements.
I would stick to the first points (say up to t=20) and fit a simple model, say quadratic.
If in fact what you are trying to do is to fit a first order linear model to the data, then least-squares fitting on the raw data is fine. If there are significant outliers, robust fitting is preferable.

Predicting future emissions from fitted HMM model

I've fitted a HMM model to my data using hmm.discnp package in R as follows:
library(hmm.discnp)
zs <- hmm(y=lis,K=5)
Now I want to predict the future K observations (emissions) from this model. But I am only able to get most probable state sequence for the observations that I already have through Viterbi algorithm.
I have t emissions already , i.e (y(1),...,y(t)).
I want the most probable future K emissions from the fitted HMM object i.e (y(t+1),...y(t+k)).
Is there a function to calculate this? if not then how do I calculate it manually?
Generating emissions from an HMM is pretty straightforward to do manually. I'm am not really familiar with R but I explain here the steps to generate data as you ask.
First thing to keep in mind is that, by its Markovian nature, the HMM has no memory. At any time, only the current state is known, what happened before is "forgotten". This means that the generation of the sample at time t+1 only depends of the sample at time t.
If you have a sequence, the first thing you can do is to fit the most probable state sequence (with the Viterbi algorithm) as you did. Now, you know the state that generated the last observation that you have (the one that you denote y(t)).
Now, from this state, you know the probabilities to transit to each other state of the model thanks to the transition matrix. This is a probability mass function (pmf) and you can draw a state number from this pmf (not by hand! R should have a built-in function to draw a sample from a pmf). The state number you draw is the state in which your system is at time t+1.
With this information, you can now draw a sample observation from the probability function that is assigned to this new state (same here, if it is a Gaussian distribution, use a Gaussian random generator that should exist in R).
From this state t+1, you can now apply the same procedure to reach a state at time t+2 and so on.
Keep in mind that if you do this full procedure several times (to generate data samples from time t+1 to t+k), you will end up with different results. This is due to the probabilistic nature of the model. I am not sure of what you mean by most probable future emissions and I am not sure whether there are some routines or not to do so. You can compute the likelihood of the full sequence you obtain at the end (from 1 to t+k). It will in general be greater that the likelihood of the sequence up to t as the last part has been truly generated from the model itself and thus "perfectly" fits in some regards.

Simulating returns from ARMA(1,1) - MCsGARCH(1,1) model

How can I find expected intraday return of ARMA(1,1) - MCsGARCH(1,1) Model in R?
The sample code of the model is available at http://www.unstarched.net/2013/03/20/high-frequency-garch-the-multiplicative-component-garch-mcsgarch-model/
I think you are mixing up something here. There is no "expected intraday return", for the ARMA(1,1) - MCsGARCH(1,1) there only is an estimation of the volatility of the following period/day (sigma, as you've already noticed in the comments).
I assume you are referring to the last plot on the website you provided, that would mean you want to know the VaR (Value-at-Risk) that is calculated with the volatility from the estimation procedure.
If you look at the code that was used to provide the plot:
D = as.POSIXct(rownames(roll#forecast$VaR))
VaRplot(0.01, actual = xts(roll#forecast$VaR[, 3], D), VaR = xts(roll#forecast$VaR[,1], D))
You can see that the VaR (and the returns) where taken from the object roll. After you've run the simulation (without changing any variable names from the example), you could store them in a variable for later use like this:
my_VaR = roll#forecast$VaR[, 1]
my_act = roll#forecast$VaR[, 3]
Where VaR, 1] is the first listelement for VaR. If you check str(roll) you see pretty much at the end, that:
Element 1: stands for the alpha(1%) VaR
Element 2: stands for the alpha(5%) VaR and
Element 3: stands for the realized return.
To adress what you said in your comment:
Have a look at the variable df (generated from as.data.frame(roll), that may include what you are looking for.
I want to compare the expected return and the actual return.
This seems to drift more in the direction of Cross Validated, but I'll try to give a brief outline.
GARCH models are primarily used for volatility forecasting and to learn about the volatility dynamics of a time series (and/or the correlation dynamics in multivariate models). Now since variance is of the second moment, which translates to squared, it is always positive. But are returns always positive? Of course they are not. This means the volatility forecast gives us an idea of the magnitude of the returns of the next period, but at that point we don't know if it will be a positve return or a negative return. That's were the Value-at-Risk (VaR) comes into play.
Take e. g. a portfolio manager who owns one asset. With a GARCH model he could predict the volatility of the next period (let's say he uses a daily return series, then that would be tomorrow). Traders watch the risk of their portfolio, it is much more closely monitored than the potential chances. So with the volatility forecast he can make a good guess about the risk he has of his asset loosing in value tomorrow. A 95%-VaR of lets say 1,000 EUR means, with a 95% probability, the risk (or loss) of tomorrow will not exceed 1,000 EUR. A higher probability comes with less certainty, so a 99%-VaR will be higher, e. g. 1,500 EUR.
To wrap this up: there is no "expected" return, there is only a volatility forecast for tomorrow that gives an inclination (never certainty) of how tomorrows return could turn out. With the VaR this can be used for risk management. This is what is being done in the last part of the article you provided.
what is the difference of ugarchsim and roll function?
You could check in the documentation of the rugarch package, every function and its properties are explained in more detail in there. At a quick glance I would say ugarchsim is used if you want to fit a model to a complete time series. The last standard deviation is then the forecast for the next period. The documentation for ugarchroll says:
ugarchroll-methods {rugarch} function: Univariate GARCH Rolling
Density Forecast and Backtesting
Description
Method for creating rolling density forecast from ARMA-GARCH models with option for refitting every n periods with
parallel functionality. is used aswell for forecasting as for
backtesting.
This is if you want to test how your model would've performed in the past. It only takes e. g. the first 300 datapoints provided and give the forecast for datapoint 301. Then the VaR (95% or 99%) is compared to the realized return of datapoint 301. Then the model is being refitted, giving a forecast for datapoint 302 and so on and on.
Edit: added answers to the questions from the comments.

Simple algorithm for online outlier detection of a generic time series

I am working with a large amount of time series.
These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic).
I would like a simple algorithm for doing an online "outlier detection". Basically, I want to keep in memory (or on disk) the whole historical data for each time series, and I want to detect any outlier in a live scenario (each time a new sample is captured).
What is the best way to achieve these results?
I'm currently using a moving average in order to remove some noise, but then what next? Simple things like standard deviation, mad, ... against the whole data set doesn't work well (I can't assume the time series are stationary), and I would like something more "accurate", ideally a black box like:
double outlier_detection(double* vector, double value);
where vector is the array of double containing the historical data, and the return value is the anomaly score for the new sample "value" .
This is a big and complex subject, and the answer will depend on (a) how much effort you want to invest in this and (b) how effective you want your outlier detection to be. One possible approach is adaptive filtering, which is typically used for applications like noise cancelling headphones, etc. You have a filter which constantly adapts to the input signal, effectively matching its filter coefficients to a hypothetical short term model of the signal source, thereby reducing mean square error output. This then gives you a low level output signal (the residual error) except for when you get an outlier, which will result in a spike, which will be easy to detect (threshold). Read up on adaptive filtering, LMS filters, etc, if you're serious about this kind of technique.
I suggest the scheme below, which should be implementable in a day or so:
Training
Collect as many samples as you can hold in memory
Remove obvious outliers using the standard deviation for each attribute
Calculate and store the correlation matrix and also the mean of each attribute
Calculate and store the Mahalanobis distances of all your samples
Calculating "outlierness":
For the single sample of which you want to know its "outlierness":
Retrieve the means, covariance matrix and Mahalanobis distances from training
Calculate the Mahalanobis distance "d" for your sample
Return the percentile in which "d" falls (using the Mahalanobis distances from training)
That will be your outlier score: 100% is an extreme outlier.
PS. In calculating the Mahalanobis distance, use the correlation matrix, not the covariance matrix. This is more robust if the sample measurements vary in unit and number.

Resources