Is there a numerical method for approaching the first derivative at t = 0 s in a real-time application? - math

I want to improve step-by-step, whilst unevenly-sampled data are coming, the value of the first derivative at t = 0 s. For example, if you want to find the initial velocity in a projectile's motion, but you do not know its final position and velocity, however, you are receiving (slowly) the measurements of the projectile's current position and time.
Update - 26 Aug 2018
I would like to give you more details:
"Unevenly-sampled data" means the time intervals are not regular (irregular times between successive measurements). However, data have almost the same sampling frequency, i.e., it is about 15 min. Thus, there are some measurements without changes, because of the nature of the phenomenon (heat transfer). It gives an exponential tendency and I can fit data to a known model, but an important amount of information is required. For practical purposes, I only need to know the value of the very first slope for the whole process.
I tried a progresive Weighted Least Squares (WLS) fitting procedure, with a weight matrix such as
W = diag((0.5).^(1:kk)); % where kk is the last measurement id
But it was using preprocessed data (i.e., jitter-removing, smoothing, and fitting using the theoretical functional). I gave me the following result:
This is a real example of the problem and its "current solution"
It is good for me, but I would like to know if there is an optimal manner of doing that, but employing the raw data (or smoothed data).

IMO, additional data is not relevant to improve the estimate at zero. Because perturbations come into play and the correlation between the first and last samples goes decreasing.
Also, the asymptotic behavior of the phenomenon is probably not known rigorously (is it truly a first order linear model) ? And this can introduce a bias in the measurements.
I would stick to the first points (say up to t=20) and fit a simple model, say quadratic.
If in fact what you are trying to do is to fit a first order linear model to the data, then least-squares fitting on the raw data is fine. If there are significant outliers, robust fitting is preferable.

Related

COMSOL: Diffusion in Transport of Diluted Species Produces Unphysical Results

I am simulating Transport of Diluted Species inside a pipe segment in COMSOL Multiphysics. I have specified an initial concentration which produces a concentration distribution around a slice through the pipe at t=0. Moreover, I have a point probe a little bit upstream (I am using laminar flow for convection). I am plotting the concentration at this point dependent on time.
To investigate whether the model produces accurate (i.e. physically realistic) results, I am varying the diffusion coefficient D. This is where i noticed unrealistic behavior: For a large range of different diffusion coefficients, the concentration graph at the point probe does not change. This is unphysical, since e.g. higher diffusion coefficients should lead to a more spread out distribution at the point probe.
I already did a mesh refinement study and found, that the result strongly depends on mesh resolution. Therefore, I am now using the highest mesh resolution (extremely fine). Regardless, the concentration results still do not change for varying diffusion coefficients.
What could be the reason for this unphysical behavior? I already know it is not due to mesh resolution or relative tolerance of the solver.
After a lot of time spent on this simulation, I concluded that the undesired effects are indeed due to numerical diffusion, as suggested by 2b-t. Of course, it is impossible to be certain that this is actually the reason. However, I investigated pretty much any other potential culprit in the simulation - without any new insights.
To work around this issue of numerical diffusion, I switched to Particle-Based Simulation (PBS) and approximated the concentration as the normalized number of particles inside a small receiver volume. This method provides a good approximation for the concentration for large particle numbers and a small receiver volume.
By doing this, I produced results that are in very good agreement with results know from the literature.

bam() returns negative deviance explained values

I'm trying to run GAMs to analyze some temperature data. I have remote cameras and external temperature loggers, and I'm trying to model the difference in the temperatures recorded by them (camera temperature - logger temperature). Most of the time, the cameras are recording higher temperatures, but sometimes, the logger returns the higher temperature, in which case the difference ends up being a negative value. The direction of the difference is something that I care about, so I do have to have non-positive values as a response. My explanatory variables are percent canopy cover (quantitative), direct and diffuse radiation (quant.), and camera direction (ordered factor) as fixed effects as well as the camera/logger pair (factor) for a random effect.
I had mostly been using the gam() function in mgcv to run my models. I'm using a scat distribution since my data is heavy-tailed. My model code is as follows:
gam(f1, family = scat(link = "identity"), data = d)
I wanted to try using bam() since I have 60,000 data points (one temperature observation per hour of the day for several months). The gam() models run fine, though they take a while to run. But the exact same model formulas run in bam() end up returning negative deviance explained values. I also get 50+ warning messages that all say:
In y - mu : longer object length is not a multiple of shorter object length
Running gam.check() on the fitted models returns identical residuals plots. The parametric coefficients, smooth terms, and R-squared values are also almost identical. The only things that have really noticeably changed are the deviance explained values, and they've changed to something completely nonsensical (the deviance explained values for the bam() models range from -61% to -101% deviance explained).
I'll admit that I'm brand new to using GAM's. I know just enough to know that the residuals plots are more important than the deviance explained values, and the residuals plots look good (way better than they did with a Gaussian distribution, at least). More than anything, I'm curious about what's going on within the bam() function specifically that's causing the function to pass that warning and return a negative deviance explained value. Is there some extra argument that I can set in bam() or some further manipulations I can do to my data to prevent this from happening, or can I ignore it and move forward since my residuals plots look good and the outputs are mostly the same?
Thanks in advance for any help.

Predicting future emissions from fitted HMM model

I've fitted a HMM model to my data using hmm.discnp package in R as follows:
library(hmm.discnp)
zs <- hmm(y=lis,K=5)
Now I want to predict the future K observations (emissions) from this model. But I am only able to get most probable state sequence for the observations that I already have through Viterbi algorithm.
I have t emissions already , i.e (y(1),...,y(t)).
I want the most probable future K emissions from the fitted HMM object i.e (y(t+1),...y(t+k)).
Is there a function to calculate this? if not then how do I calculate it manually?
Generating emissions from an HMM is pretty straightforward to do manually. I'm am not really familiar with R but I explain here the steps to generate data as you ask.
First thing to keep in mind is that, by its Markovian nature, the HMM has no memory. At any time, only the current state is known, what happened before is "forgotten". This means that the generation of the sample at time t+1 only depends of the sample at time t.
If you have a sequence, the first thing you can do is to fit the most probable state sequence (with the Viterbi algorithm) as you did. Now, you know the state that generated the last observation that you have (the one that you denote y(t)).
Now, from this state, you know the probabilities to transit to each other state of the model thanks to the transition matrix. This is a probability mass function (pmf) and you can draw a state number from this pmf (not by hand! R should have a built-in function to draw a sample from a pmf). The state number you draw is the state in which your system is at time t+1.
With this information, you can now draw a sample observation from the probability function that is assigned to this new state (same here, if it is a Gaussian distribution, use a Gaussian random generator that should exist in R).
From this state t+1, you can now apply the same procedure to reach a state at time t+2 and so on.
Keep in mind that if you do this full procedure several times (to generate data samples from time t+1 to t+k), you will end up with different results. This is due to the probabilistic nature of the model. I am not sure of what you mean by most probable future emissions and I am not sure whether there are some routines or not to do so. You can compute the likelihood of the full sequence you obtain at the end (from 1 to t+k). It will in general be greater that the likelihood of the sequence up to t as the last part has been truly generated from the model itself and thus "perfectly" fits in some regards.

How to test time series model?

I am wondering what the good approach for testing time series model would be. Suppose I have a time series in a time domain t1,t2,...tN. I have inputs, say, zt1, zt2,...ztN and output x1,x2...xN.
Now, if that were a classical data mining problem, I could go with known approaches like cross-validation, leave-one-out, 70-30 or something else.
But how should I approach the problem of testing my model with time series? Should I build the model on the first t1,t2,...t(N-k) inputs and test it on the last k inputs? But what if we want to maximise the prediction for p steps ahead and not k (where p < k). I am looking for a robust solution which I can apply to my specific case.
With timeseries fitting, you need to be careful about not using your Out-of-sample data until after you've developed your model. The main problem with modelling is that it's simply easy to overfit.
Typically what we do is to use 70% for in-sample modelling, 30% of out-of-sample testing/validation. And when we use the model in production, the data we collect day-to-day becomes true-out-of-sample data : the data you have never seen or used.
Now, if you have enough data points, I'd suggest trying rolling window fitting approach. For each time step in your in-sample, you look back N time steps to fit your model and see how the parameters in your model varies over time. For example, let's say your model is linear regression with Y = B0 + B1*X1 + B2*X2. You'd do regression N - window_size time over the sample. This way, you understand how sensitive your Betas are in relation to time, among other things.
It sounds like you have a choice between
Using the first few years of data to create the model, then seeing how well it predicts the remaining years.
Using all the years of data for some subset of input conditions, then seeing how well it predicts using the remaining input conditions.

Simple algorithm for online outlier detection of a generic time series

I am working with a large amount of time series.
These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic).
I would like a simple algorithm for doing an online "outlier detection". Basically, I want to keep in memory (or on disk) the whole historical data for each time series, and I want to detect any outlier in a live scenario (each time a new sample is captured).
What is the best way to achieve these results?
I'm currently using a moving average in order to remove some noise, but then what next? Simple things like standard deviation, mad, ... against the whole data set doesn't work well (I can't assume the time series are stationary), and I would like something more "accurate", ideally a black box like:
double outlier_detection(double* vector, double value);
where vector is the array of double containing the historical data, and the return value is the anomaly score for the new sample "value" .
This is a big and complex subject, and the answer will depend on (a) how much effort you want to invest in this and (b) how effective you want your outlier detection to be. One possible approach is adaptive filtering, which is typically used for applications like noise cancelling headphones, etc. You have a filter which constantly adapts to the input signal, effectively matching its filter coefficients to a hypothetical short term model of the signal source, thereby reducing mean square error output. This then gives you a low level output signal (the residual error) except for when you get an outlier, which will result in a spike, which will be easy to detect (threshold). Read up on adaptive filtering, LMS filters, etc, if you're serious about this kind of technique.
I suggest the scheme below, which should be implementable in a day or so:
Training
Collect as many samples as you can hold in memory
Remove obvious outliers using the standard deviation for each attribute
Calculate and store the correlation matrix and also the mean of each attribute
Calculate and store the Mahalanobis distances of all your samples
Calculating "outlierness":
For the single sample of which you want to know its "outlierness":
Retrieve the means, covariance matrix and Mahalanobis distances from training
Calculate the Mahalanobis distance "d" for your sample
Return the percentile in which "d" falls (using the Mahalanobis distances from training)
That will be your outlier score: 100% is an extreme outlier.
PS. In calculating the Mahalanobis distance, use the correlation matrix, not the covariance matrix. This is more robust if the sample measurements vary in unit and number.

Resources