Time series smoothing, avoiding revisions - r

This time my question is more methodological than technical. I have weekly time series data which gets updated every week. Unfortunately the time series is quite volatile. I would thus like to apply a filter/a smoothing method. I tried Hodrick-Prescott and LOESS. Both results look fine, with the downturn that if a new datapoint follows which diverges strongly from the historic data points, the older values have to be revised/are changing. Does somebody know a method which is implemented in R, which could do what I want? A name of a method/a function would probably be completely sufficient. It should however be something more sophisticated than a left sided moving average, because I would not like to lose data at the beginning of the time series. Every helping comment is appreciated! Thank you very much!
Best regards,
Andreas

I think (?) that the term you may be looking for is causal filtering, i.e. filtering that doesn't depend on future values. Within this category probably the simplest/best known approach is exponential smoothing, which is implemented in the forecast and expsmooth packages (library("sos"); findFn("{exponential smoothing}")).
Does that help?

It seems you need a robust two-sided smoother. The problem is that an outlier at an end-point is indistinguishable from a sudden change in the trend. It only becomes clear that it is an outlier after several more observations are collected (and even then some strong assumptions of trend smoothness are required).
I think you will find it hard to do better than loess(), but other functions that aim to do robust smoothing include
smooth() for Tukey's smoothers;
supsmu() for Friedman's super smoother;
Hodrick-Prescott smoothing is not robust to outliers.

Related

How do I numerically compare the Dymos solution to the simulated solution?

I want to conduct a convergence study for my Dymos optimization results where I vary the number of nodes and compare the simulated solution to the optimization solution. From what I understand, Dymos fits polynomials to the system dynamics to represent the timeseries solution. What is the best way to compare the polynomial trajectory of the optimization solution to the trajectory of the simulated solution? I specifically want to evaluate the difference between the two trajectories away from the collocation/control nodes... to show that the polynomial fitting actually represents the simulated solution. How would I access the polynomial fitting data?
Thanks in advance.
For some of the testing we have an assert_timeseries_near_equal function that treats the more dense time series as the truth and tests that the less dense timeseries (usually the discrete solution) is reasonably close to it.
We're actually working on this method a bit more explicit right now so it's a little easier for users to apply in general cases, such as comparing discrete solutions from two different cases.
In general, there's a few different ways you can test your explicit results against an explicit integration. You could just verify that the final states of the two solutions are reasonably close. Since the error tends to increase over the course of the trajectory this is often good enough for a quick check. The downside of this approach is that it doesn't test that both solutions took the same path to the final condition.
To test the solution away from the nodes I'd recommend the following: Add a second timeseries output to the relevant phase that contains more segments or higher order segments. This timeseries will have more nodes. Dymos will interpolate from the solution's collocation grid onto this more dense timeseries output grid. Comparing this against the explicit simulation should still match exactly in terms of times, controls, and parameters, you'll better capture the interpolating state polynomials vs the explicitly simulated results.
There are other statistical methods out there for comparing timeseries that you can bring to bear, but visualizing the explicit trajectory plus some error bound as a "tube" into which we want to fit the discrete solution is usually how I handle it.

R: Evaluate Gradient Boosting Machines (GBM) for Regression

Which are the best metrics to evaluate the fit of a GBM algorithm in R (metrics, graphs, ratios)? And how interpret them?
I think maybe you are overthinking this one! Take a step back and think about what matters... the error. You have forecasted values and you have observed values. the difference tells you most of what you need to know when comparing across models. Basic measures like MSE, MPE, etc. should do fine. If you are looking to refine within a given model, I would recommend taking a look at the gbm documentation. For example, you can pass your gbm model object to summary(), to get the relative influence of each of your variables. Additionally, you can find a lot of information in the documentation, so if you haven't taken a look, I would recommend doing so! I have posted the link at the bottom.
-Carmine
gbm_documentation

R - Approach to find outliers/artefacts in blood pressure curve

Do you guys have an idea how to approach the problem of finding artefacts/outliers in a blood pressure curve? My goal is to write a program, that finds out the start and end of each artefact. Here are some examples of different artefacts, the green area is the correct blood pressure curve and the red one is the artefact, that needs to be detected:
And this is an example of a whole blood pressure curve:
My first idea was to calculate the mean from the whole curve and many means in short intervals of the curve and then find out where it differs. But the blood pressure varies so much, that I don't think this could work, because it would find too many non existing "artefacts".
Thanks for your input!
EDIT: Here is some data for two example artefacts:
Artefact1
Artefact2
Without any data there is just the option to point you towards different methods.
First (without knowing your data, which is always a huge drawback), I would point you towards Markov switching models, which can be analysed using the HiddenMarkov-package, or the HMM-package. (Unfortunately the RHmm-package that the first link describes is no longer maintained)
You might find it worthwile to look into Twitter's outlier detection.
Furthermore, there are many blogposts that look into change point detection or regime changes. I find this R-bloggers blog post very helpful for a start. It refers to the CPM-package, which stands for "Sequential and Batch Change Detection Using Parametric and Nonparametric Methods", the BCP-package ("Bayesian Analysis of Change Point Problems"), and the ECP-package ("Non-Parametric Multiple Change-Point Analysis of Multivariate Data"). You probably want to look into the first two as you don't have multivariate data.
Does that help you getting started?
I could provide an graphical answer that does not use any statistical algorithm. From your data I observe that the "abnormal" sequences seem to present constant portions or, inversely, very high variations. Working on the derivative, and setting limits on this derivative could work. Here is a workaround:
require(forecast)
test=c(df2$BP)
test=ma(test, order=50)
test=test[complete.cases(test)]
which <- ma(0+abs(diff(test))>1, order=10)>0.1
abnormal=test; abnormal[!which]<-NA
plot(x=1:NROW(test), y=test, type='l')
lines(x=1:NROW(test), y=abnormal, col='red')
What it does: first "smooths" the data with a moving average to prevent the micro-variations to be detected. Then it applyes the "diff" function (derivative) and tests if it is greater than 1 (this value is to be adjusted manually depending on the soothing amplitude). THen, in order to get a whole "block" of abnormal sequence without tiny gaps, we apply again a smoothing on the boolean and test it superior to 0.1 to grasp better the boundaries of the zone. Eventually, I overplot the spotted portions in red.
This works for one type of abnormality. For the other type, you could make up a low treshold on the derivative, inversely, and play with the tuning parameters of smoothing.

How to calculate first derivative of time series

I would calculate the first derivative (dpH/dtime) of time series using two variables, time and pH.
Are there any kind of functions to do this in R or should I compute an extra function to do this?
Assuming pH and time are plain vectors try this:
library(pspline)
predict(sm.spline(time, pH), time, 1)
You might want to start with stats::deriv or diff.ts as Matt L suggested. Just keep in mind what a professor of mine used to tell all his students: numeric differentiation is known as "error multiplier."
EDIT:
To clarify -- what he was warning about was that any noise in your data can throw the derivative estimate way off. It's been said that integration is a low-pass filter and differentiation is a high-pass filter.
So, the important thing is to do some smoothing on your data before calculating a derivative. Hence Gabor's excellent suggestion to use predict.spline . But keep in mind that modifying the spline parameters will smooth your data to different levels, so always look at the results to make sure you removed apparent noise but not desired features.
Here's a link to "Numerical Differentiation".
http://en.wikipedia.org/wiki/Numerical_differentiation
Here's a link describing a method based on Taylor Series Expansions:
http://ocw.usu.edu/civil_and_environmental_engineering/numerical_methods_in_civil_engineering/ODEsMatlab.pdf

Looking for interesting formula

I'm creating a game where players can make an alloy. To make it less predictable and more interesting, I thought that the durability and hardness of an alloy should not be calculated by a simple formula, because it will be extremely easy to find extrema, where alloy have best statistics.
So the questions is, is there any formula for a function where extrema can be found only by investigating all points? Input values will be in percents: 0.0%-100.0%. I think it should look like this: half sound wave
A very simple way would be a couple of sin function, just vary the constants and the sign for each new player. Here is one example (sin(1.1*x) + sin(x) + sin(0.9 *x))^2
If you use this between 10pi and 20pi you have an by average increasing function with local minima.
Modulating a simple linear or exponential function with trigonometric functions whose frequency and amplitude are dependent on the input should get you what you want.
You don't need a formula, I think — throw a bunch of random values around your domain, and then interpolate (linear interpolation will do) between them. Then you can even change the "formula" completely each time the game is run, or once in a while, or change it slowly with time, etc, etc.
If you want something that is very hard to predict then I would suggest involving a random number generator with the same seed every time. You can use it as an envelope for whatever function you come up with (trig functions or what not) to make it more jagged.
An interesting formula to use would be that of gamma of the Black-Scholes options pricing model. It goes as follows:
You can easily replace the variables, here's a graph of how the function looks:
alt text http://www.sqbimmer.com/aalex/gamma.png

Resources