Smoothing of time-series data without smoothing out peak values in R - r

I have got a 3 months time series of daily data (data is recorded every 5 mins). The data is pretty noisy.
I have already tried some MA methods. They work fine and the resulting curve is fairly smooth but the problem is that the peaks are almost smoothed out.
So my question is:
Is there any method to get rid of all this noise in the graph but preserve the peak values?
I have also read something about Kalman-Filtering, but I am not sure how this works and if this is suitable for my problem.
I tried the following code:
smooth <- rollapply(PCM4 [,3], width=10, FUN=mean, align = "center", fill=NA)
I also tried some different input values for window width, which made the resulting data smoother, but also reduced the peak values which is not what I want.
data set:
DateTime h v Q T
2014-12-18 11:45:00 0.112 0.515 17.141 15.4
2014-12-18 11:50:00 0.113 0.511 17.007 15.5
2014-12-18 11:55:00 0.114 0.518 17.480 15.5
unsmoothed plot:
smoothed plot (width=10):
As you see, the second plot is quite distorted and the first peak e.g. is at about 250 L/s instead of 500 L/s.
The reason for this is, that it´s computed from the rolling mean, so it gets quite distorted.
But the question is: Is there any better solution to fit my needs??

Is there any method to get rid of all this noise in the graph but preserve the peak values?
The challenge here is that you have not really said what is noise and what is signal. Normally, a wildly different ("peak") value would be classified as noise. When people say filtering, they are usually thinking of low-pass filtering (removing high frequency noise and keeping general trends). A sudden peak is going to be noise by that definition.
A Kalman Filter would give you a tool to use if you had a mathematical understanding of your system and its noise. In the KF's "predict" step you would have a mathematical model which would produce an expected value against which you would test your measurement. If you could predict peaks (either their value, or even just when they exist) a KF could help you.
An approach that might help is http://www.lifl.fr/~casiez/1euro/ the "1 Euro" filter. The core idea is that gross movements (your sudden peaks) are likely to be essentially true, while periods of low movement are noisy and should be averaged down. That filter opens up its bandwidth suddenly whenever there's a big movement, and then gradually clamps it down. It was designed for tracking human movements without reflecting the noise from the measurements.

Related

Relrisk function and bandwidth selection in spatstat

I'm having trouble interpreting the results I got from relrisk. My data is a multiple point process containing two marks (two rodents species AA and RE), I want to know if they are spatially segregated or not.
> summary(REkm)
Marked planar point pattern: 46 points
Average intensity 0.08101444 points per square unit
*Pattern contains duplicated points*
Coordinates are given to 3 decimal places
i.e. rounded to the nearest multiple of 0.001 units
Multitype:
frequency proportion intensity
AA 15 0.326087 0.02641775
RE 31 0.673913 0.05459669
Window: rectangle = [4, 38] x [0.3, 17] units
x 16.7 units)
Window area = 567.8 square units
relkm <- relrisk(REkm)
plot(relkm, main="Relrisk default")
The bandwidth of this relrisk estimation is automatically selection by default(bw.relrisk), but when I tried other numeric number using sigma= 0.5 or 1, the results are somehow kind of weird.
How did this happened? Was it because the large proportion of blank space of my ppp?
According to chapter.14 of Spatial Point Patterns books and the previous discussion, I assume the default of relrisk will show the ratio of intensities (case divided by control, in my case: RE divided by AA), but if I set casecontrol=FALSE, I can get the spatially-varying probability of each type.
Then why the image of type RE in the Casecontrol=False looks exactly same as the relrisk estimation by default? Or they both estimate p(RE)=λRE/ λRE+λAA for each sites?
Any help will be appreciated! Thanks a lot!
That's two questions.
Why does the image for RE when casecontrol=FALSE look the same as the default output from relrisk?
The definitive source of information about spatstat functions is the online documentation in the help files. The help file for relrisk.ppp gives full details of the behaviour of this function. It says that the calculation of probabilities and risks is controlled by the argument relative. If relative=FALSE (the default), the code calculates the spatially varying probability of each type. If relative=TRUE it calculates the relative risk of each type i, defined as the ratio of the probability of type i to the probability of type c where c is the type designated as the control. If you wanted the relative risk then you should set relative=TRUE.
Very different results obtained when setting sigma=0.5 compared to the automatically selected bandwidth.
Your example output says that the window is 34 by 17 units. A smoothing bandwidth of sigma=0.5 is very small for this region. Imagine each data point being replaced by a blurry circle of radius 0.5; there would be a lot of empty space. The smoothing procedure is encountering numerical problems which are causing the funky artefacts.
You could try a range of different values of sigma, say from 1 to 15, and decide which value produces the most satisfactory result.
The plot of relrisk(REkm, casecontrol=FALSE) suggests that the automatic bandwidth selector bw.relriskppp chose a much larger value of sigma, perhaps about 10. You can investigate this by
b <- bw.relriskppp(REkm)
print(b)
plot(b)
The print command will print the chosen value of sigma that was used in the default calculation. The plot command will show the cross-validation criterion which was maximised to select the bandwidth. This gives you an idea of the range of values of sigma that are acceptable according to the automatic selector.
Read the help file for bw.relriskppp about the different options available for bandwidth selection method. Maybe a different choice of method would give you a more acceptable result from your viewpoint.

Poorly fitting curve in natural log regression

I'm fitting a logarithmic curve to 20+ data sets using the equation
y = intercept + coefficient * ln(x)
Generated in R via
output$curvePlot <- renderPlot ({
x=medianX
y=medianY
Estimate = lad(formula = y~log(x),method = "EM")
logEstimate = lad(formula = y~log(x),method = "EM")
plot(x,predict(Estimate),type='l',col='white')
lines(x,predict(logEstimate),col='red')
points(x,y)
cf <- round(coef(logEstimate),1)
eq <- paste0("y = ", cf[1],
ifelse(sign(cf[2])==1, " + ", " - "), abs(cf[2]), " * ln(x) from 0 to ",xmax)
mtext(eq,3,line=-2,col = "red")
output$summary <- renderPrint(summary(logEstimate))
output$calcCurve <-
renderPrint(round(cf[2]*log(input$calcFeet)+cf[1]))
})
The curve consistently "crosses twice" on the data; fitting too low at low/high points on the X axis, fitting too high at the middle of the X axis.
I don't really understand where to go from here. Am I missing a factor or using the wrong curve?
The dataset is about 60,000 rows long, but I condensed it into medians. Medians were selected due to unavoidable outliers in the data, particularly a thick left tail, caused by our instrumentation.
x,y
2,6.42
4,5.57
6,4.46
8,3.55
10,2.72
12,2.24
14,1.84
16,1.56
18,1.33
20,1.11
22,0.92
24,0.79
26,0.65
28,0.58
30,0.34
32,0.43
34,0.48
36,0.38
38,0.37
40,0.35
42,0.32
44,0.21
46,0.25
48,0.24
50,0.25
52,0.23
Full methodology for context:
Samples of dependent variable, velocity (ft/min), were collected at
various distances from fan nozzle with a NIST-calibrated hot wire
anemometer. We controlled for instrumentation accuracy by subjecting
the anemometer to a weekly test against a known environment, a
pressure tube with a known aperture diameter, ensuring that
calibration was maintained within +/- 1%, the anemometer’s published
accuracy rating.
We controlled for fan alignment with the anemometer down the entire
length of the track using a laser from the center of the fan, which
aimed no more than one inch from the center of the anemometer at any
distance.
While we did not explicitly control for environmental factors, such as
outdoor air temperature, barometric pressure, we believe that these
factors will have minimal influence on the test results. To ensure
that data was collected evenly in a number of environmental
conditions, we built a robot that drove the anemometer down the track
to a different distance every five minutes. This meant that data would
be collected at every independent variable position repeatedly, over
the course of hours, rather than at one position over the course of
hours. As a result, a 24 hour test would measure the air velocity at
each distance over 200 times, allowing changes in temperature as the
room warmed or cooled throughout the day to address any confounding
environmental factors by introducing randomization.
The data was collected via Serial port on the hot wire anemometer,
saving a timestamped CSV that included fields: Date, Time, Distance
from Fan, Measured Temperature, and Measured Velocity. Analysis on the
data was performed in R.
Testing: To gather an initial set of hypotheses, we took the median of
air velocity at each distance. The median was selected, rather than
the mean, as outliers are common in data sets measuring physical
quantities. As air moves around the room, it can cause the airflow to
temporarily curve away from the anemometer. This results in outliers
on the low end that do not reflect the actual variable we were trying
to measure. It’s also the case that, sometimes, the air velocity at a
measured distance appears to “puff,” or surge and fall. This is
perceptible by simply standing in front of the fan, and it happens on
all fans at all distances, to some degree. We believe the most likely
cause of this puffing is due to eddy currents and entrainment of the
surrounding air, temporarily increasing airflow. The median result
absolves us from worrying about how strong or weak a “puff” may feel,
and it helps limit the effects on air speed of the air curving away
from the anemometer, which does not affect actual air velocity, but
only measured air velocity. With our initial dataset of medians, we
used logarithmic regression to calculate a curve to match the data and
generated our initial velocity profiles at set distances. To validate
that the initial data was accurate, we ran 10 monte carlo folding
simulations at 25% of the data set and ensured that the generated
medians were within a reasonable value of each other.
Validation: Fans were run every three months and the monte carlo
folding simulations were observed. If the error rate was <5% from our
previous test, we validated the previous test.
There is no problem with the code itself, you found the best possible fit using a logarithmic curve. I double-checked using Mathematica, and I obtain the same results.
The problem seems to reside in your model. From the data you provided and the description of the origin of the data, the logarithmic function might not the best model for your measurements. The description indicates that the velocity must be a finite value at x=0, and slowly tends towards 0 while going to infinity. However, the negative logarithmic function will be infinite at x=0 and negative after a while.
I am not a physicist, but my intuition would tend towards using the inverse-square law or using the exponential function. I tested both, and the exponential function gives way better results:

Is it feasible to denoise time irrelevant sensor reading with Kalman Filter and how to code it?

After I did some research, I can understand how to implement it with time relevant functions. However, I'm not very sure about whether can I apply it to time irrelevant scenarios.
Giving that we have a simple function y=a*x^2, where both y and x are measured at a constant interval (say 1 min/sample) and a is a constant. However, both y and x measurements have white noise.
More specifically, x and y are two independently measured variables. For example, x is air flow rate in a duct and the y is the pressure drop across the duct. Because the air flow is varying due to the variation of the fan speed, the pressure drop across the duct is also varying. The relation between the pressure drop y and flow rate x is y=a*x^2, however both measurement embedded white noise. Is that possible to use Kalman Filter to estimate a more accurate y? Both x and y are recorded in a constant time interval.
Here are my questions:
Is it feasible to implement Kalman Filter for the y reading noise reduction? Or in another word, have a better estimation of y?
If this is feasible, how to code it in R or C?
P.S.
I tried to apply Kalman Filter to single variable and it works well. The result is as below. I'll have a try Ben's suggestion then and have a look whether can I make it works.
I think you can apply some Kalman Filter like ideas here.
Make your state a, with variance P_a. Your update is just F=[1], and your measurement is just H=[1] with observation y/x^2. In other words, you measure x and y and estimate a by solving for a in your original equation. Update your scalar KF as usual. Approximating R will be important. If x and y both have zero mean Gaussian noise, then y/x^2 certainly doesn't, but you can come up with an approximation.
Now that you have a running estimate of a (which is a random constant, so Q=0 ideally, but maybe Q=[tiny] to avoid numerical issues) you can use it to get a better y.
You have y_meas and y_est=a*x_meas^2. Combine those using your variances as (R_y * a * x^2 + (P_a + R_x2) * y_meas) / (R_y + P_a + R_x2). Over time as P_a goes to zero (you become certain of your estimate of a) you can see you end up combining information from your x and y measurements proportional to your trust in them individually. Early on, when P_a is high you are mostly trusting the direct measurement of y_meas because you don't know the relationship.

How can I get a plot in rpart to use observed values rather than weights

I've successfully completed an analysis in rpart, where I have 0-1 outcome data, where I have weighted the data to deal with the problem of a scarce response. When I plot the data using prp, I want the labels to have the true proportion, rather than the weighted proportion. Is this possible?
A sample data set below (note that I am working with many more factors than I'm using here!)
require(rpart)
require(rpart.plot)
set.seed(1001)
x<-rnorm(1000)
y<-rbinom(1000,size=1,prob=1/(1+exp(-x)))
z<-10+rnorm(1000)
weights<-ifelse(y==0,1,z)
rpartfun<-rpart(y~x,
weights=z,method="class",control=list(cp=0))
rparttrim<- prune(rpartfun,cp=rpartfun$cptable[which.min(rpartfun$cptable[,"xerror"]),"CP"])
prp(rparttrim,extra=104)
[I would produce the image I get from that here, but I don't have enough reputation]
Where I would like that first node (and indeed,all the nodes!) to, instead of having .28 to .72 (the weighted proportions), have 0.65 to 0.35 (the true proportion).

Convergence of R density() function to a delta function

I'm a bit puzzled by the behavior of the R density() function in an edge case...
Suppose I add more and more points with x=0 into a simulated data set. What I expect is that the density estimate will very quickly converge (I'm being deliberately vague about what that means...) to a delta function at x=0. In practice, the fit certainly gets narrower, but very slowly, as shown by this sequence of plots:
plot(density(c(0,0)), xlim=c(-2,2))
plot(density(c(0,0,0,0)), xlim=c(-2,2))
plot(density(c(rep(0,10000))), xlim=c(-2,2))
plot(density(c(rep(0,10000000))), xlim=c(-2,2))
But if you add a tiny bit of noise to the simulated data, the behavior is much better:
plot(density(0.0000001*rnorm(10000000) + c(rep(0,10000000))), xlim=c(-2,2))
Just let sleeping dogs lie? Or am I missing something about the usage of density()?
Per ?bw.nrd0, the default bandwidth selector for density:
bw.nrd0 implements a rule-of-thumb for choosing the bandwidth of a Gaussian kernel density estimator. It defaults to 0.9 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power (= Silverman's ‘rule of thumb’, Silverman (1986, page 48, eqn (3.31)) unless the quartiles coincide when a positive result will be guaranteed.
When your data is constant, then the quartiles coincide, so the last clause guaranteeing a positive result kicks in. This basically means that the bandwidth chosen is not a continuous function of the spread of the data, at zero.
To illustrate:
> bw.nrd0(rep(0, 1e6))
[1] 0.05678616
> bw.nrd0(rnorm(1e6, s=1e-6))
[1] 5.672872e-08
Actually (...tail between legs...) I now realize that my entire question was misguided. Being fairly new to R, I had instantly assumed that density() tries to fit Gaussians of different widths to the data points, optimizing both the number of Gaussians and their individual widths. But in fact it is doing something much simpler. It just smears out each data point, and adds up the smears to give a smoothed estimate of the data. density() is just a simple smoothing algorithm. So, yes indeed, RTFM :)

Resources