Curve smoothing preserving peaks and valleys - r

I want to generate a smoothed version of a set of datapoints, but preserve any local peaks or valleys (i.e. where the previous and next value are either both greater than or both less than the current value). The standard smooth function in R does a running median, which works great for intermediate values but chops off the peaks and valleys of the data. Here's an example which hopefully helps to explain the issue:
peakValleys <- runif(20);
landform <- approx(x=seq_along(peakValleys), y=peakValleys, n=200)$y;
landform <- landform + runif(200, max=0.1);
plot(landform);
points(smooth(landform), type="l", col="red");
I'd like the red line to include any local peaks/valleys, but still look smooth (i.e. not just jumping at the peak/valley point). This is most obvious around 28, where a valley datapoint has been completely excluded.
The problem is worse if I have a larger smoothing window with the running median. Ideally, this "smoothing with peak inclusion" should work for any window size:
plot(landform.jitter);
points(runmed(landform.jitter, 7), type="l", col="red");
I care the most about the peaks/valleys, and the least about the points in between. One application of this is in finding and quantifying peaks in flow cytometry, where a small change in a peak height can mean a big difference in the estimated cell fraction. Another application is in nanopore sequencing of DNA, where peaks can be very brief, but are important to know about

Related

draw sines at higher freq

I'm trying to plot some sine waves (code example in plain js here).
When freq is "low" (freq = 10 hz in that case), the plot is quite nice:
The problem is when I increase the freq (try to set var freq = 50 for example):
lots of ripples, it becomes distorted and not so good as plot. If I increment it more, even worse (var freq = 8030 for example is terrible).
When I see those kind of graph on pro systems, they are displayed just fine.
How would you improve it? FFT, splines, whatever? Which is the right approch?
I don't really need accurancy (i.e. for waveform analysis or whatever), just plot it nicely (as in Desmos https://www.desmos.com/calculator/eodkjlywjh, for example).
Like Matt said, the problem here is the same as in plotting a discrete-time signal shows amplitude modulation: "At higher frequencies, when the peak falls between two samples, the sampled points can be a lot lower than the peak." This causes the peaks of the waveform to vary, making a kind of ripple visual effect at the top and bottom of the plot.
Increasing the sampling resolution helps. If you can't increase sampling resolution by much, adjust the sampling resolution to the closest integer multiple of the sine's frequency and set the sampling phase so that it samples the waveform peaks. For instance for a 100 Hz wave cos(2 π (100*x + 0.3)), you could sample at 400 Hz at x[n] = n/400 - 0.003. That way there are four samples per period, with x[0], x[4], x[8], ... sampling exactly on the peaks.
Another thought: plotting with a thicker line can help to smooth over some defects. It looks like the desmos example that you linked is using something like a 3-pixel line width.

Algorithmically detecting jumps in a time-series

I have about 50 datasets that include all trades within a timeframe of 30 days for about 10 pairs on 5 exchanges. All pairs are of the same asset class, meaning they are strongly correlated and expect to have similar properties, but are on different scales. An example of this data would be
set.seed(1)
n <- 1000
dates <- seq(as.POSIXct("2019-08-05 00:00:00", tz="UTC"), as.POSIXct("2019-08-05 23:59:00", tz="UTC"), by="1 min")
x <- data.frame("t" = sort(sample(dates, 1000)),"p" = cumsum(sample(c(-1, 1), n, TRUE)))
Roughly, I need to identify the relevant local minima and maxima, which happen daily. The yellow marks are my points of interest. Unlike this example, there is usually only one such point per day and I consider each day separately. However, it is hard to filter out noise from my actual points of interest.
My actual goal is to find the exact point, at which the pair started to make a jump and the exact point, at which the jump is over. This needs to be as accurate as possible, as I want to observe which asset moved first and which asset followed at which point in time (as said, they are highly correlated).
Between two extreme values, I want to minimize the distance and maximize the relative/absolute change, as my points of interest are usually close to each other and their difference is quite large.
I already looked at other questions like
Finding local maxima and minima and Algorithm to locate local maxima and also this algorithm that has the same goal. However, my dataset is extremely noisy. I already reduced the dataset to 5-minute intervals, however, this has led to omitting the relevant points in the functions to identify local minima & maxima. Therefore, this was a not good solution given my goal.
How can I achieve my goal with a quite accurate algorithm? Manually skimming through all the time-series is not an option, since this would require me to evaluate 50 * 30 time-series manually, which is too time-consuming. I'm really puzzled and trying to find a suitable solution for a week.
If more code snippets are demanded, I'm happy to share, however they didn't give me meaningful results, which would be opposed to the idea of providing a minimum working example, therefore I decided to leave them out for now.
EDIT:
First off, I updated the plot and added timestamps to the dataset to give you an idea (the actual resolution). Ideally, the algorithm would detect both jumps on the left. The inner two dots because they're closer together and jump without interception, and the outer dots because they're more extreme in values. In fact, this maybe answers the question whether the algorithm is allowed to look into the future. Yes, if there's another local extrema in the range of, say, 30 observations (or 30 minutes), then ignore the intermediate local extrema.
In my data, jumps have been from 2% - ~ 15%, such that a jump needs to be at least 2% to be considered. And only if a threshold of 15 (this might be adaptable) consecutive steps in the same direction before / after the peaks and valleys is reached.
A very naive approach was to subset the data around the global minimum and maximum of a day. In most cases, this has denoised data and worked as an indicator. However, this is not robust when the global extrema are not in the range of the jump.
Hope this clarifies why this isn't a statistical question (there are some tests to determine whether a jump has happened, but not for jump arrival time afaik).
In case anyone wants a real example:
this is a corresponding graph, this is the raw data of the relevant period and this is the reduced dataset.
Perhaps as a starting point, look at function streaks
in package PMwR (which I maintain). A streak is
defined as a move of a specified size that is
uninterrupted by a countermove of the same size. The
function works with returns, not differences, so I add
100 to your data.
For instance:
set.seed(1)
n <- 1000
x <- 100 + cumsum(sample(c(-1, 1), n, TRUE))
plot(x, type = "l")
s <- streaks(x, up = 0.12, down = -0.12)
abline(v = s[, 1])
abline(v = s[, 2])
The vertical lines show the starts and ends of streaks.
Perhaps you can then filter the identified streaks by required criteria such as length. Or
you may play around with different thresholds for up
and down moves (though this is not really recommended
in the current implementation, but perhaps the results
are good enough). For instance, up streaks might look as follows. A green vertical shows the start of a streak; a red line shows its end.
plot(x, type = "l")
s <- streaks(x, up = 0.12, down = -0.05)
s <- s[!is.na(s$state) & s$state == "up", ]
abline(v = s[, 1], col = "green")
abline(v = s[, 2], col = "red")

Set symmetric specific cuts for bimodal data

I have a binomial assymetric distribution which I would like to cut at both ends. The specific part of it is that I would like to calculate symmetric boundaries at the appropriate side of each 'bell'. The figure shows an extreme case of separation between bells for simplicity.
In this case the red cuts were selected by eye and the 1550 blue lines used at each side represent an arbitrary value that could potentially be passed through a function for the trim. My goal would be subset everything between blue lines.
hist(p3_cut$x,50)
abline(v=c(6200,7600),col='red')
abline(v=c(6200-1500,7600+1500),col='blue')
My guess is that the problem here is basically find the 'edges' of each curve. I cannot use half distance between means, I need something that recognizes frequency change from 0 (or very low value) to something relatively high.
A somewhat general answer. Depending on the problem you might need to adjust the binwidth in the density function:
# get density of x and normalize so max is one
dens <- density(x,adjust=0.1)
dens$y <- dens$y / max(dens$y)
# keep all x where density is higher than some fraction of max (here 1%)
min_frac <- 0.01
x_keep <- dens$x[dens$y > 0.01]
# find position of gap in x, and get x just before and after gap
gap_pos <- which.max(diff(x_keep))
left_cut <- x_keep[gap_pos]
right_cut <- x_keep[gap_pos + 1]
Using this code and changing the adjust parameter in the density function I was able to calculate almost perfect cuts at least for this case. I am positive that this approach is flexible enough for most situations that are similar to this one. I show the results for the cuts proposed.

Calculating the volume under a surface

I have created a 3D plot (a surface) using wireframe function. I wonder if there is any functions by which I can calculate the volume under the surface in a 3D plot?
Here is a sample of my data plus the wrieframe syntax I used to create my 3D (surface) plot:
x1<-c(13,27,41,55,69,83,97,111,125,139)
x2<-c(27,55,83,111,139,166,194,222,250,278)
x3<-c(41,83,125,166,208,250,292,333,375,417)
x4<-c(55,111,166,222,278,333,389,445,500,556)
x5<-c(69,139,208,278,347,417,487,556,626,695)
x6<-c(83,166,250,333,417,500,584,667,751,834)
x7<-c(97,194,292,389,487,584,681,779,876,974)
x8<-c(111,222,333,445,556,667,779,890,1001,1113)
x9<-c(125,250,375,500,626,751,876,1001,1127,1252)
x10<-c(139,278,417,556,695,834,974,1113,1252,1391)
df<-data.frame(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10)
df.matrix<-as.matrix(df)
wireframe(df.matrix,
aspect = c(61/87, 0.4),scales=list(arrows=FALSE,cex=.5,tick.number="10",z=list(arrows=T)),ylim=c(1:10),xlab=expression(phi1),ylab="Percentile",zlab=" Loss",main="Random Classifier",
light.source = c(10,10,10),drape=T,col.regions = rainbow(100, s = 1, v = 1, start = 0, end = max(1,100 - 1)/100, alpha = 1),screen=list(z=-60,x=-60))
Note: my real data is a 100X100 matrix
Thanks
The data you are feeding to wireframe is a grid of values. Hence one estimate of the volume of whatever underlying surface this is approximating is the sum of the grid values multiplied by the grid cell areas. This is just like adding up the heights of histogram bars to get the number of values in your histogram.
The problem I see with you doing this on your data is that the cell areas are going to be in odd units - percentiles on one axis, phi on the other has unknown units, so your volume is going to have units of loss times units of percentile times units of phi.
This isn't a problem if you want to compare volumes of similar things on exactly the same grid, but if you have surfaces on different grids (different values of phi, or different percentiles) then you need to be careful.
Now, noting that wireframe doesn't draw like a 3d histogram would (looking like square tower blocks) this gives us another way to estimate the volume. Your 10x10 matrix is plotted as 9x9 squares. Divide each of those squares into triangles and then compute the volume of the 192 right truncated triangular prisms (I think this is what they are - they are equilateral triangular prisms with a right angle and one sloping end). The formula for that should be out there somewhere. Probably base area times height to the centroid of the triangle or something.
I thought maybe this would be in the raster package, but it isn't. There's code for computing the surface area but not the volume! I'm sure the raster maintainer would be happy to have some code for this!
If the points are arbitrary (ie, don't follow smooth function), it seems like you're looking for the volume of the convex hull (minimum surface) surrounding these points. One package to help you calculate this is alphashape3d.
You'll need a 3-column matrix of the coordinates to form the right type of object to make the calculation but it seems rather straight-forward.

What method does outline=FALSE use to determine outliers? [duplicate]

This question already has answers here:
In ggplot2, what do the end of the boxplot lines represent?
(4 answers)
Closed 10 years ago.
In R, I have used the outline=FALSE parameter to exclude outliers when plotting a box and whisker for a particular set. It's worked spectacularly, but leaves me wondering how exactly it determines which elements are outliers.
boxplot(x, horizontal = TRUE, axes = FALSE, outline = FALSE)
An "outlier" in the terminology of box-and-whisker plots is any point in the data set that falls farther than a specified distance from the median, typically approximately 2.5 times the difference between the median and the 0.25 (lower) or 0.75 (upper) quantile. To get there, see ?boxplot.stats: first, look at the definition of out in the output
out: the values of any data points which lie beyond the extremes of the whiskers (if(do.out)).
These are the "outliers".
Second, look at the definition of the whiskers, which are based on the coef parameter, which is 1.5 by default:
the whiskers extend to the most extreme data point which is no more than coef times the length of the box away from the box.
Finally, look at the definition of the "hinges", which are the ends of the box:
The two ‘hinges’ are versions of the first and third quartile, i.e., close to quantile(x, c(1,3)/4).
Put these together, and you get outliers defined (approximately) as points that are farther from the median than 2.5 times the distance between the median and the relevant quartile. The reasons for these somewhat convoluted definitions are (I think) partly historical and partly the desire to have the components of the plots reflect actual values that are present in the data (rather than, say, the halfway point between two data points) as much as possible. (You would probably need to go back to the original literature referenced in the help page for the full justifications and explanations.)
The thing to be careful about is that points defined as "outliers" by this algorithm are not necessarily outliers in the usual statistical sense (e.g. points that are surprisingly extreme based on a particular statistical model of the data). In particular, if you have a big data set you will necessarily see lots of "outliers" (one indication that you might want to switch to a more data-hungry graphical summary such as a violin plot or beanplot).
For boxplot, outliers are the points that are above or below the "whiskers". These one, by default, extend to the data points that are no more than the interquartile range times the range argument from the box. By default range value is 1.5, but you can change it and so you can also change the outliers list.
You can also see that with the boxplot.stats function, which performs the computation used by the plot.
For example, if you have the following vector :
v <- c(runif(10), -0.5, -1)
boxplot(v)
By default, only the -1 value is considered as an outlier. You can see it with boxplot.stats :
boxplot.stats(v)$out
[1] -1
But if you change the range argument (or the coef one for boxplot.stats), then -1 is no more considered as an outlier :
boxplot(v, range=2)
boxplot.stats(v, coef=2)$out
numeric(0)
This is admittedly not immediately evident from boxplot(). Look at the range parameter:
this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
So the value of range is used, together with the interquartile range and the box (given by the quartiles), to determine where the whiskers end. And everything outside the whiskers is an outlier.
I'll be the first to agree that this definition is unintuitive. Sadly enough, it is established by now.

Resources