Detecting dips in a 2D plot - r

I need to automatically detect dips in a 2D plot, like the regions marked with red circles in the figure below. I'm only interested in the "main" dips, meaning the dips have to span a minimum length in the x axis. The number of dips is unknown, i.e., different plots will contain different numbers of dips. Any ideas?
Update:
As requested, here's the sample data, together with an attempt to smooth it using median filtering, as suggested by vines.
Looks like I need now a robust way to approximate the derivative at each point that would ignore the little blips that remain in the data. Is there any standard approach?
y <- c(0.9943,0.9917,0.9879,0.9831,0.9553,0.9316,0.9208,0.9119,0.8857,0.7951,0.7605,0.8074,0.7342,0.6374,0.6035,0.5331,0.4781,0.4825,0.4825,0.4879,0.5374,0.4600,0.3668,0.3456,0.4282,0.3578,0.3630,0.3399,0.3578,0.4116,0.3762,0.3668,0.4420,0.4749,0.4556,0.4458,0.5084,0.5043,0.5043,0.5331,0.4781,0.5623,0.6604,0.5900,0.5084,0.5802,0.5802,0.6174,0.6124,0.6374,0.6827,0.6906,0.7034,0.7418,0.7817,0.8311,0.8001,0.7912,0.7912,0.7540,0.7951,0.7817,0.7644,0.7912,0.8311,0.8311,0.7912,0.7688,0.7418,0.7232,0.7147,0.6906,0.6715,0.6681,0.6374,0.6516,0.6650,0.6604,0.6124,0.6334,0.6374,0.5514,0.5514,0.5412,0.5514,0.5374,0.5473,0.4825,0.5084,0.5126,0.5229,0.5126,0.5043,0.4379,0.4781,0.4600,0.4781,0.3806,0.4078,0.3096,0.3263,0.3399,0.3184,0.2820,0.2167,0.2122,0.2080,0.2558,0.2255,0.1921,0.1766,0.1732,0.1205,0.1732,0.0723,0.0701,0.0405,0.0643,0.0771,0.1018,0.0587,0.0884,0.0884,0.1240,0.1088,0.0554,0.0607,0.0441,0.0387,0.0490,0.0478,0.0231,0.0414,0.0297,0.0701,0.0502,0.0567,0.0405,0.0363,0.0464,0.0701,0.0832,0.0991,0.1322,0.1998,0.3146,0.3146,0.3184,0.3578,0.3311,0.3184,0.4203,0.3578,0.3578,0.3578,0.4282,0.5084,0.5802,0.5667,0.5473,0.5514,0.5331,0.4749,0.4037,0.4116,0.4203,0.3184,0.4037,0.4037,0.4282,0.4513,0.4749,0.4116,0.4825,0.4918,0.4879,0.4918,0.4825,0.4245,0.4333,0.4651,0.4879,0.5412,0.5802,0.5126,0.4458,0.5374,0.4600,0.4600,0.4600,0.4600,0.3992,0.4879,0.4282,0.4333,0.3668,0.3005,0.3096,0.3847,0.3939,0.3630,0.3359,0.2292,0.2292,0.2748,0.3399,0.2963,0.2963,0.2385,0.2531,0.1805,0.2531,0.2786,0.3456,0.3399,0.3491,0.4037,0.3885,0.3806,0.2748,0.2700,0.2657,0.2963,0.2865,0.2167,0.2080,0.1844,0.2041,0.1602,0.1416,0.2041,0.1958,0.1018,0.0744,0.0677,0.0909,0.0789,0.0723,0.0660,0.1322,0.1532,0.1060,0.1018,0.1060,0.1150,0.0789,0.1266,0.0965,0.1732,0.1766,0.1766,0.1805,0.2820,0.3096,0.2602,0.2080,0.2333,0.2385,0.2385,0.2432,0.1602,0.2122,0.2385,0.2333,0.2558,0.2432,0.2292,0.2209,0.2483,0.2531,0.2432,0.2432,0.2432,0.2432,0.3053,0.3630,0.3578,0.3630,0.3668,0.3263,0.3992,0.4037,0.4556,0.4703,0.5173,0.6219,0.6412,0.7275,0.6984,0.6756,0.7079,0.7192,0.7342,0.7458,0.7501,0.7540,0.7605,0.7605,0.7342,0.7912,0.7951,0.8036,0.8074,0.8074,0.8118,0.7951,0.8118,0.8242,0.8488,0.8650,0.8488,0.8311,0.8424,0.7912,0.7951,0.8001,0.8001,0.7458,0.7192,0.6984,0.6412,0.6516,0.5900,0.5802,0.5802,0.5762,0.5623,0.5374,0.4556,0.4556,0.4333,0.3762,0.3456,0.4037,0.3311,0.3263,0.3311,0.3717,0.3762,0.3717,0.3668,0.3491,0.4203,0.4037,0.4149,0.4037,0.3992,0.4078,0.4651,0.4967,0.5229,0.5802,0.5802,0.5846,0.6293,0.6412,0.6374,0.6604,0.7317,0.7034,0.7573,0.7573,0.7573,0.7772,0.7605,0.8036,0.7951,0.7817,0.7869,0.7724,0.7869,0.7869,0.7951,0.7644,0.7912,0.7275,0.7342,0.7275,0.6984,0.7342,0.7605,0.7418,0.7418,0.7275,0.7573,0.7724,0.8118,0.8521,0.8823,0.8984,0.9119,0.9316,0.9512)
yy <- runmed(y, 41)
plot(y, type="l", ylim=c(0,1), ylab="", xlab="", lwd=0.5)
points(yy, col="blue", type="l", lwd=2)

EDITED : function strips the regions to contain nothing but the lowest part, if wanted.
Actually, Using the mean is easier than using the median. This allows you to find regions where the real values are continuously below the mean. The median is not smooth enough for an easy application.
One example function to do this would be :
FindLowRegion <- function(x,n=length(x)/4,tol=length(x)/20,p=0.5){
nx <- length(x)
n <- 2*(n %/% 2) + 1
# smooth out based on means
sx <- rowMeans(embed(c(rep(NA,n/2),x,rep(NA,n/2)),n),na.rm=T)
# find which series are far from the mean
rlesx <- rle((sx-x)>0)
# construct start and end of regions
int <- embed(cumsum(c(1,rlesx$lengths)),2)
# which regions fulfill requirements
id <- rlesx$value & rlesx$length > tol
# Cut regions to be in general smaller than median
regions <-
apply(int[id,],1,function(i){
i <- min(i):max(i)
tmp <- x[i]
id <- which(tmp < quantile(tmp,p))
id <- min(id):max(id)
i[id]
})
# return
unlist(regions)
}
where
n determines how much values are used to calculate the running mean,
tol determines how many consecutive values should be lower than the running mean to talk about a low region, and
p determines the cutoff used (as a quantile) for stripping the regions to their lowest part. When p=1, the complete lower region is shown.
Function is tweaked to work on data as you presented, but the numbers might need to be adjusted a bit to work with other data.
This function returns a set of indices, which allows you to find the low regions. Illustrated with your y vector :
Lows <- FindLowRegion(y)
newx <- seq_along(y)
newy <- ifelse(newx %in% Lows,y,NA)
plot(y, col="blue", type="l", lwd=2)
lines(newx,newy,col="red",lwd="3")
Gives :

You have to smooth the graph in some way. Median filtration is quite useful for that purpose (see http://en.wikipedia.org/wiki/Median_filter). After smoothing, you will simply have to search for the minima, just as usual (i.e. search for the points where the 1st derivative switches from negative to positive).

A simpler answer (which also does not require smoothing) could be provided by adapting the maxdrawdown() function from the tseries. A drawdown is commonly defined as the retreat from the most-recent maximum; here we want the opposite. Such a function could then be used in a sliding window over the data, or over segmented data.
maxdrawdown <- function(x) {
if(NCOL(x) > 1)
stop("x is not a vector or univariate time series")
if(any(is.na(x)))
stop("NAs in x")
cmaxx <- cummax(x)-x
mdd <- max(cmaxx)
to <- which(mdd == cmaxx)
from <- double(NROW(to))
for (i in 1:NROW(to))
from[i] <- max(which(cmaxx[1:to[i]] == 0))
return(list(maxdrawdown = mdd, from = from, to = to))
}
So instead of using cummax(), one would have to switch to cummin() etc.

My first thought was something much cruder than filtering. Why not look for the big drops followed by long enough stable periods?
span.b <- 20
threshold.b <- 0.2
dy.b <- c(rep(NA, span.b), diff(y, lag = span.b))
span.f <- 10
threshold.f <- 0.05
dy.f <- c(diff(y, lag = span.f), rep(NA, span.f))
down <- which(dy.b < -1 * threshold.b & abs(dy.f) < threshold.f)
abline(v = down)
The plot shows that it's not perfect, but it doesn't discard the outliers (I guess it depends on your take on the data).

Related

Plotting an 'n' sized vector between a given function with given interval in R

Let me make my question clear because I don't know how to ask it properly (therefore I don't know if it was answered already or not), I will go through my whole problem:
There is a given function (which is the right side of an explicit first order differential equation if it matters):
f = function(t,y){
-2*y+3*t
}
Then there's a given interval from 'a' to 'b', this is the range the function is calculated in with 'n' steps, so the step size in the interval (dt) is:
dt=abs(a-b)/n
In this case 'a' is always 0 and 'b' is always positive, so 'b' is always greater than 'a' but I tried to be generic.
The initial condition:
yt0=y0
The calculation that determines the vector:
yt=vector("numeric",n)
for (i in 1:(n-1))
{
yt[1]=f(0,yt0)*dt+yt0
yt[i+1]=(f(dt*i,yt[i]))*dt+yt[i]
}
The created vector is 'n' long, but this is an approximate solution to the differential equation between the interval ranging from 'a' to 'b'. And here comes my problem:
When I try plotting it alongside the exact solution (using deSolve), it is not accurate. The values of the vector are accurate, but it does not know that these values belong to an approximate function that's between the interval range 'a' to 'b' .
That's why the graphs of the exact and approximate solution are not matching at all. I feel pretty burnt out, so I might not describe my issue properly, but is there a solution to this? To make it realise that its values are between 'a' and 'b' on the 'x' axis and not between '1' and 'n'?
I thank you all for the answers in advance!
The deSolve lines I used (regarding 'b' is greater than 'a'):
df = function(t, y, params) list(-2*y+3*t)
t = seq(a, b, length.out = n)
ddf = as.data.frame(ode(yt0, t, df, parms=NULL))
I tried to reconstruct the comparison between an "approximate" solution using a loop (that is in fact the Euler method), and a solution with package deSolve. It uses the lsoda solver by default that is more precise than Euler'S method, but it is of course also an approximation (default relative and absolute tolerance set to 1e-6).
As the question missed some concrete values and the plot functions, it was not clear where the original problem was, but the following example may help to re-formulate the question. I assume that the problem may be confusion between t (absolute time) and dt between the two approaches. Compare the lines marked as "original code" with the "suggestion":
library(deSolve)
f = function(t, y){
-2 * y + 3 * t
}
## some values
y0 <- 0.1
a <- 3
b <- 5
n <- 100
## Euler method using a loop
dt <- abs(a-b)/n
yt <- vector("numeric", n)
yt[1] <- f(0, y0) * dt + y0 # written before the loop
for (i in 1:(n-1)) {
#yt[i+1] = (f( dt * i, yt[i])) * dt + yt[i] # original code
yt[i+1] <- (f(a + dt * i, yt[i])) * dt + yt[i] # suggestion
}
## Lsoda integration wit package deSolve
df <- function(t, y, params) list(-2*y + 3*t)
t <- seq(a, b, length.out = n)
ddf = as.data.frame(ode(y0, t, df, parms=NULL))
## Plot of both solutions
plot(ddf, type="l", lwd=5, col="orange", ylab="y", las=1)
lines(t, yt, lwd=2, lty="dashed", col="blue")
legend("topleft", c("deSolve", "for loop"),
lty=c("solid", "dashed"), lwd=c(5, 2), col=c("orange", "blue"))

Find local minimum in a vector with r

Taking the ideas from the following links:
the local minimum between the two peaks
How to explain ...
I look for the local minimum or minimums, avoiding the use of functions already created for this purpose [max / min locale or global].
Our progress:
#DATA
simulate <- function(lambda=0.3, mu=c(0, 4), sd=c(1, 1), n.obs=10^5) {
x1 <- rnorm(n.obs, mu[1], sd[1])
x2 <- rnorm(n.obs, mu[2], sd[2])
return(ifelse(runif(n.obs) < lambda, x1, x2))
}
data <- simulate()
hist(data)
d <- density(data)
#
#https://stackoverflow.com/a/25276661/8409550
##Since the x-values are equally spaced, we can estimate dy using diff(d$y)
d$x[which.min(abs(diff(d$y)))]
#With our data we did not obtain the expected value
#
d$x[which(diff(sign(diff(d$y)))>0)+1]#pit
d$x[which(diff(sign(diff(d$y)))<0)+1]#peak
#we check
#1
optimize(approxfun(d$x,d$y),interval=c(0,4))$minimum
optimize(approxfun(d$x,d$y),interval=c(0,4),maximum = TRUE)$maximum
#2
tp <- pastecs::turnpoints(d$y)
summary(tp)
ind <- (1:length(d$y))[extract(tp, no.tp = FALSE, peak = TRUE, pit = TRUE)]
d$x[ind[2]]
d$x[ind[1]]
d$x[ind[3]]
My questions and request for help:
Why did the command lines fail:
d$x[which.min(abs(diff(d$y)))]
It is possible to eliminate the need to add one to the index in the command lines:
d$x[which(diff(sign(diff(d$y)))>0)+1]#pit
d$x[which(diff(sign(diff(d$y)))<0)+1]#peak
How to get the optimize function to return the two expected maximum values?
Question 1
The answer to the first question is straighforward. The line d$x[which.min(abs(diff(d$y)))] asks for the x value at which there was the smallest change in y between two consecutive points. The answer is that this happened at the extreme right of the plot where the density curve is essentially flat:
which.min(abs(diff(d$y)))
#> [1] 511
length(abs(diff(d$y)))
#> [1] 511
This is not only smaller than the difference at your local maxima /minima points; it is orders of magnitude smaller. Let's zoom in to the peak value of d$y, including only the peak and the point on each side:
which.max(d$y)
#> [1] 324
plot(d$x[323:325], d$y[323:325])
We can see that the smallest difference is around 0.00005, or 5^-5, between two consecutive points. Now look at the end of the plot where it is flattest:
plot(d$x[510:512], d$y[510:512])
The difference is about 1^-7, which is why this is the flattest point.
Question 2
The answer to your second question is "no, not really". You are taking a double diff, which is two elements shorter than x, and if x is n elements long, a double diff will correspond to elements 2 to (n - 1) in x. You can remove the +1 from the index, but you will have an off-by-one error if you do that. If you really wanted to, you could concatenate dummy zeros at each stage of the diff, like this:
d$x[which(c(0, diff(sign(diff(c(d$y, 0))))) > 0)]
which gives the same result, but this is longer, harder to read and harder to justify, so why would you?
Question 3
The answer to the third question is that you could use the "pit" as the dividing point between the minimum and maximum value of d$x to find the two "peaks". If you really want a single call to get both at once, you could do it inside an sapply:
pit <- optimize(approxfun(d$x,d$y),interval=c(0,4))$minimum
peaks <- sapply(1:2, function(i) {
optimize(approxfun(d$x, d$y),
interval = c(min(d$x), pit, max(d$x))[i:(i + 1)],
maximum = TRUE)$maximum
})
pit
#> [1] 1.691798
peaks
#> [1] -0.02249845 3.99552521

Identify all local extrema of a fitted smoothing spline via R function 'smooth.spline'

I have a 2-dimensional data set.
I use the R's smooth.spline function to smooth my points graph following an example in this article:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.smooth.spline.html
So that I get the spline graph similar to the green line on this picture
I'd like to know the X values, where the first derivative of the smoothing spline equals zero (to determine exact minimum or maximum).
My problem is that my initial dataset (or a dataset that I could auto-generate) to feed into the predict() function does not contain such exact X values that correspond to the smoothing spline extrema.
How can I find such X values?
Here is the picture of the first derivative of the green spline line above
But exact X coordinate of extremums are still not exact.
My approximate R script to generate the pictures looks like the following
sp1 <- smooth.spline(df)
pred.prime <- predict(sp1, deriv=1)
pred.second <- predict(sp1, deriv=2)
d1 <- data.frame(pred.prime)
d2 <- data.frame(pred.second)
dfMinimums <- d1[abs(d1$y) < 1e-4, c('x','y')]
I think that there are two problems here.
You are using the original x-values and they are spaced too far apart AND
Because of the wide spacing of the x's, your threshold for where you consider the derivative "close enough" to zero is too high.
Here is basically your code but with many more x values and requiring smaller derivatives. Since you do not provide any data, I made a coarse approximation to it that should suffice for illustration.
## Coarse approximation of your data
x = runif(300, 0,45000)
y = sin(x/5000) + sin(x/950)/4 + rnorm(300, 0,0.05)
df = data.frame(x,y)
sp1 <- smooth.spline(df)
Spline code
Sx = seq(0,45000,10)
pred.spline <- predict(sp1, Sx)
d0 <- data.frame(pred.spline)
pred.prime <- predict(sp1, Sx, deriv=1)
d1 <- data.frame(pred.prime)
Mins = which(abs(d1$y) < mean(abs(d1$y))/150)
plot(df, pch=20, col="navy")
lines(sp1, col="darkgreen")
points(d0[Mins,], pch=20, col="red")
The extrema look pretty good.
plot(d1, type="l")
points(d1[Mins,], pch=20, col="red")
The points identified look like zeros of the derivative.
You can use my R package SplinesUtils: https://github.com/ZheyuanLi/SplinesUtils, which can be installed by
devtools::install_github("ZheyuanLi/SplinesUtils")
The function to be used are SmoothSplinesAsPiecePoly and solve. I will just use the example under the documentation.
library(SplinesUtils)
## a toy dataset
set.seed(0)
x <- 1:100 + runif(100, -0.1, 0.1)
y <- poly(x, 9) %*% rnorm(9)
y <- y + rnorm(length(y), 0, 0.2 * sd(y))
## fit a smoothing spline
sm <- smooth.spline(x, y)
## coerce "smooth.spline" object to "PiecePoly" object
oo <- SmoothSplineAsPiecePoly(sm)
## plot the spline
plot(oo)
## find all stationary / saddle points
xs <- solve(oo, deriv = 1)
#[1] 3.791103 15.957159 21.918534 23.034192 25.958486 39.799999 58.627431
#[8] 74.583000 87.049227 96.544430
## predict the "PiecePoly" at stationary / saddle points
ys <- predict(oo, xs)
#[1] -0.92224176 0.38751847 0.09951236 0.10764884 0.05960727 0.52068566
#[7] -0.51029209 0.15989592 -0.36464409 0.63471723
points(xs, ys, pch = 19)
One caveat in the #G5W implementation that I found is that it sometimes returns multiple records close around extrema instead of a single one. On the diagram they cannot be seen, since they all fall into one point effectively.
The following snippet from here filters out single extrema points with the minimum value of the first derivative:
library(tidyverse)
df2 <- df %>%
group_by(round(y, 4)) %>%
filter(abs(d1) == min(abs(d1))) %>%
ungroup() %>%
select(-5)

Error in rollapply: subscript out of bounds

I'd first like to describe my problem:
What i want to do is to calculate the number of spikes on prices in a 24 hour window, while I possess half hourly data.
I have seen all Stackoverflow posts like e.g. this one:
Rollapply for time series
(If there are more relevant ones, please let me know ;) )
As I cannot and probably also should not upload my data, here's a minimal example:
I simulate a random variable, convert it to an xts object, and use a user defined function to detect "spikes" (of course pretty ridiculous in this case, but illustrates the error).
library(xts)
##########Simulate y as a random variable
y <- rnorm(n=100)
##########Add a date variable so i can convert it to a xts object later on
yDate <- as.Date(1:100)
##########bind both variables together and convert to a xts object
z <- cbind(yDate,y)
z <- xts(x=z, order.by=yDate)
##########use the rollapply function on the xts object:
x <- rollapply(z, width=10, FUN=mean)
The function works as it is supposed to: it takes the 10 preceding values and calculates the mean.
Then, I defined an own function to find peaks: A peak is a local maximum (higher than m points around it) AND is at least as big as the mean of the timeseries+h.
This leads to:
find_peaks <- function (x, m,h){
shape <- diff(sign(diff(x, na.pad = FALSE)))
pks <- sapply(which(shape < 0), FUN = function(i){
z <- i - m + 1
z <- ifelse(z > 0, z, 1)
w <- i + m + 1
w <- ifelse(w < length(x), w, length(x))
if(all(x[c(z : i, (i + 2) : w)] <= x[i + 1])&x[i+1]>mean(x)+h) return(i + 1) else return(numeric(0))
})
pks <- unlist(pks)
pks
}
And works fine: Back to the example:
plot(yDate,y)
#Is supposed to find the points which are higher than 3 points around them
#and higher than the average:
#Does so, so works.
points(yDate[find_peaks(y,3,0)],y[find_peaks(y,3,0)],col="red")
However, using the rollapply() function leads to:
x <- rollapply(z,width = 10,FUN=function(x) find_peaks(x,3,0))
#Error in `[.xts`(x, c(z:i, (i + 2):w)) : subscript out of bounds
I first thought, well, maybe the error occurs because for it might run int a negative index for the first points, because of the m parameter. Sadly, setting m to zero does not change the error.
I have tried to trace this error too, but do not find the source.
Can anyone help me out here?
Edit: A picture of spikes:Spikes on the australian Electricity Market. find_peaks(20,50) determines the red points to be spikes, find_peaks(0,50) additionally finds the blue ones to be spikes (therefore, the second parameter h is important, because the blue points are clearly not what we want to analyse when we talk about spikes).
I'm still not entirely sure what it is that you are after. On the assumption that given a window of data you want to identify whether its center is greater than the rest of the window at the same time as being greater than the mean of the window + h then you could do the following:
peakfinder = function(x,h = 0){
xdat = as.numeric(x)
meandat = mean(xdat)
center = xdat[ceiling(length(xdat)/2)]
ifelse(all(center >= xdat) & center >= (meandat + h),center,NA)
}
y <- rnorm(n=100)
z = xts(y, order.by = as.Date(1:100))
plot(z)
points(rollapply(z,width = 7, FUN = peakfinder, align = "center"), col = "red", pch = 19)
Although it would appear to me that if the center point is greater than it's neighbours it is necessarily greater than the local mean too so this part of the function would not be necessary if h >= 0. If you want to use the global mean of the time series, just substitute the calculation of meandat with the pre-calculated global mean passed as an argument to peakfinder.

R : how to use variables for vector indices?

I'm new user of R, and trying to generate a k-moving average graph with sine function which involves random number(in range [-0.5,+0.5]) noise.
So what I have to do is calculate a mean of consecutive (2*k+1) elements in noised-sine vector but however, the code with "HELP" below, it's not working as I expected... :(
The code seems to calculate the mean of 1 through (i-k)th element.
What's wrong with it? Help please!
set.seed(1)
x = seq(0,2*pi,pi/50)
sin_graph <- sin(x)
noise <- runif(101, -0.5, 0.5)
sin_noise <- sin_graph + noise
plot(x,sin_noise, ylim=c(-2,2))
lines(x,sin_graph, col="red")
k<-1
MA<-0
while (k<=1){
i <- k+1
MA_vector <- rep(NA, times=101)
while (i<=101-k){
MA_vector[i] <- mean(sin_noise[i-k:i+k]) #HELP!
i <- i+1
}
print(MA_vector)
plot(x, MA_vector, ylim=c(-2,2))
lines(x,sin_graph, col="red")
k<-k+1
}
As it stands, it's substracting a vector of k:i from i and then adding k. : takes precedent over mathematical operators. By using brackets (see code below), it evaluates i-k and i+k and creates a vector with min and max as results of the evaluations. I get another smooth function.
MA_vector[i] <- mean(sin_noise[(i-k):(i+k)])

Resources