Error in rollapply: subscript out of bounds - r
I'd first like to describe my problem:
What i want to do is to calculate the number of spikes on prices in a 24 hour window, while I possess half hourly data.
I have seen all Stackoverflow posts like e.g. this one:
Rollapply for time series
(If there are more relevant ones, please let me know ;) )
As I cannot and probably also should not upload my data, here's a minimal example:
I simulate a random variable, convert it to an xts object, and use a user defined function to detect "spikes" (of course pretty ridiculous in this case, but illustrates the error).
library(xts)
##########Simulate y as a random variable
y <- rnorm(n=100)
##########Add a date variable so i can convert it to a xts object later on
yDate <- as.Date(1:100)
##########bind both variables together and convert to a xts object
z <- cbind(yDate,y)
z <- xts(x=z, order.by=yDate)
##########use the rollapply function on the xts object:
x <- rollapply(z, width=10, FUN=mean)
The function works as it is supposed to: it takes the 10 preceding values and calculates the mean.
Then, I defined an own function to find peaks: A peak is a local maximum (higher than m points around it) AND is at least as big as the mean of the timeseries+h.
This leads to:
find_peaks <- function (x, m,h){
shape <- diff(sign(diff(x, na.pad = FALSE)))
pks <- sapply(which(shape < 0), FUN = function(i){
z <- i - m + 1
z <- ifelse(z > 0, z, 1)
w <- i + m + 1
w <- ifelse(w < length(x), w, length(x))
if(all(x[c(z : i, (i + 2) : w)] <= x[i + 1])&x[i+1]>mean(x)+h) return(i + 1) else return(numeric(0))
})
pks <- unlist(pks)
pks
}
And works fine: Back to the example:
plot(yDate,y)
#Is supposed to find the points which are higher than 3 points around them
#and higher than the average:
#Does so, so works.
points(yDate[find_peaks(y,3,0)],y[find_peaks(y,3,0)],col="red")
However, using the rollapply() function leads to:
x <- rollapply(z,width = 10,FUN=function(x) find_peaks(x,3,0))
#Error in `[.xts`(x, c(z:i, (i + 2):w)) : subscript out of bounds
I first thought, well, maybe the error occurs because for it might run int a negative index for the first points, because of the m parameter. Sadly, setting m to zero does not change the error.
I have tried to trace this error too, but do not find the source.
Can anyone help me out here?
Edit: A picture of spikes:Spikes on the australian Electricity Market. find_peaks(20,50) determines the red points to be spikes, find_peaks(0,50) additionally finds the blue ones to be spikes (therefore, the second parameter h is important, because the blue points are clearly not what we want to analyse when we talk about spikes).
I'm still not entirely sure what it is that you are after. On the assumption that given a window of data you want to identify whether its center is greater than the rest of the window at the same time as being greater than the mean of the window + h then you could do the following:
peakfinder = function(x,h = 0){
xdat = as.numeric(x)
meandat = mean(xdat)
center = xdat[ceiling(length(xdat)/2)]
ifelse(all(center >= xdat) & center >= (meandat + h),center,NA)
}
y <- rnorm(n=100)
z = xts(y, order.by = as.Date(1:100))
plot(z)
points(rollapply(z,width = 7, FUN = peakfinder, align = "center"), col = "red", pch = 19)
Although it would appear to me that if the center point is greater than it's neighbours it is necessarily greater than the local mean too so this part of the function would not be necessary if h >= 0. If you want to use the global mean of the time series, just substitute the calculation of meandat with the pre-calculated global mean passed as an argument to peakfinder.
Related
Plotting an 'n' sized vector between a given function with given interval in R
Let me make my question clear because I don't know how to ask it properly (therefore I don't know if it was answered already or not), I will go through my whole problem: There is a given function (which is the right side of an explicit first order differential equation if it matters): f = function(t,y){ -2*y+3*t } Then there's a given interval from 'a' to 'b', this is the range the function is calculated in with 'n' steps, so the step size in the interval (dt) is: dt=abs(a-b)/n In this case 'a' is always 0 and 'b' is always positive, so 'b' is always greater than 'a' but I tried to be generic. The initial condition: yt0=y0 The calculation that determines the vector: yt=vector("numeric",n) for (i in 1:(n-1)) { yt[1]=f(0,yt0)*dt+yt0 yt[i+1]=(f(dt*i,yt[i]))*dt+yt[i] } The created vector is 'n' long, but this is an approximate solution to the differential equation between the interval ranging from 'a' to 'b'. And here comes my problem: When I try plotting it alongside the exact solution (using deSolve), it is not accurate. The values of the vector are accurate, but it does not know that these values belong to an approximate function that's between the interval range 'a' to 'b' . That's why the graphs of the exact and approximate solution are not matching at all. I feel pretty burnt out, so I might not describe my issue properly, but is there a solution to this? To make it realise that its values are between 'a' and 'b' on the 'x' axis and not between '1' and 'n'? I thank you all for the answers in advance! The deSolve lines I used (regarding 'b' is greater than 'a'): df = function(t, y, params) list(-2*y+3*t) t = seq(a, b, length.out = n) ddf = as.data.frame(ode(yt0, t, df, parms=NULL))
I tried to reconstruct the comparison between an "approximate" solution using a loop (that is in fact the Euler method), and a solution with package deSolve. It uses the lsoda solver by default that is more precise than Euler'S method, but it is of course also an approximation (default relative and absolute tolerance set to 1e-6). As the question missed some concrete values and the plot functions, it was not clear where the original problem was, but the following example may help to re-formulate the question. I assume that the problem may be confusion between t (absolute time) and dt between the two approaches. Compare the lines marked as "original code" with the "suggestion": library(deSolve) f = function(t, y){ -2 * y + 3 * t } ## some values y0 <- 0.1 a <- 3 b <- 5 n <- 100 ## Euler method using a loop dt <- abs(a-b)/n yt <- vector("numeric", n) yt[1] <- f(0, y0) * dt + y0 # written before the loop for (i in 1:(n-1)) { #yt[i+1] = (f( dt * i, yt[i])) * dt + yt[i] # original code yt[i+1] <- (f(a + dt * i, yt[i])) * dt + yt[i] # suggestion } ## Lsoda integration wit package deSolve df <- function(t, y, params) list(-2*y + 3*t) t <- seq(a, b, length.out = n) ddf = as.data.frame(ode(y0, t, df, parms=NULL)) ## Plot of both solutions plot(ddf, type="l", lwd=5, col="orange", ylab="y", las=1) lines(t, yt, lwd=2, lty="dashed", col="blue") legend("topleft", c("deSolve", "for loop"), lty=c("solid", "dashed"), lwd=c(5, 2), col=c("orange", "blue"))
Find local minimum in a vector with r
Taking the ideas from the following links: the local minimum between the two peaks How to explain ... I look for the local minimum or minimums, avoiding the use of functions already created for this purpose [max / min locale or global]. Our progress: #DATA simulate <- function(lambda=0.3, mu=c(0, 4), sd=c(1, 1), n.obs=10^5) { x1 <- rnorm(n.obs, mu[1], sd[1]) x2 <- rnorm(n.obs, mu[2], sd[2]) return(ifelse(runif(n.obs) < lambda, x1, x2)) } data <- simulate() hist(data) d <- density(data) # #https://stackoverflow.com/a/25276661/8409550 ##Since the x-values are equally spaced, we can estimate dy using diff(d$y) d$x[which.min(abs(diff(d$y)))] #With our data we did not obtain the expected value # d$x[which(diff(sign(diff(d$y)))>0)+1]#pit d$x[which(diff(sign(diff(d$y)))<0)+1]#peak #we check #1 optimize(approxfun(d$x,d$y),interval=c(0,4))$minimum optimize(approxfun(d$x,d$y),interval=c(0,4),maximum = TRUE)$maximum #2 tp <- pastecs::turnpoints(d$y) summary(tp) ind <- (1:length(d$y))[extract(tp, no.tp = FALSE, peak = TRUE, pit = TRUE)] d$x[ind[2]] d$x[ind[1]] d$x[ind[3]] My questions and request for help: Why did the command lines fail: d$x[which.min(abs(diff(d$y)))] It is possible to eliminate the need to add one to the index in the command lines: d$x[which(diff(sign(diff(d$y)))>0)+1]#pit d$x[which(diff(sign(diff(d$y)))<0)+1]#peak How to get the optimize function to return the two expected maximum values?
Question 1 The answer to the first question is straighforward. The line d$x[which.min(abs(diff(d$y)))] asks for the x value at which there was the smallest change in y between two consecutive points. The answer is that this happened at the extreme right of the plot where the density curve is essentially flat: which.min(abs(diff(d$y))) #> [1] 511 length(abs(diff(d$y))) #> [1] 511 This is not only smaller than the difference at your local maxima /minima points; it is orders of magnitude smaller. Let's zoom in to the peak value of d$y, including only the peak and the point on each side: which.max(d$y) #> [1] 324 plot(d$x[323:325], d$y[323:325]) We can see that the smallest difference is around 0.00005, or 5^-5, between two consecutive points. Now look at the end of the plot where it is flattest: plot(d$x[510:512], d$y[510:512]) The difference is about 1^-7, which is why this is the flattest point. Question 2 The answer to your second question is "no, not really". You are taking a double diff, which is two elements shorter than x, and if x is n elements long, a double diff will correspond to elements 2 to (n - 1) in x. You can remove the +1 from the index, but you will have an off-by-one error if you do that. If you really wanted to, you could concatenate dummy zeros at each stage of the diff, like this: d$x[which(c(0, diff(sign(diff(c(d$y, 0))))) > 0)] which gives the same result, but this is longer, harder to read and harder to justify, so why would you? Question 3 The answer to the third question is that you could use the "pit" as the dividing point between the minimum and maximum value of d$x to find the two "peaks". If you really want a single call to get both at once, you could do it inside an sapply: pit <- optimize(approxfun(d$x,d$y),interval=c(0,4))$minimum peaks <- sapply(1:2, function(i) { optimize(approxfun(d$x, d$y), interval = c(min(d$x), pit, max(d$x))[i:(i + 1)], maximum = TRUE)$maximum }) pit #> [1] 1.691798 peaks #> [1] -0.02249845 3.99552521
vectorize computation of difference of moving averages in R?
I have a vector and for each point in that vector I would like to compute the difference between the average for some range of the points immediately before that point minus the average for some range of the points immediately after that point. I have done this with a for loop because filter does not seem to have an option to apply exclusively to points after a vector point (parameter sides = only 1 or 2) and because I did not know how to shoehorn this into an apply statement since I need a function that operates on each point using its position within the vector and not just its own value. Can someone show me the way? Here's how I did it with a for loop: x = rep(c(1,1,1,1,1,10), 20) x = x + 100 x = x - c(1:length(x)) lookahead = 4 y = x for(i in (lookahead):(length(x)-lookahead)) { y[i] = mean(x[(i-lookahead):i]) - mean(x[i:(i+lookahead)]) } plot(x) lines(y, col="red") You can see from the plot what the objective is: to identify spikes (but no I don't want to be told about other ways to find spikes, I want to use my simple boxcar moving average method). There must be a better way to calculate this vector. Thank you for any suggestions. p.s. I see someone wants to flag this as a repeat of Calculating moving average in R However my question is different as the answers to that question (use roll_mean or filter) don't apply here without modification. If there is a way to use roll_mean or filter, I can't tell from the docs and would appreciate someone telling me how I can use either of these to calculate forward-looking moving averages instead of backward-looking moving averages. Thanks again.
Problem with your procedure is that it starts at i=4, and subsets x[0:4] where R trims out 0 index automatically. y1 = RcppRoll::roll_mean(x, 5) y1 = c(rep(NA, 4), y1) - c(y1, rep(NA, 4)) # you can use y1 = lag(y1, 4) - y1 instead if you have dplyr # fill NA positions y1[1:4]=x[1:4] y1[116:120]=x[116:120] y1 differs from y only at positions 4, and 116 where your loop is problematic. In case, if you have no access to RcppRoll, you can use embed(faster than zoo::rollmean). y1 = rowMeans(embed(x, 5)) #slightly slower than roll_mean y1 = c(rep(NA, 4), y1) - c(y1, rep(NA, 4)) # you can use y1 = lag(y1, 4) - y1 instead if you have dplyr # fill NA positions y1[1:4]=x[1:4] y1[116:120]=x[116:120]
OK. I have one solution however, I've modified your code for the loop to go from (lookahead+1):(length(x)-lookahead) . This is so that the very first mean is a mean of 5 values like all the rest. Calculate a vector of averages of 5 values: lastIndexInY <- length(x)-lookahead Y_ave <- (x[ 1:lastIndexInY ] + x[ 1:lastIndexInY +1] + x[ 1:lastIndexInY +2] + x[ 1:lastIndexInY +3]+ x[ 1:lastIndexInY +4] )/5 Then your result y is the same as: y_vec <- c(x[1:4], Y_ave[1:(length(Y_ave)-4)] - Y_ave[5:length(Y_ave) ] , x[-3:0 + length(x)] ) all(y - y_vec == 0 ) [1] TRUE (Are you sure you need to retain the first 4 and last 4 values of x?)
Hexbin: apply function for every bin
I would like to build the hexbin plot where for every bin is the "ratio between class 1 and class2 points falling into this bin" is plotted (either log or not). x <- rnorm(10000) y <- rnorm(10000) h <- hexbin(x,y) plot(h) l <- as.factor(c( rep(1,2000), rep(2,8000) )) Any suggestions on how to implement this? Is there a way to introduce function to every bin based on bin statistics?
#cryo111's answer has the most important ingredient - IDs = TRUE. After that it's just a matter of figuring out what you want to do with Inf's and how much do you need to scale the ratios by to get integers that will produce a pretty plot. library(hexbin) library(data.table) set.seed(1) x = rnorm(10000) y = rnorm(10000) h = hexbin(x, y, IDs = TRUE) # put all the relevant data in a data.table dt = data.table(x, y, l = c(1,1,1,2), cID = h#cID) # group by cID and calculate whatever statistic you like # in this case, ratio of 1's to 2's, # and then Inf's are set to be equal to the largest ratio dt[, list(ratio = sum(l == 1)/sum(l == 2)), keyby = cID][, ratio := ifelse(ratio == Inf, max(ratio[is.finite(ratio)]), ratio)][, # scale up (I chose a scaling manually to get a prettier graph) # and convert to integer and change h as.integer(ratio*10)] -> h#count plot(h)
You can determine the number of class 1 and class 2 points in each bin by library(hexbin) library(plyr) x=rnorm(10000) y=rnorm(10000) #generate hexbin object with IDs=TRUE #the object includes then a slot with a vector cID #cID maps point (x[i],y[i]) to cell number cID[i] HexObj=hexbin(x,y,IDs = TRUE) #find count statistics for first 2000 points (class 1) and the rest (class 2) CountDF=merge(count(HexObj#cID[1:2000]), count(HexObj#cID[2001:length(x)]), by="x", all=TRUE ) #replace NAs by 0 CountDF[is.na(CountDF)]=0 #check if all points are included sum(CountDF$freq.x)+sum(CountDF$freq.y) But printing them is another story. For instance, what if there are no class 2 points in one bin? The fraction is not defined then. In addition, as far as I understand hexbin is just a two dimensional histogram. As such, it counts the number of points that fall into a given bin. I do not think that it can handle non-integer data as in your case.
Detecting dips in a 2D plot
I need to automatically detect dips in a 2D plot, like the regions marked with red circles in the figure below. I'm only interested in the "main" dips, meaning the dips have to span a minimum length in the x axis. The number of dips is unknown, i.e., different plots will contain different numbers of dips. Any ideas? Update: As requested, here's the sample data, together with an attempt to smooth it using median filtering, as suggested by vines. Looks like I need now a robust way to approximate the derivative at each point that would ignore the little blips that remain in the data. Is there any standard approach? y <- c(0.9943,0.9917,0.9879,0.9831,0.9553,0.9316,0.9208,0.9119,0.8857,0.7951,0.7605,0.8074,0.7342,0.6374,0.6035,0.5331,0.4781,0.4825,0.4825,0.4879,0.5374,0.4600,0.3668,0.3456,0.4282,0.3578,0.3630,0.3399,0.3578,0.4116,0.3762,0.3668,0.4420,0.4749,0.4556,0.4458,0.5084,0.5043,0.5043,0.5331,0.4781,0.5623,0.6604,0.5900,0.5084,0.5802,0.5802,0.6174,0.6124,0.6374,0.6827,0.6906,0.7034,0.7418,0.7817,0.8311,0.8001,0.7912,0.7912,0.7540,0.7951,0.7817,0.7644,0.7912,0.8311,0.8311,0.7912,0.7688,0.7418,0.7232,0.7147,0.6906,0.6715,0.6681,0.6374,0.6516,0.6650,0.6604,0.6124,0.6334,0.6374,0.5514,0.5514,0.5412,0.5514,0.5374,0.5473,0.4825,0.5084,0.5126,0.5229,0.5126,0.5043,0.4379,0.4781,0.4600,0.4781,0.3806,0.4078,0.3096,0.3263,0.3399,0.3184,0.2820,0.2167,0.2122,0.2080,0.2558,0.2255,0.1921,0.1766,0.1732,0.1205,0.1732,0.0723,0.0701,0.0405,0.0643,0.0771,0.1018,0.0587,0.0884,0.0884,0.1240,0.1088,0.0554,0.0607,0.0441,0.0387,0.0490,0.0478,0.0231,0.0414,0.0297,0.0701,0.0502,0.0567,0.0405,0.0363,0.0464,0.0701,0.0832,0.0991,0.1322,0.1998,0.3146,0.3146,0.3184,0.3578,0.3311,0.3184,0.4203,0.3578,0.3578,0.3578,0.4282,0.5084,0.5802,0.5667,0.5473,0.5514,0.5331,0.4749,0.4037,0.4116,0.4203,0.3184,0.4037,0.4037,0.4282,0.4513,0.4749,0.4116,0.4825,0.4918,0.4879,0.4918,0.4825,0.4245,0.4333,0.4651,0.4879,0.5412,0.5802,0.5126,0.4458,0.5374,0.4600,0.4600,0.4600,0.4600,0.3992,0.4879,0.4282,0.4333,0.3668,0.3005,0.3096,0.3847,0.3939,0.3630,0.3359,0.2292,0.2292,0.2748,0.3399,0.2963,0.2963,0.2385,0.2531,0.1805,0.2531,0.2786,0.3456,0.3399,0.3491,0.4037,0.3885,0.3806,0.2748,0.2700,0.2657,0.2963,0.2865,0.2167,0.2080,0.1844,0.2041,0.1602,0.1416,0.2041,0.1958,0.1018,0.0744,0.0677,0.0909,0.0789,0.0723,0.0660,0.1322,0.1532,0.1060,0.1018,0.1060,0.1150,0.0789,0.1266,0.0965,0.1732,0.1766,0.1766,0.1805,0.2820,0.3096,0.2602,0.2080,0.2333,0.2385,0.2385,0.2432,0.1602,0.2122,0.2385,0.2333,0.2558,0.2432,0.2292,0.2209,0.2483,0.2531,0.2432,0.2432,0.2432,0.2432,0.3053,0.3630,0.3578,0.3630,0.3668,0.3263,0.3992,0.4037,0.4556,0.4703,0.5173,0.6219,0.6412,0.7275,0.6984,0.6756,0.7079,0.7192,0.7342,0.7458,0.7501,0.7540,0.7605,0.7605,0.7342,0.7912,0.7951,0.8036,0.8074,0.8074,0.8118,0.7951,0.8118,0.8242,0.8488,0.8650,0.8488,0.8311,0.8424,0.7912,0.7951,0.8001,0.8001,0.7458,0.7192,0.6984,0.6412,0.6516,0.5900,0.5802,0.5802,0.5762,0.5623,0.5374,0.4556,0.4556,0.4333,0.3762,0.3456,0.4037,0.3311,0.3263,0.3311,0.3717,0.3762,0.3717,0.3668,0.3491,0.4203,0.4037,0.4149,0.4037,0.3992,0.4078,0.4651,0.4967,0.5229,0.5802,0.5802,0.5846,0.6293,0.6412,0.6374,0.6604,0.7317,0.7034,0.7573,0.7573,0.7573,0.7772,0.7605,0.8036,0.7951,0.7817,0.7869,0.7724,0.7869,0.7869,0.7951,0.7644,0.7912,0.7275,0.7342,0.7275,0.6984,0.7342,0.7605,0.7418,0.7418,0.7275,0.7573,0.7724,0.8118,0.8521,0.8823,0.8984,0.9119,0.9316,0.9512) yy <- runmed(y, 41) plot(y, type="l", ylim=c(0,1), ylab="", xlab="", lwd=0.5) points(yy, col="blue", type="l", lwd=2)
EDITED : function strips the regions to contain nothing but the lowest part, if wanted. Actually, Using the mean is easier than using the median. This allows you to find regions where the real values are continuously below the mean. The median is not smooth enough for an easy application. One example function to do this would be : FindLowRegion <- function(x,n=length(x)/4,tol=length(x)/20,p=0.5){ nx <- length(x) n <- 2*(n %/% 2) + 1 # smooth out based on means sx <- rowMeans(embed(c(rep(NA,n/2),x,rep(NA,n/2)),n),na.rm=T) # find which series are far from the mean rlesx <- rle((sx-x)>0) # construct start and end of regions int <- embed(cumsum(c(1,rlesx$lengths)),2) # which regions fulfill requirements id <- rlesx$value & rlesx$length > tol # Cut regions to be in general smaller than median regions <- apply(int[id,],1,function(i){ i <- min(i):max(i) tmp <- x[i] id <- which(tmp < quantile(tmp,p)) id <- min(id):max(id) i[id] }) # return unlist(regions) } where n determines how much values are used to calculate the running mean, tol determines how many consecutive values should be lower than the running mean to talk about a low region, and p determines the cutoff used (as a quantile) for stripping the regions to their lowest part. When p=1, the complete lower region is shown. Function is tweaked to work on data as you presented, but the numbers might need to be adjusted a bit to work with other data. This function returns a set of indices, which allows you to find the low regions. Illustrated with your y vector : Lows <- FindLowRegion(y) newx <- seq_along(y) newy <- ifelse(newx %in% Lows,y,NA) plot(y, col="blue", type="l", lwd=2) lines(newx,newy,col="red",lwd="3") Gives :
You have to smooth the graph in some way. Median filtration is quite useful for that purpose (see http://en.wikipedia.org/wiki/Median_filter). After smoothing, you will simply have to search for the minima, just as usual (i.e. search for the points where the 1st derivative switches from negative to positive).
A simpler answer (which also does not require smoothing) could be provided by adapting the maxdrawdown() function from the tseries. A drawdown is commonly defined as the retreat from the most-recent maximum; here we want the opposite. Such a function could then be used in a sliding window over the data, or over segmented data. maxdrawdown <- function(x) { if(NCOL(x) > 1) stop("x is not a vector or univariate time series") if(any(is.na(x))) stop("NAs in x") cmaxx <- cummax(x)-x mdd <- max(cmaxx) to <- which(mdd == cmaxx) from <- double(NROW(to)) for (i in 1:NROW(to)) from[i] <- max(which(cmaxx[1:to[i]] == 0)) return(list(maxdrawdown = mdd, from = from, to = to)) } So instead of using cummax(), one would have to switch to cummin() etc.
My first thought was something much cruder than filtering. Why not look for the big drops followed by long enough stable periods? span.b <- 20 threshold.b <- 0.2 dy.b <- c(rep(NA, span.b), diff(y, lag = span.b)) span.f <- 10 threshold.f <- 0.05 dy.f <- c(diff(y, lag = span.f), rep(NA, span.f)) down <- which(dy.b < -1 * threshold.b & abs(dy.f) < threshold.f) abline(v = down) The plot shows that it's not perfect, but it doesn't discard the outliers (I guess it depends on your take on the data).