What is the best R idiom to compute sums of elements within a sliding window?
Conceptually I want the following:
for (i in 1:(length(input) - lag + 1))
output[i] <- sum(input[i:(i + lag - 1)])
In other words, every output element should be the sum of a fixed number of input elements (called lag here), resulting in an appropriately shorter result vector. I know that I can theoretically write this as
output = diff(cumsum(c(0, input)), lag = lag)
but I am worried about the precision here. I have a setup in mind where all the values would have the same sign, and the vectors would be pretty large. Summing up the values up front might lead to prety large numbers, so there won't be many significant digits left for the individual differences. This feels bad.
I would imagine that it should be possible to do better than that, at least when using a single function instead of two. An implementation could maintain the current sum, aleays adding one element and subtracting another for each iteration. Since that would still accumulate rounding errors along the way, one could perform the computations separately from both ends, and if the results at the center were too far off, compute a fresh result from the center and thus increase precision in a divide-and-conquer approach.
Do you know of any implementation which does anything like this?
Or is there a reason why this won't work as I think it should?
Or perhaps a reason why the diff(cumsum(…)) approach isn't as bad as it seems?
Edit: I had some off-by-one mistakes in my above formulations, making them inconsistent. Now they seem to agree on test data. lag should be the number of elements summed, and I'd expect a shorter vector as a result. I'm not dealing with time series objects, so absolute time alignment is not that relevant to me.
I had seen some noisy-looking things in my real data, which I had assumed to be due to such numeric problems. Since several different approaches to compute these values, using different suggestions from answers and comments, still led to similar results, it might be that the strangeness of my data is not in fact due to numeric issues.
So in order to evaluate answers, I used the following setup:
library(Rmpfr)
library(caTools)
len <- 1024*1024*8
lag <- 3
precBits <- 128
taillen <- 6
set.seed(42) # reproducible
input <- runif(len)
input <- input + runif(len, min=-1e-9, max=1e-9) # use >32 bits
options(digits = 22)
# Reference: sum everything separately using high precision.
output <- mpfr(rep(0, taillen), precBits = precBits)
for (i in 1:taillen)
output[i] <- sum(mpfr(input[(len-taillen+i-lag+1):(len-taillen+i)],
precBits=precBits))
output
addResult <- function(data, name) {
n <- c(rownames(resmat), name)
r <- rbind(resmat, as.numeric(tail(data, taillen)))
rownames(r) <- n
assign("resmat", r, parent.frame())
}
# reference solution, rounded to nearest double, assumed to be correct
resmat <- matrix(as.numeric(output), nrow=1)
rownames(resmat) <- "Reference"
# my original solution
addResult(diff(cumsum(c(0, input)), lag=lag), "diff+cumsum")
# filter as suggested by Matthew Plourde
addResult(filter(input, rep(1, lag), sides=1)[lag:length(input)], "filter")
# caTools as suggested by Joshua Ulrich
addResult(lag*runmean(input, lag, alg="exact", endrule="trim"), "caTools")
The result for this looks as follows:
[,1] [,2]
Reference 2.380384891521345469556 2.036472557725210297264
diff+cumsum 2.380384892225265502930 2.036472558043897151947
filter 2.380384891521345469556 2.036472557725210741353
caTools 2.380384891521345469556 2.036472557725210741353
[,3] [,4]
Reference 1.999147923481302324689 1.998499369297661143463
diff+cumsum 1.999147923663258552551 1.998499369248747825623
filter 1.999147923481302324689 1.998499369297661143463
caTools 1.999147923481302324689 1.998499369297661143463
[,5] [,6]
Reference 2.363071143676507723796 1.939272651346203080180
diff+cumsum 2.363071143627166748047 1.939272651448845863342
filter 2.363071143676507723796 1.939272651346203080180
caTools 2.363071143676507723796 1.939272651346203080180
The result indicates that diff+cumsum is still surprisingly accurate. (It appeared even more accurate before I thought of adding that second runif vector.) filter and caTools both are almost indistinguishable from the perfect result. As for performance, I haven't tested that (yet). I only know that the Rmpfr cumsum with 128 bits was slow enough that I didn't feel like waiting on its completion. Feel free to edit this question if you have a performance benchmark, or new suggestions to add to the comparison.
I can't speak to whether this is such an implementation, but there is
filter(input, sides=2, filter=rep(1, lag+1))
Looking at the body to filter, it looks like the hard work gets passed off to a C routine, C_rfilter, so perhaps you could examine this to see if it satisfies your precision requirements. Otherwise, #JoshuaUlrich's suggestion sounds promising.
This answer is based on the comment from Joshua Ulrich.
The package caTools provides a function runmean which computes my partial sum, divided by the window size (or rather the number of not-NA elements in the window in question). Quoting from its documentation:
In case of runmean(..., alg="exact") function a special algorithm is used (see references section) to ensure that round-off errors do not accumulate. As a result runmean is more accurate than filter(x, rep(1/k,k)) and runmean(..., alg="C") functions.
Note:
Function runmean(..., alg="exact") is based by code by Vadim Ogranovich, which is based on Python code (see last reference), pointed out by Gabor Grothendieck.
References:
About round-off error correction used in runmean: Shewchuk, Jonathan
Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates
More on round-off error correction can be found at:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/393090
The code stores the sum of the current window using a sequence of double precision floating point values, where smaller values represent the round-off error incurred by larger elements. Therefore there shouldn't be any accumulation of rounding errors even if the input data is processed in a single pass, adding one element and removing another at each step. The final result should be as exact as double precision arithmetic can represent it.
Algorithms other than exact seem to yield somewhat different results, though, so I probably wouldn't suggest these.
It seems a bit unfortunate that the source code contains a runsum_exact function, but it is commented out. The division to obtain the mean, combined with the multiplication to get back to the sum, will introduce rounding errors which could have been avoided. To this the CHANGES file says:
11) caTools-1.11 (Dec 2010)
Fully retired runsum.exact, which was not working for a while, use runmean with "exact" option instead.
At the moment (caTools version 1.14 from 2012-05-22) the package appears to be orphaned.
Related
While doing certain computations involving the Rogers L-function, the following result was generated by Wolfram Alpha:
I wanted to verify this result in Pari/GP by means of the lindep function, so I calculated the integral to 20 digits in WA, yielding:
11.3879638800312828875
Then, I used the following code in Pari/GP:
lindep([zeta(2), zeta(3), 11.3879638800312828875])
As pi^2 = 6*zeta(2), one would expect the output to be a vector along the lines of:
[12,12,-3]
because that's the linear dependency suggested by WA's result. However, I got a very elaborate vector from Pari/GP:
[35237276454, -996904369, -4984618961]
I think the first vector should be the "right" output of the Pari code sample.
Questions:
Why is the lindep function in Pari/GP not yielding the output one would expect in this case?
What can I do to make it give the vector that would be more appropriate in this situation?
It comes down to Pari treating your rounded values as exact. Since you must round your values, lindep's solution doesn't always come to the same solution as the true answer due to error.
You can try changing the accuracy of lindep using the second argument. The manual states that you should choose this to be smaller than the number of correct decimal digits. I believe this should solve the issue.
lindep(v, {flag = 0}) finds a small nontrivial integral linear
combination between components of v. If none can be found return an
empty vector.
If v is a vector with real/complex entries we use a floating point
(variable precision) LLL algorithm. If flag = 0 the accuracy is chosen
internally using a crude heuristic. If flag > 0 the computation is
done with an accuracy of flag decimal digits. To get meaningful
results in the latter case, the parameter flag should be smaller than
the number of correct decimal digits in the input.
As I was working out how Epi generates the basis for its spline functions (via the function Ns), I was a little confused by how it handles the detrend argument.
When detrend=T I would have expected that Epi::Ns(...) would more or less project the basis given by splines::ns(...) onto the orthogonal complement of the column space of [1 t] and finally extract the set of linearly independent columns (so that we have a basis).
However, this doesn’t appear to be the exactly the case; I tried
library(Epi)
x=seq(-0.75, 0.75, length.out=5)
Ns(x, knots=c(-0.5,0,0.5), Boundary.knots=c(-1,1), detrend=T)
and
library(splines)
detrend(ns(x, knots=c(-0.5,0,0.5), Boundary.knots=c(-1,1)), x)
The matrices produced by the above code are not the same, however, they do have the same column space (in this example) suggesting that if plugged in to a linear model, the fitted coefficients will be different but the fit (itself) will be the same.
The first question I had was; is this true in general?
The second question is why are the two different?
Regarding the second question - when detrend is specified, Epi::Ns gives a warning that fixsl is ignored.
Diving into Epi github NS.r ... in the construction of the basis, in the call to Epi::Ns above with detrend=T, the worker ns.ld() is called (a function almost identical to the guts of splines::ns()), which passes c(NA,NA) along to splines::spline.des as the derivs argument in determining a matrix const;
const <- splines::spline.des( Aknots, Boundary.knots, 4, c(2-fixsl[1],2-fixsl[2]))$design
This is the difference between what happens in Ns(detrend=T) and the call to ns() above which passes c(2,2) to splineDesign as the derivs argument.
So that explains how they are different, but not why? Does anyone have an explanation for why fixsl=c(NA,NA) is used instead of fixsl=c(F,F) in Epi::Ns()?
And does anyone have a proof/or an answer to the first question?
I think the orthogonal complement of const's column space is used so that second (or desired) derivatives are zero at the boundary (via projection of the general spline basis) - but I'm not sure about this step as I haven't dug into the mathematics, I'm just going by my 'feel' for it. Perhaps if I understood this better, the reason that the differences in the result for const from the call to splineDesign/spline.des (in ns() and Ns() respectively) would explain why the two matrices from the start are not the same, yet yield the same fit.
The fixsl=c(NA,NA) was a bug that has been fixed since a while. See the commits on the CRAN Github mirror.
I have still sent an email to the maintainer to ask if the fix could be made a little bit more consistent with the condition, but in principle this could be closed.
I'm doing some statistical analysis with R software (bootstrapped Kolmogorov-Smirnov tests) of very large data sets, meaning that my p values are all incredibly small. I've Bonferroni corrected for the large number of tests that I've performed meaning that my alpha value is also very small in order to reject the null hypothesis.
The problem is, R presents me with p values of 0 in some cases where the p value is presumably so small that it cannot be presented (these are usually for the very large sample sizes). While I can happily reject the null hypothesis for these tests, the data is for publication, so I'll need to write p < ..... but I don't know what the lowest reportable values in R are?
I'm using the ks.boot function in case that matters.
Any help would be much appreciated!
.Machine$double.xmin gives you the smallest non-zero normalized floating-point number. On most systems that's 2.225074e-308. However, I don't believe this is a sensible limit.
Instead I suggest that in Matching::ks.boot you change the line
ks.boot.pval <- bbcount/nboots to
ks.boot.pval <- log(bbcount)-log(nboots) and work on the log-scale.
Edit:
You can use trace to modify the function.
Step 1: Look at the function body, to find out where to add additional code.
as.list(body(ks.boot))
You'll see that element 17 is ks.boot.pval <- bbcount/nboots, so we need to add the modified code directly after that.
Step 2: trace the function.
trace (ks.boot, quote(ks.boot.pval <- log(bbcount)-log(nboots)), at=18)
Step 3: Now you can use ks.boot and it will return the logarithm of the bootstrap p-value as ks.boot.pvalue. Note that you cannot use summary.ks.boot since it calls format.pval, which will not show you negative values.
Step 4: Use untrace(ks.boot) to remove the modifications.
I don't know whether ks.boot has methods in the packages Rmpfr or gmp but if it does, or you feel like rolling your own code, you can work with arbitrary precision and arbitrary size numbers.
I am working on designing a new sensor, and so I have a vector of measured values and a vector of truth values. To represent error, it's simply measured - truth. Since there's a lot of variation in the truth, I would like to represent the normalized error. My initial thought would be error./truth to get percent error, but there are many cases where my truth value is zero! Can anyone think of a better way to represent the normalized data while avoiding the divide-by-zero? I'm working in Matlab, though the question is a bit language-agnostic as well.
PS, feel free to push this to another stackexchange if you think it's better suited
Try error = (measured-truth)/norm2(truth) for each vector.
Where norm2() is the forbenious norm.
norm2(x) =SQRT( SUM( x[i]^2, i=1..N ) )
This can only fail is all the values of truth are zero. You can mitigate this by adding a small positive number like 1e-12 to the norm, or to avoid the division when the norm is less than a threshold number.
I'd suggest you to separate results with zero (or smaller than 10e-6 for example) truth vector and non-zero truth vector. You can't treat it by the same means (since you can't normalize truth vector) and you should define what to do in that case.
I can't suggest you something specific because I don't know the problem statement, but you should define it by yourself how to deal with it. Or if you post your problem here I hope we can help you.
I have been researching the log-sum-exp problem. I have a list of numbers stored as logarithms which I would like to sum and store in a logarithm.
the naive algorithm is
def naive(listOfLogs):
return math.log10(sum(10**x for x in listOfLogs))
many websites including:
logsumexp implementation in C?
and
http://machineintelligence.tumblr.com/post/4998477107/
recommend using
def recommend(listOfLogs):
maxLog = max(listOfLogs)
return maxLog + math.log10(sum(10**(x-maxLog) for x in listOfLogs))
aka
def recommend(listOfLogs):
maxLog = max(listOfLogs)
return maxLog + naive((x-maxLog) for x in listOfLogs)
what I don't understand is if recommended algorithm is better why should we call it recursively?
would that provide even more benefit?
def recursive(listOfLogs):
maxLog = max(listOfLogs)
return maxLog + recursive((x-maxLog) for x in listOfLogs)
while I'm asking are there other tricks to make this calculation more numerically stable?
Some background for others: when you're computing an expression of the following type directly
ln( exp(x_1) + exp(x_2) + ... )
you can run into two kinds of problems:
exp(x_i) can overflow (x_i is too big), resulting in numbers that you can't add together
exp(x_i) can underflow (x_i is too small), resulting in a bunch of zeroes
If all the values are big, or all are small, we can divide by some exp(const) and add const to the outside of the ln to get the same value. Thus if we can pick the right const, we can shift the values into some range to prevent overflow/underflow.
The OP's question is, why do we pick max(x_i) for this const instead of any other value? Why don't we recursively do this calculation, picking the max out of each subset and computing the logarithm repeatedly?
The answer: because it doesn't matter.
The reason? Let's say x_1 = 10 is big, and x_2 = -10 is small. (These numbers aren't even very large in magnitude, right?) The expression
ln( exp(10) + exp(-10) )
will give you a value very close to 10. If you don't believe me, go try it. In fact, in general, ln( exp(x_1) + exp(x_2) + ... ) will give be very close to max(x_i) if some particular x_i is much bigger than all the others. (As an aside, this functional form, asymptotically, actually lets you mathematically pick the maximum from a set of numbers.)
Hence, the reason we pick the max instead of any other value is because the smaller values will hardly affect the result. If they underflow, they would have been too small to affect the sum anyway, because it would be dominated by the largest number and anything close to it. In computing terms, the contribution of the small numbers will be less than an ulp after computing the ln. So there's no reason to waste time computing the expression for the smaller values recursively if they will be lost in your final result anyway.
If you wanted to be really persnickety about implementing this, you'd divide by exp(max(x_i) - some_constant) or so to 'center' the resulting values around 1 to avoid both overflow and underflow, and that might give you a few extra digits of precision in the result. But avoiding overflow is much more important about avoiding underflow, because the former determines the result and the latter doesn't, so it's much simpler just to do it this way.
Not really any better to do it recursively. The problem's just that you want to make sure your finite-precision arithmetic doesn't swamp the answer in noise. By dealing with the max on its own, you ensure that any junk is kept small in the final answer because the most significant component of it is guaranteed to get through.
Apologies for the waffly explanation. Try it with some numbers yourself (a sensible list to start with might be [1E-5,1E25,1E-5]) and see what happens to get a feel for it.
As you have defined it, your recursive function will never terminate. That's because ((x-maxlog) for x in listOfLogs) still has the same number of elements as listOfLogs.
I don't think that this is easily fixable either, without significantly impacting either the performance or the precision (compared to the non-recursive version).