Sliding FFT in R - r

Is there a function or package in R for calculating the Sliding FFT of a sample? By this I mean that given the output of fft(x[n:m]), calculate fft(x[1+(n:m)]) efficiently.
Ideally I'd find both an online version (where I don't have access to the full time series at the beginning, or it's too big to fit in memory, and I'm not going to try to save the whole running FFT in memory either) and a batch version (where I give it the whole sample x and tell it the running window width w, resulting in a complex matrix of dimension c(w,length(x)/w)).
An example of such an algorithm is presented here (but I've never tried implementing it in any language yet):
http://cnx.org/content/m12029/latest/
If no such thingy exists already in R, that doesn't look too hard to implement I guess.

As usually happens when I post something here, I kept working on it and came up with a solution:
fft.up <- function(x1, xn, prev) {
b <- length(prev)
vec <- exp(2i*pi*seq.int(0,b-1)/b)
(prev - x1 + xn) * vec
}
# Test it out
x <- runif(6)
all.equal(fft.up(x[1], x[6], fft(x[1:5])), fft(x[2:6]))
# [1] TRUE
Still interested to know if some library offers this, because then it might offer other handy things too. =) But for now my problem's solved.

Related

CRAN package submission: "Error: C stack usage is too close to the limit"

Right upfront: this is an issue I encountered when submitting an R package to CRAN. So I
dont have control of the stack size (as the issue occured on one of CRANs platforms)
I cant provide a reproducible example (as I dont know the exact configurations on CRAN)
Problem
When trying to submit the cSEM.DGP package to CRAN the automatic pretest (for Debian x86_64-pc-linux-gnu; not for Windows!) failed with the NOTE: C stack usage 7975520 is too close to the limit.
I know this is caused by a function with three arguments whose body is about 800 rows long. The function body consists of additions and multiplications of these arguments. It is the function varzeta6() which you find here (from row 647 onwards).
How can I adress this?
Things I cant do:
provide a reproducible example (at least I would not know how)
change the stack size
Things I am thinking of:
try to break the function into smaller pieces. But I dont know how to best do that.
somehow precompile? the function (to be honest, I am just guessing) so CRAN doesnt complain?
Let me know your ideas!
Details / Background
The reason why varzeta6() (and varzeta4() / varzeta5() and even more so varzeta7()) are so long and R-inefficient is that they are essentially copy-pasted from mathematica (after simplifying the mathematica code as good as possible and adapting it to be valid R code). Hence, the code is by no means R-optimized (which #MauritsEvers righly pointed out).
Why do we need mathematica? Because what we need is the general form for the model-implied construct correlation matrix of a recursive strucutral equation model with up to 8 constructs as a function of the parameters of the model equations. In addition there are constraints.
To get a feel for the problem, lets take a system of two equations that can be solved recursivly:
Y2 = beta1*Y1 + zeta1
Y3 = beta2*Y1 + beta3*Y2 + zeta2
What we are interested in is the covariances: E(Y1*Y2), E(Y1*Y3), and E(Y2*Y3) as a function of beta1, beta2, beta3 under the constraint that
E(Y1) = E(Y2) = E(Y3) = 0,
E(Y1^2) = E(Y2^2) = E(Y3^3) = 1
E(Yi*zeta_j) = 0 (with i = 1, 2, 3 and j = 1, 2)
For such a simple model, this is rather trivial:
E(Y1*Y2) = E(Y1*(beta1*Y1 + zeta1) = beta1*E(Y1^2) + E(Y1*zeta1) = beta1
E(Y1*Y3) = E(Y1*(beta2*Y1 + beta3*(beta1*Y1 + zeta1) + zeta2) = beta2 + beta3*beta1
E(Y2*Y3) = ...
But you see how quickly this gets messy when you add Y4, Y5, until Y8.
In general the model-implied construct correlation matrix can be written as (the expression actually looks more complicated because we also allow for up to 5 exgenous constructs as well. This is why varzeta1() already looks complicated. But ignore this for now.):
V(Y) = (I - B)^-1 V(zeta)(I - B)'^-1
where I is the identity matrix and B a lower triangular matrix of model parameters (the betas). V(zeta) is a diagonal matrix. The functions varzeta1(), varzeta2(), ..., varzeta7() compute the main diagonal elements. Since we constrain Var(Yi) to always be 1, the variances of the zetas follow. Take for example the equation Var(Y2) = beta1^2*Var(Y1) + Var(zeta1) --> Var(zeta1) = 1 - beta1^2. This looks simple here, but is becomes extremly complicated when we take the variance of, say, the 6th equation in such a chain of recursive equations because Var(zeta6) depends on all previous covariances betwenn Y1, ..., Y5 which are themselves dependend on their respective previous covariances.
Ok I dont know if that makes things any clearer. Here are the main point:
The code for varzeta1(), ..., varzeta7() is copy pasted from mathematica and hence not R-optimized.
Mathematica is required because, as far as I know, R cannot handle symbolic calculations.
I could R-optimze "by hand" (which is extremly tedious)
I think the structure of the varzetaX() must be taken as given. The question therefore is: can I somehow use this function anyway?
Once conceivable approach is to try to convince the CRAN maintainers that there's no easy way for you to fix the problem. This is a NOTE, not a WARNING; The CRAN repository policy says
In principle, packages must pass R CMD check without warnings or significant notes to be admitted to the main CRAN package area. If there are warnings or notes you cannot eliminate (for example because you believe them to be spurious) send an explanatory note as part of your covering email, or as a comment on the submission form
So, you could take a chance that your well-reasoned explanation (in the comments field on the submission form) will convince the CRAN maintainers. In the long run it would be best to find a way to simplify the computations, but it might not be necessary to do it before submission to CRAN.
This is a bit too long as a comment, but hopefully this will give you some ideas for optimising the code for the varzeta* functions; or at the very least, it might give you some food for thought.
There are a few things that confuse me:
All varzeta* functions have arguments beta, gamma and phi, which seem to be matrices. However, in varzeta1 you don't use beta, yet beta is the first function argument.
I struggle to link the details you give at the bottom of your post with the code for the varzeta* functions. You don't explain where the gamma and phi matrices come from, nor what they denote. Furthermore, seeing that beta are the model's parameter etimates, I don't understand why beta should be a matrix.
As I mentioned in my earlier comment, I would be very surprised if these expressions cannot be simplified. R can do a lot of matrix operations quite comfortably, there shouldn't really be a need to pre-calculate individual terms.
For example, you can use crossprod and tcrossprod to calculate cross products, and %*% implements matrix multiplication.
Secondly, a lot of mathematical operations in R are vectorised. I already mentioned that you can simplify
1 - gamma[1,1]^2 - gamma[1,2]^2 - gamma[1,3]^2 - gamma[1,4]^2 - gamma[1,5]^2
as
1 - sum(gamma[1, ]^2)
since the ^ operator is vectorised.
Perhaps more fundamentally, this seems somewhat of an XY problem to me where it might help to take a step back. Not knowing the full details of what you're trying to model (as I said, I can't link the details you give to the cSEM.DGP code), I would start by exploring how to solve the recursive SEM in R. I don't really see the need for Mathematica here. As I said earlier, matrix operations are very standard in R; analytically solving a set of recursive equations is also possible in R. Since you seem to come from the Mathematica realm, it might be good to discuss this with a local R coding expert.
If you must use those scary varzeta* functions (and I really doubt that), an option may be to rewrite them in C++ and then compile them with Rcpp to turn them into R functions. Perhaps that will avoid the C stack usage limit?

Is this what rnorm(x) does if x is a vector, and how could I have found out faster?

I’m looking for R resources, and I started looking at “An Introduction to R” here at r-project.org. I did and got stumped immediately.
I think I've figured out what’s going on, and my question is basically
Are there resources to help me figure out something like this more
easily?
The preface of the Introduction to R suggests starting with the introductory session in Appendix A, and right at the start is this code and remark.
x <- rnorm(50)
y <- rnorm(x)
Generate two pseudo-random normal vectors of x- and y-coordinates.
The documentation says the (first and only non-optional) parameter to rnorm is the length of the result vector. So x <- rnorm(50) produces a vector of 50 random values from a normal distribution with mean 0 and standard deviation 1.
So far so good. But why does rnorm(x) seem to do what y <- rnorm(50) or y <- rnorm(length(x)) would have done? Either of these alternatives seem clearer to me.
My guess as to what happens is this:
The wrapper for rnorm didn’t care what kind of thing x is and just passed to the underlying C function a pointer to the C struct for x as an R object.
R objects represented in C are structs followed by “data”; the data of the C representation of an R vector of reals starts with two integers, the first of which is the vector's length. (The vector elements follow those integers.) I found this out by reading up on R internals here.
If a C function were written to find the value of an R integer from a passed pointer-to-R-object, and it were called with a pointer to an R vector of reals, it would find the vector’s length in the place it would look for the single integer.
In addition to my main question of “How can I figure out something like this more easily?”, I wouldn’t mind knowing whether what I think is going on is correct and whether the fact that rnorm(x) is idiomatic R in this context or more of a sloppy choice. Given that it does something useful, can it be relied upon or is it just lucky behavior for an expression that isn’t well-defined in R?
I’m used to strongly-typed languages like C or SQL, which have easier-to-follow (for me) semantics and which also have more comprehensive references available, so any references for R that have a programming-language-theory focus or are aimed at people used to strong typing would be good, too.
It is documented behavior. From ?rnorm:
Usage: [...]
rnorm(n, mean = 0, sd = 1)
Arguments:
[...]
n: number of observations. If ‘length(n) > 1’, the length is
taken to be the number required.

Polynomials as functions in R

Sorry in advance if the question is stupid/was answered somewhere else/... I could not find any nice solution.
Based on the idea of power series I have
A) a vector of real coefficients of lengths n which comes from an other loop and which can be rather long, but lets assume it is simple, for instance,
a<-1:10
and
B) a real center, e.g.
c<-3
I would like to define the polynomial (in my example)
a[1]+a[2]*(x-3)+ a[3]*(x-3)^2+ .... + a[10]*(x-3)^9
as a function. Unfortunately
1) the function as.polynomial(a) only allows center 0 (as far as I understand) so I cannot use it and
2) the list of coefficients can be long, too long to do it by hand
3) I might later ever need a multivariable version.
I would prefer to use a loop to define this "finite power series" but I do not know how loops and sums of functions can be realized in a clean fashion (and I did not find it either).
Something like (very naive)
t<-function(x) 0
for(i in 1:length(a))
{t<-function(x) {t(x) + a[i]*(x-c)^(i-1}}
Thanks so much for your help.
i think this works
my_polynomial = function(x) {
sum(sapply(seq_along(a), function(ii) a[ii] * (x - c) ^ (ii - 1L)))
}
Just for the future reference. To change the center using the package polynom use change.origin
For example:
change.origin(as.polynomial(a),3)

Efficient programming to overcome memory limit in R

I have a function that calculates an index in R for a matrix of binary data. The goal of this function is to calculate a person-fit index for binary response data called HT. It divides the covariance between response vectors of two respondents (e.g. person i & j) by the maximum possible covariance between the two response patterns which can be calculated using the mean of response vectors(e.g. Bi).The function is:
fit<-function(Data){
N<-dim(Data)[1]
L<-dim(Data)[2]
r <- rowSums(Data)
p.cor.n <- (r/L) #proportion correct for each response pattern
sig.ij <- var(t(Data),t(Data)) #covariance of response patterns
diag(sig.ij) <-0
H.num <- apply(sig.ij,1,sum)
H.denom1 <- matrix(p.cor.n,N,1) %*% matrix(1-p.cor.n,1,N) #Bi(1-Bj)
H.denom2 <- matrix(1-p.cor.n,N,1) %*% matrix(p.cor.n,1,N) #(1-Bi)Bj
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
diag(H.denomm) <-0
H.denom <- apply(H.denomm,1,sum)
HT <- H.num / H.denom
return(HT)
}
This function works fine with small matrices (e.g. 1000 by 20) but when I increased the number of rows (e.g. to 10000) I came across to memory limitation problem. The source of the problem is this line in the function:
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
which selects the denominator for each response pattern.Is there any other way to re-write this line which demands lower memory?
P.S.: you can try data<-matrix(rbinom(200000,1,.7),10000,20).
Thanks.
Well here is one way you could shave a little time off. Overall I still think there might be a better theoretical answer in terms of the approach you take....But here goes. I wrote up an Rcpp function that specifically implements ifelse in the sense you use it in above. It only works for square matrices like in your example. BTW I wasn't really trying to optimize R ifelse because I'm pretty sure it already calls internal C functions. I was just curious if a C++ function designed to do exactly what you are trying to do and nothing more would be faster. I shaved 11 seconds off. (This selects the larger value).
C++ Function:
library(Rcpp)
library(inline)
code <-"
Rcpp::NumericMatrix x(xs);
Rcpp::NumericMatrix y(ys);
Rcpp::NumericMatrix ans (x.nrow(), y.ncol());
int ii, jj;
for (ii=0; ii &lt x.nrow(); ii++){
for (jj=0; jj &lt x.ncol(); jj++){
if(x(ii,jj) &lt y(ii,jj)){
ans(ii,jj) = y(ii,jj);
} else {
ans(ii,jj) = x(ii,jj);
}
}
}
return(ans);"
matIfelse <- cxxfunction(signature(xs="numeric",ys="numeric"),
plugin="Rcpp",
body=code)
Now if you replace ifelse in your function above with matIfelse you can give it a try. For example:
H.denomm <- matIfelse(H.denom1,H.denom2)
# Time for old version to run with the matrix you suggested above matrix(rbinom(200000,1,.7),10000,20)
# user system elapsed
# 37.78 3.36 41.30
# Time to run with dedicated Rcpp function
# user system elapsed
# 28.25 0.96 30.22
Not bad roughly 36% faster, again though I don't claim that this is generally faster than ifelse just in this very specific instance. Cheers
P.s. I forgot to mention that to use Rcpp you need to have Rtools installed and during the install make sure environment path variables are added for Rtools and gcc. On my machine those would look like: c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin
Edit:
I just noticed that you were running into memory problems... So I'm not sure if you are running a 32 or 64 bit machine, but you probably just need to allow R to increase the amount of RAM it can use. I'll assume you are running on 32 bit to be safe. So you should be able to let R take at least 2gigs of RAM. Give this a try: memory.limit(size=1900) size is in megabytes so I just went for 1.9 gigs just to be safe. I'd imagine this is plenty of memory for what you need.
Do you actually intend to do NxL independent ifelse((H.denom1>H.denom2,... operations?
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
If you really do, look for a library or alternatively, a better decomposition.
If you told us in general terms what this code is trying to do, it would help us answer it.

matrix multiplication in R (incredibly slow)

I have the following piece of code:
Y.hat.tr <- array(0,c(nXtr,2))
for (i in 1:nXtr){
#print(i)
Y.hat.tr[i,2] <- ktr[,i]%*%solve(K + a*In)%*%Ytr
#Y.hat.tr[i,2] <- ktr[,i]%*%chol2inv(chol((K + a*In)))%*%Ytr
}
Y.hat.tr[,1] <- Ytr
My problem is that nXtr =300, and ktr is a 300x300 matrix. This routine takes approx 30 seconds to run in R version 3.0.1. I have tried various approaches to reduce the run time, but to no avail.
Any ideas would be gratefully received. If any other information is required please let me know
I have now taken the solve(K + a*In)%*%Ytr out of the loop, which has helped, but I was hoping to somehow vectorise this piece of code. Having thought about this for a while, and also after looking through various posts I cannot see how this can be done?
Maybe I am missing something (and without sample or simulated data to test on it is harder to check), but isn't your loop equivalent to:
Y.hat.tr[,2] <- t(ktr) %*% solve(K + a*In) %*% Ytr
?
Removing the loop altogether and using internal vectorized code may speed things up.
Also, you are using solve with 1 argument, often you can speed things by using solve with 2 arguments (fewer internal calculations), something like:
t(ktr) %*% solve( K + a*In, Ytr )
Your loop is of the type called embarrassingly parallel, which means that if you want to keep the loop and are working on a computer with more than 1 core (or have easy access to a cluster) then you could use the parallel package (and maybe simplest to convert using the foreach package) to run the calculations in parallel which sometimes can greatly speed up the process.

Resources