I have a function that calculates an index in R for a matrix of binary data. The goal of this function is to calculate a person-fit index for binary response data called HT. It divides the covariance between response vectors of two respondents (e.g. person i & j) by the maximum possible covariance between the two response patterns which can be calculated using the mean of response vectors(e.g. Bi).The function is:
fit<-function(Data){
N<-dim(Data)[1]
L<-dim(Data)[2]
r <- rowSums(Data)
p.cor.n <- (r/L) #proportion correct for each response pattern
sig.ij <- var(t(Data),t(Data)) #covariance of response patterns
diag(sig.ij) <-0
H.num <- apply(sig.ij,1,sum)
H.denom1 <- matrix(p.cor.n,N,1) %*% matrix(1-p.cor.n,1,N) #Bi(1-Bj)
H.denom2 <- matrix(1-p.cor.n,N,1) %*% matrix(p.cor.n,1,N) #(1-Bi)Bj
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
diag(H.denomm) <-0
H.denom <- apply(H.denomm,1,sum)
HT <- H.num / H.denom
return(HT)
}
This function works fine with small matrices (e.g. 1000 by 20) but when I increased the number of rows (e.g. to 10000) I came across to memory limitation problem. The source of the problem is this line in the function:
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
which selects the denominator for each response pattern.Is there any other way to re-write this line which demands lower memory?
P.S.: you can try data<-matrix(rbinom(200000,1,.7),10000,20).
Thanks.
Well here is one way you could shave a little time off. Overall I still think there might be a better theoretical answer in terms of the approach you take....But here goes. I wrote up an Rcpp function that specifically implements ifelse in the sense you use it in above. It only works for square matrices like in your example. BTW I wasn't really trying to optimize R ifelse because I'm pretty sure it already calls internal C functions. I was just curious if a C++ function designed to do exactly what you are trying to do and nothing more would be faster. I shaved 11 seconds off. (This selects the larger value).
C++ Function:
library(Rcpp)
library(inline)
code <-"
Rcpp::NumericMatrix x(xs);
Rcpp::NumericMatrix y(ys);
Rcpp::NumericMatrix ans (x.nrow(), y.ncol());
int ii, jj;
for (ii=0; ii < x.nrow(); ii++){
for (jj=0; jj < x.ncol(); jj++){
if(x(ii,jj) < y(ii,jj)){
ans(ii,jj) = y(ii,jj);
} else {
ans(ii,jj) = x(ii,jj);
}
}
}
return(ans);"
matIfelse <- cxxfunction(signature(xs="numeric",ys="numeric"),
plugin="Rcpp",
body=code)
Now if you replace ifelse in your function above with matIfelse you can give it a try. For example:
H.denomm <- matIfelse(H.denom1,H.denom2)
# Time for old version to run with the matrix you suggested above matrix(rbinom(200000,1,.7),10000,20)
# user system elapsed
# 37.78 3.36 41.30
# Time to run with dedicated Rcpp function
# user system elapsed
# 28.25 0.96 30.22
Not bad roughly 36% faster, again though I don't claim that this is generally faster than ifelse just in this very specific instance. Cheers
P.s. I forgot to mention that to use Rcpp you need to have Rtools installed and during the install make sure environment path variables are added for Rtools and gcc. On my machine those would look like: c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin
Edit:
I just noticed that you were running into memory problems... So I'm not sure if you are running a 32 or 64 bit machine, but you probably just need to allow R to increase the amount of RAM it can use. I'll assume you are running on 32 bit to be safe. So you should be able to let R take at least 2gigs of RAM. Give this a try: memory.limit(size=1900) size is in megabytes so I just went for 1.9 gigs just to be safe. I'd imagine this is plenty of memory for what you need.
Do you actually intend to do NxL independent ifelse((H.denom1>H.denom2,... operations?
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
If you really do, look for a library or alternatively, a better decomposition.
If you told us in general terms what this code is trying to do, it would help us answer it.
Related
Based on what I've read before, vectorization is a form of parallelization known as SIMD. It allows processors to execute the same instruction (such as addition) on an array simultaneously.
However, I got confused when reading The Relationship between Vectorized and Devectorized Code regarding Julia's and R's vectorization performance. The post claims that devectorized Julia code (via loops) is faster than the vectorized code in both Julia and R, because:
This confuses some people who are not familiar with the internals of
R. It is therefore worth noting how one improves the speed of R code.
The process of performance improvement is quite simple: one starts
with devectorized R code, then replaces it with vectorized R code and
then finally implements this vectorized R code in devectorized C code.
This last step is unfortunately invisible to many R users, who
therefore think of vectorization per se as a mechanism for increasing
performance. Vectorization per se does not help make code faster. What
makes vectorization in R effective is that it provides a mechanism for
moving computations into C, where a hidden layer of devectorization
can do its magic.
It claims that R turns vectorized code, written in R, into devectorized code in C. If vectorization is faster (as a form of parallelization), why would R devectorize the code and why is that a plus?
"Vectorization" in R, is a vector processing in R's interpreter's view. Take the function cumsum as an example. On entry, R interpreter sees that a vector x is passed into this function. However, the work is then passed to C language that R interpreter can not analyze / track. While C is doing work, R is just waiting. By the time that R's interpreter comes back to work, a vector has been processed. So in R's view, it has issued a single instruction but processed a vector. This is an analogy to the concept of SIMD - "single instruction, multiple data".
Not just the cumsum function that takes a vector and returns a vector is seen as "vectorization" in R, functions like sum that takes a vector and returns a scalar is also a "vectorization".
Simply put: whenever R calls some compiled code for a loop, it is a "vectorization". If you wonder why this kind of "vectorization" is useful, it is because a loop written by a compiled language is faster than a loop written in an interpreted language. The C loop is translated to machine language that a CPU can understand. However, if a CPU wants to execute an R loop, it needs R's interpreter's help to read it, iteration by iteration. This is like, if you know Chinese (the hardest human language), you can respond to someone speaking Chinese to you faster; otherwise, you need a translator to first translator Chinese to you sentence after sentence in English, then you respond in English, and the translator make it back to Chinese sentence by sentence. The effectiveness of communication is largely reduced.
x <- runif(1e+7)
## R loop
system.time({
sumx <- 0
for (x0 in x) sumx <- sumx + x0
sumx
})
# user system elapsed
# 1.388 0.000 1.347
## C loop
system.time(sum(x))
# user system elapsed
# 0.032 0.000 0.030
Be aware that "vectorization" in R is just an analogy to SIMD but not a real one. A real SIMD uses CPU's vector registers for computations hence is a true parallel computing via data parallelism. R is not a language where you can program CPU registers; you have to write compiled code or assembly code for that purpose.
R's "vectorization" does not care how a loop written in a compiled language is really executed; after all that is beyond R's interpreter's knowledge. Regarding whether these compiled code will be executed with SIMD, read Does R leverage SIMD when doing vectorized calculations?
More on "vectorization" in R
I am not a Julia user, but Bogumił Kamiński has demonstrated an impressive feature of that language: loop fusion. Julia can do this, because, as he points out, "vectorization in Julia is implemented in Julia", not outside the language.
This reveals a downside of R's vectorization: speed often comes at a price of memory usage. I am not saying that Julia won't have this problem (as I don't use it, I don't know), but this is definitely true for R.
Here is an example: Fastest way to compute row-wise dot products between two skinny tall matrices in R. rowSums(A * B) is a "vectorization" in R, as both "*" and rowSums are coded in C language as a loop. However, R can not fuse them into a single C loop to avoid generating the temporary matrix C = A * B into RAM.
Another example is R's recycling rule or any computations relying on such rule. For example, when you add a scalar a to a matrix A by A + a, what really happens is that a is first replicated to be a matrix B that has the same dimension with A, i.e., B <- matrix(a, nrow(A), ncol(A)), then an addition between two matrices are calculated: A + B. Clearly the generation of the temporary matrix B is undesired, but sorry, you can't do it better unless you write your own C function for A + a and call it in R. This is described as "such a fusion is possible only if explicitly implemented" in Bogumił Kamiński's answer.
To deal with the memory effects of many temporary results, R has a sophisticated mechanism called "garbage collection". It helps, but memory can still explode if you generate some really big temporary result somewhere in your code. A good example is the function outer. I have written many answers using this function, but it is particularly memory-unfriendly.
I might have been off-topic in this edit, as I begin to discuss the side effect of "vectorization". Use it with care.
Put memory usage in mind; there might be a more memory efficient vectorized implementation. For example, as mentioned in the linked thread on row-wise dot products between two matrices, c(crossprod(x, y)) is better than sum(x * y).
Be prepared to use CRAN R packages that have compiled code. If you find existing vectorized functions in R limited to do your task, explore CRAN for possible R packages that can do it. You can ask a question with your coding bottleneck on Stack Overflow, and somebody may point you to the right function in the right package.
Be happy to write your own compiled code.
I think it is worth to note that the post you are referring to does not cover all current functionality of vectorization in Julia.
The important thing is that vectorization in Julia is implemented in Julia, as opposed to R, where it is implemented outside of the language. This is explained in detail in this post: https://julialang.org/blog/2017/01/moredots.
The consequence of the fact that Julia can perform fusion of any sequence of broadcasted operations into a single loop. In other languages that provide vectorization such a fusion is possible only if explicitly implemented.
In summary:
In Julia you can expect that vectorized code is as fast as a loop.
If you perform a sequence of vectorized operations then in general you can expect Julia to be faster than R as it can avoid allocation of intermediate results of the computations.
EDIT:
Following the comment of 李哲源 here is an example showing that Julia is able to avoid any allocations if you want to increase all elements of a vector x by 1:
julia> using BenchmarkTools
julia> x = rand(10^6);
julia> #benchmark ($x .+= 1)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 819.230 μs (0.00% GC)
median time: 890.610 μs (0.00% GC)
mean time: 929.659 μs (0.00% GC)
maximum time: 2.802 ms (0.00% GC)
--------------
samples: 5300
evals/sample: 1
In the code .+= performs addition in place (adding $ in front of the expression is only needed for benchmarking, in normal code it would be x .+= 1). And we see that no memory allocation was done.
If we compare this to a possible implementation in R:
> library(microbenchmark)
> x <- runif(10^6)
> microbenchmark(x <- x + 1)
Unit: milliseconds
expr min lq mean median uq max neval
x <- x + 1 2.205764 2.391911 3.999179 2.599051 5.061874 30.91569 100
we can see that it not only saves memory, but also leads to a faster execution of the code.
I'm trying to generate an optimized LHS (Latin Hypercube Sampling) design in R, with sample size N = 400 and d = 7 variables, but it's taking forever. My pc is an HP Z820 workstation with 12 cores, 32 Mb RAM, Windows 7 64 bit, and I'm running Microsoft R Open which is a multicore version of R. The code has been running for half an hour, but I still don't see any results:
library(lhs)
lhs_design <- optimumLHS(n = 400, k = 7, verbose = TRUE)
It seems a bit weird. Is there anything I could do to speed it up? I heard that parallel computing may help with R, but I don't know how to use it, and I have no idea if it speeds up only code that I write myself, or if it could speed up an existing package function such as optimumLHS. I don't have to use the lhs package necessarily - my only requirement is that I would like to generate an LHS design which is optimized in terms of S-optimality criterion, maximin metric, or some other similar optimality criterion (thus, not just a vanilla LHS). If worse comes to worst, I could even accept a solution in a different environment than R, but it must be either MATLAB or a open source environment.
Just a little code to check performance.
library(lhs)
library(ggplot2)
performance<-c()
for(i in 1:100){
ptm<-proc.time()
invisible(optimumLHS(n = i, k = 7, verbose = FALSE))
time<-print(proc.time()-ptm)[[3]]
performance<-rbind(performance,data.frame(time=time, n=i))
}
ggplot(performance,aes(x=n,y=time))+
geom_point()
Not looking too good. It seems to me you might be in for a very long wait indeed. Based on the algorithm, I don't think there is a way to speed things up via parallel processing, since to optimize the separation between sample points, you need to know the location of the all the sample points. I think your only option for speeding this up will be to take a smaller sample or get (access)a faster computer. It strikes me that since this is something that only really has to be done once, is there a resource where you could just get a properly sampled and optimized distribution already computed?
So it looks like ~650 hours for my machine, which is very comparable to yours, to compute with n=400.
I have the following piece of code:
Y.hat.tr <- array(0,c(nXtr,2))
for (i in 1:nXtr){
#print(i)
Y.hat.tr[i,2] <- ktr[,i]%*%solve(K + a*In)%*%Ytr
#Y.hat.tr[i,2] <- ktr[,i]%*%chol2inv(chol((K + a*In)))%*%Ytr
}
Y.hat.tr[,1] <- Ytr
My problem is that nXtr =300, and ktr is a 300x300 matrix. This routine takes approx 30 seconds to run in R version 3.0.1. I have tried various approaches to reduce the run time, but to no avail.
Any ideas would be gratefully received. If any other information is required please let me know
I have now taken the solve(K + a*In)%*%Ytr out of the loop, which has helped, but I was hoping to somehow vectorise this piece of code. Having thought about this for a while, and also after looking through various posts I cannot see how this can be done?
Maybe I am missing something (and without sample or simulated data to test on it is harder to check), but isn't your loop equivalent to:
Y.hat.tr[,2] <- t(ktr) %*% solve(K + a*In) %*% Ytr
?
Removing the loop altogether and using internal vectorized code may speed things up.
Also, you are using solve with 1 argument, often you can speed things by using solve with 2 arguments (fewer internal calculations), something like:
t(ktr) %*% solve( K + a*In, Ytr )
Your loop is of the type called embarrassingly parallel, which means that if you want to keep the loop and are working on a computer with more than 1 core (or have easy access to a cluster) then you could use the parallel package (and maybe simplest to convert using the foreach package) to run the calculations in parallel which sometimes can greatly speed up the process.
I've been working on a simple collection of functions for my supervisor that will do some simple initial genome scale stats, that is easy to do to give my team a quick indication as to future analyses which may make more time - for example RDP4 or BioC (just to explain why I haven't just gone straight to BioConductor). I'd like to speed some things up to allow larger contig sizes so I've decided to use doParallel and foreach to edit some for loops to allow this. Below is one simple function which identifies bases in some sequences (stored as a matrix) which are identical and removes them.
strip.invar <- function(x) {
cat("
Now removing invariant sites from DNA data matrix, this may take some time...
")
prog <- txtProgressBar(min=0, max=ncol(x), style=3)
removals<-c()
for(i in 1:ncol(x)){
setTxtProgressBar(prog, i)
if(length(unique(x[,i])) == 1) {
removals <- append(removals, i)
}
}
newDnaMatrix <- x[,-removals]
return(newDnaMatrix)
}
After reading the introduction to doParallel and foreach I tried to make a version to accommodate for more cores - on my mac this is 8 - a quad core with two threads per core - 8 virtual cores:
strip.invar <- function(x, coresnum=detectCores()){
cat("
Now removing invariant sites from DNA data matrix, this may take some time...
")
prog <- txtProgressBar(min=0, max=ncol(x), style=3)
removals<-c()
if(coresnum > 1) {
cl <- makeCluster(coresnum)
registerDoParallel(cl)
foreach(i=1:ncol(x)) %dopar% {
setTxtProgressBar(prog, i)
if(all(x[,i] == x[[1,i]])){
removals <- append(removals, i)
}
}
} else {
for(i in 1:ncol(x)){
setTxtProgressBar(prog, i)
if(length(unique(x[,i])) == 1) {
removals <- append(removals, i)
}
}
}
newDnaMatrix <- x[,-removals]
return(newDnaMatrix)
}
However if I run this and have the number of cores set to 8 I'm not entirely sure it works - I can't see the progress bar doing anything but then I've heard that printing to screen and stuff involving graphic devices is tricky with parallel computing in R. But it still seems to take some time and my laptop get's 'very' hot so I'm not sure if I've done this correctly, I've tried after seeing a few examples (I successfully ran a nice bootstrap example in the vignette), but I'm bound to hit learning bumps. As an aside, I thought I'd also ask people's opinion, what is the best speed up for R code bottlenecks where loops or apply is involved - parallelising it, or Rcpp?
Thanks.
My other answer was not correct, since the colmean being equal to the first value is not sufficient as a test for the number of unique values. So here is another answer:
You can optimize the loop by using apply.
set.seed(42)
dat <- matrix(sample(1:5,1e5,replace=TRUE),nrow=2)
res1 <- strip.invar(dat)
strip.invar2 <- function(dat) {
ix <- apply(dat,2,function(x) length(unique(x))>1)
dat[,ix]}
res2 <- strip.invar2(dat)
all.equal(res1,res2)
#TRUE
library(microbenchmark)
microbenchmark(strip.invar(dat),strip.invar2(dat),times=10)
#Unit: milliseconds
# expr min lq median uq max neval
#strip.invar(dat) 2514.7995 2529.2827 2547.6751 2678.464 2792.405 10
#strip.invar2(dat) 933.3701 945.5689 974.7564 1008.589 1018.400 10
This improves performance quite a bit, though not as much as you could achieve if vectorization was possible.
Parallelization won't give better performance here, since each iteration does not require much performance on is own, so parallelization overhead will actually increase the time needed. However, you could split the data and process chunks in parallel.
Firstly, try running cl <- makeCluster( coresnum-1 ). The master R process is already using one of your cores and is used to dispatch and receive results from the slave jobs, so you have 7 free cores for the slave jobs. I think you will be effectively queuing one of your foreach loops to wait until one of the previous loops finishes and therefore the job will take longer to complete.
Secondly, what you would normally see on the console running this function in a non-parallel environment is stil printed to the console, it's just that each jobs output is printed to the slave processes console so you won't see it. You can however save the output from the different foreach loops to a text file to examine them. Here is an example of how to save console output. Stick the code there inside your foreach statement.
Your laptop will get very hot because all of your cores are working at 100% capacity while you are running this job.
I have found the foreach package to provide an excellent set of functions to provide simple parallel processing. Rcpp may (will?!) give you much greater performance, but how are you at writing C++ code and what is the runtime of this function and how often will it be used? I always think about these things first.
Is there a function or package in R for calculating the Sliding FFT of a sample? By this I mean that given the output of fft(x[n:m]), calculate fft(x[1+(n:m)]) efficiently.
Ideally I'd find both an online version (where I don't have access to the full time series at the beginning, or it's too big to fit in memory, and I'm not going to try to save the whole running FFT in memory either) and a batch version (where I give it the whole sample x and tell it the running window width w, resulting in a complex matrix of dimension c(w,length(x)/w)).
An example of such an algorithm is presented here (but I've never tried implementing it in any language yet):
http://cnx.org/content/m12029/latest/
If no such thingy exists already in R, that doesn't look too hard to implement I guess.
As usually happens when I post something here, I kept working on it and came up with a solution:
fft.up <- function(x1, xn, prev) {
b <- length(prev)
vec <- exp(2i*pi*seq.int(0,b-1)/b)
(prev - x1 + xn) * vec
}
# Test it out
x <- runif(6)
all.equal(fft.up(x[1], x[6], fft(x[1:5])), fft(x[2:6]))
# [1] TRUE
Still interested to know if some library offers this, because then it might offer other handy things too. =) But for now my problem's solved.