I have heard that writing for loops in R is particularly slow. I have the following code which needs to run through 122,000 rows with each having 513 columns and transform them using fft() function:
for (i in 2:100000){
Data1[i,2:513]<- fft(as.numeric(Data1[i,2:513]), inverse = TRUE)/512
}
I have tried to do this for 1000 cycles and that took few minutes... is there a way to do this loop faster? Maybe by not using a loop or by doing it in C?
mvfft (documented on the fft help page) was designed to do this all at once. It's hard to imagine how you could do it any faster: less than three seconds (on an older Xeon workstation) for a dataset exactly your size.
n.row <- 122e3
X <- matrix(rnorm(n.row*512), n.row)
system.time(
Y <- mvfft(t(X), inverse=TRUE)/512
)
user system elapsed
2.34 0.39 2.75
Note that the discrete FFT in this case has complex values.
FFTs are fast. Typically they can be computed in less time than it takes to read data from an ASCII file (because the character-to-numeric conversions involved in the read take more time than the calculations in the FFT). Your limiting resources therefore are I/O throughput speed and RAM. But 122,000 vectors of 512 complex values occupy "only" about a gigabyte, so you should be ok.
Related
Problem description
I have 45000 short time series (length 9) and would like to compute the distances for a cluster analysis. I realize that this will result in (the lower triangle of) a matrix of size 45000x45000, a matrix with more than 2 billion entries. Unsurprisingly, I get:
> proxy::dist(ctab2, method="euclidean")
Error: cannot allocate vector of size 7.6 Gb
What can I do?
Ideas
Increase available/addressable memory somehow? However, these 7.6G are probably beyond some hard limit that cannot be extended? In any case, the system has 16GB memory and the same amount of swap. By "Gb", R seems to mean Gigabyte, not Gigabit, so 7.6Gb puts us already dangerously close to a hard limit.
Perhaps a different distance computation method instead of euclidean, say DTW, might be more memory efficient? However, as explained below, the memory limit seems to be the resulting matrix, not the memory required at computation time.
Split the dataset into N chunks and compute the matrix in N^2 parts (actually only those parts relevant for the lower triangle) that can later be reassembled? (This might look similar to the solution to a similar problem proposed here.) It seems to be a rather messy solution, though. Further, I will need the 45K x 45K matrix in the end anyway. However, this seems to hit the limit. The system also gives the memory allocation error when generating a 45K x 45K random matrix:
> N=45000; memorytestmatrix <- matrix( rnorm(N*N,mean=0,sd=1), N, N)
Error: cannot allocate vector of size 15.1 Gb
30K x 30K matrices are possible without problems, R gives the resulting size as
> print(object.size(memorytestmatrix), units="auto")
6.7 Gb
1 Gb more and everything would be fine, it seems. Sadly, I do not have any large objects that I could delete to make room. Also, ironically,
> system('free -m')
Warning message:
In system("free -m") : system call failed: Cannot allocate memory
I have to admit that I am not really sure why R refuses to allocate 7.6 Gb; the system certainly has more memory, although not a lot more. ps aux shows the R process as the single biggest memory user. Maybe there is an issue with how much memory R can address even if more is available?
Related questions
Answers to other questions related to R running out of memory, like this one, suggest to use a more memory efficient methods of computation.
This very helpful answer suggests to delete other large objects to make room for the memory intensive operation.
Here, the idea to split the data set and compute distances chunk-wise is suggested.
Software & versions
R version is 3.4.1. System kernel is Linux 4.7.6, x86_64 (i.e. 64bit).
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
Edit (Aug 27): Some more information
Updating the Linux kernel to 4.11.9 has no effect.
The bigmemory package may also run out of memory. It uses shared memory in /dev/shm/ of which the system by default (but depending on configuration) allows half the size of the RAM. You can increase this at runtime by doing (for instance) mount -o remount,size=12Gb /dev/shm, but this may still not allow usage of 12Gb. (I do not know why, maybe the memory management configuration is inconsistent then). Also, you may end up crashing your system if you are not careful.
R apparently actually allows access to the full RAM and can create objects up to that size. It just seems to fail for particular functions such as dist. I will add this as an answer, but my conclusions are a bit based on speculation, so I do not know to what degree this is right.
R apparently actually allows access to the full RAM. This works perfectly fine:
N=45000; memorytestmatrix <- matrix(nrow=N, ncol=N)
This is the same thing I tried before as described in the original question, but with a matrix of NA's instead of rnorm random variates. Reassigning one of the values in the matrix as float (memorytestmatrix[1,1]<-0.5) still works and recasts the matrix as a float matrix.
Consequently, I suppose, you can have a matrix of that size, but you cannot do it the way the dist function attempts to do it. A possible explanation is that the function operates with multiple objects of that size in order to speed the computation up. However, if you compute the distances element-wise and change the values in place, this works.
library(mefa) # for the vec2dist function
euclidian <- function(series1, series2) {
return((sum((series1 - series2)^2))^.5)
}
mx = nrow(ctab2)
distMatrixE <- vec2dist(0.0, size=mx)
for (coli in 1:(mx-1)) {
for (rowi in (coli+1):mx) {
# Element indices in dist objects count the rows down column by column from left to righ in lower triangular matrices without the main diagonal.
# From row and column indices, the element index for the dist object is computed like so:
element <- (mx^2-mx)/2 - ((mx-coli+1)^2 - (mx-coli+1))/2 + rowi - coli
# ... and now, we replace the distances in place
distMatrixE[element] <- euclidian(ctab2[rowi,], ctab2[coli,])
}
}
(Note that addressing in dist objects is a bit tricky, since they are not matrices but 1-dimensional vectors of size (N²-N)/2 recast as lower triangular matrices of size N x N. If we go through rows and columns in the right order, it could also be done with a counter variable, but computing the element index explicitly is clearer, I suppose.)
Also note that it may be possible to speed this up by making use of sapply by computing more than one value at a time.
There exist good algorithms that do not need a full distance matrix in memory.
For example, SLINK and DBSCAN and OPTICS.
I have a data set (after normalising and preprocessing) contains a data frame that has 5 columns and 133763 rows.
I am trying to apply k means algorithm, and herical algorithm in order to do the clustering. However, my problem is that R studio keeps trying to do the calculation, and then it throws out of memory exception even though i am using mac bro i 7, 16 gb
my code for heroical clustering is:
dist.cards<-dist(cardsNorm)
as i said that takes forever running. however, if i did this
dist.cards<-dist(cardsNorm[1:10])
it works fine, that is because i just use 10 rows.
for the k mean, this is my code:
cardsKMS<-kmeans(cardsNorm,centers=3,iter.max = 100,nstart = 25)
it works fine, but when i try to measure the model using this code
a <- silhouette(cardsKMS$cluster,dist(cardsNorm))
it takes forever and never finishes calculating
help please
Creating a distance matrix between n = 133763 observations requires (n^2-n)/2 pairwise comparisons. Given that a scalar numeric requires 12 bytes of RAM the entire matrix requires about 100 GB. So unfortunately you don't have enough.
Algorithms based on distance matrices scale very poorly with increased data set size (since they are inherently quadratic in memory and CPU) so I am afraid you need to try some other clustering algorithm.
I wish to calculate the speedup of my MPI application against the number of parallel processes/nodes.
Application mostly performs huge matrices computation in parallel.
I can measure an elapsed time using MPI_Wtime(), something like this..
double start = MPI_Wtime();
....
double end = MPI_Wtime();
double elapsed = end - start;
But how can I achieve this against the degree of parallelization ?
The usual definition of speedup is time on 1 process divided by time on p processes.
If you wish to present the performance of your code, it's good to pick a range of p from 1 to the highest amount you have access to run and plot the results on a speedup vs. p plot.
Note that strictly speaking, speedup should compare the time on p processes vs the best possible sequential code, not just running your parallel code sequentially. This seems like a moot point, but in some areas the parallel codes are pretty awful in the sequential case. In the sparse matrix world, for example, you can find a parallel code 10-50x slower than the top sequential code.
I have a function that calculates an index in R for a matrix of binary data. The goal of this function is to calculate a person-fit index for binary response data called HT. It divides the covariance between response vectors of two respondents (e.g. person i & j) by the maximum possible covariance between the two response patterns which can be calculated using the mean of response vectors(e.g. Bi).The function is:
fit<-function(Data){
N<-dim(Data)[1]
L<-dim(Data)[2]
r <- rowSums(Data)
p.cor.n <- (r/L) #proportion correct for each response pattern
sig.ij <- var(t(Data),t(Data)) #covariance of response patterns
diag(sig.ij) <-0
H.num <- apply(sig.ij,1,sum)
H.denom1 <- matrix(p.cor.n,N,1) %*% matrix(1-p.cor.n,1,N) #Bi(1-Bj)
H.denom2 <- matrix(1-p.cor.n,N,1) %*% matrix(p.cor.n,1,N) #(1-Bi)Bj
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
diag(H.denomm) <-0
H.denom <- apply(H.denomm,1,sum)
HT <- H.num / H.denom
return(HT)
}
This function works fine with small matrices (e.g. 1000 by 20) but when I increased the number of rows (e.g. to 10000) I came across to memory limitation problem. The source of the problem is this line in the function:
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
which selects the denominator for each response pattern.Is there any other way to re-write this line which demands lower memory?
P.S.: you can try data<-matrix(rbinom(200000,1,.7),10000,20).
Thanks.
Well here is one way you could shave a little time off. Overall I still think there might be a better theoretical answer in terms of the approach you take....But here goes. I wrote up an Rcpp function that specifically implements ifelse in the sense you use it in above. It only works for square matrices like in your example. BTW I wasn't really trying to optimize R ifelse because I'm pretty sure it already calls internal C functions. I was just curious if a C++ function designed to do exactly what you are trying to do and nothing more would be faster. I shaved 11 seconds off. (This selects the larger value).
C++ Function:
library(Rcpp)
library(inline)
code <-"
Rcpp::NumericMatrix x(xs);
Rcpp::NumericMatrix y(ys);
Rcpp::NumericMatrix ans (x.nrow(), y.ncol());
int ii, jj;
for (ii=0; ii < x.nrow(); ii++){
for (jj=0; jj < x.ncol(); jj++){
if(x(ii,jj) < y(ii,jj)){
ans(ii,jj) = y(ii,jj);
} else {
ans(ii,jj) = x(ii,jj);
}
}
}
return(ans);"
matIfelse <- cxxfunction(signature(xs="numeric",ys="numeric"),
plugin="Rcpp",
body=code)
Now if you replace ifelse in your function above with matIfelse you can give it a try. For example:
H.denomm <- matIfelse(H.denom1,H.denom2)
# Time for old version to run with the matrix you suggested above matrix(rbinom(200000,1,.7),10000,20)
# user system elapsed
# 37.78 3.36 41.30
# Time to run with dedicated Rcpp function
# user system elapsed
# 28.25 0.96 30.22
Not bad roughly 36% faster, again though I don't claim that this is generally faster than ifelse just in this very specific instance. Cheers
P.s. I forgot to mention that to use Rcpp you need to have Rtools installed and during the install make sure environment path variables are added for Rtools and gcc. On my machine those would look like: c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin
Edit:
I just noticed that you were running into memory problems... So I'm not sure if you are running a 32 or 64 bit machine, but you probably just need to allow R to increase the amount of RAM it can use. I'll assume you are running on 32 bit to be safe. So you should be able to let R take at least 2gigs of RAM. Give this a try: memory.limit(size=1900) size is in megabytes so I just went for 1.9 gigs just to be safe. I'd imagine this is plenty of memory for what you need.
Do you actually intend to do NxL independent ifelse((H.denom1>H.denom2,... operations?
H.denomm <- ifelse(H.denom1>H.denom2,H.denom2,H.denom1)
If you really do, look for a library or alternatively, a better decomposition.
If you told us in general terms what this code is trying to do, it would help us answer it.
I'm running linear regression on a tiff image. Image sizes are;
ncol=6350, nrow=2077, nlayers=26
What I did before running the calculation is just read tiff image in R using
ndvi2000<-raster("img2000.tif")
Then wrote following script in R console window. Calculation process is taking very long time more than 20mins and still running. Is it normal to take long time on big image? The script of the regression is:
time<-sort(sample(97:297, nlayers(ndvi2000)))
t.lm.pred<-function(x) {if (is.na(x[1])) {NA} else{predict(lm(x~time))}}
f.pred<-calc(ndvi2000,t.lm.pred)
The number of values you have is very large, so I'm not in the least surprised that it takes very long. Simply making a list of random numbers the size of your tiff file:
x = runif(6350 * 2077 * 26)
object.size(x) / (1024 * 1024)
2616.216
That is over 2.5 Gb, and that is just to save one variable. A rule of thumb is that you need roughly three times the amount of RAM than your dataset size. So, assuming you load some more images, you'll needs more than 10-20 Gb of RAM. If you don't have enough RAM, your operating system will starting swapping memory to disk, which makes your analysis veeeery slow.
I think it will be good idea to rethink your analysis, either that or rent a 64 Gb RAM EC2 instance. You could only look at the temporal average, or spatial average. Only look at specific locations, etc, etc. Simply brute-force using all values in your data might not be best here.