Dealing with very large vector in R

Dealing with very large vector in R - r

I am dealing with some large data in R:
I have a vector of normally distributed random numbers with length about 6400*50000, I need to sum every 4 elements in this vector to get a smaller one.
Is there any efficient way to do this in R?
My thoughts till now:
using a matrix with ncol=10 and use apply function-- failed because the matrix size is too big;
Try paralell and foreach package but no progress yet;
example code:
library(parallel)
library(RcppZiggurat)
library(doParallel)
library(foreach)
coreNums<-detectCores()
N1=6400
M=4
N2=N1/M
cl<-makeCluster(getOption("cl.cores", coreNums))
registerDoParallel(cl)
vector1<-zrnorm(N1*K)
vector2=foreach(i=1:(N2*K)) %dopar% {sum(vector1[M*(i-1)+1:M*i])}
vector2=unlist(vector)

I think colSums is the function you are looking for.
vector1 = rnorm(1000*50000)
dim(vector1) = c(10, length(vector1)/10)
vector2 = colSums(vector1)
In my opinion, the task is too simple for parallelization.
Also, I did not get any problems with the matrix size.
If you want to use less memory, here is the code doing the same in parts of 10,000 values in vector1.
vector2 = double(length(vector1)/10);
for( i in seq_len(length(vector2)/10000) ){
part = vector1[((i-1)*10000+1):(i*10000)]
dim(part) = c(10, 1000)
vector2[((i-1)*1000+1):(i*1000)] = colSums(part)
}

Related

Improving the computational time of R for loops

Trying to run a spline function by rows (690075) in a dataframe (Camera1) with 4096 columns (each column represents a position on the x axis) where the input variable to the function is a column in another dataset of the same length (test$vr) using a for loop; but I am having serious computational time issues.
I have tried converting the dataframe to a matrix and storing the output in a list amongst others, but to no avail. I have to do this for 2 other dataframes (camera2,camera3) of the same size.
Code
# Note camera1 and test$vr are of the same length
# Initialize
final.data1 <- data.frame()
#new wavelength range
y1 <- round(seq(from = 4714 , to = 4900, length.out = 4096),3)
system.time({
for (i in 1:690075) {
w1 = (as.numeric(colnames(camera1[-1]))) * (1.0 + test$vr[i]/299792.458)
my.data1<-as.data.frame(t(splinefun(x = w1, y = camera1[i,][-1])(y1)))
colnames(my.data1)=y1
final.data1 <- bind_rows(final.data1, my.data1)
} })
Running on a Ubuntu box with 344GB ram and 30 core Intel(R) Xeon(R) CPU E5-2695 # 2.30GHz
Any suggestions would be greatly appreciated.
Thank you.

Without seeing the data it's not easy to optimize your code, but I would start with something along the lines of the following.
final.data1 <- matrix(nrow = 690075, ncol = 4096)
#new wavelength range
y1 <- round(seq(from = 4714 , to = 4900, length.out = 4096), 3)
system.time({
w1 <- (as.numeric(colnames(camera1[-1]))) * (1.0 + test$vr/299792.458)
for (i in 1:690075) {
my.data1 <- t(splinefun(x = w1[i], y = camera1[i, ][-1])(y1))
final.data1[i, ] <- my.data1
}
})
final.data1 <- as.data.frame(final.data1)
colnames(final.data1) <- y1
Explanation:
I start by defining an object of class matrix to hold the results. I believe I got the dimensions of your final data.frame right. This reduces the running time because
Matrices are much faster than data frames, they are just folded vectors and indexing is fast. Data frames, on the contrary, are lists that can hold all types of data, numeric, character, logical, other lists, etc., and therefore it's slow to access their members.
By reserving the result's full memory in one operation saves R's memory management routines a lot of work. To extend final.data1 every iteration through the loop, is very time consuming.
w1 is computed outside the loop, taking advantage of R's vectorized nature. Besides, you were repeating the computation of as.numeric(colnames(camera1[-1])) 690k times!
Test this code and if it doesn't produce the same final result, just say so and I will see if I can do something to debug it.

First remove all instructions that can be done once, and put them outside the for loop. For example: colnames and as.numeric.
Second, try to vectorize. It seems that the w1 calculation can be vectorized, and so estimated once outside the for loop, by just removing the [i].
Third, initialize the final.data1 to the final dimension. For each row added to this data.frame, R will create a new data.frame with one more row, then remove the previous data.frame. It will take long time. Thus, final.data1 <- matrix(NA, ncol=length(y1), nrow=NROW).
And finally, if you want to use more than one core, try to replace the for loop by the parallelized foreach loop. It is possible if all rows are independant:
require(foreach)
require(doSNOW)
cl <- makeCluster(25, type="FORK") # FORK not usable in Windows
registerDoSNOW(cl) # register the cluster
clusterExport(cl, c("objects", "needed", "by", "each", "iteration"), envir=environment()) # for example y1, w1 and camera1
final.data1<- foreach(i=icount(NROW), .combine=rbind, inorder=FALSE) %dopar%
{
# your R code
}
stopCluster(cl)

Dealing with multidimensional output in parallel programming

I am currently working on a program to evaluate the out-of-sample performance of several forecasting models on simulated data. For those who are familiar with finance, it works exactly like backtesting a trading strategy, except that I would evaluate forecasts and not transactions.
Some of the objects I currently manipulate using for loops for this type of task are 7 dimensional arrays (dimensions stand for Monte Carlo replications, data generating processes, forecast horizons, 3 dimensions for model parameter selection, and one dimension for all the periods covered in the out-of-sample analysis). Obviously, it is painfully slow, so parallel computing has became a must for me.
My problem is: how do I keep track of more than 2 dimensions in R? Let's just show you using 'for loops' and only 3 dimensions what I mean:
x <- array(dim=c(2,2,2))
for (i in 1:2){
for (j in 1:2){
for (k in 1:2){
x[i,j,k] <- i+j+k
}
}
}
If I use something like 'foreach', I am very annoyed by the fact that, to my knowledge, available combining functionalities will return lists, matrices or vectors -- but not arbitrarily large multidimensional arrays. For instance:
library(doParallel)
library(foreach)
# Get the number of cores to use
no_cores <- max(1, detectCores()-1)
# Make cluster object using no_cores
cl <- makeCluster(no_cores)
# Initialize cluster for parallel computing
registerDoParallel(cl)
x <- foreach(i=1:2, .combine=rbind)%:%
foreach(j=1:2, .combine=cbind)%:%
foreach(k=1:2, .combine=c)%dopar%{
i+j+k
}
Here, I basically combine results into vectors, then matrices and, finally, I pile up matrices by rows. Another option would be to use lists, or pile matrices through columns, but you can imagine the mess when you have 7 dimensions and millions of iterations to track.
I suppose I could also write my own 'combine' function and get the kind of output I want, but I suspect that I am not the first person to encounter this problem. Either there is a way to do exactly what I want, or someone here can point out a way to think differently about storing my results. It wouldn't be surprising that I am taking an absurdly inefficient path toward solving this problem -- I am an economist, not a data scientist, after all!
Any help would be greatly appreciated. Thanks in advance.

There is one available solution that I finally stumbled upon tonight. I can create an appropriate combination function along the dimension of my choice using the 'abind' function of the 'abind' package:
library(abind)
# Get the number of cores to use
no_cores <- max(1, detectCores()-1)
# Make cluster object using no_cores
cl <- makeCluster(no_cores)
# Initialize cluster for parallel computing
registerDoParallel(cl)
mbind <- function(...) abind(..., along=3)
x <- foreach(i=1:2, .combine=mbind)%:%
foreach(j=1:2, .combine=cbind)%:%
foreach(k=1:2, .combine=c)%dopar%{
i+j+k
}
I would still like to see if someone has other means of doing what I want to do, however. There might be many ways to do it and I am new to R, yet this solution is a distinct possibility.

What I would do and I already use in one of my packages, bigstatsr.
Take only one dimension and cut it in no_cores blocks. It should have sufficient iterations (e.g. 20 for 4 cores). For each iteration, construct part of the array you want and store it in a temporary file. The, use the content of these files to fill the whole array. By doing so, you fill only preallocated objects, which should be faster and easier.
Example:
x.all <- array(dim=c(20,2,2))
no_cores <- 3
tmpfile <- tempfile()
range.parts <- bigstatsr:::CutBySize(nrow(x.all), nb = no_cores)
library(foreach)
cl <- parallel::makeCluster(no_cores)
doParallel::registerDoParallel(cl)
foreach(ic = 1:no_cores) %dopar% {
ind <- bigstatsr:::seq2(range.parts[ic, ])
x <- array(dim = c(length(ind), 2, 2))
for (i in seq_along(ind)){
for (j in 1:2){
for (k in 1:2){
x[i,j,k] <- ind[i]+j+k
}
}
}
saveRDS(x, file = paste0(tmpfile, "_", ic, ".rds"))
}
parallel::stopCluster(cl)
for (ic in 1:no_cores) {
ind <- bigstatsr:::seq2(range.parts[ic, ])
x.all[ind, , ] <- readRDS(paste0(tmpfile, "_", ic, ".rds"))
}
print(x.all)
Instead of writing files, you could also directly return the no_cores parts of the array in foreach and combine them with the right abind.

parallel programming for a function taking two arguments from a list of data frames using R

I have a function called DTW in similarity measure package. It takes two matrix or data frame as its arguments and returns the Dynamic time warping distance. Those data frames are the longitudes and latitudes of trajectory.
My program looks like this and all the data frames like df1, df2,df3 and so on are available:
distance <- function(arg1,arg2) {
DTW(arg1, arg2)
}
for(i in 1:length(LIST)){
for(j in 1:length(LIST)){
a <- get(paste0("df",i))
b <- get(paste0("df",j))
ddist[i,j] <- distance(a,b)
print(ddist)
}
}
I am making a matrix ddist in which all the values are inserted returned by distance function. The program is working fine. I want to make it fast using parallel programming like parapply or parlapply function.

Here is a simple method to give you an idea of how to make it parallel
k<-length(LIST)
ddist<-matrix(0,k,k)
library("doParallel")
cl<-makeCluster(4,outfile='')
registerDoParallel(cl)
for(i in 1:k) {
a <- get(paste0("df",i))
ddist[i,]=foreach(j = 1:k , .combine='cbind' ,.export=paste0("df",1:k)) %dopar% {
b <- get(paste0("df",j))
distance(a,b)
}
}
stopCluster(cl)
Having said that , things to evaluate
if the distance function takes more than 2 seconds ,then only use
parallel
df1 , df2 etc may not be a good idea , store each
dataframe as df[[1]] , df[[2]]. Better than using the get function
if length(k) is very huge , then the amount of time taken for
transferring the exported df1,df2 etc is quite a long time , hence
try to hit the sweet spot of performance with various iterations
You can see the option of data.table where there is inplace edit,
use this instead of the ddist as it might be faster
If this code is called within a function , then you might also need to
export the function ddist , like .export=c(ddist,paste0("df",1:k))
Change the "4" in makeCluster to chose the cores you want, as a
thumbrule , keep it as detectCores()-1

Block bootstrap for genomic data

I am trying to implement a block bootstrap procedure, but I haven't figured out a way of doing this efficiently.
My data.frame has the following structure:
CHR POS var_A var_B
1 192 0.9 0.7
1 2000 0.8 0.3
2 3 0.21 0.76
2 30009 0.36 0.15
...
The first column is the chromosome identification, the second column is the position, and the last two columns are variables for which I want to calculate a correlation. The problem is that each row is not entirely independent to one another, depending on the distance between them (the closer the more dependent), and so I cannot simply do cor(df$var_A, df$var_B).
The way out of this problem that is commonly used with this type of data is performing a block bootstrap. That is, I need to divide my data into blocks of length X, randomly select one row inside that block, and then calculate my statistic of interest. Note, however, that these blocks need to be defined based on the column POS, and not based on the row number. Also, this procedure needs to be done for each chromosome.
I tried to implement this, but I came up with the slowest code possible (it didn't even finish running) and I am not 100% sure it works.
x = 1000
cors = numeric()
iter = 1000
for(j in 1:iter) {
df=freq[0,]
for (i in unique(freq$CHR)) {
t = freq[freq$CHR==i,]
fim = t[nrow(t),2]
i = t[1,2]
f = i + x
while(f < fim) {
rows = which(t$POS>=i & t$POS<f)
s = sample(rows)
df = rbind(df,t[s,])
i = f
f = f + x
}
}
cors = c(cors, cor(df$var_A, df$var_B))
}
Could anybody help me out? I am sure there is a more efficient way of doing this.
Thank you in advance.

One efficient way to try would be to use the 'boot' package, of which functions include parallel processing capabilities.
In particular, the 'tsboot', or time series boot function, will select ordered blocks of data. This could work if your POS variable is some kind of ordered observation.
The boot package functions are great, but they need a little help first. To use bootstrap functions in the boot package, one must first wrap the statistic of interest in a function which includes an index argument. This is the device the bootstrap generated index will use to pass sampled data to your statistic.
cor_hat <- function(data, index) cor(y = data[index,]$var_A, x = data[index,]$var_B)
Note cor_hat in the arguments below. The sim = "fixed", l = 1000 arguments, which indicate you want fixed blocks of length(l) 1000. However, you could do blocks of any size, 5 or 10 if your trying to capture nearest neighbor dynamics moving over time. The multicore argument speaks for itself, but it maybe "snow" if you are using windows.
library(boot)
tsboot(data, cor_hat, R = 1000, sim = "fixed", l = 1000, parallel = "multicore", ncpus = 4)
In addition, page 194 of Elements of Statistical Learning provides a good example of the framework using the traditional boot function, all of which is relevant to tsboot.
Hope that helps, good luck.
Justin
r

I hope I understood you right:
# needed for round_any()
library(plyr)
res <- lapply(unique(freq$CHR),function(x){
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
})
This should return a list with an entry for each chromosome. Within each entry, there's an observation per 1kb-block if present. The number of blocks is determined by the maximum POS value.
EDIT:
library(doParallel)
library(foreach)
library(plyr)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
res <- foreach(x=unique(freq$CHR),.packages = 'plyr') %dopar% {
freq_sel <- freq[freq$CHR==x,]
blocks <- lapply(seq(1,round_any(max(freq_sel$POS),1000,ceiling),1000), function(ix) freq_sel[freq_sel$POS > ix & freq_sel$POS <= ix+999,])
do.call(rbind,lapply(blocks,function(x) if (nrow(x) > 1) x[sample(1:nrow(x),1),] else x))
}
stopCluster(cl)
This is a simple parallelisation with foreach on each Chromosome. It could be better to restructure the function and base the parallel processing on another level (such as the 1000 iterations or maybe the blocks). In any case, I can just stress again what I was saying in my comment: Before you work on parallelising your code, you should be sure that it's as efficient as possible. Meaning you might want to look into the boot package or similar to get an increase in efficiency. That said, with the number of iterations you're planning, parallel processing might be useful once you're comfortable with your function.

So, after a while I came up with an answer to my problem. Here it goes.
You'll need the package dplyr.
l = 1000
teste = freq %>%
mutate(w = ceiling(POS/l)) %>%
group_by(CHR, w) %>%
sample_n(1)
This code creates a new variable named w based on the position in the genome (POS). This variable w is the window to which each row was assigned, and it depends on l, which is the length of your window.
You can repeat this code several times, each time sampling one row per window/CHR (with the sample_n(1)) and apply whatever statistic of interest that you want.

Need help on combine function in a Parallel Simulation study using doMC

I want to ask for some help on writing a combine function for foreach(). Consider the function below:
library(mvtnorm)
library(doMC)
mySimFunc <- function(){
myNum <- runif(1)
myVec <- rnorm(10)
myMat <- rmvnorm(5, rep(0, 3), diag(3))
myListRslt <- list("myNum" = myNum, "myVec" = myVec, "myMat" = myMat)
return (myListRslt)
}
Now I'd like to run the code above for 1000 times using foreach() %dopar% and in each iteration I'd like to:
return myNum as is
get average of myVec and return it
get colMeans() of myMat and return it.
I'd like foreach() %dopar% to return a final list including:
a vector of length 1000 including 1000 myNum each corresponding to an iteration
a vector of length 1000 including 1000 average of myVec in each iteration
a matrix with 1000 rows where each row includes colMeans of myMat in that iteration
My Ideal solution
My ideal solution is o find a way that foreach() acts exactly like for so that I can simply define:
myNumRslt <- NULL
myVecRslt <- NULL
myMatRslt <- NULL
# and then simply aggregate result of each iteration to the variables above as:
foreach(i = 1:1000) %dopar%{
rslt <- mySimFunc()
myNumRslt <- c(myNumRslt, rslt$myNum)
myVecRslt <- c(myVecRslt, mean(rslt$myVec))
myMatRslt.tmp <- colMeans(rslt$myMat)
myMatRslt <- rbind(myMatRslt, myMatRslt.tmp)
}
BUT, unfortunately seems that it's not possible to do that with foreach() so then I think the only solution is to write a combine function that does similar to result aggregation above.
Challenge
1) How could I write a combine function that returns what I explained above?
2) When we do %dopar% (suppose using doMC package), does doMC distribute each iteration to a CPU or it goes further and divide each iteration to further pieces and distribute them?
3) Is there any better (more efficient) way than using doMC and foreach() ?
idea's
In this question Brian mentioned a brilliant way to deal with lists including numeric values. In my case, I have numeric values as well as vectors and matrices. I don't know how to extend Brian's idea in my case.
Thanks very much for your help.

Edit
Cleaned up, generalizable solution using .combine:
#modify function to include aggregation
mySimFunc2 <- function(){
myNum <- runif(1)
myVec <- mean(rnorm(10))
myMat <- colMeans(rmvnorm(5, rep(0, 3), diag(3)))
myListRslt <- list("myNum" = myNum, "myVec" = myVec, "myMat" = myMat)
return (myListRslt)
}
#.combine function
MyComb1 <- function(...) {
lst=list(...)
vec<-sapply(1:length(lst), function (i) return(lst[[i]][[1]] ))
vecavg<-sapply(1:length(lst),function (i) return(lst[[i]][[2]] ))
colmeans<-t(sapply(1:length(lst), function (i) return(lst[[i]][[3]])))
final<-list(vec,vecavg,colmeans)
names(final)<-c("vec","vecavg","colmeans")
return(final)
}
library(doParallel)
cl <- makeCluster(3) #set cores
registerDoParallel(cl)
foreach(i=1:1000,.export=c("mySimFunc2","MyComb1"),.combine=MyComb1,
.multicombine=TRUE,.maxcombine=1000, .packages=c("mvtnorm"))%dopar%{mySimFunc2()}
You should now have a list output containing the desired three objects, which I've titled respectively as vec, vecavg, and colmeans. Note you must set .maxcombine to the number of iterations if iterations are greater than 100.
As a side note, it does not make sense to parallelize for this example task, although I'm guessing the real task may be more complex.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Dealing with very large vector in R - r

Related

Improving the computational time of R for loops

Dealing with multidimensional output in parallel programming

parallel programming for a function taking two arguments from a list of data frames using R

Block bootstrap for genomic data

Need help on combine function in a Parallel Simulation study using doMC

Categories

Resources