Working with multiple cores and sparse matrices in R - r

I am working on a project that requires large matrices with a larger number of zeros. Unfortunately, as some of these matrices can have more than 1e10 elements, working with the "standard" R matrices is not an option, due to RAM constraints. Also, I need to work on multiple cores, as the computation can take quite a long time and really shouldn't.
So far, I have been working with the foreach package, and converted the results (which come in standard matrices) to sparse matrices afterwards. I can't help but think that there must be a smarter way.
Here is a minimal example of what I have been doing so far:
cl <- makeSOCKcluster(8)
registerDoSNOW(cl)
Mat <- foreach(j=1:length(lambda), .combine='cbind') %dopar% {
replicate(iter, rpois(n=1, lambda[j]))
}
Mat <- Matrix(Mat, sparse=TRUE)
stopCluster(cl)
The lambdas are all quite small, so that only every 5th element or so is different from zero, making it sensible to store the results in a sparse matrix.
Unfortunately, it has now become necessary to increase the number of iterations from 1e6 to at least 1e7, so that the matrix that is produced by the foreach loop is too large to be stored on 8GB of RAM. What I now want to do is split up the tasks into steps that each have 1e6 iterations, and combine these into a single, sparse matrix.
I now have the following as an idea:
library(Matrix)
library(snow)
cl <- makeSOCKcluster(8)
iter <- 1e6
steps <- 1e5
numsteps <- iter / steps
draws <- function(x, lambda, steps){
replicate(n=steps, rpois(n=1, lambda=lambda))
}
for(i in 1:numsteps){
Mat <- Matrix(0, nrow=steps, ncol=96, sparse=TRUE)
Mat <- Matrix(
parApply(cl=cl, X=Mat, MARGIN=2, FUN=draws, lambda=0.2, steps=steps)
, sparse = TRUE)
if(!exists("fullmat")) fullmat <- Mat else fullmat <- rBind(fullmat, Mat)
rm(Mat)
}
stopCluster(cl)
It works fine, but I had to fix lambda to some value. For my application, I need the values in the ith row to come from a poisson distribution with mean equal to the ith element of the lambda vector. This obviously worked fine in the foreach loop., but I have yet to find a way to make it work in an apply loop.
My questions are:
Is it possible to have the apply function "know" on which row it is operating and pass a corresponding argument to a function?
Is there a way to work with foreach and sparse matrices without the need of creating a standard matrix and converting it into a sparse one in the next step?
If none of the above, is there a way for me to manually assign tasks to slave processes of R - that is, could I specifically tell a process to work on column 1, another to work on column 2 and so on, each creating a sparse vector and only combining these in the last step.

I was able to find a solution to my problem.
In my case, I am able to define a unique ID for each of the columns, and can address the parameters by that. The following code should illustrate what I mean:
library(snow)
library(Matrix)
iter <- 1e6
steps <- 1e5
# define a unique id
SZid <- seq(from=1, to=10, by=1)
# in order to have reproducible code, generate random parameters
SZlambda <- replicate(runif(n=1, min=0, max=.5))
SZmu <- replicate(runif(n=1, min=10, max=15))
SZsigma <- replicate(runif(n=1, min=1, max=3))
cl <- makeSOCKcluster(8)
clusterExport(cl, list=c("SZlambda", "SZmu", "SZsigma"))
numsteps <- iter / steps
MCSZ <- function(SZid, steps){ # Monte Carlo Simulation
lambda <- SZlambda[SZid]; mu <- SZmu[SZid]; sigma <- SZsigma[SZid];
replicate(steps, sum(rlnorm(meanlog=mu, sdlog=sigma,
n = rpois(n=1, lambda))
))
}
for (i in 1:numsteps){
Mat <- Matrix(
parSapply(cl, X=SZid, FUN=MCSZ, steps=steps), sparse=TRUE)
if(!exists("LossSZ")) LossSZ <- Mat else LossSZ <- rBind(LossSZ, Mat)
rm(Mat)
}
stopCluster(cl)
The trick is to apply the function not over the matrix, but over a vector of unique ids that line up with the indices of the parameters.

Related

R aborts when using function DIST (110 GB vector)

I need to run a hierarchical clustering algorithm in R on a dataset with 173000 rows and 17 columns.
When running the function dist() on the dataset, R aborts. I have also tried it with a Windows pc and the error message I get is "cannot allocate vector of size 110.5 Gb".
My Mac and my Windows pc have 4 GB of RAM.
Is there a way to still do this in R? I know hierarchical algorithms are not the best for large datasets but it is requireed by a University assignment.
Thank you
The problem can be solved by writing a function to compute the pairwise euclidian distances between columns of the data set, assumed below to be in tabular form. For other distances, a similar function can be written.
dist2 <- function(X){
cmb <- combn(seq_len(ncol(X)), 2)
d <- matrix(NA_real_, nrow = ncol(X), ncol = ncol(X))
if(!is.null(colnames(X)))
dimnames(d) <- list(colnames(X), colnames(X))
for(i in seq_len(ncol(cmb))){
ix <- cmb[1, i]
iy <- cmb[2, i]
res <- sqrt(sum((X[, ix] - X[, iy])^2))
d[ix, iy] <- d[iy, ix] <- res
diag(d) <- 0
}
d
}
Now test the function with a data.frame of the dimensions in the question.
set.seed(2021)
m <- replicate(17, rnorm(173000))
m <- as.data.frame(m)
dist2(m)
First and foremost, it would be very nice of you to provide a reprex (reproducible example). Make sure you will do it later.
Speaking about the issue, you can use sample_frac function (if I am not mistaken, this is a function from tidyverse package). For example, sample_frac(your_data, .5) will sample 50% of your dataframe. It will reduce the size of data to be clustered and it will be easier for your laptop.
The other way is to extend the memory.limit(size = n) where n is a number in megabytes.

Matrix computation with for loop

I am newcomer to R, migrated from GAUSS because of the license verification issues.
I want to speed-up the following code which creates n×k matrix A. Given the n×1 vector x and vectors of parameters mu, sig (both of them k dimensional), A is created as A[i,j]=dnorm(x[i], mu[j], sigma[j]). Following code works ok for small numbers n=40, k=4, but slows down significantly when n is around 10^6 and k is about the same size as n^{1/3}.
I am doing simulation experiment to verify the bootstrap validity, so I need to repeatedly compute matrix A for #ofsimulation × #bootstrap times, and it becomes little time comsuming as I want to experiment with many different values of n,k. I vectorized the code as much as I could (thanks to vector argument of dnorm), but can I ask more speed up?
Preemptive thanks for any help.
x = rnorm(40)
mu = c(-1,0,4,5)
sig = c(2^2,0.5^2,2^2,3^2)
n = length(x)
k = length(mu)
A = matrix(NA,n,k)
for(j in 1:k){
A[,j]=dnorm(x,mu[j],sig[j])
}
Your method can be put into a function like this
A.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
A <- matrix(NA,n,k)
for(j in 1:k) A[,j] <- dnorm(x,mu[j],sig[j])
A
}
and it's clear that you are filling the matrix A column by column.
R stores the entries of a matrix columnwise (just like Fortran).
This means that the matrix can be filled with a single call of dnorm using suitable repetitions of x, mu, and sig. The vector z will have the columns of the desired matrix stacked. and then the matrix to be returned can be formed from that vector just by specifying the number of rows an columns. See the following function
B.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
z <- dnorm(rep(x,times=k),rep(mu,each=n),rep(sig,each=n))
B <- matrix(z,nrow=n,ncol=k)
B
}
Let's make an example with your data and test this as follows:
N <- 40
set.seed(11)
x <- rnorm(N)
mu <- c(-1,0,4,5)
sig <- c(2^2,0.5^2,2^2,3^2)
A <- A.fill(x,mu,sig)
B <- B.fill(x,mu,sig)
all.equal(A,B)
# [1] TRUE
I'm assuming that n is an integer multiple of k.
Addition
As noted in the comments B.fill is quite slow for large values of n.
The reason lies in the construct rep(...,each=...).
So is there a way to speed A.fill.
I tested this function:
C.fill <- function(x,mu,sig) {
k <- length(mu)
n <- length(x)
sapply(1:k,function(j) dnorm(x,mu[j],sig[j]), simplify=TRUE)
}
This function is about 20% faster than A.fill.

Parallelizing a double for loop in R

I've been using the parallel package in R to do loops like:
cl <- makeCluster(getOption("cl.cores", 6))
result <- parSapply(cl,1:k,function(i){ ... })
Is there a natural way to parallelize a nested for loop in R using this package? Or perhaps another package? I know there are several ways to implement parallelism in R.
My loop looks something like this. I simplified a bit but it gets the message across:
sup_mse <- matrix(0,nrow=k,ncol=length(sigma))
k <- 100000 #Number of iterations
sigma <- seq(from=0.1,to=10,by=0.2)
for(i in 1:k){
for(j in 1:length(sigma)){
sup<-supsmu(x,y)
sup_mse[i,j] <- mean((m(x)-sup$y)^2)
}
}
Thanks for making the reproducible example! I prefer snowfall for my parallel processing, so here's how it looks in there.
install.packages('snowfall')
require(snowfall)
### wasn't sure what you were using for x or y
set.seed(1001)
x <- sample(seq(1,100),20)
y <- sample(seq(1,100),20)
k <- 100
sigma <- seq(0.1, 10, 0.2)
### makes a local cluster on 4 cores and puts the data each core will need onto each
sfInit(parallel=TRUE,cpus=4, type="SOCK",socketHosts=rep("localhost",4))
sfExport('x','y','k','sigma')
answers <- sfSapply(seq(1,k), function(M)
sapply(seq(1,length(sigma)), function(N)
mean((mean(x)-supsmu(x,y)$y)^2) ## wasn't sure what you mean by m(x) so guessed mean
)
)
sup_mse <- t(answers) ## will give you a matrix with length(sigma) columns and k rows
sfStop()
I remember reading somewhere that you only want to use sfSapply in the outer loops and then use your regular apply functions inside of that loop. Hope this helps!

Parallel Monte Carlo Simulation in R using snowfall

I try to compare up to thousands of estimated beta distributions. Each beta distribution is characterized by the two shape parameters alpha & beta.
I now draw 100,000 samples of every distribution. As a final result I want to get an order of the distributions with the highest Probability in every sample draw.
My first approach was to use lapply for generating a matrix of N * NDRAWS numeric values which was consuming too much memory as N gets beyond 10,000. (10,000 * 100,000 * 8 Bytes)
So I decided to use a sequential approach of ordering every single draw, then cumsum the order of all draws and get the final order as shown in the example below:
set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T), beta=sample(1:200, N, replace=T))
vec <- vector(mode = "integer", length = N )
for(i in 1:NDRAWS){
# order probabilities after a single draw for every theta
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
# sum up winning positions for every theta
vec[pos] <- vec[pos] + 1:N
}
# order thetas
ord <- order(-vec)
df[ord,]
This is only consuming N * 4 Bytes of memory, as there is no giant matrix but a single vector of length N. My Question now is, how to speed up this operation using snowfall (or any other multicore package) by taking advantage of my 4 CPU Cores, instead of using just one core???
# parallelize using snowfall pckg
library(snowfall)
sfInit( parallel=TRUE, cpus=4, type="SOCK")
sfLapply( 1:NDRAWS, function(x) ?????? )
sfStop()
Any help is appreciated!
This can be parallelized in the same way that one would parallelize random forest or bootstrapping. You just perform the sequential code on each of the workers but with each using a smaller number of iterations. That is much more efficient than splitting each iteration of the for loop into a separate parallel task.
Here's your complete example converted to use the foreach package with the doParallel backend:
set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T),
beta=sample(1:200, N, replace=T))
library(doParallel)
nworkers <- detectCores()
cl <- makePSOCKcluster(nworkers)
clusterSetRNGStream(cl, c(1,2,3,4,5,6,7))
registerDoParallel(cl)
vec <- foreach(ndraws=rep(ceiling(NDRAWS/nworkers), nworkers),
.combine='+') %dopar% {
v <- integer(N)
for(i in 1:ndraws) {
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
v[pos] <- v[pos] + 1:N
}
v
}
ord <- order(-vec)
df[ord,]
Note that this gives different results than the sequential version because different random numbers are generated by the workers. I used the parallel random number support provided by the parallel package since that is good practice.
Well, the functionality is there. I'm not sure though what you'd be returning with each iteration.
Perhaps try this?
myFunc <- function(xx, N) {
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
vec[pos] + 1:N
}
Using doParallel will allow you to add results:
require(doParallel)
registerDoParallel(cores=4)
foreach(i=1:NDRAWS, .combine='+') %dopar% myFunc(i, N)

Fill matrix with loop

I am trying to create a matrix n by k with k mvn covariates using a loop.
Quite simple but not working so far... Here is my code:
n=1000
k=5
p=100
mu=0
sigma=1
x=matrix(data=NA, nrow=n, ncol=k)
for (i in 1:k){
x [[i]]= mvrnorm(n,mu,sigma)
}
What's missing?
I see several things here:
You may want to set the random seed for replicability (set.seed(20430)). This means that every time you run the code, you will get exactly the same set of pseudorandom variates.
Next, your data will just be independent variates; they won't actually have any multivariate structure (although that may be what you want). In general, if you want to generate multivariate data, you should use ?mvrnorm, from the MASS package. (For more info, see here.)
As a minor point, if you want standard normal data, you don't need to specify mu = 0 and sigma = 1, as those are the default values for rnorm().
You don't need a loop to fill a matrix in R, just generate as many values as you like and add them directly using the data= argument in the matrix() function. If you really were committed to using a loop, you should probably use a double loop, so that you are looping over the columns, and within each loop, looping over the rows. (Note that this is a very inefficient way to code in R--although I do things like that all the time ;-).
Lastly, I can't tell what p is supposed to be doing in your code.
Here is a basic way to do what you seem to be going for:
set.seed(20430)
n = 1000
k = 5
dat = rnorm(n*k)
x = matrix(data=dat, nrow=n, ncol=k)
If you really wanted to use loops you could do it like this:
mu = 0
sigma = 1
x = matrix(data=NA, nrow=n, ncol=k)
for(j in 1:k){
for(i in 1:n){
x[i,j] = rnorm(1, mu, sigma)
}
}
define the matrix first
E<-matrix(data=0, nrow=10, ncol=10);
run two loops to iterate i for rows and j for columns, mine is a exchangeable correlation structure
for (i in 1:10)
{
for (j in 1:10)
{
if (i==j) {E[i,j]=1}
else {E[i,j]=0.6}
}
};
A=c(2,3,4,5);# In your case row terms
B=c(3,4,5,6);# In your case column terms
x=matrix(,nrow = length(A), ncol = length(B));
for (i in 1:length(A)){
for (j in 1:length(B)){
x[i,j]<-(A[i]*B[j])# do the similarity function, simi(A[i],B[j])
}
}
x # matrix is filled
I was thinking in my problem perspective.

Resources