Difference matrix for large vector R - r

I've got a large vector (length: 250k) and want to calculate the difference of each element to all others.
One way I've done it on a smaller size is this:
n = 1000
set.seed(35)
values = sample(1:1e3, n, replace=T)
mat_temp = matrix(values, n, n, byrow=TRUE) - matrix(values, n, n, byrow=FALSE)
mat_temp = abs(mat_temp)
It's not ideal because I'd only need the lower part of the diagonal (or absolute differences).
And the main issue: how can I efficiently run it for the full 250k*250k (n = 250000) matrix? With a 16GB RAM, is that possible at all? Tried bigmemory but it fails to initialise the big matrix.
Is there a way (I only need the differences)?

Related

Filling a matrix with for-loop output

I want to fill a matrix with data simulated by using a for-loop containing the rbinom-function. This loop executes the rbinom-function 100 times, thus generating a different outcome every run. However, I can't find a way to get these outcomes in a matrix for further analysis. When assigning the for loop to an object, this object appears empty in the environment and can thus not be used in the matrix. ('data' must be of a vector type, was 'NULL').
When not including the rbinom-function in a for loop, it can be assigned to an object and I'm able to use the output in the matrix. Every column, however, contains the exact same sequence of numbers. When only running the for loop containing the rbinom-function, I do get different sequences, as it runs the rbinom function 100 times instead of 1 time. I just don't know how to integrate the loop in te matrix.
The two pieces of code I have:
n = 100
size = 7
loop_vill <- for (i in 1:100) {
print(rbinom(n=n, size=size, prob=0.75)) #working for-loop
}
vill <- rbinom(n=n, size=size, prob=0.75)
sim_data_vill <- matrix(data=vill, nrow=length(loop_vill), ncol=100)
#creates a matrix in which all columns are exact copies, should be solved
when able to use outputs of loop_vill.
sim_data_vill
When calling sim_data_vill, it (logically) contains a matrix of 100 rows and 100 columns, with all columns being the same. However, I would like to see a matrix with all columns being different (thus containing the output of a new run of the rbinom-function each time).
Hello as far as i can see you are having a few problems.
You are currently not running the for loop for each column (only the 1 vector is saved in vill)
You are not looping over the rbinom
Now there's a few ways to achieve what you want. (Scroll to the last example for the efficient way)
method 1: For loop
Using your idea we can use a for loop. The best idea is to save an empty matrix first and fill it in with the for loop
nsim <- 100 #how many rbinom are w
n <- 100000
size = 7
prob = 0.75
sim_data_vill_for_loop <- matrix(ncol = nsim, nrow = n)
for(i in seq(nsim)) #iterate from 1 to nsim
sim_data_vill_for_loop[, i] <- rbinom(n, size = size, prob = prob) #fill in 1 column at a time
Now this will work, but is a bit slow, and requires a whopping 3 lines of code for the simulation part!
method 2: apply
We can remove the for loop and pre-assigned matrix with using one of the myriad apply like functions. One such function is replicate. This reduces the massive 3 lines of code to:
sim_data_vill_apply <- replicate(nsim, rbinom(n, size, prob))
wuh.. That was short, but can we do even better? Actually running functions such as rbinom multiple times can be rather slow and costly.
method 3: using vectorized functions (very fast)
One thing you will hear whispered (or shouted) is the word vectorized, when it comes to programming in R. Basically, calling a function will induce overhead, and if you are working with a vectorized function, calling it once, will make sure you only induce overhead once, instead of multiple times. All distribution functions in R such as rbinom are vectorized. So what if we just do all the simulation in one go?
sim_data_vill_vectorized_functions <- matrix(rbinom(nsim * n, size, prob), ncol = nsim, nrow = n, byrow = FALSE) #perform all simulations in 1 rbinom call, and fill in 1 matrix.
So lets just quickly check how much faster this is compared to using a for loop or apply. This can be done using the microbenchmark package:
library(microbenchmark)
microbenchmark(for_loop = {
sim_data_vill_for_loop <- matrix(ncol = nsim, nrow = n)
for(i in seq(nsim)) #iterate from 1 to nsim
sim_data_vill_for_loop[, i] <- rbinom(n, size = size, prob = prob) #fill in 1 column at a time
},
apply = {
sim_data_vill_apply <- replicate(nsim, rbinom(n, size, prob))
},
vectorized = {
sim_data_vill_vectorized <- matrix(rbinom(nsim * n, size = size, prob = prob), ncol = nsim, nrow = n, byrow = FALSE)
}
)
Unit: milliseconds
expr min lq mean median uq max neval
for_loop 751.6121 792.5585 837.5512 812.7034 848.2479 1058.4144 100
apply 752.4156 781.3419 837.5626 803.7456 901.6601 1154.0365 100
vectorized 696.9429 720.2255 757.7248 737.6323 765.3453 921.3982 100
Looking at the median time, running all the simulations at once is about 60 ms. faster than using a for loop. As such here it is not a big deal, but in other cases it might be. (reverse n and nsim, and you will start seeing the overhead becoming big part of the calculations.)
Even if it is not a big deal, using vectorized computations where they pop up, is in all cases prefered, to make code more readable, and to avoid unnecessary bottlenecks that have already been optimized in implemented code.

Same sparse matrix, different object sizes

I was working on creating some adjacency matrices and stumbled on a weird issue.
I have one matrix full of 1s and 0s. I want to multiply the transpose of it by it (t(X) %*% X) and then run some other stuff. Since the routine started to get real slow I converted it to a sparse matrix, which obviously went faster.
However, the sparse matrix gets double the size depending on when I convert the matrix to a sparse format.
Here is some generic example that runs into the same issue
set.seed(666)
nr = 10000
nc = 1000
bb = matrix(rnorm(nc *nr), ncol = nc, nrow = nr)
bb = apply(bb, 2, function(x) x = as.numeric(x > 0))
# Slow and unintelligent method
op1 = t(bb) %*% bb
op1 = Matrix(op1, sparse = TRUE)
# Fast method
B = Matrix(bb, sparse = TRUE)
op2 = t(B) %*% B
# weird
identical(op1, op2) # returns FALSE
object.size(op2)
#12005424 bytes
object.size(op1) # almost half the size
#6011632 bytes
# now it works...
ott1 = as.matrix(op1)
ott2 = as.matrix(op2)
identical(ott1, ott2) # returns TRUE
Then I got curious. Anybody knows why this happens?
The class of op1 is dsCMatrix, whereas op2 is a dgCMatrix. dsCMatrix is a class for symmetric matrices, which therefore only needs to store the upper half plus the diagonal (roughly half as much data as the full matrix).
The Matrix statement that converts a dense to a sparse matrix is smart enough to choose a symmetric class for symmetric matrices, hence the saving. You can see this in the code for the function Matrix, which explicitly performs the test isSym <- isSymmetric(data).
%*% on the other hand is optimised for speed and does not perform this check.

Creating this covariance matrix manually in R

I have two samples of 1000 in length each, and I need to construct covariance matrices for these two samples.
Each sample is made up of 10 clusters of size 100. Now, each unit has a variable attached to it that identifies the cluster it came from, and the covariance between two units will be X if they are from the same cluster, or Y if they are from different clusters.
So I need to find a way to construct a covariance matrix that looks like the following picture, except the blocks of X are 100x100 and not 3x3:
Is there any method of doing this easily? The matrix is far too big to create it by manually inputting the data, and the procedure needs to be repeated thousands of times within a loop.
You mean something like this?
m <- c(rep(1, 100), rep(0, 300),
rep(0, 100), rep(1, 100), rep(0, 200),
rep(0, 200), rep(1, 100), rep(0, 100),
rep(0, 300), rep(1, 100))
m <- matrix(m, byrow = TRUE)
m
I managed to find a straightforward solution requiring no extra packages, so I'll post the solution here in case other people encounter the same problem.
The easiest way for me was to create a double loop that goes through each index of the matrix and manually enters the item. Obviously this is very computationally exhaustive, so if you need to do this many times I'd recommend using a much more efficient approach.
m<-matrix(rep(NA,1000000),ncol=1000)
for(i in 1:1000){
for(j in 1:1000){
if(sampleA$cluster[i]==sampleA$cluster[j]){
m[i,j]<-"X"
}
else{
m[i,j]<-"Y"
}
}
}

expand.grid very big vectors exceeding memory limit

I have a problem with R.
I have 6 vectors of data. Each vector will have weight.
I need to calculate the quantile of each possible scenarios.
For example :
v1=c(1,2)
v2=c(0,5)
weights=c(1/3,2/3)
I would normally use :
scenarios=data.matrix(expand.grid(v1,v2))
results=scenarios %*% weights
And finally to get all the quantiles from 1% to 100% :
quantiles=quantile(results,seq(0.01,1,0.01),names=FALSE)
The problem is that I have 6 vectors of : 51,236,234,71,7 and 8 obs respectively, which would give me a vector of 11 G obs...
I get the error from R that I exceed the memory limit with a vector of 47 Gb...
Do you see some alternative that I can use to bypass this big matrix? I'm thinking like a loop within each value one vector and write the result in a document.
But then I don't know how i would calculate the percentile of these separate files...
Rather than generate the whole population, how about sampling to generate your pdf?
N <- 1e6
scenarios <- unique(matrix(c(sample(1:51, N, replace=T),
sample(1:236, N, replace=T),
sample(1:234, N, replace=T),
sample(1:71, N, replace=T),
sample(1:7, N, replace=T),
sample(1:8, N, replace=T)), nrow=N))
N <- nrow(scenarios)
weights <- matrix(rep(1/6, 6))
quantiles <- quantile(scenarios %*% weights, seq(0.01,1,0.01), names=FALSE)
if OP strictly wants the whole population, I will take this post down
Alright !! Thanks for your help guys !
Looks like sampling was the way to go !
Heres the code i use at the end with chinson12's help !
I did a bootstrap to see if the sampling converges towards the right value !
N=1e6
B=2
results = c(1:100)
for ( i in 1:B){
scenarios=unique(matrix(c(sample(v1,N,replace=T),sample(v2,N,replace=T),sample(v3,N,replace=T),
sample(v4,N,replace=T),sample(v5,N,replace=T),sample(v6,N,replace=T)),nrow = N))
weightedSum = round(scenarios %*% weights,4)
results=cbind(results,quantile(weightedSum ,seq(0.01,1,0.01),names=FALSE))
}
write(t(results),"ouput.txt",ncolumns = B + 1)
The output file looks great ! To 4 digits places, all of my percentiles are the same ! So they converges to a value at least !
This being said, are those percentiles unbiased for my population percentiles ?
Thanks

R: Compute only a band of a correlation matrix

I would like to compute correlations between columns of a matrix only for some band of the correlations matrix. I know how to get the whole correlation matrix:
X <- matrix(rnorm(20*30), nrow=20)
cor(X)
But as shown in the left figure below, I'm only interested in some band below the main diagonal.
I could try to cleverly subset the orginal matrix to get only the little squares shown in the right figure, but this seems to be cumbersome.
Do you have a better idea/solution to the problem.
EDIT
I forgot to mention this, but I can hardly use a for loop in R, since the dimension of the correlation matrix is rather large (about 2000*2000) and I have to do this process around 100 times.
You’re probably right that cor on the whole matrix is faster than using manual loops, since the internal workings of cor are highly optimised for matrices. But the bigger the matrix (and, conversely, the smaller the band), the more benefit you could reap from manually looping over the band.
That said, maybe just give it a try – the code for the manual loop is trivial:
cor_band = function (x, band_width, method = c('pearson', 'kendall', 'spearman')) {
out = matrix(nrow = ncol(x), ncol = ncol(x))
for (i in 1 : ncol(x))
for (j in i : min(i + band_width, ncol(x)))
out[j, i] = cor(x[, j], x[, i], method = method)
out
}
Note that the indices in out are reversed so that we get the band below the diagonal rather than above. Since the correlation matrix is symmetrical, either works.
Try a for loop :
band_cor_mat = matrix(NA, nrow=nrow(X), ncol=ncol(X))
for (cc in 1:ncol(X)) { # Diagonal
for (mm in 1:min(band_width, nrow(X)-cc)) { # Band
band_cor_mat[cc+mm,cc] = cor(X[,cc+mm], X[,cc])
}
}
You will have a correlation matrix, with correlation values in the band, and NAs for the rest.

Resources