Calculating a GP correlation matrix outside of a loop - r

So I am trying to calculate the correlation matrix associated with a Gaussian Process using R and was hoping for some suggestions for doing so without using the triple for-loop I have written below. Mainly I want to try and condense the code for readable purposes and also to speed up calculations.
#Example Data
n = 500
x1 = sample(1:100,n,replace=T)
x2 = sample(1:100,n,replace=T)
x3 = sample(1:100,n,replace=T)
X = cbind(x1,x2,x3)
R = matrix(NA,nrow=n,ncol=n)
for(i in 1:nrow(X)){
for(j in 1:nrow(X)){
temp = 0
for(k in 1:ncol(X)){
temp = -abs(X[i,k]-X[j,k])^1.99 + temp
}
R[i,j] = exp(temp)
}
}
So as n gets large, the code gets much slower. Also worth noting, since this is a correlation matrix, the matrix is syymetric and the diagonal is equal to 1.

It's much faster using this:
y <- t(X)
R <- exp(-sapply(1:ncol(y), function(i) colSums((y-y[,i])^2)))
If you want ot keep your original formula:
R <- exp(-sapply(1:ncol(y), function(i) colSums(abs(y-y[,i])^1.99)))

I'm wondering if you could cut your calculation and looping times in half by changing these two lines? (Actually the timing was improved by more than 50% 14.304 secs improved to 6.234 secs )
1: for(j in 1:nrow(X)){
2: R[i,j] = exp(temp)
To:
1: for(j in i:nrow(X)){
2: R[i,j] = R[j,i]= exp(temp)
Tested:
> all.equal(R, R2)
[1] TRUE
That way you populate the lower triangle without doing any calculations.BTW, what's with the 1.99? This is perhaps a problem more suited to submitting as a C program. The Rcpp package supports this and there are a lot of worked examples on SO. Perhaps a search on: [r] rcpp nested loops

Related

Speed of Daisy Function

I'm working on improving the speed of a function (for a dissimilarity measure) I'm writing which is quite similar mathematically to the Euclidean distance function. However, when I time my function compared to that implemented in the daisy function from the cluster package, I find quite a significant difference in speed, with daisy performing much better. Given that (I'm assuming) a dissimilarity measure would require O(n x p) time due to the need to compare each object to itself over all variables (where n is number of objects and p is number of variables), I find it difficult to understand how the daisy function performs so well (near constant time, from the few experiments I've done) relative to my simple and direct implementation. I present the code I have used both to implement and test below. I have tried looking through the r source code for the implementation of the daisy function, but I found it difficult to understand. I found no nested for loop. Any help with understanding why this function performs so fast and how I could possibly modify my code to have similar speed would be very highly appreciated.
euclidean <- function (df){
no_obj <- nrow(df)
dist <- array(0, dim = c(no_obj, no_obj))
for (i in 1:no_obj){
for (j in 1:no_obj){
dist_v <- 0
if(i != j){
for (v in 1:ncol(df)){
dist_v <- dist_v + sqrt((df[i,v] - df[j,v])^2)
}
}
dist[i,j] <- dist_v
}
}
return(dist)
}
data("iris")
tic <- Sys.time()
dst <- euclidean(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Euclidean]: ", time))
tic <- Sys.time()
dst <- daisy(iris[,1:4])
time <- difftime(Sys.time(), tic, units = "secs")[[1]]
print(paste("Time taken [Daisy]: ", time))
one option:
euclidean3 <- function(df) {
require(data.table)
n <- nrow(df)
i <- CJ(1:n, 1:n) # generate all row combinations
dl <- sapply(df, function(x) sqrt((x[i[[1]]] - x[i[[2]]])^2)) # loop over columns
dv <- rowSums(dl) # sum values of columns
d <- matrix(dv, n, n) # fill in matrix
d
}
dst3 <- euclidean3(iris[,1:4])
all.equal(euclidean(iris[,1:4]), dst3) # TRUE
[1] "Time taken [Euclidean3]: 0.008"
[1] "Time taken [Daisy]: 0.002"
Largest bottleneck in your code is selecting data.frame elements in loop (df[j,v])). Maybe changing it to matrix also could improver speed. I believe there could be more performant approach on stackoverflow, you just need to search by correct keywords...

R aborts when using function DIST (110 GB vector)

I need to run a hierarchical clustering algorithm in R on a dataset with 173000 rows and 17 columns.
When running the function dist() on the dataset, R aborts. I have also tried it with a Windows pc and the error message I get is "cannot allocate vector of size 110.5 Gb".
My Mac and my Windows pc have 4 GB of RAM.
Is there a way to still do this in R? I know hierarchical algorithms are not the best for large datasets but it is requireed by a University assignment.
Thank you
The problem can be solved by writing a function to compute the pairwise euclidian distances between columns of the data set, assumed below to be in tabular form. For other distances, a similar function can be written.
dist2 <- function(X){
cmb <- combn(seq_len(ncol(X)), 2)
d <- matrix(NA_real_, nrow = ncol(X), ncol = ncol(X))
if(!is.null(colnames(X)))
dimnames(d) <- list(colnames(X), colnames(X))
for(i in seq_len(ncol(cmb))){
ix <- cmb[1, i]
iy <- cmb[2, i]
res <- sqrt(sum((X[, ix] - X[, iy])^2))
d[ix, iy] <- d[iy, ix] <- res
diag(d) <- 0
}
d
}
Now test the function with a data.frame of the dimensions in the question.
set.seed(2021)
m <- replicate(17, rnorm(173000))
m <- as.data.frame(m)
dist2(m)
First and foremost, it would be very nice of you to provide a reprex (reproducible example). Make sure you will do it later.
Speaking about the issue, you can use sample_frac function (if I am not mistaken, this is a function from tidyverse package). For example, sample_frac(your_data, .5) will sample 50% of your dataframe. It will reduce the size of data to be clustered and it will be easier for your laptop.
The other way is to extend the memory.limit(size = n) where n is a number in megabytes.

simulation of binomial distribution and storing value in matrix in r

set.seed(123)
for(m in 1:40)
{
u <- rbinom(1e3,40,0.30)
result[[m]]=u
}
result
for (m in 1:40) if (any(result[[m]] == 1)) break
m
m is the exit time for company, as we change the probability it will give different result. Using this m as exit, I have to find if there was a funding round inbetween, so I created a random binomial distribution with some prob, when you will get a 1 that means there is a funding round(j). if there is a funding round i have to find the limit of round using the random uniform distribution. I am not sure if the code is right for rbinom and is running till m. And imat1<- matrix(0,nrow = 40,ncol = 2) #empty matrix
am gettin the y value for all 40 iteration I Need it when I get rbinom==1 it should go to next loop. I am trying to store the value in matrix but its not getting stored too. Please help me with that.
mat1<- matrix(0,nrow = 40,ncol = 2) #empty matrix
for(j in 1:m) {
k<- if(any(rbinom(1e3,40,0.42)==1)) #funding round
{
y<- runif(j, min = 0, max = 1) #lower and upper bound
mat1[l][0]<-j
mat1[l][1]<-y #matrix storing the value
}
}
resl
mat1
y
The answer to your first question:
result <- vector("list",40)
for(m in 1:40)
{
u <- rbinom(1e3,40,0.05)
print(u)
result[[m]]=u
}
u
The second question is not clear. Could you rephrase it?
To generate 40 vectors of random binomial numbers you don't need a loop at all, use ?replicate.
u <- replicate(40, rbinom(1e3, 40, 0.05))
As for your second question, there are several problems with your code. I will try address them, it will be up to you to say if the proposed corrections are right.
The following does basically nothing
for(k in 1:40)
{
n<- (any(rbinom(1e3,40,0.05)==1)) # n is TRUE/FALSE
}
k # at this point, equal to 40
There are better ways of creating a T/F variable.
#matrix(0, nrow = 40,ncol = 2) # wrong, don't use list()
matrix(0, nrow = 40,ncol = 2) # or maybe NA
Then you set l=0 when indices in R start at 1. Anyway, I don't believe you'll need this variable l.
if(any(rbinom(1e3,40,0.30)==1)) # probably TRUE, left as an exercise
# in probability theory
Then, finally,
mat1[l][0]<-j # index `0` doesn't exist
Please revise your code, and tell us what you want to do, we're glad to help.

Parallelizing a double for loop in R

I've been using the parallel package in R to do loops like:
cl <- makeCluster(getOption("cl.cores", 6))
result <- parSapply(cl,1:k,function(i){ ... })
Is there a natural way to parallelize a nested for loop in R using this package? Or perhaps another package? I know there are several ways to implement parallelism in R.
My loop looks something like this. I simplified a bit but it gets the message across:
sup_mse <- matrix(0,nrow=k,ncol=length(sigma))
k <- 100000 #Number of iterations
sigma <- seq(from=0.1,to=10,by=0.2)
for(i in 1:k){
for(j in 1:length(sigma)){
sup<-supsmu(x,y)
sup_mse[i,j] <- mean((m(x)-sup$y)^2)
}
}
Thanks for making the reproducible example! I prefer snowfall for my parallel processing, so here's how it looks in there.
install.packages('snowfall')
require(snowfall)
### wasn't sure what you were using for x or y
set.seed(1001)
x <- sample(seq(1,100),20)
y <- sample(seq(1,100),20)
k <- 100
sigma <- seq(0.1, 10, 0.2)
### makes a local cluster on 4 cores and puts the data each core will need onto each
sfInit(parallel=TRUE,cpus=4, type="SOCK",socketHosts=rep("localhost",4))
sfExport('x','y','k','sigma')
answers <- sfSapply(seq(1,k), function(M)
sapply(seq(1,length(sigma)), function(N)
mean((mean(x)-supsmu(x,y)$y)^2) ## wasn't sure what you mean by m(x) so guessed mean
)
)
sup_mse <- t(answers) ## will give you a matrix with length(sigma) columns and k rows
sfStop()
I remember reading somewhere that you only want to use sfSapply in the outer loops and then use your regular apply functions inside of that loop. Hope this helps!

Fill matrix with loop

I am trying to create a matrix n by k with k mvn covariates using a loop.
Quite simple but not working so far... Here is my code:
n=1000
k=5
p=100
mu=0
sigma=1
x=matrix(data=NA, nrow=n, ncol=k)
for (i in 1:k){
x [[i]]= mvrnorm(n,mu,sigma)
}
What's missing?
I see several things here:
You may want to set the random seed for replicability (set.seed(20430)). This means that every time you run the code, you will get exactly the same set of pseudorandom variates.
Next, your data will just be independent variates; they won't actually have any multivariate structure (although that may be what you want). In general, if you want to generate multivariate data, you should use ?mvrnorm, from the MASS package. (For more info, see here.)
As a minor point, if you want standard normal data, you don't need to specify mu = 0 and sigma = 1, as those are the default values for rnorm().
You don't need a loop to fill a matrix in R, just generate as many values as you like and add them directly using the data= argument in the matrix() function. If you really were committed to using a loop, you should probably use a double loop, so that you are looping over the columns, and within each loop, looping over the rows. (Note that this is a very inefficient way to code in R--although I do things like that all the time ;-).
Lastly, I can't tell what p is supposed to be doing in your code.
Here is a basic way to do what you seem to be going for:
set.seed(20430)
n = 1000
k = 5
dat = rnorm(n*k)
x = matrix(data=dat, nrow=n, ncol=k)
If you really wanted to use loops you could do it like this:
mu = 0
sigma = 1
x = matrix(data=NA, nrow=n, ncol=k)
for(j in 1:k){
for(i in 1:n){
x[i,j] = rnorm(1, mu, sigma)
}
}
define the matrix first
E<-matrix(data=0, nrow=10, ncol=10);
run two loops to iterate i for rows and j for columns, mine is a exchangeable correlation structure
for (i in 1:10)
{
for (j in 1:10)
{
if (i==j) {E[i,j]=1}
else {E[i,j]=0.6}
}
};
A=c(2,3,4,5);# In your case row terms
B=c(3,4,5,6);# In your case column terms
x=matrix(,nrow = length(A), ncol = length(B));
for (i in 1:length(A)){
for (j in 1:length(B)){
x[i,j]<-(A[i]*B[j])# do the similarity function, simi(A[i],B[j])
}
}
x # matrix is filled
I was thinking in my problem perspective.

Resources