Avoid nested for loops when summing over matrix indices - r

I have a fairly simply computation I need to do, but I cannot figure out how to do it in a way that is even close to efficient. I have a large nxn matrix, and I need to compute the following:
I'm still fairly inexperienced at coding, and so the only way that comes to my mind is to do the straightforward thing and use 3 for loops to move across the indexes:
sum=0
for(i in 1:n)
{
for(j in 1:n)
{
for(k in 1:n)
{
sum = sum + A[i,j]*A[j,k]
}
}
}
Needless to say, for any decent size matrix this takes forever to run. I know there must be a better, more efficient way to do this, but I cannot figure it out.

If you don't consider the k and i sums, you can realise that you are just doing the matrix product of A with itself. Such product in R is obtained through the %*% operator. After calculating this matrix, you just need to sum all the elements together:
sum(A %*% A)
should give the result you are seeking.

Related

Matrix operation efficiency in R

I have 3 matrices X, K and M as follows.
x <- matrix(c(1,2,3,1,2,3,1,2,3),ncol=3)
K <- matrix(c(4,5,4,5,4,5),ncol=3)
M <- matrix(c(0.1,0.2,0.3),ncol=1)
Here is what I need to accomplish.
For example,
Y(1,1)=(1-4)^2*0.1^2+(1-4)^2*0.2^2+(1-4)^2*0.3^2
Y(1,2)=(1-5)^2*0.1^2+(1-5)^2*0.2^2+(1-5)^2*0.3^2
...
Y(3,2)=(3-5)^2*0.1^2+(3-5)^2*0.2^2+(3-5)^2*0.3^2
Currently I used 3 for loops to calculate the final matrix in R. But for large matrices, this is taking extremely long to calculate. And I also need to change the elements in matrix M to find the best value that produces minimal squared errors. Is there a better way to code it up, i.e. Euclidean norm?
for (lin in 1:N) {
for (col in 1:K) {
Y[lin,col] <- 0
for (m in 1:M){
Y[lin,col] <- Y[lin,col] + (X[lin,m]-K[col,m])^2 * M[m,1]^2
}
}
}
Edit:
I ended up using Rcpp to write the code in C++ and call it from R. It is significantly faster! It takes 2-3 seconds to fill up a 2000 * 2000 matrix.
Thank you. I was able to figure this out. The change made my calculation twice as fast as before. For anyone who may be interested, I replaced the last for loop for(m in 1:M) with the following:
Y[lin,col] <- norm(as.matrix((X[lin,]-K[col,]) * M[1,]),"F")^2
Note that I transposed the matrix M so that it has 3 columns instead of 1.

Operations over Rows or Column

I have an operation I am running in R, and want to know if there is any set of rules that can help me determine if I want the operation to be performed over rows or columns, given that transposing a matrix is a matter of programming preference otherwise.
The only regular advice I have so far is: Test it on a subsample every time. Can we do better than that in any way, say: Division is best longer than wider? If we can't do better than that, why not?
I have programmed my specific operation of interest to be as follows, but keep in mind I am more interested in this in general than in specific:
support_n: Some matrix I'm investigating. It is, (N) x (K choose N). K is >50, N>4
fz(): A bland function of several variables, polynomials, max, and min.
fz<-function(z,vec_l){
if(z%in%vec_l){ #find if z is eqivilant to any number, return 0
out<-0
} else if(z>max(vec_l)){
out<-z^2*max(vec_l)^2
} else {
out<-z^2+min(vec_l)^2
}
out
}
registerDoParallel(cl)
system.time(
payoff<-foreach(y=1:n, .combine='cbind') %:%
foreach(x=1:ncol(support_n), .combine='c') %dopar% {
fz(support_n[y,x],support_n[-y,x])
}
)
So should I run this over y's or x's first, in general? Why?

Change entries in matrix using entries of the matrix

I am trying to make my matrix (tc) symmetric (using R) by adding the corresponding entries and divide those by the sum of the corresponding diagonal entries (tc[i,j]+tc[j,i])/(tc[i,i]+tc[j,j]). I tried it with loops but it does not give me the right values let alone make the matrix symmetric. This is my code so far:
for (i in 1:end){
for(j in 1:end){
tc[i,j]<-(tc[i,j]+tc[j,i])/(tc[i,i]+tc[j,j])
}
}
It's probably a super obvious mistake but I can't figure it out. Can anyone help me? =)
Well, if you think about it, you are summing using values that you have already updated (since you are looping over each i and j).
What if you make a new matrix with the same dimensions as tc, and then run your loop.
newTc <- matrix(0, nrow=nrow(tc), ncol=ncol(tc))
for (i in 1:end){
for(j in 1:end){
newTc[i,j]<-(tc[i,j]+tc[j,i])/(tc[i,i]+tc[j,j])
}
}

How to avoid a loop here in R?

In my R program I have a "for" loop of the following form:
for(i in 1:I)
{
res[i] <- a[i:I] %*% b[i:I]
}
where res, a and b are vectors of length I.
Is there any straightforward way to avoid this loop and calculate res directly? If so, would that be more efficient?
Thanks in advance!
This is the "reverse cumsum" of a*b
rev(cumsum(rev(a) * rev(b)))
So long as res is already of length I, the for loop isn't "incorrect" and the apply solutions will not really be any faster. However, using apply can be more succinct...(if potentially less readable)
Something like this:
res <- sapply(seq_along(a), function(i) a[i:I] %*% b[i:I])
should work as a one-liner.
Expanding on my first sentence. While using the inherent vectorization available in R is very handy and often the fastest way to go, it isn't always critical to avoid for loops. Underneath, the apply family determines the size of the output and pre-allocates it before "looping".

How to calculate Euclidean distance (and save only summaries) for large data frames

I've written a short 'for' loop to find the minimum euclidean distance between each row in a dataframe and all the other rows (and to record which row is closest). In theory this avoids the errors associated with trying to calculate distance measures for very large matrices. However, while not that much is being saved in memory, it is very very slow for large matrices (my use case of ~150K rows is still running).
I'm wondering whether anyone can advise or point me in the right direction in terms of vectorising my function, using apply or similar. Apologies for what may seem a simple question, but I'm still struggling to think in a vectorised way.
Thanks in advance (and for your patience).
require(proxy)
df<-data.frame(matrix(runif(10*10),nrow=10,ncol=10), row.names=paste("site",seq(1:10)))
min.dist<-function(df) {
#df for results
all.min.dist<-data.frame()
#set up for loop
for(k in 1:nrow(df)) {
#calcuate dissimilarity between each row and all other rows
df.dist<-dist(df[k,],df[-k,])
# find minimum distance
min.dist<-min(df.dist)
# get rowname for minimum distance (id of nearest point)
closest.row<-row.names(df)[-k][which.min(df.dist)]
#combine outputs
all.min.dist<-rbind(all.min.dist,data.frame(orig_row=row.names(df)[k],
dist=min.dist, closest_row=closest.row))
}
#return results
return(all.min.dist)
}
#example
min.dist(df)
This should be a good start. It uses fast matrix operations and avoids the growing object construct, both suggested in the comments.
min.dist <- function(df) {
which.closest <- function(k, df) {
d <- colSums((df[, -k] - df[, k]) ^ 2)
m <- which.min(d)
data.frame(orig_row = row.names(df)[k],
dist = sqrt(d[m]),
closest_row = row.names(df)[-k][m])
}
do.call(rbind, lapply(1:nrow(df), which.closest, t(as.matrix(df))))
}
If this is still too slow, as a suggested improvement, you could compute the distances for k points at a time instead of a single one. The size of k will need to be a compromise between speed and memory usage.
Edit: Also read https://stackoverflow.com/a/16670220/1201032
Usually, built in functions are faster that coding it yourself (because coded in Fortran or C/C++ and optimized).
It seems that the function dist {stats} answers your question spot on:
Description
This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.

Resources