I have an R function that calculates the Hamming distance of two vectors:
Hamming = function(x,y){
get_dist = sum(x != y, na.rm=TRUE)
return(get_dist)
}
that I would like to apply to every row of two matrices M1, M2 without using a for loop. What I currently have (where L is the number of rows in M1 and M2) is the very time-consuming loop:
xdiff = c()
for(i in 1:L){
xdiff = c(xdiff, Hamming(M1[i,],M2[i,]))
}
I thought that this could be done by executing
mapply(Hamming, t(M1), t(M2))
(with the transpose because mapply works across columns), but this doesn't generate a length L vector of Hamming distances for each row, so perhaps I'm misunderstanding what mapply is doing.
Is there a straightforward application of mapply or something else in the R apply family that would work?
If dim(M1) and dim(M2) are identical, then you can simply do:
rowSums(M1 != M2, na.rm = TRUE)
Your attempt with mapply didn't work because m-by-n matrices are stored as m*n-length vectors, and mapply handles them as such. To accomplish this with mapply, you would need to split each matrix into a list of row vectors:
mapply(Hamming, asplit(M1, 1L), asplit(M2, 1L))
vapply would be better, though:
vapply(seq_len(nrow(M1)), function(i) Hamming(M1[i, ], M2[i, ]), 0L)
In any case, just use rowSums.
Related
I have two matrices: A (k rows, m columns), B(k rows, n columns)
I want to operate on all pairs of columns (one from A and one from B), the result should be a matrix C (m rows, n columns) where C[i,j] = f(A[,i],B[,j])
now, if the function f was the sum of the dot product, then the whole thing was just a simple multiplication of matrices (C = t(A) %*% B)
but my f is different (specifically, I count the number equal entries:
f = function(x,y) sum(x==y)
my question if there is a simple (and fast, because my matrices are big) way to compute the result?
preferably in R, but possibly in python (numpy). I thought about using outer(A,B,"==") but this results in a 4 dimensional array which I havent figured out what exactly to do with it.
Any help is appreciated
In R, we can split them into list and apply the function f with a nested lapply/sapply
lapply(asplit(A, 2), function(x) sapply(asplit(B, 2), function(y) f(x, y)))
Or using outer after converting to data.frame because the unit will be column, while for matrix, it is a single element (as matrix is a vector with dim attributes)
outer(as.data.frame(A), as.data.frame(B), FUN = Vectorize(f))
data
A <- cbind(1:5, 6:10)
B <- cbind(c(1:3, 1:2), c(5:7, 6:7))
I have a vector X of length n, and a list of indices L of variable length. Let F be a function from R^m to R. I want to apply the function F to each subvector X[L[[i]]. This is, I want to calculate F( X[ L[[i]] ] )
For example, suppose that F is the mean
set.seed(123)
X <- rnorm(100)
L <- list()
for(i in 1:10) L[[i]] <- sample(1:100,30,replace = FALSE)
By brute force I could calculate
out <- vector()
for(i in 1:10) out[i] <- mean(X[ L[[i]] ])
However, this for loop is rather slow for larger dimensions. I was wondering if there is a more direct way for calculating out? I have tried to use lapply but it does no seem to work for the combination of a vector + a list of indices + a function.
You can simply use lapply to loop over your list and use each element to subset your vector X. Once you subset, calculate the mean, i.e.
lapply(L, function(i) mean(X[i]))
Problem
Say I have a function that is currently not vectorized. The following is just an example :
FunctionNotVectorized = function(x,y,some_options) return(x[1]+y[1])
which has, say, 10 different options. I would like to
1) define a matrix of size 1e5 x 1e5 for each option.
2) then, for each matrix, assign values for their corresponding indices.
First, I defined a matrix of size 1e5 x 1e5 for each option, by for loop :
for (k in 1:10){
assign(sprintf("res%02d", k), matrix(0,1e5,1e5))
}
which defines matrices named res01, ... res10.
Second, I tried to assign values for their corresponding indices for each matrix. But I'm stuck here
Try
What I would like to do:
for (i in 1:1e5){
for (j in 1:1e5){
for (k in 1:10){
assign(sprintf("res%02d[i,j]", k),
FunctionNotVectorized(i,j,some_options=k))
}
}
}
but clearly, assign(sprintf("res%02d[i,j]", k) does not work. Any help will be appreciated.
Avoid using loops in R, because in makes calculation hundreds times slowlier. Only with iterations<100 it is ok to use for/while/etc
Use lapply to operate on whatever objects the same way, then do.call to aggregate them from list. Use lists instead of assigning. lapply and list are close friends
Here is an example for matrices with sizes 15x15:
mtxs = list() #create empty list which will get filled
for(k in 1:10){ # loop over 10 matrixes
mtx = do.call(c,lapply(1:15,function(x){ # gathering second vectorized calculation
do.call(c,lapply(1:15, # gathering first vectorized calculation
function(y){functionNotVectorized(y, x, k) } ))})) # lapply over rows ans cols
mtxs[[k]] = matrix(mtx, 15, 15) # assigning matrices
}
Simply use a named list without need to use assign to add objects to global environment:
# BUILD LIST OF MATRICES
my_matrix_list <- setNames(replicate(10, matrix(0,1e5,1e5), simplify = FALSE),
paste0("res", 1:10, "d"))
# DYNAMICALLY ASSIGN VALUE BY OBJECT NAME
for (i in 1:1e5){
for (j in 1:1e5){
for (k in 1:10){
my_matrix_list[paste0("res", k, "d")][i,j] <-
FunctionNotVectorized(i,j,some_options=k)
}
}
}
# REFERENCE ITEMS IN LIST
my_matrix_list$res1d
my_matrix_list$res2d
my_matrix_list$res3d
...
i am working with consumer price index CPI and in order to calculate it i have to multiply the index matrix with the corresponding weights:
grossCPI77_10 <- grossIND1977 %*% weights1910/100
grossCPI82_10 <- grossIND1982 %*% weights1910/100
of course i would rather like to have a code like the one beyond:
grossIND1982 <- replicate(20, cbind(1:61))
grossIND1993 <- replicate(20, cbind(1:61))
weights1910_sc <- c(1:20)
grossIND_list <- mget(ls(pattern = "grossIND...."))
totalCPI <- mapply("*", grossIND_list, weights1910_sc)
the problem is that it gives me a 1200x20 matrix. i expected a normal matrix (61x20) vector (20x1) multiplication which should result in a 20x1 vector? could you explain me what i am doing wrong? thanks
part of your problem is that you don't have matrices but 3D arrays, with one singleton dimension. The other issue is that mapply likes to try and combine the results into a matrix, and also that constant arguments should be passed via MoreArgs. But actually, this is more a case for lapply.
grossIND1982 <- replicate(20, cbind(1:61))[,1,]
grossIND1993 <- replicate(20, cbind(1:61))[,1,]
weights1910_sc <- c(1:20)
grossIND_list <- mget(ls(pattern = "grossIND...."))
totalCPI <- mapply("*", grossIND_list, MoreArgs=list(e2 = weights1910_sc), SIMPLIFY = FALSE)
totalCPI <- lapply(grossIND_list, "*", e2 = weights1910_sc)
I am not sure if I understood all aspects of your problem (especially concerning what should be colums, what should be rows, and in which order the crossproduct shall be applied), but I will try at least to cover some aspects. See comments in below code for clarifications of what you did and what you might want. I hope it helps, let me know if this is what you need.
#instead of using mget, I recommend to use a list structure
#otherwise you might capture other variables with similar names
#that you do not want
INDlist <- sapply(c("1990", "1991"), function(x) {
#this is how to set up a matrix correctly, check `?matrix`
#I think your combination of cbind and rep did not give you what you wanted
matrix(rep(1:61, 20), nrow = 61)
}, USE.NAMES = TRUE, simplify = F)
weights <- list(c(1:20))
#the first argument of mapply needs to be a function, in this case of two variables
#the body of the function calculates the cross product
#you feed the arguments (both lists) in the following part of mapply
#I have repeated your weights, but you might assign different weights for each year
res <- mapply(function(x, y) {x %*% y}, INDlist, rep(weights, length(INDlist)))
dim(res)
#[1] 61 2
I have two vectors x and y with some values and I need to generate the matrix which elements would be returned by a function f(x,y) applied to those 2 vectors. That is the matrix M will have a typical element
M[i,j] <- f(x[i], y[j])
What is the most efficent way to do this if I want to avoid loops? I can generate matrix columns or rows by using sapply function, i.e.
M[i, ] <- sapply(y, f, x = x[i])
But I still need to apply loop in other dimension which is very slow, because the dimension of x is huge. Is it possible to use apply family of function and avoid loops completely?
That is exactly what the outer function does:
outer(x, y, f)
If f is not vectorized, you need:
outer(x, y, Vectorize(f))