Replacing for loop with apply fuctions - r

I want to replace nested for loops with appropriate apply function in R.
I declare a matrix with the following dimensions - ncol is 412 and nrow 2164
dist.name.enh <- matrix(NA, ncol = length(WW_name),nrow = length(Px_name))
The for loops for calculating the string distances are as below
for(i in 1:length(WW_name)) {
for(j in 1:length(Px_name)) {
dist.name.enh[j,i]<-stringdist(tolower(WW_name)[i],
tolower(Px_name)[j],
method = "lv")
}
}
How do I avoid the for loops as it is taking very long to return the matrix.
The code is looked up from R-bloggers website

You can use outer here which will apply the function to every combination of x and y.
outer(tolower(WW_name), tolower(Px_name), stringdist::stringdist, method = "lv")
For example,
x <- c("random", "strings", "to", "test")
y <- c("another", "string", "test")
outer(x, y, stringdist::stringdist, method = "lv")
# [,1] [,2] [,3]
#[1,] 6 6 6
#[2,] 7 1 6
#[3,] 6 5 3
#[4,] 6 5 0

Related

double sum calculation, what is the most efficient approach?

I need to calculate this
where x is a vector of length n and f is a function.
What is the most efficient calculation for this in R?
One method is a double for loop, but that is obviously slow.
One fast way to do is the following:
Assume we have this vector:
x = c(0,1,2)
i.e. n=3, and assume f is a multiplication function:
Now, we use expand.grid.unique custom function which produces unique combinations within vector; in other words, it is similar to expand.grid base function but with unique combinations:
expand.grid.unique <- function(x, y, include.equals=FALSE)
{
x <- unique(x)
y <- unique(y)
g <- function(i)
{
z <- setdiff(y, x[seq_len(i-include.equals)])
if(length(z)) cbind(x[i], z, deparse.level=0)
}
do.call(rbind, lapply(seq_along(x), g))
}
In our vector case, when we cal expand.grid.unique(x,x), it produces the following result:
> expand.grid.unique(x,x)
[,1] [,2]
[1,] 0 1
[2,] 0 2
[3,] 1 2
Let's assign two_by_two to it:
two_by_two <- expand.grid.unique(x,x)
Since our function is assumed to be multiplication, then we need to calculate sum-product, i.e. dot product of first and second columns of two_by_two. For this we need %*% operator:
output <- two_by_two[,1] %*% two_by_two[,2]
> output
[,1]
[1,] 2
See ?combn
x <- 0:2
combn(x, 2)
# unique combos
[,1] [,2] [,3]
#[1,] 0 0 1
#[2,] 1 2 2
sum(combn(x, 2))
#[1] 6
combn() creates all the unique combinations. If you have a function that you want to sum, you can add a FUN to the call:
random_f <- function(x){x[1] + 2 * x[2]}
combn(x, 2, FUN = random_f)
#[1] 2 4 5
sum(combn(x, 2, FUN = random_f))
#[1] 11

Finding cumulative sum and then average the values in R

I want to compute cumulative sum for the first (n-1) columns(if we have n columns matrix) and subsequently average the values. I created a sample matrix to do this task. I have the following matrix
ma = matrix(c(1:10), nrow = 2, ncol = 5)
ma
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
I wanted to find the following
ans = matrix(c(1,2,2,3,3,4,4,5), nrow = 2, ncol = 4)
ans
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 3 4 5
The following are my r function.
ColCumSumsAve <- function(y){
for(i in seq_len(dim(y)[2]-1)) {
y[,i] <- cumsum(y[,i])/i
}
}
ColCumSumsAve(ma)
However, when I run the above function its not producing any output. Are there any mistakes in the code?
Thanks.
There were several mistakes.
Solution
This is what I tested and what works:
colCumSumAve <- function(m) {
csum <- t(apply(X=m, MARGIN=1, FUN=cumsum))
res <- t(Reduce(`/`, list(t(csum), 1:ncol(m))))
res[, 1:(ncol(m)-1)]
}
Test it with:
> colCumSumAve(ma)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 3 4 5
which is correct.
Explanation:
colCumSumAve <- function(m) {
csum <- t(apply(X=m, MARGIN=1, FUN=cumsum)) # calculate row-wise colsum
res <- t(Reduce(`/`, list(t(csum), 1:ncol(m))))
# This is the trickiest part.
# Because `csum` is a matrix, the matrix will be treated like a vector
# when `Reduce`-ing using `/` with a vector `1:ncol(m)`.
# To get quasi-row-wise treatment, I change orientation
# of the matrix by `t()`.
# However, the output, the output will be in this transformed
# orientation as a consequence. So I re-transform by applying `t()`
# on the entire result at the end - to get again the original
# input matrix orientation.
# `Reduce` using `/` here by sequencial list of the `t(csum)` and
# `1:ncol(m)` finally, has as effect `/`-ing `csum` values by their
# corresponding column position.
res[, 1:(ncol(m)-1)] # removes last column for the answer.
# this, of course could be done right at the beginning,
# saving calculation of values in the last column,
# but this calculation actually is not the speed-limiting or speed-down-slowing step
# of these calculations (since this is sth vectorized)
# rather the `apply` and `Reduce` will be rather speed-limiting.
}
Well, okay, I could do then:
colCumSumAve <- function(m) {
csum <- t(apply(X=m[, 1:(ncol(m)-1)], MARGIN=1, FUN=cumsum))
t(Reduce(`/`, list(t(csum), 1:ncol(m))))
}
or:
colCumSumAve <- function(m) {
m <- m[, 1:(ncol(m)-1)] # remove last column
csum <- t(apply(X=m, MARGIN=1, FUN=cumsum))
t(Reduce(`/`, list(t(csum), 1:ncol(m))))
}
This is actually the more optimized solution, then.
Original Function
Your original function makes only assignments in the for-loop and doesn't return anything.
So I copied first your input into a res, processed it with your for-loop and then returned res.
ColCumSumsAve <- function(y){
res <- y
for(i in seq_len(dim(y)[2]-1)) {
res[,i] <- cumsum(y[,i])/i
}
res
}
However, this gives:
> ColCumSumsAve(ma)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1.5 1.666667 1.75 9
[2,] 3 3.5 3.666667 3.75 10
The problem is that the cumsum in matrices is calculated in column-direction instead row-wise, since it treats the matrix like a vector (which goes columnwise through the matrix).
Corrected Original Function
After some frickeling, I realized, the correct solution is:
ColCumSumsAve <- function(y){
res <- matrix(NA, nrow(y), ncol(y)-1)
# create empty matrix with the dimensions of y minus last column
for (i in 1:(nrow(y))) { # go through rows
for (j in 1:(ncol(y)-1)) { # go through columns
res[i, j] <- sum(y[i, 1:j])/j # for each position do this
}
}
res # return `res`ult by calling it at the end!
}
with the testing:
> ColCumSumsAve(ma)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 3 4 5
Note: dim(y)[2] is ncol(y) - and dim(y)[1] is nrow(y) -
and instead seq_len(), 1: is shorter and I guess even slightly faster.
Note: My solution given first will be faster, since it uses apply, vectorized cumsum and Reduce. - for-loops in R are slower.
Late Note: Not so sure that the first solution is faster. Since R-3.x it seems that for loops are faster. Reduce will be the speed limiting funtion and can be sometimes incredibly slow.
k <- t(apply(ma,1,cumsum))[,-ncol(k)]
for (i in 1:ncol(k)){
k[,i] <- k[,i]/i
}
k
This should work.
All you need is rowMeans:
nc <- 4
cbind(ma[,1],sapply(2:nc,function(x) rowMeans(ma[,1:x])))
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 3 4 5
Here's how I did it
> t(apply(ma, 1, function(x) cumsum(x) / 1:length(x)))[,-NCOL(ma)]
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 2 3 4 5
This applies the cumsum function row-wise to the matrix ma and then divides by the correct length to get the average (cumsum(x) and 1:length(x) will have the same length). Then simply transpose with t and remove the last column with [,-NCOL(ma)].
The reason why there is no output from your function is because you aren't returning anything. You should end the function with return(y) or simply y as Marius suggested. Regardless, your function doesn't seem to give you the correct response anyway.

Use of function over all row-pairs of two matrices

If I want to calculate the n-dimensional distance of two vectors, I can use a function such as:
a = c(1:10)
b = seq(20, 23, length.out = length(a))
test_fun =
function(x,y) {
return(
sqrt(
sum(
(x - y) ^ 2
)
)
)
}
n_distance = test_fun(a,b)
Now, I want to expand this to a matrix setting: I want to calculate the n-dimensional distance for each pair of rows of two matrices.
set.seed(123)
a_mtx = matrix(1:30, ncol = 5)
b_mtx = matrix(sample(1:15,15), ncol = 5)
n_distance_mtx =
matrix(
NA,
nrow = nrow(b_mtx),
ncol = nrow(a_mtx)
)
for(i in 1:nrow(b_mtx)) {
for(j in 1:nrow(a_mtx)) {
n_distance_mtx[i,j] =
test_fun(a_mtx[j,], b_mtx[i,])
}
}
Where each column of n_distance_mtx contains the distance metrics between each row of a_mtx and b_mtx (so n_distance_mtx[,1] is the distance between a_mtx[1,] and b_mtx[1:3,].
If I calculate column means on n_distance_mtx I can obtain the mean distance between each row in a_mtx and all rows of b_mtx.
colMeans(n_distance_mtx)
#[1] 23.79094 24.90281 26.15618 27.53303 29.01668 30.59220
So 23.79094 is the mean distance between a_mtx[1,] and b_mtx[1:3,], and 24.90281 is the mean distance between a_mtx[2,] and b_mtx[1:3,], and so on.
Question: How can I arrive at the same solution without using for-loops?
I want to apply this method to matrices with much larger dimension (on the order of hundreds of thousands of rows). Looking at this and this, it seems there must be a way to accomplish this with a Vectorized outer function, but I have been unable to generate such a function.
test_fun_vec =
Vectorize(
function(x,y) {
outer(
x,
y,
test_fun
)
}
)
test_fun_vec(a_mtx,b_mtx)
#[1] 4 0 2 7 4 6 3 5 1 5 7 5 10 0 9 11 15 17 8 11 9 12 10 16
#[25] 10 22 20 25 15 24
We can use Vectorize with outer
f1 <- Vectorize(function(i, j) test_fun(a_mtx[j, ], b_mtx[i, ]))
out <- outer(seq_len(nrow(b_mtx)), seq_len(nrow(a_mtx)), FUN = f1)
out
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 20.88061 21.84033 22.97825 24.26932 25.69047 27.22132
#[2,] 24.87971 25.57342 26.43861 27.45906 28.61818 29.89983
#[3,] 25.61250 27.29469 29.05168 30.87070 32.74141 34.65545
colMeans(out)
#[1] 23.79094 24.90281 26.15618 27.53303 29.01668 30.59220
identical(n_distance_mtx, out)
#[1] TRUE
If I unsderstood your question right, you want the Euclidean distance between each vector (row) in a_mtx to the other vectors in b_mtx.
If so, you could use apply twice like this:
result = apply(a_mtx, 1, function(x){ apply(b_mtx, 1, function(y){ test_fun(x,y) })})
This gives a distance matrix:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 20.88061 21.84033 22.97825 24.26932 25.69047 27.22132
[2,] 24.87971 25.57342 26.43861 27.45906 28.61818 29.89983
[3,] 25.61250 27.29469 29.05168 30.87070 32.74141 34.65545
where the row index is the corresponding vector (row) from b_mtx and the column index is the corresponding vector from a_mtx
Finally, obtain the mean distance using:
colMeans(result)
[1] 23.79094 24.90281 26.15618 27.53303 29.01668 30.59220

Outer function R - maintain coordinate subtraction

I have two matrices, call them A (n x 2) and B (q x 2). I'd like to get an n x q x 2 array C, such that C[1,5,] represents the difference between the first row of A and the fifth row of B, taking the subtraction of the first element in the first row of A with the first element in the fifth row of B and the second element similarly subtracted.
I'm trying to perform this function via the outer function, but it also gives me the "non-diagonal" subtractions; i.e. it will also subtract A[1,1] - B[5,2] and A[1,2] - B[5,1] which I am not interested in. Does anyone have a fast, easy way to do this?
Current code
>diffs <- outer(A,B,FUN ='-')
>diffs[1,,5,]
[,1] [,2]
[1,] **-0.3808701** 0.7591052
[2,] 0.2629293 **1.4029046**
I've added the stars to indicate what I actually want.
Thanks for any help in advance
(EDIT)
Here's a simpler case for illustrative purposes
> A <- matrix(1:10, nrow = 5, ncol = 2)
> B <- matrix(4:9, nrow = 3, ncol = 2)
> A
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> B
[,1] [,2]
[1,] 4 7
[2,] 5 8
[3,] 6 9
>diffs <- outer(A,B,FUN ='-')
>diffs[1,,3,] == (A[1,] - B[3,])
[,1] [,2]
[1,] TRUE FALSE
[2,] FALSE TRUE
>diffs[1,,3,]
[,1] [,2]
[1,] -5 -8
[2,] 0 -3
Before worrying about the shape of the output I think we should make sure we're getting the correct values.
A <- matrix(1:10, nrow=5, ncol=2)
B <- matrix(4:9, nrow=3, ncol=2)
# long-winded method
dia_long <- c(
c(A[1,] - B[1,]),
c(A[1,] - B[2,]),
c(A[1,] - B[3,]),
c(A[2,] - B[1,]),
c(A[2,] - B[2,]),
c(A[2,] - B[3,]),
c(A[3,] - B[1,]),
c(A[3,] - B[2,]),
c(A[3,] - B[3,]),
c(A[4,] - B[1,]),
c(A[4,] - B[2,]),
c(A[4,] - B[3,]),
c(A[5,] - B[1,]),
c(A[5,] - B[2,]),
c(A[5,] - B[3,]))
# loop method
comb <- expand.grid(1:nrow(A), 1:nrow(B))
dia_loop <- list()
for (i in 1:nrow(comb)) {
dia_loop[[i]] <- A[comb[i, 1], ] - B[comb[i, 2], ]
}
dia_loop <- unlist(dia_loop)
# outer/apply method
dia_outer <- apply(outer(A, B, FUN='-'), c(3, 1), diag)
# they all return the same values
all.identical <- function(l) {
all(sapply(2:length(l), FUN=function(x) identical(l[1], l[x])))
}
all.identical(lapply(list(dia_long, dia_loop, dia_outer), sort))
# TRUE
table(dia_long)
# dia_long
# -5 -4 -3 -2 -1 0 1 2 3
# 1 2 4 5 6 5 4 2 1
Are these the values you are looking for?
My solution: use nested lapply and sapply functions to extract the diagonals. I then needed to do some post-processing (not related to this specific problem), before I then turned it into an array. Should be noted that this is a q x 2 x n array, which turned out to be better for my purposes - this could be permuted with aperm from here though to solve the original question.
A <- matrix(1:10, nrow = 5, ncol = 2)
B <- matrix(4:9, nrow = 3, ncol = 2)
diffs <- outer(A,B, FUN = '-')
diffs <- lapply(X = 1:nrow(A),FUN = function(y){
t(sapply(1:ncol(B), FUN = function(x) diag(diffs[y,,x,])))})
diffs <- array(unlist(lapply(diffs, FUN = t)), dim = c(nrow(B),2,nrow(A)))

R - Add combinations of elements within a vector

Does anyone know of a way to add up combinations of numbers within a vector?
Suppose I am going through a for loop and each time I end up with a vector of different lengths, how could I combine each element of this vector such that I have the sum of 2, 3, etc elements?
For example if I have:
vector <- c(1:5)
And want to go through it as in:
element 1 + element 2; element 2 + element 3, etc
But also:
element 1 + element 2 + element 3
How would I do this? It's important to note that in many of the vectors the lengths will be different. So whilst one vector might contain 3 elements another might contain 12.
I know you can do vector[1]+vector[2], but I need some way to iterate throughout the vector wherein it takes into account the above note.
Use you can use combn:
> combn(vector, 3, FUN = NULL, simplify = TRUE)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 1 1 1 1 1 2 2 2 3
[2,] 2 2 2 3 3 4 3 3 4 4
[3,] 3 4 5 4 5 5 4 5 5 5
The trick here is that each call will return a matrix of results, and you will have to decide how you want to aggregate and store all the various combinations.
If you don't mind having a list, then the following should do the trick:
> sapply(c(1:length(vector)),
function(x) {
combn(vector, x, FUN = NULL, simplify = TRUE)
})
Generate pair IDs
In this case, we need to get the pairs:
combn(3, 2)
Output:
[,1] [,2] [,3]
[1,] 1 1 2
[2,] 2 3 3
Pairs are generated by column.
Sum Over Vector Elements (Using a Subset)
To access each element and perform a summation, we opt to define a helper function that takes the combination and the vector.
# Write a helper function
# sums of the index of the vector
comb_subset_sum = function(x, vec){
return(sum(vec[x]))
}
From this, we can use combn directly or use sapply.
Summing for 1 k:
combn directly:
# Input Vector
vec = 1:5
# Length of vector
n = length(vec)
# Generate pairwise combinations and obtain pair_sum
# Specify the k (m in R)
m = combn(n, m = 2, FUN = comb_subset_sum, vec = vec)
sapply usage:
# Input Vector
vec = 1:5
# Number of Observations
n = length(vec)
# Combinations
# Specify the k (m in R)
combinations = combn(n, m = 2)
# Obtain vectorized sum over subset
subset_summed = apply(combinations, 2, comb_subset_sum, vec = vec)
Example Output:
combinations:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 1 1 1 2 2 2 3 3 4
[2,] 2 3 4 5 3 4 5 4 5 5
subset_summed:
[1] 3 4 5 6 5 6 7 7 8 9
Trace:
vec[1]+vec[2]=3
vec[1]+vec[3]=4
vec[1]+vec[4]=5
vec[1]+vec[5]=6
vec[2]+vec[3]=5
vec[2]+vec[4]=6
vec[2]+vec[5]=7
vec[3]+vec[4]=7
vec[3]+vec[5]=8
vec[4]+vec[5]=9
To obtain the trace output, add the following before return() in comb_subset_sum():
cat(paste0("vec[",x,"]", collapse = "+"), "=", sum(vec[x]), "\n")
Summing for multiple k:
Here, we apply the same logic, just in a way that enables the k value of the combination to take multiple values.
# Input Vector
vec = 1:5
# Length of Vec
n = length(vec)
# Store output
o = vector('list',n)
for(i in seq_along(vec)){
o[[i]] = combn(n, i, FUN = comb_subset_sum, vec = vec)
}
Note: The size of each element of o will vary as the number of combinations will increase and then decrease.
Summing over combinations
If we do not care about vector element values, we can then just sum over the actual combinations in a similar way to how we obtained the vector elements.
To generate pairs and then sum, use:
# Input Vector
vec = 1:5
# Length of Vec
n = length(vec)
# Generate all combinations (by column)
# Specify the k (m in R)
m = combn(n, m = 2)
# Obtain sum by going over columns
sum_m = apply(m, 2, sum)
Or do it in one go:
# Specify the k (m in R)
sum_inplace = combn(n, m = 2, FUN = sum)
Equality:
all.equal(sum_m,sum_inplace)
Sum over k uses
And, as before, we can set it up to get all sums under different k by using:
# Input Vector
vec = 1:5
# Length of Vec
n = length(vec)
# Store output (varying lengths)
o = vector('list',n)
for(i in seq_along(vec)){
o[[i]] = combn(n, i, FUN = sum)
}
The following relies on the binary representation of number. Basically, you have 2^n combinations to check. By writing any number between 1 and 2^n in binary with 'n' bits, you have all the permutations of elements you might want.
The number2binary function comes from Paul Hiestra's answer in this tread: How to convert integer number into binary vector?
number2binary = function(number, noBits) {
binary_vector = rev(as.numeric(intToBits(number)))
if(missing(noBits)) {
return(binary_vector)
} else {
binary_vector[-(1:(length(binary_vector) - noBits))]
}
}
vector <- 1:5
n <- length(vector)
comp_sum <- function(x) {
binary <- number2binary(x, noBits = n)
result <- sum(vector[which(binary==1)])
names(result) <- paste(which(binary == 1), collapse = "+")
return(result)
}
binaries <- sapply(1:2^n-1, comp_sum)
Note: I only go up to 2^n - 1 as you do not need the "zero". By adding some conditions in your comp_sum function, you can pick only sums of two elements or of three elements...
You might be looking for rollsum from zoo package, where you can specify the number of elements you want to add up:
lapply(2:5, function(i) zoo::rollsum(1:5, i))
[[1]]
[1] 3 5 7 9 # two elements roll sum
[[2]]
[1] 6 9 12 # three elements roll sum
[[3]]
[1] 10 14 # four elements roll sum
[[4]]
[1] 15 # five elements roll sum

Resources