Related
I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.
I'm working with the popbio package on a population model. It looks something like this:
library(popbio)
babies <- 0.3
kids <- 0.5
teens <- 0.75
adults <- 0.98
A <- c(0,0,0,0,teens*0.5,adults*0.8,
babies,0,0,0,0,0,
0,kids,0,0,0,0,
0,0,kids,0,0,0,
0,0,0,teens,0,0,
0,0,0,0,teens,adults
)
A <- matrix ((A), ncol=6, byrow = TRUE)
N<-c(10,10,10,10,10,10)
N<-matrix (N, ncol=1)
model <- pop.projection(A,N,iterations=10)
model
I'd like to know how I can randomise the input so that at each iteration, which represents years this case, I'd get a different input for the matrix elements. So, for instance, my model runs for 10 years, and I'd like to have the baby survival rate change for each year. babies <- rnorm(1,0.3,0.1)doesn't do it because that still leaves me with a single value, just randomly selected.
Update: This is distinct from running 10 separate models with different initial, random values. I'd like the update to occur within a single model run, which itself has 10 iteration in the pop.projection function.
Hope you can help.
I know this answer is very late, but here's one approach using expressions. First, use an expression to create the matrix.
vr <- list( babies=0.3, kids=0.5, teens=0.75, adults=0.98 )
Ax <- expression( matrix(c(
0,0,0,0,teens*0.5,adults*0.8,
babies,0,0,0,0,0,
0,kids,0,0,0,0,
0,0,kids,0,0,0,
0,0,0,teens,0,0,
0,0,0,0,teens,adults), ncol=6, byrow = TRUE ))
A1 <- eval(Ax, vr)
lambda(A1)
[1] 1.011821
Next, use an expression to create vital rates with nrorm or other functions.
vr2 <- expression( list( babies=rnorm(1,0.3,0.1), kids=0.5, teens=0.75, adults=0.98 ))
A2 <- eval(Ax, eval( vr2))
lambda(A2)
[1] 1.014586
Apply the expression to 100 matrices.
x <- sapply(1:100, function(x) lambda(eval(Ax, eval(vr2))))
quantile(x, c(.05,.95))
5% 95%
0.996523 1.025900
Finally, make two small changes to pop.projection by adding the vr option and a line to evaluate A at each time step.
pop.projection2 <- function (Ax, vr, n, iterations = 20)
{
x <- length(n)
t <- iterations
stage <- matrix(numeric(x * t), nrow = x)
pop <- numeric(t)
change <- numeric(t - 1)
for (i in 1:t) {
stage[, i] <- n
pop[i] <- sum(n)
if (i > 1) {
change[i - 1] <- pop[i]/pop[i - 1]
}
## evaluate Ax
A <- eval(Ax, eval(vr))
n <- A %*% n
}
colnames(stage) <- 0:(t - 1)
w <- stage[, t]
pop.proj <- list(lambda = pop[t]/pop[t - 1], stable.stage = w/sum(w),
stage.vectors = stage, pop.sizes = pop, pop.changes = change)
pop.proj
}
n <-c(10,10,10,10,10,10)
pop.projection2(Ax, vr2, n, 10)
$lambda
[1] 0.9874586
$stable.stage
[1] 0.33673579 0.11242588 0.08552367 0.02189786 0.02086656 0.42255023
$stage.vectors
0 1 2 3 4 5 6 7 8 9
[1,] 10 11.590000 16.375700 19.108186 20.2560223 20.5559445 20.5506251 20.5898222 20.7603581 20.713271
[2,] 10 4.147274 3.332772 4.443311 5.6693931 1.9018887 6.8455597 5.3879202 10.5214540 6.915534
[3,] 10 5.000000 2.073637 1.666386 2.2216556 2.8346965 0.9509443 3.4227799 2.6939601 5.260727
[4,] 10 5.000000 2.500000 1.036819 0.8331931 1.1108278 1.4173483 0.4754722 1.7113899 1.346980
[5,] 10 7.500000 3.750000 1.875000 0.7776139 0.6248948 0.8331209 1.0630112 0.3566041 1.283542
[6,] 10 17.300000 22.579000 24.939920 25.8473716 25.9136346 25.8640330 25.9715930 26.2494195 25.991884
$pop.sizes
[1] 60.00000 50.53727 50.61111 53.06962 55.60525 52.94189 56.46163 56.91060 62.29319 61.51194
$pop.changes
[1] 0.8422879 1.0014610 1.0485765 1.0477793 0.9521023 1.0664832 1.0079517 1.0945797 0.9874586
I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second best) matching row to df1 in terms of hamming distance.
I used e1071 package to compute hamming distance. Hamming distance between two vectors x and y can be computed as for example:
x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
24197, 610187, 402471, 157122, 866381, 582868, 878)
y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)
xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)
distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))
and the resulting distance is 25. Yet I need to do this for all rows of df1 and df2. A trivial method takes a double loop nest and looks terribly slow.
Any ideas how to do this more efficiently? In the end I need to append to df2:
a column with the row id from df1 that gives the lowest distance;
a column with the lowest distance;
a column with the row id from df1 that gives the 2nd lowest distance;
a column with the second lowest distance.
Thanks.
Fast computation of hamming distance between two integers vectors of equal length
As I said in my comment, we can do:
hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))
to compute hamming distance between two integers vectors of equal length x and y. This only uses R base, yet is more efficient than e1071::hamming.distance, because it is vectorized!
For the example x and y in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)
Fast hamming distance between a matrix and a vector
If we want to compute the hamming distance between a single y and multiple xs, i.e., the hamming distance between a vector and a matrix, we can use the following function.
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
Note that:
hmd performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;
there is no obvious loop here; instead, we use tapply().
Fast hamming distance computation between two matrices/data frames
This is what you want. The following function foo takes two data frames or matrices df1 and df2, computing the distance between df1 and each row of df2. argument p is an integer, showing how many results you want to retain. p = 3 will keep the smallest 3 distances with their row ids in df1.
foo <- function(df1, df2, p) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
k <- 1:p
for (i in 1:n) {
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
k <- k + p
}
## recode "id" and "d" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
list(id = id, d = d)
}
Note that:
transposition is done at the beginning, according to reasons before;
a for loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using *apply family, since we ask for multiple output (row id id and distance d).
Experiment
This part uses small dataset to test/demonstrate our functions.
Some toy data:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
Test hmd first (needs transposition):
hmd(t(as.matrix(df1)), df2[1, ]) ## df1 & first row of df2
# [1] 2 4 6 2 4
Test foo:
foo(df1, df2, p = 2)
# $id
# min1 min2
# 1 1 4
# 2 2 3
# 3 5 2
# $d
# mindist.1 mindist.2
# 1 2 2
# 2 1 3
# 3 1 3
If you want to append some columns to df2, you know what to do, right?
Please don't be surprised why I take another section. This part gives something relevant. It is not what OP asks for, but may help any readers.
General hamming distance computation
In the previous answer, I start from a function hmd0 that computes hamming distance between two integer vectors of the same length. This means if we have 2 integer vectors:
set.seed(0)
x <- sample(1:100, 6)
y <- sample(1:100, 6)
we will end up with a scalar:
hmd0(x,y)
# 13
What if we want to compute pairwise hamming distance of two vectors?
In fact, a simple modification to our function hmd will do:
hamming.distance <- function(x, y, pairwise = TRUE) {
nx <- length(x)
ny <- length(y)
rawx <- intToBits(x)
rawy <- intToBits(y)
if (nx == 1 && ny == 1) return(sum(as.logical(xor(intToBits(x),intToBits(y)))))
if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
bits <- length(intToBits(0)) ## 32-bit or 64 bit?
result <- unname(tapply(as.logical(xor(rawx,rawy)), rep(1:ny, each = bits), sum))
}
if (pairwise) result else sum(result)
}
Now
hamming.distance(x, y, pairwise = TRUE)
# [1] 0 3 3 2 5 0
hamming.distance(x, y, pairwise = FALSE)
# [1] 13
Hamming distance matrix
If we want to compute the hamming distance matrix, for example,
set.seed(1)
x <- sample(1:100, 5)
y <- sample(1:100, 7)
The distance matrix between x and y is:
outer(x, y, hamming.distance) ## pairwise argument has no effect here
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 2 3 4 3 4 4 2
# [2,] 7 6 3 4 3 3 3
# [3,] 4 5 4 3 6 4 2
# [4,] 2 3 2 5 6 4 2
# [5,] 4 3 4 3 2 0 2
We can also do:
outer(x, x, hamming.distance)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 5 2 2 4
# [2,] 5 0 3 5 3
# [3,] 2 3 0 2 4
# [4,] 2 5 2 0 4
# [5,] 4 3 4 4 0
In the latter situation, we end up with a symmetric matrix with 0 on the diagonal. Using outer is inefficient here, but it is still more efficient than writing R loops. Since our hamming.distance is written in R code, I would stay with using outer. In my answer to this question, I demonstrate the idea of using compiled code. This of course requires writing a C version of hamming.distance, but I will not show it here.
Here's an alternative solution that uses only base R, and should be very fast, especially when your df1 and df2 have many rows. The main reason for this is that it does not use any R-level looping for calculating the Hamming distances, such as for-loops, while-loops, or *apply functions. Instead, it uses matrix multiplication for computing the Hamming distance. In R, this is much faster than any approach using R-level looping. Also note that using an *apply function will not necessarily make your code any faster than using a for loop. Two other efficiency-related features of this approach are: (1) It uses partial sorting for finding the best two matches for each row in df2, and (2) It stores the entire bitwise representation of df1 in one matrix (same for df2), and does so in one single step, without using any R-level loops.
The function that does all the work:
# INPUT:
# X corresponds to your entire df1, but is a matrix
# Y corresponds to your entire df2, but is a matrix
# OUTPUT:
# Matrix with four columns corresponding to the values
# that you specified in your question
fun <- function(X, Y) {
# Convert integers to bits
X <- intToBits(t(X))
# Reshape into matrix
dim(X) <- c(ncols * 32, nrows)
# Convert integers to bits
Y <- intToBits(t(Y))
# Reshape into matrix
dim(Y) <- c(ncols * 32, nrows)
# Calculate pairwise hamming distances using matrix
# multiplication.
# Columns of H index into Y; rows index into X.
# The code for the hamming() function was retrieved
# from this page:
# https://johanndejong.wordpress.com/2015/10/02/faster-hamming-distance-in-r-2/
H <- hamming(X, Y)
# Now, for each row in Y, find the two best matches
# in X. In other words: for each column in H, find
# the two smallest values and their row indices.
t(apply(H, 2, function(h) {
mindists <- sort(h, partial = 1:2)
c(
ind1 = which(h == mindists[1])[1],
val1 = mindists[1],
hmd2 = which(h == mindists[2])[1],
val2 = mindists[2]
)
}))
}
To call the function on some random data:
# Generate some random test data with no. of columns
# corresponding to your data
nrows <- 1000
ncols <- 26
# X corresponds to your df1
X <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
# Y corresponds to your df2
Y <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
res <- fun(X, Y)
The above example with 1000 rows in both X (df1) and Y (df2) took about 1.1 - 1.2 seconds to run on my laptop.
I have the following code which returns the model with the lowest AIC, but I want all the models with their AIC in ascending or descending order without using the built-in sort function in R
sp <- rnorm(100) ## just some toy data to make code work!
spfinal.aic <- Inf
spfinal.order <- c(0,0,0)
for (p in 1:4) for (d in 0:1) for (q in 1:4) {
spcurrent.aic <- AIC(arima(sp, order=c(p, d, q)))
if (spcurrent.aic < spfinal.aic) {
spfinal.aic <- spcurrent.aic
spfinal.order <- c(p, d, q)
spfinal.arima <- arima(sp, order=spfinal.order)
}
}
I want spfinal.order<-c(p,d,p) to be a list of all models in ascending or descending order of AIC. How can I do that?
The code below does what you want. As you want a record of all models having been tried, no comparison is done inside the loop. A vector aic.vec will hold AIC values of all models, while a matrix order.matrix will hold column-by-column the ARIMA specification. In the end, we sort both by ascending AIC values, so that you know the best model is the first one.
sp <- rnorm(100) ## just some toy data to make code work!
order.matrix <- matrix(0, nrow = 3, ncol = 4 * 2 * 4)
aic.vec <- numeric(4 * 2 * 4)
k <- 1
for (p in 1:4) for (d in 0:1) for (q in 1:4) {
order.matrix[, k] <- c(p,d,q)
aic.vec[k] <- AIC(arima(sp, order=c(p, d, q)))
k <- k+1
}
ind <- order(aic.vec, decreasing = FALSE)
aic.vec <- aic.vec[ind]
order.matrix <- order.matrix[, ind]
I did not use a list to store the ARIMA specification, because I think a matrix is better. At the moment the matrix is in wide format, i.e., with 3 rows while many columns. You can transpose it for better printing:
order.matrix <- t(order.matrix)
Maybe you also want to bind order.matrix and aic.vec together for better presentation? Do this:
result <- cbind(order.matrix, aic.vec)
colnames(result) <- c("p", "d", "q", "AIC")
I think this makes it easier for you to inspect. Example output (first 5 rows):
> result
p d q AIC
[1,] 2 0 2 305.5698
[2,] 3 0 3 305.8882
[3,] 1 0 3 307.8365
[4,] 2 1 3 307.9137
[5,] 1 1 2 307.9952
I have a matrix in which I would like to find those columns that are very similar (I am not looking to find identical columns)
# to generate a matrix
Mat<- matrix(rexp(200, rate=.1), ncol=1000, nrow=400)
I personally thought of "cor" or "all.equal" and I did as follows, but did not work.
indexmax <- apply(Mat, MARGIN = 2, function(x) which(cor(x) >= 0.5, arr.ind = TRUE))
what I need as output is show which columns are highly similar and the degrees of their similarity (it can be correlation coefficient)
similar means their values are similar within some threshold (for example over 75% of the values residuals (e.g. column1-column2) are less than abs(0.5)
I would also love to see how then this is different from correlated. do they result in identical results ?
Using correlation you could try (with a simpler matrix for demonstration)
set.seed(123)
Mat <- matrix(rnorm(300), ncol = 10)
library(matrixcalc)
corr <- cor(Mat)
res <-which(lower.triangle(corr)>.3, arr.ind = TRUE)
data.frame(res[res[,1] != res[,2],], correlation = corr[res[res[,1] != res[,2],]])
row col correlation
1 8 1 0.3387738
2 6 2 0.3350891
Both row and col actually refer to the columns in your original matrix. So, for example, the correlation between column 8 and column 1 is 0.3387738
I'd take linear regression approach:
Mat<- matrix(rexp(200, rate=.1), ncol=100, nrow=400)
combinations <- combn(1:ncol(Mat), m = 2)
sigma <- NULL
for(i in 1:ncol(combinations)){
sigma <- c(sigma, summary(lm(Mat[,combinations[1,1]] ~ Mat[,combinations[2,1]]))$sigma)
}
sigma <- data.frame(sigma = sigma, comb_nr = 1:ncol(combinations))
And residual standard error as an optional criteria.
You can further order data frame by sigma and get best/worst combinations.
If you want a (not so elegant) straightforward approach that's likely to be very slow for matrices of your size, you can do this:
set.seed(1)
Mat <- matrix(runif(40000), ncol=100, nrow=400)
col.combs <- t(combn(1:ncol(Mat), 2))
similar <- data.frame(Col1=NULL, Col2=NULL, Corr=NULL, Pct.Diff=NULL)
# Compare each pair of columns
for (k in 1:nrow(col.combs)) {
i <- col.combs[k, 1]
j <- col.combs[k, 2]
# Difference within threshold?
diff.thresh <- (abs(Mat[, i] - Mat[, j]) < 0.5)
pair.corr <- cor(Mat[, 1], Mat[, 2])
if (mean(diff.thresh) > 0.75)
similar <- rbind(similar, c(i, j, pair.corr, 100*mean(diff.thresh)))
}
In this example there are 2590 distinct pairs of columns with more than 75% of their values within 0.5 of each other (elementwise). You can check the actual difference and correlation coefficient by looking at the resulting data frame.
> head(similar)
Col1 Col2 Corr Pct.Diff
1 1 2 -0.003187894 76.75
2 1 3 0.074061019 76.75
3 1 4 0.082668387 78.00
4 1 5 0.001713751 75.50
5 1 8 0.052228907 75.75
6 1 12 -0.017921978 78.00
Perhaps it's not the best solution, but gets the job done.
Also, if you're unsure why I used mean(diff.thresh), it's because the sum of a logical vector is the number of TRUE elements. The mean is the sum divided by the length, which means that in this case it's the fraction of values within the threshold.