how to find the most similar columns in a matrix? - r

I have a matrix in which I would like to find those columns that are very similar (I am not looking to find identical columns)
# to generate a matrix
Mat<- matrix(rexp(200, rate=.1), ncol=1000, nrow=400)
I personally thought of "cor" or "all.equal" and I did as follows, but did not work.
indexmax <- apply(Mat, MARGIN = 2, function(x) which(cor(x) >= 0.5, arr.ind = TRUE))
what I need as output is show which columns are highly similar and the degrees of their similarity (it can be correlation coefficient)
similar means their values are similar within some threshold (for example over 75% of the values residuals (e.g. column1-column2) are less than abs(0.5)
I would also love to see how then this is different from correlated. do they result in identical results ?

Using correlation you could try (with a simpler matrix for demonstration)
set.seed(123)
Mat <- matrix(rnorm(300), ncol = 10)
library(matrixcalc)
corr <- cor(Mat)
res <-which(lower.triangle(corr)>.3, arr.ind = TRUE)
data.frame(res[res[,1] != res[,2],], correlation = corr[res[res[,1] != res[,2],]])
row col correlation
1 8 1 0.3387738
2 6 2 0.3350891
Both row and col actually refer to the columns in your original matrix. So, for example, the correlation between column 8 and column 1 is 0.3387738

I'd take linear regression approach:
Mat<- matrix(rexp(200, rate=.1), ncol=100, nrow=400)
combinations <- combn(1:ncol(Mat), m = 2)
sigma <- NULL
for(i in 1:ncol(combinations)){
sigma <- c(sigma, summary(lm(Mat[,combinations[1,1]] ~ Mat[,combinations[2,1]]))$sigma)
}
sigma <- data.frame(sigma = sigma, comb_nr = 1:ncol(combinations))
And residual standard error as an optional criteria.
You can further order data frame by sigma and get best/worst combinations.

If you want a (not so elegant) straightforward approach that's likely to be very slow for matrices of your size, you can do this:
set.seed(1)
Mat <- matrix(runif(40000), ncol=100, nrow=400)
col.combs <- t(combn(1:ncol(Mat), 2))
similar <- data.frame(Col1=NULL, Col2=NULL, Corr=NULL, Pct.Diff=NULL)
# Compare each pair of columns
for (k in 1:nrow(col.combs)) {
i <- col.combs[k, 1]
j <- col.combs[k, 2]
# Difference within threshold?
diff.thresh <- (abs(Mat[, i] - Mat[, j]) < 0.5)
pair.corr <- cor(Mat[, 1], Mat[, 2])
if (mean(diff.thresh) > 0.75)
similar <- rbind(similar, c(i, j, pair.corr, 100*mean(diff.thresh)))
}
In this example there are 2590 distinct pairs of columns with more than 75% of their values within 0.5 of each other (elementwise). You can check the actual difference and correlation coefficient by looking at the resulting data frame.
> head(similar)
Col1 Col2 Corr Pct.Diff
1 1 2 -0.003187894 76.75
2 1 3 0.074061019 76.75
3 1 4 0.082668387 78.00
4 1 5 0.001713751 75.50
5 1 8 0.052228907 75.75
6 1 12 -0.017921978 78.00
Perhaps it's not the best solution, but gets the job done.
Also, if you're unsure why I used mean(diff.thresh), it's because the sum of a logical vector is the number of TRUE elements. The mean is the sum divided by the length, which means that in this case it's the fraction of values within the threshold.

Related

How to modify non-zero elements of a large sparse matrix based on a second sparse matrix in R

I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.

Looping 10 possible N samples and calculate sums of columns

I'm generating n samples, each of dimension m, and I populate a matrix mxn. Then I use the apply function to go for every column of the matrix (every sample generated) and return a list with the sum for the elements of each column. At the end I calculate the mean of all of those sums.
data = replicate(n, rnorm(m, mean = mu, sd = variance))
sum_of_column <- function(col) {
s <- sum(col)
}
sums <- apply(data, 2, sum_of_column)
me <- mean(sums)
sums is the list where each index is the sum of the respective column. me is the mean of that list.
But n is a single value and I want it to be a list of numbers (like 1:10), meaning I want to do this algorithm for every possible n = 1, n = 2, n = ... , n = 10 for which I need to store sums and calculate their mean. I may end up with a bidimensional array (as dataframe) where one column are the n's and the other column the correspondent mean of sums for that n.
In other words, I need to loop this algorithm I coded and store the value for each n-iteration. Like
n mean(sums)
1 123
2 13
...
10 94
I thought of doing this with a for loop, but would there be a smarter way to do this without explicitly looping? Maybe using apply for 3 dimensions?
You could put the logic into a function FUN. In its arguments, predefine m, mu, and sigma. n will be defined dynamically in the loop.
FUN <- \(n, m=1e5, mu=0, sigma=1) {
mxn <- replicate(n, rnorm(m, mean=mu, sd=sigma))
return(c(n=n, mean_of_sums=mean(colSums(mxn))))
}
FUN(1)
# n mean_of_sums
# 1 1 -226.6016
To loop over the n, you could use vapply, which is similar to sapply, but predefines FUN.VALUE in the third argument which saves work for R and, thus, is faster. To get the n into rows, you want to transpose the result.
n <- 1:100
set.seed(42)
r <- t(vapply(n, \(n) FUN(n), c(0, 0)))
r <- as.data.frame(r) ## if wanted
head(r)
# n mean_of_sums
# 1 1 -412.6182
# 2 2 -114.6650
# 3 3 304.1592
# 4 4 75.8026
# 5 5 -208.2705
# 6 6 126.6526
plot(r, type='l', col=4)
abline(h=0, col=8)

Is there a way to obtain and store positions of matrix element in JAGS?

I am developing a bayesian hierarchical model in R with BUGS code in JAGS.
In my model, I have two matrices that contain relevant information about each another in the same exact matrix position. My information is structured by rows. I apply a mathematical operation to the first matrix, Distmat, by row:
diffmat[i,j] <- abs(Distmat[birthterr[i],j] - Dist[i])
I am interested to record the column position of every minimum value in each row of diffmat in a new vector, to then apply this vector to the second matrix. This would be relatively easy in regular R code using functions "which" or "which.min":
a <- numeric()
for (i in 1:dim(diffmat)[1])
for (j in 1:dim(diffmat)[2])
a[i] <- which.min(diffmat[i,])
And then apply vector "a" to the second matrix (terrmat) to obtain the values associated with Distmat positions:
b <- numeric(0)
for (i in 1:dim(diffmat)[1])
for (j in 1:dim(diffmat)[2])
b[i] <- terrmat[i, a[i]]
However, apparently BUGS code does not recognize either which or which.min(), and I am struggling to find a way to store these matrix row positions in vectors. Perhaps there is a very simple solution to this, but I really got stuck there. Hope my info was enough clear.
Any suggestions would be very appreciated. Thanks for your time!
Here's a minimal working example. The analogs here are that x would be your diffmat. I'm drawing it at random here, but it should still work if you're defining it otherwise. Essentially, you rank the values of x in each row and make a new matrix e that is a dummy matrix that is coded 1 if x[i,j] is ranked 1 and 0 otherwise. Then you take the inner product of that and a vector of values from 1:ncol(terrmat) assuming terrmat and diffmat are of the same dimensions. Then that gives you the column index of the first ranked value for observation i. The ymat in the example below is where your terrmat would go. I think it will be pretty slow on any real-sized problem, but it appears to work from the output below.
dl <- list(
ymat = matrix(1:3, ncol=3, nrow=5, byrow=TRUE),
yinds = 1:3
)
mods <- "model{
for(i in 1:5){
for(j in 1:3){
x[i,j] ~ dnorm(0,1)
e[i,j] <- equals(rx[i,j], 1)
}
rx[i,1:3] <- rank(x[i,1:3])
ind[i] <- inprod(e[i,], yinds)
yval[i] <- ymat[i,ind[i]]
}
}"
library(runjags)
out <- run.jags(mods, data=dl, monitor="yval")
out
#
# JAGS model summary statistics from 20000 samples (chains = 2; adapt+burnin = 5000):
#
# Lower95 Median Upper95 Mean SD Mode MCerr MC%ofSD SSeff AC.10 psrf
# yval[1] 1 2 3 1.9973 0.8139 2 0.0058421 0.7 19409 -0.0077146 0.99996
# yval[2] 1 2 3 2.0067 0.81605 3 0.0057704 0.7 20000 0.00049096 1.0003
# yval[3] 1 2 3 1.9895 0.8142 2 0.0057573 0.7 20000 0.00066309 1
# yval[4] 1 2 3 1.9973 0.81638 1 0.0057727 0.7 20000 -0.00040016 0.99998
# yval[5] 1 2 3 1.993 0.81611 1 0.0057708 0.7 20000 -0.0027988 0.99996
#
# Total time taken: 0.7 seconds

Choosing values in a Matrix in R

I have a 25x25 matrix with numeric values and I want to choose through some conditions ! For example I want only the values from 0 to 0.2 to install them in another matrix how can I do this ?
x<-matrix(rnorm(25*25),25,25)
which(x>0.2) # indices where x>0.2
n<-40
h<-hist(x,breaks = seq(min(x),max(x),length.out = n+1),plot = F) # For multiple ranges and counts
h$breaks #n+1 break points
h$count #n counts of numbers between those breakpoints
What you want can be done with simple logical operations, see file R-intro.pdf that comes with your distribution of R, section 2.7 Index vectors; selecting and modifying subsets of a data set.
set.seed(1356) # make the results reproducible
m <- matrix(rnorm(25*25), 25) # input matrix
i <- 0 <= m & m <= 0.2 # logical index into 'm'
# create a result matrix with the same dimensions as the input
m2 <- matrix(NA, nrow = nrow(m), ncol = ncol(m))
m2[i] <- m[i] # assign the values you want
m2
sum(i) # count of values in [0, 0.2]
sum(m < 0) # count of values less than zero
sum(m > 0.2) # count of values greater than 0.2

How to apply a function to array margin and create pairwise combination matrix

I am using R to apply a self-written function, that takes as an input two numeric vectors plus a numeric parameter, over column margins of data frame. Each column in data frame is a numeric vector and I want to perform pairwise computations and create a matrix which has all possible combinations of the columns with indicated result of the computation. So essentially I want to generate a behaviour similar to the one yielded by cor() function.
# Data
> head(d)
1 2 3 4
1 -1.01035342 1.2490665 0.7202516 0.101467379
2 -0.50700743 1.4356733 0.9032172 -0.001583743
3 -0.09055243 0.4695046 2.4487632 -1.082570048
4 1.11230416 0.2885735 0.3534247 -0.728574628
5 -1.96115691 0.4831158 1.5650052 0.648675605
6 1.20434218 1.7668086 0.2170858 -0.161570792
> cor(d)
1 2 3 4
1 1.00000000 0.08320968 -0.06432155 0.04909430
2 0.08320968 1.00000000 -0.04557743 -0.01092765
3 -0.06432155 -0.04557743 1.00000000 -0.01654762
4 0.04909430 -0.01092765 -0.01654762 1.00000000
I found this useful answer: Perform pairwise comparison of matrix
Based on this I wrote this function which makes use of another self-written function compareFunctions()
createProbOfNonEqMatrix <- function(df,threshold){
combinations <- combn(ncol(df),2)
predDF <- matrix(nrow = length(density(df[,1])$y)) # df creation for predicted values from density function
for(i in 1:ncol(df)){
predCol <- density(df[,i])$y # convert df of original values to df of predicted values from density function
predDF <- cbind(predDF,predCol)
}
predDF <- predDF[,2:ncol(predDF)]
colnames(predDF) <- colnames(df) # give the predicted values column names as in the original df
predDF <- as.matrix(predDF)
out.mx <- apply( X=combinations,MARGIN = 2,FUN = "compareFunctions",
predicted_by_first = predDF[,combinations[1]],
predicted_by_second = predDF[,combinations[2]],
threshold = threshold)
return(out.mx)
}
The predicted_by_first, predicted_by_second and threshold are inputs for compareFunctions. However I get the following error:
Error in FUN(newX[, i], ...) : unused argument (newX[, i])
In desperation I tried this:
createProbOfNonEqMatrix <- function(df,threshold){
combinations <- combn(ncol(df),2)
predDF <- matrix(nrow = length(density(df[,1])$y))
for(i in 1:ncol(df)){
predCol <- density(df[,i])$y
predDF <- cbind(predDF,predCol)
}
predDF <- predDF[,2:ncol(predDF)]
colnames(predDF) <- colnames(df)
predDF <- as.matrix(predDF)
out.mx <- apply(
X=combinations,MARGIN = 2,FUN = function(x) {
diff <- abs(predDF[,x[1]]-predDF[,x[2]])
boolean <- diff<threshold
acceptCount <- length(boolean[boolean==TRUE])
probability <- acceptCount/length(diff)
return(probability)
}
)
return(out.mx)
}
It does seem to be working but instead of returning the pairwise matrix it gives me a vector:
> createProbOfNonEqMatrix(d,0.001)
[1] 0.10351562 0.08203125 0.13476562 0.13085938 0.14843750 0.10937500
Will you be able to guide me on how to make the desired pairwise matrix even if it implies writing the function code again within apply()? Also, if you could give me an idea on how to keep track of what pairwise comparisons are performed it will be greatly appreciated.
Thank you,
Alex
Your output gives you the result of the calculation in the order of the pairs in combinations: (1,2), (1,3), (1,4), (2,3), (2,4), (3,4). If you want to organise this into a symmetric square matrix you can do a basic manipulation on the result, e.g. as follows:
out.mx<-c(0.10351562, 0.08203125, 0.13476562, 0.13085938, 0.14843750, 0.10937500)
out.mtx<-matrix(nrow=ncol(df1),ncol=ncol(df1))
out.mtx[,]<-1
for (i in 1:length(combinations[1,])){
a<-combinations[1,i]
b<-combinations[2,i]
out.mtx[a,b]<-out.mtx[b,a]<-out.mx[i]
}
out.mtx
which gives you
[,1] [,2] [,3] [,4]
[1,] 1.00000000 0.1035156 0.08203125 0.1347656
[2,] 0.10351562 1.0000000 0.13085938 0.1484375
[3,] 0.08203125 0.1308594 1.00000000 0.1093750
[4,] 0.13476562 0.1484375 0.10937500 1.0000000

Resources