Sorting ARIMA AIC in R - r

I have the following code which returns the model with the lowest AIC, but I want all the models with their AIC in ascending or descending order without using the built-in sort function in R
sp <- rnorm(100) ## just some toy data to make code work!
spfinal.aic <- Inf
spfinal.order <- c(0,0,0)
for (p in 1:4) for (d in 0:1) for (q in 1:4) {
spcurrent.aic <- AIC(arima(sp, order=c(p, d, q)))
if (spcurrent.aic < spfinal.aic) {
spfinal.aic <- spcurrent.aic
spfinal.order <- c(p, d, q)
spfinal.arima <- arima(sp, order=spfinal.order)
}
}
I want spfinal.order<-c(p,d,p) to be a list of all models in ascending or descending order of AIC. How can I do that?

The code below does what you want. As you want a record of all models having been tried, no comparison is done inside the loop. A vector aic.vec will hold AIC values of all models, while a matrix order.matrix will hold column-by-column the ARIMA specification. In the end, we sort both by ascending AIC values, so that you know the best model is the first one.
sp <- rnorm(100) ## just some toy data to make code work!
order.matrix <- matrix(0, nrow = 3, ncol = 4 * 2 * 4)
aic.vec <- numeric(4 * 2 * 4)
k <- 1
for (p in 1:4) for (d in 0:1) for (q in 1:4) {
order.matrix[, k] <- c(p,d,q)
aic.vec[k] <- AIC(arima(sp, order=c(p, d, q)))
k <- k+1
}
ind <- order(aic.vec, decreasing = FALSE)
aic.vec <- aic.vec[ind]
order.matrix <- order.matrix[, ind]
I did not use a list to store the ARIMA specification, because I think a matrix is better. At the moment the matrix is in wide format, i.e., with 3 rows while many columns. You can transpose it for better printing:
order.matrix <- t(order.matrix)
Maybe you also want to bind order.matrix and aic.vec together for better presentation? Do this:
result <- cbind(order.matrix, aic.vec)
colnames(result) <- c("p", "d", "q", "AIC")
I think this makes it easier for you to inspect. Example output (first 5 rows):
> result
p d q AIC
[1,] 2 0 2 305.5698
[2,] 3 0 3 305.8882
[3,] 1 0 3 307.8365
[4,] 2 1 3 307.9137
[5,] 1 1 2 307.9952

Related

How to modify non-zero elements of a large sparse matrix based on a second sparse matrix in R

I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.

Saving quantities is a for loop

I am having problems when saving the results in a for loop.
I am computing a variance (this is not relevant I think) and my code is:
library(dirmult)
n <- 50
p <- 20
size <- 5*p
prob_true <- rep(1/p, p)
multinom <- as.matrix(rmultinom(n, size, prob = prob_true))
zeros <- round(0.5*p*n)
a <- c(as.matrix(multinom))
a[sample(1:(p*n), zeros)] <- 0
data_zeros <- matrix(a, p, n)
dirmult <- dirmult(t(data_zeros))
alpha <- dirmult$gamma
sum_alpha <- (1-dirmult$theta)/dirmult$theta
for (j in ncol(data_zeros)){
A <- alpha/sum_alpha
B <- 1 - A
N <- colSums(data_zeros)
C <- 1 + sum_alpha
var_s_dirm <- list()
var_s_dirm[[j]] <- N[j]*A*B*((N[j]+sum_alpha)/C)
}
In particular I can say that alpha is a vector with 20 values, sum_alpha is a scalar data_zeros is my dataset which has 20 rows and 50 columns and N is the sum of each column of the dataset, so it is a vector with 50 values.
It seems very simple to do what I wanted to do:
I want to get a list with 50 vectors where each one differs form the other by the fact that I multiply for a different value of N.
I really hope that somebody can help me finding the error.
The problem is (probably) you are setting constants in each time j is increased, and in each step you clear the list with the line var_s_dirm <- list()...
See if this works for you
library(dirmult)
n <- 50
p <- 20
size <- 5*p
prob_true <- rep(1/p, p)
multinom <- as.matrix(rmultinom(n, size, prob = prob_true))
zeros <- round(0.5*p*n)
a <- c(as.matrix(multinom))
a[sample(1:(p*n), zeros)] <- 0
data_zeros <- matrix(a, p, n)
dirmult <- dirmult(t(data_zeros))
alpha <- dirmult$gamma
sum_alpha <- (1-dirmult$theta)/dirmult$theta
A <- alpha/sum_alpha
B <- 1 - A
N <- colSums(data_zeros)
C <- 1 + sum_alpha
var_s_dirm <- list()
for (j in 1:ncol(data_zeros)){
var_s_dirm[[j]] <- N[j]*A*B*((N[j]+sum_alpha)/C)
}
output
var_s_dirm
[[1]]
[1] 2.614833 2.327105 2.500483 3.047700 2.233528 2.130223 2.700103 2.869699 2.930213 2.575903 2.198459 2.846096
[13] 2.425448 3.517559 3.136266 2.565345 2.578267 2.763113 2.709707 3.420792
[[2]]
[1] 2.568959 2.286279 2.456615 2.994231 2.194343 2.092850 2.652732 2.819353 2.878806 2.530712 2.159889 2.796165
[13] 2.382897 3.455848 3.081244 2.520339 2.533034 2.714637 2.662168 3.360778
[[3]]
[1] 3.211199 2.857849 3.070769 3.742790 2.742930 2.616064 3.315916 3.524193 3.598509 3.163391 2.699862 3.495207
[13] 2.978622 4.319811 3.851556 3.150424 3.166294 3.393297 3.327711 4.200974
....

Creating this minimization function

I have a sample S with predictor vectors Xs and Binary responses Ys. I would like to create the following function:
Basically, for a given X_i it gives me the minimum of the l2 norm of its difference from observations with a different outcome. I know that you can probably create a function that just loops over all the differences one by one. But I wonder if there's some vectorized way?
My attempt:
The data set is n by p+1, n = sample size, p = # predictors, Y as last row:
g <- vector();
for(i in 1:n){
temp <- t(data[data$Y[i] != data$Y, 1:p]) - as.vector(t(data[i,1:p]))
g[i]<- min(col.norm(temp))
}
Not sure how big is your data. I guess you want to vectorise in order to optimise for performance. Here you have a possible solution using slqdf.
(I looked for alternatives like data.table but I did not succeed to perform a cartesian product with your condition that Y != Y on each side of the join...)
First I create a toy dataframe and run it with your code.
library(tilting)
data <- data.frame(p1=seq(1, 6), p2=seq(12, 17), p3=seq(14,9), Y=c(rep(0,3), rep(1,3)))
g <- vector();
n <- nrow(data)
p <- 3
for(i in 1:n){
temp <- t(data[data$Y[i] != data$Y, 1:p]) - as.vector(t(data[i,1:p]))
g[i]<- min(col.norm(temp))
}
Where the result for g is,
[1] 5.196152 3.464102 1.732051 1.732051 3.464102 5.196152
Then I add an index to the data using rownames and perform the cartesian product,
data$idx <- rownames(data)
library(sqldf)
predictors <- c("p1", "p2", "p3")
cart <- sqldf(paste0("select d1.idx,",
paste("d1", predictors, collapse=",", sep="."), ",",
paste("d2", predictors, collapse=",", sep="."),
" from data as d1",
" join data as d2",
" on d1.Y != d2.Y"))
Then I calculate your function,
cart$d <- sqrt(rowSums((cart[, 2:4] - cart[, 5:7])^2))
and minimising is just like aggregating results.
> aggregate(cart, by=list(cart$idx), FUN=min)[, c("idx", "d")]
idx d
1 1 5.196152
2 2 3.464102
3 3 1.732051
4 4 1.732051
5 5 3.464102
6 6 5.196152
As I said, not sure of your objective, but I hope it helps.

Computing pairwise Hamming distance between all rows of two integer matrices/data frames

I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second best) matching row to df1 in terms of hamming distance.
I used e1071 package to compute hamming distance. Hamming distance between two vectors x and y can be computed as for example:
x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
24197, 610187, 402471, 157122, 866381, 582868, 878)
y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)
xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)
distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))
and the resulting distance is 25. Yet I need to do this for all rows of df1 and df2. A trivial method takes a double loop nest and looks terribly slow.
Any ideas how to do this more efficiently? In the end I need to append to df2:
a column with the row id from df1 that gives the lowest distance;
a column with the lowest distance;
a column with the row id from df1 that gives the 2nd lowest distance;
a column with the second lowest distance.
Thanks.
Fast computation of hamming distance between two integers vectors of equal length
As I said in my comment, we can do:
hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))
to compute hamming distance between two integers vectors of equal length x and y. This only uses R base, yet is more efficient than e1071::hamming.distance, because it is vectorized!
For the example x and y in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)
Fast hamming distance between a matrix and a vector
If we want to compute the hamming distance between a single y and multiple xs, i.e., the hamming distance between a vector and a matrix, we can use the following function.
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
Note that:
hmd performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;
there is no obvious loop here; instead, we use tapply().
Fast hamming distance computation between two matrices/data frames
This is what you want. The following function foo takes two data frames or matrices df1 and df2, computing the distance between df1 and each row of df2. argument p is an integer, showing how many results you want to retain. p = 3 will keep the smallest 3 distances with their row ids in df1.
foo <- function(df1, df2, p) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
k <- 1:p
for (i in 1:n) {
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
k <- k + p
}
## recode "id" and "d" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
list(id = id, d = d)
}
Note that:
transposition is done at the beginning, according to reasons before;
a for loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using *apply family, since we ask for multiple output (row id id and distance d).
Experiment
This part uses small dataset to test/demonstrate our functions.
Some toy data:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
Test hmd first (needs transposition):
hmd(t(as.matrix(df1)), df2[1, ]) ## df1 & first row of df2
# [1] 2 4 6 2 4
Test foo:
foo(df1, df2, p = 2)
# $id
# min1 min2
# 1 1 4
# 2 2 3
# 3 5 2
# $d
# mindist.1 mindist.2
# 1 2 2
# 2 1 3
# 3 1 3
If you want to append some columns to df2, you know what to do, right?
Please don't be surprised why I take another section. This part gives something relevant. It is not what OP asks for, but may help any readers.
General hamming distance computation
In the previous answer, I start from a function hmd0 that computes hamming distance between two integer vectors of the same length. This means if we have 2 integer vectors:
set.seed(0)
x <- sample(1:100, 6)
y <- sample(1:100, 6)
we will end up with a scalar:
hmd0(x,y)
# 13
What if we want to compute pairwise hamming distance of two vectors?
In fact, a simple modification to our function hmd will do:
hamming.distance <- function(x, y, pairwise = TRUE) {
nx <- length(x)
ny <- length(y)
rawx <- intToBits(x)
rawy <- intToBits(y)
if (nx == 1 && ny == 1) return(sum(as.logical(xor(intToBits(x),intToBits(y)))))
if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
bits <- length(intToBits(0)) ## 32-bit or 64 bit?
result <- unname(tapply(as.logical(xor(rawx,rawy)), rep(1:ny, each = bits), sum))
}
if (pairwise) result else sum(result)
}
Now
hamming.distance(x, y, pairwise = TRUE)
# [1] 0 3 3 2 5 0
hamming.distance(x, y, pairwise = FALSE)
# [1] 13
Hamming distance matrix
If we want to compute the hamming distance matrix, for example,
set.seed(1)
x <- sample(1:100, 5)
y <- sample(1:100, 7)
The distance matrix between x and y is:
outer(x, y, hamming.distance) ## pairwise argument has no effect here
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 2 3 4 3 4 4 2
# [2,] 7 6 3 4 3 3 3
# [3,] 4 5 4 3 6 4 2
# [4,] 2 3 2 5 6 4 2
# [5,] 4 3 4 3 2 0 2
We can also do:
outer(x, x, hamming.distance)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 0 5 2 2 4
# [2,] 5 0 3 5 3
# [3,] 2 3 0 2 4
# [4,] 2 5 2 0 4
# [5,] 4 3 4 4 0
In the latter situation, we end up with a symmetric matrix with 0 on the diagonal. Using outer is inefficient here, but it is still more efficient than writing R loops. Since our hamming.distance is written in R code, I would stay with using outer. In my answer to this question, I demonstrate the idea of using compiled code. This of course requires writing a C version of hamming.distance, but I will not show it here.
Here's an alternative solution that uses only base R, and should be very fast, especially when your df1 and df2 have many rows. The main reason for this is that it does not use any R-level looping for calculating the Hamming distances, such as for-loops, while-loops, or *apply functions. Instead, it uses matrix multiplication for computing the Hamming distance. In R, this is much faster than any approach using R-level looping. Also note that using an *apply function will not necessarily make your code any faster than using a for loop. Two other efficiency-related features of this approach are: (1) It uses partial sorting for finding the best two matches for each row in df2, and (2) It stores the entire bitwise representation of df1 in one matrix (same for df2), and does so in one single step, without using any R-level loops.
The function that does all the work:
# INPUT:
# X corresponds to your entire df1, but is a matrix
# Y corresponds to your entire df2, but is a matrix
# OUTPUT:
# Matrix with four columns corresponding to the values
# that you specified in your question
fun <- function(X, Y) {
# Convert integers to bits
X <- intToBits(t(X))
# Reshape into matrix
dim(X) <- c(ncols * 32, nrows)
# Convert integers to bits
Y <- intToBits(t(Y))
# Reshape into matrix
dim(Y) <- c(ncols * 32, nrows)
# Calculate pairwise hamming distances using matrix
# multiplication.
# Columns of H index into Y; rows index into X.
# The code for the hamming() function was retrieved
# from this page:
# https://johanndejong.wordpress.com/2015/10/02/faster-hamming-distance-in-r-2/
H <- hamming(X, Y)
# Now, for each row in Y, find the two best matches
# in X. In other words: for each column in H, find
# the two smallest values and their row indices.
t(apply(H, 2, function(h) {
mindists <- sort(h, partial = 1:2)
c(
ind1 = which(h == mindists[1])[1],
val1 = mindists[1],
hmd2 = which(h == mindists[2])[1],
val2 = mindists[2]
)
}))
}
To call the function on some random data:
# Generate some random test data with no. of columns
# corresponding to your data
nrows <- 1000
ncols <- 26
# X corresponds to your df1
X <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
# Y corresponds to your df2
Y <- matrix(
sample(1e6, nrows * ncols, replace = TRUE),
nrow = nrows,
ncol = ncols
)
res <- fun(X, Y)
The above example with 1000 rows in both X (df1) and Y (df2) took about 1.1 - 1.2 seconds to run on my laptop.

how to find the most similar columns in a matrix?

I have a matrix in which I would like to find those columns that are very similar (I am not looking to find identical columns)
# to generate a matrix
Mat<- matrix(rexp(200, rate=.1), ncol=1000, nrow=400)
I personally thought of "cor" or "all.equal" and I did as follows, but did not work.
indexmax <- apply(Mat, MARGIN = 2, function(x) which(cor(x) >= 0.5, arr.ind = TRUE))
what I need as output is show which columns are highly similar and the degrees of their similarity (it can be correlation coefficient)
similar means their values are similar within some threshold (for example over 75% of the values residuals (e.g. column1-column2) are less than abs(0.5)
I would also love to see how then this is different from correlated. do they result in identical results ?
Using correlation you could try (with a simpler matrix for demonstration)
set.seed(123)
Mat <- matrix(rnorm(300), ncol = 10)
library(matrixcalc)
corr <- cor(Mat)
res <-which(lower.triangle(corr)>.3, arr.ind = TRUE)
data.frame(res[res[,1] != res[,2],], correlation = corr[res[res[,1] != res[,2],]])
row col correlation
1 8 1 0.3387738
2 6 2 0.3350891
Both row and col actually refer to the columns in your original matrix. So, for example, the correlation between column 8 and column 1 is 0.3387738
I'd take linear regression approach:
Mat<- matrix(rexp(200, rate=.1), ncol=100, nrow=400)
combinations <- combn(1:ncol(Mat), m = 2)
sigma <- NULL
for(i in 1:ncol(combinations)){
sigma <- c(sigma, summary(lm(Mat[,combinations[1,1]] ~ Mat[,combinations[2,1]]))$sigma)
}
sigma <- data.frame(sigma = sigma, comb_nr = 1:ncol(combinations))
And residual standard error as an optional criteria.
You can further order data frame by sigma and get best/worst combinations.
If you want a (not so elegant) straightforward approach that's likely to be very slow for matrices of your size, you can do this:
set.seed(1)
Mat <- matrix(runif(40000), ncol=100, nrow=400)
col.combs <- t(combn(1:ncol(Mat), 2))
similar <- data.frame(Col1=NULL, Col2=NULL, Corr=NULL, Pct.Diff=NULL)
# Compare each pair of columns
for (k in 1:nrow(col.combs)) {
i <- col.combs[k, 1]
j <- col.combs[k, 2]
# Difference within threshold?
diff.thresh <- (abs(Mat[, i] - Mat[, j]) < 0.5)
pair.corr <- cor(Mat[, 1], Mat[, 2])
if (mean(diff.thresh) > 0.75)
similar <- rbind(similar, c(i, j, pair.corr, 100*mean(diff.thresh)))
}
In this example there are 2590 distinct pairs of columns with more than 75% of their values within 0.5 of each other (elementwise). You can check the actual difference and correlation coefficient by looking at the resulting data frame.
> head(similar)
Col1 Col2 Corr Pct.Diff
1 1 2 -0.003187894 76.75
2 1 3 0.074061019 76.75
3 1 4 0.082668387 78.00
4 1 5 0.001713751 75.50
5 1 8 0.052228907 75.75
6 1 12 -0.017921978 78.00
Perhaps it's not the best solution, but gets the job done.
Also, if you're unsure why I used mean(diff.thresh), it's because the sum of a logical vector is the number of TRUE elements. The mean is the sum divided by the length, which means that in this case it's the fraction of values within the threshold.

Resources