construct new supermatrix from block matrices - r

How can I construct (in R) a matrix made of subcomponents that are matrices?
For example, starting from matrices
A <- matrix(1:9,nrow=3,ncol=3)
B <- matrix(5:10,nrow=2,ncol=3)
C <- matrix(11:20,nrow=2,ncol=5)
I want to construct a block matrix
A 0
B C
where 0 represents a zero-filled block with the appropriate dimensions.
There are other questions on SO about constructing block-diagonal matrices
(Matrix::bdiag is very good for this), but I can't find one that answers this question.
(I'm posting this question because I had just about finished answering it when it was deleted by its original poster ...)

I tried writing a general purpose function. The usage is similar to matrix() but the first argument is a list of matrices (or vectors that will be recycled). It does not have all the bells and whistles (dimnames, byrow) but it is a decent start. I wouldn't be surprised to find out a better and more complete function already exists in a package but at least it was a fun exercise:
supermatrix <- function(list.of.mat, nrow = 1L, ncol = 1L) {
stopifnot(length(list.of.mat) == nrow * ncol)
is.mat <- vapply(list.of.mat, is.matrix, logical(1L))
is.vec <- vapply(list.of.mat, is.vector, logical(1L))
if (any(!is.mat & !is.vec)) stop("the list items must be matrices or vectors")
is.mat.mat <- matrix(is.mat, nrow, ncol)
if (any(rowSums(is.mat.mat) == 0L))
stop("we need at least one matrix per super row")
if (any(colSums(is.mat.mat) == 0L))
stop("we need at least one matrix per super column")
na.mat <- matrix(NA, nrow, ncol)
nrow.mat <- replace(na.mat, is.mat, vapply(list.of.mat[is.mat], nrow, integer(1L)))
ncol.mat <- replace(na.mat, is.mat, vapply(list.of.mat[is.mat], ncol, integer(1L)))
is.not.uniq <- function(x) length(table(x)) > 1L
if (any(apply(nrow.mat, 1, is.not.uniq))) stop("row dim mismatch")
if (any(apply(ncol.mat, 2, is.not.uniq))) stop("col dim mismatch")
nrow.vec <- rowMeans(nrow.mat, na.rm = TRUE)
ncol.vec <- colMeans(ncol.mat, na.rm = TRUE)
nrow.mat <- matrix(nrow.vec, nrow, ncol, byrow = FALSE)
ncol.mat <- matrix(ncol.vec, nrow, ncol, byrow = TRUE)
all.mat <- Map(matrix, list.of.mat, nrow.mat, ncol.mat)
i1.idx <- unlist(Map(rep, row(na.mat), lapply(all.mat, length)))
j1.idx <- unlist(Map(rep, col(na.mat), lapply(all.mat, length)))
i2.idx <- unlist(lapply(all.mat, row))
j2.idx <- unlist(lapply(all.mat, col))
o.idx <- order(j1.idx, j2.idx, i1.idx, i2.idx)
matrix(unlist(all.mat)[o.idx], sum(nrow.vec), sum(ncol.vec))
}
Example usage:
A <- matrix(1:9,nrow=3,ncol=3)
B <- matrix(5:10,nrow=2,ncol=3)
C <- matrix(11:20,nrow=2,ncol=5)
supermatrix(list(A, B, 0, C), 2, 2)
supermatrix(list(A, B, A, 1, 0, C, 2, C), 4, 2)

We need a zero matrix that will have compatible dimensions with A and C:
z <- matrix(0,nrow=nrow(A),ncol=ncol(C))
Now we just use rbind() and cbind():
rbind(cbind(A,z),cbind(B,C))

Related

How to quantify the frequency of all possible row combinations of a binary matrix in R in a more efficient way?

Lets assume I have a binary matrix with 24 columns and 5000 rows.
The columns are Parameters (P1 - P24) of 5000 subjects. The parameters are binary (0 or 1).
(Note: my real data can contain as much as 40,000 subjects)
m <- matrix(, nrow = 5000, ncol = 24)
m <- apply(m, c(1,2), function(x) sample(c(0,1),1))
colnames(m) <- paste("P", c(1:24), sep = "")
Now I would like to determine what are all possible combinations of the 24 measured parameters:
comb <- expand.grid(rep(list(0:1), 24))
colnames(comb) <- paste("P", c(1:24), sep = "")
The final question is: How often does each of the possible row combinations from comb appear in matrix m?
I managed to write a code for this and create a new column in comb to add the counts. But my code appears to be really slow and would take 328 days to complete to run. Therefore the code below only considers the 20 first combinations
comb$count <- 0
for (k in 1:20){ # considers only the first 20 combinations of comb
for (i in 1:nrow(m)){
if (all(m[i,] == comb[k,1:24])){
comb$count[k] <- comb$count[k] + 1
}
}
}
Is there computationally a more efficient way to compute this above so I can count all combinations in a short time?
Thank you very much for your help in advance.
Data.Table is fast at this type of operation:
m <- matrix(, nrow = 5000, ncol = 24)
m <- apply(m, c(1,2), function(x) sample(c(0,1),1))
colnames(m) <- paste("P", c(1:24), sep = "")
comb <- expand.grid(rep(list(0:1), 24))
colnames(comb) <- paste("P", c(1:24), sep = "")
library(data.table)
data_t = data.table(m)
ans = data_t[, .N, by = P1:P24]
dim(ans)
head(ans)
The core of the function is by = P1:P24 means group by all the columns; and .N the number of records in group
I used this as inspiration - How does one aggregate and summarize data quickly?
and the data_table manual https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html
If all you need is the combinations that occur in the data and how many times, this will do it:
m2 <- apply(m, 1, paste0, collapse="")
m2.tbl <- xtabs(~m2)
head(m2.tbl)
m2
# 000000000001000101010010 000000000010001000100100 000000000010001110001100 000000000100001000010111 000000000100010110101010 000000000100101000101100
# 1 1 1 1 1 1
You can use apply to paste the unique values in a row and use table to count the frequency.
table(apply(m, 1, paste0, collapse = '-'))

Optimise row wise matrix comparison in R

I've googled extensively and can't seem to find an answer to my problem. Apologies if this has been asked before. I have two matrices, a & b, each with the same dimensions. What I am trying to do is iterate over the rows of a (from i = 1 to number of rows in a) and check if any elements found in row i of matrix a appear in the corresponding row in matrix b. I have a solution using sapply but this becomes quite slow with very large matrices. I wondered if it is possible to vectorise my solution somehow? Examples below:
# create example matrices
a = matrix(
1:9,
nrow = 3
)
b = matrix(
4:12,
nrow = 3
)
# iterate over rows in a....
# returns TRUE for each row of a where any element in ith row is found in the corresponding row i of matrix b
sapply(1:nrow(a), function(x){ any(a[x,] %in% b[x,])})
# however, for large matrices this performs quite poorly. is it possible to vectorise?
a = matrix(
runif(14000000),
nrow = 7000000
)
b = matrix(
runif(14000000),
nrow = 7000000
)
system.time({
sapply(1:nrow(a), function(x){ any(a[x,] %in% b[x,])})
})
Use apply to find any 0 differences:
a <- sample(1:3, 9, replace = TRUE)
b <- sample(1:3, 9, replace = TRUE)
a <- matrix(a, ncol = 3)
b <- matrix(b, ncol = 3)
diff <- (a - b)
apply(diff, 1, function(x) which(x == 0)) # actual indexes = 0
apply(diff, 1, function(x) any(x == 0)) # row check only
or
Maybe you can try intersect + asplit like below
lengths(Map(intersect, asplit(a, 1), asplit(b, 1))) > 0

Assign the same index if two vectors have a common intersection

I need help with a question closely related to some other question of mine.
How to merge two different groupings if they are not disjoint with dplyr
As the title of the question says, I want to generate an index in a vector that links different vectors in a list if they have an intersection or, if not, if both intersect with some other vector in a list, and so on...
This is a question involving graph theory/networks - I want to find indirectly connected vectors.
The question above solved my problem considering two columns a dataframe, but I don't know how to generalize this to a list in which elements my have different lengths.
This is an example: list(1:3, 3:5, 5, 6) should give me c(1, 1, 1, 2)
EDIT:
I've tried using the fact that the powers of an adjacency matrix represent possible paths from one edge to some other one.
find_connections <- function(list_vectors){
list_vectors <- list_vectors %>%
set_names(paste0("x", 1:length(list_vectors)))
x <- crossprod(table(stack(list_vectors)))
power <- nrow(x) - 2
x <- ifelse(x >= 1, 1, 0)
if(power > 0){
z <- accumulate(replicate(power, x, simplify = FALSE),
`%*%`, .init = x) %>%
reduce(`+`)
} else{
z <- x
}
z <- ifelse(z >= 1, 1, 0)
w <- z %>%
as.data.frame() %>%
group_by(across()) %>%
group_indices()
return(w)
}
The problem is that it took too long to run my code. Each matrix is not very large, but I do need to run the function on a large number of them.
Is it possible to improve this?
This is one way to do it. It creates a loop for the elements in each vector and then uses the same trick as the previous answer to find clusters.
library(data.table)
library(igraph)
x <- list(1:3, 3:5, 5, 6)
dt <- rbindlist(lapply(x,
function(r) data.table(from = r, to = shift(r, -1, fill = r[1]))))
dg <- graph_from_data_frame(dt, directed = FALSE)
unname(sapply(x, function(v) components(dg)$membership[as.character(v[1])]))
#> [1] 1 1 1 2

Similarity / distance between many pairs of matrices

I want to quantify group similarity by computing the mean of the distance between all sets of (multidimensional) points in each pair.
I can do this easily enough manually for each pair of groups manually like so:
library(dplyr)
library(tibble)
library(proxy)
# dummy data
set.seed(123)
df1 <- data.frame(x = rnorm(100,0,4),
y = rnorm(100,1,5),
z = rbinom(100, 1, 0.1))
df2 <- data.frame(x = rnorm(100,-1,3),
y = rnorm(100,0,6),
z = rbinom(100, 1, 0.1))
df3 <- data.frame(x = rnorm(100,-30,4),
y = rnorm(100,10,2),
z = rbinom(100, 1, 0.9))
# compute distance (unscaled, uncentred data)
dist(df1, df2, method = "gower") %>% mean
dist(df1, df3, method = "gower") %>% mean
dist(df2, df3, method = "gower") %>% mean
But I'd like to somehow vectorise this as my actual data has 30+ groups. A simple for loop can achieve this like so:
# combine data and scale, centre
df <- rbind(df1, df2, df3) %>%
mutate(id = rep(1:3, each = 100))
df <- df %>%
select(-id) %>%
transmute_all(scale) %>%
add_column(id = df$id)
# create empty matrix for comparisons
n <- df$id %>% unique %>% length
m <- matrix(nrow = n, ncol = n)
# loop through each pair once
for(i in 1:n) {
for(j in 1:i) { #omit top right corner
if(i == j) {
m[i,j] <- NA #omit diagonal
} else {
m[i,j] <- dist(df[df$id == i,1:3], df[df$id == j,1:3], method = "gower") %>% mean
}
}
}
m
[,1] [,2] [,3]
[1,] NA NA NA
[2,] 0.2217443 NA NA
[3,] 0.8446070 0.8233932 NA
However, this method scales predictably badly; a quick benchmark suggests this will take 90+ hours with my actual data which has 30+ groups with 1000+ rows per group.
Can anyone suggest a more efficient solution, or perhaps a fundamentally different way to frame the problem which I'm missing?
I'm not sure if this will do well but here's another approach. You use ls to obtain the names of matrices, combn to generate pairs of two, and then get to obtain the matrices for calculating dist
do.call(rbind,
combn(ls(pattern = "df\\d+"), 2, FUN = function(x)
data.frame(pair = toString(x),
dist = mean(dist(get(x[1]), get(x[2]), method = "gower")),
stringsAsFactors = FALSE),
simplify = FALSE
))
# pair dist
#1 df1, df2 0.2139304
#2 df1, df3 0.8315169
#3 df2, df3 0.8320911
You could take each pair of groups, concatenate them, and then just calculate the dissimilarity matrix within that group. Obviously this means you're comparing a group to itself to an extent, but it may still work for your use case, and with daisy it is reasonably quick for your size of data.
library(cluster)
n <- 30
groups <- vector("list", 30)
# dummy data
set.seed(123)
for(i in 1:30) {
groups[[i]] = data.frame(x = rnorm(1000,ceiling(runif(1, -10, 10)),ceiling(runif(1, 2, 4))),
y = rnorm(1000,ceiling(runif(1, -10, 10)),ceiling(runif(1, 2, 4))),
z = rbinom(1000,1,runif(1,0.1,0.9)))
}
m <- matrix(nrow = n, ncol = n)
# loop through each pair once
for(i in 1:n) {
for(j in 1:i) { #omit top right corner
if(i == j) {
m[i,j] <- NA #omit diagonal
} else {
# concatenate groups
dat <- rbind(df_list[[i]], df_list[[j]])
# compute all distances (between groups and within groups), return matrix
mm <- dat %>%
daisy(metric = "gower") %>%
as.matrix
# retain only distances between groups
mm <- mm[(nrow(df_list[[i]])+1):nrow(dat) , 1:nrow(df_list[[i]])]
# write mean distance to global comparison matrix
m[i,j] <- mean(mm)
}
}
}
proxy can work with lists of matrices as input,
you only need to define a wrapper function that does what you want:
nested_gower <- function(x, y, ...) {
mean(proxy::dist(x, y, ..., method = "gower"))
}
proxy::pr_DB$set_entry(
FUN = nested_gower,
names = c("ngower"),
distance = TRUE,
loop = TRUE
)
df_list <- list(df1, df2, df3)
proxy::dist(df_list, df_list, method = "ngower")
[,1] [,2] [,3]
[1,] 0.1978306 0.2139304 0.8315169
[2,] 0.2139304 0.2245903 0.8320911
[3,] 0.8315169 0.8320911 0.2139049
This will still be slow,
but it should be faster than for loops in plain R
(proxy uses C in the background).
Important: note that the diagonal of the resulting cross-distance matrix doesn't have zeros.
If you were to call dist like proxy::dist(df_list, method = "ngower"),
proxy will assume that distance(x, y) = distance(y, x) (symmetry),
and that distance(x, x) = 0,
the latter of which is not true in this case.
Passing two arguments to dist prevents this assumption.
If you really don't care about the diagonal,
pass only one argument to save some extra time by avoiding the calculations of the upper triangular.
Alternatively, if you do care about the diagonal but still want to avoid calculating the upper triangular,
call dist first with one argument and then call proxy::dist(df_list, df_list, method = "ngower", pairwise = TRUE).
Side note: if you want to imitate this behavior with the gower package (as suggested by d.b),
you could define the wrapper function as:
nested_gower <- function(x, y, ...) {
distmat <- sapply(seq_len(nrow(y)), function(y_row) {
gower::gower_dist(x, y[y_row, , drop = FALSE], ...)
})
mean(distmat)
}
However, the values returned seem to change depending on how many records are passed to the functions,
so it's hard to tell what would be the best approach.
*Use proxy::pr_DB$delete_entry("ngower") first if you want to redefine a function in proxy.
If you prefer proxy's version of the Gower cross-distance matrix,
it occurs to me that you could leverage some of the functionality of my dtwclust package to do the calculations in parallel:
library(dtwclust)
library(doParallel)
custom_dist <- new("tsclustFamily", dist = "ngower", control = list(symmetric = TRUE))#dist
workers <- makeCluster(detectCores())
registerDoParallel(workers)
distmat <- custom_dist(df_list)
stopCluster(workers); registerDoSEQ()
This might be faster for your actual use case
(not so much for the small sample data here).
Same caveat about the diagonal
(so use custom_dist(df_list, df_list) or custom_dist(df_list, pairwise = TRUE)).
See section 3.2 here and the documentation of tsclustFamily if you'd like more info.

Creating block matrix via loop

I'm trying to create a block matrix using a loop in R, which depend on some variable I call T. The two matrices used to construct the block matrix could look like this:
A=matrix(c(1,0.3,0.3,1.5),nrow=2)
B=matrix(c(0.5,0.3,0.3,1.5),nrow=2)
So depending on what i set T to, I need different results. For T=2:
C=rbind(cbind(A,B),cbind(B,A))
For T=3:
C=rbind(cbind(A,B,B),cbind(B,A,B),cbind(B,B,A))
For T=5:
C=rbind(cbind(A,B,B,B,B),cbind(B,A,B,B,B),cbind(B,B,A,B,B),cbind(B,B,B,A,B),cbind(B,B,B,B,A))
So basically, I'm just trying to create a loop or something similar, where I can just specify my T and it will create the block matrix for me depending on T.
Thanks
You can do that:
N <- nrow(A)
C <- matrix(NA,N*T,N*T)
for (i in 1:T){
for (j in 1:T){
if (i == j)
C[(i-1)*N+1:N, (j-1)*N+1:N] <- A
else
C[(i-1)*N+1:N, (j-1)*N+1:N] <- B
}
}
From your explanation I suppose that you want single A and T-1 Bs in your final matrix.
If that is correct then here is a quick try using the permn function from the combinat library. All I am doing is generating the expression using the permutation and then evaluating it.
A = matrix(c(1,0.3,0.3,1.5),nrow=2)
B = matrix(c(0.5,0.3,0.3,1.5),nrow=2)
T = 5
x = c("A", rep("B",T-1))
perms = unique(permn(x)) #permn generates non-unique permutations
perms = lapply(perms, function(xx) {xx=paste(xx,collapse=","); xx=paste("cbind(",xx,")")})
perms = paste(perms, collapse=",")
perms = paste("C = rbind(",perms,")",collapse=",")
eval(parse(text=perms))
With the blockmatrix package this is pretty straightforward.
library(blockmatrix)
# create toy matrices (block matrix elements)
# with values which makes it easier to track them in the block matrix in the example here
A <- matrix("a", nrow = 2, ncol = 2)
B <- matrix("b", nrow = 2, ncol = 2)
# function for creating the block matrix
# n: number of repeating blocks in each dimension
# (I use n instead of T, to avoid confusion with T as in TRUE)
# m_list: the two matrices in a list
block <- function(n, m_list){
# create a 'layout matrix' of the block matrix elements
m <- matrix("B", nrow = n, ncol = n)
diag(m) <- "A"
# build block matrix
as.matrix(blockmatrix(dim = dim(m_list[[1]]), value = m, list = m_list))
}
# try with different n
block(n = 2, m_list = list(A = A, B = B))
block(n = 3, m_list = list(A = A, B = B))
block(n = 5, m_list = list(A = A, B = B))

Resources