Cluster groups of 1s in a binary matrix - r

I'm looking to create clusters around all 1s and 0s. Similar to Mindsweeper, I want to basically "draw a circle" around all 1s, and create a border where 0s exist.
I have tried using hclust() and creating a distance matrix, but the actual table I am working with is very large, and I have run into problems with run time.
test_matrix <- matrix(c( 1,1,0,0,0,0,1,
1,1,1,0,0,1,0,
0,1,0,0,0,1,0,
0,0,0,1,1,1,0,
0,0,0,1,1,1,1),nrow=5)
Result looks like this:
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 0 0 1 0 1 0
[2,] 1 1 0 0 0 1 1
[3,] 0 1 1 0 0 0 1
[4,] 0 1 0 0 0 0 1
[5,] 0 1 0 1 1 0 1
My rules are as follows: If any 1 is connected to any 1 via UP, DOWN, LEFT, RIGHT, DIAGONAL(any direction), continue growing the "cluster". Based on these rules (8 points of connection for every point), I can spot four unique clusters with isolated 1s.
How would you code to find these groups?

I think clustering is the right approach here, but you choose a poor ( computationally expensive) method for the task. I would go for DBSCAN like this:
library(dbscan)
## slightly altered test matrix to include a "cluster" with a single 1
test_matrix <- matrix(c( 1,1,0,0,0,0,1,
1,1,1,0,0,1,0,
0,1,0,0,0,1,0,
0,0,0,1,1,1,0,
1,0,0,1,1,1,1),
nrow=5, byrow = TRUE)
## find rows and columns of 1s
ones_pos <- which(test_matrix > 0,arr.ind=TRUE)
## perform DBSCAN clustering
## setting eps = sqrt(2) + .1 corresponds to your neighbourhood definition
## setting minPts = 2 will mark clusters of one point as noise
clust <- dbscan(ones_pos, eps = sqrt(2), minPts = 2)
## find the indices of noise elements
singular_ones <- ones_pos[clust$cluster == 0, ]
singular_ones
#> row col
#> 5 1
To find all clusters (including those that just consist of one 1) just set minPts to 1. In this case there can be no noise. The cluster membership is stored in clust$cluster.
I am quite certain this approach will also be quite fast with large matrices.

Related

Create a matrix of 0s and 1s, such that each row has only one 1 and each column has at least two 1s

I want to create a 100*4 matrix of 0s and 1s, such that each row has only one 1 and each column has at least two 1s, in R.
MyMat <- as.matrix(rsparsematrix(nrow=100, ncol=4, nnz = 100))
I am thinking of rsparsematrix but yet I am not sure how to apply my required conditions.
edit. My other try would be dummy_cols, but then no matter what. I am stuck with applying the two conditions yet. I guess there must be a more straightforward way of creating such a matrix.
1) A matrix consisting of 25 4x4 identity matrices stacked one on top of each other satisfies these requirements
m <- matrix(1, 25) %x% diag(4)
2) Exchanging the two arguments of %x% would also work and gives a different matrix which also satisfies this.
3) Any permutation of the rows and the columns of the two solution matrices in (1) and (2) would also satisfy the conditions.
m[sample(100), sample(4)]
4) If the objective is to generate a random table containing 0/1 values whose row sums are each 1 and whose column sums are each 25 then use r2dtable:
r <- r2dtable(1, rep(1, 100), rep(25, 4))[[1]]
5) or if it is desired to allow any column sums of at least 2 then:
rsums <- rep(1, 100)
csums <- rmultinom(1, 92, rep(0.25, 4)) + 2
r <- r2dtable(1, rsums, csums)[[1]]
Stochastically, with two rules:
All rows must have exactly one 1; and
All columns must have at least two 1s.
I control the first implicitly by construction; I test against the second.
nr <- 100 ; nc <- 4
set.seed(42)
lim <- 10000
while (lim > 0) {
lim <- lim - 1
M <- t(replicate(nr, sample(c(1, rep(0, nc-1)))))
if (all(colSums(M > 0) >= 2)) break
}
head(M)
# [,1] [,2] [,3] [,4]
# [1,] 1 0 0 0
# [2,] 0 0 0 1
# [3,] 0 0 0 1
# [4,] 0 1 0 0
# [5,] 0 0 0 1
# [6,] 0 1 0 0
colSums(M)
# [1] 25 30 21 24
lim
# [1] 9999
My use of lim is hardly needed in this example, but is there as a mechanism to stop this from running infinitely: if you change the dimensions and/or the rules, it might become highly unlikely or infeasible to meet all rules, so this keeps the execution time limited. (10000 is completely arbitrary.)
My point in the comment is that it would be rather difficult to find a 100x4 matrix that matches rule 1 that does not match rule 2. In fact, since the odds of a 0 or a 1 in any one cell is 0.75 and 0.25, respectively, to find a column (among 100 rows) that contains fewer than two 1s would be around 1.1e-11.
Here is a simple way to generate the 100 rows with the 1's randomly positioned and then create the matrix by transposing the rows object. The matrix generation is wrapped by a while loop (THX r2evans) to ensure each column contains at least two 1's.
minval <- 0
while(minval < 2) {
rows <- replicate(100, sample(c(0,0,0,1), 4))
m <- t(rows)
minval <- min(colSums(m))
}
m
[,1] [,2] [,3] [,4]
[1,] 0 0 0 1
[2,] 1 0 0 0
[3,] 0 0 0 1
[4,] 0 0 1 0
[5,] 1 0 0 0
[6,] 0 0 0 1
[7,] 1 0 0 0
[8,] 0 0 1 0
[9,] 0 1 0 0
[10,] 1 0 0 0
Code:
v <- tabulate(sample(1:4, 100-2*4, replace=TRUE), nbins=4) + 2
m <- diag(length(v))[sample(rep(seq_along(v), v)),]
Result check:
> dim(m)
[1] 100 4
> range(rowSums(m))
[1] 1 1
> range(colSums(m))
[1] 20 30
This works with any matrix size - just adjust the numbers 4 and 100. The first one controls the number of columns and the second one - the number of rows:
v <- tabulate(sample(1:10, 200-2*10, replace=TRUE), nbins=10) + 2
m <- diag(length(v))[sample(rep(seq_along(v), v)),]
> dim(m)
[1] 200 10
> range(rowSums(m))
[1] 1 1
> range(colSums(m))
[1] 15 31
Explanation: this works backwards from the properties of the resulting matrix. If you have 100 rows and 4 columns, with each row having only one 1 then the matrix will have 100 1s in total. Which means that the sum of all column-sums should also be 100. So we start with a vector of numbers (summing up to 100) which represents how many 1s each column will have. Say this vector is c(50,25,20,5). This tells us that there will be 50 rows of the form (1,0,0,0), 25 rows with the form (0,1,0,0), and so on. The final step is to generate all these rows and shuffle them.
The trick here:
v <- tabulate(sample(1:4, 100-2*4, replace=TRUE), nbins=4) + 2
Is to generate random column-sums while making sure the minimum is at least 2. We do this by generating values summing up to 92 and then adding 2 to each value (which, with 4 columns, ends up as additional 8).

Mapping edgelists to an adjacency matrix (and sum them together)

I want to map a number of (undirected) friendship networks (in edgelist format) to an adjacency matrix consisting of all possible nodes (i.e., persons) using R. To begin with, I construct a smaller 4-person circle x <- c(1, 2, 3, 4) which consists of 6 unique edges (1-2, 1-3, 1-4, 2-3, 2-4, 3-4). I then collapsed this set of 6 unique edges into a single list, such that it can be converted into a symmetric matrix using igraph applications (see below).
x = c(1,2,3,4)
x_pairs = combn(x, 2)
List <- split(x_pairs, rep(1:ncol(x_pairs), each = nrow(x_pairs)))
library(purrr)
new_list <- purrr::flatten(List)
g <- make_graph(unlist(new_list), directed = F)
m <- as_adjacency_matrix(g, sparse = F)
m
[,1] [,2] [,3] [,4]
[1,] 0 1 1 1
[2,] 1 0 1 1
[3,] 1 1 0 1
[4,] 1 1 1 0
My dataset has more than one of such smaller friendship circles consist of members out of a total of 50 persons and the memberships of these circles may or may not overlap. So my question is how do I map a series of smaller matrix values like the m above to a 50 by 50 adjacency matrix in two different ways:
(1) without repeating: say, if 3 and 4 are friends in one circle but they are also linked in another circle, the edge between 3 and 4 should remain 1 (but not add up to 2)
(2) cumulatively: if relationship in multiple circles indicates stronger friendship, then it might be more informative to map these circles into a weighted adjacency matrix where each cell in the matrix represents the cumulative counts of row and column id's friendship in different circles. In 3 and 4's situation, their edge value should be 1 + 1 = 2.
I've checked out this and other previous threads but can't seem to figure out how to do this, it will be really appreciated if someone could enlighten me on this.
There are various ways to achieve it. It looks like doing it in graph theoretical terms in igraph is a little more tedious than dealing directly with adjacency matrices. Let
circles <- list(1:3, 2:4) # Friendship circles with identities 1, ..., n
n <- max(unlist(circles)) # Total number of people
nM <- matrix(0, n, n) # n x n matrix of zeroes
Then
adjs <- lapply(circles, function(cr) {nM[cr, cr] <- 1; nM[cbind(cr, cr)] <- 0; nM})
is a list of n x n adjacency matrices for each friendship circle (mostly zeroes in each case).
Then the two types of aggregate matrices can be obtained by
(adj1 <- Reduce(`+`, adjs))
# [,1] [,2] [,3] [,4]
# [1,] 0 1 1 0
# [2,] 1 0 2 1
# [3,] 1 2 0 1
# [4,] 0 1 1 0
(adj2 <- 1 * (adj1 > 0))
# [,1] [,2] [,3] [,4]
# [1,] 0 1 1 0
# [2,] 1 0 1 1
# [3,] 1 1 0 1
# [4,] 0 1 1 0

back and forth to dummy variables in R

So, I've been using R on and off for two years now and been trying to get this whole idea of vectorization. Since I deal a lot with dummy variables from multiple response sets from surveys I thought it would be interesting to learn with this case.
The idea is to go from multiple responses to dummy variables (and back), for example: "Of these 8 different chocolates, which are your favorite ones (choose up to 3) ?"
Sometimes we code this as dummy variables (1 for person likes "Cote d'Or", 0 for person doesn't like it), with 1 variable per option, and some times as categorical (1 for person likes "Cote d'Or", 2 for person likes "Lindt", and so on), with 3 variables for the 3 choices.
So, basically I can end up with one a matrix which lines are like
1,0,0,1,0,0,1,0
Or a matrix with lines like
1,4,7
And the idea, as mentioned, is to go from one to the other. So far I got a loop solution for each case and a vectorized solution for going from dummy to categorical. I would appreciate any further insigh into this matter and a vectorized solution for the categorical to dummy step.
DUMMY TO NOT DUMMY
vecOrig<-matrix(0,nrow=18,ncol=8) # From this one
vecDest<-matrix(0,nrow=18,ncol=3) # To this one
# Populating the original matrix.
# I'm pretty sure this could have been added to the definition of the matrix,
# but I kept getting repeated numbers.
# How would you vectorize this?
for (i in 1:length(vecOrig[,1])){
vecOrig[i,]<-sample(vec)
}
# Now, how would you vectorize this following step...
for(i in 1:length(vecOrig[,1])){
vecDest[i,]<-grep(1,vecOrig[i,])
}
# Vectorized solution, I had to transpose it for some reason.
vecDest2<-t(apply(vecOrig,1,function(x) grep(1,x)))
NOT DUMMY TO DUMMY
matOrig<-matrix(0,nrow=18,ncol=3) # From this one
matDest<-matrix(0,nrow=18,ncol=8) # To this one.
# We populate the origin matrix. Same thing as the other case.
for (i in 1:length(matOrig[,1])){
matOrig[i,]<-sample(1:8,3,FALSE)
}
# this works, but how to make it vectorized?
for(i in 1:length(matOrig[,1])){
for(j in matOrig[i,]){
matDest[i,j]<-1
}
}
# Not a clue of how to vectorize this one.
# The 'model.matrix' solution doesn't look neat.
Vectorized solutions:
Dummy to not dummy
vecDest <- t(apply(vecOrig == 1, 1, which))
Not dummy to dummy (back to the original)
nCol <- 8
vecOrig <- t(apply(vecDest, 1, replace, x = rep(0, nCol), values = 1))
This might provide some inside for the first part:
#Create example data
set.seed(42)
vecOrig<-matrix(rbinom(20,1,0.2),nrow=5,ncol=4)
[,1] [,2] [,3] [,4]
[1,] 1 0 0 1
[2,] 1 0 0 1
[3,] 0 0 1 0
[4,] 1 0 0 0
[5,] 0 0 0 0
Note that this does not assume, that the number of ones is equal in each line (e.g., you wrote "choose up to 3").
#use algebra to create position numbers
vecDest <- t(t(vecOrig)*1:ncol(vecOrig))
[,1] [,2] [,3] [,4]
[1,] 1 0 0 4
[2,] 1 0 0 4
[3,] 0 0 3 0
[4,] 1 0 0 0
[5,] 0 0 0 0
Now, we remove the zeros. Thus, we have to turn the object into a list.
vecDest <- split(t(vecDest), rep(1:nrow(vecDest), each = ncol(vecDest)))
lapply(vecDest,function(x) x[x>0])
$`1`
[1] 1 4
$`2`
[1] 1 4
$`3`
[1] 3
$`4`
[1] 1
$`5`
numeric(0)

How to easily create dissimilarity matrix from vector of differences?

In my research each subject was given n*(n-1)/2 questions about his subjective opinion about dissimilarity between n=5 objects (for later use with 3-way multidimensional scaling).
I want to create a dissimilarity matrix from the 10-item vector v, arranged e.g. in the following fashion (for n=5):
1
2 5
3 6 8
4 7 9 10
This is a code sample code for achieving it for this particular n:
dissim<-rep(0,n*n)
dim(dissim)<-c(5,5)
dissim[2,1]<-v[1]
dissim[3,1]<-v[2]
dissim[4,1]<-v[3]
dissim[5,1]<-v[4]
dissim[3,2]<-v[5]
dissim[4,2]<-v[6]
dissim[5,2]<-v[7]
dissim[4,3]<-v[8]
dissim[5,3]<-v[9]
dissim[5,4]<-v[10]
Is there any utility function which helps doing it for any n? I know I can use two nested loops to do it, but the code would be more clear if I used a dedicated function.
And maybe I would learn about the existence of another useful library in the process?
n <- 5
mat <- matrix(0, ncol = n, nrow = n)
mat[lower.tri(mat)] <- 1:10
mat
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 1 0 0 0 0
[3,] 2 5 0 0 0
[4,] 3 6 8 0 0
[5,] 4 7 9 10 0
Er... By chance I found the solution myself. It so happens, that the internal structure of the dist object is just the vector v. So what works is this:
dissim<-v
class(dissim)='dist'
attr(dissim,"Size")<-5
dissim<-as.dist(dissim)
It works now, but I am not sure if this is a documented way and will always be valid.

random sampling - matrix

How can I take a sample of n random points from a matrix populated with 1's and 0's ?
a=rep(0:1,5)
b=rep(0,10)
c=rep(1,10)
dataset=matrix(cbind(a,b,c),nrow=10,ncol=3)
dataset
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 1 0 1
[3,] 0 0 1
[4,] 1 0 1
[5,] 0 0 1
[6,] 1 0 1
[7,] 0 0 1
[8,] 1 0 1
[9,] 0 0 1
[10,] 1 0 1
I want to be sure that the positions(row,col) from were I take the N samples are random.
I know sample {base} but it doesn't seem to allow me to do that, other methods I know are spatial methods that will force me to add x,y and change it to a spatial object and again back to a normal matrix.
More information
By random I mean also spread inside the "matrix space", e.g. if I make a sampling of 4 points I don't want to have as a result 4 neighboring points, I want them spread in the "matrix space".
Knowing the position(row,col) in the matrix where I took out the random points would also be important.
There is a very easy way to sample a matrix that works if you understand that R represents a matrix internally as a vector.
This means you can use sample directly on your matrix. For example, let's assume you want to sample 10 points with replacement:
n <- 10
replace=TRUE
Now just use sample on your matrix:
set.seed(1)
sample(dataset, n, replace=replace)
[1] 1 0 0 1 0 1 1 0 0 1
To demonstrate how this works, let's decompose it into two steps. Step 1 is to generate an index of sampling positions, and step 2 is to find those positions in your matrix:
set.seed(1)
mysample <- sample(length(dataset), n, replace=replace)
mysample
[1] 8 12 18 28 7 27 29 20 19 2
dataset[mysample]
[1] 1 0 0 1 0 1 1 0 0 1
And, hey presto, the results of the two methods are identical.
Sample seems the best bet for you. To get 1000 random positions you can do something like:
rows = sample(1:nrow(dataset), 1000, replace = TRUE)
columns = sample(1:ncol(dataset), 1000, replace = TRUE)
I think this gives what you want, but ofcourse I could be mistaken.
Extracting the items from the matrix can be done like:
random_sample = mapply(function(row, col)
return(dataset[row,col]),
row = rows, col = columns)
Sampling strategies
In the comments you speak that your sample needs to have spread. A random sample has no garantuees that there will be no clusters, because of its random nature. There are several more sampling schemes that might be interesting to explore:
Regular sampling, skip the randomness and just sample regularly. Samples the entire matrix space evenly, but there is no randomness.
Stratified random sampling, you divide your matrix space into regular subset, and then sample randomly in those subsets. Presents a mix between random and regular.
To check if your random sampling produces good results, I'd repeat the random sampling a few times and compare the results (as I assume that the sampling will be input for another analysis?).

Resources