generate (overlapping) sets of mutually similar elements from binary similarity matrix [duplicate] - r

This question already has an answer here:
Generating distinct groups of nodes in a network
(1 answer)
Closed 3 years ago.
Given a symmetric binary similarity matrix M (1 = similarity), I want to extract all (potentially overlapping) subsets, where all elements within a set are mutually similar.
A B C D E
A 1 1 0 0 0
B 1 1 1 1 0
C 0 1 1 1 1
D 0 1 1 1 1
E 0 0 1 1 1
Also, sets contained within other sets should be discarded (e.g. {D,E} is contained in {C,D,E}). For the matrix the result would be: {A,B}, {B,C,D}, {C,D,E}
How can I easily achieve this?
I suspect that there is some clustering algorithm for this, but I am unaware of the name for these types of problems. To which (mathematical) class of problems does this task belong to?
Code
M <- matrix(c(1,1,0,0,0,
1,1,1,1,0,
0,1,1,1,1,
0,1,1,1,1,
0,0,1,1,1), ncol = 5, byrow = TRUE)
colnames(M) <- rownames(M) <- LETTERS[1:5]
PS. While this may smell like some homework assignment, but its actually a problem I encountered in my job :)

A clique is a subgraph that is completely connected.
What you are looking for is hence (maximal) clique detection.
https://en.wikipedia.org/wiki/Clique_problem
Beware that the results can be much larger than you anticipate. Consider a graph where each edge is 1 with a probability of p. For p close to 1, almost any subset is a clique. Finding maximum cliques then becomes expensive. P can also be chosen to maximize the number of maximal cliques...

Related

How to create all non-isomorphic trees with n=6 nodes?

I need to create all non-isomorphic trees with n=6 nodes. I have found the degree sequence and try to generate trees this degree.sequence.game() function:
library(igraph)
set.seed(46)
par(mfrow=c(2, 3))
degs <- matrix(c(1,1,1,2,2,3,
1,1,1,3,2,2,
1,1,2,2,2,2,
1,1,1,1,2,4,
1,1,1,1,1,5,
1,1,1,1,3,3), nrow=6, byrow=T)
for(i in 1:6){
g6 <- degree.sequence.game(degs[i,], method="vl")
plot(g6, vertex.label=NA)
}
The output is:
One can see graphs A and B in left figure are isomorphic.
Expected result in right figure.
Question. What is an alternative method to create non-isomorphic trees?
Update
It seems I misunderstood your objective. Below might be one solution if you try simple.no.multiple.uniform option with in degree.sequence.game, i.e.,
g6 <- degree.sequence.game(degs[i, ], method = "simple.no.multiple.uniform")
and we can obtain
BTW, the version of igraph I am using is igraph_1.3.5 (you can see it when typing sessionInfo() in the console) and you can try with this version, which hopefully helps to address your problem as well.
Previous Answer
I think the pain point in your problem is "How to find all distinct degree sequences with given number of vertices in a tree graph?".
We can break this primary problem into two sub-problems:
What is the sum of degrees given n vertices (if we want generate a tree)? The answer is: 2*(n-1)
How to partition the 2*(n-1) into n non-isomorphic groups that consist of positive integers? the answer is: Using partitions::restrictedparts
library(partitions)
n <- 6
degs <- t(restrictedparts(2*(n-1), n, include.zero = FALSE)
and you will see
> degs
[1,] 1 1 1 1 1 5
[2,] 1 1 1 1 2 4
[3,] 1 1 1 1 3 3
[4,] 1 1 1 2 2 3
[5,] 1 1 2 2 2 2
then you can use degree.sequence.game(degs[i,], method="vl") by iterating i through 1 to nrow(degs).

What's the best way to find all 'cycles/cliques' in a matrix in R?

Suppose I have a square matrix with a bunch of 0's and 1's, like this example:
1 2 3 4 5 6
1 0 1 0 1 0 0
2 1 0 0 1 0 0
3 0 0 0 0 0 0
4 1 1 0 0 0 0
5 0 0 0 0 0 1
6 0 0 0 0 1 0
We can consider this an adjacency matrix of sorts, with 6 vertices, and all diagonal elements are necessarily 0. And of course, the matrix must be symmetric about the diagonal.
What is the most efficient way, in R, given a bunch of matrices like this, of varying sizes, to find all 'cliques' in each matrix and give the members of each clique? (By 'clique' I mean a set of vertices each of which has an edge going to each other vertex in the set) For example, in the above matrix, there are two cliques; denoting the vertices by the row/column numbers, the cliques are (1, 2, 4) and (5, 6). So, for output, I want a list of all vertices (row or column names) in each clique, for all cliques in that matrix. And I want to do this for a large number of matrices of varying sizes (though all would be square matrices). Any ideas on what the best way to do this is?
I had been thinking of using a for loop within a for loop, but for the fact that the size of possible cliques is indeterminate (but at most can equal the number of rows/columns in the matrix) makes me think I may need to use a while loop, though I'm not sure how precisely.
I will paste some code I just wrote that was supposed to do what I've described, with the list 'cliquelist' containing all cliques in the matrix by the end of the loop. The adjacency matrix is called 'mat.'
cliquelist <- NULL
for(i in 1:nrow(mat)){
for(j in 1:ncol(mat)){
if(mat[i,j]==1){
clique <- c(i, j)
pool <- j:ncol(mat)
while(length(pool)>0) {
add <- which(mat[,pool[1]]==1)
if(length(add)==0){
pool <- NULL
}else{
pool <- pool[which(!pool %in% add)]
clique <- c(clique, add)
}
}
cliquelist[length(cliquelist)+1] <- clique
}
}
}
I'm sure this code has some bugs in it; I have not figured out what they are since it is still running (after many minutes) on the example matrix I showed above, so I suspect there's an infinite loop in their somewhere.
But there are several problems with it that even make me disinclined to go back and debug it. 1) Most obviously, it will add each clique to the list as many times as there are elements in the list, so there's a lot of redundancy. 2) It doesn't filter out smaller cliques that are subsumed within bigger cliques. So, if a clique consists of vertices 1, 3, and 5, then something like the code above will return (1, 3), (1, 5), (3, 5), (1, 3, 5), etc. But I only want unique cliques not subsumed in larger cliques. I am wondering if there is a different, more efficient way to do this in R.
Any suggestions are much appreciated.
the igraph library is custom made for graph analysis. With dat as the name of your matrix:
library(igraph)
z <- graph.adjacency(dat)
cluster_walktrap(z)
IGRAPH clustering walktrap, groups: 3, mod: 0.38
+ groups:
$`1`
[1] "X5" "X6"
$`2`
[1] "X1" "X2" "X4"
$`3`
[1] "X3"
Here's a plot:
plot(z)

how to write a loop of the number of for loops in R?

this is probably a simple one, but I somehow got stuck...
I need to many loops to get the result of every sample in my support like the usual stacked loops:
for (a in 1:N1){
for (b in 1:N2){
for (c in 1:N3){
...
}
}
}
but the number of the for loops needed in this messy system depends on another random variable, let's say,
for(f in 1:N.for)
so how can I write a for loop to do deal with this? Or are there more elegant ways to do this?
note that the difference is that the nested for loops above (the variables a,b,c,...) do matter in my calculations, but the variable f of the for loop that controls for the number of for loops needed does not go into any of my calculations for my real purpose - all it does is count/ensure the number of for loops needed is correct.
Did I make it clear?
So what I am actually trying to do is generate all the possible combinations of a number of peoples preferences towards others.
Let's say I have 6 people (the simplest case for my purpose): Abi, Bob, Cath, Dan, Eva, Fay.
Abi and Bob have preference lists of C D E F ( 4!=24 possible permutations for each of them);
Cath and Dan have preference lists of A B and E F, respectively (2! * 2! = 4 possible permutations for each of them);
Eva and Fay have preference lists of A B C D (4!=24 possible permutations for each of them);
So all together there should be 24*24*4*4*24*24 possible permutations of preferences when taking all six them together.
I am just wondering what is a clear, easy and systematic way to generate them all at once?
I'd want them in the format such as
c.prefs <- as.matrix(data.frame(Abi = c("Eva", "Fay", "Dan", "Cath"),Bob = c("Dan", "Eva", "Fay", "Cath"))
but any clear format is fine...
Thank you so much!!
I'll assume you have a list of each loop variable and its maximum value, ordered from the outermost to innermost variable.
loops <- list(a=2, b=3, c=2)
You could create a data frame with all the loop variable values in the correct order with:
(indices <- rev(do.call(expand.grid, lapply(rev(loops), seq_len))))
# a b c
# 1 1 1 1
# 2 1 1 2
# 3 1 2 1
# 4 1 2 2
# 5 1 3 1
# 6 1 3 2
# 7 2 1 1
# 8 2 1 2
# 9 2 2 1
# 10 2 2 2
# 11 2 3 1
# 12 2 3 2
If the code run at the innermost point of the nested loop doesn't depend on the previous iterations, you could use something like apply to process each iteration independently. Otherwise you could loop through the rows of the data frame with a single loop:
for (i in seq_len(nrow(indices))) {
# You can get "a" with indices$a[i], "b" with indices$b[i], etc.
}
For the way of doing the calculation, an option is to use the Reduce function or some other higher-order function.
Since your data is not inherently ordered (an individual is part of a set, its preferences are part of the set) I would keep indivudals in a factor and have eg preferences in lists named with the individuals. If you have large data you can store it in an environment.
The first code is just how to make it reproducible. the problem domain was akin for graph oriented naming. You just need to change in the first line and in runif to change the behavior.
#people
verts <- factor(c(LETTERS[1:10]))
#relations, disallow preferring yourself
edges<-lapply(seq_along(verts), function(ind) {
levels(verts)[-ind]
})
names(edges) <- levels(verts)
#directions
#say you have these stored in a list or something
pool <- levels(verts)
directions<-lapply(pool, function(vert) {
relations <- pool[unique(round(runif(5, 1, 10)))]
relations[!(vert %in% relations)]
})
names(directions) = pool
num_prefs <- (lapply(directions, length))
names(num_prefs) <- names(directions)
#First take factorial of each persons preferences,
#then reduce that with multiplication
combinations <-
Reduce(`*`,
sapply(num_prefs, factorial)
)
I hope this answers your question!

R - Get a matrix with the reduced number of features with SVD

I'm using the SVD package with R and I'm able to reduce the dimensionality of my matrix by replacing the lowest singular values by 0. But when I recompose my matrix I still have the same number of features, I could not find how to effectively delete the most useless features of the source matrix in order to reduce it's number of columns.
For example what I'm doing for the moment:
This is my source matrix A:
A B C D
1 7 6 1 6
2 4 8 2 4
3 2 3 2 3
4 2 3 1 3
If I do:
s = svd(A)
s$d[3:4] = 0 # Replacement of the 2 smallest singular values by 0
A' = s$u %*% diag(s$d) %*% t(s$v)
I get A' which has the same dimensions (4x4), was reconstruct with only 2 "components" and is an approximation of A (containing a little bit less information, maybe less noise, etc.):
[,1] [,2] [,3] [,4]
1 6.871009 5.887558 1.1791440 6.215131
2 3.799792 7.779251 2.3862880 4.357163
3 2.289294 3.512959 0.9876354 2.386322
4 2.408818 3.181448 0.8417837 2.406172
What I want is a sub matrix with less columns but reproducing the distances between the different rows, something like this (obtained using PCA, let's call it A''):
PC1 PC2
1 -3.588727 1.7125360
2 -2.065012 -2.2465708
3 2.838545 0.1377343 # The similarity between rows 3
4 2.815194 0.3963005 # and 4 in A is conserved in A''
Here is the code to get A'' with PCA:
p = prcomp(A)
A'' = p$x[,1:2]
The final goal is to reduce the number of columns in order to speed up clustering algorithms on huge datasets.
Thank you in advance if someone can guide me :)
I would check out this chapter on dimensionality reduction or this cross-validated question. The idea is that the entire data set can be reconstructed using less information. It's not like PCA in the sense that you might only choose to keep 2 out of 10 principal components.
When you do the kind of trimming you did above, you're really just taking out some of the "noise" of your data. The data still as the same dimension.

Dynamic Network In R

I am currently working on dynamic temporal network.
Header: Time Sender Receiver
1 1 2
1 1 3
2 2 1
2 2 1
3 1 2
3 1 2
The above is a sample of my dataset.
There are 3 time periods (sessions) and the edgelists between nodes.
I want to compute centrality measures by each time period.
I am thinking about writing a script that compute centrality measures within the same period of the time.
However I am just wondering whether there might be R libraries that can handle this problem.
Is there anyone who knows about?
Jinie
I tried to write the code for subsetting data based on Time as follows:
uniq <-unique(unlist(df$Time))
uniq
[1] 1 2 3
for (i in 1:length(uniq)){
t[i]<-subset(df, Time==uniq[i])
net[i] <-as.matrix(t[i])
netT[i]<-net[i][,-3] #removing time column
#### getting edgelist
netT[i][,1]=as.character(net[i][,1])
netT[i][,2]=as.character(net[i][,2])
g [i]=graph.edgelist(netT [i], directed=T)
g[i]
}
however, I've got a error message ( Error in t[i] <- subset(df, Time == uniq[i]) :
object of type 'closure' is not subsettable)
Do you know why? I am kind of new to R so it is hard to figure it out.
I guess t[i] is the problem. I don't know how to assign t[i] as a data frame.
The networkDynamic R library is helpful for this sort of thing (disclaimer: I'm a package maintainer)
library(networkDynamic)
# a data frame with your input data
raw<-data.frame(time=c(1,1,2,2,3,3),
sender=c(1,1,2,2,1,1),
receiver=c(2,3,1,1,2,2))
# add another time column to define a start and end time for each edge spell
raw2<-cbind(raw$time,raw$time+1,raw$sender,raw$receiver)
# create a networkDynamic object using this edge timing info
nd<-networkDynamic(edge.spells=raw2)
# load the sna library with static network measures
library(sna)
# apply degree measure to static networks extracted at default time points
lapply(get.networks(nd),degree)
[[1]]
[1] 2 1 1
[[2]]
[1] 1 1 0
[[3]]
[1] 1 1 0
You could try the igraph library. I'm not familiar with it, but i find this question interesting enough to code up an answer, so here we go:
Because you've got a directed network (senders and receivers) you're going to need to two measures of centrality: indegree and outdegree.
Calculating this is fairly simple, the complication is splitting by time points. So for each time point we need to do the following:
Create an adjacency matrix indicating for each row (sender) the number of connections to each column (receiver).
From that we can simply add up the connections in the rows to get the outdegree, and the connections in the columns for the indegree.
Assuming your data is stored in a data.frame named df we can use split to split your data.frame by time point:
nodes <- unique(c(unique(df$Sender), unique(df$Receiver)))
centrality <- lapply(split(df, df$Time), function(time.df) {
adj <- matrix(0, length(nodes), length(nodes), dimnames=list(nodes, nodes))
for (i in 1:nrow(time.df)) {
sender <- time.df[i, "Sender"]
receiver <- time.df[i, "Receiver"]
adj[sender, receiver] <- adj[sender, receiver] + 1
}
list(indegree=colSums(adj), outdegree=rowSums(adj))
})
names(centrality) <- paste0("Time.Point.", 1:length(centrality))
If we run the code on your data (I've replaced the Senders and Receivers with letters for clarity):
> centrality
$Time.Point.1
$Time.Point.1$indegree
a b c
0 1 1
$Time.Point.1$outdegree
a b c
2 0 0
$Time.Point.2
$Time.Point.2$indegree
a b c
2 0 0
$Time.Point.2$outdegree
a b c
0 2 0
$Time.Point.3
$Time.Point.3$indegree
a b c
0 2 0
$Time.Point.3$outdegree
a b c
2 0 0

Resources