Algorithm for automating pairwise significance grouping labels in R - r

After struggling with this problem for a while, I am hoping to get some advice here. I am wondering if anyone is aware of an automated method for determining pairwise grouping labels based on significance. The question is independent of the significance test (e.g. Tukey for parametric or Mann-Whitney for non-parametric) - given these pairwise comparisons, some boxplot-type figures often represent these groupings with a sub-script:
I have done this example by hand, which can be quite tedious. I think that the sequence of labeling in the algorithm should be based on the number of levels in each group - e.g. those groups containing single levels that are significantly different from all other levels should be named first, then groups containing 2 levels, then 3, etc., all the while checking that new groupings add a new needed grouping and do not violate and differences.
In the example below, the tricky part is getting the algorithm to recognize that level 1 should be grouped with 3 and 5, but 3 and 5 should not be grouped (i.e. share a label).
Example code:
set.seed(1)
n <- 7
n2 <- 100
mu <- cumsum(runif(n, min=-3, max=3))
sigma <- runif(n, min=1, max=3)
dat <- vector(mode="list", n)
for(i in seq(dat)){
dat[[i]] <- rnorm(n2, mean=mu[i], sd=sigma[i])
}
df <- data.frame(group=as.factor(rep(seq(n), each=n2)), y=unlist(dat))
bp <- boxplot(y ~ group, df, notch=TRUE)
kr <- kruskal.test(y ~ group, df)
kr
mw <- pairwise.wilcox.test(df$y, df$g)
mw
mw$p.value > 0.05 # TRUE means that the levels are not significantly different at the p=0.05 level
# 1 2 3 4 5 6
#2 FALSE NA NA NA NA NA
#3 TRUE FALSE NA NA NA NA
#4 FALSE FALSE FALSE NA NA NA
#5 TRUE FALSE FALSE FALSE NA NA
#6 FALSE FALSE FALSE TRUE FALSE NA
#7 FALSE FALSE FALSE FALSE FALSE FALSE
text(x=1:n, y=bp$stats[4,], labels=c("AB", "C", "A", "D", "B", "D", "E"), col=1, cex=1.5, pos=3, font=2)

First let me restate the problem in the language of graph theory. Define a graph as follows. Each sample gives rise to a vertex that represents it. Between two vertices, there is an edge if and only if some test indicates that the samples represented by those vertices could not be distinguished statistically. In graph theory, a clique is a set of vertices such that, between every two vertices in the set, there is an edge. We're looking for a collection of cliques such that every edge in the graph belongs to (at least? exactly?) one of the cliques. We'd like to use as few cliques as possible. (This problem is called clique edge cover, not clique cover.) We then assign each clique its own letter and label its members with that letter. Each sample distinguishable from all others gets its own letter as well.
For example, the graph corresponding to your sample input could be drawn like this.
3---1---5 4--6
My proposed algorithm is the following. Construct the graph and use the Bron--Kerbosch algorithm to find all maximal cliques. For the graph above, these are {1, 3}, {1, 5}, and {4, 6}. The set {1}, for example, is a clique, but it is not maximal because it is a subset of the clique {1, 3}. The set {1, 3, 5} is not a clique because there is no edge between 3 and 5. In the graph
1
/ \
3---5 4--6,
the maximal cliques would be {1, 3, 5} and {4, 6}.
Now search recursively for a small clique edge cover. The input to our recursive function is a set of edges remaining to be covered and the list of maximal cliques. Find the least edge in the remaining set, where, e.g., edge (1,2) < (1,5) < (2,3) < (2,5) < (3,4). For each maximal clique that contains this edge, construct a candidate solution comprised of that clique and the output of a recursive call where the clique edges are removed from set of edges remaining. Output the best candidate.
Unless there are very few edges, this may be too slow. The first performance improvement is memoize: maintain a map from inputs to outputs of the recursive function so that we can avoid doing the work twice. If that doesn't work, then R should have an interface to an integer program solver, and we can use integer programming to determine the best collection of cliques. (I'll explain this more if the other approach is insufficient.)

I thought I would post the solution that I was able to derive with additional help from the following question:
set.seed(1)
n <- 7
n2 <- 100
mu <- cumsum(runif(n, min=-3, max=3))
sigma <- runif(n, min=1, max=3)
dat <- vector(mode="list", n)
for(i in seq(dat)){
dat[[i]] <- rnorm(n2, mean=mu[i], sd=sigma[i])
}
df <- data.frame(group=as.factor(rep(seq(n), each=n2)), y=unlist(dat))
bp <- boxplot(y ~ group, df, notch=TRUE)
#significance test
kr <- kruskal.test(y ~ group, df)
mw <- pairwise.wilcox.test(df$y, df$g)
#matrix showing connections between levels
g <- as.matrix(mw$p.value > 0.05)
g <- cbind(rbind(NA, g), NA)
g <- replace(g, is.na(g), FALSE)
g <- g + t(g)
diag(g) <- 1
rownames(g) <- 1:n
colnames(g) <- 1:n
g
#install.packages("igraph")
library(igraph)
# Load data
same <- which(g==1)
topology <- data.frame(N1=((same-1) %% n) + 1, N2=((same-1) %/% n) + 1)
topology <- topology[order(topology[[1]]),] # Get rid of loops and ensure right naming of vertices
g3 <- simplify(graph.data.frame(topology,directed = FALSE))
get.data.frame(g3)
# Plot graph
plot(g3)
# Calcuate the maximal cliques
res <- maximal.cliques(g3)
# Reorder given the smallest level
res <- sapply(res, sort)
res <- res[order(sapply(res,function(x)paste0(sort(x),collapse=".")))]
ml<-max(sapply(res, length))
reord<-do.call(order, data.frame(
do.call(rbind,
lapply(res, function(x) c(sort(x), rep.int(0, ml-length(x))))
)
))
res <- res[reord]
lab.txt <- vector(mode="list", n)
lab <- letters[seq(res)]
for(i in seq(res)){
for(j in res[[i]]){
lab.txt[[j]] <- paste0(lab.txt[[j]], lab[i])
}
}
bp <- boxplot(y ~ group, df, notch=TRUE, outline=FALSE, ylim=range(df$y)+c(0,1))
text(x=1:n, y=bp$stats[5,], labels=lab.txt, col=1, cex=1, pos=3, font=2)

Cool code.
I think you need to quote the function order() when calling do.call:
reord<-do.call("order", data.frame(
do.call(rbind,
lapply(res, function(x) c(sort(x), rep.int(0, ml-length(x))))
)
))

Related

How to generate the N most dissimilar combinations in R

I have a set of 6 colour codes (x), a set of N individuals that each need to be labelled with a unique colour code, and four locations on each animal, each of which can carry a different colour. I have 6 different colors.
So, the codes for two individual might be;
1. red, blue, blue,white
2. white,yellow,pink,yellow
However, as the colour at each position can fall off, I would like to generate a redundant labelling scheme, that would allow still allow an individual to be distinguished from others, even after it loses the colour at one (or even two?) locations.
Even though 6 colours and 4 positions gives 1296 combinations, I am finding it difficult to select the N most dissimilar combinations:
Reproducible example:
library(gtools)
x <- c("white", "red", "green", "blue", "pink", "yellow")
Perms <- permutations(n=6,r=4,v=x,repeats.allowed=T)
print(nrow(Perms))
head(Perms)
Note that the first 6 combinations differ in the colour at only 1 position - loss of this code by >1 individual would mean that they can no longer be distinguished!
So, for values of N between 50-150, how to select the N most dissimilar combinations?
Thanks !
I cannot conclusively answer your question, but I have an idea that might help you.
Build string codes with the first letter of each color:
library(gtools)
x <- c("w", "r", "g", "b", "p", "y")
Perms <- permutations(n=6,r=4,v=x,repeats.allowed=T)
m <- apply(Perms, 1, paste, collapse = "")
> head(m)
[1] "bbbb" "bbbg" "bbbp" "bbbr" "bbbw" "bbby"
Sample n codes:
set.seed(1)
n <- 50
y <- sample(m, n)
Create a n*n matrix of Levenshtein distances:
library(vwr)
lvmat <- sapply(y, function(x) levenshtein.distance(x, y))
> lvmat[1:5, 1:5]
grrp pgpg rprr yprw gggp
grrp 0 4 3 3 2
pgpg 4 0 4 4 3
rprr 3 4 0 2 4
yprw 3 4 2 0 4
gggp 2 3 4 4 0
Now you could maximize sum(lvmat), maybe via bootstrapping or whatever floats your boat, to get the sample of most dissimilar combinations.
Reproducible example of LAPs suggestion above.
Note, due to the reliance upon random sampling, this still does not guarantee that there will be no code pairs that differ at only one position. Still, it's a good start -thanks LAP!
# install.packages("gtools")
library(gtools)
library(vwr)
## Available colours
x <- c("W", "R", "G", "B", "P", "Y")
## Generate all possible colour combinations, for 6 colours & 4 positions
body <- data.frame(permutations(n=6,r=4,v=x,repeats.allowed=T), stringsAsFactors = F) ; colnames(body) <- c("Head","Thorax","L_gaster","R_gaster")
## concatenate each colour-code to a sequence without spaces, etc
m <- paste( body$Head, body$Thorax, body$L_gaster, body$R_gaster, sep="")
##
set.seed(1)
COLONY_SIZE <- 50 ## How many adult workers in the colony excluding the queen
N_Attempts <- 1000 ## How many alternative solutions to generate - the more the better, but it takes longer
## prepare data-containers
Summary <- NULL
LvList <- list()
for (TRY in 1:N_Attempts)
{print(paste(TRY,"of",N_Attempts))
y <- sample(m, COLONY_SIZE) ## randomly sample COLONY_SIZE codes
## measure pairwise Levenshtein distances for all pair combinations
Matrix <- sapply(y, function(x) levenshtein.distance(x, y))
diag(Matrix) <- NA ## eliminate self-self measure (distance = 0)
Matrix[lower.tri(Matrix)] <- NA ## dist i-j = dist j-i
## store solution
LvList[[TRY]] <- Matrix
## summarize each solution using three metrics:
## (i) the average pair distance (higher is better)
## (ii) the number of 'close' code pairs (those with the minimum distance of 1 - lower is better)
## (iii) the maximum number of 'close' code *pairs across all codes (lower is better)
Summary <- rbind(Summary, data.frame(Mean_Distance = mean(Matrix, na.rm=T),
N_close_pairs = sum(Matrix[!is.na(Matrix)]==1),
N_close_pairs_per_ant = max(rowSums( Matrix==1, na.rm=T)) ))
}
## ***Find the solution with the fewest pairs wiRth the lowest distance***
Summary$Mean_Distance_Rank <- rank(Summary$Mean_Distance)
Summary$N_close_pairs_Rank <- rank(-Summary$N_close_pairs)
Summary$N_close_pairs_per_ant_Rank <- rank(-Summary$N_close_pairs_per_ant)
Summary$Rank_Total <- Summary$Mean_Distance_Rank + Summary$N_close_pairs_Rank + Summary$N_close_pairs_per_ant_Rank
solution <- rownames( LvList[[which.max(Summary$Rank_Total)]] )
## Highlight candidate solutions
Colour <- rep(rgb(0,0,0,0.1,1),nrow(Summary) )
Colour [which.max(Summary$Rank_Total) ] <- "red"
pairs(Summary[,c("Mean_Distance","N_close_pairs","N_close_pairs_per_ant")], col=Colour, bg=Colour, pch=21, cex=1.4)
## format into a table
SOLUTION <- data.frame(Code=1:COLONY_SIZE, t(as.data.frame(sapply(solution, strsplit, ""))))
colnames(SOLUTION)[2:5] <- c("Head","Thorax","L_gaster","R_gaster")
Here's a better approach that does not rely upon blind sampling, but instead represents the similarity between each code pair as an edge in a network, and then uses the igraph function largest_ivs to searche for the most dissimilar code pairs:
rm(list=ls())
library(gtools)
library(igraph)
##
outputfolder <- "XXXXXXXXXX"
dir.create(outputfolder,showWarnings = F)
setwd(outputfolder)
## Available colours
x <- c("W", "R", "G", "B", "P", "Y")
## Generate all possible colour combinations, for 6 colours & 4 positions
body <- data.frame(permutations(n=6,r=4,v=x,repeats.allowed=T), stringsAsFactors = F) ; colnames(body) <- c("Head","Thorax","L_gaster","R_gaster")
write.table(body,file="Paint_marks_full_list.txt",col.names=T,row.names=F,quote=F,append=F)
## Generate edge list
edge_list <- data.frame(comb_1=character(),comb_2=character(),similarity=character())
if (!file.exists("Edge_list.txt")){
write.table(edge_list,file="Edge_list.txt",col.names=T,row.names=F,quote=F,append=F)
}else{
edge_list <- read.table("Edge_list.txt",header=T,stringsAsFactors = F)
}
if (nrow(edge_list)>0){
last_i <- edge_list[nrow(edge_list),"comb_1"]
last_j <- edge_list[nrow(edge_list),"comb_2"]
}
if (!(last_i==(nrow(body)-1)&last_j==nrow(body))){
for (i in last_i:(nrow(body)-1)){
print(paste("Combination",i))
for (j in (i+1):nrow(body)){
if (i>last_i|j>last_j){
simil <- length(which(body[i,]==body[j,]))
if (simil>0){
write.table(data.frame(comb_1=i,comb_2=j,similarity=simil),file="Edge_list.txt",col.names=F,row.names=F,quote=F,append=T)
}
}
}
}
}
######let's make 3 graphs with edges representing overlap between combinations ###
##First graph, in which ANY overlap between two combinations is seen as an edge. Will be used to produce list of paint combination with no overlap
net1 <- graph.data.frame(edge_list[c("comb_1","comb_2")],directed=F)
##Second graph, in which only overlaps of 2 or more spots is seen as an edge. Will be used to produce list of paint combinations with no more than 1 spot in common
net2 <- graph.data.frame(edge_list[which(edge_list$similarity>=2),c("comb_1","comb_2")],directed=F)
##Third graph, in which only overlaps of 3 or more spots is seen as an edge. Will be used to produce list of paint combinations with no more than 2 spots in common
net3 <- graph.data.frame(edge_list[which(edge_list$similarity>=3),c("comb_1","comb_2")],directed=F)
#######Now let's use the ivs function to get independent vertex sets, i.e., set of vertices with no connections between any of them
no_overlap_list <- largest_ivs(net1)
max_one_spot_overlap_list <- largest_ivs(net2)
max_two_spots_overlap_list <- largest_ivs(net3)

Sampling from a subset of data

I have the following problem.
I have multiple subarrays (say 2) that I have populated with character labels (1, 2, 3, 4, 5). My algorithm selects labels at random based on occurrence probabilities.
How can I get R to instead select labels 1:3 for subarray 1 and 4:5 for subarray 2, say, without using subsetting (i.e., []). That is, I want a random subset of labels to be selected for each subarray, instead of all labels assigned to each subarray manually using [].
I know sample() should help.
Using subsetting (which I don't want) one would do
x <- 1:5
sample(x[1:3], size, prob = probs[1:3])
but this assigns labels 1:3 to ALL subarrays.
Would
sample(sample(x), size, replace = TRUE, prob = probs)
work?
Any ideas? Please let me know if this is unclear.
Here is a small example, which selects labels from 1:5 for each of 10 subarrays.
set.seed(1)
N <- 10
K <- 2
Hstar <- 5
probs <- rep(1/Hstar, Hstar)
perms <- 5
## Set up container(s) to hold the identity of each individual from each permutation ##
num.specs <- ceiling(N / K)
## Create an ID for each haplotype ##
haps <- 1:Hstar
## Assign individuals (N) to each subpopulation (K) ##
specs <- 1:num.specs
## Generate permutations, assume each permutation has N individuals, and sample those individuals' haplotypes from the probabilities ##
gen.perms <- function() {
sample(haps, size = num.specs, replace = TRUE, prob = probs) # I would like each subarray to contain a random subset of 1:5.
}
pop <- array(dim = c(perms, num.specs, K))
for (i in 1:K) {
pop[,, i] <- replicate(perms, gen.perms())
}
pop
Hopefully this helps.
I think what you actually want is something like that
num.specs <- 3
haps[sample(seq(haps),size = num.specs,replace = F)]
[1] 3 5 4
That is a random subset of your vector haps ?
Not quite what you want (returns list of matrices instead of 3D array) but this might help
lapply(split(1:5, cut(1:5, breaks=c(0, 2, 5))), function(i) matrix(sample(i, 25, replace=TRUE), ncol=5))
Use cut and split to partition your vector of character labels before sampling them. Here I split your character labels at the value 2. Also, rather than sampling 5 numbers 5 times, you can sample 25 numbers once, and convert to matrix.

Sort list into hash table according to specific comparisson criteria in R

I am looking for a way, in R, to convert a list into a hash table, grouping elements that are similar according to a specific criteria.
The details are specific to "graph theory", as explained bellow, but I suppose the answer is a general procedure to hash based on some specific criteria.
The list is comprised of "graph" objects (from igraph package).
library(igraph)
#Creating the list of graphs
edgeList <- data.frame(
idA=c(008, 001, 001, 010, 047, 002, 005, 005),
idB=c(100, 010, 020, 030, 030, 001, 011, 111)
)
edgeList$idB= edgeList$idB+0.1
g <- graph_from_data_frame(edgeList, directed = TRUE)
g_list <- decompose(g, mode = "weak")
#from the 8 edges we obtain 5 graphs (connected components of the original graph)
The similarity criteria is that graphs must be isomorphic:
isomorphic(g_list[[1]],g_list[[4]])
How can I hash the indexes for the elements in g_list into a hash table?
For this toy example the expected result should be:
g_inded_hash
[[1]]
[1] 1 4
[[2]]
[1] 2 5
[[3]]
[1] 3
(not necessarily a list, but some data structure that groups graphs (1 and 4) and (2 and 5) which are similar)
In reality, I have 40 millions of (small) graphs that I need to group according to the isomorphisms.
From searching I found the answer must be related to the hash package or environment, but could not adapt that into a solution.
EDIT: changed directed = TRUE in graph_from_data_frame(), above.
Since isomorphism is transitive, we can look at all the pairs of components (i,j), such that i < j, then build a graph where the nodes are the components and the edges are defined by the isomorphic property. The hash table can be extracted from the connected components of this new graph.
# all pairs (i,j) such that i < j
combinations <- unlist(sapply(seq_along(g_list),
function(j) lapply(seq_len(j-1),
function(i) c(i,j))),
recursive = FALSE)
# filter the isomorphic pairs
iso <- Filter(function(pair) isomorphic(g_list[[pair[1]]],g_list[[pair[2]]]),
combinations)
# convert to data frame
df <- data.frame(matrix(unlist(iso), ncol = 2, byrow = TRUE))
# build graph where the vertices are the components
# and the edges indicate the isomorphic property
g_iso <- graph_from_data_frame(df, directed = FALSE)
# identify groups that share the same property
groups <- clusters(g_iso)$membership
# the names are the indices of g_list
g_hash <- lapply(unique(groups),
function(i) as.integer(names(which(groups == i))))
Result:
> g_hash
[[1]]
[1] 2 3 5
[[2]]
[1] 1 4
This does not match the expected result in the question but isomorphic(g_list[[2]],g_list[[3]]) and isomorphic(g_list[[3]],g_list[[5]]) are true.
It's probably not the most straightforward way to do this but that's what came to mind.
I managed to write a solution for my problem. It is probably not very "Rish", not very efficient, with all the loops, but I think it works. Please let me know of a better way to do this.
gl_hash <- list()
gl_hash[1] <- 1
j <- 1
for(i in 2:length(gl)) {
m <- 0
for(k in 1:j){
if(isomorphic( gl[[ gl_hash[[k]][1] ]], gl[[i]])) {
gl_hash[[k]] <- c(gl_hash[[1]],i)
m <- 1
break
}
}
if(m==0) {
j <- j+ 1
gl_hash[j] <- i
}
}

Neighbor groups based on cluster assignment is slow

I am doing some analysis using iGraph in R, and I am currently doing a calculation that is very expensive. I need to do it across all of the nodes in my graph, so if someone knows a more efficient way to do it, I would appreciate it.
I start out with a graph, g. I first do some community detection on the graph
library(igraph)
adj_matrix <- matrix(rbinom(10 * 5, 1, 0.5), ncol = 8000, nrow = 8000)
g <- graph_from_adjacency_matrix(adj_matrix, mode = 'undirected', diag = FALSE)
c <- cluster_louvain(g)
Then, I basically assign each cluster to 1 of 2 groups
nc <- length(c)
assignments <- rbinom(nc, 1, .5)
Now, for each node, I want to find out what percentage of its neighbors are in a given group (as defined by the cluster assignments). I currently do this in the current way:
pct_neighbors_1 <- function(g, vertex, c, assignments) {
sum(
ifelse(
assignments[membership(c)[neighbors(g, vertex)]] == 1, 1, 0)
)/length(neighbors(g, vertex))
}
And then, given that I have a dataframe with each row corresponding to one vertex in the graph, I do this for all vertices with
data$pct_neighbors_1 <- sapply(1:nrow(data),
pct_neighbors_1,
graph = g, community = c,
assignments = assignments)
Is there somewhere in here that I can make things more efficient? Thanks!
This should be faster :
library(igraph)
# for reproducibility's sake
set.seed(1234)
# create a random 1000 vertices graph
nverts <- 1000
g <- igraph::random.graph.game(nverts,0.1,type='gnp',directed=FALSE)
# clustering
c <- cluster_louvain(g)
# assignments
nc <- length(c)
assignments <- rbinom(nc, 1, .5)
# precalculate if a vertex belongs to the assigned communities
vertsInAssignments <- membership(c) %in% which(assignments==1)
# compute probabilities
probs <- sapply(1:vcount(g),FUN=function(i){
neigh <- neighbors(g,i)
sum(vertsInAssignments[neigh]) / length(neigh)
})

Colored graph isomorphisms: 1(red)->2(blue) vs 1(blue)->2(red)

Given two simple graphs:
library(igraph)
g <- graph.empty()
g <- g + vertices(1,2,3)
g <- g + path(1,2,3)
g1 <- g
V(g1)$color = c(1,2,2)
g2 <- g
V(g2)$color = c(2,1,1)
which look like:
par(mfrow=c(1,2))
palette(rainbow(3))
plot(g1)
plot(g2)
How come they are not isomorphic?
graph.isomorphic.vf2(g1,g2)$iso
FALSE
and most important, if this is not an isomorphism, how can I detect this kind of equivalence within igraph ?
(I post a first hack as an answer to keep the question uncluttered. This hack does not always work and therefore is faulty, see the second example below.
For a hack that does work, please see either my second answer or the other people answers!)
I find the canonical permutation of labels, then a canonical coloring of this new canonical graph, and then I can use vf2.
Our function to re-color the graph is:
# Convert aaabbccdefaa -> 111223345611
canonical <- function(input){
labels <- unique(input)
match(input, labels)
}
And now back to business:
g <- graph.empty()
g <- g + vertices(1,2,3)
g <- g + path(1,2,3)
g1 <- g
V(g1)$color = c(1,2,2)
g2 <- g
V(g2)$color = c(2,1,1)
# Find canonical topological labeling and then canonical coloring
g1 <- permute.vertices(g1, canonical.permutation(g1)$labeling)
g2 <- permute.vertices(g2, canonical.permutation(g2)$labeling)
V(g1)$color <- canonical(V(g1)$color)
V(g2)$color <- canonical(V(g2)$color)
par(mfrow=c(1,2))
palette(rainbow(3))
plot(g1)
plot(g2)
Which now will be detected as isomorfic:
#vf2 wants colors to be the same, not "up to a relabeling"
# this is why we use canonical colors
graph.isomorphic.vf2(g1, g2)$iso
TRUE
Failure example:
For this example, it does not work:
g1 <- graph.empty()
g1 <- g1 + vertices(1,2)
g1 <- g1 + edge(1,2)
V(g1)$color = c(1,2)
g2 <- graph.empty()
g2 <- g2 + vertices(1,2)
g2 <- g2 + edge(2,1)
V(g2)$color = c(2,1)
# Find canonical topological labeling and then canonical coloring
g1 <- permute.vertices(g1, canonical.permutation(g1)$labeling)
g2 <- permute.vertices(g2, canonical.permutation(g2)$labeling)
V(g1)$color <- canonical(V(g1)$color)
V(g2)$color <- canonical(V(g2)$color)
par(mfrow=c(1,2))
palette(rainbow(3))
plot(g1)
plot(g2)
graph.isomorphic.vf2(g1,g2)$iso
# FALSE
Indeed Isomorphic wants the colour labels to match. The solution is to permute all colour labels and test whether one of them is isomorphic. If it is, then your graphs are isomorphic.
library(combinat)
colour_isomorphic<-function(g1,g2){
g2_copy<-g2
colour2<-unique(V(g2)$color)
colour2_permutations<-permn(colour2)
for(p in colour2_permutations){
names[p]<-as.character(colour2)
V(g2_copy)$color<-sapply(V(g2)$color, function(x) p[as.character(x)])
test_result<-graph.isomorphic.vf2(g1,g2_copy)$iso
if (test_result) {return(T)}
}
return(F)
}
colour_isomorphic(g1,g2) should now return TRUE and it should also work in the other test case of the other answer given.
The only place where it can fail is if the colour labels are nor systematically chosen as the first n natural numbers (1,2,3,4,...) in which case you need to convert them to that first.
#bisounours_tronconneuse correctly points out that you could just consider every mapping from the colors of one graph to the colors of the other, using graph.isomorphic.vf2 to check if the relabeled graphs are isomorphic. While this is mathematically true, it is computationally challenging because it requires n! (n factorial) isomorphism checks for a pair of graphs with n colors. This is 3.6 million checks for graphs with 10 colors and 9e157 checks for graphs with 20 colors, so clearly it could only be used in a setting with a very small number of colors.
We could potentially be much more efficient by considering one additional fact: a pair of graphs can only be isomorphic if their color frequency distributions exactly match. This means we only need to consider mappings between colors with the same frequency in the pair of graphs. In your question, there is only one possible mapping because in each input graph there is one color appearing once and one color appearing twice. Except in pathological cases where many colors have identical frequencies in your graph, this should lead to a much more efficient procedure for checking for isomorphism.
library(igraph)
iso.josilber <- function(g1, g2) {
freq1 <- table(V(g1)$color)
freq2 <- table(V(g2)$color)
col2 <- as.character(V(g2)$color)
if (length(freq1) != length(freq2)) {
return(FALSE) # Different numbers of colors
}
relabels <- as.matrix(do.call(expand.grid, lapply(freq2, function(x) as.numeric(names(freq1[freq1 == x])))))
relabels <- relabels[apply(relabels, 1, function(x) length(unique(x)) == length(x)),]
print(paste("Number of reorderings to check:", nrow(relabels)))
if (nrow(relabels) == 0) {
return(FALSE) # No valid relabels based on frequency distribution
}
for (i in seq(nrow(relabels))) {
V(g2)$color <- relabels[i,][col2]
if(graph.isomorphic.vf2(g1,g2)$iso) {
return(TRUE) # Found an isomorphic relabeling
}
}
return(FALSE) # Checked all valid relabelings; none were isomorphic
}
iso.josilber(g1, g2) returns TRUE for both of the tiny graph pairs you posed in your question and your answer. To stress test the procedure, consider g1, a random directed graph with 100 nodes, 0.5 density, and 15 randomly selected colors, and g2, an identical graph with a randomly relabeled version of these colors (aka it is isomorphic).
set.seed(144)
g1 <- erdos.renyi.game(100, 0.5)
V(g1)$color <- sample(1:15, 100, replace=T)
g2 <- g1
V(g2)$color <- sample(1:15)[V(g1)$color]
system.time(print(iso.josilber(g1, g2)))
# [1] "Number of reorderings to check: 144"
# [1] TRUE
# user system elapsed
# 0.172 0.004 0.189
Note that the approach that exhaustively checks all color mappings would have needed to check 15! color mappings, or more than one trillion.
One word of warning --- though this procedure may be more efficient on many graph pairs than a more naive approach, it still has exponential worst-case runtime, meaning there are classes of graphs where it will still perform quite slowly.
To avoid color permutations, Bertrand Jouve pointed to me to this trick suggested in the nauty user guide (pages 58-59). The idea is to recolour the vertices to be all the same and then all vertices that used to share the same color now have an edge to a common vertex. And then we can apply a classic vf2for colored graphs.
My implementation:
library(igraph)
isocolor.setup <- function(g){
# Transform a graph so that it can be used in colored isomorphism algorithms
# Args:
# g: graph
# Returns:
# Transformed graph
nvertices <- vcount(g)
colors <- unique(V(g)$color)
g <- add.vertices(g, length(colors), color=max(colors)+1)
for(i in 1:length(colors)){
group <- V(g)[V(g)$color==colors[i]]
aux.id <- nvertices + i
g[from = group, to = rep(aux.id,length(group))] <- TRUE
}
V(g)[1:nvertices]$color <- 1
V(g)[V(g)$color != 1]$color <- 2
return(g)
}
Examples:
setup_palette <- function(g){
palette(rainbow(max(2,length(unique(V(g)$color)))))
}
par(mfrow=c(3,2))
# First graph
g1 <- graph.ring(6)
V(g1)$color <- c(1,1,2,2,3,3)
setup_palette(g1)
plot(g1)
g1.mapped <- isocolor.setup(g1)
setup_palette(g1.mapped)
setup_palette(g1.mapped)
plot(g1.mapped)
# Second graph
g2 <- graph.ring(6)
V(g2)$color <- c(2,3,2,3,1,1)
setup_palette(g2)
plot(g2)
g2.mapped<- isocolor.setup(g2)
setup_palette(g2.mapped)
plot(g2.mapped)
title(paste("\ng1 iso g2?", graph.isomorphic.vf2(g1.mapped, g2.mapped)$iso))
# Third graph
g3 <- graph.ring(6)
V(g3)$color <- c(1,1,3,3,2,2)
setup_palette(g3)
plot(g3)
g3.mapped<- isocolor.setup(g3)
setup_palette(g3.mapped)
plot(g3.mapped)
title(paste("\ng1 iso g3?", graph.isomorphic.vf2(g1.mapped, g3.mapped)$iso))
Of course we should check, as a first filter, whether they both have the same color frequency as explained by #josilber.

Resources