Creating subgraphs with overlapping vertices - graph

I've been looking for packages using which I could create subgraphs with overlapping vertices.
From what I understand in Networkx and metis one could partition a graph into two or multi-parts. But I couldn't find how to partition into subgraphs with overlapping nodes.
Suggestions on libraries that support partitioning with overlapping vertices will be really helpful.
EDIT: I tried the angel algorithm in CDLIB to partition the original graph into subgraphs with 4 overlapping nodes.
import networkx as nx
from cdlib import algorithms
if __name__ == '__main__':
g = nx.karate_club_graph()
coms = algorithms.angel(g, threshold=4, min_community_size=10)
print(coms.method_name)
print(coms.method_parameters) # Clustering parameters)
print(coms.communities)
print(coms.overlap)
print(coms.node_coverage)
Output:
ANGEL
{'threshold': 4, 'min_community_size': 10}
[[14, 15, 18, 20, 22, 23, 27, 29, 30, 31, 32, 8], [1, 12, 13, 17, 19, 2, 21, 3, 7, 8], [14, 15, 18, 2, 20, 22, 30, 31, 33, 8]]
True
0.6470588235294118
From the communities returned, I understand 1 and 3 have an overlap of 4 nodes but 2 and 3 or 1 and 3 don't have an overlap size of 4 nodes.
It is not clear to me how the overlap threshold (4 overlaps) has to be specified
here algorithms. angel(g, threshold=4, min_community_size=10). I tried setting threshold=4 here to define an overlap size of 4 nodes. However, from the documentation available for angel
:param threshold: merging threshold in [0,1].
I am not sure how to translate the 4 overlaps to the value that has to be set between the bounds [0, 1]. Suggestions will be really helpful.

You can check out CDLIB:
They have a great amount of community finding algorithms applicable to networkX, including some overlapping communities algorithms.
On a side note:
The return type of the functions is called Node Clustering which might be a little confusing at first so here are the methods applicable to it, usually you simply want to convert to a Python dictionary.
Specifically about the angel algorithm in CDLIB:
According to ANGEL: efficient, and effective, node-centric community discovery in static and dynamic networks, the threshold is not the overlapping threshold, but used as follows:
If the ratio is greater than (or equal to) a given threshold, the merge is applied and the node label updated.
Basically, this value determines whether to further merge the nodes into bigger communities, and is not equivalent to the number of overlapping nodes.
Also, don't mistake "labels" with "node's labels" (as in nx.relabel_nodes(G, labels)). The "labels" referred are actually correlated with the Label Propagation Algorithm which is used by ANGEL.
As for the effects of varying this threshold:
[...] Increasing the threshold, we obtain a higher number of communities since lower quality merges cannot take place.
[based on the comment by #J. M. Arnold]
From ANGEL's github repository you can see that when threshold >= 1 only the min_comsize value is used:
self.threshold = threshold
if self.threshold < 1:
self.min_community_size = max([3, min_comsize, int(1. / (1 - self.threshold))])
else:
self.min_community_size = min_comsize

Related

Julia: How to make the histogram have same number of bins for two vectors of equal size?

I want to calculate frequency of occurrence in multiple vectors and want the resulting number of bins to be consistent across vectors so its easier to calculate wasserstein distance among them.
The following code shows that histogram gives different sized bins.
using StatsBase
for i in 1:10
h = fit(Histogram,randn(1000), nbins=10); println(size(h.weights))
end
How to make number of bins consistent?
One way to be completely consistent across runs is to supply more than just the number of bins; to be perfectly consistent, we also supply their exact positions. With Julia's StatsBase, you do that by supplying the "edges" (bin boundaries). Here's a demo where bins run from i to i+1:
julia> fit(Histogram, randn(1000), -5:5)
Histogram{Int64, 1, Tuple{UnitRange{Int64}}}
edges:
-5:5
weights: [0, 2, 23, 139, 319, 355, 143, 18, 1, 0]
closed: left
isdensity: false

How to plan the most efficient route for patio lights Part 2

This is a continuation of some questions I posed earlier, How to plan the most efficient route for patio lights and Christmas Light Route Efficiency (CS), about my attempt to cover a screened-in structure with patio lights as efficiently as possible.
Here's the rules:
Minimize light overlapping
Each string of lights is 234" long (this is important because I can't start a new branch of lights unless it's at the end of another branch).
Think of these as Christmas lights, you have a male and a female side:
start (male) end (female)
=[}~~~o~~~o~~~o~~~o~~~o~~~o~~~o~~~{=]
<- to outlet to other lights ->
So multiple strands can daisy chain as long as there's a female for the male to plug into, like this:
A female plug must supply power to the next strand of lights via a male plug, a male plug can't give power to another male plug.
Here is a diagram of my structure:
Pink Circle = Place to hang lights (No, there is not a place to hang lights at the intersection of 10, 11 & 12 - that is not a mistake).
"Start" = The only available electrical outlet.
Yellow Dots = Parts of the structure I want to run the lights along.
Based on my previous questions, I began looking into "Route Efficiency Problem" Algorithms. I used this post, Solving Chinese Postman algorithm with eulerization, to get started, which lead me to this code (with thanks to #DamianoFantini for his help in my previous post to set the graph up correctly):
gg <- graph_from_edgelist(cbind(c(1:4, 6, 8, 10, 12, 14, 16:19, 1, 6, 8, 21, 12, 14, 5, 7, 9, 11, 13, 15),
c(2:5, 7, 9, 11, 13, 15, 17:20, 6, 8, 10, 12, 14, 16, 7, 9, 11, 13, 15, 20)))
ll=matrix(
c( 0,0, 75.25,0, 150.5,0, 225.8125,0, 302.8125,0,
0,-87, 302.8125,-87,
0,-173.8125, 302.8125,-173.8125,
0,-260.9375, 302.8125,-260.9375,
16,-384.3125, 302.8125,-384.3125,
16,-435.9575, 302.8125,-435.9375,
16,-525.1875, 75.25,-525.1875, 150.5,-525.1875, 225.8125,-525.1875, 302.8175,-525.1875, 16, -260.9375),
ncol=2,byrow=TRUE)
# SOURCE: https://stackoverflow.com/q/40576910/1152809
make.eulerian <- function(graph){
# Carl Hierholzer (1873) had explained how eulirian cycles exist for graphs that are
# 1) connected, and 2) contain only vertecies with even degrees. Based on this proof
# the posibility of an eulerian cycle existing in a graph can be tested by testing
# on these two conditions.
#
# This function assumes a connected graph.
# It adds edges to a graph to ensure that all nodes eventuall has an even numbered. It
# tries to maintain the structure of the graph by primarily adding duplicates of already
# existing edges, but can also add "structurally new" edges if the structure of the
# graph does not allow.
# save output
info <- c("broken" = FALSE, "Added" = 0, "Successfull" = TRUE)
# Is a number even
is.even <- function(x){ x %% 2 == 0 }
# Graphs with an even number of verticies with uneven degree will more easily converge
# as eulerian.
# Should we even out the number of unevenly degreed verticies?
search.for.even.neighbor <- !is.even(sum(!is.even(degree(graph))))
# Loop to add edges but never to change nodes that have been set to have even degree
for(i in V(graph)){
set.j <- NULL
#neighbors of i with uneven number of edges are good candidates for new edges
uneven.neighbors <- !is.even(degree(graph, neighbors(graph,i)))
if(!is.even(degree(graph,i))){
# This node needs a new connection. That edge e(i,j) needs an appropriate j:
if(sum(uneven.neighbors) == 0){
# There is no neighbor of i that has uneven degree. We will
# have to break the graph structure and connect nodes that
# were not connected before:
if(sum(!is.even(degree(graph))) > 0){
# Only break the structure if it's absolutely nessecary
# to force the graph into a structure where an euclidian
# cycle exists:
info["Broken"] <- TRUE
# Find candidates for j amongst any unevenly degreed nodes
uneven.candidates <- !is.even(degree(graph, V(graph)))
# Sugest a new edge between i and any node with uneven degree
if(sum(uneven.candidates) != 0){
set.j <- V(graph)[uneven.candidates][[1]]
}else{
# No candidate with uneven degree exists!
# If all edges except the last have even degrees, thith
# function will fail to make the graph eulerian:
info["Successfull"] <- FALSE
}
}
}else{
# A "structurally duplicated" edge may be formed between i one of
# the nodes of uneven degree that is already connected to it.
# Sugest a new edge between i and its first neighbor with uneven degree
set.j <- neighbors(graph, i)[uneven.neighbors][[1]]
}
}else if(search.for.even.neighbor == TRUE & is.null(set.j)){
# This only happens once (probably) in the beginning of the loop of
# treating graphs that have an uneven number of verticies with uneven
# degree. It creates a duplicate between a node and one of its evenly
# degreed neighbors (if possible)
info["Added"] <- info["Added"] + 1
set.j <- neighbors(graph, i)[ !uneven.neighbors ][[1]]
# Never do this again if a j is correctly set
if(!is.null(set.j)){search.for.even.neighbor <- FALSE}
}
# Add that a new edge to alter degrees in the desired direction
# OBS: as.numeric() since set.j might be NULL
if(!is.null(set.j)){
# i may not link to j
if(i != set.j){
graph <- add_edges(graph, edges=c(i, set.j))
info["Added"] <- info["Added"] + 1
}
}
}
# return the graph
(list("graph" = graph, "info" = info))
}
# Look at what we did
eulerian <- make.eulerian(gg)
g <- eulerian$graph
par(mfrow=c(1,2))
plot(gg)
plot(g)
Here's the result of the code:
Which, I think translates to this (but I am a graph/algorithm noob, so correct me if I'm wrong):
Obviously, there are some issues here:
I have no idea where the end/beginning of each strand of lights should be (and neither does the algorithm I think)
Node 1 is supplying power independently. This will not work in reality. All power must come from the "Start" position.
The distances and structure do not seem to be accounted for.
Is there a way to add these constraints into the algorithm? Is there another algorithm I could use that would make this easier?
https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
You can imeplement Dijkstra's algorithm using different edge data for different metrics for your path, such as light overlap, or for example, the total illuminance for the lights at each edge. I assume you might need a higher density of light in deep corners...
So the goal, can be the widest area of low light, or the perceived visibility of obstacles, or a path to which creates a homogenous ambient light. Regardless of how it is tuned though I believe Dijkstra's algorithm is a pretty standard goto for finding these things.
Update:
In the case of creating the widest covered area of light you would want a spanning tree rather than an optimal path algorithm. This might be more of what you have in mind:
https://en.wikipedia.org/wiki/Prim%27s_algorithm

Plot kernel density estimation with the kernels over the individual observations in R

Well to keep things short what I want to achieve is a plot like the right one:
I would like to obtain a standard KDE plot with its individual kernels plotted over the observations.
The best solution would be the one that considers all the different kernel functions (e.g. rectangular, triangular etc).
Well after reading this Answer I managed to come up with an solution.
# Create some input data
x<-c(19, 20, 10, 17, 16, 13, 16, 10, 7, 18)
# Calculate the KDE
kde<-density(x,kernel="gaussian",bw=bw.SJ(x)*0.2)
# Calcualte the singel kernels/pdf's making up the KDE of all observations
A.kernel<-sapply(x, function(i) {density(i,kernel="gaussian",bw=kde$bw)},simplify=F)
sapply(1:length(A.kernel), function(i){A.kernel[[i]][['y']]<<-(A.kernel[[i]][['y']])/length(x)},simplify=F)
# Plot everything together ensuring the right scale (the area of the single kernels is corrected)
plot(kde)
rug(x,col=2,lwd=2.5)
sapply(A.kernel, function(i){
lines(i,col="red")}
)
The result looks like this:

Extracting information on terminal nodes in partykit:ctree with a large number of multivariate responses

I am using partykit:ctree to explore my dataset, which is a set of about 15,000 beach surveys, investigating the number of pieces of debris found from 50 different categories. There are lots of zeros in the data, and a large spread of total debris amounts. I also have a series of independent variables, including some factors, some count data, and some continuous data.
Here is a very small sample dataset:
Counts<- as.data.frame(matrix (rpois(100,1), ncol=5))
colnames(Counts)<-c("Glass", "HardPlastic", "SoftPlastic", "PlasticBag", "Fragments")
State<-rep(c("CA","OR","WA"), each=6)
Counts$State<-c(State,"CA","OR")
County<-rep((1:9), each=2)
Counts$County<-c(County, 1,4)
Counts$Distance<-c(10, 15, 13, 19, 18, 23, 38, 40, 49, 44, 47, 45, 52, 53, 55, 59, 51, 53, 14, 33)
Year<-rep(c("2010","2011","2012"), times=7)
Counts$Year<-Year[1:20]
I have used the following code to partition my data:
M.2<-ctree(Glass + HardPlastic + SoftPlastic + PlasticBag + Fragments ~
as.factor (State) + as.factor (County) + Distance + as.factor (Year), data=Counts)
plot(M.2, terminal_panel = node_barplot, cex = 0.5)
This comes up with a lovely graph, but how do I extract the membership of each of the terminal nodes? I can see it in the graph if there are only a few items, but once the number of possible categories increases to 50, it becomes much harder to look at it graphically. I would like to see the information contained within the nodes; particularly the relative probabilities of each individual category being contained in each terminal node.
I know that if this were a BinaryTree class, I could use the nodes argument, but when I query the class(M.2) it tells me it is from the constaparty class, and I haven't been able to find how to get node information from this class.
I have also run into a secondary problem, which is that when I run the ctree on my sample data set, it crashes R every time! It works fine with my actual data set, but I can't figure out what is wrong with the sample set.
EDIT: The desired output would be something along the lines of:
Node15:
Hard Plastic 30
Glass 5
Soft Plastic 23
Plastic Bag 6
Fragments 12
I just e-mailed with the package maintainer (Torsten Hothorn) and principal author of ctree() to which such requests would really best be directed. (He currently does not participate in SO.) Apparently, this is a bug in the partykit version of ctree() and he is working on resolving this. For the time being it is best to use the old party version for this - and hopefully a fixed partykit version will become available soon.

Can breadth first search traverse a disconnected graph?

I have an exam question:
Consider the undirected graph with vertices the numbers 3..16, and edges given by the following rule: two vertices a and b are linked by an edge whenever a is a factor of b, or b is a factor of a. List the vertices in BFS order starting from 14. (Many different orders are possible with a BFS, you may list any one of these).
I'm considering two answers:
Because the graph is not connected, from 14 BFS can only traverse to 7, so the result is 14, 7.
List out all the first level disconnected vertices, then traverse to their child nodes.
so, 14, 16, 15, 13, 12, 11, 10, 9, 7, 8, 6, 5, 4, 3
Which one is correct?
Can BFS traverse to disconnected vertices?
Answer 2 doesn't make sense in my opinion, because this requires knowledge about which nodes are disconnected. Which you need the algorithm to find out. So I'd say answer 1 is correct.

Resources