I want a subset of all simple paths in a relatively large network (~1k nodes and 5k-50k edges). Currently, I generate all simple paths (igraph::all_simple_paths) followed by subsetting. However, all_simple_paths is memory- and time-intensive for larger networks, and since I don't need all paths, running it seems like a waste.
My current strategy is randomly downsample network to a reasonable number of edges --> all_simple_paths --> subset. Repeating this a few times and combining seems to give reasonable results. However this still spends time generating paths I don't use. What are better ways to do this?
For clarity, any algorithm that generates a representative subset of all simple paths is sufficient. I assume that "representative" means an unbiased random sample.
Edit: I see this answer addressing a similar problem, but it's many times slower than all_simple_paths-->subset for my networks.
Related
The problem
I have a set of locations on a plane (actually they are pins in a KML file) and I want to partition this graph into subgraphs. Connectivity is pretty good - as with all real world road networks - so I assume that if two locations are close they have some kind of connection. The resulting set of subgraphs should adhere to these constraints:
Every node has to be covered by a subgraph
Every node should be in exactly 1 subgraph
Every node within a subgraph should be close to each other (L2 norm distances)
Every subgraph should contain at least 5 locations
The amount of subgraphs should be minimal
Right now the amount of locations is no more than 100 so I thought about brute forcing through every possibility but this obviously won't scale well.
I thought about using some k-Nearest-Neighbors algorithm (e.g. using QuickGraph) but I can't get my head around where to start and how to extend/shrink the subgraphs on the way. Maybe it's possible to map this problem to another problem that can easily be solved with some numerical procedure (e.g. Simplex) ...
Maybe someone has experience in this kind of optimization problems and is willing to help me find a solution? I don't have access to Mathematica/Matlab or the like ... but sufficient .NET programming skills and hmm Excel :-)
Thanks a lot!
As soon as there are multiple criteria that need to be appeased in the best possible way simultanously, it is usually starting to get difficult.
A numerical solution could work as follows: You could define yourself a utility function, that maps partitionings of your locations to positive real values, describing how "good" a partition is by assigning it a "rating" (good could be high "bad" could be near zero).
Once you have such a function assigning partitions their according "values", you simply need to optimize it and then you hopefully obtain a good solution if you defined your utility function reasonably. Evolutionary algorithms are good at that task since your utility function is probably analytically too complex to solve due to its discrete nature.
The problem is then only how you assign "values" to partitions via this utility function. This is then your task. It can be done for example by weighing each criterion with a factor and summing the results up, or even more complex functions (least squares etc.). The factors you use in the definition of the utility function are tuning parameters and can be varied until the result seems to be good.
Some CA software wold help a lot for testing if you can get your hands on one, bit I guess to obtain a black box solver for your partitioning problem, you need to implement the complete procedure yourself using a language of your choice.
I have a large dataset with more than 250,000 observations, and I would like to use the TraMineR package for my analysis. In particular, I would like to use the commands seqtreeand seqdist, which works fine when I for example use a subsample of 10,000 observations. The limit my computer can manage is around 20,000 observations.
I would like to use all the observations and I do have access to a supercomputer who should be able to do just that. However, this doesn't help much as the process runs on a single core only. Therefore my question, is it possible to apply parallel computing technics to the above mentioned commands? Or are there other ways to speed up the process? Any help would be appreciated!
The internal seqdist function is written in C++ and has numerous optimizations. For this reason, if you want to parallelize seqdist, you need to do it in C++. The loop is located in the source file "distancefunctions.cpp" and you need to look at the two loops located around line 300 in function "cstringdistance" (Sorry but all comments are in French). Unfortunately, the second important optimization is that the memory is shared between all computations. For this reason, I think that parallelization would be very complicated.
Apart from selecting a sample, you should consider the following optimizations:
aggregation of identical sequences (see here: Problem with big data (?) during computation of sequence distances using TraMineR )
If relevant, you can try to reduce the time granularity. Distance computation time is highly dependent on sequence length (O^2). See https://stats.stackexchange.com/questions/43601/modifying-the-time-granularity-of-a-state-sequence
Reducing time granularity may also increase the number of identical sequences, and hence, the impact of optimization one.
There is a hidden option in seqdist to use an optimized version of the optimal matching algorithm. It is still in testing phase (that's why it is hidden), but it should replace the actual algorithm in a future version. To use it, set method="OMopt", instead of method="OM". Depending on your sequences, it may reduce computation time.
I want to carry out Graph Clustering in a huge undirected graph with millions of edges and nodes. Graph is almost clustered with different clusters joined together only by some nodes(kind of ambiguous nodes which can relate to multiple clusters). There will be very few or almost no edges between two clusters. This problem is almost similar to finding vertex cut set of a graph, with one exception that graph needs to be partitioned into many components(their number being unknown).(Refer this picture https://docs.google.com/file/d/0B7_3zLD0XdtAd3ZwMFAwWDZuU00/edit?pli=1)
Its almost like different strongly connected components sharing a couple of nodes between them and i am supposed to remove those nodes to separate those strongly connected components. Edges are weigthed but this problem is more like finding structures in a graph, so edge weigths won't be of relevance. (Another way to think about the problem would be to visualize Solid Spheres touching each other at some points with Spheres being those strongly connected components and touching points being those ambiguous nodes)
I am prototyping something, so am quiet short of time to pick up Graph Clustering Algorithms by myself and to select the best possible. Plus i need a solution that would cut nodes and not edges since different clusters share nodes and not edges in my case.
Is there any research paper, blog that addresses this or somewhat related problem? Or can anyone come up with a solution to this problem howsoever dirty.
Since millions of nodes and edges are involved, i would need a MapReduce implementation of the solution. Any inputs, links for that too?
Is there any current open source implementation in MapReduce that can i directly use?
I think this problem is analogous to Finding Communities in online social networks by removing vertices.
Your problem is not so simple. I am afraid that it is related to the clique problem, which is NP complete, so unless you quantify somehow the statement "there are almost no edges between the clusters", your problem might be still very difficult. But what I would do in your shoes, would be to try one dirty, greedy approach, namely regarding the nodes as the following kind of quasi-neural net:
Each vertex I would consider to have inputs, outputs and a sigmoid activation function which convert the input value (sum of inputs) into the output value. The output value, and I consider this important, would not be cloned and sent to all the neighbors, but rather divided evenly between the neighbors. In addition to this, I would define a logarithmic decay of activity in a neuron (self-suppression, suppressive connection to itself), defined by a decay parameter global for the net.
Now, I would start simulation with all the neurons starting from activity 0.5 (activity range is 0 to 1) with very high decay parameter, which would lead to all the neuronst quickly stabilizing in 0 state. I would then gradually decrease the decay parameter until the steady state result would yield the first clique with non-zero stable activity.
The question is what to do next. One possibility is to subtract the found clique from the graph and run the same process again until we find all the cliques. This greedy approach might succeed if your graph is indeed as well behaved (really almost clustered) as you say, but might lead to unexpected results otherwise. Another possibility is to give the found clique a unique clique smell that would be repulsive (mutual suppresion) to other cliques an rerun the algorithm until the second clique is found, give it a different clique smell repulsive to all others etc., until each node has its own assigned smell.
I think this would be as many big ideas as i have about this.
The key is, that since it is probably not possible to solve this problem in the general case (likely NP complete), you need to take use of whatever special properties your graph has. That means you need to play with parameters for a while until the algorithm solves 99% of the cases that you encounter. I don't think that it is possible to give the numerically precise answer to your question without long experimentation with the actual datasets that you encounter.
Since millions of nodes and edges are involved, i would need a MapReduce implementation of the solution. Any inputs, links for that too?
In my experience I doubt if using Map/Reduce over here would be truly advantageous. First 10^6 order of nodes isn't really that large [that too in a non hyper-connected graph, since you are considering clustering], and the over head of using Map/Reduce [unless you already have setup your hardware/software for it] for your problem will not be worth it.
Map/Reduce will work much better, where once you have solved the clustering issue, and then want to process each cluster with similar analysis. Basically when you can break your task into relatively isolated sub-tasks, which can be performed in parallel. This of course can be cascaded to several layers.
In a relatively similar situation, I personally first modelled my graph into a graph database (I used Neo4J, and would recommend it highly) and then ran my analytic and queries on it. You will be surprised as to how white board friendly this solution is, and even massively joined and connected queries will be executed near instantaneously especially at the scale of only a few million nodes. For example, you can do a filtered analysis, based on degrees of separation, followed by listing of commons.
To give a bit of the context, I am measuring the performance of virtual machines (VMs), or systems software in general, and usually want to compare different optimizations for performance problem. Performance is measured in absolute runtime for a number of benchmarks, and usually for a number of configurations of a VM variating over used number of CPU cores, different benchmark parameters, etc. To get reliable results, each configuration is measure like 100 times. Thus, I end up with quite a number of measurements for all kind of different parameters where I am usually interested in the speedup for all of them, comparing the VM with and the VM without a certain optimization.
What I currently do is to pick one specific series of measurements. Lets say the measurements for a VM with and without optimization (VM-norm/VM-opt) running benchmark A, on 1 core.
Since I want to compare the results of the different benchmarks and number of cores, I can not use absolute runtime, but need to normalize it somehow. Thus, I pair up the 100 measurements for benchmark A on 1 core for VM-norm with the corresponding 100 measurements of VM-opt to calculate the VM-opt/VM-norm ratios.
When I do that taking the measurements just in the order I got them, I obviously have quite a high variation in my 100 resulting VM-opt/VM-norm ratios. So, I thought, ok, let's assume the variation in my measurements come from non-deterministic effects and the same effects cause variation in the same way for VM-opt and VM-norm. So, naively, it should be ok to sort the measurements before pairing them up. And, as expected, that reduces the variation of course.
However, my half-knowledge tells me that is not the best way and perhaps not even correct.
Since I am eventually interested in the distribution of those ratios, to visualize them with beanplots, a colleague suggested to use the cartesian product instead of pairing sorted measurements. That sounds like it would account better for the random nature of two arbitrary measurements paired up for comparison. But, I am still wondering what a statistician would suggest for such a problem.
In the end, I am really interested to plot the distribution of ratios with R as bean or violin plots. Simple boxplots, or just mean+stddev tell me too few about what is going on. These distributions usually point at artifacts that are produced by the complex interaction on these much to complex computers, and that's what I am interested in.
Any pointers to approaches of how to work with and how to produce such ratios in a correct way a very welcome.
PS: This is a repost, the original was posted at https://stats.stackexchange.com/questions/15947/how-to-normalize-benchmark-results-to-obtain-distribution-of-ratios-correctly
I found it puzzling that you got such a minimal response on "Cross Validated". This does not seem like a specific R question, but rather a request for how to design an analysis. Perhaps the audience there thought you were asking too broad a question, but if that is the case then the [R] forum is even worse, since we generally tackle problems where data is actually provided. We deal with the requests for implementation construction in our language. I agree that violin plots are preferred to boxplots for the examination of distributions (when there is sufficient data and I am not sure that 100 samples per group makes the grade in that instance), but in any case that means the "R answer" is that you just need to refer to the proper R help page:
library(lattice)
?xyplot
?panel.violin
Further comments would require more details and preferably some data examples constructed in R. You may want to refer to the page where "great question design is outlined".
One further graphical method: If you are interested in the ratios of two paired variates but do not want to "commit" to just x/y, then you can examine them by plotting and then plotting iso-ratio lines by repeatedly using abline(a=0, b= ). I think 100 samples is pretty "thin" for doing density estimates, but there are 2d density methods if you can gather more data.
I'm playing arround with a Genetic Algorithm in which I want to evolve graphs.
Do you know a way to apply crossover and mutation when the chromosomes are graphs?
Or am I missing a coding for the graphs that let me apply "regular" crossover and mutation over bit strings?
thanks a lot!
Any help, even if it is not directly related to my problem, is appreciated!
Manuel
I like Sandor's suggestion of using Ken Stanley's NEAT algorithm.
NEAT was designed to evolve neural networks with arbitrary topologies, but those are just basically directed graphs. There were many ways to evolve neural networks before NEAT, but one of NEAT's most important contributions was that it provided a way to perform meaningful crossover between two networks that have different toplogies.
To accomplish this, NEAT uses historical markings attached to each gene to "line up" the genes of two genomes during crossover (a process biologists call synapsis). For example:
(source: natekohl.net)
(In this example, each gene is a box and represents a connection between two nodes. The number at the top of each gene is the historical marking for that gene.)
In summary: Lining up genes based on historical markings is a principled way to perform crossover between two networks without expensive topological analysis.
You might as well try Genetic Programming. A graph would be the closest thing to a tree and GP uses trees... if you still want to use GAs instead of GPs then take a look at how crossover is performed on a GP and that might give you an idea how to perform it on the graphs of your GA:
(source: geneticprogramming.com)
Here is how crossover for trees (and graphs) works:
You select 2 specimens for mating.
You pick a random node from one parent and swap it with a random node in the other parent.
The resulting trees are the offspring.
As others have mentioned, one common way to cross graphs (or trees) in a GA is to swap subgraphs (subtrees). For mutation, just randomly change some of the nodes (w/ small probability).
Alternatively, if you are representing a graph as an adjacency matrix, then you might swap/mutate elements in the matrices (kind of like using a two-dimensional bit string).
I'm not sure if using a bitstring is the best idea, I'd rather represent at least the weights with real values. Nevertheless bitstrings may also work.
If you have a fixed topology, then both crossover and mutation are quite easy (assuming you only evolve the weights of the network):
Crossover: take some weights from one parent, the rest from the other, can be very easily done if you represent the weights as an array or list. For more details or alternatives see http://en.wikipedia.org/wiki/Crossover_%28genetic_algorithm%29.
Mutation: simply select some of the weights and adjust them slightly.
Evolving some other stuff (e.g. activation function) is pretty similar to these.
If you also want to evolve the topology then things become much more interesting. There are quite some additional mutation possibilities, like adding a node (most likely connected to two already existing nodes), splitting a connection (instead of A->B have A->C->B), adding a connection, or the opposites of these.
But crossover will not be too easy (at least if the number of nodes is not fixed), because you will probably want to find "matching" nodes (where matching can be anything, but likely be related to a similar "role", or a similar place in the network). If you also want to do it I'd highly recommend studying already existing techniques. One that I know and like is called NEAT. You can find some info about it at
http://en.wikipedia.org/wiki/Neuroevolution_of_augmenting_topologies
http://nn.cs.utexas.edu/?neat
and http://www.cs.ucf.edu/~kstanley/neat.html
Well, I have never played with such an implementation, but eventually for crossover you could pick a branch of one of the graphs and swap it with a branch from another graph.
For mutation you could randomly change a node inside the graph, with small probability.