What is the recommended value for the pagerank algorithm parameter in the workflow of NebulaGraph Explorer? - nebula-graph

The PageRank algorithm calculates the relevance and importance of vertices based on their relationships.
I would like to know the use of parameters ITERATIONS and EPS and their recommended value.

IS_DIRECTED is whether to consider the direction of the edges. If set to false, the system automatically adds the reverse edge.
EPS is the convergence accuracy. When the difference between the result of two iterations is less than the EPS value, the iteration is not continued.
With regard to recommended values, they usually depend on your business scenario, and you will need to debug these values several times to fit your business scenario, such as maybe 10 iterations to meet the business requirements, or maybe 12 iterations.

Related

OpenMDAO: conditional statement depending on the number of iterations

During an optimisation using OpenMDAO, is there any way to access the number of iterations or the values of the design variables in previous iterations during optimisation?
I would like to create a conditional statement depending on the corresponding number of iterations.
I have created a continuous function representing discrete points linked by exponential functions. I would like to increase the exponent of the intermediate function with the number of iterations so that it penalises the intermediate values and the optimisation converges close to one of the discrete values.
Thank you in advance.
What you are describing sounds like a form of continuation/smoothing. I can suggest two different approaches:
Set a reasonable max-iteration limit on the optimizer and add an outside for-loop around the call to run_driver. You could even adapt the iteration limit after each stopping point is reached. Start with a very small iteration limit, and let it grow as you converge more.
Pros:
fairly simple to implement
uses existing OpenMDAO Driver APIs
Cons:
Limited ability to set your own stoping conditions (only really have iteration limit)
Restarting the optimization does not preserve the prior hessian approximation and may lead to poor convergence for quasi-newton method
Skip the OpenMDAO driver interface, and roll your own. This approach was suggested in the 2020 OpenMDAO Reverse Hackathon, for users who find the OpenMDAO Driver interface doesn't meet their need.
Pros:
A lot more flexibility
total control
Cons:
A lot more work

pop.size argument in GenMatch() respectively genoud()

I am using genetic matching in R using GenMatch in order to find comparable treatment and control groups to estimate a treatment effect. The default code for matching looks as follows:
GenMatch(Tr, X, BalanceMatrix=X, estimand="ATT", M=1, weights=NULL,
pop.size = 100, max.generations=100,...)
The description for the pop.size argument in the package is:
Population Size. This is the number of individuals genoud uses to
solve the optimization problem. The theorems proving that genetic
algorithms find good solutions are asymptotic in population size.
Therefore, it is important that this value not be small. See genoud
for more details.
Looking at gnoud the additional description is:
...There are several restrictions on what the value of this number can
be. No matter what population size the user requests, the number is
automatically adjusted to make certain that the relevant restrictions
are satisfied. These restrictions originate in what is required by
several of the operators. In particular, operators 6 (Simple
Crossover) and 8 (Heuristic Crossover) require an even number of
individuals to work on—i.e., they require two parents. Therefore, the
pop.size variable and the operators sets must be such that these three
operators have an even number of individuals to work with. If this
does not occur, the population size is automatically increased until
this constraint is satisfied.
I want to know how gnoud (resp. GenMatch) incorporates the population size argument. Does the algorithm randomly select n individuals from the population for the optimization?
I had a look at the package description and the source code, but did not find a clear answer.
The word "individuals" here does not refer to individuals in the sample (i.e., individual units your dataset), but rather to virtual individuals that the genetic algorithm uses. These individuals are individual draws of a set of the variables to be optimized. They are unrelated to your sample.
The goal of genetic matching is to choose a set of scaling factors (which the Matching documentation calls weights), one for each covariate, that weight the importance of that covariate in a scaled Euclidean distance match. I'm no expert on the genetic algorithm, but my understanding of what it does is that it makes a bunch of guesses at the optimal values of these scaling factors, keeps the ones that "do the best" in the sense of optimizing the criterion (which is determined by fit.func in GenMatch()), and creates new guesses as slight perturbations of the kept guesses. It then repeats this process many times, simulating what natural selection does to optimize traits in living things. Each guess is what the word "individual" refers to in the description for pop.size, which corresponds to the number of guesses at each generation of the algorithm.
GenMatch() always uses your entire sample (unless you have provided a restriction like a caliper, exact matching requirement, or common support rule); it does not sample units from your sample to form each guess (which is what bagging in is other machine learning contexts).
Results will change over many runs because the genetic algorithm itself is a stochastic process. It may converge to a solution asymptotically, but because it is optimizing over a lumpy surface, it will find different solutions each time in finite samples with finite generations and a finite population size (i.e., pop.size).

Handling-Costraints in Genetic Algorithms: implementing the death penalty

I would like to compare the "death penalty method" with other penalty methods proposed in the Genetic Algorithms' literature.
I'm using the R software, so I need to write the codes of these penalty methods. I've finding lots of difficulties because I have not understood one thing about the death penalty function: how I have to handle the infeasible offsprings since the population size usually is fixed in genetic algorithms?
I mean, I understand that, in order to use appropriately the death penalty, I have to initialize the genetic algorithm with all feasible solutions. But even if I have all feasible solutions in the first population (t=0), I could have infeasible solutions in the next generation since the crossover and the mutations are "blind" operators.
So, since the death penalty rejects all the infeasible solutions, then what happen?
Will the next generation have a population side smaller (original dim size - number of infeasible solutions) or I have to select more parents to put in the mating pool for reproduction until the next generation is composed by "original dim size" feasible offsprings or I have to try again the genetic operators until all the individuals in t+1 are feasible?
I do not know R, but the theory of the death penalty implies that you should generate more offspring.
I would generate do the following pseudo-code (translate to R):
n=<desired_population_size>;
while (n>0) {
generate n offspring;
eliminate the non feasible ones
add the feasible ones to the new generation
n=<desired population size> - <current new generation population size>
}
The only problem with this loop is the risk that it may go on forever (if we never generate feasible solutions). Even though it is quite small, if you want to protect yourself from it, you can limit the number of iterations allowed in the while loop, using a simple counter.
There is a pretty interesting article on this by Michalewicz. Have a look.

What is the meaning of "Inf" in S_Dbw output in R commander?

I have ran clv package which consists of S_Dbw and SD validity indexes for clustering purposes in R commander. (http://cran.r-project.org/web/packages/clv/index.html)
I evaluated my clustering results from DBSCAN, K-Means, Kohonen algorithms with S_Dbw index. but for all these three algorithms S_Dbw is "Inf".
Is it "Infinite" meaning? Why did i confront with "Inf". Is there any problem in my clustering results?
In general, when is S_Dbw index result "Inf"?
Be careful when comparing different algorithms with such an index.
The reason is that the index is pretty much an algorithm in itself. One particular clustering will necessarily be the "best" for each index. The main difference between an index and an actual clustering algorithm is that the index doesn't tell you how to find the "best" solution.
Some examples: k-means minimizes the distances from cluster members to cluster centers. Single-link hierarchical clustering will find the partition with the optimal minimum distance between partitions. Well, DBSCAN will find the partitioning of the dataset, where all density-connected points are in the same partition. As such, DBSCAN is optimal - if you use the appropriate measure.
Seriously. Do not assume that because one algorithm scores higher than another in a particular measure means that the algorithm works better. All that you find out this way is that a particular algorithm is more (cor-)related to a particular measure. Think of it as a kind of correlation between the measure and the algorithm, on a conceptual level.
Using a measure for comparing different results of the same algorithm is different. Then obviously there shouldn't be a benefit from one algorithm over itself. There might still be a similar effect with respect to parameters. For example the in-cluster distances in k-means obviously should go down when you increase k.
In fact, many of the measures are not even well-defined on DBSCAN results. Because DBSCAN has the concept of noise points, which the indexes do not AFAIK.
Do not assume that the measure will either give you an indication of what is "true" or "correct". And even less, what is useful or new. Because you should be using cluster analysis not to find a mathematical optimum of a particular measure, but to learn something new and useful about your data. Which probably is not some measure number.
Back to the indices. They usually are totally designed around k-means. From a short look at S_Dbw I have the impression that the moment one "cluster" consists of a single object (e.g. a noise object in DBSCAN), the value will become infinity - aka: undefined. It seems as if the authors of that index did not consider this corner case, but only used it on toy data sets where such situations did not arise. The R implementation can't fix this, without diverting from the original index and instead turning it into yet another index. Handling noise objects and singletons is far from trivial. I have not yet seen an index that doesn't fail in one way or another - typically, a solution such as "all objects are noise" will either score perfect, or every clustering can trivially be improved by putting each noise object to the nearest non-singleton cluster. If you want your algorithm to be able to say "this object doesn't belong to any cluster" then I do not know any appropriate index.
The IEEE floating point standard defines Inf and -Inf as positive and negative infinity respectively. It means your result was too large to represent in the given number of bits.

What are the differences between community detection algorithms in igraph?

I have a list of about 100 igraph objects with a typical object having about 700 vertices and 3500 edges.
I would like to identify groups of vertices within which ties are more likely. My plan is to then use a mixed model to predict how many within-group ties vertices have using vertex and group attributes.
Some people may want to respond to other aspects of my project, which would be great, but the thing I'm most interested in is information about functions in igraph for grouping vertices. I've come across these community detection algorithms but I'm not sure of their advantages and disadvantages, or whether some other function would be better for my case. I saw the links here as well, but they aren't specific to igraph. Thanks for your advice.
Here is a short summary about the community detection algorithms currently implemented in igraph:
edge.betweenness.community is a hierarchical decomposition process where edges are removed in the decreasing order of their edge betweenness scores (i.e. the number of shortest paths that pass through a given edge). This is motivated by the fact that edges connecting different groups are more likely to be contained in multiple shortest paths simply because in many cases they are the only option to go from one group to another. This method yields good results but is very slow because of the computational complexity of edge betweenness calculations and because the betweenness scores have to be re-calculated after every edge removal. Your graphs with ~700 vertices and ~3500 edges are around the upper size limit of graphs that are feasible to be analyzed with this approach. Another disadvantage is that edge.betweenness.community builds a full dendrogram and does not give you any guidance about where to cut the dendrogram to obtain the final groups, so you'll have to use some other measure to decide that (e.g., the modularity score of the partitions at each level of the dendrogram).
fastgreedy.community is another hierarchical approach, but it is bottom-up instead of top-down. It tries to optimize a quality function called modularity in a greedy manner. Initially, every vertex belongs to a separate community, and communities are merged iteratively such that each merge is locally optimal (i.e. yields the largest increase in the current value of modularity). The algorithm stops when it is not possible to increase the modularity any more, so it gives you a grouping as well as a dendrogram. The method is fast and it is the method that is usually tried as a first approximation because it has no parameters to tune. However, it is known to suffer from a resolution limit, i.e. communities below a given size threshold (depending on the number of nodes and edges if I remember correctly) will always be merged with neighboring communities.
walktrap.community is an approach based on random walks. The general idea is that if you perform random walks on the graph, then the walks are more likely to stay within the same community because there are only a few edges that lead outside a given community. Walktrap runs short random walks of 3-4-5 steps (depending on one of its parameters) and uses the results of these random walks to merge separate communities in a bottom-up manner like fastgreedy.community. Again, you can use the modularity score to select where to cut the dendrogram. It is a bit slower than the fast greedy approach but also a bit more accurate (according to the original publication).
spinglass.community is an approach from statistical physics, based on the so-called Potts model. In this model, each particle (i.e. vertex) can be in one of c spin states, and the interactions between the particles (i.e. the edges of the graph) specify which pairs of vertices would prefer to stay in the same spin state and which ones prefer to have different spin states. The model is then simulated for a given number of steps, and the spin states of the particles in the end define the communities. The consequences are as follows: 1) There will never be more than c communities in the end, although you can set c to as high as 200, which is likely to be enough for your purposes. 2) There may be less than c communities in the end as some of the spin states may become empty. 3) It is not guaranteed that nodes in completely remote (or disconencted) parts of the networks have different spin states. This is more likely to be a problem for disconnected graphs only, so I would not worry about that. The method is not particularly fast and not deterministic (because of the simulation itself), but has a tunable resolution parameter that determines the cluster sizes. A variant of the spinglass method can also take into account negative links (i.e. links whose endpoints prefer to be in different communities).
leading.eigenvector.community is a top-down hierarchical approach that optimizes the modularity function again. In each step, the graph is split into two parts in a way that the separation itself yields a significant increase in the modularity. The split is determined by evaluating the leading eigenvector of the so-called modularity matrix, and there is also a stopping condition which prevents tightly connected groups to be split further. Due to the eigenvector calculations involved, it might not work on degenerate graphs where the ARPACK eigenvector solver is unstable. On non-degenerate graphs, it is likely to yield a higher modularity score than the fast greedy method, although it is a bit slower.
label.propagation.community is a simple approach in which every node is assigned one of k labels. The method then proceeds iteratively and re-assigns labels to nodes in a way that each node takes the most frequent label of its neighbors in a synchronous manner. The method stops when the label of each node is one of the most frequent labels in its neighborhood. It is very fast but yields different results based on the initial configuration (which is decided randomly), therefore one should run the method a large number of times (say, 1000 times for a graph) and then build a consensus labeling, which could be tedious.
igraph 0.6 will also include the state-of-the-art Infomap community detection algorithm, which is based on information theoretic principles; it tries to build a grouping which provides the shortest description length for a random walk on the graph, where the description length is measured by the expected number of bits per vertex required to encode the path of a random walk.
Anyway, I would probably go with fastgreedy.community or walktrap.community as a first approximation and then evaluate other methods when it turns out that these two are not suitable for a particular problem for some reason.
A summary of the different community detection algorithms can be found here: http://www.r-bloggers.com/summary-of-community-detection-algorithms-in-igraph-0-6/
Notably, the InfoMAP algorithm is a recent newcomer that could be useful (it supports directed graphs too).

Resources