Graph Clustering - graph

I've been searching paper about method review in graph clustering but not satisfied me,
please tell me what is best method (according to you) in graph clustering, so sorry if my question very general
Thanks

With such an open question, I guess I can recommend you to try WEKA.
It has a nice set of user interfaces to let you import your dataset and then try and compare various classification and clustering algorithms on your data, without writing even one line of code.
After you identified an algorithm that works for your problem, you can then search for a nice and fast implementation in the programming language of your choice.
EDIT: since you mentioned the graph tag, maybe you should have a look at Markov Cluster Algorithm, or else, you will have a hard time trying to represent your graph data in a format suitable for the distance based clustering algorithms in WEKA.

Related

igraph Components: Which Algorithm (citation)?

I'm using igraph in academic research and I need to provide a proper citation for the algorithm used in the components() command. This algorithm returns the connected components of the graph. The command in question is documented here. It's part of the R/CRAN igraph library.
I think the algorithm used is the one below, which seems to be the canonical workhourse algoirthm cited on the Wikipedia page for connected components.
Hopcroft, J.; Tarjan, R. (1973), "Algorithm 447: efficient algorithms for graph manipulation", Communications of the ACM, 16 (6): 372–378, doi:10.1145/362248.362272
Does anyone know what algorithm is used?
It should be noticed that, igraph in R is actually written in c/c++. If you want to dig into the the details about how components is implemented, you should trace back to its c or c++ source code.
Here is a link to the source code for components
https://github.com/igraph/igraph/blob/f9b6ace881c3c0ba46956f6665043e43b95fa196/src/components.c
However, it seems the algorithm applied is not mentioned in the source code. I guess you can reach the author by email and ask for help.

Write igraph clustering to file

I am currently testing various community detection algorithms in the igraph package to compare against my implementation.
I am able to run the algorithms on different graphs but I was wondering if there was a way for me to write the clustering to a file, where all nodes in one community are written to one line and so on. I am able to obtain the membership of each node using membership(communities_object) and write that to a file using dput() but I don't know how to write it the way I want.
This is the first time I am working with R as well. I apologize if this has been asked before.
This does not have to do much with igraph, the clustering is given by a simple numeric vector. See ?write.
write(membership(communities_object), file="myfile", ncolumns=1)
write(communities_object$membership, file="myfile", ncolumns=1) also work

Community detection with InfoMap algorithm producing one massive module

I am using the InfoMap algorithm in the igraph package to perform community detection on a directed and non-weighted graph (34943 vertices, 206366 edges). In the graph, vertices represent websites and edges represent the existence of a hyperlink between websites.
A problem I have encountered after running the algorithm is that the majority of vertices have a membership in a single massive community (32920 or 94%). The rest of the vertices are dispersed into hundreds of other tiny communities.
I have tried different settings with the nb.trials parameter (i.e. 50, 100, and now running 500). However, this doesn't seem to change the result much.
I am feeling rather exasperated because the run-time on the algorithm is quite high, so I have to wait each time for the results (with no luck yet!!).
Many thanks.
Thanks for all the excellent comments. In the end, I got it working by downloading and running the source code for Infomap, which is available at: http://www.mapequation.org/code.html.
Due to licence issues, the latest code has not been integrated with igraph.
This solved the problem of too many nodes being 'lumped' into a single massive community.
Specifically, I used the following options from the command line: -N 10 --directed --two-level --map
Kudos to Martin Rosvall from the Infomap project for providing me with detailed help to resolve this problem.
For the interested reader, here is more information about this issue:
When a network collapses into one major cluster, it is most often because of a very dense and random link structure ... In the code for directed networks implemented in iGraph, teleportation is encoded. If many nodes have no outlinks, the effect of teleportation can be significant because it randomly connect nodes. We have made new code available here: http://www.mapequation.org/code.html that can cluster network without encoding the random teleportation necessary to make the dynamics ergodic. For details, see this paper: http://pre.aps.org/abstract/PRE/v85/i5/e056107
I was going to put this in a comment, but it ended up being too long and hard to read in that format, so this is a tangentially related answer.
One thing you should do is assess whether the algorithm is doing a good job at finding community structure. You can try to visualise your network to establish:
Is the algorithm returning community structures that make sense? Maybe there is one massive community?
If not does the visualisation provide insight as to why?
This will help inform your next steps. Maybe the structure of the network requires a different algorithm?
One thing I find useful for large networks is plotting your edges as a heatmap. This is simple to do if you have your edges stored in an adjacency matrix.
For this, you can use the image function, passing in your matrix of edges as the argument z. Hopefully this will allow you to see by eye the community structure.
However you also want to assess the correctness of your algorithm, so you want to sort the nodes (rows and columns of your adjacency matrix) by the community they've been assigned to.
Another thing to note is that if your edges are directed it may be more difficult to assess by eye as edges can appear on either side of the diagonal of the heatmap. One thing you can do is instead plot the underlying graph -- that is the adjacency matrix assuming your edges are undirected.
If your algorithm is doing a good job, you would expect to see square blocks along the diagonal, one for each detected community.

Rules-of-thumb doc for mathematical programming in R?

Does there exist a simple, cheatsheet-like document which compiles the best practices for mathematical computing in R? Does anyone have a short list of their best-practices? E.g., it would include items like:
For large numerical vectors x, instead of computing x^2, one should compute x*x. This speeds up calculations.
To solve a system $Ax = b$, never solve $A^{-1}$ and left-multiply $b$. Lower order algorithms exist (e.g., Gaussian elimination)
I did find a nice numerical analysis cheatsheet here. But I'm looking for something quicker, dirtier, and more specific to R.
#Dirk Eddelbeuttel has posted a bunch of stuff on "high performance computing with R". He's also a regular so will probably come along and grab some well-deserved reputation points. While you are waiting you can read some of his stuff here:
http://dirk.eddelbuettel.com/papers/ismNov2009introHPCwithR.pdf
There is an archive of the r-devel mailing list where discussions about numerical analysis issues relating to R performance occur. I will often put its URL in the Google advanced search page domain slot when I want to see what might have been said in the past: https://stat.ethz.ch/pipermail/r-devel/

Which particular software development tasks have you used math for? And which branch of math did you use?

I'm not looking for a general discussion on if math is important or not for programming.
Instead I'm looking for real world scenarios where you have actually used some branch of math to solve some particular problem during your career as a software developer.
In particular, I'm looking for concrete examples.
I frequently find myself using De Morgan's theorem when as well as general Boolean algebra when trying to simplify conditionals
I've also occasionally written out truth tables to verify changes, as in the example below (found during a recent code review)
(showAll and s.ShowToUser are both of type bool.)
// Before
(showAll ? (s.ShowToUser || s.ShowToUser == false) : s.ShowToUser)
// After!
showAll || s.ShowToUser
I also used some basic right-angle trigonometry a few years ago when working on some simple graphics - I had to rotate and centre a text string along a line that could be at any angle.
Not revolutionary...but certainly maths.
Linear algebra for 3D rendering and also for financial tools.
Regression analysis for the same financial tools, like correlations between financial instruments and indices, and such.
Statistics, I had to write several methods to get statistical values, like the F Probability Distribution, the Pearson product moment coeficient, and some Linear Algebra correlations, interpolations and extrapolations for implementing the Arbitrage pricing theory for asset pricing and stocks.
Discrete math for everything, linear algebra for 3D, analysis for physics especially for calculating mass properties.
[Linear algebra for everything]
Projective geometry for camera calibration
Identification of time series / statistical filtering for sound & image processing
(I guess) basic mechanics and hence calculus for game programming
Computing sizes of caches to optimize performance. Not as simple as it sounds when this is your critical path, and you have to go back and work out the times saved by using the cache relative to its size.
I'm in medical imaging, and I use mostly linear algebra and basic geometry for anything related to 3D display, anatomical measurements, etc...
I also use numerical analysis for handling real-world noisy data, and a good deal of statistics to prove algorithms, design support tools for clinical trials, etc...
Games with trigonometry and AI with graph theory in my case.
Graph theory to create a weighted graph to represent all possible paths between two points and then find the shortest or most efficient path.
Also statistics for plotting graphs and risk calculations. I used both Normal distribution and cumulative normal distribution calculations. Pretty commonly used functions in Excel I would guess but I actully had to write them myself since there is no built-in support in the .NET libraries. Sadly the built in Math support in .NET seem pretty basic.
I've used trigonometry the most and also a small amount a calculus, working on overlays for GIS (mapping) software, comparing objects in 3D space, and converting between coordinate systems.
A general mathematical understanding is very useful if you're using 3rd party libraries to do calculations for you, as you ofter need to appreciate their limitations.
i often use math and programming together, but the goal of my work IS the math so use software to achive that.
as for the math i use; mostly Calculus (FFT's analysing continuous and discrete signals) with a slash of linar algebra (CORDIC) to do trig on a MCU with no floating point chip.
I used a analytic geometry for simple 3d engine in opengl in hobby project on high school.
Some geometry computation i had used for dynamic printing reports, where was another 90° angle layout than.
A year ago I used some derivatives and integrals for store analysis (product item movement in store).
Bot all the computation can be found on internet or high-school book.
Statistics mean, standard-deviation, for our analysts.
Linear algebra - particularly gauss-jordan elimination and
Calculus - derivatives in the form of difference tables for generating polynomials from a table of (x, f(x))
Linear algebra and complex analysis in electronic engineering.
Statistics in analysing data and translating it into other units (different project).
I used probability and log odds (log of the ratio of two probabilities) to classify incoming emails into multiple categories. Most of the heavy lifting was done by my colleague Fidelis Assis.
Real world scenarios: better rostering of staff, more efficient scheduling of flights, shortest paths in road networks, optimal facility/resource locations.
Branch of maths: Operations Research. Vague definition: construct a mathematical model of a (normally complex) real world business problem, and then use mathematical tools (e.g. optimisation, statistics/probability, queuing theory, graph theory) to interrogate this model to aid in the making of effective decisions (e.g. minimise cost, maximise efficency, predict outcomes etc).
Statistics for scientific data analyses such as:
calculation of distributions, z-standardisation
Fishers Z
Reliability (Alpha, Kappa, Cohen)
Discriminance analyses
scale aggregation, poling, etc.
In actual software development I've only really used quite trivial linear algebra, geometry and trigonometry. Certainly nothing more advanced than the first college course in each subject.
I have however written lots of programs to solve really quite hard math problems, using some very advanced math. But I wouldn't call any of that software development since I wasn't actually developing software. By that I mean that the end result wasn't the program itself, it was an answer. Basically someone would ask me what is essentially a math question and I'd write a program that answered that question. Sure I’d keep the code around for when I get asked the question again, and sometimes I’d send the code to someone so that they could answer the question themselves, but that still doesn’t count as software development in my mind. Occasionally someone would take that code and re-implement it in an application, but then they're the ones doing the software development and I'm the one doing the math.
(Hopefully this new job I’ve started will actually let me to both, so we’ll see how that works out)

Resources