Graph: extract faces [duplicate] - graph

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to find closed loops in graph networks
I have a question regarding graphs. I need to extract all the faces of a graph (imagine a street network where I have to extract all the "blocks"). If you think of a typical checkerboard pattern (e.g. Manhattan) most faces have 4 edges und 4 nodes, but the whole thing should work for other possibilities too (where a face has more than 4 edges for instance).
How can I do that? I thought of diverse things and tried to google it but I did not found a satisfying answer.
Thanks!!

You might be looking for all cycles of length n. Modulo certain conditions, the set of all such cycles will correspond to the "faces" you seek.
If you use this approach, it will become significant if your graphs are directed or not.

Related

Amino acid frequencies histogram with letters [duplicate]

This question already has answers here:
Plotting a "sequence logo" using ggplot2?
(6 answers)
Closed 5 years ago.
I'm trying to get some graphical view of the amino acid composition and frequencies in a peptide library.
I know how to create a basic histogram with R but I often see this kind of plot in publication
Can I achieve something similar with R?
That type of figure is normally produced using WebLogo, or similar software.
There's an R wrapper for WebLogo. It's a few years old, I don't know how well it works.
There is also ggseqlogo (more recent, looks pretty good) and in the Bioconductor package, seqLogo; I don't know if the latter handles peptides.

generate a random network as a reference for small-wroldness comparison with gephi [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am trying to determine if my directed Graph G is a “small world”. The graph is created from my dataset which consists of 500 nodes, but only 60 nodes have edges (total of 150 edges).I believe to do so I need to compare the clustering coefficient and the average path length to a random graph R with the same amount of nodes and edges.
Q1: Gephi has an embedded "generate Random Graph" capability - what is the algorithm it uses?
Q2: should I just generate a graph R with 60 nodes and 150 edges, or with 500 nodes and 150 edges?
Q3: I found small differences between definitions of small-worldness test. The one I am using is taken from Humphries, M. D. and K. Gurney (2008). "Network ‘small-world-ness’: a quantitative method for determining canonical network equivalence." PloS one 3(4): e0002051. "The network G is said to be a small-world network if Lg≥Lr and Cg≫Cr" (L is average path length and C is global clustering coefficient). any insights on this definition?
Thanks in advance for any help!
This paper gives a better insight on empirically testing whether a graph is small world. Indeed you have to create a random graph to test your graph.
Q1: As one can see from the code the Gnp, Gnm generators are based on Batagelj, Brandes, Efficient Generation of Large Random Networks, Sunbelt Conf. on Social Networks, 2004, available here.
Q2: The way CC is defined takes into account the number of triangles and connected triples. This notion discards automatically all the isolated nodes (unless you computer the average clustering coefficient by averaging over the number of nodes. This is not what you want though.
The bottom line is that it shouldn't make any difference which graph to choose in your case
Q3: Check the paper in the introduction of my answer for a more intuitive definition and check

Best data structure to store coordinates for efficient look-up?

There seems to be a similar question here, but I was satisfied with neither the clarity of the answer, nor its practicality. I was asked in a recent interview what data structure I would store my large set of floating point numbers so that I look up a new arrival for itself or its closest neighbor. I said I would use a binary search tree, and try to make it balanced to achieve O(log n).
Then the question was extended to two dimensions: What data structure would I use to store a large set of (x,y) pairs such as geographical coordinates for fast look up? I couldn't think of a satisfactory answer and totally gave up when extended to K-dimensions. Using a k-dimensional tree directly that uses coordinates values to "split" the space doesn't seem to work since two close points near the origin but inside different quadrants may end up on far far away leaves.
After the interview, I remembered Voronoi diagrams for partitioning K-dimensional space nicely. What is the best way of implementing this using what data structure? How are the look ups going to be performed? I feel like this question is so common in computer science that by now it has even a dedicated data structure.
You can use a grid and sort the points into the grid-cells. There is a similar question here:Distance Calculation for massive number of devices/nodes. You can also use a space filling curve (quadkey) or a quadtree,e.g. r-tree when you need additional informations,e.g hierarchy.

Unsupervised Learning in R? Classify Matrices - what is the right package? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Recently I watched a lot of Stanford's hilarious Open Classroom's video lectures. Particularly the part about unsupervised Machine Learning got my attention. Unfortunately it stops were it might get even more interesting.
Basically I am looking to classify discrete matrices by an unsupervised algorithm. Those matrices just contain discrete values of the same range. Let's say I have 1000s of 20x15 matrices that with values ranging from 1-3. I just started to read through the literature and I feel that image classification is way more complex (color histograms) and that my case is rather a simplification of what is done there.
I also looked at the Machine Learning and Cluster Cran Task Views but do not know where to start with a practical example.
So my question is: which package / algorithm would be a good pick to start playing around and working on the problem in R?
EDIT:
I realized that I might have been to imprecise: My matrix contains discrete choice data – so mean clustering might(!) not be the right idea. I do understand with what you said about vectors and observation but I am hoping for some function that accepts matrices or data.frames, because I have several observations over time.
EDIT2:
I realize that a package / function, introduction that focuses on unsupervised classification of categorical data is what would help me the most right now.
... classify discrete matrices by an unsupervised algorithm
You must mean cluster them. Classification is commonly done by supervised algorithms.
I feel that image classification is way more complex (color histograms) and that my case is rather a simplification of what is done there
Without knowing what your matrices represent, it's hard to tell what kind of algorithm you need. But a starting point might be to flatten your 20*15 matrices to produce length-300 vectors; each element of such a vector would then be a feature (or variable) to base a clustering on. This is the way must ML packages, including the Cluster package you link to, work: "In case of a matrix or data frame, each row corresponds to an observation, and
each column corresponds to a variable."
So far I found daisy from the cluster package respectively the argument "gower" which refers to Gower's similarity coefficient to handle multiple modes of data. Gower seems to be a fairly only distance metric, still it's what I found for use with categorical data.
You might want to start from here : http://cran.r-project.org/web/views/MachineLearning.html

What do Planning Poker numbers represent? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
The numbers used to vote when planning are 0, 0.5, 1, 2, 3, 5, 8, 13, 20, 40, 100.
Is there a meaning when those numbers are chosen? Why don't we just choose 1,2,3,4.. for the sake of simpliness?
The point is that as the estimates get bigger, they become less likely to be accurate anyway. There's no point in debating the merits of 34 vs 35 - at that point you're likely to be miles out anyway. This way just makes it easier: does this feel more like a 20-point task or a 40-point task? Not having the numbers between 21 and 39 forces you to make look at it in this "bigger" way. It should also be a hint that you should break the task down further before you come close to doing it.
All the details are explained here: http://en.wikipedia.org/wiki/Planning_poker
The sequence you give has been introduced by Mike Cohn in his book "Agile Estimating & Planning" (therefore the sequence is copyrighted, you need to obtain the permission to use it or you can also buy decks from his online shop).
The original planning poker sequence is a bit different and described he by his original inventor (James Grenning) : http://renaissancesoftware.net/papers/14-papers/44-planing-poker.html
This sequence allows you to compare backlog items to eachother. So it is imposible to say that some item is exactly two times bigger than other. Using this sequence you will always decide if it is more than two times bigger or less than two times.
For example:
First Item is estimated as 3SP
Now you are estimationg Second Item and someone said that it is two times "bigger" than First Item. Development tasks can't be exactly that same or exactle few times bigger or smaller. So you need to decide if it is bigger less than two times or more (it could be 5SP or 8SP).
If you have many estimated items in your backlog you can use this numbers for some stats. This stats works because Law of large numbers. http://en.wikipedia.org/wiki/Law_of_large_numbers
Using this sequence you are putting some uncertainty into that numbers so probability that this stats will work for you become higher.
Other simple answer for your question is: Mike Cohn chose this nubers after many experiments because they seams to work best in long period of time for various teams
All what I've wrote before is theory which has been created after experiments.
I've never seen that sequence used, the Fibonacci series (1 2 3 5 8 13 21 34) is more common. The idea is to avoid tricking yourself into thinking there is precision when there isn't.
Numbers on planning poker represent complexity of a task. You should not consider that a story with 8 as value is the double size in effort or time of a size 4 story for example. You could use as many different representations for these numbers as you want (like t-shirt sizes). You just need to have an idea that one value is more complex than another and there is another value that is even more bigger than it. The Planning Poker application attempt to illustrate this complexity with drawings related with number in order to help on this idea.

Resources