Neo4j design performance: Do I have to avoid large node degrees? - graph

I'm in the middle of designing a data model to be implemented using Neo4j. Is is about a transportation system, that has some stations, having some vehicles traveling between them.
There is a huge amount of travels from some stations, say one million each month. So I want to know is there any performance penalty in case of having some nodes with millions of edges coming out of them? Is it better to keep degrees lower with some design tricks (and probably making design a little worse)?

Relationship degrees really matter most when they're traversed, so expansions that are traversing any relationship type and direction, or the type (and direction) of relationship that has a high number of degrees.
So if there are 100k :TRAVELS_TO relationships to a specific location, 100 :VISITED relationships to the location, and only 1 :TRAVELS_TO relationship from that location, then you'll only be paying a high cost when traversing those :TRAVELS_TO relationships to the location. If you're traversing relationships of a different type and/or direction, you won't be paying any higher cost because of the 100k other relationships.
So diversifying your types and/or directions can certainly help out.
You may want to check Max De Marzi's blog for his approaches when constructing a flight/airtravel graph, you may find good approaches to use here.

Related

Reverse PageRank so connections to many less important nodes is better?

Is there an easy way to modify the PageRank algorithm so that being connected to many other nodes still increases a node's PageRank, but it's best if the nodes are less important?
I'm not sure if I'm explaining this well, but what I'm thinking of is applying this to hockey scoring. So if Gretzky, for example, has 4 edges, but none of his connected nodes are connected to anything else, and Lemieux also has 4 edges, but his are interconnected (more important), I'd want Gretzky to have a higher score.
In other words, I want PageRank to "adjust" for the quality of your teammates, so you get a higher score by being connected to lower-quality teammates, in contrast to how the algorithm normally adjusts so you get a higher score by being connected to higher-quality teammates.
Below is an ugly diagram of what I'm trying to explain:
Any ideas if something like this exists, and, if so, how to implement it? I use R and igraph for most of my graph theory-related analyses, for whatever it's worth, and if it isn't clear, I'm not really knowledgeable about graph theory.
EDIT: I've looked into personalized PageRank, and it seems somewhat relevant, though I'm not sure how I could set the weights so that the algorithm performs as I want it to.

Can I use SQLite to model arbitrary graphs (i.e. a logical map with cycles)?

I'm new to SQL and learning about Adjacency Lists, Nested Sets, Closure Tables, but from what I understood, these solutions usually apply to acyclic data.
I'm aware that this sort of problem may be better suited to a graphical database engine such as Neo4j, and I am exploring that also. But for this question, I specifically want to know if I can achieve this goal in SQLite.
Before running off with a possible answer for this, please help me understand how to better define or illustrate the problem. Once the problem definition is refined, then point me in the right direction (technique, reference material) and let me try to figure it out.
Objectives:
Maintain a list of areas and how they are connected.
Areas can have different types: Country, Highway, State, City, Neighborhood.
Areas can be connected in cycles (undirected).
Areas can have multiple exits.
Maintain a weighted list from one exit to another, within the area.
Extract optimal path from one area to another (from this neighborhood to nearest highway).
Assumptions:
Will use SQLite 3 (newest version).
Small data set ( < 1,000 areas and connections, < 5s DB creation).
Relatively static ( < 5 inserts or updates/year ).
May be simpler to recreate database from scratch than update?
Highways are areas, not connectors.
Streets are logical connectors, no length, no weight.
Areas and connections are like a house with many rooms with multiple doors. The doors connect the rooms. There is no traversal weight going through a door. The weight in selecting a door comes from the distance between the doors. A hallway is like an extended door, so it has a weight and is considered a type room. A room may have a large size, but if the only two doors are near each other, it may have a small weight. it's not the size of the room that counts for my purposes, but the distance between the doors.
As always, thank you for taking the time to read, and for constructive comments.
Yes, it is possible to use SQLite to store this kind of data.
It is not practical, and you may have performance issues. If you plan to store huge amount of such data and want a well scalable solution, you should go for some graph DB.
If you are gong to store ~1000 nodes, that can work with simple realtions in SQLite.
Especially since you are going to have very little number of updates, you could pre-calculate the distances. So you don't have to actually recalculate it each time, but just load from the DB.
I think you should represent your problem as a graph.
Nodes could be the "doors" and edges the distances between them.
You could store this easily in relational database. (Areas(Id,Name), Doors(Id,Area1,Area2) DoorDoorDistance(Door1, Door2, Distance))
If you have stored these data, you can calculate shortest path from every door to every other. You could store this in a new table. (Distances(Door1,Door2,Path, Distance))
To calculate shortest path you can find different algorithms:
Shortest path algorithms
After this you have the shortest path between each pair of doors.
The only question from now is witch door to take from your starting area to which door in your destination area.
If you don't want to be this precise you just take the one with the shortest path. Otherwise you have to maintain door distances from area starting points.
A; You can can assume that you start from the center of the area, so you can store door distances from the center
B; You can be more precise, by storing exact door locations and calculating door distances from an exact starting point.
In both cases you should select door with the lowest cost, both in the starting area and the destination area:
Total cost: (Walk to door distance) + (starting Door to destination Door Path) + (Walk to destination in the destination area)
I would do this like this. I hope I helped, have fun!

What is the difference between a node and a vertex?

What is the difference (if any) between a node and a vertex? I can't find the answer after looking at countless sites! Even my book doesn't specify it so I am kind of lost!
It is worth mentioning that I am looking for the difference besides the fact that it is called a 'vertex' when used in a graph and a 'node' when used in a tree.
There are no differences between the words Node and Vertex. Even in some books that explain graph theory and graph algorithms they name it as:
Vertex denoted by v, and sometimes it's called nodes also
There are no major nor minor differences between them.
This is mentioned in the book: Data structure and Algorithms with Object Oriented Design Patterns in C#, Bruno R, Preiss.
In "The Practitioner's Guide to Graph Data", the author avoid the term "node/nodes" and only use vertex/vertices and they explain it as below:
...because we are focusing on distributed graphs, and nodes has different meanings in distributed systems, graph theory and computer science.
In distributed systems, a node can be a client, server or peer, while in computer network it can be a computer or a modem. In computer science, as you already point out, it could be used either for graph theory or tree system.
So in the context of graph theory, node and vertex are used interchangeable. But if you would like to make it clear and avoid any misunderstanding, vertex/vertices is the way to go.
In think both terminologies come from the different perception of graphs and networks. Albert-László Barabási writes in his recent text book.
"In the scientific literature the terms network and graph are used interchangeably:
Network science
Graph theory
Network
Graph
Node
Vertex
Link
Edge
Yet, there is a subtle distinction between the two terminologies: the {network, node, link} combination often refers to real systems: The WWW is a network of web documents linked by URLs; society is a network of individuals linked by family, friendship or professional ties; the metabolic network is the sum of all chemical reactions that take place in a cell. In contrast, we use the terms {graph, vertex, edge} when we discuss the mathematical representation of these networks: We talk about the web graph, the social graph (a term made popular by Facebook), or the metabolic graph. Yet, this distinction is rarely made, so these two terminologies are often synonyms of each other."
<tl;dr> Same, same, but different.
There is no difference between a node and a vertex. Most books use V to represent the vertex of a graph. I've seen node mostly associated with a tree.
For instance, you may have come across O(V + E) being used to represent the time complexity for depth first search and breadth first search graph traversals.
Similarly, V is used as part of time complexity analysis for other graph algorithms like Prim's, Kruskal's, etc.

Detecting all cycles in a directed graph with millions of nodes in Ocaml

I have graphs with thousands of nodes to millions of nodes. I want to detect all possible cycles in such graphs.
I use hash table to store the edges. ( (source node,edge weight) -> (target node) ).
What can be the efficient way of implementing it in OCaml?
Its looks like Tarjan's algorithm is the best one.
What can be the most implementation for the same.
Yes, Tarjan's algorithm for strongly connected components is a good solution. You may also use so-called path-based strong component algorithms which have (when done carefully) comparable linear complexity.
If you pick reasonable data structures, they should work. It's hard to say much more before you implemented and profiled a prototype implementation.
I don't understand what your graph representation is: are you hashed keys really a (node,weight) couple? Then how do you find all neighbors of a given node? For a large graph structure you should optimize access time, of course, but also memory efficiency.
If you really want to find all possible cycles, the problem seems at least exponential in the worst case. For a complete graph, every nonempty subset of nodes gives you a different cycle (including a link from the last back to the first). Forthermore every cyclic permutation of every subset gives you a different cycle. Depending on the sparsity of your graphs, the problem could be tractable in practice.

Distributed physics simulation help/advice

I'm working in a distributed memory environment. My task is to simulate using particles tied by springs big 3D objects by dividing them into smaller pieces and each piece get simulated by another computer. I'm using a 3rd party physics engine to a achieve simulation. The problem I am facing is how to transmit the particle information in the extremities where the object is divided. This information is needed to compute interacting particle forces. The line in the image shows where the cut has been made. Because the number o particles is big the communication overhead will be big as well. Is there a good way to transmit such information or is there a way to transmit another value which helps me determine the information I need? Any help is much appreciated. Thank-you
PS: by particle information i mean the new positions from which to compute a resulting force to be applied on the particles simulated in the local machine
"Big" means lots of things. Here the number of points with data being communicated may be "big" in that it's much more than one, but if you have say a million particles in a lattice, and are dividing it between 4 processors (say) by cutting it into squares, you're only communicating 500 particles across each boundary; big compared to one but very small compared to 1,000,000.
A library very commonly used for these sorts of distributed-memory computations (which is somwehat different than distributed computing, which suggests nodes scattered all over the internet; this sort of computation, involving tightly-coupled elements, is usually best done with a series of nearby computers in a lab or in a cluster) is MPI. This pattern of communication is very common, and is called "halo exchange" or "guardcell exchange" or "ghostzone exchange" or some combination; you should be able to find lots of examples of such things by searching for those terms. (There are a few questions on this site on the topic, but they're typically focussed on very specific implementation questions).

Resources