Find all paths where nodes satisfy a condition - graph

I need help solving a problem and to better understand the mechanics of Neo4j. The text below is long because I have tried to detail my problem and my attempts as much as possible.
To introduce the problem, take the simple structure below as an example. Each node is labelled as a Point and has 2 attributes: point_id and num, and point_id is used as a node representative.
CREATE (a:Point {point_id: 1, num: 1}),
(b:Point {point_id: 2, num: 1}),
(c:Point {point_id: 3, num: 1}),
(d:Point {point_id: 4, num: 2}),
(e:Point {point_id: 5, num: 2}),
(f:Point {point_id: 6, num: 1}),
(g:Point {point_id: 7, num: 2}),
(h:Point {point_id: 8, num: 2}),
(i:Point {point_id: 9, num: 1}),
(a)-[:NEXT]->(b),
(a)-[:NEXT]->(c),
(c)-[:NEXT]->(d),
(b)-[:NEXT]->(e),
(b)-[:NEXT]->(f),
(f)-[:NEXT]->(g),
(f)-[:NEXT]->(i),
(g)-[:NEXT]->(h),
(h)-[:NEXT]->(i);
Let's say I want to select all paths/nodes in paths starting from Point with point_id = 1, where all nodes in a path need to satisfy the same filtering condtion (like num = 1).
In the above graph, the result would be (point_id as a node representative):
The query below return the expected result:
MATCH p=((a:Point {point_id: 1})-[:NEXT*]->(b:Point))
WHERE ALL (x IN NODES(p) WHERE x.num = 1)
RETURN p
Now let's consider my real environment: I have a graph representing a road network with approximately 2 million nodes and 2.8 million relationships (or twice that amount if represented in a bidirectional way). Each node has 3 attributes with the following weighted distribution: 0 (30%), 1 (30%), 2 (20%), 3 (14%), 4 (5%) and 5 (1%). Any attribute can be used as a filter with any possible value.
The node structure is:
(:Intersection {
intersection_id: int,
num_hotels: int,
num_restaurants: int,
num_gas_stations: int
})
The relationships are labelled as [:CONNECTED_TO]. Nodes represent intersections/endpoints and the relationships represent roads. The attribute intersection_id is indexed.
I've tried to solve the problem exemplified above for some time but didn't succeed. Without specifying a maximum depth for traversing, Neo4j's memory usage explodes and the query runs indefinitely until I cancel the operation or end the process (sometimes I need to do this because the system nearly freezes due to the lack of available memory). I understand this behavior because query complexity increases exponentially with each new depth level accessed.
However, the operation is still quite costly even if I set a maximum depth level, such as 15 levels. My graph has nodes whose queries of up to 10 to 15 levels starting from them can involve both high amounts of nodes, such as 1/4 of the database, or relatively low amounts, such as few thousands of nodes (<2k). The mentioned behavior happens in both cases.
I'm using the values ​​with 20 or 30% of distribution as a filter, since it's difficult to map the values with 5% or 1% of distribution due to the size of the graph. I tried to filter with the attribute num_hotels with and without an index.
Below are some of the queries I've tried:
MATCH p=((a:Intersection {Intersection: 562818})-[:CONNECTED_TO*]->(b:Intersection))
WHERE ALL (x IN NODE(p) WHERE x.num_hotels = 0)
RETURN p
MATCH path=(a:Intersection {intersection_id: 562818})-[:CONNECTED_TO*]->(b:Intersection {num_hotels: 0})
WHERE ALL(x IN NODES(path) WHERE SINGLE(y IN NODES(path) WHERE y = x))
RETURN p;
For some cases, where the number of nodes involved in the traversing was low (300-600), I obtained plausible results, but queries didn't always execute normally. Sometimes transactions often seemed to freeze, so I had to end and start them again to have a result, and this was not always guaranteed.
I would like tips to solve the problem, as well as some explanation about the behavior of Neo4j in this type of operation. The impression I've had so far is that Neo4j is looking for all the paths and only then applying the filter regardless of how I organize the query.
Neo4j version: 3.3.1
OS: Linux Mint Cinnamon 18.2
Memory: 6GB, about 4.5GB available for the tests.

Related

Slow query response with NebulaGraph version v3.1.0

NebulaGraph version: v3.1.0
graphd: 1 (128GM, 2 TB SSD)
metad: 1 (128GM, 2 TB SSD)
storage: 3 (128GM, 2 TB SSD)
Below query took about 20 minutes
MATCH (s:Student)-\[r\]-(a:CourseTcode)-\[rr\]-(b)
WHERE a.CourseTcode.id == 522687
RETURN s, r, a, rr, b limit 3
Below is the profile
id name dependencies profiling data
18 Project 16 ver: 0, rows: 3, execTime: 18355us, totalTime: 18365us
16 Limit 14 ver: 0, rows: 3, execTime: 25528291us, totalTime: 25528300us
14 Filter 6 ver: 0, rows: 11636144, execTime: 8150513us, totalTime: 8150522us
I changed my query like below, little improvement but not enough
MATCH (s:Student)-[r ]-(a:CourseTcode)-[rr]-(b)
WHERE id(a) == "522687"
RETURN p, r, a, rr, b limit 3
Below is the profile
id name dependencies profiling data
18 Projection 16 ver: 0, rows: 3, execTime: 25216us, totalTime: 25227us
16 Limit 14 ver: 0, rows: 3, execTime: 20186664us, totalTime: 20186672us
14 Filter 7 ver: 0, rows: 11636144, execTime: 5799073us, totalTime: 5799088us
Regarding the profile, it will be helpful to have the full profile output to see the whole time consumption distributions.
1. As you could see from the profile/explain output, the query started to seek a first as it's the only one with condition filtered for now, as you tested, id(a) == "522687" should be faster, but it should rarely help as it's not the major slow phase at all, while, please use id(foo) == xxx over property conditions whenever possible.
2. Due to the nature of query/storage separation design, it'll be costly to have lots of data being fetched from storage to query engine when some of the filter/limits cannot be pushed down to the storage side.
2.1 On the nebula graph side, introducing more optimization rules and storage pushdown operators would help here(progress: https://github.com/vesoft-inc/nebula/issues/2533 ), here I could see Filter/Limit is really costly, maybe there are some space to be optimized.
2.2 On the query composing side, adding more information to reduce the data being traversed would help:
2.2.1 MATCH (s:Student)-[r:EdgeTypeA|:EdgeTypeB|:EdgeTypeC]-(a:CourseTcode)-[rr:EdgeTypeE|:EdgeTypeF|:EdgeTypeG]-(b) if the edges type are not for all, please specify it as much as possible, same applied to the type of b.
2.2.2 Another approach could be to limit the traverse in the middle rather than only in the final phase:
i. it could be something like this, where, if you check its plan, the limit will be applied in the first part of the traversal
match (s:player)-[r]-(a:player)
where a.player.name == "Tim Duncan"
with s,r,a limit 100
match (a:player)-[rr]-(b)
return s,r,a,rr,b limit 3
ii. or, even further, we use GO/ FETCH/ LOOKUP for this equivalent query(do query one step by one step, limit in each step) to enable better optimized performance, this is highly recommended in case of huge data volume queries when possible.
2.3 On the Super Node perspective, when few vertices could be connected to tons of vertices, if all of the queries are targeting sample(limit/topN) data instead of fetching all of them, or, for those supernodes, we would like to truncate data, a configuration in storageD max_edge_returned_per_vertex could be configured, i.e. 1000, or other values.

Optimizing (minimizing) the number of lines in file: an optimization problem in line with permutations and agenda scheduling

I have a calendar, typically a csv file containing a number of lines. Each line corresponds to an individual and is a sequence of consecutive values '0' and '1' where '0' refers to an empty time slot and '1' to an occupied slot. There cannot be two separated sequences in a line (e.g. two sequences of '1' separated by a '0' such as '1,1,1,0,1,1,1,1').
The problem is to minimize the number of lines by combining the individuals and resolving the collisions between time slots. Note the time slots cannot overlap. For example, for 4 individuals, we have the following sequences:
id1:1,1,1,0,0,0,0,0,0,0
id2:0,0,0,0,0,0,1,1,1,1
id3:0,0,0,0,1,0,0,0,0,0
id4:1,1,1,1,0,0,0,0,0,0
One can arrange them to end up with two lines, while keeping track of permuted individuals (for the record). In our example it yields:
1,1,1,0,1,0,1,1,1,1 (id1 + id2 + id3)
1,1,1,1,0,0,0,0,0,0 (id4)
The constraints are the following:
The number of individuals range from 500 to 1000,
The length of the sequence will never exceed 30
Each sequence in the file has the exact same length,
The algorithm needs to be reasonable in execution time because this task may be repeated up to 200 times.
We don't necessarly search for the optimal solution, a near optimal solution would suffice.
We need to keep track of the combined individuals (as in the example above)
Genetic algorithms seems a good option but how does it scales (in terms of execution time) with the size of this problem?
A suggestion in Matlab or R would be (greatly) appreciated.
Here is a sample test:
id1:1,1,1,0,0,0,0,0,0,0
id2:0,0,0,0,0,0,1,1,1,1
id3:0,0,0,0,1,0,0,0,0,0
id4:1,1,1,1,1,0,0,0,0,0
id5:0,1,1,1,0,0,0,0,0,0
id6:0,0,0,0,0,0,0,1,1,1
id7:0,0,0,0,1,1,1,0,0,0
id8:1,1,1,1,0,0,0,0,0,0
id9:1,1,0,0,0,0,0,0,0,0
id10:0,0,0,0,0,0,1,1,0,0
id11:0,0,0,0,1,0,0,0,0,0
id12:0,1,1,1,0,0,0,0,0,0
id13:0,0,0,1,1,1,0,0,0,0
id14:0,0,0,0,0,0,0,0,0,1
id15:0,0,0,0,1,1,1,1,1,1
id16:1,1,1,1,1,1,1,1,0,0
Solution(s)
#Nuclearman provided a working solution in O(NT) (where N is the number of individuals (ids) and T is the number of time slots (columns)) based on the Greedy algorithm.
I study algorithms as a hobby and I agree with Caduchon on this one, that greedy is the way to go. Though I believe this is actually the clique cover problem, to be more accurate: https://en.wikipedia.org/wiki/Clique_cover
Some ideas on how to approach building cliques can be found here: https://en.wikipedia.org/wiki/Clique_problem
Clique problems are related to independence set problems.
Considering the constraints, and that I'm not familiar with matlab or R, I'd suggest this:
Step 1. Build the independence set time slot data. For each time slot that is a 1, create a mapping (for fast lookup) of all records that also have a one. None of these can be merged into the same row (they all need to be merged into different rows). IE: For the first column (slot), the subset of the data looks like this:
id1 :1,1,1,0,0,0,0,0,0,0
id4 :1,1,1,1,1,0,0,0,0,0
id8 :1,1,1,1,0,0,0,0,0,0
id9 :1,1,0,0,0,0,0,0,0,0
id16:1,1,1,1,1,1,1,1,0,0
The data would be stored as something like 0: Set(id1,id4,id8,id9,id16) (zero indexed rows, we start at row 0 instead of row 1 though probably doesn't matter here). Idea here is to have O(1) lookup. You may need to quickly tell that id2 is not in that set. You can also use nested hash tables for that. IE: 0: { id1: true, id2: true }`. Sets also allow for usage of set operations which may help quite a bit when determining unassigned columns/ids.
In any case, none of these 5 can be grouped together. That means at best (given that row) you must have at least 5 rows (if the other rows can be merged into those 5 without conflict).
Performance of this step is O(NT), where N is the number of individuals and T is the number of time slots.
Step 2. Now we have options of how to attack things. To start, we pick the time slot with the most individuals and use that as our starting point. That gives us the min possible number of rows. In this case, we actually have a tie, where the 2nd and 5th rows both have 7. I'm going with the 2nd, which looks like:
id1 :1,1,1,0,0,0,0,0,0,0
id4 :1,1,1,1,1,0,0,0,0,0
id5 :0,1,1,1,0,0,0,0,0,0
id8 :1,1,1,1,0,0,0,0,0,0
id9 :1,1,0,0,0,0,0,0,0,0
id12:0,1,1,1,0,0,0,0,0,0
id16:1,1,1,1,1,1,1,1,0,0
Step 3. Now that we have our starting groups we need to add to them while trying to avoid conflicts between new members and old group members (which would require an additional row). This is where we get into NP-complete territory as there are a ton (roughly 2^N to be more accurately) to assign things.
I think the best approach might be a random approach as you can theoretically run it as many times as you have time for to get results. So here is the randomized algorithm:
Given the starting column and ids (1,4,5,8,9,12,16 above). Mark this column and ids as assigned.
Randomly pick an unassigned column (time slot). If you want a perhaps "better" result. Pick the one with the least (or most) number of unassigned ids. For faster implementation, just loop over the columns.
Randomly pick an unassigned id. For a better result, perhaps the one with the most/least groups that could be assigned that ID. For faster implementation, just pick the first unassigned id.
Find all groups that unassigned ID could be assigned to without creating conflict.
Randomly assign it to one of them. For faster implementation, just pick the first one that doesn't cause a conflict. If there are no groups without conflict, create a new group and assign the id to that group as the first id. The optimal result is that no new groups have to be created.
Update the data for that row (make 0s into 1s as needed).
Repeat steps 3-5 until no unassigned ids for that column remain.
Repeat steps 2-6 until no unassigned columns remain.
Example with the faster implementation approach, which is an optimal result (there cannot be less than 7 rows and there are only 7 rows in the result).
First 3 columns: No unassigned ids (all have 0). Skipped.
4th Column: Assigned id13 to id9 group (13=>9). id9 Looks like this now, showing that the group that started with id9 now also includes id13:
id9 :1,1,0,1,1,1,0,0,0,0 (+id13)
5th Column: 3=>1, 7=>5, 11=>8, 15=>12
Now it looks like:
id1 :1,1,1,0,1,0,0,0,0,0 (+id3)
id4 :1,1,1,1,1,0,0,0,0,0
id5 :0,1,1,1,1,1,1,0,0,0 (+id7)
id8 :1,1,1,1,1,0,0,0,0,0 (+id11)
id9 :1,1,0,1,1,1,0,0,0,0 (+id13)
id12:0,1,1,1,1,1,1,1,1,1 (+id15)
id16:1,1,1,1,1,1,1,1,0,0
We'll just quickly look the next columns and see the final result:
7th Column: 2=>1, 10=>4
8th column: 6=>5
Last column: 14=>4
So the final result is:
id1 :1,1,1,0,1,0,1,1,1,1 (+id3,id2)
id4 :1,1,1,1,1,0,1,1,0,1 (+id10,id14)
id5 :0,1,1,1,1,1,1,1,1,1 (+id7,id6)
id8 :1,1,1,1,1,0,0,0,0,0 (+id11)
id9 :1,1,0,1,1,1,0,0,0,0 (+id13)
id12:0,1,1,1,1,1,1,1,1,1 (+id15)
id16:1,1,1,1,1,1,1,1,0,0
Conveniently, even this "simple" approach allowed for us to assign the remaining ids to the original 7 groups without conflict. This is unlikely to happen in practice with as you say "500-1000" ids and up to 30 columns, but far from impossible. Roughly speaking 500 / 30 is roughly 17, and 1000 / 30 is roughly 34. So I would expect you to be able to get down to roughly 10-60 rows with about 15-45 being likely, but it's highly dependent on the data and a bit of luck.
In theory, the performance of this method is O(NT) where N is the number of individuals (ids) and T is the number of time slots (columns). It takes O(NT) to build the data structure (basically converting the table into a graph). After that for each column it requires checking and assigning at most O(N) individual ids, they might be checked multiple times. In practice since O(T) is roughly O(sqrt(N)) and performance increases as you go through the algorithm (similar to quick sort), it is likely O(N log N) or O(N sqrt(N)) on average, though really it's probably more accurate to use O(E) where E is the number of 1s (edges) in the table. Each each likely gets checked and iterated over a fixed number of times. So that is probably a better indicator.
The NP hard part comes into play in working out which ids to assign to which groups such that no new groups (rows) are created or a lowest possible number of new groups are created. I would run the "fast implementation" and the "random" approaches a few times and see how many extra rows (beyond the known minimum) you have, if it's a small amount.
This problem, contrary to some comments, is not NP-complete due to the restriction that "There cannot be two separated sequences in a line". This restriction implies that each line can be considered to be representing a single interval. In this case, the problem reduces to a minimum coloring of an interval graph, which is known to be optimally solved via a greedy approach. Namely, sort the intervals in descending order according to their ending times, then process the intervals one at a time in that order always assigning each interval to the first color (i.e.: consolidated line) that it doesn't conflict with or assigning it to a new color if it conflicts with all previously assigned colors.
Consider a constraint programming approach. Here is a question very similar to yours: Constraint Programming: Scheduling with multiple workers.
A very simple MiniZinc-model could also look like (sorry no Matlab or R):
include "globals.mzn";
%int: jobs = 4;
int: jobs = 16;
set of int: JOB = 1..jobs;
%array[JOB] of var int: start = [0, 6, 4, 0];
%array[JOB] of var int: duration = [3, 4, 1, 4];
array[JOB] of var int: start = [0, 6, 4, 0, 1, 8, 4, 0, 0, 6, 4, 1, 3, 9, 4, 1];
array[JOB] of var int: duration = [3, 4, 1, 5, 3, 2, 3, 4, 2, 2, 1, 3, 3, 1, 6, 8];
var int: machines;
constraint cumulative(start, duration, [1 | j in JOB], machines);
solve minimize machines;
This model does not, however, tell which jobs are scheduled on which machines.
Edit:
Another option would be to transform the problem into a graph coloring problem. Let each line be a vertex in a graph. Create edges for all overlapping lines (the 1-segments overlap). Find the chromatic number of the graph. The vertices of each color then represent a combined line in the original problem.
Graph coloring is a well-studied problem, for larger instances consider a local search approach, using tabu search or simulated annealing.

How to partition UUID space into N equal-size partitions?

Take a UUID in its hex representation: '123e4567-e89b-12d3-a456-426655440000'
I have a lot of such UUIDs, and I want to separate them into N buckets, where N is of my choosing, and I want to generate the bounds of these buckets.
I can trivially create 16 buckets with these bounds:
00000000-0000-0000-0000-000000000000
10000000-0000-0000-0000-000000000000
20000000-0000-0000-0000-000000000000
30000000-0000-0000-0000-000000000000
...
e0000000-0000-0000-0000-000000000000
f0000000-0000-0000-0000-000000000000
ffffffff-ffff-ffff-ffff-ffffffffffff
just by iterating over the options for the first hex digit.
Suppose I want 50 equal size buckets(equal in terms of number of UUID possibilities contained within each bucket), or 2000 buckets, or N buckets.
How do I generate such bounds as a function of N?
Your UUIDs above are 32 hex digits in length. So that means you have 16^32 ≈ 3.4e38 possible UUIDs. A simple solution would be to use a big int library (or a method of your own) to store these very large values as actual numbers. Then, you can just divide the number of possible UUIDs by N (call that value k), giving you bucket bounds of 0, k, 2*k, ... (N-1)*k, UMAX.
This runs into a problem if N doesn't divide the number of possible UUIDs. Obviously, not every bucket will have the same number of UUIDs, but in this case, they won't even be evenly distributed. For example, if the number of possible UUIDs is 32, and you want 7 buckets, then k would be 4, so you would have buckets of size 4, 4, 4, 4, 4, 4, and 8. This probably isn't ideal. To fix this, you could instead make the bucket bounds at 0, (1*UMAX)/N, (2*UMAX)/N, ... ((N-1)*UMAX)/N, UMAX. Then, in the inconvenient case above, you would end up with bounds at 0, 4, 9, 13, 18, 22, 27, 32 -- giving bucket sizes of 4, 5, 4, 5, 4, 5, 5.
You will probably need a big int library or some other method to store large integers in order to use this method. For comparison, a long long in C++ (in some implementations) can only store up to 2^64 ≈ 1.8e19.
If N is a power of 2, then the solution is obvious: you can split on bit boundaries as for 16 buckets in your question.
If N is not a power of 2, the buckets mathematically cannot be of exactly equal size, so the question becomes how unequal are you willing to tolerate in the name of efficiency.
As long as N<2^24 or so, the simplest thing to do is just allocate UUIDs based on the first 32 bits into N buckets each of size 2^32/N. That should be fast enough and equal enough for most applications, and if N needs to be larger than that allows, you could easily double the bits with a smallish penalty.

Why is the knapsack not filled correctly by this R code?

Consider the following R-implementation of a knapsack problem (which yields an inefficient runtime behavior on educational purposes).
Given a collection of items with weight 3, 2 and 4 pounds and (in the same order) 7, 4 and 10 Euros value respectively
#data:
itemsSize <- c(3, 2, 4)
itemsValue <- c(7, 4, 10)
Expected behavior of the following R-function sack() in total: we want a maximum
sum of Euros not violating the capacity limit of the sack. solution
for sack(5) meaning the optimal value of the sack given 5 lb cap limit should
be 11.
max <- 0
sack <- function(cap) {
for (i in (1:length(itemsSize))) {
if ((space <- (cap - itemsSize[i])) >= 0) {
if ((rised <- (sack(space) + itemsValue[i])) > max) {
max <- rised
}
}
return(max)
}
}
Phenomena: a) no matter which item put first, the code only optimizes
regarding that item (tested by changing order of itemsSize along with
order of items value in #data respectively). Expected behaviour would be to
combine items as i counts up. b) the code doubles items in the sack
(tested by e.g. itemsSize<-c(2,4) and itemsValue<-c(4,10) in #data. Expected behavior would be picking each item at most once.
Overall: a) together with b): the code only packs the first item (and potentially several instances of it) until the sack is full.
Why do these phenomena occur - what did I do wrong?

Random tree with specific branching factor in Mathematica

Do you know if it's possible to somehow generate a random tree graph with a specific branching factor? I don't want it to be a k-ary tree.
It would be also great if I could define both the branching factor and the maximum depth. I want to randomly generate a bunch of trees that would differ in branching factor and depth.
TreePlot with random integer input returns something that's almost what I want:
TreePlot[RandomInteger[#] -> # + 1 & /# Range[0, 100]]
but I can't figure out a way to get a tree with a specific branching factor.
Thanks!
I guess I'm a bit late, but I like the question. Instead of creating a tree in the form
{0 -> 1, 0 -> 5, 1 -> 2, 1 -> 3, 1 -> 4}
I will use the following form of nested calls, where every argument is a child, which represents another node
0[1[2, 3, 4], 5]
Both forms are equivalent and can be transformed into each other.
Row[{
TreeForm[0[1[2, 3, 4], 5]],
TreePlot[{0 -> 1, 0 -> 5, 1 -> 2, 1 -> 3, 1 -> 4}]
}]
Here is how the algorithm works: As arguments we need a function f which gives a random number of children and is called when we create a node. Additionally, we have a depth d which defines the maximum depth a (sub-)tree can have.
[Choose branching] Define a branching function f which can be called like f[] and returns a random number of children. If you want a tree with either 2 or 4 children, you could use e.g. f[] := RandomChoice[{2, 4}]. This function will be called for each created node in the tree.
[Choose tree-depth] Choose a maximum depth d of the tree. At this point, I'm not sure what you want the randomness to be incorporated into the generation of the tree. What I do here is that when a new node is created, the depth of the tree below it is randomly chosen between the depth of its parent minus one and zero.
[Create ID Counter] Create a unique counter variable count and set it to zero. This will give us increasing node ID's. When creating a new node, it is increased by 1.
[Create a node] Increase count and use it as node-ID. If the current depth d is zero, give back a leaf with ID count, otherwise call f to decide how many children the node should get. For every new child chose randomly the depth of its sub-tree which can be 0,...,d-1 and call 4. for each new child. When all recursive calls have returned, the tree is built.
Fortunately, in Mathematica-code this procedure is not so verbose and consists only of a few lines. I hope you can find in the code what I have described above
With[{counter = Unique[]},
generateTree[f_, d_] := (counter = 0; builder[f, d]);
builder[f_, d_] := Block[
{nodeID = counter++, childs = builder[f, #] & /# RandomInteger[d - 1, f[]]},
nodeID ## childs
];
builder[f_, 0] := (counter++);
]
Now you can create a random tree like follows
branching[] := RandomChoice[{2, 4}];
t = generateTree[branching, 6];
TreeForm[t]
Or if you like you can use the next function to convert the tree into what is accepted by TreePlot
transformTree[tree_] := Module[{transform},
transform[(n_Integer)[childs__]] := (Sow[
n -> # & /# ({childs} /. h_Integer[__] :> h)];
transform /# {childs});
Flatten#Last#Reap[transform[tree]
]
and use it to create many random trees
trees = Table[generateTree[branching, depth], {depth, 3, 7}, {5}];
GraphicsGrid[Map[TreePlot[transformTree[#]] &, trees, {2}]]

Resources