Slow query response with NebulaGraph version v3.1.0 - nebula-graph

NebulaGraph version: v3.1.0
graphd: 1 (128GM, 2 TB SSD)
metad: 1 (128GM, 2 TB SSD)
storage: 3 (128GM, 2 TB SSD)
Below query took about 20 minutes
MATCH (s:Student)-\[r\]-(a:CourseTcode)-\[rr\]-(b)
WHERE a.CourseTcode.id == 522687
RETURN s, r, a, rr, b limit 3
Below is the profile
id name dependencies profiling data
18 Project 16 ver: 0, rows: 3, execTime: 18355us, totalTime: 18365us
16 Limit 14 ver: 0, rows: 3, execTime: 25528291us, totalTime: 25528300us
14 Filter 6 ver: 0, rows: 11636144, execTime: 8150513us, totalTime: 8150522us
I changed my query like below, little improvement but not enough
MATCH (s:Student)-[r ]-(a:CourseTcode)-[rr]-(b)
WHERE id(a) == "522687"
RETURN p, r, a, rr, b limit 3
Below is the profile
id name dependencies profiling data
18 Projection 16 ver: 0, rows: 3, execTime: 25216us, totalTime: 25227us
16 Limit 14 ver: 0, rows: 3, execTime: 20186664us, totalTime: 20186672us
14 Filter 7 ver: 0, rows: 11636144, execTime: 5799073us, totalTime: 5799088us

Regarding the profile, it will be helpful to have the full profile output to see the whole time consumption distributions.
1. As you could see from the profile/explain output, the query started to seek a first as it's the only one with condition filtered for now, as you tested, id(a) == "522687" should be faster, but it should rarely help as it's not the major slow phase at all, while, please use id(foo) == xxx over property conditions whenever possible.
2. Due to the nature of query/storage separation design, it'll be costly to have lots of data being fetched from storage to query engine when some of the filter/limits cannot be pushed down to the storage side.
2.1 On the nebula graph side, introducing more optimization rules and storage pushdown operators would help here(progress: https://github.com/vesoft-inc/nebula/issues/2533 ), here I could see Filter/Limit is really costly, maybe there are some space to be optimized.
2.2 On the query composing side, adding more information to reduce the data being traversed would help:
2.2.1 MATCH (s:Student)-[r:EdgeTypeA|:EdgeTypeB|:EdgeTypeC]-(a:CourseTcode)-[rr:EdgeTypeE|:EdgeTypeF|:EdgeTypeG]-(b) if the edges type are not for all, please specify it as much as possible, same applied to the type of b.
2.2.2 Another approach could be to limit the traverse in the middle rather than only in the final phase:
i. it could be something like this, where, if you check its plan, the limit will be applied in the first part of the traversal
match (s:player)-[r]-(a:player)
where a.player.name == "Tim Duncan"
with s,r,a limit 100
match (a:player)-[rr]-(b)
return s,r,a,rr,b limit 3
ii. or, even further, we use GO/ FETCH/ LOOKUP for this equivalent query(do query one step by one step, limit in each step) to enable better optimized performance, this is highly recommended in case of huge data volume queries when possible.
2.3 On the Super Node perspective, when few vertices could be connected to tons of vertices, if all of the queries are targeting sample(limit/topN) data instead of fetching all of them, or, for those supernodes, we would like to truncate data, a configuration in storageD max_edge_returned_per_vertex could be configured, i.e. 1000, or other values.

Related

Calculate the number of trips in graph traversal

Hello Stack Overflow Community,
I'm attempting to solve this problem:
https://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&page=show_problem&problem=1040
The problem is to find the best path based on capacity between edges. I get that this can be solved using Dynamic Programming, I'm confused by the example they provide:
According to the problem description, if someone is trying to get 99 people from city 1 to 7, the route should be 1-2-4-7 which I get since the weight of each edge represents the maximum amount of passengers that can go at once. What I don't get is that the description says that it takes at least 5 trips. Where does the 5 come from? 1-2-4-7 is 3 hops, If I take this trip I calculate 4 trips, since 25 is the most limited hop in the route, I would say you need 99/25 or at least 4 trips. Is this a typo, or am I missing something?
Given the first line of the problem statement:
Mr. G. works as a tourist guide.
It is likely that Mr. G must always be present on the bus, thus the equation for the number of trips is:
x = (ceil(x) + number_of_passengers) / best_route
rather than simply:
x = number_of_passengers / best_route
or, for your numbers:
x = (ceil(x) + 99) / 25
Which can be solved with:
x == 4.16 (trips)

Find all paths where nodes satisfy a condition

I need help solving a problem and to better understand the mechanics of Neo4j. The text below is long because I have tried to detail my problem and my attempts as much as possible.
To introduce the problem, take the simple structure below as an example. Each node is labelled as a Point and has 2 attributes: point_id and num, and point_id is used as a node representative.
CREATE (a:Point {point_id: 1, num: 1}),
(b:Point {point_id: 2, num: 1}),
(c:Point {point_id: 3, num: 1}),
(d:Point {point_id: 4, num: 2}),
(e:Point {point_id: 5, num: 2}),
(f:Point {point_id: 6, num: 1}),
(g:Point {point_id: 7, num: 2}),
(h:Point {point_id: 8, num: 2}),
(i:Point {point_id: 9, num: 1}),
(a)-[:NEXT]->(b),
(a)-[:NEXT]->(c),
(c)-[:NEXT]->(d),
(b)-[:NEXT]->(e),
(b)-[:NEXT]->(f),
(f)-[:NEXT]->(g),
(f)-[:NEXT]->(i),
(g)-[:NEXT]->(h),
(h)-[:NEXT]->(i);
Let's say I want to select all paths/nodes in paths starting from Point with point_id = 1, where all nodes in a path need to satisfy the same filtering condtion (like num = 1).
In the above graph, the result would be (point_id as a node representative):
The query below return the expected result:
MATCH p=((a:Point {point_id: 1})-[:NEXT*]->(b:Point))
WHERE ALL (x IN NODES(p) WHERE x.num = 1)
RETURN p
Now let's consider my real environment: I have a graph representing a road network with approximately 2 million nodes and 2.8 million relationships (or twice that amount if represented in a bidirectional way). Each node has 3 attributes with the following weighted distribution: 0 (30%), 1 (30%), 2 (20%), 3 (14%), 4 (5%) and 5 (1%). Any attribute can be used as a filter with any possible value.
The node structure is:
(:Intersection {
intersection_id: int,
num_hotels: int,
num_restaurants: int,
num_gas_stations: int
})
The relationships are labelled as [:CONNECTED_TO]. Nodes represent intersections/endpoints and the relationships represent roads. The attribute intersection_id is indexed.
I've tried to solve the problem exemplified above for some time but didn't succeed. Without specifying a maximum depth for traversing, Neo4j's memory usage explodes and the query runs indefinitely until I cancel the operation or end the process (sometimes I need to do this because the system nearly freezes due to the lack of available memory). I understand this behavior because query complexity increases exponentially with each new depth level accessed.
However, the operation is still quite costly even if I set a maximum depth level, such as 15 levels. My graph has nodes whose queries of up to 10 to 15 levels starting from them can involve both high amounts of nodes, such as 1/4 of the database, or relatively low amounts, such as few thousands of nodes (<2k). The mentioned behavior happens in both cases.
I'm using the values ​​with 20 or 30% of distribution as a filter, since it's difficult to map the values with 5% or 1% of distribution due to the size of the graph. I tried to filter with the attribute num_hotels with and without an index.
Below are some of the queries I've tried:
MATCH p=((a:Intersection {Intersection: 562818})-[:CONNECTED_TO*]->(b:Intersection))
WHERE ALL (x IN NODE(p) WHERE x.num_hotels = 0)
RETURN p
MATCH path=(a:Intersection {intersection_id: 562818})-[:CONNECTED_TO*]->(b:Intersection {num_hotels: 0})
WHERE ALL(x IN NODES(path) WHERE SINGLE(y IN NODES(path) WHERE y = x))
RETURN p;
For some cases, where the number of nodes involved in the traversing was low (300-600), I obtained plausible results, but queries didn't always execute normally. Sometimes transactions often seemed to freeze, so I had to end and start them again to have a result, and this was not always guaranteed.
I would like tips to solve the problem, as well as some explanation about the behavior of Neo4j in this type of operation. The impression I've had so far is that Neo4j is looking for all the paths and only then applying the filter regardless of how I organize the query.
Neo4j version: 3.3.1
OS: Linux Mint Cinnamon 18.2
Memory: 6GB, about 4.5GB available for the tests.

Oracle query to count rows based on value from next record

Input values to the query : 1-20
Values in the database : 4,5, 15,16
I would like a query that gives me results as following
Value - Count
===== - =====
1 - 3
6 - 9
17 - 3
So basically, first generate continuous numbers from 1 to 20, count available numbers.
I wrote a query but I can not get it to fully work:
with avail_ip as (
SELECT (0) + LEVEL AS val
FROM DUAL
CONNECT BY LEVEL < 20),
grouped_tab as (
select val,lead(val,1,0) over (order by val) next_val
from avail_ip u
where not exists (
select 'x' from (select 4 val from dual) b
where b.val=u.val) )
select
val,next_val-val difference,
count(*) over (partition by next_val-val) avail_count
from grouped_tab
order by 1
It gives me count but i am not sure how to compress the rows to just three rows.
I was not able to add complete query, I kept getting 'error occurred while submission'. For some reason it does not like union clause. So I am attaching query as a image :(
More details of exact requirement:
I am writing a ip management module and i need to find out available (free) ip addresses within a ip block. Block could be /16 or /24 or even /12. To make it even challenging, i also support IPv6 so will have lot more numbers to manage. All issued ip addresses are stored in decimal format. So my thought is to first generate all ip decimals within the block range from network address to broadcast address. For eg. in a /24, there would 255 addresses and in case of /16 would 64K.
Now, secondly find all used addresses within a block, and find out available number of address with a starting ip. So in the above example, starting 1 ip- 3 addresses are available, starting with 6, 9 are available .
My last concern would be the query should be able to run fast enough to run through millions of numbers.
And sorry again, if my original question was not clear enough.
Similar sort of idea to what you tried:
with all_values as (
select :start_val + level - 1 as val
from dual
connect by level <= (:end_val - :start_val) + 1
),
missing_values as (
select val
from all_values
where not exists (select null from t42 where id = val)
),
chains as (
select val,
val - (row_number() over (order by val) + :start_val - 1) as chain
from missing_values
)
select min(val), count(*) - 1 as gap_count
from chains
group by chain
order by min(val);
With start_val as 1 and end_val as 20, and your data in table t42, that gets:
MIN(VAL) GAP_COUNT
---------- ----------
1 3
6 9
17 4
I've made end_val inclusive though; not sure if if you want it to be inclusive or exclusive. And I've perhaps made it more flexible that you need - your version also assumes you're always starting from 1.
The all_values CTE is basically the same as your, generating all the numbers between the start and end values - 1 to 20 (inclusive!) in this case.
The missing_values CTE removes the values that are in the table, so you're left with 1,2,3,6,7,8,9,10,11,12,13,14,17,18,19,20.
The chains CTE does the magic part. This gets the difference between each value and where you would expect it to be in a contiguous list. The difference - what I've called 'chain' - is the same for all contiguous missing values; 1,2,3 all get 0, 6 to 14 all get 2, and 17 to 20 all get 4. That chain value can then be used to group by, and you can use the aggregate count and min to get the answer you need.
SQL Fiddle of a simplified version that is specifically for 1-20, showing the data from each intermediate step. This would work for any upper limit, just by changing the 20, but assumes you'll always start from 1.

Cost Optimization across Different Suppliers for a Product

I've this following optimization problem. A company produces a product, say Big A. To produce this product, it requires 5 processes. (Please find the detail table below). For each process, there are number of supplier that supply raw material for that particular process. E.g. For process 1, there are 3 supplier 1,2 & 3.
The constrain for the CEO of this company,say C, is that for each process the CEO has to purchase supplies from Supplier 1 first, then for additional supplies from 2nd Supplier and so on.
The optimization problem is C wants 700 units for total material to produce for 1 unit of Big A then how will he do it at minimum cost. How the optimization will change if the amount of units require increases to 1500 units.
I'll be grateful if I get the solution of this answer. But if somebody can suggest me some reference regarding this problem it will be a great help too. I'm mainly using R software here.
Process Supplier Cost Units Cumm_Cost Cumm_Unit
1 1 10 100 10 100
1 2 20 110 30 210
1 3 10 200 40 410
2 1 20 100 20 100
2 2 30 150 50 250
2 3 10 150 60 400
3 1 40 130 40 130
3 2 30 140 70 270
3 3 50 120 120 390
4 1 20 120 20 120
4 2 40 120 60 240
4 3 20 180 80 420
5 1 30 180 30 180
5 2 10 160 40 320
5 3 30 140 70 460
Regards,
I will start by solving the specific problem that you have posted and then will demonsrate how to formulate the problem more abstractively. For simplicity, I will use Excel's Solver add-in to solve the problem, but any configuration of a modeling language (such as AIMMS, AMPL, LINGO, OPL, MOSEL and numerous others) with a solver (CPLEX, GUROBI, GLPK, CBC and numerous others) can be used. If you would like to use R, there exists an lpSolve package that calls the lpSolve solver (which is not the best one in the word to be honest, but it is free of charge).
Note that for "real" (large scale) integer problems, the commercial solvers CPLEX, GUROBI and XPRESS perform a lot better than others. The first completely free solver that performs decently in most tests (including Hans Mittelman's page) is CBC. CBC can be hooked up in excel and solve the built-in solver model without restrictions in the number of constraints or variables, by using this add-in. Therefore, assuming that most CPU is going to be spent by the optimization algorithm, using CBC/OpenSolver seems like an efficient choice.
SPREADSHEET SETUP
I follow some conventions for convenience:
Decision variable cells are marked Green.
Constraints are marked red.
Data are marked grey.
Objective function is marked blue.
First, lets augment the table you presented as follows:
The added columns explained briefly:
Selected?: equals 1 if the (Process, Supplier) combo is allowed to produced a positive quantity, zero otherwise.
Quantity: the quantity produced, defined for each (Process, Supplier) combo.
Max Quantity?: Equals 1 if the Suppliers produces the maximum amount of units for that particular Process.
Quantity UB: equals Units * Selected?. This makes the upper bound either equal to Units, when the Supplier is allowed to produce this Process, or zero otherwise.
Quantity LB: equals Units * Max Quantity?. This is to ensure that whenever the Max Quantity? column is 1, the produced quantity will be equal to Units.
Selection: For the 1st supplier, it equals 0. For the 2nd and 3rd suppliers, it equals the Max Quantity? of the previous supplier (row) minus the Selected? of the current supplier (row).
A screenshot with formulas:
There exist two more constraints:
There must be at least one item produced from each process and
The total number of items should be 700 (or later 1,500).
Here is their setup:
and here are the formulas:
In brief, we use SUMIF to sum the quantities that are specific to each supplier, which we are going to constrain to be more than 1 item for each process.
To finish the spreadsheet setup, we need to calculate the objective function, namely the cost of the allocation. This is easily done by taking the SUMPRODUCT of columns Quantity and Cost. Note that the cumulative quantities are derived data and not very useful in the current context.
After the above steps, the spreadsheet looks like below:
SOLVER MODEL
For the solver model we need to declare
The Objective
The Decisions
The Constraints
The Solver (and tweak some parameters if necessary).
For ease of exposition, I have given each range the name of its header. The solver model looks as follows:
It should all be explanatory, except possibly the Selected >= 0 part. The column selected equals the difference between the binary max Quantity of the previous supplier minus the Selected of the current supplier. Selected >= 0 => max Quantity of previous supplier >= Selected of current supplier. Therefore, if the previous supplier does not produce at max quantity (binary = 0), the current supplier cannot produce.
Then we need to make sure that the solver setting are OK:
and solve the problem.
Solution for req = 700 :
As we see, the model tries to avoid procedures 3 and 5 as much as possible, and satisfies the constraint "at least 1 item per process" by picking up exactly 1 item for processes 3 and 5. The objective function value is 11,710.
Solution for req = 1,500 :
Here we need more capacity, but yet process 3 seems expensive and the model tries to avoid it by allocating whatever is necessary (just 1 unit to supplier 1).
I hope this helps. The spreadsheet can be downloaded here. I include the definition of the mathematical model below, in case you would like to transfer it to another language.
MATHEMATICAL FORMULATION
A formal definition of your problem is as follows.
SETS:
PARAMETERS:
Decisions:
Objective:
Constraints:
Constraint explanation:
C1: A supplier cannot produce anything from a process if he has not been allocated to that process.
C2: If a supplier's maximum indicator is set to 1, then the production variable should be the maximum possible.
C3: We cannot select supplier s for process p if we have not produced the max quantity available from the previous supplier s_[-1].
C4: We need to produce at least 1 item from each process.
C5: the total production from all processes and suppliers should equal the required amount.
Looks like you should look at the simplex algorithm (or some existing implementation of it).
Wikipedia has a fairly nice description of the algorithm, http://en.wikipedia.org/wiki/Simplex_algorithm

What data structure should be used while designing algo for multiplication and division problems?

Considering basic example of multiplication where 12*24 = 288. Now I am looking for single or multiple data structures where I can keep each an every information of the intermediate steps performed during multiplication. e.g. 2*4 fetches 8, 1*4 fetches 4, etc.
I need to store such intermediate information so as to facilitate me to tell user exactly where he went wrong in his operations.
http://tutrr.com
Focus on first on the capability you need to provide.
For example, the user will enter one digit of his answer, you need to check it and give feedback. For example in 28 x 57 assuming you are teaching traditional "long multiplication" then the user needs to mulitply 28 by 7, recording 6 in the units, carrying 5 and then 9, remembering to adding the carried 5 and then 1. suppose he enters 4 in the tens column, you might want to say "Yes, 7 x 2 is 14, but don't forget to add the 5 you carried"
So to support this you need functions such as
getCorrectWorkingDigit( int leftDigitIndex, int rightDigitIndex)
In this case we'd call
getCorrectWorrkingDigit( 1, 0 ) and get 9 as the answer
and
getWorkingCarryDigit( int leftDigitIndex, int rightDigitIndex)
so
getWorkingCarryDigit( 1, 0 ) and get 5 as the answer
You will need a few other such functions, including functions for the final answer's digits.
Now, what data structure would allow you to do this? Your requirement is to enable your functions to be implemented. Well clearly you could build some kind of array of objects, representing each Working position, and each position in the final answer. But I think that's overkill, you can implement those functions directly against the question. All you actually need are the two integers (28 and 57 in my example) you can compute the function values on the fly, no need for keeping the target.
Having written all that I've just realised that you probably also want to keep the values the user entered, and for that a data structure might be useful, keeping the individual digits will be convenient.
For and "row" of working, and for the final result, how about an array of digits, where the index corresponds to the power of 10, so represent 196 as
[6, 9, 1]
and for the working put that in a Set, keyed by the power of ten of the right digit. In my 28 x 57:
0 -> [6, 9, 1] // this is 7 x 28
10 -> [0, 4, 1] // this is 5 x 28

Resources