Recursive query with sub-graph aggregation - recursion

I am trying to use Neo4j to write a query that aggregates quantities along a particular sub-graph.
We have two stores Store1 and Store2 one with supplier S1 the other with supplier S2. We move 100 units from Store1 into Store3 and 200 units from Store2 to Store3.
We then move 100 units from Store3 to Store4. So now Store4 has 100 units and approximately 33 originated from supplier S1 and 66 from supplier S2.
I need the query to effectively return this information, E.g.
S1, 33
S2, 66
I have a recursive query to aggregate all the movements along each path
MATCH p=(store1:Store)-[m:MOVE_TO*]->(store2:Store { Name: 'Store4'})
RETURN store1.Supplier, reduce(amount = 0, n IN relationships(p) | amount + n.Quantity) AS reduction
Returns:
| store1.Supplier | reduction|
|-------------------- |-------------|
| S1 | 200 |
| S2 | 300 |
| null | 100 |
Desired:
| store1.Supplier | reduction|
|---------------------|-------------|
| S1 | 33.33 |
| S2 | 66.67 |

What about this one :
MATCH (s:Store) WHERE s.name = 'Store4'
MATCH (s)<-[t:MOVE_TO]-()<-[r:MOVE_TO]-(supp)
WITH t.qty as total, collect(r) as movements
WITH total, movements, reduce(totalSupplier = 0, r IN movements | totalSupplier + r.qty) as supCount
UNWIND movements as movement
RETURN startNode(movement).name as supplier, round(100.0*movement.qty/supCount) as pct
Which returns :
supplier pct
Store1 33
Store2 67
Returned 2 rows in 151 ms

So the following is pretty ugly, but it works for the example you've given.
MATCH (s4:Store { Name:'Store4' })<-[r1:MOVE_TO]-(s3:Store)<-[r2:MOVE_TO*]-(s:Store)
WITH s3, r1.Quantity as Factor, SUM(REDUCE(amount = 0, r IN r2 | amount + r.Quantity)) AS Total
MATCH (s3)<-[r1:MOVE_TO*]-(s:Store)
WITH s.Supplier as Supplier, REDUCE(amount = 0, r IN r1 | amount + r.Quantity) AS Quantity, Factor, Total
RETURN Supplier, Quantity, Total, toFloat(Quantity) / toFloat(Total) * Factor as Proportion
I'm sure it can be improved.

Related

how to combine R data frames with non-exact criteria (greater/less than condition)

I have to combine two R data frames which have trade and quote information. Like a join, but based on a timestamp in seconds. I need to match each trade with the most recent quote. There are many more quotes than trades.
I have this table with stock quotes. The Timestamp is in seconds:
+--------+-----------+-------+-------+
| Symbol | Timestamp | bid | ask |
+--------+-----------+-------+-------+
| IBM | 10 | 132 | 133 |
| IBM | 20 | 132.5 | 133.3 |
| IBM | 30 | 132.6 | 132.7 |
+--------+-----------+-------+-------+
And these are trades:
+--------+-----------+----------+-------+
| Symbol | Timestamp | quantity | price |
+--------+-----------+----------+-------+
| IBM | 25 | 100 | 132.5 |
| IBM | 31 | 80 | 132.7 |
+--------+-----------+----------+-------+
I think a native R function or dplyr could do it - I've used both for basic purposes but not sure how to proceed here. Any ideas?
So the trade at 25 seconds should match with the quote at 20 seconds, and the trade #31 matches the quote #30, like this:
+--------+-----------+----------+-------+-------+-------+
| Symbol | Timestamp | quantity | price | bid | ask |
+--------+-----------+----------+-------+-------+-------+
| IBM | 25 | 100 | 132.5 | 132.5 | 133.3 |
| IBM | 31 | 80 | 132.7 | 132.6 | 132.7 |
+--------+-----------+----------+-------+-------+-------+
Consider merging on a calculated field by increments of 10. Specifically, calculate a column for multiples of 10 in both datasets, and merge on that field with Symbol.
Below transform and within are used to assign and de-assign the helper field, mult10. In this use case, both base functions are interchangeable:
final_df <- transform(merge(within(quotes, mult10 = floor(Timestamp / 10) * 10),
within(trades, mult10 = floor(Timestamp / 10) * 10),
by=c("Symbol", "mult10"),
multi10 = NULL)
Now if the 10 multiple does not suffice for your needs, adjust to level you require such as 15, 5, 2, etc.
within(quotes, mult10 <- floor(Timestamp / 15) * 15)
within(quotes, mult10 <- floor(Timestamp / 5) * 5)
within(quotes, mult10 <- floor(Timestamp / 2) * 2)
Even more, you may need to use the reverse, floor or ceiling for both data sets respectively to calculate highest multiple of quote's Timestamp and lowest multiple of trade's Timestamp:
within(quotes, mult10 <- ceiling(Timestamp / 15) * 15)
within(trades, mult10 <- floor(Timestamp / 5) * 5)

SQLite find table row where a subset of columns satisfies a specified constraint

I have the following SQLite table
CREATE TABLE visits(urid INTEGER PRIMARY KEY AUTOINCREMENT,
hash TEXT,dX INTEGER,dY INTEGER,dZ INTEGER);
Typical content would be
# select * from visits;
urid | hash | dx | dY | dZ
------+-----------+-------+--------+------
1 | 'abcd' | 10 | 10 | 10
2 | 'abcd' | 11 | 11 | 11
3 | 'bcde' | 7 | 7 | 7
4 | 'abcd' | 13 | 13 | 13
5 | 'defg' | 20 | 21 | 17
What I need to do here is identify the urid for the table row which satisfies the constraint
hash = 'abcd' AND (nearby >= (abs(dX - tX) + abs(dY - tY) + abs(dZ - tZ))
with the smallest deviation - in the sense of smallest sum of absolute distances
In the present instance with
nearby = 7
tX = tY = tZ = 12
there are three rows that meet the above constraint but with different deviations
urid | hash | dx | dY | dZ | deviation
------+-----------+-------+--------+--------+---------------
1 | 'abcd' | 10 | 10 | 10 | 6
2 | 'abcd' | 11 | 11 | 11 | 3
4 | 'abcd' | 12 | 12 | 12 | 3
in which case I would like to have reported urid = 2 or urid = 3 - I don't actually care which one gets reported.
Left to my own devices I would fetch the full set of matching rows and then dril down to the one that matches my secondary constraint - smallest deviation - in my own Java code. However, I suspect that is not necessary and it can be done in SQL alone. My knowledge of SQL is sadly too limited here. I hope that someone here can put me on the right path.
I now have managed to do the following
CREATE TEMP TABLE h1(v1 INTEGER,v2 INTEGER);
SELECT urid,(SELECT (abs(dX - 12) + abs(dY - 12) + abs(dZ - 12))) devi FROM visits WHERE hash = 'abcd';
which gives
--SELECT * FROM h1
urid | devi |
-------+-----------+
1 | 6 |
2 | 3 |
4 | 3 |
following which I issue
select urid from h1 order by v2 asc limit 1;
which yields urid = 2, the result I am after. Whilst this works, I would like to know if there is a better/simpler way of doing this.
You're so close! You have all of the components you need, you just have to put them together into a single query.
Consider:
SELECT urid
, (abs(dx - :tx) + abs(dy - :tx) + abs(dz - :tx)) AS devi
FROM visits
WHERE hash=:hashval AND devi < :nearby
ORDER BY devi
LIMIT 1
Line by line, first you list the rows and computed values you want (:tx is a placeholder; in your code you want to prepare a statement and then bind values to the placeholders before executing the statement) from the visit table.
Then in the WHERE clause you restrict what rows get returned to those matching the particular hash (That column should have an index for best results... CREATE INDEX visits_idx_hash ON visits(hash) for example), and that have a devi that is less than the value of the :nearby placeholder. (I think devi < :nearby is clearer than :nearby >= devi).
Then you say that you want those results sorted in increasing order according to devi, and LIMIT the returned results to a single row because you don't care about any others (If there are no rows that meet the WHERE constraints, nothing is returned).

Levensthein logic to get all the string with minimum difference

Suppose i have a datframe with values
Mtemp:
-----+
code |
-----+
Ram |
John |
Tracy|
Aman |
i want to compare it with dataframe
M2:
------+
code |
------+
Vivek |
Girish|
Rum |
Rama |
Johny |
Stacy |
Jon |
i want to get result so that for each value in Mtemp i will get maximum 2 possible match in M2 with Levensthein distance 2.
i have used
tp<-as.data.frame(amatch(Mtemp$code,M2$code,method = "lv",maxDist = 2))
tp$orig<-Mtemp$code
colnames(tp)<-c('Res','orig')
and i am getting result as follow
Res |orig
-----+-----
3 |Ram
5 |John
6 |Tracy
4 |Aman
please let me know a way to get 2 values(if possible) for every Mtemp string with Lev distance =2

Access 2010 trying to create pie chart by 1 row data

I have data looks like this
Sum Of a | Sum Of b | Sum Of C | Sum Of d
100 | 200 | 300 | 400
In order to create a pie chart I need to change it to format something like this
Sum Of | Value
a | x
c | y
d | z
Question how can I create new table that from the first table by query or any suggestion?
I was able to find this link that I found useful on creating chats from ms-access
Here we see that the trick is to create a saved UNION query named [SometingDataForPieChart]...
SELECT "DONE" AS PieCategory, [DONE] AS PieValue, [AREA] FROM [TABLE]
UNION ALL
SELECT "REMAIN" AS PieCategory, [REMAIN] AS PieValue, [AREA] FROM [TABLE]
...returning...
PieCategory | PieValue | AREA
----------- | -------- | -----
DONE | 100 | AREA1
DONE | 200 | AREA2
DONE | 200 | AREA3
REMAIN | 200 | AREA1
REMAIN | 300 | AREA2
REMAIN | 700 | AREA3
...and this is how you start to do pie chart.
Please read How to add a pie chart to my Access report as it has many images and step by steps.
Credit to: Gord Thompson for this answer

Neo4j shortest path with rels in both directions

I have a graph set up with the function...
create (a:station {name:"a"}),
(b:station {name:"b"}),
(c:station {name:"c"}),
(d:station {name:"d"}),
(e:station {name:"e"}),
(f:station {name:"f"}),
(a)-[:CONNECTS_TO {time:8}]->(b),
(a)-[:CONNECTS_TO {time:4}]->(c),
(a)-[:CONNECTS_TO {time:10}]->(d),
(b)-[:CONNECTS_TO {time:3}]->(c),
(b)-[:CONNECTS_TO {time:9}]->(e),
(c)-[:CONNECTS_TO {time:40}]->(f),
(d)-[:CONNECTS_TO {time:5}]->(e),
(e)-[:CONNECTS_TO {time:3}]->(f)
and using the function
START startStation=node:node_auto_index(name = "a"), endStation=node:node_auto_index(name = "f")
MATCH p =(startStation)-[r*]->(endStation)
WITH extract(x IN rels(p)| x.time) AS Times, length(p) AS `Number of Stops`, reduce(totalTime = 0, x IN rels(p)| totalTime + x.time) AS `Total Time`, extract(x IN nodes(p)| x.name) AS Route
RETURN Route, Times, `Total Time`, `Number of Stops`
ORDER BY `Total Time`
and it returns the results...
+-------------------------------------------------------------+
| Route | Times | Total Time | Number of Stops |
+-------------------------------------------------------------+
| ["a","d","e","f"] | [10,5,3] | 18 | 3 |
| ["a","b","e","f"] | [8,9,3] | 20 | 3 |
| ["a","c","f"] | [4,40] | 44 | 2 |
| ["a","b","c","f"] | [8,3,40] | 51 | 3 |
+-------------------------------------------------------------+
Which is fine except because it is a directed graph and there is no path from c -> b it doesn't return (for instance) [a, c, b, e, f] which is a valid path of length 4.
So, if I add the inverse paths...
MATCH (START)-[r:CONNECTS_TO]->(END )
CREATE UNIQUE (START)<-[:CONNECTS_TO { time:r.time }]-(END )
And run the query again I get... (for paths length 1..4)...
+---------------------------------------------------------------------+
| Route | Times | Total Time | Number of Stops |
+---------------------------------------------------------------------+
| ["a","d","e","f"] | [10,5,3] | 18 | 3 |
| ["a","c","b","e","f"] | [4,3,9,3] | 19 | 4 |
| ["a","b","e","f"] | [8,9,3] | 20 | 3 |
| ["a","c","f"] | [4,40] | 44 | 2 |
| ["a","c","b","c","f"] | [4,3,3,40] | 50 | 4 |
| ["a","c","f","e","f"] | [4,40,3,3] | 50 | 4 |
| ["a","b","c","f"] | [8,3,40] | 51 | 3 |
| ["a","b","a","c","f"] | [8,8,4,40] | 60 | 4 |
| ["a","d","a","c","f"] | [10,10,4,40] | 64 | 4 |
+---------------------------------------------------------------------+
This does include the path [a, c, b, e, f] but it also include [a, c, b, c, f] which uses c twice and [a, c, f, e, f] which uses f (the destination?!) twice.
Is there a way of filtering the paths so each path only includes the same node once?
You could do a filtering after the fact, but it might not be the fastest thing.
Something like this:
START startStation=node:node_auto_index(name = "a"), endStation=node:node_auto_index(name = "f")
MATCH p = (startStation)-[r*..4]->(endStation)
WHERE length(reduce (a=[startStation], n IN nodes(p) | CASE WHEN n IN a THEN a ELSE a + n END)) = length(nodes(p))
WITH extract(x IN rels(p)| x.time) AS Times, length(p) AS `Number of Stops`, reduce(totalTime = 0, x IN rels(p)| totalTime + x.time) AS `Total Time`, extract(x IN nodes(p)| x.name) AS Route
RETURN Route, Times, `Total Time`, `Number of Stops`
ORDER BY `Total Time`
I created a GraphGist with your question and answers in as an executable, live document.
See here: Neo4j shortest path with rels in both directions

Resources