Graph DB get the next best recommended node in Neo4j cypher - graph

I have a graph using NEO4j and currently trying to build a simple recommendation system that is better than text based search.
Nodes are created such as: Album, People, Type, Chart
Relationship are created such as:
People - [:role] -> Album
where roles are: Artist, Producer, Songwriter
Album-[:is_a_type_of]->Type (type is basically Pop, Rock, Disco...)
People -[:POPULAR_ON]->Chart (Chart is which Billboard they might have been)
People -[:SIMILAR_TO]->People (Predetermined similarity connection)
I have written the following cypher:
MATCH (a:Album { id: { id } })-[:is_a_type_of]->(t)<-[:is_a_type_of]-(recommend)
WITH recommend, t, a
MATCH (recommend)<-[:ARTIST_OF]-(p)
OPTIONAL MATCH (p)-[:POPULAR_ON]->()
RETURN recommend, count(DISTINCT t) AS type
ORDER BY type DESC
LIMIT 25;
It works however, it easily repeats itself if it has only one type of music connected to it, therefore has the same neighbors.
Is there a suggested way to say:
Find me the next best album that has the most similar connected relationships to the starting Album from.
Any Recommendation for a tie breaker scenario? Right now it is order by type (so if an album has more than one type of music it is valued more but if everyone has the same number, there is no more
significant)
-I made the [:SIMILAR_TO] link to enforce a priority to consider that relationship as important, but I haven't had a working cypher with it
-Same goes for [:Popular_On] (Maybe Drop this relationship?)

You can use 4 configurations and order albums according to higher value in this order. Keep configuration between 0 to 1 (ex. 0.6)
a. People Popular on Chart and People are similar
b. People Popular on Chart and People are Not similar
c. People Not Popular on Chart and People are similar
d. People Not Popular on Chart and People are Not similar
Calculate and sum these 4 values with each album. Higher the value, higher recommended Album.
I have temporarily made config as a = 1, b =0.8, c=0.6, d = 0.4. And assumed some relationship present which suggests some People Likes Album. If you are making logic based on Chart only then use a & b only.
MATCH (me:People)
where id(me) = 123
MATCH (a:Album { id: 456 })-[:is_a_type_of]->(t:Type)<-[:is_a_type_of]-(recommend)
OPTIONAL MATCH (recommend)<-[:ARTIST_OF]-(a:People)-[:POPULAR_ON]->(:Chart)
WHERE exists((me)-[:SIMILAR_TO]->(a))
OPTIONAL MATCH (recommend)<-[:ARTIST_OF]-(b:People)-[:POPULAR_ON]->(:Chart)
WHERE NOT exists((me)-[:SIMILAR_TO]->(b))
OPTIONAL MATCH (recommend)<-[:LIKES]-(c:People)
WHERE exists((me)-[:SIMILAR_TO]->(a))
OPTIONAL MATCH (recommend)<-[:LIKES]-(d:People)
WHERE NOT exists((me)-[:SIMILAR_TO]->(a))
RETURN recommend, (count(a)*1 + count(b)*0.8 + count(c)* 0.6+count(d)*0.4) as rec_order
ORDER BY rec_order DESC
LIMIT 10;

Related

Invisible graphs cause report to slow

I have a report with a parameter where the end user chooses a practice name that corresponds to a group of people. Most of these groups have fewer than 10 people, but a small number of them have as many as 150. When there are more than 15 people in a given group, they want separate graphs, each with no more than 15 people. So for most of the groups, we only need one graph. For a few, we need a lot of graphs.
Behind the scenes, I created a graph for each multiple of 15 people, and set them to only be visible if there are actually that many people in the group. This does what I need it to, but it makes the report super slow. As close as I can tell, behind the scenes when an end user runs the report it's still somehow rendering the hidden graphs and slowing it all to heck. (I did find this link which I think suggests this is a known bug.
I need to have one report where the end user selects the practice name, so I can't make two reports, "My practice is normal" and "My practice is ginormous". I thought maybe I could make a conditional sub-report split into those two reports based on the practice name parameter, but that doesn't appear to be possible; you can play around with visibility but I'm guessing that will still cause the invisible graph rendering problem and not help my speed.
Are there any other cool tips I can try to speed up my report, or is this just a case of too many graphs spoiling the broth?
The easiest way would be to generate a group number for every 15 people and then use a list control to repeat the chart for each group.
Here's a very quick example of this in action. I just used some sample data from one of the Adventure Works sample database.
Here's my query that returns every person in each selected department. Note that I have commented out the DELCAREs as these were just in there for testing.
--DECLARE #Department varchar(50) = ''
--DECLARE #chartMax int = 5
SELECT
GroupName, v.Department, v.FirstName, v.LastName
, ChartGroup = (ROW_NUMBER() OVER(PARTITION BY Department ORDER BY LastName, FirstName)-1) / #chartMax -- calc which chart number the person belongs to
, Salary = ((ABS(CHECKSUM(NewId())) % 100) * 500) + (ABS(CHECKSUM(NewId())) % 1000) + 10000 -- Just some random number to plot
FROM [HumanResources].[vEmployeeDepartment] v
WHERE Department IN (#Department)
ORDER BY Department
The key bit is the ChartGroup column
ChartGroup = (ROW_NUMBER() OVER(PARTITION BY Department ORDER BY LastName, FirstName)-1) / #chartMax
This will give the first 5 rows in each department a ChartGroup of 0 the next 15 1 and so on. I used 5 rather than 15 just so it's easier to demo.
Here's the dataset results
Now, in your report, add a List, set it's dataset property to your dataset containing your main data (the query above in my case).
Now edit the 'details' rowgroup properties and add a grouping by Practice and ChartGroup (Department and ChartGroup in this example)
In the list box's textbox, right-click then insert a chart.
Set the chart up as required, in my example, I used salary as the values on a pie chart and the employee names as the labels.
Here's the final design ..
Note that I set the department as a multi-value parameter and also set the number of persons per chart (chartMax) as a report parameter.
When I preview the report I get this for 'Engineering' which has 6 employees
Sales has 18 employees so we get this
.... and so on, it will generate a new chart for every 15 people or part thereof.

Cypher Query - Excluding certain relationships

I am querying my graph where it has the following nodes:
Customer
Account
Fund
Stock
With the following relationships:
HAS (a customer HAS an account)
PURCHASED (an account PURCHASES a fund or stock)
HOLDS (a fund HOLDS a stock)
The query I am trying to achieve is returning all Customers that have accounts that hold Microsoft through a fund. The following is my query:
MATCH (c:Customer)-[h:HAS]->(a:Account)-[p:PURCHASED]-(f:Fund)-[holds:HOLDS]->(s:Stock {ticker: 'MSFT'})
WHERE exists((f)-[:HOLDS]->(s:Stock))
AND exists ((f:Fund)-[holds]->(s:Stock))
AND NOT exists((a:Account {account_type: 'Individual'})-[p:PURCHASED]->(s:Stock))
RETURN *
This almost gets me the desired results but I keep getting 2 relationships out of the Microsoft stock that is tied to an Individual account where I do not want those included.
Any help would be greatly appreciated!
Result:
Desired Result:
There is duplications in your query. Lines 2 and 3 are the same. Line 2 is a subgraph of Line 1. Then you are using the variables a, p and s more than once in line 1 and line 4. Below query is not tested but give it a try. Please tell me if it works for you or not.
MATCH (c:Customer)-[h:HAS]->(a:Account)-[p:PURCHASED]-(f:Fund)-[holds:HOLDS]->(s:Stock {ticker: 'MSFT'})
WHERE NOT exists((:Account{account_type: 'Individual'})-[:PURCHASED]->(:Stock))
RETURN *
It seems to me that you should just uncheck the "Connect result nodes" option in the Neo4j Browser:

Arango DB performace: edge vs. DOCUMENT()

I'm new to arangoDB with graphs. I simply want to know if it is faster to build edges or use 'DOCUMENT()' for very simple 1:1 connections where a querying the graph is not needed?
LET a = DOCUMENT(#from)
FOR v IN OUTBOUND a
CollectionAHasCollectionB
RETURN MERGE(a,{b:v})
vs
LET a = DOCUMENT(#from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}
A simple benchmark you can try:
Create the collections products, categories and an edge collection has_category. Then generate some sample data:
FOR i IN 1..10000
INSERT {_key: TO_STRING(i), name: CONCAT("Product ", i)} INTO products
FOR i IN 1..10000
INSERT {_key: TO_STRING(i), name: CONCAT("Category ", i)} INTO categories
FOR p IN products
LET random_categories = (
FOR c IN categories
SORT RAND()
LIMIT 5
RETURN c._id
)
LET category_subset = SLICE(random_categories, 0, RAND()*5+1)
UPDATE p WITH {
categories: category_subset,
categoriesEmbedded: DOCUMENT(category_subset)[*].name
} INTO products
FOR cat IN category_subset
INSERT {_from: p._id, _to: cat} INTO has_category
Then compare the query times for the different approaches.
Graph traversal (depth 1..1):
FOR p IN products
RETURN {
product: p.name,
categories: (FOR v IN OUTBOUND p has_category RETURN v.name)
}
Look-up in categories collection using DOCUMENT():
FOR p IN products
RETURN {
product: p.name,
categories: DOCUMENT(p.categories)[*].name
}
Using the directly embedded category names:
FOR p IN products
RETURN {
product: p.name,
categories: p.categoriesEmbedded
}
Graph traversal is the slowest of all 3, the lookup in another collection is faster than the traversal, but the by far fastest query is the one with embedded category names.
If you query the categories for just one or a few products however, the response times should be in the sub-millisecond area regardless of the data model and query approach and therefore not pose a performance problem.
The graph approach should be chosen if you need to query for paths with variable depth, long paths, shortest path etc. For your use case, it is not necessary. Whether the embedded approach is suitable or not is something you need to decide:
Is it acceptable to duplicate information, and potentially have inconsistencies in the data? (If you want to change the category name, you need to change it in all product records instead of just one category document, that products can refer to via the immutable ID)
Is there a lot of additional information per category? If so, all that data needs to be embedded into every product document that has that category - basically trading memory / storage space for performance
Do you need to retrieve a list of all (distinct) categories often? You can do this type of query really cheap with the separate categories collection. With the embedded approach, it will be much less efficient, because you need to go over all products and collect the category info.
Bottom line: you should choose the data model and approach that fits your use case best. Thanks to ArangoDB's multi-model nature you can easily try another approach if your use case changes or you run into performance issues.
Generally spoken, the latter variant
LET a = DOCUMENT(#from)
RETURN MERGE(a,{b:DOCUMENT(a.bId)}
should have lower overhead than the full-featured traversal variant. This is because the DOCUMENT variant will do a point lookup of a document whereas the traversal variant is very general purpose: it can return zero to many results from a variable number of collections, needs to keep track of the path seen etc.
When I tried both variants in a local test case, the non-traversal variant was also a lot faster, supporting this claim.
However, the traversal-based variant is more flexible: it can also be used should there be multiple edges (no 1:1 mapping) and for longer paths.

Get the counts of all first generation nodes neo4j

My data structure is very simple. One label called customers with a one-to-one, one directional relationship to another customer of being referred. What is the correct query to retrieve the counts for each node of all the degrees of referred nodes that resulted from it.
In other words, if the database consisted
CustomerA referred CustomerB,
CustomerB referred CustomerC
the resulting table should be:
Customer 1st gen referrals 2nd gen referrals
A 1 1
B 1 0
C 0 0
You could match on nodes and find the sizes of the desired patterns:
MATCH (c:Customer)
RETURN c as Customer,
size((c)-[:REFERRED]->()) as firstGenRef,
size((c)-[:REFERRED*2]->()) as secondGenRef
EDIT
As far as returning the counts of all levels of referrals, that's likely going to be an expensive query, depending on how interconnected your data is.
You can give this a try, and if it takes too long or hangs, you may want to switch to APOC Procedures, specifically apoc.path.spanningTree(), which uses NODE_GLOBAL uniqueness to only retain a single path to each node encountered, that usually performs better.
MATCH (c:Customer)-[r:REFERRED*]->(ref)
WITH DISTINCT c, size(r) as gen, ref
WITH c, gen, count(gen) as referrals
ORDER BY gen ASC
RETURN c as Customer, collect({gen:gen, referrals:referrals}) as referrals
This will get you each customer on a row, along with collected maps of the generation and number of referrals at each generation, down to the maximum generation depth per customer.

Adding a new user to neo4j

A totally neo4j noob is talking here,
I like to create a graph to store a set of users, a typical user is as follows:
CREATE
(node_1 {FullName:"Peter Parker",FirstName:"peter",FamilyName:"parker"}),
(node_2 {Address:"Newyork",CountryCode:"US"}),
(node_3 {Location:"Hidden"}),
(node_4 {phoneNumber:11111}),
(node_5 {InternetEmailAddress:"peter#peterland.com")
now the problem is,
Every time I execute this I add 5 more nodes.
I know I need to use a unique key, but all example I saw can use a unique key for a specific node. So how can I make sure a user doesn't get added if it already exists(I can use email address as unique key).
how do I update the nodes if some changes occur. for example, after a week I want to update the graph to contain the following instead of the previous one.(no duplicates)
CREATE(node_1 {FullName:"Peter Parker",FirstName:"peter",FamilyName:"parker"}),(node_2 {Address:"Newyork",CountryCode:"US"}),(node_3 {Location:"public"}),(node_4 {phoneNumber:11111}),(node_5 {InternetEmailAddress:"peter#peterland.com"),(node_6 {status:"Jailed"})
(NOTE the new update changed location to "public" and added a new node for peter
Seeing as you had a load of nodes anyway.
Some of the data you have modelled as Nodes are probably properties as the other answer suggests, some are possibly correctly modelled as Nodes and one could probably form the or a part of the relationship.
Location public/hidden can be modelled in one of three ways, as a property on the Person, as a property between the Person and the Location or as the relationship type. To understand that first you need to have a relationship.
Your address at the moment is another Node, I think this is correct, but possibly you would want two nodes, related something like this:
(s:State)-[:IN_COUNTRY]-(c:Country)
YMMV and clearly that a US centric model, but you can extend it easilly enough.
Now you could create Peter with a LIVES_IN relationship:
CREATE (p:Person{fullName:"Peter Parker"}), (s:State{name:"New York"}), (c:Country{code:"US"}),
(p)-[:LIVES_IN]->(s), (s)-[:IN_COUNTRY]->(c)
For speed you are better off modelling two relationships which could be LIVES_IN_PUBLIC and LIVES_IN_HIDDEN which means to perform that update that you want above then you have to delete the one and create the other. However, if speed is not of the essence, it is common also to use properties on the relationship.
CREATE (p:Person{fullName:"Peter Parker"}), (s:State{name:"New York"}), (c:Country{code:"US"}),
(p)-[:LIVES_IN{public:false}]->(s), (s)-[:IN_COUNTRY]->(c)
So your complete Q&A:
CREATE (p:Person {fullName:"Peter Parker",firstName:"peter",familyName:"parker", phoneNumber:1111, internetEmailAddress:"peter#peterland.com"}),
(s:State {name:"New York"}), (c:Country {code:"US"}),
(p)-[:LIVES_IN{public:false}]->(s), (s)-[:IN_COUNTRY]-(c)
MATCH (p:Person {internetEmailAddress:"peter#peterland.com"})-[li:LIVES_IN]->()
SET li.public = true, p.status = "jailed"
When adding other People you probably do not want to recreate States and Countries, rather you want to match them, and possibly Merge them, but we'll stick to Create.
MATCH (s:State{name:"New York"})
CREATE (p:Person{name:"John Smith", internetEmailAddress:"john#google.com"})-[:LIVES_IN{public:false}]->(s)
John Smith now implicitly lives in the US too as you can follow the relationship through the State Node.
Treatise complete.
I think you're modeling your data incorrectly here - you're setting up each property of the person as a separate node, which is not a good idea. You don't have any linkages between those nodes, so with this data pattern, later on you won't be able to tell what Peter Parker's address is. You're also not using node labels, which I think could really help here.
The quick question to your answer about updating nodes is that you have to MATCH them, then use SET to modify a property. So if you had a person, you might do this:
MATCH (p:Person { FullName: "Peter Parker" })
SET p.Address = "123 Fake Street"
RETURN p;
But notice I'm making assumptions about the way your data is structured. I'll take that same data you provided, this might be a better way of creating it:
CREATE (node_1:Person {FullName:"Peter Parker",
FirstName:"peter",
FamilyName:"parker",
Address:"Newyork",CountryCode:"US",
Location:"Hidden",
phoneNumber:11111,
InternetEmailAddress:"peter#peterland.com"});
The difference with this suggestion is that I'm putting all the properties into a single node (instead of one property per node) and I'm applying the Person label to the node.
If you structured the data like this, then the update query I provided would work. Structuring the data like you have it, it's not possible to update Peter Parker's address, because there's no relationship between your node_1 and node_2

Resources