H, i'm fairly new to neo4j and cypher in general and i just started playing around with the default movie database provided on neo4j's installation.
I'm trying to obtain a measure that reflects the importance of each movie based on the actors that acted in them that in some iterative way would take into account the importance of each actor based on the number of movies they acted in.
I guess my best option here would be to use pageRank score:
CALL algo.pageRank(label, relationship, ...)
What would be in this situation the best and most correct way to obtain what i'm looking for?
I'm quite lost and googling around didn't help as much as i hoped. Thanks for reading, sorry for my bad english.
Related
Preface
I'm new to Gremlin and working through Kelvin Lawrence's awesome eBook on the topic in order to solve a specific use-case.
Due to the sheer amount to learn, I'm asking this question to get recommendations on how I might approach the challenge so that, as I read the eBook, I'll better know the sections to which to pay extra attention.
I intend to use AWS Neptune in the pursuit of solving this, so I tagged that topic as well.
Question
Respecting departure/arrival times of legs + other constraints, can the shortest path (the real-world, logistical meaning
of "path") between origin and destination be "queried" (i.e., can I
use the Gremlin console with a single statement)? Or is the
use-case of such complexity that I will effectively need to write a
program to accomplish it?
Use-Case / Detail
I hope to answer the question:
Starting at ORIGIN on DAY, can I get to DESTINATION while
respecting [CONDITIONS]?
The good news is that I only need a true/false response (so limit(1)?) and a lack of a result (e.g., []) suffices for "no".
What are the conditions?
Flight schedules need to be respected. Instead of simple flight routes (i.e., a connection exists between BOSton and DALlas), I have actual flight schedules (i.e., on Wednesday, 9 Nov 2022 at 08:40, flight XYZ will depart BOSton and then arrive DALlas at 13:15) ... consequently, if/when there are connections, I need to respect arrival and departure times + some sort of buffer (i.e., a path for which a Traveler would arrive at 13:05 and depart on another leg at 13:06 isn't actually a valid path);
Aggregate travel time / cost limits. The answer to the question needs to be "No" if a path's aggregate travel time or aggregate cost exceeds specified limits. (Here, I believe I'll need to use sack() to track the cost - financial and time - of each leg and bail out of the repeat() until loop when either is hit?)
I apologize b/c I know this isn't a good StackOverflow question, since it's not technically specific -- my hope is that, at least, some specific technical recommendations might result.
The use-case seems like the varsity / pro version of the flight routes example presented in the eBook, which is perfect for someone brand-new to Gremlin ... 😅
There are a number of ways you might model this. One way I have seen used effectively is to essentially have two graphs. This first just knows about routes. You use that one to find ways to get from A to Z in x-hops. Then using the second graph, which tracks actual flights, using the results from the first search you look for flights within the time constraints you need to impose. So there is really the data modeling question and then the query writing part. Obviously the data model should enable the queries to be as efficient as possible.
There are a couple of useful blog posts related to your question. They mention Neo4j but are really quite generic and mainly focus on the data modeling aspects of your question.
https://maxdemarzi.com/2015/08/26/modeling-airline-flights-in-neo4j/
https://maxdemarzi.com/2017/05/24/flight-search-with-neo4j/
I would focus on the data model, and once you have that, focus on the Gremlin queries. Amazon Neptune also now supports openCypher as an alternative property graph query language.
If you already have a data model worked out and can share a sample, I'm happy to update the answer with an example query or two.
I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.
I am exploring Arangodb and I am not sure I understand correctly how to use the graph concept in ArangoDb.
Let's say I am modelling a social network. Should I create a single graph for the whole social network or should I create a graph for every person and its connections ?
I've got the feeling I should use a single graph... But is there any performance/fonctionality issue related to that choice ?
Maybe the underlying question is this: should I consider the graph concept in arangodb as a technical or as a business-related concept ?
Thanks
You should use not use a graph per person. The first quick answer would be to use a single graph.
In general, I think you should treat the graph concept as a technical concept. Having said that, quite often, a (mathematical) graph models a relationship arising from business very naturally. Thus, the technical concept graph in a graph database maps very well to the business logic.
A social network is one of the prime examples. Typical questions here are "find the friends of a user?", "find the friends of the friends of a user?" or "what is the shortest path from person A to person B?". A graph database will be most useful for questions involving an a priori unknown path length, like for example in the shortest path example.
Coming back to the original question: You should start by looking at the queries you will have about your data. You then want to make it, so that these queries map conveniently onto the standard graph operations (or indeed other queries) your data store can answer. This then tells you what kind of information should be in the same graph, and which bits belong in separate graphs.
In your original use case of a social network, I would assume that you want to run queries involving chains of friendship-relations, so the edges in these chains must be in the same graph. However, in more complicated cases it is for example conceivable that you have a "friendship" graph and a "follows" graph, both using different edges but the same vertices. In that case you might have two graphs for your social network.
I'm currently working on my first application that uses a Graph database (Neo4J). I'm in the process of modelling my graph on a whiteboard. My colleague and I are in a pickle on whether or not we should introduce a 'collection node'.
We have something like this (Cypher syntax, Fictive example):
(parking:Parking) - Parking node
(car:Car) - Car node
Obviously, a Parking can have multiple Cars, let's say it can have up to 1mio cars.
Is it, in this case, better to introduce a new node:
(carCollection:CarCollection) - Car collection node?
A Parking could have a rel to the 'Car collection node' which can have a lot of cars. This should avoid a simple query being performed on the Parking node it self (let's say you want to query the number of available seats) to lose performance.
Is this a good idea? Or is this bogus and should you model it as it is, and does this not influence performance?
If anyone can provide a link or book with some graph modelling best practices, that would be awesome as well :).
Thx in advance.
Gr
Kwinten
anyhow, there is no way of a performance enhancer once you need to have 1mil nodes for each car.
if you will simply query your parking node with just one car, it will be as fast as if you have just 1 car in the car collection.
if you will need to return all 1 mil cars, than there is no enhancer. (the main problem, however, would be simply the net connection to stream all the data).
you can play with labels, but i suggest to keep the millions of relations directly to the parking node. but if you could provide us with an example scenario with a query, than we can figure maybe smthnig out
I need to model airline flight data in a graph database (I am specifically working with neo4j, though I will consider others if that becomes problematic). My question is more about how to model this data in a way that will ease traversal and discovery of different flight options. A few specific examples of the type of data I would like to both store and later query:
1) A direct flight scenario like JFK->LAX. Seems straightforward, simple two node relationship. But there are many flights that may be of interest between these two nodes. So, if I need to store individual flight detail, is that best in an array on the relationship between the JFK and LAX nodes?
2) A flight scenario with multiple stops, like JFK->LAX->SAN. In this scenario, it seems like there modeling the relationship between the three nodes may be of limited utility if I'm interested in the departure and arrival city? i.e. I could have a relationship from JFK->SAN and the fact that there is a layover in LAX could be a property on that relationship?
If I need to query or traverse the graph based on arrays of data in relationships between nodes, and those arrays become large (e.g. 100 different flights between JFK and LAX), will that introduce performance or scalability problems?
Hopefully this question isn't too open-ended - I'm just trying to avoid building something that works for a small example model with ~5 nodes but can't scale to hundreds of airports and tens of thousands of flights.
Hundreds of airports and tens of thousands of flights is still a very small data set and I'd be surprised if that would be a problem in neo4j.
Off the top of my head you could perhaps have all the airports as their own nodes and each route could be its own node with relationships to all the airports it touches, maybe with an "order" property on each relationship which is local to the route.
(ROUTE1)---------
/ \ \
*order=1/ \*order=2 \*order=3
v v v
(JFK) (LAX) (SAN)
I'm sure there are better solutions.
Check out Neo4J's contribution page
One of the winners of their contest was a gist describing US Flights and Airports it is very well done
This link may be useful for you http://maxdemarzi.com/?s=flights, http://gist.neo4j.org/?6619085