I'm asking if it's possible to order results by distance in a Discover query. I've performed a query for train stations however, the results seem to order somewhat randomly. I'm only interested in what the nearest train station is and so having a result with the nearest distance first would be valuable.
I appreciated that it's entirely possible to programmatically achieve this, but before I burn time working on a solution, I figured it was worth reaching out to see if this was already possible within the API.
If it helps, this is the query I was making https://discover.search.hereapi.com/v1/discover?at=-34.0337365,151.0847004&q=station&limit=10&apikey=. You may see that the closest result is 4th down.
Happy to hear an alternative solution if there is one.
Thanks!
You can do this in two different ways.
Use the browse endpoint instead of discover and the results will be arranged according to distance. The code will look like:
https://browse.search.hereapi.com/v1/browse?at=-34.0337365,151.0847004&name=station&limit=10&apikey=
OR
https://browse.search.hereapi.com/v1/browse?at=-34.0337365,151.0847004&categories=400-4100-0035&limit=10&apikey=
Places category system
https://developer.here.com/documentation/geocoding-search-api/dev_guide/topics-places/places-category-system-full.html
Related
Preface
I'm new to Gremlin and working through Kelvin Lawrence's awesome eBook on the topic in order to solve a specific use-case.
Due to the sheer amount to learn, I'm asking this question to get recommendations on how I might approach the challenge so that, as I read the eBook, I'll better know the sections to which to pay extra attention.
I intend to use AWS Neptune in the pursuit of solving this, so I tagged that topic as well.
Question
Respecting departure/arrival times of legs + other constraints, can the shortest path (the real-world, logistical meaning
of "path") between origin and destination be "queried" (i.e., can I
use the Gremlin console with a single statement)? Or is the
use-case of such complexity that I will effectively need to write a
program to accomplish it?
Use-Case / Detail
I hope to answer the question:
Starting at ORIGIN on DAY, can I get to DESTINATION while
respecting [CONDITIONS]?
The good news is that I only need a true/false response (so limit(1)?) and a lack of a result (e.g., []) suffices for "no".
What are the conditions?
Flight schedules need to be respected. Instead of simple flight routes (i.e., a connection exists between BOSton and DALlas), I have actual flight schedules (i.e., on Wednesday, 9 Nov 2022 at 08:40, flight XYZ will depart BOSton and then arrive DALlas at 13:15) ... consequently, if/when there are connections, I need to respect arrival and departure times + some sort of buffer (i.e., a path for which a Traveler would arrive at 13:05 and depart on another leg at 13:06 isn't actually a valid path);
Aggregate travel time / cost limits. The answer to the question needs to be "No" if a path's aggregate travel time or aggregate cost exceeds specified limits. (Here, I believe I'll need to use sack() to track the cost - financial and time - of each leg and bail out of the repeat() until loop when either is hit?)
I apologize b/c I know this isn't a good StackOverflow question, since it's not technically specific -- my hope is that, at least, some specific technical recommendations might result.
The use-case seems like the varsity / pro version of the flight routes example presented in the eBook, which is perfect for someone brand-new to Gremlin ... 😅
There are a number of ways you might model this. One way I have seen used effectively is to essentially have two graphs. This first just knows about routes. You use that one to find ways to get from A to Z in x-hops. Then using the second graph, which tracks actual flights, using the results from the first search you look for flights within the time constraints you need to impose. So there is really the data modeling question and then the query writing part. Obviously the data model should enable the queries to be as efficient as possible.
There are a couple of useful blog posts related to your question. They mention Neo4j but are really quite generic and mainly focus on the data modeling aspects of your question.
https://maxdemarzi.com/2015/08/26/modeling-airline-flights-in-neo4j/
https://maxdemarzi.com/2017/05/24/flight-search-with-neo4j/
I would focus on the data model, and once you have that, focus on the Gremlin queries. Amazon Neptune also now supports openCypher as an alternative property graph query language.
If you already have a data model worked out and can share a sample, I'm happy to update the answer with an example query or two.
I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.
So I'm working on a real-time prediction matter, for example, I have a node (A) (:Person) and he has friends and node (B) as (:Games)
so node (A) has liked a certain Game and his friends liked other games so I recommend those other games for him But the matter is that I need to exclude the games which he is already liked or played.
it seems to be easy around the 'NOT' command but I couldn't find the right code for it yet although I've tried a lot of ways
the one seems closest for me is like:
match (A:Person)-[:Friend]-(n:Person)
where A <> n
with distinct n
match (n)-[:LIKED]-(B:Game)-[:ON]-(:steam), (k:Person{name:'John'})
where not ((k)-[:LIKED]-(:Game)-[:ON]-(:steam))
return B
which has to recommend the games John's friends liked without the games which John already liked.
anyway, when I Run this, the Graph just freezes for a while and then shutdown which is another problem I want to ask for.
Thanks for help
The last WHERE clause has very few constraints on it, and probably explains the hang/timeout. It may help to have a variable name for each label, either to constrain the query or to receive the nodes. more like this
where not ((k:Person{name:'John'})-[:LIKED]->(B:Game)-[:ON]->(C:steam))
return B
specify directional -> relationships (as above) in cypher queries if possible, usually it provides the answer you want, and is faster.
adding the variable name, and relationship direction also makes the query easier to read, to understand what it is doing, and if you need to look at the nodes/relationships values when debugging.
I may be wrong, but the :steam label doesn't look right to me. What are example values? I'm wondering if you meant to have a :service node, and steam would be a node instance?
Note: if you provide create nodes/rels script to create a small example of this database (e.g. a dozen nodes with these relationships) it would be easier to provide a working cypher example.
If you want to find the distinct games on Steam that John's friends liked but John has not yet liked or played, something this should work:
MATCH (j:Person{name:'John'})-[:FRIEND]-(:Person)-[:LIKED]->(g:Game)-[:ON]-(:steam)
WHERE NOT (j)-[:LIKED|PLAYED]->(g)
RETURN DISTINCT g
First, I'm using azure cosmos graph db.
I see this sort of pattern quite a bit:
out('an-edge').fold().coalesce(unfold(),addV('incoming-schedule'))
I want to add an edge immediately after I do an addV in the coalesce. I've been trying to do it in a simple example:
g.V('any-vertex-id').as('a').out('an-edge').coalesce(unfold(),addV('new-vertex').addE('to-v').from('a'))
"a" seems to no longer exist after a fold() since it's a barrier step. I tried store and aggregate but I must not understand those properly. Is it possible to get a reference after a fold()? I need it because it may reference a previous addV in the query to which I wouldn't have the id yet.
What is your requirement here? Do you want to create a new vertex an edge only when out('an-edge') is not present?
If that's the case, I will try this:
g.V('any-vertex-id').as('a').coalesce(out('an-edge'), addV('new-vertex').addE('to-v').from(select('a')))
Fold() is typically used when one needs to aggregate on all the output from the preceding step. I don't think, that is necessary in this case.
http://tinkerpop.apache.org/docs/current/reference/#fold-step
It looks like I can store and then select from it when adding the edge.
g.V('any-vertex-id').store('a').out('an-edge').fold()
.coalesce(unfold(),addV('new-vertex')
.addE('to-v').from(select('a').unfold()))
Not sure if someone has a better alternative or a better suggestion then store, but this seems to work at least in my scenario
I'm experiencing issues querying a large graph involving repeat steps that aim at making "hops" across vertices and edges. My intention is to infer indirect relationships between objects. Consider the following:
John--livesIn-->Paris
Paris--isIn-->France
What I expect to come up with is that John is based in France. Simple enough, and this works great with a small data set.
The query that I use is the following, where I make no more than 2 hops:
g.V().has('name','John')
.emit(loops().is(lt(2)))
.repeat(__.bothE().bothV().simplePath())
.inE('isIn').outV().path()
This is working as expected, until I apply this to a graph made of about 1000 vertices and 3000 edges. Then, after a few minutes, I get various kinds of error (over the REST API) with no clear logic:
Error: Error encountered evaluating script
Error: 504 Gateway Time-out
Error: Java heap space
Error
I suspect that I am doing something wrong in my query. For exemple, setting the number of "hops" to 1 (direct relationship) with .emit(loops().is(lt(1))), I would expect the results to be delivered swiftly since it would not go into the repeat loop. However, this triggers the same issue.
Many thanks for your help!
Olivier
So it looks like you have a few things going on here. First let me take a shot at answering your question then let's look at why your traversal may be taking a long time to complete.
Based on your description of wanting to return John and France the following traversal should get your data:
g.V().has('name','John').as('person')
out('livesIn')
.out('isIn').as('country').select('person', 'country')
That will select all countries that a person named 'John' lives in.
Now to understand why your traversal was taking a long time. First, you are using several steps which are very memory and resource intensive such as bothE and bothV. Each of these steps navigate the relationship in both directions. Since you know the direction of the edge you are trying to traverse is out in both cases it is much quicker and less resource intensive to just use an out edge as this will traverse the specified edge name (if supplied) and end you on the adjacent vertex. Additionally, the simplePath step is another resource (specifically memory) intensive step as it must track the path value for each traverser until it contains repeated objects at which time it is dropped. This combined with the extra traversers created by the usage of loops and bothE and bothV is likely the cause of the slow query. I suspect that the query above will perform significantly better.
If you would like to see exactly what your query is doing I would suggest taking a look at the explain and profile steps which provide detailed information on your queries performance.