Convert AQL query to Gremlin query - gremlin

I'm doing a project relate to GraphDB, and I'm using ArangoDB to create graph and try some queries with that . I have 2 Json files below and I have already imported them into ArangoDB and created graph ( airports : document collection, flights : edge collection).
I have 2 examples of AQL queries for the graph, but I'm struggling with converting them to Gremlin queries.
ex1 ( flights that leave JFK airport):
FOR v,e,p IN 1..1 OUTBOUND
'airport/JFK'
GRAPH 'flights'
RETURN p
ex2 ( flights from SF to KOA international airport and have VIP lounges):
FOR airport IN airports
FILTER airport.city == "San Francisco"
FILTER airport.VIP == true
FOR v,e,p IN 1..1 OUTBOUND
airport flights
FILTER v._id == 'airports/KOA'
RETURN p
Can you help me with this? thank you

You may find the examples here [1] of interest as they are specific to Gremlin and air route use cases. Without having your data model or some sample data it is not possible to give you a 100% accurate translation of your queries. However, the Gremlin will look something like this:
// FLights that leave JFK airport
g.V().has('airport','code','JFK').
out().
path()
// Flights from SFO to KOA that have a VIP lounge
g.V().has('airport','city','San Francisco).
has('VIP','true').
out().
has('code','KOA').
path()
[1] http://www.kelvinlawrence.net/book/PracticalGremlin.html

Related

Order the result by the number of relationships

I have a directed multigraph. With this query I tried to find all the nodes that is connected to node with the uuid n1_34
MATCH (n1:Node{uuid: "n1_34"}) -[r]- (n2:Node) RETURN n2, r
This will give me a list of n2 (n1_1187, n2_2280, n2_1834, n2_932 and n2_722) and their relationships among themselves which are exactly what I need.
Nodes n1_1187, n2_2280, n2_1834, n2_932 and n2_722 are connected to the node n1_34
Now I need to order them based on the relationship it has within this subgraph. So for example, n1_1187 should be on top with 4 relationships while the others have 1 relationship.
I followed this post: Extract subgraph from Neo4j graph with Cypher but it gives me the same result as the query above. I also tried to return count(r) but it gives me 1 since it counts all unique relationships not the relationships with a common source/target.
Usually with networkx I can copy this result into a subgraph then count the relationships of each node. Can I do that with neo4j without modifying the current graph? How?
Please help. Or is there other way?
This snippet will recreate your graph for testing purposes:
WITH ['n1_34,n1_1187','n1_34,n2_2280','n1_34,n2_1834','n1_34,n2_722', 'n1_34,n2_932','n1_1187,n2_2280','n1_1187,n2_932','n1_1187,n2_1834', 'n1_1187,n2_722'] AS node_relationships
UNWIND node_relationships as relationship
with split(relationship, ",") as nodes
merge(n1:Node{label:nodes[0]})
merge(n2:Node{label:nodes[1]})
merge(n1)-[:LINK]-(n2)
Once that is run, the graph I'm working with looks like:
Then this CQL will select the nodes in the subgraph and then subsequently count up each of their respective associated links, but only to other nodes existing already in the subgraph:
match(n1:Node{label:'n1_34'})-[:LINK]-(n2:Node)
with collect(distinct(n2)) as subgraph_nodes
unwind subgraph_nodes as subgraph_node
match(subgraph_node)-[r:LINK]-(n3:Node)
where n3 in subgraph_nodes
return subgraph_node.label, count(r) order by count(r) DESC
Running the above yields the following result:
This query should do what you need :
MATCH (n1:Node{uuid: "n1_34"})-[r]-(n2:Node)
RETURN n1, n2, count(*) AS freq
ORDER BY freq DESC
Using PROFILE to assess the efficiency of some of the existing solutions using #DarrenHick's sample data, the following is the most efficient one I have found, needing only 84 DB hits:
MATCH (n1:Node{label:'n1_34'})-[:LINK]-(n2:Node)
WITH COLLECT(n2) AS nodes
UNWIND nodes AS n
RETURN n, SIZE([(n)-[:LINK]-(n3) WHERE n3 IN nodes | null]) AS cnt
ORDER BY cnt DESC
Darren's solution (adjusted to return subgraph_node instead of subgraph_node.label, for parity) requires 92 DB hits.
#LuckyChandrautama's own solution (provided in a comment to Darren's answer, and adjusted to match Darren's sample data), uses 122 DB hits.
This shows the importance of using PROFILE to assess the performance of different Cypher solutions against the actual data. You should try doing that with your actual data to see which one works best for you.

Street address map plotting

I have 30 thousand home addresses and want to geocode them (i.e., convert "123 ABC Street" to a latitude and longitude).
Researched to find if there is any good tool available, but very confusing.
Anyone can suggest any resource?
Here's a function that will get you one address from the Google Maps geocoding API:
geocodeAddress <- function(address) {
base <- "https://maps.googleapis.com/maps/api/geocode/json?address="
key <- "your_google_maps_api_key_here"
url <- URLencode(paste0(base, address, "&key=", key))
RJSONIO::fromJSON(url, simplify=FALSE)
}
And how to use it:
result <- geocodeAddress("1600 Amphitheatre Parkway Mountain View, CA 94043")
You can pull out just the lat and lng with, e.g.:
result_lat <- result$results[[1]]$geometry$location$lat
result_lng <- result$results[[1]]$geometry$location$lng
For your 30k addresses, you can loop over them individually. More info available at developers.google.com. Last I checked, there are limits on the number of requests per second and total number of free requests per day, but I suspect the cost for 30k isn't very high.
Alternatively, you can upload data in csv format to UCLA's geocoder: gis.ucla.edu/geocoder.
A third alternative is to use Texas A&M's geocoder: geoservices.tamu.edu.
I suggest the free tidygeocoder package (https://jessecambon.github.io/tidygeocoder/).
Depending on the size of your data frame find my suggestion for parallelization: Is it possible to parallelize the geocode function from the tidygeocoder package in R?

How do I simple calculations in CosmosDB using GremlinAPI

I am using CosmosDB with GremlinAPI and I would like to perform simple calculation even though CosmosDB does not support the math step.
Imagine that I have vertex "Person" with the property Age that can have a edge "Owns" to another vertex "Pet" that also has the property Age.
I would like to know if a given person has a cat that is younger than the person but not more than 10 years younger.
The query (I know this is just a part of it but this is where my problem is)
g.V().hasLabel("Person").has("Name", "Jonathan Q. Arbuckle").as("owner").values("age").inject(-10).sum().as("minAge").select("owner")
Returns an empty result but
g.V().hasLabel("Person").has("Name", "Jonathan Q. Arbuckle").as("owner").values("age").inject(-10).as("minAge").select("owner")
Returns the selected owner.
It seems that if I do a sum() or a count() in the query, then I cannot do 'select("owner")' anymore.
I do not understand this behaviour. What should I do to be able to do a 'select("owner")' and be able to filter the Pets based on their age.
Is there some other way I can write this query?
Thank you in advance
Steps like sum, count and max are known as reducing barrier steps. They cause what has happened earlier in the traversal to essentially be forgotten. One way to work around this is to use a project step. As I do not have your data I used the air-routes data set and used airport elevation as a substitute for age in your graph.
gremlin> g.V(3).
project("elev","minelev","city").
by("elev").
by(values("elev").inject(-10).sum()).
by("city")
==>[elev:542,minelev:532,city:Austin]
I wrote some notes about reducing barrier steps here: http://kelvinlawrence.net/book/PracticalGremlin.html#rbarriers
UPDATED
If you wanted to find airports with an elevation less than the starting airport by no more than 10 and avoiding the math step you can use this formulation.
g.V(3).as('a').
project('min').by(values('elev').inject(-10).sum()).as('p').
select('a').
out().
where(lt('a')).by('elev').
where(gt('p')).by('elev').by('min')

How to ingest data effectively in Neo4j

I am looking for advice on the best way to handle data going into Neo4j.
I have a set of structured data, CSV format which relates to journeys. The data is:
"JourneyID" - unique ref#/ Primary Key e.g 1234
"StartID" - ref# , this is a station e.g Station1
"EndIID" - ref# this is a station, e.g Station1 (start and end can be the same)
"Time" – integer e.g. 24
Assume I have 100 journeys/rows of data, showing journeys between 10 different stations.
I can see and work with this data in SQL or Excel. I want to work with this in Neo4j.
This is what I currently have:
StartID with JourneyID as a label
EndID with Journey ID as a label
This means that each row from the CSV for a station is its own node. I then created a relationship between Start and End using the JourneyID (primary key)
the effect was just 100 node connected to 100 nodes. E.g connection from Station1 and Station 2, Station 1 and Station 3, and Station 1 and Station 4. It didn’t show the relationship between Starting Station1 and Ending Station1, 2 and 3 - which is what I want to show.
How best do I model this data so that graph sees 10 unique StartID, connecting to the different EndIDs – showing the relationships between them?
Thanks in advance
(new to Graphs!)
This sample query, which uses MERGE to avoid creating duplicate nodes and relationships, should help you get started:
LOAD CSV WITH HEADERS FROM 'file://input.csv' AS row
MERGE (start:Station {id: row.StartID})
MERGE (end:Station {id: row.EndID})
MERGE (j:Journey {id: row.JourneyID})
ON CREATE SET j.time = row.Time
MERGE (j)-[:FROM]->(start)
MERGE (j)-[:TO]->(end)
I don't think you want a Journey to be a node, you want the Journey ID to be an attribute of the edge:
LOAD CSV WITH HEADERS FROM 'file://input.csv' AS row
MERGE (start:Station {id: row.StartID})
MERGE (end:Station {id: row.EndID})
MERGE (start)-[:JOURNEY {id:row.JourneyID}]->(end)
That more intuitively describes the data, and you could even extend this to different relationship types, if you can describe Journeys in more detail.
Edit:
This is to answer your question, but I can't speak as to how this scales up. I think it depends on the types of queries you plan to make.

neighbourhood reverse geocoding-geonames api error code 15-R language

I am encountering a problem with the reverse geocoding geonames api package in R. I have a dataset of nearly 900k rows containing latitude and longtitude and I am using GNneighbourhood(lat,lng)$name function to create the neighbourhood for every pair of coordinates(my dataset contains incidents in San Francisco).
Now, while the function is working perfectly for the big majority of points, there are times that it is giving error code 15 message :we are afraid we could not find a neighbourhood for latitude and longitude. The same procedure can be performes with revgeocode function(google reverse geocoding api) of the ggmap package and in fact it works right even for the points that give error with geoname package. The reason I am not using it is cause of the query limit per day.
Successful example
GNneighbourhood(37.7746,-122.4259)$name
[1] "Western Addition"
Failure
GNneighbourhood(37.76569,-122.4718)$name
Error in getJson("neighbourhoodJSON", list(lat = lat, lng = lng)) :
error code 15 from server: we are afraid we could not find a neighbourhood for latitude and longitude :37.76569,-122.4718
Searching for the above point in google maps works fine and we can also see that the incident is not on water or any other inappropriate location.(Unless the park nearby is indicating something, i don't know)
Anyone with experience with the procedure and the specific package? Is it possible for the api to be incomplete? It clearly states that it can handle all US cities. Some help would be appreciated.

Resources