How to speed up Two Hop query in TitanDB with Cassandra - graph

I am testing TitanDB + Cassandra now.
Graph Schema like this:
VERTEX: USER(userId), IP(ip), SESSION_ID(sessionId), DEVICE(deviceId)
EDGE: USER->IP, USER->SESSION_ID, USER->DEVICE
DATA SIZE: Vertex 100Million, Edge: 1 billion
Index: Vertex-Centric index on all kinds of edge . Index for userId, ip, sessionId, and deviceId.
Set Vertext partition for IP, DEVICE and SESSION_ID. Total 32 partition.
Cassandra hosts:AWS EC2 I2 (2xlage) x 24 .
Currently, every host hold about 30G data.
Usecase: give a userId with a edgeLabel, find out all related users by this edge's out vertex.
for example: g.V().has(T.label, 'USER').has('USER_ID', '12345').out('USER_IP').in().valueMap();
But this kinds of query is pretty slow, sometimes, hundreds seconds.
One user can have many related IP (hundreds), so from these IPs, it also can get lots of USERs (thousands).
Does Titan parallel query for this kind of query against all partition of backend storage??
I try to use limit:
g.V().has(T.label, 'USER').has('USER_ID', '12345').out('USER_IP').limit(50).in().limit(100).valueMap()
It's also slow. I hope this kinds of query can be done in 5seconds.
How the Titan limit() works? Get all result first, then 'limit' ??
How to increase the performance for it? Can anyone give some advice?

One quick perfomance gain you could get is from using Titan's Vertex Centric Indices this allows you to make very quick leaps from one vertex to another. For example you could try something like this:
mgmt = graph.openManagement()
userId = mgmt.getPropertyKey('userId')
userIp = mgmt.getEdgeLabel('USER_IP')
mgmt.buildEdgeIndex(userIp, 'userIdByUserIP', Direction.BOTH, Order.decr, time)
mgmt.commit()
To create a simple vertex centric index.
If you want to lookup multiple user ips from multiple user vertices then you could try using Titan-Hadoop. However, that is a more involved process.

Related

NebulaGraph Database: How to get all the vertices of each tag?

I want to get all the vertices of each tag in the Nebula Graph Database.
I tried using fetch prop on player * yield properties(vertex) to get the results, but this was not possible.
(root#nebula) [basketballplayer]> fetch prop on player * yield properties(vertex)
[ERROR (-1004)]: SyntaxError: syntax error near `* yield '
And I tried using neo4j statement match (v:player) return v, but it didn't work either.
root#nebula) [basketballplayer]> match (v:player) return v
[ERROR (-1005)]: Scan vertices or edges need to specify a limit number, or limit number can not push down.
Who can teach me how to use the Nebula Graph database correctly?
By design, the per tag/edge type scan(just like a tabular DBMS data scan) was chosen to be prohibited by default.
Due to the data was stored in NebulaGraph in a more linked/graph way(think of a graph traversal, which started from known nodes and then expand multiple hops along with the edges/relationships). Thus enabling a non-graph scan of data in a distributed graph database like NebulaGraph is costly.
To enable such queries, an index needs to be explicitly created before that 0 or LIMIT sample clause [1] was used(could also avoid full scan).
[1]: example of query(need index for starting node) with LIMIT clause
MATCH (v:player) RETURN v LIMIT 100
Note: the index is only related to the starting node seeking of the query pattern.

Gremlin `elementMap() step` returns less elements than actually present

I have an application with more than 3000 vertices having the same label , let's say ABC. It is required for my application to get the list of all the vertices and their properties for the user to choose the entity and interact with it. For that I am writing a GetAllVertices query for label ABC.
The id's of the vertices are numbers
Ex: 1,2,3,..
The following query returns the correct amount of vertices ~ 3000
g.V().hasLabel('ABC').dedup().count()
The following query however only returns around 1600 entries
g.V().hasLabel('ABC').elementMap()
I am trying to understand what is happening and how can I get the elementMap for all the vertices that I am interested in. I think it might be because of the hash function elementMap() might be using that is causing the collision of the keys and thus resulting in overwriting some of the keys with different entries.
Using TinkerGraph I am not able to reproduce this behavior.
gremlin> g.inject(0).repeat(addV('ABC').property(id,loops())).times(3000)
==>v[2999]
gremlin> g.V().hasLabel('ABC').count()
==>3000
gremlin> g.V().hasLabel('ABC').elementMap().count()
==>3000
If you can say more about the data in your graph I can do some additional tests and try to reproduce what you are seeing.
UPDATED 2022-08-03
I ran the same test on Amazon Neptune version 1.1.1.0.R4 from a Neptune notebook, and it worked there as well.
%%gremlin
g.inject(0).repeat(addV('ABC').property('p1',loops())).times(3000)
v[a6c131cc-42e8-3713-c82d-faa193b118a0]
%%gremlin
g.V().hasLabel('ABC').count()
3000
%%gremlin
g.V().hasLabel('ABC').elementMap().count()
3000

How to have an optional primary key with DynamoDB?

I have a relationship whereby each SITE can have one or more CAMERAs.
So the parent-child relatioship would be that of SITE->CAMERA[s].
The 99% of my queries will be "Give me all the cameras at a given site" and "Give me camera XYZ" and "Give me all cameras where enabled===true" -- at roughly 1:1:1 ratio.
The DynamoDB design, if I understand correctly, would be to have the partition key be 'SITE_ID' and the sort key be 'CAMERA_ID'. Done and done.
....
However, not every CAMERA belongs to a SITE. About 10% of my CAMERAs are not associated with a SITE. I could just put 'noSite' or something as the Partitionkey, but that seems like a kludge... or is it?
I'm new to DynamoDB and unsure how best to set up this relationship. I've always just used MongoDB and never spent time in the SQL world, so needing to worry about indexes isn't something I have experience with. Cost is more important than raw speed and the DB will remain somewhat small (currently around 500 cameras and likely never more than 10k).
What is the best way to set up this table?
Detailed question first: a noSite key is not a bad design choice for unassigned cameras. SiteID is important and
the key cannot be blank.
Your access patterns give you flexibility. Your low data volumes reduce the stakes of the design decisions.
What are the Partion Key and Sort Key names? Regardless of which "columns" you end up selecting for the keys, naming the keys PK and SK give you the option to add other record types in a single-table design later. This is a common practice.
What are the PK and SK columns?
You have two good options for PK and SK for your Camera records:
# Option 1 - marginally better, CameraID has the higher cardinality
PK: CameraID, SK: SiteID
# Option 2
PK: SiteID, SK: CameraID
At this point, 1 of your "queries" will be executed as a query (faster and cheaper) and the other 2 as scans (slower and more expensive). Scanning 500 records is nothing, though, so you could be "done and done" as you say.
Sooner, Later or Never
If required, we can remove the scan operations by adding secondary indexes. Secondary indexes add storage cost (records are literally duplicated) but reduce access costs. Net net change is case dependent. Performance will improve.
# Add an index to query "Give me all the cameras at a given site"
GSI1PK: SiteID, GSI1SK: CameraID # reverse your choice for primary keys
# or, to get fancy and be able to query enabled cameras by site, too, use a concatenated SK with a begins_with query
GSI1PK: SiteID, GSI1SK: Enabled#True#CameraID
# Add an index to query "Give me all cameras where enabled===true"
# Concatenate SiteID and CameraID in the GSI Sort Key to enable 2 types of queries
# 1. all enabled cameras? GSI2PK = true and GSI2SK > ""
# 2. all enabled cameras at Site123? GSI2PK = true and GSI2SK begins_with("Site123")
GSI2PK: Enabled, GSI2SK: SiteID#CameraID

Why does looking up a Gremlin vertex after adding one cost so much?

I discovered the following Gremlin query that was charging 60K RUs in Cosmos DB:
g.addV(label, 'plannedmeal')
.property('partitionKey', '84ca17dd-c284-4f47-a839-a75bc27f9097')
.as('meal')
.V('19760224-7ac1-4316-b9a8-1f7a979274b8') <--- problem
.as('food')
.select('meal')
.addE('contains')
.to('food')
.select('meal')
Through process of elimination, I learned that the .V('19760224-7ac1-4316-b9a8-1f7a979274b8') is the expensive part. I can easily split the query into 2, such as:
g.addV(label, 'plannedmeal')
.property('partitionKey', '84ca17dd-c284-4f47-a839-a75bc27f9097')
g.V('ID_OF_NEW_ITEM')
.as('meal')
.V('19760224-7ac1-4316-b9a8-1f7a979274b8')
.as('food')
.select('meal')
.addE('contains')
.to('food')
.select('meal')
For reference, this costs about 50 RUs total. My question is - why is there a 59,950 RU difference between these 2 approaches?
Edit
After reviewing the execution profile of the query, the GetVerticies operation that occurs in the problematic step seems to scan every vertex in my graph. This is the problem, but its still not clear why requesting a V by an id is so expensive.
This is caused by a known limitation of Cosmos.
Index utilization for Gremlin queries with mid-traversal .V() steps: Currently, only the first .V() call of a traversal will make use of the index to resolve any filters or predicates attached to it. Subsequent calls will not consult the index, which might increase the latency and cost of the query.
I adjusted the query to use one of the workarounds suggested by the documentation and it dropped to 23 RU.
g.addV(label, 'plannedmeal')
.property('partitionKey', '84ca17dd-c284-4f47-a839-a75bc27f9097')
.as('meal')
.map(
__.V('19760224-7ac1-4316-b9a8-1f7a979274b8')
)
.as('food')
.select('meal')
.addE('contains')
.to('food')
.select('meal')

Gremlin OLAP traversal query error regarding local star-graph

I'm trying to execute an OLAP traversal on a query that needs to check if a vertex has a neighbour of certain type.
i keep getting
Local traversals may not traverse past the local star-graph on GraphComputer
my query looks something like:
g.V().hasLabel('label1').
where(_.out().hasLable('label2'))
I'm using the TraversalVertexProgram.
needless to say, when running the same query in oltp mode there is no problem
is there a way to execute such logic?
That is limitation of TinkerPop OLAP GraphComputer. It operate on 'star-graph' objects. The vertex and connected edges only. It uses message passing engine inside. So you have to rewrite you query.
Option 1: start from label2 and go t label1. This should return the same result
g.V().hasLabel('label2').in().hasLabel('label1')
Option2: try to use unique edge labels and check edge label instead
g.V().hasLabel('label1').where(_.outE('label1_label2'))

Resources