Neo4J : Java Heap Space Error : 100 k nodes - graph

I have a neo4j graph with a little more than 100,000 nodes. When I use the following cypher query over REST, I get a Java Heap Error . The query is producing a 2-itemset from a set of purchases .
MATCH (a)<-[:BOUGHT]-(b)-[:BOUGHT]->(c) RETURN a.id,c.id
The cross product of two types of nodes Type 1 (a,c) and Type 2 (b) is of order 80k*20k
Is there a more optimized query for the same purpose ? I am still a newbie to cypher. (I have two indexes on all Type1 and Type2 nodes respectively which I can use)
Or should I just go about increasing the java heap size .
I am using py2neo for the REST queries.
Thanks.

As you said the cross product is 80k * 20k so you probably pull all of them across the wire?
Which is probably not what you want. Usually such a query is bound by a start user or a start product.
You might try to run this query in the neo4j-shell:
MATCH (a:Type1)<-[:BOUGHT]-(b)-[:BOUGHT]->(c) RETURN count(*)
If you have a label on the nodes, you can use that label Type1? to drive it.
Just to see how many paths you are looking at. But 80k times 20k are 1.6 billion paths.
And I'm not sure if py2neo of the version (which one) you are using is already using streaming for that? Try to use the transactional endpoint with py2neo (i.e. the cypherSession.createTransaction() API).

Related

Azure cosmosDB graphDb GremlinApi high cost

We are investigating deeper CosmosDb GraphDb and it looks like the price for simple queries is very high.
This simple query which returns 1000 vertices with 4 properties each cost 206 RU (All vertices live on the same partition key and all documents have index) :
g.V('0').out()
0 is the id of the vertex
No better result with the long query (208 RU)
g.V('0').outE('knows').inV()
Are we doing something wrong or is it expecting price ?
I have been working with CosmosDb Graph also and I am still trying to gain sensibility towards the RU consumption just like you.
Some of my experiences may be relevant for your use case:
Adding filters to your queries can restrict the engine from scanning through all the available vertices.
.dedup() performed small miracles for me. I faced a situation where I had two vertices A and B connected to C and in turn C connected to other irrelevant vertices. By running small chunks of my query and using .executionProfile() I realized that when executing the first step of my traversal, where I get the vertices to which A and B connect to, C would show up twice. This means that the engine, when processing my original query would go through the effort of computing C twice when continuing to the other irrelevant vertices. By adding a .dedup() step here, I effectively reduced the results from the first step from two C records to a single one. Considering this example, you can imagine that the irrelevant vertices may also connect to other vertices, so duplicates can show on each step.
You may not always need every single vertex that you have in the output of a query, using the .range() step to limit the results to a defined amount that suits your needs may also reduce RUs.
Be aware of the known limitations presented by MS.
I hope this helps in some way.
Instead of returning complete vertices you can try and only return the vertex properties you need using the gremlin .project() step.
Note that CosmosDB does not seem to support other gremlin steps to retrieve just some properties of a vertex.

Faster JanusGraph queries

Previously my janusgraph was working fine with nearly 1000 elements. I made my graph 100k and now some queries are very slow. I'm using Cassandra as database backend and elastic search as index backend.
For example
g.V().hasLabel('Person').union(count(), range(0, 15)) this takes 1.3 minutes. I just want to get 15 "Person". It should be fast.
g.E().count() takes 1.4 minutes. Why it is soo slow? I only have 200k nodes and 400k edges.
I also have some composite indexes for my "Person" node type. I have indexes for all the properties of "Person" node type
thanks for any advice
Counting vertices with a specific label is a frequent use case that JanusGraph can only address by running a full table scan. So, for larger graphs counting becomes unbearably slow. Because the usecase is so frequent, an issue was created for it.
If you use Janusgraph with an indexing backend for mixed indexes. it is possible to run a so-called direct index query in JanusGraph that can return totals of index hits in a speedier way. This does not work on label filters, though, but on indexed properties.

gremlin hasLabel query times out

I have a test graph with less than a million nodes and probably a slightly higher number of edges. I'm using a remote gremlin client to connect to a janusgraph/gremlin-server instance backed by 3 scylla backends.
I have various different labeled nodes i.e url, domain, host and brand. The graph contains mainly url, domain, and host nodes. I have one brand node in this entire graph. The brand node looks like this:
{
label: brand
properties: {
brand: string
}
}
I am able to do the following query in 1.5 ms. The brand property has a composite index.
g.V().hasLabel('brand').has('brand','stackoverflow');
The query below hits the 30s timeout. I expect this query to only return only one result based on the data I imported into the graph. I verified by testing with a limit
g.V().hasLabel('brand')
My questions
Why does this timeout?
Is Janusgraph scanning through all nodes in the graph to try find a single node labeled 'brand'? Is there no default index on labels?
Why does the first query execute fine when the first steps for both are the same?
Thank you
Why does this timeout?
Is Janusgraph scanning through all nodes in
the graph to try find a single node labeled 'brand'? Is there no
default index on labels?
As you have guessed this is likely timing out due to a full graph scan since vertex labels are not indexed in JanusGraph. There is an open issue for this: https://github.com/JanusGraph/janusgraph/issues/283
Why does the first query execute fine when the first steps for both are the same?
In this case I suspect that JanusGraph's optimizer is able to optimize the traversal plan to use the composite index.

Gremlin count() query in Datastax is too slow

I have 3 node in datastax enterprise and loaded 65 million vertices and edges on these. when i use dse studio or gremlin console and run gremlin query on my graph the query is too slow. I defined any kind of index and test again but Had no effect.
when i run query for example "g.v().count()" cpu usage and cpu load average no much change while if i run cql query, that distribute on all nodes and cpu usage and cpu load average on all nodes are a significant change
what is best practice or best configurations for efficient gremlin query in this case?
count() based traversals should be executed via OLAP with Spark for graphs of the size you are working with. If you using standard OLTP based traversals you can expect long wait times for this type of query.
Note that this rule holds true for any graph computation that must do a "table scan" (i.e. touch all or a very large portion of vertices/edges in the graph). This issue is not specific to DSE Graph either and will apply to virtually any graph database.
After many tests on different queries I came to this conclusion that it seems the gremlin has a problem with count query on million vertices while when you define a index on property of vertices and find specific vertix for example: g.V().hasLabel('member').has('C_ID','4242833') the time of this query lower than 1 second and this is acceptable. question is here why gremlin has problem with count query on million vertices?

Riak: how are queries using secondary indices implemented?

Consider a query that uses secondary indices. Does this cause the node that received the query to send out a request to all other nodes? That is, does the use of secondary indices require communicating with all other nodes to find data that matches the index lookup?
The best source for information on how querying of secondary indexes works can be found here:
http://docs.basho.com/riak/latest/dev/advanced/2i/
I believe that the portion of the explanation that is relevant to your question is:
"When issuing a query, the system must read from a “covering” set of partitions and then merge the results. The system looks at how many replicas of data are stored—the N value or n_val—and determines the minimum number of partitions that it must examine (1 / n_val) to retrieve a full set of results, also taking into account any offline nodes."
Also note that: "For all 2i queries, the R parameter is set to 1," - http://docs.basho.com/riak/latest/dev/using/2i/#Querying

Resources