How query parallelism is implemented in apache ignite? - bigdata

I want to know how query parallelism is implemented in apache ignite. The resulting numbers are totally different from the results without parallelism.
Thanks

Without query parallelism Ignite splits query execution between nodes: map request for each node and reduce on node-requester. To perform better on multiprocessor machines, cache indexes are split into smaller parts, just like you're working with nodes_num * queryParallelism.
In that way each node may process same query that was split on query parallelism independent threads.

Related

High memory consumption of gremlin query on AWS Neptune

I'm trying to parallelize a query to count vertices that have an edge of a given label in AWS Neptune on a large Graph by partitioning vertices by ID:
g.V().hasLabel('Label').
hasId(startingWith('prefix')).
where(outE('EdgeType')).
count()
however, the query seems to consume a lot of memory and I run into OOM exceptions. Is there an explanation for this? And, what would in general be a good strategy to parallelize/run such a query efficiently?
The graph has about ~500M vertices where ~100M have the label of interest and ~90% of those have an edge with the desired label.
Any of the text predicates (startingwith(), endingWith(), containing()) are non-indexed operations in Neptune as Neptune does not maintain a native full-text-search index. This means that any query using those may need to perform a full or range "scan" of the graph to find the results and this can expensive (as Kelvin mentions). You can, however, leverage integration with OpenSearch [1] if these types of queries are common in your use case.
Also note that Neptune's execution model is presently designed for highly-concurrent, transactional queries. If you have queries that are going to touch a large portion of the dataset, you may need to break those queries up into multiple, parallel queries. Each Neptune instance has a number of query execution threads equal to 2x the number of vCPUs on an instance. Memory allocation is divided up such that a large portion of instance memory is reserved for buffer pool cache and the remaining memory is allocated for the OS of the instance and the execution threads. Each execution thread will use a portion of the memory allocated for those threads. So an Out-Of-Memory Exception occurs when the thread runs out of memory, not the instance running out of memory. If the query execution is running out of memory, you can try increasing the instance size (to allocate more memory to the threads), or you may need to divide the query into multiple parts and execute them concurrently.
[1] https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html

Neptune and Gremlin - wild CPU utilization

Issue
I'm seeing extremely high CPU utilization when making [what seems like] a fairly common ask to a graph database. In fact, the utilization is so large that Amazon Neptune seems to "tap out" when I execute multiple of the queries in rapid succession / simultaneously.
I've experimented with even the largest (db.r5d.24xlarge) instance, which costs $14k/ month (😅), and I still see the general behavior -- memory is fine, CPU utilization goes wild.
Further, the query isn't exactly quick (~5 seconds), so I'm kind of getting the worst of both worlds...
Ask
Can someone help me understand what I can do to address this / review my query? My hope was, just as a relational database can handle thousands of concurrent requests, Amazon Neptune could do similar.
I suspect I might be using .limit() in the wrong spot. That said, the results produced are correct. If the limit() is moved up before the .and(), I'm concerned I'd end up with suboptimal results, since -- due to the until() clauses -- it could just be paths that stopped w/o making it to the toPortId.
Update: I’m wondering if it might make sense to add a timeLimit() clause to the repeat() clause … anecdotally, after some testing, it seems like I’m paying a significant time penalty to rigorously verify that there are, in fact, no routes to certain places.
Detail
I'm new to Gremlin / Neptune / graph databases in general, so it's possible/likely that I'm either doing something wrong (or have something incorrectly configured) -or- I overestimated the ability of these tools to handle queries.
My use-case requires that I "qualify" (or disqualify) a large number of candidate destinations -- that is:
Starting at ORIGIN (fromPortId) are there / what are the paths to CANDIDATE (toPortId) for which the duration properties sums to less than LIMIT (travelTimeBudget)?
Offending Query
const result = await g.withSack(INITIAL_CHECKIN_PENALTY)
.V(fromPortId)
.repeat(
(__.outE().hasLabel('HAS_VOYAGE_TO').sack(operator.sum).by('duration'))
.sack(operator.sum).by(__.constant(LAYEROVER_PENALTY))
.inV()
.simplePath()
)
.until(
__.or(
__.has('code', toPortId),
__.sack().is(gte(travelTimeBudget)),
__.loops().is(maxHops))
)
.and(
__.has('code', toPortId),
__.sack().is(lte(travelTimeBudget))
)
.limit(4)
.order().by(__.sack(), order.asc)
.local(
__.union(
__.path().by('code').by('duration'),
__.sack()
).fold()
)
.local(
__.unfold().unfold().fold()
)
.toList()
.then(data => {
return data
})
.catch(error => {
console.log('ERROR', error);
return false
});
Execution
Database: Amazon Neptune (db.r5.large)
Querying from / runtime: AWS Lambda running Node.js
Graph
5,300 vertices (Port)
42,000 edges (HAS_VOYAGE_TO)
Data Model
Note: For this issue, I'm only executing the "upper plane" / top half.
To address the CPU discussion, perhaps a small explanation of the Neptune instance architecture will be beneficial. Each Neptune instance has a worker thread pool. The number of workers in the pool is twice the number of vCPU the instance has. That controls how many queries (maximum) can be running concurrently on a given instance. Each instance also has a query queue that will hold queries sent to the instance until a worker becomes available. For a somewhat complex query, against a small instance type (like the db.r5.large), seeing 20 to 30% CPU utilization is not unexpected. A significant portion of the instance memory is used to cache graph data locally in a most recently used fashion. The remaining memory is allocated to the worker threads for query execution. So larger instances have (a) more workers (b) more memory for caching graph data locally and (c) more memory for use during query execution.
Without having the data it's hard to attempt to modify your query and test it, but it may well be possible to optimize various parts of it to improve its overall efficiency.
As there are a lot of topics that we could discuss here, and I am more than happy to do that, it might be easiest, if you are able, to open an AWS support case and ask that the support engineer create a ticket and have it assigned to me. I will be happy to then get on the phone with you and we can spend some time discussing your use case if that would help.
If you are not able to open a support case, we can connect other ways. I'm easy to find on LinkedIn if you prefer to reach out that way.
I very much want to help you with this and your other questions, but I feel the most expeditious way might be for us to get on the phone.
I'm also happy to keep discussing here of course.

Gremlin console keeps returning "Connection to server is no longer active" error

I tried to run a Gremlin query adding a property to vertex through Gremlin console.
g.V().hasLabel("user").has("status", "valid").property(single, "type", "valid")
I constantly get this error:
org.apache.tinkerpop.gremlin.jsr223.console.RemoteException: Connection to server is no longer active
This error happens after query is running for one or two minutes.
I tried some simple queries like g.V().limit(10) and it works fine.
Since the affected vertex count is more than 4 million, not sure if it is failing due to timeout issue.
I also tried to split it into small batches:
g.V().hasLabel("user").has("status", "valid").hasNot("type").limit(200000).property(single, "type", "valid")
It succeeded for first few batches and started failing again.
Is there any recommendations for updating millions of vertices?
The precise approach you take may vary depending on the backend graph database and storage you are using as well as the capacity of the hardware being used.
The capacity of the hardware where Gremlin Server is running in terms of number of CPUs and most importantly, memory, will also be a factor as will the setting of the query timeout value.
To do this in Gremlin, if you had a way to identify distinct ranges of vertices easily you could split this up into multiple threads each doing batches of updates. If the example you show is representative of your actual need then that is likely not possible in this case.
Likewise some graph databases provide a bulk load capability that is often a good way to do large batch updates but probably not an option here as you need to do essentially a conditional update based on looking at the current presence (or not) of a property.
Without more information about your data model and hardware etc. the best answer is probably to do two things:
Use smaller limits. Maybe try 5K or even just 1K at first and work up from there until you find a reliable sweet spot.
Increase the query timeout settings.
You may need to experiment to find the sweet spot for your environment as the capacity of the hardware will definitely play a role in situations like this as well as how you write your query.

Hadoop suitability for recursive data processing

I have a filtering algorithm that needs to be applied recursively and I am not sure if MapReduce is suitable for this job. W/o giving too much away, I can say that each object that is being filtered is characterized by a collection if ordered list or queue.
The data is not huge, just about 250MB when I export from SQL to
CSV.
The mapping step is simple: the head of the list contains an object that can classify the list as belonging to one of N mapping nodes. the filtration algorithm at each node works on the collection of lists assigned to the node and at the end of the filtration, either a list remains the same as before the filtration or the head of the list is removed.
The reduce function is simple too: all the map jobs' lists are brought together and may have to be written back to disk.
When all the N nodes have returned their output, the mapping step is repeated with this new set of data.
Note: N can be as much as 2000 nodes.
Simple, but it requires perhaps up to a 1000 recursions before the algorithm's termination conditions are met.
My question is would this job be suitable for Hadoop? If not, what are my options?
The main strength of Hadoop is its ability to transparently distribute work on a large number of machines. In order to fully benefit from Hadoop your application has to be characterized, at least by the following three things:
work with large amounts of data (data which is distributed in the cluster of machines) - which would be impossible to store on one machine
be data-parallelizable (i.e. chunks of the original data can be manipulated independently from other chunks)
the problem which the application is trying to solve lends itself nicely to the MapReduce (scatter - gather) model.
It seems that out of these 3, your application has only the last 2 characteristics (with the observation that you are trying to recursively use a scatter - gather procedure - which means a large number of jobs - equal to the recursion depth; see last paragraph why this might not be appropriate for hadoop).
Given the amount of data you're trying to process, I don't see any reason why you wouldn't do it on a single machine, completely in memory. If you think you can benefit from processing that small amount of data in parallel, I would recommend focusing on multicore processing than on distributed data intensive processing. Of course, using the processing power of a networked cluster is tempting but this comes at a cost: mainly the time inefficiency given by the network communication (network being the most contended resource in a hadoop cluster) and by the I/O. In scenarios which are well-fitted to the Hadoop framework these inefficiency can be ignored because of the efficiency gained by distributing the data and the associated work on that data.
As I can see, you need 1000 jobs. The setup and the cleanup of all those jobs would be an unnecessary overhead for your scenario. Also, the overhead of network transfer is not necessary, in my opinion.
Recursive algos are hard in the distributed systems since they can lead to a quick starvation. Any middleware that would work for that needs to support distributed continuations, i.e. the ability to make a "recursive" call without holding the resources (like threads) of the calling side.
GridGain is one product that natively supports distributed continuations.
THe litmus test on distributed continuations: try to develop a naive fibonacci implementation in distributed context using recursive calls. Here's the GridGain's example that implements this using continuations.
Hope it helps.
Q&D, but I suggest you read a comparison of MongoDB and Hadoop:
http://www.osintegrators.com/whitepapers/MongoHadoopWP/index.html
Without knowing more, it's hard to tell. You might want to try both. Post your results if you do!

How many mappers/reducers should be set when configuring Hadoop cluster?

When configuring a Hadoop Cluster whats the scientific method to set the number of mappers/reducers for the cluster?
There is no formula. It depends on how many cores and how much memory do you have. The number of mapper + number of reducer should not exceed the number of cores in general. Keep in mind that the machine is also running Task Tracker and Data Node daemons. One of the general suggestion is more mappers than reducers. If I were you, I would run one of my typical jobs with reasonable amount of data to try it out.
Quoting from "Hadoop The Definite Guide, 3rd edition", page 306
Because MapReduce jobs are normally
I/O-bound, it makes sense to have more tasks than processors to get better
utilization.
The amount of oversubscription depends on the CPU utilization of jobs
you run, but a good rule of thumb is to have a factor of between one and two more
tasks (counting both map and reduce tasks) than processors.
A processor in the quote above is equivalent to one logical core.
But this is just in theory, and most likely each use case is different than another, some tests need to be performed. But this number can be a good start to test with.
Probably, you should also look at reducer lazy loading, which allows reducers to start later when required, so basically, number of maps slots can be increased. Don't have much idea on this though but, seems useful.
Taken from Hadoop Gyan-My blog:
No. of mappers is decided in accordance with the data locality principle as described earlier. Data Locality principle : Hadoop tries its best to run map tasks on nodes where the data is present locally to optimize on the network and inter-node communication latency. As the input data is split into pieces and fed to different map tasks, it is desirable to have all the data fed to that map task available on a single node.Since HDFS only guarantees data having size equal to its block size (64M) to be present on one node, it is advised/advocated to have the split size equal to the HDFS block size so that the map task can take advantage of this data localization. Therefore, 64M of data per mapper. If we see some mappers running for a very small period of time, try to bring down the number of mappers and make them run longer for a minute or so.
No. of reducers should be slightly less than the number of reduce slots in the cluster (the concept of slots comes in with a pre-configuration in the job/task tracker properties while configuring the cluster) so that all the reducers finish in one wave and make full utilisation of the cluster resources.

Resources