1.Does enterprise version support distributed graph algorithm?Or can the Neo4J graph data and graph calculation be distributed over cloud infrastruction?And How does it work?
2.If I have a server (16 cores CPU,256G memory,2TB HDD) and each node or relation has 1K data,how many nodes and relationships can the server contain.The ratio between nodes and relationships is 1:5.
If we want import more data,what should we do?
3.For fast importing , we used batchinserter,but one lucence index has a number limit which is 2^32.So we can import less than 2^32 nodes.What should we do to solve this limit except using more indexes?
4.And after two days for importing, the importing speed is too slow(200-600 nodes per sec) to accept. It is only 1% of the beginning !I can see that the memory is full, what should we do to impove the speed.
It has imported about 0.2B nodes and 0.5B relationships. That's half of my data. And my server has 32GB memory.
Thanks a lot.
You have many questions here that might be better suited as individual questions or asked on the Neo4j slack channel.
I started to write this as a comment, but ran out of chars so I'll try to point you to some resources:
1) Neo4j distributed graph model
I'm not sure exactly what you're asking here. See this document for general information on Neo4j scalability. If you're asking if graph traversals can be distributed across machines in a Neo4j cluster, the answer is no.
2) Hardware sizing
This depends a bit on your access patterns. See the hardware sizing calculator that can take workload / access patterns into account.
3-4) Import
Can you create a new question and share your code for this? You should be able to achieve much better performance than this.
Related
I tried to run a Gremlin query adding a property to vertex through Gremlin console.
g.V().hasLabel("user").has("status", "valid").property(single, "type", "valid")
I constantly get this error:
org.apache.tinkerpop.gremlin.jsr223.console.RemoteException: Connection to server is no longer active
This error happens after query is running for one or two minutes.
I tried some simple queries like g.V().limit(10) and it works fine.
Since the affected vertex count is more than 4 million, not sure if it is failing due to timeout issue.
I also tried to split it into small batches:
g.V().hasLabel("user").has("status", "valid").hasNot("type").limit(200000).property(single, "type", "valid")
It succeeded for first few batches and started failing again.
Is there any recommendations for updating millions of vertices?
The precise approach you take may vary depending on the backend graph database and storage you are using as well as the capacity of the hardware being used.
The capacity of the hardware where Gremlin Server is running in terms of number of CPUs and most importantly, memory, will also be a factor as will the setting of the query timeout value.
To do this in Gremlin, if you had a way to identify distinct ranges of vertices easily you could split this up into multiple threads each doing batches of updates. If the example you show is representative of your actual need then that is likely not possible in this case.
Likewise some graph databases provide a bulk load capability that is often a good way to do large batch updates but probably not an option here as you need to do essentially a conditional update based on looking at the current presence (or not) of a property.
Without more information about your data model and hardware etc. the best answer is probably to do two things:
Use smaller limits. Maybe try 5K or even just 1K at first and work up from there until you find a reliable sweet spot.
Increase the query timeout settings.
You may need to experiment to find the sweet spot for your environment as the capacity of the hardware will definitely play a role in situations like this as well as how you write your query.
I want to build serverless app using lambda и dynamodb. To create a cluster I must know how particular configuration will work with the structure and the size of my table. To find out it I was going to use CloudWatch metrics, but as it turns out they do not reflect the objective reality and can't show the "needs" of the cluster at a particular moment in time. There may be someone who has encountered such a problem and can suggest how best to determine the cluster configuration with respect to the table parameters, the number and type of requests?
So much of it depends on a particular workload, your expected hit rate, distribution of key accesses, etc. There a few rules of thumb, but these may change over time due to changes in the service, so it's always best to do your own testing with your own workloads:
Within a family (t2, r3, r4) latency is pretty much constant, although bigger node types tend to be more consistent (lower p99).
Throughput scales ~linearly with node size (i.e. a 2xl is ~2x the throughput of an xl)
Throughput scales ~linearly with cluster size
TPS scales ~linearly with response sizes - if a node handles 50 000 1kB gets, it'll do about 5 000 10kB gets.
My recommendation is figure out your workload, test on a few different cluster sizes to get some baselines, and use the notes above to scale. Do note that DAX doesn't currently allow changing the node type for a cluster, and scaling a cluster out only increases throughput, not cacheable memory.
As for better CloudWatch metrics, it would be helpful to know what you're looking for - it might better to start a thread in the AWS forums for that discussion.
From what I've read, Riak treats all nodes in a cluster equal. However we'd like to have a heterogeneous cluster, where cpu/mem/hd is not always equal - in fact they can be very different. Each node would of course meet minimal requirements that's required for a node.
Questions:
1) What is the consequence of creating a cluster composed of machines with wildly varying specifications (cpu, diskspace, disk speed, amount and speed of memory, network speed) and
2) Can the cluster detect and compensate for such differences automatically? (assuming no here)
3) Are there ways to take care of this problem in other ways? Think of: prioritizing nodes in load balancer, based on the hardware. Something else?
I'll answer your questions, however, operating Riak in this manner is strongly discouraged as Riak assumes identical capabilities among nodes.
You could possibly have wildly varying performance characteristics
for operations against your node. In general, the "weakest node" in
the system could affect operations throughout the cluster. For
instance, during the PUT phase of an operation, a replica of the
data could be routed to the weakest node and the duration of that
operation could affect the entire PUT operation based on the PUT
operation's quorum value.
No, the cluster assumes identical hardware.
There really is no way to compensate for this.
I know there has some famous graph partition algo tools like METIS which is implemented by karypis Lab (http://glaros.dtc.umn.edu/gkhome/metis/metis/overview)
but I wanna know is there any method to partition graph stored in Neo4j?
or I have to dump the Neo4j's data and transform the node and edge format manually to fit the METIS input format?
Regarding new-ish and interesting algorithms, this is by no means exhaustive or state of the art, but these are the first places I would look:
Specific Algorithm: DiDiC (Distributed Diffusive Clustering) - I used it once in my thesis (Partitioning Graph Databases)
You iterate over all nodes, then for each node retrieve all neighbors, in order to spread some of "some unit" to all your neighbors
Easy to implement.
Can be made deterministic
Iterative - as it's based on iterations (like Super Steps in Pregel) you can stop it at any time. The longer you leave it the better the result, in theory (though in some cases, on certain graph shapes it can be unstable)
When we implemented this we ran it for 100 iterations on a machine with ~30GB RAM, for up to ~4 million nodes - it took no more than two days to complete.
Specific Algorithm: EvoCut "Finding sparse cuts locally using evolving sets" - local probabilistic algorithm from Microsoft - related to these papers
Difficult to implement
Local algorithm - BFS-like access patterns (random walks)
It's been a while since i read that paper, but i remember it was built on clean abstractions:
EvoNibble (pluggable - decides how much of neighborhood to add to the current cluster
EvoCut (calls EvoNibble multiple times to find the local cluster)
EvoPartition (calls EvoCut repeatedly to partition entire graph)
Not deterministic
General Algorithm Family: Hierarchical Graph Clustering
From a high level:
Coarsen the graph by collapsing nodes into aggregate nodes
coarsening strategy is selectable
Find clusters in the coarsened/smaller graph
clustering strategy is selectable
Incrementally decoarsen the graph, refining at the clustering at each step
refining strategy is selectable
Notes:
If the graph changes slowly (or results don't need to be right up to date) it may be possible to coarsen once (or infrequently) then work with the coarsened graph - to save computation
I don't know of a specific algorithm to recommend
General limitations - the things few clustering algorithms do:
Node types not acknowledged - i.e., all nodes treated equally
Relationship types not acknowledged - i.e., all relationships treated equally
Relationship direction not acknowledged - i.e., relationships treated as undirected
Having worked independently with METIS and Neo4j in the past, I am not aware of any tool for generating a METIS file from Neo4j. That being said, writing such a tool should be an easy task and would be a great community contribution.
Another approach for integrating METIS with Neo4j might be in connecting METIS to Neo4j from C++ via JNI. However this is going to be much more involved as it would have to take care of things like transactions, concurrency etc.
On the more general question of partitioning graphs, it is quite possible to implement some of the more known and simple algorithms with reasonable effort.
I have a filtering algorithm that needs to be applied recursively and I am not sure if MapReduce is suitable for this job. W/o giving too much away, I can say that each object that is being filtered is characterized by a collection if ordered list or queue.
The data is not huge, just about 250MB when I export from SQL to
CSV.
The mapping step is simple: the head of the list contains an object that can classify the list as belonging to one of N mapping nodes. the filtration algorithm at each node works on the collection of lists assigned to the node and at the end of the filtration, either a list remains the same as before the filtration or the head of the list is removed.
The reduce function is simple too: all the map jobs' lists are brought together and may have to be written back to disk.
When all the N nodes have returned their output, the mapping step is repeated with this new set of data.
Note: N can be as much as 2000 nodes.
Simple, but it requires perhaps up to a 1000 recursions before the algorithm's termination conditions are met.
My question is would this job be suitable for Hadoop? If not, what are my options?
The main strength of Hadoop is its ability to transparently distribute work on a large number of machines. In order to fully benefit from Hadoop your application has to be characterized, at least by the following three things:
work with large amounts of data (data which is distributed in the cluster of machines) - which would be impossible to store on one machine
be data-parallelizable (i.e. chunks of the original data can be manipulated independently from other chunks)
the problem which the application is trying to solve lends itself nicely to the MapReduce (scatter - gather) model.
It seems that out of these 3, your application has only the last 2 characteristics (with the observation that you are trying to recursively use a scatter - gather procedure - which means a large number of jobs - equal to the recursion depth; see last paragraph why this might not be appropriate for hadoop).
Given the amount of data you're trying to process, I don't see any reason why you wouldn't do it on a single machine, completely in memory. If you think you can benefit from processing that small amount of data in parallel, I would recommend focusing on multicore processing than on distributed data intensive processing. Of course, using the processing power of a networked cluster is tempting but this comes at a cost: mainly the time inefficiency given by the network communication (network being the most contended resource in a hadoop cluster) and by the I/O. In scenarios which are well-fitted to the Hadoop framework these inefficiency can be ignored because of the efficiency gained by distributing the data and the associated work on that data.
As I can see, you need 1000 jobs. The setup and the cleanup of all those jobs would be an unnecessary overhead for your scenario. Also, the overhead of network transfer is not necessary, in my opinion.
Recursive algos are hard in the distributed systems since they can lead to a quick starvation. Any middleware that would work for that needs to support distributed continuations, i.e. the ability to make a "recursive" call without holding the resources (like threads) of the calling side.
GridGain is one product that natively supports distributed continuations.
THe litmus test on distributed continuations: try to develop a naive fibonacci implementation in distributed context using recursive calls. Here's the GridGain's example that implements this using continuations.
Hope it helps.
Q&D, but I suggest you read a comparison of MongoDB and Hadoop:
http://www.osintegrators.com/whitepapers/MongoHadoopWP/index.html
Without knowing more, it's hard to tell. You might want to try both. Post your results if you do!