anybody tried neo4j vs titan - pros and cons [closed] - graph

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Can anybody please provide or point out to a good comparison between Neo4j and Titan?
One thing i can see is in terms of scale - Titan is scaleout and requires an underlying scalable datastore like cassandra. Neo4j is only for HA and has its own embedded database. Any other pros and cons? Any specific usecases. (Is Titan being used anywhere currently?)
I also have the following link: http://architects.dzone.com/articles/16-graph-databases-compared that gives a objective compare for graph databases but not much on pros and cons between Neo4j and Titan.

We have a social graph in which in a day we add almost 1 millions of node and twice as many edges. We started with neo4j graph because yes, it is very fast due to fact that its storage is on the same machine on which graph engine runs. But following are the experiences that we would like to share with you about neo4j.
Not good fit for real time query. We have social structure like twitter. We have to show latest 20 activities (and its associated activities) of all the users that a user follow on his time line.
We have some users who follows more than 1000 users. The gremlin query that we wrote for this (if you are interested then we can share gremlin query) really produced so much GC that a server with 8 cpu and 48 gb ram used to freeze and we had to restart the server to get it online again.
Many a time network partition observed.
There is not vertex centric index that is very much required in graoh database.
Ultimately we are so much fade up with server performance with gremlin query that we had to change the database to titan.
On titan we are getting reasonable performance and also scaling is very easy as we are using cassandra as backend storage. But mind you that .. using gremlin here also not a good idea as multiget query is very ugly to write and without multiget its query becomes very slow.

Great to see you exploring graph databases. I will speak to the Neo4j part of your question:
More than 30 of the Global 2000 now use Neo4j in production for a wide range of use cases, many of them surprising, even to us! (And we invented the property graph!)
A partial list of customers can be found below:
www.neotechnology.com/customers
Neo4j has been in 24x7 production for 10 years, and while the product has of course evolved significantly since then, it's built on a very solid foundation.
Most the companies moving to graph databases--speaking for Neo4j, which is what I know about-- are doing so because either a) their RDBMSs weren't able to handle the scope & scale of their connected query requirements, and/or b) the immense convenience and speed that comes from modeling domains that are a graph (social, network & data center management, fraud, portfolios, identity, etc.) as a graph, not as tables.
For kicks, you can find a number of customer talks here, from the four (soon five) GraphConnect conferences that were held this year in major cities around the world:
http://watch.neo4j.org/
If you're in London, the last one will be held next week:
http://www.graphconnect.com
You'll find a summary below of some of the technology behind Neo4j, with some customer examples. To speak very directly to your question about scaling: Neo4j has a unique architecture designed to maximize query response time & query predictability, by allowing horizontal scale-out in such a way that each instance can access the graph without having to hop over the network. (Need more read throughput. Just add instances.) It turns out that this approach works well for 95+% of the graphs out there, including some production customers who have more than half of the Facebook social graph running in a single Neo4j cluster, backing an "always on" 24x7 web site.
www.neotechnology.com/neo4j-scales-for-the-enterprise/
One of the world's largest postal delivery services does all of their real-time package routing with Neo4j. Railroads are building routing systems on Neo4j. Some of the world's largest customers are using them for HR and data governance, alternate-path routing, network & data center management, real-time fraud detection, bioinformatics, etc.
Neo4j's Cypher query language is the only declarative query language built expressly for property graphs. It takes all of the lessons learned from our 13-year old native Java API (which was the basis for Blueprints, which some of the other graph databases have since adopted) and rolls them into a next-generation language. Cypher is a great way to learn graphs, and to develop applications; and there's always the native Java API if you have special needs or value "bare metal" performance (i.e. sub millisecond vs. single-digit millisecond) performance above convenience. Neo4j is built from the ground up to support graphs, and has a graph storage engine that is built to store graphs; unlike some of the more recent additions to the graph database ecosystem, which are architected as graph libraries on top of non-graph databases, and are subject to some of the inherent limitations. (e.g. FlockDB, because it is based on MySQL, will still be very slow for anything greater than one hop.)
Definitely feel free to contact the Neo team if you need anything more specific. We'll be more than happy to help you! http://info.neotechnology.com/ContactUs.html
Good luck!

Related

Architectural choices for a CRM

I am looking for some light in the complexity of architectural selection, before starting the development of a CMS or CRM or ERP.
I was able to find this similar question: A CRM architecture (open source app)
But it seems old enough.
I watch and read recently several conferences, discussions about monolith vs distrubuted, DDD philosophy, CQRS and event driven design, etc.
And I panic even more than before on the architectural choice, having taken into account the flaws of each (I think).
What I find unfortunate with all the examples of microservices and distributed systems that can be found easily on the net is that they always take e-commerce as an example (Customers, Orders, Products ...). And for this kind of example, several databases (in general, a NoSQL DB by microservice) exist.
I see the advantage (more or less) ==> to keep a minimalist representation of the necessary data for each context.
But how to go for a unique and relational database? I really think I need a single relational database, having worked in a company producing a CRM (without access to the source code of the machine, but the structure of the database), I could see the importance of relational: necessary for listings, reports, and consult the links between entities within the CRM (a contact can have several companies and conversely, each user has several actions, tasks, but each of his tasks can also be assigned to other users, or even be linked to other items such as: "contact", "company", "publication", "calendarDate", etc. And there can be a lot of records in each table (+ 100,000 rows), so the choice of indexes will be quite important, and transactions are omni-present because there will be a lot of concurrent access to data records).
What I'm saying to myself is that if I choose to use a microservice system, there will be a lot of microservices to do because there would really be a lot of different contexts, and a high probability of having a bunch of different domain models. And then I will end up having the impression of having to light each small bulb of a garland, with perhaps too much process running simultaneously.
To try to be precise and not go in all directions, I have 2 questions to ask:
Can we easily mix the DDD philosophy with a monolith system, while uncoupling very small quantity (for the eventual services that should absolutely be set apart, for various reasons)?
If so, could I ask for resources where I can learn a lot more about this?
Do we necessarily have to work with a multitude of databases, and should it necessarily be of the kind mongoDb, nosql?
I can imagine that the answer is no, but could I ask to elaborate a little more? Or redirect me to articles that will give me clear enough answers?
Thank you in advance !
(It would be .NET Core, draft is here: https://github.com/Jin-K/simple-cms)
DDD works perfectly as an approach in designing your CRM. I used it in my last project (a web-based CRM) and it was exactly what I needed. As a matter of fact, if I wouldn't have used DDD then it would have been impossible to manage. The CRM that I created (the only architect and developer) was very complex and very custom. It integrates with many external systems (i.e. with email server and phone calls system).
The first thing you should do is to discover the main parts of your system. This is the hardest part and you probably get them wrong the first time. The good thing is that this is an iterative process that should stabilize before it gets to production because then it is harder to refactor (i.e. you need to migrate data and this is painful). These main parts are called Bounded contexts (BC) in DDD.
For each BC I created a module. I didn't need microservices, a modular monolith was just perfect. I used the Conway's Law to discover the BCs. I noticed that every department had common but also different needs from the CRM.
There were some generic BCs that were common to each department, like email receiving/sending, customer activity recording, task scheduling, notifications. The behavior was almost the same for all departments.
The department specific BCs had very different behaviour for similar concepts. For example, the Sales department and Data processing department had different requirements for a Contract so I created two Aggregates named Contract that shared the same ID but they had other data+behavior. To keep them "synchronized" I used a Saga/Process manager. For example, when a Contract was activated (manually or after the first payment) then a DataProcessingDocument was created, containing data based on the contract's content.
Another important point of view is to discover and respect the sources of truth. For example, the source of truth for the received emails is the Email Server. The CRM should reflect this in its UI, it should be very clear that it is only a delayed reflection of what is happening on the Email Server; there may be received emails that are not shown in the CRM for technical reasons.
The source of truth for the draft emails is the CRM, with it's Email composer module. If a Draft is not shown anymore then it means that it has been deleted by a CRM user.
When the CRM is not the source of truth then the code should have little or no behavior and the data should be mostly immutable. Here you could have CRUD, unless you have performance problems (i.e. millions of entries) in which case you could use CQRS.
And there can be a lot of records in each table (+ 100,000 rows), so the choice of indexes will be quite important, and transactions are omni-present because there will be a lot of concurrent access to data records).
CQRS helped my a lot to have a performant+responsive system. You don't have to use it for each module, just where you have a lot of data and/or different behavior for write and read. For example, for the recording of the activity with the customers, I used CQRS to have performant listings (so I used CQRS for performance reasons).
I also used CQRS where I had a lot of different views/projections/interpretations of the same events.
Do we necessarily have to work with a multitude of databases, and should it necessarily be of the kind mongoDb, nosql? I can imagine that the answer is no, but could I ask to elaborate a little more? Or redirect me to articles that will give me clear enough answer
Of course not. Use whatever works. I used MongoDB in 95% of cases and Mysql only for the Search module. It was easier to manage only a database system and the performance/scalability/availability was good enough.
I hope these thoughts help you. Good luck!

What are the factors to consider while choosing a Graph DB for about 30 TB data

I'm in the process of developing a software system ( Graph Database ) to study the interconnection between multiple components. It could end up with about 30 TB of data. I would like to know what all factors to consider in choosing the right database.
Some of the options i'm looking are Apache Giraph, TitanDB. I'm also wondering if a smaller scale DB like neo4j or OrientDB might itself work
This is a very broad question so I would define exactly what you looking for because size can be a bit vague.
I think any of the example graph dbs you provided can model data that large.
A few "more detailed" questions you could ask yourself include:
Do you care about Horizontal Scaling ? If yes then you should be looking at TitanDB, OrientDB or DSE Graph because Neo4J (at the time of writing) does not scale horizontally so it is limited by the size of the server.
Does a standardised language query/traversal language matter ? If yes then maybe you should be looking more at Tinkerpop vendors such as TitanDB, OrientDB, DSE Graph, and others. If no then any option will suit you.
Does my data have super nodes ? If yes then you should see how each vendor deals with super nodes. Some vendors shard, others use clever graph partioning algorithms.
How much support do you want ? If you need a lot then maybe you should look at strong enterprise solutions such as DSE, OrientDB or Neo4J. Neo4J is currently considered the most popular graph db and with that comes a large support base.
Do you want to use open source software ? If yes then TitanDB, Neo4j, or OrientDB may be for you
These are just some of the things you can look into when making a better decision between all the vendors. Note: There are many other vendors you could consider, Blazegraph, HypergraphDB, just to name a few.

Data mining with Neo4j [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm quite new to graph databases and I'm trying to decide if Neo4j is the right tool to use for data mining on network graphs or if there is something more suitable out there.
I'm planning on using a graph database to perform analyses on some large graphs (millions of nodes/ 10s to 100s million edges), but I'll be looking to apply algorithms and calculate metrics for everybody in the graph. For example:
for each person how many people in their extended network have a certain attribute.
how many steps is each person from someone with a certain attribute.
perform community detection
Running Page Rank
From looking into it a bit, it seems like neo4j is very suited to running queries starting from a certain node, but is it also suited to applying a calculation over everybody in the network? I've come across the term 'Graph compute engine' as a distinction between the two, but can't seem to find much on it.
Are there any other tools that would be useful on this scale (gephi and similar won't handle the volume of data I need to use).
Since you need to use a graph database analytics engine, you might be interested in Faunus. This is their description:
Faunus is a Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster.
I know of it because I keep and eye on their graph database, Titan, which integrates nicely with Tinkerpop, but I have not used it (Faunus).
So by using Faunus you can also have a graph backend which IMO goes hand in hand with what you want to do.
Another really good graph analytic engine is GraphLab (and it's single machine version: GraphChi). Very impressive performance - see: http://graphlab.com/
Mirroring other comments (and to keep this from becoming a product thread which will get it locked on SO) - Neo4j is a graph database - very useful for queries/exploring/etc. GraphLab and the other examples given are more whole graph analytics - things like pagerank, graph triangle counts, etc...
It doesn't look like neo4j is what you are looking for here. In my opinion you really need a graph-engine, rather than a graph database
With a graph database you should be able to perform queries. And it will perform very fast when dealing with highly connected data. For instance, Neo4j should be ligthing fast to pick a node, find its friends, and then find the friends of friends of the starting node in a social graph. In this scenario the graph database outperform the sql models when dealing with a high number of nodes. Note that the efficiency precisely comes from the fact that your engine doesn't have to look over the whole graph to answer your query.
With a graph engine you can perform computations on the whole graph, as you describe it.
If you want to scale and analyse a high number of nodes I'd suggest you take a look at the MapReduce approach ; see Hadoop (and perhaps Mahout).
Hope this helps !
I realise this is late but for the benefit of future Googlers.
You might also want to try the GraphX project built on Spark. It's alpha as of now but looks good for large scale graph analytics.
https://spark.apache.org/graphx/
If you want a pure Neo4j solution, you should check this project.
Implemented algorithms:
1 PageRank
2 Triangle Count
3 Label Propagation for Community Detection
4 Modularity (for Community Detection)
Hope it helps

Why is an RDBMS bad to store a considerably big graph?

I started using a graph database to store a big graph that im generating. But im not convinced as to why to use a graph database to do my job, and why not do what i want with a conventional RDBMS. My question in specific is, why is a Relational Database bad or rather Graph Database is BETTER than RDBMS to store graphs?
After reading Graph Databases (early release, available here: http://graphdatabases.com/) it all comes down to performance. Depending on how deep or recursive your query is, the longer it will take for your RDBMS or graph database to return results. Traditional RDBMS are not designed for the quick traversal of relationships between entities whereas graph databases are built specifically for this. This may not be an issue if your recursion is only 2 levels deep but after this level performance apparently degrades significantly.
Please take this with a grain of salt. This is coming directly from Graph Databases and I have not replicated these results.

Any alternatives to Virtuoso as a graph store? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I like it (very much) that is supports SPARQL/Update and the SPARQL endpoint that comes with it, but
I'm a little worried about vendor lock in
I think it is overkill for my requirements (I want a graph store with half a billion triples)
I would love to use an open-source and free product instead
So far I couldn't find any descent and comparable products (commercial or otherwise). They pretty much look immature or experimental to me.
Ideas ?
What you might be looking for is http://4store.org/ and you might also try searching for questions very like this over on http://www.semanticoverflow.com/ (link is defunct)
Two others besides 4store that #dajobe has already mentioned are Dydra and the Talis platform. Vendor lock-in should not, in general, be a problem if you stick to language features specified in the SPARQL standards.
Having used a lot of different Triple Stores as storage layers in my research project I would recommend the following two:
4store - Already mentioned by dajobe and is very good and has frequent releases to fix bugs and add new features as SPARQL 1.1 continues to be standardised. Also has benefit of being totally free
AllegroGraph - Free for up to 50 million Triples though tends to be be quite a RAM hog even at relatively low numbers of Triples (e.g. used around 3 of my 4GB of RAM when I had about 1.5m triples). Actual memory usage will vary with usage - in my case I was running an app that meant my entire dataset had to be loaded into memory. I haven't used Version 4 so I can't say whether they have improved this
While Virtuoso is very good at some things it has a seriously bad case of feature creep and has a lot of non-standard/proprietary features which like you imply might lead to vendor lock in.
Like Ian says stick to using the core language features in the SPARQL Standards and then you can easily move to a different Triple Store as your needs change. When developing your application try and design it to be storage agnostic so you can just plug in a different storage layer as your need to. How easy this is to do will depend on your programming environment/language/API but doing it will be beneficial in the long run.
We have positive experience with Bigdata. 4Store (as mentioned above) is also good, but does not have support for transactions.
I'm a little worried about vendor lock in
OpenLink Software (my employer) works very hard to implement open standards and specifications where they exist and are sufficient. We add extensions, and document that we've done so, when necessary -- as with the aggregate and other analytics functions which were not part of SPARQL 1.0, but are part of SPARQL 1.1 and/or will be part of SPARQL 2.0.
If you stick with the published standards, you won't be locked in. If you need the extensions, we think we're not so much locking you in as enabling and empowering you... but your mileage may vary.
I think it is overkill for my requirements (I want a graph store with half a billion triples)
By all means, consider all the functionality you need when making your decision. But it seems likely to me that you'll be doing more than storing your triples. Queries, reasoning, query optimization, Federated SPARQL (joins against other remote SPARQL endpoints, formerly known as SPARQL-FED), and other functionality may not be so much overkill as simply not-yet-needed.
It's worth noting that Virtuoso can be run in a minimized form (LiteMode=1) which disables many of the features perceived as "overkill" and makes it much more like an embedded DBMS -- but still hybrid at the core. When Lite mode is on:
Web services are not initialized, i.e., no web server, DAV, SOAP, POP3, etc.
replication is stopped
PL debugging is disabled
plugins are disabled
Bonjour/Rendezvous is disabled
tables relevant to the above are not created
index tree maps is set to 8 if no other setting is given
memory reserve is not allocated
DisableTcpSocket setting is treated as 1, regardless of value in INI file
I would love to use an open-source and free product instead
Virtuoso has two flavors -- commercial (VCE), and open source (VOS). Commercial includes shared-nothing elastic clustering which brings linear scalability, SPARQL GEO indexing and querying, result transformation to CXML for exploration with PivotViewer, and other features which VOS lacks ... but use the one that makes sense to you.

Resources