Need advice to choose graph database [closed] - graph

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm looking for options for graph database to be used in a project. I expect to have ~100000 writes (vertix + edge) per day. And much less reads (several times per hour). The most frequent query takes 2 edges depth tracing that I expect to return ~10-20 result nodes.
I don't have experience with graph databases and want to work with gremlin to be able to switch to another graph database if needed. Now I consider 2 possibilities: neo4j and Titan.
As I can see there is enough community, information and tools for Neo4j, so I'd prefer to start from it. Their capacity numbers should be enough for our needs (∼ 34 billion nodes, ∼ 34 billion edges). But I'm not sure which hardware requirements will I face in this case. Also I didn't see any parallelisation options for their queries.
On the other hand Titan is built for horizontal scalability and has integrations with intensively parallel tools like spark. So I can expect that hardware requirements can scale in a linear way. But there is much less information/community/tools for Titan.
I'll be glad to hear your suggestions

Sebastian Good made a wonderful presentation comparing several databases to each other. You might have a look at his results in here.
A quick summary of the presentation is here
For benchmarks on each graph databases with different datasets, different node sizes and caches, please have a look at this Github repository by socialsensor. Just to let you know, the results in the repo are a bit different that the ones in the presentation.
My personal recommendation is:
If you have deep pockets, go for Neo4j. With the technical support and easy CIPHER, things will go pretty quickly.
If you support Open Source (and are patient for its development cycles), go for Titan DB with Amazon Dynamo DB backend. This will give you "infinite" scalability and good performance with both EC2 machines and Dynamo tables. Check here for docs and here for their code for more information.

Related

Bigdata use cases [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
we are trying to create a dashboard using BigData. The Data are currently transacted in SQLServer and the front end is in MVC. As the data flow is extremely high to analyse using SQLServer itself it is decided to use BigData. I had chosen Cloudera Manager CDH, SQOOP to import data from SQLServer to HIVE and running the analytic using IMPALA. Decided to up the results with Microstrategy to provide the charts in mobile platform to the clients.
Any Ideas or suggestion are welcome to improve the process?
Looks like you're off to a great start. Remember your analytics can be done with a multitude of tools, not just Impala.
Once you're in Hadoop, Hive and Pig give a lot of power (more available with UDFS) with an easy learning curve.
If you eventually want to do some iterative use cases (and exploit machine learning), you might want to check out Spark (those two things are in its wheelhouse), which is not constrained by (to?) MapReduce.
Tons of great tools available. Enjoy the journey.
I would consider to use two stages. Data Analysis and Data Visualisation.
Use two stages makes the solution more flexible and decouple the responsiblility.
Data Analysis
Ingest the data (Include cleaning), Sqoop can do the ingest step, might require extra steps to cleaning the data.
Explore/Analyse the data, Apache Spark is a very flexible and powerful tool.
Store the analyse result in a specified format
Data Visualisation
Load the data from data analysis phase
Visualise it. Using Highcharts/Kibana/Dashing. Or use D3 create customised dashboard.

How to make a choice between OpenTSDB and InfluxDB or other TSDS? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
They both are open source distributed time series databases, OpenTSDB for metrics, InfluxDB for metrics and events with no external dependencies, on the other OpenTSDB based on HBase.
Any other comparation between them?
And if I want to store and query|analyze metrics real-time with no deterioration loss based on time series, which would be better?
At one of the conferences I've heard people running something like Graphite/OpenTSDB for collecting metrics centrally and InfluxDB locally on each server to collect metrics only for this server. (InfluxDB was chosen for local storage as it is easy to deploy and lightweight on memory).
This is not directly related to your question but the idea appealed to me much so I wanted to share it.
Warp 10 is another option worth considering (I'm part of the team building it), check it out at http://www.warp10.io/.
It is based on HBase but also has a standalone version which will work fine for volumes in the low 100s billions of datapoints, so it should fit most use cases out there.
Among the strengths of Warp 10 is the WarpScript language which is built from the ground up for manipulating (Geo) Time Series.
Yet another open-source option is blueflood: http://blueflood.io.
Disclaimer: like Paul Dix, I'm biased by the fact that I work on Blueflood.
Based on your short list of requirements, I'd say Blueflood is a good fit. Perhaps if you can specify the size of your dataset, the type of analysis you need to run or any other requirements that you think make your project unique, we could help steer you towards a more precise answer. Without knowing more about what you want to do, it's going to be hard for us to answer more meaningfully.

Are graph databases better suited to store trees than key-val stores? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
My application data consists of a huge tree that grows as users interact with the system. Are graph databases more suited to store big trees than key-val stores? Is the loss in scalability (for the fact graph dbs are usually harder to shard) compensated by other features?
It depends.
If you use a key value store, I would imagine you would do a lot of lookups for the children, and this could be a long list, so your key would be the parent node, and your value the children, and you could end up with a lot of movement and querying of the table. This is typically the problem you have in Relational Databases, these type of table joins.
A Graph Database is great as you do not do joins, but traversals, so you would start at the root, and specify depth, or an end condition, then you could let the graph traversal use outgoing relationships to get you to your end result.
I agree with you that sharding is not a good option for Graph Databases, at least not in the sense of cross-store relationship traversals. But I believe with proper modeling of your data, this shouldn't be a problem, at least not if the Graph Database is smart.
Neo4j has a problem with dense nodes, where a node with many(500k+) relationships can cause a slow down on traversals, but you can use indexing the get around this problem. Aside from this, it's great for large data as it's storage on disk is efficient, and it's traversals are very fast.
Define "huge". If you can fit within the confines/limitations of Neo4J or have a natural and logical sharding model, Neo4J would be a much cleaner/simpler/more powerful approach and would require a lot less code. As Nicholas said, if your database will have a lot of "hot spot" nodes (many relationships), you might have some challenges w/Neo4J, though there are generally application design approaches you can use to work around that limitation.

Simple installed tool for digital Scrum Board [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I am looking for a basic and simple-to-install digital version of a Scrum board.
I do prefer physical index cards, but in this case logistics makes it hard. Thus, I need to have it on the computer.
No real need to share data between several clients. To us it is enough if it runs on one single machine.
Just need basic functionality. A drag-drop board and a sprint burndown would do fine.
Due to regularly constraints I cannot use an online SaaS, must keep the data local.
Time is short, so simple install and ready-to-go.
Does not need to be free, but of course price is interesting.
I have not had this set of constraints earlier, so I am unfamiliar.
I have done some research and have some general experience. For example VersionOne, Mingle and Hansoft seem to have a good reputation. Anyone can comment on how those fit the above list? Anyone have other recommendations?
This thread is a bit old now, but leaving my find in the hope to help others searching the same topic.
If you are looking for a simple tool for developers to collaborate on a Scrum project, http://trello.com/ is very simple and intuitive. Absolutely no clutter and easily lets a small team manage their cards.
I would have a look at Atlassian Jira with the GreenHopper plugin - it has a nice dashboard.
http://www.atlassian.com/software/greenhopper/
Have a look at Mingle from ThoughtWorks. A really great tool. Wall looks like this
Free download/install for 1 year / 5 users.
Excel (or OpenOffice) spreadsheet? Why do you need a special tool for this?
I had a similar decision to make a year ago and went for Version One Team Edition - which is free.
http://www.versionone.com/Product/Compare_Editions.asp
It's easy to deploy the SQL database wherever you want it - so locally in your case.
Our team found using the software easy and intuitive.
The free version (up to 10 users) has ample features - the sprints/stories/tasks are easy to setup and view. The burndown chart is good.
All in all, I've no regrets with choosing Verison One - it's easy to install, easy to use and free.

Software Design Description Practise [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
How many people actually write an SDD document before writing a single line of code?
How do you handle large CSCI's?
What standard do you use for SDD content?
What tailoring have you done?
I certainly have. Historically and on recent projects.
Years ago I worked in organisations where templates were everything.
Then I worked other places where the templates were looser or non-existent or didn't fit the projects I was working on.
Now the content of the software design is pretty much governed by what I need to describe to get the idea across to the audience.
"before writing a single line of code" there wouldn't be a a lot of detail. The documents I produce before I start coding are meant to get the idea of what we need to build across to the affected teams and senior management so they introduce high level architecture, functionality, technologies, risks and scope. Those last two are really important. The rest is to show other teams where you need to interface with them and to leave managers with a lingering notion that cool stuff is happening.
Most big software companies have their own practices. For example Motorola has detailed documentation for every aspect of software development process. There are standard templates for each type of documents. Having strict standards allows effectively maintain huge number of documents and integrate it with different tools. Each document obtains tracking number from special document-tracking system. They even have system (last time I seen it was in stage of early development) for automatically requirements tracking - you can say which line of code relate to given requirement\design guideline.
I would suppose that most people who write SDD documents and use terminology like CSCI have to be using a specific software development methodology and most likely are working for some serious government customer. They usually tend to take their preparations quite seriously and the documents are ready and approved before any development starts.
In an Agile process the development and the design document could be developed in parallel. It means that there will be plenty of refactoring to be done but it usually delivers very good results in the end.
In more formal processes (like RUP) a SAD document is mostly created during the elaboration/prototyping phase based on the team research.

Resources