Data mining with Neo4j [closed] - graph

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm quite new to graph databases and I'm trying to decide if Neo4j is the right tool to use for data mining on network graphs or if there is something more suitable out there.
I'm planning on using a graph database to perform analyses on some large graphs (millions of nodes/ 10s to 100s million edges), but I'll be looking to apply algorithms and calculate metrics for everybody in the graph. For example:
for each person how many people in their extended network have a certain attribute.
how many steps is each person from someone with a certain attribute.
perform community detection
Running Page Rank
From looking into it a bit, it seems like neo4j is very suited to running queries starting from a certain node, but is it also suited to applying a calculation over everybody in the network? I've come across the term 'Graph compute engine' as a distinction between the two, but can't seem to find much on it.
Are there any other tools that would be useful on this scale (gephi and similar won't handle the volume of data I need to use).

Since you need to use a graph database analytics engine, you might be interested in Faunus. This is their description:
Faunus is a Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster.
I know of it because I keep and eye on their graph database, Titan, which integrates nicely with Tinkerpop, but I have not used it (Faunus).
So by using Faunus you can also have a graph backend which IMO goes hand in hand with what you want to do.

Another really good graph analytic engine is GraphLab (and it's single machine version: GraphChi). Very impressive performance - see: http://graphlab.com/
Mirroring other comments (and to keep this from becoming a product thread which will get it locked on SO) - Neo4j is a graph database - very useful for queries/exploring/etc. GraphLab and the other examples given are more whole graph analytics - things like pagerank, graph triangle counts, etc...

It doesn't look like neo4j is what you are looking for here. In my opinion you really need a graph-engine, rather than a graph database
With a graph database you should be able to perform queries. And it will perform very fast when dealing with highly connected data. For instance, Neo4j should be ligthing fast to pick a node, find its friends, and then find the friends of friends of the starting node in a social graph. In this scenario the graph database outperform the sql models when dealing with a high number of nodes. Note that the efficiency precisely comes from the fact that your engine doesn't have to look over the whole graph to answer your query.
With a graph engine you can perform computations on the whole graph, as you describe it.
If you want to scale and analyse a high number of nodes I'd suggest you take a look at the MapReduce approach ; see Hadoop (and perhaps Mahout).
Hope this helps !

I realise this is late but for the benefit of future Googlers.
You might also want to try the GraphX project built on Spark. It's alpha as of now but looks good for large scale graph analytics.
https://spark.apache.org/graphx/

If you want a pure Neo4j solution, you should check this project.
Implemented algorithms:
1 PageRank
2 Triangle Count
3 Label Propagation for Community Detection
4 Modularity (for Community Detection)
Hope it helps

Related

What are the factors to consider while choosing a Graph DB for about 30 TB data

I'm in the process of developing a software system ( Graph Database ) to study the interconnection between multiple components. It could end up with about 30 TB of data. I would like to know what all factors to consider in choosing the right database.
Some of the options i'm looking are Apache Giraph, TitanDB. I'm also wondering if a smaller scale DB like neo4j or OrientDB might itself work
This is a very broad question so I would define exactly what you looking for because size can be a bit vague.
I think any of the example graph dbs you provided can model data that large.
A few "more detailed" questions you could ask yourself include:
Do you care about Horizontal Scaling ? If yes then you should be looking at TitanDB, OrientDB or DSE Graph because Neo4J (at the time of writing) does not scale horizontally so it is limited by the size of the server.
Does a standardised language query/traversal language matter ? If yes then maybe you should be looking more at Tinkerpop vendors such as TitanDB, OrientDB, DSE Graph, and others. If no then any option will suit you.
Does my data have super nodes ? If yes then you should see how each vendor deals with super nodes. Some vendors shard, others use clever graph partioning algorithms.
How much support do you want ? If you need a lot then maybe you should look at strong enterprise solutions such as DSE, OrientDB or Neo4J. Neo4J is currently considered the most popular graph db and with that comes a large support base.
Do you want to use open source software ? If yes then TitanDB, Neo4j, or OrientDB may be for you
These are just some of the things you can look into when making a better decision between all the vendors. Note: There are many other vendors you could consider, Blazegraph, HypergraphDB, just to name a few.

Predictive Analysis using Java [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am working on a Spring based Web application which performs Predictive Analysis based on the user historical data and come up with offers to users. I need to implement a Predictive Analysis or any regression kind of functionality which gives a confidence score/prediction to present those offers. I am a Java developer and looked at Weka, Mahout to get the desired result. But both the tools dont provide good documentation and it is highly difficult to proceed with them. I need a suggestion regarding Java based analytics API to process my data using regression or Neural Networks or Decision Trees and provides a confidence score depicting customers probablity on buying the product in future.
Any help in this regard is highly appreciable.
I've just finished working on a long project that involves building a GUI with JavaFx and R using the JRI package, it uses forecasting from the forecast package in R.
if you'll choose this solution (JavaFX + R) , all the statistical packaging of R will be at use, R has great documentation for this, but the interface jri is a challenge.
The program i built is in a stand alone mode, not a web start.
Most of the fuss regards setting up all environment variables, and passing parameters to the JVM, the big problem is for deployment, you need to make sure your clients have R, and to setup all the links between R and Java in their PC.
If you're interested in any prediction analysis (trees, regressions..) in R using Java /JRI, let me know and ill post it.
I'd advise you to keep trying with Weka. It's a great tool, not only for implementation but also to get an idea of which algorithms will work for you, what your data looks like, etc.
The book is worth the price, but if you're not willing to buy it, this wiki page might be a good starting point.
It might be best to start with testing, not programming - I believe the quote goes "60% of the difficulty of machine learning is understanding the dataset". Play around with the Weka GUI, find out what works best for you and your data, and do try some of the meta-classifiers (boosting, bagging, stacking); they often give great results (at the cost of processing time).

anybody tried neo4j vs titan - pros and cons [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Can anybody please provide or point out to a good comparison between Neo4j and Titan?
One thing i can see is in terms of scale - Titan is scaleout and requires an underlying scalable datastore like cassandra. Neo4j is only for HA and has its own embedded database. Any other pros and cons? Any specific usecases. (Is Titan being used anywhere currently?)
I also have the following link: http://architects.dzone.com/articles/16-graph-databases-compared that gives a objective compare for graph databases but not much on pros and cons between Neo4j and Titan.
We have a social graph in which in a day we add almost 1 millions of node and twice as many edges. We started with neo4j graph because yes, it is very fast due to fact that its storage is on the same machine on which graph engine runs. But following are the experiences that we would like to share with you about neo4j.
Not good fit for real time query. We have social structure like twitter. We have to show latest 20 activities (and its associated activities) of all the users that a user follow on his time line.
We have some users who follows more than 1000 users. The gremlin query that we wrote for this (if you are interested then we can share gremlin query) really produced so much GC that a server with 8 cpu and 48 gb ram used to freeze and we had to restart the server to get it online again.
Many a time network partition observed.
There is not vertex centric index that is very much required in graoh database.
Ultimately we are so much fade up with server performance with gremlin query that we had to change the database to titan.
On titan we are getting reasonable performance and also scaling is very easy as we are using cassandra as backend storage. But mind you that .. using gremlin here also not a good idea as multiget query is very ugly to write and without multiget its query becomes very slow.
Great to see you exploring graph databases. I will speak to the Neo4j part of your question:
More than 30 of the Global 2000 now use Neo4j in production for a wide range of use cases, many of them surprising, even to us! (And we invented the property graph!)
A partial list of customers can be found below:
www.neotechnology.com/customers
Neo4j has been in 24x7 production for 10 years, and while the product has of course evolved significantly since then, it's built on a very solid foundation.
Most the companies moving to graph databases--speaking for Neo4j, which is what I know about-- are doing so because either a) their RDBMSs weren't able to handle the scope & scale of their connected query requirements, and/or b) the immense convenience and speed that comes from modeling domains that are a graph (social, network & data center management, fraud, portfolios, identity, etc.) as a graph, not as tables.
For kicks, you can find a number of customer talks here, from the four (soon five) GraphConnect conferences that were held this year in major cities around the world:
http://watch.neo4j.org/
If you're in London, the last one will be held next week:
http://www.graphconnect.com
You'll find a summary below of some of the technology behind Neo4j, with some customer examples. To speak very directly to your question about scaling: Neo4j has a unique architecture designed to maximize query response time & query predictability, by allowing horizontal scale-out in such a way that each instance can access the graph without having to hop over the network. (Need more read throughput. Just add instances.) It turns out that this approach works well for 95+% of the graphs out there, including some production customers who have more than half of the Facebook social graph running in a single Neo4j cluster, backing an "always on" 24x7 web site.
www.neotechnology.com/neo4j-scales-for-the-enterprise/
One of the world's largest postal delivery services does all of their real-time package routing with Neo4j. Railroads are building routing systems on Neo4j. Some of the world's largest customers are using them for HR and data governance, alternate-path routing, network & data center management, real-time fraud detection, bioinformatics, etc.
Neo4j's Cypher query language is the only declarative query language built expressly for property graphs. It takes all of the lessons learned from our 13-year old native Java API (which was the basis for Blueprints, which some of the other graph databases have since adopted) and rolls them into a next-generation language. Cypher is a great way to learn graphs, and to develop applications; and there's always the native Java API if you have special needs or value "bare metal" performance (i.e. sub millisecond vs. single-digit millisecond) performance above convenience. Neo4j is built from the ground up to support graphs, and has a graph storage engine that is built to store graphs; unlike some of the more recent additions to the graph database ecosystem, which are architected as graph libraries on top of non-graph databases, and are subject to some of the inherent limitations. (e.g. FlockDB, because it is based on MySQL, will still be very slow for anything greater than one hop.)
Definitely feel free to contact the Neo team if you need anything more specific. We'll be more than happy to help you! http://info.neotechnology.com/ContactUs.html
Good luck!

Essential skills of a Data Scientist [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
What are the relevant skills in the arsenal of a Data Scientist? With new technologies coming in every day, how does one pick and choose the essentials?
A few ideas germane to this discussion:
Knowing SQL and the use of a DB such as MySQL, PostgreSQL was great till the advent of NoSql and non-relational databases. MongoDB, CouchDB etc. are becoming popular to work with web-scale data.
Knowing a stats tool like R is enough for analysis, but to create applications one may need to add Java, Python, and such others to the list.
Data now comes in the form of text, urls, multi-media to name a few, and there are different paradigms associated with their manipulation.
What about cluster computing, parallel computing, the cloud, Amazon EC2, Hadoop ?
OLS Regression now has Artificial Neural Networks, Random Forests and other relatively exotic machine learning/data mining algos. for company
Thoughts?
To quote from the intro to Hadley's phd thesis:
First, you get the data in a form that
you can work with ... Second, you
plot the data to get a feel for what
is going on ... Third, you iterate
between graphics and models to build a
succinct quantitative summary of the
data ... Finally, you look back at
what you have done, and contemplate
what tools you need to do better in
the future
Step 1 almost certainly involves data munging, and may involve database accessing or web scraping. Knowing people who create data is also useful. (I'm filing that under 'networking'.)
Step 2 means visualisation/ plotting skills.
Step 3 means stats or modelling skills. Since that is a stupidly broad category, the ability to delegate to a modeller is also a useful skill.
The final step is mostly about soft skills like introspection and management-type skills.
Software skills were also mentioned in the question, and I agree that they come in very handy. Software Carpentry has a good list of all the basic software skills you should have.
Just to throw in some ideas for others to expound upon:
At some ridiculously high level of abstraction all data work involves the following steps:
Data Collection
Data Storage/Retrieval
Data Manipulation/Synthesis/Modeling
Result Reporting
Story Telling
At a minimum a data scientist should have at least some skills in each of these areas. But depending on specialty one might spend a lot more time in a limited range.
JD's are great, and for a bit more depth on these ideas read Michael Driscoll's excellent post The Three Sexy Skills of Data Geeks:
Skill #1: Statistics (Studying)
Skill #2: Data Munging (Suffering)
Skill #3: Visualization (Story telling)
At dataist the question is addressed in a general way with a nice Venn diagram:
JD hit it on the head: Storytelling. Although he did forget the OTHER important story: the story of why you used <insert fancy technique here>. Being able to answer that question is far and away the most important skill you can develop.
The rest is just hammers. Don't get me wrong, stuff like R is great. R is a whole bag of hammers, but the important bit is knowing how to use your hammers and whatnot to make something useful.
I think it's important to have command of a commerial database or two. In the finance world that I consult in, I often see DB/2 and Oracle on large iron and SQL Server on the distributed servers. This basically means being able to read and write SQL code. You need to be able to get data out of storage and into your analytic tool.
In terms of analytical tools, I believe R is increasingly important. I also think it's very advantageous to know how to use at least one other stat package as well. That could be SAS or SPSS... it really depends on the company or client that you are working for and what they expect.
Finally, you can have an incredible grasp of all these packages and still not be very valuable. It's extremely important to have a fair amount of subject matter expertise in a specific field and be able to communicate to relevant users and managers what the issues are surrounding your analysis as well as your findings.
Matrix algebra is my top pick
The ability to collaborate.
Great science, in almost any discipline, is rarely done by individuals these days.
There are several computer science topics that are useful for data scientists, many of them have been mentioned: distributed computing, operating systems, and databases.
Analysis of algorithms, that is understanding the time and space requirements of a computation, is the single most-important computer science topic for data scientists. It's useful for implementing efficient code, from statistical learning methods to data collection; and determining your computational needs, such as how much RAM or how many Hadoop nodes.
Patience - both for getting results out in a reasonable fashion and then to be able to go back and change it for what was 'actually' required.
Study Linear Algebra on MIT Open course ware 18.06 and substitute your study with the book "Introduction to Linear Algebra". Linear Algebra is one of the essential skill sets in data analytic in addition to skills mentioned above.

query language for graph sets: data modeling question

Suppose I have a set of directed graphs. I need to query those graphs. I would like to get a feeling for my best choice for the graph modeling task. So far I have these options, but please don't hesitate to suggest others:
Proprietary implementation (matrix)
and graph traversal algorithms.
RDBM and SQL option (too space consuming)
RDF and SPARQL option (too slow)
What would you guys suggest? Regards.
EDIT: Just to answer Mad's questions:
Each one is relatively small, no more than 200 vertices, 400 edges. However, there are hundreds of them.
Frequency of querying: hard to say, it's an experimental system.
Speed: not real time, but practical, say 4-5 seconds tops.
You didn't give us enough information to respond with a well thought out answer. For example: what size are these graphs? With what frequencies do you expect to query these graphs? Do you need real-time response to these queries? More information on what your application is for, what is your purpose, will be helpful.
Anyway, to counter the usual responses that suppose SQL-based DBMSes are unable to handle graphs structures effectively, I will give some references:
Graph Transformation in Relational Databases (.pdf), by G. Varro, K. Friedl, D. Varro, presented at International Workshop on Graph-Based Tools (GraBaTs) 2004;
5 Conclusion and Future Work
In the paper, we proposed a new graph transformation engine based on off-the-shelf
relational databases. After sketching the main concepts of our approach, we carried
out several test cases to evaluate our prototype implementation by comparing it to
the transformation engines of the AGG [5] and PROGRES [18] tools.
The main conclusion that can be drawn from our experiments is that relational
databases provide a promising candidate as an implementation framework for graph
transformation engines. We call attention to the fact that our promising experimental
results were obtained using a worst-case assessment method i.e. by recalculating
the views of the next rule to be applied from scratch which is still highly inefficient,
especially, for model transformations with a large number of independent matches
of the same rule. ...
They used PostgreSQL as DBMS, which is probably not particularly good at this kind of applications. You can try LucidDB and see if it is better, as I suspect.
Incremental SQL Queries (more than one paper here, you should concentrate on " Maintaining Transitive Closure of Graphs in SQL "): "
.. we showed that transitive closure, alternating paths, same generation, and other recursive queries, can be maintained in SQL if some auxiliary relations are allowed. In fact, they can all be maintained using at most auxiliary relations of arity 2. ..
Incremental Maintenance of Shortest Distance and Transitive Closure in First Order Logic and SQL.
Edit: you give more details so... I think the best way is to experiment a little with both a main-memory dedicated graph library and with a DBMS-based solution, then evaluate carefully pros and cons of both solutions.
For example: a DBMS need to be installed (if you don't use an "embeddable" DBMS like SQLite), only you know if/where your application needs to be deployed and what your users are. On the other hand, a DBMS gives you immediate benefits, like persistence (I don't know what support graph libraries gives for persisting their graphs), transactions management and countless other. Are these relevant for your application? Again, only you know.
The first option you mentioned seems best. If your graph won't have many edges (|E|=O(|V|)) then you might earn better complexity of time and space using Dictionary:
var graph = new Dictionary<Vertex, HashSet<Vertex>>();
An interesting graph library is QuickGraph. Never used it but it seems promising :)
I wrote and designed quite a few graph algorithms for various programming contests and in production code. And I noticed that every time I need one, I have to develop it from scratch, assembling together concepts from graph theory (BFS, DFS, topological sorting etc).
Perhaps a lack of experience is a reason, but it seems to me that there's still no reasonable general-purpose query language to solve graph problems. Pick a couple of general-purpose graph libraries and solve your particular task in a programming (not query!) language. That will give you best performance and space consumption, but will also require understanding of graph theory basic concepts and of their limitations.
And the last one: do not use SQL for graphs.

Resources