Is using Titan Graph DB the right choice? - amazon-dynamodb

I am planning to use Titan Graph DB for my project.
The reason for selecting it is because it is the only graph database which can use DynamoDB as the storage backend. Thus I can free myself of the scalability/throughput worries.
But when I am trying to find any tutorial to get started with Titan, I am not finding many of them. This makes me doubt whether to use Titan or choose another graph database, like Neo4j or OrientDB.
Can someone tell me if Titan being used widely?
Is the community active?
Can I expect proper releases?
The last blogpost on ThinkAurelius is dated Feb 3, 2015 regarding acquisition by DataStax. DataStax website has no mentioning about Titan.

Your question is more of an opinion question which might be suited for a different forum, but I will attempt to answer the main questions you stated anyway.
Is Titan being used widely? This is hard to tell since Aurelius doesn't disclose much information publicly about their users other than their client list. The Amazon Fulfillment gave a session on their usage of Titan with DynamoDB. This blog post identifies NASA's Jet Propulsion Laboratory and AdAgility as users also.
Is the community active? Somewhat. There are discussions occurring on the Titan mailing list and new messages come in every day. Commits are being made on the titan11 stream in GitHub. The most recent commit was on April 4, 2016.
Can I expect proper releases? "Expecting" releases of software is not something I would recommend in general. Software (both open source and proprietary) is released when the maintainers think the code is ready. For Titan specifically, the core maintainers have largely been absent from the community because they are busy working on DataStax Enterprise Graph.

Related

Is RocksDB and LevelDB just like Riak?

I have a question regarding some NoSQL databases. In Ehcache we have for example the JCache API, in MapDB the Map Interface and in Riak KV we have a own process with clusters. How do I exactly find out which database fits to which implementation type? For example for RocksDB (I assume that it is a process) and same for LevelDB.
For reference, RocksDB and LevelDB perform very similar functions and can be interchangeable in some situations.
Given your question of Is RocksDB and LevelDB just like Riak?, I can say that they are not the same as Riak provides a scalable distributed platform to run on that can connect to one or more backend databases simultaneoulsy (currently supported backends are Bitcask, LevelDB, Leveled and memory). RocksDB and LevelDB are essentially stand alone database platforms that can be used as such or can utilised by other software such as Riak as a backend. While you could technically implement RocksDB as a backend for Riak KV without needing a mountain of custom code, you probably wouldn't want to as RocksDB does not scale well.
How do I exactly find out which database fits to which implementation type? is rather a broad question. I think you might want to rephrase it as Which databases offer me {my list of desired implementations/functions}? to make it easier for community members to answer. Please note that some NoSQL databases have multiple uses available e.g. as you mentioned Riak KV, we have Maps, Sets, GSets, Flags, Registers, Solr Search, 2i and the standard CRDT options as well but some of those may be tied to other requirements e.g. 2i only works with a LevelDB/Leveled backend, Solr Search requires the Yokozuna package version of Riak KV 3.0.0 and above but is built in for all Riak 2.x.x versions etc.
What you may also want to try to do is download a few different options to a VM or bare metal rig, have a play and see how it works out. There are often cases where two competing products do something very similar on paper but in your specific use case, one outperforms the other significantly.
To get you started, here are links to Riak 2.9.8 (the latest release of the 2.x.x series) and to the Riak 2.2.6 docs (the 2.9.x docs should be out later this month).
I'm not sure if this has directly answered your question but, hopefully, it will give you some pointers as to where to go next.

Is luwak production ready?

I'm implementing a blob store over Riak KV to make an experiment storing mail attachment. Riak CS seems over reaching for this goal.
I already have a prototype implemented in Python and many ideas to keep working on it. Today I stumbled upon luwak, which has a similar design, however much more complete and consistent with the Merkle tree metadata.
Is luwak abandoned in favor of Riak CS? Is it production ready?
Looks like it has been abandoned:
Hi all,
I wanted to take a moment and let the community-at-large know
that we are going to end-of-life the Luwak functionality in Riak, as
part of our 1.1 release in February. This simply means that the Luwak
repo on Github will not be actively developed/supported by Basho, or
included as a default Riak dependency. In keeping with our commitment
to Open Source, the code will still be available if someone else wants
to develop on it. You can also continue to use it with Riak if you are
willing to edit the deps and compile from source -- the Luwak README
will contain information on how to do this.
While the idea of Luwak
was interesting, we ultimately decided that it wasn’t the
architectural path we wanted to pursue for storing larger values in
Riak. If Luwak is a piece of your Riak deployment, we’d love to hear
from you and incorporate your feedback into future directions.
Please don’t hesitate to reach out to myself or Mark Phillips.
Thanks, D.
The last commit on the open-source project was a little over 3 years ago, and it's had an outstanding pull request open for 10 months (the comments on there are not encouraging either). So yeah, I'd say it's pretty dead.

Alfresco Community 5 Share Clustering

I'm seeing a lot of conflicting information on the internet about Alfresco Share clustering. From what I can find, it looks like clustering was removed completely from Alfresco Community in versions 4.2 and above.
I did find some documentation showing that Alfresco One 5 has Share clustering and I noticed that I can enable hazelcast in Alfresco Community 5 but the clustering doesn't work at all.
Is there a way to have more than 1 instance of Alfresco Community 5 behind a load balancer and have proper synchronization/replication/clustering occur between the share instances?
Short answer
There is no cluster and no load balancer support for the Alfresco Community version (I know of). Alfresco removed that feature from the community version starting with 4.2 when they refactored the whole cluster thing.
Long answer
What are you trying to archive?
If scalability is your goal you should focus on the bottlenecks in the Alfresco architecture which will not be solved by clustering / load balancing. I haven't seen a system where Share tier was the bottleneck.
quite the contrary: If load from share against the repository tier is too high you will fall into a timeout and thread escalation since Alfresco follows the "retrying transaction" principle: If errors occur, share will retry - which means: if repositry is answering too slow share will create new requests/threads until the OS reaches kernel or process limits without any result.
So instead you should focus on optimizing the repository tier to become as fast as possible to avoid thread escalations in share (This also can't be achived by clustering):
transformation --> understand, replace or disable sync transfomation stuff running on repository tier
search --> understand, optimize tracking and run SOLR on separate host(s), but tracking will rely on the transformation performance of the repository tier
caching --> use smart reverse proxys to cache Share stuff on client and proxy side to minimize traffic
very fast/smart storage concepts on db and index tier
If availability is your concern you may get better results by using HA features from virtualisation platforms like VMWare ESX and your support efforts will be a fraction compared to clustered Alfresco.

Alfresco Community Enterprise Feature Comparison

I've seen this question but the answers are simply not good enough. I've searched the web and could find a clear listing of the main differences.
I am particularly surprised to see contradictions in the above link, that holds only 4 short answers.
So the question is, beyond support, what are (all) the differences between Alfresco Community and Enterprise editions (for the current versions of course)?
Are there functional or technical features that available in the Enterprise edition, that are not in the community edition?
I find it strange that it's so difficult to get a clear list. Looking at the forums to find this answer is not a serious option from a business perspective.
Until now, I found this link to be useful, but it's from 2009.
In particular, I find the platform support interesting, with the community edition supporting only lamp stuff:
Linux
MySQL
Tomcat
OpenLDAP
Firefox
And the enterprise edition supporting:
Windows
SQL Server
WebLogic, WebSphere
AD/Kerberos
IE and Safari
Apparently, these features are only available in the enterprise edition:
JMX monitoring
Runtime admininstration: What's that exactly? And what's in the community edition then?
Runtime indexing consistency check and update: What's in the community edition then?
High performance and availability: How is that implemented and what's in the community edition then?
Storage policies
Open source and proprietary technology stack support: which ones exaclty? Which ones are supported in the community edition?
If anyone could guide me towards serious documentation about these differences, that would be great.
I also went through the wiki but could not find an answer to my questions in there.
differences between Enterprise and Community vary in detail from version to version and are mainly visible for administrators. We see or maintain both flavors of Alfresco in midsize to very large environments and I would say it's more or less a question of taste and budget what the best decision / edition is for you. Excellent skills in infrastructure and java are highly advisable for both editions to run Alfresco in production.
The technical differences are not as dramatic as not being able to provide very similar functionality for the users - so if you're actually in a decision you should focus on a good technical partner, the support services and maybe the fact that you only get official patches in the Enterprise subscription, not on the Community. BTW Alfresco Enterprise is not Open Source but this is not a real point of interest for most end users. You can access the code as a subscription customer but it is not public available/accessible.
The main differences in features are already named more or less:
Administration
Enterprise has more views and setting in the admin web GUI. In Community you can access most configuration only from the command line. This may be a restriction but in real live Administrators prefer the command line and scripting automation.
Enterprise lets you change some Alfresco settings during runtime (most settings still require restart). Some can be change in the GUI and more in the jmx interface. Also you're able to stop and start subsystems like the CIFS protocol server. We use this feature to switch a system in read only mode. This point is meant with "runtime admininstration". Community requires restart of the service for most configuration changes. It is possible to work around this by advanced scripting like groovy or by implementing modules.
Indexing
Runtime indexing consistency check and update is not a self healing functionality as expected. You will have to learn (at least for now) that you have to recreate the Alfresco index from time to time even in Enterprise environments and that it is better to focus on good strategies how to speed recreation or how to setup standby indexes instead of hunting failed indexing transactions using the check and update methods. For major document model changes you need to recreate the index anyway.
High performance and availability
This is mainly the cluster and replication functionality which is no longer available in Community. It's similar to MS Clusters: It's a lot, lot work for very view more availability since some concepts are missing. The price is high in terms of complexity and can end up in loss of robustness. Even with enterprise support it's a hard job to keep a alfresco cluster running - so you need very good arguments why to go this way. But of course: its possible and available!
High performance: There shouldn't be any difference and if - I'm very curious about the explanation.
Technology stack
The main difference is the database support. In the Community you only can choose between MySQL and Postgres (No Oracle or MS SQL for Community). All other technologies are independent from Enterprise or Community (AD, Kerberos, OS, Browser, ...)
Java Container: I believe over 95% of all Alfresco installations run in tomcat. That's the configuration which is documented, tested and scales. Using WebLogic or WebSphere gives you no added value except new challenges - quite the contrary: You have to solve most issues for yourself and can't benefit from others experience.
Storage policies: I'm not pretty sure and should check in 4.2.x if the Content Store Selector / Storage policies is no longer available in the Community, but it was there in the 3.x versions.
[Edit]: storage policies have been removed in Community 4.2.x:
NoSuchBeanDefinitionException: No bean named 'storeSelectorContentStoreBase' is defined
If there is a really need for this functionality someone may re-enable that feature by coding a module for Community.
Regards
This page explains the difference between the editions:
https://wiki.alfresco.com/wiki/Enterprise_Edition
This page is the canonical, comprehensive list of the differences.
If you are considering an Enterprise Subscription and you have a question that isn't answered by what you can find on that page, you should talk to your account rep.
Well, regarding JMX monitoring:
Runtime administration: Alfresco enterprise allows to perform certain actions on Alfresco subsystems without restarting the server. This allows you to be very fast during debugging/developing and also making changes in production environment. Also you can access the JMX interface that supports JMX Remoting.
There is no consistency check or update, until you restart the server (during the startup you have to validate/check/rebuild your indexes). There is an option in alfresco.global.properties (or the original repository.properties config file) for that. If you have some inconsistencies in the Alfresco Community index, you're gonna have a bad time xD.
Alfresco Enterprise has specific license for clustering your architecture, the Community edition doesn't support those systems. Replicate and cluster Alfresco is one of the main improvements in performance/scalability/availability you could achieve.
The storage policies allow you to use Content Store selectors in Alfresco Enterprise. You can manage a primary and a secondary file store, and map/connect these stores in your architecture. The Community Edition allows you only to use one content store at a time.
These include everything inside Alfresco (Spring Framework, Apache-Lucene/Solr, Tomcat, and so on), because with the Enterprise license you have also the full support with everything inside the Alfresco package. The difference is that the Community is based on daily builds, supported by community, and therefor not guaranteed. The Enterprise support helps you resolve many problems that you might encounter during developing and in production environment, not only Alfresco related, but also on some configurations on supported platforms (Windows/Linux), your web application servers, and so on.
Hope it helps.

Cassandra and asp.net (C#)

I am interested to create portal on cassandra services, since I faced some performance and scale issues starting from 1 million of records.
Definitely, it could be solved, but I am interested on other options.
My main issues is cost of updating all necessary indexes, to make reading fast.
First, is cassandra is good way for asp.net programmers? I mean, maybe there is some other projects, which worth to take a look
And second, can you provide any documentation samples on how to start with cassandra programming from C#?
since I faced performance and scale issues starting from 1 million of records.
Maybe your design was not that good, NoSQL is not a magic bullet for bad design. I have multi billion row tables and 95% of the response is sub second. Also what do you mean by updating indexes, do you mean updating statistics or rebuilding indexes?
since I faced performance and scale
issues starting from 1 million of
records.
You know, the one million mark for modern databases is where it is not something "totally ridiculously small" where you can ignore actually knowing what you do. Below one million is "tiny". I have a 800 million row table and get a LOT of sql running through with it - no problem at all.
First, is cassandra is good way for
asp.net programmers?
I would more suggest a basic book about SQL, reading the documentation and POSSIBLY throwing some hardware on the problem. As in: having totally bad hardware will kill all data management systems.
If you are using Cassandra for your .NET Application take a look at Aquiles. I developed it based on my company needs. If you find it useful or need any help let me know.
You can't really speak of Cassandra documentation. There's a myriad of partial tutorials on the web.
You may want to setup Linux in a virtual machine, because the windows build process is quite challenging, to say the least. (http://www.virtualbox.org, http://www.ubuntu.com)
Here's the howto:
http://www.ridgway.co.za/archive/2009/11/06/net-developers-guide-to-getting-started-with-cassandra.aspx
Note that the cassandra SVN url and the code sample have changed since the writing of this tutorial.
Here's another C# client:
http://github.com/mattvv/hectorsharp
And here some sample code:
http://www.copypastecode.com/26752/
Note that you need to download the latest Java Development Kit (JDK) from Sun for Linux.
It's not in the repositories of Ubuntu 10.04.
Then you need to type
export JAVA_HOME="/path/to/jdk"
in order for Cassandra to find your Java installation.
You might also want to take a look at:
http://en.wikipedia.org/wiki/NoSQL
Especially the taxonomy section is interesting.
Make sure Cassandra is the right type of NoSQL solution for your problem, e.g. use Neo4J if your problem actually is a graph problem.
Also, you need to make sure your NoSQL solution is ACID-compliant.
For example, Neo4J is the only ACID-compliant NoSQL graph engine.
Edit: Here's a jumpstart guide for Windows, without compiling:
http://coderjournal.com/2010/03/cassandra-jump-start-for-the-windows-developer/
http://www.ronaldwidha.net/2010/06/23/running-cassandra-on-windows-first-attempt/
http://www.yafla.com/dforbes/Getting_Started_with_Apache_Cassandra_a_NoSQL_frontrunner_on_Windows/
Instead of cassandra you might take a look at: ravendb. Supposedly it is a document store made with and created for .Net. It has Linq integration, and is (again supposedly) very fast.
As with any new technology, read if it helps you with your specific case, and check if it is proven technology (Do they have mainstream clients using it).
Before you go into this route see if you can't optimize your current solution first. Check if your queries are fast, if the indexes are done correctly, and if you can't remove load by adding caching.
Last nut not least, if adding some processors to your SQL machine might fix issues, it is typically a much cheaper solution.
If you want to do something new, then instead of going for noSQL, you might want to consider trying a database cluster.
The idea is when two machines each search half of the original database at the same time, you have half the search time without totally redesigning your existing database.

Resources