Is RocksDB and LevelDB just like Riak? - riak

I have a question regarding some NoSQL databases. In Ehcache we have for example the JCache API, in MapDB the Map Interface and in Riak KV we have a own process with clusters. How do I exactly find out which database fits to which implementation type? For example for RocksDB (I assume that it is a process) and same for LevelDB.

For reference, RocksDB and LevelDB perform very similar functions and can be interchangeable in some situations.
Given your question of Is RocksDB and LevelDB just like Riak?, I can say that they are not the same as Riak provides a scalable distributed platform to run on that can connect to one or more backend databases simultaneoulsy (currently supported backends are Bitcask, LevelDB, Leveled and memory). RocksDB and LevelDB are essentially stand alone database platforms that can be used as such or can utilised by other software such as Riak as a backend. While you could technically implement RocksDB as a backend for Riak KV without needing a mountain of custom code, you probably wouldn't want to as RocksDB does not scale well.
How do I exactly find out which database fits to which implementation type? is rather a broad question. I think you might want to rephrase it as Which databases offer me {my list of desired implementations/functions}? to make it easier for community members to answer. Please note that some NoSQL databases have multiple uses available e.g. as you mentioned Riak KV, we have Maps, Sets, GSets, Flags, Registers, Solr Search, 2i and the standard CRDT options as well but some of those may be tied to other requirements e.g. 2i only works with a LevelDB/Leveled backend, Solr Search requires the Yokozuna package version of Riak KV 3.0.0 and above but is built in for all Riak 2.x.x versions etc.
What you may also want to try to do is download a few different options to a VM or bare metal rig, have a play and see how it works out. There are often cases where two competing products do something very similar on paper but in your specific use case, one outperforms the other significantly.
To get you started, here are links to Riak 2.9.8 (the latest release of the 2.x.x series) and to the Riak 2.2.6 docs (the 2.9.x docs should be out later this month).
I'm not sure if this has directly answered your question but, hopefully, it will give you some pointers as to where to go next.

Related

GRPC Services: Central Proto Repository or Distributed

We plan to keep a central proto repository to keep all proto definitions and its generated code here. We would keep messages as well as service definitions in a central Git repo. We plan to drive API design standard from this central repository.
But, any service which want to use this to expose a sever service or generate clients would have to import from this repo (.pg.go).
Do you see any issue with this approach? Or do you see keeping service proto files individually in the service repos as a better alternative.
PS: Starter in the GRPC journey of building microservices. Still learning the right way to structure and distribute code here.
This question occurs regularly and I suspect the fact that there's no published guidance is because the answer depends on your needs more than the technology's.
The specific issue of many vs one is not dissimilar to whether you prefer to use a monorepo and only you can effectively determine that. Perhaps one way to determine this is to understand now (and in the future) how many shared dependencies your services will have? Another may be to determine how many repos you'll have (how complex would it be to manage 10s or 100s of repos?).
In my experience, it's a good practice to keep the protos distinct (i.e. separate repo) from code that uses them. Not only may you want to version protos independently from implementations (across languages) but the implementations themselves are independent; in one use-case I must clone a repo containing an entire system (written mostly in one language) in order to get its protos to generate bindings in another language. In this case, it would be preferable if the repo were limited to just the protos.
You could look to examples for guidance. The gRPC repo keeps a bunch of stuff rooted on the grpc package in addition to math. Although less broad, Google bundles its well-known types under google.protobuf.

Why Cloudera's Impala is still "incubating"?

We are using Impala in my company and we have used it in my previous one without
problems.
Is there something affecting possible production use (for example, it breaks under heavy usage, or memory leaks possible, or concurrent access is not recommended)?
Quoting from http://incubator.apache.org/:
The Apache Incubator has two primary goals:
Ensure all donations are in accordance with the ASF legal standards
Develop new communities that adhere to our guiding principles
According to the above, the Incubator doesn't have to do anything with performance or stability, but rather attracting an active, diverse community to a given project and reviewing the licenses involved.

What is the prescriptive approach to supporting multiple RDBMS's with Flyway?

I have an application that supports multiple RDBMS's. The SQL needed to build the data model is different between each of the RDBMS's that I need to support. The differences aren't small either, they stem from the fact that one of the supported systems is expected for light use (development, small installations) and heavy use. Simply standardizing on a single supported RDBMS is not an option.
As it stands I need to be able to apply migrations to my application in all of the supported RDBMS's. Where possible I'd like to be able to share migration scripts to reduce the amount of duplication involved but I imagine that isn't entirely possible.
The only approach I can come up with so far is to keep separate directories in source control for each of the supported environments. Then at runtime, pick the appropriate directory for the RDBMS that the system is connected to.
Is having one directory per supported RDBMS the prescriptive approach or is there a better way?
Right from the FAQ: What is the best strategy for handling database-specific sql?

How much data do I need to have to make use of Presto?

How much data do I need to have to make use of Presto? The web site states that it can query data sizes from gigabytes to petabytes. I understand how it is used to query very large datasets, but is anyone using it for hundreds of gigabytes?
Currently, Presto is most useful if you already have an existing Hive installation. If you are using Hive, you should definitely try Presto. If all your data fits in a relational database like PostgreSQL or MySQL on a single machine, and you are happy with the performance, then keep using that.
However, Presto should be much faster than either of those databases on a single machine for analytic queries because it executes a query in parallel. Neither of those databases parallelize the execution of individual queries. At the moment, using Presto requires setting up HDFS and Hive (even on a single machine), so getting started will be more work than if you already have an existing Hive installation.
Or, you can take a look at Impala - which has been available as production-ready software for six months. Like Presto, Impala is a distributed SQL query engine for data in HDFS that circumvents MapReduce. Unlike Presto, there is a commercial vendor providing support (Cloudera).
That said, David's comments about data size still apply. Use the right tool for the job.

Cassandra and asp.net (C#)

I am interested to create portal on cassandra services, since I faced some performance and scale issues starting from 1 million of records.
Definitely, it could be solved, but I am interested on other options.
My main issues is cost of updating all necessary indexes, to make reading fast.
First, is cassandra is good way for asp.net programmers? I mean, maybe there is some other projects, which worth to take a look
And second, can you provide any documentation samples on how to start with cassandra programming from C#?
since I faced performance and scale issues starting from 1 million of records.
Maybe your design was not that good, NoSQL is not a magic bullet for bad design. I have multi billion row tables and 95% of the response is sub second. Also what do you mean by updating indexes, do you mean updating statistics or rebuilding indexes?
since I faced performance and scale
issues starting from 1 million of
records.
You know, the one million mark for modern databases is where it is not something "totally ridiculously small" where you can ignore actually knowing what you do. Below one million is "tiny". I have a 800 million row table and get a LOT of sql running through with it - no problem at all.
First, is cassandra is good way for
asp.net programmers?
I would more suggest a basic book about SQL, reading the documentation and POSSIBLY throwing some hardware on the problem. As in: having totally bad hardware will kill all data management systems.
If you are using Cassandra for your .NET Application take a look at Aquiles. I developed it based on my company needs. If you find it useful or need any help let me know.
You can't really speak of Cassandra documentation. There's a myriad of partial tutorials on the web.
You may want to setup Linux in a virtual machine, because the windows build process is quite challenging, to say the least. (http://www.virtualbox.org, http://www.ubuntu.com)
Here's the howto:
http://www.ridgway.co.za/archive/2009/11/06/net-developers-guide-to-getting-started-with-cassandra.aspx
Note that the cassandra SVN url and the code sample have changed since the writing of this tutorial.
Here's another C# client:
http://github.com/mattvv/hectorsharp
And here some sample code:
http://www.copypastecode.com/26752/
Note that you need to download the latest Java Development Kit (JDK) from Sun for Linux.
It's not in the repositories of Ubuntu 10.04.
Then you need to type
export JAVA_HOME="/path/to/jdk"
in order for Cassandra to find your Java installation.
You might also want to take a look at:
http://en.wikipedia.org/wiki/NoSQL
Especially the taxonomy section is interesting.
Make sure Cassandra is the right type of NoSQL solution for your problem, e.g. use Neo4J if your problem actually is a graph problem.
Also, you need to make sure your NoSQL solution is ACID-compliant.
For example, Neo4J is the only ACID-compliant NoSQL graph engine.
Edit: Here's a jumpstart guide for Windows, without compiling:
http://coderjournal.com/2010/03/cassandra-jump-start-for-the-windows-developer/
http://www.ronaldwidha.net/2010/06/23/running-cassandra-on-windows-first-attempt/
http://www.yafla.com/dforbes/Getting_Started_with_Apache_Cassandra_a_NoSQL_frontrunner_on_Windows/
Instead of cassandra you might take a look at: ravendb. Supposedly it is a document store made with and created for .Net. It has Linq integration, and is (again supposedly) very fast.
As with any new technology, read if it helps you with your specific case, and check if it is proven technology (Do they have mainstream clients using it).
Before you go into this route see if you can't optimize your current solution first. Check if your queries are fast, if the indexes are done correctly, and if you can't remove load by adding caching.
Last nut not least, if adding some processors to your SQL machine might fix issues, it is typically a much cheaper solution.
If you want to do something new, then instead of going for noSQL, you might want to consider trying a database cluster.
The idea is when two machines each search half of the original database at the same time, you have half the search time without totally redesigning your existing database.

Resources