How much data do I need to have to make use of Presto? - bigdata

How much data do I need to have to make use of Presto? The web site states that it can query data sizes from gigabytes to petabytes. I understand how it is used to query very large datasets, but is anyone using it for hundreds of gigabytes?

Currently, Presto is most useful if you already have an existing Hive installation. If you are using Hive, you should definitely try Presto. If all your data fits in a relational database like PostgreSQL or MySQL on a single machine, and you are happy with the performance, then keep using that.
However, Presto should be much faster than either of those databases on a single machine for analytic queries because it executes a query in parallel. Neither of those databases parallelize the execution of individual queries. At the moment, using Presto requires setting up HDFS and Hive (even on a single machine), so getting started will be more work than if you already have an existing Hive installation.

Or, you can take a look at Impala - which has been available as production-ready software for six months. Like Presto, Impala is a distributed SQL query engine for data in HDFS that circumvents MapReduce. Unlike Presto, there is a commercial vendor providing support (Cloudera).
That said, David's comments about data size still apply. Use the right tool for the job.

Related

Is RocksDB and LevelDB just like Riak?

I have a question regarding some NoSQL databases. In Ehcache we have for example the JCache API, in MapDB the Map Interface and in Riak KV we have a own process with clusters. How do I exactly find out which database fits to which implementation type? For example for RocksDB (I assume that it is a process) and same for LevelDB.
For reference, RocksDB and LevelDB perform very similar functions and can be interchangeable in some situations.
Given your question of Is RocksDB and LevelDB just like Riak?, I can say that they are not the same as Riak provides a scalable distributed platform to run on that can connect to one or more backend databases simultaneoulsy (currently supported backends are Bitcask, LevelDB, Leveled and memory). RocksDB and LevelDB are essentially stand alone database platforms that can be used as such or can utilised by other software such as Riak as a backend. While you could technically implement RocksDB as a backend for Riak KV without needing a mountain of custom code, you probably wouldn't want to as RocksDB does not scale well.
How do I exactly find out which database fits to which implementation type? is rather a broad question. I think you might want to rephrase it as Which databases offer me {my list of desired implementations/functions}? to make it easier for community members to answer. Please note that some NoSQL databases have multiple uses available e.g. as you mentioned Riak KV, we have Maps, Sets, GSets, Flags, Registers, Solr Search, 2i and the standard CRDT options as well but some of those may be tied to other requirements e.g. 2i only works with a LevelDB/Leveled backend, Solr Search requires the Yokozuna package version of Riak KV 3.0.0 and above but is built in for all Riak 2.x.x versions etc.
What you may also want to try to do is download a few different options to a VM or bare metal rig, have a play and see how it works out. There are often cases where two competing products do something very similar on paper but in your specific use case, one outperforms the other significantly.
To get you started, here are links to Riak 2.9.8 (the latest release of the 2.x.x series) and to the Riak 2.2.6 docs (the 2.9.x docs should be out later this month).
I'm not sure if this has directly answered your question but, hopefully, it will give you some pointers as to where to go next.

How do I extract non-HANA ECC tables into R?

I find that there's very little documentation on how to extract SAP tables into R.
I'm not talking about SAP HANA.
Currently, it's very troublesome that I need to manually extract SAP tables using a GUI interface, export them into tabular format. Then only I can import them using my R script.
The current solution I'm exploring is to have my SAP colleagues to export those SAP tables into SQL database, then I can query the tables from R.
Ideally I want to cut this seemingly unnecessary step of having the SAP tables exported into a database.
For SAP R/3 systems (or what you call ECC), your best bet would be executing remote function calls (i.e. RFC).
Normally these would be supported by open source interfaces for at least the more recent versions (e.g. 4.6 or above).
However, they are fairly scarce and I know only of one such implementation in R - this is the RSAP. You'd also need to download NW RFC SDK, and there may be further requirements based on your OS (e.g. what Visual C++ you'd need for Windows, etc.).
There's also a slightly more widely recognised equivalent in Python, the PyRFC.
On the other hand, you may try Robotic Process Automation (RPA) to interact with GUI in an automated way. One of the options is UiPath but there are others. This way you could configure the automation of table extraction - at the same time you can also call R scripts directly from the RPA.
Overall - to be honest - the solution with extracting tables into a separate database does seem to be the best alternative (compared to what I've described above).
Note: The above presumes that - for any reason, usually security - you cannot access the database underlying ECC directly through ODBC calls - otherwise the instructions for connecting and calling SQL from R are the same as for HANA or similar.
Consider using RODBC. This package allows adding different ODBC sources and use them in R Studio.
Follow this article and don't bug to word "HANA", this approach allows using any database, not only HANA.

What is the prescriptive approach to supporting multiple RDBMS's with Flyway?

I have an application that supports multiple RDBMS's. The SQL needed to build the data model is different between each of the RDBMS's that I need to support. The differences aren't small either, they stem from the fact that one of the supported systems is expected for light use (development, small installations) and heavy use. Simply standardizing on a single supported RDBMS is not an option.
As it stands I need to be able to apply migrations to my application in all of the supported RDBMS's. Where possible I'd like to be able to share migration scripts to reduce the amount of duplication involved but I imagine that isn't entirely possible.
The only approach I can come up with so far is to keep separate directories in source control for each of the supported environments. Then at runtime, pick the appropriate directory for the RDBMS that the system is connected to.
Is having one directory per supported RDBMS the prescriptive approach or is there a better way?
Right from the FAQ: What is the best strategy for handling database-specific sql?

How does the DBMS affect application performance? And Informix GUI tools?

I use Informix DBMS in all my web applications. My question has two parts:
Does the DBMS have a big effect on the performance of my applications
and if the answer is yes what about Informix and `MS SQL Server in this
issue?
I want some GUI tools to facilitate my job when writing queries,
creating database, relationships, ERD, etc. The Informix client
is so bad. There are no facilities at all. I want some tools
like SQL Server Management Studio
As a GUI tool for Informix you can use Aqua Data Studio from Aquafold. It also supports MS SQL Server.
As of the performance: it depends. How well is your Database design. Do you use indexes, is your query well-written, etc. etc. Very hard to answer your question, we just don't know enough.
To design a solution that would perform the best, one needs to know the nature of the application you are building. For example, if you are building a system that needs to process and compute large volume of data and computations can be distributed, a "traditional" relational database is not a good option no matter what vendor you choose. You would be better off with an option that supports sharding, Hadoop and will likely be based on some kind of NoSQL solution.
If you are sticking with RDMS and building something that has a lot of reads and not a lot of writes, go for a database that supports Snapshot Isolation which will allow your readers to not be blocked by writers.
Cost also plays into this - some RDMS systems are free, some are not. Your question is way to general to be answered specifically.
Aqua Data Studio is good but quite expensive. An open source tool SQL Workbench/J is also an effective tool for informix.
Informix have its own charm but i guess it should not be said that MS-Sql Server is slower or not good in performance. You may decide DBMS according to your nature of application. There are many techniques to optimize Database performance like, Applying Indexes/ Not too many Joins/ Queries can be optimize too/ Stored Procedure can also be used/ Multi-DBs level etc.
Once i need to develop Social Media site, i used MySQL in this project but only for POSTs i installed MongoDB.
Regards,
Salik

Cassandra and asp.net (C#)

I am interested to create portal on cassandra services, since I faced some performance and scale issues starting from 1 million of records.
Definitely, it could be solved, but I am interested on other options.
My main issues is cost of updating all necessary indexes, to make reading fast.
First, is cassandra is good way for asp.net programmers? I mean, maybe there is some other projects, which worth to take a look
And second, can you provide any documentation samples on how to start with cassandra programming from C#?
since I faced performance and scale issues starting from 1 million of records.
Maybe your design was not that good, NoSQL is not a magic bullet for bad design. I have multi billion row tables and 95% of the response is sub second. Also what do you mean by updating indexes, do you mean updating statistics or rebuilding indexes?
since I faced performance and scale
issues starting from 1 million of
records.
You know, the one million mark for modern databases is where it is not something "totally ridiculously small" where you can ignore actually knowing what you do. Below one million is "tiny". I have a 800 million row table and get a LOT of sql running through with it - no problem at all.
First, is cassandra is good way for
asp.net programmers?
I would more suggest a basic book about SQL, reading the documentation and POSSIBLY throwing some hardware on the problem. As in: having totally bad hardware will kill all data management systems.
If you are using Cassandra for your .NET Application take a look at Aquiles. I developed it based on my company needs. If you find it useful or need any help let me know.
You can't really speak of Cassandra documentation. There's a myriad of partial tutorials on the web.
You may want to setup Linux in a virtual machine, because the windows build process is quite challenging, to say the least. (http://www.virtualbox.org, http://www.ubuntu.com)
Here's the howto:
http://www.ridgway.co.za/archive/2009/11/06/net-developers-guide-to-getting-started-with-cassandra.aspx
Note that the cassandra SVN url and the code sample have changed since the writing of this tutorial.
Here's another C# client:
http://github.com/mattvv/hectorsharp
And here some sample code:
http://www.copypastecode.com/26752/
Note that you need to download the latest Java Development Kit (JDK) from Sun for Linux.
It's not in the repositories of Ubuntu 10.04.
Then you need to type
export JAVA_HOME="/path/to/jdk"
in order for Cassandra to find your Java installation.
You might also want to take a look at:
http://en.wikipedia.org/wiki/NoSQL
Especially the taxonomy section is interesting.
Make sure Cassandra is the right type of NoSQL solution for your problem, e.g. use Neo4J if your problem actually is a graph problem.
Also, you need to make sure your NoSQL solution is ACID-compliant.
For example, Neo4J is the only ACID-compliant NoSQL graph engine.
Edit: Here's a jumpstart guide for Windows, without compiling:
http://coderjournal.com/2010/03/cassandra-jump-start-for-the-windows-developer/
http://www.ronaldwidha.net/2010/06/23/running-cassandra-on-windows-first-attempt/
http://www.yafla.com/dforbes/Getting_Started_with_Apache_Cassandra_a_NoSQL_frontrunner_on_Windows/
Instead of cassandra you might take a look at: ravendb. Supposedly it is a document store made with and created for .Net. It has Linq integration, and is (again supposedly) very fast.
As with any new technology, read if it helps you with your specific case, and check if it is proven technology (Do they have mainstream clients using it).
Before you go into this route see if you can't optimize your current solution first. Check if your queries are fast, if the indexes are done correctly, and if you can't remove load by adding caching.
Last nut not least, if adding some processors to your SQL machine might fix issues, it is typically a much cheaper solution.
If you want to do something new, then instead of going for noSQL, you might want to consider trying a database cluster.
The idea is when two machines each search half of the original database at the same time, you have half the search time without totally redesigning your existing database.

Resources