Tools for running analysis on data held in BigQuery? - bigdata

I have about 100GB data in BigQuery, and I'm fairly new to using data analysis tools. I want to grab about 3000 extracts for different queries, using a programmatic series of SQL queries, and then run some statistical analysis to compare kurtosis across those extracts.
Right now my workflow is as follows:
running on my local machine, use BigQuery Python client APIs to grab the data extracts and save them locally
running on my local machine, run kurtosis analysis over the extracts using scipy
The second one of these works fine, but it's pretty slow and painful to save all 3000 data extracts locally (network timeouts, etc).
Is there a better way of doing this? Basically I'm wondering if there's some kind of cloud tool where I could quickly run the calls to get the 3000 extracts, then run the Python to do the kurtosis analysis.
I had a look at https://cloud.google.com/bigquery/third-party-tools but I'm not sure if any of those do what I need.

So far Cloud Datalab is your best option
https://cloud.google.com/datalab/
It is in beta so some surprises are possible
Datalab is built on top of below (Jupyter/IPython) option and totally in cloud
Another option is Jupyter/IPython Notebook
http://jupyter-notebook-beginner-guide.readthedocs.org/en/latest/
Our data sience team started with second option long ago with great success and now are moving toward Datalab
For the rest of the business (prod, bi, ops, sales, marketing, etc.), though, we had to build our own workflow/orchestration tool as nothing around was found good or relevant enough.

two easy ways:
1: if your issue is network like you say, use a google compute engine machine to do the analisis, in the same zone as your bigquery tables (us, eu etc). it will not have network issues getting data from bigquery and will be super-fast.
the machine will only cost you for the minutes you use it. save a snapshot of your machine to reuse the machine setup anytime (snapshot also has monthly cost but much lower than having the machine up.)
2: use Google cloud Datalab (beta as of dec. 2015) which supports bigquery sources and gives you all the tools you need to do the analysis and later share it with others:
https://cloud.google.com/datalab/
from their docs: "Cloud Datalab is built on Jupyter (formerly IPython), which boasts a thriving ecosystem of modules and a robust knowledge base. Cloud Datalab enables analysis of your data on Google BigQuery, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions)."

You can check out Cooladata
It allows you to query BQ tables as external data sources.
What you can do is either schedule your queries and export the results to Google storage, where you can pick up from there, or use the built in powerful reporting tool to answer your 3000 queries.
It will also provide you all the BI tools you will need for your business.

Related

cloudera - difference between cloudera BDR and DistCP

I am trying to replicate data from cloudra HDFS to AWS cloud Storage. This is not a one time movement, I need to replicate data between system. I see there are two options available. Cloudera BDR and Distcp.
Could someone please let me know what is the difference between those two? If I understand correct cloudera BDR uses distcp underneath only to copy data.
I know BDR provides
job scheduling and
Hive metadata extraction and moving to cloud. (not required)
however, apart from that what extra feature it provides which would be the case for choosing BDR.

Azure Synapse replicated to Cosmos DB?

We have a Azure data warehouse db2(Azure Synapse) that will need to be consumed by read only users around the world, and we would like to replicate the needed objects from the data warehouse potentially to a cosmos DB. Is this possible, and if so what are the available options? (transactional, merege, etc)
Synapse is mainly about getting your data to do analysis. I dont think it has a direct export option, the kind you have described above.
However, what you can do, is to use 'Azure Stream Analytics' and then you should be able to integrate/stream whatever you want to any destination you need, like an app or a database ands so on.
more details here - https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-integrate-azure-stream-analytics
I think you can also pull the data into BI, and perhaps setup some kind of a automatic export from there.
more details here - https://learn.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-get-started-visualize-with-power-bi

Kafka Connector for Oracle Database Source

I want to build a Kafka Connector in order to retrieve records from a database at near real time. My database is the Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 and the tables have millions of records. First of all, I would like to add the minimum load to my database using CDC. Secondly, I would like to retrieve records based on a LastUpdate field which has value after a certain date.
Searching at the site of confluent, the only open source connector that I found was the “Kafka Connect JDBC”. I think that this connector doesn’t have CDC mechanism and it isn’t possible to retrieve millions of records when the connector starts for the first time. The alternative solution that I thought is Debezium, but there is no Debezium Oracle Connector at the site of Confluent and I believe that it is at a beta version.
Which solution would you suggest? Is something wrong to my assumptions of Kafka Connect JDBC or Debezium Connector? Is there any other solution?
For query-based CDC which is less efficient, you can use the JDBC source connector.
For log-based CDC I am aware of a couple of options however, some of them require license:
1) Attunity Replicate that allows users to use a graphical interface to create real-time data pipelines from producer systems into Apache Kafka, without having to do any manual coding or scripting. I have been using Attunity Replicate for Oracle -> Kafka for a couple of years and was very satisfied.
2) Oracle GoldenGate that requires a license
3) Oracle Log Miner that does not require any license and is used by both Attunity and kafka-connect-oracle which is is a Kafka source connector for capturing all row based DML changes from an Oracle and streaming these changes to Kafka.Change data capture logic is based on Oracle LogMiner solution.
We have numerous customers using IBM's IIDR (info sphere Data Replication) product to replicate data from Oracle databases, (as well as Z mainframe, I-series, SQL Server, etc.) into Kafka.
Regardless of which of the sources used, data can be normalized into one of many formats in Kafka. An example of an included, selectable format is...
https://www.ibm.com/support/knowledgecenter/en/SSTRGZ_11.4.0/com.ibm.cdcdoc.cdckafka.doc/tasks/kcopauditavrosinglerow.html
The solution is highly scalable and has been measured to replicate changes into the 100,000's of rows per second.
We also have a proprietary ability to reconstitute data written in parallel to Kafka back into its original source order. So, despite data having been written to numerous partitions and topics , the original total order can be known. This functionality is known as the TCC (transactionally consistent consumer).
See the video and slides here...
https://kafka-summit.org/sessions/exactly-once-replication-database-kafka-cloud/

What is the best way to get (stream) data from BigQuery to R (Rstudio server in Docker)

I have a number of large tables in Google BigQuery, containing data to be processed in R. I am running RStudio via Docker on Google Cloud Platform using the Container Engine.
I have tested a few routes with a table of 38 million rows (three columns) with a table size of 862 MB in BigQuery.
The first route I tested was using the R package bigrquery. This option was preferred as data can be directly queried from BigQuery. And data-acquisition can be incorporated in R loops. This option is unfortunately very slow, it takes close to an hour to complete.
The second option I tried was exporting the BigQuery table to a csv file on Google Cloud Storage (approx 1 minute), and using the public link to import in Rstudio (another 5 minutes). This route entails quite some manual handling, which is at least not desirable.
In Google Cloud Console I noticed VM instances can be granted access to BigQuery. Also, RStudio can be configured to have root access in its Docker container.
So finally my question: Is there a way to use this backdoor to enable fast data-transfer from BigQuery into an R dataframe in an automated way? Or are there other ways to achieve this goal?
Any help is highly appreciated!
Edit:
I have loaded the same table into a MySQL database hosted in Google Cloud SQL, this time it took only about 20 seconds to load the same amount of data. So some kind of translation from BigQuery to SQL is an option too.

Could a TrainedModel be re-used in Google Prediction?

I am using Google Prediction v1.5 (Java client) with the "PredictionSample.java" sample program.
In "PredictionSample.java", I specify "MODEL_ID" as "mymodel", "APPLICATION_NAME" as "MyApplication" and train the model using "language_id.txt" stored in my Google Cloud Storage bucket. The sample program runs OK and performs several predictions using some input features.
However, I wonder where is the "mymodel" TrainedModel stored. Is it stored under my "Google APIs console" project ? (but it seems that I could not find "mymodel" in my "Google APIs console" project)
In the FAQ of the Google Prediction API, it says that "You cannot currently download your model.". It seems that the TrainedModel ("mymodel") is stored somewhere in the Google Prediction server. I wonder where exactly is the actual store location, and how could I re-use this TrainedModel to perform predictions using the Google Prediction v1.5 Java client (i.e. without re-training the model in future predictions).
Does anyone have ideas on this. Thanks for any suggestion.
The models are stored in Google's Cloud and can't be downloaded.
Models trained in Prediction API 1.5 and earlier are associated with the user that created them - and cannot be shared with other users.
If you use Prediction API 1.6 (released earlier this month), models are now associated with a project (similar to Google Cloud Storage, BigQuery, Google Compute Engine and other Google Cloud Platform products) which means that the same model can now be shared amongst all members of the project team.
I finally find out that I do not need to re-train the TrainedModel for future predictions. I just delete the
train(prediction);
statement in "PredictionSample.java" and I could still perform predictions from the "mymodel" TrainedModel.
The TrainedModel is in fact stored somewhere in the Google Prediction server. I could "list" it through the "APIs Explorer for Prediction API". I could perform predictions from it once it is trained/inserted. But I could not download it for my use.

Resources