Is migrating from RDS to Elastic MapReduce + Hive the right choice? - r

First of all I must put clear that I am a newbie and excuse myself if I don't use the correct terminology in my question.
This is my scenario:
I need to analyze large quantities of text like tweets, comments, mails, etc. The data is currently inserted into an Amazon RD MySQL instance as it occurs.
Later I run and R job locally using RTextTools (http://www.rtexttools.com/) over that data to output my desired results. At this point it might be important to make clear that the R scripts analyzes the data and writes data back into the MySQL table which will later be used to display it.
The issue I am having lately is that the job takes about 1 hour each time I run it and I need to do it at least 2 times a day...so using my local computer is not an option anymore.
Looking for alternatives I started to read about Amazon Elastic MapReduce instance which at first sight seems to be what I need, but here start my questions and confusions about it.
I read that data for EMR should be pulled out from an S3 bucket. If thats the case then I must start storing my data into a JSON or similar within an S3 bucket and not into my RDS instance, right?
At this point I read it is a good idea to create HIVE tables and then use RHive to read the data in order for RTextTools to do its job and write the results back to my RDS tables, is this right?
And now the final and most important question: Is taking all this trouble worth it vs. running a EC2 instance with R and running my R scripts there, will I reduce computing time?
Thanks a lot for your time and any tip in the right direction will be much appreciated

Interesting, I would like to suggest few things.
You can totally store data in S3, but you will have to first write your data to some file (txt etc) and then push it to S3. You cannot put raw JSON on S3. You can probably get the benefit of cloud front deployed over S3 for fast retrieval of data. You can also use RDS. the performance difference you will have to analyze yourself.
Writing results back to RDS shouldn't be any issue. EMR basically creates two EC2 instances , ElasticMapReduce-master and ElasticMapReduce-slave which can be used to communicate with RDS.
See,I think its worth trying out with EC2 instance with R , but then to reduce the computation time, you might have to go with expensive EC2 instance, or put autoscaling and divide task between different instances. Its just like implementing whole parallel computation logic by yourself, but in the case of EMR , you are getting all this logic of map reduce in itself. So, firstly you should try with EMR and if it doesn't work out well for your , try with new EC2 instance with R.
Let me know how it goes, thank you.

You should consider trying EMR. S3+EMR is very much worth trying out if the 1hour window is a constraint. For your type of processing workloads, you might save cycles by using a scalable on demand hadoop/hive platform. Obviously, there are some learning, re-platforming, and ongoing cluster mgmt costs related to the trial and switch. They are non-trivial. Alternatively, consider services such as Qubole, which also runs on EC2+S3 and provides higher level (and potentially easier to use) abstractions.
Disclaimer: I am a product manager at Qubole.

Related

Snowpipe vs Airflow for Continues data loading into Snowflake

I had a question related to Snowflake. Actually in my current role, I am planning to migrate data from ADLS (Azure data lake) to Snowflake.
I am right now looking for 2 options
Creating Snowpipe to load updated data
Create Airflow job for same.
I am still trying to understand which will be the best way and what is the pro and cons of choosing each.
It depends on what you are trying to as part of this migration. If it is a plain vanilla(no transformation, no complex validations) as-is migration of data from ADLS to Snowflake, then you may be good with SnowPipe(but please also check if your scenario is good for Snowpipe or Bulk Copy- https://docs.snowflake.com/en/user-guide/data-load-snowpipe-intro.html#recommended-load-file-size).
If you have many steps before you move the data to snowflake and there are chances that you may need to change your workflow in future, it is better to use Airflow which will give you more flexibility. In one of my migrations, I used Airflow and in the other one CONTROL-M
You'll be able to load higher volumes of data with lower latency if you use Snowpipe instead of Airflow. It'll also be easier to manage Snowpipe in my opinion.
Airflow is a batch scheduler and using it to schedule anything that runs more frequently than 5 minutes becomes painful to manage. Also, you'll have to manage the scaling yourself with Airflow. Snowpipe is a serverless option that can scale up and down based on the volumes sees and you're going to see your data land within 2 minutes.
The only thing that should restrict your usage of Snowpipe is cost. Although, you may find that Snowpipe ends up being cheaper in the long run if you consider that you'll need someone to manage your Airflow pipelines too.
There are a few considerations. Snowpipe can only run a single copy command, which has some limitations itself, and snowpipe imposes further limitations as per Usage Notes. The main pain is that it does not support PURGE = TRUE | FALSE (i.e. automatic purging while loading) saying:
Note that you can manually remove files from an internal (i.e.
Snowflake) stage (after they’ve been loaded) using the REMOVE command.
Regrettably the snowflake docs are famously vague as they use an ambiguous colloquial writing style. While it said you 'can' remove the files manually yourself in reality any user using snowpipe as advertised for "continuous fast ingestion" must remove the files to not suffer performance/cost impacts of the copy command having to ignore a very large number of files that have been previously loaded. The docs around the cost and performance of "table directories" which are implicit to stages talk about 1m files being a lot of files. By way of an official example the default pipe flush time on snowflake kafka connector snowpipe is 120s so assuming data ingests continually, and you make one file per flush, you will hit 1m files in 2 years. Yet using snowpipe is supposed to imply low latency. If you were to lower the flush to 30s you may hit the 1m file mark in about half a year.
If you want a fully automated process with no manual intervention this could mean that after you have pushed files into a stage and invoked the pipe you need logic have to poll the API to learn which files were eventually loaded. Your logic can then remove the loaded files. The official snowpipe Java example code has some logic that pushes files then polls the API to check when the files are eventually loaded. The snowflake kafka connector also polls to check which files the pipe has eventually asynchronously completed. Alternatively, you might write an airflow job to ls #the_stage and look for files last_modified that is in the past greater than some safe threshold to then rm #the_stage/path/file.gz the older files.
The next limitation is that a copy command is a "copy into your_table" command that can only target a single table. You can however do advanced transformations using SQL in the copy command.
Another thing to consider is that neither latency nor throughput is guaranteed with snowpipe. The documentation very clearly says you should measure the latency yourself. It would be a completely "free lunch" if snowpipe that is running on shared infrastructure to reduce your costs were to run instantly and as fast if you were paying for hot warehouses. It is reasonable to assume a higher tail latency when using shared "on-demand" infrastructure (i.e. a low percentage of invocations that have a high delay).
You have no control over the size of the warehouse used by snowpipe. This will affect the performance of any sql transforms used in the copy command. In contrast if you run on Airflow you have to assign a warehouse to run the copy command and you can assign as big a warehouse as you need to run your transforms.
A final consideration is that to use snowpipe you need to make a Snowflake API call. That is significantly more complex code to write than making a regular database connection to load data into a stage. For example, the regular Snowflake JDBC database connection has advanced methods to make it efficient to stream data into stages without having to write oAuth code to call the snowflake API.
Be very clear that if you carefully read the snowpipe documentation you will see that snowpipe is simply a restricted copy into table command running on shared infrastructure that is eventually run at some point; yet you yourself can run a full copy command as part of a more complex SQL script on a warehouse that you can size and suspend. If you can live with the restrictions of snowpipe, can figure out how to remove the files in the stage yourself, and you can live with the fact that tail latency and throughput is likely to be higher than paying for a dedicated warehouse, then it could be a good fit.

Trying to consolidate multiple Amazon DynamoDB tables into one

Scenario:
I've got a semi-structured dataset in JSON format. I'm storing the 3 subsets (new_records, upated_records, and deleted_records) from the dataset in 3 different Amazon DynamoDB tables. Scheduled to truncate and load daily.
I'm trying to create a mapping, to source data from these DynamoDB tables, append a few metadata columns (date_created, date_modified, is_active) and consolidate the data in a master DynamoDB table
Issues and Challenges:
I tried AWS Glue - Created Data Catalogue for source tables using Crawler. I understand AWS Glue doesn't provide provisions to store data in DynamoDB, so I changed the target to Amazon S3. However, the AWS Glue job results in creating some sort of reduced form of the data (parquet objects) in my Amazon S3 bucket. I've limited experience with PySpark, Pig, and Hive, so excuse me if I'm unable to explain clearly.
Quick research on Google hinted me to read parquet objects available on Amazon S3, using Amazon Athena or Redshift Spectrum.
I'm not sure, but this looks like overkill, doesn't it?
I read about Amazon Data Pipelines, which offers to quickly transfer data between different AWS services. Although I'm not sure if it provides some mechanism to create mappings between source and target (in order to append additional columns) or does it straightaway dumps data from one service to others?
Can anyone hint at a lucid and minimalistic solution?
-- Update --
I've been able to consolidate the data from Amazon DynamoDB to Amazon Redshift using AWS Glue, which turned out to be actually quite simple.
However, with Amazon Redshift, there are a few characteristic issues - its relational nature and its inability to directly perform a single merge, or upsert to update a table are few major things I'm considering here.
I'm considering if Amazon ElasticSearch can be used here, to index and consolidate the data from Amazon DynamoDB.
I'm not sure about your needs and assumptions. But let me post my thoughts that may help!
Why are you planning to do this migration? Think about this carefully.
Moving from 3 tables to 1 table, table size should not be an issue with DynamoDB But think about read/write unit capacity.
Athena is a good option, you will write SQL to query your data, will pay based on data scanned for your query, ... But Athena has 30 minutes query timeout. (I think you can request an increase for that, not sure!)
I think it is worth to try Data Pipelines. Yes, you can process the data while moving it.

Setting up of Amazon Web Services (AWS) Database and EC2

Currently, I have python codes that build machine learning models. The data for these models come from a local SQLite database (my client provides the data to us in S3 bucket, I download them to my machine and push them to the SQLite database). At a very high level, these are 3 steps I perform on my machine:
Download the data from S3 and load to SQLite
Connect to SQLite using python and perform data cleaning, aggregation, and model building in python
Write the results again to the SQLite
Our client has asked us to provide specifications for setting up an Amazon server so that we can run all these processes everyday as an application by click of a button. We planned of providing all the information after implementing the above mentioned end to end steps using our AWS account. I have no prior experience in setting up AWS/ db but want to learn more. These are the following question I have:
Can the above process be replicated on AWS? I use python 2.7 and SQLite db
We don't use any relationship in SQLite db while reading or writing data (like PK constraints etc.,). So is it better directly to read and write from S3
bucket
What are the different components on AWS that i need to have? As per my understanding for running the code I need EC2 (provides CPU, processors etc.,) and for storing, reading and writing the data I need a datastorage component. (Sorry, for usage of layman terms, I'm a newbie and trying to learn things)
Any things I need to keep in mind? Links for resources that can help me the solution.
Regards,
Eswar

Using AWS for parallel processing with R

I want to take a shot at the Kaggle Dunnhumby challenge by building a model for each customer. I want to split the data into ten groups and use Amazon web-services (AWS) to build models using R on the ten groups in parallel. Some relevant links I have come across are:
The segue package;
A presentation on parallel web-services using Amazon.
What I don't understand is:
How do I get the data into the ten nodes?
How do I send and execute the R functions on the nodes?
I would be very grateful if you could share suggestions and hints to point me in the right direction.
PS I am using the free usage account on AWS but it was very difficult to install R from source on the Amazon Linux AMIs (lots of errors due to missing headers, libraries and other dependencies).
You can build up everything manually at AWS. You have to build your own amazon computer cluster with several instances. There is a nice tutorial video available at the Amazon website: http://www.youtube.com/watch?v=YfCgK1bmCjw
But it will take you several hours to get everything running:
starting 11 EC2 instances (for every group one instance + one head instance)
R and MPI on all machines (check for preinstalled images)
configuring MPI correctly (probably add a security layer)
in best case a file server which will be mounted to all nodes (share data)
with this infrastructure the best solution is the use of the snow or foreach package (with Rmpi)
The segue package is nice but you will definitely get data communication problems!
The simples solution is cloudnumbers.com (http://www.cloudnumbers.com). This platform provides you with easy access to computer clusters in the cloud. You can test 5 hours for free with a small computer cluster in the cloud! Check the slides from the useR conference: http://cloudnumbers.com/hpc-news-from-the-user2011-conference
I'm not sure I can answer the question about which method to use, but I can explain how I would think about the question. I'm the author of Segue so keep that bias in mind :)
A few questions I would answer BEFORE I started trying to figure out how to get AWS (or any other system) running:
How many customers are in the training data?
How big is the training data (what you will send to AWS)?
What's the expected average run time to fit a model to one customer... all runs?
When you fit your model to one customer, how much data is generated (what you will return from AWS)?
Just glancing at the training data, it doesn't look that big (~280 MB). So this isn't really a "big data" problem. If your models take a long time to create, it might be a "big CPU" problem, which Segue may, or may not, be a good tool to help you solve.
In answer to your specific question about how to get the data onto AWS, Segue does this by serializing the list object you provide to the emrlapply() command, uploading the serialized object to S3, then using the Elastic Map Reduce service to stream the object through Hadoop. But as a user of Segue you don't need to know that. You just need to call emrlapply() and pass it your list data (probably a list where each element is a matrix or data frame of a single shopper's data) and a function (one you write to fit the model you choose) and Segue takes care of the rest. But keep in mind that the very first thing Segue does when you call emrlapply() is to serialize (sometimes slowly) and upload your data to S3. So depending on the size of the data and the speed of your internet connection upload speeds, this can be slow. I take issues with Markus' assertion that you will "definitely get data communication problems". That's clearly FUD. I use Segue on stochastic simulations that send/receive 300MB/1GB with some regularity. But I tend to run these simulations from an AWS instance so I am sending and receiving from one AWS rack to another, which makes everything much faster.
If you're wanting to do some analysis on AWS and get your feet wet with R in the cloud, I recommend Drew Conway's AMI for Scientific Computing. Using his AMI will save you from having to install/build much. To upload data to your running machine, once you set up your ssh certificates, you can use scp to upload files to your instance.
I like running RStudio on my Amazon instances. This will require setting up password access to your instance. There are a lot of resources around for helping with this.

Recommendations using R with SimpleDB or BigQuery or using PHP with SimpleDB

I am currently working on system that generated product recommendations like those on Amazon : "People who bought this also bought this.."
Current Scenario:
Extract the Google Analytics data of the client and insert it in database.
On the website of the client, on load of product page the API call is made to get the recommendations of the product being viewed.
When API receives the product ID as request it looks in the database and retrieves (using association rules) the recommended product IDs and sends them as response.
The list of these product Ids will be processed to get the product details(image,price..) at the client end and displayed on website.
Currently I am using PHP and MYSQL with gapi package and REST api
storage on AMAZON EC2 .
My Question is:
Now, if I have to choose amongst the following, which will be the best choice to implement the above mentioned concept.
PHP with SimpleDB or BIGQuery.
R language with BIGQuery.
RHIPE-(R and hadoop ) with SimpleDB.
Apache Mahout.
Plese help!
This isn't so easy to answer, because the constraints are fairly specialized.
The following considerations can be made, though:
BIGQuery is not yet public. Thus, with a small usage base, even if you are in the preview population, it will be harder to get advice on improvement.
Each of your answers asked about a modeling system & a storage system. Apache Mahout is not a storage mechanism, so it won't necessarily work on its own. I used to believe that its machine learning implementations were a a pastiche of a few Google Summer of Code, but I've updated that view on the suggestion of a commenter. It still looks like it has rather uneven and spotty coverage of different algorithms, and it's not particularly clear how the components are supported or maintained. I encourage an evangelist for Mahout to address this.
As a result, this eliminates the 1st, 2nd, and 4th options.
What I don't quite get is the need for a real-time server to utilize Hadoop and RHIPE. That should be done in your batch processing for developing the recommendation models, not in real-time. I suppose you could use RHIPE as a simple one-stop front end for firing off queries.
I'd recommend using RApache instead of RHIPE, because you can get your packages and models pre-loaded. I see no advantage to using Hadoop in the front end, but it would be a very natural back end system for the model fitting.
(Update 1) Other interface options include RServe (http://www.rforge.net/Rserve/) and possibly RStudio in server mode. There are R/PHP interfaces (see comments below), but I suspect it would be better to access R through HTTP or TCP/IP.
(Update 2) Addressing the whole process, the basic idea I see is that you could query the data from PHP and pass to R or, if you wish to query from within R, look at the link in the comments (to the OmegaHat tools) or post a new question about R & SimpleDB - I'm sure someone else on SO would be able to give better insight on this particular connection. RApache would let you instantiate many R processes already prepared with packages loaded and data in RAM; thus you would only need to pass whatever data needs to be used for prediction. If your new data is a small vector then RApache should be fine, and it seems this is correct for the data being processed in real-time.
If you want a real-time API for recommendations based on data in a database, Apache Mahout does this directly. You want to use ReloadFromJDBCDataModel, put on top a GenericItemBasedRecommender, and use the servlet-based wrapper in the examples module. It's probably a day or two of work to get familiar with the code and customize it to your needs, but it's pretty simple.
When you get past about 100M data points you would need to look at distributing the computation Hadoop. That's a fair bit more complex. Mahout has a distributed recommender too which you can customize.

Resources