Currently, I have python codes that build machine learning models. The data for these models come from a local SQLite database (my client provides the data to us in S3 bucket, I download them to my machine and push them to the SQLite database). At a very high level, these are 3 steps I perform on my machine:
Download the data from S3 and load to SQLite
Connect to SQLite using python and perform data cleaning, aggregation, and model building in python
Write the results again to the SQLite
Our client has asked us to provide specifications for setting up an Amazon server so that we can run all these processes everyday as an application by click of a button. We planned of providing all the information after implementing the above mentioned end to end steps using our AWS account. I have no prior experience in setting up AWS/ db but want to learn more. These are the following question I have:
Can the above process be replicated on AWS? I use python 2.7 and SQLite db
We don't use any relationship in SQLite db while reading or writing data (like PK constraints etc.,). So is it better directly to read and write from S3
bucket
What are the different components on AWS that i need to have? As per my understanding for running the code I need EC2 (provides CPU, processors etc.,) and for storing, reading and writing the data I need a datastorage component. (Sorry, for usage of layman terms, I'm a newbie and trying to learn things)
Any things I need to keep in mind? Links for resources that can help me the solution.
Regards,
Eswar
Related
TL:DR I'd like to combine the power of BigQuery with my MERN-stack application. Is it better to (a) use nodejs-biquery to write a Node/Express API directly with BigQuery, or (b) create a daily job that writes my (entire) BigQuery DB over to MongoDB, and then use mongoose to write a Node/Express API with MongoDB?
I need to determine the best approach for combining a data ETL workflow that creates a BigQuery database, with a react/node web application. The data ETL uses Airflow to create a workflow that (a) backs up daily data into GCS, (b) writes that data to BigQuery database, and (c) runs a bunch of SQL to create additional tables in BigQuery. It seems to me that my only two options are to:
Do a daily write/convert/transfer/migrate (whatever the correct verb is) from BigQuery database to MongoDB. I already have a node/express API written using mongoose, connected to a MongoDB cluster, and this approach would allow me to keep that API.
Use the nodejs-biquery library to create a node API that is directly connected to BigQuery. My app would change from MERN stack (BQ)ERN stack. I would have to re-write the node/express API to work with BigQuery, but I would no longer need the MongoDB (nor have to transfer data daily from BigQuery to Mongo). However, BigQuery can be a very slow database if I am looking for a single entry, a since its not meant to be used as Mongo or a SQL Database (it has no index, one row retrieve query run slow as full table scan). Most of my APIs calls are for very little data from the database.
I am not sure which approach is best. I don't know if having 2 databases for 1 web application is a bad practice. I don't know if it's possible to do (1) with the daily transfers from one db to the other, and I don't know how slow BigQuery will be if I use it directly with my API. I think if it is easy to add (1) to my data engineering workflow, that this is preferred, but again, I am not sure.
I am going with (1). It shouldn't be too much work to write a python script that queries tables from BigQuery, transforms, and writes collections to Mongo. There are some things to handle (incremental changes, etc.), however this is much easier to handle than writing a whole new node/bigquery API.
FWIW in a past life, I worked on a web ecommerce site that had 4 different DB back ends. ( Mongo, MySql, Redis, ElasticSearch) so more than 1 is not an issue at all, but you need to consider one as the DB of record, IE if anything does not match between them, one is the sourch of truth, the other is suspect. For my example, Redis and ElasticSearch were nearly ephemeral - Blow them away and they get recreated from the unerlying mysql and mongo sources. Now mySql and Mongo at the same time was a bit odd and that we were dong a slow roll migration. This means various record types were being transitioned from MySql over to mongo. This process looked a bit like:
- ORM layer writes to both mysql and mongo, reads still come from MySql.
- data is regularly compared.
- a few months elapse with no irregularities and writes to MySql are turned off and reads are moved to Mongo.
The end goal was no more MySql, everything was Mongo. I ran down that tangent because it seems like you could do similar - write to both DB's in whatever DB abstraction layer you used ( ORM, DAO, other things I don't keep up to date with etc.) and eventually move the reads as appropriate to wherever they need to go. If you need large batches for writes, you could buffer at that abstraction layer until a threshold of your choosing was reached before sending it.
With all that said, depending on your data complexity, a nightly ETL job would be completely doable as well, but you do run into the extra complexity of managing and monitoring that additional process. Another potential downside is the data is always stale by a day.
Scenario:
I've got a semi-structured dataset in JSON format. I'm storing the 3 subsets (new_records, upated_records, and deleted_records) from the dataset in 3 different Amazon DynamoDB tables. Scheduled to truncate and load daily.
I'm trying to create a mapping, to source data from these DynamoDB tables, append a few metadata columns (date_created, date_modified, is_active) and consolidate the data in a master DynamoDB table
Issues and Challenges:
I tried AWS Glue - Created Data Catalogue for source tables using Crawler. I understand AWS Glue doesn't provide provisions to store data in DynamoDB, so I changed the target to Amazon S3. However, the AWS Glue job results in creating some sort of reduced form of the data (parquet objects) in my Amazon S3 bucket. I've limited experience with PySpark, Pig, and Hive, so excuse me if I'm unable to explain clearly.
Quick research on Google hinted me to read parquet objects available on Amazon S3, using Amazon Athena or Redshift Spectrum.
I'm not sure, but this looks like overkill, doesn't it?
I read about Amazon Data Pipelines, which offers to quickly transfer data between different AWS services. Although I'm not sure if it provides some mechanism to create mappings between source and target (in order to append additional columns) or does it straightaway dumps data from one service to others?
Can anyone hint at a lucid and minimalistic solution?
-- Update --
I've been able to consolidate the data from Amazon DynamoDB to Amazon Redshift using AWS Glue, which turned out to be actually quite simple.
However, with Amazon Redshift, there are a few characteristic issues - its relational nature and its inability to directly perform a single merge, or upsert to update a table are few major things I'm considering here.
I'm considering if Amazon ElasticSearch can be used here, to index and consolidate the data from Amazon DynamoDB.
I'm not sure about your needs and assumptions. But let me post my thoughts that may help!
Why are you planning to do this migration? Think about this carefully.
Moving from 3 tables to 1 table, table size should not be an issue with DynamoDB But think about read/write unit capacity.
Athena is a good option, you will write SQL to query your data, will pay based on data scanned for your query, ... But Athena has 30 minutes query timeout. (I think you can request an increase for that, not sure!)
I think it is worth to try Data Pipelines. Yes, you can process the data while moving it.
I'm thinking about learn JanusGraph to use in my new big project but i can't understand some things.
Janus can be used like any database and supports "insert", "update", "delete" operations so JanusGraph will write data into Cassandra or other database to store these data, right?
Where JanusGraph store the Nodes, Edges, Attributes etc, it will write these into database, right?
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data that JanusGraph read, must be load in JanusGraph in every query or it will do selects in database to retrieve the data I need?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Should I use JanusGraph in my project in production or should I wait until it becomes production ready?
I'm developing some kind of social network that need to store friendship, posts, comments, user blocks and do some elasticsearch too, in this case, what database backend should I use?
Janus will write data into Cassandra or other database to store these data, right?
Where Janus store the Nodes, Edges, Attributes etc, it will write these into database, right?
Janus Graph will write the data into whatever storage backend you configure it to use. This includes Cassandra. It writes this data into the underlaying database using the data model roughly outlined here
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Janus Graph will only load into memory vertices and edges which you touch during a query/traversal. So if you do something like:
graph.traversal().V().hasLabel("My Amazing Label");
Janus will read and load into memory only the vertices with that label. So you don't need to worry about initializing a graph connection and then waiting for the entire graph to be serialised into memory before you can query. Janus is a lazy reader.
Should I use Janus in my project in production or should I wait until it becomes production ready?
That is entirely up to you and your use case. Janus is being used in production already as can be seen here at the bottom of the page. Janus was forked from and improved on TitanDB which is also used in several production use cases. So if you wondering "is it ready" then I would say yes, it's clearly ready given it's existing uses.
what database backend should I use?
Again, that's entirely up to you. I use Cassandra because it can scale horizontally and I find it easier to work with. It also seems to suit all different sizes of data.
I have toyed with Google Big Table and that seems very powerful as well. However, it's only really suited for VERY big data and it's also only on the cloud where as Cassandra can be hosted locally very easily.
I have not used Janus with HBase or BerkeleyDB so I can't comment there.
It's very simple to change between backends though (all you need to do is adjust some configs and check your dependencies are in place) so during your development feel free to play around with the backends. You only really need to commit to a backend when you go production or are more sure of each backend.
When considering what storage backend to use for a new project it's important to consider what tradeoffs you'd like to make. In my personal projects, I've enjoyed using NoSQL graph databases due to the following advantages over relational dbs
Not needing to migrate schemas increases productivity when rapidly iterating on a new project
Traversing a heavily normalized data-model is not as expensive as with JOINs in an RDBMS
Most include in-memory configurations which are great for experimenting & testing.
Support for multi-machine clusters and Partition Tolerance.
Here are sample JanusGraph and Neo4j backends written in Kotlin:
https://github.com/pm-dev/janusgraph-exploration
https://github.com/pm-dev/neo4j-exploration
The main advantage with JanusGraph is the flexibility of pluging-in whichever storage backend you'd like.
First of all I must put clear that I am a newbie and excuse myself if I don't use the correct terminology in my question.
This is my scenario:
I need to analyze large quantities of text like tweets, comments, mails, etc. The data is currently inserted into an Amazon RD MySQL instance as it occurs.
Later I run and R job locally using RTextTools (http://www.rtexttools.com/) over that data to output my desired results. At this point it might be important to make clear that the R scripts analyzes the data and writes data back into the MySQL table which will later be used to display it.
The issue I am having lately is that the job takes about 1 hour each time I run it and I need to do it at least 2 times a day...so using my local computer is not an option anymore.
Looking for alternatives I started to read about Amazon Elastic MapReduce instance which at first sight seems to be what I need, but here start my questions and confusions about it.
I read that data for EMR should be pulled out from an S3 bucket. If thats the case then I must start storing my data into a JSON or similar within an S3 bucket and not into my RDS instance, right?
At this point I read it is a good idea to create HIVE tables and then use RHive to read the data in order for RTextTools to do its job and write the results back to my RDS tables, is this right?
And now the final and most important question: Is taking all this trouble worth it vs. running a EC2 instance with R and running my R scripts there, will I reduce computing time?
Thanks a lot for your time and any tip in the right direction will be much appreciated
Interesting, I would like to suggest few things.
You can totally store data in S3, but you will have to first write your data to some file (txt etc) and then push it to S3. You cannot put raw JSON on S3. You can probably get the benefit of cloud front deployed over S3 for fast retrieval of data. You can also use RDS. the performance difference you will have to analyze yourself.
Writing results back to RDS shouldn't be any issue. EMR basically creates two EC2 instances , ElasticMapReduce-master and ElasticMapReduce-slave which can be used to communicate with RDS.
See,I think its worth trying out with EC2 instance with R , but then to reduce the computation time, you might have to go with expensive EC2 instance, or put autoscaling and divide task between different instances. Its just like implementing whole parallel computation logic by yourself, but in the case of EMR , you are getting all this logic of map reduce in itself. So, firstly you should try with EMR and if it doesn't work out well for your , try with new EC2 instance with R.
Let me know how it goes, thank you.
You should consider trying EMR. S3+EMR is very much worth trying out if the 1hour window is a constraint. For your type of processing workloads, you might save cycles by using a scalable on demand hadoop/hive platform. Obviously, there are some learning, re-platforming, and ongoing cluster mgmt costs related to the trial and switch. They are non-trivial. Alternatively, consider services such as Qubole, which also runs on EC2+S3 and provides higher level (and potentially easier to use) abstractions.
Disclaimer: I am a product manager at Qubole.
I want to take a shot at the Kaggle Dunnhumby challenge by building a model for each customer. I want to split the data into ten groups and use Amazon web-services (AWS) to build models using R on the ten groups in parallel. Some relevant links I have come across are:
The segue package;
A presentation on parallel web-services using Amazon.
What I don't understand is:
How do I get the data into the ten nodes?
How do I send and execute the R functions on the nodes?
I would be very grateful if you could share suggestions and hints to point me in the right direction.
PS I am using the free usage account on AWS but it was very difficult to install R from source on the Amazon Linux AMIs (lots of errors due to missing headers, libraries and other dependencies).
You can build up everything manually at AWS. You have to build your own amazon computer cluster with several instances. There is a nice tutorial video available at the Amazon website: http://www.youtube.com/watch?v=YfCgK1bmCjw
But it will take you several hours to get everything running:
starting 11 EC2 instances (for every group one instance + one head instance)
R and MPI on all machines (check for preinstalled images)
configuring MPI correctly (probably add a security layer)
in best case a file server which will be mounted to all nodes (share data)
with this infrastructure the best solution is the use of the snow or foreach package (with Rmpi)
The segue package is nice but you will definitely get data communication problems!
The simples solution is cloudnumbers.com (http://www.cloudnumbers.com). This platform provides you with easy access to computer clusters in the cloud. You can test 5 hours for free with a small computer cluster in the cloud! Check the slides from the useR conference: http://cloudnumbers.com/hpc-news-from-the-user2011-conference
I'm not sure I can answer the question about which method to use, but I can explain how I would think about the question. I'm the author of Segue so keep that bias in mind :)
A few questions I would answer BEFORE I started trying to figure out how to get AWS (or any other system) running:
How many customers are in the training data?
How big is the training data (what you will send to AWS)?
What's the expected average run time to fit a model to one customer... all runs?
When you fit your model to one customer, how much data is generated (what you will return from AWS)?
Just glancing at the training data, it doesn't look that big (~280 MB). So this isn't really a "big data" problem. If your models take a long time to create, it might be a "big CPU" problem, which Segue may, or may not, be a good tool to help you solve.
In answer to your specific question about how to get the data onto AWS, Segue does this by serializing the list object you provide to the emrlapply() command, uploading the serialized object to S3, then using the Elastic Map Reduce service to stream the object through Hadoop. But as a user of Segue you don't need to know that. You just need to call emrlapply() and pass it your list data (probably a list where each element is a matrix or data frame of a single shopper's data) and a function (one you write to fit the model you choose) and Segue takes care of the rest. But keep in mind that the very first thing Segue does when you call emrlapply() is to serialize (sometimes slowly) and upload your data to S3. So depending on the size of the data and the speed of your internet connection upload speeds, this can be slow. I take issues with Markus' assertion that you will "definitely get data communication problems". That's clearly FUD. I use Segue on stochastic simulations that send/receive 300MB/1GB with some regularity. But I tend to run these simulations from an AWS instance so I am sending and receiving from one AWS rack to another, which makes everything much faster.
If you're wanting to do some analysis on AWS and get your feet wet with R in the cloud, I recommend Drew Conway's AMI for Scientific Computing. Using his AMI will save you from having to install/build much. To upload data to your running machine, once you set up your ssh certificates, you can use scp to upload files to your instance.
I like running RStudio on my Amazon instances. This will require setting up password access to your instance. There are a lot of resources around for helping with this.