Export DynamoDB table to S3 with client side encryption - encryption

I'm trying to use Data Pipeline to export data to s3 from Dynamo. However, I can't figure out how to apply client side encryption before the file is written to s3. Is there a way to do this with Data Pipeline? I am able to set up everything except the client side encryption with Data Pipeline. The ideal flow is a dynamo source node, an activity to encrypt, and a S3 destination node.
I also tried Elastic MapReduce, but I don't see how to write a mapper and a reducer since I'm not transforming any data - I just need to move it to an encrypted file on s3. I should be able to use EMR with a hive program, but I am struggling to understand how to use EMR without writing custom map/reduce code. Ideally, no code is stored in S3.
Server side encryption isn't an option and the data needs to be encrypted before being written to s3.
I am looking for some ideas on how to do this or someone who had a similar challenge.

The current Data Pipelines solution doesn't currently support hooks for custom pre or post-processing.
How large is your table? How long is acceptable for the export process to complete?
It should be possible to do this with DynamoDB parallel scan: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#QueryAndScanParallelScan
Essentially you would write a program to use multiple threads to process the scan segments for the parallel scan, perform the encryption, and store the encrypted items in S3. Each DynamoDB scan page should return ~1MB of data, so you could aggregate multiple pages before publishing to S3.
To restore the data, you would load the S3 files, decrypt, and then write back to DynamoDB.

If this is acceptable for your use case, you can do client-side encryption before writing your data in DynamoDB. You could then use Data Pipelines to export your encrypted data to S3.
I have a similar setup for my application using a client-side encryption library provided by aws-labs. We export the tables daily to keep backups. Restoring the data works as long as the encryption metadata is exported with it.

Related

Possible to create pipeline that writes an SQL database to MongoDB daily?

TL:DR I'd like to combine the power of BigQuery with my MERN-stack application. Is it better to (a) use nodejs-biquery to write a Node/Express API directly with BigQuery, or (b) create a daily job that writes my (entire) BigQuery DB over to MongoDB, and then use mongoose to write a Node/Express API with MongoDB?
I need to determine the best approach for combining a data ETL workflow that creates a BigQuery database, with a react/node web application. The data ETL uses Airflow to create a workflow that (a) backs up daily data into GCS, (b) writes that data to BigQuery database, and (c) runs a bunch of SQL to create additional tables in BigQuery. It seems to me that my only two options are to:
Do a daily write/convert/transfer/migrate (whatever the correct verb is) from BigQuery database to MongoDB. I already have a node/express API written using mongoose, connected to a MongoDB cluster, and this approach would allow me to keep that API.
Use the nodejs-biquery library to create a node API that is directly connected to BigQuery. My app would change from MERN stack (BQ)ERN stack. I would have to re-write the node/express API to work with BigQuery, but I would no longer need the MongoDB (nor have to transfer data daily from BigQuery to Mongo). However, BigQuery can be a very slow database if I am looking for a single entry, a since its not meant to be used as Mongo or a SQL Database (it has no index, one row retrieve query run slow as full table scan). Most of my APIs calls are for very little data from the database.
I am not sure which approach is best. I don't know if having 2 databases for 1 web application is a bad practice. I don't know if it's possible to do (1) with the daily transfers from one db to the other, and I don't know how slow BigQuery will be if I use it directly with my API. I think if it is easy to add (1) to my data engineering workflow, that this is preferred, but again, I am not sure.
I am going with (1). It shouldn't be too much work to write a python script that queries tables from BigQuery, transforms, and writes collections to Mongo. There are some things to handle (incremental changes, etc.), however this is much easier to handle than writing a whole new node/bigquery API.
FWIW in a past life, I worked on a web ecommerce site that had 4 different DB back ends. ( Mongo, MySql, Redis, ElasticSearch) so more than 1 is not an issue at all, but you need to consider one as the DB of record, IE if anything does not match between them, one is the sourch of truth, the other is suspect. For my example, Redis and ElasticSearch were nearly ephemeral - Blow them away and they get recreated from the unerlying mysql and mongo sources. Now mySql and Mongo at the same time was a bit odd and that we were dong a slow roll migration. This means various record types were being transitioned from MySql over to mongo. This process looked a bit like:
- ORM layer writes to both mysql and mongo, reads still come from MySql.
- data is regularly compared.
- a few months elapse with no irregularities and writes to MySql are turned off and reads are moved to Mongo.
The end goal was no more MySql, everything was Mongo. I ran down that tangent because it seems like you could do similar - write to both DB's in whatever DB abstraction layer you used ( ORM, DAO, other things I don't keep up to date with etc.) and eventually move the reads as appropriate to wherever they need to go. If you need large batches for writes, you could buffer at that abstraction layer until a threshold of your choosing was reached before sending it.
With all that said, depending on your data complexity, a nightly ETL job would be completely doable as well, but you do run into the extra complexity of managing and monitoring that additional process. Another potential downside is the data is always stale by a day.

Trying to consolidate multiple Amazon DynamoDB tables into one

Scenario:
I've got a semi-structured dataset in JSON format. I'm storing the 3 subsets (new_records, upated_records, and deleted_records) from the dataset in 3 different Amazon DynamoDB tables. Scheduled to truncate and load daily.
I'm trying to create a mapping, to source data from these DynamoDB tables, append a few metadata columns (date_created, date_modified, is_active) and consolidate the data in a master DynamoDB table
Issues and Challenges:
I tried AWS Glue - Created Data Catalogue for source tables using Crawler. I understand AWS Glue doesn't provide provisions to store data in DynamoDB, so I changed the target to Amazon S3. However, the AWS Glue job results in creating some sort of reduced form of the data (parquet objects) in my Amazon S3 bucket. I've limited experience with PySpark, Pig, and Hive, so excuse me if I'm unable to explain clearly.
Quick research on Google hinted me to read parquet objects available on Amazon S3, using Amazon Athena or Redshift Spectrum.
I'm not sure, but this looks like overkill, doesn't it?
I read about Amazon Data Pipelines, which offers to quickly transfer data between different AWS services. Although I'm not sure if it provides some mechanism to create mappings between source and target (in order to append additional columns) or does it straightaway dumps data from one service to others?
Can anyone hint at a lucid and minimalistic solution?
-- Update --
I've been able to consolidate the data from Amazon DynamoDB to Amazon Redshift using AWS Glue, which turned out to be actually quite simple.
However, with Amazon Redshift, there are a few characteristic issues - its relational nature and its inability to directly perform a single merge, or upsert to update a table are few major things I'm considering here.
I'm considering if Amazon ElasticSearch can be used here, to index and consolidate the data from Amazon DynamoDB.
I'm not sure about your needs and assumptions. But let me post my thoughts that may help!
Why are you planning to do this migration? Think about this carefully.
Moving from 3 tables to 1 table, table size should not be an issue with DynamoDB But think about read/write unit capacity.
Athena is a good option, you will write SQL to query your data, will pay based on data scanned for your query, ... But Athena has 30 minutes query timeout. (I think you can request an increase for that, not sure!)
I think it is worth to try Data Pipelines. Yes, you can process the data while moving it.

Setting up of Amazon Web Services (AWS) Database and EC2

Currently, I have python codes that build machine learning models. The data for these models come from a local SQLite database (my client provides the data to us in S3 bucket, I download them to my machine and push them to the SQLite database). At a very high level, these are 3 steps I perform on my machine:
Download the data from S3 and load to SQLite
Connect to SQLite using python and perform data cleaning, aggregation, and model building in python
Write the results again to the SQLite
Our client has asked us to provide specifications for setting up an Amazon server so that we can run all these processes everyday as an application by click of a button. We planned of providing all the information after implementing the above mentioned end to end steps using our AWS account. I have no prior experience in setting up AWS/ db but want to learn more. These are the following question I have:
Can the above process be replicated on AWS? I use python 2.7 and SQLite db
We don't use any relationship in SQLite db while reading or writing data (like PK constraints etc.,). So is it better directly to read and write from S3
bucket
What are the different components on AWS that i need to have? As per my understanding for running the code I need EC2 (provides CPU, processors etc.,) and for storing, reading and writing the data I need a datastorage component. (Sorry, for usage of layman terms, I'm a newbie and trying to learn things)
Any things I need to keep in mind? Links for resources that can help me the solution.
Regards,
Eswar

Dynamodb streams in python

I would like to read data from a dynamodb stream in python and the alternatives that i have found so far are
Use dynamodb stream low level library functions (as described here): This solution however seems almost impossible to maintain in a production environment, with the application having to maintain the status of shards, etc.
Use KCL library designed for reading Kinesis streams: The python version of the library seems unable to read from a dynamodb stream.
What are the options to successfully process dynamodb streams in python? (links to possible examples would be super helpful)
PS: I have considered using lambda function to process the dynamodb but for this task, I would like to read the stream in an application as it has to interact with other components which cannot be done via a lamda function.
I would still suggest to use lambda. The setup is very easy as well as very robust (it's easy to manage retries,batching, downtimes...)
Then from your lambda invocation you could easily send your data in a convenient way to your existing program (including, but not limited to: SNS, SQS, a custom server webhook, sending the data to a custom pub/sub service you own...etc)

Bi-Directional Sync on Android Using SyncAdapter

I am planning to create sqlite table on my android app. The data comes from the the server via webservice.
I would like to know what is the best way to do this.
Should I transfer the data from the webservice in a sqlite db file and merge it or should i get all the data as a soap request and parse it in to table or should I use rest call.
The general size of the data is 2MB with 100 columns.
Please advise the best case where I can quickly get this data, with less load on the device.
My Workflow is:
Download a set of 20000 Addresses and save them to device sqlite database. This operation is only once, when you run the app for the first time or when you want to refresh the whole app data.
Update this record when ever there is a change in the server.
Now I can get this data either in JSON, XML or as pure SqLite File from the server . I want to know what is the fastest way to store this data in to Android Database.
I tried all the above methods and I found getting the database file from server and copying that data to the database is faster than getting the data in XML or JSON and parsing it. Please advise if I am right or wrong.
If you are planning to use sync adapters then you will need to implement a content provider (or atleast a stub) and an authenticator. Here is a good example that you can follow.
Also, you have not explained more about what is the use-case of such a web-service to decide what web-service architecture to suggest. But REST is a good style to write your services and using JSON over XML is advisable due to data format efficiency (or better yet give protocol-buffer a shot)
And yes, sync adapters are better to use as they already provide a great set of features that you will want to implement otherwise when written as a background service (e.g., periodic sync, auto sync, exponential backoff etc.)
To have less load on the device you can implement a sync-adapter backed by a content provider. You serialize/deserialize data when you upload/download data from server. When you need to persist data from the server you can use the bulkInsert() method in content-provider and persist all your data in a transaction

Resources