AWS Glue is great for transforming data from a raw form into whichever format you need, and keeping the source and destination data sets synchronized.
However, I have a scenario where data lands in a 'landing area' bucket from untrusted external sources, and the first ETL step needs to be a data validation step which only allows valid data to pass to the data lake, while non-valid data is moved to a quarantine bucket for manual inspection.
Non-valid data includes:
bad file formats/encodings
unparseable contents
mismatched schemas
even some sanity checks on the data itself
The 'landing area' bucket is not part of the data lake, it is only a temporary dead drop for incoming data, and so I need the validation job to delete the files from this bucket once it has moved them to the lake and/or quarantine buckets.
Is this possible with Glue? If the data is deleted from the source bucket, won't Glue end up removing it downstream in a subsequent update?
Am I going to need a different tool (e.g. StreamSets, NiFi, or Step Functions with AWS Batch) for this validation step, and only use Glue once the data is in the lake?
(I know I can set lifecycle rules on the bucket itself to delete the data after a certain time, like 24 hours, but in theory this could delete data before Glue has processed it, e.g. in case of a problem with the Glue job)
Please see purge_s3_path in the docs:
glueContext.purge_s3_path(s3_path, options={}, transformation_ctx="")
Deletes files from the specified Amazon S3 path recursively.
Also, make sure your AWSGlueServiceRole has s3:DeleteObject permissions
Your glue environment comes with boto3. You should be better of using the boto3 s3 client/resource to delete the landing files after Youve completed processing the data via glue
Related
I would like to take a single DynamoDB table which contains a data field with JSON data. This data has a schema that is dependent on the user associated with the table entry. Let's assume that the schema is part of the entry for simplicity.
I would like to stream this data into S3 as Parquet with embedded schema, transformation (i.e. just sending the data field) and custom file naming based on the user ID.
I am using CDK v2.
What I found so far:
I can go from DynamoDB to Kinesis Stream to Firehose but Glue is required - I don't need it nor sure how I would provide it with these various "dynamic" schemas.
CDK asks for the S3 filename - I see there may be the possibility of a dynamic field in the name but not sure how I would use that (i've seen date for ex - I would need it to be something coming from the transform lambda),
I think that using Kinesis stream directly in DynamoDB config may not be what I want and I should just use regular DB streams but then.... Would I transform data, pass this to a Firehose? where does file naming etc come in.
I've read so many docs but all seem to deal with a standard table to file and Athena.
Summary: How can I append streaming dynamodb data to various parquet file and transform/determine file name from a lambda in the middle. I think I have to go from a dyanamodb stream lambda handler and append directly to S3 but not finding too much in examples. Concerned about buffering etc
I am new to airflow, I took some courses about it but did not come across any example for my use case, I would like to:
Fetch data from postgres (data is usually 1m+ rows assuming it is large for xcoms)
(In some cases process the data if needed but this can be usually done inside the query itself)
Insert data into oracle
I tend to see workflows like exporting data into a csv first (from postgres), then loading it into destination database. However, I feel like it would be best to do all these 3 tasks in a single python operator (for example looping with a cursor and bulk inserting) but not sure if this is suitable for airflow.
Any ideas on possible solutions to this situation? What is the general approach?
As you mentioned there are several options.
To name a few:
Doing everything in one python task.
Create a pipeline
Create a custom operator.
All approaches are valid each one has advantages and disadvantages.
First approach:
You can write a python function that uses PostgresqlHook to create a dataframe and then load it to oracle.
def all_in_one(context):
pg_hook = PostgresHook('postgres_conn')
df = pg_hook.get_pandas_df('SELECT * FROM table')
# do some transformation on df as needed and load to oracle
op = PyhtonOperator(task_id='all_in_one_task',
python_callable=all_in_one,
dag=dag
)
Advantages :
easy coding (for people who are used to write python scripts)
Disadvantages:
not suitable for large transfers as it's in memory.
If you need to backfill or rerun the entire function is executed. So if there is an issue with loading to oracle you will still rerun the code that fetch records from PostgreSQL.
Second approach:
You can implement your own MyPostgresqlToOracleTransfer with any logic you wish. This is useful if you want to reuse the functionality in different DAGs
Third approach:
Work with files (data lake like).
the file can be on local machine if you have only 1 worker, if not the file must be uploaded to a shared drive (S3, Google Storage, any other disk that can be accessed by all workers).
Possible pipeline can be:
PostgreSQLToGcs -> GcsToOracle
Depends on what service you are using, some of the required operators may already been implemented by Airflow.
Advantages :
Each task stand for itself thus if you successful exported the data to disk, in event of backfill / failure you can just execute the failed operators and not the whole pipe. You can also save the exported files in cold storage in case you will need to rebuild from history.
Suitable for large transfers.
Disadvantages:
Adding another service which is "not needed" (shared disk resource)
Summary
I prefer the 2nd/3rd approaches. I think it's more suitable to what Airflow provides and allow more flexibility.
is it possible to update and insert newly added data to Grakn in an online manner ?
I have read this tutorial https://dev.grakn.ai/docs/query/updating-data but I can not get to this answer.
thanks
It would help to define what you mean by "online" - if it means what I think it does (keeping valid data/queryable data as you load), you should be able to see newly loaded data as soon as a transaction commits.
So, you can load data (https://dev.grakn.ai/docs/query/insert-query), and on commit you can see the committed data, and you can modify data (your link), which can modify committed data.
In general, you want to load data in many transactions, which allows you see in-progress data sets being loaded in an "online" manner.
I have a requirement to get the result of different Transaction codes (TCodes), that extract has to be loaded into an SQL database. There are some TCODEs with complex logic, so replicate the logic is not an option.
Is there any way to have a daily basis process that runs all the tcodes and locates the extract in a onedrive or any other location?
I just need the same result as if a user get into the tcode, executes, and extract to a CSV file.
I'm writing a simple Wordpress plugin for work and am wondering if using the Transients API is practical in this case, or if I should seek out another way.
The plugin's purpose is simple. I'm making a call to USZip Web Service (http://www.webservicex.net/uszip.asmx?op=GetInfoByZIP) to retrieve data. Our sales team is using a Lead Intake sheet that the plugin will run on.
I wanted to reduce the number of API calls, so I thought of setting a transient for each zip code as the key and store the incoming data (city and zip). If the corresponding data for a given zip code already exists, then no need to make an API call.
Here are my concerns:
1. After a quick search, I realized that the transient data is stored in the wp_options table and storing the data would balloon that table in no time. Would this cause a significance performance issue if the db becomes huge?
2. Is this horrible practice to create this many transient keys? It could easily becomes thousands in a few months time.
If using Transient is not the best way, could you please help point me in the right direction? Thanks!
P.S. I opted for the Transients API vs the Options API. I know zip codes don't change often, but they sometimes so. I set expiration time of 3 months.
A less-inflated solution would be:
Store a single option called uszip with a serialized array inside the option
Grab the entire array each time and simply check if the zip code exists
If it doesn't exist, grab the data and save the whole transient again
You should make sure you don't hit the upper bounds of a serialized array in this table (9,000 elements) considering 43,000 zip codes exist in the US. However, you will most likely have a very localized subset of zip codes.