Snowflake: can it be used to for very long time consuming tasks - airflow

I have a task, which does 4 steps
download a encrypted csv file from sftp
decrpt the file
place it in s3 bucket
ingest the data into snowflake
Now when i do the above process manually for 1TB file the time it takes is
downloading from sftp (20MB/s) ---> 12 hrs
decrpt the file ---> 24 hrs
place it in s3 bucket ---> 4 hrs
ingest the data into snowflake ---> 8 hrs
As a Proof of concept I have create a dag in airflow and create 4 tasks for the above 4 steps. The process works fine for a 20GB file
But I am not sure whether for 1TB file airflow DAG will work without any issues.
Also we have to check the file download progress, decrpt progess which when done manually we can see the progress bar of sfpt or decrypting software.
Any suggestiong how to deal with long running processes (which may take 2-3 days) in airflow with progress monitoring.

Related

Airflow + Docker + Redshift: task is failing even with the query executed on Redshift

I have a query which took ~30minutes to complete, and the output is unload some parquet files into S3, from Redshift.
I'm using the operator RedshiftToS3Operator, and after ~5 minute I receive this error:
struct.error: unpack_from requires a buffer of at least 5 bytes
Trying a PostgresOperator operator, I'm receiving a different error, after the same 5 minutes:
psycopg2.operationalerror: ssl syscall error: eof detected
With some research, I think this error is because the connection crashes after 5 minutes of idleing. I was able to reproduce this code on a Jupyter Notebook and everything went well, which makes me think that docker was the problem.
All the times, even with the Airflow displaying an error, the query was successfully executed in Redshift.
But I also tried to run the Psycopg2 code instead of using some abstractions, and this time I was able to surpass the 5 minutes, but instead of breaking with 5 min, the Running states just don't update even after the process ends on Redshift.
Basically I'm not able to track if the query works or not, only opening the Redshift UI.
Adding this to the Docker compose section of my Airflow scheduler fixed it:
sysctls:
- net.ipv4.tcp_keepalive_time=200
- net.ipv4.tcp_keepalive_intvl=200
- net.ipv4.tcp_keepalive_probes=5
The TCP settings of the Docker container appear to be the culprit.
From https://github.com/docker/for-mac/issues/3679#issuecomment-642885218

Airflow - How to configure that all DAG's tasks run in 1 worker

I have a DAG with 2 tasks:
download_file_from_ftp >> transform_file
My concern is that tasks can be performed on different workers.The file will be downloaded on the first worker and will be transformed on another worker. An error will occur because the file is missing on the second worker. Is it possible to configure the dag that all tasks are performed on one worker?
It's a bad practice. Even if you will find a work around it will be very unreliable.
In general, if your executor allows this - you can configure tasks to execute on a specific worker type. For example in CeleryExecutor you can set tasks to a specific Queue. Assuming there is only 1 worker consuming from that queue then your tasks will be executed on the same worker BUT the fact that it's 1 worker doesn't mean it will be the same machine. It highly depended on the infrastructure that you use. For example: when you restart your machines do you get the exact same machine or new one is spawned?
I highly advise you - don't go down this road.
To solve your issue either download the file to shared disk space like S3, Google cloud storage, etc... then all workers can read the file as it's stored in cloud or combine the download and transform into a single operator thus both actions are executed together.

Sharing large intermediate state between Airflow tasks

We have an Airflow deployment with Celery executors.
Many of our DAGs require a local processing step of some file in a BashOperator or PythonOperator.
However, in our understanding the tasks of a given DAG may not always be scheduled on the same machine.
The options for state sharing between tasks I've gathered so far:
Use Local Executors - this may suffice for one team, depending on the load, but may not scale to the wider company
Use XCom - does this have a size limit? Probably unsuitable for large files
Write custom Operators for every combination of tasks that need local processing in between. This approach reduces modularity of tasks and requires replicating existing operators' code.
Use Celery queues to route DAGs to the same worker (docs) - This option seems attractive at first, but what would be an appropriate way to set it up in order to avoid routing everything to one executor, or crafting a million queues?
Use a shared network storage in all machines that run executors - Seems like an additional infrastructure burden, but is a possibility.
What is the recommended way to do sharing of large intermediate state, such as files, between tasks in Airflow?
To clarify something: No matter how you setup airflow, there will only be one executor running.
The executor runs on the same machine as the scheduler.
Currently (current is airflow 1.9.0 at time of writing) there is no safe way to run multiple schedulers, so there will only ever be one executor running.
Local executor executes the task on the same machine as the scheduler.
Celery Executor just puts tasks in a queue to be worked on the celery workers.
However, the question you are asking does apply to Celery workers. If you use Celery Executor you will probably have multiple celery workers.
Using network shared storage solves multiple problems:
Each worker machine sees the same dags because they have the same dags folder
Results of operators can be stored on a shared file system
The scheduler and webserver can also share the dags folder and run on different machines
I would use network storage, and write the output file name to xcom. Then when you need to input the output from a previous task, you would read the file name from that task's Xcom and process that file.
Change datatype of column key in xcom table of airflow metastore.
Default datatype of key is: blob.
Change it to LONGBLOB. It will help you to store upto 4GB in between intermediate tasks.

Understand Hadoop yarn memory in datanode and Unix memory

We are having 20 data nodes and 3 management nodes. Each data node is having 45GB of RAM .
Data node RAM Capacity
45x20=900GB total ram
Management nodes RAM Capacity
100GB x 3 = 300GB RAM
I can see memory is completely occupied in Hadoop resource manager URL and Jobs submitted are in waiting state since 900GB is occupied till 890GB in resource manager url.
However , I have raised a request to Increase my memory capacity to avoid memory is being used till 890Gb out of 900GB.
Now, Unix Team guys are saying in data node out of 45GB RAM 80% is completely free using free -g command (cache/buffer) shows the output as free . However in Hadoop side(resource manager) URL says it is completely occupied and few jobs are in hold since memory is completely occupied.I would like to know how hadoop is calculating the memory in resource manager and is it good to upgrade the memory since it is occupying every user submit a hive jobs .
Who is right here hadoop output in RM or Unix free command .
The UNIX command free is correct because the RM shows reserved memory not memory used.
If I submit a MapReduce job with 1 map task requesting 10GB of memory per map task but the map task only uses 2GB then the system will only show 2GB used. The RM will show 10GB used because it has to reserve that amount for the task even if the task doesn't use all the memory.

Setting up a Daemon in Unix

I have scheduled a crontab job in AIX 6.1 OS to run twice in an hour which runs for the whole day to process a load.
The load which crontab kicks off needs files to arrive at a particular directory and if the files arrive, I process them.
It so happens that for the 24 times the crontab kicks off my load there would be only 7 times I actually receive the files and for the rest of the times I receive no files and the load reports saying that no files received and I cannot preict when files arrive as it can arrive anytime.
I heard that a DAEMON can be setup in Unix which can monitor for files arriving at a particular destination and call other scripts after that.
I was thinking if it were possible to create a daemon to monitor a particular destination and if files are present, It can call a shell script.
Is this possible, if yes how can I code the daemon. I have no prior experience with daemons. Can anyone clarify.

Resources