Keeping state in airflow pipeline - airflow

I am new to airflow, I feel like I may be missing some convention or concept.
Context: I have files being periodically dropped into an S3 bucket. My pipeline will need to grab new files and process them.
Basically: How do I avoid re-processing?
It is not unlikely that some part of the pipeline will change in the future and I will want to re-process files. But on a day-to-day basis I don't want to re-process files. Additionally there will likely be other pipelines in the future which would need to start from the beginning and process all the files for a different output.
I have plenty of scrappy ways of preserving state (a local json file, or checking the existence of output files) - but I'm wondering if there's a convention in airflow. What makes the most sense to me at the moment is to re-use the postgres that exists for airflow (maybe bad form?), add another DB and start creating tables in there where I list input files if they have been processed for workflow X, workflow Y, etc.
How would you do this?

Here is how I have solved a similar problem with a 4 task DAG.
Write a custom S3Sensor that extends BaseSensorOperator.
This sensor uses the boto3 library, watches a specific folder in the bucket.
If any files are put into this bucket, it posts all the file paths to Xcom.
This Sensor is the first operator in the dag.
The next operator in the dag is a python operator that reads the list from the previous tasks Xcom.
It moves all the files to another folder in the same bucket, again listing the new paths to Xcom.
The next operator processes each of these files.
The next operator triggers this same dag again (so we start back at the custom s3 file sensor because this dag retriggers itself).
The dag needs to not have any schedule_interval, and needs to be triggered once manually. It will then watch the bucket until forever, or until something breaks.

Related

Sharing information between DAGs in airflow

I have one dag that tells another dag what tasks to create in a specific order.
Dag 1 -> a file that has a task order
This runs every 5 minutes or so to keep this file fresh.
Dag 2 -> runs the task
this runs daily.
How can I pass this data between the two DAGs using Airflow.
Solutions and problems
The problem with using Airflow Variables is that I cannot set them at runtime.
The problem with using Xcoms is that they can only be run during the task stage and once the tasks are created in Dag 2, they're set and cannot be changed correct?
The problem with pushing the file to s3 is that the airflow instance doesn't have permission to pull from s3 due to security reasons decided by a team that I have no control over.
So what can I do? What are some choices I have?
What is the file format of the output from the 1st DAG? I would recommend the following workflow
Dag 1 -> Update the tasks order and store it in a yaml or json file inside the airflow environment.
Dag 2 -> Read the file to create the required tasks and run them daily.
You need to understand that airflow is constantly reading your dag files to have the latest configuration, so no extra step would be required.
I have had a similar issue in the past and it largely depends on your setup.
If you are running Airflow on Kubernetes this might work.
You create a PV(Persistent Volume) and PVC
You start your application with a KubernetesOperator and mount the PVC to it.
You store the result on the PVC.
You mount the PVC to the other pod.

Do airflow workers share the same file system ? or are they isolated

I have a task in airflow which downloads a file from GitHub to the local file system. passes it to spark-submit and then deletes it. I wanted to know if this will create any issues.
Can this be possible that both the workers that are running the same task concurrently on two different dag runs are referencing the same file?
Sample code -->
def python_task_callback():
download_file(file_name='script.py')
spark_submit(path='/temp/script.py')
delete_file(path='/temp/script.py')
For your use case if you do all of the actions you mentioned (download, parse, delete) in a single task then you will have no problems regardless of which executor you are running.
If you are splitting the actions between several tasks then you should use a shared file system like S3, Google Storage etc. In that case it will also work regardless of which executor youa re using.
A possible workflow can be:
1st task: copy file from github to S3
2nd task: submit the file to processing
3rd task: delete the file from S3
As for your general question if tasks share disk - that depends on the executor that you are using.
In Local Executor you have only 1 worker thus all tasks run on the same machine and share it's disk.
In Celery Executor/ Kubernetes Executor/others tasks may run on different workers.
However as mentioned - don't assume that tasks share disk, if you will need to scale up the executor from Local to Celery you don't want to find yourself in a case where you need to refactor your code.

Can Airflow persist access to metadata of short-lived dynamically generated tasks?

I have a DAG that, whenever there are files detected by FileSensor, generates tasks for each file to (1) move the file to a staging area, (2) trigger a separate DAG to process the file.
FileSensor -> Move(File1) -> TriggerDAG(File1) -> Done
|-> Move(File2) -> TriggerDAG(File2) -^
In the DAG definition file, the middle tasks are generated by iterating over the directory that FileSensor is watching, a bit like this:
# def generate_move_task(f: Path) -> BashOperator
# def generate_dag_trigger(f: Path) -> TriggerDagRunOperator
with dag:
for filepath in Path(WATCH_DIR).glob(*):
sensor_task >> generate_move_task(filepath) >> generate_dag_trigger(filepath)
The Move task moves the files that lead to the task generation, so the next DAG run won't have FileSensor re-trigger either Move or TriggerDAG tasks for this file. In fact, the scheduler won't generate the tasks for this file at all, since after all files go through Move, the input directory has no contents to iterate over anymore..
This gives rise to two problems:
After execution, the task logs and renderings are no longer available. The Graph View only shows the DAG as it is now (empty), not as it was at runtime. (The Tree View shows that the tasks' run and state, but clicking on the "square" and picking any details leads to an Airflow error.)
The downstream tasks can be memory-holed due to a race condition. The first task is to move the originating file to a staging area. If that takes longer than the scheduler polling period, the scheduler no longer collects the downstream TriggerDAG(File1) task, which means that task is not scheduled to be executed even though the upstream task ran successfully. It's as if the downstream task never existed.
The race condition issue is solved by changing the task sequence to Copy(File1) -> TriggerDAG(File1) -> Remove(File1), but the broader problem remains: is there a way to persist dynamically generated tasks, or at least a way to consistently access them through the Airflow interface?
While it isn't clear, i'm assuming that downstream DAG(s) that you trigger via your orchestrator DAG are NOT dynamically generated for each file (like your Move & TriggerDAG tasks); in other words, unlike your Move tasks that keep appearing and disappearing (based on files), the downstream DAGs are static and stay there always
You've already built a relatively complex workflow that does advanced stuff like generating tasks dynamically and triggering external DAGs. I think with slight modification to your DAGs structure, you can get rid of your troubles (which also are quite advanced IMO)
Relocate the Move task(s) from your upstream orchestrator DAG to the downstream (per-file) process DAG(s)
Make the upstream orchestrator DAG do two things
Sense / wait for files to appear
For each file, trigger the downstream processing DAG (which in effect you are already doing).
For the orchestrator DAG, you can do it either ways
have a single task that does file sensing + triggering downstream DAGs for each file
have two tasks (I'd prefer this)
first task senses files and when they appear, publishes their list in an XCOM
second task reads that XCOM and foreach file, triggers it's corresponding DAG
but whatever way you choose, you'll have to replicate the relevant bits of code from
FileSensor (to be able to sense file and then publish their names in XCOM) and
TriggerDagRunOperator (so as to be able to trigger multiple DAGs with single task)
here's a diagram depicting the two tasks approach
The short answer to the title question is, as of Airflow 1.10.11, no, this doesn't seem possible as stated. To render DAG/task details, the Airflow webserver always consults the DAGs and tasks as they are currently defined and collected to DagBag. If the definition changes or disappears, tough luck. The dashboard just shows the log entries in the table; it doesn't probe the logs for prior logic (nor does it seem to store much of it other than the headline).
y2k-shubham provides an excellent solution to the unspoken question of "how can I write DAGs/tasks so that the transient metadata are accessible". The subtext of his solution: convert the transient metadata into something Airflow stores per task run, but keep the tasks themselves fixed. XCom is the solution he uses here, and it does shows up in the task instance details / logs.
Will Airflow implement persistent interface access to fleeting one-time tasks whose definition disappears from the DagBag? It's possible but unlikely, for two reasons:
It would require the webserver to probe the historical logs instead of just the current DagBag when rendering the dashboard, which would require extra infrastructure to keep the web interface snappy, and could make the display very confusing.
As y2k-shubham notes in a comment to another question of mine, fleeting and changing tasks/DAGs are an Airflow anti-pattern. I'd imagine that would make this a tough sell as the next feature.

Determining if a DAG is executing

I am using Airflow 1.9.0 with a custom SFTPOperator. I have code in my DAGs that poll an SFTP site to find new files. If any are found, then I create custom task id's for the dynamically created task and retrieve/delete the files.
directory_list = sftp_handler('sftp-site', None, '/', None, SFTPToS3Operation.LIST)
for file_path in directory_list:
... SFTP code that GET's the remote files
That part works fine. It seems both the airflow webserver and airflow scheduler are iterating through all the DAGs once a second and actually running the code that retrieves the directory_list. This means I'm hitting the SFTP site ~2 x a second to authenticate and pull a list of files. I'd like to have some conditional code that only executes if the DAG is actually being run.
When an SFTP site uses password authentication, the # of times I connect really isn't an issue. One site requires key authentication and if there are too many authentication failures in a short timespan, the account is locked. During my testing, this seems to happen occasionally for reasons I'm still trying to track down.
However, if I were authenticating only when the DAG was scheduled to execute, or executing manually, this would not be an issue. It also seems wasteful to spend so much time connecting to an SFTP site when it's not scheduled to do so.
I've seen a post that can check to see if a task is executing, but that's not ideal as I'd have to create a long-running task, using up resources I shouldn't require, just to perform that test. Any thoughts on how to accomplish this?
You have a very good use case for Airflow (SFTP to _____ batch jobs), but Airflow is not meant for dynamic DAGs as you are attempting to use them.
Top-Level DAG Code and the Scheduler Loop
As you noticed, any top-level code in a DAG is executed with each scheduler loop. Or put another way, every time the scheduler loop processes the files in your DAG directory it is interpreting all the code in your DAG files. Anything not in a task or operator is interpreted/executed immediately. This puts undue strain on the scheduler as well as any external systems you are making calls to.
Dynamic DAGs and the Airflow UI
Airflow does not handle dynamic DAGs through the UI well. This is mostly the result of the Airflow DAG state not being stored in the database. DAG views and history are rendered based on what exist in the interpreted DAG file at any given moment. I personally hope to see this change in the future with some form of DAG versioning.
In a dynamic DAG you can both add and remove tasks from a DAG.
Adding Tasks Dynamically
When adding tasks for a DAG run will make it appear (in the UI) that all DAG
runs before when that task never ran that task all. The will have a None state
and the DAG run will be set to success or failed depending on the outcome
of the DAG run.
Removing Tasks Dynamically
If your dynamic DAG ever removes tasks you will lose the ability to review history of the DAG. For example, if you run a DAG with task_x in the first 20 DAG runs but remove it after that, it will fail to show up in the UI until it is added back into the DAG.
Idempotency and Airflow
Airflow works best when the DAG runs are idempotent. This means that re-running any DAG Run should have the same affect no matter when you run it or how many times you run it. Dynamic DAGs in Airflow break idempotency by adding and removing tasks to previous DAG runs so that the results of re-running are not the same.
Solution Options
You have at least two options moving forward
1.) Continue to build your SFTP DAG dynamically, but create another DAG that writes the available SFTP files to a local file (if not using distributed executor) or an Airflow Variable (this will result in more reads to the Airflow DB) and build your DAG dynamically from that.
2.) Overload the SFTPOperator to take a list of files so that every file that exist is processed within a single task run. This will make the DAGs idempotent and you will maintain accurate history through the logs.
I apologize for the extended explanation, but you're touching on one of the rough spots of Airflow and I felt it was appropriate to give an overview of the problem at hand.

How to successfully exit a task midway within an Airflow dag?

I have a dag that checks for files on an FTP server (airflow runs on separate server). If file(s) exist, the file(s) get moved to S3 (we archive here). From there, the filename is passed to a Spark submit job. The spark job will process the file via S3 (spark cluster on different server). I'm not sure if I need to have multiple dags but here's the flow. What I'm looking to do is to only run a Spark job if a file exist in the S3 bucket.
I tried using an S3 sensor but that fails/timeouts after it meets the timeout criteria, therefore the whole dag is set to failed.
check_for_ftp_files -> move_files_to_s3 -> submit_job_to_spark -> archive_file_once_done
I only want to run everything after the script that does the FTP check ONLY when a file or files were moved into S3.
You can have 2 different DAGs. One only has the S3 sensor and keeps running, lets say, every 5 minutes. If it finds the file, it triggers the second DAG. The second DAG submits the file to S3 and archives if done. You can use TriggerDagRunOperator in the first DAG for triggering.
The answer Him gave will work.
Another option is using the "soft_fail" parameter that Sensors have (it is a parameter from the BaseSensorOperator). IF you set this parameter to True, instead of failing a task, it will skip it and all following tasks in the branch will also be skipped.
See airflow code for more info.

Resources