So here's a stupid idea...
I created (many) DAG(s) in airflow... and it works... however, i would like to package it up somehow so that i could run a single DAG Run without having airflow installed; ie have it self contained so i don't need all the web servers, databases etc.
I mostly instantiate new DAG Run's with trigger dag anyway, and i noticed that the overhead of running airflow appears quite high (workers have high loads doing essentially nothing, it can sometimes take 10's of seconds before dependent tasks are queued etc).
i'm not too bothered about all the logging etc.
You can create a script which executes airflow operators, although this loses all the meta data that Airflow provides. You still need to have Airflow installed as a Python package, but you don't need to run any webservers, etc. A simple example could look like this:
from dags.my_dag import operator1, operator2, operator3
def main():
# execute pipeline
# operator1 -> operator2 -> operator3
operator1.execute(context={})
operator2.execute(context={})
operator3.execute(context={})
if __name__ == "__main__":
main()
It sounds like your main concern is the waste of resources by the idling workers more so than the waste of Airflow itself.
I would suggest running Airflow with the LocalExecutor on a single box. This will give you the benefits of concurrent execution without the hassle of managing workers.
As for the database - there is no way to remove the database component without modifying airflow source itself. One alternative would be to leverage the SequentialExecutor with SQLite, but this removes the ability to run concurrent tasks and is not recommended for production.
First I'd say you need to tweak you airflow setup.
But if that's not an option, then another way is to write your main logic in code outside the DAG. (This is also best practice). For me this makes the code easier to test locally as well.
Writing a shell script is pretty easy to tie a few processes together.
You won't get the benefit of operators or dependencies, but you probably can script your way around it. And if you can't, just use Airflow.
You can overload the imported airflow modules if they fail to import. So for example, if you are using from airflow.decorators import dag, task you can overload the #dag and #task decorators:
from datetime import datetime
try:
from airflow.decorators import dag, task
except ImportError:
mock_decorator = lambda f=None,**d: f if f else lambda x:x
dag = mock_decorator
task = mock_decorator
#dag(schedule=None, start_date=datetime(2022, 1, 1), catchup=False)
def mydag():
#task
def task_1():
print("task 1")
#task
def task_2(input):
print("task 2")
task_2(task_1())
_dag = mydag()
Related
In our Big Data project there are ~3000 tables to load, all of these tables should be processed by a separate DAG in Airflow.
In our solution a single python file generates every type of table loaders so they can be triggered separately via REST API on event-based manner via Cloud Function.
Therefore we generate our DAGs by using:
Airflow variables used for the DAG generator logic
list of table names to generate
table type: insert append, truncate load, scd1, scd2
Airflow variables used by the specific operators of tables loader DAGs, e.g.:
RR_TableN = {} // python dict for operator handling RawToRaw
RC_TableN = {} // python dict for operator handling RawToCuration
user defined macros:
we try not to "static python codes" in between task definitions, because they would be executed during DAG-generation process
user defined macros are evaluated only in DAG-execution time
Unfortunately we are bound to version Airflow v1.x.x
Problem:
We have noticed that the Airflow/Cloud Composer is signigicantly slower between the task executions when multiple DAGs are generated.
When only 10-20 DAGs are generated the time between the Task executions is much faster then we have 100-200 DAGs.
When 1000 DAGs are generated it takes minutes to start a new task after finishing a preceeding task for a given DAG even when a no other DAGs are executed.
We don't understand why the Task execution times are affected that hard by the number of generated DAGs.
Shouldn't be near constant time for Airflow to search in it's metabase for the required parameters for the TaskInstances?
We are not sure if the Cloud Composer is configured/scaled/managed properly by Google.
Questions:
What's the reason behind this slowdown from Airflow's side?
How could we reduce the waiting times between the task executions and speed up the whole process?
Is this a "bad design pattern" what we are implementing (generator and user defined macros processing Airflow variables)?
If so, how could we do similar (table separated DAGs, single codebase etc.) in a more effective way?
This is a very simple example of the generator code what we use:
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
def create_dag(dag_id, schedule, dag_number, default_args):
def example(*args):
print('Example DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id, schedule_interval=schedule, default_args=default_args)
with dag:
t1 = PythonOperator(task_id='example', python_callable=example)
return dag
for dag_number in range(1, 5000):
dag_id = 'Example_{}'.format(str(dag_number))
default_args = {'owner': 'airflow', 'start_date': datetime(2021, 1, 1)}
globals()[dag_id] = create_dag(dag_id, '#daily', dag_number, default_args)
Yes. That is a known problem. It's been fixed in Airflow 2.
It's inherent to how processing of DAG files were done in Airflow 1 (mainly about the number of queries generated).
Other than migrating to Airflow 2, there is not much you can do. Fixing that required complete refactoring and half-rewriting of Airflow scheduler logic.
One way to mitigiate it - you could potentially, rather than generate all DAGs from single file, split it to many of those. For example rather than generating DAG objects in the single Python file, you could generate 3000 separate, dynamically generated small DAG files. This will scale much better.
However, the good news is that in Airflow 2 this is Many many times faster and scalable. And Airlfow 1.10 reached EOL and is not supported any more and will not receive any more updates. So rather than changing process I'd heartily recommend to migrate.
The current problem that I am facing is that I have documents in a MongoDB collection which each need to be processed and updated by tasks which need to run in an acyclic dependency graph. If a task upstream fails to process a document, then none of the dependent tasks may process that document, as that document has not been updated with the prerequisite information.
If I were to use Airflow, this leaves me with two solutions:
Trigger a DAG for each document, and pass in the document ID with --conf. The problem with this is that this is not the intended way for Airflow to be used; I would never be running a scheduled process, and based on how documents appear in the collection, I would be making 1440 Dagruns per day.
Run a DAG every period for processing all documents created in the collection for that period. This follows how Airflow is expected to work, but the problem is that if a task fails to process a single document, none of the dependent tasks may process any of the other documents. Also, if a document takes longer than other documents do to be processed by a task, those other documents are waiting on that single document to continue down the DAG.
Is there a better method than Airflow? Or is there a better way to handle this in Airflow than the two methods I currently see?
From the knowledge I gained in my attempt to answer this question, I've come to the conclusion that Airflow is just not the tool for the job.
Airflow is designed for scheduled, idempotent DAGs. A DagRun must also have a unique execution_date; this means running the same DAG at the exact same start time (in the case that we receive two documents at the same time is quite literally impossible. Of course, we can schedule the next DagRun immediately in succession, but this limitation should demonstrate that any attempt to use Airflow in this fashion will always be, to an extent, a hack.
The most viable solution I've found is to instead use Prefect, which was developed with the intention of overcoming some of the limitations of Airflow:
"Prefect assumes that flows can be run at any time, for any reason."
Prefect's equivalent of a DAG is a Flow; one key advantage of a flow that we may take advantage of is the ease of parametriziation. Then, with some threads, we're able to have a Flow run for each element in a stream. Here is an example streaming ETL pipeline:
import time
from prefect import task, Flow, Parameter
from threading import Thread
def stream():
for x in range(10):
yield x
time.sleep(1)
#task
def extract(x):
# If 'x' referenced a document, in this step we could load that document
return x
#task
def transform(x):
return x * 2
#task
def load(y):
print("Received y: {}".format(y))
with Flow("ETL") as flow:
x_param = Parameter('x')
e = extract(x_param)
t = transform(e)
l = load(t)
for x in stream():
thread = Thread(target=flow.run, kwargs={"x": x})
thread.start()
You could change trigger_rule from "all_success" to "all_done"
https://github.com/apache/airflow/blob/62b21d747582d9d2b7cdcc34a326a8a060e2a8dd/airflow/example_dags/example_latest_only_with_trigger.py#L40
And also could create a branch that processes failed documents with trigger_rule set to "one_failed" to move processes those failed documents somehow differently (e.g. move to a "failed" folder and send a notification)
I would be making 1440 Dagruns per day.
With a good Airflow architecture, this is quite possible.
Choking points might be
executor - use Celery Executor instead of Local Executor for example
backend database - monitor and tune as necessary (indexes, proper storage etc)
webserver - well, for thousands of dagruns, tasks etc.. perhaps only use webeserver for dev/qa environments, and not for production where you have higher rate of task/dagruns submissions. You could use cli etc instead.
Another approach is scaling out by running multiple Airflow instances - partition documents let's say to ten buckets, and assign each partition's documents to just one Airflow instance.
I'd process the heavier tasks in parallel and feed successful operations downstream. As far as I know, you can't feed successes asynchronously to downstream tasks, so you would still need to wait for every thread to finish until moving downstream but, this would still be well more acceptable than spawning 1 dag for each record, something in these lines:
Task 1: read mongo filtering by some timestamp (remember idempotence) and feed tasks (i.e. via xcom);
Task 2: do stuff in paralell via PythonOperator, or even better via K8sPod, i.e:
def thread_fun(ret):
while not job_queue.empty():
job = job_queue.get()
try:
ret.append(stuff_done(job))
except:
pass
job_queue.task_done()
return ret
# Create workers and queue
threads = []
ret = [] # a mutable object
job_queue = Queue(maxsize=0)
for thr_nr in appropriate_thread_nr:
worker = threading.Thread(
target=thread_fun,
args=([ret])
)
worker.setDaemon(True)
threads.append(worker)
# Populate queue with jobs
for row in xcom_pull(task_ids=upstream_task):
job_queue.put(row)
# Start threads
for thr in threads:
thr.start()
# Wait to finish their jobs
for thr in threads:
thr.join()
xcom_push(ret)
Task 3: Do more stuff coming from previous task, and so on
We have built a system that queries MongoDB for a list, and generates a python file per item containing one DAG (note: having each dag have its own python file helps Airflow scheduler efficiency, with it's current design) - the generator DAG runs hourly, right before the scheduled hourly run of all the generated DAGs.
I am trying to see if I airflow is a good fit for this scenario. At present, I have a DAG. This looks for a trigger file at s3, creates EMR cluster and submit spark job, then delete the EMR cluster.
My requirement is to convert this into on demand run. There will be many users running the export from the application. For each export run, I will have to call this DAG. That means there will be more than once instance of the same DAG will be running at the sametime.
I know we an make an API call to trigger a DAG. But I am not sure if we can run more than once instance of a DAG at the sametime. Can anyone had similar use case?
I am handling this with max_active_runs
dag = DAG(
'dev_clickstream_v1',
max_active_runs=5,
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=2),
params=PARAMS
)
I read the API reference and couldnt find anything on it, is that possible?
Currently, there is no such feature that does it out-of-the-box but you can write some custom code in your DAG to get around this. For example, use PythonOperator (you can use MySQL operator if your metadata db is mysql) to get status of the last X runs for the dag.
use BranchPythonOperator to see if the number is more than X, if it is then use a BashOperator to run airflow pause dag cli.
You can also just make this a 2-step task by adding logic of PythonOperator in BranchPythonOperator. This is just an idea, you can use a different logic.
I am using Airflow 1.9.0 with a custom SFTPOperator. I have code in my DAGs that poll an SFTP site to find new files. If any are found, then I create custom task id's for the dynamically created task and retrieve/delete the files.
directory_list = sftp_handler('sftp-site', None, '/', None, SFTPToS3Operation.LIST)
for file_path in directory_list:
... SFTP code that GET's the remote files
That part works fine. It seems both the airflow webserver and airflow scheduler are iterating through all the DAGs once a second and actually running the code that retrieves the directory_list. This means I'm hitting the SFTP site ~2 x a second to authenticate and pull a list of files. I'd like to have some conditional code that only executes if the DAG is actually being run.
When an SFTP site uses password authentication, the # of times I connect really isn't an issue. One site requires key authentication and if there are too many authentication failures in a short timespan, the account is locked. During my testing, this seems to happen occasionally for reasons I'm still trying to track down.
However, if I were authenticating only when the DAG was scheduled to execute, or executing manually, this would not be an issue. It also seems wasteful to spend so much time connecting to an SFTP site when it's not scheduled to do so.
I've seen a post that can check to see if a task is executing, but that's not ideal as I'd have to create a long-running task, using up resources I shouldn't require, just to perform that test. Any thoughts on how to accomplish this?
You have a very good use case for Airflow (SFTP to _____ batch jobs), but Airflow is not meant for dynamic DAGs as you are attempting to use them.
Top-Level DAG Code and the Scheduler Loop
As you noticed, any top-level code in a DAG is executed with each scheduler loop. Or put another way, every time the scheduler loop processes the files in your DAG directory it is interpreting all the code in your DAG files. Anything not in a task or operator is interpreted/executed immediately. This puts undue strain on the scheduler as well as any external systems you are making calls to.
Dynamic DAGs and the Airflow UI
Airflow does not handle dynamic DAGs through the UI well. This is mostly the result of the Airflow DAG state not being stored in the database. DAG views and history are rendered based on what exist in the interpreted DAG file at any given moment. I personally hope to see this change in the future with some form of DAG versioning.
In a dynamic DAG you can both add and remove tasks from a DAG.
Adding Tasks Dynamically
When adding tasks for a DAG run will make it appear (in the UI) that all DAG
runs before when that task never ran that task all. The will have a None state
and the DAG run will be set to success or failed depending on the outcome
of the DAG run.
Removing Tasks Dynamically
If your dynamic DAG ever removes tasks you will lose the ability to review history of the DAG. For example, if you run a DAG with task_x in the first 20 DAG runs but remove it after that, it will fail to show up in the UI until it is added back into the DAG.
Idempotency and Airflow
Airflow works best when the DAG runs are idempotent. This means that re-running any DAG Run should have the same affect no matter when you run it or how many times you run it. Dynamic DAGs in Airflow break idempotency by adding and removing tasks to previous DAG runs so that the results of re-running are not the same.
Solution Options
You have at least two options moving forward
1.) Continue to build your SFTP DAG dynamically, but create another DAG that writes the available SFTP files to a local file (if not using distributed executor) or an Airflow Variable (this will result in more reads to the Airflow DB) and build your DAG dynamically from that.
2.) Overload the SFTPOperator to take a list of files so that every file that exist is processed within a single task run. This will make the DAGs idempotent and you will maintain accurate history through the logs.
I apologize for the extended explanation, but you're touching on one of the rough spots of Airflow and I felt it was appropriate to give an overview of the problem at hand.