Airflow - branching without joining - airflow

I am still very new to airflow and am trying to make a workflow that only processes data on business days. The flow logic should be:
if date.is_busday():
process_data()
else:
do_nothing() # but mark the run as successful or skipped
I used the BranchPythonOperator but did not join the two branches at the end. If it is not a business day, I have it run a dummy operator branch. Otherwise, it goes on to a normal operator branch. Here is what the flow looks like:
DAG diagram
I am able to successfully skip non-business days using this method. However, I get this warning: WARNING - No DAG RUN present this should not happen'
Additionally, when I set SLA, I started getting emails for all the dummy tasks that were not run.
When I started reading about branch operators, all the examples eventually joined the branches (branch documentation, ex. 1, ex. 2, ex. 3). This question is the closest I found to mine but seemed to be solved by joining the branches: similar question This question doesn't join the branches but their use case seems unique and complicated ex.4
Since I get warnings and cannot find any examples out there of un-joined branches, I am now doubting that my approach. My questions are:
What is the best/simplest way to make this work in airflow?
If my approach is correct, how do I get rid of the warning message or is it just something I have to live with?
If I had to join the branches, could I join the load_data branch and the dummy branch with another dummy branch? This seems kind of absurd to me but could it be the right way to avoid all the warnings?
Thanks in advance.

I would suggest you use Shortcircut instead of branch operator.
The branch operator always expects 2 downstream tasks at least.
Short Circut operator will allow you to evaluate. IF its a business day, continue, IF not, ShortCircuitOperator will stop the flow
ShortCircuitOperator will also expect a python callable function, that function can just return True / False to continue or exit.

Related

In Airflow, what do they want you to do instead of pass data between tasks

In the docs, they say that you should avoid passing data between tasks:
This is a subtle but very important point: in general, if two operators need to share information, like a filename or small amount of data, you should consider combining them into a single operator. If it absolutely can’t be avoided, Airflow does have a feature for operator cross-communication called XCom that is described in the section XComs.
I fundamentally don't understand what they mean. If there's no data to pass between two tasks, why are they part of the same DAG?
I've got half a dozen different tasks that take turns editing one file in place, and each send an XML report to a final task that compiles a report of what was done. Airflow wants me to put all of that in one Operator? Then what am I gaining by doing it in Airflow? Or how can I restructure it in an Airflowy way?
fundamentally, each instance of an operator in a DAG is mapped to a different task.
This is a subtle but very important point: in general if two operators need to share
information, like a filename or small amount of data, you should consider combining them
into a single operator
the above sentence means that if you want any information that needs to be shared between two different tasks then it is best you could combine them into one task instead of using two different tasks, on the other hand, if you must use two different tasks and you need to pass some information from one task to another then you can do it using
Airflow's XCOM, which is similar to a key-value store.
In a Data Engineering use case, file schema before processing is important. imagine two tasks as follows :
Files_Exist_Check : the purpose of this task is to check whether particular files exist in a directory or not
before continuing.
Check_Files_Schema: the purpose of this task is to check whether the file schema matches the expected schema or not.
It would only make sense to start your processing if Files_Exist_Check task succeeds. i.e. you have some files to process.
In this case, you can "push" some key to xcom like "file_exists" with the value being the count of files present in that particular directory in Task Files_Exist_Check.
Now, you "pull" this value using the same key in Check_Files_Schema Task, if it returns 0 then there are no files for you to process hence you can raise exception and fail the task or handle gracefully.
hence sharing information across tasks using xcom does come in handy in this case.
you can refer following link for more info :
https://www.astronomer.io/guides/airflow-datastores/
Airflow - How to pass xcom variable into Python function
What you have to do for avoiding having everything in one operator is saving the data somewhere. I don't quite understand your flow, but if for instance, you want to extract data from an API and insert that in a database, you would need to have:
PythonOperator(or BashOperator, whatever) that takes the data from the API and saves it to S3/local file/Google Drive/Azure Storage...
SqlRelated operator that takes the data from the storage and insert it into the database
Anyway, if you know which files are you going to edit, you may also use jinja templates or reading info from a text file and make a loop or something in the DAG. I could help you more if you clarify a little bit your actual flow
I've decided that, as mentioned by #Anand Vidvat, they are making a distinction between Operators and Tasks here. What I think is that they don't want you to write two Operators that inherently need to be paired together and pass data to each other. On the other hand, it's fine to have one task use data from another, you just have to provide filenames etc in the DAG definition.
For example, many of the builtin Operators have constructor parameters for files, like the S3FileTransformOperator. Confusing documentation, but oh well!

Airflow dynamic dag creation

Someone please tell me whether a DAG in airflow is just a graph (like a placeholder) without any actual data (like arguments) associated with it OR a DAG is like an instance (for a fixed argument)?
I want a system where the set of operations to perform (given a set of arguments) are fixed. But this input will be different everytime the set of operations are run. In simple terms, the pipeline is the same but the arguments to the pipeline will be different everytime it is run.
I want to know how to configure this in airflow? Should I create a new DAG for every new set of arguments? or any other method?
In my case, the graph is the same but want to run it on different data (from different users) as they come. So, should I create a new DAG everytime for new data?
Yes you are correct; A DAG is basically kind off a one-way graph. You can create a DAG once by chaining together multiple operators together to form your "structure".
Each operator, can then take in multiple arguments that you can pass from the DAG definition file itself (if needed).
Or you can pass in a configuration object to the DAG, and access custom data from there using the context.
I would recommend reading the Airflow Docs for more examples: https://airflow.apache.org/concepts.html#tasks
You can think of Airflow DAG as a program made of other programs, with the exception that it can't contain loops(acyclic). Will you change your program every time input data changes? Of course, it all depends on how you write your program, but usually you'd like you program to generalise, right? You don't want two different programs to do 2+2 and 3+3. But you'll have different programs to show Facebook pages and to play Pokemon Go. If you want to do the same thing to a similar data then you want to write your DAG once, and maybe only change environment arguments(DB connection, date, etc) - Airflow is perfectly suitable for that.
You do not need to create a new DAG every time, if the structure of the graph is the same.
Airflow DAGs are created via code, so you are free to create a code structure that allows you to pass in arguments each time. How you do that will require some creative thinking.
You could, for example, create a web form that accepts the arguments, stores them in a DB and then schedules the DAG with the Airflow restAPI. The DAG code would then need to be written to retrieve params from the database.
There are several other ways to accomplish what you are asking, they all just depend on your use case.
One caveat, the Airflow scheduler does not perform well if you change the start date of the DAG. For your idea above you will need to set the start date earlier than your first DAG run and then set the schedule interval to off. This way you have a start date that doesn’t change and dynamically triggered DAG runs.

Best way to copy tasks from one DAG into another?

Say I have two pre-existing DAGs, A and B. Is it possible in Airflow to "copy" all tasks from B into A, preserving dependencies and propagating default arguments and the like for all of B's tasks? With the end goal being to have a new DAG A' that contains all of the tasks of both A and B.
I understand that it might not be possible or feasible to reconcile DAG-level factors, e.g. scheduling, and propagate across the copying, but is it possible to at least preserve the dependency orderings of the tasks such that each task runs as expected, when expected—just in a different DAG?
If it is possible, what would be the best way to do so? If it's not supported, is there work in progress to support this sort of "native DAG composition"?
UPDATE-1
Based on clarification of question expressed, I infer that requirement is not to replicate a DAG into another but to append it after another DAG.
The techniques mentioned in the original answer below are still applicable (to variable extent)
But for this specific use-case there are few more options
i. Use TriggerDagRunOperator: Invoke your 2nd DAG at the end of 1st DAG
ii. Use SubDagOperator: Wrap your 2nd DAG into a Sub-Dag and attach it at the end of 1st DAG
But do checkout Wiring top-level DAGs together thread (question / answer plus comments) for ideas / loopholes in each of above mentioned techniques
ORIGINAL ANSWER
I can think of 3 possible ways
The recommended way would be to programmatically construct your DAG. In other words, if possible, iterate over a list of configs (each config for one task) read from an external source (such as Airflow Variable, database or JSON files) and build your DAG as per your business logic. Here, you'll just have to alter the dag_id and you can re-use the same script to build identical DAG as your original one
A modification of 1st approach above is to generalize your dag-construction logic by employing a simple idea like ajbosco/dag-factory or a full-fleged wrapper framework like etsy/boundary-layer
Finally if none of the above approaches are easily adaptable for you, then you can hand-code the task-replication logic to regenerate the same structure as your original DAG. You can write a single robust script and re-use it across your entire project to replicate DAGs as and when needed. Here you'll have to go through DAG-traversal and some traditional data-structure and algorithmic stuff. Here's an example of BFS-like traversal over tasks of an Airflow DAG

How can one set a variable for use only during a certain dag_run

How do I set a variable for use during a particular dag_run. I'm aware of setting values in xcom, but not all the operators that I use has xcom support. I also would not like to store the value into the Variables datastore, in case another dag run begins while the current one is running, that need to store different values.
The question is not clear, but from whatever I can infer, I'll try to clear your doubts
not all the operators that I use has xcom support
Apparently you've mistaken xcom with some other construct because xcom feature is part of TaskInstance and the functions xcom_push() and xcom_pull() are defined in BaseOperator itself (which is the parent of all Airflow operators)
I also would not like to store the value into the Variables datastore,
in case another dag run begins while the current one is running, that
need to store different values.
It is straightforward (and no-brainer) to separate-out Variables on per-DAG basis (see point (6)); but yes for different DagRuns of a single DAG, this kind of isolation would be a challenge. I can think of xcom to be the easiest workaround for this. Have a look at this for some insights on usage of Xcoms.
Additionally, if you want to manipulate Variables (or any other Airflow model) at runtime (though I would recommend you to avoid it particularly for Variables), Airflow also gives completely liberty to exploit the underlying SQLAlchemy ORM framework for that. You can take inspiration from this.

spark inconsistency when running count command

A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.:
imp_sample.where(col("location").isNotNull()).count()
And I am getting slightly different results every time I run it (141,830, then 142,314)!
Or this:
imp_sample.where(col("location").isNull()).count()
and getting 2,587,013, and then 2,586,943. How is it even possible?
Thank you!
As per your comment, you are using sampleBy in your pipeline. sampleBydoesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions and can vary from run to run.
Regarding your monotonically_increasing_id question in the comments, it only guarantees that the next id is larger than the previous one, however, it doesn't guarantee ids are consecutive (i,i+i,i+2, etc...).
Finally, you can persist a data frame, by called persist() on it.
Ok, I have suffered majorly from this in the past. I had a seven or eight stage pipeline that normalised a couple of tables, added ids, joined them and grouped them. Consecutive runs of the same pipeline gave different results, although not in any coherent pattern I could understand.
Long story short, I traced this feature to my usage of the function monotonically_increasing_id, supposed resolved by this JIRA ticket, but still evident in Spark 2.2.
I do not know exactly what your pipeline does, but please understand that my fix is to force SPARK to persist results after calling monotonically_increasing_id. I never saw the issue again after I started doing this.
Let me know if a judicious persist resolves this issue.
To persist an RDD or DataFrame, call either df.cache (which defaults to in-memory persistence) or df.persist([some storage level]), for example
df.persist(StorageLevel.DISK_ONLY)
Again, it may not help you, but in my case it forced Spark to flush out and write id values which were behaving non-deterministically given repeated invocations of the pipeline.

Resources