Airflow Broken DAG Message - Extra data - airflow

I am trying to get a DAG to populate in Airflow but am getting the following error message in the Airflow UI. What does the following Airflow error message mean and how can it be fixed?
Broken DAG: [/usr/local/airflow/dags/finance_data_dag.py] Extra data: line 3 column 12 (char 41)
NOTE: Line 3 of the finance_data_dag.py file just includes the following python import statement so not sure what "line 3 column 12 (char 41)" even refers to.
import numpy as np

Related

How Airflow determines when to re-import a DAG file?

I'm seeing some interesting behavior how Airflow treats an existing DAG file. For example, I have a DAG file like this:
# my_pipeline.py
from dag_gen import DagGenerator
g = globals()
g.update(DagGenerator.generate())
In the code above, DagGenerator.generate() returns a dictionary with two items:
{
'my_dag': DAG(...),
'my_op': DummyOperator(...),
}
Technically, this code should be equivalent to:
# my_pipeline_equivalent.py
my_dag = DAG(...)
my_op = DummyOperator(...)
For some reason, the file my_pipeline.py is not being picked up by Airflow while my_pipeline_equivalent.py can be picked up without any problem.
However, if I add the following code in my_pipeline.py, both dags (i.e. my_dag and my_dag2) can be picked up.
# my_pipeline.py
from dag_gen import DagGenerator
g = globals()
g.update(DagGenerator.generate())
# newly added
from airflow import DAG
from datetime import datetime
from airflow.operators.dummy_operator import DummyOperator
x = DAG(
'my_dag2',
schedule_interval="#daily",
start_date=datetime(2021, 1, 1),
catchup=False,
default_args={
"retries": 2, # If a task fails, it will retry 2 times.
},
tags=['example'],
)
b = DummyOperator(task_id='d3', dag=x)
What makes this stranger is if now I comment out the newly added part like below, my_dag can still be picked up while my_dag2 is gone.
# my_pipeline.py
from dag_gen import DagGenerator
g = globals()
g.update(DagGenerator.generate())
# newly added
#from airflow import DAG
#from datetime import datetime
#from airflow.operators.dummy_operator import DummyOperator
#x = DAG(
# 'my_dag2',
# schedule_interval="#daily",
# start_date=datetime(2021, 1, 1),
# catchup=False,
# default_args={
# "retries": 2, # If a task fails, it will retry 2 times.
# },
# tags=['example'],
#)
#b = DummyOperator(task_id='d3', dag=x)
However, if I actually delete the commented code, my_dag is gone as well. In order to add my_dag back, I have to add back my_dag2 code without the comment (the commented my_dag2 code, which worked previously, doesn't work now).
Could anyone help me understand what's going here? If I remember correctly, Airflow has some logic to determine when/if to import/re-import a Python file. Does anyone know where that logic lives in code?
Thanks.
Some additional find-outs. Once I have both DAGs (my_dag and my_dag2) being picked up. If I change my_pipeline.py to only keep the DAG import, my_dag can still be picked up, even if I commented out the import, but I cannot remove that line. If I do remove it, my_dag would be gone again.
# my_pipeline.py
from dag_gen import DagGenerator
g = globals()
g.update(DagGenerator.generate())
# from airflow import DAG
My guess is Airflow must be reading the python file and looking for that import as string.
If you follow the steps that Airflow takes to process the files, you will get to this line of code:
return all(s in content for s in (b'dag', b'airflow'))
Which means that the Airflow file processor will ignore files that don't contain the two words dag and airflow.
If you want to process your module, and where you already have the word dag, you can just add a comment in the beginning contains the word airflow:
# airflow
from dag_gen import DagGenerator
g = globals()
g.update(DagGenerator.generate())
Or just renaming the class DagGenerator to AirflowDagGenerator to solve the problem in all the modules.

Airflow: Importing decorated Task vs all tasks in a single DAG file?

I recently started using Apache Airflow and one of its new concept Taskflow API. I have a DAG with multiple decorated tasks where each task has 50+ lines of code. So I decided to move each task into a separate file.
After referring stackoverflow I could somehow move the tasks in the DAG into separate file per task. Now, my question is:
Does both the code samples shown below work same? (I am worried about the scope of the tasks).
How will they share data b/w them?
Is there any difference in performance? (I read Subdags are discouraged due to performance issues, though this is not Subdags just concerned).
All the code samples I see in the web (and in official documentation) put all the tasks in a single file.
Sample 1
import logging
from airflow.decorators import dag, task
from datetime import datetime
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
#dag(default_args=default_args, schedule_interval=None)
def No_Import_Tasks():
# Task 1
#task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
# Task 2
#task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
a = Task_A()
ab = Task_B(a)
No_Import_Tasks = No_Import_Tasks()
Sample 2 Folder structure:
- dags
- tasks
- Task_A.py
- Task_B.py
- Main_DAG.py
File Task_A.py
import logging
from airflow.decorators import task
#task()
def Task_A():
logging.info(f"Task A: Received param None")
# Some 100 lines of code
return "A"
File Task_B.py
import logging
from airflow.decorators import task
#task()
def Task_B(a):
logging.info(f"Task B: Received param {a}")
# Some 100 lines of code
return str(a + "B")
File Main_Dag.py
from airflow.decorators import dag
from datetime import datetime
from tasks.Task_A import Task_A
from tasks.Task_B import Task_B
default_args = {"owner": "airflow", "start_date": datetime(2021, 1, 1)}
#dag(default_args=default_args, schedule_interval=None)
def Import_Tasks():
a = Task_A()
ab = Task_B(a)
Import_Tasks_dag = Import_Tasks()
Thanks in advance!
There is virtually no difference between the two approaches - neither from logic nor performance point of view.
The tasks in Airflow share the data between them using XCom (https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) effectively exchanging data via database (or other external storage). The two tasks in Airflow - does not matter if they are defined in one or many files - can be executed anyway on completely different machines (there is no task affinity in airflow - each task execution is totally separated from other tasks. So it does not matter - again - if they are in one or many Python files.
Performance should be similar. Maybe splitting into several files is very, very little slower but it should totally negligible and possibly even not there at all - depends on the deployment you have the way you distribute files etc. etc., but I cannot imagine this can have any observable impact.

How to capture the log details on pytest-html as well as writing in to Console?

In my pytest script, I need to customize the pytest-HTML report for capturing the stdout at the same time writing into the console as I have user input in my automated test.
test_TripTick.py
import os
import sys
import pytest
from Process import RunProcess
from recordtype import recordtype
from pip._vendor.distlib.compat import raw_input
#pytest.fixture(scope="module")
def Process(request):
# print('\nProcess setup - module fixture')
fileDirectory = os.path.abspath(os.path.dirname(__file__))
configFilePath = os.path.join(fileDirectory, 'ATR220_Config.json')
process = RunProcess.RunProcess()
process.SetConfigVariables(configFilePath)
process.GetComPort(["list=PID_0180"])
def fin():
sys.exit()
request.addfinalizer(fin)
return process
def test_WipeOutReader(Process):
assert Process.WipeOutTheReader() == True
def test_LoadingKeysIntoMemoryMap(Process):
assert Process.LoadingKeysIntoMemoryMap() == True
def test_LoadingFW(Process): # only use bar
assert Process.LoadingFW() == True
def test_LoadingSBL(Process):
assert Process.LoadingSBL() == True
def test_CCIDReadForPaymentCards(Process):
assert Process.CCIDReadWrite('Payment Card') == True
Currently, if I run the following command from the windows command line, I get output on the console, but no captured output on the HTML report.
pytest C:\Projects\TripTickAT\test_TripTick.py -s --html=Report.html --verbose
Also, I would like to know the programmatic way of customizing the HTML report where I can update the file name, ordering test based on time of the execution and capturing the std-out.
I have tried additional flags to the pytest command. --capture sys and
-rP for Passed tests. --capture sys and -rF for failed tests
And I can see the console log as well in the html document after I click show All details button as shown in the output. I have captured only partial screen for the purpose of showing to you. But you can scroll down the output to see all the console logs. The grey area in the image below is the console output. I am not sure of the command line level flag that works regardless of failed or passed tests. But here is a temporary solution. This will print logs on command line console as well as html logs console
`

airflow TimeDeltaSensor fails with unsupported operand type

In my DAG I have a TimeDeltaSensor created using:
from datetime import datetime, timedelta
from airflow.operators.sensors import TimeDeltaSensor
wait = TimeDeltaSensor(
task_id='wait',
delta=timedelta(seconds=300),
dag=dag
)
However when it runs I get error
Subtask: [2018-07-13 09:00:39,663] {models.py:1427} ERROR - unsupported operand type(s) for +=: 'NoneType' and 'datetime.timedelta'
Airflow version is 1.8.1.
The code is basically lifted from Example Pipeline definition so I'm nonplussed as to what the problem could be. Any ideas?
Looking into the source code you linked there is one line that strikes me as interesting in this case:
target_dttm = dag.following_schedule(context['execution_date'])
Which means: If you don't have setup a proper DAG schedule this component will try to add its time delta to None.
I am not sure if the code in the question is just an example or the whole thing. My suggestion is: Add a DAG schedule with is other than None.

Get DAG from airflow

I'm trying to render the DAG as a tree for documentation, is there a direct way to get this ?
Right now I'm manually generating DOT files with (partial code):
for task in dag.tasks:
print("\t%s;" % task.task_id)
relatives = [r.task_id for r in task.get_direct_relatives()]
for r in relatives:
print("\t%s -- %s;" % (task.task_id, r))
Which works, but I need to dynamically import all DAGs externally..
You can use the airflow.models.DagBag object to enumerate the DAG objects.
from airflow.models import DagBag
for dag in DagBag().dags.values():
for task in dag.tasks:
[...]

Resources