I was trying out ZenML, it said it can convert my .py pipeline into airflow DAG.
I followed every step here: https://docs.zenml.io/guides/low-level-api/chapter-7, all succeeded
My pipeline runs well locally, but why can't see this DAG created on airflow UI? The UI is totally empty....
The problem seems that, ZenML will copy my .py pipeline wrote in ZenML way, and expect it can run in airflow... In my case this won't work. Does anyone know how can I let ZenML run my code through airflow successfully?
Here's my ZenML .py code:
import pandas as pd
import numpy as np
import os
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import balanced_accuracy_score
from zenml.pipelines import pipeline
from zenml.steps import step
from zenml.steps.step_output import Output
from zenml.steps.base_step_config import BaseStepConfig
class pipeline_config(BaseStepConfig):
"""
Params used in the pipeline
"""
label: str = 'species'
#step
def split_data(config: pipeline_config) -> Output(
X=pd.DataFrame, y=pd.DataFrame
):
path_to_csv = os.path.join('~/airflow/data', 'leaf.csv')
df = pd.read_csv(path_to_csv)
label = config.label
y = df[[label]]
X = df.drop(label, axis=1)
return X, y
#step
def train_evaltor(
config: pipeline_config,
X: pd.DataFrame,
y: pd.DataFrame
) -> float:
y = y[config.label]
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=10)
lgbm = lgb.LGBMClassifier(objective='multiclass', random_state=10)
metrics_lst = []
for train_idx, val_idx in folds.split(X, y):
X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
lgbm.fit(X_train, y_train)
y_pred = lgbm.predict(X_val)
cv_balanced_accuracy = balanced_accuracy_score(y_val, y_pred)
metrics_lst.append(cv_balanced_accuracy)
avg_performance = np.mean(metrics_lst)
print(f"Avg Performance: {avg_performance}")
return avg_performance
#pipeline
def super_mini_pipeline(
data_spliter,
train_evaltor
):
X, y = data_spliter()
train_evaltor(X=X, y=y)
# run the pipeline
pipeline = super_mini_pipeline(data_spliter=split_data(),
train_evaltor=train_evaltor())
pipeline.run()
Okay so it worked! See picture below:
The reason is that the words airflow and dag have to be in the airflow if safe_mode is on (which it is by default). This is airflow specific logic can be in the airflow codebase.
So all I did was change the last few lines:
# run the pipeline airflow
pipeline = super_mini_pipeline(data_spliter=split_data(),
train_evaltor=train_evaltor())
DAG = pipeline.run()
You can also change the airflow.cfg file and turn safe mode off:
In $HOME/.config/zenml/airflow_root/<UUID>/airflow.cfg
# When discovering DAGs, ignore any files that don't contain the strings ``DAG`` and ``airflow``.
dag_discovery_safe_mode = False
Edit:
There might be another reason: Airflow DAG discovery also relies on the DAG being the globals() so maybe we needed to catch it with DAG = pipeline.run(). So in any case, it works!
Related
I am new to Apache Airflow and I am trying to figure out how to unit/integration test my dags/tasks
Here is my directory structure
/airflow
/dags
/tests/dags
I created a simple DAG which has a task to reads data from a Postgres table
def read_files(ti):
sql = "select id from files where status='NEW'"
pg_hook = PostgresHook(postgres_conn_id="metadata")
connection = pg_hook.get_conn()
cursor = connection.cursor()
cursor.execute(sql)
files = cursor.fetchall()
ti.xcom_push(key="files_to_process", value=files)
with DAG(dag_id="check_for_new_files", schedule_interval=timedelta(minutes=30),
start_date=datetime(2022, 9, 1), catchup=False) as dag:
check_files = PythonOperator(task_id="read_files",
python_callable=read_files)
Is it possible to test this by mocking Airflow/Postgres connection etc
yes it is possible to do test in dags, here is an example of basic things you can do:
import unittest
from airflow.models import DagBag
class TestCheckForNewFilesDAG(unittest.TestCase):
"""Check Dag"""
def setUp(self):
self.dagbag = DagBag()
def test_task_count(self):
"""Check task count for a dag"""
dag_id='check_for_new_files'
dag = self.dagbag.get_dag(dag_id)
self.assertEqual(len(dag.tasks), 1)
def test_contain_tasks(self):
"""Check task contains in hello_world dag"""
dag_id='check_for_new_files'
dag = self.dagbag.get_dag(dag_id)
tasks = dag.tasks
task_ids = list(map(lambda task: task.task_id, tasks))
self.assertListEqual(task_ids, ['read_files'])
def test_dependencies_of_read_files_task(self):
"""Check the task dependencies of a taskin hello_world dag"""
dag_id='check_for_new_files'
dag = self.dagbag.get_dag(dag_id)
read_files_task = dag.get_task('read_files')
# to be use in case you have upstream task
upstream_task_ids = list(map(lambda task: task.task_id,
read_files_task.upstream_list))
self.assertListEqual(upstream_task_ids, [])
downstream_task_ids = list(map(lambda task: task.task_id,
read_files_task.downstream_list))
self.assertListEqual(downstream_task_ids, [])
suite = unittest.TestLoader().loadTestsFromTestCase(TestHelloWorldDAG)
unittest.TextTestRunner(verbosity=2).run(suite)
In case of verifying that manipulated data of files are moved correctly the documentations suggest:
https://airflow.apache.org/docs/apache-airflow/2.0.1/best-practices.html#self-checks
Self-Checks
You can also implement checks in a DAG to make sure the tasks are producing the results as expected. As an example, if you have a task that pushes data to S3, you can implement a check in the next task. For example, the check could make sure that the partition is created in S3 and perform some simple checks to determine if the data is correct.
I think this is an excellent and straightforward way to verify a specific task.
Here there are other useful links you can use:
https://www.youtube.com/watch?v=ANJnYbLwLjE
In the next ones, they talk about mock
https://www.astronomer.io/guides/testing-airflow/
https://medium.com/#montadhar/apache-airflow-testing-guide-7956a3f4bbf5
https://godatadriven.com/blog/testing-and-debugging-apache-airflow/
I follow this example
create the example timetable py file, and put it in the $Home/airflow/plugins
create the example dag file, and put it in $Home/airflow/dags
After restart scheduler and webserver, I get DAG import error. In the web UI, the last line of detailed error message:
airflow.exceptions.SerializationError: Failed to serialize DAG 'example_timetable_dag2': Timetable class 'AfterWorkdayTimetable' is not registered
But if I run airflow plugins, I can see the timetable is in the name and source list.
How to fix this error?
Detail of plugins/AfterWorkdayTimetable.py:
from datetime import timedelta
from typing import Optional
from pendulum import Date, DateTime, Time, timezone
from airflow.plugins_manager import AirflowPlugin
from airflow.timetables.base import DagRunInfo, DataInterval, TimeRestriction, Timetable
UTC = timezone("UTC")
class AfterWorkdayTimetable(Timetable):
def infer_data_interval(self, run_after: DateTime) -> DataInterval:
weekday = run_after.weekday()
if weekday in (0, 6): # Monday and Sunday -- interval is last Friday.
days_since_friday = (run_after.weekday() - 4) % 7
delta = timedelta(days=days_since_friday)
else: # Otherwise the interval is yesterday.
delta = timedelta(days=1)
start = DateTime.combine((run_after - delta).date(), Time.min).replace(tzinfo=UTC)
return DataInterval(start=start, end=(start + timedelta(days=1)))
def next_dagrun_info(
self,
*,
last_automated_data_interval: Optional[DataInterval],
restriction: TimeRestriction,
) -> Optional[DagRunInfo]:
if last_automated_data_interval is not None: # There was a previous run on the regular schedule.
last_start = last_automated_data_interval.start
last_start_weekday = last_start.weekday()
if 0 <= last_start_weekday < 4: # Last run on Monday through Thursday -- next is tomorrow.
delta = timedelta(days=1)
else: # Last run on Friday -- skip to next Monday.
delta = timedelta(days=(7 - last_start_weekday))
next_start = DateTime.combine((last_start + delta).date(), Time.min).replace(tzinfo=UTC)
else: # This is the first ever run on the regular schedule.
next_start = restriction.earliest
if next_start is None: # No start_date. Don't schedule.
return None
if not restriction.catchup:
# If the DAG has catchup=False, today is the earliest to consider.
next_start = max(next_start, DateTime.combine(Date.today(), Time.min).replace(tzinfo=UTC))
elif next_start.time() != Time.min:
# If earliest does not fall on midnight, skip to the next day.
next_day = next_start.date() + timedelta(days=1)
next_start = DateTime.combine(next_day, Time.min).replace(tzinfo=UTC)
next_start_weekday = next_start.weekday()
if next_start_weekday in (5, 6): # If next start is in the weekend, go to next Monday.
delta = timedelta(days=(7 - next_start_weekday))
next_start = next_start + delta
if restriction.latest is not None and next_start > restriction.latest:
return None # Over the DAG's scheduled end; don't schedule.
return DagRunInfo.interval(start=next_start, end=(next_start + timedelta(days=1)))
class WorkdayTimetablePlugin(AirflowPlugin):
name = "workday_timetable_plugin"
timetables = [AfterWorkdayTimetable]
Details of dags/test_afterwork_timetable.py:
import datetime
from airflow import DAG
from AfterWorkdayTimetable import AfterWorkdayTimetable
from airflow.operators.dummy import DummyOperator
with DAG(
dag_id="example_workday_timetable",
start_date=datetime.datetime(2021, 1, 1),
timetable=AfterWorkdayTimetable(),
tags=["example", "timetable"],
) as dag:
DummyOperator(task_id="run_this")
If I run airflow plugins:
name | source
==================================+==========================================
workday_timetable_plugin | $PLUGINS_FOLDER/AfterWorkdayTimetable.py
I had similar issue.
Either you need to add __init__.py file or you should try this to debug your issue:
Get all plugin manager objects:
from airflow import plugins_manager
plugins_manager.initialize_timetables_plugins()
plugins_manager.timetable_classes
I got this result: {'quarterly.QuarterlyTimetable': <class 'quarterly.QuarterlyTimetable'>}
Compare your result with exception message. If timetable_classes dictionary has a different plugin name you should either change plugin file path.
You could also try this inside DAG python file:
from AfterWorkdayTimetable import AfterWorkdayTimetable
from airflow import plugins_manager
print(plugins_manager.as_importable_string(AfterWorkdayTimetable))
This would help you find the name that airflow tries to use when searching through timetable_classes dictionary.
You need to register the timetable in "timetables" array via plugin interface. See:
https://airflow.apache.org/docs/apache-airflow/stable/plugins.html
Encountered the same issue.
These are the steps I followed.
Add the Timetable file(custom_tt.py) into plugins folder.
Make sure the plugin folder has _ _ init_ _.py file present in plugins folder.
Change the lazy_load_plugins in airflow.cfg to False.
lazy_load_plugins = False
Add import statement in dagfile as:
from custom_tt import CustomTimeTable
In DAG as
DAG(timetable=CustomTimeTable())
Restart the webserver and scheduler.
Problem fixed.
They have found the resolution to this but doesn't seem like they have updated the documentation to represent the fix just yet.
Your function
def infer_data_interval(self, run_after: DateTime) -> DataInterval:
should be
def infer_manual_data_interval(self, *, run_after: DateTime) -> DataInterval:
See reference:
Apache airflow.timetables.base Documentation
After updating the function with the correct name and extra parameter, everything else should work for you as it did for me.
I was running into this as well. airflow plugins reports that the plugin is registered, and running the DAG script on the command line works fine, but the web UI reports that the plugin is not registered. #Bakuchi's answer pointed me in the right direction.
In my case, the problem was how I was importing the Timetable - airflow apparently expects you to import it relative to the $PLUGINS_FOLDER, not from any other directory, even if that other directory is also on the PYTHONPATH.
For a concrete example:
export PYTHONPATH=/path/to/my/code:$PYTHONPATH
# airflow.cfg
plugins_folder = /path/to/my/code/airflow_plugins
# dag.py
import sys
from airflow_plugins.custom_timetable import CustomTimetable as Bad
from custom_timetable import CustomTimetable as Good
from airflow import plugins_manager
plugins_manager.initialize_timetables_plugins()
print(sys.path) # /path/to/my/code:...:/path/to/my/code/airflow_plugins
print(plugins_manager.as_importable_string(Bad)) # airflow_plugins.custom_timetable.CustomTimetable
print(plugins_manager.as_importable_string(Good)) # custom_timetable.CustomTimetable
print(plugins_manager.timetable_classes) # {'custom_timetable.CustomTimetable': <class 'custom_timetable.CustomTimetable'>}
A bad lookup in plugins_manager.timetable_classes is ultimately what ends up raising the _TimetableNotRegistered error, so the fix is to make the keys match by changing how the timetable is imported.
I submitted a bug report: https://github.com/apache/airflow/issues/21259
Lets say I have a ML model in mlflow server artifacts. I want to run this model from airflow Dag. Also after running in airflow, metric logs should be visible in mlflow.
How can I achieve this?
There are connections in airflow, I couldn't find any connection type for mlflow.
First, you should run airflow and mlflow servers, and set the artifact paths and databases for both. You can do this locally or on the Cloud. There are many sources on YouTube on how to achieve them.
I will show only the coding part of how you can use airflow and mlflow together. The below code is only a partial and simplified version but explains how you can do it:
# Import libraries
from numpy import loadtxt
import xgboost as xgb
import mlflow
from airflow.models import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.dummy_operator import DummyOperator
from sklearn.metrics import mean_squared_error
from typing import Any
import pickle
from datetime import datetime, timedelta
import logging
artifact_path = "models_mlflow"
local_path = "/PATH/"
tag = "some tag"
model_name = "my_model"
metric = "metrics.rmse ASC"
def find_best_params(ti: Any, metric: str, max_results: int) -> dict[str, Any]:
# Some code here to find best params
best_params = {'param_key_1': "param_value_1", 'param_key_2': "param_value_2"}
# Push the best params to XCom
ti.xcom_push(key='best_params', value=best_params)
def run_best_model(ti: Any) -> str:
"""
Runs the model with the best parameters searched and found
at the earlier phase. Then, saves the model and info in the artifacts
folder or bucket.
"""
best_params = ti.xcom_pull(key='best_params', task_ids=['find_best_params'])
best_params = best_params[0]
logging.info(f"Best params '{best_params}' are retrieved from XCom.")
# Load data from the local disk.
X_train = loadtxt(f"{local_path}/data/X_train.csv", delimiter=',')
X_val = loadtxt(f"{local_path}/data/X_val.csv", delimiter=',')
y_train = loadtxt(f"{local_path}/data/y_train.csv", delimiter=',')
y_val = loadtxt(f"{local_path}/data/y_val.csv", delimiter=',')
logging.info("Training and validation datasets are retrieved from the local storage.")
# Convert to DMatrix data structure for XGBoost.
train = xgb.DMatrix(X_train, label=y_train)
valid = xgb.DMatrix(X_val, label=y_val)
logging.info("Training and validation matrix datasets are created for XGBoost.")
with mlflow.start_run() as run:
# Get the run_id of the best model
best_run_id = run.info.run_id
logging.info(f"Best run id: '{best_run_id}'")
mlflow.set_tag("model", tag)
mlflow.log_params(best_params)
# Train the XGBoost model with the best parameters
booster = xgb.train(
params=best_params,
dtrain=train,
num_boost_round=100,
evals=[(valid, 'validation')],
early_stopping_rounds=50
)
y_pred = booster.predict(valid)
rmse = mean_squared_error(y_val, y_pred, squared=False)
mlflow.log_metric("rmse", rmse)
# Save the model (xgboost_model.bin) locally in the folder "../models/" (in case we want)
with open(f"{local_path}/models/xgboost_model.bin", "wb") as f_out:
pickle.dump(booster, f_out)
logging.info(f"XGBoost model is saved on the path '{local_path}/models/xgboost_model.bin' of the local machine.")
# Save the model (xgboost_model.bin) using 'log_artifact' in the defined artifacts folder/bucket (in case we want)
# This is defined on the CLI and as artifact path parameter on AWS Parameter Store:
# s3://bucket/mlflow/ ... /models_mlflow/
mlflow.log_artifact(local_path=f"{local_path}/models/xgboost_model.bin", artifact_path=artifact_path)
logging.info(f"Artifacts are saved on the artifact path '{artifact_path}'.")
# Save the model (booster) using 'log_model' in the defined artifacts folder/bucket
# This is defined on the CLI and as artifact path parameter on AWS Parameter Store:
# s3://bucket/mlflow/ ... /models_mlflow/
mlflow.xgboost.log_model(booster, artifact_path=artifact_path)
logging.info(f"XGBoost model is saved on the artifact path '{artifact_path}'.")
logging.info(f"Default artifacts URI: '{mlflow.get_artifact_uri()}'")
# Push the best run id to XCom
ti.xcom_push(key='best_run_id', value=best_run_id)
logging.info(f"The best run id '{best_run_id}' of the model '{model_name}' is pushed to XCom.")
default_args = {
'owner': 'me',
'start_date': datetime(2022, 8, 25, 2),
'end_date': datetime(2022, 12, 25, 2),
'depends_on_past': False,
'retries': 1,
'retry_delay': timedelta(seconds=10),
}
with DAG(
dag_id="my_dag_v1",
default_args=default_args,
description="Training dag",
schedule_interval="#monthly",
catchup=False,
) as dag:
task_start_dag = DummyOperator(
task_id = "start_dag",
)
task_find_best_params = PythonOperator(
task_id='find_best_params',
python_callable=find_best_params,
op_kwargs={"metric": metric, "max_results": 5000},
)
task_run_best_model = PythonOperator(
task_id='run_best_model',
python_callable=run_best_model,
op_kwargs={"tag": tag},
)
task_end_dag = DummyOperator(
task_id = "end_dag",
)
task_start_dag >> task_find_best_params >> task_run_best_model >> task_end_dag
If you run servers locally (airflow on http://0.0.0.0:8080 and mlflow on http://0.0.0.0:5000), you can see the results on both web pages. The above code is designed to set the artifacts path on the cloud. You need to set a local path if you want it on the local machine.
I am trying to generate airflow dags using a template in a python code, and using globals() as defined here
To define dag object and saving it. Below is my code :
import datetime as dt
import sys
import airflow
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
argumentList = sys.argv
owner = argumentList[1]
dag_name = argumentList[2]
taskID = argumentList[3]
bashCommand = argumentList[4]
default_args = {
'owner': owner,
'start_date': dt.datetime(2019, 6, 1),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
}
def dagCreate():
with DAG(dag_name,
default_args=default_args,
schedule_interval=None,
) as dag:
print_hello = BashOperator(task_id=taskID, bash_command=bashCommand)
return dag
globals()[dag_name] = dagCreate()
I have kept this python code outside dag_folder, and executing it as follows :
python bash-dag-generator.py Airflow test_bash_generate auto_bash_task ls
But I don't see any DAG generated in the airflow webserver UI. I am not sure where I am going wrong.
As per the official documentation:
DAGs are defined in standard Python files that are placed in Airflow’s DAG_FOLDER. Airflow will execute the code in each file to dynamically build the DAG objects. You can have as many DAGs as you want, each describing an arbitrary number of tasks. In general, each one should correspond to a single logical workflow.
So unless your code is actually inside the DAG_FOLDER, it will not be registered as a DAG.
The way I have been able to implement Dynamic DAGs is by using Airflow Variable.
In the below example I have a csv file that has list of Bash command like ls, echo etc. As part of the read_file task I am updating the file location to the Airflow Variable. The part where we read the csv file and loop through the commands is where the dynamic DAGs get created.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
from datetime import datetime, timedelta
import csv
'''
Orchestrate the Dynamic Tasks
'''
def read_file_task():
print('I am reading a File and setting variables ')
Variable.set('dynamic-dag-sample','/home/bashoperator.csv')
with DAG('dynamic-dag-sample',
start_date=datetime(2018, 11, 1)) as dag:
read_file_task = PythonOperator(task_id='read_file_task',
python_callable=read_file_task, provide_context=True,
dag=dag)
dynamic_dag_sample_file_path = Variable.get("dynamic-dag-sample")
if dynamic_dag_sample_file_path != None:
with open(dynamic_dag_sample_file_path) as csv_file:
reader = csv.DictReader(csv_file)
line_count = 0
for row in reader:
bash_task = BashOperator(task_id=row['Taskname'], bash_command=row['Command'])
read_file_task.set_downstream(bash_task)
I'm currently trying to build an extra link on the DatabricksRunNowOperator in airflow so I can quickly access the databricks run without having to rummage through the logs. As a starting point I'm simply trying to add a link to google in the task instance menu. I've followed the procedure shown in this tutorial creating the following code placed within my airflow home plugins folder:
from airflow.plugins_manager import AirflowPlugin
from airflow.models.baseoperator import BaseOperatorLink
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
class DBLogLink(BaseOperatorLink):
name = 'run_link'
operators = [DatabricksRunNowOperator]
def get_link(self, operator, dttm):
return "https://www.google.com"
class AirflowExtraLinkPlugin(AirflowPlugin):
name = "extra_link_plugin"
operator_extra_links = [DBLogLink(), ]
However the extra link does not show up, even after restarting the webserver etc:
Here's the code I'm using to create the DAG:
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
from datetime import datetime, timedelta
DATABRICKS_CONN_ID = '____'
args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2020, 2, 13),
'retries': 0
}
dag = DAG(
dag_id = 'testing_notebook',
default_args = args,
schedule_interval = timedelta(days=1)
)
DatabricksRunNowOperator(
task_id = 'mail_reader',
dag = dag,
databricks_conn_id = DATABRICKS_CONN_ID,
polling_period_seconds=1,
job_id = ____,
notebook_params = {____}
)
I feel like I'm missing something really basic, but I just can't figure it out.
Additional info
Airflow version 1.10.9
Running on ubuntu 18.04.3
I've worked it out. You need to have your webserver running as RBAC. This means setting up airflow with authentication and adding users. RBAC can be turned on by setting rbac = True in your airflow.cfg file.