Cannot Create Extra Operator Link on DatabricksRunNowOperator in Airflow - airflow

I'm currently trying to build an extra link on the DatabricksRunNowOperator in airflow so I can quickly access the databricks run without having to rummage through the logs. As a starting point I'm simply trying to add a link to google in the task instance menu. I've followed the procedure shown in this tutorial creating the following code placed within my airflow home plugins folder:
from airflow.plugins_manager import AirflowPlugin
from airflow.models.baseoperator import BaseOperatorLink
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
class DBLogLink(BaseOperatorLink):
name = 'run_link'
operators = [DatabricksRunNowOperator]
def get_link(self, operator, dttm):
return "https://www.google.com"
class AirflowExtraLinkPlugin(AirflowPlugin):
name = "extra_link_plugin"
operator_extra_links = [DBLogLink(), ]
However the extra link does not show up, even after restarting the webserver etc:
Here's the code I'm using to create the DAG:
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
from datetime import datetime, timedelta
DATABRICKS_CONN_ID = '____'
args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2020, 2, 13),
'retries': 0
}
dag = DAG(
dag_id = 'testing_notebook',
default_args = args,
schedule_interval = timedelta(days=1)
)
DatabricksRunNowOperator(
task_id = 'mail_reader',
dag = dag,
databricks_conn_id = DATABRICKS_CONN_ID,
polling_period_seconds=1,
job_id = ____,
notebook_params = {____}
)
I feel like I'm missing something really basic, but I just can't figure it out.
Additional info
Airflow version 1.10.9
Running on ubuntu 18.04.3

I've worked it out. You need to have your webserver running as RBAC. This means setting up airflow with authentication and adding users. RBAC can be turned on by setting rbac = True in your airflow.cfg file.

Related

Generating airflow DAGs dynamically

I am trying to generate airflow dags using a template in a python code, and using globals() as defined here
To define dag object and saving it. Below is my code :
import datetime as dt
import sys
import airflow
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
argumentList = sys.argv
owner = argumentList[1]
dag_name = argumentList[2]
taskID = argumentList[3]
bashCommand = argumentList[4]
default_args = {
'owner': owner,
'start_date': dt.datetime(2019, 6, 1),
'retries': 1,
'retry_delay': dt.timedelta(minutes=5),
}
def dagCreate():
with DAG(dag_name,
default_args=default_args,
schedule_interval=None,
) as dag:
print_hello = BashOperator(task_id=taskID, bash_command=bashCommand)
return dag
globals()[dag_name] = dagCreate()
I have kept this python code outside dag_folder, and executing it as follows :
python bash-dag-generator.py Airflow test_bash_generate auto_bash_task ls
But I don't see any DAG generated in the airflow webserver UI. I am not sure where I am going wrong.
As per the official documentation:
DAGs are defined in standard Python files that are placed in Airflow’s DAG_FOLDER. Airflow will execute the code in each file to dynamically build the DAG objects. You can have as many DAGs as you want, each describing an arbitrary number of tasks. In general, each one should correspond to a single logical workflow.
So unless your code is actually inside the DAG_FOLDER, it will not be registered as a DAG.
The way I have been able to implement Dynamic DAGs is by using Airflow Variable.
In the below example I have a csv file that has list of Bash command like ls, echo etc. As part of the read_file task I am updating the file location to the Airflow Variable. The part where we read the csv file and loop through the commands is where the dynamic DAGs get created.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
from datetime import datetime, timedelta
import csv
'''
Orchestrate the Dynamic Tasks
'''
def read_file_task():
print('I am reading a File and setting variables ')
Variable.set('dynamic-dag-sample','/home/bashoperator.csv')
with DAG('dynamic-dag-sample',
start_date=datetime(2018, 11, 1)) as dag:
read_file_task = PythonOperator(task_id='read_file_task',
python_callable=read_file_task, provide_context=True,
dag=dag)
dynamic_dag_sample_file_path = Variable.get("dynamic-dag-sample")
if dynamic_dag_sample_file_path != None:
with open(dynamic_dag_sample_file_path) as csv_file:
reader = csv.DictReader(csv_file)
line_count = 0
for row in reader:
bash_task = BashOperator(task_id=row['Taskname'], bash_command=row['Command'])
read_file_task.set_downstream(bash_task)

Is it possible to use wildcard for Application JAR names in a SparkSubmitOperator Airflow DAG?

I have an Airflow DAG which I use to submit a spark job and for that I use the SparkSubmitOperator. In the DAG, I have to specify the application JAR that needs to be run. At the moment, it is hardcoded to spark-job-1.0.jar as following:
from airflow import DAG
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from datetime import datetime, timedelta
args = {
'owner': 'joe',
'start_date': datetime(2020, 5, 23)
}
dag = DAG('spark_job', default_args=args)
operator = SparkSubmitOperator(
task_id='spark_sm_job',
conn_id='spark_submit',
java_class='com.mypackage',
application='/home/ubuntu/spark-job-1.0.jar', <---------
total_executor_cores='1',
executor_cores='1',
executor_memory='500M',
num_executors='1',
name='airflow-spark-job',
verbose=False,
driver_memory='500M',
application_args=["yarn", "10.11.21.12:9092"],
dag=dag,
)
The problem is, the release name will increment and I tried using a wildcard spark-job-*.jar but it didn't work! - Is it possible to use a wildcard, or is there any other way to get around this?
Assuming that there will always be only one file matching the glob /home/ubuntu/spark-job-*.jar, you can do the following:
import glob
...
application=glob.glob('/home/ubuntu/spark-job-*.jar')[0],
...

Getting *** Task instance did not exist in the DB as error when running gcs_to_bq in composer

While executing the following python script using cloud-composer, I get *** Task instance did not exist in the DB under the gcs2bq task Log in Airflow
Code:
import datetime
import os
import csv
import pandas as pd
import pip
from airflow import models
#from airflow.contrib.operators import dataproc_operator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.utils import trigger_rule
from airflow.contrib.operators import gcs_to_bq
from airflow.contrib.operators import bigquery_operator
print('''/-------/--------/------/
-------/--------/------/''')
yesterday = datetime.datetime.combine(
datetime.datetime.today() - datetime.timedelta(1),
datetime.datetime.min.time())
default_dag_args = {
# Setting start date as yesterday starts the DAG immediately when it is
# detected in the Cloud Storage bucket.
'start_date': yesterday,
# To email on failure or retry set 'email' arg to your email and enable
# emailing here.
'email_on_failure': False,
'email_on_retry': False,
# If a task fails, retry it once after waiting at least 5 minutes
'retries': 1,
'retry_delay': datetime.timedelta(minutes=5),
'project_id': 'data-rubrics'
#models.Variable.get('gcp_project')
}
try:
# [START composer_quickstart_schedule]
with models.DAG(
'composer_agg_quickstart',
# Continue to run DAG once per day
schedule_interval=datetime.timedelta(days=1),
default_args=default_dag_args) as dag:
# [END composer_quickstart_schedule]
op_start = BashOperator(task_id='Initializing', bash_command='echo Initialized')
#op_readwrite = PythonOperator(task_id = 'ReadAggWriteFile', python_callable=read_data)
op_load = gcs_to_bq.GoogleCloudStorageToBigQueryOperator( \
task_id='gcs2bq',\
bucket='dr-mockup-data',\
source_objects=['sample.csv'],\
destination_project_dataset_table='data-rubrics.sample_bqtable',\
schema_fields = [{'name':'a', 'type':'STRING', 'mode':'NULLABLE'},{'name':'b', 'type':'FLOAT', 'mode':'NULLABLE'}],\
write_disposition='WRITE_TRUNCATE',\
dag=dag)
#op_write = PythonOperator(task_id = 'AggregateAndWriteFile', python_callable=write_data)
op_start >> op_load
UPDATE:
Can you remove dag=dag from gcs2bq task as you are already using with models.DAG and run your dag again?
It might be because you have a dynamic start date. Your start_date should never be dynamic. Read this FAQ: https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
We recommend against using dynamic values as start_date, especially datetime.now() as it can be quite confusing. The task is triggered once the period closes, and in theory an #hourly DAG would never get to an hour after now as now() moves along.
Make your start_date static or use Airflow utils/macros:
import airflow
args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2),
}
Okay, this was a stupid question on my part and apologies for everyone who wasted time here. I had a Dag running due to which the one I was shooting off was always in the que. Also, I did not write the correct value in destination_project_dataset_table. Thanks and apologies to all who spent time.

How to perform S3 to BigQuery using Airflow?

Currently there is no S3ToBigQuery operator.
My choices are:
Use the S3ToGoogleCloudStorageOperator and then use the GoogleCloudStorageToBigQueryOperator
This is not something i'm eager to do. This means paying double for storage. Even if removing the file from either one of the storage that still involves payment.
Download the file from S3 to local file system and load it to BigQuery from file system - However there is no S3DownloadOperator This means writing the whole process from scratch without Airflow involvement. This misses the point of using Airflow.
Is there another option? What would you suggest to do?
This is what I ended up with.
This should be converted to a S3toLocalFile Operator.
def download_from_s3(**kwargs):
hook = S3Hook(aws_conn_id='project-s3')
result = hook.read_key(bucket_name='stage-project-metrics',
key='{}.csv'.format(kwargs['ds']))
if not result:
logging.info('no data found')
else:
outfile = '{}project{}.csv'.format(Variable.get("data_directory"),kwargs['ds'])
f=open(outfile,'w+')
f.write(result)
f.close()
return result
If the first option is cost restrictive, you could just use the S3Hook to download the file through the PythonOperator:
from airflow.hooks.S3_hook import S3Hook
from datetime import timedelta, datetime
from airflow import DAG
from airflow.hooks.S3_hook import S3Hook
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2018, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 0
}
def download_from_s3(**kwargs):
hook = S3Hook(aws_conn_id='s3_conn')
hook.read_key(bucket_name='workflows-dev',
key='test_data.csv')
dag = DAG('s3_download',
schedule_interval='#daily',
default_args=default_args,
catchup=False)
with dag:
download_data = PythonOperator(
task_id='download_data',
python_callable=download_from_s3,
provide_context=True
)
What you can do instead is use S3ToGoogleCloudStorageOperator and then use GoogleCloudStorageToBigQueryOperator with an external_table table flag i.e pass external_table =True.
This will create an external data that points to GCS location and doesn't store your data in BigQuery but you can still query it.

Airflow dynamic DAG and Task Ids

I mostly see Airflow being used for ETL/Bid data related jobs. I'm trying to use it for business workflows wherein a user action triggers a set of dependent tasks in future. Some of these tasks may need to be cleared (deleted) based on certain other user actions.
I thought the best way to handle this would be via dynamic task ids. I read that Airflow supports dynamic dag ids. So, I created a simple python script that takes DAG id and task id as command line parameters. However, I'm running into problems making it work. It gives dag_id not found error. Has anyone tried this? Here's the code for the script (call it tmp.py) which I execute on command line as python (python tmp.py 820 2016-08-24T22:50:00 ):
from __future__ import print_function
import os
import sys
import shutil
from datetime import date, datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
execution = '2016-08-24T22:20:00'
if len(sys.argv) > 2 :
dagid = sys.argv[1]
taskid = 'Activate' + sys.argv[1]
execution = sys.argv[2]
else:
dagid = 'DAGObjectId'
taskid = 'Activate'
default_args = {'owner' : 'airflow', 'depends_on_past': False, 'start_date':date.today(), 'email': ['fake#fake.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1}
dag = DAG(dag_id = dagid,
default_args=default_args,
schedule_interval='#once',
)
globals()[dagid] = dag
task1 = BashOperator(
task_id = taskid,
bash_command='ls -l',
dag=dag)
fakeTask = BashOperator(
task_id = 'fakeTask',
bash_command='sleep 5',
retries = 3,
dag=dag)
task1.set_upstream(fakeTask)
airflowcmd = "airflow run " + dagid + " " + taskid + " " + execution
print("airflowcmd = " + airflowcmd)
os.system(airflowcmd)
After numerous trials and errors, I was able to figure this out. Hopefully, it will help someone. Here's how it works: You need to have an iterator or an external source (file/database table) to generate dags/task dynamically through a template. You can keep the dag and task names static, just assign them ids dynamically in order to differentiate one dag from the other. You put this python script in the dags folder. When you start the airflow scheduler, it runs through this script on every heartbeat and writes the DAGs to the dag table in the database. If a dag (unique dag id) has already been written, it will simply skip it. The scheduler also look at the schedule of individual DAGs to determine which one is ready for execution. If a DAG is ready for execution, it executes it and updates its status.
Here's a sample code:
from airflow.operators import PythonOperator
from airflow.operators import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta
import sys
import time
dagid = 'DA' + str(int(time.time()))
taskid = 'TA' + str(int(time.time()))
input_file = '/home/directory/airflow/textfile_for_dagids_and_schedule'
def my_sleeping_function(random_base):
'''This is a function that will run within the DAG execution'''
time.sleep(random_base)
def_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(), 'email_on_failure': False,
'retries': 1, 'retry_delay': timedelta(minutes=2)
}
with open(input_file,'r') as f:
for line in f:
args = line.strip().split(',')
if len(args) < 6:
continue
dagid = 'DAA' + args[0]
taskid = 'TAA' + args[0]
yyyy = int(args[1])
mm = int(args[2])
dd = int(args[3])
hh = int(args[4])
mins = int(args[5])
ss = int(args[6])
dag = DAG(
dag_id=dagid, default_args=def_args,
schedule_interval='#once', start_date=datetime(yyyy,mm,dd,hh,mins,ss)
)
myBashTask = BashOperator(
task_id=taskid,
bash_command='python /home/directory/airflow/sendemail.py',
dag=dag)
task2id = taskid + '-X'
task_sleep = PythonOperator(
task_id=task2id,
python_callable=my_sleeping_function,
op_kwargs={'random_base': 10},
dag=dag)
task_sleep.set_upstream(myBashTask)
f.close()
From How can I create DAGs dynamically?:
Airflow looks in you [sic] DAGS_FOLDER for modules that contain DAG objects in their global namespace, and adds the objects it finds in the DagBag. Knowing this all we need is a way to dynamically assign variable in the global namespace, which is easily done in python using the globals() function for the standard library which behaves like a simple dictionary.
for i in range(10):
dag_id = 'foo_{}'.format(i)
globals()[dag_id] = DAG(dag_id)
# or better, call a function that returns a DAG object!
copying my answer from this question. Only for v2.3 and above:
This feature is achieved using Dynamic Task Mapping, only for Airflow versions 2.3 and higher
More documentation and example here:
Official Dynamic Task Mapping documentation
Tutorial from Astronomer
Example:
#task
def make_list():
# This can also be from an API call, checking a database, -- almost anything you like, as long as the
# resulting list/dictionary can be stored in the current XCom backend.
return [1, 2, {"a": "b"}, "str"]
#task
def consumer(arg):
print(list(arg))
with DAG(dag_id="dynamic-map", start_date=datetime(2022, 4, 2)) as dag:
consumer.expand(arg=make_list())
example 2:
from airflow import XComArg
task = MyOperator(task_id="source")
downstream = MyOperator2.partial(task_id="consumer").expand(input=XComArg(task))
The graph view and tree view are also updated:
Relevant issues here:
https://github.com/apache/airflow/projects/12

Resources