Unable to execute spark job using SparkSubmitOperator - airflow

I am able to run Spark job using BashOperator but I want to use SparkSubmitOperator for it using Spark standalone mode.
Here's my DAG for SparkSubmitOperator and stack-trace
args = {
'owner': 'airflow',
'start_date': datetime(2018, 5, 24)
}
dag = DAG('spark_job', default_args=args, schedule_interval="*/10 * * * *")
operator = SparkSubmitOperator(
task_id='spark_submit_job',
application='/home/ubuntu/test.py',
total_executor_cores='1',
executor_cores='1',
executor_memory='2g',
num_executors='1',
name='airflow-spark',
verbose=False,
driver_memory='1g',
conf={'master':'spark://xx.xx.xx.xx:7077'},
dag=dag,
)
Looking at source for spark_submit_hook it seems _resolve_connection() always sets master=yarn. How can I change master properties value by Spark standalone master URL? Which properties I can set to run Spark job in standalone mode?

You can either create a new connection using the Airflow Web UI or change the spark-default connection.
Master can be local, yarn, spark://HOST:PORT, mesos://HOST:PORT and k8s://https://<HOST>:<PORT>.
You can also supply the following commands in the extras:
{"queue": "root.default", "deploy_mode": "cluster", "spark_home": "", "spark_binary": "spark-submit", "namespace": "default"}
Either the "spark-submit" binary should be in the PATH or the spark-home is set in the extra on the connection.

Related

how to pass default values for run time input variable in airflow for scheduled execution

I come across one issue while running DAG in airflow. my code is working in two scenarios where is failing for one.
below are my scenarios,
Manual trigger with input - Running Fine
Manual trigger without input - Running Fine
Scheduled Run - Failing
Below is my code:
def decide_the_flow(**kwargs):
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
print("IP is :",cleanup)
return cleanup
I am getting below error,
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
AttributeError: 'NoneType' object has no attribute 'get'
I tried to define default variables like,
default_dag_args = {
'start_date':days_ago(0),
'params': {
"cleanup": "N"
},
'retries': 0
}
but it wont work.
I am using BranchPythonOperator to call this function.
Scheduling : enter image description here
Can anyone please guide me here. what I am missing ?
For workaround i am using below code,
try:
cleanup=kwargs['dag_run'].conf.get('cleanup','N')
except:
cleanup="N"
You can access the parameters from the context dict params, because airflow defines the default values on this dict after copying the dict dag_run.conf and checking if there is something missing:
from datetime import datetime
from airflow import DAG
from airflow.operators.empty import EmptyOperator
from airflow.operators.python import BranchPythonOperator
def decide_the_flow(**kwargs):
cleanup = kwargs['params']["cleanup"]
print(f"IP is : {cleanup}")
return cleanup
with DAG(
dag_id='airflow_params',
start_date=datetime(2022, 8, 25),
schedule_interval="* * * * *",
params={
"cleanup": "N",
},
catchup=False
) as dag:
branch_task = BranchPythonOperator(
task_id='test_param',
python_callable=decide_the_flow
)
task_n = EmptyOperator(task_id="N")
task_m = EmptyOperator(task_id="M")
branch_task >> [task_n, task_m]
I just tested it in scheduled and manual (with and without conf) runs, it works fine.

Cannot Create Extra Operator Link on DatabricksRunNowOperator in Airflow

I'm currently trying to build an extra link on the DatabricksRunNowOperator in airflow so I can quickly access the databricks run without having to rummage through the logs. As a starting point I'm simply trying to add a link to google in the task instance menu. I've followed the procedure shown in this tutorial creating the following code placed within my airflow home plugins folder:
from airflow.plugins_manager import AirflowPlugin
from airflow.models.baseoperator import BaseOperatorLink
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
class DBLogLink(BaseOperatorLink):
name = 'run_link'
operators = [DatabricksRunNowOperator]
def get_link(self, operator, dttm):
return "https://www.google.com"
class AirflowExtraLinkPlugin(AirflowPlugin):
name = "extra_link_plugin"
operator_extra_links = [DBLogLink(), ]
However the extra link does not show up, even after restarting the webserver etc:
Here's the code I'm using to create the DAG:
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksRunNowOperator
from datetime import datetime, timedelta
DATABRICKS_CONN_ID = '____'
args = {
'owner': 'Airflow',
'depends_on_past': False,
'start_date': datetime(2020, 2, 13),
'retries': 0
}
dag = DAG(
dag_id = 'testing_notebook',
default_args = args,
schedule_interval = timedelta(days=1)
)
DatabricksRunNowOperator(
task_id = 'mail_reader',
dag = dag,
databricks_conn_id = DATABRICKS_CONN_ID,
polling_period_seconds=1,
job_id = ____,
notebook_params = {____}
)
I feel like I'm missing something really basic, but I just can't figure it out.
Additional info
Airflow version 1.10.9
Running on ubuntu 18.04.3
I've worked it out. You need to have your webserver running as RBAC. This means setting up airflow with authentication and adding users. RBAC can be turned on by setting rbac = True in your airflow.cfg file.

Problem Running MySQL Script with Airflow MySQL Operator

I'm not understanding something about using the MySQL operator to call and a MySQL script with Apache Airflow.
When I run this task...
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql = '/home/user/DBScripts/MySQLScript/SampleMySQLScript.sql',
mysql_conn_id = 'mysql_db_connect',
autocommit = True,
database = 'segments'
)
I get this error in traceback...
jinja2.exceptions.TemplateNotFound: /home/user/DBScripts/MySQLScript/SampleMySQLScript.sql
The DAG task runs fine if I provide the entire SQL script as a parameter.
I'm not familiar with Jinja templating.
Is it easier to learn to write my scripts as a Jinja template? Should I import the text of the script and assign it to a variable that I pass? Is there a way to write the Airflow task so that it isn't expecting a Jinja template?
This error message means that the .sql file is not found.
Using:
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql='test.sql',
mysql_conn_id='mysql_db_connect',
autocommit=True,
database='segments',
dag=dag
)
Where test.sql is located on the same folder as the DAG file works fine:
If the path of the .sql file is not relative to the DAG file you can use template_searchpath to define the list of folders (non relative) where jinja will look for the templates.
So Your code could look like:
default_args = { # pylint: disable=invalid-name
'owner': 'airflow',
'start_date': datetime(2020, 12, 03),
}
with DAG(
dag_id='my_sql_dag',
default_args=default_args,
schedule_interval=None,
template_searchpath=['/home/user/DBScripts/MySQLScript']
) as dag:
MySQLTest = MySqlOperator(
task_id='MySQLTest',
sql='SampleMySQLScript.sql',
mysql_conn_id='mysql_db_connect',
autocommit=True,
database='segments'
)

Dynamic dags not getting added by scheduler

I am trying to create Dynamic DAGs and then get them to the scheduler. I tried the reference from https://www.astronomer.io/guides/dynamically-generating-dags/ which works well. I changed it a bit as in the below code. Need help in debugging the issue.
I tried
1. Test run the file. The Dag gets executed and the globals() is printing all the DAGs objects. But somehow not listing in the list_dags or in the UI
from datetime import datetime, timedelta
import requests
import json
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.http_operator import SimpleHttpOperator
def create_dag(dag_id,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval="#hourly",
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py,
dag_number=dag_number)
return dag
def fetch_new_dags(**kwargs):
for n in range(1, 10):
print("=====================START=========\n")
dag_id = "abcd_" + str(n)
print (dag_id)
print("\n")
globals()[dag_id] = create_dag(dag_id, n, default_args)
print(globals())
default_args = {
'owner': 'diablo_admin',
'depends_on_past': False,
'start_date': datetime(2019, 8, 8),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
'trigger_rule': 'none_skipped'
#'schedule_interval': '0 * * * *'
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
dag = DAG('testDynDags', default_args=default_args, schedule_interval='*/1 * * * *')
#schedule_interval='*/1 * * * *'
check_for_dags = PythonOperator(dag=dag,
task_id='tst_dyn_dag',
provide_context=True,
python_callable=fetch_new_dags
)
check_for_dags
Expected to create 10 DAGs dynamically and added to the scheduler.
I guess doing the following would fix it
completely remove the global testDynDags dag and tst_dyn_dags task (instantiation and invocation)
invoke your fetch_new_dags(..) method with requisite arguments in global scope
Explanation
Dynamic dags / tasks merely means that you have a well-defined logic at the time of writing dag-definition file that can help create tasks / dags having a known structure in a pre-defined fashion.
You can NOT determine the structure of your DAG at runtime (task execution). So, for instance, you cannot add n identical tasks to your DAG if the upstream task returned an integer value n. But you can iterate over a YAML file containing n segments and generate n tasks / dags.
So clearly, wrapping dag-generation code inside an Airflow task itself makes no sense.
UPDATE-1
From what is indicated in comments, I infer that the requirement dictates that you revise your external source that feeds inputs (how many dags or tasks to create) to your DAG / task-generation script. While this is indeed a complex use-case, but a simple way to achieve this is to create 2 separate DAGs.
One dag runs once in a while and generates the inputs that are stored in an an external resource like Airflow Variable (or any other external store like file / S3 / database etc.)
The second DAG is constructed programmatically by reading that same datasource which was written by the first DAG
You can take inspiration from the Adding DAGs based on Variable value section

Airflow schedule getting skipped if previous task execution takes more time

I have two tasks in my airflow DAG. One triggers an API call ( Http operator ) and another one keeps checking its status using another api ( Http sensor ). This DAG is scheduled to run every hour & 10 minutes. But some times one execution can take long time to finish for example 20 hours. In such cases all the schedules while the previous task is running is not executing.
For example say if I the job at 01:10 takes 10 hours to finish. Schedules 02:10, 03:10, 04:10, ... 11:10 etc which are supposed to run are getting skipped and only the one at 12:10 is executed.
I am using local executor. I am running airflow server & scheduler using below script.
start_server.sh
export AIRFLOW_HOME=./airflow_home;
export AIRFLOW_GPL_UNIDECODE=yes;
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
airflow initdb;
airflow webserver -p 7200;
start_scheduler.sh
export AIRFLOW_HOME=./airflow_home;
# Connection string for connecting to REST interface server
export AIRFLOW_CONN_REST_API=http://localhost:5000;
export AIRFLOW_CONN_MANAGEMENT_API=http://localhost:8001;
#export AIRFLOW__SMTP__SMTP_PASSWORD=**********;
airflow scheduler;
my_dag_file.py
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': admin_email_ids,
'email_on_failure': False,
'email_on_retry': False
}
DAG_ID = 'reconciliation_job_pipeline'
MANAGEMENT_RES_API_CONNECTION_CONFIG = 'management_api'
DA_REST_API_CONNECTION_CONFIG = 'rest_api'
recon_schedule = Variable.get('recon_cron_expression',"10 * * * *")
dag = DAG(DAG_ID, max_active_runs=1, default_args=default_args,
schedule_interval=recon_schedule,
catchup=False)
dag.doc_md = __doc__
spark_job_end_point = conf['sip_da']['spark_job_end_point']
fetch_index_record_count_config_key = conf['reconciliation'][
'fetch_index_record_count']
fetch_index_record_count = SparkJobOperator(
job_id_key='fetch_index_record_count_job',
config_key=fetch_index_record_count_config_key,
exec_id_req=False,
dag=dag,
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_job',
data={},
method='POST',
endpoint=spark_job_end_point,
headers={
"Content-Type": "application/json"}
)
job_endpoint = conf['sip_da']['job_resource_endpoint']
fetch_index_record_count_status_job = JobStatusSensor(
job_id_key='fetch_index_record_count_job',
http_conn_id=DA_REST_API_CONNECTION_CONFIG,
task_id='fetch_index_record_count_status_job',
endpoint=job_endpoint,
method='GET',
request_params={'required': 'status'},
headers={"Content-Type": "application/json"},
dag=dag,
poke_interval=15
)
fetch_index_record_count>>fetch_index_record_count_status_job
SparkJobOperator & JobStatusSensor my custom class extending SimpleHttpOperator & HttpSensor.
If I set depends_on_past true will it work as expected?. Another problem I have for this option is some time the status check job will fail. But the next schedule should get trigger. How can I achieve this behavior ?
I think the main discussion point here is what you set is catchup=False, more detail can be found here. So airflow scheduler will skip those task execution and you would see the behavior as you mentioned.
This sounds like you would need to perform catchup if the previous process took longer than expected. You can try to change it catchup=True

Resources