how to connect hdfs in airflow? - airflow

How to perform HDFS operation in Airflow?
make sure you install following python package
pip install apache-airflow-providers-apache-hdfs
#Code Snippet
#Import packages
from airflow import settings
from airflow.models import Connection
from airflow.utils.dates import days_ago
from datetime import timedelta
from airflow.operators.bash import BashOperator
#Define new DAG
dag_execute_hdfs_commands = DAG(
dag_id ='connect_hdfs',
schedule_interval='#once',
start_date=days_ago(1),
dagrun_timeout=timedelta(minutes=60),
description='excuting hdfs commands',
)
#Establish connection to HDFS
conn =Connection(
conn_id = 'webhdfs_default1',
conn_type='HDFS',
host='localhost',
login='usr_id',
password='password',
port='9000',
)
session = settings.Session()
#Following line will add new connection to you airflow default DB
#Make sure once the DAG runs successfully you comment out following line.
#Because we do not want to add same connection "webhdfs_default1" every time we perform hdfs operations.
session.add(conn) #On your next run comment this out
session.close()
if __name__ == '__main__':
dag_execute_hdfs_commands.cli()
Once above DAG runs successfully you can perform hdfs operation hereafter
For example if you wish to list files in hdfs directory try the following code
#File listing operation
start_task = BashOperator(
task_id="start_task",
bash_command="hdfs dfs -ls /",
dag=dag_execute_hdfs_commands
)
start_task

You cannot use the connection webhdfs_default with BashOperator, where it works with WebHDFSHook hook, which create a client to query the web HDFS server. Currently there is two implemented method:
check_for_path: to check if a file exists in hdfs
load_file: to upload a file to hdfs
You can access the client to do other operation:
webHDFS_hook = WebHDFSHook(webhdfs_conn_id="<you conn id>")
client = webHDFS_hook.get_conn()
client.<operation>
The client it an instance from hdfs.InsecureClient if the conf core.security is not kerberos, and hdfs.ext.kerberos.KerberosClient if it is. Here is the documentation of hdfs cli clients, you can check what are the available operation and use them.
There is a lot of available operations like download, delete, list, read, make_dir, ..., which you can call in a new Airflow operator.

Related

Cannot add JDBC driver in Sqoop command when running import command using Airflow 2.5.0

I am running a Sqoop import command which imports a table from MySQL db and loads it to HDFS using Sqoop. I have created a below DAG which performs this above activity.
from airflow.models import DAG
from airflow.contrib.operators.sqoop_operator import SqoopOperator
from airflow.utils.dates import days_ago
Dag_Sqoop_Import = DAG(dag_id="SqoopImport",
schedule_interval="* * * * *",
start_date=days_ago(2))
sqoop_mysql_import = SqoopOperator(conn_id="sqoop_local",
table="shipmethod",
cmd_type="import",
target_dir="/airflow_sqoopImport",
num_mappers=1,
task_id="SQOOP_Import",
dag=Dag_Sqoop_Import)
sqoop_mysql_import
I have also created a SqoopImport connection in Airflow as below.
But when is Trigger the job it should take the below command as I assume
sqoop import --connect jdbc:mysql://192.168.0.15:3306/adventureworks?characterEncoding=latin1 --driver com.mysql.jdbc.Driver --username xxxx --password xxxxxx --autoreset-to-one-mapper --table workorder --target-dir /user/adminn/workorder
But when I check in logs its actually taking below command
Executing command: sqoop import --username xxxx --password MASKED --num-mappers 1 --connect jdbc:mysql://192.168.0.15:3306/adventureworks?characterEncoding=latin1 --target-dir /airflow_sqoopImport --as-textfile --table shipmethod
And the DAG fails giving below error. also I know the cause of this error, I need to add the parameter driver com.mysql.jdbc.Driver which can solve the below error. am struggling to add, can you please let me know where am going wrong.
ERROR manager.SqlManager: Error reading from database: java.sql.SQLException: Streaming result set com.mysql.jdbc.RowDataDynamic#5906ebcb is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.'
Replies Appreciated, thanks.
You should provide the driver class as an argument for the operator and not the connection
sqoop_mysql_import = SqoopOperator(conn_id="sqoop_local",
table="shipmethod",
cmd_type="import",
target_dir="/airflow_sqoopImport",
driver="com.mysql.jdbc.Driver",
num_mappers=1,
task_id="SQOOP_Import",
dag=Dag_Sqoop_Import)

Airflow: Can operators run on an external service, and communicate with Airflow to update the progress of the DAG?

I'm currently experimenting with a new concept where the operator will communicate with an external service to run the operator instead of running the operator locally, and the external service can communicate with Airflow to update the progress of the DAG.
For example, let's say we have a bash operator:
bash_task = BashOperator(
task_id="bash_task",
bash_command="echo \"This Message Shouldn't Run Locally on Airflow\"",
)
That is part of a DAG:
from airflow import DAG
from airflow.operators.bash import BashOperator
with DAG() as dag:
t1 = BashOperator(
task_id="bash_task1",
bash_command="echo \"t1:This Message Shouldn't Run Locally on Airflow\""
)
t2 = BashOperator(
task_id="bash_task2",
bash_command="echo \"t2:This Message Shouldn't Run Locally on Airflow\""
)
t1 >> t2
Is there a method in the Airflow code that will allow an external service to tell the DAG that t1 has started/completed and that t2 has started/completed, without actually running the DAG on the Airflow instance?
Airflow has a concept of Executors which are responsible for scheduling tasks, occasionally via or on external services - such as Kubernetes, Dask, or a Celery cluster.
https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html
The worker process communicates back to Airflow, often via the Metadata DB about the progress of the task.

Airflow Only works with the Celery, CeleryKubernetes or Kubernetes executors

I got this dag, nevrtheless when trying to run it, it stacks on Queued run. When i then trying to run manually i get error:
Error:
Only works with the Celery, CeleryKubernetes or Kubernetes executors
Code:
from airflow import DAG
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.operators.python import PythonOperator
from datetime import datetime
def helloWorld():
print('Hello World')
def take_clients():
hook = PostgresHook(postgres_conn_id="postgres_robert")
df = hook.get_pandas_df(sql="SELECT * FROM clients;")
print(df)
# do what you need with the df....
with DAG(dag_id="test",
start_date=datetime(2021,1,1),
schedule_interval="#once",
catchup=False) as dag:
task1 = PythonOperator(
task_id="hello_world",
python_callable=helloWorld)
task2 = PythonOperator(
task_id="get_clients",
python_callable=take_clients)
task1 >> task2
I guess you are trying to use RUN button from the UI.
This button is enabled only for executors that supports it.
In your Airflow setup you are using Executor that doesn't support this command.
In newer Airflow versions the button is simply disable if you you are using Executor that doesn't support it:
I assume that what you are after is to create a new run, in that case you should use Trigger Run button. If you are looking to re-run specific task then use Clear button.
you run it as LocalExecutor , you have to change your Executor to Celery, CeleryKubernetes or Kubernetes or DaskExecutor
if you using docker-compose add:
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
otherwise go to airflow Executor

SFTP with Google Cloud Composer

I need to upload a file via SFTP into an external server through Cloud Composer. The code for the task is as follows:
from airflow import DAG
from airflow.operators.python_operator import PythonVirtualenvOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
def make_sftp():
import paramiko
import pysftp
import os
from airflow.contrib.hooks.ssh_hook import SSHHook
import subprocess
ssh_hook = SSHHook(ssh_conn_id="conn_id")
sftp_client = ssh_hook.get_conn().open_sftp()
return 0
etl_dag = DAG("dag_test",
start_date=datetime.now(tz=local_tz),
schedule_interval=None,
default_args={
"owner": "airflow",
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"retries": 5,
"retry_delay": timedelta(minutes=5)})
sftp = PythonVirtualenvOperator(task_id="sftp",
python_callable=make_sftp,
requirements=["sshtunnel", "paramiko"],
dag=etl_dag)
start_pipeline = DummyOperator(task_id="start_pipeline", dag=etl_dag)
start_pipeline >> sftp
In "conn_id" I have used the following options: {"no_host_key_check": "true"}, the DAG runs for a couple of seconds and the fail with the following message:
WARNING - Remote Identification Change is not verified. This wont protect against Man-In-The-Middle attacks\n[2022-02-10 10:01:59,358] {ssh_hook.py:171} WARNING - No Host Key Verification. This wont protect against Man-In-The-Middle attacks\nTraceback (most recent call last):\n File "/tmp/venvur4zvddz/script.py", line 23, in <module>\n res = make_sftp(*args, **kwargs)\n File "/tmp/venvur4zvddz/script.py", line 19, in make_sftp\n sftp_client = ssh_hook.get_conn().open_sftp()\n File "/usr/local/lib/airflow/airflow/contrib/hooks/ssh_hook.py", line 194, in get_conn\n client.connect(**connect_kwargs)\n File "/opt/python3.6/lib/python3.6/site-packages/paramiko/client.py", line 412, in connect\n server_key = t.get_remote_server_key()\n File "/opt/python3.6/lib/python3.6/site-packages/paramiko/transport.py", line 834, in get_remote_server_key\n raise SSHException("No existing session")\nparamiko.ssh_exception.SSHException: No existing session\n'
do I have to set other options? Thank you!
Configuring the SSH connection with key pair authentication
To SSH into the host as a user with username “user_a”, an SSH key pair should be generated for that user and the public key should be added to the host machine. The following are the steps that would create an SSH connection to the “jupyter” user which has the write permissions.
Run the following commands on the local machine to generate the required SSH key:
ssh-keygen -t rsa -f ~/.ssh/sftp-ssh-key -C user_a
“sftp-ssh-key” → Name of the pair of public and private keys (Public key: sftp-ssh-key.pub, Private key: sftp-ssh-key)
“user_a” → User in the VM that we are trying to connect to
chmod 400 ~/.ssh/sftp-ssh-key
Now, copy the contents of the public key sftp-ssh-key.pub into ~/.ssh/authorized_keys of your host system. Check for necessary permissions for authorized_keys and grant them accordingly using chmod.
I tested the setup with a Compute Engine VM . In the Compute Engine console, edit the VM settings to add the contents of the generated SSH public key into the instance metadata. Detailed instructions can be found here. If you are connecting to a Compute Engine VM, make sure that the instance has the appropriate firewall rule to allow the SSH connection.
Upload the private key to the client machine. In this scenario, the client is the Airflow DAG so the key file should be accessible from the Composer/Airflow environment. To make the key file accessible, it has to be uploaded to the GCS bucket associated with the Composer environment. For example, if the private key is uploaded to the data folder in the bucket, the key file path would be /home/airflow/gcs/data/sftp-ssh-key.
Configuring the SSH connection with password authentication
If password authentication is not configured on the host machine, follow the below steps to enable password authentication.
Set the user password using the below command and enter the new password twice.
sudo passwd user_a
To enable SSH password authentication, you must SSH into the host machine as root to edit the sshd_config file.
/etc/ssh/sshd_config
Then, change the line PasswordAuthentication no to PasswordAuthentication yes. After making that change, restart the SSH service by running the following command as root.
sudo service ssh restart
Password authentication has been configured now.
Creating connections and uploading the DAG
1.1 Airflow connection with key authentication
Create a connection in Airflow with the below configuration or use the existing connection.
Extra field
The Extra JSON dictionary would look like this. Here, we have uploaded the private key file to the data folder in the Composer environment's GCS bucket.
{
"key_file": "/home/airflow/gcs/data/sftp-ssh-key",
"conn_timeout": "30",
"look_for_keys": "false"
}
1.2 Airflow connection with password authentication
If the host machine is configured to allow password authentication, these are the changes to be made in the Airflow connection.
The Extra parameter can be empty.
The Password parameter is the user_a's user password on the host machine.
The task logs show that the password authentication was successful.
INFO - Authentication (password) successful!
Upload the DAG to the Composer environment and trigger the DAG. I was facing key validation issue with the latest version of the paramiko=2.9.2 library. I tried downgrading paramiko but the older versions do not seem to support OPENSSH keys. Found an alternative paramiko-ng in which the validation issue has been fixed. Changed the Python dependency from paramiko to paramiko-ng in the PythonVirtualenvOperator.
from airflow import DAG
from airflow.operators.python_operator import PythonVirtualenvOperator
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetime, timedelta
def make_sftp():
import paramiko
from airflow.contrib.hooks.ssh_hook import SSHHook
ssh_hook = SSHHook(ssh_conn_id="sftp_connection")
sftp_client = ssh_hook.get_conn().open_sftp()
print("=================SFTP Connection Successful=================")
remote_host = "/home/sftp-folder/sample_sftp_file" # file path in the host system
local_host = "/home/airflow/gcs/data/sample_sftp_file" # file path in the client system
sftp_client.get(remote_host,local_host) # GET operation to copy file from host to client
sftp_client.close()
return 0
etl_dag = DAG("sftp_dag",
start_date=datetime.now(),
schedule_interval=None,
default_args={
"owner": "airflow",
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"retries": 5,
"retry_delay": timedelta(minutes=5)})
sftp = PythonVirtualenvOperator(task_id="sftp",
python_callable=make_sftp,
requirements=["sshtunnel", "paramiko-ng", "pysftp"],
dag=etl_dag)
start_pipeline = DummyOperator(task_id="start_pipeline", dag=etl_dag)
start_pipeline >> sftp
Results
The sample_sftp_file has been copied from the host system to the specified Composer bucket.

Airflow 1.10.3 SubDag can only run 1 task in parallel even the concurrency is 8

Recently, I upgrade Airflow from 1.9 to 1.10.3 (latest one).
However I do notice a performance issue related to SubDag concurrency. Only 1 task inside the SubDag can be picked up, which is not the way it should be, our concurrency setting for the SubDag is 8.
See the following:
get_monthly_summary-214 and get_monthly_summary-215 are the two SubDags, it can be run in parallel controller by the parent dag concurrency
But when zoom into the SubDag say get_monthly_summary-214, then
You can definitely see that there is only 1 task running at a time, the others are queued, and it keep running in this way. When we check the SubDag concurrency, it is actually 8 as we specified in the code:
We do setup the pool slots size, it is 32, We do have 8 celery workers to pick up the queued task, and our airflow config associate with the concurrency is as follows:
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor
# The concurrency that will be used when starting workers with the
# "airflow worker" command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
worker_concurrency = 16
Also all the SubDag are configured using the queue called mini, while all its inner tasks are the default queue called default, since we might some deadlock problems before if we running both SubDag operator and SubDag inner tasks on the same queue. I also tried to use the default queue for all the tasks and operators, it does not help.
The old version 1.9 seems to be fine that each SubDag can execute multiple tasks in parallel, did we miss anything ?
Based on the Finding of #kaxil posted above, a work around solution if you still would like to execute tasks inside a subdag in parallel is creating a wrapper functiuon to explicitly pass the executor when construct SubDagOperator:
from airflow.operators.subdag_operator import SubDagOperator
from airflow.executors import GetDefaultExecutor
def sub_dag_operator_with_default_executor(subdag, *args, **kwargs):
return SubDagOperator(subdag=subdag, executor=GetDefaultExecutor(), *args, **kwargs)
call sub_dag_operator_with_default_executor when you created your subdag operator. In order to relieve the sub dag operator performance concerns
We should change the default executor for subdag_operator to SequentialExecutor. Airflow pool is not honored by subdagoperator, hence it could consume all the worker resources(e.g in celeryExecutor). This causes issues mentioned in airflow-74 and limits the subdag_operator usage. We use subdag_operator in production by specifying using sequential executor.
We suggest to create a special queue (we specifiy queue='mini' in our cases) and celery worker to handle the subdag_operator, so that it is not consume all your normal celery worker's resources. As follows:
dag = DAG(
dag_id=DAG_NAME,
description=f"{DAG_NAME}-{__version__}",
...
)
with dag:
ur_operator = sub_dag_operator_with_default_executor(
task_id=f"your_task_id",
subdag=load_sub_dag(
parent_dag_name=DAG_NAME,
child_dag_name=f"your_child_dag_name",
args=args,
concurrency=dag_config.get("concurrency_in_sub_dag") or DEFAULT_CONCURRENCY,
),
queue="mini",
dag=dag
)
Then when you create your special celery worker (we are using a light weight host like 2 cores and 3G memory), specify AIRFLOW__CELERY__DEFAULT_QUEUE as mini, depends on how much sub dag operator you would like to run in parallel, you should create multiple special celery workers to load balances the resources, we suggest, each special celery worker should take care at most 2 sub dag operator at a time, or it will be exhausted (e.g., run out of memory on a 2 core and 3G memory host)
Also you can adjust concurrency inside your subdag via the ENV VAR concurrency_in_sub_dag created in airflow UI Variables configuration page.
Update [22/05/2020] the above only work for airflow (<=1.10.3, >= 1.10.0)
For airflow beyone 1.10.3, please use
from airflow.executors import get_default_executor
instead.
That is because in Airflow 1.9.0, the default Executor was used by SubdagOperator.
Airflow 1.9.0:
https://github.com/apache/airflow/blob/1.9.0/airflow/operators/subdag_operator.py#L33
class SubDagOperator(BaseOperator):
template_fields = tuple()
ui_color = '#555'
ui_fgcolor = '#fff'
#provide_session
#apply_defaults
def __init__(
self,
subdag,
executor=GetDefaultExecutor(),
*args, **kwargs):
However, from Airflow 1.10 and onwards, the default executor for SubDagOperator is changed to SequentialExecutor
Airflow >=1.10:
https://github.com/apache/airflow/blob/1.10.0/airflow/operators/subdag_operator.py#L38
class SubDagOperator(BaseOperator):
template_fields = tuple()
ui_color = '#555'
ui_fgcolor = '#fff'
#provide_session
#apply_defaults
def __init__(
self,
subdag,
executor=SequentialExecutor(),
*args, **kwargs):
The commit that changed it is https://github.com/apache/airflow/commit/64d950166773749c0e4aa0d7032b080cadd56a53#diff-45749879e4753a355c5bdb5203584698
And the detailed reason it was changed can be found in https://github.com/apache/airflow/pull/3251
We should change the default executor for subdag_operator to SequentialExecutor. Airflow pool is not honored by subdagoperator, hence it could consume all the worker resources(e.g in celeryExecutor). This causes issues mentioned in airflow-74 and limits the subdag_operator usage. We use subdag_operator in production by specifying using sequential executor.
Thanks!.
I changed a little bit the code for the latest airflow (1.10.5) GetDefaultExecutor not working anymore:
from airflow.executors.celery_executor import CeleryExecutor
def sub_dag_operator_with_celery_executor(subdag, *args, **kwargs):
return SubDagOperator(subdag=subdag, executor=CeleryExecutor(), *args,
**kwargs)
Thanks to #kaxil and #kevin-li for their answers. They served as the foundation for the below. The simplest way to solve is to skip the wrapper function and call the SubDagOperator directly within the DAG flow (In my opinion it'll improve readability a tad). Please note below should be treated still as pseudo code but should provide guidance as to the pattern needed to scale without consuming all workers with a large scale SubDag:
# This below works for Airflow versions above 1.10.3 . See #kevin-li's answer for deets on lower versions
from airflow.executors import get_default_executor
from airflow.models import DAG
from datetime import datetime
from airflow.operators.subdag_operator import SubDagOperator
dag = DAG(
dag_id="special_dag_with_sub",
schedule_interval="5 4 * * *",
start_date=datetime(2021, 6, 1),
concurrency=concurrency
)
with dag:
subdag_queue = "subdag_queue"
operator_var = SubDagOperator(
task_id="your_task_id",
subdag=special_sub_dag(
parent_dag_name=dag.dag_id,
child_dag_name="your_child_dag_name",
queue=subdag_queue,
concurrency=DAG_CONCURRENCY_VALUE_OF_YOUR_CHOICE_HERE,
args=args,
),
executor=get_default_executor(),
queue=subdag_queue,
dag=dag
)
While having the SubDagOperator owned by a specific worker queue is important I would argue it's also important to pass the queue to the tasks within it. That can be done like the following:
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
def special_sub_dag(parent_dag_name, child_dag_name, concurrency, queue, *args):
dag = DAG(
dag_id=f"{parent_dag_name}.{child_dag_name}",
schedule_interval="5 4 * * *",
start_date=datetime(2021, 6, 1),
concurrency=concurrency,
)
do_this = PythonOperator(
task_id="do_this",
dag=dag,
python_callable=lambda: "hello world",
queue=queue,
)
then_this = DummyOperator(
task_id="then_this",
dag=dag,
queue=queue,
)
do_this >> then_this
This above approach is working for one of our larger scale DAGs (Airflow 1.10.12) so please let me know if there are issues in implementing.

Resources