I'm learning Airflow and I'm trying to understand how connections works.
I have a first dag with the following code:
c = Connection(
conn_id='aws_credentials',
conn_type='Amazon Web Services',
login='xxxxxxxx',
password='xxxxxxxxx'
)
def list_keys():
hook = S3Hook(aws_conn_id=c.conn_id)
logging.info(f"Listing Keys from {bucket}/{prefix}")
keys = hook.list_keys(bucket, prefix=prefix)
for key in keys:
logging.info(f"- s3://{bucket}/{key}")
In this case It's working fine. The connection is well passed to the S3Hook.
Then I have a second dag:
redshift_connection = Connection(
conn_id='redshift',
conn_type='postgres',
login='duser',
password='xxxxxxxxxx',
host='xxxxxxxx.us-west-2.redshift.amazonaws.com',
port=5439,
schema='db'
)
aws_connection = Connection(
conn_id='aws_credentials',
conn_type='Amazon Web Services',
login='xxxxxxxxx',
password='xxxxxxxx'
)
def load_data_to_redshift(*args, **kwargs):
aws_hook = AwsHook(aws_connection.conn_id)
credentials = aws_hook.get_credentials()
redshift_hook = PostgresHook(redshift_connection.conn_id)
sql_stmnt = sql_statements.COPY_STATIONS_SQL.format(aws_connection.login, aws_connection.password)
redshift_hook.run(sql_stmnt)
dag = DAG(
's3_to_Redshift',
start_date=datetime.datetime.now()
)
create_table = PostgresOperator(
task_id='create_table',
postgres_conn_id=redshift_connection.conn_id,
sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
dag=dag
)
This dag return me the following error: The conn_idredshiftisn't defined
Why is that? What are the differences between my first and second dag? Why the connection does seems to work in the first example and not in the second situation?
Thanks.
Connections are usually created using the UI or CLI as described here and stored by Airflow in the database backend. The operators and the respective hooks then take a connection ID as an argument and use it to retrieve the usernames, passwords, etc. for those connections.
In your case, I suspect you created a connection with the ID aws_credentials using the UI or CLI. So, when you pass its ID to S3Hook it successfully retrieves the credentials (from the Airflow's database backend, not from the Connection object that you created).
But, you did not create a connection with the ID redshift, therefore, AwsHook complains that it is not defined. You have to create the connection as described in the documentation first.
Note: The reason for not defining connections in the DAG code is that the DAG code is usually stored in a version control system (e.g., Git). And it would be a security risk to store credentials there.
Related
Is there a way to create the Mount objects of DockerOperator dynamically so that I can use the filesystem connection stored in Airflow db? I don't want to change the dag code if the connection changes.
At the moment I need to hardcode the paths like this
incoming_path = "/incoming/XYZ"
output_path = "/output_path/ABC"
input_mount = {"source": incoming_path,
"target": incoming_path,
"type": "bind"}
output_mount = {"source": output_path,
"target": output_path,
"type": "bind"}
create_stuff = DockerOperator(
task_id = f'create_stuff',
user = 1234,
queue = 'default',
image = 'image_name',
api_version='auto',
auto_remove=True,
mount_tmp_dir=False,
mounts=[Mount(**input_mount),
Mount(**output_mount)],
environment={
'AF_EXECUTION_DATE': "{{ ds }}",
'AF_OWNER': "{{ task.owner }}",
},
command = f"do stuff",
entrypoint='',
docker_url='unix://var/run/docker.sock',
network_mode='bridge'
)
I tried to use the FSHook outside an Operator, but then it returns an empty string when I call
with DAG(...) as dag:
...
#THIS WORKS
#task
def task1():
incoming_hook = FSHook('fs_incoming')
incoming_path = incoming_hook.get_path()
...
#THIS RETURNS AN EMPTY STRING
incoming_hook = FSHook('fs_incoming')
incoming_path = incoming_hook.get_path()
So another phrasing for the question would be is there a way to get the path from the connection outside an Operator?
I'm using Airflow 2.4.1
Given that you just need to share a common string between multiple DAGs, I'd recommend either using an Airflow Variable or using some kind of shared config file in your /dags directory. You can find more detail in this answer.
Forcing the use of a hook here is unnecessary as Airflow Connections are "used for storing credentials and other information necessary for connecting to external services". I don't believe that applies here, but if you did need to use a Connection, you wouldn't need a Hook; you could just use the Connection API to access Connection properties.
Hooks give you additional functionality for actually interacting with external systems. FSHook would allow you to actually interact with a filesystem rather than just share the path value across DAGs.
I need to to trigger a Data Fusion pipeline located on a GCP project called myDataFusionProject through a Data Fusion operator (CloudDataFusionStartPipelineOperator) inside a DAG whose Cloud Composer instance is located on another project called myCloudComposerProject.
I have used the official documentation as well as the source code to write the code that roughly resembles the below snippet:
LOCATION = "someLocation"
PIPELINE_NAME = "myDataFusionPipeline"
INSTANCE_NAME = "myDataFusionInstance"
RUNTIME_ARGS = {"output.instance":"someOutputInstance", "input.dataset":"someInputDataset", "input.project":"someInputProject"}
start_pipeline = CloudDataFusionStartPipelineOperator(
location=LOCATION,
pipeline_name=PIPELINE_NAME,
instance_name=INSTANCE_NAME,
runtime_args=RUNTIME_ARGS,
task_id="start_pipeline",
)
My issue is that, every time I trigger the DAG, Cloud Composer looks for myDataFusionInstance inside myCloudComposerProject instead of myDataFusionProject, which gives an error like this one:
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://datafusion.googleapis.com/v1beta1/projects/myCloudComposerProject/locations/someLocation/instances/myDataFusionInstance?alt=json returned "Resource 'projects/myCloudComposerProject/locations/someLocation/instances/myDataFusionInstance' was not found". Details: "[{'#type': 'type.googleapis.com/google.rpc.ResourceInfo', 'resourceName': 'projects/myCloudComposerProject/locations/someLocation/instances/myDataFusionInstance'}]"
So the question is: how can I force my operator to use the Data Fusion project instead of the Cloud Composer project? I suspect I may do that by adding a new runtime argument but I'm not sure how to do that.
Last piece of information: the Data Fusion pipeline simply extracts data from a BigQuery source and sends everything to a BigTable sink.
As a recommendation while developing operators on airflow, we should check the classes that are implementing the operators as documentation may lack some information due to versioning.
As commented, if you check CloudDataFusionStartPipelineOperator you will find that it makes use of a hook that gets the instance base on a project-id. This project-id its optional, so you can add your own project-id.
class CloudDataFusionStartPipelineOperator(BaseOperator):
...
def __init__(
...
project_id: Optional[str] = None, ### NOT MENTION IN THE DOCUMENTATION
...
) -> None:
...
self.project_id = project_id
...
def execute(self, context: dict) -> str:
...
instance = hook.get_instance(
instance_name=self.instance_name,
location=self.location,
project_id=self.project_id, ### defaults your project-id
)
api_url = instance["apiEndpoint"]
...
Adding the parameter to your operator call should fix your issue.
start_pipeline = CloudDataFusionStartPipelineOperator(
location=LOCATION,
pipeline_name=PIPELINE_NAME,
instance_name=INSTANCE_NAME,
runtime_args=RUNTIME_ARGS,
project_id=PROJECT_ID,
task_id="start_pipeline",
)
As a final note, besides the official documentation site you can also explore the files of apache airflow on github.
I'm trying to understand how to pass values via airflow xcom functionality. The specific usecase I am trying to build is to write a file, then move it, then run another command. The idea is that I pass the file name from one operator to the next.
Here is what I have:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import datetime as dt
DAG = DAG(
dag_id='xcom_test_dag',
start_date=dt.datetime.now(),
schedule_interval='#once'
)
def push_function(**context):
file_name = 'test_file_{date}'.format(date=dt.datetime.now())
return context['task_instance'].xcom_push(key='filename', value=file_name)
def pull_function(**context):
dir(context['task_instance'].xcom_pull())
push_task = PythonOperator(
task_id='push_task',
python_callable=push_function,
provide_context=True,
dag=DAG)
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=DAG)
push_task >> pull_task
If I want to reference the file name in the pull_task so I could perform read the file - how should I call that? Trying to access context['task_instance'] does not contain a value. Further - is it best practices to try and reference a file name like this from task to task/operator to operator?
When pulling data from XCOM, you want to provide the task ID of the task where you push the data. In your example, the task_id of your push task is push_task, so you'd want to do something like:
value = context['task_instance'].xcom_pull(task_ids='push_task')
However, from the airflow documentation, note that:
By default, xcom_pull() filters for the keys that are automatically given to XComs when they are pushed by being returned from execute functions (as opposed to XComs that are pushed manually).
If you're pushing data to XCOM manually with specific keys, you may need to include that key when calling xcom_pull. In your example, you push a key called filename in your push task, so you'd likely need to do something like this in your pull task:
value = context['task_instance'].xcom_pull(task_ids='push_task', key='filename')
This information is outlined in further detail in the Airflow documentation: https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#concepts-xcom
As for your question regarding "best practices" - for communicating between Airflow Tasks/Operators, XCOM is the best way to go. However, if you're wanting to read a file from disk across multiple operators, you would need to ensure that all your workers have access to where the file is stored. If that isn't possible, an alternative could be to have the push task store that file remotely (e.g. in AWS S3) and push the S3 URL to XCOM. The pull task could then read the S3 URL from XCOM, and download the file from S3.
I am new to using airflow and what I need to do is to use MssqlHook but I do not know how. What elements should I give in the constructor?
I have a connection in airflow with name connection_test.
I do not fully understand the attributes in the class:
class MsSqlHook(DbApiHook):
"""
Interact with Microsoft SQL Server.
"""
conn_name_attr = 'mssql_conn_id'
default_conn_name = 'mssql_default'
supports_autocommit = True
I have the following code:
sqlhook=MsSqlHook(connection_test)
sqlhook.get_conn()
And when I do this the error is Connection failed for unknown reason.
How should I do in order to make it work with the airflow connection ?
What I need is to call function .get_conn() for the MsSqlHook.
See the standard examples of Airflow.
https://github.com/gtoonstra/etl-with-airflow/blob/master/examples/mssql-example/dags/mssql_bcp_example.py
E.g.:
t1 = MsSqlImportOperator(task_id='import_data',
table_name='test.test',
generate_synth_data=generate_synth_data,
mssql_conn_id='mssql',
dag=dag)
EDIT
hook = MsSqlHook(mssql_conn_id="my_mssql_conn")
hook.run(sql)
You need to provide the connection defined in Connections. Also if using Hooks looking in the respective Operators usually yields some information about usage. This code is from the MSSQLOperator.
I want to run an airflow dag like so ->
I have 2 airflow workers W1 and W2.
In W1 I have scheduled a single task (W1-1) but in W2, I want to create X number of tasks (W2-1, W2-2 ... W2-X).
The number X and the bash command for each task will be derived from a DB call.
All tasks for worker W2 should run in parallel after W1 completes.
This is my code
dag = DAG('deploy_single', catchup=False, default_args=default_args, schedule_interval='16 15 * * *')
t1 = BashOperator(
task_id='dummy_task',
bash_command='echo hi > /tmp/hi',
queue='W1_queue',
dag=dag)
get_all_engines = "select full_command, queue_name from internal_airflow_hosts where logical_group = 'live_engines';"
db_creds = json.loads(open('/opt/airflow/db_creds.json').read())
conn_dict = db_creds["airflowdb_local"]
connection = psycopg2.connect(**conn_dict)
cursor = connection.cursor()
cursor.execute(get_all_engines)
records = cursor.fetchall()
i = 1
for record in records:
t = BashOperator(
task_id='script_test_'+str(i),
bash_command="{full_command} ".format(full_command=str(record[0])),
queue=str(record[1]),
dag=dag)
t.set_upstream(t1)
i += 1
cursor.close()
connection.close()
However, when I run this, the task on W1 completed successfully but all tasks on W2 failed. In the airflow UI, I can see that it can resolve the correct number of tasks (10 in this case) but each of these 10 failed.
Looking at the logs, I saw that on W2 (which is on a different machine), airflow could not find the db_creds.json file.
I do not want to provide the DB creds file to W2.
My question is how can an airflow task be created dynamically in this case?
Basically i want to run a DB query on the airflow server and assign tasks to one or more workers based on the results of that query. The DB will contain updated info about which engines are active etc I want the DAG to reflect this. From logs, it looks like each worker runs the DB query. Providing access to DB to each worker is not an option.
Thank you #viraj-parekh and #cwurtz.
After much trial and error, found the correct way to use airflow variables for this case.
Step 1) We create another script called gen_var.pyand place it in the dag folder. This way, the scheduler will pick up and generate the variables. If the code for generating variables is within the deploy_single dag then we run into the same dependency issue as the worker will try and process the dag too.
"""
Code that goes along with the Airflow tutorial located at:
https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py
"""
import json
import psycopg2
from airflow.models import Variable
from psycopg2.extensions import AsIs
get_all_engines = "select full_command, queue_name from internal_airflow_hosts where logical_group = 'live_engines';"
db_creds = json.loads(open('/opt/airflow/db_creds.json').read())
conn_dict = db_creds["airflowdb_local"]
connection = psycopg2.connect(**conn_dict)
cursor = connection.cursor()
cursor.execute(get_all_engines)
records = cursor.fetchall()
hosts = {}
i = 1
for record in records:
comm_dict = {}
comm_dict['full_command'] = str(record[0])
comm_dict['queue_name'] = str(record[1])
hosts[i] = comm_dict
i += 1
cursor.close()
connection.close()
Variable.set("hosts",hosts,serialize_json=True)
Note the call to serialize_json. Airflow will try to store the variable as a string. If you want it to be stored as a dict, then use serialize_json=True. Airflow will still store it as string via json.dumps
Step 2) Simplify the dag and call this "hosts" variable (now deserialize to get back the dict) like so -
hoztz = Variable.get("hosts",deserialize_json=True)
for key in hoztz:
host = hoztz.get(key)
t = BashOperator(
task_id='script_test_'+str(key),
bash_command="{full_command} ".format(full_command=str(host.get('full_command'))),
queue=str(host.get('queue_name')),
dag=dag)
t.set_upstream(t1)
Hope it helps someone else.
One way to do this would be to store the information in an Airflow Variable.
You can fetch the information needed to dynamically generate the DAG (and necessary configs) in a Variable and have W2 access it from there.
Variables are an airflow model that can be used to store static information (information that does not have an associated timestamp) that all tasks can access.