Is there a way to create the Mount objects of DockerOperator dynamically so that I can use the filesystem connection stored in Airflow db? I don't want to change the dag code if the connection changes.
At the moment I need to hardcode the paths like this
incoming_path = "/incoming/XYZ"
output_path = "/output_path/ABC"
input_mount = {"source": incoming_path,
"target": incoming_path,
"type": "bind"}
output_mount = {"source": output_path,
"target": output_path,
"type": "bind"}
create_stuff = DockerOperator(
task_id = f'create_stuff',
user = 1234,
queue = 'default',
image = 'image_name',
api_version='auto',
auto_remove=True,
mount_tmp_dir=False,
mounts=[Mount(**input_mount),
Mount(**output_mount)],
environment={
'AF_EXECUTION_DATE': "{{ ds }}",
'AF_OWNER': "{{ task.owner }}",
},
command = f"do stuff",
entrypoint='',
docker_url='unix://var/run/docker.sock',
network_mode='bridge'
)
I tried to use the FSHook outside an Operator, but then it returns an empty string when I call
with DAG(...) as dag:
...
#THIS WORKS
#task
def task1():
incoming_hook = FSHook('fs_incoming')
incoming_path = incoming_hook.get_path()
...
#THIS RETURNS AN EMPTY STRING
incoming_hook = FSHook('fs_incoming')
incoming_path = incoming_hook.get_path()
So another phrasing for the question would be is there a way to get the path from the connection outside an Operator?
I'm using Airflow 2.4.1
Given that you just need to share a common string between multiple DAGs, I'd recommend either using an Airflow Variable or using some kind of shared config file in your /dags directory. You can find more detail in this answer.
Forcing the use of a hook here is unnecessary as Airflow Connections are "used for storing credentials and other information necessary for connecting to external services". I don't believe that applies here, but if you did need to use a Connection, you wouldn't need a Hook; you could just use the Connection API to access Connection properties.
Hooks give you additional functionality for actually interacting with external systems. FSHook would allow you to actually interact with a filesystem rather than just share the path value across DAGs.
Related
I need to to trigger a Data Fusion pipeline located on a GCP project called myDataFusionProject through a Data Fusion operator (CloudDataFusionStartPipelineOperator) inside a DAG whose Cloud Composer instance is located on another project called myCloudComposerProject.
I have used the official documentation as well as the source code to write the code that roughly resembles the below snippet:
LOCATION = "someLocation"
PIPELINE_NAME = "myDataFusionPipeline"
INSTANCE_NAME = "myDataFusionInstance"
RUNTIME_ARGS = {"output.instance":"someOutputInstance", "input.dataset":"someInputDataset", "input.project":"someInputProject"}
start_pipeline = CloudDataFusionStartPipelineOperator(
location=LOCATION,
pipeline_name=PIPELINE_NAME,
instance_name=INSTANCE_NAME,
runtime_args=RUNTIME_ARGS,
task_id="start_pipeline",
)
My issue is that, every time I trigger the DAG, Cloud Composer looks for myDataFusionInstance inside myCloudComposerProject instead of myDataFusionProject, which gives an error like this one:
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://datafusion.googleapis.com/v1beta1/projects/myCloudComposerProject/locations/someLocation/instances/myDataFusionInstance?alt=json returned "Resource 'projects/myCloudComposerProject/locations/someLocation/instances/myDataFusionInstance' was not found". Details: "[{'#type': 'type.googleapis.com/google.rpc.ResourceInfo', 'resourceName': 'projects/myCloudComposerProject/locations/someLocation/instances/myDataFusionInstance'}]"
So the question is: how can I force my operator to use the Data Fusion project instead of the Cloud Composer project? I suspect I may do that by adding a new runtime argument but I'm not sure how to do that.
Last piece of information: the Data Fusion pipeline simply extracts data from a BigQuery source and sends everything to a BigTable sink.
As a recommendation while developing operators on airflow, we should check the classes that are implementing the operators as documentation may lack some information due to versioning.
As commented, if you check CloudDataFusionStartPipelineOperator you will find that it makes use of a hook that gets the instance base on a project-id. This project-id its optional, so you can add your own project-id.
class CloudDataFusionStartPipelineOperator(BaseOperator):
...
def __init__(
...
project_id: Optional[str] = None, ### NOT MENTION IN THE DOCUMENTATION
...
) -> None:
...
self.project_id = project_id
...
def execute(self, context: dict) -> str:
...
instance = hook.get_instance(
instance_name=self.instance_name,
location=self.location,
project_id=self.project_id, ### defaults your project-id
)
api_url = instance["apiEndpoint"]
...
Adding the parameter to your operator call should fix your issue.
start_pipeline = CloudDataFusionStartPipelineOperator(
location=LOCATION,
pipeline_name=PIPELINE_NAME,
instance_name=INSTANCE_NAME,
runtime_args=RUNTIME_ARGS,
project_id=PROJECT_ID,
task_id="start_pipeline",
)
As a final note, besides the official documentation site you can also explore the files of apache airflow on github.
I'm learning Airflow and I'm trying to understand how connections works.
I have a first dag with the following code:
c = Connection(
conn_id='aws_credentials',
conn_type='Amazon Web Services',
login='xxxxxxxx',
password='xxxxxxxxx'
)
def list_keys():
hook = S3Hook(aws_conn_id=c.conn_id)
logging.info(f"Listing Keys from {bucket}/{prefix}")
keys = hook.list_keys(bucket, prefix=prefix)
for key in keys:
logging.info(f"- s3://{bucket}/{key}")
In this case It's working fine. The connection is well passed to the S3Hook.
Then I have a second dag:
redshift_connection = Connection(
conn_id='redshift',
conn_type='postgres',
login='duser',
password='xxxxxxxxxx',
host='xxxxxxxx.us-west-2.redshift.amazonaws.com',
port=5439,
schema='db'
)
aws_connection = Connection(
conn_id='aws_credentials',
conn_type='Amazon Web Services',
login='xxxxxxxxx',
password='xxxxxxxx'
)
def load_data_to_redshift(*args, **kwargs):
aws_hook = AwsHook(aws_connection.conn_id)
credentials = aws_hook.get_credentials()
redshift_hook = PostgresHook(redshift_connection.conn_id)
sql_stmnt = sql_statements.COPY_STATIONS_SQL.format(aws_connection.login, aws_connection.password)
redshift_hook.run(sql_stmnt)
dag = DAG(
's3_to_Redshift',
start_date=datetime.datetime.now()
)
create_table = PostgresOperator(
task_id='create_table',
postgres_conn_id=redshift_connection.conn_id,
sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
dag=dag
)
This dag return me the following error: The conn_idredshiftisn't defined
Why is that? What are the differences between my first and second dag? Why the connection does seems to work in the first example and not in the second situation?
Thanks.
Connections are usually created using the UI or CLI as described here and stored by Airflow in the database backend. The operators and the respective hooks then take a connection ID as an argument and use it to retrieve the usernames, passwords, etc. for those connections.
In your case, I suspect you created a connection with the ID aws_credentials using the UI or CLI. So, when you pass its ID to S3Hook it successfully retrieves the credentials (from the Airflow's database backend, not from the Connection object that you created).
But, you did not create a connection with the ID redshift, therefore, AwsHook complains that it is not defined. You have to create the connection as described in the documentation first.
Note: The reason for not defining connections in the DAG code is that the DAG code is usually stored in a version control system (e.g., Git). And it would be a security risk to store credentials there.
I'm trying to understand how to pass values via airflow xcom functionality. The specific usecase I am trying to build is to write a file, then move it, then run another command. The idea is that I pass the file name from one operator to the next.
Here is what I have:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import datetime as dt
DAG = DAG(
dag_id='xcom_test_dag',
start_date=dt.datetime.now(),
schedule_interval='#once'
)
def push_function(**context):
file_name = 'test_file_{date}'.format(date=dt.datetime.now())
return context['task_instance'].xcom_push(key='filename', value=file_name)
def pull_function(**context):
dir(context['task_instance'].xcom_pull())
push_task = PythonOperator(
task_id='push_task',
python_callable=push_function,
provide_context=True,
dag=DAG)
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=DAG)
push_task >> pull_task
If I want to reference the file name in the pull_task so I could perform read the file - how should I call that? Trying to access context['task_instance'] does not contain a value. Further - is it best practices to try and reference a file name like this from task to task/operator to operator?
When pulling data from XCOM, you want to provide the task ID of the task where you push the data. In your example, the task_id of your push task is push_task, so you'd want to do something like:
value = context['task_instance'].xcom_pull(task_ids='push_task')
However, from the airflow documentation, note that:
By default, xcom_pull() filters for the keys that are automatically given to XComs when they are pushed by being returned from execute functions (as opposed to XComs that are pushed manually).
If you're pushing data to XCOM manually with specific keys, you may need to include that key when calling xcom_pull. In your example, you push a key called filename in your push task, so you'd likely need to do something like this in your pull task:
value = context['task_instance'].xcom_pull(task_ids='push_task', key='filename')
This information is outlined in further detail in the Airflow documentation: https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom#concepts-xcom
As for your question regarding "best practices" - for communicating between Airflow Tasks/Operators, XCOM is the best way to go. However, if you're wanting to read a file from disk across multiple operators, you would need to ensure that all your workers have access to where the file is stored. If that isn't possible, an alternative could be to have the push task store that file remotely (e.g. in AWS S3) and push the S3 URL to XCOM. The pull task could then read the S3 URL from XCOM, and download the file from S3.
I am new to using airflow and what I need to do is to use MssqlHook but I do not know how. What elements should I give in the constructor?
I have a connection in airflow with name connection_test.
I do not fully understand the attributes in the class:
class MsSqlHook(DbApiHook):
"""
Interact with Microsoft SQL Server.
"""
conn_name_attr = 'mssql_conn_id'
default_conn_name = 'mssql_default'
supports_autocommit = True
I have the following code:
sqlhook=MsSqlHook(connection_test)
sqlhook.get_conn()
And when I do this the error is Connection failed for unknown reason.
How should I do in order to make it work with the airflow connection ?
What I need is to call function .get_conn() for the MsSqlHook.
See the standard examples of Airflow.
https://github.com/gtoonstra/etl-with-airflow/blob/master/examples/mssql-example/dags/mssql_bcp_example.py
E.g.:
t1 = MsSqlImportOperator(task_id='import_data',
table_name='test.test',
generate_synth_data=generate_synth_data,
mssql_conn_id='mssql',
dag=dag)
EDIT
hook = MsSqlHook(mssql_conn_id="my_mssql_conn")
hook.run(sql)
You need to provide the connection defined in Connections. Also if using Hooks looking in the respective Operators usually yields some information about usage. This code is from the MSSQLOperator.
I'm trying to use Riak's mapreduce via http. his is what i'm sending:
{
"inputs":{
"bucket":"test",
"key_filters":[["matches", ".*"]]
},
"query":[
{
"map":{
"language":"erlang",
"source":"value(RiakObject, _KeyData, _Arg) -> Key = riak_object:key(RiakObject), Count = riak_kv_crdt:value(RiakObject, <<\"riak_kv_pncounter\">>), [ {Key, Count} ]."
}
}
]}
Riak fails with "[worker_startup_failed]", which isn't very informative. Could anyone please help me get this to actually execute the function?
WARNING
Allowing arbitrary Erlang functions via map-reduce is a security risk. Any valid Erlang can be executed, including sending your entire data set offsite or formatting the hard drive.
You have been warned.
However, if you implicitly trust any client that may connect to your cluster, you can allow Erlang source to be passed in a map-reduce request by setting {allow_strfun, true} in the riak_kv section of app.config, (or in the advanced.config if you are using riak.conf).
Once you have allowed passing an Erlang function in a map-reduce phase, you need to pass in a function of the form fun(RiakObject,KeyData,Arg) -> [result] end. Note that this must be an anonymous fun, so fun is a keyword, not a name, and it must end with end.
Your function should handle the case where {error,notfound} is passed as the first argument instead of an object. Simply adding a catch-all clause to the function could accomplish that.
Perhaps something like:
{
"inputs":{
"bucket":"test",
"key_filters":[["matches", ".*"]]
},
"query":[
{
"map":{
"language":"erlang",
"source":"fun(RiakObject, _KeyData, _Arg) ->
Key = riak_object:key(RiakObject),
Count = riak_kv_crdt:value(
RiakObject,
<<\"riak_kv_pncounter\">>),
[ {Key, Count} ];
(_,_,_) -> [{error,0}]
end."
}
}
]}
Allowing the source to be passed in the request is very useful while developing and debugging. For production, you really should put the functions in a dedicated pre-compiled module that you copy to the code path of each node so that the phase spec can specify the module and function by name instead of providing arbitrary code.
{"map":{
"language":"erlang",
"module":"yourprecompiledmodule",
"function":"functionname"}}
You need to enable allow_strfun on all nodes in your cluster. To do so in Riak 2, you will need to use the advanced.config file to add this to the riak_kv configuration:
[
{riak_kv, [
{allow_strfun, true}
]}
].
The other option is to create your own Erlang module by using the compiler shipped with Riak and placing the *.beam file in a well-known location for Riak to find. The basho-patches directory is one such place.
Please see the documentation as well:
advanced.config
Installing custom Erlang code
HTTP MapReduce
Using MapReduce
Advanced MapReduce
MapReduce / curl example