In Airflow I am creating branches with different operators with a for loop, my code looks like this:
for table in ['messages', 'conversations']:
Operator1 with operator1.task_id = 'operator1_{}'.format(table)
Operator1 does kwargs['ti'].xcom_push(key='file_name', value='y')
Operator2 is a BashOperator that needs to run:
bash_command = "echo {{ ti.xcom_pull(task_ids='operator1_{}', key='file_name') }}".format(table)
Operator1 >> Operator2
But in the UI the commands are rendered like that:
echo { ti.xcom_pull(task_ids='operator1_messages', key='file_name') }
echo { ti.xcom_pull(task_ids='operator1_conversations', key='file_name') }
How should I write the bash_command to have Airflow interpret correctly the template?
If I write directly
bash_command = "echo {{ ti.xcom_pull(task_ids='operator1_messages', key='file_name') }}"
it works but I want to create this command from a for loop.
Thanks!
It's doing this because the .format(table) part of your bash command is stripping off the outer { and }. You may be able to fix this with the following instead:
bash_command = "echo {{ ti.xcom_pull(task_ids='operator1_" + table + "', key='file_name') }}"
Whether this is the best way to do it is probably another question.
Related
This question already has answers here:
Can a shell script set environment variables of the calling shell? [duplicate]
(20 answers)
Closed last month.
I am trying to write a bash script that will take a single argument("prod" or "staging"), and use that to conditionally set global environment variables, specifically to switch between my staging and prod AWS keys. However, even though my logs in the script show what I expect, running echo $AWS_ACCESS_KEY in my terminal after running the script, does not show it was updated. I have tried adding source ~/.zshrc but I don't think that is needed. What can I change to update the $AWS_ACCESS_KEY globally?
#!/bin/bash
tmpAccess="access"
tmpSecret="secret"
if [ $1 == "prod" ];
then
echo "Setting the AWS KEYS to $1 keys"
tmpAccess=$PROD_ACCESS_KEY
tmpSecret=$PROD_SECRET_KEY
elif [ $1 == "staging" ];
then
echo "Setting the AWS KEYS to $1 keys"
tmpAccess=$STAGING_ACCESS_KEY
tmpSecret=$STAGING_SECRET_KEY
else
echo "Unknown env passed in: $1"
fi
export AWS_ACCESS_KEY=$tmpAccess
export AWS_SECRETS_KEY=$tmpSecret
echo "Updated AWS_ACCESS_KEY: $AWS_ACCESS_KEY"
echo "Current tmpAccess: $tmpAccess"
echo "AWS_ACCESS_KEY has been updated to $AWS_ACCESS_KEY for env $1"
echo "AWS_SECRETS_KEY has been updated to $AWS_SECRETS_KEY for env $1"
source ~/.zshrc
My zshrc file looks similar to:
export STAGING_ACCESS_KEY=1234
export STAGING_SECRETS_KEY=abcd
export PROD_ACCESS_KEY=5678
export PROD_SECRETS_KEY=efgh
Clearly, it's not possible to put in a script a program to modify a variable in the current terminal except if you accept to source it (see Setting environment variable in shell script does not make it visible to the shell).
There is another solution. Put your script content in a function:
myfunctionName () {
tmpAccess="access"
tmpSecret="secret"
if [ $1 == "prod" ];
then
echo "Setting the AWS KEYS to $1 keys"
tmpAccess=$PROD_ACCESS_KEY
tmpSecret=$PROD_SECRET_KEY
elif [ $1 == "staging" ];
then
echo "Setting the AWS KEYS to $1 keys"
tmpAccess=$STAGING_ACCESS_KEY
tmpSecret=$STAGING_SECRET_KEY
else
echo "Unknown env passed in: $1"
fi
export AWS_ACCESS_KEY=$tmpAccess
export AWS_SECRETS_KEY=$tmpSecret
echo "Updated AWS_ACCESS_KEY: $AWS_ACCESS_KEY"
echo "Current tmpAccess: $tmpAccess"
echo "AWS_ACCESS_KEY has been updated to $AWS_ACCESS_KEY for env $1"
echo "AWS_SECRETS_KEY has been updated to $AWS_SECRETS_KEY for env $1"
}
and put this function in your .zshrc file.
After that, launch a new terminal and call your myfunctionName function like the script filename.
When I do something like:
some_value = "{{ dag_run.get_task_instance('start').start_date }}"
print(f"some interpolated value: {some_value}")
I see this in the airflow logs:
some interpolated value: {{ dag_run.get_task_instance('start').start_date }}
but not the actual value itself. How can I easily see what the value is?
Everything in the DAG task run comes through as kwargs (before 1.10.12 you needed to add provide_context, but all context is provided after version 2).
To get something out of kwargs, do something like this in your python callable:
run_id = kwargs['run_id']
print(f'run_id = {run_id}')
Additional info:
To get the kwargs out, add them to your callable, so:
def my_func(**kwargs):
run_id = kwargs['run_id']
print(f'run_id = {run_id}')
And just call this from your DAG task like:
my_task = PythonOperator(
task_id='my_task'
, dag=dag
, python_callable=my_func)
I'm not sure what your current code structure is because you haven't provided more info I'm afraid.
I'm trying to pass parameter from google composer into a dataflow template as following way, but it does not work.
# composer code
trigger_dataflow = DataflowTemplateOperator(
task_id="trigger_dataflow",
template="gs://mybucket/my_template",
dag=dag,
job_name='appsflyer_events_daily',
parameters={
"input": f'gs://my_bucket/' + "{{ ds }}" + "/*.gz"
}
)
# template code
class UserOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--input',
default='gs://my_bucket/*.gz',
help='path of input file')
def main():
pipeline_options = PipelineOptions()
user_options = pipeline_options.view_as(UserOptions)
p = beam.Pipeline(options=pipeline_options)
lines = (
p
| MatchFiles(user_options.input)
)
You can pass like following.
DataflowTemplateOperator(,
task_id="task1",
template=get_variable_value("template"),
on_failure_callback=update_job_message,
parameters={
"fileBucket": get_variable_value("file_bucket"),
"basePath": get_variable_value("path_input"),
"Day": "{{ json.loads(ti.xcom_pull(key=run_id))['day'] }}",
},
)
We are using Java and in Dataflow jobs we have option class get and set like following
public interface MyOptions extends CommonOptions {
#Description("The output bucket")
#Validation.Required
ValueProvider<String> getFileBucket();
void setFileBucket(ValueProvider<String> value);
}
We need to create template for this dataflow jobs and that template will be trigger by composer dag.
Moving from dataflow classic template to flex template fixed the issue.
I am currently running this query in Airflow's MysQLOperator.
How can I replace region, s3 bucket with parameters using Jinja template?
Airflow version: 2.0.2
Python: 3.7
sql = """SELECT * FROM test
INTO OUTFILE S3 's3-ap-southeast-1://my-s3-bucket/my-key'
CHARACTER SET utf8
FORMAT CSV HEADER
FIELDS
TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES
TERMINATED BY '\\n'
OVERWRITE ON;
"""
mysql_to_s3 = MySqlOperator(
task_id="mysql_to_s3",
dag=dag,
sql=rds_sql,
mysql_conn_id=MYSQL_CONN_ID,
parameters={
"s3_bucket": "my-s3-bucket",
"s3_key_prefix": "my-key",
"region": "ap-southeast-1",
},
autocommit=False,
database="test",
)
You can use params to pass dynamic values to your SQL:
sql = """SELECT * FROM test
INTO OUTFILE S3 '{{ params.region }}://{{ params.s3_bucket }}/{{ params.s3_key_prefix }}'
CHARACTER SET utf8
FORMAT CSV HEADER
FIELDS
TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES
TERMINATED BY '\\n'
OVERWRITE ON;
"""
mysql_to_s3 = MySqlOperator(
task_id="mysql_to_s3",
dag=dag,
sql=sql,
mysql_conn_id=MYSQL_CONN_ID,
params={
"s3_bucket": "my-s3-bucket",
"s3_key_prefix": "my-key",
"region": "ap-southeast-1",
},
autocommit=False,
database="test",
)
If the values are stored in Airflow variables (region, s3_bucket, s3_key_prefix ) then you can remove the params dict from the operator and change you sql to:
INTO OUTFILE S3 '{{ var.value.region }}://{{ var.value.s3_bucket }}/{{ var.value.s3_key_prefix }}'
In both options Airflow will template the sql string and replace the place holders with the values when the operator is executed. You can see the actual values in the task render tab.
You can use airflow variables - https://airflow.apache.org/docs/apache-airflow/stable/concepts/variables.html
Airflow jinja template support - https://airflow.apache.org/docs/apache-airflow/stable/concepts/operators.html#concepts-jinja-templating
I have a DAG with three bash tasks which is scheduled to run every day.
I would like to access unique ID of dag instance(may be PID) in all bash scripts.
Is there any way to do this?
I am looking for similar functionality as Oozie where we can access WORKFLOW_ID in workflow xml or java code.
Can somebody point me to documentation of AirFlow on "How to use in-build and custom variables in AirFlow DAG"
Many Thanks
Pari
Object's attributes can be accessed with dot notation in jinja2 (see https://airflow.apache.org/code.html#macros). In this case, it would simply be:
{{ dag.dag_id }}
i made use of the fact that the python object for dag prints out the name of the current dag. so i just use jinja2 to change the dag name:
{{ dag | replace( '<DAG: ', '' ) | replace( '>', '' ) }}
bit of a hack, but it works.
therefore,
clear_upstream = BashOperator( task_id='clear_upstream',
trigger_rule='all_failed',
bash_command="""
echo airflow clear -t upstream_task -c -d -s {{ ts }} -e {{ ts }} {{ dag | replace( '<DAG: ', '' ) | replace( '>', '' ) }}
"""
)