I have a DAG without a schedule (it is run manually as needed). It has many tasks. Sometimes I want to 'skip' some initial tasks by changing the task state to SUCCESS manually. Changing task state of a manually executed DAG fails, seemingly because of a bug in parsing the execution_date.
Is there another way to individually setting task states for a manually executed DAG?
Example run below. The execution date of the Task is 01-13T17:27:13.130427, and I believe the milliseconds are not being parsed correctly.
Traceback
Traceback (most recent call last):
File "/opt/conda/envs/jumpman_prod/lib/python3.6/site-packages/airflow/www/views.py", line 2372, in set_task_instance_state
execution_date = datetime.strptime(execution_date, '%Y-%m-%d %H:%M:%S')
File "/opt/conda/envs/jumpman_prod/lib/python3.6/_strptime.py", line 565, in _strptime_datetime
tt, fraction = _strptime(data_string, format)
File "/opt/conda/envs/jumpman_prod/lib/python3.6/_strptime.py", line 365, in _strptime
data_string[found.end():])
ValueError: unconverted data remains: ..130427
It's not working from Task Instances page, but you can do it in another page:
- open DAG graph view
- select needed Run (screen 1) and click go
- select needed task
- in a popup window click Mark success (screen 2)
- then confirm.
PS it relates to airflow 1.9 version
Screen 1
Screen 2
What you may want to do to accomplish this is using branching, which, as the name suggests, allows you to follow different execution paths according to some conditions, just like an if in any programming language.
You can use the BranchPythonOperator (documented here) to attain this goal: the idea is that this operator is configured by a python_callable, a function that outputs the task_id to execute next (which should, of course, be a task which is directly downstream from the BranchPythonOperator itself).
Using branching will set the skipped tasks to the proper state automatically, as mentioned in the documentation:
All other “branches” or directly downstream tasks are marked with a state of skipped so that these paths can’t move forward. The skipped states are propagated downstream to allow for the DAG state to fill up and the DAG run’s state to be inferred.
The resulting DAG would look something like the following:
(source: apache.org)
Branching is documented here, on the official Apache Airflow documentation.
Related
dag1:
start >> clean >> end
I have a dag where i run a few tasks. But I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
Is there any way I can import information regarding my "dag2", check its status and if it is in success mode, I can proceed to the clean step
Something like this:
start >> wait_for_dag2 >> clean >> end
How can I achieve the wait_for_dag2 part?
There are some different answers depends on what you want to do:
if you have two dags with the same schedule interval, and you want to make the run of second dag waits the same run of first one, you can use ExternalTaskSensor on the last task of the first dag
if you want to run a dag2, after each run of a dag1 even if it's triggered manually, in this case you need to update dag1 and add a TriggerDagRunOperator and set schedule interval of the second to None
I want to modify it such that the clean steps only runs if another dag "dag2" is not running at the moment.
if you have two dags and you don't want to run them in same time to avoid a conflict on an external server/service, you can use one of the first two propositions or just use higher priority for the task of the first dag, and use the same pool (with 1 slot) for the tasks which lead to the conflict, but you will lose the parallelism on these tasks.
Hossein's Approach is the way people usually go. However if you want to get info about any dag run data, you can use the airlfow functionality to get that info. The following appraoch is good when you do not want(or are not allowed) to modify another dag:
from airflow.models.dagrun import DagRun
from airflow.utils.state import DagRunState
dag_runs = DagRun.find(dag_id='the_dag_id_you_want_to_check')
last_run = dag_runs[-1]
if last_run.state == DagRunState.SUCCESS:
print('the dag run was successfull!')
else:
print('the dag state is -->: ', last_run.state)
I have use case to create 2 tasks of BigqueryOperator that have same destination table but I need one to run daily, and the second one to be run manually just when I need.
Below are the illustration of Tree View
| task_3rd_adhoc
| task_3rd
|---- task_2nd
|---- task_1st_a
|---- task_1st_b
From example above, DAG are run daily. And I aim to the task will be:
task_1st_a and task_1st_b run first. Target table are:
project.dataset.table_1st_a with _PARTITIONTIME = execution date, and
project.dataset.table_1st_b with _PARTITIONTIME = execution date.
then task_2nd_a will run after task_1st_a and task_1st_b finish. BigQueryOperator use TriggerRule.ALL_SUCCESS. Target table is:
project.dataset.table_2nd with _PARTITIONTIME = execution date.
then task_3rd will run after task_2nd success. BigQueryOperator use TriggerRule.ALL_SUCCESS. Target table is:
project.dataset.table_3rd with PARTITIONTIME = D-2 from execution date.
task_3rd_adhoc will not run in daily job. I need this when I want to backfill table project.dataset.table_3rd. With target table:
project.dataset.table_3rd with _PARTITIONTIME = execution_date
But I still can't find what is the correct TriggerRule for step #4 above. I tried TriggerRule.DUMMY because I thought it can be used to set no Trigger, but task_3rd_adhoc also run in daily job when I tried create DAG above.
(based on this doc dependencies are just for show, trigger at will)
First of all, you've misunderstood TriggerRule.DUMMY.
Usually, when you wire tasks together task_a >> task_b, B would run only after A is complete (success / failed, based on B's trigger_rule).
TriggerRule.DUMMY means that even after wiring tasks A & B together as before, B would run independently of A (run at will). It doesn't mean run at your will, rather it runs at Airflow's will (it will trigger it whenever it feels like). So clearly tasks having dummy trigger rule will pretty much ALWAYS run, albeit, at an unpredictable time
What you need here (to have a particular task in DAG always but run it only when manually specified) is a combination of
AirflowSkipException
Variable
Here's roughly how you can do
A Variable should hold the command for this task (whether or not it should run). This Variable, of course, you can edit anytime from UI (thereby controlling whether or not that task runs in next DagRun)
In the Operator's code (execute() method for custom-operator or just python_callable in case of PythonOperator), you'll check value of Variable (whether or not the task is supposed to run)
Based on the Variable value, if the task is NOT supposed to run, you must throw an AirflowSkipException, so that the task will be marked at skipped. Or else, it will just run as usual
I have a sensor that waits for a file to appear in an external file system
The sensor uses mode="reschedule"
I would like to trigger a specific behavior after X failed attempts.
Is there any straightforward way to know how many times the sensor has already attempted to run the poke method?
My quick fix so far has been to push an XCom with the attempt number, and increase it every time the poke method returns False. Is there any built-in mechanism for this?
Thank you
I had a similar problem when sensor mode = "reschedule", trying to poke a different path to a file based on the current time without directly referencing pendulum.now or datetime.now
I used task_reschedules (as done in the base sensor operator to get try_number for reschedule mode https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/sensors/base.html#BaseSensorOperator.execute)
def execute(self, context):
task_reschedules = TaskReschedule.find_for_task_instance(context['ti'])
self.poke_number = (len(task_reschedules) + 1)
super().execute(context)
then self.poke_number can be used within poke(), and current time is approximately execution_date + (poke_number * poke_interval).
Apparently, the XCom thing isn't working, because pushed XComs don't seem to be available between pokes; they always return undefined.
try_number inside task_instance doesn't help either, as pokes don't count as a new try number
I ended up computing the attempt number by hand:
attempt_no = math.ceil((pendulum.now(tz='utc') - kwargs['ti'].start_date).seconds / kwargs['task'].poke_interval)
The code will work fine as long as individual executions of the poke method don't last longer than the poke interval (which they shouldn't)
Best
I have an airflow DAG that works perfectly when files are present, but error->fails when the source files are not there.
Randomly, I recieve files from a given source, that my DAG picks up and processes. While I need to run the DAG daily, files are not necessarily there daily. Could be monday, wednesday, or even sunday evening.
I'm not worried about days with no new files. i worry about days when new files come and it breaks.
How do I tell the DAG that when no file exist then gracefully exit with success?
My DAG below (please ignore schedule setting. I'm still in development mode):
import airflow
from airflow import models
from airflow.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
from airflow.operators.gcs_to_gcs import GoogleCloudStorageToGoogleCloudStorageOperator
args = {
'owner': 'Airflow',
'start_date': airflow.utils.dates.days_ago(2),
'email': ['email#gmail.com'],
'email_on_failure': True,
'schedule_interval': 'None',
}
dag = models.DAG(
dag_id='Source1_Ingestion',
default_args=args
)
# [START load ATTOM File to STAGING]
load_File_to_Source1_RAW = GoogleCloudStorageToBigQueryOperator(
task_id='Source1_GCS_to_GBQ_Raw',
bucket='Source1_files',
source_objects=['To_Process/*.txt'],
destination_project_dataset_table='Source1.Source1_RAW',
schema_fields=[
{'name': 'datarow', 'type': 'STRING', 'mode': 'NULLABLE'},
],
field_delimiter='§',
write_disposition='WRITE_TRUNCATE',
google_cloud_storage_conn_id='GCP_EDW_Staging',
bigquery_conn_id='GCP_EDW_Staging',
dag=dag)
# [END howto_operator_gcs_to_bq]
# [START move files to Archive]
archive_attom_files = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='Archive_Source1_Files',
source_bucket='Source1_files',
source_object='To_Process/*.txt',
destination_bucket='Source1_files',
destination_object='Archive/',
move_object=True,
google_cloud_storage_conn_id='GCP_EDW_Staging',
dag=dag
)
# [END move files to archive]
load_File_to_Source1_RAW.set_downstream(archive_Source1_files)
One way to approach this would be to add a Sensor Operator to the workflow.
Nehil Jain describes sensors nicely:
Sensors are a special kind of airflow operator that will keep running until a certain criterion is met. For example, you know a file will arrive at your S3 bucket during certain time period, but the exact time when the file arrives is inconsistent.
For your use case, it looks like there's a Google Cloud Sensor, which "checks for the existence of a file in Google Cloud Storage." The reason you'd incorporate a sensor is that you're decoupling the operation "determine if a file exists" from the operation "get the file (and do something with it)".
By default, sensors have two methods (source):
poke: the code to run at poke_interval times, which tests to see if the condition is true
execute: use the poke method to test for a condition on a schedule defined by the poke_interval; fails out when the timeout argument is reached
In a common file-detection sensor, the operator receives instructions to check a source for a file on a schedule (e.g. check every 5 minutes for up to 3 hours to see if the file exists). If the sensor succeeds in meeting its test condition, it succeeds and allows the DAG to continue downstream to the next operator(s). If it fails to find the file, it times out and the sensor operator is marked failed.
With just a sensor operator, you've already succeeded in separating the error cases - the DAG fails at the GoogleCloudStorageObjectSensor instead of the GoogleCloudStorageToBigQueryOperator when the file doesn't exist, and fails at the GoogleCloudStorageToBigQueryOperator when something is wrong with the transfer logic. Importantly for your use case, Airflow supports a soft_fail argument, which "mark[s] the task as SKIPPED on failure"
For this next part, I'll caveat this next part by explicitly stating that I'm not intimately familiar with the GoogleCloudStorage operators. If the operator doesn't allow wildcarding in the sensor, you may need to rewire the sensor's poke method to allow for more complex, pattern based file detection. This is where Airflow's plug-in architecture can really shine, allowing you to modify and extend existing operators to meet your exact needs.
The example I'll give you here is that the SFTPSensor only supports poking for a specific file out of the box. I needed wildcard based poking, so I wrote a plugin that modifies the SFTPSensor to support regular expressions in file identification. In my case, it was just modifying the poke to switch from polling for the existence of a single file to polling a list of files and then passing it through a regular expression to filter the list.
At a cursory glance, it looks like the way that the GoogleCloudStorageSensor pokes for an object is with the hook.exists method. I can't speak to whether a wildcard would work there, but if it doesn't, it looks like there's a hook.list method which would allow you to implement a similar workflow to what I did for the SFTPRegexSensor.
I've included some of the source code for the SFTPRegexSensor Plugin's poke method, modified for how I think it'd work with GCS in case it's helpful:
def poke(self, context):
# create a hook (removed some of the SSH/SFTP intricacies for simplicity)
# get list of file(s) matching regex
files = hook.list(self.bucket, self.prefix) # you need to define operator paramters for the choices that are dynamic in the operator's poke (e.g. which bucket, what the file prefix is); swapped in the GCS args
regex = re.compile(self.remote_filename)
files = list(filter(regex.search, files))
if not files:
return False
return True
I have a hourly shell script job that takes a date and hour as input params. The date and hour are used to construct the input path to fetch data for the logic contained in the job DAG. When a job fails and I need to rerun it (by clicking "Clear" for the failed task node to clean up the status to re-trigger a new run), how can I make sure the date and hour used for rerun are the same as the failed run since the rerun could happen in a different hour as the original run?
You have 3 options:
Hover to the failed task which is going to clear, in its displaying tag there will be a value with key Run:, it is its Execution date and time.
Click on the failed task which is going to clear, heading of its displaying popup which has the clear option will be [taskname] on [executiondatewithtime]
Open the task log, the first line after the attempts count will be included a string with format Executing <Task([TaskName]): task_id> on [ExecutionDate withTime]