I was trying to create a dag which has only one task. Can I mark the task with required status like skipped or no status?
Requirement: Generally I will be looking at s3 bucket for every one minute and if files available I will do some processing. otherwise, I will just leave. I want to see this visibility in UI. so. I was trying to mark task status as skipped so that.
is this right way to do ? do we have any other way to achieve this
Thanks
If you're wanting to mark a task as skipped, you can raise an AirflowSkipException. When raised, the execution of the task will stop and the task will get marked as skipped.
This example Airflow dag of a DummySkipOperator demonstrates an operator which gets marked as skiped by raising the above exception.
Related
I know Airflow distinguishes between manual and scheduled triggers of a DAG - the pattern of their ID is different, and also the UI in Tree View shows an outlined circle for one but not the other.
I have a DAG that uses a Python callable as on_failure_callback to emit failure alert. I now want to modify this DAG so that it will emit one type of alert when a manually triggered run fails, while emitting a different alert when the run was triggered by the scheduler.
I can do this by simply parsing the string in {{ execution_date }}. However, that seems hacky. Is there a flag that I can check instead?
Seems very related to this question:
Can I programmatically determine if an Airflow DAG was scheduled or manually triggered?
Looks like you can look up the run_id and parse that instead which might be slightly better.
The API in Airflow seems to suggest it is build around backfilling, catching up and scheduling to run regularly in interval.
I have an ETL that extract data on S3 with the versions of the previous node (where the data comes from) in DAG. For example, here are the nodes of the DAG:
ImageNet-mono
ImageNet-removed-red
ImageNet-mono-scaled-to-100x100
ImageNet-removed-red-scaled-to-100x100
where ImageNet-mono is the previous node of ImageNet-mono-scaled-to-100x100 and
where ImageNet-removed-red is the previous node of ImageNet-removed-red-scaled-to-100x100
Both of them go through transformation of scaled-to-100x100 pipeline but producing different data since the input is different.
As you can see there is no date is involved. Is Airflow a good fit?
EDIT
Currently, the graph is simple enough to be managed manually with less than 10 nodes. They won't run in regularly interval. But instead as soon as someone update the code for a node, I would have to run the downstream nodes manually one by one python GetImageNet.py removed-red and then python scale.py 100 100 ImageNet-removed-redand then python scale.py 100 100 ImageNet-mono. I am looking into a way to manage the graph with a way to one click to trigger the run.
I think it's fine to use Airflow as long as you find it useful to use the DAG representation. If your DAG does not need to be executed on a regular schedule, you can set the schedule to None instead of a crontab. You can then trigger your DAG via the API or manually via the web interface.
If you want to run specific tasks you can trigger your DAG and mark tasks as success or clear them using the web interface.
I have several jobs than will run in sequence. It is possible to create a dependency between them only for completion, but not that the prior job has to complete successfully?
If a job fails this should remain red and go to the next job and continue running.
It is mandatory that this jobs to run in sequence and not in paralel.
As Mark outlined you can simply create an On-Do action within the parent job to add a condition when the job ends Not OK. The parent job will still go red and the successor job will kick off.
See below for an example:
yes, on the actions tab you create and On/Do step and say when Not OK the job should add the output condition. In this way the next job will run (in sequence) regardless of what happens to the predecessor job.
I have a box job that is dependent on another job finishing. The first job normally finishes by 11pm and my box job then kicks off and finishes in about 15 minutes. Occasionally, however, the job may not finish until much later. If it finishes later than 4am, I'd like to have it send an alert.
My admin told me that since it is dependent on a prior job, and not set to start at a specific time, it is not possible to set a time-based alert. Is this true? Does anybody have a workaround they can suggest? I'd rather not set the alert on the prior job (suggested by my admin) as that may not always catch those instances when my job runs longer.
Thanks!
You can set a max run alarm time which will alert if that time is exceeded
We ended up adding a job to the box with a start time of 4am that checks for the existence of the files the rest of the job creates. We also did this for the jobs predecessors to make sure we are notified if we are at risk of not finishing by 4am.
Is there a way to use an activity to stop the execution of its workflow?
I have multiple TryCatch activities and If activities and it would be nice to be able to stop the workflow after catching an exception or if certain criteria aren't met in my If activities.
You can use the TerminateWorkflow activity to stop a workflow.
Just a word of warning for anyone else who might run into this, TerminateWorkflow won't actually terminate your workflow if you put it inside a CompositeActivity (it will only terminate the composite.)