Airflow Rendered Template changes when task starts running? - airflow

I'm having a very weird airflow bug.
Problem
I have a dag that has a bash operator as step 1 and a KubernetesPodOperator as step 2. The issue is regarding the KubernetesPodOperator. Basically, I was giving the task image X for quite some time, I recently changed the image the task is receiving to Y.
The issue I'm having is within TaskInstanceDetails the image is correct: Y. However, in the Rendered Template, the image starts correct X, however, as soon as the task starts running, it changes the image to Y.
I know this is very vague, I can't provide a whole lot more, but I'm just more looking for possibilities of things that could be happening, as I'm out of ideas.
What I've Tried
Delete Serialized Dags from DB
Delete Rendered Task Details from DB
Airflow db reset
Airflow db init (After nuking the whole thing)
Deleting the EC2 nodes and trying with new ones
EDIT
So, I tried runnning airflow tasks render dag_id task_id execution_date and the result here is image X !! Image Y is only superimposed on dag runs.

Answering my own question here in case anyone runs into this issue. Very simple fix... I carelessly had a different image name for the workers to be run with on the kubernetes_pod_template file. Changing that solved the issue.

Related

Airflow - Questions on batch jobs and running a task in a DagRun multiple times

I am trying to solve the following problem with airflow:
I have a data pipeline where I want to run several processes on a number of excel documents (eg: 5,000 excel files a day). My idea for a DAG is below:
Task 1 = Take an excel file, and adds a new sheet to it.
Task 2 = Convert this returned excel to a PDF.
Task 1 and 2 in the DAG would call a processing tool running outside airflow via an API call (So the actual data processing isnt happening inside airflow).
I seem to be going around in circles with figuring out the best approach to this workflow. Some questions I keep having are:
Should each DagRun be one excel, or should the DagRun take in a batch
of excels?
If taking in a batch (which I presume is the correct approach), what is the recommend batch amount?
How would I pass the returned values from task 1 to task 2. Would it be an XCOM dictionary with a reference to each newly saved excel? I read somewhere that the max size of an xcom should be 48kb. So if i have a XCOM of 5,000 excel filepaths, that will probabaly be larger than 48kb.
The last, most tricky question I have is, I would obviously want to start processing task 2 as soon as even 1 excel from Task 1 had completed, because i wouldnt want to wait for the entire batch of Task 1 to complete before starting Task 2. How can I run Task 2, multiple times within the same DagRun for each new result that Task 1 produces? Or should Task 2 be its own DAG?
Am I approaching this problem the right way? How should I be tackling this problem?
Assumptions
I made some assumptions since I don't know all the details of the Excel file processing:
You cannot merge the Excel files since you need them separate.
Excel files are accessible from Airflow DAG (same filesystem or similar).
If something of that is not true, please clarify accordingly.
Answers
That being said, I'll first answer your questions and then comment on some thoughts:
I think you can do in batches, since using one run per file will be very slow (because of the scheduler time mostly, that will add time between Excel files processing). You're also not using all the available resources, so better push Airflow to be more busy.
The batch amount will depend on the processing load and the task design. From your question I assume you're thinking about having the batch inside the task, but if the service that process the Excel files could handle good parallelism, I'd rather recommend one task per Excel file. Having 5000 tasks (one for each file) will be a bad idea (because that'll be difficult so see in the UI), but the exact number of processes per batch depends on your resources and service SLA mostly.
From my experience I recommend using one task for everything, since you can call the service in parallel and right after the service completes, you can directly transform the Excel file in PDF.
This gets solved with the answer from question #3.
Solution overview
The solution I imagine is something like:
First task for checking existence of pending files. You can do a fork using a BranchPythonOperator (example here).
Then you have X parallel tasks to process Excel (call the service) and transform that to PDF. Could be one PythonOperator task. If you use Airflow 2, you can simply use #task() decorator to simplify the code. The X could be from 10 to 100 for example, depending on the resources and the service throughput.
Have a final task that triggers the DAG again to process more files. This could be implemented using a TriggerDagRunOperator (example here).

Is Airflow a good fit for DAG that doesn’t care about execution date/time?

The API in Airflow seems to suggest it is build around backfilling, catching up and scheduling to run regularly in interval.
I have an ETL that extract data on S3 with the versions of the previous node (where the data comes from) in DAG. For example, here are the nodes of the DAG:
ImageNet-mono
ImageNet-removed-red
ImageNet-mono-scaled-to-100x100
ImageNet-removed-red-scaled-to-100x100
where ImageNet-mono is the previous node of ImageNet-mono-scaled-to-100x100 and
where ImageNet-removed-red is the previous node of ImageNet-removed-red-scaled-to-100x100
Both of them go through transformation of scaled-to-100x100 pipeline but producing different data since the input is different.
As you can see there is no date is involved. Is Airflow a good fit?
EDIT
Currently, the graph is simple enough to be managed manually with less than 10 nodes. They won't run in regularly interval. But instead as soon as someone update the code for a node, I would have to run the downstream nodes manually one by one python GetImageNet.py removed-red and then python scale.py 100 100 ImageNet-removed-redand then python scale.py 100 100 ImageNet-mono. I am looking into a way to manage the graph with a way to one click to trigger the run.
I think it's fine to use Airflow as long as you find it useful to use the DAG representation. If your DAG does not need to be executed on a regular schedule, you can set the schedule to None instead of a crontab. You can then trigger your DAG via the API or manually via the web interface.
If you want to run specific tasks you can trigger your DAG and mark tasks as success or clear them using the web interface.

Control-M: it is possible if first job fails to continue running

I have several jobs than will run in sequence. It is possible to create a dependency between them only for completion, but not that the prior job has to complete successfully?
If a job fails this should remain red and go to the next job and continue running.
It is mandatory that this jobs to run in sequence and not in paralel.
As Mark outlined you can simply create an On-Do action within the parent job to add a condition when the job ends Not OK. The parent job will still go red and the successor job will kick off.
See below for an example:
yes, on the actions tab you create and On/Do step and say when Not OK the job should add the output condition. In this way the next job will run (in sequence) regardless of what happens to the predecessor job.

is the DAG runtime displayed anywhere in the UI?

I find it strange that after a run of a DAG completes there is nowhere in the UI to tell me how long it took. Yes, I can go into (say) Graph View or Gannt to see when the first task started and when the last task to end ended, then subtract one from the other but it would be nice if the run duration were simply displayed on the UI somewhere.
Perhaps it is displayed on the UI and I'm simply not seeing it. Does anyone know if the duration of a DAG run is displayed in the UI somewhere?

Autosys box job not finishing

I am new to Autosys and facing difficulty setting up some jobs. I have a box job containg a few command jobs. One of those command jobs may or may not run. The problem is when this job doesn't run(it remains in activated state), it keeps the box running. I have to terminate this job or the box every time such situation arises.
Is there a way to handle this?
Thanks
The answer depends on how your jobs are supposed to work, so posting JIL or a more detailed description would help, but:
You can add a "box_success" JIL attribute on your box job to define conditions that will cause the box to complete even without all the jobs running or completing.
You could consider moving the optional job outside of the box so that this issue goes away. But consider that the box could then complete while the optional job is running so make sure that you don't then have jobs running out of sequence or overlapping when they shouldn't.

Resources