I need to process multiple batches of rows from a file with each batch at a time. I was originally imagining Task-A splitting the file into x batches. Submit each batch to Task-B so B can process incoming batch (calling another REST end point). I was wondering if there is a way for a task (A) to call another task (B) in a loop as we construct each batch of rows sequentially? So B can process that batch.
If not, do you have a suggestion for a better way to design this batch processing system? One of the requirements is the ability to stop/start this file processing in flight and ability to run multiple files at the same time.
Related
Let's assume my dag converts a large data set from CSV format to parquet. While running the dag, for some reason my dag fails, is it possible to restore the progress when I re run the dag?
It shouldn't start from scratch after I re run the dag.
Airflow dag is a collection of tasks, organized in a way that reflects their relationships and dependencies. So if you have a dag with 3 tasks: A -> B -> C, when the task C fails, you can just re run it without re running A and B, But if you re run the dag, that means you re run the task A with all the downstream tasks (B and C).
If you want to restore the progress within a task, you can do that based on your job logic but this is not related to airflow, it depends only on the techno you use and the logic you want to implement. For example, for your data, if you have multiple files in the dataset, you can create a state store on cloud storage or a database, to know the processing state for each file, and if the file is already processed, you can skip the processing and pass to the next one.
I have written a single test case in robot framework that fetches data about more than 1000 locations from an excel sheet and runs each location. This whole execution takes more than 12 hours to complete. I want to minimize the execution time. Is there any possibility to execute it parallelly. I have gone through Pabot but that executes all test cases parallelly and I have only one test case.
Datadriver works for me:
DataDriver https://github.com/Snooz82/robotframework-datadriver
pabot
https://pabot.org/PabotLib.html
how to run:
pabot --testlevelsplit --pabotlib ...
No, there is no way for robot to split up a test case into multiple parallel threads or processes.
If you want multiple keywords to run in parallel, you'll have to rewrite your test suite to contain multiple tests, or create your own keywords to do work in parallel threads or processes.
I have two batch jobs in differents flows. The first, do an Upsert in Salesforce and when it finish, it call to the second flow that has another batch job.
This image represents the flows:
But when I see the log on the console, sometimes the log of the second batch is mixed with the log of the first.
I get the feeling that the batch processes are asynchronous and the second batch is called even though the first batch is being processed.
Am I wrong? Should I pay attention to the order of the logs?
If I wanted it to be totally synchronous, what would be the best way?
Mule Batch is asynchronous, it is like fire and forget. If you want to call the second batch after first batch is completed, then invoke the second batch at 'On Complete' phase of first batch as shown in below picture.
If you want to do some function before invoking the second batch, then you need to use request-reply scope to make batch component synchronous.
Yes the batch job is asynchronous. As soon as the batch execute is triggered the flow will move on to the next event processor.
If batch job 2 needs to run after batch job 1 only, then you can use the on-complete phase of the first batch job to trigger some event to indicate the first has finished so that can be used to trigger the second batch job.
Alternatively If the batch jobs are related that closely you might be able to combine them into one using multiple batch steps
I am trying to figure out if Airflow can be used to express a workflow where multiple instances of the same task need to be started based on the output of a parent task. Airflow supports multiple workers, so I naively expect that Airflow can be used to orchestrate workflows involving batch processing. So far I failed to find any recipe/direction that would fit this model. What is the right way to leverage Airflow for a bath processing workflow like the one below? Assume there is a pool of Airflow workers.
Example of a workflow:
1. Start Task A to produce multiple files
2. For each file start an instance of Task B (might be another workflow)
3. Wait for all instances of Task B, then start Task C
As a hack to parallelize processing of input data in Airflow, I use a custom operator that splits the input into a predetermined number of partitions. The downstream operator gets replicated for each partition and if needed the result can be merged again. For local files, the operator runs the split command. In Kubernetes, this works nicely with the cluster autoscaling.
I was wondering if there's a way running tasks asynchronously that will run in the background (Using Celery for example) which will never run simultaneously?
Which means, each task can run by itself simultaneously with it self but not with another tasks that interfere with the actions of the first task.
For example,
Task A: Reads from a file (can run simultaneously with it self (with other tasks that reads from files)
Task B: Writes to a file (Should not run simultaneously with the read tasks (With task A))
Essentially, what I need is a way for tasks A and B to find out if the other task is running and if it is, then delay itself and wait until it's done (probably with blocking the task queue)
Does defining a queue for the tasks solves the problem? or is it just a queue for the execution of tasks (So it will execute the 2nd task in the queue without waiting for the result of the first one)?
Is using a lock my only solution here?
If the lock solution is the only one, what's the correct way of implementing this?
I have found this:
Ensuring a task is only executed one at a time
But it uses django's cache as a lock and I'm not running my programs in a django environment so it doesn't work for me.