What is task parallelism in the context of airflow? - airflow

Task parallelism in general is when multiple tasks run on the same or different set of data. But what is it in the context of airflow, when I change the parallelism parameter in the airflow.cfg file?
For instance, say I want to run a data processor on a batch of data. Will setting parallelism to 32, split the data into 32 sub-batches and run the same task on those sub-batches?
Or maybe, if somehow have 32 batches of data originally, instead of 1, I am able to run the data processor on all 32 batches(ie 32 task runs at the same time).

The setting won't "split the data" within your DAG.
From the docs:
parallelism: This variable controls the number of task instances that
runs simultaneously across the whole Airflow cluster
If you want to parallel execution of a task you will need to break it further meaning create more tasks but each task does less work. That can be come handy for some ETLs.
For example:
Lets say you want to copy yesterday records from MySQL to S3.
You could do it with a single MySQLToS3Operator that reads yesterday data in a single query. However you can also break it to 2 MySQLToS3Operator reading 12 hours data or 24 operators each reading hourly data. That is up to you and the limitation of the services you are working with.

Related

Is there a way to control parallel task groups in airflow? Can we use pools to do that?

I am trying to execute multiple similar tasks for different sets in parallel, but only want to run some of them while making other tasks group wait for completion. For example if I have 5 task groups, I want to run 3 of these groups in parallel and only trigger the others if one of them complete. Basically only have 3 running parallel at a time. What's the best way to do that?
Why don't you use ExternalTaskOperator, what it does is bound a task to not run until its parent task is finished you can search on airflow documentation.

Garbage Collection in R process running inside docker

I have a process where we currently large amount of data per day, perform a map-reduce level of functions and use only the output of the function. We currently run a code sequence that looks like the below
lapply(start_times, function(start_time){
<get_data>
<setofoperations>
}
so currently we loop through start times , which helps us get data for a particular day , analyse and output dataframes of results per output per day. set of operations is a series of functions that keep working on and return dataframes.
While running this on a docker container with a memory limit , we often see that the process runs out of memory when its dealing with large data (around 250-500MB) over periods of days and R isnt able to effectively do garbage collection.
Im trying an approach to monitor each process using cadvisor and notice spikes , but not really able to understand better.
If R does a lazy gc, ideally the process should be able to reuse the memory over and over, is there something that is not being captured through the gc process?
How can an R process reclaim more memory when its the only primary process running in the docker container ?

How to configure Apache Airflow with Celery to run concurrent tasks?

I am interested in this use case for my proof of concept, where i read from a file containing a huge list of ids and i want to process this ids as such func(id) concurrently.
Is it possible to configure airflow with CeleryExecutors to achieve this?
I saw this link :-
Running more than 32 concurrent tasks in Apache Airflow
But what if the number of ids are unknown and could be anywhere from 10,000 or even 100,000 and i want to process them around 500-1000 at a time?
Airflow can execute tasks in parallel, and it can use Celery to achieve this. Everything else is up to you to implement however you see fit, there are no specifics related to Airflow/Celery regarding your intended use.
In the end, if all you care about is paralleling your work and don't care much about other Airflow features, you could be better off using Celery alone.
There are many different ways to go about this, but here is some food for though to get you started:
Airflow tasks should be as "dumb" as possible, i.e. take an input, process it and store the output. Don't put your file-splitting logic here. You can have a dedicated DAG for that if needed. For example, you can have a DAG which reads the input file and chunks it up via some logic, then store it somewhere for tasks to pick up (convenient file structure, message queue, db, etc.)
Decide on a place for your input data such that tasks can easily pick up a limited amount of input. For example, if you're using a file structure, where one chunk to be processed is a single file, a task can get read a single file and remove it. Repeat until no chunks/files are left. Same goes for any other way, e.g. if using a message queue you can consume the chunks. Make sure you have that original DAG ready to split up the input file into chunks again if needed. You are free to make this as simple or as complex as you want.
Watch out for idempotency, e.g. make sure your process can be repeated without side-effects. If you lose data in some step, you can just restart everything without issues.

How to run Airflow DAG for specific number of times?

How to run airflow dag for specified number of times?
I tried using TriggerDagRunOperator, This operators works for me.
In callable function we can check states and decide to continue or not.
However the current count and states needs to be maintained.
Using above approach I am able to repeat DAG 'run'.
Need expert opinion, Is there is any other profound way to run Airflow DAG for X number of times?
Thanks.
I'm afraid that Airflow is ENTIRELY about time based scheduling.
You can set a schedule to None and then use the API to trigger runs, but you'd be doing that externally, and thus maintaining the counts and states that determine when and why to trigger externally.
When you say that your DAG may have 5 tasks which you want to run 10 times and a run takes 2 hours and you cannot schedule it based on time, this is confusing. We have no idea what the significance of 2 hours is to you, or why it must be 10 runs, nor why you cannot schedule it to run those 5 tasks once a day. With a simple daily schedule it would run once a day at approximately the same time, and it won't matter that it takes a little longer than 2 hours on any given day. Right?
You could set the start_date to 11 days ago (a fixed date though, don't set it dynamically), and the end_date to today (also fixed) and then add a daily schedule_interval and a max_active_runs of 1 and you'll get exactly 10 runs and it'll run them back to back without overlapping while changing the execution_date accordingly, then stop. Or you could just use airflow backfill with a None scheduled DAG and a range of execution datetimes.
Do you mean that you want it to run every 2 hours continuously, but sometimes it will be running longer and you don't want it to overlap runs? Well, you definitely can schedule it to run every 2 hours (0 0/2 * * *) and set the max_active_runs to 1, so that if the prior run hasn't finished the next run will wait then kick off when the prior one has completed. See the last bullet in https://airflow.apache.org/faq.html#why-isn-t-my-task-getting-scheduled.
If you want your DAG to run exactly every 2 hours on the dot [give or take some scheduler lag, yes that's a thing] and to leave the prior run going, that's mostly the default behavior, but you could add depends_on_past to some of the important tasks that themselves shouldn't be run concurrently (like creating, inserting to, or dropping a temp table), or use a pool with a single slot.
There isn't any feature to kill the prior run if your next schedule is ready to start. It might be possible to skip the current run if the prior one hasn't completed yet, but I forget how that's done exactly.
That's basically most of your options there. Also you could create manual dag_runs for an unscheduled DAG; creating 10 at a time when you feel like (using the UI or CLI instead of the API, but the API might be easier).
Do any of these suggestions address your concerns? Because it's not clear why you want a fixed number of runs, how frequently, or with what schedule and conditions, it's difficult to provide specific recommendations.
This functionality isn't natively supported by Airflow
But by exploiting the meta-db, we can cook-up this functionality ourselves
we can write a custom-operator / python operator
before running the actual computation, check if 'n' runs for the task (TaskInstance table) already exist in meta-db. (Refer to task_command.py for help)
and if they do, just skip the task (raise AirflowSkipException, reference)
This excellent article can be used for inspiration: Use apache airflow to run task exactly once
Note
The downside of this approach is that it assumes historical runs of task (TaskInstances) would forever be preserved (and correctly)
in practise though, I've often found task_instances to be missing (we have catchup set to False)
furthermore, on large Airflow deployments, one might need to setup routinal cleanup of meta-db, which would make this approach impossible

how to restrict the concurrent running map tasks?

My hadoop version is 1.0.2. Now I want at most 10 map tasks running at the same time. I have found 2 variable related to this question.
a) mapred.job.map.capacity
but in my hadoop version, this parameter seems abandoned.
b) mapred.jobtracker.taskScheduler.maxRunningTasksPerJob (http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.0.2/mapred-default.xml)
I set this variable like below:
Configuration conf = new Configuration();
conf.set("date", date);
conf.set("mapred.job.queue.name", "hadoop");
conf.set("mapred.jobtracker.taskScheduler.maxRunningTasksPerJob", "10");
DistributedCache.createSymlink(conf);
Job job = new Job(conf, "ConstructApkDownload_" + date);
...
The problem is that it doesn't work. There is still more than 50 maps running as the job starts.
After looking through the hadoop document, I can't find another to limit the concurrent running map tasks.
Hope someone can help me ,Thanks.
=====================
I hava found the answer about this question, here share to others who may be interested.
Using the fair scheduler, with configuration parameter maxMaps to set the a pool's maximum concurrent task slots, in the Allocation File (fair-scheduler.xml).
Then when you submit jobs, just set the job's queue to the according pool.
You can set the value of mapred.jobtracker.maxtasks.per.job to something other than -1 (the default). This limits the number of simultaneous map or reduce tasks a job can employ.
This variable is described as:
The maximum number of tasks for a single job. A value of -1 indicates that there is no maximum.
I think there were plans to add mapred.max.maps.per.node and mapred.max.reduces.per.node to job configs, but they never made it to release.
If you are using Hadoop 2.7 or newer, you can use mapreduce.job.running.map.limit and mapreduce.job.running.reduce.limit to restrict map and reduce tasks at each job level.
Fix JIRA ticket.
mapred.tasktracker.map.tasks.maximum is the property to restrict the number of map tasks that can run at a time. Have it configured in your mapred-site.xml.
Refer 2.7 in http://wiki.apache.org/hadoop/FAQ
The number of mappers fired are decided by the input block size. The input block size is the size of the chunks into which the data is divided and sent to different mappers while it is read from the HDFS. So in order to control the number of mappers we have to control the block size.
It can be controlled by setting the parameters, mapred.min.split.size and mapred.max.split.size, while configuring the job in MapReduce. The value is to be set in bytes. So if we have a 20 GB file, and we want to fire 40 mappers, then we need to set it to 20480 / 40 = 512 MB each. So for that the code would be,
conf.set("mapred.min.split.size", "536870912");
conf.set("mapred.max.split.size", "536870912");
where conf is an object of the org.apache.hadoop.conf.Configuration class.
Read about scheduling jobs in Hadoop(for example "fair scheduler"). you can create a custom queue with to many configuration and then assign your job to that. if you limit your custom queue maximum map task to 10 then each job that assign to queue at most will have 10 concurrent map task.

Resources