I want to use airflow for image processing.
I have 4 Tasks: Image Pre process (A) ,bounding box finder (B), classification (C), image finalize (D).
the chart look like this:
A -> B1 -> C \
-> B2 -> C - D
-> B3 -> C /
-> Bn -> C /
the output of Image Pre process task is a list of bounding box proposals, for each bounding box I run classification and once all classification tasks ends I run the image finalize.
I want everything to run in parallel
This will run on 10000 images per day so if I will have different presentation of pipeline in the UI for each image, I can't keep track of the pipeline...
Is it possible in airflow ?
Dynamically creating tasks like this is not something Airflow is best for. Take a look at the answer here to get some insight: Airflow dynamic tasks at runtime.
Airflow is better suited as a scheduling tool, so I propose you delegate the actual work and parallelization to another tool like Celery. You can still use Airflow to schedule this work, in a way that your B step is a simple operator which reads the output from A (via XCom or similar) and distributes actual work to some remote workers.
Can you know in advance the maximum possible number of B tasks? If that's manageable, you could get away with creating the max B tasks, and then skipping some of them as needed depending on the outcome of A.
The implementation might not be trivial, but you could get some hints from this discussion: Launch a subdag with variable parallel tasks in airflow.
I am looking to calculate the wait in a queue per position or a general time based on your queue position. It is a FIFO.
List of current performance status of the service
Size AvTime Queue Processing AvgFileSize(mb)
1 (0 - 1 mb) 2.57 18 3 0.21
2 (1 - 5 mb) 12.43 2 4 2.16
3 (5 - 10 mb) 23.38 9 8 6.72
4 (10 - 25 mb) 38.17 1 4 12.52
5 (>= 25 mb) 109.31 0 0 32.41
The current list of processing and queued batch files. Only lists the current users files so that is why there are queue numbers missing.
Queue Filename Status
30 Batch (3456).XML(2) Queue
20 Batch (2399).xml(3) Queue
14 batch (1495).xml(1) Queue
12 batch (1497).xml(1) Queue
15 batch (1499).xml(1) Queue
10 batch (1500).xml(4) Queue
13 batch (1496).xml(1) Queue
11 batch (1501).xml(1) Queue
9 batch (1498).xml(1) Queue
8 batch (1494).xml(1) Queue
7 batch (1493).xml(1) Queue
6 batch (1492).xml(1) Queue
5 batch (1491).xml(1) Queue
4 batch (1490).xml(1) Queue
3 batch (1).xml(1) Queue
2 Batch1.xml(1) Queue
1 Batch1.XML(2) Queue
Batch1.xml(1) Processing
Batch1.xml(1) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(1) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(4) Processing
Batch1.xml(2) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(2) Processing
Batch1.xml(2) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(2) Processing
So I am looking to add more information to the list how long until a batch file at position 20 will be waiting in the queue before it starts processing.
Queue Filename Status
30 (*30min) Batch (3456).XML(2) Queue
20 (*10min) Batch (2399).xml(3) Queue
...
*estimated
Your question doesn't quite provide enough context to make it possible to answer, but I can make some guesses based on the sample displays you provided.
Looks like you have a "single queue, multiple server" setup. In other words, you have a single FIFO queue, and a some fixed number N of jobs that can be in processing at any given time. Is that right?
For your algorithm, let's assume you have the following information:
Position of our job in queue (position N means there are N jobs ahead
of us)
Size of our job
Size of each job ahead of us in the queue
Pool of jobs being processed, with a certain maximum size N
Size of each job currently being processed
Elapsed time for each job currently in process (how long since that job started)
First of all, you will need a function ExpectedJobDuration(jobsize) that computes an expected job processing time for a job of a given size, based on the statistics shown in your "performance status" table. This looks pretty straightforward. Given a job size, first figure out which of your five size categories it falls into (0: 0-1mb, 1: 1-5mb, etc.) Then take your job size and multiply by the average time divided by the average size of jobs in that category. That will give you an estimate of ExpectedJobDuration(jobsize), which will tell you how long it takes to run a job of a given size, under the assumption that job time is proportional to job size, for jobs within a particular size range.
Now, for a job of a given size that's already been in process for a given time ElapsedProcessingTime, how long do we expect it to to take complete? A simple answer would be something like:
ExpectedRemainingTime = ExpectedJobDuration(jobsize) - ElapsedProcessingTime.
For jobs sitting the the queue this will be exactly the same as the expected job duration; for jobs already being processed we subtract the time the job has already been in work. However, if there is some random variation in job processing times, this is not exactly right, and could turn out to be negative. This is sort of like the actuarial problem: the average lifespan of a person is X years, how long do we expect someone to live if they are already Y years old? You would need a lot more statistical data to compute this, so for practical purposes, if the answer comes out negative, just set it to zero. (If someone is 100 years old, and the average human lifespan is 90, expect them to die at any moment. That's not quite right, but perhaps OK as a first approximation. Unless you are the 100 year old person, and not yet ready to die. :-))
OK, now we have a way to compute how long each job ahead of us in the queue should take, and how long it should take to complete jobs already in process.
If the number of jobs currently being processed is less than N (the max that can be processed at any given time) then our job can start right away. So in that case we have the answer - expected delay until our job can start is zero seconds.
Now let's look at the case where we are in position 0 in the queue. That means there are no jobs ahead of us in the queue, so our expected time to start is the minimum of the ExpectedRemainingTime of the jobs in the processing pool.
Now that gives us the basis for a recursive function that computes delay until our expected start time.
DelayUntilStart(jobPool, currentJob, queue) {
find minJob in jobPool with minimum ExpectedRemainingTIme
if currentJob is in position zero of queue
return expectedRemainingTime(minJob)
else
remove minJob from jobPool
pop the top job from the queue and put it in the jobPool
return ExpectedRemainingTime(minJob) + DelayUntilStart(jobPool, currentJob, queue)
done
}
Note - we may have a very long job ahead of us in the queue - but that doesn't mean we have to wait for it to complete. We just have to wait for it to get into the pool of jobs currently being processed, and then a shorter job might complete and let us into the pool.
The algorithm I just described is going to be an approximation. But it's probably about as good you are going to get without a lot of statistics about job processing times. For practical purposes I bet it would work pretty well.
I am trying to solve an extension of Assignment problem, where both tasks and the man hours are divisible.
for instance, a man X has 4 hours available in a day, can do 1/3 of task A in 2 hours, 1/4 of task B in 4 hours. Man Y has 10 hours available can do 1/5 of a task A in 1.3 hours, 1/8 of task B in 6 hours. Is there an extension of BiPartite matching which can solve this?
Thanks!
I don't think that you can easily model this as a bipartite matching. However, it should be fairly easy to create a linear program for your problem. Just have for every worker a set of variables x_{i,j} which indicates how much of person i's time is allocated to task j.
Let h_i be the number of hours available for person i. Then, for every person i it must hold that
Let a_{i,j} be the "efficiency" of person i at task j, i.e., how much of the task the person can do in one hour. Then, for every task j it must hold that:
That's it. No integrality constraints or anything.
I have around 10 automation tasks in R everyday.Some tasks are dependent on others.Normally, I use taskscheduleR package to set each tasks at different times (30 min difference),eg: 8:00, 8:30,..., 12:00. My problem is:
1) For some tasks: the time to run ranges from 5 to 25 minutes ---> 30 min difference is too much
2) Due to some infrequent errors, some task take more than 30 mins to run
Therefore, I just want to set a batch of tasks to complete, one after others. I prefer the solutions can be run in R
Thanks in advance,
Just come up with the answer, I never thought that it will be that easy.
Just consider 10 tasks as part of a big task, and schedule the big task:
#big task: bigtask.R
source('task1.R')
source('task2.R')
....
source('task10.R)
All we need is to schedule the 'bigtask.R' by using taskscheduleR addins
I'm running into a situation where a cron job I thought was running every 55 minutes is actually running at 55 minutes after the hour and at the top of the hour. Actually, it's not a cron job, but it's a PHP scheduling application that uses cron syntax.
When I ask this application to schedule a job every 55 minutes, it creates a crontab line like the following.
*/55 * * * *
This crontab line ends up not running a job every 55 minutes. Instead a job runs at 55 minutes after the hours, and at the top of the hour. I do not desire this. I've run this though a cron tester, and it verifies the undesired behavior is correct cron behavior.
This leads me to looking up what the / actually means. When I looked at the cron manual I learned the slash indicated "steps", but the manual itself is a little fuzzy on that that means
Step values can be used in conjunction with ranges. Following a range with "<number>" specifies skips of the number's value through the range. For example, "0-23/2" can be used in the hours field to specify command execution every other hour (the alternative in the V7 standard is "0,2,4,6,8,10,12,14,16,18,20,22"). Steps are also permitted after an asterisk, so if you want to say "every two hours", just use "*/2".
The manual's description ("specifies skips of the number's value through the range") is a little vague, and the "every two hours" example is a little misleading (which is probably what led to the bug in the application)
So, two questions:
How does the unix cron program use the "step" information (the number after a slash) to decide if it should skip running a job? (modular division? If so, on what? With what conditions deciding a "true" run, and which decisions not? Or is it something else?)
Is it possible to configure a unix cron job to run every "N" minutes?
Step values can be used in conjunction with ranges. Following a range
with "<number>" specifies skips of the number's value through the range. For
example, "0-23/2" can be used in the hours field to specify command
execution every other hour (the alternative in the V7 standard is
"0,2,4,6,8,10,12,14,16,18,20,22"). Steps are also permitted after an
asterisk, so if you want to say "every two hours", just use "*/2".
The "range" being referred to here is the range given before the /, which is a subrange of the range of times for the particular field. The first field specifies minutes within an hour, so */... specifies a range from 0 to 59. A first field of */55 specifies all minutes (within the range 0-55) that are multiples of 55 -- i.e., 0 and 55 minutes after each hour.
Similarly, 0-23/2 or */2 in the second (hours) field specifies all hours (within the range 0-23) that are multiples of 2.
If you specify a range starting other than at 0, the number (say N) after the / specifies every Nth minute/hour/etc starting at the lower bound of the range. For example, 3-23/7 in the second field means every 7th hour starting at 03:00 (03:00, 10:00, 17:00).
This works best when the interval you want happens to divide evenly into the next higher unit of time. For example, you can easily specify an event to occur every 1, 2, 3, 4, 5, 6, 10, 12, 15, 20, or 30 minutes, or every 1, 2, 3, 4, 6, or 12 hours. (Thank the Babylonians for choosing time units with so many nice divisors.)
Unfortunately, cron has no concept of "every 55 minutes" within a time range longer than an hour.
If you want to run a job every 55 minutes (say, at 00:00, 00:55, 01:50, 02:45, etc.), you'll have to do it indirectly. One approach is to schedule a script to run every 5 minutes; the script then checks the current time, and does its work only once every 11 times it's called.
Or you can use multiple lines in your crontab file to run the same job at 00:00, 00:55, 01:50, etc. -- except that a day is not a multiple of 55 minutes. If you don't mind having a longer or shorter interval once a day, week, or month, you can write a program to generate a large crontab with as many entries as you need, all running the same command at a specified time.
I came across this website that is helpful with regard to cron jobs.
https://crontab.guru
And specific to your case with * /55
https://crontab.guru/#*/55_*_*_*_*
It helped to get a better understanding of the concept behind it.
There is another tool named at that should be considered. It can be used instead of cron to achieve what the topic starter wants. As far as I remember, it is pre-installed in OS X but it isn't bundled with some Linux distros like Debian (simply apt install at).
It runs a job at a specific time of day and that time can be calculated using a complex specification. In our case the following can be used:
You can also give times like now + count time-units, where the time-units can be minutes, hours, days, or weeks and you
can tell at to run the job today by suffixing the time with today and to run the job tomorrow by suffixing the time with tomorrow.
The script every2min.sh is executed every 2 minutes. It delays next execution every time the instance is running:
#!/bin/sh
at -f ./every2min.sh now + 2 minutes
echo "$(date +'%F %T') running..." >> /tmp/every2min.log
Which outputs
2019-06-27 14:14:23 running...
2019-06-27 14:16:00 running...
2019-06-27 14:18:00 running...
As at does not know about "seconds" unit, the execution time will be rounded to full minute after the first run. But for a given task (with 55 minutes range) it should not be a big problem.
There also might be security considerations
For both at and batch, commands are read from standard input or the file specified with the -f option and executed. The working directory, the environment (except for the variables BASH_VERSINFO, DISPLAY, EUID, GROUPS, SHELLOPTS, TERM, UID, and _) and the umask are retained from the time of invocation.
This is the easiest way to schedule something to be ran every X minutes I've seen so far.