what all changes can we make in control-m job scheduling to minimize charges if we get charged on the basis of no of jobs ordered in a day in active schedule.
This is costing us a lot.
If some of your jobs are commands and share broad characteristics (nodeid, user, no alerts) then use conditional operators. E.g. linking commands with a semicolon will mean that each command is executed once. Linking with && will mean that the second command is only executed if the first runs successfully.
As Alex has mentioned, this is a broad area. And would be very difficult to anser to the point. But below are the few tips tht can be considered.
1. Check is the same script is being run by different jobs. This can be combined together by the help of scheduling tab.
2. File watcher jobs. If there is a requirement for checking the incoming file and then trigger a specific job for processign the file. [This makes up 2 jobs: Job1 - File watcher, Job2 - File Processing] This function can be achived by using AFT jobs. AFT jobs combine this two functions in one.
3. Low priority jobs, where alerting is not required, can be moved to unix/shell scripts if jobs are costly.
4. If Job2 is succeeding Job1, and Job2 is having only 1 IN CONDITION i.e from Job1, then rather than having two jobs, script (of Job2) can be called from the script (of Job1). So, ultimately we are doing two functions in Job1. Also, if the script (Job2) fails, then the Job1 will not get success return code. And you can get the details from log.
5. Keep the archiving functionality to the scripts and no need to bring it into the Control M jobs unless it is very important. And rather than doigni it every day for past 6 months, it would be better to schedule it once in a week or once in two weeks.
6. Sort the jobs in such a way that the regular jobs are in one table and the adhoc jobs (which is run only on request) into another. Keep the 'UserDaily' only for the jobs that are regular. Not keeping the 'UserDaily' for the jobs that are adhoc will not call these jobs in to the EM daily, and thus you see only those jobs that run daily and not the one that might or moght not run daily.
Hope this helps.
You can use ctmudly command to order only the user daily you want.
You can try using crontab in unix to schedule non-priority jobs which doesn't need manual intervention or observation.
You can avoid the FW jobs by including file checker logic in your Shell script which executes the actual process.
Related
I have a flow that needs to be trigger automatically on Tuesday every week.
In this, I have 4 jobs all dependent on each other.
JobA<-JobB<-JobC<-JobD
I have put all 4 jobs in one SMART folder.
Now I have scheduled this flow so that as soon as a job gets completed another job should trigger.
How can I achieve it.
Control-M has a design element called "Conditions" that defines the flowchart of a process mesh. These conditions determine the Jobs that are successors and predecessors through the Input and Output Conditions.
Keep in mind that a Condition in Control-M is a prerequisite for the execution of a Job and is composed of the following elements:
A Name,
A date
A sign
I am trying to solve the following problem with airflow:
I have a data pipeline where I want to run several processes on a number of excel documents (eg: 5,000 excel files a day). My idea for a DAG is below:
Task 1 = Take an excel file, and adds a new sheet to it.
Task 2 = Convert this returned excel to a PDF.
Task 1 and 2 in the DAG would call a processing tool running outside airflow via an API call (So the actual data processing isnt happening inside airflow).
I seem to be going around in circles with figuring out the best approach to this workflow. Some questions I keep having are:
Should each DagRun be one excel, or should the DagRun take in a batch
of excels?
If taking in a batch (which I presume is the correct approach), what is the recommend batch amount?
How would I pass the returned values from task 1 to task 2. Would it be an XCOM dictionary with a reference to each newly saved excel? I read somewhere that the max size of an xcom should be 48kb. So if i have a XCOM of 5,000 excel filepaths, that will probabaly be larger than 48kb.
The last, most tricky question I have is, I would obviously want to start processing task 2 as soon as even 1 excel from Task 1 had completed, because i wouldnt want to wait for the entire batch of Task 1 to complete before starting Task 2. How can I run Task 2, multiple times within the same DagRun for each new result that Task 1 produces? Or should Task 2 be its own DAG?
Am I approaching this problem the right way? How should I be tackling this problem?
Assumptions
I made some assumptions since I don't know all the details of the Excel file processing:
You cannot merge the Excel files since you need them separate.
Excel files are accessible from Airflow DAG (same filesystem or similar).
If something of that is not true, please clarify accordingly.
Answers
That being said, I'll first answer your questions and then comment on some thoughts:
I think you can do in batches, since using one run per file will be very slow (because of the scheduler time mostly, that will add time between Excel files processing). You're also not using all the available resources, so better push Airflow to be more busy.
The batch amount will depend on the processing load and the task design. From your question I assume you're thinking about having the batch inside the task, but if the service that process the Excel files could handle good parallelism, I'd rather recommend one task per Excel file. Having 5000 tasks (one for each file) will be a bad idea (because that'll be difficult so see in the UI), but the exact number of processes per batch depends on your resources and service SLA mostly.
From my experience I recommend using one task for everything, since you can call the service in parallel and right after the service completes, you can directly transform the Excel file in PDF.
This gets solved with the answer from question #3.
Solution overview
The solution I imagine is something like:
First task for checking existence of pending files. You can do a fork using a BranchPythonOperator (example here).
Then you have X parallel tasks to process Excel (call the service) and transform that to PDF. Could be one PythonOperator task. If you use Airflow 2, you can simply use #task() decorator to simplify the code. The X could be from 10 to 100 for example, depending on the resources and the service throughput.
Have a final task that triggers the DAG again to process more files. This could be implemented using a TriggerDagRunOperator (example here).
I have an Oozie job that processes data incrementally. Going forward, I would like to run this job on an hourly basis to prepare the results as soon as possible. But to backfill old data, it would be faster to run sequential jobs processing a week's worth of data at a time.
Is it possible to have a single coordinator.xml file that allows for both of these modes, and simply choose between them based on a flag specified ad-hoc when the job is scheduled?
In the parameters of the <coordinator-app> tag in coordinator.xml, there is a single frequency, which suggests that this is not possible, at least not in a natural way.
I don't think there is an easy way to do different frequencies within a coordinator. Based on your description you would not need the weekly job after the backfill happens.
I imagine you'd have to change the parameterization of the workflow as well to process more or less data.
On the other hand, you could just start the coordinator in the past with the frequency you'd like and tweak parameters like concurrency, throttle and execution in the app definition so Oozie can chew through the backlog by executing the workflow in parallel.
My eventual solution was to create the workflow at a given frequency (say, daily), and then create a second "backfill" workflow with a different frequency (weekly or monthly) that calls the original workflow as a sub-workflow.
The API in Airflow seems to suggest it is build around backfilling, catching up and scheduling to run regularly in interval.
I have an ETL that extract data on S3 with the versions of the previous node (where the data comes from) in DAG. For example, here are the nodes of the DAG:
ImageNet-mono
ImageNet-removed-red
ImageNet-mono-scaled-to-100x100
ImageNet-removed-red-scaled-to-100x100
where ImageNet-mono is the previous node of ImageNet-mono-scaled-to-100x100 and
where ImageNet-removed-red is the previous node of ImageNet-removed-red-scaled-to-100x100
Both of them go through transformation of scaled-to-100x100 pipeline but producing different data since the input is different.
As you can see there is no date is involved. Is Airflow a good fit?
EDIT
Currently, the graph is simple enough to be managed manually with less than 10 nodes. They won't run in regularly interval. But instead as soon as someone update the code for a node, I would have to run the downstream nodes manually one by one python GetImageNet.py removed-red and then python scale.py 100 100 ImageNet-removed-redand then python scale.py 100 100 ImageNet-mono. I am looking into a way to manage the graph with a way to one click to trigger the run.
I think it's fine to use Airflow as long as you find it useful to use the DAG representation. If your DAG does not need to be executed on a regular schedule, you can set the schedule to None instead of a crontab. You can then trigger your DAG via the API or manually via the web interface.
If you want to run specific tasks you can trigger your DAG and mark tasks as success or clear them using the web interface.
I am working on an automation, where I will get the list of jobs that did not start to run even though the scheduled time has crossed. I am going to get the list based on an 2 hour time gap.
Now my question is how to get the list of jobs that are scheduled on a particular time period on that particular day.
For eg., 22-03-3018 08:00 - 10:00 am list of jobs scheduled on this period
I want to execute the command in unix.
Depending on how your linux system is set up, you can look in:
/var/spool/cron/* (user crontabs)
/etc/crontab (system-wide crontab)
also, many distros have:
/etc/cron.d/* These configurations have the same syntax as /etc/crontab
/etc/cron.hourly, /etc/cron.daily, /etc/cron.weekly, /etc/cron.monthly
These are simply directories that contain executables that are executed hourly, daily, weekly or monthly, per their directory name.
On top of that, you can have at jobs (check /var/spool/at/), anacron (/etc/anacrontab and /var/spool/anacron/) and probably others I'm forgetting.
In Autosys there is no easy way through using autosys native cmd's
However you can get all this information from the DB it is in the WCC DB and the table is dbo.MON_JOB query would be to get you started:
SELECT [NAME]
,[NEXT_TIME]
FROM [CAAutosys11_3_5_WCC_PRD].[dbo].[MON_JOB]
WHERE [NEXT_TIME] > '1970'
ORDER BY [NEXT_TIME] ASC
Let me know if you need more clarification.