I have 2 windows machines and one Unix machine. On Unix machine autosys JILS are defined.
I have one cleanup job, that I want to run on both the windows machine # same time every day.
I don’t want to create 2 different jobs (JIL’s). In JIL can I add two machine names like machine: WindowMachine1, WindowMachine2 , is this correct or is there any other way to do this?
Defining multiple machines as a single virtual entity will cause AutoSys to treat them as a pool and load balance when sending them work, causing your cleanup job to only run on one of the two machines.
To have a job run on both machines simultaneously, you will need two separate CMD jobs, one sent to each machine.
Related
Trying to split out Airflow processes onto 2 servers. Server A, which has been already running in standalone mode with everything on it, has the DAGs and I'd like to set it as the worker in the new setup with an additional server.
Server B is the new server which would host the metadata database on MySQL.
Can I have Server A run LocalExecutor, or would I have to use CeleryExecutor? Would airflow scheduler has to run on the server that has the DAGs right? Or does it have to run on every server in a cluster? Confused as to what dependencies there are between the processes
This article does an excellent job demonstrating how to cluster Airflow onto multiple servers.
Multi-Node (Cluster) Airflow Setup
A more formal setup for Apache Airflow is to distribute the daemons across multiple machines as a cluster.
Benefits
Higher Availability
If one of the worker nodes were to go down or be purposely taken offline, the cluster would still be operational and tasks would still be executed.
Distributed Processing
If you have a workflow with several memory intensive tasks, then the tasks will be better distributed to allow for higher utilizaiton of data across the cluster and provide faster execution of the tasks.
Scaling Workers
Horizontally
You can scale the cluster horizontally and distribute the processing by adding more executor nodes to the cluster and allowing those new nodes to take load off the existing nodes. Since workers don’t need to register with any central authority to start processing tasks, the machine can be turned on and off without any downtime to the cluster.
Vertically
You can scale the cluster vertically by increasing the number of celeryd daemons running on each node. This can be done by increasing the value in the ‘celeryd_concurrency’ config in the {AIRFLOW_HOME}/airflow.cfg file.
Example:
celeryd_concurrency = 30
You may need to increase the size of the instances in order to support a larger number of celeryd processes. This will depend on the memory and cpu intensity of the tasks you’re running on the cluster.
Scaling Master Nodes
You can also add more Master Nodes to your cluster to scale out the services that are running on the Master Nodes. This will mainly allow you to scale out the Web Server Daemon incase there are too many HTTP requests coming for one machine to handle or if you want to provide Higher Availability for that service.
One thing to note is that there can only be one Scheduler instance running at a time. If you have multiple Schedulers running, there is a possibility that multiple instances of a single task will be scheduled. This could cause some major problems with your Workflow and cause duplicate data to show up in the final table if you were running some sort of ETL process.
If you would like, the Scheduler daemon may also be setup to run on its own dedicated Master Node.
Apache Airflow Cluster Setup Steps
Pre-Requisites
The following nodes are available with the given host names:
master1 - Will have the role(s): Web Server, Scheduler
master2 - Will have the role(s): Web Server
worker1 - Will have the role(s): Worker
worker2 - Will have the role(s): Worker
A Queuing Service is Running. (RabbitMQ, AWS SQS, etc)
You can install RabbitMQ by following these instructions: Installing RabbitMQ
If you’re using RabbitMQ, it is recommended that it is also setup to be a cluster for High Availability. Setup a Load Balancer to proxy requests to the RabbitMQ instances.
Additional Documentation
Documentation: https://airflow.incubator.apache.org/
Install Documentation: https://airflow.incubator.apache.org/installation.html
GitHub Repo: https://github.com/apache/incubator-airflow
All airflow processes need to have the same contents in their airflow_home folder. This includes configuration and dags. If you only want server B to run your MySQL database, you do not need to worry about any airflow specifics. Simply install the database on server B and change your airflow.cfg's sql_alchemy_conn parameter to point to your database on Server B and run airflow initdb from Server A.
If you also want to run airflow processes on server B, you would have to look into scaling using the CeleryExecutor.
I am looking to set up a machine container in Autosys to look like the below example:
Example_Example_MAIN
Example_Example_MAIN.Machine_Name1
Example_Example_MAIN.Machine_Name2
Example_Example_MAIN.Machine_Name3
Example_Example_MAIN.Machine_Name4
The way i am currently controlling these machine is to send 2,3 & 4 Offline and leave 1 Online. Then if 1 goes Offline then i will send 2 Online and the batch will run on that machine.
Is it possible to leave all machines inside of a container Online but specify a machine priority? For example if i leave all machines Online then the batch will automatically target Machine_Name1 but if 1 goes Offline then the batch will automatically target machine 2 and so on.
Sorry if this is a silly question, i'm still only a beginner!
Thank you in advance!
Cameron.
Yes, you can place all of your machines in a single pool. Autosys will only send jobs to the machines in the pool that are Online.
To do further load balancing than that, you'll have to configure the Factor (how fast it is relative to your other machines) and Max_Load (how much work it can handle at once) for every machine in the pool, as well as setting Job_Load units on each of your jobs indicating how much CPU they consume when running.
Refer to Chapter 3 of the Autosys user guide for the full details.
I need to launch a Condor job on a cluster with multiple slots per machine.
I have an additional requirement that two jobs can not be placed at the same time in the same physical machine. This is due to some binary that I can not control which performs some networking (poorly).
This is a somewhat related question: Limiting number of concurrent processes scheduled by condor
but it does not completely solves my problem. I understand I could restrict where jobs can run in the following way: Requirements = (name == "slot1#machine1") || (name == "slot1#machine2") ...
However this is too restricting as I don't care which slot the jobs run as long as two jobs are not together in the same machine.
Is there a way to achieve this?
If this is not possible how can I tell condor to pick the machine that has the most slots available?
You can try condor_status command to check the status of the pool of machines.
The first column shows the name of the slots and machines
Now check the State - Activity:
Unclaimed : Slot is idle
Claimed-Busy : Slot is running Condor jobs
We have a production environment (cluster) where there are two physical servers and 3 (three) plone/zope instances running inside each one.
We scheduled a job (with apscheduler) that needs to run only in a unique instance, but is executing by all 6 (six) instances.
To solve this, I think I need to verify if the job is running in the server1 and if it is a instance that listens on a specific port.
So, how to get programmaticly informations about a zope/plone instance?
I have a job running using Hadoop 0.20 on 32 spot instances. It has been running for 9 hours with no errors. It has processed 3800 tasks during that time, but I have noticed that just two tasks appear to be stuck and have been running alone for a couple of hours (apparently responding because they don't time out). The tasks don't typically take more than 15 minutes. I don't want to lose all the work that's already been done, because it costs me a lot of money. I would really just like to kill those two tasks and have Hadoop either reassign them or just count them as failed. Until they stop, I cannot get the reduce results from the other 3798 maps!
But I can't figure out how to do that. I have considered trying to figure out which instances are running the tasks and then terminate those instances, but
I don't know how to figure out which instances are the culprits
I am afraid it will have unintended effects.
How do I just kill individual map tasks?
Generally, on a Hadoop cluster you can kill a particular task by issuing:
hadoop job -kill-task [attempt_id]
This will kill the given map task and re-submits it on an different
node with a new id.
To get the attemp_id navigate on the Jobtracker's web UI to the map task
in question, click on it and note it's id (e.g: attempt_201210111830_0012_m_000000_0)
ssh to the master node as mentioned by Lorand, and execute:
bin/hadoop job -list
bin/hadoop job –kill <JobID>