SGE / UGE suspend running jobs - grid

I'm aware, one can suspend running jobs by qmod -sj [jobid] command and in principal that works. Which means the jobs go to suspend (s) state -- fine so far, but:
I expected that if I put all running jobs to suspend state and qsub new ones to GE or have waiting jobs, that these get to be run, which is not the case.
Some search on this topic lead me to http://gridengine.org/pipermail/users/2011-February/000050.html, which in fact points to the direction, that suspended jobs make the GE free for running other ones.

See here.:
In a workload manager with "built-in" preemption, like Platform LSF,
it works by temporarily relaxing the slot count limit on a node and
then resolving the oversubscription by bumping the lowest job on the
totem pole to get the number of jobs back under the slot count limit.
In Sun Grid Engine, the same thing happens, except that instead of the
scheduler temporarily relaxing the slot count limits, you as the
administrator configure the host with more slots than you need and a
set of rules that create an artificial lower limit on the job count
that is enforced by bumping the lowest priority jobs.
Slightly different topic, but it seems the principal can hold the same: to run other jobs while maintaining your suspended ones, temporarily increase the slot counts on the relevant nodes.

Related

How Galera Cluster guarantees consistency?

I'm searching for a high-available SQL solution! One of the articles that I read was about "virtually synchronized" in Galera Cluster: https://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/
He says
When the writeset is actually applied on a given node, any locking
conflicts it detects with open (not-yet-committed) transactions on
that node cause that open transaction to get rolled back.
and
Writesets being applied by replication threads always win
What will happen if the WriteSet conflicts with a committed transaction?
He also says:
Writesets are then “certified” on every node (in order).
How does Galera Cluster make WriteSets ordered over a cluster? Is there any hidden master node who make WriteSets ordered; something like Zookeeper? or what?
This is for the second question (about how Galera orders the writesets).
Galera implements Extended Virtual Synchrony (EVS) based on the Totem protocol. The Totem protocol implements a form of token passing, where only the node with the token is allowed to send out new requests (as I understand it). So the writes are ordered since only one node at a time has the token.
For the academic background, you can look at these:
The Totem Single-Ring Ordering and Membership Protocol
The database state machine and group communication issues
(This Answer does not directly tackle your Question, but it may give you confidence that Galera is 'good'.)
In Galera (PXC, etc), there are two general times when a transaction can fail.
On the node where the transaction is being run, the actions are compared to what is currently running on the same node. If there is a conflict, either one of the transactions is stalled (think innodb_lock_wait_timeout) or is deadlocked (and rolled back).
At COMMIT time, info is sent to all the other nodes; they check your transaction against anything on the node or pending (in gcache). If there is a conflict, a message is sent back saying that there would be trouble. So, the originating node has the COMMIT fail. For this reason, you must check for errors even on the COMMIT statement.
As with single-node systems, a deadlock is usually resolved by replaying the entire transaction.
In the case of autocommit, there is a small, configurable, number of retries, after which the statement will fail. So, again, check for errors. However, since a retry has already been tried, you may want to abort the program.
Currently (in my opinion) Galera, with at least 3 nodes in at least 3 different physical locations, is the best available HA solution for MySQL. It can effectively survive any single-point-of-failure. (Group Replication / InnoDB Cluster, from Oracle, is coming soon, and is very promising.)
One thing to note is that the "critical read" problem has a solution in Galera, but you have to take action. See wsrep_sync_wait. (As of this writing, InnoDB Cluster has no solution.)
See http://mysql.rjweb.org/doc.php/galera for tips (some of which are included above) on coding differences when moving to PXC/Galera.

Synchronous vs Asynchronous Clustering

I was reading the mariaDD knowledge base on Galera Cluster and i came across this:
The basic difference between synchronous and asynchronous replication is that "synchronous" guarantees that if changes happened on one node of the cluster, they happened on other nodes "synchronously", or at the same time. "Asynchronous" gives no guarantees about the delay between applying changes on "master" node and the propagation of changes to "slave" nodes. The delay can be short or long. This also implies that if master node crashes, some of the latest changes may be lost
With the last sentence, i have always understood that even though the updates on the slave in the asynchronous cluster setup is not performed at the same time, it logs these updates to a bin log file as the updates are being made on the master. So in the case that the master crashes before all the data is passed on to the slave, the updates will still go ahead when the master is restored since the bin log file logged the updates. Can somebody please tell me if my understanding is wrong and clarify on the matter for me please. Thanks.
In your example of a normal replication pair, the slave would catch up after the master comes back. Assuming the master does come back, you wouldn't really lose the data but if the master is permanently dead, the data is lost. The knowledge base article you mention is talking about the replication delay and not the overall integrity of the replication stream.
With normal replication, if the slave io thread (the part that gets the replication events from the master) is able to keep up with the master, then the slave may only lose a couple seconds if the master crashes. However, if it cannot keep up and is for example 1 hour behind, the slave would lose access to 1 hour of data. Another way you could lose access to data on the slave is if you have a max relay log size set and that is reached.
Galera makes sure that the write is sent to every node in the cluster before it is actually committed on any of the nodes so once the node that the write is done on commits the write, all of the other nodes will commit the same write. With galera, all writes basically happen at the same time on every node. Losing any node at any time during normal operation will not cause any data loss.

How to architect a multi-step process using a message queue?

Say I have a multi-step, asynchronous process with these restrictions:
Individual steps can be performed by any worker
Steps must be performed in-order
The approach I'm considering:
Insert a db row that represents the entire process, with a "Steps completed" column to keep track of the progress.
Subscribe to a queue that will receive a message when the entire process is done.
Upon completion of each step, update the db row and queue the next step in the process.
After the last step is completed, queue the "process is complete" message.
Delete the db row.
Thoughts? Pitfalls? Smarter ways to do it?
I've built a system very similar to what you've described in a large, task-intensive document processing system, and have had to live with both the pros and the cons for the last 7 years now. Your approach is solid and workable, but I see some drawbacks:
Potentially vulnerable to state change (i.e., what if process inputs change before all steps are queued, then the later steps could have inputs inconsistent with earlier steps)
More infrastructural than you'd like, involving both a DB and a queue = more points of failure, harder to set up, more documentation required = doesn't quite feel right
How do you keep multiple workers from acting on the same step concurrently? In other words, the DB row says 4 steps are completed, how does a worker process know if it can take #5 or not? Doesn't it need to know whether another process is already working on this? One way or another (DB or MQ) you need to include additional state for locking.
Your example is robust to failure, but doesn't address concurrency. When you add state to address concurrency, then failure handling becomes a serious problem. For example, a process takes step 5, and then puts the DB row into "Working" state. Then when that process fails, step 5 is stuck in "Working" state.
Your orchestrator is a bit heavy, as it is doing a lot of synchronous DB operations, and I would worry that it might not scale as well as the rest of the architecture, as there can be only one of those...this would depend on how long-running your steps were compared to a database transaction--this would probably only become an issue at very massive scale.
If I had it to do over again, I would definitely push even more of the orchestration onto the worker processes. So, the orchestration code is common and could be called by any worker process, but I would keep the central, controlling process as light as possible. I would also use only message queues and not any database to keep the architecture simple and less synchronous.
I would create an exchange with 2 queues: IN and WIP (work in progress)
The central process is responsible for subscribing to process requests, and checking the WIP queue for timed out steps.
1) When the central process received a request for a given processing (X), it invokes the orchestration code, and it loads the first task (X1) into the IN queue
2) The first available worker process (P1) transactionally dequeues X1, and enqueues it into the WIP queue, with a conservative time-to-live (TTL) timeout value. This dequeueing is atomic, and there are no other X tasks in IN, so no second process can work on an X task.
3) If P1 terminates suddenly, no architecture on earth can save this process except for a timeout. At the end of the timeout period, the central process will find the timed out X1 in WIP, and will transactionally dequeue X1 from WIP and enqueue it back into IN, providing the appropriate notifications.
4) If P1 terminates abnormally but gracefully, then the worker process will transactionally dequeue X1 from WIP and enqueue it back into IN, providing the appropriate notifications. Depending on the exception, the worker process could also choose to reset the TTL and retry the step.
5) If P1 hangs indefinitely, or exceeds its TTL, same result as #3. The central process handles it, and presumably the worker process will at some point be recycled--or the rule could be to recycle the worker process anytime there's a timeout.
6) If P1 succeeds, then the worker process will determine the next step, either X2 or X-done. If the next step is X2, then the worker process will transactionally dequeue X1 from WIP, and enqueue X2 into IN. If the next step is X-done, then the processing is complete, and the appopriate action can be taken, perhaps this would be enqueueing X-done into IN for subsequent processing by the orchestrator.
The benefits of my suggested approach are:
Contention between worker processes is specified
All possible failure scenarios (crash, exception, hang, and success) are handled
Simple architecture can be completely implemented with RabbitMQ and no database, which makes it more scalable
Since workers handle determining and enqueueing the next step, there is a more lightweight orchestrator, leading to a more scalable system
The only real drawback is that it is potentially vulnerable to state change, but often this is not a cause for concern. Only you can know whether this would be an issue in your system.
My final thought on this is: you should have a good reason for this orchestration. After all, if process P1 finishes task X1 and now it is time for some process to work on next task X2, it seems P1 would be a very good candidate, as it just finished X1 and is now available. By that logic, a process should just gun through all the steps until completion--why mix and match processes if the tasks need to be done serially? The only async boundary really would be between the client and the worker process. But I will assume that you have a good reason to do this, for example, the processes can run on different and/or resource-specialized machines.

How do I kill running map tasks on Amazon EMR?

I have a job running using Hadoop 0.20 on 32 spot instances. It has been running for 9 hours with no errors. It has processed 3800 tasks during that time, but I have noticed that just two tasks appear to be stuck and have been running alone for a couple of hours (apparently responding because they don't time out). The tasks don't typically take more than 15 minutes. I don't want to lose all the work that's already been done, because it costs me a lot of money. I would really just like to kill those two tasks and have Hadoop either reassign them or just count them as failed. Until they stop, I cannot get the reduce results from the other 3798 maps!
But I can't figure out how to do that. I have considered trying to figure out which instances are running the tasks and then terminate those instances, but
I don't know how to figure out which instances are the culprits
I am afraid it will have unintended effects.
How do I just kill individual map tasks?
Generally, on a Hadoop cluster you can kill a particular task by issuing:
hadoop job -kill-task [attempt_id]
This will kill the given map task and re-submits it on an different
node with a new id.
To get the attemp_id navigate on the Jobtracker's web UI to the map task
in question, click on it and note it's id (e.g: attempt_201210111830_0012_m_000000_0)
ssh to the master node as mentioned by Lorand, and execute:
bin/hadoop job -list
bin/hadoop job –kill <JobID>

Difference between process group id and job id in UNIX

Please tell me the difference between a process group ID and a jobid. Is jobid a builtin of a shell program or is it related to the kernel? What are the uses of each of them? When a process is run in background, is only jobid set or is the pgid set as well?
What are the uses of the setpgid() function?
When a process is run in background, is the kernel also involved or does the shell take care of which is background or foreground?
Good questions. The job id is mostly just a shell construct. There is support in the kernel in the form of the signals that are involved in job control, and the way in which the kernel knows exactly which processes to send the job control signals to.
Strictly speaking, the answer to your first question is that the job id is purely a shell creation. It exists because a pipeline (or, rarely, another shell grouped construct) may consist of multiple processes that should be controlled as a unit.
To answer your last question, the shell starts all processes by first doing a fork(2) and then doing an execve(2). The only difference with & is that the shell does not do a wait(2) (or a related variant) and so the program can continue "in the background". There is actually little distinction in Unix between foreground and background.
The process group is an association defined by shells so that the kernel knows about a single "foreground" process that handles a set of various "background" processes. This is mainly important so that the background processes will generate a signal should they decide to suddenly read from a terminal. (Such terminal probably being connected to standard input.) This will cause the "job" to generate a signal and the shell will prompt the user to do something.
Try (sleep 5; read x)& and after 6 seconds type a return or something so that the shell wakes up. That's when you see something like...
[1]+ Stopped ( sleep 5; read x )
...and you then type fg to pull it into the foreground.
Originally, Unix had pipelines, and it had &, but there was no way to move a command or pipeline between foreground and background and no way to help a background process that suddenly decided to read standard input.
Job control and the kernel support for it were added by Bill Joy and others in early versions of BSD and csh(1). These were picked up line-for-line by commercial Unix and in cloned for the work-alike Linux kernel.
Regarding the questions about process groups and ps(1)...
In order to support job control in shells, the kernel process state includes a process group ID and a session ID. A process group and a job are the same thing, but a job number is just a handle the shell makes up. A process is a session leader if the session ID is the same as the pid, and a process is a process group leader if the pgid is the same as the pid. I believe something a bit more subtle is happening with the + that ps(1) prints. Each terminal knows what its foreground process group is, so I believe a process gets a + if pid == pgid && (pgid is the foreground pg for its controlling terminal).
In summary, the kernel keeps several items of state: pid, pgid, sid, and a process may have a controlling terminal and a terminal may have a foreground pgid. These credentials are mostly intended to support job control but are also used to revoke access to a terminal when a user logs out.

Resources