qsub: not sharing nodes with other users - mpi

Is there a way to request full machines? At my department I have the problem that when running a large job, some processes get allocated to shared machines. I am not sure why but it happens that processes on those shared machines are extremely slowed down, possibly because of what the other user is doing.
I want to avoid this and so ideally I would be able to request not to share nodes when invoking qsub, is this possible?
We are using SGE, and different nodes have different number of cores so I can't just use ppn=4.

Related

Web load testing - what does a wavy response time graph indicate?

I am running a load test on an API using JMeter. When I host the API on the same pc as the test (the database is remote though) I get ok results.
However, when I tried running the load test through the same API but hosted on a different pc on the same network, I got this wavy pattern in my test results.
Each of the four grouped lines are response times for a particular API endpoint and the blue line is active thread count.
The question is: does this wavy pattern mean anything? This pattern isn't visible when the API is hosted on the same machine as the test.
The results are very different and I am thinking this pattern might be correlated to the problem.
I used 200 active threads and no specific configuration which would produce the requests in this pattern.
You need pay attention to the following points:
Connect Time and Latency metrics, Elapsed Time is a sum of Connect Time, Latency and the actual server response time so these "waves" might be caused by networking issues.
It might be indicating the application under tests is doing i.e. garbage collection or using swap file which is much slower than memory due to lack of resources Make sure that it has enough headroom to operate in terms of CPU, RAM, Network and Disk IO. These metrics can be checked using i.e. JMeter PerfMon Plugin. The same is applicable for JMeter, if JMeter will not be able to send requests fast enough - you will see throughput dropdowns.
The most efficient way to get to the bottom of the issue is running your application under profiling tool telemetry, this will allow you to
identify the heaviest functions, largest objects in heap, etc.
Consider checking your database as well and detect slow queries as the issue might be caused by database issues (including networking layer)

Is it "okay" to host a small wordpress blog on one AWS EC2 Instance without load balancers/beanstalk?

This is a very simple question for those with the knowledge, but I'm a newbie.
In essence, I just need to know if it would be considered okay to run a small, approx. 700 visitors/day bitnami wordpress blog on just one t2.medium EC2 instance (without any auto-scaling, beanstalk).
Am at risk of it crashing? What stats should I monitor or be aware of to be aware of potential dangers? Sorry for the basic nature of these questions, but this is new.
tl;dr: It might be "okay", but it's not ideal.
If your question is because of:
Initial setup time - Load-balancing and auto-scaling will be less expensive (more time-efficient) over time.
Cost - Auto-scaling spins down instances that aren't being used to reduce cost.
Minimal setup for a great user experience - The goal of a great AWS setup is to ensure that capacity matches demand
Am at risk of it crashing?
Possibly, yes. If you average 700 visitors, then the risk is traffic spikes if all visitors hit at the same. It also depends on what your maximum visitors are, which could vary widely from the average (or not)
What stats should I monitor or be aware of to be aware of potential dangers?
Monitor the usage on high traffic days (ie. public holiday sales)
Setup billing alerts
Setup the right metrics:
See John Rotenstein's SO answer:
CPU Utilization is not always the right measure to use -- your
application might only be able to handle a limited number of
connections, it might be squeezed on RAM and the types of requests
might vary too.
You can use normal monitoring tools, or you can write something that
pushes metrics to Amazon CloudWatch, so that you go beyond the basic
CPU and Network metrics that CloudWatch normally provides. You could
even use the Load Balancer's Latency metric to trigger scaling when
the application slows down (custom code required).
I'd start with:
Two or more instances - to deal with instance redundancy (an instance going down)
Several t2.small rather than one t2.medium can work out to be more cost-efficient, and more cost efficient than EC in some use cases.
Add auto-scaling - automatically spin up or down instances based on minimum and maximum counts
Load balancing - to re-route users from unhealthy to healthy instances. And also to keep all of the spun up instances all working as evenly as possible (rather than a single instance handling 80% of the workload while the others bludge).
You can always reduce your instances after time with monitoring.
In my opinion, with 700 visitors a day, the safer option would be to run a load balanced/auto-scaling environment on Elastic Beanstalk with at least 2 instances. The problem with running just one instance is that yes you are at a great risk of crashing in case you get an increase in traffic or when the instance goes down and with just one running you will not have a fallback. You can easily set up CloudWatch monitoring on NetworkIn, NetworkOut to get a sense of the number of requests your site is receiving and serving, and setup CPU Usage monitoring as well. The trade-off with running a load balanced environment over a single instance environment is that the cost might significantly increase as you introduce other things into your environment such as a load balancer. Also if you introduce a load balancer consider reducing the instance size to maybe a t2.small, could aid in reducing the cost.
It actually depends. This question range is wide. You have multiple options here.
You can use only ec2 instance for that much amount of visitors or even more if your application allows. You can also consider caching if your app need it.
You may add instance in an autoscaling group. So that if by any chance you need more resources you can increase them horizontally.
You can add load balancers lateron also. You just need to add user data in your launch configuration attached to autoscaling group. So when your instance get up it should automatically register itself in your load balancer.
For monitoring, you can check for the request metrics in cloudwarch for ELB. You have to keep an eye on your CPU and trigger the scale out policy once it reaches a particular threshold.

Load balance background tasks in Azure Web App

I am developing an ASP.NET application that will be hosted as an Azure web app. Part of the app will continuously record multiple web-based cameras by retrieving a snapshot every N seconds. I would like to design the app so that the processes that record the cameras can be run on multiple instances. I would like it to load balance between all instances, but not duplicate effort for any one camera.
For example, if I have 100 cameras, and am running on 2 instances, I want each instance to get 50 cameras to process. If I have 5 instances, each instance should get 20 cameras to process. As I add cameras or scale instances up/down I would like for the system to load balance the work evenly.
If it's feasible, I would rather not spin up dedicated VMs just for processing cameras, due to increased cost.
I'm somewhat familiar with Akka.NET, Hangfire, and WebJobs, but am unclear if these will help in this scenario. I have used Hangfire and WebJobs to do background processing, but not with this sort of load-balancing requirement. Will these or some other framework or tool help me load balance these background tasks evenly across Azure Web App Instances? How should I go about setting up these or another framework to do this?
I honestly don't think you want to try to "balance" the servers. I think you just want to make sure the work is well distributed. If I were you, I would use a queue system like SQS to queue up all of the cameras that need a snapshot and let each instance worker dequeue one at a time and process it.
A good approach could be to have a master server responsible for queueing up the snapshots, and then have all of your workers servers simply work out of this shared queue. Even if one server happens to process more than the others, that is fine since the others were working out of the same queue. It just means that this server was able to process its jobs more quickly than the others.
To be honest, there are a lot of ways to approach this. You could do something as simple as just having a shared list of your cameras, with a timestamp for the last snapshot, and use this to work off of. Each server would request a camera, they would look at the list and find one that was stale, and then update the timestamp and perform the snapshot for the camera. The downside to something like this is you are going to struggle with non-atomic operations and the possibility of multiple workers making the request at the same time and both working on the same server. These are the type of things that a queue system will help you with, because as soon as one of those queue items are in flight, they will no longer be available. And also, because each server is responsible for invalidating their items once they are finished, if a server were to crash mid-snapshot, this work would simple go back into the queue.
No matter which solution you choose, it is going to boil down to having a central system/list for serving up stale cameras.
The Azure WebJob SDK uses the Storage Account you set up to balance the work between the various instances that are running your Jobs. You can gain finer control by using a Queue to divide up the work that needs doing and then scale your App Service Plan based on the Queue length.
Here's a rough picture of that architecture:

ASP.NET hosting with unlimited single-node scalability

Since this question is from a user's (developer's) perspective I figured it might fit better here than on Server Fault.
I'd like an ASP.NET hosting that meets the following criteria:
The application seemingly runs on a single server (so no need to worry about e.g. session state or even static variables)
There is an option to scale storage, memory, DB size and CPU-power up and down on demand, in an "unlimited" way
I researched but there seems not to be such a platform, that completely abstracts the underlying architecture away and thus has the ease of use of a simple shared hosting but "unlimited" scalability.
"Single server" and "scalability" are mutually exclusive, I'm afraid. But a good load-balancer will apply affinity to requests so you don't need to needlessly double-cache data on multiple servers.
However, well-designed web applications are easy to port to a multiple-server scenario.
I think your best option is something like Windows Azure Websites (separate from Azure Web Workers) which run on a VM you don't have access to. The VM itself provides enough power as-is necessary to run your website, so you don't need to worry about allocating extra CPU power or RAM.
Things like SQL Server are handled separately, but is very cheap to run, and you can drag a slider to give yourself more storage space.
This can be still accomplished by using a cloud host like www.gearhost.com. Apps live in the cloud and by default get 1 node worker so session stickiness is maintained. You can then scale that application larger workers to accomplish what you need, all while maintaining HA and LB. Even further you can add multiple web workers. Each visitor is tied to a particular node to maintain session state even though you might have 10 workers for example. It's an easy and cheap way to scale a site with 100 visitors to many million in just a few clicks.

Erlang node connection timeout

I have a bunch of erlang nodes running on a single machine, and they are all connected in a network. Sometimes the machine our application is on will be under extremely heavy load for several minutes. Often, after things return to normal, my erlang nodes think that they were disconnected, and I have to manually call net_adm:ping on each one of them to get them to re-connect to the network.
Any ideas on how I can avoid this situation?
You can increase the value of net_ticktime kernel configuration option so nodes will be pinged more infrequently. See also net_kernel:set_net_ticktime. Note, however, that all communicating nodes should have the same net_ticktime value specified.

Resources