Compute node running on Physical server with 40 CPUs. Though cpu_allocation_ratio set to 4.0, scheduler_default_filters set to "RetryFilter,AvailabilityZoneFilter,RamFilter,ComputeFilter,ImagePropertiesFilter,JsonFilter,CoreFilter", when we check the vPCus count via nova hypervisor stat it only lists 40 vcpus.
Shouldn't it be 160 vcpus?
Oversubscription amounts are built into the scheduler logic to figure out how many resources are available, however this data doesn't make it into Horizon/other areas. If you have 20 Physical CPUs with hyperthreading you'll end up with 40 VCPUs, which is what Nova is aware of. When you set the allocation ratio to 4.0, you still have 40 vCPUs but you are allowing nova to oversub them by 4x.
It would be helpful to see the total of available vCPUs based on the oversub, however the number would not be accurate. Instead we end up with a negative resource usage amount, which shows how many vCPUs have been used over the total amount, which is 40 in this case. When we hit 41 we have used all 40 + 1, which gives us -1 available vCPUs.
Related
I am considering which process launcher, between mpirun and srun, is better at optimizing the resources. Let's say one compute node in a cluster has 16 cores in total and I have a job I want to run using 10 processes.
If I launch it using mpirun -n10, will it be able to detect that my request has less number of cores than what's available in each node and will automatically assign all 10 cores from a single node? Unlike srun that has -N <number> to specify the number of nodes, mpirun doesn't seem to have such a flag. I am thinking that running all processes in one node can reduce communication time.
In the example above let's further assume that each node has 2 CPUs and the cores are distributed equally, so 8 cores/CPU and the specification say that there is 48 GB memory per node (or 24 GB/CPU or 3 GB/core). And suppose that each spawned process in my job requires 2.5 GB, so all processes will use up 25 GB. When does one say that a program exceeds the memory limit, is it when the total required memory:
exceeds per node memory (hence my program is good, 25 GB < 48 GB), or
exceeds per CPU memory (hence my program is bad, 25 GB > 24 GB), or
when the memory per process exceeds per core memory (hence my program is good, 2.5 GB < 3 GB)?
mpirun has no information about the cluster resource. It will not request the resources ; you must first request an allocation, with typically sbatch, or salloc and then Slurm will setup the environment so that mpirun knows on which node(s) to start processes. So you must have a look at the sbatch and salloc options to create a request that matches your needs. By default, Slurm will try to 'pack' jobs on the minimum number of nodes.
srun can also work in an allocation created by sbatch or salloc, but it can also do the request by itself.
I'm executing a Spark job on EMR. The job is currently being bottlenecked by network (reading data from S3). Looking at metrics in Ganglia, I get a straight line at around 600 MBPs. I'm using i2.8x large instance type which is suppose to give 10Gbps i.e. ~ 1280MBPs. I have verified that enhanced networking is turned on and VirtualizationType is hvm, Am I missing something? Is there any other way to increase network throughput?
Networking capacity of Amazon EC2 instances is based upon Instance Type. The larger the instance, the more networking capacity is available. You are using the largest instance type within the i2 family, so that is good.
Enhanced Networking lowers network latency and jitter and is available on a limited number of instance types. You are using it, so that is good.
The i2.8xl is listed as having 10Gbps of network throughput, but this is limited to traffic within the same Placement Group. My testing shows that EMR instances are not launched within a Placement Group, so they might not receive the full network throughput possible.
You could experiment by using more smaller instances rather than fewer large instances. For example, 2 x i2.4xlarge cost the same as 1 x i2.8xlarge.
The S3->EC2 bandwidth is actually rate limited at 5Gbps (625MB/s) on the largest instance types of each EC2 family, even on instances with Enhanced Networking and a 20Gbps network interface. This has been confirmed by the S3 team, and it matches what I observed in my experiments. Smaller instances get rate limited at lower rates.
S3's time-to-first-byte is about 80-100ms, and after the first byte it is able to deliver data to a single thread at 85MB/s, in theory. However, we've only observed about 60MB/s per thread on average (IIRC). S3 confirmed that this is expected, and slightly higher than what their customers observe. We used an HTTP client that kept connections alive to the S3 endpoint. The main reason small objects yield low throughput is the high time-to-first-byte.
The following is the max bandwidth we've observed (in MB/s) when downloading from S3 using various EC2 instances:
Instance MB/s
C3.2XL 114
C3.4XL 245
C3.8XL 600
C4.L 67
C4.XL 101
C4.2XL 266
C4.4XL 580
C4.8XL 600
I2.8XL 600
M3.XL 117
M3.2XL 117
M4.XL 95
M4.10XL 585
X1.32XL 612
We've done the above test with 32MB objects and a thread count between 10-16.
Also,
Network performance quoted on ec2 instances matrix are benchmarked as described here. This is network bandwidth between Amazon EC2 Linux instances in the same VPC. The one we are observing between s3 and ec2 instances is not what they have promised.
EC2 Instance Network performance appears to be categorized as:
Low
Moderate
High
10 Gigabit
20 Gigabit
Determining Network bandwidth on an instance designated Low, Moderate, or High appears to be done on a case-by-case basis.
C3, C4, R3, I2, M4 and D2 instances use IntelĀ® 82599g Virtual Function Interface and provide Enhanced Networking with 10 Gigabit interfaces in the largest instance size.
10 and 20 Gigabit interfaces are only able to achieve that speed when communicating within a common Placement Group, typically in support of HPC. Network traffic outside a placement group is has a max limit of 5 Gbps.
Summary: Network bandwidth mentioned is between two instances not between s3 and ec2.Even between two instances when they are in same placement group + have support of HPC we can achieve something around 10/20 Gigabit.
I have deployed 2 identical compute nodes in Openstack environment (Mitaka).
Each Compute node has 2 Physical CPU, 12 Cores each.
I would like to create a single VM which has have much processors as possible.
I don't want to oversubscribe between pCPU to vCPU, i.e. I would keep physical to virtual as 1:1 ratio.
However, it seems only allow me max. to create 24 vCPU in single VM even I have 48 vCPU in my resource pool (sum up by 2 compute nodes, each contribute 24 vCPU).
Anyone have an idea how to create more vCPU in my case?
You cannot create an instance that spans multiple compute nodes with OpenStack ... or with any open-source virtualization platform that I am aware of.
The proprietary vSMP product (vendor ScaleMP) can do this and there may be other products.
The other approach that you could take is to build a cluster consisting of multiple instances, and use a batch scheduler and / or some kind of message passing framework to perform computations spanning the cluster.
SETUP: 1
3-node cassandra cluster. Each node is on a different machine with 4 cores 32 GB RAM, 800GB SSD (DISK), 1Gbit/s = 125 MBytes/sec Network bandwith
2 cassandra-stress client machines with same exact configuration as above.
Experiment1: Ran one client on one machine creating anywhere from 1 to 1000 threads and with Conistency Level of Quorum and the max network throughput on a cassandra node was around 8MBytes/sec with a CPU Usage of 85-90 percent on both cassandra node and the client
Experiment2: Ran two clients on two different machines creating anywhere from one to 1000 threads with Conistency Level of Quorum and the max network throughput on a cassandra node was around 12MBytes/sec with a CPU Usage of 90 percent on both cassandra node and both the client
Did not see double the throughput even though my clients were running on two different machines but I can understand the cassandra node is CPU bound and thats probably why. so that lead me to setup2
SETUP 2
3-node cassandra cluster. Each node is on a different machine with 8 cores 32 GB RAM, 800GB SSD (DISK), 1Gbit/s = 125 MBytes/sec Network bandwith
2 cassandra-stress client machines with 4 cores 32 GB RAM, 800GB SSD (DISK), 1Gbit/s = 125 MBytes/sec Network bandwith
Experiment3: Ran one client on one machine creating anywhere from 1 to 1000 threads and with Conistency Level of Quorum and the max network throughput on a cassandra node was around 18MBytes/sec with a CPU Usage of 65-70 percent on a cassandra node and >90% on the client node.
Experiment4: Ran two clients on two different machines creating anywhere from 1 to 1000 threads and with Conistency Level of Quorum and the max network throughput on a cassandra node was around 22MBytes/sec with a CPU Usage of <=75 percent on a cassandra node and >90% on both client nodes.
so the question here is with one client node I was able to push 18MB/sec (Network throughput) and with two client nodes running two different machine I was only able to push at a peak of 22MB/sec(Network throughput) ?? And I wonder why this is the case even though this time the cpu usage on cassandra node is around 65-70 percent on a 8 core machine.
Note: I stopped cassandra and ran a tool called iperf3 on two different ec2 machines and I was able to see the network bandwith of 118 MBytes/second. I am converting everything into Bytes rather than bits to avoid any sort of confusion.
I have a topology running on aws. I use M3 xlarge machines with 15GB ram, 8 supervisors. My topology is simple, I read from
kafka spout -> [db o/p1] -> [db o/p2] -> [dynamo fetch] -> [dynamo write & kafka write] kafka
db o/ps are conditional. with latency around 100 - 150 ms.
But I have never been able to achieve a throughput of more than 300 msgs/sec.
What configuration changes are to be made so that I can get a throughput of more than 3k msgs/sec?
dynamo fetch bolt execute latency is around 150 - 220ms
and dynamo read bolt execute latency is also around this number.
four bolts with parallelism 90 each and one spout with parallelism 30 (30 kafka partitions)
overall latency is greater than 4 secs.
topology.message.timeout.secs: 600
worker.childopts: "-Xmx5120m
no. of worker ports per machine : 2
no of workers : 6
no of threads : 414
executor send buffer size 16384
executor receive buffer size 16384
transfer buffer size: 34
no of ackers: 24
Looking at the console snapshot I see...
1) The overall latency for the Spout is much greater than the sum of the execute latencies of the bolts, which implies that there's a backlog on one of the streams, and
2) The capacity for SEBolt is much higher that that of the other bolts, implying that Storm feels the need to run that bolt more than the others
So I think your bottleneck is the SEBolt. Look into increasing the parallelism hint on that one. If the total number of tasks is getting too high, reduce the parallelism hint for the other bolts to offset the increase for SEBolt.