Increasing network IO on EC2

Increasing network IO on EC2 - networking

I'm executing a Spark job on EMR. The job is currently being bottlenecked by network (reading data from S3). Looking at metrics in Ganglia, I get a straight line at around 600 MBPs. I'm using i2.8x large instance type which is suppose to give 10Gbps i.e. ~ 1280MBPs. I have verified that enhanced networking is turned on and VirtualizationType is hvm, Am I missing something? Is there any other way to increase network throughput?

Networking capacity of Amazon EC2 instances is based upon Instance Type. The larger the instance, the more networking capacity is available. You are using the largest instance type within the i2 family, so that is good.
Enhanced Networking lowers network latency and jitter and is available on a limited number of instance types. You are using it, so that is good.
The i2.8xl is listed as having 10Gbps of network throughput, but this is limited to traffic within the same Placement Group. My testing shows that EMR instances are not launched within a Placement Group, so they might not receive the full network throughput possible.
You could experiment by using more smaller instances rather than fewer large instances. For example, 2 x i2.4xlarge cost the same as 1 x i2.8xlarge.

The S3->EC2 bandwidth is actually rate limited at 5Gbps (625MB/s) on the largest instance types of each EC2 family, even on instances with Enhanced Networking and a 20Gbps network interface. This has been confirmed by the S3 team, and it matches what I observed in my experiments. Smaller instances get rate limited at lower rates.
S3's time-to-first-byte is about 80-100ms, and after the first byte it is able to deliver data to a single thread at 85MB/s, in theory. However, we've only observed about 60MB/s per thread on average (IIRC). S3 confirmed that this is expected, and slightly higher than what their customers observe. We used an HTTP client that kept connections alive to the S3 endpoint. The main reason small objects yield low throughput is the high time-to-first-byte.
The following is the max bandwidth we've observed (in MB/s) when downloading from S3 using various EC2 instances:
Instance MB/s
C3.2XL 114
C3.4XL 245
C3.8XL 600
C4.L 67
C4.XL 101
C4.2XL 266
C4.4XL 580
C4.8XL 600
I2.8XL 600
M3.XL 117
M3.2XL 117
M4.XL 95
M4.10XL 585
X1.32XL 612
We've done the above test with 32MB objects and a thread count between 10-16.
Also,
Network performance quoted on ec2 instances matrix are benchmarked as described here. This is network bandwidth between Amazon EC2 Linux instances in the same VPC. The one we are observing between s3 and ec2 instances is not what they have promised.
EC2 Instance Network performance appears to be categorized as:
Low
Moderate
High
10 Gigabit
20 Gigabit
Determining Network bandwidth on an instance designated Low, Moderate, or High appears to be done on a case-by-case basis.
C3, C4, R3, I2, M4 and D2 instances use Intel® 82599g Virtual Function Interface and provide Enhanced Networking with 10 Gigabit interfaces in the largest instance size.
10 and 20 Gigabit interfaces are only able to achieve that speed when communicating within a common Placement Group, typically in support of HPC. Network traffic outside a placement group is has a max limit of 5 Gbps.
Summary: Network bandwidth mentioned is between two instances not between s3 and ec2.Even between two instances when they are in same placement group + have support of HPC we can achieve something around 10/20 Gigabit.

Related

Kafka cluster & Inter Data centre network latency relationship

In my company we are running a kafka cluster currently across 3 AZs in a single region on AWS. There are multiple Topics with many partitions and their replicas. We know Amazon does not provide offical stats for inter AZ latency but when we test it is really v. fast or rather extremely low latency & about sub millisecond (1ms). We have producers and consumers also in the same AWS region within those same AZs.
To save on operational costs, I am currently investigating impact if we move this entire Kafka cluster & producers and consumers back on-prem and create a similar cluster across 3 DCs.
I am asking this question because - The inter DC latency in the company is E2E roughly 10ms oneway...so What would be the impact if this inter DC latency increases 10fold from say 1MS t0 10MS.
I am asking this because apart from the producers, consumers, even brokers also communicate with each other across DC (e.g. replicas are reading the messages from leaders, controller is informing other brokers about changes), some changes written as metadata to a zookeeper. What are the pitfalls that I should watch out for ?
Sorry for such an open question but want to know if anyone have experience and want to share issues, pitfalls etc. How does the Kafka broker and zookeeper get impacted if a cluster spans across a DC when latency is higher compared to AWS AZ latency. Is it even practical ?

Does NUMA impact memory bandwidth, or just latency?

I have a problem that is memory bandwidth limited -- I need to read a lot (many GB) of data sequentially from RAM, do some quick processing and write it sequentially to a different location in RAM. Memory latency is not a concern.
Is there any benefit from dividing the work between two or more cores in different NUMA zones? Equivalently, does working across zones reduce the available bandwidth?

For bandwidth-limited, multi-threaded code, the behavior in a NUMA system will primarily depend how "local" each thread's data accesses are, and secondarily on details of the remote accesses.
In a typical 2-socket server system, the local memory bandwidth available to two NUMA nodes is twice that available to a single node. (But remember that it may take many threads running on many cores to reach asymptotic bandwidth for each socket.)
The STREAM Benchmark, for example, is typically run in a configuration that allows almost all accesses from every thread to be "local". This is implemented by assuming "first touch" NUMA placement -- when allocated memory is first written, the OS has to create mappings from the process virtual address space to physical addresses, and (by default) the OS chooses physical addresses that are in the same NUMA node as the core that executed the store instruction.
"Local" bandwidth (to DRAM) in most systems is approximately symmetric (for reads and writes) and relatively easy to understand. "Remote" bandwidth is much more asymmetric for reads and writes, and there is usually significant contention between the read/write commands going between the chips and the data moving between the chips. The overall ratio of local to remote bandwidth also varies significantly across processor generations. For some processors (e.g., Xeon E5 v3 and probably v4), the interconnect is relatively fast, so jobs with poor locality can often be run with all of the memory interleaved between the two sockets.
Local bandwidths have increased significantly since then, with more recent processors generally strongly favoring local access.
Example from the Intel Xeon Platinum 8160 (2 UPI links between chips):
Local Bandwidth for Reads (each socket) ~112 GB/s
Remote Bandwidth for Reads (one-direction at a time) ~34 GB/s
Local bandwidth scales perfectly in two-socket systems, and remote bandwidth also scales very well when using both sockets (each socket reading data from the other socket).
It gets more complicated with combined read and write traffic between sockets, because the read traffic from node 0 to node 1 competes with the write traffic from node 1 to node 0, etc.
Local Bandwidth for 1R:1W (each socket) ~101 GB/s (reduced due to read/write scheduling overhead)
Remote Bandwidth for 1R:1W (one socket running at a time) ~50 GB/s -- more bandwidth is available because both directions are being used, but this also means that if both sockets are doing the same thing, there will be conflicts. I see less than 60 GB/s aggregate when both sockets are running 1R:1W remote at the same time.
Of course different ratios of local to remote accesses will change the scaling. Timing can also be an issue -- if the threads are doing local accesses at the same time, then remote accesses at the same time, there will be more contention in the remote access portion (compared to a case in which the threads are doing their remote accesses at different times).

I can see cassandra is CPU bound for write heavy work loads but is it network bound as well?

SETUP: 1
3-node cassandra cluster. Each node is on a different machine with 4 cores 32 GB RAM, 800GB SSD (DISK), 1Gbit/s = 125 MBytes/sec Network bandwith
2 cassandra-stress client machines with same exact configuration as above.
Experiment1: Ran one client on one machine creating anywhere from 1 to 1000 threads and with Conistency Level of Quorum and the max network throughput on a cassandra node was around 8MBytes/sec with a CPU Usage of 85-90 percent on both cassandra node and the client
Experiment2: Ran two clients on two different machines creating anywhere from one to 1000 threads with Conistency Level of Quorum and the max network throughput on a cassandra node was around 12MBytes/sec with a CPU Usage of 90 percent on both cassandra node and both the client
Did not see double the throughput even though my clients were running on two different machines but I can understand the cassandra node is CPU bound and thats probably why. so that lead me to setup2
SETUP 2
3-node cassandra cluster. Each node is on a different machine with 8 cores 32 GB RAM, 800GB SSD (DISK), 1Gbit/s = 125 MBytes/sec Network bandwith
2 cassandra-stress client machines with 4 cores 32 GB RAM, 800GB SSD (DISK), 1Gbit/s = 125 MBytes/sec Network bandwith
Experiment3: Ran one client on one machine creating anywhere from 1 to 1000 threads and with Conistency Level of Quorum and the max network throughput on a cassandra node was around 18MBytes/sec with a CPU Usage of 65-70 percent on a cassandra node and >90% on the client node.
Experiment4: Ran two clients on two different machines creating anywhere from 1 to 1000 threads and with Conistency Level of Quorum and the max network throughput on a cassandra node was around 22MBytes/sec with a CPU Usage of <=75 percent on a cassandra node and >90% on both client nodes.
so the question here is with one client node I was able to push 18MB/sec (Network throughput) and with two client nodes running two different machine I was only able to push at a peak of 22MB/sec(Network throughput) ?? And I wonder why this is the case even though this time the cpu usage on cassandra node is around 65-70 percent on a 8 core machine.
Note: I stopped cassandra and ran a tool called iperf3 on two different ec2 machines and I was able to see the network bandwith of 118 MBytes/second. I am converting everything into Bytes rather than bits to avoid any sort of confusion.

GCE Instances Network connection speeds on internal network

Coming from a background of vSphere vm's with vNIC's defined on creation as I am do the GCE instances internal and public ip network connections use a particular virtualised NIC and if so what speed is it 100Mbit/s 1Gb or 10Gb?
I'm not so much interested in the bandwidth from the public internet in but more what kind of connection is possible between instances given networks can span regions
Is it right to think of a GCE project network as a logical 100Mbit/s 1Gb or 10Gb network spanning the atlantic I plug my instances into or should there be no minimum expectation because too many variables exist like noisy neighbours and inter region bandwidth not to mention physical distance?

The virtual network adapter advertised in GCE conforms to the virtio-net specification (specifically virtio-net 0.9.5 with multiqueue). Within the same zone we offer up to 2Gbps/core of network throughput. The NIC itself does not advertise a specific speed. Performance between zones and between regions is subject to capacity limits and quality-of-service within Google's WAN.
The performance relevant features advertised by our virtual NIC as of December 2015 are support for:
IPv4 TCP Transport Segmentation Offload
IPv4 TCP Large Receive Offload
IPv4 TCP/UDP Tx checksum calculation offload
IPv4 TCP/UDP Rx checksum verification offload
Event based queue signaling/interrupt suppression.
In our testing for best performance it is advantageous to enable of all of these features. Images supplied by Google will take advantage of all the features available in the shipping kernel (that is, some images ship with older kernels for stability and may not be able to take advantage of all of these features).

I can see up to 1Gb/s between instances within the same zone, but AFAIK that is not something which is guaranteed, especially for tansatlantic communication. Things might change in the future, so I'd suggest to follow official product announcements.

There have been a few enhancements in the years since the original question and answers were posted. In particular, the "2Gbps/core" (really, per vCPU) is still there but there is now a minimum cap of 10 Gbps for VMs with two or more vCPUs. The maximum cap is currently 32 Gbps, with 50 Gbps and 100 Gbps caps in the works.
The per-VM egress caps remain "guaranteed not to exceed" not "guaranteed to achieve."
In terms of achieving peak, trans-Atlantic performance, one suggestion would be the same as for any high-latency path. Ensure that your sources and destinations are tuned to allow sufficient TCP window to achieve the throughput you desire. In particular, this formula would be in effect:
Throughput <= WindowSize / RoundTripTime
Of course that too is a "guaranteed not to exceed" rather than a "guaranteed to achieve" thing. As was stated before "Performance between zones and between regions is subject to capacity limits and quality-of-service within Google's WAN."

Does changing the data rate of a line increase throughput?

I'm using IT Guru's Opnet to simulate different networks. I've run the basic HomeLAN scenario and by default it uses an ethernet connection running at a data rate of 20Kbps. Throughout the scenarios this is changed from 20K to 40K, then to 512K and then to a T1 line running at 1.544Mbps. My question is - does increasing the data rate for the line increase the throughput?
I have this graph output from the program to display my results:
Please note it's the image on the forefront which is of interest

In general, the signaling capacity of a data path is only one factor in the net throughput.
For example, TCP is known to be sensitive to latency. For any particular TCP implementation and path latency, there will be a maximum speed beyond which TCP cannot go regardless of the path's signaling capacity.
Also consider the source and destination of the traffic: changing the network capacity won't change the speed if the source is not sending the data any faster or if the destination cannot receive it any faster.
In the case of network emulators, also be aware that buffer sizes can affect throughput. The size of the network buffer must be at least as large as the signal rate multiplied by the latency (the Bandwidth Delay Product). I am not familiar with the particulars of Opnet, but I have seen other emulators where it is possible to set a buffer size too small to support the select rate and latency.
I have written a couple of articles related to these topics which may be helpful:
This one discusses common network bottlenecks: Common Network Performance Problems
This one discusses emulator configuration issues: Network Emulators

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex