GCE Instances Network connection speeds on internal network - networking

Coming from a background of vSphere vm's with vNIC's defined on creation as I am do the GCE instances internal and public ip network connections use a particular virtualised NIC and if so what speed is it 100Mbit/s 1Gb or 10Gb?
I'm not so much interested in the bandwidth from the public internet in but more what kind of connection is possible between instances given networks can span regions
Is it right to think of a GCE project network as a logical 100Mbit/s 1Gb or 10Gb network spanning the atlantic I plug my instances into or should there be no minimum expectation because too many variables exist like noisy neighbours and inter region bandwidth not to mention physical distance?

The virtual network adapter advertised in GCE conforms to the virtio-net specification (specifically virtio-net 0.9.5 with multiqueue). Within the same zone we offer up to 2Gbps/core of network throughput. The NIC itself does not advertise a specific speed. Performance between zones and between regions is subject to capacity limits and quality-of-service within Google's WAN.
The performance relevant features advertised by our virtual NIC as of December 2015 are support for:
IPv4 TCP Transport Segmentation Offload
IPv4 TCP Large Receive Offload
IPv4 TCP/UDP Tx checksum calculation offload
IPv4 TCP/UDP Rx checksum verification offload
Event based queue signaling/interrupt suppression.
In our testing for best performance it is advantageous to enable of all of these features. Images supplied by Google will take advantage of all the features available in the shipping kernel (that is, some images ship with older kernels for stability and may not be able to take advantage of all of these features).

I can see up to 1Gb/s between instances within the same zone, but AFAIK that is not something which is guaranteed, especially for tansatlantic communication. Things might change in the future, so I'd suggest to follow official product announcements.

There have been a few enhancements in the years since the original question and answers were posted. In particular, the "2Gbps/core" (really, per vCPU) is still there but there is now a minimum cap of 10 Gbps for VMs with two or more vCPUs. The maximum cap is currently 32 Gbps, with 50 Gbps and 100 Gbps caps in the works.
The per-VM egress caps remain "guaranteed not to exceed" not "guaranteed to achieve."
In terms of achieving peak, trans-Atlantic performance, one suggestion would be the same as for any high-latency path. Ensure that your sources and destinations are tuned to allow sufficient TCP window to achieve the throughput you desire. In particular, this formula would be in effect:
Throughput <= WindowSize / RoundTripTime
Of course that too is a "guaranteed not to exceed" rather than a "guaranteed to achieve" thing. As was stated before "Performance between zones and between regions is subject to capacity limits and quality-of-service within Google's WAN."

Related

Software Routing

"Commercial software routers from companies such as Vyatta can typically only attain transfer data at speeds of up to three gigabits per second. That isn’t fast enough to take advantage of the full speed of a typical network card, which operates at 10 gigabits per second." [1]
How is the speed of the network interface card relevant in this scenario? Aren't software routers connecting multiple Virtual Machines running on the same physical host? [2] Unless a PC has multiple network interface cards, it is unlikely that it functions as a packet switch between different physical hosts.
My interpretation suggests that there seem to exist two different kinds of software routing: (1) Embedding a real time operating system on an actual router. (2) Writing application layer code on a PC that can handle packets being transmitted between different virtual machines running on that very PC. Is this correct?
It depends on what your router is doing. If it's literally just looking at a static route table and forwarding packets out another interface, there isn't much hit in performance.
It's when you get into things like NAT, Crypto, QoS, SPI... that you will see performance degradation. Hardware vendors are usually using custom silicon to process the more advanced features, this allows for higher throughput packet forwarding.
Now that merchant silicon is fast enough and the open source applications are getting better, the performance gap is closing.
It really depends on your use case as far as what you want to use. I've gone with both and not seen performance hits, but the software versions weren't handling high throughput workloads.
Performance of the link from the virtual network to the physical eventually becomes important at any reasonable scale. You're right that, within the same physical host, things can be pretty quick, but that requires that one can get everything needed in one box.
While merchant silicon has come a long way in improving the performance of networking equipment, greater gains are taking place getting CPU's to handle networking tasks better. Both AMD and Intel have improved their architectures to the point where 10 Gbps forwarding is a reality. Intel has developed a specialized library (DPDK Wiki Page) that takes care of a lot of low-level networking functions at high performance.

Does Google Compute Engine offer SR-IOV (Single Root I/O Virtualization)?

Amazon / AWS EC2 offers SR-IOV (Single Root I/O Virtualization) instances, which it dubs "enhanced networking" -- does Google offer this on Compute Engine?
Specifically, are any GCE instance types able to bypass the hypervisor and have direct access to a multi-queue NIC?
SRV-IOV support is needed to take advantage of Scylla DB's architecture?
HN Discussion: https://news.ycombinator.com/item?id=10262719
Currently Google Compute Engine does not offer SR-IOV. That said, SR-IOV is not strictly necessary to take advantage of Scylla's architecture.
GCE offers multi-queue networking and it is possible to directly user-mode assign the virtio-net queues using Intel's DPDK. This should allow our virtio-net NIC to work with Scylla, although at least at one point DPDK made certain qemu specific assumptions with respect to virtio-net (in particular it assumed Tx/Rx queue depths of 256 descriptors; the virtio-net NIC in GCE currently advertises 16,384 entry queues although this is likely to change in the near future).
For applications like Scylla this should offer superior network performance and better in-guest compute overhead over utilizing the kernel TCP/IP stack.
Additionally, for all GCE instances with >= 1 cores (i.e., not fractional core instances) we offer multi-Gbps throughput subject to fabric availability. Latency is likely to be lowest in zones with Haswell processors. We do not currently guarantee specific network characteristics, but we offer up to 2 Gbps/core of network throughput shared between the virtual NIC and any attached persistent disk volumes (Local SSD throughput does not count against this limit). Throughput wise this makes 8-vCPU and larger instances comparable to EC2 Enhanced Networking.
At the moment, nothing that we offer is similar to AWS' "enhanced networking".
You are more than welcome posting this as a Feature Request on our Compute Engine Issue tracker though, so we can look at implementing a similar feature.

Effect of increasing number of network interface cards of a machine

What is the effect of increasing number of NICs of a machine? Does this increase the number of requests processed per second?(In case of a server, does it allow to increase the number of concurrent client requests served by a server?)
There are several layers in network infrastructure. Some are implemented in HW (and FW) and some in SW. Adding NICs is addition to HW/FW layers only. So indeed there is more bandwidth for the lower level traffic but for point-to-point connections the overall power is limited by the hosting machine. If it's powerful enough, increase in number of NICs can indeed increase the number of concurrent requests served (server is a good example).

Does changing the data rate of a line increase throughput?

I'm using IT Guru's Opnet to simulate different networks. I've run the basic HomeLAN scenario and by default it uses an ethernet connection running at a data rate of 20Kbps. Throughout the scenarios this is changed from 20K to 40K, then to 512K and then to a T1 line running at 1.544Mbps. My question is - does increasing the data rate for the line increase the throughput?
I have this graph output from the program to display my results:
Please note it's the image on the forefront which is of interest
In general, the signaling capacity of a data path is only one factor in the net throughput.
For example, TCP is known to be sensitive to latency. For any particular TCP implementation and path latency, there will be a maximum speed beyond which TCP cannot go regardless of the path's signaling capacity.
Also consider the source and destination of the traffic: changing the network capacity won't change the speed if the source is not sending the data any faster or if the destination cannot receive it any faster.
In the case of network emulators, also be aware that buffer sizes can affect throughput. The size of the network buffer must be at least as large as the signal rate multiplied by the latency (the Bandwidth Delay Product). I am not familiar with the particulars of Opnet, but I have seen other emulators where it is possible to set a buffer size too small to support the select rate and latency.
I have written a couple of articles related to these topics which may be helpful:
This one discusses common network bottlenecks: Common Network Performance Problems
This one discusses emulator configuration issues: Network Emulators

Determine asymmetric latencies in a network

Imagine you have many clustered servers, across many hosts, in a heterogeneous network environment, such that the connections between servers may have wildly varying latencies and bandwidth. You want to build a map of the connections between servers by transferring data between them.
Of course, this map may become stale over time as the network topology changes - but lets ignore those complexities for now and assume the network is relatively static.
Given the latencies between nodes in this host graph, calculating the bandwidth is a relative simply timing exercise. I'm having more difficulty with the latencies - however. To get round-trip time, it is a simple matter of timing a return-trip ping from the local host to a remote host - both timing events (start, stop) occur on the local host.
What if I want one-way times under the assumption that the latency is not equal in both directions? Assuming that the clocks on the various hosts are not precisely synchronized (at least that their error is of the the same magnitude as the latencies involved) - how can I calculate the one-way latency?
In a related question - is this asymmetric latency (where a link is quicker in direction than the other) common in practice? For what reasons/hardware configurations? Certainly I'm aware of asymmetric bandwidth scenarios, especially on last-mile consumer links such as DSL and Cable, but I'm not so sure about latency.
Added: After considering the comment below, the second portion of the question is probably better off on serverfault.
To the best of my knowledge, asymmetric latencies -- especially "last mile" asymmetries -- cannot be automatically determined, because any network time synchronization protocol is equally affected by the same asymmetry, so you don't have a point of reference from which to evaluate the asymmetry.
If each endpoint had, for example, its own GPS clock, then you'd have a reference point to work from.
In Fast Measurement of LogP Parameters
for Message Passing Platforms, the authors note that latency measurement requires clock synchronization external to the system being measured. (Boldface emphasis mine, italics in original text.)
Asymmetric latency can only be measured by sending a message with a timestamp ts, and letting the receiver derive the latency from tr - ts, where tr is the receive time. This requires clock synchronization between sender and receiver. Without external clock synchronization (like using GPS receivers or specialized software like the network time protocol, NTP), clocks can only be synchronized up to a granularity of the roundtrip time between two hosts [10], which is useless for measuring network latency.
No network-based algorithm (such as NTP) will eliminate last-mile link issues, though, since every input to the algorithm will itself be uniformly subject to the performance characteristics of the last-mile link and is therefore not "external" in the sense given above. (I'm confident it's possible to construct a proof, but I don't have time to construct one right now.)
There is a project called One-Way Ping (OWAMP) specifically to solve this issue. Activity can be seen in the LKML for adding high resolution timestamps to incoming packets (SO_TIMESTAMP, SO_TIMESTAMPNS, etc) to assist in the calculation of this statistic.
http://www.internet2.edu/performance/owamp/
There's even a Java version:
http://www.av.it.pt/jowamp/
Note that packet timestamping really needs hardware support and many present generation NICs only offer millisecond resolution which may be out-of-sync with the host clock. There are MSDN articles in the DDK about synchronizing host & NIC clocks demonstrating potential problems. Timestamps in nanoseconds from the TSC is problematic due to core differences and may require Nehalem architecture to properly work at required resolutions.
http://msdn.microsoft.com/en-us/library/ff552492(v=VS.85).aspx
You can measure asymmetric latency on link by sending different sized packets to a port that returns a fixed size packet, like send some udp packets to a port that replies with an icmp error message. The icmp error message is always the same size, but you can adjust the size of the udp packet you're sending.
see http://www.cs.columbia.edu/techreports/cucs-009-99.pdf
In absence of a synchronized clock, the asymmetry cannot be measured as proven in the 2011 paper "Fundamental limits on synchronizing clocks over networks".
https://www.researchgate.net/publication/224183858_Fundamental_Limits_on_Synchronizing_Clocks_Over_Networks
The sping tool is a new development in this space, which uses clock synchronization against nearby NTP servers, or an even more accurate source in the form of a GNSS box, to estimate asymmetric latencies.
The approach is covered in more detail in this blog post.

Resources