Is it a good idea to built fault tolerance over an Amazon SNS call. As Amazon services are built to be resilient should we add one more layer of fault tolerance or trust amazon to handle that part?
The fallacies are of distributed computing:
The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous
There are many network issues which can occur. Your application should also be resilient if network latency is high or even down on your side. It doesn't really matter which remote service you are calling.
Related
As per the docs, the NATS server design is a "server first" approach, regarding protecting against "lazy clients". Lazy clients are just booted under poor performance scenarios.
Due to this, I have internalized an assumption that anything connecting from past the edge should not connect directly to the NATS server, but should instead access some middle layer service point that manages and maps the more poorly performing external connections through internally initiated connections to the NATS server.
For example, consider a client service node on a remote customer facility that accesses back end services that are serviced over NATS in some form.
Is my stated assumption true in a VERY STRICT sense, such that it is NEVER advisable for that remote node to connect to a NATS server directly, even given a low number of possible client connections AND a stable network service to those facilities?
Or, is it ok-ish to connect directly from that remote node IF, and ONLY IF, I have a known solid infrastructure (low latency, high bandwidth, dependable, etc).
Finally, how about if the path to that remote node is not a very solid network service? Specifically, is it better to A) as I described, use an intermediary service on the back-end for managing the end point connection requests and passing them along over internal connections, or B) is it preferred to just let it connect directly to the server and let NATS boot it as needed, then use a re-connect on drops approach to keep the connection up as best as it can?
A good example here is a mobile endpoint, which could be up and down regularly for various reasons and due to no fault of the device or infrastructure at all.
Currently, I am designing every NATS solution as a "backend connections only" design. If this is overcomplicating my designs needlessly, of course I want to stop forcing that design constraint. :)
I'm in a high availability project which includes deployment of 2-node high availability cluster for hot replacement of services (applications) running on the cluster nodes. The applications have inbound and outbound tcp connections as well as process udp traffic (mainly for communicating with ntp server).
The problem is pretty standard until one needs to provide a hot migration of services to backup node with all the data stored in RAM. Applications are agnostic of backup mechanisms and it is highly undesirable to modify them.
As only approach to this problem, I've come off with a duplication approach assuming that both cluster nodes will run the same applications repeating calculations of each other. In case of failure the primary server the backup server will become a primary.
However, I have not found any ready solution for proxy which will have synchronous port mirroring. No existing proxy servers (haproxy, dante, 3proxy etc.) support such feature as far as I know. Have I missed something, or I should write a new one from scratch?
A rough sketch of the functionality can be found here:
p.s. I assume that it is possible to compare traffic from the two clones of the same application...
Amazon / AWS EC2 offers SR-IOV (Single Root I/O Virtualization) instances, which it dubs "enhanced networking" -- does Google offer this on Compute Engine?
Specifically, are any GCE instance types able to bypass the hypervisor and have direct access to a multi-queue NIC?
SRV-IOV support is needed to take advantage of Scylla DB's architecture?
HN Discussion: https://news.ycombinator.com/item?id=10262719
Currently Google Compute Engine does not offer SR-IOV. That said, SR-IOV is not strictly necessary to take advantage of Scylla's architecture.
GCE offers multi-queue networking and it is possible to directly user-mode assign the virtio-net queues using Intel's DPDK. This should allow our virtio-net NIC to work with Scylla, although at least at one point DPDK made certain qemu specific assumptions with respect to virtio-net (in particular it assumed Tx/Rx queue depths of 256 descriptors; the virtio-net NIC in GCE currently advertises 16,384 entry queues although this is likely to change in the near future).
For applications like Scylla this should offer superior network performance and better in-guest compute overhead over utilizing the kernel TCP/IP stack.
Additionally, for all GCE instances with >= 1 cores (i.e., not fractional core instances) we offer multi-Gbps throughput subject to fabric availability. Latency is likely to be lowest in zones with Haswell processors. We do not currently guarantee specific network characteristics, but we offer up to 2 Gbps/core of network throughput shared between the virtual NIC and any attached persistent disk volumes (Local SSD throughput does not count against this limit). Throughput wise this makes 8-vCPU and larger instances comparable to EC2 Enhanced Networking.
At the moment, nothing that we offer is similar to AWS' "enhanced networking".
You are more than welcome posting this as a Feature Request on our Compute Engine Issue tracker though, so we can look at implementing a similar feature.
Azure, Rackspace and Amazon do handle UDP, but GAE (the most similar to Azure) does not.
I am wondering what are the expected benefits of this restriction. Does it help fine-tuning the network? Does it ease the load balancing? Does is help to secure the network?
I suspect the reason is that UDP traffic does not have a defined lifetime nor a defined packet to packet relationship. This makes it hard to load balance and hard to manage - when you don't know how long to hold the path open you end up using timers, this is a problem for some NAT implementations too.
There's another angle not really explored here so far. UDP traffic is also a huge source of security problems, specifically DDoS attacks.
By blocking all UDP traffic, Azure can more effectively mitigate these attacks. Nearly all large bandwidth attacks, which are by far the hardest to deal with, are Amplification Attacks of some sort and most often UDP based. Allowing that traffic past the border of the network greatly improves the likelihood of service disruption, regardless of QoS sureties.
A second facet to that same story is that by blocking UDP they prevent people from hosting insecure DNS servers and thus prevent Azure from being the source of these large scale amplification attacks. This is actually a very good thing for the internet overall, as I'd think the connectivity of Azure's data centers are significant. To contrast this I've had servers in AWS send non stop UDP attacks to our datacenter for months on end, and could not successfully get the abuse team to respond to it.
The only thing that comes to my mind is that maybe they wanted to avoid their cloud being accessed through an unreliable transport protocol.
Along with scalability, reliability is one of the key aspects in Azure. For example Sql Azure and Azure Storage data is always replicated in at least three places and roles with at least two instances have a 99.95% uptime in their SLA.
Of course, despite its partial unreliability, UDP has its use cases, some of them enumerated in the comments from the feature voting site, but maybe those use cases are not a target for the Azure platform.
I want to develop simple Serverless LAN Chat program just for fun. How can I do this ? What type Architecture should I use?
Last year I have worked on TCP,UDP Client/ Server application Project.It was simple (Server listens to certain port/socket and Client connect to server's port etc..) But I have no idea about how to develop "Serverless" LAN Chat program. How can I do this? UDP,TCP,Multicast,Broadcast? or Should program behave like both server and client?
The simplest way would be to use UDP and simply broadcast your messages all over the network.
A little bit more advanced version would be to only use the broadcast to discover other nodes in the network.
Every node maintains a list of known peers.
Messages are sent with TCP to all known peers.
When a node starts up, it sends out an UDP broadcast to discover other nodes.
When a node receives a discovery broadcast, it sends "itself" to the source of the broadcast, in order to make it self known. The receiving node adds the broadcaster to it's own list of known peers.
When a node drops out of the network, it sends another broadcast in order to inform the remaining nodes that they should remove the dropped client from their list.
You would also have to consider handling the dropping out of nodes without them informing the rest of the network.
The spread toolkit may be a bit overkill for what you want, but an interesting starting point.
From the blurb:
Spread is an open source toolkit that provides a high performance messaging service that is resilient to faults across local and wide area networks. Spread functions as a unified message bus for distributed applications, and provides highly tuned application-level multicast, group communication, and point to point support. Spread services range from reliable messaging to fully ordered messages with delivery guarantees.
Spread can be used in many distributed applications that require high reliability, high performance, and robust communication among various subsets of members. The toolkit is designed to encapsulate the challenging aspects of asynchronous networks and enable the construction of reliable and scalable distributed applications.
Spread consists of a library that user applications are linked with, a binary daemon which runs on each computer that is part of the processor group, and various utility and demonstration programs.
Some of the services and benefits provided by Spread:
Reliable and scalable messaging and group communication.
A very powerful but simple API simplifies the construction of distributed architectures.
Easy to use, deploy and maintain.
Highly scalable from one local area network to complex wide area networks.
Supports thousands of groups with different sets of members.
Enables message reliability in the presence of machine failures, process crashes and recoveries, and network partitions and merges.
Provides a range of reliability, ordering and stability guarantees for messages.
Emphasis on robustness and high performance.
Completely distributed algorithms with no central point of failure.
Apples iChat is an example of the very product you are envisioning. It uses Bonjour (apple's zero-conf networking protocol) to identify peers on a LAN. You can then chat or audio/video chat with them.
I'm not entirely sure how Bonjour works inside, but I know it uses multicast. Clients "register" services on the LAN, and the Bonjour protocol allows for each host to pull up a directory of hosts for a given service (all without central management).