Kafka cluster & Inter Data centre network latency relationship - networking

In my company we are running a kafka cluster currently across 3 AZs in a single region on AWS. There are multiple Topics with many partitions and their replicas. We know Amazon does not provide offical stats for inter AZ latency but when we test it is really v. fast or rather extremely low latency & about sub millisecond (1ms). We have producers and consumers also in the same AWS region within those same AZs.
To save on operational costs, I am currently investigating impact if we move this entire Kafka cluster & producers and consumers back on-prem and create a similar cluster across 3 DCs.
I am asking this question because - The inter DC latency in the company is E2E roughly 10ms oneway...so What would be the impact if this inter DC latency increases 10fold from say 1MS t0 10MS.
I am asking this because apart from the producers, consumers, even brokers also communicate with each other across DC (e.g. replicas are reading the messages from leaders, controller is informing other brokers about changes), some changes written as metadata to a zookeeper. What are the pitfalls that I should watch out for ?
Sorry for such an open question but want to know if anyone have experience and want to share issues, pitfalls etc. How does the Kafka broker and zookeeper get impacted if a cluster spans across a DC when latency is higher compared to AWS AZ latency. Is it even practical ?

Related

What is the purpose of an MQ concentrator topology?

Amazon MQ (and various on-prem MQs) can be configured in a "concentrator" toplogy with a larger amount of brokers that forward their messages to a fewer number of central brokers.
What is the benefit of such a setup? Is the idea the brokers on the bottom are distributed in different AZs or even regions?

DynamoDB DAX and High Availability

What's your preferred strategy for dealing with DAX's maintenance windows?
DynamoDB itself has no MWs and is very highly available. When DAX is introduced into the mix, if it's the sole access point of clients to DDB then it becomes a SPOF. How do you then handle degradation gracefully during DAX scheduled downtimes?
My thinking was to not use the DAX Client directly but introduce some abstraction layer that allows it to fall back to direct DDB access when DAX is down. Is that a good approach?
DAX maintenance window doesn't take the cluster offline, unless it is a one-node cluster. DAX provides availability through multiple nodes in the cluster. For a multi-node cluster, each node in the cluster goes through maintenance in a specific order in order for the cluster to remain available. With retries configured on the DAX client, your worload shouldn't see an impact during maintenance windows.
Other than maintenance window, cluster nodes need to be divided across multiple AZs, for availability in case an AZ goes down.
An abstraction layer to fall back to DDB is not a bad idea. But you need to make sure you have the provisioned capacity configured to handle the load spike.

Increasing network IO on EC2

I'm executing a Spark job on EMR. The job is currently being bottlenecked by network (reading data from S3). Looking at metrics in Ganglia, I get a straight line at around 600 MBPs. I'm using i2.8x large instance type which is suppose to give 10Gbps i.e. ~ 1280MBPs. I have verified that enhanced networking is turned on and VirtualizationType is hvm, Am I missing something? Is there any other way to increase network throughput?
Networking capacity of Amazon EC2 instances is based upon Instance Type. The larger the instance, the more networking capacity is available. You are using the largest instance type within the i2 family, so that is good.
Enhanced Networking lowers network latency and jitter and is available on a limited number of instance types. You are using it, so that is good.
The i2.8xl is listed as having 10Gbps of network throughput, but this is limited to traffic within the same Placement Group. My testing shows that EMR instances are not launched within a Placement Group, so they might not receive the full network throughput possible.
You could experiment by using more smaller instances rather than fewer large instances. For example, 2 x i2.4xlarge cost the same as 1 x i2.8xlarge.
The S3->EC2 bandwidth is actually rate limited at 5Gbps (625MB/s) on the largest instance types of each EC2 family, even on instances with Enhanced Networking and a 20Gbps network interface. This has been confirmed by the S3 team, and it matches what I observed in my experiments. Smaller instances get rate limited at lower rates.
S3's time-to-first-byte is about 80-100ms, and after the first byte it is able to deliver data to a single thread at 85MB/s, in theory. However, we've only observed about 60MB/s per thread on average (IIRC). S3 confirmed that this is expected, and slightly higher than what their customers observe. We used an HTTP client that kept connections alive to the S3 endpoint. The main reason small objects yield low throughput is the high time-to-first-byte.
The following is the max bandwidth we've observed (in MB/s) when downloading from S3 using various EC2 instances:
Instance MB/s
C3.2XL 114
C3.4XL 245
C3.8XL 600
C4.L 67
C4.XL 101
C4.2XL 266
C4.4XL 580
C4.8XL 600
I2.8XL 600
M3.XL 117
M3.2XL 117
M4.XL 95
M4.10XL 585
X1.32XL 612
We've done the above test with 32MB objects and a thread count between 10-16.
Also,
Network performance quoted on ec2 instances matrix are benchmarked as described here. This is network bandwidth between Amazon EC2 Linux instances in the same VPC. The one we are observing between s3 and ec2 instances is not what they have promised.
EC2 Instance Network performance appears to be categorized as:
Low
Moderate
High
10 Gigabit
20 Gigabit
Determining Network bandwidth on an instance designated Low, Moderate, or High appears to be done on a case-by-case basis.
C3, C4, R3, I2, M4 and D2 instances use IntelĀ® 82599g Virtual Function Interface and provide Enhanced Networking with 10 Gigabit interfaces in the largest instance size.
10 and 20 Gigabit interfaces are only able to achieve that speed when communicating within a common Placement Group, typically in support of HPC. Network traffic outside a placement group is has a max limit of 5 Gbps.
Summary: Network bandwidth mentioned is between two instances not between s3 and ec2.Even between two instances when they are in same placement group + have support of HPC we can achieve something around 10/20 Gigabit.

Will using two availability zones in ec2 introduce network paritions?

Currently I am using one availability zone in my ec2 launch config. It is important that I don't get network partitions in my app, as rabbitmq does not handle network partitions well when clustering and HA is used (which I am using).
I am very fuzzy on the concept of network partitions. Would it be safe for me to use two availability zones?
The different Amazon EC2 Availability Zones are in different physical locations. While the connections between availability zones are quite good, it is still a WAN connection.
From the RabbitMQ docs
RabbitMQ clusters do not tolerate network partitions well. If you are thinking of clustering across a WAN, don't. You should use federation or the shovel instead
(emphasis mine)
https://www.rabbitmq.com/partitions.html
In short, a 1 minute or so interruption in connectivity will cause a network partition to be created. While this would be an unusual event for EC2, it can and sometimes will happen.

Capacity planning for service oriented architecture?

I have a collection of SOA components that can handle a series of business processes. For example one SOA component imports user data, another runs analytics on it.
I'm familiar with business process modeling for manufacturing, i.e. calculating WIP, throughput, cycle times, utilization etc. for each process. Little's Law, theory of constraints, etc.
Can I apply this approach to capacity planning for my SOA architecture, or is there a more rigorous / more widely accepted approach?
A bit of a broad question. Some guidelines for you but there is no real perfect answer here.
What you are looking for is Business Activity Monitoring used together with performance metrics reported from your servers.
BAM/Business Activity Monitoring will allow you to measure how many orders per seconds you are processing. How many sales you have made today etc. You all then monitor and collect information such as CPU usage, network bandwidth, disk io performance, memory usage and other technical performance metrics. In windows you can use performance counters for this. In the Linux world there is various tools and techniques that you can use.
Using the number of orders placed you can then look at the performance statistics of the systems used by the order placing software to give you some indication of what is happening.
For example we process 10 orders a second on average using roughly 8GB of ram on the ESB server where the orders service is hosted. We are seeing a average increase of 25% per month in the order coming through. We have noticed several alerts about swapping to disk when orders are at their peak. To ensure that we can cater with the demand we will need to double the memory on the server every 4 months. Thus in a year we will need 3*8GB of memory extra or another 32GB of memory. Now you can decide on the implementation do you create a cluster with 4 machines with 8GB of ram in or do I load balance.
Using this information you can start to get a good idea of where your limits are and what you need to budget for in the future.
Go look at some BAM tools and some monitoring tools and see what suits you.

Resources