Increase throughput in network load heavy topology storm with dynamo - bigdata

I have a topology running on aws. I use M3 xlarge machines with 15GB ram, 8 supervisors. My topology is simple, I read from
kafka spout -> [db o/p1] -> [db o/p2] -> [dynamo fetch] -> [dynamo write & kafka write] kafka
db o/ps are conditional. with latency around 100 - 150 ms.
But I have never been able to achieve a throughput of more than 300 msgs/sec.
What configuration changes are to be made so that I can get a throughput of more than 3k msgs/sec?
dynamo fetch bolt execute latency is around 150 - 220ms
and dynamo read bolt execute latency is also around this number.
four bolts with parallelism 90 each and one spout with parallelism 30 (30 kafka partitions)
overall latency is greater than 4 secs.
topology.message.timeout.secs: 600
worker.childopts: "-Xmx5120m
no. of worker ports per machine : 2
no of workers : 6
no of threads : 414
executor send buffer size 16384
executor receive buffer size 16384
transfer buffer size: 34
no of ackers: 24

Looking at the console snapshot I see...
1) The overall latency for the Spout is much greater than the sum of the execute latencies of the bolts, which implies that there's a backlog on one of the streams, and
2) The capacity for SEBolt is much higher that that of the other bolts, implying that Storm feels the need to run that bolt more than the others
So I think your bottleneck is the SEBolt. Look into increasing the parallelism hint on that one. If the total number of tasks is getting too high, reduce the parallelism hint for the other bolts to offset the increase for SEBolt.

Related

Increasing network IO on EC2

I'm executing a Spark job on EMR. The job is currently being bottlenecked by network (reading data from S3). Looking at metrics in Ganglia, I get a straight line at around 600 MBPs. I'm using i2.8x large instance type which is suppose to give 10Gbps i.e. ~ 1280MBPs. I have verified that enhanced networking is turned on and VirtualizationType is hvm, Am I missing something? Is there any other way to increase network throughput?
Networking capacity of Amazon EC2 instances is based upon Instance Type. The larger the instance, the more networking capacity is available. You are using the largest instance type within the i2 family, so that is good.
Enhanced Networking lowers network latency and jitter and is available on a limited number of instance types. You are using it, so that is good.
The i2.8xl is listed as having 10Gbps of network throughput, but this is limited to traffic within the same Placement Group. My testing shows that EMR instances are not launched within a Placement Group, so they might not receive the full network throughput possible.
You could experiment by using more smaller instances rather than fewer large instances. For example, 2 x i2.4xlarge cost the same as 1 x i2.8xlarge.
The S3->EC2 bandwidth is actually rate limited at 5Gbps (625MB/s) on the largest instance types of each EC2 family, even on instances with Enhanced Networking and a 20Gbps network interface. This has been confirmed by the S3 team, and it matches what I observed in my experiments. Smaller instances get rate limited at lower rates.
S3's time-to-first-byte is about 80-100ms, and after the first byte it is able to deliver data to a single thread at 85MB/s, in theory. However, we've only observed about 60MB/s per thread on average (IIRC). S3 confirmed that this is expected, and slightly higher than what their customers observe. We used an HTTP client that kept connections alive to the S3 endpoint. The main reason small objects yield low throughput is the high time-to-first-byte.
The following is the max bandwidth we've observed (in MB/s) when downloading from S3 using various EC2 instances:
Instance MB/s
C3.2XL 114
C3.4XL 245
C3.8XL 600
C4.L 67
C4.XL 101
C4.2XL 266
C4.4XL 580
C4.8XL 600
I2.8XL 600
M3.XL 117
M3.2XL 117
M4.XL 95
M4.10XL 585
X1.32XL 612
We've done the above test with 32MB objects and a thread count between 10-16.
Also,
Network performance quoted on ec2 instances matrix are benchmarked as described here. This is network bandwidth between Amazon EC2 Linux instances in the same VPC. The one we are observing between s3 and ec2 instances is not what they have promised.
EC2 Instance Network performance appears to be categorized as:
Low
Moderate
High
10 Gigabit
20 Gigabit
Determining Network bandwidth on an instance designated Low, Moderate, or High appears to be done on a case-by-case basis.
C3, C4, R3, I2, M4 and D2 instances use IntelĀ® 82599g Virtual Function Interface and provide Enhanced Networking with 10 Gigabit interfaces in the largest instance size.
10 and 20 Gigabit interfaces are only able to achieve that speed when communicating within a common Placement Group, typically in support of HPC. Network traffic outside a placement group is has a max limit of 5 Gbps.
Summary: Network bandwidth mentioned is between two instances not between s3 and ec2.Even between two instances when they are in same placement group + have support of HPC we can achieve something around 10/20 Gigabit.

I can see cassandra is CPU bound for write heavy work loads but is it network bound as well?

SETUP: 1
3-node cassandra cluster. Each node is on a different machine with 4 cores 32 GB RAM, 800GB SSD (DISK), 1Gbit/s = 125 MBytes/sec Network bandwith
2 cassandra-stress client machines with same exact configuration as above.
Experiment1: Ran one client on one machine creating anywhere from 1 to 1000 threads and with Conistency Level of Quorum and the max network throughput on a cassandra node was around 8MBytes/sec with a CPU Usage of 85-90 percent on both cassandra node and the client
Experiment2: Ran two clients on two different machines creating anywhere from one to 1000 threads with Conistency Level of Quorum and the max network throughput on a cassandra node was around 12MBytes/sec with a CPU Usage of 90 percent on both cassandra node and both the client
Did not see double the throughput even though my clients were running on two different machines but I can understand the cassandra node is CPU bound and thats probably why. so that lead me to setup2
SETUP 2
3-node cassandra cluster. Each node is on a different machine with 8 cores 32 GB RAM, 800GB SSD (DISK), 1Gbit/s = 125 MBytes/sec Network bandwith
2 cassandra-stress client machines with 4 cores 32 GB RAM, 800GB SSD (DISK), 1Gbit/s = 125 MBytes/sec Network bandwith
Experiment3: Ran one client on one machine creating anywhere from 1 to 1000 threads and with Conistency Level of Quorum and the max network throughput on a cassandra node was around 18MBytes/sec with a CPU Usage of 65-70 percent on a cassandra node and >90% on the client node.
Experiment4: Ran two clients on two different machines creating anywhere from 1 to 1000 threads and with Conistency Level of Quorum and the max network throughput on a cassandra node was around 22MBytes/sec with a CPU Usage of <=75 percent on a cassandra node and >90% on both client nodes.
so the question here is with one client node I was able to push 18MB/sec (Network throughput) and with two client nodes running two different machine I was only able to push at a peak of 22MB/sec(Network throughput) ?? And I wonder why this is the case even though this time the cpu usage on cassandra node is around 65-70 percent on a 8 core machine.
Note: I stopped cassandra and ran a tool called iperf3 on two different ec2 machines and I was able to see the network bandwith of 118 MBytes/second. I am converting everything into Bytes rather than bits to avoid any sort of confusion.

What is the meaning of AmazonDB Free Tier?

In my Android app I use Amazon DynamoDB. I created 10 tables with Read capacity 10 and Write capacity 5. Today I received an email from Amazon. It costs me 11.36$.
I don't understand the meaning of free tier. Here is what I read from Amazon:
DynamoDB customers get 25 GB of free storage, as well as up to 25 write capacity units and 25 read capacity units of ongoing throughput capacity (enough throughput to handle up to 200 million requests per month) and 2.5 million read requests from DynamoDB Streams for free.
Please tell me more clearly about the meaning of free tier: 25 read and 25 write capacity units!
Amazon consider aggregates of read and write capacity of all tables, not the capacity of individual tables.
In your case the read capacity is 100 and the write capacity if 50. And you are charged for the 75 read capacity usage hours and 25 write capacity of usage hours.
Please plan properly for the read and write capacity of each table, otherwise you end up paying more bills.
Thanks.

Encrypted Vs Unencrypted EBS Volumes AWS

We are testing standard EBS volume, EBS volume with encryption on EBS optimized m3.xlarge EC2 instance.
While analyzing the test results, we came to know that
EBS volume with encryption is taking lesser time during read, write, read/write operations as compared to EBS without encryption.
I think there will be an effect of latency on encrypted EBS volume because of extra encryption overhead on every I/O request.
What will be the appropriate reason why EBS encrypted volumes are faster than normal EBS volumes??
Expected results should be that EBS should yield better results that Encrypted EEBS.
Results :
Encrpted EBS results:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 8
Initializing random number generator from timer.
Extra file open flags: 16384
8 files, 512Mb each
4Gb total file size
Block size 16Kb
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential write (creation) test
Threads started!
Done.
Operations performed: 0 Read, 262144 Write, 8 Other = 262152 Total
Read 0b Written 4Gb Total transferred 4Gb (11.018Mb/sec)
705.12 Requests/sec executed
Test execution summary:
total time: 371.7713s
total number of events: 262144
total time taken by event execution: 2973.6874
per-request statistics:
min: 1.06ms
avg: 11.34ms
max: 3461.45ms
approx. 95 percentile: 1.72ms
EBS results:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 8
Initializing random number generator from timer.
Extra file open flags: 16384
8 files, 512Mb each
4Gb total file size
Block size 16Kb
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential write (creation) test
Threads started!
Done.
Operations performed: 0 Read, 262144 Write, 8 Other = 262152 Total
Read 0b Written 4Gb Total transferred 4Gb (6.3501Mb/sec)
406.41 Requests/sec executed
Test execution summary:
total time: 645.0251s
total number of events: 262144
total time taken by event execution: 5159.7466
per-request statistics:
min: 0.88ms
avg: 19.68ms
max: 5700.71ms
approx. 95 percentile: 6.31ms
please help me resolve this issue.
That's certainly unexpected conceptually and also confirmed by Amazon EBS Encryption:
[...] and you can expect the same provisioned IOPS performance on encrypted volumes as you would with unencrypted volumes with a minimal effect on latency. You can access encrypted Amazon EBS volumes the same way you access existing volumes; encryption and decryption are handled transparently and they require no additional action from you, your EC2 instance, or your application. [...] [emphasis mine]
Amazon EBS Volume Performance provides more details on EBS performance in general - from that angle, but pure speculation, maybe the use of encryption implies some default Pre-Warming Amazon EBS Volumes:
When you create any new EBS volume (General Purpose (SSD), Provisioned IOPS (SSD), or Magnetic) or restore a volume from a snapshot, the back-end storage blocks are allocated to you immediately. However, the first time you access a block of storage, it must be either wiped clean (for new volumes) or instantiated from its snapshot (for restored volumes) before you can access the block. This preliminary action takes time and can cause a 5 to 50 percent loss of IOPS for your volume the first time each block is accessed. [...]
Either way, I suggest to rerun the benchmark after pre-warming both new EBS volumes, in case you haven't done so already.

keeping 1000 tcp connections open inspite of very few(10/20) actual communications

For a socket server, what are the costs involved in keeping 1000s of tcp connections open, but only few clients are actually communicating? I'm using single threaded poll/select based server. Also, what is the max open-socket limit of linux kernel(plz don't give conf file suggestions. i want theoretical limit)?
The main cost (for typical TCP/IP stack implementations) is in transmit and receive buffers living in kernel memory -- the recommended size in this good (if maybe dated) short essay is socket buffer size = 2 * bandwidth * delay. So, if the delay is 100 ms (ping time, a round trip of two delays, would then be 200 ms), and the bandwidth is about a gigabit/second (call it 100 MB per second for ease of computation;-), to get top performance you need a socket buffer size of about 20 MB per socket (I've heard of no kernel configures so generously by default, but you can control the buffer size per-socket just before you open the socket...).
A typical socket buffer size might be, say, 256 KB; with the same hypothetical 100 ms delay, that should be fine for a bandwidth of up to 5 megabit/second -- i.e., with that buffer size and delay, you're not going to be fully exploiting the bandwidth of even a moderate quality broadband connection ("long narrow pipes" -- large bandwidth and large delay, so their product is quite large -- are notoriously difficult to exploit fully). Anyway, with that buffer size, you'd be eating up 256 MB of kernel memory by keeping 1000 TCP sockets open and connected (albeit inactive).
I don't think (at least on a 64-bit kernel) there are other intrinsic limits besides the amount of RAM allocated to the kernel (where the buffers live). Of course you can configure your kernel to limit that number (to avoid all of kernel memory going to socket buffers;-).
Costs: the socket fd; the kernel data structures corresponding to it; the TCP connection tuple (protocol;source:port;destination:port) and its data structures; the socket send and receive buffers; and the data structures in your code that keep track of connections.
The limit depends on your configuration but it's generally much higher than 1000.
Since sockets are made available as file descriptors, the limit can be no more than the maximum number of open files, which you can get via
#include <sys/resource.h>
struct rlimit limit;
getrlimit(RLIMIT_NOFILE, &limit);
/* limit.rlim_cur gives the current limit for the process.
This can be dynamically adjusted via setrlimit, with a maximum
of limit.rlim_max. They are likely equivalent already though. */
int max_no_files = limit.rlim_cur;
This hard-limit can be adjusted, but how depends on what platform you are on. Linux has /etc/security/limits.conf. I don't know about other platforms.

Resources