Encrypted Vs Unencrypted EBS Volumes AWS - encryption

We are testing standard EBS volume, EBS volume with encryption on EBS optimized m3.xlarge EC2 instance.
While analyzing the test results, we came to know that
EBS volume with encryption is taking lesser time during read, write, read/write operations as compared to EBS without encryption.
I think there will be an effect of latency on encrypted EBS volume because of extra encryption overhead on every I/O request.
What will be the appropriate reason why EBS encrypted volumes are faster than normal EBS volumes??
Expected results should be that EBS should yield better results that Encrypted EEBS.
Results :
Encrpted EBS results:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 8
Initializing random number generator from timer.
Extra file open flags: 16384
8 files, 512Mb each
4Gb total file size
Block size 16Kb
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential write (creation) test
Threads started!
Done.
Operations performed: 0 Read, 262144 Write, 8 Other = 262152 Total
Read 0b Written 4Gb Total transferred 4Gb (11.018Mb/sec)
705.12 Requests/sec executed
Test execution summary:
total time: 371.7713s
total number of events: 262144
total time taken by event execution: 2973.6874
per-request statistics:
min: 1.06ms
avg: 11.34ms
max: 3461.45ms
approx. 95 percentile: 1.72ms
EBS results:
sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 8
Initializing random number generator from timer.
Extra file open flags: 16384
8 files, 512Mb each
4Gb total file size
Block size 16Kb
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential write (creation) test
Threads started!
Done.
Operations performed: 0 Read, 262144 Write, 8 Other = 262152 Total
Read 0b Written 4Gb Total transferred 4Gb (6.3501Mb/sec)
406.41 Requests/sec executed
Test execution summary:
total time: 645.0251s
total number of events: 262144
total time taken by event execution: 5159.7466
per-request statistics:
min: 0.88ms
avg: 19.68ms
max: 5700.71ms
approx. 95 percentile: 6.31ms
please help me resolve this issue.

That's certainly unexpected conceptually and also confirmed by Amazon EBS Encryption:
[...] and you can expect the same provisioned IOPS performance on encrypted volumes as you would with unencrypted volumes with a minimal effect on latency. You can access encrypted Amazon EBS volumes the same way you access existing volumes; encryption and decryption are handled transparently and they require no additional action from you, your EC2 instance, or your application. [...] [emphasis mine]
Amazon EBS Volume Performance provides more details on EBS performance in general - from that angle, but pure speculation, maybe the use of encryption implies some default Pre-Warming Amazon EBS Volumes:
When you create any new EBS volume (General Purpose (SSD), Provisioned IOPS (SSD), or Magnetic) or restore a volume from a snapshot, the back-end storage blocks are allocated to you immediately. However, the first time you access a block of storage, it must be either wiped clean (for new volumes) or instantiated from its snapshot (for restored volumes) before you can access the block. This preliminary action takes time and can cause a 5 to 50 percent loss of IOPS for your volume the first time each block is accessed. [...]
Either way, I suggest to rerun the benchmark after pre-warming both new EBS volumes, in case you haven't done so already.

Related

The first DB call is much slower

First call at morning takes 15 seconds,
FOR EACH ... NO-LOCK:
END.
the second call takes only 1,5 seconds.
What causes this delay?
What can I log to identify it?
Even when I restart the DB I can't reproduce the behaviour of the first call.
(In complex queries I measure difference of 15 minutes to 2 seconds)
The most likely cause for this will be caching. There are two caches in place:
The -B buffer pool of the database which caches database blocks in memory. It is a typical observation, that once this cache is warmed up after a restart of the DB server queries are executing much faster. Of course this all depends on the size of your DB and the size of the -B buffer pool. Relatively small databases may fit into a relatively large -B buffer pool in large parts
The OS disk cache will also play it's part in your observation

Difference between two perf events for intel processors

What is the difference between following perf events for intel processors:
UNC_CHA_DIR_UPDATE.HA: Counts only multi-socket cacheline Directory state updates memory writes issued from the HA pipe. This does not include memory write requests which are for I (Invalid) or E (Exclusive) cachelines.
UNC_CHA_DIR_UPDATE.TOR: Counts only multi-socket cacheline Directory state updates due to memory writes issued from the TOR pipe which are the result of remote transaction hitting the SF/LLC and returning data Core2Core. This does not include memory write requests which are for I (Invalid) or E (Exclusive) cachelines.
UNC_M2M_DIRECTORY_UPDATE.ANY: Counts when the M2M (Mesh to Memory) updates the multi-socket cacheline Directory to a new state.
The above description about perf events is taken from here.
In particular, if there is a directory update because of the memory write request coming from a remote socket then which perf event will account for that if any?
As per my understanding, since the CHA is responsible for handling the requests coming from the remote sockets via UPI, the directory updates which are caused by the remote requests should be reflected by UNC_CHA_DIR_UPDATE.HA or UNC_CHA_DIR_UPDATE.TOR. But when I run a program (which I will explain shortly), the UNC_M2M_DIRECTORY_UPDATE.ANY count is much larger (more than 34M) whereas the other two events have the count in the order of few thousand. Since there are no other writes happening other than those coming from the remote socket it seems that UNC_M2M_DIRECTORY_UPDATE.ANY measures the number of directory updates(and not the other two events which) happening due to remote writes.
Description of the system
Intel Xeon GOLD 6242 CPU (Intel Cascadelake architecture)
4 sockets with each socket having PMEM attached to it
part of the PMEM is configured to be used as a system RAM on sockets 2 and 3
OS: Linux (kernel 5.4.0-72-generic)
Description of the program:
Note: use numactl to bind the process to node 2 which is a DRAM node
Allocate two buffers of size 1GB each
Initialize these buffers
Move the second buffer to the PMEM attached to socket 3
Perform a data copy from the first buffer to the second buffer

Does NUMA impact memory bandwidth, or just latency?

I have a problem that is memory bandwidth limited -- I need to read a lot (many GB) of data sequentially from RAM, do some quick processing and write it sequentially to a different location in RAM. Memory latency is not a concern.
Is there any benefit from dividing the work between two or more cores in different NUMA zones? Equivalently, does working across zones reduce the available bandwidth?
For bandwidth-limited, multi-threaded code, the behavior in a NUMA system will primarily depend how "local" each thread's data accesses are, and secondarily on details of the remote accesses.
In a typical 2-socket server system, the local memory bandwidth available to two NUMA nodes is twice that available to a single node. (But remember that it may take many threads running on many cores to reach asymptotic bandwidth for each socket.)
The STREAM Benchmark, for example, is typically run in a configuration that allows almost all accesses from every thread to be "local". This is implemented by assuming "first touch" NUMA placement -- when allocated memory is first written, the OS has to create mappings from the process virtual address space to physical addresses, and (by default) the OS chooses physical addresses that are in the same NUMA node as the core that executed the store instruction.
"Local" bandwidth (to DRAM) in most systems is approximately symmetric (for reads and writes) and relatively easy to understand. "Remote" bandwidth is much more asymmetric for reads and writes, and there is usually significant contention between the read/write commands going between the chips and the data moving between the chips. The overall ratio of local to remote bandwidth also varies significantly across processor generations. For some processors (e.g., Xeon E5 v3 and probably v4), the interconnect is relatively fast, so jobs with poor locality can often be run with all of the memory interleaved between the two sockets.
Local bandwidths have increased significantly since then, with more recent processors generally strongly favoring local access.
Example from the Intel Xeon Platinum 8160 (2 UPI links between chips):
Local Bandwidth for Reads (each socket) ~112 GB/s
Remote Bandwidth for Reads (one-direction at a time) ~34 GB/s
Local bandwidth scales perfectly in two-socket systems, and remote bandwidth also scales very well when using both sockets (each socket reading data from the other socket).
It gets more complicated with combined read and write traffic between sockets, because the read traffic from node 0 to node 1 competes with the write traffic from node 1 to node 0, etc.
Local Bandwidth for 1R:1W (each socket) ~101 GB/s (reduced due to read/write scheduling overhead)
Remote Bandwidth for 1R:1W (one socket running at a time) ~50 GB/s -- more bandwidth is available because both directions are being used, but this also means that if both sockets are doing the same thing, there will be conflicts. I see less than 60 GB/s aggregate when both sockets are running 1R:1W remote at the same time.
Of course different ratios of local to remote accesses will change the scaling. Timing can also be an issue -- if the threads are doing local accesses at the same time, then remote accesses at the same time, there will be more contention in the remote access portion (compared to a case in which the threads are doing their remote accesses at different times).

Increase throughput in network load heavy topology storm with dynamo

I have a topology running on aws. I use M3 xlarge machines with 15GB ram, 8 supervisors. My topology is simple, I read from
kafka spout -> [db o/p1] -> [db o/p2] -> [dynamo fetch] -> [dynamo write & kafka write] kafka
db o/ps are conditional. with latency around 100 - 150 ms.
But I have never been able to achieve a throughput of more than 300 msgs/sec.
What configuration changes are to be made so that I can get a throughput of more than 3k msgs/sec?
dynamo fetch bolt execute latency is around 150 - 220ms
and dynamo read bolt execute latency is also around this number.
four bolts with parallelism 90 each and one spout with parallelism 30 (30 kafka partitions)
overall latency is greater than 4 secs.
topology.message.timeout.secs: 600
worker.childopts: "-Xmx5120m
no. of worker ports per machine : 2
no of workers : 6
no of threads : 414
executor send buffer size 16384
executor receive buffer size 16384
transfer buffer size: 34
no of ackers: 24
Looking at the console snapshot I see...
1) The overall latency for the Spout is much greater than the sum of the execute latencies of the bolts, which implies that there's a backlog on one of the streams, and
2) The capacity for SEBolt is much higher that that of the other bolts, implying that Storm feels the need to run that bolt more than the others
So I think your bottleneck is the SEBolt. Look into increasing the parallelism hint on that one. If the total number of tasks is getting too high, reduce the parallelism hint for the other bolts to offset the increase for SEBolt.

Munin Graphs meaning

I've been using Munin for some days and I think it's very interesting information, but I don't understand some of the graphs, and how they can be used/read to get information to improve system.
The ones I don't understand are:
Disk
Disk throughput per device
Inode usage in percent
IOstat
Firewall Throughput
Processes
Fork rate
Number of threads
VMstat
System
Available entropy
File table usage
Individual interrupts
Inode table usage
Interrupts and context switches
Ty!
Munin creates graphs that enable you to see trends. This is very useful to see if a change you made doesn't negatively impact the performance of the system.
Disk
Disk throughput per device & IOstat
The amount of data written or read from a disk device. Disks are always slow compared to memory. A lot of disk reads could for example indicate that your database server doesn't have enough RAM.
Inode usage in percent
Every filesystem has a index where information about the files is stored, like name, permissions and location on the disk. With many small files the space available to this index could run out. If that happens no new files can be saved to that filesystem, even if there is enough space on the device.
Firewall Throughput
Just like it says, the amount of packets going though the iptables firewall. Often this firewall is active on all interfaces on the system. This is only really interesting if you run munin on a router/firewall/gateway system.
Processes
Fork rate
Processes are created by forking a existing process into two processes. This is the rate at wich new processes are created.
Number of threads
The total number of processes running in the system.
VMstat
Usage of cpu time.
running: time spent running non-kernel code
I/O sleep: Time spent waiting for IO
System
Available entropy: The entropy is the measure of the random numbers available from /dev/urandom. These random numbers are needed to create SSL connections. If you create a large number of SSL connections this randomness pool could possibly run out of real random numbers.
File table usage
The total number of files open in the system. If this number suddenly goes up there might be a program that is not releasing its file handles properly.

Resources