Reservoir sampling: evict specific items from stream? - bigdata

In Reservoir sampling algorithm, we always evict random items from a stream.
Is there a way to evict smallest sample items (instead of random items)?
What are ways to do that?

Related

Does DynamoDB latency depend on number of items per partition

Newbie to DDB here. I've been using a DDB table for a year now. Recently, I made improvements by compressing the payload using gzip (and representing it as a binary in DDB) and storing the new data in another newly created beta table. Overall compression was 3x. I expected the read latency(GetItem) to improve as well as it's less data to be transported over the wire. However, I'm seeing that the read latency has increased from ~ 50ms p99.9 to ~114 ms p99.9. I'm not sure how that happened and was wondering if because of the compression, now I have a lot of rows per partition (which I think is defined as <= 10 GB). I now have 3-4x more rows per partition. So, I'm wondering that once dynamoDb determines the right partition for a partition key, then within the partition how does it find the correct item? Gut feel is that this shouldn't lead to an increase in latency as a simplified representation of the partition can be a giant hashmap so it'd just be a simple lookup. I'd appreciate any help here.
My DDB schema:
partition-key - user-id,dataset-name
range-key - update-timestamp
payload - used to be string, now is compressed/binary.
In my GetItem requests, I specify both partition key and range key.
According to your description, your change included two unrelated parts: You compressed the payload, and increased the number of items per partition. The first change - the compression - probably has little effect on the p99 latency (it could have a more noticable effect on the mean latency - which, according to Little's Law is related to throughput, if your client has fixed concurrency - but I'd expect it to lower, not increase).
Some guesses as to what might have increased the p99 latency:
More items per partition means that DynamoDB (which uses a B-tree) needs to do more disk reads to find a specific item. Since each disk access has rare delays caused by queueing, this adds to the tail latency.
You said that the change caused each partition to hold more items, I guess this means you now have fewer partitions. If you have too few of them, you can start getting unbalanced load on the different DynamoDB partitions, and more contention and latency for specific "hot" partitions.
I don't know how you measure your latency. Your client now needs (I guess) to uncompress the returned result, maybe it is now busier, adding queening delays in the client? Can you lower your client's concurrency (how many client threads run in parallel) and see if the high tail latency is an artifact of the server design, or the client's design?

LMDB: How to store large value sizes efficiently

I have been using LMDB to store key value pairs where the value sizes are of the order 200 Bytes. I am running into a scenarios where value sizes could grow upto 8KB or more.
According to: https://lmdb.readthedocs.io/en/release/#storage-efficiency-limits and https://github.com/lmdbjava/benchmarks/blob/master/results/20160710/README.md, LMDB is most efficient for value sizes in page size (4096KB) increments, otherwise it can lead to fragmentation due to overflow pages.
My main questions are:
Do I need to break down my value into page size increments for optimal performance?
Are lexicographic sorted keys in LMDB placed in adjacent pages? Let's say my value is about 14KB and I break it down into 8K, 4K and 2K chunks, with key values :key_chunk1, key_chunk_2, key_chunk_3, will they be in adjacent pages? Let's say the last chunk (The 2KB value) is on a new page, and the next lexicographically sorted key is of 4K, will this be in a new page as it cannot fit in the existing page?

Does AWS Dynamodb limit option in query limits the capacity unit used?

I have a question...
If I have 1000 item having same partition key in a table... And if I made a query for this partition key with limit 10 then I want to know does it take read capacity unit for 1000 items or for just 10 items
Please clear my doubt
I couldn't find the exact point in the DynamoDB documentation. From my experience it uses only the returned limit for consumed capacity which is 10 (Not 1000).
You can quickly evaluate this also using the following approach.
However, you can specify the ReturnConsumedCapacity parameter in
a Query request to obtain this information.
The limit option will limit the number of results returned. The capacity consumed depends on the size of the items, and how many of them are accessed (I say accessed because if you have filters in place, more capacity may be consumed than the number of items actually returned would consume if there are items that get filtered out) to produce the results returned.
The reason I mention this is because, for queries, each 4KB of returned capacity is equivalent to 1 read capacity unit.
Why is this important? Because if your items are small, then for each capacity unit consumed you could return multiple items.
For example, if each item is 200 bytes in size, you could be returning up to 20 items for each capacity unit.
According to the aws documentation:
The maximum number of items to evaluate (not necessarily the number of matching items). If DynamoDB processes the number of items up to the limit while processing the results, it stops the operation and returns the matching values up to that point, and a key in LastEvaluatedKey to apply in a subsequent operation, so that you can pick up where you left off.
It seems to me that it means that it will not consume the capacity units for all the items with the same partition key. According to your example the consumed capacity units will be for your 10 items.
However since I did not test it I cannot be sure, but that is how I understand the documentation.

On what basis we can calculate number of partitions in Hazelcast?

I want to calculate an optimum number of partitions for my Hazelcast cluster, however, I am unable to find a parameter to base this calculation on.
The default partition of 271, may or may not be sufficient, which I am not sure.
For simplicity sake, if I assume that my cluster would have about 50 million entries split on 50 nodes, then what would be the ideal number of partitions and how to derive to this number?
Thank you,
Dilish
A partition shouldn't be bigger than 50-100MB. 50MB is better though to have fast migration in scaling or failure situations. If the size is ok, it mainly depends on the number of configured partition threads. In general more partitions per node is always better, to get higher scalability factor, however if you use EntryProcessors quite a lot, you also want to higher the number of partition threads, to make sure partitions won't block each other (most often multiple partitions share a single partition thread). Last but not least you should round up to the next higher prime number for statistical distribution.
In terms of performance you can also try thinking like this: how many threads can I run? and then figure out a prime number that's let's say 10 times bigger.

How to calculate Read Capacity Unit and Write Capacity Unit for DynamoDB

How to calculate RCU and WCU with the data given as: reading throughput of 32 GB/s and writing throughput of 16 GB/s.
DynamoDB Provisioned Throughput is based upon a certain size of units, and the number of items being written:
In DynamoDB, you specify provisioned throughput requirements in terms of capacity units. Use the following guidelines to determine your provisioned throughput:
One read capacity unit represents one strongly consistent read per second, or two eventually consistent reads per second, for items up to 4 KB in size. If you need to read an item that is larger than 4 KB, DynamoDB will need to consume additional read capacity units. The total number of read capacity units required depends on the item size, and whether you want an eventually consistent or strongly consistent read.
One write capacity unit represents one write per second for items up to 1 KB in size. If you need to write an item that is larger than 1 KB, DynamoDB will need to consume additional write capacity units. The total number of write capacity units required depends on the item size.
Therefore, when determining your desired capacity, you need to know how many items you wish to read and write per second, and the size of those items.
Rather than seeking a particular GB/s, you should be seeking a given number of items that you wish to read/write per second. That is the functionality that your application would require to meet operational performance.
There are also some DynamoDB limits that would apply, but these can be changed upon request:
US East (N. Virginia) Region:
Per table – 40,000 read capacity units and 40,000 write capacity units
Per account – 80,000 read capacity units and 80,000 write capacity units
All Other Regions:
Per table – 10,000 read capacity units and 10,000 write capacity units
Per account – 20,000 read capacity units and 20,000 write capacity units
At 40,000 read capacity units x 4KB x 2 (eventually consistent) = 320MB/s
If my calculations are correct, your requirements are 100x this amount, so it would appear that DynamoDB is not an appropriate solution for such high throughputs.
Are your speeds correct?
Then comes the question of how you are generating so much data per second. A full-duplex 10GFC fiber runs at 2550MB/s, so you would need multiple fiber connections to transmit such data if it is going into/out of the AWS cloud.
Even 10Gb Ethernet only provides 10Gbit/s, so transferring 32GB would require 28 seconds -- and that's to transmit one second of data!
Bottom line: Your data requirements are super high. Are you sure they are realistic?
if you click on capacity tab of your dynamodb table there is a capacity calcuator link next to Estimated cost. you can use that to determine the read and write capacity units along with estimated cost.
read capacity units are dependent on the type of read that you need (strongly consistent/eventually consistent), item size and throughput that you desire.
write capacity units are determined by throughput and item size only.
for calculating item size you can refer this and below is a screenshot of the calculator

Resources