Get an object from a bucket in riak without knowing its key - riak

I am using a riak bucket to store a list of messages, using a UUID as the key and a json message as value. This is working fine.
What I need is an efficient way to get a single message from the bucket without knowing its key, at least in one of these two scenarios:
Get the last inserted object (this is my prefered approach).
Get a random object from the bucket (if the first alternative is not possible).
Is there any efficient way to achieve that?
I think one alternative could be to retrieve the keys in the bucket and then get the first one. But this means making two calls to riak, one to obtain all the keys (just to discard all but one) and a second one to obtain the object. It does not seem very efficient.

As Riak is a key-value store, the by far most efficient way to retrieve data is through the keys. Listing or retrieving all keys in a bucket, even if you only end up using the one returned first, is one of the least efficient operations you can perform as it causes Riak to scan ALL keys in the system (not just the bucket), and it is usually recommended NEVER to use this on a production system.
The most efficient way to get the last inserted object would probably be to store the id in a separate, known record in a different bucket. This would however require you to perform two writes on every insert and two reads for every read, but would do so in the most efficient way. You could possibly implement a post-commit hook (would have to be in Erlang as it is not currently not possible to write records using JavaScript functions) on the bucket containing messages to get the system to perform the update for you, which would remove the need for the last write.
If you write a lot of data to the bucket containing messages, you may want to adjust the separate bucket so that it does not allow multiple values and that the last value wins. This way you would reduce the risk of having lots of siblings created due to frequent updates to this single record across the system. This would always give you one of the last written records, but not necessarily the last one (especially if you frequently write messages to the database), as Riak does not support any type of atomicity and is an eventually consistent database.
You could also create one or more secondary indexes if you are using the leveldb backend, and use this to limit your scan to only recent records, which would be more efficient than a scann of all keys. You could then either select the most recent key or a random one through mapreduce, but this would be much less efficient than the previously described approach.
I can not think of any efficient way to retrieve a random record in a bucket from Riak unless you know the range of keys you have inserted and can decide randomly on the client which one to get. One way to do this would be to generate all keys in sequence rather than using a UUID, but that is naturally not a good idea in a highly concurrent distributed system.

1st task is pretty easy to implement:
Add post-commit hook that will write the last inserted key to some predefined key/bucket place
Get the key from that predefined key/bucket and issue a get query using them
It's still two operations but both are just gets that are fast. Plus additional overhead on hook but nothing too heavy either.
2nd scenario is also easy, but it is way too inefficient to be used practically:
Get all keys (extremely expensive operation)
Pick random
Issue get

I have come up with the same scenario. In My scenario I have to save the users. For that I required an auto increment Id. So what I did is, I placed the last inserted key in a separate bucket as like mentioned by "Christian Dahlqvist", every time I want to insert new record I fetch the last inserted key from that key bucket. Here we have only one value in that bucket with the key as "LastKey" which is always known to us. And I incremented the key based on the fetched key and again updated the key bucket. So always the key bucket contains the latest key in it.

Related

I want to increase number of records read using queryPage in dynamoDB

I have a requirement where I need to get only a certain attribute from the matching records on querying a DynamoDB table. I have used withSelect(Select.SPECIFIC_ATTRIBUTES).withProjectionExpression(<attribute_name>) to get that attribute. But the number of records being read by the queryPage operation is the same in both the cases (1. using withSelect and 2. without using withSelect). The only advantage is by using withSelect, these operations are being processed very quickly. But this is in turn causing a lot of DynamoDB reads. Is there any way I can read more records in a single query thereby reducing my number of DB reads?
The reason you are seeing that the number of reads is the same is due to the fact that projection expressions are applied after each item is retrieved from the storage nodes, but before it is collected into the response object. The net benefit of projection expressions is to save network bandwidth, which in turn can save latency. But it will not result in consumed capacity savings.
If you want to save consumed capacity and be able to retrieve more items per request, your only options are:
create an index and project only the attributes you need to query; this can be a local secondary index, or a global secondary index, depending whether you need to change the partition key for the index
try to optimize the schema of your data stored in the table; perhaps you can compress your items, or just generally work out encodings that result in smaller documents
Some things to keep in mind if you do decide to go with an index: a local secondary index would probably work best in your example but you would need to create a new table for that (local secondary indexes can only be created when you create the table); a global secondary index would also work but only if your application can tolerate eventually consistent reads on the index (and of course, there is a higher cost associated with these).
Read more about using indexes with DynamoDB here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-indexes.html

How can one get count, with a where condition in DynamoDB

Let us say, We have a situation where instead of getting the total count in a table, get the count of records with a particular status.
We know DynamoDb is schemaless and still has to count each record one by one to get the total count.
And yet, How can we leverage the above need using dynamoDb queries?
While normally "Query" or "Scan" requests return all the matching items, you can pass the Select=COUNT parameter and ask to retrieve only the number of matching items, instead of the actual items. But before you go doing that, there are a few things you should know:
DynamoDB will still be reading - and you will still be paying for - all the data, even if just for being counted. Doing a "Scan" with a filter is in almost all cases out of the question, because it will read the entire data set every time. With a "Query" you can ask to read just one partition, or a contiguous range of sort-keys in one partition, which in some cases may be reasonable enough (but please think if it is, in your use case).
Even if you're not actually reading the data, and just counting, DynamoDB still does Scan and Query with "paging", i.e., your reads request will read just 1MB of data from disk, return you the partial count, and ask you to submit another request to resume the scan. Your DynamoDB library probably has a way to automate this resumption, so for example it can run thousands or whatever number of queries needed until finally finishing the scan and calculating the total sum.
In some cases, it may make sense for to maintain a counter in addition to the data. Writes will be more expensive (e.g., each write adds data and increments the counter), but reads that need this counter will be hugely cheaper - so it all depends on how much of each your workload needs.

Fetch new entities only

I thought Datastore's key was ordered by insertion date, but apparently I was wrong. I need to periodically look for new entities in the Datastore, fetch them and process them.
Until now, I would simply store the last fetched key and wrongly query for anything greater than it.
Is there a way of doing so?
Thanks in advance.
Datastore automatically generated keys are generated with uniform distribution, in order to make search more performant. You will not be able to understand which entity where added last using keys.
Instead, you can try couple of different approaches.
Use Pub/Sub and architecture your app so another background task will consume this last added entities. On entities add in DB, you will just publish new Event into Pub/Sub with key id. You event listener (separate routine) will receive it.
Use names and generate you custom names. But, as you want to create sequentially growing names, this will case performance hit on even not big ranges of data. You can find more about this in Best Practices of Google Datastore.
https://cloud.google.com/datastore/docs/best-practices#keys
You can add additional creation time column, and still use automatic keys generation.

Retrieving Random Single Items in Dynamo

We are trying to get our heads wrapped around a design question, which is not really easy in any DB. We have 100,000 random items, (could be a lot more), (we are talking a truly random key, we'll use UUIDs,) and we want to hand them out one at a time. Order is not important. We are thinking that we'll create a dynamo table of the items, then delete them out of that table as they are assigned. We can do a conditional delete to make sure that we have not already given the item away. But, when trying to find an item in the first place, if we do a scan or a query with a limit of 1, will it always hit the same first available record? I'm wondering what the ramifications are. Dynamo will shard on the UUID. We are worried about everyhone trying to hit on the same record all the time. First one would of course get delete, then they could all hit on the second one, etc.
We could set up a memcache/redis instance in elastic cache, and keep a list of the available UUDS in there. We can do a random select of items from this using redis SPOP, which gets a random item and deletes it. We might have a problem where we could get out of sync between the two, but for the most part this would work.
Any thoughts on how to do this without the cache would be great. If dynamo does scans starting at different points, that would be dandy.
I have the same situation with you that have a set of million of UUID as key in DynamoDB and I need to random select some of them in a API call. For the performance issue and easy implementation. I did use Redis as you said.
add the UUID to a Set in Redis
when the call comes, SPOP a UUID from the set
with that UUID, del in DynamoDB
The performance of Scan operation is bad, should try to avoid it as best as you can.

Maximum records can be stored at Riak database

Can anyone give an example of maximum record limit in Riak database with specific hardware details? please help me in this case.I'm going to build a CDR information system. Will it be suitable to select Riak as my database?
Riak uses the 2^160 SHA-1 hash value to identify the partitions to store data in. Data is then stored in the identified partitions based on the bucket and key name. The size of the hash space is therefore not related to the amount of data that can be stored. Two different objects that happen to hash to the same value will therefore not overwrite each other.
When working with Riak, it is important to model your data correctly and consider how it needs to be retrieved and queried during the design process. Ideally you should try to ensure that the vast majority of your queries can be done through direct key access. It is often recommended to de-normalise your data and use natural keys. For CDRs this may mean creating an object holding all CDRs for a subscriber per day. These objects can be named based on the subscriber id and date, making it easy to retrieve data directly by key. It is also often more efficient to retrieve a few larger objects than many small ones and perform filtering in the application rather than try to just get the exact data that is needed. I have described this approach in greater detail here.
The limit to the number of records (or key/value pairs) you can store in Riak is governed only by the size of the hash space: 2^160. According to WolframAlpha, this is the number:
1461501637330902918203684832716283019655932542976
In other words, go nuts. :)

Resources