I am storing messages into a riak DB with sequence numbers as keys. When I am retrieving the messages, I used to sort all the keys and get each message in sorted order. Unfortunately that method has become non viable. Therefore I was wondering if riak itself timestamps each write, and if it is possible to retrieve messages in the order that they were written. I tried googling around but couldn't find anything.
No, Riak does not store the time a key was stored in any query-able form. If you are storing the messages in a JSON format that includes a time, you could use Riak Search to retrieve them by time. If you are using the LevelDB backend, you could add a secondary index to each item and retrieve them that way. Both of these methods only update the index when the key is stored, so it won't help already stored data, unless you read and rewrite it.
Related
I am preparing for an interview.
At Geeks for Geeks I found the following report of an interview:
Asked me a technical question for implementing a database with key value pairs that can be stored efficiently where the key and value pairs are only valid for a specific time. Modified the version of the question every time to make it more efficient. It includes implementation of hashing, dictionary, garbage collection, and heap.
I cannot figure out the part where we have to make sure the key value pair is valid for a specific time.
I am assuming the first bit where a database with key value pairs is to be implemented is a dictionary/hashtable but I can only make that much out. Looking for someone to help me out with the rest.
"Database" means durability, and to implement that efficiently, you're going to use an append-only transaction log combined with asynchronously updated snapshots of the dictionary. The in-memory dictionary is updated as updates are written to the log, and when the system is restarted, the dictionary is reconstituted from the latest available snapshot and any logged updates that it's missing.
Expiry is easily implemented in a system like that, and actually makes it simpler. In addition to updating the in-memory dictionary as updates are logged, you also process the log on a delay equal to the expiry time, and delete any records corresponding to delayed updates, if they are still current. You don't even need the snapshots in this scheme, since the dictionary can be reconstituted from only the recent portion of the log.
You don't often get to take use an expiry mechanism like this in real life, though, because they usually let you specify a different expiry time for each record. It will usually boil down to a similar kind of process involving multiple "expiry buckets", though.
If you want a lot of detail, you should look online for design information about memcache and redis.
From what I understand one of the uses cases of DynamoDB Streams is to maintain/update duplicated data.
Let's say I have a User object, and its name attribute is replicated in many Invoice objects.
When a User edits/updates its name, I will have a lambda using DynamoDb Streams to then update all Invoices related to this user with his new name.
There could be thousands of Invoices related to this user so this updating could take a while, specially because I will want to do a rate limited batch_write so that this operation doesn't throttle my table.
The question is : How can my (web)application know that the lambda has finished updating? For example, I want to show a loading screen to the client using the application untill the duplicated data updating is done, so that he doesn't see any outdated information on his browser.
Or is there other ways of rapidly dealing with updating thousands of duplicated data?
Why aren't you capturing the output of Lambda. You can make Lambda return successful status, once all the updates are persisting to DDB.
Invoice can keep a reference to User object instead of storing the exact name and can fetch name at the time of generating/printing
I thought Datastore's key was ordered by insertion date, but apparently I was wrong. I need to periodically look for new entities in the Datastore, fetch them and process them.
Until now, I would simply store the last fetched key and wrongly query for anything greater than it.
Is there a way of doing so?
Thanks in advance.
Datastore automatically generated keys are generated with uniform distribution, in order to make search more performant. You will not be able to understand which entity where added last using keys.
Instead, you can try couple of different approaches.
Use Pub/Sub and architecture your app so another background task will consume this last added entities. On entities add in DB, you will just publish new Event into Pub/Sub with key id. You event listener (separate routine) will receive it.
Use names and generate you custom names. But, as you want to create sequentially growing names, this will case performance hit on even not big ranges of data. You can find more about this in Best Practices of Google Datastore.
https://cloud.google.com/datastore/docs/best-practices#keys
You can add additional creation time column, and still use automatic keys generation.
Can anyone give an example of maximum record limit in Riak database with specific hardware details? please help me in this case.I'm going to build a CDR information system. Will it be suitable to select Riak as my database?
Riak uses the 2^160 SHA-1 hash value to identify the partitions to store data in. Data is then stored in the identified partitions based on the bucket and key name. The size of the hash space is therefore not related to the amount of data that can be stored. Two different objects that happen to hash to the same value will therefore not overwrite each other.
When working with Riak, it is important to model your data correctly and consider how it needs to be retrieved and queried during the design process. Ideally you should try to ensure that the vast majority of your queries can be done through direct key access. It is often recommended to de-normalise your data and use natural keys. For CDRs this may mean creating an object holding all CDRs for a subscriber per day. These objects can be named based on the subscriber id and date, making it easy to retrieve data directly by key. It is also often more efficient to retrieve a few larger objects than many small ones and perform filtering in the application rather than try to just get the exact data that is needed. I have described this approach in greater detail here.
The limit to the number of records (or key/value pairs) you can store in Riak is governed only by the size of the hash space: 2^160. According to WolframAlpha, this is the number:
1461501637330902918203684832716283019655932542976
In other words, go nuts. :)
I am using a riak bucket to store a list of messages, using a UUID as the key and a json message as value. This is working fine.
What I need is an efficient way to get a single message from the bucket without knowing its key, at least in one of these two scenarios:
Get the last inserted object (this is my prefered approach).
Get a random object from the bucket (if the first alternative is not possible).
Is there any efficient way to achieve that?
I think one alternative could be to retrieve the keys in the bucket and then get the first one. But this means making two calls to riak, one to obtain all the keys (just to discard all but one) and a second one to obtain the object. It does not seem very efficient.
As Riak is a key-value store, the by far most efficient way to retrieve data is through the keys. Listing or retrieving all keys in a bucket, even if you only end up using the one returned first, is one of the least efficient operations you can perform as it causes Riak to scan ALL keys in the system (not just the bucket), and it is usually recommended NEVER to use this on a production system.
The most efficient way to get the last inserted object would probably be to store the id in a separate, known record in a different bucket. This would however require you to perform two writes on every insert and two reads for every read, but would do so in the most efficient way. You could possibly implement a post-commit hook (would have to be in Erlang as it is not currently not possible to write records using JavaScript functions) on the bucket containing messages to get the system to perform the update for you, which would remove the need for the last write.
If you write a lot of data to the bucket containing messages, you may want to adjust the separate bucket so that it does not allow multiple values and that the last value wins. This way you would reduce the risk of having lots of siblings created due to frequent updates to this single record across the system. This would always give you one of the last written records, but not necessarily the last one (especially if you frequently write messages to the database), as Riak does not support any type of atomicity and is an eventually consistent database.
You could also create one or more secondary indexes if you are using the leveldb backend, and use this to limit your scan to only recent records, which would be more efficient than a scann of all keys. You could then either select the most recent key or a random one through mapreduce, but this would be much less efficient than the previously described approach.
I can not think of any efficient way to retrieve a random record in a bucket from Riak unless you know the range of keys you have inserted and can decide randomly on the client which one to get. One way to do this would be to generate all keys in sequence rather than using a UUID, but that is naturally not a good idea in a highly concurrent distributed system.
1st task is pretty easy to implement:
Add post-commit hook that will write the last inserted key to some predefined key/bucket place
Get the key from that predefined key/bucket and issue a get query using them
It's still two operations but both are just gets that are fast. Plus additional overhead on hook but nothing too heavy either.
2nd scenario is also easy, but it is way too inefficient to be used practically:
Get all keys (extremely expensive operation)
Pick random
Issue get
I have come up with the same scenario. In My scenario I have to save the users. For that I required an auto increment Id. So what I did is, I placed the last inserted key in a separate bucket as like mentioned by "Christian Dahlqvist", every time I want to insert new record I fetch the last inserted key from that key bucket. Here we have only one value in that bucket with the key as "LastKey" which is always known to us. And I incremented the key based on the fetched key and again updated the key bucket. So always the key bucket contains the latest key in it.