Iterating when using OptimizeForPointLookup - rocksdb

I have a requirement where I only do point lookups but I also need to iterate but don't have to be in any specific order. I used OptimizeForPointLookup and used the iterator API and everything seems to work fine. However, the rocksdb code is documented with the following in options.h against the OptimizeForPointLookup api.
// Use this if you don't need to keep the data sorted, i.e. you'll never use
// an iterator, only Put() and Get() API calls
Is there something I am missing? Interestingly the iteration also seems to be happening in a sorted order.

OptimizeForPointLookup() API makes the GET/PUT operation faster by creating a BLOOM FILTER and setting the Index type to kHashSearch. As the name suggest, kHashSearch creates hash over the keys and makes point lookups faster.
For normal iterator operation, the index type is set to kBinarySearch.
RocksDB by default, inserts data into memtable in sorted order. Optimizing for Point Lookups doesnot affect this insert behaviour of rocksDB.

Related

How to do contains search in Redis?

I am new to the Redis cache implementation.
I want to search value in all the keys.
The values may or may not be nested collections of list.
What command should I use this to search data?
https://github.com/antirez/redis/issues/6802
I am implementing the same in .net core.
https://github.com/StackExchange/StackExchange.Redis
If you just want to search inside a hash key as in the screenshot, you can use HSCAN to traverse all the fields of the hash, this returns the value as well. Then test for the value client-side. Or, you can move this logic to a Lua script to do it Redis-server-side.
If you want to search in all the keys, consider the following:
You will need to traverse the whole keyspace, key by key, using SCAN.
Depending on the type, perform the search inside the key.
Sets and sorted sets can be searched with SSCAN and ZSCAN for values, using MATCH option.
For all other types, you need to do the search by your own.
Again, you can implement the above in a Lua script for a more efficient implementation. This answer can get you started.

Firebase Cloud Firestore ArrayList

I'm studying Cloud Firestore. I know it can also save a list type. Let's assume that we have data from index 0 to index 3. Is it impossible to write another value directly to index 4? Existing data values remain unchanged.
Do I have to perform a full reset every time?
You can't do this -- you have to overwrite the entire array. (Or use a dictionary instead)
Arrays tend to be problematic in an environment like Cloud Firestore where many clients could theoretically append or remove elements from an array at the same time -- if instructions arrive in a slightly different order, you could end up with out-of-bounds errors, corrupted data, or just a really bad time.

How to get Riak keys last modified since X?

Is there a way to get a list of keys from Riak, which were modified since a specified time? A stream of changes would be equally good.
MapReduce is not a recommended way.
There are a couple of possible solutions to this problem (all of which have their advantages and disadvantages):
Search (Solr) range queries if your object is a JSON or XML document (http://docs.basho.com/riak/kv/2.2.0/developing/usage/search/)
Secondary Indexes and range queries where date is the 2i (http://docs.basho.com/riak/kv/2.2.0/developing/usage/secondary-indexes/)
Date bounded sets (http://docs.basho.com/riak/kv/2.2.0/developing/data-types/sets/) that contain a list of keys added during a predefined time period
If you can use Riak TS it supports SQL and makes selecting records by date/time range quite easy.
It seems that commit hooks are the closesest thing to a solution.
Pre-commit hooks may be written in JavaScript, so I can trigger a HTTP request or append to a change log.
you can also use secondary indexes to tag your keys with the time they were added, and perform time-range requests

Is it okay to filter using code instead of the NoSQL database?

We are using DynamoDB and have some complex queries that would be very easily handled using code instead of trying to write a complicated DynamoDB scan operation. Is it better to write a scan operation or just pull the minimal amount of data using a query operation (query on the hash key or a secondary index) and perform further filtering and reduction in the calling code itself? Is this considered bad practice or something that is okay to do in NoSQL?
Unfortunately, it depends.
If you have an even modestly large table a table scan is not practical.
If you have complicated query needs the best way to tackle that using DynamoDB is using Global Secondary Indexes (GSIs) to act as projections on the fields that you want. You can use techniques such as sparse indexes (creating a GSI on fields that only exist on a subset of the objects) and composite attributes keys (concatenating two or more attributes and using this as a new attribute to create a GSI on).
However, to directly address the question "Is it okay to filter using code instead of the NoSQL database?" the answer would be yes, that is an acceptable approach. The reason for performing filters in DynamoDB is not to reduce the "cost" of the query, that is actually the same, but to decrease unnecessary data transfer over the network.
The ideal solution is to use a GSI to get to reduce the scope of what is returned to as close to what you want as possible, but if it is necessary some additional filtering can be fine to eliminate some records either through a filter in DynamoDB or using your own code.

Get an object from a bucket in riak without knowing its key

I am using a riak bucket to store a list of messages, using a UUID as the key and a json message as value. This is working fine.
What I need is an efficient way to get a single message from the bucket without knowing its key, at least in one of these two scenarios:
Get the last inserted object (this is my prefered approach).
Get a random object from the bucket (if the first alternative is not possible).
Is there any efficient way to achieve that?
I think one alternative could be to retrieve the keys in the bucket and then get the first one. But this means making two calls to riak, one to obtain all the keys (just to discard all but one) and a second one to obtain the object. It does not seem very efficient.
As Riak is a key-value store, the by far most efficient way to retrieve data is through the keys. Listing or retrieving all keys in a bucket, even if you only end up using the one returned first, is one of the least efficient operations you can perform as it causes Riak to scan ALL keys in the system (not just the bucket), and it is usually recommended NEVER to use this on a production system.
The most efficient way to get the last inserted object would probably be to store the id in a separate, known record in a different bucket. This would however require you to perform two writes on every insert and two reads for every read, but would do so in the most efficient way. You could possibly implement a post-commit hook (would have to be in Erlang as it is not currently not possible to write records using JavaScript functions) on the bucket containing messages to get the system to perform the update for you, which would remove the need for the last write.
If you write a lot of data to the bucket containing messages, you may want to adjust the separate bucket so that it does not allow multiple values and that the last value wins. This way you would reduce the risk of having lots of siblings created due to frequent updates to this single record across the system. This would always give you one of the last written records, but not necessarily the last one (especially if you frequently write messages to the database), as Riak does not support any type of atomicity and is an eventually consistent database.
You could also create one or more secondary indexes if you are using the leveldb backend, and use this to limit your scan to only recent records, which would be more efficient than a scann of all keys. You could then either select the most recent key or a random one through mapreduce, but this would be much less efficient than the previously described approach.
I can not think of any efficient way to retrieve a random record in a bucket from Riak unless you know the range of keys you have inserted and can decide randomly on the client which one to get. One way to do this would be to generate all keys in sequence rather than using a UUID, but that is naturally not a good idea in a highly concurrent distributed system.
1st task is pretty easy to implement:
Add post-commit hook that will write the last inserted key to some predefined key/bucket place
Get the key from that predefined key/bucket and issue a get query using them
It's still two operations but both are just gets that are fast. Plus additional overhead on hook but nothing too heavy either.
2nd scenario is also easy, but it is way too inefficient to be used practically:
Get all keys (extremely expensive operation)
Pick random
Issue get
I have come up with the same scenario. In My scenario I have to save the users. For that I required an auto increment Id. So what I did is, I placed the last inserted key in a separate bucket as like mentioned by "Christian Dahlqvist", every time I want to insert new record I fetch the last inserted key from that key bucket. Here we have only one value in that bucket with the key as "LastKey" which is always known to us. And I incremented the key based on the fetched key and again updated the key bucket. So always the key bucket contains the latest key in it.

Resources