Redis Multiple Key Set Counts - count

So I add keys to my Redis implementation for wallpaper view counts like this...
(the values are there for demonstration purposes but the overall format is the same)
SADD wallpapers:100:2015-12-31 "127.0.0.1"
SADD wallpapers:100:2016-01-01 "127.0.0.1"
SADD wallpapers:100:2016-01-01 "192.168.1.1"
SADD wallpapers:100:2016-01-02 "127.0.0.1"
So that should add the IP's in the associated sets. So my question is, do they allow some sort of pattern based counts?
SCARD wallpapers:100:2016:01-01
For example the above command would return "2", as there are two IPs stored in the set, but is there a way to run something like the below command to get all counts for all the dates?
SCARD wallpapers:100:*

Actually it's easier than you've ever thought: store less specific sets to be able to get what you want.
For example, if you need wallpapers:100:* it means that you just need a set called wallpapers:100 where you store unique IP addresses there.
That is, whenever you add an IP addresses to one of specific sets (i.e. daily sets), also add it to the global set for a given wallpaper identifier.
Redis is like working with a manual index. Index your data in a way you can efficiently use it. That's all! This means that data redundancy is a good approach.

EVAL "local total = 0 for _, key in ipairs(redis.call('keys', ARGV[1])) do total = total + redis.call('scard', key) end return total" 0 wallpapers:100:*
This command returns you the total number of elements in keys wallpapers:100:*.
If you want total number of unique values from all keys combined,
EVAL "return redis.call('SUNIONSTORE', 'wallpapers:temp', unpack(redis.call('keys', ARGV[1])))" 0 wallpapers:100:*
This will return the number of unique values from all keys combined and also creates a key wallpapers:temp
You can delete this key later del wallpapers:temp
I used SUNIONSTORE for the second command.
Refer EVAL.

Related

Best way to model high score data in DynamoDB

I believe this would be easier with PostgreSQL or MongoDB, both of which I'm familiar with, but I'm using DynamoDB with my project for the sake of learning how to use it and getting comfortable with it. I've never used it before.
I want to use DynamoDB to store high scores for my typing test project. There are 4 data attributes to be stored:
name (doesn't need to be unique)
WPM
number of errors
test type (because I have 2 different kinds of typing tests)
At first, my partition key was testType, and my sort key was WPM. Then I realized that if anyone got the same WPM as a previous user, it would overwrite the previous user's data, because testType and WPM, the two key components, were identical. So ties did not work.
So, now, name is my partition key, and WPM is my sort key. In order to filter by testType, I just use JS array filter methods. This still doesn't seem optimal though for multiple reasons. For my small typing test project, I think it's ok, but I can see that it's possible for 2 people to input the same name and get the same WPM and overwrite each other.
What would be a better way to set this up with DynamoDB?
Assuming you want the top X many WPM results for a given test type:
Set the partition key to be the test type. Set the sort key as <WPM>#<username>. Make sure to zero-pad the WPM so it’s always 3 digits even if the score is below 100. That keeps it numerically sorted.
With this key structure you have a sorted list (in the sort key) of all the scores for a given test type. You can Query against the test type and use ScanIndexForward=false to get descending high scores.
Notice how multiple identical scores by different usernames won’t overwrite each other. The username can be pulled from the returned sort key or from an attribute on the item, along with other metadata about the high score event.
If you have multiple users with the same username, well, that’s kinda weird. Presumably you have an internal identifier. You can use that as the suffix in the sort key instead of the username.

How do I specify a Range including everything in a RocksDB column?

I want to publish metrics about my RocksDB instance, including the disk size of each column over time. I've come across a likely API method to calculate disk size for a column:
GetApproximateSizes():
Status s = GetApproximateSizes(options, column_family, ranges.data(), NUM_RANGES, sizes.data());
This is a nice API, but I don't know how to provide a Range that will specify my entire column. Is there a way to do so without finding the min/max key in the column?
For the whole database, you can approximate it using 0x00 or the empty byte string and an arbitrarily big key as end such as 0xFFFFFF.
Otherwise, if the column share a common prefix, use the following function to compute the end key:
def strinc(key):
key = key.rstrip(b"\xff")
return key[:-1] + int2byte(ord(key[-1:]) + 1)
strinc will compute the next byte string that is not prefix of key, together they describe the whole keyspace having KEY as prefix.

Use RocksDB to support key-key-value (RowKey->Containers) by splitting the container

Support I have key/value where value is a logical list of strings where I can append strings. To avoid the situation where inserting a single string item to the queue causing re-write the entire list, I'd using multiple key-value pairs to represent it.
Key -> metadata of the value such as length and subkey format
Key-l1 -> value of item 1 in list
Key-l2 -> value of item 2 in list
Key-ln -> the lastest value in the list
I'd override the key comparer in RocksDB such that sorting of Key-ln formatted key is sort Key part first and ln second (i.e. group by and sort by Key and within the same Key value sort by ln). This way, all the list items along with its root key and metadata are grouped together in sst during initial bulk insert and during later sst compaction.
Appending a new list item becomes (1) first read Key-metadata to get the current list size of n; 2) insert Key-l(n+1) with new value. Deleting list item works as it is for RocksDB by deleting Key-ln and update the metadata.
To ensure the consistency, (1) and (2) will be done inside a RocksDB transaction.
This design seems to be ok?
Now, if I want to add anther feature of TTL for entire key-value(list), I'd use TTL support already in RocksDB. My understanding is that TTL to remove expired item happens during compaction. However, such compaction is not done under a transaction. RocksDB doesn't know that Key-metadata and Key-ln entries are related. It is entirely possible that there is a time window where Key->metadata(root node) is deleted while child nodes of (Key-ln) is not deleted yet (or reverse order). If during this time window, someone reads or update the list, it will get an inconsistent for the Key-list. Any remedy for it?
Thanks
You should use Merge Operator, it's designed for such value append use case. Your design is read-before-write, which has performance penalty, in general it should be avoided if possible: What's read-before-write in NoSQL?.
Options options;
options.merge_operator.reset(new StringAppendOperator(','));
DB::Open(options, kDBPath, &db)
...
db->Merge(WriteOptions(), "key", "value1");
db->Merge(WriteOptions(), "key", "value2");
db_->Get(ReadOptions(), "key", &result); // return "value1,value2"
The above example uses a predefined StringAppendOperator, which simply append new values at the end. You can defined your own MergeOperator to customize the merge operation.
In the backend, the merge operation is done on the read path (and compaction to reduce the version number), details: Merge Operator Implementation.

Generating Order Numbers - Keep unique across multiple machines - Unique string seed

I'm attempting to create an order number for customers to use. I will have multiple machines that do not have access to the same database (so can't use primary keys and generate a unique ID).
I will have a unique string that I could use for a seed for some algorithm that will generate a unique looking alphanumeric ID # for the order number. I do not want to use this unique string as the order # because its contents would not be appropriate in appearance for a customer to use for order #.
Would it be possible to combine the use of a GUID & my unique string with some algorithm to create a unique order #?
Open to any suggestions.
If you have a relatively small number of machines and each one can have it's own configuration file or setting, you can assign a letter to each machine (A,B,C...) and then append the letter onto the order number, which could just be an auto-incrementing integer in each DB.
i.e.
Starting each database ID at 1000:
1001A // First order on database A
1001B // First order on database B
1001C // First order on database C
1002A // Second order on database A
1003A // Third order on database A
1004A // etc...
1002B
1002C
Your order table in each database would have an ID column (integer) and "machine" identifier (character A,B,C...) so in case you ever needed to combine DBs into one, each order would still be unique.
Just use a straight up guid/uuid. They take into account the mac address of the network interface to make it unique to that machine.
http://en.wikipedia.org/wiki/Uuid
You can use ids and as a primary key if you generate they id from a stored procedure (or perhaps in Oracle using a sequence).
What you have to do is make each machine generate in a different range e.g. machine a from 1 to 1million, machine B from 1000001 to 2000000 etc.
You say you have a unique string that would not be 'appropriate' to show to customers.
If it's only inappropriate and not necessary i.e. security/privacy related you could just transform it somehow. A simple example would be Rot13
But generally I too would suggest using UUID (but version 4) for random numbers. The probability for generating duplicates is extremely low and there are libraries for many programming languages available.

What exactly are hashtables?

What are they and how do they work?
Where are they used?
When should I (not) use them?
I've heard the word over and over again, yet I don't know its exact meaning.
What I heard is that they allow associative arrays by sending the array key through a hash function that converts it into an int and then uses a regular array. Am I right with that?
(Notice: This is not my homework; I go too school but they teach us only the BASICs in informatics)
Wikipedia seems to have a pretty nice answer to what they are.
You should use them when you want to look up values by some index.
As for when you shouldn't use them... when you don't want to look up values by some index (for example, if all you want to ever do is iterate over them.)
You've about got it. They're a very good way of mapping from arbitrary things (keys) to arbitrary things (values). The idea is that you apply a function (a hash function) that translates the key to an index into the array where you store the values; the hash function's speed is typically linear in the size of the key, which is great when key sizes are much smaller than the number of entries (i.e., the typical case).
The tricky bit is that hash functions are usually imperfect. (Perfect hash functions exist, but tend to be very specific to particular applications and particular datasets; they're hardly ever worthwhile.) There are two approaches to dealing with this, and each requires storing the key with the value: one (open addressing) is to use a pre-determined pattern to look onward from the location in the array with the hash for somewhere that is free, the other (chaining) is to store a linked list hanging off each entry in the array (so you do a linear lookup over what is hopefully a short list). The cases of production code where I've read the source code have all used chaining with dynamic rebuilding of the hash table when the load factor is excessive.
Good hash functions are one way functions that allow you to create a distributed value from any given input. Therefore, you will get somewhat unique values for each input value. They are also repeatable, such that any input will always generate the same output.
An example of a good hash function is SHA1 or SHA256.
Let's say that you have a database table of users. The columns are id, last_name, first_name, telephone_number, and address.
While any of these columns could have duplicates, let's assume that no rows are exactly the same.
In this case, id is simply a unique primary key of our making (a surrogate key). The id field doesn't actually contain any user data because we couldn't find a natural key that was unique for users, but we use the id field for building foreign key relationships with other tables.
We could look up the user record like this from our database:
SELECT * FROM users
WHERE last_name = 'Adams'
AND first_name = 'Marcus'
AND address = '1234 Main St'
AND telephone_number = '555-1212';
We have to search through 4 different columns, using 4 different indexes, to find my record.
However, you could create a new "hash" column, and store the hash value of all four columns combined.
String myHash = myHashFunction("Marcus" + "Adams" + "1234 Main St" + "555-1212");
You might get a hash value like AE32ABC31234CAD984EA8.
You store this hash value as a column in the database and index on that. You now only have to search one index.
SELECT * FROM users
WHERE hash_value = 'AE32ABC31234CAD984EA8';
Once we have the id for the requested user, we can use that value to look up related data in other tables.
The idea is that the hash function offloads work from the database server.
Collisions are not likely. If two users have the same hash, it's most likely that they have duplicate data.

Resources