Allocating datastore long ID's - but segmented so different Kinds have different ranges - google-cloud-datastore

My program has 3 kinds that are closely related and I want to be able to store and manipulate their long id's interchangeably, e.g. I might have an array of long id's that can be for any of the 3 Kind's.
Using the allocateIds API I can allocate the ID's for the 3 kinds in the same namespace, but I also sometimes need to be able to tell which Kind one of these id's referred to (e.g. in order to do a datastore operation on the right Kind).
I understand that the 'normal' way to this is to store the whole Key type, rather then just the long id, but there will be a huge number of these - it will be more efficient if I can just use 'long' values rather then Key values.
So, I'd like to be able to segment the ID ranges, so I can call a simple function with an ID and it will tell me which of the 3 Kind's the ID is for.
(I'm using Java, but I don't think that matters.)

Allocate my own ID's
I guess the most straight-forward way to do this is to simply allocate my own ID's. I believe that, in order to allocate sequential ID's, I would need to do an extra datastore write for every allocation (to track the allocations), or get into some complicated system of pre-allocating ranges of ID's to each live instance. This sounds like a bad idea.
So I could generate random 54 bit ID's - reserving 2 bits to use as flags to indicate the type. But it is my understand that random or hash allocation dramatically reduces the number of allocations that can be made safely. The Internet tells me that the chance of a collision is approximately k^2 / 2N, where k is number of allocations and N is the size of the allocation space. So, if I'm willing to accept 0.1% chance of collision then k=sqrt(2*2^54/1000) = ~1.9 million. Since I really have no idea how many entities I will need to store, this is unacceptable.

Reserve some bits in the Long ID to indicate the Kind
Another solution would be to use 2 bits of the long value as flags to indicate the type. The easiest way to do this would be to take advantage of the fact that the allocator now only uses the low 56 bits of a long. So I could use the high bits as flags to indicate the Kind. The problem with that solution is that I lose the ability to manipulate these numbers in javascript - the reason for the 56 bit limit in the first place.
An alternative to this - to maintain the option of manipulating these numbers in js - is to use allocateIdRange and pre-allocate (and throw away) the ID ranges corresponding to bits 54 and 55. Actually, I could use any bits, but specifying the ID ranges is much easier if I use the high bits.
But I know little of how the datastore and how the allocator actually work, so I don't know if this 'pre-allocate and discard' technique is a good idea.

Related

Encrypting small messages

i need to implement a coupon-code feature. because of the number of codes required and some other constraints, i can't store them in a database. in addition the displayed codes need to be short (around 10 characters).
my original idea was to use a cryptographic function to create codes by encrypting an ongoing counter. but i'm at a loss what method to use.
Because of the counter i would be encoding only a couple of bytes and I am aware that many algorithms are not secure when used with very short messages.
Is my Approach a good idea?
What algorithm could i use?
I'm not sure if this is what you're after, and as per my comment, you have no real guarantee of security, but one possible answer could be to seed a prng with some number and give out the first x numbers as codes. As long as x is much smaller than the total possible number of outcomes, the chance for repetition is small, and codes could be validated by re-generating the sequence (you may want to hash parts of it for speed purposes)
if you use base 62: [a-z A-Z 0-9] with 10 numbers, there are over 839 quadrillion possible outcomes. If you were to give everyone on the planet a unique code, you would have used roughly 0.0000009% of your addressable space

How can I create my own GUID algorithm with smaller "global"?

I have my own application with far more smaller "global" than our real global and I wanted shorter version of GUID. Now supposed I have my concrete number of IDs that I estimated to not ever exceed (for example 100 million IDs). How can I determine the number of random bits required to have the same property as GUID? (Globally unique, require no central authority to generate one) Using the normal GUID would be an overkill.
My "overkill" refers to this : I need the ID to be as easily typed/say/write down as possible and have somewhat astronomically low collision chance as GUID at the same time. I heard GUID can be assigned to every grain of sand on earth. My application is a game, each player get one ID generated, obviously my players is not as much as the amount of sand on earth.
It would be the best if player can say like "My ID is XXXX-XXXX". In that case, I would be not so sure if 8 characters of randomized hex is not enough or too much for 100 million players. (In reality I encode it to A-Z 0-9 instead of hex though) My game is not online restricted, so I would like each player to be able to obtain unique ID even when not online. (no server to check ID collisions)
GUID has been designed to be globally unique. But I don't know why that results in 128-bit sequence. Maybe they just choose the "very large" one that is a power of 2? I don't know what are they thinking when designing GUID to ensure that it will not clash. (They use world population times something? If that is the case I can too use 10 million times something.)
A 128-bit guid will generally perform well, because most compilers are smart enough to reduce operations on it to a pair of 64-bit operations (and on some CPUs, a single 128-bit extended operation). Java and C#/VB.NET would likely have quite a bit more overhead than C++, but if you are using Java or C#/VB.NET, you've already accepted quite a bit more overhead, and a GUID won't add much to it.
However, if you really need smaller values, you could manually reduce GUIDs, by XOR-ing the upper 64 bits with the lower 64 bits (thereby preserving some of the uniqueness of the original) to create a compact 64-bit mostly-unique number.
You could reduce to 32-bit or 48-bit in a similar way, always a multiple of the size of the original GUID. This has the advantage that you are starting out with a number that is intended to be unique across a very large set. However, keep in mind that 100 million items require a fairly high number of bits to preserve a non-overlapping guarantee, so you may just be setting yourself up for a very difficult-to-find problem later on if you aren't careful.
A crude but probably equally effective approach is to use a cryptographically-secure random number generator and construct a number as large as you need (probably minimum 48-bit). It is important not to do modulo operations on the results, or you could significantly reduce the uniqueness (due to the period of the random number generator).
I am assuming you cannot use a sequential id, although you may want to revisit that idea and see if there is a way to make a sequential id work. For example, you could use a sequential id paired with a random seed number, guaranteeing uniqueness without requiring a large number, and allowing internal indexing operations and similar optimizations that are common with large data sets.
Ok, I have discussed with friend and came up with solution. This is how to decide the number of "characters" of my game ID.
A character would consist of 0-9 and A-Z instead of HEX, thats 36 kinds of characters. We took out 0 O 1 I so it would be printable to variety of fonts without confusion, that leaves 32 kind of characters.
Then if every characters will be pseudo-randomized, how many players can we safely have?
We used Birthday paradox's square approximation. The formula in that page indicate how many number of people necessary to have 50% chance of 2 people colliding. It is 22.99 people for birthday problem. (365 possible choices)
Now we substitute 32^No.of characters into the equation instead of 365. This is how many players that will cause 50% chance of 2 players having the same ID :
Finally, we agreed to choose 9-character ID so the game can be registered up to 6.9 million players before just 2 from all 6.9 million players will have the same ID (50% chance).
The game isn't even online-only! It only collide if that 2 players is still actively playing at the same time and decide to send score to the scoreboard in the same week because of weekly score reset. So the actual number that the game can hold would be somewhat higher than that. (The game will probably not having that many players.. it is just a small happy dream of every game startups. Well at least the computation was fun.)
It will probably looks like this for easier reading : 5XT-339-A67

How to generate a unique GUID from two unique GUIDs, which are order-insignificant

I have an application whereby users have their own IDs.
The IDs are unique.
The IDs are GUIDs, so they include letters and numbers.
I want a formulae whereby if I have both IDs I can find their combined GUID, regardless of which order I use them in.
These GUIDs are 16 digits long, for the example below I will pretend they are 4.
user A: x43y
user B: f29a
If I use formula X which takes two arguments: X(a,b) I want the produced code to give the same result regardless whether a = UserA or UserB's GUID.
I do not require a method to find either users IDs, given one, from this formulae - ie it is a one way method.
Thank you for any answers or direction
So I'll turn my comment into an answer. Then this question can get answered, the answer accepted (if it is good enough) and we can all move on.
Sort the GUIDs lexicographically and append the second to the first. The result is unique, and has all the other characteristics you've asked for.
Can you compress it (I know you wrote shorten but bear with me) down to 16 characters ? No you can't; not, that is, if you want to be able to decompress it again and recover the original bits. (You've written that you don't need to be able to recover the original GUIDs, skip the next paragraph if you want to.)
A GUID is, essentially, a random sequence of 128 bits. Random sequences can't, by definition, be compressed. If a sequence of 128 bits is compressible it can't be random, there would have to be some algorithm for inflating the compressed version back to 128 bits. I know that since GUIDs are generated algorithmically they're not truly random. However, in practice there is almost no point in regarding them as anything other than truly random; I certainly don't think you should waste your time trying to compress them.
Given that the total population of possible GUIDs is large, you might be satisfied by a method which takes the first half of each individual GUID and assembles a pseudo-GUID from them. Depending on how many GUIDs your system is likely to be working with, and your appetite for risk, this might satisfy your practical needs.

Count the frequency of bytes in a purely functional language

If we had an assignment:
Given a block of binary data, count the frequency of the bytes within it.
And you were supposed to do this in C, the answer would be trivial and reasonably fast even for larger binary blocks. How would one go about implementing this in a purely functional language, without side effects?
For example, if you wrote a function that accepted freqency counts for each byte and the rest of the list of bytes, and returned modified frequency counts, it would have to do awful lot of work for data set of 100M bytes.
Also, if you sorted the data and then somehow counted the amount of subsequent same-valued bytes, the sort itself would take a lot of time.
Is there a reasonable way to implement this?
The straightforward way to do it is indeed to pass in and return data structures mapping bytes to counts. This would probably be implemented as some kind of tree (since that's what you get out of the standard library containers, as far as I know). In pure functional programming when you're passed in a tree and you need to return a new tree with a difference in only one node, the returned tree ends up sharing almost all of its structure and data with the original tree.
There is some overhead in traversing the tree to get to the count, but since you're counting bytes the tree is only ever smaller than 256 elements, so the overhead is log(255), which is a constant. It doesn't get larger for large data sets - it doesn't change the big-oh complexity of the algorithm. That's actually true even if you use the greatest possible overhead of copying around a full 256-entry array of counts with no sharing.
If you want to optimise this, you can take advantage of the fact that the "intermediate" frequency counts are never needed except as part of the computation of the next set of counts. That means you can use various techniques for getting the implementation to use destructive updates even while you're still semantically writing functional code. An STref in Haskell is basically letting you do this manually.
Theoretically the compiler could notice that you're replacing a never-needed-again value with a new one, so it could do the update in place for you. I don't know whether or not any actual production ready compilers are currently able to make this optimisation.

A couple of questions about Hash Tables

I've been reading a lot about Hash Tables and how to implement on in C and I think I have almost all the concepts in my head so I can start to code my own, I just have a couple of questions that I have yet to properly understand.
As a reference, I've been reading this:
http://eternallyconfuzzled.com/jsw_home.aspx
1) As I've read on the site above, a power of two or a prime number is recommended for the Hash Table size. This is basically an array and an array has a fixed size so I can quickly look up for the value I'm looking for. I can't declare a small array if I have a large input as it won't fit and I can't declare a very large array if my input data is not that large cause it's wasted memory.
What is the optimum size for the Hash Table? What should I base my decision on?
2) Also, on that site, there's a couple of hashing functions which I have yet to read them all. It also states that it's always best to use a good known algorithm and to roll my own. And I might do just that, I'll pick one from that site and test it out on my code and see if it minimizes collisions based on my input data.
What's bugging me is how I control the hash range? The hash can't return and integer larger than the Hash Table size or we'll have a serious problem. How do I deal with this?
1) What you are referring to is the load factor of the hash table - the percentage of buckets that are expected to be filled. Wikipedia has this to say:
With a good hash function, the average
lookup cost is nearly constant as the
load factor increases from 0 up to 0.7
or so. Beyond that point, the
probability of collisions and the cost
of handling them increases.
I believe the Java implementation (and probably others) resizes periodically to keep the load factor within an acceptable range.
2) Just use the modulo operator (%) to keep the bucket index legal. The second operator should be the size of your bucket array.
Pick a small size for your hash table. As you add stuff to your table, check to see what percentage of the table is being used; when it is greater than 70% full, make the table bigger. This also holds true as you remove elements-- make the table smaller when it is less than 60% full, for instance. Wikipedia has a good description of some strategies for dynamic resizing, but that's the general idea.
I only say this because you seem to have known input data:
If you know the rough order of magnitude of the amount of data you will be storing in the hash table, it's generally good enough to just create a table about that big. (You shouldn't worry about whether everything will fit. Instead, the right thing to think about is how many collisions you will have and how you will handle them.)
As for the right hash function, it's possible that the structure of your input will suggest which one will be correct. For instance, what aspects of your input are likely to be evenly distributed?

Resources