Can we assume anything about them? Are they globally unique (across all of Firebase)? Is there any sort of ordering? Does the client matter?
Is there a public library / documentation so I can generate those IDs as well?
I am referring to the ones generated by push
There is a blog post on it, as well as a Gist.
From the blog post, here's the core of What's in a Push Id:
A push ID contains 120 bits of information. The first 48 bits are a
timestamp, which both reduces the chance of collision and allows
consecutively created push IDs to sort chronologically. The timestamp
is followed by 72 bits of randomness, which ensures that even two
people creating push IDs at the exact same millisecond are extremely
unlikely to generate identical IDs. One caveat to the randomness is
that in order to preserve chronological ordering if a client creates
multiple push IDs in the same millisecond, we just ‘increment’ the
random bits by one.
To turn our 120 bits of information (timestamp + randomness) into an
ID that can be used as a Firebase key, we basically base64 encode it
into ASCII characters, but we use a modified base64 alphabet that
ensures the IDs will still sort correctly when ordered
lexicographically (since Firebase keys are ordered lexicographically).
Also something amazing to note, is the ports for several different languages, done by the community:
Ruby
PHP
Python
Java
Nimrod
Go
Lua
Swift
So perhaps the best way to learn is pick a language not on that list and port it!
Related
Im developing a chat app with Realtime Database as backend, and this is the way i save the data into the DB:
I identify each message with the full uid of the user sending it.
Do you think is this necessary, or can i only save the first 10 characters (for example) of the uid in order to reduce bytes? my concern is if in some moment 2 diferent users will have the same 10 first characters
There is no guarantee that the first 10 characters of a UID are going to be unique, so using those as an identifier is not a great idea.
If you want to use shorter IDs, the two first options that come to mind are:
Create your own identifier for each user in the room, for example by giving them a sequential ID, and then store that.
Use an actual hash function to determine a shorter unique ID for each user. While there is still a chance of collisions (multiple users getting the same ID), the chances are likely smaller then when you just take the first 10 characters.
In all cases, I'd highly recommend calculating the cost savings that you'll accomplish with this though. Message length is more typically the dominant factor in the size of each message.
I'm a bit worried that I will reach the free data limits of Firebase in a student project.
Basically my question is:
is it possible to append to the end of the string instead of retrieving key and value, appending and uploading again.
What I want to achieve:
I have to create statistics of user right/wrong answers for particular questions.
I want to have a kvp:
answers: 1r/5w/3r
Where number is the number of users guesses and r/w means right wrong. Whenever the guessing session ends I want to add /numberOfGuesses+RightOrWrongAnswer and the end.
I'm using Unity 2018.
Thank you in advance for all the help!
I don't know how your game is architected or how many people are playing, but I'd be surprised if you hit your free limit on a student project (you can store 1GB and download 10GB). That string is 8 bytes, let's assume worst case scenario: as a UTF32 string, that would be 32 bytes of data - you'd have to pull that down 312 million times to hit a cap (there'll be some overhead, but I can't imagine it being a hugely impactful). If you're afraid of being charged, you can opt to not have a credit card on file to be doubly sure you stay on a student budget.
If you want to reduce the amount of reading/writing though, I might suggest that instead of:
key: <value_string> (so, instead of session_id: "1r/5w/3r")
you structure more like:
key:
- wrong: 5
- right: 3
So have two more values nested under your key. One for all the wrong answers, just an incrementing integer. Then one for all the right answers: just an incrementing integer.
The mechanism to "append" would be a transaction, and you should use these whether you're mutating a string or counter. Firebase tries to be smart with data usage and offline caching, but you don't get much more control other than that.
If order really matters, you might want to get cleverer. You'll generally want to work with the abstractions Realtime Database gives you though to maximize any inherent optimizations (it likes to think in terms of JSON documents, so think about your data layout similarly). This may not be as data optimal, but you may want to consider instead using a ledger of some kind (perhaps using ServerValue.Timestamp to record a single right or wrong answer, and having a cloud function listening to sum up the results in the background after a game - this would be especially useful if you plan on having a lot of users trying to write the same key at the same time).
The auto id's generated by an Android client in a Firestore collection seem to all meet certain criteria for me:
20 characters of length
Start with a - dash
Seem to cycle through characters based on time?
With the last point I mean that the first characters will look very similar if the creation happened in a similar time frame, e.g. -LZ.., -L_.., and -La... This describes the Flutter implementation.
However, looking at the Javascript implementation of auto id, I would assume that the only common criterion of all clients is the length of 20 characters. Is this assumption correct?
Accross all clients, the auto id has a length of 20 characters:
iOS
Android
JavaScript (Web)
Flutter
You're referring to two types of IDs:
The push IDs as they are generated by the Firebase Realtime Database SDK when you call DatabaseReference.push() (or childByAutoId in iOS). These are described in The 2^120 Ways to Ensure Unique Identifiers, and a JavaScript implementation can be found here.
The auth IDs that are generated by the Cloud Firestore SDK when you call add(..) or doc() (without arguments). The JavaScript implementation of this can indeed be found in the Firestore SDK repo.
The only things these two IDs have in common is that they're designed to ensure enough entropy that realistically they will be globally unique, and that they're both 20 characters long.
I am working on a project and firestore random keys where kind of important in this scenario, so my question is, what are the chances for firebase firestore or the real-time database to generate two or more identical random variables?
According to this blog link : The 2^120 Ways to Ensure Unique Identifiers
How Push IDs are Generated
Push IDs are string identifiers that are generated client-side. They
are a combination of a timestamp and some random bits. The timestamp
ensures they are ordered chronologically, and the random bits ensure
that each ID is unique, even if thousands of people are creating push
IDs at the same time.
What's in a Push ID?
A push ID contains 120 bits of information. The first 48 bits are a
timestamp, which both reduces the chance of collision and allows
consecutively created push IDs to sort chronologically. The timestamp
is followed by 72 bits of randomness, which ensures that even two
people creating push IDs at the exact same millisecond are extremely
unlikely to generate identical IDs. One caveat to the randomness is
that in order to preserve chronological ordering if a client creates
multiple push IDs in the same millisecond, we just 'increment' the
random bits by one.
To turn our 120 bits of information (timestamp + randomness) into an
ID that can be used as a Firebase key, we basically base64 encode it
into ASCII characters, but we use a modified base64 alphabet that
ensures the IDs will still sort correctly when ordered
lexicographically (since Firebase keys are ordered lexicographically).
While Gastón Saillén's answer is 100% correct regarding the pushed key from Firebase realtime database, I'll try to add a few more details.
When using DatabaseReference's push() method, it generates a key that has a time component, so basically two events can theoretically take place within the same millisecond but there is an astronomically small chance that two users can generate a key in the exact same moment and with the exact same randomness. Please also note, that these keys are generated entirely on the client without consultation Firebase server. If you are interested, here is the algorithm that generates those keys. In the end, I can tell you that I haven't heard of a person who reported a problem with key collisions so far.
So unlike Fireabase realtime database keys, Cloud Firestore ids are actually purely random. There's no time component included. This built-in generator for unique ids that is used in Firestore when you call CollectionReference's add() methods or CollectionReference's document() method without passing any parameters, generates random and highly unpredictable ids, which prevents hitting certain hotspots in the backend infrastructure. That's also the reason why there is no order, if you check the documents in a collection in the Firebase console. The collisions of ids in this case is incredibly unlikely and you can/should assume they'll be completely unique. That's what they were designed for. Regarding the algorithm, you can check Frank van Puffelen's answer from this post. So you don't have to be concerned about this ids.
I've decided to use GUID as primary key for many of my project DB tables. I think it is a good practice, especially for scalability, backup and restore in mind. The problem is that I don't want to use the regular GUID and search for an alternative approach. I was actually interested to know what Pinterest i using as primary key. When you look at the URL you see something like this:
http://pinterest.com/pin/275001120966638272/
I prefer the numerical representation, even it it is stores as string. Is there any way to achieve this?
Furthermore, youtube also use a different kind of hashing technique which I can't figure it out:
http://www.youtube.com/watch?v=kOXFLI6fd5A
This reminds me shorten url like scheme.
I prefer the shortest one, but I know that it won't guarantee to be unique. I first thought about doing something like this:
DateTime dt1970 = new DateTime(1970, 1, 1);
DateTime current = DateTime.Now;
TimeSpan span = current - dt1970;
Result Example:
1350433430523.66
Prints the total milliseconds since 1970, But what happens if I have hundreds thousands of writes per second.
I mainly prefer the non BIGINT Auto-Increment solution because it makes a lot less headache to scale the DB using 3rd party tools as well as less problematic backup/restore functionality because I can transfer data between servers and such if I want.
Another sophisticated approach is to tailor the solution towards my application. In the database, the primary key will also contain the username (unique and can't be changed by the user), so I can combine the numerical value of the name with the millisecond number which will give me a unique numerical string. Because the user doesn't insert data as such a high rate, the numerical ID is guarantee to be unique. I can also remove the last 5 figures and still get a unique ID, because I assume that the user won't insert data at more than 1 per second the most, but I would probably won't do that (what do you think about this idea?)
So I ask for your help. My data is assumes to grow very big, 2TB a year with ten of thousands new rows each second. I want URLs to look as "friendly" as possible, and prefer not to use the 'regular' GUID.
I am developing my app using ASP.NET 4.5 and MySQL
Thanks.
Collision Table
For YouTube like GUID's you can see this answer. They are basically keeping a database table of all random video ID's they are generating. When they request a new one, they check the table for any collisions. If they find a collision, they try to generate a new one.
Long Primary Keys
You could use a long (e.g. 275001120966638272) as a primary key, however if you have multiple servers generating unique identifiers you'll have to partition them somehow or introduce a global lock, so each server doesn't generate the same unique identifier.
Twitter Snowflake ID's
One solution to the partitioning problem with long ID's is to use snowflake ID's. This is what Twitter uses to generate it's ID's. All generated ID's are made up of the following parts:
Epoch timestamp in millisecond precision - 41 bits (gives us 69 years with a custom epoch)
Configured machine id - 10 bits (gives us up to 1024 machines)
Sequence number - 12 bits (A local counter per machine that rolls over every 4096)
One extra bit is reserved for future purposes. Since the ID's use timestamp as the first component, they are time sortable (which is very important for query performance).
Base64 Encoded GUID's
You can use ShortGuid which encodes a GUID as a base64 string. The downside is that the output is a little ugly (e.g. 00amyWGct0y_ze4lIsj2Mw) and it's case sensitive which may not be good for URL's if you are lower-casing them.
Base32 Encoded GUID's
There is also base32 encoding of GUID's, which you can see this answer for. These are slightly longer than ShortGuid above (e.g. lt7fz44kdqlu5pt7wnyzmu4ov4) but the advantage is that they can be all lower case.
Multiple Factors
One alternative I have been thinking about is to introduce multiple factors e.g. If Pintrest used a username and an ID for extra uniqueness:
https://pinterest.com/some-user/1
Here the ID 1 is unique to the user some-user and could be the number of posts they've made i.e. their next post would be 2. You could also use YouTube's approach with their video ID but specific to a user, this could lead to some ridiculously short URL's.
The first, simplest and practical scenario for unique keys
is the increasing numbering sequence of the write order,
This represent the record number inside one database providing unique numbering on a local scale : this is the -- often met -- application level requirement.
Next, the numerical approach based on a concatenation of time and counters is commonly used to ensure that concurrent transactions in same wagons will have unique ids before writing.
When the system gets highly threaded and distributed, like in highly concurrent situations, do some constraints need to be relaxed, before they become a penalty for scaling.
Universally unique identifier as primary key
Yes, it's a good practice.
A key reference system can provide independence from the underlying database system.
This provides one more level of integrity for the database when the evoked scenario occurs : backup, restore, scale, migrate and perhaps prove some authenticity.
This article Generating Globally Unique Identifiers for Use with MongoDB
by Alexander Marquardt (a Senior Consulting Engineer at MongoDB) covers the question in detail and gives some insight about database and informatics.
UUID are 128 bits length. They introduce an amount of entropy
high enough to ensure a practical uniqueness of labels.
They can be represented by a 32 hex character strings.
Enough to write several thousands of billions of billions
of decimal number.
Here are a few more questions that can occur when considering the overall principle and the analysis:
should primary keys of database
and Unique Resource Location be kept as two different entities ?
does this numbering destruct the sequentiality in the system ?
Does providing a machine host number (h),
followed by a user number (u) and time (t) along a write index (i)
guarantee the PK huti to stay unique ?
Now considering the DB system:
primary keys should be preserved as numerical (be it hexa)
the database system relies on it and this implies performance considerations.
their size should be fixed,
the system must answer rapidly to tell if it's potentially dealing with a PK or not.
Hashids
The hashing technique of Youtube is hashids.
It's a good choice :
the hash are shorts and the length can be controlled,
the alphabet can be customized,
it is reversible (and as such interesting as short reference to the primary keys),
it can use salt.
it's design to hash positive numbers.
However it is a hash and as such the probability exists that a collision happen. They can be detected : unique constraint is violated before they are stored and in such case, should be run again.
Consider the comment to this answer to figure out how much entropy it's possible to get from a shorten sha1+b64 recipe.
To anticipate on the colliding scenario,
calls for the estimation of the future dimension of the database, that is, the potential number of records. Recommended reading : Z.Bloom, How Long Does An ID Need To Be ?
Milliseconds since epoch
Cited from the previous article, which provides most of the answer to the problem at hand with a nice synthetic style
It may not be necessary for you to encode every time since 1970
however. If you are only interested in keeping recent records close to
each other, you only need enough values to ensure that you don’t have
more values with the same prefix than your database can cache at once
What you could do is convert a GUID into only numeric by converting all the letters into numbers in the guid. Here is a example of what that would look like. It's abit long but if that is not a problem this could be one way of going about generating the keys.
1004234499987310234371029731000544986101469898102
Here is the code i used to generate the string above. But i would probably recommend you using a long primary key insteed although it can be abit of a pain it's probably a safer way to do it then the function below.
string generateKey()
{
Guid guid = Guid.NewGuid();
string newKey = "";
foreach(char c in guid.ToString().Replace("-", "").ToCharArray())
{
if(char.IsLetter(c))
{
newKey += (int)c;
}
else
{
newKey += c;
}
}
return newKey;
}
Edit:
I did some testing with only taking the 20 first numbers and out of 5000000 generated keys 4999978 was uniqe. But when using 25 first numbers it is 5000000 out of 5000000. I would recommend you to do some more testing if going with this method.