I was wondering what is the usage of these codes that are in each validator, i.e. in https://github.com/symfony/symfony/blob/master/src/Symfony/Component/Validator/Constraints/NotBlank.php#L24
class NotBlank extends Constraint
{
const IS_BLANK_ERROR = 'c1051bb4-d103-4f74-8988-acbcafc7fdc3';
I could not find any documentation about it, neither in http://symfony.com/doc/master/validation/custom_constraint.html: what algorithm is used to generate them?
It seems to be a UUID. From Wikipedia:
A universally unique identifier (UUID) is a 128-bit number used to
identify information in computer systems. The term globally unique
identifier (GUID) is also used.
When generated according to the standard methods, UUIDs are for
practical purposes unique, without depending for their uniqueness on a
central registration authority or coordination between the parties
generating them, unlike most other numbering schemes. While the
probability that a UUID will be duplicated is not zero, it is close
enough to zero to be negligible.
In PHP you can generate it using UUID PECL package or using a library like this one.
Related
According to docs, the property id is special in Azure CosmosDB documents as it must always be set and have unique value per partition. Also it has additional restrictions on its content :
The following characters are restricted and cannot be used in the Id
property: '/', '\', '?', '#'
Obviously, this field is one of document "keys" (in addition to _rid) and used somehow in internal plumbing. Other than the restrictions above, it is unclear how exactly is this key used internally and more importantly for practitioners,which values constitute technically better ids than others?
Wild guess 1: For example, from some DB worlds, one would prefer short primary key values, since the PK would be included in index entries and shorter keys would allow more compact index for storage and lookup. Would id field length matter at all besides the one-time storage cost?
Wild guess 2: in some systems better throughput is achieved if common prefixes are avoided in names (i.e. azure storage container/blob names) and even suggest to add a small random hash as prefix. Does cosmosDB care about id prefix similarities?
Anything else one should consider?
EDIT: Clarification, I'm interested in what's good for the cosmosDB server storage/execution side, provided my data model is still in design and/or has multiple keys available the data designer can choose from.
First and foremost let's clear something out. The id property is NOT unique. Your collection can have multiple documents that have the exact same id. The id is ONLY unique within it's own logical partition.
That said, based on all the compiled info that we know from documentation and talks it doesn't really matter what value you choose to go with. It is a string and Cosmos DB will treat it as such but it is also considered as a "Primary key" internally so restrictions apply, such as ordering by it.
Where it does matter is in your consuming application's business logic. The id plays a double role of being both a CosmosDB property but also your property. You get to set it. This is the value you are going to use to make direct reads to the database. If you use any other value, then it's no longer a read. It's a query. That makes it more expensive and slower.
A good value to set is the id of the entity that is hosted in this collection. That way you can use the entity's id to read quickly and efficiently.
Scenario: I need to store document accepted by the customer in my database. Customer needs to be sure that I don't modify it through time, and I need to have possibility to prove that stored document was accepted by the customer.
Do you know proven ways how to achieve this without doubts from any side?
I think I can create checksum from stored data for the customer, but I need to ensure that this checksum is unmodifiable by the customer. Any ideas?
PS. If you have better idea how to title this question then tell me, please.
PS. Let me know if you see better forum to ask this question, please.
What we call this in Cryptography is data integrity.
To ensure that the data is not changed by you or someone else, your customer can calculate the hash of the file with a cryptographic hash functions, which are designed to have collision resistance. I.e.
Hash(Original) != Hash(Modified) // equality almost impossible
In short, when you modify it is expected that the new modified document has the same hash value is impossible (in Cryptology term, negligible).
Your customer can use SHA-3 hash function which is standardized by NIST.
Don't use SHA-1 which has shattered.
If you want to go further, your customer can use HMAC which are key-based hash functions which supply data integrity and the authentication of data.
For the second part, we can solve it by digital signatures. Your customer signs the message
Sign(hash(message))
and gives you
( Sign(hash(message)), message ) )
and his public key.
You can verify the signature with the public key of the customer to see that the customer changed the data or not. Digital signatures gives us Non-Repudation.
This part actually solves your two problems. Even third parties can check that the data is not modified and comes from the signer (your customer).
Note : don't use checksums which are not Cryptographically secure and mostly easy to modify the document in a way that they have the same checksums.
In Corda, an OwnableState must specify an AbstractParty as an owner. There are two types of AbstractParty:
Party, with a well-known identity
AnonymousParty, identified solely by public key
If I create a CompositeKey to own the OwnableState, who then will store it in their vault as part of FinalityFlow?
At the moment nobody will unless lower level APIs are used.
The vault needs more work to fully understand multi-sig states, e.g. with cash, we need a way to select coins that we're participants of.
It's quite an advanced feature because composite keys have so many use cases. This is typical in the blockchain space, Bitcoin supported CHECKMULTISIG outputs in the protocol long before wallets that knew how to use them existed. And when wallets did start to appear, they had different code and features for different use cases. E.g. using multisig/composite keys for more secure wallets is different to using them to do dispute mediation protocols.
At least with flows we have a straightforward way to implement support - we can make flows that understand composite keys and either have the certs linking the components to real parties, or know who they are some other way, and then go gather the signatures automatically.
I have decided to implement the following ID strategy for my documents, which combines the document "type" with the ID:
doc.id = "docType_" + Guid.NewGuid().ToString("n");
// create document in collection
This results in IDs such as the following for my documents:
usr_19d17037ea7f41a9b20db1a90f71d30d
usr_89fe82c93b264076aa1b6e1fb4813aaf
usr_2aa58c1c970a4c5eaa206a755c1c7bf4
msg_ec43510732ae47a6a5d5f323b7461d68
msg_3b03ceeb7e06490d998c3e368b435851
With a RangeIndex policy in place on the ID, I should be able to query the collection for specific types. For example:
SELECT * FROM c WHERE STARTSWITH(c.id, 'usr_') AND ...
Since this is a web application with many different document types, many of my app's queries would implement this STARTSWITH filter by default.
My main concern here is the use of a random GUID string on the ID. I know that in SQL Server I have had issues with index performance and fragmentation while using random GUIDs on the primary key in a clustered index.
Is there a similar concern here? It seems that in DocumentDB, the care of managing indexes has been abstracted away from you. Would a sequential ID be more ideal/performant in any way?
tl;dr: Use separate fields for the type and a GUID-only ID and use hash indexes on both.
This answer is necessarily going to be somewhat opinionated based upon the nature of your questions. Let me first address what appears to be your primary concern, namely the fragmentation of indexes effecting performance.
DocumentDB assumes the use of GUIDs and a hash index (as opposed to a range index) is ideally suited to finding the one matching entity by GUID. On the other hand, if you want to find a set of documents by looking at the beginning of the string, I suspect that would probably be more performant with a range index. This assumes that STARTSWITH is only optimized when used with range indexes, but I don't know for a fact that it is optimized even when you have a range index.
My recommendation would be to use separate fields for the type and a GUID-only ID and use hash indexes on both. This gives you the advantage of being assured that queries like the one you show would be highly performant and that queries which combine a type clause with other parameters would also be able to use at least one index. Note, hash indexes of this type (say 2x 3 bytes = 6 bytes/document) are highly space efficient, so don't worry about needed two of them. Those two combined should be much smaller than one range index which needs to have enough precision to cover the entire length of your type+GUID.
Other than the performance and space reasons already discussed, I can see a couple of other disadvantages to combining the type with the GUID: 1) when trying to retrieve a single document (both for direct use and as part of a foreign key lookup), having the GUID separate and using a hash index will be faster and more space efficient than using a range index on the combined field; 2) Combining the type with the ID greatly complicates certain migrations that commonly need to be done at a later date. Let's say that you decide to break your users into authors and readers for example. Users are foreign key referenced in other document types (blog post author, reader comment, etc.) by the user ID. If that ID includes the type, then you would need to not only change the user documents to accomplish the migration but you'd also need to find and change every foreign key. If the two fields (GUID and type) were separate, then you'd only need to change the user documents. Agile software craftsmanship is largely about making decisions that provide flexibility down the road.
As for the use of a sequential index, the trend in databases in general and NoSQL in particular, is that the complexity of providing a monotonically increasing sequential ID is greater than the space-efficiency advantages of that over a GUID. If you are going to stick with DocumentDB, I recommend that you just go with the flow and use GUIDs.
I've decided to use GUID as primary key for many of my project DB tables. I think it is a good practice, especially for scalability, backup and restore in mind. The problem is that I don't want to use the regular GUID and search for an alternative approach. I was actually interested to know what Pinterest i using as primary key. When you look at the URL you see something like this:
http://pinterest.com/pin/275001120966638272/
I prefer the numerical representation, even it it is stores as string. Is there any way to achieve this?
Furthermore, youtube also use a different kind of hashing technique which I can't figure it out:
http://www.youtube.com/watch?v=kOXFLI6fd5A
This reminds me shorten url like scheme.
I prefer the shortest one, but I know that it won't guarantee to be unique. I first thought about doing something like this:
DateTime dt1970 = new DateTime(1970, 1, 1);
DateTime current = DateTime.Now;
TimeSpan span = current - dt1970;
Result Example:
1350433430523.66
Prints the total milliseconds since 1970, But what happens if I have hundreds thousands of writes per second.
I mainly prefer the non BIGINT Auto-Increment solution because it makes a lot less headache to scale the DB using 3rd party tools as well as less problematic backup/restore functionality because I can transfer data between servers and such if I want.
Another sophisticated approach is to tailor the solution towards my application. In the database, the primary key will also contain the username (unique and can't be changed by the user), so I can combine the numerical value of the name with the millisecond number which will give me a unique numerical string. Because the user doesn't insert data as such a high rate, the numerical ID is guarantee to be unique. I can also remove the last 5 figures and still get a unique ID, because I assume that the user won't insert data at more than 1 per second the most, but I would probably won't do that (what do you think about this idea?)
So I ask for your help. My data is assumes to grow very big, 2TB a year with ten of thousands new rows each second. I want URLs to look as "friendly" as possible, and prefer not to use the 'regular' GUID.
I am developing my app using ASP.NET 4.5 and MySQL
Thanks.
Collision Table
For YouTube like GUID's you can see this answer. They are basically keeping a database table of all random video ID's they are generating. When they request a new one, they check the table for any collisions. If they find a collision, they try to generate a new one.
Long Primary Keys
You could use a long (e.g. 275001120966638272) as a primary key, however if you have multiple servers generating unique identifiers you'll have to partition them somehow or introduce a global lock, so each server doesn't generate the same unique identifier.
Twitter Snowflake ID's
One solution to the partitioning problem with long ID's is to use snowflake ID's. This is what Twitter uses to generate it's ID's. All generated ID's are made up of the following parts:
Epoch timestamp in millisecond precision - 41 bits (gives us 69 years with a custom epoch)
Configured machine id - 10 bits (gives us up to 1024 machines)
Sequence number - 12 bits (A local counter per machine that rolls over every 4096)
One extra bit is reserved for future purposes. Since the ID's use timestamp as the first component, they are time sortable (which is very important for query performance).
Base64 Encoded GUID's
You can use ShortGuid which encodes a GUID as a base64 string. The downside is that the output is a little ugly (e.g. 00amyWGct0y_ze4lIsj2Mw) and it's case sensitive which may not be good for URL's if you are lower-casing them.
Base32 Encoded GUID's
There is also base32 encoding of GUID's, which you can see this answer for. These are slightly longer than ShortGuid above (e.g. lt7fz44kdqlu5pt7wnyzmu4ov4) but the advantage is that they can be all lower case.
Multiple Factors
One alternative I have been thinking about is to introduce multiple factors e.g. If Pintrest used a username and an ID for extra uniqueness:
https://pinterest.com/some-user/1
Here the ID 1 is unique to the user some-user and could be the number of posts they've made i.e. their next post would be 2. You could also use YouTube's approach with their video ID but specific to a user, this could lead to some ridiculously short URL's.
The first, simplest and practical scenario for unique keys
is the increasing numbering sequence of the write order,
This represent the record number inside one database providing unique numbering on a local scale : this is the -- often met -- application level requirement.
Next, the numerical approach based on a concatenation of time and counters is commonly used to ensure that concurrent transactions in same wagons will have unique ids before writing.
When the system gets highly threaded and distributed, like in highly concurrent situations, do some constraints need to be relaxed, before they become a penalty for scaling.
Universally unique identifier as primary key
Yes, it's a good practice.
A key reference system can provide independence from the underlying database system.
This provides one more level of integrity for the database when the evoked scenario occurs : backup, restore, scale, migrate and perhaps prove some authenticity.
This article Generating Globally Unique Identifiers for Use with MongoDB
by Alexander Marquardt (a Senior Consulting Engineer at MongoDB) covers the question in detail and gives some insight about database and informatics.
UUID are 128 bits length. They introduce an amount of entropy
high enough to ensure a practical uniqueness of labels.
They can be represented by a 32 hex character strings.
Enough to write several thousands of billions of billions
of decimal number.
Here are a few more questions that can occur when considering the overall principle and the analysis:
should primary keys of database
and Unique Resource Location be kept as two different entities ?
does this numbering destruct the sequentiality in the system ?
Does providing a machine host number (h),
followed by a user number (u) and time (t) along a write index (i)
guarantee the PK huti to stay unique ?
Now considering the DB system:
primary keys should be preserved as numerical (be it hexa)
the database system relies on it and this implies performance considerations.
their size should be fixed,
the system must answer rapidly to tell if it's potentially dealing with a PK or not.
Hashids
The hashing technique of Youtube is hashids.
It's a good choice :
the hash are shorts and the length can be controlled,
the alphabet can be customized,
it is reversible (and as such interesting as short reference to the primary keys),
it can use salt.
it's design to hash positive numbers.
However it is a hash and as such the probability exists that a collision happen. They can be detected : unique constraint is violated before they are stored and in such case, should be run again.
Consider the comment to this answer to figure out how much entropy it's possible to get from a shorten sha1+b64 recipe.
To anticipate on the colliding scenario,
calls for the estimation of the future dimension of the database, that is, the potential number of records. Recommended reading : Z.Bloom, How Long Does An ID Need To Be ?
Milliseconds since epoch
Cited from the previous article, which provides most of the answer to the problem at hand with a nice synthetic style
It may not be necessary for you to encode every time since 1970
however. If you are only interested in keeping recent records close to
each other, you only need enough values to ensure that you don’t have
more values with the same prefix than your database can cache at once
What you could do is convert a GUID into only numeric by converting all the letters into numbers in the guid. Here is a example of what that would look like. It's abit long but if that is not a problem this could be one way of going about generating the keys.
1004234499987310234371029731000544986101469898102
Here is the code i used to generate the string above. But i would probably recommend you using a long primary key insteed although it can be abit of a pain it's probably a safer way to do it then the function below.
string generateKey()
{
Guid guid = Guid.NewGuid();
string newKey = "";
foreach(char c in guid.ToString().Replace("-", "").ToCharArray())
{
if(char.IsLetter(c))
{
newKey += (int)c;
}
else
{
newKey += c;
}
}
return newKey;
}
Edit:
I did some testing with only taking the 20 first numbers and out of 5000000 generated keys 4999978 was uniqe. But when using 25 first numbers it is 5000000 out of 5000000. I would recommend you to do some more testing if going with this method.