DocumentDB auto generated ID: GUID or UUID? Which variant? - guid

TL;DR: Are the IDs that are auto-generated by DocumentDB supposed to be GUIDs or UUIDs, and is there actually a difference? If they are UUIDs, then which variant/version of UUID?
Background: Some of the DocumentDB client libraries will auto generate an ID for you if you do not provide one. I have seen it mentioned in the Azure blog and in several related questions that the generated IDs are GUIDs. I know there is some discussion over whether GUIDs are UUIDs, with many people saying that they are.
The problem: However, I have noticed that some of the IDs that DocumentDB auto-generates do not follow the UUID RFC, which allows only the digits 1-5 in the "version" nibble (V in xxxxxxxx-xxxx-Vxxx-xxxx-xxxxxxxxxxxx). DocumentDB generates IDs with any hex digit in that nibble, for example d981befd-d19b-ee48-35bd-c1b507d3ec4f, whose version nibble is the first e of ee48.
It is possible that this depends on which client is used to create the documents. In our DocumentDB database, we have documents with the third grouping dde5, 627a, fe95, and so on. These documents were stored from within a stored procedure by calling Collection.createDocument() with the options {'disableAutomaticIdGeneration': false}. Other documents that I create through the third party DocumentDB Studio application always have 4xxx in the third grouping, which is a valid UUID version. However, documents that I create through the Azure portal have non-standard third groupings like b359.
Question: Are the auto-generated DocumentDB IDs supposed to be GUIDs or UUIDs, and is there actually a difference? If UUIDs, then which variant?

Poking around in the source code on GitHub, I found that the various client and server side libraries use several different methods for creating what they're calling a GUID (in some libraries) or a UUID (in other libraries).
The nodejs client, Javascript client, and server-side library manufacture what they call a GUID by concatenating series of hex digits and hyphens. Note that these are random, but do not comply with the rules for creating RFC4122 version 4 UUIDs.
The Python client and Java client call their respective standard library methods to generate a random (version 4) UUID.
The .NET client is available via NuGet, but the source code is not yet published.
Summary:
Microsoft is not making a distinction between GUID and UUID in their client libraries. They are using the terms interchangeably.
What you get for a GUID/UUID depends on which client library you're using to call DocumentDB when you create your documents.

Related

Pact. How to test a REST GET with automatically generated ID in the URL

I want to test a REST service that returns the detail of a given entity identified by an UUID, i.e. my consumer pact has an interaction requesting a GET like this:
/cities/123e4567-e89b-12d3-a456-426655440000
So I need this specific record to exist in the Database for the pact verifier to find it. In other projects I've achieved this executing an SQL INSERT in the state setup, but in this case I'd prefer to use the microservice's JPA utilities for accessing to the DB, because the data model is quite complex and using these utilities would save me much effort and make the test much more maintainable.
The problem is that these utilities do not allow specifying the identifier when you create a new record (they assign an automatic ID). So after creating the entity (in the state setup) I'd like to tell the pact verifier to use the generated ID rather than the one specified by the consumer pact.
As far as I know, Pact matching techniques are not useful here because I need the microservice to receive this specific ID. Is there any way for the verifier to be aware of the correct ID to use in the call to the service?
You have two options here:
Option 1 - Find a way to use the UUID from the pact file
This option (in my option) would be the better one, because you are using well known values for you verification. With JPA, I think you may be able to disable the auto-generation of the ID. And if you are using Hibernate as the JPA provider, it may not generate an ID if you have provided it one (i.e. setting the ID on the entity to the one from the pact file before saving it). This is what I have done recently.
Using a generator (as mentioned by Beth) would be a good mechanism for this problem, but there is no current way to provide a generator to use a specific value. They generate random ones on the fly.
Option 2 - Replace the ID in the URL
Depending on how you run the verification, you could use a request filter to change the UUID in the URL to the one which was created during the provider state callback. However, I feel this is a potentially bad thing to do, because you could change the request in a way that weakens the contract. You will not be verifying that your provider adheres to what the consumer specified.
If you choose this option, be careful to only change the UUID portion of the URL and nothing else.
For information on request filters, have a look at Gradle - Modifying the requests before they are sent and JUnit - Modifying the requests before they are sent in the Pact-JVM readmes.
Unfortunately not. The provider side verifier takes this information from the pact file itself and so can't know how to send anything else.
The best option is to use provider states to manage the injection of the specific record prior this test case (or to just have the correct record in there in the first place).
You use the JPA libraries during the provider state setup to modify the UUID in record to what you're expecting.
If you are using pact-jvm on both the consumer and provider sides, I believe you may be able to use 'generators', but you'll need to look up the documentation on that as I haven't used them.

How should I specify the resource database via HTTP Requests

I have a REST API that will be facilitating CRUD from multiple databases. These databases all represent the same data for different locations within the organization (IE We have 20 or so implementations of a software package and we want to read from all of the supporting databases via one API).
I was wondering what the "Best Practice" would be for facilitating what database to access resources from?
For example, right now in my request headers I have a custom "X-" header that would represent the database id. Unfortunately, this sort of thing feels a bit like a workaround.
I was thinking of a few other options:
I could bake the Database Id into the URI (/:db_id/resource/...)
I could modify the Accept Header like someone would with an API version
I could split up the API to be one service per database
Would one of the aforementioned options be considered "better" than the others, and if not what is considered the "best" option for this sort of architecture?
I am, at the moment, using ASP.NET Web API 2.
These databases all represent the same data for different locations within the organization
I think this is the key to your answer - you don't want to expose internal implementation details (like database IDs etc.) outside your API - what if you consolidate? or change your internal implementation one day?
However, this sentence reveals a distinction that is meaningful to the business - the location.
So - I'd make the location part of the URI:
/api/location/{locationId}/resource...
Then map the locationId internally to a database ID. LocationId could also be a name, or a code, or something unique that would be meaningful to the API client.
Then - if you later consolidate multiple locations to the same database or otherwise change your internal implementation, the clients don't have to change.
In addition, whoever is configuring the client applications, can do so thinking about something meaningful to the business - the location they are interested in.

Best UI interface/Language to query MarkLogic Data

We will be moving from Oracle and use MarkLogic 8 as our datastore and will be using MarkLogic's Java api to talk with data.
I am exploring for any UI tool (like SQL Developer is there for Oracle), which can be used for ML. I found that ML's Query Manager can used for accessing data. But I see multiple options wrt language:
SQL
SPARQL
XQuery
JavaScript
We need to perform CRUD operations and search for data, and our testing team is aware of SQL (for Oracle), so I am confused which route I should follow and on what basis I should decide which one/two will be better to explore. We are most likely to use JSON document type.
Any help/suggestions would be helpful.
You already mention you will be using the MarkLogic Java Client API, that should provide most of the common needs you could have, including search, CRUD, facets, lexicon values, and also custom extension though REST extensions as the Client API will be leveraging the MarkLogic REST API. It saves you from having to code inside MarkLogic to a large extent.
Apart from that you can run ad hoc commands from the Query Console, using one of the above mentioned languages. SQL will require the presence of a so-called SQL view (see also your earlier question Using SQL in Query Manager in MarkLogic). SPARQL will require enabling the triple index, and ingestion of RDF data.
That leaves XQuery and JavaScript, that have pretty much identical expression power, and performance. If you are unfamiliar with XQuery and XML languages in general, JavaScript might be more appealing.
HTH!

Lightweight method for adding persistent data to ASP.NET website?

Aside from creating SQL SERVER tables, is there a light-weight technology or method for adding persistent data to an ASP.NET website which works with LINQ and preferably doesn't require much in terms installing installation/packages to a project nor learning large frameworks?
Session state is one option but only if it is run out of process and configured for SQL Server which doesn't fit my needs.
Options to satisfy the question:
1.
Session State -> Only if configured for out-of-process and SQL Server.
2.
NoSql Database Solutions -> MonogoDb, RavenDB, Sqlite.org
3.
SQL Server Key/Value Singleton -> Create a table and store key/value pairs as a single entry in the table or create a generic key/value table. Keys will need to be unique and values will need to scalars only or multiple values crammed into one key using a deliminator. A generic key/value table will need to store all keys as strings and rely on type conversion either implicit to the program or stored as an extra column.
See below
http://en.wikipedia.org/wiki/Entity-attribute-value_model
How to design a product table for many kinds of product where each product has many parameters
4.
Create an XML file or other flat file and store/write key/values to it. May require special permissions.
I will likely go with option #3 because it satisfies my current requirements best but will explore the NoSQL solutions for future projects.

Generating UUids in a web farm environment

I am planning on using sequential guids as primary keys/uuids as detailed in the post below
What are the performance improvement of Sequential Guid over standard Guid?
I am wondering if there are any gotchas as far as generating these guids across multiple web servers in a web farm environment. I suspect the chances of collision are impossibly low but the fact that the mac address of the web server/timestamp would
doubtless be involved in generating these guids gives me pause. I wonder if the possibility exists in a high traffic website the ordering would be messed up and the benefit of using sequential guids might be lost.
Please let me know what your experience is.
For what it is worth, my environment is ASP.NET 3.5, IIS 7 using Oracle 11g.
Incidentally, what is the data type I should use for guids in Oracle? I know that Sql Server has "uniqueidentifier"
Thanks for your advice
-Venu
Because I was the creator of the post you're referring to I can answer to it.
We're using the C# code shown in the post (without the modification of ordering detailed in one of the reply, that I feel could improve performance another little bit) in web farms with from 2 to 8 application servers and never had problems of concurrency, I believe that the SequentialGuid function implemented in the Windows core DLLs already takes care of creating different guid on different machines.
In database operation having different machines inserting different Guids means that each different application server in the web farm will write data that will reside on specific regions of the database (i.e. an application server will write guid starting with 12345 and the other one with guid starting with 62373) and so the update of indexes still works efficiently because page splits do not happens very frequently (or never).
So, from my experience, no specific problem happens if you use the same strategy to generate Guids that I outlined in my original message also if you're working in web farm enviviroment if you use the proper method to generate the Guids.
I would avoid in any way to create Guid from code and also to create Guid in a central manner.
Regarding data type we used char(36) because we like to waste a lot of space! Joke aside we decided to use a long and verbose way to write data because having data in a clear format ease a lot the maintenance, but you can use Oracle GUID or simply a RAW(16) data type (they're basically the same) and spare 20 bytes of for each row. To make browsing and editing of data easier you can provide your customer a couple of function to code and decode raw guid data so that the textual representation of the guid is seen.
Guid for Oracle
You might want to take a look at how NHibernate's Guid Comb generator works. I have never heard of a collision.
To ensure that you have unique GUIDs only 1 server can be the creator of said GUIDs.
If memory serves, Oracle doesn't support the creation of MS's "Guid for OLE" but you should be able to generate something highly similar utilizing this: RAWTOHEX(SYS_GUID())
Alternatively, you could have a separate application residing on a single server that is solely responsible for generating GUIDs (for example, call a web service located at a specific server, whose sole purpose is to generate and return GUIDs.
Finally, GUIDS are not sequential. Even if you generate one right after another, they won't increment in the exact same fashion as an integer (i.e. that last digit won't go from C-D in one step). Sequencing requires integers or some other numeric data type.

Resources