I am working on DICOM gated (PET) data.
I would like to artificially create a DICOM image series which includes gated data. I am inquiring on the increment values of SOPInstanceUID which labels each image slice in each phase or gate.
These have different values for each slice in a gate and are incremented between gates but I can't find out the logic to how this value is chosen.
Is there a reference to where and how these values are written?
Multiple algorithms to generate DICOM UID are explained in this answer with their drawbacks.
As per DICOM specifications, all UIDs including SOPInstanceUID in question should be unique. This is irrelevant to what data (gated PET data or other) you are working on.
Following is from specifications:
2017a Part 5 - Data Structures and Encoding (9 Unique Identifiers (UIDs))
Unique Identifiers (UIDs) provide the capability to uniquely identify a wide variety of items. They guarantee uniqueness across multiple countries, sites, vendors and equipment. Different classes of objects, instance of objects and information entities can be distinguished from one another across the DICOM universe of discourse irrespective of any semantic context.
UID consists of two parts:
Organization root:
This part of UID ensures the uniqueness across organizations. There are service providers who offer this for free. Medical Connections is the one I am aware about. You can contact them to get the one for free.
Suffix:
Further, you should generate suffix in such a way that it guarantees uniqueness inside your organization.
Following are the general rules for DICOM UID:
Total length must be <= 64 characters, including the stops
Must contain only digits 0-9 and full stops
Each numeric "component" (between stops) must be a valid and unambiguous integer number, and so must not have a leading zero (unless the whole component is zero)
Must be guaranteed to be unique - this means:
It must be derived from a proper official root under your sole control.
It must not be created by appending digits (however special you consider the combination!) to someone else's UID.
In particular, series UIDs for secondary capture images, KIN objects etc. must not be created as derivatives of the Study UID (unless you own that root!)
Related to the above, there is no expectation or requirement that the Study UID, Series UID and Instance UID for images should be derived from the same root (though in practice, Series UID and Instance UID normally are, as both must be generated internally by the equipment which generates the images)
Date and Time are useful for generating UIDs, but only if:
Each machine has a unique root (normally your company UID root + a machine specific suffix such as a serial number
If it is possible for UIDs to be generated at > 1 per second, then a sequential counter should also be used
if on a multi-threaded machine, then the thread ID or a properly interlocked counter are needed to prevent 2 applications or 2 threads in the same application from generating identical UIDs simultaneously.
Do not use time on its own - it is too easy to end up with a leading zero 0 - e.g. 20060724.093017 use instead 20060724093017
Same can be found in specifications.
Following example is from DICOM Specifications to generate UID. Please note that this is Informative section.
2017a Part 5 - Data Structures and Encoding (B Creating a Privately Defined Unique Identifier (Informative))
B.1 Organizationally Derived UID:
The following example presents a particular choice made by a specific
organization in defining its suffix to guarantee uniqueness of a SOP
Instance UID.
"1.2.840.xxxxx.3.152.235.2.12.187636473"
In this example, the root is:
1 Identifies ISO
2 Identifies ANSI Member Body
840 Country code of a specific Member Body (U.S. for ANSI)
xxxxx Identifies a specific Organization.(assigned by ANSI)
In this example the first two components of the suffix relate to the
identification of the device:
3 Manufacturer defined device type
152 Manufacturer defined serial number
The remaining four components of the suffix relate to the
identification of the image:
235 Study number
2 Series number
12 Image number
187636473 Encoded date and time stamp of image acquisition
In this example, the organization has chosen these components to
guarantee uniqueness. Other organizations may choose an entirely
different series of components to uniquely identify its images. For
example it may have been perfectly valid to omit the Study Number,
Series Number and Image Number if the time stamp had a sufficient
precision to ensure that no two images might have the same date and
time stamp. Because of the flexibility allowed by the DICOM Standard
in creating organizationally derived UIDs, implementations should not
depend on any assumed structure of UIDs and should not attempt to
parse UIDs to extract the semantics of some of its components.
There is one more way mentioned in specifications
2017a Part 5 - Data Structures and Encoding (B Creating a Privately Defined Unique Identifier (Informative))
B.2 UUID Derived UID:
UID may be constructed from the root "2.25." followed by a decimal representation of a Universally Unique Identifier (UUID). That decimal representation treats the 128 bit UUID as an integer, and may thus be up to 39 digits long (leading zeros must be suppressed).
A UUID derived UID may be appropriate for dynamically created UIDs, such as SOP Instance UIDs, but is usually not appropriate for UIDs determined during application software design, such as private SOP Class or Transfer Syntax UIDs, or Implementation Class UIDs.
Related
Meteor uses it's internal Random package to generate Mongo-Ids for the documents, where the used set of characters is defined as:
var UNMISTAKABLE_CHARS = "23456789ABCDEFGHJKLMNPQRSTWXYZabcdefghijkmnopqrstuvwxyz";
The method description for Random.id also states:
Return a unique identifier, such as "Jjwjg6gouWLXhMGKW", that is likely to be unique in the whole world.
which is defined for the default length of an Id (17 chars; each one of UNMISTAKABLE_CHARS).
Now I would like to use only the first N characters of the Id to shorten my URLs (which include the Ids to dynamically load pages that require a specific document, which is determined by the Id).
So if my original Id is
`v5sw59HEdX9KM5KQE`
I would like to use for example (consider a totally random-picked N=5 here):
{
_id:"v5sw59HEdX9KM5KQE",
short: "v5sw5"
}
as document schema and fetch the respective document by this Id using { short } as query in my Mongo.Collection.
Now my question is how many characters are satisfactory to prevent collision if an amount of documents (thus Ids) between 5000 to 10000 are to be considered.
Note: I have some tools on entropy calculation and all these values (character set, length of the original Ids, number of documents) in front of me but I don't know how to wire this all up to safely calculate N.
If I understand correctly, besides the normal 17 chars long id generated for your documents _id, you would like a shorter id so that typically url's look less scary when they contain that id.
In your example you truncate the id, hence creating an explicit association between your shorter id and the original document id.
This sounds like git shorten commit hash: How does Git(Hub) handle possible collisions from short SHAs?
You could follow a similar path, i.e. first determine an initial default length that is reasonable to avoid probable collision (as explained in Peter O.'s answer), but explicitly check for uniqueness server-side and increase the length of any new shorten version in case of collision, until it becomes unique again.
Generating identifiers at random already runs the risk, at least in theory, of generating a duplicate identifier. For the default length of MongoIDs (assuming there are 5517 of them), the chance of having a duplicate MongoID reaches 50% after generating almost 731156 billion random MongoIDs (see "Birthday problem"), so the chance of a duplicate is negligible in practice for most applications.
Shortening a random identifier will make the collision problem even worse. In this case, for an ID length of 5 characters (resulting in 555 or 503284375 different IDs), the chance of having a duplicate MongoID reaches 50% after generating only about 26415 random IDs.
Since it appears that you can't control how MongoIDs are generated as easily as you can control how shortened "unique IDs" are generated, one thing you can do is the following:
Create a document that assigns each MongoID to a uniquely assigned number (such as a monotonically increasing counter).
To make the numbers assigned this way "look random", feed each number to a so-called "full-period" linear congruential generator to get a unique but "randomized" number within the generator's period.
The numbers (encoded similarly to MongoID strings) can then serve as short identifiers for your purposes.
But consider whether you really want the short identifiers created this way to be predictable. Using short identifiers hardly achieves this predictability goal.
If you wish to go the route of using shortened MongoIDs, see "Birthday problem" for formulas you can use to estimate how many random numbers it takes for the risk of collision to remain tolerable.
For further information on how Meteor generates MongoIDs, see also this question; one of its answers includes a way you can have MongoDB generate MongoIDs on the server rather than have Meteor do so on the client. It appears, too, that Meteor doesn't check the MongoIDs it generates for uniqueness before inserting them into a document.
I would argue that if you want to avoid collisions on a small collection then you don't want to use random ids, but either go with fully deterministic IDs or at least reduce the randomness to something more controlled. Along those lines, another option for you to consider would be to use MONGO for idGeneration in your collection. Those IDs are generated following a known recipe. Accordingly you could take characters 1-4 and 12 of that ID and would get a guarantee for no hash collisions as long as no more than N documents are created in the same second, where N is the number of characters used in MongoIDs (which I don't know off hand).
While working with the DICOM study, series and media concepts, I wondered if these values are to be unique over all data, or only to the patient they belong to.
Phrased otherwise; can I have 2 patients having a study/series/sop instance uid that is the same value for both patients?
Or does the DICOM standard simply doesn't care about that and is that open to the implementor to decide?
In DICOM, a Study (identified by its Study Instance UID) is always associated with a single Patient. See DICOM standard part 3 for details.
To answer your initial question/thought: a Unique Identifier (UID) has to be globally unique, i.e. world-wide over all patients, devices, hospitals, etc.
UID in DICOM (no matter what UID) is always globally unique. So, as you asked in question, uniqueness is not limited to Patient level or something.
Following is from specifications:
2017a Part 5 - Data Structures and Encoding (9 Unique Identifiers (UIDs))
Unique Identifiers (UIDs) provide the capability to uniquely identify a wide variety of items. They guarantee uniqueness across multiple countries, sites, vendors and equipment. Different classes of objects, instance of objects and information entities can be distinguished from one another across the DICOM universe of discourse irrespective of any semantic context.
More details about DICOM UID can be found in this answer.
Your comment on question as below:
My question was more about what to do in case I choose to clone a patient in my system and attach the same dicom(s) to it. Should I regenerate the dicom-uid's or could I keep them as-is.
I am not sure what you mean by "clone". While cloning, if there is change in dataset, you should regenerate the SOPInstance UID. Even if you simply apply lossy transfer syntax to your dataset, you should regenerate the SOPInstance UID. Any action that differentiates/separates the the datasets from original require new SOPInstance UID. So, while cloning, if you are changing patient demographics, new UID should be generated. Whether new StudyInstance UID should be generated or not depends upon what is changed.
OTOH, if you are just copying your dataset at different location, it is still same dataset. You do not need to regenerate UIDs in this case.
Unfortunately although the standard states the UID should be globally unique you can not guarantee it at the series level in my experience. I have come across series with duplicate ids across studies. To protect yourself assume you have to use StudyUID +SeriesUID to ensure a unique series key.
Can we assume anything about them? Are they globally unique (across all of Firebase)? Is there any sort of ordering? Does the client matter?
Is there a public library / documentation so I can generate those IDs as well?
I am referring to the ones generated by push
There is a blog post on it, as well as a Gist.
From the blog post, here's the core of What's in a Push Id:
A push ID contains 120 bits of information. The first 48 bits are a
timestamp, which both reduces the chance of collision and allows
consecutively created push IDs to sort chronologically. The timestamp
is followed by 72 bits of randomness, which ensures that even two
people creating push IDs at the exact same millisecond are extremely
unlikely to generate identical IDs. One caveat to the randomness is
that in order to preserve chronological ordering if a client creates
multiple push IDs in the same millisecond, we just ‘increment’ the
random bits by one.
To turn our 120 bits of information (timestamp + randomness) into an
ID that can be used as a Firebase key, we basically base64 encode it
into ASCII characters, but we use a modified base64 alphabet that
ensures the IDs will still sort correctly when ordered
lexicographically (since Firebase keys are ordered lexicographically).
Also something amazing to note, is the ports for several different languages, done by the community:
Ruby
PHP
Python
Java
Nimrod
Go
Lua
Swift
So perhaps the best way to learn is pick a language not on that list and port it!
I've decided to use GUID as primary key for many of my project DB tables. I think it is a good practice, especially for scalability, backup and restore in mind. The problem is that I don't want to use the regular GUID and search for an alternative approach. I was actually interested to know what Pinterest i using as primary key. When you look at the URL you see something like this:
http://pinterest.com/pin/275001120966638272/
I prefer the numerical representation, even it it is stores as string. Is there any way to achieve this?
Furthermore, youtube also use a different kind of hashing technique which I can't figure it out:
http://www.youtube.com/watch?v=kOXFLI6fd5A
This reminds me shorten url like scheme.
I prefer the shortest one, but I know that it won't guarantee to be unique. I first thought about doing something like this:
DateTime dt1970 = new DateTime(1970, 1, 1);
DateTime current = DateTime.Now;
TimeSpan span = current - dt1970;
Result Example:
1350433430523.66
Prints the total milliseconds since 1970, But what happens if I have hundreds thousands of writes per second.
I mainly prefer the non BIGINT Auto-Increment solution because it makes a lot less headache to scale the DB using 3rd party tools as well as less problematic backup/restore functionality because I can transfer data between servers and such if I want.
Another sophisticated approach is to tailor the solution towards my application. In the database, the primary key will also contain the username (unique and can't be changed by the user), so I can combine the numerical value of the name with the millisecond number which will give me a unique numerical string. Because the user doesn't insert data as such a high rate, the numerical ID is guarantee to be unique. I can also remove the last 5 figures and still get a unique ID, because I assume that the user won't insert data at more than 1 per second the most, but I would probably won't do that (what do you think about this idea?)
So I ask for your help. My data is assumes to grow very big, 2TB a year with ten of thousands new rows each second. I want URLs to look as "friendly" as possible, and prefer not to use the 'regular' GUID.
I am developing my app using ASP.NET 4.5 and MySQL
Thanks.
Collision Table
For YouTube like GUID's you can see this answer. They are basically keeping a database table of all random video ID's they are generating. When they request a new one, they check the table for any collisions. If they find a collision, they try to generate a new one.
Long Primary Keys
You could use a long (e.g. 275001120966638272) as a primary key, however if you have multiple servers generating unique identifiers you'll have to partition them somehow or introduce a global lock, so each server doesn't generate the same unique identifier.
Twitter Snowflake ID's
One solution to the partitioning problem with long ID's is to use snowflake ID's. This is what Twitter uses to generate it's ID's. All generated ID's are made up of the following parts:
Epoch timestamp in millisecond precision - 41 bits (gives us 69 years with a custom epoch)
Configured machine id - 10 bits (gives us up to 1024 machines)
Sequence number - 12 bits (A local counter per machine that rolls over every 4096)
One extra bit is reserved for future purposes. Since the ID's use timestamp as the first component, they are time sortable (which is very important for query performance).
Base64 Encoded GUID's
You can use ShortGuid which encodes a GUID as a base64 string. The downside is that the output is a little ugly (e.g. 00amyWGct0y_ze4lIsj2Mw) and it's case sensitive which may not be good for URL's if you are lower-casing them.
Base32 Encoded GUID's
There is also base32 encoding of GUID's, which you can see this answer for. These are slightly longer than ShortGuid above (e.g. lt7fz44kdqlu5pt7wnyzmu4ov4) but the advantage is that they can be all lower case.
Multiple Factors
One alternative I have been thinking about is to introduce multiple factors e.g. If Pintrest used a username and an ID for extra uniqueness:
https://pinterest.com/some-user/1
Here the ID 1 is unique to the user some-user and could be the number of posts they've made i.e. their next post would be 2. You could also use YouTube's approach with their video ID but specific to a user, this could lead to some ridiculously short URL's.
The first, simplest and practical scenario for unique keys
is the increasing numbering sequence of the write order,
This represent the record number inside one database providing unique numbering on a local scale : this is the -- often met -- application level requirement.
Next, the numerical approach based on a concatenation of time and counters is commonly used to ensure that concurrent transactions in same wagons will have unique ids before writing.
When the system gets highly threaded and distributed, like in highly concurrent situations, do some constraints need to be relaxed, before they become a penalty for scaling.
Universally unique identifier as primary key
Yes, it's a good practice.
A key reference system can provide independence from the underlying database system.
This provides one more level of integrity for the database when the evoked scenario occurs : backup, restore, scale, migrate and perhaps prove some authenticity.
This article Generating Globally Unique Identifiers for Use with MongoDB
by Alexander Marquardt (a Senior Consulting Engineer at MongoDB) covers the question in detail and gives some insight about database and informatics.
UUID are 128 bits length. They introduce an amount of entropy
high enough to ensure a practical uniqueness of labels.
They can be represented by a 32 hex character strings.
Enough to write several thousands of billions of billions
of decimal number.
Here are a few more questions that can occur when considering the overall principle and the analysis:
should primary keys of database
and Unique Resource Location be kept as two different entities ?
does this numbering destruct the sequentiality in the system ?
Does providing a machine host number (h),
followed by a user number (u) and time (t) along a write index (i)
guarantee the PK huti to stay unique ?
Now considering the DB system:
primary keys should be preserved as numerical (be it hexa)
the database system relies on it and this implies performance considerations.
their size should be fixed,
the system must answer rapidly to tell if it's potentially dealing with a PK or not.
Hashids
The hashing technique of Youtube is hashids.
It's a good choice :
the hash are shorts and the length can be controlled,
the alphabet can be customized,
it is reversible (and as such interesting as short reference to the primary keys),
it can use salt.
it's design to hash positive numbers.
However it is a hash and as such the probability exists that a collision happen. They can be detected : unique constraint is violated before they are stored and in such case, should be run again.
Consider the comment to this answer to figure out how much entropy it's possible to get from a shorten sha1+b64 recipe.
To anticipate on the colliding scenario,
calls for the estimation of the future dimension of the database, that is, the potential number of records. Recommended reading : Z.Bloom, How Long Does An ID Need To Be ?
Milliseconds since epoch
Cited from the previous article, which provides most of the answer to the problem at hand with a nice synthetic style
It may not be necessary for you to encode every time since 1970
however. If you are only interested in keeping recent records close to
each other, you only need enough values to ensure that you don’t have
more values with the same prefix than your database can cache at once
What you could do is convert a GUID into only numeric by converting all the letters into numbers in the guid. Here is a example of what that would look like. It's abit long but if that is not a problem this could be one way of going about generating the keys.
1004234499987310234371029731000544986101469898102
Here is the code i used to generate the string above. But i would probably recommend you using a long primary key insteed although it can be abit of a pain it's probably a safer way to do it then the function below.
string generateKey()
{
Guid guid = Guid.NewGuid();
string newKey = "";
foreach(char c in guid.ToString().Replace("-", "").ToCharArray())
{
if(char.IsLetter(c))
{
newKey += (int)c;
}
else
{
newKey += c;
}
}
return newKey;
}
Edit:
I did some testing with only taking the 20 first numbers and out of 5000000 generated keys 4999978 was uniqe. But when using 25 first numbers it is 5000000 out of 5000000. I would recommend you to do some more testing if going with this method.
DICOM already provides a unique enough identifier for the Series (e.g. Series Instance UID), so why also include one on the lower level objects (e.g. SOPInstanceUID)?
What I find really annoying is the fact that when referencing other objects - for example when RTPlan object references RTStruct object via ReferencedStructureSetSequence / ReferencedSOPInstanceUID - it's done using the SOP Instance UID. However any of the DICOM SCPs - such as find/move - don't work with SOP Instance UID, they work with the Series Instance UID. So what gives? Do I have to load the whole Series to find all the referenced objects?
This question was from quite a while ago, but I thought I'd add that, ignoring QR altogether, a SeriesInstanceUID is a globally unique identifier for a single series. SOPInstanceUID is a globally unique identifier for a DICOM file. A series can have multiple DICOM files, so each would share that same SeriesInstanceUID, but each file would have it's own SOPInstanceUID.
As you probably know, DICOM has a hierarchy of identifiers for each individual SOP (Service Object Pair) Instance (Patient ID / Study Instance UID / Series Instance UID / SOP Instance UID). This hierarchy is built into the Query/Retrieve mechanism in DICOM, and is also used to identify specific SOP Instances.
In the specific case you're mentioning, I believe there could be the possibility of multiple RT Structure Sets within a Series/Study. The individual SOP Instance must be referenced so that you know which Structure Set the RT Plan is referencing.
As for products supporting retrieving by SOP Instance UID, unfortunately, relational queries are not widely supported in DICOM Query/Retrieve SCPs, as you've discovered, and some DICOM servers do not support Image level queries. In this specific case, you could query at the series level specifically for the RTSTRUCT modality, and only retrieve the Series that have this modality, thus narrowing down which data you need to download to just the RT Structure Sets.
SOPInstanceUID represent separate uid of the Dicom Image File. Study, series and sopinstace uids are based on data model. StudyUID give you the particular study information. In which different series devided. Series instance uid used for for this. And SOP instance uid represent seperate Dicom image. It's hierarchy structure. I also never used SOPInstanceUID when i developed PACS workstation in Java. As per my experience, Study & Series uids are enough for represent patient's data. But still SOPInstanceUID gives unique identity to dicom image.
SOP Instance UID : Represent your a unique Identifier for IOD, Its a TYPE 1 tag must present with value.
For Example :
Each DICOM Image has unique identifier
Series reference is not specific enough. In the case of structure sets the Reference SOP Instance UID ties the contours in the structure set to the specific slice in the dataset. It's not enough to just reference the series because you have to ensure that the contour is exactly aligning with a slice.
SOPInstanceUId is for image level identification.
Understand it like:
A study can have multiple series and a series can have multiple images/DICOM
So,
to identify study uniquely we use StudyInstanceUID
to identify series uniquely we use SeriesInstanceUID and
to identify an image/DICOM uniquely we use SOPInstanceUId