How to handle centimeter from/to inches - math

I have an international application that handles lengths and weights of people, and stores these in a database. I was wondering how to deal with this in case users can switch between using centimeters/inches in the application.
I was thinking to always use centimeters in the database, and convert to inches if the user chose to use inches. But of course, if the user enters a length in inches and it is converted to and stored as centimeters, the value may change slightly because of rounding errors.
How would you handle this scenario?

There is much to consider in your question beyond the information that is available. Before deciding how to store and convert the information, you must know what your acceptable error is. For instance if you are calculating trajectory to intercept an incoming missile with another missile, extremely minute precision is necessary to be successful. If this is a medical application and being used to precisely control medication formulation it could be more important to be precise than if you are simply calculating BMI.
In short, pick a standard whether metric or other and stick with that for your storage type. depending on the precision required, choose the smallest unit of the measurement system that will provide you the accuracy you need. All display of units of a different measurement system would be converted from this base measurement.
And try not to over-engineer the solution. If it will not conceivably be important to measure out to 52 decimal places you are wasting effort and injecting unnecessary complication accounting for that scenario.

Personally I would use one of two methods.
Always store it in the same unit of measure
Store a unit of measure in a separate field so that you know if the units is CM or Inches.
I prefer the first method since it makes it easier to process.

So you would have rounding errors if you convert from inches to centimeters, and if you convert from centimeters to inches. The Problem would be the same, no matter what you store in the DB.
You could possibly store the values not in centimeters in the Database but in millimeters. So i think how smaller the unit is, so more exact it would be, even in case of conversion.

If the same user should be able to switch at front end, you should definitely store one field representing the value in one unit you decided, because the rounding errors will happen anyway.
If you have a group of users only dealing with inches and another only dealing with cm and each of these groups have their one database or at least "own values", then decide for two fields, value/unit (e.g. same software, different customer installation in different countries)
I'd store non float values representing for example micro meters (with an unsigned 32-bit you can represent everything from 4.2 km to 0.001 mm).

Not sure why you would need a database unless you were storing your conversion rates
There would be no way to detact metric or imperial because they are just numbers
Your rounding errors will happen in accordance with the degree of accuracy you wish to display....

Depending on what you're going to be doing with those values (whether you need to do much aggregation at the DB layer, etc), the best way to ensure there is no cumulative rounding error is to store in the original unit of measure with that unit of measure (id) in a separate column, and have a separate conversion table that you use for on-the-fly calculations when comparing, aggragating, etc.
This will not be super-efficient or convenient, however: you will always have to join to a conversion table before doing any work with the values stored.

We can do the task very simple. We have only coefficient for converting inches to cm and all calculations and result we did not save into DB. We can only multiply or divide the number on the ratio and we get result. So if you have cm need to multiply the ratio and get result. You can see how it work by the example I found it in 2 min: http://inchpro.com/metric-system/convert-inches-to-centimeters
I think you know that store all the values in a database occupies a lot of space.

Related

How can I create my own GUID algorithm with smaller "global"?

I have my own application with far more smaller "global" than our real global and I wanted shorter version of GUID. Now supposed I have my concrete number of IDs that I estimated to not ever exceed (for example 100 million IDs). How can I determine the number of random bits required to have the same property as GUID? (Globally unique, require no central authority to generate one) Using the normal GUID would be an overkill.
My "overkill" refers to this : I need the ID to be as easily typed/say/write down as possible and have somewhat astronomically low collision chance as GUID at the same time. I heard GUID can be assigned to every grain of sand on earth. My application is a game, each player get one ID generated, obviously my players is not as much as the amount of sand on earth.
It would be the best if player can say like "My ID is XXXX-XXXX". In that case, I would be not so sure if 8 characters of randomized hex is not enough or too much for 100 million players. (In reality I encode it to A-Z 0-9 instead of hex though) My game is not online restricted, so I would like each player to be able to obtain unique ID even when not online. (no server to check ID collisions)
GUID has been designed to be globally unique. But I don't know why that results in 128-bit sequence. Maybe they just choose the "very large" one that is a power of 2? I don't know what are they thinking when designing GUID to ensure that it will not clash. (They use world population times something? If that is the case I can too use 10 million times something.)
A 128-bit guid will generally perform well, because most compilers are smart enough to reduce operations on it to a pair of 64-bit operations (and on some CPUs, a single 128-bit extended operation). Java and C#/VB.NET would likely have quite a bit more overhead than C++, but if you are using Java or C#/VB.NET, you've already accepted quite a bit more overhead, and a GUID won't add much to it.
However, if you really need smaller values, you could manually reduce GUIDs, by XOR-ing the upper 64 bits with the lower 64 bits (thereby preserving some of the uniqueness of the original) to create a compact 64-bit mostly-unique number.
You could reduce to 32-bit or 48-bit in a similar way, always a multiple of the size of the original GUID. This has the advantage that you are starting out with a number that is intended to be unique across a very large set. However, keep in mind that 100 million items require a fairly high number of bits to preserve a non-overlapping guarantee, so you may just be setting yourself up for a very difficult-to-find problem later on if you aren't careful.
A crude but probably equally effective approach is to use a cryptographically-secure random number generator and construct a number as large as you need (probably minimum 48-bit). It is important not to do modulo operations on the results, or you could significantly reduce the uniqueness (due to the period of the random number generator).
I am assuming you cannot use a sequential id, although you may want to revisit that idea and see if there is a way to make a sequential id work. For example, you could use a sequential id paired with a random seed number, guaranteeing uniqueness without requiring a large number, and allowing internal indexing operations and similar optimizations that are common with large data sets.
Ok, I have discussed with friend and came up with solution. This is how to decide the number of "characters" of my game ID.
A character would consist of 0-9 and A-Z instead of HEX, thats 36 kinds of characters. We took out 0 O 1 I so it would be printable to variety of fonts without confusion, that leaves 32 kind of characters.
Then if every characters will be pseudo-randomized, how many players can we safely have?
We used Birthday paradox's square approximation. The formula in that page indicate how many number of people necessary to have 50% chance of 2 people colliding. It is 22.99 people for birthday problem. (365 possible choices)
Now we substitute 32^No.of characters into the equation instead of 365. This is how many players that will cause 50% chance of 2 players having the same ID :
Finally, we agreed to choose 9-character ID so the game can be registered up to 6.9 million players before just 2 from all 6.9 million players will have the same ID (50% chance).
The game isn't even online-only! It only collide if that 2 players is still actively playing at the same time and decide to send score to the scoreboard in the same week because of weekly score reset. So the actual number that the game can hold would be somewhat higher than that. (The game will probably not having that many players.. it is just a small happy dream of every game startups. Well at least the computation was fun.)
It will probably looks like this for easier reading : 5XT-339-A67

Load factor of hash tables with tombstones

So the question came up about whether tombstones should be included when calculating the load factor of a hash table.
I thought that, given that the load factor is used to determine when to expand capacity, tombstones should not be included. An obvious example is if you almost fill and then remove every value in a hash table. Here insertions are super easy (no collisions) so I believe the load factor shouldn't include them.
But you could look at this and think that with all the tombstones lookups will be slow (potentially searching almost the entire space).
So I thought I'd ask the question. Should the load factor of a hashtable include tombstones in the calculation?
Load factor is not an essential part of hash table data structure -- it is the way to define rules of behaviour for the dymamic system (growing/shrinking hash table is a dynamic system).
Moreover, in my opinion, in 95% of modern hash table cases this way is over simplified, dynamic systems behave suboptimally. It's advantages:
Well, simplicity of understanding and implementation.
Hash table data structure shouldn't store many numbers with some thresholds -- likely only one number. This is meaningful when hash table is very small and the size of the header affects total data structure memory efficiency (in bytes to store an entry).
In certain (and common) case: append/update only hash table, more complex models of behaviour degenerate to the "just load factor" model, in other words, load factor model defines relatively optimal behaviour.
See also my answer on load factor model. I prefer [min load, target load, max load] + growth factor frame model.
If you develop general-purpose hash table with tombstones, I think you can just pick up my results (below). I spend maybe several weeks solely developing this model. Maybe you can make some improvements or further research, I would be glad.
Two main hash table dynamic behaviour patterns are targeted:
growing hash table (maybe in growing phase), with little or no removals
initial fill of hash table, when proper capacity was not specified (or unknown)
hash table that remains of the same or nearly the same size, number of removals is equal or nearly equal to number of insertions
caches with upper size bound, LRUs, tables with entry expires
Two thresholds are defined:
max size (i. e. number of alive entries), table size * max load
min number of free (i. e. empty, without alive entry nor tombstone) slots, computed by magic formula.
If hash table size exceeds max size, we assume we are in the "growing pattern", rehash to the table size to be able to store current size * growth factor entries, i. e. choose table size closest possible to current size * growth factor / target load.
If the number of free slots becomes below than min number of free slots, we are in "cache pattern", rehash "to the current size", i. e. to the table size closest possible to current size / target load.
Read the source where all the above logics are coded.
Also, article Tombstones purge from hashtable: theory and practice sheds some light.
If you develop specially purposed hash table, which dymanic properties are known (or could be studied), I recommend you to develop your own model, fitting your case. Don't rely on pure math and CS theory, evaluate your model in benchmarks.

Siemens DICOM Individual Slice Time (Private_1019_1029)

I'm seeing that the individual slice time information from the Private_0019_1029 field of the DICOM header has negative values and sometime only positive values.
I assumed that these times are with respect to the Volume Acquisition time recorded in the header.
Going by that assumption, it would mean that the Acquisition time varies. But upon checking the difference between successive volume acquisition times, I see that it's equal to TR.
So I'm at a loss about what's happening.
I'm trying to look at the raw fMRI data without slice time correction; hence it's necessary to have the individual slice times.
Does the moco series do time shifting in addition to motion correction? (I don't believe it used to, but your experience may show otherwise).
This indicates how their slice timing is measured. Try the computations with the raw and the moco series and see if the times line up. That may give you your answer.
When dealing with private tag, you should really include the Private Vendor ID, in your case the value of tag (0019,0010).
You may also want to have a look at the output of:
gdcmdump --csa input.dcm
This will dump the SIEMENS CSA header directly from the DICOM attribute.

Allocating datastore long ID's - but segmented so different Kinds have different ranges

My program has 3 kinds that are closely related and I want to be able to store and manipulate their long id's interchangeably, e.g. I might have an array of long id's that can be for any of the 3 Kind's.
Using the allocateIds API I can allocate the ID's for the 3 kinds in the same namespace, but I also sometimes need to be able to tell which Kind one of these id's referred to (e.g. in order to do a datastore operation on the right Kind).
I understand that the 'normal' way to this is to store the whole Key type, rather then just the long id, but there will be a huge number of these - it will be more efficient if I can just use 'long' values rather then Key values.
So, I'd like to be able to segment the ID ranges, so I can call a simple function with an ID and it will tell me which of the 3 Kind's the ID is for.
(I'm using Java, but I don't think that matters.)
Allocate my own ID's
I guess the most straight-forward way to do this is to simply allocate my own ID's. I believe that, in order to allocate sequential ID's, I would need to do an extra datastore write for every allocation (to track the allocations), or get into some complicated system of pre-allocating ranges of ID's to each live instance. This sounds like a bad idea.
So I could generate random 54 bit ID's - reserving 2 bits to use as flags to indicate the type. But it is my understand that random or hash allocation dramatically reduces the number of allocations that can be made safely. The Internet tells me that the chance of a collision is approximately k^2 / 2N, where k is number of allocations and N is the size of the allocation space. So, if I'm willing to accept 0.1% chance of collision then k=sqrt(2*2^54/1000) = ~1.9 million. Since I really have no idea how many entities I will need to store, this is unacceptable.
Reserve some bits in the Long ID to indicate the Kind
Another solution would be to use 2 bits of the long value as flags to indicate the type. The easiest way to do this would be to take advantage of the fact that the allocator now only uses the low 56 bits of a long. So I could use the high bits as flags to indicate the Kind. The problem with that solution is that I lose the ability to manipulate these numbers in javascript - the reason for the 56 bit limit in the first place.
An alternative to this - to maintain the option of manipulating these numbers in js - is to use allocateIdRange and pre-allocate (and throw away) the ID ranges corresponding to bits 54 and 55. Actually, I could use any bits, but specifying the ID ranges is much easier if I use the high bits.
But I know little of how the datastore and how the allocator actually work, so I don't know if this 'pre-allocate and discard' technique is a good idea.

A couple of questions about Hash Tables

I've been reading a lot about Hash Tables and how to implement on in C and I think I have almost all the concepts in my head so I can start to code my own, I just have a couple of questions that I have yet to properly understand.
As a reference, I've been reading this:
http://eternallyconfuzzled.com/jsw_home.aspx
1) As I've read on the site above, a power of two or a prime number is recommended for the Hash Table size. This is basically an array and an array has a fixed size so I can quickly look up for the value I'm looking for. I can't declare a small array if I have a large input as it won't fit and I can't declare a very large array if my input data is not that large cause it's wasted memory.
What is the optimum size for the Hash Table? What should I base my decision on?
2) Also, on that site, there's a couple of hashing functions which I have yet to read them all. It also states that it's always best to use a good known algorithm and to roll my own. And I might do just that, I'll pick one from that site and test it out on my code and see if it minimizes collisions based on my input data.
What's bugging me is how I control the hash range? The hash can't return and integer larger than the Hash Table size or we'll have a serious problem. How do I deal with this?
1) What you are referring to is the load factor of the hash table - the percentage of buckets that are expected to be filled. Wikipedia has this to say:
With a good hash function, the average
lookup cost is nearly constant as the
load factor increases from 0 up to 0.7
or so. Beyond that point, the
probability of collisions and the cost
of handling them increases.
I believe the Java implementation (and probably others) resizes periodically to keep the load factor within an acceptable range.
2) Just use the modulo operator (%) to keep the bucket index legal. The second operator should be the size of your bucket array.
Pick a small size for your hash table. As you add stuff to your table, check to see what percentage of the table is being used; when it is greater than 70% full, make the table bigger. This also holds true as you remove elements-- make the table smaller when it is less than 60% full, for instance. Wikipedia has a good description of some strategies for dynamic resizing, but that's the general idea.
I only say this because you seem to have known input data:
If you know the rough order of magnitude of the amount of data you will be storing in the hash table, it's generally good enough to just create a table about that big. (You shouldn't worry about whether everything will fit. Instead, the right thing to think about is how many collisions you will have and how you will handle them.)
As for the right hash function, it's possible that the structure of your input will suggest which one will be correct. For instance, what aspects of your input are likely to be evenly distributed?

Resources