Consistent handling of DateTimes in different RDBMSs - datetime

I'm planning a distributed system of applications that will communicate with different types of RDBMS. One of the requirements is consistent handling of DateTimes across all RDBMS types. All DateTime values must be at millisecond precision, include the TimeZone info and be stored in a single column.
Since different RDBMS's handle dates and times differently, I'm worried I can't rely on their native column types in this case and so I'll have to come up with a different solution. (If I'm wrong here, you're welcome to show me the way.)
The solution, whatever it may be, should ideally allow for easy sorting and comparisons on the SQL level. Other aspects, such as readability and ability to use SQL datetime functions, are not important, since this will all be handled by a gateway service.
I'm toying with an idea of storing my DateTime values in an unsigned largeint column type (8 bytes). I haven't made sure if all RDBMS's in question (MSSQL, Oracle, DB2, PostgreSQL, MySQL, maybe a few others) actually /have/ such a type, but at this point I just assume they do.
As for the storage format... For example, 2009-01-01T12:00:00.999+01:00 could be stored similar to ?20090101120000999??, which falls in under 8 bytes.
The minimum DateTime I'd be able to store this way would be 0001-01-01T00:00:00.000+xx:xx, and the maximum would be 8000-12-31T23:59:59.999+xx:xx, which gives me more than enough of a span.
Since maximum unsigned largeint value is 18446744073709551615, this leaves me with the following 3 digits (marked by A and BB) to store the TimeZone info: AxxxxxxxxxxxxxxxxxBB.
Taking into account the maximum year span of 0001..8000, A can be either 0 or 1, and BB can be anywhere from 00 to 99.
And now the questions:
What do you think about my proposed solution? Does it have merit or is it just plain stupid?
If no better way exists, how do you propose the three remaining digits be used for TimeZone info best?

One of the requirements is consistent handling of DateTimes across all RDBMS types.
Be aware that date-time handling capabilities vary radically across various database systems. This ranges from virtually no support (SQLite) to excellent (Postgres). Some such as Oracle have legacy data-types that may confuse the situation, so study carefully without making assumptions.
Rather than establish a requirement that broadly says we must support "any or all database", you should get more specific. Research exactly what databases might realistically be candidates for deployment in the real-world. A requirement of "any or all databases" is naïve and unrealistic because databases vary in many capabilities — date-time handling is just the beginning of your multi-database support concerns.
The SQL standard barely touches on the subject of date-time, broadly defining a few types with little discussion of the nuances and complexities of date-time work.
Also be aware that most programming platforms provide atrociously poor support for date-time handling. Note that Java leads the industry in this field, with its brilliantly designed java.time classes. That framework evolved from the Joda-Time project for Java which was ported to .Net platform as NodaTime.
All DateTime values must be at millisecond precision,
Good that you have specified that important detail. Understand that various systems resolve date-time values to whole seconds, milliseconds, microseconds, nanoseconds, or something else.
include the TimeZone info and be stored in a single column.
Define time zone precisely.
Understand the difference between an offset-from-UTC and a time zone: The first is a number of hours-minutes-seconds plus-or-minus, the second has a name in format Continent/Region and is a history of past, present, and future changes to the offset used by the people of a particular region.
The 2-4 letter abbreviations such as CST, PST, IST, and so on are not formal time zone names, are not standardized, and are not even unique (avoid them).
Since different RDBMS's handle dates and times differently, I'm worried I can't rely on their native column types in this case and so I'll have to come up with a different solution.
The SQL standard does define a few types that are supported by some major databases.
TIMESTAMP WITH TIME ZONE represents a moment, a specific point on the timeline. I vaguely recall hearing of a database that actually stored the incoming time zone. But most, such as Postgres, use the time zone indicated on the incoming value to adjust into UTC, then store that UTC value, and lastly, discard the zone info. When retrieved, you get back a UTC value. Beware of tools and middleware with the confusing anti-feature of applying a default time zone after retrieval and before display to the user.
TIMESTAMP WITHOUT TIME ZONE represents a date with time-of-day, but purposely lacking the context of a time zone or offset. Without a zone/offset, such a value does not represent a moment. You could apply a time zone to determine a moment in a range of about 26-27 hours, the range of time zones around the globe.
There are other types in the standard as well such as date-only (DATE) and time-only (TIME).
See this table I made for Java, but in this context the column of SQL standard types in relevant. Be aware that TIME WITH TIME ZONE makes no sense logically, and should not be used.
If you have narrowed down your list of candidate databases, study their documentation to learn if they have a type akin to the standard types in which you are interested, and what the name of that type is (not always the standard name).
I'm toying with an idea of storing my DateTime values in an unsigned largeint column type (8 bytes).
A 64-bit value is not likely appropriate. For example, the java.time classes use a pair of numbers, a number of whole seconds since the epoch reference of first moment of 1970 in UTC, plus another number for the count of nanoseconds in the fractional second.
It is really best to use the database's data-time data types if they are similar across your list of candidate databases. Using a count-from-epoch is inherently ambiguous, which makes identifying erroneous data difficult.
Storing your own count-from-epoch number is possible. If you must go that way, be sure the entire team understands what epoch reference was chosen. At least a couple dozen have been in use in various computing systems. Beware of staff persons assuming a particular epoch reference is in use.
Another way to define your own date-time tracking is to use text in the standard ISO 8601 formats. Such strings will alphabetically sort as chronological. One exception to that sorting is the optional but commonly used Z at the end to indicate an offset-from-UTC of zero (pronounced “Zulu”).
The minimum DateTime I'd be able to store this way would be 0001-01-01T00:00:00.000+xx:xx,
Taking into account the maximum year span of 0001..8000
Are you really storing values from the time of Christ? Is this software really going to be around executing transactions for the year 8000?
This is an area where the responsible stakeholders should define their real needs. For example, for many business systems you may need only data from the year of the product's launch and run out only a century or two into the future.
The minimum/maximum value range varies widely between different databases. If you choose to use a built-in data type in each database system, investigate its limits. Some, for example, may go only to the year 2038, the common Y2038 problem.
To sum up my recommendation:
Get real about your date-time needs: min/max range, resolution, and various types (moment versus not a moment, date-only, etc.).
Get real about your possible databases for deployment.
If you need enterprise-quality reliability in a classic RDMS, your candidate list is likely only a few: Postgres, Microsoft SQL Server, Oracle, and maybe IBM Db2.
Keep this list of supported databases as short as possible. Each database you agree to support is a huge commitment, now and in the future.
Be sure your chosen database(s) have a database driver available for your chosen programming language(s). For example JDBC for Java.
If at all possible, use the built-in data types offered by the database.
Be sure you and your team understand date-time handling. Many do not, in my experience, as (a) the subject is rarely taught, and (b) many programmers & admins mistakenly believe their quotidian intuitive understanding of date-time is sufficient for programming work. (Ignorance is bliss, as they say.)
Identify other areas of functionality beyond date-time handling, and compare which databases support those areas.

I would suggest you to store the datetime information in milliseconds since 1970 (Java style) .
It's a standard way for storing datetime information, in addition it's more efficient in terms of space than your suggestion. Because in your suggestion some digits are "wasted" i.e. the month digits can store only 00-12 (instead of 00-99) and so on.
You didn't specify what is your development language but I am sure you can find many code snippets that transform date to milliseconds.
If you are developing in .NET they have a similar concept of ticks. (you can use this information as well)
Regarding the time zone,I would have add another column to store only the TimeZone indication.
Remember that any format you choose should maintain consistency between two dates, i.e. if D1 > D2 then format(D1)>format(D2) , this way you can query the DB for changes since some date, or query for changes between two dates

Related

How to ingest historical data with proper creation time?

When ingesting historical data, we would like it to become consistent with streamed data with respect to caching and retention, hence we need to set proper creation time on the data extents.
The options I found:
creationTime ingestion property,
with(creationTime='...') query ingestion property,
creationTimePattern parameter of Lightingest.
All options seem to have very limited usability as they require manual work or scripting to populate creationTime with some granularity based on the ingested data.
In case the "virtual" ingestion time can be extracted from data in form of a datetime column or otherwise inherited (e.g. based on integer ID), is it possible to instruct the engine to set creation time as an expression based on the data row?
If such a feature is missing, what could be other handy alternatives?
creationTime is a tag on an extent/shard.
The idea is to be able to effectively identify and drop / cool data at the end of the retention time.
In this context, your suggested capability raises some serious issues.
If all records have the same date, no problem, we can use this date as our tag.
If we have different dates, but they span on a short period, we might decide to take min / avg / max date.
However -
What is the behavior you would expect in case of a file that contains dates that span on a long period?
Fail the ingestion?
Use the current time as the creationTime?
Use the min / avg / max date, although they clearly don't fit the data well?
Park the records in a tamp store until (if ever) we get enough records with similar dates to create the batches?
Scripting seems the most reasonable way to go here.
If your files are indeed homogenous by their records dates, then you don't need to scan all records, just read the 1st record and use its date.
If the dates are heterogenous, then we are at the scenario described by the "However" part.

Performance implications of datefield in sqlite

Is there any performance implications in sqlite of having a datefield, and searching for records from a particular year, based on the year attribute of the datefield, as opposed to having a dedicated year int field, and searching based on that?
SQLite doesn't have a date type, so dates are stored in one of a few different formats, and calculations on those dates are performed using built in date functions. Those date functions will probably add some overhead, but whether that will actually have any performance implication really comes down to your data, the size of your db, etc.
The best thing you can do is run some of your own tests, then decide for yourself whether the performance gain you get from breaking the date into multiple columns is worth the added schema complexity.

Alternative to GUID with Scalablity in mind and Friendly URL

I've decided to use GUID as primary key for many of my project DB tables. I think it is a good practice, especially for scalability, backup and restore in mind. The problem is that I don't want to use the regular GUID and search for an alternative approach. I was actually interested to know what Pinterest i using as primary key. When you look at the URL you see something like this:
http://pinterest.com/pin/275001120966638272/
I prefer the numerical representation, even it it is stores as string. Is there any way to achieve this?
Furthermore, youtube also use a different kind of hashing technique which I can't figure it out:
http://www.youtube.com/watch?v=kOXFLI6fd5A
This reminds me shorten url like scheme.
I prefer the shortest one, but I know that it won't guarantee to be unique. I first thought about doing something like this:
DateTime dt1970 = new DateTime(1970, 1, 1);
DateTime current = DateTime.Now;
TimeSpan span = current - dt1970;
Result Example:
1350433430523.66
Prints the total milliseconds since 1970, But what happens if I have hundreds thousands of writes per second.
I mainly prefer the non BIGINT Auto-Increment solution because it makes a lot less headache to scale the DB using 3rd party tools as well as less problematic backup/restore functionality because I can transfer data between servers and such if I want.
Another sophisticated approach is to tailor the solution towards my application. In the database, the primary key will also contain the username (unique and can't be changed by the user), so I can combine the numerical value of the name with the millisecond number which will give me a unique numerical string. Because the user doesn't insert data as such a high rate, the numerical ID is guarantee to be unique. I can also remove the last 5 figures and still get a unique ID, because I assume that the user won't insert data at more than 1 per second the most, but I would probably won't do that (what do you think about this idea?)
So I ask for your help. My data is assumes to grow very big, 2TB a year with ten of thousands new rows each second. I want URLs to look as "friendly" as possible, and prefer not to use the 'regular' GUID.
I am developing my app using ASP.NET 4.5 and MySQL
Thanks.
Collision Table
For YouTube like GUID's you can see this answer. They are basically keeping a database table of all random video ID's they are generating. When they request a new one, they check the table for any collisions. If they find a collision, they try to generate a new one.
Long Primary Keys
You could use a long (e.g. 275001120966638272) as a primary key, however if you have multiple servers generating unique identifiers you'll have to partition them somehow or introduce a global lock, so each server doesn't generate the same unique identifier.
Twitter Snowflake ID's
One solution to the partitioning problem with long ID's is to use snowflake ID's. This is what Twitter uses to generate it's ID's. All generated ID's are made up of the following parts:
Epoch timestamp in millisecond precision - 41 bits (gives us 69 years with a custom epoch)
Configured machine id - 10 bits (gives us up to 1024 machines)
Sequence number - 12 bits (A local counter per machine that rolls over every 4096)
One extra bit is reserved for future purposes. Since the ID's use timestamp as the first component, they are time sortable (which is very important for query performance).
Base64 Encoded GUID's
You can use ShortGuid which encodes a GUID as a base64 string. The downside is that the output is a little ugly (e.g. 00amyWGct0y_ze4lIsj2Mw) and it's case sensitive which may not be good for URL's if you are lower-casing them.
Base32 Encoded GUID's
There is also base32 encoding of GUID's, which you can see this answer for. These are slightly longer than ShortGuid above (e.g. lt7fz44kdqlu5pt7wnyzmu4ov4) but the advantage is that they can be all lower case.
Multiple Factors
One alternative I have been thinking about is to introduce multiple factors e.g. If Pintrest used a username and an ID for extra uniqueness:
https://pinterest.com/some-user/1
Here the ID 1 is unique to the user some-user and could be the number of posts they've made i.e. their next post would be 2. You could also use YouTube's approach with their video ID but specific to a user, this could lead to some ridiculously short URL's.
The first, simplest and practical scenario for unique keys
is the increasing numbering sequence of the write order,
This represent the record number inside one database providing unique numbering on a local scale : this is the -- often met -- application level requirement.
Next, the numerical approach based on a concatenation of time and counters is commonly used to ensure that concurrent transactions in same wagons will have unique ids before writing.
When the system gets highly threaded and distributed, like in highly concurrent situations, do some constraints need to be relaxed, before they become a penalty for scaling.
Universally unique identifier as primary key
Yes, it's a good practice.
A key reference system can provide independence from the underlying database system.
This provides one more level of integrity for the database when the evoked scenario occurs : backup, restore, scale, migrate and perhaps prove some authenticity.
This article Generating Globally Unique Identifiers for Use with MongoDB
by Alexander Marquardt (a Senior Consulting Engineer at MongoDB) covers the question in detail and gives some insight about database and informatics.
UUID are 128 bits length. They introduce an amount of entropy
high enough to ensure a practical uniqueness of labels.
They can be represented by a 32 hex character strings.
Enough to write several thousands of billions of billions
of decimal number.
Here are a few more questions that can occur when considering the overall principle and the analysis:
should primary keys of database
and Unique Resource Location be kept as two different entities ?
does this numbering destruct the sequentiality in the system ?
Does providing a machine host number (h),
followed by a user number (u) and time (t) along a write index (i)
guarantee the PK huti to stay unique ?
Now considering the DB system:
primary keys should be preserved as numerical (be it hexa)
the database system relies on it and this implies performance considerations.
their size should be fixed,
the system must answer rapidly to tell if it's potentially dealing with a PK or not.
Hashids
The hashing technique of Youtube is hashids.
It's a good choice :
the hash are shorts and the length can be controlled,
the alphabet can be customized,
it is reversible (and as such interesting as short reference to the primary keys),
it can use salt.
it's design to hash positive numbers.
However it is a hash and as such the probability exists that a collision happen. They can be detected : unique constraint is violated before they are stored and in such case, should be run again.
Consider the comment to this answer to figure out how much entropy it's possible to get from a shorten sha1+b64 recipe.
To anticipate on the colliding scenario,
calls for the estimation of the future dimension of the database, that is, the potential number of records. Recommended reading : Z.Bloom, How Long Does An ID Need To Be ?
Milliseconds since epoch
Cited from the previous article, which provides most of the answer to the problem at hand with a nice synthetic style
It may not be necessary for you to encode every time since 1970
however. If you are only interested in keeping recent records close to
each other, you only need enough values to ensure that you don’t have
more values with the same prefix than your database can cache at once
What you could do is convert a GUID into only numeric by converting all the letters into numbers in the guid. Here is a example of what that would look like. It's abit long but if that is not a problem this could be one way of going about generating the keys.
1004234499987310234371029731000544986101469898102
Here is the code i used to generate the string above. But i would probably recommend you using a long primary key insteed although it can be abit of a pain it's probably a safer way to do it then the function below.
string generateKey()
{
Guid guid = Guid.NewGuid();
string newKey = "";
foreach(char c in guid.ToString().Replace("-", "").ToCharArray())
{
if(char.IsLetter(c))
{
newKey += (int)c;
}
else
{
newKey += c;
}
}
return newKey;
}
Edit:
I did some testing with only taking the 20 first numbers and out of 5000000 generated keys 4999978 was uniqe. But when using 25 first numbers it is 5000000 out of 5000000. I would recommend you to do some more testing if going with this method.

What temporal patterns exist for neo4j or graph databases?

I'm looking for patterns (ideally with advantages/disadvantages) that can be used for databases concerning time.
One I can think of is to have a node representing a point in time or time period.
What others are there? What others have you used?
Not a good question for SO
This question is very open-ended, and SO is meant for questions with specific technical answers.
TL;DR: Graph patterns are infinite. Start from the problem, not the possibilities.
Graph patterns are case-specific, not data-type specific
There isn't a set of temporal graph patterns, and even if there was, each pattern would be unique for a specific use-case, and close to useless elsewhere. What you should be asking yourself:
What will my queries need to look like?
What kind of information am I representing?
What information is relevant?
Should it be more granular, or more general?
Time, Date, or Datetime? Microtime? BC?
Context really matters.
Modelling information flow in a datacenter's network? Probably only need seconds and microseconds in a property on the relevant data.
Modelling evolution on the tree of life? Probably don't need anything from Time or even Date, instead using a float and an int for exponential notation, or a single int representing thousands of years.
What is time?
Or at least, what is it to your data?
The three most common patterns I've seen (because they're the most flexible and easiest to work with in queries):
Just stick dates or datetimes wherever they're relevant.
(cause)-->(event {datetime})
(event)-->(datetime) and (datetime)-[:NEXT]->(datetime)-[:NEXT]->(datetime)
However, even with these patterns there are still many open-ended questions. Consider a case of tracking modifications to files...
Simply put create and modified dates on the File nodes?
Put dates on a relationship between the user and the file?
Just a datetime, or read/write and duration?
Event itself as a node with start, end, and duration, with relationships to the user and the file, and the change-set applied to the file?
Should that event have a relationships between it's chronological neighbors, or should that relationship be kept between the change-sets alone?

Override DateTime serialization for ASP.NET WebMethod parameters

I am working on cleaning up a bug in a large code base where no one was paying attention to local time vs. UTC time.
What we want is a way of globally ignoring time zone information on DateTime objects sent to and from our ASP.NET web services. I've got a solution for retrieve operations. Data is only returned in datasets, and I can look for DateTime columns and set the DateTimeMode to Unspecified. That solves my problem for all data passed back and forth inside a data set.
However DateTime objects are also often passed directly as parameters to the web methods. I'd like to strip off any incoming time zone information. Rather than searching through our client code and using DateTime.SpecifyKind(..) to set all DateTime vars to Undefined, I'd like to do some sort of global ASP.NET override to monitor incoming parameters and strip out the time zone information.
Is such a thing possible? Or is there another easier way to do what I want to do?
Just to reiterate -- I don't care about time zones, everyone is in the same time zone. But a couple of users have machines badly configured, wrong time zones, etc. So when they send in July 1, 2008, I'm getting June 30, 2008 22:00:00 on the server side where it's automatically converting it from their local time to the server's local time.
Update: One other possibility would be if it were possible to make a change on the client side .NET code to alter the way DateTime objects with Kind 'Undefined' are serialized.
I have dealt with this often in many applications, services, and on different platforms (.NET, Java, etc.). Please believe me that you do NOT want the long term consequences of pretending that you don't care about the time zone. After chasing lots of errors that are enormously difficult and expensive to fix, you will wish you had cared.
So, rather than stripping the time zone, you should either capture the correct time zone or force a specific time zone. If you reasonably can, get the various data sources fixed to provide a correct time zone. If they are out of your control, then force them either to the server's local time zone or to UTC.
The general industry convention is to force everything to UTC, and to set all production hardware clocks to UTC (that means servers, network devices like routers, etc.). Then you should translate to/from the user's local time zone in the UI.
If you fix it correctly now, it can be easy and cheap. If you intentionally break it further because you think that will be cheaper, then you will have no excuses later when you have to untangle the awful mess.
Note that this is similar to the common issue with Strings: there is not such thing as plain text (a String devoid of a character encoding) and there is no such thing as a plain (no time zone) time/date. Pretending otherwise is the source of much pain and heartache, and embarrassing errors.
OK, I do have a workaround for this, which depends on the fact that I only actually need the Date portion of the DateTime. I attach this property to every Date or DateTime parameter in the system
<XmlElement(DataType:="date")>
This changes the generated wsdl to have the type s:date instead of s:dateTime. (Note that simply having the type of the .NET method parameter be a Date rather than a DateTime did NOT accomplish this). So the client now only sends the date portion of the DateTime, no time info, no time zone info.
If I ever need to send a Date and Time value to the server, I'll have to use some other workaround, like making it a string parameter.
I've had issues with the time zone information as well. The problem is I'm already providing the datetime fields in UTC. Then the serialization occurs and the local offset becomes part of the date/time. The dates/times for our vendor in a different timezone were pretty messed up. I got around this problem by using the tsql convert function on the datetime fields in my select statement I used to populate my datasets. This converted the fields to a string variable, which translates nicely to a datetime value automatically on the client side. If you just want to pass the date, you can use the 101 code to provide just the date. I used 126 to provide the date and time exactly how it appears in my database columns, with the timezone information stripped out.

Resources