I have a software which logs data into a table with current date and time accurate to milliseconds. There is no problem in PostgreSQL and MSSQL Server, but in MDB I have primary key violation. When I look into my table using MS Access, it shows datetimes accurate to seconds.
Could milliseconds be written to MDB at all?
The DateTime field in Access has seconds as accuracy (it's actually stored as a floating number, but it reports and sets in seconds). If you want to store milliseconds, you could store them in a different field.
You can store the millisecond part of the date/time in an integer field, and then use a composite primary key linking the two fields. I've never heard a solid argument against composite primary keys, but it's odd at best.
Just for completeness it's probably worth mentioning that you could save the timestamps as yyyy-mm-dd hh:nn:ss.fff in a TEXT(23) column. That column could be used as a primary key, and the values could be sorted and compared directly if the numeric parts were all zero-padded. Date arithmetic would be a bit awkward, however.
Related
Is there a datatype equivalent to tsrange or daterange (or actually any of the range types available in PostgreSQL) in e.g. sqlite?
If not, what is the best way to work around it?
Sorry, there only 5 datatypes in SQLite: https://www.sqlite.org/datatype3.html
NULL, INTEGER, REAL, TEXT, and BLOB. And NUMERIC.
As # a_horse_with_no_name mentioned, "You would typically use two columns of that type that store the start and end value of the range". This does make it a bit tricky if you want to do database calculations on intervals. But this resources might be found as an run-time loadable extension.
You would typically use two columns of that type that store the start and end value of the range. – a_horse_with_no_name 42 mins ago
Be cautious; SQLite is quite forgiving in what it accepts for each data type:
SQLite uses a more general dynamic type system. In SQLite, the
datatype of a value is associated with the value itself, not with its
container. The dynamic type system of SQLite is backwards compatible
with the more common static type systems of other database engines in
the sense that SQL statements that work on statically typed databases
should work the same way in SQLite. However, the dynamic typing in
SQLite allows it to do things which are not possible in traditional
rigidly typed databases.
This means that you can throw text at integer fields; if the text is an integer, that's fine. If it's not, that's fine. It will be stored and returned to you when retrieved. The difference being, if it could be converted to an integer, it will be and you will be returned an integer. If it could not be converted, you will get returned a text string. This could make programming with SQLite databases interesting.
I am developing an application that allows users to read books. I am using DynamoDB for storing details of the books that user reads and I plan to use the data stored in DynamoDB for calculating statistics, such as trending books, authors, etc.
My current schema looks like this:
user_id | timestamp | book_id | author_id
user_id is the partition key, and timestamp is the sort key.
The problem I am having is that, with this schema I am only able to query
the details of the books that a single user (partition key) has read. That is one of the requirements for me.
The other requirement is to query all the records that has been created in a certain date range, eg: records created in the past 7 days. With this schema, I am unable to run this query.
I have looked into so many other options, and haven't figured out a way to create a schema that would allow me to run both queries.
Retrieve the records of the books read by a single user (Can be done).
Retrieve the records of books read by all the users in last x days (Unable to do it).
I do not want to run a scan, since It will be expensive and I looked into the option of using GSI for timestamp, but it requires me to specify a hash key, and therefore I cannot query all the records created between 2 dates.
One naive solution would be to create a GSI with a constant hash key across all books and timestamp as a range key. This will allow you to perform your type of queries.
The problem with this approach is that it is likely to become a scaling bottleneck, as same hash key means same node. One workaround for this problem is to do sharding: create a set of hash keys (ex: from 1 to 10) and assign random key from this set to every book. Then when you make a query you will need to make 10 queries and merge results. You can even make this set size dynamic, so that it scales with your data.
I would also suggest looking into other tools (not DynamoDB) for this use case, as DDB is not the best tool for data analysis. You might, for example, feed DynamoDB data into CloudSearch or ElasticSearch and do your analysis there.
One solution could be using GSI and including two more columns, when ever you ingest a record kindly ingest date as a primary key e.g 2017-07-02 and timestamp as range key 04:22:33:000.
Maintain one table for checkpoint which would contain the process name and timestamp of the table, Everytime you read from the table you can update the checkpoint table to get incremental data. if you want to get last 7 day data change timestamp to past 7 date and get data between last 7 day and current time.
You can use query spec for the same by passing date as a partition and using between keywords for timestamp which is range condition.
Date diff you will to calculate from checkpoint table and current date and so day wise you get the data.
I am populating a medium-sized table (60GB, 500 million rows). The process completes reasonably fast if the table has no primary key (~1 hour using bulk insert), but it takes ~10 times longer if I create that table with the primary key. I assume this is because it takes time to verify the uniqueness constraint and also update the index at each insert.
I thought a good workaround would be to add the primary key later, since indexation on the table that's already populated should be much faster compared to incremental indexation. But sqlite doesn't seem to have the option to add primary key after the table is created (not sure why?).
I guess I could just not use a primary key at all, and instead just add a unique index after the table is populated. Is there any disadvantage to that?
Or any better solution recommended?
From a purely technical point of view, an unique index has exactly the same effect as a primary key. (In SQLite, some primary keys allow NULLs for backwards compatibility.)
The only difference is that the primary key constraint does not show up in the table definition itself, which might be a bad thing for documentation purposes.
Also see Is CREATE UNIQUE INDEX or INTEGER PRIMARY KEY more performant in SQLite.
Run the bulk insert inside a transaction and you'll avoid quite a few things that slow inserts down.
I just found this which is a great write up on how to speed things up in sqlite3.
Improve INSERT-per-second performance of SQLite?
I’m creating Android apps and need to save date/time of the creation record. The SQLite docs say, however, "SQLite does not have a storage class set aside for storing dates and/or times" and it's "capable of storing dates and times as TEXT, REAL, or INTEGER values".
Is there a technical reason to use one type over another? And can a single column store dates in any of the three formats from row to row?
I will need to compare dates later. For instance, in my apps I will show all records that are created between date A until date B. I am worried that not having a true DATETIME column could make comparisons difficult.
SQlite does not have a specific datetime type. You can use TEXT, REAL or INTEGER types, whichever suits your needs.
Straight from the DOCS
SQLite does not have a storage class set aside for storing dates and/or times. Instead, the built-in Date And Time Functions of SQLite are capable of storing dates and times as TEXT, REAL, or INTEGER values:
TEXT as ISO8601 strings ("YYYY-MM-DD HH:MM:SS.SSS").
REAL as Julian day numbers, the number of days since noon in Greenwich on November 24, 4714 B.C. according to the proleptic Gregorian calendar.
INTEGER as Unix Time, the number of seconds since 1970-01-01 00:00:00 UTC.
Applications can chose to store dates and times in any of these formats and freely convert between formats using the built-in date and time functions.
SQLite built-in Date and Time functions can be found here.
One of the powerful features of SQLite is allowing you to choose the storage type. Advantages/disadvantages of each of the three different possibilites:
ISO8601 string
String comparison gives valid results
Stores fraction seconds, up to three decimal digits
Needs more storage space
You will directly see its value when using a database browser
Need for parsing for other uses
"default current_timestamp" column modifier will store using this format
Real number
High precision regarding fraction seconds
Longest time range
Integer number
Lowest storage space
Quick operations
Small time range
Possible year 2038 problem
If you need to compare different types or export to an external application, you're free to use SQLite's own datetime conversion functions as needed.
SQLite does not have a storage class set aside for storing dates
and/or times. Instead, the built-in Date And Time Functions of SQLite
are capable of storing dates and times as TEXT, REAL, or INTEGER
values:
TEXT as ISO8601 strings ("YYYY-MM-DD HH:MM:SS.SSS"). REAL as Julian
day numbers, the number of days since noon in Greenwich on November
24, 4714 B.C. according to the proleptic Gregorian calendar. INTEGER
as Unix Time, the number of seconds since 1970-01-01 00:00:00 UTC.
Applications can chose to store dates and times in any of these
formats and freely convert between formats using the built-in date and
time functions.
Having said that, I would use INTEGER and store seconds since Unix epoch (1970-01-01 00:00:00 UTC).
For practically all date and time matters I prefer to simplify things, very, very simple... Down to seconds stored in integers.
Integers will always be supported as integers in databases, flat files, etc. You do a little math and cast it into another type and you can format the date anyway you want.
Doing it this way, you don't have to worry when [insert current favorite database here] is replaced with [future favorite database] which coincidentally didn't use the date format you chose today.
It's just a little math overhead (eg. methods--takes two seconds, I'll post a gist if necessary) and simplifies things for a lot of operations regarding date/time later.
Store it in a field of type long. See Date.getTime() and new Date(long)
I've decided to use GUID as primary key for many of my project DB tables. I think it is a good practice, especially for scalability, backup and restore in mind. The problem is that I don't want to use the regular GUID and search for an alternative approach. I was actually interested to know what Pinterest i using as primary key. When you look at the URL you see something like this:
http://pinterest.com/pin/275001120966638272/
I prefer the numerical representation, even it it is stores as string. Is there any way to achieve this?
Furthermore, youtube also use a different kind of hashing technique which I can't figure it out:
http://www.youtube.com/watch?v=kOXFLI6fd5A
This reminds me shorten url like scheme.
I prefer the shortest one, but I know that it won't guarantee to be unique. I first thought about doing something like this:
DateTime dt1970 = new DateTime(1970, 1, 1);
DateTime current = DateTime.Now;
TimeSpan span = current - dt1970;
Result Example:
1350433430523.66
Prints the total milliseconds since 1970, But what happens if I have hundreds thousands of writes per second.
I mainly prefer the non BIGINT Auto-Increment solution because it makes a lot less headache to scale the DB using 3rd party tools as well as less problematic backup/restore functionality because I can transfer data between servers and such if I want.
Another sophisticated approach is to tailor the solution towards my application. In the database, the primary key will also contain the username (unique and can't be changed by the user), so I can combine the numerical value of the name with the millisecond number which will give me a unique numerical string. Because the user doesn't insert data as such a high rate, the numerical ID is guarantee to be unique. I can also remove the last 5 figures and still get a unique ID, because I assume that the user won't insert data at more than 1 per second the most, but I would probably won't do that (what do you think about this idea?)
So I ask for your help. My data is assumes to grow very big, 2TB a year with ten of thousands new rows each second. I want URLs to look as "friendly" as possible, and prefer not to use the 'regular' GUID.
I am developing my app using ASP.NET 4.5 and MySQL
Thanks.
Collision Table
For YouTube like GUID's you can see this answer. They are basically keeping a database table of all random video ID's they are generating. When they request a new one, they check the table for any collisions. If they find a collision, they try to generate a new one.
Long Primary Keys
You could use a long (e.g. 275001120966638272) as a primary key, however if you have multiple servers generating unique identifiers you'll have to partition them somehow or introduce a global lock, so each server doesn't generate the same unique identifier.
Twitter Snowflake ID's
One solution to the partitioning problem with long ID's is to use snowflake ID's. This is what Twitter uses to generate it's ID's. All generated ID's are made up of the following parts:
Epoch timestamp in millisecond precision - 41 bits (gives us 69 years with a custom epoch)
Configured machine id - 10 bits (gives us up to 1024 machines)
Sequence number - 12 bits (A local counter per machine that rolls over every 4096)
One extra bit is reserved for future purposes. Since the ID's use timestamp as the first component, they are time sortable (which is very important for query performance).
Base64 Encoded GUID's
You can use ShortGuid which encodes a GUID as a base64 string. The downside is that the output is a little ugly (e.g. 00amyWGct0y_ze4lIsj2Mw) and it's case sensitive which may not be good for URL's if you are lower-casing them.
Base32 Encoded GUID's
There is also base32 encoding of GUID's, which you can see this answer for. These are slightly longer than ShortGuid above (e.g. lt7fz44kdqlu5pt7wnyzmu4ov4) but the advantage is that they can be all lower case.
Multiple Factors
One alternative I have been thinking about is to introduce multiple factors e.g. If Pintrest used a username and an ID for extra uniqueness:
https://pinterest.com/some-user/1
Here the ID 1 is unique to the user some-user and could be the number of posts they've made i.e. their next post would be 2. You could also use YouTube's approach with their video ID but specific to a user, this could lead to some ridiculously short URL's.
The first, simplest and practical scenario for unique keys
is the increasing numbering sequence of the write order,
This represent the record number inside one database providing unique numbering on a local scale : this is the -- often met -- application level requirement.
Next, the numerical approach based on a concatenation of time and counters is commonly used to ensure that concurrent transactions in same wagons will have unique ids before writing.
When the system gets highly threaded and distributed, like in highly concurrent situations, do some constraints need to be relaxed, before they become a penalty for scaling.
Universally unique identifier as primary key
Yes, it's a good practice.
A key reference system can provide independence from the underlying database system.
This provides one more level of integrity for the database when the evoked scenario occurs : backup, restore, scale, migrate and perhaps prove some authenticity.
This article Generating Globally Unique Identifiers for Use with MongoDB
by Alexander Marquardt (a Senior Consulting Engineer at MongoDB) covers the question in detail and gives some insight about database and informatics.
UUID are 128 bits length. They introduce an amount of entropy
high enough to ensure a practical uniqueness of labels.
They can be represented by a 32 hex character strings.
Enough to write several thousands of billions of billions
of decimal number.
Here are a few more questions that can occur when considering the overall principle and the analysis:
should primary keys of database
and Unique Resource Location be kept as two different entities ?
does this numbering destruct the sequentiality in the system ?
Does providing a machine host number (h),
followed by a user number (u) and time (t) along a write index (i)
guarantee the PK huti to stay unique ?
Now considering the DB system:
primary keys should be preserved as numerical (be it hexa)
the database system relies on it and this implies performance considerations.
their size should be fixed,
the system must answer rapidly to tell if it's potentially dealing with a PK or not.
Hashids
The hashing technique of Youtube is hashids.
It's a good choice :
the hash are shorts and the length can be controlled,
the alphabet can be customized,
it is reversible (and as such interesting as short reference to the primary keys),
it can use salt.
it's design to hash positive numbers.
However it is a hash and as such the probability exists that a collision happen. They can be detected : unique constraint is violated before they are stored and in such case, should be run again.
Consider the comment to this answer to figure out how much entropy it's possible to get from a shorten sha1+b64 recipe.
To anticipate on the colliding scenario,
calls for the estimation of the future dimension of the database, that is, the potential number of records. Recommended reading : Z.Bloom, How Long Does An ID Need To Be ?
Milliseconds since epoch
Cited from the previous article, which provides most of the answer to the problem at hand with a nice synthetic style
It may not be necessary for you to encode every time since 1970
however. If you are only interested in keeping recent records close to
each other, you only need enough values to ensure that you don’t have
more values with the same prefix than your database can cache at once
What you could do is convert a GUID into only numeric by converting all the letters into numbers in the guid. Here is a example of what that would look like. It's abit long but if that is not a problem this could be one way of going about generating the keys.
1004234499987310234371029731000544986101469898102
Here is the code i used to generate the string above. But i would probably recommend you using a long primary key insteed although it can be abit of a pain it's probably a safer way to do it then the function below.
string generateKey()
{
Guid guid = Guid.NewGuid();
string newKey = "";
foreach(char c in guid.ToString().Replace("-", "").ToCharArray())
{
if(char.IsLetter(c))
{
newKey += (int)c;
}
else
{
newKey += c;
}
}
return newKey;
}
Edit:
I did some testing with only taking the 20 first numbers and out of 5000000 generated keys 4999978 was uniqe. But when using 25 first numbers it is 5000000 out of 5000000. I would recommend you to do some more testing if going with this method.