Is there any performance implications in sqlite of having a datefield, and searching for records from a particular year, based on the year attribute of the datefield, as opposed to having a dedicated year int field, and searching based on that?
SQLite doesn't have a date type, so dates are stored in one of a few different formats, and calculations on those dates are performed using built in date functions. Those date functions will probably add some overhead, but whether that will actually have any performance implication really comes down to your data, the size of your db, etc.
The best thing you can do is run some of your own tests, then decide for yourself whether the performance gain you get from breaking the date into multiple columns is worth the added schema complexity.
Related
When ingesting historical data, we would like it to become consistent with streamed data with respect to caching and retention, hence we need to set proper creation time on the data extents.
The options I found:
creationTime ingestion property,
with(creationTime='...') query ingestion property,
creationTimePattern parameter of Lightingest.
All options seem to have very limited usability as they require manual work or scripting to populate creationTime with some granularity based on the ingested data.
In case the "virtual" ingestion time can be extracted from data in form of a datetime column or otherwise inherited (e.g. based on integer ID), is it possible to instruct the engine to set creation time as an expression based on the data row?
If such a feature is missing, what could be other handy alternatives?
creationTime is a tag on an extent/shard.
The idea is to be able to effectively identify and drop / cool data at the end of the retention time.
In this context, your suggested capability raises some serious issues.
If all records have the same date, no problem, we can use this date as our tag.
If we have different dates, but they span on a short period, we might decide to take min / avg / max date.
However -
What is the behavior you would expect in case of a file that contains dates that span on a long period?
Fail the ingestion?
Use the current time as the creationTime?
Use the min / avg / max date, although they clearly don't fit the data well?
Park the records in a tamp store until (if ever) we get enough records with similar dates to create the batches?
Scripting seems the most reasonable way to go here.
If your files are indeed homogenous by their records dates, then you don't need to scan all records, just read the 1st record and use its date.
If the dates are heterogenous, then we are at the scenario described by the "However" part.
We are new to DynamoDB and struggling with what seems like it would be a simple task.
It is not actually related to stocks (it's about recording machine results over time) but the stock example is the simplest I can think of that illustrates the goal and problems we're facing.
The two query scenarios are:
All historical values of given stock symbol <= We think we have this figured out
The latest value of all stock symbols <= We do not have a good solution here!
Assume that updates are not synchronized, e.g. the moment of the last update record for TSLA maybe different than for AMZN.
The 3 attributes are just { Symbol, Moment, Value }. We could make the hash_key Symbol, range_key Moment, and believe we could achieve the first query easily/efficiently.
We also assume could get the latest value for a single, specified Symbol following https://stackoverflow.com/a/12008398
The SQL solution for getting the latest value for each Symbol would look a lot like https://stackoverflow.com/a/6841644
But... we can't come up with anything efficient for DynamoDB.
Is it possible to do this without either retrieving everything or making multiple round trips?
The best idea we have so far is to somehow use update triggers or streams to track the latest record per Symbol and essentially keep that cached. That could be in a separate table or the same table with extra info like a column IsLatestForMachineKey (effectively a bool). With every insert, you'd grab the one where IsLatestForMachineKey=1, compare the Moment and if the insertion is newer, set the new one to 1 and the older one to 0.
This is starting to feel complicated enough that I question whether we're taking the right approach at all, or maybe DynamoDB itself is a bad fit for this, even though the use case seems so simple and common.
There is a way that is fairly straightforward, in my opinion.
Rather than using a GSI, just use two tables with (almost) the exact same schema. The hash key of both should be symbol. They should both have moment and value. Pick one of the tables to be stocks-current and the other to be stocks-historical. stocks-current has no range key. stocks-historical uses moment as a range key.
Whenever you write an item, write it to both tables. If you need strong consistency between the two tables, use the TransactWriteItems api.
If your data might arrive out of order, you can add a ConditionExpression to prevent newer data in stocks-current from being overwritten by out of order data.
The read operations are pretty straightforward, but I’ll state them anyway. To get the latest value for everything, scan the stocks-current table. To get historical data for a stock, query the stocks-historical table with no range key condition.
I need to date/timestamp various transactions, and can add that explicityly into the data structure.
Firebase creates an ID like IuId2Du7p9rJoT-BARu using some algorithm.
Is there a way I can decode the date/time from the firebase-created ID and avoid storing a separate date/timestamp?
Short answer: no.
I've asked the same question previously, because my engineer instincts tell me I can never duplicate data. The conclusion that I came to after I thought this through to the logical end, is that even in a SQL database there exists tons of duplication. It's simply hidden under the covers (as indices, temporary tables, and memory caches). This is a part of large and active data.
So drop the timestamp in the data and go have lunch; save yourself some energy :)
Alternately, skip the timestamp entirely. You know that the records are stored by timestamp already, assuming you haven't provided your own priority, so you should be good to go.
I'm planning a distributed system of applications that will communicate with different types of RDBMS. One of the requirements is consistent handling of DateTimes across all RDBMS types. All DateTime values must be at millisecond precision, include the TimeZone info and be stored in a single column.
Since different RDBMS's handle dates and times differently, I'm worried I can't rely on their native column types in this case and so I'll have to come up with a different solution. (If I'm wrong here, you're welcome to show me the way.)
The solution, whatever it may be, should ideally allow for easy sorting and comparisons on the SQL level. Other aspects, such as readability and ability to use SQL datetime functions, are not important, since this will all be handled by a gateway service.
I'm toying with an idea of storing my DateTime values in an unsigned largeint column type (8 bytes). I haven't made sure if all RDBMS's in question (MSSQL, Oracle, DB2, PostgreSQL, MySQL, maybe a few others) actually /have/ such a type, but at this point I just assume they do.
As for the storage format... For example, 2009-01-01T12:00:00.999+01:00 could be stored similar to ?20090101120000999??, which falls in under 8 bytes.
The minimum DateTime I'd be able to store this way would be 0001-01-01T00:00:00.000+xx:xx, and the maximum would be 8000-12-31T23:59:59.999+xx:xx, which gives me more than enough of a span.
Since maximum unsigned largeint value is 18446744073709551615, this leaves me with the following 3 digits (marked by A and BB) to store the TimeZone info: AxxxxxxxxxxxxxxxxxBB.
Taking into account the maximum year span of 0001..8000, A can be either 0 or 1, and BB can be anywhere from 00 to 99.
And now the questions:
What do you think about my proposed solution? Does it have merit or is it just plain stupid?
If no better way exists, how do you propose the three remaining digits be used for TimeZone info best?
One of the requirements is consistent handling of DateTimes across all RDBMS types.
Be aware that date-time handling capabilities vary radically across various database systems. This ranges from virtually no support (SQLite) to excellent (Postgres). Some such as Oracle have legacy data-types that may confuse the situation, so study carefully without making assumptions.
Rather than establish a requirement that broadly says we must support "any or all database", you should get more specific. Research exactly what databases might realistically be candidates for deployment in the real-world. A requirement of "any or all databases" is naïve and unrealistic because databases vary in many capabilities — date-time handling is just the beginning of your multi-database support concerns.
The SQL standard barely touches on the subject of date-time, broadly defining a few types with little discussion of the nuances and complexities of date-time work.
Also be aware that most programming platforms provide atrociously poor support for date-time handling. Note that Java leads the industry in this field, with its brilliantly designed java.time classes. That framework evolved from the Joda-Time project for Java which was ported to .Net platform as NodaTime.
All DateTime values must be at millisecond precision,
Good that you have specified that important detail. Understand that various systems resolve date-time values to whole seconds, milliseconds, microseconds, nanoseconds, or something else.
include the TimeZone info and be stored in a single column.
Define time zone precisely.
Understand the difference between an offset-from-UTC and a time zone: The first is a number of hours-minutes-seconds plus-or-minus, the second has a name in format Continent/Region and is a history of past, present, and future changes to the offset used by the people of a particular region.
The 2-4 letter abbreviations such as CST, PST, IST, and so on are not formal time zone names, are not standardized, and are not even unique (avoid them).
Since different RDBMS's handle dates and times differently, I'm worried I can't rely on their native column types in this case and so I'll have to come up with a different solution.
The SQL standard does define a few types that are supported by some major databases.
TIMESTAMP WITH TIME ZONE represents a moment, a specific point on the timeline. I vaguely recall hearing of a database that actually stored the incoming time zone. But most, such as Postgres, use the time zone indicated on the incoming value to adjust into UTC, then store that UTC value, and lastly, discard the zone info. When retrieved, you get back a UTC value. Beware of tools and middleware with the confusing anti-feature of applying a default time zone after retrieval and before display to the user.
TIMESTAMP WITHOUT TIME ZONE represents a date with time-of-day, but purposely lacking the context of a time zone or offset. Without a zone/offset, such a value does not represent a moment. You could apply a time zone to determine a moment in a range of about 26-27 hours, the range of time zones around the globe.
There are other types in the standard as well such as date-only (DATE) and time-only (TIME).
See this table I made for Java, but in this context the column of SQL standard types in relevant. Be aware that TIME WITH TIME ZONE makes no sense logically, and should not be used.
If you have narrowed down your list of candidate databases, study their documentation to learn if they have a type akin to the standard types in which you are interested, and what the name of that type is (not always the standard name).
I'm toying with an idea of storing my DateTime values in an unsigned largeint column type (8 bytes).
A 64-bit value is not likely appropriate. For example, the java.time classes use a pair of numbers, a number of whole seconds since the epoch reference of first moment of 1970 in UTC, plus another number for the count of nanoseconds in the fractional second.
It is really best to use the database's data-time data types if they are similar across your list of candidate databases. Using a count-from-epoch is inherently ambiguous, which makes identifying erroneous data difficult.
Storing your own count-from-epoch number is possible. If you must go that way, be sure the entire team understands what epoch reference was chosen. At least a couple dozen have been in use in various computing systems. Beware of staff persons assuming a particular epoch reference is in use.
Another way to define your own date-time tracking is to use text in the standard ISO 8601 formats. Such strings will alphabetically sort as chronological. One exception to that sorting is the optional but commonly used Z at the end to indicate an offset-from-UTC of zero (pronounced “Zulu”).
The minimum DateTime I'd be able to store this way would be 0001-01-01T00:00:00.000+xx:xx,
Taking into account the maximum year span of 0001..8000
Are you really storing values from the time of Christ? Is this software really going to be around executing transactions for the year 8000?
This is an area where the responsible stakeholders should define their real needs. For example, for many business systems you may need only data from the year of the product's launch and run out only a century or two into the future.
The minimum/maximum value range varies widely between different databases. If you choose to use a built-in data type in each database system, investigate its limits. Some, for example, may go only to the year 2038, the common Y2038 problem.
To sum up my recommendation:
Get real about your date-time needs: min/max range, resolution, and various types (moment versus not a moment, date-only, etc.).
Get real about your possible databases for deployment.
If you need enterprise-quality reliability in a classic RDMS, your candidate list is likely only a few: Postgres, Microsoft SQL Server, Oracle, and maybe IBM Db2.
Keep this list of supported databases as short as possible. Each database you agree to support is a huge commitment, now and in the future.
Be sure your chosen database(s) have a database driver available for your chosen programming language(s). For example JDBC for Java.
If at all possible, use the built-in data types offered by the database.
Be sure you and your team understand date-time handling. Many do not, in my experience, as (a) the subject is rarely taught, and (b) many programmers & admins mistakenly believe their quotidian intuitive understanding of date-time is sufficient for programming work. (Ignorance is bliss, as they say.)
Identify other areas of functionality beyond date-time handling, and compare which databases support those areas.
I would suggest you to store the datetime information in milliseconds since 1970 (Java style) .
It's a standard way for storing datetime information, in addition it's more efficient in terms of space than your suggestion. Because in your suggestion some digits are "wasted" i.e. the month digits can store only 00-12 (instead of 00-99) and so on.
You didn't specify what is your development language but I am sure you can find many code snippets that transform date to milliseconds.
If you are developing in .NET they have a similar concept of ticks. (you can use this information as well)
Regarding the time zone,I would have add another column to store only the TimeZone indication.
Remember that any format you choose should maintain consistency between two dates, i.e. if D1 > D2 then format(D1)>format(D2) , this way you can query the DB for changes since some date, or query for changes between two dates
I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.