Teradata Partitioning Usage in Views - teradata

I have a table which is partitioned on a date column.
TRANS_DT between '2010-01-0' and '2030-12-31' EACH INTERNAL '1' MONTH.
A view is created on the table where the TRANS_DT is CAST as CAST(CAST(TRANS_DT AS DATE FORMAT 'MMM-YYYY') AS VARCHAR(8)) to denote date like 'MAR-2021'.
When SQL's are running on the VIEW by applying condition as TRANS_DT='MAR-2021', the explain plan states that a full table scan is performed instead of Partition Level Scan.

Related

DynamoDB - Extract date and Query

I am having the following table in my DynamoDB.
I want to get/extract all the data using the following condition or filters
This Month data : This will be the set of records that belongs to 1st of this month to today. ( I think this I can achieve using the BEGINS_WITH filter , again not sure whether this is the correct approach )
This Quarter data : This will be the set of records that belongs to this quarter, basically from 1st of April 2021 to 30th June 2021
This Year data : This will be set of records that belongs to this entire year
Question : How I can filter/query the data using the date column from the above table to get these 3 types (Month , Quarter ,Year ) of data.
Other Details
Table Size : 25 GB
Item Count : 4,081,678
It looks like you have time-based access patterns (e.g. fetch by month, quarter, year, etc).
Because your sort key starts with a date, you can implement your access patterns using the between condition on your sort key. For example (in pseudo code):
Fetch User 1 data for this month
query where user_id = 1 and date between 2021-06-01 and 2021-06-30
Fetch User 1 data for this quarter
query where user_id = 1 and date between 2021-01-01 and 2021-03-31
Fetch User 1 data for this month
query where user_id = 1 and date between 2021-06-01 and 2021-06-30
If you need to fetch across all users, you could use the same approach using the scan operation. While scan is commonly considered wasteful/inefficient, it's a fine approach if you run this type of query infrequently.
However, if this is a common access pattern, you might want to consider re-organizing your data to make this operation more efficient.
As mentioned in the above answer by #Seth Geoghegan , the above table design is not correct, ideally you should think before placing your Partition Key and Sort Key, still for the people like me who already have such kind of scenarios, here is the steps which I followed to mitigate my issue.
Enabled DynamoDB Steams
Re-trigger the data so that they can pass through the DDB Streams ( I added one additional column updated_dttm to all of my records using one of the script )
Process the Streams record , in my case I broken down the date column above to three more columns , event_date , category , sub_category respectively and updated back to the original record using the Lambda
Then I was able to query my data using event_date column , I can also create index over event_date column and make my query/search more effective
Points to Consider
Cost for updating records so that they can go to DDB Streams
Cost of reprocessing the records
Cost for updating records back to DDB

Formatted local time in a virtual generated column of SQLite

I want to store date in milliseconds, but also I want to see the formatted representation of it.
In order to not waste the drive space it makes sense to use a virtual generated column for this.
I wrote it:
DROP TABLE IF EXISTS example;
CREATE TABLE IF NOT EXISTS example (
time INTEGER,
formatted_time GENERATED ALWAYS AS (strftime('%Y.%m.%d %H:%m', time/1000, 'unixepoch')) VIRTUAL
);
INSERT INTO example (time) VALUES (1605960000000);
INSERT INTO example (time) VALUES (1615413202000);
It works, but I can't set the second modifier of strftime(format,timestring,modifier,modifier...) to get the local time. (It returns UTC time by default.)
When I use:
formatted_time GENERATED ALWAYS AS (strftime('%Y.%m.%d %H:%m', time/1000, 'unixepoch', 'localtime')) VIRTUAL
It throws Result: non-deterministic use of strftime() in a generated column when I insert a data.
While it works as expected:
select strftime('%Y.%m.%d %H:%m', 1615413202000/1000, 'unixepoch', 'localtime');
How to create a virtual column with formatted local time?
It's not possible.
Based on this answers:
Creating generate column based on today's date in SQLite
Computed column 'Month' in table cannot be persisted because the column is non-deterministic
SQLite requires to the value of a generated column would be the same on any machine in any time zone (be deterministic). Even for a virtual generated column.
In my case for time value 1615413202000 the formatted_time would be different for the same table opened in different time zones, so the the table would be "non-deterministic".
As a workaround it possible to create a view:
CREATE VIEW example_view AS
SELECT time, strftime('%Y.%m.%d %H:%m', time/1000, 'unixepoch', 'localtime') as formatted_time_local
FROM example;
(based on Shawn's answer)

How to cluster raw events tables from Firebase Analytics in BQ in field event_name?

I would like to cluster raw table with raw data of events from Firebase in BQ, but without reprocessing/creating another tables (keeping costs at minimum).
The main idea is to find a way to cluster tables when they create from intraday table.
I tried to create empty tables with pre-defined schema (same as previous events tables), but partitioned by _partition_time column (NULL partition) and clustered by event_name column.
After Firebase inserts all the data from intraday table, the column event_name stays in details tab of table as cluster field, but no reducing costs happens after querying.
What could be another solution or way how to make it working ?
Thanks in advance.
/edit:
Our table has detail tab as:
detail tab of table
After running this query:
SELECT * FROM 'ooooooo.ooooooo_ooooo.events_20181222'
WHERE event_name = 'screen_view'
the result is:
how query processed whole table
So no cost reducing.
But if I try to create the same table clustered by event_name manually with:
Create TABLE 'aaaa.aaaa.events_20181222'
partition by DATE(event_timestamp)
cluster by event_name
AS
Select * from ooooooo.ooooooo_ooooo.events_20181222
Then the same query from first IMG applied to created table processes only 5mb - so clustering really works.

Separating DATE / DATETIME elements into different columns for fast querying in MySQL?

I want to store data in MySQL and query it based on the current day. I want to know what is the best practice to do so.
I want to store data totals for each day, so queries total data will be quick. I thought about modeling my table as follows:
TotalsByCountry
- Year
- Month
- Day
- countryId
- totalNumber
When I query the totals for a specific day and for specific country, I will query the table based on 4 columns, the Year, Month, Day and countryId.
I wanted to know if this is a good practice, or there is a better way to do so, like using one columns for data that holds the month, day and year, and query only two columns, the datetime columns and the coutryId.
need you help in choosing the right way to model the table. I also want to make another table that store totals based on gender, so take that into consideration too.
The data will need to be accessed frequently, maybe in real time because I want to show the data changes in real time. I will be developing the web app in asp.net and probably use web sockets to create constant connection that will update the data on the user in real time. So when data changes, it will be reflected on the user webpage in real-time. That's why I need a table modeling that will be ready for many queries. I will use caching for a few seconds so it want stress the db too much.
I hope I provided enough information, if not, please comment and I will reply.
Having three separate columns to store each individual element of a date (year/month/day) will add unnecessary overhead to your database in terms of insert performance and disk space.
What you will want to do is simply have a single DATETIME column to store the date and time, and have a composite index set up on (countryId, datetime_col).
Even if you wanted to query all rows based on a specific day or month, MySQL will still be able to utilize indexes on the DATETIME field, provided that you are writing your queries in the right way and making sure to never to wrap the DATETIME column within a function when you perform your conditional check.
Here is how you can write your query so that it will still be able to utilize indexes:
-- Get the sum of totalNumber of all rows based on current day
-- where countryId = 1
SELECT SUM(totalNumber) AS totalsum
FROM tbl
WHERE countryId = 1 AND
datetime_col >= CAST(CURDATE() AS DATETIME) AND
datetime_col < CAST(CURDATE() + INTERVAL 1 DAY AS DATETIME)
By making the comparison on the bare DATETIME column, the query remains sargable(i.e. able to utilize index range scans) and MySQL will be able to use indexes to quickly look up rows.
On the other hand, if you were to try to wrap the DATETIME column within a function to make the comparison:
-- Get the sum of totalNumber of all rows based on current day
-- where countryId = 1
SELECT SUM(totalNumber) AS totalsum
FROM tbl
WHERE countryId = 1 AND
DATE(datetime_col) = CURDATE()
...It would be quite inefficient because the DATE() function that wraps the column effectively renders the query as non-sargable, and any kind of index you have set up containing the DATETIME column will not be utilized.
You can also efficiently query for the total sum of all rows in the current month:
-- Get the sum of totalNumber of all rows based on current month
-- where countryId = 1
SELECT SUM(totalNumber) AS monthsum
FROM tbl
WHERE countryId = 1 AND
datetime_col >= CAST(CONCAT(YEAR(NOW()), '-', MONTH(NOW()), '-01') AS DATETIME) AND
datetime_col < CAST(CONCAT(YEAR(NOW()), '-', MONTH(NOW()), '-01') AS DATETIME) + INTERVAL 1 MONTH
And within the current year:
-- Get the sum of totalNumber of all rows based on current year
-- where countryId = 1
SELECT SUM(totalNumber) AS yearsum
FROM tbl
WHERE countryId = 1 AND
datetime_col >= CAST(CONCAT(YEAR(NOW()), '-01-01') AS DATETIME) AND
datetime_col < CAST(CONCAT(YEAR(NOW()), '-01-01') AS DATETIME) + INTERVAL 1 YEAR
My argument is:
If you want to be fast on a database lookups, you need well built queries that uses indexes.
Your approach require 4 indexes (that means slower insert), using a single date column you will require just two indexes, Also the query complexity will increase if you ever need to search for date ranges.

Does a multi-column index work for single column selects too?

I've got (for example) an index:
CREATE INDEX someIndex ON orders (customer, date);
Does this index only accelerate queries where customer and date are used or does it accelerate queries for a single-column like this too?
SELECT * FROM orders WHERE customer > 33;
I'm using SQLite.
If the answer is yes, why is it possible to create more than one index per table?
Yet another question: How much faster is a combined index compared with two separat indexes when you use both columns in a query?
marc_s has the correct answer to your first question. The first key in a multi key index can work just like a single key index but any subsequent keys will not.
As for how much faster the composite index is depends on your data and how you structure your index and query, but it is usually significant. The indexes essentially allow Sqlite to do a binary search on the fields.
Using the example you gave if you ran the query:
SELECT * from orders where customer > 33 && date > 99
Sqlite would first get all results using a binary search on the entire table where customer > 33. Then it would do a binary search on only those results looking for date > 99.
If you did the same query with two separate indexes on customer and date, Sqlite would have to binary search the whole table twice, first for the customer and again for the date.
So how much of a speed increase you will see depends on how you structure your index with regard to your query. Ideally, the first field in your index and your query should be the one that eliminates the most possible matches as that will give the greatest speed increase by greatly reducing the amount of work the second search has to do.
For more information see this:
http://www.sqlite.org/optoverview.html
I'm pretty sure this will work, yes - it does in MS SQL Server anyway.
However, this index doesn't help you if you need to select on just the date, e.g. a date range. In that case, you might need to create a second index on just the date to make those queries more efficient.
Marc
I commonly use combined indexes to sort through data I wish to paginate or request "streamily".
Assuming a customer can make more than one order.. and customers 0 through 11 exist and there are several orders per customer all inserted in random order. I want to sort a query based on customer number followed by the date. You should sort the id field as well last to split sets where a customer has several identical dates (even if that may never happen).
sqlite> CREATE INDEX customer_asc_date_asc_index_asc ON orders
(customer ASC, date ASC, id ASC);
Get page 1 of a sorted query (limited to 10 items):
sqlite> SELECT id, customer, date FROM orders
ORDER BY customer ASC, date ASC, id ASC LIMIT 10;
2653|1|1303828585
2520|1|1303828713
2583|1|1303829785
1828|1|1303830446
1756|1|1303830540
1761|1|1303831506
2442|1|1303831705
2523|1|1303833761
2160|1|1303835195
2645|1|1303837524
Get the next page:
sqlite> SELECT id, customer, date FROM orders WHERE
(customer = 1 AND date = 1303837524 and id > 2645) OR
(customer = 1 AND date > 1303837524) OR
(customer > 1)
ORDER BY customer ASC, date ASC, id ASC LIMIT 10;
2515|1|1303837914
2370|1|1303839573
1898|1|1303840317
1546|1|1303842312
1889|1|1303843243
2439|1|1303843699
2167|1|1303849376
1544|1|1303850494
2247|1|1303850869
2108|1|1303853285
And so on...
Having the indexes in place reduces server side index scanning when you would otherwise use a query OFFSET coupled with a LIMIT. The query time gets longer and the drives seek harder the higher the offset goes. Using this method eliminates that.
Using this method is advised if you plan on joining data later but only need a limited set of data per request. Join against a SUBSELECT as described above to reduce memory overhead for large tables.

Resources