join index or collect stats which is better in Teradata - teradata

Am facing an issue with one of my FACT tables.
Through same job,
I call a procedure to load this FACT table and then second procedure to collect stats on this fact table.
As part of a new requirement I need to create a join index which will also include the above mentioned fact tables.
I believe that join index will be executed whenever there is a change in any of involved tables.So what will happen in above scenario?.will my collect stats procedure wait for join index execution to complete.or Will there be any contention because of the simulataneous occurance of collect stats and joinindex
Regards,
Anoop

The Join Index will automatically be maintained by Teradata when ETL processes add, change, or delete data in the table(s) referenced by the Join Index. The Join Index will have to be removed if you apply DDL changes to table(s) referenced in the Join Index that affect the columns participating in the Join Index or before you can DROP the table(s) referenced in the Join Index.
Statistics collection on either the Join Index or Fact table should be reserved until after the ETL for the Fact table has been completed or during a regular stats maintenance period. Whether you collect stats after each ETL process or only during a regular stats maintenance period is dependent on how much of the data in your Fact table is changing during each ETL cycle. I would hazard a guess that if you are creating a join index to improve performance of querying the fact table you likely do not need to collect stats on the same fact table after each ETL cycle unless this ETL cycle is a monthly or quarterly ETL process. Stats collection on the JI and fact table can be run in parallel. The lock required for COLLECT STATS is no higher than a READ. (It may in fact be an ACCESS lock.)
Depending on you release of Teradata you may be able to take advantage of using the THRESHOLD options to allow the optimizer to determine whether or not statistics do in fact need to be collected. I believe this was included in Teradata 14 as a stepping stone toward the automated statistics maintenance that has been introduced in Teradata 14.10.

Related

SQL Server Data Archiving

I have a SQL Azure database on which I need to perform some data archiving operation.
Plan is to move all the irrelevant data from the actual tables into Archive_* tables.
I have tables which have up to 8-9 million records.
One option is to write a stored procedure and insert data in to the new Archive_* tables and also delete from the actual tables.
But this operation is really time consuming and running for more than 3 hrs.
I am in a situation where I can't have more than an hour's downtime.
How can I make this archiving faster?
You can use Azure Automation to schedule execution of a stored procedure every day at the same time, during maintenance window, where this stored procedure will archive the oldest one week or one month of data only, each time it runs. The store procedure should archive data older than X number of weeks/months/years only. Please read this article to create the runbook. In a few days you will have all the old data archived and the Runbook will continue to do the job from now and on.
You can't make it faster, but you can make it seamless. The first option is to have a separate task that moves data in portions from the source to the archive tables. In order to prevent table lock escalations and overall performance degradation I would suggest you to limit the size of a single transaction. E.g. start transaction, insert N records into the archive table, delete these records from the source table, commit transaction. Continue for a few days until all the necessary data is transferred. The advantage of that way is that if there is some kind of a failure, you may restart the archival process and it will continue from the point of the failure.
The second option that does not exclude the first one really depends on how critical the performance of the source tables for you and how many updates are happening with them. It if is not a problem you can write triggers that actually pour every inserted/updated record into an archive table. Then, when you want a cleanup all you need to do is to delete the obsolete records from the source tables, their copies will already be in the archive tables.
In the both cases you will not need to have any downtime.

Should ANALYZE be run in a transaction?

In sqlite (specifically version 3), should ANALYZE be run in a transaction?
If so, and I'm at the end of a long transaction that made lots of changes, is it okay to run ANALYZE in that same transaction or should that transaction be committed first and begin another transaction for the ANALYZE?
The documentation doesn't say anything about this one way or another.
ANALYZE reads the data from indexed columns and writes statistical information into some internal table.
This is somewhat similar to the following query:
INSERT OR REPLACE INTO sqlite_statXXX
SELECT 'MyTable', 'MyColumn', COUNT(*), AVG(MyColumn) FROM MyTable
done once for every indexed column.
Like any other SQL statement that writes a small amount of data to the database, the transaction overhead will be much larger than the actual effort to write the data itself.
In your case, it is not necessary for your changed data to be available without the changed statistics, so you could just as well do the ANALYZE in the same transaction.
If the database is so big that ANALYZE runs for a long time, it might make sense to delay its execution until later when it does not conflict with more important transactions.

Is there a way to find the SQL that updated a particular field at a particular time?

Let's assume that I know when a particular database record was updated. I know that somewhere exists a history of all SQL that's executed, perhaps only accessible by a DBA. If I could access this history, I could SELECT from it where the query text is LIKE '%fieldname%'. While this would pretty much pull up any transactional query containing the field name I am looking for, it's a great start, especially if I can filter the recordset down to a particular date/time range.
I've discovered the dbc.DBQLogTbl view, but it doesn't appear to work as I expect. Is there another view that contains the information I am looking for?
It depends on the level of database query logging (DBQL) that has been enabled by the DBA.
Some DBA's may elect not to detailed information for tactical queries so it is best to consult with your DBA team to understand what is being captured. You can also query the DBC.DBQLRules to determine what level of logging has been enabled.
The following data dictionary objects will be of particular interest to your question:
DBC.QryLog contains the details about the query with respect to the user, session, application, type of statement, CPU, IO, and other fields associated with a particular query.
DBC.QryLogSQL contains the SQL statements. If a SQL statement is exceeds a certain length it is split across multiple rows which is denoted by a column in this table. If you join this to the main Query Log table care must be taken if you are aggregating and metrics in the Query Log table. Although more often then not if your are joining the Query Log table to the SQL table you are not doing any aggregation.
DBC.QryLogObjects contains the objects used by a particular query and how they were used. This includes tables, columns, and indexes referenced by a particular query.
These tables can be joined together in DBC via QueryID and ProcID. There are a few other tables that capture information about the queries but are beyond the scope of this particular question. You can find out about those in the Teradata Manuals.
Check with your DBA team to determine the level of logging being done and where they historical DBQL data is retained. Often DBQL data is moved nightly to a historical database and there often is a ten minute delay in data being flushed from cache to the DBC tables. Your DBA team can tell you where to find historical DBQL data.

Oracle materialized views or aggregated tables in datawarehouse

Is the materialized views of oracle(11g) are good practice for aggregated tables in Data warehousing?
We have DW processes that replace 2 month of data each day. Some time it means few Gigs for each month (~100K rows).
On top of them are materialized views that get refreshed after night cycle of data tranfer.
My question is would it be better to create aggregated tables instead of the MVs?
I think that one case where aggregated tables might be beneficial is where the aggregation can be effectively combined with the atomic-level data load, best illustrated with an example.
Let's say that you load a large volume of data into a fact table every day via a partition exchange. A materialized view refresh using partition change tracking is going to be triggered during or after the partition exchange and it's going to scan the modified partitions and apply the changes to the MV's.
It is possible that as part of the population of the table(s) that you are going to exchange with the fact table partitions you could also compute aggregates at various levels using CUBE/ROLLUP, and use multitable insert to load up tables that you can then partition exchange into one or more aggregation tables. Not only might this be inherently more efficient through avoiding rescanning the atomic-level data, your aggregates are computed prior to the fact table partition exchange so if anything goes wrong you can suspend the modification of the fact table itself.
Other thoughts might occur later ... I'll open the answer up as a community Wiki if other have ideas.

SQL Database Design - Cache Tables?

What's a common/best practice for database design when it comes to improving performance on count(1) queries? (I'm currently using SQLite)
I've normalized my data, it exists on multiple tables, and for simple things I want to do on a single table with a good index -- queries are acceptably quick for my purposes.
eg:
SELECT count(1) from actions where type='3' and area='5' and employee='2533';
But when I start getting into multiple table queries, things get too slow (> 1 second).
SELECT count(1)
from
(SELECT SID from actions
where type='3' and employee='2533'
INTERSECT
SELECT SID from transactions where currency='USD') x;
How should I cache my results? What is a good design?
My natural reaction is to add a table solely for storing rows of cached results per employee?
Edit
Design patterns like Command Query Responsibility Segregation (CQRS) specifically aim to improve the read side performance of data access, often in distributed systems and at enterprise scale.
Commands are issued to indicate 'transactions' or 'change / updates' to data
When a system processes these commands (e.g. by updating database tables), the new state of the affected objects is 'broadcast'
Systems which are interested (such as a user interface or a queryable REST API) will then subscribe to these data changes, and then 'shape' the updated data to their specific needs
This updated data is then cached (often called a 'Read Store')
Another pattern commonly associated with CQRS is "Event Sourcing", which stores, and then allows 'replay' of Commands, for various use cases.
The above may be overkill for your scenario, but a very simple implementation of caching at an internal app level, could be via a Sqllite Trigger
Assuming that there are many more 'reads' than writes to your actions or transactions tables,
You could create a cache tables specifically for "SID for actions by type by employee' and one for "SID for transactions by Currency", or even combine the two (depends on what other scenarios you have for querying)
You would then need to update these cache table(s) every time the underlying action or transactions tables update. One cheap (and nasty) way would be to provide an INSERT, UPDATE and DELETE trigger on the action and transactions table, which would then update the appropriate cache table(s).
Your 'query' interface would now primarily interact with the cache tables, using the 'derived' data (such as the counts).
You may still however need to handle cache miss scenarios, such as the initial 'seed' of these cache tables, or if the cache tables need to be regenerated.
In addition to a local relational database like SqlLite, NoSql databases like MongoDb, Cassandra and Redis are frequently used as alternatives to read side caching in read-heavy environments (depending on the type and format of data that you need to cache). You would however need to handle alternative to synchronize data from your 'master' (e.g. SQLLite) database to these cache read stores - triggers obviously won't cut it here.
Original Answer
If you are 100% sure that you are always repeating exactly the same query for the same customer, sure, persist the result.
However, in most other instances, RDBMS usually handles caching just fine.
The INTERSECT with the query
SELECT SID from transactions where currency='USD'
Could be problematic if there are a large number of transaction records with USD.
Possibly you could replace this with a join?
SELECT count(1) from
(
SELECT t.[SID]
from
transactions as t
inner join
(
SELECT SID from actions where type='3' and employee='2533'
) as a
on t.SID = a.SID
where t.currency= 'USD'
) as a
You might just check your indexes however:
For
SELECT count(1) from actions where
type='3' and area='5' and
employee='2533'
SELECT SID from actions where
type='3' and employee='2533'
An index on Actions(Employee, Type) or Actions(Employee, Type, Area) would make sense (assuming Employee has highest selectivity, and depending on the selectivity of Type and Area).
You can also compare this to an index on Actions(Employee, Type, Area, SID) as a covering index for your second query.
And for the join above, you need an index on Transactions(SID, Currency)

Resources