How to change the PI of the existing huge table in Teradata ? - teradata

We have a huge table in Teradata holding the transnational data sizing approximately 13 TB.
As a typical old Data Warehouse enhancement case, we have now found that the PI selection made in the past for this table is not the best. And due to this, we are facing lots of performance issues with the SQL's pulling data from this table.
So the very first idea that we wanted to implement is to load the data in temp table, alter the PI of the existing table or create a new table with new PI and load the data from temp table to new \ altered table.
But the challenge here is the table is live and this solution will not be the best due the size of the data movement. Also, other way we thought about delta load (to new table) - delta delete (from main table). But this also can not be the best workaround as the table is live and it will involve too much of efforts in data movement as well as in matching the row counts in source and target table.
I need your ideas on this scenario, how can I change the PI of this table by making small efforts without any system downtime.

Related

Adding new fields to historical tables in BigQuery

I'm getting daily exports of Google Analytics data into BigQuery and these form the basis for our main reporting dataset.
Over time i need to add new columns for additional things we use to enrich the data - like say a mapping from url to 'reporting category' for example.
This is easy to just add as a new column onto the processed tables (there is about 10 processing steps at the moment for all the enrichment we do).
This issue is if stakeholders then ask - can we add that new column to the historical data?
Currently i then need to rerun all the daily jobs which is very slow and costly.
This is coming up frequently enough that i'm seriously thinking about redesigning my data pipelines to tailor for the fact that i often need to essentially drop and recreate ALL the data from time to time when i need to add a new field or correct old dirty data or something.
I'm just wondering if there is better ways to
Add a new column to an old table in BQ (would be happy to do this by hand for these instances where i can just join the new column based on the ga [hit_key] i have defined which is basically a row key)
(Less common) Update existing tables based on some where condition.
Just wondering what best practices are and if anyone has had similar issues where you basically need to update an historic shema and if there are ways to do it without just dropping and recreating which is essentially what i'm currently doing.
To be clearer on my current approach: I'm taking the [ga_sessions_yyyymmdd] table and making a series of [ga_data_prepN_yyyymmdd] tables where is either add new columns at each step or reduce the data in some way. There is now 11 of these steps and each time i'm taking all the 100 or more columns along for the ride. This is what i'm going to try design away from as currently 90% of the columns at each stage dont even need to be touched as they can just be joined back on at the end maybe based on hit_key or something.
It's a little bit messy though to try and pick apart.
Adding new columns to the schema of the existing historical tables is possible, but the values for newly added columns will be NULLs. If you do need to populate values into these columns, probably the best approach is to use UPDATE DML statement. More details how to try it out is here: Does BigQuery support UPDATE, DELETE, and INSERT (SQL DML) statements?

Data revisioning/versioning best practices

I am struggling to define an effective process of revisioning. We have some data spread across multiple tables. We cannot delete or update, we need to create new issues of the same data. I know the solution of a history table containing all revisions etc, but that seems to work fine as long as you want to keep revisions of simple structures, such as a Blogging-platform.
What if you have a database with many complex structures, where the simplest of them looks like this below.
If you change something in tableA, you can keep the old data in a history table. What happens though if you change something in TableB, which defines what a record in TableA is? It almost forces you to create a copy of TableA (new ID in other words) and recreate it's underlying structures (more new IDs). The whole process of creating a new ID each time a mistake is corrected or some peripheral data is added, doesn't feel ok.
Is there any good practice for such cases? I read somewhere about keeping the whole old data structure revisioned in XML, but that practice can be reluctant to schema changes and it is not easily querable. Technologies such as Flashback doesn't cover the whole spectrum of our needs either.
Tip: We're using Oracle v11.2.

websql performance, can we shard tables

I am using websql to store data in a phonegap application. One of table have a lot of data say from 2000 to 10000 rows. So when I read from this table, which is just a simple select statement it is very slow. I then debug and found that as the size of table increases the performance deceases exponentially. I read somewhere that to get performance you have to divide table into smaller chunks, is that possible how?
One idea is to look for something to group the rows by and consider breaking into separate tables based on some common category - instead of a shared table for everything.
I would also consider fine tuning the queries to make sure they are optimal for the given table.
Make sure you're not just running a simple Select query without a where clause to limit the result set.

Tables with data that will never be deleted or changed

This is a more in depth follow up to a question I asked yesterday about storing historical data ( Storing data in a side table that may change in its main table ) and I'm trying to narrow down my question.
If you have a table that represents a data object at the application level and need that table for historical purposes is it considered bad practice to set it up to where the information can't be deleted. Basically I have a table representing safety requirements for a worker and I want to make it so that these requirements can never be deleted or changed. So if a change needs to made a new record is created.
Is this not a good idea? What are the best practice to deal with data like this? I have a table with historical safety training data and it points to the table with requirement data (as well as some other key tables) so I can't let the requirements be changed or the historical table will be pointing to the wrong information.
Is this not a good idea?
Your scenario sounds perfectly valid to me. If you have historical data that you need to keep there are various ways to meeting that requirement.
Option 1:
Store all historical data and current data in one table (make sure you store a creation date so you know what's old and what's new). When you need to retrieve the most recent record for someone, just base it on the most recent date that exists in the table.
Option 2:
Store all historical data in a separate table and keep current data in another. This might be beneficial if you're working with millions of records so you don't degrade performance of any applications built on top of it. Either at the time of creating a new record or through some nightly job you can move old data into the other table to keep your current table lightweight.
Here is one alternative, that is not necessarily "better" but is something to keep in mind...
You could have separate "active" and "historical" tables, then create a trigger so whenever a row in the active table is modified or deleted, the old row values are copied to the historical table, together with the timestamp.
This way, the application can work with the active table in a natural way, while the accurate history of changes is automatically generated in the historical table. And since this works at the DBMS level, you'll be more resistant to application bugs.
Of course, things can get much messier if you need to maintain a history of the whole graph of objects (i.e. several tables linked via FOREIGN KEYs). Probably the simplest option is to simply forgo referential integrity for historical tables and just keep it for active tables.
If that's not enough for your project's needs, you'll have to somehow represent a "snapshot" of the whole graph at the moment of change. One way to do it is to treat the connections as versioned objects too. Alternatively, you could just copy all the connections with each version of the endpoint object. Either case will complicate your logic significantly.

How to create INDEX for large data in ORACLE (150 million)

I want to create an Index for some table.
I tried manually by writing index syntax like
create index index_name on table
with this I tried, but it is taking so much time for large data (150 million).
I heard that we can create index in (back end) the server also like "index.sh". I don't know exactly about this.
I think when someone advised you to "create index in(back end)", they might have meant "create it as a background task", may be submit a job during off peak hours.
If your table is large, you should consider partitioning and then create local/global index partitions. Without knowing about the application (Data warehouse/OLTP), type of data, table structures and how it is accessed, it is difficult to come up with the appropriate solution.
Also, If you are trying to build an index on a really large table, you should Consider Creating Indexes with NOLOGGING. It comes with it's own gotchas( you have to take a back up after the index is created), so you should see if it makes sense in your case.

Resources