Teradata Change data capture - teradata

My team is thinking about developing a real time application (a bunch of charts, gauges etc) reading from the database. At the backend we have a high volume Teradata database. We expect some other applications to be constantly feeding in data into this database.
Now we are wondering about how to feed in the changes from the database to the application. Polling from the application would not be a viable option in our case.
Are there any tools that are available within Teradata that would help us achieve this?
Any directions on this would be greatly appreciated

We faced similar requirement. But in our case client asked us to provide daily changes to a purchase orders table. That means we had to run a batch of scripts every day to capture the changes occuring to the table.
So we started to collect data every day and store the data in a sparse history format in another table. So the process is simple here. We collect a purchase order details record in the against first day's date in the history table. And then the next day we compare the next day's feed record against the history record and identify any change in that record. If there is a change in the purchase order record columns we collect that record and keep it in a final reporting table which will be shown to the client.
If you run the batch scripts every day once and there will be more than one change in a day to a record then this method cannot give you the full changes. For that you may need to run the batch scripts more than once every day based on your requirement.
Please let us know if you find any other solution. Hope this helps.

There is a change data capture tool from wisdomforce.
http://www.wisdomforce.com/resources/docs/databasesync/DatabaseSyncBestPracticesforTeradata.pdf
It would it probably work in this case

Are triggers with stored procedures an option?
CREATE TRIGGER dbname.triggername
AFTER INSERT ON db_name.tbl_name
REFERENCING stored_procedure
Theoretically speaking, you can write external stored procedures which may call UDFs written in Java or C/C++ etc which can push the row data to your application in near real time.

Related

What design concept am I dealing with?

I have two data objects- customers and jobs. Job records are created based on certain fields of the customer record. A job record is created for each service visit to the customer, which happens on a recurring weekly basis.
So I'm considering the best way to create job records, I can either:
Use a server function to create job records on the backend. And create them in batches- say, quarterly- so I'd have job records for 12 weeks ahead. This way I can just query the jobs table for any operations in the presentation layer.
Use fields on the customers table to create jobs in the presentation layer, creating the job record only after some interaction with the presentation layer.
This way, jobs are always created from updated data.
I think I should go with the second approach but it seems like I might be committing a design transgression when it comes to handling data and presentation layers.
Is there some concept that encapsulates this type of problem?
--
Drawback to first approach: The server function would have to run after any changes to the customer record so that jobs are updated. I suppose I could schedule the function to run every night (cron job) so I'm getting updated records every day. But I think there should be a simpler way.
This is kind of an opinion question, and I suspect it might get removed.
but, I would go with #2, always. with #1 you're creating a lot of empty data records that hold no value and may or may not get used. It also gives you an opportunity to present the data to the user for verification before saving the job.

SQL Server Data Archiving

I have a SQL Azure database on which I need to perform some data archiving operation.
Plan is to move all the irrelevant data from the actual tables into Archive_* tables.
I have tables which have up to 8-9 million records.
One option is to write a stored procedure and insert data in to the new Archive_* tables and also delete from the actual tables.
But this operation is really time consuming and running for more than 3 hrs.
I am in a situation where I can't have more than an hour's downtime.
How can I make this archiving faster?
You can use Azure Automation to schedule execution of a stored procedure every day at the same time, during maintenance window, where this stored procedure will archive the oldest one week or one month of data only, each time it runs. The store procedure should archive data older than X number of weeks/months/years only. Please read this article to create the runbook. In a few days you will have all the old data archived and the Runbook will continue to do the job from now and on.
You can't make it faster, but you can make it seamless. The first option is to have a separate task that moves data in portions from the source to the archive tables. In order to prevent table lock escalations and overall performance degradation I would suggest you to limit the size of a single transaction. E.g. start transaction, insert N records into the archive table, delete these records from the source table, commit transaction. Continue for a few days until all the necessary data is transferred. The advantage of that way is that if there is some kind of a failure, you may restart the archival process and it will continue from the point of the failure.
The second option that does not exclude the first one really depends on how critical the performance of the source tables for you and how many updates are happening with them. It if is not a problem you can write triggers that actually pour every inserted/updated record into an archive table. Then, when you want a cleanup all you need to do is to delete the obsolete records from the source tables, their copies will already be in the archive tables.
In the both cases you will not need to have any downtime.

Can I add a field to the app_events_intraday table in BigQuery?

I am currently extracting my Firebase event data from BigQuery to an onsite database for analysis. I extract the Firebase intraday table(s) along with the previous 4 days (since previous days' tables continue to be updated) every time I run the ETL job. Since there is no key or unique ID for events, I am deleting & re-inserting the past 4 days of data locally in order to refresh the data from BigQuery.
Would it be possible for me to create a new field called event_dim.etl_status on the intraday table to keep track of events that have been moved locally? And if so, would this field make its way into the app_events_yyyymmdd table once it is renamed from *_intraday to *_yyyymmdd?
Edit:
Some more context based on comments from dsesto:
A magical Firebase-BigQuery wizard automatically copies/renames the Event "intraday" table into a daily table, so I have no way to reproduce or test this. It is part of the Firebase->BigQuery black box.
Since I only have a production environment (Firebase has no mechanism for a sandbox environment), testing this theory would require potentially breaking my production environment which is why I posed a "is it possible" scenario in case someone else has done something similar.

MS SQL product list with filtering

I'm building an application in ASP.NET(VB) with a MS SQL database. It is a search tool for cars that has a list of every car and all of their attributes (colors, # of doors, gas milage, mfg. year, etc). This tool outputs the results in a gridview and the users has the ability to perform advanced searches and filtering. The filtering needs to be very fine-grained (range of gas milage, color(s), mfg year range, etc.) and I cannot seem to find the best way to do this filtering without a large SQL where statement that is going to greatly impact SQL performance and page load. I feel like I'm missing something very obvious here, thank you for any help. I'm not sure what other details would be helpful.
This is not an OLTP database you're building--it's really an analytics database. There really isn't a way around the problem of having to filter. The question is whether the organization of the data will allow seeks most of the time, or will it require scans; and also whether the resulting JOINs can be done efficiently or not.
My recommendation is to go ahead and create the data normalized and all, as you are doing. Then, build a process that spins it into a data warehouse, denormalizing like crazy as needed, so that you can do filtering by WHERE clauses that have to do a lot less work.
For every single possible search result, you have a row in a table that doesn't require joining to other tables (or only a few fact tables).
You can reduce complexity a bit for some values such as gas mileage, by striping the mileage into bands of, say, 5 mpg. (10-19, 20-24, 25-29, etc.)
As you need to add to the data and change it, your data-warehouse-loading process (that runs once a day perhaps) will keep the data warehouse up to date. If you want more frequent loading that doesn't keep clients offline, you can build the data warehouse to an alternate node, then swap them out. Let's say it takes 2 hours to build. You build for 2 hours to a new database, then swap to the new database, and all your data is only 2 hours old. Then you wipe out the old database and use the space to do it again.

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources