I'm getting daily exports of Google Analytics data into BigQuery and these form the basis for our main reporting dataset.
Over time i need to add new columns for additional things we use to enrich the data - like say a mapping from url to 'reporting category' for example.
This is easy to just add as a new column onto the processed tables (there is about 10 processing steps at the moment for all the enrichment we do).
This issue is if stakeholders then ask - can we add that new column to the historical data?
Currently i then need to rerun all the daily jobs which is very slow and costly.
This is coming up frequently enough that i'm seriously thinking about redesigning my data pipelines to tailor for the fact that i often need to essentially drop and recreate ALL the data from time to time when i need to add a new field or correct old dirty data or something.
I'm just wondering if there is better ways to
Add a new column to an old table in BQ (would be happy to do this by hand for these instances where i can just join the new column based on the ga [hit_key] i have defined which is basically a row key)
(Less common) Update existing tables based on some where condition.
Just wondering what best practices are and if anyone has had similar issues where you basically need to update an historic shema and if there are ways to do it without just dropping and recreating which is essentially what i'm currently doing.
To be clearer on my current approach: I'm taking the [ga_sessions_yyyymmdd] table and making a series of [ga_data_prepN_yyyymmdd] tables where is either add new columns at each step or reduce the data in some way. There is now 11 of these steps and each time i'm taking all the 100 or more columns along for the ride. This is what i'm going to try design away from as currently 90% of the columns at each stage dont even need to be touched as they can just be joined back on at the end maybe based on hit_key or something.
It's a little bit messy though to try and pick apart.
Adding new columns to the schema of the existing historical tables is possible, but the values for newly added columns will be NULLs. If you do need to populate values into these columns, probably the best approach is to use UPDATE DML statement. More details how to try it out is here: Does BigQuery support UPDATE, DELETE, and INSERT (SQL DML) statements?
Related
Could you help to understand this approach:
I have to do a query that make some operations, i do not want to use containers, since i read that temporal tables are faster, at least for my case, but i dont get how it works:
The Web Service that i will use to make inserts in temporally table, will be consumed by some people in the same time, each values for each users will be diferents, because thats the reason why i want to do this.... but i dont understand how the temporal table, will manage data for each user; Because it will be only a table, so, if an user perform the WS, the table will contain some rows, but then another user could perform in the same time the WS, that should fill the table with another values, how is works?
Temporal tables are saved for each users or how it works, for my case?
Thanks in advance
Both temp tables are based on scope. When variable/buffer goes out of scope tables are dropped. So each user or WS call uses its own table.
You can find specs here:
https://msdn.microsoft.com/en-us/library/gg845661.aspx
https://msdn.microsoft.com/en-us/library/bb314749.aspx
I am struggling to define an effective process of revisioning. We have some data spread across multiple tables. We cannot delete or update, we need to create new issues of the same data. I know the solution of a history table containing all revisions etc, but that seems to work fine as long as you want to keep revisions of simple structures, such as a Blogging-platform.
What if you have a database with many complex structures, where the simplest of them looks like this below.
If you change something in tableA, you can keep the old data in a history table. What happens though if you change something in TableB, which defines what a record in TableA is? It almost forces you to create a copy of TableA (new ID in other words) and recreate it's underlying structures (more new IDs). The whole process of creating a new ID each time a mistake is corrected or some peripheral data is added, doesn't feel ok.
Is there any good practice for such cases? I read somewhere about keeping the whole old data structure revisioned in XML, but that practice can be reluctant to schema changes and it is not easily querable. Technologies such as Flashback doesn't cover the whole spectrum of our needs either.
Tip: We're using Oracle v11.2.
This is a more in depth follow up to a question I asked yesterday about storing historical data ( Storing data in a side table that may change in its main table ) and I'm trying to narrow down my question.
If you have a table that represents a data object at the application level and need that table for historical purposes is it considered bad practice to set it up to where the information can't be deleted. Basically I have a table representing safety requirements for a worker and I want to make it so that these requirements can never be deleted or changed. So if a change needs to made a new record is created.
Is this not a good idea? What are the best practice to deal with data like this? I have a table with historical safety training data and it points to the table with requirement data (as well as some other key tables) so I can't let the requirements be changed or the historical table will be pointing to the wrong information.
Is this not a good idea?
Your scenario sounds perfectly valid to me. If you have historical data that you need to keep there are various ways to meeting that requirement.
Option 1:
Store all historical data and current data in one table (make sure you store a creation date so you know what's old and what's new). When you need to retrieve the most recent record for someone, just base it on the most recent date that exists in the table.
Option 2:
Store all historical data in a separate table and keep current data in another. This might be beneficial if you're working with millions of records so you don't degrade performance of any applications built on top of it. Either at the time of creating a new record or through some nightly job you can move old data into the other table to keep your current table lightweight.
Here is one alternative, that is not necessarily "better" but is something to keep in mind...
You could have separate "active" and "historical" tables, then create a trigger so whenever a row in the active table is modified or deleted, the old row values are copied to the historical table, together with the timestamp.
This way, the application can work with the active table in a natural way, while the accurate history of changes is automatically generated in the historical table. And since this works at the DBMS level, you'll be more resistant to application bugs.
Of course, things can get much messier if you need to maintain a history of the whole graph of objects (i.e. several tables linked via FOREIGN KEYs). Probably the simplest option is to simply forgo referential integrity for historical tables and just keep it for active tables.
If that's not enough for your project's needs, you'll have to somehow represent a "snapshot" of the whole graph at the moment of change. One way to do it is to treat the connections as versioned objects too. Alternatively, you could just copy all the connections with each version of the endpoint object. Either case will complicate your logic significantly.
I have a Tariffs table for international dialing Codes
with StartDate and EndDate
I'm using ASP.net Application to import excel offers to this table , Each offer contain about 10000 row, so it is a large table (about 3 millions row)
what is the faster scenario in SQL Server 2008 to create a stored-procedure or trigger to change the previous endDate for same tariff same prefix same destination and new rate on insert a new row,
and how to undo saving offer of 10000 rows and get back the table and update records to the previous state
Thank you,
The information in your question seems a bit jumbled, partially because of the ideas within it but also unhelpful grammer/whitespace (sorry to be so blunt but these things are helpful) but I'll try my best to answer.
In general, assume that a trigger is slower than a stored proc. They also add a higher level of complexity than many other things, like procs, so always be sure you really need one before using one.
But, I don't understand why you'd need a trigger if you're only inserting into one table. Triggers are usually used to implement a complex chain of logic. If it's a straight insert or update then keep simple and use a proc.
If it's just an insert, then the quickest way of all is a bulk insert.
Since you want to keep the previous state, my advice would be to create an archive/audit table (basically a duplicate, with possibly some extra fields like WhenInserted etc), on insert move (i.e. insert in the new table and then delete from the original) the existing rows into the archive and then you can do a bulk insert for the new rows.
But you use the word "change", so it's difficult to know what you really want. Hope that helps.
I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.