Best Handle Intraday GA Data in BigQuery - google-analytics

I have a configured google analytics raw data export to big query.
Could anyone from the community suggest efficient ways to query the intraday data as we noticed the problem for Intraday Sync (e.g. 15 minutes delay), the streaming data is growing exponentially across the sync frequency.
For example:
Everyday (T-1) batch data (ga_sessions_yyymmdd) syncs with 15-20GB with 3.5M-5M records.
On the other side, the intraday data streams (with 15 min delay) more than ~150GB per day with ~30M records.
https://issuetracker.google.com/issues/117064598
It's not cost-effective for persisting & querying the data.
And, is this a product bug or expected behavior as the data is not cost-effectively usable for exponentially growing data?
Querying big query cost $5 per TB & streaming inserts cost ~$50 per TB

In my vision, it is not a bug, it is a consequence of how data is structured in Google Analytics.
Each row is a session, and inside each session you have a number of hits. As we can't afford to wait until a session is completely finished, everytime a new hit (or group of hits) occurs the whole session needs to be exported again to BQ. Updating the row is not an option in a streaming system (at least in BigQuery).
I have already created some stream pipelines in Google Dataflow with Session Windows (not sure if it is what Google uses internally), and I faced the same dilemma: wait to export the aggregate only once, or export continuously and have the exponential growth.
An advice that I can give you about querying the ga_realtime_sessions table is:
Only query for the columns you really need (no select *);
use the view that is exported in conjunction with the daily ga_realtime_sessions_yyyymmdd, it doesn't affect the size of the query, but it will prevent you from using duplicated data.

Related

Optimising Firestore costs for time series data?

I have been using Postgres to store time-series sensor data but I am weighing the cost of using Firestore cause I prefer the serverless nature of Firestore. My only concern is the cost of Firestore because I am paying for every read. I want to be able to display this sensor information on my web app. Now, I am taking data every 10 seconds and theres over 400+ sensor points (400 columns per row in my postgres table)
Currently, if a user queries for a week's work of data that's about 60,000 rows of data, but I optimise it by just taking every nth value to "feather" the data. So by taking every 20th row for example, I have reduced the return of the data to 3000 rows which is manageable and still the chart shows a clear trend.
I want to be able to do this in Firestore to save costs, because if a user queries for a week's data, I am paying for 60000 document reads which I can't display all those data points on the web app anyway. I have tried searching for ways to query firestore to take the Nth row of data, but haven't found any concrete solutions.
Does anybody have any recommendation how I can optimise my Firestore costs for time series data or perhaps any other cheap serverless methods to manage this data?
Firestore doesn't offer any way to "feather" data from queries, as you say. What you could do instead is put an integer in each document that describes its "Nth" value, then query for only those "N" that you want.

Firebase BigQuery Export Schema Size Difference

We have migrated all of our old Firebase BigQuery events tables to the new schema using the provided script. One thing we noticed was that the size of the daily tables increased dramatically.
For example, the data from 4/1/18 in the old schema was 3.5MM rows and 8.7 Gig. Once migrated, the new table from the same date is 32.3MM rows and 27 Gig. This is nearly 10 times larger in terms of number of rows and over 3X larger by space size.
Can someone tell me why the same data is so much larger in the new schema?
The result is that we are getting charged significantly more in BigQuery query costs when reading the tables from the new schema versus the old schema.
firebaser here
While increasing the size of the exported data definitely wasn't a goal, it is an expected side-effect of the new schema.
In the old storage format the events were stored in bundles. While I don't exactly know how the events are bundled, it was definitely always a bunch of events with their own unique and with shared properties. This meant that you frequently had to unnest the data in your query or cross join the tables with themselves, to get to the raw data, and then combine and group it again to fit your requirements.
In the new storage format, each event is stored separately. This definitely increases the storage size, since properties that were shared between events in a bundle, now are duplicated for each event. But the queries you write on the new format should be easier to read and can process the data faster, since they don't have to unnest it first.
So the larger storage size should come with a slightly faster processing speed. But I can totally imagine the sticker shock when you see the difference, and realize the improved speed doesn't always make up for that. I apologize if that is the case, and have been assured that don't have any other big schema changes planned from here on.

Google analytics realtime data in BigQuery

We have enabled continuous export of Google Analytics data to BigQuery which means we get ga_realtime_sessions_YYYYMMDD tables with data dumps throughout the day.
These tables are – usually! – left in place, so we accumulate a stack of the realtime tables for the previous n dates (n does not seem to be configurable).
However, every once in a while, one of the tables disappears, so there will be gaps in the sequence of dates and we might not have a table for e.g. yesterday.
Is this behaviour documented somewhere?
It would be nice to know which guarantees we have, as we might rely on e.g. realtime data from yesterday while we wait for the “finished” ga_sessions_YYYYMMDD table to show up. The support document linked above does not mention this.
As stated in this help article, these internal ga_realtime_sessions_YYYYMMDD tables should not be used for queries and the ga_realtime_sessions_view_YYYYMMDD view should be used instead for your queries, in order to obtain the fresh data and to avoid unexpected results.
In the case you want to use data from some day ago while you wait for the internal ga_realtime_sessions_YYYYMMDD tables to be created for today, you can choose to copy the data obtained from querying the ga_realtime_sessions_view_YYYYMMDD view, into a separate table at the end of a day for this purpose.

Firebase database high delay after a long standby

I'm currently testing Firebase on a non-production Firebase app which I am the only one who works on.
When I try to query the database to retrieve the data after there has not been any query during the last 24 hours, the query take about 8 seconds. After a query is done, the next ones would take normal amount of time (about 100ms).
This is not about caching the queries, by "next queries" I mean new queries which are not the same.
To reproduce it:
Create a database node called users, users children are user data (first name, last name, age, gender, etc)
Add 500,000 users to this node
Get a user by its UID and measure the time. (It should take about 100ms)
Wait 24 hours (I don't know the exact time, but I'm sure about 24 hours)
Get any user by its UID and measure the time. (It should take about 8sec)
Get any user by its UID and measure the time. (It should take about 100ms)
I want to know if this is a known issue to Firebase realtime database or not?
I reached Firebase support, they were able to recreate the issue and faced a wait time of about 6 seconds. Here is their answer after the investigation:
It looks like this is intended behavior. The realtime database queries work by building the index in-memory, which takes time linear to the number of nodes at that location. Once the index is built things are very fast, but the initial build can take a bit to build, especially for large locations.
If you wants the index to stay in memory on the database you should have a listener always listening for this query.
So basically the database takes a long time to process the query because of indexing the large database.
The problem can be solved by keeping a listener on the database or querying the database every few hours.
In production it is not very likely that you face this problem, because the database is being accessed by the user all the time, but if your database is not accessed all the time and you don't want the users experience that long wait time, you should utilize the discussed solution.
Firebase keeps recently used data in its internal cache. This cache is cleared after a few minutes.
But the exact numbers depend on how much data you're loading and how you're loading that data. Without seeing a specific setup that shows how to reproduce these numbers there really isn't much anyone can say.

How to handle large amounts of data for a web statistics module

I'm developing a statistics module for my website that will help me measure conversion rates, and other interesting data.
The mechanism I use is - to store a database entry in a statistics table - each time a user enters a specific zone in my DB (I avoid duplicate records with the help of cookies).
For example, I have the following zones:
Website - a general zone used to count unique users as I stopped trusting Google Analytics lately.
Category - self descriptive.
Minisite - self descriptive.
Product Image - whenever user sees a product and the lead submission form.
Problem is after a month, my statistics table is packed with a lot of rows, and the ASP.NET pages I wrote to parse the data load really slow.
I thought maybe writing a service that will somehow parse the data, but I can't see any way to do that without losing flexibility.
My questions:
How large scale data parsing applications - like Google Analytics load the data so fast?
What is the best way for me to do it?
Maybe my DB design is wrong and I should store the data in only one table?
Thanks for anyone that helps,
Eytan.
The basic approach you're looking for is called aggregation.
You are interested in certain function calculated over your data and instead of calculating the data "online" when starting up the displaying website, you calculate them offline, either via a batch process in the night or incrementally when the log record is written.
A simple enhancement would be to store counts per user/session, instead of storing every hit and counting them. That would reduce your analytic processing requirements by a factor in the order of the hits per session. Of course it would increase processing costs when inserting log entries.
Another kind of aggregation is called online analytical processing, which only aggregates along some dimensions of your data and lets users aggregate the other dimensions in a browsing mode. This trades off performance, storage and flexibility.
It seems like you could do well by using two databases. One is for transactional data and it handles all of the INSERT statements. The other is for reporting and handles all of your query requests.
You can index the snot out of the reporting database, and/or denormalize the data so fewer joins are used in the queries. Periodically export data from the transaction database to the reporting database. This act will improve the reporting response time along with the aggregation ideas mentioned earlier.
Another trick to know is partitioning. Look up how that's done in the database of your choice - but basically the idea is that you tell your database to keep a table partitioned into several subtables, each with an identical definition, based on some value.
In your case, what is very useful is "range partitioning" -- choosing the partition based on a range into which a value falls into. If you partition by date range, you can create separate sub-tables for each week (or each day, or each month -- depends on how you use your data and how much of it there is).
This means that if you specify a date range when you issue a query, the data that is outside that range will not even be considered; that can lead to very significant time savings, even better than an index (an index has to consider every row, so it will grow with your data; a partition is one per day).
This makes both online queries (ones issued when you hit your ASP page), and the aggregation queries you use to pre-calculate necessary statistics, much faster.

Resources