Saving data in bigquery simultaneously from a source other than firebase - firebase

We are trying to log some events based on our application. There are two types of events,
Client-side events: logged by an android application using firebase SDK, which is saved to bigquery by firebase
Server-side events: we log them using bigquery's go client
Now, firebase stores current day's events into events_intraday_$date table and then flushes that table into partitioned table events_$date.
So, I also logged current day's events into events_intraday_$date table.
The events got logged successfully into the table but got deleted the next day when the events_intraday_$date table is flushed into events_$date table.
I'm not able to understand how is that happening.

Looks like this is intended behaviour:
Within each dataset, a table is imported for each day of export. Daily tables have the format "ga_sessions_YYYYMMDD".
Intraday data is imported approximately three times a day. Intraday tables have the format "ga_sessions_intraday_YYYYMMDD". During the same day, each import of intraday data overwrites the previous import in the same table.
When the daily import is complete, the intraday table from the previous day is deleted. For the current day, until the first intraday import, there is no intraday table. If an intraday-table write fails, then the previous day's intraday table is preserved.
Data for the current day is not final until the daily import is complete. You may notice differences between intraday and daily data based on active user sessions that cross the time boundary of last intraday import.

Related

Firebase Export to BigQuery: event_timestamp changes when going from intraday to full day table

I have a mobile application that is connected to a BigQuery Warehouse through Firebase export. For keeping my dashboards up to date, I run incremental jobs (dbt) several times a day to extract data from the tables BigQuery creates that contain imported Firebase data. (see this article).
For real-time data streaming, a table with the suffix "_intraday" is created. Once that day is over, the data is moved over to the table which only contains full days and the intraday table is deleted.
It looks like when this happens (moving from intraday to full day), the event_timestamp (UNIX) is slightly changed (a few milliseconds) for each entry. The problem: I defined a combination of user_id and event_timestamp as unique key. Due to this issue, the first job dealing with the moved table will identify each row as a new, unique row, duplicating my resulting data exactly by 2.
Has anyone ever seen this issue and knows if it's expected? Do you know any other solution than implementing an event ID on the client, giving each event a unique identifier (through custom event params) and using this instead of user_id + timestamp?
auto-created tables
Thank you.

BigQuery - Large Amount of Rows in Newest Event Table

I recently linked my Firebase project's Analytics with BigQuery using the free-tier sandbox, and now I'm nearing the 10GB storage ceiling.
I noticed that the majority of that exported data populated to the earliest event table created (earliest table was 6.5GB and other tables were ~50-100MB), so yesterday I just deleted that earliest event table to get rid of all of those old rows I didn't want.
However, I noticed after checking today that the newest event table is roughly the same size as the one I deleted.
My questions are:
Is the latest table created so big because that's older rows repopulating?
Is it possible to delete that large chunk of data from storage without a similarly-sized amount flowing into the next event table that's created?

Why is the intraday table sometimes missing from the BigQuery dataset?

My team has linked our Firebase and BigQuery projects and set up the intraday table. However, the table is created unpredictably. I was able to use it yesterday (events_intraday_20200701), but it is already noon as of writing this, and there is still no intraday table (events_intraday_20200702) in the dataset. (The regular event tables are there, per usual). In the streaming area of the Firebase console, I can see hundreds of events being generated, but cannot query an intraday table to see them in realtime.
I also struggle to find resources clarifying when the table is created besides "raw event data is streamed into a separate intraday BigQuery table in real-time" from
https://support.google.com/firebase/answer/6318765?hl=en. Are there reasons why the table may not be created, or more details about what time during the day I can expect it to exist?
On a related note, is it true that Web events are not supported for the intraday table?
Thanks!

Best Handle Intraday GA Data in BigQuery

I have a configured google analytics raw data export to big query.
Could anyone from the community suggest efficient ways to query the intraday data as we noticed the problem for Intraday Sync (e.g. 15 minutes delay), the streaming data is growing exponentially across the sync frequency.
For example:
Everyday (T-1) batch data (ga_sessions_yyymmdd) syncs with 15-20GB with 3.5M-5M records.
On the other side, the intraday data streams (with 15 min delay) more than ~150GB per day with ~30M records.
https://issuetracker.google.com/issues/117064598
It's not cost-effective for persisting & querying the data.
And, is this a product bug or expected behavior as the data is not cost-effectively usable for exponentially growing data?
Querying big query cost $5 per TB & streaming inserts cost ~$50 per TB
In my vision, it is not a bug, it is a consequence of how data is structured in Google Analytics.
Each row is a session, and inside each session you have a number of hits. As we can't afford to wait until a session is completely finished, everytime a new hit (or group of hits) occurs the whole session needs to be exported again to BQ. Updating the row is not an option in a streaming system (at least in BigQuery).
I have already created some stream pipelines in Google Dataflow with Session Windows (not sure if it is what Google uses internally), and I faced the same dilemma: wait to export the aggregate only once, or export continuously and have the exponential growth.
An advice that I can give you about querying the ga_realtime_sessions table is:
Only query for the columns you really need (no select *);
use the view that is exported in conjunction with the daily ga_realtime_sessions_yyyymmdd, it doesn't affect the size of the query, but it will prevent you from using duplicated data.

Can I add a field to the app_events_intraday table in BigQuery?

I am currently extracting my Firebase event data from BigQuery to an onsite database for analysis. I extract the Firebase intraday table(s) along with the previous 4 days (since previous days' tables continue to be updated) every time I run the ETL job. Since there is no key or unique ID for events, I am deleting & re-inserting the past 4 days of data locally in order to refresh the data from BigQuery.
Would it be possible for me to create a new field called event_dim.etl_status on the intraday table to keep track of events that have been moved locally? And if so, would this field make its way into the app_events_yyyymmdd table once it is renamed from *_intraday to *_yyyymmdd?
Edit:
Some more context based on comments from dsesto:
A magical Firebase-BigQuery wizard automatically copies/renames the Event "intraday" table into a daily table, so I have no way to reproduce or test this. It is part of the Firebase->BigQuery black box.
Since I only have a production environment (Firebase has no mechanism for a sandbox environment), testing this theory would require potentially breaking my production environment which is why I posed a "is it possible" scenario in case someone else has done something similar.

Resources