Google BigQuery export with single wildcard URI - google-analytics

I am currently trying to export the daily Google Analytics data (automatically linked from GA with daily cadence) in BigQuery to buckets created in Google Cloud Storage.
When I export the daily GA table in BigQuery to GCS with single wildcard URI, it automatically splits the table into multiple sharded files (around 1gb per file) and land in the designated buckets in GCS. But when I copy the daily GA table to a manually created table in BigQuery, exporting the manually created table to GCS resulted in more sharded files (around 300mb per file), even though they have the same size and row counts.
I am trying to figure out why exporting this manually created table would cause BigQuery to shard the table this way. Essentially this would tripled the amount of files sitting in GCS. Ideally I would like to limit the amount of the sharded files in GCS so I do not need to process so many of them.

Related

Crashlytics BigQuery integration table expiration

Is it possible to set the default table expiration that is used when enabling the Crashlytics Big Query integration in Firebase?
We are trying to reduce our monthly Firebase costs (Blaze plan) that are due to the amount of data exported automatically and now exists in our BigQuery tables. These are the costs that appear in our Firebase billing reports as "non Firebase services".
To reduce the costs we would like to allow the data to expire automatically and adjust the "time to expire" shown below for all ongoing data exported from Firebase to BigQuery.
Is this possible from within the Firebase console itself? Or can this only be done in BigQuery using the CLI? This page doesn’t seem to give any indication that this is possible from the Firebase Console itself: https://firebase.google.com/docs/crashlytics/bigquery-export
But we can see from the BigQuery docs that Table Expiration appears to be what we need to set, our question is essentially how to do this to apply for all existing and future tables streamed from Firebase Crashlytics (but also Events and Performance) data.
Thanks for any advice!
You can limit data in BigQuery by setting the retention time right in the BigQuery console to whatever length of time you prefer:
Set default table expiration times here
Update a particular table's expiration time here
The size of exported data highly depends on the product usage. Moreover, especially for Crashlytics, stacktrace in the data is completely unpredictable.
In order for you to have an idea of the cost, You can check following links:
Schema of the exported table
Columns presenting regardless of the stack trace
BigQuery Free operations
Additionally, please follow following documentation, which has clearer insight on the export data to BigQuery.

Exporting Firebase Crashlytics Data to BigQuery Partially

Is it possible to filter the data to be exported to BigQuery? For example I just want to have fatal crashes (is_fatal=TRUE) to be exported but not non fatal exceptions which allocates much more space in my case.
I checked out data transfer options but could not find anything related to filtering or schema customization.
The only configuration options for the exporting Crashlytics data to BigQuery are to:
Turn it on or off
Enable streaming of intra-day events (if your project is on the Blaze plan)
It's not possible to control what crash data is exported beyond that.
If you want less data to be stored in BigQuery, you'll have to copy the data you want to keep over to new tables, and delete the ones generated by the integration.

How to extract server timestamp from Firebase BigQuery Export?

I am trying to extract near real-time analytics data from Firebase Analytics. The data is exported to BigQuery in real-time to intraday table. In order to sync real-time data to my database. I need to know when the record has logged to the table to avoid reading same records multiple times.
However, According to the Firebase document below, event_timestamp field is the timestamp when it is logged in a client device. event_server_timestamp_offset field is the offset between the collection time and upload time.
https://support.google.com/firebase/answer/7029846?hl=en
So, I assumed the server logged time can be found by event_timetamp + event_server_timestamp_offset.
But I've found that event_server_timestamp_offset have both negative and positive number.
Does anyone know what the event_server_timestamp_offset is for? Unloading data before Collecting data is impossible.

Refresh Firebase data to BigQuery to display in Data Studio

I am researching of a way to regularly sync Firebase data to BigQuery, then display that data to Data Studio. I saw this instruction in the documentation:
https://support.google.com/firebase/answer/6318765?hl=en
According to the above instruction, it says once Firebase is linked to BigQuery, the data from Firebase is being streamed to BigQuery real-time.
Let's say I have initial export of Firebase data to BigQuery (before linking) and I made a Data Studio visualization out of that initial data, we call it Dataset A. Then I started linking Firebase to BigQuery. I want Dataset A to be in sync with Firebase every 3 hours.
Based on the documentation, does this mean I don't have to use some external program to synchronize Firebase data every 3 hours to BigQuery, since it is streaming real-time already? After linking, does the streamed data from Firebase automatically goes to Dataset A?
I am asking because I don't want to break the visualization if the streaming behaves differently than the expected (expected means that Firebase streams to BigQuery's Dataset A consistent with the original schema). Because if it does (break the original dataset or it doesn't stream to the original dataset), I might as well write a program that does the syncing.
Once you link your Firebase project to BigQuery, Firebase will continuously export the data to BigQuery, until you unlink the project. As the documentation says, the data is exported to daily tables, and a single fixed intraday table. There is no way for you to control the schedule of the data export beyond enabling/disabling it.
If you're talking about Analytics data, schema changes to the exported data are very rare. So far there's been a schema change once, and there are currently no plans to make any more schema changes. If a schema change ever were to happen again though, all collaborators on the project will be emailed well in advance of the change.

Getting data from BigQuery back into Google Analytics

I have data in BigQuery from my Google Analytics account, along with some extra tables where I have transformed some of this data.
I would like to export some of my transformed data from BigQuery and import it into Google Analytics as a custom dimension.
I have done this manually, by downloading a CSV from my table in BigQuery and importing this using the GA admin UI. I would like to automate the process, but not sure where to start.
What would be the most efficient tool to automate this process? The process being:
Run a SQL query on by BQ data every day and overwrite a table.
Export this table as a file and upload it to a GA account as a query time import.
Not sure why you'd want do this, but one (rather clunky) solution that pops into my head is to spin up a small GCE instance, and using the gcloud tool and some simple bash:
Run a BigQuery query job (SQL) to truncate your table
Monitor the progress of that query job i.e wait
When it's finished, trigger an export job and dump the table to GCS
Monitor the progress of that BigQuery export job i.e. wait
When it's finished, download the file from GCS
Upload the file to GA using the management API (https://developers.google.com/analytics/devguides/config/mgmt/v3/mgmtReference/management/uploads/uploadData)
Schedule a cron job to run the above bash script daily
A nicer way would be to use Cloud Functions listening on the GCS bucket, but in my opinion, CFs are not designed for performing long running batch/data workloads. They have e.g. time limits (540s). Also, if GA supported direct load from GCS it would be much better. But, I wasn't able to find support for that.

Resources