I have scanned a Bigquery table from Google DLP Console. The scan results are saved back into a big query table. DLP has identified sensitive information, but the row_index is shown as null "location.content_locations.record_location.table_location.row_index", can anyone help me understand why?
We no longer populate row_index for bigquery as it's not meaningful since BQ is unordered. If you want to identify the row where the finding came from, I suggest using identifyingFields which lives in BigQueryOptions when you create your job.
https://cloud.google.com/dlp/docs/creating-job-triggers#job-identifying-fields
Related
I have a mobile application that is connected to a BigQuery Warehouse through Firebase export. For keeping my dashboards up to date, I run incremental jobs (dbt) several times a day to extract data from the tables BigQuery creates that contain imported Firebase data. (see this article).
For real-time data streaming, a table with the suffix "_intraday" is created. Once that day is over, the data is moved over to the table which only contains full days and the intraday table is deleted.
It looks like when this happens (moving from intraday to full day), the event_timestamp (UNIX) is slightly changed (a few milliseconds) for each entry. The problem: I defined a combination of user_id and event_timestamp as unique key. Due to this issue, the first job dealing with the moved table will identify each row as a new, unique row, duplicating my resulting data exactly by 2.
Has anyone ever seen this issue and knows if it's expected? Do you know any other solution than implementing an event ID on the client, giving each event a unique identifier (through custom event params) and using this instead of user_id + timestamp?
auto-created tables
Thank you.
Via Google BigQuery, I'd like to visualize some data in Google Data Studio. Because the dataset is quite large in volume and I'd like to maximize the efficiency in data processing, I nested the data first (on both hit- and productlevel) with the following query (which is strongly simplified for illustration purposes) with as input a Google Analytics table (as by default imported from Google Analytics into BigQuery):
#standardSQL
SELECT
visitorid, visitNumber, visitId, visitStartTime, date,
ARRAY(
SELECT
AS STRUCT hits.hitNumber, hits.time, hits.hour,
ARRAY(
SELECT
AS STRUCT product.productSKU, product.v2ProductName, product.productVariant
FROM
hits.product) AS productInfo
FROM
t.hits
ORDER BY
hits.hitNumber) AS hitInfo
FROM
`[projectID].[DatasetID].ga_sessions_*` AS t
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE("%Y%m%d",DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND FORMAT_DATE("%Y%m%d",DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
Because I sensed Google Data Studio has issues dealing with nested data (incorrect aggregations), as a proposed solution I read elsewhere to flatten (i.e. unnest) the data in a materialized view first and connect the flattened data from this materialized view to Google Data Studio.
(Note: I could also have chosen to directly unnest the data in the above query and connect that to Google Data Studio, but I'd like to go for the 'materialized view'-solution because of data efficiency gain.)
Now, my question is: Does anyone know how to convert to an unnested format in the materialized view-query in this specific case? Reading the documentation, UNNEST() is not supported in materialized view, so I'm unsure how to do this.
Thanks in advance!
According to the documentation discovering BigQuery materialized views Limitations chapter:
A materialized view is limited to referencing only a single table, and
is not able to use joins or UNNEST functionality
Having said this and following the discussion comments being posted by #Timo Rietveld and #Martin Weitzmann I assume that planning to visualize data in Google Data Studio and performing some aggregation functions it would probably required to flatten the data into the simple format using UNNEST operator achieving the best result.
For instance, event tables being imported from Google Analytics with multiple key-value parameters represented as an array in event column. You can flatten then each array element as a single row, simplifying data structure as well.
Any further comments or remarks will be highly appreciated, make wiki answer helping other contributors in their research.
I have linked Firebase project with BigQuery. How to calculate session length and number of sessions from the data. Is there any column or event_name similar to '.session_start / .session_stop ' which stores the session details like in AWS Pinpoint ?
Take a look at the following response, seems it will be useful to your question. As well, the BigQuery export schema and the following query samples will be useful as start point to deploy queries over the exported Firebase Analytics data.
(I don't have reputation to add this as a comment)
We have enabled continuous export of Google Analytics data to BigQuery which means we get ga_realtime_sessions_YYYYMMDD tables with data dumps throughout the day.
These tables are – usually! – left in place, so we accumulate a stack of the realtime tables for the previous n dates (n does not seem to be configurable).
However, every once in a while, one of the tables disappears, so there will be gaps in the sequence of dates and we might not have a table for e.g. yesterday.
Is this behaviour documented somewhere?
It would be nice to know which guarantees we have, as we might rely on e.g. realtime data from yesterday while we wait for the “finished” ga_sessions_YYYYMMDD table to show up. The support document linked above does not mention this.
As stated in this help article, these internal ga_realtime_sessions_YYYYMMDD tables should not be used for queries and the ga_realtime_sessions_view_YYYYMMDD view should be used instead for your queries, in order to obtain the fresh data and to avoid unexpected results.
In the case you want to use data from some day ago while you wait for the internal ga_realtime_sessions_YYYYMMDD tables to be created for today, you can choose to copy the data obtained from querying the ga_realtime_sessions_view_YYYYMMDD view, into a separate table at the end of a day for this purpose.
We have a set of tables with the data about users interactions online and we want to create a table with the scheme similar to GA BigQuery Export Schema (this feature is not yet available in Russia).
I couldn't find the information on how to create a record field in BQ querying existing tables.
On the contrary, it is written that "This type is only available when using JSON source files."
Is there any workaround or this feature expected in a nearer future? Can I submit a feature request?
Currently the only way to get nested and repeated records into BigQuery is loading JSON files. Once a query is run, all structure is flattened.
Feature request noted, hopefully BigQuery will support emitting nested records results!