BigQuery: How to flatten table in materialized view? - google-analytics

Via Google BigQuery, I'd like to visualize some data in Google Data Studio. Because the dataset is quite large in volume and I'd like to maximize the efficiency in data processing, I nested the data first (on both hit- and productlevel) with the following query (which is strongly simplified for illustration purposes) with as input a Google Analytics table (as by default imported from Google Analytics into BigQuery):
#standardSQL
SELECT
visitorid, visitNumber, visitId, visitStartTime, date,
ARRAY(
SELECT
AS STRUCT hits.hitNumber, hits.time, hits.hour,
ARRAY(
SELECT
AS STRUCT product.productSKU, product.v2ProductName, product.productVariant
FROM
hits.product) AS productInfo
FROM
t.hits
ORDER BY
hits.hitNumber) AS hitInfo
FROM
`[projectID].[DatasetID].ga_sessions_*` AS t
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_DATE("%Y%m%d",DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND FORMAT_DATE("%Y%m%d",DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
Because I sensed Google Data Studio has issues dealing with nested data (incorrect aggregations), as a proposed solution I read elsewhere to flatten (i.e. unnest) the data in a materialized view first and connect the flattened data from this materialized view to Google Data Studio.
(Note: I could also have chosen to directly unnest the data in the above query and connect that to Google Data Studio, but I'd like to go for the 'materialized view'-solution because of data efficiency gain.)
Now, my question is: Does anyone know how to convert to an unnested format in the materialized view-query in this specific case? Reading the documentation, UNNEST() is not supported in materialized view, so I'm unsure how to do this.
Thanks in advance!

According to the documentation discovering BigQuery materialized views Limitations chapter:
A materialized view is limited to referencing only a single table, and
is not able to use joins or UNNEST functionality
Having said this and following the discussion comments being posted by #Timo Rietveld and #Martin Weitzmann I assume that planning to visualize data in Google Data Studio and performing some aggregation functions it would probably required to flatten the data into the simple format using UNNEST operator achieving the best result.
For instance, event tables being imported from Google Analytics with multiple key-value parameters represented as an array in event column. You can flatten then each array element as a single row, simplifying data structure as well.
Any further comments or remarks will be highly appreciated, make wiki answer helping other contributors in their research.

Related

ADX Data Pagination for use in client API

I am exploring using ADX as a timeseries data store for sensor metrics. Our current solution is storing data in MSSQL and I'm testing ADX as an alternative. I was able to set up data ingestion and I can perform basic queries, and with the added aggregation functions, computing insights and statistics seems to be much faster.
As part of the solution, we have a API data access layer used by clients and our web portal to query data for display and analysis use. I am currently transforming the MSSQL queries to the KQL version and I'm hitting a stumble block on data pagination.
We have a function to query historical data using a combination of:
an start/end date,
a device identifier,
and some paging options
records per page,
current page,
column sorting / additional filtering
Currently this is handled in a SQL SP on the back-end, by getting the total number of records and pages (which is set as output on the API so that the front-end can use this data in the table view), then getting the records based on the input parameters and pagination details to return a record set - quite straight forward.
Any suggestions on how to achieve effective pagination using ADX/KQL?
I found a section in the docs on pagination on stored query results, but as the queries are dynamic based on user input, so this does not sound like a viable option.
When you paginate (for example viewing result 21-30) you need to consider if you are taking a snapshot of the result and paging through it or viewing live data. If you expect new rows coming in to not affect your pagination, than stored query results is that snapshot. Once you generate it you can select specific rows from it based on your page calculation.

How can I query noSQL data from Firestore with BigQuery SQL queries?

I exported hierarchically structured data with Firebase hash-keys to BigQuery. However, since the data is not structured in tables, I don't know how to use SQL queries to get desired information. Is this possible in principle or do I need to convert / flatten the data into tables first? Google seems to advice visualizing data in Data Studio using BigQuery as source (not Firebase/Firestore directly). Yet, I cannot find any useful information / sample queries for this case. Thank you very much in advance.
I'm not familiar with the "hierarchically structured data with Firebase hash-keys", but the general guideline here to query such 'blob/text' data (from BigQuery perspective) is:
As you said, use separate pipeline to load / save the data into BQ table structure, or
Define a function to access your data. Since the function body could be JavaScript, you have the full flexibility to parse / read your text/blob data:
CREATE FUNCTION yourDataset.getValue(input STRING, key STRING)
RETURNS STRING
LANGUAGE js
AS """
// JavaScript code to read the data
""";

Filtering results from ClickHouse using values from dictionaries

I'm a little unfamiliar with ClickHouse and still study it by trial and error. Got a question about it.
Talking about the star scheme of data representations, with dimensions and facts. Currently, I keep everything in PostgreSQL, but OLAP queries with aggregations start to show bad timing, so I'm going to move some fact tables to ClickHouse. Initial tests of CH show incredible performance, however, in real life the queries should include joins to dimension tables from PostgreSQL. I know I can connect them as dictionaries.
Question: I found that using dictionaries I can make requests similar to LEFT JOINs in good old RDBMS, ie values from resultset could be joined with corresponding values from the dictionary. But can they be filtered by some restrictions on dictionary keys (as in INNER JOIN)? For example, in PostgreSQL I have a table users (id, name, ...) and in ClickHouse I have table visits (user_id, source, medium, session_time, timestamp, ...) with metrics about their visits to the site. Can I make a query to CH to fetch aggregated metrics (number of daily visits for given date range) of users which name matches some condition (LIKE "EVE%" for example)?
It sounds like ODBC table function is what you're looking for. ClickHouse have a bunch of table functions which work like Postgres foreign tables. The setup is similar to Dictionaries but you gain the traditional JOIN behavior. It currently doesn't show up in the official document. You can refer to this https://github.com/yandex/ClickHouse/blob/master/dbms/tests/integration/test_odbc_interaction/test.py#L84 . And in near future (this year), ClickHouse will have standard JOIN statement supported.
The dictionary will basically replace the value first. As I understand it your dictionary would be based off your users table.
Here is an example. Hopefully I am understanding your question.
select dictGetString('accountidmap', 'domain', tuple(toString(account_id))) AS domain, sum(session) as sessions from session_distributed where date = '2018-10-15' and like(domain, '%cats%') group by domain
This is a real query on our database so If there is something you want to try/confirm let me know

How to sort exported Firebase data on BigQuery without resource exceeding?

I use Firebase Analytics and export data to BigQuery. Now I'd like to filter data and sort them by timestamp.
I wrote this query:
SELECT event_timestamp, event_name, event_params, user_id,
user_pseudo_id, user_properties, STRUCT(device.category,
device.time_zone_offset_seconds, device.is_limited_ad_tracking) device,
platform FROM `myTable` ORDER BY event_timestamp;
But this results Error: Resources exceeded during query execution: The query could not be executed in the allotted memory. Sort operator used for ORDER BY used too much memory... I think the data is too much to put on BigQuery's memory.
The reason why I have to sort data is that I'd like to download the data and parse them in my on-premises application in the ascending order of timestamp. And if I change the sorting role to my application, it must take a lot of time.
I don't know fully about the feature of Google Cloud Platform. Is there any good way to sort huge data on GCP?

Google analytics realtime data in BigQuery

We have enabled continuous export of Google Analytics data to BigQuery which means we get ga_realtime_sessions_YYYYMMDD tables with data dumps throughout the day.
These tables are – usually! – left in place, so we accumulate a stack of the realtime tables for the previous n dates (n does not seem to be configurable).
However, every once in a while, one of the tables disappears, so there will be gaps in the sequence of dates and we might not have a table for e.g. yesterday.
Is this behaviour documented somewhere?
It would be nice to know which guarantees we have, as we might rely on e.g. realtime data from yesterday while we wait for the “finished” ga_sessions_YYYYMMDD table to show up. The support document linked above does not mention this.
As stated in this help article, these internal ga_realtime_sessions_YYYYMMDD tables should not be used for queries and the ga_realtime_sessions_view_YYYYMMDD view should be used instead for your queries, in order to obtain the fresh data and to avoid unexpected results.
In the case you want to use data from some day ago while you wait for the internal ga_realtime_sessions_YYYYMMDD tables to be created for today, you can choose to copy the data obtained from querying the ga_realtime_sessions_view_YYYYMMDD view, into a separate table at the end of a day for this purpose.

Resources