BigQuery _TABLE_SUFFIX for ga_sessions - google-analytics

This query works fine...
SELECT page.* FROM `zentinel-datawarehouse.xxx.ga_sessions_20170601` ga,
UNNEST (hits) hits,
UNNEST (hits.page) page
but when i need use _table_suffix
SELECT page.* FROM `zentinel-datawarehouse.xxx.ga_sessions_*` ga,
UNNEST (hits) hits,
UNNEST (hits.page) page
WHERE _TABLE_SUFFIX>=20170601
Dont works any more...
This happened in thats date because hits.page is record repeteabled... in month 08 or 09 works fine because is record NULL
ANY IDEA???
REGARDS

if you get the below error message
ERROR:Values referenced in UNNEST must be arrays. UNNEST contains
expression of type STRUCT at [3:9]
I think some tables have different schema, try to locate when was the schema change applied it might be that jan-feb has one schema, and since march updated schema.
With Google Analytics export you encounter this schema change frequently.
What you can do here, is that you patch your tables, eg: fix the schema in a direction that will help you.
Without doing the fix, you would need to have two different queries to target both schema(s) (and more schemas will follow if the GA team changes on the go).
You should have a script that constantly propagates to previous tables all the schema changes they introduce with newer updates.

hits.page is not an array but a struct. You're already cross joining "hits" (which is an array), that should be sufficient.
You can only cross join arrays or tables. You want to remove this statement as it only work if hits.page is null:
SELECT
page.*
FROM
`project.dataset.ga_sessions_201712*` t, t.hits h
LIMIT
1000

Related

Custom Definitions in Bigquery

I'm pretty new to Bigquery/Firebase/GA even SQL. (btw, if you have some good experience or recommendations where I can start learning, that would be great!)
But I have main issue with Bigquery that needs solving right now. I'm kinda trying all sources I can get some info/tips from. I hope this community will be one of them.
So my issue is with Custom Definitions. we have them defined in Google Analytics . We want to divide users with this definition and analyze them separately:
My question is: where/how can I find these custom definitions in bigquery to filter my Data? I have normal fields, like user ID, Timestamps etc. but can't find these custom definitions.
I have been doing some research but still don't have a clear answer, if someone can give me some tips or mby a solution I would be forever in debt ! xD
I got one solution from the other community which looks like this, but I couldn't make it work, my bigquery doesn't recognize customDimensions as it says in the error.
select cd.* from table, unnest(customDimensions) cd
You can create your own custom function, Stored Procedure on Bigquery as per your requirements.
To apply formal Filter over filed like user ID, & Timestamps, you can simply apply standard SQL filter as given below:-
SELECT * FROM DATA WHERE USER_ID = 'User1' OR Timestamps = 'YYY-MM-DDTHH:MM'
Moreover, unnest is used to split data on fields, do you have data which need to be spited ?
I could help you more if you share what are you expecting from your SQL.
Your custom dimensions sit in arrays called customDimensions. These arrays are basically a list of structs - where each struct has 2 fields: key and value. So they basically look like this example: [ {key:1, value:'banana'}, {key:4, value:'yellow'}, {key:8, value:'premium'} ] where key is the index of the custom dimension you've set up in Google Analytics.
There are 3 customDimensions arrays! Two of them are nested within other arrays. If you want to work with those you really need to get proficient in working with arrays. E.g. the function unnest() turns arrays into table format on which you can run SQL.
customDimensions[]
hits[]
customDimensions[]
product[]
customDimensions[]
Example 1 with subquery on session-scoped custom dimension 3:
select
fullvisitorid,
visitStartTime,
(select value from unnest(customDimensions) where key=3) cd3
from
ga_sessions_20210202
Example 2 with a lateral cross join - you're enlargening the table here - not ideal:
select
fullvisitorid,
visitStartTime,
cd.*
from ga_session_20210202 cross join unnest(customDimensions) as cd
All names are case-sensitive - in one of your screenshots you used a wrong name with a "c" being uppercase.
This page can help you up your array game: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays - just work through all the examples and play around in the query editor

CosmosDB: DISTINCT for only one Column

I have the following query:
SELECT DISTINCT c.deviceId, c._ts FROM c
ORDER BY c._ts DESC
I would like to receive only one pair (c.deviceId, c._ts) per deviceId, but because the c._ts value is distinct for all entries, I am getting all the value-pairs for all deviceIds, with other words my whole DB.
I have tried to use Question: Distinct for only one value as a guide, but I see that CosmosDB does not support GROUP BY.
Is there a way to do this in cosmosDB?
Though it's a common requirement i think,i can't implement it on my side as same as you. The distinct keyword can't work on one single column cross whole query result.
Group by feature is currently in active development for a long period,based on the latest comment in this voice,it is coming soon.
If your need is urgent ,as workaround, you could follow this case to use documentdb-lumenize package which supports Aggregate Functions.

How to put a part of a code as a string in table to use it in a procedure?

I'm trying to resolve below issue:
I need to prepare table that consists 3 columns:
user_id,
month
value.
Each from over 200 users has got different values of parameters that determine expected value which are: LOB, CHANNEL, SUBSIDIARY. So I decided to store it in table ASYSTENT_GOALS_SET. But I wanted to avoid multiplying rows and thought it would be nice to put all conditions as a part of the code that I would use in "where" clause further in procedure.
So, as an example - instead of multiple rows:
I created such entry:
So far I created testing table ASYSTENT_TEST (where I collect month and value for certain user). I wrote a piece of procedure where I used BULK COLLECT.
declare
type test_row is record
(
month NUMBER,
value NUMBER
);
type test_tab is table of test_row;
BULK_COLLECTOR test_tab;
p_lob varchar2(10) :='GOSP';
p_sub varchar2(14);
p_ch varchar2(10) :='BR';
begin
select subsidiary into p_sub from ASYSTENT_GOALS_SET where user_id='40001001';
execute immediate 'select mc, sum(ppln_wartosc) plan from prod_nonlife.mis_report_plans
where report_id = (select to_number(value) from prod_nonlife.view_parameters where view_name=''MIS'' and parameter_name=''MAX_REPORT_ID'')
and year=2017
and month between 7 and 9
and ppln_jsta_symbol in (:subsidiary)
and dcs_group in (:lob)
and kanal in (:channel)
group by month order by month' bulk collect into BULK_COLLECTOR
using p_sub,p_lob,p_ch;
forall x in BULK_COLLECTOR.first..BULK_COLLECTOR.last insert into ASYSTENT_TEST values BULK_COLLECTOR(x);
end;
So now when in table ASYSTENT_GOALS_SET column SUBSIDIARY (varchar) consists string 12_00_00 (which is code of one of subsidiary) everything works fine. But the problem is when user works in two subsidiaries, let say 12_00_00 and 13_00_00. I have no clue how to write it down. Should SUBSIDIARY column consist:
'12_00_00','13_00_00'
or
"12_00_00","13_00_00"
or maybe
12_00_00','13_00_00
I have tried a lot of options after digging on topics like "Deling with single/escaping/double qoutes".
Maybe I should change something in execute immediate as well?
Or maybe my approach to that issue is completely wrong from the very beginning (hopefully not :) ).
I would be grateful for support.
I didn't create the table function described here but that article inspired me to go back to try regexp_substr function again.
I changed: ppln_jsta_symbol in (:subsidiary) to
ppln_jsta_symbol in (select regexp_substr((select subsidiary from ASYSTENT_GOALS_SET where user_id=''fake_num''),''[^,]+'', 1, level) from dual
connect by regexp_substr((select subsidiary from ASYSTENT_GOALS_SET where user_id=''fake_num''), ''[^,]+'', 1, level) is not null) Now it works like a charm! Thank you #Dessma very much for your time and suggestion!
"I wanted to avoid multiplying rows and thought it would be nice to put all conditions as a part of the code that I would use in 'where' clause further in procedure"
This seems a misguided requirement. You shouldn't worry about number of rows: databases are optimized for storing and retrieving rows.
What they are not good at is dealing with "multi-value" columns. As your own solution proves, it is not nice, it is very far from nice, in fact it is a total pain in the neck. From now on, every time anybody needs to work with subsidiary they will have to invoke a function. Adding, changing or removing a user's subsidiary is much harder than it ought to be. Also there is no chance of enforcing data integrity i.e. validating that a subsidiary is valid against a reference table.
Maybe none of this matters to you. But there are very good reasons why Codd mandated "no repeating groups" as a criterion of First Normal Form, the foundation step of building a sound data model.
The correct solution, industry best practice for almost forty years, would be to recognise that SUBSIDIARY exists at a different granularity to CHANNEL and so should be stored in a separate table.

How to Handle BQ GA Export Changes?

I'm trying to reprocess ga_sessions_yyyymmdd data but am finding the ga_sessions never used to have a field called [channelGrouping] but it does in more recent data.
So my jobs work fine for the latest version of ga_sessions but when i try reprocess earleir ga_sessions data the job fails as it's missing the [channelGrouping] field.
Obviously usually this is what you want, but in this case it's not. I want to make sure i'm sticking to the latest ga_sessions schema and would like the job to just set missing cols to null for when they did not exist.
Is there any way around this?
Perhaps i need to make an empty table called ga_sessions_template_latest and union it on to whatever ga_sessions_ daily table i'm handling - maybe this will 'upgrade' the old ga_sessions to the new structure.
Attached is a screenshot of exactly what i mean (my union idea will actually be horrible due to nested fields in ga_sessions).
I don't have such a script yet. But since the tables are under your project you are able to update them. You can write a script and update the schema on all tables with missing columns from the most recent schema set.
I envision a script that gets most recent table schema.
Then goes back one by one to past tables, does a compare, identifies the missing columns, defines them as not required and nullable, and reads the schema + applies the additional columns and runs the update on the table. Data won't be modified, you will have just additional columns with null values.
you can try out for some also from the Web UI.

BigQuery error: Cannot query the cross product of repeated fields

I am running the following query on Google BigQuery web interface, for data provided by Google Analytics:
SELECT *
FROM [dataset.table]
WHERE
  hits.page.pagePath CONTAINS "my-fun-path"
I would like to save the results into a new table, however I am obtaining the following error message when using Flatten Results = False:
Error: Cannot query the cross product of repeated fields
customDimensions.value and hits.page.pagePath.
This answer implies that this should be possible: Is there a way to select nested records into a table?
Is there a workaround for the issue found?
Depending on what kind of filtering is acceptable to you, you may be able to work around this by switching to OMIT IF from WHERE. It will give different results, but, again, perhaps such different results are acceptable.
The following will remove entire hit record if (some) page inside of it meets criteria. Note two things here:
it uses OMIT hits IF, instead of more commonly used OMIT RECORD IF).
The condition is inverted, because OMIT IF is opposite of WHERE
The query is:
SELECT *
FROM [dataset.table]
OMIT hits IF EVERY(NOT hits.page.pagePath CONTAINS "my-fun-path")
Update: see the related thread, I am afraid this is no longer possible.
It would be possible to use NEST function and grouping by a field, but that's a long shot.
Using flatten call on the query:
SELECT *
FROM flatten([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910],customDimensions)
WHERE
  hits.page.pagePath CONTAINS "m"
Thus in the web ui:
setting a destination table
allowing large results
and NO flatten results
does the job correctly and the produced table matches the original schema.
I know - it is old ask.
But now it can be achieved by just using standard SQL dialect instead of Legacy
#standardSQL
SELECT t.*
FROM `dataset.table` t, UNNEST(hits.page) as page
WHERE
  page.pagePath CONTAINS "my-fun-path"

Resources