BigQuery error: Cannot query the cross product of repeated fields - google-analytics

I am running the following query on Google BigQuery web interface, for data provided by Google Analytics:
SELECT *
FROM [dataset.table]
WHERE
  hits.page.pagePath CONTAINS "my-fun-path"
I would like to save the results into a new table, however I am obtaining the following error message when using Flatten Results = False:
Error: Cannot query the cross product of repeated fields
customDimensions.value and hits.page.pagePath.
This answer implies that this should be possible: Is there a way to select nested records into a table?
Is there a workaround for the issue found?

Depending on what kind of filtering is acceptable to you, you may be able to work around this by switching to OMIT IF from WHERE. It will give different results, but, again, perhaps such different results are acceptable.
The following will remove entire hit record if (some) page inside of it meets criteria. Note two things here:
it uses OMIT hits IF, instead of more commonly used OMIT RECORD IF).
The condition is inverted, because OMIT IF is opposite of WHERE
The query is:
SELECT *
FROM [dataset.table]
OMIT hits IF EVERY(NOT hits.page.pagePath CONTAINS "my-fun-path")

Update: see the related thread, I am afraid this is no longer possible.
It would be possible to use NEST function and grouping by a field, but that's a long shot.
Using flatten call on the query:
SELECT *
FROM flatten([google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910],customDimensions)
WHERE
  hits.page.pagePath CONTAINS "m"
Thus in the web ui:
setting a destination table
allowing large results
and NO flatten results
does the job correctly and the produced table matches the original schema.

I know - it is old ask.
But now it can be achieved by just using standard SQL dialect instead of Legacy
#standardSQL
SELECT t.*
FROM `dataset.table` t, UNNEST(hits.page) as page
WHERE
  page.pagePath CONTAINS "my-fun-path"

Related

Custom Definitions in Bigquery

I'm pretty new to Bigquery/Firebase/GA even SQL. (btw, if you have some good experience or recommendations where I can start learning, that would be great!)
But I have main issue with Bigquery that needs solving right now. I'm kinda trying all sources I can get some info/tips from. I hope this community will be one of them.
So my issue is with Custom Definitions. we have them defined in Google Analytics . We want to divide users with this definition and analyze them separately:
My question is: where/how can I find these custom definitions in bigquery to filter my Data? I have normal fields, like user ID, Timestamps etc. but can't find these custom definitions.
I have been doing some research but still don't have a clear answer, if someone can give me some tips or mby a solution I would be forever in debt ! xD
I got one solution from the other community which looks like this, but I couldn't make it work, my bigquery doesn't recognize customDimensions as it says in the error.
select cd.* from table, unnest(customDimensions) cd
You can create your own custom function, Stored Procedure on Bigquery as per your requirements.
To apply formal Filter over filed like user ID, & Timestamps, you can simply apply standard SQL filter as given below:-
SELECT * FROM DATA WHERE USER_ID = 'User1' OR Timestamps = 'YYY-MM-DDTHH:MM'
Moreover, unnest is used to split data on fields, do you have data which need to be spited ?
I could help you more if you share what are you expecting from your SQL.
Your custom dimensions sit in arrays called customDimensions. These arrays are basically a list of structs - where each struct has 2 fields: key and value. So they basically look like this example: [ {key:1, value:'banana'}, {key:4, value:'yellow'}, {key:8, value:'premium'} ] where key is the index of the custom dimension you've set up in Google Analytics.
There are 3 customDimensions arrays! Two of them are nested within other arrays. If you want to work with those you really need to get proficient in working with arrays. E.g. the function unnest() turns arrays into table format on which you can run SQL.
customDimensions[]
hits[]
customDimensions[]
product[]
customDimensions[]
Example 1 with subquery on session-scoped custom dimension 3:
select
fullvisitorid,
visitStartTime,
(select value from unnest(customDimensions) where key=3) cd3
from
ga_sessions_20210202
Example 2 with a lateral cross join - you're enlargening the table here - not ideal:
select
fullvisitorid,
visitStartTime,
cd.*
from ga_session_20210202 cross join unnest(customDimensions) as cd
All names are case-sensitive - in one of your screenshots you used a wrong name with a "c" being uppercase.
This page can help you up your array game: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays - just work through all the examples and play around in the query editor

Error with SQLite query, What am I missing?

I've been attempting to increase my knowledge and trying out some challenges. I've been going at this for a solid two weeks now finished most of the challenge but this one part remains. The error is shown below, what am i not understanding?
Error in sqlite query: update users set last_browser= 'mozilla' + select sql from sqlite_master'', last_time= '13-04-2019' where id = '14'
edited for clarity:
I'm trying a CTF challenge and I'm completely new to this kind of thing so I'm learning as I go. There is a login page with test credentials we can use for obtaining many of the flags. I have obtained most of the flags and this is the last one that remains.
After I login on the webapp with the provided test credentials, the following messages appear: this link
The question for the flag is "What value is hidden in the database table secret?"
So from the previous image, I have attempted to use sql injection to obtain value. This is done by using burp suite and attempting to inject through the user-agent.
I have gone through trying to use many variants of the injection attempt shown above. Im struggling to find out where I am going wrong, especially since the second single-quote is added automatically in the query. I've gone through the sqlite documentation and examples of sql injection, but I cannot sem to understand what I am doing wrong or how to get that to work.
A subquery such as select sql from sqlite_master should be enclosed in brackets.
So you'd want
update user set last_browser= 'mozilla' + (select sql from sqlite_master''), last_time= '13-04-2019' where id = '14';
Although I don't think that will achieve what you want, which isn't clear. A simple test results in :-
You may want a concatenation of the strings, so instead of + use ||. e.g.
update user set last_browser= 'mozilla' || (select sql from sqlite_master''), last_time= '13-04-2019' where id = '14';
In which case you'd get something like :-
Thanks for everyone's input, I've worked this out.
The sql query was set up like this:
update users set last_browser= '$user-agent', last_time= '$current_date' where id = '$id_of_user'
edited user-agent with burp suite to be:
Mozilla', last_browser=(select sql from sqlite_master where type='table' limit 0,1), last_time='13-04-2019
Iterated with that found all tables and columns and flags. Rather time consuming but could not find a way to optimise.

UNION of tables using bigquery LegacySQL

I'm trying without luck to do a query to retrieve the union two tables of events using legacySQL, as standardSQL is not yet supported on data studio.
In standardSQL that would be something like:
SELECT
*
FROM
`com_myapp_ANDROID.app_events_*`,
`com_myapp_IOS.app_events_*`
However, in legacySQL I get an error when trying to refer app_events_*. How do I include all the tables of my events, so I can filter it afterwards on data studio if I can't use the wildcard?
I've tried something like:
select * from (TABLE_QUERY(com_myapp_ANDROID, 'table_id CONTAINS "app_events_"'))
But not sure if this is the right approach, I get:
Cannot output multiple independently repeated fields at the same time.
Found user_dim_user_properties_value_index and event_dim_date
Edit: in the end this is the result of the query, as you can't use directly FLATTEN with TABLE_QUERY:
select
*
from
FLATTEN((SELECT * FROM TABLE_QUERY(com_myapp_ANDROID, 'table_id CONTAINS "app_events"')),user_dim.user_properties),
FLATTEN((SELECT * FROM TABLE_QUERY(com_myapp_IOS, 'table_id CONTAINS "app_events"')),user_dim.user_properties)
Table wildcards don't work in legacy SQL as you have guessed so you have to use the TABLE_QUERY() function.
Your approach is right but the first parameter in the TABLE_QUERY function should be the dataset name not the first part of the table name. Assuming your dataset name is app_events that would look like this:
TABLE_QUERY(app_events,'table_id CONTAINS "app_events"')
In legacySQL the union table operator is comma
select * from [table1],[table2]
For TABLE_QUERY you would include the dataset name as first param, and the expression for the second
select * from (TABLE_QUERY([dataset], 'table_id CONTAINS "event"'))
to read more how to debug TABLE_QUERY read this linked answer
The Web UI automatically flattens you the results, but when there are independent repeated fields you need to flatten with the FLATTEN wrapper.
It takes two params, table, and repeated field eg: FLATTEN(table, tags)
Also if TABLE_QUERY is involved you need to subselect probably like
select
*
from
FLATTEN((SELECT * FROM TABLE_QUERY(com_myapp_ANDROID, 'table_id CONTAINS "app_events"')),user_dim.user_properties)
That particular issue you are experiencing is not UNION related - you will see same error message even with just one table if the table has multiple independently repeated fields and you are trying to output them at once. This scenario is specific to Legacy SQL and can be resolved with use of FLATTEN clause
At the same time, most likely you don't actually mean to use SELECT * which cause those repeated fields to be in output all at the same time. If you can narrow down your output list - you have slight chance to address it - but if still few independently repeated fields are in output - you can use FLATTEN technique

CustTableListPage filtering is too slow

When I'm trying to filter CustAccount field on CustTableListPage it's taking too long to filter. On the other fields there is no latency. I'm trying to filter just part of account number like "*123".
I have done reindexing for custtable and also updated statics but not appreciable difference at all.
When i have added listpage's query in a view it's filtering custAccount field normally like the other fields.
Any suggestion?
Edit:
Our version is AX 2012 r2 cu8, not a user based problem it occurs for every user, Interaction class has some custimizations but just for setting some buttons enable/disable props. etc... i tryed to look query execution what i found is not clear. something like FETCH_API_CURSOR_000000..x
Record a trace of this execution and locate what is a bottleneck.
Keep in mind that that wildcards (such as *) have to be used with care. Using a filter string that starts with a wildcard kills all performance because the SQL indexes cannot be used.
Using a wildcard at the end
Imagine that you have a dictionnary and have to list all the words starting with 'Foo'. You can skip all entries before 'F', then all those before 'Fo', then all those before 'Foo' and start your result list from there.
Similarly, asking the underlying SQL engine to list all CustAccount entries starting with '123' (= filter string '123*') allows using an index on CustAccount to quickly skip to the relevant data.
Using a wildcard at the start
Imagine that you still have that dictionnary and have to list all the words ending with 'ing'. You would have no other choice than going through the entire dictionnary and checking the ending of every word (due to the alphabetical sorting).
This explains why asking the SQL engine to list all CustAccount entries ending with '123' (= filter string '*123') means that all CustAccount values must be investigated. So the AOS loops through all the entries and uses an SQL cursor to do this. That is the FETCH_API_CURSOR statement you see on the SQL level.
Possible solutions
Educate your end user that using a wildcard at the beginning of a filter string will always be slow on a large table.
Step up the SQL server hardware / allocated resources (faster CPU, more RAM, faster disk, ...).
Create a full text index on CustAccount (not a fan of this one and performance impact should be thoroughly investigated).
I've solve the problem. CustTableListPage query had a sorting over DirPartyTable.Name field. When I remove this sorting, filtering with wildcard working like a charm.

What is Better for Mimicking PL/SQL Returning SQL in Interactive Reports: Collection or Pipelined-Function

The worst aspect of the Interactive Report (IR) is that you cannot create it using a PL/SQL returning SQL statement. I have gotten around this using two methods:
1) APEX_COLLECTION.CREATE_COLLECTION in the Before Header Process, which takes a SQL statement (that is constructed in PL/SQL in the process), and have the IR's source be select c001 alias1, c002 alias2 ... from apex_collections a where collection_name = '...'
2) Make a badass pipeline function with a parameter list as long as you need and then have the IR's source be select * from table(package_name.pipelined_function_name(:P1_parameter1, :P1_Parameter2))
Is there a performance difference? I originally used the first method but then ran into an occurrence where it was giving me a bug so I tried the pipelined function and found I just liked it better and have tended to use them ever since unless it was inappropriate to do so (namely when there is a large number of items to be passed to the parameter).
First method gives you opportunity to cache data by re-creating the collection only when you need it. Using n00X and d00X columns will give you some additional performance and right column types for the report definition. You can also create a view based on that collection with type casting and column aliases to add more convenience:
create or replace view apx_my_report
as
select n001 id, c001 data, d001 some_date
from apex_collections
where collection_name = 'MY_REPORT'
/
In that case you report source will be like that:
select id, data, some_date from apx_my_report
/
On the other hand, when you need to execute an ad-hoc query every time when page is rendered, it leads to the unavoidable re-creation of a such collection, therefore the performance goes down because of unwanted transaction maintaining: undo, redo etc.
So, it depends.

Resources