BigQuery Window ORDER BY is not allowed if DISTINCT is specified - window-functions

I'm investigating porting some bigquery legacy sql containing windowed distinct counts like this
count(distinct brand_id) over (partition by user_id order by order_placed_at range between 7 * 24 * 60 * 60 * 1000000 PRECEDING AND 1 PRECEDING) as last_7_day_buyer_brands
to Standard sql.... but I get this error....
Window ORDER BY is not allowed if DISTINCT is specified
For reference I've tried APPROX_COUNT_DISTINCT function with no luck.
Is there a better way to get this to work other than write the subqueries and group by's?
Most of the other queries have ported to standard sql with only minor changes.

Per documentation
OVER clause requirements:
PARTITION BY: Optional.
ORDER BY: Optional. Disallowed if DISTINCT is present.
window_frame_clause: Optional. Disallowed if DISTINCT is present.
note: above is 'highlighted' by me, not as in documentation
As you can see not only ORDER BY but even RANGE BETWEEN is not allowed when DISTINCT is used
I think, subquery is the way to go.
In case if you need direction for this, use below simple example
#standardSQL
SELECT
user_id,
order_placed_at,
brand_id,
(SELECT COUNT(DISTINCT brand)
FROM UNNEST(last_7_day_buyer_brands_with_dups) AS brand
) AS last_7_day_buyer_brands
FROM (
SELECT
user_id,
order_placed_at,
brand_id,
ARRAY_AGG(brand_id) OVER(
PARTITION BY user_id ORDER BY order_placed_at
RANGE BETWEEN 7 * 24 * 60 * 60 * 1000000 PRECEDING AND 1 PRECEDING
) AS last_7_day_buyer_brands_with_dups
FROM yourTable
)

Related

BigQuery: Scheduled Query error: Resources exceeded during query execution: Not enough resources for query planning

In BigQuery, the following error appears after execution of a scheduled query:
Resources exceeded during query execution: Not enough resources for
query planning - too many subqueries or query is too complex
I admit the query is quite complex, with multiple OVER () clauses including PARTITION BY and ORDER BY in the OVER() clauses, which are expensive from a computational perspective. However this is needed to accomplish the desired result. I need this OVER() clauses to get the desired resulting table. The query is appr 50GB.
The scheduled query queries data over 4 days of Google Analytics-related data.
However, remarkably, when I'm running the same query on a manual basis, the query executes without any problems (appr 35 seconds query time). Even when I manually execute the query with 365 days of GA-data, the query executes successfully. This query is 4TB (appr 280 seconds query time).
Does anyone know why scheduled queries fail in my case while manual queries can be executed without errors? And - given the fact that the scheduling is important - does anyone know if there is a fix so that the scheduled query can be executed without errors?
Basically, it's this query, see below. Note that I hided the input table so reduced the query length a bit. The input table is just a collection of SELECT queries to merge multiple input tables using UNION ALL.
Note as well that I am trying to connect hits from separate sources, Firebase Analytics (app data) and Universal Analytics (web data), into custom session id's where this is needed, and if this is not needed use the regular visit id's from GA.
SELECT
*,
MAX(device) OVER (PARTITION BY country, date, custvisitid_unified RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS device_new,
IF
(mix_app_web_session = 'mixed',
CONCAT('mixed_',MAX(app_os) OVER (PARTITION BY country, date, custvisitid_unified RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)),
browser) AS browser_new,
MAX(channel) OVER (PARTITION BY country, date, custvisitid_unified RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS channel_new,
IF
(mix_app_web_session = 'mixed',
MAX(app_os) OVER (PARTITION BY country, date, custvisitid_unified RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
os) AS os_new
FROM (
SELECT
*,
IF
(COUNT(DISTINCT webshop) OVER (PARTITION BY country, date, custvisitid_unified) > 1,
'mixed',
'single') AS mix_app_web_session
FROM ( # define whether custvisitid_unified has hits from both app and web
SELECT
*,
IF
(user_id_anonymous_wide IS NULL
AND mix_app_web_user = 'single',
custvisitid,
CONCAT(MAX(custvisitid) OVER (PARTITION BY country, date, user_id_anonymous_wide RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),cust_session_counter)) AS custvisitid_unified
FROM ( # create custvisitid_unified, in which the linked app and web hits have been assigned a unified custom visitid. Only apply a custom visitid to user_id_anonymous_wide = 'mixed' hits (since it is a bit tricky), otherwise just use the regular visitid from GA
SELECT
*,
IF
(COUNT(DISTINCT webshop) OVER (PARTITION BY country, date, user_id_anonymous_wide) > 1,
'mixed',
'single') AS mix_app_web_user,
(COUNT(new_session) OVER (PARTITION BY country, date, user_id_anonymous_wide ORDER BY timestamp_microsec)) + 1 AS cust_session_counter
FROM ( # define session counter
SELECT
*,
IF
((timestamp_microsec-prev_timestamp_microsec) > 2400000000,
'new',
NULL) AS new_session
FROM ( # Where timestamp is greater than 40 mins (actually 30 mins, but some margin is applied to be sure)
SELECT
*,
IF
(user_id_anonymous_wide IS NOT NULL,
LAG(timestamp_microsec,1) OVER (PARTITION BY country, date, user_id_anonymous_wide ORDER BY timestamp_microsec),
NULL) AS prev_timestamp_microsec
FROM ( # define previous timestamp to calculate difference in timestamp between consecutive hits
SELECT
*,
MAX(user_id_anonymous) OVER (PARTITION BY country, date, custvisitid RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS user_id_anonymous_wide,
IF
(webshop = 'appshop',
os,
NULL) AS app_os # user_id_anonymous_wide: define the user_id_anonymous values for all hits within the existing sessionid (initially only in 1 of the hits)
FROM (
# SELECT many dimensions FROM multiple tables (which resulted in 1 table with the use of UNION ALL's
) ))))))
An update: I fixed the issue by splitting up the query in 2 queries.

Querying "in" eventAction array in big query

I apologize if this has been asked before, but I can't seem to find a specific doc describing how to do this. We are importing our GA data into big query. i simply need to see waht visitors on our site have been viewing two or more pages and completing at least one of a few actions. I am fairly new to BQ, and teh docs I have read talk of using UNNEST, unfortunately, thi sis the issue I am seeing, when i run this query:
SELECT visitId, totals.pageviews FROM `analytics-acquisition-funnel.119485123.ga_sessions_20181009` WHERE totals.pageviews > 2 AND
'modal-click' IN UNNEST(hits.eventInfo.eventAction)
order by totals.pageviews DESC LIMIT 100000
I get the following issue, shouldn't this work. I apologize, I reading this doc, but I feel like my use case is simpler than most shown:
https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#scanning-arrays
Cannot access field eventInfo on a value with type ARRAY> at [2:30]
Below is for BigQuery Standard SQL
#standardSQL
SELECT visitId, totals.pageviews
FROM `analytics-acquisition-funnel.119485123.ga_sessions_20181009`
WHERE totals.pageviews > 2
AND (SELECT COUNTIF(eventInfo.eventAction = 'modal-click') FROM UNNEST(hits)) > 0
ORDER BY totals.pageviews DESC
LIMIT 100000
OR
#standardSQL
SELECT visitId, totals.pageviews
FROM `analytics-acquisition-funnel.119485123.ga_sessions_20181009`
WHERE totals.pageviews > 2
AND EXISTS(SELECT 1 FROM UNNEST(hits) WHERE eventInfo.eventAction = 'modal-click')
ORDER BY totals.pageviews DESC
LIMIT 100000

Sqlite fetch a particular value not present within the specified limit

I have a table Customer which has the following columns,
user_name,current_id,id,params,display,store.
I am writing a query like this,
SELECT * FROM Customer WHERE user_name='Mike' AND current_id='9845' AND id='Get_Owner' AND params='owner=1' order by(display) limit 6 offset 0
Now there are times when I want to fetch a particular value which is not there in the first six and I want to fetch that particular value and rest 5 values in the same way like above how can I do that?
For example I want something like this
SELECT * FROM Customer WHERE user_name='Mike' AND current_id='9845' and id='Get_Owner' AND params='owner=1' AND stored='Shelly.Am'
I want Shelly.Am and other 5 value like my first query
You can combine two queries by using a compound query.
The ORDER BY/LIMIT clauses would apply to the entire compound query, so the second query must be moved into a subquery:
SELECT *
FROM Customer
WHERE user_name='Mike'
AND current_id='9845'
AND id='Get_Owner'
AND params='owner=1'
AND stored='Shelly.Am'
UNION ALL
SELECT *
FROM (SELECT *
FROM Customer
WHERE user_name='Mike'
AND current_id='9845'
AND id='Get_Owner'
AND params='owner=1'
AND stored!='Shelly.Am'
ORDER BY display
LIMIT 5);

Selecting unique rows from a table that are not found in two other tables in sqlite

I have three tables (forward, reverse and probe) that all contain the same columns (Query, QueryLength, Hit, Start, End, Strand, Description, Length, Percent_id, Score, Evalue).
I want to get the unique rows from 'forward' where the 'Hit' is not found in either the 'reverse' table or the 'probe' table. With 'AND' I don't get any results, with 'OR' I get the comparison only with reverse.
CREATE TABLE f AS SELECT * FROM forward WHERE forward.Hit NOT IN (SELECT Hit from reverse) OR (SELECT Hit FROM probe)
Thanks for your help.
When you write ... OR (SELECT Hit FROM probe), you have a scalar subquery, i.e., the database looks only at the first value returned by the subquery, and treats it as 'true' if the value is non-zero.
To check whether a value is found in another table, you have to use IN (in both cases):
SELECT * FROM forward
WHERE Hit NOT IN (SELECT Hit FROM reverse)
AND Hit NOT IN (SELECT Hit FROM probe)
If you want to check entire records instead of only one column, you can use EXCEPT:
SELECT * FROM forward
EXCEPT
SELECT * FROM reverse
EXCEPT
SELECT * FROM probe

How to randomly delete 20% of the rows in a SQLite table

Good afternoon, We were wondering how to randomly delete 20% of the rows in a sqlite table with 15000 rows. We noticed that this question was solved in Stack Overflow using SQL Server Select n random rows from SQL Server table.
But the SQL Server script does not appear to function properly in sqlite. How can we convert the SQL Server script to an sqlite equivalent script? Thank you.
Alternatively, since the random() function in sqlite returns a signed 64-bit integer, we can calculate a point within this space as (2^63) * 0.6 . Signed integers greater than this will be 40% of the set of positive signed 64-bit integers, so 20% of the whole set.
Truncate to the integer below, this is 5534023222112865484 .
Therefore you should be able to get 20% of your rows with a simple:
SELECT * FROM table WHERE random() > 5534023222112865485
Or in your case, since you want to delete that many:
DELETE FROM table WHERE random() > 5534023222112865485
I hope you enjoy this approach. It may actually be suitable if you want high performance from such an operation, but it could be hardware dependent / version dependent, so probably is not worth the risk.
Not quite 'random' - but if you've an identity column on the table you could DELETE FROM mytable WHERE ID % 5 = 0 which should statistically delete very close to a fifth of the rows.
Try:
DELETE FROM TABLE
WHERE ROWID IN (SELECT ROWID FROM TABLE ORDER BY RANDOM() LIMIT 3000)
If you want to calculate 20% in a subquery:LIMIT (SELECT CAST( ( COUNT(id) * 0.2 ) AS INT )
SQLite - ORDER BY RAND() provides a hint. Thus this may work?
DELETE FROM table WHERE id IN(
SELECT id FROM table ORDER BY RANDOM() LIMIT (
SELECT CAST( ( COUNT(id) * 0.2 ) AS INT ) FROM table
)
);

Resources