Grouping by quantiles with large data in BiqQuery - window-functions

I have a large dataset of millions of rows with values x and y, and want to get the avg(y) in different quantiles of x. One way I could do that is the following code. But with a large dataset rank(), it is too intensive and I get memory usage error in BigQuery.
SELECT
cast(100 * ord / num_rows as INT64) percentile,
AVG(y)
FROM
(
SELECT
rank() over (order by ord) ord,
count(*) over () num_rows ,
y
FROM table
)
GROUP BY 1
I understand that window functions are expensive for large datasets, but as I'm only interested in a bucketized percentile in a low granularity, computationally this should be possible. Is there a way to achieve this in BQ?

Related

Combining COUNTA() and AVERAGEX() in DAX / Power BI

I have a simple data set with training sessions for some athletes. Let's say I want to visualize how many training sessions are done as an average of the number of athletes, either in total or divided by the clubs that exist. I hope the data set is somewhat self-describing.
To norm the number of activities by the number of athletes I use two measures:
TotalSessions = COUNTA(Tab_Sessions[Session key])
AvgAthlete = AVERAGEX(VALUES(Tab_Sessions[Athlete]),[TotalSessions])
I give AvgAthlete as the desired value in both visuals shown below. If I make a filter on the clubs the values are as expected, but with no filter applied I get some strange values
What I guess happens is that since Athlete B doesn't do any strength, Athlete B is not included in the norming factor for strength. Is there a DAX function that can solve this?
If I didn't have the training sessions as a hierarchy (Type-Intensity), it would be pretty straightforward to do some kind of workaround with a calculated column, but it won't work with hierarchical categories. The expected results calculated in excel are shown below:
Data set as csv:
Session key;Club;Athlete;Type;Intensity
001;Fast runners;A;Cardio;High
002;Fast runners;A;Strength;Low
003;Fast runners;B;Cardio;Low
004;Fast runners;B;Cardio;High
005;Fast runners;B;Cardio;High
006;Brutal boxers;C;Cardio;High
007;Brutal boxers;C;Strength;High
If you specifically want to aggregate this across whatever choice you have made in your Club selection, then you simply write out a simple measure that does that:
AvgAthlete =
VAR _athletes =
CALCULATE (
DISTINCTCOUNT ( 'Table'[Athlete] ) ,
ALLEXCEPT ( 'Table' , 'Table'[Club] )
)
RETURN
DIVIDE (
[Sessions] ,
_athletes
)
Here we use a distinct count of values in the Athlete column, with all filters removed apart from on the Club column. This is, as far as I interpret your question, the denominator you are after.
Divide the total number of sessions on this number of athletes. Here is the result:

Can I speed up calculations between R and Sqlite by using data.tables?

I have a sqlite database of about 1.4 million rows and 16 columns.
I have to run an operation on 80,000 id's :
Get all rows associated with that id
convert to R date object and sort by date
calculate difference between 2 most recent dates
For each id I have been querying sqlite from R using dbSendQuery and dbFetch for step 1, while steps 2 and 3 are done in R. Is there a faster way? Would it be faster or slower to load the entire sqlite table into a data.table ?
I heavily depends on how you are working on that problem.
Normally loading the whole query inside the memory and then do the operation will be faster from what I have experienced and have seen on grahics, I can not show you a benchmark right now. If logically it makes hopefully sense, because you have to repeat several operations multiple times on multiple data.frames. As you can see here, 80k rows are pretty fast, faster than 3x 26xxx rows.
However you could have a look at the parallel package and use multiple cores on your machine to load subsets of your data and process them parallel, each on a multiple core.
Here you can find information how to do this:
http://jaehyeon-kim.github.io/2015/03/Parallel-Processing-on-Single-Machine-Part-I
If you're doing all that in R and fetching rows from the database 80,0000 times in a loop... you'll probably have better results doing it all in one go in sqlite instead.
Given a skeleton table like:
CREATE TABLE data(id INTEGER, timestamp TEXT);
INSERT INTO data VALUES (1, '2019-07-01'), (1, '2019-06-25'), (1, '2019-06-24'),
(2, '2019-04-15'), (2, '2019-04-14');
CREATE INDEX data_idx_id_time ON data(id, timestamp DESC);
a query like:
SELECT id
, julianday(first_ts)
- julianday((SELECT max(d2.timestamp)
FROM data AS d2
WHERE d.id = d2.id AND d2.timestamp < d.first_ts)) AS days_difference
FROM (SELECT id, max(timestamp) as first_ts FROM data GROUP BY id) AS d
ORDER BY id;
will give you
id days_difference
---------- ---------------
1 6.0
2 1.0
An alternative for modern versions of sqlite (3.25 or newer) (EDIT: On a test database with 16 million rows and 80000 distinct ids, it runs considerably slower than the above one, so you don't want to actually use it):
WITH cte AS
(SELECT id, timestamp
, lead(timestamp, 1) OVER id_by_ts AS next_ts
, row_number() OVER id_by_ts AS rn
FROM data
WINDOW id_by_ts AS (PARTITION BY id ORDER BY timestamp DESC))
SELECT id, julianday(timestamp) - julianday(next_ts) AS days_difference
FROM cte
WHERE rn = 1
ORDER BY id;
(The index is essential for performance for both versions. Probably want to run ANALYZE on the table at some point after it's populated and your index(es) are created, too.)

BigQuery export to csv and load in R for huge tables

I have a 100GB table which I want to process in R. When I export it to csvs I get 500 csv files - when I read them in r into data tables and bind them - I get a huge data table which cann't be saved/loaded (even when I increase the memory of the virtual instance that the R is installed on). I wanted to try a different attitude - split the original table, export to R, then process each table seperately. The problem is that I din't want the split to "break" in the middle of some grouping. For example - my key variable is "visit", and each visit may have several rows. I don't want that there will be a visit which is broken into different sub-tables (beacuse all my processing in R is done using visit as the grouping variable of data table). what is the best way to do it? I tried to order the visit ids by time, to export only their names to a spearate csv etc. - all the order by trials are ended with an error (not enough resources). The table currently contains more than 100M rows, with 64 variables.
I wanted to try a different attitude - split the original table …
The problem is that I din't want the split to "break" in the middle of some grouping.
Below is how to identify batches such that rows for same visitid will be in the same batch
For each batch max and min visitid are identified so that you can then use them to extract only rows for those visitids between min and max values thus controlling size of your to be processed data
1 – Batching by number of rows
Replace 1000000 below with whatever you want batch size to be in terms of number of rows
#legacySQL
SELECT
batch,
SUM(size) AS size,
COUNT(visitId) AS visitids_count,
MIN(visitId) AS visitId_min,
MAX(visitId) AS visitId_max
FROM (
SELECT
visitId,
size,
INTEGER(CEIL(total/1000000)) AS batch
FROM (
SELECT
visitId,
size,
SUM(size) OVER(ORDER BY visitId ) AS total
FROM (
SELECT visitId, COUNT(1) AS size
FROM [yourproject:yourdataset.yourtable]
GROUP BY visitId
)
)
)
GROUP BY batch
2 – Batching by bytes size of batch
Replace 1000000000 below with whatever you want batch size to be in terms of bytes
And replace 123 below with eastimated average size of one row in bytes
#legacySQL
SELECT
batch,
SUM(size) AS size,
COUNT(visitId) AS visitids_count,
MIN(visitId) AS visitId_min,
MAX(visitId) AS visitId_max
FROM (
SELECT
visitId,
size,
INTEGER(CEIL(total/1000000000)) AS batch
FROM (
SELECT
visitId,
size,
SUM(size) OVER(ORDER BY visitId ) AS total
FROM (
SELECT visitId, SUM(123) AS size
FROM [yourproject:yourdataset.yourtable]
GROUP BY visitId
)
)
)
GROUP BY batch
Above helps you to be prepared for proper splitting your original table using batches min and max values
Hope this help you to proceed further
Note: above assumes normal distribution of rows for visitid and relatively big number of rows in table (like in your example), so batches will be reasonably evenly sized
Note 2: I realized I wrote it quickly in Legacy SQL , so below is version in Standard SQL in case if you want to migrate or already using it
#standardSQL
SELECT
batch,
SUM(size) AS size,
COUNT(visitId) AS visitids_count,
MIN(visitId) AS visitId_min,
MAX(visitId) AS visitId_max
FROM (
SELECT
visitId,
size,
CAST(CEIL(total/1000000) as INT64) AS batch
FROM (
SELECT
visitId,
size,
SUM(size) OVER(ORDER BY visitId ) AS total
FROM (
SELECT visitId, COUNT(1) AS size
FROM `yourproject.yourdataset.yourtable`
GROUP BY visitId
)
)
)
GROUP BY batch

Fastest Way to Count Distinct Values in a Column, Including NULL Values

The Transact-Sql Count Distinct operation counts all non-null values in a column. I need to count the number of distinct values per column in a set of tables, including null values (so if there is a null in the column, the result should be (Select Count(Distinct COLNAME) From TABLE) + 1.
This is going to be repeated over every column in every table in the DB. Includes hundreds of tables, some of which have over 1M rows. Because this needs to be done over every single column, adding Indexes for every column is not a good option.
This will be done as part of an ASP.net site, so integration with code logic is also ok (i.e.: this doesn't have to be completed as part of one query, though if that can be done with good performance, then even better).
What is the most efficient way to do this?
Update After Testing
I tested the different methods from the answers given on a good representative table. The table has 3.2 million records, dozens of columns (a few with indexes, most without). One column has 3.2 million unique values. Other columns range from all Null (one value) to a max of 40K unique values. For each method I performed four tests (with multiple attempts at each, averaging the results): 20 columns at one time, 5 columns at one time, 1 column with many values (3.2M) and 1 column with a small number of values (167). Here are the results, in order of fastest to slowest
Count/GroupBy (Cheran)
CountDistinct+SubQuery (Ellis)
dense_rank (Eriksson)
Count+Max (Andriy)
Testing Results (in seconds):
Method 20_Columns 5_Columns 1_Column (Large) 1_Column (Small)
1) Count/GroupBy 10.8 4.8 2.8 0.14
2) CountDistinct 12.4 4.8 3 0.7
3) dense_rank 226 30 6 4.33
4) Count+Max 98.5 44 16 12.5
Notes:
Interestingly enough, the two methods that were fastest (by far, with only a small difference in between then) were both methods that submitted separate queries for each column (and in the case of result #2, the query included a subquery, so there were really two queries submitted per column). Perhaps because the gains that would be achieved by limiting the number of table scans is small in comparison to the performance hit taken in terms of memory requirements (just a guess).
Though the dense_rank method is definitely the most elegant, it seems that it doesn't scale well (see the result for 20 columns, which is by far the worst of the four methods), and even on a small scale just cannot compete with the performance of Count.
Thanks for the help and suggestions!
SELECT COUNT(*)
FROM (SELECT ColumnName
FROM TableName
GROUP BY ColumnName) AS s;
GROUP BY selects distinct values including NULL. COUNT(*) will include NULLs, as opposed to COUNT(ColumnName), which ignores NULLs.
I think you should try to keep the number of table scans down and count all columns in one table in one go. Something like this could be worth trying.
;with C as
(
select dense_rank() over(order by Col1) as dnCol1,
dense_rank() over(order by Col2) as dnCol2
from YourTable
)
select max(dnCol1) as CountCol1,
max(dnCol2) as CountCol2
from C
Test the query at SE-Data
A development on OP's own solution:
SELECT
COUNT(DISTINCT acolumn) + MAX(CASE WHEN acolumn IS NULL THEN 1 ELSE 0 END)
FROM atable
Run one query that Counts the number of Distinct values and adds 1 if there are any NULLs in the column (using a subquery)
Select Count(Distinct COLUMNNAME) +
Case When Exists
(Select * from TABLENAME Where COLUMNNAME is Null)
Then 1 Else 0 End
From TABLENAME
You can try:
count(
distinct coalesce(
your_table.column_1, your_table.column_2
-- cast them if you want replace value from column are not same type
)
) as COUNT_TEST
Function coalesce help you combine two columns with replace not null values.
I used this in mine case and success with correctly result.
Not sure this would be the fastest but might be worth testing. Use case to give null a value. Clearly you would need to select a value for null that would not occur in the real data. According to the query plan this would be a dead heat with the count(*) (group by) solution proposed by Cheran S.
SELECT
COUNT( distinct
(case when [testNull] is null then 'dbNullValue' else [testNull] end)
)
FROM [test].[dbo].[testNullVal]
With this approach can also count more than one column
SELECT
COUNT( distinct
(case when [testNull1] is null then 'dbNullValue' else [testNull1] end)
),
COUNT( distinct
(case when [testNull2] is null then 'dbNullValue' else [testNull2] end)
)
FROM [test].[dbo].[testNullVal]

How to calculate amount of contingency tables?

If i want to calculate the amount of k-dimensional contingency tables which formula should I use?
For example, if i have 16 categorical variables in my dataset and want to calculate the amount of 1-dimensional contingency tables, then it's clear, there is only 1 table. If I want to calculate the amount of 2-dimensional contingency tables then I assume there are 120. But how do I calculate it? And what if i have much more variables and k-dimensional tables?
I'm searching for one equations with gives me the number of available contingency tables, given the dimension (k) and the number of variables (n).
For moron - a contingency table is defined here.
Sebi - I think you do need to clarify the problem a bit, but let me plow ahead. If I had 16 categorical variables and need to define a contingency table for each pair of variables, that would be C(16,2) = 120 tables. (Combinations of 16 choose 2). Is that what you mean by k-dimension tables?
If so, the number of k dimension tables is simply C(16,k). The excel function is Combin(n,k).
C(16,3) = 560
C(16,4) = 1820
C(16,5) = 4368
C(16,6) = 8008... and so on....
If I understand this correctly, you are trying to select distinct subsets of size k from the n variables, I suspect the formula will be:
number of tables = n! / ( (n-k)! k!)

Resources