Change the way Join Operator renames like columns - azure-data-explorer

Is there a way to change the way the Join Operator appends a '1' at the end of columns that have the same name across both join tables? I would like to do this without renaming the columns explicitly.
datatable(TableKey:string, CreatedDate:datetime)
['1', datetime('2022-01-01')]
| join kind=inner (
datatable(TableKey:string, CreatedDate:datetime)
['1', datetime('2022-01-02')]
) on TableKey
Result:
TableKey CreatedDate TableKey1 CreatedDate1
1 2022-01-01 00:00:00.0000000 1 2022-01-02 00:00:00.0000000

Related

SQL: grouping to have exact rows

Let's say there is a schema:
|date|value|
DBMS is SQLite.
I want to get N groups and calculate AVG(value) for each of them.
Sample:
2020-01-01 10:00|2.0
2020-01-01 11:00|2.0
2020-01-01 12:00|3.0
2020-01-01 13:00|10.0
2020-01-01 14:00|2.0
2020-01-01 15:00|3.0
2020-01-01 16:00|11.0
2020-01-01 17:00|2.0
2020-01-01 18:00|3.0
Result (N=3):
2020-01-01 11:00|7.0/3
2020-01-01 14:00|15.0/3
2020-01-01 17:00|16.0/3
I need to use a windowing function, like NTILE, but it seems NTILE is not usable after GROUP BY. It can create buckets, but then how can I use these buckets for aggregation?
SELECT
/*AVG(*/value/*)*/,
NTILE (3) OVER (ORDER BY date) bucket
FROM
test
/*GROUP BY bucket*/
/*GROUP BY NTILE (3) OVER (ORDER BY date) bucket*/
Also dropped the test data and this query into DBFiddle.
You can use NTILE() window function to create the groups and aggregate:
SELECT
DATETIME(MIN(DATE), ((STRFTIME('%s', MAX(DATE)) - STRFTIME('%s', MIN(DATE))) / 2) || ' second') date,
ROUND(AVG(value), 2) avg_value
FROM (
SELECT *, NTILE(3) OVER (ORDER BY date) grp
FROM test
)
GROUP BY grp;
To change the number of rows in each bucket, you must change the number 3 inside the parentheses of NTILE().
See the demo.
Results:
| date | avg_value |
| ------------------- | --------- |
| 2020-01-01 11:00:00 | 2.33 |
| 2020-01-01 14:00:00 | 5 |
| 2020-01-01 17:00:00 | 5.33 |
I need to use a windowing function, like NTILE, but it seems NTILE is not usable after GROUP BY. It can create buckets, but then how can I use these buckets for aggregation?
You first use NTILE to assign bucket numbers in a subquery, then group by it in an outer query.
Using sub-query
SELECT bucket
, AVG(value) AS avg_value
FROM ( SELECT value
, NTILE(3) OVER ( ORDER BY date ) AS bucket
FROM test
) x
GROUP BY bucket
ORDER BY bucket
Using WITH clause
WITH x AS (
SELECT date
, value
, NTILE(3) OVER ( ORDER BY date ) AS bucket
FROM test
)
SELECT bucket
, COUNT(*) AS bucket_size
, MIN(date) AS from_date
, MAX(date) AS to_date
, MIN(value) AS min_value
, AVG(value) AS avg_value
, MAX(value) AS max_value
, SUM(value) AS sum_value
FROM x
GROUP BY bucket
ORDER BY bucket

Grouping query in Redshift takes huge amount of time

I have a following requirement: I have a table in following format.
and this is what I want it to be transformed into:
Basically I want number of users with various combination of activities
I want to have this format as I want to create a TreeMap visualization out of it.
This is what I have done till now.
First find out number of users with activity groupings
WITH lookup AS
(
SELECT listagg(name,',') AS groupings,
processed_date,
guid
FROM warehouse.test
GROUP BY processed_date,
guid
)
SELECT groupings AS activity_groupings,
LENGTH(groupings) -LENGTH(REPLACE(groupings,',','')) + 1 AS count,
processed_date,
COUNT( guid) AS users
FROM lookup
GROUP BY processed_date,
groupings
I put the results in a separate table
Then, I do a Split and coalesce like this:
SELECT NULLIF(SPLIT_PART(groupings,',', 1),'') AS grouping_1,
COALESCE(NULLIF(SPLIT_PART(groupings,',', 2),''), grouping_1) AS grouping_2,
COALESCE(NULLIF(SPLIT_PART(groupings,',', 3),''), grouping_2, grouping_1) AS grouping_3,
num_users
FROM warehouse.groupings) AS expr_qry
GROUP BY grouping_1,
grouping_2,
grouping_3
The problem is the first query takes more than 90 minutes to execute as I have more than 250M rows.
There must be a better and efficient way to di this.
Any heads up would be greatly appreciated.
Thanks
You do not need to use complex string manipulation functions (LISTAGG(), SPLIT_PART()). You can achieve what you're after with the ROW_NUMBER() function and simple aggregates.
-- Create sample data
CREATE TEMP TABLE test_data (id, guid, name)
AS SELECT 1::INT, 1::INT, 'cooking'
UNION ALL SELECT 2::INT, 1::INT, 'cleaning'
UNION ALL SELECT 3::INT, 2::INT, 'washing'
UNION ALL SELECT 4::INT, 4::INT, 'cooking'
UNION ALL SELECT 6::INT, 5::INT, 'cooking'
UNION ALL SELECT 7::INT, 3::INT, 'cooking'
UNION ALL SELECT 8::INT, 3::INT, 'cleaning'
;
-- Assign a row number to each name per guid
WITH name_order AS (
SELECT guid
, name
, ROW_NUMBER() OVER(PARTITION BY guid ORDER BY id) row_n
FROM test_data
) -- Use MAX() to collapse each guid's data to 1 row
, groupings AS (
SELECT guid
, MAX(CASE WHEN row_n = 1 THEN name END) grouping_1
, MAX(CASE WHEN row_n = 2 THEN name END) grouping_2
FROM name_order
GROUP BY guid
) -- Count the guids per each grouping
SELECT grouping_1
, COALESCE(grouping_2, grouping_1) AS grouping_2
, COUNT(guid) num_users
FROM groupings
GROUP BY 1,2
;
-- Output
grouping_1 | grouping_2 | num_users
------------+------------+-----------
washing | washing | 1
cooking | cleaning | 2
cooking | cooking | 2

BQ array lookup: similar to NTH, but based on index, not position

The NTH function is really useful for extracting nested array elements in BQ, but its utility for a given table depends on each row's nested array containing the same amount of elements, and in the same order. If I have a 2+ column nested array where one column is variable name/ID, and the different instances of the array in different rows have inconsistent naming and/or ordering, is there an elegant way to fetch/pivot a variable based on the variable name/ID?
For example, if row1 has customDimensions array:
index value
4 aaa
23 bbb
70 ccc
and row2 has customDimensions array:
index value
4 ddd
70 eee
I'd want to run something like
SELECT
NTHLOOKUP(70, customdims.index, customdims.value) as val70,
NTHLOOKUP(4, customdims.index, customdims.value) as val4,
NTHLOOKUP(23, customdims.index, customdims.value) as val23
from my_table;
And get:
val70 val4 val23
ccc aaa bbb
eee ddd (null)
I've been able to get this sort of result by making a subquery for each desired index value, unnesting the array in each and filtering WHERE index = (value), but that gets really ugly as the variables pile up. Is there an alternative?
EDIT: Based on Mikhail's answer below (thank you!!) I was able to write my query more elegantly. Not quite as slick as an NTHLOOKUP, but I'll take it:
select id,
max(case when index = 41 then value[OFFSET(0)] else '' end) as val41,
max(case when index = 59 then value[OFFSET(0)] else '' end) as val59
from
(select
concat(array1.thing1, array1.thing2) as id,
cd.index,
ARRAY_AGG(distinct cd.value) as value
FROM my_table g
,unnest(array1) as array1
,unnest(array1.customDimensions) as cd
where index in (41,59)
group by 1,2
order by 1,2
) x
group by 1
order by 1
The best I can "offer" is below (BigQuery Standard SQL)
#standardSQL
WITH `project.dataset.my_table` AS (
SELECT ARRAY<STRUCT<index INT64, value STRING>>
[(4, 'aaa'), (23, 'bbb'), (70, 'ccc')] customDimensions
UNION ALL
SELECT ARRAY<STRUCT<index INT64, value STRING>>
[(4, 'ddd'), (70, 'eee')] customDimensions
)
SELECT cd.index, ARRAY_AGG(cd.value) VALUES
FROM `project.dataset.my_table`,
UNNEST(customDimensions) cd
GROUP BY cd.index
with result as below
Row index values
1 4 aaa
ddd
2 23 bbb
3 70 ccc
eee
I would recommend to stay with this flatten version as it serves most of practical cases I can think of
But if you still want to further pivot this - there are quite a number of posts related to how to pivot in BigQuery
I've been able to get this sort of result by making a subquery for each desired index value, unnesting the array in each and filtering WHERE index = (value), but that gets really ugly as the variables pile up. Is there an alternative?
Yes, you can use a user-defined function to encapsulate the common logic. For example,
CREATE TEMP FUNCTION NTHLOOKUP(
targetIndex INT64,
customDimensions ARRAY<STRUCT<index INT64, value STRING>>
) AS (
(SELECT value FROM UNNEST(customDimensions)
WHERE index = targetIndex)
);
SELECT
NTHLOOKUP(70, customDimensions) as val70,
NTHLOOKUP(4, customDimensions) as val4,
NTHLOOKUP(23, customDimensions) as val23
from my_table;

Zipping rows with the same "key" while joining tables

I have two tables, one with objects, one with properties of the objects. Both tables have a personal ID and a date as "key", but since multiple orders of objects can be done by one person on a single day, it doesn't match well. I do know however, that the entries are entered in the same order in both tables, so it is possible to join on the order, if the personID and date are the same.
This is what I want to accomplish:
Table 1:
PersonID Date Object
1 20-08-2013 A
2 13-11-2013 B
2 13-11-2013 C
2 13-11-2013 D
3 21-11-2013 E
Table 2:
PersonID Date Property
4 05-05-2013 $
1 20-08-2013 ^
2 13-11-2013 /
2 13-11-2013 *
2 13-11-2013 +
3 21-11-2013 &
Result:
PersonID Date Object Property
4 05-05-2013 $
1 20-08-2013 A ^
2 13-11-2013 B /
2 13-11-2013 C *
2 13-11-2013 D +
3 21-11-2013 E &
So what I want to do, is join the two tables and "zip" the group of entries that have the same (PersonID,Date) "key".
Something called "Slick" seems to have this (see here), but I'd like to do it in SQLite.
Any advice would be amazing!
You are on the right track. Why not just do a LEFT JOIN between the tables like
select t2.PersonID,
t2.Date,
t1.Object,
t2.Property
from table2 t2
left join table1 t1 on t2.PersonID = t1.PersonID
order by t2.PersonID
Use a additional column to make every key unique in both tables. For example in SQLite you could use RowIDs to keep track of the order of insertion. To store this additional column in the database itself might be useful for other queries as well, but you do not have to store this.
First add the column ID to both tables, the DDL queries should now look like this: (make sure you do not add the primary key constraint until both tables are filled.
CREATE TABLE table1 (
ID,
PersonID,
Date,
Object
);
CREATE TABLE table2 (
ID,
PersonID,
Date,
Property
);
Now populate the ID column. You can adjust the ID to your liking. Make sure you do this for table2 as well:
UPDATE table1
SET ID =(
SELECT table1.PersonID || '-' || table1.Date || '-' || count( * )
FROM table1 tB
WHERE table1.RowID >= tB.RowID
AND
table1.PersonID == tB.PersonID
AND
table1.Date == tB.Date
);
Now you can join them:
SELECT t2.PersonID,
t2.Date,
t1.Object,
t2.Property
FROM table2 t2
LEFT JOIN table1 t1
ON t2.ID = t1.ID;

How to concatenate a selection of sqlite columns spread over multiple tables?

I am aware that you can concatenate multiple columns within a single table with a query like this:
SELECT ( column1 || column2 || column3 || ... ) AS some_name FROM some_table
Is it possible to concatenate columns from multiple tables with a single sqlite query?
Very simple example - Two tables and the result:
Table1 | Table2 | Result
Col1 Col2 | Col3 |
-------------------------------------------------------------------------------
A B | C | ABC
If this is possible, what would the query be?
I can always manually concatenate multiple single table concatenation results, but having sqlite do all the work would be great:)
Try this
SELECT a.col1 || a.col2 || b.col3 AS 'SUM'
FROM table1 a, table2 b
WHERE a.id = b.id;
In where you have top mention the joining condition between the tables
Fiddle

Resources