The below query consumes more than 1 TB of spool space as the Col11 column has more than 5 million values.
SELECT A.COL1, B.COL2,
FROM TAB1 A
JOIN TAB2 B
ON B.COL2= A.COL3 AND B.COL4= A.COL4 AND B.COL6= 'XYZ'
JOIN TAB3 D
ON D.COL5= A.COL5 AND D.COL4= A.COL4 AND D.COL6= 'EFG'
JOIN VIEW1 C
ON A.COL9= C.COL9 AND A.COL4= C.COL4 AND C.COL6='EFG' AND A.COL7= C.COL8
JOIN Vpd VPRD ON VPRD.COL10=D.COL10
WHERE A.COL4= 2017
AND ((C.COL8 IN('13')) OR (((C.COL8 IN('01', '1B', '8E')))))
AND ( ((B.COL11= 'ABC')))
AND ( ( A.COL1 BETWEEN '2017/07/01' AND '2017/08/01')
OR ( A.COL1 BETWEEN '2016/07/01' AND '2016/09/01') )
GROUP BY 1,2;
Its explain plan is as below:
6) We do an all-AMPs RETRIEVE step from VPRD by way of an
all-rows scan with a condition of ("NOT
(VPRD.COL10 IS NULL)") into Spool 4 (all_amps)
(compressed columns allowed), which is duplicated on all AMPs.
The size of Spool 4 is estimated with high confidence to be 2,160
rows (45,360 bytes). The estimated time for this step is 0.01
seconds.
7) We do an all-AMPs JOIN step from a single partition of
TAB2 by way of index # 8 "TAB2.COL4 = 2017,
TAB2.COL6 = 'XYZ ', TAB2.COL11 = 'ABC'" with no residual
conditions, which is joined to Spool 4 (Last Use) by way of an
all-rows scan. TAB2 and Spool 4 are joined using a
product join, with a join condition of ("(1=1)"). The result goes
into Spool 5 (all_amps) (compressed columns allowed), which is
duplicated on all AMPs. Then we do a SORT to order Spool 5 by the
hash code of (TAB2.COL4,VPRD.COL10, 'EFG'). The size of Spool 5
is estimated with high confidence to be 60,480 rows (1,814,400
bytes). The estimated time for this step is 0.01 seconds.
8) We execute the following steps in parallel.
1) We do an all-AMPs JOIN step from Spool 5 (Last Use) by way of
an all-rows scan, which is joined to TAB3 by way of
a traversal of index # 16 without accessing the base table
extracting row ids only. Spool 5 and TAB3 are
joined using a nested join, with a join condition of (
"(COL4 = TAB3.COL4) AND ((COL10 = TAB3.COL10) AND
(TAB3.COL6 = ('EFG' )))"). The result
goes into Spool 6 (all_amps), which is built locally on the
AMPs. Then we do a SORT to order Spool 6 by field Id 1. The
size of Spool 6 is estimated with low confidence to be
162,567 rows (6,340,113 bytes). The estimated time for this
step is 0.01 seconds.
2) We do an all-AMPs RETRIEVE step from
TAB3 by way of an all-rows scan
with a condition of "(TAB3.COL4 = 2017) AND
((TAB3.COL6 = 'EFG') AND((TAB3.COL8 IN
('13','01','1B','8E')) AND (TAB3.HRCY_LVL_ID = 'TERR')))")
into Spool 7 (all_amps) (compressed columns allowed), which
is duplicated on all AMPs. Spool 7 is built as in-memory
optimized spool with 4 column partitions. The size of Spool
7 is estimated with low confidence to be 187,380 rows (
8,994,240 bytes). The estimated time for this step is 0.01
seconds.
9) We do an all-AMPs JOIN step from Spool 6 (Last Use) by way of an
all-rows scan, which is joined to a single partition of
TAB3 with a condition of ("(TAB3.COL4 =
2017) AND (TAB3.COL6 = 'EFG')"). Spool 6
and TAB3 are joined using a row id join, with a join
condition of ("COL10 = TAB3.COL10").
The result goes into Spool 8 (all_amps) (compressed columns
allowed), which is redistributed by the rowkey of (
TAB3.COL5)), TAB2.COL2) to all
AMPs. Spool 8 is built as in-memory optimized spool with 4 column
partitions. The size of Spool 8 is estimated with low confidence
to be 162,567 rows (5,364,711 bytes). The estimated time for this
step is 0.04 seconds.
10) We do an all-AMPs JOIN step from Spool 7 (Last Use), which is
joined to 5 partitions of TAB1 with a condition of (
"(TAB1.COL4 = 2017) AND
((TAB1.COL1 IN (DATE '2016-07-01'TO DATE
'2016-09-01',DATE '2017-07-01'TO DATE '2017-08-01')) AND
(TAB1.COL7 IN ('13','8E','1B','01')))").
Spool 7 and TAB1 are joined using a in-memory dynamic
hash join, with a join condition of (
"(TAB1.COL9 = COL9) AND ((TAB1.COL4 = COL4) AND
(TAB1.COL7 = COL8 ))"). The result
goes into Spool 9 (all_amps) (compressed columns allowed), which
is redistributed by the rowkey of (TAB1.COL5)),
TAB1.COL3) to all AMPs. Spool 9 is built as
in-memory optimized spool with 4 column partitions. The size of
Spool 9 is estimated with low confidence to be 7,493,759 rows (
277,269,083 bytes). The estimated time for this step is 0.45
seconds.
11) We do an all-AMPs JOIN step from Spool 8 (Last Use), which is
joined to Spool 9 (Last Use). Spool 8 and Spool 9 are joined
using a single partition in-memory hash join, with a join
condition of ("(COL5 = COL5) AND (((COL4 =
COL4) AND ((COL4 = COL4) AND ((COL4 =
COL4) AND (COL4 = COL4 )))) AND (COL2 =
COL3 ))"). The result goes into Spool 3 (all_amps)
(compressed columns allowed), which is built locally on the AMPs.
The size of Spool 3 is estimated with low confidence to be 385,641
rows (8,869,743 bytes). The estimated time for this step is 0.03
seconds.
12) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
way of an all-rows scan , grouping by field1 (
TAB1.COL1 ,TAB2.COL2).
Aggregate Intermediate Results are computed globally, then placed
in Spool 1. The size of Spool 1 is estimated with low confidence
to be 169 rows (6,253 bytes). The estimated time for this step is
0.02 seconds.
13) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.57 seconds.
If I format the above query in the below fashion, I am getting the output with very less spool space
SELECT * FROM(
SELECT A.COL1,
CASE WHEN (B.COL11= 'ABC') THEN B.COL2
ELSE NULL END AS COL2
FROM TAB1 A
JOIN TAB2 B
ON B.COL2= A.COL3 AND B.COL4= A.COL4 AND B.COL6= 'XYZ'
JOIN TAB3 D
ON D.COL5= A.COL5 AND D.COL4= A.COL4 AND D.COL6= 'EFG'
JOIN VIEW1 C
ON A.COL9= C.COL9 AND A.COL4= C.COL4 AND C.COL6='EFG' AND A.COL7= C.COL8
JOIN Vpd VPRD ON VPRD.COL10=D.COL10
WHERE A.COL4= 2017
AND ((C.COL8 IN('13')) OR (((C.COL8 IN('01', '1B', '8E')))))
AND ( ((B.COL11= 'ABC')))
AND ( ( A.COL1 BETWEEN '2017/07/01' AND '2017/08/01')
OR ( A.COL1 BETWEEN '2016/07/01' AND '2016/09/01') )
GROUP BY 1,2)X
WHERE X.COL2 IS NOT NULL
Its explain plan is as below
6) We do an all-AMPs RETRIEVE step from VPRD by way of an
all-rows scan with a condition of ("NOT
(VPRD.COL10 IS NULL)") into Spool 4 (all_amps)
(compressed columns allowed), which is duplicated on all AMPs.
Then we do a SORT to order Spool 4 by the hash code of (
VPRD.COL10, 'EFG', 2017). The size of
Spool 4 is estimated with high confidence to be 2,160 rows (
47,520 bytes). The estimated time for this step is 0.00 seconds.
7) We execute the following steps in parallel.
1) We do an all-AMPs JOIN step from Spool 4 (Last Use) by way of
an all-rows scan, which is joined to TAB3 by way of
a traversal of index # 16 without accessing the base table
extracting row ids only. Spool 4 and TAB3 are
joined using a nested join, with a join condition of (
"(COL10 = TAB3.COL10) AND ((TAB3.COL4 = 2017) AND
(TAB3.COL6 = ('EFG' )))"). The result
goes into Spool 5 (all_amps), which is built locally on the
AMPs. Then we do a SORT to order Spool 5 by field Id 1. The
size of Spool 5 is estimated with no confidence to be 5,806
rows (179,986 bytes). The estimated time for this step is
0.01 seconds.
2) We do an all-AMPs RETRIEVE step from
TAB3 by way of an all-rows
scan with a condition of "(TAB3.COL4 = 2017) AND
((TAB3.COL6 = 'EFG') AND((TAB3.COL8 IN
('13','01','1B','8E')) AND (TAB3.HRCY_LVL_ID = 'TERR')))")
into Spool 6 (all_amps) (compressed columns
allowed), which is duplicated on all AMPs. Spool 6 is built
as in-memory optimized spool with 4 column partitions. The
size of Spool 6 is estimated with low confidence to be
187,380 rows (8,994,240 bytes). The estimated time for this
step is 0.01 seconds.
8) We do an all-AMPs JOIN step from Spool 5 (Last Use) by way of an
all-rows scan, which is joined to a single partition of
TAB3 with a condition of ("(TAB3.COL4 =
2017) AND (TAB3.COL6 = 'EFG')"). Spool 5
and TAB3 are joined using a row id join, with a join
condition of ("COL10 = TAB3.COL10").
The result goes into Spool 7 (all_amps) (compressed columns
allowed), which is duplicated on all AMPs. Spool 7 is built as
in-memory optimized spool with 4 column partitions. The size of
Spool 7 is estimated with no confidence to be 1,045,080 rows (
30,307,320 bytes). The estimated time for this step is 0.08
seconds.
9) We do an all-AMPs JOIN step from Spool 6 (Last Use), which is
joined to 5 partitions of TAB1 with a condition of (
"(TAB1.COL4 = 2017) AND ((TAB1.COL1 IN (DATE '2016-07-01'TO DATE
'2016-09-01',DATE '2017-07-01'TO DATE '2017-08-01')) AND
(TAB1.COL7 IN ('13','8E','1B','01')))").
Spool 6 and TAB1 are joined using a in-memory dynamic
hash join, with a join condition of (
"(TAB1.COL9 = COL9) AND ((TAB1.COL4 = COL4) AND
(TAB1.COL7 = COL8 ))"). The result
goes into Spool 8 (all_amps) (compressed columns allowed), which
is built locally on the AMPs. Spool 8 is built as in-memory
optimized spool with 4 column partitions. The size of Spool 8 is
estimated with low confidence to be 7,493,759 rows (337,219,155
bytes). The estimated time for this step is 0.28 seconds.
10) We do an all-AMPs JOIN step from Spool 7 (Last Use), which is
joined to Spool 8 (Last Use). Spool 7 and Spool 8 are joined
using a single partition in-memory hash join, with a join
condition of ("(COL5 = COL5) AND ((COL4 =
COL4) AND (COL4 = COL4 ))"). The result goes into
Spool 9 (all_amps) (compressed columns allowed), which is built
locally on the AMPs. Spool 9 is built as in-memory optimized
spool with 4 column partitions. The size of Spool 9 is estimated
with no confidence to be 182,784 rows (6,763,008 bytes). The
estimated time for this step is 0.03 seconds.
11) We do an all-AMPs RETRIEVE step from a single partition of
TAB2 with a condition of ("TAB2.COL4 = 2017")
with a residual condition of ("(TAB2.COL4 = 2017)
AND ((TAB2.COL6 = 'XYZ ') AND (NOT (( CASE
WHEN (TAB2.COL11 = 'ABC') THEN
(TAB2.COL2) ELSE (NULL) END )IS NULL )))") into
Spool 10 (all_amps) (compressed columns allowed), which is
duplicated on all AMPs. Spool 10 is built as in-memory optimized
spool with 4 column partitions. The size of Spool 10 is estimated
with no confidence to be 2,384,460 rows (88,225,020 bytes). The
estimated time for this step is 0.09 seconds.
12) We do an all-AMPs JOIN step from Spool 9 (Last Use), which is
joined to Spool 10 (Last Use). Spool 9 and Spool 10 are joined
using a single partition in-memory hash join, with a join
condition of ("(COL4 = COL4) AND (((COL4 = COL4) AND (COL4 = COL4 ))
AND (COL2 = COL3 ))"). The result goes into Spool 3 (all_amps)
(compressed columns allowed), which is built locally on the AMPs.
The size of Spool 3 is estimated with no confidence to be 15,701
rows (486,731 bytes). The estimated time for this step is 0.01
seconds.
13) We do an all-AMPs SUM step to aggregate from Spool 3 (Last Use) by
way of an all-rows scan , grouping by field1 (
TAB1.COL1 ,( CASE WHEN (TAB2.COL11 = 'ABC') THEN
(TAB2.COL2) ELSE (NULL) END)). Aggregate
Intermediate Results are computed globally, then placed in Spool 1.
The size of Spool 1 is estimated with no confidence to be 13,247
rows (490,139 bytes). The estimated time for this step is 0.02
seconds.
14) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.52 seconds.
Can someone please tell me the difference it creates while execution? How the derived table consumes less spool space when compared to the original query? Thanks
Related
I have a situation where I have to create a moving sum for the past 6 months. My data looks like
A B 20-Jan-18 20
A B 20-Mar-18 45
A B 10-Apr-18 15
A B 21-May-18 30
A B 30-Jul-18 10
A B 15-Aug-18 25
And the expected result is
A B 20-Jan-18 20 20 Sum of row1
A B 20-Mar-18 45 65 Sum of row1+2
A B 10-Apr-18 15 80 Sum of row1+2+3
A B 21-May-18 30 110 Sum of row1+2+3+4
A B 30-Jul-18 10 100 Sum of row2+3+4+5 (as row1 is > 6 months in the past)
A B 15-Aug-18 25 125 Sum of row2+3+4+5+6
I tried to use the solution proposed in an earlier thread by inserting dummy records for dates where there is no record and then using ROWS BETWEEN 181 PRECEDING AND CURRENT ROW
But there may be situations where there are multiple records on the same day which means that choosing the last 181 rows will lead to the earliest record getting dropped.
I have checked a lot of cases on this forum and others but can't find a solution for this moving average where the window size is not constant. Please help.
Teradata doesn't implement RANGE in Windowed Aggregates, but you can use old-style SQL to get the same result. If the number of rows per group is not too high it's very efficient, but needs an intermediate table (unless the GROUP BY columns are the PI of the souce table). The self-join on the PI columns results in an AMP-local direct join plus aggregated locally, without matching PIs it will be a less efficient join plus aggregated globally
create volatile table vt as
( select a,b,datecol,sumcol
from mytable
) with data
primary index(a,b);
select t1.a, t1.b, t1.datecol
,sum(t2.sumcol)
from vt as t1
join vt as t2
on t1.a=t2.a
and t1.b=t2.b
and t2.datecol between t1.datecol -181 and t1.datecol
group by 1,2,3
Of course this will not work as expected if there are multiple rows per day (this will increase the number of rows for the sum due to the n*m join). You need some unique column combination, this defect_id might be useful.
Otherwise you need to switch to a Scalar Subquery which takes care about non-uniqueness, but is usually less efficient:
create volatile table vt as
( select a,b,defect_id,datecol,sumcol
from mytable
) with data
primary index(a,b);
select t1.*
,(select sum(t2.sumcol)
from vt as t2
where t1.a=t2.a
and t1.b=t2.b
and t2.datecol between t1.datecol -181 and t1.datecol
)
from vt as t1
To use your existing approach you must aggregate those multiple rows per day first
What I want to do is, when I select records from the table, the last column is the subtraction of the two columns. Now in the first record, the last column (i.e. Subtraction of two columns) will be [Value1] - [Value2] where `[Value1] and [Value2] are columns of the table.
Now the second record will be like below,
'Value of (previous row.last column) + ([Value1] - [Value2])
and so for the next record and so on.
The columns are as below,
[ID],[Value1],[Value2]
Now the records will be like below,
[ID] [Value1] [Value2] [Result]
1 10 5 10 - 5 = 5
2 15 7 5 + (15 - 7) = 13
3 100 50 13 + (100 - 50) = 63
and so on......
SQLite doesn't support running totals, but for your data and your desired result it's possible to factor out the arithmetic and write the query like this:
SELECT t.id, t.value1, t.value2, SUM(t1.value1 - t1.value2)
FROM table1 AS t
JOIN table1 AS t1 ON t.id >= t1.id
GROUP BY t.id, t.value1, t.value2
http://sqlfiddle.com/#!7/efaf1/2/0
This query will slow down as your row count increases. So, if you're planning to run this on a large table, you may want to run the calculation outside of SQLite.
Here is some sample data from table daily_user. Each row represents an active user on a specific day, the revenue is based on the money generated by the user on that day. The earliest date in this table is 1/1.
date user_id group revenue
1/1 1 a 1
1/1 2 b 0
1/1 3 a 0
1/2 2 b 10
1/2 3 a 0
1/3 3 a 1
The output I want (Basically, each row tells me for each group, from 1/1 to each observation date, how many users have ever paid. For example, the last row means from 1/1-1/3, for group b, in total we have 1 user who paid us):
end_date group # users who ever paid
1/1 a 1
1/1 b 0
1/2 a 1
1/2 b 1
1/3 a 2
1/3 b 1
There seems to be some UDFs to do cumulative sum, but I am not sure if there is any cumulative distinct count function that I can leverage here. Is there anyway to struct a hive query to implement this?
I think the solution is to actually 'collect_set' the users ( collect unique values) and take the size of the array, for small numbers of user ( ie. which would fit in memory)
SELECT size( collect_set( user_id ) ) as uniques
end_date, group
FROM daily_user
GROUP BY end_date, group;
For large numbers of uniques, you'll need a probabilistic data structure, like sketch sets or hyperloglogs, available as UDF's from the Brickhouse library ( http://github.com/klout/brickhouse ). This will give you an estimate which is close, but not the exact number of uniques
SELECT estimated_reach( sketch_set( user_id )) as uniques_est,
end_date, group
FROM daily_user
GROUP BY end_date, group;
You can also merge these, so that can merge pre-calculated collections/sketches from previous days. Something like :
SELECT size(combine_unique( unique_set ) ) as uniques,
group
FROM daily_uniques
WHERE end_date > date_add( today, -30 )
GROUP BY group;
or
SELECT estimated_reach( union_sketch( unique_sketch) ) as uniques,
group
FROM daily_uniques
WHERE end_date > date_add( today, -30 )
GROUP BY group;
The function if(revenue=0,1,0) will have value 1 if the revenue is 0, and will have value 0 otherwise. Summing this function will give you the total number of people who had revenue of 0:
select
date as end_date,
group,
sum(if(revenue=0,1,0)) as number_of_users_who_never_paid
from
daily_user
group by
date,
group
The simplest way of doing this, without writing a custom UDF, would be to do some sort of cartesian join:
select
date as end_date,
group,
sum(if(mon.user_id is not null AND mon.date <= du.date,1,0)) as cumulative_spenders
from
daily_user du
LEFT OUTER JOIN
(
select
distinct
user_id,
date,
group
from
daily_user
where
revenue > 0
) mon
ON
(du.user_id=mon.user_id and du.group=mon.group)
group by
date,
group
This will generate a row per spending transaction per entry in the original table, then aggregate from there.
I have a large dataset and a lookup table. I need to return for each row in the dataset the smallest value present for rows in the lookup where conditions are met.
Given the size of my dataset I'm reluctant to hack an iffy solution together by cross-joining as this would create many millions of records. I'm hoping someone can suggest a solution that (ideally) leverages base r or data.table since these are already in use in an efficient manner.
Example
A<-seq(1e4,9e4,1e4)
B<-seq(0,1e4,1e3)
dt1<-data.table(expand.grid(A,B),ID=1:nrow(expand.grid(A,B)))
setnames(dt1, c("Var1","Var2"),c("A","B"))
lookup<-data.table(minA=c(1e4,1e4,2e4,2e4,5e4),
maxA=c(2e4,3e4,7e4,6e4,9e4),
minB=rep(2e3,5),
Val=seq(.1,.5,.1))
# Sample Desired Value
A B ID Val
99: 90000 10000 99 0.5
In SQL, I would then write something along the lines of
SELECT ID, A, B, min(Val) as Val
FROM dt1
LEFT JOIN lookup on dt1.A>=lookup.minA
and dt1.A<=lookup.maxA
and dt1.B>=lookup.minB
GROUP BY ID, A, B
Which would join all matching records from lookup to dt1 and return the smallest Val.
Update
My solution so far looks like:
CJ.table<-function(X,Y) setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
dt1.lookup<- CJ.table(dt1,lookup)[A>=minA & A<=maxA & B>=minB,
list(Val=Val[which.min( Val)]),
by=list(ID,A,B)]
dt1.lookup<-rbind.fill(dt1.lookup, dt1[!ID %in% dt1.lookup$ID])
This retrieves all records and allows the return of additional columns from the lookup table if I need them. It also has the benefit of enforcing the pick of the minimum Val.
A solution I found without cross joining first needs to prepare the data by getting rid of rows where A and B are out of range entirely:
Prep = dt1[A >= min(lookup$minA) & A <= max(lookup$maxA) & B >= min(lookup$minB)]
Then you make a data table of where each of the conditions are met that correspond to the lowest possible Val:
Indices = Prep[,list(min(which(A >= lookup$minA)),
min(which(A <= lookup$maxA)),
min(which(B >= lookup$minB)), A, B),by=ID]
Then you must get Val at the lowest point where all three conditions are satisfied:
Indices[,list(Val=lookup$Val[max(V1,V2,V3)], A, B),by=ID]
See if this gets you what you're looking for:
ID Val A B
1: 19 0.1 10000 2000
2: 20 0.1 20000 2000
3: 21 0.2 30000 2000
4: 22 0.3 40000 2000
5: 23 0.3 50000 2000
6: 24 0.3 60000 2000
7: 25 0.3 70000 2000
8: 26 0.5 80000 2000
9: 27 0.5 90000 2000
10: 28 0.1 10000 3000
My first thought was trying to make an index like Senor O did. However, the min(Val) made the index table tougher for me to think through. The way I thought to do it was to loop through the lookup table.
dt1[,Val:=as.numeric(NA)]
for (row in 1:NROW(lookup)) {
dt1[A>=lookup[order(Val)][row,minA]&A<=lookup[order(Val)][row,maxA]&B>=lookup[order(Val)][row,minB]&is.na(Val),Val:=lookup[order(Val)][row,Val]]
}
I think this should work because it first sets the new column with NA values.
Then it puts the lookup table in order by Val so you're going to get the lowest value of it.
At each loop it will only potentially changes values in dt1 if they were still NA in Val and since we're looping through lookup in order of smallest Val to biggest it will ensure you get the min(Val) that you wanted.
replace the rbind.fill line with
rbindlist(list(dt1.lookup,dt1[!ID %in% dt1.lookup[,ID]][,list(ID, A, B, Val=as.numeric(NA))]))
it will eliminate reliance on the reshape package and I think it'll be faster.
I have one table with values like "1,2,3,4,5,6,7" per row like
ID Value
101 5,6,7
201 8,9,3
301 3,4,5
Value column values are foreign key of other table B
Table B
5 A
6 C
7 N
Is there any way i can join these two tables together in one query?
I want to pass 101 and get A C N values.
if your model is as shown, something like this?
select a.id, listagg(new_value, ',') within group (order by new_value) new_value
from a
inner join b
on ','||a.value||',' like '%,'|| b.value ||',%'
group by a.id
http://www.sqlfiddle.com/#!4/74e46/1