I have datas in a table with schema:
Id INTEGER,
date DATETIME,
value REAL
id is primary key, and I have an index on date column to speed up querying values within a specific date range.
What should I do if I need N equal date ranges between specific start and end dates, and query aggregated datas for each date range?
For example:
Start date: 2015-01-01
End date: 2019-12-31
N: 5
In this case equal date intervals should be:
2015-01-01 ~ 2015-12-31
2016-01-01 ~ 2016-12-31
2017-01-01 ~ 2017-12-31
2018-01-01 ~ 2018-12-31
2019-01-01 ~ 2019-12-31
And the query should aggregate all values (AVG) in between those intervals, so I would like to have 5 total rows after the execution.
Maybe something with CTE?
There are 2 ways to do it.
They both use recursive ctes but return different results.
The 1st one with NTILE():
with
dates as (select '2015-01-01' mindate, '2019-12-31' maxdate),
alldates as (
select mindate date from dates
union all
select date(a.date, '1 day')
from alldates a cross join dates d
where a.date < d.maxdate
),
groups as (
select *, ntile(5) over (order by date) grp
from alldates
),
cte as (
select min(date) date1, max(date) date2
from groups
group by grp
)
select * from cte;
Results:
| date1 | date2 |
| ---------- | ---------- |
| 2015-01-01 | 2016-01-01 |
| 2016-01-02 | 2016-12-31 |
| 2017-01-01 | 2017-12-31 |
| 2018-01-01 | 2018-12-31 |
| 2019-01-01 | 2019-12-31 |
And the 2nd builds the groups with math:
with
dates as (select '2015-01-01' mindate, '2019-12-31' maxdate),
cte1 as (
select mindate date from dates
union all
select date(
c.date,
((strftime('%s', d.maxdate) - strftime('%s', d.mindate)) / 5) || ' second'
)
from cte1 c inner join dates d
on c.date < d.maxdate
),
cte2 as (
select date date1, lead(date) over (order by date) date2
from cte1
),
cte as (
select date1,
case
when date2 = (select maxdate from dates) then date2
else date(date2, '-1 day')
end date2
from cte2
where date2 is not null
)
select * from cte
Results:
| date1 | date2 |
| ---------- | ---------- |
| 2015-01-01 | 2015-12-31 |
| 2016-01-01 | 2016-12-30 |
| 2016-12-31 | 2017-12-30 |
| 2017-12-31 | 2018-12-30 |
| 2018-12-31 | 2019-12-31 |
In both cases you can get the averages by joining the table to the cte:
select c.date1, c.date2, avg(t.value) avg_value
from cte c inner join tablename t
on t.date between c.date1 and c.date2
group by c.date1, c.date2
I have a 65M~ record table in Hive that contains patient, facility, service start and service end dates. The table looks similar to the MWE below:
CREATE TABLE <your_db>.example
(accountId string,
provider string,
startdate timestamp,
enddate timestamp);
INSERT INTO TABLE <your_db>.example VALUES
('123A', 'smith', '2019-03-01 00:00:00', '2019-03-04 00:00:00'),
('456B', 'rogers', '2019-03-02 00:00:00', '2019-03-03 00:00:00'),
('123A', 'smith', '2019-03-03 00:00:00', '2019-03-06 00:00:00'),
('123A', 'smith', '2019-03-07 00:00:00', '2019-03-08 00:00:00'),
('456B', 'daniels', '2019-03-04 00:00:00', '2019-03-05 00:00:00'),
('456B', 'daniels', '2019-03-06 00:00:00', '2019-03-09 00:00:00'),
('123A', 'smith', '2019-03-10 00:00:00', '2019-03-12 00:00:00');
SELECT * FROM <your_db>.example;
# example.accountid example.provider example.startdate example.enddate
#1 123A smith 2019-03-01 00:00:00.0 2019-03-04 00:00:00.0
#2 456B rogers 2019-03-02 00:00:00.0 2019-03-03 00:00:00.0
#3 123A smith 2019-03-03 00:00:00.0 2019-03-06 00:00:00.0
#4 123A smith 2019-03-07 00:00:00.0 2019-03-08 00:00:00.0
#5 456B daniels 2019-03-04 00:00:00.0 2019-03-05 00:00:00.0
#6 456B daniels 2019-03-06 00:00:00.0 2019-03-09 00:00:00.0
#7 123A smith 2019-03-10 00:00:00.0 2019-03-12 00:00:00.0
I want to define the continuous startdate and enddate for accountId and provider combination, where there is no more than 1 day between a record's enddate and the next record's startdate, then calculate the number of days in the continuous block (called "los" for length-of-stay). This grouping is called a "case". Below is what the case output needs to look like:
# results.accountid results.provider results.los results.startdate results.enddate
#1 123A smith 7 2019-03-01 00:00:00.0 2019-03-08 00:00:00.0
#2 456B rogers 1 2019-03-02 00:00:00.0 2019-03-03 00:00:00.0
#3 456B daniels 5 2019-03-04 00:00:00.0 2019-03-09 00:00:00.0
#4 123A smith 2 2019-03-10 00:00:00.0 2019-03-12 00:00:00.0
We are currently using the accepted answer to this question, but it becomes a very expensive operation with our actual (65M record) table. I'm thinking that a more efficient solution would be to first consolidate and define each cases' startdate and enddate, and then run a datediff calculation (instead of exploding each date range), but I'm not sure how to pull that off in HiveQL.
Thanks in advance!
Digging through our company's repos, I found the creative solution below that does what we're looking for. Have yet to test out its performance improvement over the current 'explode' solution. It does what I asked for in the original question, but it is a bit complex (though well commented).
/*
STEP 1: Input
*/
DROP TABLE IF EXISTS <your_db>.tmp_completedatepairs;
CREATE TABLE AS <your_db>.tmp_completedatepairs AS
SELECT CONCAT(isnull(accountid, ''), "-", isnull(provider, '')) AS tag
, startdate
, enddate
FROM <your_db>.example
WHERE startdate IS NOT NULL
AND enddate IS NOT NULL;
/*
STEP 2: Create new pairs of start and end dates that are
better time span tiles across the stay period
*/
DROP TABLE IF EXISTS <your_db>.tmp_respaned_input;
CREATE TABLE <your_db>.tmp_respaned_input AS
SELECT SD.tag
, SD.startdate
, ED.enddate
FROM (SELECT *
, ROW_NUMBER() OVER (PARTITION BY tag ORDER BY startdate ASC) AS rnsd
FROM <your_db>.tmp_completedatepairs) AS SD
LEFT JOIN
(SELECT *
, ROW_NUMBER() OVER (PARTITION BY tag ORDER BY enddate ASC) AS rned
FROM <your_db>.tmp_completedatepairs) AS ED
ON SD.tag=ED.tag
AND SD.rnsd=ED.rned;
/*
STEP 3: Find gaps >1day and define stays around them
This consists of several substeps:
(a) Isolate all start dates that are more than 1 day after a preceding start date with the same tag, or are the earliest date for the tag. Number them in order.
(b) Isolate all end dates that are more than 1 day before a following end date with the same tag, or are the last date for the tag. Number them in order.
(c) Match together corresponding start and end dates after SELECTing only those dates that terminate a case (rather than dates that occur within case boundaries)
*/
DROP TABLE IF EXISTS <your_db>.results;
CREATE TABLE <your_db>.resuts AS
-- (c) Match together corresponding start and end dates after SELECTing only those dates that terminate a case (rather than dates that occur within case boundaries)
SELECT SPLIT(tag,'-')[0] AS accountid
, SPLIT(tag,'-')[1] AS provider
, DATEDIFF(enddate, startdate) AS los
, startdate
, enddate
FROM
-- (a) Isolate all start dates that are more than 1 day after a preceding end date with the same tag, or are the earliest date for the tag. Number them in order.
(SELECT tag
, startdate
, CONCAT(tag, CAST(ROW_NUMBER() OVER (PARTITION BY tag ORDER BY startdate ASC) AS string)) AS rnlink
FROM (SELECT L.tag
, L.startdate AS startdate
, DATEDIFF(L.startdate, R.enddate) AS d
FROM (SELECT *
, CONCAT(tag, CAST(ROW_NUMBER() OVER (PARTITION BY tag ORDER BY startdate ASC) AS string)) AS rnstart
FROM <your_db>.tmp_respaned_input) L
LEFT JOIN
(SELECT *
, CONCAT(tag, CAST(ROW_NUMBER() OVER (PARTITION BY tag ORDER BY enddate ASC) + 1 AS string)) AS rnstart
FROM <your_db>.tmp_respaned_input) R
ON L.rnstart = R.rnstart) X
WHERE d > 1 OR d IS NULL) S
LEFT JOIN
-- (b) Isolate all end dates that are more than 1 day before a following start date with the same tag, or are the last date for the tag. Number them in order.
(SELECT enddate
, CONCAT(tag, CAST(row_number() over (PARTITION BY tag ORDER BY enddate ASC) AS string)) AS rnlink
FROM (SELECT L.tag
, L.enddate AS enddate
, DATEDIFF(R.startdate, L.enddate) AS d
FROM (SELECT *
, CONCAT(tag, CAST(row_number() over (PARTITION BY tag ORDER BY enddate ASC) AS string)) AS rnend
FROM <your_db>.tmp_respaned_input) L
LEFT JOIN
(SELECT *
, CONCAT(tag, CAST(row_number() over (PARTITION BY tag ORDER BY startdate ASC) - 1 AS string)) AS rnend
FROM <your_db>.tmp_respaned_input) R
ON L.rnend = R.rnend) X
WHERE d > 1 or d IS NULL) E
ON S.rnlink = E.rnlink;
-- Print results
SELECT *
FROM <your_db>.results
ORDER BY startdate ASC;
# results.accountid results.provider results.los results.startdate results.enddate
#1 123A smith 7 2019-03-01 00:00:00.0 2019-03-08 00:00:00.0
#2 456B rogers 1 2019-03-02 00:00:00.0 2019-03-03 00:00:00.0
#3 456B daniels 5 2019-03-04 00:00:00.0 2019-03-09 00:00:00.0
#4 123A smith 2 2019-03-10 00:00:00.0 2019-03-12 00:00:00.0
This is my solution, please see comments in the code:
--configuration
set hive.cli.print.header=true;
set hive.execution.engine=tez;
set hive.mapred.reduce.tasks.speculative.execution=false;
set mapred.reduce.tasks.speculative.execution=false;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=36;
set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
set hive.vectorized.execution.reduce.groupby.enabled=true;
set hive.map.aggr=true;
with example as (--this is your data example
select stack (9, '123A', 'smith', '2019-03-01 00:00:00', '2019-03-04 00:00:00',
'456B', 'rogers', '2019-03-02 00:00:00', '2019-03-03 00:00:00',
'123A', 'smith', '2019-03-03 00:00:00', '2019-03-06 00:00:00',
'123A', 'smith', '2019-03-07 00:00:00', '2019-03-08 00:00:00',
'456B', 'daniels', '2019-03-04 00:00:00', '2019-03-05 00:00:00',
'456B', 'daniels', '2019-03-06 00:00:00', '2019-03-09 00:00:00',
'123A', 'smith', '2019-03-10 00:00:00', '2019-03-12 00:00:00',
--I added one more case
'123A', 'smith', '2019-03-14 00:00:00', '2019-03-17 00:00:00',
'123A', 'smith', '2019-03-18 00:00:00', '2019-03-19 00:00:00'
) as (accountId, provider, startdate, enddate )
)
select --aggregate start and end dates for the whole case, count LOS
accountId, provider, datediff(max(enddate),min(startdate)) as los, min(startdate) startdate , max(enddate) enddate
from
(
select --distribute case_id across all records in the same case
accountId, provider, startdate, enddate,
last_value(case_id, true) over(partition by accountid, same_case_flag order by startdate ) as case_id --Bingo!!! we have case_id
from
(
select --generate UUID as case_id if previous same_case_flag != current one or previous was NULL.
--One UUID will be generated for each new case
accountId, provider, startdate, enddate, same_case_flag,
case when lag(same_case_flag) over(partition by accountid order by startdate) = same_case_flag
then NULL else java_method("java.util.UUID", "randomUUID")
end case_id
from
(
select --calculate same case flag
accountId, provider, startdate, enddate,
case when datediff(startdate,lag(enddate) over(partition by accountId order by startdate)) <=1 --startdate - prev_enddate
OR
datediff(lead(startdate) over(partition by accountId order by startdate), enddate) <=1 --next_startdate-enddate
then true else false
end as same_case_flag
from example s
)s)s)s
group by accountId, provider, case_id
order by startdate; --remove order by if not necessary to sppeed-up processing !!! I added it to get the same ordering as in your example
Result:
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 4 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 5 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 6 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 06/06 [==========================>>] 100% ELAPSED TIME: 10.79 s
--------------------------------------------------------------------------------
OK
accountid provider los startdate enddate
123A smith 7 2019-03-01 00:00:00 2019-03-08 00:00:00
456B rogers 1 2019-03-02 00:00:00 2019-03-03 00:00:00
456B daniels 5 2019-03-04 00:00:00 2019-03-09 00:00:00
123A smith 2 2019-03-10 00:00:00 2019-03-12 00:00:00
123A smith 5 2019-03-14 00:00:00 2019-03-19 00:00:00
Time taken: 29.049 seconds, Fetched: 5 row(s)
Remove order to get rid of last reducer.
Depending on your date, probably for assigning case_id you can use concat(accountid, rand()) or concat also startdate, or something like this instead of randomUUID if there are rear separate cases with the same accountid, but randomUUID is safer because it is always unique.
This approach does not use joins at all.
enter image description here
How to generate dates between tow date column based on each row
A row generator technique should be used, such as:
SQL> alter session set nls_date_format = 'dd.mm.yyyy';
Session altered.
SQL> with test (sno, start_date, end_date) as
2 (select 1, date '2018-01-01', date '2018-01-05' from dual union
3 select 2, date '2018-01-03', date '2018-01-05' from dual
4 )
5 select sno, start_date + column_value - 1 datum
6 from test,
7 table(cast(multiset(select level from dual
8 connect by level <= end_date - start_date + 1)
9 as sys.odcinumberlist))
10 order by sno, datum;
SNO DATUM
---------- ----------
1 01.01.2018
1 02.01.2018
1 03.01.2018
1 04.01.2018
1 05.01.2018
2 03.01.2018
2 04.01.2018
2 05.01.2018
8 rows selected.
SQL>
I need compare value from 1 column with previous value from 2 column. For example, I have table:
id | create_date | end_date
1 | 2016-12-31 | 2017-01-25
2 | 2017-01-26 | 2017-05-21
3 | 2017-05-22 | 2017-08-26
4 | 2017-09-01 | 2017-09-02
I need to compare create_date for id = 2 with end_date for id = 1
and compare create_date for id = 3 with end_date for id = 2 etc.
Result: show me id which has create_date (id = n) <> end_date (id = n-1) + interval '1' day
Should I use lag() function? How I can compare it? Which function I should use and how?
Thank you
Teradata doesn't have lag/lead, but you can still get the same functionality:
select
id,
create_date,
end_date,
max(end_date) over (order by id between 1 preceding and 1 preceding) as prev_end_date
...
qualify
create_date <> prev_end_date + INTERVAL '1' day;
I have the below table: I also have a calendar table if needed.
ID Start_dt End_dt
1 1/9/2016 3/10/2016
Expected Output:
ID Start_dt End_dt Month ActiveCustomerPerMonth
1 1/9/16 3/10/2016 201601 1
1 1/9/16 3/10/2016 201602 1
1 1/9/16 3/10/2016 201603 0 (Not Active end of Month)
I need this as I'm working on a current query that will utilize a case statement to count if the customer was active for that month. If the member was active on the last day of the month, the member would be considred active for that month. But I need to be able to count for al months for that customer.
CASE
WHEN LAST_DAY(x.END_DT) = x.END_DT
THEN '1'
WHEN TO_CHAR(X.END_DT,'MM/DD/YYYY') != '01/01/3000'
OR X.DISCHARGE_REASON IS NOT NULL
THEN '0'
WHEN X.FIRST_ASSGN_DT IS NULL
THEN '0'
ELSE '1'
END ActiveMemberForMonth
I'm new to Oracle and was reading about connect by but did not understand the process and not sure if this would be the proper place to use.
Something like this.
with
test_data ( id, start_dt, end_dt ) as (
select 1, to_date('1/9/2016' , 'mm/dd/yyyy'), to_date('3/10/2016', 'mm/dd/yyyy')
from dual union all
select 2, to_date('1/23/2016', 'mm/dd/yyyy'), to_date('5/31/2016', 'mm/dd/yyyy')
from dual
)
-- end of test data; solution (SQL query) begins below this line
select id, start_dt, end_dt,
to_char(add_months(trunc(start_dt, 'mm'), level - 1), 'yyyymm') as mth,
case when end_dt < last_day(end_dt)
and level = 1 + months_between(trunc(end_dt, 'mm'), trunc(start_dt, 'mm'))
then 0 else 1 end as active_at_month_end
from test_data
connect by level <= 1 + months_between(trunc(end_dt, 'mm'), trunc(start_dt, 'mm'))
and prior id = id
and prior sys_guid() is not null
order by id, mth -- optional
;
ID START_DT END_DT MTH ACTIVE_AT_MONTH_END
--- ---------- ---------- ------ -------------------
1 2016-01-09 2016-03-10 201601 1
1 2016-01-09 2016-03-10 201602 1
1 2016-01-09 2016-03-10 201603 0
2 2016-01-23 2016-05-31 201601 1
2 2016-01-23 2016-05-31 201602 1
2 2016-01-23 2016-05-31 201603 1
2 2016-01-23 2016-05-31 201604 1
2 2016-01-23 2016-05-31 201605 1
8 rows selected.