History handling scenario issue - teradata

I need help with the query to handle below scenarios.
The below record is active:
ID start_dt status end_dt
-------------------------------------------------
18,593,122 1/15/07 14:38 A 12/11/07 8:45
18,593,122 12/11/07 8:45 C 12/11/07 8:45
18,593,122 12/11/07 8:45 A 11/13/11 0:00
18,593,122 11/13/11 0:00 C 12/26/11 10:36
18,593,122 12/26/11 10:36 A ?
Below is closed:
ID start_dt status end_dt
-------------------------------------------------
18,593,122 1/15/07 14:38 A 12/11/07 8:45
18,593,122 12/11/07 8:45 C ?
I have to insert records in the table where the records are not correctly ended.
For ex there are records like below:
ID start_dt status end_dt
-------------------------------------------------
18,593,122 1/15/07 14:38 A 12/11/07 8:45
In the above record the Closed record is missing.
I have to identify such records and insert in the table.
Below sample getting impacted.
10,866 7/29/96 0:01 A 12/27/03 14:16
10,866 7/25/00 0:01 A 8/20/00 23:59
10,866 8/20/00 23:59 C 10/2/02 13:00
10,866 10/2/02 13:00 A 7/25/04 14:11
10,866 12/27/03 14:16 C 7/25/04 14:11
10,866 7/25/04 14:11 C 7/25/04 14:11
10,866 7/25/04 14:11 A ?
10,866 5/28/11 16:24 T 5/28/11 16:24
Below scenario not able to handle:
Records with accs_meth_status_type_cd=’A’ and end date not null(Highlighted below).
Expected: Record with accs_meth_status_type_cd=’C’ should be inserted
Actual: Record with accs_meth_status_type_cd=’C’ is not getting inserted
10,866 7/29/96 0:01 A 12/27/03 14:16
10,866 7/25/00 0:01 A 8/20/00 23:59
10,866 8/20/00 23:59 C 10/2/02 13:00
10,866 10/2/02 13:00 A 7/25/04 14:11
10,866 12/15/04 14:16 A ?

From what I understand, a correctly closed record (i.e. ID) consists of:
Row #1: Status <> 'C' and End_Dt IS NOT NULL
Row #2: Status = 'C' and End_Dt IS NULL
End_Dt in Row #1 = Start_Dt in Row #2
Assuming that's correct, you can do something like this to find records that should be closed (i.e. have Row #1, but are missing Row #2):
INSERT INTO mytbl (ID, Start_Dt, Status, End_Dt)
SELECT
a.ID,
a.End_Dt, -- Use the unclosed row's End_Dt as the Start_Dt for the new "to-be-inserted" row
'C',
NULL
FROM mytbl a
WHERE a.status <> 'C'
AND a.End_Dt IS NOT NULL -- Get rows that should be considered closed
AND (a.ID, a.End_Dt) NOT IN (
-- Check for corresponding records that do not also have a 'C' row
-- You can also do this as a LEFT JOIN above
SELECT ID, Start_Dt
FROM mytbl
WHERE Status = 'C' -- Check for presence of 'C' rows
AND End_Dt IS NULL -- Check
)
QUALIFY ROW_NUMBER() OVER(PARTITION BY a.ID ORDER BY End_Dt DESC) = 1 -- Only return one row per "unclosed" record
;
This won't handle all the edge cases, but it should get you started. Let me know if that's what you're looking for.
Updated
I ran the query above against the 5 new rows you provided, and this the result I got:
id start_dt status end_dt
1 10,866 7/25/2000 00:01:00.000000 A 8/20/2000 23:59:00.000000
2 10,866 8/20/2000 23:59:00.000000 C 10/2/2002 13:00:00.000000
3 10,866 7/29/1996 00:01:00.000000 A 12/27/2003 14:16:00.000000
4 10,866 10/2/2002 13:00:00.000000 A 7/25/2004 14:11:00.000000
5 10,866 12/27/2003 14:16:00.000000 A ?
6 10,866 7/25/2004 14:11:00.000000 C ?
Rows #1-5 are the original rows. Row #6 is the newly inserted 'C' row, which corresponds to row #4, the old improperly closed 'A' row. Is this what you are expecting or no?

Related

How to query SQLite data hourly?

The data table looks like the following:
ID DATE
1 2020-12-31 10:10:00
2 2020-12-31 20:30:00
3 2020-12-31 20:50:00
4 2021-01-02 17:10:00
5 2021-01-02 17:20:00
6 2021-01-02 17:30:00
7 2021-01-03 23:10:00
..
And I would like to query only the last entry per hour per day, and to have the resulte like:
ID DATE
1 2020-12-31 10:10:00
3 2020-12-31 20:50:00
6 2021-01-02 17:30:00
7 2021-01-03 23:10:00
..
I tried to look for hourly query and found the following
strftime('%H', " + DATE + ", '+1 hours')
However, not sure how to use it properly (e.g. with GROUP BY ? then how to ensure it takes the lastest entry of the hour), therefore, would be great to have some help here!
You can do it with ROW_NUMBER() window function:
SELECT ID, DATE
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY strftime('%Y%m%d%H', DATE) ORDER BY DATE DESC) rn
FROM tablename
)
WHERE rn = 1
ORDER BY ID
Instead of strftime('%Y%m%d%H', DATE) you could also use substr(DATE, 1, 13).
For versions of SQLite previous to 3.25.0 which do not support window functions you can do it with NOT EXISTS:
SELECT t1.*
FROM tablename t1
WHERE NOT EXISTS (
SELECT 1
FROM tablename t2
WHERE strftime('%Y%m%d%H', t2.DATE) = strftime('%Y%m%d%H', t1.DATE)
AND t2.DATE > t1.DATE
)
See the demo.
Results:
> ID | DATE
> -: | :------------------
> 1 | 2020-12-31 10:10:00
> 3 | 2020-12-31 20:50:00
> 6 | 2021-01-02 17:30:00
> 7 | 2021-01-03 23:10:00

Creating a SQLite query

I have a SQLite database, I want to create a query that will group records if the DateTime is within 60 minutes - the hard part is the DateTime is cumulative so if we have 3 records with DateTimes 2019-12-14 15:40:00, 2019-12-14 15:56:00 and 2019-12-14 16:55:00 it would all fall in one group. Please see the hands and desired output of the query to help you understand the requirement.
Database Table "Hands"
ID DateTime Result
1 2019-12-14 15:40:00 -100
2 2019-12-14 15:56:00 1000
3 2019-12-14 16:55:00 -2000
4 2012-01-12 12:00:00 400
5 2016-10-01 21:00:00 900
6 2016-10-01 20:55:00 1000
Desired output of query
StartTime Count Result
2019-12-14 15:40:00 3 -1100
2012-01-12 12:00:00 1 400
2016-10-01 20:55:00 2 1900
You can use some window functions to indicate at which record a new group should start (because of a datetime difference with the previous that is 60 minutes or larger), and then to turn that information into a unique group number. Finally you can group by that group number and perform the aggregation functions on it:
with base as (
select DateTime, Result,
coalesce(cast((
julianday(DateTime) - julianday(
lag(DateTime) over (order by DateTime)
)
) * 24 >= 1 as integer), 1) as firstInGroup
from Hands
), step as (
select DateTime, Result,
sum(firstInGroup) over (
order by DateTime rows
between unbounded preceding and current row) as grp
from base
)
select min(DateTime) DateTime,
count(*) Count,
sum(Result) Result
from step
group by grp;
DB-fiddle

How to get count of multiple distinct columns with one column as date

I have a table with columns as id,date,name
id date name
1 2019-08-01 00:00:00 abc
1 2019-08-01 00:00:00 def
2 2019-08-01 00:00:00 pqr
1 2019-08-31 00:00:00 def
I want to get the count of id for given month.
The expected result for count of id for month 8 must be 3
SELECT strftime('%Y/%m/%d', date) as vdate,count(DISTINCT vdate,id) AS totalcount FROM cardtable WHERE date BETWEEN date('" + $rootScope.mydate + "', 'start of month') AND date('" + $rootScope.mydate + "','start of month','+1 month','-1 day') group by vdate
Basically i want to count if id and date both are distinct.for example if there are 2 entries on date 2019-08-01 with same id than it should give count as 1,if there 3 entries on date 2019-08-01 in which 2 entries are with id 1 and 3rd entry with 2 than it should count 2 and when there are 2 entries with id 1 and on different date lets say 1 entry on 2019-08-01 with id 1 and other on 2019-08-31 with id 1 than count id for month 8 must 2.How can i modify the above query.
Use a subquery which returns the distinct values that you want to count:
SELECT COUNT(*) AS totalcount
FROM (
SELECT DISTINCT strftime('%Y/%m/%d', date), id
FROM cardtable
WHERE date(date) BETWEEN
date('" + $rootScope.mydate + "', 'start of month')
AND
date('" + $rootScope.mydate + "','start of month','+1 month','-1 day')
)
See the demo.
Results:
| totalcount |
| ---------- |
| 3 |

Efficient Date Case Logic in Hive

I have a 65M~ record table in Hive that contains patient, facility, service start and service end dates. The table looks similar to the MWE below:
CREATE TABLE <your_db>.example
(accountId string,
provider string,
startdate timestamp,
enddate timestamp);
INSERT INTO TABLE <your_db>.example VALUES
('123A', 'smith', '2019-03-01 00:00:00', '2019-03-04 00:00:00'),
('456B', 'rogers', '2019-03-02 00:00:00', '2019-03-03 00:00:00'),
('123A', 'smith', '2019-03-03 00:00:00', '2019-03-06 00:00:00'),
('123A', 'smith', '2019-03-07 00:00:00', '2019-03-08 00:00:00'),
('456B', 'daniels', '2019-03-04 00:00:00', '2019-03-05 00:00:00'),
('456B', 'daniels', '2019-03-06 00:00:00', '2019-03-09 00:00:00'),
('123A', 'smith', '2019-03-10 00:00:00', '2019-03-12 00:00:00');
SELECT * FROM <your_db>.example;
# example.accountid example.provider example.startdate example.enddate
#1 123A smith 2019-03-01 00:00:00.0 2019-03-04 00:00:00.0
#2 456B rogers 2019-03-02 00:00:00.0 2019-03-03 00:00:00.0
#3 123A smith 2019-03-03 00:00:00.0 2019-03-06 00:00:00.0
#4 123A smith 2019-03-07 00:00:00.0 2019-03-08 00:00:00.0
#5 456B daniels 2019-03-04 00:00:00.0 2019-03-05 00:00:00.0
#6 456B daniels 2019-03-06 00:00:00.0 2019-03-09 00:00:00.0
#7 123A smith 2019-03-10 00:00:00.0 2019-03-12 00:00:00.0
I want to define the continuous startdate and enddate for accountId and provider combination, where there is no more than 1 day between a record's enddate and the next record's startdate, then calculate the number of days in the continuous block (called "los" for length-of-stay). This grouping is called a "case". Below is what the case output needs to look like:
# results.accountid results.provider results.los results.startdate results.enddate
#1 123A smith 7 2019-03-01 00:00:00.0 2019-03-08 00:00:00.0
#2 456B rogers 1 2019-03-02 00:00:00.0 2019-03-03 00:00:00.0
#3 456B daniels 5 2019-03-04 00:00:00.0 2019-03-09 00:00:00.0
#4 123A smith 2 2019-03-10 00:00:00.0 2019-03-12 00:00:00.0
We are currently using the accepted answer to this question, but it becomes a very expensive operation with our actual (65M record) table. I'm thinking that a more efficient solution would be to first consolidate and define each cases' startdate and enddate, and then run a datediff calculation (instead of exploding each date range), but I'm not sure how to pull that off in HiveQL.
Thanks in advance!
Digging through our company's repos, I found the creative solution below that does what we're looking for. Have yet to test out its performance improvement over the current 'explode' solution. It does what I asked for in the original question, but it is a bit complex (though well commented).
/*
STEP 1: Input
*/
DROP TABLE IF EXISTS <your_db>.tmp_completedatepairs;
CREATE TABLE AS <your_db>.tmp_completedatepairs AS
SELECT CONCAT(isnull(accountid, ''), "-", isnull(provider, '')) AS tag
, startdate
, enddate
FROM <your_db>.example
WHERE startdate IS NOT NULL
AND enddate IS NOT NULL;
/*
STEP 2: Create new pairs of start and end dates that are
better time span tiles across the stay period
*/
DROP TABLE IF EXISTS <your_db>.tmp_respaned_input;
CREATE TABLE <your_db>.tmp_respaned_input AS
SELECT SD.tag
, SD.startdate
, ED.enddate
FROM (SELECT *
, ROW_NUMBER() OVER (PARTITION BY tag ORDER BY startdate ASC) AS rnsd
FROM <your_db>.tmp_completedatepairs) AS SD
LEFT JOIN
(SELECT *
, ROW_NUMBER() OVER (PARTITION BY tag ORDER BY enddate ASC) AS rned
FROM <your_db>.tmp_completedatepairs) AS ED
ON SD.tag=ED.tag
AND SD.rnsd=ED.rned;
/*
STEP 3: Find gaps >1day and define stays around them
This consists of several substeps:
(a) Isolate all start dates that are more than 1 day after a preceding start date with the same tag, or are the earliest date for the tag. Number them in order.
(b) Isolate all end dates that are more than 1 day before a following end date with the same tag, or are the last date for the tag. Number them in order.
(c) Match together corresponding start and end dates after SELECTing only those dates that terminate a case (rather than dates that occur within case boundaries)
*/
DROP TABLE IF EXISTS <your_db>.results;
CREATE TABLE <your_db>.resuts AS
-- (c) Match together corresponding start and end dates after SELECTing only those dates that terminate a case (rather than dates that occur within case boundaries)
SELECT SPLIT(tag,'-')[0] AS accountid
, SPLIT(tag,'-')[1] AS provider
, DATEDIFF(enddate, startdate) AS los
, startdate
, enddate
FROM
-- (a) Isolate all start dates that are more than 1 day after a preceding end date with the same tag, or are the earliest date for the tag. Number them in order.
(SELECT tag
, startdate
, CONCAT(tag, CAST(ROW_NUMBER() OVER (PARTITION BY tag ORDER BY startdate ASC) AS string)) AS rnlink
FROM (SELECT L.tag
, L.startdate AS startdate
, DATEDIFF(L.startdate, R.enddate) AS d
FROM (SELECT *
, CONCAT(tag, CAST(ROW_NUMBER() OVER (PARTITION BY tag ORDER BY startdate ASC) AS string)) AS rnstart
FROM <your_db>.tmp_respaned_input) L
LEFT JOIN
(SELECT *
, CONCAT(tag, CAST(ROW_NUMBER() OVER (PARTITION BY tag ORDER BY enddate ASC) + 1 AS string)) AS rnstart
FROM <your_db>.tmp_respaned_input) R
ON L.rnstart = R.rnstart) X
WHERE d > 1 OR d IS NULL) S
LEFT JOIN
-- (b) Isolate all end dates that are more than 1 day before a following start date with the same tag, or are the last date for the tag. Number them in order.
(SELECT enddate
, CONCAT(tag, CAST(row_number() over (PARTITION BY tag ORDER BY enddate ASC) AS string)) AS rnlink
FROM (SELECT L.tag
, L.enddate AS enddate
, DATEDIFF(R.startdate, L.enddate) AS d
FROM (SELECT *
, CONCAT(tag, CAST(row_number() over (PARTITION BY tag ORDER BY enddate ASC) AS string)) AS rnend
FROM <your_db>.tmp_respaned_input) L
LEFT JOIN
(SELECT *
, CONCAT(tag, CAST(row_number() over (PARTITION BY tag ORDER BY startdate ASC) - 1 AS string)) AS rnend
FROM <your_db>.tmp_respaned_input) R
ON L.rnend = R.rnend) X
WHERE d > 1 or d IS NULL) E
ON S.rnlink = E.rnlink;
-- Print results
SELECT *
FROM <your_db>.results
ORDER BY startdate ASC;
# results.accountid results.provider results.los results.startdate results.enddate
#1 123A smith 7 2019-03-01 00:00:00.0 2019-03-08 00:00:00.0
#2 456B rogers 1 2019-03-02 00:00:00.0 2019-03-03 00:00:00.0
#3 456B daniels 5 2019-03-04 00:00:00.0 2019-03-09 00:00:00.0
#4 123A smith 2 2019-03-10 00:00:00.0 2019-03-12 00:00:00.0
This is my solution, please see comments in the code:
--configuration
set hive.cli.print.header=true;
set hive.execution.engine=tez;
set hive.mapred.reduce.tasks.speculative.execution=false;
set mapred.reduce.tasks.speculative.execution=false;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=36;
set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
set hive.vectorized.execution.reduce.groupby.enabled=true;
set hive.map.aggr=true;
with example as (--this is your data example
select stack (9, '123A', 'smith', '2019-03-01 00:00:00', '2019-03-04 00:00:00',
'456B', 'rogers', '2019-03-02 00:00:00', '2019-03-03 00:00:00',
'123A', 'smith', '2019-03-03 00:00:00', '2019-03-06 00:00:00',
'123A', 'smith', '2019-03-07 00:00:00', '2019-03-08 00:00:00',
'456B', 'daniels', '2019-03-04 00:00:00', '2019-03-05 00:00:00',
'456B', 'daniels', '2019-03-06 00:00:00', '2019-03-09 00:00:00',
'123A', 'smith', '2019-03-10 00:00:00', '2019-03-12 00:00:00',
--I added one more case
'123A', 'smith', '2019-03-14 00:00:00', '2019-03-17 00:00:00',
'123A', 'smith', '2019-03-18 00:00:00', '2019-03-19 00:00:00'
) as (accountId, provider, startdate, enddate )
)
select --aggregate start and end dates for the whole case, count LOS
accountId, provider, datediff(max(enddate),min(startdate)) as los, min(startdate) startdate , max(enddate) enddate
from
(
select --distribute case_id across all records in the same case
accountId, provider, startdate, enddate,
last_value(case_id, true) over(partition by accountid, same_case_flag order by startdate ) as case_id --Bingo!!! we have case_id
from
(
select --generate UUID as case_id if previous same_case_flag != current one or previous was NULL.
--One UUID will be generated for each new case
accountId, provider, startdate, enddate, same_case_flag,
case when lag(same_case_flag) over(partition by accountid order by startdate) = same_case_flag
then NULL else java_method("java.util.UUID", "randomUUID")
end case_id
from
(
select --calculate same case flag
accountId, provider, startdate, enddate,
case when datediff(startdate,lag(enddate) over(partition by accountId order by startdate)) <=1 --startdate - prev_enddate
OR
datediff(lead(startdate) over(partition by accountId order by startdate), enddate) <=1 --next_startdate-enddate
then true else false
end as same_case_flag
from example s
)s)s)s
group by accountId, provider, case_id
order by startdate; --remove order by if not necessary to sppeed-up processing !!! I added it to get the same ordering as in your example
Result:
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 4 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 5 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 6 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 06/06 [==========================>>] 100% ELAPSED TIME: 10.79 s
--------------------------------------------------------------------------------
OK
accountid provider los startdate enddate
123A smith 7 2019-03-01 00:00:00 2019-03-08 00:00:00
456B rogers 1 2019-03-02 00:00:00 2019-03-03 00:00:00
456B daniels 5 2019-03-04 00:00:00 2019-03-09 00:00:00
123A smith 2 2019-03-10 00:00:00 2019-03-12 00:00:00
123A smith 5 2019-03-14 00:00:00 2019-03-19 00:00:00
Time taken: 29.049 seconds, Fetched: 5 row(s)
Remove order to get rid of last reducer.
Depending on your date, probably for assigning case_id you can use concat(accountid, rand()) or concat also startdate, or something like this instead of randomUUID if there are rear separate cases with the same accountid, but randomUUID is safer because it is always unique.
This approach does not use joins at all.

Oracle: Creating a range on month between two dates

I have the below table: I also have a calendar table if needed.
ID Start_dt End_dt
1 1/9/2016 3/10/2016
Expected Output:
ID Start_dt End_dt Month ActiveCustomerPerMonth
1 1/9/16 3/10/2016 201601 1
1 1/9/16 3/10/2016 201602 1
1 1/9/16 3/10/2016 201603 0 (Not Active end of Month)
I need this as I'm working on a current query that will utilize a case statement to count if the customer was active for that month. If the member was active on the last day of the month, the member would be considred active for that month. But I need to be able to count for al months for that customer.
CASE
WHEN LAST_DAY(x.END_DT) = x.END_DT
THEN '1'
WHEN TO_CHAR(X.END_DT,'MM/DD/YYYY') != '01/01/3000'
OR X.DISCHARGE_REASON IS NOT NULL
THEN '0'
WHEN X.FIRST_ASSGN_DT IS NULL
THEN '0'
ELSE '1'
END ActiveMemberForMonth
I'm new to Oracle and was reading about connect by but did not understand the process and not sure if this would be the proper place to use.
Something like this.
with
test_data ( id, start_dt, end_dt ) as (
select 1, to_date('1/9/2016' , 'mm/dd/yyyy'), to_date('3/10/2016', 'mm/dd/yyyy')
from dual union all
select 2, to_date('1/23/2016', 'mm/dd/yyyy'), to_date('5/31/2016', 'mm/dd/yyyy')
from dual
)
-- end of test data; solution (SQL query) begins below this line
select id, start_dt, end_dt,
to_char(add_months(trunc(start_dt, 'mm'), level - 1), 'yyyymm') as mth,
case when end_dt < last_day(end_dt)
and level = 1 + months_between(trunc(end_dt, 'mm'), trunc(start_dt, 'mm'))
then 0 else 1 end as active_at_month_end
from test_data
connect by level <= 1 + months_between(trunc(end_dt, 'mm'), trunc(start_dt, 'mm'))
and prior id = id
and prior sys_guid() is not null
order by id, mth -- optional
;
ID START_DT END_DT MTH ACTIVE_AT_MONTH_END
--- ---------- ---------- ------ -------------------
1 2016-01-09 2016-03-10 201601 1
1 2016-01-09 2016-03-10 201602 1
1 2016-01-09 2016-03-10 201603 0
2 2016-01-23 2016-05-31 201601 1
2 2016-01-23 2016-05-31 201602 1
2 2016-01-23 2016-05-31 201603 1
2 2016-01-23 2016-05-31 201604 1
2 2016-01-23 2016-05-31 201605 1
8 rows selected.

Resources