I have a table with columns as id,date,name
id date name
1 2019-08-01 00:00:00 abc
1 2019-08-01 00:00:00 def
2 2019-08-01 00:00:00 pqr
1 2019-08-31 00:00:00 def
I want to get the count of id for given month.
The expected result for count of id for month 8 must be 3
SELECT strftime('%Y/%m/%d', date) as vdate,count(DISTINCT vdate,id) AS totalcount FROM cardtable WHERE date BETWEEN date('" + $rootScope.mydate + "', 'start of month') AND date('" + $rootScope.mydate + "','start of month','+1 month','-1 day') group by vdate
Basically i want to count if id and date both are distinct.for example if there are 2 entries on date 2019-08-01 with same id than it should give count as 1,if there 3 entries on date 2019-08-01 in which 2 entries are with id 1 and 3rd entry with 2 than it should count 2 and when there are 2 entries with id 1 and on different date lets say 1 entry on 2019-08-01 with id 1 and other on 2019-08-31 with id 1 than count id for month 8 must 2.How can i modify the above query.
Use a subquery which returns the distinct values that you want to count:
SELECT COUNT(*) AS totalcount
FROM (
SELECT DISTINCT strftime('%Y/%m/%d', date), id
FROM cardtable
WHERE date(date) BETWEEN
date('" + $rootScope.mydate + "', 'start of month')
AND
date('" + $rootScope.mydate + "','start of month','+1 month','-1 day')
)
See the demo.
Results:
| totalcount |
| ---------- |
| 3 |
I have a 65M~ record table in Hive that contains patient, facility, service start and service end dates. The table looks similar to the MWE below:
CREATE TABLE <your_db>.example
(accountId string,
provider string,
startdate timestamp,
enddate timestamp);
INSERT INTO TABLE <your_db>.example VALUES
('123A', 'smith', '2019-03-01 00:00:00', '2019-03-04 00:00:00'),
('456B', 'rogers', '2019-03-02 00:00:00', '2019-03-03 00:00:00'),
('123A', 'smith', '2019-03-03 00:00:00', '2019-03-06 00:00:00'),
('123A', 'smith', '2019-03-07 00:00:00', '2019-03-08 00:00:00'),
('456B', 'daniels', '2019-03-04 00:00:00', '2019-03-05 00:00:00'),
('456B', 'daniels', '2019-03-06 00:00:00', '2019-03-09 00:00:00'),
('123A', 'smith', '2019-03-10 00:00:00', '2019-03-12 00:00:00');
SELECT * FROM <your_db>.example;
# example.accountid example.provider example.startdate example.enddate
#1 123A smith 2019-03-01 00:00:00.0 2019-03-04 00:00:00.0
#2 456B rogers 2019-03-02 00:00:00.0 2019-03-03 00:00:00.0
#3 123A smith 2019-03-03 00:00:00.0 2019-03-06 00:00:00.0
#4 123A smith 2019-03-07 00:00:00.0 2019-03-08 00:00:00.0
#5 456B daniels 2019-03-04 00:00:00.0 2019-03-05 00:00:00.0
#6 456B daniels 2019-03-06 00:00:00.0 2019-03-09 00:00:00.0
#7 123A smith 2019-03-10 00:00:00.0 2019-03-12 00:00:00.0
I want to define the continuous startdate and enddate for accountId and provider combination, where there is no more than 1 day between a record's enddate and the next record's startdate, then calculate the number of days in the continuous block (called "los" for length-of-stay). This grouping is called a "case". Below is what the case output needs to look like:
# results.accountid results.provider results.los results.startdate results.enddate
#1 123A smith 7 2019-03-01 00:00:00.0 2019-03-08 00:00:00.0
#2 456B rogers 1 2019-03-02 00:00:00.0 2019-03-03 00:00:00.0
#3 456B daniels 5 2019-03-04 00:00:00.0 2019-03-09 00:00:00.0
#4 123A smith 2 2019-03-10 00:00:00.0 2019-03-12 00:00:00.0
We are currently using the accepted answer to this question, but it becomes a very expensive operation with our actual (65M record) table. I'm thinking that a more efficient solution would be to first consolidate and define each cases' startdate and enddate, and then run a datediff calculation (instead of exploding each date range), but I'm not sure how to pull that off in HiveQL.
Thanks in advance!
Digging through our company's repos, I found the creative solution below that does what we're looking for. Have yet to test out its performance improvement over the current 'explode' solution. It does what I asked for in the original question, but it is a bit complex (though well commented).
/*
STEP 1: Input
*/
DROP TABLE IF EXISTS <your_db>.tmp_completedatepairs;
CREATE TABLE AS <your_db>.tmp_completedatepairs AS
SELECT CONCAT(isnull(accountid, ''), "-", isnull(provider, '')) AS tag
, startdate
, enddate
FROM <your_db>.example
WHERE startdate IS NOT NULL
AND enddate IS NOT NULL;
/*
STEP 2: Create new pairs of start and end dates that are
better time span tiles across the stay period
*/
DROP TABLE IF EXISTS <your_db>.tmp_respaned_input;
CREATE TABLE <your_db>.tmp_respaned_input AS
SELECT SD.tag
, SD.startdate
, ED.enddate
FROM (SELECT *
, ROW_NUMBER() OVER (PARTITION BY tag ORDER BY startdate ASC) AS rnsd
FROM <your_db>.tmp_completedatepairs) AS SD
LEFT JOIN
(SELECT *
, ROW_NUMBER() OVER (PARTITION BY tag ORDER BY enddate ASC) AS rned
FROM <your_db>.tmp_completedatepairs) AS ED
ON SD.tag=ED.tag
AND SD.rnsd=ED.rned;
/*
STEP 3: Find gaps >1day and define stays around them
This consists of several substeps:
(a) Isolate all start dates that are more than 1 day after a preceding start date with the same tag, or are the earliest date for the tag. Number them in order.
(b) Isolate all end dates that are more than 1 day before a following end date with the same tag, or are the last date for the tag. Number them in order.
(c) Match together corresponding start and end dates after SELECTing only those dates that terminate a case (rather than dates that occur within case boundaries)
*/
DROP TABLE IF EXISTS <your_db>.results;
CREATE TABLE <your_db>.resuts AS
-- (c) Match together corresponding start and end dates after SELECTing only those dates that terminate a case (rather than dates that occur within case boundaries)
SELECT SPLIT(tag,'-')[0] AS accountid
, SPLIT(tag,'-')[1] AS provider
, DATEDIFF(enddate, startdate) AS los
, startdate
, enddate
FROM
-- (a) Isolate all start dates that are more than 1 day after a preceding end date with the same tag, or are the earliest date for the tag. Number them in order.
(SELECT tag
, startdate
, CONCAT(tag, CAST(ROW_NUMBER() OVER (PARTITION BY tag ORDER BY startdate ASC) AS string)) AS rnlink
FROM (SELECT L.tag
, L.startdate AS startdate
, DATEDIFF(L.startdate, R.enddate) AS d
FROM (SELECT *
, CONCAT(tag, CAST(ROW_NUMBER() OVER (PARTITION BY tag ORDER BY startdate ASC) AS string)) AS rnstart
FROM <your_db>.tmp_respaned_input) L
LEFT JOIN
(SELECT *
, CONCAT(tag, CAST(ROW_NUMBER() OVER (PARTITION BY tag ORDER BY enddate ASC) + 1 AS string)) AS rnstart
FROM <your_db>.tmp_respaned_input) R
ON L.rnstart = R.rnstart) X
WHERE d > 1 OR d IS NULL) S
LEFT JOIN
-- (b) Isolate all end dates that are more than 1 day before a following start date with the same tag, or are the last date for the tag. Number them in order.
(SELECT enddate
, CONCAT(tag, CAST(row_number() over (PARTITION BY tag ORDER BY enddate ASC) AS string)) AS rnlink
FROM (SELECT L.tag
, L.enddate AS enddate
, DATEDIFF(R.startdate, L.enddate) AS d
FROM (SELECT *
, CONCAT(tag, CAST(row_number() over (PARTITION BY tag ORDER BY enddate ASC) AS string)) AS rnend
FROM <your_db>.tmp_respaned_input) L
LEFT JOIN
(SELECT *
, CONCAT(tag, CAST(row_number() over (PARTITION BY tag ORDER BY startdate ASC) - 1 AS string)) AS rnend
FROM <your_db>.tmp_respaned_input) R
ON L.rnend = R.rnend) X
WHERE d > 1 or d IS NULL) E
ON S.rnlink = E.rnlink;
-- Print results
SELECT *
FROM <your_db>.results
ORDER BY startdate ASC;
# results.accountid results.provider results.los results.startdate results.enddate
#1 123A smith 7 2019-03-01 00:00:00.0 2019-03-08 00:00:00.0
#2 456B rogers 1 2019-03-02 00:00:00.0 2019-03-03 00:00:00.0
#3 456B daniels 5 2019-03-04 00:00:00.0 2019-03-09 00:00:00.0
#4 123A smith 2 2019-03-10 00:00:00.0 2019-03-12 00:00:00.0
This is my solution, please see comments in the code:
--configuration
set hive.cli.print.header=true;
set hive.execution.engine=tez;
set hive.mapred.reduce.tasks.speculative.execution=false;
set mapred.reduce.tasks.speculative.execution=false;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=36;
set hive.vectorized.execution.enabled=true;
set hive.vectorized.execution.reduce.enabled=true;
set hive.vectorized.execution.reduce.groupby.enabled=true;
set hive.map.aggr=true;
with example as (--this is your data example
select stack (9, '123A', 'smith', '2019-03-01 00:00:00', '2019-03-04 00:00:00',
'456B', 'rogers', '2019-03-02 00:00:00', '2019-03-03 00:00:00',
'123A', 'smith', '2019-03-03 00:00:00', '2019-03-06 00:00:00',
'123A', 'smith', '2019-03-07 00:00:00', '2019-03-08 00:00:00',
'456B', 'daniels', '2019-03-04 00:00:00', '2019-03-05 00:00:00',
'456B', 'daniels', '2019-03-06 00:00:00', '2019-03-09 00:00:00',
'123A', 'smith', '2019-03-10 00:00:00', '2019-03-12 00:00:00',
--I added one more case
'123A', 'smith', '2019-03-14 00:00:00', '2019-03-17 00:00:00',
'123A', 'smith', '2019-03-18 00:00:00', '2019-03-19 00:00:00'
) as (accountId, provider, startdate, enddate )
)
select --aggregate start and end dates for the whole case, count LOS
accountId, provider, datediff(max(enddate),min(startdate)) as los, min(startdate) startdate , max(enddate) enddate
from
(
select --distribute case_id across all records in the same case
accountId, provider, startdate, enddate,
last_value(case_id, true) over(partition by accountid, same_case_flag order by startdate ) as case_id --Bingo!!! we have case_id
from
(
select --generate UUID as case_id if previous same_case_flag != current one or previous was NULL.
--One UUID will be generated for each new case
accountId, provider, startdate, enddate, same_case_flag,
case when lag(same_case_flag) over(partition by accountid order by startdate) = same_case_flag
then NULL else java_method("java.util.UUID", "randomUUID")
end case_id
from
(
select --calculate same case flag
accountId, provider, startdate, enddate,
case when datediff(startdate,lag(enddate) over(partition by accountId order by startdate)) <=1 --startdate - prev_enddate
OR
datediff(lead(startdate) over(partition by accountId order by startdate), enddate) <=1 --next_startdate-enddate
then true else false
end as same_case_flag
from example s
)s)s)s
group by accountId, provider, case_id
order by startdate; --remove order by if not necessary to sppeed-up processing !!! I added it to get the same ordering as in your example
Result:
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 4 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 5 ...... SUCCEEDED 1 1 0 0 0 0
Reducer 6 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 06/06 [==========================>>] 100% ELAPSED TIME: 10.79 s
--------------------------------------------------------------------------------
OK
accountid provider los startdate enddate
123A smith 7 2019-03-01 00:00:00 2019-03-08 00:00:00
456B rogers 1 2019-03-02 00:00:00 2019-03-03 00:00:00
456B daniels 5 2019-03-04 00:00:00 2019-03-09 00:00:00
123A smith 2 2019-03-10 00:00:00 2019-03-12 00:00:00
123A smith 5 2019-03-14 00:00:00 2019-03-19 00:00:00
Time taken: 29.049 seconds, Fetched: 5 row(s)
Remove order to get rid of last reducer.
Depending on your date, probably for assigning case_id you can use concat(accountid, rand()) or concat also startdate, or something like this instead of randomUUID if there are rear separate cases with the same accountid, but randomUUID is safer because it is always unique.
This approach does not use joins at all.
enter image description here
How to generate dates between tow date column based on each row
A row generator technique should be used, such as:
SQL> alter session set nls_date_format = 'dd.mm.yyyy';
Session altered.
SQL> with test (sno, start_date, end_date) as
2 (select 1, date '2018-01-01', date '2018-01-05' from dual union
3 select 2, date '2018-01-03', date '2018-01-05' from dual
4 )
5 select sno, start_date + column_value - 1 datum
6 from test,
7 table(cast(multiset(select level from dual
8 connect by level <= end_date - start_date + 1)
9 as sys.odcinumberlist))
10 order by sno, datum;
SNO DATUM
---------- ----------
1 01.01.2018
1 02.01.2018
1 03.01.2018
1 04.01.2018
1 05.01.2018
2 03.01.2018
2 04.01.2018
2 05.01.2018
8 rows selected.
SQL>
I need the the below output on Teradata :
DATE_HOME WORKING_DAY
01/01/2018 0
02/01/2018 1
03/01/2018 1
04/01/2018 1
05/01/2018 1
06/01/2018 0
07/01/2018 0
08/01/2018 1
09/01/2018 1
Output required
DATE_HOME WORKING_DAY Updated_DATE
01/01/2018 0 02/01/2018
02/01/2018 1 02/01/2018
03/01/2018 1 03/01/2018
04/01/2018 1 04/01/2018
05/01/2018 1 05/01/2018
06/01/2018 0 08/01/2018
07/01/2018 0 08/01/2018
08/01/2018 1 08/01/2018
09/01/2018 1 09/01/2018
That's a simple task for first_value:
first_value(case when WORKING_DAY = 1 then DATE_HOME end ignore nulls)
over (order by DATE_HOME
rows between current date and unbounded following)
Change the non-business dates into NULL and then search for the first non-NULL value.
Edit:
In fact there's no need for first_value as you sort by the same column, a simple min works, too:
min(case when WORKING_DAY = 1 then DATE_HOME end)
over (order by DATE_HOME
rows between current date and unbounded following)
Good lord, this is ugly, but it seems to work. I don't have access to a TD system, so it's more verbose than it could be:
SELECT
date_home,
working_day,
CAST(
CASE
-- If current date is a non-work day, add appropriate number of days to non-work day to get next work-day
WHEN working_day = 0 THEN date_home + INTERVAL '1' DAY * (ROW_NUMBER() OVER(PARTITION BY update_date ORDER BY date_home DESC))
ELSE date_home
END
AS DATE) AS Update_date
FROM (
SELECT
date_home,
working_day,
CASE
-- If current and previous days were non-work days, group row by PrevWorkDay value
WHEN MIN(working_day) OVER(ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) = 0 AND working_day = 0 THEN MIN(PrevWorkDay) OVER(ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)
ELSE PrevWorkDay
END AS Update_Date
FROM (
SELECT
date_home,
working_day,
CASE
-- Track "baseline" previous date_home value for new group of "non-work day" rows
WHEN COALESCE(MIN(working_day) OVER(ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 1) = 1 AND working_day = 0 THEN COALESCE(MIN(Date_Home) OVER(ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), date_home)
ELSE NULL
END AS PrevWorkDay
FROM holiday_calendar
) src
) src
ORDER BY date_home
This assumes your source data is stored in a table called "holiday_calendar".
The COALESCE's are used to handle the first row in the result set, which can't compute values for the previous row, since there is no previous row.
Give it a try and let me know.
I have a table(pay_period) as following
pay_period
period_id list_id start_date end_date price
1 100 2017-01-01 2017-08-31 100
2 100 2017-09-01 2017-12-31 110
3 101 2017-01-01 2017-08-31 75
Now I have list_id, checkin_date, checkout_date
list_id 100
checkin_date 2017-08-25
checkout_date 2017-09-10
I need to calculate the price of a list for the period from checkin date to checkout date.
therefore the calculation is supposed to be
7 * 100 + 10 * 110
I am thinking to do it with a for loop, if there is any other better way to do it, can you please suggest?
You have to see if the checkin_date and checkout_date are into the same period_id.
1.1 If yes, you multiply the price with the nunmber of days.
1.2 If no, you have count the days between checkin_day untill the end of your period 1 and multiply with the corresponding price, then do the same with checkout_day and beginning of next period.
Note: i guess it might happen to have more than 2 prices per list_id. for example:
period_id list_id start_date end_date price
1 100 2017-01-01 2017-04-30 100
2 100 2017-05-01 2017-09-30 110
3 100 2017-10-01 2017-12-31 120
4 101 2017-01-01 2017-08-31 75
and the calculation period to be:
list_id 100
checkin_date 2017-03-01
checkout_date 2017-11-10
In this case, yes, the solution would be to have a CURSOR where to keep the prices for list_id and periods; loop through it and compare the checkin_date and checkout_date with each record.
Best,
Mikcutu.
You can do the following for a much cleaner code. Although it is purely sql, I am using a function to make it code better to understand.
Create a generic function which gets you the number of overlapping days in 2 different date range.
CREATE OR REPLACE FUNCTION fn_count_range
( p_start_date1 IN DATE,
p_end_date1 IN DATE,
p_start_date2 IN DATE,
p_end_date2 IN DATE ) RETURN NUMBER AS
v_days NUMBER;
BEGIN
IF p_end_date1 < p_start_date1 OR p_end_date2 < p_start_date2 THEN
RETURN 0;
END IF;
SELECT COUNT(*) INTO v_days
FROM (
(SELECT p_start_date1 + LEVEL - 1
FROM dual CONNECT BY LEVEL <= p_end_date1 - p_start_date1 + 1 ) INTERSECT
(SELECT p_start_date2 + LEVEL - 1
FROM dual CONNECT BY LEVEL <= p_end_date2 - p_start_date2 + 1 ) );
RETURN v_days;
END;
/
Now, your query to calculate the total price is simplified.
WITH lists ( list_id,
checkin_date,
checkout_date) AS
( SELECT 100,
TO_DATE('2017-08-25','YYYY-MM-DD'),
TO_DATE('2017-09-10','YYYY-MM-DD')
FROM dual) --Not required if you have a lists table.
SELECT l.list_id,
SUM(fn_count_range(start_date,end_date,checkin_date,checkout_date) * price) total_price
FROM pay_period p
JOIN lists l ON p.list_id = l.list_id
GROUP BY l.list_id;