Continuous Date Span T-Sql - teradata

I am trying to combine continuous date spans whenever they exist
ID_NBR START_DT END_DT
22 20120101 20120131
22 20120201 20120731
22 20120801 20121231
22 20130201 20131231
22 20140101 20151231
22 20160101 20160131
22 20160201 20160430
22 20160601 20160630
22 20160701 99991231
and want the result to be like below:
ID_NBR START_DT END_DT
22 20120101 20121231
22 20130201 20160430
22 20160601 99991231
Obviously I'm not trying to be spoon-fed so here is what I have so far but I really think there has to be an simpler way
SELECT
s1.ID_NBR,
s1.START_DT,
MIN(t1.END_DT) AS END_DT,
ROW_NUMBER() OVER(ORDER BY s1.START_DT) AS Sequence_ID
FROM MEM s1
INNER JOIN MEM t1
ON t1.ID_NBR=s1.ID_NBR
AND s1.START_DT <= t1.END_DT
AND NOT EXISTS (
SELECT*FROM MEM t2
WHERE t2.ID_NBR=t1.ID_NBR
AND (t1.END_DT+1) >= t2.START_DT
AND t1.END_DT < t2.END_DT
)
WHERE NOT EXISTS(SELECT * FROM MEM s2
WHERE s2.ID_NBR=s1.ID_NBR
AND s1.START_DT > s2.START_DT AND (s1.START_DT-1) <= s2.END_DT)
GROUP BY s1.ID_NBR,s1.START_DT

In Teradata TD14.10 there's a simple way to combine overlapping periods using SELECT NORMALIZE. The implementation is based on the PERIOD data type, which includes the start date, but excludes the end date. As your data includes the end date you must adjust it for the calculation and again to split the period in separate columns again:
SELECT ID_NBR,
Begin(pd), -- get the start date
Last(pd) -- adjust the end date
FROM
(
SELECT NORMALIZE
ID_NBR,
-- periods are [inclusive..exclusive[
PERIOD(START_DT,CASE WHEN END_DT = DATE '9999-12-31' THEN END_DT ELSE END_DT + 1 END) AS pd
FROM tab
) AS dt
If you dates are actually Decimal(38,0) (which is quite wrong) you need to cast them to dates first using
Cast(start_dt - 19000000 AS DATE)

Related

Impala - Working hours between two dates in impala

I have two time stamps #starttimestamp and #endtimestamp. How to calculate number of working hours between these two
Working hours is defined below:
Mon- Thursday (9:00-17:00)
Friday (9:00-13:00)
Have to work in impala
think i found a better solution.
we will create a series of numbers using a large table. You can get a time dimension type table too. Make it doenst get truncated. I am using a large table from my db.
Use this series to generate a date range between start and end date.
date_add (t.start_date,rs.uniqueid) -- create range of dates
join (select row_number() over ( order by mycol) as uniqueid -- create range of unique ids
from largetab) rs
where end_date >=date_add (t.start_date,rs.uniqueid)
Then we will calculate total hour difference between the timestamp using unix timestamp considering date and time.
unix_timestamp(endtimestamp - starttimestamp )
Exclude non working hours like 16hours on M-T, 20hours on F, 24hours on S-S.
case when dayofweek ( dday) in (1,7) then 24
when dayofweek ( dday) =5 then 20
else 16 end as non work hours
Here is complete SQL.
select
end_date, start_date,
diff_in_hr - sum(case when dayofweek ( dday) in (1,7) then 24
when dayofweek ( dday) =5 then 20
else 16 end ) total_workhrs
from (
select (unix_timestamp(end_date)- unix_timestamp(start_date))/3600 as diff_in_hr , end_date, start_date,date_add (t.start_date,rs.uniqueid) as dDay
from tdate t
join (select row_number() over ( order by mycol) as uniqueid from largetab) rs
where end_date >=date_add (t.start_date,rs.uniqueid)
)rs2
group by 1,2,diff_in_hr

MariaDB running total up to N and rows NOT included in its calculation

I have a table which amongst other columns has amt and created(timestamp).
I'm trying to calculate the running total of amt up to N
Get all the rows not included in the calculation leading to the sum up to N
I'm doing this in code but was wondering if there was a way to get these with SQL and ideally in one query.
Looking around and it's easy to find examples of calculating the running total like
https://stackoverflow.com/a/1290936/400048 but less so to find running total up N and then only actually return rows not involved in calculating N.
You can use the window version of the SUM aggregate function to get the running total for each row.
CREATE TABLE TEST (ID BIGINT PRIMARY KEY, AMT INT, CREATED TIMESTAMP);
INSERT INTO TEST VALUES
(1, 1, TIMESTAMP '2000-01-01 00:00:00'),
(2, 2, TIMESTAMP '2000-01-02 00:00:00'),
(3, 1, TIMESTAMP '2000-01-03 00:00:00'),
(4, 3, TIMESTAMP '2000-01-04 00:00:00'),
(5, 5, TIMESTAMP '2000-01-05 00:00:00'),
(6, 1, TIMESTAMP '2000-01-07 00:00:00');
SELECT ID, AMT, SUM(AMT) OVER (ORDER BY CREATED) RT, CREATED FROM TEST ORDER BY CREATED;
> ID AMT RT CREATED
> -- --- -- -------------------
> 1 1 1 2000-01-01 00:00:00
> 2 2 3 2000-01-02 00:00:00
> 3 1 4 2000-01-03 00:00:00
> 4 3 7 2000-01-04 00:00:00
> 5 5 12 2000-01-05 00:00:00
> 6 1 13 2000-01-07 00:00:00
Then you can use a non-standard QUALIFY clause in H2 or a subquery (in both MariaDB and H2) to filter out rows below the limit.
If N is a running total limit and by “rows not included in the calculation” you mean rows above the limit, the queries will look like these:
-- Simple non-standard query for H2
SELECT ID, AMT, SUM(AMT) OVER (ORDER BY CREATED) RT, CREATED FROM TEST
QUALIFY RT > 10 ORDER BY CREATED;
-- Equivalent standard query with subquery for MariaDB, H2, and many others
SELECT * FROM (
SELECT ID, AMT, SUM(AMT) OVER (ORDER BY CREATED) RT, CREATED FROM TEST
) T WHERE RT > 10 ORDER BY CREATED;
> ID AMT RT CREATED
> -- --- -- -------------------
> 5 5 12 2000-01-05 00:00:00
> 6 1 13 2000-01-07 00:00:00
RT - AMT in the first row here is a running total of all previous rows. You can select it separately, if you wish:
-- Non-standard query for H2
SELECT SUM(AMT) OVER (ORDER BY CREATED) RT FROM TEST
QUALIFY RT < 10 ORDER BY CREATED DESC FETCH FIRST ROW ONLY;
-- Non-standard query for MariaDB or H2
SELECT RT FROM (
SELECT ID, AMT, SUM(AMT) OVER (ORDER BY CREATED) RT, CREATED FROM TEST
) T WHERE RT < 10 ORDER BY CREATED DESC LIMIT 1;
-- Standard query for H2 and others (but not for MariaDB)
SELECT RT FROM (
SELECT ID, AMT, SUM(AMT) OVER (ORDER BY CREATED) RT, CREATED FROM TEST
) T WHERE RT < 10 ORDER BY CREATED DESC FETCH FIRST ROW ONLY;
> RT
> --
> 7
If you meant something else, the QUALIFY or WHERE criteria will be different.

Google Analytics: Conversion rate in time intervals

I've got Google Analytics running on a website and am now trying to determine the conversion rate in certain time intervals. I therefore have a table that contains
interval_id
i.interval_start_time_utc
i.interval_stop_time_utc
Sadly, the following BigQuery query that would assign each order to an interval will not work:
SELECT
totals.transactions,
totals.visits,
i.interval_id
FROM [123456.ga_sessions_20160609]
INNER JOIN intervals i ON i.interval_start_time_utc < visitStartTime AND visitStartTime < i.interval_end_time_utc
This throws the error
ON clause must be AND of = comparisons of one field name from each table [...]
so I gather that BigQuery simply doesn't do range joins. Is there another way to do this short of doing a full join and then paring down? Are there entirely different, better approaches for this sort of thing?
BigQuery Standard SQL doesn't have this limitation - see Enabling Standard SQL
If yo want to make with BigQuery Legacy SQL - try something like below
SELECT
totals.transactions,
totals.visits,
i.interval_id
FROM [123456.ga_sessions_20160609]
CROSS JOIN intervals i
WHERE i.interval_start_time_utc < visitStartTime
AND visitStartTime < i.interval_end_time_utc
For the sake of presenting idea - let’s simplify example
And let’s remember - we do want to make it with BigQuery Legacy SQL - not with Standard SqL where it is trivial!
Challenge
Assume we have visits table:
SELECT visit_time FROM
(SELECT 2 AS visit_time),
(SELECT 12 AS visit_time),
(SELECT 22 AS visit_time),
(SELECT 32 AS visit_time)
and intervals table:
SELECT before, after, event FROM
(SELECT 1 AS before, 5 AS after, 3 AS event),
(SELECT 6 AS before, 10 AS after, 8 AS event),
(SELECT 21 AS before, 25 AS after, 23 AS event),
(SELECT 33 AS before, 37 AS after, 35 AS event)
We want to extract all visits which are within event’s before and after values
This can be simply done with use of CROSS JOIN like below:
SELECT
visit_time, event, before, after
FROM (
SELECT visit_time FROM
(SELECT 2 AS visit_time),
(SELECT 12 AS visit_time),
(SELECT 22 AS visit_time),
(SELECT 32 AS visit_time),
) AS visits
CROSS JOIN (
SELECT before, after, event FROM
(SELECT 1 AS before, 5 AS after, 3 AS event),
(SELECT 6 AS before, 10 AS after, 8 AS event),
(SELECT 21 AS before, 25 AS after, 23 AS event),
(SELECT 33 AS before, 37 AS after, 35 AS event)
) AS intervals
WHERE visit_time BETWEEN before AND after
With result as:
visit_time event before after
2 3 1 5
22 23 21 25
Potential Issue
When both tables are big enough – this cross join becomes quite expensive!
Hint
It happened that (from user’s comments) - The intervals are always the x units to the left and right of event.
Solution
Below is proposed solution/option that uses hint/fact and makes use of JOIN instead of CROSS JOIN between two big tables
The key here is to generate (on fly) new table that will hold all possible interval’s values based on event and x
SELECT event, event + delta AS point
FROM (
SELECT event FROM
(SELECT 1 AS before, 5 AS after, 3 AS event),
(SELECT 6 AS before, 10 AS after, 8 AS event),
(SELECT 21 AS before, 25 AS after, 23 AS event),
(SELECT 33 AS before, 37 AS after, 35 AS event)
) AS events
CROSS JOIN (
SELECT pos - 1 - 2 AS delta FROM (
SELECT ROW_NUMBER() OVER() AS pos, * FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + 2 * 2, '.'),'') AS h FROM (SELECT NULL)),h
)))
) AS deltas
In above code x = 2 – but you can change it in two places, for example if x = 5 you should have
SELECT pos - 1 - 5 AS delta FROM (
SELECT ROW_NUMBER() OVER() AS pos, * FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + 2 * 5, '.'),'') AS h FROM (SELECT NULL)),h
)))
CROSS JOIN in above code is inexpensive because deltas table is quite small
So, finally now, you can have your result with below:
SELECT
visit_time, event
FROM (
SELECT visit_time FROM
(SELECT 2 AS visit_time),
(SELECT 12 AS visit_time),
(SELECT 22 AS visit_time),
(SELECT 32 AS visit_time),
) AS visits
JOIN (
SELECT event, event + delta AS point
FROM (
SELECT event FROM
(SELECT 1 AS before, 5 AS after, 3 AS event),
(SELECT 6 AS before, 10 AS after, 8 AS event),
(SELECT 21 AS before, 25 AS after, 23 AS event),
(SELECT 33 AS before, 37 AS after, 35 AS event)
) AS events
CROSS JOIN (
SELECT pos - 1 - 2 AS delta FROM (
SELECT ROW_NUMBER() OVER() AS pos, * FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + 2 * 2, '.'),'') AS h FROM (SELECT NULL)),h
)))
) AS deltas
) AS points
ON points.point = visits.visit_time
With expected result
visit_time event
2 3
22 23
I think above approach can work for you – but you sure need to adopt it to your particular case
I think this can be done relatively easy if you will round all your involved times up to respective minutes
Hope this will help
Share result with us if you will get this work :o)

Calculating occupany level between a date range

I'm having trouble trying to wrap my head around how to write this query to calculate the occupancy level of a hotel and then list the results by date. Consider the following type of data from a table called reservations:
Arrival Departure Guest Confirmation
08/01/2015 08/05/2015 John 13234
08/01/2015 08/03/2015 Bob 34244
08/02/2015 08/03/2015 Steve 32423
08/02/2015 08/02/2015 Mark 32411
08/02/2015 08/04/2014 Jenny 24422
Output Data would ideally look like:
Date Occupancy
08/01/2015 2
08/02/2015 4
08/03/2015 2
08/04/2015 1
08/02/2015 0
And the query should be able to utilize a date range as a variable. I'm having trouble getting the obviously hardest piece of how to both get the count per night and spitting it out by date.
You can generate a list of dates first. In Oracle you can do this by using connect by. This will make a recursive query. For instance, to get the next 30 days, you can select today and keep connecting until you've got the desired number of days. level indicates the level of recursion.
select trunc(sysdate) + level - 1 as THEDATE
from dual
connect by level <= 30;
On that list, you can query the number of reservations for each day in that period:
select THEDATE,
(select count(*)
from reservations r
where r.Arrival >= THEDATE and
r.Departure < THEDATE) as RESERVATIONCOUNT
from
( select trunc(sysdate) + level - 1 as THEDATE,
from dual
connect by level <= 30)
Instead of getting a fixed number of dates, you can also get another value there, for instance, to get at least 30 days in the future, but further if there are reservations for later..:
select THEDATE,
(select count(*)
from reservations r
where r.Arrival >= THEDATE and
r.Departure < THEDATE) as RESERVATIONCOUNT
from
( select trunc(sysdate) + level - 1 as THEDATE,
from dual
connect by
level <= greatest(30, (select trunc(max(DEPARTURE) - sysdate)
from reservations)))

SQL Server 2008 R2 looking for a way to get the night hours for an employee

Using SQL Server 2008 R2 we are looking for a way to select the shift hours that an employee has that are during the night which in the this case 22.00 and 6.00 +1.
Our problem becomes how to get the hours when the shift crosses midnight or how we get the overlap when a shift begins 05.30 to 22.30 and has an overlap in both the beginning and end of the shift.
Here is an example, theses are the data available in the database and the result we are looking for:
startDateTime | endDateTime | nightHours
--------------------------+---------------------------+----------------
2012-07-04 05:00:00.000 2012-07-04 23:00:00.000 2
2012-07-04 18:00:00.000 2012-07-05 05:00:00.000 7
Does anyone have an example or a few good pointer that we can use.
This may be overly complex, but it does work. We use a number of CTEs to construct useful intermediate representations:
declare #Times table (
ID int not null,
StartTime datetime not null,
EndTime datetime not null
)
insert into #Times (ID,StartTime,EndTime)
select 1,'2012-07-04T05:00:00.000','2012-07-04T23:00:00.000' union all
select 2,'2012-07-04T18:00:00.000','2012-07-05T05:00:00.000'
;With Start as (
select MIN(DATEADD(day,DATEDIFF(day,0,StartTime),0)) as StartDay from #Times
), Ends as (
select MAX(EndTime) EndTime from #Times
), Nights as (
select DATEADD(hour,-2,StartDay) as NightStart,DATEADD(hour,6,StartDay) as NightEnd from Start
union all
select DATEADD(DAY,1,NightStart),DATEADD(DAY,1,NightEnd) from Nights n
inner join Ends e on n.NightStart < e.EndTime
), Overlaps as (
select
t.ID,
CASE WHEN n.NightStart > t.StartTime THEN n.NightStart ELSE t.StartTime END as StartPeriod,
CASE WHEN n.NightEnd < t.EndTime THEN n.NightEnd ELSE t.EndTime END as EndPeriod
from
#Times t
inner join
Nights n
on
t.EndTime > n.NightStart and
t.StartTime < n.NightEnd
), Totals as (
select ID,SUM(DATEDIFF(hour,StartPeriod,EndPeriod)) as TotalHours
from Overlaps
group by ID
)
select
*
from
#Times t
inner join
Totals tot
on
t.ID = tot.ID
Result:
ID StartTime EndTime ID TotalHours
----------- ----------------------- ----------------------- ----------- -----------
1 2012-07-04 05:00:00.000 2012-07-04 23:00:00.000 1 2
2 2012-07-04 18:00:00.000 2012-07-05 05:00:00.000 2 7
You'll note that I had to add an ID column in order to get my correlation to work.
The Start CTE finds the earliest applicable midnight. The End CTE finds the last time for which we need to find overlapping nights. Then, the recursive Nights CTE computes every night between those two points in time. We then join this back to the original table (in Overlaps) to find those periods in each night which apply. Finally, in Totals, we compute how many hours each overlapping period contributed.
This should work for multi-day events. You might want to change the Totals CTE to use minutes, or apply some other rounding functions, if you need to count partial hours.
I think, the best way would be a function that takes start time and end time of the shift. Then inside the function have 2 cases: first when shift starts and ends on the same day and another case when starts on one day and finishes on the next one.
For the case when it starts and finishes on the same day do
#TotalOvernightHours=0
#AMDifference = Datediff(hh, #shiftStart, #6amOnThatDay);
if #AMDIfference > 0 than #TotalOvernightHours = #TotalOvernightHours + #AMDifference
#PMDifference Datediff(hh, #10pmOnThatDay, #ShiftEnd)
if #PMDifference > 0 than #TotalOvernightHours = #TotalOvernightHours + #PMDifference
For the case when start and finish are on different days pretend it is 2 shifts: first starts at #ShiftStart, but finishes at midnight. Second one starts at midnight, finishes at #ShiftEnd. And for every shift do apply the logic above.
In case you have shifts that a longer than 24 hours, break them up into smaller sub-shifts, where midnight is a divider. So if you have shift starting on 1 Jun 19:00 and finishing at 3 Jun 5:00 then you would end up with three sub-shifts:
1 Jun 19:00 - 1 Jun 24:00
2 Jun 00:00 - 2 Jun 24:00
3 Jun 00:00 - 3 Jun 5:00
And for every sub-shift you do calculate the overnight hours.
I'd probably would write a function that calculates overnight hours for one 24hrs period and another function that breaks the whole shift into 24hrs chunks, then feeds it into the first function.
p.s. this is not sql, only pseudo-code.
p.p.s. This would work only if you have ability to create functions. And it would get you a clean, easy-to ready code.

Resources