Teradata - calculate difference between two dates from different row number - teradata

I want to move my excel calculation to Teradata but not sure how to do it. In excel is rather easy and I use simple if to give me DIFF =IF(A2=A3, (C2-B3) * 24, "")
NO T_DATE L_DATE DIFF
AAA 10/08/2019 17:02:00 10/08/2019 20:35:00 5.83
AAA 10/08/2019 14:45:00 10/08/2019 15:10:00 11.78
AAA 10/08/2019 03:23:00 10/08/2019 10:25:00 17.32
AAA 09/08/2019 17:06:00 10/08/2019 01:11:00 25.70
AAA 08/08/2019 23:29:00 09/08/2019 10:27:00
BBB 08/08/2019 09:34:00 08/08/2019 21:19:00 22.23
BBB 07/08/2019 23:05:00 08/08/2019 06:09:00 18.03
BBB 07/08/2019 12:07:00 07/08/2019 20:25:00 22.32
BBB 06/08/2019 22:06:00 07/08/2019 08:53:00 22.77
BBB 06/08/2019 10:07:00 06/08/2019 19:44:00
Is there a way of doing it in Teradata? I want to have again the difference in hours between L_DATE and T_DATE for each NO.

You can use window functions to achieve this. It's important to note that when you subtract two dates or timestamps (in this case) you will be returned in INTERVAL type so you will need to specify what type of INTERVAL you want as well as it's size (SECOND, MINUTE, HOUR, DAY, etc)..
CREATE MULTISET VOLATILE TABLE yourtable(
ID VARCHAR(3)
,T_DATE TIMESTAMP(0)
,L_DATE TIMESTAMP(0)
,DIFF NUMERIC(6,2)
) PRIMARY INDEX (ID) ON COMMIT PRESERVE ROWS;
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('AAA','2019-10-08 17:02:00','2019-10-08 20:35:00',5.83);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('AAA','2019-10-08 14:45:00','2019-10-08 15:10:00',11.78);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('AAA','2019-10-08 03:23:00','2019-10-08 10:25:00',17.32);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('AAA','2019-09-08 17:06:00','2019-10-08 01:11:00',25.70);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('AAA','2019-08-08 23:29:00','2019-09-08 10:27:00',NULL);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('BBB','2019-08-08 09:34:00','2019-08-08 21:19:00',22.23);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('BBB','2019-07-08 23:05:00','2019-08-08 06:09:00',18.03);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('BBB','2019-07-08 12:07:00','2019-07-08 20:25:00',22.32);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('BBB','2019-06-08 22:06:00','2019-07-08 08:53:00',22.77);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('BBB','2019-06-08 10:07:00','2019-06-08 19:44:00',NULL);
SELECT yourtable.*,
CAST(((LEAD(T_DATE) OVER (PARTITION BY ID ORDER BY T_DATE) - L_DATE) HOUR(4)) AS INTEGER)
FROM yourtable;
+-----+---------------------+---------------------+--------+-------------------------------------------+
| ID | T_DATE | L_DATE | DIFF | (LEAD (<value expression>) - L_DATE) HOUR |
+-----+---------------------+---------------------+--------+-------------------------------------------+
| AAA | 2019-08-08 23:29:00 | 2019-09-08 10:27:00 | <null> | 7 |
| AAA | 2019-09-08 17:06:00 | 2019-10-08 01:11:00 | 25.70 | 2 |
| AAA | 2019-10-08 03:23:00 | 2019-10-08 10:25:00 | 17.32 | 4 |
| AAA | 2019-10-08 14:45:00 | 2019-10-08 15:10:00 | 11.78 | 2 |
| AAA | 2019-10-08 17:02:00 | 2019-10-08 20:35:00 | 5.83 | <null> |
| BBB | 2019-06-08 10:07:00 | 2019-06-08 19:44:00 | <null> | 3 |
| BBB | 2019-06-08 22:06:00 | 2019-07-08 08:53:00 | 22.77 | 4 |
| BBB | 2019-07-08 12:07:00 | 2019-07-08 20:25:00 | 22.32 | 3 |
| BBB | 2019-07-08 23:05:00 | 2019-08-08 06:09:00 | 18.03 | 3 |
| BBB | 2019-08-08 09:34:00 | 2019-08-08 21:19:00 | 22.23 | <null> |
+-----+---------------------+---------------------+--------+-------------------------------------------+
The reason this looks so ugly is because you are trying to compare (subtract) values in two different records. In a database there is no relationship between one record and another. There is no ordering. They live independently of one another. This is radically different than excel where rows (records) have order (a row number).
We use the Window Function LEAD() to establish a group of records as being in a group (a partition) using the PARTITION BY clause, and we give that partition an ordering with the ORDER BY clause. Then we use that LEAD() to say "The very next T_DATE in this ordered partition for this record".
Then we do our date math and subtract the two timestamps. We specify that we want an INTERVAL of type HOUR(4) back. This will hold up to 9999 hours and it will error if it goes over 9999 hours.
Lastly we cast that thing to integer so you can do math on it. You do not, however, have to do the casting if you don't want. I added it because often times we want to add hours together and whatnot.
If you are working on an older version of Teradata that doesn't have the LEAD() function (it's a newer addition) you can use MAX() or MIN() and some extra syntax in your windowing definition to explicitely say "Just the next record's T_DATE" like:
MAX(T_DATE) OVER (PARTITION BY ID ORDER BY T_DATE ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING)

Related

Trying to determine if two ranges of dates overlap using R

I have a dataset that includes information about the schools that a student has attended within an academic year and their entry and withdrawal dates from each school. While most students only attend one school, there are others who have attended up to four different schools. I would like to make sure that none of the date ranges overlap. Below is an example of the data that I have (the dates are structured as dates):
|---------------------|------------------|---------------------|------------------|
| entry_date_1 | withdrawal_date_1| entry_date_2 | withdrawal_date_2|
|---------------------|------------------|---------------------|------------------|
| 2017-11-09 | 2018-05-24 | NA | NA |
|---------------------|------------------|---------------------|------------------|
| 2017-08-14 | 2017-12-15 | 2017-12-16 | 2018-05-24 |
|---------------------|------------------|---------------------|------------------|
| 2017-08-14 | 2018-06-01 | 2018-01-16 | 2018-03-20 |
|---------------------|------------------|---------------------|------------------|
| 2018-01-24 | 2018-02-25 | 2018-04-03 | 2018-05-24 |
|---------------------|------------------|---------------------|------------------|
What I would ideally like is a column that gives me a logical operator like this:
|---------------------|------------------|---------------------|------------------|------------------|
| entry_date_1 | withdrawal_date_1| entry_date_2 | withdrawal_date_2| overlap? |
|---------------------|------------------|---------------------|------------------|------------------|
| 2017-11-09 | 2018-05-24 | NA | NA | NA |
|---------------------|------------------|---------------------|------------------|------------------|
| 2017-08-14 | 2017-12-15 | 2017-12-16 | 2018-05-24 | FALSE |
|---------------------|------------------|---------------------|------------------|------------------|
| 2017-08-14 | 2018-06-01 | 2018-01-16 | 2018-03-20 | TRUE |
|---------------------|------------------|---------------------|------------------|------------------|
| 2018-01-24 | 2018-02-25 | 2018-04-03 | 2018-05-24 | FALSE |
|---------------------|------------------|---------------------|------------------|------------------|
I tried doing this using the %overlaps% function in the DescTools package, but it doesn't yield a logical operator for any column - just NA. If someone could help me to troubleshoot the issue, that would be great. And any other suggestions would also be helpful. I'm most comfortable with the tidyverse and base R and less comfortable with data.table.
Below is a snippet of data for a reproducible example:
my_data <- data.frame("student_id" = 1:6,
"entry_date_1" = as.Date(c("2017-11-09","2017-08-14","2017-08-14","2018-01-24","2017-10-04","2017-08-14")),
"withdrawal_date_1" = as.Date(c("2018-05-24","2017-12-15","2018-06-01","2018-02-25","2017-11-11","2018-05-24")),
"entry_date_2" = as.Date(c(NA,"2017-12-16","2018-01-16","2018-04-03","2017-12-12",NA)),
"withdrawal_date_2" = as.Date(c(NA,"2018-05-24","2018-03-20","2018-05-24","2018-05-24",NA)))
Thanks in advance for any help!
You can use int_overlaps() in lubridate.
library(dplyr)
library(lubridate)
my_data %>%
mutate(overlap = int_overlaps(interval(entry_date_1, withdrawal_date_1),
interval(entry_date_2, withdrawal_date_2)))
# student_id entry_date_1 withdrawal_date_1 entry_date_2 withdrawal_date_2 overlap
# 1 1 2017-11-09 2018-05-24 <NA> <NA> NA
# 2 2 2017-08-14 2017-12-15 2017-12-16 2018-05-24 FALSE
# 3 3 2017-08-14 2018-06-01 2018-01-16 2018-03-20 TRUE
# 4 4 2018-01-24 2018-02-25 2018-04-03 2018-05-24 FALSE
# 5 5 2017-10-04 2017-11-11 2017-12-12 2018-05-24 FALSE
# 6 6 2017-08-14 2018-05-24 <NA> <NA> NA

R: merge two datasets within range of dates

I have one dataset x that looks something like this:
id | date
1 | 2014-02-04
1 | 2014-03-15
2 | 2014-02-04
2 | 2014-03-15
And I would like to merge it with another dataset, y, by id and date. But with date from x being same as or preceding the date in dataset y for every observation. Dataset y looks like this:
id | date | value
1 | 2014-02-07 | 100
2 | 2014-02-04 | 20
2 | 2014-03-22 | 80
So I would want my final dataset to be:
id | date.x | date.y | value
1 | 2014-02-04 | 2014-02-07 | 100
1 | 2014-03-15 | |
2 | 2014-02-04 | 2014-02-04 | 20
2 | 2014-03-15 | 2014-03-22 | 80
I really do not have a lead on how to approach something like this, any help is appreciated. Thanks!
This is easy in data.table, using the roll-argument
First, craete sample data with actual dates
library( data.table )
DT1 <- fread("id | date
1 | 2014-02-04
1 | 2014-03-15
2 | 2014-02-04
2 | 2014-03-15")
DT2 <- fread("id | date | value
1 | 2014-02-07 | 100
2 | 2014-02-04 | 20
2 | 2014-03-22 | 80")
DT1[, date := as.Date( date ) ]
DT2[, date := as.Date( date ) ]
now, perform an update join on DT1, where the columns date.y and value are the result of the (left rolling) join from DT2[ DT1, .( x.date, value), on = .(id, date), roll = -Inf ].
This code joins on two columns, id and date, the roll-argument -Inf is used on the last one (i.e. date). To make sure the date-value from DT2 is returned, and not the date from DT1, we call for x.date in stead of date (which returns the date -value from DT1)
#rolling update join
DT1[, c("date.y", "value") := DT2[ DT1, .( x.date, value), on = .(id, date), roll = -Inf ]][]
# id date date.y value
# 1: 1 2014-02-04 2014-02-07 100
# 2: 1 2014-03-15 <NA> NA
# 3: 2 2014-02-04 2014-02-04 20
# 4: 2 2014-03-15 2014-03-22 80
Another option is to full_join by year & month.
Firstly we need to add an additional column that extracts month and year from date column:
library(zoo)
library(dplyr)
xx <- x %>%
mutate(y_m = as.yearmon(date))
yy <- y %>%
mutate(y_m = as.yearmon(date))
Then we need to fully join by id and y_m:
out <- full_join(xx,yy, by = c("id","y_m")) %>%
select(-y_m)
> out
# A tibble: 4 x 4
id date.x date.y value
<dbl> <date> <date> <dbl>
1 1 2014-02-04 2014-02-07 100
2 1 2014-03-15 NA NA
3 2 2014-02-04 2014-02-04 20
4 2 2014-03-15 2014-03-22 80

Average the first row by group from data.table lookup

I wish to average the most recent company rows, for each individual which occur before a specified date.
In other words I would like to average the most recent (for each company) previous alpha values for each individual and for each date.
table1 <- fread(
"individual_id | date
1 | 2018-01-02
1 | 2018-01-04
1 | 2018-01-05
2 | 2018-01-02
2 | 2018-01-05",
sep ="|"
)
table1$date = as.IDate(table1$date)
table2 <- fread(
"individual_id | date2 | company_id | alpha
1 | 2018-01-02 | 62 | 1
1 | 2018-01-04 | 62 | 1.5
1 | 2018-01-05 | 63 | 1
2 | 2018-01-01 | 71 | 2
2 | 2018-01-02 | 74 | 1
2 | 2018-01-05 | 74 | 4",
sep = "|"
)
So for example:
observation 1 in table 1 is individual "1" on the 2018-01-02.
To achieve this I look in table 2 and see that individual 1 has 1 instance prio or on the 2018-01-02 for a company 62. Hence only 1 value to average and the mean alpha is 1.
example 2:
observation for individual 2 on 2018-01-05.
here there are 3 observations for individual 2, 1 for company 71 and 2 for company 74, so we choose the most recent for each company which leaves us with 2 observations 71 on 2018-01-01 and 74 on 2018-01-05, with alpha values of 2 and 4, the mean alpha is then 3.
The result should look like:
table1 <- fread(
"individual_id | date | mean alpha
1 | 2018-01-02 | 1
1 | 2018-01-04 | 1.5
1 | 2018-01-05 | (1.5+1)/2 = 1.25
2 | 2018-01-02 | (2+1)/2 = 1.5
2 | 2018-01-05 | (2+4)/2 = 3",
sep ="|"
)
I can get the sub sample of the first row from table2 using:
table2[, .SD[1], by=company_id]
But I am unsure how limit by the date and combine this with the first table.
Edit
This produces the result for each individual but not by company.
table1[, mean_alpha :=
table2[.SD, on=.(individual_id, date2 <= date), mean(alpha, na.rm = TRUE), by=.EACHI]$V1]
individual_id date mean_alpha
1 2018-01-02 1.000000
1 2018-01-04 1.250000
1 2018-01-05 1.166667
2 2018-01-02 1.500000
2 2018-01-05 2.333333
Here is another possible approach:
#ensure that order is correct before using the most recent for each company
setorder(table2, individual_id, company_id, date2)
table1[, mean_alpha :=
#perform non-equi join
table2[table1, on=.(individual_id, date2<=date),
#for each row of table1,
by=.EACHI,
#get most recent alpha by company_id and average the alphas
mean(.SD[, last(alpha), by=.(company_id)]$V1)]$V1
]
output:
individual_id date mean_alpha
1: 1 2018-01-02 1.00
2: 1 2018-01-04 1.50
3: 1 2018-01-05 1.25
4: 2 2018-01-02 1.50
5: 2 2018-01-05 3.00
data:
library(data.table)
table1 <- fread(
"individual_id | date
1 | 2018-01-02
1 | 2018-01-04
1 | 2018-01-05
2 | 2018-01-02
2 | 2018-01-05",
sep ="|"
)
table1[, date := as.IDate(date)]
table2 <- fread(
"individual_id | date2 | company_id | alpha
1 | 2018-01-02 | 62 | 1
1 | 2018-01-04 | 62 | 1.5
1 | 2018-01-05 | 63 | 1
2 | 2018-01-01 | 71 | 2
2 | 2018-01-02 | 74 | 1
2 | 2018-01-05 | 74 | 4",
sep = "|"
)
table2[, date2 := as.IDate(date2)]
table2[table1,
on = "individual_id",
allow.cartesian = TRUE][
date2 <= date, ][order(-date2)][,
.SD[1,],
by = .(individual_id, company_id, date)][,
mean(alpha),
by = .(individual_id, date)][
order(individual_id, date)]
What I did there: joined tables 1 and 2 on individual, allowing for all possible combinations. Then filtered out the combinations in which date2 was greater than date, so we kept dates2 prior to dates. Ordered them in descending order by date2, so we could select only the most recent occurrencies (that's what's done with .SD[1,]) by each individual_id, company_id and date combinations.
After that, it's just calculating the mean by individual and date, and sorting the table to match with your expecte output.

R data tables vlookup nearest date

I am trying to do a vlookup in r using data.tables. I am looking up the value for a specific date, if it is not available I would like the nearest next value.
table1 <- fread(
"id | date_created
1 | 2018-01-02
1 | 2018-01-03
2 | 2018-01-08
2 | 2018-01-09",
sep ="|"
)
table2<- fread(
"otherid | date | value
1 | 2018-01-02 | 1
2 | 2018-01-04 | 5
3 | 2018-01-07 | 3
4 | 2018-01-08 | 5
5 | 2018-01-11 | 3
6 | 2018-01-12 | 2",
sep = "|"
)
The result should look like:
table1 <- fread(
"id | date | value2
1 | 2018-01-02 | 1
1 | 2018-01-03 | 5
2 | 2018-01-08 | 5
2 | 2018-01-09 | 3",
sep ="|"
)
Edit
I fixed it, this works:
table1[, value2:= table2[table1, value, on = .(date=date_created), roll = -7]]
table1[, value2:= table2[table1, value, on = .(date=date_created), roll = -7]]

why datetime showed not like in the DB?

I've in the database the following lines
id | date_order | name | origin
----+---------------------+----------+---------
38 | 2016-05-10 14:00:00 | OT/00024 | GI/00005:
39 | 2016-05-26 14:00:00 | OT/00025 | GI/00005:
40 | 2016-06-11 14:00:00 | OT/00026 | GI/00005:
41 | 2016-06-27 14:00:00 | OT/00027 | GI/00005:
42 | 2016-07-13 14:00:00 | OT/00028 | GI/00005:
but it showed in the views as:
date_order | name | origin
--------------------+----------+-------------
10/05/2016 15:00:00 | OT/00024 | GI/00005:
26/05/2016 15:00:00 | OT/00025 | GI/00005:
11/06/2016 14:00:00 | OT/00026 | GI/00005:
27/06/2016 14:00:00 | OT/00027 | GI/00005:
13/07/2016 15:00:00 | OT/00028 | GI/00005:
I changed Timezone but I still get the difference !
When you store the datetime, you should use context like this:
from openerp.osv import fields
from datetime import datetime
...
my_date = fields.datetime.context_timestamp(cr, uid, datetime.now(), context=context)
The date stored in the database is UTC (GMT-0) timezone. Assume that the person is set with timezone GMT - 5:00, then while storing the value to the database, the date will be added with 5 hrs (exactly 5, not little more or little less) and thus we get the UTC time to store into the database. Now when displaying the same it will check for the users timezone and it finds that its GMT - 5:00 so the database time will be subtracted with 5 (again exactly 5, not little more or little less) and displayed the user.
This will be great for system which is used in different timezones. So the understanding is the input is taken in the user's timezone stored in UTC(GMT-0) and displayed to user's timezone (even if the user viewing is in the different timezone the time will be accurate to their timezone)
Odoo display the datetime field AS TIMEZONE, in this case the timezone is GMT+1 but it will be GMT+0 in june because of ramadan, that's why

Resources