I have a dataset that includes information about the schools that a student has attended within an academic year and their entry and withdrawal dates from each school. While most students only attend one school, there are others who have attended up to four different schools. I would like to make sure that none of the date ranges overlap. Below is an example of the data that I have (the dates are structured as dates):
|---------------------|------------------|---------------------|------------------|
| entry_date_1 | withdrawal_date_1| entry_date_2 | withdrawal_date_2|
|---------------------|------------------|---------------------|------------------|
| 2017-11-09 | 2018-05-24 | NA | NA |
|---------------------|------------------|---------------------|------------------|
| 2017-08-14 | 2017-12-15 | 2017-12-16 | 2018-05-24 |
|---------------------|------------------|---------------------|------------------|
| 2017-08-14 | 2018-06-01 | 2018-01-16 | 2018-03-20 |
|---------------------|------------------|---------------------|------------------|
| 2018-01-24 | 2018-02-25 | 2018-04-03 | 2018-05-24 |
|---------------------|------------------|---------------------|------------------|
What I would ideally like is a column that gives me a logical operator like this:
|---------------------|------------------|---------------------|------------------|------------------|
| entry_date_1 | withdrawal_date_1| entry_date_2 | withdrawal_date_2| overlap? |
|---------------------|------------------|---------------------|------------------|------------------|
| 2017-11-09 | 2018-05-24 | NA | NA | NA |
|---------------------|------------------|---------------------|------------------|------------------|
| 2017-08-14 | 2017-12-15 | 2017-12-16 | 2018-05-24 | FALSE |
|---------------------|------------------|---------------------|------------------|------------------|
| 2017-08-14 | 2018-06-01 | 2018-01-16 | 2018-03-20 | TRUE |
|---------------------|------------------|---------------------|------------------|------------------|
| 2018-01-24 | 2018-02-25 | 2018-04-03 | 2018-05-24 | FALSE |
|---------------------|------------------|---------------------|------------------|------------------|
I tried doing this using the %overlaps% function in the DescTools package, but it doesn't yield a logical operator for any column - just NA. If someone could help me to troubleshoot the issue, that would be great. And any other suggestions would also be helpful. I'm most comfortable with the tidyverse and base R and less comfortable with data.table.
Below is a snippet of data for a reproducible example:
my_data <- data.frame("student_id" = 1:6,
"entry_date_1" = as.Date(c("2017-11-09","2017-08-14","2017-08-14","2018-01-24","2017-10-04","2017-08-14")),
"withdrawal_date_1" = as.Date(c("2018-05-24","2017-12-15","2018-06-01","2018-02-25","2017-11-11","2018-05-24")),
"entry_date_2" = as.Date(c(NA,"2017-12-16","2018-01-16","2018-04-03","2017-12-12",NA)),
"withdrawal_date_2" = as.Date(c(NA,"2018-05-24","2018-03-20","2018-05-24","2018-05-24",NA)))
Thanks in advance for any help!
You can use int_overlaps() in lubridate.
library(dplyr)
library(lubridate)
my_data %>%
mutate(overlap = int_overlaps(interval(entry_date_1, withdrawal_date_1),
interval(entry_date_2, withdrawal_date_2)))
# student_id entry_date_1 withdrawal_date_1 entry_date_2 withdrawal_date_2 overlap
# 1 1 2017-11-09 2018-05-24 <NA> <NA> NA
# 2 2 2017-08-14 2017-12-15 2017-12-16 2018-05-24 FALSE
# 3 3 2017-08-14 2018-06-01 2018-01-16 2018-03-20 TRUE
# 4 4 2018-01-24 2018-02-25 2018-04-03 2018-05-24 FALSE
# 5 5 2017-10-04 2017-11-11 2017-12-12 2018-05-24 FALSE
# 6 6 2017-08-14 2018-05-24 <NA> <NA> NA
Related
Final Output
| Date | New_Date |
|-----------| --------- |
|1967-07-01 | |
|1967-07-02 | |
|1967-07-03 | |
|1967-07-04 | |
|1967-07-05 | |
|1967-07-06 | |
|1967-07-07 | 07-July |
|1967-07-08 | |
|1967-07-09 | |
|1967-07-10 | |
|1967-07-11 | |
|1967-07-12 | |
|1967-07-13 | |
|1967-07-14 | 14-July |
Is there any function or library I can use to get "New_Date" (Final output every 7 day)?
I've tried this code but I am not getting the desired *Final output
df <- df %>%
mutate(New_Date <- seq.Data(Date, by = 7),
format(New_Date, format = "%d-%b))
We can use case_when
library(dplyr)
df %>%
mutate(New_date = case_when((row_number() -1) %% 7 + 1 == 7 ~
format(Date, '%d-%b'), TRUE ~ ''))
-output
Date New_date
1 1967-07-01
2 1967-07-02
3 1967-07-03
4 1967-07-04
5 1967-07-05
6 1967-07-06
7 1967-07-07 07-Jul
8 1967-07-08
9 1967-07-09
10 1967-07-10
11 1967-07-11
12 1967-07-12
13 1967-07-13
14 1967-07-14 14-Jul
data
df <- data.frame(Date = seq(as.Date('1967-07-01'), length.out = 14, by = '1 day'))
You can create a index of every 7 days,change the format of those dates and create a new column.
inds <- seq(7, nrow(df), 7)
df$New_Date <- ''
df$New_Date[inds] <- format(df$Date[inds], '%d-%b')
df
# Date New_Date
#1 1967-07-01
#2 1967-07-02
#3 1967-07-03
#4 1967-07-04
#5 1967-07-05
#6 1967-07-06
#7 1967-07-07 07-Jul
#8 1967-07-08
#9 1967-07-09
#10 1967-07-10
#11 1967-07-11
#12 1967-07-12
#13 1967-07-13
#14 1967-07-14 14-Jul
If Date column is not of Date type run df$Date <- as.Date(df$Date) first.
So I would like to transform the following:
days <- c("MONDAY", "SUNDAY", "MONDAY", "SUNDAY", "MONDAY", "SUNDAY")
dates <- c("2020-03-02", "2020-03-08", "2020-03-09", "2020-03-15", "2020-03-16", "2020-03-22")
df <- cbind(days, dates)
+--------+------------+
| days | dates |
+--------+------------+
| MONDAY | 2020.03.02 |
| SUNDAY | 2020.03.08 |
| MONDAY | 2020.03.09 |
| SUNDAY | 2020.03.15 |
| MONDAY | 2020.03.16 |
| SUNDAY | 2020.03.22 |
+--------+------------+
Into this:
+------------+------------+
| MONDAY | SUNDAY |
+------------+------------+
| 2020.03.02 | 2020.03.08 |
| 2020.03.09 | 2020.03.15 |
| 2020.03.16 | 2020.03.22 |
+------------+------------+
Do you have any hints how should I do it? Thank you in advance!
In Base-R
sapply(split(df,df$days), function(x) x$dates)
MONDAY SUNDAY
[1,] "2020-03-02" "2020-03-08"
[2,] "2020-03-09" "2020-03-15"
[3,] "2020-03-16" "2020-03-22"
Here is a solution in tidyr which takes into account JohannesNE's
poignant comment.
You can think of this, as the 'trick' you were referring to in your reply (assuming each consecutive Monday and Sunday is a pair):
df <- as.data.frame(df) # tidyr needs a df object
df <- cbind(pair = rep(1:3, each = 2), df) # the 'trick'!
pair days dates
1 1 MONDAY 2020-03-02
2 1 SUNDAY 2020-03-08
3 2 MONDAY 2020-03-09
4 2 SUNDAY 2020-03-15
5 3 MONDAY 2020-03-16
6 3 SUNDAY 2020-03-22
Now the tidyr implementation:
library(tidyr)
df %>% pivot_wider(names_from = days, values_from = dates)
# A tibble: 3 x 3
pair MONDAY SUNDAY
<int> <chr> <chr>
1 1 2020-03-02 2020-03-08
2 2 2020-03-09 2020-03-15
3 3 2020-03-16 2020-03-22
I want to move my excel calculation to Teradata but not sure how to do it. In excel is rather easy and I use simple if to give me DIFF =IF(A2=A3, (C2-B3) * 24, "")
NO T_DATE L_DATE DIFF
AAA 10/08/2019 17:02:00 10/08/2019 20:35:00 5.83
AAA 10/08/2019 14:45:00 10/08/2019 15:10:00 11.78
AAA 10/08/2019 03:23:00 10/08/2019 10:25:00 17.32
AAA 09/08/2019 17:06:00 10/08/2019 01:11:00 25.70
AAA 08/08/2019 23:29:00 09/08/2019 10:27:00
BBB 08/08/2019 09:34:00 08/08/2019 21:19:00 22.23
BBB 07/08/2019 23:05:00 08/08/2019 06:09:00 18.03
BBB 07/08/2019 12:07:00 07/08/2019 20:25:00 22.32
BBB 06/08/2019 22:06:00 07/08/2019 08:53:00 22.77
BBB 06/08/2019 10:07:00 06/08/2019 19:44:00
Is there a way of doing it in Teradata? I want to have again the difference in hours between L_DATE and T_DATE for each NO.
You can use window functions to achieve this. It's important to note that when you subtract two dates or timestamps (in this case) you will be returned in INTERVAL type so you will need to specify what type of INTERVAL you want as well as it's size (SECOND, MINUTE, HOUR, DAY, etc)..
CREATE MULTISET VOLATILE TABLE yourtable(
ID VARCHAR(3)
,T_DATE TIMESTAMP(0)
,L_DATE TIMESTAMP(0)
,DIFF NUMERIC(6,2)
) PRIMARY INDEX (ID) ON COMMIT PRESERVE ROWS;
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('AAA','2019-10-08 17:02:00','2019-10-08 20:35:00',5.83);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('AAA','2019-10-08 14:45:00','2019-10-08 15:10:00',11.78);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('AAA','2019-10-08 03:23:00','2019-10-08 10:25:00',17.32);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('AAA','2019-09-08 17:06:00','2019-10-08 01:11:00',25.70);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('AAA','2019-08-08 23:29:00','2019-09-08 10:27:00',NULL);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('BBB','2019-08-08 09:34:00','2019-08-08 21:19:00',22.23);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('BBB','2019-07-08 23:05:00','2019-08-08 06:09:00',18.03);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('BBB','2019-07-08 12:07:00','2019-07-08 20:25:00',22.32);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('BBB','2019-06-08 22:06:00','2019-07-08 08:53:00',22.77);
INSERT INTO yourtable(ID,T_DATE,L_DATE,DIFF) VALUES ('BBB','2019-06-08 10:07:00','2019-06-08 19:44:00',NULL);
SELECT yourtable.*,
CAST(((LEAD(T_DATE) OVER (PARTITION BY ID ORDER BY T_DATE) - L_DATE) HOUR(4)) AS INTEGER)
FROM yourtable;
+-----+---------------------+---------------------+--------+-------------------------------------------+
| ID | T_DATE | L_DATE | DIFF | (LEAD (<value expression>) - L_DATE) HOUR |
+-----+---------------------+---------------------+--------+-------------------------------------------+
| AAA | 2019-08-08 23:29:00 | 2019-09-08 10:27:00 | <null> | 7 |
| AAA | 2019-09-08 17:06:00 | 2019-10-08 01:11:00 | 25.70 | 2 |
| AAA | 2019-10-08 03:23:00 | 2019-10-08 10:25:00 | 17.32 | 4 |
| AAA | 2019-10-08 14:45:00 | 2019-10-08 15:10:00 | 11.78 | 2 |
| AAA | 2019-10-08 17:02:00 | 2019-10-08 20:35:00 | 5.83 | <null> |
| BBB | 2019-06-08 10:07:00 | 2019-06-08 19:44:00 | <null> | 3 |
| BBB | 2019-06-08 22:06:00 | 2019-07-08 08:53:00 | 22.77 | 4 |
| BBB | 2019-07-08 12:07:00 | 2019-07-08 20:25:00 | 22.32 | 3 |
| BBB | 2019-07-08 23:05:00 | 2019-08-08 06:09:00 | 18.03 | 3 |
| BBB | 2019-08-08 09:34:00 | 2019-08-08 21:19:00 | 22.23 | <null> |
+-----+---------------------+---------------------+--------+-------------------------------------------+
The reason this looks so ugly is because you are trying to compare (subtract) values in two different records. In a database there is no relationship between one record and another. There is no ordering. They live independently of one another. This is radically different than excel where rows (records) have order (a row number).
We use the Window Function LEAD() to establish a group of records as being in a group (a partition) using the PARTITION BY clause, and we give that partition an ordering with the ORDER BY clause. Then we use that LEAD() to say "The very next T_DATE in this ordered partition for this record".
Then we do our date math and subtract the two timestamps. We specify that we want an INTERVAL of type HOUR(4) back. This will hold up to 9999 hours and it will error if it goes over 9999 hours.
Lastly we cast that thing to integer so you can do math on it. You do not, however, have to do the casting if you don't want. I added it because often times we want to add hours together and whatnot.
If you are working on an older version of Teradata that doesn't have the LEAD() function (it's a newer addition) you can use MAX() or MIN() and some extra syntax in your windowing definition to explicitely say "Just the next record's T_DATE" like:
MAX(T_DATE) OVER (PARTITION BY ID ORDER BY T_DATE ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING)
I am trying to do a vlookup in r using data.tables. I am looking up the value for a specific date, if it is not available I would like the nearest next value.
table1 <- fread(
"id | date_created
1 | 2018-01-02
1 | 2018-01-03
2 | 2018-01-08
2 | 2018-01-09",
sep ="|"
)
table2<- fread(
"otherid | date | value
1 | 2018-01-02 | 1
2 | 2018-01-04 | 5
3 | 2018-01-07 | 3
4 | 2018-01-08 | 5
5 | 2018-01-11 | 3
6 | 2018-01-12 | 2",
sep = "|"
)
The result should look like:
table1 <- fread(
"id | date | value2
1 | 2018-01-02 | 1
1 | 2018-01-03 | 5
2 | 2018-01-08 | 5
2 | 2018-01-09 | 3",
sep ="|"
)
Edit
I fixed it, this works:
table1[, value2:= table2[table1, value, on = .(date=date_created), roll = -7]]
table1[, value2:= table2[table1, value, on = .(date=date_created), roll = -7]]
I have a data frame of stocks and dates. I want to add a "next date" column. How should I do this?
The data is this:
df = data.frame(ticker = c("BHP", "BHP", "BHP", "BHP", "ANZ", "ANZ", "ANZ"), date = c("1999-05-31", "2000-06-30", "2001-06-29", "2002-06-28", "1999-09-30", "2000-09-29", "2001-09-28"))
df$date = as.POSIXct(df$date)
In human-readable form:
ticker | date
-----------------
BHP | 1999-05-31
BHP | 2000-06-30
BHP | 2001-06-29
BHP | 2002-06-28
ANZ | 1999-09-30
ANZ | 2000-09-29
ANZ | 2001-09-28
What I want is to add a column for the next date:
ticker | date | next_date
------------------------------------
BHP | 1999-05-31 | 2000-06-30
BHP | 2000-06-30 | 2001-06-29
BHP | 2001-06-29 | 2002-06-28
BHP | 2002-06-28 | NA # (or some default value)
ANZ | 1999-09-30 | 2000-09-29
ANZ | 2000-09-29 | 2001-09-28
ANZ | 2001-09-28 | NA
library(dplyr)
df %>%
group_by(ticker) %>%
mutate(next_date = lead(date))
We can use ave from base R to do this
df$next_date <- with(df, ave(as.Date(date), ticker, FUN = function(x) c(x[-1], NA)))
df$next_date
#[1] "2000-06-30" "2001-06-29" "2002-06-28" NA "2000-09-29" "2001-09-28" NA
Or we can use data.table
library(data.table)
setDT(df)[, next_date := shift(date, type = "lead"), by = ticker]