Calculate the days between dates from a grouped data.frame R dplyr - r

I would like to calculate the number of days between rows of a data.frame groped by a couple of fields, so if I have the following data.frame:
da <- read.table(text="i j data date
2 682 147 2008-05-26
2 682 317 2010-11-13
2 682 217 2019-08-05
3 682 147 2008-05-26
3 682 317 2010-11-13
10 682 220 2019-08-08", header=TRUE)
require(dplyr)
da %>% count(i,j)
I would like to calculate two periods for the first group, 1 period for the second, and none for the last.
I can calculate the intervals between the first and last dates
require(lubridate)
da %>% group_by(i,j) %>%
summarize(fini=ymd(min(date)),fend=ymd(max(date)),deltaD=as.numeric(fend - fini))
`summarise()` regrouping output by 'i' (override with `.groups` argument)
# A tibble: 3 x 5
# Groups: i [3]
i j fini fend deltaD
<int> <int> <date> <date> <dbl>
1 2 682 2008-05-26 2019-08-05 4088
2 3 682 2008-05-26 2010-11-13 901
3 10 682 2019-08-08 2019-08-08 0
that is fine if I have two rows in the group, but I can't figure out how to do it if I have 3 or more.

Do you need something like this ?
library(dplyr)
da %>%
mutate(date = as.Date(date)) %>%
arrange(i, j, date) %>%
group_by(i, j) %>%
transmute(fini = date, fend = lead(date), deltaD = as.numeric(fend - fini)) %>%
na.omit() %>%
ungroup
# i j fini fend deltaD
# <int> <int> <date> <date> <dbl>
#1 2 682 2008-05-26 2010-11-13 901
#2 2 682 2010-11-13 2019-08-05 3187
#3 3 682 2008-05-26 2010-11-13 901

The following seems to work. I used lead (after arrange to order by date), and some tweaks to avoid dropping groups with only one date
da %>%
group_by(i,j) %>%
dplyr::arrange(date) %>%
dplyr::mutate(lead_date = dplyr::lead(date),
one_in_group = n() == 1) %>% # to maintain row with i = 10
dplyr::filter(!is.na(lead_date) | one_in_group) %>%
dplyr::mutate(lead_date = ifelse(one_in_group, date, lead_date),# handles solo dates
date = lubridate::ymd(date),
lead_date = lubridate::ymd(lead_date),
deltaD = as.numeric(lead_date - date)) %>%
dplyr::select(- one_in_group) %>%
dplyr::arrange(i, j, date)
# A tibble: 4 x 6
# Groups: i, j [3]
i j data date lead_date deltaD
<int> <int> <int> <date> <date> <dbl>
1 2 682 147 2008-05-26 2010-11-13 901
2 2 682 317 2010-11-13 2019-08-05 3187
3 3 682 147 2008-05-26 2010-11-13 901
4 10 682 220 2019-08-08 2019-08-08 0

Related

How do I use dplyr to correlate each column in a for loop?

I have a dataframe of 19 stocks, including the S&P500 (SPX), throughout time. I want to correlate each one of these stocks with the S&P for each month (Jan-Dec), making 18 x 12 = 216 different correlations, and store these in a list called stockList.
> tokens
# A tibble: 366 x 21
Month Date SPX TZERO .....(16 more columns of stocks)...... MPS
<dbl> <dttm> <dbl> <dbl> <dbl>
1 2020-01-02 3245.50 0.95 176.72
...
12 2020-12-31 3733.42 2.90 .....(16 more columns of stocks)..... 360.73
Here's where my error pops up, by using the index [i], or [[i]], in the cor() function
stockList <- list()
for(i in 1:18) {
stockList[[i]] <- tokens %>%
group_by(Month) %>%
summarize(correlation = cor(SPX, tokens[i+3], use = 'complete.obs'))
}
Error in summarise_impl(.data, dots) :
Evaluation error: incompatible dimensions.
How do I use column indexing in the cor() function when trying to summarize? Is there an alternative way?
First, to recreate data like yours:
library(tidyquant)
# Get gamestop, apple, and S&P 500 index prices
prices <- tq_get(c("GME", "AAPL", "^GSPC"),
get = "stock.prices",
from = "2020-01-01",
to = "2020-12-31")
library(tidyverse)
prices_wide <- prices %>%
select(date, close, symbol) %>%
pivot_wider(names_from = symbol, values_from = close) %>%
mutate(Month = lubridate::month(date)) %>%
select(Month, Date = date, GME, AAPL, SPX = `^GSPC`)
This should look like your data:
> prices_wide
# A tibble: 252 x 5
Month Date GME AAPL SPX
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2020-01-02 6.31 75.1 3258.
2 1 2020-01-03 5.88 74.4 3235.
3 1 2020-01-06 5.85 74.9 3246.
4 1 2020-01-07 5.52 74.6 3237.
5 1 2020-01-08 5.72 75.8 3253.
6 1 2020-01-09 5.55 77.4 3275.
7 1 2020-01-10 5.43 77.6 3265.
8 1 2020-01-13 5.43 79.2 3288.
9 1 2020-01-14 4.71 78.2 3283.
10 1 2020-01-15 4.61 77.8 3289.
# … with 242 more rows
Then I put that data in longer "tidy" format where each row has the stock value and the SPX value so I can compare them:
prices_wide %>%
# I want every row to have month, date, and SPX
pivot_longer(cols = -c(Month, Date, SPX),
names_to = "symbol",
values_to = "price") %>%
group_by(Month, symbol) %>%
summarize(correlation = cor(price, SPX)) %>%
ungroup()
# A tibble: 24 x 3
Month symbol correlation
<dbl> <chr> <dbl>
1 1 AAPL 0.709
2 1 GME -0.324
3 2 AAPL 0.980
4 2 GME 0.874
5 3 AAPL 0.985
6 3 GME -0.177
7 4 AAPL 0.956
8 4 GME 0.873
9 5 AAPL 0.792
10 5 GME -0.435
# … with 14 more rows

Convert HMM /HHMM time column to timestamp in R

I am new here please be gentle ;)
I have two time columns in a dataframe in R that uses the HMM /HHMM format as a numeric. For example, 03:13 would be 313 and 14:14 would be 1414. An example would be sched_arr_time and sched_dep_time in the nycflights13 package.
I need to calculate the time difference in minutes. My SQL knowledge tells me I would substring this with a case when and then glue it back together as a time format somehow but I was hoping there is a more elegant way in R to deal with this?
Many thanks for your help!
This would explain the data:
library(nycflights13)
flights %>% select(sched_dep_time, sched_arr_time)
We can convert to time class with as.ITime after changing the format to HH:MM with str_pad and str_replace, and then take the difference using difftime
library(dplyr)
library(stringr)
library(data.table)
flights %>%
head %>%
select(sched_dep_time, sched_arr_time) %>%
mutate_all(~ str_pad(., width = 4, pad = 0) %>%
str_replace(., '^(..)', '\\1:') %>%
as.ITime) %>%
mutate(diff = difftime(sched_arr_time, sched_dep_time, unit = 'min'))
# A tibble: 6 x 3
# sched_dep_time sched_arr_time diff
# <ITime> <ITime> <drtn>
#1 05:15:00 08:19:00 184 mins
#2 05:29:00 08:30:00 181 mins
#3 05:40:00 08:50:00 190 mins
#4 05:45:00 10:22:00 277 mins
#5 06:00:00 08:37:00 157 mins
#6 05:58:00 07:28:00 90 mins
If we want to add a 'Date' as well, then we
library(lubridate)
flights %>%
head %>%
select(sched_dep_time, sched_arr_time) %>%
mutate_all(~ str_pad(., width = 4, pad = 0) %>%
str_replace("^(..)(..)", "\\1:\\2:00") %>%
str_c(Sys.Date(), ., sep=' ') %>%
ymd_hms) %>%
mutate(diff = difftime(sched_arr_time, sched_dep_time, unit = 'min'))
Here is another option using strptime
as_time <- function(x)
as.POSIXct(strptime(if_else(nchar(x) == 3, paste0("0", x), as.character(x)), "%H%M"))
flights %>%
select(sched_dep_time, sched_arr_time) %>%
mutate(diff_in_mins = difftime(as_time(sched_arr_time), as_time(sched_dep_time), "mins"))
## A tibble: 336,776 x 3
# sched_dep_time sched_arr_time diff_in_mins
# <int> <int> <drtn>
# 1 515 819 184 mins
# 2 529 830 181 mins
# 3 540 850 190 mins
# 4 545 1022 277 mins
# 5 600 837 157 mins
# 6 558 728 90 mins
# 7 600 854 174 mins
# 8 600 723 83 mins
# 9 600 846 166 mins
#10 600 745 105 mins
## … with 336,766 more rows

subset data getSymbols quantmod

subset data e.g. all previous year and store as new object.
mtdl <- na.omit(getSymbols("MTDL.JK", auto.assign = F, src = "yahoo", periodicity = "weekly"))
week.year.mtdl <- mtdl %>%
filter(DATE >= as.Date("2018-01-01") & DATE <= as.Date("2018-12-31"))
Here are a few ways to go about this if you want to use dplyr.
1 transform xts into data.frame
df_mtdl <- data.frame(date = index(mtdl), coredata(mtdl))
week.year.mtdl <- df_mtdl %>%
filter(date >= as.Date("2018-01-01") & date <= as.Date("2018-12-31"))
head(week.year.mtdl)
date MTDL.JK.Open MTDL.JK.High MTDL.JK.Low MTDL.JK.Close MTDL.JK.Volume MTDL.JK.Adjusted
1 2018-01-01 650 650 620 630 78200 609.6684
2 2018-01-08 630 650 610 610 291800 590.3138
3 2018-01-15 610 750 600 700 9390700 677.4093
4 2018-01-22 700 730 640 700 6816200 677.4093
5 2018-01-29 700 745 685 685 119900 662.8934
6 2018-02-05 695 715 630 635 1533000 614.5070
2 use tidyquant. This returns a tibble instead of an xts object. Tidyquant is built on top of quantmod and a lot of other packages.
library(tidyquant)
tq_mtdl <- tq_get("MTDL.JK", complete_cases = TRUE, periodicity = "weekly")
week.year.mtdl <- tq_mtdl %>%
filter(date >= as.Date("2018-01-01") & date <= as.Date("2018-12-31"))
head(week.year.mtdl)
# A tibble: 6 x 7
date open high low close volume adjusted
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-01-04 645 645 620 625 137000 605.
2 2018-01-11 620 660 600 645 1460000 624.
3 2018-01-18 645 750 635 660 13683700 639.
4 2018-01-25 680 745 665 685 1359700 663.
5 2018-02-01 700 715 675 700 922200 677.
6 2018-02-08 695 695 630 690 673700 668.
Or use packages timetk (used as part of tidyquant) or tsbox to transform the data from xts to data.frame or tibble.
This will give 2018 points of an xts object
mtdl["2018"]
All of these also work:
subset(mtdl, time(.) >= "2018-01-01" & time(.) <= "2018-12-31")
subset(mtdl, start = "2018-01-01", end = "2018-12-31")
window(mtdl, start = "2018-01-01", end = "2018-12-31")
dates <- seq(as.Date("2008-01-01"), as.Date("2008-12-31"), "day")
window(mtdl, dates)
mtdl[dates] # dates is from above
mtdl[ format(time(mtdl), "%Y") == 2018 ]

R- create dataset by removing duplicates based on a condition - filter

I have a data frame where for each day, I have several prices.
I would like to modify my data frame with the following code :
newdf <- Data %>%
filter(
if (Data$Date == Data$Echeance) {
Data$Close == lag(Data$Close,1)
} else {
Data$Close == Data$Close
}
)
However, it is not giving me what I want, that is :
create a new data frame where the variable Close takes its normal value, unless the day of Date is equal to the day of Echeance. In this case, take the following Close value.
I added filter because I wanted to remove the duplicate dates, and keep only one date per day where Close satisfies the condition above.
There is no error message, it just doesn't give me the right database.
Here is a glimpse of my data:
Date Echeance Compens. Open Haut Bas Close
1 1998-03-27 00:00:00 1998-09-10 00:00:00 125. 828 828 820 820. 197
2 1998-03-27 00:00:00 1998-11-10 00:00:00 128. 847 847 842 842. 124
3 1998-03-27 00:00:00 1999-01-11 00:00:00 131. 858 858 858 858. 2
4 1998-03-30 00:00:00 1998-09-10 00:00:00 125. 821 821 820 820. 38
5 1998-03-30 00:00:00 1998-11-10 00:00:00 129. 843 843 843 843. 1
6 1998-03-30 00:00:00 1999-01-11 00:00:00 131. 860 860 860 860. 5
Thanks a lot in advance.
Sounds like a use case for ifelse, with dplyr:
library(dplyr)
Data %>%
mutate(Close = ifelse(Date==Echeance, lead(Close,1), Close))
Here an example:
dat %>%
mutate(var_new = ifelse(date1==date2, lead(var,1), var))
# A tibble: 3 x 4
# date1 date2 var var_new
# <date> <date> <int> <int>
# 1 2018-03-27 2018-03-27 10 11
# 2 2018-03-28 2018-01-01 11 11
# 3 2018-03-29 2018-02-01 12 12
The function lead will move the vector by 1 position. Also note that I created a var_new just to show the difference, but you can mutate directly var.
Data used:
dat <- tibble(date1 = seq(from=as.Date("2018-03-27"), to=as.Date("2018-03-29"), by="day"),
date2 = c(as.Date("2018-03-27"), as.Date("2018-01-01"), as.Date("2018-02-01")),
var = 10:12)
dat
# A tibble: 3 x 3
# date1 date2 var
# <date> <date> <int>
# 1 2018-03-27 2018-03-27 10
# 2 2018-03-28 2018-01-01 11
# 3 2018-03-29 2018-02-01 12

filtering based on two conditions: values less than on matching dates?

I'm missing some basic filtering know how. What are some dplyr or other ways to return the instances when value is, say, less than 300 on the same date, for both scenario a and scenario b?
library(tidyverse)
library(lubridate)
scenario <- c("a","a","a","a","a","a","a","a","a","a","a","a","a","a",
"b","b","b","b","b","b","b","b","b","b","b","b","b","b")
str_tstep <- c("2/29/1924", "3/31/1924", "4/30/1924", "5/31/1924", "6/30/1924", "7/31/1924",
"8/31/1924", "9/30/1924", "10/31/1924", "11/30/1924", "12/31/1924", "1/31/1925",
"3/31/1926", "9/30/1926", "1/31/1922", "1/31/1924", "2/29/1924", "5/31/1924",
"10/31/1924","11/30/1924", "12/31/1924", "1/31/1925", "2/28/1925", "1/31/1926",
"2/28/1926", "3/31/1926", "1/31/1927", "1/31/1928")
tstep <- mdy(str_tstep)
value <- c(260,396,348,347,368,397,418,419,190,290,504,323,800,800,355,408,250,365,222,
299,504,323,800,397,288,800,387,415)
df <- data.frame(scenario, tstep, value)
We could filter all the 'value' that are less than 300 after grouping by 'tstep'
df %>%
group_by(tstep) %>%
filter(all(value < 300))
# A tibble: 7 x 3
# Groups: tstep [4]
# scenario tstep value
# <fct> <dttm> <dbl>
#1 a 1924-02-29 00:00:00 260
#2 a 1924-10-31 00:00:00 190
#3 a 1924-11-30 00:00:00 290
#4 b 1924-02-29 00:00:00 250
#5 b 1924-10-31 00:00:00 222
#6 b 1924-11-30 00:00:00 299
#7 b 1926-02-28 00:00:00 288
If we the number of 'scenario' are less than 2 for some 'tstep' and we want to filter them out
df %>%
group_by(tstep) %>%
filter(n_distinct(scenario)== 2 , all(value < 300))
# A tibble: 6 x 3
# Groups: tstep [3]
# scenario tstep value
# <fct> <dttm> <dbl>
#1 a 1924-02-29 00:00:00 260
#2 a 1924-10-31 00:00:00 190
#3 a 1924-11-30 00:00:00 290
#4 b 1924-02-29 00:00:00 250
#5 b 1924-10-31 00:00:00 222
#6 b 1924-11-30 00:00:00 299
Something like this would do it (assuming I have interpreted the question correctly)...
df %>% filter(value<300) %>% #remove values 300+
group_by(tstep) %>%
filter(all(c("a","b") %in% scenario)) #check both scenarios exist for each tstep
scenario tstep value
1 a 1924-02-29 260.
2 a 1924-10-31 190.
3 a 1924-11-30 290.
4 b 1924-02-29 250.
5 b 1924-10-31 222.
6 b 1924-11-30 299.
This will give you the dates that appear in BOTH a and b with values below 300 (unlike akrun's solution, which also includes those that only appear in just one of a or b).

Resources