I'm trying to calculate a rolling window in a fixed time interval. Suppose that the interval is 48 hours. I would like to get every data point that is contained between the date of the current observation and 48 hours before that observation. For example, if the datetime of the current observation is 05-07-2022 14:15:28, for that position, I would like a count value for every occurence between that date and 03-07-2022 14:15:28. Seconds are not fundamental to the analysis.
library(tidyverse)
library(lubridate)
df = tibble(id = 1:7,
date_time = ymd_hm('2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'))
# A tibble: 7 × 2
id date_time
<int> <dttm>
1 1 2022-05-07 15:00:00
2 2 2022-05-09 13:45:00
3 3 2022-05-09 13:51:00
4 4 2022-05-09 17:00:00
5 5 2022-05-10 15:25:00
6 6 2022-05-10 17:18:00
7 7 2022-05-11 14:00:00
With the example window of 48 hours, that would yield:
# A tibble: 7 × 4
id date_time lag_48hours count
<int> <dttm> <dttm> <dbl>
1 1 2022-05-07 15:00:00 2022-05-05 15:00:00 1
2 2 2022-05-09 13:45:00 2022-05-07 13:45:00 2
3 3 2022-05-09 13:51:00 2022-05-07 13:51:00 3
4 4 2022-05-09 17:00:00 2022-05-07 17:00:00 3
5 5 2022-05-10 15:25:00 2022-05-08 15:25:00 4
6 6 2022-05-10 17:18:00 2022-05-08 17:18:00 5
7 7 2022-05-11 14:00:00 2022-05-09 14:00:00 4
I added the lag column for illustration purposes. Any idea how to obtain the count column? I need to be able to adjust the window (48 hours in this example).
I'd encourage you to use slider, which allows you to do rolling window analysis using an irregular index.
library(tidyverse)
library(lubridate)
library(slider)
df = tibble(
id = 1:7,
date_time = ymd_hm(
'2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'
)
)
df %>%
mutate(
count = slide_index_int(
.x = id,
.i = date_time,
.f = length,
.before = dhours(48)
)
)
#> # A tibble: 7 × 3
#> id date_time count
#> <int> <dttm> <int>
#> 1 1 2022-05-07 15:00:00 1
#> 2 2 2022-05-09 13:45:00 2
#> 3 3 2022-05-09 13:51:00 3
#> 4 4 2022-05-09 17:00:00 3
#> 5 5 2022-05-10 15:25:00 4
#> 6 6 2022-05-10 17:18:00 5
#> 7 7 2022-05-11 14:00:00 4
How about this...
df %>%
mutate(count48 = map_int(date_time,
~sum(date_time <= . & date_time > . - 48 * 60 * 60)))
# A tibble: 7 × 3
id date_time count48
<int> <dttm> <int>
1 1 2022-05-07 15:00:00 1
2 2 2022-05-09 13:45:00 2
3 3 2022-05-09 13:51:00 3
4 4 2022-05-09 17:00:00 3
5 5 2022-05-10 15:25:00 4
6 6 2022-05-10 17:18:00 5
7 7 2022-05-11 14:00:00 4
Related
This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 6 months ago.
I have a data.frame with some prices per day. I would like to get the average daily price in another column (avg_price). How can I do that ?
date price avg_price
1 2017-01-01 01:00:00 10 18.75
2 2017-01-01 01:00:00 10 18.75
3 2017-01-01 05:00:00 25 18.75
4 2017-01-01 04:00:00 30 18.75
5 2017-01-02 08:00:00 10 20
6 2017-01-02 08:00:00 30 20
7 2017-01-02 07:00:00 20 20
library(lubridate)
library(tidyverse)
df %>%
group_by(day = day(date)) %>%
summarise(avg_price = mean(price))
# A tibble: 2 x 2
day avg_price
<int> <dbl>
1 1 18.8
2 2 20
df %>%
group_by(day = day(date)) %>%
mutate(avg_price = mean(price))
# A tibble: 7 x 4
# Groups: day [2]
date price avg_price day
<dttm> <dbl> <dbl> <int>
1 2017-01-01 01:00:00 10 18.8 1
2 2017-01-01 01:00:00 10 18.8 1
3 2017-01-01 05:00:00 25 18.8 1
4 2017-01-01 04:00:00 30 18.8 1
5 2017-01-02 08:00:00 10 20 2
6 2017-01-02 08:00:00 30 20 2
7 2017-01-02 07:00:00 20 20 2
I am trying to convert below data on daily basis based on range available in start_date & end_date_ column.
to this output (sum):
Please use dput() when posting data frames next time!
Example data
# A tibble: 4 × 4
id start end inventory
<int> <chr> <chr> <dbl>
1 1 01/05/2022 02/05/2022 100
2 2 10/05/2022 15/05/2022 50
3 3 11/05/2022 21/05/2022 80
4 4 14/05/2022 17/05/2022 10
Transform the data
df %>%
mutate(across(2:3, ~ as.Date(.x,
format = "%d/%m/%Y"))) %>%
pivot_longer(cols = c(start, end), values_to = "date") %>%
arrange(date) %>%
select(date, inventory)
# A tibble: 8 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-10 50
4 2022-05-11 80
5 2022-05-14 10
6 2022-05-15 50
7 2022-05-17 10
8 2022-05-21 80
Expand the dates and left_join
left_join(tibble(date = seq(first(df$date),
last(df$date),
by = "day")), df)
# A tibble: 21 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-03 NA
4 2022-05-04 NA
5 2022-05-05 NA
6 2022-05-06 NA
7 2022-05-07 NA
8 2022-05-08 NA
9 2022-05-09 NA
10 2022-05-10 50
# … with 11 more rows
I have following simulated dataset of y column with fixed trading days (say 250) of 2018.
data
# A tibble: 249 × 2
Date y
<dttm> <dbl>
1 2018-01-02 00:00:00 0.409
2 2018-01-03 00:00:00 -1.90
3 2018-01-04 00:00:00 0.131
4 2018-01-05 00:00:00 -0.619
5 2018-01-08 00:00:00 0.449
6 2018-01-09 00:00:00 0.448
7 2018-01-10 00:00:00 0.124
8 2018-01-11 00:00:00 -0.346
9 2018-01-12 00:00:00 0.775
10 2018-01-15 00:00:00 -0.948
# … with 239 more rows
with tail
> tail(data,n=10)
# A tibble: 10 × 2
Date y
<dttm> <dbl>
1 2018-12-13 00:00:00 -0.00736
2 2018-12-14 00:00:00 -1.30
3 2018-12-17 00:00:00 0.227
4 2018-12-18 00:00:00 -0.671
5 2018-12-19 00:00:00 -0.750
6 2018-12-20 00:00:00 -0.906
7 2018-12-21 00:00:00 -1.74
8 2018-12-27 00:00:00 0.331
9 2018-12-28 00:00:00 -0.768
10 2018-12-31 00:00:00 0.649
I want to calculate rolling sd of column y with window 60 and then to find the exact trading days not actual-usual days (it can be done from index? I don't know.)
data2 = data%>%
mutate(date = as.Date(Date))
data3=data2[,-1];head(data3)
roll_win = 60
data3$a = c(rep(NA_real_, roll_win - 1), zoo::rollapply(data3$y, roll_win ,sd))
dat = subset(data3, !is.na(a))
dat_max = dat[dat$a == max(dat$a, na.rm = TRUE), ]
dat_max$date_start = dat_max$date - (roll_win - 1)
dat_max
Turn outs that the period of high volatility is :
dat_max
# A tibble: 1 × 4
y date a date_start
<dbl> <date> <dbl> <date>
1 0.931 2018-04-24 1.18 2018-02-24
Now if I subtract the two dates I will have :
> dat_max$date - dat_max$date_start
Time difference of 59 days
Which is actually true but these are NOT THE TRADING DAYS.
I have asked a similar question here but it didn't solved the problem.Actually the asked question then was how I can obtain the days of high volatility.
Any help how I can obtain this trading days ? Thanks in advance
EDIT
FOR FULL DATA
library(gsheet)
data= gsheet2tbl("https://docs.google.com/spreadsheets/d/1PdZDb3OgqSaO6znUWsAh7p_MVLHgNbQM/edit?usp=sharing&ouid=109626011108852110510&rtpof=true&sd=true")
data
Start date for each time window
If the question is how to calculate the start date for each window then using the data in the Note at the end and a window of 3:
w <- 3
out <- mutate(data,
sd = zoo::rollapplyr(y, w, sd, fill = NA),
start = dplyr::lag(Date, w - 1)
)
out
giving:
Date y sd start
1 2018-12-13 -0.00736 NA <NA>
2 2018-12-14 -1.30000 NA <NA>
3 2018-12-17 0.22700 0.8223515 2018-12-13
4 2018-12-18 -0.67100 0.7674388 2018-12-14
5 2018-12-19 -0.75000 0.5427053 2018-12-17
6 2018-12-20 -0.90600 0.1195840 2018-12-18
7 2018-12-21 -1.74000 0.5322894 2018-12-19
8 2018-12-27 0.33100 1.0420146 2018-12-20
9 2018-12-28 -0.76800 1.0361488 2018-12-21
10 2018-12-31 0.64900 0.7435068 2018-12-27
Largest sd's with their start and end dates
and the largest 4 sd's and their start and end dates are:
head(dplyr::arrange(out, -sd), 4)
giving:
Date y sd start
8 2018-12-27 0.331 1.0420146 2018-12-20
9 2018-12-28 -0.768 1.0361488 2018-12-21
3 2018-12-17 0.227 0.8223515 2018-12-13
4 2018-12-18 -0.671 0.7674388 2018-12-14
Rows between two dates
If the question is how many rows are between and include two dates that appear in data then:
d1 <- as.Date("2018-12-14")
d2 <- as.Date("2018-12-20")
diff(match(c(d1, d2), data$Date)) + 1
## [1] 5
Note
Lines <- " Date y
1 2018-12-13T00:00:00 -0.00736
2 2018-12-14T00:00:00 -1.30
3 2018-12-17T00:00:00 0.227
4 2018-12-18T00:00:00 -0.671
5 2018-12-19T00:00:00 -0.750
6 2018-12-20T00:00:00 -0.906
7 2018-12-21T00:00:00 -1.74
8 2018-12-27T00:00:00 0.331
9 2018-12-28T00:00:00 -0.768
10 2018-12-31T00:00:00 0.649"
data <- read.table(text = Lines)
data$Date <- as.Date(data$Date)
Here's what I would like to a achieve as a function in Excel, but I can't seem to find a solution to do it in R.
This is what I tried to do but it does not seem to allow me to operate with the previous values of the new column I'm trying to make.
Here is a reproducible example:
library(dplyr)
set.seed(42) ## for sake of reproducibility
dat <- data.frame(date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"))
This would be the output of the dataframe:
dat
date
1 2020-12-26
2 2020-12-27
3 2020-12-28
4 2020-12-29
5 2020-12-30
6 2020-12-31
Desired output:
date periodNumber
1 2020-12-26 1
2 2020-12-27 2
3 2020-12-28 3
4 2020-12-29 4
5 2020-12-30 5
6 2020-12-31 6
My try at this:
dat %>%
mutate(periodLag = dplyr::lag(date)) %>%
mutate(periodNumber = ifelse(is.na(periodLag)==TRUE, 1,
ifelse(date == periodLag, dplyr::lag(periodNumber), (dplyr::lag(periodNumber) + 1))))
Excel formula screenshot:
You could use dplyr's cur_group_id():
library(dplyr)
set.seed(42)
# I used a larger example
dat <- data.frame(date=sample(seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"), size = 30, replace = TRUE))
dat %>%
arrange(date) %>% # needs sorting because of the random example
group_by(date) %>%
mutate(periodNumber = cur_group_id())
This returns
# A tibble: 30 x 2
# Groups: date [6]
date periodNumber
<date> <int>
1 2020-12-26 1
2 2020-12-26 1
3 2020-12-26 1
4 2020-12-26 1
5 2020-12-26 1
6 2020-12-26 1
7 2020-12-26 1
8 2020-12-26 1
9 2020-12-27 2
10 2020-12-27 2
11 2020-12-27 2
12 2020-12-27 2
13 2020-12-27 2
14 2020-12-27 2
15 2020-12-27 2
16 2020-12-28 3
17 2020-12-28 3
18 2020-12-28 3
19 2020-12-29 4
20 2020-12-29 4
21 2020-12-29 4
22 2020-12-29 4
23 2020-12-29 4
24 2020-12-29 4
25 2020-12-30 5
26 2020-12-30 5
27 2020-12-30 5
28 2020-12-30 5
29 2020-12-30 5
30 2020-12-31 6
This question already has answers here:
Split date-time column into Date and time variables
(7 answers)
Closed 1 year ago.
I have the following dataset, and require the times and not dates placed into a separate column relative to the date. Which can be indicated by id, to join the time with date.
dataset:
# A tibble: 10 x 2
origintime id
<dttm> <int>
1 2021-03-04 18:44:25 1
2 2021-03-04 18:28:32 2
3 2021-03-04 18:25:55 3
4 2021-03-04 18:23:00 4
5 2021-03-04 18:20:00 5
6 2021-03-04 18:15:58 6
7 2021-03-04 18:11:41 7
8 2021-03-04 18:10:57 8
9 2021-03-04 18:10:33 9
10 2021-03-04 18:07:01 10
outcome:
# A tibble: 10 x 3
origintime time id
<dttm> <int>
1 2021-03-04 18:44:25 1
2 2021-03-04 18:28:32 2
3 2021-03-04 18:25:55 3
4 2021-03-04 18:23:00 4
5 2021-03-04 18:20:00 5
6 2021-03-04 18:15:58 6
7 2021-03-04 18:11:41 7
8 2021-03-04 18:10:57 8
9 2021-03-04 18:10:33 9
10 2021-03-04 18:07:01 10
reproducible code:
structure(list(origintime = structure(c(1614883465.299, 1614882512.721,
1614882355.215, 1614882180.074, 1614882000.671, 1614881758.214,
1614881501.122, 1614881457.527, 1614881433.217, 1614881221.603
), tzone = "UTC", class = c("POSIXct", "POSIXt")), id = 1:10), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
Just use format with %T to extract the time component from the 'origintime' column while converting the 'origintime' to Date class
library(dplyr)
df1 <- df1 %>%
mutate(time = format(origintime, '%T'), origintime = as.Date(origintime))
Or use separate and return as character columns
library(tidyr)
df1 %>%
separate(origintime, into = c('origintime', 'time'), sep=" ")
library(data.table)
setDT(df)
df[ , c('origintime', 'time') := tstrsplit(origintime, " ")]
df
# origintime id time
# 1: 2021-03-04 1 18:44:25
# 2: 2021-03-04 2 18:28:32
# 3: 2021-03-04 3 18:25:55
# 4: 2021-03-04 4 18:23:00
# 5: 2021-03-04 5 18:20:00
# 6: 2021-03-04 6 18:15:58
# 7: 2021-03-04 7 18:11:41
# 8: 2021-03-04 8 18:10:57
# 9: 2021-03-04 9 18:10:33
#10: 2021-03-04 10 18:07:01