I have following simulated dataset of y column with fixed trading days (say 250) of 2018.
data
# A tibble: 249 × 2
Date y
<dttm> <dbl>
1 2018-01-02 00:00:00 0.409
2 2018-01-03 00:00:00 -1.90
3 2018-01-04 00:00:00 0.131
4 2018-01-05 00:00:00 -0.619
5 2018-01-08 00:00:00 0.449
6 2018-01-09 00:00:00 0.448
7 2018-01-10 00:00:00 0.124
8 2018-01-11 00:00:00 -0.346
9 2018-01-12 00:00:00 0.775
10 2018-01-15 00:00:00 -0.948
# … with 239 more rows
with tail
> tail(data,n=10)
# A tibble: 10 × 2
Date y
<dttm> <dbl>
1 2018-12-13 00:00:00 -0.00736
2 2018-12-14 00:00:00 -1.30
3 2018-12-17 00:00:00 0.227
4 2018-12-18 00:00:00 -0.671
5 2018-12-19 00:00:00 -0.750
6 2018-12-20 00:00:00 -0.906
7 2018-12-21 00:00:00 -1.74
8 2018-12-27 00:00:00 0.331
9 2018-12-28 00:00:00 -0.768
10 2018-12-31 00:00:00 0.649
I want to calculate rolling sd of column y with window 60 and then to find the exact trading days not actual-usual days (it can be done from index? I don't know.)
data2 = data%>%
mutate(date = as.Date(Date))
data3=data2[,-1];head(data3)
roll_win = 60
data3$a = c(rep(NA_real_, roll_win - 1), zoo::rollapply(data3$y, roll_win ,sd))
dat = subset(data3, !is.na(a))
dat_max = dat[dat$a == max(dat$a, na.rm = TRUE), ]
dat_max$date_start = dat_max$date - (roll_win - 1)
dat_max
Turn outs that the period of high volatility is :
dat_max
# A tibble: 1 × 4
y date a date_start
<dbl> <date> <dbl> <date>
1 0.931 2018-04-24 1.18 2018-02-24
Now if I subtract the two dates I will have :
> dat_max$date - dat_max$date_start
Time difference of 59 days
Which is actually true but these are NOT THE TRADING DAYS.
I have asked a similar question here but it didn't solved the problem.Actually the asked question then was how I can obtain the days of high volatility.
Any help how I can obtain this trading days ? Thanks in advance
EDIT
FOR FULL DATA
library(gsheet)
data= gsheet2tbl("https://docs.google.com/spreadsheets/d/1PdZDb3OgqSaO6znUWsAh7p_MVLHgNbQM/edit?usp=sharing&ouid=109626011108852110510&rtpof=true&sd=true")
data
Start date for each time window
If the question is how to calculate the start date for each window then using the data in the Note at the end and a window of 3:
w <- 3
out <- mutate(data,
sd = zoo::rollapplyr(y, w, sd, fill = NA),
start = dplyr::lag(Date, w - 1)
)
out
giving:
Date y sd start
1 2018-12-13 -0.00736 NA <NA>
2 2018-12-14 -1.30000 NA <NA>
3 2018-12-17 0.22700 0.8223515 2018-12-13
4 2018-12-18 -0.67100 0.7674388 2018-12-14
5 2018-12-19 -0.75000 0.5427053 2018-12-17
6 2018-12-20 -0.90600 0.1195840 2018-12-18
7 2018-12-21 -1.74000 0.5322894 2018-12-19
8 2018-12-27 0.33100 1.0420146 2018-12-20
9 2018-12-28 -0.76800 1.0361488 2018-12-21
10 2018-12-31 0.64900 0.7435068 2018-12-27
Largest sd's with their start and end dates
and the largest 4 sd's and their start and end dates are:
head(dplyr::arrange(out, -sd), 4)
giving:
Date y sd start
8 2018-12-27 0.331 1.0420146 2018-12-20
9 2018-12-28 -0.768 1.0361488 2018-12-21
3 2018-12-17 0.227 0.8223515 2018-12-13
4 2018-12-18 -0.671 0.7674388 2018-12-14
Rows between two dates
If the question is how many rows are between and include two dates that appear in data then:
d1 <- as.Date("2018-12-14")
d2 <- as.Date("2018-12-20")
diff(match(c(d1, d2), data$Date)) + 1
## [1] 5
Note
Lines <- " Date y
1 2018-12-13T00:00:00 -0.00736
2 2018-12-14T00:00:00 -1.30
3 2018-12-17T00:00:00 0.227
4 2018-12-18T00:00:00 -0.671
5 2018-12-19T00:00:00 -0.750
6 2018-12-20T00:00:00 -0.906
7 2018-12-21T00:00:00 -1.74
8 2018-12-27T00:00:00 0.331
9 2018-12-28T00:00:00 -0.768
10 2018-12-31T00:00:00 0.649"
data <- read.table(text = Lines)
data$Date <- as.Date(data$Date)
Related
I'm trying to calculate a rolling window in a fixed time interval. Suppose that the interval is 48 hours. I would like to get every data point that is contained between the date of the current observation and 48 hours before that observation. For example, if the datetime of the current observation is 05-07-2022 14:15:28, for that position, I would like a count value for every occurence between that date and 03-07-2022 14:15:28. Seconds are not fundamental to the analysis.
library(tidyverse)
library(lubridate)
df = tibble(id = 1:7,
date_time = ymd_hm('2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'))
# A tibble: 7 × 2
id date_time
<int> <dttm>
1 1 2022-05-07 15:00:00
2 2 2022-05-09 13:45:00
3 3 2022-05-09 13:51:00
4 4 2022-05-09 17:00:00
5 5 2022-05-10 15:25:00
6 6 2022-05-10 17:18:00
7 7 2022-05-11 14:00:00
With the example window of 48 hours, that would yield:
# A tibble: 7 × 4
id date_time lag_48hours count
<int> <dttm> <dttm> <dbl>
1 1 2022-05-07 15:00:00 2022-05-05 15:00:00 1
2 2 2022-05-09 13:45:00 2022-05-07 13:45:00 2
3 3 2022-05-09 13:51:00 2022-05-07 13:51:00 3
4 4 2022-05-09 17:00:00 2022-05-07 17:00:00 3
5 5 2022-05-10 15:25:00 2022-05-08 15:25:00 4
6 6 2022-05-10 17:18:00 2022-05-08 17:18:00 5
7 7 2022-05-11 14:00:00 2022-05-09 14:00:00 4
I added the lag column for illustration purposes. Any idea how to obtain the count column? I need to be able to adjust the window (48 hours in this example).
I'd encourage you to use slider, which allows you to do rolling window analysis using an irregular index.
library(tidyverse)
library(lubridate)
library(slider)
df = tibble(
id = 1:7,
date_time = ymd_hm(
'2022-05-07 15:00', '2022-05-09 13:45', '2022-05-09 13:51', '2022-05-09 17:00',
'2022-05-10 15:25', '2022-05-10 17:18', '2022-05-11 14:00'
)
)
df %>%
mutate(
count = slide_index_int(
.x = id,
.i = date_time,
.f = length,
.before = dhours(48)
)
)
#> # A tibble: 7 × 3
#> id date_time count
#> <int> <dttm> <int>
#> 1 1 2022-05-07 15:00:00 1
#> 2 2 2022-05-09 13:45:00 2
#> 3 3 2022-05-09 13:51:00 3
#> 4 4 2022-05-09 17:00:00 3
#> 5 5 2022-05-10 15:25:00 4
#> 6 6 2022-05-10 17:18:00 5
#> 7 7 2022-05-11 14:00:00 4
How about this...
df %>%
mutate(count48 = map_int(date_time,
~sum(date_time <= . & date_time > . - 48 * 60 * 60)))
# A tibble: 7 × 3
id date_time count48
<int> <dttm> <int>
1 1 2022-05-07 15:00:00 1
2 2 2022-05-09 13:45:00 2
3 3 2022-05-09 13:51:00 3
4 4 2022-05-09 17:00:00 3
5 5 2022-05-10 15:25:00 4
6 6 2022-05-10 17:18:00 5
7 7 2022-05-11 14:00:00 4
This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 6 months ago.
I have a data.frame with some prices per day. I would like to get the average daily price in another column (avg_price). How can I do that ?
date price avg_price
1 2017-01-01 01:00:00 10 18.75
2 2017-01-01 01:00:00 10 18.75
3 2017-01-01 05:00:00 25 18.75
4 2017-01-01 04:00:00 30 18.75
5 2017-01-02 08:00:00 10 20
6 2017-01-02 08:00:00 30 20
7 2017-01-02 07:00:00 20 20
library(lubridate)
library(tidyverse)
df %>%
group_by(day = day(date)) %>%
summarise(avg_price = mean(price))
# A tibble: 2 x 2
day avg_price
<int> <dbl>
1 1 18.8
2 2 20
df %>%
group_by(day = day(date)) %>%
mutate(avg_price = mean(price))
# A tibble: 7 x 4
# Groups: day [2]
date price avg_price day
<dttm> <dbl> <dbl> <int>
1 2017-01-01 01:00:00 10 18.8 1
2 2017-01-01 01:00:00 10 18.8 1
3 2017-01-01 05:00:00 25 18.8 1
4 2017-01-01 04:00:00 30 18.8 1
5 2017-01-02 08:00:00 10 20 2
6 2017-01-02 08:00:00 30 20 2
7 2017-01-02 07:00:00 20 20 2
I have to calculate the following data Number of frost change days**(NFCD)**** as weekly basis.
That means the number of days in which minimum temperature and maximum temperature cross 0°C.
Let's say I work with years 1957-1980 with hourly temp.
Example data (couple of rows look like):
Date Time (UTC) temperature
1957-07-01 00:00:00 5
1957-07-01 03:00:00 6.2
1957-07-01 05:00:00 9
1957-07-01 06:00:00 10
1957-07-01 07:00:00 10
1957-07-01 08:00:00 14
1957-07-01 09:00:00 13.2
1957-07-01 10:00:00 15
1957-07-01 11:00:00 15
1957-07-01 12:00:00 16.3
1957-07-01 13:00:00 15.8
Expected data:
year month week NFCD
1957 7 1 1
1957 7 2 5
dat <- data.frame(date=c(rep("A",5),rep("B",5)), time=rep(1:5, times=2), temp=c(1:5,-2,1:4))
dat
# date time temp
# 1 A 1 1
# 2 A 2 2
# 3 A 3 3
# 4 A 4 4
# 5 A 5 5
# 6 B 1 -2
# 7 B 2 1
# 8 B 3 2
# 9 B 4 3
# 10 B 5 4
aggregate(temp ~ date, data = dat, FUN = function(z) min(z) <= 0 && max(z) > 0)
# date temp
# 1 A FALSE
# 2 B TRUE
(then rename temp to NFCD)
Using the data from r2evans's answer you can also use tidyverse logic:
library(tidyverse)
dat %>%
group_by(date) %>%
summarize(NFCD = min(temp) < 0 & max(temp) > 0)
which gives:
# A tibble: 2 x 2
date NFCD
<chr> <lgl>
1 A FALSE
2 B TRUE
I have a data set with a lot of countries and currency data, like this:
iso date SPOT
<chr> <date> <dbl>
1 AUD 2000-01-03 0.658
2 AUD 2000-01-04 0.655
3 AUD 2000-01-05 0.658
4 AUD 2000-01-06 0.653
5 AUD 2000-01-07 0.655
6 AUD 2000-01-10 0.656
7 AUD 2000-01-11 0.658
8 AUD 2000-01-12 0.659
9 AUD 2000-01-13 0.668
10 AUD 2000-01-14 0.666
and I want to create an exact date index where the data for each is mapped to the day of one year ago, so the mapping the data to "LAG1" like this, where LAG1 = date - years(1):
iso date SPOT LAG1
<chr> <date> <dbl> <date>
1 AUD 2000-01-03 0.658 1999-01-03
2 AUD 2000-01-04 0.655 1999-01-04
3 AUD 2000-01-05 0.658 1999-01-05
4 AUD 2000-01-06 0.653 1999-01-06
5 AUD 2000-01-07 0.655 1999-01-07
6 AUD 2000-01-10 0.656 1999-01-10
7 AUD 2000-01-11 0.658 1999-01-11
8 AUD 2000-01-12 0.659 1999-01-12
9 AUD 2000-01-13 0.668 1999-01-13
10 AUD 2000-01-14 0.666 1999-01-14
This was my solution:
df %>%
mutate(LAG1=date-years(1)) %>%
select(iso,LAG1=date,LAG1_SPOT=SPOT) %>%
right_join(.,df,by=c("iso", "LAG1")) %>% as_tibble()
but I don't like it because it's a bunch of lines for something I think should be simpler, and I want to make it into a function.
Is there a better way to do this?
I think your intent of merging/joining is the right way to go. In fact, it's "right" because it will naturally deal with data anomalies better. I also think there are a couple of small logic errors in your code.
Since your data doesn't have enough to look at past years, here is some fake data. I'm making SPOT just a sequence to help visualize the sequence, but otherwise it doesn't matter much. I'm also going to introduce two anomalies in the data to demonstrate how they will show in the end.
library(dplyr)
library(lubridate)
dates <- seq.Date(as.Date("2020-03-15"), by = "day", length.out = 5)
df <- tibble(
iso = rep(c("AUD", "USD"), each = 10),
date = rep(c(dates - years(1), dates), times = 2),
SPOT = 1:20
)
# data missingness
df <- df[-3,]
# repeated date
df$date[12] <- df$date[13]
df
# # A tibble: 19 x 3
# iso date SPOT
# <chr> <date> <int>
# 1 AUD 2019-03-15 1
# 2 AUD 2019-03-16 2
# 3 AUD 2019-03-18 4
# 4 AUD 2019-03-19 5
# 5 AUD 2020-03-15 6
# 6 AUD 2020-03-16 7
# 7 AUD 2020-03-17 8
# 8 AUD 2020-03-18 9
# 9 AUD 2020-03-19 10
# 10 USD 2019-03-15 11
# 11 USD 2019-03-16 12
# 12 USD 2019-03-18 13
# 13 USD 2019-03-18 14
# 14 USD 2019-03-19 15
# 15 USD 2020-03-15 16
# 16 USD 2020-03-16 17
# 17 USD 2020-03-17 18
# 18 USD 2020-03-18 19
# 19 USD 2020-03-19 20
Using your code from above, we see this:
df %>%
mutate(date = date - years(1)) %>%
rename(LAG1_SPOT = SPOT) %>%
right_join(., df, by = c("iso", "date"))
# # A tibble: 19 x 4
# iso date LAG1_SPOT SPOT
# <chr> <date> <int> <int>
# 1 AUD 2019-03-15 6 1
# 2 AUD 2019-03-16 7 2
# 3 AUD 2019-03-18 9 4
# 4 AUD 2019-03-19 10 5
# 5 AUD 2020-03-15 NA 6
# 6 AUD 2020-03-16 NA 7
# 7 AUD 2020-03-17 NA 8
# 8 AUD 2020-03-18 NA 9
# 9 AUD 2020-03-19 NA 10
# 10 USD 2019-03-15 16 11
# 11 USD 2019-03-16 17 12
# 12 USD 2019-03-18 19 13
# 13 USD 2019-03-18 19 14
# 14 USD 2019-03-19 20 15
# 15 USD 2020-03-15 NA 16
# 16 USD 2020-03-16 NA 17
# 17 USD 2020-03-17 NA 18
# 18 USD 2020-03-18 NA 19
# 19 USD 2020-03-19 NA 20
Since I believe your intent is to compare this year's data with last year's data, then the above shows that we have paired them, but the date of reference is last year. I suggest that you should be using +:
df %>%
mutate(date = date + years(1)) %>%
rename(LAG1_SPOT = SPOT) %>%
right_join(., df, by = c("iso", "date"))
# # A tibble: 20 x 4
# iso date LAG1_SPOT SPOT
# <chr> <date> <int> <int>
# 1 AUD 2019-03-15 NA 1
# 2 AUD 2019-03-16 NA 2
# 3 AUD 2019-03-18 NA 4
# 4 AUD 2019-03-19 NA 5
# 5 AUD 2020-03-15 1 6
# 6 AUD 2020-03-16 2 7
# 7 AUD 2020-03-17 NA 8
# 8 AUD 2020-03-18 4 9
# 9 AUD 2020-03-19 5 10
# 10 USD 2019-03-15 NA 11
# 11 USD 2019-03-16 NA 12
# 12 USD 2019-03-18 NA 13
# 13 USD 2019-03-18 NA 14
# 14 USD 2019-03-19 NA 15
# 15 USD 2020-03-15 11 16
# 16 USD 2020-03-16 12 17
# 17 USD 2020-03-17 NA 18
# 18 USD 2020-03-18 13 19
# 19 USD 2020-03-18 14 19
# 20 USD 2020-03-19 15 20
This also shows how data anomalies will present. First, in AUD we see that 03-17 is missing data from last year, so we have nothing to compare the 8 spot against. This is just the fact that we are missing data. Unavoidable, but a lag here would have given us data, likely from the wrong date. Second, we see that our dupe-data (acquisition systems are imperfect!), we now have two rows for USD on 2020-03-18, which is certainly suspect (but outside the scope of your question). But we have compared both of 2019's values with the single 2020 value.
If the data anomalies never show up in your data, I still think join is the correct method for dealing with this, as if there is ever a time that lag will find the wrong row (leap-years?), you will never know that it failed: you'll get data and use it with no indication.
BTW: if you are just looking to reduce the four lines of code, this is perfectly equivalent:
transmute(df, iso, date = date + years(1), LAG1_SPOT = SPOT) %>%
right_join(., df, by = c("iso", "date"))
So I basically got a while loop function that creates 1's in the "algorithm_column" based on the highest percentages in the "percent" column, until a certain total percentage is reached (90% or something). The rest of the rows that are not taken into account will have a value of 0 in the "algorithm_column" ( Create while loop function that takes next largest value untill condition is met)
I want to show, based on what the loop function found, the min and max times of the column "timeinterval" (the min is where the 1's start and max is the last row with a 1, the 0's are out of the scope). And then finally create a time interval from this.
So if we have the following code, I want to create in another column, lets say "total_time" a calculation from the min time 09:00 ( this is where 1 start in the algorithm_column) until 11:15, which makes a time interval of 02:15 hours added to the "total_time" column.
algorithm
# pc4 timeinterval stops percent idgroup algorithm_column
#1 5464 08:45:00 1 1.3889 1 0
#2 5464 09:00:00 5 6.9444 2 1
#3 5464 09:15:00 8 11.1111 3 1
#4 5464 09:30:00 7 9.7222 4 1
#5 5464 09:45:00 5 6.9444 5 1
#6 5464 10:00:00 10 13.8889 6 1
#7 5464 10:15:00 6 8.3333 7 1
#8 5464 10:30:00 4 5.5556 8 1
#9 5464 10:45:00 7 9.7222 9 1
#10 5464 11:00:00 6 8.3333 10 1
#11 5464 11:15:00 5 6.9444 11 1
#12 5464 11:30:00 8 11.1111 12 0
I have multiple pc4 groups, so it should look at every group and calculate a total_time for each group respectively.
I got this function, but I'm a bit stuck if this is what I need.
test <- function(x) {
ind <- x[["algorithm$algorithm_column"]] == 0
Mx <- max(x[["timeinterval"]][ind], na.rm = TRUE);
ind <- x[["algorithm$algorithm_column"]] == 1
Mn <- min(x[["timeinterval"]][ind], na.rm = TRUE);
list(Mn, Mx) ## or return(list(Mn, Mx))
}
test(algorithm)
Here is a dplyr solution.
library(dplyr)
algorithm %>%
mutate(tmp = cumsum(c(0, diff(algorithm_column) != 0))) %>%
filter(algorithm_column == 1) %>%
group_by(pc4, tmp) %>%
summarise(first = first(timeinterval),
last = last(timeinterval)) %>%
select(-tmp)
## A tibble: 1 x 3
## Groups: pc4 [1]
# pc4 first last
# <int> <fct> <fct>
#1 5464 09:00:00 11:15:00
Data.
algorithm <- read.table(text = "
pc4 timeinterval stops percent idgroup algorithm_column
1 5464 08:45:00 1 1.3889 1 0
2 5464 09:00:00 5 6.9444 2 1
3 5464 09:15:00 8 11.1111 3 1
4 5464 09:30:00 7 9.7222 4 1
5 5464 09:45:00 5 6.9444 5 1
6 5464 10:00:00 10 13.8889 6 1
7 5464 10:15:00 6 8.3333 7 1
8 5464 10:30:00 4 5.5556 8 1
9 5464 10:45:00 7 9.7222 9 1
10 5464 11:00:00 6 8.3333 10 1
11 5464 11:15:00 5 6.9444 11 1
12 5464 11:30:00 8 11.1111 12 0
", header = TRUE)