I have a dataset that contains start and end time stamps, as well as a performance percentage. I'd like to calculate group statistics over hourly blocks, e.g. "the average performance for the midnight hour was x%."
My question is if there is a more efficient way to do this than a series of ifelse() statements.
# some sample data
pre.starting <- data.frame(starting = format(seq.POSIXt(from =
as.POSIXct(Sys.Date()), to = as.POSIXct(Sys.Date()+1), by = "5 min"),
"%H:%M", tz="GMT"))
pre.ending <- data.frame(ending = pre.starting[seq(1, nrow(pre.starting),
2), ])
ending2 <- pre.ending[-c(1), ]
starting2 <- data.frame(pre.starting = pre.starting[!(pre.starting$starting
%in% pre.ending$ending),])
dataset <- data.frame(starting = starting2
, ending = ending2
, perct = rnorm(nrow(starting2), 0.5, 0.2))
For example, I could create hour blocks with code along the lines of the following:
dataset2 <- dataset %>%
mutate(hour = ifelse(starting >= 00:00 & ending < 01:00, 12
, ifelse(starting >= 01:00 & ending < 02:00, 1
, ifelse(starting >= 02:00 & ending < 03:00, 13)))
) %>%
group_by(hour) %>%
summarise(mean.perct = mean(perct, na.rm=T))
Is there a way to make this code more efficient, or improve beyond ifelse()?
We can use cut ending hour based on hourly interval after converting timestamps into POSIXct and then take mean for each hour.
library(dplyr)
dataset %>%
mutate_at(vars(pre.starting, ending), as.POSIXct, format = "%H:%M") %>%
group_by(ending_hour = cut(ending, breaks = "1 hour")) %>%
summarise(mean.perct = mean(perct, na.rm = TRUE))
# ending_hour mean.perct
# <fct> <dbl>
# 1 2019-09-30 00:00:00 0.540
# 2 2019-09-30 01:00:00 0.450
# 3 2019-09-30 02:00:00 0.612
# 4 2019-09-30 03:00:00 0.470
# 5 2019-09-30 04:00:00 0.564
# 6 2019-09-30 05:00:00 0.437
# 7 2019-09-30 06:00:00 0.413
# 8 2019-09-30 07:00:00 0.397
# 9 2019-09-30 08:00:00 0.492
#10 2019-09-30 09:00:00 0.613
# … with 14 more rows
Related
I want to filter my time series based on a variable time interval. More specifically, consider the time t_i from a timestamp t. I want to filter my time series such that what remains is a time series containing only timestamps from t_i - 15 min up to and including t_i + 15 min.
Here's what I tried:
library(lubridate)
library(dplyr)
mv <- 2 # moving window
t <- as.POSIXct("2020-06-20 12:00", tz="UTC") # time stamp
time <- seq(ymd_hm('2020-01-01 00:00'),ymd_hm('2020-12-31 23:45'), by = '15 mins')
ts <- tibble(time=time, data=sin(seq(1,length(time),1)))
# What I did:
ts %>%
filter(time >= t - mv*24*60*60) %>%
filter(time <= t) %>%
filter(strftime(time, format = "%H:%M", tz = "UTC") >= strftime(t-15*60, format = "%H:%M", tz = "UTC")) %>%
filter(strftime(time, format = "%H:%M", tz = "UTC") <= strftime(t+15*60, format = "%H:%M", tz = "UTC"))
Output:
# A tibble: 7 x 2
time data
<dttm> <dbl>
1 2020-06-18 12:00:00 -0.435
2 2020-06-18 12:15:00 0.523
3 2020-06-19 11:45:00 0.298
4 2020-06-19 12:00:00 0.964
5 2020-06-19 12:15:00 0.744
6 2020-06-20 11:45:00 0.885
7 2020-06-20 12:00:00 0.0870
This is exactly what I want but it breaks down when t <- as.POSIXct("2020-06-20 23:45", tz="UTC") (also with 00:00):
# A tibble: 0 x 2
# … with 2 variables: time <dttm>, data <dbl>
I included an if-else statement to circumvent this but it is far from elegant and doesn't give me exactly what I want:
t <- as.POSIXct("2020-06-20 23:45", tz="UTC") # time stamp
if(strftime(t, format = "%H:%M", tz = "UTC") %in% c("23:45","00:00")){
ts %>%
filter(time >= t - mv*24*60*60) %>%
filter(time <= t) %>%
filter(strftime(time, format = "%H:%M", tz = "UTC") >= strftime(t-15*60, format = "%H:%M", tz = "UTC"))
} else {
ts %>%
filter(time >= t - mv*24*60*60) %>%
filter(time <= t) %>%
filter(strftime(time, format = "%H:%M", tz = "UTC") >= strftime(t-15*60, format = "%H:%M", tz = "UTC")) %>%
filter(strftime(time, format = "%H:%M", tz = "UTC") <= strftime(t+15*60, format = "%H:%M", tz = "UTC"))
}
Output:
# A tibble: 5 x 2
time data
<dttm> <dbl>
1 2020-06-18 23:45:00 0.543
2 2020-06-19 23:30:00 -0.177
3 2020-06-19 23:45:00 -0.924
4 2020-06-20 23:30:00 -0.936
5 2020-06-20 23:45:00 -0.209
Desired output:
# A tibble: 7 x 2
time data
<dttm> <dbl>
1 2020-06-18 23:45:00 0.543
2 2020-06-19 00:00:00 -0.413
3 2020-06-19 23:30:00 -0.177
4 2020-06-19 23:45:00 -0.924
5 2020-06-20 00:00:00 -0.821
6 2020-06-20 23:30:00 -0.936
7 2020-06-20 23:45:00 -0.209
There seems to be an issue with the shift between days but I'm not sure how to solve it and I haven't been able to find similar questions. Is there a way to achieve this (elegantly)?
It apperars that strftime(ts$time[1], format = "%H:%M", tz = "UTC") > strftime(t, format = "%H:%M", tz = "UTC") is evaluated to FALSE which makes sense depending on how you look at it.
To mitigate this you'll need full YYYY-MM-DD HH:MM such that it is evaluated 'correctly'. Which will be the case if you evaluate the the full string, instead of only the hours.
We can get the intervals by adding a dummy-variable we call time_ that includes all the HH:MM, and then treat them as strings,
# Troublesome Vector;
t <- ymd_hm("2020-06-20 23:45", tz="UTC")
ts %>% filter(
between(
time,
left = t - mv*24*60*60 -15*60,
right = t
)
) %>% mutate(
time_ = strftime(time, format = "%H:%M", tz = "UTC") %>% as.character()
) %>% filter(
str_detect(
time_,
pattern = seq(
t-15*60,
t+15*60,
by = "15 mins"
) %>% strftime(format = "%H:%M", tz = "UTC") %>% paste(
collapse = "|"
)
)
)
Which gives the output,
# A tibble: 8 x 3
time data time_
<dttm> <dbl> <chr>
1 2020-06-18 23:30:00 1.00 23:30
2 2020-06-18 23:45:00 0.543 23:45
3 2020-06-19 00:00:00 -0.413 00:00
4 2020-06-19 23:30:00 -0.177 23:30
5 2020-06-19 23:45:00 -0.924 23:45
6 2020-06-20 00:00:00 -0.821 00:00
7 2020-06-20 23:30:00 -0.936 23:30
8 2020-06-20 23:45:00 -0.209 23:45
ts %>%
filter(between(time, t - days(mv), t)) %>%
mutate(aux = as.numeric(time) %% (60 * 60 * 24)) %>%
filter(between(aux,
(as.numeric(t) %% (60 * 60 * 24) - 900),
(as.numeric(t) %% (60 * 60 * 24) + 900)) |
aux == 0) %>%
select(-aux)
gives
# # A tibble: 7 x 2
# time data
# <dttm> <dbl>
# 1 2020-06-18 23:45:00 0.543
# 2 2020-06-19 00:00:00 -0.413
# 3 2020-06-19 23:30:00 -0.177
# 4 2020-06-19 23:45:00 -0.924
# 5 2020-06-20 00:00:00 -0.821
# 6 2020-06-20 23:30:00 -0.936
# 7 2020-06-20 23:45:00 -0.209
It's probably very particular for this specific task and a bit hard to read. The interval reflects a duration (fixed amount of seconds).
For similar cases, where the date increases, you need to change the offsets and adjust the values by 86400. This version doesn't work if t is as midnight nor if the offset is not equal to 15'.
If you have just 2 days, this would also be an approach (using periods instead of durations):
ts %>%
filter(between(time, t - days(mv), t)) %>%
filter(between(time, t - minutes(15), t + minutes(15)) |
between(time, t - days(1) - minutes(15), t - days(1) + minutes(15)) |
between(time, t - days(2) - minutes(15), t - days(2) + minutes(15)))
which gives the same result in this case.
If you want to adjust the margins, you need to change the values.
By the way: you should NOT use t as name for an object in R, because it's already the name of a function.
HTH
I am trying to fit linear models to a time-series where the regression begins at midnight each day and uses all data until 0600 the following morning (covering a total of 30 hrs). I want to do this for every day in the time-series, and this also needs to be applied by a grouping factor. What I ultimately need is the regression coefficients added to the data frame for the day where the regression started. I am familiar with rolling and window regressions and how to apply functions across groups using dplyr. Where I am struggling is how to code that the regression needs to start at midnight each day. If I were to use a window function, after the first day it would be shifted ahead six hours from midnight and I am not sure how to shift the window back to midnight. Seems like I need to specify a window size and a lag/lead at each iteration but can't visualize how to implement that. Any insight is appreciated.
here is some sample data. I would like to model dv ~ datetime, by = grp
df <- dplyr::arrange(data.frame(datetime = seq(as.POSIXct("2020-09-19 00:00:00"), as.POSIXct("2020-09-30 00:00:00"),"hour"),
grp = rep(c('a', 'b', 'c'), 265),
dv = rnorm(795)),grp, datetime)
We assume that we want each regression to cover 30 rows (except for any stub at the end) and that we should move forward by 24 hours for each regression so that there is one regression per date within grp.
ans <- df %>%
group_by(grp) %>%
group_modify(~ {
r <- rollapplyr(1:nrow(.), 30, by = 24,
function(ix) coef(lm(dv ~ datetime, ., subset = ix)),
align = "left", partial = TRUE)
data.frame(date = head(unique(as.Date(.$datetime)), nrow(r)),
coef1 = r[, 1], coef2 = r[, 2])
}) %>%
ungroup
giving:
> ans
# A tibble: 36 x 4
grp date coef1 coef2
<chr> <date> <dbl> <dbl>
1 a 2020-09-19 -7698. 0.00000481
2 a 2020-09-20 -2048. 0.00000128
3 a 2020-09-21 -82.0 0.0000000514
4 a 2020-09-22 963. -0.000000602
5 a 2020-09-23 2323. -0.00000145
6 a 2020-09-24 5886. -0.00000368
7 a 2020-09-25 7212. -0.00000450
8 a 2020-09-26 -17448. 0.0000109
9 a 2020-09-27 1704. -0.00000106
10 a 2020-09-28 15731. -0.00000982
# ... with 26 more rows
old
After re-reading question I replaced this with the above.
Within group create g which groups the values since the last 6 am and let width be the number of rows since the most recent 6am row. Then run rollapplyr using the width vector to define the widths to regress over.
library(dplyr)
library(zoo)
ans <- df %>%
group_by(grp) %>%
group_modify(~ {
g <- cumsum(format(.$datetime, "%H") == "06")
width = 1:nrow(.) - match(g, g) + 1
r <- rollapplyr(1:nrow(.), width,
function(ix) coef(lm(dv ~ datetime, ., subset = ix)),
partial = TRUE, fill = NA)
mutate(., coef1 = r[, 1], coef2 = r[, 2])
}) %>%
ungroup
giving:
> ans
# A tibble: 795 x 5
grp datetime dv coef1 coef2
<chr> <dttm> <dbl> <dbl> <dbl>
1 a 2020-09-19 00:00:00 -0.560 -0.560 NA
2 a 2020-09-19 01:00:00 -0.506 -24071. 0.0000150
3 a 2020-09-19 02:00:00 -1.76 265870. -0.000166
4 a 2020-09-19 03:00:00 0.0705 -28577. 0.0000179
5 a 2020-09-19 04:00:00 1.95 -248499. 0.000155
6 a 2020-09-19 05:00:00 0.845 -205918. 0.000129
7 a 2020-09-19 06:00:00 0.461 0.461 NA
8 a 2020-09-19 07:00:00 0.359 45375. -0.0000284
9 a 2020-09-19 08:00:00 -1.40 412619. -0.000258
10 a 2020-09-19 09:00:00 -0.446 198902. -0.000124
# ... with 785 more rows
Note
Input used
set.seed(123)
df <- dplyr::arrange(data.frame(datetime = seq(as.POSIXct("2020-09-19 00:00:00"), as.POSIXct("2020-09-30 00:00:00"),"hour"),
grp = rep(c('a', 'b', 'c'), 265),
dv = rnorm(795)),grp, datetime)
I am dealing with a huge dataset (years of 1-minute-interval observations of energy usage). I want to convert it from 1-min-interval to 15-min-interval.
I have written a for loop which does this successfully (tested on a small subset of the data); however, when I tried running it on the main data, it was executing very slowly - and it would have taken me over 175 hours to run the full loop (I stopped it mid-execution).
The data to be converted to the 15-th minute interval is the kWh usage; thusly converting it simply requires taking the average of the first 15th observations, then the second 15th, etc. This is the loop that's working:
# Opening the file
data <- read.csv("1.csv",colClasses="character",na.strings="?")
# Adding an index to each row
total <- nrow(data)
data$obsnum <- seq.int(nrow(data))
# Calculating 15 min kwH usage
data$use_15_min <- data$use
for (i in 1:total) {
int_used <- floor((i-1)/15)
obsNum <- 15*int_used
sum <- 0
for (j in 1:15) {
usedIndex <- as.numeric(obsNum+j)
sum <- as.numeric(data$use[usedIndex]) + sum
}
data$use_15_min[i] <- sum/15
}
I have been searching for a function that can do the same, but without using loops, as I imagine this should save much time. Yet, I haven't been able to find one. How is it possible to achieve the same functionality without using a loop?
Try data.table:
library(data.table)
DT <- data.table(data)
n <- nrow(DT)
DT[, use_15_min := mean(use), by = gl(n, 15, n)]
Note
The question is missing the input data so we used this:
data <- data.frame(use = 1:100)
A potential solution is to calculate the running mean (e.g. using TTR::runMean) and then select every 15th observations. Here is an example:
df = data.frame(x = 1:100, y = runif(100))
df['runmean'] = TTR::runMean(df['y'], n=15)
df_15 = df[seq(1,nrow(df), 15), ]
I cannot test it, as I do not have Your data, but perhaps:
total <- nrow(data)
data$use_15_min = TTR::runMean(data$use, n=15)
data_15_min = data[seq(1, nrow(df), 15)]
I would use lubridate::floor_date to create the 15-minute groupings.
library(tidyverse)
library(lubridate)
df <- tibble(
date = seq(ymd_hm("2019-01-01 00:00"), by = "min", length.out = 60 * 24 * 7),
value = rnorm(n = 60 * 24 * 7)
)
df
#> # A tibble: 10,080 x 2
#> date value
#> <dttm> <dbl>
#> 1 2019-01-01 00:00:00 0.182
#> 2 2019-01-01 00:01:00 0.616
#> 3 2019-01-01 00:02:00 -0.252
#> 4 2019-01-01 00:03:00 0.0726
#> 5 2019-01-01 00:04:00 -0.917
#> 6 2019-01-01 00:05:00 -1.78
#> 7 2019-01-01 00:06:00 -1.49
#> 8 2019-01-01 00:07:00 -0.818
#> 9 2019-01-01 00:08:00 0.275
#> 10 2019-01-01 00:09:00 1.26
#> # ... with 10,070 more rows
df %>%
mutate(
nearest_15_mins = floor_date(date, "15 mins")
) %>%
group_by(nearest_15_mins) %>%
summarise(
avg_value_at_15_mins_int = mean(value)
)
#> # A tibble: 672 x 2
#> nearest_15_mins avg_value_at_15_mins_int
#> <dttm> <dbl>
#> 1 2019-01-01 00:00:00 -0.272
#> 2 2019-01-01 00:15:00 -0.129
#> 3 2019-01-01 00:30:00 0.173
#> 4 2019-01-01 00:45:00 -0.186
#> 5 2019-01-01 01:00:00 -0.188
#> 6 2019-01-01 01:15:00 0.104
#> 7 2019-01-01 01:30:00 -0.310
#> 8 2019-01-01 01:45:00 -0.173
#> 9 2019-01-01 02:00:00 0.0137
#> 10 2019-01-01 02:15:00 0.419
#> # ... with 662 more rows
I have a per minute timeseries for a number of years.
I need to compute a the following value for each minute data point:
q <- (Fn-Fd)/Fn
Whereby Fn is the average F value at night time between 12-1 AM and Fd is just the minute data point.
Now obviously the Fn changes each day so one approach would be to calculate Fn perhaps using a dplyr function and i would need to create a loop of some kind or re-organise my data frame...
dummy data:
#string of dates for a one month
datetime <- seq(
from=as.POSIXct("2012-1-1 0:00:00", tz="UTC"),
to=as.POSIXct("2012-2-1 0:00:00", tz="UTC"),
by="min"
)
#variable F
F <- runif(44641, min = 0, max =2)
#dataframe
df <- as.data.frame(cbind(datetime,F))
library(lubridate)
#make sure its in "POSIXct" "POSIXt" format
df$datetime <- as_datetime(df$datetime)
Or a less elegant way might be to get Fn on its own, between the times using dplyr first - i think it will be something like this:
Fn <- df %>%
filter(between(as.numeric(format(datetime, "%H")), 0, 1)) %>%
group_by(hour=format(datetime, "%Y-%m-%d %H:")) %>%
summarise(value=mean(df$F))
But I am not sure my syntax is correct there? Am I calculating the mean F between 12 and 1 AM per day?
Then i could just print the average Fn value for each min per day to my dataframe and do the simple calculation to get Q.
Thanks in advance for advice here.
Maybe something like this ?
library(dplyr)
library(lubridate)
df %>%
group_by(Date = as.Date(datetime)) %>%
mutate(F_mean = mean(F[hour(datetime) == 0]),
value = (F_mean - F)/F_mean) %>%
ungroup() %>%
select(-F_mean, -Date)
# datetime F value
# <dttm> <dbl> <dbl>
# 1 2012-01-01 00:00:00 1.97 -0.902
# 2 2012-01-01 00:01:00 0.194 0.813
# 3 2012-01-01 00:02:00 1.52 -0.467
# 4 2012-01-01 00:03:00 1.66 -0.599
# 5 2012-01-01 00:04:00 0.765 0.262
# 6 2012-01-01 00:05:00 1.31 -0.267
# 7 2012-01-01 00:06:00 1.62 -0.565
# 8 2012-01-01 00:07:00 0.642 0.380
# 9 2012-01-01 00:08:00 1.62 -0.560
#10 2012-01-01 00:09:00 1.68 -0.621
# ... with 44,631 more rows
We first group_by every date get the mean value for 0th hour (values between 00:00 to 00:59) each day and calculate value using the formula given.
I am working with a multi-year dataset that has columns for date (%Y-%m-%d) and daily values for several variables.
In R, how do I subset the data by a date range (i.e., June 29 +/- 5 days) but capture the data from all years?
DATE A B C
1996-06-10 12:00:00 178.0 24.1 1.7
1996-06-11 12:00:00 184.1 30.2 1.1
1996-06-12 12:00:00 187.2 29.4 1.8
1996-06-13 12:00:00 194.4 35.0 5.3
1996-06-14 12:00:00 200.3 35.9 1.5
1996-06-15 12:00:00 138.9 15.1 0.0
...
1) Base R
Let yrs be all unique years in the data and targets be each of those years with the target's month and day. Then create dates which contains all dates within delta days of any value in targets. Note that sapply strips dates of its "Date" class but that does not matter since it is only subsequently used in %in% and that ignores the class. Finally subset DF down to those rows whose DATE is in dates. No packages are used.
# inputs (also DF defined in Note at end)
target <- "06-19"
delta <- 5
DATE <- as.Date(DF$DATE)
yrs <- unique(format(DATE, "%Y"))
targets <- as.Date(paste(yrs, target, sep = "-"))
dates <- c(sapply(targets, "+", seq(-delta, delta)))
DF[DATE %in% dates, ]
giving:
DATE A B C
5 1996-06-14 12:00:00 200.3 35.9 1.5
6 1996-06-15 12:00:00 138.9 15.1 0.0
2) sqldf
Alternately, this can be done using a single SQL statement. Note that we assume that the DATE column is character since the question referred to it being in a particular format. Now, using the same inputs the inner select generates target dates from each year and then the outer select joins DF to those rows within delta days of any target date. We use the H2 database backend here since it has better date support than SQLite.
library(sqldf)
library(RH2)
# inputs (also DF defined in Note at end)
target <- "06-19"
delta <- 5
fn$sqldf("select DF.* from DF
join (select distinct cast(substr(DATE, 1, 4) || '-' || '$target' as DATE) as target
from DF)
on cast(substr(DATE, 1, 10) as DATE) between target - $delta and target + $delta")
giving:
DATE A B C
1 1996-06-14 12:00:00 200.3 35.9 1.5
2 1996-06-15 12:00:00 138.9 15.1 0.0
We could simplify the SQL somewhat if DATE is of R's "Date" class. That is, replace the sqldf statement above with:
DF2 <- transform(DF, DATE = as.Date(DATE))
fn$sqldf("select DF2.* from DF2
join (select distinct cast(year(DATE) || '-' || '$target' as DATE) as target from DF2)
on DATE between target - $delta and target + $delta")
giving:
DATE A B C
1 1996-06-14 200.3 35.9 1.5
2 1996-06-15 138.9 15.1 0.0
Note
The input DF is assumed to be:
DF <- structure(list(DATE = c("1996-06-10 12:00:00", "1996-06-11 12:00:00",
"1996-06-12 12:00:00", "1996-06-13 12:00:00", "1996-06-14 12:00:00",
"1996-06-15 12:00:00"), A = c(178, 184.1, 187.2, 194.4, 200.3,
138.9), B = c(24.1, 30.2, 29.4, 35, 35.9, 15.1), C = c(1.7, 1.1,
1.8, 5.3, 1.5, 0)), .Names = c("DATE", "A", "B", "C"), row.names = c(NA,
-6L), class = "data.frame")
A base R attempt.
Stealing the example data from the other answer by Kevin:
df <- data.frame(
my_date = seq.Date(as.Date("1990-01-01"), as.Date("1999-12-31"), by = 1),
x = rnorm(3652),
y = rnorm(3652),
z = rnorm(3652)
)
Set your variables for the selection:
month_num <- 6
day_num <- 29
bound <- 5
Find the key dates in your range of years:
keydates <- as.Date(sprintf(
"%d-%02d-%02d",
do.call(seq, as.list(as.numeric(range(format(df$my_date, "%Y"))))),
month_num,
day_num
))
Make a selection:
out <- df[df$my_date %in% outer(keydates, -bound:bound, `+`),]
Check that it worked:
table(format(out$my_date, "%m-%d"))
#06-24 06-25 06-26 06-27 06-28 06-29 06-30 07-01 07-02 07-03 07-04
# 10 10 10 10 10 10 10 10 10 10 10
One valid value for each day/month for each year 1990 to 1999, centred on "06-29" with a range of 5 days either side
You can use lubridate intervals to provide valid date ranges and then use a purrr map to run each interval over your data to filter.
library(dplyr)
library(lubridate)
library(magrittr) # only because I've used the "exposition" (%$%) pipe
library(purrr)
df <- tibble(
my_date = as.POSIXct(
seq.Date(as.Date("1990-01-01"), as.Date("1999-12-31"), by = 1),
tz = "UTC"
),
x = rnorm(3652),
y = rnorm(3652),
z = rnorm(3652)
)
month_num <- 6
day_num <- 29
bound <- 5
date_span <- df %>%
select(my_date) %>%
filter(month(my_date) == month_num & day(my_date) == day_num) %>%
mutate(
start = my_date - days(bound),
end = my_date + days(bound)
) %$%
interval(start, end, tzone = "UTC")
map_dfr(date_span, ~filter(df, my_date %within% .x))
# # A tibble: 110 x 4
# my_date x y z
# <dttm> <dbl> <dbl> <dbl>
# 1 1990-06-24 10:00:00 0.404 1.33 1.58
# 2 1990-06-25 10:00:00 0.351 -1.73 0.665
# 3 1990-06-26 10:00:00 -0.512 1.01 1.72
# 4 1990-06-27 10:00:00 1.55 0.417 -0.126
# 5 1990-06-28 10:00:00 1.86 1.18 0.322
# 6 1990-06-29 10:00:00 -0.0193 -0.105 0.356
# 7 1990-06-30 10:00:00 0.844 -0.712 1.51
# 8 1990-07-01 10:00:00 -0.431 0.451 -2.19
# 9 1990-07-02 10:00:00 1.74 -0.0650 -0.866
# 10 1990-07-03 10:00:00 0.965 -0.506 -0.0690
# # ... with 100 more rows
You could also go via the Julian day, which allows you to do basic arithmetic operations (e.g. ± 5 days) without the need to convert back and forth between Date and character objects. Keep in mind that your target date translates into a different Julian day during leap years, so you'll need to extract this piece of information somehow (use lubridate::leap_year if you don't like the base R approach below):
## convert dates to julian day
dat$JULDAY = format(
dat$DATE
, "%j"
)
## target date (here 19 june) as julian day
dat$TARGET = ifelse(
as.integer(
format(
dat$DATE
, "%y"
)
) %% 4 == 0
, 171 # leap year
, 170 # common year
)
## create subset
subset(
dat
, JULDAY >= (TARGET - 5) & JULDAY <= (TARGET + 5)
, select = c("DATE", "A", "B", "C")
)
# DATE A B C
# 5 1996-06-14 12:00:00 200.3 35.9 1.5
# 6 1996-06-15 12:00:00 138.9 15.1 0.0