I am having an issue with the length of time it's taking to run a double for loop with an if statement within R.
In one data set I have about 3000000 rows (DF1) and in the other I have about 22 (DF2). An example of the two data frames I have are given below.
DF1
DateTime REG
2018-07-01 12:00:00 NHDG
2018-07-12 11:55:23 NSKR
DF2
StartDateTime EndDateTime Direction
2018-07-01 07:55:11 2018-07-01 12:01:56 W
2018-07-12 11:00:23 2018-07-12 11:45:00 E
I want to flag anything in DF1 when the DateTime is between StartDateTime and EndDateTime. Hence the output will be as follows:
DF1
DateTime REG Flag
2018-07-01 12:00:00 NHDG 1
2018-07-12 11:55:23 NSKR 0
The code I have used currently is:
#Flag if in delay or not
DF1$Flag<-0
for (i in 1:nrow(DF1)){
for (j in 1:nrow(DF2)){
if ((DF1$DateTime[i] >= DF2$StartDateTime[j]) & (DF1$DateTime <= DF2$EndDateTime[j])){
DF1$Flag[i]<-1
} else {
DF1$Flag[i]<-DF1$Flag
}
}
}
I am more than happy for this code to be taken out of the for loops if possible.
If I understand properly, the value of Flag in DF1 should be set to 1 if the DateTime is between any interval from DF2, right?
Then, the following base code would do the job:
DF1$Flag = sapply(DF1$DateTime,
function(x) as.integer(sum(x >= DF2$StartDateTime &
x <= DF2$EndDateTime) > 0))
# DateTime REG Flag
# 1 2018-07-01 12:00:00 NHDG 1
# 2 2018-07-12 11:55:23 NSKR 0
The idea is to vectorize the comparison: for each DateTime in DF1 (sort of "looping" through sapply), you compare the value to all intervals (Start- and EndDateTime) from DF2 and you sum the results: if the sum is greater than 0, then you have at least one line in DF2 where DateTime from DF1 falls between its Start- and EndDateTime. Then as.integer converts the boolean output of sum(...) > 0 to 1 or 0.
And, if you want a faster solution, using dplyr:
df1 = full_join(mutate(DF1, foo=1), mutate(DF2, foo=1), by='foo') %>%
mutate(Flag = as.integer(DateTime >= StartDateTime & DateTime <= EndDateTime)) %>%
group_by(DateTime) %>% slice(which.max(Flag)) %>%
select(DateTime, REG, Flag)
Otherwise:
There seems to be a problem with you second loop, over the rows of DF2 (j loop): for each row of DF1, you compare the date to the start and end dates of successively all rows of DF2, basically overwriting every time the resulting Flag value and only keeping the result for the comparison with the very last row of DF2...?
In other words, i in DF1$Flag[i] <- ... does not move inside the j loop (and is each time overwritten).
So if you just want to compare between the min and max date range from DF2, you can simply do:
DF1$Flag = as.integer((DF1$DateTime >= min(DF2$StartDateTime)) & (DF1$DateTime <= max(DF2$EndDateTime)))
What about this?
library(data.table)
DF1$flag <- as.numeric(sapply(seq(nrow(DF1)), function(x)
DF1[x, "DateTime"] %between% c(min(DF2[x, "StartDateTime"]), max(DF2[x, "EndDateTime"]))))
# DateTime REG flag
# 1 2018-07-01 12:00:00 NHDG 1
# 2 2018-07-12 11:55:23 NSKR 0
Data
> dput(DF1)
structure(list(DateTime = structure(1:2, .Label = c("2018-07-01 12:00:00",
"2018-07-12 11:55:23"), class = "factor"), REG = structure(1:2, .Label = c("NHDG",
"NSKR"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
> dput(DF2)
structure(list(StartDateTime = structure(1:2, .Label = c("2018-07-01 07:55:11",
"2018-07-12 11:00:23"), class = "factor"), EndDateTime = structure(1:2, .Label = c("2018-07-01 12:01:56",
"2018-07-12 11:45:00"), class = "factor"), Direction = structure(2:1, .Label = c("E",
"W"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
DF1$DateTime <- as.POSIXct(DF1$DateTime)
DF2$StartDateTime <- as.POSIXct(DF2$StartDateTime)
DF2$EndDateTime <- as.POSIXct(DF2$EndDateTime)
Could also go for foverlaps:
library(data.table)
setDT(DF1)[, DateTime := as.POSIXct(DateTime)][, EndDateTime := DateTime]
setDT(DF2)[, `:=` (StartDateTime = as.POSIXct(StartDateTime),
EndDateTime = as.POSIXct (EndDateTime))]
setkey(DF1, DateTime, EndDateTime)
setkey(DF2, StartDateTime, EndDateTime)
DF1[, Flag := foverlaps(DF1, DF2, type = "within", which = TRUE, mult = "first")][
is.na(Flag), Flag := 0][, EndDateTime := NULL]
This will check for every date in DF1 if it is situated in any interval in DF2.
It'll also be fast, at least according to my tests. Benchmark with sapply:
Unit: milliseconds
expr min lq mean median uq max neval
DT 4.752853 5.247319 18.38787 5.42855 6.950966 311.1944 25
sapply 9413.337014 10598.926908 11206.14866 10892.91751 11746.901293 13568.7995 25
This is on a dataset with 10 000 rows in DF1 and 12 in DF2.
I only ran it once on 300 000 / 22 rows, and this is what I get:
Unit: seconds
expr min lq mean median uq max neval
DT 11.60865 11.60865 11.60865 11.60865 11.60865 11.60865 1
sapply 674.05823 674.05823 674.05823 674.05823 674.05823 674.05823 1
One faster way would be to use crossing() from tidyr to cross df1 and df2, set the flag per row in the new data frame then use aggregate() to reduce the rows back down. This method assumes that there are no duplicate entries in df1. If there are, they will be combined.
> df1
DateTime REG
1 2018-07-01 12:00:00 NHDG
2 2018-07-12 11:55:23 NSKR
> df2
StartDateTime EndDateTime Direction
1 2018-07-01 07:55:11 2018-07-01 12:01:56 W
2 2018-07-12 11:00:23 2018-07-12 11:45:00 E
> # Create a DF with rows for each combination of df1 rows with df2 rows
> tmp <- crossing(df1, df2)
> tmp
DateTime REG StartDateTime EndDateTime Direction
1 2018-07-01 12:00:00 NHDG 2018-07-01 07:55:11 2018-07-01 12:01:56 W
2 2018-07-01 12:00:00 NHDG 2018-07-12 11:00:23 2018-07-12 11:45:00 E
3 2018-07-12 11:55:23 NSKR 2018-07-01 07:55:11 2018-07-01 12:01:56 W
4 2018-07-12 11:55:23 NSKR 2018-07-12 11:00:23 2018-07-12 11:45:00 E
> # Create a new column for the flag
> tmp$flag <- tmp$DateTime >= tmp$StartDateTime & tmp$DateTime <= tmp$EndDateTime
> tmp
DateTime REG StartDateTime EndDateTime Direction flag
1 2018-07-01 12:00:00 NHDG 2018-07-01 07:55:11 2018-07-01 12:01:56 W TRUE
2 2018-07-01 12:00:00 NHDG 2018-07-12 11:00:23 2018-07-12 11:45:00 E FALSE
3 2018-07-12 11:55:23 NSKR 2018-07-01 07:55:11 2018-07-01 12:01:56 W FALSE
4 2018-07-12 11:55:23 NSKR 2018-07-12 11:00:23 2018-07-12 11:45:00 E FALSE
> # Drop the unwanted columns
> tmp <- tmp[,c("DateTime", "REG", "flag")]
> tmp
DateTime REG flag
1 2018-07-01 12:00:00 NHDG TRUE
2 2018-07-01 12:00:00 NHDG FALSE
3 2018-07-12 11:55:23 NSKR FALSE
4 2018-07-12 11:55:23 NSKR FALSE
> # Sum all flags for a given df1 date and limit total to 1
> df1 <- aggregate(flag ~ DateTime + REG, tmp, FUN = function(x) {min(1, sum(x))})
> df1
DateTime REG flag
1 2018-07-01 12:00:00 NHDG 1
2 2018-07-12 11:55:23 NSKR 0
>
Running with many more dates and comparing against your original for loop and the sapply() method above:
Original for loop method: 6.282 sec elapsed
sapply() method: 1.65 sec elapsed
crossing() and aggregate(): 0.385 sec elapsed
The full script is here:
#!/usr/bin/env Rscript
library(tictoc)
library(tidyr)
# Setup: generate a lot of dates for performance comparison
beg <- as.POSIXct("2018-07-01 12:00:00")
end <- as.POSIXct("2100-12-01 12:00:00")
dates <- seq(beg, end, 60*60*24)
#df1 <- data.frame(c("2018-07-01 12:00:00", "2018-07-12 11:55:23"), c("NHDG","NSKR"))
df1 <- data.frame(dates, rep(c("NHDG","NSKR"), length(dates)/2))
df2 <- data.frame(c("2018-07-01 07:55:11", "2018-07-12 11:00:23"), c("2018-07-01 12:01:56", "2018-07-12 11:45:00"), c("W","E"))
colnames(df1) <- c("DateTime", "REG")
colnames(df2) <- c("StartDateTime","EndDateTime","Direction")
df1$DateTime <- as.POSIXct(df1$DateTime, tz = "America/Los_Angeles")
df2$StartDateTime <- as.POSIXct(df2$StartDateTime, tz = "America/Los_Angeles")
df2$EndDateTime <- as.POSIXct(df2$EndDateTime, tz = "America/Los_Angeles")
# Original (fixed)
tic(sprintf("%30s", "Original for loop method"))
for (i in 1:nrow(df1)){
df1$flag[i] <- 0
for (j in 1:nrow(df2)){
if ((df1$DateTime[i] >= df2$StartDateTime[j]) & (df1$DateTime[i] <= df2$EndDateTime[j])){
df1$flag[i]<-1
break
}
}
}
toc()
result1 <- df1
df1$flag <- NULL
# Sapply
tic(sprintf("%30s", "sapply() method"))
df1$flag = sapply(df1$DateTime,
function(x) as.integer(sum(x >= df2$StartDateTime &
x <= df2$EndDateTime) > 0))
toc()
result2 <- df1
df1$flag <- NULL
# Aggregate
tic(sprintf("%30s", "crossing() and aggregate()"))
# Create a DF with rows for each combination of df1 rows with df2 rows
tmp <- crossing(df1, df2)
# Create a new column for the flag
tmp$flag <- tmp$DateTime >= tmp$StartDateTime & tmp$DateTime <= tmp$EndDateTime
# Drop the unwanted columns
tmp <- tmp[,c("DateTime", "REG", "flag")]
# Sum all flags for a given df1 date and limit total to 1
df1 <- aggregate(flag ~ DateTime + REG, tmp, FUN = function(x) {min(1, sum(x))})
# Sort the rows by date
df1 <- df1[order(df1$DateTime),]
# Reset the row names (for comparison below)
rownames(df1) <- NULL
toc()
result3 <- df1
# Prove that results are the same
if (!all.equal(result1, result2)) {
print("MISMATCH")
stop()
}
if (!all.equal(result1, result3)) {
print(MISMATCH)
stop()
}
print("PASS")
Related
I am trying to extract the start and end time index separately for all the labels and store them separately.
EDIT
As suggested in the comment I prepared an example dataset
data <- rnorm(11)
dates1 <- as.POSIXct("2019-03-18 10:30:00", tz = "CET") + 0:6*60
dates2 <- as.POSIXct("2019-03-19 08:30:00", tz = "CET") + 0:3*60
dates <- append(dates1, dates2)
R <- xts(x = data, order.by = dates)
colnames(R) <- "R"
R$Label[1:7] <- 1
R$Label[8:11] <- 2
Output:
R Label
2019-03-18 10:30:00 1.193363635 1
2019-03-18 10:31:00 -0.558021057 1
2019-03-18 10:32:00 0.670440862 1
2019-03-18 10:33:00 0.073794492 1
2019-03-18 10:34:00 -0.416108940 1
2019-03-18 10:35:00 -0.596981420 1
2019-03-18 10:36:00 0.002006772 1
2019-03-19 08:30:00 -1.245200719 2
2019-03-19 08:31:00 0.417944923 2
2019-03-19 08:32:00 1.699169683 2
2019-03-19 08:33:00 0.861448103 2
Class of R is xts, zoo.
Now I would like to store the start and end time index for label 1 and two separately. I have many more data with more labels, so it needs to be automated. I would really appreciate if you can help. Thank you
Using the data you have posted:
library(xts)
library(dplyr)
library(tibble)
set.seed(42)
data <- rnorm(11)
dates1 <- as.POSIXct("2019-03-18 10:30:00", tz = "CET") + 0:6*60
dates2 <- as.POSIXct("2019-03-19 08:30:00", tz = "CET") + 0:3*60
dates <- append(dates1, dates2)
R <- xts(x = data, order.by = dates)
colnames(R) <- "R"
R$Label <- 1 # note I have removed the indexing here
R$Label[8:11] <- 2
R %>%
as.data.frame() %>%
rownames_to_column() %>%
group_by(Label) %>%
summarise(min = min(rowname), max = max(rowname) )
# A tibble: 2 x 3
Label min max
<dbl> <chr> <chr>
1 1 2019-03-18 09:30:00 2019-03-18 09:36:00
2 2 2019-03-19 07:30:00 2019-03-19 07:33:00
If we split it into components and then use start and end on each component we can get the start and end times of each group.
s <- split(R, R$Label)
do.call("c", lapply(s, start)) # start of each group
do.call("c", lapply(s, end)) # end of each group
If you want the row numbers instead do the same thing but change the index to 1, 2, 3, ...
R2 <- zoo(coredata(R))
s <- split(R2, R2$Label)
do.call("c", lapply(s, start)) # start of each group
do.call("c", lapply(s, end)) # end of each group
I'm trying to duplicate what you'd do with AVERAGEIFS function in Excel on my dataset:
EG_df <- data.frame(id = c("red_blue", "white_blue", "red_yellow","white_yellow", "brown_blue", "brown_yellow"),
StartDate = as.Date(c('2019-1-1','2019-3-1','2019-7-1','2018-1-1','2018-3-1','2018-7-1')),
EndDate = as.Date(c('2019-6-1','2019-12-1','2019-8-1','2018-1-1','2018-3-1','2018-7-1')),
avg_Value = NA
)
source <- data.frame(source.id = c("red_blue", "red_blue", "red_blue","brown_yellow", "brown_yellow", "brown_yellow"),
source.Date = as.Date(c('2019-1-1','2019-2-1','2019-3-1','2018-7-1','2018-8-1','2018-9-1')),
source.Value = c(22,56,32,31,14,7)
)
Logic I need to fill in EG.df$avg_Value :
For each row in EG_df, return the average value of source.value when source.Date is between StartDate and EndDate.
The Excel Formula, for clarification:
=AVERAGEIFS(source.value, source.id, id, source.Date, ">="&StartDate, source.Date, ">="&EndDate)
Any help would be greatly appreciated!
You can do this pretty efficiently with a non-equi join:
library(data.table)
setDT(source); setDT(EG_df)
EG_df[, avg_Value :=
source[copy(.SD), on=.(source.id = id, source.Date >= StartDate, source.Date <= EndDate), mean(x.source.Value), by=.EACHI]$V1
]
id StartDate EndDate avg_Value
1: red_blue 2019-01-01 2019-06-01 36.66667
2: white_blue 2019-03-01 2019-12-01 NA
3: red_yellow 2019-07-01 2019-08-01 NA
4: white_yellow 2018-01-01 2018-01-01 NA
5: brown_blue 2018-03-01 2018-03-01 NA
6: brown_yellow 2018-07-01 2018-07-01 31.00000
(There are NAs since I'm just using the excerpt source provided rather than the full table.)
How it works
x[i, j] subsets using i and then evaluates j, inside of which .SD refers to the Subset of Data.
When x and i are both tables, x[i, on=, j, by=.EACHI] is a join, with on= specifying the join conditions, and j evaluated for each row of i.
Because j = mean(x.source.Value) returns an unnamed column, it gets the default name of V1.
Inside j of x[i, j], v := val creates or modifies column v by assigning val to it.
Using the dplyr Librarie
library(dyplr)
df = EG_df %>%
left_join(source, by = c('id' = 'source.id')) %>%
filter((StartDate <= source.Date) & (source.Date <= EndDate)) %>%
group_by(id, StartDate, EndDate) %>%
summarise(value = mean(source.Value))
using the tidyverse
dplyr::inner_join(source,EG_df,by = c("source.id"="id")) %>%
dplyr::filter(source.Date >= StartDate,
source.Date <= EndDate) %>%
dplyr::group_by(source.id,StartDate,EndDate) %>%
dplyr::summarise(avg_Value = mean(source.Value))
Consider the base package running a merge > subset > aggregate for the averages by id group and date range. Then merge this resultset back to original dataset.
# MERGE > SUBSET > AGGREGATE
agg_df <- aggregate(cbind(avgValue=source.Value) ~ id + StartDate + EndDate,
subset(merge(EG_df, source, by.x="id", by.y="source.id", all.x=TRUE),
source.Date >= StartDate & source.Date <= EndDate),
FUN=mean)
# MERGE WITH ORIGINAL DATASET
merge(EG_df, agg_df, by=c("id", "StartDate", "EndDate"), all.x=TRUE)
# id StartDate EndDate avgValue
# 1 brown_blue 2018-03-01 2018-03-01 NA
# 2 brown_yellow 2018-07-01 2018-07-01 31.00000
# 3 red_blue 2019-01-01 2019-06-01 36.66667
# 4 red_yellow 2019-07-01 2019-08-01 NA
# 5 white_blue 2019-03-01 2019-12-01 NA
# 6 white_yellow 2018-01-01 2018-01-01 NA
Rextester Demo
Aside - This is similar to SQL's greatest-n-per-group problem (official StackOverflow tag) where the agg_df would be a subquery or CTE joined back to original table.
I have a data frame , it has three columns employid , start date(ydm) and end date(ydm). my objective was to create another data frame which has two columns, one is employee ID and the other one is date. Second data frame would be built around first Data frame such that it will take ids from the first data frame, and the column date will take all the months between Start Date and end date of that employee. In simple words , i would expand the data in first data frame by months according to the employee start date and end date.
I actually successfully created the code, using for loop. Problem is, it is very slower, and some where I read that one is to avoid loops in r. is there a way that can do the same in a much quicker way ?
an example of my data frame and code is below:
# Creating Data frame
a<- data.frame(employeeid =c('a','b','c'), StartDate= c('2018-1-1','2018-1-5','2018-11-2'),
EndDate= c('2018-1-3','2018-1-9','2018-1-8'), stringsAsFactors = F)
a$StartDate <- ydm(a$StartDate)
a$EndDate <- ydm(a$EndDate)
#second empty data frame
a1 <-a
a1 <- a1[0,1:2]
#my code starts
r <- 1
r.1 <- 1
for (id in a$employeeid) {
#r.1 <- 1
for ( i in format(seq(a[r,2],a[r,3],by="month"), "%Y-%m-%d") ) {
a1[r.1,1] <- a[r,1]
a1[r.1,2] <- i
r.1 <- r.1 +1
}
r <- r+1
}
This results in this :
I want the same result, but a bit quicker
Almost a one-liner with tidyverse:
> result
# A tibble: 12 x 2
employeeid date
<chr> <date>
1 a 2018-01-01
2 a 2018-02-01
3 a 2018-03-01
4 b 2018-05-01
5 b 2018-06-01
6 b 2018-07-01
7 b 2018-08-01
8 b 2018-09-01
9 c 2018-11-01
10 c 2018-12-01
11 c 2019-01-01
12 c 2019-02-01
Code
result <- df %>%
group_by(employeeid) %>%
summarise(date = list(seq(StartDate,
EndDate,
by = "month"))) %>%
unnest()
Data
library(tidyverse)
library(lubridate)
df <- data.frame(employeeid = c('a', 'b', 'c'),
StartDate = ymd(c('2018-1-1', '2018-5-1', '2018-11-1')),
EndDate = ymd(c('2018-3-1', '2018-9-1', '2019-02-1')),
stringsAsFactors = FALSE)
I'd try to solve this with by using apply and a custom function, that calculates the difference of end and start.
Im not sure how your desired output looks like, but in the function of the following example all month in between start and end are pasted in a string.
library(lubridate)
# Creating Data frame
a<- data.frame(employeeid =c('a','b','c'), StartDate= c('2018-1-1','2018-1-5','2018-11-2'),
EndDate= c('2018-2-3','2019-1-9','2020-1-8'), stringsAsFactors = F)
a$StartDate <- ymd(a$StartDate)
a$EndDate <- ymd(a$EndDate)
# create month-name month nummeric value mapping
month_names = month.abb[1:12]
month_dif = function(dates) # function to calc the dif. it expects a 2 units vector to be passed over
{
start = dates[1] # first unit of the vector is expected to be the start date
end = dates[2] # second unit is expected to be the end date
start_month = month(start)
end_month = month(end)
start_year = year(start)
end_year = year(end)
year_dif = end_year - start_year
if(year_dif == 0){ #if start and end both are in the same year month is start till end
return(paste(month_names[start_month:end_month], collapse= ", " ))
} else { #if there is an overlap, mont is start till dezember and jan till end (with x full year in between)
paste(c(month_names[start_month:12],
rep(month_names, year_dif-1),
month_names[1:end_month]), collapse = ", ")
}
}
apply(a[2:3], 1, month_dif)
output:
> apply(a[2:3], 1, month_dif)
[1] "Jan, Feb"
[2] "Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan"
[3] "Nov, Dec, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan"
You can use a combination of apply and do.call:
out_apply_list <- apply(X=a, MARGIN=1,
FUN=function(x) {
data.frame(id= x[1],
date=seq(from = as.Date(x[2], "%Y-%d-%m"),
to = as.Date(x[3], "%Y-%d-%m"),
by = "month"),
row.names = NULL)
})
df <- do.call(what = rbind, args = out_apply_list)
which gives you the following output:
> df
id date
1 a 2018-01-01
2 a 2018-02-01
3 a 2018-03-01
4 b 2018-05-01
5 b 2018-06-01
6 b 2018-07-01
7 b 2018-08-01
8 b 2018-09-01
9 c 2018-02-11
10 c 2018-03-11
11 c 2018-04-11
12 c 2018-05-11
13 c 2018-06-11
14 c 2018-07-11
For the sake of completeness, here is a concise one-line with data.table:
library(data.table)
setDT(a)[, .(StartDate = seq(StartDate, EndDate, by = "month")), by = employeeid]
employeeid StartDate
1: a 2018-01-01
2: a 2018-02-01
3: a 2018-03-01
4: b 2018-05-01
5: b 2018-06-01
6: b 2018-07-01
7: b 2018-08-01
8: b 2018-09-01
9: c 2018-02-11
10: c 2018-03-11
11: c 2018-04-11
12: c 2018-05-11
13: c 2018-06-11
14: c 2018-07-11
I have one data table which contains just a sequence of times. I have another data table containing two columns: start_time and end_time. I want to take the first data table and add a column where the value is the count of all of the rows in the second data table where the time from the first data table fits within the start and end time. Here is my code
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")
all_dates <- as.data.table(seq(start_date, end_date, "min"))
colnames(all_dates) <- c("Bin")
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)
Here is what i want to do, but this is wrong and gives an error. What's the right way to write this?
all_dates[, BinCount := input_data[start_times < Bin & end_times > Bin, .N] ]
In the end i should get something like
Bin BinCount
2017-01-31 17:00:00 1
2017-01-31 17:01:00 5
...
The problem can be solved very easily using sqldf as it provides easy way to join tables with range checking. Hence one solution could be:
The data from OP:
library(data.table)
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")
all_dates <- as.data.table(seq(start_date, end_date, "min"))
colnames(all_dates) <- c("Bin")
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)
library(sqldf)
result <- sqldf("SELECT all_dates.bin, count() as BinCount
FROM all_dates, input_data
WHERE all_dates.bin > input_data.start_times AND
all_dates.bin < input_data.end_times
GROUP BY bin" )
result
Bin BinCount
1 2017-01-31 17:01:00 1
2 2017-01-31 17:02:00 1
3 2017-01-31 17:03:00 1
4 2017-01-31 17:04:00 1
5 2017-01-31 17:05:00 1
6 2017-01-31 17:06:00 1
...........
...........
497 2017-02-01 01:17:00 6
498 2017-02-01 01:18:00 5
499 2017-02-01 01:19:00 5
500 2017-02-01 01:20:00 4
[ reached getOption("max.print") -- omitted 460 rows ]
In data.table you're after a range join.
library(data.table)
start_date <- as.POSIXct(x = "2017-01-31 17:00:00", format = "%Y-%m-%d %H:%M:%S")
end_date <- as.POSIXct(x = "2017-02-01 09:00:00", format = "%Y-%m-%d %H:%M:%S")
all_dates <- as.data.table(seq(start_date, end_date, "min"))
colnames(all_dates) <- c("Bin")
set.seed(123)
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- data.table(start_times, end_times)
## doing the range-join and calculating the number of items per bin in one chained step
input_data[
all_dates
, on = .(start_times < Bin, end_times > Bin)
, nomatch = 0
, allow.cartesian = T
][, .N, by = start_times]
# start_times N
# 1: 2017-01-31 17:01:00 1
# 2: 2017-01-31 17:02:00 1
# 3: 2017-01-31 17:03:00 1
# 4: 2017-01-31 17:04:00 1
# 5: 2017-01-31 17:05:00 1
# ---
# 956: 2017-02-01 08:56:00 6
# 957: 2017-02-01 08:57:00 4
# 958: 2017-02-01 08:58:00 4
# 959: 2017-02-01 08:59:00 5
# 960: 2017-02-01 09:00:00 5
Note:
I've put the all_dates object on the right-hand-side of the join, so the result contains the names of the input_data columns, even though they are your Bins (see this issue for the discussion on this topic)
I've used set.seed(), as you're taking samples
Wasn't requested, but here is a compact alternative solution using the tidyverse. Uses lubridate parsers, interval, and %within%, as well as purrr::map_int to generate the desired bin counts.
library(tidyverse)
library(lubridate)
start_date <- ymd_hms(x = "2017-01-31 17:00:00") # lubridate parsers
end_date <- ymd_hms(x = "2017-02-01 09:00:00")
all_dates <- tibble(seq(start_date, end_date, "min")) # tibble swap for data.table
colnames(all_dates) <- c("Bin")
start_times <- sample(seq(start_date,end_date,"min"), 100)
offsets <- sample(seq(60,7200,60), 100)
end_times <- start_times + offsets
input_data <- tibble(
start_times,
end_times,
intvl = interval(start_times, end_times) # Add interval column
)
all_dates %>% # Checks date in Bin and counts intervals it lies within
mutate(BinCount = map_int(.$Bin, ~ sum(. %within% input_data$intvl)))
# A tibble: 961 x 2
Bin BinCount
<dttm> <int>
1 2017-01-31 17:00:00 0
2 2017-01-31 17:01:00 0
3 2017-01-31 17:02:00 0
4 2017-01-31 17:03:00 0
5 2017-01-31 17:04:00 0
6 2017-01-31 17:05:00 0
7 2017-01-31 17:06:00 0
8 2017-01-31 17:07:00 1
9 2017-01-31 17:08:00 1
10 2017-01-31 17:09:00 1
# ... with 951 more rows
I am working with a multi-year dataset that has columns for date (%Y-%m-%d) and daily values for several variables.
In R, how do I subset the data by a date range (i.e., June 29 +/- 5 days) but capture the data from all years?
DATE A B C
1996-06-10 12:00:00 178.0 24.1 1.7
1996-06-11 12:00:00 184.1 30.2 1.1
1996-06-12 12:00:00 187.2 29.4 1.8
1996-06-13 12:00:00 194.4 35.0 5.3
1996-06-14 12:00:00 200.3 35.9 1.5
1996-06-15 12:00:00 138.9 15.1 0.0
...
1) Base R
Let yrs be all unique years in the data and targets be each of those years with the target's month and day. Then create dates which contains all dates within delta days of any value in targets. Note that sapply strips dates of its "Date" class but that does not matter since it is only subsequently used in %in% and that ignores the class. Finally subset DF down to those rows whose DATE is in dates. No packages are used.
# inputs (also DF defined in Note at end)
target <- "06-19"
delta <- 5
DATE <- as.Date(DF$DATE)
yrs <- unique(format(DATE, "%Y"))
targets <- as.Date(paste(yrs, target, sep = "-"))
dates <- c(sapply(targets, "+", seq(-delta, delta)))
DF[DATE %in% dates, ]
giving:
DATE A B C
5 1996-06-14 12:00:00 200.3 35.9 1.5
6 1996-06-15 12:00:00 138.9 15.1 0.0
2) sqldf
Alternately, this can be done using a single SQL statement. Note that we assume that the DATE column is character since the question referred to it being in a particular format. Now, using the same inputs the inner select generates target dates from each year and then the outer select joins DF to those rows within delta days of any target date. We use the H2 database backend here since it has better date support than SQLite.
library(sqldf)
library(RH2)
# inputs (also DF defined in Note at end)
target <- "06-19"
delta <- 5
fn$sqldf("select DF.* from DF
join (select distinct cast(substr(DATE, 1, 4) || '-' || '$target' as DATE) as target
from DF)
on cast(substr(DATE, 1, 10) as DATE) between target - $delta and target + $delta")
giving:
DATE A B C
1 1996-06-14 12:00:00 200.3 35.9 1.5
2 1996-06-15 12:00:00 138.9 15.1 0.0
We could simplify the SQL somewhat if DATE is of R's "Date" class. That is, replace the sqldf statement above with:
DF2 <- transform(DF, DATE = as.Date(DATE))
fn$sqldf("select DF2.* from DF2
join (select distinct cast(year(DATE) || '-' || '$target' as DATE) as target from DF2)
on DATE between target - $delta and target + $delta")
giving:
DATE A B C
1 1996-06-14 200.3 35.9 1.5
2 1996-06-15 138.9 15.1 0.0
Note
The input DF is assumed to be:
DF <- structure(list(DATE = c("1996-06-10 12:00:00", "1996-06-11 12:00:00",
"1996-06-12 12:00:00", "1996-06-13 12:00:00", "1996-06-14 12:00:00",
"1996-06-15 12:00:00"), A = c(178, 184.1, 187.2, 194.4, 200.3,
138.9), B = c(24.1, 30.2, 29.4, 35, 35.9, 15.1), C = c(1.7, 1.1,
1.8, 5.3, 1.5, 0)), .Names = c("DATE", "A", "B", "C"), row.names = c(NA,
-6L), class = "data.frame")
A base R attempt.
Stealing the example data from the other answer by Kevin:
df <- data.frame(
my_date = seq.Date(as.Date("1990-01-01"), as.Date("1999-12-31"), by = 1),
x = rnorm(3652),
y = rnorm(3652),
z = rnorm(3652)
)
Set your variables for the selection:
month_num <- 6
day_num <- 29
bound <- 5
Find the key dates in your range of years:
keydates <- as.Date(sprintf(
"%d-%02d-%02d",
do.call(seq, as.list(as.numeric(range(format(df$my_date, "%Y"))))),
month_num,
day_num
))
Make a selection:
out <- df[df$my_date %in% outer(keydates, -bound:bound, `+`),]
Check that it worked:
table(format(out$my_date, "%m-%d"))
#06-24 06-25 06-26 06-27 06-28 06-29 06-30 07-01 07-02 07-03 07-04
# 10 10 10 10 10 10 10 10 10 10 10
One valid value for each day/month for each year 1990 to 1999, centred on "06-29" with a range of 5 days either side
You can use lubridate intervals to provide valid date ranges and then use a purrr map to run each interval over your data to filter.
library(dplyr)
library(lubridate)
library(magrittr) # only because I've used the "exposition" (%$%) pipe
library(purrr)
df <- tibble(
my_date = as.POSIXct(
seq.Date(as.Date("1990-01-01"), as.Date("1999-12-31"), by = 1),
tz = "UTC"
),
x = rnorm(3652),
y = rnorm(3652),
z = rnorm(3652)
)
month_num <- 6
day_num <- 29
bound <- 5
date_span <- df %>%
select(my_date) %>%
filter(month(my_date) == month_num & day(my_date) == day_num) %>%
mutate(
start = my_date - days(bound),
end = my_date + days(bound)
) %$%
interval(start, end, tzone = "UTC")
map_dfr(date_span, ~filter(df, my_date %within% .x))
# # A tibble: 110 x 4
# my_date x y z
# <dttm> <dbl> <dbl> <dbl>
# 1 1990-06-24 10:00:00 0.404 1.33 1.58
# 2 1990-06-25 10:00:00 0.351 -1.73 0.665
# 3 1990-06-26 10:00:00 -0.512 1.01 1.72
# 4 1990-06-27 10:00:00 1.55 0.417 -0.126
# 5 1990-06-28 10:00:00 1.86 1.18 0.322
# 6 1990-06-29 10:00:00 -0.0193 -0.105 0.356
# 7 1990-06-30 10:00:00 0.844 -0.712 1.51
# 8 1990-07-01 10:00:00 -0.431 0.451 -2.19
# 9 1990-07-02 10:00:00 1.74 -0.0650 -0.866
# 10 1990-07-03 10:00:00 0.965 -0.506 -0.0690
# # ... with 100 more rows
You could also go via the Julian day, which allows you to do basic arithmetic operations (e.g. ± 5 days) without the need to convert back and forth between Date and character objects. Keep in mind that your target date translates into a different Julian day during leap years, so you'll need to extract this piece of information somehow (use lubridate::leap_year if you don't like the base R approach below):
## convert dates to julian day
dat$JULDAY = format(
dat$DATE
, "%j"
)
## target date (here 19 june) as julian day
dat$TARGET = ifelse(
as.integer(
format(
dat$DATE
, "%y"
)
) %% 4 == 0
, 171 # leap year
, 170 # common year
)
## create subset
subset(
dat
, JULDAY >= (TARGET - 5) & JULDAY <= (TARGET + 5)
, select = c("DATE", "A", "B", "C")
)
# DATE A B C
# 5 1996-06-14 12:00:00 200.3 35.9 1.5
# 6 1996-06-15 12:00:00 138.9 15.1 0.0