Subset a dataframe between 2 dates - r

I am working with daily returns from a Brazilian Index (IBOV) since 1993, I am trying to figure out the best way to subset for periods between 2 dates.
The data frame (IBOV_RET) is as follows :
head(IBOV_RET)
DATE 1D_RETURN
1 1993-04-28 -0.008163265
2 1993-04-29 -0.024691358
3 1993-04-30 0.016877637
4 1993-05-03 0.000000000
5 1993-05-04 0.033195021
6 1993-05-05 -0.012048193
...
I set 2 variables DATE1 and DATE2 as dates
DATE1 <- as.Date("2014-04-01")
DATE2 <- as.Date("2014-05-05")
I was able to create a new subset using this code:
TEST <- IBOV_RET[IBOV_RET$DATE >= DATE1 & IBOV_RET$DATE <= DATE2,]
It worked, but I was wondering if there is a better way to subset the data between 2 date, maybe using subset.

As already pointed out by #MrFlick, you dont get around the basic logic of subsetting. One way to make it easier for you to subset your specific data.frame would be to define a function that takes two inputs like DATE1 and DATE2 in your example and then returns the subset of IBOV_RET according to those subset parameters.
myfunc <- function(x,y){IBOV_RET[IBOV_RET$DATE >= x & IBOV_RET$DATE <= y,]}
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
Test <- myfunc(DATE1,DATE2)
#> Test
# DATE X1D_RETURN
#2 1993-04-29 -0.02469136
#3 1993-04-30 0.01687764
#4 1993-05-03 0.00000000
#5 1993-05-04 0.03319502
You can also enter the specific dates directly into myfunc:
myfunc(as.Date("1993-04-29"),as.Date("1993-05-04")) #will produce the same result

You can use the subset() function with the & operator:
subset(IBOV_RET, DATE1> XXXX-XX-XX & DATE2 < XXXX-XX-XX)
Updating for a more "tidyverse-oriented" approach:
IBOV_RET %>%
filter(DATE1 > XXXX-XX-XX, DATE2 < XXXX-XX-XX) #comma same as &

There is no real other way to extract date ranges. The logic is the same as extracting a range of numeric values as well, you just need to do the explicit Date conversion as you've done. You can make your subsetting shorter as you would with any other subsetting task with subset or with. You can break ranges into intervals with cut (there is a specific cut.Date overload). But base R does not have any way to specify Date literals so you cannot avoid the conversion. I can't imagine what other sort of syntax you may have had in mind.

What about:
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
# creating a data range with the start and end date:
dates <- seq(DATE1, DATE2, by="days")
IBOV_RET <- subset(IBOV_RET, DATE %in% dates)

I believe lubridate could help here;
daterange <- interval(DATE1, DATE2)
TEST <- IBOV_RET[which(Date %within% daterange),]

I sort of love dplyr package
So if you
>library("dplyr")
and then, as you did:
>Date1<-as.Date("2014-04-01")
>Date2<-as.Date("2014-05-05")
Finally
>test<-filter(IBOV_RET, filter(DATE>Date1 & DATE<Date2))

You can use R's between() function after simply converting the strings to dates:
df %>%
filter(between(date_column, as.Date("string-date-lower-bound"), as.Date("string-date-upper-bound")))

Test = IBOV_RET[IBOV_RET$Date => "2014-04-01" | IBOV_RET$Date <= "1993-05-04"]
Here I am using "or" function | where data should be greater than particular data or data should be less than or equal to this date.

Related

How to split a data frame in R based on date when multiple rows have identical date stamp [duplicate]

I am working with daily returns from a Brazilian Index (IBOV) since 1993, I am trying to figure out the best way to subset for periods between 2 dates.
The data frame (IBOV_RET) is as follows :
head(IBOV_RET)
DATE 1D_RETURN
1 1993-04-28 -0.008163265
2 1993-04-29 -0.024691358
3 1993-04-30 0.016877637
4 1993-05-03 0.000000000
5 1993-05-04 0.033195021
6 1993-05-05 -0.012048193
...
I set 2 variables DATE1 and DATE2 as dates
DATE1 <- as.Date("2014-04-01")
DATE2 <- as.Date("2014-05-05")
I was able to create a new subset using this code:
TEST <- IBOV_RET[IBOV_RET$DATE >= DATE1 & IBOV_RET$DATE <= DATE2,]
It worked, but I was wondering if there is a better way to subset the data between 2 date, maybe using subset.
As already pointed out by #MrFlick, you dont get around the basic logic of subsetting. One way to make it easier for you to subset your specific data.frame would be to define a function that takes two inputs like DATE1 and DATE2 in your example and then returns the subset of IBOV_RET according to those subset parameters.
myfunc <- function(x,y){IBOV_RET[IBOV_RET$DATE >= x & IBOV_RET$DATE <= y,]}
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
Test <- myfunc(DATE1,DATE2)
#> Test
# DATE X1D_RETURN
#2 1993-04-29 -0.02469136
#3 1993-04-30 0.01687764
#4 1993-05-03 0.00000000
#5 1993-05-04 0.03319502
You can also enter the specific dates directly into myfunc:
myfunc(as.Date("1993-04-29"),as.Date("1993-05-04")) #will produce the same result
You can use the subset() function with the & operator:
subset(IBOV_RET, DATE1> XXXX-XX-XX & DATE2 < XXXX-XX-XX)
Updating for a more "tidyverse-oriented" approach:
IBOV_RET %>%
filter(DATE1 > XXXX-XX-XX, DATE2 < XXXX-XX-XX) #comma same as &
There is no real other way to extract date ranges. The logic is the same as extracting a range of numeric values as well, you just need to do the explicit Date conversion as you've done. You can make your subsetting shorter as you would with any other subsetting task with subset or with. You can break ranges into intervals with cut (there is a specific cut.Date overload). But base R does not have any way to specify Date literals so you cannot avoid the conversion. I can't imagine what other sort of syntax you may have had in mind.
What about:
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
# creating a data range with the start and end date:
dates <- seq(DATE1, DATE2, by="days")
IBOV_RET <- subset(IBOV_RET, DATE %in% dates)
I believe lubridate could help here;
daterange <- interval(DATE1, DATE2)
TEST <- IBOV_RET[which(Date %within% daterange),]
I sort of love dplyr package
So if you
>library("dplyr")
and then, as you did:
>Date1<-as.Date("2014-04-01")
>Date2<-as.Date("2014-05-05")
Finally
>test<-filter(IBOV_RET, filter(DATE>Date1 & DATE<Date2))
You can use R's between() function after simply converting the strings to dates:
df %>%
filter(between(date_column, as.Date("string-date-lower-bound"), as.Date("string-date-upper-bound")))
Test = IBOV_RET[IBOV_RET$Date => "2014-04-01" | IBOV_RET$Date <= "1993-05-04"]
Here I am using "or" function | where data should be greater than particular data or data should be less than or equal to this date.

Trying to count rows with non-missing date/category observations: why is this code not working?

I've got a data frame with five categorical variables and two date variables.
I'd like to get a count of the observations for which none of the categorical variables is missing AND for which the difference between the dates is less than or equal to six months. So for this data frame, it would be a count of 1 as only one observation (row 1) meets the criteria.
The code I've tried so far works on the minimal working example but doesn't work when I run it on my actual data set. When I take the code apart the bits and pieces work (eg as.numeric(difftime(white$dnf_DateDeath, white$RecruitmentFinal, units = "days")) <= 182.52) but when together as below I get [1] NA. I have no idea why.
Is there a way of building an ifelse() tree, so that the expressions might get evaluated step-wise? Any help would be much appreciated.
Starting point:
df <-
data.frame(sports=c(1,NA,1,1),car=c(1,NA,NA,1),hobbies=c(1,NA,1,1),
home=c(1,NA,NA,1),office=c(1,1,NA,1), start_date=c("01/01/2016",
"01/01/2016","01/01/2016","01/01/2016"),
leave_date=c("01/04/2016","01/03/2016",NA,"01/12/2016"))
I've tried using:
library(lubridate)
sum(!is.na(df$sports) &!is.na(df$hobbies) & !is.na(df$car) &
!is.na(df$home) & !is.na(df$office) &
as.period(interval(df$start_date, df$leave_date)) <= months(6))
And I've also tried:
sum(!is.na(df$sports) &!is.na(df$hobbies) & !is.na(df$car) &
!is.na(df$home) & !is.na(df$office) &
as.numeric(difftime(df$leave_date, df$start_date, units = "days"))
<= 182.52)
The following seems to work as expected.
df2 <- df[complete.cases(df), ]
df2[abs(difftime(df2$start_date, df2$leave_date, unit = "days")) <= 365.25/2, ]
# sports car hobbies home office start_date leave_date
#1 1 1 1 1 1 01/01/2016 01/04/2016
EDIT.
If you want to use package lubridate for date arithmetics, you could do
library(lubridate)
inx <- dmy(df2$start_date) + months(6) > dmy(df2$leave_date)
df2[inx, ]
# sports car hobbies home office start_date leave_date
#1 1 1 1 1 1 01/01/2016 01/04/2016

R how to avoid a loop. Counting weekends between two dates in a row for each row in a dataframe

I have two columns of dates. Two example dates are:
Date1= "2015-07-17"
Date2="2015-07-25"
I am trying to count the number of Saturdays and Sundays between the two dates each of which are in their own column (5 & 7 in this example code). I need to repeat this process for each row of my dataframe. The end results will be one column that represents the number of Saturdays and Sundays within the date range defined by two date columns.
I can get the code to work for one row:
sum(weekdays(seq(Date1[1,5],Date2[1,7],"days")) %in% c("Saturday",'Sunday')*1))
The answer to this will be 3. But, if I take out the "1" in the row position of date1 and date2 I get this error:
Error in seq.Date(Date1[, 5], Date2[, 7], "days") :
'from' must be of length 1
How do I go line by line and have one vector that lists the number of Saturdays and Sundays between the two dates in column 5 and 7 without using a loop? Another issue is that I have 2 million rows and am looking for something with a little more speed than a loop.
Thank you!!
map2* functions from the purrr package will be a good way to go. They take two vector inputs (eg two date columns) and apply a function in parallel. They're pretty fast too (eg previous post)!
Here's an example. Note that the _int requests an integer vector back.
library(purrr)
# Example data
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
# Wrapper function to compute number of weekend days between dates
n_weekend_days <- function(date_1, date_2) {
sum(weekdays(seq(date_1, date_2, "days")) %in% c("Saturday",'Sunday'))
}
# Iterate row wise
map2_int(d$Date1, d$Date2, n_weekend_days)
#> [1] 3 4 2
If you want to add the results back to your original data frame, mutate() from the dplyr package can help:
library(dplyr)
d <- mutate(d, end_days = map2_int(Date1, Date2, n_weekend_days))
d
#> Date1 Date2 end_days
#> 1 2015-07-17 2015-07-25 3
#> 2 2015-07-28 2015-08-14 4
#> 3 2015-08-15 2015-08-20 2
Here is a solution that uses dplyr to clean things up. It's not too difficult to use with to assign the columns in the dataframe directly.
Essentially, use a reference date, calculate the number of full weeks (by floor or ceiling). Then take the difference between the two. The code does not include cases in which the start date or end data fall on Saturday or Sunday.
# weekdays(as.Date(0,"1970-01-01")) -> "Friday"
require(dplyr)
startDate = as.Date(0,"1970-01-01") # this is a friday
df <- data.frame(start = "2015-07-17", end = "2015-07-25")
df$start <- as.Date(df$start,"", format = "%Y-%m-%d", origin="1970-01-01")
df$end <- as.Date(df$end, format = "%Y-%m-%d","1970-01-01")
# you can use with to define the columns directly instead of %>%
df <- df %>%
mutate(originDate = startDate) %>%
mutate(startDayDiff = as.numeric(start-originDate), endDayDiff = as.numeric(end-originDate)) %>%
mutate(startWeekDiff = floor(startDayDiff/7),endWeekDiff = floor(endDayDiff/7)) %>%
mutate(NumSatsStart = startWeekDiff + ifelse(startDayDiff %% 7>=1,1,0),
NumSunsStart = startWeekDiff + ifelse(startDayDiff %% 7>=2,1,0),
NumSatsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 1,1,0),
NumSunsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 2,1,0)
) %>%
mutate(NumSats = NumSatsEnd - NumSatsStart, NumSuns = NumSunsEnd - NumSunsStart)
Dates are number of days since 1970-01-01, a Thursday.
So the following is the number of Saturdays or Sundays since that date
f <- function(d) {d <- as.numeric(d); r <- d %% 7; 2*(d %/% 7) + (r>=2) + (r>=3)}
For the number of Saturdays or Sundays between two dates, just subtract, after decrementing the start date to have an inclusive count.
g <- function(d1, d2) f(d2) - f(d1-1)
These are all vectorized functions so you can just call directly on the columns.
# Example data, as in Simon Jackson's answer
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
As follows
within(d, end_days<-g(Date1,Date2))
# Date1 Date2 end_days
# 1 2015-07-17 2015-07-25 3
# 2 2015-07-28 2015-08-14 4
# 3 2015-08-15 2015-08-20 2

Subset between two dates [duplicate]

I am working with daily returns from a Brazilian Index (IBOV) since 1993, I am trying to figure out the best way to subset for periods between 2 dates.
The data frame (IBOV_RET) is as follows :
head(IBOV_RET)
DATE 1D_RETURN
1 1993-04-28 -0.008163265
2 1993-04-29 -0.024691358
3 1993-04-30 0.016877637
4 1993-05-03 0.000000000
5 1993-05-04 0.033195021
6 1993-05-05 -0.012048193
...
I set 2 variables DATE1 and DATE2 as dates
DATE1 <- as.Date("2014-04-01")
DATE2 <- as.Date("2014-05-05")
I was able to create a new subset using this code:
TEST <- IBOV_RET[IBOV_RET$DATE >= DATE1 & IBOV_RET$DATE <= DATE2,]
It worked, but I was wondering if there is a better way to subset the data between 2 date, maybe using subset.
As already pointed out by #MrFlick, you dont get around the basic logic of subsetting. One way to make it easier for you to subset your specific data.frame would be to define a function that takes two inputs like DATE1 and DATE2 in your example and then returns the subset of IBOV_RET according to those subset parameters.
myfunc <- function(x,y){IBOV_RET[IBOV_RET$DATE >= x & IBOV_RET$DATE <= y,]}
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
Test <- myfunc(DATE1,DATE2)
#> Test
# DATE X1D_RETURN
#2 1993-04-29 -0.02469136
#3 1993-04-30 0.01687764
#4 1993-05-03 0.00000000
#5 1993-05-04 0.03319502
You can also enter the specific dates directly into myfunc:
myfunc(as.Date("1993-04-29"),as.Date("1993-05-04")) #will produce the same result
You can use the subset() function with the & operator:
subset(IBOV_RET, DATE1> XXXX-XX-XX & DATE2 < XXXX-XX-XX)
Updating for a more "tidyverse-oriented" approach:
IBOV_RET %>%
filter(DATE1 > XXXX-XX-XX, DATE2 < XXXX-XX-XX) #comma same as &
There is no real other way to extract date ranges. The logic is the same as extracting a range of numeric values as well, you just need to do the explicit Date conversion as you've done. You can make your subsetting shorter as you would with any other subsetting task with subset or with. You can break ranges into intervals with cut (there is a specific cut.Date overload). But base R does not have any way to specify Date literals so you cannot avoid the conversion. I can't imagine what other sort of syntax you may have had in mind.
What about:
DATE1 <- as.Date("1993-04-29")
DATE2 <- as.Date("1993-05-04")
# creating a data range with the start and end date:
dates <- seq(DATE1, DATE2, by="days")
IBOV_RET <- subset(IBOV_RET, DATE %in% dates)
I believe lubridate could help here;
daterange <- interval(DATE1, DATE2)
TEST <- IBOV_RET[which(Date %within% daterange),]
I sort of love dplyr package
So if you
>library("dplyr")
and then, as you did:
>Date1<-as.Date("2014-04-01")
>Date2<-as.Date("2014-05-05")
Finally
>test<-filter(IBOV_RET, filter(DATE>Date1 & DATE<Date2))
You can use R's between() function after simply converting the strings to dates:
df %>%
filter(between(date_column, as.Date("string-date-lower-bound"), as.Date("string-date-upper-bound")))
Test = IBOV_RET[IBOV_RET$Date => "2014-04-01" | IBOV_RET$Date <= "1993-05-04"]
Here I am using "or" function | where data should be greater than particular data or data should be less than or equal to this date.

How do I create a string containing quotes and then parse and evaluate it?

I have a data frame with 3 years worth of sales data that I'm trying to convert to a time series. Manually creating subsets for each of the 36 months:
mydfJan2011 <- subset(myDataFrame,
as.Date("2011-01-01") <= myDataFrame$Dates &
myDataFrame$Dates <= as.Date("2011-01-31"))
...
mydfDec2013 <- subset(myDataFrame,
as.Date("2013-12-01") <= myDataFrame$Dates &
myDataFrame$Dates <= as.Date("2013-12-31"))
and then summing them up and putting them into a vector
counts[1] <- sum(mydfJan2011$itemsSold)
...
counts[36] <- sum(mydfDec2013$itemsSold))
to get the values for the time series works fine, but I'd like to make it a little more automatic as I have to create more than one time series, so I'm trying to turn it into a loop.
In order to do that, I need to create a string with a subset command like this:
"subset(myDataFrame,
as.Date("2011-01-01") <= myDataFrame$Dates &
myDataFrame$Dates <= as.Date("2011-01-31"))"
But when I use paste, the result is this:
myString
>"subset(myDataFrame, as.Date(\"2011-02-01\") <= myDataFrame$Dates & myDataFrame$Dates <= as.Date(\"2011-02-28\"))"
and
eval(parse(text = myString))
results in the following error message:
Error in charToDate(x) :
character string is not in a standard unambiguous format
whereas just typing in the command (without escapes) results in the subset I'm trying to create.
I've tried playing around with single and double quotes, substitute and deparse, but none of it results in any kind of subset of my data frame.
Any suggestions?
Even another way of splitting up the data by month and summing it up would be welcome.
Thanks,
Signe
Here is a solution using tapply:
with(sales, tapply(itemsSold, substr(Dates, 1, 7), sum))
Produces monthly sums (I limited my data to 9 months for illustrative purposes, but this extends to longer periods):
2011-01 2011-02 2011-03 2011-04 2011-05 2011-06 2011-07 2011-08 2011-09
1592.097 1468.427 1594.386 1563.014 1595.489 1560.361 1553.128 1663.705 1325.519
tapply computes the sum of values in a vector (sales$sales) grouped by the values of another vector (substr(sales$date, 1, 7), which is basically "yyyy-mm"). with allows me to avoid me typing sales$ repeatedly. You should almost never have to use eval(parse(...)). There is almost always a better, faster way to do it without resorting to that.
And here is the data I used:
set.seed(1)
sales <- data.frame(Dates=seq(as.Date("2011-01-01"), as.Date("2011-09-30"), by="+1 day"))
sales$itemsSold <- runif(nrow(sales), 1, 100)
For reference, there are also several 3rd party packages that simplify this type of computation (see data.table, dplyr).
Here's a data.table approach that aggregates by year and month, using the first of the month as the respective group label:
library(data.table)
##
mDt <- Dt[
,list(monthSold=sum(itemsSold)),
keyby=list(mDay=as.Date(paste0(
year(Dates),"-",month(Dates),"-01")))]
##
R> head(mDt)
mDay monthSold
1: 2012-01-01 179
2: 2012-02-01 128
3: 2012-03-01 152
4: 2012-04-01 160
5: 2012-05-01 152
6: 2012-06-01 141
Data:
set.seed(123)
Dt <- data.table(
Dates=seq.Date(
from=as.Date("2012-01-01"),
to=as.Date("2014-12-31"),
by="day"),
itemsSold=rpois(1096,5))

Resources