creating date vector from existing date vector - r

Date Price
2006-01-03 12.02
2006-01-04 11.84
2006-01-05 11.83
...
EXPIRATION DATES
2006-01-18
2006-02-15
2006-03-22
...
Hello, I have a data frame of daily futures prices with corresponding dates. I also have a vector of all the relevant contract expiration dates for the futures prices.
The price column is the price for the contract expiring in the nearest month (12 month expiration cycle). For example, the 12.02 contract price on 2006-01-03 expires on 2006-01-18. I want to create a column that lists the relevant expiration date for each futures price so I can calculate days until expiration for each daily price. The logic would be:
all dates between 2006-01-03 and 2006-01-18 would have 2006-01-18 in the new expiration date column and so on for all the 127 expiration dates I have.
I tried playing around with mutate() and subset(), but I've had no luck. I assume this will be tedious, but just need someone to help me get started
Thanks

Assuming the two data.frames are called df and df2 and dates are already formatted as such, with dplyr,
# add a row with a different expiration date to make sure it's working
df[4,] <- list(as.Date('2006-02-04'), 12)
library(dplyr)
df %>% rowwise() %>%
mutate(days_left = min(df2$EXPIRATION.DATES[df2$EXPIRATION.DATES > Date] - Date))
## Source: local data frame [4 x 3]
## Groups: <by row>
##
## # A tibble: 4 x 3
## Date Price days_left
## <date> <dbl> <S3: difftime>
## 1 2006-01-03 12.02 15 days
## 2 2006-01-04 11.84 14 days
## 3 2006-01-05 11.83 13 days
## 4 2006-02-04 12.00 11 days
or in base,
df$days_left <- lapply(df$Date, function(x){
min(df2$EXPIRATION.DATES[df2$EXPIRATION.DATES > x] - x)
})
df
## Date Price days_left
## 1 2006-01-03 12.02 15
## 2 2006-01-04 11.84 14
## 3 2006-01-05 11.83 13
## 4 2006-02-04 12.00 11
Subtracting dates calls difftime, which it may be worth calling explicitly so you can specify units:
# dplyr
df %>% rowwise() %>%
mutate(days_left = df2$EXPIRATION.DATES[df2$EXPIRATION.DATES > Date] %>%
difftime(Date, units = 'days') %>%
min())
# base
df$days_left <- lapply(df$Date, function(x){
min(difftime(df2$EXPIRATION.DATES[df2$EXPIRATION.DATES > x], x, units = 'days'))
})
Depending on your data it may not make a difference, but it is a more robust approach than simple subtraction.

Disclaimer: I dislike pipes (I have my reasons) and when I can find a good "Base R" solution, I go for that one first. So here's my old fart solution.
I added more data to make sure it really works as expected.
# Create main dataframe
df1 <- read.table(text=
"Date Price
2006-01-03 12.02
2006-01-18 12.04
2006-01-22 12.05
2006-02-01 11.99
2006-02-16 11.84
2006-03-21 11.83
2006-03-22 11.90
2006-03-29 12.00
", head=T, stringsAsFactors=FALSE)
# Convert Date column to a proper Date-classed column
df1$Date <- as.Date(df1$Date)
# Generate an expiration dates vector
exp_dates <- as.Date(c("2006-01-18", "2006-02-15", "2006-03-22", "2006-04-18"))
# initialize df1$exp_dates
df1$exp_date <- NA
class(df1$exp_date) <- "Date"
# Loop over rows and find closest expir. date which is not past the date
for(i in 1:nrow(df1))
df1$exp_date[i] <- exp_dates[which.max((df1$Date[i]-exp_dates) <= 0)]
(Yeah, I also loops, and I even like it! :^p)
df1
Date Price exp_date
1 2006-01-03 12.02 2006-01-18
2 2006-01-18 12.04 2006-01-18
3 2006-01-22 12.05 2006-02-15
4 2006-02-01 11.99 2006-02-15
5 2006-02-16 11.84 2006-03-22
6 2006-03-21 11.83 2006-03-22
7 2006-03-22 11.90 2006-03-22
8 2006-03-29 12.00 2006-04-18

Related

Applying function to a subset of xts quantmod

I'm trying to get the standard deviation of a stock price by year, but I'm getting the same value for every year.
I tried with dplyr (group_by, summarise) and also with a function, but had no luck in any of them, both return the same value of 67.0.
It is probably passing the whole dataframe without subsetting it, how can this issue be fixed?
library(quantmod)
library(tidyr)
library(dplyr)
#initial parameters
initialDate = as.Date('2010-01-01')
finalDate = Sys.Date()
ybeg = format(initialDate,"%Y")
yend = format(finalDate,"%Y")
ticker = "AAPL"
#getting stock prices
stock = getSymbols.yahoo(ticker, from=initialDate, auto.assign = FALSE)
stock = stock[,4] #working only with closing prices
With dplyr:
#Attempt 1 with dplyr - not working, all values by year return the same
stock = stock %>% zoo::fortify.zoo()
stock$Date = stock$Index
separate(stock, Date, c("year","month","day"), sep="-") %>%
group_by(year) %>%
summarise(stdev= sd(stock[,2]))
# A tibble: 11 x 2
# year stdev
# <chr> <dbl>
# 1 2010 67.0
# 2 2011 67.0
#....
#10 2019 67.0
#11 2020 67.0
And with function:
#Attempt 2 with function - not working - returns only one value instead of multiple
#getting stock prices
stock = getSymbols.yahoo(ticker, from=initialDate, auto.assign = FALSE)
stock = stock[,4] #working only with closing prices
#subsetting
years = as.character(seq(ybeg,yend,by=1))
years
calculate_stdev = function(series,years) {
series[years] #subsetting by years, to be equivalent as stock["2010"], stock["2011"] e.g.
sd(series[years][,1]) #calculate stdev on closing prices of the current subset
}
yearly.stdev = calculate_stdev(stock,years)
> yearly.stdev
[1] 67.04185
Use apply.yearly() (a convenience wrapper around the more general period.apply()) to call a function on yearly subsets of the xts object returned by getSymbols().
You can use the Cl() function to extract the close column from objects returned by getSymbols().
stock = getSymbols("AAPL", from = "2010-01-01", auto.assign = FALSE)
apply.yearly(Cl(stock), sd)
## AAPL.Close
## 2010-12-31 5.365208
## 2011-12-30 3.703407
## 2012-12-31 9.568127
## 2013-12-31 6.412542
## 2014-12-31 13.371293
## 2015-12-31 7.683550
## 2016-12-30 7.640743
## 2017-12-29 14.621191
## 2018-12-31 20.593861
## 2019-12-31 34.538978
## 2020-06-19 29.577157
I don't know dplyr, but here's how with data.table
library(data.table)
# convert data.frame to data.table
setDT(stock)
# convert your Date column with content like "2020-06-17" from character to Date type
stock[,Date:=as.Date(Date)]
# calculate sd(price) grouped by year, assuming here your price column is named "price"
stock[,sd(price),year(Date)]
Don't pass the name of the dataframe again in your summarise function. Use the variable name instead.
separate(stock, Date, c("year","month","day"), sep="-") %>%
group_by(year) %>%
summarise(stdev = sd(AAPL.Close)) # <-- here
# A tibble: 11 x 2
# year stdev
# <chr> <dbl>
# 1 2010 5.37
# 2 2011 3.70
# 3 2012 9.57
# 4 2013 6.41
# 5 2014 13.4
# 6 2015 7.68
# 7 2016 7.64
# 8 2017 14.6
# 9 2018 20.6
#10 2019 34.5
#11 2020 28.7

How to select the earliest date in a month from a Date series in R?

I have a database containing the value of different indices with different frequency (weekly, monthly, daily)of data. I hope to calculate monthly returns by abstracting beginning of month value from the time series.
I have tried to use a loop to partition the time series month by month then use min() to get the earliest date in the month. However, I am wondering whether there is a more efficient way to speed up the calculation.
library(data.table)
df<-fread("statistic_date index_value funds_number
2013-1-1 1000.000 0
2013-1-4 996.096 21
2013-1-11 1011.141 21
2013-1-18 1057.344 21
2013-1-25 1073.376 21
2013-2-1 1150.479 22
2013-2-8 1150.288 19
2013-2-22 1112.993 18
2013-3-1 1148.826 20
2013-3-8 1093.515 18
2013-3-15 1092.352 17
2013-3-22 1138.346 18
2013-3-29 1107.440 17
2013-4-3 1101.897 17
2013-4-12 1093.344 17")
I expect to filter to get the rows of the earliest date of each month, such as:
2013-1-1 1000.000 0
2013-2-1 1150.479 22
2013-3-1 1148.826 20
2013-4-3 1101.897 17
Your help will be much appreciated!
Using the tidyverse and lubridate packages,
library(lubridate)
library(tidyverse)
df %>% mutate(statistic_date = ymd(statistic_date), # convert statistic_date to date format
month = month(statistic_date), #create month and year columns
year= year(statistic_date)) %>%
group_by(month,year) %>% # group by month and year
arrange(statistic_date) %>% # make sure the df is sorted by date
filter(row_number()==1) # select first row within each group
# A tibble: 4 x 5
# Groups: month, year [4]
# statistic_date index_value funds_number month year
# <date> <dbl> <int> <dbl> <dbl>
#1 2013-01-01 1000 0 1 2013
#2 2013-02-01 1150. 22 2 2013
#3 2013-03-01 1149. 20 3 2013
#4 2013-04-03 1102. 17 4 2013
First make statistic_date a Date:
df$statistic_date <- as.Date(df$statistic_date)
The you can use nth_day to find the first day of every month in statistic_date.
library("datetimeutils")
dates <- nth_day(df$statistic_date, period = "month", n = "first")
## [1] "2013-01-01" "2013-02-01" "2013-03-01" "2013-04-03"
df[statistic_date %in% dates]
## statistic_date index_value funds_number
## 1: 2013-01-01 1000.000 0
## 2: 2013-02-01 1150.479 22
## 3: 2013-03-01 1148.826 20
## 4: 2013-04-03 1101.897 17

Summarize R data frame based on a date range in a second data frame

I have two data frames, one that includes data by day, and one that includes data by irregular time multi-day intervals. For example:
A data frame precip_range with precipitation data by irregular time intervals:
start_date<-as.Date(c("2010-11-01", "2010-11-04", "2010-11-10"))
end_date<-as.Date(c("2010-11-03", "2010-11-09", "2010-11-12"))
precipitation<-(c(12, 8, 14))
precip_range<-data.frame(start_date, end_date, precipitation)
And a data frame precip_daily with daily precipitation data:
day<-as.Date(c("2010-11-01", "2010-11-02", "2010-11-03", "2010-11-04", "2010-11-05",
"2010-11-06", "2010-11-07", "2010-11-08", "2010-11-09", "2010-11-10",
"2010-11-11", "2010-11-12"))
precip<-(c(3, 1, 2, 1, 0.25, 1, 3, 0.33, 0.75, 0.5, 1, 2))
precip_daily<-data.frame(day, precip)
In this example, precip_daily represents daily precipitation estimated by a model and precip_range represents measured cumulative precipitation for specific date ranges. I am trying to compare modeled to measured data, which requires synchronizing the time periods.
So, I want to summarize the precip column in data frame precip_daily (count of observations and sum of precip) by the date date ranges between start_date and end_date in the data frame precip_range. Any thoughts on the best way to do this?
You can use the start_dates from precip_range as breaks to cut() to group your daily values. For example
rng <- cut(precip_daily$day,
breaks=c(precip_range$start_date, max(precip_range$end_date)),
include.lowest=T)
Here we cut the values in daily using the start dates in the range data.frame. We're sure to include the lowest value and stop at the largest end value. If we merge that with the daily values we see
cbind(precip_daily, rng)
# day precip rng
# 1 2010-11-01 3.00 2010-11-01
# 2 2010-11-02 1.00 2010-11-01
# 3 2010-11-03 2.00 2010-11-01
# 4 2010-11-04 1.00 2010-11-04
# 5 2010-11-05 0.25 2010-11-04
# 6 2010-11-06 1.00 2010-11-04
# 7 2010-11-07 3.00 2010-11-04
# 8 2010-11-08 0.33 2010-11-04
# 9 2010-11-09 0.75 2010-11-04
# 10 2010-11-10 0.50 2010-11-10
# 11 2010-11-11 1.00 2010-11-10
# 12 2010-11-12 2.00 2010-11-10
which shows that the values have been grouped. Then we can do
aggregate(cbind(count=1, sum=precip_daily$precip)~rng, FUN=sum)
# rng count sum
# 1 2010-11-01 3 6.00
# 2 2010-11-04 6 6.33
# 3 2010-11-10 3 3.50
To get the total for each of those ranges (ranges as labeled with the start date)
Or
library(zoo)
library(data.table)
temp <- merge(precip_daily, precip_range, by.x = "day", by.y = "start_date", all.x = T)
temp$end_date <- na.locf(temp$end_date)
setDT(temp)[, list(Sum = sum(precip), Count = .N), by = end_date]
## end_date Sum Count
## 1: 2010-11-03 6.00 3
## 2: 2010-11-09 6.33 6
## 3: 2010-11-12 3.50 3

identify date format in R before converting

I have a simple data set which has a date column and a value column. I noticed that the date sometimes comes in as mmddyy (%m/%d/%y) format and other times in mmddYYYY (%m/%d/%Y) format. What is the best way to standardize the dates so that i can do other calculations without this formatting causing issues?
I tried the answers provided here
Changing date format in R
and here
How to change multiple Date formats in same column
Neither of these were able to fix the problem.
Below is a sample of the data
Date, Market
12/17/09,1.703
12/18/09,1.700
12/21/09,1.700
12/22/09,1.590
12/23/2009,1.568
12/24/2009,1.520
12/28/2009,1.500
12/29/2009,1.450
12/30/2009,1.450
12/31/2009,1.450
1/4/2010,1.440
When i read it into a new vector using something like this
dt <- as.Date(inp$Date, format="%m/%d/%y")
I get the following output for the above segment
dt Market
2009-12-17 1.703
2009-12-18 1.700
2009-12-21 1.700
2009-12-22 1.590
2020-12-23 1.568
2020-12-24 1.520
2020-12-28 1.500
2020-12-29 1.450
2020-12-30 1.450
2020-12-31 1.450
2020-01-04 1.440
As you can see we skipped from 2009 to 2020 at 12/23 because of change in formatting. Any help is appreciated. Thanks.
> dat$Date <- gsub("[0-9]{2}([0-9]{2})$", "\\1", dat$Date)
> dat$Date <- as.Date(dat$Date, format = "%m/%d/%y")
> dat
Date Market
# 1 2009-12-17 1.703
# 2 2009-12-18 1.700
# 3 2009-12-21 1.700
# 4 2009-12-22 1.590
# 5 2009-12-23 1.568
# 6 2009-12-24 1.520
# 7 2009-12-28 1.500
# 8 2009-12-29 1.450
# 9 2009-12-30 1.450
# 10 2009-12-31 1.450
# 11 2010-01-04 1.440

R Search for a particular time from index

I use an xts object. The index of the object is as below. There is one for every hour of the day for a year.
"2011-01-02 18:59:00 EST"
"2011-01-02 19:58:00 EST"
"2011-01-02 20:59:00 EST"
In columns are values associated with each index entry. What I want to do is calculate the standard deviation of the value for all Mondays at 18:59 for the complete year. There should be 52 values for the year.
I'm able to search for the day of the week using the weekdays() function, but my problem is searching for the time, such as 18:59:00 or any other time.
You can do this by using interaction to create a factor from the combination of weekdays and .indexhour, then use split to select the relevant observations from your xts object.
set.seed(21)
x <- .xts(rnorm(1e4), seq(1, by=60*60, length.out=1e4))
groups <- interaction(weekdays(index(x)), .indexhour(x))
output <- lapply(split(x, groups), function(x) c(count=length(x), sd=sd(x)))
output <- do.call(rbind, output)
head(output)
# count sd
# Friday.0 60 1.0301030
# Monday.0 59 0.9204670
# Saturday.0 60 0.9842125
# Sunday.0 60 0.9500347
# Thursday.0 60 0.9506620
# Tuesday.0 59 0.8972697
You can use the .index* family of functions (don't forget the '.' in front of 'index'!):
fxts[.indexmon(fxts)==0] # its zero-based (!) and gives you all the January values
fxts[.indexmday(fxts)==1] # beginning of month
fxts[.indexwday(SPY)==1] # Mondays
require(quantmod)
> fxts
value
2011-01-02 19:58:00 1
2011-01-02 20:59:00 2
2011-01-03 18:59:00 3
2011-01-09 19:58:00 4
2011-01-09 20:59:00 5
2011-01-10 18:59:00 6
2011-01-16 18:59:00 7
2011-01-16 19:58:00 8
2011-01-16 20:59:00 9`
fxts[.indexwday(fxts)==1] #this gives you all the Mondays
for subsetting the time you use
fxts["T19:30/T20:00"] # this will give you the time period you are looking for
and here you combine weekday and time period
fxts["T18:30/T20:00"] & fxts[.indexwday(fxts)==1] # to get a logical vector or
fxts["T18:30/T21:00"][.indexwday(fxts["T18:30/T21:00"])==1] # to get the values
> value
2011-01-03 18:58:00 3
2011-01-10 18:59:00 6

Resources