I am working in R. I have a date sequence and I would like assign whether each particular date is term time or a school holiday. I plan to do this using a dataframe column where each row is labelled "Holiday" or "Term"
My approach is to create a time sequence and individually specify each holiday dates using a vector; the first element is the start date of the holiday and the second is the end of the holiday.
I then create an if-or statement to test whether the time lies within any of the holiday dates specified in the vectors.
Here is my code so far:
start <- as.POSIXlt("2015-10-10 00:00:00")
end <- as.POSIXlt("2016-03-31 00:00:00")
DateSeq <- seq(from=start, to=end, by="mins")
#Holidays defined using a vector with by start and end date
H1 <- c("2015-10-26", "2015-11-3") #October half term
H2 <- c("2015-12-16", "2016-01-05") #Christmas holiday
H3 <- c("2016-02-15", "2016-02-19") #Feb half term
H4 <- c("2016-03-24", "2016-03-31") #Easter holiday
date_table <- data.frame(Time = DateSeq)
if ((round(date_table$Time, units = "days")== H1[1] <> H1[2]) | (round(date_table$Time, units = "days") == H2[1] <> H2[2])) {
date_table$Holiday <- "Holiday"
} else {
date_table$Holiday <- "Term"
}
As you can see this code does not work and simply labels all the rows as "Term".
Therefore I am wondering the following:
How I can specify the date range in the holiday vectors so it can be used within the if statement?
Whether this is the best approach to use? I am quite new to R and I was thinking of alternatives such as creating a set of individual sequences for term and holidays and then stitching them together. This approach however seems fiddly but would appreciate your thoughts.
Thanks for your help.
Similar to your approach by avoid if-else.
# Or alternatively by="mins"
DateSeq <- seq(from=start, to=end, by="hours")
date_table <- data.frame(Time = DateSeq)
# Put all holidays together
H<-rbind(H1,H2,H3,H4)
# Or alternatively set to "Term"
date_table$Holiday<-0
# Assign all the holidays
for (i in 1:nrow(H)){
date_table[date_table$Time > H[i,1] & date_table$Time < H[i,2],"Holiday"]<-1 # or "Holiday"
}
# Check they are correctly assign
plot(date_table)
You could use cut() to cut DateSeq into term and holiday.
res <- cut(DateSeq, breaks = as.POSIXlt( c(H1, H2, H3, H4), format = "%Y-%m-%d"), labels = c("October", "term1", "Christmas", "term2", "Feb", "Term3", "Easter"))
table(res)
Related
I'm creating a function but i need some help with best practices.
Active.Test <- function(date) {
date <- rep(date,length(df$Start.Date))
active <- rep(0,length(df$Start.Date))
active[date > df$Start.Date & date < df$End.Date] <- 1
active[df$Start.Date == df$End.Date ] <- df$Active.Time
return (active)
}
I basically want to check if a date (which is passed to the function) is between the start and end date in my data frame. If it is, assign a 1. If the start and end dates are equal, get the result from the same row in Active.Time column. Everything else has a default value of 0.
This returns an error as it's retrieving a vector which is of a different size for the second test.
I can re-write the above as:
Active.Test <- function(date) {
date <- rep(date,length(df$Start.Date))
active <- rep(0,length(df$Start.Date))
active[date > df$Start.Date & date < df$End.Date] <- 1
active[df$Start.Date == df$End.Date] <- df$Active.Time[df$Start.Date == df$End.Date]
return (active)
}
This will then get the correct element from the Active.Time column but this doesn't seem to be an elegant way to write this. I'm also guessing it's slower as i'm performing the same check twice as many times.
Could you please help me re-write this using best practices?
EDIT: Here's some code to get a few rows of data and then test use the function by checking to see if the start and end dates encompass 25/05/2016.
#Create a data frame
df <- data.frame(End.Date = as.Date(c("1/05/2016","28/05/2016", "25/05/2016"), format = "%d/%m/%Y"), Start.Date = as.Date(c("20/04/2016 11:00","20/05/2016 23:00", "25/05/2016 10:00"), format = "%d/%m/%Y" ), Active.Time = as.numeric(c(0.5,0.4,0.8)))
#Test the function
df$new <- Active.Test(as.Date("25/05/2016", format = "%d/%m/%Y"))
Thanks
# Using the data.table approach
library(data.table)
# Make data table instead of data.frame (you can also do as.data.table(df) to get a data.table)
my_dt <- data.table(Start.Date=as.Date(c("20/04/2016 11:00","20/05/2016 23:00", "25/05/2016 10:00"), format = "%d/%m/%Y" ),
End.Date=as.Date(c("1/05/2016","28/05/2016", "25/05/2016"), format = "%d/%m/%Y"),
Active.Time = as.numeric(c(0.5,0.4,0.8))
)
setkey(my_dt)
# Sample date to test
datte <- as.Date("25/05/2016", format = "%d/%m/%Y")
# Create function with conditions and result to return
Active.Test <- function(datte, Start.Date, End.Date, Active.Time) {
if(datte > Start.Date & datte < End.Date){
return(1)
}
else if(Start.Date==End.Date){
return(Active.Time)
}
else{return(0)}
}
# Test function
my_dt[, res:=Active.Test(datte, Start.Date, End.Date, Active.Time), by=1:nrow(my_dt)]
See data.table vignette for more on data.table. Also, in your function above, note the warning you get when you run df$new <- Active.Test(as.Date("25/05/2016", format = "%d/%m/%Y"))!
I have a sequence of dates in R, and for each date I need to get the year, month, and day. I tried to use the strftime function to print out the year, but R behaves very strangely. This code fails:
# sequence of dates
dates <- seq(as.Date("1987-03-29"), as.Date("1991-12-31"), by=1)
# this fails with "'origin' must be supplied" error:
for (d in dates) {
year <- strftime(d, "%Y")
print(year)
}
The exact error message is: Error in as.POSIXlt.numeric(x, tz = tz) : 'origin' must be supplied
On the other hand, this code works without any error:
# sequence of dates
dates <- seq(as.Date("1987-03-29"), as.Date("1991-12-31"), by=1)
# this works
for (i in 1: length(dates)) {
year <- strftime(dates[i], "%Y")
print(year)
}
Why does the first example fail and the second example works? I suspect that in the first example R is trying to convert my date to some kind of POSIXct object and in the second example it doesn't? I'm confused why there's any difference and I'd appreciate an explanation of what's going on. I'm using R version 3.2.2.
The for is creating d as numeric. Here are two approaches.
Below the comments were removed and only the code lines marked ## have been changed.
1) list Use a list like this:
dates <- seq(as.Date("1987-03-29"), as.Date("1991-12-31"), by=1)
for (d in as.list(dates)) { ##
year <- strftime(d, "%Y")
print(year)
}
2) as.Date or convert d back to "Date" class.
dates <- seq(as.Date("1987-03-29"), as.Date("1991-12-31"), by=1)
for (d in dates) {
year <- strftime(as.Date(d, origin = "1970-01-01"), "%Y") ##
print(year)
}
I have been trying to find the percentage price jump (+-15 % change) in the numbers and when there is the jump it will give me the corresponding date. However, when there is a first jump it will break-out from the inner-for loop gives me the correspndong data for it and which will get store in temp.cyc data frame.
The program is not showing any error but it is repeating the same no. throughout and when I check the value of test.df it is showing NA.
Can anyone help me understand what's going on? And I am new to R so it will helpful if you can give your answer in detail. Thank You :)
# Defining variables
row= nrow(price.close)
col=ncol(price.close)
#Defining Matrix
m<-matrix(0,ncol=1,nrow=row)
p<-matrix(0,ncol=5,nrow=row)
# Dataframe to temporaily store percentage Change
test.df<- vector(mode="numeric", length=nrow(price.close))
# Dataframe to extract required Values
temp.cyc<-as.data.frame(p)
colnames(temp.cyc)<-c("cyc.duration","Start.date","End.date","Start.date.value","End.date.value")
for( j in 1:row)
{
for(i in j:row)
{
test.df<-(price.close[(i+1),2]-price.close[j,2])/price.close[j,2]
if(test.df >= 0.15 | test.df <= -0.15 | is.na(test.df)== TRUE )
{
temp.cyc$Start.date.value = price.close[j,2]
temp.cyc$End.date.value <- price.close[i,2]
temp.cyc$Start.date <- price.close[j,1]
temp.cyc$End.date <- price.close[i,1]
}
break
}
}
Seems to me you are using this for financial data , e.g. stock prices. If this assumption is right then I suggest that you should use packages that have this functionality. I would suggest quantmod package.
Here's short example how to get dates when price goes more then 15% up or down.
library(quantmod)
# create some dummy stock data over 10 days period
# next time I hope you will attach some of your data
stockClose <- c(100,50,75,70,68,100,115,120,130,100)
stockDates <- seq(as.Date("2014-01-01"),length=10,by="+1 days")
stock.xts <- as.xts(stockClose,stockDates)
# calculate change , check '?Delt' help for more info
change <- Delt(stock.xts)
#get only those rows where price change in both directions is higher then 15%
specialDays <- change[coredata(change) < -0.15 | coredata(change) > 0.15,]
#get dates
justDates <- index(specialDays)
which gives us "2014-01-02" ,"2014-01-03" ,"2014-01-06" and "2014-01-10"
In case you want to actually compare all possible combinations for entering and exiting position then you can use something like this :
library(quantmod)
calculatePeriods <- function(){
stockClose <- c(100,50,75,70,68,100,115,120,130,100)
stockDates <- seq(as.Date("2014-01-01"),length=10,by="+1 days")
stock.xts <- as.xts(stockClose,stockDates)
# you will be adding rows to thid df
result <- data.frame()
for(i in 1:(length(stock.xts)-1)){
for(j in 2:length(stock.xts)){
change <- (coredata(stock.xts[j])-coredata(stock.xts[i]))/coredata(stock.xts[i])
if(change < (-0.15) | change > (0.15)){
row <- data.frame("cyc.duration"=as.numeric(index(stock.xts[j])-index(stock.xts[i]),units="days"),"Start.date"=index(stock.xts[i]),"End.date"=index(stock.xts[j]),"Start.date.value"=coredata(stock.xts[i]),"End.date.value"=coredata(stock.xts[j]))
result <- rbind(result,row)
}
}
}
return(result)
}
I have a really odd issue... I am using the to.weekly and to.period function to convert a daily xts object to weekly data. In most instances, I get the week-ending date as a Friday (day.of.week function will return 5) (e.g. "2010-01-08", "2011-02-11"), but there are a few cases where I get something other than Friday (Saturday/Sunday/Thursday/etc.)
I have tried to.weekly and to.period(x, period = 'weeks') and both return the same problem.
Why is this happening? Is there a work-around for this??
Thanks!!
[EDIT: EXAMPLE BELOW]
test.dates <- as.Date(c("2010-04-27","2010-04-28","2010-04-29","2010-04-30","2010-05-03","2010-05-04","2010-05-05","2010-05-06","2010-05-07","2010-05-10","2010-05-11","2010-05-12","2010-05-13","2010-05-14","2010-05-17","2010-05-18","2010-05-19","2010-05-20","2010-05-21","2010-05-22","2010-05-24","2010-05-25","2010-05-26","2010-05-27","2010-05-28","2010-06-01","2010-06-02","2010-06-03","2010-06-04"))
test.data <- rnorm(length(test.dates),mean=1,sd=2)
test.xts <- xts(x=test.data,order.by=test.dates)
#Function that takes in a vector of zoo/xts objects (e.g. "2010-01-08") and returns the day of the week for each
dayofweek <- function(x) {
placeholder <- vector("list",length=length(x))
names(placeholder) <- x
for(i in 1:length(x)) {placeholder[[i]] <- month.day.year(x[i])}
placeholder2 <- rep(NA,times=length(x))
for(i in 1:length(x)) {placeholder2[i] <- day.of.week(placeholder[[i]][[1]],placeholder[[i]][[2]],placeholder[[i]][[3]])}
return(placeholder2)}
This returns the date(s) that are not Friday: time(to.weekly(test.xts))[dayofweek(time(to.weekly(test.xts))) != 5]
You have 2 problems with your example:
Your dayofweek function is a bit cumbersome, and probably incorrect in its results.
Your example dates is missing some dates, such as 05-23-2010.
Here is a cleaned-up version of your code:
library(xts)
test.dates <- as.Date(c("2010-04-27","2010-04-28","2010-04-29","2010-04-30","2010-05-03","2010-05-04","2010-05-05","2010-05-06","2010-05-07","2010-05-10","2010-05-11","2010-05-12","2010-05-13","2010-05-14","2010-05-17","2010-05-18","2010-05-19","2010-05-20","2010-05-21","2010-05-22","2010-05-24","2010-05-25","2010-05-26","2010-05-27","2010-05-28","2010-06-01","2010-06-02","2010-06-03","2010-06-04"))
test.data <- rnorm(length(test.dates),mean=1,sd=2)
test.xts <- xts(x=test.data,order.by=test.dates)
test.weekly <- to.weekly(test.xts)
library(lubridate)
test.weekly[wday(test.weekly, label = TRUE, abbr = TRUE) != "Fri"]
The only result of this function is
test.xts.Open test.xts.High test.xts.Low test.xts.Close
2010-05-22 -1.705749 1.273982 -2.084203 -1.502611
The problem of course, is that this week ends on 05-23-2010, but that date is not present in the time series. Therefore, to.weekly uses the next closest date as the end point, which is 05-22-2010. This is the source of your problem.
Here is a better example, which reveals no issue with the to.weekly function.
library(lubridate); library(xts)
test.dates <- seq(as.Date("1900-01-01"),as.Date("2011-10-01"),by='days')
test.dates <- test.dates[wday(test.dates)!=1 & wday(test.dates)!=7] #Remove weekends
test.data <- rnorm(length(test.dates),mean=1,sd=2)
test.xts <- xts(x=test.data,order.by=test.dates)
test.weekly <- to.weekly(test.xts)
test.weekly[wday(test.weekly, label = TRUE, abbr = TRUE) != "Fri"]
The following function does work, but the last as.Date part was more or less an result of trial and error that do not understand fully.
### This function creates a real date column out of year / period that is saved in
### in separate columns, plus it handles a 13th period in case of overlapping period
### terminology. Turns quarters into months.
realDate <- function (table,year="year_col",period="period_col"){
if (is.character(table) == TRUE)
{
dframe <- get(table)
}
else{
dframe <- table
}
x <- expression({resDate <- with(dframe,
as.Date(paste(get(year),"-",
ifelse(get(period) > 9, get(period),
paste("0", get(period), sep = "")),
"-01", sep = "")))
})
y <- expression({resDate <- with(dframe,as.Date(paste(get(year) + 1,"-","01","-01",sep="")))})
#### I do not get this? Why do I have to do this?
a <- ifelse(get(period) == 13,eval(y),eval(x))
a <-as.Date(a, origin="1970-01-01")
return(a)
}
Instead I tried to do it like this (because it was more intuitively to me):
{ ....
ifelse(get(period) == 13,eval(y),eval(x))
return(resDate)
}
This returned the corrected values whenever the condition was FALSE (no) but returned NA if the condition was TRUE (yes). Why is that? And if I use the function above, why do I have to define the origin again? Why I even have call as.Date again?
EDIT:
a <- rep(2002:2010,2)
b <- rep(1:13,2)
d<-cbind(a,b[1:length(a)])
names(d) <- c("year_col","period_col")
P.S.:
I found this thread on vectorized ifelse.
Your construct is "interesting" at least. To start with, neither x nor y gives output. I wonder why you use an assignment in your eval(). this gives you a resDate vector that is exactly what the last call has been. And that is not dependent on the condition, it's the last one written (eval(x) in your case). They get executed before the ifelse clause is executed.
Plus, the output you get is the numeric representation of your data, not the data object. That is in resDate. I guess that ifelse cannot determine the class of the output vector as you use the eval() inside. I'm surprised you get output at all, in fact you're effectively using something that could be called a "bug" in R (Microsoft would call it a feature :-) ).
Your mistake is in your ifelse : get(period) doesn't exist. it should be get(period, dframe). Then it works. The only reason why it works on your computer, is because you have a period in your workspace presumably. Classis problem when debugging.
In any case, I'd make it:
realDate <- function (table,year="year_col",period="period_col"){
if (is.character(table)){ # is.character(table) returns a boolean already.
dframe <- get(table)
} else {
dframe <- table
}
year <- get(year,dframe)
period <- get(period,dframe)
year[period==13] <- year[period==13]+1
period[period==13] <- 1
as.Date(paste(year,"-",period,"-01",sep=""))
}
This is quite a bit faster than your own, has less pitfalls and conversions, and is more the R way of doing it. You could change year[...] and period [...] by ifelse constructs, but using indices is generally faster.
EDIT :
This is easier for the data generation:
dframe <- data.frame(
year_col= rep(2006:2007,each=13),
period_col = rep(1:13,2)
)
realDate(dframe)
[1] "2006-01-01" "2006-02-01" "2006-03-01" "2006-04-01" "2006-05-01"
"2006-06-01" "2006-07-01" "2006-08-01" "2006-09-01"
[10] "2006-10-01" "2006-11-01" "2006-12-01" "2007-01-01" "2007-01-01"
"2007-02-01" "2007-03-01" "2007-04-01" "2007-05-01"
[19] "2007-06-01" "2007-07-01" "2007-08-01" "2007-09-01"
"2007-10-01" "2007-11-01" "2007-12-01" "2008-01-01"