R: Pad dates with weekends or bank holidays - r

Assume the following dataset. I get closing prices for all working days. But I also have missing rows for dates for which there is no observation. How can I add rows equal to each day and date all the way to the present? The reason I need this done is that I need to average by week and having variable time windows renders that impossible.
Here is my code:
library(quantmod)
from="2012-09-01"
sym = c("BARC")
prices = Map(function(n)
{
print(n)
tryCatch(getSymbols(n, src="google", env=NULL, from=from)[, 4], error =
function(e) NA)
}, sym)
N = length(prices)
# identify symbols returning valid data
i = ! unlist(Map(function(i) is.na(prices[i]), seq(N)))
# combine returned prices list into a matrix, one column for each symbol
prices = Reduce(cbind, prices[i])
colnames(prices) = sym[i]
If you see the "prices" data frame you will see the point I am making.

You can create a blank xts with all the dates first, and then merge with your prices object.
full_dates <- xts(,order.by = seq(from = start(prices), to = end(prices), by= "day"))
full_prices <- merge(full_dates,prices, all = TRUE)
You can also choose to fill forward the missing prices, by the following
na.locf(full_prices)

Related

Counting the number of days excluding Sundays between two dates and creating a new column in R DataFrame

I have a data.frame in R in which includes two variables with a Start-Date and an End-Date. I would like to add a new column with the number of days between the two dates and reduce the result by the number of sundays in each interval. I tried it like below but it doesn't work:
Data$Start <- as.Date(Data$Start, "%d.%m.%y")
Data$End <- as.Date(Data$End,"%d.%m.%y")
interval <- difftime(Data$Start, Data$End, units = "days")
sundays <- seq(from = Data$Start, to = Data$End, by = "days")
number.sundays <- length(which(wday(sundays)==1))
Data$DaysAhead <- interval - number.sundays
I get the error message in the seq() function, that it has to have the lenght 1 but I don't understand how I can handle this. Can somenone help me out with that?
Here's an example that works:
Data <- data.frame(
Start = c("01.01.2020", "01.06.2020"),
End = c("01.03.2020", "01.09.2020")
)
Data$Start <- as.Date(Data$Start, "%d.%m.%Y")
Data$End <- as.Date(Data$End,"%d.%m.%Y")
interval <- difftime(Data$End, Data$Start, units = "days")
sundays <- lapply(1:nrow(Data), function(i)seq(from = Data$Start[i], to = Data$End[i], by = "days"))
number.sundays <- sapply(sundays, function(x)length(which(lubridate::wday(x)==1)))
Data$DaysAhead <- interval - number.sundays
The problem is that seq() isn't vectorized, it assumes a single start and single end point. If you putt it inside of a loop (like lapply()) it will work and generate the relevant sequence for each start and end time. Then you can use sapply() to figure out how many sundays and since the returned value is a scalar, the return from sapply() will be a vector of the same length as interval.
I realized with an updated data set that there's a problem with the solution above, when Start-Date and End-Date aren't in the same year. I still want to count the days except Sundays starting on the 20.12.2020 until 10.01.2021 for example. The error message showing up in that case is that the sign with the argument "by" is wrong. I just can't manage to get it running . If I turn the dates around, the output makes no sense and the number of days is too high. What do I have to do to get this running over the year-end?

is there a way to look for same values by month by month in R?

I'll start by showing a part of my dataframe.
So I have for each patient a combination of molecules, the date he started it and the theoretical date he/she finishes the treatment.
What I need to do is the flag/get the patients who stays under same combination of molecules for at least six consecutive months.
I already found a way to create sequences of dates between to dates and create a new df with it..but I don't know to do the last part of getting the patients on the same treatment for six consecutive months with R.
Thanks in advance for the help.
Here is an attempt to solve your question. However, I think it's possible with an apply function, but I didn't found it yet.
library(lubridate)
df = data.frame(id = c(1,2), traitement = c("A+B+C","D+E"),
begin = c('01/03/2012','01/01/2012'),
end = c('05/06/2012','05/07/2012'),
stringsAsFactors = FALSE)
df$begin = dmy(df$begin)
df$end = dmy(df$end)
df$time = 0
df$toolong = FALSE
for(i in 1:nrow(df)){
df$time[i] = month(as.period(interval(df$begin[i],df$end[i]))) # compute the number of months
if(df$time[i] >= 6){df$toolong[i] = TRUE} # flag with true patient you use medication for too long
}
df

Periodicity for overnight stock data

I frequently use to.daily to convert 1 min OHLC data to a daily format but am trying to find a way to do the same with overnight data. I was hoping to see the option to specify what time a "day" starts and ends but didn't see that.
Overnight session being 18:00 to 09:30.
Does anyone have a simple way to do this?
You could use time-of-day subsetting with which.i = TRUE to find all of the observations you don't want. Then subset the original data with the negative of the result, so all the non-overnight observations will be dropped.
# assume data are in a xts object named 'x'
DayObs <- x["T09:30/T18:30", which.i = TRUE]
Overnight <- x[-DayObs,]
You might need to change the start and end times in the time-of-day subset call.
If you already have your data subset so that it only includes the overnight session, you can aggregate to "daily" using period.apply() and custom endpoints. Assuming your data are in an object named x:
ep <- c(0, which(diff(.indexhour(x) > 9 & .indexmin(x) > 30) == 1))
makeOHLC <- function(x) {
op <- as.numeric(first(x))
cl <- as.numeric(last(x))
c(Open = op, High = max(x), Low = min(x), Close = cl)
}
period.apply(x, ep, makeOHLC)

Multiply a timeseries by a factor in R

I have the following time series:
ts <- cbind(data.frame(date=seq(as.Date("2017/11/01"), by = "day", length.out = 30)),value=rep(5,30))
ts <- ts[order(ts$date, decreasing=T),]
I would like to adjust it by the below cumulative factor that has a value on some given dates:
cf <- cbind(data.frame(date=as.Date(c("2017/11/28", "2017/11/25","2017/11/04","2017/09/25"))),cumfactor=c(0.8,0.7,0.6,.05))
Such that, the value on each date on ts will be multiplied (adjusted) by the cumfactor on cf on the corresponding date and that cumfactor will be used for subsequent (earlier) dates until the next cumfactor shows up for an earlier date. The first (latest) dates in ts should not be adjusted if they are later than the first (latest) cumfactor date.
I am looking for the following result:
result <- cbind(data.frame(date=seq(as.Date("2017/11/01"), by = "day", length.out = 30)),value=c(3,3,3,3,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,3.5,4,4,4,5,5))
result <- result[order(result$date, decreasing=T),]
My guess is that a for loop could be the best option but I haven't been successful at obtaining this result.
Merge ts and cf, carry forward the factors and multiply them.
library(zoo)
m <- merge(ts, cf, all.x = TRUE)[nrow(ts):1, ]
transform(m, value = value * na.fill(na.locf0(cumfactor), 1))
We have retained your descending sequence of dates from the question but note that in R normally time series are represented in ascending order of date.

R check consistency of separated timeseries-table

I have a timeseries-table like this, which goes up to 2000 31 12 23 (12/31/2000 23:00):
I'd like to add temparature values from several weatherstations to it. The problem is, that obviously the different timeseries dont't match by count of rows, so there must be gaps.
How can I check up on these dataframes if they consequently follow the pattern of 0-24 hours, 1-12 months and get information of where these gaps are?
If your data is in the format of the link then you can probably convert it to a POSIXct object by doing the following (assuming your data frame is called data):
data = as.data.frame(list(YY = rep("1962",10),
MM = rep("01",10),
DD = rep("01",10),
HH = c("00","01","02","03","04",
"05","06","07","08","09")))
date = paste(data$YY,data$MM,data$DD,sep="-")
data$dateTime = as.POSIXct(paste(date,data$HH,sep=" "),format="%Y-%m-%d %H")
That should put your data into a POSIXct format. If your temperature dataset also has a column called "dateTime" and it's a POSIXct object you should be able to use the merge function and it will combine the two data frames
temp = as.data.frame(list(YY = rep("1962",10),
MM = rep("01",10),
DD = rep("01",10),
HH = c("00","01","02","03","04",
"05","06","07","08","09")))
date1 = paste(temp$YY,temp$MM,temp$DD,sep="-")
temp$dateTime = as.POSIXct(paste(date1,temp$HH,sep=" "),format="%Y-%m-%d %H")
temp$temp = round(rnorm(10,0,5),1)
temp = temp[,c("dateTime","temp")]
#let's say your temperature dataset is missing an entry for a certain timestamp
temp = temp[-3,]
# this data frame won't have an entry for 02:00:00
data1 = merge(data,temp)
data1
# if you want to look at time differences you can try something like this
diff(data1$dateTime)
# this one will fill in the temp value as NA at 02:00:00
data2 = merge(data,temp,all.x = T)
data2
diff(data2$dateTime)
I hope that helps, I often use the merge function when I'm trying to match up timestamps from ecological datasets
Thank you for you answer and sorry for my late reply.
Couldn't make it without your helpful hints though I now managed to merge all my timeseries on a slightly different way:
Sys.setenv(TZ='UTC') #setting system time to UTC for not having DST-gaps
# creating empty hourly timeseries for following join
start = strptime("1962010100", format="%Y%m%d%H")
end = strptime("2000123123", format= "%Y%m%d%H")
series62_00 <- data.frame(
MESS_DATUM=seq(start, end, by="hour",tz ='UTC'), t = NA)
# joining all the temperatureseries with same timespan using "plyr"-package
library("plyr")
t_allstations <- list(series62_00,t282,t867,t1270,t2261,t2503
,t2597,t3668,t3946,t4752,t5397,t5419,t5705)
t_omain_DWD <- join_all(t_allstations, by = "MESS_DATUM", type = "left")
Using join_all with type = "left" makes sure, that the column "Date" is not changed and missing temperature values are filled in as NA's.

Resources