I am trying to write a function that will return the number of business days between two dates (not just excluding weekends, but holidays as well). I'm approaching it by building a matrix with rownames corresponding to days of the week with the elements of the matrix either a 1 or a 0: a 0 if it is a holiday or the extra couple elements to fill the matrix.
I've checked the length of each vector in the code. It checks out. I've run the code manually in the console, one line at a time, and it works perfectly. BUT if I run the function, it displays this error message:
Warning message:
In matrix(da, nrow = 7, dimnames = list(n)) :
data length [132] is not a sub-multiple or multiple of the number of rows [7]
I'm using R 3.1.1, mostly working in Rstudio. The cal mentioned in the code can be found here.
Here's the code:
dte <- function(date) {
#Input a date and it tells you the number of business (not including holidays)
#days until that date
#Take the target date and turn it into a date
d <- strptime(date,format="%Y-%m-%d")
#Obtain current date
c <- strptime(Sys.Date(), format="%Y-%m-%d")
#Calculate the difference in days
diff <- d-c
#Extract the actual number difference
f <- diff[[1]]
#Get the list of holidays
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- Sys.Date()+0:f
#Find which days in the range are holidays
if(any(b %in% cal)) {
bt <- b[b %in% cal]
#Return the position of the holidays within the range
bn <- which(b %in% bt)
} else {
#Set holidays present to 0
bn <- 0
}
#Build a vector of the weekdays starting with the current weekday
n <- weekdays(Sys.Date()+0:6)
#Create a vector as long as the difference with a 1 in each place
v <- rep(1,f)
#Set each holiday to 0
v[bn] <- v[bn]-1
#Extra steps to make sure that the matrix is full but only with 1s where we want them.
g <- ((trunc(f/7)+1)*7)-f
u <- rep(0,g)
da <- c(v,u)
#Create the matrix
m <- matrix(da,nrow=7,dimnames=list(n))
#Extract all of the workweeks and add them up
ww <- m[c("Monday","Tuesday","Wednesday","Thursday","Friday"),]
r <- sum(ww)
r
}
The problem is that your strptime calls return POSIXt objects which have time components and are then effected by daylight savings time. Observe
(d1<-strptime("2014-08-24",format="%Y-%m-%d"))
# [1] "2014-08-24 EDT"
(d2<-strptime("2014-12-31",format="%Y-%m-%d"))
# [1] "2014-12-31 EST"
d2-d1
# Time difference of 129.0417 days
So there are not a while number of dates between the two values which causes complications for you later in your code. If you use as.Date rather than strptime then you won't have this problem because Date objects don't care about time.
But i'm not sure really why you're even bothering with the matrix at all. I think a simpler implementation would look like
dte <- function(date) {
d <- as.Date(date,format="%Y-%m-%d")
c <- Sys.Date()
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- seq(c, d, by="1 day")
return(sum(as.POSIXlt(b)$wday %in% 1:5 & (!b %in% cal)))
}
Related
I have a dataframe with 105 months and 20 columns. The example below is simplified and shows that some of the columns start at January 2014 and some don't. Some others are zeroed:
df <- data.frame(months = c('2014-01-01','2014-02-01',
'2014-03-01','2014-04-01','2014-05-01',
'2014-06-01','2014-07-01'),
series2 = c(1754,3345,12226,1712,6703,8172,1545),
series3 = c(NA,NA,NA,NA,554,222,321)
series4 = c(NA,NA,NA,NA,0,0,0)
)
My objective is to seasonally adjust the series which can be seasonally adjusted and write a similar dataframe, keeping the seasonally adjusted series in the same order and position as in the original dataframe.
I have made a for loop to decide which columns can be seasonally adjusted. The for loop also finds out the initial date of every column.
library(seasonal)
# determine initial and final date in the first column of dataframe
initial_date <- as.POSIXct(pull(df[1,1]),format = "%Y-%m-%d")
final_date <- as.POSIXct(pull(dados0[nrow(df),1]),format = "%Y-%m-%d")
# create an empty dataframe to be completed with seasonally adjusted
dataseas_adj_df<-data.frame(matrix(ncol = ncol(df), nrow = nrow(df)))
# decide which series should be seasonally adjusted
for(i in 2:ncol(df)) { # Head of for-loop
# if a certain column contains only zeros...
if(sum(df[,i] != 0, na.rm=TRUE)==0) {
seas_adj_df[,i]<-as.numeric(NA) #fill the column with NA}
else {
#determine the number of values of the column
n_values_column<-length(df[,i][!is.na(df[,i])])
#how many months after the beginning of the dataframe did the column start?
months_to_add<-nrow(df)-n_values_column
#calculates the initial date of the column
column_initial_date<-initial_date %m+% months(months_to_add)
#transform the column values into a time series
time_series <- ts(df[,i],start = c(year(column_initial_date),
month(column_initial_date)),
end = c(year(final_date), month(final_date)),
freq = 12)
#perform seasonal adjustment
time_series_sa<- final(seas(time_series, multimode = "R"))
#insert seasonally adjusted series into the new dataframe
seas_adj_df[,i]<-time_series_sa #this part is wrong
}}
However, i receive the folowing error:
Error in \[\<-.data.frame(*tmp*, , i, value = c(928.211662624947, 993.311013042665, : replacement has 81 rows, data has 105
This happens because some of my columns have 81 values (the rest are filled with "NA").
My two questions are:
Is there a way to seasonally adjust all series, but asking R to "jump" NA columns and 0 columns? My final dataframe must have the seasonally ajusted series in the exact same position as the original ones (example: series5_SA must be in column 5, even if series4 couldn't be seasonally adjusted).
Using my code (or a similar code), how could I add a time series with 81 values into a dataframe with 105 rows? How can i tell R that the column should be inserted from line (105-81=24) on?
You can use lapply in conjunction with an error handling function (like tryCatch), instead of a for loop. This type of functions will try to perform an operation. But, if they encounter and error, they will provide another result as indicated (like the original ts not sa). The order of the time series will not be afected. Here is an example with the AirPassengers data set:
> library(seasonal)
> library(lubridate)
>
> data(AirPassengers)
>
> df <- replicate(5, AirPassengers)
> df <- cbind.data.frame(date_decimal(as.numeric(time(AirPassengers))), df)
>
> ## Adding NA to second and fourth data columns
> df[sample(1:nrow(df), 10), 3] <- NA
> df[sample(1:nrow(df), 10), 5] <- NA
>
> initial_date <- as.Date(df[1,1], format = "%Y-%m-%d")
>
> time_series <- lapply(df[, -1], function(x){
+ ts(x, start = c(year(initial_date), month(initial_date)), frequency = 12)
+ })
>
> time_series_sa <- lapply(time_series, function(x) {
+ tryCatch(final(seas(x, multimode = "R")), error = function(e) {x})
+ })
>
> summary(time_series_sa)
Length Class Mode
1 144 ts numeric
2 144 ts numeric
3 144 ts numeric
4 144 ts numeric
5 144 ts numeric
Hope it helps.
I am using a for loop to iterate through a list of fund codes, make an API call and then combine this data in one large zoo object. I would like to see something like this (shortened version), where there is a simple change between each column (Zoo1, Zoo2, Zoo3):
However currently I have column headers like this:
I have researched into ways to do this (for example if there was a parameter of merge.zoo for a custom name) however I can't seem to find anything.
Currently my code is looking like this:
## Imported libraries
library(httr) ## Library to make the HTTP get request
library(jsonlite) ## Library to de-code the JSON
library(lattice) ## Help with the output plots
library(zoo) ## Use with data frames
library(ggplot2) ## Used for plotting
## Percentage Change - calculates the percentage difference from the first value to all of the next values, in a dataframe
PercentageChange <- function(dataframe) {
Values <- as.vector(dataframe$navPrice) ## Collect the navPrice as a vector
ReturnValues <- c() ## Create a vector for the values that we are going to return
StartValue <- Values[length(Values)] ## Find the start value that we can base all of the data on
for (Value in Values) { ## Loop through all of the price values.
PercentageChangeValue <- (Value-StartValue)/StartValue*100 ## Calculate the percentage change difference between the current and start value using the percentage change formula
ReturnValues <- append(ReturnValues, PercentageChangeValue) ## Add that value to the values that we will return
}
return(ReturnValues) ## Return the vector that we created
}
## Call API - based on the fund code given to us, collect the data from the AI, for a certain date range
CallAPI <- function(Fund, Start="1950-01-01", End="2050-01-01") {
Data <- read.csv("Data/VanguardFundApiNumbers.csv") ## Collect the data from the CSV file with the API Lookup codes and put that into a dataframe
Code <- Data[Data$Ticker == Fund, ]$VanguardCode ## Look up the row with the fund code and then take the ticker code for the API
Res <- GET(paste("https://www.vanguardinvestor.co.uk/api/fund-data/", Code, "/S/price-history?startDate=", Start, "&endDate=", End, sep="")) ## Create the URL that we will send a request to, and then send it
FundData <- fromJSON(rawToChar(Res$content)) ## Turn the raw data into a string, and then into JSON which can be assigned into the return variable
return(FundData) ## Return the return data
}
## Create Just Date Values - returns a list of date times without the time in a more readable format
CreateJustDateValues <- function(Dates) {
ReturnList <- c() ## Create a list for the return values
for (Date in Dates) { ## Loop through the dates
ReturnList <- append(ReturnList, gsub("-", ".", substr(Date, 1, 10))) ## For each date, take the first ten chars, replace any dashes with . and then add that to the return list
}
return(ReturnList) ## Return the list we have created
}
## Create a list of the funds that we want to look at
Funds <- c("VDEE", "VDWE", "VGAC", "VGSC", "VUSE", "VJSI", "VLSH")
StartDate <- "2021-06-01"
EndDate <- "2050-01-01"
## Creates the first fund for testing and to help set up the variable
FundData <- CallAPI(Funds[1], Start=StartDate)
FundData$percentChange <- PercentageChange(FundData)
Combination <- zoo(FundData$percentChange, FundData$date)
## Loops through the remaining funds
for (Fund in Funds[2:length(Funds)]) {
FundData <- CallAPI(Fund, Start=StartDate) ## Returns a DataFrame with the data
FundData$percentChange <- PercentageChange(FundData) ## Create a percentage change column that we can analyse
ZooObject <- zoo(FundData$percentChange, FundData$date) ## Turn the wanted data into a zoo object
Combination <- merge.zoo(Combination, ZooObject) ## Combine it to the previous objects
}
Any explanation to how I could improve the column headers would be great!
If the problem is to merge zoo vector z0 onto zoo object z such that the new column has name "X" then make z0 into a column vector zoo object and use setNames :
library(zoo)
z <- read.zoo(BOD)
z0 <- 10*z
merge(z, setNames(cbind(z0), "X"))
giving:
z X
1 8.3 83
2 10.3 103
3 19.0 190
4 16.0 160
5 15.6 156
7 19.8 198
If hard coding X is ok then it can be written:
merge(z, X = z0)
data looks like sample1
sample2
I have 1000 csv files, all of them have two columns, first column is date, second column is price. Files have different time periods, some data start from 1995, some data start from 2000,since I need to do co integration test, all data in test should have exactly same time point.
I need extract same time period from 1000 csv files, for example start from 1998-4-20.
It works if I do it individually using:
newdata208 <- subset(data208, Date >= "1998-04-20")
but when I try to loop them, error happen, could anyone help me fixed error?
v1 <- list()
for (i in 1:length(datasets)) {
v1[i] <- subset(datasets[i], Date >= "1998-04-20")
}
Error in subset.default(datasets[i], Date >= "1998-04-20") :
object 'Date' not found
for original problem, just add get(i) in loop, then problem is fixed, I do not why, could anyone tell me?
v1 <- list()
for (i in 1:length(datasets)) {
data <- get(i)
v1[i] <- subset(data, Date >= "1998-04-20")}
then, problem is fixed
updating my stupid code
#set dictionary.
setwd("F:/xxx/folder")
dataset <- list.files(pattern = "*.CSV")
datasets <- c()
for (i in 1:1000)) {
datasets[i] <- substr(dataset[i], 1, (nchar(dataset[i])-4))
}
# we only need closing price column and date column
setClass("myDate")
setAs("character", "myDate", function(from) as.Date(from, format = "%m/%d/%Y"))
# read date column and closing price column
for (i in 1:length(temps)) {
assign(temps[i], read.csv(temp[i],
colClass = c("myDate", rep("NULL", 4),
rowClass = "numeric",
rep("NULL", 2)), stringsAsFactor = FALSE, header = TRUE))
}
# extract same time period
v1 <- c()
for (i in temps) {
data <- get(i)
v1[i] <- subset(data, Date >= "1998-04-20", select = C)
}
# lengths are different, file505 has short time period
index <- subset(file505, Date >= "1998-04-20")
indexs <- index$Date
# try use index to extract data
selectdate <- which(file001$Date %in% indexs)
file001CLOSE <- file001[selectdate, "C"]
#redo loop to get same period
v2 <- c()
for (i in datasets) {
data2 <- get(i)
v2[[i]] <- data2[selectdate, "C"]
}
v2table <- do.call(cbind,v2)
# right now, data is wonderful, let's begin do time series.
# test co-integration
install.packages("urca")
library("urca")
comb <- combn(1000, 2)
pairs <- c()
for (i in 1:499500) {
pairs[[i]] <- v2table[, comb[, i]]
}
# test:FF <- pairs[[88]], it is working wonderful, display all details in result
# do ca.jo
testresults <- list()
for (i in 1:499500) {
testdata <- pairs[[i]]
testresults[[i]] <- ca.jo(testdata, ecdet = "const", type = "eigen", K = 1)
}
it creates a huge list containing all test results, I need to split Values of teststatistic and critical values of test, and find all pairs have co-integration factor.
I've got a two step solution for you:
x <- list.files(path = "your directory", pattern = ".csv")
y <- lapply(x,fread)
data <- rbindlist(y)
Reads in all the csvs in your directory as data tables then binds then together.
After that i'd just subset as follows:
data <- data[Date >= "your dates",]
EDIT*
I get the feeling you want to bring in your data and bind it column wise/merge. As it stands that would be incorrect as you have the same "C" value in each column. After looking at your samples, if you were to stack those row wise you'd just be getting a very long C column. I wonder whether each C column represents the same or a different variable. If it is a different variable, I've written up some code which would truncate your data appropriately.
I've used the first 6 rows of your sample data 1 and sample data 2
files <- list.files(path = dir, pattern = ".csv")
data_mock <- lapply(files,fread)
data_mock[[1]][, Date := data_mock[[2]][,Date]]
#I change the dates here because your sample dates are too far apart to test for date truncation to work
for (i in 1:length(data_mock)){
data_mock[[i]]$Date <- as.Date(data_mock[[i]]$Date, format = "%Y-%m-%d")
}
for (i in 1:length(data_mock)){
setnames(data_mock[[i]], old = names(data_mock[[i]]), new = c("Date", paste0("C",i)))
}
#I change the variable names here because I'm not sure whether you want to stack Cs ontop of one another or whether each C is a different variable.
#I've assumed each C is different.
start_finish <- function(data, start, finish){
data[Date >= start & Date <= finish,]
}
results <- list()
for (i in 1:length(data_mock)){
results[[i]] <- start_finish(data_mock[[i]], "1987-01-15", "1987-01-17")
}
This is what the original data looked like:
[[1]]
Date C
1: 1998-04-20 12.667
2: 1998-04-21 12.587
3: 1998-04-22 12.625
4: 1998-04-23 12.601
5: 1998-04-24 12.584
6: 1998-04-25 12.624
[[2]]
Date C
1: 1987-01-14 95.89
2: 1987-01-15 97.72
3: 1987-01-16 98.10
4: 1987-01-17 97.07
5: 1987-01-18 98.86
6: 1987-01-19 99.95
This is what it looks like once you run a loop over the "start_finish" function I wrote:
[[1]]
Date C
1: 1987-01-15 12.587
2: 1987-01-16 12.625
3: 1987-01-17 12.601
[[2]]
Date C2
1: 1987-01-15 97.72
2: 1987-01-16 98.10
3: 1987-01-17 97.07
I believe you wanted your data to start and end at the same time? You merely need to alter the "start" and "finish" dates in the pretty simple function I wrote.
Is that what you're after?
I have several .csv files containing hourly data. Each file represents data from a point in space. The start and end date is different in each file.
The data can be read into R using:
lstf1<- list.files(pattern=".csv")
lst2<- lapply(lstf1,function(x) read.csv(x,header = TRUE,stringsAsFactors=FALSE,sep = ",",fill=TRUE, dec = ".",quote = "\""))
head(lst2[[800]])
datetime precip code
1 2003-12-30 00:00:00 NA M
2 2003-12-30 01:00:00 NA M
3 2003-12-30 02:00:00 NA M
4 2003-12-30 03:00:00 NA M
5 2003-12-30 04:00:00 NA M
6 2003-12-30 05:00:00 NA M
datetime is YYYY-MM-DD-HH-MM-SS, precip is the data value, codecan be ignored.
For each dataframe (df) in lst2 I want to select data for the period 2015-04-01 to 2015-11-30 based on the following conditions:
1) If precip in a df contains all NAswithin this period, delete it (do not select)
2) If precip is not all NAs select it.
The desired output (lst3) contains the sub-setted data for the period 2015-04-01 to 2015-11-30.
All dataframes in lst3 should have equal length with days and hourswithout precipdenoted as NA
The I can write the files in lst3 to my directory using something like:
sapply(names(lst2),function (x) write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))
The link to a sample file can be found here (~200 KB)
It's a little hard to understand exactly what you are trying to do, but this example (using dplyr, which has nice filter syntax) on the file you provided should get you close:
library(dplyr)
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")
# Get the required date range and delete the NAs
df.sub <- filter(df, !is.na(precip),
datetime >= as.POSIXct("2015-04-01"),
datetime < as.POSIXct("2015-12-01"))
# Check if the subset has any rows left (it will be empty if it was full of NA for precip)
if nrow(df.sub > 0) {
df.result <- filter(df, datetime >= as.POSIXct("2015-04-01"),
datetime < as.POSIXct("2015-12-01"))
# Then add df.result to your list of data frames...
} # else, don't add it to your list
I think you are saying that you want to retain NAs in the data frame if there are also valid precip values--you only want to discard if there are NAs for the entire period. If you just want to strip all NAs, then just use the first filter statement and you are done. You obviously don't need to use POSIXct if you've already got your dates encoded correctly another way.
EDIT: w/ function wrapper so you can use lapply:
library(dplyr)
# Get some example data
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")
dfnull <- df
dfnull$precip <- NA
# list of 3 input data frames to test, 2nd one has precip all NA
df.list <- list(df, dfnull, df)
# Function to do the filtering; returns list of data frames to keep or null
filterprecip <- function(d) {
if (nrow(filter(d, !is.na(precip), datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01"))) >
0) {
return(filter(d, datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01")))
}
}
# Function to remove NULLS in returned list
# (Credit to Hadley Wickham: http://tolstoy.newcastle.edu.au/R/e8/help/09/12/8102.html)
compact <- function(x) Filter(Negate(is.null), x)
# Filter the list
results <- compact(lapply(df.list, filterprecip))
# Check that you got a list of 2 data frames in the right date range
str(results)
Based on what you've written, is sounds like you're just interested in subsetting your list of files if data exists in the precip column for this specific date range.
> valuesExist <- function(df,start="2015-04-01 0:00:00",end="2015-11-30 23:59:59"){
+ sub.df <- df[df$datetime>=start & df$datetime>=end,]
+ if(sum(is.na(sub.df$precip)==nrow(df)){return(FALSE)}else{return(TRUE)}
+ }
> lst2.bool <- lapply(lst2, valuesExist)
> lst2 <- lst2[lst2.bool]
> lst3 <- lapply(lst2, function(x) {x[x$datetime>="2015-04-01 0:00:00" & x$datetime>="2015-11-30 23:59:59",]}
> sapply(names(lst2), function (x) write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))
If you want to have a dynamic start and end time, toss a variable with these values into the valueExist function and replace the string timestamp in the lst3 assignment with that same variable.
If you wanted to combine the two lapply loops into one, be my guest, but I prefer having a boolean variable when I'm subsetting.
Is there a good package in R that allows to sub-set (i.e. index into) timeseries by times that are not in the time series?
E.g. for financial applications, indexing a price series by a time stamp that is not in the database, should return the latest available price before the time stamp.
in code, this is what I would like
n =15
full.dates = seq(Sys.Date(), by = 'day', length = n)
series.dates = full.dates[c(1:10, 12, 15)]
require(zoo)
series=zoo(rep(1,length(series.dates)), series.dates)
series[full.dates[11]]
this returns
Data:
numeric(0)
Index:
character(0)
however, I would like this to return the value of the last existing date before full.dates[11], which is full.dates[10]:
series[full.dates[10]]
2014-01-03
1
Thanks
You can use index to extract index of the observations in your zoo object. The index can then be used for subsetting the object. Step by step to show the logic (you only need the last step, if I have understood you correctly):
# the index of the observations, here dates
index(series)
# are the dates smaller than your reference date?
index(series) < full.dates[11]
# subset observations: dates less than reference date
series[index(series) < full.dates[11]]
# select last observation before reference date:
tail(series[index(series) < full.dates[11]], 1)
# 2014-01-03
# 1
A possible alternative may be to expand your time series and "replac[e] each NA with the most recent non-NA" using na.locf and the xout argument (see also ?na.locf and ?approx and this answer)
# expand time series to the range of dates in 'full.dates'
series2 <- na.locf(series, xout = full.dates)
series2
# select observation at reference date
series2[full.dates[10]]
# 2014-01-03
# 1
If you rather want missing values in your incomplete series to be replaced by "next observation carried backward", you need to merge your series with with a 'dummy' zoo object which contains the desired range of consecutive dates.
series3 <- merge(series, zoo(, full.dates))
na.locf(series3, fromLast = TRUE)
na.locf(x, xout = newdate) seems not much worse than subscripting but at any rate here we define a subclass of "zoo" called "zoo2" in which [ uses na.locf. This is an untested minimal implementation but it could be extended:
as.zoo2 <- function(x) UseMethod("as.zoo2")
as.zoo2.zoo <- function(x) structure(x, class = c("zoo2", setdiff(class(x), "zoo2")))
"[.zoo2" <- function(x, i, ...) {
if (!missing(i) && inherits(i, class(index(x)))) {
zoo:::`[.zoo`(na.locf(x, xout = i),, ...)
} else as.zoo2(zoo:::`[.zoo`(x, i, ...))
}
This gives:
> series2 <- as.zoo2(series)
> series2[full.dates[11]]
2014-01-04
1
I would strongly argue that subset functions should not return the prior row if the desired index value does not exist. Subset functions should return what the user requested; they should not assume the user wanted something different than what they requested.
If this is what you want, you can handle it fairly easily with an if statement.
series.subset <- series[full.dates[11]]
if(NROW(series.subset)==0) {
# merge series with an empty zoo object
# that contains the index value you want
prior <- merge(series, zoo(,full.dates[11]))
# lag *back* one period so the NA is on the prior value
prior <- lag(prior, 1)
# get the index value at the prior value
prior <- index(prior)[is.na(prior)]
# subset again
series.subset <- series[prior]
}