Is there a good package in R that allows to sub-set (i.e. index into) timeseries by times that are not in the time series?
E.g. for financial applications, indexing a price series by a time stamp that is not in the database, should return the latest available price before the time stamp.
in code, this is what I would like
n =15
full.dates = seq(Sys.Date(), by = 'day', length = n)
series.dates = full.dates[c(1:10, 12, 15)]
require(zoo)
series=zoo(rep(1,length(series.dates)), series.dates)
series[full.dates[11]]
this returns
Data:
numeric(0)
Index:
character(0)
however, I would like this to return the value of the last existing date before full.dates[11], which is full.dates[10]:
series[full.dates[10]]
2014-01-03
1
Thanks
You can use index to extract index of the observations in your zoo object. The index can then be used for subsetting the object. Step by step to show the logic (you only need the last step, if I have understood you correctly):
# the index of the observations, here dates
index(series)
# are the dates smaller than your reference date?
index(series) < full.dates[11]
# subset observations: dates less than reference date
series[index(series) < full.dates[11]]
# select last observation before reference date:
tail(series[index(series) < full.dates[11]], 1)
# 2014-01-03
# 1
A possible alternative may be to expand your time series and "replac[e] each NA with the most recent non-NA" using na.locf and the xout argument (see also ?na.locf and ?approx and this answer)
# expand time series to the range of dates in 'full.dates'
series2 <- na.locf(series, xout = full.dates)
series2
# select observation at reference date
series2[full.dates[10]]
# 2014-01-03
# 1
If you rather want missing values in your incomplete series to be replaced by "next observation carried backward", you need to merge your series with with a 'dummy' zoo object which contains the desired range of consecutive dates.
series3 <- merge(series, zoo(, full.dates))
na.locf(series3, fromLast = TRUE)
na.locf(x, xout = newdate) seems not much worse than subscripting but at any rate here we define a subclass of "zoo" called "zoo2" in which [ uses na.locf. This is an untested minimal implementation but it could be extended:
as.zoo2 <- function(x) UseMethod("as.zoo2")
as.zoo2.zoo <- function(x) structure(x, class = c("zoo2", setdiff(class(x), "zoo2")))
"[.zoo2" <- function(x, i, ...) {
if (!missing(i) && inherits(i, class(index(x)))) {
zoo:::`[.zoo`(na.locf(x, xout = i),, ...)
} else as.zoo2(zoo:::`[.zoo`(x, i, ...))
}
This gives:
> series2 <- as.zoo2(series)
> series2[full.dates[11]]
2014-01-04
1
I would strongly argue that subset functions should not return the prior row if the desired index value does not exist. Subset functions should return what the user requested; they should not assume the user wanted something different than what they requested.
If this is what you want, you can handle it fairly easily with an if statement.
series.subset <- series[full.dates[11]]
if(NROW(series.subset)==0) {
# merge series with an empty zoo object
# that contains the index value you want
prior <- merge(series, zoo(,full.dates[11]))
# lag *back* one period so the NA is on the prior value
prior <- lag(prior, 1)
# get the index value at the prior value
prior <- index(prior)[is.na(prior)]
# subset again
series.subset <- series[prior]
}
Related
I have a dataframe with 105 months and 20 columns. The example below is simplified and shows that some of the columns start at January 2014 and some don't. Some others are zeroed:
df <- data.frame(months = c('2014-01-01','2014-02-01',
'2014-03-01','2014-04-01','2014-05-01',
'2014-06-01','2014-07-01'),
series2 = c(1754,3345,12226,1712,6703,8172,1545),
series3 = c(NA,NA,NA,NA,554,222,321)
series4 = c(NA,NA,NA,NA,0,0,0)
)
My objective is to seasonally adjust the series which can be seasonally adjusted and write a similar dataframe, keeping the seasonally adjusted series in the same order and position as in the original dataframe.
I have made a for loop to decide which columns can be seasonally adjusted. The for loop also finds out the initial date of every column.
library(seasonal)
# determine initial and final date in the first column of dataframe
initial_date <- as.POSIXct(pull(df[1,1]),format = "%Y-%m-%d")
final_date <- as.POSIXct(pull(dados0[nrow(df),1]),format = "%Y-%m-%d")
# create an empty dataframe to be completed with seasonally adjusted
dataseas_adj_df<-data.frame(matrix(ncol = ncol(df), nrow = nrow(df)))
# decide which series should be seasonally adjusted
for(i in 2:ncol(df)) { # Head of for-loop
# if a certain column contains only zeros...
if(sum(df[,i] != 0, na.rm=TRUE)==0) {
seas_adj_df[,i]<-as.numeric(NA) #fill the column with NA}
else {
#determine the number of values of the column
n_values_column<-length(df[,i][!is.na(df[,i])])
#how many months after the beginning of the dataframe did the column start?
months_to_add<-nrow(df)-n_values_column
#calculates the initial date of the column
column_initial_date<-initial_date %m+% months(months_to_add)
#transform the column values into a time series
time_series <- ts(df[,i],start = c(year(column_initial_date),
month(column_initial_date)),
end = c(year(final_date), month(final_date)),
freq = 12)
#perform seasonal adjustment
time_series_sa<- final(seas(time_series, multimode = "R"))
#insert seasonally adjusted series into the new dataframe
seas_adj_df[,i]<-time_series_sa #this part is wrong
}}
However, i receive the folowing error:
Error in \[\<-.data.frame(*tmp*, , i, value = c(928.211662624947, 993.311013042665, : replacement has 81 rows, data has 105
This happens because some of my columns have 81 values (the rest are filled with "NA").
My two questions are:
Is there a way to seasonally adjust all series, but asking R to "jump" NA columns and 0 columns? My final dataframe must have the seasonally ajusted series in the exact same position as the original ones (example: series5_SA must be in column 5, even if series4 couldn't be seasonally adjusted).
Using my code (or a similar code), how could I add a time series with 81 values into a dataframe with 105 rows? How can i tell R that the column should be inserted from line (105-81=24) on?
You can use lapply in conjunction with an error handling function (like tryCatch), instead of a for loop. This type of functions will try to perform an operation. But, if they encounter and error, they will provide another result as indicated (like the original ts not sa). The order of the time series will not be afected. Here is an example with the AirPassengers data set:
> library(seasonal)
> library(lubridate)
>
> data(AirPassengers)
>
> df <- replicate(5, AirPassengers)
> df <- cbind.data.frame(date_decimal(as.numeric(time(AirPassengers))), df)
>
> ## Adding NA to second and fourth data columns
> df[sample(1:nrow(df), 10), 3] <- NA
> df[sample(1:nrow(df), 10), 5] <- NA
>
> initial_date <- as.Date(df[1,1], format = "%Y-%m-%d")
>
> time_series <- lapply(df[, -1], function(x){
+ ts(x, start = c(year(initial_date), month(initial_date)), frequency = 12)
+ })
>
> time_series_sa <- lapply(time_series, function(x) {
+ tryCatch(final(seas(x, multimode = "R")), error = function(e) {x})
+ })
>
> summary(time_series_sa)
Length Class Mode
1 144 ts numeric
2 144 ts numeric
3 144 ts numeric
4 144 ts numeric
5 144 ts numeric
Hope it helps.
I am using a for loop to iterate through a list of fund codes, make an API call and then combine this data in one large zoo object. I would like to see something like this (shortened version), where there is a simple change between each column (Zoo1, Zoo2, Zoo3):
However currently I have column headers like this:
I have researched into ways to do this (for example if there was a parameter of merge.zoo for a custom name) however I can't seem to find anything.
Currently my code is looking like this:
## Imported libraries
library(httr) ## Library to make the HTTP get request
library(jsonlite) ## Library to de-code the JSON
library(lattice) ## Help with the output plots
library(zoo) ## Use with data frames
library(ggplot2) ## Used for plotting
## Percentage Change - calculates the percentage difference from the first value to all of the next values, in a dataframe
PercentageChange <- function(dataframe) {
Values <- as.vector(dataframe$navPrice) ## Collect the navPrice as a vector
ReturnValues <- c() ## Create a vector for the values that we are going to return
StartValue <- Values[length(Values)] ## Find the start value that we can base all of the data on
for (Value in Values) { ## Loop through all of the price values.
PercentageChangeValue <- (Value-StartValue)/StartValue*100 ## Calculate the percentage change difference between the current and start value using the percentage change formula
ReturnValues <- append(ReturnValues, PercentageChangeValue) ## Add that value to the values that we will return
}
return(ReturnValues) ## Return the vector that we created
}
## Call API - based on the fund code given to us, collect the data from the AI, for a certain date range
CallAPI <- function(Fund, Start="1950-01-01", End="2050-01-01") {
Data <- read.csv("Data/VanguardFundApiNumbers.csv") ## Collect the data from the CSV file with the API Lookup codes and put that into a dataframe
Code <- Data[Data$Ticker == Fund, ]$VanguardCode ## Look up the row with the fund code and then take the ticker code for the API
Res <- GET(paste("https://www.vanguardinvestor.co.uk/api/fund-data/", Code, "/S/price-history?startDate=", Start, "&endDate=", End, sep="")) ## Create the URL that we will send a request to, and then send it
FundData <- fromJSON(rawToChar(Res$content)) ## Turn the raw data into a string, and then into JSON which can be assigned into the return variable
return(FundData) ## Return the return data
}
## Create Just Date Values - returns a list of date times without the time in a more readable format
CreateJustDateValues <- function(Dates) {
ReturnList <- c() ## Create a list for the return values
for (Date in Dates) { ## Loop through the dates
ReturnList <- append(ReturnList, gsub("-", ".", substr(Date, 1, 10))) ## For each date, take the first ten chars, replace any dashes with . and then add that to the return list
}
return(ReturnList) ## Return the list we have created
}
## Create a list of the funds that we want to look at
Funds <- c("VDEE", "VDWE", "VGAC", "VGSC", "VUSE", "VJSI", "VLSH")
StartDate <- "2021-06-01"
EndDate <- "2050-01-01"
## Creates the first fund for testing and to help set up the variable
FundData <- CallAPI(Funds[1], Start=StartDate)
FundData$percentChange <- PercentageChange(FundData)
Combination <- zoo(FundData$percentChange, FundData$date)
## Loops through the remaining funds
for (Fund in Funds[2:length(Funds)]) {
FundData <- CallAPI(Fund, Start=StartDate) ## Returns a DataFrame with the data
FundData$percentChange <- PercentageChange(FundData) ## Create a percentage change column that we can analyse
ZooObject <- zoo(FundData$percentChange, FundData$date) ## Turn the wanted data into a zoo object
Combination <- merge.zoo(Combination, ZooObject) ## Combine it to the previous objects
}
Any explanation to how I could improve the column headers would be great!
If the problem is to merge zoo vector z0 onto zoo object z such that the new column has name "X" then make z0 into a column vector zoo object and use setNames :
library(zoo)
z <- read.zoo(BOD)
z0 <- 10*z
merge(z, setNames(cbind(z0), "X"))
giving:
z X
1 8.3 83
2 10.3 103
3 19.0 190
4 16.0 160
5 15.6 156
7 19.8 198
If hard coding X is ok then it can be written:
merge(z, X = z0)
I want to store values in "yy" but my code below stores only one row (last value). Please see the output below. Can somebody help to store all the values in "yy"
Thanks in advance. I am a beginner to R.
arrPol <- as.matrix(unique(TN_97_Lau_Cot[,6]))
arrYear <- as.matrix(unique(TN_97_Lau_Cot[,1]))
for (ij in length(arrPol)){
for (ik in length(arrYear)) {
newPolicy <- subset(TN_97_Lau_Cot, POLICY == as.character(arrPol[ij]) & as.numeric(arrYear[ik]))
yy <- newPolicy[which.min(newPolicy$min_dist),]
}
}
Output:
YEAR DIVISION STATE COUNTY CROP POLICY STATE_ABB LRPP min_dist
1: 2016 8 41 97 21 699609 TN 0 2.6
Here is a image of "TN_97_Lau_Cot" matrix.
No loops required. There could be an easier way to do it, but two set-based steps are better than two loops. These are the two ways I would try and do it:
base
# Perform an aggregate and merge it to your data.frame.
TN_97_Lau_Cot_Agg <- merge(
x = TN_97_Lau_Cot,
y = aggregate(min_dist ~ YEAR + POLICY, data = TN_97_Lau_Cot, min),
by = c("YEAR","POLICY"),
all.x = TRUE
)
# Subset the values that you want.
TN_97_Lau_Cot_Final <- unique(subset(TN_97_Lau_Cot_Agg, min_dist.x == min_dist.y))
data.table
library(data.table)
# Convert your data.frame to a data.table.
TN_97_Lau_Cot <- data.table(TN_97_Lau_Cot)
# Perform a "window" function that calculates the min value for each year without reducing the rows.
TN_97_Lau_Cot[, minDistAggregate:=min(min_dist), by = c("YEAR","POLICY")]
# Find the policy numbers that match the minimum distance for that year.
TN_97_Lau_Cot_Final <- unique(TN_97_Lau_Cot[min_dist==minDistAggregate, -10, with=FALSE])
I use the following source and get an error:
>source("raw.githubusercontent.com/iembry-USGS/ie2misc/master/R/…)
Error in source("raw.githubusercontent.com/iembry-USGS/ie2misc/master/R/…) : raw.githubusercontent.com/iembry-USGS/ie2misc/master/R/…: unexpected input 1: ï»
Since I have to use what is the error and how I can fix it?
Here is my code (the last line is the relevant command:
library(zoo)
library (xts)
library(data.table)
source("https://raw.githubusercontent.com/iembry-USGS/ie2misc/master/R/na.interp1.R")
Lines <- "D1,Diff
1,20/11/2014 16:00,0.01
2,20/11/2014 17:00,0.02
3,20/11/2014 19:00,0.03
4,21/11/2014 16:00,0.04
5,21/11/2014 17:00,0.06
6,21/11/2014 20:00,0.10"
z <- read.zoo(text = Lines, tz = "", format = "%d/%m/%Y %H:%M", sep = ",")
## Source 1 begins
startdate <- as.character((start(z)))
# set the start date/time as the 1st entry in the time series and make
# this a character vector.
start <- as.POSIXct(startdate)
# transform the character vector to a POSIXct object
enddate <- as.character((end(z)))
# set the end date/time as the last entry in the time series and make
# this a character vector.
end <- as.POSIXct(enddate)
# transform the character vector to a POSIXct object
gridtime <- seq(from = start, by = 3600, to = end)
# create a sequence beginning with the start date/time with a 60 minute
# interval ending at the end date/time
## Source 1 ends
## Source 2 begins
timeframe <- data.frame(rep(NA, length(gridtime)))
# create 1 NA column spaced out by the gridtime to complement the single
# column of z
timelength <- xts(timeframe, order.by = gridtime)
# create a xts time series object using timeframe and gridtime
zDate <- merge(timelength, z)
# merge the z zoo object and the timelength xts object
## Source 2 ends
Lines <- as.data.frame(zDate)
# to data.frame from zoo
Lines[, "D1"] <- rownames(Lines)
# create column named D1
Lines <- setDT(Lines)
# create data.table out of data.frame
setcolorder(Lines, c(3, 2, 1))
# set the column order as the 3rd column followed by the 2nd and 1st
# columns
Lines <- Lines[, 3 := NULL]
# remove the 3rd column
setnames(Lines, 2, "diff")
# change the name of the 2nd column to diff
Lines <- setDF(Lines)
# return to data.frame
rowsinterps1 <- which(is.na(Lines$diff == TRUE))
# index of rows of Lines that have NA (to be interpolated)
xi <- as.numeric(Lines[which(is.na(Lines$diff == TRUE)), 1])
# the Date-Times for diff to be interpolated in numeric format
interps1 <- na.interp1(as.numeric(Lines$Time), Lines$diff, xi = xi, na.rm = FALSE, maxgap = 3)
# the interpolated values where only gap sizes of 3 are filled
The package was updated that's the reason that the code didn't work.
I wish the people that make the points to drop would return them back. The question was OK.
I am trying to write a function that will return the number of business days between two dates (not just excluding weekends, but holidays as well). I'm approaching it by building a matrix with rownames corresponding to days of the week with the elements of the matrix either a 1 or a 0: a 0 if it is a holiday or the extra couple elements to fill the matrix.
I've checked the length of each vector in the code. It checks out. I've run the code manually in the console, one line at a time, and it works perfectly. BUT if I run the function, it displays this error message:
Warning message:
In matrix(da, nrow = 7, dimnames = list(n)) :
data length [132] is not a sub-multiple or multiple of the number of rows [7]
I'm using R 3.1.1, mostly working in Rstudio. The cal mentioned in the code can be found here.
Here's the code:
dte <- function(date) {
#Input a date and it tells you the number of business (not including holidays)
#days until that date
#Take the target date and turn it into a date
d <- strptime(date,format="%Y-%m-%d")
#Obtain current date
c <- strptime(Sys.Date(), format="%Y-%m-%d")
#Calculate the difference in days
diff <- d-c
#Extract the actual number difference
f <- diff[[1]]
#Get the list of holidays
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- Sys.Date()+0:f
#Find which days in the range are holidays
if(any(b %in% cal)) {
bt <- b[b %in% cal]
#Return the position of the holidays within the range
bn <- which(b %in% bt)
} else {
#Set holidays present to 0
bn <- 0
}
#Build a vector of the weekdays starting with the current weekday
n <- weekdays(Sys.Date()+0:6)
#Create a vector as long as the difference with a 1 in each place
v <- rep(1,f)
#Set each holiday to 0
v[bn] <- v[bn]-1
#Extra steps to make sure that the matrix is full but only with 1s where we want them.
g <- ((trunc(f/7)+1)*7)-f
u <- rep(0,g)
da <- c(v,u)
#Create the matrix
m <- matrix(da,nrow=7,dimnames=list(n))
#Extract all of the workweeks and add them up
ww <- m[c("Monday","Tuesday","Wednesday","Thursday","Friday"),]
r <- sum(ww)
r
}
The problem is that your strptime calls return POSIXt objects which have time components and are then effected by daylight savings time. Observe
(d1<-strptime("2014-08-24",format="%Y-%m-%d"))
# [1] "2014-08-24 EDT"
(d2<-strptime("2014-12-31",format="%Y-%m-%d"))
# [1] "2014-12-31 EST"
d2-d1
# Time difference of 129.0417 days
So there are not a while number of dates between the two values which causes complications for you later in your code. If you use as.Date rather than strptime then you won't have this problem because Date objects don't care about time.
But i'm not sure really why you're even bothering with the matrix at all. I think a simpler implementation would look like
dte <- function(date) {
d <- as.Date(date,format="%Y-%m-%d")
c <- Sys.Date()
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- seq(c, d, by="1 day")
return(sum(as.POSIXlt(b)$wday %in% 1:5 & (!b %in% cal)))
}