How to align dates for merging two xts files? - r

I'm trying to analyze 1-year %-change data in R on two data series by merging them into one file. One series is weekly and the other is monthly. Converting the weekly series to monthly is the problem. Using apply.monthly() on the weekly data creates a monthly file but with intra-monthly dates that don't match the first-day-of-month format in the monthly series after combining the two files via merge.xts(). Question: How to change the resulting merged file (sample below) to one monthly entry for both series?
2012-11-01 0.02079801 NA
2012-11-24 NA -0.03375796
2012-12-01 0.02052502 NA
2012-12-29 NA 0.04442094
2013-01-01 0.01881466 NA
2013-01-26 NA 0.06370272
2013-02-01 0.01859883 NA
2013-02-23 NA 0.02999318

You can pass indexAt="firstof" in a call to to.monthly to get monthly data using the first of the month for the index.
library(quantmod)
getSymbols(c("USPRIV", "ICSA"), src="FRED")
merge(USPRIV, to.monthly(ICSA, indexAt="firstof", OHLC=FALSE))

Something like this:
do.call(rbind, by(d[-1], d[[1]] - as.POSIXlt(d[[1]])$mday, FUN=apply, 2, sum, na.rm=TRUE))
## V2 V3
## 2012-10-31 0.02079801 -0.03375796
## 2012-11-30 0.02052502 0.04442094
## 2012-12-31 0.01881466 0.06370272
## 2013-01-31 0.01859883 0.02999318
Note that the dates are encoded as row names, not as a column in the result.

It is a frequently occurring issue. And sometimes I forget my own solution for it and google does not easily lead to one. So I am posting my solution here.
Basically, just convert the index of monthly aggregated series to yearmon. You can also optionally convert it back to yyyy-mm-dd (to 1st of each month ) format with as.date . After the exact dates are stripped and the indices are 'homogenised' , all the columns align perfectly.
# Here with dplyr
time(myxts)<- time(myxts) %>% as.yearmon() %%> as.date()
#or without dplyr
time(myxts)<- as.date( as.yearmon( time(myxts) ) )

Related

bfastts for monthly data

I am working with data collected monthly. In my dataset, there are some months where no data was collected and thus, there is no entry in my data. I have previously used bfastts for similar occurrences when data was collected daily, so that I may have NA values in my data. How may I do the same for monthly data, using bfastts or some other function?
eg. below if needed
2006-06-01 2.260121
2006-07-01 2.306800
2006-08-01 2.246624
2006-09-01 1.724565
2006-11-01 1.630561
2007-05-01 2.228918
2007-06-01 2.228918
2007-07-01 2.22891
I wish to have NA fields for December to March.
The question did not specify what class of object is desired but here are three. zoo supports an irregularly spaced index so it does not need to insert NA's but ts does not and converting from zoo to ts automatically inserts NA's. Convert the ts object back to zoo again or to a data frame to get a zoo or data frame object with NA's.
The zoo and data frame objects use yearmon class for the index which internally represents year/month as year + fraction where fraction is 0, 1/12, ..., 11/12 for Jan, Feb, ..., Dec and displays in meaningful form. as.Date can be used to convert yearmon objects to Date objects although in this case yearmon probably makes more sense since it directly represents year and month without day.
If you want to go in the other direction and remove NA's use na.omit(z_na) or na.omit(DF_na) .
library(zoo)
# zoo object - no NA's
z <- read.zoo(DF, FUN = as.yearmon)
# ts object with NA's
tt <- as.ts(z)
# zoo object with NA's
z_na <- as.zoo(tt)
# data.frame with NA's
DF_na <- fortify.zoo(tt)
Note
Lines <- "2006-06-01 2.260121
2006-07-01 2.306800
2006-08-01 2.246624
2006-09-01 1.724565
2006-11-01 1.630561
2007-05-01 2.228918
2007-06-01 2.228918
2007-07-01 2.22891"
DF <- read.table(text = Lines)

Using dplyr::mutate between two dataframes to create column based on date range

Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands
eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481

R turn irregular time interval into regular ones using previous numbers

i have an irregular time interval like this
df=data.frame(Date=c("2013-01-08","2013-01-11","2013-01-13","2013-01-21","2013-02-06"), runningtotal=c(800,910,1060,1210,660)
i found through zoo object it can be merged with a regular time interval and fill in 0 as missing values. However, I need to fill in previous value instead, except at month start fill it with 0. So the end output is like this:
date runningtotal
2013-01-01 0
2013-01-02 0
...
2013-01-08 800
2013-01-09 800
2013-01-10 800
2013-01-11 910
2013-01-12 910
2013-01-13 1060
...
2013-02-01 0
And also, does it make sense to fill in value like this for forecasting purpose?
Thanks.
Try approxfun with the constant method. I don't have lubridate and just deal with regular Date objects. For instance:
df<-data.frame(Date=c("2013-01-08","2013-01-11","2013-01-13","2013-01-21","2013-02-06"), runningtotal=c(800,910,1060,1210,660))
df$Date<-as.Date(as.character(df$Date))
#create some new dates
newDates<-seq(df$Date[1],df$Date[5],length.out=10)
intfun<-approxfun(df$Date,df$runningtotal,method="constant",yleft=0,yright=0)
data.frame(newDates,intfun(newDates))
I would use na.locf from zoo package. But You should prepare data before applying it.
## generate a vector of dates
mm <- min(DF$Date)
day(mm) <- 1
seq_dates <- seq.POSIXt(mm,max(DF$Date),by='days')
## add zeros valus for the beging of month
DF <- rbind(DF,data.frame(Date=seq_dates[day(seq_dates)==1],runningtotal=0))
library(zoo)
## merge with the sequence of dates , and apply na.locf for previous values.
na.locf(merge(seq_dates,DF,by=1,all.x=TRUE))
The idea is to apply na.locf that change missing values with the previous non missing values. Merge your data with a sequence of dates(from the first month to the end of dates) will insert missing values.

calculate Value at Risk in a data frame

My data set has 1000s hedge fund returns for 140 months and I was trying to calculate Value at Risk (VaR) suing command VaR in PerformanceAnalytics package. However, I have come up with many questions when using this function. I have created a sample data frame to show my problem.
df=data.frame(matrix(rnorm(24),nrow=8))
df$X1<-c('2007-01','2007-02','2007-03','2007-04','2007-05','2007-06','2007-07','2007-08')
df[2,2]<-NA
df[2,3]<-NA
df[1,3]<-NA
df
I got a data frame:
X1 X2 X3
1 2007-01 -1.4420195 NA
2 2007-02 NA NA
3 2007-03 -0.4503824 -0.78506597
4 2007-04 1.4083746 0.02095307
5 2007-05 0.9636549 0.19584430
6 2007-06 1.1935281 -0.14175623
7 2007-07 -0.3986336 1.58128683
8 2007-08 0.8211377 -1.13347168
I then run
apply(df,2,FUN=VaR, na.rm=TRUE)
and received a warning message:
The data cannot be converted into a time series. If you are trying to pass in names from a data object with one column, you should use the form 'data[rows, columns, drop = FALSE]'. Rownames should have standard date formats, such as '1985-03-15'.
I have tried to convert my data frame into combination of time series using zoo() but it didn't help. Can someone help to figure out what should I do now?
#user2893255, you should convert your data frame into an xts-object before using the apply function:
df.xts <- as.xts(df[,2:3],order.by=as.Date(df$X1,"%Y-%m"))
and then
apply(df.xts,2,FUN=VaR, na.rm=TRUE)
gives you the result without warnings or error messages.
Try dropping the Date column:
apply(df[,-1L], 2, FUN=VaR, na.rm=TRUE)

Data aggregation loop in R

I am facing a problem concerning aggregating my data to daily data.
I have a data frame where NAs have been removed (Link of picture of data is given below). Data has been collected 3 times a day, but sometimes due to NAs, there is just 1 or 2 entries per day; some days data is missing completely.
I am now interested in calculating the daily mean of "dist": this means summing up the data of "dist" of one day and dividing it by number of entries per day (so 3 if there is no data missing that day). I would like to do this via a loop.
How can I do this with a loop? The problem is that sometimes I have 3 entries per day and sometimes just 2 or even 1. I would like to tell R that for every day, it should sum up "dist" and divide it by the number of entries that are available for every day.
I just have no idea how to formulate a for loop for this purpose. I would really appreciate if you could give me any advice on that problem. Thanks for your efforts and kind regards,
Jan
Data frame: http://www.pic-upload.de/view-11435581/Data_loop.jpg.html
Edit: I used aggregate and tapply as suggested, however, the mean value of the data was not really calculated:
Group.1 x
1 2006-10-06 12:00:00 636.5395
2 2006-10-06 20:00:00 859.0109
3 2006-10-07 04:00:00 301.8548
4 2006-10-07 12:00:00 649.3357
5 2006-10-07 20:00:00 944.8272
6 2006-10-08 04:00:00 136.7393
7 2006-10-08 12:00:00 360.9560
8 2006-10-08 20:00:00 NaN
The code used was:
dates<-Dis_sub$date
distance<-Dis_sub$dist
aggregate(distance,list(dates),mean,na.rm=TRUE)
tapply(distance,dates,mean,na.rm=TRUE)
Don't use a loop. Use R. Some example data :
dates <- rep(seq(as.Date("2001-01-05"),
as.Date("2001-01-20"),
by="day"),
each=3)
values <- rep(1:16,each=3)
values[c(4,5,6,10,14,15,30)] <- NA
and any of :
aggregate(values,list(dates),mean,na.rm=TRUE)
tapply(values,dates,mean,na.rm=TRUE)
gives you what you want. See also ?aggregate and ?tapply.
If you want a dataframe back, you can look at the package plyr :
Data <- as.data.frame(dates,values)
require(plyr)
ddply(data,"dates",mean,na.rm=TRUE)
Keep in mind that ddply is not fully supporting the date format (yet).
Look at the data.table package especially if your data is huge. Here is some code that calculates the mean of dist by day.
library(data.table)
dt = data.table(Data)
Data[,list(avg_dist = mean(dist, na.rm = T)),'date']
It looks like your main problem is that your date field has times attached. The first thing you need to do is create a column that has just the date using something like
Dis_sub$date_only <- as.Date(Dis_sub$date)
Then using Joris Meys' solution (which is the right way to do it) should work.
However if for some reason you really want to use a loop you could try something like
newFrame <- data.frame()
for d in unique(Dis_sub$date){
meanDist <- mean(Dis_sub$dist[Dis_sub$date==d],na.rm=TRUE)
newFrame <- rbind(newFrame,c(d,meanDist))
}
But keep in mind that this will be slow and memory-inefficient.

Resources