Interpolation of constrained gaps - r

In continuity to the following question:
Efficient dynamic addition of rows in dataframe and dynamic calculation in R
I have the following table:
Lines <- "D1,Diff
1,20/11/2014 16:00,0.01
2,20/11/2014 17:00,0.02
3,20/11/2014 19:00,0.03 <-- Gap I
4,21/11/2014 16:00,0.04
5,21/11/2014 17:00,0.06 <-- Gap II
6,21/11/2014 20:00,0.10"
As can be seen there are a gap of 18:00 in 20/11/2014 and two gaps of 18:00 and 19:00 at 21/11/2014.
An addition gap is between the days 20/11/2014 19:00 and 21/11/2014 16:00.
I would to interpolate (fill in) the value which the gap is up to 3 hours between the rows.
The required result should be as followed (in dataframe format):
Lines <- "D1,Diff
1,20/11/2014 16:00,0.01
2,20/11/2014 17:00,0.02
3,20/11/2014 18:00,0.025<-- Added lines
4,20/11/2014 19:00,0.03
5,21/11/2014 16:00,0.04
6,21/11/2014 17:00,0.06
6,21/11/2014 18:00,0.073 <--
6,21/11/2014 19:00,0.086 <--
6,21/11/2014 20:00,0.10"
Here is the code I use that fills in the gap between days that is over 3 hours:
library (zoo)
z <- read.zoo(text = Lines, tz = "", format = "%d/%m/%Y %H:%M", sep = ",")
interpolated1 <-na.approx(z, xout = seq(start(z), end(z), "hours"))

We can merge z with a zero width zoo series z0 which is based on a grid of hours. This will transform z to an hourly series with NAs. Then use the maxgap argument to na.approx as shown below to fill in the desired gaps only. This still leaves NAs in the longer gaps so remove them using na.omit .
fortify.zoo(z3) would transform the result to data frame but since z3, the resulting series with only gaps to length 3 filled, is a time series this is probably not a good idea and it would be better to leave it as a zoo object so that you can use all the facilities of zoo.
No packages other than zoo are used.
z0 <- zoo(, seq(start(z), end(z), "hours"))
z3 <- na.omit(na.approx(merge(z, z0), maxgap = 3))
giving:
> z3
2014-11-20 16:00:00 2014-11-20 17:00:00 2014-11-20 18:00:00 2014-11-20 19:00:00
0.01000000 0.02000000 0.02500000 0.03000000
2014-11-21 16:00:00 2014-11-21 17:00:00 2014-11-21 18:00:00 2014-11-21 19:00:00
0.04000000 0.06000000 0.07333333 0.08666667
2014-11-21 20:00:00
0.10000000

Source 1: Creating a specific sequence of date/times in R. Answer by mnel on Sep 13 2012 and edit by Matt Dowle on Sep 13 2012
&
Source 2: Creating regular 15-minute time-series from irregular time-series. Answer by mnel on Sep 13 2012 and edit by Dirk Eddelbuettel on May 3 2012
library(zoo)
library(xts)
library(data.table)
library(devtools)
devtools::install_github("iembry-USGS/ie2misc")
library(ie2misc)
# iembry released a version of ie2misc so you should be able to install
# the package now
# `na.interp1` is a function that combines zoo's `na.approx` and pracma's
# `interp1`
The rest of the code starts after the creation of your z zoo object
## Source 1 begins
startdate <- as.character((start(z)))
# set the start date/time as the 1st entry in the time series and make
# this a character vector.
start <- as.POSIXct(startdate)
# transform the character vector to a POSIXct object
enddate <- as.character((end(z)))
# set the end date/time as the last entry in the time series and make
# this a character vector.
end <- as.POSIXct(enddate)
# transform the character vector to a POSIXct object
gridtime <- seq(from = start, by = 3600, to = end)
# create a sequence beginning with the start date/time with a 60 minute
# interval ending at the end date/time
## Source 1 ends
## Source 2 begins
timeframe <- data.frame(rep(NA, length(gridtime)))
# create 1 NA column spaced out by the gridtime to complement the single
# column of z
timelength <- xts(timeframe, order.by = gridtime)
# create a xts time series object using timeframe and gridtime
zDate <- merge(timelength, z)
# merge the z zoo object and the timelength xts object
## Source 2 ends
The next steps involve the process of interpolating your data as requested.
Lines <- as.data.frame(zDate)
# to data.frame from zoo
Lines[, "D1"] <- rownames(Lines)
# create column named D1
Lines <- setDT(Lines)
# create data.table out of data.frame
setcolorder(Lines, c(3, 2, 1))
# set the column order as the 3rd column followed by the 2nd and 1st
# columns
Lines <- Lines[, 3 := NULL]
# remove the 3rd column
setnames(Lines, 2, "diff")
# change the name of the 2nd column to diff
Lines <- setDF(Lines)
# return to data.frame
rowsinterps1 <- which(is.na(Lines$diff == TRUE))
# index of rows of Lines that have NA (to be interpolated)
xi <- as.numeric(Lines[which(is.na(Lines$diff == TRUE)), 1])
# the Date-Times for diff to be interpolated in numeric format
interps1 <- na.interp1(as.numeric(Lines$Time), Lines$diff, xi = xi,
na.rm = FALSE, maxgap = 3)
# the interpolated values where only gap sizes of 3 are filled
Lines[rowsinterps1, 2] <- interps1
# replace the NAs in diff with the interpolated diff values
Lines <- na.omit(Lines) # remove rows with NAs
Lines
This is the Lines data.frame:
Lines
D1 diff
1 2014-11-20 16:00:00 0.01000000
2 2014-11-20 17:00:00 0.02000000
3 2014-11-20 18:00:00 0.02500000
4 2014-11-20 19:00:00 0.03000000
25 2014-11-21 16:00:00 0.04000000
26 2014-11-21 17:00:00 0.06000000
27 2014-11-21 18:00:00 0.07333333
28 2014-11-21 19:00:00 0.08666667
29 2014-11-21 20:00:00 0.10000000

Related

apply.yearly() works with subset but not on full time series dataset in R

When I run the following code on my dataset, I get an output (partial one shown) like this:
all_countries_ts[,grepl("Muslims", colnames(all_countries_ts))]
Senegal Muslims Serbia Muslims Seychelles Muslims
1970-01-01 3693807 200000 170
2000-01-01 8936283 529322 730
2010-01-01 11713126 527598 821
2015-01-01 13621382 471414 844
However, when I try to use the function apply.yearly on it to sum across the years, I just get an NA result:
apply.yearly(all_countries_ts[,grepl("Muslims", colnames(all_countries_ts))], FUN = sum)
1970-01-01 NA
2000-01-01 NA
2010-01-01 NA
2015-01-01 NA
The funny thing is that it works with some inputs but not others. For example, if I use input "Agnostics" instead of "Muslims", I get a good result. There isn't an error, so I can't seem to figure out what exactly is happening here.
all_countries_ts is stored as a xts object. One thing to note is that apply.yearly() always work on a subset of this dataset. I have written a function and you can see it below:
sum_by_category <- function(religious_group, dataset) {
apply.yearly(dataset[,grepl(paste(religious_group), colnames(dataset))], FUN =
sum)
}
country_search <- function(country_name, z){
z <- foreach(i = 1:length(country_name), .combine = merge.xts) %do%{
all_countries_ts[,grepl(country_name[i], colnames(all_countries_ts))]
}
return(z)}
When I type in the following, it works perfectly:
sum_by_category("Muslims", country_search("Senegal"))
Senegal Muslims
1970-01-01 3693807
2000-01-01 8936283
2010-01-01 11713126
2015-01-01 13621382
I really can't figure out what's going on since it works with some inputs and not others. Thanks in advance for any help / insights!
The xts::apply.yearly expects x argument coercible to xts object. Perhaps your data.frame is not a xts compatible data frame.
The help for apply.yearly explains:
Arguments
x an time-series object coercible to xts
FUN an R function
I have created a sample data based on data shared by OP and converted it to xts class. apply.yearly works correctly on the same.
library(xts)
# Convert data.frame to xts class
all_countries_ts <- xts(df[,-1], order.by = df$Date)
#Now one can use `apply.yearly`
apply.yearly(all_countries_ts[,grepl("Muslims", colnames(all_countries_ts))], FUN = sum)
# [,1]
# 1970-01-01 3893977
# 2000-01-01 9466335
# 2010-01-01 12241545
# 2015-01-01 14093640
Edited: Review of the OP's data suggest that it contains NA for many column which is causing total sum to be shown as NA. The fix is simple. OP needs to use as:
apply.yearly(all_countries_ts[,grepl("Muslims",colnames(all_countries_ts))],
FUN = sum, na.rm = TRUE)
# [,1]
# 1970-01-01 570772699
# 2000-01-01 1292170756
# 2010-01-01 1571250533
# 2015-01-01 1734531709
Data:
df <- read.table(text =
" Date 'Senegal Muslims' 'Serbia Muslims' 'Seychelles Muslims' Others
1970-01-01 3693807 200000 170 200
2000-01-01 8936283 529322 730 100
2010-01-01 11713126 527598 821 300
2015-01-01 13621382 471414 844 500",
header = TRUE, stringsAsFactors = FALSE)
#convert Date column to Date format
df$Date <- as.Date(df$Date)

Sum xts elements on a list row by row

I have an xts object called data which contains 5 min returns for the period from 2015-01-01 17:00:00 to 2015-12-31 17:00:00. Each trading day starts at 17:00:00 and finishes the next day at the same time for a total of 288 daily returns[(24hours*60 minutes) / 5 minutes = 288 intraday returns]. The returns are denoted as
head(data, 5)
DPRICE
2015-01-01 17:00:00 0.000000e+00
2015-01-01 17:05:00 9.797714e-05
2015-01-01 17:10:00 2.027022e-04
2015-01-01 17:15:00 2.735798e-04
2015-01-01 17:20:00 7.768653e-05
tail(data, 5)
DPRICE
2015-12-31 16:40:00 0.0001239429
2015-12-31 16:45:00 0.0001272704
2015-12-31 16:50:00 0.0010186764
2015-12-31 16:55:00 0.0006841370
2015-12-31 17:00:00 0.0002481227
I am trying to standardize the data by their average absolute value for each 5-minute intra-day interval according to McMillan and Speight Daily FX Volatility Forecasts (2012).
The mathematical formula is :
My *code is
library(xts)
std_data = abs(data) #create absolute returns
D <- split(std_data, "days") #splits data to days
mts.days <- lapply(seq_along(D) - 1, function(i) {
if (i > 0) rbind(D[[i]]["T17:00:00/T23:55:00"], D[[i + 1]]["T00:00:00/T16:55:00"])
}) #creates a list with 365 elements each containing 288 unique returns
dummy = mapply(sum, mts.days) #add the first,second... observations from each element
With this code I create a list with 365 xts elements each having dimensions
> dim(mts.days[[2]])
[1] 288 1
I want to add the same observations from each element to create the denominator of the function above.
I don't understand your request, but will give it a shot nevertheless.
## generate bogus data
library(quantmod)
set.seed(123)
ndays <- 3
ndatperday <- 288
data <- cumsum(do.call("rbind", lapply(13:15, function(dd){
xts(rnorm(ndatperday)/1e4,
seq(as.POSIXct(paste0("2016-08-",dd," 17:00:00")),
length = ndatperday, by = 300))
})))
colnames(data) <- "DPRICE"
## calculate percentage returns
ret <- ROC(data, type="discrete")
## this is probably not what you need: returns divided by the overall mean
ret/mean(abs(ret), na.rm=T)
## I suspect indeed that you need returns divided by the daily mean return
library(dplyr)
ret.df <- data.frame(ret)
## create a factor identifying the 3 days of bogus data
ret.df$day <- rep(paste0("2016-08-",13:15),each=ndatperday)
## compute daily mean return
dail <- ret.df %>%
group_by(day) %>%
summarise(mean=mean(abs(DPRICE), na.rm=TRUE))
## attach daily mean returns to the days they actually are associated to
ret.df <- ret.df %>% left_join(dail)
## normalize
ret.df$DPRICE <- ret.df$DPRICE/ret.df$mean
%%%%%%%%%
Second shot: after reading the paper (http://onlinelibrary.wiley.com/doi/10.1002/for.1222/full) I might have understood what you were after:
library(quantmod)
library(dplyr)
set.seed(123)
## generate bogus 5-min series
ndays <- 365
ndatperday <- 288
data <- as.xts(zoo(0.1+cumsum(rt(ndays*ndatperday, df=3))/1e4,
seq(as.POSIXct("2015-01-01 17:00"),
as.POSIXct("2015-12-31 17:00"), by=300)))
colnames(data) <- "DPRICE"
## calculate 5-min percentage returns
ret <- ROC(data, type="discrete")
## create a factor identifying the 5-minute intra-day interval
ret.df <- as.data.frame(ret)
ret.df$intra5 <- strftime(index(ret), format="%H:%M")
## compute mean returns (over the year) for each of the 288 5-minute intra-day intervals
dail <- ret.df %>%
group_by(intra5) %>%
summarise(mean=mean(abs(DPRICE), na.rm=TRUE))
## attach mean returns to each datapoint
ret.df <- ret.df %>% left_join(dail)
## normalize
ret.df$DPRICE <- ret.df$DPRICE/ret.df$mean

How to insert zeros in a data frame of R

I have following data.frame, DF
DF is already in R. we Do not need to load it in r using read.csv or something
timeStamp count
1 2014-01-15 14:30:00 2
2 2014-01-15 16:30:00 3
3 2014-01-15 17:00:00 2
4 2014-01-15 17:15:00 1
I have an "independent seq of timestamps", say tmpSeq from 2014-01-15 14:00:00 to 2014-01-22 13:00:00. I want to get a List of counts from this data.frame and insert zeros for timeStamp not present in data.frame but in the tmpSeq
Assuming your sequence is in 15 minute increments:
DF <- data.frame(timeStamp=as.POSIXct(c("2014-01-15 14:30:00","2014-01-15 16:30:00",
"2014-01-15 17:00:00","2014-01-15 17:15:00")),
count=c(2,3,2,1))
tmpSeq <- seq(as.POSIXct("2014-01-15 14:00:00"),
as.POSIXct("2014-01-22 13:00:00"), by="15 mins")
DF <- merge(DF, data.frame(timeStamp=tmpSeq, count=0), all=TRUE)
should do it.
Generally, it is better to work with some ts packages when you deal with time series objects. Using xts package you can use rbind to merge 2 times series.
First I create the short time series
I generate the long ts assuming it is regular ts with 15 mins interval
I merge the 2 series using rbind
Here my code:
library(xts)
dat = as.xts(read.zoo(text='
time Stamp count ## a small hack here to read your data
1 2014-01-15 14:30:00 2
2 2014-01-15 16:30:00 3
3 2014-01-15 17:00:00 2
4 2014-01-15 17:15:00 1',
header=TRUE,
index=1:2,
format='%Y-%m-%d %H:%M:%S',tz=''))
## generate the long ts
tmpSeq <-
seq.POSIXt(as.POSIXct('2014-01-15 14:00:00'),
as.POSIXct('2014-01-22 13:00:00'),by = '15 mins')
tmpSeq <-
xts(x=rep(0,length(tmpSeq)),tmpSeq)
## insert dat values in tmpSeq
rbind(tmpSeq,dat)
It seems what you are looking for is a 'merge'. Look at this post: How to join (merge) data frames (inner, outer, left, right)?
You need a right outer join ( if you make tmpSeq as your right data frame)
Edit:
Adding the merge statement in the answer to make the answer clearer :
Right outer: merge(x = DF, y = data.frame(timeStamp=tmpSeq, count=0), all.y=TRUE)

How to merge couples of Dates and values contained in a unique csv

We have a csv file with Dates in Excel format and Nav for Manager A and Manager B as follows:
Date,Manager A,Date,Manager B
41346.6666666667,100,40932.6666666667,100
41347.6666666667,100,40942.6666666667,99.9999936329992
41348.6666666667,100,40945.6666666667,99.9999936397787
41351.6666666667,100,40946.6666666667,99.9999936714362
41352.6666666667,100,40947.6666666667,100.051441180137
41353.6666666667,100,40948.6666666667,100.04877283951
41354.6666666667,100.000077579585,40949.6666666667,100.068400298752
41355.6666666667,100.00007861475,40952.6666666667,100.070263374822
41358.6666666667,100.000047950872,40953.6666666667,99.9661095940006
41359.6666666667,99.9945012295984,40954.6666666667,99.8578245935173
41360.6666666667,99.9944609274138,40955.6666666667,99.7798031949116
41361.6666666667,99.9944817907402,40956.6666666667,100.029523604978
41366.6666666667,100,40960.6666666667,100.14859511024
41367.6666666667,99.4729804387476,40961.6666666667,99.7956029017769
41368.6666666667,99.4729804387476,40962.6666666667,99.7023420799123
41369.6666666667,99.185046151864,40963.6666666667,99.6124531927299
41372.6666666667,99.1766469096966,40966.6666666667,99.5689030038018
41373.6666666667,98.920738006398,40967.6666666667,99.5701493637685
,,40968.6666666667,99.4543885041996
,,40969.6666666667,99.3424528379521
We want to create a zoo object with the following structure [Dates, Manager A Nav, Manager B Nav].
After reading the csv file with:
data = read.csv("...", header=TRUE, sep=",")
we set an index for splitting the object and use lapply to split
INDEX <- seq(1, by = 2, length = ncol(data) / 2)
data.zoo <- lapply(INDEX, function(i, data) data[i:(i+1)], data = zoo(data))
I'm stuck with the fact that Dates are in Excel format and don't know how to fix that stuff. Is the problem set in a correct way?
If all you want to do is to convert the dates to proper dates you can do this easily enough. The thing you need to know is the origin date. Your numbers represent the integer and fractional number of days that have passed since the origin date. Usually this is Jan 0 1990!!! Go figure, but be careful as I don't think this is always the case. You can try this...
# Excel origin is day 0 on Jan 0 1900, but treats 1900 as leap year so...
data$Date <- as.Date( data$Date , origin = "1899/12/30")
data$Date.1 <- as.Date( data$Date.1 , origin = "1899/12/30")
# For more info see ?as.Date
If you are interested in keeping the times as well, you can use as.POSIXct, but you must also specify the timezone (UTC by default);
data$Date <- as.POSIXct(data$Date, origin = "1899/12/30" )
head(data)
# Date Manager.A Date.1 Manager.B
# 1 2013-03-13 16:00:00 100 2012-01-24 100.00000
# 2 2013-03-14 16:00:00 100 2012-02-03 99.99999
# 3 2013-03-15 16:00:00 100 2012-02-06 99.99999
# 4 2013-03-18 16:00:00 100 2012-02-07 99.99999
# 5 2013-03-19 16:00:00 100 2012-02-08 100.05144
# 6 2013-03-20 16:00:00 100 2012-02-09 100.04877

R: Computing monthly averages from hourly data and then plotting

After converting a date/time character string into POSIXlt using strptime, I am left with the following (data truncated for ease here):
DateTime North South West East Seast System
1 2008-09-12 01:00:00 1919.9 3721.4 2085.9 2565.5 2571.1 12863.8
2 2008-09-12 02:00:00 1827.0 3518.1 1965.3 2396.9 2410.7 12118.0
3 2008-09-12 03:00:00 1755.4 3388.4 1866.8 2338.7 2335.2 11684.5
4 2008-09-12 04:00:00 1733.5 3327.1 1810.0 2295.6 2290.2 11456.4
5 2008-09-12 05:00:00 1742.7 3327.3 1831.4 2314.2 2302.3 11517.9
6 2008-09-12 06:00:00 1912.2 3504.4 1986.7 2515.0 2502.6 12420.9
I then have aggregated the data (seemingly right) into year-month averages using the following snippet of code:
North_Monthly_Avg <- aggregate(North, list(Date=format(DateTime, "%Y-%m")),mean)
which yields the following:
Date x
1 2008-09 2192.066
2 2008-10 1885.074
3 2008-11 1675.373
4 2008-12 1637.231
5 2009-01 1752.693
6 2009-02 1743.393
I can plot the 'x' values but cannot get the year-months to label properly on the x-axis since it is only plotting the index. Not sure what I am missing...I have played around with axis.POSIXct, but have no luck.
Try zoo and lattice:
library(zoo)
library(lattice)
dat <- 'Date Time North South West East Seast System
2008-09-12 01:00:00 1919.9 3721.4 2085.9 2565.5 2571.1 12863.8
2008-09-12 02:00:00 1827.0 3518.1 1965.3 2396.9 2410.7 12118.0
2008-09-12 03:00:00 1755.4 3388.4 1866.8 2338.7 2335.2 11684.5
2008-09-12 04:00:00 1733.5 3327.1 1810.0 2295.6 2290.2 11456.4
2008-09-12 05:00:00 1742.7 3327.3 1831.4 2314.2 2302.3 11517.9
2008-09-12 06:00:00 1912.2 3504.4 1986.7 2515.0 2502.6 12420.9'
z <- read.zoo(text = dat, header = TRUE, index.column = 1:2, tz = "")
xyplot(z)
zAgg <- aggregate(z$North, by = as.yearmon, FUN = mean)
dat2 <- 'Date x
2008-09 2192.066
2008-10 1885.074
2008-11 1675.373
2008-12 1637.231
2009-01 1752.693
2009-02 1743.393'
zAgg <- read.zoo(text = dat2, header = TRUE, FUN = as.yearmon)
plot(zAgg, xaxt = "n")
tt <- time(zAgg)
m <- format(tt, "%m")
axis(side = 1, at = tt, labels = ifelse(m == "01", trunc(tt), m), cex.axis = .7)
Try using as.integer() on the date
North_Monthly_Avg <- aggregate(North, list(Date=as.integer(format(DateTime, "%Y-%m"))),mean)
#user1062431,
To edit the tick names to your preferred format, edit the m <- format(tt, "%m") line in the answer of Oscar.
To get the format 12 - 2008 you need to modify:
m <- format(tt, "%m") to m <- format(tt, "%m - %Y")
To get the format dec 2008 you need to modify:
m <- format(tt, "%m") to m <- format(tt, "%b %Y")
I think the problem is that there is no date. You will have to settle with a 1st of the month or 15th of the month and apply that to your aggregated table.
I came up with this:
North_Monthly_Avg=aggregate(North,by=list(format(DateTime,'%Y-%m')),mean)
names(North_Monthly_Avg)=c('Month','North')
North_Monthly_Avg$day=15
North_Monthly_Avg$Date=paste(North_Monthly_Avg$Month,North_Monthly_Avg$day,sep='-')
North_Monthly_Avg$Date=strptime(North_Monthly_Avg$Date,'%Y-%m-%d')
plot(m$Date,m$North,xaxt='n') # the xaxt='n' removes any ticks on the x axis
axis(1,as.numeric(m$Date),labels=format(m$Date,'%Y-%m')) # formats the x axis to your liking
I am fairly new to R, so this may not be the most elegant solution, but it will work.
Replace the 15 with 1 in the $day line if you prefer 1st of the month and the sep in paste should be changed to '-0'.
The problem you're having is because you are using format to create the groupings to use for the subdivision. This makes the values into strings, so that plotting functions don't know to plot them like dates.
The cut function has a cut.POSIXlt variant that will do exactly what you need, and preserve the type information so that all the plotting stuff will just work.
Instead of
North_Monthly_Avg <- aggregate(North, list(Date=format(DateTime, "%Y-%m")),mean)
Just use
North_Monthly_Avg <- aggregate(North, cut(DateTime, "month"), mean)
You could can try the package openair and use it's function timeAverage
Hourly to monthly
library(openair)
mydata$date <- as.POSIXct(strptime(mydata$date, format = "%d/%m/%Y %H:%M", tz = "GMT"))
hourly<-timeAverage(mydata, average.time = "day")

Resources