R time series missing values - r

I was working with a time series dataset having hourly data. The data contained a few missing values so I tried to create a dataframe (time_seq) with the correct time value and do a merge with the original data so the missing values become 'NA'.
> data
date value
7980 2015-03-30 20:00:00 78389
7981 2015-03-30 21:00:00 72622
7982 2015-03-30 22:00:00 65240
7983 2015-03-30 23:00:00 47795
7984 2015-03-31 08:00:00 37455
7985 2015-03-31 09:00:00 70695
7986 2015-03-31 10:00:00 68444
//converting the date in the data to POSIXct format.
> data$date <- format.POSIXct(data$date,'%Y-%m-%d %H:%M:%S')
// creating a dataframe with the correct sequence of dates.
> time_seq <- seq(from = as.POSIXct("2014-05-01 00:00:00"),
to = as.POSIXct("2015-04-30 23:00:00"), by = "hour")
> df <- data.frame(date=time_seq)
> df
date
8013 2015-03-30 20:00:00
8014 2015-03-30 21:00:00
8015 2015-03-30 22:00:00
8016 2015-03-30 23:00:00
8017 2015-03-31 00:00:00
8018 2015-03-31 01:00:00
8019 2015-03-31 02:00:00
8020 2015-03-31 03:00:00
8021 2015-03-31 04:00:00
8022 2015-03-31 05:00:00
8023 2015-03-31 06:00:00
8024 2015-03-31 07:00:00
// merging with the original data
> a <- merge(data,df, x.by = data$date, y.by = df$date ,all=TRUE)
> a
date value
4005 2014-07-23 07:00:00 37003
4006 2014-07-23 07:30:00 NA
4007 2014-07-23 08:00:00 37216
4008 2014-07-23 08:30:00 NA
The values I get after merging are incorrect and they contain half-hourly values. What would be the correct approach for solving this?
Why are is the merge result in 30 minute intervals when both my dataframes are hourly?
PS:I looked into this question : Fastest way for filling-in missing dates for data.table and followed the steps but it didn't help.

You can use the padr package to solve this problem.
library(padr)
library(dplyr) #for the pipe operator
data %>%
pad() %>%
fill_by_value()

Related

R: Merge dataframes by nearest datetime

I have two dataframes of different lengths: NROW(data) = 20000
NROW(database) = 8000
Both of dataframes have date time values in a format as : YYYY-MM-DD HH-MM-SS which are not the same in each dataframe
What I want is to merge them by the nearest date-time and keep only the records that exist in database.
I tried the approach posted in another stackexchange post
[R – How to join two data frames by nearest time-date?][1]
based on data.table library. I tried following but without success:
require("data.table")
database <- data.table(database)
data <- data.table(data)
setkey( data, "timekey")
setkey( database, "timekeyd")
database <- data[ database, roll = "nearest"]
But the merge was almost completely wrong. You can see how the merged was performed in the following table that has only the two keys (timekey and timekeyd)
1 2017-11-01 00:00:00 2017-10-31 21:00:00
2 2017-11-01 00:00:00 2017-10-31 22:10:00
3 2017-11-02 19:00:00 2017-11-02 21:00:00
4 2017-11-02 19:00:00 2017-11-02 21:00:00
5 2017-11-03 20:08:00 2017-11-03 22:10:00
6 2017-11-04 19:00:00 2017-11-04 21:00:00
7 2017-11-04 19:00:00 2017-11-04 21:00:00
8 2017-11-05 19:00:00 2017-11-05 21:10:00
9 2017-11-07 18:00:00 2017-11-07 20:00:00

Using subset on dates giving shifted dates from the desired time frame

I have a data frame (called homeAnew) from which the head is as follows.
date total
1 2014-01-01 00:00:00 0.756
2 2014-01-01 01:00:00 0.717
3 2014-01-01 02:00:00 0.643
4 2014-01-01 03:00:00 0.598
5 2014-01-01 04:00:00 0.604
6 2014-01-01 05:00:00 0.638
I wanted to extract explicit dates and I originally used:
Hourly <- subset(homeAnew,date >= "2014-04-10 00:00:00" & date <= "2015-04-10 00:00:00")
However the result was a dataframe that started at 2014-04-09 12:00:00 and ended 2015-04-09 12:00:00. Basically it was shifted back 12 hours from where I wanted it.
I was able to use
Date1<-as.Date("2014-04-10 00:00:00")
Date2<-as.Date("2015-04-10 00:00:00")
Hourly<-homeAnew[homeAnew$date>=Date1 & homeAnew$date<=Date2,]
To get what was after but I was wondering if someone could explain to me why subset would work like that?

R : how to get the rolling mean of a variable over the last few days but only at a given hour?

Consider this
time <- seq(ymd_hms("2014-02-24 23:00:00"), ymd_hms("2014-06-25 08:32:00"), by="hour")
group <- rep(LETTERS[1:20], each = length(time))
value <- sample(-10^3:10^3,length(time), replace=TRUE)
df2 <- data.frame(time,group,value)
str(df2)
> head(df2)
time group value
1 2014-02-24 23:00:00 A 246
2 2014-02-25 00:00:00 A -261
3 2014-02-25 01:00:00 A 628
4 2014-02-25 02:00:00 A 429
5 2014-02-25 03:00:00 A -49
6 2014-02-25 04:00:00 A -749
I would like to create a variable that contains, for each group, the rolling mean of value
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At time 2014-02-24 23:00:00, df2['rolling_mean_same_hour'] contains the mean of the values of value observed at 23:00:00 during the last 5 days in the data (not including 2014-02-24 of course).
I would like to do that in either dplyr or data.table. I confess having no ideas how to do that.
Any ideas?
Many thanks!
You can calculate the rollmean() with your data grouped by the group variable and hour of the time variable, normally the rollmean() will include the current observation, but you can use shift() function to exclude the current observation from the rollmean:
library(data.table); library(zoo)
setDT(df2)
df2[, .(rolling_mean_same_hour = shift(
rollmean(value, 5, na.pad = TRUE, align = 'right'),
n = 1,
type = 'lag'),
time), .(hour(time), group)]
# hour group rolling_mean_same_hour time
# 1: 23 A NA 2014-02-24 23:00:00
# 2: 23 A NA 2014-02-25 23:00:00
# 3: 23 A NA 2014-02-26 23:00:00
# 4: 23 A NA 2014-02-27 23:00:00
# 5: 23 A NA 2014-02-28 23:00:00
# ---
#57796: 22 T -267.0 2014-06-20 22:00:00
#57797: 22 T -389.6 2014-06-21 22:00:00
#57798: 22 T -311.6 2014-06-22 22:00:00
#57799: 22 T -260.0 2014-06-23 22:00:00
#57800: 22 T -26.8 2014-06-24 22:00:00

Check if Posixct time is within interval

The problem:
I have two dataframes that I would like to merge depending on the date/time of one dataframe being in the interval of the other dataframe.
traffic: Date and Time (Posixct), Frequency
mydata: Interval, Sum of Frequency
I would now like to calculate if the Posixct time from traffic is within the interval of mydata and if this is TRUE I would like to count the frequency in the column "Sum of Frequencies" in mydata.
The two problems, that I encountered:
1. traffic data frame has significantly more rows than mydata. I dont know how to tell R to loop through every observation in traffic to check for one row in mydata.
There can be more than one observation fitting in the frequency interval of mydata. I want R to add up all frequencies of the different traffic observations to get a total score of frequencies. Also the intervals are overlapping.
Here is the data:
DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00", "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)
traffic <- data.frame(DateTime, Frequency)
library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00", "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00", "2015-08-01 14:15:00")
DateTime1 <- as.POSIXct(DateTime1)
DateTime2 <- as.POSIXct(DateTime2)
mydata <- data.frame(DateTime1, DateTime2)
mydata$Interval <- as.interval(DateTime1, DateTime2)
mydata$SumFrequency <- NA
The expected outcome should be something like this:
mydata$SumFrequency <- c(24, 2, 2)
head(mydata)
I tried int_overlaps from package lubridate.
Any tips on how to solve this are higly appreciated!
A short solution with foverlaps from the data.table package:
mydata <- data.table(DateTime1, DateTime2, key = c("DateTime1", "DateTime2"))
traffic <- data.table(start = DateTime, end = DateTime, Frequency, key = c("start","end"))
foverlaps(traffic, mydata, type="within", nomatch=0L)[, .(sumFreq = sum(Frequency)),
by = .(DateTime1, DateTime2)]
which gives:
DateTime1 DateTime2 sumFreq
1: 2014-11-01 04:00:00 2014-11-01 04:15:00 24
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2
On a data.table approach with between to filter traffic dataset on time:
setDT(traffic)
setDT(mydata)
mydata[,SumFrequency := as.numeric(SumFrequency)] # coerce logical to numeric for next step.
mydata[,SumFrequency := sum( traffic[ DateTime %between% c(DateTime1, DateTime2), Frequency] ), by=1:nrow(mydata)]
which give:
DateTime1 DateTime2 Interval SumFrequency
1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 24
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 2
If there's a lot of row in mydata, it could be better to create an index column and use it in by clause:
mydata[, idx := .I]
mydata[, SumFrequency := sum( traffic[DateTime %between% c(DateTime1, DateTime2),Frequency] ),by=idx]
And this gives:
DateTime1 DateTime2 Interval SumFrequency idx
1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 24 1
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2 2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 2 3
I see two solutions :
With data.frame and plyr
You could do it using %within% function in lubridate and with a for-loop or using plyr loop functions like dlply
DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00", "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)
traffic <- data.frame(DateTime, Frequency)
library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00", "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00", "2015-08-01 14:15:00")
DateTime1 <- as.POSIXct(DateTime1)
DateTime2 <- as.POSIXct(DateTime2)
mydata <- data.frame(DateTime1, DateTime2)
mydata$Interval <- as.interval(DateTime1, DateTime2)
library(plyr)
# Create a group-by variable
mydata$NumInt <- 1:nrow(mydata)
mydata$SumFrequency <- dlply(mydata, .(NumInt),
function(row){
sum(
traffic[traffic$DateTime %within% row$Interval, "Frequency"]
)
})
mydata
#> DateTime1 DateTime2
#> 1 2014-11-01 04:00:00 2014-11-01 04:15:00
#> 2 2015-08-01 04:03:00 2015-08-01 04:13:00
#> 3 2015-08-01 14:00:00 2015-08-01 14:15:00
#> Interval NumInt SumFrequency
#> 1 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 1 24
#> 2 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2 2
#> 3 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 3 2
With data.table and functions foverlaps
data.table has implemented a function for overlapping joins that you could use in your case with a little trick.
This functions is foverlaps (I uses below data.table 1.9.6)
(see How to perform join over date ranges using data.table? and this presentation)
Notice that you do not need to create interval with lubridate
DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00", "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)
traffic <- data.table(DateTime, Frequency)
library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00", "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00", "2015-08-01 14:15:00")
mydata <- data.table(DateTime1 = as.POSIXct(DateTime1), DateTime2 = as.POSIXct(DateTime2))
# Use function `foverlaps` for overlapping joins
# Here's the trick : create a dummy variable to artificially have an interval
traffic[, dummy:=DateTime]
setkey(mydata, DateTime1, DateTime2)
# do the join
mydata2 <- foverlaps(traffic, mydata, by.x=c("DateTime", "dummy"), type ="within", nomatch=0L)[, dummy := NULL][]
mydata2
#> DateTime1 DateTime2 DateTime Frequency
#> 1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 1
#> 2: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:03:00 2
#> 3: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:06:00 3
#> 4: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:08:00 5
#> 5: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:10:00 12
#> 6: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:12:00 1
#> 7: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:13:00 2
#> 8: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:15:00 1
#> 9: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:13:00 1
# summarise with a sum by grouping by each line of mydata
setkeyv(mydata2, key(mydata))
mydata2[mydata, .(SumFrequency = sum(Frequency)), by = .EACHI]
#> DateTime1 DateTime2 SumFrequency
#> 1: 2014-11-01 04:00:00 2014-11-01 04:15:00 24
#> 2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2
#> 3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2
As far as point 2 is concerned you can use aggregate for instance
aggData <- aggregate(traffic$Frequency~format(traffic$DateTime, "%Y%m%d h:m"), data=traffic, sum)
This sums all frequencies in minute intervals.
And for point 1. Wouldn't a merge work?
merge(x = myData, y = aggData, by = "DateTime", all.x = TRUE)
The outer merge is explained here
Using a for.loop we could do something like this:
for(i in 1:nrow(mydata)) {
mydata$SumFrequency[i] <- sum(traffic$Frequency[traffic$DateTime %within% mydata$Interval[i]])
}
> mydata
# DateTime1 DateTime2 Interval SumFrequency
#1 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 24
#2 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2
#3 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 2

subset by vector in r

I am trying to subset an xts object of OHLC hourly data with a vector.
If i create the vector myself with the following command
lookup = c("2012-01-12", "2012-01-31", "2012-03-05", "2012-03-19")
testdfx[lookup]
testdfx[lookup]
I get the correct data displayed which shows all the hours that match the dates in the vector (00:00 to 23:00.
> head(testdfx[lookup])
open high low close
2012-01-12 00:00:00 1.27081 1.27217 1.27063 1.27211
2012-01-12 01:00:00 1.27212 1.27216 1.27089 1.27119
2012-01-12 02:00:00 1.27118 1.27166 1.27017 1.27133
2012-01-12 03:00:00 1.27134 1.27272 1.27133 1.27261
2012-01-12 04:00:00 1.27260 1.27262 1.27141 1.27183
2012-01-12 05:00:00 1.27183 1.27230 1.27145 1.27165
> tail(testdfx[lookup])
open high low close
2012-03-19 18:00:00 1.32451 1.32554 1.32386 1.32414
2012-03-19 19:00:00 1.32417 1.32465 1.32331 1.32372
2012-03-19 20:00:00 1.32373 1.32415 1.32340 1.32372
2012-03-19 21:00:00 1.32373 1.32461 1.32366 1.32376
2012-03-19 22:00:00 1.32377 1.32424 1.32359 1.32366
2012-03-19 23:00:00 1.32364 1.32406 1.32333 1.32336
However when I extract a dates from an object and create a vector to use for subsetting I only get the hours of 00:00-19:00 displayed in my subset.
> head(testdfx[dates])
open high low close
2007-01-05 00:00:00 1.3092 1.3093 1.3085 1.3088
2007-01-05 01:00:00 1.3087 1.3092 1.3075 1.3078
2007-01-05 02:00:00 1.3079 1.3091 1.3078 1.3084
2007-01-05 03:00:00 1.3083 1.3084 1.3073 1.3074
2007-01-05 04:00:00 1.3073 1.3080 1.3061 1.3071
2007-01-05 05:00:00 1.3070 1.3072 1.3064 1.3069
> tail(euro[nfp.releases])
open high low close
2014-01-10 14:00:00 1.35892 1.36625 1.35728 1.36366
2014-01-10 15:00:00 1.36365 1.36784 1.36241 1.36743
2014-01-10 16:00:00 1.36742 1.36866 1.36693 1.36719
2014-01-10 17:00:00 1.36720 1.36752 1.36579 1.36617
2014-01-10 18:00:00 1.36617 1.36663 1.36559 1.36624
2014-01-10 19:00:00 1.36630 1.36717 1.36585 1.36702
I have compared both objects containing the require dates and they appear to be the same.
> class(lookup)
[1] "character"
> class(nfp.releases)
[1] "character"
> str(lookup)
chr [1:4] "2012-01-12" "2012-01-31" "2012-03-05" "2012-03-19"
> str(nfp.releases)
chr [1:86] "2014-02-07" "2014-01-10" "2013-12-06" "2013-11-08" ..
I am new to R but have tried everything over the past 3 days to get this to work. If I can't to it this way I will end up having to create a variable by hand but as its got 86 dates this may take some time.
Thanks in advance.
I cannot reproduce your problem
lookup = c("2012-01-12", "2012-01-31", "2012-03-05", "2012-03-19")
time_index <- seq(from = as.POSIXct("2012-01-01 07:00"), to = as.POSIXct("2012-05-17 18:00"), by = "hour")
set.seed(1)
value <- matrix(rnorm(n = 4*length(time_index)),length(time_index),4)
testdfx <- xts(value, order.by = time_index)
testdfx[lookup[1]]
testdfx["2012-01-12"]
Thanks for the response guys I actually thought i had deleted this thread but obviously not.
The problem in the case above was to be found around 3' from the computer. When looking through the data I was only interested in Fridays which also means that the FX market is closing down for the week end.
Sorry to have wasted your time and Admin please remove.

Resources