subset by vector in r - r

I am trying to subset an xts object of OHLC hourly data with a vector.
If i create the vector myself with the following command
lookup = c("2012-01-12", "2012-01-31", "2012-03-05", "2012-03-19")
testdfx[lookup]
testdfx[lookup]
I get the correct data displayed which shows all the hours that match the dates in the vector (00:00 to 23:00.
> head(testdfx[lookup])
open high low close
2012-01-12 00:00:00 1.27081 1.27217 1.27063 1.27211
2012-01-12 01:00:00 1.27212 1.27216 1.27089 1.27119
2012-01-12 02:00:00 1.27118 1.27166 1.27017 1.27133
2012-01-12 03:00:00 1.27134 1.27272 1.27133 1.27261
2012-01-12 04:00:00 1.27260 1.27262 1.27141 1.27183
2012-01-12 05:00:00 1.27183 1.27230 1.27145 1.27165
> tail(testdfx[lookup])
open high low close
2012-03-19 18:00:00 1.32451 1.32554 1.32386 1.32414
2012-03-19 19:00:00 1.32417 1.32465 1.32331 1.32372
2012-03-19 20:00:00 1.32373 1.32415 1.32340 1.32372
2012-03-19 21:00:00 1.32373 1.32461 1.32366 1.32376
2012-03-19 22:00:00 1.32377 1.32424 1.32359 1.32366
2012-03-19 23:00:00 1.32364 1.32406 1.32333 1.32336
However when I extract a dates from an object and create a vector to use for subsetting I only get the hours of 00:00-19:00 displayed in my subset.
> head(testdfx[dates])
open high low close
2007-01-05 00:00:00 1.3092 1.3093 1.3085 1.3088
2007-01-05 01:00:00 1.3087 1.3092 1.3075 1.3078
2007-01-05 02:00:00 1.3079 1.3091 1.3078 1.3084
2007-01-05 03:00:00 1.3083 1.3084 1.3073 1.3074
2007-01-05 04:00:00 1.3073 1.3080 1.3061 1.3071
2007-01-05 05:00:00 1.3070 1.3072 1.3064 1.3069
> tail(euro[nfp.releases])
open high low close
2014-01-10 14:00:00 1.35892 1.36625 1.35728 1.36366
2014-01-10 15:00:00 1.36365 1.36784 1.36241 1.36743
2014-01-10 16:00:00 1.36742 1.36866 1.36693 1.36719
2014-01-10 17:00:00 1.36720 1.36752 1.36579 1.36617
2014-01-10 18:00:00 1.36617 1.36663 1.36559 1.36624
2014-01-10 19:00:00 1.36630 1.36717 1.36585 1.36702
I have compared both objects containing the require dates and they appear to be the same.
> class(lookup)
[1] "character"
> class(nfp.releases)
[1] "character"
> str(lookup)
chr [1:4] "2012-01-12" "2012-01-31" "2012-03-05" "2012-03-19"
> str(nfp.releases)
chr [1:86] "2014-02-07" "2014-01-10" "2013-12-06" "2013-11-08" ..
I am new to R but have tried everything over the past 3 days to get this to work. If I can't to it this way I will end up having to create a variable by hand but as its got 86 dates this may take some time.
Thanks in advance.

I cannot reproduce your problem
lookup = c("2012-01-12", "2012-01-31", "2012-03-05", "2012-03-19")
time_index <- seq(from = as.POSIXct("2012-01-01 07:00"), to = as.POSIXct("2012-05-17 18:00"), by = "hour")
set.seed(1)
value <- matrix(rnorm(n = 4*length(time_index)),length(time_index),4)
testdfx <- xts(value, order.by = time_index)
testdfx[lookup[1]]
testdfx["2012-01-12"]

Thanks for the response guys I actually thought i had deleted this thread but obviously not.
The problem in the case above was to be found around 3' from the computer. When looking through the data I was only interested in Fridays which also means that the FX market is closing down for the week end.
Sorry to have wasted your time and Admin please remove.

Related

Calculating mean and sd of bedtime (hh:mm) in R - problem are times before/after midnight

I got the following dataset:
data <- read.table(text="
wake_time sleep_time
08:38:00 23:05:00
09:30:00 00:50:00
06:45:00 22:15:00
07:27:00 23:34:00
09:00:00 23:00:00
09:05:00 00:10:00
06:40:00 23:28:00
10:00:00 23:30:00
08:10:00 00:10:00
08:07:00 00:38:00", header=T)
I used the chron-package to calculate the average wake_time:
> mean(times(data$wake_time))
[1] 08:20:12
But when I do the same for the variable sleep_time, this happens:
> mean(times(data$sleep_time))
[1] 14:04:00
I guess the result is distorted because the sleep_time contains times before and after midnight.
But how can I solve this problem?
Additionally:
How can I calculate the sd of the times. I want to use it like "mean wake-up-time 08:20 ± 44 min" for example.
THe times values are stored as numbers 0-1 representing a fraction of a day. If the sleep time is earlier than the wake time, you can "add a day" before taking the mean. For example
library(chron)
wake <- times(data$wake_time)
sleep <- times(data$sleep_time)
times(mean(ifelse(sleep < wake, sleep+1, sleep)))
# [1] 23:40:00
And since the values are parts of a day, if you want the sd in minutes, you'd take the partial day values and convert to minutes
sd(ifelse(sleep < wake, sleep+1, sleep) * 24*60)
# [1] 47.60252

R: Merge dataframes by nearest datetime

I have two dataframes of different lengths: NROW(data) = 20000
NROW(database) = 8000
Both of dataframes have date time values in a format as : YYYY-MM-DD HH-MM-SS which are not the same in each dataframe
What I want is to merge them by the nearest date-time and keep only the records that exist in database.
I tried the approach posted in another stackexchange post
[R – How to join two data frames by nearest time-date?][1]
based on data.table library. I tried following but without success:
require("data.table")
database <- data.table(database)
data <- data.table(data)
setkey( data, "timekey")
setkey( database, "timekeyd")
database <- data[ database, roll = "nearest"]
But the merge was almost completely wrong. You can see how the merged was performed in the following table that has only the two keys (timekey and timekeyd)
1 2017-11-01 00:00:00 2017-10-31 21:00:00
2 2017-11-01 00:00:00 2017-10-31 22:10:00
3 2017-11-02 19:00:00 2017-11-02 21:00:00
4 2017-11-02 19:00:00 2017-11-02 21:00:00
5 2017-11-03 20:08:00 2017-11-03 22:10:00
6 2017-11-04 19:00:00 2017-11-04 21:00:00
7 2017-11-04 19:00:00 2017-11-04 21:00:00
8 2017-11-05 19:00:00 2017-11-05 21:10:00
9 2017-11-07 18:00:00 2017-11-07 20:00:00

Using subset on dates giving shifted dates from the desired time frame

I have a data frame (called homeAnew) from which the head is as follows.
date total
1 2014-01-01 00:00:00 0.756
2 2014-01-01 01:00:00 0.717
3 2014-01-01 02:00:00 0.643
4 2014-01-01 03:00:00 0.598
5 2014-01-01 04:00:00 0.604
6 2014-01-01 05:00:00 0.638
I wanted to extract explicit dates and I originally used:
Hourly <- subset(homeAnew,date >= "2014-04-10 00:00:00" & date <= "2015-04-10 00:00:00")
However the result was a dataframe that started at 2014-04-09 12:00:00 and ended 2015-04-09 12:00:00. Basically it was shifted back 12 hours from where I wanted it.
I was able to use
Date1<-as.Date("2014-04-10 00:00:00")
Date2<-as.Date("2015-04-10 00:00:00")
Hourly<-homeAnew[homeAnew$date>=Date1 & homeAnew$date<=Date2,]
To get what was after but I was wondering if someone could explain to me why subset would work like that?

Split time series data hourly in R

I have time-series data sampled at 10 minutes rate. I want to split it hour-wise, but to my surprise split.xts is not producing intended results. Steps used are:
library(xts)
set.seed(123)
Sys.setenv(TZ="Asia/Kolkata")
timeind <- seq(as.POSIXct("2017-01-20 00:00:00 IST"),
as.POSIXct("2017-01-20 23:59:59 IST"),by="10 min") #for indexing
df <- xts(runif(length(timeind),30,50),timeind) #xts data frame
split(df,"hours",k=1)
OUTPUT IS:
[[1]]
[,1]
2017-01-20 00:00:00 31.24343
2017-01-20 00:10:00 32.57921
2017-01-20 00:20:00 40.17684
[[2]]
[,1]
2017-01-20 00:30:00 41.89185
2017-01-20 00:40:00 30.93997
2017-01-20 00:50:00 31.76651
2017-01-20 01:00:00 49.07364
2017-01-20 01:10:00 34.79113
2017-01-20 01:20:00 48.13881
Expected output is:
[[1]]
[,1]
2017-01-20 00:00:00 31.24343
2017-01-20 00:10:00 32.57921
2017-01-20 00:20:00 40.17684
2017-01-20 00:30:00 41.89185
2017-01-20 00:40:00 30.93997
2017-01-20 00:50:00 31.76651
[[2]]
2017-01-20 01:00:00 49.07364
2017-01-20 01:10:00 34.79113
2017-01-20 01:20:00 48.13881
...
Why split.xts is not working properly?
It's a known bug. If the index timezone happens to be one that is not a round hour offset from UTC, endpoints does not work correctly (because its calculations are based on UTC).
For example, Asia/Kolkata is UTC+0530, so endpoints aligns on half-hours.
A possible work-around would be to add 30 minutes to the index before calling split, then subtracting 30 minutes from each element of the result. Though that might cause issues around daylight saving time, if the timezone observes one.
df_adjusted <- df
.index(df_adjusted) <- .index(df_adjusted) - 60 * 30
by_hour <- lapply(split(df_adjusted, "hours"),
function(x) { .index(x) <- .index(x) + 60 * 30; x })

R time series missing values

I was working with a time series dataset having hourly data. The data contained a few missing values so I tried to create a dataframe (time_seq) with the correct time value and do a merge with the original data so the missing values become 'NA'.
> data
date value
7980 2015-03-30 20:00:00 78389
7981 2015-03-30 21:00:00 72622
7982 2015-03-30 22:00:00 65240
7983 2015-03-30 23:00:00 47795
7984 2015-03-31 08:00:00 37455
7985 2015-03-31 09:00:00 70695
7986 2015-03-31 10:00:00 68444
//converting the date in the data to POSIXct format.
> data$date <- format.POSIXct(data$date,'%Y-%m-%d %H:%M:%S')
// creating a dataframe with the correct sequence of dates.
> time_seq <- seq(from = as.POSIXct("2014-05-01 00:00:00"),
to = as.POSIXct("2015-04-30 23:00:00"), by = "hour")
> df <- data.frame(date=time_seq)
> df
date
8013 2015-03-30 20:00:00
8014 2015-03-30 21:00:00
8015 2015-03-30 22:00:00
8016 2015-03-30 23:00:00
8017 2015-03-31 00:00:00
8018 2015-03-31 01:00:00
8019 2015-03-31 02:00:00
8020 2015-03-31 03:00:00
8021 2015-03-31 04:00:00
8022 2015-03-31 05:00:00
8023 2015-03-31 06:00:00
8024 2015-03-31 07:00:00
// merging with the original data
> a <- merge(data,df, x.by = data$date, y.by = df$date ,all=TRUE)
> a
date value
4005 2014-07-23 07:00:00 37003
4006 2014-07-23 07:30:00 NA
4007 2014-07-23 08:00:00 37216
4008 2014-07-23 08:30:00 NA
The values I get after merging are incorrect and they contain half-hourly values. What would be the correct approach for solving this?
Why are is the merge result in 30 minute intervals when both my dataframes are hourly?
PS:I looked into this question : Fastest way for filling-in missing dates for data.table and followed the steps but it didn't help.
You can use the padr package to solve this problem.
library(padr)
library(dplyr) #for the pipe operator
data %>%
pad() %>%
fill_by_value()

Resources