R Search for a particular time from index - r

I use an xts object. The index of the object is as below. There is one for every hour of the day for a year.
"2011-01-02 18:59:00 EST"
"2011-01-02 19:58:00 EST"
"2011-01-02 20:59:00 EST"
In columns are values associated with each index entry. What I want to do is calculate the standard deviation of the value for all Mondays at 18:59 for the complete year. There should be 52 values for the year.
I'm able to search for the day of the week using the weekdays() function, but my problem is searching for the time, such as 18:59:00 or any other time.

You can do this by using interaction to create a factor from the combination of weekdays and .indexhour, then use split to select the relevant observations from your xts object.
set.seed(21)
x <- .xts(rnorm(1e4), seq(1, by=60*60, length.out=1e4))
groups <- interaction(weekdays(index(x)), .indexhour(x))
output <- lapply(split(x, groups), function(x) c(count=length(x), sd=sd(x)))
output <- do.call(rbind, output)
head(output)
# count sd
# Friday.0 60 1.0301030
# Monday.0 59 0.9204670
# Saturday.0 60 0.9842125
# Sunday.0 60 0.9500347
# Thursday.0 60 0.9506620
# Tuesday.0 59 0.8972697

You can use the .index* family of functions (don't forget the '.' in front of 'index'!):
fxts[.indexmon(fxts)==0] # its zero-based (!) and gives you all the January values
fxts[.indexmday(fxts)==1] # beginning of month
fxts[.indexwday(SPY)==1] # Mondays
require(quantmod)
> fxts
value
2011-01-02 19:58:00 1
2011-01-02 20:59:00 2
2011-01-03 18:59:00 3
2011-01-09 19:58:00 4
2011-01-09 20:59:00 5
2011-01-10 18:59:00 6
2011-01-16 18:59:00 7
2011-01-16 19:58:00 8
2011-01-16 20:59:00 9`
fxts[.indexwday(fxts)==1] #this gives you all the Mondays
for subsetting the time you use
fxts["T19:30/T20:00"] # this will give you the time period you are looking for
and here you combine weekday and time period
fxts["T18:30/T20:00"] & fxts[.indexwday(fxts)==1] # to get a logical vector or
fxts["T18:30/T21:00"][.indexwday(fxts["T18:30/T21:00"])==1] # to get the values
> value
2011-01-03 18:58:00 3
2011-01-10 18:59:00 6

Related

Filter a data frame by two time series

Hi I am new to R and would like to know if there is a simple way to filter data over multiple dates.
I have a data which has dates from 07.03.2003 to 31.12.2016.
I need to split/ filter the data by multiple time series, as per below.
Dates require in new data frame:
07.03.2003 to 06/03/2005
and
01/01/2013 to 31/12/2016
i.e the new data frame should not include dates from 07/03/2005 to 31/12/2012
Let's take the following data.frame with dates:
df <- data.frame( date = c(ymd("2017-02-02"),ymd("2016-02-02"),ymd("2014-02-01"),ymd("2012-01-01")))
date
1 2017-02-02
2 2016-02-02
3 2014-02-01
4 2012-01-01
I can filter this for a range of dates using lubridate::ymd and dplyr::between and dplyr::between:
df1 <- filter(df, between(date, ymd("2017-01-01"), ymd("2017-03-01")))
date
1 2017-02-02
Or:
df2 <- filter(df, between(date, ymd("2013-01-01"), ymd("2014-04-01")))
date
1 2014-02-01
I would go with lubridate. In particular
library(data.table)
library(lubridate)
set.seed(555)#in order to be reproducible
N <- 1000#number of pseudonumbers to be generated
date1<-dmy("07-03-2003")
date2<-dmy("06-03-2005")
date3<-dmy("01-01-2013")
date4<-dmy("31-12-2016")
Creating data table with two columns (dates and numbers):
my_dt<-data.table(date_sample=c(sample(seq(date1, date4, by="day"), N),numeric_sample=sample(N,replace = F)))
> head(my_dt)
date_sample numeric_sample
1: 2007-04-11 2
2: 2006-04-20 71
3: 2007-12-20 46
4: 2016-05-23 78
5: 2011-10-07 5
6: 2003-09-10 47
Let's impose some cuts:
forbidden_dates<-interval(date2+1,date3-1)#create interval that dates should not fall in.
> forbidden_dates
[1] 2005-03-07 UTC--2012-12-31 UTC
test_date1<-dmy("08-03-2003")#should not fall in above range
test_date2<-dmy("08-03-2005")#should fall in above range
Therefore:
test_date1 %within% forbidden_dates
[1] FALSE
test_date2 %within% forbidden_dates
[1] TRUE
A good way of visualizing the cut:
before
>plot(my_dt)
my_dt<-my_dt[!(date_sample %within% forbidden_dates)]#applying the temporal cut
after
>plot(my_dt)

period.apply over an hour with deciding start time

So I have a xts time serie over the year with time zone "UTC". The time interval between each row is 15 minutes.
x1 x2
2014-12-31 23:15:00 153.0 0.0
2014-12-31 23:30:00 167.1 5.4
2014-12-31 23:45:00 190.3 4.1
2015-01-01 00:00:00 167.1 9.7
As I want data over one hour to allow for comparison with other data sets, I tried to use period.apply:
dat <- period.apply(dat, endpoints(dat,on="hours",k=1), colSums)
The problem is that the first row in my new data set is 2014-12-31 23:45:00 and not 2015-01-01 00:00:00. I tried changing the endpoint vector but somehow it keeps saying that it is out of bounds. I also thought this was my answer: https://stats.stackexchange.com/questions/5305/how-to-re-sample-an-xts-time-series-in-r/19003#19003 but it was not. I don't want to change the names of my columns, I want to sum over a different interval.
Here a reproducible example:
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
period.apply(xts, endpoints(xts,on="hours",k=1), colSums)
And the result looks like this:
2014-12-31 23:45:00 3
2015-01-01 00:45:00 4
2015-01-01 01:45:00 4
2015-01-01 02:45:00 4
and ends up like this:
2015-01-01 21:45:00 4
2015-01-01 22:45:00 4
2015-01-01 23:45:00 4
2015-01-02 00:00:00 1
Whereas I would like it to always sum over the same interval, meaning I would like only 4s.
(I am using RStudio 0.99.903 with R x64 3.3.2)
The problem is that you're using endpoints, but you want to align by the start of the interval, not the end. I thought you might be able to use this startpoints function, but that produced weird results.
The basic idea of the work-around below is to subtract a small amount from all index values, then use endpoints and period.apply to aggregate. Then call align.time on the result. I'm not sure if this is a general solution, but it seems to work for your example.
library(xts)
seq<-seq(from=ISOdate(2014,12,31,23,15),length.out = 100, by="15 min", tz="UTC")
xts<-xts(rep(1,100),order.by = seq)
# create a temporary object
tmp <- xts
# subtract a small amount of time from each index value
.index(tmp) <- .index(tmp)-0.001
# aggregate to hourly
agg <- period.apply(tmp, endpoints(tmp, "hours"), colSums)
# round index up to next hour
agg_aligned <- align.time(agg, 3600)

Calulcate running difference of time using difftime on one column of timestamps

How would you calculate time difference of two consecutive rows of timestamps in minutes and add the result to a new column.
I have tried this:
data$hours <- as.numeric(floor(difftime(timestamps(data), (timestamps(data)[1]), units="mins")))
But only get difference from time zero and onwards.
Added example data with 'mins' column that I want to be added
timestamps mins
2013-06-23 00:00:00 NA
2013-06-23 01:00:00 60
2013-06-23 02:00:00 60
2013-06-23 04:00:00 120
The code that you're using with the [1] is always referencing the first element of the timestamps vector.
To do what you want, you want to look at all but the first element minus all but the last element.
mytimes <- data.frame(timestamps=c("2013-06-23 00:00:00",
"2013-06-23 01:00:00",
"2013-06-23 02:00:00",
"2013-06-23 04:00:00"),
mins=NA)
mytimes$mins <- c(NA, difftime(mytimes$timestamps[-1],
mytimes$timestamps[-nrow(mytimes)],
units="mins"))
What this code does is:
Setup a data frame so that you will keep the length of the timestamps and mins the same.
Within that data frame, put the timestamps you have and the fact that you don't have any mins yet (i.e. NA).
Select all but the first element of timestamps mytimes$timestamps[-1]
Select all but the last element of timestamps mytimes$timestamps[-nrow(mytimes)]
Subtract them difftime (since they're well-formatted, you don't first have to make them POSIXct objects) with the units of minutes. units="mins"
Put an NA in front because you have one fewer difference than you have rows c(NA, ...)
Drop all of that back into the original data frame's mins column mytimes$mins <-
Another option is to calculate it with this approach:
# create some data for an MWE
hrs <- c(0,1,2,4)
df <- data.frame(timestamps = as.POSIXct(paste("2015-12-17",
paste(hrs, "00", "00", sep = ":"))))
df
# timestamps
# 1 2015-12-17 00:00:00
# 2 2015-12-17 01:00:00
# 3 2015-12-17 02:00:00
# 4 2015-12-17 04:00:00
# create a function that calculates the lag for n periods
lag <- function(x, n) c(rep(NA, n), x[1:(length(x) - n)])
# create a new column named mins
df$mins <- as.numeric(df$timestamps - lag(df$timestamps, 1)) / 60
df
# timestamps mins
# 1 2015-12-17 00:00:00 NA
# 2 2015-12-17 01:00:00 60
# 3 2015-12-17 02:00:00 60
# 4 2015-12-17 04:00:00 120

Aggregate 5 minute data to hourly sums with NA's

My problem is as follows: I've got a time series with 5-Minute precipitation data like:
Datum mm
1 2004-04-08 00:05:00 NA
2 2004-04-08 00:10:00 NA
3 2004-04-08 00:15:00 NA
4 2004-04-08 00:20:00 NA
5 2004-04-08 00:25:00 NA
6 2004-04-08 00:30:00 NA
With this structure:
'data.frame': 1098144 obs. of 2 variables:
$ Datum: POSIXlt, format: "2004-04-08 00:05:00" "2004-04-08 00:10:00" "2004-04-08 00:15:00" "2004-04-08 00:20:00" ...
$ mm : num NA NA NA NA NA NA NA NA NA NA ...
As you can see, the time series begins with a lot of NA's, but there is measured precipitation further down, although riddled with single, less common NA's due to malfunction of the measuring station.
What I'm trying to achieve, is summing up the measured precipitation to hourly sums, not considering NA's.
This is what I tried so far:
sums <- aggregate(precip["mm"],
list(cut(precip$Datum, "1 hour")), sum)
Even though the timestamps are correctly aggregated to hours, all sums are 0 or NA. The sums are not even calculated if there is no NA at all.
additionally to be taken into account:
Hourly precipitation sums in meteorology always describe the cumulative sum until a certain hour: The amount of precipitation at 0:00 o'clock describes the sum from 23:00 the previous day until 0:00. So I always need to sum up the previous hour.
Reproducible Example
set.seed(1120)
s <- as.POSIXlt("2004-03-08 23:00:00")
r <- seq(s, s+1e4, "30 min")
precip <- data.frame(Datum=r, mm=sample(c(1:5,NA), 6, T))
Datum mm
2004-03-08 23:00:00 4
2004-03-08 23:30:00 1
2004-03-09 00:00:00 2
2004-03-09 00:30:00 4
2004-03-09 01:00:00 1
2004-03-09 01:30:00 4
With the above example, the result I am looking for is:
Datum mm
2004-03-09 00:00:00 5
2004-03-09 01:00:00 6
2004-03-09 02:00:00 5
Try adding na.rm=TRUE:
aggregate(precip['mm'], list(cut(precip$Datum, "1 hour")), sum, na.rm=TRUE)
# Group.1 mm
# 1 2004-04-08 00:00:00 26
# 2 2004-04-08 01:00:00 35
# 3 2004-04-08 02:00:00 25
Reproducible Example
set.seed(1120)
s <- as.POSIXlt("2004-04-08 00:05:00")
r <- seq(s, s+1e4, "5 min")
precip <- data.frame(Datum=r, mm=sample(c(1:5,NA), 34, T))
addendum
To your second question: If you would like measurements on the hour to be calculated with the lesser hour add right=TRUE:
aggregate(precip['mm'], list(cut(precip$Datum, "1 hour", right=TRUE)), sum, na.rm=TRUE)
Further Explanation
We will create another more detailed explanation to show how the solution works:
p <- c("2004-04-07 23:48:20", "2004-04-08 00:00:00", "2004-04-08 00:03:20")
ptime <- as.POSIXlt(p)
#[1] "2004-04-07 23:48:20 EDT" "2004-04-08 00:00:00 EDT" "2004-04-08 00:03:20 EDT"
We have three dates to separate into groups. If we use cut without any extra arguments, the second entry "2004-04-08 00:00:00 EDT" will be grouped with the third entry for hour "00:00":
cut(ptime, "1 hour")
#[1] 2004-04-07 23:00:00 2004-04-08 00:00:00 2004-04-08 00:00:00
But if we add the argument right=FALSE we can group it with the "23:00" hour:
cut(ptime, "1 hour", right=TRUE)
#[1] 2004-04-07 23:00:00 2004-04-07 23:00:00 2004-04-08 00:00:00
We can specify the behavior of edge cases.
edit
With your new data the original solution produces the desired output:
aggregate(precip['mm'], list(cut(precip$Datum, "1 hour")), sum, na.rm=TRUE)
Group.1 mm
1 2004-03-08 23:00:00 5
2 2004-03-09 00:00:00 6
3 2004-03-09 01:00:00 5
You can use dplyr to calculate sum like :
precip$hour <- strftime(precip$Datum,"%Y-%m-%d %H")
library(dplyr)
sum_hour <- precip %>% group_by(hour) %>% summarise(sum_hour = sum(mm,na.rm = T))

identify date format in R before converting

I have a simple data set which has a date column and a value column. I noticed that the date sometimes comes in as mmddyy (%m/%d/%y) format and other times in mmddYYYY (%m/%d/%Y) format. What is the best way to standardize the dates so that i can do other calculations without this formatting causing issues?
I tried the answers provided here
Changing date format in R
and here
How to change multiple Date formats in same column
Neither of these were able to fix the problem.
Below is a sample of the data
Date, Market
12/17/09,1.703
12/18/09,1.700
12/21/09,1.700
12/22/09,1.590
12/23/2009,1.568
12/24/2009,1.520
12/28/2009,1.500
12/29/2009,1.450
12/30/2009,1.450
12/31/2009,1.450
1/4/2010,1.440
When i read it into a new vector using something like this
dt <- as.Date(inp$Date, format="%m/%d/%y")
I get the following output for the above segment
dt Market
2009-12-17 1.703
2009-12-18 1.700
2009-12-21 1.700
2009-12-22 1.590
2020-12-23 1.568
2020-12-24 1.520
2020-12-28 1.500
2020-12-29 1.450
2020-12-30 1.450
2020-12-31 1.450
2020-01-04 1.440
As you can see we skipped from 2009 to 2020 at 12/23 because of change in formatting. Any help is appreciated. Thanks.
> dat$Date <- gsub("[0-9]{2}([0-9]{2})$", "\\1", dat$Date)
> dat$Date <- as.Date(dat$Date, format = "%m/%d/%y")
> dat
Date Market
# 1 2009-12-17 1.703
# 2 2009-12-18 1.700
# 3 2009-12-21 1.700
# 4 2009-12-22 1.590
# 5 2009-12-23 1.568
# 6 2009-12-24 1.520
# 7 2009-12-28 1.500
# 8 2009-12-29 1.450
# 9 2009-12-30 1.450
# 10 2009-12-31 1.450
# 11 2010-01-04 1.440

Resources