Starting minute in minute data aggregation in xts - r

I have a question about xts to.hourly and split methods. It seems that it assumes that if my timestamp is 12:00 that it means a minute 12:00-12:01. However my data provider has 11:59-12:00 in mind. I didn't find any parameters on this. Is the only solution to simply lag my time series by one minute?

Your question is actually about the endpoints function, which is what to.hourly, split.xts, and several other functions use to define intervals.
The endpoints function "assumes" that your timestamps are actual datetimes, and a time of 12:00:00 is the beginning of the 12 o'clock hour. If your data have 1-minute resolution, a time of 12:00:00 falls in the interval of 12:00:00.0-12:00:59.999.
There is no parameter you can change to make xts behave as if a time of 12:00:00 means anything other than what it actually is.
If you're certain that your data provider is putting a timestamp of 12:00 on data that occur in the interval of 11:59:00.0-11:59:59.999, then you might be able to simply subtract a small value from the xts object index.
# assuming 'x' is your xts object
.index(x) <- .index(x) - 0.001
That said, you need to think carefully about doing this because making a mistake can cause you to have look-ahead bias.

Related

What is the difference between force_tz() and with_tz() in the lubridate package of R?

Can someone please tell me what is the difference between with_tz() and force_tz() in the lubridate package of R? As I was coding I got the same result when using either so I got confused in what circumstances should I use with_tz() over force_tz(), and vice versa.
meeting <- ymd_hms("1998-08-28 09:00:00", tz= "Asia/Kathmandu")
mistake1<- with_tz(meeting, "America/ Chicago")
with_tz(a, "Asia/Kathmandu")
mistake2 <- force_tz(meeting, "America/Chicago")
with_tz(mistake,"Asia/Kathmandu")
mistake3 <- force_tz(meeting, "America/Chicago")
force_tz(mistake,"Asia/Kathmandu")
So following from the documentation the key difference is that:
with_tz returns a date-time as it would appear in a different time zone. The actual moment of time measured does not change, just the time zone it is measured in.
whereas the other function
force_tz returns the date-time that has the same clock time as input time, but in the new time zone.
So, in other words, one function encodes a time point in a different time zone whereas the other function only changes the time zone designation. This is actually a crucial distinction as if you consider a time line, with_tz will make a reference to the same point in the time line. The force_tz enables you to change the time zone of the time stamp.
This actually shows how well designed is the lubridate package. Consider a scenario where you may be willing to conduct analysis where you may be able to say something about events taking place in the evening across multiple localities. In that context you may be willing to preserve the timepoints. Depending on the details it may make sense to recode the timestamps so those are within the same time zone and you can select events taking placer after 8pm.
In an alternative scenario you may be willing to look at events that took place within 18 hours from an arbitrary time point across multiple localities. You would need to bring those events into a common time zone so you can show the span of hours. In that context the actual hour of event is not relevant (whether it was evening or morning) but whether the event took place within 18 hours from point X (say point X is afternoon GMT). You want for the reset of events to be on the GMT scale.
library("lubridate")
x <- ymd_hms("2009-08-07 00:00:01", tz = "America/New_York")
with_tz(x, "GMT")
# [1] "2009-08-07 04:00:01 GMT"
force_tz(x, "GMT")
# [1] "2009-08-07 00:00:01 GMT"
with_tz(x, "America/Costa_Rica")
# "2009-08-06 22:00:01 CST"
force_tz(x, "America/Costa_Rica")
# "2009-08-07 00:00:01 CST"

timedeltas and datetimes subtraction and converting to duration in minutes

I am at a standstill with this problem. I outlined it in another question ( Creating data histograms/visualizations using ipython and filtering out some values ) which meandered a bit so I'd like to fix the question and give it more context since I am sure others must have a workaround for this or have the problem. I've also seen similar, not identical, questions asked and can't quite adapt any of the solutions thus far given.
I have columns in my data frame for Start Time and End Time and created a 'Duration' column for time lapsed. I'm using ipython.
The Start Time/End Time columns have fields that look like:
2014/03/30 15:45
A date and then a time in hh:mm
when I type:
pd.to_datetime('End Time') and
pd.to_datetime('Start Time')
I get fields resulting that look like:
2014-03-30 15:45:00
same date but with hyphens and same time but with :00 seconds appended
I then decided to create a new column for the difference between the End and Start times. The 'Duration' or time lapsed column was created by typing in one command:
df['Duration'] = pd.to_datetime(df['End Time'])-pd.to_datetime(df['Start Time'])
The format of the fields in the duration column is:
01:14:00
no date just a time lapsed in the format hh:mm:ss
to indicate time lapsed or 74 mins in the above example.
When I type:
df.Duration.dtype
dtype('m8[ns]') is returned, whereas, when I type
df.Duration.head(4)
0 00:14:00
1 00:16:00
2 00:03:00
3 00:09:00
Name: Duration, dtype: timedelta64[ns]
is returned which seems to indicate a different dtype for Duration.
How can I convert the format I have in the Duration column to a single integer value of minutes (time lapsed)? I see no methods that I can use, I'd write a function but wouldn't know how to treat the input of hh:mm:ss. This must be a common requirement of data analysis, should I be going about converting these dates and times differently if my end goal is to get a single integer indicating minutes lapsed? Should I just be using Excel?... because I have so far spent a day on this problem and it should be a simple problem to solve.
**update:
THANK YOU!! (Jeff and Dataswede) I added a column with the command:
df['Durationendminusstart'] = pd.to_timedelta(df.Duration,unit='ns').astype('timedelta64[m]')
which seems to give me the Duration (minutes lapsed) as wanted so that huge part is solved!
What still is not clear is why there were two different dtypes for the same column depending how I asked, oh well right now it doesn't matter.**

Create a custom time zone

Is it possible to create a custom time zone in R for handling datetime objects?
More specifically I am interested in dealing with POSIXct objects, and would like to create a time zone than corresponds to "US/Eastern" - 17 hours. Time zones with a similar offset do not follow the same daylight savings convention as the US.
The reason for using a time zone so defined comes from FX trading, for which 5 pm EST is a reasonable 'midnight'.
When you are concerned about a specific ”midnight-like” time for each day, I assume that you want to obtain a date without time which switches over at that time. If that is your intention, then how about simply subtracting 17 hours (= 17*3600 seconds) from your vector of times, and taking the date of the resulting POSIXct value?
That would avoid complicated time zone maniplulations, which are usually not hanled by R itself but the underlying C libraray, as far as I know, so they might be difficult to achieve from within R. Instead, all computations would be performed in EST, and you'd still get a different switchover time than the local midnight.

Index xts using string and return only observations at that exact time

I have an xts time series in R and am using the very handy function to subset the time series based on a string, for example
time_series["17/06/2006 12:00:00"]
This will return the nearest observation to that date/time - which is very handy in many situations. However, in this particular situation I only want to return the elements of the time series which are at that exact time. Is there a way to do this in xts using a nice date/time string like this?
In a more general case (I don't have this problem immediately now, but suspect I may run into it soon) - is it possible to extract the closest observation within a certain period of time? For example, the closest observation to the given date/time, assuming it is within 10 minutes of the given date/time - otherwise just discard that observation.
I suspect this more general case may require me writing a function to do this - which I am happy to do - I just wanted to check whether the more specific case (or the general case) was already catered for in xts.
AFAIK, the only way to do this is to use a subset that begins at the time you're interested in, then get the first observation of that.
e.g.
first(time_series["2006-06-17 12:00:00/2006-06-17 12:01"])
or, more generally, to get the 12:00 price every day, you can subset down to 1 minute of each day, then split by days and extract the first observation of each.
do.call(rbind, lapply(split(time_series["T12:00:00/T12:01"],'days'), first))
Here's a thread where Jeff (the xts author) contemplates adding the functionality you want
http://r.789695.n4.nabble.com/Find-first-trade-of-day-in-xts-object-td3598441.html#a3599887

Date conversion query

I've come into possession of hundreds of ascii data files where the date and time are separate columns like so:
date time
1-Jan-08 23:05
I need to convert this to a usable R Date object, subtract 8 hours (timezone conversion from UTC to Pacific) and then turn it into unix time. I need to do this since the data are collected every evening (from 5pm through 2am the following morning). So if I were to use regular date/time format it would confound days (day1 spans two days when in fact it was just one evening of data collection). I'd like to consider each day's events separately.
Using unixtime will allow me to calculate time differences in events that occur each day (I will probably retain a date field in addition to the unix time). Can someone suggest an efficient way to do this?
Here is some data to use (this is in UTC)
dummy=data.frame(date="1-Jan-08",time="23:05")
Paste them together (which works vectorised) and then parse, e.g.
datetime <- paste(dummy$date, dummy$time)
parsed <- strptime(datetime, "%d-%b-%y %H:%M")
which you can also assign as columns in the data frame.
Edit: strptime() has an optional tz="" argument you can use.

Resources