I noticed some strange xts behaviour when trying to split an object that goes back a long way. The behaviour of split changes at the epoch.
#Create some data
dates <- seq(as.Date("1960-01-01"),as.Date("1980-01-01"),"days")
x <- rnorm(length(dates))
data <- xts(x, order.by=dates)
If we split the xts object by week, it defines the last day of the week as Monday prior to 1970. Post-1970, it defines it as Sunday (expected behaviour).
#Split the data, keep the last day of the week
lastdayofweek <- do.call(rbind, lapply(split(data, "weeks"), last))
head(lastdayofweek)
tail(lastdayofweek)
1960 Calendar
1979 Calendar
This seems to only be a problem for weeks, not months or years.
#Split the data, keep the last day of the month
lastdayofmonth <- do.call(rbind, lapply(split(data, "months"), last))
head(lastdayofmonth)
tail(lastdayofmonth)
The behaviour seems likely to do with the following, though I am not sure why it would apply to weeks only. From the xts cran.
For dates prior to the epoch (1970-01-01) the ending time is aligned to the 59.0000 second. This is
due to a bug/feature in the R implementation of asPOSIXct and mktime0 at the C-source level. This
limits the precision of ranges prior to 1970 to 1 minute granularity with the current xts workaround.
My workaround has been to shift the dates before splitting the objects for data prior to 1970, if I am splitting on weeks. I expect someone else has a more elegant solution (or a way to avoid the error).
EDIT: To be clear as to what the question is, I am looking for an answer that
a) specifies why this happens (so I can understand the nature of the error better, and therefore avoid it) and/or
b) the best workaround to deal with it.
One "workaround" would be to check out Rev. 743 or earlier, as it appears to me that this broke in Rev. 744.
svn checkout svn://svn.r-forge.r-project.org/svnroot/xts/#743
But, a much better idea is to file a bug report so that you don't have to use an old version forever. (also, of course, other bugs may have been patched and/or new features added since Rev. 743)
Related
I have a dataset having solar power generation for 24 hours for many days, now I have to find the average of the power generated in accordance with the time, as for example, Have a glimpse of the datasetI have to find the average of the power generated at time 9:00:00 AM.
Start by stripping out the time from the date-time variable.
Assuming your data is called myData
library(lubridate)
myData$Hour <- hour(strptime(myData$Time, format = "%Y-%m-%d %H:%M:%S"))
Then use ddply from the plyr package, which allows us to apply a function to a subset of data.
myMeans <- ddply(myData[,c("Hour", "IT_solar_generation")], "Hour", numcolwise(mean))
The resulting frame will have one column called Time which will give you the hour, and another with the means at each hour.
NOW, on another side but important note, when you ask a question you should be providing information on the attempts you've made so far to answer the question. This isn't a help desk.
I imported date variables as strings from SQL (date1) into Stata and then created a new date variable (date2) like this:
gen double date2 = clock(date1, "YMDhms")
format date2 %tc
However, now I want to calculate the number of days between two dates (date3-date2), formatted as above, but I can't seem to do it.
I don't care about the hms, so perhaps I should strip that out first? And then deconstruct the date into YYYY MM DD as separate variables? Nothing I seems to do is working right now.
It sounds like by dates you actually mean timestamp (aka datetime) variables. In my experience, there's usually no need to cast dates/timestamps as strings since ODBC and Stata will handle the conversion to SIF td/tc formats nicely.
But perhaps you exported to a text file and then read in the data instead. Here are a couple solutions.
tc timestamps are in milliseconds since 01jan1960 00:00:00.000, assuming 1000*60*60*24=86,400 seconds/day (that is, ignoring leap seconds). This means that you need to divide your difference by that number to get elapsed days.
For example, 2016 was a leap year:
. display (tc(01jan2017 00:00:00) - tc(01jan2016 00:00:00))/(1000*60*60*24)
366
You can also use the dofc() function to make dates out of timestamps and omit the division:
. display (dofc(tc(01jan2018 00:00:00)) - dofc(tc(01jan2016 00:00:00)))
731
2017 is not a leap year, so 366 + 365 = 731 days.
You can use generate with all these functions, though display is often easier for debugging initial attempts.
I have large data set of logged timestamps corresponding to state changes (e.g., light switch flips) that look like this:
library(data.table)
library(lubridate)
foo <-
data.table(ts = ymd_hms("2013-01-01 01:00:01",
"2013-01-01 05:34:34",
"2013-01-02 14:12:12",
"2013-01-02 20:01:00",
"2013-01-02 23:01:00",
"2013-01-03 03:00:00",
"2013-05-04 05:00:00"),
state = c(1, 0, 1, 0, 0, 1, 0) )
And I'm trying to (1) convert the history of state logs into run-times in seconds, and (2) convert these into daily cumulative run-times. Most (but not all) of the time, consecutive logged state values alternate. This is a kludgy start, but it falls a little short.
foo[, dif:=diff(ts)]
foo[state==1][, list(runtime = sum(dif)), .(floor_date(ts, "day"))]
In particular, when the state is "on" during a period that crosses midnight, this approach isn't smart enough to split things up, and incorrectly reports runtime longer than one day. Also, using diff is not so intelligent either, since it will make mistakes if there are consecutive identical states or NAs.
Any suggestions that will correctly resolve runtime that are still fast and efficient for large data sets?
This should work. I played around with different starting values of foo but there could still be some edges cases I didn't factor in . One thing that you will need to take note of is if your real data has a timezone that accepts daylight savings time then this will break when making the data.table with all dates. You can workaround that by doing a force_tz to UTC or GMT first (you can change it back later). On the other hand if you need to account for a 25 hour or 23 hour day then you'll need to strategically change them back to your timezone.
#I'm using devel version of data.table which includes shift function for leading/lagging variables
foo[,(paste0("next",names(foo))):=shift(.SD,1,0,"lead")]
#shift with fill=NA produced an error for some reason this is workaround
foo[nrow(foo),`:=`(nextts=NA,nextstate=NA)]
#make data.table with every date from min ts to max ts
complete<-data.table(datestamp=seq(from=floor_date(foo[,min(ts)],unit="day"),to=ceiling_date(foo[,max(ts)],unit="day"),by="days"))
#make column for end of day
complete[,enddate:=datestamp+hours(23)+minutes(59)+seconds(59.999)]
#set keys and then do overlapping join
setkey(foo,ts,nextts)
setkey(complete,datestamp,enddate)
overlap<-foverlaps(foo[state==1],complete,type="any")
#compute run time for each row
overlap[,runtime:=as.numeric(difftime(pmin(datestamp+days(1),nextts),pmax(datestamp,ts),units="secs"))]
#summarize down to seconds per day
overlap[,list(runtime=sum(runtime)),by=datestamp]
I want to create a single column with a sequence of date/time increasing every hour for one year or one month (for example). I was using a code like this to generate this sequence:
start.date<-"2012-01-15"
start.time<-"00:00:00"
interval<-60 # 60 minutes
increment.mins<-interval*60
x<-paste(start.date,start.time)
for(i in 1:365){
print(strptime(x, "%Y-%m-%d %H:%M:%S")+i*increment.mins)
}
However, I am not sure how to specify the range of the sequence of dates and hours. Also, I have been having problems dealing with the first hour "00:00:00"? Not sure what is the best way to specify the length of the date/time sequence for a month, year, etc? Any suggestion will be appreciated.
I would strongly recommend you to use the POSIXct datatype. This way you can use seq without any problems and use those data however you want.
start <- as.POSIXct("2012-01-15")
interval <- 60
end <- start + as.difftime(1, units="days")
seq(from=start, by=interval*60, to=end)
Now you can do whatever you want with your vector of timestamps.
Try this. mondate is very clever about advancing by a month. For example, it will advance the last day of Jan to last day of Feb whereas other date/time classes tend to overshoot into Mar. chron does not use time zones so you can't get the time zone bugs that code as you can using POSIXct. Here x is from the question.
library(chron)
library(mondate)
start.time.num <- as.numeric(as.chron(x))
# +1 means one month. Use +12 if you want one year.
end.time.num <- as.numeric(as.chron(paste(mondate(x)+1, start.time)))
# 1/24 means one hour. Change as needed.
hours <- as.chron(seq(start.time.num, end.time.num, 1/24))
I want to apply a function to 20 trading days worth of hourly FX data (as one example amongst many).
I started off with rollapply(data,width=20*24,FUN=FUN,by=24). That seemed to be working well, I could even assert I always got 480 bars passed in... until I realized that wasn't what I wanted. The start and end time of those 480 bars was drifting over the years, due to changes in daylight savings, and market holidays.
So, what I want is a function that treats a day as from 22:00 to 22:00 of each day we have data for. (21:00 to 21:00 in N.Y. summertime - my data timezone is UTC, and daystart is defined at 5pm ET)
So, I made my own rollapply function with this at its core:
ep=endpoints(data,on=on,k=k)
sp=ep[1:(length(ep)-width)]+1
ep=ep[(width+1):length(ep)]
xx <- lapply(1:length(ep), function(ix) FUN(.subset_xts(data,sp[ix]:ep[ix]),...) )
I then called this with on="days", k=1 and width=20.
This has two problems:
Days is in days, not trading days! So, instead of typically 4 weeks of data, I get just under 3 weeks of data.
The cutoff is midnight UTC. I cannot work out how to change it to use the 22:00 (or 21:00) cutoff.
UPDATE: Problem 1 above is wrong! The XTS endpoints function does work in trading days, not calendar days. The reason I thought otherwise is the timezone issue made it look like a 6-day trading
week: Sun to Fri. Once the timezone problem was fixed (see my
self-answer), using width=20 and on="days" does indeed give me 4
weeks of data.
(The typically there is important: when there is a trading holiday during those 4 weeks I expect to receive 4 weeks 1 day's worth of data, i.e. always exactly 20 trading days.)
I started working on a function to cut the data into weeks, thinking I could then cut them into five 24hr chunks, but this feels like the wrong approach, and surely someone has invented this wheel before me?
Here is how to get the daybreak right:
x2=x
index(x2)=index(x2)+(7*3600)
indexTZ(x2)='America/New_York'
I.e. just setting the timezone puts the daybreak at 17:00; we want it to be at 24:00, so add 7 hours on first.
With help from:
time zones in POSIXct and xts, converting from GMT in R
Here is the full function:
rollapply_chunks.FX.xts=function(data,width,FUN,...,on="days",k=1){
data <- try.xts(data)
x2 <- data
index(x2) <- index(x2)+(7*3600)
indexTZ(x2) <- 'America/New_York'
ep <- endpoints(x2,on=on,k=k) #The end point of each calendar day (when on="days").
#Each entry points to the final bar of the day. ep[1]==0.
if(length(ep)<2){
stop("Cannot divide data up")
}else if(length(ep)==2){ #Can only fit one chunk in.
sp <- 1;ep <- ep[-1]
}else{
sp <- ep[1:(length(ep)-width)]+1
ep <- ep[(width+1):length(ep)]
}
xx <- lapply(1:length(ep), function(ix) FUN(.subset_xts(data,sp[ix]:ep[ix]),...) )
xx <- do.call(rbind,xx) #Join them up as one big matrix/data.frame.
tt <- index(data)[ep] #Implicit align="right". Use sp for align="left"
res <- xts(xx, tt)
return (res)
}
You can see we use the modified index to split up the original data. (If R uses copy-on-write under the covers, then the only extra memory requirement should be for a copy of the index, not of the data.)
(Legal bit: please consider it licensed under MIT, but explicit permission given to use in the GPL-2 XTS package if that is desired.)