Efficient chunking timeseries in R - r

I am working with a number of large timeseries currency pair pricing data in R. Files tend to be 100-300MB in size, and I will generally be working with 3 files at a time. I am looking for a (much) more efficient way of considering the TIME column of these data.
My data begins looking like :
PAIR TIME BID ASK
1 USD/JPY 2012-01-02 00:00:00.307 77.023 77.055
2 USD/JPY 2012-01-02 00:00:00.493 77.030 77.049
3 USD/JPY 2012-01-02 00:00:05.003 77.030 77.050
4 USD/JPY 2012-01-02 00:00:05.005 77.023 77.056
5 USD/JPY 2012-01-02 00:00:05.006 77.024 77.056
6 USD/JPY 2012-01-02 00:00:06.008 77.023 77.056
... ... ... ...
R has no problem understanding the TIME column. For instance,
USDJPY$TIME[2] - USDJPY$TIME[1]
Gives output
Time difference of 0.1860001 secs
Data are already organized into files by month. Unfortunately this is also much too large. I would like to break down pricing data by 'trading week'
Forex trading occurs in continuous multi-day stretches, usually from Monday to Friday. Some trading holidays will suspend trading, and there will not be data on these days. The nature of trading scheduling is such that, if
USDJPY$TIME[t+1] - USDJPY$TIME[t]
... is greater than 12 hours, time t is the last time index for that week in USDJPY.
I have not found an acceptable way to break the data into trading weeks, by indices, or otherwise. All my attempts end up hanging. The USDJPY file contains ~1,900,000 rows.
One approach I tried :
for(i in 1:(length(USDJPY$TIME)-1)){
USDJPY.diff <- c(USDJPY.diff, USDJPY$TIME[i+1]-USDJPY$TIME[i])
}
takes far too long (I quit before it could finish)

I would think data.table should speed things up here quite a bit:
library(data.table) #1.9.5
setDT(data)
data[, DIFF := as.numeric(TIME-shift(TIME,n=1,type="lag"))]
Week number calc (increment with difference is greater than 12 hours)
data[, Week.num := cumsum(DIFF>12)]

Related

Need formula on how to calculate the work completion days on average

Client give 100 task to employee.
Employee complete 50 task in 1 day
20 task in 2 days
15 task in 3 days
4 task in 4 days
5 taak in 6 days
6 task in 10 days.
Now I want to know on a average how many days employee will take to complete for 1 task
Need formula for this query..
Assuming tasks are not completed in parallel (i.e. days are mutually exclusive with respect to completing/working on tasks), average days per task = 0.26:
=SUM(B2:B7)/SUM(A2:A7)
This is where the solution should terminate - however, I provide a number of checks/alternative approaches which serve to demonstrate (unequivocally) the veracity of the above function...
checks
check 1
The same can be derived using the 'weighted average calculation:
=SUM((B2:B7/A2:A7)*A2:A7)/SUM(A2:A7)
check 2
Intuitively, if each task takes ~0.26 days to complete, and there are 100 tasks, then the total duration (days) ~= 26: summing column B gives just that:
check 3
If still unconvinced, you can calculate the average days per task for each category/type (i.e. for those that take 1,2,3,.., 10 days to complete):
=B2:B7/A2:A7
Then expand these out using sequence / other method:
=SEQUENCE(1,A2,G2,0)
Again, this yields 0.26 and which should confirm (unequivocally) the veracity of the simple/direct ratio...
Ta

Subsetting data by multiple date ranges - R

I'll get straight to the point: I have been given some data sets in .csv format containing regularly logged sensor data from a machine. However, this data set also contains measurements taken when the machine is turned off, which I would like to separate from the data logged from when it is turned on. To subset the relevant data I also have a file containing start and end times of these shutdowns. This file is several hundred rows long.
Examples of the relevant files for this problem:
file: sensor_data.csv
sens_name,time,measurement
sens_A,17/12/11 06:45,32.3321
sens_A,17/12/11 08:01,36.1290
sens_B,17/12/11 05:32,17.1122
sens_B,18/12/11 03:43,12.3189
##################################################
file: shutdowns.csv
shutdown_start,shutdown_end
17/12/11 07:46,17/12/11 08:23
17/12/11 08:23,17/12/11 09:00
17/12/11 09:00,17/12/11 13:30
18/12/11 01:42,18/12/11 07:43
To subset data in R, I have previously used the subset() function with simple conditions which has worked fine, but I don't know how to go about subsetting sensor data which fall outside multiple shutdown date ranges. I've already formatted the date and time data using as.POSIXlt().
I'm suspecting some scripting may be involved to come up with a good solution, but I'm afraid I am not yet experienced enough to handle this type of data.
Any help, advice, or solutions will be greatly appreciated. Let me know if there's anything else needed for a solution.
I prefer POSIXct format for ranges within data frames. We create an index for sensors operating during shutdowns with t < shutdown_start OR t > shutdown_end. With these ranges we can then subset the data as necessary:
posixct <- function(x) as.POSIXct(x, format="%d/%m/%y %H:%M")
sensor_data$time <- posixct(sensor_data$time)
shutdowns[] <- lapply(shutdowns, posixct)
ind1 <- sapply(sensor_data$time, function(t) {
sum(t < shutdowns[,1] | t > shutdowns[,2]) == length(sensor_data$time)})
#Measurements taken when shutdown
sensor_data[ind1,]
# sens_name time measurement
# 1 sens_A 2011-12-17 06:45:00 32.3321
# 3 sens_B 2011-12-17 05:32:00 17.1122
#Measurements taken when not shutdown
sensor_data[!ind1,]
# sens_name time measurement
# 2 sens_A 2011-12-17 08:01:00 36.1290
# 4 sens_B 2011-12-18 03:43:00 12.3189

Ensuring temporal data density in R

ISSUE ---------
I have thousands of time series files (.csv) that contain intermittent data spanning for between 20-50 years (see df). Each file contains the date_time and a metric (temperature). The data is hourly and where no measurement exists there is an 'NA'.
>df
date_time temp
01/05/1943 11:00 5.2
01/05/1943 12:00 5.2
01/05/1943 13:00 5.8
01/05/1943 14:00 NA
01/05/1943 15:00 NA
01/05/1943 16:00 5.8
01/05/1943 17:00 5.8
01/05/1943 18:00 6.3
I need to check these files to see if they have sufficient data density. I.e. that the ratio of NA's to data values is not too high. To do this I have 3 criteria that must be checked for each file:
Ensure that no more than 10% of the hours in a day are NA's
Ensure that no more than 10% of the days in a month are NA's
Ensure that there are 3 continuous years of data with valid days and months.
Each criterion must be fulfilled sequentially and if the file does not meet the requirements then I must create a data frame (or any list) of the files that do not meet the criteria.
QUESTION--------
I wanted to ask the community how to go about this. I have considered the value of nested if loops, along with using sqldf, plyr, aggregate or even dplyr. But I do not know the simplest way to achieve this. Any example code or suggestions would be very much appreciated.
I think this will work for you. These will check every hour for NA's in the next day, month or 3 year period. Not tested because I don't care to make up data to test it. These functions should spit out the number of NA's in the respective time period. So for function checkdays if it returns a value greater than 2.4 then according to your 10% rule you'd have a problem. For months 72 and for 3 year periods you're hoping for values less than 2628. Again please check these functions. By the way the functions assume your NA data is in column 2. Cheers.
checkdays <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-23)){
nadata=data[i:(i+23),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
checkmonth <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-719)){
nadata=data[i:(i+719),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
check3years <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-26279)){
nadata=data[i:(i+26279),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
So I ended up testing these. They work for me. Here are system times for a dataset a year long. So I don't think you'll have problems.
> system.time(checkdays(RM_W1))
user system elapsed
0.38 0.00 0.37
> system.time(checkmonth(RM_W1))
user system elapsed
0.62 0.00 0.62
Optimization:
I took the time to run these functions with the data you posted above and it wasn't good. For loops are dangerous because they work well for small data sets but slow down exponentially as datasets get larger, that is if they're not constructed properly. I cannot report system times for the functions above with your data (it never finished) but I waited about 30 minutes. After reading this awesome post Speed up the loop operation in R I rewrote the functions to be much faster. By minimising the amount of things that happen in the loop and pre-allocating memory you can really speed things up. You need to call the function like checkdays(df[,2]) but its faster this way.
checkdays <- function(data){
countNA=numeric(length(data)-23)
for(i in 1:(length(data)-23)){
nadata=data[i:(i+23)]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
> system.time(checkdays(df[,2]))
user system elapsed
4.41 0.00 4.41
I believe this should be sufficient for your needs. In regards to leap years you should be able to modify the optimized function as I mentioned in the comments. However make sure you specify a leap year dataset as second dataset rather than a second column.

How to enter censored data into R's survival model?

I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.
The original subscriptions dataset looks like this..
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
Which I manipulate to look like this..
id tenure_in_months status(1=cancelled, 0=active)
1 2 1
2 ? 0
3 1 1
..in order to feed the survival model:
obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
plot(fit)
What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?
First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.
This is a situation known as right-censor in survival analysis. See: http://en.wikipedia.org/wiki/Censoring_%28statistics%29
So your data would need to translate from
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
to:
id t1 t2 status(3=interval_censored)
1 2 2 3
2 3 NA 3
3 1 1 3
Then you will need to change your R surv object, from:
Surv(time=tenure_in_months, event=status, type="right")
to:
Surv(t1, t2, event=status, type="interval2")
See http://stat.ethz.ch/R-manual/R-devel/library/survival/html/Surv.html for more syntax details. A very good summary of computational details can be found: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm
Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.
If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.
NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.
SQL code to get the time till event (use in SELECT part of query)
DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months
BTW:
I would use difference in days, for my analysis. Does not make sense to round off the time to months.
You need to know the date the data was collected. The tenure_in_months for id 2 should then be this date minus 2013-06-01.
Otherwise I believe your encoding of the data is correct. the status of 0 for id 2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).

Assigning week numbers in a time series to obtain weekly average price

Let's say I have a time series with daily data (business days), and I would like to organize the data by business weeks. (Monday-Friday) in a similar fashion as the one in this webpage from the EIA on futures prices of crude oil:
http://www.eia.gov/dnav/pet/hist/LeafHandler.ashx?n=PET&s=RCLC1&f=D
As you can see the prices are nicely organized by weeks in this webpage.
Is there any function in R that could organize the data in a similar fashion?
You can obtain the data in .xls format at:
http://www.eia.gov/dnav/pet/hist_xls/RCLC1d.xls
What I would like to do is to assign a week number to each daily observation something like this: (Look at the weeks column)
Date Price weeks day
1983-04-04 29.44 1 Monday
1983-04-05 29.71 1 Tuesday
1983-04-06 29.92 1 Wednesday
1983-04-07 30.17 1 Thursday
1983-04-08 30.38 1 Friday
1983-04-11 30.26 2 Monday
...
...
So far I have used the week function of the lubridate package but is not working well. It seems like once a year hits the 53rd week the function fails to initiate properly the week of the following year.
I have been trying to stay away from rep, seq /5 or /7 kind of solutions since there may be some observations that I may need to filter from the data later on, so I would like to have a solution that doesn't depend on the particular vector of my data but rather I would prefer the solution to be more general, that is to depend on the date class, i.e POSIcxt, xts or zoo class
Any hints would be greatly appreciated.
Wouldn't this work?:
as.POSIXlt()$yday %/% 7
I realize that it does have part of what you wanted to avoid but it does draw its starting point from a recognized class. For your data noting that I read it in with colClasses=c("Date", "numeric","numeric","character") :
> 1 + as.POSIXlt(dat$Date)$yday %/% 7
[1] 14 14 14 14 14 15
If you want to replicate those interval labels, try adding multiples of 7 to any Monday and Friday:
paste(as.Date(strptime("1983 Apr- 4",format="%Y %b- %d"))+(39)*7,
" to ",
as.Date(strptime("1983 Apr- 8",format="%Y %b- %d"))+(39)*7,
sep="")
#[1] "1984-01-02 to 1984-01-06" # The first new year change
paste(as.Date(strptime("1983 Apr- 4",format="%Y %b- %d"))+(39+52)*7,
" to ",
as.Date(strptime("1983 Apr- 8",format="%Y %b- %d"))+(39+52)*7,
sep="")
#[1] "1984-12-31 to 1985-01-04" # The second new year change
Here's a function that will accept an integer vector:
from8Apr83dts <- function(numwks) {
paste(as.Date(strptime("1983 Apr- 4",format="%Y %b- %d"))+(numwks)*7,
" to ",
as.Date(strptime("1983 Apr- 8",format="%Y %b- %d"))+(numwks)*7,
sep="")
}
# Usage
from8Apr83dts(39:40)
#[1] "1984-01-02 to 1984-01-06" "1984-01-09 to 1984-01-13"

Resources