Ensuring temporal data density in R - r

ISSUE ---------
I have thousands of time series files (.csv) that contain intermittent data spanning for between 20-50 years (see df). Each file contains the date_time and a metric (temperature). The data is hourly and where no measurement exists there is an 'NA'.
>df
date_time temp
01/05/1943 11:00 5.2
01/05/1943 12:00 5.2
01/05/1943 13:00 5.8
01/05/1943 14:00 NA
01/05/1943 15:00 NA
01/05/1943 16:00 5.8
01/05/1943 17:00 5.8
01/05/1943 18:00 6.3
I need to check these files to see if they have sufficient data density. I.e. that the ratio of NA's to data values is not too high. To do this I have 3 criteria that must be checked for each file:
Ensure that no more than 10% of the hours in a day are NA's
Ensure that no more than 10% of the days in a month are NA's
Ensure that there are 3 continuous years of data with valid days and months.
Each criterion must be fulfilled sequentially and if the file does not meet the requirements then I must create a data frame (or any list) of the files that do not meet the criteria.
QUESTION--------
I wanted to ask the community how to go about this. I have considered the value of nested if loops, along with using sqldf, plyr, aggregate or even dplyr. But I do not know the simplest way to achieve this. Any example code or suggestions would be very much appreciated.

I think this will work for you. These will check every hour for NA's in the next day, month or 3 year period. Not tested because I don't care to make up data to test it. These functions should spit out the number of NA's in the respective time period. So for function checkdays if it returns a value greater than 2.4 then according to your 10% rule you'd have a problem. For months 72 and for 3 year periods you're hoping for values less than 2628. Again please check these functions. By the way the functions assume your NA data is in column 2. Cheers.
checkdays <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-23)){
nadata=data[i:(i+23),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
checkmonth <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-719)){
nadata=data[i:(i+719),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
check3years <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-26279)){
nadata=data[i:(i+26279),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
So I ended up testing these. They work for me. Here are system times for a dataset a year long. So I don't think you'll have problems.
> system.time(checkdays(RM_W1))
user system elapsed
0.38 0.00 0.37
> system.time(checkmonth(RM_W1))
user system elapsed
0.62 0.00 0.62
Optimization:
I took the time to run these functions with the data you posted above and it wasn't good. For loops are dangerous because they work well for small data sets but slow down exponentially as datasets get larger, that is if they're not constructed properly. I cannot report system times for the functions above with your data (it never finished) but I waited about 30 minutes. After reading this awesome post Speed up the loop operation in R I rewrote the functions to be much faster. By minimising the amount of things that happen in the loop and pre-allocating memory you can really speed things up. You need to call the function like checkdays(df[,2]) but its faster this way.
checkdays <- function(data){
countNA=numeric(length(data)-23)
for(i in 1:(length(data)-23)){
nadata=data[i:(i+23)]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
> system.time(checkdays(df[,2]))
user system elapsed
4.41 0.00 4.41
I believe this should be sufficient for your needs. In regards to leap years you should be able to modify the optimized function as I mentioned in the comments. However make sure you specify a leap year dataset as second dataset rather than a second column.

Related

Subsetting data by multiple date ranges - R

I'll get straight to the point: I have been given some data sets in .csv format containing regularly logged sensor data from a machine. However, this data set also contains measurements taken when the machine is turned off, which I would like to separate from the data logged from when it is turned on. To subset the relevant data I also have a file containing start and end times of these shutdowns. This file is several hundred rows long.
Examples of the relevant files for this problem:
file: sensor_data.csv
sens_name,time,measurement
sens_A,17/12/11 06:45,32.3321
sens_A,17/12/11 08:01,36.1290
sens_B,17/12/11 05:32,17.1122
sens_B,18/12/11 03:43,12.3189
##################################################
file: shutdowns.csv
shutdown_start,shutdown_end
17/12/11 07:46,17/12/11 08:23
17/12/11 08:23,17/12/11 09:00
17/12/11 09:00,17/12/11 13:30
18/12/11 01:42,18/12/11 07:43
To subset data in R, I have previously used the subset() function with simple conditions which has worked fine, but I don't know how to go about subsetting sensor data which fall outside multiple shutdown date ranges. I've already formatted the date and time data using as.POSIXlt().
I'm suspecting some scripting may be involved to come up with a good solution, but I'm afraid I am not yet experienced enough to handle this type of data.
Any help, advice, or solutions will be greatly appreciated. Let me know if there's anything else needed for a solution.
I prefer POSIXct format for ranges within data frames. We create an index for sensors operating during shutdowns with t < shutdown_start OR t > shutdown_end. With these ranges we can then subset the data as necessary:
posixct <- function(x) as.POSIXct(x, format="%d/%m/%y %H:%M")
sensor_data$time <- posixct(sensor_data$time)
shutdowns[] <- lapply(shutdowns, posixct)
ind1 <- sapply(sensor_data$time, function(t) {
sum(t < shutdowns[,1] | t > shutdowns[,2]) == length(sensor_data$time)})
#Measurements taken when shutdown
sensor_data[ind1,]
# sens_name time measurement
# 1 sens_A 2011-12-17 06:45:00 32.3321
# 3 sens_B 2011-12-17 05:32:00 17.1122
#Measurements taken when not shutdown
sensor_data[!ind1,]
# sens_name time measurement
# 2 sens_A 2011-12-17 08:01:00 36.1290
# 4 sens_B 2011-12-18 03:43:00 12.3189

Quantifying Logged Non-activity in R-- Overlapping logged events

Update: This seems to be very well-described in SQL forums -- how to account for the gaps in-between time ranges (many of which overlap.) So I may have to turn to SQL to quickly solve this problem, but I'm surprised it cannot be done in "R". It would appear that the object used by interval gets almost all the way there, but outside of a slow loop, it seems difficult to apply on a vector-wide analysis. Please do let me know if you have any ideas, but here's a description of the problem and its solution in SQL:
https://www.simple-talk.com/sql/t-sql-programming/calculating-gaps-between-overlapping-time-intervals-in-sql/
.... What I'd like to do is come up with a list of non-activity time from a log, and then filter down on it to show a minimum amount of time of non-activity.
1/17/2012 0:15 1/17/2012 0:31
1/20/2012 0:21 1/20/2012 0:22
1/15/2013 1:08 1/15/2013 1:10
1/15/2013 1:08 1/15/2013 1:10
1/15/2013 7:39 1/15/2013 7:41
1/15/2013 7:39 1/15/2013 7:41
1/16/2013 1:11 1/16/2013 1:15
1/16/2013 1:11 1/16/2013 1:15
I was going to just lag the end times into the start row and compute the difference, but then it became clear there were overlapping activities. I also tried "price is right" type matching to get the closest end time... except, of course, if things are going on simultaneously, this doesn't guarantee there's no activity from a still-unfinished simultaneous task.
I currently have date-time in and date-time out columns. I am hoping there is a better idea than taking the many millions of entries, and using seq.POSIXt to write every individual minute that has activity? But even that doesn't seem very workable.. But it would seem there would be some easy way to identify gaps of time of a minimum size, whether it be 5 minutes or 30. Any suggestions?
Assuming that 1/17/2012 00:15 is the first value in your data set, I would convert your data into two columns, each column will contain the number of minutes since this time stamp
ie using the first 3 rows of your data as an example
_______|_______
0 | 16
4323 | 4324
528882 | 528884
... | ...
Subtracting these two columns from each other will tell you the minutes where activity occurred, you can then simply inverse this and you will get your non activity.

using getSymbols to load different start time variables (time series data)

getSymbols(c("PI","RSXFS", "TB3MS", src="FRED",from="1959-1-1", from="1992-1", from="1934-1-1")
How can I load data by using getSymbols for different start dates for multiple variables?
I needs 200 variables from FRED. I can download the FRED CODE easily, but the problem is that dates. Each variables have different starting date.
First I load data set with time series format and then i will use window commend for fixing the same time period for all 200 data.
May be you are looking for mapply
symbols<-c("PI","RSXFS", "TB3MS")
begin.date<-c("1959-1-1","1992-1", "1934-1-1")
jj<- mapply(function(sym,dt) getSymbols(sym, src="FRED", from=dt,auto.assign = FALSE),symbols,begin.date)
head(jj[[3]])
TB3MS
1934-01-01 0.72
1934-02-01 0.62
1934-03-01 0.24
1934-04-01 0.15
1934-05-01 0.16
1934-06-01 0.15

Efficient chunking timeseries in R

I am working with a number of large timeseries currency pair pricing data in R. Files tend to be 100-300MB in size, and I will generally be working with 3 files at a time. I am looking for a (much) more efficient way of considering the TIME column of these data.
My data begins looking like :
PAIR TIME BID ASK
1 USD/JPY 2012-01-02 00:00:00.307 77.023 77.055
2 USD/JPY 2012-01-02 00:00:00.493 77.030 77.049
3 USD/JPY 2012-01-02 00:00:05.003 77.030 77.050
4 USD/JPY 2012-01-02 00:00:05.005 77.023 77.056
5 USD/JPY 2012-01-02 00:00:05.006 77.024 77.056
6 USD/JPY 2012-01-02 00:00:06.008 77.023 77.056
... ... ... ...
R has no problem understanding the TIME column. For instance,
USDJPY$TIME[2] - USDJPY$TIME[1]
Gives output
Time difference of 0.1860001 secs
Data are already organized into files by month. Unfortunately this is also much too large. I would like to break down pricing data by 'trading week'
Forex trading occurs in continuous multi-day stretches, usually from Monday to Friday. Some trading holidays will suspend trading, and there will not be data on these days. The nature of trading scheduling is such that, if
USDJPY$TIME[t+1] - USDJPY$TIME[t]
... is greater than 12 hours, time t is the last time index for that week in USDJPY.
I have not found an acceptable way to break the data into trading weeks, by indices, or otherwise. All my attempts end up hanging. The USDJPY file contains ~1,900,000 rows.
One approach I tried :
for(i in 1:(length(USDJPY$TIME)-1)){
USDJPY.diff <- c(USDJPY.diff, USDJPY$TIME[i+1]-USDJPY$TIME[i])
}
takes far too long (I quit before it could finish)
I would think data.table should speed things up here quite a bit:
library(data.table) #1.9.5
setDT(data)
data[, DIFF := as.numeric(TIME-shift(TIME,n=1,type="lag"))]
Week number calc (increment with difference is greater than 12 hours)
data[, Week.num := cumsum(DIFF>12)]

R: Calculate means for subset of a group

I want to calculate the mean for each "Day" but for a portion of the day (Time=12-14). This code works for me but I have to enter each day as a new line of code, which will amount to hundreds of lines.
This seems like it should be simple to do. I've done this easily when the grouping variables are the same but dont know how to do it when I dont want to include all values for the day.
Is there a better way to do this?
sapply(sap[sap$Day==165 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)
sapply(sap[sap$Day==166 & sap$Time %in% c(12,12.1,12.2,12.3,12.4,12.5,13,13.1,13.2,13.3,13.4,13.5, 14), ],mean)
Here's what the data looks like:
Day Time StomCond_Trunc
165 12 33.57189926
165 12.1 50.29437636
165 12.2 35.59876214
165 12.3 24.39879768
Try this:
aggregate(StomCond_Trunc~Day,data=subset(sap,Time>=12 & Time<=14),mean)
If you have a large dataset, you may also want to look into the data.table package. Converting a data.frame to a data.table is quite easy.
Example:
Large(ish) dataset
df <- data.frame(Day=1:1000000,Time=sample(1:14,1000000,replace=T),StomCond_Trunc=rnorm(100000)*20)
Using aggregate on the data.frame
>system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
user system elapsed
16.255 0.377 24.263
Converting it to a data.table
dt <- data.table(df,key="Time")
>system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
user system elapsed
9.534 0.178 15.270
Update from Matthew. This timing has improved dramatically since originally answered due to a new optimization feature in data.table 1.8.2.
Retesting the difference between the two approaches, using data.table 1.8.2 in R 2.15.1 :
df <- data.frame(Day=1:1000000,
Time=sample(1:14,1000000,replace=T),
StomCond_Trunc=rnorm(100000)*20)
system.time(aggregate(StomCond_Trunc~Day,data=subset(df,Time>=12 & Time<=14),mean))
# user system elapsed
# 10.19 0.27 10.47
dt <- data.table(df,key="Time")
system.time(dt[Time>=12 & Time<=14,mean(StomCond_Trunc),by=Day])
# user system elapsed
# 0.31 0.00 0.31
Using your original method, but with less typing:
sapply(sap[sap$Day==165 & sap$Time %in% seq(12, 14, 0.1), ],mean)
However this is only a slightly better method than your original one. It's not as flexible as the other answers since it depends on 0.1 increments in your time values. The other methods don't care about the increment size, which makes them more versatile. I'd recommend #Maiasaura's answer with data.table

Resources