foverlaps and within in data.table - r

Have 2 tables
dums:
start end 10min
2013-04-01 00:00:54 UTC 2013-04-01 01:00:10 UTC 0.05
2013-04-01 00:40:26 UTC 2013-04-01 01:00:00 UTC 0.1
2013-04-01 02:13:20 UTC 2013-04-01 04:53:42 UTC 0.15
2013-04-02 02:22:00 UTC 2013-04-01 04:33:12 UTC 0.2
2013-04-01 02:26:23 UTC 2013-04-01 04:05:12 UTC 0.25
2013-04-01 02:42:47 UTC 2013-04-01 04:34:33 UTC 0.3
2013-04-01 02:53:12 UTC 2013-04-03 05:27:05 UTC 0.35
2013-04-02 02:54:08 UTC 2013-04-02 05:31:15 UTC 0.4
2013-04-03 02:57:16 UTC 2013-04-03 05:29:32 UTC 0.45
maps: start and end are 10 minute interval blocks spanning 2013-4-1 00:00:00 to 2013-04-04
I want to add the column 3 of dt1 to map as long as the start and end time are within the 10 minute blocks and keep appending the columns
ideally the output should be
start end 10min
4/1/2013 0:00:00 4/1/2013 0:10:00 0.05 0
4/1/2013 0:10 4/1/2013 0:20 0.05 0
4/1/2013 0:20 4/1/2013 0:30 0.05 0
4/1/2013 0:30 4/1/2013 0:40 0.05 0
4/1/2013 0:40 4/1/2013 0:50 0.05 0.01
4/1/2013 0:50 4/1/2013 1:00 0.05 0.01
I tried
setkey(dums,start,end)
setkey(map,start,end)
foverlaps(map,dums,type="within",nomatch=0L)
I keep getting the error:
Error in foverlaps(map, dums, type = "within", nomatch = 0L) : All entries in column start should be <= corresponding entries in column end in data.table 'y'
Any pointers or alternative approaches?
Thanks

The error message
All entries in column start should be <= corresponding entries in column end in data.table 'y'
is probably caused by a typo in the dataset.
dums[start > end, with = TRUE]
returns 4 and row 4 of dums is:
start end min10
1: 2013-04-02 02:22:00 2013-04-01 04:33:12 0.2
After changing start to 2013-04-01 02:22:00 OP's code runs fine.
However, to achieve the expected output the result of foverlaps() needs to be reshaped from long to wide format.
This can be done in two ways:
dcast(foverlaps(map, dums, nomatch = 0L), i.start + i.end ~ min10,
value.var = "min10")
i.start i.end 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
1: 2013-04-01 00:00:00 2013-04-01 00:10:00 0.05 NA NA NA NA NA NA NA NA
2: 2013-04-01 00:10:00 2013-04-01 00:20:00 0.05 NA NA NA NA NA NA NA NA
3: 2013-04-01 00:20:00 2013-04-01 00:30:00 0.05 NA NA NA NA NA NA NA NA
4: 2013-04-01 00:30:00 2013-04-01 00:40:00 0.05 NA NA NA NA NA NA NA NA
5: 2013-04-01 00:40:00 2013-04-01 00:50:00 0.05 0.1 NA NA NA NA NA NA NA
---
311: 2013-04-03 04:40:00 2013-04-03 04:50:00 NA NA NA NA NA NA 0.35 NA 0.45
312: 2013-04-03 04:50:00 2013-04-03 05:00:00 NA NA NA NA NA NA 0.35 NA 0.45
313: 2013-04-03 05:00:00 2013-04-03 05:10:00 NA NA NA NA NA NA 0.35 NA 0.45
314: 2013-04-03 05:10:00 2013-04-03 05:20:00 NA NA NA NA NA NA 0.35 NA 0.45
315: 2013-04-03 05:20:00 2013-04-03 05:30:00 NA NA NA NA NA NA 0.35 NA 0.45
or, more in line with OP's expected result:
dcast(foverlaps(map, dums, nomatch = 0L), i.start + i.end ~ rowid(i.start),
value.var = "min10")
i.start i.end 1 2 3 4 5
1: 2013-04-01 00:00:00 2013-04-01 00:10:00 0.05 NA NA NA NA
2: 2013-04-01 00:10:00 2013-04-01 00:20:00 0.05 NA NA NA NA
3: 2013-04-01 00:20:00 2013-04-01 00:30:00 0.05 NA NA NA NA
4: 2013-04-01 00:30:00 2013-04-01 00:40:00 0.05 NA NA NA NA
5: 2013-04-01 00:40:00 2013-04-01 00:50:00 0.05 0.10 NA NA NA
---
311: 2013-04-03 04:40:00 2013-04-03 04:50:00 0.35 0.45 NA NA NA
312: 2013-04-03 04:50:00 2013-04-03 05:00:00 0.35 0.45 NA NA NA
313: 2013-04-03 05:00:00 2013-04-03 05:10:00 0.35 0.45 NA NA NA
314: 2013-04-03 05:10:00 2013-04-03 05:20:00 0.35 0.45 NA NA NA
315: 2013-04-03 05:20:00 2013-04-03 05:30:00 0.35 0.45 NA NA NA
Note that the parameter type = "within" has been skipped for brevity.
Data
# corrected
dums <- fread(
" 2013-04-01 00:00:54 UTC 2013-04-01 01:00:10 UTC 0.05
2013-04-01 00:40:26 UTC 2013-04-01 01:00:00 UTC 0.1
2013-04-01 02:13:20 UTC 2013-04-01 04:53:42 UTC 0.15
2013-04-01 02:22:00 UTC 2013-04-01 04:33:12 UTC 0.2
2013-04-01 02:26:23 UTC 2013-04-01 04:05:12 UTC 0.25
2013-04-01 02:42:47 UTC 2013-04-01 04:34:33 UTC 0.3
2013-04-01 02:53:12 UTC 2013-04-03 05:27:05 UTC 0.35
2013-04-02 02:54:08 UTC 2013-04-02 05:31:15 UTC 0.4
2013-04-03 02:57:16 UTC 2013-04-03 05:29:32 UTC 0.45"
)
dums <- dums[, .(start = as.POSIXct(paste(V1, V2, V3)),
end = as.POSIXct(paste(V4, V5, V6)),
min10 = V7)]
setkey(dums, start, end)
ts <- seq(as.POSIXct("2013-04-01 00:00:00 UTC"),
as.POSIXct("2013-04-04 00:00:00 UTC"),
by = "10 min")
map <- data.table(start = head(ts, -1L), end = tail(ts, -1L),
key = c("start", "end"))

That's a good catch with the POSIXct time being off for 1 row. I feel super silly to have glossed over such an error in the input data.
The ultimate goal is to have 3 column variables : YYYY-DD-MM ; start time (POSIXCt), end time (POSIXCt).
The start and end time being 10 minute windows.
The number of days is 365. So effectively looking at 365 * 144 (10 minute slices for a day). The catch is I have 450k rows of "dums" data and the min10 is not evenly spaced discrete intervals, it is a continuous data. If I have to aggregate (sum,means,sd etc) , is there any way to use the dcast + aggregate +foverlaps within+ grouping? I can do with a for loop just placing the min10 value from start to end but it looks super time consuming and inefficient.
The output would be
5: 2013-04-01 00:40:00 2013-04-01 00:50:00 0.15
---
311: 2013-04-03 04:40:00 2013-04-03 04:50:00 0.80
map <- data.table(start = head(ts, -1L), end = tail(ts, -1L),
key = c("start", "end"))
# plus do something on the lines
dums[, .(count=.N, sum=sum(min10)), by = ID1]

Related

Comparing time series with different sampling rate (dates) in R

I have two long time series to compare, however, the sampling of them is completely different. The first one is with hourly, the second one is with irregular sampling.
I would like to compare Value1 and Value2, so, I would like to select Value1 records from df1 at 02:00 according to df2 dates. How can I solve it in R?
df1:
Date1
Value1
2014-01-01 01:00:00
0.16
2014-01-01 02:00:00
0.13
2014-01-01 03:00:00
0.6
2014-01-02 01:00:00
0.5
2014-01-02 02:00:00
0.22
2014-01-02 03:00:00
0.17
2014-01-19 01:00:00
0.2
2014-01-19 02:00:00
0.11
2014-01-19 03:00:00
0.15
2014-01-21 01:00:00
0.13
2014-01-21 02:00:00
0.33
2014-01-21 03:00:00
0.1
2014-01-23 01:00:00
0.09
2014-01-23 02:00:00
0.02
2014-01-23 03:00:00
0.16
df2:
Date2
Value2
2014-01-01
13
2014-01-19
76
2014-01-23
8
desired output:
df_fused:
Date1
Value1
Value2
2014-01-01 02:00:00
0.13
13
2014-01-19 02:00:00
0.11
76
2014-01-23 02:00:00
0.02
8
here is a data.table approach
library( data.table )
#sample data can also be setDT(df1);setDT(df2)
df1 <- fread("Date1 Value1
2014-01-01 01:00:00 0.16
2014-01-01 02:00:00 0.13
2014-01-01 03:00:00 0.6
2014-01-02 01:00:00 0.5
2014-01-02 02:00:00 0.22
2014-01-02 03:00:00 0.17
2014-01-19 01:00:00 0.2
2014-01-19 02:00:00 0.11
2014-01-19 03:00:00 0.15
2014-01-21 01:00:00 0.13
2014-01-21 02:00:00 0.33
2014-01-21 03:00:00 0.1
2014-01-23 01:00:00 0.09
2014-01-23 02:00:00 0.02
2014-01-23 03:00:00 0.16")
df2 <- fread("Date2 Value2
2014-01-01 13
2014-01-19 76
2014-01-23 8")
#set dates to posix
df1[, Date1 := as.POSIXct( Date1, format = "%Y-%m-%d %H:%M:%S", tz = "UTC" )]
#set df2 dates to 02:00:00 time
df2[, Date2 := as.POSIXct( paste0( Date2, "02:00:00" ), format = "%Y-%m-%d %H:%M:%S", tz = "UTC" )]
#join
df2[ df1, Value1 := i.Value1, on = .(Date2 = Date1)][]
# Date2 Value2 Value1
# 1: 2014-01-01 02:00:00 13 0.13
# 2: 2014-01-19 02:00:00 76 0.11
# 3: 2014-01-23 02:00:00 8 0.02

R: yes-no factor based on previous entries

I've got a timeseries dataset — data from meteostation. So there's 3 columns: time - time and date; p - rain, mm; h - water level,m.
I need to make a new column factor_rain, with 1 and 0 values. 1 - if water level(df$h) was influenced by rain (df$p). This can be if there was a rain for the last 5 hours (5 entries).
In other cases, there should be 0.
A part of dataset is here:
df <- data.frame(time = c("2017-06-04 9:00:00", "2017-06-04 13:00:00", "2017-06-04 17:00:00",
"2017-06-04 19:00:00", "2017-06-04 21:00:00", "2017-06-04 23:00:00",
"2017-06-05 9:00:00", "2017-06-05 11:00:00",
"2017-06-05 13:00:00", "2017-06-05 16:00:00",
"2017-06-05 19:00:00", "2017-06-05 21:00:00", "2017-06-05 23:00:00",
"2017-06-06 9:00:00", "2017-06-06 11:00:00", "2017-06-06 13:00:00",
"2017-06-06 16:00:00", "2017-06-06 17:00:00", "2017-06-06 18:00:00",
"2017-06-06 19:00:00"),
p = c(NA, NA, 16.4, NA, NA, NA, NA, NA, NA, NA, 12,
NA, NA, NA, NA, NA, NA, NA, NA, NA),
h = c(23,NA,NA,NA,NA,32,NA,NA,28,NA,NA,
33,NA,NA,NA,29,NA,NA,NA,NA))
I was trying the simplest way I thought — it works only for one case unfortunately:
> df$factor_rain[df$p[-c(1:5)] > 1 & df$h > 1] <- 1
> Warning message:
In df$p[-c(1:5)] > 1 & df$h > 1 :
longer object length is not a multiple of shorter object length
Is there any way to fix it? If you can suggest how to use real time (smth from xts library, for example) it would be great. I mean use a 5 hours treshold, not 5 values.
By the way I need to get this as a result:
> df
time p h factor_rain
1 2017-06-04 9:00:00 NA 23 0
2 2017-06-04 13:00:00 NA NA 0
3 2017-06-04 17:00:00 16.4 NA 0
4 2017-06-04 19:00:00 NA NA 0
5 2017-06-04 21:00:00 NA NA 0
6 2017-06-04 23:00:00 NA 32 1
7 2017-06-05 9:00:00 NA NA 0
8 2017-06-05 11:00:00 NA NA 0
9 2017-06-05 13:00:00 NA 28 0
10 2017-06-05 16:00:00 NA NA 0
11 2017-06-05 19:00:00 12.0 NA 0
12 2017-06-05 21:00:00 NA 33 1
13 2017-06-05 23:00:00 NA NA 0
14 2017-06-06 9:00:00 NA NA 0
15 2017-06-06 11:00:00 NA NA 0
16 2017-06-06 13:00:00 NA 29 0
17 2017-06-06 16:00:00 NA NA 0
18 2017-06-06 17:00:00 NA NA 0
19 2017-06-06 18:00:00 NA NA 0
20 2017-06-06 19:00:00 NA NA 0
You can use
df$factorrain = FALSE
df$factorrain[rowSums(expand.grid(which(!is.na(df$p)), 0:4))] = TRUE
# time p h factorrain
# 1 2017-06-04 9:00:00 NA 23 FALSE
# 2 2017-06-04 13:00:00 NA NA FALSE
# 3 2017-06-04 17:00:00 16.4 NA TRUE
# 4 2017-06-04 19:00:00 NA NA TRUE
# 5 2017-06-04 21:00:00 NA NA TRUE
# 6 2017-06-04 23:00:00 NA 32 TRUE
# 7 2017-06-05 9:00:00 NA NA TRUE
# 8 2017-06-05 11:00:00 NA NA FALSE
# 9 2017-06-05 13:00:00 NA 28 FALSE
# 10 2017-06-05 16:00:00 NA NA FALSE
# 11 2017-06-05 19:00:00 12.0 NA TRUE
# 12 2017-06-05 21:00:00 NA 33 TRUE
# 13 2017-06-05 23:00:00 NA NA TRUE
# 14 2017-06-06 9:00:00 NA NA TRUE
# 15 2017-06-06 11:00:00 NA NA TRUE
# 16 2017-06-06 13:00:00 NA 29 FALSE
# 17 2017-06-06 16:00:00 NA NA FALSE
# 18 2017-06-06 17:00:00 NA NA FALSE
# 19 2017-06-06 18:00:00 NA NA FALSE
# 20 2017-06-06 19:00:00 NA NA FALSE
Or, a similar approach with apply,
df$factorrain = FALSE
df$factorrain[sapply(which(!is.na(df$p)), function(x) x+(0:4))] = TRUE
A solution can be achieved by using non-equi join from data.table.
library(data.table)
df$time <- as.POSIXct(df$time, format = "%Y-%m-%d %H:%M:%S")
setDT(df)
df[,timeLow := time-5*60*60]
df[df,.(time, p, h = i.h), on=.(time < time, time >= timeLow)][
,.(factor_rain = ifelse(!is.na(first(h)), any(!is.na(p)),FALSE)),by=.(time)][
df,.(time, p, h, factor_rain),on="time"]
# time p h factor_rain
# 1: 2017-06-04 09:00:00 NA 23 FALSE
# 2: 2017-06-04 13:00:00 NA NA FALSE
# 3: 2017-06-04 17:00:00 16.4 NA FALSE
# 4: 2017-06-04 19:00:00 NA NA FALSE
# 5: 2017-06-04 21:00:00 NA NA FALSE
# 6: 2017-06-04 23:00:00 NA 32 FALSE <-- There is no rain in last 5 hours
# 7: 2017-06-05 09:00:00 NA NA FALSE
# 8: 2017-06-05 11:00:00 NA NA FALSE
# 9: 2017-06-05 13:00:00 NA 28 FALSE
# 10: 2017-06-05 16:00:00 NA NA FALSE
# 11: 2017-06-05 19:00:00 12.0 NA FALSE
# 12: 2017-06-05 21:00:00 NA 33 TRUE
# 13: 2017-06-05 23:00:00 NA NA FALSE
# 14: 2017-06-06 09:00:00 NA NA FALSE
# 15: 2017-06-06 11:00:00 NA NA FALSE
# 16: 2017-06-06 13:00:00 NA 29 FALSE
# 17: 2017-06-06 16:00:00 NA NA FALSE
# 18: 2017-06-06 17:00:00 NA NA FALSE
# 19: 2017-06-06 18:00:00 NA NA FALSE
# 20: 2017-06-06 19:00:00 NA NA FALSE
Note: The solution can be optimized a bit. I'll take up optimization in a while.

How to increase time series granularity in R Dataframe? [duplicate]

This question already has answers here:
Insert rows for missing dates/times
(9 answers)
Closed 5 years ago.
I have a dataframe that contains hourly weather information. I would like to increase the granularity of the time measurements (5 minute intervals instead of 60 minute intervals) while copying the other columns data into the new rows created:
Current Dataframe Structure:
Date Temperature Humidity
2015-01-01 00:00:00 25 0.67
2015-01-01 01:00:00 26 0.69
Target Dataframe Structure:
Date Temperature Humidity
2015-01-01 00:00:00 25 0.67
2015-01-01 00:05:00 25 0.67
2015-01-01 00:10:00 25 0.67
.
.
.
2015-01-01 00:55:00 25 0.67
2015-01-01 01:00:00 26 0.69
2015-01-01 01:05:00 26 0.69
2015-01-01 01:10:00 26 0.69
.
.
.
What I've Tried:
for(i in 1:nrow(df)) {
five.minutes <- seq(df$date[i], length = 12, by = "5 mins")
for(j in 1:length(five.minutes)) {
df$date[i]<-rbind(five.minutes[j])
}
}
Error I'm getting:
Error in as.POSIXct.numeric(value) : 'origin' must be supplied
The one possible solution can be using fill from tidyr and right_join from dplyr.
The approach is to create date/time series between min and max+55mins times from dataframe. Left join dataframe with timeseries which will provide you all desired rows but NA for Temperature and Humidity. Now use fill to populated NA values with previous valid values.
# Data
df <- read.table(text = "Date Temperature Humidity
'2015-01-01 00:00:00' 25 0.67
'2015-01-01 01:00:00' 26 0.69
'2015-01-01 02:00:00' 28 0.69
'2015-01-01 03:00:00' 25 0.69", header = T, stringsAsFactors = F)
df$Date <- as.POSIXct(df$Date, format = "%Y-%m-%d %H:%M:%S")
# Create a dataframe with all possible date/time at intervale of 5 mins
Dates <- data.frame(Date = seq(min(df$Date), max(df$Date)+3540, by = 5*60))
result <- df %>%
right_join(Dates, by="Date") %>%
fill(Temperature, Humidity)
result
# Date Temperature Humidity
#1 2015-01-01 00:00:00 25 0.67
#2 2015-01-01 00:05:00 25 0.67
#3 2015-01-01 00:10:00 25 0.67
#4 2015-01-01 00:15:00 25 0.67
#5 2015-01-01 00:20:00 25 0.67
#6 2015-01-01 00:25:00 25 0.67
#7 2015-01-01 00:30:00 25 0.67
#8 2015-01-01 00:35:00 25 0.67
#9 2015-01-01 00:40:00 25 0.67
#10 2015-01-01 00:45:00 25 0.67
#11 2015-01-01 00:50:00 25 0.67
#12 2015-01-01 00:55:00 25 0.67
#13 2015-01-01 01:00:00 26 0.69
#14 2015-01-01 01:05:00 26 0.69
#.....
#.....
#44 2015-01-01 03:35:00 25 0.69
#45 2015-01-01 03:40:00 25 0.69
#46 2015-01-01 03:45:00 25 0.69
#47 2015-01-01 03:50:00 25 0.69
#48 2015-01-01 03:55:00 25 0.69
I think this might do:
df=tibble(DateTime=c("2015-01-01 00:00:00","2015-01-01 01:00:00"),Temperature=c(25,26),Humidity=c(.67,.69))
df$DateTime<-ymd_hms(df$DateTime)
DateTime=as.POSIXct((sapply(1:(nrow(df)-1),function(x) seq(from=df$DateTime[x],to=df$DateTime[x+1],by="5 min"))),
origin="1970-01-01", tz="UTC")
Temperature=c(sapply(1:(nrow(df)-1),function(x) rep(df$Temperature[x],12)),df$Temperature[nrow(df)])
Humidity=c(sapply(1:(nrow(df)-1),function(x) rep(df$Humidity[x],12)),df$Humidity[nrow(df)])
tibble(as.character(DateTime),Temperature,Humidity)
<chr> <dbl> <dbl>
1 2015-01-01 00:00:00 25.0 0.670
2 2015-01-01 00:05:00 25.0 0.670
3 2015-01-01 00:10:00 25.0 0.670
4 2015-01-01 00:15:00 25.0 0.670
5 2015-01-01 00:20:00 25.0 0.670
6 2015-01-01 00:25:00 25.0 0.670
7 2015-01-01 00:30:00 25.0 0.670
8 2015-01-01 00:35:00 25.0 0.670
9 2015-01-01 00:40:00 25.0 0.670
10 2015-01-01 00:45:00 25.0 0.670
11 2015-01-01 00:50:00 25.0 0.670
12 2015-01-01 00:55:00 25.0 0.670
13 2015-01-01 01:00:00 26.0 0.690

how to take averaged diurnal for each month for two columns with ggplot2

I have a time series data of two columns, and I want a graph with averaged hourly pattern for each month, like the graph attached but with two time series.
timestamp ET_control ET_treatment
1 2016-01-01 00:00:00 NA NA
2 2016-01-01 00:30:00 NA NA
3 2016-01-01 01:00:00 NA NA
4 2016-01-01 01:30:00 NA NA
5 2016-01-01 02:00:00 NA NA
6 2016-01-01 02:30:00 NA NA
7 2016-01-01 03:00:00 NA NA
8 2016-01-01 03:30:00 NA NA
9 2016-01-01 04:00:00 NA NA
10 2016-01-01 04:30:00 NA NA
11 2016-01-01 05:00:00 NA NA
12 2016-01-01 05:30:00 NA NA
13 2016-01-01 06:00:00 NA NA
14 2016-01-01 06:30:00 NA NA
15 2016-01-01 07:00:00 NA NA
16 2016-01-01 07:30:00 NA NA
17 2016-01-01 08:00:00 NA NA
18 2016-01-01 08:30:00 NA NA
19 2016-01-01 09:00:00 NA NA
20 2016-01-01 09:30:00 NA NA
21 2016-01-01 10:00:00 NA NA
22 2016-01-01 10:30:00 NA NA
23 2016-01-01 11:00:00 NA NA
24 2016-01-01 11:30:00 0.09863437 NA
25 2016-01-01 12:00:00 0.11465258 NA
26 2016-01-01 12:30:00 0.12356855 NA
27 2016-01-01 13:00:00 0.09246215 0.085398782
28 2016-01-01 13:30:00 0.08843156 0.072877001
29 2016-01-01 14:00:00 0.08536019 0.081885947
30 2016-01-01 14:30:00 0.08558541 NA
31 2016-01-01 15:00:00 0.05571436 NA
32 2016-01-01 15:30:00 0.04087248 0.038582547
33 2016-01-01 16:00:00 0.04233724 NA
34 2016-01-01 16:30:00 0.02150660 0.019560578
35 2016-01-01 17:00:00 0.01803765 0.019691155
36 2016-01-01 17:30:00 NA 0.005190489
37 2016-01-01 18:00:00 NA NA
38 2016-01-01 18:30:00 NA NA
39 2016-01-01 19:00:00 NA NA
40 2016-01-01 19:30:00 NA NA
41 2016-01-01 20:00:00 NA NA
42 2016-01-01 20:30:00 NA NA
43 2016-01-01 21:00:00 NA NA
44 2016-01-01 21:30:00 NA NA
45 2016-01-01 22:00:00 NA NA
46 2016-01-01 22:30:00 NA NA
47 2016-01-01 23:00:00 NA NA
48 2016-01-01 23:30:00 NA NA
49 2016-01-02 00:00:00 NA NA
50 2016-01-02 00:30:00 NA NA
given t is your data.frame with packages dplyr and ggplot2:
t <- t %>% mutate(
month = format(strptime(timestamp, "%Y-%m-%d %H:%M:%S"), "%b"),
hour=format(strptime(timestamp, "%Y-%m-%d %H:%M:%S"), "%H"))
tm <- t %>% group_by(month, hour) %>%
summarize(ET_control_mean=mean(ET_control, na.rm=T))
ggplot(tm, aes(x=hour, y=ET_control_mean)) + geom_point() + facet_wrap(~ month)
if you want to have both columns in your graph, you should transform your data into the 'long' format.

R: how to keep legitimate NAs in a merged zoo object

I have multiple time-series objects with regular interval of five minutes, but they can have different start and end times. They can also log at different times, not necessarily at minutes 5,10,15, etc.
I want to merge those objects, but I want to keep the legitimate NAs intact. For example, one object start logging at a later time, then the NAs at the beginning are legitimate NAs. The same if one object stops logging earlier, then the NAs at the end are legitimate.
But there is not option to keep both NAs intact with na.locf.
Here is an example of my problem:
lines1="Index,x1
2014-01-01 00:00:00,73.06
2014-01-01 00:05:00,73.11
2014-01-01 00:10:00,73.16
2014-01-01 00:15:00,73.22"
lines2="Index,x2
2014-01-01 00:11:00,71.11
2014-01-01 00:16:00,70.12
2014-01-01 00:21:00,70.16
2014-01-01 00:26:00,70.19
2014-01-01 00:31:00,69.16"
lines3="Index,x3
2014-01-01 00:23:00,0
2014-01-01 00:28:00,1
2014-01-01 00:33:00,1
2014-01-01 00:38:00,0
2014-01-01 00:43:00,0"
df1=read.table(text = lines1, header = TRUE, sep = ",")
df2=read.table(text = lines2, header = TRUE, sep = ",")
df3=read.table(text = lines3, header = TRUE, sep = ",")
z1 = zoo(df1$x1, as.POSIXct(df1$Index))
z2 = zoo(df2$x2, as.POSIXct(df2$Index))
z3 = zoo(df3$x3, as.POSIXct(df3$Index))
z = merge(z1,z2,z3)
z
z.na.locf = na.locf(z)
z.na.locf
timesteps = seq(as.POSIXct("2014-01-01 00:00:00"),
as.POSIXct("2014-01-01 01:00:00"),
by = "5 min")
z.timesteps = na.locf(z, xout=timesteps)
z.timesteps
The merged object is this:
> z
z1 z2 z3
2014-01-01 00:00:00 73.06 NA NA
2014-01-01 00:05:00 73.11 NA NA
2014-01-01 00:10:00 73.16 NA NA
2014-01-01 00:11:00 NA 71.11 NA
2014-01-01 00:15:00 73.22 NA NA
2014-01-01 00:16:00 NA 70.12 NA
2014-01-01 00:21:00 NA 70.16 NA
2014-01-01 00:23:00 NA NA 0
2014-01-01 00:26:00 NA 70.19 NA
2014-01-01 00:28:00 NA NA 1
2014-01-01 00:31:00 NA 69.16 NA
2014-01-01 00:33:00 NA NA 1
2014-01-01 00:38:00 NA NA 0
2014-01-01 00:43:00 NA NA 0
Note that the NAs in the beginning of z1 is legitimate, also in the end of z3, and in the beginning and end of z2. The NAs that need to be replaced are the ones in the middle of data. The problem is if I tried to fill in the missing values in the middle of the data, the legitimate NAs are gone too:
> z.na.locf
z1 z2 z3
2014-01-01 00:00:00 73.06 NA NA
2014-01-01 00:05:00 73.11 NA NA
2014-01-01 00:10:00 73.16 NA NA
2014-01-01 00:11:00 73.16 71.11 NA
2014-01-01 00:15:00 73.22 71.11 NA
2014-01-01 00:16:00 73.22 70.12 NA
2014-01-01 00:21:00 73.22 70.16 NA
2014-01-01 00:23:00 73.22 70.16 0
2014-01-01 00:26:00 73.22 70.19 0
2014-01-01 00:28:00 73.22 70.19 1
2014-01-01 00:31:00 73.22 69.16 1
2014-01-01 00:33:00 73.22 69.16 1
2014-01-01 00:38:00 73.22 69.16 0
2014-01-01 00:43:00 73.22 69.16 0
Note that for z1 and z2, the legitimate NAs in the end are gone.
Furthermore, if I want to re-sample the data to have the same regular timestamp, both NAs at the beginning and in the end are gone too.
> z.timesteps
z1 z2 z3
2014-01-01 00:00:00 73.06 71.11 0
2014-01-01 00:05:00 73.11 71.11 0
2014-01-01 00:10:00 73.16 71.11 0
2014-01-01 00:15:00 73.22 71.11 0
2014-01-01 00:20:00 73.22 70.12 0
2014-01-01 00:25:00 73.22 70.16 0
2014-01-01 00:30:00 73.22 70.19 1
2014-01-01 00:35:00 73.22 69.16 1
2014-01-01 00:40:00 73.22 69.16 0
2014-01-01 00:45:00 73.22 69.16 0
2014-01-01 00:50:00 73.22 69.16 0
2014-01-01 00:55:00 73.22 69.16 0
2014-01-01 01:00:00 73.22 69.16 0
Is there a way we can achieve what I need? Thanks for your help.
na.fill can help here. The following line of code will preserve runs of NAs at the beginning and at the end but fill in the remaining NAs using na.locf:
zz <- na.locf(z, na.rm = FALSE) + 0 * na.fill(z, fill = c(NA, 0, NA))
giving:
> zz
z1 z2 z3
2014-01-01 00:00:00 73.06 NA NA
2014-01-01 00:05:00 73.11 NA NA
2014-01-01 00:10:00 73.16 NA NA
2014-01-01 00:11:00 73.16 71.11 NA
2014-01-01 00:15:00 73.22 71.11 NA
2014-01-01 00:16:00 NA 70.12 NA
2014-01-01 00:21:00 NA 70.16 NA
2014-01-01 00:23:00 NA 70.16 0
2014-01-01 00:26:00 NA 70.19 0
2014-01-01 00:28:00 NA 70.19 1
2014-01-01 00:31:00 NA 69.16 1
2014-01-01 00:33:00 NA NA 1
2014-01-01 00:38:00 NA NA 0
2014-01-01 00:43:00 NA NA 0
Note 1: We could reduce the read.table / zoo lines to three lines of the form:
z1 <- read.zoo(text = lines1, header = TRUE, sep = ",", tz = "")
Note 2: Perhaps what you want to do next is:
timesteps <- seq(start(zz), start(zz) + 3600, by = "5 min")
m <- merge(zz, zoo(, timesteps))
m.na <- na.locf(m, na.rm = FALSE) + 0 * na.fill(m, fill = c(NA, 0, NA))
window(m.na, timesteps)

Resources