merging large data.frame to fill missing hourly dates - r

I want to fill some missing dates in large data.frames. I saw different posts, but nothing is working. I'm using merge, which I thought it would be easy, but the result is not what I expect.
My data consists on hourly data for the whole year, with the corresponding values of a variable. I show just a sample:
# sample of data
dput(head(x1))
structure(list(date = structure(c(14617, 14617, 14617, 14617,
14617, 14617), class = "Date"), value = c(-9999, -9999, -9999,
-9999, -9999, -9999)), .Names = c("date", "value"), row.names =
c(2923L, 6545L, 10167L, 13789L, 17411L, 21033L), class = "data.frame")
So, since I want to add the missing data I created the array with the correct and complete time series:
# Create hourly data
times <- seq(as.POSIXct("2010-01-01 00:00:00"), as.POSIXct("2010-12-31 23:00:00"), by="hour")
# Split into days and hours
nt <- as.Date(strptime(times, "%Y-%m-%d"))
ndays <- data.frame("date"=nt,"hour"=format(as.POSIXct(strptime(times,"%Y-%m-%d %H:%M:%S",tz="")) ,format = "%H:%M:%S"))
The, I tried to merge ndays and x1, to have a new data.frame with the whole dates (and hours):
newdata <- merge(ndays,x1,by="date",all.x = T)
But, I don't have the values of x1! I'd have NA, so I tried to combine different options of merge, but none it's working. If I use:
newdata <- merge(x1, ndays,by="date",all.x = T)
The result looks like:
head(newdata)
date value hour
1 2010-01-08 -9999 12:00:00
2 2010-01-08 -9999 01:00:00
3 2010-01-08 -9999 02:00:00
4 2010-01-08 -9999 03:00:00
5 2010-01-08 -9999 00:00:00
6 2010-01-08 -9999 05:00:00
.....
But what I want is:
head(newdata)
date value hour
2010-01-01 NA 00:00:00
........
2010-01-08 -9999 12:00:00
2010-01-08 -9999 01:00:00
2010-01-08 -9999 02:00:00
To have the whole dates, and the final expected data.frame has to have a length (for each column) 8760 = the number of hours per year (timestep).
If I do:
newdata <- merge(ndays,x1,by="date",all = T)
Again, I'd have a new data.frame with a length of 193680, because all the data is merged. But I only want the values of x1 and days and hours for the whole year.
What am I missing to work with merge? Should I write another function to do it?

If I understand correctly, I believe this can be solved by updating in a join. This is a special kind of left join, i.e., take all rows of nday and copy value only in those rows where a matching date is found:
library(data.table)
setDT(ndays)[unique(setDT(x1)), on = "date", value := value]
Note that only unique rows of x1 are used assuming that there is only one distinct value per day.
# show some relevant rows
ndays[date %in% (as.IDate("2010-01-08") + (-1:+1))]
date hour value
1: 2010-01-07 00:00:00 NA
2: 2010-01-07 01:00:00 NA
3: 2010-01-07 02:00:00 NA
4: 2010-01-07 03:00:00 NA
5: 2010-01-07 04:00:00 NA
6: 2010-01-07 05:00:00 NA
7: 2010-01-07 06:00:00 NA
8: 2010-01-07 07:00:00 NA
9: 2010-01-07 08:00:00 NA
10: 2010-01-07 09:00:00 NA
11: 2010-01-07 10:00:00 NA
12: 2010-01-07 11:00:00 NA
13: 2010-01-07 12:00:00 NA
14: 2010-01-07 13:00:00 NA
15: 2010-01-07 14:00:00 NA
16: 2010-01-07 15:00:00 NA
17: 2010-01-07 16:00:00 NA
18: 2010-01-07 17:00:00 NA
19: 2010-01-07 18:00:00 NA
20: 2010-01-07 19:00:00 NA
21: 2010-01-07 20:00:00 NA
22: 2010-01-07 21:00:00 NA
23: 2010-01-07 22:00:00 NA
24: 2010-01-07 23:00:00 NA
25: 2010-01-08 00:00:00 -9999
26: 2010-01-08 01:00:00 -9999
27: 2010-01-08 02:00:00 -9999
28: 2010-01-08 03:00:00 -9999
29: 2010-01-08 04:00:00 -9999
30: 2010-01-08 05:00:00 -9999
31: 2010-01-08 06:00:00 -9999
32: 2010-01-08 07:00:00 -9999
33: 2010-01-08 08:00:00 -9999
34: 2010-01-08 09:00:00 -9999
35: 2010-01-08 10:00:00 -9999
36: 2010-01-08 11:00:00 -9999
37: 2010-01-08 12:00:00 -9999
38: 2010-01-08 13:00:00 -9999
39: 2010-01-08 14:00:00 -9999
40: 2010-01-08 15:00:00 -9999
41: 2010-01-08 16:00:00 -9999
42: 2010-01-08 17:00:00 -9999
43: 2010-01-08 18:00:00 -9999
44: 2010-01-08 19:00:00 -9999
45: 2010-01-08 20:00:00 -9999
46: 2010-01-08 21:00:00 -9999
47: 2010-01-08 22:00:00 -9999
48: 2010-01-08 23:00:00 -9999
49: 2010-01-09 00:00:00 NA
50: 2010-01-09 01:00:00 NA
51: 2010-01-09 02:00:00 NA
52: 2010-01-09 03:00:00 NA
53: 2010-01-09 04:00:00 NA
54: 2010-01-09 05:00:00 NA
55: 2010-01-09 06:00:00 NA
56: 2010-01-09 07:00:00 NA
57: 2010-01-09 08:00:00 NA
58: 2010-01-09 09:00:00 NA
59: 2010-01-09 10:00:00 NA
60: 2010-01-09 11:00:00 NA
61: 2010-01-09 12:00:00 NA
62: 2010-01-09 13:00:00 NA
63: 2010-01-09 14:00:00 NA
64: 2010-01-09 15:00:00 NA
65: 2010-01-09 16:00:00 NA
66: 2010-01-09 17:00:00 NA
67: 2010-01-09 18:00:00 NA
68: 2010-01-09 19:00:00 NA
69: 2010-01-09 20:00:00 NA
70: 2010-01-09 21:00:00 NA
71: 2010-01-09 22:00:00 NA
72: 2010-01-09 23:00:00 NA
date hour value

Related

How to use own data with quantstrat and quantmod?

I am learning quantstrat and is working on a project where I use a local csv file which I exported from metatrader5. I managed to load the data into an xts object and called it fulldata_xts of which I have created subsets bt_xts and wf_xts for the backtest and walk forward respectively. Below is the head of fulldata_xts. I have added the other columns other than the standard OHLCV.
EURUSD.Open EURUSD.High EURUSD.Low
2010-01-03 16:00:00 1.43259 1.43336 1.43151
2010-01-03 17:00:00 1.43151 1.43153 1.42879
2010-01-03 18:00:00 1.42885 1.42885 1.42569
2010-01-03 19:00:00 1.42702 1.42989 1.42700
2010-01-03 20:00:00 1.42938 1.42968 1.42718
2010-01-03 21:00:00 1.42847 1.42985 1.42822
EURUSD.Close EURUSD.Volume EURUSD.Vol
2010-01-03 16:00:00 1.43153 969 0
2010-01-03 17:00:00 1.42886 2098 0
2010-01-03 18:00:00 1.42705 2082 0
2010-01-03 19:00:00 1.42939 1544 0
2010-01-03 20:00:00 1.42848 1131 0
2010-01-03 21:00:00 1.42897 1040 0
EURUSD.Spread EURUSD.Year EURUSD.Month
2010-01-03 16:00:00 12 2010 1
2010-01-03 17:00:00 15 2010 1
2010-01-03 18:00:00 15 2010 1
2010-01-03 19:00:00 14 2010 1
2010-01-03 20:00:00 15 2010 1
2010-01-03 21:00:00 14 2010 1
EURUSD.Day EURUSD.Weekday EURUSD.Hour
2010-01-03 16:00:00 4 2 0
2010-01-03 17:00:00 4 2 1
2010-01-03 18:00:00 4 2 2
2010-01-03 19:00:00 4 2 3
2010-01-03 20:00:00 4 2 4
2010-01-03 21:00:00 4 2 5
EURUSD.Session EURUSD.EMA14
2010-01-03 16:00:00 0 NA
2010-01-03 17:00:00 0 NA
2010-01-03 18:00:00 0 NA
2010-01-03 19:00:00 0 NA
2010-01-03 20:00:00 0 NA
2010-01-03 21:00:00 0 NA
EURUSD.EMA14_Out
2010-01-03 16:00:00 0
2010-01-03 17:00:00 0
2010-01-03 18:00:00 0
2010-01-03 19:00:00 0
2010-01-03 20:00:00 0
2010-01-03 21:00:00 0
I am trying to create my own indicator using the following code:
add.indicator(strategy1.st, name = sentiment,
arguments = list(date = quote(Cl(mktdata))),
label = "sentiment")
I have based the above code from a course on datacamp but is similar to what is being discussed here. My questions are:
How can I specify my own data i.e. bt_xts on the code above. Please correct me if I am wrong but from what I gather, the mktdata object gets created when the data is downloaded using quantstrat facilities which is not applicable on my case since I read the data off of csv and converted it to data table then to an xts object.
The function sentiment on the inside the add.indicator code above for now only functions returns 0,1,2 (stay out, bullish, bearish) based on day of week. I plan to develop this further once I get the other part of the strategy working. This function takes in a variable date hence the arguments = list(date = quote(Cl(mktdata))) part is incorrect. What should I put inside the quote() to specify the date column of my data, bt_xts?

R: yes-no factor based on previous entries

I've got a timeseries dataset — data from meteostation. So there's 3 columns: time - time and date; p - rain, mm; h - water level,m.
I need to make a new column factor_rain, with 1 and 0 values. 1 - if water level(df$h) was influenced by rain (df$p). This can be if there was a rain for the last 5 hours (5 entries).
In other cases, there should be 0.
A part of dataset is here:
df <- data.frame(time = c("2017-06-04 9:00:00", "2017-06-04 13:00:00", "2017-06-04 17:00:00",
"2017-06-04 19:00:00", "2017-06-04 21:00:00", "2017-06-04 23:00:00",
"2017-06-05 9:00:00", "2017-06-05 11:00:00",
"2017-06-05 13:00:00", "2017-06-05 16:00:00",
"2017-06-05 19:00:00", "2017-06-05 21:00:00", "2017-06-05 23:00:00",
"2017-06-06 9:00:00", "2017-06-06 11:00:00", "2017-06-06 13:00:00",
"2017-06-06 16:00:00", "2017-06-06 17:00:00", "2017-06-06 18:00:00",
"2017-06-06 19:00:00"),
p = c(NA, NA, 16.4, NA, NA, NA, NA, NA, NA, NA, 12,
NA, NA, NA, NA, NA, NA, NA, NA, NA),
h = c(23,NA,NA,NA,NA,32,NA,NA,28,NA,NA,
33,NA,NA,NA,29,NA,NA,NA,NA))
I was trying the simplest way I thought — it works only for one case unfortunately:
> df$factor_rain[df$p[-c(1:5)] > 1 & df$h > 1] <- 1
> Warning message:
In df$p[-c(1:5)] > 1 & df$h > 1 :
longer object length is not a multiple of shorter object length
Is there any way to fix it? If you can suggest how to use real time (smth from xts library, for example) it would be great. I mean use a 5 hours treshold, not 5 values.
By the way I need to get this as a result:
> df
time p h factor_rain
1 2017-06-04 9:00:00 NA 23 0
2 2017-06-04 13:00:00 NA NA 0
3 2017-06-04 17:00:00 16.4 NA 0
4 2017-06-04 19:00:00 NA NA 0
5 2017-06-04 21:00:00 NA NA 0
6 2017-06-04 23:00:00 NA 32 1
7 2017-06-05 9:00:00 NA NA 0
8 2017-06-05 11:00:00 NA NA 0
9 2017-06-05 13:00:00 NA 28 0
10 2017-06-05 16:00:00 NA NA 0
11 2017-06-05 19:00:00 12.0 NA 0
12 2017-06-05 21:00:00 NA 33 1
13 2017-06-05 23:00:00 NA NA 0
14 2017-06-06 9:00:00 NA NA 0
15 2017-06-06 11:00:00 NA NA 0
16 2017-06-06 13:00:00 NA 29 0
17 2017-06-06 16:00:00 NA NA 0
18 2017-06-06 17:00:00 NA NA 0
19 2017-06-06 18:00:00 NA NA 0
20 2017-06-06 19:00:00 NA NA 0
You can use
df$factorrain = FALSE
df$factorrain[rowSums(expand.grid(which(!is.na(df$p)), 0:4))] = TRUE
# time p h factorrain
# 1 2017-06-04 9:00:00 NA 23 FALSE
# 2 2017-06-04 13:00:00 NA NA FALSE
# 3 2017-06-04 17:00:00 16.4 NA TRUE
# 4 2017-06-04 19:00:00 NA NA TRUE
# 5 2017-06-04 21:00:00 NA NA TRUE
# 6 2017-06-04 23:00:00 NA 32 TRUE
# 7 2017-06-05 9:00:00 NA NA TRUE
# 8 2017-06-05 11:00:00 NA NA FALSE
# 9 2017-06-05 13:00:00 NA 28 FALSE
# 10 2017-06-05 16:00:00 NA NA FALSE
# 11 2017-06-05 19:00:00 12.0 NA TRUE
# 12 2017-06-05 21:00:00 NA 33 TRUE
# 13 2017-06-05 23:00:00 NA NA TRUE
# 14 2017-06-06 9:00:00 NA NA TRUE
# 15 2017-06-06 11:00:00 NA NA TRUE
# 16 2017-06-06 13:00:00 NA 29 FALSE
# 17 2017-06-06 16:00:00 NA NA FALSE
# 18 2017-06-06 17:00:00 NA NA FALSE
# 19 2017-06-06 18:00:00 NA NA FALSE
# 20 2017-06-06 19:00:00 NA NA FALSE
Or, a similar approach with apply,
df$factorrain = FALSE
df$factorrain[sapply(which(!is.na(df$p)), function(x) x+(0:4))] = TRUE
A solution can be achieved by using non-equi join from data.table.
library(data.table)
df$time <- as.POSIXct(df$time, format = "%Y-%m-%d %H:%M:%S")
setDT(df)
df[,timeLow := time-5*60*60]
df[df,.(time, p, h = i.h), on=.(time < time, time >= timeLow)][
,.(factor_rain = ifelse(!is.na(first(h)), any(!is.na(p)),FALSE)),by=.(time)][
df,.(time, p, h, factor_rain),on="time"]
# time p h factor_rain
# 1: 2017-06-04 09:00:00 NA 23 FALSE
# 2: 2017-06-04 13:00:00 NA NA FALSE
# 3: 2017-06-04 17:00:00 16.4 NA FALSE
# 4: 2017-06-04 19:00:00 NA NA FALSE
# 5: 2017-06-04 21:00:00 NA NA FALSE
# 6: 2017-06-04 23:00:00 NA 32 FALSE <-- There is no rain in last 5 hours
# 7: 2017-06-05 09:00:00 NA NA FALSE
# 8: 2017-06-05 11:00:00 NA NA FALSE
# 9: 2017-06-05 13:00:00 NA 28 FALSE
# 10: 2017-06-05 16:00:00 NA NA FALSE
# 11: 2017-06-05 19:00:00 12.0 NA FALSE
# 12: 2017-06-05 21:00:00 NA 33 TRUE
# 13: 2017-06-05 23:00:00 NA NA FALSE
# 14: 2017-06-06 09:00:00 NA NA FALSE
# 15: 2017-06-06 11:00:00 NA NA FALSE
# 16: 2017-06-06 13:00:00 NA 29 FALSE
# 17: 2017-06-06 16:00:00 NA NA FALSE
# 18: 2017-06-06 17:00:00 NA NA FALSE
# 19: 2017-06-06 18:00:00 NA NA FALSE
# 20: 2017-06-06 19:00:00 NA NA FALSE
Note: The solution can be optimized a bit. I'll take up optimization in a while.

how to take averaged diurnal for each month for two columns with ggplot2

I have a time series data of two columns, and I want a graph with averaged hourly pattern for each month, like the graph attached but with two time series.
timestamp ET_control ET_treatment
1 2016-01-01 00:00:00 NA NA
2 2016-01-01 00:30:00 NA NA
3 2016-01-01 01:00:00 NA NA
4 2016-01-01 01:30:00 NA NA
5 2016-01-01 02:00:00 NA NA
6 2016-01-01 02:30:00 NA NA
7 2016-01-01 03:00:00 NA NA
8 2016-01-01 03:30:00 NA NA
9 2016-01-01 04:00:00 NA NA
10 2016-01-01 04:30:00 NA NA
11 2016-01-01 05:00:00 NA NA
12 2016-01-01 05:30:00 NA NA
13 2016-01-01 06:00:00 NA NA
14 2016-01-01 06:30:00 NA NA
15 2016-01-01 07:00:00 NA NA
16 2016-01-01 07:30:00 NA NA
17 2016-01-01 08:00:00 NA NA
18 2016-01-01 08:30:00 NA NA
19 2016-01-01 09:00:00 NA NA
20 2016-01-01 09:30:00 NA NA
21 2016-01-01 10:00:00 NA NA
22 2016-01-01 10:30:00 NA NA
23 2016-01-01 11:00:00 NA NA
24 2016-01-01 11:30:00 0.09863437 NA
25 2016-01-01 12:00:00 0.11465258 NA
26 2016-01-01 12:30:00 0.12356855 NA
27 2016-01-01 13:00:00 0.09246215 0.085398782
28 2016-01-01 13:30:00 0.08843156 0.072877001
29 2016-01-01 14:00:00 0.08536019 0.081885947
30 2016-01-01 14:30:00 0.08558541 NA
31 2016-01-01 15:00:00 0.05571436 NA
32 2016-01-01 15:30:00 0.04087248 0.038582547
33 2016-01-01 16:00:00 0.04233724 NA
34 2016-01-01 16:30:00 0.02150660 0.019560578
35 2016-01-01 17:00:00 0.01803765 0.019691155
36 2016-01-01 17:30:00 NA 0.005190489
37 2016-01-01 18:00:00 NA NA
38 2016-01-01 18:30:00 NA NA
39 2016-01-01 19:00:00 NA NA
40 2016-01-01 19:30:00 NA NA
41 2016-01-01 20:00:00 NA NA
42 2016-01-01 20:30:00 NA NA
43 2016-01-01 21:00:00 NA NA
44 2016-01-01 21:30:00 NA NA
45 2016-01-01 22:00:00 NA NA
46 2016-01-01 22:30:00 NA NA
47 2016-01-01 23:00:00 NA NA
48 2016-01-01 23:30:00 NA NA
49 2016-01-02 00:00:00 NA NA
50 2016-01-02 00:30:00 NA NA
given t is your data.frame with packages dplyr and ggplot2:
t <- t %>% mutate(
month = format(strptime(timestamp, "%Y-%m-%d %H:%M:%S"), "%b"),
hour=format(strptime(timestamp, "%Y-%m-%d %H:%M:%S"), "%H"))
tm <- t %>% group_by(month, hour) %>%
summarize(ET_control_mean=mean(ET_control, na.rm=T))
ggplot(tm, aes(x=hour, y=ET_control_mean)) + geom_point() + facet_wrap(~ month)
if you want to have both columns in your graph, you should transform your data into the 'long' format.

R: Compare data.table and pass variable while respecting key

I have two data.tables:
original <- data.frame(id = c(rep("RE01",5),rep("RE02",5)),date.time = head(seq.POSIXt(as.POSIXct("2015-11-01 01:00:00"),as.POSIXct("2015-11-05 01:00:00"),60*60*10),10))
compare <- data.frame(id = c("RE01","RE02"),seq = c(1,2),start = as.POSIXct(c("2015-11-01 20:00:00","2015-11-04 08:00:00")),end = as.POSIXct(c("2015-11-02 08:00:00","2015-11-04 20:00:00")))
setDT(original)
setDT(compare)
I would like to check the date in each row of original and see if it lies between the start and finish dates of compare whilst respecting the id. If it does lie between the two elements, a variable should be passed to original (compare$diff.seq). The output should look like this:
original
id date.time diff.seq
1 RE01 2015-11-01 01:00:00 NA
2 RE01 2015-11-01 11:00:00 NA
3 RE01 2015-11-01 21:00:00 1
4 RE01 2015-11-02 07:00:00 1
5 RE01 2015-11-02 17:00:00 NA
6 RE02 2015-11-03 03:00:00 NA
7 RE02 2015-11-03 13:00:00 NA
8 RE02 2015-11-03 23:00:00 NA
9 RE02 2015-11-04 09:00:00 2
10 RE02 2015-11-04 19:00:00 2
I've been reading the manual and SO for hours and trying "on", "by" and so on.. without any success. Can anybody point me in the right direction?
As said in the comments, this is very straight forward using data.table::foverlaps
You basically have to create an additional column in the original data set in order to set join boundaries, then key the two data sets by the columns you want to join on and then simply run forverlas and select the desired columns
original[, end := date.time]
setkey(original, id, date.time, end)
setkey(compare, id, start, end)
foverlaps(original, compare)[, .(id, date.time, seq)]
# id date.time seq
# 1: RE01 2015-11-01 01:00:00 NA
# 2: RE01 2015-11-01 11:00:00 NA
# 3: RE01 2015-11-01 21:00:00 1
# 4: RE01 2015-11-02 07:00:00 1
# 5: RE01 2015-11-02 17:00:00 NA
# 6: RE02 2015-11-03 03:00:00 NA
# 7: RE02 2015-11-03 13:00:00 NA
# 8: RE02 2015-11-03 23:00:00 NA
# 9: RE02 2015-11-04 09:00:00 2
# 10: RE02 2015-11-04 19:00:00 2
Alternatively, you can run foverlaps the other way around and then just update the original data set by reference while selecting the correct rows to update
indx <- foverlaps(compare, original, which = TRUE)
original[indx$yid, diff.seq := indx$xid]
original
# id date.time end diff.seq
# 1: RE01 2015-11-01 01:00:00 2015-11-01 01:00:00 NA
# 2: RE01 2015-11-01 11:00:00 2015-11-01 11:00:00 NA
# 3: RE01 2015-11-01 21:00:00 2015-11-01 21:00:00 1
# 4: RE01 2015-11-02 07:00:00 2015-11-02 07:00:00 1
# 5: RE01 2015-11-02 17:00:00 2015-11-02 17:00:00 NA
# 6: RE02 2015-11-03 03:00:00 2015-11-03 03:00:00 NA
# 7: RE02 2015-11-03 13:00:00 2015-11-03 13:00:00 NA
# 8: RE02 2015-11-03 23:00:00 2015-11-03 23:00:00 NA
# 9: RE02 2015-11-04 09:00:00 2015-11-04 09:00:00 2
# 10: RE02 2015-11-04 19:00:00 2015-11-04 19:00:00 2

R: how to keep legitimate NAs in a merged zoo object

I have multiple time-series objects with regular interval of five minutes, but they can have different start and end times. They can also log at different times, not necessarily at minutes 5,10,15, etc.
I want to merge those objects, but I want to keep the legitimate NAs intact. For example, one object start logging at a later time, then the NAs at the beginning are legitimate NAs. The same if one object stops logging earlier, then the NAs at the end are legitimate.
But there is not option to keep both NAs intact with na.locf.
Here is an example of my problem:
lines1="Index,x1
2014-01-01 00:00:00,73.06
2014-01-01 00:05:00,73.11
2014-01-01 00:10:00,73.16
2014-01-01 00:15:00,73.22"
lines2="Index,x2
2014-01-01 00:11:00,71.11
2014-01-01 00:16:00,70.12
2014-01-01 00:21:00,70.16
2014-01-01 00:26:00,70.19
2014-01-01 00:31:00,69.16"
lines3="Index,x3
2014-01-01 00:23:00,0
2014-01-01 00:28:00,1
2014-01-01 00:33:00,1
2014-01-01 00:38:00,0
2014-01-01 00:43:00,0"
df1=read.table(text = lines1, header = TRUE, sep = ",")
df2=read.table(text = lines2, header = TRUE, sep = ",")
df3=read.table(text = lines3, header = TRUE, sep = ",")
z1 = zoo(df1$x1, as.POSIXct(df1$Index))
z2 = zoo(df2$x2, as.POSIXct(df2$Index))
z3 = zoo(df3$x3, as.POSIXct(df3$Index))
z = merge(z1,z2,z3)
z
z.na.locf = na.locf(z)
z.na.locf
timesteps = seq(as.POSIXct("2014-01-01 00:00:00"),
as.POSIXct("2014-01-01 01:00:00"),
by = "5 min")
z.timesteps = na.locf(z, xout=timesteps)
z.timesteps
The merged object is this:
> z
z1 z2 z3
2014-01-01 00:00:00 73.06 NA NA
2014-01-01 00:05:00 73.11 NA NA
2014-01-01 00:10:00 73.16 NA NA
2014-01-01 00:11:00 NA 71.11 NA
2014-01-01 00:15:00 73.22 NA NA
2014-01-01 00:16:00 NA 70.12 NA
2014-01-01 00:21:00 NA 70.16 NA
2014-01-01 00:23:00 NA NA 0
2014-01-01 00:26:00 NA 70.19 NA
2014-01-01 00:28:00 NA NA 1
2014-01-01 00:31:00 NA 69.16 NA
2014-01-01 00:33:00 NA NA 1
2014-01-01 00:38:00 NA NA 0
2014-01-01 00:43:00 NA NA 0
Note that the NAs in the beginning of z1 is legitimate, also in the end of z3, and in the beginning and end of z2. The NAs that need to be replaced are the ones in the middle of data. The problem is if I tried to fill in the missing values in the middle of the data, the legitimate NAs are gone too:
> z.na.locf
z1 z2 z3
2014-01-01 00:00:00 73.06 NA NA
2014-01-01 00:05:00 73.11 NA NA
2014-01-01 00:10:00 73.16 NA NA
2014-01-01 00:11:00 73.16 71.11 NA
2014-01-01 00:15:00 73.22 71.11 NA
2014-01-01 00:16:00 73.22 70.12 NA
2014-01-01 00:21:00 73.22 70.16 NA
2014-01-01 00:23:00 73.22 70.16 0
2014-01-01 00:26:00 73.22 70.19 0
2014-01-01 00:28:00 73.22 70.19 1
2014-01-01 00:31:00 73.22 69.16 1
2014-01-01 00:33:00 73.22 69.16 1
2014-01-01 00:38:00 73.22 69.16 0
2014-01-01 00:43:00 73.22 69.16 0
Note that for z1 and z2, the legitimate NAs in the end are gone.
Furthermore, if I want to re-sample the data to have the same regular timestamp, both NAs at the beginning and in the end are gone too.
> z.timesteps
z1 z2 z3
2014-01-01 00:00:00 73.06 71.11 0
2014-01-01 00:05:00 73.11 71.11 0
2014-01-01 00:10:00 73.16 71.11 0
2014-01-01 00:15:00 73.22 71.11 0
2014-01-01 00:20:00 73.22 70.12 0
2014-01-01 00:25:00 73.22 70.16 0
2014-01-01 00:30:00 73.22 70.19 1
2014-01-01 00:35:00 73.22 69.16 1
2014-01-01 00:40:00 73.22 69.16 0
2014-01-01 00:45:00 73.22 69.16 0
2014-01-01 00:50:00 73.22 69.16 0
2014-01-01 00:55:00 73.22 69.16 0
2014-01-01 01:00:00 73.22 69.16 0
Is there a way we can achieve what I need? Thanks for your help.
na.fill can help here. The following line of code will preserve runs of NAs at the beginning and at the end but fill in the remaining NAs using na.locf:
zz <- na.locf(z, na.rm = FALSE) + 0 * na.fill(z, fill = c(NA, 0, NA))
giving:
> zz
z1 z2 z3
2014-01-01 00:00:00 73.06 NA NA
2014-01-01 00:05:00 73.11 NA NA
2014-01-01 00:10:00 73.16 NA NA
2014-01-01 00:11:00 73.16 71.11 NA
2014-01-01 00:15:00 73.22 71.11 NA
2014-01-01 00:16:00 NA 70.12 NA
2014-01-01 00:21:00 NA 70.16 NA
2014-01-01 00:23:00 NA 70.16 0
2014-01-01 00:26:00 NA 70.19 0
2014-01-01 00:28:00 NA 70.19 1
2014-01-01 00:31:00 NA 69.16 1
2014-01-01 00:33:00 NA NA 1
2014-01-01 00:38:00 NA NA 0
2014-01-01 00:43:00 NA NA 0
Note 1: We could reduce the read.table / zoo lines to three lines of the form:
z1 <- read.zoo(text = lines1, header = TRUE, sep = ",", tz = "")
Note 2: Perhaps what you want to do next is:
timesteps <- seq(start(zz), start(zz) + 3600, by = "5 min")
m <- merge(zz, zoo(, timesteps))
m.na <- na.locf(m, na.rm = FALSE) + 0 * na.fill(m, fill = c(NA, 0, NA))
window(m.na, timesteps)

Resources