In a dataframe, I have wind speed data measured four times a day, at 00:00, 06:00, 12:00 and 18:00 o'clock. To combine these with other data, I need to fill the time in between towards a resolution of 15 minutes. I would like to fill the gaps by simple interpolation.
The following example produces two corresponding sample dataframes. df1 and df2 need to be merged. In the resulting merged dataframe, the gap values between the 6-hourly values (where var == NA?) need to be filled by a simply mean interpolation. My problem is how to merge both and do the concrete interpolation between the given values.
First dataframe
Creation:
# create a corresponding sample data frame
df1 <- data.frame(
date = seq.POSIXt(
from = ISOdatetime(2015,10,1,0,0,0, tz = "GMT"),
to = ISOdatetime(2015,10,14,23,59,0, tz= "GMT"),
by = "6 hour"
),
windspeed = abs(rnorm(14*4, 10, 4)) # abs() because windspeed shoud be positive
)
Resulting dataframe:
> # show the head of the dataframe
> head(df1)
date windspeed
1 2015-10-01 00:00:00 17.928217
2 2015-10-01 06:00:00 11.306025
3 2015-10-01 12:00:00 6.648131
4 2015-10-01 18:00:00 10.320146
5 2015-10-02 00:00:00 2.138559
6 2015-10-02 06:00:00 9.076344
Second dataframe
Creation:
# create a 2nd corresponding sample data frame
df2 <- data.frame(
date = seq.POSIXt(
from = ISOdatetime(2015,10,1,0,0,0, tz = "GMT"),
to = ISOdatetime(2015,10,14,23,59,0, tz= "GMT"),
by = "15 min"
),
var = abs(rnorm(14*24*4, 300, 100))
)
Resulting dataframe:
> # show the head of the 2nd dataframe
> head(df2)
date var
1 2015-10-01 00:00:00 198.2657
2 2015-10-01 00:15:00 472.9041
3 2015-10-01 00:30:00 605.8776
4 2015-10-01 00:45:00 429.0949
5 2015-10-01 01:00:00 400.2390
6 2015-10-01 01:15:00 317.1503
This is a solution
First merge them to get using all = TRUE to get all values
df3 <- merge(df1, df2, all = TRUE)
Then use approx for Interpolation
df3$windspeed <- approx(x = df1$date, y = df1$windspeed, xout = df2$date)$y
The only problem there is that the las ones will be NA unless your last value of windspeed is there, but everything in between will be there
Related
I have a dataframe in R (thousands of rows) containing data like this.
"id","ts"
1,2010-11-11 06:00:00
2,2010-11-11 06:01:00
3,2010-11-11 06:02:00
4,2010-11-11 06:03:00
...
11,2010-11-11 06:10:00
12,2010-11-11 06:11:00
13,2010-11-11 06:12:00
14,2010-11-11 06:13:00
15,2010-11-11 06:14:00
16,2010-11-11 06:15:00
17,2010-11-11 10:00:00
18,2010-11-11 10:01:00
19,2010-11-11 10:02:00
20,2010-11-11 10:03:00
21,2010-11-11 10:04:00
22,2010-11-11 10:05:00
...
I have data like the above for many days (11 Nov 2010 - 15 Dec 2010). Each day, ideally, has timestamp data (as.POSIXct, tz = "UTC") in three time slots between the ranges given below. However, some days have data for one or two time slots only.
Slot1: 06:00:00 - 06:15:00
Slot2: 10:00:00 - 10:15:00
Slot3: 13:00:00 - 13:15:00
What I would like to do is, to add a group column (continous group number until 15 Dec 2010 data) based on the above three time ranges. The expected output is:
"id","ts","Group"
1,2010-11-11 06:00:00,1
2,2010-11-11 06:01:00,1
3,2010-11-11 06:02:00,1
4,2010-11-11 06:03:00,1
...
11,2010-11-11 06:10:00,1
12,2010-11-11 06:11:00,1
13,2010-11-11 06:12:00,1
14,2010-11-11 06:13:00,1
15,2010-11-11 06:14:00,1
16,2010-11-11 06:15:00,1
17,2010-11-11 10:00:00,2
18,2010-11-11 10:01:00,2
19,2010-11-11 10:02:00,2
20,2010-11-11 10:03:00,2
21,2010-11-11 10:04:00,2
22,2010-11-11 10:05:00,2
...
How this could be achieved in R?
Some reproducible sample data is here:
start1 <- as.POSIXct("2010-11-11 06:00:00 UTC")
end1 <- as.POSIXct("2010-11-11 06:15:00 UTC")
start2 <- as.POSIXct("2010-11-11 10:00:00 UTC")
end2 <- as.POSIXct("2010-11-11 10:15:00 UTC")
start3 <- as.POSIXct("2010-11-11 13:00:00 UTC")
end3 <- as.POSIXct("2010-11-11 13:15:00 UTC")
ts1 <- data.frame(ts=seq.POSIXt(start1,end1, by = "min"))
ts2 <- data.frame(ts=seq.POSIXt(start2,end2, by = "min"))
ts3 <- data.frame(ts=seq.POSIXt(start3,end3, by = "min"))
ts <- data.frame(rbind(ts1,ts2,ts3))
id <- data.frame(id=seq.int(1,48,1))
dat <- data.frame(cbind(id,ts))
You can extract hour and minute value from ts and use case_when to apply Group number.
library(dplyr)
library(lubridate)
dat %>%
arrange(ts) %>%
mutate(hour = hour(ts),
minute = minute(ts),
date = as.Date(ts),
Group = case_when(hour == 6 & minute <= 15 ~ 1L,
hour == 10 & minute <= 15 ~ 2L,
hour == 13 & minute <= 15 ~ 3L),
Group = (as.integer(date - min(date)) * 3) + Group,
Group = match(Group, unique(Group))) -> result
result
You can keep the columns that you want using select i.e result %>% select(id, ts, Group).
I'm trying to summarize values for overlapping time periods.
I can use only tidyr, ggplot2 and dplyr libraries. Base R is preferred though.
My data looks like this, but usually it has around 100 records:
df <- structure(list(Start = structure(c(1546531200, 1546531200, 546531200, 1546638252.6316, 1546549800, 1546534800, 1546545600, 1546531200, 1546633120, 1547065942.1053), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Stop = structure(c(1546770243.1579, 1546607400, 1547110800, 1546670652.6316, 1547122863.1579, 1546638252.6316, 1546878293.5579, 1546416000, 1546849694.4, 1547186400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Value = c(12610, 520, 1500, 90, 331380, 27300, 6072, 4200, 61488, 64372)), .Names = c("Start", "Stop", "Value"), row.names = c(41L, 55L, 25L, 29L, 38L, 28L, 1L, 20L, 14L, 31L), class = c("tbl_df", "tbl", "data.frame"))
head(df) and str(df) gives:
Start Stop Value
2019-01-03 16:00:00 2019-01-06 10:24:03 12610
2019-01-03 16:00:00 2019-01-04 13:10:00 520
2019-01-03 16:00:00 2019-01-10 09:00:00 1500
2019-01-04 21:44:12 2019-01-05 06:44:12 90
2019-01-03 21:10:00 2019-01-10 12:21:03 331380
2019-01-03 17:00:00 2019-01-04 21:44:12 27300
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 3 variables:
$ Start: POSIXct, format: "2019-01-03 16:00:00" "2019-01-03 16:00:00" ...
$ Stop : POSIXct, format: "2019-01-06 10:24:03" "2019-01-04 13:10:00" ...
$ Value: num 12610 520 1500 90 331380 ...
So there are overlapping time periods with "Start" and "Stop" dates with assigned value. In any given record when there is a value between df$Start and df$Stop and outside of this scope the value is 0.
I want to create another dataframe based on which I could show how this values summarize and change over time. The Desired output would look like this (the "sum" column is made up):
> head(df2)
timestamp sum
"2019-01-02 09:00:00 CET" 14352
"2019-01-03 17:00:00 CET" 6253
"2019-01-03 18:00:00 CET" 23465
"2019-01-03 21:00:00 CET" 3241
"2019-01-03 22:10:00 CET" 23235
"2019-01-04 14:10:00 CET" 123321
To get unique timestamps:
timestamps <- sort(unique(c(df$`Start`, df$`Stop`)))
With df2 dataframe I could easily draw a graph with ggplot, but how to get this sums?
I think I should iterate over df data frame either some custom function or any built-it summarize function which would work like this:
fnct <- function(date, min, max, value) {
if (date >= min && date <=max) {
a <- value
}
else {
a <- 0
}
return(a)
}
...for every given date from timestamps iterate through df and give me a sum of values for the timestamp.
It looks really simple and I'm missing something very basic.
Here's a tidyverse solution similar to my response to this recent question. I gather to bring the timestamps (Starts and Stops) into one column, with another column specifying which. The Starts add the value and the Stops subtract it, and then we just take the cumulative sum to get values at all the instants when the sum changes.
For 100 records, there won't be any perceivable speed improvement from using data.table; in my experience it starts to make more of a difference around 1M records, especially when grouping is involved.
library(dplyr); library(tidyr)
df2 <- df %>%
gather(type, time, Start:Stop) %>%
mutate(chg = if_else(type == "Start", Value, -Value)) %>%
arrange(time) %>%
mutate(sum = cumsum(chg)) # EDIT: corrected per OP comment
> head(df2)
## A tibble: 6 x 5
# Value type time chg sum
# <dbl> <chr> <dttm> <dbl> <dbl>
#1 1500 Start 1987-04-27 14:13:20 1500 1500
#2 4200 Stop 2019-01-02 08:00:00 -4200 -2700
#3 12610 Start 2019-01-03 16:00:00 12610 9910
#4 520 Start 2019-01-03 16:00:00 520 10430
#5 4200 Start 2019-01-03 16:00:00 4200 14630
#6 27300 Start 2019-01-03 17:00:00 27300 41930
In the past I have tried to solve similar problems using the tidyverse/baseR... But nothing comes even remotely close to the speeds that data.table provides for these kind of operations, so I encourage you to give it a try...
For questions like this, my favourite finction is foverlaps() from the data.table-package. With this function you can (fast!) perform an overlap-join. If you want more flexibility in your joining than foverlaps() provides, a non-equi-join (again using data.table) is probably the best (and fastest!) option. But foverlaps() will do here (I guess).
I used the sample data you provided, but filtered out rows where Stop <= Start (probably a tyop in your sample data). When df$Start is not before df$Stop, foverlaps give a warning and won't execute.
library( data.table )
#create data.table with periods you wish to simmarise on
#NB: UTC is used as timezone, since this is also the case in the sample data provided!!
dt.dates <- data.table( id = paste0( "Day", 1:31 ),
Start = seq( as.POSIXct( "2019-01-01 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ),
as.POSIXct( "2019-01-31 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ),
by = "1 days"),
Stop = seq( as.POSIXct( "2019-01-02 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) - 1,
as.POSIXct( "2019-02-01 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) - 1,
by = "1 days") )
If you do not want to summarise on a daily basis, but by hour, minute, second, of year. Just change the values (and stepsize) in dt.dates data.table so that they match your periods.
#set df as data.table
dt <- as.data.table( df )
#filter out any row where Stop is smaller than Start
dt <- dt[ Start < Stop, ]
#perform overlap join
#first set keys
setkey(dt, Start, Stop)
#then perform join
result <- foverlaps( dt.dates, dt, type = "within" )
#summarise
result[, .( Value = sum( Value , na.rm = TRUE ) ), by = .(Day = i.Start) ]
output
# Day Value
# 1: 2019-01-01 1500
# 2: 2019-01-02 1500
# 3: 2019-01-03 1500
# 4: 2019-01-04 351562
# 5: 2019-01-05 413050
# 6: 2019-01-06 400440
# 7: 2019-01-07 332880
# 8: 2019-01-08 332880
# 9: 2019-01-09 332880
# 10: 2019-01-10 64372
# 11: 2019-01-11 0
# 12: 2019-01-12 0
# 13: 2019-01-13 0
# 14: 2019-01-14 0
# 15: 2019-01-15 0
# 16: 2019-01-16 0
# 17: 2019-01-17 0
# 18: 2019-01-18 0
# 19: 2019-01-19 0
# 20: 2019-01-20 0
# 21: 2019-01-21 0
# 22: 2019-01-22 0
# 23: 2019-01-23 0
# 24: 2019-01-24 0
# 25: 2019-01-25 0
# 26: 2019-01-26 0
# 27: 2019-01-27 0
# 28: 2019-01-28 0
# 29: 2019-01-29 0
# 30: 2019-01-30 0
# 31: 2019-01-31 0
# Day Value
plot
#summarise for plot
result.plot <- result[, .( Value = sum( Value , na.rm = TRUE ) ), by = .(Day = i.Start) ]
library( ggplot2 )
ggplot( data = result.plot, aes( x = Day, y = Value ) ) + geom_col()
I was trying to see if it is possible to set the start and end parameters of the ts() function in the forecast R package. The reason for this is to then use window() to subset a train and test set by date.
The time frame is from 2015-01-01 00:00:00 to 12/31/2017 23:00
index esti
2015-01-01 00:00:00 1
2015-01-01 01:00:00 2
2015-01-01 02:00:00 3
2015-01-01 03:00:00 2
2015-01-01 04:00:00 5
2015-01-01 05:00:00 2
...
2017-12-31 18:00:00 0
2017-12-31 19:00:00 1
2017-12-31 20:00:00 0
2017-12-31 21:00:00 2
2017-12-31 22:00:00 0
2017-12-31 23:00:00 4
I used the following syntax to create the time series object:
tmp <- ts(dat, start = c(2015,1), frequency=24)
The returned object is this:
Time Series:
Start = c(2015, 1)
End = c(2015, 6)
Frequency = 24
It looks as if the ts object isn't correct here...
As far as I understand, the ts object does not work well with hourly input. It is recommended that you work with xts or zoo package instead. See this SO post.
Try the following:
## Creating an entire hourly dataframe similar to the example dat
x <-
lubridate::parse_date_time(
c("2015-01-01 00:00:00", "2017-12-31 23:00:00"),
orders = "ymdHMS"
)
y <- seq(x[1], x[2], by = "hour")
dat <- data.frame(
index = y, esti = sample(seq(0, 10), size = length(y),
replace = TRUE)
)
## xts package
library(xts)
tmp <- xts(dat, order.by = dat$index)
## Example window-ing
window(tmp, end = y[100])
Let me know if this does not work out.
I have a table in R like:
start duration
02/01/2012 20:00:00 5
05/01/2012 07:00:00 6
etc... etc...
I got to this by importing a table from Microsoft Excel that looked like this:
date time duration
2012/02/01 20:00:00 5
etc...
I then merged the date and time columns by running the following code:
d.f <- within(d.f, { start=format(as.POSIXct(paste(date, time)), "%m/%d/%Y %H:%M:%S") })
I want to create a third column called 'end', which will be calculated as the number of hours after the start time. I am pretty sure that my time is a POSIXct vector. I have seen how to manipulate one datetime object, but how can I do that for the entire column?
The expected result should look like:
start duration end
02/01/2012 20:00:00 5 02/02/2012 01:00:00
05/01/2012 07:00:00 6 05/01/2012 13:00:00
etc... etc... etc...
Using lubridate
> library(lubridate)
> df$start <- mdy_hms(df$start)
> df$end <- df$start + hours(df$duration)
> df
# start duration end
#1 2012-02-01 20:00:00 5 2012-02-02 01:00:00
#2 2012-05-01 07:00:00 6 2012-05-01 13:00:00
data
df <- structure(list(start = c("02/01/2012 20:00:00", "05/01/2012 07:00:00"
), duration = 5:6), .Names = c("start", "duration"), class = "data.frame", row.names = c(NA,
-2L))
You can simply add dur*3600 to start column of the data frame. E.g. with one date:
start = as.POSIXct("02/01/2012 20:00:00",format="%m/%d/%Y %H:%M:%S")
start
[1] "2012-02-01 20:00:00 CST"
start + 5*3600
[1] "2012-02-02 01:00:00 CST"
I have read in and formatted my data set like shown under.
library(xts)
#Read data from file
x <- read.csv("data.dat", header=F)
x[is.na(x)] <- c(0) #If empty fill in zero
#Construct data frames
rawdata.h <- data.frame(x[,2],x[,3],x[,4],x[,5],x[,6],x[,7],x[,8]) #Hourly data
rawdata.15min <- data.frame(x[,10]) #15 min data
#Convert time index to proper format
index.h <- as.POSIXct(strptime(x[,1], "%d.%m.%Y %H:%M"))
index.15min <- as.POSIXct(strptime(x[,9], "%d.%m.%Y %H:%M"))
#Set column names
names(rawdata.h) <- c("spot","RKup", "RKdown","RKcon","anm", "pp.stat","prod.h")
names(rawdata.15min) <- c("prod.15min")
#Convert data frames to time series objects
data.htemp <- xts(rawdata.h,order.by=index.h)
data.15mintemp <- xts(rawdata.15min,order.by=index.15min)
#Select desired subset period
data.h <- data.htemp["2013"]
data.15min <- data.15mintemp["2013"]
I want to be able to combine hourly data from data.h$prod.h with data, with 15 min resolution, from data.15min$prod.15min corresponding to the same hour.
An example would be to take the average of the hourly value at time 2013-12-01 00:00-01:00 with the last 15 minute value in that same hour, i.e. the 15 minute value from time 2013-12-01 00:45-01:00. I'm looking for a flexible way to do this with an arbitrary hour.
Any suggestions?
Edit: Just to clarify further: I want to do something like this:
N <- NROW(data.h$prod.h)
for (i in 1:N){
prod.average[i] <- mean(data.h$prod.h[i] + #INSERT CODE THAT FINDS LAST 15 MIN IN HOUR i )
}
I found a solution to my problem by converting the 15 minute data into hourly data using the very useful .index* function from the xts package like shown under.
prod.new <- data.15min$prod.15min[.indexmin(data.15min$prod.15min) %in% c(45:59)]
This creates a new time series with only the values occuring in the 45-59 minute interval each hour.
For those curious my data looked like this:
Original hourly series:
> data.h$prod.h[1:4]
2013-01-01 00:00:00 19.744
2013-01-01 01:00:00 27.866
2013-01-01 02:00:00 26.227
2013-01-01 03:00:00 16.013
Original 15 minute series:
> data.15min$prod.15min[1:4]
2013-09-30 00:00:00 16.4251
2013-09-30 00:15:00 18.4495
2013-09-30 00:30:00 7.2125
2013-09-30 00:45:00 12.1913
2013-09-30 01:00:00 12.4606
2013-09-30 01:15:00 12.7299
2013-09-30 01:30:00 12.9992
2013-09-30 01:45:00 26.7522
New series with only the last 15 minutes in each hour:
> prod.new[1:4]
2013-09-30 00:45:00 12.1913
2013-09-30 01:45:00 26.7522
2013-09-30 02:45:00 5.0332
2013-09-30 03:45:00 2.6974
Short answer
df %>%
group_by(t = cut(time, "30 min")) %>%
summarise(v = mean(value))
Long answer
Since, you want to compress the 15 minutes time series to a smaller resolution (30 minutes), you should use dplyr package or any other package that computes the "group by" concept.
For instance:
s = seq(as.POSIXct("2017-01-01"), as.POSIXct("2017-01-02"), "15 min")
df = data.frame(time = s, value=1:97)
df is a time series with 97 rows and two columns.
head(df)
time value
1 2017-01-01 00:00:00 1
2 2017-01-01 00:15:00 2
3 2017-01-01 00:30:00 3
4 2017-01-01 00:45:00 4
5 2017-01-01 01:00:00 5
6 2017-01-01 01:15:00 6
The cut.POSIXt, group_by and summarise functions do the work:
df %>%
group_by(t = cut(time, "30 min")) %>%
summarise(v = mean(value))
t v
1 2017-01-01 00:00:00 1.5
2 2017-01-01 00:30:00 3.5
3 2017-01-01 01:00:00 5.5
4 2017-01-01 01:30:00 7.5
5 2017-01-01 02:00:00 9.5
6 2017-01-01 02:30:00 11.5
A more robust way is to convert 15 minutes values into hourly values by taking average. Then do whatever operation you want to.
### 15 Minutes Data
min15 <- structure(list(V1 = structure(1:8, .Label = c("2013-01-01 00:00:00",
"2013-01-01 00:15:00", "2013-01-01 00:30:00", "2013-01-01 00:45:00",
"2013-01-01 01:00:00", "2013-01-01 01:15:00", "2013-01-01 01:30:00",
"2013-01-01 01:45:00"), class = "factor"), V2 = c(16.4251, 18.4495,
7.2125, 12.1913, 12.4606, 12.7299, 12.9992, 26.7522)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -8L))
min15
### Hourly Data
hourly <- structure(list(V1 = structure(1:4, .Label = c("2013-01-01 00:00:00",
"2013-01-01 01:00:00", "2013-01-01 02:00:00", "2013-01-01 03:00:00"
), class = "factor"), V2 = c(19.744, 27.866, 26.227, 16.013)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -4L))
hourly
### Convert 15min data into hourly data by taking average of 4 values
min15$V1 <- as.POSIXct(min15$V1,origin="1970-01-01 0:0:0")
min15 <- aggregate(. ~ cut(min15$V1,"60 min"),min15[setdiff(names(min15), "V1")],mean)
min15
names(min15) <- c("time","min15")
names(hourly) <- c("time","hourly")
### merge the corresponding values
combined <- merge(hourly,min15)
### average of hourly and 15min values
rowMeans(combined[,2:3])