Combining time series data with different resolution in R - r

I have read in and formatted my data set like shown under.
library(xts)
#Read data from file
x <- read.csv("data.dat", header=F)
x[is.na(x)] <- c(0) #If empty fill in zero
#Construct data frames
rawdata.h <- data.frame(x[,2],x[,3],x[,4],x[,5],x[,6],x[,7],x[,8]) #Hourly data
rawdata.15min <- data.frame(x[,10]) #15 min data
#Convert time index to proper format
index.h <- as.POSIXct(strptime(x[,1], "%d.%m.%Y %H:%M"))
index.15min <- as.POSIXct(strptime(x[,9], "%d.%m.%Y %H:%M"))
#Set column names
names(rawdata.h) <- c("spot","RKup", "RKdown","RKcon","anm", "pp.stat","prod.h")
names(rawdata.15min) <- c("prod.15min")
#Convert data frames to time series objects
data.htemp <- xts(rawdata.h,order.by=index.h)
data.15mintemp <- xts(rawdata.15min,order.by=index.15min)
#Select desired subset period
data.h <- data.htemp["2013"]
data.15min <- data.15mintemp["2013"]
I want to be able to combine hourly data from data.h$prod.h with data, with 15 min resolution, from data.15min$prod.15min corresponding to the same hour.
An example would be to take the average of the hourly value at time 2013-12-01 00:00-01:00 with the last 15 minute value in that same hour, i.e. the 15 minute value from time 2013-12-01 00:45-01:00. I'm looking for a flexible way to do this with an arbitrary hour.
Any suggestions?
Edit: Just to clarify further: I want to do something like this:
N <- NROW(data.h$prod.h)
for (i in 1:N){
prod.average[i] <- mean(data.h$prod.h[i] + #INSERT CODE THAT FINDS LAST 15 MIN IN HOUR i )
}

I found a solution to my problem by converting the 15 minute data into hourly data using the very useful .index* function from the xts package like shown under.
prod.new <- data.15min$prod.15min[.indexmin(data.15min$prod.15min) %in% c(45:59)]
This creates a new time series with only the values occuring in the 45-59 minute interval each hour.
For those curious my data looked like this:
Original hourly series:
> data.h$prod.h[1:4]
2013-01-01 00:00:00 19.744
2013-01-01 01:00:00 27.866
2013-01-01 02:00:00 26.227
2013-01-01 03:00:00 16.013
Original 15 minute series:
> data.15min$prod.15min[1:4]
2013-09-30 00:00:00 16.4251
2013-09-30 00:15:00 18.4495
2013-09-30 00:30:00 7.2125
2013-09-30 00:45:00 12.1913
2013-09-30 01:00:00 12.4606
2013-09-30 01:15:00 12.7299
2013-09-30 01:30:00 12.9992
2013-09-30 01:45:00 26.7522
New series with only the last 15 minutes in each hour:
> prod.new[1:4]
2013-09-30 00:45:00 12.1913
2013-09-30 01:45:00 26.7522
2013-09-30 02:45:00 5.0332
2013-09-30 03:45:00 2.6974

Short answer
df %>%
group_by(t = cut(time, "30 min")) %>%
summarise(v = mean(value))
Long answer
Since, you want to compress the 15 minutes time series to a smaller resolution (30 minutes), you should use dplyr package or any other package that computes the "group by" concept.
For instance:
s = seq(as.POSIXct("2017-01-01"), as.POSIXct("2017-01-02"), "15 min")
df = data.frame(time = s, value=1:97)
df is a time series with 97 rows and two columns.
head(df)
time value
1 2017-01-01 00:00:00 1
2 2017-01-01 00:15:00 2
3 2017-01-01 00:30:00 3
4 2017-01-01 00:45:00 4
5 2017-01-01 01:00:00 5
6 2017-01-01 01:15:00 6
The cut.POSIXt, group_by and summarise functions do the work:
df %>%
group_by(t = cut(time, "30 min")) %>%
summarise(v = mean(value))
t v
1 2017-01-01 00:00:00 1.5
2 2017-01-01 00:30:00 3.5
3 2017-01-01 01:00:00 5.5
4 2017-01-01 01:30:00 7.5
5 2017-01-01 02:00:00 9.5
6 2017-01-01 02:30:00 11.5

A more robust way is to convert 15 minutes values into hourly values by taking average. Then do whatever operation you want to.
### 15 Minutes Data
min15 <- structure(list(V1 = structure(1:8, .Label = c("2013-01-01 00:00:00",
"2013-01-01 00:15:00", "2013-01-01 00:30:00", "2013-01-01 00:45:00",
"2013-01-01 01:00:00", "2013-01-01 01:15:00", "2013-01-01 01:30:00",
"2013-01-01 01:45:00"), class = "factor"), V2 = c(16.4251, 18.4495,
7.2125, 12.1913, 12.4606, 12.7299, 12.9992, 26.7522)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -8L))
min15
### Hourly Data
hourly <- structure(list(V1 = structure(1:4, .Label = c("2013-01-01 00:00:00",
"2013-01-01 01:00:00", "2013-01-01 02:00:00", "2013-01-01 03:00:00"
), class = "factor"), V2 = c(19.744, 27.866, 26.227, 16.013)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -4L))
hourly
### Convert 15min data into hourly data by taking average of 4 values
min15$V1 <- as.POSIXct(min15$V1,origin="1970-01-01 0:0:0")
min15 <- aggregate(. ~ cut(min15$V1,"60 min"),min15[setdiff(names(min15), "V1")],mean)
min15
names(min15) <- c("time","min15")
names(hourly) <- c("time","hourly")
### merge the corresponding values
combined <- merge(hourly,min15)
### average of hourly and 15min values
rowMeans(combined[,2:3])

Related

Converting mixed times into 24 hour format

I currently have a dataset with multiple different time formats(AM/PM, numeric, 24hr format) and I'm trying to turn them all into 24hr format. Is there a way to standardize mixed format columns?
Current sample data
time
12:30 PM
03:00 PM
0.961469907
0.913622685
0.911423611
09:10 AM
18:00
Desired output
new_time
12:30:00
15:00:00
23:04:31
21:55:37
21:52:27
09:10:00
18:00:00
I know how to do them all individually(an example below), but is there a way to do it all in one go because I have a large amount of data and can't go line by line?
#for numeric time
> library(chron)
> x <- c(0.961469907, 0.913622685, 0.911423611)
> times(x)
[1] 23:04:31 21:55:37 21:52:27
The decimal times are a pain but we can parse them first, feed them back as a character then use lubridate's parse_date_time to do them all at once
library(tidyverse)
library(chron)
# Create reproducible dataframe
df <-
tibble::tibble(
time = c(
"12:30 PM",
"03:00 PM",
0.961469907,
0.913622685,
0.911423611,
"09:10 AM",
"18:00")
)
# Parse times
df <-
df %>%
dplyr::mutate(
time_chron = chron::times(as.numeric(time)),
time_chron = if_else(
is.na(time_chron),
time,
as.character(time_chron)),
time_clean = lubridate::parse_date_time(
x = time_chron,
orders = c(
"%I:%M %p", # HH:MM AM/PM 12 hour format
"%H:%M:%S", # HH:MM:SS 24 hour format
"%H:%M")), # HH:MM 24 hour format
time_clean = hms::as_hms(time_clean)) %>%
select(-time_chron)
Which gives us
> df
# A tibble: 7 × 2
time time_clean
<chr> <time>
1 12:30 PM 12:30:00
2 03:00 PM 15:00:00
3 0.961469907 23:04:31
4 0.913622685 21:55:37
5 0.911423611 21:52:27
6 09:10 AM 09:10:00
7 18:00 18:00:00

Ozone time series

I’m working with a time series of continued measurement of ozone concentration in ambient air over a 4 month period. Measurement are conducted every 5min14 sec. Approximately 40000 data points.
I started processing my data in R, but ran into some problems due to my lack of skills.
My data.frame is Date as characters and ozone concentration as numeric values.
Date O3_ppb
2018-05-26 17:55:58 UTC 33.95161
2018-05-26 18:01:12 UTC 35.12605
2018-05-26 18:06:28 UTC 36.03172
2018-05-26 18:11:42 UTC 36.81590
2018-05-26 18:16:57 UTC 37.11235
2018-05-26 18:22:12 UTC 37.26945
I wish to illustrate the daily development of ozone concentration over the course of 24h based on one month of data. Meaning I would like a monthly average every 5min over 24h.
My thinking was that I somehow need to sort my data into groups every 5min over 24h. For example 00:00:00, 00:05:00, 00:10:00 …
But since there is drift in the measurement a measurement conducted at 00:05:00 one day would be conducted 00:06:20 the next and so on. And since the sensor once in while reboot the number of observations each day fluctuates a bit aswell.
My question:
Is there a function or loop that would be able to sort my data into intervals of 5mins that also would be able to take drift into account, so that measurements that for example falls between 00:02:30 - 00:07:30 would be sorted into a group called 00:05:00 and 00:07:30 – 00:12:30 into a 00:10:00 group?
Sorry if this is completely unintelligible, but I’m new to R and in general programming. I really hope that someone can help me, so I can kick-start the project
Here is a data.table approach using an overlap-join (foverlaps())
library( data.table )
dt <- fread(' Date O3_ppb
"2018-05-26 17:55:58" 33.95161
"2018-05-26 18:01:12" 35.12605
"2018-05-26 18:06:28" 36.03172
"2018-05-26 18:11:42" 36.81590
"2018-05-26 18:16:57" 37.11235
"2018-05-26 18:22:12" 37.26945', header = TRUE)
#set to posix
dt[, Date := as.POSIXct( Date, format = "%Y-%m-%d %H:%M:%S", tz = "UTC") ]
#create dummy variables to join on later
dt[, `:=`( Start = Date, Stop = Date ) ]
#create data.table with periods you wish to summarise on later
#notice the +/- 150 (=00:02:30) to set a 5 minute 'bandwidth' around the period.
dt.period <- data.table( period = seq( as.POSIXct( "2018-05-26 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ),
as.POSIXct( "2018-05-27 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ),
by = "5 mins"),
Start = seq( as.POSIXct( "2018-05-26 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) - 150,
as.POSIXct( "2018-05-27 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) - 150 ,
by = "5 mins"),
Stop = seq( as.POSIXct( "2018-05-26 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) + 150,
as.POSIXct( "2018-05-27 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) + 150,
by = "5 mins") )
#perform overlap join
#first set keys
setkey(dt.period, Start, Stop)
#then perform join
result <- foverlaps( dt, dt.period, type = "within", nomatch = NA )
#summarise
result[, .( O3_ppb_avg = mean( O3_ppb, na.rm = TRUE ) ), by = .(period) ]
output
# period O3_ppb_avg
# 1: 2018-05-26 17:55:00 33.95161
# 2: 2018-05-26 18:00:00 35.12605
# 3: 2018-05-26 18:05:00 36.03172
# 4: 2018-05-26 18:10:00 36.81590
# 5: 2018-05-26 18:15:00 37.11235
# 6: 2018-05-26 18:20:00 37.26945
Here's an approach using lubridate that just rounds to the closest 5 min, regardless of the time.
# Load data
library(tidyverse); library(lubridate)
df <- read.table(header = T, stringsAsFactors = F,
text = "Date O3_ppb
'2018-05-26 17:55:58 UTC' 33.95161
'2018-05-26 18:01:12 UTC' 35.12605
'2018-05-26 18:06:28 UTC' 36.03172
'2018-05-26 18:11:42 UTC' 36.81590
'2018-05-26 18:16:57 UTC' 37.11235
'2018-05-26 18:22:12 UTC' 37.26945") %>%
mutate(Date = ymd_hms(Date))
df2 <- df %>%
# By adding 2.5 min = 150 sec and rounding down, we get closest 5 min
mutate(Date_rnd = floor_date(Date + 150, "5 minutes"),
# One option is to group by decimal time of day
group = hour(Date_rnd) + minute(Date_rnd)/60,
# ...or could convert that to a time on a single day, in this case today
group_as_datetime = floor_date(Sys.time(), "1 day") + group*60*60)
Output
> df2
# Date O3_ppb Date_rnd group group_as_datetime
#1 2018-05-26 17:55:58 33.95161 2018-05-26 17:55:00 17.91667 2019-01-05 17:55:00
#2 2018-05-26 18:01:12 35.12605 2018-05-26 18:00:00 18.00000 2019-01-05 18:00:00
#3 2018-05-26 18:06:28 36.03172 2018-05-26 18:05:00 18.08333 2019-01-05 18:05:00
#4 2018-05-26 18:11:42 36.81590 2018-05-26 18:10:00 18.16667 2019-01-05 18:10:00
#5 2018-05-26 18:16:57 37.11235 2018-05-26 18:15:00 18.25000 2019-01-05 18:15:00
#6 2018-05-26 18:22:12 37.26945 2018-05-26 18:20:00 18.33333 2019-01-05 18:20:00

Interpolation over time

In a dataframe, I have wind speed data measured four times a day, at 00:00, 06:00, 12:00 and 18:00 o'clock. To combine these with other data, I need to fill the time in between towards a resolution of 15 minutes. I would like to fill the gaps by simple interpolation.
The following example produces two corresponding sample dataframes. df1 and df2 need to be merged. In the resulting merged dataframe, the gap values between the 6-hourly values (where var == NA?) need to be filled by a simply mean interpolation. My problem is how to merge both and do the concrete interpolation between the given values.
First dataframe
Creation:
# create a corresponding sample data frame
df1 <- data.frame(
date = seq.POSIXt(
from = ISOdatetime(2015,10,1,0,0,0, tz = "GMT"),
to = ISOdatetime(2015,10,14,23,59,0, tz= "GMT"),
by = "6 hour"
),
windspeed = abs(rnorm(14*4, 10, 4)) # abs() because windspeed shoud be positive
)
Resulting dataframe:
> # show the head of the dataframe
> head(df1)
date windspeed
1 2015-10-01 00:00:00 17.928217
2 2015-10-01 06:00:00 11.306025
3 2015-10-01 12:00:00 6.648131
4 2015-10-01 18:00:00 10.320146
5 2015-10-02 00:00:00 2.138559
6 2015-10-02 06:00:00 9.076344
Second dataframe
Creation:
# create a 2nd corresponding sample data frame
df2 <- data.frame(
date = seq.POSIXt(
from = ISOdatetime(2015,10,1,0,0,0, tz = "GMT"),
to = ISOdatetime(2015,10,14,23,59,0, tz= "GMT"),
by = "15 min"
),
var = abs(rnorm(14*24*4, 300, 100))
)
Resulting dataframe:
> # show the head of the 2nd dataframe
> head(df2)
date var
1 2015-10-01 00:00:00 198.2657
2 2015-10-01 00:15:00 472.9041
3 2015-10-01 00:30:00 605.8776
4 2015-10-01 00:45:00 429.0949
5 2015-10-01 01:00:00 400.2390
6 2015-10-01 01:15:00 317.1503
This is a solution
First merge them to get using all = TRUE to get all values
df3 <- merge(df1, df2, all = TRUE)
Then use approx for Interpolation
df3$windspeed <- approx(x = df1$date, y = df1$windspeed, xout = df2$date)$y
The only problem there is that the las ones will be NA unless your last value of windspeed is there, but everything in between will be there

dealing with the datetime value in R

First of all, I have a large data.table with the one parameter-Date, but the str(Date) is chr.
date
2015-07-01 0:15:00
2015-07-01 0:30:00
2015-07-01 0:45:00
2015-07-01 0:60:00
2015-07-01 1:15:00
2015-07-01 1:30:00
2015-07-01 1:45:00
2015-07-01 1:60:00
what i want to do is
make them in standard format like: 2015-07-01 00:15:00
correct the time, for example: 2015-07-01 1:60:00 -> 2015-07-01 02:00:00
for the first one, I tried to use the function as.POSIXct() to reset the format, it should be correct, but the problem is for the data like 2015-07-01 1:60:00, after transformatiion, it is just NA.
anybody has ideas?
Here is a code to generate test data:
dd <- data.table(date = c("2015-07-01 0:15:00", "2015-07-01 0:30:00",
"2015-07-01 0:45:00","2015-07-01 0:60:00", "2015-07-01 1:15:00",
"2015-07-01 1:30:00","2015-07-01 1:45:00","2015-07-01 1:60:00","2015-07-01 2:15:00"))
Note: this table is just for one day and the last value of the table is
2015-07-01 23:60:00
for any unclear points, feel free to let me know
thanks for that !
In base R you could try this:
df1$date <- gsub(":60:",":59:",df1$date, fixed = TRUE)
df1$date <- as.POSIXct(df1$date)
the59s <- grepl(":59:",df1$date)
df1$date[the59s] <- df1$date[the59s] + 60
#> df1
# date
#1 2015-07-01 00:15:00
#2 2015-07-01 00:30:00
#3 2015-07-01 00:45:00
#4 2015-07-01 01:00:00
#5 2015-07-01 01:15:00
#6 2015-07-01 01:30:00
#7 2015-07-01 01:45:00
#8 2015-07-01 02:00:00
#9 2015-07-01 02:15:00
The idea is to let POSIXct perform the conversion to the next hour / day / month / ... triggered by a "60 minutes" value. For this we first identify those entries containing :60: and replace that part with :59:. Then the column is converted into a POSIXct object. Afterwards we find all those entries containing a ":59:" and add 60 (seconds), thereby converting the time/date to the intended format.
In the case described by the OP the data contains only quarter hour values 0, 15, 30, 40, 60. A more general situation may include genuine 59 minutes values that should not be converted to the next hour. It would then be better to store the relevant row indices before performing the conversion:
the60s <- grepl(":60:", df1$date)
df1$date <- gsub(":60:",":59:",df1$date, fixed = TRUE)
df1$date <- as.POSIXct(df1$date)
df1$date[the60s] <- df1$date[the60s] + 60
data:
df1 <- structure(list(date = structure(1:9, .Label = c("2015-07-01 0:15:00",
"2015-07-01 0:30:00", "2015-07-01 0:45:00", "2015-07-01 0:60:00",
"2015-07-01 1:15:00", "2015-07-01 1:30:00", "2015-07-01 1:45:00",
"2015-07-01 1:60:00", "2015-07-01 2:15:00"), class = "factor")),
.Names = "date", row.names = c(NA, -9L), class = "data.frame")

R add specific (different) amounts of times to entire column

I have a table in R like:
start duration
02/01/2012 20:00:00 5
05/01/2012 07:00:00 6
etc... etc...
I got to this by importing a table from Microsoft Excel that looked like this:
date time duration
2012/02/01 20:00:00 5
etc...
I then merged the date and time columns by running the following code:
d.f <- within(d.f, { start=format(as.POSIXct(paste(date, time)), "%m/%d/%Y %H:%M:%S") })
I want to create a third column called 'end', which will be calculated as the number of hours after the start time. I am pretty sure that my time is a POSIXct vector. I have seen how to manipulate one datetime object, but how can I do that for the entire column?
The expected result should look like:
start duration end
02/01/2012 20:00:00 5 02/02/2012 01:00:00
05/01/2012 07:00:00 6 05/01/2012 13:00:00
etc... etc... etc...
Using lubridate
> library(lubridate)
> df$start <- mdy_hms(df$start)
> df$end <- df$start + hours(df$duration)
> df
# start duration end
#1 2012-02-01 20:00:00 5 2012-02-02 01:00:00
#2 2012-05-01 07:00:00 6 2012-05-01 13:00:00
data
df <- structure(list(start = c("02/01/2012 20:00:00", "05/01/2012 07:00:00"
), duration = 5:6), .Names = c("start", "duration"), class = "data.frame", row.names = c(NA,
-2L))
You can simply add dur*3600 to start column of the data frame. E.g. with one date:
start = as.POSIXct("02/01/2012 20:00:00",format="%m/%d/%Y %H:%M:%S")
start
[1] "2012-02-01 20:00:00 CST"
start + 5*3600
[1] "2012-02-02 01:00:00 CST"

Resources