I’m working with a time series of continued measurement of ozone concentration in ambient air over a 4 month period. Measurement are conducted every 5min14 sec. Approximately 40000 data points.
I started processing my data in R, but ran into some problems due to my lack of skills.
My data.frame is Date as characters and ozone concentration as numeric values.
Date O3_ppb
2018-05-26 17:55:58 UTC 33.95161
2018-05-26 18:01:12 UTC 35.12605
2018-05-26 18:06:28 UTC 36.03172
2018-05-26 18:11:42 UTC 36.81590
2018-05-26 18:16:57 UTC 37.11235
2018-05-26 18:22:12 UTC 37.26945
I wish to illustrate the daily development of ozone concentration over the course of 24h based on one month of data. Meaning I would like a monthly average every 5min over 24h.
My thinking was that I somehow need to sort my data into groups every 5min over 24h. For example 00:00:00, 00:05:00, 00:10:00 …
But since there is drift in the measurement a measurement conducted at 00:05:00 one day would be conducted 00:06:20 the next and so on. And since the sensor once in while reboot the number of observations each day fluctuates a bit aswell.
My question:
Is there a function or loop that would be able to sort my data into intervals of 5mins that also would be able to take drift into account, so that measurements that for example falls between 00:02:30 - 00:07:30 would be sorted into a group called 00:05:00 and 00:07:30 – 00:12:30 into a 00:10:00 group?
Sorry if this is completely unintelligible, but I’m new to R and in general programming. I really hope that someone can help me, so I can kick-start the project
Here is a data.table approach using an overlap-join (foverlaps())
library( data.table )
dt <- fread(' Date O3_ppb
"2018-05-26 17:55:58" 33.95161
"2018-05-26 18:01:12" 35.12605
"2018-05-26 18:06:28" 36.03172
"2018-05-26 18:11:42" 36.81590
"2018-05-26 18:16:57" 37.11235
"2018-05-26 18:22:12" 37.26945', header = TRUE)
#set to posix
dt[, Date := as.POSIXct( Date, format = "%Y-%m-%d %H:%M:%S", tz = "UTC") ]
#create dummy variables to join on later
dt[, `:=`( Start = Date, Stop = Date ) ]
#create data.table with periods you wish to summarise on later
#notice the +/- 150 (=00:02:30) to set a 5 minute 'bandwidth' around the period.
dt.period <- data.table( period = seq( as.POSIXct( "2018-05-26 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ),
as.POSIXct( "2018-05-27 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ),
by = "5 mins"),
Start = seq( as.POSIXct( "2018-05-26 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) - 150,
as.POSIXct( "2018-05-27 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) - 150 ,
by = "5 mins"),
Stop = seq( as.POSIXct( "2018-05-26 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) + 150,
as.POSIXct( "2018-05-27 00:00:00", format = "%Y-%m-%d %H:%M:%S", tz = "UTC" ) + 150,
by = "5 mins") )
#perform overlap join
#first set keys
setkey(dt.period, Start, Stop)
#then perform join
result <- foverlaps( dt, dt.period, type = "within", nomatch = NA )
#summarise
result[, .( O3_ppb_avg = mean( O3_ppb, na.rm = TRUE ) ), by = .(period) ]
output
# period O3_ppb_avg
# 1: 2018-05-26 17:55:00 33.95161
# 2: 2018-05-26 18:00:00 35.12605
# 3: 2018-05-26 18:05:00 36.03172
# 4: 2018-05-26 18:10:00 36.81590
# 5: 2018-05-26 18:15:00 37.11235
# 6: 2018-05-26 18:20:00 37.26945
Here's an approach using lubridate that just rounds to the closest 5 min, regardless of the time.
# Load data
library(tidyverse); library(lubridate)
df <- read.table(header = T, stringsAsFactors = F,
text = "Date O3_ppb
'2018-05-26 17:55:58 UTC' 33.95161
'2018-05-26 18:01:12 UTC' 35.12605
'2018-05-26 18:06:28 UTC' 36.03172
'2018-05-26 18:11:42 UTC' 36.81590
'2018-05-26 18:16:57 UTC' 37.11235
'2018-05-26 18:22:12 UTC' 37.26945") %>%
mutate(Date = ymd_hms(Date))
df2 <- df %>%
# By adding 2.5 min = 150 sec and rounding down, we get closest 5 min
mutate(Date_rnd = floor_date(Date + 150, "5 minutes"),
# One option is to group by decimal time of day
group = hour(Date_rnd) + minute(Date_rnd)/60,
# ...or could convert that to a time on a single day, in this case today
group_as_datetime = floor_date(Sys.time(), "1 day") + group*60*60)
Output
> df2
# Date O3_ppb Date_rnd group group_as_datetime
#1 2018-05-26 17:55:58 33.95161 2018-05-26 17:55:00 17.91667 2019-01-05 17:55:00
#2 2018-05-26 18:01:12 35.12605 2018-05-26 18:00:00 18.00000 2019-01-05 18:00:00
#3 2018-05-26 18:06:28 36.03172 2018-05-26 18:05:00 18.08333 2019-01-05 18:05:00
#4 2018-05-26 18:11:42 36.81590 2018-05-26 18:10:00 18.16667 2019-01-05 18:10:00
#5 2018-05-26 18:16:57 37.11235 2018-05-26 18:15:00 18.25000 2019-01-05 18:15:00
#6 2018-05-26 18:22:12 37.26945 2018-05-26 18:20:00 18.33333 2019-01-05 18:20:00
Related
I currently have a dataset with multiple different time formats(AM/PM, numeric, 24hr format) and I'm trying to turn them all into 24hr format. Is there a way to standardize mixed format columns?
Current sample data
time
12:30 PM
03:00 PM
0.961469907
0.913622685
0.911423611
09:10 AM
18:00
Desired output
new_time
12:30:00
15:00:00
23:04:31
21:55:37
21:52:27
09:10:00
18:00:00
I know how to do them all individually(an example below), but is there a way to do it all in one go because I have a large amount of data and can't go line by line?
#for numeric time
> library(chron)
> x <- c(0.961469907, 0.913622685, 0.911423611)
> times(x)
[1] 23:04:31 21:55:37 21:52:27
The decimal times are a pain but we can parse them first, feed them back as a character then use lubridate's parse_date_time to do them all at once
library(tidyverse)
library(chron)
# Create reproducible dataframe
df <-
tibble::tibble(
time = c(
"12:30 PM",
"03:00 PM",
0.961469907,
0.913622685,
0.911423611,
"09:10 AM",
"18:00")
)
# Parse times
df <-
df %>%
dplyr::mutate(
time_chron = chron::times(as.numeric(time)),
time_chron = if_else(
is.na(time_chron),
time,
as.character(time_chron)),
time_clean = lubridate::parse_date_time(
x = time_chron,
orders = c(
"%I:%M %p", # HH:MM AM/PM 12 hour format
"%H:%M:%S", # HH:MM:SS 24 hour format
"%H:%M")), # HH:MM 24 hour format
time_clean = hms::as_hms(time_clean)) %>%
select(-time_chron)
Which gives us
> df
# A tibble: 7 × 2
time time_clean
<chr> <time>
1 12:30 PM 12:30:00
2 03:00 PM 15:00:00
3 0.961469907 23:04:31
4 0.913622685 21:55:37
5 0.911423611 21:52:27
6 09:10 AM 09:10:00
7 18:00 18:00:00
I have a dataframe (vlinder) like the following, whereby the date and the timestamp (in UTC) are in separate columns:
date time.utc variable
1/04/2020 0:00:00 12
1/04/2020 0:05:00 54
In a first step, I combined the date and time variables into one column called dateandtime using the following code:
vlinder$dateandtime <- paste(vlinder$date, vlinder$time.utc)
which resulted in an extra column in dataframe vlinder:
date time.utc variable dateandtime
1/04/2020 0:00:00 12 1/04/2020 0:00:00
1/04/2020 0:05:00 54 1/04/2020 0:05:00
I want to convert the time of UTC into local time (which is CEST, so a time difference of 2 hours).
I tried using the following code, but I get something totally different.
vlinder$dateandtime <- as.POSIXct(vlinder$dateandtime, tz = "UTC")
vlinder$dateandtime.cest <- format(vlinder$dateandtime, tz = "Europe/Brussels", usetz = TRUE)
which results in:
date time.utc variable dateandtime dateandtime.cest
1/04/2020 0:00:00 12 0001-04-20 0001-04-20 00:17:30 LMT
1/04/2020 0:05:00 54 0001-04-20 0001-04-20 00:17:30 LMT
How can I solve this?
Many thanks!
Here's a lubridate and tidyverse answer. Some data tidying, data type changes, and then bam. Check lubridate::OlsonNames() for valid time zones (tz). (I'm not positive I chose the correct tz.)
library(tidyverse)
library(lubridate)
df <- read.table(header = TRUE,
text = "date time.utc variable
1/04/2020 00:00:00 12
1/04/2020 00:05:00 54")
df <- df %>%
mutate(date = mdy(date),
datetime_utc = as_datetime(paste(date, time.utc)),
datetime_cest = as_datetime(datetime_utc, tz = 'Europe/Brussels'))
date time.utc variable datetime_utc datetime_cest
1 2020-01-04 00:00:00 12 2020-01-04 00:00:00 2020-01-04 01:00:00
2 2020-01-04 00:05:00 54 2020-01-04 00:05:00 2020-01-04 01:05:00
The default format of as.POSIXct expects an date ordered by Year-Month-Day. Therefore the date 01/04/2020 is translated into the 20th April of Year 1.
You just need to add your timeformat to as.POSIXct:
vlinder$dateandtime <- as.POSIXct(vlinder$dateandtime, tz = "UTC", format = "%d/%m/%Y %H:%M:%S")
format(vlinder$dateandtime, tz = "Europe/Brussels", usetz = TRUE)
I'm working with some timestamps in POSIXct format. Right now they are all showing up as being in the timezone "UTC", but in reality some are known to be in the "America/New_York" timezone. I'd like to correct the timestamps so that they all read as the correct times.
I initially used an ifelse() statement along with lubridate::with_tz(). This didn't work as expected because ifelse() didn't return values in POSIXct.
Then I tried dplyr::if_else() based on other posts here, and that's not working as expected either.
I can change a single timestamp or even a list of timestamps to a different timezone using with_tz() (so I know it works), but when I use it within if_else() the output is such that all the values are returned given the "yes" argument in if_else().
library(lubridate)
library(dplyr)
x <- data.frame("ts" = as.POSIXct(c("2017-04-27 13:44:00 UTC",
"2017-03-10 12:22:00 UTC", "2017-03-22 10:24:00 UTC"), tz = "UTC"),
"tz" = c("UTC","EST","UTC"))
x <- mutate(x, ts_New = if_else(tz == "UTC", with_tz(ts, "America/New_York"), ts))
Expected results are below where ts_New has timestamps adjusted to new time zone but only when values in tz = "UTC". Timestamps with tz = "America/New_York" shouldn't change.
ts tz ts_NEW
1 2017-04-27 13:44:00 UTC 2017-04-27 09:44:00
2 2017-03-10 12:22:00 EST 2017-03-10 12:22:00
3 2017-01-22 10:24:00 UTC 2017-03-22 06:24:00
Actual results are below where all ts_New timestamps are adjusted to new time zone regardless of value in tz
x
ts tz ts_New
1 2017-04-27 13:44:00 UTC 2017-04-27 09:44:00
2 2017-03-10 12:22:00 EST 2017-03-10 07:22:00
3 2017-03-22 10:24:00 UTC 2017-03-22 06:24:00
This doesn't answer your original question about why with_tz doesn't work with if_else but here is one workaround. We subtract 4 hours (difference between UTC and EST) where tz == "UTC".
library(dplyr)
library(lubridate)
x %>% mutate(ts_New = if_else(tz == "UTC", ts - hours(4), ts))
# ts tz ts_New
#1 2017-04-27 13:44:00 UTC 2017-04-27 09:44:00
#2 2017-03-10 12:22:00 EST 2017-03-10 12:22:00
#3 2017-03-22 10:24:00 UTC 2017-03-22 06:24:00
Or in base R
x$ts_New <- x$ts
inds <- x$tz == "UTC"
x$ts_New[inds] <- x$ts_New[inds] - 4 * 60 * 60
I have read in and formatted my data set like shown under.
library(xts)
#Read data from file
x <- read.csv("data.dat", header=F)
x[is.na(x)] <- c(0) #If empty fill in zero
#Construct data frames
rawdata.h <- data.frame(x[,2],x[,3],x[,4],x[,5],x[,6],x[,7],x[,8]) #Hourly data
rawdata.15min <- data.frame(x[,10]) #15 min data
#Convert time index to proper format
index.h <- as.POSIXct(strptime(x[,1], "%d.%m.%Y %H:%M"))
index.15min <- as.POSIXct(strptime(x[,9], "%d.%m.%Y %H:%M"))
#Set column names
names(rawdata.h) <- c("spot","RKup", "RKdown","RKcon","anm", "pp.stat","prod.h")
names(rawdata.15min) <- c("prod.15min")
#Convert data frames to time series objects
data.htemp <- xts(rawdata.h,order.by=index.h)
data.15mintemp <- xts(rawdata.15min,order.by=index.15min)
#Select desired subset period
data.h <- data.htemp["2013"]
data.15min <- data.15mintemp["2013"]
I want to be able to combine hourly data from data.h$prod.h with data, with 15 min resolution, from data.15min$prod.15min corresponding to the same hour.
An example would be to take the average of the hourly value at time 2013-12-01 00:00-01:00 with the last 15 minute value in that same hour, i.e. the 15 minute value from time 2013-12-01 00:45-01:00. I'm looking for a flexible way to do this with an arbitrary hour.
Any suggestions?
Edit: Just to clarify further: I want to do something like this:
N <- NROW(data.h$prod.h)
for (i in 1:N){
prod.average[i] <- mean(data.h$prod.h[i] + #INSERT CODE THAT FINDS LAST 15 MIN IN HOUR i )
}
I found a solution to my problem by converting the 15 minute data into hourly data using the very useful .index* function from the xts package like shown under.
prod.new <- data.15min$prod.15min[.indexmin(data.15min$prod.15min) %in% c(45:59)]
This creates a new time series with only the values occuring in the 45-59 minute interval each hour.
For those curious my data looked like this:
Original hourly series:
> data.h$prod.h[1:4]
2013-01-01 00:00:00 19.744
2013-01-01 01:00:00 27.866
2013-01-01 02:00:00 26.227
2013-01-01 03:00:00 16.013
Original 15 minute series:
> data.15min$prod.15min[1:4]
2013-09-30 00:00:00 16.4251
2013-09-30 00:15:00 18.4495
2013-09-30 00:30:00 7.2125
2013-09-30 00:45:00 12.1913
2013-09-30 01:00:00 12.4606
2013-09-30 01:15:00 12.7299
2013-09-30 01:30:00 12.9992
2013-09-30 01:45:00 26.7522
New series with only the last 15 minutes in each hour:
> prod.new[1:4]
2013-09-30 00:45:00 12.1913
2013-09-30 01:45:00 26.7522
2013-09-30 02:45:00 5.0332
2013-09-30 03:45:00 2.6974
Short answer
df %>%
group_by(t = cut(time, "30 min")) %>%
summarise(v = mean(value))
Long answer
Since, you want to compress the 15 minutes time series to a smaller resolution (30 minutes), you should use dplyr package or any other package that computes the "group by" concept.
For instance:
s = seq(as.POSIXct("2017-01-01"), as.POSIXct("2017-01-02"), "15 min")
df = data.frame(time = s, value=1:97)
df is a time series with 97 rows and two columns.
head(df)
time value
1 2017-01-01 00:00:00 1
2 2017-01-01 00:15:00 2
3 2017-01-01 00:30:00 3
4 2017-01-01 00:45:00 4
5 2017-01-01 01:00:00 5
6 2017-01-01 01:15:00 6
The cut.POSIXt, group_by and summarise functions do the work:
df %>%
group_by(t = cut(time, "30 min")) %>%
summarise(v = mean(value))
t v
1 2017-01-01 00:00:00 1.5
2 2017-01-01 00:30:00 3.5
3 2017-01-01 01:00:00 5.5
4 2017-01-01 01:30:00 7.5
5 2017-01-01 02:00:00 9.5
6 2017-01-01 02:30:00 11.5
A more robust way is to convert 15 minutes values into hourly values by taking average. Then do whatever operation you want to.
### 15 Minutes Data
min15 <- structure(list(V1 = structure(1:8, .Label = c("2013-01-01 00:00:00",
"2013-01-01 00:15:00", "2013-01-01 00:30:00", "2013-01-01 00:45:00",
"2013-01-01 01:00:00", "2013-01-01 01:15:00", "2013-01-01 01:30:00",
"2013-01-01 01:45:00"), class = "factor"), V2 = c(16.4251, 18.4495,
7.2125, 12.1913, 12.4606, 12.7299, 12.9992, 26.7522)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -8L))
min15
### Hourly Data
hourly <- structure(list(V1 = structure(1:4, .Label = c("2013-01-01 00:00:00",
"2013-01-01 01:00:00", "2013-01-01 02:00:00", "2013-01-01 03:00:00"
), class = "factor"), V2 = c(19.744, 27.866, 26.227, 16.013)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -4L))
hourly
### Convert 15min data into hourly data by taking average of 4 values
min15$V1 <- as.POSIXct(min15$V1,origin="1970-01-01 0:0:0")
min15 <- aggregate(. ~ cut(min15$V1,"60 min"),min15[setdiff(names(min15), "V1")],mean)
min15
names(min15) <- c("time","min15")
names(hourly) <- c("time","hourly")
### merge the corresponding values
combined <- merge(hourly,min15)
### average of hourly and 15min values
rowMeans(combined[,2:3])
I know this has been asked several times and I looked at the questions and followed the suggestions. However, I couldn't solve this one.
The datetime.csv can be found on https://www.dropbox.com/s/6bvhk4kei4pg8zq/datetime.csv
My code looks like:
jd1 <- read.csv("datetime.csv")
head(jd1)
Date Time
1 20100101 0:00
2 20100101 1:00
3 20100101 2:00
4 20100101 3:00
5 20100101 4:00
6 20100101 5:00
sapply(jd1,class)
> sapply(jd1,class)
Date Time
"integer" "factor"
jd1 <- transform(jd1, timestamp=format(as.POSIXct(paste(Date, Time)), "%Y%m%d %H:%M:%S"))
Error in as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
I tried the solution suggested by rcs on Converting two columns of date and time data to one but this seems to give an error.
Any help is highly appreciated.
Thanks.
The format string you're passing to format includes %S which you don't have. But that won't fix the error since its coming from as.POSIXct. You need to pass the format string there instead and remove the call to the format function.
foo <- transform(jd1, timestamp=as.POSIXct(paste(Date, Time), format="%Y%m%d %H:%M"))
str(foo)
Compare this to:
bar <- transform(jd1, timestamp=as.POSIXct(paste(Date, Time), format="%Y%m%d %H:%M:%S"))
str(bar)
And the result of calling format:
baz <- transform(jd1, timestamp=format(as.POSIXct(paste(Date, Time), format="%Y%m%d %H:%M"), format='%Y%m%d %H:%M:%S'))
str(baz)
if it's just this file you don't even need to read it as csv. Following will do
# if you are reading just timestamps, you may want to read it as just one column
jd1 <- read.table("datetime.csv", header = TRUE, colClasses = c("character"))
jd1$timestamp <- as.POSIXct(jd1$Date.Time, format = "%Y%m%d,%H:%M")
head(jd1)
## Date.Time timestamp
## 1 20100101,0:00 2010-01-01 00:00:00
## 2 20100101,1:00 2010-01-01 01:00:00
## 3 20100101,2:00 2010-01-01 02:00:00
## 4 20100101,3:00 2010-01-01 03:00:00
## 5 20100101,4:00 2010-01-01 04:00:00
## 6 20100101,5:00 2010-01-01 05:00:00
# if you must read it as seperate columns as you may have other columns in your file
jd2 <- read.csv("datetime.csv", header = TRUE, colClasses = c("character", "character"))
jd2$timestamp <- as.POSIXct(paste(jd2$Date, jd2$Time, sep = " "), format = "%Y%m%d %H:%M")
head(jd2)
## Date Time timestamp
## 1 20100101 0:00 2010-01-01 00:00:00
## 2 20100101 1:00 2010-01-01 01:00:00
## 3 20100101 2:00 2010-01-01 02:00:00
## 4 20100101 3:00 2010-01-01 03:00:00
## 5 20100101 4:00 2010-01-01 04:00:00
## 6 20100101 5:00 2010-01-01 05:00:00
Arun's comment prompted me to do some benchmarking..
jd2 <- read.csv("datetime.csv", header = TRUE, colClasses = c("character", "character"))
library(microbenchmark)
microbenchmark(as.POSIXct(paste(jd2$Date, jd2$Time, sep = " "), format = "%Y%m%d %H:%M"), as.POSIXct(do.call(paste, c(jd2[c("Date", "Time")])), format = "%Y%m%d %H:%M"),
transform(jd2, timestamp = as.POSIXct(paste(Date, Time), format = "%Y%m%d %H:%M")), times = 100)
## Unit: milliseconds
## expr min lq median uq max neval
## as.POSIXct(paste(jd2$Date, jd2$Time, sep = " "), format = "%Y%m%d %H:%M") 18.84720 18.87736 18.89542 18.93307 20.99021 100
## as.POSIXct(do.call(paste, c(jd2[c("Date", "Time")])), format = "%Y%m%d %H:%M") 18.94440 18.97917 18.99492 19.02220 21.07320 100
## transform(jd2, timestamp = as.POSIXct(paste(Date, Time), format = "%Y%m%d %H:%M")) 19.05581 19.10230 19.12612 19.16877 21.27490 100