I have hourly data of CO2 values and I would like to know what is the CO2 concentration during the night (e.g. 9pm-7am). A reproducible example:
library(tidyverse); library(lubridate)
times <- seq(ymd_hms("2020-01-01 08:00:00"),
ymd_hms("2020-01-04 08:00:00"), by = "1 hours")
values <- runif(length(times), 1, 15)
df <- tibble(times, values)
How to get mean nightime values (e.g. between 9pm and 7am)? Of course I can filter like this:
df <- df %>%
filter(!hour(times) %in% c(8:20))
And then give id to each observation during the night
df$ID <- rep(LETTERS[1:round(nrow(df)/11)],
times = 1, each = 11)
And finally group and summarise
df_grouped <- df %>%
group_by(., ID) %>%
summarise(value_mean =mean(values))
But this is not a good way I am sure. How to do this better? Especially the part where we give ID to the nighttime values
You can use data.table::frollmean to get the means for a certain window time. In your case you want the means for the last 10 hours, so we set the n argument of the function to 10:
> df$means <- data.table::frollmean(df$values, 10)
> df
> head(df, 20)
# A tibble: 20 x 3
times values means
<dttm> <dbl> <dbl>
1 2020-01-01 08:00:00 4.15 NA
2 2020-01-01 09:00:00 6.24 NA
3 2020-01-01 10:00:00 5.17 NA
4 2020-01-01 11:00:00 9.20 NA
5 2020-01-01 12:00:00 12.3 NA
6 2020-01-01 13:00:00 2.93 NA
7 2020-01-01 14:00:00 9.12 NA
8 2020-01-01 15:00:00 9.72 NA
9 2020-01-01 16:00:00 12.0 NA
10 2020-01-01 17:00:00 13.4 8.41
11 2020-01-01 18:00:00 10.2 9.01
12 2020-01-01 19:00:00 1.97 8.59
13 2020-01-01 20:00:00 11.9 9.26
14 2020-01-01 21:00:00 8.84 9.23
15 2020-01-01 22:00:00 10.1 9.01
16 2020-01-01 23:00:00 3.76 9.09
17 2020-01-02 00:00:00 9.98 9.18
18 2020-01-02 01:00:00 5.56 8.76
19 2020-01-02 02:00:00 5.22 8.09
20 2020-01-02 03:00:00 6.36 7.39
Each row in the mean column will be the mean of that same row value column with the 9 last rows of the value column. Of course there will be some NAs.
Maybe you should give some look to the tsibble package, built to manipulate time series.
You can parametrize the difference between the times you want, but they need to be evenly spaced in your data to use this solution:
n <- diff(which(grepl('20:00:00|08:00:00', df$times))) + 1
n <- unique(n)
df$means <- data.table::frollmean(df$values, n)
> head(df, 20)
# A tibble: 20 x 3
times values means
<dttm> <dbl> <dbl>
1 2020-01-01 08:00:00 11.4 NA
2 2020-01-01 09:00:00 7.03 NA
3 2020-01-01 10:00:00 7.15 NA
4 2020-01-01 11:00:00 6.91 NA
5 2020-01-01 12:00:00 8.18 NA
6 2020-01-01 13:00:00 4.70 NA
7 2020-01-01 14:00:00 13.8 NA
8 2020-01-01 15:00:00 5.16 NA
9 2020-01-01 16:00:00 12.3 NA
10 2020-01-01 17:00:00 3.81 NA
11 2020-01-01 18:00:00 3.09 NA
12 2020-01-01 19:00:00 9.89 NA
13 2020-01-01 20:00:00 1.24 7.28
14 2020-01-01 21:00:00 8.07 7.02
15 2020-01-01 22:00:00 5.59 6.91
16 2020-01-01 23:00:00 5.77 6.81
17 2020-01-02 00:00:00 10.7 7.10
18 2020-01-02 01:00:00 3.44 6.73
19 2020-01-02 02:00:00 10.3 7.16
20 2020-01-02 03:00:00 4.61 6.45
Related
I have daily time series as provided in the example here, I need to know how to fill the NA value for only the morning time which is starting from 6:00 AM to 9:00 AM, that gap filling it should be by averaging the residual hours of the same day and so on for the other morning day,
set.seed(3)
df <- data.frame( timestamp = seq(as.POSIXct('2022-01-01', tz='utc'),as.POSIXct('2022-01-10 23:00', tz='utc'), by = '1 hour') ,
value = runif(240))
df$value[runif(nrow(df)) < 0.3] <- NA
if I understand you correctly this is one way to solve the task in dplyr:
df %>%
dplyr::mutate(after = ifelse(lubridate::hour(timestamp) > 10, value, NA),
day = format(df$timestamp, format = '%Y-%m-%d')) %>%
dplyr::group_by(day) %>%
dplyr::mutate(value = ifelse(lubridate::hour(timestamp) <10 & is.na(value), mean(after, na.rm = TRUE), value)) %>%
dplyr::ungroup() %>%
dplyr::select(-after, -day)
# A tibble: 240 x 2
timestamp value
<dttm> <dbl>
1 2022-01-01 00:00:00 0.427
2 2022-01-01 01:00:00 0.808
3 2022-01-01 02:00:00 0.385
4 2022-01-01 03:00:00 0.427
5 2022-01-01 04:00:00 0.602
6 2022-01-01 05:00:00 0.604
7 2022-01-01 06:00:00 0.125
8 2022-01-01 07:00:00 0.295
9 2022-01-01 08:00:00 0.578
10 2022-01-01 09:00:00 0.631
# ... with 230 more rows
# i Use `print(n = ...)` to see more rows
timestamp value after day
1 2022-01-01 00:00:00 NaN NA 00
2 2022-01-01 01:00:00 0.808 NA 01
3 2022-01-01 02:00:00 0.385 NA 02
4 2022-01-01 03:00:00 NaN NA 03
5 2022-01-01 04:00:00 0.602 NA 04
6 2022-01-01 05:00:00 0.604 NA 05
7 2022-01-01 06:00:00 0.125 NA 06
8 2022-01-01 07:00:00 0.295 NA 07
9 2022-01-01 08:00:00 0.578 NA 08
10 2022-01-01 09:00:00 0.631 NA 09
... with 230 more rows
i Use print(n = ...) to see more rows
I have some air pollution data measured by hours.
Datetime
PM2.5
Station.id
2020-01-01 00:00:00
10
1
2020-01-01 01:00:00
NA
1
2020-01-01 02:00:00
15
1
2020-01-01 03:00:00
NA
1
2020-01-01 04:00:00
7
1
2020-01-01 05:00:00
20
1
2020-01-01 06:00:00
30
1
2020-01-01 00:00:00
NA
2
2020-01-01 01:00:00
17
2
2020-01-01 02:00:00
21
2
2020-01-01 03:00:00
55
2
I have a very large number of data collected from many stations. Using R, what is the most efficient way to remove a day when it has 1. A total of 18 hours of missing data AND 2. 8 hours continuous missing data.
PS. The original data can be either NAs have already been removed OR NAs are inserted.
The "most efficient" way will almost certainly use data.table. Something like this:
library(data.table)
setDT(your_data)
your_data[, date := as.IDate(Datetime)][,
if(
!(sum(is.na(PM2.5)) >= 18 &
with(rle(is.na(PM2.5)), max(lengths[values])) >= 8
)) .SD,
by = .(date, station.id)
]
# date Datetime PM2.5
# 1: 2020-01-01 2020-01-01 00:00:00 10
# 2: 2020-01-01 2020-01-01 01:00:00 NA
# 3: 2020-01-01 2020-01-01 02:00:00 15
# 4: 2020-01-01 2020-01-01 03:00:00 NA
# 5: 2020-01-01 2020-01-01 04:00:00 7
# 6: 2020-01-01 2020-01-01 05:00:00 20
# 7: 2020-01-01 2020-01-01 06:00:00 30
Using this sample data:
your_data = fread(text = 'Datetime PM2.5
2020-01-01 00:00:00 10
2020-01-01 01:00:00 NA
2020-01-01 02:00:00 15
2020-01-01 03:00:00 NA
2020-01-01 04:00:00 7
2020-01-01 05:00:00 20
2020-01-01 06:00:00 30')
I have a list of tibbles that look like this:
> head(temp)
$AT
# A tibble: 8,784 × 2
price_eur datetime
<dbl> <dttm>
1 50.9 2021-01-01 00:00:00
2 48.2 2021-01-01 01:00:00
3 44.7 2021-01-01 02:00:00
4 42.9 2021-01-01 03:00:00
5 40.4 2021-01-01 04:00:00
6 40.2 2021-01-01 05:00:00
7 39.6 2021-01-01 06:00:00
8 40.1 2021-01-01 07:00:00
9 41.3 2021-01-01 08:00:00
10 44.9 2021-01-01 09:00:00
# … with 8,774 more rows
$IE
# A tibble: 7,198 × 2
price_eur datetime
<dbl> <dttm>
1 54.0 2021-01-01 01:00:00
2 53 2021-01-01 02:00:00
3 51.2 2021-01-01 03:00:00
4 48.1 2021-01-01 04:00:00
5 47.3 2021-01-01 05:00:00
6 47.6 2021-01-01 06:00:00
7 45.4 2021-01-01 07:00:00
8 43.4 2021-01-01 08:00:00
9 47.8 2021-01-01 09:00:00
10 51.8 2021-01-01 10:00:00
# … with 7,188 more rows
$`IT-Calabria`
# A tibble: 8,736 × 2
price_eur datetime
<dbl> <dttm>
1 50.9 2021-01-01 00:00:00
2 48.2 2021-01-01 01:00:00
3 44.7 2021-01-01 02:00:00
4 42.9 2021-01-01 03:00:00
5 40.4 2021-01-01 04:00:00
6 40.2 2021-01-01 05:00:00
7 39.6 2021-01-01 06:00:00
8 40.1 2021-01-01 07:00:00
9 41.3 2021-01-01 08:00:00
10 41.7 2021-01-01 09:00:00
# … with 8,726 more rows
The number of rows is different because there are missing observations, usually one or several days.
Ideally I need a tibble with a single date time index and corresponding columns with NAs when there is missing data and I'm stuck here.
We can do a full join by 'datetime'
library(dplyr)
library(purrr)
reduce(temp, full_join, by = "datetime")
If we need to rename the column 'price_eur' before the join, loop over the list with imap, rename the 'price_eur' to the corresponding list name (.y) and do the join within reduce
imap(temp, ~ .x %>%
rename(!! .y := price_eur)) %>%
reduce(full_join, by = 'datetime')
I am working with some temperature data where I have temperatures at certain depths e.g. 0.9m, 2.5m and 5m. I would like to interpolate this values so I obtain the temperature each meter, e.g. 1m, 2m and 3m. The original data looks like this:
df
# A tibble: 5 x 3
date d_0.9 d_2.5
<dttm> <dbl> <dbl>
1 2004-01-05 03:00:00 7 8
2 2004-01-05 04:00:00 7.5 9
3 2004-01-05 05:00:00 7 8
4 2004-01-05 06:00:00 6.92 NA
What I would like to get is something like :
df_int
# A tibble: 5 x 5
date d_0.9 d_1 d_2 d_2.5
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2004-01-05 03:00:00 7 7.0625 7.6875 8
2 2004-01-05 04:00:00 7.5 7.59375 8.53125 9
3 2004-01-05 05:00:00 7 7.0625 7.6875 8
4 2004-01-05 06:00:00 6.92 NA NA NA
I have to do this for a very large data frame. Is there an efficient way of doing it?
Many thanks in advance
One option is to convert the data to long format, use a join to add rows for the depths we want to interpolate at, and then use approx for the interpolation:
library(tidyverse)
# Data
df = tibble(date=seq(as.POSIXct("2004-01-05 03:00:00"),
as.POSIXct("2004-01-05 06:00:00"),
by="1 hour"),
d_0.9 = c(7,7.5,7,6.92),
d_2.5 = c(8,NA,8,NA),
d_5.0 = c(10,10.5,9.4,NA))
# Create a data frame with all of the times and depths we want to interpolate at
depths = sort(unique(c(c(0.9, 2.5, 5), seq(ceiling(0.9), floor(5), 1))))
depths = crossing(date=unique(df$date), depth = depths)
# Convert data to long format, join to add interpolation depths, then interpolate
df.interp = df %>%
gather(depth, value, -date) %>%
mutate(depth = as.numeric(gsub("d_", "", depth))) %>%
full_join(depths) %>%
arrange(date, depth) %>%
group_by(date) %>%
mutate(value.interp = if(length(na.omit(value)) > 1) {
approx(depth, value, xout=depth)$y
} else {
value
})
In the code above, the if statement is inclduded to prevent approx throwing an error when a given date has only one non-missing value.
df.interp
date depth value value.interp
1 2004-01-05 03:00:00 0.9 7.00 7.000000
2 2004-01-05 03:00:00 1.0 NA 7.062500
3 2004-01-05 03:00:00 2.0 NA 7.687500
4 2004-01-05 03:00:00 2.5 8.00 8.000000
5 2004-01-05 03:00:00 3.0 NA 8.400000
6 2004-01-05 03:00:00 4.0 NA 9.200000
7 2004-01-05 03:00:00 5.0 10.00 10.000000
8 2004-01-05 04:00:00 0.9 7.50 7.500000
9 2004-01-05 04:00:00 1.0 NA 7.573171
10 2004-01-05 04:00:00 2.0 NA 8.304878
11 2004-01-05 04:00:00 2.5 NA 8.670732
12 2004-01-05 04:00:00 3.0 NA 9.036585
13 2004-01-05 04:00:00 4.0 NA 9.768293
14 2004-01-05 04:00:00 5.0 10.50 10.500000
15 2004-01-05 05:00:00 0.9 7.00 7.000000
16 2004-01-05 05:00:00 1.0 NA 7.062500
17 2004-01-05 05:00:00 2.0 NA 7.687500
18 2004-01-05 05:00:00 2.5 8.00 8.000000
19 2004-01-05 05:00:00 3.0 NA 8.280000
20 2004-01-05 05:00:00 4.0 NA 8.840000
21 2004-01-05 05:00:00 5.0 9.40 9.400000
22 2004-01-05 06:00:00 0.9 6.92 6.920000
23 2004-01-05 06:00:00 1.0 NA NA
24 2004-01-05 06:00:00 2.0 NA NA
25 2004-01-05 06:00:00 2.5 NA NA
26 2004-01-05 06:00:00 3.0 NA NA
27 2004-01-05 06:00:00 4.0 NA NA
28 2004-01-05 06:00:00 5.0 NA NA
This question already has answers here:
How to aggregate every 30 minutes in R
(2 answers)
Grouping every n minutes with dplyr
(4 answers)
Create a time interval of 15 minutes from minutely data in R?
(3 answers)
Closed 3 years ago.
I have a dataframe with varying time steps, which I want to convert into even time steps. Every 10 minutes a value should be written and if there is no new value, the previous one should be taken (see 2019-01-01 01:00:00 and 2019-01-01 02:30:00).
date ZUL_T
1 2019-01-01 00:04:00 23.3
2 2019-01-01 00:15:00 23.3
3 2019-01-01 00:26:00 19.9
4 2019-01-01 00:37:00 20.7
5 2019-01-01 00:48:00 21.9
6 2019-01-01 00:59:00 21.9
7 2019-01-01 01:10:00 18.8
8 2019-01-01 01:22:00 18.8
9 2019-01-01 01:33:00 20.7
10 2019-01-01 01:44:00 21.6
11 2019-01-01 01:55:00 19.2
12 2019-01-01 02:06:00 19.2
13 2019-01-01 02:17:00 19.6
14 2019-01-01 02:29:00 19.6
15 2019-01-01 02:40:00 20.5
This is my current code, but there are some time steps missing if there is no value in the DS.
library(lubridate)
lowtime <- min(DS$date)
hightime <- max(DS$date)
# Set the minute and second to the nearest 10 minute value
minute(lowtime) <- floor(minute(lowtime)/10) * 10
minute(hightime) <- ceiling(minute(hightime)/10) * 10
second(lowtime) <- 0
second(hightime) <- 0
# Set the breakpoints at 10 minute intervals
breakpoints <- seq.POSIXt(lowtime, hightime, by = 600)
ZUL_T <- aggregate(ZUL_T ~ cut(date, breaks = breakpoints), DS, mean)
> data
date ZUL_T
1 2019-01-01 00:00:00 23.3
2 2019-01-01 00:10:00 23.3
3 2019-01-01 00:20:00 19.9
4 2019-01-01 00:30:00 20.7
5 2019-01-01 00:40:00 21.9
6 2019-01-01 00:50:00 21.9
7 2019-01-01 01:10:00 18.8
8 2019-01-01 01:20:00 18.8
9 2019-01-01 01:30:00 20.7
10 2019-01-01 01:40:00 21.6
11 2019-01-01 01:50:00 19.2
12 2019-01-01 02:00:00 19.2
13 2019-01-01 02:10:00 19.6
14 2019-01-01 02:20:00 19.6
15 2019-01-01 02:40:00 20.5
We can use floor_date from lubridate package to cut time every 10 mins and take a lower bound, group by it and sum ZUL_T values.
library(dplyr)
library(lubridate)
library(tidyr)
df %>%
group_by(date = floor_date(ymd_hms(date), "10 mins")) %>%
summarise(ZUL_T = sum(ZUL_T))
# date ZUL_T
# <dttm> <dbl>
# 1 2019-01-01 00:00:00 23.3
# 2 2019-01-01 00:10:00 23.3
# 3 2019-01-01 00:20:00 19.9
# 4 2019-01-01 00:30:00 20.7
# 5 2019-01-01 00:40:00 21.9
# 6 2019-01-01 00:50:00 21.9
# 7 2019-01-01 01:10:00 18.8
# 8 2019-01-01 01:20:00 18.8
# 9 2019-01-01 01:30:00 20.7
#10 2019-01-01 01:40:00 21.6
#11 2019-01-01 01:50:00 19.2
#12 2019-01-01 02:00:00 19.2
#13 2019-01-01 02:10:00 19.6
#14 2019-01-01 02:20:00 19.6
#15 2019-01-01 02:40:00 20.5
and then use complete and fill to complete the missing combinations and fill the NA values with previous values.
df %>%
group_by(date = floor_date(ymd_hms(date), "10 mins")) %>%
summarise(ZUL_T = sum(ZUL_T))
complete(date = seq(min(date), max(date), "10 mins")) %>%
fill(ZUL_T)
# date ZUL_T
# <dttm> <dbl>
# 1 2019-01-01 00:00:00 23.3
# 2 2019-01-01 00:10:00 23.3
# 3 2019-01-01 00:20:00 19.9
# 4 2019-01-01 00:30:00 20.7
# 5 2019-01-01 00:40:00 21.9
# 6 2019-01-01 00:50:00 21.9
# 7 2019-01-01 01:00:00 21.9
# 8 2019-01-01 01:10:00 18.8
# 9 2019-01-01 01:20:00 18.8
#10 2019-01-01 01:30:00 20.7
#11 2019-01-01 01:40:00 21.6
#12 2019-01-01 01:50:00 19.2
#13 2019-01-01 02:00:00 19.2
#14 2019-01-01 02:10:00 19.6
#15 2019-01-01 02:20:00 19.6
#16 2019-01-01 02:30:00 19.6
#17 2019-01-01 02:40:00 20.5
data
df <- structure(list(date = structure(1:15, .Label = c("2019-01-01 00:04:00",
"2019-01-01 00:15:00", "2019-01-01 00:26:00", "2019-01-01 00:37:00",
"2019-01-01 00:48:00", "2019-01-01 00:59:00", "2019-01-01 01:10:00",
"2019-01-01 01:22:00", "2019-01-01 01:33:00", "2019-01-01 01:44:00",
"2019-01-01 01:55:00", "2019-01-01 02:06:00", "2019-01-01 02:17:00",
"2019-01-01 02:29:00", "2019-01-01 02:40:00"), class = "factor"),
ZUL_T = c(23.3, 23.3, 19.9, 20.7, 21.9, 21.9, 18.8, 18.8,
20.7, 21.6, 19.2, 19.2, 19.6, 19.6, 20.5)),
class = "data.frame", row.names = c(NA,-15L))
You could merge with the breakpoints as data frame.
# first, you probably need 10 min later in time
minute(hightime) <- ceiling((minute(max(DS$date)) + 10)/10) * 10
breakpoints <- seq.POSIXt(lowtime, hightime, by=600)
Use aggregate in classic list notation to get proper names.
ZUL_T <- aggregate(list(ZUL_T=DS$ZUL_T), list(date=cut(DS$date, breaks=breakpoints)), mean)
Now merge,
ZUL_T <- merge(transform(ZUL_T, date=as.character(date)),
data.frame(date=as.character(breakpoints[-length(breakpoints)]),
stringsAsFactors=F),
all=TRUE)
and replace NA values wit values - 1.
ZUL_T$ZUL_T[is.na(ZUL_T$ZUL_T)] <- ZUL_T$ZUL_T[which(is.na(ZUL_T$ZUL_T)) - 1]
ZUL_T
# date ZUL_T
# 1 2019-01-01 00:00:00 23.3
# 2 2019-01-01 00:10:00 23.3
# 3 2019-01-01 00:20:00 19.9
# 4 2019-01-01 00:30:00 20.7
# 5 2019-01-01 00:40:00 21.9
# 6 2019-01-01 00:50:00 21.9
# 7 2019-01-01 01:00:00 21.9
# 8 2019-01-01 01:10:00 18.8
# 9 2019-01-01 01:20:00 18.8
# 10 2019-01-01 01:30:00 20.7
# 11 2019-01-01 01:40:00 21.6
# 12 2019-01-01 01:50:00 19.2
# 13 2019-01-01 02:00:00 19.2
# 14 2019-01-01 02:10:00 19.6
# 15 2019-01-01 02:20:00 19.6
# 16 2019-01-01 02:30:00 19.6
# 17 2019-01-01 02:40:00 20.5