I have some air pollution data measured by hours.
Datetime
PM2.5
Station.id
2020-01-01 00:00:00
10
1
2020-01-01 01:00:00
NA
1
2020-01-01 02:00:00
15
1
2020-01-01 03:00:00
NA
1
2020-01-01 04:00:00
7
1
2020-01-01 05:00:00
20
1
2020-01-01 06:00:00
30
1
2020-01-01 00:00:00
NA
2
2020-01-01 01:00:00
17
2
2020-01-01 02:00:00
21
2
2020-01-01 03:00:00
55
2
I have a very large number of data collected from many stations. Using R, what is the most efficient way to remove a day when it has 1. A total of 18 hours of missing data AND 2. 8 hours continuous missing data.
PS. The original data can be either NAs have already been removed OR NAs are inserted.
The "most efficient" way will almost certainly use data.table. Something like this:
library(data.table)
setDT(your_data)
your_data[, date := as.IDate(Datetime)][,
if(
!(sum(is.na(PM2.5)) >= 18 &
with(rle(is.na(PM2.5)), max(lengths[values])) >= 8
)) .SD,
by = .(date, station.id)
]
# date Datetime PM2.5
# 1: 2020-01-01 2020-01-01 00:00:00 10
# 2: 2020-01-01 2020-01-01 01:00:00 NA
# 3: 2020-01-01 2020-01-01 02:00:00 15
# 4: 2020-01-01 2020-01-01 03:00:00 NA
# 5: 2020-01-01 2020-01-01 04:00:00 7
# 6: 2020-01-01 2020-01-01 05:00:00 20
# 7: 2020-01-01 2020-01-01 06:00:00 30
Using this sample data:
your_data = fread(text = 'Datetime PM2.5
2020-01-01 00:00:00 10
2020-01-01 01:00:00 NA
2020-01-01 02:00:00 15
2020-01-01 03:00:00 NA
2020-01-01 04:00:00 7
2020-01-01 05:00:00 20
2020-01-01 06:00:00 30')
I have two dataframes, interest rates and monthly standard deviation prices returns, that I have managed to merge together. However the interest rate data has gaps in its dates where the markets were not open, i.e. weekends and holidays. The monthly returns all start on the first of the month so where this lines up with a market closure the data doesn't merge correctly. An example of the dataframes is
Date Rollingstd
01/11/2014 0.00925
01/10/2014 0.01341
Date InterestRate
03/11/2014 2
31/10/2014 1.5
As you can see there is no 01/11/2014 in the interest rate data so merging together gives me
Date InterestRate Rollingstd
03/11/2014 2 0.01341
31/10/2014 1.5 0.01341
I guess a fix for this would be to expand the interest rate dataframe so that it includes all dates and just fill the interest rate data up so it looks like this
Date InterestRate
03/11/2014 2
02/11/2014 1.5
01/11/2014 1.5
31/10/2014 1.5
This would ensure there are no missing dates in the dataframe. Any ideas on how I could do this?
Do you want this?
df2 <- read.table(text = 'Date InterestRate
03/11/2014 2
31/10/2014 1.5', header = T)
df1 <- read.table(text = 'Date Rollingstd
01/11/2014 0.00925
01/10/2014 0.01341', header = T)
library(tidyverse)
df1 %>% full_join(df2, by = 'Date') %>%
mutate(Date = as.Date(Date, '%d/%m/%Y')) %>%
arrange(Date) %>%
complete(Date = seq.Date(min(Date), max(Date), 'days')) %>%
fill(InterestRate, .direction = 'up') %>%
as.data.frame()
#> Date Rollingstd InterestRate
#> 1 2014-10-01 0.01341 1.5
#> 2 2014-10-02 NA 1.5
#> 3 2014-10-03 NA 1.5
#> 4 2014-10-04 NA 1.5
#> 5 2014-10-05 NA 1.5
#> 6 2014-10-06 NA 1.5
#> 7 2014-10-07 NA 1.5
#> 8 2014-10-08 NA 1.5
#> 9 2014-10-09 NA 1.5
#> 10 2014-10-10 NA 1.5
#> 11 2014-10-11 NA 1.5
#> 12 2014-10-12 NA 1.5
#> 13 2014-10-13 NA 1.5
#> 14 2014-10-14 NA 1.5
#> 15 2014-10-15 NA 1.5
#> 16 2014-10-16 NA 1.5
#> 17 2014-10-17 NA 1.5
#> 18 2014-10-18 NA 1.5
#> 19 2014-10-19 NA 1.5
#> 20 2014-10-20 NA 1.5
#> 21 2014-10-21 NA 1.5
#> 22 2014-10-22 NA 1.5
#> 23 2014-10-23 NA 1.5
#> 24 2014-10-24 NA 1.5
#> 25 2014-10-25 NA 1.5
#> 26 2014-10-26 NA 1.5
#> 27 2014-10-27 NA 1.5
#> 28 2014-10-28 NA 1.5
#> 29 2014-10-29 NA 1.5
#> 30 2014-10-30 NA 1.5
#> 31 2014-10-31 NA 1.5
#> 32 2014-11-01 0.00925 2.0
#> 33 2014-11-02 NA 2.0
#> 34 2014-11-03 NA 2.0
Created on 2021-05-23 by the reprex package (v2.0.0)
This is a bit of a curious case for which I have been unable to find a solution on stackoverflow. I have a dataset with a date-time column and a column of values that indicate an event, such as in the dat example below. The date-times are every hour, however, note that occasional "missed" hours exist (2 hours are missing between rows 12 & 13).
dat <- data.frame(datetime = seq(min(as.POSIXct("2010-04-03 03:00:00 UTC")),
max(as.POSIXct("2010-04-04 10:00:00 UTC")), by = "hour")[-c(13,14)],
event = c(1, rep(NA, 9), 2, rep(NA, 5), 3, 4, rep(NA, 9), 5, NA, 6))
> dat
datetime event
1 2010-04-03 03:00:00 1
2 2010-04-03 04:00:00 NA
3 2010-04-03 05:00:00 NA
4 2010-04-03 06:00:00 NA
5 2010-04-03 07:00:00 NA
6 2010-04-03 08:00:00 NA
7 2010-04-03 09:00:00 NA
8 2010-04-03 10:00:00 NA
9 2010-04-03 11:00:00 NA
10 2010-04-03 12:00:00 NA
11 2010-04-03 13:00:00 2
12 2010-04-03 14:00:00 NA
13 2010-04-03 17:00:00 NA
14 2010-04-03 18:00:00 NA
15 2010-04-03 19:00:00 NA
16 2010-04-03 20:00:00 NA
17 2010-04-03 21:00:00 3
18 2010-04-03 22:00:00 4
19 2010-04-03 23:00:00 NA
20 2010-04-04 00:00:00 NA
21 2010-04-04 01:00:00 NA
22 2010-04-04 02:00:00 NA
23 2010-04-04 03:00:00 NA
24 2010-04-04 04:00:00 NA
25 2010-04-04 05:00:00 NA
26 2010-04-04 06:00:00 NA
27 2010-04-04 07:00:00 NA
28 2010-04-04 08:00:00 5
29 2010-04-04 09:00:00 NA
30 2010-04-04 10:00:00 6
I would like each row within an interval of 7 hours after the event occurs to be identified with a unique identifier, but with the following caveats (hence the "curious case"):
if a subsequent event occurs within the 7 hours of the event prior, that subsequent event is essentially ignored (i.e., "event" number does not equal assigned identifier value), and
missing times are accounted for (i.e., the rule is based on the time elapsed, not the number of rows).
The product would look like result:
library(dplyr)
result <- dat %>%
mutate(id = c(rep(1, 8), rep(NA, 2), rep(2, 6), rep(3, 8), rep(NA, 3), rep(4, 3)))
> result
datetime event id
1 2010-04-03 03:00:00 1 1
2 2010-04-03 04:00:00 NA 1
3 2010-04-03 05:00:00 NA 1
4 2010-04-03 06:00:00 NA 1
5 2010-04-03 07:00:00 NA 1
6 2010-04-03 08:00:00 NA 1
7 2010-04-03 09:00:00 NA 1
8 2010-04-03 10:00:00 NA 1
9 2010-04-03 11:00:00 NA NA
10 2010-04-03 12:00:00 NA NA
11 2010-04-03 13:00:00 2 2
12 2010-04-03 14:00:00 NA 2
13 2010-04-03 17:00:00 NA 2
14 2010-04-03 18:00:00 NA 2
15 2010-04-03 19:00:00 NA 2
16 2010-04-03 20:00:00 NA 2
17 2010-04-03 21:00:00 3 3
18 2010-04-03 22:00:00 4 3
19 2010-04-03 23:00:00 NA 3
20 2010-04-04 00:00:00 NA 3
21 2010-04-04 01:00:00 NA 3
22 2010-04-04 02:00:00 NA 3
23 2010-04-04 03:00:00 NA 3
24 2010-04-04 04:00:00 NA 3
25 2010-04-04 05:00:00 NA NA
26 2010-04-04 06:00:00 NA NA
27 2010-04-04 07:00:00 NA NA
28 2010-04-04 08:00:00 5 4
29 2010-04-04 09:00:00 NA 4
30 2010-04-04 10:00:00 6 4
Most ideally, this would be accomplished in a dplyr framework.
library(lubridate)
library(tidyverse)
dat <- data.frame(datetime = seq(min(as.POSIXct("2010-04-03 03:00:00 UTC")),
max(as.POSIXct("2010-04-04 10:00:00 UTC")), by = "hour")[-c(13,14)],
event = c(1, rep(NA, 9), 2, rep(NA, 5), 3, 4, rep(NA, 9), 5, NA, 6)) %>%
mutate(id = c(rep(1, 8), rep(NA, 2), rep(2, 6), rep(3, 8), rep(NA, 3), rep(4, 3)))
Events <- dat %>%
#Get only the roes with events
filter(!is.na(event)) %>%
#Get the duration of time between events
mutate(
EventLag = datetime - lag(datetime)) %>%
## remove events that occurred < 7 hrs after the previous or that are NA (i.e. the first one). but in the real data
## I do not suspect your first point would ever be an event...? Maybe this can be removed in the
## real dataset...
filter(as.numeric(EventLag) > 7| is.na(EventLag)) %>%
as.data.frame()
## You now have all of the events that are of interest (i.e. those that occurred outside of the 7 hr buffer)
## Give the events a new ID so there are no gaps
## Join them with the rest of the datetime stamps
Events <- Events %>%
mutate(ID = row_number()) %>%
dplyr::select(datetime, ID)
## Expand each event by 7 hrs
Events <- Events %>%
group_by(ID) %>%
do(data.frame(ID= .$ID, datetime= seq(.$datetime, .$datetime + hours(7), by = '1 hour'), stringsAsFactors=FALSE)) %>%
as.data.frame()
## Join with initial data by datettime
DatJoin <- dat %>%
left_join(Events, by = "datetime")
DatJoin
I have hourly data of CO2 values and I would like to know what is the CO2 concentration during the night (e.g. 9pm-7am). A reproducible example:
library(tidyverse); library(lubridate)
times <- seq(ymd_hms("2020-01-01 08:00:00"),
ymd_hms("2020-01-04 08:00:00"), by = "1 hours")
values <- runif(length(times), 1, 15)
df <- tibble(times, values)
How to get mean nightime values (e.g. between 9pm and 7am)? Of course I can filter like this:
df <- df %>%
filter(!hour(times) %in% c(8:20))
And then give id to each observation during the night
df$ID <- rep(LETTERS[1:round(nrow(df)/11)],
times = 1, each = 11)
And finally group and summarise
df_grouped <- df %>%
group_by(., ID) %>%
summarise(value_mean =mean(values))
But this is not a good way I am sure. How to do this better? Especially the part where we give ID to the nighttime values
You can use data.table::frollmean to get the means for a certain window time. In your case you want the means for the last 10 hours, so we set the n argument of the function to 10:
> df$means <- data.table::frollmean(df$values, 10)
> df
> head(df, 20)
# A tibble: 20 x 3
times values means
<dttm> <dbl> <dbl>
1 2020-01-01 08:00:00 4.15 NA
2 2020-01-01 09:00:00 6.24 NA
3 2020-01-01 10:00:00 5.17 NA
4 2020-01-01 11:00:00 9.20 NA
5 2020-01-01 12:00:00 12.3 NA
6 2020-01-01 13:00:00 2.93 NA
7 2020-01-01 14:00:00 9.12 NA
8 2020-01-01 15:00:00 9.72 NA
9 2020-01-01 16:00:00 12.0 NA
10 2020-01-01 17:00:00 13.4 8.41
11 2020-01-01 18:00:00 10.2 9.01
12 2020-01-01 19:00:00 1.97 8.59
13 2020-01-01 20:00:00 11.9 9.26
14 2020-01-01 21:00:00 8.84 9.23
15 2020-01-01 22:00:00 10.1 9.01
16 2020-01-01 23:00:00 3.76 9.09
17 2020-01-02 00:00:00 9.98 9.18
18 2020-01-02 01:00:00 5.56 8.76
19 2020-01-02 02:00:00 5.22 8.09
20 2020-01-02 03:00:00 6.36 7.39
Each row in the mean column will be the mean of that same row value column with the 9 last rows of the value column. Of course there will be some NAs.
Maybe you should give some look to the tsibble package, built to manipulate time series.
You can parametrize the difference between the times you want, but they need to be evenly spaced in your data to use this solution:
n <- diff(which(grepl('20:00:00|08:00:00', df$times))) + 1
n <- unique(n)
df$means <- data.table::frollmean(df$values, n)
> head(df, 20)
# A tibble: 20 x 3
times values means
<dttm> <dbl> <dbl>
1 2020-01-01 08:00:00 11.4 NA
2 2020-01-01 09:00:00 7.03 NA
3 2020-01-01 10:00:00 7.15 NA
4 2020-01-01 11:00:00 6.91 NA
5 2020-01-01 12:00:00 8.18 NA
6 2020-01-01 13:00:00 4.70 NA
7 2020-01-01 14:00:00 13.8 NA
8 2020-01-01 15:00:00 5.16 NA
9 2020-01-01 16:00:00 12.3 NA
10 2020-01-01 17:00:00 3.81 NA
11 2020-01-01 18:00:00 3.09 NA
12 2020-01-01 19:00:00 9.89 NA
13 2020-01-01 20:00:00 1.24 7.28
14 2020-01-01 21:00:00 8.07 7.02
15 2020-01-01 22:00:00 5.59 6.91
16 2020-01-01 23:00:00 5.77 6.81
17 2020-01-02 00:00:00 10.7 7.10
18 2020-01-02 01:00:00 3.44 6.73
19 2020-01-02 02:00:00 10.3 7.16
20 2020-01-02 03:00:00 4.61 6.45
I've got a dataframe of 3 variables: POSIXct object - time, numeric - RRR and factor - he. Where RRR is an amount of liquid precipitation and he is the hydrological event number, here its time corresponds to the beginning of the flood event.
df <- structure(list(time = structure(c(1396879200, 1396922400, 1396976400,
1397008800, 1397095200, 1397332800, 1397354400, 1397397600, 1397451600,
1397484000, 1397527200, 1397786400, 1397959200, 1398002400, 1398024000,
1398132000, 1398175200, 1398218400, 1398261600, 1398369600, 1398466800,
1398477600, 1398520800, 1398564000, 1398607200, 1398747600, 1398780000,
1398909600, 1398952800, 1398974400, 1398996000),
class = c("POSIXct", "POSIXt"),
tzone = ""),
RRR = c(NA, 2, NA, 4, NA, NA, 0.9, 3,
NA, 0.4, 11, NA, 0.5, 1, NA, 13, 4, 0.8, 0.3, NA, NA, 8, 4, 11,
1, NA, 7, 1, 0.4, NA, 4),
he = c(1, NA, 2, NA, 3, 4, NA, NA,
5, NA, NA, 6, NA, NA, 7, NA, NA, NA, NA, 8, 9, NA, NA, NA, NA,
10, NA, NA, NA, 11, NA)),
class = "data.frame",
row.names = c(NA, -31L))
Head of my dataframe look as follows:
> df
time RRR he
1 2014-04-07 18:00:00 NA 1
2 2014-04-08 06:00:00 2.0 NA
3 2014-04-08 21:00:00 NA 2
4 2014-04-09 06:00:00 4.0 NA
5 2014-04-10 06:00:00 NA 3
6 2014-04-13 00:00:00 NA 4
7 2014-04-13 06:00:00 0.9 NA
8 2014-04-13 18:00:00 3.0 NA
9 2014-04-14 09:00:00 NA 5
I need to calculate the time difference between time of every he value and last non-NA RRR value. For example, for he = 2 the desired difference would be difftime(df$time[3], df$time[2]), while for he = 4 the time difference should be difftime(df$time[6], df$time[4]). So in the end I want to get a dataframe like this, where 'diff' is the time difference in hours.
> df
time RRR he diff
1 2014-04-07 18:00:00 NA 1 NA
2 2014-04-08 06:00:00 2.0 NA NA
3 2014-04-08 21:00:00 NA 2 15
4 2014-04-09 06:00:00 4.0 NA NA
5 2014-04-10 06:00:00 NA 3 24
6 2014-04-13 00:00:00 NA 4 90
7 2014-04-13 06:00:00 0.9 NA NA
8 2014-04-13 18:00:00 3.0 NA NA
9 2014-04-14 09:00:00 NA 5 15
I'm sure that there must be easier ways, but using tidyverse and data.table you can do:
df %>%
mutate(time = as.POSIXct(time, format = "%Y-%m-%d %H:%M:%S")) %>% #Transforming "time" into a datetime object
fill(RRR) %>% #Filling the NA values in "RRR" with tha last non-NA value
group_by(temp = rleid(RRR)) %>% #Grouping by run length of "RRR"
mutate(temp2 = seq_along(temp)) %>% #Sequencing around the run length of "RRR"
group_by(RRR, temp) %>% #Group by "RRR" and run length of "RRR"
mutate(diff = ifelse(!is.na(he), difftime(time, time[temp2 == 1], units="hours"), NA)) %>% #Computing the difference in hours between the first occurrence of a non-NA "RRR" value and the non-NA "he" values
ungroup() %>%
select(-temp, -temp2, -RRR) %>% #Removing the redundant variables
rowid_to_column() %>% #Creating unique row IDs
left_join(df %>%
rowid_to_column() %>%
select(RRR, rowid), by = c("rowid" = "rowid")) %>% #Merging with the original df to get the original values of "RRR"
select(-rowid) #Removing the redundant variables
time he diff RRR
<dttm> <dbl> <dbl> <dbl>
1 2014-04-07 16:00:00 1. 0. NA
2 2014-04-08 04:00:00 NA NA 2.00
3 2014-04-08 19:00:00 2. 15. NA
4 2014-04-09 04:00:00 NA NA 4.00
5 2014-04-10 04:00:00 3. 24. NA
6 2014-04-12 22:00:00 4. 90. NA
7 2014-04-13 04:00:00 NA NA 0.900
8 2014-04-13 16:00:00 NA NA 3.00
9 2014-04-14 07:00:00 5. 15. NA
10 2014-04-14 16:00:00 NA NA 0.400
Here's a data.table approach making use of its non-equi join capabilities:
library(data.table)
setDT(df)
df[df[!is.na(he)][df[!is.na(RRR)], on = .(time>time), rrr_time := i.time],
on = .(time, he), rrr_time := i.rrr_time][, diff := difftime(time, rrr_time)]
The result is:
# time RRR he rrr_time diff
# <POSc> <num> <num> <POSc> <difftime>
# 1: 2014-04-07 16:00:00 NA 1 <NA> NA hours
# 2: 2014-04-08 04:00:00 2.0 NA <NA> NA hours
# 3: 2014-04-08 19:00:00 NA 2 2014-04-08 04:00:00 15 hours
# 4: 2014-04-09 04:00:00 4.0 NA <NA> NA hours
# 5: 2014-04-10 04:00:00 NA 3 2014-04-09 04:00:00 24 hours
# 6: 2014-04-12 22:00:00 NA 4 2014-04-09 04:00:00 90 hours
# 7: 2014-04-13 04:00:00 0.9 NA <NA> NA hours
# 8: 2014-04-13 16:00:00 3.0 NA <NA> NA hours
# 9: 2014-04-14 07:00:00 NA 5 2014-04-13 16:00:00 15 hours
# 10: 2014-04-14 16:00:00 0.4 NA <NA> NA hours
# 11: 2014-04-15 04:00:00 11.0 NA <NA> NA hours
# 12: 2014-04-18 04:00:00 NA 6 2014-04-15 04:00:00 72 hours
# 13: 2014-04-20 04:00:00 0.5 NA <NA> NA hours
# 14: 2014-04-20 16:00:00 1.0 NA <NA> NA hours
# 15: 2014-04-20 22:00:00 NA 7 2014-04-20 16:00:00 6 hours
# 16: 2014-04-22 04:00:00 13.0 NA <NA> NA hours
# 17: 2014-04-22 16:00:00 4.0 NA <NA> NA hours
# 18: 2014-04-23 04:00:00 0.8 NA <NA> NA hours
# 19: 2014-04-23 16:00:00 0.3 NA <NA> NA hours
# 20: 2014-04-24 22:00:00 NA 8 2014-04-23 16:00:00 30 hours
# 21: 2014-04-26 01:00:00 NA 9 2014-04-23 16:00:00 57 hours
# 22: 2014-04-26 04:00:00 8.0 NA <NA> NA hours
# 23: 2014-04-26 16:00:00 4.0 NA <NA> NA hours
# 24: 2014-04-27 04:00:00 11.0 NA <NA> NA hours
# 25: 2014-04-27 16:00:00 1.0 NA <NA> NA hours
# 26: 2014-04-29 07:00:00 NA 10 2014-04-27 16:00:00 39 hours
# 27: 2014-04-29 16:00:00 7.0 NA <NA> NA hours
# 28: 2014-05-01 04:00:00 1.0 NA <NA> NA hours
# 29: 2014-05-01 16:00:00 0.4 NA <NA> NA hours
# 30: 2014-05-01 22:00:00 NA 11 2014-05-01 16:00:00 6 hours
# 31: 2014-05-02 04:00:00 4.0 NA <NA> NA hours
# time RRR he rrr_time diff
A base alternative with findInterval:
t_he <- d$time[!is.na(d$he)]
t_r <- d$time[!is.na(d$RRR)]
i <- findInterval(t_he, t_r)
d[!is.na(d$he), "diff"] <- t_he - t_r[replace(i, i == 0, NA)]
# time RRR he diff
# 1 2014-04-07 16:00:00 NA 1 NA hours
# 2 2014-04-08 04:00:00 2.0 NA NA hours
# 3 2014-04-08 19:00:00 NA 2 15 hours
# 4 2014-04-09 04:00:00 4.0 NA NA hours
# 5 2014-04-10 04:00:00 NA 3 24 hours
# 6 2014-04-12 22:00:00 NA 4 90 hours
# 7 2014-04-13 04:00:00 0.9 NA NA hours
# 8 2014-04-13 16:00:00 3.0 NA NA hours
# 9 2014-04-14 07:00:00 NA 5 15 hours