Have dataset of 1 year
hourly records
for analysis, I need to extract seperately for each month of the year, each hour of the day , so january 00h, january 01h, january 02h, january 03h, ...., ... , march 21h, march 22h, march 23h
Thanks in advance for any useful help!
Select observations for specified hours of the day during a period with datetime, filter, subset, ...
Code below (filter, month (time) generates fatal errror
Error: unexpected ')' in "at<-subset(groenenborgerno, timestamp=hour(time) == 01))"
groenenborgerno$timestamp <- as.POSIXct(groenenborgerno$date, format="%Y-%m-%d %H:%M:%S")
library(lubridate)
january01<-filter(atimeframe,
(month(time) == 01 & hour(time) == 01) )
Since no data is provided, I will try to answer your question with sample data:
require(lubridate)
require(tidyverse)
## Create some sample data:
time_index <- seq(from = as.POSIXct("2017-01-01 07:00"),
to = as.POSIXct("2018-01-01 18:00"), by = "hour")
value <- rnorm(n = length(time_index))
data <- data.frame(time_index,value)
data <- data %>% mutate (hour = hour(time_index),
month = month(time_index)) %>%
group_by(month,hour)
head(data)
> data
# A tibble: 8,772 x 4
# Groups: month, hour [288]
time_index value hour month
<dttm> <dbl> <int> <dbl>
1 2017-01-01 07:00:00 -0.626 7 1
2 2017-01-01 08:00:00 0.184 8 1
3 2017-01-01 09:00:00 -0.836 9 1
4 2017-01-01 10:00:00 1.60 10 1
5 2017-01-01 11:00:00 0.330 11 1
6 2017-01-01 12:00:00 -0.820 12 1
7 2017-01-01 13:00:00 0.487 13 1
8 2017-01-01 14:00:00 0.738 14 1
9 2017-01-01 15:00:00 0.576 15 1
10 2017-01-01 16:00:00 -0.305 16 1
# ... with 8,762 more rows
and then just filter() the hour / month combination you would want like so:
data %>% filter(hour > 12 & month == 1)
# A tibble: 347 x 4
# Groups: month, hour [11]
time_index value hour month
<dttm> <dbl> <int> <dbl>
1 2017-01-01 13:00:00 0.487 13 1
2 2017-01-01 14:00:00 0.738 14 1
3 2017-01-01 15:00:00 0.576 15 1
4 2017-01-01 16:00:00 -0.305 16 1
5 2017-01-01 17:00:00 1.51 17 1
6 2017-01-01 18:00:00 0.390 18 1
7 2017-01-01 19:00:00 -0.621 19 1
8 2017-01-01 20:00:00 -2.21 20 1
9 2017-01-01 21:00:00 1.12 21 1
10 2017-01-01 22:00:00 -0.0449 22 1
# ... with 337 more rows
Related
I have daily time series as provided in the example here, I need to know how to fill the NA value for only the morning time which is starting from 6:00 AM to 9:00 AM, that gap filling it should be by averaging the residual hours of the same day and so on for the other morning day,
set.seed(3)
df <- data.frame( timestamp = seq(as.POSIXct('2022-01-01', tz='utc'),as.POSIXct('2022-01-10 23:00', tz='utc'), by = '1 hour') ,
value = runif(240))
df$value[runif(nrow(df)) < 0.3] <- NA
if I understand you correctly this is one way to solve the task in dplyr:
df %>%
dplyr::mutate(after = ifelse(lubridate::hour(timestamp) > 10, value, NA),
day = format(df$timestamp, format = '%Y-%m-%d')) %>%
dplyr::group_by(day) %>%
dplyr::mutate(value = ifelse(lubridate::hour(timestamp) <10 & is.na(value), mean(after, na.rm = TRUE), value)) %>%
dplyr::ungroup() %>%
dplyr::select(-after, -day)
# A tibble: 240 x 2
timestamp value
<dttm> <dbl>
1 2022-01-01 00:00:00 0.427
2 2022-01-01 01:00:00 0.808
3 2022-01-01 02:00:00 0.385
4 2022-01-01 03:00:00 0.427
5 2022-01-01 04:00:00 0.602
6 2022-01-01 05:00:00 0.604
7 2022-01-01 06:00:00 0.125
8 2022-01-01 07:00:00 0.295
9 2022-01-01 08:00:00 0.578
10 2022-01-01 09:00:00 0.631
# ... with 230 more rows
# i Use `print(n = ...)` to see more rows
timestamp value after day
1 2022-01-01 00:00:00 NaN NA 00
2 2022-01-01 01:00:00 0.808 NA 01
3 2022-01-01 02:00:00 0.385 NA 02
4 2022-01-01 03:00:00 NaN NA 03
5 2022-01-01 04:00:00 0.602 NA 04
6 2022-01-01 05:00:00 0.604 NA 05
7 2022-01-01 06:00:00 0.125 NA 06
8 2022-01-01 07:00:00 0.295 NA 07
9 2022-01-01 08:00:00 0.578 NA 08
10 2022-01-01 09:00:00 0.631 NA 09
... with 230 more rows
i Use print(n = ...) to see more rows
I have data resembling the following structure, where the when variable denotes the day of measurement:
## Generate data.
set.seed(1986)
n <- 1000
y <- rnorm(n)
when <- as.POSIXct(strftime(seq(as.POSIXct("2021-11-01 23:00:00 UTC", tryFormats = "%Y-%m-%d"),
as.POSIXct("2022-11-01 23:00:00 UTC", tryFormats = "%Y-%m-%d"),
length.out = n), format = "%Y-%m-%d"))
dta <- data.frame(y, when)
head(dta)
#> y when
#> 1 -0.04625141 2021-11-01
#> 2 0.28000082 2021-11-01
#> 3 0.25317063 2021-11-01
#> 4 -0.96411077 2021-11-02
#> 5 0.49222664 2021-11-02
#> 6 -0.69874551 2021-11-02
I need to compute averages of y over time. For instance, the following computes daily averages:
## Compute daily averages of y.
library(dplyr)
daily_avg <- dta %>%
group_by(when) %>%
summarise(daily_mean = mean(y)) %>%
ungroup()
daily_avg
#> # A tibble: 366 × 2
#> when daily_mean
#> <dttm> <dbl>
#> 1 2021-11-01 00:00:00 0.162
#> 2 2021-11-02 00:00:00 -0.390
#> 3 2021-11-03 00:00:00 -0.485
#> 4 2021-11-04 00:00:00 -0.152
#> 5 2021-11-05 00:00:00 0.425
#> 6 2021-11-06 00:00:00 0.726
#> 7 2021-11-07 00:00:00 0.855
#> 8 2021-11-08 00:00:00 0.0608
#> 9 2021-11-09 00:00:00 -0.995
#> 10 2021-11-10 00:00:00 0.395
#> # … with 356 more rows
I am having a hard time computing weekly averages. Here is what I have tried so far:
## Fail - compute weekly averages of y.
library(lubridate)
dta$week <- week(dta$when) # This is wrong.
dta[165: 171, ]
#> y when week
#> 165 0.9758333 2021-12-30 52
#> 166 -0.8630091 2021-12-31 53
#> 167 0.3054031 2021-12-31 53
#> 168 1.2814421 2022-01-01 1
#> 169 0.1025440 2022-01-01 1
#> 170 1.3665411 2022-01-01 1
#> 171 -0.5373058 2022-01-02 1
Using the week function from the lubridate package ignores the fact that my data spawn across years. So, if I were to use a code similar to the one I used for the daily averages, I would aggregate observations belonging to different years (but to the same week number). How can I solve this?
You can use %V (from ?strptime) for weeks, combining it with the year.
dta %>%
group_by(week = format(when, format = "%Y-%V")) %>%
summarize(daily_mean = mean(y)) %>%
ungroup()
# # A tibble: 54 x 2
# week daily_mean
# <chr> <dbl>
# 1 2021-44 0.179
# 2 2021-45 0.0477
# 3 2021-46 0.0340
# 4 2021-47 0.356
# 5 2021-48 0.0544
# 6 2021-49 -0.0948
# 7 2021-50 -0.0419
# 8 2021-51 0.209
# 9 2021-52 0.251
# 10 2022-01 -0.197
# # ... with 44 more rows
There are different variants of "week", depending on your preference.
%V
Week of the year as decimal number (01–53) as defined in ISO 8601.
If the week (starting on Monday) containing 1 January has four or more
days in the new year, then it is considered week 1. Otherwise, it is
the last week of the previous year, and the next week is week 1.
(Accepted but ignored on input.)
%W
Week of the year as decimal number (00–53) using Monday as the first
day of week (and typically with the first Monday of the year as day 1
of week 1). The UK convention.
You can extract year and week from the dates and group by both:
dta %>%
mutate(year = year(when),
week = week(when)) %>%
group_by(year, week) %>%
summarise(y_mean = mean(y)) %>%
ungroup()
# # A tibble: 54 x 3
# # Groups: year, week [54]
# year week y_mean
# <dbl> <dbl> <dbl>
# 1 2021 44 -0.222
# 2 2021 45 0.234
# 3 2021 46 0.0953
# 4 2021 47 0.206
# 5 2021 48 0.192
# 6 2021 49 -0.0831
# 7 2021 50 0.0282
# 8 2021 51 0.196
# 9 2021 52 0.132
# 10 2021 53 -0.279
# # ... with 44 more rows
I have some air pollution data measured by hours.
Datetime
PM2.5
Station.id
2020-01-01 00:00:00
10
1
2020-01-01 01:00:00
NA
1
2020-01-01 02:00:00
15
1
2020-01-01 03:00:00
NA
1
2020-01-01 04:00:00
7
1
2020-01-01 05:00:00
20
1
2020-01-01 06:00:00
30
1
2020-01-01 00:00:00
NA
2
2020-01-01 01:00:00
17
2
2020-01-01 02:00:00
21
2
2020-01-01 03:00:00
55
2
I have a very large number of data collected from many stations. Using R, what is the most efficient way to remove a day when it has 1. A total of 18 hours of missing data AND 2. 8 hours continuous missing data.
PS. The original data can be either NAs have already been removed OR NAs are inserted.
The "most efficient" way will almost certainly use data.table. Something like this:
library(data.table)
setDT(your_data)
your_data[, date := as.IDate(Datetime)][,
if(
!(sum(is.na(PM2.5)) >= 18 &
with(rle(is.na(PM2.5)), max(lengths[values])) >= 8
)) .SD,
by = .(date, station.id)
]
# date Datetime PM2.5
# 1: 2020-01-01 2020-01-01 00:00:00 10
# 2: 2020-01-01 2020-01-01 01:00:00 NA
# 3: 2020-01-01 2020-01-01 02:00:00 15
# 4: 2020-01-01 2020-01-01 03:00:00 NA
# 5: 2020-01-01 2020-01-01 04:00:00 7
# 6: 2020-01-01 2020-01-01 05:00:00 20
# 7: 2020-01-01 2020-01-01 06:00:00 30
Using this sample data:
your_data = fread(text = 'Datetime PM2.5
2020-01-01 00:00:00 10
2020-01-01 01:00:00 NA
2020-01-01 02:00:00 15
2020-01-01 03:00:00 NA
2020-01-01 04:00:00 7
2020-01-01 05:00:00 20
2020-01-01 06:00:00 30')
I have a data table (lv_timest) with time stamps every 3 hours for each date:
# A tibble: 6 × 5
LV0_mean LV1_mean LV2_mean Date_time Date
<dbl> <dbl> <dbl> <S3:POSIXct> <date>
1 0.778 -4.12 0.736 2016-12-28 00:00:00 2016-12-28
2 0.376 -0.234 0.388 2016-12-28 03:00:00 2016-12-28
3 0.409 1.46 0.241 2016-12-28 06:00:00 2016-12-28
4 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
5 0.759 2.91 0.735 2016-12-28 12:00:00 2016-12-28
6 0.857 3.00 0.803 2016-12-28 15:00:00 2016-12-28
from which I would like to extract the time stamps that match as closely as possible those of another table (event_timest):
# A tibble: 6 × 4
Event_number Date_time Date Date_time_new
<int> <S3: POSIXct> <date> <S3: POSIXct>
1 75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00
2 123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00
3 264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00
4 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00
5 318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00
6 369 2017-01-11 07:24:00 2017-01-11 2017-01-11 07:00:00
For example, for row 1 in table event_timest, I would extract row 4 from table lv_timest:
Event_number Date_time.x Date.x Date_time_new LV0_mean LV1_mean LV2_mean Date_time.y Date.y
<int> <S3: POSIXct> <date> <S3: POSIXct> <dbl> <dbl> <dbl> <S3: POSIXct> <date>
75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
In fact, the time difference should not be over one hour. I thought of using the fuzzyjoin package for this, and writing a function that computes the time difference between timestamps of the two table, as hours. However, fuzzy_inner_join replicates rows in the second table and takes several timestamps in the first table to match it.
require(lubridate)
require(fuzzyjoin)
diff_timest <- function(x, y){abs(x%--%y %/% hours(1)) <= 1} # time interval as hours ≤ 1 hour
match_timest <- fuzzy_inner_join(event_timest, lv_timest,
by = c("Date" = "Date",
"Date_time_new" = "Date_time"),
match_fun = list(`==`, diff_timest))
head(match_timest)
# A tibble: 6 × 9
Event_number Date_time.x Date.x Date_time_new LV0_mean LV1_mean LV2_mean Date_time.y Date.y
<int> <dttm> <date> <dttm> <dbl> <dbl> <dbl> <dttm> <date>
1 75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
2 123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00 1.24 1.83 2.05 2016-12-30 15:00:00 2016-12-30
3 264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00 -0.128 -5.43 2.72 2017-01-07 06:00:00 2017-01-07
4 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00 -0.0751 0.171 2.56 2017-01-09 09:00:00 2017-01-09
5 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00 -0.204 -0.797 2.28 2017-01-09 12:00:00 2017-01-09
6 318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00 -0.204 -0.797 2.28 2017-01-09 12:00:00 2017-01-09
Would there be another way to do this?
Joining is always a procedure of first getting all combinations of all rows followed by a filter. We can do this manually:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
datetimes_a <- tibble(
id = seq(3),
group = "A",
datetime = c("2016-12-28 00:00:00", "2016-12-28 03:00:00", "2016-12-28 23:59:59") %>% as.POSIXct()
)
datetimes_b <- tibble(
id = seq(3),
group = "B",
datetime = c("2016-12-28 00:00:10", "2016-12-28 03:20:00", "2016-12-29 00:00:02") %>% as.POSIXct()
)
datetimes_a %>%
# start with cross product of all possible pairs
expand_grid(datetimes_b %>% rename_all(~ paste0(.x, "_b"))) %>%
mutate(diff = abs(datetime - datetime_b)) %>%
# get shortest time difference
group_by(id, id_b) %>%
arrange(diff) %>%
slice(1) %>%
# time diff must be less than 1hr
filter(diff < hours(1))
#> # A tibble: 3 x 7
#> # Groups: id, id_b [3]
#> id group datetime id_b group_b datetime_b diff
#> <int> <chr> <dttm> <int> <chr> <dttm> <drtn>
#> 1 1 A 2016-12-28 00:00:00 1 B 2016-12-28 00:00:10 10 secs
#> 2 2 A 2016-12-28 03:00:00 2 B 2016-12-28 03:20:00 1200 secs
#> 3 3 A 2016-12-28 23:59:59 3 B 2016-12-29 00:00:02 3 secs
Created on 2022-02-08 by the reprex package (v2.0.1)
This works also if the nearest timepoint is on another date e.g. right before and after midnight.
I would suggest a standard join, followed by a grouped filter to the closest instance of each timestamp:
library(tidyverse)
library(lubridate)
match_timest <- event_timest %>%
inner_join(lv_timest, by = "Date") %>%
mutate(diff = abs(as.numeric(Date_time.x - Date_time.y, unit = "hours"))) %>%
group_by(Date_time.y) %>%
filter(diff <= 1 & diff == min(diff)) %>%
ungroup() %>%
select(!diff)
Note:
this will still match multiple rows if there’s more than one that are the exact same shortest difference from the index timestamp.
this won’t match timestamps from different dates — eg, 23:59:59 on 1/1/22 won’t be matched with 00:00:00 on 1/2/22. If you’d like to do that, you can use a full Cartesian join (full_join(lv_timest, by = character())) rather than the inner_join() above.
I am doing time series analysis. Part of my data is as follow:
# A tibble: 6 x 3
time DOY Value
<dttm> <dbl> <dbl>
1 2015-01-08 12:30:00 8 0.664
2 2015-01-08 13:00:00 8 0.647
3 2015-01-11 14:00:00 11 0.669
4 2015-01-11 15:00:00 11 0.644
5 2015-02-04 12:30:00 35 0.664
6 2015-02-04 13:00:00 35 0.647
I would like to calculate the maximum values of 7 consecutive days of the data. For example:
# A tibble: 6 x 4
time DOY Value Max
<dttm> <dbl> <dbl> <dbl>
1 2015-01-08 12:30:00 8 0.664 11.669
2 2015-01-08 13:00:00 8 0.647 11.669
3 2015-01-11 14:00:00 11 0.669 11.669
4 2015-01-11 15:00:00 11 0.644 11.669
5 2015-02-04 12:30:00 35 0.664 35.664
6 2015-02-04 13:00:00 35 0.647 35.664
welcome to R and Stackoverflow. As mentioned above, you will find many friends here, if you provide a reproducible example, and explain what you have done and/or where things go wrong for you. This helps others to help you.
Based on your data fragment, I do some basic operations that I think might help you. Still you may need to adapt the principles to your problem case.
data
I turned your example into a tibble. Please note, when you work with date, times, date-times I recommend you use the respective variable type. This will give you access to helpful functions, etc.
Please also note you mentioned 8*3 tibble above. In fact your data structure is already a 4 column tibble with Date, time, DOY, and value!
library(dplyr) # basic dataframe/tibble operations
library(lubridate) # for datetime handling
df <- tribble(
~Date, ~time, ~DOY, ~Value
,"2015-01-08", "12:30:00", 8, 0.664
,"2015-01-08", "13:00:00", 8, 0.647
,"2015-01-11", "14:00:00", 11, 0.669
,"2015-01-11", "15:00:00", 11, 0.644
,"2015-02-04", "12:30:00", 35, 0.664
,"2015-02-04", "13:00:00", 35, 0.647
)
df <- df %>%
mutate(timestamp = ymd_hms(paste(Date, time))
This yields:
df
# A tibble: 6 x 5
Date time DOY Value timestamp
<chr> <chr> <dbl> <dbl> <dttm>
1 2015-01-08 12:30:00 8 0.664 2015-01-08 12:30:00
2 2015-01-08 13:00:00 8 0.647 2015-01-08 13:00:00
3 2015-01-11 14:00:00 11 0.669 2015-01-11 14:00:00
4 2015-01-11 15:00:00 11 0.644 2015-01-11 15:00:00
5 2015-02-04 12:30:00 35 0.664 2015-02-04 12:30:00
6 2015-02-04 13:00:00 35 0.647 2015-02-04 13:00:00
Note: timestamp is now a datetime objet dttm.
binning of data
It is not fully clear what your consecutive 7 days are and/or how you "group" them.
I assume you want to pick 7 days of a week.
As datetime is dttm, we can use the power of {lubridate} and extract the week from the datetime.
Note: you may want to bin/group your data differently. Think about what you want to achieve here and adpat this accordingly.
df <- df %>% mutate(bin = week(timestamp))
df
# A tibble: 6 x 6
Date time DOY Value timestamp bin
<chr> <chr> <dbl> <dbl> <dttm> <dbl>
1 2015-01-08 12:30:00 8 0.664 2015-01-08 12:30:00 2
2 2015-01-08 13:00:00 8 0.647 2015-01-08 13:00:00 2
3 2015-01-11 14:00:00 11 0.669 2015-01-11 14:00:00 2
4 2015-01-11 15:00:00 11 0.644 2015-01-11 15:00:00 2
5 2015-02-04 12:30:00 35 0.664 2015-02-04 12:30:00 5
6 2015-02-04 13:00:00 35 0.647 2015-02-04 13:00:00 5
If you want to work on "7 consecutive days" you will need to identify the groups of 7 days. Again, there are different ways to do this, check what the modulo operator does and how to apply this to your DOY.
operating on your groups/bins
You describe looking for the maximum per bin (7 days ~ week).
{dplyr} offers for such problems grouped operations. Read up on them:
df %>%
group_by(bin) %>%
summarise(MaxValue = max(Value) # we create a new variable and assing the max of each group to it
)
# A tibble: 2 x 2
bin MavValue
<dbl> <dbl>
1 2 0.669
2 5 0.664
Obviously, you can perform many operations (summaries of your bins/groups).
Note: You can create bins on multiple variables. Read up on group_by() and summarise(..., .groups = "drop"), if you want to use this interim tibble for further calculations.
Hope this gets you started.
clarification on grouping by 7 days
If you have a sequence of (integer) numbers, there is a neat way to group this into n-element bins, i.e. using integer division.
In your case the data comes already with a date-of-year DOY variable. For completeness: with lubridate you can pull the DOY from a timestamp with the function yday(), i.e. (df %>% mutate(DOY = yday(timestamp)).
# let's use integer division to group our DOYs into group of 7s
##--------- does not look at date or day
##--------- group 1-7 := 0, group 8-14 := 1, .... group 29-35 := 5
df <- df %>%
mutate(bin = DOY %/% 7)
This yields:
# A tibble: 6 x 6
Date time DOY Value timestamp bin
<chr> <chr> <dbl> <dbl> <dttm> <dbl>
1 2015-01-08 12:30:00 8 0.664 2015-01-08 12:30:00 1
2 2015-01-08 13:00:00 8 0.647 2015-01-08 13:00:00 1
3 2015-01-11 14:00:00 11 0.669 2015-01-11 14:00:00 1
4 2015-01-11 15:00:00 11 0.644 2015-01-11 15:00:00 1
5 2015-02-04 12:30:00 35 0.664 2015-02-04 12:30:00 5
6 2015-02-04 13:00:00 35 0.647 2015-02-04 13:00:00 5
And then build your max summary as before on the (new) grouping variable:
df %>%
group_by(bin) %>%
summarise(MaxValue = max(Value)
# A tibble: 2 x 2
bin MaxValue
<dbl> <dbl>
1 1 0.669
2 5 0.664
For the example data given the result is identical. However, with your full dataset and the offset between "weeks" (with their defined start date) vs cutting your DOYs into bins of 7 consecutive days, you will get a different summary (unless, the first day of the week (*) coincides with DOY 1).
(*): in lubridate you can set weeks to start Monday or Sunday as a parameter (in case you ever need this).