Match two tables based on a time difference criterium

Match two tables based on a time difference criterium - r

I have a data table (lv_timest) with time stamps every 3 hours for each date:
# A tibble: 6 × 5
LV0_mean LV1_mean LV2_mean Date_time Date
<dbl> <dbl> <dbl> <S3:POSIXct> <date>
1 0.778 -4.12 0.736 2016-12-28 00:00:00 2016-12-28
2 0.376 -0.234 0.388 2016-12-28 03:00:00 2016-12-28
3 0.409 1.46 0.241 2016-12-28 06:00:00 2016-12-28
4 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
5 0.759 2.91 0.735 2016-12-28 12:00:00 2016-12-28
6 0.857 3.00 0.803 2016-12-28 15:00:00 2016-12-28
from which I would like to extract the time stamps that match as closely as possible those of another table (event_timest):
# A tibble: 6 × 4
Event_number Date_time Date Date_time_new
<int> <S3: POSIXct> <date> <S3: POSIXct>
1 75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00
2 123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00
3 264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00
4 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00
5 318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00
6 369 2017-01-11 07:24:00 2017-01-11 2017-01-11 07:00:00
For example, for row 1 in table event_timest, I would extract row 4 from table lv_timest:
Event_number Date_time.x Date.x Date_time_new LV0_mean LV1_mean LV2_mean Date_time.y Date.y
<int> <S3: POSIXct> <date> <S3: POSIXct> <dbl> <dbl> <dbl> <S3: POSIXct> <date>
75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
In fact, the time difference should not be over one hour. I thought of using the fuzzyjoin package for this, and writing a function that computes the time difference between timestamps of the two table, as hours. However, fuzzy_inner_join replicates rows in the second table and takes several timestamps in the first table to match it.
require(lubridate)
require(fuzzyjoin)
diff_timest <- function(x, y){abs(x%--%y %/% hours(1)) <= 1} # time interval as hours ≤ 1 hour
match_timest <- fuzzy_inner_join(event_timest, lv_timest,
by = c("Date" = "Date",
"Date_time_new" = "Date_time"),
match_fun = list(`==`, diff_timest))
head(match_timest)
# A tibble: 6 × 9
Event_number Date_time.x Date.x Date_time_new LV0_mean LV1_mean LV2_mean Date_time.y Date.y
<int> <dttm> <date> <dttm> <dbl> <dbl> <dbl> <dttm> <date>
1 75 2016-12-28 08:00:00 2016-12-28 2016-12-28 08:00:00 0.760 2.07 0.460 2016-12-28 09:00:00 2016-12-28
2 123 2016-12-30 14:02:00 2016-12-30 2016-12-30 14:00:00 1.24 1.83 2.05 2016-12-30 15:00:00 2016-12-30
3 264 2017-01-07 06:12:00 2017-01-07 2017-01-07 06:00:00 -0.128 -5.43 2.72 2017-01-07 06:00:00 2017-01-07
4 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00 -0.0751 0.171 2.56 2017-01-09 09:00:00 2017-01-09
5 317 2017-01-09 10:59:00 2017-01-09 2017-01-09 11:00:00 -0.204 -0.797 2.28 2017-01-09 12:00:00 2017-01-09
6 318 2017-01-09 13:31:00 2017-01-09 2017-01-09 14:00:00 -0.204 -0.797 2.28 2017-01-09 12:00:00 2017-01-09
Would there be another way to do this?

Joining is always a procedure of first getting all combinations of all rows followed by a filter. We can do this manually:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
datetimes_a <- tibble(
id = seq(3),
group = "A",
datetime = c("2016-12-28 00:00:00", "2016-12-28 03:00:00", "2016-12-28 23:59:59") %>% as.POSIXct()
)
datetimes_b <- tibble(
id = seq(3),
group = "B",
datetime = c("2016-12-28 00:00:10", "2016-12-28 03:20:00", "2016-12-29 00:00:02") %>% as.POSIXct()
)
datetimes_a %>%
# start with cross product of all possible pairs
expand_grid(datetimes_b %>% rename_all(~ paste0(.x, "_b"))) %>%
mutate(diff = abs(datetime - datetime_b)) %>%
# get shortest time difference
group_by(id, id_b) %>%
arrange(diff) %>%
slice(1) %>%
# time diff must be less than 1hr
filter(diff < hours(1))
#> # A tibble: 3 x 7
#> # Groups: id, id_b [3]
#> id group datetime id_b group_b datetime_b diff
#> <int> <chr> <dttm> <int> <chr> <dttm> <drtn>
#> 1 1 A 2016-12-28 00:00:00 1 B 2016-12-28 00:00:10 10 secs
#> 2 2 A 2016-12-28 03:00:00 2 B 2016-12-28 03:20:00 1200 secs
#> 3 3 A 2016-12-28 23:59:59 3 B 2016-12-29 00:00:02 3 secs
Created on 2022-02-08 by the reprex package (v2.0.1)
This works also if the nearest timepoint is on another date e.g. right before and after midnight.

I would suggest a standard join, followed by a grouped filter to the closest instance of each timestamp:
library(tidyverse)
library(lubridate)
match_timest <- event_timest %>%
inner_join(lv_timest, by = "Date") %>%
mutate(diff = abs(as.numeric(Date_time.x - Date_time.y, unit = "hours"))) %>%
group_by(Date_time.y) %>%
filter(diff <= 1 & diff == min(diff)) %>%
ungroup() %>%
select(!diff)
Note:
this will still match multiple rows if there’s more than one that are the exact same shortest difference from the index timestamp.
this won’t match timestamps from different dates — eg, 23:59:59 on 1/1/22 won’t be matched with 00:00:00 on 1/2/22. If you’d like to do that, you can use a full Cartesian join (full_join(lv_timest, by = character())) rather than the inner_join() above.

Related

need a code for this question "gap filling" time series

I have daily time series as provided in the example here, I need to know how to fill the NA value for only the morning time which is starting from 6:00 AM to 9:00 AM, that gap filling it should be by averaging the residual hours of the same day and so on for the other morning day,
set.seed(3)
df <- data.frame( timestamp = seq(as.POSIXct('2022-01-01', tz='utc'),as.POSIXct('2022-01-10 23:00', tz='utc'), by = '1 hour') ,
value = runif(240))
df$value[runif(nrow(df)) < 0.3] <- NA

if I understand you correctly this is one way to solve the task in dplyr:
df %>%
dplyr::mutate(after = ifelse(lubridate::hour(timestamp) > 10, value, NA),
day = format(df$timestamp, format = '%Y-%m-%d')) %>%
dplyr::group_by(day) %>%
dplyr::mutate(value = ifelse(lubridate::hour(timestamp) <10 & is.na(value), mean(after, na.rm = TRUE), value)) %>%
dplyr::ungroup() %>%
dplyr::select(-after, -day)
# A tibble: 240 x 2
timestamp value
<dttm> <dbl>
1 2022-01-01 00:00:00 0.427
2 2022-01-01 01:00:00 0.808
3 2022-01-01 02:00:00 0.385
4 2022-01-01 03:00:00 0.427
5 2022-01-01 04:00:00 0.602
6 2022-01-01 05:00:00 0.604
7 2022-01-01 06:00:00 0.125
8 2022-01-01 07:00:00 0.295
9 2022-01-01 08:00:00 0.578
10 2022-01-01 09:00:00 0.631
# ... with 230 more rows
# i Use `print(n = ...)` to see more rows

timestamp value after day
1 2022-01-01 00:00:00 NaN NA 00
2 2022-01-01 01:00:00 0.808 NA 01
3 2022-01-01 02:00:00 0.385 NA 02
4 2022-01-01 03:00:00 NaN NA 03
5 2022-01-01 04:00:00 0.602 NA 04
6 2022-01-01 05:00:00 0.604 NA 05
7 2022-01-01 06:00:00 0.125 NA 06
8 2022-01-01 07:00:00 0.295 NA 07
9 2022-01-01 08:00:00 0.578 NA 08
10 2022-01-01 09:00:00 0.631 NA 09
... with 230 more rows
i Use print(n = ...) to see more rows

How to create a tibble from list of tibbles with different number of rows?

I have a list of tibbles that look like this:
> head(temp)
$AT
# A tibble: 8,784 × 2
price_eur datetime
<dbl> <dttm>
1 50.9 2021-01-01 00:00:00
2 48.2 2021-01-01 01:00:00
3 44.7 2021-01-01 02:00:00
4 42.9 2021-01-01 03:00:00
5 40.4 2021-01-01 04:00:00
6 40.2 2021-01-01 05:00:00
7 39.6 2021-01-01 06:00:00
8 40.1 2021-01-01 07:00:00
9 41.3 2021-01-01 08:00:00
10 44.9 2021-01-01 09:00:00
# … with 8,774 more rows
$IE
# A tibble: 7,198 × 2
price_eur datetime
<dbl> <dttm>
1 54.0 2021-01-01 01:00:00
2 53 2021-01-01 02:00:00
3 51.2 2021-01-01 03:00:00
4 48.1 2021-01-01 04:00:00
5 47.3 2021-01-01 05:00:00
6 47.6 2021-01-01 06:00:00
7 45.4 2021-01-01 07:00:00
8 43.4 2021-01-01 08:00:00
9 47.8 2021-01-01 09:00:00
10 51.8 2021-01-01 10:00:00
# … with 7,188 more rows
$`IT-Calabria`
# A tibble: 8,736 × 2
price_eur datetime
<dbl> <dttm>
1 50.9 2021-01-01 00:00:00
2 48.2 2021-01-01 01:00:00
3 44.7 2021-01-01 02:00:00
4 42.9 2021-01-01 03:00:00
5 40.4 2021-01-01 04:00:00
6 40.2 2021-01-01 05:00:00
7 39.6 2021-01-01 06:00:00
8 40.1 2021-01-01 07:00:00
9 41.3 2021-01-01 08:00:00
10 41.7 2021-01-01 09:00:00
# … with 8,726 more rows
The number of rows is different because there are missing observations, usually one or several days.
Ideally I need a tibble with a single date time index and corresponding columns with NAs when there is missing data and I'm stuck here.

We can do a full join by 'datetime'
library(dplyr)
library(purrr)
reduce(temp, full_join, by = "datetime")
If we need to rename the column 'price_eur' before the join, loop over the list with imap, rename the 'price_eur' to the corresponding list name (.y) and do the join within reduce
imap(temp, ~ .x %>%
rename(!! .y := price_eur)) %>%
reduce(full_join, by = 'datetime')

Calculate maximum values for three consecutive days of tibble in R

I am doing time series analysis. Part of my data is as follow:
# A tibble: 6 x 3
time DOY Value
<dttm> <dbl> <dbl>
1 2015-01-08 12:30:00 8 0.664
2 2015-01-08 13:00:00 8 0.647
3 2015-01-11 14:00:00 11 0.669
4 2015-01-11 15:00:00 11 0.644
5 2015-02-04 12:30:00 35 0.664
6 2015-02-04 13:00:00 35 0.647
I would like to calculate the maximum values of 7 consecutive days of the data. For example:
# A tibble: 6 x 4
time DOY Value Max
<dttm> <dbl> <dbl> <dbl>
1 2015-01-08 12:30:00 8 0.664 11.669
2 2015-01-08 13:00:00 8 0.647 11.669
3 2015-01-11 14:00:00 11 0.669 11.669
4 2015-01-11 15:00:00 11 0.644 11.669
5 2015-02-04 12:30:00 35 0.664 35.664
6 2015-02-04 13:00:00 35 0.647 35.664

welcome to R and Stackoverflow. As mentioned above, you will find many friends here, if you provide a reproducible example, and explain what you have done and/or where things go wrong for you. This helps others to help you.
Based on your data fragment, I do some basic operations that I think might help you. Still you may need to adapt the principles to your problem case.
data
I turned your example into a tibble. Please note, when you work with date, times, date-times I recommend you use the respective variable type. This will give you access to helpful functions, etc.
Please also note you mentioned 8*3 tibble above. In fact your data structure is already a 4 column tibble with Date, time, DOY, and value!
library(dplyr) # basic dataframe/tibble operations
library(lubridate) # for datetime handling
df <- tribble(
~Date, ~time, ~DOY, ~Value
,"2015-01-08", "12:30:00", 8, 0.664
,"2015-01-08", "13:00:00", 8, 0.647
,"2015-01-11", "14:00:00", 11, 0.669
,"2015-01-11", "15:00:00", 11, 0.644
,"2015-02-04", "12:30:00", 35, 0.664
,"2015-02-04", "13:00:00", 35, 0.647
)
df <- df %>%
mutate(timestamp = ymd_hms(paste(Date, time))
This yields:
df
# A tibble: 6 x 5
Date time DOY Value timestamp
<chr> <chr> <dbl> <dbl> <dttm>
1 2015-01-08 12:30:00 8 0.664 2015-01-08 12:30:00
2 2015-01-08 13:00:00 8 0.647 2015-01-08 13:00:00
3 2015-01-11 14:00:00 11 0.669 2015-01-11 14:00:00
4 2015-01-11 15:00:00 11 0.644 2015-01-11 15:00:00
5 2015-02-04 12:30:00 35 0.664 2015-02-04 12:30:00
6 2015-02-04 13:00:00 35 0.647 2015-02-04 13:00:00
Note: timestamp is now a datetime objet dttm.
binning of data
It is not fully clear what your consecutive 7 days are and/or how you "group" them.
I assume you want to pick 7 days of a week.
As datetime is dttm, we can use the power of {lubridate} and extract the week from the datetime.
Note: you may want to bin/group your data differently. Think about what you want to achieve here and adpat this accordingly.
df <- df %>% mutate(bin = week(timestamp))
df
# A tibble: 6 x 6
Date time DOY Value timestamp bin
<chr> <chr> <dbl> <dbl> <dttm> <dbl>
1 2015-01-08 12:30:00 8 0.664 2015-01-08 12:30:00 2
2 2015-01-08 13:00:00 8 0.647 2015-01-08 13:00:00 2
3 2015-01-11 14:00:00 11 0.669 2015-01-11 14:00:00 2
4 2015-01-11 15:00:00 11 0.644 2015-01-11 15:00:00 2
5 2015-02-04 12:30:00 35 0.664 2015-02-04 12:30:00 5
6 2015-02-04 13:00:00 35 0.647 2015-02-04 13:00:00 5
If you want to work on "7 consecutive days" you will need to identify the groups of 7 days. Again, there are different ways to do this, check what the modulo operator does and how to apply this to your DOY.
operating on your groups/bins
You describe looking for the maximum per bin (7 days ~ week).
{dplyr} offers for such problems grouped operations. Read up on them:
df %>%
group_by(bin) %>%
summarise(MaxValue = max(Value) # we create a new variable and assing the max of each group to it
)
# A tibble: 2 x 2
bin MavValue
<dbl> <dbl>
1 2 0.669
2 5 0.664
Obviously, you can perform many operations (summaries of your bins/groups).
Note: You can create bins on multiple variables. Read up on group_by() and summarise(..., .groups = "drop"), if you want to use this interim tibble for further calculations.
Hope this gets you started.
clarification on grouping by 7 days
If you have a sequence of (integer) numbers, there is a neat way to group this into n-element bins, i.e. using integer division.
In your case the data comes already with a date-of-year DOY variable. For completeness: with lubridate you can pull the DOY from a timestamp with the function yday(), i.e. (df %>% mutate(DOY = yday(timestamp)).
# let's use integer division to group our DOYs into group of 7s
##--------- does not look at date or day
##--------- group 1-7 := 0, group 8-14 := 1, .... group 29-35 := 5
df <- df %>%
mutate(bin = DOY %/% 7)
This yields:
# A tibble: 6 x 6
Date time DOY Value timestamp bin
<chr> <chr> <dbl> <dbl> <dttm> <dbl>
1 2015-01-08 12:30:00 8 0.664 2015-01-08 12:30:00 1
2 2015-01-08 13:00:00 8 0.647 2015-01-08 13:00:00 1
3 2015-01-11 14:00:00 11 0.669 2015-01-11 14:00:00 1
4 2015-01-11 15:00:00 11 0.644 2015-01-11 15:00:00 1
5 2015-02-04 12:30:00 35 0.664 2015-02-04 12:30:00 5
6 2015-02-04 13:00:00 35 0.647 2015-02-04 13:00:00 5
And then build your max summary as before on the (new) grouping variable:
df %>%
group_by(bin) %>%
summarise(MaxValue = max(Value)
# A tibble: 2 x 2
bin MaxValue
<dbl> <dbl>
1 1 0.669
2 5 0.664
For the example data given the result is identical. However, with your full dataset and the offset between "weeks" (with their defined start date) vs cutting your DOYs into bins of 7 consecutive days, you will get a different summary (unless, the first day of the week (*) coincides with DOY 1).
(*): in lubridate you can set weeks to start Monday or Sunday as a parameter (in case you ever need this).

Mutate a column in a tsibble dataframe, applying a Box-Cox transformation

I am a big fan of Hyndman's packages, but stumbled with Box-Cox transformation.
I have a dataframe
class(chicago_sales)
[1] "tbl_ts" "tbl_df" "tbl" "data.frame"
I am trying to mutate an extra column, where the Mean_price variable will be transformed.
foo <- chicago_sales %>%
mutate(bc = BoxCox(x = chicago_sales$Median_price, lambda =
BoxCox.lambda(chicago_sales$Median_price)))
gives me some result (probably wrong too) and cannot apply autoplot.
I also tried to apply the code from Hyndman's book, but failed.
What am I doing wrong? Thanks!
UPDATED:

Issue, inside tsibbles, when using dplyr, you do not call chicago_sales$Median_price, but just Median_price. When using tsibbles I would advice using fable and fabletools, but if you are using forecast, it should work like this:
library(tsibble)
library(dplyr)
library(forecast)
pedestrian %>%
mutate(bc = BoxCox(Count, BoxCox.lambda(Count)))
# A tsibble: 66,037 x 6 [1h] <Australia/Melbourne>
# Key: Sensor [4]
Sensor Date_Time Date Time Count bc
<chr> <dttm> <date> <int> <int> <dbl>
1 Birrarung Marr 2015-01-01 00:00:00 2015-01-01 0 1630 11.3
2 Birrarung Marr 2015-01-01 01:00:00 2015-01-01 1 826 9.87
3 Birrarung Marr 2015-01-01 02:00:00 2015-01-01 2 567 9.10
4 Birrarung Marr 2015-01-01 03:00:00 2015-01-01 3 264 7.65
5 Birrarung Marr 2015-01-01 04:00:00 2015-01-01 4 139 6.52
6 Birrarung Marr 2015-01-01 05:00:00 2015-01-01 5 77 5.54
7 Birrarung Marr 2015-01-01 06:00:00 2015-01-01 6 44 4.67
8 Birrarung Marr 2015-01-01 07:00:00 2015-01-01 7 56 5.04
9 Birrarung Marr 2015-01-01 08:00:00 2015-01-01 8 113 6.17
10 Birrarung Marr 2015-01-01 09:00:00 2015-01-01 9 166 6.82
# ... with 66,027 more rows
I used a built in dataset from the tsibble package as you did not provide a dput of chicago_sales.

Daily select one hour of the day for analysis

Have dataset of 1 year
hourly records
for analysis, I need to extract seperately for each month of the year, each hour of the day , so january 00h, january 01h, january 02h, january 03h, ...., ... , march 21h, march 22h, march 23h
Thanks in advance for any useful help!
Select observations for specified hours of the day during a period with datetime, filter, subset, ...
Code below (filter, month (time) generates fatal errror
Error: unexpected ')' in "at<-subset(groenenborgerno, timestamp=hour(time) == 01))"
groenenborgerno$timestamp <- as.POSIXct(groenenborgerno$date, format="%Y-%m-%d %H:%M:%S")
library(lubridate)
january01<-filter(atimeframe,
(month(time) == 01 & hour(time) == 01) )

Since no data is provided, I will try to answer your question with sample data:
require(lubridate)
require(tidyverse)
## Create some sample data:
time_index <- seq(from = as.POSIXct("2017-01-01 07:00"),
to = as.POSIXct("2018-01-01 18:00"), by = "hour")
value <- rnorm(n = length(time_index))
data <- data.frame(time_index,value)
data <- data %>% mutate (hour = hour(time_index),
month = month(time_index)) %>%
group_by(month,hour)
head(data)
> data
# A tibble: 8,772 x 4
# Groups: month, hour [288]
time_index value hour month
<dttm> <dbl> <int> <dbl>
1 2017-01-01 07:00:00 -0.626 7 1
2 2017-01-01 08:00:00 0.184 8 1
3 2017-01-01 09:00:00 -0.836 9 1
4 2017-01-01 10:00:00 1.60 10 1
5 2017-01-01 11:00:00 0.330 11 1
6 2017-01-01 12:00:00 -0.820 12 1
7 2017-01-01 13:00:00 0.487 13 1
8 2017-01-01 14:00:00 0.738 14 1
9 2017-01-01 15:00:00 0.576 15 1
10 2017-01-01 16:00:00 -0.305 16 1
# ... with 8,762 more rows
and then just filter() the hour / month combination you would want like so:
data %>% filter(hour > 12 & month == 1)
# A tibble: 347 x 4
# Groups: month, hour [11]
time_index value hour month
<dttm> <dbl> <int> <dbl>
1 2017-01-01 13:00:00 0.487 13 1
2 2017-01-01 14:00:00 0.738 14 1
3 2017-01-01 15:00:00 0.576 15 1
4 2017-01-01 16:00:00 -0.305 16 1
5 2017-01-01 17:00:00 1.51 17 1
6 2017-01-01 18:00:00 0.390 18 1
7 2017-01-01 19:00:00 -0.621 19 1
8 2017-01-01 20:00:00 -2.21 20 1
9 2017-01-01 21:00:00 1.12 21 1
10 2017-01-01 22:00:00 -0.0449 22 1
# ... with 337 more rows