Is there any way to join two data frames by date ranges? - r

I have two data frames, the first dataset is the record for forecasted demand in the following 27 days for each item of the company, shown as below:
library(tidyverse)
library(lubridate)
daily_forecast <- data.frame(
item=c("A","B","A","B"),
date_fcsted=c("2020-8-1","2020-8-1","2020-8-15","2020-8-15"),
fcsted_qty=c(100,200,200,100)
) %>%
mutate(date_fcsted=ymd(date_fcsted)) %>%
mutate(extended_date=date_fcsted+days(27))
and the other dateset is the actual daily demand for each item:
actual_orders <- data.frame(
order_date=rep(seq(ymd("2020-8-3"),ymd("2020-9-15"),by = "1 week"),2),
item=rep(c("A","B"),7),
order_qty=round(rnorm(n=14,mean=50,sd=10),0)
)
What i am trying to accomplish is to get the actual total demand for each item within the date_fcsted and extended_date in the first dataset and then have them joined to calculate the forecast accuracy.
Solutions with tidyverse would be highly appreciated.

You can try the following :
library(dplyr)
daily_forecast %>%
left_join(actual_orders, by = 'item') %>%
filter(order_date >= date_fcsted & order_date <= extended_date) %>%
group_by(item, date_fcsted, extended_date, fcsted_qty) %>%
summarise(value = sum(order_qty))
# item date_fcsted extended_date fcsted_qty value
# <chr> <date> <date> <dbl> <dbl>
#1 A 2020-08-01 2020-08-28 100 179
#2 A 2020-08-15 2020-09-11 200 148
#3 B 2020-08-01 2020-08-28 200 190
#4 B 2020-08-15 2020-09-11 100 197

You could also try fuzzy_join as suggested by #Gregor Thomas. I added a row number column to make sure you have unique rows independent of item and date ranges (but this may not be needed).
library(fuzzyjoin)
library(dplyr)
daily_forecast %>%
mutate(rn = row_number()) %>%
fuzzy_left_join(actual_orders,
by = c("item" = "item",
"date_fcsted" = "order_date",
"extended_date" = "order_date"),
match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(rn, item.x, date_fcsted, extended_date, fcsted_qty) %>%
summarise(actual_total_demand = sum(order_qty))
Output
rn item.x date_fcsted extended_date fcsted_qty actual_total_demand
<int> <chr> <date> <date> <dbl> <dbl>
1 1 A 2020-08-01 2020-08-28 100 221
2 2 B 2020-08-01 2020-08-28 200 219
3 3 A 2020-08-15 2020-09-11 200 212
4 4 B 2020-08-15 2020-09-11 100 216

Related

Group and add variable of type stock and another type in a single step?

I want to group by district summing 'incoming' values at quarter and get the value of the 'stock' in the last quarter (3) in just one step. 'stock' can not summed through quarters.
My example dataframe:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
df
district quarter incoming stock
1 ARA 1 4044 19547
2 ARA 2 2992 3160
3 ARA 3 2556 1533
4 BJI 1 1639 5355
5 BJI 2 9547 6146
6 BJI 3 1191 355
7 CMC 1 2038 5816
8 CMC 2 1942 1119
9 CMC 3 225 333
The actual dataframe has ~45.000 rows and 41 variables of which 8 are of type stock.
The result should be:
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I know how to get to the result but in three steps and I don't think it's efficient and error prone due to the data.
My approach:
basea <- df %>%
group_by(district) %>%
filter(quarter==3) %>% #take only the last quarter
summarise(across(stock, sum)) %>%
baseb <- df %>%
group_by(district) %>%
summarise(across(incoming, sum)) %>%
final <- full_join(basea, baseb)
Does anyone have any suggestions to perform the procedure in one (or at least two) steps?
Grateful,
Modus
Given that the dataset only has 3 quarters and not 4. If that's not the case use nth(3) instead of last()
library(tidyverse)
df %>%
group_by(district) %>%
summarise(stock = last(stock),
incoming = sum(incoming))
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
here is a data.table approach
library(data.table)
setDT(df)[, .(incoming = sum(incoming), stock = stock[.N]), by = .(district)]
district incoming stock
1: ARA 9592 1533
2: BJI 12377 355
3: CMC 4205 333
Here's a refactor that removes some of the duplicated code. This also seems like a prime use-case for creating a custom function that can be QC'd and maintained easier:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
aggregate_stocks <- function(df, n_quarter) {
base <- df %>%
group_by(district)
basea <- base %>%
filter(quarter == n_quarter) %>%
summarise(across(stock, sum))
baseb <- base %>%
summarise(across(incoming, sum))
final <- full_join(basea, baseb, by = "district")
return(final)
}
aggregate_stocks(df, 3)
#> # A tibble: 3 × 3
#> district stock incoming
#> <chr> <dbl> <dbl>
#> 1 ARA 1533 9592
#> 2 BJI 355 12377
#> 3 CMC 333 4205
Here is the same solution as #Tom Hoel but without using a function to subset, instead just use []:
library(dplyr)
df %>%
group_by(district) %>%
summarise(stock = stock[3],
incoming = sum(incoming))
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205

Group by a variable in dataframe R

I have a dataframe like below,
Date
cat
cam
reg
per
22-01-05
A
60
120
50
22-01-05
B
20
100
20
22-01-08
A
30
150
20
22-01-08
B
30
100
30
But i want something like below,
Date
cam
reg
per
22-01-05
80
220
14.5
22-01-08
60
250
24
How to get this using R?
I am not sure why your expected per values are like that, but maybe you want the following:
df <- data.frame(Date = c("22-01-05", "22-01-05", "22-01-08", "22-01-08"),
cat = c("A", "B", "A", "B"),
cam = c(60,20,30,30),
reg = c(120,100,150,100),
per = c(50,20,20,30))
library(dplyr)
df %>%
group_by(Date) %>%
summarise(cam = sum(cam),
reg = sum(reg),
per = cam/reg)
#> # A tibble: 2 × 4
#> Date cam reg per
#> <chr> <dbl> <dbl> <dbl>
#> 1 22-01-05 80 220 0.364
#> 2 22-01-08 60 250 0.24
Created on 2022-07-07 by the reprex package (v2.0.1)
Using only the package dplyr (which is part of package tidyverse) just do:
df %>% group_by(Date) %>% summarise(cam = sum(cam),
reg = sum(reg),
per = 100*(cam/reg))
Date cam reg per
<chr> <int> <int> <dbl>
1 22-01-05 80 220 36.4
2 22-01-08 60 250 24
The nice thing with this syntax is, you can modify and add additional variables like sum, but also like mean, median, etc. in a very clean and structured way.
you can try this, but I don't how to get the value of per ,14.5 and 24
library(dplyr)
aggregate(cbind(cam, reg) ~ Date,df,sum) %>% mutate(per = 100*(cam/reg))
A data.frame: 2 × 4
Date cam reg per
<chr> <dbl> <dbl> <dbl>
22-01-05 80 220 36.36364
22-01-08 60 250 24.00000

What is an efficient programming way to find the closest time of a dataset to a reference (larger) dataset

I am searching for an efficient way to find the closest times of a small dataset (x) in comparison to a large dataset (a). The result has to be an index of the length of (a). I have already created a function which works very nicely, however, it is absolutely useless for large data as it takes days to process.
Here is my function: function(x, a, which = TRUE,na.rm=FALSE){
if("POSIXt" %in% class(x)) x <- as.numeric(x)
if("POSIXt" %in% class(a)) a <- as.numeric(a)
sapply(a, function(y) DescTools::Closest(x, y, which = TRUE,na.rm=FALSE)[1])
}
both datasets x and a are filtered and therefore have no consistent time stemp but they are filtered after the same requirements.
vector a contains 20 Hz data with a length of 16020209 and x contains 30 sec data with a length of 26908.
Any suggestions are very much appreciated! Thank you :)
One can use a rolling join from data.table:
library(data.table)
set.seed(1) # reproduciblity on Stackoverflow
DF_A <- data.table(x = seq(-500, by = 0.5, length.out = 26908),
idx = seq_len(26908))
DF_HZ <- data.table(x = round(runif(16020209, first(DF_A$x), last(DF_A$x)), 3),
idx_hz = seq_len(16020209))
DF_HZ[, x_hz := x + 0] # so we can check
DF_A[, x_a := x + 0] # so we can check
setkey(DF_A, x)
setkey(DF_HZ, x)
# The order(idx_hz) returns the result in the same order as
# DF_HZ but it is not necessary to match joins.
DF_A[DF_HZ, roll = "nearest"][order(idx_hz)]
#> x idx x_a idx_hz x_hz
#> 1: 3072.021 7145 3072.0 1 3072.021
#> 2: 4506.369 10014 4506.5 2 4506.369
#> 3: 7206.883 15415 7207.0 3 7206.883
#> 4: 11718.574 24438 11718.5 4 11718.574
#> 5: 2213.328 5428 2213.5 5 2213.328
#> ---
#> 16020205: 10517.477 22036 10517.5 16020205 10517.477
#> 16020206: 11407.776 23817 11408.0 16020206 11407.776
#> 16020207: 12051.919 25105 12052.0 16020207 12051.919
#> 16020208: 3482.463 7966 3482.5 16020208 3482.463
#> 16020209: 817.366 2636 817.5 16020209 817.366
Created on 2020-11-11 by the reprex package (v0.3.0)
On my machine, the above (not including the creation of the dummy data) takes about 3 s.
I would use something like a SQL full join for this task since the second df is small - though it depends on your data size and ram. Here is a simple example with test data:
library(dplyr)
# demo tibbles
tab1 <- tibble::tribble(
~time_1, ~VALUE_1,
"2020-11-01", 268L,
"2020-11-02", 479L,
"2020-11-03", 345L,
"2020-11-04", 567L,
"2020-11-05", 567L) %>%
dplyr::mutate(time_1 = as.Date(time_1))
tab2 <- tibble::tribble(
~time_2, ~VALUE_2,
"2020-11-01", 268L,
"2020-11-02", 479L) %>%
dplyr::mutate(time_2 = as.Date(time_2))
# calculations
tab1 %>%
dplyr::mutate(ID = dplyr::row_number()) %>% # Build ID from row number
dplyr::full_join(tab2, by = character()) %>%
dplyr::mutate(DIF = abs(time_1 - time_2)) %>%
dplyr::group_by(ID) %>%
dplyr::slice_min(order_by = DIF, n = 1)
time_1 VALUE_1 ID time_2 VALUE_2 DIF
<date> <int> <int> <date> <int> <drtn>
1 2020-11-01 268 1 2020-11-01 268 0 days
2 2020-11-02 479 2 2020-11-02 479 0 days
3 2020-11-03 345 3 2020-11-02 479 1 days
4 2020-11-04 567 4 2020-11-02 479 2 days
5 2020-11-05 567 5 2020-11-02 479 3 days
If size turns out to be a problem you yould split the large data.frame in smaller once and the run it with a loop. In this case parallel processing would be a great option since by splitting the large DF calculations can be run independently.

aggregation of the region's values ​in the dataset

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
I processed the dataset.
Can we find the day of the least death in the Asian region?
the important thing here;
 is the sum of deaths of all countries in the asia region. Accordingly, it is to sort and find the day.
as output;
date region death
2020/02/17 asia 6300 (asia region sum)
The data in the output I created are examples. The data in the example are not real.
Since these are cumulative cases and deaths, we need to difference the data.
library(dplyr)
df %>%
mutate(day = as.Date(day)) %>%
filter(region=="Asia") %>%
group_by(day) %>%
summarise(deaths=sum(death)) %>%
mutate(d=c(first(deaths),diff(deaths))) %>%
arrange(d)
# A tibble: 107 x 3
day deaths d
<date> <int> <int>
1 2020-01-23 18 1 # <- this day saw only 1 death in the whole of Asia
2 2020-01-29 133 2
3 2020-02-21 2249 3
4 2020-02-12 1118 5
5 2020-01-24 26 8
6 2020-02-23 2465 10
7 2020-01-26 56 14
8 2020-01-25 42 16
9 2020-01-22 17 17
10 2020-01-27 82 26
# ... with 97 more rows
So the second day of records saw the least number of deaths recorded (so far).
Using the dplyr package for data treatment :
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
library(dplyr)
df_sum <- df %>% group_by(region,day) %>% # grouping by region and day
summarise(death=sum(death)) %>% # summing following the groups
filter(region=="Asia",death==min(death)) # keeping only minimum of Asia
Then you have :
> df_sum
# A tibble: 1 x 3
# Groups: region [1]
region day death
<fct> <fct> <int>
1 Asia 2020/01/22 17

Applying vector in grouped data within function using dplyr

Trying to use forecast with grouped_by in r. If I donated use the function, and just specify the vector to be used it not a problem, but if I set the vector in the function it return NA:s.
Any suggestions?
fun <- function(data, VECTOR) {
a <- data %>%
group_by(company_id) %>%
mutate(count = n()) %>%
filter(count > 2) %>%
arrange(company_id, date) %>%
do(data.frame(forecast = forecast::forecast(.$VECTOR, h = 2)))
return(a)
}
fun(data = test, VECTOR = oms)
data:
company_id STMT_TO_DT NET_SALES
<chr> <date> <dbl>
1 55600727 2011-12-01 1951000
2 55600727 2012-12-01 1934000
3 55600727 2013-12-01 1902000
4 55600727 2014-12-01 1951000
5 55600727 2015-12-01 1930000
6 55600784 2012-06-01 413
7 55600784 2013-06-01 476
8 55600784 2014-06-01 301
9 55600784 2015-06-01 385
10 55600784 2016-06-01 1867
As stated before, if no function is used:
a <- data %>%
group_by(company_id) %>%
mutate(count = n()) %>%
filter(count > 2) %>%
arrange(company_id, STMT_TO_DT) %>%
do(data.frame(forecast = forecast::forecast(.$NET_SALES, h = 2)))`
Following results was obtained:
company_id forecast.Point.Forecast forecast.Lo.80 forecast.Hi.80 forecast.Lo.95 forecast.Hi.95
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 55600727 1936302 1.906994e+06 1965610.447 1891478.7440 1981125.371
2 55600727 1936302 1.905888e+06 1966715.952 1889788.0197 1982816.095
3 55600784 791 9.511396e+01 1486.886 -273.2659 1855.266
4 55600784 854 1.581140e+02 1549.886 -210.2659 1918.266`

Resources