Using Prophet Package to Predict By Group in Dataframe in R - r
I am using the new package released from Facebook called Prophet. It does time series predictions and I want to apply this function By Group.
Scroll down to R Section.
https://facebookincubator.github.io/prophet/docs/quick_start.html
This is my attempt:
grouped_output = df %>% group_by(group) %>%
do(m = prophet(df[,c(1,3)])) %>%
do(future = make_future_dataframe(m, period = 7)) %>%
do(forecast = prophet:::predict.prophet(m, future))
grouped_output[[1]]
I then need to extract the results from the list of each group which I am having trouble doing.
Below is my original dataframe without the groups:
ds <- as.Date(c('2016-11-01','2016-11-02','2016-11-03','2016-11-04',
'2016-11-05','2016-11-06','2016-11-07','2016-11-08',
'2016-11-09','2016-11-10','2016-11-11','2016-11-12',
'2016-11-13','2016-11-14','2016-11-15','2016-11-16',
'2016-11-17','2016-11-18','2016-11-19','2016-11-20',
'2016-11-21','2016-11-22','2016-11-23','2016-11-24',
'2016-11-25','2016-11-26','2016-11-27','2016-11-28',
'2016-11-29','2016-11-30'))
y <- c(15,17,18,19,20,54,67,23,12,34,12,78,34,12,3,45,67,89,12,111,123,112,14,566,345,123,567,56,87,90)
y<-as.numeric(y)
df <- data.frame(ds, y)
df
ds y
1 2016-11-01 15
2 2016-11-02 17
3 2016-11-03 18
4 2016-11-04 19
5 2016-11-05 20
6 2016-11-06 54
7 2016-11-07 67
8 2016-11-08 23
9 2016-11-09 12
10 2016-11-10 34
11 2016-11-11 12
12 2016-11-12 78
13 2016-11-13 34
14 2016-11-14 12
15 2016-11-15 3
16 2016-11-16 45
17 2016-11-17 67
18 2016-11-18 89
19 2016-11-19 12
20 2016-11-20 111
21 2016-11-21 123
22 2016-11-22 112
23 2016-11-23 14
24 2016-11-24 566
25 2016-11-25 345
26 2016-11-26 123
27 2016-11-27 567
28 2016-11-28 56
29 2016-11-29 87
30 2016-11-30 90
The current function works when I do it to a single group as follows:
#install.packages('prophet')
library(prophet)
m<-prophet(df)
future <- make_future_dataframe(m, period = 7)
forecast <- prophet:::predict.prophet(m, future)
forecast$yhat
[1] -2.649032 -29.762095 128.169781 59.573684 -11.623727 107.473617 -29.949730 -42.862455 -62.378408 104.797639 46.868610
[12] -12.502864 119.282058 -4.914921 -4.402638 -10.643570 169.309505 123.321261 74.734746 215.856347 99.290218 105.508059
[23] 102.882915 284.245984 237.401258 185.688202 321.466962 197.451536 194.280518 180.535663 349.304365 288.684031 222.337210
[34] 342.968499 203.648851 185.377165
I now want to change this so that it applies the prophet:::predict function to each group. So the NEW dataframe BY GROUP looks like this:
ds <- as.Date(c('2016-11-01','2016-11-02','2016-11-03','2016-11-04',
'2016-11-05','2016-11-06','2016-11-07','2016-11-08',
'2016-11-09','2016-11-10','2016-11-11','2016-11-12',
'2016-11-13','2016-11-14','2016-11-15','2016-11-16',
'2016-11-17','2016-11-18','2016-11-19','2016-11-20',
'2016-11-21','2016-11-22','2016-11-23','2016-11-24',
'2016-11-25','2016-11-26','2016-11-27','2016-11-28',
'2016-11-29','2016-11-30',
'2016-11-01','2016-11-02','2016-11-03','2016-11-04',
'2016-11-05','2016-11-06','2016-11-07','2016-11-08',
'2016-11-09','2016-11-10','2016-11-11','2016-11-12',
'2016-11-13','2016-11-14','2016-11-15','2016-11-16',
'2016-11-17','2016-11-18','2016-11-19','2016-11-20',
'2016-11-21','2016-11-22','2016-11-23','2016-11-24',
'2016-11-25','2016-11-26','2016-11-27','2016-11-28',
'2016-11-29','2016-11-30'))
y <- c(15,17,18,19,20,54,67,23,12,34,12,78,34,12,3,45,67,89,12,111,123,112,14,566,345,123,567,56,87,90,
45,23,12,10,21,34,12,45,12,44,87,45,32,67,1,57,87,99,33,234,456,123,89,333,411,232,455,55,90,21)
y<-as.numeric(y)
group<-c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B")
df <- data.frame(ds,group, y)
df
ds group y
1 2016-11-01 A 15
2 2016-11-02 A 17
3 2016-11-03 A 18
4 2016-11-04 A 19
5 2016-11-05 A 20
6 2016-11-06 A 54
7 2016-11-07 A 67
8 2016-11-08 A 23
9 2016-11-09 A 12
10 2016-11-10 A 34
11 2016-11-11 A 12
12 2016-11-12 A 78
13 2016-11-13 A 34
14 2016-11-14 A 12
15 2016-11-15 A 3
16 2016-11-16 A 45
17 2016-11-17 A 67
18 2016-11-18 A 89
19 2016-11-19 A 12
20 2016-11-20 A 111
21 2016-11-21 A 123
22 2016-11-22 A 112
23 2016-11-23 A 14
24 2016-11-24 A 566
25 2016-11-25 A 345
26 2016-11-26 A 123
27 2016-11-27 A 567
28 2016-11-28 A 56
29 2016-11-29 A 87
30 2016-11-30 A 90
31 2016-11-01 B 45
32 2016-11-02 B 23
33 2016-11-03 B 12
34 2016-11-04 B 10
35 2016-11-05 B 21
36 2016-11-06 B 34
37 2016-11-07 B 12
38 2016-11-08 B 45
39 2016-11-09 B 12
40 2016-11-10 B 44
41 2016-11-11 B 87
42 2016-11-12 B 45
43 2016-11-13 B 32
44 2016-11-14 B 67
45 2016-11-15 B 1
46 2016-11-16 B 57
47 2016-11-17 B 87
48 2016-11-18 B 99
49 2016-11-19 B 33
50 2016-11-20 B 234
51 2016-11-21 B 456
52 2016-11-22 B 123
53 2016-11-23 B 89
54 2016-11-24 B 333
55 2016-11-25 B 411
56 2016-11-26 B 232
57 2016-11-27 B 455
58 2016-11-28 B 55
59 2016-11-29 B 90
60 2016-11-30 B 21
How do I predict using the prophet package, the y-hat by group rather than in total?
Here is a solution using tidyr::nest to nest the data by group, fit the models in those groups using purrr::map and then retrieving the y-hat as requested.
I took your code, but incorporated it into mutate calls that would compute new colums using purrr::map.
library(prophet)
library(dplyr)
library(purrr)
library(tidyr)
d1 <- df %>%
nest(-group) %>%
mutate(m = map(data, prophet)) %>%
mutate(future = map(m, make_future_dataframe, period = 7)) %>%
mutate(forecast = map2(m, future, predict))
Here is the output at this point:
d1
# A tibble: 2 × 5
group data m future
<fctr> <list> <list> <list>
1 A <tibble [30 × 2]> <S3: list> <data.frame [36 × 1]>
2 B <tibble [30 × 2]> <S3: list> <data.frame [36 × 1]>
# ... with 1 more variables: forecast <list>
Then I use unnest() to retrieve the data from the forecast column and select the y-hat value as requested.
d <- d1 %>%
unnest(forecast) %>%
select(ds, group, yhat)
And here is the output for the newly forecasted values:
d %>% group_by(group) %>%
top_n(7, ds)
Source: local data frame [14 x 3]
Groups: group [2]
ds group yhat
<date> <fctr> <dbl>
1 2016-11-30 A 180.53422
2 2016-12-01 A 349.30277
3 2016-12-02 A 288.68215
4 2016-12-03 A 222.33501
5 2016-12-04 A 342.96654
6 2016-12-05 A 203.64625
7 2016-12-06 A 185.37395
8 2016-11-30 B 131.07827
9 2016-12-01 B 222.83703
10 2016-12-02 B 236.33555
11 2016-12-03 B 145.41001
12 2016-12-04 B 228.59687
13 2016-12-05 B 162.49244
14 2016-12-06 B 68.44477
I was looking for a solution for the same problem. I came up with the following code, which is a bit simpler than the accepted answer.
library(tidyr)
library(dplyr)
library(prophet)
data = df %>%
group_by(group) %>%
do(predict(prophet(.), make_future_dataframe(prophet(.), periods = 7))) %>%
select(ds, group, yhat)
And here are the predicted values
data %>% group_by(group) %>%
top_n(7, ds)
# A tibble: 14 x 3
# Groups: group [2]
ds group yhat
<date> <fctr> <dbl>
1 2016-12-01 A 316.9709
2 2016-12-02 A 258.2153
3 2016-12-03 A 196.6835
4 2016-12-04 A 346.2338
5 2016-12-05 A 208.9083
6 2016-12-06 A 216.5847
7 2016-12-07 A 206.3642
8 2016-12-01 B 230.0424
9 2016-12-02 B 268.5359
10 2016-12-03 B 190.2903
11 2016-12-04 B 312.9019
12 2016-12-05 B 266.5584
13 2016-12-06 B 189.3556
14 2016-12-07 B 168.9791
Related
How best to do this join in R?
Below is the sample data. I know that I have to do a left join. The question is how to have it only return values that match (indcodelist = indcodelist2) but with the highest codetype value. indcodelist <- c(110000,111000,112000,113000,114000,115000,121000,210000,211000,315000) estemp <- c(11,21,31,41,51,61,55,21,22,874) projemp <- c(15,25,36,45,52,61,31,29,31,899) nchg <- c(4,4,5,4,1,0,-24,8,9,25) firsttable <- data.frame(indcodelist,estemp,projemp,nchg) indcodelist2 <- c(110000,111000,112000,113000,114000,115000,121000,210000,211000,315000,110000,111000,112000,113000) codetype <- c(18,18,18,18,18,18,18,18,18,18,10,10,10,10) codetitle <- c("Accountant","Doctor","Lawyer","Teacher","Economist","Financial Analyst","Meteorologist","Dentist", "Editor","Veterinarian","Accounting Technician","Doctor","Lawyer","Teacher") secondtable <- data.frame(indcodelist2,codetype,codetitle) tried <- left_join(firsttable,secondtable, by =c(indcodelist = "indcodelist2")) Desired Result indcodelist estemp projemp nchg codetitle 110000 11 15 4 Accountant 111000 21 25 4 Doctor
If you only want values that match in both tables, inner_join might be what you’re looking for. You can see this answer to understand different types of joins. To get the highest codetype, you can use dplyr::slice_max(). Be aware the default behavior is to return values that tie. If there is more than one codetitle at the same codetype, they’ll all be returned. library(tidyverse) firsttable %>% inner_join(., secondtable, by = c("indcodelist" = "indcodelist2")) %>% group_by(indcodelist) %>% slice_max(codetype) #> # A tibble: 10 × 6 #> # Groups: indcodelist [10] #> indcodelist estemp projemp nchg codetype codetitle #> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 110000 11 15 4 18 Accountant #> 2 111000 21 25 4 18 Doctor #> 3 112000 31 36 5 18 Lawyer #> 4 113000 41 45 4 18 Teacher #> 5 114000 51 52 1 18 Economist #> 6 115000 61 61 0 18 Financial Analyst #> 7 121000 55 31 -24 18 Meteorologist #> 8 210000 21 29 8 18 Dentist #> 9 211000 22 31 9 18 Editor #> 10 315000 874 899 25 18 Veterinarian Created on 2022-09-15 by the reprex package (v2.0.1)
You might use {powerjoin} : library(powerjoin) power_inner_join( firsttable, secondtable |> summarize_by_keys(dplyr::across()[which.max(codetype),]), by = c("indcodelist" = "indcodelist2") ) #> indcodelist estemp projemp nchg codetype codetitle #> 1 110000 11 15 4 18 Accountant #> 2 111000 21 25 4 18 Doctor #> 3 112000 31 36 5 18 Lawyer #> 4 113000 41 45 4 18 Teacher #> 5 114000 51 52 1 18 Economist #> 6 115000 61 61 0 18 Financial Analyst #> 7 121000 55 31 -24 18 Meteorologist #> 8 210000 21 29 8 18 Dentist #> 9 211000 22 31 9 18 Editor #> 10 315000 874 899 25 18 Veterinarian
Using mean in dplyr chain with curly braces always returns NA
Trying to create a simple function that summarizes a variable of choice via a dplyr chain. Here's my attempt: get_mutated_df <- function(data, outcome){ {{data}} %>% group_by(speed) %>% summarize(dist_mean = mean({{outcome}}, na.rm = T)) %>% print() } data(cars) get_mutated_df(cars, "dist") However, this returns a tibble of NAs: # A tibble: 19 × 2 speed dist_mean <dbl> <dbl> 1 4 NA 2 7 NA 3 8 NA 4 9 NA What's the appropriate way of doing this?
No {} at data, and remove "" to dist will works. get_mutated_df <- function(data, outcome){ data %>% group_by(speed) %>% summarize(dist_mean = mean({{outcome}}, na.rm = T)) %>% print() } get_mutated_df(cars, dist) speed dist_mean <dbl> <dbl> 1 4 6 2 7 13 3 8 16 4 9 10 5 10 26 6 11 22.5 7 12 21.5 8 13 35 9 14 50.5 10 15 33.3 11 16 36 12 17 40.7 13 18 64.5 14 19 50 15 20 50.4 16 22 66 17 23 54 18 24 93.8 19 25 85 Code for pre_ thing carss <- cars carss$pre_dist <- cars$dist get_mutated_df_2 <- function(data, outcome){ outcome <- deparse(substitute(outcome)) outcome <- paste0("pre_", outcome) outcome <- as.symbol(outcome) data %>% group_by(speed) %>% summarize(dist_mean := mean({{outcome}}, na.rm = T)) %>% print() } get_mutated_df_2(carss, dist) speed dist_mean <dbl> <dbl> 1 4 6 2 7 13 3 8 16 4 9 10 5 10 26 6 11 22.5 7 12 21.5 8 13 35 9 14 50.5 10 15 33.3 11 16 36 12 17 40.7 13 18 64.5 14 19 50 15 20 50.4 16 22 66 17 23 54 18 24 93.8 19 25 85
Calculate area under the curve for time serie data
I want to calculate the area under the curve for the time points for each id and column. Any suggestions? Which R packages to use? Many thanks! id <- rep(1:3,each=5) time <- rep(c(10,20,30,40,50),3) q1 <- sample(100,15, replace=T) q2 <- sample(100,15, replace=T) q3 <- sample(100,15, replace=T) df <- data.frame(id,time,q1,q2,q3) df id time q1 q2 q3 1 10 38 55 38 1 20 46 29 88 1 30 16 28 97 1 40 37 20 81 1 50 59 27 42 2 10 82 81 54 2 20 45 3 23 2 30 82 67 59 2 40 27 3 42 2 50 45 71 45 3 10 39 8 29 3 20 12 6 90 3 30 92 11 7 3 40 52 8 37 3 50 81 57 80 Wanted output, something like this: q1 q2 q3 1 area area area 2 area area area 3 area area area
library(tidyverse) id <- rep(1:3,each=5) time <- rep(c(10,20,30,40,50),3) q1 <- sample(100,15, replace=T) q2 <- sample(100,15, replace=T) q3 <- sample(100,15, replace=T) df <- data.frame(id,time,q1,q2,q3) df %>% arrange(time) %>% pivot_longer(cols = c(q1, q2, q3)) -> longer_df longer_df %>% ggplot(aes(x = time, y = value, col = factor(id))) + geom_line() + geom_point() + facet_wrap(. ~ name) longer_df %>% group_by(id, name) %>% mutate(lag_value = lag(value), midpoint_value = (value + lag_value)/2) %>% summarize(area = 10*sum(midpoint_value, na.rm = T)) %>% pivot_wider(values_from = area) #> `summarise()` has grouped output by 'id'. You can override using the `.groups` argument. #> # A tibble: 3 x 4 #> # Groups: id [3] #> id q1 q2 q3 #> <int> <dbl> <dbl> <dbl> #> 1 1 1960 1980 2075 #> 2 2 1025 2215 2180 #> 3 3 2105 1590 2110 Created on 2021-06-30 by the reprex package (v2.0.0)
Here I will use the trapz function to calculate the integral. library(data.table) library(caTools) # integrate with its trapz function # data df <- fread("id time q1 q2 q3 1 10 38 55 38 1 20 46 29 88 1 30 16 28 97 1 40 37 20 81 1 50 59 27 42 2 10 82 81 54 2 20 45 3 23 2 30 82 67 59 2 40 27 3 42 2 50 45 71 45 3 10 39 8 29 3 20 12 6 90 3 30 92 11 7 3 40 52 8 37 3 50 81 57 80") # calculate the area with `trapz` df[,lapply(.SD[,2:4], function(y) trapz(time,y)),by=id] #> id q1 q2 q3 #> 1: 1 1475 1180 3060 #> 2: 2 2175 1490 1735 #> 3: 3 2160 575 1885 Created on 2021-06-30 by the reprex package (v2.0.0)
Slide along data frame rows and compare rows with next rows
I guess something similar should have been asked before, however I could only find an answer for python and SQL. So please notify me in the comments when this was also asked for R! Data Let's say we have a dataframe like this: set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20)) # In cause you do not get the same dataframe see the comment by #Ian Campbell - thanks! position value 1 1 27 2 2 37 3 3 57 4 4 89 5 5 20 6 6 86 7 7 97 8 8 62 9 9 58 10 10 6 11 11 19 12 12 16 13 13 61 14 14 34 15 15 67 16 16 43 17 17 88 18 18 83 19 19 32 20 20 63 Goal I'm interested in calculating the average value for n positions and subtract this from the average value of the next n positions, let's say n=5 for now. What I tried I now used this method, however when I apply this to a bigger dataframe it takes a huge amount of time, and hence wonder if there is a faster method for this. calc <- function( pos ) { this.five <- df %>% slice(pos:(pos+4)) next.five <- df %>% slice((pos+5):(pos+9)) differ = mean(this.five$value)- mean(next.five$value) data.frame(dif= differ) } df %>% group_by(position) %>% do(calc(.$position)) That produces the following table: position dif <int> <dbl> 1 1 -15.8 2 2 9.40 3 3 37.6 4 4 38.8 5 5 37.4 6 6 22.4 7 7 4.20 8 8 -26.4 9 9 -31 10 10 -35.4 11 11 -22.4 12 12 -22.3 13 13 -0.733 14 14 15.5 15 15 -0.400 16 16 NaN 17 17 NaN 18 18 NaN 19 19 NaN 20 20 NaN
I suspect a data.table approach may be faster. library(data.table) setDT(df) df[,c("roll.position","rollmean") := lapply(.SD,frollmean,n=5,fill=NA, align = "left")] df[, result := rollmean[.I] - rollmean[.I + 5]] df[,.(position,value,rollmean,result)] # position value rollmean result # 1: 1 27 46.0 -15.8 # 2: 2 37 57.8 9.4 # 3: 3 57 69.8 37.6 # 4: 4 89 70.8 38.8 # 5: 5 20 64.6 37.4 # 6: 6 86 61.8 22.4 # 7: 7 97 48.4 4.2 # 8: 8 62 32.2 -26.4 # 9: 9 58 32.0 -31.0 #10: 10 6 27.2 -35.4 #11: 11 19 39.4 -22.4 #12: 12 16 44.2 NA #13: 13 61 58.6 NA #14: 14 34 63.0 NA #15: 15 67 62.6 NA #16: 16 43 61.8 NA #17: 17 88 NA NA #18: 18 83 NA NA #19: 19 32 NA NA #20: 20 63 NA NA Data RNGkind(sample.kind = "Rounding") set.seed(1); df <- data.frame( position = 1:20,value = sample(seq(1,100), 20)) RNGkind(sample.kind = "default")
How to use R find the closest date before specific date of matrix A? And 14 days after the specific date?
I’m a newbie in R. I have two dataset A and B. A <- data.table::fread( " V1 DATE ID 1 7/16/11 a 2 2/18/09 b 3 3/25/08 c ") B <- data.table::fread( " V1 DATE ID Value 1 2013-06-13 a 109 2 2017-08-22 a 86 3 2017-09-15 a 88 4 2008-11-05 a 78 5 2009-02-17 a 74 6 2009-03-09 a 84 7 2009-03-17 a 81 8 2009-04-14 a 57 9 2009-04-21 a 65 10 2009-05-12 a 54 11 2009-06-08 a 54 12 2009-08-27 a 68 13 2011-08-26 b 199 14 2011-12-07 b 174 15 2012-01-31 b 66 16 2012-02-15 b 58 17 2012-04-17 b 59 18 2012-12-21 b 78 19 2013-01-14 b 91 20 2014-03-12 b 74 21 2014-08-28 b 98 22 2014-10-18 b 112 23 2010-12-15 b 36 24 2011-08-26 b 199 25 2011-12-07 b 174 26 2012-01-31 b 66 27 2012-02-15 b 58 28 2012-04-17 b 59 29 2015-05-08 c 105 30 2006-03-27 c 69 31 2007-03-12 c 104 32 2007-11-09 c 63 33 2008-03-25 c 239 34 2008-04-04 c 446 35 2008-04-09 c 354 36 2008-04-10 c 365 37 2008-04-11 c 366 38 2008-04-18 c 273 39 2008-04-28 c 271 40 2008-05-06 c 262 41 2008-05-19 c 72 42 2008-05-24 c 86 43 2008-06-20 c 47 44 2008-07-10 c 46 45 2008-08-06 c 55 46 2008-09-01 c 58 47 2008-09-29 c 56 48 2008-10-30 c 53 49 2008-12-09 c 71 50 2008-12-18 c 63 51 2009-01-14 c 60 52 2009-02-21 c 58 53 2009-03-28 c 54 54 2009-04-29 c 56 55 2009-04-30 c 59 56 2009-06-23 c 64 57 2009-07-24 c 69 58 2009-08-17 c 73 59 2009-10-04 c 127 60 2009-11-26 c 289 61 2009-12-02 c 277 62 2009-12-08 c 230 ") I tried weeks to use R to: find value from B which ID==A$ID, and B$DATE is closest date before or the same date as A$DATE; The expected result is : ID=c, DATE=2008-03-25, Value=239 find value from B which ID==A$ID, and B$DATE is 14 days after A$DATE. If there is no exact date after 14 days, find the closest date's value (like 15, 16 or 17 days after A$DATE) The expected result is : ID=c, DATE=2008-04-09, Value=354
Both questions can answered using a rolling join from data.table. However, there are two important steps in preparing the data. The date strings need to be converted to class IDate (or Date) to allow for date arithmetic. (IDate uses an integer representation to save memory). The dataframes need to be coerced to data.table to enable the enhanced syntax. setDT() coerces a dataframe or tibble to data.table by reference, i.e., without copying. BTW: The sample datasets provided by the OP were already data.tables as the OP had used the data.table::fread() function. Data preparation: library(data.table) setDT(A)[, DATE := as.IDate(DATE, "%m/%d/%y")] setDT(B)[, DATE := as.IDate(DATE)] Now, we can apply the rolling join: B[A, on = .(ID, DATE), roll = +Inf, .(ID, DATE, Value)] ID DATE Value 1: a 2011-07-16 68 2: b 2009-02-18 NA 3: c 2008-03-25 239 The result can be verified by printing B in proper order B[order(ID, DATE)]. The earliest date for ID == "b" in B is 2011-08-26. So, there is no date in B on or before 2009-02-18. Please, note that the value in the DATE column is the reference date A$DATE, not the matching B$DATE. Edit after clarification of the expected result by the OP: Also the second question can be solved by a rolling join but the code requires three modifications: The reference dates A$DATE need to be shifted by 14 days later. We need a backward rolling join because the OP wants to find the closest date in B on or after the shifted reference date. According to OP's expected result the result should contain the matching B$DATE. With the additional requrements we get B[A[, .(ID, DATE = DATE + 14)], on = .(ID, DATE), roll = -Inf, .(ID, DATE = x.DATE, Value)] ID DATE Value 1: a 2013-06-13 109 2: b 2010-12-15 36 3: c 2008-04-09 354
A solution using dplyr: q1 and q2 corresponds to your two questions. library(dplyr) A$DATE <- as.Date(A$DATE,format = "%m/%d/%y") B$DATE <- as.Date(B$DATE) BA <- left_join(B,A, by= c("ID"="ID")) q1 <- BA %>% filter(ID %in% A$ID) %>% filter(DATE.x < DATE.y) %>% group_by(ID) %>% arrange(desc(DATE.x)) %>% slice(1) q2 <- BA %>% filter(ID %in% A$ID) %>% group_by(ID) %>% filter(as.numeric(DATE.x) - as.numeric(DATE.y) >= 14) q1 #> # A tibble: 2 x 6 #> # Groups: ID [2] #> V1.x DATE.x ID Value V1.y DATE.y #> <int> <date> <chr> <int> <int> <date> #> 1 12 2009-08-27 a 68 1 2011-07-16 #> 2 32 2007-11-09 c 63 3 2008-03-25 q2 #> # A tibble: 48 x 6 #> # Groups: ID [3] #> V1.x DATE.x ID Value V1.y DATE.y #> <int> <date> <chr> <int> <int> <date> #> 1 1 2013-06-13 a 109 1 2011-07-16 #> 2 2 2017-08-22 a 86 1 2011-07-16 #> 3 3 2017-09-15 a 88 1 2011-07-16 #> 4 13 2011-08-26 b 199 2 2009-02-18 #> 5 14 2011-12-07 b 174 2 2009-02-18 #> 6 15 2012-01-31 b 66 2 2009-02-18 #> 7 16 2012-02-15 b 58 2 2009-02-18 #> 8 17 2012-04-17 b 59 2 2009-02-18 #> 9 18 2012-12-21 b 78 2 2009-02-18 #> 10 19 2013-01-14 b 91 2 2009-02-18 #> # ... with 38 more rows