merging two datasets based on common id and date within interval range - r

I have two datasets: DF1 - data frame which lists heads of states (leader_id) of countries (country_code) and an interval of their time in office (office_interval). DF2 - data frame where every observation is an event that has an ID (event_ID) country (country_code) and date it occurred (event_date)
Data:
library(lubridate)
#Leader DF
leader_id <- c("Adam","Bob","Charlie","Derek", "Edgar")
country_code <- c(1,1,2,2,3)
office_interval <- c(interval(ymd("1900-01-01"), ymd("1905-01-01")),
interval(ymd("1910-01-01"), ymd("1915-01-01")),
interval(ymd("1920-01-01"), ymd("1925-01-01")),
interval(ymd("1930-01-01"), ymd("1935-01-01")),
interval(ymd("1940-01-01"), ymd("1945-01-01")))
DF1 <- data.frame(leader_id, country_code, office_interval)
#Event DF
event_id <- c(1,1,2,3,3)
country_code <- c(1,2,2,1,3)
event_date <- c(as.Date("1901-01-02"),
as.Date("1920-01-02"),
as.Date("1921-01-02"),
as.Date("1911-01-02"),
as.Date("1941-02-02"))
DF2 <- data.frame(event_id, country_code, event_date)
I would like to take create a new column in DF2 that takes the leaderid from DF1 based on each row in DF2 that occur within a leaders office_interval in the same country.
DF2 should look like this afterward:
event_id country_code event_date leader_id
1 1 1 1901-01-02 Adam
2 1 2 1920-01-02 Charlie
3 2 2 1921-01-02 Charlie
4 3 1 1911-01-02 Bob
5 3 3 1941-02-02 Edgar
I've tried some solutions from here but I cannot get any of them to work.

Here is a solution maybe can work for your purpose
idx <- sapply(1:nrow(DF2), function(k) which(DF2$event_date[k] %within% DF1$office_interval & DF2$country_code[k]%in% DF1$country_code))
DF2$leader_id <- DF1$leader_id[idx]
such that
> DF2
event_id country_code event_date leader_id
1 1 1 1901-01-02 Adam
2 1 2 1920-01-02 Charlie
3 2 2 1921-01-02 Charlie
4 3 1 1911-01-02 Bob
5 3 3 1941-02-02 Edgar

We can left_join DF2 and DF1 by "country_code" and keep the records which are within the time interval range.
library(dplyr)
library(lubridate)
left_join(DF2, DF1, by = "country_code") %>%
filter(event_date %within% office_interval)
# event_id country_code event_date leader_id office_interval
#1 1 1 1901-01-02 Adam 1900-01-01 UTC--1905-01-01 UTC
#2 1 2 1920-01-02 Charlie 1920-01-01 UTC--1925-01-01 UTC
#3 2 2 1921-01-02 Charlie 1920-01-01 UTC--1925-01-01 UTC
#4 3 1 1911-01-02 Bob 1910-01-01 UTC--1915-01-01 UTC
#5 3 3 1941-02-02 Edgar 1940-01-01 UTC--1945-01-01 UTC

This should also work:
# add start and end date
DF1$start_date <- substr(DF1$office_interval, 1, 10)
DF1$end_date <- substr(DF1$office_interval, 17, 26)
# merge dataframes
DF2 <- merge(x = DF2, y = DF1, by.x = "country_code", by.y = "country_code")
# filter for correct times
DF2 <- DF2[(DF2$event_date >= DF2$start_date & DF2$event_date <= DF2$end_date),]
# select columns
DF2[1:4]

Related

Join with closest value between two values in R

I was working in the following problem. I've got monthly data from a survey, let's call it df:
df1 = tibble(ID = c('1','2'), reported_value = c(1200, 31000), anchor_month = c(3,5))
ID reported_value anchor_month
1 1200 3
2 31000 5
So, the first row was reported in March, but there's no way to know if it's reporting March or February values and also can be an approximation to the real value. I've also got a table with actual values for each ID, let's call it df2:
df2 = tibble( ID = c('1', '2') %>% rep(4) %>% sort,
real_value = c(1200,1230,11000,10,25000,3100,100,31030),
month = c(1,2,3,4,2,3,4,5))
ID real_value month
1 1200 1
1 1230 2
1 11000 3
1 10 4
2 25000 2
2 3100 3
2 100 4
2 31030 5
So there's two challenges: first, I only care about the anchor month OR the previous month to the anchor month of each ID and then I want to match to the closest value (sounds like fuzzy join). So, my first challenge was to filter my second table so it only has the anchor month or the previous one, which I did doing the following:
filter_aux = df1 %>%
bind_rows(df1 %>% mutate(anchor_month = if_else(anchor_month == 1, 12, anchor_month- 1)))
df2 = df2 %>%
inner_join(filter_aux , by=c('ID', 'month' = 'anchor_month')) %>% distinct(ID, month)
Reducing df2 to:
ID real_value month
1 1230 2
1 11000 3
2 100 4
2 31030 5
Now I tried to do a difference_inner_join by ID and reported_value = real_value, (df1 %>% difference_inner_join(df2, by= c('ID', 'reported_value' = 'real_value'))) but it's bringing a non-numeric argument to binary operator error I'm guessing because ID is a string in my actual data. What gives? I'm no expert in fuzzy joins, so I guess I'm missing something.
My final dataframe would look like this:
ID reported_value anchor_month closest_value month
1 1200 3 1230 2
2 31000 5 31030 5
Thanks!
It was easier without fuzzy_join:
df3 = df1 %>% left_join(df2 , by='ID') %>%
mutate(dif = abs(real_value - reported_value)) %>%
group_by(ID) %>% filter(dif == min(dif))
Output:
ID reported_value anchor_month real_value month dif
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1200 3 1230 2 30
2 2 31000 5 31030 5 30

Search second data set for values that occur in list from first data set

The below code creates two datasets:
df1 <-
read.table(textConnection("ID Code Date1 Date2
1 I611 01/01/2021 03/01/2021
2 L111 04/01/2021 09/01/2021
3 L111 01/01/2021 03/01/2021
4 Z538 08/01/2021 11/01/2021
5 I613 08/08/2021 09/09/2021
"), header=TRUE)
df2 <-
read.table(textConnection("ID State
1 Washington
49 California
1 Washington
40 Texas
1 Texas
2 Texas
2 Washington
50 Minnesota
60 Washington"), header=TRUE)
What I am looking to achieve is to search the second dataset for ID values that occur at least once in the first dataset 'ID' column and then group matches by the 'State' column in the second dataset. So the output would be:
State
IDs
Washington
3
Texas
2
Any assistance will be very much appreciated - thank you.
dplyr::semi_join() is useful for filtering one table based on values in another. You can then count() occurrences of each state in your filtered table.
library(dplyr)
semi_join(df2, df1) %>%
count(State, name = "IDs")
Joining, by = "ID"
State IDs
1 Texas 2
2 Washington 3
Filter df2 based on df1, then aggregate:
aggregate(ID ~ State, data = df2[ df2$ID %in% df1$ID, ], length)
# State ID
# 1 Texas 2
# 2 Washington 3
Using base R
stack(table(subset(df2, ID %in% df1$ID)$State))[2:1]
ind values
1 Texas 2
2 Washington 3

How to sum total observations in one dataset per ID that occur within time interval of another dataset

I have two datasets: DF1 - data frame which lists heads of states (leader_id) of countries (country_code) and an interval of their time in office (office_interval). DF2 - data frame where every observation is an event that has the country (country_code) and date it occurred (event_date).
reproducible data:
library(lubridate)
#Leader DF
leader_id <- c("Adam","Bob","Charlie")
country_code <- c(1,1,2)
office_interval <- c(interval(ymd("1900-01-01"), ymd("1905-01-01")),
interval(ymd("1910-01-01"), ymd("1915-01-01")),
interval(ymd("1920-01-01"), ymd("1925-01-01")))
DF1 <- data.frame(leader_id, country_code, office_interval)
#Event DF
country_code <- c(1,2,2,1)
event_date <- c(as.Date("1901-01-01"),
as.Date("1902-01-01"),
as.Date("1921-01-01"),
as.Date("1901-02-02"))
DF2 <- data.frame(country_code, event_date)
I would like to create a new column, DF1$total_events, that sums the total number of observations in DF2 that occur within the same country_code and office_interval for each leader in DF1. It should look like this:
leader_id country_code1 office_interval total_events
1 Adam 1 1900-01-01 UTC--1905-01-01 UTC 2
2 Bob 1 1910-01-01 UTC--1915-01-01 UTC 0
3 Charlie 2 1920-01-01 UTC--1925-01-01 UTC 1
I've tried to modify some solutions from this similar question, however i can't get anything to work on my data.
We can do a left_join on DF1 and DF2 by "country_code" and count number of event_date within office_interval.
library(dplyr)
library(lubridate)
DF1 %>%
left_join(DF2, by = "country_code") %>%
group_by(leader_id, country_code, office_interval) %>%
summarise(total_events = sum(event_date %within% office_interval))
# leader_id country_code office_interval total_events
# <fct> <dbl> <Interval> <int>
#1 Adam 1 1900-01-01 UTC--1905-01-01 UTC 2
#2 Bob 1 1900-01-01 UTC--1905-01-01 UTC 0
#3 Charlie 2 1910-01-01 UTC--1915-01-02 UTC 1
Using data.table
library(data.table)
library(lubridate)
setDT(DF1)[DF2, on = .(country_code)][, .(total_events =
sum(event_date %within% office_interval)),
.(leader_id, country_code, new = office_interval)]

Merge dataframes based on interval condition

I have a dataframe like this
id start end
1 20/06/88 24/07/89
1 27/07/89 13/04/93
1 14/04/93 6/09/95
2 3/01/92 11/02/94
2 30/03/94 16/04/96
2 17/04/96 18/08/97
that I would like to merge with this other dataframe
id date
1 26/08/88
2 10/05/96
The resulting merged dataframe should look like this
id start end date
1 20/06/88 24/07/89 26/06/88
1 27/07/89 13/04/93 NA
1 14/04/93 6/09/95 NA
2 3/01/92 11/02/94 NA
2 30/03/94 16/04/96 NA
2 17/04/96 18/08/97 10/05/96
In practice I want to merge the two dataframes based on id and on the fact that date must lie within the interval spanned by the start and end vars of the first dataframe.
Do you have any suggestion on how to do this? I tried to use the fuzzyjoin package, but I have some memory issue..
Many thanks to everyone
Might be a dupe, will remove when I found a good target. In the meantime, we could use fuzzyjoin
library(tidyverse)
library(fuzzyjoin)
df1 %>%
mutate_at(2:3, as.Date, "%d/%m/%y") %>%
fuzzy_left_join(
df2 %>% mutate(date = as.Date(date, "%d/%m/%y")),
by = c("id" = "id", "start" = "date", "end" = "date"),
match_fun = list(`==`, `<`, `>`))
# id.x start end id.y date
#1 1 1988-06-20 1989-07-24 1 1988-08-26
#2 1 1989-07-27 1993-04-13 NA <NA>
#3 1 1993-04-14 1995-09-06 NA <NA>
#4 2 1992-01-03 1994-02-11 NA <NA>
#5 2 1994-03-30 1996-04-16 NA <NA>
#6 2 1996-04-17 1997-08-18 2 1996-05-10
All that remains is tidying up the id columns.
Sample data
df1 <- read.table(text = "
id start end
1 20/06/88 24/07/89
1 27/07/89 13/04/93
1 14/04/93 6/09/95
2 3/01/92 11/02/94
2 30/03/94 16/04/96
2 17/04/96 18/08/97", header = T)
df2 <- read.table(text = "
id date
1 26/08/88
2 10/05/96 ", header = T)
You can use sqldf for complex joins:
require(sqldf)
sqldf("SELECT df1.*,df2.date,df2.id as id2
FROM df1
LEFT JOIN df2
ON df1.id = df2.id AND
df1.start < df2.date AND
df1.end > df2.date")

Assign values of a new column based on the frequency of a special pattern in dataframe

I would like to create another column of a data frame that groups each member in the first column based on the order.
Here is a reproducible demo:
df1=c("Alex","23","ID #:123", "John","26","ID #:564")
df1=data.frame(df1)
library(dplyr)
library(data.table)
df1 %>% mutate(group= ifelse(df1 %like% "ID #:",1,NA ) )
This was the output from the demo:
df1 group
1 Alex NA
2 23 NA
3 ID #:123 1
4 John NA
5 26 NA
6 ID #:564 1
This is what I want:
df1 group
1 Alex 1
2 23 1
3 ID #:123 1
4 John 2
5 26 2
6 ID #:564 2
So I want to have a group column indicates each member in order.
I appreciate in advance for any reply or thoughts!
Shift the condition with lag first and then do a cumsum:
df1 %>%
mutate(group= cumsum(lag(df1 %like% "ID #:", default = 1)))
# df1 group
#1 Alex 1
#2 23 1
#3 ID #:123 1
#4 John 2
#5 26 2
#6 ID #:564 2
Details:
df1 %>%
mutate(
# calculate the condition
cond = df1 %like% "ID #:",
# shift the condition down and fill the first value with 1
lag_cond = lag(cond, default = 1),
# increase the group when the condition is TRUE (ID encountered)
group= cumsum(lag_cond))
# df1 cond lag_cond group
#1 Alex FALSE TRUE 1
#2 23 FALSE FALSE 1
#3 ID #:123 TRUE FALSE 1
#4 John FALSE TRUE 2
#5 26 FALSE FALSE 2
#6 ID #:564 TRUE FALSE 2
You don't mention whether you're always expecting 3 rows per member. This code will allow you to toggle the number of rows per member (in case there's not always 3):
# Your code:
df1=c("Alex","23","ID #:123", "John","26","ID #:564")
df1=data.frame(df1)
library(dplyr)
library(data.table)
df1 %>% mutate(group= ifelse(df1 %like% "ID #:",1,NA ) )
number_of_rows_per_member <- 3 # Change if necessary
positions <- 1:(nrow(df1)/number_of_rows_per_member)
group <- c()
for (i in 1:length(positions)) {
group[(i*number_of_rows_per_member):((i*number_of_rows_per_member)-(number_of_rows_per_member-1))] <- i
}
group # This is the group column
df1$group <- group # Now just move the group coloumn into your original dataframe
df1 # Done!

Resources