Merge dataframes based on interval condition - r

I have a dataframe like this
id start end
1 20/06/88 24/07/89
1 27/07/89 13/04/93
1 14/04/93 6/09/95
2 3/01/92 11/02/94
2 30/03/94 16/04/96
2 17/04/96 18/08/97
that I would like to merge with this other dataframe
id date
1 26/08/88
2 10/05/96
The resulting merged dataframe should look like this
id start end date
1 20/06/88 24/07/89 26/06/88
1 27/07/89 13/04/93 NA
1 14/04/93 6/09/95 NA
2 3/01/92 11/02/94 NA
2 30/03/94 16/04/96 NA
2 17/04/96 18/08/97 10/05/96
In practice I want to merge the two dataframes based on id and on the fact that date must lie within the interval spanned by the start and end vars of the first dataframe.
Do you have any suggestion on how to do this? I tried to use the fuzzyjoin package, but I have some memory issue..
Many thanks to everyone

Might be a dupe, will remove when I found a good target. In the meantime, we could use fuzzyjoin
library(tidyverse)
library(fuzzyjoin)
df1 %>%
mutate_at(2:3, as.Date, "%d/%m/%y") %>%
fuzzy_left_join(
df2 %>% mutate(date = as.Date(date, "%d/%m/%y")),
by = c("id" = "id", "start" = "date", "end" = "date"),
match_fun = list(`==`, `<`, `>`))
# id.x start end id.y date
#1 1 1988-06-20 1989-07-24 1 1988-08-26
#2 1 1989-07-27 1993-04-13 NA <NA>
#3 1 1993-04-14 1995-09-06 NA <NA>
#4 2 1992-01-03 1994-02-11 NA <NA>
#5 2 1994-03-30 1996-04-16 NA <NA>
#6 2 1996-04-17 1997-08-18 2 1996-05-10
All that remains is tidying up the id columns.
Sample data
df1 <- read.table(text = "
id start end
1 20/06/88 24/07/89
1 27/07/89 13/04/93
1 14/04/93 6/09/95
2 3/01/92 11/02/94
2 30/03/94 16/04/96
2 17/04/96 18/08/97", header = T)
df2 <- read.table(text = "
id date
1 26/08/88
2 10/05/96 ", header = T)

You can use sqldf for complex joins:
require(sqldf)
sqldf("SELECT df1.*,df2.date,df2.id as id2
FROM df1
LEFT JOIN df2
ON df1.id = df2.id AND
df1.start < df2.date AND
df1.end > df2.date")

Related

Adapting string variables to specific characteristics in R

I have the following data:
id code
1 I560
2 K980
3 R30
4 F500
5 650
I would like to do the following two actions regarding the colum code:
i) select the two numbers after the letter and
ii) remove those observations that do not start with a letter. So in the end, the data frame should look like this:
id code
1 I56
2 K98
3 R30
4 F50
In base R, you could do :
subset(transform(df, code = sub('([A-Z]\\d{2}).*', '\\1', code)),
grepl('^[A-Z]', code))
Or using tidyverse functions
library(dplyr)
library(stringr)
df %>%
mutate(code = str_extract(code, '[A-Z]\\d{2}')) %>%
filter(str_detect(code, '^[A-Z]'))
# id code
#1 1 I56
#2 2 K98
#3 3 R30
#4 4 F50
An option with substr from base R
df1$code <- substr(df1$code, 1, 3)
df1[grepl('^[A-Z]', df1$code),]
# id code
#1 1 I56
#2 2 K98
#3 3 R30
#4 4 F50
data
df1 <- structure(list(id = 1:5, code = c("I56", "K98", "R30", "F50",
"650")), row.names = c(NA, -5L), class = "data.frame")

Split a column list into columns

Suppose I have a DT as -
id values valid_types
1 2|3 100|200
2 4 200
3 2|1 500|100
The valid_types tells me what are the valid types I need. There are 4 total types(100, 200, 500, 2000). An entry specifies their valid types and their corresponding values with | separated character values.
I want to transform this to a DT which has the types as columns and their corresponding values.
Expected:
id 100 200 500
1 2 3 NA
2 NA 4 NA
3 1 NA 2
I thought I could take both the columns and split them on | which would give me two lists. I would then combine them by setting the keys as names of the types list and then convert the final list to a DT.
But the idea I came up with is very convoluted and not really working.
Is there a better/easier way to do this ?
Here is another data.table approach:
dcast(
DT[, lapply(.SD, function(x) strsplit(x, "\\|")[[1L]]), by = id],
id ~ valid_types, value.var = "values"
)
Using tidyr library you can use separate_rows with pivot_wider :
library(tidyr)
df %>%
separate_rows(values, valid_types, sep = '\\|', convert = TRUE) %>%
pivot_wider(names_from = valid_types, values_from = values)
# id `100` `200` `500`
# <int> <int> <int> <int>
#1 1 2 3 NA
#2 2 NA 4 NA
#3 3 1 NA 2
A data.table way would be :
library(data.table)
library(splitstackshape)
setDT(df)
dcast(cSplit(df, c('values', 'valid_types'), sep = '|', direction = 'long'),
id~valid_types, value.var = 'values')

merging two datasets based on common id and date within interval range

I have two datasets: DF1 - data frame which lists heads of states (leader_id) of countries (country_code) and an interval of their time in office (office_interval). DF2 - data frame where every observation is an event that has an ID (event_ID) country (country_code) and date it occurred (event_date)
Data:
library(lubridate)
#Leader DF
leader_id <- c("Adam","Bob","Charlie","Derek", "Edgar")
country_code <- c(1,1,2,2,3)
office_interval <- c(interval(ymd("1900-01-01"), ymd("1905-01-01")),
interval(ymd("1910-01-01"), ymd("1915-01-01")),
interval(ymd("1920-01-01"), ymd("1925-01-01")),
interval(ymd("1930-01-01"), ymd("1935-01-01")),
interval(ymd("1940-01-01"), ymd("1945-01-01")))
DF1 <- data.frame(leader_id, country_code, office_interval)
#Event DF
event_id <- c(1,1,2,3,3)
country_code <- c(1,2,2,1,3)
event_date <- c(as.Date("1901-01-02"),
as.Date("1920-01-02"),
as.Date("1921-01-02"),
as.Date("1911-01-02"),
as.Date("1941-02-02"))
DF2 <- data.frame(event_id, country_code, event_date)
I would like to take create a new column in DF2 that takes the leaderid from DF1 based on each row in DF2 that occur within a leaders office_interval in the same country.
DF2 should look like this afterward:
event_id country_code event_date leader_id
1 1 1 1901-01-02 Adam
2 1 2 1920-01-02 Charlie
3 2 2 1921-01-02 Charlie
4 3 1 1911-01-02 Bob
5 3 3 1941-02-02 Edgar
I've tried some solutions from here but I cannot get any of them to work.
Here is a solution maybe can work for your purpose
idx <- sapply(1:nrow(DF2), function(k) which(DF2$event_date[k] %within% DF1$office_interval & DF2$country_code[k]%in% DF1$country_code))
DF2$leader_id <- DF1$leader_id[idx]
such that
> DF2
event_id country_code event_date leader_id
1 1 1 1901-01-02 Adam
2 1 2 1920-01-02 Charlie
3 2 2 1921-01-02 Charlie
4 3 1 1911-01-02 Bob
5 3 3 1941-02-02 Edgar
We can left_join DF2 and DF1 by "country_code" and keep the records which are within the time interval range.
library(dplyr)
library(lubridate)
left_join(DF2, DF1, by = "country_code") %>%
filter(event_date %within% office_interval)
# event_id country_code event_date leader_id office_interval
#1 1 1 1901-01-02 Adam 1900-01-01 UTC--1905-01-01 UTC
#2 1 2 1920-01-02 Charlie 1920-01-01 UTC--1925-01-01 UTC
#3 2 2 1921-01-02 Charlie 1920-01-01 UTC--1925-01-01 UTC
#4 3 1 1911-01-02 Bob 1910-01-01 UTC--1915-01-01 UTC
#5 3 3 1941-02-02 Edgar 1940-01-01 UTC--1945-01-01 UTC
This should also work:
# add start and end date
DF1$start_date <- substr(DF1$office_interval, 1, 10)
DF1$end_date <- substr(DF1$office_interval, 17, 26)
# merge dataframes
DF2 <- merge(x = DF2, y = DF1, by.x = "country_code", by.y = "country_code")
# filter for correct times
DF2 <- DF2[(DF2$event_date >= DF2$start_date & DF2$event_date <= DF2$end_date),]
# select columns
DF2[1:4]

R - Merge two datasets if ALL values per group match

Say I have two dataframes df1 and df2 as follow:
df1:
EmployeeID Skill
1 A
1 B
1 C
2 B
2 D
2 C
2 F
3 A
3 J
df2:
Opportunity.ID Skill
12345 A
12345 B
56788 C
56788 B
56788 F
09988 H
What I'm looking to do is to have a new data frame with all the EmployeeID that have all the skills required for a specific Opportunity.ID, and not only one of them. This is why a simple merge or left/right join will not be enought.
In our case, what I would like to have is:
Opportunity.ID Employee.ID
12345 1
56788 2
09988 NA
Note that employee 3 should not be assigned to opportunity 12345 because he only has one skill among the two required.
Any help would be greatly appreciated.
Here's one way using dplyr -
df2 %>%
left_join(df1, by = "Skill") %>%
group_by(Opportunity.ID) %>%
mutate(test = ave(Skill, EmployeeID, FUN = function(x) all(Skill %in% x))) %>%
ungroup() %>%
filter(test != "FALSE") %>%
distinct(Opportunity.ID, EmployeeID)
# A tibble: 3 x 2
Opportunity.ID EmployeeID
<int> <int>
1 12345 1
2 56788 2
3 9988 NA
There is probably a better solution, but with the data.table-package I came to the following approach:
library(data.table) # load the package
setDT(df1) # convert 'df1' to a 'data.table'
setDT(df2) # convert 'df2' to a 'data.table'
df2[, .(EmployeeID = df1[.SD[, .(Skill, n = .N)], on = .(Skill)
][, .(ne = .N), by = .(EmployeeID, n)
][n == ne, EmployeeID])
, by = Opportunity.ID]
which gives:
Opportunity.ID EmployeeID
1: 12345 1
2: 56788 2
3: 9988 NA

Find the data between two datasets which is within two weeks R

I am having two dataframes as follows,
data1
Type date
1 A 2011-10-21
2 A 2011-11-18
3 A 2011-12-16
4 B 2011-10-20
5 B 2011-11-17
6 B 2011-12-15
and
data2
Date Type value
1 2011-10-25 A 1
2 2011-10-15 A 3
3 2011-11-10 A 4
4 2011-10-23 B 12
5 2011-10-27 B 1
6 2011-11-18 B 1
I want to loop through the type(A,B) of data1 and check for each date and check all the entries for type(A,B) in data2 and check for the dates in data2 which is within two weeks gap, and then sum the values and bring it as an output.
My ideal output would be
Type date Value
1 A 2011-10-21 4 (3+1)
2 A 2011-11-18 4
3 A 2011-12-16 NA ( No values for A within two weeks)
4 B 2011-10-20 13 ( 12+1)
5 B 2011-11-17 1
6 B 2011-12-15 NA ( No values for A within two weeks)
I can think of writing a loop in R and running through. But it is running for a long time. I guess there should be a better way in dplyr to do this. I am trying and not able to complete it. Can anybody help me in doing this?
Thanks
How does this look? Assuming data1 as df1 and data2 as df2
library(dplyr)
library(lubridate)
df3 <- full_join(df1, df2, by = "Type")
df3 <- df3 %>% mutate(date1 = week(date), Date1 = week(Date))
df4 <- df3 %>% mutate(Key = ifelse(((date1 - Date1) %in% c(-2:2)), T, F))
df5 <- df4 %>% filter(Key == T) %>% group_by(Type, date) %>%
summarise(Value = sum(value))
full_join(df1, df5, by = c("Type", "date"))

Resources