Join with closest value between two values in R - r

I was working in the following problem. I've got monthly data from a survey, let's call it df:
df1 = tibble(ID = c('1','2'), reported_value = c(1200, 31000), anchor_month = c(3,5))
ID reported_value anchor_month
1 1200 3
2 31000 5
So, the first row was reported in March, but there's no way to know if it's reporting March or February values and also can be an approximation to the real value. I've also got a table with actual values for each ID, let's call it df2:
df2 = tibble( ID = c('1', '2') %>% rep(4) %>% sort,
real_value = c(1200,1230,11000,10,25000,3100,100,31030),
month = c(1,2,3,4,2,3,4,5))
ID real_value month
1 1200 1
1 1230 2
1 11000 3
1 10 4
2 25000 2
2 3100 3
2 100 4
2 31030 5
So there's two challenges: first, I only care about the anchor month OR the previous month to the anchor month of each ID and then I want to match to the closest value (sounds like fuzzy join). So, my first challenge was to filter my second table so it only has the anchor month or the previous one, which I did doing the following:
filter_aux = df1 %>%
bind_rows(df1 %>% mutate(anchor_month = if_else(anchor_month == 1, 12, anchor_month- 1)))
df2 = df2 %>%
inner_join(filter_aux , by=c('ID', 'month' = 'anchor_month')) %>% distinct(ID, month)
Reducing df2 to:
ID real_value month
1 1230 2
1 11000 3
2 100 4
2 31030 5
Now I tried to do a difference_inner_join by ID and reported_value = real_value, (df1 %>% difference_inner_join(df2, by= c('ID', 'reported_value' = 'real_value'))) but it's bringing a non-numeric argument to binary operator error I'm guessing because ID is a string in my actual data. What gives? I'm no expert in fuzzy joins, so I guess I'm missing something.
My final dataframe would look like this:
ID reported_value anchor_month closest_value month
1 1200 3 1230 2
2 31000 5 31030 5
Thanks!

It was easier without fuzzy_join:
df3 = df1 %>% left_join(df2 , by='ID') %>%
mutate(dif = abs(real_value - reported_value)) %>%
group_by(ID) %>% filter(dif == min(dif))
Output:
ID reported_value anchor_month real_value month dif
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1200 3 1230 2 30
2 2 31000 5 31030 5 30

Related

Identifying values from one database to use in another database

I am working on a project in which I need to work with 2 databases, identify values from one database to use in another.
I have a dataframe 1,
df1<-data.frame("ID"=c(1,2,3),"Condition A"=c("B","B","A"),"Condition B"=c("1","1","2"),"Year"=c(2002,1988,1995))
and a dataframe 2,
df2 <- data.frame("Condition A"=c("A","A","B","B"),"Condiction B"=c("1","2","1","2"),"<1990"=c(20,30,50,80),"1990-2000"=c(100,90,80,30),">2000"=c(300,200,800,400))
I would like to add a new column to df1 called "Value", in which, for each ID (from df1), collects the values from column 3,4 or 5 from df2 (depending on the year), and following conditions A and B available in both databases. The end result would be something like this:
df1<-data.frame("ID"=c(1,2,3),"Condition A"=c("B","B","A"),"Condition B"=c("1","1","2"),"Year"=c(2002,1988,1995),"Value"=c(800,50,90))
thanks!
I think we can simply left_join, then mutate with case_when, then drop the undesired columns with select:
library(dplyr)
left_join(df1, df2, by=c("Condition.A", "Condition.B"))%>%
mutate(Value=case_when(Year<1990 ~ X.1990,
Year<2000 ~ X1990.2000,
Year>=2000 ~ X.2000))%>%
select(-starts_with("X"))
ID Condition.A Condition.B Year Value
1 1 B 1 2002 800
2 2 B 1 1988 50
3 3 A 2 1995 90
EDIT: I edited your code, removing the "Condiction" typo
You could use
library(dplyr)
library(tidyr)
df2 %>%
rename(Condition.B = Condiction.B) %>%
pivot_longer(matches("\\d+{4}")) %>%
right_join(df1, by = c("Condition.A", "Condition.B")) %>%
filter(name == case_when(
Year < 1990 ~ "X.1990",
Year > 2000 ~ "X.2000",
TRUE ~ "X1990.2000")) %>%
select(ID, Condition.A, Condition.B, Year, Value = value) %>%
arrange(ID)
This returns
# A tibble: 3 x 5
ID Condition.A Condition.B Year Value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 B 1 2002 800
2 2 B 1 1988 50
3 3 A 2 1995 90
At first we rename the misspelled column Condiction.B of df2 and bring it into a "long format" based on the "<1990", "1990-2000", ">2000" columns. Note that those columns can't be named like this, they are automatically renamed to X.1990, X1990.2000 and X.2000.
Next we use a right join with df1 on the two Condition columns.
Finally we filter just the matching years based on a hard coded case_when function and do some clean up (selecting and arranging).
We could do it this way:
Condiction must be a typo so I changed it to Condition
in df1 create a helper column that assigns each your to the group which is a column name in df2
bring df2 in long format
finally apply left_join by by=c("Condition.A", "Condition.B", "helper"="name")
library(dplyr)
library(tidyr)
df1 <- df1 %>%
mutate(helper = case_when(Year >=1990 & Year <=2000 ~"X1990.2000",
Year <1990 ~ "X.1990",
Year >2000 ~ "X.2000"))
df2 <- df2 %>%
pivot_longer(
cols=starts_with("X")
)
df3 <- left_join(df1, df2, by=c("Condition.A", "Condition.B", "helper"="name")) %>%
select(-helper)
ID Condition.A Condition.B Year value
1 1 B 1 2002 800
2 2 B 1 1988 50
3 3 A 2 1995 90

How to find observations within a certain time range of each other in R

I have a dataset with ID, date, days of life, and medication variables. Each ID has multiple observations indicating different administrations of a certain drug. I want to find UNIQUE meds that were administered within 365 days of each other. A sample of the data frame is as follows:
ID date dayoflife meds
1 2003-11-24 16361 lasiks
1 2003-11-24 16361 vigab
1 2004-01-09 16407 lacos
1 2013-11-25 20015 pheno
1 2013-11-26 20016 vigab
1 2013-11-26 20016 lasiks
2 2008-06-05 24133 pheno
2 2008-04-07 24074 vigab
3 2014-11-25 8458 pheno
3 2014-12-22 8485 pheno
I expect the outcome to be:
ID N
1 3
2 2
3 1
indicating that individual 1 had a max of 3 different types of medications administered within 365 days of each other. I am not sure if it is best to use days of life or the date to get to this expected outcome.Any help is appreciated
An option would be to convert the 'date' to Date class, grouped by 'ID', get the absolute difference of 'date' and the lag of the column, check whether it is greater than 365, create a grouping index with cumsum, get the number of distinct elements of 'meds' in summarise
library(dplyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
mutate(diffd = abs(as.numeric(difftime(date, lag(date, default = first(date)),
units = 'days')))) %>%
group_by(grp = cumsum(diffd > 365), add = TRUE) %>%
summarise(N = n_distinct(meds)) %>%
group_by(ID) %>%
summarise(N = max(N))
# A tibble: 3 x 2
# ID N
# <int> <int>
#1 1 2
#2 2 2
#3 3 1
You can try:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date = as.Date(date),
lag_date = abs(date - lag(date)) <= 365,
lead_date = abs(date - lead(date)) <= 365) %>%
mutate_at(vars(lag_date, lead_date), ~ ifelse(., ., NA)) %>%
filter(coalesce(lag_date, lead_date)) %>%
summarise(N = n_distinct(meds))
Output:
# A tibble: 3 x 2
ID N
<int> <int>
1 1 2
2 2 2
3 3 1

Joining data in R by first row, then second and so on

I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05

How to get date difference from two dataframe in R

I have two datafram as mentioned:
DF_1
ID Date1
1 12/01/2017
2 15/02/2017
3 18/03/2017
DF_2
ID Date1
1 05/01/2017
1 15/01/2017
1 18/01/2017
2 10/02/2017
2 13/02/2017
2 15/02/2017
3 22/03/2017
I want to calculate the difference between dates for a particular id in DF_1 to same id in DF_2 with most recent old date in DF_2 as compare with date of DF_1.
For Example: For ID=1 the Date of DF_1 is 12-01-2017 and Most recent old date for that in DF_2 would be 05-01-2017 because 15 & 18 both are > than DF_1 Date.
Required Output:
ID Date1 Count
1 12/01/2017 7
2 15/02/2017 0
3 18/03/2017 -4
The following reproduces your expected output:
library(tidyverse);
df1 <- read.table(text =
"ID Date1
1 12/01/2017
2 15/02/2017
3 18/03/2017", header = T) %>%
mutate(Date1 = as.Date(Date1, format = "%d/%m/%Y"));
df2 <- read.table(text =
"ID Date1
1 05/01/2017
1 15/01/2017
1 18/01/2017
2 10/02/2017
2 13/02/2017
2 15/02/2017
3 22/03/2017", header = T) %>%
mutate(Date1 = as.Date(Date1, format = "%d/%m/%Y"));
left_join(df1, df2, by = "ID") %>%
mutate(Count = Date1.x - Date1.y) %>%
group_by(ID) %>%
slice(ifelse(
all(Count < 0),
which.min(abs(Count)),
which.min(Count[Count >= 0]))) %>%
select(ID, Date1.x, Count)
## A tibble: 3 x 3
## Groups: ID [3]
# ID Date1.x Count
# <int> <date> <time>
#1 1 2017-01-12 7
#2 2 2017-02-15 0
#3 3 2017-03-18 -4
Explanation: Calculate the time difference between df1$Date1 and df2$Date2, group entries by ID, and keep only the row which has the smallest positive time difference, unless all time differences are negative in which case report the smallest absolute time difference.
I think your last row is wrong, as for ID=3 the df2 value is not before the df1 value. Assuming that is correct, you can do this...
df3 <- df2 %>% rename(Date2=Date1) %>%
left_join(df1) %>%
mutate(datediff=as.Date(Date1,format="%d/%m/%Y")-as.Date(Date2,format="%d/%m/%Y")) %>%
filter(datediff>=0) %>%
group_by(ID) %>%
summarise(Date1=first(Date1),Count=min(datediff))
df3
ID Date1 Count
1 1 12/01/2017 7
2 2 15/02/2017 0

conditionally duplicating rows in a data frame

This is a sample of my data set:
day city count
1 1 A 50
2 2 A 100
3 2 B 110
4 2 C 90
Here is the code for reproducing it:
df <- data.frame(
day = c(1,2,2,2),
city = c("A","A","B","C"),
count = c(50,100,110,90)
)
As you could see, the count data is missing for city B and C on the day 1. What I want to do is to use city A's count as an estimate for the other two cities. So the desired output would be:
day city count
1 1 A 50
2 1 B 50
3 1 C 50
4 2 A 100
5 2 B 110
6 2 C 90
I could come up with a for loop to do it, but I feel like there should be an easier way of doing it. My idea is to count the number of observations for each day, and then for the days that the number of observations is less than the number of cities in the data set, I would replicate the row to complete the data for that day. Any better ideas? or a more efficient for-loop? Thanks.
With dplyr and tidyr, we can do:
library(dplyr)
library(tidyr)
df %>%
expand(day, city) %>%
left_join(df) %>%
group_by(day) %>%
fill(count, .direction = "up") %>%
fill(count, .direction = "down")
Alternatively, we can avoid the left_join using thelatemail's solution:
df %>%
complete(day, city) %>%
group_by(day) %>%
fill(count, .direction = "up") %>%
fill(count, .direction = "down")
Both return:
# A tibble: 6 x 3
day city count
<dbl> <fct> <dbl>
1 1. A 50.
2 1. B 50.
3 1. C 50.
4 2. A 100.
5 2. B 110.
6 2. C 90.
Data (slightly modified to show .direction filling both directions):
df <- data.frame(
day = c(1,2,2,2),
city = c("B","A","B","C"),
count = c(50,100,110,90)
)

Resources