How to evaluate zodiac sign based on date of birth in R? - r

So I have a date of birth vector in a data.frame. I want to evaluate, based on this date, which zodiac sign is the respondent.
I've seen this solution:
Checking if Date is Between two Dates in R
But, this approach would mean that I have to create 12 vectors times 2 for each zodiac sign (starting date and finishing date), to check if my date of birth falls between the two. Is there a more efficient way to do this?
So this is my data.frame:
data.frame(respondent = c(1,2,3,4,5), date_of_birth = seq(as.Date("2011-12-30"), as.Date("2012-04-30"), by="months") )
respondent date_of_birth
1 1 2011-12-30
2 2 2012-01-30
3 3 2012-03-01
4 4 2012-03-30
5 5 2012-04-30
and I want to get this:
respondent date_of_birth zodiac
1 1 2011-12-30 Capricorn
2 2 2012-01-30 Aquarius
3 3 2012-03-01 Pisces
4 4 2012-03-30 Aries
5 5 2012-04-30 Taurus

I think the *apply functions are just made for this work. You could try to use lapply on your fisrt data frame (more precisely: with its date_of_birth column) and with a data frame indexing the zodiac signs according to the date to produce a vector zodiac whose length equals the height of your data frame.

That would work and with a fully populated zodiac database it should be pretty easy. What I mean with this is that you need a database, where for each year, you've got the different dates, because otherwise it's difficult to compare dates across New Year. Also please make sure that the conditions are correct, don't know anything about zodiac signs.
library(fuzzyjoin)
birth.days <- data.frame(respondent = c(1,2,3,4,5), date_of_birth = seq(as.Date("2011-12-30"), as.Date("2012-04-30"), by="months") )
zodiacs <- data.frame(Zodiac = c("Capricorn")
, Start.Date = as.Date("2011-12-22")
, End.Date = as.Date("2012-01-20"))
fuzzy_left_join(birth.days, zodiacs,
by = c("date_of_birth" = "Start.Date", "date_of_birth" = "End.Date"),
match_fun = list(`>=`, `<`))
respondent date_of_birth Zodiac Start.Date End.Date
1 1 2011-12-30 Capricorn 2011-12-22 2012-01-20
2 2 2012-01-30 <NA> <NA> <NA>
3 3 2012-03-01 <NA> <NA> <NA>
4 4 2012-03-30 <NA> <NA> <NA>
5 5 2012-04-30 <NA> <NA> <NA>
Just as an example on how to populate a database with the dates:
Capricorn <- data.frame( Start.Date = seq.Date(from= as.Date("1900-12-22"), to = as.Date("2100-01-01"), by = "year")
, End.Date = seq.Date(from= as.Date("1901-01-20"), to = as.Date("2100-01-20"), by = "year")
, Zodiac = rep("Capricorn", 200 )
)

Related

Counting Number of People in a Hotel (R)

I am working with the R programming language. Suppose there is a hotel that has a list of customers with their check-in and check-out times (Note: The actual value of the dates is "POSIXct" and is written as "year-month-date".):
check_in_date <- c('2010-01-01', '2010-01-02' ,'2010-01-01', '2010-01-08', '2010-01-08', '2010-01-15', '2010-01-15', '2010-01-16', '2010-01-19', '2010-01-22')
check_out_date <- c('2010-01-07', '2010-01-04' ,'2010-01-09', '2010-01-21', '2010-01-11', '2010-01-22', 'still in hotel as of today', '2010-01-20', '2010-01-25', '2010-01-29')
Person = c("John", "Smith", "Alex", "Peter", "Will", "Matt", "Tim", "Kevin", "Tom", "Adam")
hotel <- data.frame(check_in_date, check_out_date, Person )
The data looks like something like this:
check_in_date check_out_date Person
1 2010-01-01 2010-01-07 John
2 2010-01-02 2010-01-04 Smith
3 2010-01-01 2010-01-09 Alex
4 2010-01-08 2010-01-21 Peter
5 2010-01-08 2010-01-11 Will
6 2010-01-15 2010-01-22 Matt
7 2010-01-15 still in hotel as of today Tim
8 2010-01-16 2010-01-20 Kevin
9 2010-01-19 2010-01-25 Tom
10 2010-01-22 2010-01-29 Adam
Question: I am trying to find out on any given day, how many people were still in the hotel. This would look something like this (just an example, does not correspond to the above data):
day_of_the_year Number_of_people_currently_in_hotel
1 2010-01-01 1
2 2010-01-02 1
3 2010-01-03 2
4 2010-01-04 0
5 2010-01-05 5
6 2010-01-06 5
7 2010-01-07 2
8 2010-01-08 2
9 2010-01-09 8
I tried to solve this problem in 3 steps:
First Step: I generated a column containing every date from the start to the end (e.g. in this example, let's suppose that there are 31 days : from the start to the end of Jan-2010)
day_of_the_year = seq(as.Date("2010/1/1"), as.Date("2010/1/31"),by="day")
Second Step: I then determined how many people checked in to the hotel at each day:
library(dplyr)
#create some indicator variable
hotel$event = 1
check_ins = hotel %>% group_by(check_in_date) %>% summarise(n = n())
check_in_date n
<chr> <int>
1 2010-01-01 2
2 2010-01-02 1
3 2010-01-08 2
4 2010-01-15 2
5 2010-01-16 1
6 2010-01-19 1
7 2010-01-22 1
Third Step: I then repeated a similar step to determine how many people checked out of the hotel each day:
check_outs = hotel %>% group_by(check_out_date) %>% summarise(n = n())
check_out_date n
<chr> <int>
1 2010-01-04 1
2 2010-01-07 1
3 2010-01-09 1
4 2010-01-11 1
5 2010-01-20 1
6 2010-01-21 1
7 2010-01-22 1
8 2010-01-25 1
9 2010-01-29 1
10 still in hotel as of today 1
Problem: Now, I am not sure how to combine the above 3 Steps in such a way so that we can find out how many people were staying at the hotel each day of the month. Can someone please show me how to do this?
Thanks!
Note: I found a "similar" question counting the number of people in the system in R , I am currently trying to see if I can adapt the methods used in this question for my problem.
I used hotel$check_in_date = as.Date(hotel$check_in_date) and hotel$check_out_date = as.Date(hotel$check_out_date) to convert the strings to dates. This function will then count the number of guests for a given date. Since you have a note in for guests that are currently checked in, I created a temporary data frame in the function to avoid overwriting the original data.
count_guests = function(date) {
temp = hotel
temp$check_out_date = ifelse(is.na(temp$check_out_date), as.Date(date), temp$check_out_date)
counts = ifelse((temp$check_in_date <= date) &(temp$check_out_date >= date), 1, 0)
return(sum(counts))
}
count_guests(as.Date("2010-01-02"))
[1] 3
count_guests(as.Date("2010-01-10"))
[1] 2
count_guests(as.Date("2010-01-21"))
[1] 4
EDIT: On second thought it looks like you want a new data frame. This can be done easily with apply().
guests = data.frame(day_of_the_year = seq(as.Date("2010/1/1"), as.Date("2010/1/31"),by="day"))
guests$num_checked_in = lapply(guests$day_of_the_year, FUN = count_guests)
day_of_the_year num_checked_in
1 2010-01-01 2
2 2010-01-02 3
3 2010-01-03 3
4 2010-01-04 3
5 2010-01-05 2
...
I think this might help, but for a total solution we need a reference date for those that did not check ou yet
library(tidyverse)
hotel %>%
mutate(
across(.cols = ends_with("_date"),.fns = ymd),
check_out_date = if_else(is.na(check_out_date), today(),check_out_date)
) %>%
mutate(
date = map2(
.x = check_in_date,
.y = check_out_date,
.f = function(x,y)seq.Date(from = x,to = y,by = "1 day"))
) %>%
unnest() %>%
count(date)
# A tibble: 29 x 2
date n
<date> <int>
1 2010-01-01 2
2 2010-01-02 3
3 2010-01-03 3
4 2010-01-04 3
5 2010-01-05 2
6 2010-01-06 2
7 2010-01-07 2
8 2010-01-08 3
9 2010-01-09 3
10 2010-01-10 2
# ... with 19 more rows
You can try using "lubridate" package which i believe is part of tidyverse. So if load tidyverse you don't have to load lubridate again.
Use ymd to convert character to date since year-month-day is the format of your date.
dt <- tibble(checkin = lubridate::ymd(check_in_date),
checkout = lubridate::ymd(check_out_date),
person = Person)
For anyone that has not checked out yet, assign them checkout date of today using today() function. Or if you know the date when this data was collected that may be another sensible date to assign here.
Create interval objects with start as checkin date and end as checkout date.
Similarly create interval object for the date(s) you want to check. Here I am using 2010-01-07.
Find overlap using int_overlap()
dt<- dt %>% mutate(
checkout = replace_na(checkout, today()),
stay_interval = lubridate::interval(start = checkin, end = checkout),
date_of_interest = lubridate::interval(ymd("2010-01-07"), ymd("2010-01-07")),
stay = lubridate::int_overlaps(date_of_interest, stay_interval)
)
dt %>% count(stay)
# A tibble: 2 x 2
stay n
<lgl> <int>
1 FALSE 8
2 TRUE 2

Mutate based on two conditions in R dataframe

I have a R dataframe which can be generated from the code below
DF <- data.frame("Person_id" = c(1,1,1,1,2,2,2,2,3,3), "Type" = c("IN","OUT","IN","ANC","IN","OUT","IN","ANC","EM","ANC"), "Name" = c("Nara","Nara","Nara","Nara","Dora","Dora","Dora","Dora","Sara","Sara"),"day_1" = c("21/1/2002","21/4/2002","21/6/2002","21/9/2002","28/1/2012","28/4/2012","28/6/2012","28/9/2012","30/06/2004","30/06/2005"),"day_2" = c("23/1/2002","21/4/2002","","","30/1/2012","28/4/2012","","28/9/2012","",""))
What I would like to do is create two new columns as admit_start_date and admit_end_date based on few conditions which are given below
Rule 1
admit_start_date = day_1
admit_end_date = day_2 (sometimes day_2 can be NA. So refer Rule 2 below)
Rule 2
if day_2 is (null or blank or na) and Type is (Out or ANC or EM) then
admit_end_date = day_1
else (if Type is IN)
admit_end_date = day_1 + 5 (days)
This is what I am trying but doesn't seem to help
transform_dates = function(DF){ # this function is to create 'date' columns
DF %>%
mutate(admit_start_date = day_1) %>%
mutate(admit_end_date = day_2) %>%
admit_end_date = if_else(((Type == 'Out' & admit_end_date.isna() ==True|Type == 'ANC' & admit_end_date.isna() ==True|Type == 'EM' & admit_end_date.isna() ==True),day_1,day_1 + 5)
)
}
As you can see, I am not sure how to check for NA for a newly created column and replace those NAs with day_1 or day_1 + 5(days) based on Type column.
Can you please help?
I expect my output to be like as shown below
We can use case_when to specify each condition separately after converting "day" columns to actual date objects.
library(dplyr)
DF %>%
mutate_at(vars(starts_with('day')), as.Date, "%d/%m/%Y") %>%
mutate(admit_start_date = day_1,
admit_end_date = case_when(
!is.na(day_2) ~day_2,
is.na(day_2) & Type %in% c('OUT', 'ANC', 'EM') ~ day_1,
Type == 'IN' ~ day_1 + 5))
# Person_id Type Name day_1 day_2 admit_start_date admit_end_date
#1 1 IN Nara 2002-01-21 2002-01-23 2002-01-21 2002-01-23
#2 1 OUT Nara 2002-04-21 2002-04-21 2002-04-21 2002-04-21
#3 1 IN Nara 2002-06-21 <NA> 2002-06-21 2002-06-26
#4 1 ANC Nara 2002-09-21 <NA> 2002-09-21 2002-09-21
#5 2 IN Dora 2012-01-28 2012-01-30 2012-01-28 2012-01-30
#6 2 OUT Dora 2012-04-28 2012-04-28 2012-04-28 2012-04-28
#7 2 IN Dora 2012-06-28 <NA> 2012-06-28 2012-07-03
#8 2 ANC Dora 2012-09-28 2012-09-28 2012-09-28 2012-09-28
#9 3 EM Sara 2004-06-30 <NA> 2004-06-30 2004-06-30
#10 3 ANC Sara 2005-06-30 <NA> 2005-06-30 2005-06-30
The dates in the dataframe are not of class "Date", (class(DF$day_1)), using mutate_at we change their class to "Date" so we can perform mathematical calculations on it. starts_with('day') means that any column whose name starts with "day" would be converted to "Date" class. We use mutate_at when we want to apply the same function to multiple columns.
case_when is an alternative to nested ifelse statements. They execute in sequential order. So first condition is checked, if the condition is satisfied it doesn't check the remaining conditions. If the first condition is not satisfied, it checks for the second condition and so on. Hence, no else is required here. If none of the conditions are satisfied it returns NA. Check ?case_when.

How to diagonally subtract different columns in R

I have a dataset of a hypothetical exam.
id <- c(1,1,3,4,5,6,7,7,8,9,9)
test_date <- c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15")
result_date <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20")
data1 <- as_data_frame(id)
data1$test_date <- test_date
data1$result_date <- result_date
colnames(data1)[1] <- "id"
"id" indicates the ID of the students who have taken a particular exam. "test_date" is the date the students took the test and "result_date" is the date when the students' results are posted. I'm interested in finding out which students retook the exam BEFORE the result of that exam session was released, e.g. students who knew that they have underperformed and retook the exam without bothering to find out their scores. For example, student with "id" 1 took the exam for the second time on "2012-07-10" which was before the result date for his first exam - "2012-07-29".
I tried to:
data1%>%
group_by(id) %>%
arrange(id, test_date) %>%
filter(n() >= 2) %>% #To only get info on students who have taken the exam more than once and then merge it back in with the original data set using a join function
So essentially, I want to create a new column called "re_test" where it would equal 1 if a student retook the exam BEFORE receiving the result of a previous exam and 0 otherwise (those who retook after seeing their marks or those who did not retake).
I have tried to mutate in order to find cases where dates are either positive or negative by subtracting the 2nd test_date from the 1st result_date:
mutate(data1, re_test = result_date - lead(test_date, default = first(test_date)))
However, this leads to mixing up students with different id's. I tried to split but mutate won't work on a list of dataframes so now I'm stuck:
split(data1, data1$id)
Just to add on, this is a part of the desired result:
data2 <- as_data_frame(id <- c(1,1,3,4))
data2$test_date_result <- c("2012-06-27","2012-07-10", "2013-07-04","2012-03-24")
data2$result_date_result <- c("2012-07-29","2012-09-02","2013-08-01","2012-04-25")
data2$re_test <- c(1, 0, 0, 0)
Apologies for the verbosity and hope I was clear enough.
Thanks a lot in advance!
library(reshape2)
library(dplyr)
# first melt so that we can sequence by date
data1m <- data1 %>%
melt(id.vars = "id", measure.vars = c("test_date", "result_date"), value.name = "event_date")
# any two tests in a row is a flag - use dplyr::lag to comapre the previous
data1mc <- data1m %>%
arrange(id, event_date) %>%
group_by(id) %>%
mutate (multi_test = (variable == "test_date" & lag(variable == "test_date"))) %>%
filter(multi_test)
# id variable event_date multi_test
# 1 1 test_date 2012-07-10 TRUE
# 2 9 test_date 2012-03-15 TRUE
## join back to the original
data1 %>%
left_join (data1mc %>% select(id, event_date, multi_test),
by=c("id" = "id", "test_date" = "event_date"))
I have a piecewise answer that may work for you. I first create a data.frame called student that contains the re-test information, and then join it with the data1 object. If students re-took the test multiple times, it will compare the last test to the first, which is a flaw, but I'm unsure if students have the ability to re-test multiple times?
student <- data1 %>%
group_by(id) %>%
summarise(retest=(test_date[length(test_date)] < result_date[1]) == TRUE)
Some re-test values were NA. These were individuals that only took the test once. I set these to FALSE here, but you can retain the NA, as they do contain information.
student$retest[is.na(student$retest)] <- FALSE
Join the two data.frames to a single object called data2.
data2 <- left_join(data1, student, by='id')
I am sure there are more elegant ways to approach this. I did this by taking advantage of the structure of your data (sorted by id) and the lag function that can refer to the previous records while dealing with a current record.
### Ensure Data are sorted by ID ###
data1 <- arrange(data1,id)
### Create Flag for those that repeated ###
data1$repeater <- ifelse(lag(data1$id) == data1$id,1,0)
### I chose to do this on all data, you could filter on repeater flag first ###
data1$timegap <- as.Date(data1$result_date) - as.Date(data1$test_date)
data1$lagdate <- as.Date(data1$test_date) - lag(as.Date(data1$result_date))
### Display results where your repeater flag is 1 and there is negative time lag ###
data1[data1$repeater==1 & !is.na(data1$repeater) & as.numeric(data1$lagdate) < 0,]
# A tibble: 2 × 6
id test_date result_date repeater timegap lagdate
<dbl> <chr> <chr> <dbl> <time> <time>
1 1 2012-07-10 2012-09-02 1 54 days -19 days
2 9 2012-03-15 2012-04-20 1 36 days -2 days
I went with a simple shift comparison. 1 line of code.
data1 <- data.frame(id = c(1,1,3,4,5,6,7,7,8,9,9), test_date = c("2012-06-27","2012-07-10","2013-07-04","2012-03-24","2012-07-22", "2013-09-16","2012-06-21","2013-10-18", "2013-04-21", "2012-02-16", "2012-03-15"), result_date = c("2012-07-29","2012-09-02","2013-08-01","2012-04-25","2012-09-01","2013-10-20","2012-07-01","2013-10-31", "2013-05-17", "2012-03-17", "2012-04-20"))
data1$re_test <- unlist(lapply(split(data1,data1$id), function(x)
ifelse(as.Date(x$test_date) > c(NA, as.Date(x$result_date[-nrow(x)])), 0, 1)))
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 NA
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 NA
4 4 2012-03-24 2012-04-25 NA
5 5 2012-07-22 2012-09-01 NA
6 6 2013-09-16 2013-10-20 NA
7 7 2012-06-21 2012-07-01 NA
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 NA
10 9 2012-02-16 2012-03-17 NA
11 9 2012-03-15 2012-04-20 1
I think there is benefit in leaving NAs but if you really want all others as zero, simply:
data1$re_test <- ifelse(is.na(data1$re_test), 0, data1$re_test)
data1
id test_date result_date re_test
1 1 2012-06-27 2012-07-29 0
2 1 2012-07-10 2012-09-02 1
3 3 2013-07-04 2013-08-01 0
4 4 2012-03-24 2012-04-25 0
5 5 2012-07-22 2012-09-01 0
6 6 2013-09-16 2013-10-20 0
7 7 2012-06-21 2012-07-01 0
8 7 2013-10-18 2013-10-31 0
9 8 2013-04-21 2013-05-17 0
10 9 2012-02-16 2012-03-17 0
11 9 2012-03-15 2012-04-20 1
Let me know if you have any questions, cheers.

How to calculate the number of group using R?

It could be a very easy question, I have a data.table with key and more than 1000 rows, two of which could be set as key. I want to calculate the number of the groups for this dataset.
For example, the simple data is(ID and Act is key)
ID ValueDate Act Volume
1 2015-01-01 EUR 21
1 2015-02-01 EUR 22
1 2015-01-01 MAD 12
1 2015-02-01 MAD 11
2 2015-01-01 EUR 5
2 2015-02-01 EUR 7
3 2015-01-01 EUR 4
3 2015-02-01 EUR 2
3 2015-03-01 EUR 6
Here is a code to generate test data:
dd <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
ValueDate = c("2015-01-01", "2015-02-01", "2015-01- 01","2015-02-01", "2015-01-01","2015-02-01","2015-01-01","2015-02-01","2015-03-01"),
Act = c("EUR","EUR","MAD","MAD","EUR","EUR","EUR","EUR","EUR"),
Volume=c(21,22,12,11,5,7,4,2,6))
in this case, we can see that there are a total of 4 subsets.
I tried to set the key for this table as first,
setkey(dd, ID, Act)
Then I thought the function of count could be working to count the groups.
Is it right to use the function of count, or there could be a simple method?
Thanks a lot !
nrow(dd[, .(cnt= sum(.N)), by= c("ID", "Act")])
# or using base R
{t <- table(interaction(dd$ID, dd$Act)); length(t[t>0])}
# or for the counts:
dd[, .(cnt= sum(.N)), by= c("ID", "Act")]
ID Act cnt
1: 1 EUR 2
2: 1 MAD 2
3: 2 EUR 2
4: 3 EUR 3
The fastest way should be uniqueN.
library(data.table)
dd <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
ValueDate = c("2015-01-01", "2015-02-01", "2015-01-01","2015-02-01", "2015-01-01","2015-02-01","2015-01-01","2015-02-01","2015-03-01"),
Act = c("EUR","EUR","MAD","MAD","EUR","EUR","EUR","EUR","EUR"),
Volume=c(21,22,12,11,5,7,4,2,6))
uniqueN(dd, by = c("ID", "Act"))
#[1] 4

R: Using values from data frame A from a date prior to populate a row in data frame B

This may be very complicated and I suspect requires advanced knowledge. I have now two different types of data.frames I need to combine:
The data:
Dataframe A:
lists all transfusion dates by patient ID. Every transfusion is represented by a separate row, patients can have multiple transfusions. Different patients can have transfusions on the same date.
Patient ID Transfusion.Date
1 01/01/2000
1 01/30/2000
2 04/01/2003
3 04/01/2003
Dataframes of Type B contain test results at other dates, also by patient ID:
Patient ID Test.Date Test.Value
1 11/30/1999 negative
1 01/15/2000 700 copies/uL
1 01/27/2000 900 copies/uL
2 03/30/2003 negative
What I would like to have is Dataframe A with the same number of rows (1 for each transfusion), and with the most recent Test.Value as a separate column. Each transfusion date should have the test result from the test performed most closely (prior) to the transfusion.
desired output:
-->
Patient ID Transfusion.Date Pre.Transfusion.Test
1 01/01/2000 negative
1 01/30/2000 900 copies/ul
2 04/01/2003 negative
3 04/01/2003 NA
I think the general strategy would be to subset the data.frames by patient IDs. Then take all transfusion dates for patient 1, check which result is closest to all available test_dates for each element and then return the value closest.
How can I explain R to do that?
Edit 1: Here is the R-code for these examples
df_A <- data.frame(MRN = c(1,1,2,3),
Transfusion.Date = as.Date(c('01/01/2000', '01/30/2000',
'04/01/2003','04/01/2003'),'%m/%d/%Y'))
df_B <- data.frame(MRN = c(1,1,1,2),
Test.Date = as.Date(c('11/30/1999', '01/15/2000', '01/27/2000',
'03/30/2003'),'%m/%d/%Y'), Test.Result = c('negative',
'700 copies/ul','900 copies/ul','negative'))
Edit 2:
To clarify, the resulting data should be: Patient A received transfusions on Day X and Day Y. (for df_A). Prior to the transfusion on day X, his most recent test result was X (closest test date to first transfusion, in df_B). Prior to the transfusion on day Y, his most recent test result was Y (prior to the second transfusion, also in df_B. df_B also contains a bunch of other test dates, which are not needed for the final output.
Here's using data.table's rolling joins:
require(data.table)
setkey(setDT(df_A), MRN, Transfusion.Date)
setkey(setDT(df_B), MRN, Test.Date)
df_B[df_A, roll=TRUE]
# MRN Test.Date Test.Result
# 1: 1 2000-01-01 negative
# 2: 1 2000-01-30 900 copies/ul
# 3: 2 2003-04-01 negative
# 4: 3 2003-04-01 NA
setDT converts data.frame to data.table by reference (without any additional copying). That'll result in df_A and df_B now being data.tables.
setkey sorts the data.table by the columns we provided, and marks those columns as key columns, which allows us to use binary search based joins.
We perform a join of the form x[i] on the key columns, where for each row of i, the matching rows of x (if any, else NA) along with i's rows are returned. This is what we call an equi-join. By adding roll = TRUE, in the event of a mismatch, the last observation is carried forward (LOCF). This is what we call a rolling join. The sorting in increasing order (due to setkey()) ensures that the last observation is the most recent date.
HTH
dfLast <- df_B[ df_B$Test.Date %in%
as.Date( tapply(df_B$Test.Date, df_B$MRN, tail,1),"1970-01-01"), ]
merge(df_A, dfLast, by=c(1:2,1:2) ,all.y=TRUE)
MRN Transfusion.Date Test.Result
1 1 2000-01-27 900 copies/ul
2 2 2003-03-30 negative
Edited. Had some logical errors and some sytactic errors. tapply returned the integer values of the Dates and as you pointed out I was using the wrong column name in the data reduction step.
OK thanks for everyone's help. It took me a lot of toil, blood, sweat, and tears, but this is the solution I came up with:
Merge both data frames:
df_AB <- merge(df_A, df_B, all.x = T)
df_AB:
MRN Transfusion.Date Test.Date Test.Result
1 1 2000-01-01 1999-11-30 negative
2 1 2000-01-01 2000-01-15 700 copies/ul
3 1 2000-01-01 2000-01-27 900 copies/ul
4 1 2000-01-30 1999-11-30 negative
5 1 2000-01-30 2000-01-15 700 copies/ul
6 1 2000-01-30 2000-01-27 900 copies/ul
7 2 2003-04-01 2003-03-30 negative
8 3 2003-04-01 <NA> <NA>
Using dplyr
df_tests <- df_AB %>%
group_by(MRN, Transfusion.Date) %>%
mutate(Time.Difference = Transfusion.Date - Test.Date) %>%
filter(Time.Difference > 0) %>%
arrange(Time.Difference) %>%
summarize(Test.Date = Test.Date[1], Test.Result = Test.Result[1])
df_tests:
MRN Transfusion.Date Test.Date Test.Result
1 1 2000-01-01 1999-11-30 negative
2 1 2000-01-30 1999-11-30 negative
3 2 2003-04-01 2003-03-30 negative
using merge again for MRN3:
df_desired <- merge(df_A, df_tests, all.x = T)
MRN Transfusion.Date Test.Date Test.Result
1 1 2000-01-01 1999-11-30 negative
2 1 2000-01-30 2000-01-27 900 copies/ul
3 2 2003-04-01 2003-03-30 negative
4 3 2003-04-01 <NA> <NA>

Resources